# mixture_of_experts_meets_promptbased_continual_learning__3d5bfc2a.pdf

Mixture of Experts Meets Prompt-Based Continual Learning

Minh Le3 An Nguyen2 Huy Nguyen1 Trang Nguyen3

Trang Pham3 Linh Van Ngo2 Nhat Ho1

1 The University of Texas at Austin 2 Hanoi University of Science and Technology 3 Vin AI Research

Exploiting the power of pre-trained models, prompt-based approaches stand out compared to other continual learning solutions in effectively preventing catastrophic forgetting, even with very few learnable parameters and without the need for a memory buffer. While existing prompt-based continual learning methods excel in leveraging prompts for state-of-the-art performance, they often lack a theoretical explanation for the effectiveness of prompting. This paper conducts a theoretical analysis to unravel how prompts bestow such advantages in continual learning, thus offering a new perspective on prompt design. We first show that the attention block of pre-trained models like Vision Transformers inherently encodes a special mixture of experts architecture, characterized by linear experts and quadratic gating score functions. This realization drives us to provide a novel view on prefix tuning, reframing it as the addition of new task-specific experts, thereby inspiring the design of a novel gating mechanism termed Non-linear Residual Gates (No RGa). Through the incorporation of non-linear activation and residual connection, No RGa enhances continual learning performance while preserving parameter efficiency. The effectiveness of No RGa is substantiated both theoretically and empirically across diverse benchmarks and pretraining paradigms. Our code is publicly available at https://github.com/Minhchuyentoancbn/Mo E_Prompt CL.

1 Introduction

Humans possess a remarkable ability to learn continuously by integrating new skills and knowledge while retaining past experiences. However, current AI models often fail to retain this ability. Unlike humans, they often suffer from catastrophic forgetting [28, 30, 32, 38], a phenomenon where they struggle to retain knowledge from previous tasks while learning new ones. Inspired by human learning, Continual Learning [2, 28, 43, 1, 12] is an ongoing field that aims to train a model across a sequence of tasks while mitigating this challenge. Traditional continual learning methods often rely on storing past data for fine-tuning, which can raise concerns about memory usage and privacy [5, 39, 51]. To address these limitations, prompt-based approaches have emerged as a promising alternative within rehearsal-free continual learning. By attaching prompts - small sets of learnable parameters - to a frozen pre-trained model, these approaches enable efficient adaptation to new tasks with minimal modifications to the underlying model [56, 26, 61]. The effectiveness of prompt-based methods has been demonstrated by several recent works achieving state-of-the-art performance on various continual learning benchmarks [49, 53, 54].

While prompt-based methods have demonstrably achieved impressive results, their emphasis largely lies on prompt utility, leaving a gap in our theoretical comprehension of their effectiveness. This

Equal contribution.

38th Conference on Neural Information Processing Systems (Neur IPS 2024).

absence of a theoretical foundation hinders our ability to further refine and optimize these methods. In this work, we offer a new perspective by focusing on prefix tuning [26] and its connection to mixture of experts models [19, 15, 13, 11]. We demonstrate that self-attention blocks in Vision Transformers [8] implicitly encode a special mixture of experts architecture, revealing a surprising connection between these seemingly disparate concepts. Leveraging this connection, we propose that applying prefix tuning within pre-trained models can be interpreted as introducing new experts. The newly introduced experts collaborate with the pre-trained experts, facilitating efficient adaptation of the model to new tasks.

Drawing insights from this analysis, we observe that the original prefix tuning suffers from suboptimal sample efficiency, requiring a substantial amount of data for reasonable parameter estimation. To address this challenge, we propose a novel gating mechanism termed Non-linear Residual Gates (No RGa). This architecture integrates non-linear activation functions and residual connections within the gating score functions. Our work focuses on improving within-task prediction accuracy, a key component of continual learning performance as identified in previous research [22, 49]. We posit that No RGa can enhance this aspect, which contributes to improved overall continual learning performance while maintaining parameter efficiency. We further provide theoretical justification for this improvement, demonstrating how No RGa accelerates parameter estimation rates.

Our contributions can be summarized as follows: (1) We reveal a novel connection between self-attention and a mixture of experts, providing a fresh perspective on prompt-based continual learning approaches; (2) Leveraging this insight, we propose Non-linear Residual Gates (No RGa), an innovative gating mechanism that enhances continual learning performance while maintaining parameter efficiency, and provide a theoretical justification for this improvement; (3) Extensive experiments across various continual learning benchmarks and pre-training settings demonstrate that our approach achieves state-of-the-art performance compared to existing methods.

Notation. For any n N, we denote [n] as the set {1, 2, . . . , n} . Next, for any set S, we let |S| stand for its cardinality. For any vector u := (u1, u2, . . . , ud) Rd and α := (α1, α2, . . . , αd) Nd, we let uα = uα1 1 uα2 2 . . . uαd d , |u| := u1+u2+. . .+ud and α! := α1!α2! . . . αd!, while u stands for its 2-norm value. Lastly, for any two positive sequences {an}n 1 and {bn}n 1, we write an = O(bn) or an bn if an Cbn for all n N, where C > 0 is some universal constant. The notation an = OP (bn) indicates that an/bn is stochastically bounded.

2 Background and Related Works

We first provide background and related works on continual learning. Then, we define the attention mechanism, followed by discussions on prompt-based continual learning and mixture of experts.

Continual Learning (CL) addresses the challenge of training a model incrementally on a sequence of T tasks, denoted by D = {D1, ..., DT }. Each task s training data Dt = {(x(t) i , y(t) i )}Nt i=1 contains pairs of input sample x(t) i X (t), and corresponding label y(t) i Y(t). Notably, the class labels are distinct for each task, i.e., Y(t) T Y(t ) = , t = t . Consider a neural network with a backbone function fθ and an output layer hψ. The model predicts a label ˆy = hψ(fθ(x)) Y = ST t=1 Y(t), where x X = ST t=1 X (t) is an unseen test sample from arbitrary tasks. Importantly, during training on a new task, the model can only access the current data, without access to data from previous tasks. Prior approaches often rely on storing past task samples for training on new tasks, raising concerns regarding storage and privacy [5, 6, 39, 51, 59].

Our work focuses on the class-incremental learning (CIL) setting, where task identities are not provided during inference, unlike in task-incremental learning (TIL) [46]. A recent theory by [22] analyzes the CIL objective by decomposing the probability of a test sample x of the j-th class in task t into two probabilities:

P(x X (t) j |D) = P(x X (t) j |x X (t), D)P(x X (t)|D), (1)

where the first term involves within-task prediction (WTP) and the second term pertains to taskidentity inference (TII). This equation highlights that by improving either the WTP performance or the TII, we can consequently improve the overall CIL performance, as shown in [22, 49].

Attention Mechanism. Within the Transformer architecture, the attention mechanism plays a crucial role. One prevalent variant is scaled dot-product attention[47], formally defined as follows:

Definition 2.1 (Scaled Dot-Product Attention). Let K RN dk be a key matrix with N key vectors, and V RN dv be a value matrix with N corresponding value vectors. Given a query matrix Q RM dk, Attention over (K, V ) is defined as

Attention(Q, K, V ) = softmax(QK

where the softmax function acts on the rows of matrix QK RM N.

Vision Transformer (Vi T) [8] employs the same attention mechanism within multiple Multi-head Self-Attention (MSA) layers, which is formally defined as follows: Definition 2.2 (Multi-head Self-Attention Layer). Let XQ, XK, XV denote the input query, key, and value matrix, respectively, where XQ = XK = XV = [x1, ..., x N] RN d, and N is the length of the input sequence. The output is expressed as

MSA(XQ, XK, XV ) := Concat(h1, ..., hm)W O RN d, (3)

hi := Attention(XQW Q i , XKW K i , XV W V i ), i [m]. (4)

where W O Rmdv d, W Q i Rd dk, W K i Rd dk, and W V i Rd dv are projection matrices, and m is the number of heads in the MSA layer. In Vi Ts, they use dk = dv = d/m.

Prompt-based continual learning. Prompt-based approaches have emerged as a promising alternative within rehearsal-free continual learning [61, 52, 44]. In vision tasks, prompt-based methods often leverage a pre-trained Vi T as a feature extractor fθ, with its parameters θ typically frozen. These methods enhance the model by introducing prompts, small sets of learnable parameters that influence the operations of the MSA layer [53]. Prompts are strategically injected into the query, key, and value matrices to guide the Vi T in learning new tasks. We denote the prompt parameters by p RLp d, where Lp is the sequence length and d is the embedding dimension. Previous work [53] outlines two main prompt-based approaches: Prompt Tuning (Pro T) [25] and Prefix Tuning (Pre T) [26]. While Prompt Tuning directly concatenates the same prompt parameter p to the query, key, and value, prefix tuning divides p into prefixes {p K, p V } R

2 d and appends it to the key and value vectors:

f Pre T prompt(p, XQ, XK, XV ) := MSA XQ, p K

= Concat( h1, ..., hm)W O (5)

Existing prompt-based methods in CL address catastrophic forgetting by creating new adaptive prompts for each new task. During testing, the model chooses suitable prompt combinations to handle unseen data from any encountered task [49]. L2P [54] proposes a shared prompt pool for all tasks, utilizing a query-key mechanism for prompt selection. Instead of using the same prompt pool across tasks, Dual Prompt [53] introduces G-Prompt and E-Prompt to capture task-agnostic and task-specific information, respectively. S-Prompt [52] focuses on learning task-specific prompts and employs a Pro T strategy similar to L2P. CODA-Prompt [42] expands the prompt pool across tasks and performs a weighted summation of the prompt pool using attention factors. A recent work, Hi De-Prompt [49], achieves state-of-the-art performance by introducing a new hierarchical decomposition of CIL objectives and optimizing each component for better performance.

In this study, we focus on prefix tuning as our primary prompt-based methodology and follow the framework presented in Hi De-Prompt [49]. During training, Hi De-Prompt co-optimizes task-specific prompts pt and model s output layer parameters ψ for each new task t using the WTP objective. These prompts are stored within a prompt pool P = {p1, ..., p T }. At test time, a separate lightweight auxiliary output layer ˆhω : RD RT , trained with the TII objective, takes the uninstructed representation fθ(x) of a new data point x as input to infer the task identity, guiding the selection of the most suitable prompt pk from the prompt pool P. Subsequently, the final prediction is given as ˆy = hψ(fθ(x, pk)). For further details, please refer to Appendix D.

Mixture of experts (Mo E) extends classical mixture models with an adaptive gating mechanism [19, 21]. An Mo E model consists of a group of N expert networks fi : Rd Rdv, for all i [N], and a gate function G : Rd RN. Given an input h Rd, Mo E computes a weighted sum of expert outputs fi(h) based on learned score function si : Rd R for each expert:

j=1 G(h)j fj(h) :=

exp (sj(h)) PN ℓ=1 exp (sℓ(h)) fj(h), (6)

score function 𝑠𝑖,𝑗

Each row in the value matrix is an expert 𝑓𝑗

Each entry in the attention matrix is a score function 𝑠𝑖,𝑗

Figure 1: An illustrative depiction of the relationship between self-attention and Mo E. Each output vector of a head in the MSA layer can be viewed as the output of a Mo E model. These Mo E models share the same set of experts encoded in the value matrix. Each entry in the attention matrix corresponds to a score function within this architecture.

where G(h) := softmax(s1(h), . . . , s N(h)). Building on this concept, works by [10, 41] established the Mo E layer as a fundamental building block to scale up model capacity efficiently. Please refer to Appendix C for a comprehensive discussion of related works.

3 Connection between Prefix Tuning and Mixture of Experts

We first explore the relationship between attention and mixture of experts in Section 3.1, followed by establishing the connection between prefix tuning and the mixture of experts in Section 3.2.

3.1 Mixture of Experts Meets Attention

Following the notation established in Definition 2.2, let s consider the l-th head within the MSA layer. Let X = x 1 , . . . , x N RNd, which is the concatenation of input sequence embeddings into a single one-dimensional vector. We define the matrix Ei Rd Nd such that Ei X := xi for all i [N]. Furthermore, we introduce an Mo E architecture consisting of a group of N expert networks fj : RNd Rdv, N gating functions Gi : RNd RN with the score function for the j-th expert of the i-th gating si,j : RNd R, where

fj(X) := W V l Ej X = W V l xj, si,j(X) := X E i W Q l W K l Ej X dv = x i W Q l W K l xj dv

for i and j [N]. From equation (4), we can express the output of the l-th head as follows:

hl = softmax

XQW Q l W K l XK dv

XV W V l = [hl,1, . . . , hl,N] RN dv, (7)

exp x i W Q l W K l xj dv

PN k=1 exp x i W Q l W K l xk dv

exp(si,j(X)) PN k=1 exp(si,k(X)) fj(X), (8)

for i [N]. Expanding on equation (8), we can discern that each attention head within the MSA layer implicitly embodies a special mixture of experts architecture. This architecture encompasses N Mo E models, each featuring its own quadratic gating function Gi. However, instead of employing N 2

Attention matrix

Gating function Value matrix

Value vectors

Prefix vectors

Residual connection

Non-linear activation

score function for pretrained experts

score function of original prefix tuning

score function of No RGa

Figure 2: Left: An illustrative depiction of prefix tuning as the introduction of new experts into pre-trained Mo E models. Right: Visualization of No RGa implementation, integrating non-linear activation and residual connections into the prefix tuning attention matrix.

separate expert networks for each model, this architecture utilizes N shared linear expert networks fj for j [N], significantly reducing the number of parameters. Notably, each expert network and its corresponding gating function process the entire input sequence directly, rather than individual embedding xi as in traditional Mo E layers [41]. This connection between self-attention and mixture of experts is depicted in Figure 1. In the subsequent section, we explore how prompt-based techniques can be viewed through this lens.

3.2 Prefix Tuning via the Perspective of Mixture of Experts

Building on the connection between self-attention and mixture of experts, we propose that applying prefix tuning can be interpreted as the introduction of new experts to customize the pre-trained model for a specific task, as illustrated in Figure 2. Specifically, similar to Section 3.1, we consider the l-th head within the MSA layer. We denote p K = p K 1 , . . . , p K L RL d, p V = p V 1 , . . . , p V L RL d, where L = Lp

2 . We define new prefix experts f N+j : RNd Rdv along with their corresponding new score functions si,N+j : RNd R as follows:

f N+j(X) := W V l p V j , si,N+j(X) := X E i W Q l W K l p K j dv = x i W Q l W K l p K j dv (9)

for i [N] and j [L]. Then from equation (5), the output of the l-th head can be expressed as:

hl = Attention XQW Q l , p K

W K l , p V

= h hl,1, . . . , hl,N i RN dv, (10)

exp(si,j(X)) PN k=1 exp(si,k(X)) + PL k =1 exp(si,N+k (X)) fj(X)

exp(si,N+j (X)) PN k=1 exp(si,k(X)) + PL k =1 exp(si,N+k (X)) f N+j (X) (11)

It s worth noting that W Q l , W K l , and W V l remain fixed, with only p K and p V being learnable. By examining equation (8) and equation (11), we can interpret each head in a multi-head self-attention layer within a pre-trained model as a mixture of experts architecture with pre-trained experts fj and gating score functions si,j for i and j [N]. Prefix tuning extends this Mo E by introducing L additional prefix experts f N+j defined by prefix vectors p V j and linear score functions si,N+j for i [N] and j [L]. These new experts collaborate with the pre-trained ones within the Mo E model, facilitating the model s adaptation to downstream tasks.

We argue that our introduction of a novel connection between self-attention, prefix tuning, and Mo E offers a fresh perspective on the design of previous prompt-based continual learning methods. In the context of continual learning, the pre-trained experts serve as a knowledge base, while prefix tuning augments it with task-specific knowledge encoded in new experts. Moreover, we draw a parallel between the pre-trained experts and the G(eneral)-Prompt utilized in Dual Prompt, which captures

task-agnostic information [53]. Both are shared across tasks, making them useful for prediction, especially when task identification is incorrect. Notably, the new experts achieve their efficiency through simple linear gating functions and independence from the input, unlike the pre-trained experts. For simplicity, we call the Mo E model (11) as linear gating prefix Mo E.

Statistical suboptimality. The connection between prefix tuning and the Mo E within the linear gating prefix Mo E model (11) allows us to theoretically explore the statistical behavior of the prefix tuning. In Appendix A, by interpreting the linear gating prefix Mo E as a regression problem with sample size n, we demonstrate that the convergence rate for estimating the model parameters, e.g., prompts, can be as slow as O(1/ logτ(n)) where τ > 0 is some constant. This suggests that a huge amount of data is required to achieve reasonable parameter estimation in the linear gating prefix Mo E model, which can be discouraging in practice. To address this statistical limitation, the next section introduces a novel non-linear residual gating score function to replace the linear gating function.

4 Non-linear Residual Gate Meets Prefix Tuning

As discussed earlier, prefix tuning introduces additional experts within the Mo E framework, resulting in the linear gating prefix Mo E model. However, as outlined in Appendix A, this approach suffers from suboptimal sample efficiency for parameter estimation. To overcome this and enhance overall CIL performance, we propose an innovative approach that significantly improves sample efficiency while promoting WTP performance in Section 4.1 and provide theoretical explanations in Section 4.2.

4.1 No RGa: Non-linear Residual Gate

We propose a simple yet effective modification to the linear gating prefix Mo E model by incorporating non-linear activation and residual connection within the score functions of prefix experts as follows:

ˆsi,N+j(X) := X E i W Q l W K l p K j dv + α σ

τ X E i W Q l W K l p K j dv

= si,N+j(X) + α σ(τ si,N+j(X)), i [N], j [L], (12) where α, τ R are learnable scalar factors, and σ is a non-linear activation function. The new score function in equation (12) consists of a linear and a non-linear component. We call the new prefix Mo E model with score functions (12) as non-linear residual gating prefix Mo E.

The use of a non-linear activation function here is motivated by the algebraic independence condition in Definition 4.2 to theoretically guarantee the optimal sample efficiency of expert and parameter estimations (cf. Theorem 4.3). It s worth noting that removing the linear component si,N+j(X) in the score function (12) could potentially lead to the vanishing gradient problem during training. To mitigate this challenge, we incorporate a residual connection [14] into the formulation. Our modification introduces minimal additional parameters (α and τ) compared to the original score function, ensuring parameter efficiency. This is particularly crucial in continual learning scenarios where the number of parameters grows with each new task. For implementation, we define Hl = W Q l W K l . From equation (5), the attention matrix of the l-th head can then be written as:

Al = XQHl[p K , XK ] dv = [XQHlp K , XQHl XK ] dv = [Aprompt l , Apretrain l ]. (13)

Here, Aprompt l denotes the attention score matrix for the prompts, and Apretrain l represents the attention score matrix for the pre-trained experts. To implement No RGa, we can directly modify the final attention matrix as follows: ˆAl = [ ˆAprompt l , Apretrain l ], (14) ˆAprompt l = Aprompt l + α σ(τ Aprompt l ). (15) The implementation of No RGa is illustrated in Figure 2. Despite its simplicity, our modification can significantly enhance sample efficiency and promote more reasonable parameter estimation, as demonstrated in our theoretical analysis in Section 4.2. Within the Hi De-Prompt framework, task-specific prompt parameters are trained using the WTP objective for each new task. Consequently, our modification leads to better parameter estimation, which directly contributes to improved WTP performance, ultimately improving overall continual learning efficacy. Importantly, No RGa maintains the same parameter count as Hi De-Prompt, which is crucial in CL because of the memory constraint. Here, we evaluated σ with tanh, sigmoid, and GELU, finding tanh to perform well in most cases.

4.2 Theoretical Explanation for Non-linear Residual Gating Prefix Mo E

Similar to the setting in Appendix A, we prove that estimating parameters in the non-linear residual gating prefix Mo E model (12) is statistically efficient in terms of the number of data. To provide a fair comparison to the linear gating prefix Mo E, we focus only on the first head and its first row, namely, l = 1 and i = 1 in equation (12). Then, we proceed to provide a theoretical justification of our claim by viewing this row as an output of a regression setting. In particular, we assume that (X1, Y1), (X2, Y2), . . . , (Xn, Yn) RNd R are i.i.d. samples generated from model: Yi = g G (Xi) + εi, i = 1, 2, . . . , n, (16)

where ε1, . . . , εn are independent Gaussian noise variables such that E[εi|Xi] = 0 and Var(εi|Xi) = ν2 for all 1 i n. Additionally, we assume that X1, X2, . . . , Xn are i.i.d. samples from some probability distribution µ. The regression function g G ( ) in equation (16) then takes the form of a non-linear residual gating prefix Mo E model with N pre-trained experts and L unknown experts,

exp(X B0 j X + c0 j) T(X) h(X, η0 j )

exp((β 1j ) X + ασ(τ(β 1j ) X) + β 0j ) T(X) h(X, η j ), (17)

where T(X) := PN k=1 exp(X B0 k X + c0 k) + PL k =1 exp((β 1k ) X + ασ(τ(β 1k ) X) + β 0k ), G := PL j =1 exp(β 0j )δ(β 1j ,η j ) denotes a mixing measure, i.e., a weighted sum of Dirac measures

δ, associated with unknown parameters (β 1j , β 0j , η j )L j =1 in RNd R Rq. Here, the matrix B0 j is

equivalent to (E 1 W Q 1 W K 1 Ej/ dv) in the score function s1,j(X), and the vector β 1j corresponds

to the vector (E 1 W Q 1 W K 1 p K j / dv) in ˆs1,N+j (X). Furthermore, the experts h(X, η0 j ) and h(X, η j ) represent fj(X) and f N+j (X), respectively. In our formulation, for the generality of the ensuing theory, we consider general parametric forms of the experts h(X, η0 j ) and h(X, η j ), i.e., we do not only constrain these expert functions to be the forms of the simple experts in the aforementioned model. Similar to the setting in Appendix A, B0 j , c0 j, and the expert parameters η0 j are known. Our goal is to estimate the unknown prompt-related parameters β 1j , β 0j , and η j .

Least squares estimation. We will use the least squares method [45] to estimate the unknown parameters (β 0j , β 1j , η j )L j =1 or, equivalently, the ground-truth mixing measure G . In particular, we take into account the estimator

b Gn := arg min G GL (Θ)

Yi g G(Xi) 2 , (18)

where we denote GL (Θ) := {G = Pℓ i=1 exp(β0i)δ(β1i,ηi) : 1 ℓ L , (β0i, β1i, ηi) Θ} as the set of all mixing measures with at most L atoms. In practice, since the true number of experts L is typically unknown, we assume that the number of fitted experts L is sufficiently large, i.e., L > L.

To begin with, we explore the convergence behavior of the regression estimator g b Gn( ) to the true regression function g G ( ) under the L2(µ)-norm in the following theorem:

Theorem 4.1 (Regression Estimation Rate). Equipped with a least squares estimator b Gn given in equation (18), the model estimation g b Gn( ) converges to the true model g G ( ) at the following rate:

g b Gn g G L2(µ) = OP ( p

log(n)/n). (19)

Proof of Theorem 4.1 is in Appendix B.1. The bound (19) implies that the rate for estimating the regression function g G ( ) is of order OP ( p

log(n)/n), which is parametric on the sample size n. More importantly, it also indicates that if there exists a loss function among parameters L such that g b Gn g G L2(µ) L( b Gn, G ), then we would obtain the bound L( b Gn, G ) = OP ( p

log(n)/n), which leads to the desired parameter and expert estimation rates.

We now turn our attention to the parameter and expert estimation problems. To understand how the non-linear residual gating affects these problems, we analyze the properties of the expert h( , η) and the activation function σ( ) to determine which formulations will achieve favorable performance.

Definition 4.2 (Algebraic independence). We say that an expert function h( , η) and an activation function σ( ) are algebraically independent if they are twice differentiable w.r.t their parameters, and if for any k 1 and pair-wise distinct parameters (β11, η1), . . . , (β1k, ηk), the following set of functions in X is linearly independent for almost every X RNd:

n Xνh (1 + σ (β 1j X))|ν| + 1{|ν|=2}σ (β 1j X) i |γ|h

ηγ (X, ηj) : j [k ],

ν NNd, γ Nq : 0 |ν| + |γ| 2 o .

Intuitively, the algebraic independence condition ensures that there will be no interactions among parameters of the expert function h( , η) and the activation function σ( ). Technically, a key step in our argument is to decompose the regression discrepancy g b Gn(X) g G (X) into a combination of linearly independent terms by applying Taylor expansions to the product of the softmax s numerator and the expert function, i.e., exp(β 1 X+ασ(τβ 1 X))h(X, η). Thus, the above condition guarantees that all the derivative terms in the Taylor expansion are linearly independent. To exemplify the algebraic independence condition, we consider the following simple examples of the expert functions h( , η) and the activation σ( ) that are algebraically independent.

Example. When the expert function h( , η) is formulated as a neural network h(X, (a, b)) = φ(a X + b) with the activation φ( ) {Re LU( ), GELU( ), z 7 zp}, where (a, b) RNd R, and the activation function σ( ) is one among the functions sigmoid( ), tanh( ), GELU( ), then they satisfy the algebraic independence condition in Definition 4.2.

Finally, we establish the rates for estimating parameters and experts in the non-linear residual gating prefix Mo E model in Theorem 4.3. Prior to presenting the theorem statement, let us design a loss function among parameters based on a notion of Voronoi cells [27], which is a commonly employed approach for the convergence analysis of expert estimation in Mo E models [36, 34, 35, 33], yet tailored to the setting of this paper. In particular, the Voronoi loss used for our analysis is defined as

L1(G, G ) := X

j [L]:|Vj |>1

i Vj exp(β0i) h β1ij 2 + ηij 2i

j [L]:|Vj |=1

i Vj exp(β0i) h β1ij + ηij i +

i Vj exp(β0i) exp(β 0j ) , (20)

where we denote β1ij := β1i β 1j and ηij := ηi η j . Above, Vj Vj (G), for j [L], is a Voronoi cell associated with the mixing measure G generated by the true component ω j := (β 1j , η j ), which is defined as follows:

Vj := {i {1, 2, . . . , L } : ωi ω j ωi ω ℓ , ℓ = j }, (21)

where we denote ωi := (β1i, ηi) as a component of G. Note that, the cardinality of each Voronoi cell Vj indicates the number of components ωi of G approximating the true component ω j of G . Additionally, since L1(G, G ) = 0 if and only if G G , it follows that when L1(G, G ) becomes sufficiently small, the differences β1ij and ηij are also small. This observation indicates that, although L1(G, G ) is a proper metric as it is not symmetric, it is an appropriate loss function for measuring the discrepancy between the least square estimator b Gn and the true mixing measures G .

Theorem 4.3. Assume that the expert function h(x, η) and the activation σ( ) are algebraically independent, then we achieve the following lower bound for any G GL (Θ):

g G g G L2(µ) L1(G, G ),

which together with Theorem 4.1 indicates that L1( b Gn, G ) = e OP (n 1/2).

Proof of Theorem 4.3 is in Appendix B.2. A few comments on Theorem 4.3 are in order: (i) From the bound L1( b Gn, G ) = e OP (n 1/2), we deduce that the estimation rates for the over-specified parameters β 1j , η 1j , where j [L] : |Vj | > 1, are all of order OP ( 4p

log(n)/n). Since the expert h( , η) is twice differentiable over a bounded domain, it is also a Lipschitz function. Thus, denote

Table 1: Overall performance comparison on Split CIFAR-100 and Split Image Net-R. We present Final Average Accuracy (FA), Cumulative Average Accuracy (CA), and Average Forgetting Measure (FM) of all methods under different pre-trained models.

PTM Method Split CIFAR-100 Split Imagenet-R

FA ( ) CA( ) FM( ) FA ( ) CA( ) FM( )

L2P 83.06 0.17 88.27 0.71 5.61 0.32 67.53 0.44 71.98 0.52 5.84 0.38 Dual Prompt 87.30 0.27 91.23 0.65 3.87 0.43 70.93 0.08 75.67 0.52 5.47 0.19 S-Prompt 87.57 0.42 91.38 0.69 3.63 0.41 69.88 0.51 74.25 0.55 4.73 0.47 CODA-Prompt 86.94 0.63 91.57 0.75 4.04 0.18 70.03 0.47 74.26 0.24 5.17 0.22 Hi De-Prompt 92.61 0.28 94.03 0.01 1.50 0.28 75.06 0.12 76.60 0.01 4.09 0.13 No RGa (Ours) 94.48 0.13 95.83 0.37 1.44 0.27 75.40 0.39 79.52 0.07 4.59 0.07

L2P 79.13 1.25 85.13 0.05 7.50 1.21 61.31 0.50 68.81 0.52 10.72 0.40 Dual Prompt 78.84 0.47 86.16 0.02 8.84 0.67 58.69 0.61 66.61 0.67 11.75 0.92 S-Prompt 79.14 0.65 85.85 0.17 8.23 1.15 57.96 1.10 66.42 0.71 11.27 0.72 CODA-Prompt 80.83 0.27 87.02 0.20 7.50 0.25 61.22 0.35 66.76 0.37 9.66 0.20 Hi De-Prompt 93.02 0.15 94.56 0.05 1.26 0.13 70.83 0.17 73.23 0.08 6.77 0.23 No RGa (Ours) 94.76 0.15 95.86 0.31 1.34 0.14 73.06 0.26 77.46 0.42 6.88 0.49

L2P 75.51 0.88 82.53 1.10 6.80 1.70 59.43 0.28 66.83 0.92 11.33 1.25 Dual Prompt 76.21 1.00 83.54 1.23 9.89 1.81 60.41 0.76 66.87 0.41 9.21 0.43 S-Prompt 76.60 0.61 82.89 0.89 8.60 1.36 59.56 0.60 66.60 0.13 8.83 0.81 CODA-Prompt 79.11 1.02 86.21 0.49 7.69 1.57 66.56 0.68 73.14 0.57 7.22 0.38 Hi De-Prompt 93.48 0.11 95.02 0.01 1.63 0.10 71.33 0.21 73.62 0.13 7.11 0.02 No RGa (Ours) 94.01 0.04 95.11 0.35 1.61 0.30 72.77 0.20 76.55 0.46 7.10 0.39

L2P 72.23 0.35 79.71 1.26 8.37 2.30 57.21 0.69 64.09 0.74 7.47 0.96 Dual Prompt 73.95 0.49 81.85 0.59 9.32 1.42 57.98 0.71 65.39 0.27 9.32 0.69 S-Prompt 74.39 0.17 81.60 0.74 9.07 1.13 57.55 0.72 64.90 0.13 8.73 0.56 CODA-Prompt 77.50 0.64 84.81 0.30 8.10 0.01 63.15 0.39 69.73 0.25 6.86 0.11 Hi De-Prompt 92.51 0.11 94.25 0.01 1.67 0.20 68.11 0.18 71.70 0.01 6.45 0.58 No RGa (Ours) 93.43 0.33 94.65 0.62 1.65 0.25 71.77 0.44 75.76 0.49 6.42 0.68

L2P 77.24 0.69 83.73 0.70 5.57 0.75 54.13 0.67 62.09 0.76 4.88 0.42 Dual Prompt 77.56 0.63 84.37 0.51 6.54 0.50 54.45 0.30 62.92 0.41 5.34 0.41 S-Prompt 77.20 0.39 84.47 0.37 7.00 0.62 53.94 0.32 62.42 0.51 5.16 0.48 CODA-Prompt 77.83 0.34 84.97 0.23 12.60 0.02 55.75 0.26 65.49 0.36 10.46 0.04 Hi De-Prompt 91.57 0.20 93.70 0.01 1.51 0.17 63.77 0.49 68.26 0.01 9.37 0.71 No RGa (Ours) 93.52 0.06 94.94 0.29 1.63 0.13 64.52 0.16 70.21 0.64 9.06 0.19

b Gn := PLn i=1 exp(bβ0i)δ(bβn 1i,bηn i ), we achieve that

sup X |h(X, bηn i ) h(X, η j )| bηn i η j OP ( 4p

log(n)/n). (22)

The above bound indicates that if the experts h( , η j ) are fitted by at least two other experts, then their estimation rates are also of order OP ( 4p

log(n)/n); (ii) For exactly-specified parameters β 1j , η j , where j [L] : |Vj | = 1, the rates for estimating them are faster than those of their over-specified counterparts, standing at order OP ( p

log(n)/n). By arguing similarly to equation (22), the experts h( , η j ) also enjoy the faster estimation rate of order OP ( p

log(n)/n), which is parametric on the sample size n; (iii) It follows from the above rates that we only need a polynomial number of data (roughly ϵ 4 where ϵ is the desired approximation error) to estimate the parameters and experts of the non-linear residual gating prefix Mo E. By contrast, when using the linear gating, as being demonstrated in Appendix A, it requires an exponential number of data. This highlights the statistical benefits of using the non-linear residual gating Mo E model over the linear gating prefix Mo E model.

5 Experiments

Datasets. We evaluate various continual learning methods on widely used CIL benchmarks, including Split CIFAR-100 [23] and Split Image Net-R [23], consistent with prior work [49]. We further explore the model s performance on fine-grained classification tasks with Split CUB-200 [48] and large inter-task differences with 5-Datasets [9]. Please refer to Appendix E for more details.

Evaluation Metrics. We utilize several established metrics described in [50]. These include: final average accuracy (FA), which represents the average accuracy after the final task; cumulative average accuracy (CA), which refers to the historical average accuracy; and average forgetting measure (FM). We give more emphasis to FA and CA due to their comprehensiveness, as noted in [42].

Baselines. We compare our approach with several representative prompt-based approaches including L2P [54], Dual Prompt [53], CODA-Prompt [42], S-Prompt [52], and Hi De-Prompt [49]. Additionally,

Table 2: Final average accuracy (FA) on Split CUB-200 and 5-Datasets.

Method Split CUB-200 5-Datasets

Sup-21K i BOT-21K Sup-21K i BOT-21K

L2P 75.46 46.60 81.84 82.25 Dual Prompt 77.56 45.93 77.91 68.03 S-Prompt 77.13 44.22 86.06 77.20 CODA-Prompt 74.34 47.79 64.18 51.65 Hi De-Prompt 86.56 78.23 93.83 94.88 No RGa (Ours) 90.90 80.69 94.16 94.92

Table 3: Ablation study of different activation functions, measured by final average accuracy (FA).

Method Split CIFAR-100 Split CUB-200

Sup-21K i BOT-21K Sup-21K i BOT-21K

Hi De-Prompt 92.61 93.02 86.56 78.23 No RGa tanh 94.36 94.76 90.87 80.69 No RGa sigmoid 94.48 94.69 90.90 80.18 No RGa GELU 94.05 94.63 90.74 80.54

we evaluate against state-of-the-art pre-trained model-based continual learning methods, including ADAM [60] and Ran PAC [29]. We further extend our evaluation by applying Hi De-Prompt with parameter-efficient fine-tuning techniques like Lo RA [17] and Adapters [16]. In line with [49], we utilize the checkpoints of Vi T that use supervised pre-training of Imagenet-21K (denoted as Sup-21K), and some self-supervised pre-training such as i BOT-21K, i BOT-1K [62], DINO-1K [4], and Mo Co-1K [7]. For implementation details, see Appendix E.

Main Results. In Table 1, we evaluate several continual learning methods on Split CIFAR-100 and Split Image Net-R using diverse pre-trained models. No RGa achieves state-of-the-art FA and CA across all datasets and models, consistently outperforming Hi De-Prompt. On Sup-21K, No RGa demonstrates impressive FA results on both CIFAR-100 and Image Net-R. It also maintains the highest CA, with significant margins of 1.80% and 2.92% on CIFAR-100 and Image Net-R, respectively, compared to Hi De-Prompt. These results highlight No RGa s strong ability to retain knowledge and exhibit minimal forgetting, as evidenced by the low FM values on both datasets. No RGa also surpasses Hi De-Prompt on self-supervised models, with FA improvements up to 1.95% and 3.66%. We further investigate two scenarios: fine-grained classification tasks and large inter-task differences through experiments on Split CUB-200 and 5-Datasets, respectively, as shown in Table 2. No RGa maintains its lead, achieving FA gaps of 4.34% and 2.46% on Split CUB-200, and the highest FA on 5-Datasets. While gains in some metrics may be modest, No RGa consistently outperforms Hi De-Prompt in either FA or CA, underscoring its robustness. For example, on Split Image Net-R with Sup-21K weights, the FA improvement is small (75.06% vs. 75.40%), but the CA gains are substantial (76.60% vs. 79.52%), demonstrating the method s effectiveness and robustness.

Ablation Study. To assess the impact of non-linear activation functions on No RGa s performance, we evaluated the model s behavior with different choices for the activation function σ, including tanh, sigmoid, and GELU in Table 3. The results show that No RGa achieves state-of-the-art performance on both Split CIFAR-100 and Split CUB-200 datasets with all three activation functions. These findings suggest that No RGa exhibits robustness to the choice of non-linear activation within a reasonable range. While all functions perform well, the tanh activation function demonstrates generally strong performance across scenarios. Further results are provided in the Appendix.

6 Conclusion

This paper presents an initial exploration of self-attention and prefix-tuning through the lens of mixture of experts. We find that applying prefix tuning can be viewed as introducing new prefix experts to adapt the pre-trained model. However, limitations in sample efficiency exist. We address this by proposing No RGa, a novel gating mechanism to enhance continual learning performance. Our results demonstrate No RGa s effectiveness both theoretically and empirically. While the current implementation of the expert network prioritizes simplicity, future research directions could involve investigating more intricate architectures. Furthermore, the choice of activation functions in our work requires fine-tuning, which opens avenues for future research on adaptively learning activation.

[1] R. Aljundi, P. Chakravarty, and T. Tuytelaars. Expert gate: Lifelong learning with a network of experts. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3366 3375, 2017.

[2] E. Belouadah, A. Popescu, and I. Kanellos. A comprehensive study of class incremental learning algorithms for visual tasks. Neural Networks, 135:38 54, 2021.

[3] Y. Bulatov. Notmnist dataset. Google (Books/OCR), Tech. Rep.[Online]. Available: http://yaroslavvb. blogspot. it/2011/09/notmnist-dataset. html, 2, 2011.

[4] M. Caron, H. Touvron, I. Misra, H. Jégou, J. Mairal, P. Bojanowski, and A. Joulin. Emerging properties in self-supervised vision transformers. In Proceedings of the International Conference on Computer Vision (ICCV), 2021.

[5] H. Cha, J. Lee, and J. Shin. Co2l: Contrastive continual learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 9516 9525, October 2021.

[6] A. Chaudhry, M. Rohrbach, M. Elhoseiny, T. Ajanthan, P. K. Dokania, P. H. S. Torr, and M. Ranzato. On tiny episodic memories in continual learning, 2019.

[7] X. Chen, S. Xie, and K. He. An empirical study of training self-supervised vision transformers, 2021.

[8] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. ICLR, 2021.

[9] S. Ebrahimi, F. Meier, R. Calandra, T. Darrell, and M. Rohrbach. Adversarial continual learning. ar Xiv preprint ar Xiv:2003.09553, 2020.

[10] D. Eigen, M. Ranzato, and I. Sutskever. Learning factored representations in a deep mixture of experts. In ICLR Workshops, 2014.

[11] Z. Fan, R. Sarkar, Z. Jiang, T. Chen, K. Zou, Y. Cheng, C. Hao, Z. Wang, et al. M3vit: Mixtureof-experts vision transformer for efficient multi-task learning with model-accelerator co-design. Advances in Neural Information Processing Systems, 35:28441 28457, 2022.

[12] N. L. Hai, T. Nguyen, L. N. Van, T. H. Nguyen, and K. Than. Continual variational dropout: a view of auxiliary local variables in continual learning. Machine Learning, 113(1):281 323, 2024.

[13] H. Hazimeh, Z. Zhao, A. Chowdhery, M. Sathiamoorthy, Y. Chen, R. Mazumder, L. Hong, and E. Chi. Dselect-k: Differentiable selection in the mixture of experts with applications to multi-task learning. Advances in Neural Information Processing Systems, 34:29335 29347, 2021.

[14] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition, 2015.

[15] N. Ho, C.-Y. Yang, and M. I. Jordan. Convergence rates for gaussian mixtures of experts. Journal of Machine Learning Research, 23(323):1 81, 2022.

[16] N. Houlsby, A. Giurgiu, S. Jastrzebski, B. Morrone, Q. De Laroussilhe, A. Gesmundo, M. Attariyan, and S. Gelly. Parameter-efficient transfer learning for nlp. In International conference on machine learning, pages 2790 2799. PMLR, 2019.

[17] E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, and W. Chen. Lora: Low-rank adaptation of large language models. ar Xiv preprint ar Xiv:2106.09685, 2021.

[18] Q. Huang, Z. An, N. Zhuang, M. Tao, C. Zhang, Y. Jin, K. Xu, L. Chen, S. Huang, and Y. Feng. Harder tasks need more experts: Dynamic routing in moe models. ar Xiv preprint ar Xiv:2403.07652, 2024.

[19] R. A. Jacobs, M. I. Jordan, S. J. Nowlan, and G. E. Hinton. Adaptive mixtures of local experts. Neural Computation, 3, 1991.

[20] P. Janson, W. Zhang, R. Aljundi, and M. Elhoseiny. A simple baseline that questions the use of pretrained-models in continual learning. ar Xiv preprint ar Xiv:2210.04428, 2022.

[21] M. I. Jordan and R. A. Jacobs. Hierarchical mixtures of experts and the em algorithm. Neural computation, 6(2):181 214, 1994.

[22] G. Kim, C. Xiao, T. Konishi, Z. Ke, and B. Liu. A theoretical study on solving continual learning. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems, volume 35, pages 5065 5079. Curran Associates, Inc., 2022.

[23] A. Krizhevsky, G. Hinton, et al. Learning multiple layers of features from tiny images. Technical report, Citeseer, 2009.

[24] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278 2324, 1998.

[25] B. Lester, R. Al-Rfou, and N. Constant. The power of scale for parameter-efficient prompt tuning. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 3045 3059. Association for Computational Linguistics, Nov. 2021.

[26] X. L. Li and P. Liang. Prefix-tuning: Optimizing continuous prompts for generation, 2021.

[27] T. Manole and N. Ho. Refined convergence rates for maximum likelihood estimation under finite mixture models. In Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pages 14979 15006. PMLR, 17 23 Jul 2022.

[28] M. Mc Closkey and N. J. Cohen. Catastrophic interference in connectionist networks: The sequential learning problem. In Psychology of learning and motivation, volume 24, pages 109 165. Elsevier, 1989.

[29] M. D. Mc Donnell, D. Gong, A. Parvaneh, E. Abbasnejad, and A. van den Hengel. Ranpac: Random projections and pre-trained models for continual learning. Advances in Neural Information Processing Systems, 36, 2024.

[30] S. V. Mehta, D. Patil, S. Chandar, and E. Strubell. An empirical investigation of the role of pre-training in lifelong learning. Journal of Machine Learning Research, 24(214):1 50, 2023.

[31] Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y. Ng. Reading digits in natural images with unsupervised feature learning. In NIPS Workshop on Deep Learning and Unsupervised Feature Learning 2011, 2011.

[32] C. V. Nguyen, A. Achille, M. Lam, T. Hassner, V. Mahadevan, and S. Soatto. Toward understanding catastrophic forgetting in continual learning, 2019.

[33] H. Nguyen, P. Akbarian, and N. Ho. Is temperature sample efficient for softmax Gaussian mixture of experts? In Proceedings of the ICML, 2024.

[34] H. Nguyen, P. Akbarian, F. Yan, and N. Ho. Statistical perspective of top-k sparse softmax gating mixture of experts. In International Conference on Learning Representations, 2024.

[35] H. Nguyen, N. Ho, and A. Rinaldo. On least square estimation in softmax gating mixture of experts. In Proceedings of the ICML, 2024.

[36] H. Nguyen, T. Nguyen, and N. Ho. Demystifying softmax gating function in Gaussian mixture of experts. In Advances in Neural Information Processing Systems, 2023.

[37] A. Panos, Y. Kobe, D. O. Reino, R. Aljundi, and R. E. Turner. First session adaptation: A strong replay-free baseline for class-incremental learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 18820 18830, 2023.

[38] H. Phan, A. P. Tuan, S. Nguyen, N. V. Linh, and K. Than. Reducing catastrophic forgetting in neural networks via gaussian mixture approximation. In Pacific-Asia Conference on Knowledge Discovery and Data Mining, pages 106 117. Springer, 2022.

[39] S.-A. Rebuffi, A. Kolesnikov, G. Sperl, and C. H. Lampert. icarl: Incremental classifier and representation learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017.

[40] T. Ridnik, E. Ben-Baruch, A. Noy, and L. Zelnik-Manor. Imagenet-21k pretraining for the masses, 2021.

[41] N. Shazeer, A. Mirhoseini, K. Maziarz, A. Davis, Q. Le, G. Hinton, and J. Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. In International Conference on Learning Representations (ICLR), 2017.

[42] J. S. Smith, L. Karlinsky, V. Gutta, P. Cascante-Bonilla, D. Kim, A. Arbelle, R. Panda, R. Feris, and Z. Kira. Coda-prompt: Continual decomposed attention-based prompting for rehearsal-free continual learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 11909 11919, June 2023.

[43] Q. Tran, H. Phan, K. Than, D. Phung, and T. Le. Continual learning with optimal transport based mixture model. ar Xiv preprint ar Xiv:2211.16780, 2022.

[44] Q. Tran, L. Tran, K. Than, T. Tran, D. Phung, and T. Le. Koppa: Improving prompt-based continual learning with key-query orthogonal projection and prototype-based one-versus-all. ar Xiv preprint ar Xiv:2311.15414, 2023.

[45] S. van de Geer. Empirical processes in M-estimation. Cambridge University Press, 2000.

[46] G. M. van de Ven, T. Tuytelaars, and A. S. Tolias. Three types of incremental learning. Nature Machine Intelligence, 4:1185 1197, 2022.

[47] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. u. Kaiser, and I. Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017.

[48] C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie. The Caltech-UCSD Birds-200-2011 Dataset. California Institute of Technology, 2011.

[49] L. Wang, J. Xie, X. Zhang, M. Huang, H. Su, and J. Zhu. Hierarchical decomposition of prompt-based continual learning: Rethinking obscured sub-optimality. Advances in Neural Information Processing Systems, 2023.

[50] L. Wang, X. Zhang, H. Su, and J. Zhu. A comprehensive survey of continual learning: Theory, method and application, 2024.

[51] L. Wang, X. Zhang, K. Yang, L. Yu, C. Li, H. Lanqing, S. Zhang, Z. Li, Y. Zhong, and J. Zhu. Memory replay with data compression for continual learning. In International Conference on Learning Representations, 2021.

[52] Y. Wang, Z. Huang, and X. Hong. S-prompts learning with pre-trained transformers: An occam s razor for domain incremental learning. In Conference on Neural Information Processing Systems (Neur IPS), 2022.

[53] Z. Wang, Z. Zhang, S. Ebrahimi, R. Sun, H. Zhang, C.-Y. Lee, X. Ren, G. Su, V. Perot, J. Dy, et al. Dualprompt: Complementary prompting for rehearsal-free continual learning. European Conference on Computer Vision, 2022.

[54] Z. Wang, Z. Zhang, C.-Y. Lee, H. Zhang, R. Sun, X. Ren, G. Su, V. Perot, J. Dy, and T. Pfister. Learning to prompt for continual learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 139 149, 2022.

[55] H. Xiao, K. Rasul, and R. Vollgraf. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms, 2017.

[56] Y. Xin, S. Luo, H. Zhou, J. Du, X. Liu, Y. Fan, Q. Li, and Y. Du. Parameter-efficient fine-tuning for pre-trained vision models: A survey, 2024.

[57] B. Yu. Assouad, Fano, and Le Cam. Festschrift for Lucien Le Cam, pages 423 435, 1997.

[58] J. Yu, Y. Zhuge, L. Zhang, P. Hu, D. Wang, H. Lu, and Y. He. Boosting continual learning of vision-language models via mixture-of-experts adapters. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 23219 23230, 2024.

[59] G. Zhang, L. Wang, G. Kang, L. Chen, and Y. Wei. Slca: Slow learner with classifier alignment for continual learning on a pre-trained model. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023.

[60] D.-W. Zhou, Z.-W. Cai, H.-J. Ye, D.-C. Zhan, and Z. Liu. Revisiting class-incremental learning with pre-trained models: Generalizability and adaptivity are all you need. International Journal of Computer Vision, pages 1 21, 2024.

[61] D.-W. Zhou, H.-L. Sun, J. Ning, H.-J. Ye, and D.-C. Zhan. Continual learning with pre-trained models: A survey, 2024.

[62] J. Zhou, C. Wei, H. Wang, W. Shen, C. Xie, A. Yuille, and T. Kong. ibot: Image bert pre-training with online tokenizer. International Conference on Learning Representations (ICLR), 2022.

Supplement to Mixture of Experts Meets Prompt-Based Continual Learning

In this supplementary material, we first analyze the statistical suboptimality of the Linear Gating Prefix Mo E Model (11) in Appendix A. Appendix B provides proofs for the theoretical results presented in Section 4.2. In Appendix C, we discuss related works on mixture of experts. Appendix D outlines the training algorithm for Hi De-Prompt while Appendix E presents the experimental setup and details. Further, Appendix F presents further experiments on the task-incremental learning setting to empirically demonstrate the benefits of using our proposed Non-linear Residual Gating Prefix Mo E (12) over the Linear Gating Prefix Mo E Model. Appendix G and Appendix H compare No RGa with other parameter-efficient fine-tuning techniques and pre-trained model-based methods. In Appendix I, we present the efficiency tests, while Appendix J explores the impact of learnable α and τ. Finally, Appendix K compares the training times between No RGa and Hi De-Prompt.

A Statistical Suboptimality of Linear Gating Prefix Mo E Model

In this appendix, we demonstrate that estimating parameters and experts in the linear gating prefix Mo E model (11) can be statistically inefficient in terms of the number of data. To simplify our findings, we particularly focus on the first head, namely, l = 1 in equation (11), and the first row of this head, namely, i = 1 in equation (11). Then, we proceed to provide a theoretical justification of our claim for the suboptimality of the linear gating prefix Mo E by viewing this row as an output of the regression setting. In particular, we assume that (X1, Y1), (X2, Y2), . . . , (Xn, Yn) RNd R is an i.i.d. sample generated from the following model:

Yi = f G (Xi) + εi, i = 1, 2, . . . , n, (23)

where ε1, . . . , εn are independent Gaussian noise variables such that E[εi|Xi] = 0 and Var(εi|Xi) = ν2 for all 1 i n. Additionally, we assume that X1, X2, . . . , Xn are i.i.d. samples from some probability distribution µ. Motivated by linear gating prefix Mo E model (11), the regression function f G ( ) in equation (23) admits the form of the linear gating prefix Mo E model with pre-trained N experts and L unknown experts, namely

f G (X) : =

exp(X B0 j X + c0 j) PN k=1 exp(X B0 k X + c0 k) + PL k =1 exp((β 1k ) X + β 0k ) h(X, η0 j )

exp((β 1j ) X + β 0j ) PN k=1 exp(X B0 k X + c0 k) + PL k =1 exp((β 1k ) X + β 0k ) h(X, η j ), (24)

where G := PL j =1 exp(β 0j )δ(β 1j ,η j ) denotes a mixing measure, i.e., a weighted sum of Dirac

measures δ, associated with unknown parameters (β 1j , β 0j , η j )L j =1 in RNd R Rq. Here, the

matrix B0 j plays the role of the matrix E 1 W Q 1 W K 1 Ej dv in the score function s1,j(X). Furthermore, the

vector β 1j corresponds to the vector E i W Q l W K l p K j dv in the score function s1,N+j (X). Furthermore, the experts h(X, η0 j ) correspond to the role of fj(X) and h(X, η j ) correspond to the role of f N+j (X). In our formulation, we consider general parametric forms of the experts h(X, η0 j ) and h(X, η j ), i.e., we do not only constrain these expert functions to be the forms of the simple experts in the linear gating prefix Mo E model.

Similar to the linear gating prefix Mo E model (11), the matrices B0 j , the biases c0 j, and the expert parameters η0 j are known. Our aim is to estimate the unknown gating parameters β 1j , β 0j , and η j that correspond to the prompts.

Least squares estimation: We will use the least squares method [45] to estimate the unknown parameters (β 0j , β 1j , η j )L j =1 or, equivalently, the ground-truth mixing measure G . In particular, we take into account the estimator

e Gn := arg min G GL (Θ)

Yi f G(Xi) 2 , (25)

where we denote GL (Θ) := {G = Pℓ i=1 exp(β0i)δ(β1i,ηi) : 1 ℓ L , (β0i, β1i, ηi) Θ} as the set of all mixing measures with at most L atoms. In practice, since the true number of true experts L is typically unknown, we assume that the number of fitted experts L is sufficiently large, i.e. L > L.

Let us recall that our main objective in this appendix is to show that using the linear gating in the prefix Mo E model is not sample efficient. To illustrate that point, we consider a simple scenario when the expert function takes the form h(X, (a, b)) = (a X + b)p, for some p N. Additionally, we also design a new Voronoi loss function as below to facilitate our arguments.

L2,r(G, G ) :=

i Vj exp(β0i) exp(β 0j) +

i Vj exp(β0i) h β1ij r + aij r + | bij|ri ,

where we denote β1ij := β1i β 1j and ηij := ηi η j .

Now, we are ready to state the result of parameter estimation under the linear gating prefix Mo E model in the following theorem:

Theorem A.1. Assume that the experts take the form h(X, (a, b)) = (a X + b)p, for some p N, then we achieve the following minimax lower bound of estimating G :

inf Gn GL (Θ) sup G GL (Θ)\GL 1(Θ) Ef G[L2,r(Gn, G)] n 1/2,

for any r 1, where Ef G indicates the expectation taken w.r.t the product measure with f n G.

There are two main implications of the result in Theorem A.1:

(i) The rates for estimating parameters β 1j, a j and b j are slower than OP (n 1/2r), for any r 1. This means that they are slower than any polynomial rates, and could be of order OP (1/ log(n)). Using the same reasoning described after equation (22), we have

sup x |φ((ban i ) X + bbn i ) φ((a j) X + b j)| ban i a j + |bbn i b j|. (27)

As a consequence, the rates for estimating experts φ((a j) X + b j) are no better than those for estimating the parameters a j and b j, and could also be as slow as OP (1/ log(n)).

(ii) The above rates imply that we need an exponential number of data (roughly exp(1/ϵτ) where ϵ is the desired approximation error) to estimate the parameters and experts of the linear gating prefix Mo E. This fact demonstrates that using the linear gating in the prefix Mo E model is not sample efficient from the perspective of the expert estimation problem.

Proof of Theorem A.1. Prior to presenting the main proof of Proposition A.1, let us introduce the following key result:

Lemma A.2. If the following holds for any r 1:

lim ε 0 inf G GL (Θ):L2,r(G,G ) ε f G f G L2(µ)

L2,r(G, G ) = 0, (28)

then we obtain that

inf Gn GL (Θ) sup G GL (Θ)\GL 1(Θ) Ef G[L2,r(Gn, G)] n 1/2. (29)

Proof of Lemma A.2. Indeed, from the Gaussian assumption on the noise variables ϵi, we obtain that Yi|Xi N(f G (Xi), σ2) for all i [n]. Next, the assumption in equation (28) indicates for sufficiently small ε > 0 and a fixed constant C1 > 0 which we will choose later, we can find a mixing measure G GL (Θ) such that L2,r(G , G ) = 2ε and f G f G L2(µ) C1ε. From Le Cam s

lemma [57], as the Voronoi loss function L2,r satisfies the weak triangle inequality, we obtain that

inf Gn GL (Θ) sup G GL (Θ)\GL 1(Θ) Ef G[L2,r(Gn, G)]

L2,r(G , G ) 8 exp( n EX µ[KL(N(f G (X), σ2), N(f G (X), σ2))])

ε exp( n f G f G 2 L2(µ)),

ε exp( C1nε2), (30) where the second inequality is due to the fact that

KL(N(f G (X), σ2), N(f G (X), σ2)) = (f G (X) f G (X))2

By choosing ε = n 1/2, we obtain that ε exp( C1nε2) = n 1/2 exp( C1). As a consequence, we achieve the desired minimax lower bound in equation (29).

Main proof. We need to prove that the following limit holds true for any r 1:

lim ε 0 inf G GL (Θ):L2,r(G,G ) ε f G f G L2(µ)

L2,r(G, G ) = 0. (31)

For that purpose, it suffices to build a sequence of mixing measures (Gn)n 1 such that both L2,r(Gn, G ) 0 and f Gn f G L2(µ)

L2,r(Gn, G ) 0,

as n . To this end, we consider the sequence Gn = PL+1 i=1 exp(βn 0i)δ(βn 1i,an i ,bn i ), where

exp(βn 01) = exp(βn 02) = 1

2 exp(β 01) + 1 2nr+1 and exp(βn 0i) = exp(βn 0(i 1)) for any 3 i L + 1;

βn 11 = βn 12 = β 11 and βn 1i = βn 1(i 1) for any 3 i L + 1;

an 1 = an 2 = a 1 and an i = an i 1 for any 3 i L + 1;

bn 1 = b 1 + 1

n, bn 2 = b 1 1

n and bn i = b i 1 for any 3 i L + 1.

As a result, the loss function L2,r(Gn, G ) is reduced to

L2,r(Gn, G ) = 1 nr+1 + h exp(β 01) + 1 nr+1

nr = O(n r). (32)

which indicates indicates that L2,r(Gn, G ) 0 as n .

Now, we prove that f Gn f G L2(µ)/L2,r(Gn, G ) 0. For that purpose, let us consider the quantity

Qn(X) := h N X

i =1 exp(X B0 i X + c0 i ) +

j =1 exp((β 1j ) X + β 0j ) i [g Gn(X) g G (X)].

For simplicity, let us consider the polynomial degree p = 1 as the arguments for other values of p can be adapted accordingly. Recall from equation (46) that Qn(X) can be decomposed as follows:

i Aj exp(βn 0i) h exp((βn 1i) X)((an i ) X + bn i ) exp((β 1j) X)((a j) X + b j) i

i Aj exp(βn 0i) h exp((βn 1i) X)g Gn(X) exp((β 1j) X)g Gn(X) i

i Aj exp(βn 0i) exp(β 0j) h exp((β 1j) X)((a j) X + b j) exp((β 1j) X)g Gn(X) i

:= An(X) Bn(X) + Cn(X).

From the definitions of βn 1i, an i and bn i , we can rewrite An(X) as follows:

h exp(β 01) + 1 nr+1

i exp((β 11) X)[((an i ) X + bn i ) ((a 1) X + b 1)]

h exp(β 01) + 1 nr+1

i exp((β 11) X)[(bn 1 b 1) + (bn 2 b 1)]

Additionally, it can also be checked that Bn(X) = 0, and Cn(X) = O(n (r+1)). Therefore, it follows that Cn(X)/L2,r(Gn, G ) 0. As a consequence, Qn(X)/L2,r(Gn, G ) 0 as n for almost every X.

Since the term h PN i =1 exp(X B0 i X + c0 i ) + PL j =1 exp((β 1j ) X + β 0j ) i is bounded, we

deduce that [f Gn(X) f G (X)]/L2,r 0 for almost every X. This result indicates that

f Gn f G L2(µ)/L2,r(Gn, G ) 0

as n . Hence, the proof of claim (31) is completed.

B Proof of Theoretical Results

In this appendix, we present rigorous proofs for the theoretical results introduced in Section 4, namely Theorem 4.1 and Theorem 4.3, in that order.

B.1 Proof of Theorem 4.1

For the proof of the theorem, we first introduce some notation. Firstly, we denote by FL (Θ) the set of conditional densities of all mixing measures in GL (Θ), that is, FL (Θ) := {g G(X) : G GL (Θ)}. Additionally, for each δ > 0, the L2(µ) ball centered around the conditional density g G (Y |X) and intersected with the set FL (Θ) is defined as

FL (Θ, δ) := g FL (Θ) : g g G L2(µ) δ .

In order to measure the size of the above set, Geer et. al. [45] suggest using the following quantity:

JB(δ, FL (Θ, δ)) := Z δ

δ2/213 H1/2 B (t, FL (Θ, t), L2(µ)) dt δ, (33)

where HB(t, FL (Θ, t), L2(µ)) stands for the bracketing entropy [45] of FL (Θ, u) under the L2(µ)-norm, and t δ := max{t, δ}. By using the similar proof argument of Theorem 7.4 and Theorem 9.2 in [45] with notations being adapted to this work, we obtain the following lemma: Lemma B.1. Take Ψ(δ) JB(δ, FL (Θ, δ)) that satisfies Ψ(δ)/δ2 is a non-increasing function of δ. Then, for some universal constant c and for some sequence (δn) such that nδ2 n cΨ(δn), we achieve that

P g b Gn g G L2(µ) > δ c exp nδ2

for all δ δn.

We now demonstrate that when the expert functions are Lipschitz continuous, the following bound holds:

HB(ε, FL (Θ), L2(µ)) log(1/ε), (34)

for any 0 < ε 1/2. Indeed, for any function g G FL (Θ), since the expert functions are bounded, we obtain that h(X, η) M for all X where M is a bounded constant of the expert functions. Let τ ε and {π1, . . . , π N} be the ζ-cover under the L norm of the set FL (Θ) where N := N(ζ, FL (Θ), L ) is the η-covering number of the metric space (FL (Θ), L ). Then, we construct the brackets of the form [Li(X), Ui(X)] for all i [ N] as follows:

Li(x) := max{πi(X) ζ, 0}, Ui(x) := max{πi(X) + ζ, M}.

From the above construction, we can validate that FL (Θ) N i=1[Li(X), Ui(X)] and Ui(X) Li(X) min{2ζ, M}. Therefore, it follows that

Ui Li 2 L2(µ) = Z (Ui Li)2dµ(X) Z 4ζ2dµ(X) = 4ζ2,

which implies that Ui Li L2(µ) 2ζ. By definition of the bracketing entropy, we deduce that

HB(2ζ, FL (Θ), L2(µ)) log N = log N(ζ, FL (Θ), L ). (35)

Therefore, we need to provide an upper bound for the covering number N. In particular, we denote := {(β1, β0) RNd Nd RNd R : (β1, β0, η) Θ} and Ω:= {η Rq : (β1, β0, η) Θ}. Since Θ is a compact set, and Ωare also compact. Therefore, we can find ζ-covers ζ and Ωζ for and Ω, respectively. We can check that

| ζ| OP (τ (Nd+1)L ), |Ωζ| OP (τ q L ).

For each mixing measure G = PL

i=1 exp(β0i)δ(β1i,ηi) GL (Θ), we consider other two mixing measures:

i=1 exp(β0i)δ(β1i,ηi), G :=

i=1 exp(β0i)δ(β1i,ηi).

Here, ηi Ωζ such that ηi is the closest to ηi in that set, while (β1i, β0i) ζ is the closest to (β1i, β0i) in that set. From the above formulations, we get that

g G g ˇ G L

exp(β 1j X + ασ(τβ 1j X) + β0j) |h(X, ηj) h(X, ηj)| PN i =1 exp(X B0 i X + c0 i ) + PL j =1 exp(β 1j X + ασ(τβ 1j x) + β0j )

j=1 sup X X

exp(β 1j X + ασ(τβ 1j X) + β0j) |h(X, ηj) h(X, ηj)| PN i =1 exp(X B0 i X + c0 i ) + PL j =1 exp(β 1j X + ασ(τβ 1j X) + β0j )

j=1 sup X X |h(X, ηj) h(X, ηj)|

j=1 sup X X [L1(X) ηj ηj ]

Here, the second inequality occurs as the softmax weight is bounded by one, and the third inequality follows from the fact that the expert h(X, ) is a Lipschitz function with some Lipschitz constant L1(X) > 0. Next, let us denote

i =1 exp(X B0 i X + c0 i ) +

j =1 exp(β 1j X + ασ(τβ 1j X) + β0j ),

i =1 exp(X B0 i X + c0 i ) +

j =1 exp(β 1j X + ασ(τβ 1j X) + β0j ).

Then, we have

g ˇ G g G L

i=1 exp(X B0 i X + c0 i )h(X, η0 i ) +

j=1 exp(β 1j X + ασ(τβ 1j X) + β0j)h(X, ηj)

i=1 exp(X B0 i X + c0 i )h(X, η0 i ) +

j=1 exp(β 1j X + ασ(τβ 1j X) + β0j)h(X, ηj)

i=1 sup X X

exp(X B0 i X + c0 i )h(X, η0 i )

j=1 sup X X

exp(β 1j X + ασ(τβ 1j X) + β0j) D exp(β 1j X + ασ(τβ 1j X) + β0j)

|h(X, ηj)|.

Now, we will bound two terms in the above right hand side. Firstly, since both the input space X and the parameter space Θ are bounded, we have that

exp(β 1j X + ασ(τβ 1j X) + β0j ) exp(β 1j X + ασ(τβ 1j X) + β0j )

(β1j β1j) X + α[σ(τβ 1j X) σ(τβ 1j X)] + (β0j β0j)

j =1 |(β1j β1j) X| + |α| |σ(τβ 1j X) σ(τβ 1j X)| + |β0j β0j|

h β1j β1j X + |ατ| β1j β1j X + |β0j β0j| i

L (B + |ατ|B + 1)ζ ζ.

As a result, we deduce that

i=1 sup X X

exp(X B0 i X + c0 i )h(X, η0 i ) ζ. (37)

Regarding the second term, note that

exp(β 1j X + ασ(τβ 1j X) + β0j) D exp(β 1j X + ασ(τβ 1j X) + β0j)

= exp(β 1j X + ασ(τβ 1j X) + β0j) 1

h exp(β 1j X + ασ(τβ 1j X) + β0j) exp(exp(β 1j X + ασ(τβ 1j X) + β0j)) i .

Since both the input space and the parameter space are bounded, we have

exp(β 1j X + ασ(τβ 1j X) + β0j) 1

h exp(β 1j X + ασ(τβ 1j X) + β0j) exp(β 1j X + ασ(τβ 1j X) + β0j)

(B + |ατ|B + 1)ζ ζ,

which yields that

j=1 sup X X

exp(β 1j X + ασ(τβ 1j X) + β0j) D exp(β 1j X + ασ(τβ 1j X) + β0j)

j=1 sup X X |h(X, ηj)| ζ. (38)

From equations (36), (37) and (38), we obtain that g ˇ G g G L ζ. According to the triangle inequality, we have

g G g G L g G g ˇ G L + g ˇ G g G L ζ.

By definition of the covering number, we deduce that

N(ζ, FL (Θ), L ) | ζ| |Ωζ| O(n (Nd+1)L ) O(n q L ) O(n (Nd+1+q)L ). (39)

Combine equations (35) and (39), we achieve that

HB(2ζ, FL (Θ), L2(µ)) log(1/τ).

Let ζ = ε/2, then we obtain that

HB(ε, FL (Θ), L2(µ)) log(1/ε).

As a result, it follows that

JB(δ, FL (Θ, δ)) = Z δ

δ2/213 H1/2 B (t, FL (Θ, t), L2(µ)) dt δ Z δ

δ2/213 log(1/t)dt δ. (40)

Let Ψ(δ) = δ [log(1/δ)]1/2, then Ψ(δ)/δ2 is a non-increasing function of δ. Furthermore, equation (40) indicates that Ψ(δ) JB(δ, FL (Θ, δ)). In addition, let δn = p

log(n)/n, then we get that nδ2 n cΨ(δn) for some universal constant c. Finally, by applying Lemma B.1, we achieve the desired conclusion of the theorem.

B.2 Proof of Theorem 4.3

Our goal is also to demonstrate the following inequality:

inf G GL (Θ) g G g G L2(µ)/L1(G, G ) > 0. (41)

For that purpose, we divide the proof of the above inequality into local and global parts in the sequel.

Local part: In this part, we demonstrate that

lim ε 0 inf G GL (Θ):L1(G,G ) ε g G g G L2(µ)/L1(G, G ) > 0. (42)

Assume by contrary that the above claim is not true, then there exists a sequence of mixing measures Gn = PL i=1 exp(βn 0i)δ(βn 1i,ηn i ) in GL (Θ) such that L1n := L1(Gn, G ) 0 and

g Gn g G L2(µ)/L1n 0, (43)

as n . Let us denote by Vn j := Vj(Gn) a Voronoi cell of Gn generated by the j-th components of G . Since our arguments are asymptotic, we may assume that those Voronoi cells do not depend on the sample size, i.e., Vj = Vn j . Thus, the Voronoi loss L1n can be represented as

i Vj exp(βn 0i) h βn 1ij 2 + ηn ij 2i

i Vj exp(βn 0i) h βn 1ij + ηn ij i +

i Vj exp(βn 1i) exp(β 1j) , (44)

where we denote βn 1ij := βn 1i β 1j and ηn ij := ηn i η j .

Since L1n 0, we get that (βn 1i, ηn i ) (β 1j, η j ) and P i Vj exp(βn 0i) exp(β 0j) as n for any i Vj and j [L]. Now, we divide the proof of the local part into three steps as follows:

Step 1 - Taylor expansion. In this step, we would like to decompose the quantity

Qn(X) := h N X

i =1 exp(X A0 i X + c0 i ) +

j =1 exp((β 1j ) X + ασ(τ(β 1j ) X) + β 0j ) i

[g Gn(X) g G (X)] (45) into a combination of linearly independent elements using Taylor expansion. In particular, the quantity Qn(X) is decomposed as follows:

i Vj exp(βn 0i) h exp((βn 1i) X + ασ(τ(βn 1i) X))h(X; ηn i ) exp((β 1j) X + ασ(τ(β 1j) X))h(X; η j ) i

i Vj exp(βn 0i) h exp((βn 1i) X + ασ(τ(βn 1i) X)) exp((β 1j) X + ασ(τ(β 1j) X)) i g Gn(X)

i Vj exp(βn 0i) exp(β 0j) exp((β 1j) X + ασ(τ(β 1j) X)) h h(X; η j ) g Gn(X) i

:= An(X) Bn(X) + Cn(X). (46)

Decomposition of An(X). Let us denote E(X; β1) := exp(β 1 X + ασ(τβ 1 X)), then An can be separated into two terms as follows:

i Vj exp(βn 0i) h E(X; βn 1i)h(x; ηn i ) E(X; β 1j)h(X; η j ) i

i Vj exp(βn 0i) h E(X; βn 1i)h(X; ηn i ) E(X; β 1j)h(X; η j ) i

:= An,1(X) + An,2(X). By means of the first-order Taylor expansion, we have

An,1(X) = X

exp(βn 0i) α!

|α|=1 ( βn 1ij)α1( ηn ij)α2 |α1|E

βα1 1 (X; β 1j) |α2|h

ηα2 (X; η j ) + Rn,1(X)

|α1|+|α2|=1 Sn,j,α1,α2 |α1|E

βα1 1 (X; β 1j) |α2|h

ηα2 (X; η j ) + Rn,1(X),

where Rn,1(X) is a Taylor remainder such that Rn,1(X)/L1n 0 as n , and

Sn,j,α1,α2 := X

exp(βn 0i) α! ( βn 1ij)α1( ηn ij)α2.

On the other hand, by applying the second-order Taylor expansion, we get that

An,2(X) = X

1 |α1|+|α2| 2 Sn,j,α1,α2 |α1|E

βα1 1 (X; β 1j) |α2|h

ηα2 (X; η j ) + Rn,2(X),

in which Rn,2(X) is a Taylor remainder such that Rn,2(X)/L1n 0 as n .

Decomposition of Bn. Recall that we have

i Vj exp(βn 0i) h E(X; βn 1i) E(X; β 1j) i g Gn(X)

i Vj exp(βn 0i) h E(X; βn 1i) E(x; β 1j) i g Gn(X)

:= Bn,1(X) + Bn,2(X).

By invoking first-order and second-order Taylor expansions to Bn,1(X) and Bn,2(X), it follows that

Bn,1(X) = X

|ℓ|=1 Tn,j,ℓ |ℓ|E

βℓ 1 (X; β 1j)g Gn(X) + Rn,3(X),

Bn,2(X) = X

1 |ℓ| 2 Tn,j,ℓ |ℓ|E

βℓ 1 (X; β 1j)g Gn(X) + Rn,4(X),

where we define

exp(βn 0i) ℓ! ( βn 1ij)ℓ.

Additionally, Rn,3(X) and Rn,4(X) are Taylor remainders such that Rn,3(X)/L1n 0 and Rn,3(X)/L1n 0 as n .

Collect the above results together, we can represent Qn(x) as

0 |α1|+|α2| 2 Sn,j,α1,α2 |α1|E

βα1 1 (X; β 1j) |α2|h

ηα2 (X; η j ),

0 |ℓ| 2 Tn,j,ℓ |ℓ|E

βℓ 1 (X; β 1j)g Gn(X) +

i=1 Rn,i(X), (47)

where we define Sn,j,0d d,0q = Tn,j,0d d = P i Vj exp(βn 0i) exp(β 0j) for any j [L].

Step 2 - Non-vanishing coefficients. In this step, we demonstrate that at least one among ratios of the forms Sn,j,α1,α2/L1n and Tn,j,ℓ/L1n goes to zero as n tends to infinity. Indeed, assume by contrary that

L1n 0, Tn,j,ℓ

for any j [L], 0 |α1|, |α2|, |ℓ| 2. Then, we get

i Vj exp(βn 0i) exp(β 0j) =

Sn,j,0d d,0q

Now, we consider indices j [L] such that its corresponding Voronoi cell has only one element, i.e. |Vj| = 1.

For arbitrary u, v [Nd], let α1 NNd Nd and α2 = 0q such that α(uv) 1 = 1 while other entries equal to zero. Then, we have 1 L1n P

i Vj exp(βn 0i)|( βn 1ij)(uv)| = |Sn,j,α1,α2|/L1n 0 as n . By taking the summation of the previous term with u, v [Nd], we achieve that 1 L1n P i Vj exp(βn 0i) βn 1ij 1 0. Owing to the topological equivalence between norm-1 and norm-2, it follows that

i Vj exp(βn 0i) βn 1ij 0. (49)

For arbitrary u [Nd], let α1 = 0Nd Nd and α2 Nq such that α(u) 2 = 1 while other entries equal to zero. Then, we get 1 L1n P

i Vj exp(βn 0i)|( ηn ij)(u)| = |Sn,j,α1,α2|/L1n 0 as n . By taking the summation of the previous term with u [q], we achieve that 1 L1n P

i Vj exp(βn 0i) ηn ij 1 0, or equivalently,

i Vj exp(βn 0i) ηn ij 0. (50)

Combine the limits in equations (49) and (50), we obtain that 1 L1n

i Vj exp(βn 0i)[ βn 1ij + ηn ij ] 0, (51)

Next, we consider indices j [L] such that its corresponding Voronoi cell has more than one element, i.e. |Vj| > 1.

For arbitrary u, v [Nd], let α1 NNd Nd and α2 = 0q such that α(uv) 1 = 2 while other entries equal to zero. Then, we have 1 L1n P

i Vj exp(βn 0i)|( βn 1ij)(uv)|2 = |Sn,j,α1,α2|/L1n 0 as n . By taking the summation of the previous term with u, v [Nd], we achieve that 1 L1n

i Vj exp(βn 0i) βn 1ij 2 0. (52)

For arbitrary u [Nd], let α1 = 0Nd Nd and α2 Nq such that α(u) 2 = 2 while other entries equal to zero. Then, we get 1 L1n P

i Vj exp(βn 0i)|( ηn ij)(u)|2 = |Sn,j,α1,α2|/L1n 0 as n . By taking the summation of the previous term with u [q], we achieve that 1 L1n

i Vj exp(βn 0i) ηn ij 2 0. (53)

Putting the limits in equations (49) and (50), we have 1 L1n

i Vj exp(βn 0i)[ βn 1ij + ηn ij ] 0, (54)

as n . Taking the summation of three limits in equations (48), (51) and (54), we deduce that 1 = L1n/L1n 0 as n , which is a contradiction. Thus, at least one among ratios of the forms Sn,j,α1,α2/L1n and Tn,j,ℓ/L1n goes to zero as n tends to infinity.

Step 3 - Application of Fatou s lemma. In this step, we show that all the ratios Sn,j,α1,α2/L1n and Tn,j,ℓ/L1n go to zero as n , which contradicts to the conclusion in Step 2. In particular, by denoting mn as the maximum of the absolute values of those ratios. From the result of Step 2, it follows that 1/mn .

Recall from the hypothesis in equation (43) that g Gn g G L2(µ)/L1n 0 as n , which indicates that g Gn g G L1(µ)/L1n 0. Therefore, by applying the Fatou s lemma, we get that

0 = lim n g Gn g G L1(µ)

mn L1n Z lim inf n |g Gn(X) g G (X)|

mn L1n dµ(X) 0.

This result implies that 1 mn L1n [g Gn(X) g G (X)] 0 as n for µ-almost surely X.

Looking at the formulation of Qn(X) in equation (45), since the term h Pk0 i =1 exp(X A0 i X+c0 i )+ Pk j =1 exp((β 1j ) X + σ((β 1j ) X) + β 0j ) i is bounded, we deduce that the term Qn(X)

mn L1n 0 for µ-almost surely X.

Let us denote Sn,j,α1,α2

mn L1n ϕj,α1,α2, Tn,j,ℓ mn L1n φj,ℓ,

with a note that at least one among them is non-zero. Then, from the decomposition of Qn(X) in equation (47), we have

1+1{|Vj |>1} X

|α1|+|α2|=0 ϕj,α1,α2 |α1|E

βα1 1 (X; β 1j) |α2|h

ηα2 (X; η j ),

1+1{|Vj |>1} X

|ℓ|=0 φj,ℓ |ℓ|E

βℓ 1 (X; β 1j)g G (X) = 0,

for µ-almost surely X. It is worth noting that the term |α1|E

βα1 1 (X; β 1j) |α2|h

ηα2 (X; η j ) can be explicitly expressed as

When |α1| = 0, |α2| = 0: exp((β 1j) X + σ((β 1j) X))h(X; η j );

When |α1| = 1, |α2| = 0: X(u) 1 + σ ((β 1j) X) exp((β 1j) X +

σ((β 1j) X))h(X; η j );

When |α1| = 0, |α2| = 1: exp((β 1j) X + σ((β 1j) X)) h η(w) (X; η j );

When |α1| = 1, |α2| = 1:

x(u) 1 + σ ((β 1j) x) exp((β 1j) x + σ((β 1j) x)) h

η(w) (x; η j );

When |α1| = 2, |α2| = 0:

X(u)x(v)h (1+σ ((β 1j) X))2 +σ ((β 1j) X) i exp((β 1j) X +σ((β 1j) X))h(X; η j )

When |α1| = 0, |α2| = 2: exp((β 1j) X + σ((β 1j) X)) 2h η(w) η(w ) (X; η j ).

Recall that the expert function h satisfies the condition in Definition 4.2, i.e. the set Xνh (1 + σ ((β 1j) X))|ν| + 1{|ν|=2}σ ((β 1j) X) i |γ|h

ηγ (X, η j ) : j [L], 0 |ν| + |γ| 2

is linearly independent for almost every X. Therefore, we obtain that ϕj,α1,α2 = φj,ℓ= 0 for all j [L], 0 |α1| + |α2|, |ℓ| 1 + 1{|Vj|>1}. This result turns out to contradict the fact that at least one among them is different from zero. Hence, we achieve the inequality in equation (42).

Global part. It is worth noting that the inequality (42) suggests that there exists a positive constant ε such that

inf G GL (Θ):L1(G,G ) ε g G g G L2(µ)/L1(G, G ) > 0.

Therefore, it is sufficient to prove that

inf G GL (Θ):L1(G,G )>ε g G g G L2(µ)/L1(G, G ) > 0. (55)

Assume by contrary that the inequality (55) does not hold true, then we can find a sequence of mixing measures G n GL (Θ) such that L1(G n, G ) > ε and

lim n g G n g G L2(µ)

L1(G n, G ) = 0,

which indicates that g G n g G L2(µ) 0 as n . Recall that Θ is a compact set, therefore, we can replace the sequence G n by one of its subsequences that converge to a mixing measure G GL (Ω). Since L1(G n, G ) > ε , we deduce that L1(G , G ) > ε .

Next, by invoking the Fatou s lemma, we have that

0 = lim n g G n g G 2 L2(µ) Z lim inf n

g G n(X) g G (X) 2 dµ(X).

Thus, we get that g G (X) = g G (X) for µ-almost surely X. From the identifiability property of the non-linear residual gating prefix Mo E (cf. the end of this proof), we deduce that G G . Consequently, it follows that L1(G , G ) = 0, contradicting the fact that L1(G , G ) > ε > 0. Hence, the proof is completed.

Identifiability of Non-linear Residual Gating Mo E.

We now prove the identifiability of the non-linear residual gating prefix Mo E. In particular, we will show that if g G(X) = g G (X) for almost every X, then it follows that G G .

For ease of presentation, let us denote

softmax G(u) : = exp(u) PN i =1 exp(X B0 i X + c0 i ) + PL j =1 exp((β1j ) X + ασ(τ(β1j ) X) + β0j ) ,

softmax G (u ) : = exp(u ) PN i =1 exp(X B0 i X + c0 i ) + PL j =1 exp((β 1j ) X + ασ(τ(β 1j ) X) + β 0j ) ,

u X B0 i X + c0 i , (β1j ) X + ασ(τ(β1j ) X) + β0j : i [N], j [L ] ,

u X B0 i X + c0 i , (β 1j ) X + ασ(τ(β 1j ) X) + β 0j : i [N], j [L] .

Since g G(X) = g G (X) for almost every X, we have

i=1 softmax G(X Bi X + c0 i ) h(X, η0 i ) +

j=1 softmax G (β1j) X + ασ(τ(β1j) X) + β0j h(X, ηj)

i=1 softmax G (X Bi X + c0 i ) h(X, η0 i ) +

j=1 softmax G (β 1j) X + ασ(τ(β 1j) X) + β 0j h(X, η j ).

As the expert function h satisfies the conditions in Definition 4.2, the set {h(X, η i) : i [k ]}, where η 1, . . . , η k are distinct vectors for some k N, is linearly independent. If L = L, then there exists

some i [L ] such that ηi = η j for any j [L]. This implies that PL

j=1 softmax G (β1j) X +

ασ(τ(β1j) X) + β0j h(X, ηj) = 0, which is a contradiction. Thus, we must have that L = L . As a result, n softmax G (β1j) X + ασ(τ(β1j) X) + β0j : j [L ] o

= n softmax G (β 1j) X + ασ(τ(β 1j) X) + β 0j : j [L] o ,

for almost every X. WLOG, we may assume that

softmax G (β1j) X + ασ(τ(β1j) X) + β0j = softmax G (β 1j) X + ασ(τ(β 1j) X) + β 0j ,

for almost every X for any j [L]. Since the softmax function is invariant to translation, this result indicates that β1j = β 1j and β0j = β 0j + v0 for some v0 R for any j [L]. Recall from the universal assumption that β0L = β0L = 0, we get that β0j = β 0j for any j [L]. Then, equation (56) can be rewritten as

j=1 exp(β0j) exp (β1j) X + ασ(τ(β1j) X) h(X, ηj)

j=1 exp(β 0j) exp (β 1j) X + ασ(τ(β 1j) X h(X, η j ), (58)

for almost every X. Next, we denote P1, P2, . . . , Pm as a partition of the index set [L], where m L , such that exp(β0i) = exp(β 0i ) for any i, i Pj and j [L]. On the other hand, when i and i do not belong to the same set Pj, we let exp(β0i) = exp(β0i ). Thus, we can reformulate equation (58) as

i Pj exp(β0i) exp (β1i) X + ασ(τ(β1i) X h(X, ηi)

i Pj exp(β 0i) exp (β 1i) X + ασ(τ(β 1i) X h(X, η i ),

for almost every X. Recall that β1i = β 1i and β0i = β 0i for any i [L], then the above equation leads to {ηi : i Pj} {η i : i Pj}, for almost every X for any j [m]. As a consequence,

i Pj exp(β0i)δ(β1i,ηi) =

i Pj exp(β 0i)δ(β 1i,η i ) = G .

Hence, we reach the conclusion of this proposition.

C Discussion of related Mixture of Experts works

Recently, the Mo E model has been employed to mitigate catastrophic forgetting in continual learning. For example, [58] focused on continual learning in vision-language models by adapting a pre-trained vision-language model to new tasks through learning a mixture of specialized adapter modules. [58] introduced an Mo E structure onto a frozen CLIP, utilizing a mixture of adapters to modify the MLP block after the MSA layer. In contrast, our work centers on general continual learning with pre-trained models, leveraging the inherent Mo E architecture of MSA layers. Consequently, our Mo E model placement differs from that of [58]. By employing prefix tuning, we demonstrate that it is analogous to introducing new prefix experts to scale and adapt these pre-trained Mo E models to downstream tasks. Furthermore, while [58] utilizes task-specific routers, our approach employs task-specific prompts that encapsulate both task-specific router and expert parameters.

The parameters cost is usually considered in practical memory-constrained continual learning scenarios. Dynamic routing mechanism can be employed for gating-based neural networks [18]. To improve the parameter efficiency of the final model, we can integrate this mechanism in the proposed method. Specifically, each head in the MSA layers comprises N Mo E models, where N is the length of the input sequence. This allows for a dynamic routing mechanism to enhance parameter efficiency. For instance, [18] proposed a dynamic routing strategy that adaptively adjusts the number of activated experts based on the input. The computation for any Mo E model s gating is directly correlated with the corresponding row in the attention matrix, which encapsulates the Mo E model s score functions. For example, selecting the top k experts via Top-K routing in the i-th Mo E model is equivalent to identifying the top k largest values in the i-th row of the attention matrix. To implement [18], we first sort the elements in the i-th row from highest to lowest, then find the smallest set of experts whose cumulative probability exceeds the threshold. Consequently, unselected experts remain inactive, reducing the need to compute all elements of the value matrix within self-attention.

D Training Algorithm of Hi De-Prompt

In this appendix, we outline the detailed algorithm of Hi De-Prompt, utilizing the same notation as in Section 2.

Each previously encountered class c Y(i), i = 1, . . . , t 1 has its instructed and uninstructed representations approximated by Gaussian distributions, denoted as Gc and ˆGc, respectively.

Hi De-Prompt maintains an expandable pool of task-specific prompts et, each optimized for a specific task Dt using a cross-entropy loss within the WTP objective. To prevent forgetting, previous prompts e1, . . . , et 1 remain frozen. Knowledge transfer across tasks is facilitated by a prompt ensemble (PE) strategy: the current prompt is initialized with the last one et et 1 and refined using a weighted combination of all past prompts pt = α Pt 1 i=1 ei +(1 α)et, where α is a hyper-parameter. Notably, Hi De-Prompt incorporates contrastive regularization within the WTP objective, pushing features of the new task away from those of past tasks represented by the prototypes of old class distributions Gc. Let Ht = {fθ(x(t) i , pt)| i = 1, . . . , Nt} be the embedding transformation of Dt and µc be the mean of Gc. The contrastive loss can be written as

LCR(pt) = X

c Y(i) log( exp(h µc/τ) P h Ht exp(h h /τ) + Pt 1 i=1 P c Y(i) exp(h µc /τ) ), (59)

where τ is the temperature that is set to 0.8. The overall objective function of WTP for learning a new task t is defined as LWTP(ψ, pt) = LCE(ψ, pt) + λLCR(pt), (60)

Algorithm 1 Hi De-Prompt s training algorithm

Input: Pre-trained transformer backbone fθ, training sets D1, . . . , DT , number of tasks T, number of epochs E, hyper-parameters α, τ and λ. Output: Parameters p1, . . . , p T , ω and ψ

1: Initialize e1, ω and ψ 2: for t = 1, . . . , T do 3: for c Y(t) do 4: Obtain ˆGc from fθ and Dt Uninstructed Representations 5: if t > 1 then 6: Initialize et et 1 7: Construct pt = α Pt 1 i=1 ei + (1 α)et 8: else 9: Construct pt = et 10: for epoch = 1, . . . , E do 11: Optimize pt and ψ with LWTP in Eq.(60) Within-Task Prediction 12: Optimize ω with LTII in Eq.(62) Task-Identity Inference 13: Optimize ψ with LTAP in Eq.(61) Task-Adaptive Prediction 14: for c Y(t) do 15: Obtain Gc from fθ, pt and Dt Instructed Representations return (p1, . . . , p T , ω, ψ)

where λ is a hyper-parameter. Following WTP, Hi De-Prompt performs a further refinement step on the output layer parameters ψ using a separate objective called task-adaptive prediction (TAP). TAP addresses potential classifier bias by considering the Gaussian distribution of all classes encountered so far. The final output layer hψ can be further optimized for TAP objective,

h Hi,c log( exp(hψ(h)[c]) Pt j=1 P

c Y(j) exp(hψ(h)[c ]) ) (61)

where Hi,c is constructed by sampling an equal number of pseudo representations from Gc for c Y(i) and i = 1, . . . , t.

For TII, Hi De-Prompt leverages a lightweight auxiliary output layer ˆhω : RD RT , to map uninstructed representations directly to task identity. This mapping is learned explicitly through a cross-entropy loss function,

LTII(ω) = X

ˆh ˆ Hc log( exp(ˆhω(ˆh)[c]) P

c Yt exp(ˆhω(ˆh)[c ] ) (62)

where ˆHc is constructed by sampling an equal number of pseudo representations from ˆGc for c Y(i) and i = 1, . . . , t. Please refer to Algorithm 1 for more details.

E Experimental Details

Datasets. We use commonly-used datasets in the field of continual learning, including (1) Split CIFAR-100 [23]: This dataset comprises images from 100 classes. These classes are divided randomly into 10 separate incremental tasks, with each task featuring a distinct set of classes. (2) Split Image Net-R [23]: This dataset is composed of images from 200 classes. It includes challenging examples from the original Image Net [40] dataset and newly gathered images representing diverse styles. These classes are also randomly divided into 10 distinct incremental tasks. (3) Split CUB-200 [48]: This dataset consists of fine-grained images of 200 different bird species. It is randomly divided into 10 incremental tasks, each comprising a unique class subset. (4) 5-Datasets [9]: This composite dataset incorporates CIFAR-10 [23], MNIST [24], Fashion-MNIST [55], SVHN [31], and not MNIST [3]. Each of these datasets is treated as a separate incremental task, permitting for the assessment of the effects of significant variations between tasks.

Prompt-Based Approaches. We compare No RGa against recent prompt-based continual learning approaches: L2P [54], Dual Prompt [53], CODA-Prompt [42], S-Prompt [52] and Hi De-Prompt [49].

Table 4: Performance comparison in task-incremental learning setting. Here we present Final Average Accuracy (FA).

Method Split CIFAR-100 Split CUB-200

Sup-21K i BOT-21K Sup-21K i BOT-21K

Hi De-Prompt 97.87 0.31 97.48 0.33 97.57 0.08 92.34 0.34 No RGa tanh 98.55 0.45 98.26 0.36 97.86 0.14 92.85 0.33 No RGa sigmoid 98.63 0.35 98.15 0.29 97.89 0.14 92.85 0.22 No RGa GELU 98.41 0.47 98.17 0.30 97.76 0.10 93.00 0.11

Table 5: Performance comparison of different PEFT methods using Vi T-B/16 with Sup-21K weights. Here we present Final Average Accuracy (FA).

Method Split CIFAR-100 Split CUB-200

Hi De-Prompt 92.61 86.56 Hi De-Lo RA 92.71 87.37 Hi De-Adapter 92.73 87.10 No RGa 94.48 90.90

To ensure a fair comparison, we replicate these methods using the configurations reported in their respective papers. S-Prompt in the original paper trains a separate prompt and classifier head for each task. At evaluation, it infers domain id as the nearest centroid obtained by applying K-Means on the training data. To adapt S-Prompt to CIL, we use one common classifier head for all tasks. For No RGa, we adopt the same configuration as Hi De-Prompt, which utilizes Prefix Tuning [26] as its prompt-based methodology. Learnable scalar factors α and τ are frozen after the first task s training to mitigate catastrophic forgetting. We further optimize No RGa by selecting the best non-linear activation function σ via cross-validation among tanh, sigmoid, and GELU.

Evaluation Metric. We employ three common metrics to measure the performance the methods, including final average accuracy (FA), cumulative average accuracy (CA), and average forgetting measure (FM). Let Si,t denote the accuracy on the i-th task after learning the t-th task, and At represent the average accuracy as At = 1 t Pt i=1 Si,t. Upon learning all T tasks, we compute FA = AT , CA = 1 T PT t=1 At, and FM = 1 T 1 PT 1 i=1 maxt {1,...,T 1}(Si,t Si,T ). It s worth noting that FA and CA are prioritized over FM, as they inherently encompass both plasticity and forgetting, with FM providing supplementary context [42].

Implementation Details. We train and test on one NVIDIA A100 GPU for baselines and our method. We leverage a pre-trained Vi T-B/16 model as the backbone. Training employs an Adam optimizer (β1 = 0.9, β2 = 0.999), a batch size of 128, and a constant learning rate of 0.005 for all methods except CODA-Prompt. CODA-Prompt utilizes a cosine decaying learning rate starting at 0.001. Additionally, a grid search technique was implemented to determine the most appropriate number of epochs for effective training.

F Task-incremental Learning Results

Because Hi De-Prompt optimizes prompt parameters specifically for within-task prediction (WTP), No RGa inherently aligns with this objective, leading to generally better continual learning performance. We demonstrate this improvement through experiments in a task-incremental learning setting, where task labels are available during inference (as in Table 4). While Hi De-Prompt performs well, No RGa shows consistent improvement across all scenarios. Notably, No RGa with sigmoid activation achieves the highest final average accuracy in both Split CIFAR-100 and Split CUB-200 with Sup-21K training. Additionally, No RGa demonstrates its effectiveness even with self-supervised pre-training, further solidifying its advantage over the original prefix tuning model. Overall, No RGa variants outperform Hi De-Prompt on both datasets and under both training conditions.

G Comparison to Different Parameter-Efficient Fine-Tuning Methods

As the advantages of different parameter-efficient fine-tuning (PEFT) methods remain an open question, we briefly describe them through our revealed connection between self-attention and Mo E.

Table 6: Performance comparison of pre-trained model-based continual learning methods using Vi T-B/16 with Sup-21K weights. Here we present Final Average Accuracy (FA).

Method Split CIFAR-100 Split CUB-200

ADAM + VPT-D 85.04 85.28 ADAM + SSF 85.27 85.67 ADAM + Adapter 87.29 85.84 Ran PAC 92.20 90.30 No RGa 94.48 90.90

Table 7: Ablation study on the effect of learnable α and τ with Sup-21K weights. Here we present Final Average Accuracy (FA).

Method Split CIFAR-100 Split CUB-200

Hi De-Prompt 92.61 86.56 Learnable α, Fixed τ 94.38 90.45 Fixed α, Learnable τ 94.42 90.48 Fixed α, Fixed τ 94.29 90.32 Learnable α, Learnable τ 94.48 90.90

Prefix tuning introduces additional parameters at the input of MSA layers to adapt the pre-trained model representation, contrasting with Adapter [16], which insert adaptive parameters between layers, often replacing MLP blocks. Lo RA [17] approximates weight updates with low-rank matrices and adds them to the backbone weights. Our work shows that the MSA layer in a pre-trained model can be seen as a pre-trained Mo E architecture. Applying Lo RA to the MSA layer refines both the pre-trained experts and their corresponding score functions for downstream tasks. In contrast, prefix tuning expands the pre-trained Mo E models by incorporating new experts while preserving the original components, rather than modifying the pre-trained experts like Lo RA.

No RGa emerges as a simple, parameter-efficient fine-tuning method and can be regarded as a distinct implementation of prompts. However, our novel perspective on the interplay between self-attention, prefix tuning, and mixture of experts enables us to theoretically substantiate the effectiveness of No RGa as shown in Section 4.

For empirical comparison, we integrate the framework of Hi De-Prompt with different PEFT techniques and Sup-21K weights, evaluating performance on Split CIFAR-100 and Split CUB-200. The results are summarized in Table 5. The table shows that No RGa consistently outperforms the other PEFT methods on both datasets, suggesting its effectiveness. Nevertheless, further investigation with Lo RA and Adapter would be necessary to draw more definitive conclusions.

While exploring alternative PEFT methods might offer improvements in WTP performance, these approaches lack theoretical guarantees and could lead to an increased number of parameters. In contrast, our No RGa method modifies the original score functions of prefix tuning to enhance WTP performance with theoretical rigor. Importantly, No RGa maintains the same parameter count as Hi De-Prompt, which is crucial in CL due to memory constraints.

H Comparison with Pre-trained Model-based Methods

Previous works have demonstrated that utilizing pre-trained models (PTM) significantly enhances performance for continual learning, often surpassing the performance of non-PTM-based methods. Moreover, studies have shown that first-task adaptation and simple PEFT-style tuning can achieve competitive performance [20, 37, 60, 29] with prompt-based methods. For instance, [20] demonstrated that appending a nearest class mean (NCM) classifier to a Vi T model s feature outputs, can serve as a strong baseline. [37, 60] enhanced this strategy by adapting the pre-trained model to the first task using the three PEFT methods for transformer networks [60] and the Fi LM method for CNNs [37]. Additionally, [29] improved NCM by incorporating second-order statistics covariance and Gram matrices. However, these methods, which fine-tune only the backbone for the initial task, may not always ensure satisfactory separation of new tasks features. Our work focuses on continually adapting the backbone, utilizing task-specific prompts to consistently capture emerging

Figure 3: Validation loss on Split CUB-200 throughout the training of the first task.

Table 8: Comparison of training times for Hi De-Prompt and No RGa. All experiments were conducted on a single NVIDIA A100 GPU.

Method Split CIFAR-100 Split Image Net-R Split CUB-200 5-Datasets

Hi De-Prompt 2.80h 2.67h 1.04h 24.06h No RGa 2.85h 2.70h 1.10h 24.23h

tasks characteristics, and proposing a novel method to enhance the CL performance of previous prompting methods.

To validate our approach, we compare it against state-of-the-art PTM-based continual learning methods, including ADAM [60] and Ran PAC [29], using Split CIFAR-100 and Split CUB-200 datasets. The results are summarized in Table 6. In comparison to other PTM-based continual learning methods, No RGa demonstrates competitive performance across both evaluated datasets. For instance, on Split CIFAR-100, No RGa achieves an FA of 94.48%, exceeding the next best method by over 2%. Similarly, on Split CUB-200, No RGa delivers strong results relative to other baselines. These improvements highlight the effectiveness of our method in mitigating catastrophic forgetting and preserving knowledge retention across multiple tasks.

I Efficiency Tests

We compare the validation loss of No RGa and Hi De-Prompt throughout the first task on Split CUB-200, as illustrated in Figure 3. The results demonstrate that No RGa consistently outperforms Hi De-Prompt throughout the training process. This empirical evidence supports the theoretical advantages of No RGa over Hi De-Prompt.

J Effect of Learnable Hyperparameters

As described in Section 4.1 and Appendix E, in our framework, α and τ are learnable hyperparameters and optimized through backpropagation during the first task, eliminating the need for manual tuning. Additionally, our theoretical analysis of No RGa s statistical efficiency in Section 4.2 holds for any values of α and τ, demonstrating the theoretical robustness. To further investigate, we evaluated both fixed and learnable settings for these hyperparameters. For the fixed case, we set their values to 1. We report FA on Split CUB-200 and Split CIFAR-100 with Sup-21K weights. The results are summarized in Table 7. Although performance slightly decreased with fixed hyperparameters, it still outperforms Hi De-Prompt, indicating our method s empirical robustness.

K Training Times

We utilize a single A100 GPU for all experiments. The training times are summarized in Table 8. While No RGa exhibits slightly longer training times compared to Hi De-Prompt, it consistently achieves significantly better performance. This demonstrates the effectiveness of No RGa while maintaining competitive training efficiency.

Neur IPS Paper Checklist

Question: Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope?

Answer: [Yes]

Justification: The paper s contributions and scope are reflected accurately in the abstract and introduction.

Guidelines:

The answer NA means that the abstract and introduction do not include the claims made in the paper. The abstract and/or introduction should clearly state the claims made, including the contributions made in the paper and important assumptions and limitations. A No or NA answer to this question will not be perceived well by the reviewers. The claims made should match theoretical and experimental results, and reflect how much the results can be expected to generalize to other settings. It is fine to include aspirational goals as motivation as long as it is clear that these goals are not attained by the paper.

2. Limitations

Question: Does the paper discuss the limitations of the work performed by the authors?

Answer: [Yes] Justification: We have already discussed the limitations of our work in the conclusion section.

Guidelines:

The answer NA means that the paper has no limitation while the answer No means that the paper has limitations, but those are not discussed in the paper. The authors are encouraged to create a separate "Limitations" section in their paper. The paper should point out any strong assumptions and how robust the results are to violations of these assumptions (e.g., independence assumptions, noiseless settings, model well-specification, asymptotic approximations only holding locally). The authors should reflect on how these assumptions might be violated in practice and what the implications would be. The authors should reflect on the scope of the claims made, e.g., if the approach was only tested on a few datasets or with a few runs. In general, empirical results often depend on implicit assumptions, which should be articulated. The authors should reflect on the factors that influence the performance of the approach. For example, a facial recognition algorithm may perform poorly when image resolution is low or images are taken in low lighting. Or a speech-to-text system might not be used reliably to provide closed captions for online lectures because it fails to handle technical jargon. The authors should discuss the computational efficiency of the proposed algorithms and how they scale with dataset size. If applicable, the authors should discuss possible limitations of their approach to address problems of privacy and fairness. While the authors might fear that complete honesty about limitations might be used by reviewers as grounds for rejection, a worse outcome might be that reviewers discover limitations that aren t acknowledged in the paper. The authors should use their best judgment and recognize that individual actions in favor of transparency play an important role in developing norms that preserve the integrity of the community. Reviewers will be specifically instructed to not penalize honesty concerning limitations.

3. Theory Assumptions and Proofs

Question: For each theoretical result, does the paper provide the full set of assumptions and a complete (and correct) proof?

Answer: [Yes] Justification: The full set of assumptions and proofs is given in Section 4.2 and the Appendices. Guidelines:

The answer NA means that the paper does not include theoretical results. All the theorems, formulas, and proofs in the paper should be numbered and crossreferenced. All assumptions should be clearly stated or referenced in the statement of any theorems. The proofs can either appear in the main paper or the supplemental material, but if they appear in the supplemental material, the authors are encouraged to provide a short proof sketch to provide intuition. Inversely, any informal proof provided in the core of the paper should be complemented by formal proofs provided in appendix or supplemental material. Theorems and Lemmas that the proof relies upon should be properly referenced. 4. Experimental Result Reproducibility

Question: Does the paper fully disclose all the information needed to reproduce the main experimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper (regardless of whether the code and data are provided or not)? Answer: [Yes] Justification: All the experimental details for reproducibility are specified in Section 5, and Appendix E. Guidelines:

The answer NA means that the paper does not include experiments. If the paper includes experiments, a No answer to this question will not be perceived well by the reviewers: Making the paper reproducible is important, regardless of whether the code and data are provided or not. If the contribution is a dataset and/or model, the authors should describe the steps taken to make their results reproducible or verifiable. Depending on the contribution, reproducibility can be accomplished in various ways. For example, if the contribution is a novel architecture, describing the architecture fully might suffice, or if the contribution is a specific model and empirical evaluation, it may be necessary to either make it possible for others to replicate the model with the same dataset, or provide access to the model. In general. releasing code and data is often one good way to accomplish this, but reproducibility can also be provided via detailed instructions for how to replicate the results, access to a hosted model (e.g., in the case of a large language model), releasing of a model checkpoint, or other means that are appropriate to the research performed. While Neur IPS does not require releasing code, the conference does require all submissions to provide some reasonable avenue for reproducibility, which may depend on the nature of the contribution. For example (a) If the contribution is primarily a new algorithm, the paper should make it clear how to reproduce that algorithm. (b) If the contribution is primarily a new model architecture, the paper should describe the architecture clearly and fully. (c) If the contribution is a new model (e.g., a large language model), then there should either be a way to access this model for reproducing the results or a way to reproduce the model (e.g., with an open-source dataset or instructions for how to construct the dataset). (d) We recognize that reproducibility may be tricky in some cases, in which case authors are welcome to describe the particular way they provide for reproducibility. In the case of closed-source models, it may be that access to the model is limited in some way (e.g., to registered users), but it should be possible for other researchers to have some path to reproducing or verifying the results. 5. Open access to data and code

Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [Yes] Justification: We provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results in supplemental material. Guidelines:

The answer NA means that paper does not include experiments requiring code. Please see the Neur IPS code and data submission guidelines (https://nips.cc/ public/guides/Code Submission Policy) for more details. While we encourage the release of code and data, we understand that this might not be possible, so No is an acceptable answer. Papers cannot be rejected simply for not including code, unless this is central to the contribution (e.g., for a new open-source benchmark). The instructions should contain the exact command and environment needed to run to reproduce the results. See the Neur IPS code and data submission guidelines (https: //nips.cc/public/guides/Code Submission Policy) for more details. The authors should provide instructions on data access and preparation, including how to access the raw data, preprocessed data, intermediate data, and generated data, etc. The authors should provide scripts to reproduce all experimental results for the new proposed method and baselines. If only a subset of experiments are reproducible, they should state which ones are omitted from the script and why. At submission time, to preserve anonymity, the authors should release anonymized versions (if applicable). Providing as much information as possible in supplemental material (appended to the paper) is recommended, but including URLs to data and code is permitted. 6. Experimental Setting/Details

Question: Does the paper specify all the training and test details (e.g., data splits, hyperparameters, how they were chosen, type of optimizer, etc.) necessary to understand the results? Answer: [Yes] Justification: All the experimental details are specified in Section 5, and Appendix E. Guidelines:

The answer NA means that the paper does not include experiments. The experimental setting should be presented in the core of the paper to a level of detail that is necessary to appreciate the results and make sense of them. The full details can be provided either with the code, in appendix, or as supplemental material. 7. Experiment Statistical Significance

Question: Does the paper report error bars suitably and correctly defined or other appropriate information about the statistical significance of the experiments? Answer: [Yes] Justification: We report error bars and relevant information in Section 5, and Appendix F. Guidelines:

The answer NA means that the paper does not include experiments. The authors should answer "Yes" if the results are accompanied by error bars, confidence intervals, or statistical significance tests, at least for the experiments that support the main claims of the paper. The factors of variability that the error bars are capturing should be clearly stated (for example, train/test split, initialization, random drawing of some parameter, or overall run with given experimental conditions). The method for calculating the error bars should be explained (closed form formula, call to a library function, bootstrap, etc.)

The assumptions made should be given (e.g., Normally distributed errors). It should be clear whether the error bar is the standard deviation or the standard error of the mean. It is OK to report 1-sigma error bars, but one should state it. The authors should preferably report a 2-sigma error bar than state that they have a 96% CI, if the hypothesis of Normality of errors is not verified. For asymmetric distributions, the authors should be careful not to show in tables or figures symmetric error bars that would yield results that are out of range (e.g. negative error rates). If error bars are reported in tables or plots, The authors should explain in the text how they were calculated and reference the corresponding figures or tables in the text.

8. Experiments Compute Resources

Question: For each experiment, does the paper provide sufficient information on the computer resources (type of compute workers, memory, time of execution) needed to reproduce the experiments?

Answer: [Yes]

Justification: We have provided sufficient information on the computer resources in Appendix E.

Guidelines:

The answer NA means that the paper does not include experiments. The paper should indicate the type of compute workers CPU or GPU, internal cluster, or cloud provider, including relevant memory and storage. The paper should provide the amount of compute required for each of the individual experimental runs as well as estimate the total compute. The paper should disclose whether the full research project required more compute than the experiments reported in the paper (e.g., preliminary or failed experiments that didn t make it into the paper).

9. Code Of Ethics

Question: Does the research conducted in the paper conform, in every respect, with the Neur IPS Code of Ethics https://neurips.cc/public/Ethics Guidelines?

Answer: [Yes]

Justification: We have read and followed all the Neur IPS Code of Ethics.

Guidelines:

The answer NA means that the authors have not reviewed the Neur IPS Code of Ethics. If the authors answer No, they should explain the special circumstances that require a deviation from the Code of Ethics. The authors should make sure to preserve anonymity (e.g., if there is a special consideration due to laws or regulations in their jurisdiction).

10. Broader Impacts

Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed?

Answer: [NA]

Justification: Given the theoretical nature of the paper, we do not think there are any positive or negative societal impacts of the work performed. Experiments are conducted only for empirically justifying the theoretical results.

Guidelines:

The answer NA means that there is no societal impact of the work performed. If the authors answer NA or No, they should explain why their work has no societal impact or why the paper does not address societal impact.

Examples of negative societal impacts include potential malicious or unintended uses (e.g., disinformation, generating fake profiles, surveillance), fairness considerations (e.g., deployment of technologies that could make decisions that unfairly impact specific groups), privacy considerations, and security considerations. The conference expects that many papers will be foundational research and not tied to particular applications, let alone deployments. However, if there is a direct path to any negative applications, the authors should point it out. For example, it is legitimate to point out that an improvement in the quality of generative models could be used to generate deepfakes for disinformation. On the other hand, it is not needed to point out that a generic algorithm for optimizing neural networks could enable people to train models that generate Deepfakes faster. The authors should consider possible harms that could arise when the technology is being used as intended and functioning correctly, harms that could arise when the technology is being used as intended but gives incorrect results, and harms following from (intentional or unintentional) misuse of the technology. If there are negative societal impacts, the authors could also discuss possible mitigation strategies (e.g., gated release of models, providing defenses in addition to attacks, mechanisms for monitoring misuse, mechanisms to monitor how a system learns from feedback over time, improving the efficiency and accessibility of ML).

11. Safeguards

Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse (e.g., pretrained language models, image generators, or scraped datasets)?

Answer: [NA]

Justification: Our paper does not release any data or models that have a high risk for misuse.

Guidelines:

The answer NA means that the paper poses no such risks. Released models that have a high risk for misuse or dual-use should be released with necessary safeguards to allow for controlled use of the model, for example by requiring that users adhere to usage guidelines or restrictions to access the model or implementing safety filters. Datasets that have been scraped from the Internet could pose safety risks. The authors should describe how they avoided releasing unsafe images. We recognize that providing effective safeguards is challenging, and many papers do not require this, but we encourage authors to take this into account and make a best faith effort.

12. Licenses for existing assets

Question: Are the creators or original owners of assets (e.g., code, data, models), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected?

Answer: [Yes]

Justification: The data and models used in the paper are properly credited. See Section 5, and Appendix E for further details.

Guidelines:

The answer NA means that the paper does not use existing assets. The authors should cite the original paper that produced the code package or dataset. The authors should state which version of the asset is used and, if possible, include a URL. The name of the license (e.g., CC-BY 4.0) should be included for each asset. For scraped data from a particular source (e.g., website), the copyright and terms of service of that source should be provided.

If assets are released, the license, copyright information, and terms of use in the package should be provided. For popular datasets, paperswithcode.com/datasets has curated licenses for some datasets. Their licensing guide can help determine the license of a dataset. For existing datasets that are re-packaged, both the original license and the license of the derived asset (if it has changed) should be provided. If this information is not available online, the authors are encouraged to reach out to the asset s creators.

13. New Assets

Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets?

Answer: [NA]

Justification: Our paper does not release new assets.

Guidelines:

The answer NA means that the paper does not release new assets. Researchers should communicate the details of the dataset/code/model as part of their submissions via structured templates. This includes details about training, license, limitations, etc. The paper should discuss whether and how consent was obtained from people whose asset is used. At submission time, remember to anonymize your assets (if applicable). You can either create an anonymized URL or include an anonymized zip file.

14. Crowdsourcing and Research with Human Subjects

Question: For crowdsourcing experiments and research with human subjects, does the paper include the full text of instructions given to participants and screenshots, if applicable, as well as details about compensation (if any)?

Answer: [NA]

Justification: Our paper does not involve crowdsourcing nor research with human subjects.

Guidelines:

The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Including this information in the supplemental material is fine, but if the main contribution of the paper involves human subjects, then as much detail as possible should be included in the main paper. According to the Neur IPS Code of Ethics, workers involved in data collection, curation, or other labor should be paid at least the minimum wage in the country of the data collector.

15. Institutional Review Board (IRB) Approvals or Equivalent for Research with Human Subjects

Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or institution) were obtained?

Answer: [NA]

Justification: Our paper does not involve crowdsourcing nor research with human subjects.

Guidelines:

The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Depending on the country in which research is conducted, IRB approval (or equivalent) may be required for any human subjects research. If you obtained IRB approval, you should clearly state this in the paper.

We recognize that the procedures for this may vary significantly between institutions and locations, and we expect authors to adhere to the Neur IPS Code of Ethics and the guidelines for their institution. For initial submissions, do not include any information that would break anonymity (if applicable), such as the institution conducting the review.