# theory_on_mixtureofexperts_in_continual_learning__7af5c700.pdf

Published as a conference paper at ICLR 2025

THEORY ON MIXTURE-OF-EXPERTS IN CONTINUAL LEARNING

Hongbo Li1,3, Sen Lin2, Lingjie Duan1, Yingbin Liang3, Ness B. Shroff3,4

1Engineering Systems and Design Pillar, Singapore University of Technology and Design, hongbo_li@mymail.sutd.edu.sg, lingjie_duan@sutd.edu.sg 2Department of Computer Science, University of Houston, slin50@central.uh.edu 3Department of ECE, The Ohio State University, {li.15242, liang.889, shroff.11}@osu.edu 4Department of CSE, The Ohio State University

Continual learning (CL) has garnered significant attention because of its ability to adapt to new tasks that arrive over time. Catastrophic forgetting (of old tasks) has been identified as a major issue in CL, as the model adapts to new tasks. The Mixture-of-Experts (Mo E) model has recently been shown to effectively mitigate catastrophic forgetting in CL, by employing a gating network to sparsify and distribute diverse tasks among multiple experts. However, there is a lack of theoretical analysis of Mo E and its impact on the learning performance in CL. This paper provides the first theoretical results to characterize the impact of Mo E in CL via the lens of overparameterized linear regression tasks. We establish the benefit of Mo E over a single expert by proving that the Mo E model can diversify its experts to specialize in different tasks, while its router learns to select the right expert for each task and balance the loads across all experts. Our study further suggests an intriguing fact that the Mo E in CL needs to terminate the update of the gating network after sufficient training rounds to attain system convergence, which is not needed in the existing Mo E studies that do not consider the continual task arrival. Furthermore, we provide explicit expressions for the expected forgetting and overall generalization error to characterize the benefit of Mo E in the learning performance in CL. Interestingly, adding more experts requires additional rounds before convergence, which may not enhance the learning performance. Finally, we conduct experiments on both synthetic and real datasets to extend these insights from linear models to deep neural networks (DNNs), which also shed light on the practical algorithm design for Mo E in CL.

1 INTRODUCTION

Continual Learning (CL) has emerged as an important paradigm in machine learning (Parisi et al. (2019); Wang et al. (2024)), in which an expert aims to learn a sequence of tasks one by one over time. The expert is anticipated to leverage the knowledge gained from old tasks to facilitate learning new tasks, while simultaneously enhancing the performance of old tasks via the knowledge obtained from new ones. Given the dynamic nature of CL, one major challenge herein is known as catastrophic forgetting (Mc Closkey & Cohen (1989); Kirkpatrick et al. (2017)), where the expert can perform poorly on (i.e., easily forget) the previous tasks when learning new tasks if data distributions change largely across tasks. This becomes a more serious issue if a single expert continues to serve an increasing number of tasks.

Recently, the sparsely-gated Mixture-of-Experts (Mo E) model has achieved astonishing successes in deep learning, especially in the development of large language models (LLMs) (e.g., Du et al. (2022); Li et al. (2024); Lin et al. (2024); Xue et al. (2024)). By adaptively routing different input data to one of the multiple experts through a gating network (Eigen et al. (2013); Shazeer et al. (2016)), different experts in the Mo E model will be specialized to grasp different knowledge in the data. Thus inspired, there have emerged several attempts to leverage Mo E in mitigating the forgetting issue in CL (e.g., Hihn & Braun (2021); Wang et al. (2022); Doan et al. (2023); Rype s c et al. (2023); Yu et al. (2024)) by training each expert to handle a particular set of tasks. However, these studies have primarily

Published as a conference paper at ICLR 2025

focused on experimental investigations, whereas the theoretical understanding of Mo E and its impact on the learning performance in CL is still lacking. In this paper, we aim to fill this gap by providing the first explicit theoretical results to comprehensively understand Mo E in CL.

To this end, we study the sparsely-gated Mo E model with M experts in CL through the lens of overparameterized linear regression tasks (Viele & Tong (2002); Anandkumar et al. (2014); Zhong et al. (2016)). In our setting of CL, one learning task arrives in each round and its dataset is generated with the ground truth randomly drawn from a shared pool encompassing N unknown linear models. Subsequently, the data is fed into a parameterized gating network, guided by which a softmax-based router will route the task to one of the M experts for model training. The trained Mo E model then further updates the gating network by gradient descent (GD). After that, a new task arrives and the above process repeats until the end of CL. It is worth noting that analyzing linear models is an important first step towards understanding the performance of deep neural networks (DNNs), as shown in many recent studies (e.g., Evron et al. (2022); Lin et al. (2023); Chen et al. (2022)).

Our main contributions are summarized as follows.

We provide the first theoretical analysis to understand the behavior of Mo E in CL, through the lens of overparameterized linear regression tasks. By updating the gating network with a carefully designed loss function, we show that after sufficient training rounds (on the order of O(M)) in CL for expert exploration and router learning, the Mo E model will diversify and move into a balanced system state: Each expert will specialize either in a specific task (if M > N) or in a cluster of similar tasks (if M < N), and the router will consistently select the right expert for each task. Another interesting finding is that, unlike existing studies in Mo E (e.g., Fedus et al. (2022); Chen et al. (2022); Li et al. (2024)), it is necessary to terminate the update of the gating network for Mo E due to the dynamics of task arrival in CL. This will ensure that the learning system eventually converges to a stable state with balanced loads among all experts.

We provide explicit expressions of the expected forgetting and generalization error to characterize the benefit of Mo E on the performance of CL. 1) Compared to the single expert case (M = 1), where tasks are learned by a single diverged model, the Mo E model with diversified experts significantly enhances the learning performance, especially with large changes in data distributions across tasks. 2) Regardless of whether there are more experts (M > N) or fewer experts (M < N), both forgetting and generalization error converge to a small constant. This occurs because the router consistently selects the right expert for each task after the Mo E model converges, efficiently minimizing model errors caused by switching tasks. 3) In Mo E, initially adding more experts requires additional exploration rounds before convergence, which does not necessarily improve learning performance.

Finally, we conduct extensive experiments to verify our theoretical results. Specifically, our experimental results on synthetic data with linear models not only support our above-mentioned theoretical findings, but also show that load balancing reduces the average generalization error. This effectively improves the capacity of the Mo E model compared to the unbalanced case. More importantly, the experiments on real datasets suggest that our theoretical findings can be further carried over beyond linear models to DNNs, which also provides insights on practical algorithm design for Mo E in CL.

2 RELATED WORK

Continual learning. In the past decade, various empirical approaches have been proposed to tackle catastrophic forgetting in CL, which generally fall into three categories: 1) Regularization-based approaches (e.g., Kirkpatrick et al. (2017); Ritter et al. (2018); Gou et al. (2021); Liu & Liu (2021)), which introduce explicit regularization terms on key model parameters trained by previous tasks to balance old and new tasks. 2) Parameter-isolation-based approaches (e.g., Chaudhry et al. (2018); Serra et al. (2018); Jerfel et al. (2019); Yoon et al. (2019); Konishi et al. (2023)), which isolate parameters associated with different tasks to prevent interference between parameters. 3) Memorybased approaches (e.g., Farajtabar et al. (2020); Jin et al. (2021); Lin et al. (2021); Saha et al. (2020); Tang et al. (2021); Gao & Liu (2023)), which store data or gradient information from old tasks and replay them during training of new tasks.

On the other hand, theoretical studies on CL are very limited. Among them, Doan et al. (2021) and Bennani et al. (2020) introduce NTK overlap matrix to measure the task similarity and propose variants of the orthogonal gradient descent approach to address catastrophic forgetting. Lee et al.

Published as a conference paper at ICLR 2025

(2021) consider a teacher-student framework to examine the impact of task similarity on learning performance. Peng et al. (2023) propose an ideal CL framework that can achieve no forgetting by assuming i.i.d. data distributions for all tasks. Evron et al. (2022) provide forgetting bounds in overparameterized linear models on different task orders. Further, Lin et al. (2023) provide explicit forms of forgetting and overall generalization error based on the testing error. These works collectively suggest that learning performance with a single expert tends to deteriorate when subsequent tasks exhibit significant diversification. In contrast, our work is the first to conduct theoretical analysis to understand the benefit of multiple experts on CL.

Mixture-of-Experts model. The Mo E model has been extensively studied over the years for enhancing model capacity in deep learning (e.g., Eigen et al. (2013); Riquelme et al. (2021); Wang & Van Hoof (2022); Zhou et al. (2022); Chi et al. (2022); Zadouri et al. (2023)). Recently, it has found widespread applications in emerging fields such as LLMs (e.g., Du et al. (2022); Li et al. (2024); Lin et al. (2024); Xue et al. (2024)). To improve training stability and simplify the Mo E structure, Shazeer et al. (2016) propose to sparsify the output of the gating network. Subsequently, Fedus et al. (2022) suggest routing each data sample to a single expert instead of multiple experts. For theoretical studies, Nguyen & Chamroukhi (2018) propose a maximum quasi-likelihood method for estimating Mo E parameters, while Chen et al. (2022) analyze Mo E mechanisms in deep learning for single-task classification. Unlike these works, which do not address sequential task training in CL, our study focuses on Mo E in CL, introducing distinct training phases for the gating network. Additionally, we derive explicit expressions for forgetting and generalization errors.

Mo E in CL. Recently, the Mo E model has been applied to reducing catastrophic forgetting in CL (Lee et al. (2020); Hihn & Braun (2021); Wang et al. (2022); Doan et al. (2023); Rype s c et al. (2023); Yu et al. (2024)). For example, Lee et al. (2020) expand the number of experts using the Bayesian nonparametric framework to address task-free CL. Rype s c et al. (2023) propose to diverse experts by routing data with minimal distribution overlap to each expert and then combine experts knowledge during task predictions to enhance learning stability. Additionally, Yu et al. (2024) apply Mo E to expand the capacity of vision-language models, alleviating forgetting in CL. However, these works solely focus on empirical methods, lacking theoretical analysis of how the Mo E performs in CL.

3 PROBLEM SETTING AND MOE MODEL DESIGN

Notations. For a vector w, let w 2 and w denote its ℓ-2 and ℓnorms, respectively. For some positive constant c1 and c2, we define x = Ω(y) if x > c2|y|, x = Θ(y) if c1|y| < x < c2|y|, and x = O(y) if x < c1|y|. We also denote by x = o(y) if x/y 0.

3.1 CL IN LINEAR MODELS

General setting. We consider the CL setting with T training rounds. In each round t [T], one out of N tasks randomly arrives to be learned by the Mo E model with M experts. For each task, we follow most theoretical work on CL by fitting a linear model f(X) = X w with ground truth w Rd (e.g., Evron et al. (2022); Lin et al. (2023)), which serves as a foundation for understanding DNN generalization performance (Belkin et al. (2018); Ju et al. (2020)). Then, for the task arrival in the t-th training round, it corresponds to a linear regression problem, where the training dataset is denoted by Dt = (Xt, yt). Here Xt Rd st is the feature matrix with st samples of d-dimensional vectors, and yt Rst is the output vector. In this study, we focus on the overparameterized regime, where st < d. Consequently, there exist numerous linear models that can perfectly fit the data.

Ground truth and dataset. Let W = {w1, , w N} represent the collection of ground truth vectors of all N tasks. For any two tasks n, n [N], we assume wn wn = O(σ0), where σ0 (0, 1) denotes the variance. Moreover, we assume that task n possesses a unique feature signal vn Rd with vn = O(1) (Chen et al. (2022); Huang et al. (2024)).

In each training round t [T], let nt [N] denote the index of the current task arrival with ground truth wnt W. In the following, we formally define the generation of dataset per training round.

Definition 1. At the beginning of each training round t [T], the dataset Dt = (Xt, yt) of the new task arrival nt is generated by the following steps:

1) Uniformly draw a ground truth wn from ground-truth pool W and let wnt = wn.

Published as a conference paper at ICLR 2025

2) Independently generate a random variable βt (0, C], where C is a constant satisfying C = O(1).

3) Generate Xt as a collection of st samples, where one sample is given by βtvnt and the rest of the st 1 samples are drawn from normal distribution N(0, σ2 t Id), where σt 0 is the noise level.

4) Generate the output to be yt = X t wnt.

In any training round t [T], the actual ground truth wnt of task arrival nt is unknown. However, according to Definition 1, task nt can be classified into one of N clusters based on its feature signal vnt. Although the position of vnt in feature matrix Xt is not specified for each task nt, we can address this binary classification sub-problem over Xt using a single gating network in Mo E (Shazeer et al. (2016); Fedus et al. (2022); Chen et al. (2022)). In this context, we aim to investigate whether the Mo E model can enhance the learning performance in CL. For ease of exposition, we assume st = s for all t [T] in this paper. Then we will introduce the Mo E model in the following subsections.

3.2 STRUCTURE OF THE MOE MODEL

Gating Network

Expert 1 Expert 2 Expert M . . .

Figure 1: An illustration of the Mo E model.

As shown in Figure 1, an Mo E model comprises a collection of M experts, a router, and a gating network which is typically set to be linear (Shazeer et al. (2016); Fedus et al. (2022); Chen et al. (2022)). In the t-th round, upon the arrival of task nt and input of its data Dt = (Xt, yt), the gating network computes its linear output hm(Xt, θ(m) t ) for each expert m [M], where θ(m) t Rd is gating network parameter for expert m. Define h(Xt, Θt) := [h1(Xt, θ(1) t ) h M(Xt, θ(M) t )] and Θt := [θ(1) t θ(M) t ] as the outputs and the parameters of the gating network for all experts, respectively. Then we obtain h(Xt, Θt) = P

i [st] Θ t Xt,i, where Xt,i is the i-th sample of the feature matrix Xt.

To sparsify the gating network and reduce the computation cost, we employ top-1 switch routing", which maintains model quality while lowering routing computation, as demonstrated by Fedus et al. (2022); Chen et al. (2022); Yang et al. (2021). Although this top-1 gating model is simple, it is fundamental gaining a theoretical understanding of the behavior of Mo E in CL, and its theoretical analysis is already non-trivial. Extending to the top-k routing strategy (as introduced by Shazeer et al. (2016)) is nontrivial and falls outside the scope of this work. However, we still provide a discussion on the learning performance of the top-k routing strategy later in Section 5.2.

In each round t, as depicted in Figure 1, for task nt, the router selects the expert with the maximum gate output hm(Xt, θ(m) t ), denoted as mt, from the M experts (Chen et al. (2022); Nguyen et al. (2024)). In practice, to encourage exploration across experts and stabilize Mo E training, we add perturbations to the router (Shazeer et al. (2016); Fedus et al. (2022); Chen et al. (2022)). Specifically, task nt will be routed to the expert that satisfies mt = arg maxm{hm(Xt, θ(m) t ) + r(m) t }, (1)

where r(m) t for any m [M] is drawn independently from the uniform distribution Unif[0, λ]. We analyze in Appendix B that this routing strategy Eq. (1) ensures continuous and stable transitions for tasks. Additionally, the router calculates the softmax gate outputs, derived by

πm(Xt, Θt) = exp(hm(Xt,θ(m) t )) PM m =1 exp(hm (Xt,θ(m) t )), m [M], (2)

for the Mo E to update the gating network parameter Θt+1 for all experts.

3.3 TRAINING OF THE MOE MODEL WITH KEY DESIGNS

Expert model. Let w(m) t denote the model of expert m in the t-th training round, where each model is initialized from zero, i.e., w(m) 0 = 0 for any m [M]. After the router determines the expert mt by Eq. (1), it transfers the dataset Dt = (Xt, yt) to this expert for updating w(mt) t . For any other expert m [M] not selected ( i.e., m = mt), its model w(m) t remains unchanged from w(m) t 1. In

Published as a conference paper at ICLR 2025

each round t, the training loss is defined by the mean-squared error (MSE) relative to dataset Dt: Ltr t (w(mt) t , Dt) = 1

st (Xt) w(mt) t yt 2 2. (3) Since we focus on the overparameterized regime, there exist infinitely many solutions that perfectly satisfy Ltr t (w(mt) t , Dt) = 0 in Eq. (3). Among these solutions, gradient descent (GD) starting from the previous expert model w(mt) t 1 at the convergent point provides a unique solution for minimizing Ltr t (w(mt) t , Dt) in Eq. (3), which is determined by the following optimization problem (Evron et al. (2022); Gunasekar et al. (2018); Lin et al. (2023)): min wt wt w(mt) t 1 2, s.t. X t wt = yt. (4)

Solving Eq. (4), we update the selected expert in the Mo E model for the current task arrival nt as follows, while keeping the other experts unchanged: w(mt) t = w(mt) t 1 + Xt(X t Xt) 1(yt X t w(mt) t 1 ). (5)

Gating network parameters. After obtaining w(mt) t in Eq. (5), the Mo E updates the gating network parameter from Θt to Θt+1 using GD for the next training round. On one hand, we aim for θ(m) t+1 of each expert m to specialize in a specific task, which helps mitigate learning loss caused by the incorrect routing of distinct tasks. On the other hand, the router needs to balance the load among all experts (Fedus et al. (2022); Shazeer et al. (2016); Li et al. (2024)) to reduce the risk of model overfitting and enhance the learning performance in CL. To achieve this, we introduce our first key design of multi-objective training loss for gating network updates.

Key design I: Multi-objective training loss. First, based on the updated expert model w(mt) t in Eq. (5), we propose the following locality loss function for updating Θt: Lloc t (Θt, Dt) = P m [M] πm(Xt, Θt) w(m) t w(m) t 1 2, (6) where πm(Xt, Θt) is the softmax output defined in Eq. (2). Since our designed locality loss in Eq. (6) is minimized when the tasks with similar ground truths are routed to the same expert m (e.g., w(m) t = w(m) t 1), it enjoys several benefits as shown later in our theoretical results: each expert will specialize in a particular set of tasks which leads to fast convergence of expert model w(m) t , and the performance of CL in terms of forgetting and generalization error will be improved. Note that in Eq. (6), we only need to calculate the locality loss for the single expert mt, as w(m) t w(m) t 1 2 = 0 for any expert m = mt that has not updated its model, leading to low computational complexity.

In addition to the novel locality loss in Eq. (6), we follow the existing Mo E literature (e.g., Fedus et al. (2022); Shazeer et al. (2016); Li et al. (2024)) where an auxiliary loss is typically defined to characterize load balance among the experts: Laux t (Θt, Dt) = α M P

m [M] f (m) t P (m) t , (7)

Algorithm 1 Training of the Mo E model for CL

1: Input: T, σ0, Γ = O(σ1.25 0 ), λ = Θ(σ1.25 0 ), I(m) = 0, α = O(σ0.5 0 ), η = O(σ0.5 0 ), T1 = η 1M ; 2: Initialize θ(m) 0 = 0 and w(m) 0 = 0, m [M]; 3: for t = 1, , T do 4: Generate r(m) t for any m [M]; 5: Select mt in Eq. (1) and update w(mt) t in Eq. (5); 6: if t > T1 then 7: for m [M] with |hm hmt| < Γ do 8: I(m) = 1; // Convergence flag 9: end for 10: end if 11: if m, s.t. I(m) = 0 then 12: Update θ(m) t as in Eq. (9) for any m [M]; 13: end if 14: end for

where α is constant, f (m) t = 1 t Pt τ=1 1{mτ = m} is the fraction of tasks dispatched to expert m since t = 1, and P (m) t = 1 t Pt τ=1 πm(Xτ, Θτ) 1{mτ = m} is the average probability that the router chooses expert m since t = 1. The auxiliary loss in Eq. (7) encourages exploration across all experts since it is minimized under a uniform routing with f (m) t = 1 M and P (m) t = 1 M . Although the definition of auxiliary loss in Eq. (7) is not new, it is necessary and plays a crucial role for balancing the load across experts in the Mo E model for CL.

Based on Eq. (3), Eq. (6) and Eq. (7), we finally define the task loss for each task arrival

Published as a conference paper at ICLR 2025

nt as follows: Ltask t (Θt, w(mt) t , Dt) = Ltr t (w(mt) t , Dt) + Lloc t (Θt, Dt) + Laux t (Θt, Dt). (8) Commencing from the initialization Θ0, the gating network is updated based on GD: θ(m) t+1 = θ(m) t η θ(m) t Ltask t (Θt, w(mt) t , Dt), m [M], (9)

where η > 0 is the learning rate. Note that w(mt) t in Eq. (5) is also the optimal solution for minimizing Ltask t (Θt, w(mt) t , Dt) in Eq. (8). This is because both Lloc t (Θt, Dt) and Laux t (Θt, Dt) are derived after updating w(mt) t , making Ltr t (w(mt) t , Dt) the sole objective for w(mt) t in Eq. (8).

Key design II: Early termination. To ensure the stable convergence of the system with balanced loads among experts (which we will theoretically justify in Section 4), after training sufficient (i.e., T1) rounds for expert exploration, we introduce an early termination strategy in Algorithm 1 by evaluating a convergence flag I(m) for each expert m. This flag assesses the output gap, defined as |hm(Xt, θt) hmt(Xt, θt)|, between the expert itself and any selected expert mt for t > T1. If this gap exceeds threshold Γ for expert m, which indicates that gating network parameter θ(m) t has not converged, then the Mo E model continues updating Θt for all experts based on Eq. (9). Otherwise, the update of Θt is permanently terminated.

4 THEORETICAL RESULTS ON MOE TRAINING FOR CL

In this section, we provide theoretical analysis for the training of expert models and the gating network in Algorithm 1, which further justifies our key designs in Section 3. Specifically, (i) we first support our key design I by proving that the expert model converges fast via updating Θt under our designed locality loss in Eq. (6). (ii) We then show that our key design II, early termination in updating Θt, is necessary to ensure a stable convergence system state with experts balanced loads. For clarity, we study the case with M > N in this section (labeled as M > N version), and further extend the results to the M < N version in appendices. To characterize expert specialization, we first show that each expert s gate output is determined by the input feature signal vn of Xt.

Lemma 1 (M > N version). For any two feature matrices X and X with the same feature signal vn, with probability at least 1 o(1), their corresponding gate outputs of the same expert m satisfy hm(X, θ(m) t ) hm( X, θ(m) t ) = O(σ1.5 0 ). (10)

The full version containing M < N case and the proof of Lemma 1 are given in Appendix C. According to Lemma 1, the router decides expert mt for task nt based primarily on its feature signal vnt. Consequently, given N tasks, all experts can be grouped into N sets according to their specialty, i.e., their gating parameter θ(m) t s to identify feature signal vn, where each expert set is defined as: Mn = m [M] n = arg maxj [N](θ(m) t ) vj . (11)

The following proposition indicates the convergence of the expert model after sufficient training rounds under Algorithm 1. Proposition 1 (M > N version). Under Algorithm 1, with probability at least 1 o(1), for any t > T1, where T1 = η 1M , each expert m [M] stabilizes within an expert set Mn, and its expert model remains unchanged beyond time T1, satisfying w(m) T1+1 = = w(m) T .

The full version and the proof of Proposition 1 are given in Appendix E. Proposition 1 demonstrates that after T1 rounds of expert exploration, each expert will specialize in a specific task, enforced by minimizing locality loss in Eq. (6). After that, expert models remain unchanged until the end of T.

Next, the following proposition characterizes the dynamics of gate outputs if there is no termination of updating gating network parameters Θt in Algorithm 1. Proposition 2 (M > N version). If the Mo E keeps updating Θt by Eq. (9) at any round t [T], we obtain: 1) At round t1 = η 1σ 0.25 0 M , the following property holds hm(Xt1, θ(m) t1 ) hm (Xt1, θ(m ) t1 ) = O(σ1.75 0 ), if m, m Mn, Θ(σ0.75 0 ), otherwise. (12)

2) At round t2 = η 1σ 0.75 0 M , the following property holds hm(Xt2, θ(m) t2 ) hm (Xt2, θ(m ) t2 ) = O(σ1.75 0 ), m, m [M]. (13)

Published as a conference paper at ICLR 2025

The full version and the proof of Proposition 2 are given in Appendix F. According to Proposition 2, if the Mo E updates Θt at any round t, the gap between gate outputs of experts within the same expert set converges to O(σ1.75 0 ) by round t1 (Eq. (12)). In contrast, the gap between experts in different sets is sufficiently large, i.e., Θ(σ0.75 0 ), indicating that the Mo E has successfully diversified experts into different sets at round t1 as in Eq. (11). However, unlike Mo E in single-task learning that can stop training at any time after the expert models converge (e.g., Celik et al. (2022); Li et al. (2024)), Mo E in CL requires both the gating network and the experts to be suitably updated with the continuous arrival of new tasks. This is necessary to balance the load on each expert and maximize the system capacity utilization. However, continuing updating Θt according to Eq. (9) will eventually reduce the output gap between any two experts to O(σ1.75 0 ) in Eq. (13) at training round t2, causing the router to select wrong experts for subsequent task arrivals and incurring additional training errors.

Based on Proposition 2, it is necessary to terminate the update of Θt to preserve a sufficiently large output gap between any two experts in different sets, ensuring expert diversity as in Eq. (12) at round t1. This motivates our design of early termination in Algorithm 1, outlined from Line 7 to Line 13. In the next proposition, we prove the benefit of terminating updating Θt in Algorithm 1.

Proposition 3 (M > N version). Under Algorithm 1, the Mo E terminates updating Θt since round T2 = O(η 1σ 0.25 0 M). Then for any task arrival nt at t > T2, the router selects any expert m Mnt with an identical probability of 1 |Mnt|, where |Mnt| is the number of experts in set Mn.

The full version and the proof of Proposition 3 are given in Appendix G. According to Algorithm 1, once the Mo E terminates updating Θt, the random noise r(m) t in Eq. (1) will guide the router to select experts in the same expert set with identical probability, effectively balancing the loads across experts therein. Our theoretical analysis will be further corroborated by the experiments later in Section 6.

5 THEORETICAL RESULTS ON FORGETTING AND GENERALIZATION

For the Mo E model described in Section 3, we define Et(w(mt) t ) as the model error in the t-th round: Et(w(mt) t ) = w(mt) t wnt 2 2, (14)

which characterizes the generalization performance of the selected expert mt with model w(mt) t for task nt at round t. Following the existing literature on CL (e.g., Lin et al. (2023); Chaudhry et al. (2018)), we assess the performance of Mo E in CL using the metrics of forgetting and overall generalization error, defined as follows:

(1) Forgetting: Define Ft as the forgetting of old tasks after learning task nt for t {2, , T}:

τ=1 (Eτ(w(mτ ) t ) Eτ(w(mτ ) τ )). (15)

(2) Overall generalization error: We evaluate the generalization performance of the model w(m) T from the last training round T by computing the average model error across all tasks:

τ=1 Eτ(w(mτ ) T ). (16)

In the following, we present explicit forms of the above two metrics for learning with a single expert (i.e., M = 1) as a benchmark (cf. Lin et al. (2023)). Here we define r := 1 s

d as the overparameterization ratio.

Proposition 4. If M = 1, for any training round t {2, , T}, we have

E[Ft] = 1 t 1 Pt 1 τ=1 n rt rτ

N PN n=1 wn 2 + rτ rt

n =n wn wn 2o , (17)

E[GT ] = r T

N PN n=1 wn 2 + 1 r T

n =n wn w n 2. (18)

Note that the setting here (with M = 1) differs slightly from Lin et al. (2023) as we have N tasks in total. Hence a proof of Proposition 4 is provided in Appendix H. Proposition 4 implies that distinct tasks with large model gap P

n =n wn w n 2 lead to poor performance of both E[Ft] in Eq. (17) and E[Gt] in Eq. (18), which is missing in the existing CL literature (e.g., Lesort et al. (2023)).

Published as a conference paper at ICLR 2025

Next, we investigate the impact of Mo E on CL under two cases: (I) when there are more experts than tasks (M > N), and (II) when there are fewer experts than tasks (M < N). The benefit of Mo E will be characterized by comparing our results with the single-expert baseline in Proposition 4.

5.1 CASE I: MORE EXPERTS THAN TASKS

Based on Proposition 1, we derive the explicit upper bounds for both forgetting and overall generalization error in the following theorem. To simplify notations, we define L(m) t := t f (m) t as the cumulative number of task arrivals routed to expert m up to round t, where f (m) t is given in Eq. (7).

Theorem 1. If M = Ω(N ln(N)), for each round t {2, , T1}, the expected forgetting satisfies

E[Ft] < 1 t 1 Pt 1 τ=1 n r L(mτ ) t r L(mτ ) τ N PN n=1 wn 2 + r L(mτ ) τ r L(mτ ) t N2 P n =n wn wn 2o .

(19) For each t {T1 + 1, , T}, we have E[Ft] = T1 1

t 1 E[FT1]. Further, after training task n T in the last round T, the overall generalization error satisfies

T PT τ=1 n r L(mτ ) T1

N PN n=1 wn 2 + 1 r L(mτ ) T1 N2 P

n =n wn wn 2o . (20)

The proof of Theorem 1 is given in Appendix I, and we have the following insights.

1) Forgetting. If t T1, in Eq. (19), the coefficient r L(mτ ) t r L(mτ ) τ of the term PN n=1 wn 2 is smaller than 0 because L(mτ ) t L(mτ ) τ and r < 1, indicating that the training of new tasks enhances the performance of old tasks due to the repeated task arrivals in this phase. Meanwhile, the coefficient of model gap P

n =n wn wn 2 is greater than 0, indicating that the forgetting is due to experts exploration of distinct tasks. However, as stated in Proposition 1, once the expert models converge at t = T1, training on newly arriving tasks with correct routing no longer causes forgetting of previous tasks. Consequently, for t {T1 + 1, , T}, E[Ft] = T1 1

t 1 E[FT1] decreases with t and converges to zero as T . This result highlights that, unlike the oscillatory forgetting observed in Eq. (17) for a single expert, the Mo E model effectively minimizes expected forgetting in CL through its correct routing mechanism. Furthermore, a decrease in task similarity, i.e., larger model gaps, further amplifies the learning benefit of the Mo E model.

2) Generalization error. Note that the second term PN n =n wn wn 2 dominates the generalization

error when tasks are less similar with large model gaps. In Eq. (20), the coefficient 1 r L(mτ ) T1 of the model gaps PN n =n wn wn 2 is smaller than 1 r T in Eq. (18), due to the convergence of expert models after round T1. Therefore, the generalization error under Mo E is reduced compared to that of a single expert, especially as T increases (where 1 r T approaches 1 in Eq. (18)).

3) Expert number. According to Theorem 1, for t > T1, E[Ft] increases with T1 as additional rounds of expert exploration accumulate more model errors in Eq. (19). Regarding E[GT ] in Eq. (20), a

longer exploration period T1 for experts increases the coefficient 1 r L(mτ ) T1 of the model gaps, leading to an increase in E[GT ] when the model gaps across tasks are large. Since T1 increases with expert number M, adding more experts does not enhance learning performance but delays convergence. Note that if M = 1 in Theorem 1, L(mτ ) t becomes t for the single expert, causing Eq. (19) and Eq. (20) to specialize to Eq. (17) and Eq. (18), respectively.

5.2 CASE II: FEWER EXPERTS THAN TASKS

Next, we consider a more general case with fewer experts than tasks, i.e., M < N, where Algorithm 1 still works efficiently. In particular, we assume that the N ground truths in W can be classified into K clusters, where K < M, based on the task similarity. Let Wk denote the k-th task cluster. For any two tasks n, n [N] in the same cluster with wn, wn Wk, we assume wn wn = O(σ1.5 0 ). Then we let set Mk include all experts that specialize in tasks within the k-th task cluster.

Recall Proposition 1 indicates that expert models converge after T1 rounds of exploration if M > N. However, in the case of fewer experts than tasks (M < N), each expert has to specialize in learning a cluster of similar tasks. Consequently, as similar tasks within the same cluster are continuously

Published as a conference paper at ICLR 2025

routed to each expert, the expert models keep updating after round T1, behaving differently from the M > N case in Proposition 1. Given the above understanding, we have the following theorem. Theorem 2. If M < N and M = Ω(K ln(K)), for any t {1, , T1}, the expected forgetting E[Ft] is the same as Eq. (19). While for any t {T1 + 1, , T}, the expected forgetting satisfies

E[Ft] < 1 t 1 Pt 1 τ=1 r L(mτ ) t r L(mτ ) τ N PN n=1 wn 2 + 1 t 1 PT1 τ=1 r L(mτ ) i r L(mτ ) T1 N2 P

n =n wn wn 2

t 1 Pt τ=T1+1(1 r L(mτ ) t L(mτ ) τ )r L(mτ ) t L(mτ ) T1 1, (21)

where Ψ1 = 1 N PN n=1 P

n,n Wk wn wn 2

|Wk| is the expected model gap between any two tasks in the same task cluster. After training task n T in round T, the overall generalization error satisfies

T PT τ=1 r L(mτ ) T

N PN n=1 wn 2 + 1

T PT1 τ=1 r L(mτ ) T L(mτ ) T1 (1 r L(mτ ) T1 ) N2 P

n =n wn wn 2

T n PT1 τ=1(1 r L(mτ ) T L(mτ ) T1 ) + PT τ=T1+1 r L(mτ ) T L(mτ ) T1 (1 r L(mτ ) T1 ) o

T PT τ=T1+1(1 r L(mτ ) T L(mτ ) T1 ), (22)

where Ψ2 = 1

N PN n=1 1 K PK k=1 1 |Wk| P

n Wk wn wn 2 is the expected model gap between a randomly chosen task in W and any task in a fixed ground-truth cluster Wk.

The proof of Theorem 2 is given in Appendix J, and we provide the following insights.

1) Forgetting. Compared to Theorem 1, E[Ft] in Eq. (21) introduces an additional term Ψ1, which measures the forgetting of task arrivals during {T1 + 1, , τ} caused by updating expert models for new task arrival in round τ {T1 + 1, , T}. However, since tasks routed to the same expert during {T1 + 1, , T} belong to the same cluster, their small model gaps lead to minimal forgetting. If there is only one task in each cluster, Ψ becomes 0 and E[Ft] becomes the same as Eq. (19).

2) Generalization error. As expert models continuously update at any t, E[GT ] in Eq. (22) comprises three terms: a) the expected model gap 1 N 2 PN n =n wn wn 2 between any two random task arrivals for t < T1, b) the expected model gap Ψ1 between two tasks in the same cluster for t > T1, and c) the expected model gap Ψ2 between a random task arrival for t < T1 and any task arrival in a fixed ground-truth cluster Wk for t > T1. If each cluster contains only one task, the coefficient of

Ψ2 simplifies to PT τ=T1+1(1 r L(mτ ) T1 ), given L(mτ ) T = L(mτ ) T1 without updates after T1. Moreover, Ψ2 = 1 N 2 PN n =n wn wn 2 and Ψ1 = 0, resulting in Eq. (22) specializing to Eq. (20).

Note that under our assumption wn wn = O(σ1.5 0 ), the router cannot distinguish between similar tasks n and n within the same cluster. Consequently, adding more experts cannot avoid the errors Ψ1 and Ψ2 in Eq. (21) and Eq. (22). Therefore, similar to our insights of Theorem 1, when there are enough experts than clusters (i.e., M = Ω(K ln(K))), adding more experts does not enhance learning performance but delays convergence. Although the learning performance degrades compared to Theorem 1, it still benefits from Mo E compared to the single expert in Proposition 4.

Note that if we extend to the top-k routing strategy in Eq. (1), the router will select k experts to train the same data at a time. In this case, the forgetting described in Eq. (21) may decrease, as each expert may handle a smaller cluster of tasks compared to the case with top-1 routing strategy. However, similar tasks that belong to the same cluster in the top-1 case now may be divided into different clusters and handled by different expert, which may reduce the potential positive knowledge transfer among these tasks. Consequently, the generalization error in Eq. (22) may not be smaller for the top-k case.

6 EXPERIMENTS

In this section, we present extensive experiments on both linear models and DNNs to validate our theoretical analysis. Due to space constraints, we include detailed experimental setups and additional results on datasets such as MNIST Le Cun et al. (1989), CIFAR-100 Krizhevsky et al. (2009) and Tiny Image Net Le & Yang (2015) in Appendix A.

Key design of early termination. In the first experiment, we aim to check the necessity of terminating the update of Θt in Line 11 of Algorithm 1. Here we set T = 2000, N = 6, K = 3 and vary

Published as a conference paper at ICLR 2025

expert number M {1, 5, 10, 20}. As depicted in Figure 2(a) and Figure 2(c), both forgetting and generalization error first increase due to the expert exploration and then converge to almost zero for all Mo E models with termination of update, verifying Theorem 1 and Theorem 2. In stark contrast, learning without termination leads to poor performance with large oscillations in Figure 2(b) and Figure 2(d), as the router selects the wrong expert for a new task arrival after the continual update of Θt. In addition, in both Figure 2(a) and Figure 2(c), the Mo E model significantly improves the performance of CL compared to a single model. The comparison between M = 10 and M = 20 also indicates that adding extra experts delays the convergence if M > N, which does not improve learning performance, verifying our analysis in Theorem 1 and Theorem 2.

0 500 1000 1500 2000 Rounds

forgetting Ft

M = 1 M = 5 M = 10 M = 20

(a) With termination.

0 500 1000 1500 2000 Rounds

forgetting Ft

M = 1 M = 5 M = 10 M = 20

(b) Without termination.

0 500 1000 1500 2000 Rounds

M = 1 M = 5 M = 10 M = 20

(c) With termination.

0 500 1000 1500 2000 Rounds

M = 1 M = 5 M = 10 M = 20

(d) Without termination.

Figure 2: The dynamics of forgetting and overall generalization errors with and without termination of updating Θt in Algorithm 1. Here we set N = 6 with K = 3 clusters and vary M {1, 5, 10, 20}.

Real-data validation. Finally, we extend our Algorithm 1 and insights from linear models to DNNs by conducting experiments on the CIFAR-10 dataset (Krizhevsky et al. (2009)). The details of our experiment setup is given in Appendix A.3. We set K = 4, N = 300 and vary M {1, 4, 12}. In each training round, to diversify the model gaps of different tasks, we transform the d d matrix into a d d dimensional normalized vector to serve as input for the gating network. Then we calculate the variance σ0 of each element across all tasks from the input vector, which is then used for parameter setting in Algorithm 1. Figure 3 illustrates that our theoretical insights from linear models also hold for DNNs, in terms of the impact of Mo E and early termination on the performance of CL in practice.

0 50 100 150 200 250 300 Rounds

M = 1 M = 4 M = 12

(a) With termination.

0 50 100 150 200 250 300 Rounds

M = 1 M = 4 M = 12

(b) Without termination.

0 50 100 150 200 250 300 Rounds

M = 1 M = 4 M = 12

(c) With termination.

0 50 100 150 200 250 300 Rounds

M = 1 M = 4 M = 12

(d) Without termination.

Figure 3: The dynamics of overall generalization error and test accuracy under the CIFAR-10 dataset (Krizhevsky et al. (2009)). Here we set K = 4, N = 300 and M {1, 4, 12}.

7 CONCLUSION

In this work, we conducted the first theoretical analysis of Mo E and its impact on learning performance in CL, focusing on an overparameterized linear regression problem. We establish the benefit of Mo E over a single expert by proving that the Mo E model can diversify its experts to specialize in different tasks, while its router learns to select the right expert for each task and balance the loads across all experts. Then we demonstrated that, under CL, terminating the updating of gating network parameters after sufficient training rounds is necessary for system convergence. Furthermore, we provided explicit forms of the expected forgetting and overall generalization error to assess the impact of Mo E. Interestingly, adding more experts requires additional rounds before convergence, which may not enhance the learning performance. Finally, we conducted experiments on real datasets using DNNs to show that certain insights can extend beyond linear models.

Published as a conference paper at ICLR 2025

ACKNOWLEDGMENTS

This work has been supported in part by the U.S. National Science Foundation under the grants: NSF AI Institute (AI-EDGE) 2112471, CNS-2312836, and was sponsored by the Army Research Laboratory under Cooperative Agreement Number W911NF-23-2-0225. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the Army Research Laboratory or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for Government purposes notwithstanding any copyright notation herein.

This work was also supported in part by the Ministry of Education, Singapore, under its Academic Research Fund Tier 2 Grant with Award no. MOE-T2EP20121-0001; in part by SUTD Kickstarter Initiative (SKI) Grant with no. SKI 2021_04_07; and in part by the Joint SMU-SUTD Grant with no. 22-LKCSB-SMU-053.

The work of Yingbin Liang was supported in part by the U.S. National Science Foundation with the grants RINGS-2148253, CNS-2112471, and ECCS-2413528.

We sincerely thank Qinhang Wu for his invaluable assistance in conducting extensive experiments during our rebuttal, which greatly contributed to the acceptance of this paper.

Animashree Anandkumar, Rong Ge, Daniel J Hsu, Sham M Kakade, Matus Telgarsky, et al. Tensor decompositions for learning latent variable models. J. Mach. Learn. Res., 15(1):2773 2832, 2014.

Mikhail Belkin, Siyuan Ma, and Soumik Mandal. To understand deep learning we need to understand kernel learning. In International Conference on Machine Learning, pp. 541 549. PMLR, 2018.

Mehdi Abbana Bennani, Thang Doan, and Masashi Sugiyama. Generalisation guarantees for continual learning with orthogonal gradient descent. ar Xiv preprint ar Xiv:2006.11942, 2020.

Onur Celik, Dongzhuoran Zhou, Ge Li, Philipp Becker, and Gerhard Neumann. Specializing versatile skill libraries using local mixture of experts. In Conference on Robot Learning, pp. 1423 1433. PMLR, 2022.

Arslan Chaudhry, Marc Aurelio Ranzato, Marcus Rohrbach, and Mohamed Elhoseiny. Efficient lifelong learning with a-gem. ar Xiv preprint ar Xiv:1812.00420, 2018.

Zixiang Chen, Yihe Deng, Yue Wu, Quanquan Gu, and Yuanzhi Li. Towards understanding the mixture-of-experts layer in deep learning. Advances in Neural Information Processing Systems, 35:23049 23062, 2022.

Zewen Chi, Li Dong, Shaohan Huang, Damai Dai, Shuming Ma, Barun Patra, Saksham Singhal, Payal Bajaj, Xia Song, Xian-Ling Mao, et al. On the representation collapse of sparse mixture of experts. Advances in Neural Information Processing Systems, 35:34600 34613, 2022.

Thang Doan, Mehdi Abbana Bennani, Bogdan Mazoure, Guillaume Rabusseau, and Pierre Alquier. A theoretical analysis of catastrophic forgetting through the ntk overlap matrix. In International Conference on Artificial Intelligence and Statistics, pp. 1072 1080. PMLR, 2021.

Thang Doan, Seyed Iman Mirzadeh, and Mehrdad Farajtabar. Continual learning beyond a single model. In Conference on Lifelong Learning Agents, pp. 961 991. PMLR, 2023.

Nan Du, Yanping Huang, Andrew M Dai, Simon Tong, Dmitry Lepikhin, Yuanzhong Xu, Maxim Krikun, Yanqi Zhou, Adams Wei Yu, Orhan Firat, et al. Glam: Efficient scaling of language models with mixture-of-experts. In International Conference on Machine Learning, pp. 5547 5569. PMLR, 2022.

David Eigen, Marc Aurelio Ranzato, and Ilya Sutskever. Learning factored representations in a deep mixture of experts. ar Xiv preprint ar Xiv:1312.4314, 2013.

Published as a conference paper at ICLR 2025

Itay Evron, Edward Moroshko, Rachel Ward, Nathan Srebro, and Daniel Soudry. How catastrophic can catastrophic forgetting be in linear regression? In Conference on Learning Theory, pp. 4028 4079. PMLR, 2022.

Mehrdad Farajtabar, Navid Azizan, Alex Mott, and Ang Li. Orthogonal gradient descent for continual learning. In International Conference on Artificial Intelligence and Statistics, pp. 3762 3773. PMLR, 2020.

William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. The Journal of Machine Learning Research, 23(1): 5232 5270, 2022.

Rui Gao and Weiwei Liu. Ddgr: Continual learning with deep diffusion-based generative replay. In International Conference on Machine Learning, pp. 10744 10763. PMLR, 2023.

Jianping Gou, Baosheng Yu, Stephen J Maybank, and Dacheng Tao. Knowledge distillation: A survey. International Journal of Computer Vision, 129(6):1789 1819, 2021.

Suriya Gunasekar, Jason Lee, Daniel Soudry, and Nathan Srebro. Characterizing implicit bias in terms of optimization geometry. In International Conference on Machine Learning, pp. 1832 1841. PMLR, 2018.

Heinke Hihn and Daniel A Braun. Mixture-of-variational-experts for continual learning. ar Xiv preprint ar Xiv:2110.12667, 2021.

Yu Huang, Yuan Cheng, and Yingbin Liang. In-context convergence of transformers. In International Conference on Machine Learning. PMLR, 2024.

Ghassen Jerfel, Erin Grant, Tom Griffiths, and Katherine A Heller. Reconciling meta-learning and continual learning with online mixtures of tasks. Advances in neural information processing systems, 32, 2019.

Xisen Jin, Arka Sadhu, Junyi Du, and Xiang Ren. Gradient-based editing of memory examples for online task-free continual learning. Advances in Neural Information Processing Systems, 34: 29193 29205, 2021.

Peizhong Ju, Xiaojun Lin, and Jia Liu. Overfitting can be harmless for basis pursuit, but only to a degree. Advances in Neural Information Processing Systems, 33:7956 7967, 2020.

James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, et al. Overcoming catastrophic forgetting in neural networks. Proceedings of the National Academy of Sciences, 114 (13):3521 3526, 2017.

Tatsuya Konishi, Mori Kurokawa, Chihiro Ono, Zixuan Ke, Gyuhak Kim, and Bing Liu. Parameterlevel soft-masking for continual learning. In International Conference on Machine Learning, pp. 17492 17505. PMLR, 2023.

Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009.

Ya Le and Xuan Yang. Tiny imagenet visual recognition challenge. CS 231N, 7(7):3, 2015.

Yann Le Cun, Bernhard Boser, John Denker, Donnie Henderson, Richard Howard, Wayne Hubbard, and Lawrence Jackel. Handwritten digit recognition with a back-propagation network. Advances in neural information processing systems, 2, 1989.

Sebastian Lee, Sebastian Goldt, and Andrew Saxe. Continual learning in the teacher-student setup: Impact of task similarity. In International Conference on Machine Learning, pp. 6109 6119. PMLR, 2021.

Soochan Lee, Junsoo Ha, Dongsu Zhang, and Gunhee Kim. A neural dirichlet process mixture model for task-free continual learning. In International Conference on Learning Representations, 2020.

Published as a conference paper at ICLR 2025

Timothée Lesort, Oleksiy Ostapenko, Pau Rodríguez, Diganta Misra, Md Rifat Arefin, Laurent Charlin, and Irina Rish. Challenging common assumptions about catastrophic forgetting and knowledge accumulation. In Conference on Lifelong Learning Agents, pp. 43 65. PMLR, 2023.

Jing Li, Zhijie Sun, Xuan He, Li Zeng, Yi Lin, Entong Li, Binfan Zheng, Rongqian Zhao, and Xin Chen. Locmoe: A low-overhead moe for large language model training. ar Xiv preprint ar Xiv:2401.13920, 2024.

Bin Lin, Zhenyu Tang, Yang Ye, Jiaxi Cui, Bin Zhu, Peng Jin, Junwu Zhang, Munan Ning, and Li Yuan. Moe-llava: Mixture of experts for large vision-language models. ar Xiv preprint ar Xiv:2401.15947, 2024.

Sen Lin, Li Yang, Deliang Fan, and Junshan Zhang. Trgp: Trust region gradient projection for continual learning. In International Conference on Learning Representations, 2021.

Sen Lin, Peizhong Ju, Yingbin Liang, and Ness Shroff. Theory on forgetting and generalization of continual learning. In International Conference on Machine Learning, pp. 21078 21100. PMLR, 2023.

Hao Liu and Huaping Liu. Continual learning with recursive gradient optimization. In International Conference on Learning Representations, 2021.

Michael Mc Closkey and Neal J Cohen. Catastrophic interference in connectionist networks: The sequential learning problem. In Psychology of Learning and Motivation, volume 24, pp. 109 165. Elsevier, 1989.

Hien D Nguyen and Faicel Chamroukhi. Practical and theoretical aspects of mixture-of-experts modeling: An overview. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 8(4):e1246, 2018.

Huy Nguyen, Pedram Akbarian, Fanqi Yan, and Nhat Ho. Statistical perspective of top-k sparse softmax gating mixture of experts. In International Conference on Learning Representations, 2024.

German I Parisi, Ronald Kemker, Jose L Part, Christopher Kanan, and Stefan Wermter. Continual lifelong learning with neural networks: A review. Neural Networks, 113:54 71, 2019.

Liangzu Peng, Paris Giampouras, and René Vidal. The ideal continual learner: An agent that never forgets. In International Conference on Machine Learning, pp. 27585 27610. PMLR, 2023.

Carlos Riquelme, Joan Puigcerver, Basil Mustafa, Maxim Neumann, Rodolphe Jenatton, André Susano Pinto, Daniel Keysers, and Neil Houlsby. Scaling vision with sparse mixture of experts. Advances in Neural Information Processing Systems, 34:8583 8595, 2021.

Hippolyt Ritter, Aleksandar Botev, and David Barber. Online structured laplace approximations for overcoming catastrophic forgetting. Advances in Neural Information Processing Systems, 31, 2018.

Grzegorz Rype s c, Sebastian Cygert, Valeriya Khan, Tomasz Trzcinski, Bartosz Michał Zieli nski, and Bartłomiej Twardowski. Divide and not forget: Ensemble of selectively trained experts in continual learning. In The Twelfth International Conference on Learning Representations, 2023.

Gobinda Saha, Isha Garg, and Kaushik Roy. Gradient projection memory for continual learning. In International Conference on Learning Representations, 2020.

Joan Serra, Didac Suris, Marius Miron, and Alexandros Karatzoglou. Overcoming catastrophic forgetting with hard attention to the task. In International Conference on Machine Learning, pp. 4548 4557. PMLR, 2018.

Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. In International Conference on Learning Representations, 2016.

Published as a conference paper at ICLR 2025

Shixiang Tang, Dapeng Chen, Jinguo Zhu, Shijie Yu, and Wanli Ouyang. Layerwise optimization by gradient decomposition for continual learning. In Proceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition, pp. 9634 9643, 2021.

Kert Viele and Barbara Tong. Modeling with mixtures of linear regressions. Statistics and Computing, 12:315 330, 2002.

Liyuan Wang, Xingxing Zhang, Qian Li, Jun Zhu, and Yi Zhong. Coscl: Cooperation of small continual learners is stronger than a big one. In European Conference on Computer Vision, pp. 254 271. Springer, 2022.

Liyuan Wang, Xingxing Zhang, Hang Su, and Jun Zhu. A comprehensive survey of continual learning: Theory, method and application. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024.

Qi Wang and Herke Van Hoof. Learning expressive meta-representations with mixture of expert neural processes. Advances in neural information processing systems, 35:26242 26255, 2022.

Fuzhao Xue, Zian Zheng, Yao Fu, Jinjie Ni, Zangwei Zheng, Wangchunshu Zhou, and Yang You. Openmoe: An early effort on open mixture-of-experts language models. ar Xiv preprint ar Xiv:2402.01739, 2024.

An Yang, Junyang Lin, Rui Men, Chang Zhou, Le Jiang, Xianyan Jia, Ang Wang, Jie Zhang, Jiamang Wang, Yong Li, et al. M6-t: Exploring sparse expert models and beyond. ar Xiv preprint ar Xiv:2105.15082, 2021.

Jaehong Yoon, Saehoon Kim, Eunho Yang, and Sung Ju Hwang. Scalable and order-robust continual learning with additive parameter decomposition. In International Conference on Learning Representations, 2019.

Jiazuo Yu, Yunzhi Zhuge, Lu Zhang, Dong Wang, Huchuan Lu, and You He. Boosting continual learning of vision-language models via mixture-of-experts adapters. ar Xiv preprint ar Xiv:2403.11549, 2024.

Ted Zadouri, Ahmet Üstün, Arash Ahmadian, Beyza Ermis, Acyr Locatelli, and Sara Hooker. Pushing mixture of experts to the limit: Extremely parameter efficient moe for instruction tuning. In The Twelfth International Conference on Learning Representations, 2023.

Kai Zhong, Prateek Jain, and Inderjit S Dhillon. Mixed linear regression with multiple components. Advances in Neural Information Processing Systems, 29, 2016.

Yanqi Zhou, Tao Lei, Hanxiao Liu, Nan Du, Yanping Huang, Vincent Zhao, Andrew M Dai, Quoc V Le, James Laudon, et al. Mixture-of-experts with expert choice routing. Advances in Neural Information Processing Systems, 35:7103 7114, 2022.

Published as a conference paper at ICLR 2025

A Experimental details and additional experiments 16

A.1 Experiments compute resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

A.2 Experimental details of Figure 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

A.3 Experimental details of Figure 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

A.4 Experiments on the MNIST dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

A.5 Experiments on the CIFAR-100 dataset . . . . . . . . . . . . . . . . . . . . . . . . . 17

A.6 Experiments on the Tiny Image Net dataset . . . . . . . . . . . . . . . . . . . . . . . . 17

A.7 Experiments on termination threshold and load balance . . . . . . . . . . . . . . . . . 18

B Smooth router 19

C Full version and proof of Lemma 1 20

D Analysis of loss function 21

E Full version and proof of Proposition 1 23

E.1 Proof of Lemma 7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

E.2 Proof of Lemma 8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

E.3 Proof of Lemma 9 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

E.4 Proof of Lemma 10 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

E.5 Proof of Lemma 11 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

E.6 Proof of Lemma 12 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

E.7 Proof of Lemma 13 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

E.8 Final proof of Proposition 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

F Full version and proof of Proposition 2 28

F.1 Proof of Lemma 14 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

F.2 Proof of Lemma 15 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

F.3 Final proof of Proposition 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

G Full version and proof of Proposition 3 31

H Proof of Proposition 4 32

I Proof of Theorem 1 33

I.1 Proof of Lemma 16 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

I.2 Final proof of Theorem 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

J Proof of Theorem 2 35

Published as a conference paper at ICLR 2025

A EXPERIMENTAL DETAILS AND ADDITIONAL EXPERIMENTS

A.1 EXPERIMENTS COMPUTE RESOURCES

Operating system: Red Hat Enterprise Linux Server 7.9 (Maipo)

Type of CPU: 2.9 GHz 48-Core Intel Xeon 8268s

Type of GPU: NVIDIA Volta V100 w/32 GB GPU memory

A.2 EXPERIMENTAL DETAILS OF FIGURE 2

Synthetic data generation. We first generate N ground truths and their corresponding feature signals. For each ground truth wn Rd, where n [N], we randomly generate d elements by a normal distribution N(0, σ0). These ground truths are then scaled by a constant to obtain their feature signals vn. In each training round t, we generate (Xt, yt) according to Definition 1 based on ground-truth pool W and feature signals. Specifically, after drawing wnt from W, for Xt Rd s, we randomly select one out of s samples to fill with vnt. The other s 1 samples are generated from N(0, σ2 t Id). Finally, we compute the output yt = X t wnt. Here we set σ0 = 0.4, σt = 0.1, d = 10 and s = 6. In Figure 2, we set η = 0.5, α = 0.5 and λ = 0.3.

A.3 EXPERIMENTAL DETAILS OF FIGURE 3

Datasets. We use the CIFAR-10 (Krizhevsky et al. (2009)) dataset, selecting 512 samples randomly for training and 2000 samples for testing at each training round.

DNN architecture and training details. We employ a non-pretrained Res Net-18 as our base model. Each task is learned using Adam with a learning rate governed by a cosine annealing schedule for 5 epoches, with a minibatch size of 32, a weight decay of 0.005. The initial learning rate is set to 0.0005, and it is reduced to a minimum value of 10 6 over a total of 300 rounds.

Task setups. We define the ground truth pool as W = {(0), (4), (5), (9)}, representing K = 4 clusters of tasks for recognizing the image classes airplane, deer, dog, truck, respectively. The experiment spans T = 300 training rounds with N = 300 tasks. We randomly generate the task arrival sequence [nt]t [T ], where each nt is drawn from (0), (4), (5), (9) with equal probability 1

4. We then conduct two experiments (with and without termination) using the same task arrival order. For each task t [T], we randomly select its type (e.g., task (0) for recognizing the airplane class image) and 512 corresponding samples, ensuring that tasks have distinct distributions and features.

0 100 200 300 Rounds

forgetting Ft

M = 1 M = 4

(a) With termination

0 100 200 300 Rounds

forgetting Ft

M = 1 M = 4

(b) Without termination

Figure 4: The dynamics of forgetting under the CIFAR-10 dataset. Here we set N = 4 and M {1, 4}.

A.4 EXPERIMENTS ON THE MNIST DATASET

Datasets. We use the MNIST dataset Le Cun et al. (1989), selecting 100 samples randomly for training and 1000 samples for testing at each training round.

DNN architecture and training details. We use a five-layer neural network, consisting of two convolutional layers and three fully connected layers. Re LU activation is applied to the first four layers, while Sigmoid is used for the final layer. The first convolutional layer is followed by a 2D max-pooling operation with a stride of 2. Each task is learned using SGD with a learning rate of

Published as a conference paper at ICLR 2025

0.2 for 600 epochs. The forgetting and overall generalization error are evaluated as described in Eq. (15) and Eq. (16), respectively. Here, Et(w(mt) t ) is defined as the mean-squared test error instead of Eq. (14).

Task setups. We define the ground truth pool as W = {(1), (4), (7)}, representing K = 3 clusters of tasks for recognizing the numbers 1, 4, and 7, respectively. The experiment spans T = 60 training rounds with N = 60 tasks. Before the experiments in Figure 5, we randomly generate the task arrival sequence [nt]t [T ], where each nt is drawn from (1), (4), (7) with equal probability 1

3. We then conduct two experiments (with and without termination) using the same task arrival order. For each task t [T], we randomly select its type (e.g., task (1) for recognizing the number 1) and 100 corresponding samples, ensuring that tasks have distinct distributions and features.

0 10 20 30 40 50 60 Rounds

forgetting Ft

M = 1 M = 4 M = 7

(a) With termination.

0 10 20 30 40 50 60 Rounds

forgetting Ft

M = 1 M = 4 M = 7

(b) Without termination.

0 10 20 30 40 50 60 Rounds

M = 1 M = 4 M = 7

(c) With termination.

0 10 20 30 40 50 60 Rounds

M = 1 M = 4 M = 7

(d) Without termination.

Figure 5: Learning performance under the MNIST dataset (Le Cun et al. (1989)). Here we set K = 3, N = 60 and M {1, 4, 7}.

A.5 EXPERIMENTS ON THE CIFAR-100 DATASET

Datasets. We use the CIFAR-100 (Krizhevsky et al. (2009)) dataset, selecting 192 samples randomly for training and 600 samples for testing at each training round.

DNN architecture and training details. They are the same as the experiments on the CIFAR-10 dataset in Appendix A.3.

Task setups. We define the ground truth pool as W = {(28), (40), (52), (72), (79), (99)}, representing K = 6 clusters of tasks for recognizing the image classes telephone, bee, mountain, bear, turtle, tractor respectively. The experiment spans T = 350 training rounds with N = 350 tasks. We randomly generate the task arrival sequence [nt]t [T ], where each nt is drawn from (28), (40), (52), (72), (79), (99) with equal probability 1

6. We then conduct two experiments (with and without termination) using the same task arrival order. For each task t [T], we randomly select its type (e.g., task (28) for recognizing the telephone class image) and 192 corresponding samples, ensuring that tasks have distinct distributions and features.

0 100 200 300 Rounds

M = 1 M = 8 M = 10

(a) With termination

0 100 200 300 Rounds

M = 1 M = 8 M = 10

(b) Without termination

Figure 6: Learning performance under the CIFAR-100 dataset. Here we set N = 6 and M {1, 8, 10}.

A.6 EXPERIMENTS ON THE TINY IMAGENET DATASET

Datasets. We use the Tiny Image Net (Le & Yang (2015)) dataset, selecting 192 samples randomly for training and 300 samples for testing at each training round.

Published as a conference paper at ICLR 2025

Table 1: The average incremental accuracy under the CIFAR-100 dataset.

Expert number Test accuracy (%)

M = 1 17.1 M = 8 25.5 M = 8 w/o ET 10.3 M = 10 32.9 M = 10 w/o ET 9.1

DNN architecture and training details. They are the same as the experiments on the CIFAR-10 dataset in Appendix A.3.

Task setups. We define the ground truth pool as W = {(20), (50), (83), (145), (168), (179)}, representing K = 6 clusters of tasks for recognizing six disjoint image classes from Tiny Imagenet respectively. The experiment spans T = 300 training rounds with N = 300 tasks. We randomly generate the task arrival sequence [nt]t [T ], where each nt is drawn from (20), (50), (83), (145), (168), (179) with equal probability 1

6. We then conduct two experiments (with and without termination) using the same task arrival order. For each task t [T], we randomly select its type (e.g., task (20)) and 192 corresponding samples, ensuring that tasks have distinct distributions and features.

0 50 100 150 200 250 300 Rounds

M = 1 M = 8 M = 10

(a) With termination

0 50 100 150 200 250 300 Rounds

M = 1 M = 8 M = 10

(b) Without termination

Figure 7: Learning performance under the Tiny Image Net dataset. Here we set N = 6 and M {1, 8, 10}.

Table 2: The average incremental accuracy under the Tiny Image Net dataset.

Expert number Test accuracy (%)

M = 1 17.4 M = 8 37.6 M = 8 w/o ET 10.5 M = 10 28.3 M = 10 w/o ET 10.3

A.7 EXPERIMENTS ON TERMINATION THRESHOLD AND LOAD BALANCE

In additional experiments, we vary termination threshold Γ {σ0.75 0 , σ0, σ1.25 0 , σ1.5 0 } in Line 7 of Algorithm 1 to investigate its effect on load balance and learning performance, under the same synthetic data generation as Figure 2 in Appendix A.2.

Initially, we set σ0 = 0.4, λ = σ1.25 0 , M = 5, and N = 6 with K = 3 task clusters: W1 = {1, 4}, W2 = {2, 5}, and W3 = {3, 6}. Figure 8 illustrates the recorded task arrivals per round and their routed experts. Figure 8(a) and Figure 8(b) depict that if the Mo E model terminates the update based on Γ > λ, the small noise r(m) t cannot alter the router s decision from the expert with the maximum gate output for each task cluster (e.g., expert 5 for W2 = {2, 5} in Figure 8(a)) in Eq. (1), leading to imbalanced expert load. While for Γ λ in Figure 8(c) and Figure 8(d), the random

Published as a conference paper at ICLR 2025

noise r(m) t in Eq. (1) can mitigate the gaps of gate outputs among experts within the same expert set, ensuring load balance (e.g., experts 3 and 4 for W2 = {2, 5} in Figure 8(c)).

0 100 200 300 400 500 Rounds

Task : 1 Task : 2 Task : 3 Task : 4 Task : 5 Task : 6

(a) Threshold = σ0.75 0 .

0 100 200 300 400 500 Rounds

Task : 1 Task : 2 Task : 3 Task : 4 Task : 5 Task : 6

(b) Threshold = σ0.

0 100 200 300 400 500 Rounds

Task : 1 Task : 2 Task : 3 Task : 4 Task : 5 Task : 6

(c) Threshold = σ1.25 0 .

0 100 200 300 400 500 Rounds

Task : 1 Task : 2 Task : 3 Task : 4 Task : 5 Task : 6

(d) Threshold = σ1.5 0 .

Figure 8: The records of task arrivals and their selected experts under different termination thresholds in Line 7 of Algorithm 1: Γ {σ0.75 0 , σ0, σ1.25 0 , σ1.5 0 }. Here we set M = 5, N = 6 and K = 3.

To further examine how load balance affects learning performance, we increase the task number to N = 30. We repeat the experiment 100 times and plot the average forgetting and generalization errors in Figure 9. Figure 9(a) illustrates that the forgetting is robust to a wide range of Γ, due to the convergence of expert models after T1 rounds exploration. However, Figure 9(b) shows that balanced loads under Γ {σ1.25 0 , σ1.5 0 } lead to smaller generalization errors compared to imbalanced loads under Γ {σ0.75 0 , σ1 0}. This is because diverse expert models help mitigate model errors compared to a single overfitted model.

0 100 200 300 400 500 Rounds

forgetting Ft

0 100 200 300 400 500 Rounds

Figure 9: We run the experiment for 100 times to take the average learning performance under four termination thresholds: Γ {σ0.75 0 , σ0, σ1.25 0 , σ1.5 0 }. Here we set M = 5, N = 30 and K = 3.

B SMOOTH ROUTER

We first prove that Eq. (1) ensures a smooth transition between different routing behaviors, which makes the router more stable. Suppose that there are two different datasets (X, y) and ( ˆX, ˆy) simultaneously acting as input of the Mo E. Let h and ˆh denote the corresponding output of the gating network, respectively. Denote the probability vectors by p and ˆp, which tell the probabilities that each expert gets routed for the two datasets. For example, pm = P(arg maxm [M]{hm + r(m )} = m) and ˆpm = P(arg maxm [M]{ˆhm + r(m )} = m) according to Eq. (1). Then we propose the following lemma to prove the smooth router.

Lemma 2. The two probability vectors satisfy p ˆp λM 2 h ˆh .

Proof. Let m1 = arg maxm{hm + r(m)} and m2 = arg maxm{ˆhm + r(m)}. We first consider the event that m1 = m2. In this case, we have hm1 + r(m1) hm2 + r(m2), ˆhm2 + r(m2) ˆhm1 + r(m1), which implies that ˆhm2 ˆhm1 > r(m1) r(m2) hm2 hm1. (23)

Define C(m1, m2) = ˆhm2 ˆhm1+hm2 hm1

2 . Based on Eq. (23), we obtain

|r(m1) r(m2) C(m1, m2)| ˆhm2 ˆhm1 hm2 + hm1

2 ˆh h . (24)

Published as a conference paper at ICLR 2025

Therefore, we calculate that m1 = m2 below P(arg max m {hm + r(m)} = arg max m {ˆhm + r(m)})

P( m1 = m2 [M], s.t. |r(m1) r(m2) C(m1, m2)| ˆh h )

m1<m2 P |r(m1) r(m2) C(m1, m2)| ˆh h

m1<m2 E P(r(m2) + C(m1, m2) ˆh h r(m1) r(m2) + C(m1, m2) + ˆh h )|r(m2)

λM 2 ˆh h , where the first inequality is derived by Eq. (24), the second inequality is because of union bound, and the last inequality is due to the fact that r(m) is drawn from Unif [0, λ].

Then for any j [M], we have

|ˆpi pi| E h 1(arg max m {ˆhm + r(m) = i}) 1(arg max m {ˆhm + r(m) = i}) i

E h 1(arg max m {ˆhm + r(m) = i}) 1(arg max m {ˆhm + r(m) = i}) i

P(arg max m {hm + r(m)} = arg max m {ˆhm + r(m)})

M 2 ˆh h . This completes the proof of Lemma 2.

Intuitively, Lemma 2 tells that if the outputs h and ˆh of the gating network are similar for two tasks, their data sets (X, y) and ( ˆX, ˆy) will be routed to the same expert with a high probability. It means that the router transitions are smooth and continuous.

C FULL VERSION AND PROOF OF LEMMA 1

Lemma 1 (Full version). For any two feature matrices X and X with feature signals vn and v n, if wn = wn under M > N or wn, wn Wk under M < N, with probability at least 1 o(1), their corresponding gate outputs of the same expert m satisfy hm(X, θ(m) t ) hm( X, θ(m) t ) = O(σ1.5 0 ). (25)

Proof. We first focus on the M > N case to prove Lemma 1. Then we consider the M < N case to prove Lemma 1. For dataset (Xt, yt) generated in Definition 1 per round t, we can assume that the first sample of Xt is the signal vector. Therefore, we rewrite Xt = [βtvn Xt,2 Xt,s]. Let Xt = [βtvnt 0 0] represents the matrix that only keeps the feature signal.

Based on the definition of the gating network in Section 3, we have hm(Xt, θ(m) t ) = Ps i=1(θ(m) t ) Xt,i. Then we calculate hm(Xt, θ(m) t ) hm( Xt, θ(m) t ) =

(θ(m) t ) s X

j=1 (θ(m) t,j ) Xt,(i,j)

max t,j {θ(m) t,j }

j=1 Xt,(i,j) ,

where θ(m) t,j is the j-th element of vector θ(m) t and Xt,(i,j) is the (i, j)-th element of matrix Xt.

Published as a conference paper at ICLR 2025

Then we apply Hoeffding s inequality to obtain

j=1 Xt,(i,j)| < s d σ0 1 2 exp ( σ2 0s2d2

As Xt,(i,j) N(0, σ2 t ), we have Xt,i = O(σt), indicating exp ( σ2 0s2d2

Xt,i ) = o(1). Therefore,

with probability at least 1 o(1), we have Ps i=2 Pd j=1 Xt,(i,j) = O(σ0). Consequently, we obtain hm(Xt, θ(m) t ) hm( Xt, θ(m) t ) = O(σ1.5 0 ) due to the fact that Ps i=2 Pd j=1 Xt,(i,j) = O(σ0)

and θ(m) t,j = O(σ0.5 0 ) proven in Lemma 6 later.

If M < N, we calculate hm(Xt, θ(m) t ) hm( Xt, θ(m) t ) = (θ(m) t ) s X

(θ(m) t ) s X

i=2 Xt,i + (θ(m) t ) (vn vn )

max t,j {θ(m) t,j }

j=1 Xt,(i,j) + O(σ2 0)

= O(σ1.5 0 ),

where the second inequality is based on the union bound, the third inequality is because of θ(m) t,j = O(σ0.5 0 ) and our assumption vn vn = O(σ1.5 0 ), and the last inequality is based on our proof of the M > N case above. This completes the proof of the full version of Lemma 1.

Based on Lemma 1, we revisit Lemma 2 to obtain the following conclusion for the smooth router.

Corollary 1. If datasets (X, y) and ( ˆX, ˆy) are generated by the same ground truth, then the two probability vectors in Lemma 2 satisfy p ˆp = O λM 2σ1.5 0 .

D ANALYSIS OF LOSS FUNCTION

In this section, we analyze the loss function for both gating network parameters and expert models before analyzing Mo E.

Lemma 3. Under update rule Eq. (5), if the current task nt routed to expert mt have the same ground truth with the last task nτ, where τ < t, routed to expert mt, i.e., wnt = wnτ , then the model of expert mt satisfies w(mt) t = w(mt) t 1 = = w(mt) τ .

It is easy to prove Lemma 3 by the updating rule Eq. (5) such that we skip the proof here.

Next, we examine the training of the gating network parameter.

Lemma 4. For any training round t 1, we have that PM m=1 θ(m) t Ltask t = 0.

Proof. As the training loss θ(m)Ltr t = 0, we obtain θ(m)Ltask t = θ(m)Lloc t + θ(m)Laux t . (26) Next, we will prove PM m=1 θ(m) t Lloc t = 0 and PM m=1 θ(m) t Lloc t = 0, respectively. Based on the two equations, we can prove Lemma 4.

According to the definition of locality loss in Eq. (6), we calculate

θ(m) t Lloc t = πmt(Xt, Θt)

θ(m) t w(mt) t w(mt) t 1 2. (27)

Published as a conference paper at ICLR 2025

If m = mt, we obtain πmt(Xt, Θt)

θ(m) t = πmt(Xt, Θt) X

m =mt πm (Xt, Θt) hm(Xt, θ(m) t )

= πmt(Xt, Θt) X

m =mt πm (Xt, Θt) X

i [st] Xt,i. (28)

If m = mt, we obtain πmt(Xt, Θt)

θ(m) t = πmt(Xt, Θt) πm(Xt, Θt) X

i [st] Xt,i. (29)

Based on Eq. (27), Eq. (28) and Eq. (29), we obtain M X

m=1 θ(m) t Lloc t = w(mt) t w(mt) t 1 2

πmt(Xt, Θt)

θ(m) t = 0.

According to the definition of auxiliary loss in Eq. (7), we calculate

θ(m) t Laux t = αM

m =1 f (m ) t P (m ) t θ(m) t

t f (mt) t πmt(Xt, Θt)

θ(m) t , (30)

where the second equality is due to the fact that P (mt) t θ(m) t = 1

t πmt(Xt,Θt)

θ(m) t and P (m ) t θ(m) t = 0 for any

m = mt by Eq. (7). Then based on Eq. (28) and Eq. (29), we similarly obtain M X

m=1 θ(m) t Laux t = αM

πmt(Xt, Θt)

θ(m) t = 0.

In summary, we finally prove PM m=1 θ(m) t Ltask t = 0 in Eq. (26).

In the following lemma, we analyze the gradient of loss function with respect to each expert.

Lemma 5. For any training round t {1, , T}, the following property holds

θ(m) t Ltask t = Ω(σ0), if t {1, , T1}, O(σ0), if t {T1 + 1, , T} (31)

for any expert m [M], where T1 = η 1M is the length of the exploration stage.

Proof. We prove Eq. (31) by analyzing θ(m)Lloc t = O(σ0) and θ(m)Laux t = O( αM

t ) in Eq. (26), respectively.

First, we calculate θ(m) t Lloc t . Based on Eq. (27), we have

θ(m) t Lloc t = πmt(Xt, Θt)

θ(m) t w(mt) t w(mt) t 1 2

= 1{wnτ = wnt} 0 + 1{wnτ = wnt} w(mt) t w(mt) t 1 2 πmt(Xt, Θt)

w(mt) t w(mt) t 1 2 πmt(Xt, Θt)

where τ is the index of the last task that routed to expert mt, the second equality is derived by Lemma 3. As πmt(Xt,Θt)

θ(m) t = O(1) and wn N(0, σ2 0), we finally obtain

θ(m) t Lloc t = O(σ0).

Next, we further calculate θ(m) t Laux t , which contains the following two cases.

Published as a conference paper at ICLR 2025

If t {1, , T1}, by Eq. (30), we have

θ(m) t Laux t αM

T1 f (mt) t πmt(Xt, Θt)

σ0f (mt) t πmt(Xt, Θt)

θ(m) t = Ω(σ0),

where the second inequality is derived by setting η = Ω(σ0.5 0 ) to make T1 = σ 0.5 0 M .

If t {T1 + 1, , T}, we calculate

θ(m) t Laux t αM

T1 f (mt) t πmt(Xt, Θt)

Based on the derived θ(m) t Lloc t and θ(m) t Laux t above, we can finally calculate

θ(m) t Ltask t based on Eq. (26).

For t {1, , T1}, if m = mt, we have θ(m) t Ltask t = θ(m) t Lloc t + θ(m) t Laux t

T1 f (mt) t πmt(Xt, Θt)

Similarly, for any t {T1 + 1, , T}, we can can obtain θ(m) t Ltask t = θ(m) t Lloc t + θ(m) t Laux t

T1 f (mt) t πmt(Xt, Θt)

= O(σ0). This completes the proof of Eq. (31).

Given θ(m) 0 = 0 for any expert m [M], in the next lemma, we obtain the upper bound of θ(m) t at any round t {1, , T}.

Lemma 6. For any training round t {1, , T}, the gating network parameter of any expert m satisfies θ(m) t = O(σ0.5 0 ).

Proof. Based on Lemma 5, for any t {1, , T1} the accumulated update of θ(m) t throughout the exploration stage satisfies θ(m) t η T1 αM = O(σ0.5 0 ).

For any t {T1 + 1, , T}, the accumulated update of θ(m) t throughout the router learning phase satisfies

θ(m) t θ(m) T1 + η (T T1) αM

T1 = O(σ0.5 0 ) + O(σ0.5 0 σ0) = O(σ0.5 0 ).

In summary, θ(m) t = O(σ0.5 0 ) for any round t {1, , T}.

E FULL VERSION AND PROOF OF PROPOSITION 1

Proposition 1 (Full version). Under Algorithm 1, with probability at least 1 o(1), for any t > T1, where T1 = η 1M , each expert m [M] satisfies the following properties:

1) If M > N, expert m stabilizes within an expert set Mn, and its expert model remains unchanged beyond time T1, satisfying w(m) T1+1 = = w(m) T .

Published as a conference paper at ICLR 2025

2) If M < N, expert m stabilizes within an expert set Mk, and its expert model satisfies w(m) t w(m) T1+1 = O(σ1.5 0 ) for any t {T1 + 2, , T}.

We first propose the following lemmas before formally proving Proposition 1. Then we prove Proposition 1 in Appendix E.8.

Lemma 7. At any training round t {1, , T1}, for any feature signal vn, the gating network parameter of expert m [M] satisfies

θ(m) t+1 θ(m) t , vn = O(σ0), if m = mt, O(M 1σ0), if m = mt.

Lemma 7 tells that for any expert mt being selected by the router, its softmax value under the updated θmt t+1 is reduced since the next task t. While for any expert m without being selected, its softmax value is increased. This is to ensure fair exploration of each expert m under the auxiliary loss function in Eq. (7). In addition, for any expert m without being selected, its gating network parameter θ(m) t is updated at the same speed with others for any signal vector vn.

Lemma 8. At the end of the exploration stage, with probability at least 1 δ, the fraction of tasks dispatched to any expert m [M] satisfies f (m) T1 1

= O(η0.5M 1). (32)

Lemma 8 tells that during the exploration stage, all the M experts are expected to be evenly explored by all tasks. Therefore, the gating network parameter θ(m) t of expert m is updated similarly to all the others.

Lemma 9. At the end of the exploration stage, i.e., t = T1, the following property holds θ(m) T1 θ(m ) T1

= O(η 0.5σ0),

for any m, m [M] and m = m .

Define δΘ = |hm(Xt, θ(m) t ) hm (Xt, θ(m) t )|. Then we obtain the following lemma.

Lemma 10. At any round t, if δΘt is close to 0, it satisfies |πm(Xt, Θt) πm (Xt, Θt)| = O(δΘ). Otherwise, |πm(Xt, Θt) πm (Xt, Θt)| = Ω(δΘ).

Lemma 11. If M = Ω(N ln(N)), we have |Mn| 1 for all n [N] with probability at least 1 o(1). If M < N, given M = Ω(K ln(K)), we have |Mk| 1 for all k [K] with probability at least 1 o(1).

Lemma 12. At any round t, we have the following properties:

1) for task arrival nt with ground truth wnt = wn under M > N, if it is routed to a correct expert mt Mn, then θ(m) t Lloc t = 0 for any expert m [M].

2) for task arrival nt with ground truth wnt Wk under M < N, if it is routed to a correct expert mt Mk, then θ(m) t Lloc t = O(σ1.5 0 ) for any expert m [M].

Let Xn and Xn denote two feature matrices containing feature signals vn and vn , respectively.

Lemma 13. If n and n satisfy: 1) n = n under M > N or 2) wn Wk and wn Wk with k = k under M < N, then if expert m satisfies 1) m Mn under M > N or 2) m Mk under M < N, at the beginning of the router learning stage t = T1 + 1, then the following property holds at any round t {T1 + 2, , T}: πm(Xn, Θt) > πm(Xn , Θt), m [M]. (33)

Based on these lemmas, the proof of Proposition 1 is given in Appendix E.8.

E.1 PROOF OF LEMMA 7

Proof. According to Lemma 5, for any round t {1, , T1}, the auxiliary loss is the primary loss to update Θt of the gating network. Then based on the update rule of θ(m) t in Eq. (9) and the gradient

Published as a conference paper at ICLR 2025

of θ(mt) t L(aux) t in Eq. (30), we obtain

θ(mt) t+1 θ(mt) t = η θ(mt) t L(aux) t

=O(σ0.5 0 η) πmt(Xt, Θt) X

m =mt πm (Xt, Θt) X

i [st] Xt,i

=O(σ0), based on the fact that πmt(Xt, Θt) P

m =mt πm (Xt, Θt) = πmt(Xt, Θt) (1 πmt(Xt, Θt)) 1

4 and Xt = O(1).

While for any m = mt, we calculate θ(m) t+1 θ(m) t = η θ(m) t L(aux) t

=O(σ0.5 0 η) πmt(Xt, Θt) πm(Xt, Θt) X

i [st] Xt,i

=O(M 1σ0), due to the fact that πmt(Xt, Θt) πm(Xt, Θt) = O(M 1).

Note that by Eq. (30), we have θ(mt) t L(aux) t > 0 and θ(m) t L(aux) t < 0 for any m = mt. Consequently, for expert mt, its corresponding output hmt at the gating network will be reduced by O(σ0) for the same feature signal vn since task t + 1. While for any expert m = mt, its corresponding output hm is increased by O(M 1σ0).

E.2 PROOF OF LEMMA 8

Proof. By the symmetric property, we have that for any m [M], E[f (m) T1 ] = 1 M .

By Hoeffding s inequality, we obtain

P(|f (m) T1 1

M | ϵ) 1 2 exp ( 2ϵ2T1).

Then we further obtain

P(|f (m) T1 1

M | ϵ, m [M]) (1 2 exp ( 2ϵ2T1))M

1 2M exp ( 2ϵ2T1)).

Let δ = 1 2M exp ( 2ϵ2T1)). Then we obtain ϵ = O(η0.5M 1). Subsequently, there is a probability of at least 1 δ that f (m) T1 1 M = O(η0.5M 1).

E.3 PROOF OF LEMMA 9

Proof. Based on Lemma 7 and Lemma 8 and their corresponding proofs above, we can prove Lemma 9 below.

For experts m and m , they are selected by the router for T1 f (m) T1 and T1 f (m ) T1 times during the exploration stage, respectively. Therefore, we obtain θ(m) T1 = f (m) T1 T1 O(σ0) (1 f (m) T1 ) T1 O(M 1σ0),

θ(m ) T1 = f (m ) T1 T1 O(σ0) (1 f (m ) T1 ) T1 O(M 1σ0).

Then by Eq. (9) and Lemma 7, we calculate θ(m) T1 θ(m ) T1 = f (m) T1 f (m ) T1 T1 O(σ0) (1 f (m) T1 ) (1 f (m ) T1 ) T1 O(M 1σ0)

= |f (m) T1 f (m ) T1 | T1 O(σ0)

= O(η 0.5σ0), where the first equality is derived based on the update steps in Lemma 7, and the last equality is because of T1 |f (m) T1 f (m ) T1 | = O(η 0.5) by Eq. (32) in Lemma 8.

Published as a conference paper at ICLR 2025

E.4 PROOF OF LEMMA 10

Proof. At any round t, we calculate

|πm(Xt, Θt) πm (Xt, Θt)| = πm (Xt, Θt) exp (hm(Xt, θ(m) t ) hm (Xt, θ(m ) t )) πm (Xt, Θt)

= πm (Xt, Θt) exp (hm(Xt, θ(m) t ) hm (Xt, θ(m ) t )) 1 ,

where the first equality is by solving Eq. (2). Then if δΘt is close to 0, by applying Taylor series with sufficiently small δΘ, we obtain

|πm(Xt, Θt) πm (Xt, Θt)| πm (Xt, Θt)|hm(Xt, θ(m) t ) hm (Xt, θ(m ) t )| = O(δΘ), where the last equality is because of πm( Xt, Θt) 1.

While if δΘt is not sufficiently small, we obtain

|πm(Xt, Θt) πm (Xt, Θt)| > πm (Xt, Θt)|hm(Xt, θ(m) t ) hm (Xt, θ(m ) t )| = Ω(δΘ). This completes the proof.

E.5 PROOF OF LEMMA 11

Proof. If M > N, by the symmetric property, we have that for all n [N], m [M],

P(m Mn) = 1

Therefore, the probability that |Mn| at least includes one expert is

P(|Mn| 1) 1 1 1

By applying union bound, we obtain

P(|Mn| 1, n) 1 1 1

M N 1 N 1 1

M 1 N exp M

where the second inequality is because (1 N 1)M is small enough, and the last inequality is because of M = Ω N ln N

While if M < N, we can use the same method to prove that P(|Mk| 1, k) 1 δ, given M = Ω K ln K

δ and P(m Mk) = 1

K by the symmetric property.

E.6 PROOF OF LEMMA 12

Proof. In the case of M > N, as |Mn| = 1, if task nt with nt = n is routed to the correct expert mt Mn, we have wmt t = wmt t 1 by Eq. (5). Consequently, the caused locality loss Lloc t (Θt, Dt) = 0, based on its definition in Eq. (6).

In the case of M < N, as |Mk| 1 for each cluster k, if task nt with wnt Wk is routed to the correct expert mt Mk, we have w(mt) t w(mt) t 1 = Xt(X t Xt) 1(yt X t w(mt) t 1 )

= Xt(X t Xt) 1X t (wnt w(mt) t 1 )

= O( wt w(mt) t 1 )

= O(σ1.5 0 ), where the second equality is because of yt = X t wt, and the third equality is because of Xt = O(1). Therefore, we obtain θ(m) t Lloc t = O(σ1.5 0 ) by solving Eq. (27), for any m [M].

E.7 PROOF OF LEMMA 13

Proof. We first focus on the M > N case to prove Lemma 13. Based on Eq. (35), we obtain πm(Xn, ΘT1) πm(Xn , ΘT1) = Ω(σ0.5 0 ) for m Mn, given θ(m) t = O(σ0.5 0 ) derived in

Published as a conference paper at ICLR 2025

Lemma 6. To prove Eq. (33), we will prove that πm(Xn, Θt) πm(Xn , Θt) = Ω(σ0.5 0 ) holds for any t T1 + 1.

Based on Lemma 5, we have θ(m) t Ltask t = O(σ0), leading to θ(mt) t+1 θ(mt) t , vn = O(σ1.5 0 )

for expert mt and θ(m) t+1 θ(m) t , vn = O(σ1.5 0 ) for any other expert m = mt. As T2 = σ 0.5 0 η 1M , we calculate

θ(m) T2 < O(σ0.5 0 ) θ(m) T1 Ltask T1 η (T2 T1)

O(σ0.5 0 ) O(σ0) O(σ 0.25 0 )

= O(σ0.5 0 ).

Therefore, θ(m) t = O(σ0.5 0 ) is always true for t {T1 + 1, , T2}, and thus πm(Xn, Θt) πm(Xn , Θt) = Ω(σ0.5 0 ) holds, meaning πm(Xn, Θt) > πm(Xn , Θt), m [M].

For the case of M < N, we can use the same method to prove Eq. (33). This completes the proof.

E.8 FINAL PROOF OF PROPOSITION 1

Proof. In the case of M > N, to prove Proposition 1, we equivalently prove that at the end of the exploration stage with t = T1, for any two experts m Mn and m Mn with n = n , the following properties hold: πm(Xn, Θt) > πm (Xn, Θt), πm (Xn , Θt) > πm(Xn , Θt), (34) where Xn and Xn contain feature signals vn and vn , respectively. Based on Eq. (34), we have each expert m [M] stabilizes within an expert set Mn.

According to Lemma 1, the gating network only focuses on feature signals. Then for each expert m, we calculate |hm(Xn, θ(m) t ) hm(Xn , θ(m) t )| = | θ(m) t , vn vn |

= θ(m) t vn vn = O(σ0.5 0 ),

where the last equality is because of θ(m) t = O(σ0.5 0 ) in Lemma 6 and vn vn = O(1).

Then based on Lemma 10, we obtain |πm(Xn, Θt) πm(Xn , Θt)| = Ω(σ0.5 0 ). (35)

Next, we prove Eq. (34) by contradiction. Assume there exist two experts m Mn and m Mn such that πm(Xn, Θt) > πm (Xn, Θt), πm(Xn , Θt) > πm (Xn , Θt), which is equivalent to πm(Xn, Θt) > πm(Xn , Θt) > πm (Xn , Θt) > πm (Xn, Θt), (36) because of πm(Xn, Θt) > πm(Xn , Θt) and πm (Xn , Θt) > πm (Xn, Θt) based on the definition of expert set Mn in Eq. (11). Then we prove Eq. (36) does not exist at t = T1.

For task t = T1, we calculate |hm(Xn, θ(m) T1 ) hm (Xn, θ(m) T1 )| θ(m) T1 θ(m ) T1 vn

= O(σ0η 0.5), (37) where the first inequality is derived by union bound, and the second equality is because of vn = O(1) and θ(m) T1 θ(m ) T1 = O(σ0η 0.5) derived in Lemma 9 at the end of the exploration phase.

Then according to Lemma 10 and Eq. (37), we obtain |πm(Xn, ΘT1) πm (Xn, ΘT1)| = O(σ0η 0.5). (38)

Based on Eq. (36), we further calculate |πm(Xn, ΘT1) πm (Xn, ΘT1)| |πm(Xn, ΘT1) πm(Xn , ΘT1)|

= Ω(σ0.5 0 ),

Published as a conference paper at ICLR 2025

where the first inequality is derived by Eq. (36), and the last equality is derived in Eq. (35). This contradicts with Eq. (38) as σ0η 0.5 < σ0.5 0 given η = O(σ0.5 0 ). Therefore, Eq. (36) does not exist for t = T1, and Eq. (34) is true for t = T1.

Based on Lemma 13, we obtain that each expert set Mn is stable during the router learning stage. Therefore, at any round t {T1 + 1, , T}, task nt with ground truth wnt will be routed to one of its best experts in Mnt. Then based on Lemma 12, we have that θ(m) t Lloc t = 0 holds in the router

learning stage. Subsequently, w(m) t of any expert m remains unchanged, based on Lemma 3.

In the case of M < N, we similarly obtain that at the end of the exploration phase with t = T1, for any two experts m Mk and m Mk with k = k , the following property holds πm(Xk, Θt) > πm (Xk, Θt), πm (Xk , Θt) > πm(Xk , Θt), where Xk and Xk contain feature signals vk and vk , respectively.

Then during the router learning stage with t > T1, any task nt with wnt Wk will be routed to the correct expert m Mk. Let w(m) denote the minimum ℓ2-norm offline solution for expert m. Based on the update rule of w(mt) t in Eq. (5), we calculate w(mt) t w(mt) = w(mt) t 1 + Xt(X t Xt) 1(yt X t w(mt) t 1 ) w(mt)

= (I Xt(X t Xt) 1X t )w(mt) t 1 + Xt(X t Xt) 1X t w(mt) w(mt)

= (I Xt(X t Xt) 1X t )(w(mt) t 1 w(mt)), where the second equality is because of yt = X t w(mt). Define Pt = Xt(X t Xt) 1X t for task nt, which is the projection operator on the solution space wnt. Then we obtain w(m) t w(m) = (I Pt) (I PT1+1)(w(m) T1 w(m)) for each expert m [M].

Since orthogonal projections Pt s are non-expansive operators, it also follows that t {T1 + 1, , T}, w(m) t w(m) w(m) t 1 w(m) w(m) T1 w(m) . As the solution spaces Wk is fixed for each expert m Mk, we further obtain w(m) t w(m) T1+1 = w(m) t w(m) + w(m) w(m) T1+1

w(m) t w(m) + w(m) T1+1 w(m) max wn,wn Wk wn wn

= O(σ1.5 0 ), where the first inequality is derived by the union bound, the second inequality is because of the orthogonal projections for the update of w(m) t per task, and the last equality is because of wn wn = O(σ1.5 0 ) for any two ground truths in the same set Wk.

F FULL VERSION AND PROOF OF PROPOSITION 2

Proposition 2 (Full version). If the Mo E keeps updating Θt by Eq. (9) at any round t [T], we obtain: 1) At round t1 = η 1σ 0.25 0 M , the following property holds

hm(Xt1, θ(m) t1 ) hm (Xt1, θ(m ) t1 ) =

O(σ1.75 0 ), if m, m Mn under M > N or m, m Mk under M < N, Θ(σ0.75 0 ), otherwise. 2) At round t2 = η 1σ 0.75 0 M , the following property holds hm(Xt2, θ(m) t2 ) hm (Xt2, θ(m ) t2 ) = O(σ1.75 0 ), m, m [M].

We first propose the following lemmas as preliminaries to prove Proposition 2. Then we prove Proposition 2 in Appendix F.3.

For any two experts m, m , define Θ = |πm(Xt, Θt) πm (Xt, Θt)|.

Lemma 14. At any round t {T1 + 1, , t1}, m = mt, the following property holds θ(m) t Ltask t θ(mt) t Ltask t = O σ0 . (39)

Published as a conference paper at ICLR 2025

Let Xn and Xn denote two feature matrices containing feature signals vn and vn , respectively.

Lemma 15. In the router learning stage with t {T1 + 1, , t1}, for any two experts satisfying 1) m Mn and m Mn with n = n under M > N, or 2) m Mk and m Mk with wn Wk, wn Wk and k = k under M < N, the following properties hold: πm(Xn, Θt) > πm (Xn, Θt), πm (Xn , Θt) > πm(Xn , Θt). (40)

F.1 PROOF OF LEMMA 14

Proof. Based on the proof of Lemma 5, for any m = mt, we calculate θ(m) t Ltask t θ(mt) t Ltask t = ( w(mt) t w(mt) t 1 2 + αM

t f mt t ) πmt

= O σ0 + αM

= O(σ0), where the last equality is because of πmt(Xt,Θt) < 1, Xt = O(1) and αM

t σ0 for any t {T1 + 1, , t1}.

F.2 PROOF OF LEMMA 15

Proof. We will use the same method as in Appendix E.8 to prove Lemma 15 by contradiction. Here, we only prove the case of M > N. The proof for the case of M < N is similar.

Recall Proposition 1 that Eq. (40) is true at t = T1 + 1. Based on Lemma 13, we have |πm(Xn, Θt) πm(Xn , Θt)| = Ω(σ0.5 0 ). Then in the following, we aim to prove |πm(Xn, ΘT1) πm (Xn, ΘT1)| = O(σ0η 0.5) in Eq. (38) is always true for any m Mn and m Mn during the router learning stage. Then according to the proof of Proposition 1 in Appendix E.8, Eq. (40) is also true.

Under Eq. (40) at t = T1 + 1, the router will route task t to its best expert mt Mnt, leading to θ(m) t Lloc t = 0 by Lemma 12. Therefore, θ(m) t Ltask t = O(σ0) makes θ(mt) t+1 θ(mt) t , vn =

O(σ1.5 0 ) for expert mt and θ(m) t+1 θ(m) t , vn = O(σ1.5 0 ) any other expert m = mt at task t = T1 + 1.

Subsequently, for any two experts m Mn and m Mn , we calculate

θ(m) t1 θ(m ) t1 θ(m) T1+2 θ(m ) T1+2 + (t1 T1) O(σ1.5 0 )

O(σ0η 0.5) + O(η 1σ 0.25 0 ) O(σ1.5 0 )

= O(σ0η 0.5),

where the first inequality is because of θ(m) t+1 θ(m) t , vn = O(σ1.5 0 ), and the second inequality is because of t1 T1 t1.

As θ(m) t1 θ(m ) t1 = O(σ0η 0.5) and |πm(Xn, Θt) πm(Xn , Θt)| = Ω(σ0.5 0 ), Eq. (40) is true for any t {T1 + 2, , t1}, based on our proof of Proposition 1.

F.3 FINAL PROOF OF PROPOSITION 2

Proof. We first focus on the M > N case to prove |hm(Xt1, θ(m) t1 ) hm (Xt1, θ(m ) t1 )| = O(σ1.75 0 ) for any two different experts m, m Mn in the same expert set in Eq. (12). Then we prove |hm(Xt1, θ(m) t1 ) hm (Xt1, θ(m ) t1 )| = Θ(σ0.75 0 ) for any two experts m Mn and m Mn in different expert sets in Eq. (12). After that, we prove Eq. (13) at round t2 = η 1σ 0.75 0 M . Finally, we prove that the above analysis can be generalized to the case of M < N.

In the case of M > N, let M and m denote the two experts within set Mn with the maximum and minimum softmax values, respectively. In other words, we have M = arg max m Mn{πm(Xt, Θt)}, m = arg min m Mn{πm(Xt, Θt)}, (41)

Published as a conference paper at ICLR 2025

where vn Xt. If the two experts satisfies |h M (Xt1, θ(M ) t1 ) hm (Xt1, θ(m ) t1 )| = O(σ1.75 0 ), then this equation holds for any two experts in Mt.

At the beginning of the router learning stage, we have |πM (XT1, ΘT1) πm (XT1, ΘT1)| = O(σ0η 0.5) = O(σ0.75 0 ), based on Proposition 1.

According to the routing strategy in Eq. (1), if the new task t has ground truth wn, it is always routed to expert M until its softmax value is reduced to smaller than others. Therefore, we calculate the reduced output of gating network for expert M in the router learning stage:

θ(M ) T1+1 θ(M ) t1 , vn O(σ0) η (t1 T1)

= O(σ0.75 0 ). While for expert m , it will not be routed until its softmax value increased to the maximum. Therefore, we similarly calculate the increased gating output θ(m ) t1 θ(m ) T1+1, vn = O(σ0.75 0 ) for expert m .

Based on Lemma 4 and Lemma 14, the gating network parameters of experts m and M will converge to the same value, with an error smaller than the update step of Θt. Therefore, we obtain

θ(M ) t1 θ(m ) t1 = θ (mt1 1)

t1 1 Ltask t1 1 η

= θ (mt1 1)

t1 1 Laux t1 1 η

= O(σ1.75 0 ), based on the fact that θ (mt1 1)

t1 1 Laux t1 1 = O(σ1.25 0 ) and η = O(σ0.5 0 ). Then according to Lemma 10,

we obtain |h M (Xt1, θ(M ) t1 ) hm (Xt1, θ(m ) t1 )| = O(σ1.75 0 ), which also holds for any two experts in the same expert set Mn.

Next, we prove |hm(Xt1, θ(m) t1 ) hm (Xt1, θ(m ) t1 )| = Θ(σ0.75 0 ) in Eq. (12) for expert m in Eq. (41) and another expert m / Mn.

Let m Mn denote the index of the expert in other expert sets with the maximum softmax value of dataset Xn, where vn Xt. In other words, m = arg maxm/ Mn{πm(Xt, Θt)}. According to the proof of Proposition 1, we obtain πm (XT1, ΘT1) π m(XT1, ΘT1) = O(σ0.75 0 ). This equation indicates that during the router learning stage with t {T1 + 1, , t1}, any task arrival nt with ground truth wnt = wn will not be routed to expert m. Therefore, π m(Xt, Θt) keeps increasing with t. Then we calculate the difference between the parameter gradient of expert m and expert m per round t: θ( m) t Ltask t θ(m ) t Ltask t = αM

t f mt t πmt(Xt, Θt)(πm (Xt, Θt) π m(Xt, Θt)) P

i [st] Xt,i

0, based on the fact that πm (Xt, Θt) π m(Xt, Θt) > 0 under Lemma 13. As θ( m) t Ltask t < 0 and

θ(m ) t Ltask t < 0, we obtain that hm (Xt, θ(m ) t ) h m(Xt, θ( m) t ) increases with t during the router

learning stage. According to our former analysis of hm (Xt, θ(m ) t ), we obtain

|hm (Xt1, θ(m ) t1 ) h m(Xt1, θ( m) t1 )|

=|hm (XT1, θ(m ) T1 ) h m(XT1, θ( m) T1 )| + θ( m) t Ltask t θ(m ) t Ltask t η (t1 T1)

=O(σ0.75 0 ) + Θ(σ0.75 0 ) = Θ(σ0.75 0 ), where the first equality is because of θ( m) t Ltask t θ(m ) t Ltask t = O(σ0). This completes the proof of Eq. (12).

Subsequently, we prove hm(Xt2, θ(m) t2 ) hm (Xt2, θ(m ) t2 ) = O(σ1.75 0 ), m, m [M] in Eq. (13)

by proving |h M (Xt2, θ(M ) t2 ) hm(Xt2, θ(m) t2 )| = O(σ1.75 0 ) between expert M Mn in Eq. (41) and any other expert m / Mn.

Published as a conference paper at ICLR 2025

Based on Lemma 7, we obtain

θ(m) t+1 θ(m) t , vn = O(σ1.25 0 ), if m = mt, O(M 1σ1.25 0 ), if m = mt. According to our analysis above, expert M Mn is periodically selected by the router for training task arrivals nt = n during t {t1, , t2}. After each training of task n at expert M , its gate output h M (Xt, θ(M ) t ) is reduced by O(σ1.25 0 ). While at other rounds without being selected, its gate output is increased by O(M 1σ1.25 0 ). Under such training behavior, we obtain h M (Xt1, θ(M ) t1 ) h M (Xt2, θ(M ) t2 ) = O(σ1.75 0 ), by assuming vn Xt1 and vn Xt2.

While for expert m / Mn, its gate output hm(Xt, θ(m) t ) keeps increasing for any data Xt of task n. For any t {t1, , t2}, we have θ(m) t Laux t = O(σ1.25 0 ) for expert m. Assuming that expert m is never selected by the router for training task nt = n in the period, we obtain |hm(Xt2, θ(m) t2 ) hm(Xt1, θ(m) t1 )| = θ(m) t Laux t (t2 t1) η

> |h M (Xt1, θ(M ) t1 ) hm(Xt1, θ(m) t1 )|

where the inequality is because of θ(m) t Laux t (t2 t1) η = O(σ0.5 0 ) and |h M (Xt1, θ(M ) t1 )

hm(Xt1, θ(m) t1 )| = Θ(σ0.75 0 ). This inequality indicates that there exists a training round t

{t1, , t2} such that hm(Xt , θ(m) t ) > h M (Xt , θ(M ) t ) for task arrival nt = n. Consequently, expert m is selected to train task n again, meaning that m Mn at round t . Then the gating network parameters of experts m and M will converge to the same value, with an error of O(σ1.75 0 ), based on Lemma 4 and Lemma 14. This completes the proof of Eq. (13) in the case of M > N.

In the case of M < N, let M and m denote the experts of set Mk with the maximum and minimum softmax values, respectively. In other words, we have M = arg max m Mk{πm(Xt, Θt)}, m = arg min m Mk{πm(Xt, Θt)},

where the ground truth of task t satisfies wn Wk.

According to the proof of Proposition 2 in Appendix F.3, the gating network parameters of experts m and M will converge to the same value at the end of the router learning stage, with an error smaller than the update step of Θt. Therefore, we obtain

θ(M ) t1 θ(m ) t1 = θ (mt1 1)

t1 1 Ltask t1 1 η

= ( θ (mt1 1)

t1 1 Lloc t1 1 + θ (mt1 1)

t1 1 Laux t1 1) η

= O(σ1.75 0 ), based on the fact that θ (mt1 1)

t1 1 Laux t1 1 = O(σ1.25 0 ) and θ (mt1 1)

t1 1 Lloc t1 1 = O(σ1.5 0 ) derived in

Lemma 12. Then we obtain |h M (Xt1, θ(M ) t1 ) hm (Xt1, θ(m ) t1 )| = O(σ1.75 0 ), which also holds for any two experts in the same expert set Mk.

Similarly, for any two experts in different expert sets, we can derive hm(Xt1, θ(m) t1 )

hm (Xt1, θ(m ) t 1 ) = Θ(σ0.75 0 ). For training round t2, the proof for the case of M < N is the same as the case of M > N above, and thus we skip it here.

G FULL VERSION AND PROOF OF PROPOSITION 3

Proposition 3 (Full version). Under Algorithm 1, the Mo E terminates updating Θt since round T2 = O(η 1σ 0.25 0 M). Then for any task arrival nt at t > T2, the following property holds:

1) If M > N, the router selects any expert m Mnt with an identical probability of 1 |Mnt|, where |Mnt| is the number of experts in set Mn.

Published as a conference paper at ICLR 2025

2) If M < N and wnt Wk, the router selects any expert m Mk, with an identical probability of 1 |Mk|, where |Mk| is the number of experts in set Mk.

Proof. In the case of M > N, according to Algorithm 1 and Proposition 2, after the termination of gating network update, the following properties hold: 1) |hm(Xt, θ(m) t ) hm (Xt, θ(m ) t )| = Θ(σ0.75 0 ) for any m Mn and m Mn and 2) |hm(Xt, θ(m) t ) hm (Xt, θ(m ) t )| = O(Γ) for any m, m Mn, where Γ = O(σ1.25 0 ).

If the ground truth of task arrival nt satisfies wnt = wn, for any experts m Mn and m / Mn, we have hm(Xt, θ(m) t ) + r(m) t (hm (Xt, θ(m ) t ) + r(m ) t ) hm(Xt, θ(m) t ) hm (Xt, θ(m ) t ) + r(m ) t = Θ(σ0.75 0 ),

given r(m ) t = Θ(σ1.25 0 ) and hm(Xt, θ(m) t ) hm (Xt, θ(m ) t ) = Θ(σ0.75 0 ). Therefore, any expert m / Mn will not be selected to learn task t, and only experts in set Mn will be selected.

For any experts m Mn, we calculate

P mt = m|m Mn = P m = arg max m Mn

n hm (Xt, θ(m ) t ) + r(m ) t o

= P hm(Xt, θ(m) t ) + r(m) t (hm (Xt, θ(m ) t ) + r(m ) t ) > 0, m Mn

= P r(m) t > r(m ) t , m Mn = 1 |Mn|,

where the third equality is because of r(m) t = Θ(σ1.25 0 ) and |hm(Xt, θ(m) t ) hm (Xt, θ(m ) t )| = O(σ1.25 0 ) derived under Algorithm 1, and the last equality is due to the fact that r(m) t satisfies a uniform distribution Unif[0, λ].

In the case of M < N, we similarly derive the following properties: 1) |hm(Xt, θ(m) t ) hm (Xt, θ(m ) t )| = Θ(σ0.75 0 ) for any m Mk and m Mk and 2) |hm(Xt, θ(m) t ) hm (Xt, θ(m ) t )| = O(Γ) for any m, m Mk. Based on the two properties, hm(Xt, θ(m) t ) + r(m) t (hm (Xt, θ(m ) t ) + r(m ) t ) = Θ(σ0.75 0 ) is always true for any two experts m and m not in the same expert set Mk. Furthermore, we can similarly calculate

P mt = m|m Mk = P r(m) t > r(m ) t , m Mk = 1 |Mk|.

This completes the proof of Proposition 3.

H PROOF OF PROPOSITION 4

Proof. In the single-expert system, based on the definition of forgetting in Eq. (15) and Eq. (42) in Lemma 16, for any training rounds t [T] and i {1, , t}, we calculate:

E[Ft] = 1 t 1

i=1 E h w(mi) t wni 2 w(mi) ni wni 2i

n (rt ri)E[ wni 2] +

l=1 (1 r)rt l E[ wnl wni 2]

j=1 (1 r)ri j E[ wnj wni 2] o

n=1 wn 2 + ri rt

n =n wn wn 2o ,

Published as a conference paper at ICLR 2025

where we let wni denote the ground truth of the task arrival at round i. Here the last equality is derived by E[ wni 2] = 1

N PN n=1 wn 2 and

E[ wnj wni 2] = E[ 1

N PN n=1 wnj wn 2] = 1 N2 P

n =n wn wn 2

when there is only a single expert.

Similarly, we can calculate the generalization error E[GT ] = 1

T PT i=1 E[ w(mi) T wni 2]

T PT i=1 r T E[ wni 2] + PT l=1(1 r)r T l E[ wnl wni 2]

N PN n=1 wn 2 + 1 r T

n =n wn w n 2, where the second equality is because of Eq. (42) in Lemma 16. This completes the proof of Proposition 4.

I PROOF OF THEOREM 1

Before proving Theorem 1, we first propose the following lemma. Then we formally prove Theorem 1 in Appendix I.2.

For expert m, let τ (m)(l) {1, , T1} represent the training round of the l-th time that the router selects expert m during the exploration stage. For instance, τ (1)(2) = 5 indicates that round t = 5 is the second time the router selects expert 1.

Lemma 16. At any round t {T1 + 1, , T}, for i {T1 + 1, , t}, we have w(mi) t wni 2 = w(mi) T1 wni 2. While at any round t {1, , T1}, for any i {1, , t}, we have

E[ w(mi) t wni 2] = r L (mi) t E[ wni 2] +

l=1 (1 r)r L (mi) t l E[ wτ (mi)(l) wni 2], (42)

where L(mi) t = t f (mi) t and r = 1 s

I.1 PROOF OF LEMMA 16

Proof. At any round t {T1 + 1, , T}, we have w(m) t = w(m) T1 , based on Proposition 1 and

Proposition 2. Therefore, w(mi) t wni 2 = w(mi) T1 wni 2 is true for any round t {T1 + 1, , T} and i {T1 + 1, . . . , t}.

Next, we prove Eq. (42) for round t {1, , T1}. Define Pt = Xt(X t Xt) 1X t for task t. At current task t, there are totally L(m) t = t f (m) t tasks routed to expert m, where f (m) t is in Eq. (7).

Based on the update rule of w(m) t in Eq. (5), we calculate w(mi) t wni 2 = w(mi)

τ (mi)(L (mi) t ) wni 2

= (I Pt)w(mi)

τ (mi)(L (mi) t 1) + Ptwτ mi(L (mi) t ) wni 2

= (I Pt)(w(mi)

τ (mi)(L (mi) t 1) wni) + Pt(wτ mi(L (mi) t ) wni) 2,

where the first equality is because there is no update of w(mi) t for t {τ (mi)(L(mi) t ), , t}, and the second equality is by Eq. (5).

As Pt is the orthogonal projection matrix for the row space of Xt, based on the rotational symmetry of the standard normal distribution, it follows that E Pt(wτ mi(L (mi) t ) wni) = s

d wτ mi(L (mi) t )

Published as a conference paper at ICLR 2025

wni 2. Then we further calculate

E[ w(mi) t wni 2] = (1 s

τ (mi)(L (mi) t 1) wni 2] + s

d E[ wτ (mi)(L (mi) t ) wni 2]

d)L (mi) t E[ w(mi) 0 wni 2]

d)L (mi) t l s

d E[ wτ (mi)(L (mi) t ) wni 2]

= r L (mi) t E[ wni 2] +

l=1 (1 r)r L (mi) t l E[ wτ (mi)(l) wni 2],

where the second equality is derived by iterative calculation, and the last equality is because of w(m) 0 = 0 for any expert m. Here we denote by r = 1 s

d to simplify notations.

I.2 FINAL PROOF OF THEOREM 1

Proof. Based on Eq. (42) in Lemma 16, we obtain

E[ w(mi) i wni 2] = r L (mi) i E[ wni 2] +

l=1 (1 r)r L (mi) i l E[ wτ (mi)(l) wni 2],

where τ (mi)(L(mi) i ) = i.

Then at any round t {2, , T1}, we calculate the expected forgetting as:

E[Ft] = 1 t 1

i=1 E h w(mi) t wni 2 w(mi) i wni 2i

n (r L (mi) t r L (mi) i )E[ wni 2] +

l=1 (1 r)r L (mi) t l E[ wτ (mi)(l) wni 2]

j=1 (1 r)r L (mi) i j E[ wτ (mi)(j) wni 2] o

where ci,j = (1 r)(r L (mi) t L (mi) i + r L (mi) t j rj L (mi) i ).

As task i s ground truth wni is randomly drawn from ground truth pool W with identical probability 1 N , we have

E[ wni 2] = 1

According to Lemma 8 and Proposition 1, each expert m will converge to an expert set Mn before t = T1. Therefore, we obtain

E[ wτ (mi)(l) wni 2] = E[ 1

n=1 wτ (mi)(l) wn 2]

n =1 wn wn 2

n =n wn wn 2, (43)

where the inequality is because the expected error E[ wτ (mi)(l) wn 2] per round for t < T1 is smaller than the uniformly random routing strategy with expected error E[ wτ (mi)(l) wn 2] =

1 N PN n =1 wn wn 2, and the last equality is because of wn wn 2 = 0 for n = n.

Published as a conference paper at ICLR 2025

Therefore, we finally obtain

E[Ft] < 1 t 1 Pt 1 i=1 n r L(mi) t r L(mi) i N PN n=1 wn 2 + 1 r

N2 PL (mi) t l=1 r L (mi) t l PN n =n wn wn 2

N 2 PL (mi) i j=1 r L (mi) i j PN n =n wn wn 2o

= 1 t 1 Pt 1 i=1 n r L(mi) t r L(mi) i N PN n=1 wn 2 + 1 r L(mi) t N2 PN n =n wn wn 2o

1 r L(mi) i N 2 PN n =n wn wn 2o

nr L (mi) t r L (mi) i N

n=1 wn 2 + r L (mi) i r L (mi) t

n =n wn wn 2o .

For any t {T1 + 1, , T}, based on Proposition 2, the expert model w(m) t = w(m) T1 for any expert m [M]. Therefore, we calculate the caused forgetting

E[Ft] = 1 t 1

i=1 E h w(mi) t wni 2 w(mi) i wni 2i

i=1 E h w(mi) T1 wni 2 w(mi) i wni 2i

Based on Eq. (42), we can also calculate the close form of the generalization error:

i=1 E[ w(mi) T wni 2]

i=1 E[ w(mi) T1 wni 2]

r L (mi) T1 E[ wni 2] +

L (mi) T1 X

l=1 (1 r)r L (mi) T1 l E[ wτ (mi)(l) wni 2]

< PT i=1 r L (mi) T1 NT

n=1 wn 2 + 1 r

L (mi) T1 X

l=1 r L (mi) T1 l N X

n =n wn wn 2

= PT i=1 r L (mi) T1 NT

n=1 wn 2 + PT i=1(1 r L (mi) T1 ) N 2T

n =n wn wn 2,

where the inequality is because of Eq. (43).

J PROOF OF THEOREM 2

Proof. For any round t {1, , T1}, the forgetting is the same as the case of M > N in Theorem 1, as tasks randomly arrive and are routed to different experts. Therefore, we skip the proof for t < T1.

For t {T1 + 1, , T}, the router will route tasks in the same cluster to each expert per round. Therefore, we divide the t rounds into two subintervals: i {1, , T1} and i {T1 + 1, , t} to

Published as a conference paper at ICLR 2025

calculate the forgetting as

E[Ft] = 1 t 1 Pt i=1 E h w(mi) t wni 2 w(mi) i wni 2i

= 1 t 1 PT1 i=1 n (r L (mi) t r L (mi) i )E[ wni 2] + PL (mi) t l=1 (1 r)r L (mi) t l E[ wτ (mi)(l) wni 2]

PL (mi) i j=1 (1 r)r L (mi) i j E[ wτ (mi)(j) wni 2] o +

1 t 1 PT i=T1+1 n (r L (mi) t r L (mi) i )E[ wni 2] +

l=L (mi) T1 +1 (1 r)r L (mi) t l E[ wτ (mi)(l) wni 2]

| {z } term a

j=L (mi) T1 +1 (1 r)r L (mi) i j E[ wτ (mi)(j) wni 2]

| {z } term b

where the first term in the second equality is similarly derived as forgetting in Eq. (19). For the term a term b in the second equality, we calculate term a term b

L (mi) t L (mi) i +L (mi) T1 +1 X

l=L (mi) T1 +1 (1 r)r L (mi) t l E[ wτ (mi)(l) wni 2]

=r L (mi) t L (mi) T1 (1 r L (mi) t L (mi) i )E[ wτ (mi)(l) wni 2]

=r L (mi) t L (mi) T1 1(1 r L (mi) t L (mi) i ) N

n=1 E[ wτ (mi)(l) wn 2]

=r L (mi) t L (mi) T1 1(1 r L (mi) t L (mi) i ) N

where the third equality is derived by E[ wτ (mi)(l) wni 2] = 1 N PN n=1 E[ wτ (mi)(l) wn 2], and the last equality is because the router always routes task wτ (mi)(l) = wn within the same cluster Wk to expert mi. Taking the result of term a term b into E[Ft], we obtain

E[Ft] < 1 t 1

r L (mi) t r L (mi) i N

n=1 wn 2 + 1 t 1

r L (mi) i r L (mi) T1 N 2

n =n wn wn 2

r L (mi) t L (mi) T1 1(1 r L (mi) t L (mi) i ) N

Published as a conference paper at ICLR 2025

Similarly, we can calculate the overall generalization error as

i=1 E[ w(mi) T wni 2]

r L (mi) T E[ wni 2] +

l=1 (1 r)r L (mi) T l E[ wτ (mi)(l) wni 2]

i=1 r L (mi) T E[ wni 2] + 1

n L (mi) T1 X

l=1 (1 r)r L (mi) T l E[ wτ (mi)(l) wni 2] | {z } term 1

l=L (mi) T1 +1 (1 r)r L (mi) T l E[ wτ (mi)(l) wni 2] | {z } term 2

n L (mi) T1 X

l=1 (1 r)r L (mi) T l E[ wτ (mi)(l) wni 2] | {z } term 2

l=L (mi) T1 +1 (1 r)r L (mi) T l E[ wτ (mi)(l) wni 2] | {z } term 3

where the second equality is derived by Eq. (42), and the third equality is derived by dividing T rounds into two subintervals {1, , T1} and {T1 + 1, , T}.

In the above equation, term 1 means both wni and wτ (mi)(l) are randomly drawn from the N ground truths, due to the fact that i T1 and τ (mi)(l) T1. Term 2 means one of wni and wτ (mi)(l) is randomly drawn from the N ground truths before the T1-th round, while the other one has fixed to a cluster Mk after the T1-th round (i.e., i T1 and τ (mi)(l) > T1 or i > T1 and τ (mi)(l) T1). Term 3 means wni and wτ (mi)(l) are in the same cluster Mk, as i > T1 and τ (mi)(l) > T1.

For i {1, , T1}, based on the proof of Theorem 1 in Appendix I, we obtain term 1:

E[ wτ (mi)(l) wni 2] < 1 N 2 X

n =n wn wn 2.

While for i {T1 + 1, , t}, we calculate term 3 as:

E[ wτ (mi)(l) wni 2] = 1

n=1 E[ wτ (mi)(l) wn 2]

Finally, we calculate term 2:

E[ wτ (mi)(l) wni 2] = 1

n=1 E[ wτ (mi)(l) wn 2]

n Wk wn wn 2.

Published as a conference paper at ICLR 2025

Based on the expressions of the three terms above, we obtain

r L (mi) T N

n=1 wn 2 + 1

r L (mi) T L (mi) T1 (1 r L (mi) T1 ) N 2

n =n wn wn 2

1 r L (mi) T L (mi) T1 1

n Wk wn wn 2

r L (mi) T L (mi) T1 (1 r L (mi) T1 ) N

n Wk wn wn 2

1 r L (mi) T L (mi) T1 1

which completes the proof.