# federated_fewshot_classincremental_learning__4278ec3e.pdf

Published as a conference paper at ICLR 2025

FEDERATED FEW-SHOT CLASS-INCREMENTAL LEARNING

M. Anwar Ma sum , Mahardhika Pratama, Lin Liu, Habibullah Habibullah, and Ryszard Kowalczyk University of South Australia, Mawson Lakes, SA, 5095, Australia masmy039@mymail.unisa.edu.au, dhika.pratama@unisa.edu.au, lin.liu@unisa.edu.au, habibullah.habibullah@unisa.edu.au, ryszard.kowalczyk@unisa.edu.au

ABSTRACT This study proposes a challenging yet practical Federated Few-Shot Class Incremental Learning (FFSCIL) problem, where clients only hold very few samples for new classes. We develop a novel Unified Optimized Prototype Prompt (UOPP) model to simultaneously handle catastrophic forgetting, over-fitting, and prototype bias in FFSCIL. UOPP utilizes task-wise prompt learning to mitigate task interference and over-fitting, unified static-dynamic prototypes to achieve a stability-plasticity balance, and adaptive dual heads for enhanced inferences. Dynamic prototypes represent new classes in the current few-shot task and are rectified to deal with prototype bias. Our comprehensive experimental results show that UOPP significantly outperforms state-of-the-art (SOTA) methods on three datasets with improvements up to 76% on average accuracy and 90% on harmonic mean accuracy respectively. Our extensive analysis shows UOPP robustness in various numbers of local clients and global rounds, low communication costs, and moderate running time. The source code of UOPP is publicly available at https://github.com/anwarmaxsum/FFSCIL.

1 INTRODUCTION

Figure 1: The importance of prototype rectification to handle prototype bias (a) Initial prototype per client (b) Aggregation without rectification can t handle prototype-bias (c) Aggregation with rectification overcomes prototype-bias.

The previous studies on Federated Class Incremental Learning (FCIL) address catastrophic forgetting challenges in a dynamic environment with data privacy constraints. Coordinated by a central server, a collection of clients continually develops a global recognition model without sharing their local data. The first issue of FCIL is the existing works i.e. LGA (Dong et al., 2023), TARGET (Zhang et al., 2023b) and LANDER (Tran et al., 2024) assume the clients carry abundant training data and thus impractical in the resource-constrained environments. They are data-hungry such that they face the issues of prototype bias and over-fitting in realm of the data scarcity constraint. As in stand-alone few-shot learning (Zhang et al., 2022), federated learning with few samples leads to prototype bias problems. Figure 1 visualizes how prototypes of the observed classes are generated and aggregated from the few samples carried by the clients. In FCIL simulation (Dong et al., 2023; 2022), where a client holds only partial classes e.g. 60% of the total classes, it shows that the observed few samples by a client lead to prototype bias where a prototype doesn t represent the true population rather it represents the few locally gathered samples. To overcome the prototype bias problem, each prototype should be refined to a correct location. As shown in Figure 1, we also emphasize that aggregating the prototype can t handle the prototype bias. Few-shot learning (FSL) e.g. Meta Node (Zhang et al., 2022) or few-shot class incremental learning (FSCIL) e.g. S3C (Kalla & Biswas, 2022) methods can t be expected since the methods require the presence of all classes whereas, in a federated setting, a client carries only a subset of all classes.

Corresponding author

Published as a conference paper at ICLR 2025

Second, the current FCIL methods train and share whole backbone parameters resulting in a large number of parameters during optimization processes which imply a long training time and a high communication cost. Third, in the current SOTAs, clients generate and share synthetic images for global model aggregation on the server side. Aside from overloading the communication costs, this mechanism may violate data privacy principles since synthetic data may reveal partial information about the private data of a client to another party due to its similarity to the original data. Last but not least, some FCIL methods e.g. LGA (Dong et al., 2023) GLFC (Dong et al., 2022) save several exemplars from previous tasks for rehearsal that may breach data openness policy, where data are only open for a client at a specific moment.

These gaps motivate us to address a new direction of FCIL i.e. Federated Few-Shot Class Incremental Learning (FFSCIL) where a client participating in the federated learning process only possesses very few samples. Second, we develop a novel efficient but effective approach to the FFSCIL problem, where a client trains and shares as small parameters as possible but produces a highly accurate global model without sharing any synthetic samples or saving exemplars from previous samples. Therefore, our proposed method handles catastrophic forgetting in dynamic collaborative learning with data privacy and data scarcity constraints. The contributions of this paper are: (1) We emphasize the data scarcity issue leading to the prototype-bias and over-fitting problems in FCIL and define a new problem, namely Federated Few-Shot Class Incremental Learning (FFSCIL); (2) We propose a novel method for the FFSCIL problem termed Unified Optimized Prototype Prompt (UOPP) built upon a prompt learning framework coupled with static and dynamic prototypes optimized by Neural Ordinary Differential Equation (ODE). The proposed method utilizes an adaptive dual-head to enhance its predictive accuracy. To our knowledge, our proposed method is the first prompt-based that integrates prompt tuning, prototype rectification by trainable network and adaptive dual classifiers in a single pipeline; (3) We offer theoretical studies for the convergence and generalization of the proposed method; (4) We provide a comprehensive analysis in three benchmark datasets that show the proposed method outperforms the baseline and current SOTAs with significant gaps and achieves improved stability-plasticity balances. Our analysis emphasizes the robustness of the proposed method in various participating clients and small rounds per task.

2 RELATED WORKS Federated Class Incremental Learning (FCIL): The FCIL studies address catastrophic forgetting problem while preserving dta privacy e.g. Fed We IT (Yoon et al., 2021), GLFC (Dong et al., 2022), and LGA (Dong et al., 2023) optimize the global model by aggregating locally optimized models by the participating clients. The current SOTAs prove their effectiveness rather than combining Fed Avg (Mc Mahan et al., 2017) and class incremental learning (CIL) method such as ICARL(Rebuffi et al., 2017) and Bi C (Wu et al., 2019). However, the SOTAs tune and send the whole backbone, producing long training times and high communication costs as a consequence. Furthermore, the SOTAs assume that a client saves several samples as memory that may not be practically applicable. Other studies i.e. TARGET (Zhang et al., 2023a) leverage synthetic samples for rehearsal instead of real samples. It achieves higher performance than Fed We IT (Yoon et al., 2021) but still outperformed by LGA. A different approach i.e. Fed CIL (Qi et al., 2023) trains a generative model i.e. ACGAN (Odena et al., 2017) on local clients side to generate fake samples for aggregation on the central server s side. It achieves a higher performance than the combination of Fed Avg or Fed Prox (Li et al., 2020) with ACGAN, DGR(Shin et al., 2017) or LWF-2T (Usmanova et al., 2021), but it needs more expensive communication costs and training time due to the generative model.

Few Shot Class Incremental Learning (FSCIL): Previous studies on FSCIL have attempted to maintain stability-plasticity tradeoff under data scarcity by adding extra representation e.g. TOPIC (Tao et al., 2020b) introduces Neural gas as the graph of mapped features and CEC (Zhang et al., 2021) continually evolves its classifier to adapt to new tasks. Another approach modifies its learning mechanism e.g. FSLL (Mazumder et al., 2021) updates with self-supervised loss, F2M (Shi et al., 2021) finds flat minima regions on the base task then forces parameter updates on few shot tasks to reside within the flat region, S3C (Kalla & Biswas, 2022) trains scholastic classifier with supervised loss and Mg Sv F (Zhao et al., 2024) applies multi grained fast-slow learning mechanism. Prototypebased methods e.g. TEEN(Wang et al., 2024), NC-FSCIL(Yang et al., 2023), and Or CO(Ahmed et al., 2024) show the important of prototype correction to deal with prototype bias in FSCIL. However, prototype refinement in the data scarcity is still an open challenge. Besides, FSCIL methods aren t yet proven in federated settings under non-i.i.d constrain. A comprehensive literature review is presented in Appendix G.

Published as a conference paper at ICLR 2025

(c) Federated training, interaction between clients and central server

Embedded Patches

Vi T Encoder

(a) Local training at client

Unification Prototype Rectification

Dual Head (Classifier)

Unification Prototype Rectification

Prototypebased Head

Class Dual Head

dynamic prototype

static prototype Neural ODE

static prototype

unified prototype

MLP Head Prototypebased Head Classes

Unification Prototype Rectification

MLP Head Prototypebased Head Classes

Unification Prototype Rectification

Central Server

Weighted Aggregation

Global parameters

Vi T Process Our Additional Process

Trainable & sharable params

Trainable but unsharable params

Untrainable & unsharable params

Clients & Server Sides

(All participating clients)

Vi T Backbone

Vi T Backbone

(b) Detailed Unification, Rectification and Dual Head modules

carries all available classes pototoype: dog, cat, frog horse, lion

from Vi T Encoder from Unification

Figure 2: The visualization of UOPP, includes task-wise prompt learning empowered by shared unified staticdynamic prototypes, dynamic prototypes rectification, adaptive dual heads, and weighted aggregation. The gray-colored parameters are frozen and unshareable parameters, the green-colored parameters are trainable and shareable parameters, while the blue-colored parameters are trainable but unshareable parameters

3 PROBLEM FORMULATION

Federated Few-Shot Class-Incremental Learning (FFSCIL) is defined as: Given a sequence of tasks [0, 1, 2, ..., T] where each task t carries a labeled training set T t = {(xt i, yt i)}|T t| i=1 , where xt i X denotes an input image and yt i Y denotes its label, and |.| denotes the cardinality. Each task t is disjoint with another task t . i.e. t,t T t T t = . On each task-t, a set of clients {l}Lall l=1 coordinated by a central server G are deployed to learn T t. In the first task (t = 0), each client l carries abundant training samples while the remaining tasks (t > 0), it carries far smaller samples than the first task i.e |T 0 l | >> |T t>0 l |. For a convenient way, task-0 is called the base task while the rest is called the few-shot task (FS task). A client l holds only a subset of current task training set i.e. T t l T t. Each client-l carries non-identically and distributed data (non-i.i.d) to another client l i.e. l,l Dt l = Dt l where Dt l and Dt l are distribution of T t l and T t l respectively. Following FCIL, Non-i.i.d distribution is represented by the percentage of available classes η. Each task-t is learned in a federated way that is repeated in RT rounds where in each round r [1..RT ], a set of local clients is randomly selected from all available clients i.e. {l}L l=1 {l}Lall l=1 . Due to data privacy constraints, a client-l is not allowed to share any training sample (xi, yi) T t l to another client or server, but permitted to share its parameters.

Let a deep neural network gΦ(fΘ(.)) be parameterized by Θ and Φ where f(.) and g(.) are the feature extractor and classifier respectively. In each round r of task t, a central server G coordinates selected local clients {l}L l=1 to conduct local CIL training using its training samples {T t l }. Each client-l optimizes its local parameters (Θl, Φl), then sends its locally optimized parameters to the central server G to be aggregated. The central server G aggregates all received local parameters into optimum global parameters i.e. (ΘG, ΦG) = Agg({(ΘG, ΦG)}L l=1) and communicates them back to all clients for the next round process. The objective of FFSCIL is to achieve an optimum global model gΦG(fΘG(.)) to recognize the learned classes from the first task (task-0) until the current task (task-t) i.e. {T 0, ..., T t}.

4 PROPOSED METHOD: UNIFIED OPTIMIZED PROTOTYPE PROMPT(UOPP)

We design our method, termed Unified Optimized Prototype Prompt (UOPP), to address challenges in FFSCIL i.e. catastrophic forgetting, over-fitting and prototype bias simultaneously under data privacy constraints. Figure 2 exhibits the flow of our method in both the local training view (a) and federated training view (b). Looking at Figure 2 (a), we utilize a prompt-based approach on top of the frozen Vi T backbone as it minimizes task interference that leads to better handling of catastrophic forgetting with lightweight learnable parameters (prompts). Then we add a rectification block to handle prototype bias in the few-shot tasks (t > 0) by iteratively rectifying the prototypes.

Published as a conference paper at ICLR 2025

Note that each client only holds few samples for any class in the few-shot tasks. Then we add a unification block to unify the rectified prototypes and the feature produced by the Vi T encoder. Since the prototypes are shareable to/from the server, this mechanism handles a non-i.i.d challenge where a client-l is accommodated to learn knowledge representation from the classes not available in T t l . Last, we design an adaptive dual head that leverages the strength of both MLP and prototype-based classifiers. On the federated view (b), the Figure shows that in our method, a client shares only smallsized parameters i.e. prompts, prototypes, and MLP parameters rather than the whole backbone parameters. Thus, it minimizes the communication cost between clients and the central server. The details of our method are presented in the following subsections. The uniqueness of our method from the existing prototype-based methods i.e. (Wang et al., 2024), (Yang et al., 2023), (Ahmed et al., 2024), (Goswami et al., 2024), and (Guo et al., 2024) is that we utilize a trainable Neural ODE that works by support and query samples drawn from different distribution, while the existing method utilizes similarity ratio for rectification process. Second, alongside prototype rectification, we adjust task-wise key and prompt parameters to improve inter-task separability. Different from PILo RA (Guo et al., 2024), the prompt is prepended to W Kx and W V x of Vi T model, while PILo RA appends matrices A.B into W Q and W V and doesn t use task-wise keys. Third, we utilize dual-head classifiers to leverage the strength of both combined with task prediction.

4.1 MINIMIZING TASK INTERFERENCE VIA PROMPT LEARNING Prompt Structure: we define a prompt Pl for each client-l to learn a sequence of task {t}T t=0 as: Pl = [(K0 l , P 0 l ), (Kt l , P t l ), .., (KT l , P T l )], t [0..T] where (Kt l , P t l ) is a pair of prompt key-andvalue corresponded to task-t. Our method utilizes the prefix tuning technique (Li & Liang, 2021) because it usually outperforms the prompt tuning method, then following the definition of prefix tuning, and attention mechanism in Vi T, the output of Vi T backbone given an input x and prompt (Kt l , P t l ) is defined as: f(Kt l ,P t l )(x) = A(Qij, [Kt l ++ Ki,j], [P t l ++ Vi,j]) (1)

where Qi,j, Ki,j, Vi,j are query, key and value of jth head MSA in ith layer of Vi T encoder, ++ denotes concatenation, and A denotes attention function. Note that the function is applied for all MSA heads j [1..J] and all encoder layers i [1..I]. At task-t, a client-l optimizes only (Kt l , P t l ) and not the other prompt key-value pair, and (Kt l , P t l ) is adjusted only in task t, not in the previous or upcoming tasks. Therefore, once optimal to T t l , the pair (Kt l , P t l ) will remain robust against forgetting since its value is not adjusted afterward. In addition, Kt l is adjusted to match sample xi T t l by using a matching loss Lm during training. This mechanism is designed to make (Kt l , P t l ) exclusive to task-t and not the samples from other tasks. In other words, it minimizes task interference. In addition, since only prompts rather than the whole network parameters are adjusted, this strategy alleviates the over-fitting problem due to few samples.

4.2 UNIFIED STATIC-DYNAMIC PROTOTYPE AND ITS USABILITY Static Prototype: we define a vector of D-dimension zc as the prototype for class c T t that is produced from Vi T encoder, where D is the embedding dimension. A static prototype set Zl = { zlc} is a collection of static prototypes of class-c that are available at client-l. Assuming that a prototype follows a Gaussian distribution i.e. zc N(µc, Σc) and forms D disjoint uni-variate distribution, then a prototype of class c is represented as zc N(µc, σc2) where σc2 = ID.σc,i2, i {1, 2, ..., D}, ID is identity matrix. Note that at task-t, a selected local client-l holds its local training set T t l . Suppose that T t lc = {(xt li, yt li) T t l , yt li = c} is the samples of class-c in T t l and |.| denotes the number of samples. Then, a static prototype zlc N(µlc, σlc2) for a class-c available in T t l is computed by Eq. 2 and 3.

µlc = 1 |T t lc|

i=1 f(Kt l ,P t l )(xi) , xi T t lc (2)

σlc 2 = 1 |T t lc|

i=1 (µt lc f(Kt l ,P t l )(xi))2 , xi T t lc (3)

Dynamic Prototype: we define a vector ˆzc of D dimension as the prototype for class c T t>0 iteratively rectified during local training on task-t, where D is the embedding dimension. A dynamic prototype set ˆZl = {ˆzlc} is a collection of dynamic prototypes of class-c available at client-l. At the beginning of local training at task t > 0 on client-l, a dynamic prototype ˆzlc is set by its corresponding static prototype value zlc, then iteratively rectified by Grad Net gΨl during local training

Published as a conference paper at ICLR 2025

i.e. ˆzlc = Rectification(gΨl, zlc). After finishing the training of task-t, the dynamic prototypes of all classes c T t are not updated anymore and are stored as static prototypes. Therefore, the dynamic prototype set ˆZl includes prototypes from currently learned classes only, while the static prototype set Zl includes both currently learned classes and previously learned classes. The details of the rectification process are explained in the following sub-section.

Unified Prototype: we define the unified prototype Zl = Zl ˆZ as the union of static and dynamic prototype of client-l. Following the definition of static and dynamic prototypes, Zl contains static prototypes of previously learned classes and dynamic prototypes of currently learned classes that are still updated during the local training at client-l. Unified prototype Zl plays an important role in our method, both in the base task (t = 0) and the few-shot task (t > 0). As illustrated by the unification block in Figure 2, in the base task, Zl is unified with the output of Vi T i.e. f(Kt l ,P t l )(x). Since Zl is shareable, then Zl contains prototypes of classes unavailable in T t l shared by the central server. Therefore, each client-l can afford to learn all classes in T t. In the few-shot task, each client-l rectifies ˆzlc for all classes c T t l . Similarly, the shareable Zl enhances the separability of ˆzlc since it is contrasted to all other prototypes of all learned classes.

Optimal Prototype

Biased Prototype

s=0 s=M (a) Biased & Optimal Prototype (b) Prototype Rectification Figure 3: Visualization of Biased and Optimal Prototype (a), and Rectification (b)

4.3 HANDLING PROTOTYPE-BIAS VIA DYNAMIC PROTOTYPE RECTIFICATION Previous studies (Chen et al., 2018; Zhang et al., 2022) emphasize that generations of prototypes by averaging few samples leads to prototype bias, as the prototypes only represent the observed samples but not the whole population, as illustrated in Figure 3(a). Therefore, our method rectifies a prototype ˆzc to become as close as possible to the population mean. We follow the framework of episodic training in few-shot learning (Vinyals et al., 2016; Zhang et al., 2022), a rectification process of a prototype ˆzc can be defined as: ˆzc(M) = Rectification(gΨ(), ˆzc(0), S, Q, s = M) where s is the number of rectification steps, ˆzc(0) and ˆzc(M) are the initial prototype and final prototype after M-steps rectification process, S is the support set that contains few observed samples per class, Q is the query set comprising unlabeled samples, and gΨ() is gradient networks (Grad Net) parameterized by Ψ. Now looking into a more detailed view, suppose that L(ˆzc(s)) as a differentiable loss function with prototype ˆzc(s) and L(ˆzc(s)) as its gradient during the optimization process, one step prototype rectification can be defined as an iterative process of Gradient Descent algorithm as described in equation 4. Symbol α denotes the learning rate, while ω denotes l2-norm regularizer.

ˆzc(s + 1) = ˆzc(s) α( L(ˆzc(s)) + ω(ˆzc(s)) (4)

The previous study (Chen et al., 2018) discovered that the iterative process above can be viewed as Euler discretization of an Ordinary Differential Equation (ODE). Therefore, the term L(ˆzc(s)) in the process above can be derived into equation 5, where s represents a continuous variable such as time, and dˆzc(s)

ds represents a continuous gradient flow of prototype ˆzt c(s) over s. Since the optimization process is executed by a neural network model gΨ(.) (last part of equation 5), then ODE becomes Neural ODE. L(ˆzc(s)) = dˆzc(s)

ds = gΨl((ˆzc(s), S, Q, s)) (5)

We follow the implementation of Grad Net (Zhang et al., 2022) as the neural network model gψ(.) that executes the rectification process. Following the implementation on Grad Net, the optimum prototype is produced in the last step of rectification i.e. ˆzc(M) = ˆzc(0) + R M s=0 gΨl((ˆzc(0), S, Q, s)). We follow the implementation of the previous study (Chen et al., 2018; Zhang et al., 2022), where the last term is solved by ODESolver based on Runge-Kutta method (Alexander, 1990). Therefore, the optimum prototype ˆzc(M) can be obtained by executing equation 6.

ˆzc(M) = ODESolver(gΨl, ˆzc(0), S, Q, s = M) (6)

In previous studies (Chen et al., 2018; Zhang et al., 2022), Q contains images from base task. Now dealing with limitations in FFSCIL, we can t afford to save exemplars from previous tasks or gather images from other clients. However, we have a shareable unified prototype Zl = Zl ˆZl containing all learned class prototypes. Therefore, the uniqueness of our rectification is that we generate

Published as a conference paper at ICLR 2025

support set S is constructed by drawing prototypes from N(µlc, σlc2) where (µlc, σlc2) is computed by eq. 2-3 for all class c available in T t l , and generate query set Q by drawing prototypes from N(ˆµl, ˆσ2 l ) where (ˆµl, ˆσ2 l ) is the property of ˆzl ˆZl. As the implications, our rectification occurs fully in the embedding space and pseudo-rehearsal free. Note that Zl is assigned by aggregated prototypes i.e. Zl = ZG in each federated round.

4.4 ADAPTIVE DUAL-HEAD CLASSIFIER FOR A BETTER PREDICTION We design an adaptive dual head classifier by leveraging MLP classifier gΦ(.) that works effectively in the highly labeled task (T 0) and prototype-based (PB) classifier g Z(.) that has been proven robust in few shot tasks (T t>0). We combine both MLP and PB classifiers as one united head layer. MLP parameter Φ is optimized in the base task, while unified prototype set Z is optimized in all few shot tasks. In the testing phase, we deploy a head selector to select which classifier to use for an input x. The head selector works by predicting the task ID based on the input-key matching. Given an input image x, and a sequence of prompt-key K0, K1, ..., KT , we can find the task-id where x belong by finding the highest similarity of x and Kt, t [0, ..T]. The predicted label ˆy of an input x is computed by equation 7.

( gΦ(f K(0,P 0)(x)), if ˆt = 0 g Z(f(Kˆt,P ˆt)(x), otherwise, , where, ˆt = arg min t Lm(x, Kt) (7)

where Lm is matching loss between input x and prompt-key Kt. Note that equation 7 is applicable both for the client and server side. The united classifier is expected to elevate the prediction result, rather than employing one of the FC or PB classifiers alone, this hypothesis will be proven in our ablation study.

4.5 FEDERATED LEARNING AND SERVER-SIDE AGGREGATION Figure 2 shows that on a task-t, each client-l optimizes its local parameters based on its local training data T t l . Then the optimized parameters i.e. ((Kt l , P t l ), Φl, Zl) to the central server G are aggregated. We propose a simple weighted aggregation applying the principle the more you learn, the better your knowledge . We consider the total participation of a client Sl on the task t as the client s weight wl. In a round r of a task t, given clients that carry locally optimized parameter {((Kt l , P t l ), Φl, Zl)}L l=1 and their weights {wl}L l=1, then the aggregated parameter ((Kt G, P t G), Φl, ZG) is computed using equation 8.

((Kt G, P t G), ΦG, ZG) = 1 PL l=1 wl

l=1 ((Kt l , P t l ), Φl, Zl).wl (8)

In the base task (t = 0), the server aggregates and distributes prompt parameter (Kt G, P t G), MLP head parameter ΦG, and unified prototype set Zt G to all participating clients. while in the few shot tasks (t > 0), the server aggregates and distributes prompt parameters and unified prototype set only. The Grad Net parameter Ψl is unshared and utilized only for local prototype rectification. Rather, a client sends the rectified dynamic prototype ˆZl Zl to the central server.

4.6 WRAP UP AND FINAL OBJECTIVE In the base task, each client-l unifies f(K0 l ,P 0 l )(x) with Zl then updates prompt and MLP head parameters i.e. (K0 l , P 0 l ) and Φl. In the few-shot tasks, client-l performs f(Kt l ,P t l )(x) to form static prorotype Zlc, then construct support sample (S) and query samples (Q), and perform rectification for all ˆzlc ˆZlc. Afterward, it updates prompt parameters (Kt l , P t l ) and Grad Net parameter Ψl. Therefore, the objectives of local training are defined in the equations below:

(i). Base Task (t = 0) objective : Lt=0 = Lce(gΦl(f(K0 l ,P 0 l )(xi) Zl), yi Cl) + λLm(xi, K0 l ), (xi, yi) T 0 l (9)

(ii). Few Shot Tasks (t 1) objective:

Lt>0 = Lpce(g Zl(gΨl( ˆZlc(s), S, Q, s), Cl) + λLm(xi, Kt l ), (xi, yi) T t l (10) where Lce denotes cross-entropy loss, Lpce denotes prototype cross-entropy loss utilizing cosine distance as similarity measurement, Lm denotes matching loss, and Cl is the label set of the unified prototype set Zl, (f(Kt l ,P t l )) denotes prompting output as in eq. 1, and gΨl denotes rectification function as in eq. 5. The detailed process of our method is presented in Appendix A.

Published as a conference paper at ICLR 2025

5 THEORETICAL ANALYSIS

Let Θ = (P, Φ, Ψ) be trainable parameters, F(Θ) = E[L(T ; Θ)] = E[L(T ; (P, Φ, Ψ))] is the expected loss function, k, E, R, and LS is local iteration, local epoch, global round, and number of selected local clients respectively. We follow L-smooth and µ-strongly convex F, G-bounded uniformly gradient assumptions, random uniformly distributed batches, and decreasing learning rate as in (Li et al., 2019),(Bottou et al., 2018) We stated Theorem 1,2, and 3 as presented in Appendix B. Theorem 1 and 2 prove UOPP local training and federated convergence respectively, while theorem 3 proves UOPP generalization.

6 EXPERIMENTAL RESULTS AND ANALYSIS

6.1 EXPERIMENTAL SETTING

Datasets: our experiment is done using three benchmarks i.e. split CIFAR100, split Mini Image Net, and split CUB200. The CIFAR100 and mini Image Net datasets contain 100 classes while CUB200 is a dataset of 200 classes. We follow the settings from (Tao et al., 2020a) for tasks and classes per task split, and (Dong et al., 2023) for the federated setting. For CIFAR100 and Mini Image Net, we split the dataset into 9 tasks i.e. 60 classes for the base task (t = 0), and 5 classes for each few-shot task (t 1). We split the CUB200 dataset into 11 tasks i.e. 100 classes for the base task, and 10 classes for each few-shot task. Few shot tasks are measured in 5-shot and 1-shot settings.

Benchmark Algorithms: UOPP are compared with 9 benchmark algorithms i.e. LGA (Dong et al., 2023), TARGET (Zhang et al., 2023b), LANDER (Tran et al., 2024), and Fed-CPrompt (Bagwe et al., 2023) that represent the SOTA in federated class incremental learning, Fed-S3C(Kalla & Biswas, 2022) that represents SOTA in few shot class incremental learning, Fed-L2P(Wang et al., 2022b), Fed-Dual P(Wang et al., 2022a), Fed-CODAP(Smith et al., 2023), and PILo RA(Guo et al., 2024) that represent SOTA in FCIL. Except for Fed-CPrompt, Feddenotes that the method is customized in a federated manner from its original (stand-alone) mode, by using Fed Avg as the aggregation function. We only run LANDER and PILo RA for CIFAR100 since their official code, setting, and pre-trained embedding that can be executed in our setting is only for CIFAR100 dataset. We also evaluated our method in standalone FSCIL and compared it to FSCIL SOTAs i.e. TEEN(Wang et al., 2024), NC-FSCIL(Yang et al., 2023), Or Co(Ahmed et al., 2024), and Pri Vi Lege(Park et al., 2024), Please see Appendix D.1.

Details and Metrics: our numerical study is executed under a single NVIDIA A100 GPU with 40 GB memory across 3 different random seeds. Adapted from (Dong et al., 2023), the simulation is run by 20 total clients and 1 global server, where in each round, 6 (30%) local clients are selected randomly. Each client randomly receives 60% (η = 0.6) classes. The total global round is set to 90 (10 rounds/task) for CIFAR100 and Mini Image Net and 110 for CUB200. Our task split setting is different from the recent study (Jiang et al., 2024), since it follows FCIL setting, while our setting follows FSCIL setting.We evaluate the consolidated algorithms for all learned classes with accuracy metrics (Acc.) adn performance drop (PD). Besides, we measure the accuracy of base classes, novel classes, and harmonic mean accuracy that represents stability-plasticity performance. Please see Appendix F for the detailed experiment settings, hyperparameters, and metrics.

6.2 MAIN RESULTS

a) General Performance: the numerical result of the consolidated algorithms is shown in table 1. The proposed method (UOPP) achieves the highest accuracy with a significant gap 5 76% compared to the competitor methods both in 5-shot and 1-shot settings. Fed-S3C, TARGET, LGA, and LANDER achieve relatively low performance with 30 76% gap in 3 benchmark datasets in both 5-shot and 1-shot settings compared to UOPP. The results confirm that the FCIL and FSCIL methods can t be applied directly on FFSCIL. Meanwhile, prompt-based methods i.e. Fed-L2P, Fed Dual P, Fed-CODAP, and Fed-CPrompt achieve a relatively better performance than those 4 methods. Compared to UOPP, the methods have relatively smaller gap i.e. 5 36%. The results prove that prompt-based methods are more promising than the SOTAs of FCIL and FSCIL, The proposed method also achieves the lowest performance drop with (0.7 19%) followed by Fed-L2P with (0.4 22%) PD and Fed-Dual P with ( 0.4 28%) PD. Despite utilizing Vi T and prototype approach, PILo RA achieves lower performance than the prompt-based method i.e. < 52% in average. Looking at per-task performance, Figure 4 shows that UOPP achieves the highest accuracy in all tasks in those

Published as a conference paper at ICLR 2025

Table 1: Numerical result of the consolidated algorithms in CIFAR100, Mini Image Net, and CUB200 dataset in 5-shot and 1-shot setting across 3 different seeded runs. (a) Complete numerical result on CIFAR100 dataset with 5-shot setting

Method Trainable Params. Accuracy in each session (%) Avg PD Gap 0 1 2 3 4 5 6 7 8 Fed-S3C CNN 44.51 48.97 47.77 45.35 43.48 41.47 40.33 39.32 37.71 43.21 6.80 46.80 TARGET CNN 68.90 63.61 59.06 55.12 51.68 48.64 45.94 43.52 41.34 53.09 27.56 36.92 LGA CNN 73.76 69.80 65.59 60.26 56.87 52.94 50.66 47.69 44.89 58.05 28.87 31.96 LANDER CNN 58.60 61.75 56.26 52.11 47.71 44.71 41.69 40.28 38.87 49.11 19.73 40.90 Fed-L2P Prompt 73.47 74.20 73.37 71.88 70.85 70.72 69.28 68.66 68.37 71.20 5.10 18.81 Fed-Dual P Prompt 76.39 82.75 83.37 80.80 79.93 78.26 77.73 76.98 77.11 79.26 -0.72 10.75 Fed-CODAP Prompt 81.73 69.29 70.81 68.67 67.17 66.14 64.32 64.79 64.12 68.56 17.62 21.45 Fed-CPrompt Prompt 88.00 64.63 69.30 67.39 63.39 62.33 61.11 59.78 59.00 66.10 29.00 23.91 PILo RA Params (A,B) 67.40 62.22 57.77 53.92 50.55 47.58 44.93 42.57 40.44 51.93 26.96 38.08 UOPP (Ours) Prompt 90.57 90.58 90.85 90.96 91.23 91.51 91.56 91.74 81.05 90.01 9.52 0.00

(b) Summarized numerical result on CIFAR100 dataset with 1-shot setting, Mini Imagenet, and CUB200 dataset with 5-shot and 1-shot settings

CIFAR100 Mini Image Net CUB200 1-shot 5-shot 1-shot 5-shot 1-shot Avg PD Gap Avg PD Gap Avg PD Gap Avg PD Gap Avg PD Gap Fed-S3C 41.97 9.07 46.65 29.73 5.48 63.19 29.16 8.24 63.05 14.91 7.22 65.89 14.40 7.85 62.33 TARGET 53.09 27.56 35.53 44.77 23.24 48.15 44.77 23.24 47.44 21.47 16.77 59.33 20.70 14.17 56.03 LGA 58.29 24.80 30.33 35.07 30.51 57.85 31.65 28.44 60.56 16.92 14.16 63.88 10.02 17.86 66.71 LANDER 48.43 19.90 40.19 - - - - - - - - - - - - Fed-L2P 74.29 3.75 14.33 78.92 1.01 14.00 80.80 0.45 11.41 58.26 22.67 22.54 57.12 24.34 19.61 Fed-Dual P 81.95 -0.87 6.67 85.91 -0.45 7.01 86.88 0.67 5.33 62.89 28.21 17.91 62.41 26.94 14.32 Fed-CODAP 68.72 21.62 19.90 80.11 15.12 12.81 80.13 15.70 12.08 37.55 42.26 43.25 39.80 45.68 36.93 Fed-CPrompt 73.82 24.69 14.80 88.77 8.29 4.15 86.34 12.38 5.87 61.23 37.33 19.57 58.26 36.86 18.47 PILo RA 51.78 26.88 36.84 - - - - - - - - - - - - UOPP (Ours) 88.62 5.83 0.00 92.92 0.73 0.00 92.21 2.18 0.00 80.80 10.90 0.00 76.73 19.54 0.00

0 5 Session

Accuracy(%)

CIFAR100 5-shot

0 5 Session

Mini Image Net 5-shot

0 5 10 Session

CUB200 5-shot

0 5 Session

CIFAR100 1-shot

0 5 Session

Mini Image Net 1-shot

0 5 10 Session

CUB200 1-shot

Fed-S3C TARGET LGA LANDER Fed-L2P Fed-Dual P Fed-CODAP Fed-CPrompt PILo RA UOPP Figure 4: Visualization of the performance of consolidated algorithms in Mini Image Net, CIFAR100 and CUB200.

three datasets. In the first task (base task), the proposed method achieves higher performance with a small gap i.e. 1 2% than the competitor methods. However, with the increasing number of tasks, the gap gets higher e.g. 3% in task-1, 4% in task-2, and 6% in task-9 (last task). It shows that the proposed method handles catastrophic forgetting better than the competitor methods. Please see our extended analysis on different k-shot and standalone FSCIL in Appendix D. The complete numerical results for all dataset are presented in Appendix H.

Table 2: Harmonic mean accuracy of the consolidated algorithms in CIFAR100 dataset with 5-shot setting.

Method Harmonic Mean Acc. by Session (%) Avg PD Gap 1 2 3 4 5 6 7 8 Fed-S3C 49.3 42.8 38.0 35.4 33.9 33.4 32.9 32.1 37.2 17.3 53.3 TARGET 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 90.5 LGA 44.7 31.5 19.3 16.2 12.5 12.3 9.9 7.8 19.3 36.9 71.2 LANDER 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 90.5 Fed-L2P 46.0 50.2 49.2 50.0 53.5 54.8 56.1 57.7 52.2 -11.7 38.3 Fed-Dual P 67.4 71.4 66.8 68.4 68.2 69.3 70.1 71.8 69.2 -4.5 21.3 Fed-CODAP 69.8 69.3 66.1 65.2 63.3 61.8 62.0 61.3 64.9 8.5 25.6 Fed-CPrompt 73.3 74.4 67.6 63.5 62.3 60.4 59.4 58.9 65.0 14.3 25.5 PILo RA 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 90.5 UOPP 90.6 91.6 91.6 91.9 92.1 92.0 92.1 81.8 90.5 8.8 0.0

b) Stability-Plasticity Analysis: We evaluate UOPP stability-plasticity performance by evaluating the harmonic mean accuracy on each few-shot task. Table 2 shows the harmonic mean accuracy of consolidated methods in CIFAR100 with 5-shot setting, while Figure 5 visualizes the accuracy for base classes, novel classes, and harmonic mean accuracy. Both show that UOPP achieves the best harmonic mean accuracy with 15% gap on each task and 18 58% on average of all tasks. The results prove that UOPP handles stabilityplasticity dilemmas better than its competitors. Looking at the base classes and novel classes performance, Figure 5 shows that UOPP achieves the highest accuracy of base classes and novel classes that are consistent through all tasks. Fed-L2P has increased base classes and novel classes accuracy, but the performance is still below UOPP with a significant gap. Fed-Dual, Fed-CODAP, and Fed CPrompt have relatively stable base class accuracy but decreasing novel class accuracy. TARGET,

Published as a conference paper at ICLR 2025

0 5 Session

Accuracy(%)

Base Classes Accuracy (5-shot)

2.5 5.0 7.5 Session

Novel Classes Accuracy (5-shot)

2.5 5.0 7.5 Session

Harmonic Mean Accuracy (5-shot)

0 5 Session

Base Classes Accuracy (1-shot)

2.5 5.0 7.5 Session

Novel Classes Accuracy (1-shot)

2.5 5.0 7.5 Session

Harmonic Mean Accuracy (1-shot)

Fed-S3C TARGET LGA LANDER Fed-L2P Fed-Dual P Fed-CODAP Fed-CPrompt PILo RA UOPP Figure 5: Visualization of the performance of consolidated methods for base classes and novel classes in CIFAR100 Dataset.

Table 3: Summary of the numerical result of the consolidated algorithms in CIFAR100 dataset with the variation of Non-i.i.d. level (a), selected local clients (b) and variation of total global round (c).

(a) Non-i.i.d level (η) (b) Selected Local Client (L) (c) Total Global Round (R) η=0.6 (60%) η=0.4 (40%) η=0.2 (20%) L=4 (20%) L=6 (30%) L=8 (40%) R=54 (6 r/task) R=72 (8 r/task) R=90 (10 r/task) Avg PD Avg PD Avg PD Avg PD Avg PD Avg PD Avg PD Avg PD Avg PD Fed-S3C 43.2 6.8 34.2 0.5 18.9 1.4 43.9 4.2 43.2 6.8 42.7 6.2 43.4 5.3 47.4 10.1 42.6 6.5 TARGET 53.1 27.6 41.3 27.5 23.2 13.5 51.4 26.7 53.1 27.6 56.7 29.4 44.4 23.0 51.8 26.9 53.5 27.8 LGA 58.1 28.9 56.4 29.5 56.4 29.7 56.3 30.4 58.1 28.9 57.9 29.1 57.1 22.1 55.2 26.3 58.0 29.3 LANDER 49.1 19.7 37.6 18.9 17.5 8.9 47.4 22.3 49.1 19.7 51.6 20.8 37.9 31.4 47.2 26.0 49.1 19.7 Fed-Dual P 79.3 -0.7 60.2 -41.9 6.6 -21.5 76.2 -11.2 79.3 -0.7 79.2 11.1 81.5 -0.9 80.2 1.8 79.3 -0.7 Fed-Cpompt 66.1 29.0 66.2 28.1 66.2 28.1 62.5 36.9 66.1 29.0 59.6 38.1 62.5 36.9 59.6 38.1 66.1 29.0 PILo RA 51.9 27.0 43.4 22.5 35.3 18.3 50.6 26.3 51.9 27.0 51.1 26.5 55.9 29.0 53.9 28.0 51.9 27.0 UOPP 90.0 9.5 86.3 4.3 68.8 29.4 89.1 4.2 90.0 9.5 91.3 0.0 90.7 -1.5 90.8 0.2 90.0 9.5

LANDER, and PILo RA can t achieve plasticity since their novel classes accuracy is close to 0. Fed-S3C maintains stability-plasticity dilemma with relative balances, but the performance of both components is low (< 40%). The complete numerical results are presented in Appendix I.

6.3 ROBUSTNESS, ABLATION, AND FURTHER ANALYSIS:

Table 4: Summary of the numerical result of our ablation study in CIFAR100 dataset, MLP, and PB and Rect. denote prototypes-based and rectification respectively.

Conf. Stiatic Proto. Dynamic Proto. MLP Head PB. Head Rect. Avg PD Gap Time

A - 80.62 7.76 9.39 6.01h B - - 85.32 10.06 4.69 4.17h C - 88.72 5.49 1.29 8.00h D - - - - 69.23 37.76 20.78 3.50h UOPP 90.01 9.52 0.00 6.05h

a) Different Non-i.i.d. level: Table 3 (a) shows the performance of the methods w.r.t. non-i.i.d level represented by the percentage of available class η (lower is harder). The table shows that our method outperforms the existing methods in all non-i.i.d. levels with a significant margin i.e. up to 50%. The table shows that the small available class (η = 20%) remains challenging since all the methods achieve less than 70% on average. In contrast, in other cases i.e. (η > 40%) our method achieves > 86% accuracy on average.

b) Different Participating Local Clients: we evaluate UOPP robustness in various selected local clients simulating fluctuations of participating clients in real-life applications. Table 3 (b) shows the performance of consolidated methods with 4 (20%), 6 (30%), and 8 (40%) selected local clients from 20 total clients. The table shows that the UOPP achieves the highest performance in all combinations with a significant gap i.e. 10% compared to the competitor methods. Table 3 also shows that UOPP achieves higher performance on a higher percentage of participating local clients. Besides, UOPP achieves a relatively lower performance drop (PD) than Fed-S3C, TARGET, LGA, LANDER and Fed-CPrompt. In 8 (40%) local client cases, UOPP achieves the lowest PD, while in 4 (20%) and 6 (30%) local client cases, UOPP experiences a higher PD than 8 and 4 local client cases due to an accuracy drop on the last task. The accuracy drop is caused by a mismatch between the samples and prompt keys resulting in inaccurate feature extraction and classifiers selection. The more detailed per-task result is presented in Appendix J.

c) Variation of Total Global Rounds: we evaluate UOPP robustness in smaller global rounds simulating real-world conditions where the global model is urgently needed, thereby requiring smaller rounds. Table 3(b) summarizes our investigation on 54 (6 r/task) to 90 (10 r/task) global rounds in CIFAR100 datasets, while the complete result is presented in Appendix J. . The table shows that UOPP achieves the highest performance with a significant gap i.e. ( 9%) compared to the competitors. In the lower global rounds, the UOPP achieves even better performance than in normal global rounds as it doesn t experience the accuracy drop aforementioned. UOPP also achieves the smallest

Published as a conference paper at ICLR 2025

PD compared to the competitor methods in smaller rounds. Prompt-based method i.e. Fed-Dual P is proven to be more robust than Fed-S3C, TARGET, LGA, LANDER, and Fed-CPrompt. Furthermore, those 3 methods achieve lower accuracy in the smaller global rounds. This finding confirms the robustness of our proposed method in the case of low global rounds.

d) Ablation Study: we conduct an ablation study to investigate the contribution of each component of the proposed method. The result is summarized in Table 4, while the detailed result is presented in Appendix K. The result shows that the absence of static prototype (Conf. A) and dynamic prototype (Conf. B) drops the average performance with 9.4% and 4.7% gap respectively. This result proves the importance of unified prototypes for prompt learning to deal with FFSCIL problems. The absence of the MLP head drops the performance with 1.3% gap. This result shows the presence of MLP classifiers (Conf. C) contributes to the model prediction performance. Last, the absence of the prototype-based head (Conf. D) drops the performance with the most significant magnitude e.g. 20.8%. It proves that the PB classifier is a must to deal with few-shot tasks. Note that the presence of rectification (Rect.) follows the presence of dynamic prototype (e.g. in configuration B and D where dynamic prototype is absent, the rectification is absent in those configurations).

2 4 6 8 Session

Validation Loss

Novel Classes Loss on

CIFAR100 5-shot

2 4 6 8 Session

Novel Classes Loss on

CIFAR100 1-shot

Without Rectification With Rectification

Figure 6: Visualization of Novel Classes Validation Loss with vs without Rectification.

e) The Importance of Prototype Rectification: We analyze the impact of prototype rectification in our proposed method. Figure 6 shows the validation loss of novel classes with rectification and without rectification on CIFAR100 dataset with 5-shot and 1-shot settings. The figure shows that, without prototype rectification (red line), our method produces a far higher validation loss for novel classes as there are many misclassified samples. On the contrary, with prototype rectification (blue line), our method produces far smaller and more stable validation loss. The figure also shows that the difference in loss magnitude between the two variants is even higher in the 1-shot setting. This finding emphasizes the importance of prototype rectification in our method.

Table 5: Comparison of Parameters, Communication Cost, and Training time in CIFAR100 dataset.

Method Number of Parameters (M) Comm. Cost (MB) Running Time (h) Trainable Sharable Base Task FS Task Base Task FS Task Base Task FS Task Fed-S3C 11.7 11.7 11.7 11.7 46.8 46.8 4.1 TARGET 11.3 11.3 17.45 17.45 61.7 61.7 2.02 LGA 11.3 11.3 15.42 15.42 69.8 69.8 8.22 LANDER 11.3 11.3 17.45 17.45 61.7 61.7 4.78 Fed-L2P 0.29 0.29 0.29 0.29 1.16 1.16 4.83 Fed-Dual P 0.33 0.33 0.33 0.33 1.32 1.32 3.2 Fed-CODAP 0.33 0.33 0.33 0.33 1.32 1.32 1.42 Fed-Cprompt 0.33 0.33 0.33 0.33 1.32 1.32 2.03 PILo RA 0.30 0.30 0.34 0.37 1.37 1.50 6.24 UOPP 0.33 34.1 0.38 0.41 1.5 1.63 6.05

f) Complexity, Running Time, Parameters, and Communication Cost: our complexity analysis shows that our proposed method has the same complexity i.e. O(R.L.N) where R is the number of global rounds, L is the number of participating local clients in each round, and N is the size of the training data in each client. The detailed complexity analysis is provided in Appendix C. Table 5 shows the training time of the consolidated method in CIFAR100 dataset. The table shows that the proposed method requires a moderate training time since it is lower than LGA and higher than the other methods. In the base task, UOPP trains only a small amount (0.33 M) of parameters, since it trains only its prompts and MLP classifier. However, in a few-shot tasks, it trains Grad Net for prototype rectification which contributes to the high amount of trainable parameters. However, UOPP keeps its low communication cost both in the base task and few shot tasks since it only shares prompt+MLP or prompt+prototypes in both tasks. Please see Appendix D and E for our extended analysis on running time, memory consumption, limitations and potential solution.

7 CONCLUDING REMARKS

We define a new Federated Few-Shot Class-Incremental Learning (FFSCIL) problem and develop a novel Unified Optimized Prototype Prompt (UOPP) model that utilizes task-wise prompt learning to mitigate task interference empowered by shared static-dynamic prototypes, adaptive dual heads, and weighted aggregation. The dynamic prototype tackles prototype bias by iterative rectifications. Our comprehensive experimental results show that UOPP significantly outperforms existing SOTA methods of FCIL, FSCIL, and CIL, on three datasets with a significant gap i.e. up to 76%. Our deeper analysis confirms that the proposed method achieves better stability-plasticity trade-off, and robustness in different local clients and small global rounds. Our analysis shows that our proposed method requires moderate training time but a lower communication cost than the SOTAs.

Published as a conference paper at ICLR 2025

ACKNOWLEDGMENTS

M. Anwar Ma sum acknowledges the support of Tokopedia-UI Centre of Excellence for GPU access to run the experiments.

Noor Ahmed, Anna Kukleva, and Bernt Schiele. Orco: Towards better generalization via orthogonality and contrast for few-shot class-incremental learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 28762 28771, 2024.

Roger Alexander. Solving ordinary differential equations i: Nonstiff problems (e. hairer, sp norsett, and g. wanner). Siam Review, 32(3):485, 1990.

Rahaf Aljundi, Francesca Babiloni, Mohamed Elhoseiny, Marcus Rohrbach, and Tinne Tuytelaars. Memory aware synapses: Learning what (not) to forget. In Proceedings of the European conference on computer vision (ECCV), pp. 139 154, 2018.

Gaurav Bagwe, Xiaoyong Yuan, Miao Pan, and Lan Zhang. Fed-cprompt: Contrastive prompt for rehearsal-free federated continual learning. ar Xiv preprint ar Xiv:2307.04869, 2023.

L eon Bottou, Frank E Curtis, and Jorge Nocedal. Optimization methods for large-scale machine learning. SIAM review, 60(2):223 311, 2018.

Pietro Buzzega, Matteo Boschini, Angelo Porrello, Davide Abati, and Simone Calderara. Dark experience for general continual learning: a strong, simple baseline. Advances in neural information processing systems, 33:15920 15930, 2020.

Francisco M Castro, Manuel J Mar ın-Jim enez, Nicol as Guil, Cordelia Schmid, and Karteek Alahari. End-to-end incremental learning. In Proceedings of the European conference on computer vision (ECCV), pp. 233 248, 2018.

Ricky TQ Chen, Yulia Rubanova, Jesse Bettencourt, and David K Duvenaud. Neural ordinary differential equations. Advances in neural information processing systems, 31, 2018.

Jiahua Dong, Lixu Wang, Zhen Fang, Gan Sun, Shichao Xu, Xiao Wang, and Qi Zhu. Federated class-incremental learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10164 10173, 2022.

Jiahua Dong, Hongliu Li, Yang Cong, Gan Sun, Yulun Zhang, and Luc Van Gool. No one left behind: Real-world federated class-incremental learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, pp. 1 17, 2023. doi: 10.1109/TPAMI.2023.3334213.

Weifeng Ge. Deep metric learning with hierarchical triplet loss. In Proceedings of the European conference on computer vision (ECCV), pp. 269 285, 2018.

Dipam Goswami, Bartłomiej Twardowski, and Joost Van De Weijer. Calibrating higher-order statistics for few-shot class-incremental learning with pre-trained vision transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4075 4084, 2024.

Haiyang Guo, Fei Zhu, Wenzhuo Liu, Xu-Yao Zhang, and Cheng-Lin Liu. Pilora: Prototype guided incremental lora for federated class-incremental learning. In Proceedings of the European Conference on Computer Vision, 2024.

Saihui Hou, Xinyu Pan, Chen Change Loy, Zilei Wang, and Dahua Lin. Learning a unified classifier incrementally via rebalancing. In Proceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition, pp. 831 839, 2019.

Yalan Jiang, Yang Cheng, Dan Wang, and Bin Song. Rra-ffscil: Inter-intra classes representation and relationship augmentation federated few-shot incremental learning. Neurocomputing, pp. 127956, 2024.

Jayateja Kalla and Soma Biswas. S3c: Self-supervised stochastic classifiers for few-shot classincremental learning. In European Conference on Computer Vision, pp. 432 448. Springer, 2022.

Published as a conference paper at ICLR 2025

James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, et al. Overcoming catastrophic forgetting in neural networks. Proceedings of the national academy of sciences, 114(13):3521 3526, 2017.

Muhammad Rifki Kurniawan, Xiang Song, Zhiheng Ma, Yuhang He, Yihong Gong, Yang Qi, and Xing Wei. Evolving parameterized prompt memory for continual learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pp. 13301 13309, 2024.

Steinar Laenen and Luca Bertinetto. On episodes, prototypical networks, and few-shot learning. Advances in Neural Information Processing Systems, 34:24581 24592, 2021.

Gwen Legate, Nicolas Bernier, Lucas Page-Caccia, Edouard Oyallon, and Eugene Belilovsky. Guiding the last layer in federated learning with pre-trained models. Advances in Neural Information Processing Systems, 36, 2024.

Tian Li, Anit Kumar Sahu, Manzil Zaheer, Maziar Sanjabi, Ameet Talwalkar, and Virginia Smith. Federated optimization in heterogeneous networks. Proceedings of Machine learning and systems, 2:429 450, 2020.

Xiang Li, Kaixuan Huang, Wenhao Yang, Shusen Wang, and Zhihua Zhang. On the convergence of fedavg on non-iid data. ar Xiv preprint ar Xiv:1907.02189, 2019.

Xiang Lisa Li and Percy Liang. Prefix-tuning: Optimizing continuous prompts for generation. ar Xiv preprint ar Xiv:2101.00190, 2021.

Zhizhong Li and Derek Hoiem. Learning without forgetting. IEEE transactions on pattern analysis and machine intelligence, 40(12):2935 2947, 2017.

Muhammad Anwar Ma sum, Mahardhika Pratama, Savitha Ramasamy, Lin Liu, Habibullah Habibullah, and Ryszard Kowalczyk. Pip: Prototypes-injected prompt for federated class incremental learning. CIKM 24, pp. 1670 1679. Association for Computing Machinery, 2024. ISBN 9798400704369. doi: 10.1145/3627673.3679794.

Pratik Mazumder, Pravendra Singh, and Piyush Rai. Few-shot lifelong learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pp. 2337 2345, 2021.

Brendan Mc Mahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Aguera y Arcas. Communication-efficient learning of deep networks from decentralized data. In Artificial intelligence and statistics, pp. 1273 1282. PMLR, 2017.

Augustus Odena, Christopher Olah, and Jonathon Shlens. Conditional image synthesis with auxiliary classifier gans. In International conference on machine learning, pp. 2642 2651. PMLR, 2017.

Keon-Hee Park, Kyungwoo Song, and Gyeong-Moon Park. Pre-trained vision and language transformers are few-shot incremental learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 23881 23890, 2024.

Ameya Prabhu, Philip HS Torr, and Puneet K Dokania. Gdumb: A simple approach that questions our progress in continual learning. In Computer Vision ECCV 2020: 16th European Conference, Glasgow, UK, August 23 28, 2020, Proceedings, Part II 16, pp. 524 540. Springer, 2020.

Daiqing Qi, Handong Zhao, and Sheng Li. Better generative replay for continual federated learning. ar Xiv preprint ar Xiv:2302.13001, 2023.

Sylvestre-Alvise Rebuffi, Alexander Kolesnikov, Georg Sperl, and Christoph H Lampert. icarl: Incremental classifier and representation learning. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp. 2001 2010, 2017.

Anurag Roy, Riddhiman Moulick, Vinay K Verma, Saptarshi Ghosh, and Abir Das. Convolutional prompting meets language models for continual learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 23616 23626, 2024.

Published as a conference paper at ICLR 2025

Guangyuan Shi, Jiaxin Chen, Wenlong Zhang, Li-Ming Zhan, and Xiao-Ming Wu. Overcoming catastrophic forgetting in incremental few-shot learning by finding flat minima. Advances in neural information processing systems, 34:6747 6761, 2021.

Hanul Shin, Jung Kwon Lee, Jaehong Kim, and Jiwon Kim. Continual learning with deep generative replay. Advances in neural information processing systems, 30, 2017.

James Seale Smith, Leonid Karlinsky, Vyshnavi Gutta, Paola Cascante-Bonilla, Donghyun Kim, Assaf Arbelle, Rameswar Panda, Rogerio Feris, and Zsolt Kira. Coda-prompt: Continual decomposed attention-based prompting for rehearsal-free continual learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11909 11919, 2023.

Yue Tan, Guodong Long, Jie Ma, Lu Liu, Tianyi Zhou, and Jing Jiang. Federated learning from pre-trained models: A contrastive learning approach. Advances in neural information processing systems, 35:19332 19344, 2022.

Xiaoyu Tao, Xiaopeng Hong, Xinyuan Chang, Songlin Dong, Xing Wei, and Yihong Gong. Fewshot class-incremental learning. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 12180 12189, 2020a.

Xiaoyu Tao, Xiaopeng Hong, Xinyuan Chang, Songlin Dong, Xing Wei, and Yihong Gong. Fewshot class-incremental learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12183 12192, 2020b.

Minh-Tuan Tran, Trung Le, Xuan-May Le, Mehrtash Harandi, and Dinh Phung. Text-enhanced data-free approach for federated class-incremental learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 23870 23880, 2024.

Anastasiia Usmanova, Franc ois Portet, Philippe Lalanda, and German Vega. A distillation-based approach integrating continual learning and federated learning for pervasive services. ar Xiv preprint ar Xiv:2109.04197, 2021.

Oriol Vinyals, Charles Blundell, Timothy Lillicrap, Daan Wierstra, et al. Matching networks for one shot learning. Advances in neural information processing systems, 29, 2016.

Qi-Wei Wang, Da-Wei Zhou, Yi-Kai Zhang, De-Chuan Zhan, and Han-Jia Ye. Few-shot classincremental learning via training-free prototype calibration. Advances in Neural Information Processing Systems, 36, 2024.

Zifeng Wang, Zizhao Zhang, Sayna Ebrahimi, Ruoxi Sun, Han Zhang, Chen-Yu Lee, Xiaoqi Ren, Guolong Su, Vincent Perot, Jennifer Dy, et al. Dualprompt: Complementary prompting for rehearsal-free continual learning. In European Conference on Computer Vision, pp. 631 648. Springer, 2022a.

Zifeng Wang, Zizhao Zhang, Chen-Yu Lee, Han Zhang, Ruoxi Sun, Xiaoqi Ren, Guolong Su, Vincent Perot, Jennifer Dy, and Tomas Pfister. Learning to prompt for continual learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 139 149, 2022b.

Yue Wu, Yinpeng Chen, Lijuan Wang, Yuancheng Ye, Zicheng Liu, Yandong Guo, and Yun Fu. Large scale incremental learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 374 382, 2019.

Yibo Yang, Haobo Yuan, Xiangtai Li, Zhouchen Lin, Philip Torr, and Dacheng Tao. Neural collapse inspired feature-classifier alignment for few-shot class-incremental learning. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview. net/forum?id=y5W8tpojht J.

Jaehong Yoon, Wonyong Jeong, Giwoong Lee, Eunho Yang, and Sung Ju Hwang. Federated continual learning with weighted inter-client transfer. In International Conference on Machine Learning, pp. 12073 12086. PMLR, 2021.

Published as a conference paper at ICLR 2025

Baoquan Zhang, Xutao Li, Shanshan Feng, Yunming Ye, and Rui Ye. Metanode: Prototype optimization as a neural ode for few-shot learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pp. 9014 9021, 2022.

Chi Zhang, Nan Song, Guosheng Lin, Yun Zheng, Pan Pan, and Yinghui Xu. Few-shot incremental learning with continually evolved classifiers. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 12455 12464, 2021.

Jie Zhang, Chen Chen, Weiming Zhuang, and Lingjuan Lv. Addressing catastrophic forgetting in federated class-continual learning. ar Xiv preprint ar Xiv:2303.06937, 2023a.

Jie Zhang, Chen Chen, Weiming Zhuang, and Lingjuan Lyu. Target: Federated class-continual learning via exemplar-free distillation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4782 4793, 2023b.

Junting Zhang, Jie Zhang, Shalini Ghosh, Dawei Li, Serafettin Tasci, Larry Heck, Heming Zhang, and C-C Jay Kuo. Class-incremental learning via deep model consolidation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 1131 1140, 2020.

Hanbin Zhao, Yongjian Fu, Mintong Kang, Qi Tian, Fei Wu, and Xi Li. Mgsvf: Multi-grained slow versus fast framework for few-shot class-incremental learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(3):1576 1588, 2024. doi: 10.1109/TPAMI.2021.3133897.

A DETAILED PROCESS OF UNIFIED OPTIMIZED PROTOTYPE PROMPT (UOPP)

In this section, we present the detailed algorithm of UOPP as shown in algorithm 1.

B DETAILED THEORETICAL ANALYSIS

Let Θ = (P, Φ, Ψ) be trainable parameters, F(Θ) = E[L(T ; Θ)] = E[L(T ; (P, Φ, Ψ))] is the expected loss function, k, E, R, and LS is local iteration, local epoch, global round, and number of selected local clients respectively. We follow L-smooth and µ-strongly convex F, G-bounded uniformly gradient assumptions, random uniformly distributed batches, and decreasing learning rate as in (Li et al., 2019),(Bottou et al., 2018) We state the following theorems.

Theorem 1: lim infk E[|| F(Θk)||2 2] = 0

Theorem 2: Let Θ1, ΘR, Θ is the initial, last updated (R-th), and optimum parameter respectively, F is minimum of F, exist A, B, C, δ > 0 so that: E[F(ΘR)] F A

2 E||Θ1 Θ ||).

Theorem 3: Given Θ and Θ are optimal parameter in T t l Z and T t l respectively, where T t l T t, where |T t l |/|T t| = η (0, 1), then at least there s ϵ > 0 that satisfy F(Θ; T t) F(Θ ; T t) ϵ. Theorem 1 and 2 prove UOPP local training and federated convergence respectively, while theorem 3 proves UOPP generalization. The detailed theoretical analysis, assumptions, and proofs are presented in Appendix B.

Let Θ = (P, Φ, Ψ) be the trainable parameters, F(Θ) = E[L(T ; Θ)] = E[L(T ; (P, Φ, Ψ))] is the expected loss function, k, E, R, and L is local iteration, local epoch, global round, and number of selected local clients respectively. Please note that in this analysis, L denotes the number of selected local clients, while l 1 denotes a constant for the l-smooth coefficient. Following the update rule in section 4.3, the expression of F(Θ) above can be detailed as follows:

(i) Base Task (t = 0): Θ = (P, Φ), and F(Θ) = E[L(T ; Θ)] = E[Ll+(T ; (P, Φ))] as local clients update (P, Φ) using Ll+ following equations 7 and 8.

(i) FS Task (t 1): Θ = (P, Ψ), and F(Θ) = E[L(T ; Θ)] = E[Llfs+(T ; (P, Ψ))] as local clients update (P, Ψ) using Llfs+ following equations 9 and 10.

We adopt the SGD optimization convergence analysis (Bottou et al., 2018) and Fed Avg convergence analysis (Li et al., 2019) assumptions as follows:

Published as a conference paper at ICLR 2025

Algorithm 1 UOPP

1: Input: Number of clients Lall, number of selected local clients L, total number of rounds R, number of task T + 1, local epochs E, batch size B. 2: Distribute frozen Vi T backbone f to all clients {l}Lall l=1 and central server G 3: Initiate prompt, key, and head layer for all clients and central server PG = Pl, ΦG = Φl, Ψl = init(), l {1..Lall} 4: RT R/(T + 1), RT represents round per task 5: Init global and local unified prototypes ZG = Zl, = Z = 6: for t = 0 : T do 7: for r = 1 : RT do 8: {l}L l=1 randomly select L local clients from Lall total clients 9: Clients execute: 10: if RT = 1 then 11: Compute static prototype Zl as in Eq. (3)-(4), then send it to server 12: end if 13: Receive global parameters i.e. prompt, FC layer, and prototypes set PG, ΦG, and ZG 14: Assign local parameters (Pl, Φl, Zl) (PG, ΦG, ZG) 15: B Split T t l into B sized batches 16: for e = 1 : E do 17: for b = 1 : B do 18: if (t = 0) then // Base Task Update 19: Compute prompt-generated feature f(Kt l ,P t l )(x) as in Eq. (2) 20: Compute logits with FC clsasifier gΦl(f(Kt l ,P t l )(x) ZG) 21: Compute loss Lt=0 as in Eq. (10) 22: Update local parameters (Kt l , P t l , Φl) based on Lt=0 23: else (t > 0) // Few-shot Task Update 24: Compute static prototype Zl using feature f(Kt l ,P t l )(x) as in Eq. (2)

25: Draw S from Zl and draw Q from Zl = ZG 26: Rectify dynamic prototype ˆZl using gΨ(.)as in Eq. (5) to (7) 27: Form unified prototype Zl = ZG ˆZl 28: Compute logits with PB classifier g Zl(f(Kt l ,P t l )(x) S) 29: Compute loss Lt>0 as in Eq. (11) 30: Update local parameters (Kt l , P t l , Ψl) based on Lt>0 31: end if 32: end for 33: if t = 0 then 34: Update local static prototype Zl as Eq. (3)-(4) for all class c T t l 35: end if 36: end for 37: if t = 0 then 38: Set unified prototype Zl = ZG Zl 39: else 40: Set unified prototype Zl = ZG ˆZl 41: end if 42: Store local parameters (Kt l , P t l , Φl, Ψl, Zl) 43: Compute clients weight ωl 44: Send local parameters (Kt l , P t l , Φl, Zl) and weight ωt l to server 45: Server executes: 46: if RT = 1 then 47: Receive clients initial static prototype Zl for l [1..L] 48: Generate ZG = ZG Agg( Zl for l [1..L]) and send ZG to clients 49: end if 50: Receives selected clients parameters (Kt l , P t l , Φl, Zl) and weight ωl for l [1..L] 51: Do weighted aggregation as in Eq. (9) 52: Send global parameters (Kt G, P t G, ΦG, ZG) to clients for the next round 53: end for 54: end for 55: Output: Optimal Global parameters (PG, ΦG, ZG)

Published as a conference paper at ICLR 2025

Assumption 1: F1, ...Fl, ..., FLS are all L smooth: for all Θ and Θ , Fl(Θ) Fl(Θ ) + (Θ Θ )T Fl(Θ) + L

2 ||Θ Θ ||2 2.

Assumption 2: F1, ...Fl, ..., FLS are all µ strongly convex: for all Θ and Θ , Fl(Θ) Fl(Θ ) + (Θ Θ )T Fl(Θ) + µ

2 ||Θ Θ ||2 2.

Assumption 3: Let ξk l be the random uniformly sampled from l-th local data at k th iteration . The variance of stochastic gradients in each client is bounded by the following criteria: E|| Fl(Θk l , ξk l ) Fl(Θk l )|| σ2 l for l = 1, 2, ..., LS

Assumption 4:The expected squared norm of stochastic gradients in each client is bounded by: E|| Fl(Θk l , ξk l )|| G2 for all l = 1, 2, ..., LS and k = 1, 2, ...., K where K N.

Assumption 5: P k=1 αk l = and P k=1 αk l 2 < where αk l is the learning rate of l th client in k-th step training.

We follow the theoretical analysis in federated class incremental learning method (Ma sum et al., 2024) as it has a similar characteristic i.e. prompt-based method supported by shared prototypes.

B.1 PROOF OF THEOREM 1

Let a client-l be trained locally with its local data T t l Z, where T t l is local;y observed training samples for t-th task and Z = Zl = ZG is aggregated unified prototype for task t shared by server respectively. We assume that Z is augmented so that |zcb| |xca| for zcb Z and xt ca T t l T t. As an implication, the number of prototypes of unavailable classes in T t l and the samples of available classes in T t l are balanced. Then the local model Θl = (Pl, Φl) or Θl = (Pl, Ψl) is updated in K iterations based on minibatches drawn from T t l Z. Since thefeature extractor parameters are frozen, and T t l Z has balance samples for all classes, then ξk l approximates ξk that is a sample from T t. The local model parameters are optimized by a stochastic gradient (SG) approach. Suppose that g(Θl, ξk l ) is a SG function, then the parameter update can be simplified as:

Θk+1 l Θk l αk l g(Θk l , ξk l ) (A11)

Applying assumption 1, and local parameter updates Θ by iterating stochastic gradient with sample ξk l , then we get:

Fl(Θk+1 l ) Fl(Θk l ) (Θk+1 l Θk l )T Fl(Θk l ) + L

2 ||Θk+1 l Θk l ||2 2

αk l Fl(Θk l )T g(Θk l , ξk l ) + αk l 2 L

2 ||g(Θk l , ξk l )||2 2 (A12)

The equation above can be derived into:

Eξk l [Fl(Θk+1 l )] Fl(Θk l ) αk l Fl(Θk l )T E[g(Θk l , ξk l )]

2 Eξk l [||g(Θk l , ξk l )||2 2] (A13)

The inequation above shows Θk l optimization by SG method at a step k, and it shows the reduction of Fl (left side) is bounded by the value in the right side involving Fl which is derivative of Fl at Θk l along with g(Θk l , ξk l ) (first term) and second moment of g(Θk l , ξk l ) (second term). Let g(Θk l , ξk l ) be the unbiased estimator of Fl, then we derive inequation above into:

Eξk l [Fl(Θk+1 l )] Fl(Θk l ) αk l ||Fl(Θk l )||2 2 + αk l 2 L

2 Eξk l [||g(Θk l , ξk l )||2 2] (A14)

The inequation above guarantees SGD convergence as long as the stochastic directions and stepsize are chosen. We apply the restriction below to avoid the harm of the second term of the right side in the inequation above,

V[g(Θk l , ξk l )] = E[||g(Θk l , ξk l )||2 2] ||E[g(Θk l , ξk l )]||2 2. (A15)

Published as a conference paper at ICLR 2025

Adopting first and second-moment limit as in (Bottou et al., 2018), then we add the following assumption.

Assumption 6: The objective function Fl and SG satisfy the following conditions.

(a). The sequence of {Θk l } is contained in an open space where Fl is bounded below by a scalar Finf

(b) Exist scalars νG ν > 0 so that for all k N satisfy:

Fl(Θk l )T Eξk l [g(Θk l , ξk l )] ν|| Fl(Θk l )T||2 2, and

||Eξk l [g(Θk l , ξk l )]||2 νG|| Fl(Θk l )||2. (A16)

(c) Exist scalars m1 0 and m2 0 so that for all k N satisfy:

V[g(Θk l , ξk l )] m1 + m2|| Fl(Θk l )||2 2 (A17)

Combining assumption 6 and restriction criteria as presented in equation (5), then we have:

Eξk l [||g(Θk l , ξk l )||2 2] m1 + m G|| Fl(Θk l )||2 2, with

m G = m2 + ν2 G ν2 > 0 (A18)

Then by substituting Eξk l [||g(Θk l , ξk l )||2 2] from equation (A8) into equation (A3), we have:

Eξk l [Fl(Θk+1 l )] Fl(Θk l ) αk l Fl(Θk l )T E[g(Θk l , ξk l )]

2 (m1 + m G|| Fl(Θk l )||2 2) (A19)

Assumption 5 ensures that {αk l } 0 is practically achievable by applying a learning rate scheduler (with decay) that reduces the learning rate in each step of local training. Then by choosing αk l Lm G ν and substituting Fl(Θk l )T E[g(Θk l , ξk l )] in equation (A9) with the condition in assumption 6.b, we have

Eξk l [Fl(Θk+1 l )] Fl(Θk l ) αk l ν|| Fl(Θk l )||2 2

2 (m1 + m G|| Fl(Θk l )||2 2) (A20)

Applying expectation into the equation above we get

Eξk l [Fl(Θk+1 l )] E[Fl(Θk l )] αk l νE[|| Fl(Θk l )||2 2]

2(m1 + m GE[|| Fl(Θk l )||2 2])

Eξk l [Fl(Θk+1 l )] E[Fl(Θk l )] 1

2ναk l E[|| Fl(Θk l )||2 2]

Sum both sides for k {1, ..., K} we get

Finf E[F(Θ1 l )] E[Fl(ΘK+1 l )] E[Fl(Θ1 l )]

Finf E[F(Θ1 l )] 1

k=1 αk l E[|| Fl(Θk l )||2 2] + 1

k=1 αk l 2 (A22)

Dividing by ν for both sides, then we get

k=1 αk l E[|| Fl(Θk l )||2 2] 2(E[F(Θ1 l )] Finf)

k=1 αk l 2 (A23)

Published as a conference paper at ICLR 2025

Applying lim K and assumption 5 to the equation above we get

k=1 αk l E[|| Fl(Θk l )||2 2] 2(E[F(Θ1 l )] Finf)

k=1 αk l 2 <

Dividing both sides with PK k=1 αk l , and following assumption 5 where lim K PK k=1 αk l = and lim K PK k=1 αk l 2 < , then the right side will return 0. Therefore, we have

PK k=1 E[αk l || Fl(Θk l )||2 2] PK k=1 αk l = 0 (A25)

lim K E[ PK k=1 αk l || Fl(Θk l )||2 2 PK k=1 αk l ] = 0 (A26)

lim k E[|| Fl(Θk l )||2 2] = 0 (A27)

The equation (A17) proves the convergence for local training in l-th client where the gradient of loss F converges to 0 along with the increase of training step/iteration k and the decreasing of learning rate α.

B.2 PROOF OF THEOREM 2

Let the selected local clients {l}l=LS l=1 are conduct local optimization with its local training data {T t l Z}l=LS l=1 coordinated by central server G, where T t l is local training sample for client l for task t. Client local update is conducted in k iterations using minibatch sampling on local training data set ξk l T t l . Global model aggregation is executed in each round r = {1, 2, ..., R}. We define global aggregation step as IE = {r E|r = 1, 2, ...R}. Following (Li et al., 2019), symbol Θk+1 l denotes the local parameter of client l after communication steps, while φk+1 l denotes the local parameter after an immediate result of one step of stochastic gradient descent. Then the definition satisfies the following expressions:

φk+1 l = Θk l αk l Fl(Θk l , ξk l ) (A28)

( φk+1 l if k + 1 / IE PLS l=1 wk l φk+1 l if k + 1 IE (A29)

Where wl = ωl/ PLS l=1 ωl, where ωl is the weight of client l. We state φk+1 l = PLS l=1 wlφk+1 l and Θk+1 l = PLS l=1 wlΘk+1 l , φk+1 l is the result of single step of stochastic gradient descent iteration from Θk+1 l . Then we define gk = PLS l=1 wl Fl(Θk l ) and gk = PLS l=1 wl Fl(Θk l , ξk l ). We adopt the lemmas below from (Li et al., 2019) where their derivation is obtained from fully participating clients in FL setting.

Lemma 1: By applying assumptions 1 and 2, in one step SGD training and chose α 1 4L we have

E[|| φk+1 Θ ||2] (1 αkµ)E[|| Θk Θ ||2] (αk)2E[||gk gk||2] + 6L(αk)2Γ + 2E[PLS l=1 wl|| Θk Θk l ||2] where Γ = F PLS l=1 wl F l 0.

Lemma 2: By applying assumption 3, the gradient function follows:

E[|| gk gk||2] PLS l=1 w2 l σ2 l , where σ2 l is the variance of Θl

Lemma 3: By applying assumption 4, where αk is non-increasing and it satisfies αk αk+Efor allk 0, then we have E[PLS l=1 || Θk+1 Θk l ||2] 4(αk)2(E 1)2G2

In FL setting with all clients participating, we get Θk+1 = φk+1. However, in a setting with only partial clients participating, we utilize a random sampling mechanism that satisfies ESL[ Θk+1] = φk+1. We also adopt the bounding condition from (Li et al., 2019) as shown in lemma 4.

Published as a conference paper at ICLR 2025

Lemma 4: The expected difference between Θk+1and φk+1 bounded by : ESL[|| φk+1 Θk+1||2] 4 LS (αk)2E2G2 and in the case of wl is uniform for all l-th client, then ESL[|| φk+1 Θk+1||2]

NS 1 (αk)2E2G2, where NS is total clients and LS is number of selected clients.

Please note that || Θk+1 Θ ||2 = || Θk+1 Θ ||2 (A30)

|| Θk+1 Θ ||2 = || Θk+1 φk+1 + φk+1 Θ ||2 (A31)

|| Θk+1 Θ ||2 =|| Θk+1 φk+1||2 + || φk+1 Θ ||2

+ 2|| Θk+1 φk+1||.|| φk+1 Θ || (A32)

|| Θk+1 Θ ||2 =|| Θk+1 φk+1||2 + || φk+1 Θ ||2

+ 2 Θk+1 φk+1, φk+1 Θ (A33)

In the case of k + 1 / IE, then the term || Θk+1 φk+1||2 vanishes. Then by applying lemma 4, we get E[|| Θk+1 Θ ||2] (1 αkµ)E[|| Θk+1 Θ ||2] + (αk)B (A34) In the case of k + 1 IE, then by applying lemma 4, we get

E[|| Θk+1 Θ ||2] (1 αkµ)E[|| Θk+1 Θ ||2] + (αk)(B + C) (A35)

where B = PLS l=1 wlσ2 l + 6LΣ + 8(E 1)2G2 and C = 4(NS LS)

NS 1 (E2G2) if wl is uniform and C = 4 LS (E2G2) otherwise.

By choosing αk = β k+δ for some β > 1/µ and δ > 0 so that α1 min{1/µ, 1/4L} = 1/4L and

αk 2αk+E then we have E[|| Θk+1 Θ ||2] v δ+k where v = max{ β2(B+C)

βµ 1 , (δ + 1)|| Θk+1 Θ ||2}

Then, by applying a strong convexity assumption of F we have

2 v δ + k (A36)

where F is the minimum value of F where optimum parameter Θ is achieved. Later on, if we choose β = 2/µ, δ = max{8L/µ, E}and denote κ = L/µ, αk = 2/u(1/(δ + k)) then we have

E[F( Θk)] F κ (δ + k 1)(2(B + C)

2 E||Θ1 Θ ||) (A37)

The equation above generalizes FL where the model is optimized in a total of k iterations where in a practical implementation, k = b.E.R, b is the number of batches. Then, we get k > R as E and b are positive integers. Therefore, substituting k with R in the inequation above produces the higher amount on the right side. On that basis, the inequation above can be derived into:

E[F(ΘR)] F κ (δ + R 1)(2(B + C)

2 E||Θ1 Θ ||) (A38)

The equation above can be derived into:

E[F(ΘR)] F 1

R κ (δ/R + 1 1/R)(2(B + C)

2 E||Θ1 Θ ||) (A39)

Let A = κ (δ/R+1 1/R) is a positive number. Then the equation above can be derived into:

E[F(ΘR)] F A

2 E||Θ1 Θ ||) (A40)

The inequation (A30) guarantees the proposed weighted federated learning achieves a convergence condition that is upper bounded by the amount on the right side.

Published as a conference paper at ICLR 2025

B.3 PROOF OF THEOREM 3

Let Θ and Θ are optimal parameter in T t l Z and T t l respectively, where T t l T t, where |T t l |/|T t| = η (0, 1), then we have

F(Θ; T t) = ηF(Θ; T t l ) + (1 η)F(Θ; (T t T t l )) (A41)

F(Θ ; T t) = ηF(Θ ; T t l ) + (1 η)F(Θ ; (T t T t l )) (A42)

Let that Θo is the initial value of Θand Θ assigned by a random uniform initiation. Thus or all class c T t c = T t y=c It satisfy F(Θo; T t c ) = eo. After optimally learning on T t l and T t l Z then Θo become Θ and Θ respectively. Note that Θ knows only available classes in T t l , while Θ knows classes that available both in T t l and classes in T t T t l via Z. Assuming that the loss for predicting classes in T t l is ea < eo, then we have F(Θ; T t l ) = F(Θ ; T t l ) = ea < eo. Since all the backbone parameters are frozen and Θ learn Z, then we get F(Θ; (T t T t l )) = eo, while F(Θ ; (T t T t l )) = eb, where ea eb eo.

Then, the equations (A41) and (42) can be derived to

F(Θ; T t) = ηea + (1 η)eo (A43)

F(Θ ; T t) = ηea + (1 η)eb (A44)

Subtracting the equations above, then we have

F(Θ; T t) F(Θ ; T t) = ηea + (1 η)eo (ηea + (1 η)eb) (A45)

F(Θ; T t) F(Θ ; T t) = ηea + (1 η)eo ηea (1 η)eb (A46)

F(Θ; T t) F(Θ ; T t) = (1 η)eo (1 η)eb (A47)

F(Θ; T t) F(Θ ; T t) = (1 η)(eo eb) (A48)

As we have 0 < η < 1 and eo > eb, the right side of the inequation above has a positive value. Then, by choosing a small positive number ϵ > 0 where (1 η)(eo eb) ϵ then we have.

F(Θ; T t) F(Θ ; T t) ϵ (A49)

Inequation above proves that Θ is more generalized to T t than Θ. This shows that our idea i.e. empowering prompt learning with shared unified prototypes improves model generalization.

C DETAILED COMPELXITY ANALYSIS

Following the pseudo-code in Algorithm 1, UOPP have several operations e.g. generate static prototype (line 11, 24, 34), drawing S, Q from prototypes (line 25), Rectify prototype (line 26), updating model parameters (line 20-22, 28-30), forming unified prototype (line 27, 38, 40) data exchange between clients and server. Knowing that accumulating on all batches, generating prototype or compute features from T t l cost O(N t l ), drawing (augment) samples from feature costs O(N t l ) , rectifying prototypes cost costs O(N t l ), parameters update cost costs O(N t l ), forming uniform prototype cost O(1), and parameters exchange include aggregation costs O(1), and we have 1 base task and T few-shot tasks (total task is (T+1)) then the UOPP complexity will be:

O(UPPP) = O(Base Task) + O(Few Stot Task) (A50)

O(UOPP) =O(1) + RT (O(clientsbase) + O(serverbase) + O(1) + T.RT .(O(clientsfs) + O(serverfs)) (A51)

O(UOPP) =O(1) + RT .(L.O(1clientbase) + O(serverbase)) + T.RT .(L.O(1clientfs) + O(serverfs)) (A52)

Published as a conference paper at ICLR 2025

O(UOPP) =O(1) + RT .(L(O(N 0 l ) + O(E.N 0 l )

+ O(E.N 0 l )) + O(1)

+ T.RT .(L(O(N t l ) + O(E.N t l ) + O(E.N t l )

+ O(E.1) + O(E.N t l )) + O(1)

O(UOPP) =O(1) + RT .L.O(E.N 0 l ) + T.RT .L.O(E.N t l ) (A54)

O(UOPP) =O(1) + RT .L.E.O(N 0 l ) + T.RT .L.E.O(N t l ) (A55)

O(UOPP) =O(1) + RT .L.E(O(N 0 l ) + T.O(N t l )) (A56)

Please note that Nl = N 0 l + N 1 l + ... + N T l = N 0 l + T(N t l ), t [1..T]. Therefore, the equation above can be derived into:

O(UOPP) =O(1) + RT .L.E(O(N 0 l + T.ON t l )) (A57)

O(UOPP) = RT .L.E.O(Nl) (A58)

O(UOPP) = O(RT .L.E.Nl) (A59)

Since E is set as a small constant in our method i.e. 1-20 and RT < R, then the UOPP complexity will be:

O(UOPP) = O(R.L.Nl) (A60)

Our derivation shows that our proposed method has the complexity of O(R.L.Nl) where R is total global rounds, L is the number of selected local clients in each round and Nl is the number of samples in each client.

D EXTENDED EXPERIMENT RESULTS AND ANALYSIS

D.1 EVALUATION ON STANDALONE FSCIL

We measure the performance of our proposed method in standalone FSCIL setting to evaluate our idea on prompting with a unified static-dynamic prototype and dual-head classifiers. We compare our method with existing SOTAs i.e. TEEN(Neur IPS, 2023)(Wang et al., 2024), NCFSCIL(ICLR, 2023)(Yang et al., 2023), Or Co(CVPR, 2024)(Ahmed et al., 2024), and Pri Vi Lege(CVPR, 2024)(Park et al., 2024). The evaluation is conducted in 3 datasets i.e. CIFAR100, Mini Image Net, and CUB200, following common settings in FSCIL.

Table A6 shows the detailed numerical result of our experiment. The table shows that our proposed method achieves a better performance in general. In comparison to TEEN, NC-FSCIL, and ORCO, our method archives a significantly better performance i.e. with more than 20%, 25%, and 14% margin in CIFAR100, Mini Imagenet, and CUB200 dataset respectively. In comparison to Pri Vi Ledge, Our method achieves higher performance in CIFAR100 and CUB200 dataset with 1.65% and 4.9% margins respectively. Our method achieves lower performance than Pri Vi Ledge in Mini Image Net dataset. Looking in a more derailed view, that lower performance is basically due to our method achieving lower performance in the base task i.e. 3% lower accuracy. Thus, It affects the later (few-shot) tasks and average performance. Please note that Privi Ledge utilize vision and language modality, while the rest methods only utilize vision modality. This factor should be considered into account in the performance comparison.

Looking from the performance drop (PD) metrics, our method archives the lowest performance drop in all dataset by a significant margin i.e. 2 28%. This indicates that our method handle the catastrophic forgetting better than the existing methods.

Published as a conference paper at ICLR 2025

Table A6: Numerical result of the FSCIL methods in CIFAR100, Mini Image Net and CUB200 dataset with 5-shot setting, PD indicates the performance drop, and Gap indicates the gap between the respected method to our proposed method (UOPP).

Method Accuracy in each session (%) Avg PD Gap CIFAR100: 60 base classes @5 classes in few-shot task 0 1 2 3 4 5 6 7 8 9 10 TEEN (Neur IPS,2023) 74.92 72.65 68.74 65.01 62.01 59.29 57.90 54.76 52.64 - - 63.10 22.28 26.63 NC-FSCIL (ICLR 2023) 89.51 82.62 76.72 71.61 67.13 63.18 59.67 56.53 53.70 - - 68.96 35.81 20.77 Or CO (CVPR 2024) 80.08 68.16 66.99 60.97 59.78 58.60 57.04 55.13 52.19 - - 62.10 27.89 27.63 Pri Vi Lege (CVPR 2024) 90.88 89.39 88.97 87.55 87.83 87.35 87.53 87.15 86.06 - - 88.08 4.82 1.65 UOPP (Ours) 90.27 90.45 90.14 89.40 89.26 89.52 89.70 89.48 89.35 - - 89.73 0.92 0.00 Mini Image Net: 60 base classes @5 classes in few-shot task TEEN (Neur IPS,2023) 73.53 70.55 66.37 63.23 60.53 57.95 55.24 53.44 52.08 - - 61.44 21.45 31.58 NC-FSCIL (ICLR 2023) 77.25 71.30 66.21 61.80 57.94 54.53 51.50 48.79 46.35 - - 59.52 30.90 33.50 Or CO (CVPR 2024) 83.30 75.32 71.53 68.16 65.63 63.12 60.20 58.82 58.08 - - 67.13 25.22 25.89 Pri Vi Lege (CVPR 2024) 96.68 96.49 95.65 95.54 95.54 94.91 94.33 94.19 94.10 - - 95.27 2.58 -2.25 UOPP (Ours) 93.57 93.66 93.51 93.51 93.13 92.24 92.23 92.66 92.71 - - 93.02 0.86 0.00 CUB200: 100 base classes @10 classes in few-shot task TEEN (Neur IPS,2023) 77.26 76.13 72.81 68.16 67.77 64.40 63.25 62.29 61.19 60.32 59.31 68.14 17.95 14.82 NC-FSCIL (ICLR 2023) 78.49 71.52 65.54 60.30 55.81 51.96 48.72 45.78 43.18 40.92 38.80 57.92 39.69 25.04 Or CO (CVPR 2024) 75.59 66.85 64.05 63.69 62.20 60.38 60.18 59.20 58.00 58.42 57.94 63.35 17.66 19.61 Pri Vi Lege (CVPR 2024) 82.21 81.25 80.45 77.76 77.78 75.95 75.69 76.00 75.19 75.19 75.08 78.03 7.13 4.93 UOPP (Ours) 86.63 86.99 85.19 83.37 82.42 80.19 80.21 80.76 80.93 81.16 81.17 82.96 5.46 0.00

D.2 PERFORMANCE ON DIFFERENT K-SHOT

We extend our investigation of the methods performance on different k-shot values i.e. 7-shot, 5-shot, 3-shot, and 1-shot. Table A7 presents the detailed numerical results on CIFAR100 dataset with 1-7 shot settings. Table A7 shows that our proposed method consistently achieves the highest performance in different of k-shot values. The performance gap is consistently significant i.e. 7-46%. Regarding average performance, our method tends to achieve lower performance in lower k-shot values i.e. 1 and 3. UOPP performances in 5-shot and 7-setting are comparable on average. However, in terms of the performance in the last (final) session, UOPP achieves a better final performance in a 7-shot setting. This indicates that more samples for each client improve the global model performance.

This trend also applies to other methods where their performance on the higher shots is better than their performance on the lower shots. However, Fed-CPrompt shows the anomaly, where its performance in lower shots is better than its performance in higher shots. This condition indicates that the increase in novel classes accuracy is lower than the drop in base classes accuracy. Thus, is higher sample is contra-productive for few-shot task training. Looking to the performance drop (PD), the table shows that UOPP achieves a lower (better) performance drop in the higher k-shot value. This fact is in line with the mentioned trend of higher performance in the higher k-shot value before.

D.3 MEMORY CONSUMPTION ANALYSIS

We extended our resource analysis by evaluating the memory consumption by client and server. Table A8 shows the memory consumption by a single client and central server. Please note that each client trains its local model utilizing its local dataset, while the central server coordinates the clients, aggregates the locally optimized models into a global model, and redistributes it to the clients.

Table A8 shows the memory consumption for our proposed method, Fed-Dual P, and PILo RA where these 3 methods have comparable communication costs and similar aggregation methods. As commonly known, Table A8 shows the memory consumption of deep learning training is highly affected by the batch size (BS) of the data loader, while the memory consumption for model aggregation is affected by the number of participating clients. Second, the table shows that the memory consumption for a central server is relatively lower than the client s memory consumption. Third, the memory consumption of our method is comparable with Fed-Dual P, and PILo Ra even though our method utilizes more operation for prototype rectification and has more trainable parameters during the few-shot tasks. This fact indicates that our method has comparable scalability and resource requirements with the existing methods. Last but not least, the table shows the high possibility of practical deployment of our method both in cross-silo and cross-device scenarios. The local training setting can be adjusted by the end device specification. FOr example, in the case of cross-device

Published as a conference paper at ICLR 2025

Table A7: Numerical result of the FFSCIL methods in CIFAR100, 7-shot, 5-shot, 3-shot, and 1-shot setting, S indicate number of shot, PD indicates the performance drop, and Gap indicates the gap between the respected method to our proposed method (UOPP).

Method S Accuracy in each session (%) Avg PD Gap 0 1 2 3 4 5 6 7 8 Fed-S3C 7 43.98 49.65 48.01 45.83 43.21 41.68 40.39 39.69 38.18 43.40 5.80 46.36 TARGET 7 68.65 63.40 58.36 52.68 50.04 46.58 43.96 39.37 34.04 50.78 34.61 38.98 LGA 7 73.15 67.08 64.56 59.87 56.15 52.04 50.00 47.29 44.39 57.17 28.76 32.59 LANDER 7 66.30 60.65 55.89 51.05 48.15 44.53 42.66 40.83 38.26 49.81 28.04 39.95 Fed-Dual P 7 78.45 83.71 84.61 83.40 83.29 81.86 81.84 81.51 81.80 82.27 -3.35 7.49 Fed-Cprompt 7 88.13 77.9385 69.6857 67.13 66.7125 64.8824 64.28 62.51 61.51 69.20 26.62 20.56 PILo RA 7 66.48 61.37 56.99 53.19 49.86 46.93 44.32 41.99 39.89 51.22 26.59 38.54 UOPP 7 90.77 90.86 91.04 90.75 91.08 89.35 88.84 87.87 87.32 89.76 3.45 0.00 Fed-S3C 5 44.51 48.97 47.77 45.35 43.48 41.47 40.33 39.32 37.71 43.21 6.80 46.80 TARGET 5 68.90 63.61 59.06 55.12 51.68 48.64 45.94 43.52 41.34 53.09 27.56 36.92 LGA 5 73.76 69.80 65.59 60.26 56.87 52.94 50.66 47.69 44.89 58.05 28.87 31.96 LANDER 5 58.60 61.75 56.26 52.11 47.71 44.71 41.69 40.28 38.87 49.11 19.73 40.90 Fed-L2P 5 73.47 74.20 73.37 71.88 70.85 70.72 69.28 68.66 68.37 71.20 5.10 18.81 Fed-Dual P 5 76.39 82.75 83.37 80.80 79.93 78.26 77.73 76.98 77.11 79.26 -0.72 10.75 Fed-CODAP 5 81.73 69.29 70.81 68.67 67.17 66.14 64.32 64.79 64.12 68.56 17.62 21.45 Fed-Cprompt 5 88.00 64.63 69.30 67.39 63.39 62.33 61.11 59.78 59.00 66.10 29.00 23.91 PILo RA 5 67.40 62.22 57.77 53.92 50.55 47.58 44.93 42.57 40.44 51.93 26.96 38.08 UOPP 5 90.57 90.58 90.85 90.96 91.23 91.51 91.56 91.74 81.05 90.01 9.52 0.00 Fed-S3C 3 43.98 49.60 48.03 45.19 42.69 41.12 39.87 38.82 37.47 42.97 6.51 45.64 TARGET 3 68.65 63.03 58.10 52.64 49.00 44.92 42.09 37.18 32.20 49.76 36.45 38.85 LGA 3 73.52 67.37 63.57 59.07 56.01 52.35 49.50 46.89 44.15 56.94 29.37 31.67 LANDER 3 66.30 59.77 55.56 50.56 47.40 44.04 41.72 40.72 38.43 49.39 27.87 39.22 Fed-Dual P 3 73.20 79.31 82.73 81.53 82.06 81.16 81.56 81.21 80.34 80.34 -7.14 8.27 Fed-Cprompt 3 88.32 83.62 81.46 77.68 77.18 75.40 74.32 70.63 68.51 77.46 19.81 11.15 PILo RA 3 66.43 61.32 56.94 53.15 49.83 46.89 44.29 41.96 39.86 51.19 26.57 37.42 UOPP 3 90.62 90.57 90.84 89.84 90.18 89.69 88.20 88.24 79.31 88.61 11.31 0.00 Fed-S3C 1 44.51 48.70 46.76 44.26 42.09 40.11 38.51 37.35 35.44 41.97 9.07 46.65 TARGET 1 68.90 63.61 59.06 55.12 51.68 48.64 45.94 43.52 41.34 53.09 27.56 35.53 LGA 1 73.58 67.00 63.44 58.88 56.90 54.00 52.82 49.25 48.78 58.29 24.80 30.33 LANDER 1 57.60 61.75 56.09 51.29 47.33 44.22 39.73 40.15 37.70 48.43 19.90 40.19 Fed-L2P 1 77.00 74.91 74.73 74.40 73.45 74.20 73.22 73.46 73.25 74.29 3.75 14.33 Fed-Dual P 1 73.20 79.31 82.73 81.53 82.06 81.16 81.56 81.21 80.34 80.34 -7.14 8.28 Fed-CODAP 1 83.29 73.27 72.54 68.37 67.52 65.54 62.11 64.19 61.67 68.72 21.62 19.90 Fed-Cprompt 1 87.53 82.86 79.24 74.87 73.04 69.92 68.28 65.81 62.84 73.82 24.69 14.80 PILo RA 1 67.20 62.03 57.60 53.76 50.40 47.44 44.80 42.44 40.32 51.78 26.88 36.84 UOPP 1 90.65 90.16 89.97 89.49 89.53 88.29 87.66 87.04 84.82 88.62 5.83 0.00

Published as a conference paper at ICLR 2025

Table A8: Analysis of Memory Consumptions by Client and Server

Method Single Client Memory Consumption (GB) BS=128 BS=64 BS=32 BS=16 BS=8 BS=4 Fed-Dual P 14.88 9.15 5.82 4.2 3.25 2.82 PILo RA 15.13 9.97 7.02 4.33 3.6 3.2 UOPP 14.89 8.68 5.59 3.98 3.26 2.83

Method Server Memory Consumption (GB) L=1000 L=100 L=50 L=10 L=5 L=2 Fed-Dual P 1.32 0.1320 0.0660 0.0013 0.0066 0.0026 PILo RA 1.50 0.1495 0.0748 0.0015 0.0075 0.0030 UOPP 1.63 0.1627 0.0814 0.0016 0.0081 0.0033

Table A9: Running Time w.r.t. Percentage of Available Classes Number of Selected Local Client in Each Round

Method Run time (h) in different available classes percentage (η) Run time (h) in different selected local clients (L) η=0.6(60%) η=0.4(40%) η=0.2(20%) L=4 L=6 L=8 Fed-S3C 4.10 2.07 2.08 2.18 4.10 6.23 TARGET 2.02 2.15 1.99 1.50 2.02 2.50 LGA 8.22 5.35 5.03 3.51 8.22 15.23 LANDER 4.78 1.90 1.85 3.90 4.78 4.60 Fed-Dual P 3.20 2.17 1.90 2.07 3.20 4.07 Fed-Cpompt 2.03 2.37 2.45 1.50 2.03 2.77 PILo RA 6.17 4.78 3.13 4.12 6.17 8.62 UOPP 6.05 5.87 5.35 4.70 6.05 8.81

settings where local training is conducted by small edge devices such as laptops or Io T nodes, the batch size can be set to a smaller value i.e. 16 or less. In the case of a cross-silo setting, where local training is conducted by more powerful end devices such as a corporate server, the batch size can be chosen by the larger value i.e. 128 or more.

D.4 RUNNING TIME ANALYSIS

We extend our scalability analysis by investigating the simulation time w.r.t. the training data size e.g. indicated by the available classes percentage (η), and the number of participating local clients (L). table A9 shows the running time of our simulation w.r.t. those 2 factors. Please note that our simulation is run on a single GPU device. Thus each local training is executed sequentially. Thus, If the local training is conducted in a parallel way, the difference in the simulation time will be smaller.

The table shows that the ratio of the training data doesn t imply the same running time ratio. For example in the case of η = 0.6 where each client carries 3x the number of samples of η = 0.2, the running time in those η = 0.6 is not equal to 3x training time of η = 0.2. In our method, the increase of running time in those case is less than 20% (5.35 to 6.05). Thus, we found that the scale-up ratio of training samples with r factor will impact far less than r training time. Second, as we mentioned earlier regarding the sequential process of local training, table A9 shows a linear trend which is a logical result. For example, for UOPP and LORA, the total simulation time can be formulated by L x 1h. In the real application, local training should be conducted parallelly since each client runs on its local device. The bottleneck will be shifted to the server that receives the locally trained models, and the delay for round-trip communication (model transfer) between clients and server.

E DISCUSSION ON LIMITATION AND POTENTIAL SOLUTION

Our study has several limitations that can be improved in future studies.

a) Same η for each client : First, In our simulation, each client has the same non-i.i.d level represented by the same percentage of available classes (η). In the real application, each client may have a different degree of class availability from the other client. Thus, in the future study, the simulation can be extended into a variation percentage of available classes where each client is assigned with a random (picked from a min-max) η range. This variable η raises a new challenge for FFSCIL aside from simulating a more realistic setting.

b) Simulation on a single device (GPU) : Second, Our simulation is conducted in a single GPU, where each local training is executed sequentially one by one. This limitation will produce a linearly

Published as a conference paper at ICLR 2025

increasing training time with the number of selected local clients. It is less realistic especially when we want to measure the training time. This limitation can be solved by utilizing a server/workstation that has multiple GPUs such as a Nvidia DGX server. The other solution is utilizing multiple cloud devices/servers as the clients.

c) Fixed-size of Prompt : In this study, our method utilized a fixed-size prompt for all clients. Related to the randomly selected available class and the new challenge of different η (point a), It will be more realistic if a client decides its prompt size following the condition of its local data. The evolving prompt approaches such as Conv Prompt(Roy et al., 2024) and Evo Prompt(Kurniawan et al., 2024) may be suitable for those case. However, It raises a new challenge in the aggregation process i.e. how to produce an optimum global model from the different-sized local models which optimum for the current task as well as previously learned tasks.

d) Overfitting Handling : The current version of our method doesn t utilize advanced overfitting handling. Thus, in some cases e.g. in the CIFAR100 dataset, the model may suffer from overfitting indicated by the performance drop in the last task. One of the potential solutions is by applying early stopping during the training process. The other potential solution is applying learning decay to reduce the learning rate in line with the increasing training epochs.

e) Multi-Modality : The current version of our proposed method utilizes vision modality only. In the advance of Vision-Language Models, a language-guided approach may become a prospective approach to improve the model performance. Alternatively, language embedding can be utilized for prototype rectification instead of the prototypes generated by Vi T.

F EXPERIMENT DETAILS AND HYPER PARAMETER SETTING

Experimental Details: our numerical study is executed under a single NVIDIA A100 GPU with 40 GB memory across 3 runs with different random seeds. Fed-L2P, Fed-Dual P, Fed-CODAP, and UOPP train T number of prompts P R5 768 and Φ R|C 768| and head layer Φ, while the competitors train whole CNN models following their original implementation. Following (Dong et al., 2023), each experiment is simulated by 20 total clients and 1 global server, where in each round, 6 (30%) local clients are selected randomly. Each client randomly receives 60% (η = 0.6) classes. The total global round is set to 90 (10 rounds per class) for CIFAR100 and Mini Image Net and 110 for CUB200. For all methods, the local training on each client is set with a maximum of 20 epochs, and the learning rate is set by choosing the best value from {0.001, 5.0} by grid search with 2 incremental factors. Our setting is different to the recent study (Jiang et al., 2024), since it follows FCIL setting, while our setting follows FSCIL setting for the number of tasks, base classes, and novel classes in the few-shot tasks.

Performance Metric: on each session, we evaluate the consolidated algorithms to all learned classes with accuracy metrics (Acc(.)). Besides, we also measure the accuracy of base classes, novel classes and harmonic mean accuracy that indicates the balance between the performance of base classes and novel classes, in other words, it represents stability-plasticity performance. We also measure performance drop (PD), the accuracy difference between the first task, and the last task.

CNN-Based Methods: The competitor methods i.e. Fed-S3C, LGA, TARGET, and LANDER run with 2-20 local epochs on each client. The learning rate is set with the best result from 0.001 to 5.0 by 5 or 10 increment factor. The other hyperparameters such as weight decay, momentum, and dropout rate are set with their original setting. The methods utilize Res Net18 as the backbone model. LGA utilizes Le Net as the perturbation model. TARGET and LANDER use CNN as their synthesizer model. TARGET and LANDER generate 10000-50000 synthetic images on each task.

Prompt-Based Methods: PILo RA and the prompt-based methods i.e. Fed-L2P, Fed-Dual P, Fed CODAP, Fed-CPrompt, and UOPP, are run with Vi T backbone. The base task is run with 1-2 epochs, while the few-shot task is run with 2-20 local epochs. For UOPP, the rectification step M is set to 40 steps per iteration. The initial learning rate is set with the best result from 0.001 to 0.2 by a 2 or 5 increment factor. The learning rate for the FS task and base task may be different. The prompt length is set to 5. The dual-head selection is executed in a batch-wise manner for convenience in implementation. The other parameters are set with the default settings.

Published as a conference paper at ICLR 2025

G EXTENDED LITERATURE STUDY

Few Shot Class Incremental Learning (FSCIL): Previous studies on FSCIL have attempted to maintain stability-plasticity tradeoff in few labeled sequences of tasks by adding extra representation e.g. TOPIC (Tao et al., 2020b) introduces Neural gas as the graph of mapped features and CEC (Zhang et al., 2021) continually evolves its classifier to adapt to new tasks. Another approach modifies its learning mechanism e.g. FSLL (Mazumder et al., 2021) takes a partial parameter of the model to be updated with self-supervised loss, F2M (Shi et al., 2021) finds flat minima regions on the base task then forces parameter update on few shot tasks to reside within the flat region, S3C (Kalla & Biswas, 2022) trains scholastic classifier with supervised loss and Mg Sv F (Zhao et al., 2024) applies multi grained fast-slow learning mechanism. FSCIL methods demonstrate that representation or prototypes-based inference tends to be more stable (less forgetting) than linear classifiers under the data scarcity constraint. Nevertheless, the prototypes have to be still refined to avoid the prototype bias problem due to the data scarcity issue.

Class Incremental Learning (CIL): L2P (Wang et al., 2022b), Dual P (Wang et al., 2022a), CODAP (Smith et al., 2023) offer a breakthrough solution for CIL by training small-sized task-wise parameters called prompts while the feature extractor e.g. Vi T that contains the biggest parameter numbers stays frozen. It solves task interference because each task has a specific prompt parameter to train based on a trainable matching key with its sample. This approach simplifies the training process and reduces memory consumption. The prompt-based approach is proven to be more effective than the rehearsal approach e.g. ICARL (Rebuffi et al., 2017), EEIL, (Castro et al., 2018), GD (Prabhu et al., 2020), DER++ (Buzzega et al., 2020), that saves exemplars from the previous tasks and replays them along with current task samples, the bias correction approach e.g. Bi C (Wu et al., 2019) and LUCIR (Hou et al., 2019) that trains an additional task-wise bias layer to balance the model s stability-plasticity dilemma, and the regularization approach e.g. EWC (Kirkpatrick et al., 2017), MAS (Aljundi et al., 2018), LWF (Li & Hoiem, 2017), and DMC (Zhang et al., 2020) that tunes the base learner parameters to accommodate the previous task and current task. Regardless of its excellent performance in CIL, the prompt-based approach has not yet been proven in federated or few sample settings.

Few Shot Learning (FSL): FSL method e.g. metric learning(Ge, 2018), prototype network (Laenen & Bertinetto, 2021), and Neural ODE (Chen et al., 2018; Zhang et al., 2022) works effectively with few labeled training samples in a single session but not yet tested in continual or federated setting. However, it confirms that optimizing the prototypes tackles prototype bias that improves model performance greatly.

Prototype-based Methods: Prototype-based FSCIL method such as TEEN (Wang et al., 2024), NC-FSCIL (Yang et al., 2023), and Or CO(Ahmed et al., 2024) shows an insight the importance of prototype adjustment. TEEN recalibrates the prototypes using a similarity ratio between a calibrated prototype to base classes prototype and a novel classes prototype. NC-FSCIL utilizes the neural collapse principle for prototype alignment, while Or Co generates multi-angle prototypes to improve class representation and discrimination. Proroype-based FL methods such as Fed PCL(Tan et al., 2022) and Fed NCM(Legate et al., 2024) show how to conduct a prototype learning in a federated way, where a set of clients coordinated by a central server work together to achieve globally optimal prototypes. Prototype-based FCIL methods such as PILo RA show how a learnable prototype improves a parameter-efficient fine-tuning method to handle catastrophic forgetting with data-privacy constrains.

The strengths and weaknesses of CIL, FSL, FSCIL, and FCIL methods above inspire us to tackle the FFSCIL problem by developing a rehearsal-free prompt learning method combined with optimal prototypes to minimize communication costs.

H DETAILED NUMERICAL RESULTS ON BENCHMARK DATASETS

In this section, we present the detailed numerical result as shown in Tables A10, and A11,.

Published as a conference paper at ICLR 2025

Table A10: Numerical result of the consolidated algorithms in Mini Image Net dataset with 5-shot and 1-shot setting across 3 different seeded runs. S indicates the number of shots for the few shot tasks, PD indicates the performance drop, and Gap indicates the gap between the respected method to our proposed method (UOPP).

Method S Accuracy in each session (%) Avg PD Gap 0 1 2 3 4 5 6 7 8 Fed-S3C 5 31.91 32.97 31.74 30.96 29.93 28.92 27.66 27.06 26.43 29.73 5.5 63.2 TARGET 5 58.10 53.64 49.80 46.48 43.58 41.02 38.74 36.70 34.86 44.77 23.2 48.2 LGA 5 50.68 49.20 45.19 38.05 29.24 29.91 27.26 25.94 20.17 35.07 30.5 57.8 Fed-L2P 5 81.02 78.22 78.66 79.44 79.67 78.14 77.65 77.48 80.00 78.92 1.0 14.0 Fed-Dual P 5 83.93 89.31 88.11 87.47 87.23 84.97 83.96 83.78 84.38 85.91 -0.4 7.0 Fed-CODAP 5 90.21 83.35 82.31 80.27 78.89 77.79 77.13 75.94 75.09 80.11 15.1 12.8 Fed-Cprompt 5 93.57 92.26 90.71 89.60 89.09 87.22 85.80 85.41 85.28 88.77 8.29 4.15 UOPP 5 93.65 93.24 92.97 92.60 92.73 92.92 92.49 92.73 92.92 92.92 0.7 0.0 Fed-S3C 1 32.84 33.19 31.83 30.69 29.17 27.90 26.54 25.68 24.60 29.16 8.2 63.8 TARGET 1 58.10 53.64 49.80 46.48 43.58 41.02 38.74 36.70 34.86 44.77 23.2 48.2 LGA 1 50.15 41.26 37.24 33.33 26.04 27.02 24.61 23.53 21.71 31.65 28.44 60.56 Fed-L2P 1 83.02 79.99 79.92 79.54 80.20 80.84 80.55 80.60 82.57 80.80 0.4 12.1 Fed-Dual P 1 85.47 90.06 88.77 88.44 88.14 86.22 85.04 84.98 84.80 86.88 0.7 6.0 Fed-CODAP 1 90.94 83.65 81.53 80.21 79.50 77.48 76.71 75.89 75.24 80.13 15.7 12.8 Fed-Cprompt 1 93.42 91.72 88.94 87.89 87.00 84.12 81.82 81.15 81.04 86.34 12.38 5.87 UOPP 1 93.66 93.15 92.72 92.04 92.20 92.05 91.03 91.56 91.48 92.21 2.2 0.0

Table A11: Numerical result of the consolidated algorithms in CUB200 dataset with 5-shot and 1-sot setting across 3 different seeded runs. S indicates the number of shots for the few shot tasks, PD indicates the performance drop, and Gap indicates the gap between the respected method to our proposed method (UOPP).

Method S Accuracy in each session (%) Avg PD Gap 0 1 2 3 4 5 6 7 8 9 10 Fed-S3C 5 18.65 18.54 17.97 15.85 15.41 14.26 13.70 13.19 12.70 12.27 11.43 14.91 7.22 65.89 TARGET 5 32.03 27.75 25.44 23.48 21.80 20.35 19.08 17.96 16.96 16.07 15.26 21.47 16.77 59.33 LGA 5 25.07 22.95 21.03 19.89 17.86 16.11 14.94 13.88 12.26 11.19 10.91 16.92 14.16 63.88 Fed-L2P 5 73.24 69.49 63.32 60.46 61.05 56.65 53.24 51.98 51.47 49.42 50.57 58.26 22.67 22.54 Fed-Dual P 5 78.98 77.83 72.34 67.29 65.75 62.44 58.25 55.17 51.99 50.95 50.77 62.89 28.21 17.91 Fed-CODAP 5 71.69 53.03 42.26 32.81 34.38 29.69 30.73 30.10 29.24 29.73 29.44 37.55 42.26 43.25 Fed-CPrompt 5 87.81 82.02 78.28 60.76 59.24 52.15 50.76 50.78 50.77 50.53 50.48 61.23 37.33 19.57 UOPP 5 86.18 85.95 84.96 83.02 81.62 79.48 78.57 78.15 77.70 77.86 75.28 80.80 10.90 0.00 Fed-S3C 1 18.65 18.22 17.34 15.65 14.86 13.64 13.22 12.64 11.80 11.57 10.79 14.40 7.85 62.33 TARGET 1 29.03 27.01 24.76 22.86 21.22 19.81 18.57 17.48 16.51 15.64 14.86 20.70 14.17 56.03 LGA 1 23.87 10.74 10.67 10.17 8.83 9.55 7.71 8.88 7.25 6.55 6.01 10.02 17.86 66.71 Fed-L2P 1 73.74 67.78 62.05 58.91 61.08 57.03 52.70 50.25 48.41 46.96 49.40 57.12 24.34 19.61 Fed-Dual P 1 78.20 76.41 71.23 66.34 65.63 62.69 58.37 54.25 51.67 50.48 51.27 62.41 26.94 14.32 Fed-CODAP 1 73.07 56.54 48.81 39.62 37.46 35.15 32.56 30.81 28.32 28.10 27.39 39.80 45.68 36.93 Fed-CPrompt 1 87.22 72.41 66.88 50.86 59.63 53.31 51.58 50.99 47.60 50.02 50.36 58.26 36.86 18.47 UOPP 1 85.88 84.66 83.17 80.19 79.19 76.57 74.02 72.71 71.05 70.26 66.34 76.73 19.54 0.00

Published as a conference paper at ICLR 2025

Table A12: Base classes accuracy of the consolidated algorithms in CIFAR100 dataset with 5-shot and 1-shot setting across 3 different seeded runs. S indicates the number of shots for the few shot tasks, PD indicates the performance drop, and Gap indicates the gap between the respected method to our proposed method (UOPP).

Method S Base Classes Accuracy in each session (%) Avg PD Gap 0 1 2 3 4 5 6 7 8 Fed-S3C 5 44.51 48.90 49.46 48.93 48.69 47.78 47.64 47.58 46.54 47.78 -3.07 40.83 TARGET 5 68.91 68.91 68.91 68.91 68.91 68.91 68.91 68.91 68.91 68.91 0.00 19.70 LGA 5 73.76 72.92 73.17 72.55 72.79 72.14 72.62 72.40 72.07 72.71 1.36 15.90 LANDER 5 66.90 65.63 65.13 63.62 63.33 62.53 63.78 64.78 64.15 64.43 2.12 24.18 Fed-L2P 5 73.47 77.66 79.49 81.02 82.53 83.81 83.53 84.19 84.78 81.17 -10.72 7.44 Fed-Dual P 5 76.39 85.00 87.19 87.50 87.89 87.60 87.98 87.86 88.09 86.17 -11.47 2.44 Fed-CODAP 5 81.73 69.20 71.38 70.26 68.93 69.48 68.23 70.16 70.88 71.14 11.58 17.47 Fed-Cprompt 5 88.00 62.67 66.90 67.22 63.25 62.38 62.48 60.68 59.28 65.87 27.32 22.74 UOPP 5 90.57 90.57 90.56 90.57 90.57 90.56 90.57 90.56 72.97 88.61 0.01 0.00 Fed-S3C 1 44.51 49.42 49.96 49.52 49.56 49.21 48.97 48.93 48.06 48.68 -4.42 40.68 TARGET 1 68.91 68.91 68.91 68.91 68.91 68.91 68.91 68.91 68.91 68.91 0.00 20.46 LGA 1 73.58 72.41 73.57 72.26 72.63 71.87 72.40 70.34 70.05 72.12 3.24 17.24 LANDER 1 66.90 65.43 64.12 63.10 62.65 59.60 63.57 62.83 61.95 63.35 4.07 26.01 Fed-L2P 1 77.00 78.36 79.67 80.19 80.59 81.52 81.00 80.93 80.33 79.95 -3.93 9.41 Fed-Dual P 1 78.66 86.60 88.35 87.74 87.82 87.88 87.57 87.70 87.11 86.60 -9.04 2.76 Fed-CODAP 1 83.29 74.56 74.18 72.00 71.12 69.91 67.09 70.44 68.42 72.33 12.85 17.03 Fed-Cprompt 1 87.53 85.38 82.57 81.40 80.73 79.60 79.48 77.92 75.77 81.15 9.62 8.21 UOPP 1 90.65 90.65 90.65 90.66 90.66 89.48 88.82 87.97 84.72 89.36 2.68 0.00

Table A13: Novel classes accuracy of the consolidated algorithms in CIFAR100 dataset with 5-shot and 1-shot setting across 3 different seeded runs. S indicates the number of shots for the few shot tasks, PD indicates the performance drop, and Gap indicates the gap between the respected method to our proposed method (UOPP).

Method S Novel Classes Accuracy in each session (%) Avg PD Gap 1 2 3 4 5 6 7 8 Fed-S3C 5 49.80 37.67 31.00 27.85 26.31 25.70 25.16 24.48 31.00 25.32 61.92 TARGET 5 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 92.92 LGA 5 32.27 20.10 11.13 9.12 6.87 6.75 5.33 4.13 11.96 28.14 80.96 LANDER 5 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 92.92 Fed-L2P 5 32.67 36.67 35.29 35.82 39.29 40.77 42.04 43.76 38.29 -11.09 54.63 Fed-Dual P 5 55.80 60.43 54.02 56.03 55.84 57.22 58.33 60.64 57.29 -4.84 35.63 Fed-CODAP 5 70.40 67.40 62.33 61.88 58.12 56.52 55.60 53.98 60.78 16.43 32.14 Fed-Cprompt 5 88.20 83.70 68.07 63.80 62.20 58.37 58.23 58.58 67.64 29.63 25.27 UOPP 5 90.73 92.60 92.56 93.20 93.77 93.53 93.75 93.18 92.92 -2.45 0.00 Fed-S3C 1 40.07 27.53 23.22 19.67 18.27 17.60 17.49 16.52 22.55 23.55 62.74 TARGET 1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 85.29 LGA 1 2.07 2.67 5.36 9.68 11.11 13.67 13.11 16.88 9.32 -14.81 75.97 LANDER 1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 85.29 Fed-L2P 1 33.47 45.10 51.24 52.02 56.65 57.67 60.66 62.63 52.43 -29.16 32.86 Fed-Dual P 1 65.87 67.07 67.02 66.85 66.48 67.98 67.46 68.15 67.11 -2.28 18.18 Fed-CODAP 1 57.80 62.65 53.87 56.73 55.04 52.13 53.49 51.55 55.41 6.25 29.88 Fed-Cprompt 1 52.60 59.30 48.73 49.95 46.68 45.87 45.06 43.45 48.95 9.15 36.34 UOPP 1 84.27 85.90 84.82 86.15 85.43 85.34 85.46 84.96 85.29 -0.69 0.00

I DETAILED NUMERICAL RESULTS ON STABILITY-PLASTICITY ANALYSIS

In this section, we present the detailed numerical results on the stability-plasticity analysis of UOPP as shown in Tables A12, A13, and A14.

J DETAILED NUMERICAL RESULTS DIFFERENT LOCAL CLIENTS AND GLOBAL ROUNDS

In this section we present the detailed numerical results on different local clients and rounds as presented in tables A15 and A16.

K DETAILED NUMERICAL RESULTS OF ABLATION STUDY

In this section we present detailed numerical results on the ablation study as shown in table A17.

Published as a conference paper at ICLR 2025

Table A14: harmonic Mean accuracy of the consolidated algorithms in CIFAR100 dataset with 5shot and 1-shot setting across 3 different seeded runs. S indicates the number of shots for the few shot tasks, PD indicates the performance drop, and Gap indicates the gap between the respected method to our proposed method (UOPP).

Method S Harmonic Mean Accuracy in each session (%) Avg PD Gap 1 2 3 4 5 6 7 8 Fed-S3C 5 49.35 42.76 37.95 35.43 33.93 33.39 32.91 32.08 37.23 17.26 53.27 TARGET 5 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 90.50 LGA 5 44.74 31.54 19.30 16.20 12.54 12.35 9.93 7.81 19.30 36.93 71.20 LANDER 5 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 90.50 Fed-L2P 5 45.99 50.18 49.16 49.95 53.50 54.79 56.08 57.72 52.17 -11.74 38.33 Fed-Dual P 5 67.37 71.39 66.80 68.44 68.20 69.34 70.12 71.83 69.19 -4.46 21.31 Fed-CODAP 5 69.79 69.33 66.06 65.21 63.30 61.82 62.04 61.28 64.85 8.51 25.65 Fed-Cprompt 5 73.27 74.36 67.64 63.52 62.29 60.35 59.43 58.93 64.98 14.35 25.52 PILo RA 5 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 90.50 UOPP 5 90.65 91.57 91.55 91.86 92.14 92.03 92.13 81.85 90.47 8.80 0.00 Fed-S3C 1 44.25 35.50 31.62 28.16 26.64 25.90 25.77 24.58 30.30 19.67 56.90 TARGET 1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 87.20 LGA 1 4.02 5.15 9.97 17.09 19.24 22.99 22.10 27.20 15.97 -23.19 71.23 LANDER 1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 87.20 Fed-L2P 1 46.90 57.60 62.53 63.22 66.85 67.37 69.34 70.38 63.02 -23.48 24.18 Fed-Dual P 1 74.82 76.25 76.00 75.91 75.70 76.54 76.26 76.47 75.99 -1.65 11.21 Fed-CODAP 1 65.12 67.93 61.63 63.11 61.59 58.67 60.80 58.80 62.21 6.32 24.99 Fed-Cprompt 1 65.10 69.03 60.97 61.72 58.85 58.17 57.10 55.23 60.77 9.87 26.43 PILo RA 1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 87.20 UOPP 1 87.34 88.21 87.64 88.35 87.41 87.05 86.69 84.84 87.19 2.50 0.00

Table A15: Accuracy of the consolidated algorithms in CIFAR100 dataset with 5-shot setting on different number of selected local clients across 3 different seeded runs. S indicates the number of shots for the few shot tasks, PD indicates the performance drop, and L indicates the number of selected local clients.

Method L Accuracy in each session (%) Avg PD 0 1 2 3 4 5 6 7 8 S3C 4 42.63 50.02 48.77 46.47 44.56 42.72 41.53 40.39 38.39 43.94 4.24 S3C 6 44.51 48.97 47.77 45.35 43.48 41.47 40.33 39.32 37.71 43.21 6.80 S3C 8 43.43 48.60 46.90 44.81 43.16 41.34 40.33 38.84 37.21 42.74 6.22 TARGET 4 66.75 61.62 57.21 53.40 50.06 47.12 44.50 42.16 40.05 51.43 26.70 TARGET 6 68.90 63.61 59.06 55.12 51.68 48.64 45.94 43.52 41.34 53.09 27.56 TARGET 8 73.53 67.88 63.03 58.83 55.15 51.91 49.02 46.44 44.12 56.66 29.41 LGA 4 72.98 68.8 62.81 57.57 54.29 51.51 49.81 46.67 42.58 56.34 30.40 LGA 6 73.76 69.80 65.59 60.26 56.87 52.94 50.66 47.69 44.89 58.05 28.87 LGA 8 73.73 70.03 65.9 60.4 56.31 52.68 50.37 47.32 44.61 57.93 29.12 LANDER 4 59.60 58.03 53.23 49.81 45.80 43.11 40.64 39.06 37.27 47.40 22.33 LANDER 6 58.60 61.75 56.26 52.11 47.71 44.71 41.69 40.28 38.87 49.11 19.73 LANDER 8 61.60 63.80 58.84 54.37 50.46 47.76 44.17 42.18 40.82 51.56 20.78 Fed-Dual P 4 64.90 75.14 80.11 78.63 79.35 77.79 76.84 76.67 76.05 76.17 -11.15 Fed-Dual P 6 76.39 82.75 83.37 80.80 79.93 78.26 77.73 76.98 77.11 79.26 -0.72 Fed-Dual P 8 84.65 84.77 83.90 80.41 78.56 76.64 75.93 74.60 73.58 79.23 11.07 Fed-Cprompt 4 87.78 44.85 35.84 35.13 38.63 40.78 38.12 41.14 41.30 44.84 46.48 Fed-Cprompt t 6 88.00 64.63 69.30 67.39 63.39 62.33 61.11 59.78 59.00 66.10 29.00 Fed-Cprompt 8 87.65 82.52 80.99 77.53 76.64 73.74 71.70 70.67 68.39 76.65 19.26 UOPP 4 89.18 89.49 89.57 89.91 90.35 89.99 89.23 88.88 84.96 89.06 4.22 UOPP 6 90.57 90.58 90.85 90.96 91.23 91.51 91.56 91.74 81.05 90.01 9.52 UOPP 8 90.93 90.91 91.40 91.61 91.74 91.78 91.26 91.36 90.90 91.32 0.03

Published as a conference paper at ICLR 2025

Table A16: Accuracy of the consolidated algorithms in CIFAR100 dataset with 5-shot setting on different number of rounds across 3 different seeded runs. S indicates the number of shots for the few shot tasks, PD indicates the performance drop, and R indicates the number of rounds.

Method R Accuracy in each session (%) Avg PD 0 1 2 3 4 5 6 7 8 S3C 54 43.42 49.51 48.01 45.41 43.69 41.96 40.89 39.64 38.10 43.40 5.32 S3C 72 51.20 53.95 51.97 49.09 47.10 45.11 43.74 42.88 41.13 47.35 10.07 S3C 90 44.51 48.97 47.77 45.35 43.48 41.47 40.33 39.32 37.71 43.21 6.80 TARGET 54 57.60 53.17 49.37 46.08 43.20 40.66 38.40 36.38 34.56 44.38 23.04 TARGET 72 67.28 62.11 57.67 53.83 50.46 47.49 44.86 42.49 40.37 51.84 26.91 TARGET 90 68.90 63.61 59.06 55.12 51.68 48.64 45.94 43.52 41.34 53.09 27.56 LGA 54 69.57 62.32 60.94 61.41 57.44 53.71 49.76 51.24 47.51 57.10 22.06 LGA 72 68.35 66.26 61.61 58.15 54.75 51.55 49.21 45.19 42.08 55.24 26.27 LGA 90 73.76 69.80 65.59 60.26 56.87 52.94 50.66 47.69 44.89 58.05 28.87 LANDER 54 60.60 43.89 40.59 38.16 34.89 32.74 31.04 30.14 29.23 37.92 31.37 LANDER 72 62.60 57.00 52.09 49.21 46.21 43.13 40.42 37.05 36.64 47.15 25.96 LANDER 90 58.60 61.75 56.26 52.11 47.71 44.71 41.69 40.28 38.87 49.11 19.73 Fed-Dual P 54 79.40 81.40 83.44 82.67 82.39 81.85 81.31 80.78 80.32 81.51 -0.92 Fed-Dual P 72 78.85 83.25 83.96 81.61 80.28 79.73 78.63 78.15 77.06 80.17 1.79 Fed-Dual P 90 76.39 82.75 83.37 80.80 79.93 78.26 77.73 76.98 77.11 79.26 -0.72 Fed-Cprompt 54 87.92 72.05 64.70 54.99 66.09 52.84 56.98 55.86 51.01 62.49 36.91 Fed-Cprompt 72 87.92 72.42 49.91 57.41 60.11 55.87 53.98 49.23 49.84 59.63 38.08 Fed-Cprompt 90 88.00 64.63 69.30 67.39 63.39 62.33 61.11 59.78 59.00 66.10 29.00 UOPP 54 89.87 89.75 90.33 90.73 90.91 91.12 91.33 91.18 91.32 90.73 -1.45 UOPP 72 90.37 90.29 90.70 90.99 90.94 91.34 91.27 91.07 90.13 90.79 0.24 UOPP 90 90.57 90.58 90.85 90.96 91.23 91.51 91.56 91.74 81.05 90.01 9.52

Table A17: Accuracy of different configurations in CIFAR100 dataset with 5-shot setting on across 3 different seeded runs. S indicates the number of shots for the few shot tasks, PD indicates the performance drop, and Gap indicates the difference accuracy to PIP.

Conf. Accuracy in each session (%) Avg PD Gap 0 1 2 3 4 5 6 7 8 A (w/o Static Proto) 84.37 82.54 80.91 80.56 80.29 80.54 80.70 79.03 76.61 80.62 7.76 9.39 B (w/o Dynamic Proto) 90.27 87.38 85.67 84.40 84.69 84.84 85.24 85.16 80.21 85.32 10.06 4.69 C (w/o MLP Head) 88.25 88.66 89.07 89.16 89.63 90.11 90.27 90.62 82.76 88.72 5.49 1.29 D (w/o PB. Head) 90.10 83.17 77.23 72.08 67.58 63.60 60.07 56.91 52.34 69.23 37.76 20.78 UOPP 90.57 90.58 90.85 90.96 91.23 91.51 91.56 91.74 81.05 90.01 9.52 0.00