# federated_continual_learning_with_weighted_interclient_transfer__9ae2eb71.pdf

Federated Continual Learning with Weighted Inter-client Transfer

Jaehong Yoon 1 * Wonyong Jeong 1 2 * Giwoong Lee 1 Eunho Yang 1 2 Sung Ju Hwang 1 2

Abstract There has been a surge of interest in continual learning and federated learning, both of which are important in deep neural networks in realworld scenarios. Yet little research has been done regarding the scenario where each client learns on a sequence of tasks from a private local data stream. This problem of federated continual learning poses new challenges to continual learning, such as utilizing knowledge from other clients, while preventing interference from irrelevant knowledge. To resolve these issues, we propose a novel federated continual learning framework, Federated Weighted Inter-client Transfer (Fed We IT), which decomposes the network weights into global federated parameters and sparse task-speciﬁc parameters, and each client receives selective knowledge from other clients by taking a weighted combination of their task-speciﬁc parameters. Fed We IT minimizes interference between incompatible tasks, and also allows positive knowledge transfer across clients during learning. We validate our Fed We IT against existing federated learning and continual learning methods under varying degrees of task similarity across clients, and our model signiﬁcantly outperforms them with a large reduction in the communication cost. Code is available at https://github.com/wyjeong/Fed We IT.

1. Introduction

Continual learning (Thrun, 1995; Kumar & Daume III, 2012; Ruvolo & Eaton, 2013; Kirkpatrick et al., 2017; Schwarz et al., 2018) describes a learning scenario where a model continuously trains on a sequence of tasks; it is inspired by the human learning process, as a person learns to perform numerous tasks with large diversity over his/her lifespan,

*Equal contribution 1Korea Advanced Institute of Science and Technology (KAIST), South Korea 2AITRICS, South Korea. Correspondence to: Jaehong Yoon <jaehong.yoon@kaist.ac.kr>, Wonyong Jeong <wyjeong@kaist.ac.kr>.

Proceedings of the 38 th International Conference on Machine Learning, PMLR 139, 2021. Copyright 2021 by the author(s).

Figure 1. Concept. A continual learner at a hospital which learns on sequence of disease prediction tasks may want to utilize relevant task parameters from other hospitals. FCL allows such inter-client knowledge transfer via the communication of task-decomposed parameters.

making use of the past knowledge to learn about new tasks without forgetting previously learned ones. Continual learning is a long-studied topic since having such an ability leads to the potential of building a general artiﬁcial intelligence. However, there are crucial challenges in implementing it with conventional models such as deep neural networks (DNNs), such as catastrophic forgetting, which describes the problem where parameters or semantic representations learned for the past tasks drift to the direction of new tasks during training. The problem has been tackled by various prior work (Kirkpatrick et al., 2017; Shin et al., 2017; Riemer et al., 2019). More recent works tackle other issues, such as scalability or order-robustness (Schwarz et al., 2018; Hung et al., 2019; Yoon et al., 2020).

However, all of these models are fundamentally limited in that the models can only learn from its direct experience - they only learn from the sequence of the tasks they have trained on. Contrarily, humans can learn from indirect experience from others, through different means (e.g. verbal communications, books, or various media). Then wouldn t it be beneﬁcial to implement such an ability to a continual learning framework, such that multiple models learning on different machines can learn from the knowledge of the tasks that have been already experienced by other clients? One problem that arises here, is that due to data privacy on individual clients and exorbitant communication cost, it may not be possible to communicate data directly between the clients or between the server and clients. Federated learning (Mc Mahan et al., 2016; Li et al., 2018; Yurochkin

Federated Continual Learning with Weighted Inter-client Transfer

Figure 2. Challenge of Federated Continual Learning. Interference from other clients, resulting from sharing irrelevant knowledge, may hinder an optimal training of target clients (Red) while relevant knowledge from other clients will be beneﬁcial for their learning (Green).

et al., 2019) is a learning paradigm that tackles this issue by communicating the parameters instead of the raw data itself. We may have a server that receives the parameters locally trained on multiple clients, aggregates it into a single model parameter, and sends it back to the clients. Motivated by our intuition on learning from indirect experience, we tackle the problem of Federated Continual Learning (FCL) where we perform continual learning with multiple clients trained on private task sequences, which communicate their task-speciﬁc parameters via a global server.

Figure 1 depicts an example scenario of FCL. Suppose that we are building a network of hospitals, each of which has a disease diagnosis model which continuously learns to perform diagnosis given CT scans, for new types of diseases. Then, under our framework, any diagnosis model which has learned about a new type of disease (e.g., COVID-19) will transmit the task-speciﬁc parameters to the global server, which will redistribute them to other hospitals for the local models to utilize. This allows all participants to beneﬁt from the new task knowledge without compromising the data privacy.

Yet, the problem of federated continual learning also brings new challenges. First, there is not only the catastrophic forgetting from continual learning, but also the threat of potential interference from other clients. Figure 2 describes this challenge with the results of a simple experiment. Here, we train a model for MNIST digit recognition while communicating the parameters from another client trained on a different dataset. When the knowledge transferred from the other client is relevant to the target task (SVHN), the model starts with high accuracy, converge faster and reach higher accuracy (green line), whereas the model underperforms the base model if the transferred knowledge is from a task highly different from the target task (CIFAR-10, red line). Thus, we need to selective utilize knowledge from other clients to minimize the inter-client interference and maximize inter-client knowledge transfer. Another problem with the federated learning is efﬁcient communication, as communication cost could become excessively large when

utilizing the knowledge of the other clients, since the communication cost could be the main bottleneck in practical scenarios when working with edge devices. Thus we want the knowledge to be represented as compactly as possible.

To tackle these challenges, we propose a novel framework for federated continual learning, Federated Weighted Interclient Transfer (Fed We IT), which decomposes the local model parameters into a dense base parameter and sparse task-adaptive parameters. Fed We IT reduces the interference between different tasks since the base parameters will encode task-generic knowledge, while the task-speciﬁc knowledge will be encoded into the task-adaptive parameters. When we utilize the generic knowledge, we also want the client to selectively utilize task-speciﬁc knowledge obtained at other clients. To this end, we allow each model to take a weighted combination of the task-adaptive parameters broadcast from the server, such that it can select task-speciﬁc knowledge helpful for the task at hand. Fed We IT is communication-efﬁcient, since the task-adaptive parameters are highly sparse and only need to be communicated once when created. Moreover, when communication efﬁciency is not a critical issue as in cross-silo federated learning (Kairouz et al., 2019), we can use our framework to incentivize each client based on the attention weights on its task-adaptive parameters. We validate our method on multiple different scenarios with varying degree of task similarity across clients against various federated learning and local continual learning models. The results show that our model obtains signiﬁcantly superior performance over all baselines, adapts faster to new tasks, with largely reduced communication cost. The main contributions of this paper are as follows:

We introduce a new problem of Federated Continual Learning (FCL), where multiple models continuously learn on distributed clients, which poses new challenges such as prevention of inter-client interference and interclient knowledge transfer.

We propose a novel and communication-efﬁcient framework for federated continual learning, which allows each client to adaptively update the federated parameter and selectively utilize the past knowledge from other clients, by communicating sparse parameters.

2. Related Work

Continual learning While continual learning (Kumar & Daume III, 2012; Ruvolo & Eaton, 2013) is a long-studied topic with a vast literature, we only discuss recent relevant works. Regularization-based: EWC (Kirkpatrick et al., 2017) leverages Fisher Information Matrix to restrict the change of the model parameters such that the model ﬁnds solution that is good for both previous and the current task, and IMM (Lee et al., 2017) proposes to learn the posterior

Federated Continual Learning with Weighted Inter-client Transfer

distribution for multiple tasks as a mixture of Gaussians. Stable SGD (Mirzadeh et al., 2020) shows impressive performance gain through controlling essential hyperparameters and gradually decreasing learning rate each time a new task arrives. Architecture-based: DEN (Yoon et al., 2018) tackles this issue by expanding the networks size that are necessary via iterative neuron/ﬁlter pruning and splitting, and RCL (Xu & Zhu, 2018) tackles the same problem using reinforcement learning. APD (Yoon et al., 2020) additively decomposes the parameters into shared and task-speciﬁc parameters to minimize the increase in the network complexity. Coreset-based: GEM variants (Lopez-Paz & Ranzato, 2017; Chaudhry et al., 2019) minimize the loss on both of actual dataset and stored episodic memory. FRCL (Titsias et al., 2020) memorizes approximated posteriors of previous tasks with sophisticatedly constructed inducing points. To the best of our knowledge, none of the existing approaches considered the communicability for continual learning of deep neural networks, which we tackle. Co LLA (Rostami et al., 2018) aims at solving multi-agent lifelong learning with sparse dictionary learning, it does not have a central server to guide collaboration among clients and is formulated by a simple dictionary learning problem, thus not applicable to modern neural networks. Also, Co LLA is restricted to synchronous training with homogeneous clients.

Federated learning Federated learning is a distributed learning framework under differential privacy, which aims to learn a global model on a server while aggregating the parameters learned at the clients on their private data. Fed Avg (Mc Mahan et al., 2016) aggregates the model trained across multiple clients by computing a weighted average of them based on the number of data points trained. Fed Prox (Li et al., 2018) trains the local models with a proximal term which restricts their updates to be close to the global model. Fed Curv (Shoham et al., 2019) aims to minimize the model disparity across clients during federated learning by adopting a modiﬁed version of EWC. Recent works (Yurochkin et al., 2019; Wang et al., 2020) introduce well-designed aggregation policies by leveraging Bayesian non-parametric methods. A crucial challenge of federated learning is the reduction of communication cost. TWAFL (Chen et al., 2019) tackles this problem by performing layer-wise parameter aggregation, where shallow layers are aggregated at every step, but deep layers are aggregated in the last few steps of a loop. (Karimireddy et al., 2020) suggests an algorithm for rapid convergence, which minimizes the interference among discrepant tasks at clients by sacriﬁcing the local optimality. This is an opposite direction from personalized federated learning methods (Fallah et al., 2020; Lange et al., 2020; Deng et al., 2020) which put more emphasis on the performance of local models. FCL is a parallel research direction to both, and to the best of our knowledge, ours is the ﬁrst work that considers task-

incremental learning of clients under federated learning framework.

3. Federated Continual Learning with Fed We IT

Motivated by the human learning process from indirect experiences, we introduce a novel continual learning under federated learning setting, which we refer to as Federated Continual Learning (FCL). FCL assumes that multiple clients are trained on a sequence of tasks from private data stream, while communicating the learned parameters with a global server. We ﬁrst formally deﬁne the problem in Section 3.1, and then propose naive solutions that straightforwardly combine the existing federated learning and continual learning methods in Section 3.2. Then, following Sections 3.3 and 3.4, we discuss about two novel challenges that are introduced by federated continual learning, and propose a novel framework, Federated Weighted Inter-client Transfer (Fed We IT) which can effectively handle the two problems while also reducing the client-to-sever communication cost.

3.1. Problem Deﬁnition

In the standard continual learning (on a single machine), the model iteratively learns from a sequence of tasks {T (1), T (2), ..., T (T )} where T (t) is a labeled dataset of tth task, T (t) = {x(t) i , y(t) i }Nt i=1, which consists of Nt pairs of instances x(t) i and their corresponding labels y(t) i . Assuming the most realistic situation, we consider the case where the task sequence is a task stream with an unknown arriving order, such that the model can access T (t) only at the training period of task t which becomes inaccessible afterwards. Given T (t) and the model learned so far, the learning objective at task t is as follows: minimize θ(t) L(θ(t); θ(t 1), T (t)),

where θ(t) is a set of the model parameters at task t.

We now extend the conventional continual learning to the federated learning setting with multiple clients and a global server. Let us assume that we have C clients, where at each client cc {c1, . . . , c C} trains a model on a privately accessible sequence of tasks {T (1) c , T (2) c , ..., T (t) c } T . Please note that there is no relation among the tasks T (t) 1:C received at step t, across clients. Now the goal is to effectively train C continual learning models on their own private task streams, via communicating the model parameters with the global server, which aggregates the parameters sent from each client, and redistributes them to clients.

3.2. Communicable Continual Learning

In conventional federated learning settings, the learning is done with multiple rounds of local learning and parameter aggregation. At each round of communication r, each

Federated Continual Learning with Weighted Inter-client Transfer

(a) Communication of General Knowledge (b) Communication of Task-adaptive Knowledge

Figure 3. Updates of Fed We IT. (a) A client sends sparsiﬁed federated parameter Bc m(t) c . After that, the server redistributes aggregated parameters to the clients. (b) The knowledge base stores previous tasks-adaptive parameters of clients, and each client selectively utilizes them with an attention mask.

client cc and the server s perform the following two procedures: local parameter transmission and parameter aggregation & broadcasting. In the local parameter transmission step, for a randomly selected subset of clients at round r, C(r) {c1, c2, ..., c C}, each client cc C(r) sends updated parameters θ(r) to the server. The server-clients transmission is not done at every client because some of the clients may be temporarily disconnected. Then the server aggregates the parameters θ(r) c sent from the clients into a single parameter. The most popular frameworks for this aggregation are Fed Avg (Mc Mahan et al., 2016) and Fed Prox (Li et al., 2018). However, naive federated continual learning with these two algorithms on local sequences of tasks may result in catastrophic forgetting. One simple solution is to use a regularization-based, such as Elastic Weight Consolidation (EWC) (Kirkpatrick et al., 2017), which allows the model to obtain a solution that is optimal for both the previous and the current tasks. There exist other advanced solutions (Nguyen et al., 2018; Chaudhry et al., 2019) that successfully prevents catastrophic forgetting. However, the prevention of catastrophic forgetting at the client level is an orthogonal problem from federated learning.

Thus we focus on challenges that newly arise in this federated continual learning setting. In the federated continual learning framework, the aggregation of the parameters into a global parameter θG allows inter-client knowledge transfer across clients, since a task T (q) i learned at client ci at round q may be similar or related to T (r) j learned at client cj at round r. Yet, using a single aggregated parameter θG may be suboptimal in achieving this goal since knowledge from irrelevant tasks may not to be useful or even hinder the training at each client by altering its parameters into incorrect directions, which we describe as inter-client interference. Another problem that is also practically important, is the communication-efﬁciency. Both the parameter transmission from the client to the server, and server to client will incur large communication cost, which will be problematic for the continual learning setting, since the clients may train on

possibly unlimited streams of tasks.

3.3. Federated Weighted Inter-client Transfer

How can we then maximize the knowledge transfer between clients while minimizing the inter-client interference, and communication cost? We now describe our model, Federated Weighted Inter-client Transfer (Fed We IT), which can resolve the these two problems that arise with a naive combination of continual learning approaches with federated learning framework.

The main cause of the problems, as brieﬂy alluded to earlier, is that the knowledge of all tasks learned at multiple clients is stored into a single set of parameters θG. However, for the knowledge transfer to be effective, each client should selectively utilize only the knowledge of the relevant tasks that is trained at other clients. This selective transfer is also the key to minimize the inter-client interference as well as it will disregard the knowledge of irrelevant tasks that may interfere with learning.

We tackle this problem by decomposing the parameters, into three different types of the parameters with different roles: global parameters (θG) that capture the global and generic knowledge across all clients, local base parameters (B) which capture generic knowledge for each client, and task-adaptive parameters (A) for each speciﬁc task per client, motivated by Yoon et al. (2020). A set of the model parameters θ(t) c for task t at continual learning client cc is then deﬁned as follows:

θ(t) c = B(t) c m(t) c + A(t) c + X

j<|t| α(t) i,j A(j) i (1)

where B(t) c {RIl Ol}L l=1 is the set of base parameters for cth client shared across all tasks in the client, m(t) c {ROl}L l=1 is the set of sparse vector masks which allows to adaptively transform B(t) c for the task t, A(t) c {RIl Ol}L l=1 is the set of a sparse task-adaptive parameters at client cc. Here, L is the number of the layer in the neural

Federated Continual Learning with Weighted Inter-client Transfer

network, and Il, Ol are input and output dimension of the weights at layer l, respectively.

The ﬁrst term allows selective utilization of the global knowledge. We want the base parameter B(t) c at each client to capture generic knowledge across all tasks across all clients. In Figure 3 (a), we initialize it at each round t with the global parameter from the previous iteration, θ(t 1) G which aggregates the parameters sent from the client. This allows B(t) c to also beneﬁt from the global knowledge about all the tasks. However, since θ(t 1) G also contains knowledge irrelevant to the current task, instead of using it as is, we learn the sparse mask m(t) c to select only the relevant parameters for the given task. This sparse parameter selection helps minimize inter-client interference, and also allows for efﬁcient communication. The second term is the task-adaptive parameters A(t) c . Since we additively decompose the parameters, this will learn to capture knowledge about the task that is not captured by the ﬁrst term, and thus will capture speciﬁc knowledge about the task T (t) c . The ﬁnal term describes weighted inter-client knowledge transfer. We have a set of parameters that are transmitted from the server, which contain all task-adaptive parameters from all the clients. To selectively utilizes these indirect experiences from other clients, we further allocate attention α(t) c on these parameters, to take a weighted combination of them. By learning this attention, each client can select only the relevant taskadaptive parameters that help learn the given task. Although we design A(j) i to be highly sparse, using about 2 3% of memory of full parameter in practice, sending all task knowledge is not desirable. Thus we transmit the randomly sampled task-adaptive parameters across all time steps from knowledge base, which we empirically ﬁnd to achieve good results in practice.

Training. We learn the decomposable parameter θ(t) c by optimizing for the following objective:

minimize B(t) c , m(t) c , A(1:t) c , α(t) c L θ(t) c ; T (t) c + λ1Ω({m(t) c , A(1:t) c })

i=1 B(t) c m(i) c + A(i) c 2 2, (2)

where L is a loss function and Ω( ) is a sparsity-inducing regularization term for all task-adaptive parameters and the masking variable (we use ℓ1-norm regularization), to make them sparse. The second regularization term is used for retroactive update of the past task-adaptive parameters, which helps the task-adaptive parameters to maintain the original solutions for the target tasks, by reﬂecting the change of the base parameter. Here, B(t) c = B(t) c B(t 1) c is the difference between the base parameter at the current and previous timestep, and A(i) c is the difference between the task-adaptive parameter for task i at the current and pre-

vious timestep. This regularization is essential for preventing catastrophic forgetting. λ1 and λ2 are hyperparameters controlling the effect of the two regularizers.

Algorithm 1 Federated Weighted Inter-client Transfer

input Dataset {D(1:t) c }C c=1, global parameter θG, hyperparameters λ1, λ2, knowledge base kb {} output {Bc, m(1:t) c , α(1:t) c , A(1:t) c }C c=1 1: Initialize Bc to θG for all clients C {c1, ..., c C} 2: for task t = 1, 2, ... do 3: Randomly sample knowledge base kb(t) kb 4: for round r = 1, 2, ... do 5: Collect communicable clients C(r) C 6: Distribute θG and kb(t) to client cc C(r) if cc meets kb(t) ﬁrst, otherwise distribute only θG 7: Minimize Equation 2 for solving local CL problems 8: B(t,r) c m(t,r) c are transmitted from C(r) to the server 9: Update θG 1 |C(r)| P

c B(t,r) c m(t,r) c 10: end for 11: Update knowledge base kb kb {A(t) j }j C 12: end for

3.4. Efﬁcient Communication via Sparse Parameters

Fed We IT learns via server-to-client communication. As discussed earlier, a crucial challenge here is to reduce the communication cost. We describe what happens at the client and the server at each step.

Client: At each round r, each client cc partially updates its base parameter with the nonzero components of the global parameter sent from the server; that is, Bc(n) = θG(n) where n is a nonzero element of the global parameter. After training the model using Equation 2, it obtains a sparsiﬁed

base parameter b B (t) c = B(t) c m(t) c and task-adaptive parameter A(t) c for the new task, both of which are sent to the server, at smaller cost compared to naive FCL baselines. While naive FCL baselines require |C| R |θ| for client-to-server communication, Fed We IT requires |C| (R |b B| + |A|) where R is the number of communication round per task and | | is the number of parameters.

Server: The server ﬁrst aggregates the base parameters sent from all the clients by taking an weighted average of them: θG = 1

C b B (t) i . Then, it broadcasts θG to all the

clients. Task adaptive parameters of t 1, {A(t 1) i } C\c i=1 are broadcast at once per client during training task t. While naive FCL baselines requires |C| R |θ| for server-toclient communication cost, Fed We IT requires |C| (R |θG| + (|C| 1) |A|) in which θG, A are highly sparse. We describe the Fed We IT algorithm in Algorithm 1.

Federated Continual Learning with Weighted Inter-client Transfer

Figure 4. Conﬁguration of task sequences: We ﬁrst split a dataset D into multiple sub-tasks in non-IID manner ((a) and (b)). Then, we distribute them to multiple clients (C#). Mixed tasks from multiple datasets (colored circles) are distributed across all clients ((c)).

4. Experiments

We validate our Fed We IT under different conﬁgurations of task sequences against baselines which are namely Overlapped-CIFAR-100 and Non IID-50. 1) Overlapped CIFAR-100: We group 100 classes of CIFAR-100 dataset into 20 non-iid superclasses tasks. Then, we randomly sample 10 tasks out of 20 tasks and split instances to create a task sequence for each of the clients with overlapping tasks. 2) Non IID-50: We use the following eight benchmark datasets: MNIST (Le Cun et al., 1998), CIFAR-10/-100 (Krizhevsky & Hinton, 2009), SVHN (Netzer et al., 2011), Fashion MNIST (Xiao et al., 2017), Not-MNIST (Bulatov, 2011), Face Scrub (Ng & Winkler, 2014), and Trafﬁc Signs (Stallkamp et al., 2011). We split the classes in the 8 datasets into 50 non-IID tasks, each of which is composed of 5 classes that are disjoint from the classes used for the other tasks. This is a large-scale experiment, containing 280, 000 images of 293 classes from 8 heterogeneous datasets. After generating and processing tasks, we randomly distribute them to multiple clients as illustrated in Figure 4. We followed metrics for accuracy and forgetting from recent works (Chaudhry et al., 2020; Mirzadeh et al., 2020; 2021).

Experimental setup We use a modiﬁed version of Le Net (Le Cun et al., 1998) for the experiments with both Overlapped-CIFAR-100 and Non IID-50 dataset. Further, we use Res Net-18 (He et al., 2016) with Non IID-50 dataset. We followed other experimental setups from (Serrà et al., 2018) and (Yoon et al., 2020). For detailed descriptions of the task conﬁguration, metrics, hyperparameters, and more experimental results, please see supplementary ﬁle.

Baselines and our model 1) STL: Single Task Learning at each arriving task. 2) EWC: Individual continual learning with EWC (Kirkpatrick et al., 2017) per client. 3) Stable-SGD: Individual continual learning with Stable SGD (Mirzadeh et al., 2020) per client. 4) APD: Individual continual learning with APD (Yoon et al., 2020) per client. 5) Fed Prox: FCL using Fed Prox (Li et al., 2018) algorithm. 6) Scaffold: FCL using Scaffold (Karimireddy et al., 2020)

Table 1. Average Per-task Performance on Overlapped-CIFAR-100 during FCL with 100 clients.

100 clients (F=0.05, R=20, 1,000 tasks in total)

Methods Accuracy Forgetting Model Size

Fed Prox 24.11 ( 0.44) 0.14 ( 0.01) 1.22 GB Fed Curv 29.11 ( 0.20) 0.09 ( 0.02) 1.22 GB Fed Prox-SSGD 22.29 ( 0.51) 0.14 ( 0.01) 1.22 GB Fed Prox-APD 32.55 ( 0.29) 0.02 ( 0.01) 6.97 GB

Fed We IT (Ours) 39.58 ( 0.27) 0.01 ( 0.00) 4.03 GB

STL 32.96 ( 0.23) 12.20 GB

Figure 5. Averaged task adaptation during training last two (9th

and 10th) tasks with 5 and 100 clients.

algorithm. 7) Fed Curv: FCL using Fed Curv (Shoham et al., 2019) algorithm. 8) Fed Prox-[model]: FCL, that is trained using Fed Prox algorithm with [model]. 9) Fed We IT: Our Fed We IT algorithm.

4.1. Experimental Results

We ﬁrst validate our model on both Overlapped-CIFAR-100 and Non IID-50 task sequences against single task learning (STL), continual learning (EWC, APD), federated learning (Fed Prox, Scaffold, Fed Curv), and naive federated continual learning (Fed Prox-based) baselines. Table 2 shows the ﬁnal average per-task performance after the completion of (federated) continual learning on both datasets. We observe that Fed Prox-based federated continual learning (FCL) approaches degenerate the performance of continual learning (CL) methods over the same methods without federated learning. This is because the aggregation of all client parameters that are learned on irrelevant tasks results in severe interference in the learning for each task, which leads to catastrophic forgetting and suboptimal task adaptation. Scaffold achieves poor performance on FCL, as its regularization on the local gradients is harmful for FCL, where all clients learn from a different task sequences. While Fed Curv reduces inter-task disparity in parameters, it cannot minimize

Federated Continual Learning with Weighted Inter-client Transfer

Table 2. Averaged Per-task performance on both dataset during FCL with 5 clients (fraction=1.0). We measured task accuracy and model size after completing all learning phases over 3 individual trials. We also measured C2S/S2C communication cost for training each task.

Non IID-50 Dataset (F=1.0, R=20)

Methods FCL Accuracy Forgetting Model Size Client to Server Cost Server to Client Cost

EWC (Kirkpatrick et al., 2017) 74.24 ( 0.11) 0.10 ( 0.01) 61 MB N/A N/A Stable SGD (Mirzadeh et al., 2020) 76.22 ( 0.26) 0.14 ( 0.01) 61 MB N/A N/A APD (Yoon et al., 2020) 81.42 ( 0.89) 0.02 ( 0.01) 90 MB N/A N/A

Fed Prox (Li et al., 2018) 68.03 ( 2.14) 0.17 ( 0.01) 61 MB 1.22 GB 1.22 GB Scaffold (Karimireddy et al., 2020) 30.84 ( 1.41) 0.11 ( 0.02) 61 MB 2.44 GB 2.44 GB Fed Curv (Shoham et al., 2019) 72.39 ( 0.32) 0.13 ( 0.02) 61 MB 1.22 GB 1.22 GB Fed Prox-EWC 68.27 ( 0.72) 0.12 ( 0.01) 61 MB 1.22 GB 1.22 GB Fed Prox-Stable-SGD 75.02 ( 1.44) 0.12 ( 0.01) 79 MB 1.22 GB 1.22 GB Fed Prox-APD 81.20 ( 1.52) 0.01 ( 0.01) 79 MB 1.22 GB 1.22 GB

Fed We IT (Ours) 84.11 ( 0.27) 0.00 ( 0.00) 78 MB 0.37 GB 1.08 GB

Single Task Learning 85.78 ( 0.17) 610 MB N/A N/A

Overlapped CIFAR-100 Dataset (F=1.0, R=20)

Methods FCL Accuracy Forgetting Model Size Client to Server Cost Server to Client Cost

EWC (Kirkpatrick et al., 2017) 44.26 ( 0.53) 0.13 ( 0.01) 61 MB N/A N/A Stable SGD (Mirzadeh et al., 2020) 43.31 ( 0.44) 0.08 ( 0.01) 61 MB N/A N/A APD (Yoon et al., 2020) 50.82 ( 0.41) 0.02 ( 0.01) 73 MB N/A N/A

Fed Prox (Li et al., 2018) 38.96 ( 0.37) 0.13 ( 0.02) 61 MB 1.22 GB 1.22 GB Scaffold (Karimireddy et al., 2020) 22.80 ( 0.47) 0.09 ( 0.01) 61 MB 2.44 GB 2.44 GB Fed Curv (Shoham et al., 2019) 40.36 ( 0.44) 0.15 ( 0.02) 61 MB 1.22 GB 1.22 GB Fed Prox-EWC 41.53 ( 0.39) 0.13 ( 0.01) 61 MB 1.22 GB 1.22 GB Fed Prox-Stable-SGD 43.29 ( 1.45) 0.07 ( 0.01) 61 MB 1.22 GB 1.22 GB Fed Prox-APD 52.20 ( 0.50) 0.02 ( 0.01) 75 MB 1.22 GB 1.22 GB

Fed We IT (Ours) 55.16 ( 0.19) 0.01 ( 0.00) 75 MB 0.37 GB 1.07 GB

Single Task Learning 57.15 ( 0.07) 610 MB N/A N/A

inter-task interference, which results it to underperform single-machine CL methods. On the other hand, Fed We IT signiﬁcantly outperforms both single-machine CL baselines and naive FCL baselines on both datasets. Even with larger number of clients (C = 100), Fed We IT consistently outperforms all baselines (Figure 5). This improvement largely owes to Fed We IT s ability to selectively utilize the knowledge from other clients to rapidly adapt to the target task, and obtain better ﬁnal performance.

The fast adaptation to new task is another clear advantage of inter-client knowledge transfer. To further demonstrate the practicality of our method with larger networks, we experiment on Non-IID dtaset with Res Net-18 (Table 3), on which Fed We IT still signiﬁcantly outperforms the strongest baseline (Fed Prox-APD) while using fewer parameters.

Efﬁciency of Fed We IT We also report the accuracy as a function of network capacity in Tables 1 to 3, which we measure by the number of parameters used. We observe that Fed We IT obtains much higher accuracy while utilizing less number of parameters compared to Fed Prox-APD. This efﬁciency mainly comes from the reuse of task-adaptive parameters from other clients, which is not possible with single-machine CL methods or naive FCL methods.

We also examine the communication cost (the size of nonzero parameters transmitted) of each method. Table 2 re-

Figure 6. Catastrophic forgetting. Performance comparison about current task adaptation at 3rd, 6th and 8th tasks during federated continual learning on Non IID-50. We provide full version in our supplementary ﬁle.

ports both the client-to-server (C2S) / server-to-client (S2C) communication cost at training each task. Fed We IT, uses only 30% and 3% of parameters for b B and A of the dense models respectively. We observe that Fed We IT is signiﬁcantly more communication-efﬁcient than FCL baselines although it broadcasts task-adaptive parameters, due to high sparsity of the parameters. Figure 7 shows the accuracy as a function of C2S cost according to a transmission of top-κ% informative parameters. Since Fed We IT selectively utilizes task-speciﬁc parameters learned from other clients, it results in superior performance over APD-baselines especially with sparse communication of model parameters.

Catastrophic forgetting Further, we examine how the performance of the past tasks change during continual learning, to see the severity of catastrophic forgetting with each method. Figure 6 shows the performance of Fed We IT and

Federated Continual Learning with Weighted Inter-client Transfer

Table 3. FCL results on Non IID-50 dataset with Res Net-18.

Res Net-18 (F=1.0, R=20)

Methods Accuracy Forgetting Model Size

APD 92.44 ( 0.17) 0.02 ( 0.00) 1.86 GB Fed Prox-APD 92.89 ( 0.22) 0.02 ( 0.01) 2.05 GB

Fed We IT (Ours) 94.86 ( 0.13) 0.00 ( 0.00) 1.84 GB

Figure 7. Accuracy over client-to-server cost. We report the relative communication cost to the original network. All results are averaged over the 5 clients. FCL baselines on the 3rd, 6th and 8th tasks, at the end of training for later tasks. We observe that naive FCL baselines suffer from more severe catastrophic forgetting than local continual learning with EWC because of the inter-client interference, where the knowledge of irrelevant tasks from other clients overwrites the knowledge of the past tasks. Contrarily, our model shows no sign of catastrophic forgetting. This is mainly due to the selective utilization of the prior knowledge learned from other clients through the global/task-adaptive parameters, which allows it to effectively alleviate inter-client interference. Fed Prox-APD also does not suffer from catastrophic forgetting, but they yield inferior performance due to ineffective knowledge transfer.

Weighted inter-client knowledge transfer By analyzing the attention α in Equation 1, we examine which task parameters from other clients each client selected. Figure 8, shows example of the attention weights that are learned for the 0th split of MNIST and 10th split of CIFAR-100. We observe that large attentions are allocated to the task parameters from the same dataset (CIFAR-100 utilizes parameters from CIFAR-100 tasks with disjoint classes), or from a similar dataset (MNIST utilizes parameters from Trafﬁc Sign and SVHN). This shows that Fed We IT effectively selects

Traffic (6)

Traffic (5)

N-MNIST (1)

Traffic (0)

CIFAR100 (10)

Source Parameters Source Parameters

Figure 8. Inter-client transfer for Non IID-50. We compare the scale of the attentions at ﬁrst FC layer which gives the weights on transferred task-adaptive parameters from other clients.

Figure 9. Fed We IT with asynchronous federated continual learning on Non-iid 50 dataset. We measure the test accuracy of all tasks per client.

beneﬁcial parameters to maximize inter-client knowledge transfer. This is an impressive result since it does not know which datasets the parameters are trained on.

Asynchronous Federated Continual Learning We now consider Fed We IT under the asynchronous federated continual learning scenario, where there is no synchronization across clients for each task. This is a more realistic scenario since each task may require different training rounds to converge during federated continual learning. Here, asynchronous implies that each task requires different training costs (i.e., time, epochs, or rounds) for training. Under the asynchronous federated learning scenario, Fed We IT transfers any available task-adaptive parameters from the knowledge base to each client. In Figure 9, we plot the average test accuracy over all tasks during synchronous / asynchronous federated continual learning. In asynchronous Fed We IT, each task requires different training rounds and receives new tasks and task-adaptive parameters in an asynchronous manner and the performance of asynchronous Fed We IT is almost similar to that of the synchronous Fed We IT.

5. Conclusion

We tackled a novel problem of federated continual learning, which continuously learns local models at each client while allowing it to utilize indirect experience (task knowledge) from other clients. This poses new challenges such as inter-client knowledge transfer and prevention of interclient interference between irrelevant tasks. To tackle these challenges, we additively decomposed the model parameters at each client into the global parameters that are shared across all clients, and sparse local task-adaptive parameters that are speciﬁc to each task. Further, we allowed each model to selectively update the global task-shared parameters and selectively utilize the task-adaptive parameters from other clients. The experimental validation of our model under various task similarity across clients, against existing federated learning and continual learning baselines shows that our model obtains signiﬁcantly outperforms baselines with reduced communication cost. We believe that federated continual learning is a practically important topic of large interests to both research communities of continual learning and federated learning, that will lead to new research directions.

Federated Continual Learning with Weighted Inter-client Transfer

6. Acknowledgement

This work was supported by Samsung Research Funding Center of Samsung Electronics (No. SRFC-IT150251), Samsung Advanced Institute of Technology, Samsung Electronics Co., Ltd., Next-Generation Information Computing Development Program through the National Research Foundation of Korea(NRF) funded by the Ministry of Science, ICT & Future Plannig (No. 2016M3C4A7952634), the National Research Foundation of Korea(NRF) grant funded by the Korea government(MSIT) (2018R1A5A1059921), and Center for Applied Research in Artiﬁcial Intelligence (CARAI) grant funded by DAPA and ADD (UDI190031RD).

Bulatov, Y. Not-mnist dataset. 2011.

Chaudhry, A., Ranzato, M., Rohrbach, M., and Elhoseiny, M. Efﬁcient lifelong learning with a-gem. In Proceedings of the International Conference on Learning Representations (ICLR), 2019.

Chaudhry, A., Khan, N., Dokania, P. K., and Torr, P. H. Continual learning in low-rank orthogonal subspaces. In Advances in Neural Information Processing Systems (NIPS), 2020.

Chen, Y., Sun, X., and Jin, Y. Communication-efﬁcient federated deep learning with asynchronous model update and temporally weighted aggregation. ar Xiv preprint ar Xiv:1903.07424, 2019.

Deng, Y., Kamani, M. M., and Mahdavi, M. Adaptive personalized federated learning. ar Xiv preprint ar Xiv:2003.13461, 2020.

Fallah, A., Mokhtari, A., and Ozdaglar, A. Personalized federated learning: A meta-learning approach. ar Xiv preprint ar Xiv:2002.07948, 2020.

He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), 2016.

Hung, C.-Y., Tu, C.-H., Wu, C.-E., Chen, C.-H., Chan, Y.- M., and Chen, C.-S. Compacting, picking and growing for unforgetting continual learning. In Advances in Neural Information Processing Systems (NIPS), 2019.

Kairouz, P., Mc Mahan, H. B., Avent, B., Bellet, A., Bennis, M., Bhagoji, A. N., Bonawitz, K., Charles, Z., Cormode, G., Cummings, R., et al. Advances and open problems in federated learning. ar Xiv preprint ar Xiv:1912.04977, 2019.

Karimireddy, S. P., Kale, S., Mohri, M., Reddi, S. J., Stich, S. U., and Suresh, A. T. Scaffold: Stochastic controlled averaging for on-device federated learning. In Proceedings of the International Conference on Machine Learning (ICML), 2020.

Kirkpatrick, J., Pascanu, R., Rabinowitz, N., Veness, J., Desjardins, G., Rusu, A. A., Milan, K., Quan, J., Ramalho, T., Grabska-Barwinska, A., et al. Overcoming catastrophic forgetting in neural networks. Proceedings of the National Academy of Sciences, pp. 201611835, 2017.

Krizhevsky, A. and Hinton, G. E. Learning multiple layers of features from tiny images. Technical report, Computer Science Department, University of Toronto, 2009.

Kumar, A. and Daume III, H. Learning task grouping and overlap in multi-task learning. In Proceedings of the International Conference on Machine Learning (ICML), 2012.

Lange, M. D., Jia, X., Parisot, S., Leonardis, A., Slabaugh, G., and Tuytelaars, T. Unsupervised model personalization while preserving privacy and scalability: An open problem. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), 2020.

Le Cun, Y., Bottou, L., Bengio, Y., and Haffner, P. Gradientbased learning applied to document recognition. Proceedings of the IEEE, 1998.

Lee, S.-W., Kim, J.-H., Jun, J., Ha, J.-W., and Zhang, B.- T. Overcoming catastrophic forgetting by incremental moment matching. In Advances in Neural Information Processing Systems (NIPS), 2017.

Li, T., Sahu, A. K., Zaheer, M., Sanjabi, M., Talwalkar, A., and Smith, V. Federated optimization in heterogeneous networks. ar Xiv preprint ar Xiv:1812.06127, 2018.

Lopez-Paz, D. and Ranzato, M. Gradient episodic memory for continual learning. In Advances in Neural Information Processing Systems (NIPS), 2017.

Mc Mahan, H. B., Moore, E., Ramage, D., Hampson, S., et al. Communication-efﬁcient learning of deep networks from decentralized data. ar Xiv preprint ar Xiv:1602.05629, 2016.

Mirzadeh, S. I., Farajtabar, M., Pascanu, R., and Ghasemzadeh, H. Understanding the role of training regimes in continual learning. In Advances in Neural Information Processing Systems (NIPS), 2020.

Mirzadeh, S. I., Farajtabar, M., Gorur, D., Pascanu, R., and Ghasemzadeh, H. Linear mode connectivity in multitask

Federated Continual Learning with Weighted Inter-client Transfer

and continual learning. In Proceedings of the International Conference on Learning Representations (ICLR), 2021.

Netzer, Y., Wang, T., Coates, A., Bissacco, A., Wu, B., and Ng, A. Y. Reading digits in natural images with unsupervised feature learning. 2011.

Ng, H.-W. and Winkler, S. A data-driven approach to cleaning large face datasets. In 2014 IEEE international conference on image processing (ICIP), pp. 343 347. IEEE, 2014.

Nguyen, C. V., Li, Y., Bui, T. D., and Turner, R. E. Variational continual learning. In Proceedings of the International Conference on Learning Representations (ICLR), 2018.

Riemer, M., Cases, I., Ajemian, R., Liu, M., Rish, I., Tu, Y., and Tesauro, G. Learning to learn without forgetting by maximizing transfer and minimizing interference. In Proceedings of the International Conference on Learning Representations (ICLR), 2019.

Rostami, M., Kolouri, S., Kim, K., and Eaton, E. Multiagent distributed lifelong learning for collective knowledge acquisition. In Proceedings of the 17th International Conference on Autonomous Agents and Multi Agent Systems. International Foundation for Autonomous Agents and Multiagent Systems, 2018.

Ruvolo, P. and Eaton, E. Ella: An efﬁcient lifelong learning algorithm. In Proceedings of the International Conference on Machine Learning (ICML), 2013.

Schwarz, J., Luketina, J., Czarnecki, W. M., Grabska Barwinska, A., Teh, Y. W., Pascanu, R., and Hadsell, R. Progress & compress: A scalable framework for continual learning. ar Xiv preprint ar Xiv:1805.06370, 2018.

Serrà, J., Surís, D., Miron, M., and Karatzoglou, A. Overcoming catastrophic forgetting with hard attention to the task. In Proceedings of the International Conference on Machine Learning (ICML), 2018.

Shin, H., Lee, J. K., Kim, J., and Kim, J. Continual learning with deep generative replay. In Advances in Neural Information Processing Systems (NIPS), 2017.

Shoham, N., Avidor, T., Keren, A., Israel, N., Benditkis, D., Mor-Yosef, L., and Zeitak, I. Overcoming forgetting in federated learning on non-iid data. ar Xiv preprint ar Xiv:1910.07796, 2019.

Stallkamp, J., Schlipsing, M., Salmen, J., and Igel, C. The german trafﬁc sign recognition benchmark: a multi-class classiﬁcation competition. In The 2011 international joint conference on neural networks, 2011.

Thrun, S. A Lifelong Learning Perspective for Mobile Robot Control. Elsevier, 1995.

Titsias, M. K., Schwarz, J., Matthews, A. G. d. G., Pascanu, R., and Teh, Y. W. Functional regularisation for continual learning with gaussian processes. In Proceedings of the International Conference on Learning Representations (ICLR), 2020.

Wang, H., Yurochkin, M., Sun, Y., Papailiopoulos, D., and Khazaeni, Y. Federated learning with matched averaging. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum? id=Bkluql SFDS.

Xiao, H., Rasul, K., and Vollgraf, R. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. ar Xiv preprint ar Xiv:1708.07747, 2017.

Xu, J. and Zhu, Z. Reinforced continual learning. In Advances in Neural Information Processing Systems (NIPS), 2018.

Yoon, J., Yang, E., Lee, J., and Hwang, S. J. Lifelong learning with dynamically expandable networks. In Proceedings of the International Conference on Learning Representations (ICLR), 2018.

Yoon, J., Kim, S., Yang, E., and Hwang, S. J. Scalable and order-robust continual learning with additive parameter decomposition. In Proceedings of the International Conference on Learning Representations (ICLR), 2020.

Yurochkin, M., Agarwal, M., Ghosh, S., Greenewald, K., Hoang, T. N., and Khazaeni, Y. Bayesian nonparametric federated learning of neural networks. Proceedings of the International Conference on Machine Learning (ICML), 2019.

Federated Continual Learning with Weighted Inter-client Transfer: Supplementary File

Organization We provide in-depth descriptions and explanations that are not covered in the main document, and additionally report more experiments in this supplementary document, which organized as follows:

Section A - We further describe the experimental details, including the network architecture, training conﬁgurations, forgetting measures, and datasets.

Section B - We report additional experimental results, such as the effectiveness of the communication frequency (Section B.1) and ablation study on Overlapped CIFAR-100 dataset (Section B.2).

A. Experimental Details

We further provide the experimental settings in detail, including the descriptions of the network architectures, hyperparameters, and dataset conﬁguration.

Network Architecture We utilize a modiﬁed version of Le Net and a conventional Res Net-18 as the backbone network architectures for validation. In the Le Net, the ﬁrst two layers are convolutional neural layers of 20 and 50 ﬁlters with the 5 5 convolutional kernels, which are followed by the two fully-connected layers of 800 and 500 units each. Rectiﬁed linear units activations and local response normalization are subsequently applied to each layers. We use 2 2 max-pooling after each convolutional layer. All layers are initialized based on the variance scaling method. Detailed description of the architecture for Le Net is given in Table A.4.

Conﬁgurations We use an Adam optimizer with adaptive learning rate decay, which decays the learning rate by a factor of 3 for every 5 epochs with no consecutive decrease in the validation loss. We stop training in advance and start learning the next task (if available) when the learning rate reaches ρ. The experiment for Le Net with 5 clients, we initialize by 1e 3 1

3 at the beginning of each new task and ρ = 1e 7. Mini-batch size is 100, the rounds per task is 20, an the epoch per round is 1. The setting for Res Net-18 is identical, excluding the initial learning rate, 1e 4. In the case of experiments with 20 and 100 clients, we set the same settings except reducing minibatch size from 100 to 10 with an initial learning rate 1e 4. We use client fraction 0.25 and 0.05, respectively, at each communication round.

Table A.4. Implementation Details of Base Network Architecture (Le Net). Note that T indicates the number of tasks that each client sequentially learns on.

Layer Filter Shape Stride Output

Input N/A N/A 32 32 3 Conv 1 5 5 20 1 32 32 20 Max Pooling 1 3 3 2 16 16 20 Conv 2 5 5 50 1 16 16 50 Max Pooling 2 3 3 2 8 8 50 Flatten 3200 N/A 1 1 3200 FC 1 800 N/A 1 1 800 FC 2 500 N/A 1 1 500 Softmax Classiﬁer N/A 1 1 5 T

Total Number of Parameters 3,012,920

we set λ1 = [1e 1, 4e 1] and λ2 = 100 for all experiments. Further, we use µ = 5e 3 for Fed Prox, λ = [1e 2, 1.0] for EWC and Fed Curv. We initialize the attention parameter α(t) c as sum to one, α(t) c,j 1/|α(t) c .

Metrics. We evaluate all the methods on two metrics following the continual learning literature (Chaudhry et al.,2019; Mirzadeh et al., 2020). 1. Averaged Accuracy: We measure averaged test accuracy of all the tasks after the completion of a continual learning at task t by At = 1

t Pt i=1 at,i, where at,i is the test accuracy of task i after learning on task t.

2. Averaged Forgetting: We measure the forgetting as the averaged disparity between minimum task accuracy during continuous training. More formally, for T tasks, the forgetting can be deﬁned as F = 1 T 1 PT 1 i=1 maxt 1,...,T 1(at,i a T,i).

Datasets We create both Overlapped-CIFAR-100 and Non IID-50 datasets. For Overlapped-CIFAR-100, we generate 20 non-iid tasks based on 20 superclasses, which hold 5 subclasses. We split instances of 20 tasks according to the number of clients (5, 20, and 100) and then distribute the tasks across all clients. For Non IID-50 dataset, we utilize 8 heterogenous datasets and create 50 non-iid tasks in total as shown in Table A.5. Then we arbitrarily select 10 tasks without duplication and distribtue them to 5 clients. The average performance of single task learning on the dataset is 85.78 0.17(%), measured by our base Le Net architecture.

Federated Continual Learning with Weighted Inter-client Transfer

Table A.5. Dataset Details of Non IID-50 Task. We provide dataset details of Non IID-50 dataset, including 8 heterogeneous datasets, number of sub-tasks, classes per sub-task, and instances of train, valid, and test sets.

Dataset Num. Classes Num. Tasks Num. Classes (Task) Num. Train Num. Valid Num. Test

CIFAR-100 100 15 5 36,750 10,500 5,250 Face Scrub 100 16 5 13,859 3,959 1,979 Trafﬁc Signs 43 9 5 (3) 32,170 9,191 4,595 SVHN 10 2 5 61,810 17,660 8,830 MNIST 10 2 5 42,700 12,200 6,100 CIFAR-10 10 2 5 36,750 10,500 5,250 Not MNIST 10 2 5 11,339 3,239 1,619 Fashion MNIST 10 2 5 42,700 12,200 6,100

Total 293 50 248 278,078 39,723 79,449

(a) Federated continual learning with 20 clients

(b) Federated continual learning with 100 clients

Figure A.10. Task adaptation comparison with Fed We IT and APD using 20 clients and 100 clients. We visualize the last 5 tasks out of 10 tasks per client. Overlapped-CIFAR-100 dataset are used after splitting instances according to the number of clients (20 and 100).

Table B.6. Experimental results on the Overlapped-CIFAR-100 dataset with 20 tasks. All results are the mean accuracies over 5 clients, averaged over 3 individual trials.

Overlapped-CIFAR-100 with 20 tasks

Methods Accuracy M Size C2S/S2C Cost

Fed Prox 29.76 0.39 0.061 GB 1.22 / 1.22 GB Fed Prox-EWC 27.80 0.58 0.061 GB 1.22 / 1.22 GB Fed Prox-APD 43.80 0.76 0.093 GB 1.22 / 1.22 GB

Fed We IT 46.78 0.14 0.092 GB 0.37 / 1.07 GB

B. Additional Experimental Results

We further include a quantitative analysis about the communication round frequency and additional experimental results across the number of clients.

B.1. Effect of the Communication Frequency

We provide an analysis on the effect of the communication frequency by comparing the performance of the model, measured by the number of training epochs per communication

round. We run the 4 different Fed We IT with 1, 2, 5, and 20 training epochs per round. Figure A.11 shows the performance of our Fed We IT variants. As clients frequently update the model parameters through the communication with the central server, the model gets higher performance while maintaining smaller network capacity since the model with a frequent communication efﬁciently updates the model parameters as transferring the inter-client knowledge. However, it requires much heavier communication costs than the model with sparser communication. For example, the model trained for 1 epochs at each round may need to about 16.9 times larger entire communication cost than the model trained for 20 epochs at each round. Hence, there is a trade-off between model performance of federated continual learning and communication efﬁciency, whereas Fed We IT variants consistently outperform (federated) continual learning baselines.

B.2. Ablation Study with Model Components

We perform an ablation study to analyze the role of each component of our Fed We IT. We compare the performance of four different variations of our model. w/o B communi-

Federated Continual Learning with Weighted Inter-client Transfer

Overlapped-CIFAR-100

Methods Accuracy Model Size C2S/S2C Cost Epochs / Round

Fed We IT 55.16 0.19 0.075 GB 0.37 / 1.07 GB 1

Fed We IT 55.18 0.08 0.077 GB 0.19 / 0.53 GB 2 Fed We IT 53.73 0.44 0.083 GB 0.08 / 0.22 GB 5 Fed We IT 53.22 0.14 0.088 GB 0.02 / 0.07 GB 20

Figure A.11. Number of Epochs per Round We show error bars over the number of training epochs per communication rounds on Overlapped-CIFAR-100 for with 5 clients. All models transmit full of local base parameters and highly sparse task-adaptive parameters. All results are the mean accuracy over 5 clients and we run 3 individual trials. Red arrows at each point describes the standard deviation of the performance.

Table B.7. Ablation studies to analyze the effectiveness of parameter decomposition on We IT. All experiments performed on Non IID50 dataset.

Methods Acc. M Size C2S/S2C Cost

Fed We IT 84.11% 0.078 GB 0.37 / 1.07 GB

w/o B comm. 77.88% 0.070 GB 0.01 / 0.01 GB w/o A comm. 79.21% 0.079 GB 0.37 / 1.04 GB w/o A 65.66% 0.061 GB 0.37 / 1.04 GB w/o m 78.71% 0.087 GB 1.23 / 1.25 GB

cation describes the model that does not transfer the base parameter B and only communicates task-adaptive ones. w/o A communication is the model that does not communicate task-adaptive parameters. w/o A is the model which trains the model only with sparse transmission of local base parameter, and w/o m is the model without the sparse vector mask. As shown in Table B.7, without communicating B or A, the model yields signiﬁcantly lower performance compared to the full model since they do not beneﬁt from inter-client knowledge transfer. The model w/o A obtains very low performance due to catastrophic forgetting, and the model w/o sparse mask m achieves lower accuracy with larger capacity and cost, which demonstrates the importance of performing selective transmission.

B.3. Ablation Study with Regularization Terms

We also perform the additional analysis by eliminating the proposed regularization terms in Table B.8. As described in Section 3.3, without ℓ1 term, the method achieves even better performance but requires signiﬁcantly larger memory. Without ℓ2 term, the method suffers from the forgetting.

Table B.8. Ablation Study on Knowledge Transfer (Non IID-50)

Method Avg. Accuracy Memory size Bw T

Fed We IT 84.43% ( 0.50) 68.93 MB -0.0014 Fed We IT w/o ℓ1 87.12% ( 0.24) 354.41 MB -0.0007 Fed We IT w/o ℓ2 56.76% ( 0.84) 63.44 MB -0.3203

B.4. Forgetting Analysis

In Figure B.12, we present forgetting performances of our baseline models, such as Local-EWC/APD, Fed Avg EWC/APD, and Fed Prox-EWC,APD, and our method, Fed We IT. As shown in the ﬁgure, EWC based methods, including local continual learning and combinations of Fed Avg and Fed Prox, shows performance degradation while learning on new tasks. For example, the performance of the Fed Avg-EWC (red with inversed triangular markers) is rapidly dropped from Task 1 to Task 10, which visualized in the left most plot on the ﬁrst row. On the other hand, both APD based method and our method shows compelling ability to prevent catastrophic forgetting regardless how many tasks and what tasks the clients learn afterwards. We provide corresponding results in Table 2 in the main document.

Federated Continual Learning with Weighted Inter-client Transfer

Figure B.12. Forgetting Analysis Performance change over the increasing number of tasks for all tasks except the last task (1st to 9th) during federated continual learning on Non IID-50. We observe that our method does not suffer from task forgetting on any tasks.