# codec_communicationefficient_decentralized_continual_learning__10cbd4c9.pdf

Published in Transactions on Machine Learning Research (03/2024)

Co De C: Communication-Efficient Decentralized Continual Learning

Sakshi Choudhary choudh23@purdue.edu Department of Electrical and Computer Engineering Purdue University

Sai Aparna Aketi saketi@purdue.edu Department of Electrical and Computer Engineering Purdue University

Gobinda Saha gsaha@purdue.edu Department of Electrical and Computer Engineering Purdue University

Kaushik Roy kaushik@purdue.edu Department of Electrical and Computer Engineering Purdue University

Reviewed on Open Review: https: // openreview. net/ forum? id= N05On QG1BA

Training at the edge utilizes continuously evolving data generated at different locations. Privacy concerns prohibit the co-location of this spatially as well as temporally distributed data, deeming it crucial to design training algorithms that enable efficient continual learning over decentralized private data. Decentralized learning allows serverless training with spatially distributed data. A fundamental barrier in such setups is the high bandwidth cost of communicating model updates between agents. Moreover, existing works under this training paradigm are not inherently suitable for learning a temporal sequence of tasks while retaining the previously acquired knowledge. In this work, we propose Co De C, a novel communication-efficient decentralized continual learning algorithm that addresses these challenges. We mitigate catastrophic forgetting while learning a distributed task sequence by incorporating orthogonal gradient projection within a gossip-based decentralized learning algorithm. Further, Co De C includes a novel lossless communication compression scheme based on the gradient subspaces. We theoretically analyze the convergence rate for our algorithm and demonstrate through an extensive set of experiments that Co De C successfully learns distributed continual tasks with minimal forgetting. The proposed compression scheme results in up to 4.8 reduction in communication costs without any loss in performance. 1

1 Introduction

Deep neural networks have demonstrated exceptional performance for many visual recognition tasks over the past decade. This has been fueled by the explosive growth of available training data and powerful computing resources. Edge devices such as smartphones, drones, and Internet-of-Things (Io T) sensors contribute towards generating this massive amount of data (Shi et al., 2020). Interestingly, this data is spatially distributed, while continuously evolving over time. Large-scale deep neural network training has traditionally relied upon the availability of a humongous amount of data at a central server. This mainly poses three

1The Py Torch implementation can be found at https://github.com/Sakshi09Ch/Co De C

Published in Transactions on Machine Learning Research (03/2024)

(a) (b) (c)

Task 1 Task 2 Task T

Data Data Data

GPM 1 GPM 2 GPM 3

GPM 1 GPM 2 GPM 3

Agent 2 Agent 3

Figure 1: An overview of Co De C. (a) Data for each incoming task is independently and identically distributed (IID) over the decentralized agents. Each agent has a GPM (Gradient Projection Memory) which is updated after learning each task. (b) Based on the sparse graph topology, the agents communicate coefficients associated with the model updates at each training iteration. (c) GPM partitions each layer s subspace into two orthogonal subspaces.

challenges: (1) high network bandwidth requirements to collect this dispersed data from numerous learning agents, (2) data privacy concerns for locally-generated data accessed by the central server and (3) adapting to changing data distributions without expensive training from scratch. This motivates the need for learning algorithms to enable efficient distributed training by utilizing spatially and temporally distributed data.

Centralized distributed learning (i.e. federated learning) has emerged to train models over spatially distributed data without compromising on user privacy Konečný et al. (2016). This approach relies upon a central parameter server to collect local model updates, process, and send the global updates back to the agents. However, the central server may lead to a single point of failure and network bandwidth issues (Assran et al., 2019). To address these concerns, several decentralized learning algorithms have been developed (Bianchi et al., 2013; Lan et al., 2017; Lian et al., 2017; Assran et al., 2019). Decentralized learning is a peer-to-peer learning paradigm, where agents communicate only with their neighbors without the need for a central server. Each agent learns a global generalized model by aggregating locally computed model updates shared by neighbors. However, decentralized learning algorithms are not inherently equipped to thrive in dynamic learning environments with a temporal sequence of changing data distributions.

Gradient-based optimization methods like plain SGD and DPSGD (Lian et al., 2017) inherently update model parameters by minimizing the loss function with respect to the current data distribution. This results in overwriting of parameters learned for the previous task(s), leading to catastrophic forgetting (Mccloskey & Cohen, 1989; Ratcliff, 1990). Hence, continual learning techniques focus on learning consecutive tasks without forgetting the past acquired knowledge (Lee et al., 2017; Lopez-Paz & Ranzato, 2017; Saha et al., 2021a; Mallya et al., 2018; Kirkpatrick et al., 2017; Farajtabar et al., 2020; Wang et al., 2021; Saha et al., 2021).

Table 1 summarizes previous works that address the challenges of learning with spatially and/or temporally distributed data. In this paper, we propose Co De C to enable serverless training with data distributed across space as well as time. To the best of our knowledge, this is the first work that demonstrates such a decentralized continual learning setup. Our algorithm has three components: (1) SGD combined with gossip averaging (Xiao & Boyd, 2003) as shown in (Lian et al., 2017) to learn with spatially distributed private data,(2) Gradient Projection Memory (GPM) (Saha et al., 2021) to continually learn a temporal task sequence with minimal forgetting and (3) a novel lossless communication compression scheme to reduce the bandwidth requirements during training. Our setup is illustrated in figure 1.

GPM partitions each layer s gradient space into two orthogonal subspaces: Core Gradient Space (CGS) and Residual Gradient Space (RGS) as shown in 1(c). Important gradient directions (CGS) for previous tasks

Published in Transactions on Machine Learning Research (03/2024)

Table 1: Comparison with prior works in Continual Learning (CL), Federated Learning (FL) and Decentralized Learning (DL) paradigm. Note that FL techniques require a server, while DL techniques are serverless. ( ) denotes decentralized agents independently learning different tasks at a given time, and communicating via a fully-connected topology.

Technique CL FL DL Fed Opt (Konečný et al., 2016) DPSGD (Lian et al., 2017) EWC (Kirkpatrick et al., 2017) GPM (Saha et al., 2021) Fed We IT (Yoon et al., 2021) FLw F-2T (Usmanova et al., 2021) DCIL (Zhang et al., 2022) SKILL (Ge et al., 2023) Co De C(ours)

are stored in gradient projection memory (GPM), and gradient updates for the new tasks are taken along RGS to minimize interference. We find the basis vectors that span RGS and represent model updates as a linear combination of these vectors. We communicate the coefficients associated with these basis vectors instead of the model updates and achieve lossless communication compression. Further, theoretical insights into the convergence rate of Co De C prove that it is possible to achieve similar rates as the state-of-theart decentralized learning approaches such as DPSGD (Lian et al., 2017). We provide empirical evidence by performing experiments over various standard image-classification datasets, networks, graph sizes, and topologies. We also extend our analysis by designing and evaluating a decentralized continual learning benchmark Med MNIST-5 using biomedical image-classification datasets from Med MNIST-v2 (Yang et al., 2021). This imitates a practical real-world application where multiple healthcare organizations aim to learn a global generalized model without sharing the locally accessible patients data. These models need to be updated with the emergence of variants of a disease, new diseases, or new diagnostic methods in a continual manner.

Contributions: We summarize our contributions as follows:

We propose Co De C, a communication-efficient decentralized continual learning algorithm that addresses a challenging problem: leveraging spatially and temporally distributed data to optimize a global model while preserving data privacy.

We introduce a novel lossless communication compression scheme based on gradient subspaces.

We theoretically show that our algorithm converges at the rate of O(1/

NK), where N is the number of agents and K is the number of training iterations.

Experiments over a variety of image-classification datasets, networks, graph sizes, and topologies demonstrate minimal forgetting and up to 4.8 reduction in communication costs with isoperformance relative to the full communication baseline.

2 Related Work

2.1 Decentralized Learning

Several works exist in the decentralized learning paradigm which enable distributing training without utilizing a central server (Bianchi et al., 2013; Lan et al., 2017; Lian et al., 2017; Assran et al., 2019; Balu et al., 2021). DPSGD (Lian et al., 2017) provides theoretical analysis for the convergence rate of decentralized learning algorithms, proving it to be similar to their centralized counterpart (Dean et al., 2012). In Co De C, we utilize DPSGD (Lian et al., 2017) and modify it to send model updates instead of model parameters. Note, these existing works are not equipped to learn a temporal task sequence without forgetting past knowledge.

Published in Transactions on Machine Learning Research (03/2024)

To reduce the communication overhead for decentralized learning, several compression techniques (Koloskova et al., 2019; Tang et al., 2019; Aketi et al., 2021) have been explored. Deep Squeeze (Tang et al., 2019) introduced error-compensated communication compression to decentralized training. Choco-SGD (Koloskova et al., 2019) communicates compressed model updates rather than parameters and achieves better performance than Tang et al. (2019). However, it is orthogonal to the compression scheme we present and can be used in synergy with our approach. Moreover, all of the above-mentioned compression techniques are lossy and require additional hyperparameter tuning.

2.2 Continual Learning

The majority of continual learning works fall into three categories (Wickramasinghe et al., 2023): network expansion, replay and regularization-based methods. Network expansion based methods (Rusu et al., 2016; Lee et al., 2017) overcome catastrophic forgetting by dedicating different model parameters to each task. Replay-based methods store training samples from the past tasks in the memory or synthesize old data from generative models for rehearsal (Chaudhry et al., 2019; Rebuffi et al., 2017; Shin et al., 2017). Regularizationbased methods penalize changes to parameters (Kirkpatrick et al., 2017; Zenke et al., 2017), or constrain gradient directions (Saha et al., 2021; Wang et al., 2021; Saha & Roy, 2023) important for previous tasks. These methods rely on the availability of centrally located training data and hence fail to be directly applicable to a distributed learning scenario. Network expansion based methods in a decentralized continual learning setup may give rise to model heterogeneity across agents over time, while replay-based methods can lead to privacy concerns. Thus, we explore regularization based methods like GPM (Saha et al., 2021), SGP (Saha & Roy, 2023), EWC (Kirkpatrick et al., 2017) and SI (Zenke et al., 2017) in this work. We utilize GPM in Co De C and show superior performance than D-EWC and D-SI, decentralized continual learning baselines we implemented with EWC and SI respectively. We further extend Co De C to incorporate scaled gradient updates as shown in SGP.

2.3 Distributed Continual Learning

Fed We IT (Yoon et al., 2021) tackled the problem of federated continual learning through the decomposition of model parameters at each client into global and local task-adaptive parameters. FLw F-2T (Usmanova et al., 2021) developed a distillation-based method for class-incremental federated continual learning. Unlike our serverless training setup, these works utilize a central server to aggregate and send global updates to the agents. Co LLA (Rostami et al., 2017) focused on multi-agent distributed lifelong learning and proposed a distributed optimization algorithm for a network of synchronous learning agents. Note that it uses parametric models and is not directly applicable to modern deep neural networks. SKILL (Ge et al., 2023) proposes a distributed lifelong learning mechanism, where each agent uses a common pre-trained backbone and learns a task-specific head module. After training, these task-specific heads are shared among agents via a fully connected graph. However, each SKILL agent is independently learning a different task at a given time, lacking the concept of collaborative learning as demonstrated in Co De C.

3 Methodology

3.1 Problem Formulation

In this work, we optimize a DNN model to learn from spatially and temporally distributed data. The communication topology is modeled as a graph G = ([N], W), where N is the number of learning agents and W is the mixing matrix indicating the graph s connectivity. wij encodes the effect of agent j on agent i, and wij = 0 implies there is no direct communication link between the two agents.

Consider a learning scenario where T tasks are learned sequentially. Now, for any task τ {1, .., T}, the corresponding dataset Dτ is independently and identically distributed (IID) across the N agents as {Dτ,1, Dτ,2, Dτ,3.....Dτ,N}. For each τ, we aim to minimize the global loss function Fτ(x) given in equation 1. Here, Fτ,i(dτ,i, x) is the local loss function per task at agent i and fτ,i(x) is the expected value of Fτ,i(dτ,i, x) over the dataset Dτ,i.

Published in Transactions on Machine Learning Research (03/2024)

min x Rd Fτ(x) = 1

i=1 fτ,i(x),

where fτ,i(x) = Edτ,i Dτ,i[Fτ,i(dτ,i, x)] i

Decentralized optimization of this global loss function Fτ(x) is based on the current dataset Dτ. A crucial challenge is to optimize Fτ(x) such that the past information acquired from tasks 1, 2, .., (τ 1) is retained. Inspired by Saha et al. (2021), we define a subspace that contains important gradient directions associated with all the past tasks and modify the local gradient updates of the current task to be orthogonal to this subspace i.e., to lie in RGS. This ensures minimal catastrophic forgetting.

Typically, decentralized agents communicate the model parameters with their neighbors in each training iteration (Lian et al., 2017). Note that in the proposed algorithm the model updates lie in RGS, which is a smaller vector subspace compared to the entire gradient space. To utilize this property for enabling lossless communication compression, we communicate model updates with neighbors similar to Koloskova et al. (2019) rather than the model parameters.

3.2 Approach

We demonstrate the flow of Co De C in Algorithm 1. All hyperparameters are synchronized between the agents at the beginning of the training. Each agent i computes the gradient update gi = ( fτ,i(dτ,i; xi)) with respect to model parameters xi, evaluated on mini-batch dτ,i. We obtain gi, the orthogonal projection of the local gradients using GPM memory M (line 6, algorithm 1). The parameters of each agent are updated using this gi which ensures minimal forgetting. Then, each agent performs a gossip averaging step using xi

and ˆxj (line 8, algorithm 1). ˆxj represent the copies of xj maintained by all the neighbors of agent j and in general xj = ˆxj. The computed model updates (denoted by qk i ) lie in the RGS subspace spanned by the basis vectors contained in Ol. Therefore, we express them as a linear combination of these basis vectors and find the associated coefficients, ci to communicate with the neighbors (line 10, algorithm 1). Upon receiving these coefficients, the agents reconstruct the neighbors updates without any loss in information (line 13, algorithm 1). Communicating the coefficients (ci) leads to lossless compression, which we elaborate upon in section 3.3. The local copy ˆxj is updated using the reconstructed model updates qj (line 14, algorithm 1). Note that our algorithm requires each agent to only store the sum of neighbors models P

j N(i) wijˆxj

resulting in O(1) memory overhead, independent of the number of neighbors.

At the end of each task, important gradient directions are obtained using a Singular Value Decomposition (SVD) representation of the input activations of each layer (Saha et al., 2021). These gradient directions are added as basis vectors to the CGS matrix M and subsequently removed from the RGS Matrix O. SVD is calculated using a subset of training data at any randomly chosen agent and communicated to other agents iteratively using the communication graph.

3.3 Lossless Compression

Stochastic Gradient Descent (SGD) updates lie in the span of input data points (Zhang et al., 2017). Leveraging this fact, GPM (Saha et al., 2021) performs SVD on a representation matrix Rl τ and finds basis vectors corresponding to the important gradient directions for previous tasks. Rl τ is constructed by performing a forward pass of ns samples from the training dataset for task τ through the network and concatenating the input activations for each layer l (equation 2). Subsequently, the SVD of representation, Rl τ in equation 2 is used to obtain the matrix Ul τ containing a set of orthonormal basis vectors which span the entire gradient space.

Rl τ = [xl 1,τ, xl 2,τ.., xl ns,τ]

SV D(Rl τ) = Ul τΣ(Vl τ)T (2)

The threshold hyperparameter ϵth determines the number of basis vectors chosen from Ul τ to represent important gradient directions for any particular task. These vectors span a subspace in the gradient space which we define as the Core Gradient Space (CGS). They are added to the GPM matrix M = {(Ml)L l=1}.

Published in Transactions on Machine Learning Research (03/2024)

Algorithm 1 Communication-Efficient Decentralized Continual Learning (Co De C)

Input: Each agent i [1, N] initializes model parameters xi 0, step size η, mixing matrix W = [wij]i,j [1,N], ˆxi (0)= xi 0, Ml = [ ] and Ol = [I] for all layers l = 1, 2, ...L, GPM Memory M = {(Ml)L l=1}, RGS Matrix O = {(Ol)L l=1}, N(i): neighbors of agent i (including itself), T: total tasks, K: training iterations

Each agent simultaneously implements the TRAIN( ) procedure 1. procedure TRAIN( ) 2. for τ = 1, . . . , T do 3. for k = 0, 1, . . . , K 1 do 4. dτ,i Dτ,i 5. gi k = fτ,i(dτ,i; xi k) 6. gi k = gi k (Ml(Ml)T )gi k # for each layer l 7. xi (k+ 1

2 ) = xi k η gi k 8. xi k+1 = xi (k+ 1

j N (i) wij(ˆxj k xi k)

9. qi k = xi k+1 xi k 10. ci k = (Ol) T qi k 11. for each j N(i) do 12. Send ci k and receive cj k 13. qj k = (Ol)cj k 14. ˆxj (k+1) = qj k + ˆxj k 15. end 16. end # GPM Update 17. p = random(1, 2, ...N) 18. if i == p do 19. Update Ml, Ol for each layer l L 20. Update M = {(Ml)L l=1} 21. Update O = {(Ol)L l=1} 22. Send M, O to all agents 23. end 24. end 25. return

For task τ > 1, we ensure that the CGS vectors being added to the GPM matrix in round τ are orthogonal to all the CGS vectors in stored in M. Before performing SVD on the representation matrix Rl τ for each layer l, we perform the following projection step:

ˆRl τ = Rl τ (Ml(Ml) T )Rl τ (3)

SVD is then performed on ˆRl τ and new orthogonal basis vectors are added to M. This ensures that the newly added basis vectors for task τ are unique and orthogonal to the vectors already present in M.

The following update rule is used to obtain orthogonal gradient update gi for the later tasks:

gi = gi (Ml(Ml) T )gi (4)

Here, gi is the original local gradient update at agent i at layer l, and the projection of gi on CGS is (Ml Ml T )gi. Let the input space for a layer be of dimension nl. This implies that Ul τ contains nl orthonormal basis vectors. Now based on ϵth, after every task, a set of rl basis vectors corresponding to the top rl singular values are stored in M. Hence, gi lies in a (nl rl) dimensional orthogonal subspace denoted as the Residual Gradient Space (RGS).

The basis vectors which span RGS are the remaining (nl rl) vectors contained in Ul τ. We store them in the RGS Matrix O = {(Ol)L l=1}. nl rl< nl, and rl increases as the task sequence progresses. We note that the gradient updates tend to lie in a smaller subspace (i.e. RGS) whose dimensionality decreases based on ϵth and the number of tasks.

Published in Transactions on Machine Learning Research (03/2024)

In algorithm 1, model updates qi k are computed at every training iteration k. Since all the local gradients gi k lie in RGS, the updates qi k also lie in RGS (refer to A.6 for the proof). Therefore, we express layer-wise qi k as a linear combination of the basis vectors in Ol and find the associated coefficients ci k. The neighbors of agent i reconstruct the updates qi k from the received ci k. This encoding and decoding of qi k require two additional matrix multiplications (lines 10 and 13 in algorithm 1). Our approach ensures that all agents have the same M and O matrices so that the reconstruction is exact. Hence, we achieve lossless communication compression by the virtue of taking orthogonal gradient updates to avoid catastrophic forgetting.

4 Convergence Rate Analysis

In this section, we provide a convergence analysis for our algorithm. In particular, we provide an upper bound for F ( xk) 2, where F ( xk) is the average gradient achieved by the averaged model across all agents. Since our claims are valid for each task, the task subscript is dropped for the following analysis. We make the following assumptions:

Assumption 1 - Lipschitz Gradients: Each function fi(x) is L-smooth.

Assumption 2 - Bounded Variance: The variance of the stochastic gradients is assumed to be bounded. There exist constants σ and δ such that

Ed Di|| Fi(x; d) fi(x)||2 σ2 (5)

i=1 || fi(x) F(x)||2 δ2 i, x (6)

Assumption 3 - Doubly Stochastic Mixing Matrix: W is a real doubly stochastic matrix with λ1(W) = 1 and max{|λ2(W)|, |λN(W)|} ρ < 1 , where λi(W) is the ith largest eigenvalue of W and ρ is a constant.

The above assumptions are commonly used in most decentralized learning works (Lian et al., 2017; Tang et al., 2019; Esfandiari et al., 2021). Since we modify the original gradient update gi, we introduce an additional assumption: Assumption 4 - Bounded Orthogonal Updates: For each agent i, we have: gi µ gi (7)

where µ (0, 1] signifies how constrained the gradient space is. In particular, µ encapsulates the average impact of the dimension of RGS subspace during training.

To ensure that the gradient update after projection is in the descent direction, we provide the following lemma: Lemma 1. Given the original gradient update -gi is in the descent direction, the orthogonal gradient update - gi is also in the descent direction.

Please refer to Appendix A.1 for the proof. Theorem 2 presents the convergence of Co De C (proof in Appendix A.3). Theorem 2. Given assumptions 1-4, let step size η satisfy the following condition:

(1 ρ)2 + 12µ2 (1 ρ)

For all K 1, we have

k=0 E h F ( xk) 2i 1 C1K

E [F ( x0) F ] +

N + C3 η2µ2 3σ2

(1 ρ)2 + 3δ2

where C1 = 1

L), C2 = Lη2/2C1, C3 = L2η/2C1.

Published in Transactions on Machine Learning Research (03/2024)

The result of theorem 2 shows that the norm of the average gradient achieved by the consensus model is upper-bounded by the suboptimality gap (F ( x0) F ), the sampling variance (σ), the gradient variations (δ), and the constraint on the gradient space (µ). The suboptimality gap signifies how good the model initialization is. σ indicates the variation in gradients due to stochasticity, while δ is related to gradient variations across the agents. From equation 9, we observe that µ appears in the last term and effectively scales σ and δ. A detailed explanation of the constraints on step size η is presented in Appendix A.4. We present a corollary to show the convergence rate of Co De C in terms of the training iterations. Note that we denote an = O(bn) if an cbn, where c > 0 is a constant.

Corollary 3. Suppose that the step size satisfies η = O q

N K . For a sufficiently large K and some constant C > 0,

k=0 E h F ( xk) 2i C

The proof for Corollary 3 is detailed in Appendix A.5. It indicates that Co De C achieves a convergence rate of O( 1

NK ) for each task. This rate is similar to the well-known best result in decentralized SGD algorithms (Lian et al., 2017). Since µ2 appears only in the higher order term 1 K , it does not affect the order of the convergence rate.

5 Experimental Setup

Implementation details: For each task, the data distribution is IID across the agents. The agents communicate with their neighbors after every mini-batch update. We present results for different graph topologies and sizes: directed ring with 4/8/16 agents and undirected torus with 8/16 agents. In the directed ring topology, each agent has only 1 neighbor. Meanwhile, the torus topology has higher connectivity, with 3 and 4 neighbors for graph sizes of 8 and 16 agents respectively. We evaluate Co De C on three well-known continual learning benchmark datasets: 10-Split CIFAR-100 (Krizhevsky, 2009), 20-Split Mini Image Net (Vinyals et al., 2016) and a sequence of 5-Datasets (Ebrahimi et al., 2020). 10-Split CIFAR-100 is constructed by splitting CIFAR-100 into 10 tasks, where each task comprises of 10 classes. We use a 5-layer Alex Net for experiments with Split CIFAR-100. 20-Split mini Image Net has 20 sequential tasks, where each task comprises 5 classes. The sequence of 5-Datasets includes CIFAR-10, MNIST, SVHN (Netzer et al., 2011), not MNIST (Bulatov, 2011) and Fashion MNIST (Xiao et al., 2017). For Split mini Image Net and 5-Datasets, we use a reduced Res Net18 architecture similar to Lopez-Paz & Ranzato (2017). We design Med MNIST-5, a biomedical decentralized continual learning benchmark based on the datasets in Med MNIST-v2 (Yang et al., 2021). Med MNIST-5 consists of the following sequential classification tasks: Tissue MNIST, Organ AMNIST, OCTMNIST, Path MNIST, Blood MNIST. Res Net-18 architecture is used to evaluate the performance. For all experiments, batch normalization parameters are learned for the first task and frozen for subsequent tasks. We use multi-head setting, where each task has a separate final classifier with no constraints on gradient updates. Please refer to Appendix A.7, A.8, A.9 for details related to architectures, dataset statistics, and training hyperparameters, respectively.

Baselines: To the best of our knowledge, our proposed setup is unique and hence there are no directly comparable baselines. Therefore, for a fair comparison we implement two baselines D-EWC and D-SI based on well known continual learning techniques Elastic Weight Consolidation (EWC) (Kirkpatrick et al., 2017) and Synaptic Intelligence (SI) (Zenke et al., 2017) respectively. Please refer to algorithm 2 and 3 for the detailed implementation. At the end of each task, these methods compute statistics to penalize the parameter updates to mitigate forgetting. D-EWC computes Fisher information matrix F, while D-SI utilizes parameter specific contribution to changes in the total loss to compute importance measure Ω. To provide an upper bound on performance, we add two baselines: STL and D-STL. Single Task Learning (STL) represents a setting where all the tasks are learned sequentially in a centralized setup without any constraints. Decentralized Single Task Learning (D-STL) baseline extends STL to a decentralized setting.

Performance Metrics: We mainly focus on these metrics: (1) Average Accuracy (ACC): measures the average test classification accuracy of all tasks (2) Backward Transfer (BWT): indicates the impact on

Published in Transactions on Machine Learning Research (03/2024)

the past knowledge after learning new tasks where negative BWT implies forgetting (3) Communication Compression (CC): measures the relative reduction in the communication cost achieved through our lossless compression scheme. ACC and BWT can be formally defined as:

i=1 AT,i; BWT = 1 T 1

i=1 AT,i Ai,i (11)

Here, T is the total number of tasks and AT,i is the accuracy of the model on ith task after learning T tasks sequentially.

6 Results and Discussions

Continual Learning Statistics Aggregation: We implement two versions for each technique based on how continual learning statistics are computed and consolidated at the end of each task: broadcast and allgather. In broadcast, an agent is randomly chosen and the corresponding statistics are calculated and sent to all the other agents. In all-gather, all agents compute these statistics using their local data, and the global average of these statistics is utilized to mitigate forgetting. Note that all-gather incurs more computational cost as compared to broadcast. D-SI and D-EWC give a sub-optimal performance with broadcast, while Co De C gives a similar performance for both versions as shown in Table 2 (refer to Table 11 in Appendix for additional results). Hence, for results in Table 3, 4, 5 and 6 we choose all-gather for D-EWC and D-SI, and broadcast for Co De C.

Table 2: Impact of broadcast and all-gather for continual learning for Split mini Image Net over 8 agent directed ring.

Setup ACC(%) BWT(%)

D-SI (broadcast) 36.14 1.61 -12.90 1.15 D-SI (all-gather) 45.58 1.24 -3.67 1.27 D-EWC (broadcast) 37.20 0.58 0.27 0.15 D-EWC (all-gather) 46.39 1.54 -1.64 1.11 Co De C(broadcast) 53.22 1.82 0.08 0.45 Co De C (all-gather) 53.25 1.48 0.50 0.21

Performance Comparison: We present two versions of our approach: Co De C, which uses the lossless compression scheme, and Co De C(f), an implementation with full communication. As shown in Table 3, for Split CIFAR-100 we obtain 3-4% better ACC than D-EWC with a similar order of BWT. Co De C achieves 12-13% better ACC than D-SI, with marginally better BWT. Our proposed compression technique results in a 1.86x reduction in the communication cost on average without any performance degradation.

Table 4 demonstrates learning a longer task sequence Split Mini Image Net, and we outperform both DEWC and D-SI by 6-11% in terms of ACC with better BWT in some cases. We achieve 1.43x reduction in communication cost on average over a range of graph sizes. Learning dynamics of task 1 after learning each task in Split mini Image Net are shown in figure 2. Note that for D-SI, the training converges only for a lower learning rate as compared to D-EWC and Co De C (details in Appendix A.9). Results on 5-Datasets demonstrate learning across diverse datasets. As shown in Table 5, although we report better BWT for DEWC and D-SI in most cases, we achieve 0.5-5% better accuracy and upto 2.2x reduced communication cost. We believe that final average accuracy (ACC) and backward transfer (BWT) are both equally important in continual learning scenarios. It is possible to achieve zero BWT by freezing the model weights after learning the first task. However, that leads to sub-optimal ACC due to lack of ability to learn the subsequent tasks effectively. In our setup, we strive to achieve a balance between these two metrics by tuning the threshold hyperparameter ϵth accordingly. It is possible to use a higher ϵth and achieve a lower BWT, but at the cost of lower ACC. Hence, ϵth can be tuned to achieve lower BWT or higher ACC as required.

Additionally, we present results on Med MNIST-5, a decentralized continual learning benchmark we propose to imitate a scenario where healthcare organizations aim to optimize a global model while maintaining

Published in Transactions on Machine Learning Research (03/2024)

Table 3: Split CIFAR-100 over Alex Net using directed ring and torus. ( ): methods that don t adhere to CL setup and provide an upper-bound on the performance.

Agents Setup Directed Ring Torus ACC(%) BWT(%) CC ACC(%) BWT(%) CC

STL 70.56 0.20 - - - - - D-STL 69.22 0.10 - - - - - D-SI 44.90 0.10 -0.83 0.64 1x - - - D-EWC 53.12 0.62 0.24 0.18 1x - - - Co De C(f) 57.54 0.25 -1.22 0.22 1x - - - 4

Co De C 57.83 0.25 -0.95 0.05 1.86x - - - D-STL 64.99 0.41 - - 65.17 0.44 - - D-SI 39.54 0.16 -1.08 0.85 1x 39.36 0.40 -1.25 0.65 - D-EWC 50.52 0.58 0.51 0.09 1x 49.41 0.88 0.29 0.27 1x Co De C(f) 53.57 0.38 -0.65 0.52 1x 53.54 0.35 -1.15 0.41 1x 8

Co De C 53.63 0.25 -0.43 0.33 1.85x 53.62 0.29 -0.64 0.36 1.86x D-STL 58.31 0.49 - - 59.29 0.12 - - D-SI 34.66 1.15 -1.23 0.4 1x 34.86 0.68 -1.16 0.54 1x D-EWC 45.52 0.60 0.22 0.34 1x 44.53 0.77 -0.20 0.56 1x Co De C(f) 48.05 0.45 -0.38 0.12 1x 48.19 0.27 -0.29 0.11 1x 16

Co De C 48.16 0.33 -0.18 0.28 1.84x 48.36 0.04 -0.26 0.31 1.84x

Table 4: Split Mini Imagenet over Res Net-18 using directed ring and torus. ( ): methods that don t adhere to CL setup.

Agents Setup Directed Ring Torus ACC(%) BWT(%) CC ACC(%) BWT(%) CC

STL 70.18 2.75 - - D-STL 69.36 0.78 - - - - - D-SI 51.01 0.86 -4.00 0.61 1x - - - D-EWC 52.81 2.80 -1.07 2.03 1x - - - Co De C(f) 60.03 0.75 0.36 1.01 1x - - - 4

Co De C 59.00 2.56 -0.79 0.27 1.51x - - - D-STL 63.13 0.86 - - 66.27 1.47 - - D-SI 45.58 1.24 -3.67 1.27 1x 46.00 0.73 -3.21 0.51 1x D-EWC 46.39 1.54 -1.64 1.11 1x 48.23 3.14 -1.02 1.16 1x Co De C(f) 53.22 1.82 0.08 0.45 1x 59.90 0.48 0.37 0.24 1x 8

Co De C 53.30 1.25 -0.46 0.48 1.37x 59.97 0.87 -0.19 0.98 1.53x D-STL 57.09 1.55 - - 63.51 0.61 - - D-SI 39.55 0.87 -2.03 0.69 1x 39.96 0.47 -1.74 0.96 1x D-EWC 39.67 1.37 -1.32 1.18 1x 45.14 0.18 -0.64 0.23 1x Co De C(f) 45.29 3.58 -0.99 1.40 1x 51.03 2.51 -0.01 0.67 1x 16

Co De C 45.68 0.77 0.61 0.79 1.42x 51.32 1.05 0.26 0.56 1.39x

patient s privacy. Results in Table 6 show that Co De C outperforms D-EWC and D-SI by 7% and 1% respectively in terms of ACC while incurring minimum BWT. The lossless compression scheme results in a 2x reduction in the communication cost. In all our experiments, we observe that ACC decreases as we increase the graph size, while BWT remains of the similar order.

We also compare Co De C with SKILL, a distributed lifelong learning algorithm (results in Table 12). Furthermore, we present results on non-IID data distribution across the agents in Table 13.

Task-wise Communication Costs: The reduction in communication cost is a reflection of the constraints on the direction of gradient updates. As the gradient updates are not constrained for the first task, they occupy the entire gradient space. However, gradient updates after learning task 1 are constrained to the RGS subspace, whose dimensionality decreases as the task sequence progresses. This implies an increase

Published in Transactions on Machine Learning Research (03/2024)

Table 5: 5-Datasets over Res Net-18 using directed ring and torus. ( ): methods that don t adhere to CL setup.

Agents Setup Directed Ring Torus ACC(%) BWT(%) CC ACC(%) BWT(%) CC

STL 92.79 0.08 - - - - - D-STL 92.51 0.18 - - - - - D-SI 82.44 0.29 -1.52 0.13 1x - - - D-EWC 86.82 0.25 -3.37 0.80 1x - - - Co De C(f) 87.24 0.23 -4.05 0.05 1x - - - 4

Co De C 87.41 0.44 -4.03 0.30 2.13x - - - D-STL 92.31 0.06 - - 92.32 0.15 - - D-SI 80.36 0.15 -2.64 0.07 1x 79.55 0.33 -3.07 0.13 1x D-EWC 85.69 0.19 -0.92 0.14 1x 82.99 3.25 -2.10 1.60 1x Co De C(f) 86.54 0.04 -4.37 0.17 1x 85.92 0.18 -5.10 0.17 1x 8

Co De C 86.23 0.22 -4.61 0.32 2.17x 86.15 0.17 -4.85 0.26 2.19x D-STL 92.16 0.16 - - 91.76 0.09 - - D-SI 78.53 0.62 -5.2 0.56 1x 77.00 0.11 -5.76 0.22 1x D-EWC 82.19 0.45 -0.18 0.05 1x 81.48 0.12 -0.56 0.14 1x Co De C(f) 86.36 0.15 -4.36 0.19 1x 84.91 0.20 -5.48 0.22 1x 16

Co De C 86.41 0.16 -4.37 0.24 2.16x 85.00 0.55 -5.52 0.35 2.23x

Table 6: Med MNIST-5 over Res Net-18 using 16 agent directed ring topology.

Setup ACC(%) BWT(%) CC

D-STL 76.55 0.70 - - D-SI 61.53 0.49 -7.51 0.72 1x D-EWC 55.62 1.23 -0.72 0.25 1x Co De C(f) 62.51 0.46 -0.37 0.27 1x Co De C 62.47 0.13 -0.07 0.44 2.03x

in compression ratios, which is clearly reflected in our results highlighting task-wise CC in figure 3. In essence, as the gradient space becomes more constrained, it suffices for agents to communicate less with their neighbors. Hence, we achieve a CC of 2.1x for task 2, with this increasing up to 4.8x for task 5.

Scope of ACC-CC Trade-off: It is possible to get better ACC with Co De C, albeit at the cost of reduced CC. This is achieved by relaxing the orthogonal gradient constraint and allowing gradient updates along some CGS basis vectors, as demonstrated in Scaled Gradient Projection (SGP) (Saha & Roy, 2023). The gradient update along each CGS basis vector is scaled according to its importance λ for the past tasks.

1 3 5 7 9 11 13 15 17 19 Tasks

Accuracy(%)

Co De C D-EWC D-SI

Figure 2: Evolution of task 1 accuracy over the course of 20 tasks from Split Mini Image Net over a 4 agent ring topology

2 3 4 5 Tasks

Communication Compression (CC)

4 agents 8 agents 16 agents

Figure 3: Task-wise CC for 5-Datasets over Res Net-18

Published in Transactions on Machine Learning Research (03/2024)

Table 7: Split CIFAR-100 over Alex Net using directed ring and torus topologies with scaled gradient updates.

Agents Setup Directed Ring Torus ACC(%) BWT(%) CC ACC(%) BWT(%) CC

Co De C(f) 59.07 0.51 -2.19 0.25 1x - - - 4 Co De C 59.20 0.6 -1.95 0.26 1.42x - - - Co De C(f) 55.33 0.21 -1.02 0.59 1x 56.02 0.4 -0.88 0.35 1x 8 Co De C 55.68 0.36 -1.10 0.27 1.42x 55.69 0.27 -1.01 0.06 1.42x Co De C(f) 49.68 0.51 -0.51 0.37 1x 50.15 0.12 -0.09 0.19 1x 16 Co De C 49.56 0.90 -0.57 0.30 1.42x 50.00 0.49 -0.88 0.23 1.42x

5 10 30 100 1000 Scale Coeﬃcient (α) for SGP

Communication Compression (CC)

(55.42,-0.2)

(55.56,-0.1)

(55,-0.71) (54.31,-0.8)

(53.77,-0.86)

Figure 4: Impact of α on CC for Split CIFAR-100 over an 8 agent ring.The data labels denote (ACC, BWT).

The lower the λ, the lesser the importance of the corresponding basis direction. λ=1 implies no gradient steps can be taken along that basis direction. The following update rule is used to obtain the scaled gradient update gi:

gi = gi (MlΛ(Ml) T )gi (12)

Here, Λ is a diagonal matrix containing λ in its diagonal. Note that Λ = I corresponds to the GPM update rule in equation 4. The computation of λ involves a scale coefficient α. λ tends to 1 for large α values, and SGP converges to GPM. To incorporate gradient scaling in Co De C, the basis vectors in M with λ < 1 are appended to O at the end of each task. This provides us with a knob to achieve an ACC-CC trade-off, as shown in figure 4. We find that ACC increases as α decreases, albeit at the cost of lower CC. This is the result of the gradient updates lying in a larger subspace for smaller values of α. For α=1000, the results are similar to the ones presented in Table 3, where we allow only orthogonal gradient updates. The best performance in terms of ACC is obtained at α=10 with CC=1.42x, and we report these results in Table 7.

0 25 50 75 100 Epochs

Avg. Consensus Error

Task 2 DPSGD Task 2 Co De C Task 9 DPSGD Task 9 Co De C

2 4 6 8 10 Epochs

Avg. Consensus Error

Task 2 DPSGD Task 2 Co De C Task 9 DPSGD Task 9 Co De C

Figure 5: Average consensus error (CE) for Split CIFAR-100 (left) and Split Mini Image Net (right) over an 8 agent ring topology. Task τ DPSGD (Co De C) denotes CE when τ th task is learned without (with) orthogonal gradient constraints.

Consensus Error: We also investigate the effect of taking orthogonal gradient updates upon the average consensus error, which we formally define as:

i=1 xk xi k 2 (13)

Here, xk represents the global average of the model parameters xi k at any given iteration k. We provide an upper bound for the consensus error in Appendix A.2. CE is a measure of the effectiveness of gossip

Published in Transactions on Machine Learning Research (03/2024)

averaging in the decentralized learning scenario. In particular, a lower CE implies that the agents are closer to achieving a global consensus. In figure 5, we show CE with and without orthogonal updates for task 2 and 9 after each training epoch for Split CIFAR-100 and Split mini Image Net. As the training progresses, CE consistently reduces as expected. We observe that the rate of achieving consensus is similar for the two cases. In other words, Co De C enables decentralized continual learning without hindering the gossip averaging mechanism.

7 Conclusion

This work proposes Co De C, a novel communication-efficient decentralized continual learning algorithm. Co De C enables serverless training with spatially and temporally distributed private data and mitigates catastrophic forgetting by taking gradient steps orthogonal to the gradient directions important for previous tasks. These orthogonal gradient updates, and hence the model updates, lie in a lower dimensional gradient subspace. We exploit this fact to achieve lossless communication compression without requiring any additional hyperparameters. Further, we provide theoretical insights into the convergence rate of our algorithm. Our results demonstrate that Co De C is very effective in learning distributed continual tasks with minimal backward transfer and up to 4.8x reduced communication overhead during training.

8 Acknowledgments

This work was supported in part by, the Center for the Co-Design of Cognitive Systems (COCOSYS), a DARPA-sponsored JUMP center, the Semiconductor Research Corporation (SRC), the National Science Foundation, and DARPA Sh ELL.

Published in Transactions on Machine Learning Research (03/2024)

Sai Aparna Aketi, Amandeep Singh, and Jan M. Rabaey. Sparse-push: Communication- & energy-efficient decentralized distributed learning over directed & time-varying graphs with non-iid datasets. Co RR, abs/2102.05715, 2021. URL https://arxiv.org/abs/2102.05715.

Sai Aparna Aketi, Sangamesh Kodge, and Kaushik Roy. Neighborhood gradient mean: An efficient decentralized learning method for non-iid data. Transactions on Machine Learning Research, 2023.

Mahmoud Assran, Nicolas Loizou, Nicolas Ballas, and Mike Rabbat. Stochastic gradient push for distributed deep learning. In International Conference on Machine Learning, pp. 344 353. PMLR, 2019.

Aditya Balu, Zhanhong Jiang, Sin Yong Tan, Chinmay Hedge, Young M Lee, and Soumik Sarkar. Decentralized deep learning using momentum-accelerated consensus. In ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 3675 3679, 2021. doi: 10.1109/ICASSP39728.2021.9414564.

Pascal Bianchi et al. Performance of a distributed stochastic approximation algorithm. IEEE Transactions on Information Theory, 59(11):7405 7418, 2013. doi: 10.1109/TIT.2013.2275131.

Yaroslav Bulatov. Notmnist dataset. Google (Books/OCR), Tech. Rep.[Online], 2011. URL http: //yaroslavvb.blogspot.it/2011/09/notmnist-dataset.html.

Arslan Chaudhry, Marc Aurelio Ranzato, Marcus Rohrbach, and Mohamed Elhoseiny. Efficient lifelong learning with A-GEM. In International Conference on Learning Representations, 2019.

Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Mark Mao, Marc aurelio Ranzato, Andrew Senior, Paul Tucker, Ke Yang, Quoc Le, and Andrew Ng. Large scale distributed deep networks. In F. Pereira, C.J. Burges, L. Bottou, and K.Q. Weinberger (eds.), Advances in Neural Information Processing Systems, volume 25. Curran Associates, Inc., 2012. URL https://proceedings.neurips.cc/ paper/2012/file/6aca97005c68f1206823815f66102863-Paper.pdf.

Sayna Ebrahimi, Franziska Meier, Roberto Calandra, Trevor Darrell, and Marcus Rohrbach. Adversarial continual learning. In The European Conference on Computer Vision (ECCV), 2020.

Yasaman Esfandiari, Sin Yong Tan, Zhanhong Jiang, Aditya Balu, Ethan Herron, Chinmay Hegde, and Soumik Sarkar. Cross-gradient aggregation for decentralized learning from non-iid data. Co RR, abs/2103.02051, 2021. URL https://arxiv.org/abs/2103.02051.

Mehrdad Farajtabar, Navid Azizan, Alex Mott, and Ang Li. Orthogonal gradient descent for continual learning. In International Conference on Artificial Intelligence and Statistics, pp. 3762 3773. PMLR, 2020.

Yunhao Ge, Yuecheng Li, Di Wu, Ao Xu, Adam M. Jones, Amanda Sofie Rios, Iordanis Fostiropoulos, shixian wen, Po-Hsuan Huang, Zachary William Murdock, Gozde Sahin, Shuo Ni, Kiran Lekkala, Sumedh Anand Sontakke, and Laurent Itti. Lightweight learner for shared knowledge lifelong learning. Transactions on Machine Learning Research, 2023. ISSN 2835-8856. URL https://openreview.net/ forum?id=Jjl2c8k WUc.

Kevin Hsieh, Amar Phanishayee, Onur Mutlu, and Phillip B. Gibbons. The non-iid data quagmire of decentralized machine learning. Co RR, abs/1910.00189, 2019. URL http://arxiv.org/abs/1910.00189.

James Kirkpatrick, Razvan Pascanu, Neil C. Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A. Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, Demis Hassabis, Claudia Clopath, Dharshan Kumaran, and Raia Hadsell. Overcoming catastrophic forgetting in neural networks. Proceedings of the National Academy of Sciences, 114:3521 3526, 2017.

Anastasia Koloskova, Tao Lin, Sebastian U. Stich, and Martin Jaggi. Decentralized deep learning with arbitrary communication compression, 2019. URL https://arxiv.org/abs/1907.09356.

Published in Transactions on Machine Learning Research (03/2024)

Jakub Konečný, H. Brendan Mc Mahan, Daniel Ramage, and Peter Richtárik. Federated optimization: Distributed machine learning for on-device intelligence, 2016. URL https://arxiv.org/abs/1610.02527.

Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical report, 2009.

Guanghui Lan et al. Communication-efficient algorithms for decentralized and stochastic optimization, 2017. URL https://arxiv.org/abs/1701.03961.

Jeongtae Lee, Jaehong Yoon, Eunho Yang, and Sung Ju Hwang. Lifelong learning with dynamically expandable networks. Co RR, abs/1708.01547, 2017. URL http://arxiv.org/abs/1708.01547.

Xiangru Lian, Ce Zhang, Huan Zhang, Cho-Jui Hsieh, Wei Zhang, and Ji Liu. Can decentralized algorithms outperform centralized algorithms? a case study for decentralized parallel stochastic gradient descent, 2017. URL https://arxiv.org/abs/1705.09056.

Tao Lin, Sai Praneeth Karimireddy, Sebastian U. Stich, and Martin Jaggi. Quasi-global momentum: Accelerating decentralized deep learning on heterogeneous data. Co RR, abs/2102.04761, 2021. URL https://arxiv.org/abs/2102.04761.

David Lopez-Paz and Marc' Aurelio Ranzato. Gradient episodic memory for continual learning. In Advances in Neural Information Processing Systems, volume 30, 2017.

Arun Mallya et al. Packnet: Adding multiple tasks to a single network by iterative pruning. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7765 7773, 2018.

Michael Mccloskey and Neil J. Cohen. Catastrophic interference in connectionist networks: The sequential learning problem. The Psychology of Learning and Motivation, 24:104 169, 1989.

Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y. Ng. Reading digits in natural images with unsupervised feature learning. 2011.

Roger Ratcliff. Connectionist models of recognition memory: constraints imposed by learning and forgetting functions. Psychological review, 97 2:285 308, 1990.

Sylvestre-Alvise Rebuffi, Alexander Kolesnikov, Georg Sperl, and Christoph H. Lampert. i Ca RL: Incremental classifier and representation learning. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5533 5542, 2017.

Mohammad Rostami, Soheil Kolouri, Kyungnam Kim, and Eric Eaton. Multi-agent distributed lifelong learning for collective knowledge acquisition. ar Xiv preprint ar Xiv:1709.05412, 2017.

Andrei A. Rusu, Neil C. Rabinowitz, Guillaume Desjardins, Hubert Soyer, James Kirkpatrick, Koray Kavukcuoglu, Razvan Pascanu, and Raia Hadsell. Progressive neural networks. Ar Xiv, abs/1606.04671, 2016.

Gobinda Saha and Kaushik Roy. Continual learning with scaled gradient projection. Proceedings of the AAAI Conference on Artificial Intelligence, 37(8):9677 9685, Jun. 2023. doi: 10.1609/aaai.v37i8.26157. URL https://ojs.aaai.org/index.php/AAAI/article/view/26157.

Gobinda Saha, Isha Garg, Aayush Ankit, and Kaushik Roy. Space: Structured compression and sharing of representational space for continual learning. IEEE Access, 9:150480 150494, 2021a. doi: 10.1109/ ACCESS.2021.3126027.

Gobinda Saha et al. Gradient projection memory for continual learning. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=3AOj0RCNC2.

Joan SerrÃ , DÃdac SurÃs, Marius Miron, and Alexandros Karatzoglou. Overcoming catastrophic forgetting with hard attention to the task. In Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pp. 4548 4557. PMLR, 10 15 Jul 2018.

Published in Transactions on Machine Learning Research (03/2024)

Yuanming Shi, Kai Yang, Tao Jiang, Jun Zhang, and Khaled B Letaief. Communication-efficient edge ai: Algorithms and systems. IEEE Communications Surveys & Tutorials, 22(4):2167 2191, 2020.

Hanul Shin, Jung Kwon Lee, Jaehong Kim, and Jiwon Kim. Continual learning with deep generative replay. Co RR, abs/1705.08690, 2017. URL http://arxiv.org/abs/1705.08690.

Hanlin Tang, Xiangru Lian, Shuang Qiu, Lei Yuan, Ce Zhang, Tong Zhang, and Ji Liu. Deepsqueeze: Parallel stochastic gradient descent with double-pass error-compensated compression. Co RR, abs/1907.07346, 2019. URL http://arxiv.org/abs/1907.07346.

Anastasiia Usmanova, François Portet, Philippe Lalanda, and German Vega. A distillation-based approach integrating continual learning and federated learning for pervasive services. ar Xiv preprint ar Xiv:2109.04197, 2021.

Oriol Vinyals, Charles Blundell, Timothy Lillicrap, Koray Kavukcuoglu, and Daan Wierstra. Matching networks for one shot learning. In Advances in Neural Information Processing Systems, volume 29, 2016.

Shipeng Wang, Xiaorong Li, Jian Sun, and Zongben Xu. Training networks in null space of feature covariance for continual learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 184 193, June 2021.

Buddhi Wickramasinghe, Gobinda Saha, and Kaushik Roy. Continual learning: A review of techniques, challenges and future directions. IEEE Transactions on Artificial Intelligence, pp. 1 21, 2023. doi: 10. 1109/TAI.2023.3339091.

Han Xiao et al. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. Ar Xiv, abs/1708.07747, 2017.

Lin Xiao and S. Boyd. Fast linear iterations for distributed averaging. In 42nd IEEE International Conference on Decision and Control (IEEE Cat. No.03CH37475), volume 5, pp. 4997 5002 Vol.5, 2003. doi: 10.1109/ CDC.2003.1272421.

Jiancheng Yang, Rui Shi, Donglai Wei, Zequan Liu, Lin Zhao, Bilian Ke, Hanspeter Pfister, and Bingbing Ni. Medmnist v2: A large-scale lightweight benchmark for 2d and 3d biomedical image classification. Co RR, abs/2110.14795, 2021. URL https://arxiv.org/abs/2110.14795.

Jaehong Yoon, Wonyong Jeong, Giwoong Lee, Eunho Yang, and Sung Ju Hwang. Federated continual learning with weighted inter-client transfer. In International Conference on Machine Learning, 2021. URL https://arxiv.org/abs/2003.03196.

Hao Yu, Rong Jin, and Sen Yang. On the linear speedup analysis of communication efficient momentum sgd for distributed non-convex optimization, 2019. URL https://arxiv.org/abs/1905.03817.

Friedemann Zenke et al. Continual learning through synaptic intelligence. In Doina Precup and Yee Whye Teh (eds.), Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pp. 3987 3995, International Convention Centre, Sydney, Australia, 06 11 Aug 2017. PMLR. URL http://proceedings.mlr.press/v70/zenke17a.html.

Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning requires rethinking generalization. In International Conference on Learning Representations, 2017.

Xiaohan Zhang, Songlin Dong, Jinjie Chen, Qi Tian, Yihong Gong, and Xiaopeng Hong. Deep classincremental learning from decentralized data. IEEE Transactions on Neural Networks and Learning Systems, pp. 1 14, 2022. doi: 10.1109/TNNLS.2022.3214573.

Published in Transactions on Machine Learning Research (03/2024)

Proofs for the lemma, theorem and corollary presented in the main paper are detailed in A.1, A.2, A.3, A.4 and A.5 sections. Details related to the network architectures and datasets used in our experiments are presented in A.7 and A.8 respectively. We list all our training hyperparameters in A.9. We also provide details about implementation of our baselines D-EWC and D-SI in A.10. Some additional results are available in A.11. We perform our experiments on a single machine with 4 NVIDIA Ge Force GTX 1080 Ti GPUs. All the agents in our experiments are distributed evenly over these 4 GPUs. For instance, in the case of a 16-agent ring/torus topology, each GPU is utilized by 4 agents.

A.1 Proof of Lemma 1

The orthogonal projection gi of the original gradient update gi with respect to GPM is obtained as:

gi = gi (Ml(Ml) T )gi (14)

From the above equation we can write:

gi = gi + (Ml(Ml) T )gi (15)

We have: gi, gi = gi + (Ml(Ml) T )gi, gi = gi, gi + (Ml(Ml) T )gi, gi (16)

Since gi and Ml(Ml) T )gi are orthogonal to each other:

(Ml(Ml) T )gi, gi = 0 (17)

Substituting equation 17 into equation 16:

gi, gi = gi, gi = gi 2 > 0 (18)

We make sure gi = 0 by using ϵth < 1. From the above equation, we see that the dot product is greater than 0. This implies that if -gi is in the descent direction, - gi is also in the descent direction.

A.2 Bounds on Consensus Error

This section provides an upper bound on the consensus error. We follow the same approach as Esfandiari et al. (2021). The update rule for our algorithm is as follows:

xk = xk 1 η 1

i=1 gi k 1 (19)

xk denotes the averaged model across all the agents at a given iteration k. For the rest of the analysis, the initial value will be directly set to 0. From equation 19 we have:

xk+1 xk = η 1

i=1 gi k (20)

We introduce some key notations and properties:

Gk [ g1 k, g2 k, ..., g N k ]

Xk [x1 k, x2 k, ..., x N k ]

Gk [g1 k, g2 k, ..., g N k ]

Hk [ f1(x1 k), f2(x2 k), ..., f N(x N k )]

Published in Transactions on Machine Learning Research (03/2024)

For all the above matrices, A 2 F = PN i=1 ai 2, where ai is the i-th column of the matrix A. Thus, we obtain:

Xk(I Q) 2 F =

i=1 xi k xk 2. (22)

For each doubly stochastic matrix W, the following properties hold true

(I Q)W = W(I Q);

For any integer k 1, (I Q)W S ( ρ)k, where S is the spectrum norm of a matrix.

For N arbitrary real square matrices Ai, i {1, 2, ..., N},

j=1 Ai F Aj F. (23)

We are now ready to provide a bound on the consensus error. Since Xk = Xk 1W η Gk we have:

Xk(I Q) = Xk 1(I Q)W η Gk(I Q) (24)

Applying the above equation k times we have:

Xk(I Q) = X0(I Q)Wk

τ=1 η Gτ(I Q)Wk τ = η

τ=1 Gτ(I Q)Wk τ (25)

τ=0 Gτ(I Q)Wk 1 τ

We find the upper bound for term I.

τ=0 Gτ(I Q)Wk 1 τ

τ =0 E Gτ(I Q)Wk 1 τ F

Gτ (I Q)Wk 1 τ F

τ =0 ρ(k 1 τ+τ

2 )E[ Gτ F Gτ F] b

τ =0 µ2ρ(k 1 τ+τ

2 )E[ Gτ F Gτ F]

τ =0 µ2ρ(k 1 τ+τ

2E[ Gτ 2 F] + 1

2E[ Gτ 2 F]

τ =0 µ2ρ(k 1 τ+τ

2 )E[ Gτ 2 F] d µ2

τ=0 ρ( k 1 τ

2 )E[ Gτ 2 F]

(a) follows from equation 23. (b) follows from assumption 4.

Published in Transactions on Machine Learning Research (03/2024)

(c) follows from the inequality xy 1

2(x2 + y2) for any two real numbers x, y.

(d) is derived from Pk 1 τ1=0 ρk 1 τ1+τ

We proceed with finding the bounds for E[ Gτ 2 F]:

E[ Gτ 2 F] = E[ Gτ Hτ + Hτ HτQ + HτQ 2 F]

3E[ Gτ Hτ 2 F] + 3E[ Hτ(I Q) 2F] + 3E[ HτQ 2 F] a 3Nσ2 + 3Nδ2 + 3E[ 1

i=1 fi(xi τ) 2] (28)

(a) holds because E[ HτQ 2 F] E[ 1

N PN i=1 fi(xi τ) 2] Substituting (28) in (27):

τ=0 Gτ(I Q)Wk 1 τ

τ=0 ρ( k 1 τ

2 ) 3Nσ2 + 3Nδ2 + 3E[ 1

i=1 fi(xi τ) 2]

3Nµ2(σ2 + δ2)

(1 ρ)2 + 3Nµ2

τ=0 ρ( k 1 τ

i=1 fi(xi τ) 2]

Substituting (29) into the main inequality (26):

(1 ρ)2 + 3Nδ2

τ=0 ρ( k 1 τ

i=1 fi(xi τ) 2] (30)

Summing over k {1, . . . , K 1} and noting that E X0(I Q)

k=1 E Xk(I Q)

CK + 3Nη2µ2

τ=0 ρ( k 1 τ

i=1 fi(xi τ) 2]

CK + 3Nη2µ2

i=1 fi(xi k) 2] CK + 3Nη2µ2

i=1 fi(xi k) 2]

where C = η2µ2 3Nσ2 + 3Nδ2

Dividing both sides by N:

1 N E Xk(I Q)

(1 ρ)2 + 3δ2

i=1 fi(xi k) 2] (32)

This directly implies:

i=1 E xk xi k

(1 ρ)2 + 3δ2

i=1 fi(xi k) 2] (33)

Published in Transactions on Machine Learning Research (03/2024)

A.3 Proof for Theorem 2

When F is L-smooth, we have:

E[F( xk+1)] E[F( xk)] + E[ F( xk), xk+1 xk ] | {z } I

2 E[ xk+1 xk 2] (34)

We proceed by analysing I:

E[ F( xk), xk+1 xk ] = E[ F( xk), η 1

E[ F ( xk) , η 1

] =E[ F ( xk) , η 1

i=1 gi k gi k + gi k

= E[ F ( xk) , η 1

i=1 gi k gi k

+ E[ F ( xk) , η 1

| {z } III (36)

We first analyse II:

ηE[ F ( xk) , 1

gi k gi k ] 1

2LE[ F( xk) 2] + Lη2

i=1 ( gi k gi k) 2] (37)

This holds as a, b 1

2 b 2. Analysing III:

E F ( xk) , η 1

= ηE F( xk), 1

i=1 fi(xi k) (38)

With the aid of the equity a, b = 1

2[ a 2 + b 2 a b 2], we have :

F ( xk) , 1

i=1 fi xi k = 1

F ( xk) 2 + 1

i=1 fi(xi k) 2 F( xk) 1

i=1 fi(xi k) 2

Analysing :

i=1 fi(xi k) 2 = 1

i=1 fi( xk) 1

i=1 fi(xi k) 2

i=1 fi( xk) fi(xi k) 2 1

i=1 L2 xk xi k 2 (40)

Substituting (40) back into (39), we have:

F ( xk) , 1

i=1 fi xi k 1

F( xk) 2 + 1

i=1 fi(xi k) 2 L2 1

i=1 xk xi k 2 !

Published in Transactions on Machine Learning Research (03/2024)

Substituting (37) and (41) into (36), and (36) into (35):

E[ F( xk), xk+1 xk ] 1

E[ F( xk) 2] + Lη2

i=1 ( gi k gi k) 2]

i=1 fi(xi k) 2] L2E[ 1

i=1 xk xi k 2]

From equation (20), we have:

E[ xk+1 xk 2] = η2E[ 1

i=1 gi k 2]. (43)

Substituting (42) and (43) in (34):

E[F( xk+1)] E[F( xk)] + 1

E[ F( xk) 2] + Lη2

i=1 ( gi k gi k) 2]

i=1 fi(xi k) 2] + ηL2

i=1 xk xi k 2] + η2L

i=1 gi k 2]

Rearranging the terms and dividing by C1 = η 2 1 2L

> 0 to find the bound for E[ F( xk) 2]:

E[ F( xk) 2] 1

E[F( xk)] E[F( xk+1)] + C2

i=1 ( gi k gi k) 2] + E[ 1

i=1 gi k 2]

i=1 xk xi k 2] C4 E[ 1

i=1 fi(xi k) 2]

where C2 = Lη2/2C1, C3 = L2η/2C1, C4 = η/2C1.

We first analyze :

i=1 ( gi k gi k) 2] + E[ 1

i=1 gi k 2] = 1 N 2 E[

i=1 ( gi k gi k) 2] +

i=1 gi k 2]

a= 1 N 2 E[

i=1 (MMTgi k) 2] +

i=1 ((I MMT)gi k) 2] b= 1 N 2 E[ MMT N X

i=1 (gi k) 2] + (I MMT)

i=1 (gi k) 2]

1 N gi k 2] c σ2

(a) follows from the fact that gi k is an orthogonal projection of gi k, and it is defined by the GPM matrix M.

(b) follows from all agents having the same GPM matrix M

(c) is the conclusion of Lemma 1 in Yu et al. (2019).

Published in Transactions on Machine Learning Research (03/2024)

Substituting (46) into (45) and summing over k {0, 1, . . . , K 1}:

k=0 E h F ( xk) 2i 1

E [F ( x0) F ( xk)] + C2

i=1 fi(xi k)

xk xi k 2 #

i=1 fi(xi k)

Dividing both sides by K:

k=0 E h F ( xk) 2i 1 C1K

E [F ( x0) F ] + C2 σ2

i=1 fi(xi k)

xk xi k 2 #

i=1 fi(xi k)

Using equation 33 in the above equation, we have:

k=0 E h F ( xk) 2i 1 C1K

E [F ( x0) F ] + C2 σ2

i=1 fi(xi k)

(1 ρ)2 + 3δ2

i=1 fi(xi k) 2]

i=1 fi(xi k)

Rearranging the terms:

k=0 E h F ( xk) 2i 1 C1K

E [F ( x0) F ] + C2 σ2

N + C3 η2µ2 3σ2

(1 ρ)2 + 3δ2

+ C2 + 3C3η2µ2

i=1 fi(xi k)

When C2 + 3C3η2µ2

(1 ρ) C4 0, we have:

k=0 E h F ( xk) 2i 1 C1K

E [F ( x0) F ] + C2 σ2

N + C3 η2µ2 3σ2

(1 ρ)2 + 3δ2

A.4 Discussion on the Step Size

Recall the condition C1 > 0. This implies η > 1

The condition for equation (51) to be true is C2 + 3C3η2µ2

(1 ρ) C4 0. Therefore, we have:

(1 ρ) + ηL 1 0 (52)

Solving this inequality, combining the fact that η > 0, we have then the specific form of η :

(1 ρ)2 + 12µ2 (1 ρ)

Published in Transactions on Machine Learning Research (03/2024)

Hence, the step size η is defined as

(1 ρ)2 + 12µ2 (1 ρ)

A.5 Proof for Corollary 3

According to equation (51), on the right hand side, there are three terms with different coefficients with respect to the step size η. We separately investigate each term:

implies C1 = O q

. Therefore for the first term:

For the second term:

For the third term:

By omitting N in non-dominant terms, there exists a constant C > 0 such that the overall convergence rate is as follows: 1 K

k=0 E h F ( xk) 2i C

which suggests when N is fixed and K is sufficiently large, Co De C enables the convergence rate of O( 1

A.6 Proof for qi k Lying in RGS

Here, we prove that qi k from line 9 in algorithm 1 lies in RGS.

Line 8 (Algorithm 1): xi k+1 = xi (k+ 1

j N(i) wij(ˆxj k xi k)

Line 9 (Algorithm 1): qi k = xi k+1 xi k

After simplifying qi k: qi k = xi (k+ 1

j N(i) wij(ˆxj k xi k) xi k (59)

From line 7 in Algorithm 1 we have, xi (k+ 1

2 ) = xi k η gi k

Substituting xi (k+ 1

2 ) in equation 59, qi k = η gi k + P

j N(i) wij(ˆxj k xi k)

Now, we know that gi k is the orthogonal gradient component of gi k, and hence it lies in RGS subspace. If the gossip update component i.e. P

j N(i) wij(ˆxj k xi k) also lies in RGS, we can conclude that qi k lies in RGS by linearity.

We prove this using the linearity property of the vector spaces and induction.

Say αi k = P

j wij(ˆxj k xi k) = P

j wij(xj k xi k). αi 0 lies in RGS as xj 0 = xi 0 (synchronized initialization). Now, αi k+1 = P j wij(αj k η( gj k gi k)).

GPM update ensures that gradients gj k and gi k lie in RGS and from induction, we have αj k to lie in RGS. From linearity, we conclude that αi k+1 lies in RGS for any agent i. Hence, qi k also lies in RGS.

Published in Transactions on Machine Learning Research (03/2024)

A.7 Network Architecture

Alex Net-like architecture: For our experiments, we scale the output channels in each layer of the architecture used in SerrÃ et al. (2018). The network consists of 3 convolutional layers of 16, 32, and 64 filters with 4 4, 3 3, and 2 2 kernel sizes, respectively and 2 fully connected layers of 512 units each. A 2 2 max-pooling layer follows the convolutional layers. Rectified linear units are used as activations. Dropout of 0.2 is used for the first two layers and 0.5 for the rest of the layers.

Reduced Res Net18 architecture: This is similar to the architecture used by Lopez-Paz & Ranzato (2017). We replace the 4 4 average-pooling layer with a 2 2 layer. For experiments with mini Image Net, we use convolution with stride 2 in the first layer.

All the networks use Re LU in the hidden units and softmax with cross entropy loss in the final layer.

A.8 Datasets

Table 8 and 9 provide the details related to the datasets used in our experiments. For Med MNIST-5 dataset statistics, please refer to table 2 in Yang et al. (2021). The training samples/tasks are independently and identically distributed (IID) across agents without any data overlap. For instance, for a graph size of 4 agents, each agent has 5000/4 = 1250 training samples for a particular task in Split CIFAR-100.

Table 8: Dataset Statistics for Split CIFAR-100 and Split-mini Image Net

Split CIFAR-100 Split mini Image Net num. of tasks 10 20 input size 3 32 32 3 84 84 # Classes/task 10 5 # Training samples/tasks 5,000 2,500 # Test samples/tasks 1,000 500

Table 9: 5-Datasets Statistics

CIFAR-10 MNIST SVHN Fashion MNIST not MNIST Classes 10 10 10 10 10 # Training samples/tasks 50,000 60,000 73,257 60,000 16,853 # Test samples/tasks 10,000 10,000 26,032 10,000 1,873

A.9 Hyperparameters

All our experiments were run for three randomly chosen seeds. We decay the learning rate by a factor of 10 after 50% and 75% of the training, unless mentioned otherwise. For Split CIFAR-100, we use a mini-batch size of 22 per agent, and we run all our experiments for a total of 100 epochs for each task. For Split Mini Image Net, we use a mini-batch size of 10 per agent and 10 epochs for each task. For 5-Datasets and Med MNIST-5, we use a mini-batch size of 32 per agent, and 50 epochs for each task. We list additional hyperparameters in Table 10. For D-SI, in case of Split CIFAR-100 we use a lower learning rate of 0.0005 for the last 2 tasks when number of agents is 8 or 16. Similarly for 5-Datasets, we lower the learning rate to 0.001 after the 1st task. This tuning is required for the training to converge in case of D-SI. The average consensus error plots shown in figure 5 in the main paper were obtained with a cosine annealing based learning rate scheduling instead of the step decay mentioned earlier.

Published in Transactions on Machine Learning Research (03/2024)

Table 10: List of hyperparameters for the baselines and our approach. lr represents initial learning rate. incre represents how the ϵth is incremented for each new task.

Dataset Setup Hyperparameters

Split CIFAR-100 D-SI lr: 0.001, c: 0.1 D-EWC lr: 0.05, λ: 5000 Co De C lr: 0.01, ϵth: 0.97, incre: 0.003

Split Mini Image Net D-SI lr: 0.001, c: 0.3 D-EWC lr: 0.03, λ: 5000 Co De C lr: 0.1, ϵth: 0.985, incre: 0.0003

5-Datasets D-SI lr: 0.01, c: 0.1 D-EWC lr: 0.03, λ: 5000 Co De C lr: 0.1, ϵth: 0.965, incre: 0

Med MNIST-5 D-SI lr: 0.001, c: 0.5 D-EWC lr: 0.001, λ: 5000 Co De C lr: 0.001, ϵth: 0.99, incre: 0

A.10 Baseline Implementation

Algorithm 2 and 3 demonstrate the flow of D-EWC and D-SI respectively, the baselines which extend EWC (Kirkpatrick et al., 2017) and SI (Zenke et al., 2017) to a decentralized setting. The loss function minimized in D-EWC is of the form fτ,i(di τ,k; xi k) shown in line 5, algorithm 2. Here, λ is a regularization coefficient which signifies the importance given to the past tasks. xi,l k and xi,l τ 1 represent model parameters for a particular layer l. Unlike Co De C, here we generate the Fisher matrix Fi at each agent and then do a global averaging step before utilizing it for continually learning the next task. We do so because D-EWC performs best when entire training data is used to generate the Fisher matrix.

SI aims to minimize the loss function fτ,i(di τ,k; xi k) of the form shown in line 5, algorithm 3. Here, c is a dimensionless strength parameter which signifies the importance given to the past tasks. SI computes ω in an online manner, which is used to update Ωat the end of each task. Ωis an importance estimate, which scales the per-parameter regularization strength. Ωis generated at each agent and then globally averaged before utilizing it for the next task.

STL and D-STL provide upper bounds on performance. These are not continual learning techniques and may not be feasible in resource-constrained environments as they requires excessive number of model parameters.

A.11 Additional Results

Continual Learning Statistics Aggregation: Additional results in Table 11 demonstrate the impact of broadcast and all-gather for continual learning statistics aggregation.

Task-wise CC: We present additional results for task-wise CC, similar to figure 3 in the paper. Figure 3(a) demonstrates that task-wise CC ranges from 1.2x to 1.8x for Split mini Image Net. Figure 3(b) shows task-wise CC ranging from 1.2x to 4.45x for Split CIFAR-100.

Training Loss vs Epochs with and without compression: We present some results to emphasize the lossless nature of our proposed communication compression scheme. Figure 7 shows training loss after each epoch for a particular agent for task 2 and 9 in Split CIFAR-100 sequence with and without compression. The convergence rate of the training loss is not affected by applying the proposed compression scheme.

Published in Transactions on Machine Learning Research (03/2024)

Table 11: Impact of all-gather and broadcast technique for continual learning over 8 agent directed ring.

Dataset Setup ACC BWT

D-SI (broadcast) 27.2 -14.14 D-SI (all-gather) 39.36 -1.98 D-EWC (broadcast) 46.66 -0.06 D-EWC (all-gather) 50.46 0.42 Co De C(broadcast) 53.13 -1.00

Split CIFAR-100

Co De C (all-gather) 52.7 -0.39 D-SI (broadcast) 37.57 -43.65 D-SI (all-gather) 80.23 -2.70 D-EWC (broadcast) 71.54 -0.12 D-EWC (all-gather) 85.9 -0.92 Co De C(broadcast) 86.56 -4.38

Co De C (all-gather) 86.46 -4.00

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Communication Compression (CC)

4 agents 8 agents 16 agents

(a) Split Mini Image Net

2 3 4 5 6 7 8 9 10 Tasks

Communication Compression (CC)

4 agents 8 agents 16 agents

(b) Split CIFAR-100

Figure 6: Task-wise CC for (a) Split Mini Image Net and (b) Split CIFAR-100 over a directed ring topology.

0 20 40 60 80 100 Epochs

Training Loss

Task 2 Co De C(f) Task 2 Co De C

0 20 40 60 80 100 Epochs

Training Loss

Task 9 Co De C(f) Task 9 Co De C

Figure 7: Training loss vs epochs for (a) task 2 and (b) task 9 in Split CIFAR-100 sequence with Co De C(f) and Co De C using Alex Net over a directed ring with 8 agents

Co De C vs SKILL: Each agent in SKILL (Ge et al., 2023) uses a common pre-trained frozen backbone built-in at initialization so that only the last layer (or head) and a unique set of bias parameters are learned for each task. After training, these bias parameters and task-specific heads are shared among agents via a fully connected graph. However, each SKILL agent is independently learning a different task at a given time, lacking the concept of collaborative learning as demonstrated in Co De C. The data for each task is spatially

Published in Transactions on Machine Learning Research (03/2024)

distributed in our setup, while SKILL assumes access to the entire data available in a centralized manner at a single agent. Hence, in Co De C it becomes essential to communicate with peers during training. Moreover, we do not assume access to a pre-trained model at initialization, and update the weights for each task while mitigating forgetting. Unlike SKILL where the number of learned parameters increases with each task, we update the same set of weights for the entire task sequence. Hence, SKILL and Co De C target two different scenarios even though both utilize decentralized agents while continually learning a task sequence.

To compare SKILL with Co De C, we employ SKILL in a scenario where each task s data is distributed across the agents. Similar to Co De C, SKILL agents aim to learn a global generalized model with spatially and temporally distributed data. For iso-comparison, both SKILL and Co De C use Res Net-18 pre-trained on Image Net and perfect task oracle at test time. In SKILL, the agents do not communicate with their peers during training and share the bias parameters and heads only at the end of the training. Meanwhile, the agents in Co De C communicate with their peers during training by utilizing the gossip averaging mechanism. Results in Table 12 show the importance of collaborative learning in scenarios where each task s data is spatially distributed. Co De C achieves about 5.6% better accuracy than SKILL by the virtue of communicating while learning each task. As the number of agents increases from 8 to 16, SKILL performs much worse than Co De C because the number of available training samples at each agent also reduces.

Table 12: Comparison between SKILL and Co De C for Split mini Image Net over Res Net-18 using directed ring topology

Agents Setup ACC (%) BWT (%)

8 SKILL 79.21 0.00 Co De C 83.81 1.15 16 SKILL 74.33 0.00 Co De C 80.94 0.77

Table 13: Results for non-IID data with 0.5 skew for Split CIFAR-100 and Split mini Image Net over a directed ring topology with 8 agents. D-STL is not a continual learning baseline and serves as an upper bound on performance.

Dataset Setup ACC BWT CC

Split CIFAR-100 D-STL 62.40 0.50 - - D-SI 38.81 0.72 -0.79 0.54 1x D-EWC 49.20 0.35 -0.13 0.21 1x Co De C(f) 50.71 0.47 -0.46 0.23 1x Co De C 50.60 0.38 -0.41 0.39 2.53x

Split mini Image Net D-STL 60.89 0.61 - - D-SI 45.64 0.51 -1.97 0.32 1x D-EWC 46.57 2.07 -1.15 0.64 1x Co De C(f) 51.28 0.23 -0.08 0.76 1x Co De C 50.93 0.16 -0.88 0.50 1.32x

Results on Non-IID Data: We conducted experiments with 50% label-wise non-IIDness across the agents (i.e. skew=0.5) (Hsieh et al., 2019). Co De C utilizes a variant of DPSGD (Lian et al., 2017) as the base decentralized algorithm for IID data across the agents. However, DPSGD has been shown to perform poorly with non-IID data. Techniques like Quasi-Global Momentum (QGM) (Lin et al., 2021) and Neighborhood Gradient Mean (NGM) (Aketi et al., 2023) were introduced to improve decentralized learning performance in the presence of data heterogeneity across the agents.

Published in Transactions on Machine Learning Research (03/2024)

For our experiments, we employ QGM for learning with non-IID data as it improves performance without introducing extra communication overhead. Note that NGM performs better than QGM but at the cost of 2 communication.

Our results in Table 13 show that Co De C outperforms D-EWC and D-SI by 1.5% and 11.8% respectively for Split CIFAR-100. Co De C achieves minimal backward transfer while achieving a CC of 2.53x. For Split mini Image Net, Co De C achieves about 4-5% better accuracy than D-EWC and D-SI while incurring low BWT and CC of 1.32x.

Runtime Comparison: We present relative training runtimes in Table 14. For Split CIFAR-100 and Split mini Image Net, Co De C(f), D-EWC and D-SI have similar runtimes. For 5-Datasets, D-EWC and D-SI have about 30% higher runtime than Co De C(f).

Co De C generally has the highest time due to the two extra matrix multiplications required for communication compression. Specifically, line 10 in algorithm 1 encodes model updates into coefficients which are communicated to the neighbors by each agent. Line 11 in algorithm 1 decodes the received coefficients back into model updates. Therefore, Co De C achieves the highest accuracy and lossless communication compression at the cost of increased runtime.

Table 14: Training time (relative) for Split CIFAR-100, Split mini Image Net and 5-Datasets over a directed ring topology with 8 agents

Dataset Setup Training time

Split CIFAR-100 Co De C(f) 1 Co De C 1.27 D-EWC 1.09 D-SI 0.98

Split mini Image Net Co De C(f) 1 Co De C 1.68 D-EWC 1 D-SI 0.92

5-Datasets Co De C(f) 1 Co De C 1.61 D-EWC 1.30 D-SI 1.31

Published in Transactions on Machine Learning Research (03/2024)

Algorithm 2 Decentralized Elastic Weight Consolidation (D-EWC) Input: Each agent i [1, N] initializes model parameters xi 0, step size η, mixing matrix W = [wij]i,j [1,N], ˆxi (0) = 0, Fl = [ ] for all layers l = 1, 2, ...L, Fisher Matrix Fi = {(Fl)L l=1}, old model parameters xi (0) = 0, N(i): neighbors of agent i (including itself), T: total tasks, K: number of training iterations

Each agent simultaneously implements the TRAIN( ) procedure 1. procedure TRAIN( ) 2. for τ = 1, . . . , T do 3. for k = 0, 1, . . . , K 1 do 4. dτ,i Dτ,i 5. fτ,i(dτ,i; xi k) = fτ,i(dτ,i; xi k) + PL l=0 λ 2 Fl(xi,l k xi,l τ 1) 6. gi k = fτ,i(dτ,i; xi k) 7. xi (k+ 1

2 ) = xi k ηgi k

8. xi k+1 = xi (k+ 1

j N(i) wij(ˆxj k xi k)

9. qi k = xi k+1 xi k 10. for each j N(i) do 11. Send qi k and receive qj k 12. ˆxj (k+1) = qj k + ˆxj k 13. end 14. end 15. Save xi τ 16. # EWC Update 17. Update Fl for each layer l 18. Update Fi = {(Fl)L l=1} 19. p = random(1, 2, ...N) 20. if i == p do 21. Gather Fi from all agents 22. F = avg(F1, F2, ....FN) 23. Send F to all agents 24. end 25. end 26. return

Algorithm 3 Decentralized Synaptic Intelligence (D-SI) Input: Each agent i [1, N] initializes model parameters xi 0, step size η, mixing matrix W = [wij]i,j [1,N], ˆxi (0) = 0, Ωl = [ ] for all layers l = 1, 2, ...L, importance measure Ωi = {(Ωl)L l=1}, running importance estimate ω= [ ], damping parameter ξ = 10 3, old model parameters xi (0) = 0, N(i): neighbors of agent i (including itself), T: total tasks, K: number of training iterations

Each agent simultaneously implements the TRAIN( ) procedure 1. procedure TRAIN( ) 2. for τ = 1, . . . , T do 3. for k = 0, 1, . . . , K 1 do 4. dτ,i Dτ,i 5. fτ,i(dτ,i; xi k) = fτ,i(dτ,i; xi k) + c PL l=0 Ωl(xi,l k xi,l τ 1)2

6. gi k = fτ,i(dτ,i; xi k)

7. ωl = ωl gi k (xi,l k xi,l k 1) 8. xi (k+ 1

2 ) = xi k ηgi k

9. xi k+1 = xi (k+ 1

j N(i) wij(ˆxj k xi k)

10. qi k = xi k+1 xi k 11. for each j N(i) do 12. Send qi k and receive qj k 13. ˆxj (k+1) = qj k + ˆxj k 14. end 15. end 16. Save xi τ 17. # SI Update 18. Ωl = Ωl + ωl

(xi,l τ xi,l τ 1)2+ξ (for each layer

l) 19. Update Ωi = {(Ωl)L l=1} 20. p = random(1, 2, ...N) 21. if i == p do 22. Gather Ωi from all agents 23. Ω= avg(Ω1, Ω2, ....ΩN) 24. Send Ωto all agents 25. end 26. end 27. return