# flash_concept_drift_adaptation_in_federated_learning__90208351.pdf Flash: Concept Drift Adaptation in Federated Learning Kunjal Panchal 1 Sunav Choudhary 2 Subrata Mitra 2 Koyel Mukherjee 2 Somdeb Sarkhel 3 Saayan Mitra 3 In Federated Learning (FL), adaptive optimization is an effective approach to addressing the statistical heterogeneity issue but cannot adapt quickly to concept drifts. In this work, we propose a novel adaptive optimizer called FLASH that simultaneously addresses both statistical heterogeneity and the concept drift issues. The fundamental insight is that a concept drift can be detected based on the magnitude of parameter updates that are required to fit the global model to each participating client s local data distribution. FLASH uses a two-pronged approach that synergizes clientside early-stopping training to facilitate detection of concept drifts and the server-side drift-aware adaptive optimization to effectively adjust effective learning rate. We theoretically prove that FLASH matches the convergence rate of state-ofthe-art adaptive optimizers and further empirically evaluate the efficacy of FLASH on a variety of FL benchmarks using different concept drift settings. 1. Introduction Federated Learning (FL) (Mc Mahan et al., 2017) is a distributed machine learning paradigm where edge devices (called clients ) jointly train a machine learning (ML) model while keeping the training data on their devices. FL has cross-device and cross-silo settings (Wu et al., 2022). In cross-device FL, which is our primary focus, a server picks a small fraction of clients from available clients to train an ML model (called global model ) on their local data at each round. The server then aggregates the trained models from the participating clients to update the global model. As the server does not have access to clients data, FL is inherently privacy-preserving and has been applied in many industries such as health care (Loftus et al., 2022; 1University of Massachusetts, Amherst, USA 2Adobe Research, Bangalore, India 3Adobe Research, San Jose, USA. Correspondence to: Kunjal Panchal . Proceedings of the 40 th International Conference on Machine Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright 2023 by the author(s). du Terrail et al., 2022) and Io T (Dara et al., 2022; Li et al., 2022) where data privacy is of paramount importance. Adaptive optimization is an effective approach to addressing the statistical heterogeneity issue in FL, where one client s data distribution could be different from another client. As participating clients are different from one round to another, statistical heterogeneity causes differences in data distributions across rounds, leading to convergence problems (Karimireddy et al., 2020). Adaptive optimizers such as FEDYOGI and FEDADAM (Reddi et al., 2021) use adaptive learning rates which incorporate knowledge of past rounds to perform more informed model updates. They successfully smooth the updates applied to the global model and improve the convergence in FL. Problem. Existing adaptive optimization approaches, however, cannot adapt quickly to concept drift, another practical issue faced by deploying FL in real-world applications yet largely overlooked in the literature. A concept drift happens when the participating clients change their class conditional distributions due to phenomena like seasonality effects, geographic biases, diurnal patterns, and change in users habits (Kairouz et al., 2021; Canonaco et al., 2021). For example, the use of the term corona would generate different outputs preand post-pandemic. Formally, the class conditional distribution before concept drift, P(y | x) = P c C qc Pc(y | x), is different from the distribution after concept drift, P (y | x), where qc is weight of cth client, C is a set of available client for a particular round in FL, and (x, y) are data samples. Adapting quickly to a concept drift requires a large adaptive learning rate to adjust global model parameters so that updated parameters can fit new conditional distribution P (y | x). But existing adaptive optimizers often result in a relatively small effective learning rate towards an optimum despite a concept drift. Without loss of generality, existing adaptive optimizers update model parameters using w(r) g = w(r 1) g + ηg m(r) v(r)+τ , where w(r 1) g are global model parameters at round r 1, and m(r) and v(r) are the estimates of first and second moments of the gradients at round r. We refer to ηg v(r)+τ as the effective learning rate of a parameter, which is the learning rate ηg scaled by the inverse of the second moments of the gradients. The second Flash: Concept Drift Adaptation in Federated Learning 0 500 1000 1500 2000 2500 3000 3500 4000 Number of Rounds Effective Learning Rate g Fed Yogi Flash 0 500 1000 1500 2000 2500 3000 3500 4000 Number of Rounds Validation Accuracy Fed Avg Fed Yogi Fed Yogi Flash Figure 1. (Left) Effective learning rate and (Right) Accuracy across rounds on CIFAR10 dataset, where drift occurs at round 2000. = with drift, = without drift. moments of the gradients would be relatively high when clients train the global model, which has captured P, on the new distribution P , resulting in an effective learning rate at the onset of a concept drift which is not large enough to adapt to the new distribution quickly. To illustrate the problem, Figure 1 shows the effective learning rate and the validation accuracy of the state-of-the-art adaptive optimizer Fed Yogi (Reddi et al., 2021). We calculate the learning rate by taking average of per-parameter ηg/ v(r) values. Before the drift, the effective learning rate of FEDYOGI is decreasing as the training approaches convergence. But after a drift at round 2000, the effective learning rate increases slowly for FEDYOGI. This slowly increased effective learning rate delays the drift adaptation, while also causing a large accuracy dip during the drift. Proposed Approach. In this paper, we design a novel adaptive optimization algorithm called FLASH that can quickly adapt to concept drift while simultaneously addressing the statistical heterogeneity issue. The core intuition in FLASH is to use a larger effective learning rate at the onset of concept drifts while preserving the effectiveness of adaptivity to address statistical heterogeneity as shown in Figure 1. Our fundamental insight is that a concept drift can be detected based on how large client parameter updates are required to fit the received global model to each client s local data distribution in a round. When the global model needs larger updates to fit a client s data in a round, it implies that the client s data is unlikely to be sampled from the global distribution captured by the global model. When larger updates are required by all participating clients in that round, it indicates a concept drift and thus the needs of a larger effective learning rate to update the global model. To this end, FLASH uses a two-pronged approach that synergizes client-side training to facilitate concept drift detection and server-side drift-aware adaptive optimization to dynamically adjust effective learning rate. On the client side, FLASH allows clients to train the global model until it reaches a steady state, where the trained model fits well to their local data, in each round. This results in a varying number of epochs per client. In contrast, existing adap- tive optimizers typically have participating clients train the global model for a fixed number of epochs. On the server side, the server adapts effective learning rate based on effective gradients, which results from aggregating local parameter updates from participating clients. The principle of the adaptivity is to retain the similar effective learning rate as existing adaptive optimizers when no concept drift happens but significantly increase it at the onset of the concept drift. We achieve the adaptivity by tracking ||( (r)) 2 v(r)|| at each round r that is, the difference between the current squared effective gradients ( (r))2 and the second moment of the gradients v(r) and found that this metric leads to a boost in the effective learning rate in case of a drift, while still getting a stabilized performance when no distribution change. We evaluate the effectiveness of FLASH both theoretically and empirically. Theoretically, we prove that FLASH can match the convergence rate of FEDYOGI, a state-of-the-art adaptive optimizer, while it can also adapt quicker to a drift by decreasing the lower bound of the change in second order effective gradient. Empirically, we compare FLASH against the best of the adaptive optimizers, personalization, drift correction, and drift detection methods using CIFAR10/100, and EMNIST datasets. In no concept drift settings, FLASH gives a comparable performance than the best performing baselines. In concept drift settings, FLASH has the highest accuracy at the onset of concept drift, needs the least number of FL rounds to recover accuracy, and achieves the highest recovered accuracy after concept drifts. The fast concept drift adaptation from FLASH saves 11.18% (CIFAR10), 10.42% (CIFAR100) , and 11.79% (EMNIST) of local epochs done by clients, as the global model adapts to the new distribution. In concept drift settings, FLASH only underperforms against ORACLE (an algorithm which has knowledge of when a drift occurs and access to both preand postdrift distributions) with the lowest accuracy dip difference of 1.48%-2.99%, while the best performing baselines exhibit 3.15%-9.22% more accuracy dip compared to the ORACLE. We summarize the contributions of this work as follows: We propose a drift-aware adaptive optimization strategy that can quickly adapt to various concept drift patterns (sudden, incremental, and recurrent). Empirical evaluation demonstrates FLASH can achieve comparable performance as the most performing baselines in no concept drift settings while outperforms all baselines in various concept drift settings. We give theoretical analysis on convergence of FLASH, an epoch bound for early-stopping SGD for client-side training, and draw a relation between the change in second order effective gradients with concept drifts. Flash: Concept Drift Adaptation in Federated Learning 2. Related Work FLASH is related to techniques that address statistical heterogeneity and adaptive optimizations in particular, and techniques that address concept drift. Techniques to Address Statistical Heterogeneity. Adaptive optimization to address the challenges of statistical heterogeneity has been studied in FEDYOGI/FEDADAM (Reddi et al., 2021). They are slow to adapt to concept drifts because of their relatively small effective learning rate when a synchronized concept drift happens. FLASH proposes a new adaptive update rule where the effective learning rate is higher at the onset of a drift for faster adaptation. Personalization also addresses statistical heterogeneity HYPCLUSTER (Mansour et al., 2020), MAML (Jiang et al., 2019), PERFEDAVG (Fallah et al., 2020), APFL (Deng et al., 2020), DITTO (Li et al., 2021), FEDROD (Chen and Chao, 2022). It usually creates a separate personalized model, along with a local model. FLASH uses early-stopping training to get a well-trained local model for a client. And the same model, when sent to the server, is used to detect and adapt to (if any) concept drifts. We do not need to store any model states on client, as the model states can be rendered useless in case of drifts. Drift correction strategies address statistical heterogeneity by reducing gradient mismatch caused by solving two separate objectives (local and global). SCAFFOLD (Karimireddy et al., 2020) introduces a gradient correction term named control variate to help reduce the variance across client updates. FEDDYN (Acar et al., 2021) further reduces the communication cost related to drift-correction. These methods consider the concept drift as client-level noise, resulting in a local model update correction towards the pre-drift distribution. FLASH instead adapts the global model to the drift, resulting in less efforts from each client in local training and gradient correction. Techniques to Address Concept Drift. Concept drifts in FL fall into two categories, distributed drift, and synchronized drift. Distributed drift, where clients can drift to multiple distributions in the same round, is explored in FEDDRIFT (Jothimurugesan et al., 2022). They propose a multi-model solution where each new distribution drift by any client spawns a new global model. Maintaining multiple global models to track all the past and current distributions poses a scalability challenge. Besides, their experiments do not consider heterogeneous data setting. FLASH focuses on synchronized drift, where all the clients face a drift towards one new distribution. This setting is explored in centralized learning in (Gama et al., 2014; Tahmasbi et al., 2020). FL poses an additional challenge of telling apart client-level heterogeneity from a concept drift. ADAPTIVEFEDAVG (Canonaco et al., 2021) attempts to addresses the synchronized drift issue through a preliminary study on adaptive optimizers, but it is incomplete and it falls short in improv- ing the global accuracy dip during the drift. FLASH has a rigorous study with various patterns of synchronized drift, on various baselines based on adaptive optimization, drift correction, and drift adaptation. We also provide theoretical analysis for FLASH. 3. Methodology This section describes our proposed adaptive optimization algorithm FLASH. Compared to existing adaptive optimizers, FLASH has two distinct features: (1) Clients train local model via early-stopping training to facilitate the concept drift detection and, at the same time, address statistical heterogeneity, and (2) the server updates the global model via drift-aware adaptive optimization. The synergy of the two features enables FLASH to quickly adapt to concept drift. Algorithm 1 provides the pseudo-code for FLASH. Algorithm 1 FLASH Input: R: total number of rounds, r: round index, p: participation rate, C(r): a set of available clients for rth round, C: a set of currently participating clients, c: client index, ηℓ: local learning rate, ηg: global learning rate, D(r) c : local dataset of client c in rth round, e(r) c : number of epochs for client c for round r, E: maximum number of epochs a client can train for, Lc: local objective for client c, β{1,2}: exponential decay rates, τ: adaptivity rate, γ: loss decrement threshold Output: w(R) g : global model server randomly initializes w(0) g 1: for r [R] round do 2: sample C from C(r) with the rate of p 3: send w(r 1) g to all the clients in C 4: for c C in parallel do 5: w(r) c,0 w(r 1) g 6: for e [E] epochs do 7: if ℓ(r) c,e 1 ℓ(r) c,e γ/e then 8: w(r) c,e w(r) c,e 1 ηℓ Lc(w(r) c,e 1; D(r) c,train) 9: ℓ(r) c,e P (x,y) D(r) c,valid fc(w(r) c,e, x, y) 10: else 11: break 12: end if 13: end for 14: return w(r) c to the server 15: end for 16: (r) 1 |C| P c C(w(r) c w(r 1) g ) 17: m(r) β1m(r 1) + (1 β1) (r) 18: v(r) β2v(r 1) + (1 β2)( (r))2 2 ( (r) j )2 v(r) j 2+ v(r 1) j 20: d(r) j β3jd(r 1) j + (1 β3j) ( (r) j )2 v(r) j ; 21: w(r) g w(r 1) g + ηg m(r) v(r) - d(r) +τ 22: end for Flash: Concept Drift Adaptation in Federated Learning 3.1. Client-side Early-stopping Training In one round of FL, each selected client receives the global model from the server, trains the global model locally via early-stopping training, and sends the locally-updated global model (called local model ) back to the server. Early-stopping training trains the global model until an early stop criteria based on model fitness is met. This is in contrast to the prevalent way of client-side training (Canonaco et al., 2021; Deng et al., 2020; Li et al., 2021; Mansour et al., 2020), i.e, each client trains the the global model for a fixed number of epochs. Lines 5 to 15 in Algorithm 1 describe the training procedure of each of the participating clients. As a stopping criterion, we use decrement in the validation loss value (see Line 7) to indicate when a local model w(r) c reaches its steady state. If the validation loss stops to decrease by a threshold γ/e, where γ is a threshold hyperparameter and e is the current epoch count, we stop training for the client. The threshold decreases as epoch count grows since the validation loss is also expected to decrease. Otherwise a client can train the local model until a set number of maximum epochs E. The rationale behind the early-stopping training are twofolds. First, training for longer epochs in an FL round has been demonstrated to be an effective approach for addressing statistical heterogeneity because it creates local models which can better fit the clients data distributions than training for only a few epochs (Wu et al., 2022). Early-stopping further avoids potential over-fitting of the local model. Second, early-stopping training produces parameter updates that are necessary for the global model to fit a client s local data distribution. The magnitude of the parameter updates could indicate if concept drifts. We empirically verify the two rationales in Section 5.4 and demonstrated that earlystopping training is able to improve validation accuracy by 1.41%-3.87% across datasets in no concept drift settings. And in concept drift settings, early-stopping training is more effective in detecting concept drifts than fixed-epoch training. 3.2. Server-side Drift-aware Adaptive Optimization After receiving the local model from each participating client, the server aggregates the parameter updates into effective gradients (Line 17) and updates the global model via drift-aware adaptive optimization (Lines 18-22). Here the goal is to automatically increase effective learning rate when there is a concept drift while preserving the convergence properties of existing adaptive optimizers. Although the magnitude of effective gradients is a good indicator of concept drift, the challenge lies in how much the effective learning rate needs to be adjusted accordingly. To address the challenge, we propose a metric called gra- 0 500 1000 1500 2000 2500 3000 3500 4000 Number of Round r ||( (r))2 v(r)||2 (a) CIFAR 10 0 500 1000 1500 2000 2500 3000 Number of Rounds ||( (r))2 v(r)||2 (b) CIFAR 100 Figure 2. The gradient disparity ||( (r)) 2 v(r)|| v.s. round r. Sudden concept drift occurs at r = 2000 for FLASH and FEDYOGI. = with drift, = without drift. dient disparity that calculates the difference between the squared gradients ( (r))2 and the second moment of the gradients v(r) at round r, i.e., ||( (r)) 2 v(r)||. We find that this metric not only correlates well with the inside of a concept drift but also lies in a value range that is comparable to v(r)) and thus can be used to adjust the effective learning rate. Figure 2 shows how the concept drift impacts gradient disparity. Concept drift at round r = 2000 results in a larger magnitude of the effective gradient (r) and thus a larger gradient disparity. This sudden increase in gradient disparity indicates that the global model needs to adapt to the new distribution. Since an absolute value of ( (r))2 v(r) doesn t indicate whether the value is larger or nearly equal to the previous ones , we track a moving average of ( (r))2 v(r) as shown in Line 20 in Algorithm 1, symbolized by d(r). Instead of a fixed weight parameter β3, we use an adaptive β3 for weighing the two terms of d(r). Our aim is to weigh d(r 1) term more in case there is no large change in ( (r))2 v(r) compared to its previous values. Inversely, we want a larger weighted ( (r))2 v(r) if the value widely differs from the previous ones. We assign β3 as shown in Line 19 in Algorithm 1. We use the equation in Line 21 to update the global model. 4. Analysis of FLASH In this section, we theoretically analyze the convergence of early-stopping training on the client side and the convergence of FLASH. The goal is to show that, despite the drift-aware design, FLASH retains the convergence property of existing adaptive optimizers. Specifically, given a client with a local variance of σc, (defined in Assumption C.3), we derive the upper bound of the number of epochs ec needed to train a global model to meet our stopping criteria. We prove that FLASH has the same convergence rate as FEDYOGI. We also derive a relation between FEDYOGI and FLASH in terms of drift. The proofs are in Appendix C. Theorem 4.1 (Convergence of Early-stopping SGD). Flash: Concept Drift Adaptation in Federated Learning Assume functions {Fc} satisfy Assumptions C.1, C.2, C.3, and C.4. The output of the early-stopping SGD with the early stopping criteria, P x,y fc(w(r) c,e 1; x, y) P x,y fc(w(r) c,e; x, y) γ/e, e [E] and (x, y) D(r) c,valid, has an expected error smaller than ϵ for γ (F ϵ) ln E+ 1 E and ηℓ, ec satisfying Strongly convex, 1 µE ηℓ log (max (1,µ2ED/c)) ec = O min µD2 µϵ + σ2 c 2µϵ, exp( F ϵ General convex, 1 ec = O min D2 + G2D2 ϵ2 + σ2 c D2 2ϵ2 , exp( F ϵ Non-convex, 1 ec = O min F + G2F ϵ2 + σ2 c LF 2 2ϵ2 , exp( F ϵ where c := G2 + µ2 c 2 , D := E w(r) c,0 w c , and F = fc(w(r) c,0) fc(w c). Discussion. Here we discuss the relationship between ec, epochs taken to achieve an optimization error of ϵ = Er h fc(w(r) c,e(r) c ) i fc(w c), and the concept drift D = w(r) c,0 w c . In case of FL, w(r) c,0 := w(r 1) g . We focus on the general convex case, but the same analysis can be applied for non-convex case as well. Dropping all the terms related to L, G, and lower powers of ϵ, we get ec = O D2(1 + 1 ϵ2 + σ2 c ϵ2 ) . This indicates that the number of epochs a client would run the local training for depends on the drift between w(r 1) g and w c. If P(r 1) = P(r) c , then according to Lemma C.8, D would be larger, implying a large ec. Besides, ec also depends on the early stopping parameter γ. A small γ mean large ec, as a lower limit to the error difference fc(w(r) c,e 1) fc(w(r) c,e) allows for more epochs of local SGD, and vice versa. Theorem 4.2 (Convergence of FLASH). Let assumptions C.1 to C.4 hold. Suppose the server and client learning rates satisfy ηℓ min |C| 30L2E 1 2 , τ 6(B2 1)[G(β2+ β2)+Lηg] the iterates of Algorithm 1 for ηℓ = Θ(1/L E), ηg = Θ(1/ R), and τ = G/L for FLASH satisfy min 0 r RE f(w(r) g ) 2 O f(w(0) g ) Er[f(w(R) g )] ER|C| (σ2 ℓ+ 6Eσ2 g) + 6Lσ2 ℓ RG2|C| + 6L Discussion. FLASH obtains convergence at the rate of O(1/ R), which matches FEDYOGI s convergence rate. As R , the error tends to 0. The local and global variance terms are also weighed by 1/|C|. Meaning that with more participating clients, the impact of local and global variance decreases. In Section 5, we empirically show that FLASH converges faster than FEDYOGI. Theorem 4.3 (Lower Bounding the Change in the Second Moment of Effective Gradients). Let {Fc} satisfy Assumptions C.1, C.3, and C.5. In round r, the updates in FLASH and FEDYOGI satisfy, FEDYOGI Er h ( (r))2 v(r) i FLASH Er h ( (r))2 v(r) i r w(r) g w(r 1) g η2 ℓE2G2(1 βr 1 2 ) r w(r) g w(r 1) g vr ηℓEG(1 βr 3) + τ η2 ℓE2G2(1 βr 1 2 ) Discussion. Lower bounding the change in the second moment of effective gradients, as in the above theorem, gives us an intuition of what would be the behavior (in terms of the effective learning rate) of an adaptive optimizer during, and after a concept drift. Since the change in β3 indicates a concept drift, we now analyze how β3 affects the above bound. Recall that β3 (Line 19, Algorithm 1) is the adaptive weighing parameter for tracking the rolling average of gradient disparity d (Line 20, Algorithm 1) as defined in the last paragraph of Section 3.2. A larger β3 means there is no large change in the rolling average of the gradient disparity. On the contrary, a smaller β3 would mean that there is a large change in the rolling average of the gradient disparity. From the stated lower bound, we make two main observations: (a) As the global model reaches its convergence, we see that β3 1. Consequently, FLASH behaves similar to FEDYOGI since the the lower bound becomes zero. This similarity in the behavior is confirmed by the results shown in Figure 2 during pre-drift stages. (b) At the onset of a drift, due to a large gradient disparity, β3 would be closer to 0. β3 0 results in a higher value for the lower bound of the change in the second moment of effective gradients, which in turn results in FEDYOGI being slower to adapt to the drift than FLASH. When β3 0, the lower bound of the LHS term FLASH Er ( (r))2 v(r) decreases due to the negative term ηℓEG(1 βr 3) in its bound. This is the reason why we see a lower value of ( (r))2 v(r) for FLASH during and after a concept drift. Since FEDYOGI does not have such β3 term, the value ( (r))2 v(r) stays higher during and after the drift, which results in Flash: Concept Drift Adaptation in Federated Learning lower learning rate and hence slower drift adaptation for FEDYOGI. 5. Experiments We evaluate FLASH on both convex and non-convex tasks using 6 non-iid federated datasets. We test the efficacy of FLASH on three concept drift setups: sudden, incremental, and recurrent, and compare our results against state-of-theart adaptive optimization, personalization, drift correction, and drift adaptation approaches. 5.1. Experiment Settings Datasets, Tasks, and Models. We have a convex task: Classification of Synthetic data (Li et al., 2020) with a 2 layer fully connected model (Li et al., 2020) (Synthetic). For non-convex tasks, we used Stackoverflow Next Word Prediction (Stackoverflow, 2023) with a 1 layer LSTM model (Reddi et al., 2021) (Stackoverflow), EMNIST (Cohen et al., 2017) Image Classification with a 2 layer CNN model (Reddi et al., 2021) (EMNIST), Shakespeare (Mc Mahan et al., 2017) Next Character Prediction with a 2 layer LSTM model (Reddi et al., 2021) (Shakespeare). The above three datasets have natural non-IID partitions based on the authors. Each author is considered a client in the federated setup, the images/texts generated by that author are samples for that client. We also use CIFAR10/100 (Krizhevsky et al., 2009) for image classification tasks based on Res Net18 (He et al., 2016) (CIFAR10/100). Both datasets are generated artificially in a federated non-iid fashion based on Dirichlet distribution as described in (Reddi et al., 2021). More details in Appendix A. Baselines. Our baselines fall in four groups: (a) Serverand client-side optimization (FEDAVG, FEDPROX, FEDYOGI, FEDNOVA), (b) Personalization (APFL, DITTO, HYPCLUS- TER), (c) Drift correction (SCAFFOLD, FEDDYN, FEDDC), and (d) Drift adaptation (FEDDRIFT, ADAPTIVEFEDAVG shortened to ADAPFA). Note that in case of FEDDRIFT, we are using FEDDRIFTEAGER since our use case demands a synchronized concept drift. We also use an ORACLE algorithm which has knowledge about at what round which client faces a concept drift. ORACLE uses FEDYOGI as its base algorithm. ORACLE simultaneously trains two global models, each based on the preand postdrift distributions. With its knowledge about when the drift occurs, ORACLE simply switches the current global model to reflect the drift. Metrics. Here we define four metrics used to compare performance of FLASH and its baselines. Generalized Accuracy for a round r is the average accuracy of the global model w(r) g on the test data D(r) c of clients c C(r), where C(r) is a set of available clients for the round r. Personalized Accuracy for a round r is the average accuracy of the local model w(r) c,ec on the test data D(r) c of clients c C, where C is a set of participating clients for the round r. Accuracy Dip is the lowest generalized accuracy during a concept drift starting from round r1 to round r2. It is defined as dip(r1,r2) = min({acc(r) g r [r1, r2]}). Rounds till Recovery is the number of rounds taken for the global model w(r) g to reach a steady state after a concept drift has occurred. Concept Drift Setups. Recall that concept drift occurs when the conditional distribution P(y | x) changes. Following the practice in (Tahmasbi et al., 2020; Jothimurugesan et al., 2022), we use label swapping to simulate concept drifts, given the same input features. Since it is not feasible to swap labels for next word prediction tasks, we use the classification tasks (EMNIST, CIFAR10, CIFAR100, and Synthetic) for concept drift experiments. For a task with n labels, we swap ith label with i + 1st label i [0, 2, . . . , n]. Specifically, we simulate three types of concept drifts. (1) Sudden Drift. All the clients face the distribution change abruptly, at the same round. For EMNIST, CIFAR10, CIFAR100, and Synthetic datasets, the concept drift occurs at 600th, 2000th, 2000th, and 500th rounds respectively. (2) Incremental Drift. Starting on the same rounds as described in sudden drift, for every 100 rounds, 20% more clients change their distributions to the new one. (3) Recurrent Drift. First drift occurs abruptly for EMNIST, CIFAR10, CIFAR100, and Synthetic datasets at 600th, 2000th, 2000th, and 500th rounds respectively. Next drift to the old (initial) distribution occurs at 1000th, 2500th, 2500th, and 800th rounds. We use Flower (Beutel et al., 2020) library to implement FLASH and all its baselines. The implementation is available on 1. We use an NVidia 2080ti GPU to run all the experiments with 3 runs for each. The random seeds used are 0, 44, and 56. For all the tasks and datasets, we have uniformly randomly sampled 10 clients per round. Hyperparameter details are given in Appendix A. 5.2. Performance Comparison without Concept Drift This set of experiments aim to demonstrate that FLASH retains the benefits of adaptivity to address statistical heterogeneity issues in FL. We ran FLASH and its baselines in a no concept drift setup to get the base generalized and personalized accuracies. For a fair comparison, we used the same early-stopping criteria for client-side training for all the baselines. Table 1 reports the generalized accuracy. Personalized accuracy shows similar trends and is reported in Appendix B, tables 5 and 6. Overall, FLASH achieves comparable generalized accuracy as state-of-the-art techniques in ad- 1Source Code Flash: Concept Drift Adaptation in Federated Learning Table 1. Generalized accuracy (the higher, the better) for FLASH and baselines with no concept drift. EM = EMNIST, SO = Stackoverflow, SH = Shakespeare, C10 = CIFAR10, C100 = CIFAR100. Full results in Appendix Tables 3 - 6. Tasks EM SO SH C10 C100 C10 C100 Non-iid ness Writers Authors Authors α = 0.1 α = 0.1 α = 0.6 α = 0.6 FEDAVG 84.12% 28.24% 57.29% 56.91% 36.00% 75.28% 41.52% FEDPROX 87.52% 27.78% 57.43% 56.74% 35.83% 76.23% 42.63% FEDYOGI 87.71% 28.96% 57.82% 67.57% 39.92% 78.30% 44.33% FEDNOVA 88.53% 28.56% 58.23% 66.59% 37.45% 77.89% 43.82% SCAFFOLD 88.10% 27.68% 57.59% 65.86% 32.48% 75.43% 44.14% FEDDYN 88.18% 26.08% 57.43% 65.08% 35.20% 77.33% 45.33% FEDDC 90.64% 27.50% 56.46% 65.16% 39.89% 79.28% 44.89% APFL 89.78% 24.32% 59.46% 68.90% 36.44% 78.05% 45.78% DITTO 87.17% 27.97% 59.16% 69.41% 35.41% 78.34% 45.73% HYPCLUSTER 89.66% 27.96% 58.50% 68.23% 39.24% 78.66% 45.49% ADAPFA 88.46% 28.76% 58.42% 66.91% 39.48% 78.31% 44.36% FEDDRIFT 89.12% 27.65% 58.28% 69.18% 38.09% 78.46% 44.49% FLASH 88.17% 29.13% 58.27% 69.45% 40.65% 79.58% 45.65% dressing statistical heterogeneity. Specifically, FLASH outperforms adaptive optimization approaches FEDYOGI and drift correction methods SCAFFOLD and FEDDYN (except with EMNIST) across all tasks. It demonstrates that the new model updating rules in FLASH, designed for concept drift adaptation, retains the benefits of adaptive learning rates. Moreover, FLASH gives comparable performance against personalization methods APFL, DITTO, and HYPCLUSTER. The results echo the findings in (Wu et al., 2022) that multiple epochs of local training (or finetuning) works better than other personalization methods. FEDDRIFT in no drift setting shows no advantage over another clustering algorithm, HYPCLUSTER. ADAPFA, also being an optimizer for drift adaptation, shows comparable performance to FEDYOGI. 5.3. Performance Comparison with Concept Drift This set of experiments aims to demonstrate the better performance of FLASH than baselines in adapting to concept drifts. Here, we report the results from one type of concept drift, incremental drift, as described in Section 5.1. Results for the other two types, sudden and recurrent concept drifts, are similar and reported in Appendix B. Figure 3 shows validation generalized accuracy curves with an incremental drift. Table 2 further reports the accuracy dip and rounds till recovery to the steady state. We observe that although ADAPFA competes with FLASH in terms of round till recovery, it does it with the sacrifice in its during-drift accuracy degradation of up to 49.14% (C100) compared to FLASH. While other methods like FEDDRIFT come close to FLASH in the accuracy dip criteria, they still struggle to recover quickly due to new model creation and training at the onset of a drift. ORACLE does not face any performance degradation due to possessing global models based on both the preand postdrift distribution, FLASH still outperforms it through a more stable effective learning rate adaptivity. Given the same number of rounds pre-drift and post-drift, FEDYOGI is unable to recover to the same steady-state Table 2. The lowest accuracy (the higher, the better) during the concept drift (D) and rounds till recovery (the lower, the better) to the steady state (R) in an incremental concept drift setting. EM = EMNIST, C10 = CIFAR10, C100 = CIFAR100. Tasks EM (D) (R) C10 (D) (R) C100 (D) (R) C10 (D) (R) C100 (D) (R) Non-iid Writers α = 0.1 α = 0.1 α = 0.6 α = 0.6 FEDAVG 14.64% 510 43.63% 290 18.39% 610 62.75% 420 14.88% 670 FEDPROX 64.94% 460 44.40% 270 17.67% 570 63.10% 400 19.48% 540 FEDYOGI 46.90% 360 45.35% 220 19.46% 510 64.40% 180 16.26% 510 FEDNOVA 58.16% 430 44.13% 240 20.64% 530 62.76% 270 17.49% 550 SCAFFOLD 64.35% 620 47.30% 230 20.89% 500 56.30% 250 15.40% 380 FEDDYN 64.24% 380 52.60% 220 19.83% 430 72.55% 230 18.64% 490 FEDDC 81.73% 300 53.15% 230 20.19% 400 71.64% 240 19.33% 480 APFL 88.50% 320 59.44% 180 25.43% 350 64.75% 190 20.07% 560 DITTO 59.71% 540 56.18% 160 28.49% 430 64.28% 170 18.99% 600 HYPCLUST 85.35% 400 58.65% 140 26.20% 580 64.00% 490 13.41% 620 ADAPFA 59.46% 320 52.25% 70 20.41% 260 70.31% 110 23.17% 270 FEDDRIFT 83.63% 350 63.63% 130 36.10% 240 74.35% 160 38.08% 430 ORACLE 91.65% 0 72.85% 0 40.13% 0 79.29% 0 42.72% 0 FLASH 89.18% 260 70.45% 60 38.65% 210 76.30% 80 40.52% 220 accuracies in case of EMNIST, CIFAR100, and CIFAR10 (α = 0.1) tasks. For all the datasets, the accuracy dip (see Table 2) is also larger for FEDYOGI compared to FLASH. The faster recovery of FLASH when the drift is injected for 20% more clients every 100 rounds indicates that in the first 100 rounds interval, FLASH can still detect the drift and adapt accordingly. We also make observations for the cases of adaptation and starting from scratch in the concept drift setup. The main difference between FLASH and FEDDRIFT is that while FLASH lets the same global model adapt (on server side) to a new distribution, FEDDRIFT starts a new global model when a client encounters a new distribution. As described in Section 5.1, for this specific case of concept drift, the feature distribution P(x) remains the same after the labels have been swapped. Hence, the global model based on P1(x, y) would still offer utility when learning a new distribution P2(x, y) in terms of having similar feature extraction parameters before and after the concept drift. Hence we see a lower accuracy dip and faster adaptation in case of FLASH. Drift correction methods are rigid to a single global distribution. They update local models based on the assumption of a single global distribution, rather than adapt global model to the distributions learned by the local models. Personalization helps in case of the aforementioned issue of rigidity since the personalized models of each client do not need to correct themselves according to any global distribution. Yet the global models generated by each client remain suboptimal because of the slower adaptability of the server-side aggregation methods of these personalization approaches. Computation Savings after Concept Drifts. We further report the computation savings resulting from FLASH s faster adaptation to concept drifts compared to FEDYOGI. After concept drifts, FLASH requires less number of federated rounds to recover accuracy, which directly translates to the less number of epochs on the client side to train the global Flash: Concept Drift Adaptation in Federated Learning 200 400 600 800 1000 1200 1400 Number of Rounds Validation Accuracy Fed Yogi Fed DC APFL Fed Drift Adap FA Oracle Flash 0 500 1000 1500 2000 2500 3000 3500 4000 Number of Rounds Validation Accuracy Fed Yogi Fed DC Hyp Cluster Fed Drift Adap Fed Avg Oracle Flash (b) CIFAR 100 (α = 0.6) 0 500 1000 1500 2000 2500 3000 3500 4000 Number of Rounds Validation Accuracy Fed Yogi Fed Dyn APFL Fed Drift Adap Fed Avg Oracle Flash (c) CIFAR 10 (α = 0.6) Figure 3. Accuracy curves with incremental drift after steady state. The validation accuracies have been averaged across 100 rounds of interval. Plotted baselines are best of each categories from server-side optimization (FEDYOGI), personalization, and drift correction (varies depending on the dataset). All drift adaptation baselines along with FLASH and ORACLE are shown. Rest of the baselines depicted in Figure 8 in Appendix B. 0 500 1000 1500 2000 2500 3000 3500 4000 Number of Rounds Cumulative Epochs across the Participating Clients Fed Yogi without Shift Flash without Shift Fed Yogi with Sudden Shift Flash with Sudden Shift (a) CIFAR 10 0 500 1000 1500 2000 2500 3000 3500 4000 Number of Rounds Cumulative Epochs across the Participating Clients Fed Yogi without Shift Flash without Shift Fed Yogi with Sudden Shift Flash with Sudden Shift (b) CIFAR 100 Figure 4. Total number of epochs ran across participating clients per round r for FLASH vs FEDYOGI. Global model adaptation in FLASH leads to lower amount of local training done by the clients. model. Figure 4 shows the total number of epochs to train the global model to reach the same early-stopping criteria for FLASH and FEDYOGI in each round. The total number of epochs is the sum of the number of epochs per participating client in that round. At the onset of both the training (round 0) and the drift (round 2000), a client has to train the global model for many epochs (4-6 epochs in average per client) as the global model has not yet reached a steady state. As the training approaches the steady state (starting from around r = 1000), each client uses less number of epochs (less than 3 epochs in average). We observe that, after concept drift occurs, FLASH has 11.18-11.79% lower number of epochs per round compared to Fed Yogi because FLASH adapts to the new distribution faster, resulting in clients having to do less work. Impact of the Ratio of Drifted Clients. We observe the impact of a ratio pdrift of clients facing a concept drift to total clients in a round for FLASH. As the ratio decreases, the global impact of only a few clients (pdrift < 0.25) getting injected with a drift gets absorbed by rest of the clients who do not face a drift. Results on the drift ratio and its impact on adaptation are in Appendix B. 0 500 1000 1500 2000 2500 3000 3500 4000 Number of Rounds Personalized Accuracy E = 1 E = 3 E = 5 E = 7 E = 10 Dynamic E (a) No drift 0 500 1000 1500 2000 2500 3000 3500 4000 Number of Rounds Personalized Accuracy Flash Dynamic Epochs Flash Fixed Epochs (5) (b) With sudden drift Figure 5. Personalized accuracy of FLASH with dynamic number of epochs (through early-stopping) versus fixed number of epochs E {1, 3, 5, 7, 10} on CIFAR10. 5.4. Ablation Study: Fixed Epochs v.s. Early Stopping This set of experiments aims to verify the importance of early-stopping training by comparing it with fixed-epoch training. We empirically verify the two benefits of earlystopping training (see Section 3.1): (1) It improves the personalized accuracy of the local model on a client and (2) It produces a more reliable signal in detecting concept drifts. Figure 5a reports the personalized accuracy curves from fixed-epoch training and early-stopping training in a no concept drift setting. We observe that early-stopping training outperforms the best of the fixed-epoch training (with E = 5) in personalized accuracy. With early stopping, each client s epoch count depends on the local gradient variance of the client and its rate of loss reduction (see Theorem 4.1), which in turn is based on its heterogeneity. Hence, a less heterogeneous client could spend less epochs on local training, avoiding over-fitting. The same trend is also seen in generalized accuracy curves. Figure 5b shows the accuracy curves with sudden concept drifts at round 2000 using CIFAR10 dataset. The lower accuracy dip from early-stopping training at the onset of the Flash: Concept Drift Adaptation in Federated Learning concept drift demonstrates that early-stopping training is better than fixed-epoch training in adapting concept drift. 6. Conclusion In this work, we studied concept drift adaptation in federated learning. We proposed FLASH that leverages clientside early-stopping training to facilitate the concept drift detection and server-side drift-aware adaptive optimization to address both statistical heterogeneity and concept drift adaptation. We gave convergence rate for FLASH and its client-side early stopping training. We also empirically evaluated the effectiveness of FLASH in improving generalized and personalized accuracy and reducing accuracy dips at the onset of concept drifts and the federated rounds taken to recover from concept drifts using a set of tasks and different drift patterns. For future work, it would be interesting to see how this approach translates to real-world drifts where the drifts can occur in an ad-hoc manner and changes in joint distribution instead of class conditional distribution. Acknowledgement We would like to express our sincere gratitude to the UMass Amherst Startup Funding and the gift funding from Adobe for supporting this work. Brendan Mc Mahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Aguera y Arcas. Communication Efficient Learning of Deep Networks from Decentralized Data. In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, 2017. Shanshan Wu, Tian Li, Zachary Charles, Yu Xiao, Ziyu Liu, Zheng Xu, and Virginia Smith. Motley: Benchmarking heterogeneity and personalization in federated learning. ar Xiv preprint 2206.09262, 2022. Tyler J Loftus, Matthew M Ruppert, Benjamin Shickel, Tezcan Ozrazgat-Baslanti, Jeremy A Balch, Philip A Efron, Jr. Gilbert R Upchurch, Parisa Rashidi, Christopher Tignanelli, Jiang Bian, and Azra Bihorac. Federated learning for preserving data privacy in collaborative healthcare research. Digital Health, 2022. Jean Ogier du Terrail, Samy-Safwan Ayed, Edwige Cyffers, Felix Grimberg, Chaoyang He, Regis Loeb, Paul Mangold, Tanguy Marchand, Othmane Marfoq, Erum Mushtaq, Boris Muzellec, Constantin Philippenko, Santiago Silva, Maria Tele nczuk, Shadi Albarqouni, Salman Avestimehr, Aur elien Bellet, Aymeric Dieuleveut, Martin Jaggi, Sai Praneeth Karimireddy, Marco Lorenzi, Giovanni Neglia, Marc Tommasi, and Mathieu Andreux. FLamby: Datasets and benchmarks for cross-silo federated learning in realistic healthcare settings. In Thirtysixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2022. Suresh Dara, Ambedkar Kanapala, A. Ramesh Babu, Swetha Dhamercherala, Ankit Vidyarthi, and Ruchi Agarwal. Scalable federated-learning and internet-of-things enabled architecture for chest computer tomography image classification. Computers and Electrical Engineering, 2022. Li Li, Xi Yu, Xuliang Cai, Xin He, and Yanhong Liu. Contract theory based incentive mechanism for federated learning in health crowdsensing. IEEE Internet of Things Journal, 2022. Sai Praneeth Karimireddy, Satyen Kale, Mehryar Mohri, Sashank Reddi, Sebastian Stich, and Ananda Theertha Suresh. SCAFFOLD: Stochastic controlled averaging for federated learning. In Proceedings of the 37th International Conference on Machine Learning, 2020. Sashank J. Reddi, Zachary Charles, Manzil Zaheer, Zachary Garrett, Keith Rush, Jakub Koneˇcn y, Sanjiv Kumar, and Hugh Brendan Mc Mahan. Adaptive federated optimization. In International Conference on Learning Representations, 2021. Peter Kairouz, H. Brendan Mc Mahan, Brendan Avent, Aur elien Bellet, Mehdi Bennis, Arjun Nitin Bhagoji, Kallista A. Bonawitz, Zachary Charles, Graham Cormode, Rachel Cummings, Rafael G. L. D Oliveira, Hubert Eichner, Salim El Rouayheb, David Evans, Josh Gardner, Zachary Garrett, Adri a Gasc on, Badih Ghazi, Phillip B. Gibbons, Marco Gruteser, Za ıd Harchaoui, Chaoyang He, Lie He, Zhouyuan Huo, Ben Hutchinson, Justin Hsu, Martin Jaggi, Tara Javidi, Gauri Joshi, Mikhail Khodak, Jakub Konecn y, Aleksandra Korolova, Farinaz Koushanfar, Sanmi Koyejo, Tancr ede Lepoint, Yang Liu, Prateek Mittal, Mehryar Mohri, Richard Nock, Ayfer Ozg ur, Rasmus Pagh, Hang Qi, Daniel Ramage, Ramesh Raskar, Mariana Raykova, Dawn Song, Weikang Song, Sebastian U. Stich, Ziteng Sun, Ananda Theertha Suresh, Florian Tram er, Praneeth Vepakomma, Jianyu Wang, Li Xiong, Zheng Xu, Qiang Yang, Felix X. Yu, Han Yu, and Sen Zhao. Advances and open problems in federated learning. Foundations and Trends in Machine Learning, 2021. Giuseppe Canonaco, Alex Bergamasco, Alessio Mongelluzzo, and Manuel Roveri. Adaptive federated learning in presence of concept drift. In 2021 International Joint Conference on Neural Networks (IJCNN), 2021. Yishay Mansour, Mehryar Mohri, Jae Ro, and Ananda Theertha Suresh. Three approaches for Flash: Concept Drift Adaptation in Federated Learning personalization with applications to federated learning. arxiv preprint 2002.10619, 2020. Yihan Jiang, Jakub Koneˇcn y, Keith Rush, and Sreeram Kannan. Improving federated learning personalization via model agnostic meta learning. arxiv preprint 1909.12488, 2019. Alireza Fallah, Aryan Mokhtari, and Asuman Ozdaglar. Personalized federated learning with theoretical guarantees: A model-agnostic meta-learning approach. In Advances in Neural Information Processing Systems, 2020. Yuyang Deng, Mohammad Mahdi Kamani, and Mehrdad Mahdavi. Adaptive personalized federated learning. ar Xiv preprint 2003.13461, 2020. Tian Li, Shengyuan Hu, Ahmad Beirami, and Virginia Smith. Ditto: Fair and robust federated learning through personalization. International Conference on Machine Learning, 2021. Hong-You Chen and Wei-Lun Chao. On bridging generic and personalized federated learning for image classification. In International Conference on Learning Representations, 2022. Durmus Alp Emre Acar, Yue Zhao, Ramon Matas, Matthew Mattina, Paul Whatmough, and Venkatesh Saligrama. Federated learning based on dynamic regularization. In International Conference on Learning Representations, 2021. Ellango Jothimurugesan, Kevin Hsieh, Jianyu Wang, Gauri Joshi, and Phillip Gibbons. Federated learning under distributed concept drift. In Neur IPS 2022 Workshop on Distribution Shifts: Connecting Methods and Applications, 2022. Jo ao Gama, Indrundefined ˇZliobaitundefined, Albert Bifet, Mykola Pechenizkiy, and Abdelhamid Bouchachia. A survey on concept drift adaptation. ACM Computing Surveys, 2014. Ashraf Tahmasbi, Ellango Jothimurugesan, Srikanta Tirthapura, and Phillip B. Gibbons. Driftsurf: A riskcompetitive learning algorithm under concept drift. arxiv preprint ar Xiv:2003.06508, 2020. Tian Li, Anit Kumar Sahu, Manzil Zaheer, Maziar Sanjabi, Ameet Talwalkar, and Virginia Smith. Federated optimization in heterogeneous networks. In Proceedings of Machine Learning and Systems, 2020. Stackoverflow. Tensor Flow Federated Datasets. https: //www.tensorflow.org/federated/api_ docs/python/tff/simulation/datasets, 2023. [Online; accessed 09-Jan-2023]. Gregory Cohen, Saeed Afshar, Jonathan Tapson, and Andr e van Schaik. Emnist: Extending mnist to handwritten letters. In 2017 International Joint Conference on Neural Networks (IJCNN), pages 2921 2926, 2017. Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images, 2009. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770 778, 2016. Daniel J Beutel, Taner Topal, Akhil Mathur, Xinchi Qiu, Titouan Parcollet, and Nicholas D Lane. Flower: A friendly federated learning research framework. ar Xiv preprint 2007.14390, 2020. Stackoverflow Dataset Fed ML. Fed ML Federated Stackoverflow Heterogenity . https: //github.com/Fed ML-AI/Fed ML/blob/ ecd2d81222301d315ca3a84be5a5ce4f33d6181c/ doc/en/simulation/benchmark/ BENCHMARK_MPI.md, 2022. [Online; accessed 14-Sep-2022]. Shakespeare. Tensor Flow Federated Datasets. https://www.tensorflow.org/federated/ api_docs/python/tff/simulation/ datasets/shakespeare, 2023. [Online; accessed 09-Jan-2023]. CIFAR100. Tensor Flow Federated Datasets. https://www.tensorflow.org/federated/ api_docs/python/tff/simulation/ datasets/cifar100/load_data, 2023. [Online; accessed 18-Jan-2023]. Flash: Concept Drift Adaptation in Federated Learning A. Datasets and Hyperparameters EMNIST EMNIST dataset (Cohen et al., 2017) has 3400 unique clients where each client has train, validation, and test datasets. The train datasets of all the clients combined have in total 671,585 instances. The validation and test datasets of all the clients combined have a total of 77,483 instances each. The heterogeneity of EMNIST clients stems from the individual writing style of each client (where one client is one person), as discussed in Appendix C.2 of (Reddi et al., 2021). We have trained each of our baselines and FLASH for R = 1500 rounds, with the batch size of N = 20 instances, 10 clients per round. All the experiments have ran for E = 10 with the early-stopping criteria discussed in 3.1. The default learning rates for all the experiments is ηℓ= 0.05 and ηg = 1.00. Although SCAFFOLD and FEDDYN required ηℓ= 0.03. For both FEDPROX and FEDDYN, λ was assigned 0.001. APFL has α = 0.25. And DITTO has λ = 0.1 and client learning rate of ηℓ= 0.01. α in FEDDC has been assigned to 0.5. While ρ in FEDNOVA has been assigned to 0.8. FLASH has γ = 0.04. Stackoverflow Stackoverflow dataset (Stackoverflow, 2023) has separate clients for training, validation, and testing. There are 342,477 train clients, having a combined sample count of 135,818,730. Similarly, there are 38,758 validation and 204,088 test clients having a combined sample count of 16,491,230 and 16,586,035 respectively. Since we need validation set for early-stopping training for each client, we divide a client s train dataset in a 7:3 split to get the training and the validation sets. This is a naturally heterogeneous dataset (Fed ML, 2022). Each user of Stackoverflow is a client and their posts form a dataset for the client. The dataset is heterogeneous in two ways: First, users have different writing styles and thus clients datasets are not i.i.d. Second, the total number of posts from each user is also different, leading to different sizes of datasets per client. We have trained each of our baselines and FLASH for R = 2000 rounds, with the batch size of N = 16 instances, 10 clients per round. The vocabulary consists is of 10,000 tokens. Similar to EMNIST, all the experiments have ran for E = 10 with the validation loss based early-stopping criteria. The default learning rates for all the experiments is ηℓ= 0.3 and ηg = 1.00. FEDPROX has λ set to 0.01. While λ for FEDDYN and DITTO have been set to 0.001 and 0.1 respectively. For APFL, the interpolation factor α is 0.25. The value of ρ for FEDNOVA is 0.9. FLASH has γ set to 0.05. Shakespeare Shakespeare dataset (Shakespeare, 2023) has 715 unique clients where each client has train, validation, and test datasets. The train datasets of all the clients combined have in total 12,854 instances. The validation and test datasets of all the clients combined have a total of 3,214 and 2,356 instances respectively. Shakespeare dataset is heterogeneous because of each client is one play written by William Shakespeare, and all the plays have different setting and characters. Each of our baselines and FLASH are trained for R = 1500 rounds, with the batch size of N = 4, 10 clients per round. The vocabulary size is 90 as each token is related to a character in the English language. The maximum number of epochs E is set to 10 for all the algorithms, with early-stopping. Default learning rates for all the experiments are ηℓ= 0.1 and ηg = 1.00. FEDPROX and FEDDYN has their λ = 0.001. While DITTO has λ = 0.1. APFL and FEDDC have their α = 0.5. FEDNOVA has ρ = 0.85. Early stopping threshold γ of FLASH is set to 0.05. CIFAR10 CIFAR10 dataset is created from the centralized version of CIFAR10 dataset (Krizhevsky et al., 2009) having 50,000 images. Federated CIFAR10 dataset has 500 unique clients, each having 100 samples for training, and 20 samples for testing. A client would receive the training and testing samples according to the Dirichlet distribution (Reddi et al., 2021). A Dirichlet distribution with the parameter α [0, 1] determines the heterogeneity of a client, with a client being more heterogeneous as α 0. Here, heterogeneity means how dissimilar the dataset instances sampled from a distribution are. We have experimented on α = 0.1 and α = 0.6. FLASH and its baselines are trained for R = 4000 rounds, with batch size of N = 20, with 10 clients per round. Maximum number of epochs is set to E = 15 for the early stopping criteria. Default learning rates for all the experiments are ηℓ= 0.05 and ηg = 0.5. SCAFFOLD, HYPCLUSTER, FEDDYN and APFL are set for ηℓ= 0.01, 0.01 and 0.09 respectively. FEDPROX, DITTO and FEDDYN have their λ set to 0.08, 0.01, and 0.05. APFL and FEDDC have their α = 0.25 and 0.01. FEDNOVA has ρ = 0.9. FLASH has γ = 0.04. CIFAR100 Similar to CIFAR10, CIFAR100 dataset (CIFAR100, 2023) is also created from CIFAR100 dataset (Krizhevsky et al., 2009) having 50,000 images. The client count and train-test image count are same as that of CIFAR10. Here too, we have experimented with the Dirichlet parameter α = 0.1 and α = 0.6. FLASH and its baselines are trained for R = 4000 rounds, with batch size of N = 20, with 10 clients per round. Maximum Flash: Concept Drift Adaptation in Federated Learning number of epochs is set to E = 15 for the early stopping criteria. Default learning rates for all the experiments are ηℓ= 0.01 and ηg = 0.5. Although SCAFFOLD works best at ηℓ= 0.03. FEDPROX, DITTO and FEDDYN have their λ set to 0.01. APFL and FEDDC have their α = 0.25 and 0.01. FEDNOVA has ρ = 0.8. FLASH has γ = 0.05. Synthetic Synthetic dataset is generated with the same procedure described in (Li et al., 2020). The total number of clients is 30. The dimension of input features is 60, and there are 10 output classes. Each client has sample count randomly generated from the log-normal distribution with its µ = 4 and σ = 2, and size = number of clients. The parameter which controls how much local models differ from each other is set to α = 0.5. And the parameter which controls how much the local data at each client differs from that of other clients is set to β = 0.5. We have trained all the baselines and FLASH for R = 1000 rounds, with batch size of N = 10, with 10 clients per round. Maximum number of epochs is set to E = 8 for the early stopping criteria. Default learning rates for all the experiments are ηℓ= 0.01 and ηg = 0.05. Although SCAFFOLD, FEDDYN, FEDDC, DITTO prefer ηℓ= 0.005, ηℓ= 0.005, ηℓ= 0.05, ηℓ= 0.005. FEDPROX, DITTO and FEDDYN have their λ set to 0.25, 0.01, and 0.01. APFL and FEDDC have their α = 0.5 and 0.1. FEDNOVA has ρ = 0.95. FLASH has γ = 0.03. B. Additional Results B.1. Few clients with sudden drift: FLASH balances local training and global adaptation Recall that the goal of FLASH is to let all heterogeneous clients have well-trained local models. If the majority of those local models exhibit a drift towards a new data distribution, we let the global model adapt to that new distribution. To this end, we find it interesting to see what ratio of clients have to face a concept drift for the global adaptation to get triggered. We experiment with different ratios of clients which face a sudden concept drift over all the participating clients of a certain round. As shown in Figure 6, pdrift = 1.00 is the most extreme case where all the clients face the drift. We see the highest accuracy dip there. We also observe that the accuracy achieved after global drift adaptation is lower than what a global model following the same distribution throughout the training can achieve. 1500 2000 2500 3000 3500 4000 Number of Rounds Validation Accuracy pshift = 0.1 pshift = 0.25 pshift = 0.50 pshift = 0.75 pshift = 1.00 (a) CIFAR 10 200 400 600 800 1000 1200 1400 Number of Rounds Validation Accuracy pdrift = 0.1 pdrift = 0.25 pdrift = 0.50 pdrift = 0.75 pdrift = 1.00 Figure 6. Behavior of FLASH with changing pdrift, the ratio of participating clients facing the sudden concept drift over total participating clients for a round r. The sudden concept drift occurs at 2000th and 600th rounds for CIFAR10 and EMNIST respectively. B.2. Performance Comparison without Concept Drift In this section, we discuss the generalized and personalized accuracies in case of no concept drift. As discussed in Section 5, Generalized Accuracy for a round r is the average accuracy of the global model w(r) g on the test data D(r) c of clients Flash: Concept Drift Adaptation in Federated Learning Table 3. Generalized accuracy (P), standard deviation of individual client s accuracy (I), and standard deviation of individual experiment run (E) for FLASH vs baselines, in a no distribution change setting. EM = EMNIST, SO = Stackoverflow, C10 = CIFAR10, C100 = CIFAR100. Tasks EM (P) (I) (E) SO (P) (I) (E) C10 (P) (I) (E) C100 (P) (I) (E) Non-iid ness Writers Authors α = 0.1 α = 0.1 LOCAL ONLY - - - - - - - - - - - - FEDAVG 84.12% 7.92% 1.35% 28.24% 4.44% 0.22% 56.91% 17.69% 1.42% 36.00% 10.21% 0.54% FEDPROX 87.52% 4.92% 1.10% 27.78% 4.26% 0.09% 56.74% 18.34% 1.23% 35.83% 11.80% 0.30% FEDYOGI 87.71% 7.14% 1.02% 28.96% 4.36% 0.19% 67.57% 15.08% 1.57% 39.92% 10.39% 0.34% FEDNOVA 88.53% 7.15% 0.96% 28.56% 4.03% 0.22% 66.59% 8.23% 1.77% 37.45% 10.49% 0.46% SCAFFOLD 88.10% 7.47% 1.12% 27.68% 4.32% 0.20% 65.86% 13.17% 1.65% 32.48% 10.46% 0.40% FEDDYN 88.18% 7.84% 0.86% 26.08% 4.52% 0.12% 65.08% 12.90% 1.43% 35.20% 11.19% 0.62% FEDDC 90.64% 6.81% 0.79% 27.50% 4.76% 0.26% 65.16% 9.30% 1.14% 39.89% 10.44% 0.48% APFL 89.78% 7.48% 0.90% 24.32% 4.49% 0.31% 68.90% 15.82% 0.95% 36.44% 11.13% 0.44% DITTO 87.17% 5.01% 0.95% 27.97% 4.10% 0.20% 69.41% 13.73% 1.08% 35.41% 10.77% 0.35% HYPCLUSTER 89.66% 7.31% 1.25% 27.96% 4.51% 0.24% 68.23% 11.08% 1.76% 39.24% 10.85% 0.55% ADAPFA 88.46% 6.26% 1.06% 28.76% 4.68% 0.19% 66.91% 10.20% 1.27% 39.48% 10.60% 0.37% FEDDRIFT 89.12% 6.36% 1.07% 27.65% 4.75% 0.23% 69.18% 11.27% 1.18% 38.09% 11.06% 0.39% FLASH 88.17% 5.27% 1.20% 29.13% 4.22% 0.12% 69.45% 9.10% 1.24% 40.65% 11.24% 0.42% c C(r), where C(r) is a set of available clients for the round r. It is formulated as acc(r) g = 1 (x,y) D(r) c 1{y = w(r) g (x)} |D(r) c | . (1) Personalized Accuracy for a round r is the average accuracy of the local model w(r) c,ec on the test data D(r) c of clients c C, where C is a set of participating clients for the round r. It is formulated as acc(r) p = 1 (x,y) D(r) c 1{y = w(r) c,e(r) c (x)} |D(r) c | . (2) Besides the above two metrics, we also compute Standard Deviation of both the accuracies across clients. Using the notation of acc(r) c for the generalized or personalized accuracies based on the test dataset of client c, we formulate it as follows, s P c C(r)(acc(r) acc(r) c )2 Tables 3 and 4 show the generalized accuracy (denoted as P in the tables), standard deviation of the generalized accuracy across clients (denoted as I in the tables), and standard deviation across experiment runs based on different seeds (denoted as E in the tables), for all the datasets. The main observation from these results is that FLASH gives comparable performance to FEDYOGI as in case of no concept drift, FLASH reduces to FEDYOGI as β3 1 in line 19 in Algorithm 1, resulting in line 20 being d(r) d(r 1). Initially d(0) is set to 0, and without any drifts, d(r) for any r [R] would stay closer to 0. This results in Flash turning into its base adaptive optimizer (which can be any of FEDADAM, FEDYOGI and the like in (Reddi et al., 2021)). The same observations are also made for personalized accuracy and its standard deviations in Tables 5 and 6. B.3. Performance Comparison with Concept Drift In this section we show learning curves of the global model in case of sudden, increment, and recurrent concept drifts, as shown in Figures 7, 8, and 9. There are two metrics under focus here: (a) Accuracy Dip is the lowest generalized accuracy during a concept drift starting from round r1 to round r2. It is defined as dip(r1,r2) = min({acc(r) g r [r1, r2]}). (4) (b) Rounds till Recovery is the number of rounds taken for the global model w(r) g to reach a steady state after a concept drift has occurred. Figures 7, 8, and 9 depict superiority of FLASH over all categories of its baselines (Serverand client-side Flash: Concept Drift Adaptation in Federated Learning Table 4. Generalized accuracy (P), standard deviation of individual client s accuracy (I), and standard deviation of individual experiment run (E) for FLASH vs baselines, in a no distribution change setting. SH = Shakespeare, SY = Synthetic, C10 = CIFAR10, C100 = CIFAR100. Tasks SH (P) (I) (E) SY (P) (I) (E) C10 (P) (I) (E) C100 (P) (I) (E) Non-iid ness Authors α = 0.5, β = 0.5 α = 0.6 α = 0.6 LOCAL ONLY - - - - - - - - - - - - FEDAVG 57.29% 9.72% 0.35% 90.46% 24.07% 0.09% 75.28% 9.54% 0.78% 41.52% 10.52% 0.19% FEDPROX 57.43% 10.48% 0.41% 92.76% 24.58% 0.09% 76.23% 9.00% 0.69% 42.63% 9.89% 0.36% FEDYOGI 57.82% 10.59% 0.44% 93.20% 24.65% 0.16% 78.30% 9.02% 0.94% 44.33% 10.42% 0.24% FEDNOVA 58.23% 9.43% 0.53% 93.57% 22.18% 0.23% 77.89% 8.41% 1.15% 43.82% 8.52% 0.53% SCAFFOLD 57.59% 10.81% 0.53% 92.92% 28.64% 0.20% 75.43% 8.74% 0.86% 44.14% 9.80% 0.35% FEDDYN 57.43% 11.57% 0.39% 91.89% 24.68% 0.14% 77.33% 9.35% 0.71% 45.33% 10.08% 0.14% FEDDC 56.46% 11.55% 0.41% 93.10% 26.07% 0.15% 79.28% 9.83% 1.02% 44.89% 10.44% 0.35% APFL 59.46% 7.04% 0.51% 91.78% 25.72% 0.08% 78.05% 8.72% 0.75% 45.78% 10.63% 0.29% DITTO 59.16% 11.23% 0.43% 92.34% 15.79% 0.10% 78.34% 10.79% 0.83% 45.73% 10.58% 0.26% HYPCLUSTER 58.50% 10.99% 0.60% 92.36% 25.15% 0.19% 78.66% 8.74% 0.74% 45.49% 9.91% 0.19% ADAPFA 58.42% 11.40% 0.58% 92.88% 27.77% 0.17% 78.31% 9.33% 0.67% 44.36% 10.02% 0.14% FEDDRIFT 58.28% 9.30% 0.51% 92.38% 28.36% 0.24% 78.46% 9.38% 0.83% 44.49% 9.27% 0.13% FLASH 58.27% 6.98% 0.56% 93.92% 25.42% 0.11% 79.58% 8.14% 0.85% 45.65% 10.23% 0.26% Table 5. Personalized accuracy (P), standard deviation of individual client s accuracy (I), and standard deviation of individual experiment run (E) for FLASH vs baselines, in a no distribution change setting. EM = EMNIST, SO = Stackoverflow, C10 = CIFAR10, C100 = CIFAR100. Tasks EM (P) (I) (E) SO (P) (I) (E) C10 (P) (I) (E) C100 (P) (I) (E) Non-iid ness Writers Authors α = 0.1 α = 0.1 LOCAL ONLY 28.18% 18.56% 1.14% 15.93% 5.31% 0.25% 49.78% 16.65% 1.56% 36.19% 11.91% 0.43% FEDAVG 83.06% 7.80% 0.95% 28.12 % 4.45% 0.12% 55.81% 17.77% 1.03/% 35.69% 10.41% 0.53% FEDPROX 86.67% 4.75% 1.02% 28.28% 4.51% 0.16% 55.56% 17.99% 1.25% 35.39% 10.60% 0.47% FEDYOGI 87.31% 6.68% 1.25% 29.42% 4.85% 0.21% 68.35% 15.44% 1.63% 39.84% 10.42% 0.36% FEDNOVA 89.26% 4.52% 0.89% 29.06% 4.67% 0.15% 67.56% 14.67% 1.27% 37.26% 9.76% 0.34% SCAFFOLD 87.51% 7.83% 1.08% 27.82% 4.31% 0.11% 65.84% 13.03% 1.78% 38.44% 10.53% 0.54% FEDDYN 87.22% 7.92% 1.12% 26.00% 4.61% 0.18% 65.90% 12.85% 0.96% 39.90% 10.55% 0.49% FEDDC 89.98% 6.91% 1.32% 27.14% 4.75% 0.22% 66.47% 9.58% 1.12% 41.72% 10.38% 0.61% APFL 90.22% 7.19% 1.27% 27.34% 4.32% 0.13% 69.11% 15.84% 1.43% 40.66% 9.81% 0.41% DITTO 88.11% 5.01% 0.93% 28.92% 4.55% 0.20% 69.28% 13.09% 1.11% 39.39% 10.56% 0.56% HYPCLUSTER 89.04% 7.47% 1.18% 28.65% 4.50% 0.19% 70.17% 11.12% 1.52% 39.10% 10.49% 0.59% ADAPFA 88.86% 6.22% 0.83% 29.01% 4.66% 0.24% 67.29% 10.34% 1.53% 39.25% 9.76% 0.53% FEDDRIFT 89.68% 7.42% 0.92% 28.34% 4.19% 0.28% 70.45% 12.78% 1.67% 39.22% 10.62% 0.47% FLASH 89.44 % 7.90% 1.04% 30.24% 4.36% 0.14% 70.07% 8.37% 1.21% 41.23% 11.57% 0.46% Table 6. Personalized accuracy (P), standard deviation of individual client s accuracy (I), and standard deviation of individual experiment run (E) for FLASH vs baselines, in a no distribution change setting. SH = Shakespeare, SY = Synethtic, C10 = CIFAR10, C100 = CIFAR100. Tasks SH (P) (I) (E) SY (P) (I) (E) C10 (P) (I) (E) C100 (P) (I) (E) Non-iid ness Authors α = 0.5, β = 0.5 α = 0.6 α = 0.6 LOCAL ONLY - - - - - - - - - - - - FEDAVG 57.35% 9.72% 0.35% 90.46% 24.07% 0.09% 75.53% 9.54% 0.78% 41.81% 10.52% 0.19% FEDPROX 57.27% 10.48% 0.41% 92.76% 24.58% 0.09% 76.35% 9.00% 0.69% 42.76% 9.89% 0.36% FEDYOGI 57.75% 10.59% 0.44% 93.20% 24.65% 0.16% 78.47% 9.02% 0.94% 41.71% 10.42% 0.24% FEDNOVA 58.86% 9.43% 0.53% 93.57% 22.18% 0.23% 75.93% 8.41% 1.15% 46.12% 8.52% 0.53% SCAFFOLD 57.65% 10.81% 0.53% 92.92% 28.64% 0.20% 75.84% 8.74% 0.86% 44.40% 9.80% 0.35% FEDDYN 56.54% 11.57% 0.39% 91.89% 24.68% 0.14% 77.49% 9.35% 0.71% 45.57% 10.08% 0.14% FEDDC 56.62% 11.55% 0.41% 93.10% 26.07% 0.15% 79.43% 9.83% 1.02% 45.06% 10.44% 0.35% APFL 59.06% 7.04% 0.51% 91.78% 25.72% 0.08% 80.56% 8.72% 0.75% 46.87% 10.63% 0.29% DITTO 60.05% 11.23% 0.43% 92.34% 15.79% 0.10% 79.86% 10.79% 0.83% 46.77% 10.58% 0.26% HYPCLUSTER 59.84% 10.99% 0.60% 92.36% 25.15% 0.19% 79.76% 8.74% 0.74% 46.92% 9.91% 0.19% ADAPFA 59.07% 11.40% 0.58% 92.88% 27.77% 0.17% 78.56% 9.33% 0.67% 43.86% 10.02% 0.14% FEDDRIFT 59.18% 9.30% 0.51% 92.38% 28.36% 0.24% 78.82% 9.38% 0.83% 46.04% 9.27% 0.13% FLASH 59.19% 6.98% 0.56% 93.92% 25.42% 0.11% 79.95% 8.14% 0.85% 45.90% 10.23% 0.26% Flash: Concept Drift Adaptation in Federated Learning Table 7. Lowest accuracy during the concept drift (P), rounds till recovery to the steady state (R) for FLASH vs baselines for all the tasks, in a sudden concept drift setting. EM = EMNIST, C10 = CIFAR10, C100 = CIFAR100, SY = Synthetic. Tasks EM (P) (R) C10 (P) (R) C100 (P) (R) C10 (P) (R) C100 (P) (R) SY (P) (R) Non-iid ness Writers α = 0.1 α = 0.1 α = 0.6 α = 0.6 α = 0.5, β = 0.5 FEDAVG 5.11% 490 49.05% 240 13.05% 570 57.85% 380 9.25% 610 86.25% 160 FEDPROX 61.50% 390 43.10% 230 12.65% 530 63.60% 310 15.99% 510 84.61% 80 FEDYOGI 43.82% 370 46.42% 190 17.50% 480 65.40% 150 15.60% 550 86.07% 150 FEDNOVA 52.48% 370 50.12% 210 16.48% 500 64.18% 220 16.74% 520 82.09% 120 SCAFFOLD 58.51% 600 56.50% 240 18.73% 460 53.45% 200 14.20% 500 81.69% 80 FEDDYN 63.18% 360 56.22% 210 17.95% 380 69.65% 190 18.25% 430 75.16% 70 FEDDC 80.43% 260 55.38% 210 19.30% 360 64.30% 190 18.75% 410 84.07% 90 APFL 84.33% 270 59.70% 140 21.00% 320 69.95% 190 16.60% 520 85.35% 110 DITTO 75.05% 510 58.86% 150 15.21% 410 62.83% 220 16.96% 470 86.19% 80 HYPCLUSTER 81.32% 350 60.50% 150 14.16% 560 56.00% 450 11.80% 600 83.06% 50 ADAPFA 44.08% 270 54.08% 410 19.00% 230 65.95% 90 18.48% 90 87.04% 60 FEDDRIFT 82.31% 310 65.67% 100 34.22% 300 76.32% 130 31.07% 380 90.30% 90 ORACLE 90.28% 0 72.15% 0 35.75% 0 78.25% 0 44.06% 0 93.55% 0 FLASH 88.00% 210 70.99% 50 35.81% 180 76.90% 60 37.33% 160 91.56% 40 optimization, Personalization, Drift correction, and Drift adaptation), in both of the above metrics for all 3 concept drift settings. For sudden concept drift, FLASH comes closest to the ORACLE (an algorithm which has knowledge of when the drift would occur, and also has a well-trained global model for the new global data distribution already available) with the accuracy dips of only 1.03% (EMNIST), 2.33% (CIFAR10 0.1), 1.85% (CIFAR10 0.6), 1.09% (CIFAR100 0.1), and 1.08% (CIFAR100 0.6). Compared to the above results, FEDYOGI sees larger accuracy dips with respect to ORACLE, with 5.55% (EMNIST), 16.86% (CIFAR10 0.1), 3.37% (CIFAR10 0.6), 4.48% (CIFAR100 0.1), and 1.47% (CIFAR100 0.6). The lowest accuracy during the sudden concept drift and rounds till recovery metrics are depicted in Table 7, indicating that FLASH not only takes the least damage in terms of lower accuracy during the drift, but also recovers from the drift faster than its state-of-the-art baselines. For incremental concept drift setting, the adaptability of FLASH compared to FEDYOGI during the drift is highlighted even further. FLASH sees accuracy dips of only 0.83% (EMNIST), 2.77% (CIFAR10 0.1), 2.44% (CIFAR10 0.6), 2.16% (CIFAR100 0.1), and 1.07% (CIFAR100 0.6). While FEDYOGI sees accuracy dips of 4.98% (EMNIST), 14.96% (CIFAR10 0.1), 4.95% (CIFAR10 0.6), 8.40% (CIFAR100 0.1), and 1.08% (CIFAR100 0.6) Similar observations are made in Figure 9, with one additional observation: the recurring drift stage sees a lower dip in the accuracies compared to the first drift. This indicates that the global models still retain some of the information related to the first distribution even after drift has occurred. Flash: Concept Drift Adaptation in Federated Learning 200 400 600 800 1000 1200 1400 Number of Rounds Validation Accuracy Fed Avg Fed Prox Fed Yogi Fed Nova Scaffold Fed Dyn Fed DC APFL Ditto Hyp Cluster Fed Drift Adap Fed Avg Oracle Flash 200 400 600 800 1000 Number of Rounds Validation Accuracy Fed Avg Fed Prox Fed Yogi Fed Nova Scaffold Fed Dyn Fed DC APFL Ditto Hyp Cluster Fed Drift Adap Fed Avg Oracle Flash (b) Synthetic 0 500 1000 1500 2000 2500 3000 3500 4000 Number of Rounds Validation Accuracy Fed Avg Fed Prox Fed Yogi Fed Nova Scaffold Fed Dyn Fed DC APFL Ditto Hyp Cluster Fed Drift Adap Fed Avg Oracle Flash (c) CIFAR 10 (α = 0.1) 0 500 1000 1500 2000 2500 3000 3500 4000 Number of Rounds Validation Accuracy Fed Avg Fed Prox Fed Yogi Fed Nova Scaffold Fed Dyn Fed DC APFL Ditto Hyp Cluster Fed Drift Adap Fed Avg Oracle Flash (d) CIFAR 10 (α = 0.6) 0 500 1000 1500 2000 2500 3000 3500 4000 Number of Rounds Validation Accuracy Fed Avg Fed Prox Fed Yogi Fed Nova Scaffold Fed Dyn Fed DC APFL Ditto Hyp Cluster Fed Drift Adap Fed Avg Oracle Flash (e) CIFAR 100 (α = 0.6) Figure 7. Accuracy curves for EMNIST, Synthetic, and CIFAR 10 / 100 datasets with sudden concept drift after steady state Flash: Concept Drift Adaptation in Federated Learning 200 400 600 800 1000 1200 1400 Number of Rounds Validation Accuracy Fed Avg Fed Prox Fed Yogi Fed Nova Scaffold Fed Dyn Fed DC APFL Ditto Hyp Cluster Fed Drift Adap Fed Avg Oracle Flash 200 400 600 800 1000 Number of Rounds Validation Accuracy Fed Avg Fed Prox Fed Yogi Fed Nova Scaffold Fed Dyn Fed DC APFL Ditto Hyp Cluster Fed Drift Adap Fed Avg Oracle Flash (b) Synthetic 0 500 1000 1500 2000 2500 3000 3500 Number of Rounds Validation Accuracy Fed Avg Fed Prox Fed Yogi Fed Nova Scaffold Fed Dyn Fed DC APFL Ditto Hyp Cluster Fed Drift Adap Fed Avg Oracle Flash (c) CIFAR 10 (α = 0.1) 0 500 1000 1500 2000 2500 3000 3500 4000 Number of Rounds Validation Accuracy Fed Avg Fed Prox Fed Yogi Fed Nova Scaffold Fed Dyn Fed DC APFL Ditto Hyp Cluster Fed Drift Adap Fed Avg Oracle Flash (d) CIFAR 10 (α = 0.6) 0 500 1000 1500 2000 2500 3000 3500 4000 Number of Rounds Validation Accuracy Fed Avg Fed Prox Fed Yogi Fed Nova Scaffold Fed Dyn Fed DC APFL Ditto Hyp Cluster Fed Drift Adap Fed Avg Oracle Flash (e) CIFAR 100 (α = 0.6) Figure 8. Accuracy curves for EMNIST, Synthetic, and CIFAR 10 / 100 datasets with incremental concept drift after steady state Flash: Concept Drift Adaptation in Federated Learning 200 400 600 800 1000 1200 1400 Number of Rounds Validation Accuracy Fed Avg Fed Prox Fed Yogi Fed Nova Scaffold Fed Dyn Fed DC APFL Ditto Hyp Cluster Fed Drift Adap Fed Avg Oracle Flash 200 400 600 800 1000 Number of Rounds Validation Accuracy Fed Avg Fed Prox Fed Yogi Fed Nova Scaffold Fed Dyn Fed DC APFL Ditto Hyp Cluster Fed Drift Adap Fed Avg Oracle Flash (b) Synthetic 0 500 1000 1500 2000 2500 3000 3500 4000 Number of Rounds Validation Accuracy Fed Avg Fed Prox Fed Yogi Fed Nova Scaffold Fed Dyn Fed DC APFL Ditto Hyp Cluster Fed Drift Adap Fed Avg Oracle Flash (c) CIFAR 10 (α = 0.1) 0 500 1000 1500 2000 2500 3000 3500 4000 Number of Rounds Validation Accuracy Fed Avg Fed Prox Fed Yogi Fed Nova Scaffold Fed Dyn Fed DC APFL Ditto Hyp Cluster Fed Drift Adap Fed Avg Oracle Flash (d) CIFAR 10 (α = 0.6) 0 500 1000 1500 2000 2500 3000 3500 4000 Number of Rounds Validation Accuracy Fed Avg Fed Prox Fed Yogi Fed Nova Scaffold Fed Dyn Fed DC APFL Ditto Hyp Cluster Fed Drift Adap Fed Avg Oracle Flash (e) CIFAR 100 (α = 0.6) Figure 9. Accuracy curves for EMNIST, Synthetic, and CIFAR 10 / 100 datasets with recurrent concept drift after steady state Flash: Concept Drift Adaptation in Federated Learning C. Analysis of FLASH In Federated Learning, we solve an optimization problem of the form: min wg Rd f(wg) = 1 c C Fc(wg), where Fc(wg) = E(x,y) Dc [fc(wg; x, y)] is the loss function of the cth client, and Dc is the data for the cth client. The functions Fc and therefore f may be non-convex. For each c and wg, we assume access to an unbiased stochastic gradient fc(wg) of the client s true gradient Fc(wg). In addition, we make the following assumptions. Assumption C.1 (Lipschitz Gradient). The function Fc is L-smooth for all c C i.e., || Fc(w) Fc(z)|| L||w z|| w, z Rd. Assumption C.2 (µ-convexity). The function Fc is µ-convex for µ 0 and satisfies: Fc(w), z w Fc(w) Fc(z) + µ 2 w z 2 c, w, z. Assumption C.3 (Bounded Variance). (Assumption 2 in (Reddi et al., 2021)) The function Fc for any c C has σℓ-bounded local variance i.e., E || [fc(w; x, y)]j [ Fc(w)]j||2 = σ2 c,j σ2 ℓ,j for all w Rd, j [d], and c C. Furthermore, we assume the global variance is bounded, 1 |C| P c C || [Fc(w)]j [ f(w)]j||2 σ2 g,j for all w Rd and j [d]. Assumption C.4 (Bounded Local Gradients). The function fc(w; x, y) have G-bounded gradients i.e., for any c C, w Rd, and (x, y) D(r) c we have |[ fc(w; x, y)]j| G for all j [d]. Assumption C.5 (Bounded Gradient Dissimilarity or (G, B)-BGD). (Assumption 1 in (Karimireddy et al., 2020)) There exists constants such as G 0 and B 1 such that 1 |C| P c C fc(wg) 2 G2 + B2 f(wg) 2, wg. If {Fc} are convex, we can relax the assumption to 1 |C| P c C fc(wg) 2 G2 + 2LB2(f(wg) f(w g)), wg. Lemma C.6 (One epoch progress of client-side SGD). Suppose {Fc} satisfies Assumptions C.1, C.2, C.3, and C.4. For a local step-size ηℓ, the updates of SGD satisfy Ee w(r) c,e w c 2 1 µηℓ Ee 1 w(r) c,e 1 w c 2 + 2ηℓ fc(w c) Ee 1[fc(w(r) c,e 1)] (5) + 2Lη3 ℓG2 + 2η4 ℓL2G2 + 2η2 ℓG2 + η2 ℓσ2 c (6) Proof. We start with restating the SGD update w(r) c,e w(r) c,e 1 ηℓ fc(w(r) c,e 1) which gives w(r) c,e = ηℓ fc(w(r) c,e 1) = Ee 1 h w(r) c,e i = ηℓEe 1 h fc(w(r) c,e 1) i Using the above update, we proceed as, Ee 1 w(r) c,e + w(r) c,e w c 2 = Ee 1 w(r) c,e w c 2 2ηℓ fc(w(r) c,e 1), w(r) c,e w c + η2 ℓEe 1 fc(w(r) c,e 1) 2 (7) Ee 1 w(r) c,e w c 2 2ηℓ fc(w(r) c,e 1), w(r) c,e w c | {z } T1 + η2 ℓEe 1 Fc(w(r) c,e 1) 2 +η2 ℓσ2 c (8) (Separating variance and mean according to Lemma 4 in (Karimireddy et al., 2020)) Flash: Concept Drift Adaptation in Federated Learning Bounding T1 T1 = 2ηℓ fc(w(r) c,e 1), w c w(r) c,e (9) fc(w c) fc(w(r) c,e) + L w(r) c,e w(r) c,e 1 2 µ w c w(r) c,e 2 from Lemma 5 of (Karimireddy et al., 2020) fc(w c) fc(w(r) c,e) µ w(r) c,e w c 2 + 2Lηℓ ηℓ fc(w(r) c,e 1) 2 (11) fc(w c) fc(w(r) c,e) µ w(r) c,e w c 2 + 2Lη3 ℓ fc(w(r) c,e 1) 2 (12) fc(w c) fc(w(r) c,e) µ w(r) c,e w c 2 + 2Lη3 ℓG2 (13) Bounding T2 T2 = η2 ℓEe 1 Fc(w(r) c,e 1) Fc(w(r) c,e) + Fc(w(r) c,e) 2 (14) 2η2 ℓEe 1 Fc(w(r) c,e 1) Fc(w(r) c,e) 2 + 2η2 ℓEe 1 Fc(w(r) c,e) 2 (15) 2η2 ℓL2Ee 1 w(r) c,e 1 w(r) c,e 2 + 2η2 ℓ Fc(w(r) c,e) 2 (16) 2η2 ℓL2 Ee 1 ηℓ fc(w(r) c,e 1) 2 + 2η2 ℓG2 (17) 2η4 ℓL2G2 + 2η2 ℓG2 = 2η2 ℓG2(η2 ℓL2 + 1) (18) The inequality (15) comes from the relaxed triangular inequality (Lemma 3 in (Karimireddy et al., 2020)). Inequalities (16) and (18) are respectively implied by Assumptions C.1 and C.4. Plugging in the bounds for T1 and T2 in Equation 8, Ee 1 w(r) c,e + w(r) c,e w c 2 Ee 1 w(r) c,e w c 2 2ηℓ fc(w(r) c,e 1), w(r) c,e w c | {z } T1 + η2 ℓE Fc(w(r) c,e 1) | {z } T2 Ee 1 w(r) c,e w c 2 + 2ηℓ fc(w c) fc(w(r) c,e) µ w(r) c,e w c 2 + 2Lη3 ℓG2 + 2η4 ℓL2G2 + 2η2 ℓG2 + η2 ℓσ2 c Ee 1 w(r) c,e w c 2 + 2ηℓ fc(w c) fc(w(r) c,e) + 2Lη3 ℓG2 + 2η4 ℓL2G2 + 2η2 ℓG2 + η2 ℓσ2 c (21) Theorem C.7 (Convergence of Early-stopping SGD). For functions {Fc} which satisfy Assumptions C.1, C.2, C.3, and C.4, the output of the early-stopping SGD with early stopping criteria P x,y fc(w(r) c,e 1; x, y) P x,y fc(w(r) c,e; x, y) γ/e, e [E] and (x, y) D(r) c,valid has expected error smaller than ϵ for γ (F ϵ) ln E+ 1 E and some values of ηℓ, ec satisfying Strongly convex, 1 µE ηℓ log (max (1,µ2ED/c)) and ec = O min µD2 µϵ + σ2 c 2µϵ + LG2 µ3ϵ , exp( F ϵ General convex, 1 E ηℓ, and ec = O min D2 + G2D2 ϵ2 + σ2 c D2 ϵ3/2 + (L2G2)1/3D2 ϵ4/3 , exp( F ϵ Non-convex, 1 E ηℓ, and ec = O min F + G2F ϵ2 + σ2 c LF 2 LGF ϵ3/2 + (L2G)1/3F ϵ4/3 , exp( F ϵ Flash: Concept Drift Adaptation in Federated Learning where c := G2 + µ2 c 2 , D := E w(r) c,0 w c , and F = fc(w(r) c,0) fc(w c). Proof. Using the Lemma C.6 statement, and moving fc(w c) fc(w(r) c,e) to LHS and rearranging the terms on RHS, E h fc(w(r) c,e 1) i fc(w c) 1 2ηℓ E w(r) c,e 1 w c 2 1 2ηℓ E w(r) c,e w c 2 + η2 ℓ 2ηℓ (2LηℓG2 + 2η2 ℓL2G2 + 2G2 + σ2 c) (22) E w(r) c,e 1 w c 2 1 2ηℓ E w(r) c,e w c 2 + ηℓ(G2 + σ2 c 2 ) + η2 ℓ(LG2) + η3 ℓ(L2G2) (23) Note that the indices are off by 1 with respect to Lemma C.6 to be consistent with the theorem statement. In case of µ = 0 (general convex case), we apply the sublinear convergence rate on a non-negative sequence (refer to Lemma 2 in (Karimireddy et al., 2020)) and the condition 1 E h fc(w(r) c,e) i fc(w c) D2 2ηℓE + G2 + σ2 c 2 ηℓ+ (LG2)η2 ℓ+ (LG)η3 ℓ (24) 2 + G2 + σ2 c 2 2 + (LG2) 1 3 D2 3 + (LG) 1 2 D2 For the µ-convex case, similar to the usage of Lemma 1 in (Karimireddy et al., 2020), we apply the linear convergence rate with the condition ηℓ 1 µE and get, E h fc(w(r) c,e) i fc(w c) 3 2D2µ exp ηℓµE + G2 + σ2 c 2 ηℓ+ (LG2)η2 ℓ+ (L2G2)η3 ℓ (26) 2 + G2 + σ2 c 2 µE + (LG2) 1 µ2E2 + (L2G2) 1 µ3E3 (27) Rearranging the terms and assigning the error as E h fc(w(r) c,e) i fc(w c) = ϵ get us the bounds shown in the theorem statement. For the bound on early stopping parameter γ, we get the lower bound of E h fc(w(r) c,e) i fc(w c) as follows, E h fc(w(r) c,e) i fc(w c) = E h fc(w(r) c,e) i E h fc(w(r) c,e 1) i + E h fc(w(r) c,e 1) i E h fc(w(E 2) c ) i + E h fc(w(E 2) c ) i E h fc(w(0) c ) i + E h fc(w(0) c ) i fc(w c) (28) i + E h fc(w(0) c ) i fc(w c) (29) + E h fc(w(0) c ) i fc(w c) (30) γ (Efc(w(0) c ) Efc(w(r) c,e)) ln E + 1 where F = E h fc(w(0) c ) i fc(w c) and ϵ = E h fc(w(r) c,e) i fc(w c). Lemma C.8 (Bounding the Client Drift wrt Current Round). Let {Fc} satisfy Assumptions C.1, C.3, and C.5. For any step size satisfying ηℓ q 1 L2E(E 1), we can bound the drift for E = max({Ec | c C}) where Ec is the number of epochs Flash: Concept Drift Adaptation in Federated Learning client c trains w(r) c,0 := w(r 1) g for, as c C E w(r) c,e w(r) g 2 5Eη2 ℓE[σ2 ℓ+ 6Eσ2 g] + 30E2η2 ℓE f(w(r) g ) 2 Proof. We have followed the same proof technique as in Lemma 3 of (Reddi et al., 2021). Lemma C.9 (Upper Bounding the Effective Gradients). Let {Fc} satisfy Assumptions C.1, C.3, and C.5. In round r, the updates in FLASH and FEDYOGI satisfy, Er ( (r))2 6η2 ℓEσ2 ℓ |C| + 6η2 ℓEG2 + 30η4 ℓL2E3 |C| σ2 ℓ+ 6Eσ2 g + 180η4 ℓL2E4 |C| + 2E2η2 ℓ+ 6η2 ℓE(B2 1) Er f(w(r) g ) 2 (32) Er ( (r))2 Er (r) + ηℓE f(w(r) g ) ηℓE f(w(r) g ) 2 (33) 2Er (r) + ηℓE f(w(r) g ) 2 + 2Er ηℓE f(w(r) g ) 2 (34) e=0 ηℓ fc(w(r) c,e) + ηℓE f(w(r) g ) + 2Er ηℓE f(w(r) g ) 2 (35) e=0 (ηℓ fc(w(r) c,e) ηℓ Fc(w(r) c,e) + ηℓ Fc(w(r) c,e) ηℓ Fc(w(r) g ) + ηℓ Fc(w(r) g )) + ηℓE f(w(r) g ) + 2Er ηℓE f(w(r) g ) 2 (36) e=0 ( fc(w(r) c,e) Fc(w(r) c,e) + Fc(w(r) c,e) Fc(w(r) g )) 1 e=0 Fc(w(r) g ) + E f(w(r) g ) + 2E2η2 ℓEr f(w(r) g ) 2 (37) e=0 ( fc(w(r) c,e) Fc(w(r) c,e)) e=0 ( Fc(w(r) c,e) Fc(w(r) g )) + 6η2E G2 + (B2 1)Er f(w(r) g ) 2 + 2E2η2 ℓEr f(w(r) g ) 2 (38) 6η2 ℓEσ2 ℓ |C| + 6η2 ℓ |C|2 Er e=0 L(w(r) c,e w(r) g ) + (2E2η2 ℓ+ 6η2 ℓE(B2 1))Er f(w(r) g ) 2 + 6η2 ℓEG2 (39) 6η2 ℓEσ2 ℓ |C| + 6η2 ℓL2E2 5Eη2 ℓ(σ2 ℓ+ 6Eσ2 g) + 30E2η2 ℓEr f(w(r) g ) 2 + (2E2η2 ℓ+ 6η2 ℓE(B2 1))Er f(w(r) g ) 2 + 6η2 ℓEG2 (40) 6η2 ℓEσ2 ℓ |C| + 6η2 ℓEG2 + 30η4 ℓL2E3 |C| σ2 ℓ+ 6Eσ2 g + 180η4 ℓL2E4 |C| + 2E2η2 ℓ+ 6η2 ℓE(B2 1) Er f(w(r) g ) 2 Flash: Concept Drift Adaptation in Federated Learning The second to last inequality follows from Lemma C.8. Lemma C.10 (Upper Bounding the Rolling Average of Effective Gradients). Let {Fc} satisfy Assumptions C.1, C.3, and C.5. In round r, the updates in FLASH and FEDYOGI satisfy, Er h v(r) i η2 ℓE2G2(1 βr 1 2 ) + (2 β2) 6η2 ℓEσ2 ℓ |C| + 6η2 ℓEG2 + 30η4 ℓL2E3 |C| σ2 ℓ+ 6Eσ2 g + (2 β2) 180η4 ℓL2E4 |C| + 2E2η2 ℓ+ 6η2 ℓE(B2 1) Er f(w(r) g ) 2 (42) Proof. First we recall that v(r) = β2v(r 1) + (1 β2)( (r))2, similar to FEDADAM/FEDYOGI updates in (Reddi et al., 2021). Unrolling the recursion, we get v(r) = (1 β2) Pr i=1 βr i 2 ( (i))2. Replacing v(r) with its unrolled version, Er h v(r) i = (1 β2)Er i=1 βr i 2 ( (i))2 i=1 βr 1 i 2 ( (i))2 + Er β0 2( (r))2 i=1 βr 1 i 2 ( (i))2 + Er ( (r))2 i=1 βr 1 i 2 ( (i))2 + (1 β2) Er ( (r))2 | {z } T1 Plugging in the bounds for T1, from Lemma C.9, Flash: Concept Drift Adaptation in Federated Learning Er h v(r) i (1 β2) i=1 βr 1 i 2 ( (i))2 + (2 β2)Er ( (r))2 (47) i=1 βr 1 i 2 ( (i))2 + (2 β2) 6η2 ℓEσ2 ℓ |C| + 6η2 ℓEG2 + 30η4 ℓL2E3 |C| σ2 ℓ+ 6Eσ2 g + 180η4 ℓL2E4 |C| + 2E2η2 ℓ+ 6η2 ℓE(B2 1) Er f(w(r) g ) 2 (48) i=1 βr 1 i 2 e=0 ηℓ fc(w(i) c,e) + (2 β2) 6η2 ℓEσ2 ℓ |C| + 6η2 ℓEG2 + 30η4 ℓL2E3 |C| σ2 ℓ+ 6Eσ2 g + 180η4 ℓL2E4 |C| + 2E2η2 ℓ+ 6η2 ℓE(B2 1) Er f(w(r) g ) 2 (49) (1 β2)η2 ℓ |C|2 i=1 βr 1 i 2 e=0 fc(w(i) c,e) + (2 β2) 6η2 ℓEσ2 ℓ |C| + 6η2 ℓEG2 + 30η4 ℓL2E3 |C| σ2 ℓ+ 6Eσ2 g + 180η4 ℓL2E4 |C| + 2E2η2 ℓ+ 6η2 ℓE(B2 1) Er f(w(r) g ) 2 (50) (1 β2)η2 ℓ |C|2 i=1 βr 1 i 2 (|C|EG)2 + (2 β2) 6η2 ℓEσ2 ℓ |C| + 6η2 ℓEG2 + 30η4 ℓL2E3 |C| σ2 ℓ+ 6Eσ2 g + 180η4 ℓL2E4 |C| + 2E2η2 ℓ+ 6η2 ℓE(B2 1) Er f(w(r) g ) 2 (51) (1 β2)η2 ℓ|C|2E2G2 |C|2 (1 βr 1 2 ) (1 β2) + (2 β2) 6η2 ℓEσ2 ℓ |C| + 6η2 ℓEG2 + 30η4 ℓL2E3 |C| σ2 ℓ+ 6Eσ2 g + 180η4 ℓL2E4 |C| + 2E2η2 ℓ+ 6η2 ℓE(B2 1) Er f(w(r) g ) 2 (52) η2 ℓE2G2(1 βr 1 2 ) + (2 β2) 6η2 ℓEσ2 ℓ |C| + 6η2 ℓEG2 + 30η4 ℓL2E3 |C| σ2 ℓ+ 6Eσ2 g + 180η4 ℓL2E4 |C| + 2E2η2 ℓ+ 6η2 ℓE(B2 1) Er f(w(r) g ) 2 (53) Theorem C.11 (Lower Bounding the Change in the Second Moment of Effective Gradients). Let {Fc} satisfy Assumptions C.1, C.3, and C.5. In round r, the updates in FLASH and FEDYOGI satisfy, Flash: Concept Drift Adaptation in Federated Learning FEDYOGI E h ( (r))2 v(r) i FLASH E h ( (r))2 v(r) i r w(r) g w(r 1) g vr + τ η2 ℓE2G2(1 βr 1 2 ) r w(r) g w(r 1) g vr ηℓEG(1 βr 3) + τ η2 ℓE2G2(1 βr 1 2 ) Proof. Using Jensen s Inequality, we have E [||a b||] |E||a|| E||b||| Hence we get, Er h ( (r))2 v(r) i Er||( (r))2|| Er||v(r)|| (55) Er||( (r))2|| i=1 βr 1 i 2 ( (i))2 + (1 β2)Er ( (r))2 (1 1 + β2)Er||( (r))2|| i=1 βr 1 i 2 ( (i))2 (1 1 + β2)Er||( (r))2|| i=1 βr 1 i 2 ( (i))2 β2Er||( (r))2|| η2 ℓE2G2(1 βr 1 2 ) (59) The second and the last inequalities both are based on the derivation shown in Lemma C.10. For FEDYOGI, we know that w(r) g = w(r 1) g + ηg (r) v(r)+τ , hence we get lower bound of Er ( (r))2 v(r) as, FEDYOGI E h ( (r))2 v(r) i r w(r) g w(r 1) g vr + τ η2 ℓE2G2(1 βr 1 2 ) and For FLASH, we know that w(r) g = w(r 1) g + ηg (r) v(r) d(r)+τ , hence we get lower bound of Er ( (r))2 v(r) as, FLASH E h ( (r))2 v(r) i r w(r) g w(r 1) g vr d(r) + τ η2 ℓE2G2(1 βr 1 2 ) r w(r) g w(r 1) g vr ηℓEG(1 βr 3) + τ η2 ℓE2G2(1 βr 1 2 ) Second inequality follows from the fact that d(r) ηℓEG(1 βr 3). Note that ηℓEG(1 βr 3) would always be positive. Flash: Concept Drift Adaptation in Federated Learning Therefore, we get FEDYOGI E h ( (r))2 v(r) i FLASH E h ( (r))2 v(r) i r w(r) g w(r 1) g vr + τ η2 ℓE2G2(1 βr 1 2 ) r w(r) g w(r 1) g vr ηℓEG(1 βr 3) + τ η2 ℓE2G2(1 βr 1 2 ) Theorem C.12 (Convergence of FLASH). Let assumptions C.1 to C.4 hold. Suppose the server and client learning rates satisfy " |C| 30L2E τ 6(B2 1) G(β2 + β2) + Lηg Then the iterates of Algorithm 1 for ηℓ= Θ(1/L E), ηg = Θ(1/ R), and τ = G/L for FLASH satisfy min 0 r R E f(w(r) g ) 2 O f(w(0) g ) Er[f(w(R) g )] ER|C| (σ2 ℓ+ 6Eσ2 g) + 6Lσ2 ℓ RG2|C| + 6L Proof. The proof strategy is similar to that of FEDADAM as described in (Reddi et al., 2021), except now we also have to manage the addition of moving average of the difference between ( (r))2 and v(r). We note that FLASH has the following update rule (see Algorithm 1 Line 21 w(r+1) g w(r) g + ηg (r) v(r) d(r) + τ . Similar to the analysis of FEDADAM, we are assuming β1 = 0. Hence, m(r) = (r). Note that d(r) j β3jd(r 1) j + (1 β3j)(( (r) j )2 v(r) j ) where β3j = ||v(r 1) j ||2 ||( (r) j )2 v(r) j ||2+||v(r 1) j ||2 for all j [d]. Using the L-smooth nature of the function f and the above update rule, we have the following, f(w(r+1) g ) f(w(r) g ) + D f(w(r) g ), w(r+1) g w(r) g E + L w(r+1) g w(r) g 2 (64) = f(w(r) g ) + ηg [ f(w(r) g )]j (r) j q v(r) j d(r) j + τ ( (r) j )2 q v(r) j d(r) j + τ 2 (65) The second step follows from the update rule of FLASH stated initially. Flash: Concept Drift Adaptation in Federated Learning Now we take expectation of f(wr+1 g ) (over randomness at round r) and rewrite the above inequality as, Er[f(w(r+1) g )] f(w(r) g ) + ηg f(w(r) g ), Er v(r) d(r) + τ (r) p β2v(r 1) β3d(r 1) + τ β2v(r 1) β3d(r 1) + τ + Lη2 g 2 Er v(r) d(r) + τ 2 = f(w(r) g ) + ηg f(w(r) g ), Er β2v(r 1) β3d(r 1) + τ f(w(r) g ), Er v(r) d(r) + τ (r) p β2v(r 1) β3d(r 1) + τ + Lη2 g 2 Er v(r) d(r) + τ 2 Bounding T1 f(w(r) g ), Er β2v(r 1) β3d(r 1) + τ * f(w(r) g ) p β2v(r 1) β3d(r 1) + τ , Er h (r) ηℓE f(w(r) g ) + ηℓE f(w(r) g ) i+ = ηℓE[ f(w(r) g )]2 p β2v(r 1) β3d(r 1) + τ + * f(w(r) g ) p β2v(r 1) β3d(r 1) + τ , Er h (r) + ηℓE f(w(r) g ) i+ Bounding T3 * f(w(r) g ) p β2v(r 1) β3d(r 1) + τ , Er e=0 ηℓ fc(w(r) c,e) + ηℓE f(w(r) g ) 2 [ f(w(r) g )]2 p β2v(r 1) β3d(r 1) + τ Fc(w(r) c,e) qp β2v(r 1) β3d(r 1) + τ 1 Fc(w(r) g ) qp β2v(r 1) β3d(r 1) + τ 2 [ f(w(r) g )]2 p β2v(r 1) β3d(r 1) + τ + ηℓ Fc(w(r) c,e) Fc(w(r) g ) qp β2v(r 1) β3d(r 1) + τ 2 [ f(w(r) g )]2 p β2v(r 1) β3d(r 1) + τ + ηℓL2 w(r) c,e w(r) g 2 # 2 [ f(w(r) g )]2 p β2v(r 1) β3d(r 1) + τ + ηℓL2E 5Eη2 ℓ σ2 ℓ+ 6Eσ2 g + 30E2η2 ℓEr f(w(r) g ) 2 (75) Flash: Concept Drift Adaptation in Federated Learning Plugging the bound of T3 in T1, 2 [ f(w(r) g )]2 p β2v(r 1) β3d(r 1) + τ + ηℓL2E 5Eη2 ℓ σ2 ℓ+ 6Eσ2 g + 30E2η2 ℓEr f(w(r) g ) 2 (76) 2 [ f(w(r) g )]2 p β2v(r 1) β3d(r 1) + τ + 5η3 ℓL2E2 2τ σ2 ℓ+ 6Eσ2 g + 15η3 ℓL2E3 τ Er f(w(r) g ) 2 (77) Bounding T2 f(w(r) g ), Er v(r) d(r) + τ (r) p β2v(r 1) β3d(r 1) + τ j=1 [ f(w(r) g )]j (r) j β2v(r 1) j β3jd(r 1) j q v(r) j + d(r) j v(r) j d(r) j + τ)( q β2v(r 1) j β3jd(r 1) j + τ) (79) j=1 [ f(w(r) g )]j (r) j β2v(r 1) j q + d(r) j β3jd(r 1) j v(r) j d(r) j + τ)( q β2v(r 1) j β3jd(r 1) j + τ) (80) j=1 [ f(w(r) g )]j (r) j β2v(r 1) j v(r) j q β2v(r 1) j + q v(r) j + (1 β3j)(( (r) j )2 v(r) j ) v(r) j d(r) j + τ)( q β2v(r 1) j β3jd(r 1) j + τ) (81) j=1 [ f(w(r) g )]j (r) j β2v(r 1) j v(r) j + (1 β3j)(( (r) j )2 v(r) j ) v(r) j d(r) j + τ)( q β2v(r 1) j β3jd(r 1) j + τ) (82) Now we rearrange the nominator to convert it into terms with only (r) and v(r). j=1 [ f(w(r) g )]j (r) j β2( (r) j )2 v(r) j + β3j(v(r) j ( (r) j )2) v(r) j d(r) j + τ)( q β2v(r 1) j β3jd(r 1) j + τ) (83) j=1 [ f(w(r) g )]j (r) j ( (r) j )2 v(r) j d(r) j + τ)( q β2v(r 1) j β3jd(r 1) j + τ) j=1 [ f(w(r) g )]j (r) j v(r) j v(r) j d(r) j + τ)( q β2v(r 1) j β3jd(r 1) j + τ) j=1 [ f(w(r) g )]j (r) j β3j(v(r) j ( (r) j )2) v(r) j d(r) j + τ)( q β2v(r 1) j β3jd(r 1) j + τ) (84) j=1 [ f(w(r) g )]j (r) j ( (r) j )2 v(r) j d(r) j + τ) j=1 [ f(w(r) g )]j (r) j β2v(r 1) j β2v(r 1) j β3jd(r 1) j + τ) j=1 [ f(w(r) g )]j (r) j β3j(1 β2) Pr 1 i=0 βr i 2 ( (i) j )2 v(r) j d(r) j + τ) (85) Flash: Concept Drift Adaptation in Federated Learning Here, the first term inequality follows from the fact that q β2v(r 1) j β3d(r 1) j , since q β2v(r 1) j β3d(r 1) j β2(1 βr 2) + β3(1 βr 3) . Similarly for the second term, we can remove q v(r) j d(r) j from the denominator to get an upper bound since the term is positive ( q v(r) j d(r) j ηℓEG p (1 βr 2) + (1 βr 3) ). And v(r) j β2v(r 1) j , since v(r) j is always positive. For the third term, v(r) j ( (r) j )2 = (1 β2) Pr i=0 βr i 2 ( (i) j )2 ( (r) j )2 = (1 β2) Pr 1 i=0 βr i 2 ( (i) j )2. For the same of simplicity, we have assumed β3 [0, 1] to be a constant, but a similar anaysis can be derived for a dynamic β3 as well. Bounding the above term further, j=1 [ f(w(r) g )]j ( (r) j )2 + β2 j=1 [ f(w(r) g )]j (r) j q j=1 [ f(w(r) g )]j β3j(1 β2) i=0 βr i 2 ( (i) j )2 (86) The first term inequality follows from q v(r) j d(r) j (r) j . Second term inequality follows from q β3jd(r 1) j q β2v(r 1) j . j=1 [ f(w(r) g )]j ( (r) j )2 + β2 j=1 [ f(w(r) g )]j ( (r) j )2 j=1 [ f(w(r) g )]j β3j i=0 βr i 2 ( (i) j )2 (87) T2 (β2 + β2)G j=1 ( (r) j )2 + (1 β2)G i=0 βr i 2 ( (i) j )2 (88) j=1 ( (r) j )2 + (1 βr 1 2 )G3|C|2E2 j=1 ( (r) j )2 + (1 βr 1 2 )G3|C|2E2 Plugging in the bounds of T1 and T2 in Equation 67, Flash: Concept Drift Adaptation in Federated Learning Er[f(w(r+1) g )] = f(w(r) g ) + ηg f(w(r) g ), Er β2v(r 1) β3d(r 1) + τ f(w(r) g ), Er v(r) d(r) + τ (r) p β2v(r 1) β3d(r 1) + τ + Lη2 g 2 Er v(r) d(r) + τ 2 f(w(r) g ) + ηg 2 [ f(w(r) g )]2 p β2v(r 1) β3d(r 1) + τ + 5η3 ℓL2E2 σ2 ℓ+ 6Eσ2 g + 15η3 ℓL2E3 τ Er f(w(r) g ) 2 ! j=1 ( (r) j )2 + G3|C|2E2(1 βr 1 2 ) τ 2 + Lη2 g 2τ 2 Er( (r))2 (92) Using the bounds derived on Er( (r))2 from Lemma C.9, Er[f(w(r+1) g )] f(w(r) g ) ηgηℓE 2τ [ f(w(r) g )]2 p β2v(r 1) β3d(r 1) + τ + 5ηgη3 ℓL2E2 2τ (σ2 ℓ+ 6Eσ2 g) + 15ηgη3 ℓL2E3 τ Er f(w(r) g ) 2 + Lη2 g 2τ 2 + ηg G(β2 + β2) + G3|C|2E2(1 βr 1 2 ) τ 2 (93) f(w(r) g ) ηgηℓE 2τ [ f(w(r) g )]2 p β2v(r 1) β3d(r 1) + τ + 5ηgη3 ℓL2E2 2|C|τ (σ2 ℓ+ 6Eσ2 g) Lη2 g 2τ 2 + ηg G(β2 + β2) ! 6η2 ℓEσ2 ℓ |C| + 6η2 ℓEG2 + 30η4 ℓL2E3 |C| σ2 ℓ+ 6Eσ2 g Lη2 g 2τ 2 + ηg G(β2 + β2) ! 180η4 ℓL2E4 |C| + 2E2η2 ℓ+ 6η2 ℓE(B2 1) Er f(w(r) g ) 2 + G3|C|2E2(1 βr 1 2 ) τ 2 + 15ηgη3 ℓL2E3 |C|τ Er f(w(r) g ) 2 (94) f(w(r) g ) ηgηℓE 2τ [ f(w(r) g )]2 p β2v(r 1) β3d(r 1) + τ + 5ηgη3 ℓL2E2 2|C|τ + 30q1η4 ℓL2E3 (σ2 ℓ+ 6Eσ2 g) + 15ηgη3 ℓL2E3 |C|τ + q1q2 Er f(w(r) g ) 2 + q1q3 + G3|C|2E2(1 βr 1 2 ) τ 2 (95) where q1 = Lη2 g 2τ 2 + ηg G(β2+ β2) τ 2 , q2 = 180η4 ℓL2E4 |C| + 2E2η2 ℓ+ 6η2 ℓE(B2 1), q3 = 6η2 ℓEσ2 ℓ |C| + 6η2 ℓEG2. Flash: Concept Drift Adaptation in Federated Learning Summing over r = 1 to R gives us, Er[f(w(R) g )] f(w(0) g ) 2τ [ f(w(r) g )]2 p β2v(r 1) β3d(r 1) + τ + 5ηgη3 ℓL2E2 2|C|τ + 30q1η4 ℓL2E3 (σ2 ℓ+ 6Eσ2 g) 15ηgη3 ℓL2E3 |C|τ + q1q2 Er f(w(r) g ) 2 + q1q3 + G3|C|2E2(1 βr 1 2 ) τ 2 2τ [ f(w(r) g )]2 p β2v(r 1) β3d(r 1) + τ + R 5ηgη3 ℓL2E2 2|C|τ + 30q1η4 ℓL2E3 (σ2 ℓ+ 6Eσ2 g) 15ηgη3 ℓL2E3 |C|τ + q1q2 Er f(w(r) g ) 2 + Rq1q3 + G3|C|2E2 R 1 βR 2 (1 β2) Using the fact 15ηgη3 ℓL2E3 |C|τ + q1q2 ηgηℓE 2τ derived from the bounds on the local learning rate ηℓ min |C| 30L2E 1 2 , τ 6(B2 1)[G(β2+ β2)+Lηg] Er[f(w(R) g )] f(w(0) g ) ηgηℓE [ f(w(r) g )]2 j q β2v(r 1) j β3d(r 1) j + τ + Rq1q3 + G3|C|2E2 R 1 βR 2 (1 β2) + R 5ηgη3 ℓL2E2 2|C|τ + 30q1η4 ℓL2E3 (σ2 ℓ+ 6Eσ2 g) (98) Using the following lower bound, [ f(w(r) g )]2 j q β2v(r 1) j β3d(r 1) j + τ [ f(w(r) g )]2 j ηℓEG( p β2(1 βr 2) + β3(1 βr 3)) + τ (99) β2(1 βr 2) + β3(1 βr 3)) + τ min 1 r R f(w(r) g ) 2 (100) we derive the convergence bound as follows, min 0 r R E f(w(r) g ) 2 2(ηℓEG( p β2(1 βr 2) + β3(1 βr 3)) + τ) ηℓηg ER h f(w(0) g ) Er[f(w(R) g )] i + 2(ηℓEG( p β2(1 βr 2) + β3(1 βr 3)) + τ) ηℓηg E 5ηgη3 ℓL2E2 2|C|τ + 30q1η4 ℓL2E3 (σ2 ℓ+ 6Eσ2 g) + q1q3 + 2(ηℓEG( p β2(1 βr 2) + β3(1 βr 3)) + τ) ηℓηg ER R 1 βR 2 1 β2 min 0 r R E f(w(r) g ) 2 O ηℓE(β3(1 βr 3)) + τ ηℓηg ER h f(w(0) g ) Er[f(w(R) g )] i + ηℓE(β3(1 βr 3)) + τ ηℓηg E 5ηgη3 ℓL2E2 2|C|τ + 30q1η4 ℓL2E3 (σ2 ℓ+ 6Eσ2 g) + q1q3 + ηℓE(β3(1 βr 3)) + τ ηℓηg ER R 1 βR 2 1 β2 Flash: Concept Drift Adaptation in Federated Learning Assuming ηℓ= Θ(1/L E), ηg = Θ(1/ R), and τ = G/L, we get the following asymptotic bound for the convergence of FLASH, min 0 r R E f(w(r) g ) 2 O f(w(0) g ) Er[f(w(R) g )] ER|C| (σ2 ℓ+ 6Eσ2 g) + 6Lσ2 ℓ RG2|C| + 6L