# efficient_personalized_federated_learning_via_sparse_modeladaptation__859ab476.pdf Efficient Personalized Federated Learning via Sparse Model-Adaptation Daoyuan Chen 1 Liuyi Yao 1 Dawei Gao 1 Bolin Ding 1 Yaliang Li 1 Federated Learning (FL) aims to train machine learning models for multiple clients without sharing their own private data. Due to the heterogeneity of clients local data distribution, recent studies explore the personalized FL that learns and deploys distinct local models with the help of auxiliary global models. However, the clients can be heterogeneous in terms of not only local data distribution, but also their computation and communication resources. The capacity and efficiency of personalized models are restricted by the lowest-resource clients, leading to suboptimal performance and limited practicality of personalized FL. To overcome these challenges, we propose a novel approach named p Fed Gate for efficient personalized FL by adaptively and efficiently learning sparse local models. With a lightweight trainable gating layer, p Fed Gate enables clients to reach their full potential in model capacity by generating different sparse models accounting for both the heterogeneous data distributions and resource constraints. Meanwhile, the computation and communication efficiency are both improved thanks to the adaptability between the model sparsity and clients resources. Further, we theoretically show that the proposed p Fed Gate has superior complexity with guaranteed convergence and generalization error. Extensive experiments show that p Fed Gate achieves superior global accuracy, individual accuracy and efficiency simultaneously over state-of-the-art methods. We also demonstrate that p Fed Gate performs better than competitors in the novel clients participation and partial clients participation scenarios, and can learn meaningful sparse local models adapted to different data distributions. 1Alibaba Group. Correspondence to: Yaliang Li . Proceedings of the 40 th International Conference on Machine Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright 2023 by the author(s). 1. Introduction Federated Learning (FL) gains increasing popularity in machine learning scenarios where the data are distributed in different places and can not be transmitted due to privacy concerns (Muhammad et al., 2020; Meng et al., 2021; Hong et al., 2021; Wang et al., 2022; Xie et al., 2023). Typical FL trains a unique global model from multiple data owners (clients) by transmitting and aggregating intermediate information with the help of a centralized server (Mc Mahan et al., 2017; Kairouz et al., 2021). Although using a shared global model for all clients shows promising average performance, the inherent statistical heterogeneity among clients challenges the existence and convergence of the global model (Sattler et al., 2020; Li et al., 2020). Recently, there are emerging efforts that introduce personalization into FL by learning and deploying distinct local models (Yang et al., 2019; Karimireddy et al., 2020; Tan et al., 2021). The distinct models are designed particularly to fit the heterogeneous local data distribution via techniques taking care of relationships between the global model and personalized local models, such as multi-task learning (Collins et al., 2021), meta-learning (Dinh et al., 2020a), model mixture (Li et al., 2021c), knowledge distillation (Zhu et al., 2021) and clustering (Ghosh et al., 2020). However, the heterogeneity among clients exists not only in local data distribution, but also in their computation and communication resources (Chai et al., 2019; 2020; Chen et al., 2023). The lowest-resource clients restrict the capacity and efficiency of the personalized models due to the following reasons: (1) The adopted model architecture of all clients is usually assumed to be the same for aggregation compatibility and (2) The communication bandwidth and participation frequency of clients usually determine how much can they contribute to the model training of other clients and how fast can they agree to meet a converged central point w.r.t their local models. This resource heterogeneity is under-explored in most existing personalized FL (PFL) works, and instead, they gain accuracy improvement with a large amount of additional computation or communication costs. Without special design taking the efficiency and resource heterogeneity into account, we can only gain sub-optimal performance and limited practicality of PFL. To overcome these challenges, in this paper, we propose Efficient Personalized Federated Learning via Sparse Model-Adaptation a novel method named p Fed Gate for efficient PFL, which learns to generate personalized sparse models based on adaptive gated weights and different clients resource constraints. Specifically, we introduce a lightweight trainable gating layer for each client, which predicts sparse, continues and block-wise gated weights and transforms the global model shared across all clients into a personalized sparse one. The gated weights prediction is conditioned on specific samples for better estimation of the heterogeneous data distributions. Thanks to the adaptability between the model sparsity and clients resources, the personalized models and FL training process gain better computation and communication efficiency. As a result, the model-adaption via sparse gated weights delivers the double benefit of personalization and efficiency: (1) The sparse model-adaption enables each client to reach its full potential in model capacity with no need for compatibility across other low-resource clients, and to deal with a small and focused hypothesis space that is restricted by the personalized sparsity and the local data distribution. (2) Different resource restrictions can be easily imposed on the predicted weights as we consider the block-wise masking under a flexible combinatorial optimization setting. We further provide space-time complexity analysis to show p Fed Gate superiority over state-of-the-art (SOTA) methods, and provide theoretical guarantees for p Fed Gate in terms of its generalization and convergence. We evaluate the proposed p Fed Gate on four FL benchmarks compared to several SOTA methods. We show that p Fed Gate achieves superior global accuracy, individual accuracy and efficiency simultaneously (up to 4.53% average accuracy improvement with 12x smaller sparsity than the compared strongest PFL method). We demonstrate the effectiveness and robustness of p Fed Gate in the partial clients participation and novel clients participation scenarios. We find that p Fed Gate can learn meaningful sparse local models adapted to different data distributions, and conduct extensive experiments to study the effect of sparsity and verify the necessity and effectiveness of p Fed Gate components. Our main contributions can be summarized as follows: We exploit the potential of co-design of model compression and personalization in FL, and propose a novel efficient PFL approach that learns to generate sparse local models with a fine-grind batch-level adaptation. We provide a new formulation for the efficient PFL considering the clients heterogeneity in both local data distribution and hardware resources, and provide theoretical results about the generalization, convergence and complexity of the proposed method. We achieve SOTA results on several FL benchmarks and illustrate the feasibility of gaining better accuracy with improved efficiency for PFL simultaneously. Our codes are at https://github.com/alibaba/Federated Scope/ tree/master/benchmark/p FL-Bench. 2. Related Works Personalized FL. Personalized FL draws increasing attention as it is a natural way to improve FL performance for heterogeneous clients. Many efforts have been devoted via multi-task learning (Smith et al., 2017; Corinzia & Buhmann, 2019; Huang et al., 2021; Marfoq et al., 2021), model mixture (Zhang et al., 2020; Li et al., 2021c), clustering (Briggs et al., 2020; Sattler et al., 2020; Chai et al., 2020), knowledge distillation (Lin et al., 2020; Zhu et al., 2021; Ozkara et al., 2021), meta-learning (Khodak et al., 2019; Jiang et al., 2019; Khodak et al., 2019; Singhal et al., 2021), and transfer learning (Yang et al., 2020; He et al., 2020; Zhang et al., 2021a). Although effective in accuracy improvements, most works pay the cost of additional computation or communication compared to non-personalized methods. For example, Sattler et al. (2020) consider groupwise client relationships, requiring client-wise distance calculation that is computationally intensive in cross-device scenarios. Fallah et al. (2020) leverage model agnostic meta-learning to enable fast local personalized training that requires computationally expensive second-order gradients. Zhang et al. (2021a) learn pair-wise client relationships and need to store and compute similarity matrix with square complexity w.r.t. the number of clients. Marfoq et al. (2021) learn a mixture of multiple global models which multiplies the storing and communicating costs. Our work differs from these works by considering a practical setting that clients are heterogeneous in both the data distribution and hardware resources. Under this setting, we achieve personalization from a novel perspective, personalized sparse model-adaptation, and show the feasibility of gaining better accuracy, computation and communication efficiency at the same time. Efficient FL. Fruitful FL literature has explored the improvement of communication efficiency such as methods based on gradient compression (Rothchild et al., 2020; Alistarh et al., 2017; Reisizadeh et al., 2020; Haddadpour et al., 2021; Zhang et al., 2021b), model ensemble (Hamer et al., 2020), model sub-parameter sharing (Liang et al., 2020), sub-model selection (Tianchun et al., 2022; Minh & Carl, 2021), and Bayesian neural network (Yurochkin et al., 2019b;a). Few works improve computation and communication efficiency with personalized local models via quantization (Ozkara et al., 2021; Li et al., 2021a; Hong et al., 2022) and model parameter decoupling (Diao et al., 2021; Collins et al., 2021; Huang et al., 2022). We mainly differ from them by adaptively generating the sparse model weights at a fine-grained batch level, which estimates the conditional distribution of heterogeneous local data well and gains high accuracy as shown in our experiments. Besides Efficient Personalized Federated Learning via Sparse Model-Adaptation the methodology, we present several new theoretical results. Due to the space limit, we give more detailed descriptions and comparisons with some related works in Appendix G. 3. Problem Formulation The goal of traditional federated learning (FL) is to fit a single global model by collaboratively training models from a set C of clients without sharing their local data. In this paper, we focus on the personalized federated learning problem in terms of not only client-distinct models, but also client-distinct resource limitations. Specifically, consider each client i C has its own private dataset Si that is drawn from a local distribution Di over X Y. In general, the local data distributions {Di}i C are heterogeneous and thus it is promising to learn personalized model hθi H : X 7 Y parameterized by θi for each local distribution Di, where H is the set of hypotheses with d dimensions. Besides, clients may have heterogeneous computation and communication resources especially in cross-device FL scenarios. To better account for the data heterogeneity and system heterogeneity, we aim to optimize the following objective: min {hθi}i C i C pi E(x,y) Di[f(θi; x, y)], s.t. size(θi) di, i C, (1) where f(θi; x, y) ℓ(hθi(x), y) and ℓ: Y Y 7 R+ is the loss function, pi is non-negative aggregation weight for client i and P i C pi = 1. In typical FL setting (Mc Mahan et al., 2017; Kairouz et al., 2021), pi indicates the participation degree of client i and is set to be proportional to the size of local dataset |Si|. The size(θi) indicates the number of parameters of θi, and di indicates the model size limitation for client i. Without special designs to handle the system heterogeneity (here the size restriction), most PFL methods simply adopt another hypothesis space H with min({di}i C) dimensions that are constraints by the lowestresource clients, and the problem is degraded to the typical one that minimizes the objective without constraints. We consider model size constraint for simplicity and it fairly reflects the computation and communication cost. The formulation can be extended to other configurations such as inference and communication latency in a similar manner. 4. Learning to Generate Sparse Model In practice, one can solve the problem in Equation (1) with a two-step manner: 1 find optimal local models via existing personalized federated learning methods without sparsity constraints, and then 2 compress the trained models into required local model size via compression techniques such as pruning, quantization and knowledge distillation (Deng et al., 2020). However, the final local models from step 2 usually gain worse performance to models found in step 1 (Jiang et al., 2022). Besides, the post-compression process still requires computational and communication costs corresponding to the un-compressed models during the FL process. To alleviate the performance degradation and further improve the efficiency of FL process, we propose to jointly learn and compress the federated models. Specifically, we can directly learn sparse local models whose number of non-zero parameters satisfy the size requirements: min {hθ i}i C i C pi E(x,y) Di[ℓ(hθ i(x), y)], (2) where hθ i Hdi and Hdi indicates the subset of H with hypotheses whose number of non-zero parameters is not larger than di. However, the sparse models hθ i can be arbitrarily different across clients due to the various dimension requirements di and local data distributions Di. Directly optimizing hθ i and aggregating them may lead to sub-optimal performance and make the federated learning process unconvergent. As (Marfoq et al., 2021) discussed, clients can not benefit from each other without any additional assumption to the local distributions. Here we consider the case of bounded diversity as following assumption shows: Assumption 1. (Bounded Diversity) Let Dg be a global data distribution that is a mixture of all local data distribution Di, i C. Denote hθ g and hθ i as the optimal estimator for the conditional probability of Dg and Di respectively. There exist a Dg such that the variance between local and global gradients is bounded || θf(θ i ) θf(θ g)||2 σ2 i , i C, (3) where f(θ) is a compact notation as f(θ) f(θ; x, y) for all data points (x, y) X Y. Proposition 1. Denote (θ i) be the parameter of any sparse optimal estimator for Di, i C. If f is µ-strongly convex 1, under Assumption 1, there exist continuous weights M i Rd such that (θ i) = θ g M i where indicates Hadamard production over model parameters of θ, and {M i }i C can be bounded as ||(M i 1) 4|| 1 (R4 θi R4 θgr4 θir4 θg) (σ4 i µ4||(θ g) 4||) i C, (4) where 4 indicates Hadamard power of 4, rθi and rθg are the infimum of θi and θg respectively, Rθi and Rθg are the supremum of θi and θg respectively. 2 Proposition 1 suggests that we can deal with Problem (2) via indirectly generating personalized sparse models hθ i 1The strong convexity is only used in the motivated proposition and not involved in the remaining theoretical analysis. We give more discussion on all the Assumptions in Appendix E. 2The proofs of all propositions and theorems of our work are in Appendix. Efficient Personalized Federated Learning via Sparse Model-Adaptation Gating Layer !" [0, ..., 1] & = "% /!,( Gating Layer !# [1, ..., 0] Client 1 (sparsity constraint !") & = "% /",( ⑤Update "% via aggregating [Δ"&, , Δ"', ] ④Upload sparse Δ"' Client # (sparsity constraint !#) ②Sparse model-adaption ③Optimize "%, '& with backprogation ②Sparse model-adaption ③Optimize "%, '' with backprogation ④Upload sparse Δ"& Figure 1: The framework of p Fed Gate that learns to generate sparse personalized models conditioned on clients different data samples and resource limitations. The serial numbers indicate the FL processes at each communication round. parameterized by θ i = θg Mi with personalized weights Mi and a shared global model θg, which helps to transfer information across clients and leads to a controllable optimization. Specifically, denote si be the model sparsity constraint of client i, Problem (2) can be transformed into following form: min θg,{Mi}i C, i C pi E(x,y) Di[f(θg Mi; x, y)], s.t. count(Mi = 0)/d si, i C. (5) With this re-formulation, Mi can be regarded as sparsityenforced gated weights that absorb the diversity of different local data distribution into the global model, via scaling and blocking the information flow through the shared model θg. 5. PFL with Conditional Gated Weights 5.1. Overall Framework Note that for discriminative models (e.g., neural networks), the personalized models are estimators for conditional distribution PDi(y|x) associated with clients local data Di. To achieve efficient PFL and optimize the objective described in Equation (5), instead of learning client-level element-wise weights Mi Rd, we propose to learn to predict batch-level block-wise weights M i,j = gϕi(xi,j), where ϕi indicates the parameters of a trainable personalized gating layer, and gϕi is the predictive function conditioned on specific data batch xi,j and sparsity limitation si. The gating layer ϕi brings up benefits in both efficiency and effectiveness: (1) predicting block-wise weights M i,j RL enables us to use a lightweight ϕi for L N+ sub-blocks of θ with L d, and thus gain much smaller computational costs than the element-wise manner; (2) predicting batch-wise weights can estimate the conditional probability better than the clientlevel manner since each client may have complex local data distribution mixed by different classes, e.g., using different sparse models to classify dog and bird can achieve better performance than using the same one. We empirically verify the effectiveness of these choices in Section 7.6. Based on the introduced gating layer, we establish an efficient PFL framework in server-client setting as shown in Figure 1. To enable benefits from each other, all clients share the same global model θg that is downloaded from server. Each client i C has a private, lightweight and learnable gating layer ϕi, which is trained from local data in a differentiable manner, and learns to achieve fine-grained personalized model-adaptation θ i,j = θg M i,j=θg gϕi(xi,j) given different data batches, where indicates block-wise production of θ. To handle diverse resource-limited scenarios, ϕi ensures that the personalized adapted model θ i,j has required sparsity not larger than si. Moreover, the sparsity speeds up the training and inference of local models, as well as the communication via uploading the sparse model updates (more discussion are in Sec.6.3 and Appendix F). 5.2. Adaptive Gating Layer 5.2.1. LAYER ARCHITECTURE We adopt a simple two-path sandwich architecture for the gating layer as shown in the left part of Figure 2. One path Efficient Personalized Federated Learning via Sparse Model-Adaptation sub0 weight, bias, sub0 bias, sub4 Block-Wise Scaling & Masking Fully Connected Fully Connected Sigmoid Sigmoid Adaptive gating layer 𝝓𝒊 Original weight 𝜽 Sparse gated weight 𝜽𝒊 Figure 2: The sparse model-adaptation via gating layer. predicts the block-wise gated weights M that may violate the hard sparsity constraints, and the other path predicts the block-wise importance scores that are used to adjust M into a sparse one M such that the final sparse model satisfies the hard limitation (we introduce the adjustment later). Let the number of sub-blocks of θg be L and the dimension of flattened input feature be d X, i.e., x Rd X. The middle two fully connected layers have the same size and each of them has (d X L) parameters. Here the switchable normalization (Luo et al., 2019) is used to handle batched data samples and different input types via learning mixture of batch norm, instance norm and layer norm with negligible additional number of trainable parameters. Furthermore, the personalized normalization parameters within client-distinct gating layers enable our model-adaptation to absorb different feature distributions (similar empirical evidence is shown by Li et al. (2021d)). The final batch normalization and sigmoid activation are used to stabilize the optimization of θg and {ϕi} by bounding M and thus the scaling degrees on θg s parameters (recall that the upper bound in Proposition 1 is dependent on the infimum and supremum of parameters after scaling). In total, the gating layer learns and predicts the sparse gated weights M = gϕ(x) = g(ϕ; x) given specific data batch x where g : Rd X 7 RL, with a lightweight parameter size dϕ = 2d XL that is usually much smaller than the size of model θg. We show the effectiveness and necessity of these adopted components in Section 7.6. 5.2.2. OPERATOR-TYPE-FREE BLOCK SPLITTING There are many alternatives to split a model into certain subblocks, e.g., a convolutional operator can be split channelwise or filter-wise and an attention operator can be split head-wise. Our gating layer is decoupled with specific block splitting manners and thus can easily support users various splitting preferences. Besides, we propose a gen- eral operator-type-free splitting manner by default to enhance the usability of our approach as illustrated in the right part of Figure 2. Denote B N+ as a given block split factor, and denote dl as the dimension of a flattened vector that parameterizes the given learnable operator. We group the vector s first dl smin elements as the first sub-block, and equally split the remaining elements into (B 1) sub-blocks (the last sub-block may has fewer size than (dl (1 smin)/(B 1) ). Here B provides flexible splitting granularity, smin (0, si] indicates a pre-defined minimal sparsity factor and the predicted gated weights for the first sub-block is enforced to be non-zero to avoid cutting information flow through this operator. The proposed splitting is simple to implement but effective for PFL as shown in our empirical evaluations. 5.2.3. SPARSE GATED WEIGHTS To ensure an equal or smaller sparsity of adapted model than the limitation si, we propose to leverage the predicted block importance G to adjust the predicted gated weights M into a sparse one M . Specifically, denote W ZL be the parameter size look-up table of L sub-blocks, we transform M = MI where I {0, 1}L is binary index and a solution of the following knapsack problem that maximizes total block importance values while satisfying sparsity constraint: max I I G, s.t. |I W|/d si. (6) In practice, to enable gradient-based optimization of gating layer, we leverage the straight-through trick (Hubara et al., 2016) via differentiable G in backward pass. In Appendix A, we summarize the overall algorithm and present more details about the gradient flows through the gating layer. 6. Algorithm Analysis 6.1. Generalization Following the parameter-sharing analysis from Baxter (2000) and the generalization analysis from Shamsian et al. (2021), here we give the generalization guarantee of our sparse model. Based on the notations used in previous sections, let ˆL(θg, ϕi) denote the empirical loss of the sparse model in client i, and ˆL(θg, ϕi) 1 |Si| P (x,y) Si f θg gϕi(x) ; x, y . The expected loss is denoted as L(θg, ϕi) E(x,y) Dif θg gϕi(x) ; x, y . Assumption 2. The parameters of global model and the gating layer can be bounded in a ball with a radius of R. The following Lipschitz conditions hold: |f(θ; x, y) f(θ ; x, y)| Lf||θg θ ||, ||h(θi; x) h(θ i; x)|| Lh||θi θ i|| and ||g(ϕi; x) g(ϕ i; x)|| Lg||ϕi ϕ i||, where h(θi; ) is the sparse model parameterized by θi and g(ϕi; ) indicates the gating layer parameterized by ϕi. Efficient Personalized Federated Learning via Sparse Model-Adaptation Let the parameter spaces of the sparse model and gating layer be of size d and dϕ, separately. With the above assumption, we have the following theorem regarding the generalization error bound. Theorem 1. Under Assumption 2, with probability at least 1 δ, there exist N = O( d |C|ϵ2 log RLf Lh(RLg+si) ϵ2 log RLf Lh(RLg+si) |C|ϵ2 ) such that for all θg, ϕi, |L(θg, ϕi) ˆL(θg, ϕi)| ϵ when the number of client i s local data samples is greater than N. Theorem 1 indicates the generalization depends on the size of the global model (i.e., d) reduced by the number of clients |C| (the first term), as the global model is shared by all clients. It also depends on the size of the gating layers (the second term), and it is not reduced by |C| since each client has its own personalized parameters. Besides, the generalization is also affected by the Lipschitz constants of the global model, sparse model, and the gating layer, as they together constrain the parameter space that our method can search. Specifically, as the sparsity si decreases, the smaller the N, and the better the generalization. This is also to some extent a reflection of Ockham s Razor in our approach. There is similar corroboration in other scenarios, such as the positive effect of the sparsity regularity on generalization ability (Maurer & Pontil, 2012). 6.2. Convergence Under the following assumptions that are commonly used in convergence analysis of many FL works (Kairouz et al., 2021) (more discussion about the Assumptions are in Appendix E), we can see that the global model updates { θt g,i}i Cs become arbitrarily small, meanwhile the learnable parameters of p Fed Gate (both θg and gating layers {ϕi}i C) converge to stationary points with comparable convergence rate to many existing works. Assumption 3. (Smoothness) For all clients i C, (x, y) Di and all possible θg, the function f(θg; x, y) ℓ(hθg(x), y) is twice continuously differentiable and Lsmooth: || f(θ g; x, y) f(θg; x, y)||2 L||θ g θg||. Assumption 4. (Bounded Output) The function f is bounded below by f R. Assumption 5. (Bounded variance) For client i C and all possible θg, given data sample (x, y) drawn from local data Si, the local gradient estimator θℓ(hθg(x), y) is unbiased and has bounded variance σ2. Theorem 2. Under Assumptions 1 5, if clients use SGD as local optimizer with learning rate η, there exist a large enough number of communication rounds T , such that p Fed Gate converges with η = η0 t=1 E θf θt g gϕt i(x) ; x, y 2 where the expectation is over the random samples and O( ) hides polylogarithmic factors. 6.3. Space-Time Complexity We briefly summarize the efficiency of p Fed Gate in terms of space-time costs. With the structured block sparsity (Sec.5.2.2), the training and inference cost of p Fed Gate is O d(si+sϕi) for client i at each round, where the model sparsity si and relative sparsity of gating layer sϕi=count(ϕi =0)/d are usually small and lead to (si+sϕi) < 1. By contrast, the computation cost of Fed Avg and several SOTA PFL methods is at least O(d). The introduced sparsity provides great potential for boosting computation efficiency. For the communication, the upload parameter number of p Fed Gate is O(qid) that can be smaller than the one of baselines, O(d). The qi=count( θi = 0)/d indicates the ratio of non-zero model updates, which depends on si and local samples trained in the round. In Appendix F, we provide detailed discussions for these terms and show superiority of p Fed Gate over several SOTA competitors. 7. Experiments 7.1. Experimental Settings We adopt four widely used FL datasets in our experiments: EMNIST (Cohen et al., 2017), FEMNIST (Caldas et al., 2018), CIFAR10 and CIFAR100 (Krizhevsky, 2009). They are partitioned into several sub-datasets to simulate the local dataset for a client, and each of them is randomly split into train/val/test datasets with ratio 6:2:2. In the experiments, the data partition follows the heterogeneous settings adopted by Marfoq et al. (2021); Dinh et al. (2020b). For baselines, we choose the classic Fed Avg (Mc Mahan et al., 2017) and Fed Avg with fine-tuning (Fed Avg-FT), and as well as several SOTA PFL methods, including p Fed Me (Dinh et al., 2020b), LG-Fed Avg (Liang et al., 2020), Ditto (Li et al., 2021c) and Fed EM (Marfoq et al., 2021). In addition, we consider several SOTA PFL methods that also improve efficiency via shared model representations (Fed Rep (Collins et al., 2021)), binary quantization (Fed Mask (Li et al., 2021a)), and sparse sub-models (Hetero FL (Diao et al., 2021)). Due to limited space, please refer to Appendix H for more details about the datasets, models and baselines. Efficient Personalized Federated Learning via Sparse Model-Adaptation Table 1: Accuracy comparison on widely-adopted FL datasets. Acc (%) and Acc (%) indicate the average accuracy of all clients and accuracy of the bottom decile clients respectively. s indicates the average of finally achieved sparsity across all clients. Bold and underlined numbers indicate the best and second-best results respectively. EMNIST FEMNIST CIFAR10 CIFAR100 Average Acc Local 71.91 64.28 71.12 57.93 61.91 53.71 26.95 21.13 57.97 49.26 Fed Avg 82.54 75.07 76.51 60.82 68.33 60.22 35.21 29.82 65.65 56.48 Fed Avg-FT 83.19 76.83 78.43 64.22 69.91 61.55 37.15 30.91 67.17 58.38 p Fed Me 83.29 76.48 75.29 57.63 70.29 62.13 36.14 30.13 66.25 56.59 LG-Fed Avg 83.25 76.14 75.13 57.62 69.54 61.86 36.72 30.51 66.16 56.53 Ditto 82.75 78.12 79.29 63.24 73.14 62.59 37.32 30.11 68.13 58.52 Fed EM 83.41 76.59 80.12 64.81 72.43 62.88 38.28 31.04 68.56 58.83 Fed Rep 82.42 77.71 79.17 62.89 72.77 63.44 36.68 30.51 67.76 58.64 Fed Mask, s =0.5 81.95 77.26 78.69 62.18 72.43 62.88 36.21 30.13 67.32 58.11 Hetero FL, s =0.55 85.64 77.76 77.22 59.13 70.97 63.64 37.01 31.27 67.71 57.95 p Fed Gate, s=1 87.11 81.43 87.32 77.14 75.18 66.67 42.01 35.03 72.91 65.07 s=0.5, s =0.43 87.28 81.15 86.31 75.68 74.07 64.21 40.07 32.38 71.93 63.36 s=0.3, s =0.25 87.09 82.52 86.75 76.47 73.65 64.39 39.53 31.63 71.76 63.75 7.2. Overall Performance Clients Average Performance. We first examine the overall performance by evaluating the accuracy on local test set of each client, and averaging the results of all clients with weights proportional to their local dataset sizes. The detailed overall performance is shown in Table 1, where s indicates the average of finally achieved sparsity across all clients (our block-wise operation may lead to s s). We can see that p Fed Gate achieves better accuracy (averaged 3.1% to 4.53% accuracy improvements) and smaller sparsity even with s=0.3 at the same time. Compared with Fed EM, the baseline with best accuracy, p Fed Gate doesn t have the heavy storage burden, since Fed EM needs to store multiple (3) global models in the client. These observations demonstrate the effectiveness and efficiency of the proposed method. Note that compared to the SOTA efficient method Hetero FL, our method achieves better accuracy while with smaller sparsity (0.25 v.s. 0.55). When s=1, we achieve the best performance without sparsity constraints, verifying that the batch-level block-wise adaptation has a strong ability to achieve personalization. When s <1, we still gain comparable performance with s=1, providing evidence that there exist well-performed sparse models and they can be effectively learned by our approach. Individual Client Performance. We then examine whether p Fed Gate improved the average performance by sacrificing the performance of some clients. In Table 1, we mark the accuracy of bottom decile local models Acc as the |C|/10 )- th worst accuracy (following Marfoq et al. (2021), we neglect the particularly noisy results from clients with worse accuracy due to their very small local data sizes). We find that p Fed Gate also gains significant Acc improvement (averaged 4.92% to 6.24% than Fed EM). We plot the clientwise accuracy difference between our method with s = 0.5 and the strongest baseline, Fed EM in Figure 3, in which p Fed Gate significantly improves most clients compared to the strongest baseline. The results in Table 1 and Figure 3 show that p Fed Gate not only improves the average accuracy and efficiency, but also fairly improves the individual client performance by learning to generate personalized models with adaptive gating layers. 7.3. Personalization Study We propose to learn personalized gating layers [ϕi] that estimate the heterogeneous data distribution with sparse model-adaption. To investigate the level of personalization achieved by p Fed Gate, we conduct experiments on another CIFAR10 non-i.i.d. partition with 10 clients and pathological splitting (Mc Mahan et al., 2017) that sorts the data by labels and assigns to clients with equal-sized shards. We do hierarchical clustering for the parameters of learned gating layers [ϕi] with Ward s method (Ward Jr, 1963) and euclidean distance. We illustrate the results in Figure 4 when the models of clients converge with s=0.3 and achieve good performance (Acc=89.21%), where we mark the local label categories of clients and common label categories after merging in red numbers. Interestingly, we can see that clients learn gating layers with smaller parameter distances along with they share more similarities in label categories. This indicates that p Fed Gate not only can achieve Efficient Personalized Federated Learning via Sparse Model-Adaptation Figure 3: The client-wise accuracy difference between p Fed Gate and Fed EM on CIFAR10 dataset. Client 3 Client 5 Client 1 Client 4 Client 0 Client 7 Client 9 Client 2 Client 6 Client 8 Y: [7, 8 ,9] Y: [7, 8] Y: [1, 2] Y: [0, 1, 9] Y: [0, 2] Y: [3, 4, 5] Y: [3, 4] Y: [5, 6] Y: [5, 6, 9] Y: [2, 3, 6, 7] Y: [7, 8 ,9] Y: [0, 1, 2, 9] Y: [3, 4, 5] Y: [5, 6, 9] Y: [2, 3, 5, 6, 7, 9] Y: [0, 1, 2, 9] Figure 4: The hierarchical clustering result for parameters of clients gating layers on CIFAR-10 with 10-clients partition. Table 2: Training time t (ms) and peak memory Mem (Mb) for one FL round. bz indicates the batch size. bz = 32 bz = 128 bz = 512 Mem t Mem t Mem t Fed Avg 26.0 3.07 27.1 3.89 31.8 4.51 Fed EM 29.4 9.02 31.2 11.38 36.3 13.11 ours, s=1 26.8 3.65 27.9 4.09 32.4 4.66 s=0.5 21.0 2.44 22.2 3.21 26.7 3.58 s=0.3 16.9 2.32 18.0 2.81 22.9 3.33 good client-distinct personalization, but also implicitly learn group-wise personalization with a strong sparsity constraint, which could be beneficial to applications where there are inherent partitions among clients. 7.4. Efficiency Comparison We have seen that p Fed Gate can achieve good accuracy with sparse personalization. To examine its efficiency as we analyzed in Appendix F, we track training time in ms and peak memory in Mb for one client during one FL round. The results on the CIFAR-10 dataset with different batch sizes are listed in Table 2. We can see that p Fed Gate can effectively improve the training efficiency and memory cost, where the results are averaged by the number of trained data batches. For example, when s=0.3 and bz=128, p Fed Gate uses 66% peak memory and 72% training time compared with Fed Avg, meanwhile having 5.32% Acc and 4.17% Acc improvements (Table 1). When compared to Fed EM that leverages 3 global models, the efficiency is more significantly improved (66% Mem and 28% t) with superior accuracy (1.22% Acc and 1.51% Acc improvements). It is worth noting that in addition to the model parameters, the training process (including the backward propagation phase) also requires storing the intermediate tensors, whose size depends on the model architecture, the shape of the inputs, and the specific implementation. Thus once the model-adaption is completed, the intermediate tensor involved in the forward and backward processes through the sparse model is correspondingly smaller based on block sparsity, allowing us to get a lower training memory peak and less training time than Fed Avg and Fed EM. In addition, we empirically see that p Fed Gate has a comparable convergence speed to the baselines (Appx I.2), with the results of the single-FL-round and single-client efficiency given above, we can see that p Fed Gate is able to improve accuracy while at the same time improving system efficiency. 7.5. Partial Clients Participation Table 3: Averaged accuracy when randomly sampling 20% clients at each round. The last two columns indicate the average drop compared to non-sampled case. FEMNIST ( ) CIFAR10 ( ) Drop ( ) Acc Fed Avg 77.45 57.98 66.31 57.96 3.56 2.55 Fed EM 78.54 63.37 71.12 60.76 1.45 1.78 ours, s=1 87.86 77.43 74.29 64.11 0.82 1.13 s=0.5 86.72 75.12 74.25 64.36 0.79 0.21 s=0.3 85.78 74.89 73.91 64.38 0.36 0.80 We have seen our method can consistently and simultaneously improve the efficiency and model accuracy as Table 1 shows. One can arise such a natural question: when combined with other efficiency improvement techniques such as client sampling, is p Fed Gate still effective? Here we run experiments by uniformly sampling 20% clients without replacement at each round, and summarize the results in Table 3. It is observed that compared with baselines having about 1.45% 3.56% performance drop, our method still achieves strong and comparable performance to the nonsampled case, verifying the effectiveness and robustness of our proposed method again. 7.6. Ablation Study Choices of Gating Layer. To gain further insight into our method, we ablate the gating layer w.r.t. its switchable norm layer (pre-norm) and batch norm layer (post-norm) within the sandwich structure and the sigmoid activation Efficient Personalized Federated Learning via Sparse Model-Adaptation Table 4: Average accuracy and round to achieve 0.8 Acc accuracy (T0.8) normalized by standard p Fed Gate. EMNIST CIFAR100 Acc T0.8 Acc T0.8 p Fed Gate, s=0.5 87.28 1 39.72 1 w/o pre-norm 86.25 1.13 38.32 1.24 w/o post-norm 87.14 1.10 39.67 1.43 w/o sigmoid 83.55 1.85 31.29 1.93 client-wise adaptation 84.83 0.74 33.52 0.86 Figure 5: The accuracy on EMNIST when varying s. that bounds the gated weights. We also change the batchwise adaption into client-wise manner by feeding the input feature mean-pooled over all local training data samples of each client. We show the results in Table 4 with s=0.5. We find that the two normalization layers improve the convergence speed. Besides, pre-norm has a larger impact on performance than post-norm since it helps to handle the heterogeneous feature distribution. The sigmoid significantly impacts both the convergence and accuracy, showing the effectiveness of gated weights bounding operation, which restricts the dissimilarity level between the personalized model and global model, and in turn helps to learn a global model suitable for different clients. Compared to batchlevel manner, although client-wise adaption achieves a faster training, it pays cost of significant accuracy drop due to the coarser-grained modeling of conditional distribution of local data and smaller parameter space could be explored under sparsity constraints. In a nutshell, the ablation results verify effectiveness and necessity of our choices in gating layer. Effect of s. We further study the effect of the sparsity factor s by varying s from 1 to 0.1 on EMNIST dataset and illustrate the results in Figure 5, in which [0.5, 0.1] indicates that we randomly divide all clients into two equal-sized parts and set s to be 0.5 and 0.1 for these two parts respectively. We can see that our method is robust to keep a good performance when adopting not very small sparsity degrees. Another observation is that the performance variance increases as s decreases. Besides, our method still performs well in a sparsity-mixed setting, verifying the robustness of our method again. 7.7. Overview of More Experiments in Appendix Due to the space limitation, we provide further experiments and analysis in Appendix in terms of Generalization: In Appendix I.1, we present additional results of p Fed Gate for the un-seen clients that haven t participated in the FL training stage (Marfoq et al., 2021; Yuan et al., 2022; Chen et al., 2022), in which case p Fed Gate achieves better performance than baselines since it only needs to train the personalized lightweight gating layers with a small number of parameters. Convergence: In Appendix I.2, we show that p Fed Gate can achieve effective optimization, which supports the Theorem 2 and further verifies the efficiency of p Fed Gate. Sparse factor study: In Appendix I.3, we further study the effect of s w.r.t. global-local model gap, FL communication frequency and data heterogeneity degree. Sparse model study: In Appendix I.4, we vary the block splitting factor B of p Fed Gate and make more detailed comparison to Fed Mask. 8. Conclusion and Future Work Existing personalized Federated Learning (PFL) methods usually focus on the heterogeneity of local data distribution only and pay additional computation and communication costs. In this paper, we explore the organic combination of model compression and PFL by adaptively generating different sparse local models at a fine-grained batch level, which gains double benefit of personalization performance and efficiency. Theoretical analyses are provided w.r.t. generalization, convergence and space-time complexity. Extensive experiments are conducted to verify the effectiveness, efficiency and robustness of the proposed method in various benchmarks and scenarios. We demonstrate the feasibility of obtaining superior accuracy and efficiency simultaneously for PFL, and hope this work can facilitate more studies on efficient and practical PFL. Our work also raises some interesting questions such as do the sparse models help the protection against privacy attacks in FL (Zhu & Han, 2019; Qin et al., 2023), since attackers may face more challenging parameter space coupled with unknown personalized gating layers. Besides, the local model update of each client can be dense especially when the number of local training batches is large, as the final mask is the union over triggered sparse masks from different batches. Restricting the triggered masks via intersection operation without compromising the convergence is also worth exploring, which can further provide a worst-case guarantee with user-preferred sparsity values. Efficient Personalized Federated Learning via Sparse Model-Adaptation Ainsworth, S. K., Hayase, J., and Srinivasa, S. Git re-basin: Merging models modulo permutation symmetries. ar Xiv preprint ar Xiv:2209.04836, 2022. Alistarh, D., Grubic, D., Li, J., Tomioka, R., and Vojnovic, M. Qsgd: Communication-efficient sgd via gradient quantization and encoding. Neur IPS, 30:1709 1720, 2017. Baxter, J. A model of inductive bias learning. JAIR, 12: 149 198, 2000. Bibikar, S., Vikalo, H., Wang, Z., and Chen, X. Federated dynamic sparse training: Computing less, communicating less, yet learning better. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pp. 6080 6088, 2022. Briggs, C., Fan, Z., and Andras, P. Federated learning with hierarchical clustering of local updates to improve training on non-iid data. In IJCNN, pp. 1 9. IEEE, 2020. Caldas, S., Duddu, S. M. K., Wu, P., Li, T., Koneˇcn y, J., Mc Mahan, H. B., Smith, V., and Talwalkar, A. Leaf: A benchmark for federated settings. ar Xiv preprint ar Xiv:1812.01097, 2018. Chai, Z., Fayyaz, H., Fayyaz, Z., Anwar, A., Zhou, Y., Baracaldo, N., Ludwig, H., and Cheng, Y. Towards taming the resource and data heterogeneity in federated learning. In 2019 USENIX Conference on Operational Machine Learning (Op ML 19), pp. 19 21, 2019. Chai, Z., Ali, A., Zawad, S., Truex, S., Anwar, A., Baracaldo, N., Zhou, Y., Ludwig, H., Yan, F., and Cheng, Y. Tifl: A tier-based federated learning system. In Proceedings of the 29th International Symposium on High Performance Parallel and Distributed Computing, pp. 125 136, 2020. Chen, D., Gao, D., Kuang, W., Li, Y., and Ding, B. p FLbench: A comprehensive benchmark for personalized federated learning. In Neur IPS, 2022. Chen, D., Gao, D., Xie, Y., Pan, X., Li, Z., Li, Y., Ding, B., and Zhou, J. FS-REAL: Towards real-world cross-device federated learning. In KDD, 2023. Cohen, G., Afshar, S., Tapson, J., and Van Schaik, A. Emnist: Extending mnist to handwritten letters. In IJCNN, pp. 2921 2926. IEEE, 2017. Collins, L., Hassani, H., Mokhtari, A., and Shakkottai, S. Exploiting shared representations for personalized federated learning. In ICML, volume 139, pp. 2089 2099, 18 24 Jul 2021. Corinzia, L. and Buhmann, J. M. Variational federated multi-task learning, 2019. Deng, L., Li, G., Han, S., Shi, L., and Xie, Y. Model compression and hardware acceleration for neural networks: A comprehensive survey. Proceedings of the IEEE, 108 (4):485 532, 2020. Diao, E., Ding, J., and Tarokh, V. Heterofl: Computation and communication efficient federated learning for heterogeneous clients. In ICLR, 2021. Dinh, C. T., Tran, N. H., and Nguyen, T. D. Personalized federated learning with moreau envelopes. Ar Xiv, abs/2006.08848, 2020a. Dinh, C. T., Tran, N. H., and Nguyen, T. D. Personalized federated learning with moreau envelopes. In Neur IPS, 2020b. Fallah, A., Mokhtari, A., and Ozdaglar, A. Personalized federated learning with theoretical guarantees: A modelagnostic meta-learning approach. In Neur IPS, volume 33, pp. 3557 3568, 2020. Ghosh, A., Chung, J., Yin, D., and Ramchandran, K. An efficient framework for clustered federated learning. In Neur IPS, volume 33, 2020. Haddadpour, F., Kamani, M. M., Mokhtari, A., and Mahdavi, M. Federated learning with compression: Unified analysis and sharp guarantees. In AISTATS, pp. 2350 2358. PMLR, 2021. Hamer, J., Mohri, M., and Suresh, A. T. Fedboost: A communication-efficient algorithm for federated learning. In ICML, pp. 3973 3983. PMLR, 2020. Hardy, G. H., Littlewood, J. E., P olya, G., P olya, G., et al. Inequalities. Cambridge university press, 1952. He, C., Annavaram, M., and Avestimehr, S. Group knowledge transfer: Federated learning of large cnns at the edge. Neur IPS, 33, 2020. Hong, J., Zhu, Z., Yu, S., Wang, Z., Dodge, H., and Zhou, J. Federated adversarial debiasing for fair and transferable representations. In KDD, 2021. Hong, J., Wang, H., Wang, Z., and Zhou, J. Efficient splitmix federated learning for on-demand and in-situ customization. In ICLR, 2022. Huang, T., Liu, S., Li, S., He, F., Lin, W., and Tao, D. Achieving personalized federated learning with sparse local models. Ar Xiv, abs/2201.11380, 2022. Huang, Y., Chu, L., Zhou, Z., Wang, L., Liu, J., Pei, J., and Zhang, Y. Personalized cross-silo federated learning on non-iid data. In AAAI, volume 35, pp. 7865 7873, 2021. Efficient Personalized Federated Learning via Sparse Model-Adaptation Hubara, I., Courbariaux, M., Soudry, D., El-Yaniv, R., and Bengio, Y. Binarized neural networks. Neur IPS, 29, 2016. Jiang, Y., Koneˇcn y, J., Rush, K., and Kannan, S. Improving Federated Learning Personalization via Model Agnostic Meta Learning. ar Xiv:1909.12488, 2019. Jiang, Y., Wang, S., Valls, V., Ko, B. J., Lee, W.-H., Leung, K. K., and Tassiulas, L. Model pruning enables efficient federated learning on edge devices. IEEE Transactions on Neural Networks and Learning Systems, 2022. Kairouz, P., Mc Mahan, H. B., Avent, B., Bellet, A., Bennis, M., Bhagoji, A. N., Bonawitz, K., Charles, Z., Cormode, G., Cummings, R., D Oliveira, R. G. L., Eichner, H., Rouayheb, S. E., Evans, D., Gardner, J., Garrett, Z., Gasc on, A., Ghazi, B., Gibbons, P. B., Gruteser, M., Harchaoui, Z., He, C., He, L., Huo, Z., Hutchinson, B., Hsu, J., Jaggi, M., Javidi, T., Joshi, G., Khodak, M., Konecn y, J., Korolova, A., Koushanfar, F., Koyejo, S., Lepoint, T., Liu, Y., Mittal, P., Mohri, M., Nock, R., Ozg ur, A., Pagh, R., Qi, H., Ramage, D., Raskar, R., Raykova, M., Song, D., Song, W., Stich, S. U., Sun, Z., Suresh, A. T., Tram er, F., Vepakomma, P., Wang, J., Xiong, L., Xu, Z., Yang, Q., Yu, F. X., Yu, H., and Zhao, S. Advances and open problems in federated learning. Foundations and Trends in Machine Learning, 14(1 2):1 210, 2021. ISSN 1935-8237. Karimireddy, S. P., Kale, S., Mohri, M., Reddi, S., Stich, S., and Suresh, A. T. Scaffold: Stochastic controlled averaging for federated learning. In ICML, pp. 5132 5143. PMLR, 2020. Khodak, M., Balcan, M.-F. F., and Talwalkar, A. S. Adaptive gradient-based meta-learning methods. In Neur IPS, pp. 5917 5928, 2019. Krizhevsky, A. Learning multiple layers of features from tiny images. Msc thesis, 2009. Li, A., Sun, J., Zeng, X., Zhang, M., Li, H., and Chen, Y. Fedmask: Joint computation and communicationefficient personalized federated learning via heterogeneous masking. In Sen Sys, pp. 42 55, 2021a. Li, D. and Wang, J. Fedmd: Heterogenous federated learning via model distillation. 2019. Li, T., Sahu, A. K., Talwalkar, A., and Smith, V. Federated learning: Challenges, methods, and future directions. IEEE Signal Processing Magazine, 37(3):50 60, 2020. Li, T., Hu, S., Beirami, A., and Smith, V. Ditto: Fair and robust federated learning through personalization. In ICML, pp. 6357 6368. PMLR, 2021b. Li, T., Hu, S., Beirami, A., and Smith, V. Ditto: Fair and robust federated learning through personalization. In ICML, pp. 6357 6368, 2021c. Li, X., Huang, K., Yang, W., Wang, S., and Zhang, Z. On the convergence of fedavg on non-iid data. In ICLR, 2019. Li, X., Jiang, M., Zhang, X., Kamp, M., and Dou, Q. Fed{bn}: Federated learning on non-{iid} features via local batch normalization. In ICLR, 2021d. Liang, P. P., Liu, T., Ziyin, L., Salakhutdinov, R., and Morency, L.-P. Think locally, act globally: Federated learning with local and global representations. ar Xiv preprint ar Xiv:2001.01523, 2020. Lin, T., Kong, L., Stich, S. U., and Jaggi, M. Ensemble distillation for robust model fusion in federated learning. Neur IPS, 33, 2020. Luo, P., Ren, J., Peng, Z., Zhang, R., and Li, J. Differentiable learning-to-normalize via switchable normalization. ICLR, 2019. Ma, J., Long, G., Zhou, T., Jiang, J., and Zhang, C. On the convergence of clustered federated learning. ar Xiv preprint ar Xiv:2202.06187, 2022. Marfoq, O., Neglia, G., Bellet, A., Kameni, L., and Vidal, R. Federated multi-task learning under a mixture of distributions. Neur IPS, 34, 2021. Maurer, A. and Pontil, M. Structured sparsity and generalization. The Journal of Machine Learning Research, 13 (1):671 690, mar 2012. ISSN 1532-4435. Mc Mahan, B., Moore, E., Ramage, D., Hampson, S., and y Arcas, B. A. Communication-efficient learning of deep networks from decentralized data. In AISTATS, pp. 1273 1282. PMLR, 2017. Meng, C., Rambhatla, S., and Liu, Y. Cross-node federated graph neural network for spatio-temporal data modeling. In KDD, pp. 1202 1211, 2021. Minh, H. and Carl, K. Neurips 2021, workshop on new frontiers in federated learning. In Personalized neural architecture search for federated learning, 2021. Muhammad, K., Wang, Q., O Reilly-Morgan, D., Tragos, E., Smyth, B., Hurley, N., Geraci, J., and Lawlor, A. Fedfast: Going beyond average for faster training of federated recommender systems. In KDD, pp. 1234 1242, 2020. Nguyen, J., Malik, K., Zhan, H., Yousefpour, A., Rabbat, M., Malek, M., and Huba, D. Federated learning with buffered asynchronous aggregation. In International Conference on Artificial Intelligence and Statistics, pp. 3581 3607. PMLR, 2022. Efficient Personalized Federated Learning via Sparse Model-Adaptation Ozkara, K., Singh, N., Data, D., and Diggavi, S. Quped: Quantized personalization via distillation with applications to federated learning. In Beygelzimer, A., Dauphin, Y., Liang, P., and Vaughan, J. W. (eds.), Advances in Neural Information Processing Systems, 2021. Qin, Z., Yao, L., Chen, D., Li, Y., Ding, B., and Cheng, M. Revisiting personalized federated learning: Robustness against backdoor attacks. In KDD, 2023. Qiu, X., Fernandez-Marques, J., Gusmao, P. P., Gao, Y., Parcollet, T., and Lane, N. D. Zero FL: Efficient ondevice training for federated learning with local sparsity. In International Conference on Learning Representations, 2022. Reddi, S. J., Charles, Z., Zaheer, M., Garrett, Z., Rush, K., Koneˇcn y, J., Kumar, S., and Mc Mahan, H. B. Adaptive federated optimization. In ICLR, 2021. Reisizadeh, A., Mokhtari, A., Hassani, H., Jadbabaie, A., and Pedarsani, R. Fedpaq: A communication-efficient federated learning method with periodic averaging and quantization. In AISTATS, pp. 2021 2031. PMLR, 2020. Rothchild, D., Panda, A., Ullah, E., Ivkin, N., Stoica, I., Braverman, V., Gonzalez, J., and Arora, R. Fetchsgd: Communication-efficient federated learning with sketching. In ICML, pp. 8253 8265. PMLR, 2020. Sattler, F., M uller, K.-R., and Samek, W. Clustered federated learning: Model-agnostic distributed multitask optimization under privacy constraints. TNNLS, 2020. Shamsian, A., Navon, A., Fetaya, E., and Chechik, G. Personalized federated learning using hypernetworks. In Meila, M. and Zhang, T. (eds.), ICML, volume 139, pp. 9489 9502, 7 2021. Singhal, K., Sidahmed, H., Garrett, Z., Wu, S., Rush, J., and Prakash, S. Federated reconstruction: Partially local federated learning. Advances in Neural Information Processing Systems, 34, 2021. Smith, V., Chiang, C.-K., Sanjabi, M., and Talwalkar, A. Federated Multi-Task Learning. In Neur IPS, volume 30, pp. 4427 4437, 2017. Tan, A. Z., Yu, H., Cui, L., and Yang, Q. Towards personalized federated learning. ar Xiv preprint ar Xiv:2103.00710, 2021. Tianchun, W., Wei, C., Dongsheng, L., Wenchao, Y., Jingchao, N., Liang, T., Haifeng, C., and Zhang, X. Icdm. In Personalized Federated Learning via Heterogeneous Modular Networks, 2022. Wang, Z., Kuang, W., Xie, Y., Yao, L., Li, Y., Ding, B., and Zhou, J. Federatedscope-gnn: Towards a unified, comprehensive and efficient package for federated graph learning. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pp. 4110 4120, 2022. Ward Jr, J. H. Hierarchical grouping to optimize an objective function. JASA, 58(301):236 244, 1963. Xie, Y., Wang, Z., Gao, D., Chen, D., Yao, L., Kuang, W., Li, Y., Ding, B., and Zhou, J. Federatedscope: A flexible federated learning platform for heterogeneity. Proceedings of the VLDB Endowment, 16(5):1059 1072, 2023. Yang, H., He, H., Zhang, W., and Cao, X. Fed Steg: A Federated Transfer Learning Framework for Secure Image Steganalysis. IEEE TNSE, 2020. Yang, Q., Liu, Y., Chen, T., and Tong, Y. Federated machine learning: Concept and applications. ar Xiv: Artificial Intelligence, 2019. Yuan, H., Morningstar, W. R., Ning, L., and Singhal, K. What do we mean by generalization in federated learning? In International Conference on Learning Representations, 2022. Yurochkin, M., Agarwal, M., Ghosh, S., Greenewald, K., and Hoang, N. Statistical model aggregation via parameter matching. In Wallach, H., Larochelle, H., Beygelzimer, A., d'Alch e-Buc, F., Fox, E., and Garnett, R. (eds.), Advances in Neural Information Processing Systems, volume 32, 2019a. Yurochkin, M., Agarwal, M., Ghosh, S., Greenewald, K., Hoang, N., and Khazaeni, Y. Bayesian nonparametric federated learning of neural networks. In International Conference on Machine Learning, pp. 7252 7261. PMLR, 2019b. Zhang, J., Guo, S., Ma, X., Wang, H., Xu, W., and Wu, F. Parameterized knowledge transfer for personalized federated learning. In Neur IPS, volume 34, 2021a. Zhang, M., Sapra, K., Fidler, S., Yeung, S., and Alvarez, J. M. Personalized federated learning with first order model optimization. In ICLR, 2020. Zhang, Q., Gu, B., Deng, C., Gu, S., Bo, L., Pei, J., and Huang, H. Asysqn: Faster vertical federated learning algorithms with better computation resource utilization. In KDD, pp. 3917 3927, 2021b. Zhu, L. and Han, S. Deep leakage from gradients. In Neur IPS, volume 32. Springer, 2019. Zhu, Z., Hong, J., and Zhou, J. Data-free knowledge distillation for heterogeneous federated learning. In ICML, 2021. Efficient Personalized Federated Learning via Sparse Model-Adaptation We provide more details and further experiments about our work in the appendices: Sec.A: the summarized algorithm of the proposed method. Proofs: the full proof of Proposition 1 (Sec.B), Theorem 1 (Sec.C), and Theorem 2 (Sec.D). The discussions on the adopted assumptions are also presented in Sec.E. Sec.F: the detailed discussion about the space-time complexity comparing with other FL methods. Sec.G: the detailed comparison and discussion about more related works. Sec.H: the implementation details of experiments such as datasets, baselines, models, hyper-parameters and sparse model aggregation procedure of p Fed Gate. Sec.I: further experimental results including Sec.I.1, the performance in novel client participation case; Sec.I.2, the convergence study; Sec.I.3, the effect of s w.r.t. global-local model gap, FL communication frequency and data heterogeneity degree; and Sec.I.4, the model compression manner study, in which we compare p Fed Gate to the SOTA binary quantization method Fed Mask (Li et al., 2021a). A. The Algorithm of the proposed p Fed Gate method Algorithm 1 Efficient Personalized FL with p Fed Gate Input: T, ηg, η, θ0 g, {ϕ0 i }i C, smin, {si}i C 1 for t = 1, , T do 2 Server sends θt 1 g to clients Cs sampled from C for client i Cs in parallel do 3 for data batch (x, y) Si do 4 Get sparse gated weight: M i = g(ϕt 1 i ; x) Do personalized model-adaption: θ i = θt 1 g M i Update global model and local gating layer: θt g = θt 1 g ηg( θf(θ i; x, y)) ϕt i = ϕt 1 i η( ϕθ i θf(θ i; x, y)) 5 Upload sparse θt g,i := θt g θt 1 g 6 Server updates global model: θt g = AGGREGATE θt 1 g , { θt g,i}i Cs 7 return θg, {ϕi}i C We summarize the overall algorithm in Algorithm 1. Besides, we present more details about (1) the gradients flow via the gating layer, which contains a knapsack solver; and (2) the global model aggregation. Differentiability. We leverage the straight-through trick (Hubara et al., 2016) to make the combination of the two flows predicted by the gating layer (red G and blue M in Figure 2) still differentiable. Recall that the combination is M = M I where I is a binary index and a solution to the knapsack problem (Section 5.2). The straight-through (ST) here means that we build such a computation graph, in which we only use binary variable I in the forward pass stage, while replace I into the differentiable variable G in backward gradient propagation. Specifically, this can be implemented by the following demonstrative Py Torch code: I ST = I - G.detach() + G. In this way, the tensor I ST and M are both differentiable and can be used to pass the gradient flow via their combination tensor M = I ST M and the following adapted tensor of the sparse model θ i = θt 1 g M i. And thus we can train the personalized model and the gating layer as Algorithm 1, lines 6 and 7 shown. Global Model Aggregation. We present more details about the global model aggregation of p Fed Gate. After local training, the client only uploads the updates of the parameters in the global model that are selected to form the client s local model. In other words, the client uploads the nonzero parameters updates with their corresponding index in the global model. The server then aggregates the received model updates according to their index. Considering the following toy example: there are four clients, and the global model has a total of three parameter blocks. Each client uploads a dictionary whose key denotes the index and the value denotes the corresponding parameter updates. Client 1: 0: δ0, 2: δ2; Client 2: 1: δ0, 2: δ2; Client 3: 0: δ0, 1: δ2; Client 4: 0: δ0, 1: δ0, 2: δ2. After receiving the above updates, the server aggregates the parameter updates with index 0 by weighted averaging the δ0s from clients 1, 3, and 4. The other parameters are aggregated similarly. B. Proof of Proposition 1 Since f is µ-strongly convex, by definition we have f(θ ) f(θ)+ θf(θ)T (θ θ)+ µ 2 θ θ 2, θ, θ . (8) We have f (θ) = f(θ) µ 2 ||θ||2 is also convex due to the first order condition of f with Equation (8). Considering Efficient Personalized Federated Learning via Sparse Model-Adaptation the monotone gradient condition for convexity of f , we get ( θf(θ) θf(θ ))T (θ θ ) µ θ θ 2, θ, θ . (9) Applying Cauchy-Schwartz inequality on Equation (9), we get θf(θ) θf(θ ) θ θ ( θf(θ) θf(θ )T (θ θ ) Dividing θ θ on both sides of Equation (10), we get θ θ θf(θ) θf(θ ) /µ, θ, θ . (11) Under Assumption 1, for all (θ i , θ g) pairs with Equation (11), we have (M i 1)θ g 2 σ2 i µ2 , i C. (12) Using the reversed Cauchy-Schwarz inequality (Hardy et al., 1952), we have Ai (M i 1) θ g q (M i 1) 4 (θ g) 4 , (13) where indicates element-wise multiplication, and 4 in- dicates Hadamard power of 4, Ai = 1 (R2 θi R2 θg +r2 θir2 θg )2 R2 θi R2 θg r2 θir2 θg is a positive constant and 1 4 R2 θi R2 θg r2 θir2 θg , rθi and rθg are the infimum of θi and θg respectively, Rθi and Rθg are the supremum of θi and θg respectively. Combining Equation (12) and Equation (13), we get the upper bound in Propostion 1 and finish the proof. C. Proof of Theorem 1 Let Hn denote the function space with its elements parametrized by θg, ϕ1, , ϕn and the distance metric is defined as: d((θg, ϕ1, , ϕn) (θ g, ϕ 1, , ϕ n)) n Ex,y Di h X (f(θg, ϕi; x, y) X f(θ g, ϕ i; x, y) i , (14) With the Lipshitz conditions in Assumption 2, we have: d((θg, ϕ1, , ϕn) (θ g, ϕ 1, , ϕ n)) 1 n Ex,y Di h ℓ(hθg,ϕi(x), y) ℓ(hθ g,ϕ i(x), y) i Lf||hθg,ϕi hθ g,ϕ i|| =Lf||hs(θi) hs(θ i)|| Lf Lh||θg Mi θ g M i|| (Lipshitz condition) Lf Lh||θg Mi θg M i + θg M i θ g M i|| Lf Lh h ||θg|| ||Mi M i|| + ||M i|| ||θg θ g|| i (Assumption 2) Lf Lh h R||Mi M i|| + si||θg θ g|| i Lf Lh h R||g(ϕi) g(ϕ i)|| + si||θg θ g|| i (Lipshitz condition) Lf Lh h RLg||ϕi ϕ i|| + si||θg θ g|| i , (15) where R is radius of the ball that can bound the parameters of global model and the gating layer (mentioned in Assumption 2). We can get an ϵ-covering in metric d((θg, ϕ1, , ϕn) (θ g, ϕ 1, , ϕ n)) if we select a covering of in the parameter space with both ||ϕi ϕ i|| and ||θg θ g|| equal to ϵ Lf Lg(RLg+si). Therefore, the covering number of H|C|, denoted as B(ϵ, H|C|) is: log(B(ϵ, H|C|) = O (|C|dϕ + d) log RLf Lh(RLg+si) According to (Baxter, 2000; Shamsian et al., 2021), there exit N and N = O 1 nϵ2 log B(ϵ,H|C|) = O( d |C|ϵ2 log RLf Lh(RLg+si) ϵ2 log RLf Lh(RLg+si) log δ |C|ϵ2 ). D. Proof of Theorem 2 We note that (Marfoq et al., 2021) extends the surrogate optimization into the FL setting and provides convergence results for the algorithms which approximate the target objective functions using first-order surrogate or partial first-order surrogate functions. For simplicity, we use two more compact notations for all the learnable parameters introduced in our formulation as v M {Mi}i C and v M ,ϕ {M i = g(ϕi; x)}i C,x Xi. Here we prove that our method described in Algorithm 1 can be regarded as to optimize the objective function F(θg, v M ) in Equation (5) with another partial first-order surrogate function, and the convergence results from (Marfoq et al., 2021) can be applied into our method. We first recap the formal definition of partial first-order surrogate: Efficient Personalized Federated Learning via Sparse Model-Adaptation Definition 1 (Partial first-order surrogate (Marfoq et al., 2021)). A function F ( u, v) : Rdu V R is a partial-firstorder surrogate of F( u, v) wrt u near ( u0, v0) Rdu V when the following conditions are satisfied: 1. F ( u, v) F( u, v) for all u Rdu and v V; 2. e( u, v) F ( u, v) F( u, v) is differentiable and L-smooth with respect to u. Moreover, we have e( u0, v0) = 0 and ue( u0, v0) = 0. 3. F ( u, v0) F ( u, v) = d V ( v0, v) for all u Rdu and v arg min v V F ( u, v ), where d V is non-negative and d V( v, v ) = 0 v = v . Definition 1 indicates that in the partial parameter set V, F is majorant to F and the smoothness is satisfied in the neighborhood of a given point. Denote F i(θg, v M ,ϕ) as the transformed local objective function depicted by the gating layer ϕi, i.e., the sparse gated weight M i is generated by the process described in Section 5.2 and intermediate objective in Equation (6). We then restrict our attention from the function F(θg, v M ) over all clients into |C| local functions Fi(θg, v M ,ϕ) for each client i, since the weighted sum operation is convex and holds the partial majorization and smoothness properties. Consider the iterative version of Fi and F i corresponding to the produce in Algorithm 1. We have the partial set of gated weights Vϕ = {{M i = g(ϕt i; x)}x Xi}ϕt i=arg minϕ(F i (θt 1 g , vt 1 ϕ )) for all t [T]. For v M ,ϕt i Vϕ, due to the optimality of ϕt i and equality between v M and v M ,ϕt i using block-to-element scatter operation on the gated weights, we have F i(θg, v M ) = Fi(θg, v M ,ϕt i) and Condition 1 satisfied. Note that the difference between F i(θg, v M ) and Fi(θg, v M ,ϕ) is differentiable w.r.t. θg since both of them is differentiable by adopting either element-wise or blockwise production on θg and the sparse gated weights. With the smoothness Assumption 3, we have Condition 2 satisfied. For Condition 3, it is clear that if the right statement holds, i.e., v M ,ϕ = v M ,ϕ, then d V( v M ,ϕ, v M ,ϕ) = F i(θg, v M ,ϕ) F i(θg, v M ,ϕ) = 0 and we have the sufficiency between the two statements of Condition 3 satisfied. For the necessity between the two statements of Condition 3, we can consider such a set Vϕ \( Vϕ Vϕ), where Vϕ indicates the subset of Vϕ in which t [ T, T] if F i converges at step T T, and Vϕ indicates the subset of Vϕ in which t {t |F i(θt 1 g , vt 1 ϕ ) = F i(θt 1 g , vt 1 ϕ ), t, t [T], t = t}. The new partial subset ensures the optimal v M ,ϕ is unique by discarding the redundant v M ,ϕ that have the same F i(θg, v M ,ϕ). For such a new partial subset Vϕ\( Vϕ Vϕ), we have the necessity between the two statements of Condition 3 satisfied due to the uniqueness of the solutions of arg min v M ,ϕ Vϕ\( Vϕ Vϕ) F ( θg, v M ,ϕ), and the Condi- tion 1 and Condition 2 still hold since Vϕ \ ( Vϕ Vϕ) is a subset of Vϕ. We finally show that the assumptions required by the convergence proof in (Marfoq et al., 2021) hold when using our approach. Proposition 2. Under the assumptions 1 5, when taking the parameter from θg to θϕi = θg ϕi(x) over all i C, x Xi, the gradient variance of f and diversity across clients is also bounded, and the smoothness of f holds. Proof. Note that using the chain rule, the gradients on ϕi can be written as ϕ(f(θϕi; x, y)) = ϕθϕi θf(θϕi; x, y) for (x, y) Di. For simplicity, we denote ϕ(f(θϕi; x, y)) f(θϕi). Since we generate sparse model as θϕi = θg M i, we have ϕθϕi = M i ϕM i. The block-wise production operation is differentiable and f over θg is differentiable according to Assumption 3, thus f over ϕi is also differentiable. Then recall that the optimal gated weights are bounded as shown in Proposition 1. Further, we explicitly bound the sparse gated weight during FL training process in [0, 1]L using the sigmoid function as introduced in Section 5.2. We thus can assume that || ϕθϕ i ϕθϕi|| Lϕ||θϕ i θϕi||2, (16) where Lϕ is a Lipshitz constant corresponding to our bounded gated weights. We get || ϕf(θϕ i) ϕf(θϕi)||2 =|| ϕθϕ i θf(θϕ i) ϕθϕi θf(θϕi)||2 (Chain Rule) =|| ϕθϕ i θf(θϕ i) ϕθϕ i θf(θϕi)+ ϕθϕ i θf(θϕi) ϕθϕi θf(θϕi)||2 || ϕθϕ i θf(θϕ i) ϕθϕ i θf(θϕi)||2+ || ϕθϕ i θf(θϕi) ϕθϕi θf(θϕi)||2 L|| ϕθϕ i||2||θϕ i θϕi||2+ Lϕ|| θf(θϕi)||2||θϕ i θϕi||2 2LLϕ||θϕ i θϕi||2, (Assumption 3 & Eqn. (16)) (17) where the last two lines are because of that Lipschitz implies bounded gradients. Thus we get that f is (2LLϕ)-smooth using our method. Similar to the proof of smoothness that leverages the chain rule and bounded Mi, we can generalize Assumption 1 and Assumption 5 for our method such that the gradients dissimilarity is bounded and f has bounded variance characterized by another constant factor. Now we see that each client can optimize a partial first-order surrogate function and the necessary assumptions satisfied with Proposition 2. Applying the Theorem 3.2 in (Marfoq et al., 2021), we conclude the proof of our Theorem 2. Efficient Personalized Federated Learning via Sparse Model-Adaptation E. Discussion on the Assumptions We note that the adopted theoretical assumptions are reasonable, fairly mild in FL scenarios and widely used by numerous related FL works. To recap, there are five Assumptions including assumption 1 (bounded data diversity), assumption 2 (bounded parameters of the global model and the gating layer, 3 (smoothness of gradients), 4 (bounded output of f) and 5 (bounded gradients variance). And there are three theoretical results: the motivated Proposition 1 is based on Assumption 1; the Theorem 1 (generation) is based on Assumption 2, the Theorem 2 (convergence) is based on Assumption 1 5. About the Assumption 1 (used in Proposition 1 and Theorem 2), it is necessary to a feasible and effective FL process, in which there is knowledge that can be shared and mutually beneficial between the FL participants. We would like to clarify that we adopt a weaker form than the similar assumption that is also widely used in related literature such as (Ma et al., 2022; Dinh et al., 2020b; Marfoq et al., 2021; Liang et al., 2020). For example, let s consider assumption 3 in (Dinh et al., 2020b): (Bounded diversity). The variance of local gradients to global gradient is bounded as 1 N PN i=1 || fi(w) θf(w)||2 σ2 f. . By contrast, our assumption is There exists a such that the variance between local and global gradients is bounded as || θf(θ i ) θf(θ g)||2 σ2 i , i C . Note that our assumption is actually a special case of their form on sparse model, with our f(θ i ) term corresponding to their fi(w) term, and our fi(w) term corresponding to their f(w) term. As for the assumption 2 (used in Theorem 1), it is dependent on the hypothesis space of the adopted global model and gating layer, which are controllable by users. For example, we can hard clip their model parameters, and introduce bounded activation functions such as sigmoid into the gating layer structure as the proposed method does. A more specific example of this is our implementation, where we use zero initialization, kaiming-uniform initialization and (0, 1)-uniform initialization for the parameters of the adopted global model and gating layer, which leads to all these init parameters are bounded by one of 0, 1, or constants calculated by the bounded fan in (usually determined by the data shape). Then during the FL courses, we clip the gradients at each backpropagation step (via torch.nn.utils.clip grad norm ()), resulting in the bounded parameters and thus the assumption holds. Besides, assumptions similar to our assumption 2 can be found in the related PFL works, such as p Fed HN (Shamsian et al., 2021), in which the authors assume that their introduced weights of the hyper-network and the embeddings are bounded and follow several Lipschitz conditions with constants different from ours. The assumptions 3 and 4 are easily satisfied for most discriminative models such as neural networks (Tan et al., 2021; Kairouz et al., 2021; Li et al., 2019). And the assumption 5 is necessary to a feasible and effective FL course, in which there is knowledge that can be shared and mutually beneficial between the FL participants (Ma et al., 2022; Dinh et al., 2020b; Liang et al., 2020; Marfoq et al., 2021). As for the strong convexity, it is only used only in Proposition 1, which shows the motivation of the proposed method, and the detailed upper bound of M i is not involved in our remaining theoretical analysis (Theorem 2 and Theorem 1). Although many SOTA FL and PFL works and PFL works with provable analysis are also based on the strong convexity of loss function, we leave it as future work to relax the convexity assumption. F. Space-Time Complexity We now examine the efficiency of p Fed Gate described in Algorithm 1 in terms of computation, communication, and storage costs. Table 5 summarizes the costs for p Fed Gate and competitors including Fed Avg and two SOTA personalized FL methods, where the training and inference costs on clients are considered for each training batch (or each sample when the batch size is 1), and the communication cost is considered for each FL round. We also mark some specific example values when taking the default hyper-parameters of different methods in Table 5. For comparison simplicity, here we assume that for the computation costs, the inference and back-propagation times are proportional to the model size. We hide both constants and polylogarithmic factors dependent on the specific model architectures in O( ), since we compare these methods with the same model architectures and the same trained data samples. As for communication and storage, O( ) hides constants and lower-order factors dependent on specific implementations such as platforms, storage, and communication protocols. Computation. Recall that our method is based on structured block sparsity (Sec.5.2.2), i.e., p Fed Gate will remove some blocks and generate a sparse sub-network. It actually reduces the computation cost in both inference and training even without sparse matrix computing libraries. The training and inference cost of p Fed Gate is O d(si+sϕi) for client i at each round. Communication. In the communication stage, we do not need to transmit the zero sub-blocks (a detailed example is in Sec.A) with very light additional meta-data for indexing. As a result, the p Fed Gate uploads parameters in O(qid), where qi = count( θi = 0)/d indicates the ratio of non-zero model updates. Specifically, qi can be regarded as the proba- Efficient Personalized Federated Learning via Sparse Model-Adaptation Table 5: The proposed p Fed Gate achieves better computation, communication and storage complexity over state-ofthe-art personalized FL methods. The underlined results indicate the examples when taking default hyper-parameters: the embedding and hyper-network size dv = 0.25|C|, dh = 100 for p Fed HN (Shamsian et al., 2021), the component model number k = 3 for Fed EM (Marfoq et al., 2021), the client sparsity si = 0.5, and relative sparsity of gating layer sϕi = d/2d XL = 0.05 for p Fed Gate. Besides, for p Fed Gate, we report the average non-zero parameter ratio qi = 0.67, which is dependent on datasets and local training steps. Here we report the average value of qi when we run FL experiments 3 times on FEMNIST and CIFAR10 with si = 0.5, local update step as 1 epoch, and batch size as 128. Fed Avg p Fed HN Fed EM p Fed Gate Computation Train (Client) O(2d) O(2d) O(2kd) O 2d(si + sϕi) O(6d) O(1.1d) Infer (Client) O(d) O(d) O(d) O d(si + sϕi) Train (Server) O(1) O (dv + dh)|Cs| O(1) O(1) O(6d) O (0.25|C|+100)|Cs| Communication Client O(d) O(d) O(kd) O(qid) O(3d) O(0.67d) Server O(d|Cs|) O(d|Cs|) O(kd|Cs|) O(d|Cs|) O(3d|Cs|) Client (Peak Memory) O(rd) O(r d) O(krd) O(d + r sϕid) Client (Disk) O(d) O(d) O(kd) O (1 + sϕi)d O(3d) O(1.05d) Server O(d) O(dvdh + dv|C|) O(kd) O(d) O(25|C|+0.25|C|2) O(3d) bility that a parameter is finally not masked in the local training process. We can consider qi = 1 Q|S| j=1(1 pj) , where pj indicates the probability that a parameter is not within the sub-blocks the gating layer predicts to mask when taking the j-th data batch as input, and |S| indicates the number of data batches trained in each local FL round. Note that p1 is equal to the sparsity factor si at client i, and pj becomes smaller as j increases, due to the fact that the local data batches usually share some similarities, and the newly predicted sub-blocks are more likely to be the same as the sub-blocks have been selected. Storage. (1) For the peak memory cost on clients, note that FL requires training on client-side, in addition to the model parameters, the memory cost in FL training process also contains the intermediate tensors in both the forward and the backward propagation phases, whose size depends on the model architecture, the shape of the inputs, and the specific implementation (we abstract this factor as r, r , r for Fed Avg, p Fed HN and p Fed Gate respectively in Table 5). Thus once the model-adaption of p Fed Gate is completed, the intermediate tensor involved in the forward and backward processes through the sparse model is correspondingly smaller based on block sparsity, allowing us to get a lower peak training memory than compared methods (empirical results are in Section 7.4). (2) For the disk storage, compared Fed Avg, the introduced additional storage overhead of p Fed Gate is very small as the gating layer structure has a simple and small structure, whose shape is about the multiplication of the dimension of the input feature and the number of the total sub-blocks. For example, the size of gating layer dϕi is only 6.3% of the original model for the CNN model used on CIFAR-10. We would like to point out that p Fed Gate actually has great potential for further storage savings for multi-task applications. Considering that if we use one dense original model but learn E distinct gating layers for different E downstream tasks on a client, the storage cost will be d + T dϕi, which has a much smaller model storage cost than saving multiple distinct local models. It is promising to extend the current batch-level adaptation to task-level adaptation to further reduce the storage cost, and we leave it as future work. Cross-methods Comparison. From Table 5 we can see that p Fed Gate achieves better efficiency than Fed Avg in training, inference, and uploading with negligible additional storage cost on clients (as marked in red). By contrast, to achieve good personalization, Fed EM introduces more than one global model and pays larger costs in training, communication, and storage (as marked in blue). And p Fed HN learns client-wise embedding in the server, which may lead to a single-point bottleneck for cross-device settings and potentially disclosing personal information. To further verify the analysis, we present empirical supports in Section 7.4. Efficient Personalized Federated Learning via Sparse Model-Adaptation G. More Related Work Comparison Here we present more details about the works related to our method. Qu Pe D (Ozkara et al., 2021) combines quantization and knowledge distillation for personalized federated learning. Fed Mask (Li et al., 2021a) freezes the local models during the whole FL process, learns distinct binary masks for the last several layers of the models, and only transmits and aggregates the learned masks. Different from the quantization-based methods, we adopt continuous gated weights on all sub-blocks of the local model and transmit the sparse model updates, enabling flexible model size reduction and high capability to capture the client relationships via the shared global model. Fed Rep (Collins et al., 2021) proposes to upload partial subparameters as shared representation and leave the others locally trained. Herero FL (Diao et al., 2021) selects different subsets of global model parameters for clients based on their computational capabilities. Different from these model-decoupling-based works that use the same parameter subset for those clients having the same computation capability, our method generates different sub-models from the whole global model with a larger learnable parameter space to handle heterogeneous data distribution. More importantly, our method differs from these works by adaptively generating the sparse model weights at a fine-grained batch level, which achieves a good estimation for the conditional probability of heterogeneous local data and high accuracy as shown in our experiments. Fed PNAS (Minh & Carl, 2021) leverages neural architecture search (NAS) to search suitable sub-networks for clients, and Fed MN (Tianchun et al., 2022) proposes to use routinghypernetwork to select personalized sub-blocks. For the Fed PNAS, we differ from it in the target and studied problem. Fed PNAS mainly focuses on improving the personalization performance with sufficient hardware resources. For example, they adopt 5 clients in their experiments that usually correspond to the cross-silo case. While p Fed Gate focuses on the cross-device setting, where the clients resources are usually very limited. Compared with Fed MN, the performance of p Fed Gate is theoretically guaranteed for both generalization and convergence. While Fed MN doesn t provide any theoretical analyses for generalization or convergence. Besides, the proposed p Fed Gate method adopts a much larger hypotheses space than Fed MN. In Fed MN, each modular is either connected or dropped. While p Fed Gate doesn t only decide the connections of the network (connected or dropped), but also scales the weights of the remained blocks simultaneously to improve the personalized performance. There are FL works considered sparsity, Fed DST (Bibikar et al., 2022) and Zero FL (Qiu et al., 2022). However, we are different from them in several aspects: First, from the view of the studied problem, both of these two works focus on non-personalized FL that finds a single global model to perform well on all clients, while we consider the personalized case to learn client-wise models. Second, both of them don t provide any theoretical analyses for generalization or convergence, while p Fed Gate performs well with theoretical guarantees in terms of both generalization and convergence. Third, from the view of methodology, they are based on element-wise unstructured sparsity while ours is based on block-wise structured sparsity, which is easier to gain efficiency benefits in extended implementations. Finally, from the experimental results, we can see that Fed DST works in sparsity with 0.5 or 0.8 (Figure 3 in their paper) while ours works in even s=0.1. Another method Zero FL reported the results of sparsity 0.1 on CIFAR10 and gained 0.42% accuracy improvement compared to vanilla Fed Avg (their Table 1, NIID column), while we gained 3.02% improvement over Fed Avg with sparsity 0.1 (our Table 1). We also compared meta-learning-based PFL methods including Per-Fed Avg (Fallah et al., 2020) and p Fed Me (Dinh et al., 2020b). Per-Fed Avg leverages model agnostic metalearning to enable fast local personalized training. However, it requires computationally expensive second-order gradients, which challenges real-world applications, especially with limited system resources. p Fed Me improves Per-Fed Avg with faster convergence and smaller computation complexity. We empirically show that our methods can achieve better performance than theirs in Table 6. Besides, knowledge distillation-based methods can also deal with hardware heterogeneity by allowing different local models to use different architectures. However, many existing KD-based FL/PFL works rely on some common reference object shared among clients, to enable knowledge alignment between the teacher and student models. For example, works (Li & Wang, 2019; Zhu et al., 2021) introduce additional public data for all clients and perform FL with the help of the model logits on the public data. It s worth noticing that such additional common references may not always be available in some scenarios. Further, we conduct empirical comparisons to a known KD-based FL method, Fed MD (Li & Wang, 2019). We list the results on FEMNIST and CIFAR-10 in Table 6, where Fed MD, Far-Trans indicates a far-transfer case by adopting the CIFAR-10 as the public dataset for FEMNIST experiment, and adopting the FEMNIST as the public dataset for CIFAR-10 experiment. The Fed MD, Near-Trans indicates a near-transfer case by adopting the EMNIST as the public dataset for FEMNIST experiment, and adopting the CIFAR-100 as the public dataset for CIFAR-10 experiment. We find that p Fed Gate still shows superiority in terms of the average/bottom accuracy, and robustness when we vary the sparsity for p Fed Gate and vary the public dataset for Fed MD. Efficient Personalized Federated Learning via Sparse Model-Adaptation Table 6: Performance comparison to some baselines in FEMNIST and CIFAR-10 datasets. FEMNIST CIFAR-10 Acc Acc Fed Avg 76.51 60.82 68.33 60.22 Per-Fed Avg 77.18 61.42 70.35 60.96 p Fed Me 75.29 57.63 70.29 62.13 Fed MD, Far-Trans 61.25 51.37 55.28 52.04 Fed MD, Near-Trans 75.87 60.29 67.52 59.43 Fed EM 80.12 64.81 72.43 62.88 p Fed Gate, s=1 87.32 77.14 75.18 66.67 p Fed Gate, s=0.5 86.31 75.68 74.07 64.21 p Fed Gate, s=0.3 86.76 76.47 73.65 64.39 H. Implementation Details Datasets. We conduct experiments on several widely used FL datasets including EMNIST (Cohen et al., 2017) and FEMNIST (Caldas et al., 2018) for 62-class handwritten character recognition, and CIFAR10/CIFAR100 (Krizhevsky, 2009) for 10-class/100-class image classification. We follow the heterogeneous partition manners used in (Marfoq et al., 2021; Caldas et al., 2018; Diao et al., 2021; Fallah et al., 2020): FEMNIST was partitioned by writers and we generate the other three FL datasets using Dirichlet allocation of parameter α=0.4. We adopt 10% sub-samples of EMNIST (81,425 samples) and 15% subsamples of FEMNIST (98,671 samples), and allocate the sub-datasets to 100 and 539 clients for EMNIST and FEMNIST respectively. For CIFAR10/CIFAR100 (60,000 samples), we allocate them to 100/50 clients. All datasets are randomly split into train/valid/test sets with a ratio 6:2:2. Baselines and models. We consider the following competitive baselines including SOTA personalized and efficient FL methods in this work: Fed Avg (Mc Mahan et al., 2017): the classical FL method that simply aggregates model updates in a weighted averaging manner; Fed Avg-FT: a basic personalized baseline that finetunes the global model with local data before model local evaluation; Local: a naive personalized method that trains models on local datasets without FL communication; p Fed Me (Dinh et al., 2020b) decouples the personalized model and global model with Moreau envelops based regularization; LG-Fed Avg (Liang et al., 2020) achieves personalization with improved communication efficiency; Ditto (Li et al., 2021b) is a SOTA PFL method that introduces a local personalized model for each client, which is trained with model parameter regularization according to the global model; Fed EM (Marfoq et al., 2021) deals with data heterogeneity via the mixture of multiple global models. We use 3 models according to the authors default choice; Fed Rep (Marfoq et al., 2021) proposes to improve the FL performance and communication efficiency via sharing partial model parameters among FL participants. Only the low layers of models for feature extraction are uploaded to the server and aggregated, and the remaining head parameters are locally trained for personalization. Following the authors sharing manner, we adopt the last classification layers as local personalized model parameters; Fed Mask (Li et al., 2021a) learns distinct binary masks for the last several layers of the local models, and aggregates the masks with an intersection operation. Following the mask layer choice manner similar to the authors, in our experiments, the one-shot mask is applied in the last layer for the adopted 2-layer CNN and the last 2 layers for the adopted 3-layer CNN respectively; Hetero FL (Diao et al., 2021) uses heterogeneous model architectures for different clients to achieve personalization and improvements in both computational and communication efficiency. Recall that we can vary each client s sparse ratio s to reflect the clients resource heterogeneity, since the affordable sparse ratio s can fairly reflect the client s computation and communication capabilities. Clients with very limited computation and communication resources have to set s close to 0; On the contrary, clients with sufficient resources can set s close to 1. Here we use the full model parameters for 10% clients and 50% parameters for the other 90% clients following the authors computation complexity setting, which leads to an average sparsity s = 0.55. To align with previous works, we use a 2-layer CNN (Reddi et al., 2021; Marfoq et al., 2021) for EMNIST/FEMNIST, and two Le Net-based CNNs (Dinh et al., 2020b; Liang et al., 2020; Shamsian et al., 2021) with different capabilities for CIFAR10 and CIFAR100. Specifically, the model used for EMNIST/FEMNIST two convolutional layers with 5 5 kernels, max pooling, and two dense layers with a total of 2,171,786 parameters. The models used for CIFAR10 and CIFAR100 have a Le Net-based structure with an additional linear classification layer, and a total of 256,830 and 3,537,444 parameters respectively. Efficient Personalized Federated Learning via Sparse Model-Adaptation Table 7: The adopted learning rates for p Fed Gate in all datasets. EMNIST FEMNIST CIFAR10 CIFAR100 ηg η ηg η ηg η ηg η s=1 0.1 0.1 0.1 0.1 0.03 0.05 0.1 1.5 s=0.5 0.1 0.1 0.1 0.05 0.03 1.5 0.03 1.5 s=0.3 0.03 0.5 0.3 0.05 0.03 1.5 0.05 0.5 Platform and Hyper-parameters. We implement all models with Py Torch, and run experiments on Tesla V100 and NVIDIA Ge Force GTX 1080 Ti GPUs. For fair comparisons, we adopt the same communication rounds for all methods and search for the optimal configuration of hyper-parameters using the validation sets. We run each experiment 3 times with the optimal configurations and different random seeds, and report the average results. For each method on each dataset, we use the SGD optimizer and grid search the learning rate ηg from [0.005, 0.01, 0.03, 0.05, 0.1, 0.3, 0.5], set the communication round T = 400, the batch size as 128 and the local update step as 1 epoch. For p Fed Gate, the learning rate of gating layer η is searched from [0.01, 0.05, 0.1, 0.3, 0.5, 1, 1.5], and we set the block size splitting factor B = 5 for all evaluated models. For Fed Avg-FT, we use the local fine-tuning step as 1 epoch. For p Fed Me, we search its penalization parameter µ from [0.0001, 0.001, 0.01, 0.1, 1, 10]. For Ditto, we search its regularization factor from [0.05, 0.1, 0.5, 0.8]. For Fed Rep, we search its learning rate for the personalized model from [0.05, 0.005, 0.5, 0.01, 0.1]. For Fed EM, we set its number of global models as 3. We summarize the adopted learning rates in Table 7. I. Additional Experiments I.1. Novel Clients Generalization Table 8: Averaged accuracy when testing on novel clients that have not participated in the training stage of FL. The g Acc indicates the results of novel clients, and indicates the generalization gap Acc g Acc. FEMNIST CIFAR10 Acc g Acc Acc g Acc Fed Avg 76.39 74.93 1.46 67.78 66.15 1.63 Fed EM 79.21 78.26 0.95 71.92 70.75 1.17 ours, s=1 87.79 87.18 0.61 73.97 73.18 0.79 s=0.5 86.64 86.19 0.45 74.12 73.19 0.93 s=0.3 85.69 85.12 0.57 73.64 72.78 0.86 To evaluate the generalization of p Fed Gate for the novel clients that haven t participated in the previous FL training stage, we conduct experiments with similar simulation set- Table 9: Convergence FL round to achieve 90% highest accuracy. EMNIST FEMNIST CIFAR10 CIFAR100 Fed Avg 72.3 195.3 253.1 226.9 Fed EM 77.4 212.8 265.7 245.1 ours, s=1 54.6 152.9 190.9 183.8 s=0.5 57.9 174.1 219.3 196.5 s=0.3 75.2 214.4 271.8 260.7 tings to Yuan et al. (2022); Chen et al. (2022), where the novel clients only fine-tune the global model on their local dataset. Specifically, we randomly select 20% clients as novel clients, and evaluate the selected and remaining clients separately. The results are summarized in Table 8, where Acc and g Acc indicate the results of FL-participated clients and novel clients respectively, and = Acc g Acc indicates the accuracy generalization gap. The results show that the proposed method p Fed Gate can generalize to novel clients well and gain smaller accuracy gaps than compared methods. For the novel clients, instead of re-training the overparameterized global model on (possibly small amounts of) local data, p Fed Gate only needs to train the personalized lightweight gating layers with a small number of parameters, and thus achieves an efficient and effective local adaptation. I.2. Convergence Study We demonstrate the convergence curves of Fed Avg, Hetero FL with average sparsity rate as 0.55, the strongest baseline, Fed EM, and the proposed p Fed Gate with different sparsity rates on EMNIST dataset in Figure 6. We can see that the proposed method achieves comparable convergence speeds with the baselines, which provides empirical evidence for the Theorem 2. In addition, to gain further insights into the proposed method, we plot the histogram of averaged gating weights of all clients on the CIFAR10 dataset in Figure 7. We can see that the sparse pattern becomes stable as we train the gating layer with more local datasets, showing that the proposed method can achieve effective optimization and converge to certain global points under distinct personalized local models. Besides, we summarize the convergence FL round that each method achieves the 90% of its highest average accuracy in Table 9, where the values are averaged over three experiments with different random seeds. We observe that p Fed Gate gains comparable convergence speed with Fed Avg and Fed EM, providing empirical evidence for Theorem 2. Besides, with the results and discussion of the single-FL-round and single-client efficiency given in 7.4, we can see that p Fed Gate is indeed able to achieve significant accuracy improvements while at the same time improving system efficiency. Efficient Personalized Federated Learning via Sparse Model-Adaptation Figure 6: The convergence curves of Fed Avg, Herero FL, p Fed Gate and the strongest baseline, Fed EM on EMNIST. The p Fed Gate achieves comparable or even better (s=1) convergence speeds than baselines. Table 10: Effect of s when comparing the performance of global and local (personalized) models on EMNIST dataset. Local Model Global Model Local-Global Gap Acc Acc s=1 87.11 81.43 79.47 71.28 7.64 10.15 s=0.5 87.28 81.15 76.46 66.98 10.82 14.17 s=0.3 87.09 82.52 77.14 66.29 9.95 16.23 s=0.1 83.58 75.14 76.45 66.98 7.13 8.16 I.3. More Studies for the Effect of s Global v.s. Local Models. In Section 7.6, we vary the sparsity factor s from 1 to 0.1 to examine the accuracy of p Fed Gate. Here we further compare the accuracy of both the local (personalized) models and the global (shared) model. We firstly directly examined the accuracy of the global model for the same number of FL rounds when the local model achieved the best performance, and find that the accuracy of the global model can be very bad as in the proposed p Fed Gate method, the global model does not receive a direct correction signal for the ground truth (but rather via a sparse adaptation). We thus train the global model with one more local epoch, then aggregate the trained one and evaluate it. The results on EMNIST dataset are listed in Table 10. Interestingly, we found that p Fed Gate s local model consistently achieved better accuracy than the global model, the accuracy advantage is more significant on bottom accuracy than on average accuracy. Furthermore, as the sparsity decreases, the advantage increases (from s =1 to 0.3) and then decreases (s =0.3 to 0.1). This suggests that introducing moderate sparsity helps generalization, but at very high sparsity our method is at risk of degradation and the degradation outweighs the benefits of increasing sparsity, requiring further enhancement with small. A discussion of the relevant theoretical intuition is given in Theorem 1. Communication Frequency and Data Distribution. In addition, to investigate the effect of s w.r.t. FL characteristics, we (1) change the local update steps (the smaller the number of steps, the more frequently the Federation updates) and (2) change the Dirichlet factor to simulate different Non-IID degrees (the smaller the α, the higher the heterogeneity of the data). The results on EMNIST with different local update steps and results on CIFAR-10 with different αs are listed in Table 11 and Table 12 respectively. Generally speaking, we can find that p Fed Gate gains comparable performance when the local update step increases from 1 to 4, but a significant drop when the step becomes 8. Besides, as the Non-IID degree increases, the accuracy of p Fed Gate mostly decreases especially when α=0.01. These observations suggest that there is still room to improve the robustness of p Fed Gate, particularly in terms of how to effectively aggregate the local models for gaining high accuracy when the local and global models can easily differ greatly. For example, in the future, we could consider combining asynchronous federal learning (Nguyen et al., 2022) and more advanced aggregation mechanisms in the model parameter space (Ainsworth et al., 2022) to allow for larger variations of the participants models. I.4. Model Compression Manner Study We gain sparse local models in the proposed framework via learning personalized gating layers and predicting sparse gated weights for model sub-blocks. As introduced in Section 5.2, we propose a simple and general split manner that chunks the model parameters into B equal-sized sub-blocks, where the split factor B provides flexible splitting granularity. Here we study the sparsification technique proposed in our method by varying the split factor B and comparing it with a SOTA PFL method Fed Mask (Li et al., 2021a), that adopts binary masks to compress the parameters of models last layers. The results are shown in Table 13, where we examine the FL accuracy by setting B = [2, 5, 10] and s = [0.3, 0.5]. From Table 13 we have the following observations: Our method achieves better performance than Fed Mask in various sparsity settings. Besides, when increasing the sparsity (s = 0.5 to s = 0.3), our method with B = 5 still achieves good or even better performance (the accuracy change is +0.248 on average) while Fed Mask gains large performance drops (the accuracy change is -1.543 on average). These results verify the effectiveness of the proposed adaptive sparse weights prediction again. Note that Fed Mask freezes the local models during the whole FL process and only learns the binary masks for the last several layers of the models. Instead, we train both the local models and the personalized gating layers, which generate continuous gated weights to mask some sub-blocks and scale the weights of other Efficient Personalized Federated Learning via Sparse Model-Adaptation Figure 7: The convergence curves of p Fed Gate in terms of the histogram of averaged gating weights over all clients on the CIFAR10 dataset. sub-blocks, leading to high capacities for the learnable personalized models. With a smaller sparsity hyper-parameter value, the size of the model s hypothesis space searched by Fed Mask (depicted by only the masks for the last several layers) is much smaller than the one of p Fed Gate (depicted by both the whole model and the gating weights for all the sub-blocks). As for the impact of B, on the one hand, we can see that too small B = 2 gains relatively worse results than B = 5 as there may be insufficient sub-blocks to model the clients diversity. On the other hand, the larger B = 10 achieves similar results to B = 5 while larger performance drop when increasing sparsity. We empirically found that B = 5 has good robustness and adopt it as the default value in our experiments. Efficient Personalized Federated Learning via Sparse Model-Adaptation Table 11: Effect of s when changing the FL communication frequency on FEMNIST dataset. The smaller the number of steps, the more frequently the Federation updates. Step=1 Step=2 Step=4 Step=8 Acc Acc p Fed Gate, s=1 87.11 81.43 86.87 81.21 87.01 81.07 84.28 79.53 s=0.5 87.28 81.15 87.25 80.99 87.09 80.65 84.62 78.45 s=0.3 87.09 82.52 87.48 82.63 86.39 82.08 85.31 80.86 s=0.1 83.58 75.14 83.27 74.81 82.99 74.58 80.75 73.05 Table 12: Effect of s when changing the Dirichelet factor α of Non-IID CIFAR-10 dataset. The smaller the α is, the higher the heterogeneity of the data. α=0.4 α=0.2 α=0.1 α=0.01 Acc Acc p Fed Gate, s=1 75.18 66.67 74.29 65.71 73.55 65.23 72.63 64.64 s=0.5 74.07 64.21 73.05 63.44 72.55 63.03 71.76 62.02 s=0.3 73.65 64.39 72.58 64.05 71.95 62.83 71.58 62.68 s=0.1 70.26 60.45 69.74 59.56 68.96 59.38 67.45 58.04 Table 13: Averaged accuracy comparison for PFL methods with and without model compression. The numbers in the parentheses indicate the accuracy difference between sparsity s = 0.5 and s = 0.3 for Fed Mask and p Fed Gate. FEMNIST CIFAR10 Acc Fed Avg 76.51 60.82 68.33 60.22 Fed EM 80.12 64.81 72.43 62.88 Fed Mask s=0.5 78.69 62.18 70.54 61.37 s=0.3 77.41 ( 1.28) 60.69 ( 1.49) 68.46 ( 2.08) 59.05 ( 1.32) B=5 s=0.5 86.31 75.68 74.07 64.21 s=0.3 86.75 (+0.44) 76.47 (+0.79) 73.65 ( 0.42) 64.39 (+0.18) B=2 s=0.5 81.10 71.28 70.81 61.03 s=0.3 80.03 ( 1.07) 70.72 ( 0.56) 70.65 ( 0.16) 60.66 ( 0.37) B=10 s=0.5 85.38 74.91 72.94 63.59 s=0.3 84.67 ( 0.71) 75.10 (+0.19) 72.48 ( 0.46) 63.58 ( 0.11)