# fedbat_communicationefficient_federated_learning_via_learnable_binarization__a81e598c.pdf Fed BAT: Communication-Efficient Federated Learning via Learnable Binarization Shiwei Li 1 * Wenchao Xu 2 Haozhao Wang 1 Xing Tang 3 Yining Qi 1 Shijie Xu 3 Weihong Luo 3 Yuhua Li 1 Xiuqiang He 3 Ruixuan Li 1 Federated learning is a promising distributed machine learning paradigm that can effectively exploit large-scale data without exposing users privacy. However, it may incur significant communication overhead, thereby potentially impairing the training efficiency. To address this challenge, numerous studies suggest binarizing the model updates. Nonetheless, traditional methods usually binarize model updates in a post-training manner, resulting in significant approximation errors and consequent degradation in model accuracy. To this end, we propose Federated Binarization Aware Training (Fed BAT), a novel framework that directly learns binary model updates during the local training process, thus inherently reducing the approximation errors. Fed BAT incorporates an innovative binarization operator, along with meticulously designed derivatives to facilitate efficient learning. In addition, we establish theoretical guarantees regarding the convergence of Fed BAT. Extensive experiments are conducted on four popular datasets. The results show that Fed BAT significantly accelerates the convergence and exceeds the accuracy of baselines by up to 9%, even surpassing that of Fed Avg in some cases. 1. Introduction Federated learning (FL) (Mc Mahan et al., 2017) stands out as a promising distributed machine learning paradigm designed to safeguard data privacy. It enables distributed *This work was done when Shiwei Li worked as an intern at Fi T, Tencent. 1Huazhong University of Science and Technology, Wuhan, China 2The Hong Kong Polytechnic University, Hong Kong, China 3Fi T, Tencent, Shenzhen, China. Correspondence to: Haozhao Wang , Xing Tang , Ruixuan Li . Proceedings of the 41 st International Conference on Machine Learning, Vienna, Austria. PMLR 235, 2024. Copyright 2024 by the author(s). clients to collaboratively train a global model without sharing their local data, thus can protect user data privacy. In the classic FL system, a central server initially broadcasts the global model to several clients, who then train it using their respective datasets. After local training, the clients send their model parameters or model updates back to the server, who aggregates them to create a new global model for the next round of training. This process is repeated for several rounds until the global model converges. Despite FL s success in preserving local data privacy, the iterative transmission of model parameters introduces considerable communication overhead, adversely affecting training efficiency. Specifically, the communication occurs in two stages: the uplink, where clients send model updates to the server, and the downlink, where the server broadcasts the global model to clients. As suggested by H onig et al. (2022), the uplink typically imposes a tighter bottleneck than the downlink, particularly due to the global mobile upload bandwidth being less than one fourth of the download bandwidth. Therefore, in this paper, we seek to compress the uplink communication, as much research (Reisizadeh et al., 2020; Isik et al., 2023; Tang et al., 2023) has done. Sign SGD (Bernstein et al., 2018) is an effective binarization technique for reducing communication volume. It was originally proposed to communicate only the signs of gradients in distributed training systems. Recently, it has also been naturally applied to binarize model updates in FL (Ferreira et al., 2021). Specifically, it binarizes model updates m Rd as ˆ m = α m = α sign(m), where α denotes the magnitude of binary values, termed the step size and typically tuned as a hyperparameter. Sign SGD can efficiently compress the uplink communication by a factor of 32. However, the binarized model updates inevitably contains approximation errors in comparison to the original updates, which affects both the convergence speed and the model accuracy. To improve the performance of Sign SGD, subsequent research has mainly explored from two aspects, including error feedback methods (Karimireddy et al., 2019) and stochastic sign methods (Chen et al., 2020; Safaryan & Richt arik, 2021). EF-Sign SGD (Karimireddy et al., 2019) aims to compensate for the errors introduced by binariza- Fed BAT: Communication-Efficient Federated Learning via Learnable Binarization Figure 1. An illustration of the t-th round within the Fed BAT framework. ①downlink: the server sends model parameters wt to clients; ②local training: clients train the model updates (mt+1 and αt+1) via learnable binarization; ③uplink: clients upload their binary model updates ( mt+1 and αt+1) to the server. ④model aggregation: the server aggregates binary model updates to generate wt+1. tion. Notably, EF-Sign SGD adds the binarized errors from the previous round onto the current model updates before conducting another binarization. On the other hand, stochastic sign methods aim to alleviate the bias estimation of Sign SGD, namely E[α sign(m)] = m. The key idea here is to perturb local model updates with random noise. Noisy-Sign SGD (Chen et al., 2020) adds Gaussian noise ζ N(0, σ2) to the model update and then performs binarization, where σ is an adjustable standard deviation. Similarly, for a model update m, Stoc-Sign SGD (Safaryan & Richt arik, 2021) adds uniform noise ζ U( m , m ) to m before binarization. In this way, each element mi will be binarized into +1 with a probability of (1/2 + mi/2 m ) and into -1 with a probability of (1/2 mi/2 m ). The above research may differ in their motivations and specific solutions, nevertheless, they share a fundamental similarity. Before binarization, a compensation term or disturbance term will be superimposed on the model updates to be binarized, which is the binarized errors from the previous round in EF-Sign SGD, Gaussian noise in Noisy-Sign SGD, and uniform noise in Stoc-Sign SGD. Although applying these methods in FL can indeed enhance the performance of Sign SGD (see Section 5.2), it is crucial to acknowledge that they still suffer from two main drawbacks. On one hand, the above methods still binarize model updates in a post-training manner as Sign SGD does. Binarized errors are introduced solely after the training phase, depriving them of the opportunity to be optimized during local training. On the other hand, the step size is usually kept as a hyperparameter, necessitating significant human effort to tune it for various applications. However, even with careful tuning, the resulting performance may still fall short of the optimum. In this paper, we aim to solve these two issues by exploiting the local training process of FL. Specifically, our primary goal is to equip clients with the capability of directly learning binary model updates and the corresponding step size. To this end, we propose a novel training paradigm, termed Federated Binarization-Aware Training (Fed BAT). The illustration of each round within Fed BAT is presented in Figure 1. The key idea of Fed BAT is to binarize the model update with the step size during the forward propagation. This entails calculating the output loss based on the binarized model updates. It provides an opportunity to compute gradients for both the binarized model updates and the step size, facilitating their subsequent optimization. However, the derivative of the vanilla binarization operator is always zero, making the gradient descent algorithm infeasible. To solve this issue, we further introduce a learnable binarization operator with well-designed derivatives. The main contributions can be summarized as follows: We analyze that the drawbacks of existing federated binarization methods lie in their post-training manner. This opens new avenues for binarization in FL. Based on our analysis, we propose Fed BAT to learn binary model updates during the local training process of FL. For this, a novel binarization operator is employed with well-designed derivatives. Theoretically, we establish convergence guarantees for Fed BAT, demonstrating a comparable convergence rate to its uncompressed counterpart, the Fed Avg. Experimentally, we validate Fed BAT on four popular datasets. The experimental results show that Fed BAT can significantly improve the convergence speed and test accuracy of existing binarization methods, even surpassing the accuracy of Fed Avg in some cases. Fed BAT: Communication-Efficient Federated Learning via Learnable Binarization 2. Preliminaries FL involves N clients connecting to a server. The general goal of FL is to train a global model by multiple rounds of local training on each client s local dataset. Denoting the objective function of the k-th client as Fk, FL can be formulated as min w F(w) = k=1 pk Fk(w), (1) where pk is the proportion of the k-th client s data to all the data of the N clients. Fed Avg (Mc Mahan et al., 2017) is a widely used FL algorithm. In the t-th round, the server sends the model parameters wt to several randomly selected K clients, denoted as St. Each selected client performs certain steps of local training and sends the model update wk t+1 wt back to the server. The server aggregates these updates to generate a new global model as follows: wt+1 = wt + X k St p k(wk t+1 wt), (2) where p k = pk/P St pk denotes the proportion of the k-th client s data to all the data used in the t-th round. In this paper, we employ Sign SGD and its aforementioned variants to compress the uplink communication in FL. Each client only needs to send the signs of its model updates to the server. Note that the signs undergo a encoding { 1, 1} {0, 1} before transmission, and are decoded back after transmission. For simplicity, we omit this process in the following pages. Therefore, Eq.2 can be rewritten as wt+1 = wt + X k St p kα sign(wk t+1 wt). (3) 3. Methodology 3.1. Motivations and Objectives In the t-th round, the objective of local training for the k-th client can be formulated as min mk t+1 Fk(wt + mk t+1), (4) where Fk is the loss function of the k-th client. wt is the global model parameter in the t-th round and mk t+1 represents the model updates to be learned. Binarization is not considered as a constraint during local training but is only performed after obtaining the final updates mk t+1. The binarized updates ˆmk t+1 inevitably contain errors compared to the original updates mk t+1, thus reducing the model accuracy and slowing down the convergence speed. A natural motivation is to correct or compensate for the binarized errors during the local training process. In light of this, we try to introduce binarization of model updates into local training, thereby directly learning binarized model updates and their corresponding step size. Specifically, the objective of local training can be formulated as min mk t+1,αk t+1 Fk(wt + S(mk t+1, αk t+1)), (5) where S(mk t+1, αk t+1) are the binary model updates to be learned. S is a binarization operator. The model updates mk t+1 and the step size αk t+1 are learnable parameters. Next, we will introduce the learnable binarization operator S in Section 3.2, and then present the detailed pipeline of Fed BAT in Section 3.3. Theoretical guarantees on the convergence of Fed BAT will be provided in the next section. 3.2. Learnable Binarization We first define a binarization operator as follows: α x > α, S (x, α) α x α, α x < α, (6) where S (x, α) is a stochastic binarization with uniform noise ζ U(0, 1) added as follows: S (x, α) = α(2 α+x/2α + ζ 1) = α w.p. α+x/2α, α w.p. α x/2α. (7) However, the floor function x returns the largest integer not exceeding x, and its derivative is always zero, which makes gradient descent infeasible. To solve this issue, we adopt Straight-through Estimator (STE) (Hinton, 2012) to treat the floor function as an identity map during backpropagation, that is to say its derivative is equal to 1. Recent work has demonstrated that STE works as a first-order approximation of the gradient and affirmed its efficacy (Liu et al., 2023). Therefore, the gradient of x can be estimated by: 0 x > α, 1 α x α, 0 x < α, (8) and the gradient of α can be estimated by: 1 x > α, 2 α+x/2α + ζ x+α/α α x α, 1 x < α, (9) It is worth noting that α represents the step size and is supposed to be a positive number, otherwise the meaning of Eq.6 will be wrong. To restrict the value of α to be positive, we calculate α as follows: α = α eραe, (10) Fed BAT: Communication-Efficient Federated Learning via Learnable Binarization where αe becomes the learnable step size, initialized to zero. α is the initial value of α and ρ is a hyperparameter that regulates the pace of the optimization process for αe. So far, we have developed a derivable binarization operator, which can be seamlessly integrated into a neural network. Let x denotes a layer of parameters within the network, and α denotes the corresponding step size calculated by Eq.10. During forward propagation, x will undergo element-wise binarization using Eq.6, where α is shared by all elements in x. During backpropagation, x and α will be optimized by the gradients calculated from Eq.8 and Eq.9. 3.3. Federated Binarization-Aware Training Here, we integrate the learnable binarization operator into local training of FL. In contrast to Sign SGD, this allows clients to consider binarized errors and make corrections when optimizing locally. We refer to the proposed training framework as Federated Binarization-Aware Training (Fed BAT), and the pipeline is shown in Algorithm 1. In Fed BAT, the operating procedures of the server is the same as that in Fed Avg. As described in lines 4-8, at the beginning of each round, the server sends global model parameters to several randomly selected clients. It subsequently collects the binarized model updates returned by the clients to calculate new global model parameters as follows: wt+1 = wt + X k St p k ˆmk t+1 = wt + X k St p kαk t+1 mk t+1, (11) where ˆmk t+1 and αk t+1 are the binarized model updates and the step size of the k-th client. Note that the model update of each layer within the model has a unique step size, thus, Eq.11 is actually performed layer-wise. For simplicity, we omit the representations of the different layers. Before introducing the local training procedures in Fed BAT, we highlight the variances in the local model architectures. We reiterate that Fed BAT is achieved by performing the binarization operator defined in Eq.6 on the model updates during local training. However, there is no explicit representation of model updates within the vanilla model. Therefore, we allow local models to maintain an extra copy of model parameters to represent model updates, denoted as m. In addition, the step size α of the update within each layer is also kept as a learnable parameter. The local training process comprises two distinct stages: full-precision training and binarization-aware training. In the t-th round, as depicted in lines 11-12, each client loads the global model parameters wt and initialize its model updates mt with zeros. Given that the model updates commence as zeros, learning to binarize them becomes more challenging. Therefore, to secure more precise initialization values for the model updates, we initially conduct training Algorithm 1 Federated Binarization-Aware Training 1: Input: the iteration rounds R; the local steps τ; the local warm-up ratio ϕ, the coefficient of the step size ρ. 2: function server(R) 3: Initialize global model parameters with w0. 4: for t = 0 to R 1 do 5: Send global model parameters wt to clients. 6: Get model updates ( mt+1, αt+1) from clients. 7: Aggregate binary model updates by Eq.11. 8: end for 9: end function 10: function client(wt, τ, ϕ, ρ) 11: Initialize local model parameters with wt. 12: Initialize local model updates mt with zeros. 13: for s = 0 to τ 1 do 14: if s < ϕτ then 15: Run full-precision training by Eq.12. 16: else if s = ϕτ then 17: Initialize the step size layer-wise by Eq.13. 18: Run binarization-aware training by Eq.14. 19: else 20: Run binarization-aware training by Eq.14. 21: end if 22: end for 23: mk t+1, αk t+1 = mk t,τ 1, αk t,τ 1. 24: Send binary model updates ( mk t+1, αk t+1) to server. 25: end function without binarization, that is the full-precision training where model updates will be optimized as follows: mk t,s+1 = mk t,s ηt Fk(wk t +mk t,s)/ mk t,s. (12) A warm-up ratio ϕ is defined to denote the proportion of full-precision training to the entire local training. After the full-precision training, α and αe defined in Eq.10 will be initialized for each layer of the k-th client as follows: (α )k l = (m)k l 1/dk l , (αe)k l = 0, (13) where (m)k l Rdk l represents the model update of the l-th layer in the k-th client. Next, Fed BAT performs binarizationaware training, optimizing the step size and the binarized model update using Eq.8 and Eq.9 as follows: mk t,s+1 = mk t,s ηt Fk(wk t + ˆ mk t,s)/ mk t,s, αk t,s+1 = αk t,s ηt Fk(wk t + ˆ mk t,s)/ αk t,s, ˆmk t,s+1 = S(mk t,s+1, αk t,s+1). After local training, the k-th client will send its binarized model update mk t+1 and the step size αk t+1 to the server. Fed BAT: Communication-Efficient Federated Learning via Learnable Binarization 4. Convergence Analysis In this section, we provide theoretical guarantees for the convergence of Fed BAT, while considering the data heterogeneity within FL. For simplicity, we focus on the case where the warm-up ratio ϕ is zero, meaning that the warm-up training, as defined by Eq.12, is not performed. Furthermore, to ensure the unbiased property of the binarization operator S in Eq.6, we set the step size as αk t,s = mk t,s in each step instead of optimizing it during local training. Then we give the following notations and assumptions. Notations. Let F and F k be the minimum values of F and Fk, respectively, then Γ = F PN k=1 pk F k can be used to quantify the degree of data heterogeneity. τ is the number of local steps. In Section 3.3, the subscripts t [R] and s [τ] are used to represent the serial number of global rounds and local iterations, respectively. In the following analysis, we will only use the subscript t to represent the cumulative number of iteration steps in the sense that t [T], T = Rτ. Assumption 1. F1, ..., FN are all L-smooth: for w and v, Fk(v) Fk(w) + (v w)T Fk(w) + L Assumption 2. F1, ..., FN are u-strongly convex: for all w and v, Fk(v) Fk(w)+(v w)T Fk(w)+ µ Assumption 3. Let ξk t be sampled from the k-th client s local data uniformly at random. The variance of stochastic gradients in each device is bounded: E Fk(wk t , ξk t ) Fk(wk t ) 2 σ2 for all k = 1, ..., N. Assumption 4. The expected squared norm of stochastic gradients is uniformly bounded, i.e., E Fk(wk t , ξk t ) 2 G2 for all k = 1, ..., N and t = 1, ..., T. Assumption 5. The variance of binarization S grows with the l2-norm of its argument, i.e., E S(x) x q x . Assumptions 1-4 are commonplace in standard optimization analyses (Stich et al., 2018; Yu et al., 2019; Li et al., 2020). The condition in Assumption 5 is satisfied with many compression schemes including the binarization operator S as defined in Eq.6. Assumption 5 is also used in (Karimireddy et al., 2019; Reisizadeh et al., 2020) to analyze the convergence of federated algorithms. Theorems 1 and 2 show the convergence of Fed BAT under the strongly convex assumption with full and partial device participation, respectively. The convergence of Fed BAT under the non-convex assumption with partial device participation is shown in Theorems 3. All proofs are provided in Appendix. Theorem 1. Let Assumptions 1-5 hold and L, µ, σ, G, q be defined therein. Choose κ = L µ , γ = max{8κ, τ} 1 and the learning rate ηt = 2 µ(γ+t). Then Fed BAT with full device participation satisfies E[F(w T )] F κ γ + T (2B µ + µ(γ + 1) 2 E w1 w 2), where B = PN k=1 p2 kσ2 + 6LΓ + 8(1 + q2)(τ 1)2G2 + 4 PN k=1 p2 kq2τ 2G2. Theorem 2. Let Assumptions 1-5 hold and L, µ, σ, G, q be defined therein. Let κ, γ, ηt be defined in Theorem 1. Assuming that K devices are randomly selected to participate in each round of training and their data is balanced in the sense that p1 = ... = p N = 1 N . Then the same bound in Theorem 1 holds if we redefine the value of B to B = σ2 N + 6LΓ + 8(1 + q2)(τ 1)2G2 + 4 q2(N 1)+N K K(N 1) τ 2G2. Remark 1. By setting K = N, Theorem 2 transforms into Theorem 1. By setting q = 0, Theorem 1 and 2 are equivalent to the analysis of Fed Avg in (Li et al., 2020). By setting K = N and τ = 1, Theorem 1 and 2 recovers the convergence rate of Stoc-Sign SGD (Safaryan & Richt arik, 2021) when used in distributed training. By setting K = N, τ = 1 and q = 0, Theorem 1 and 2 recovers the convergence rate of vanilla SGD, i.e., O( 1 T ) for strongly-convex losses. Theorem 3. Let Assumptions 1 and 3-5 hold, i.e., without the convex assumption, and L, σ, G, q be defined therein. Assume the learning rate is set to η = 1 L T and the local dataset is balanced, then the following first-order stationary condition holds for Fed BAT with partial device participation t=0 E F(wt) 2 2L(F(w0) F + Γ) (16) where P = σ2 N + 4 q2(N 1)+N K K(N 1) τ 2G2 and Q = 4(1 + q2)(τ 1)2G2. Remark 2. Under the conditions of Theorems 1-3, the convergence rate of both Fed BAT and Fed Avg (q = 0) is O( 1 T ) in the strongly convex setting, and O( 1 T ) in the non-convex setting. Remark 3. For ease of analysis, Fed BAT is discussed in the case where the step size is set as αt,s i = mt,s i without optimization. However, learning the step size shall be able to achieve smaller value of q and enhance the performance of Fed BAT. Empirically, we show in Section 5 that optimizing the step size as defined in Eq.13-14 achieves better accuracy. 5. Experiments 5.1. Experimental Setup Datasets and Models. In this section, we evaluate Fed BAT using four widely recognized datasets: FMNIST (Xiao et al., 2017), SVHN (Netzer et al., 2011), CIFAR-10 and CIFAR-100 (Krizhevsky & Hinton, 2009). To showcase the versatility of Fed BAT, we also assess its performance across different model architectures. Specifically, we employ a CNN with four convolution layers and one fully connected layer for FMNIST and SVHN, and Res Net-10 (He et al., 2016) for CIFAR-10 and CIFAR-100. Detailed model architectures are available in Appendix A.1. Fed BAT: Communication-Efficient Federated Learning via Learnable Binarization (a) CIFAR-100, IID (b) CIFAR-100, Non-IID-1 (c) CIFAR-100, Non-IID-2 Figure 2. Convergence curves of Fed BAT and baselines on CIFAR-100 with 100 clients. Data Partitioning. We consider both cases of IID and Non-IID data distribution, referring to the data partitioning benchmark of FL (Li et al., 2022). Under IID partitioning, an equal quantity of data is randomly sampled for each client. The Non-IID scenario further encompasses two distinct label distributions, termed Non-IID-1 and Non-IID-2. In Non-IID-1, the proportion of the same label among clients follows the Dirichlet distribution (Yurochkin et al., 2019), while in Non-IID-2, each client only contains data of partial labels. For CIFAR-100, we set the Dirichlet parameter to 0.1 in Non-IID-1 and assign 10 random labels to each client in Non-IID-2. For the other datasets, we set the Dirichlet parameter to 0.3 in Non-IID-1 and assign 3 random labels to each client in Non-IID-2. Baseline Methods. All experiments are conducted on Flower (Beutel et al., 2020), an open-source training platform for FL. Fed Avg (Mc Mahan et al., 2017) is adopted as the backbone training algorithm. We compare Fed BAT with the binarization methods discussed in Section 1, including Sign SGD (Bernstein et al., 2018), EF-Sign SGD (Karimireddy et al., 2019), Noisy-Sign SGD (Chen et al., 2020) and Stoc-Sign SGD (Safaryan & Richt arik, 2021). We also compare Fed BAT with Zero FL (Qiu et al., 2022), a method that exploits local sparsity to compress communications. To ensure similar traffic volume to the binarization methods, we set the sparsity ratio of Zero FL to 97%. Details about the baselines are provided in Appendix A.2. Hyperparameters. The number of clients is set to 30 and 100, respectively. 10 clients will participate in every round. The local epoch is set to 10 and the batch size is set to 64. SGD (Bottou, 2010) is used as the local optimizer. The learning rate is tuned from (1.0, 0.1, 0.01) and set to 0.1. The number of rounds are set to 100 for CNN and 200 for Res Net-10. For the baselines, each hyperparameter is carefully tuned among (1.0, 0.1, 0.01, 0.001), including the step size and the coefficient of noise. Detailed tuning process and results are provided in Appendix A.2. In Fed BAT, the coefficient ρ is set to 6 and the warm-up ratio ϕ is set to 0.5 by default. Each experiment is run five times on Nvidia 3090 GPUs with Intel Xeon E5-2673 CPUs. Average results and the standard deviation are reported. 5.2. Overall Performance In this subsection, we compare the performance of Fed BAT and the baselines by the test accuracy and the convergence speed. All numerical results are reported in Table 1. The convergence curves on CIFAR-100 with 100 clients are shown in Figure 2. The convergence curves on other datasets are provided in Appendix B.3. In addition, experimental results of more clients and more related baselines are also available in Appendix B. As shown in Table 1, compared with Sign SGD, other binarization baselines can indeed improve the test accuracy in most cases. However, in a few cases, such as CIFAR-10 under the Non-IID-1 data distribution, their accuracy can be even worse than Sign SGD, which reveals the limitations of existing binarization methods. For Zero FL, we observe a generally lower accuracy compared to the binarization methods. This discrepancy can be attributed to the higher sparsity ratio, which hinders the update of most parameters. In addition, all baselines suffer significant accuracy loss compared to Fed Avg, particularly in scenarios involving high degrees of Non-IID data distribution. On the contrary, Fed BAT can generally achieve comparable or even higher accuracy than Fed Avg, irrespective of data distribution or the number of clients. In terms of convergence speed, as shown in the Figure 2, Fed BAT can consistently outperform all binarization baselines and approach that of Fed Avg. 5.3. Ablation Studies In this subsection, we conduct ablation studies to assess the feasibility and superiority of Fed BAT s design. Specifically, we adjust the coefficient of the step size ρ and the Fed BAT: Communication-Efficient Federated Learning via Learnable Binarization Table 1. The test accuracy of all methods on four datasets. The best accuracy is bolded and the next best accuracy is underlined. N = 30 N = 100 IID Non-IID-1 Non-IID-2 IID Non-IID-1 Non-IID-2 FMNIST with CNN Fed Avg (Mc Mahan et al., 2017) 92.5 ( 0.1) 90.7 ( 0.1) 88.9 ( 0.2) 92.0 ( 0.1) 90.5 ( 0.2) 88.7 ( 0.2) Sign SGD (Bernstein et al., 2018) 91.3 ( 0.1) 86.5 ( 1.0) 78.9 ( 1.3) 90.3 ( 0.1) 88.2 ( 0.2) 80.5 ( 1.0) EF-Sign SGD (Karimireddy et al., 2019) 92.3 ( 0.1) 90.5 ( 0.1) 88.6 ( 0.2) 91.3 ( 0.1) 89.7 ( 0.1) 87.4 ( 0.1) Noisy-Sign SGD (Chen et al., 2020) 92.1 ( 0.1) 90.3 ( 0.2) 87.6 ( 0.3) 91.0 ( 0.1) 89.4 ( 0.1) 86.9 ( 0.1) Stoc-Sign SGD (Safaryan & Richt arik, 2021) 91.7 ( 0.1) 89.5 ( 0.2) 82.6 ( 0.8) 90.6 ( 0.1) 88.5 ( 0.2) 84.8 ( 0.8) Zero FL (Qiu et al., 2022) 91.0 ( 0.1) 89.3 ( 0.3) 87.4 ( 0.2) 90.2 ( 0.1) 88.8 ( 0.2) 86.6 ( 0.1) Fed BAT 92.5 ( 0.1) 90.8 ( 0.2) 89.1 ( 0.3) 91.8 ( 0.1) 90.6 ( 0.1) 89.0 ( 0.4) SVHN with CNN Fed Avg (Mc Mahan et al., 2017) 92.7 ( 0.1) 90.9 ( 0.1) 89.2 ( 0.2) 92.1 ( 0.1) 89.7 ( 0.3) 88.9 ( 0.2) Sign SGD (Bernstein et al., 2018) 92.3 ( 0.1) 80.1 ( 1.5) 66.1 ( 1.3) 90.7 ( 0.1) 80.4 ( 1.4) 64.7 ( 1.9) EF-Sign SGD (Karimireddy et al., 2019) 92.6 ( 0.2) 90.7 ( 0.1) 88.3 ( 0.1) 91.8 ( 0.1) 89.2 ( 0.2) 86.8 ( 0.4) Noisy-Sign SGD (Chen et al., 2020) 92.2 ( 0.1) 90.3 ( 0.2) 88.2 ( 0.1) 90.9 ( 0.2) 88.7 ( 0.1) 86.9 ( 0.2) Stoc-Sign SGD (Safaryan & Richt arik, 2021) 92.3 ( 0.1) 88.0 ( 0.6) 86.7 ( 0.8) 90.7 ( 0.2) 87.7 ( 0.3) 84.8 ( 0.2) Zero FL (Qiu et al., 2022) 91.7 ( 0.1) 90.2 ( 0.1) 87.6 ( 0.3) 90.3 ( 0.1) 88.4 ( 0.3) 87.0 ( 0.2) Fed BAT 92.9 ( 0.1) 91.1 ( 0.1) 89.3 ( 0.2) 92.5 ( 0.1) 90.7 ( 0.1) 89.2 ( 0.2) CIFAR-10 with Res Net-10 Fed Avg (Mc Mahan et al., 2017) 91.5 ( 0.1) 89.0 ( 0.2) 83.7 ( 0.4) 89.3 ( 0.1) 84.6 ( 0.2) 80.9 ( 0.5) Sign SGD (Bernstein et al., 2018) 88.9 ( 0.1) 87.8 ( 0.2) 76.1 ( 1.6) 87.3 ( 0.2) 82.2 ( 0.4) 76.6 ( 0.9) EF-Sign SGD (Karimireddy et al., 2019) 90.8 ( 0.1) 87.4 ( 0.2) 78.3 ( 0.9) 87.4 ( 0.2) 81.8 ( 0.5) 76.6 ( 0.9) Noisy-Sign SGD (Chen et al., 2020) 90.1 ( 0.2) 86.2 ( 0.1) 78.3 ( 0.7) 85.5 ( 0.2) 80.2 ( 0.3) 72.7 ( 0.4) Stoc-Sign SGD (Safaryan & Richt arik, 2021) 88.7 ( 0.2) 86.2 ( 0.2) 77.8 ( 0.4) 85.9 ( 0.2) 80.5 ( 0.3) 74.1 ( 1.0) Zero FL (Qiu et al., 2022) 89.0 ( 0.1) 86.3 ( 0.1) 79.0 ( 0.6) 85.2 ( 0.2) 78.4 ( 0.2) 73.8 ( 0.6) Fed BAT 91.2 ( 0.1) 88.6 ( 0.1) 82.8 ( 0.1) 89.2 ( 0.2) 84.9 ( 0.3) 81.0 ( 0.6) CIFAR-100 with Res Net-10 Fed Avg (Mc Mahan et al., 2017) 67.7 ( 0.2) 64.1 ( 0.3) 54.5 ( 0.4) 59.2 ( 0.3) 55.4 ( 0.8) 49.0 ( 0.6) Sign SGD (Bernstein et al., 2018) 58.9 ( 0.6) 53.9 ( 0.2) 34.3 ( 1.5) 54.2 ( 0.3) 39.3 ( 0.4) 32.5 ( 1.7) EF-Sign SGD (Karimireddy et al., 2019) 65.6 ( 0.2) 59.7 ( 0.4) 50.7 ( 0.4) 53.8 ( 0.5) 46.8 ( 0.5) 40.2 ( 0.6) Noisy-Sign SGD (Chen et al., 2020) 65.3 ( 0.2) 58.3 ( 0.2) 46.6 ( 0.2) 52.6 ( 0.6) 46.2 ( 0.5) 38.3 ( 0.3) Stoc-Sign SGD (Safaryan & Richt arik, 2021) 61.1 ( 0.4) 57.8 ( 0.4) 46.2 ( 0.7) 54.2 ( 0.2) 47.2 ( 0.3) 40.1 ( 0.6) Zero FL (Qiu et al., 2022) 63.7 ( 0.2) 59.9 ( 0.5) 47.7 ( 0.8) 50.5 ( 0.5) 45.6 ( 0.4) 36.1 ( 0.5) Fed BAT 66.3 ( 0.1) 63.9 ( 0.4) 53.9 ( 0.3) 58.6 ( 0.3) 54.3 ( 0.4) 49.2 ( 0.6) Table 2. The test accuracy of Fed BAT with varying ρ. ρ = 0 ρ = 2 ρ = 4 ρ = 6 ρ = 8 ρ = 10 FMNIST 87.9 88.2 88.9 89.0 88.7 88.9 SVHN 88.5 89.0 89.2 89.2 89.5 88.7 CIFAR-10 78.3 79.0 80.4 81.0 80.6 80.1 CIFAR-100 41.2 48.0 48.7 49.2 48.8 48.7 local warm-up ratio ϕ in the context of the Non-IID-2 data distribution with 100 clients. We show in the subsequent experiments that Fed BAT also outperforms the baseline methods in terms of hyperparameter tuning. We first explore the Fed BAT variant with ρ = 0, where the step size α remains fixed at its initial value α without undergoing any optimization. Comparing Table 1 and Table 2, it can be found that Fed BAT (ρ = 0) still achieves better accuracy than all binarization baselines, which proves the superiority of binarizing model updates during local training. Nevertheless, the accuracy of Fed BAT (ρ = 0) remains inferior to Fed Avg, which underscores the need for an adaptive step size. Therefore, we further tune ρ among {2, 4, 6, 8, 10} to verify the effect and robustness of learning the step size α. As illustrated in Table 2, setting the parameter ρ to 2 enhances the accuracy of Fed BAT, aligning Table 3. The test accuracy of Fed BAT with varying ϕ. ϕ = 0.1 ϕ = 0.3 ϕ = 0.5 ϕ = 0.7 ϕ = 0.9 FMNIST 88.6 88.7 89.0 89.0 88.5 SVHN 88.4 88.8 89.2 88.5 88.6 CIFAR-10 75.9 79.9 81.0 80.7 80.3 CIFAR-100 42.9 48.6 49.2 48.9 48.3 it more closely with Fed Avg. The gradual increase in the value of ρ leads to a slight continuous improvement in the accuracy of Fed BAT until ρ reaches 10. It is noteworthy that for ρ values of 4, 6, and 8, the accuracy difference in Fed BAT is comparatively minimal. This indicates a broad optimal range for the hyperparameter ρ, highlighting the robustness of Fed BAT to variations in ρ. Another hyperparameter of Fed BAT is the local warm-up ratio ϕ, which balances the trade-off between the binarization of model updates and their initialization. Here, we tune ϕ among {0.1, 0.3, 0.5, 0.7, 0.9}. As shown in the Table 3, the optimal accuracy for Fed BAT is always achieved when ϕ is set to 0.5, indicating an equilibrium between the significance of initializing and binarizing model updates. For ϕ values of 0.3 or 0.7, there is a marginal reduction in accuracy, although the variance in accuracy is minimal. Notably, Fed BAT: Communication-Efficient Federated Learning via Learnable Binarization further deviation by increasing ϕ to 0.9 or decreasing it to 0.1 leads to more decline in the accuracy. Following the above ablation studies, we summarize the essential findings regarding hyperparameter tuning about Fed BAT. It is advisable to set the local warm-up ratio ϕ to 0.5 as a default configuration without necessitating hyperparameter tuning. The default setting for ρ in Fed BAT is recommended to be 6, with an option to tune it within the interval of [4,8] for marginal accuracy enhancements. 6. Related Work 6.1. Communication-Efficient Federated Learning Existing methods reduce communication costs in FL from two aspects: model compression and gradient compression. The proposed Fed BAT belongs to the latter paradigm. Frequently transferring large models between the server and clients is a significant burden, especially for clients with limited communication bandwidth. Therefore, researchers have turned to model compression techniques. Yang et al. (2021) train and communicate a binary neural network in FL. Different from their motivation, our focus lies in learning binary model updates rather than binary weights. (Caldas et al., 2018; Bouacida et al., 2021) enable clients to train randomly selected sub-models of a larger server model. Hyeon-Woo et al. (2022) use matrix factorization to reduce the size of the model that needs to be transferred. (Li et al., 2021; Isik et al., 2023) transfer a pruned model for efficient communication. Specifically, Fed PM (Isik et al., 2023) trains and communicates only a binary mask for each model parameter, while keeping the model parameters at their randomly initialized values. It is worth noting that Fed PM has certain similarities with Fed BAT. They both keep the base model parameters fixed during local training. Fed PM trains binary masks to prune the base model, while Fed BAT trains binary model updates with respect to the base model. The model trained by Fed PM enjoys the advantage of sparsity, however, the randomly initialized model parameters do not undergo any updates. Due to the lack of effective training on randomly initialized parameters, Fed PM may not be able to achieve a satisfactory accuracy as Fed Avg (Vallapuram et al., 2022). Apart from the model compression, another way to reduce communication costs is gradient compression. It is generally achieved by pruning (Qiu et al., 2022) or quantization (including binarization) (Reisizadeh et al., 2020; Bernstein et al., 2018) on the model updates. To enhance efficiency, recent quantization methods (Jhunjhunwala et al., 2021; Qu et al., 2022; H onig et al., 2022) try to adjust the quantization bitwidth used in different rounds adaptively. However, they suffer from the same post-training manner as existing binarization methods. We posit that the concept of Fed BAT can be seamlessly applied to federated quantization. 6.2. Binarization Binary Connect (Courbariaux et al., 2015) is a pioneering work on learning binary weights. It binarizes weights into { 1, +1} to calculate the output loss. During backpropagation, the full precision weights are optimized by STE (Hinton, 2012). However, Binary Connect does not involve learning the step size. Further, BWN and XNOR-Net are proposed to calculate a step size by minimizing the binarized errors (Rastegari et al., 2016). To be specific, the step size is set to the average of absolute weight values. Later, Hou et al. (2017) propose a proximal Newton algorithm with diagonal Hessian approximation that directly minimizes the loss with respect to the binarized weights and the step size. However, it requires the second-order gradient information, usually not available in SGD. In addition, considering that binarization is essentially a form of 1-bit quantization, many quantization-aware training methods are also suitable for binarization. For example, PACT (Choi et al., 2018) and LSQ (Esser et al., 2020) achieve end-to-end optimization by designing reasonable gradient for the step size. In this paper, we discovered the post-training manner of model updates compression in FL. We seek to solve this problem by directly learning binary model updates during local training. However, the above methods are designed to learn binarized or quantized weights by a long period of centralized training. It is difficult for them to learn accurate binary model updates during a short local training period. Therefore, after designing the gradient of the step size, we also introduced a temperature ρ to regulate the pace of its optimization, along with the implementation of warm-up training for improved initialization. 7. Conclusion We analyzed the challenges faced by existing binarization methods when applied in the context of FL. The analysis encourages us to leverage the local training process to learn binary model updates, instead of binarizing them after training. Therefore, we propose Fed BAT, a federated training framework designed with a focus on binarization awareness. Fed BAT has undergone comprehensive theoretical examination and experimental validation. It is able to exceed the binarization baselines in terms of the test accuracy and convergence speed, and is comparable to Fed Avg. An inherent limitation of Fed BAT resides in the requirement for the client to retain two distinct copies of model parameters: one for the initialization model and another for model updates. Although only one copy of parameters shall be trained, this slightly increases the memory overhead of the local training process in Fed BAT. However, by compressing the initialization model transmitted from the server before communication, the memory overhead in Fed BAT can be alleviated, as well as the downlink communication can be further compressed. We leave this as future work. Fed BAT: Communication-Efficient Federated Learning via Learnable Binarization Acknowledgements This work is supported in part by National Natural Science Foundation of China under grants 62376103, 62302184, 62206102, Science and Technology Support Program of Hubei Province under grant 2022BAA046, and Ant Group through CCF-Ant Research Fund. Impact Statement In this paper, we propose Fed BAT to improve the communication efficiency of federated learning. It is our contention that Fed BAT does not result in any adverse social impact. Instead, Fed BAT enhances user privacy protection to a certain extent as the model updates uploaded by the clients are binarized and noise is introduced in the process. Bernstein, J., Wang, Y., Azizzadenesheli, K., and Anandkumar, A. SIGNSGD: compressed optimisation for nonconvex problems. In Proceedings of the 35th International Conference on Machine Learning, ICML, pp. 559 568. PMLR, 2018. Beutel, D. J., Topal, T., Mathur, A., Qiu, X., Parcollet, T., and Lane, N. D. Flower: A friendly federated learning research framework. Co RR, abs/2007.14390, 2020. Bottou, L. Large-scale machine learning with stochastic gradient descent. In 19th International Conference on Computational Statistics, COMPSTAT, pp. 177 186. Physica Verlag, 2010. Bouacida, N., Hou, J., Zang, H., and Liu, X. Adaptive federated dropout: Improving communication efficiency and generalization for federated learning. In 2021 IEEE Conference on Computer Communications Workshops, INFOCOM Workshops, pp. 1 6. IEEE, 2021. Caldas, S., Koneˇcn y, J., Mc Mahan, H. B., and Talwalkar, A. Expanding the reach of federated learning by reducing client resource requirements. Co RR, abs/1812.07210, 2018. Chen, X., Chen, T., Sun, H., Wu, Z. S., and Hong, M. Distributed training with heterogeneous data: Bridging medianand mean-based algorithms. In Advances in Neural Information Processing Systems, Neur IPS, volume 33, pp. 21616 21626, 2020. Choi, J., Wang, Z., Venkataramani, S., Chuang, P. I., Srinivasan, V., and Gopalakrishnan, K. PACT: parameterized clipping activation for quantized neural networks. Co RR, abs/1805.06085, 2018. Courbariaux, M., Bengio, Y., and David, J.-P. Binaryconnect: Training deep neural networks with binary weights during propagations. In Advances in Neural Information Processing Systems, volume 28, 2015. Esser, S. K., Mc Kinstry, J. L., Bablani, D., Appuswamy, R., and Modha, D. S. Learned step size quantization. In 8th International Conference on Learning Representations, ICLR. Open Review.net, 2020. Ferreira, P. A., da Silva, P. N., Gottin, V., Stelling, R., and Calmon, T. Bayesian signsgd optimizer for federated learning. In Advances in Neural Information Processing Systems, Neur IPS, 2021. Glorot, X., Bordes, A., and Bengio, Y. Deep sparse rectifier neural networks. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, AISTATS, pp. 315 323, 2011. He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR, pp. 770 778. IEEE Computer Society, 2016. Hinton, G. Neural networks for machine learning. Coursera video lectures, 2012. H onig, R., Zhao, Y., and Mullins, R. D. Dadaquant: Doublyadaptive quantization for communication-efficient federated learning. In International Conference on Machine Learning, ICML, volume 162, pp. 8852 8866. PMLR, 2022. Hou, L., Yao, Q., and Kwok, J. T. Loss-aware binarization of deep networks. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings, 2017. Hyeon-Woo, N., Ye-Bin, M., and Oh, T. Fedpara: Low-rank hadamard product for communication-efficient federated learning. In The Tenth International Conference on Learning Representations, ICLR. Open Review.net, 2022. Ioffe, S. and Szegedy, C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of the 32nd International Conference on Machine Learning, ICML, pp. 448 456, 2015. Isik, B., Pase, F., G und uz, D., Weissman, T., and Zorzi, M. Sparse random networks for communication-efficient federated learning. In The Eleventh International Conference on Learning Representations, ICLR. Open Review.net, 2023. Jhunjhunwala, D., Gadhikar, A., Joshi, G., and Eldar, Y. C. Adaptive quantization of model updates for Fed BAT: Communication-Efficient Federated Learning via Learnable Binarization communication-efficient federated learning. In IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP, pp. 3110 3114. IEEE, 2021. Karimireddy, S. P., Rebjock, Q., Stich, S. U., and Jaggi, M. Error feedback fixes signsgd and other gradient compression schemes. In Proceedings of the 36th International Conference on Machine Learning,ICML, volume 97, pp. 3252 3261. PMLR, 2019. Krizhevsky, A. and Hinton, G. Learning multiple layers of features from tiny images. Technical report, University of Toronto, 2009. Li, A., Sun, J., Zeng, X., Zhang, M., Li, H., and Chen, Y. Fedmask: Joint computation and communicationefficient personalized federated learning via heterogeneous masking. In Sen Sys 21: The 19th ACM Conference on Embedded Networked Sensor Systems, pp. 42 55. ACM, 2021. Li, Q., Diao, Y., Chen, Q., and He, B. Federated learning on non-iid data silos: An experimental study. In 38th IEEE International Conference on Data Engineering, ICDE, pp. 965 978. IEEE, 2022. Li, X., Huang, K., Yang, W., Wang, S., and Zhang, Z. On the convergence of fedavg on non-iid data. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. Open Review.net, 2020. Liu, L., Dong, C., Liu, X., Yu, B., and Gao, J. Bridging discrete and backpropagation: Straight-through and beyond. Co RR, abs/2304.08612, 2023. Mc Mahan, B., Moore, E., Ramage, D., Hampson, S., and y Arcas, B. A. Communication-efficient learning of deep networks from decentralized data. In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, AISTATS, volume 54, pp. 1273 1282. PMLR, 2017. Netzer, Y., Wang, T., Coates, A., Bissacco, A., Wu, B., and Ng, A. Y. Reading digits in natural images with unsupervised feature learning. In NIPS Workshop on Deep Learning and Unsupervised Feature Learning, 2011. Qiu, X., Fern andez-Marqu es, J., de Gusmao, P. P. B., Gao, Y., Parcollet, T., and Lane, N. D. Zerofl: Efficient ondevice training for federated learning with local sparsity. In The Tenth International Conference on Learning Representations, ICLR. Open Review.net, 2022. Qu, L., Song, S., and Tsui, C. Feddq: Communicationefficient federated learning with descending quantization. In IEEE Global Communications Conference, GLOBECOM, pp. 281 286. IEEE, 2022. Raihan, M. A. and Aamodt, T. M. Sparse weight activation training. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems, Neur IPS, 2020. Rastegari, M., Ordonez, V., Redmon, J., and Farhadi, A. Xnor-net: Imagenet classification using binary convolutional neural networks. In Computer Vision - ECCV 2016 - 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part IV, pp. 525 542, 2016. Reisizadeh, A., Mokhtari, A., Hassani, H., Jadbabaie, A., and Pedarsani, R. Fedpaq: A communication-efficient federated learning method with periodic averaging and quantization. In The 23rd International Conference on Artificial Intelligence and Statistics, AISTATS, volume 108, pp. 2021 2031. PMLR, 2020. Safaryan, M. and Richt arik, P. Stochastic sign descent methods: New algorithms and better theory. In Proceedings of the 38th International Conference on Machine Learning, ICML, volume 139, pp. 9224 9234. PMLR, 2021. Stich, S. U., Cordonnier, J., and Jaggi, M. Sparsified SGD with memory. In Advances in Neural Information Processing Systems 31: Annual Conferenceon Neural Information Processing Systems, Neur IPS, pp. 4452 4463, 2018. Tang, Z., Wang, Y., and Chang, T. z-signfedavg: A unified stochastic sign-based compression for federated learning. Co RR, abs/2302.02589, 2023. Vallapuram, A. K., Zhou, P., Kwon, Y. D., Lee, L. H., Xu, H., and Hui, P. Hidenseek: Federated lottery ticket via serverside pruning and sign supermask. Co RR, abs/2206.04385, 2022. Xiao, H., Rasul, K., and Vollgraf, R. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. Co RR, abs/1708.07747, 2017. Yang, Y., Zhang, Z., and Yang, Q. Communication-efficient federated learning with binary neural networks. IEEE J. Sel. Areas Commun., 39(12):3836 3850, 2021. Yu, H., Yang, S., and Zhu, S. Parallel restarted SGD with faster convergence and less communication: Demystifying why model averaging works for deep learning. In The Thirty-Third AAAI Conference on Artificial Intelligence, AAAI, pp. 5693 5700. AAAI Press, 2019. Yurochkin, M., Agarwal, M., Ghosh, S., Greenewald, K. H., Hoang, T. N., and Khazaeni, Y. Bayesian nonparametric federated learning of neural networks. In Proceedings of the 36th International Conference on Machine Learning, ICML, volume 97, pp. 7252 7261. PMLR, 2019. Fed BAT: Communication-Efficient Federated Learning via Learnable Binarization A. Detailed Experimental Settings A.1. Model Architectures In this paper, we employ a CNN for FMNIST and SVHN, Res Net-10 for CIFAR-10 and CIFAR-100. The detailed model architectures are shown in Table 4. The Basic Block used by Res Net is the same as defined in (He et al., 2016). Batch normalization (BN) (Ioffe & Szegedy, 2015) is used to ensure stable training and Re LU (Glorot et al., 2011) is employed as the activation function. Table 4. Model architectures of the CNN, Res Net-10. CNN (FMNIST) CNN (SVHN) Res Net-10 (CIFAR-10) Res Net-10 (CIFAR-100) Convd2d(1,32,3) Convd2d(3,32,3) Convd2d(3,32,3) Convd2d(3,32,3) Convd2d(32,64,3) Convd2d(32,64,3) Basic Block(32) Basic Block(32) Convd2d(64,128,3) Convd2d(64,128,3) Basic Block(64) Basic Block(64) Convd2d(128,256,3) Convd2d(128,256,3) Basic Block(128) Basic Block(128) Linear(256,10) Linear(1024,10) Basic Block(256) Basic Block(256) Linear(256,10) Linear(256,100) A.2. Baseline Methods In our experiments, we compare Fed BAT with five baselines, including Sign SGD (Bernstein et al., 2018), EFSign SGD (Karimireddy et al., 2019), Noisy-Sign SGD (Chen et al., 2020), Stoc-Sign SGD (Safaryan & Richt arik, 2021) and Zero FL (Qiu et al., 2022). Here, we introduce each baseline and the hyperparameters involved in detail. We declare that all hyperparameters are tuned in {1.0, 0.1, 0.01, 0.001} for each dataset. The hyperparameter of Sign SGD is the step size α, which is tuned and set to 0.001 for all datasets. In EF-Sign SGD, there is no hyperparameter to tune. Notably, EF-Sign SGD adds the binarization errors from the previous round onto the current model updates before conducting another binarization. Furthermore, it sets the step size α to m 1/d for a more precise binarization of m Rd. Noisy-Sign SGD adds Gaussian noise ξ N(0, σ2) to the model update and then performs binarization, where σ is an adjustable standard deviation. We set the step size α and the standard deviation σ to 0.01 and 0.01 for Noisy-Sign SGD. Stoc-Sign SGD adds uniform noise ξ U( m , m ) to a model update m before binarization. In this way, each element mi will be binarized into +1 with probability (1/2 + mi/2 m ) and into -1 with probability (1/2 mi/2 m ). However, we observe that the value of m is so large that the above two probabilities are both close to 0.5. Therefore, in our experiments, we replaced m with m for a better binarization. Besides, the step size of Stoc-Sign SGD is tuned and set to 0.01. Zero FL performs SWAT (Raihan & Aamodt, 2020) in local training and prune the model updates after training. The sparsity ratio is set to 97% to ensure a communication compression ratio similar to the binarization methods. B. Additional Experiment Results B.1. Additional Baselines In this subsection, we test additional baselines on CIFAR-10, including Fed PAQ (Reisizadeh et al., 2020) and Bi FL (Yang et al., 2021). Fed PAQ quantizes model updates after local training. Table 5 shows that Fed PAQ achieve comparable accuracy to Fed BAT with a communication costs ranging from 3 to 4 bits per parameter (bpp). Bi FL trains a binary neural network (BNN) during local training. There are two ways to upload local weights in Bi FL, one is to communicate full-precision weights (Bi FL-Full), and the other is to communicate binarized weights (Bi FL). Bi FL-Full achieves higher accuracy as communication is uncompressed. However, Bi FL directly binarizes model weights rather than binarizing model updates, resulting in much lower accuracy compared to Sign SGD. Table 5. The test accuracy of additional baselines (including Fed PAQ and Bi FL) on CIFAR-10 with 100 clients. Fed Avg Sign SGD Fed PAQ [3-bit] Fed PAQ [4-bit] Bi FL Bi FL-Full Fed BAT IID 89.3 87.3 88.8 89.1 61.8 86.8 89.2 Non-IID-1 84.6 82.2 83.9 84.5 37.4 84.3 84.9 Non-IID-2 80.9 76.6 80.5 80.8 25.1 79.6 81.0 Fed BAT: Communication-Efficient Federated Learning via Learnable Binarization B.2. More Clients In Section 5, the number of clients is set to 30 and 100. Now, we expand the number of clients to 200. As shown in Table 6, despite this increase, Fed BAT consistently achieves accuracy comparable to Fed Avg and notably surpasses Sign SGD. Table 6. The test accuracy of Fed Avg, Sign SGD and Fed BAT with 200 clients. FMNIST with CNN SVHN with CNN CIFAR-10 with Res Net-10 CIFAR-100 with Res Net-10 IID Non-IID-1 Non-IID-2 IID Non-IID-1 Non-IID-2 IID Non-IID-1 Non-IID-2 IID Non-IID-1 Non-IID-2 Fed Avg 91.6 90.2 88.4 91.4 88.6 87.8 84.9 81.0 78.2 49.7 48.8 42.9 Sign SGD 89.5 87.5 80.2 89.6 79.6 67.6 84.2 77.8 71.0 46.1 34.2 30.8 Fed BAT 91.4 89.9 88.1 91.4 89.2 87.8 85.0 80.8 77.8 49.1 48.4 43.2 B.3. Additional Convergence Curves Due to limited space, we illustrate the convergence curves of various methods for the FMNIST, SVHN and CIFAR-10 datasets in Figure 3. The convergence patterns of the methods closely resemble those observed in Figure 2. (a) FMNIST, IID (b) FMNIST, Non-IID-1 (c) FMNIST, Non-IID-2 (d) SVHN, IID (e) SVHN, Non-IID-1 (f) SVHN, Non-IID-2 (g) CIFAR-10, IID (h) CIFAR-10, Non-IID-1 (i) CIFAR-10, Non-IID-2 Figure 3. Convergence curves of Fed BAT and baselines on FMNIST, SVHN and CIFAR-10 with 100 clients. Fed BAT: Communication-Efficient Federated Learning via Learnable Binarization C. Proof of Theorem 1 In this section, we analyze Fed BAT in the setting of full device participation. The theoretical analysis in this paper is rooted in the findings about Fed Avg presented in (Li et al., 2020). C.1. Additional Notation Let wk t be the model parameters maintained in the k-th device at the t-th step. Let Iτ be the set of global synchronization steps, i.e., Iτ = {nτ|n = 1, 2, ...}. If t + 1 Iτ, i.e., the time step to communication, Fed BAT activates all devices. Then the optimization of Fed BAT can be described as vk t+1 = wk t ηt Fk(xk t , ξk t ) (17) xk t = Sm(wk t ) (18) wk t+1 = vk t+1 if t + 1 / Iτ, PN k=1 pk Sm(vk t+1) if t + 1 Iτ. (19) Here, the variable vk t+1 is introduced to represent the immediate result of one step SGD update from wk t . We interpret wk t+1 as the parameters obtained after communication steps (if possible). Also, an additional variable xk t is introduced to represent the result of performing binarization on model updates. In our analysis, we define two virtual sequences vt = PN k=1 pkvk t and wt = PN k=1 pkwk t . vt+1 results from an single step of SGD from wt. When t + 1 / Iτ, both are inaccessible. When t + 1 Iτ, we can only fetch wt+1. For convenience, we define gt = PN k=1 pk Fk(xk t ) and gt = PN k=1 pk Fk(xk t , ξk t ). Therefore, vt+1 = wt ηtgt and Egt = gt. Notably, for any t 0, there exists a t0 t, such that t t0 τ 1 and wk t0 = wt0 for all k = 1, 2, ..., N. In this case, xk t = Sm(wk t ) = S(wk t wt0) + wt0. Therefore, we have Exk t = wk t and E xk t wk t 2 q2 wk t wt0 2. C.2. Key Lemmas To convey our proof clearly, it would be necessary to prove certain useful lemmas. We defer the proof of these lemmas to latter section and focus on proving the main theorem. Lemma 1. (Results of one step SGD). Assume Assumption 1 and 2. If ηt 1 4L, we have E vt+1 w 2 (1 ηtµ)E wt w 2 + η2 t E gt gt 2 + 6Lη2 t Γ + 2 k=1 pk E wt xk t 2. (20) Lemma 2. (Bounding the variance). Assume Assumption 3 holds. It follows that k=1 p2 kσ2. (21) Lemma 3. (Bounding the divergence of xk t ). Assume Assumption 4, that ηt is non-increasing and ηt 2ηt+τ for all t 0. It follows that k=1 pk E wt xk t 2 4(1 + q2)η2 t (τ 1)2G2. (22) Lemma 4. (Bounding the divergence of wt). Assume Assumption 4 and 5, that ηt is non-increasing and ηt 2ηt+τ for all t 0. It follows that E wt+1 vt+1 2 4 k=1 p2 kq2η2 t τ 2G2. (23) Fed BAT: Communication-Efficient Federated Learning via Learnable Binarization C.3. Completing the Proof of Theorem 1 Note that E wt = E vt when we take expectation to erase the randomness of stochastic binarization, therefore E wt+1 w 2 = E wt+1 vt+1 + vt+1 w 2 = E wt+1 vt+1 2 + E vt+1 w 2. (24) Let t = E wt w 2. From Lemma 1-4, it follows that t+1 (1 ηtµ) t + η2 t B, (25) k=1 p2 kσ2 + 6LΓ + 8(1 + q2)(τ 1)2G2 + 4 k=1 p2 kq2τ 2G2. (26) For a diminishing stepsize, ηt = β t+γ for some β > 1 µ and γ > 0 such that η1 min{ 1 µ, 1 4L} = 1 4L and ηt 2ηt+τ. We will prove t v t+γ where v = max{ β2B βµ 1, (γ + 1) 1}. We prove it by induction. Firstly, the definition of v ensures that it holds for t = 1. Assume the conclusion holds for some t, it follows that t+1 (1 ηtµ) t + η2 t B (1 βµ t + γ ) v t + γ + β2B (t + γ)2 (t + γ)2 v + β2B (t + γ)2 βµ 1 v t + γ + 1. Then by the L-smoothness of F, E[F( wt)] F L 2 v γ + t. (28) Specifically, if we choose β = 2 µ, γ = max{8 L µ , τ} 1 and denote κ = L µ , then ηt = 2 µ 1 γ+t. One can verify that the choice of ηt satisfies ηt 2ηt+τ for t 1. Then, we have v = max{ β2B βµ 1, (γ + 1) 1} β2B βµ 1 + (γ + 1) 1 4B µ2 + (γ + 1) 1, (29) E[F( wt)] F L 2 v γ + t κ γ + t(2B µ + µ(γ + 1) C.4. Deferred Proofs of Key Lemmas Proof of Lemma 1. Notice that vt+1 = wt ηtgt and Egt = gt, then E vt+1 w 2 = E wt ηtgt w ηt gt + ηt gt 2 = E wt w ηt gt 2 | {z } A1 +η2 t E gt gt 2. (31) We next focus on bounding A1. Again we split A1 into three terms: wt w ηt gt 2 = wt w 2 2ηt wt w , gt | {z } B1 + η2 t gt 2 | {z } B2 From the the L-smoothness of Fk, it follows that Fk(xk t ) 2 2L(Fk(xk t ) F k ). (33) Fed BAT: Communication-Efficient Federated Learning via Learnable Binarization By the convexity of 2 and Eq. 33, we have B2 = η2 t gt 2 η2 t k=1 pk Fk(xk t ) 2 2Lη2 t k=1 pk(Fk(xk t ) F k ). (34) B1 = 2ηt wt w , gt = 2ηt k=1 pk wt w , Fk(xk t ) k=1 pk wt xk t , Fk(xk t ) 2ηt k=1 pk xk t w , Fk(xk t ) . By Cauchy-Schwarz inequality and AM-GM inequality, we have 2 wt xk t , Fk(xk t ) 1 ηt wt xk t 2 + ηt Fk(xk t ) 2. (36) By the µ-strong convexity of Fk, we have xk t w , Fk(xk t ) (Fk(xk t ) Fk(w )) µ 2 xk t w 2. (37) Therefore, we have A1 = E wt w ηt gt 2 E wt w 2 + 2Lη2 t E k=1 pk(Fk(xk t ) F k ) ηt wt xk t 2 + ηt Fk(xk t ) 2) k=1 pk(Fk(xk t ) Fk(w ) + µ 2 xk t w 2) (1 µηt)E wt w 2 + k=1 pk E wt xk t 2 k=1 pk(Fk(xk t ) F k ) 2ηt k=1 pk(Fk(xk t ) Fk(w )) where we use Eq.33 again and the inequality E xk t w 2 = E wt xk t 2 E wt w 2 E wt w 2. We next aim to bound C. We define γt = 2ηt(1 2Lηt). Since ηt 1 4L, ηt γt 2ηt. Then we split C into two terms: C = 2ηt(1 2Lηt) k=1 pk(Fk(xk t ) F k ) + 2ηt k=1 pk(Fk(w ) F k ) k=1 pk(Fk(xk t ) F ) + (2ηt γt) k=1 pk(F F k ) k=1 pk(Fk(xk t ) F ) where in the last equation, we use the notation Γ = PN k=1 pk(F F k ) = F PN k=1 pk F k . Fed BAT: Communication-Efficient Federated Learning via Learnable Binarization To bound D, we have k=1 pk(Fk(xk t ) F ) = k=1 pk(Fk(xk t ) Fk( wt)) + k=1 pk(Fk( wt) F ) k=1 pk Fk( wt), xk t wt + F( wt) F k=1 pk[ηt Fk( wt) 2 + 1 ηt xk t wt 2] + F( wt) F k=1 pk[ηt L(Fk( wt) F k ) + 1 2ηt xk t wt 2] + F( wt) F where the first inequality results from the convexity of Fk, the second inequality from AM-GM inequality and the third inequality from Eq. 33. Therefore k=1 pk[ηt L(Fk( wt) F k ) + 1 2ηt xk t wt 2] γt(F( wt) F ) + 4Lη2 t Γ = γt(ηt L 1) k=1 pk(Fk( wt) F k ) + (4Lη2 t + γtηt L)Γ + γt k=1 pk xk t wt 2 k=1 pk xk t wt 2 where in the last inequality, we use the following facts: (1)ηt L 1 3 4 0 and PN k=1 pk(Fk( wt) F ) = F( wt) F 0 (2) Γ 0 and 4Lη2 t + γtηt L 6η2 t L and (3) γt 2ηt 1. Recalling the expression of A1 and plugging C into it, we have A1 = E wt w ηt gt 2 (1 µηt)E wt w 2 + 2 k=1 pk E wt xk t 2 + 6Lη2 t Γ. (42) Plugging A1 into Eq. 31, we have the result in Lemma 1 E vt+1 w 2 (1 ηtµ)E wt w 2 + η2 t E gt gt 2 + 6Lη2 t Γ + 2 k=1 pk E wt xk t 2. (43) Proof of Lemma 2. From Assumption 3, the variance of the stochastic gradients in device k is bounded by σ2, then E gt gt 2 = E k=1 pk( Fk(wk t , ξk t ) Fk(wk t )) 2 k=1 p2 k E Fk(wk t , ξk t ) Fk(wk t ) 2 Proof of Lemma 3. Considering that Exk t = wk t , we have k=1 pk E wt xk t 2 = k=1 pk E ( wt wk t ) (wk t xk t ) 2 k=1 pk E wt wk t 2 + k=1 pk E wk t xk t 2 (45) Fed BAT: Communication-Efficient Federated Learning via Learnable Binarization Since Fed BAT requires a communication each τ steps. Therefore, for any t 0, there exists a t0 t, such that t t0 τ 1 and wk t0 = wt0 for all k = 1, 2, ..., N. Also, we use the fact that ηt is non-increasing and ηt0 2ηt for all t t0 τ 1, then N X k=1 pk E wt wk t 2 = k=1 pk E (wk t wt0) ( wt wt0) 2 k=1 pk E wk t wt0 2 t=t0 (τ 1)η2 t Fk(xk t , ξk t ) 2 t=t0 (τ 1)η2 t G2 k=1 pkη2 t (τ 1)2G2 4η2 t (τ 1)2G2 Here in the first inequality, we use E X EX 2 E X 2 where X = wk t wt0 with probability pk. In the second inequality, we use Jensen inequality: wk t wt0 2 = t=t0 ηt Fk(wk t , ξk t ) 2 (t t0) t=t0 η2 t Fk(wk t , ξk t ) 2. (47) In the third inequality, we use ηt ηt0 for t t0 and E Fk(wk t , ξk t ) 2 G2 for k = 1, 2, ..., N and t 1. In the last inequality, we use ηt0 2ηt0+τ 2ηt for t0 t t0 + τ. According to Assumption 5, we have E wk t xk t 2 q2E wk t wt0 2 as discussed in Section C.1. Then the second term in Eq. 45 can be bounded by reusing the result in Eq. 46 as k=1 pk E wk t xk t 2 q2 N X k=1 pk E wk t wt0 2 4q2η2 t (τ 1)2G2. (48) Plugging Eq. 46 and Eq. 48 into Eq. 45, we have the result in Lemma 3 k=1 pk E wt xk t 2 4(1 + q2)η2 t (τ 1)2G2. (49) Proof of Lemma 4. Notice that wt+1 = vt+1 when t + 1 / Iτ and wt+1 = PN k=1 pk Sm(vt+1), vt+1 = PN k=1 pkvt+1 when t + 1 Iτ. Hence, if t + 1 Iτ, we have E wt+1 vt+1 2 = E k=1 pk(Sm(vk t+1) vk t+1) 2 k=1 p2 k E (Sm(vk t+1) vk t+1) 2 k=1 p2 kq2E vk t+1 wt0 2 k=1 p2 kq2η2 t τ 2G2, where the last inequality follows the result in Eq.47. Fed BAT: Communication-Efficient Federated Learning via Learnable Binarization D. Proof of Theorem 2 In this section, we analyze Fed BAT in the setting of partial device participation. D.1. Additional Notation Recall that wk t is the model parameter maintained in the k-th device at the t-th step. Iτ = {nτ|n = 1, 2, ..., N} is the set of global synchronization steps. Again, gt = PN k=1 pk Fk(xk t ) and gt = PN k=1 pk Fk(xk t , ξk t ). Therefore, vt+1 = wt ηtgt and Egt = gt. Now we consider the case where Fed BAT samples a random set St of devices to participate in each round of training. This make the analysis a little bit intricate, since St varies each τ steps. Following (Li et al., 2020), we assume that Fed BAT always activates all devices at the beginning of each round and then uses the parameters maintained in only a few sampled devices to produce the next-round parameter. It is clear that this updating scheme is equivalent to the original. As assumed in Theorem 2 that p1 = ... = p N = 1 N , the update of Fed BAT with partial devices active can be described as: for all k [N], vk t+1 = wk t ηt Fk(xk t , ξk t ) (51) xk t = Sm(wk t ) (52) vk t+1 if t + 1 / Iτ, P k St+1 pk Sm(vk t ) P k St+1 pk = 1 k St+1 Sm(vk t+1) if t + 1 Iτ. (53) D.2. Key Lemmas Lemma 5. (Unbiased sampling scheme). In the case of partial device participation in Theorem 2, we have E wt+1 = E vt+1. (54) Lemma 6. (Bounding the variance of wt). In the case of partial device participation in Theorem 2, with Assumption 4 and 5, assume that ηt is non-increasing and ηt ηt+τ for all t 0. It follows that E wt+1 vt+1 2 4q2(N 1) + N K K(N 1) η2 t τ 2G2. (55) D.3. Completing the Proof of Theorem 2 Using Lemma 5, Eq.24 still holds, that is E wt+1 w 2 = E wt+1 vt+1 + vt+1 w 2 = E wt+1 vt+1 2 + E vt+1 w 2 (56) Let t = E wt w 2. From Lemma 1, 2, 3 and 6, it follows that t+1 (1 ηtµ) t + η2 t B (57) N + 6LΓ + 8(1 + q2)(τ 1)2G2 + 4q2(N 1) + N K K(N 1) τ 2G2 (58) The only difference between Eq.57 and Eq.25 is the value of constant B. Following the same process, we can get the result of Theorem 2 E[F( wt)] F L 2 v γ + t κ γ + t(2B µ + µ(γ + 1) 2 w1 w 2), (59) where γ = max{8 L µ , τ} 1 and κ = L Fed BAT: Communication-Efficient Federated Learning via Learnable Binarization D.4. Deferred Proofs of Key Lemmas Proof of Lemma 5. Considering the partial device participation in Theorem 2, wt+1 = vt+1 when t + 1 / Iτ and wt+1 = 1 K P k St+1 Sm(vk t+1), vt+1 = 1 N PN k=1 vk t+1 when t + 1 Iτ. In the latter case, there are two kinds of randomness between wt+1 and vt+1, respectively from the client s random selection and stochastic binarization. To distinguish them, we use the notation ESt when we take expectation to erase the randomness of device selection, and use the notation ES when we take expectation to erase the randomness of binarization. Therefore, when t + 1 Iτ, we have E wt+1 = ESt[ES wt+1] = ESt[ES 1 K k St+1 Sm(vk t+1)] = ESt[ 1 k St+1 vk t+1] = vt+1 (60) Proof of Lemma 6. Notice that wt+1 = vt+1 when t + 1 / Iτ and wt+1 = 1 K P k St+1 Sm(vk t+1), vt+1 = PN k=1 pkvt+1 when t + 1 Iτ. Hence, we have E wt+1 vt+1 2 = E 1 k St+1 Sm(vk t+1) vt+1 2 k St+1 (Sm(vk t+1) vk t+1) + 1 k St+1 vk t+1 vt+1 2 k St+1 (Sm(vk t+1) vk t+1) 2 k St+1 vk t+1 vt+1 2 To bound A1, we have k St+1 (Sm(vk t+1) vk t+1) 2 k St+1 E (Sm(vk t+1) vk t+1) 2 k St+1 4q2η2 t τ 2G2 K q2η2 t τ 2G2, where in the inequality we use the result of Eq.50. Then, to bound A2, we have k St+1 vk t+1 vt+1 2 i=1 I{i St+1}(vi t+1 vt+1) 2 i=1 P(i St+1) vi t+1 vt+1 2 + X i =j P(i, j St+1) D vi t+1 vt+1, vj t+1 vt+1 E ] i=1 vi t+1 vt+1 2 + K 1 KN(N 1)E X D vi t+1 vt+1, vj t+1 vt+1 E = N K KN(N 1) i=1 E vi t+1 vt+1 2 where we use the following equalities: (1) P(i St+1) = K N and P(i, j St+1) = K(K 1) N(N 1) for all i = j and (2) Fed BAT: Communication-Efficient Federated Learning via Learnable Binarization PN i=1 vi t+1 vt+1 2 + P i =j D vi t+1 vt+1, vj t+1 vt+1 E = 0. Also using the result of Eq.50, we have A2 = N K KN(N 1) i=1 E vi t+1 vt+1 2 N K K(N 1)4η2 t τ 2G2 (64) Plugging A1 and A2, we have the result in Lemma 6 E wt+1 vt+1 2 4q2(N 1) + N K K(N 1) η2 t τ 2G2. (65) E. Proof of Theorem 3 In this section, we proof the convergence of Fed BAT under non-convex settings. We only use Assumptions 1, 3, 4, 5. Note that Assumptions 2 is about the strongly convex, which we do not use in this case. E.1. Additional Notation Let us review the notations in Section D.1. wk t is the model parameters maintained in the k-th device at the t-th step. Let Iτ be the set of global synchronization steps, i.e., Iτ = {nτ|n = 1, 2, ...}. If t + 1 Iτ, i.e., the time step to communication, Fed BAT activates all devices. Then the update of Fed BAT can be described as vk t+1 = wk t ηt Fk(xk t , ξk t ) (66) xk t = Sm(wk t ) (67) wk t+1 = vk t+1 if t + 1 / Iτ, PN k=1 pk Sm(vk t+1) if t + 1 Iτ. (68) Here, the variable vk t+1 is introduced to represent the immediate result of one step SGD update from wk t . We interpret wk t+1 as the parameter obtained after communication steps (if possible). Also, an additional variable xk t is introduced to represent the result of binarization on model update. In our analysis, we define two virtual sequences vt = PN k=1 pkvk t and wt = PN k=1 pkwk t . It is obviously that E vt = E wt. vt+1 results from an single step of SGD from wt. When t + 1 / Iτ, both are inaccessible. When t + 1 Iτ, we can only fetch wt+1. For convenience, we define gt = F(xk t ) = PN k=1 pk Fk(xk t ) and gt = PN k=1 pk Fk(xk t , ξk t ). Therefore, vt+1 = wt ηtgt and Egt = gt. Notably, for any t 0, there exists a t0 t, such that t t0 τ 1 and wk t0 = wt0 for all k = 1, 2, ..., N. In this case, xk t = Sm(wk t ) = S(wk t wt0) + wt0. Therefore, we have Exk t = wk t and E xk t wk t 2 q2 wk t wt0 2. E.2. Key Lemmas Lemma 7. Assume Assumption 1 and 3, we have EF( wt+1) EF( wt) η 2E F( wt) 2 + Lη2 η k=1 E wt xk t 2 + Lη2 2 E gt gt 2 + L 2 E wt+1 vt+1 2 (69) E.3. Completing the Proof of Theorem 3 Since η = 1 L T , we have Lη2 η 2 0, then Eq.(69) can be rewritten as EF( wt+1) EF( wt) η 2E F( wt) 2 + η k=1 E wt xk t 2 + Lη2 2 E gt gt 2 + L 2 E wt+1 vt+1 2 Fed BAT: Communication-Efficient Federated Learning via Learnable Binarization The last three terms can be bounded by Lemmas 2, 3 and 6. Note that these lemmas require no convex assumption. Therefore, EF( wt+1) EF( wt) η 2E F( wt) 2 2L24(1 + q2)η2(τ 1)2G2 + Lη2 2 4q2(N 1) + N K K(N 1) η2 t τ 2G2 (71) Now rearranging the terms and summing over t = 0, ..., T 1 yield that t=0 E F( wt) 2 F(w0) F(w T ) 2L24(1 + q2)η2(τ 1)2G2 + Lη2 2 4q2(N 1) + N K K(N 1) η2 t τ 2G2 (72) Picking the learning rate η = 1 L T , and with F(w T ) 1 N PN k=1 F k = F Γ, we have t=0 E F( wt) 2 2L(F( w0) F + Γ) where P = σ2 N + 4 q2(N 1)+N K K(N 1) τ 2G2 and Q = 4(1 + q2)(τ 1)2G2. E.4. Deferred Proofs of Key Lemmas Proof of Lemma 7. For any L-smooth function F, we have F( wt+1) F( vt+1) + F( vt+1), wt+1 vt+1 + L 2 wt+1 vt+1 2 (74) As E wt+1 = E vt+1, taking expectations for the randomness of stochastic binarization and client selection yields that EF( wt+1) EF( vt+1) + L 2 E wt+1 vt+1 2 (75) Since vt+1 = wt ηgt, with L-smoothness, we have F( vt+1) F( wt) η F( wt), gt + Lη2 2 gt 2 (76) The inner product term above can be written in expectation as follows: 2E F( wt), gt = E F( wt) 2 + E gt 2 E F( wt) gt 2 (77) Now, we consider the last term in Eq.77 with the fact that Egt = E gt E F( wt) gt 2 = E F( wt) gt + gt gt 2 = E F( wt) gt 2 + E gt gt 2 k=1 ( Fk( wt) Fk(xk t )) 2 + E gt gt 2 k=1 E wt xk t 2 + E gt gt 2 Further, we have ηE F( wt), gt η 2E F( wt) 2 η 2E gt 2 + η k=1 E wt xk t 2 + η 2E gt gt 2 (79) Fed BAT: Communication-Efficient Federated Learning via Learnable Binarization Summing Eq.79 into Eq.77, we have EF( vt+1) EF( wt) η 2E F( wt) 2 + (Lη2 2)E gt 2 + η k=1 E wt xk t 2 + η 2E gt gt 2 (80) E gt 2 can be expanded as follows: E gt 2 = E gt gt + gt 2 = E gt 2 + E gt gt 2 (81) Therefore, we have EF( vt+1) EF( wt) η 2E F( wt) 2 + (Lη2 2)E gt 2 + η k=1 E wt xk t 2 + Lη2 2 E gt gt 2 (82) Finally, summing Eq.82 into Eq.75 yields the result in Lemma 7.