# provably_secure_federated_learning_against_malicious_clients__8df37079.pdf Provably Secure Federated Learning against Malicious Clients Xiaoyu Cao, Jinyuan Jia, Neil Zhenqiang Gong Duke University, Durham, NC 27708 {xiaoyu.cao, jinyuan.jia, neil.gong}@duke.edu Federated learning enables clients to collaboratively learn a shared global model without sharing their local training data with a cloud server. However, malicious clients can corrupt the global model to predict incorrect labels for testing examples. Existing defenses against malicious clients leverage Byzantine-robust federated learning methods. However, these methods cannot provably guarantee that the predicted label for a testing example is not affected by malicious clients. We bridge this gap via ensemble federated learning. In particular, given any base federated learning algorithm, we use the algorithm to learn multiple global models, each of which is learnt using a randomly selected subset of clients. When predicting the label of a testing example, we take majority vote among the global models. We show that our ensemble federated learning with any base federated learning algorithm is provably secure against malicious clients. Specifically, the label predicted by our ensemble global model for a testing example is provably not affected by a bounded number of malicious clients. Moreover, we show that our derived bound is tight. We evaluate our method on MNIST and Human Activity Recognition datasets. For instance, our method can achieve a certified accuracy of 88% on MNIST when 20 out of 1,000 clients are malicious. Introduction Federated learning (Koneˇcn y et al. 2016; Mc Mahan et al. 2017) is an emerging machine learning paradigm, which enables many clients (e.g., smartphones, Io T devices, and organizations) to collaboratively learn a model without sharing their local training data with a cloud server. Due to its promise for protecting privacy of the clients local training data and the emerging privacy regulations such as General Data Protection Regulation (GDPR), federated learning has been deployed by industry. For instance, Google has deployed federated learning for next-word prediction on Android Gboard. Existing federated learning methods mainly follow a single-global-model paradigm. Specifically, a cloud server maintains a global model and each client maintains a local model. The global model is trained via multiple iterations of communications between the clients and server. In each iteration, three steps are performed: 1) the server sends Copyright c 2021, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. the current global model to the clients; 2) the clients update their local models based on the global model and their local training data, and send the model updates to the server; and 3) the server aggregates the model updates and uses them to update the global model. The learnt global model is then used to predict labels of testing examples. However, such single-global-model paradigm is vulnerable to security attacks. In particular, an attacker can inject fake clients to federated learning or compromise existing clients, where we call the fake/compromised clients malicious clients. Such malicious clients can corrupt the global model via carefully tampering their local training data or model updates sent to the server. As a result, the corrupted global model has a low accuracy for the normal testing examples (Fang et al. 2020; Xie, Koyejo, and Gupta 2019) or certain attacker-chosen testing examples (Bagdasaryan et al. 2020; Bhagoji et al. 2019; Xie et al. 2020). For instance, when learning an image classifier, the malicious clients can re-label the cars with certain strips as birds in their local training data and scale up their model updates sent to the server, such that the learnt global model incorrectly predicts a car with the strips as bird (Bagdasaryan et al. 2020). Various Byzantine-robust federated learning methods have been proposed to defend against malicious clients. (Blanchard et al. 2017; Chen, Su, and Xu 2017; Mhamdi, Guerraoui, and Rouault 2018; Yin et al. 2018, 2019; Chen et al. 2018; Alistarh, Allen-Zhu, and Li 2018). The main idea of these methods is to mitigate the impact of statistical outliers among the clients model updates. They can bound the difference between the global model parameters learnt without malicious clients and the global model parameters learnt when some clients become malicious. However, these methods cannot provably guarantee that the label predicted by the global model for a testing example is not affected by malicious clients. Indeed, studies showed that malicious clients can still substantially degrade the testing accuracy of a global model learnt by a Byzantine-robust method via carefully tampering their model updates sent to the server (Bhagoji et al. 2019; Fang et al. 2020; Xie, Koyejo, and Gupta 2019). In this work, we propose ensemble federated learning, the first federated learning method that is provably secure against malicious clients. Specifically, given n clients, we define a subsample as a set of k clients sampled from the n The Thirty-Fifth AAAI Conference on Artificial Intelligence (AAAI-21) clients uniformly at random without replacement. For each subsample, we can learn a global model using a base federated learning algorithm with the k clients in the subsample. Since there are n k subsamples with k clients, n k global models can be trained in total. Suppose we are given a testing example x. We define pi as the fraction of the n k global models that predict label i for x, where i = 1, 2, , L. We call pi label probability. Our ensemble global model predicts the label with the largest label probability for x. In other words, our ensemble global model takes a majority vote among the global models to predict label for x. Since each global model is learnt using a subsample with k clients, a majority of the global models are learnt using normal clients when most clients are normal. Therefore, the majority vote among the global models is secure against a bounded number of malicious clients. Theory: Our first major theoretical result is that our ensemble global model provably predicts the same label for a testing example x when the number of malicious clients is no larger than a threshold, which we call certified security level. Our second major theoretical result is that we prove our derived certified security level is tight, i.e., when no assumptions are made on the base federated learning algorithm, it is impossible to derive a certified security level that is larger than ours. Note that the certified security level may be different for different testing examples. Algorithm: Computing our certified security level for x requires its largest and second largest label probabilities. When n k is small (e.g., the n clients are dozens of organizations (Kairouz et al. 2019) and k is small), we can compute the largest and second largest label probabilities exactly via training n k global models. However, it is challenging to compute them exactly when n k is large. To address the computational challenge, we develop a Monte Carlo algorithm to estimate them with probabilistic guarantees via training N instead of n k global models. Evaluation: We empirically evaluate our method on MNIST (Le Cun, Cortes, and Burges 1998) and Human Activity Recognition datasets (Anguita et al. 2013). We distribute the training examples in MNIST to clients to simulate federated learning scenarios, while the Human Activity Recognition dataset represents a real-world federated learning scenario, where each user is a client. We use the popular Fed Avg developed by Google (Mc Mahan et al. 2017) as the base federated learning algorithm. Moreover, we use certified accuracy as our evaluation metric, which is a lower bound of the testing accuracy that a method can provably achieve no matter how the malicious clients tamper their local training data and model updates. For instance, our ensemble Fed Avg with N = 500 and k = 10 can achieve a certified accuracy of 88% on MNIST when evenly distributing the training examples among 1,000 clients and 20 of them are malicious. In summary, our key contributions are as follows: Theory: We propose ensemble federated learning, the first provably secure federated learning method against malicious clients. We derive a certified security level for Algorithm 1 Single-global-model federated learning 1: Input: C, global Iter, local Iter, η, Agg. 2: Output: Global model w. 3: w random initialization. 4: for Iter global = 1, 2, , global Iter do 5: /* Step I */ 6: The server sends w to the clients. 7: /* Step II */ 8: for i C do 9: wi w. 10: for Iter local = 1, 2, , local Iter do 11: Sample a Batch from local training data Di. 12: wi wi η Loss(Batch; wi). 13: end for 14: Send gi = wi w to the server. 15: end for 16: /* Step III */ 17: g Agg(g1, g2, , g|C|). 18: w w η g. 19: end for 20: return w. our ensemble federated learning. Moreover, we prove that our derived certified security level is tight. Algorithm: We propose a Monte Carlo algorithm to compute our certified security level in practice. Evaluation: We evaluate our methods on MNIST and Human Activity Recognition datasets. For space reasons, all proofs appear in (Cao, Jia, and Gong 2021). Background on Federated Learning Assuming we have n clients C = {1, 2, , n} and a cloud server in a federated learning setting. The ith client holds some local training dataset Di, where i = 1, 2, , n. Existing federated learning methods (Koneˇcn y et al. 2016; Mc Mahan et al. 2017; Wang et al. 2020; Li et al. 2020b) mainly focus on learning a single global model for the n clients. Specifically, the server maintains a global model and each client maintains a local model. Then, federated learning iteratively performs the following three steps, which are shown in Algorithm 1. In Step I, the server sends the current global model to the clients.1 In Step II, each client trains a local model via fine-tuning the global model to its local training dataset. In particular, each client performs local Iter iterations of stochastic gradient descent with a learning rate η to train its local model. Then, each client sends its model update (i.e., the difference between the local model and the global model) to the server. In Step III, the server aggregates the clients model updates according to some aggregation rule Agg and uses the aggregated model update to update the global model. The three steps are repeated for global Iter iterations. Existing federated learning algorithms 1The server may select a subset of clients, but we assume the server sends the global model to all clients for convenience. essentially use different aggregation rules in Step III. For instance, Google developed Fed Avg (Mc Mahan et al. 2017), which computes the average of the clients model updates weighted by the sizes of their local training datasets as the aggregated model update to update the global model. We call such a federated learning algorithm that learns a single global model base federated learning algorithm and denote it as A. Note that given any subset of the n clients C, a base federated learning algorithm can learn a global model for them. Specifically, the server learns a global model via iteratively performing the three steps between the server and the given subset of clients. Our Ensemble Federated Learning Unlike single-global-model federated learning, our ensemble federated learning trains multiple global models, each of which is trained using the base algorithm A and a subsample with k clients sampled from the n clients uniformly at random without replacement. Among the n clients C, we have n k subsamples with k clients. Therefore, n k global models can be trained in total if we train a global model using each subsample. For a given testing input x, these global models may predict different labels for it. We define pi as the fraction of the n k global models that predict label i for x, where i = 1, 2, , L. We call pi label probability. Note that pi is an integer multiplication of 1 ( n k), which we will leverage to derive a tight security guarantee of ensemble federated learning. Moreover, pi can also be viewed as the probability that a global model trained on a random subsample with k clients predicts label i for x. Our ensemble global model predicts the label with the largest label probability for x, i.e., we define: h(C, x) = argmax i pi, (1) where h is our ensemble global model and h(C, x) is the label that our ensemble global model predicts for x when the ensemble global model is trained on clients C. Defining provable security guarantees against malicious clients: Suppose some of the n clients C become malicious. These malicious clients can arbitrarily tamper their local training data and model updates sent to the server in each iteration of federated learning. We denote by C the set of n clients with malicious ones. Moreover, we denote by M(C ) the number of malicious clients in C , e.g., M(C ) = m means that m clients are malicious. Note that we don t know which clients are malicious. For a testing example x, our goal is to show that our ensemble global model h provably predicts the same label for x when the number of malicious clients is bounded. Formally, we aim to show the following: h(C , x) = h(C, x), C , M(C ) m , (2) where h(C , x) is the label that the ensemble global model trained on the clients C predicts for x. We call m certified security level. When a global model satisfies Equation (2) for a testing example x, we say the global model achieves a provable security guarantee for x with a certified security level m . Note that the certified security level may be different for different testing examples. Next, we derive the certified security level of our ensemble global model. Deriving certified security level using exact label probabilities: Suppose we are given a testing example x. Assuming that, when there are no malicious clients, our ensemble global model predicts label y for x, py is the largest label probability, and pz is the second largest label probability. Moreover, we denote by p y and p z respectively the label probabilities for y and z in the ensemble global model when there are malicious clients. Suppose m clients become mali- cious. Then, 1 ( n m k ) ( n k) fraction of subsamples with k clients include at least one malicious client. In the worst-case scenario, for each global model learnt using a subsample including at least one malicious client, its predicted label for x changes from y to z. Therefore, in the worst-case scenario, the m malicious clients decrease the largest label probability py by 1 ( n m k ) ( n k) and increase the second largest label proba- bility pz by 1 ( n m k ) ( n k) , i.e., we have p y = py (1 ( n m k ) ( n k) ) and p z = pz +(1 ( n m k ) ( n k) ). Our ensemble global model still predicts label y for x, i.e., h(C , x) = h(C, x) = y, once m satisfies the following inequality: p y > p z py pz > 2 2 n m k n k . (3) In other words, the largest integer m that satisfies the inequality (3) is our certified security level m for the testing example x. The inequality (3) shows that our certified security level is related to the gap py pz between the largest and second largest label probabilities in the ensemble global model trained on the clients C without malicious ones. For instance, when a testing example has a larger gap py pz, the inequality (3) may be satisfied by a larger m, which means that our ensemble global model may have a larger certified security level for the testing example. Deriving certified security level using approximate label probabilities: When n k is small (e.g., several hundred), we can compute the exact label probabilities py and pz via training n k global models, and compute the certified security level via inequality (3). However, when n k is large, it is computationally challenging to compute the exact label probabilities via training n k global models. For instance, when n = 100 and k = 10, there are already 1.73 1013 global models, training all of which is computationally intractable in practice. Therefore, we also derive certified security level using a lower bound py of py (i.e., py py) and an upper bound pz of pz (i.e., pz pz). We use a lower bound py of py and an upper bound pz of pz because our certified security level is related to the gap py pz and we aim to estimate a lower bound of the gap. The lower bound py and upper bound pz may be estimated by different methods. For instance, in the next section, we propose a Monte Carlo algorithm to estimate a lower bound py and an upper bound pz via only training N of the n k global models. Next, we derive our certified security level based on the probability bounds py and pz. One way is to replace py and pz in inequality (3) as py and pz, respectively. Formally, we have the following inequality: py pz > 2 2 n m k n k . (4) If an m satisfies inequality (4), then the m also satisfies inequality (3), because py pz py pz. Therefore, we can find the largest integer m that satisfies the inequality (4) as the certified security level m . However, we found that the certified security level m derived based on inequality (4) is not tight, i.e., our ensemble global model may still predict label y for x even if the number of malicious clients is larger than m derived based on inequality (4). The key reason is that the label probabilities are integer multiplications of 1 ( n k). Therefore, we normalize py and pz as integer multiplications of 1 ( n k) to derive a tight certified security level. Specifically, we derive the certified security level as the largest integer m that satisfies the following inequality (formally described in Theorem 1): l py n k m pz n k n k > 2 2 n m k n k . (5) Figure 1 illustrates the relationships between py, py, and py ( n k) ( n k) as well as pz, pz, and pz ( n k) ( n k) . When an m sat- isfies inequality (4), the m also satisfies inequality (5), be- cause py pz py ( n k) ( n k) pz ( n k) ( n k) . Therefore, the certi- fied security level derived based on inequality (4) is smaller than or equals the certified security level derived based on inequality (5). Note that when py = py and pz = pz, both (4) and (5) reduce to (3) as the label probabilities are integer multiplications of 1 ( n k). The following theorem formally summarizes our certified security level. Theorem 1. Given n clients C, an arbitrary base federated learning algorithm A, a subsample size k, and a testing example x, we define an ensemble global model h as Equation (1). y and z are the labels that have the largest and second largest label probabilities for x in the ensemble global model. py is a lower bound of py and pz is an upper bound of pz. Formally, py and pz satisfy the following conditions: max i =y pi = pz pz py py. (6) Then, h provably predicts y for x when at most m clients in C become malicious, i.e., we have: h(C , x) = h(C, x) = y, C , M(C ) m , (7) where m is the largest integer m (0 m n k) that satisfies inequality (5). Our Theorem 1 is applicable to any base federated learning algorithm, any lower bound py of py and any upper 𝑝𝑧 𝑝𝑦 𝑝y 𝑝𝑧 𝑝𝑧 𝑛 𝑘 𝑛 𝑘 Figure 1: An example to illustrate the relationships between py, py, and py ( n k) ( n k) as well as pz, pz, and pz ( n k) ( n k) . bound pz of pz that satisfy (6). When the lower bound py and upper bound pz are estimated more accurately, i.e., py and pz are respectively closer to py and pz, our certified security level may be larger. The following theorem shows that our derived certified security level is tight, i.e., when no assumptions on the base federated learning algorithm are made, it is impossible to derive a certified security level that is larger than ours for the given probability bounds py and pz. Theorem 2. Suppose py + pz 1. For any C satisfying M(C ) > m , i.e., at least m + 1 clients are malicious, there exists a base federated learning algorithm A that satisfies (6) but h(C , x) = y or there exist ties. Computing the Certified Security Level Suppose we are given n clients C, a base federated learning algorithm A, a subsample size k, and a testing dataset D with d testing examples. For each testing example xt in D, we aim to compute its label ˆyt predicted by our ensemble global model h and the corresponding certified security level ˆm t . To compute the certified security level based on our Theorem 1, we need a lower bound pˆyt of the largest label probability pˆyt and an upper bound pˆzt of the second largest label probability pˆzt. When n k is small, we can compute the exact label probabilities via training n k global models. When n k is large, we propose a Monte Carlo algorithm to estimate the predicted label and the two probability bounds for all testing examples in D simultaneously with a confidence level 1 α via training N of the n k global models. Computing predicted label and probability bounds for one testing example: We first discuss how to compute the predicted label ˆyt and probability bounds pˆyt and pˆzt for one testing example xt. We sample N subsamples with k clients from the n clients uniformly at random without replacement and use them to train N global models g1, g2, , g N. We use the N global models to predict labels for xt and count the frequency of each label. We treat the label with the largest frequency as the predicted label ˆyt. Recall that, based on the definition of label probability, a global model trained on a random subsample with k clients predicts label ˆyt for xt with the label probability pˆyt. Therefore, the frequency Nˆyt of the label ˆyt among the N global models follows a binomial distribution B(N, pˆyt) with parameters N and pˆyt. Thus, given Nˆyt and N, we can use the standard onesided Clopper-Pearson method (Clopper and Pearson 1934) to estimate a lower bound pˆyt of pˆyt with a confidence level 1 α. Specifically, we have pˆyt = B (α; Nˆyt, N Nˆyt + 1), Algorithm 2 Computing Predicted Label and Certified Security Level 1: Input: C, A, k, N, D, α. 2: Output: Predicted label and certified security level for each testing example in D. g1, g2, , g N SAMPLE&TRAIN(C, A, k, N) 3: for xt in D do 4: counts[i] PN l=1 I(gl(xt) = i), i {1, 2, , L} 5: /* I is the indicator function */ 6: ˆyt index of the largest entry in counts (ties are broken uniformly at random) 7: pˆyt B α d ; Nˆyt, N Nˆyt + 1 8: pˆzt 1 pˆyt 9: if pˆyt > pˆzt then 10: ˆm t SEARCHLEVEL(pˆyt, pˆzt, k, |C|) 11: else 12: ˆyt ABSTAIN, ˆm t ABSTAIN 13: end if 14: end for 15: return ˆy1, ˆy2, , ˆyd and ˆm 1, ˆm 2, , ˆm d where B(q; v, w) is the qth quantile from a beta distribution with shape parameters v and w. Moreover, we can estimate pˆzt = 1 pˆyt 1 pˆyt pzt as an upper bound of pˆzt. Computing predicted labels and probability bounds for d testing examples: One method to compute the predicted labels and probability bounds for the d testing examples is to apply the above process to each testing example individually. However, such method is computationally intractable because it requires training N global models for every testing example. To address the computational challenge, we propose a method that only needs to train N global models in total. Our idea is to split α among the d testing examples. Specifically, we follow the above process to train N global models and use them to predict labels for the d testing examples. For each testing example xt, we estimate the lower bound pˆyt = B α d ; Nˆyt, N Nˆyt + 1 with confidence level 1 α/d instead of 1 α. According to the Bonferroni correction, the simultaneous confidence level of estimating the lower bounds for the d testing examples is 1 α. Following the above process, we still estimate pˆzt = 1 pˆyt as an upper bound of pˆzt for each testing example. Complete algorithm: Algorithm 2 shows our algorithm to compute the predicted labels and certified security levels for the d testing examples in D. The function SAMPLE&TRAIN randomly samples N subsamples with k clients and trains N global models using the base federated learning algorithm A. Given the probability bounds pˆyt and pˆzt for a testing example xt, the function SEARCHLEVEL finds the certified security level ˆm t via finding the largest integer m that satisfies (5). For example, SEARCHLEVEL can simply start m from 0 and iteratively increase it by one until finding ˆm t . Probabilistic guarantees: In Algorithm 2, since we estimate the lower bound pˆyt using the Clopper-Pearson method, there is a probability that the estimated lower bound Dataset MNIST HAR Model architecture CNN DNN Number of clients 1,000 30 global Iter 3,000 5,000 local Iter 5 Learning rate η 0.001 Batch size 32 Table 1: Federated learning settings and hyperparameters. is incorrect, i.e., pˆyt > pˆyt. When the lower bound is estimated incorrectly for a testing example xt, the certified security level ˆm t outputted by Algorithm 2 for xt may also be incorrect, i.e., there may exist an C such that M(C ) ˆm t but h(C , xt) = ˆyt. In other words, our Algorithm 2 has probabilistic guarantees for its outputted certified security levels. However, in the following theorem, we prove the probability that Algorithm 2 returns an incorrect certified security level for at least one testing example is at most α. Theorem 3. The probability that Algorithm 2 returns an incorrect certified security level for at least one testing example in D is bounded by α, which is equivalent to: Pr( xt D(h(C , xt) = ˆyt, C , M(C ) ˆm t |ˆyt = ABSTAIN)) 1 α. (8) Note that when the probability bounds are estimated deterministically, e.g., when n k is small and the exact label probabilities can be computed via training n k global models, the certified security level obtained from our Theorem 1 is also deterministic. Experiments Experimental Setup Datasets, model architectures, and base algorithm: We use MNIST (Le Cun, Cortes, and Burges 1998) and Human Activity Recognition (HAR) datasets (Anguita et al. 2013). MNIST is used to simulate federated learning scenarios, while HAR represents a real-world federated learning scenario. Specifically, MNIST has 60,000 training examples and 10,000 testing examples. We consider n = 1, 000 clients and we split them into 10 groups. We assign a training example with label l to the lth group of clients with probability q and assign it to each remaining group with a probability 1 q 9 . After assigning a training example to a group, we distribute it to a client in the group uniformly at random. The parameter q controls local training data distribution on clients and we call q degree of non-IID. q = 0.1 means that clients local training data are IID, while a larger q indicates a larger degree of non-IID. By default, we set q = 0.5. However, we will study the impact of q (degree of non-IID) on our method. HAR includes human activity data from 30 users, each of which is a client. The task is to predict a user s activity based on the sensor signals (e.g., acceleration) collected from the user s smartphone. There are 6 possible activities (e.g., walking, sitting, and standing), indicating a 6-class classification problem. There are 10,299 examples in total and each example has 561 features. We 0 10 20 30 40 50 60 70 Number of malicious clients m Certified accuracy @ m Fed Avg Ensemble Fed Avg 0 2 4 6 8 10 Number of malicious clients m Certified accuracy @ m Fed Avg Ensemble Fed Avg Figure 2: Fed Avg vs. ensemble Fed Avg. 0 10 20 30 40 50 60 70 Number of malicious clients m Certified accuracy @ m k = 10 k = 20 k = 50 0 2 4 6 8 10 Number of malicious clients m Certified accuracy @ m k = 2 k = 4 k = 6 Figure 3: Impact of k on our ensemble Fed Avg. use 75% of each user s examples as training examples and the rest as testing examples. We consider a convolutional neural network (CNN) architecture (shown in (Cao, Jia, and Gong 2021)) for MNIST. For HAR, we consider a deep neural network (DNN) with two fully-connected hidden layers, each of which contains 256 neurons and uses Re LU as the activation function. We use the popular Fed Avg (Mc Mahan et al. 2017) as the base federated learning algorithm. Recall that a base federated learning algorithm has hyperparameters (shown in Algorithm 1): global Iter, local Iter, learning rate η, and batch size. Table 1 summarizes these hyperparameters for Fed Avg in our experiments. In particular, we set the global Iter in Table 1 because Fed Avg converges with such settings. Evaluation metric: We use certified accuracy as our evaluation metric. Specifically, we define the certified accuracy at m malicious clients (denoted as CA@m) for a federated learning method as the fraction of testing examples in the testing dataset D whose labels are correctly predicted by the method and whose certified security levels are at least m. Formally, we define CA@m as follows: P xt D I(ˆyt = yt) I( ˆm t m) |D| , (9) where I is the indicator function, yt is the true label for xt, and ˆyt and ˆm t are respectively the predicted label and certified security level for xt. Intuitively, CA@m means that when at most m clients are malicious, the accuracy of the federated learning method for D is at least CA@m no matter what attacks the malicious clients use (i.e., no matter how the malicious clients tamper their local training data and model updates). Note that CA@0 reduces to the standard accuracy when there are no malicious clients. 0 10 20 30 40 50 60 70 Number of malicious clients m Certified accuracy @ m N = 100 N = 500 N = 1000 0 2 4 6 8 10 Number of malicious clients m Certified accuracy @ m N = 100 N = 500 N = 1000 Figure 4: Impact of N on our ensemble Fed Avg. 0 10 20 30 40 50 60 70 Number of malicious clients m Certified accuracy @ m α = 0.1 α = 0.01 α = 0.001 0 2 4 6 8 10 Number of malicious clients m Certified accuracy @ m α = 0.1 α = 0.01 α = 0.001 Figure 5: Impact of α on our ensemble Fed Avg. When we can compute the exact label probabilities via training n k global models, the CA@m of our ensemble global model h computed using the certified security levels derived from Theorem 1 is deterministic. When n k is large, we estimate predicted labels and certified security levels using Algorithm 2, and thus our CA@m has a confidence level 1 α according to Theorem 3. Parameter settings: Our method has three parameters: N, k, and α. Unless otherwise mentioned, we adopt the following default settings for them: N = 500, α = 0.001, k = 10 for MNIST, and k = 2 for HAR. Under such default setting for HAR, we have n k = 30 2 = 435 < N = 500 and we can compute the exact label probabilities via training 435 global models. Therefore, we have deterministic certified accuracy for HAR under the default setting. We will explore the impact of each parameter while using the default settings for the other two parameters. For HAR, we set k = 4 when exploring the impact of N (i.e., Figure 4(b)) and α (i.e., Figure 5(b)) since the default setting k = 2 gives deterministic certified accuracy, making N and α not relevant. Experimental Results Single-global-model Fed Avg vs. ensemble Fed Avg: Figure 2 compares single-global-model Fed Avg and ensemble Fed Avg with respect to certified accuracy on the two datasets. When there are no malicious clients (i.e., m = 0), single-global-model Fed Avg is more accurate than ensemble Fed Avg. This is because ensemble Fed Avg uses a subsample of clients to train each global model. However, singleglobal-model Fed Avg has 0 certified accuracy when just one client is malicious. This is because a single malicious client can arbitrarily manipulate the global model learnt by Fed Avg (Blanchard et al. 2017). However, the certified ac- 0 10 20 30 40 50 60 70 Number of malicious clients m Certified accuracy @ m q = 0.1 q = 0.5 q = 0.9 Figure 6: Impact of the degree of non-IID q on MNIST. curacy of ensemble Fed Avg reduces to 0 when up to 61 and 9 clients (6.1% and 30%) are malicious on MNIST and HAR, respectively. Note that it is unknown whether existing Byzantine-robust federated learning methods have non-zero certified accuracy when m > 0, and thus we cannot compare ensemble Fed Avg with them. Impact of k, N, and α: Figure 3, 4, and 5 show the impact of k, N, and α, respectively. k achieves a trade-off between accuracy under no malicious clients and security under malicious clients. Specifically, when k is larger, the ensemble global model is more accurate at m = 0, but the certified accuracy drops more quickly to 0 as m increases. This is because when k is larger, it is more likely for the sampled k clients to include malicious ones. The certified accuracy increases as N or α increases. This is because training more global models or a larger α allows Algorithm 2 to estimate tighter probability bounds and larger certified security levels. When N increases from 100 to 500, the certified accuracy increases significantly. However, when N further grows to 1,000, the increase of certified accuracy is marginal. Our results show that we don t need to train too many global models in practice, as the certified accuracy saturates when N is larger than some threshold. Impact of degree of non-IID q: Figure 6 shows the certified accuracy of our ensemble Fed Avg on MNIST when the clients local training data have different degrees of non IID. We observe that the certified accuracy drops when q increases from 0.5 to 0.9, which represents a high degree of non-IID. However, the certified accuracy is still high when m is small for q = 0.9, e.g., the certified accuracy is still 83% when m = 10. This is because although each global model trained using a subsample of clients is less accurate when the local training data are highly non-IID, the ensemble of multiple global models is still accurate. Related Work In federated learning, the first category of studies (Smith et al. 2017; Li et al. 2020b; Wang et al. 2020; Liu et al. 2020; Peng et al. 2020) aim to design federated learning methods that can learn more accurate global models and/or analyze their convergence properties. For instance, Fed MA (Wang et al. 2020) constructs the global model via matching and averaging the hidden elements in a neural network with similar feature extraction signatures. The second category of studies (Koneˇcn y et al. 2016; Mc Mahan et al. 2017; Wen et al. 2017; Alistarh et al. 2017; Lee et al. 2017; Sahu et al. 2018; Bernstein et al. 2018; Vogels, Karimireddy, and Jaggi 2019; Yurochkin et al. 2019; Mohri, Sivek, and Suresh 2019; Wang et al. 2020; Li, Wen, and He 2020; Li et al. 2020c; Hamer, Mohri, and Suresh 2020; Rothchild et al. 2020; Malinovsky et al. 2020) aim to improve the communication efficiency between the clients and server via sparsification, quantization, and/or encoding of the model updates sent from the clients to the server. The third category of studies (Bonawitz et al. 2017; Geyer, Klein, and Nabi 2017; Hitaj, Ateniese, and Perez-Cruz 2017; Melis et al. 2019; Zhu, Liu, and Han 2019; Mohri, Sivek, and Suresh 2019; Wang, Tong, and Shi 2020; Li et al. 2020a) aim to explore the privacy/fairness issues of federated learning and their defenses. These studies often assume a single global model is shared among the clients. Smith et al. (Smith et al. 2017) proposed to learn a customized model for each client via multi-task learning. Our work is on security of federated learning, which is orthogonal to the studies above. Multiple studies (Fang et al. 2020; Bagdasaryan et al. 2020; Xie, Koyejo, and Gupta 2019; Bhagoji et al. 2019) showed that the global model s accuracy can be significantly downgraded by malicious clients. Existing defenses against malicious clients leverage Byzantine-robust aggregation rules such as Krum (Blanchard et al. 2017), trimmed mean (Yin et al. 2018), coordinate-wise median (Yin et al. 2018), and Bulyan (Mhamdi, Guerraoui, and Rouault 2018). However, they cannot provably guarantee that the global model s predicted label for a testing example is not affected by malicious clients. As a result, they may be broken by strong attacks that carefully craft the model updates sent from the malicious clients to the server, e.g., (Fang et al. 2020). We propose ensemble federated learning whose predicted label for a testing example is provably not affected by a bounded number of malicious clients. We note that ensemble methods were also proposed as provably secure defenses (e.g., (Jia, Cao, and Gong 2021)) against data poisoning attacks. However, they are insufficient to defend against malicious clients that can manipulate both the local training data and the model updates. In particular, a provably secure defense against data poisoning attacks guarantees that the label predicted for a testing example is unaffected by a bounded number of poisoned training examples. However, a single malicious client can poison an arbitrary number of its local training examples, breaking the assumption of provably secure defenses against data poisoning attacks. In this work, we propose ensemble federated learning and derive its tight provable security guarantee against malicious clients. Moreover, we propose an algorithm to compute the certified security levels. Our empirical results on two datasets show that our ensemble federated learning can effectively defend against malicious clients with provable security guarantees. Interesting future work includes estimating the probability bounds deterministically and considering the internal structure of a base federated learning algorithm to further improve our provable security guarantees. Acknowledgements We thank the anonymous reviewers for insightful reviews. This work was supported by NSF grant No.1937786. References Alistarh, D.; Allen-Zhu, Z.; and Li, J. 2018. Byzantine stochastic gradient descent. In Neur IPS. Alistarh, D.; Grubic, D.; Li, J.; Tomioka, R.; and Vojnovic, M. 2017. QSGD: Communication-efficient SGD via gradient quantization and encoding. In Neur IPS. Anguita, D.; Ghio, A.; Oneto, L.; Parra, X.; and Reyes-Ortiz, J. L. 2013. A public domain dataset for human activity recognition using smartphones. In ESANN. Bagdasaryan, E.; Veit, A.; Hua, Y.; Estrin, D.; and Shmatikov, V. 2020. How to backdoor federated learning. In AISTATS. Bernstein, J.; Wang, Y.-X.; Azizzadenesheli, K.; and Anandkumar, A. 2018. sign SGD: Compressed Optimisation for Non-Convex Problems. In ICML. Bhagoji, A.; Chakraborty, S.; Mittal, P.; and Calo, S. 2019. Analyzing Federated Learning through an Adversarial Lens. In ICML. Blanchard, P.; Mhamdi, E. M. E.; Guerraoui, R.; and Stainer, J. 2017. Machine Learning with Adversaries: Byzantine Tolerant Gradient Descent. In Neur IPS. Bonawitz, K.; Ivanov, V.; Kreuter, B.; Marcedone, A.; Mc Mahan, H. B.; Patel, S.; Ramage, D.; Segal, A.; and Seth, K. 2017. Practical secure aggregation for privacy-preserving machine learning. In CCS. Cao, X.; Jia, J.; and Gong, N. Z. 2021. Provably Secure Federated Learning against Malicious Clients. ar Xiv preprint ar Xiv:2102.01854. Chen, L.; Wang, H.; Charles, Z.; and Papailiopoulos, D. 2018. DRACO: Byzantine-resilient Distributed Training via Redundant Gradients. In ICML. Chen, Y.; Su, L.; and Xu, J. 2017. Distributed Statistical Machine Learning in Adversarial Settings: Byzantine Gradient Descent. In POMACS. Clopper, C. J.; and Pearson, E. S. 1934. The use of confidence or fiducial limits illustrated in the case of the binomial. Biometrika 26(4): 404 413. Fang, M.; Cao, X.; Jia, J.; and Gong, N. Z. 2020. Local model poisoning attacks to Byzantine-robust federated learning. In USENIX Security. Geyer, R. C.; Klein, T.; and Nabi, M. 2017. Differentially private federated learning: A client level perspective. ar Xiv preprint ar Xiv:1712.07557. Hamer, J.; Mohri, M.; and Suresh, A. T. 2020. Fed Boost: Communication-Efficient Algorithms for Federated Learning. In ICML. Hitaj, B.; Ateniese, G.; and Perez-Cruz, F. 2017. Deep models under the GAN: information leakage from collaborative deep learning. In CCS. Jia, J.; Cao, X.; and Gong, N. Z. 2021. Intrinsic certified robustness of bagging against data poisoning attacks. In AAAI. Kairouz, P.; Mc Mahan, H. B.; Avent, B.; Bellet, A.; Bennis, M.; Bhagoji, A. N.; Bonawitz, K.; Charles, Z.; Cormode, G.; Cummings, R.; et al. 2019. Advances and open problems in federated learning. ar Xiv preprint ar Xiv:1912.04977. Koneˇcn y, J.; Mc Mahan, H. B.; Yu, F. X.; Richt arik, P.; Suresh, A. T.; and Bacon, D. 2016. Federated learning: Strategies for improving communication efficiency. In Neur IPS Workshop on Private Multi-Party Machine Learning. Le Cun, Y.; Cortes, C.; and Burges, C. 1998. MNIST handwritten digit database. URL http://yann.lecun.com/exdb/ mnist. Last accessed on Feb 21, 2021. Lee, K.; Lam, M.; Pedarsani, R.; Papailiopoulos, D.; and Ramchandran, K. 2017. Speeding up distributed machine learning using codes. IEEE Transactions on Information Theory 64(3): 1514 1529. Li, Q.; Wen, Z.; and He, B. 2020. Practical Federated Gradient Boosting Decision Trees. In AAAI. Li, T.; Sanjabi, M.; Beirami, A.; and Smith, V. 2020a. Fair Resource Allocation in Federated Learning. In ICLR. Li, X.; Huang, K.; Yang, W.; Wang, S.; and Zhang, Z. 2020b. On the convergence of fedavg on non-iid data. In ICLR. Li, Z.; Kovalev, D.; Qian, X.; and Richt arik, P. 2020c. Acceleration for Compressed Gradient Descent in Distributed and Federated Optimization. In ICML. Liu, F.; Wu, X.; Ge, S.; Fan, W.; and Zou, Y. 2020. Federated Learning for Vision-and-Language Grounding Problems. In AAAI. Malinovsky, G.; Kovalev, D.; Gasanov, E.; Condat, L.; and Richtarik, P. 2020. From Local SGD to Local Fixed Point Methods for Federated Learning. In ICML. Mc Mahan, H. B.; Moore, E.; Ramage, D.; Hampson, S.; et al. 2017. Communication-efficient learning of deep networks from decentralized data. In AISTATS. Melis, L.; Song, C.; De Cristofaro, E.; and Shmatikov, V. 2019. Exploiting unintended feature leakage in collaborative learning. In IEEE S&P. Mhamdi, E. M. E.; Guerraoui, R.; and Rouault, S. 2018. The Hidden Vulnerability of Distributed Learning in Byzantium. In ICML. Mohri, M.; Sivek, G.; and Suresh, A. T. 2019. Agnostic Federated Learning. In ICML. Peng, X.; Huang, Z.; Zhu, Y.; and Saenko, K. 2020. Federated Adversarial Domain Adaptation. In ICLR. Rothchild, D.; Panda, A.; Ullah, E.; Ivkin, N.; Stoica, I.; Braverman, V.; Gonzalez, J.; and Arora, R. 2020. Fetch SGD: Communication-Efficient Federated Learning with Sketching. In ICML. Sahu, A. K.; Li, T.; Sanjabi, M.; Zaheer, M.; Talwalkar, A.; and Smith, V. 2018. On the convergence of federated optimization in heterogeneous networks. ar Xiv preprint ar Xiv:1812.06127. Smith, V.; Chiang, C.-K.; Sanjabi, M.; and Talwalkar, A. S. 2017. Federated multi-task learning. In Neur IPS. Vogels, T.; Karimireddy, S. P.; and Jaggi, M. 2019. Power SGD: Practical low-rank gradient compression for distributed optimization. In Neur IPS. Wang, H.; Yurochkin, M.; Sun, Y.; Papailiopoulos, D.; and Khazaeni, Y. 2020. Federated Learning with Matched Averaging. In ICLR. Wang, Y.; Tong, Y.; and Shi, D. 2020. Federated Latent Dirichlet Allocation: A Local Differential Privacy Based Framework. In AAAI. Wen, W.; Xu, C.; Yan, F.; Wu, C.; Wang, Y.; Chen, Y.; and Li, H. 2017. Terngrad: Ternary gradients to reduce communication in distributed deep learning. In Neur IPS. Xie, C.; Huang, K.; Chen, P.-Y.; and Li, B. 2020. DBA: Distributed Backdoor Attacks against Federated Learning. In ICLR. Xie, C.; Koyejo, S.; and Gupta, I. 2019. Fall of empires: Breaking byzantine-tolerant SGD by inner product manipulation. In UAI. Yin, D.; Chen, Y.; Kannan, R.; and Bartlett, P. 2019. Defending Against Saddle Point Attack in Byzantine-Robust Distributed Learning. In ICML. Yin, D.; Chen, Y.; Ramchandran, K.; and Bartlett, P. 2018. Byzantine-Robust Distributed Learning: Towards Optimal Statistical Rates. In ICML. Yurochkin, M.; Agarwal, M.; Ghosh, S.; Greenewald, K.; Hoang, N.; and Khazaeni, Y. 2019. Bayesian Nonparametric Federated Learning of Neural Networks. In ICML. Zhu, L.; Liu, Z.; and Han, S. 2019. Deep leakage from gradients. In Neur IPS.