# towards_understanding_ensemble_distillation_in_federated_learning__c9c8a75b.pdf Towards Understanding Ensemble Distillation in Federated Learning Sejun Park 1 Kihun Hong 1 Ganguk Hwang 1 Federated Learning (FL) is a collaborative machine learning paradigm for data privacy preservation. Recently, a knowledge distillation (KD) based information sharing approach in FL, which conducts ensemble distillation on an unlabeled public dataset, has been proposed. However, despite its experimental success and usefulness, the theoretical analysis of the KD based approach has not been satisfactorily conducted. In this work, we build a theoretical foundation of the ensemble distillation framework in federated learning from the perspective of kernel ridge regression (KRR). In this end, we propose a KD based FL algorithm for KRR models which is related with some existing KD based FL algorithms, and analyze our algorithm theoretically. We show that our algorithm makes local prediction models as much powerful as the centralized KRR model (which is a KRR model trained by all of local datasets) in terms of the convergence rate of the generalization error if the unlabeled public dataset is sufficiently large. We also provide experimental results to verify our theoretical results on ensemble distillation in federated learning. 1. Introduction Despite the rapid development of machine learning algorithms (Bochkovskiy et al., 2020; Nichol et al., 2022; Brown et al., 2020), the performance of machines still heavily depends on the size of the training dataset. However, due to the hassle of data processing, manpower and time resources are excessively consumed to obtain data, especially labeled data, in supervised learning. Moreover, most of data is inaccessible for training a prediction model due to data privacy preservation in general. 1Department of Mathematical Sciences, Korea Advanced Institute of Science and Technology (KAIST), Daejeon, South Korea. Correspondence to: Ganguk Hwang . Proceedings of the 40 th International Conference on Machine Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright 2023 by the author(s). Recently, Federated Learning (FL) (Mc Mahan et al., 2017) has been proposed as a solution to resolve data shortage and data privacy preservation. Even though various FL algorithms (Mc Mahan et al., 2017; Li et al., 2020a; Yurochkin et al., 2019; Wang et al., 2020; Karimireddy et al., 2020; Li et al., 2021) have been proposed, most of them usually iteratively conduct the following procedure in training. (i) A server distributes a global model to clients for updating their local models and then each client trains its local model using its local dataset. (ii) The clients send their trained local models (or gradients) to the server and the server aggregates them to update the global model. Such approaches obviously improve the performance of the local models while protecting data privacy, but sometimes performance improvement is not significantly sufficient to be applied for real world problems. Moreover, they have some restrictions in local training. For example, most of FL algorithms require that all local models be the same model (Mc Mahan et al., 2017; Li et al., 2020a), parts of a global model (Arivazhagan et al., 2019; Jiang et al., 2022), approximations of a global model (Diao et al., 2021; Yao et al., 2021), or derived models from a meta model (Fallah et al., 2020; Shamsian et al., 2021). Such restrictions do not matter in a server-centric (i.e. server provider-centric) learning framework. However, in a client-centric training environment where most training strategies are led by individual clients, they severely matter. Note that local model architectures may also need to be protected as local data. To overcome the above restrictions, knowledge distillation (KD) (Hinton et al., 2015) based FL algorithms have been proposed (Mora et al., 2022). Some of them improve the effectiveness of the FL algorithms by conducting knowledge distillation in addition to the existing FL algorithms (Lin et al., 2020b; Zhu et al., 2021). Some others use knowledge distillation as a main information sharing method, which can be applied to a client-centric training environment, instead of parameter sharing (Li & Wang, 2019; Cho et al., 2021; Zhang et al., 2021). In these methods, they assume that there exists an unlabeled public dataset or an unlabeled data generator in a server or a shared data storage that all clients can access. So, the knowledge of all local models can be obtained in the server via the predictions of the local models on the unlabeled public dataset. Then the server combines the knowledge through a weighted average Towards Understanding Ensemble Distillation in Federated Learning of the predictions and distributes the public dataset with the averaged predictions to clients. Finally, clients use the distributed dataset as well as their local datasets for their local training. There are some intuitions as to why this ensemble distillation strategy works well. First, the predictions of local models on the unlabeled public dataset contain the information of local models like the parameters of local models in typical FL algorithms. Second, the ensemble distillation process is a kind of automatic crowdsourcing data labeling technique to provide additional data to each client. Therefore, this strategy directly alleviates the data shortage problem of clients. Despite these intuitions, KD based FL algorithms do not have sufficient theoretical analysis even for the independent and identically distributed (IID) case (i.e., all data points of clients are independent and identically distributed) unlike typical federated learning algorithms such as Fed Avg and Fed Prox (Li et al., 2020a;b; Yuan & Li, 2022). In this work, we provide a theoretical analysis for the effectiveness of ensemble distillation in federated learning. Even though knowledge distillation and ensemble techniques are difficult to be analyzed, some recent works provide analytical results for them in the context of kernel ridge regression (KRR) (Zhang et al., 2013; Lin et al., 2017; 2020a; Mobahi et al., 2020; Afonin & Karimireddy, 2022). Extending the existing results, we provide a theoretical framework for ensemble distillation of KRR models in general federated learning setting. Unlike the theoretical framework of typical FL algorithms which follows the optimization approach, we leverage the generalization theory of statistical models in this work. More precisely, we derive the convergence rate of the expected risk with respect to the dataset size. Our contributions in this work are as follows: 1. First, we analyze knowledge distillation with an auxiliary dataset in terms of the convergence rate of the expected risk (Section 4). As an application of this result, we verify the effectiveness of one-shot ensemble distillation in federated learning with kernel ridge regression (Section 5.2). 2. Second, we propose and analyze a new iterative ensemble distillation algorithm like Fed MD (Li & Wang, 2019) in federated learning with kernel ridge regression (Section 5.3 - 5.4). In the proposed algorithm, to overcome the undesirable amplified regularization in the repeated distillation procedures, we introduce a de-regularization trick that leads to the effectiveness of the proposed iterative ensemble distillation algorithm (Section 5.3). 3. Third, we analyze how a random client selection strategy, which is considered to resolve the communication cost or system heterogeneity issue, affects the performance of the proposed iterative ensemble distillation algorithm (Section 5.5). 4. Lastly, we provide experimental results to validate our theoretical results for the proposed iterative ensemble distillation in federated learning (Section 6). 2. Related Works Federated Learning: FL is a privacy-preserving decentralized machine learning framework which is proposed in Mc Mahan et al. (2017). FL has several challenges to address such as statistical heterogeneity (Li et al., 2020a; Karimireddy et al., 2020), system (e.g., computational capability) heterogeneity (Nishio & Yonetani, 2019), communication efficiency (Koneˇcn y et al., 2016; Caldas et al., 2018), and privacy concerns (Bagdasaryan et al., 2020). Note that there are theoretical results that the classic FL algorithms such as Fed Avg (Mc Mahan et al., 2017) and Fed Prox (Li et al., 2020a) converge for both IID and non-IID cases (Li et al., 2020a;b; Yuan & Li, 2022). Knowledge Distillation: KD is originally proposed in Buciluˇa et al. (2006); Hinton et al. (2015) in the context of model compression. The original KD encourages the outputs of a student model to match the outputs of a pretrained teacher model. There are some variants of the original student-teacher framework. Self-distillation utilizes the student model itself as a teacher (Zhang et al., 2019). Ensemble distillation (or co-distillation) uses an ensemble of models as a teacher (Hinton et al., 2015; Anil et al., 2018; Lin et al., 2020b). Since it is an inclusive concept, we refer to a knowledge distillation framework with several students and a teacher which is an ensemble of (some of) them as ensemble distillation like Lin et al. (2020b) in this work. Knowledge Distillation in FL: There are two lines of research mainly in federated learning with knowledge distillation. One line of research aims at improving efficiency or effectiveness of federated learning algorithms by applying knowledge distillation (Lin et al., 2020b; He et al., 2020; Zhu et al., 2021; Zhang et al., 2022a;b). In particular, Lin et al. (2020b) improve the performance by applying additional ensemble distillation on a server to train a server model rather than directly replacing it with an aggregated model of client models. They also propose an extended version for heterogeneous model cases. However, it allows only a few prototypes and clients should share their models with the server. The other line of research aims at enabling to use the black box models in FL (Li & Wang, 2019; Cho et al., 2021; Zhang et al., 2021). Li & Wang (2019) first propose the framework for FL with black box models. Subsequently, Cho et al. (2021); Zhang et al. (2021) apply advanced ensemble strategies to improve the performance in non-IID Towards Understanding Ensemble Distillation in Federated Learning setting. However, to the best of our knowledge, there is no previous work that provides the theoretical analysis of KD in FL. Recently, Allen-Zhu & Li (2020) provide a great theoretical analysis to explain why ensemble knowledge distillation works on neural networks with the same architectures. However, we cannot directly apply their results to ensemble distillation in FL since they assume all neural networks use the same dataset. Theoretical Analysis of Kernel Ridge Regression: There are many prior works that analyze the convergence rate of kernel ridge regression in expected risk sense or in probability sense thanks to the closed form expression of its solution. The convergence properties for the classical KRR model have been analyzed well in Caponnetto & De Vito (2007); Fischer & Steinwart (2020); Cui et al. (2021); Li et al. (2023). Recently, Zhang et al. (2013); Lin et al. (2017); Chang et al. (2017); Guo et al. (2017); Lin et al. (2020a); Yin et al. (2021) also analyze distributed learning with KRR models in various contexts. In particular, Lin et al. (2020a) propose a communication scheme for distributed learning. However, this approach does not preserve data privacy due to the nature of the form of KRR solutions. On the other hand, Mobahi et al. (2020); Borup & Andersen (2021) apply KRR to study self-distillation. To the best of our knowledge, Afonin & Karimireddy (2022) is the only preceding study that utilizes KRR to provide a theoretical analysis of distillation strategies in federated learning setting. However, they mainly consider the case of two clients and their strategy is an ensemble of infinite models, which is impractical. In our work, we analyze standard ensemble distillation methods for the KRR model with arbitrary number of clients. 3. Backgrounds In this section, we briefly introduce some backgrounds for our work. See Appendix A for details and comments. 3.1. Preliminaries and Notations Our interest is a regression problem in FL setting. Let X Rd be an input space. We assume X is compact. Our goal is to find a target function f0 : X R. Let ρx,y be a data generating distribution such that ρx,y(x, y) = ρx(x) ρy|x(y|x) where Ey ρy|x[y|x] = f0(x) and Vary ρy|x(y|x) = σ2(x) for any x X. Let k : X X R be a continuous, symmetric, positive semi-definite kernel such that ZZ k(x1, x2)f(x1)f(x2) dρx(x1)dρx(x2) 0 (1) for any f L2 ρx and the equality holds if and only if f = 0. Hk RX denotes a reproducing kernel Hilbert space with a kernel k. Set κ = (supx X k(x, x))1/2. Define the covariance operator Lk : Hk Hk by Lkf = Z f(x)k(x, ) dρx(x) and its sample analog Lk,X : Hk Hk by i=1 f(xi)k(xi, ) where X = {x1, , xn} X. Let N(λ) = tr((Lk + λI) 1Lk) and fλ = (Lk + λI) 1Lkf0 where λ > 0. Define a compact embedding ιρx : Hk L2 ρx by ιρxh = [h] ρx where [h] ρx is the equivalence class containing h in L2 ρx. Also, define its sample analog SD : Hk Rn by SDf = [f(x1), , f(xn)] where D = {(x1, y1), , (xn, yn)} X R.1 A denotes the adjoint operator of a given operator A. We write kx( ) = k(x, ) and k(x1 1, x1 2) k(x1 1, xm 2 ) ... ... ... k(xn 1, x1 2) k(xn 1, xm 2 ) where X1 = {x1 1, , xn 1} and X2 = {x1 2, , xm 2 }. Also, we write D(x) = {x1, , xn} where D = {(x1, y1), , (xn, yn)} is a given dataset. Technical Assumptions: In this work we assume the following assumptions. Assumption 3.1. We assume Ey ρyy2 < and Z exp |y f0(x)| dρy|x(y|x) γ2 for any x X where M and γ are positive constants. Assumption 3.2. The target function f0 satisfies f0 = Lr kg0 for some g0 Hk and r [0, 1 2]. In particular, f0 Hk. 1We use a scaled L2 norm as a norm of Euclidean space Rn. See Appendix A. Towards Understanding Ensemble Distillation in Federated Learning Some previous works (Guo et al., 2017; Chang et al., 2017; Lin et al., 2020a) assume that N(λ) satisfies N(λ) Ceλ s (3) for some s (0, 1] and some constant Ce 1 independent of λ. We do not assume any regularity condition on N(λ) in this work to deal with the convergence rate of a general case. However, under the above assumption (3), we can obtain a better convergence rate by extending our analysis. Note that (3) always holds with s = 1. 3.2. Kernel Ridge Regression and Optimal Convergence Rate Kernel ridge regression (KRR) is an algorithm to estimate the target function f0. Let D = {(x1, y1), , (x N, y N)} be a labeled dataset whose data points are independently drawn from ρx,y. Given a kernel k and a regularization hyperparameter λ > 0, a KRR model is given by a minimizer of the following optimization problem argmin h Hk i=1 (h(xi) yi)2 + λ h 2 Hk. Note that the solution of the above optimization problem has a closed form expression f D,λ = (Lk,D(x) + λI) 1S Dy where y = [y1, , y N] . One way to measure the performance of kernel ridge regression is to find a convergence rate of the expected loss ED ιρx(f D,λ f0) L2ρx with respect to the dataset size N. We provide a key theorem from Caponnetto & De Vito (2007); Lin et al. (2017); Fischer & Steinwart (2020). Theorem 3.3. Under Assumption 3.1 and 3.2, with λ = N 1 2r+2 , ED ιρx(f D,λ f) L2 ρx = O N 2r+1 Moreover, this convergence rate is optimal. 4. KRR with Knowledge Distillation Let g be a teacher model which approximates f0. In the noiseless data-free version, the optimization problem of KRR with KD is given by argmin h Hk α ιρx(h f0) 2 L2ρx+(1 α) ιρx(h g) 2 L2ρx+λ h 2 Hk where α (0, 1) controls the distillation effect and λ > 0 is a regularization hyperparameter. From the first order optimality condition, we obtain the minimizer fλ = (Lk + λI) 1(αLkf0 + (1 α)Lkg). (4) Now we consider a sample version. Let D1 = {(x1 1, y1 1), , (x N1 1 , y N1 1 )} be a labeled dataset whose data points are independently generated from ρx,y and D2(x) = {x1 2, , x N2 2 } be an unlabeled dataset whose data points are independently generated from ρx. To distill the teacher s knowledge using D2(x), we consider the optimization problem argmin h Hk α 1 i=1 (h(xi 1) yi 1)2 i=1 (h(xi 2) g(xi 2))2 + λ h 2 Hk. Then, the solution of the above problem is f D,λ = (αLk,D1(x) + (1 α)Lk,D2(x) + λI) 1 (αS D1y1 + (1 α)Lk,D2(x)g). (5) A natural question is how much the performance of the trained student model f D,λ can be improved and the answer is given in the following: Theorem 4.1 (Informal). The performance of the student model f D,λ is at least as good as the worse of the teacher model g and a trained KRR model using both datasets of sizes N1 and N2 independently generated from ρx,y in terms of the convergence rate of the expected risk when we use adequate α and λ. Theorem 4.1 is one of the key ideas to analyze ensemble distillation in federated learning. See Appendix B for the detailed statement and proof. 5. KRR with Ensemble Distillation in FL 5.1. Problem Formulation We assume the following: There are total m clients and client j has a labeled dataset Dj = {(x1 j, y1 j ), , (x N j , y N j )} whose data points are independently generated from ρx,y (j = 1, , m). We assume |D1| = = |Dm| = N for simplicity. Set yj = [y1 j , , y N j ] and D = Sm j=1 Dj. To distill knowledge without sharing models, we assume there is an unlabeled public dataset Dp(x) = {x1 p, , x Np p } whose data points are independently generated from ρx and whose size is Np. Towards Understanding Ensemble Distillation in Federated Learning Algorithm 1 KRR with iterative Ensemble Distillation in FL 1: Input: hyperparameters α (0, 1), λ > 0, and t N 2: Output: Trained model fj, j = 1, , m 3: Pretrain: For j = 1, , m, client j trains its model fj using the loss function argmin h Hk i=1 (h(xi j) yi j)2 + λ h 2 Hk. 4: Each client downloads the unlabeled public dataset Dp(x). 5: for t0 = 1, , t do 6: For j = 1, , m, client j predicts on Dp(x) and uploads the prediction yj p to the server. 7: The server computes an updated consensus 8: Each client downloads the ensemble prediction yp. 9: For j = 1, , m, client j updates its model fj using the loss function argmin h Hk α 1 i=1 (h(xi j) yi j)2+ i=1 (h(xi p) ( yp)i)2 + λ h 2 Hk. 10: end for Due to privacy concerns, only client j can access its own dataset Dj. In addition, all clients can access the public dataset Dp(x) with its pseudo labels. For simplicity, we write Lk,Dj(x) as Lk,Xj for j = 1, , m and Lk,Dp(x) as Lk,Xp. We consider and analyze an algorithm that performs public data labeling through an ensemble prediction of local models and local model training through knowledge distillation. Algorithm 1 presents the detailed procedure. 5.2. One-Shot Ensemble Distillation in FL First, we deal with one-shot ensemble distillation in FL, i.e., Algorithm 1 with t = 1. Even though each client cannot access the one-shot ensemble model, using the result in Lin et al. (2020a) and Theorem 4.1, we can derive a strong result that guarantees the performance of the local models after one-shot ensemble distillation. Theorem 5.1. Assume Assumption 3.1 and 3.2 hold. Also, assume m N 2r+1 ϵ (6) for any fixed ϵ (0, 1) and Np (m 1)N. Let f j D,λ be the local model of client j after one-shot ensemble distillation (j = 1, , m). Then, with α = 1/m and λ = (m N) 1 2r+2 , E ιρx( f j D,λ f0) L2ρx = O (m N) 2r+1 for j = 1, , m. The proof of Theorem 5.1 is provided in Appendix C.1. Theorem 5.1 tells us that, under some assumptions, each local model f j D,λ after the one-shot ensemble distillation has at least the same performance as the KRR model using the whole dataset of size |D| = m N in terms of the convergence rate of the expected risk. Note that Theorem 5.1 requires that the number of clients satisfy m N 2r+1 ϵ for any fixed ϵ (0, 1). However, one of main properties in FL is a massively distributed environment (Mc Mahan et al., 2017). Therefore, it is more desirable to remove this restriction on the number of clients, which is the main reason why we consider iterative ensemble distillation (i.e., t > 1). 5.3. A Toy Example: Iterative Ensemble Distillation in FL When m = 1 We now analyze iterative ensemble distillation. First, we consider a simple example m = 1 to obtain some motivations. In this case, the local client trains its model using its private dataset D1 and the public dataset Dp(x) with pseudo labels which are predictions of the local model. Thus, the problem is equivalent to self-distillation on auxiliary unlabeled dataset. To analyze this algorithm, our first interest is to find the limiting regressor after infinitely many iterations. Theorem 5.2. The local model f1 converges to f D1,λ/α = (Lk,X1 + λ/αI) 1S D1y1 after infinitely many iterations in Algorithm 1 with m = 1. We provide the proof in Appendix C.2. Theorem 5.2 implies that the limiting regressor is just a kernel ridge regressor using the private dataset D1 with an amplified regularization. This result is closely related to the study of self-distillation on kernel ridge regression (Mobahi et al., 2020; Borup & Andersen, 2021). However, it has a limitation to analyze the generalization error bound since an amplified regularization makes the approximation error E ιρx(f fλ/α) L2ρx excessively large when α is small. To resolve this issue, we introduce a de-regularization trick. Inspired by Kernel Inducing Point (KIP) scheme (Nguyen Towards Understanding Ensemble Distillation in Federated Learning et al., 2021), we define a de-regularization trick for a prediction v on Dp(x) as follows: SDp(Lk,Xp + λ0I) 1S Dpw = v where λ0 0 is a de-regularization hyperparameter. For well-definedness of w, we require a condition KXp,Xp > 0. Under this condition, w = (SDp(Lk,Xp + λ0I) 1S Dp) 1v = (KXp,Xp + Npλ0I)K 1 Xp,Xpv. (7) Note that limλ0 0 w = v. Applying the de-regularization trick, we modify Algorithm 1 so that each client uses the de-regularized predictions, i.e., y p = (SDp(Lk,Xp + λ0I) 1S Dp) 1 yp instead of yp except the last ensemble distillation step. The de-regularization trick can be implemented in the server since it only depends on Dp(x). Therefore, this procedure does not require any additional computational resource in the client side. We provide the modified algorithm (Algorithm 2) in Appendix C.3. From now on, we set λ0 = λ. The limiting regressor is obviously changed if we adopt the de-regularization trick. Theorem 5.3. The local model f1 converges to f D1,λ = Lk,X1 + λI + 1 α after infinitely many ensemble distillation steps with the de-regularization trick in Algorithm 2 with m = 1 where P Dp(x) is the orthogonal projection onto the orthogonal complement of a subspace span (k(x, ) : x Dp(x)) (8) of the reproducing kernel Hilbert space Hk under the assumption λ0 = λ and KXp,Xp is invertible. Note that there is still an additional term 1 α α λP Dp(x) in the inverse compared with f D1,λ. However, we can omit this term if Np is large. The following theorem supports this argument. Theorem 5.4. Assume λ0 = λ and KXp,Xp is invertible. We also assume that the density ρx is strictly positive on any non-empty open subset of X. Let f D1,λ Hk be defined as in Theorem 5.3. If α and λ do not depend on Np, then lim Np f D1,λ = (Lk,X1 + λI) 1 S D1y1 = f D1,λ almost surely in the Hk-norm sense. The proof of Theorem 5.4 is provided in Appendix C.5. Then, our next question is how much public data is needed to ignore the additional term 1 α α λP Dp(x) when we compute the convergence rate of the expected risk. We answer this question in Section 5.4 for the general setting of m clients. 5.4. Iterative Ensemble Distillation in FL for General Case We now turn to analyze iterative ensemble distillation for m clients. Our main question is whether the limiting regressors of clients after infinitely many iterations in Algorithm 2 have the same performance as a KRR model using the whole dataset D in terms of the convergence rate of the expected risk. The answer is yes if the unlabeled public dataset is sufficiently large under some regularity conditions. Theorem 5.5. Assume Assumption 3.1 and 3.2 with 0 < r 1 2 holds. We further assume m 2, λ0 = λ, 3r+2 2r2+2r N 1 2r+2 1/(1 ϵ) , (m 1)N for a fixed ϵ (0, 1 2), and KXp,Xp > 0. Let f j D,λ be the local model of client j after conducting Algorithm 2 with t = (j = 1, , m). Then, with α = 1/m and λ = (m N) 1 2r+2 , E ιρx( f j D,λ f0) L2ρx = O (m N) 2r+1 for j = 1, , m. Theorem 5.5 tells us that iterative ensemble distillation can drop the restriction on m and N to attain the same convergence rate as the KRR model trained by all data from clients if we have sufficient unlabeled public data. Thus, this theorem implies Algorithm 2 works well in a massively distributed environment. Note that Theorem 5.5 requires more unlabeled public data compared with Theorem 5.1. However, Theorem 5.5 does not guarantee the performance of the limiting regressors when r = 0 even if Np is large. The reason is that de-regularization is closely related to the projection PDp(x) but we have no convergence rate of (I PDp(x))f0 Hk when r = 0 where PDp(x) is the orthogonal projection onto the subspace (8). For r > 0, the required public dataset size decreases as r increases. 5.5. Effects of Client Selection Strategy We discuss the performance of Algorithm 1 and 2 until now. Note that these algorithms are a sort of synchronous algorithms that have to wait until all clients finish training their local models at each step. So, these algorithms may cause waste of time especially in a massively distributed and/or system heterogeneous environment. A typical approach to resolve this issue in FL is to adopt a client selection strategy in each communication round or to use an asynchronous algorithm. When we directly adopt a random client selection strategy in Algorithm 2, the ensemble prediction yp on Dp(x) may have a high variance, which makes the algorithm unstable. Hence, we memorize the previous prediction Towards Understanding Ensemble Distillation in Federated Learning yp,old and update yp as yp = (1 γt0) yp,old + γt0 yp,new at each communication round t0 like a Robbins-Monro stochastic approximation (Robbins & Monro, 1951). We provide a general framework (Algorithm 3) that considers a client selection strategy and an asynchronous strategy in Appendix C.7. See Appendix C.7 for details. Based on the stochastic approximation theory (Robbins & Monro, 1951; Bertsekas & Tsitsiklis, 1996), we obtain a result on the limiting regressor after infinitely many iterations in Algorithm 3 as follows. Corollary 5.6. Assume λ0 = λ and KXp,Xp is invertible. We also assume that t0=1 γt0 = , t0=1 γ2 t0 < . If we sample a fixed number of clients uniformly from all clients at each communication round, then the prediction yp on Dp(x) after infinitely many ensemble distillation steps with the de-regularization trick in Algorithm 3 converges to SDpg almost surely where g is the limiting ensemble regressor after infinitely many ensemble distillation steps with the de-regularization trick in Algorithm 2. In conclusion, the local models after conducting Algorithm 3 with t = is the same as the local models after conducting Algorithm 2 with t = . We prove a general version of Corollary 5.6 in Appendix C.8. See Appendix C.8 for details. Corollary 5.6 implies that the local models derived from Algorithm 3 with uniform sampling is the same as those derived from Algorithm 2 as t . Therefore, the client selection strategy does not affect the limiting regressor in the case of uniform client sampling. If we sample clients non-uniformly, then the effect of local datasets to the limiting regressor can be different. 5.6. Connection to Neural Network From Section 5.2 to 5.5, we discuss the effectiveness of ensemble distillation algorithms for kernel ridge regression models. According to the approximation scheme of neural networks as kernel machines (Jacot et al., 2018; Domingos, 2020), it seems that ensemble distillation with neural networks in regression problems can be also effective. One remark is that we assume all clients use the same kernel k in our analysis and this assumption is violated when neural networks are used as local models. However, we can expect the algorithms that match features in neural networks have good performance according to our analysis. For instance, Fed He NN (Makhija et al., 2022) uses Hilbert-Schmidt independence criterion to match the features, which improves the performance. Another remark is that the de-regularization trick can be omitted in the ridgeless cases. Therefore, we can see that some existing algorithms (Li & Wang, 2019; Lin et al., 2020b) are the special cases of our algorithm. 6. Experiments In this section, we provide experimental results to validate our theoretical results. Rather than verifying the convergence rate of the expected risk, we now analyze the expected risk itself to confirm the practical effectiveness of ensemble distillation algorithms. We conduct experiments on three synthetic datasets and one real world dataset. We refer to the three synthetic datasets as Dataset 1, Dataset 2, and Dataset 3. Dataset 1 and Dataset 3 are from the existing works (Chang et al., 2017; Lin et al., 2020a). The real world dataset is a simplified regression version of the MNIST dataset from another existing work (Cui et al., 2021). We refer to this dataset as MNIST. The generating procedure for the datasets and the kernels used for KRR are described in Appendix E.1. To evaluate the performance, we compute the averaged mean squared error of the local models over the test dataset. To deal with massively distributed environment which is mainly considered in FL, we set the number of clients (m) is relatively large compared with the local dataset size (N). In addition, we set the unlabeled public dataset size Np = (m 1)N to compare one-shot ensemble distillation and iterative ensemble distillation fairly. We conduct an additional experiment to investigate the effect of the unlabeled public dataset size. Detailed experiment setup (e.g., hyperparameter configuration) is described in Appendix E.2. 6.2. Illustrative Example: Effect of De-regularization Trick We first validate the effectiveness of the de-regularization trick for iterative ensemble distillation. In this experiment, we use Dataset 1 and set m = 20, N = 20, λ = 0.002 and Np = 380. As illustrated in Figure 1, the regularization effect is amplified as ensemble distillation is repeated when we do not use the de-regularization trick. In contrast, the de-regularization trick prevents the regularization effect from being amplified even for a large t. 6.3. Performance Comparison We compare the local training, the one-shot ensemble distillation algorithm, the iterative ensemble distillation algorithm without the de-regularization, the iterative ensemble Towards Understanding Ensemble Distillation in Federated Learning 0.0 0.2 0.4 0.6 0.8 1.0 x target t=1 t=2 t=3 t=10 t=1000 (a) w/ de-regularization trick 0.0 0.2 0.4 0.6 0.8 1.0 x target t=1 t=2 t=3 t=10 t=1000 (b) w/o de-regularization trick Figure 1. KRR regressors of a client paticipating in iterative ensemble distillation after t communication rounds (a) with the deregularization trick and (b) without the de-regularization trick. The solid blue line is the target function. Table 1. Performance of the standalone model on different datasets with sample sizes N = 10 and N = 20 DATASET N = 10 N = 20 DATASET 1 0.0329 0.0243 DATASET 2 0.0255 0.0144 DATASET 3 0.0785 0.0739 MNIST 0.9436 0.7845 distillation algorithm with the de-regularization, and the central training (which gives the KRR model trained using all local datasets). The performance of standalone KRR models with sample sizes N {10, 20} is presented in Table 1. We summarize the experimental results of the case with N = 10 and various m in Figure 2. We present additional experimental results with sample size N = 20 in Appendix E.3. We first observe that all ensemble distillation based FL algorithms improve the performance of local models compared with the local training. They also achieve the performance as good as the central training on Dataset 2. However, the one-shot ensemble distillation algorithm becomes worse compared with the central training in the other cases. Especially, it has poor performance in massively distributed environment with high dimensional datasets (Dataset 3 and MNIST). The iterative ensemble distillation algorithm without the de-regularization is also worse than the central training on Dataset 3. This algorithm does not achieve the same performance as the central training in some other cases as well. On the other hand, the iterative ensemble distillation algorithm with the de-regularization is dominant compared with the one-shot ensemble distillation algorithm and the iterative ensemble distillation algorithm without the deregularization in all settings. It is also comparable with the central training in all experiments. Although our theoretical result requires more unlabeled public data than (m 1)N Table 2. Performance of the iterative ensemble distillation algorithm on Dataset 3 with different public dataset sizes Np {50, 100, 200, 500, 1000}. We set N = 10 and m = 50. Np 50 100 200 500 1000 0.0251 0.0198 0.0168 0.0168 0.0164 data points, we can observe that (m 1)N unlabeled public data points are enough for iterative ensemble distillation to achieve the same performance as the central training in practice. Moreover, it has not only the same convergence rate but also the same expected risk as the central training in our experiments. We additionally compare our proposed algorithm with Fed MD (Li & Wang, 2019) with neural networks on multiple datasets. See Appendix E.3 for the results. 6.4. Effect of Public Dataset Size We compare the performance of the iterative ensemble distillation algorithm with the de-regularization on various Np. We simulate this algorithm on Dataset 3 with N = 10 and m = 50. The result is summarized in Table 2. Basically, the performance of the algorithm is improved as Np increases. However, the improvement slows down when Np is sufficiently large. In other words, more public data points may not improve the performance if there are already sufficient public data points. We provide additional experimental results including the comparison between with a client selection strategy and without a client selection strategy in Appendix E.3. 7. Conclusions In this work, we provide a KRR based theoretical framework to verify the effectiveness of KD, one-shot ensemble distillation, and iterative ensemble distillation under some Towards Understanding Ensemble Distillation in Federated Learning 10 20 30 40 50 100 The number of clients Averaged MSE central training one-shot ED IED w/o deregularization IED w/ deregularization (a) Dataset 1 with N = 10 10 20 30 40 50 100 The number of clients Averaged MSE central training one-shot ED IED w/o deregularization IED w/ deregularization (b) Dataset 2 with N = 10 10 20 30 40 50 100 The number of clients Averaged MSE central training one-shot ED IED w/o deregularization IED w/ deregularization (c) Dataset 3 with N = 10 10 20 30 40 50 The number of clients Averaged MSE central training one-shot ED IED w/o deregularization IED w/ deregularization (d) MNIST with N = 10 Figure 2. Comparison between the performance of the one-shot ensemble distillation algorithm (one-shot ED), the iterative ensemble distillation algorithm without the de-regularization (IED w/o deregularization), the iterative ensemble distillation algorithm with the de-regularization (IED w/ deregularization), and the central training. We set N = 10 and conduct the experiments with various m. regularity conditions. We also analyze the effects of a client selection strategy in our setting. We simulate ensemble distillation to validate our theoretical results. Acknowledgements This work was supported in part by the National Research Foundation of Korea(NRF) grant funded by the Korea government(MSIT) (Grant No. NRF-2019R1A5A1028324) and in part by internal fund/grant of Electronics and Telecommunications Research Institute(ETRI). [22RB1100, Exploratory and Strategic Research of ETRI-KAIST ICT Future Technology] Afonin, A. and Karimireddy, S. P. Towards model agnostic federated learning using knowledge distillation. In The Tenth International Conference on Learning Representations, 2022. Allen-Zhu, Z. and Li, Y. Towards understanding ensemble, knowledge distillation and self-distillation in deep learning. ar Xiv preprint ar Xiv:2012.09816, 2020. Anil, R., Pereyra, G., Passos, A., Orm andi, R., Dahl, G. E., and Hinton, G. E. Large scale distributed neural network training through online distillation. In 6th International Conference on Learning Representations, 2018. Arivazhagan, M. G., Aggarwal, V., Singh, A. K., and Choudhary, S. Federated learning with personalization layers. ar Xiv preprint ar Xiv:1912.00818, 2019. Aydın, A. D. and Gheondea, A. Probability error bounds for approximation of functions in reproducing kernel hilbert spaces. Journal of Function Spaces, 2021, 2021. Bagdasaryan, E., Veit, A., Hua, Y., Estrin, D., and Shmatikov, V. How to backdoor federated learning. In Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics, volume 108 of Proceedings of Machine Learning Research, pp. 2938 2948. PMLR, 2020. Banach, S. Sur les op erations dans les ensembles abstraits et leur application aux equations int egrales. Fundamenta Mathematicae, 3(1):133 181, 1922. Berlinet, A. and Thomas-Agnan, C. Reproducing kernel Hilbert spaces in probability and statistics. Springer Science & Business Media, 2011. Bertsekas, D. and Tsitsiklis, J. N. Neuro-dynamic programming. Athena Scientific, 1996. Towards Understanding Ensemble Distillation in Federated Learning Bhatia, R. Matrix analysis, volume 169. Springer Science & Business Media, 2013. Bochkovskiy, A., Wang, C.-Y., and Liao, H.-Y. M. Yolov4: Optimal speed and accuracy of object detection. ar Xiv preprint ar Xiv:2004.10934, 2020. Borup, K. and Andersen, L. N. Even your teacher needs guidance: Ground-truth targets dampen regularization imposed by self-distillation. Advances in Neural Information Processing Systems, 34:5316 5327, 2021. Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., Mc Candlish, S., Radford, A., Sutskever, I., and Amodei, D. Language models are few-shot learners. In Advances in Neural Information Processing Systems, volume 33, pp. 1877 1901, 2020. Buciluˇa, C., Caruana, R., and Niculescu-Mizil, A. Model compression. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 535 541, 2006. Caldas, S., Koneˇcny, J., Mc Mahan, H. B., and Talwalkar, A. Expanding the reach of federated learning by reducing client resource requirements. ar Xiv preprint ar Xiv:1812.07210, 2018. Caponnetto, A. and De Vito, E. Optimal rates for the regularized least-squares algorithm. Foundations of Computational Mathematics, 7(3):331 368, 2007. Chang, X., Lin, S.-B., and Zhou, D.-X. Distributed semisupervised learning with kernel ridge regression. Journal of Machine Learning Research, 18(46):1 22, 2017. Chatalic, A., Schreuder, N., Rosasco, L., and Rudi, A. Nystr om kernel mean embeddings. In International Conference on Machine Learning, pp. 3006 3024. PMLR, 2022. Cho, Y. J., Wang, J., Chiruvolu, T., and Joshi, G. Personalized federated learning for heterogeneous clients with clustered knowledge transfer. ar Xiv preprint ar Xiv:2109.08119, 2021. Conway, J. B. A course in functional analysis, volume 96. Springer, 2019. Cui, H., Loureiro, B., Krzakala, F., and Zdeborov a, L. Generalization error rates in kernel regression: The crossover from the noiseless to noisy regime. Advances in Neural Information Processing Systems, 34:10131 10143, 2021. Diao, E., Ding, J., and Tarokh, V. Heterofl: Computation and communication efficient federated learning for heterogeneous clients. In 9th International Conference on Learning Representations, 2021. Domingos, P. Every model learned by gradient descent is approximately a kernel machine. ar Xiv preprint ar Xiv:2012.00152, 2020. Fallah, A., Mokhtari, A., and Ozdaglar, A. Personalized federated learning with theoretical guarantees: A modelagnostic meta-learning approach. Advances in Neural Information Processing Systems, 33:3557 3568, 2020. Ferreira, J. and Menegatto, V. A. Positive definiteness, reproducing kernel hilbert spaces and beyond. Annals of Functional Analysis, 4(1), 2013. Fischer, S. and Steinwart, I. Sobolev norm learning rates for regularized least-squares algorithms. The Journal of Machine Learning Research, 21(1):8464 8501, 2020. Fujii, J., Fujii, M., Furuta, T., and Nakamoto, R. Norm inequalities equivalent to heinz inequality. Proceedings of the American Mathematical Society, 118(3):827 830, 1993. Guo, Z.-C., Lin, S.-B., and Zhou, D.-X. Learning theory of distributed spectral algorithms. Inverse Problems, 33(7): 074009, 2017. He, C., Annavaram, M., and Avestimehr, S. Group knowledge transfer: Federated learning of large cnns at the edge. Advances in Neural Information Processing Systems, 33: 14068 14080, 2020. Hinton, G., Vinyals, O., Dean, J., et al. Distilling the knowledge in a neural network. ar Xiv preprint ar Xiv:1503.02531, 2015. Jacot, A., Gabriel, F., and Hongler, C. Neural tangent kernel: Convergence and generalization in neural networks. Advances in neural information processing systems, 31, 2018. Jiang, Y., Wang, S., Valls, V., Ko, B. J., Lee, W.-H., Leung, K. K., and Tassiulas, L. Model pruning enables efficient federated learning on edge devices. IEEE Transactions on Neural Networks and Learning Systems, 2022. Karimireddy, S. P., Kale, S., Mohri, M., Reddi, S., Stich, S., and Suresh, A. T. Scaffold: Stochastic controlled averaging for federated learning. In International Conference on Machine Learning, pp. 5132 5143. PMLR, 2020. Koneˇcn y, J., Mc Mahan, H. B., Yu, F. X., Richt arik, P., Suresh, A. T., and Bacon, D. Federated learning: Strategies for improving communication efficiency. ar Xiv preprint ar Xiv:1610.05492, 2016. Towards Understanding Ensemble Distillation in Federated Learning Le Cun, Y., Bottou, L., Bengio, Y., and Haffner, P. Gradientbased learning applied to document recognition. Proceedings of the IEEE, 86(11):2278 2324, 1998. Li, D. and Wang, J. Fedmd: Heterogenous federated learning via model distillation. ar Xiv preprint ar Xiv:1910.03581, 2019. Li, T., Sahu, A. K., Zaheer, M., Sanjabi, M., Talwalkar, A., and Smith, V. Federated optimization in heterogeneous networks. In Proceedings of Machine Learning and Systems, volume 2, pp. 429 450, 2020a. Li, T., Hu, S., Beirami, A., and Smith, V. Ditto: Fair and robust federated learning through personalization. In International Conference on Machine Learning, pp. 6357 6368. PMLR, 2021. Li, X., Huang, K., Yang, W., Wang, S., and Zhang, Z. On the convergence of fedavg on non-iid data. In 8th International Conference on Learning Representations, 2020b. Li, Y., Zhang, H., and Lin, Q. On the saturation effect of kernel ridge regression. In The Eleventh International Conference on Learning Representations, 2023. Lin, S.-B., Guo, X., and Zhou, D.-X. Distributed learning with regularized least squares. Journal of Machine Learning Research, 18(92):1 31, 2017. Lin, S.-B., Wang, D., and Zhou, D.-X. Distributed kernel ridge regression with communications. Journal of Machine Learning Research, 21(93):1 38, 2020a. Lin, T., Kong, L., Stich, S. U., and Jaggi, M. Ensemble distillation for robust model fusion in federated learning. Advances in Neural Information Processing Systems, 33: 2351 2363, 2020b. Makhija, D., Han, X., Ho, N., and Ghosh, J. Architecture agnostic federated learning for neural networks. In Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pp. 14860 14870. PMLR, 2022. Mc Mahan, B., Moore, E., Ramage, D., Hampson, S., and Arcas, B. A. y. Communication-efficient learning of deep networks from decentralized data. In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, pp. 1273 1282. PMLR, 2017. Mobahi, H., Farajtabar, M., and Bartlett, P. Self-distillation amplifies regularization in hilbert space. Advances in Neural Information Processing Systems, 33:3351 3361, 2020. Mora, A., Tenison, I., Bellavista, P., and Rish, I. Knowledge distillation for federated learning: a practical guide. ar Xiv preprint ar Xiv:2211.04742, 2022. Nguyen, T., Chen, Z., and Lee, J. Dataset meta-learning from kernel ridge-regression. In 9th International Conference on Learning Representations, 2021. Nichol, A. Q., Dhariwal, P., Ramesh, A., Shyam, P., Mishkin, P., Mcgrew, B., Sutskever, I., and Chen, M. GLIDE: Towards photorealistic image generation and editing with text-guided diffusion models. In Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pp. 16784 16804. PMLR, 2022. Nishio, T. and Yonetani, R. Client selection for federated learning with heterogeneous resources in mobile edge. In ICC 2019-2019 IEEE international conference on communications, pp. 1 7. IEEE, 2019. Rasmussen, C. E. and Williams, C. K. I. Gaussian processes for machine learning. Adaptive computation and machine learning. MIT Press, 2006. Robbins, H. and Monro, S. A stochastic approximation method. The annals of mathematical statistics, pp. 400 407, 1951. Rudi, A., Camoriano, R., and Rosasco, L. Less is more: Nystr om computational regularization. Advances in Neural Information Processing Systems, 28, 2015. Shamsian, A., Navon, A., Fetaya, E., and Chechik, G. Personalized federated learning using hypernetworks. In International Conference on Machine Learning, pp. 9489 9502. PMLR, 2021. Steinwart, I. and Christmann, A. Support vector machines. Springer Science & Business Media, 2008. Wang, H., Yurochkin, M., Sun, Y., Papailiopoulos, D. S., and Khazaeni, Y. Federated learning with matched averaging. In 8th International Conference on Learning Representations, 2020. Yao, D., Pan, W., Wan, Y., Jin, H., and Sun, L. Fedhm: Efficient federated learning for heterogeneous models via low-rank factorization. ar Xiv preprint ar Xiv:2111.14655, 2021. Yao, Y., Rosasco, L., and Caponnetto, A. On early stopping in gradient descent learning. Constructive Approximation, 26(2):289 315, 2007. Yin, R., Wang, W., and Meng, D. Distributed nystr om kernel learning with communications. In International Conference on Machine Learning, pp. 12019 12028. PMLR, 2021. Yuan, X.-T. and Li, P. On convergence of fedprox: Local dissimilarity invariant bounds, non-smoothness and beyond. ar Xiv preprint ar Xiv:2206.05187, 2022. Towards Understanding Ensemble Distillation in Federated Learning Yurochkin, M., Agarwal, M., Ghosh, S., Greenewald, K., Hoang, N., and Khazaeni, Y. Bayesian nonparametric federated learning of neural networks. In International Conference on Machine Learning, pp. 7252 7261. PMLR, 2019. Zhang, J., Guo, S., Ma, X., Wang, H., Xu, W., and Wu, F. Parameterized knowledge transfer for personalized federated learning. Advances in Neural Information Processing Systems, 34:10092 10104, 2021. Zhang, J., Chen, C., Li, B., Lyu, L., Wu, S., Ding, S., Shen, C., and Wu, C. Dense: Data-free one-shot federated learning. Advances in Neural Information Processing Systems, 35:21414 21428, 2022a. Zhang, L., Song, J., Gao, A., Chen, J., Bao, C., and Ma, K. Be your own teacher: Improve the performance of convolutional neural networks via self distillation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3713 3722, 2019. Zhang, L., Shen, L., Ding, L., Tao, D., and Duan, L.-Y. Finetuning global model via data-free knowledge distillation for non-iid federated learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10174 10183, 2022b. Zhang, Y., Duchi, J., and Wainwright, M. Divide and conquer kernel ridge regression. In Proceedings of the 26th Annual Conference on Learning Theory, volume 30 of Proceedings of Machine Learning Research, pp. 592 617. PMLR, 2013. Zhu, Z., Hong, J., and Zhou, J. Data-free knowledge distillation for heterogeneous federated learning. In International Conference on Machine Learning, pp. 12878 12889. PMLR, 2021. Towards Understanding Ensemble Distillation in Federated Learning A. Comments and Details on Section 3 A.1. Comments on Section 3.1 For simplicity, we write Ah = A(h) for any h H1 and an operator A : H1 H2 where H1 and H2 are Hilbert spaces. Define A1h = Ah and Ath = AAt 1h for any integer t 2. For linear operators A : H1 H2 and B : H2 H3, we write the composition B A as BA where H1, H2 and H3 are Hilbert spaces. We use a scaled L2 norm a2 1 + + a2n as a norm of Euclidean space Rn. Note that the kernel k is bounded since it is continuous on a compact domain X X. Thus, κ < . From the positive definiteness of k, k(x1, x2) κ2 holds for any x1, x2 X. From the boundedness of k, the continuity of k, and the separability of X, we can conclude Hk is separable by Corollary 4 of Section 1.5 in Berlinet & Thomas-Agnan (2011). By Mercer s theorem (Rasmussen & Williams, 2006) and Proposition 4.2 in Ferreira & Menegatto (2013), k(x, x ) = P i=1 µiφi(x)φi(x ) where {µi} i=1 are eigenvalues of a linear operator Lk and each φi Hk is an eigenfunction of Lk corresponding to µi such that µi 0, P i=1 µi < , and {φi} i=1 is an orthonormal basis of L2 ρx.2 Moreover, we know that {φi} i=1 are continuous and with the inner product where f = P i=1 fiφi and g = P i=1 giφi. We can easily see that { µiφi} i=1 is an orthonormal basis of Hk. For f L2 ρx such that f = P i=1 fiφi ρx-almost everywhere where f 2 i µi < , we can think f as an element P i=1 fiφi in Hk. In the definition of ιρx : Hk L2 ρx, [h] ρx means the equivalence class of all measureable functions that is equal to h Hk ρx-almost everywhere. The adjoint operator ι ρx : L2 ρx Hk of ιρx satisfies ι ρx([h] ρx)(x) = ι ρx([h] ρx), kx Hk = h, ιρx(kx) L2ρx = Z k(x, t)h(t) dρx(t). Thus, Lk = ι ρxιρx. From this fact and the compactness of ι ρx (Steinwart & Christmann, 2008), we can see that Lk is compact, self-adjoint, and positive. Also, ιρxf 2 L2 ρx = ιρxf, ιρxf L2ρx = ι ρxιρxf, f Hk = Lkf, f Hk = L1/2 k f, L1/2 k f Hk = L1/2 k f 2 Hk. (9) 2In L2 ρx, they are considered as equivalence classes containing continuous functions φi. Towards Understanding Ensemble Distillation in Federated Learning Let D = {(x1, y1), , (xn, yn)} be a labeled dataset. The adjoint S D : Rn Hk of SD maps c = [c1, , cn] Rn to Note that Lk,X = S DSD. Obviously SD is compact since it has finite rank. Therefore Lk,X is also compact, self-adjoint, and positive. Similarly as ιρx, SDf 2 2 = SDf, SDf 2 = S DSDf, f Hk = Lk,D(x)f, f Hk = L1/2 k,D(x)f, L1/2 k,D(x)f Hk = L1/2 k,D(x)f 2 Hk. (10) Since SD only depends on D(x), we naturally define SD as SDf = [f(x1), , f(xn)] for D(x) = {x1, , xn}. We write k X( ) = [kx1( ), , kxn( )] where X = {x1, , xn} X. Assumption 3.1 is related to the condition of the noise term and Assumption 3.2 is regarding the regularity of the target function f0. If the noise is uniformly bounded, Gaussian, or sub-Gaussian, then (2) is satisfied (Caponnetto & De Vito, 2007; Lin et al., 2020a). (3) with s = 1 is always satisfied since which follows from κ2 Z k(x, x) dρx = Z X i=1 µiφi(x)2 dρx = Since N(λ) = P i=1 µi µi+λ, it is monotonically decreasing with respect to λ and N(λ) as λ 0. Thus, there exists λ1 > 0 (e.g., λ1 = µ2) such that N(λ) 1 for 0 < λ < λ1. A.2. Details on Section 3.2 First, consider a noiseless data-free version : argmin h Hk J[h] = ιρx(h f0) 2 L2ρx + λ h 2 Hk. From (9), we obtain J[h] = h f0, Lk(h f0) Hk + λ h 2 Hk. Observe that J[h + ϵu] J[h] = ϵ 2Lk(h f0) + 2λh, u Hk + o(ϵ) since Lk is self-adjoint. By the definition of functional derivatives, we have J[h] = 2Lk(h f0) + 2λh. Note that J is strongly convex since 2J[h] = 2(Lk + λI) 2λI holds. From the first order optimality condition, the minimizer of the optimization problem is fλ = (Lk + λI) 1Lkf0. Towards Understanding Ensemble Distillation in Federated Learning However, most of regression problems have limited datasets with noisy labels. Let D = {(x1, y1), , (x N, y N)} be a labeled dataset whose data points are independently drawn from ρx,y. Then the optimization problem is given by argmin h Hk J[h] = 1 i=1 (h(xi) yi)2 + λ h 2 Hk = SDh y 2 2 + λ h 2 Hk where y = [y1, , y N] . Then, J[h + ϵu] J[h] = ϵ 2 SDu, SDh y 2 + ϵ 2λh, u Hk + o(ϵ) = ϵ 2S DSDh 2S Dy + 2λh, u Hk + o(ϵ). In other words, J[h] = 2S DSDh 2S Dy + 2λh = 2(Lk,D(x)h S Dy + λh). Note that J is strongly convex since 2J[h] = 2(Lk,D(x)+λI) 2λI holds. From the first order optimality condition, we can see that the minimizer f D,λ is given by f D,λ = (Lk,D(x) + λI) 1S Dy. In the matrix form we have f D,λ = k D(x)(NλI + KD(x),D(x)) 1y. B. Details on Section 4 We now derive (4) and (5) as in Section A.2. Let g Hk be a teacher model which approximates f0. Consider a noiseless data-free version of kernel ridge regression problem with knowledge distillation : argmin h Hk J[h] = α ιρx(h f0) 2 L2ρx + (1 α) ιρx(h g) 2 L2ρx + λ h 2 Hk where α (0, 1) is a distillation hyperparameter and λ > 0 is a regularization hyperparameter. Then J[h + ϵu] J[h] = ϵ 2α Lk(h f0) + 2(1 α) Lk(h g) + 2λh, u Hk + o(ϵ) and so J[h] = 2α Lk(h f0) + 2(1 α) Lk(h g) + 2λh. Again, 2J[h] = 2(Lk + λI) 2λI holds. Also, the minimizer fλ of the optmization problem is given by fλ = (Lk + λI) 1(αLkf0 + (1 α)Lkg) using the first order optimality condition. Let D1 = {(x1 1, y1 1), , (x N1 1 , y N1 1 )} be a dataset to train a student model whose data points are independently generated from ρx,y. To distill knowledge of the teacher model g, we assume an unlabeled dataset D2(x) = {x1 2, , x N2 2 } whose data points are independently generated from ρx, is given. Then the optimization problem is argmin h Hk J[h] = α 1 i=1 (h(xi 1) yi 1)2 + (1 α) 1 i=1 (h(xi 2) g(xi 2))2 + λ h 2 Hk = α SD1h y1 2 2 + (1 α) SD2(h g) 2 2 + λ h 2 Hk where α (0, 1) and λ > 0 are hyperparameters and y1 = [y1 1, , y N 1 ] . From J[h + ϵu] J[h] = αϵ 2 SD1u, SD1h y1 2 + (1 α)ϵ 2 SD2u, SD2h SD2g 2 + ϵ 2λh, u Hk + o(ϵ) = ϵ 2αS D1SD1h + 2(1 α)S D2SD2h 2αS D1y1 2(1 α)S D2SD2g + 2λh, u Hk + o(ϵ), Towards Understanding Ensemble Distillation in Federated Learning we have J[h] = 2(αLk,D1(x) + (1 α)Lk,D2(x) + λI)h 2(αS D1y1 + (1 α)Lk,D2(x)g). Observe that 2J[h] = 2(αLk,D1(x) + (1 α)Lk,D2(x) + λI) 2λI holds. Applying the first order optimality condition, the minimizer f D,λ of the optimization problem is f D,λ = (αLk,D1(x) + (1 α)Lk,D2(x) + λI) 1(αS D1y1 + (1 α)Lk,D2(x)g). The solution is written in the matrix form as f D,λ = h k D1(x) k D2(x) i (λI + DKD1(x) D2(x),D1(x) D2(x)) 1D y1 g(D2(x)) where D = diag(α/N1, , α/N1 | {z } N1 , (1 α)/N2, , (1 α)/N2 | {z } N2 ) and g(D2(x)) = [g(x1 2), , g(x N2 2 )] . B.1. Proof of Theorem 4.1 We first prove the following theorem. Theorem B.1. With the same notation as in Section 4, under Assumption 3.1 and 3.2, E ιρx( f D,λ f0) L2ρx 9 α N1 + 1 α N2 ! 2ακ(M + γ) λN1 + λ 1 2 +r g0 Hk + (1 α)1/2E h (Lk + λI)1/2(αLk,D1(x) + (1 α)Lk,D2(x) + λI) 1/2 L1/2 k,D2(x)(g f0) Hk i Proof of Theorem B.1. Using (9) and Lemma D.3, we have ιρx( f D,λ f0) L2ρx = L1/2 k ( f D,λ f0) Hk (Lk + λI)1/2( f D,λ f0) Hk. By the triangle inequality and the submultiplicativity of the operator norm, (Lk + λI)1/2( f D,λ f0) Hk (Lk + λI)1/2((αLk,D1(x) + (1 α)Lk,D2(x) + λI) 1(αS D1y1 + (1 α)Lk,D2(x)f0) f0) Hk + (1 α) (Lk + λI)1/2(αLk,D1(x) + (1 α)Lk,D2(x) + λI) 1Lk,D2(x)(g f0) Hk. (13) First, we bound the first term in (13). Note that (Lk + λI)1/2((αLk,D1(x) + (1 α)Lk,D2(x) + λI) 1(αS D1y1 + (1 α)Lk,D2(x)f0) f0) = (Lk + λI)1/2(αLk,D1(x) + (1 α)Lk,D2(x) + λI) 1(Lk + λI)1/2(Lk + λI) 1/2(α(S D1y1 Lk,D1(x)f0) λf0). Let Qd = (Lk + λI)1/2(αLk,D1(x) + (1 α)Lk,D2(x) + λI) 1(Lk + λI)1/2 . By Lemma D.1, Lemma D.5, the triangle inequality, and the submultiplicativity of the operator norm, we have Qd (αLk,D1(x) + (1 α)Lk,D2(x) + λI) 1(Lk + λI) = I + α(αLk,D1(x) + (1 α)Lk,D2(x) + λI) 1(Lk Lk,D1(x)) + (1 α)(αLk,D1(x) + (1 α)Lk,D2(x) + λI) 1(Lk Lk,D2(x)) λ(α Lk,D1(x) Lk + (1 α) Lk,D2(x) Lk ). (14) Towards Understanding Ensemble Distillation in Federated Learning On the other hand, by the submultiplicativity of the operator norm and Lemma D.1, (Lk + λI) 1/2α(S D1y1 Lk,D1(x)f0) Hk α (Lk + λI) 1/2 S D1y1 Lk,D1(x)f0 λ S D1y1 Lk,D1(x)f0 . λ(Lk + λI) 1/2f0 Hk = λ(Lk + λI) 1/2Lr kg0 Hk λ 1 2 +r g0 Hk (15) by Lemma D.1. Therefore, by Lemma D.7 and Lemma D.8, we have (Lk + λI)1/2((αLk,D1(x) + (1 α)Lk,D2(x) + λI) 1(αS D1y1 + (1 α)Lk,D2(x)f0) f0) Hk α N1 + 1 α N2 ! 2ακ(M + γ) λN1 + λ 1 2 +r g0 Hk (log(6/δ))3/2 with confidence at least 1 δ. By Lemma D.9, E (Lk + λI)1/2((αLk,D1(x) + (1 α)Lk,D2(x) + λI) 1(αS D1y1 + (1 α)Lk,D2(x)f0) f0) Hk α N1 + 1 α N2 ! 2ακ(M + γ) λN1 + λ 1 2 +r g0 Hk since Γ(5/2) = 3 π/4 < 3/2. The second term in (13) satisfies (1 α) (Lk + λI)1/2(αLk,D1(x) + (1 α)Lk,D2(x) + λI) 1Lk,D2(x)(g f0) Hk (1 α) (Lk + λI)1/2(αLk,D1(x) + (1 α)Lk,D2(x) + λI) 1L1/2 k,D2(x)L1/2 k,D2(x)(g f0) Hk (1 α) (Lk + λI)1/2(αLk,D1(x) + (1 α)Lk,D2(x) + λI) 1/2 (αLk,D1(x) + (1 α)Lk,D2(x) + λI) 1/2(Lk,D2(x) + λI)1/2 (Lk,D2(x) + λI) 1/2L1/2 k,D2(x) L1/2 k,D2(x)(g f0) Hk by the submultiplicativity of the operator norm. Applying Lemma D.1 and (1 α)1/2 (αLk,D1(x) + (1 α)Lk,D2(x) + λI) 1/2(Lk,D2(x) + λI)1/2 1 which follows from Lemma D.2, we obtain that the second term in (13) is bounded by (1 α)1/2 (Lk + λI)1/2(αLk,D1(x) + (1 α)Lk,D2(x) + λI) 1/2 L1/2 k,D2(x)(g f0) Hk by the submultiplicativity of the operator norm. Taking the expectation completes the proof of Theorem B.1. The following corollary implies Theorem 4.1. Corollary B.2. Suppose Assumption 3.1 and 3.2 hold and g is independent of D2(x). With the same notation as in Section 4, if we set α = N1/(N1 + N2) and λ = (N1 + N2) 1 2r+2 , then E ιρx( f D,λ f0) L2ρx O (N1 + N2) 2r+1 4r+4 + E ιρx(g f0) 2 L2ρx Proof of Corollary B.2. Set α = N1/(N1 + N2) and λ = (N1 + N2) 1 2r+2 . We will bound the given upper bound of E ιρx( f D,λ f0) L2ρx in Theorem B.1. We first observe that α N1 + 1 α N2 2κ2(N1 + N2) 1 2r+2 N1 + N2 1 + 4κ2 (16) Towards Understanding Ensemble Distillation in Federated Learning 2ακ(M + γ) λN1 + λ 1 2 +r g0 Hk = 2κ(M + γ)(N1 + N2) 1 4r+4 N1 N1 + N2 1 N1 + (N1 + N2) 2r+1 (2κM + 2κγ + g0 Hk) (N1 + N2) 2r+1 since N1 + N2 p 2(N1 + N2) and N1 N1 + N2. Thus, we know that α N1 + 1 α N2 ! 2ακ(M + γ) λN1 + λ 1 2 +r g0 Hk = O (N1 + N2) 2r+1 Second, by the Cauchy-Schwarz inequality (Conway, 2019), E h (1 α)1/2 (Lk + λI)1/2(αLk,D1(x) + (1 α)Lk,D2(x) + λI) 1/2 L1/2 k,D2(x)(g f0) Hk i E (Lk + λI)1/2(αLk,D1(x) + (1 α)Lk,D2(x) + λI) 1/2 2 1/2 E L1/2 k,D2(x)(g f0) 2 Hk 1/2 . Since (Lk + λI)1/2(αLk,D1(x) + (1 α)Lk,D2(x) + λI) 1/2 2 = Qd where Qd is defined in the proof of Theorem B.1, by (14) and Lemma D.7, we have α N1 + 1 α N2 (log(4/δ))1/2 α N1 + 1 α N2 (log(4/δ)) (17) with confidence at least 1 δ where δ (0, 1) and so E (Lk +λI)1/2(αLk,D1(x) +(1 α)Lk,D2(x) +λI) 1/2 2 4Γ 3 α N1 + 1 α N2 by Lemma D.9 and (16) since Γ(3/2) = π/2 < 1. On the other hand, by (10) and the independence of g and D2(x), we find that E L1/2 k,D2(x)(g f0) 2 Hk = E SD2(g f0) 2 2 = E i=1 (g(xi 2) f0(xi 2))2 # i=1 (g(xi 2) f0(xi 2))2 | g = E(g(x1 2) f0(x1 2))2 = E ιρx(g f0) 2 L2ρx. E ιρx( f D,λ f0) L2ρx O (N1 + N2) 2r+1 4r+4 + E ιρx(g f0) 2 L2ρx Remark B.3. Since g is fixed in Section 4, we assume g is independent of D2(x) in Corollary B.2. In this case, when g is a KRR model trained using a dataset D whose size is N such that D is independent of D2(x), we know that E ιρx(g f0) 2 L2ρx 1/2 = O N 2r+1 which can be proved by a similar argument as in the proof of Theorem 3.3 provided in Caponnetto & De Vito (2007). Thus, we can conclude that the generalization error of f D,λ has the convergence rate O min(N1 + N2, N) 2r+1 4r+4 . If D is not independent of D2(x), then it could be more complicated. However, we can achieve the same result using Lemma D.13 under some assumptions. We deal with this case in Section 5.4. Towards Understanding Ensemble Distillation in Federated Learning C. Proofs and Comments on Section 5 C.1. Proof of Theorem 5.1 We first show the following theorem that is similar to Theorem 13 in Lin et al. (2020a). Theorem C.1. Let i=1 (Lk,Xi + λI) 1S Diyi. Under Assumption 3.1 and 3.2, E ιρx( f D,λ f0) 2 L2ρx = O λ2r+1 + 1 λm N + λ2r+1B2 + O(1) λ2r+1B2 + 1 λm N 2 exp 1 4(κ2 + 1)B 1 + 1 4(κ2 + 1)B holds if 0 < λ 1 and N(λ) 1 where B = 1 + log N(λ) 1 + log N(λ) Proof of Theorem C.1. We use a similar argument as in Lin et al. (2020a). Recall Proposition 9 in Lin et al. (2020a): E ιρx( f D,λ f0) 2 L2ρx 2 ιρx(fλ f0) 2 L2ρx + 1 m E Q4(P + S fλ Hk)2 + E h Q4R2 (Lk + λI)1/2(fλ f0) 2 Hk i P = (Lk + λI) 1/2(Lkf0 S D1y D1) Hk, Q = (Lk + λI)1/2(Lk,X1 + λI) 1/2 , R = (Lk + λI) 1/2(Lk Lk,X1)(Lk + λI) 1/2 , and S = (Lk + λI) 1/2(Lk Lk,X1) . By (9), Lemma D.1, Lemma D.3, and the submultiplicativity of the operator norm, we have ιρx(fλ f0) 2 L2ρx = L1/2 k (fλ f0) 2 Hk (Lk + λI)1/2(fλ f0) 2 Hk = λ(Lk + λI) 1/2Lr kg0 2 Hk λ2r+1 g0 2 Hk. (18) fλ Hk = (Lk + λI) 1Lkf0 Hk f0 Hk (19) by the submultiplicativity of the operator norm and Lemma D.1. Therefore, applying Lemma D.7 and Lemma D.8 leads to Q4(P + S fλ Hk)2 16κ2(2 2κ f0 Hk + M + γ)2 1 λN (log(6/δ))2 with confidence at least 1 δ for 12 exp( 1/4(κ2 + 1)B) δ < 1. On the other hand, using the trivial bound λ (Lk + λI)1/2 κ2 + λ which follows from the submultiplicativity of the operator norm and (Lk + λI)1/2 = (µ1 + λ)1/2 (κ2 + λ)1/2, we get Q4(P + S fλ Hk)2 4κ2(2 2κ f0 Hk + M + γ)2 1 + κ2 2 1 λN (log(6/δ))2 Towards Understanding Ensemble Distillation in Federated Learning with confidence at least 1 δ for δ (0, 1). By Lemma D.10 and Remark D.11, E Q4(P + S fλ Hk)2 = O 1 + O(1) 1 λN 1 + κ2 2 exp 1 4(κ2 + 1)B 1 + 1 4(κ2 + 1)B Next we turn to bound E Q4R2 (Lk + λI)1/2(fλ f0) 2 Hk . By Lemma D.7 and (20), Q4R2 16(κ2 + 1)2B2(log(8/δ))2 with confidence at least 1 δ for 8 exp( 1/4(κ2 + 1)B) δ < 1 and Q4R2 4 1 + κ2 2 (κ2 + 1)2B2(log(8/δ))2 with confidence at least 1 δ for δ (0, 1). Again, using (18), Lemma D.10, and Remark D.11, we obtain E h Q4R2 (Lk + λI)1/2(fλ f0) 2 Hk i = O(λ2r+1B2) + O(1) λ2r+1B2 1 + κ2 2 exp 1 4(κ2 + 1)B 1 + 1 4(κ2 + 1)B Combining the bounds, we are done. Now, we derive the performance of one-shot ensemble distillation. The following corollary implies Theorem 5.1. Corollary C.2. Assume Assumption 3.1 and 3.2 hold. Also, assume m N 2r+1 ϵ for any fixed ϵ (0, 1) and Np (m 1)N. Let f i D,λ = (αLk,Xi + (1 α)Lk,Xp + λI) 1(αS Diyi + (1 α)Lk,Xp f D,λ) which is the local model of client i after one-shot ensemble distillation (i = 1, , m) where i=1 (Lk,Xi + λI) 1S Diyi. Then, with α = 1/m and λ = (m N) 1 2r+2 , E ιρx( f i D,λ f0) L2ρx = O (m N) 2r+1 for any i = 1, , m. Proof of Corollary C.2. When m or N increases, Np also increases. Thus, it is enough to show the statement under the assumption N N 1 2r+2 p Although we cannot directly apply Corollary B.2, we can show a similar statement using the same way as in the proof of Corollary B.2. Since α N1 + 1 α N2 ! 2ακ(M + γ) λN1 + λ 1 2 +r g0 Hk = O (m N) 2r+1 E (Lk + λI)1/2(αLk,Xi + (1 α)Lk,Xp + λI) 1/2 2 4 4(1 + 4κ2), Towards Understanding Ensemble Distillation in Federated Learning E ιρx( f i D,λ f0) L2ρx O (m N) 2r+1 4r+4 + E ιρx( f D,λ f0) 2 L2ρx Therefore, it suffices to show that E ιρx( f D,λ f0) 2 L2ρx = O (m N) 2r+1 From m N 2r+1 ϵ, B log(eκ2N) 1 + log(eκ2N) 1 + log(eκ2N) f(x) = log(eκ2x) 1 + log(eκ2x) is continuous on [1, ) and vanishes at , f(x) C(κ, r, ϵ) holds for some C(κ, r, ϵ) which only depends on κ, r and ϵ. Then B C(κ, r, ϵ) 1 ϵ 4(2r+2) . Also, since 0 < B C(κ, r, ϵ) and xβ exp 1 4(κ2 + 1)x is continuous on (0, C(κ, r, ϵ)] and vanishes at 0+, 1 Bβ exp 1 4(κ2 + 1)B C (κ, r, ϵ, β) for some C (κ, r, ϵ, β) which only depends on κ, r, ϵ and β when β > 0 is a fixed constant. Taking β = 1 + 8 ϵ (2r + 2), we find that exp 1 4(κ2 + 1)B B1+ 8(2r+2) ϵ C κ, r, ϵ, 1 + 8(2r + 2) ϵ C κ, r, ϵ, 1 + 8(2r + 2) Hence, 1 + κ2 2 exp 1 4(κ2 + 1)B 1 + 1 4(κ2 + 1)B (1 + κ2)2N 2C(κ, r, ϵ) ϵ C κ, r, ϵ, 1 + 8(2r + 2) N 2 1 + 1 4(κ2 + 1)B As a consequence, by Theorem C.1, we get E ιρx( f D,λ f0) 2 L2 ρx = O λ2r+1 + 1 λm N + λ2r+1B2 + O(1) λ2r+1B2 + 1 λm N 2 exp 1 4(κ2 + 1)B 1 + 1 4(κ2 + 1)B = O λ2r+1 + 1 λm N = O (m N) 2r+1 2r+2 + (m N) 2r+1 2r+2 = O (m N) 2r+1 Remark C.3. In the proof of Corollary C.2, m N 2r+1 is not sufficient to derive the same result since this condition does not guarantee the boundedness of B. Towards Understanding Ensemble Distillation in Federated Learning C.2. Proof of Theorem 5.2 Define an affine operator H : Hk Hk by Hg = (αLk,X1 + (1 α)Lk,Xp + λI) 1(αS D1y1 + (1 α)Lk,Xpg) which is a unique solution of argmin h Hk J[h] = α 1 i=1 (h(xi 1) yi 1)2 + (1 α) 1 i=1 (h(xi 2) g(xi 2))2 + λ h 2 Hk. Therefore, the limiting regressor of the local model f1 is H f D1,λ = limt Htf D1,λ after infinitely many iterations in Algorithm 1. Now, we prove the following theorem which is a more general statement of Theorem 5.2. Theorem C.4. For any initial point h0 Hk, Hth0 converges to a unique fixed point f D1,λ/α of H as t . Proof of Theorem C.4. Define another operator H : Hk Hk by Hg = L1/2 k,Xp(αLk,X1 + (1 α)Lk,Xp + λI) 1(αS D1y1 + (1 α)L1/2 k,Xpg). Observe that Hg1 Hg2 Hk (1 α)L1/2 k,Xp(αLk,X1 + (1 α)Lk,Xp + λI) 1L1/2 k,Xp Since (1 α)Lk,Xp < αLk,X1 + (1 α)Lk,Xp + λI and (1 α)Lk,Xp is of finite rank which implies it is compact, we have (1 α)L1/2 k,Xp(αLk,X1 + (1 α)Lk,Xp + λI) 1L1/2 k,Xp by Lemma D.2. Thus, H is a η-contraction where η (0, 1). By the Banach fixed point theorem (Banach, 1922), Hth0 converges to a unique fixed point g of H as t . Define operators A = L1/2 k,Xp and B = (αLk,X1 + (1 α)Lk,Xp + λI) 1(αS D1y1 + (1 α)L1/2 k,Xp ). Then Hh = BAh and Hh = ABh for any h Hk. From the definition of g , B g = B H g = BAB g = HB g which implies that B g = (αLk,X1 + (1 α)Lk,Xp + λI) 1(αS D1y1 + (1 α)L1/2 k,Xp g ) is a fixed point of H. Let g be a fixed point of H, i.e., g satisfies Hg = (αLk,X1 + (1 α)Lk,Xp + λI) 1(αS D1y1 + (1 α)Lk,Xpg ) = g . Then, αS D1y1 + (1 α)Lk,Xpg = (αLk,X1 + (1 α)Lk,Xp + λI)g , i.e., g = (Lk,X1 + λ/αI) 1S D1y1 = f D1,λ/α. Hence, H has a unique a fixed point and so Hth0 = B Ht 1Ah0 converges to B g = g = f D1,λ/α as t . Remark C.5. Note that H may not be a contraction since (A + B + I) 1B < 1 may not be true even if operators A and B are positive compact operators on a Hilbert space. So we have to follow the above argument. Towards Understanding Ensemble Distillation in Federated Learning Algorithm 2 KRR with iterative De-regularized Ensemble Distillation in FL 1: Input: hyperparameters α (0, 1), λ > 0, λ0 0 and t N 2: Output: Trained model fj, j = 1, , m 3: Pretrain: For j = 1, , m, client j trains its model fj using the loss function argmin h Hk i=1 (h(xi j) yi j)2 + λ h 2 Hk. 4: Each client downloads the unlabeled public dataset Dp(x). 5: for t0 = 1, , t do 6: For j = 1, , m, client j predicts on Dp(x) and upload the prediction yj p to server. 7: The server computes an updated consensus 8: if t0 = t then 9: The server applies the de-regularization trick to yp: yp = (SDp(Lk,Xp + λ0I) 1S Dp) 1 yp. 10: end if 11: Each client downloads the ensemble prediction yp. 12: For j = 1, , m client j updates its model fj using the loss function argmin h Hk α 1 i=1 (h(xi j) yi j)2 + (1 α) 1 i=1 (h(xi p) ( yp)i)2 + λ h 2 Hk. 13: end for C.3. Algorithm on KRR with Iterative De-regularized Ensemble Distillation in Federated Learning Setting Applying the de-regularization trick except the last ensemble distillation step, we provide Algorithm 2. C.4. Proof of Theorem 5.3 Similarly as in Appendix C.2, define an affine operator T : Hk Hk by Tg = (αLk,X1 + (1 α)Lk,Xp + λI) 1(αS D1y1 + (1 α)S Dp(SDp(Lk,Xp + λ0I) 1S Dp) 1SDpg). Then the limiting regressor of the local model f1 is T f D1,λ = limt T tf D1,λ after infinitely many ensemble distillation steps with the de-regularization trick. As a result, the final local model is given by (αLk,X1 + (1 α)Lk,Xp + λI) 1(αS D1y1 + (1 α)Lk,Xp T f D1,λ). We deal with T f D1,λ in this subsection to obtain some motivations. Before we prove Theorem 5.3, we provide the following lemma. Lemma C.6. Assume KXp,Xp > 0. Then, S Dp(SDp(Lk,Xp + λ0I) 1S Dp) 1SDp = Lk,Xp + λ0PDp holds where PDp is the orthogonal projection onto a subspace span (k(x, ) : x Dp(x)) of the reproducing kernel Hilbert space Hk. Towards Understanding Ensemble Distillation in Federated Learning Proof of Lemma C.6. Note that (Lk,Xp + λ0I) 1S Dpv = h k( , x1 p) k( , x Np p ) i (KXp,Xp + Npλ0I) 1v and so SDp(Lk,Xp + λ0I) 1S Dpv = KXp,Xp(KXp,Xp + Npλ0I) 1v. We thus obtain (Lk,Xp S Dp(SDp(Lk,Xp + λ0I) 1S Dp) 1SDp)h = S Dp(I (SDp(Lk,Xp + λ0I) 1S Dp) 1)SDph Npλ0K 1 Xp,Xp h(x1 p) ... h(x Np p ) = λ0 h k( , x1 p) k( , x Np p ) i K 1 Xp,Xp h(x1 p) ... h(x Np p ) On the other hand, from SDp S Dpv = SDp i=1 vik(xi p, ) Np KXp,Xpv, S Dp(SDp S Dp) 1SDph = S Dp Np K 1 Xp,Xp h(x1 p) ... h(x Np p ) = h k( , x1 p) k( , x Np p ) i K 1 Xp,Xp h(x1 p) ... h(x Np p ) Therefore, Lk,Xp S Dp(SDp(Lk,Xp + λ0I) 1S Dp) 1SDp = λ0S Dp(SDp S Dp) 1SDp, i.e., S Dp(SDp(Lk,Xp + λ0I) 1S Dp) 1SDp = Lk,Xp + λ0S Dp(SDp S Dp) 1SDp. Set P = S Dp(SDp S Dp) 1SDp. Then we observe that (i) P is idempotent since P 2 = S Dp(SDp S Dp) 1SDp S Dp(SDp S Dp) 1SDp = S Dp(SDp S Dp) 1SDp = P. (ii) P is symmetric since P = (S Dp(SDp S Dp) 1SDp) = S Dp((SDp S Dp) 1) SDp = S Dp((SDp S Dp) ) 1SDp = S Dp(SDp S Dp) 1SDp = P. (iii) The range of P is span (k(x, ) : x Dp(x)) since the range of S Dp is contained in span (k(x, ) : x Dp(x)) and for any u( ) span (k(x, ) : x Dp(x)) there exists v RNp such that u = S Dpv which satisfies PS Dpv = S Dp(SDp S Dp) 1SDp S Dpv = S Dpv = u. By Proposition 3.3 of chapter 2 in Conway (2019), P is the orthogonal projection of Hk onto span (k(x, ) : x Dp(x)). From the above lemma, we can applied the Banach fixed point theorem (Banach, 1922) again as before. Towards Understanding Ensemble Distillation in Federated Learning Lemma C.7. Assume (1 α)λ0 < λ and KXp,Xp > 0. Define an operator T : Hk Hk as Tg = (S Dp(SDp(Lk,Xp + λ0I) 1S Dp) 1SDp)1/2 (αLk,X1 + (1 α)Lk,Xp + λI) 1(αS D1y1 + (1 α)(S Dp(SDp(Lk,Xp + λ0I) 1S Dp) 1SDp)1/2g). Then T is a η-contraction where η (0, 1). Thus, T tg converges to a unique fixed point g of T as t . Proof of Lemma C.7. By Lemma C.6, Tg = (Lk,Xp + λ0PDp)1/2(αLk,X1 + (1 α)Lk,Xp + λI) 1(αS D1y1 + (1 α)(Lk,Xp + λ0PDp)1/2g). Since PDp is a projection, 0 PDp I and so Lk,Xp Lk,Xp + λ0PDp Lk,Xp + λ0I. 0 (1 α)(Lk,Xp + λ0PDp) < αLk,X1 + (1 α)Lk,Xp + λI. (21) Since (1 α)(Lk,Xp + λ0PDp) is of finite rank which implies it is compact, by Lemma D.2, (1 α)(Lk,Xp + λ0PDp)1/2(αLk,X1 + (1 α)Lk,Xp + λI) 1(Lk,Xp + λ0PDp)1/2 < 1. (22) Hence, from the submultiplicativity of the operator norm, we have Tg1 Tg2 Hk = (1 α)(Lk,Xp + λ0PDp)1/2(αLk,X1 + (1 α)Lk,Xp + λI) 1(Lk,Xp + λ0PDp)1/2(g1 g2) Hk (1 α)(Lk,Xp + λ0PDp)1/2(αLk,X1 + (1 α)Lk,Xp + λI) 1(Lk,Xp + λ0PDp)1/2 g1 g2 Hk, i.e., T is a η-contraction where η < 1. Applying the Banach fixed point theorem (Banach, 1922), we are done. Similarly as in Appendix C.2, we prove the following general statement. Obviously, the following theorem implies Theorem 5.3. Theorem C.8. Assume λ0 = λ and KXp,Xp > 0. For any h0 Hk, T th0 converges to a unique fixed point Lk,X1 + λI + 1 α of T as t . Proof of Theorem C.8. Let A = (Lk,Xp + λ0PDp)1/2, B = (αLk,X1 + (1 α)Lk,Xp + λI) 1(αS D1y1 + (1 α)(Lk,Xp + λ0PDp)1/2 ). Then Th = BAh and Th = ABh for any h Hk. Let g be a unique fixed point of T where the existence and the uniqueness of the fixed point of T follow from Lemma C.7. From the definition, B g = B T g = BAB g = TB g holds, i.e., B g is a fixed point of T. Let g be a fixed point of T. From Tg = (αLk,X1 + (1 α)Lk,Xp + λI) 1(αS D1y1 + (1 α)(Lk,Xp + λ0PDp)g ) = g , we obtain (αLk,X1 + λI (1 α)λ0PDp)g = αS D1y1. Set λ0 = λ. Then λI (1 α)λ0PDp = αλI + (1 α)λ(I PDp). By Proposition 3.2 of chapter 2 in Conway (2019), I PDp = P Dp. Therefore, g = Lk,X1 + λI + 1 α In particular, a fixed point of T is unique. Consequently, T th0 = B T t 1Ah0 converges to g as t by Lemma C.7. Towards Understanding Ensemble Distillation in Federated Learning Remark C.9. Without the de-regularization trick, the fixed point of H is Lk,X1 + λ αI 1 S D1y1. With the de-regularization trick, the fixed point of T is Lk,X1 + λI + 1 α Thus, we can see that the de-regularization replaces 1 α α λI by 1 α α λP Dp(x). C.5. Proof of Theorem 5.4 Proof of Theorem 5.4. Using the formula A 1 B 1 = A 1(B A)B 1, (23) f D,λ f D1,λ = Lk,X1 + λI + 1 α α λ(I PDp(x)) 1 1 α α λ(I PDp(x)) (Lk,X1 + λI) 1 S D1y1. By Theorem 11 in Aydın & Gheondea (2021), (I PDp(x)) (Lk,X1 + λI) 1 S D1y1 0 almost surely as Np . Therefore, by Lemma D.1, f D,λ f D1,λ Hk Lk,X1 + λI + 1 α α λ(I PDp(x)) 1 1 α α λ (I PDp(x)) (Lk,X1 + λI) 1 S D1y1 Hk (I PDp(x)) (Lk,X1 + λI) 1 S D1y1 Hk 0 almost surely as Np . Remark C.10. To apply the result provided in Aydın & Gheondea (2021), we assume that the density ρx is strictly positive on any non-empty open subset of X in this case. In Section 5.4, we assume Assumption 3.2 with 0 < r 1 2 instead of this condition to control the error. C.6. Proof of Theorem 5.5 In this subsection, we assume λ0 = λ and KXp,Xp > 0. Similarly as before, we define an affine operator T : Hk Hk by 1 m(αLk,Xi + (1 α)Lk,Xp + λI) 1(αS Diyi + (1 α)S Dp(SDp(Lk,Xp + λ0I) 1S Dp) 1SDpg) 1 m(αLk,Xi + (1 α)Lk,Xp + λI) 1(αS Diyi + (1 α)(Lk,Xp + λPDp(x))g) (24) where the second equality follows from Lemma C.6. Also, define T : Hk Hk as 1 m(Lk,Xp + λPDp(x))1/2(αLk,Xi + (1 α)Lk,Xp + λI) 1(αS Diyi + (1 α)(Lk,Xp + λPDp(x))1/2g). Here, PDp(x) is the orthogonal projection onto a subspace span (k(x, ) : x Dp(x)) Towards Understanding Ensemble Distillation in Federated Learning of the reproducing kernel Hilbert space Hk. By the submultiplicativity of the operator norm, the triangle inequality, and (22), we have Tg1 Tg2 Hk 1 i=1 (Lk,Xp + λPDp(x))1/2(αLk,Xi + (1 α)Lk,Xp + λI) 1(1 α)(Lk,Xp + λPDp(x))1/2 i=1 (Lk,Xp + λPDp(x))1/2(αLk,Xi + (1 α)Lk,Xp + λI) 1(1 α)(Lk,Xp + λPDp(x))1/2 < 1, i.e., T is a η-contraction where η < 1. Set A = (Lk,Xp + λPDp(x))1/2, B = 1 m(αLk,Xi + (1 α)Lk,Xp + λI) 1(αS Diyi + (1 α)(Lk,Xp + λPDp(x))1/2 ). Then Th = BAh and Th = ABh for any h Hk. Let g be a unique fixed point of T where the existence and the uniqueness of a fixed point of T follows from the Banach fixed point theorem (Banach, 1922). Then, for any h0 Hk, T th0 = B T t 1Ah0 converges to B g . The following lemma is regarding the computation of B g . Lemma C.11. With the same notation as given above, i=1 (αLk,Xi + (1 α)Lk,Xp + λI) 1S Diyi + i=1 (1 α)(αLk,Xi + (1 α)Lk,Xp + λI) 1U1/2 ! i=1 U1/2(1 α)(αLk,Xi + (1 α)Lk,Xp + λI) 1U1/2 ! 1 U1/2 1 m i=1 (αLk,Xi + (1 α)Lk,Xp + λI) 1S Diyi where U = Lk,Xp + λPDp(x) = S Dp(SDp(Lk,Xp + λ0I) 1S Dp) 1SDp. Proof. From the definition of g , we have 1 m U 1/2(αLk,Xi + (1 α)Lk,Xp + λI) 1(αS Diyi + (1 α)U 1/2 g ). Note that, from (21), i=1 U 1/2(1 α)(αLk,Xi + (1 α)Lk,Xp + λI) 1U 1/2 < I which implies that I 1 m Pm i=1 U 1/2(1 α)(αLk,Xi + (1 α)Lk,Xp + λI) 1U 1/2 is invertible. Hence, i=1 U 1/2(1 α)(αLk,Xi + (1 α)Lk,Xp + λI) 1U 1/2 ! 1 i=1 (αLk,Xi + (1 α)Lk,Xp + λI) 1S Diyi Plugging (25) into B g yields 1 m(αLk,Xi + (1 α)Lk,Xp + λI) 1(αS Diyi + (1 α)U1/2 g ) i=1 (αLk,Xi + (1 α)Lk,Xp + λI) 1S Diyi + i=1 (1 α)(αLk,Xi + (1 α)Lk,Xp + λI) 1U1/2 ! i=1 U1/2(1 α)(αLk,Xi + (1 α)Lk,Xp + λI) 1U1/2 ! 1 U1/2 1 m i=1 (αLk,Xi + (1 α)Lk,Xp + λI) 1S Diyi Towards Understanding Ensemble Distillation in Federated Learning Since Algorithm 2 conducts knowledge distillation on the public dataset Dp(x) using B g in the last step, it suffices to evaluate the performance of B g on Dp(x). To this end, we need two additional lemmas: Lemma C.12 and Lemma C.13. Lemma C.12 can be viewed as an extension of the distributed semi-supervised kernel ridge regression (Chang et al., 2017). Lemma C.12. Assume m 2, Assumption 3.1 and 3.2. Let f s D,λ = 1 i=1 (αLk,Xi + (1 α)Lk,Xp + λI) 1S Diyi. SDp(f s D,λ f0) 2 Λ + 1 where Λ and Λi are random variables for i = 1, , m such that Λ Λ(log(6/δ))5/4 with confidence at least 1 δ and each Λi Λi(log(6/δ))3/2 with confidence at least 1 δ. Here, !1/2 2κ(M + γ) 2κ2 f0 Hk p λ1/2+r g0 Hk Λi = α (1 α) In particular, under the assumption that Np (m 1)N, with α = 1/m and λ = (m N) 1 2r+2 , we have Λ = O (m N) 2r+1 4r+4 and Λi = O (m N) 2r+1 4r+4 for i = 1, , m, which implies that E SDp(f s D,λ f0) 2 = O (m N) 2r+1 Proof of Lemma C.12. Using (10) and Lemma D.3, we obtain SDp(f s D,λ f0) 2 = L1/2 k,Xp(f s D,λ f0) Hk (Lk,Xp + λI)1/2(f s D,λ f0) Hk . f s D,λ = (αLk + (1 α)Lk,Xp + λI) 1S Dy = 1 i=1 (αLk + (1 α)Lk,Xp + λI) 1S Diyi where y = [y 1 , , y m] . By the triangle inequality, (Lk,Xp + λI)1/2(f s D,λ f0) Hk (Lk,Xp + λI)1/2( f s D,λ f0) Hk + (Lk,Xp + λI)1/2(f s D,λ f s D,λ) Hk . (26) First, we bound the first term in (26). Note that (Lk,Xp + λI)1/2( f s D,λ f0) = (Lk,Xp + λI)1/2(αLk + (1 α)Lk,Xp + λI) 1(S Dy Lkf0) + (Lk,Xp + λI)1/2((αLk + (1 α)Lk,Xp + λI) 1 (Lk + λI) 1)Lkf0 + (Lk,Xp + λI)1/2(fλ f0). (27) Towards Understanding Ensemble Distillation in Federated Learning Qp = (Lk,Xp + λI) 1/2(Lk + λI)1/2 , Qp = (Lk,Xp + λI)1/2(Lk + λI) 1/2 , P = (Lk + λI) 1/2(S Dy Lk,Xf0) Hk, S = Lk,X Lk , and Sp = Lk,Xp Lk . Note that, by Lemma D.2, αLk + (1 α)Lk,Xp + λI (1 α)(Lk,Xp + λI) (Lk,Xp + λI)1/2(αLk + (1 α)Lk,Xp + λI) 1(Lk,Xp + λI)1/2 1 1 α. (28) Thus, the first term in (27) satisfies (Lk,Xp + λI)1/2(αLk + (1 α)Lk,Xp + λI) 1(S Dy Lkf0) Hk 1 1 αQp Also, the third term in (27) satisfies (Lk,Xp + λI)1/2(fλ f0) Hk Qp (Lk + λI)1/2(fλ f0) Hk and the second term in (27) satisfies (Lk,Xp + λI)1/2((αLk + (1 α)Lk,Xp + λI) 1 (Lk + λI) 1)Lkf0 Hk = (Lk,Xp + λI)1/2(αLk + (1 α)Lk,Xp + λI) 1(1 α)(Lk Lk,Xp)(Lk + λI) 1Lkf0 Hk (Lk,Xp + λI) 1/2 Sp fλ Hk by the submultiplicativity of the operator norm, (23), and (28). Combining them yields (Lk,Xp + λI)1/2( f s D,λ f0) Hk 1 1 αQp λ S + f0 Hk λ Sp + Qp (Lk + λI)1/2(fλ f0) Hk (29) by Lemma D.1 and (19). From (Lk + λI)1/2(fλ f0) Hk = λ(Lk + λI) 1/2f0 Hk = λ(Lk + λI) 1/2Lr kg0 Hk λ1/2+r g0 Hk which follows from Lemma D.1 and the submultiplicativity of the operator norm, the inequality (29) becomes (Lk,Xp + λI)1/2( f s D,λ f0) Hk 1 1 αQp λ S + f0 Hk λ Sp + Qpλ1/2+r g0 Hk. Now, we derive PAC-bounds for Qp, Qp, P, S, and Sp. By Lemma D.1, Lemma D.4, the triangle inequality, and the submultiplicativity of the operator norm, we have Qp (Lk,Xp + λI) 1(Lk + λI) 1/2 = I + (Lk,Xp + λI) 1(Lk Lk,Xp) 1/2 1 + 1 λ Lk Lk,Xp 1/2 (30) Qp (Lk,Xp + λI)(Lk + λI) 1 1/2 = I + (Lk,Xp Lk)(Lk,Xp + λI) 1 1/2 1 + 1 λ Lk Lk,Xp 1/2 . Towards Understanding Ensemble Distillation in Federated Learning By Lemma D.7 and Lemma D.8, with confidence at least 1 δ, m N (log(6/δ))1/2, Lk,Xp Lk 2 Np (log(6/δ))1/2 and P 2κ(M + γ) λm N log(6/δ) where δ (0, 1). Then λ Lk Lk,Xp 1/2 (log(6/δ))1/4. (Lk,Xp + λI)1/2( f s D,λ f0) Hk 1 1 α !1/2 2κ(M + γ) (log(6/δ))5/4 Np (log(6/δ))1/2 + λ1/2+r g0 Hk(log(6/δ))1/4 with confidence at least 1 δ. Define !1/2 2κ(M + γ) 2κ2 f0 Hk p λ1/2+r g0 Hk. Then (Lk,Xp + λI)1/2( f s D,λ f0) Hk Λ(log(6/δ))5/4 with confidence at least 1 δ. We next bound the second term (Lk,Xp + λI)1/2( f s D,λ f s D,λ) Hk in (26). First, we bound the norm of f s,i D,λ = (αLk + (1 α)Lk,Xp + λI) 1S Diyi. Observe that f s,i D,λ = (αLk + (1 α)Lk,Xp + λI) 1(S Diyi Lkf0) + (αLk + (1 α)Lk,Xp + λI) 1(Lk + λI)(Lk + λI) 1Lkf0. Set Q p = (αLk + (1 α)Lk,Xp + λI) 1(Lk + λI) . By Lemma D.1 and the submultiplicativity of the operator norm, we have Q p = I + (1 α)(αLk + (1 α)Lk,Xp + λI) 1(Lk Lk,Xp) 1 + 1 α λ Lk Lk,Xp . Again, applying Lemma D.1, the submultiplicativity of the operator norm, the triangle inequality, and (19), we obtain f s,i D,λ Hk 1 λ S Diyi Lkf0 Hk + 1 + 1 α λ Lk Lk,Xp f0 Hk. Now, by the submultiplicativity of the operator norm, the triangle inequality, and (23), (Lk,Xp + λI)1/2( f s D,λ f s D,λ) Hk i=1 (Lk,Xp + λI)1/2((αLk + (1 α)Lk,Xp + λI) 1 (αLk,Xi + (1 α)Lk,Xp + λI) 1)S Diyi Hk i=1 (Lk,Xp + λI)1/2(αLk,Xi + (1 α)Lk,Xp + λI) 1α(Lk,Xi Lk) f s,i D,λ Hk. Towards Understanding Ensemble Distillation in Federated Learning Simliarly as in (28), we can easily see that (Lk,Xp + λI)1/2(αLk,Xi + (1 α)Lk,Xp + λI) 1(Lk,Xp + λI)1/2 1 1 α. (31) Then, applying (31), Lemma D.1, and the submultiplicativity of the operator norm yields i=1 (Lk,Xp + λI)1/2(αLk,Xi + (1 α)Lk,Xp + λI) 1α(Lk,Xi Lk) f s,i D,λ Hk λ Lk,Xi Lk Hk f s,i D,λ Hk. To find a PAC-bound of Lk,Xi Lk f s,i D,λ Hk, we use Lemma D.7 and Lemma D.8. Then, with confidence at least 1 δ, Np (log(6/δ))1/2, Lk,Xi Lk 2 N (log(6/δ))1/2, S Diyi Lk,Xif0 Hk 2κ(M + γ) N log(6/δ). Lk,Xi Lk f s,i D,λ Hk N (log(6/δ))3/2 with confidence at least 1 δ. Define Λi = α (1 α) for i = 1, , m. Then α (1 α) λ Lk,Xi Lk Hk f s,i D,λ Hk Λi(log(6/δ))3/2 with confidence at least 1 δ for any i = 1, , m. Now, we set α = 1/m and λ = (m N) 1 2r+2 . We also assume that Np (m 1)N. Then and 2κ(M + γ) (2κ(M + γ) + 2 2κ2 f0 Hk) m1/(2r+2). Thus, we have Λ 2(1 + 4κ2)1/2 2κ(M + γ) + 2 2κ2 f0 Hk (m N)(2r+1)/(4r+4) + 4κ2 f0 Hk (m N)(2r+1)/(4r+4) + (1 + 4 2κ2)1/2 g0 Hk (m N)(2r+1)/(4r+4) = O (m N) 2r+1 2κ2 (2κ(M + γ) + (1 + (4 + 2 2)κ2) f0 Hk) m1/(2r+2) m1/2 1 (m N)(2r+1)/(4r+4) = O (m N) 2r+1 By Lemma D.9, we conclude that E SDp(f s D,λ f0) 2 EΛ + 1 i=1 EΛi = O (m N) 2r+1 Towards Understanding Ensemble Distillation in Federated Learning The second lemma is regarding the convergence rate of f0 PDpf0 Hk. Aydın & Gheondea (2021) prove that f0 PDpf0 Hk 0 as Np under the condition that the density ρx is strictly positive on any non-empty open subset of X. However, they do not provide the convergence rate. The following lemma provides a convergence rate if we assume a regularity condition of f0. Lemma C.13. Assume Assumption 3.2. Then, f0 PDpf0 Hk 2λr g0 Hk with confidence at least 1 4 exp( 1/4(κ2 + 1)B) where B = 1 + log N(λ) 1 + log N(λ) for any λ > 0 such that λ (0, 1) and N(λ) 1. Also, f0 PDpf0 Hk f0 Hk almost surely. In particular, for any fixed ϵ (0, 1), E f0 PDpf0 Hk = O N ( 1+ϵ)r p . Proof of Lemma C.13. Note that (Lk,Xp + λI) 1Lk,Xpf0 = k Dp(x)(NpλI + KXp,Xp) 1SDpf0 span(k(x, ) : x Dp(x)) for any λ > 0. Thus, f0 PDpf0 Hk = min h span(k(x, ):x Dp(x)) f0 h Hk f0 (Lk,Xp + λI) 1Lk,Xpf0 Hk. f0 PDpf0 Hk λ(Lk,Xp + λI) 1Lr kg0 Hk λ (Lk + λI) 1/2 (Lk + λI)1/2(Lk,Xp + λI) 1(Lk + λI)1/2 (Lk + λI) 1/2Lr k g0 Hk by the submultiplicativity of the operator norm. Using Lemma D.1, λ (Lk + λI) 1/2 (Lk + λI)1/2(Lk,Xp + λI) 1(Lk + λI)1/2 (Lk + λI) 1/2Lr k g0 Hk λr (Lk + λI)1/2(Lk,Xp + λI) 1(Lk + λI)1/2 g0 Hk. When we assume λ (0, 1) and N(λ) 1, by Lemma D.7, we have f0 PDpf0 Hk 2λr g0 Hk with confidence at least 1 δ where 4 exp( 1/4(κ2 + 1)B) δ < 1. On the other hand, f0 PDpf0 Hk f0 Hk almost surely since I PDp 1. Combining them, we have E f0 PDpf0 Hk 2λr g0 Hk + 4 f0 Hk exp 1 4(κ2 + 1)B Set λ = N 1+ϵ p for a fixed ϵ > 0. Since λ 0 as Np , we may assume N(λ) 1. Then, 1 + log(eκ2Np) Towards Understanding Ensemble Distillation in Federated Learning f(x) = log(eκ2x) 1 + log(eκ2x) is continuous on [1, ) and vanishes at , f(x) C(κ, ϵ) for some C(κ, ϵ) > 0. Thus, B C(κ, ϵ) 1 g(x) = x β exp 1 4(κ2 + 1)x Then g is continuous on (0, C(κ, ϵ)] and vanishes at 0+ for any β > 0. Therefore, we know that g(x) C (κ, ϵ, β) on x (0, C(κ, ϵ)] for some C (κ, ϵ, β) > 0. Hence, we have E f0 PDpf0 Hk 2λr g0 Hk + f0 Hk exp 1 4(κ2 + 1)B 2N ( 1+ϵ)r p g0 Hk + f0 Hk B ϵ C κ, ϵ, 4(1 ϵ)r 2N ( 1+ϵ)r p g0 Hk + f0 Hk N ( 1+ϵ)r p C(κ, ϵ) ϵ C κ, ϵ, 4(1 ϵ)r = O N ( 1+ϵ)r p . We are now ready to derive the main result in this subsection. Theorem C.14. Assume m 2, λ0 = λ, Assumption 3.1, and Assumption 3.2 with 0 < r 1 2. We further assume 3r+2 2r2+2r N 1 2r+2 1/(1 ϵ) , (m 1)N for some fixed 0 < ϵ < 1 2. Let g = T h0 for any h0 Hk where the operator T is defined in (24). Then, with α = 1/m and λ = (m N) 1 2r+2 , E SDp(g f0) 2 = O (m N) 2r+1 Proof of Theorem C.14. When m or N increases, λ decreases and Np increases. Thus, we may assume N(λ) 1 and N(1/ p Np) 1. Set U = S Dp(SDp(Lk,Xp + λI) 1S Dp) 1SDp and V = Lk,Xp + λI. Note that V is invertible and Lk,Xp U = Lk,Xp + λPDp V = Lk,Xp + λI. i=1 (αLk,Xi + (1 α)Lk,Xp + λI) 1Uf0. By the triangle inequality, SDp(g f0) 2 SDp(g g) 2 + SDp( g f0) 2. First, note that SDp( g f0) 2 = L1/2 k,Xp( g f0) Hk (Lk,Xp + λI)1/2( g f0) Hk Towards Understanding Ensemble Distillation in Federated Learning by (10) and Lemma D.3. By the triangle inequality, (Lk,Xp + λI)1/2( g f0) Hk i=1 V 1/2(αLk,Xi + (1 α)Lk,Xp + λI) 1(Lk,Xp + λPDp αLk,Xi (1 α)Lk,Xp λI)f0 Hk i=1 V 1/2(αLk,Xi + (1 α)Lk,Xp + λI) 1α(Lk,Xp Lk,Xi)f0 Hk i=1 λ V 1/2(αLk,Xi + (1 α)Lk,Xp + λI) 1(I PDp)f0 Hk. By the submultiplicativity of the operator norm, Lemma D.1, and (31), V 1/2( g f0) Hk 1 λ Lk,Xp Lk,Xi f0 Hk + λ 1 α (I PDp)f0 Hk Set Λ1 i = α (1 α) λ Lk,Xp Lk,Xi f0 Hk. By the triangle inequality, Lk,Xp Lk,Xi Lk,Xp Lk + Lk Lk,Xi . By Lemma D.7, with confidence at least 1 δ, N (log(4/δ))1/2 and Lk,Xp Lk 2 Np (log(4/δ))1/2. Then, with confidence at least 1 δ, Λ1 i Λ1(log(4/δ))1/2 (33) 2ακ2 f0 Hk (1 α) = O (m N) 2r+1 λ 1 α (I PDp)f0 Hk 2(m N) 1 4r+4 O ((m 1)N) 2r 4r+4 = O (m N) 2r+1 by Lemma C.13. On the other hand, by (10) and Lemma D.3, SDp(g g) 2 = L1/2 k,Xp(g g) Hk U 1/2(g g) Hk. We can see that U 1/2 g = 1 i=1 U 1/2(αLk,Xi + (1 α)Lk,Xp + λI) 1Uf0 i=1 U 1/2(αLk,Xi + (1 α)Lk,Xp + λI) 1Uf0 + i=1 (1 α)U 1/2(αLk,Xi + (1 α)Lk,Xp + λI) 1U 1/2 ! i=1 U 1/2(1 α)(αLk,Xi + (1 α)Lk,Xp + λI) 1U 1/2 ! 1 i=1 U 1/2(1 α)(αLk,Xi + (1 α)Lk,Xp + λI) 1U 1/2 ! i=1 U 1/2(αLk,Xi + (1 α)Lk,Xp + λI) 1Uf0 + i=1 (1 α)U 1/2(αLk,Xi + (1 α)Lk,Xp + λI) 1U 1/2 ! i=1 U 1/2(1 α)(αLk,Xi + (1 α)Lk,Xp + λI) 1U 1/2 ! 1 U 1/2 i=1 (1 α)(αLk,Xi + (1 α)Lk,Xp + λI) 1U Towards Understanding Ensemble Distillation in Federated Learning Thus, by the triangle inequality, the submultiplicativity of the operator norm, and Lemma C.11, we have U 1/2(g g) Hk i=1 (αLk,Xi + (1 α)Lk,Xp + λI) 1S Diyi 1 i=1 (αLk,Xi + (1 α)Lk,Xp + λI) 1Uf0 i=1 U 1/2(1 α)(αLk,Xi + (1 α)Lk,Xp + λI) 1U 1/2 ! i=1 U 1/2(1 α)(αLk,Xi + (1 α)Lk,Xp + λI) 1U 1/2 ! 1 i=1 (αLk,Xi + (1 α)Lk,Xp + λI) 1S Diyi 1 i=1 (1 α)(αLk,Xi + (1 α)Lk,Xp + λI) 1U First, we bound the first term. By the triangle inequality and Lemma D.3, i=1 (αLk,Xi + (1 α)Lk,Xp + λI) 1S Diyi 1 i=1 (αLk,Xi + (1 α)Lk,Xp + λI) 1Uf0 i=1 (αLk,Xi + (1 α)Lk,Xp + λI) 1S Diyi 1 i=1 (αLk,Xi + (1 α)Lk,Xp + λI) 1Uf0 i=1 (αLk,Xi + (1 α)Lk,Xp + λI) 1S Diyi i=1 (αLk,Xi + (1 α)Lk,Xp + λI) 1Uf0 f0 From (32), we know V 1/2 1 m i=1 (αLk,Xi + (1 α)Lk,Xp + λI) 1Uf0 f0 λ 1 α (I PDp)f0 Hk. Also, using the same argument as in the proof of Lemma C.12, it satisfies that V 1/2 i=1 (αLk,Xi + (1 α)Lk,Xp + λI) 1S Diyi i=1 Λi (35) where random variables Λ and Λi(i = 1, , m) are given in Lemma C.12. In particular, Λ Λ(log(6/δ))5/4 with confidence at least 1 δ and Λi Λi(log(6/δ))3/2 with confidence at least 1 δ where Λ = O (m N) 2r+1 Λi = O (m N) 2r+1 4r+4 . We next turn to derive an upper bound of the second term. To bound i=1 U 1/2(1 α)(αLk,Xi + (1 α)Lk,Xp + λI) 1U 1/2 ! i=1 U 1/2(1 α)(αLk,Xi + (1 α)Lk,Xp + λI) 1U 1/2 ! 1 , we use Lemma D.6. From 0 U = Lk,Xp + λPDp V = Lk,Xp + λI < 1 1 α(αLk,Xi + (1 α)Lk,Xp + λI), we have (1 α)(αLk,Xi + (1 α)Lk,Xp + λI) 1 < V 1 for any i = 1, , m. Hence i=1 (1 α)(αLk,Xi + (1 α)Lk,Xp + λI) 1 < V 1 i=1 (1 α)(αLk,Xi + (1 α)Lk,Xp + λI) 1 ! 1 Towards Understanding Ensemble Distillation in Federated Learning Note that U is of finite rank which means that U 1/2 is of finite rank and so compact, and all of the above operators are positive. Applying Lemma D.6 and the above inequality yields i=1 U 1/2(1 α)(αLk,Xi + (1 α)Lk,Xp + λI) 1U 1/2 ! i=1 U 1/2(1 α)(αLk,Xi + (1 α)Lk,Xp + λI) 1U 1/2 ! 1 i=1 V 1/2(1 α)(αLk,Xi + (1 α)Lk,Xp + λI) 1V 1/2 ! i=1 V 1/2(1 α)(αLk,Xi + (1 α)Lk,Xp + λI) 1V 1/2 ! 1 . By the submultiplicativity of the operator norm, i=1 V 1/2(1 α)(αLk,Xi + (1 α)Lk,Xp + λI) 1V 1/2 ! i=1 V 1/2(1 α)(αLk,Xi + (1 α)Lk,Xp + λI) 1V 1/2 ! 1 i=1 V 1/2(1 α)(αLk,Xi + (1 α)Lk,Xp + λI) 1V 1/2 i=1 V 1/2(1 α)(αLk,Xi + (1 α)Lk,Xp + λI) 1V 1/2 ! 1 i=1 V 1/2(1 α)(αLk,Xi + (1 α)Lk,Xp + λI) 1V 1/2 ! 1 . Here, the second inequality follows from the fact that i=1 V 1/2(1 α)(αLk,Xi + (1 α)Lk,Xp + λI) 1V 1/2 1 V 1/2(1 α)(αLk,Xi + (1 α)Lk,Xp + λI) 1V 1/2 1. which comes from the triangle inequality and (31). Set i=1 V 1/2(1 α)(αLk,Xi + (1 α)Lk,Xp + λI) 1V 1/2 ! 1 On the other hand, since A 7 A 1 is operator convex (Bhatia, 2013), i=1 V 1/2(1 α)(αLk,Xi + (1 α)Lk,Xp + λI) 1V 1/2 ! 1 1 i=1 (I V 1/2(1 α)(αLk,Xi + (1 α)Lk,Xp + λI) 1V 1/2) 1. m Pm i=1 V 1/2(1 α)(αLk,Xi + (1 α)Lk,Xp + λI) 1V 1/2 1 is positive, i=1 (I V 1/2(1 α)(αLk,Xi + (1 α)Lk,Xp + λI) 1V 1/2) 1 (I V 1/2(1 α)(αLk,Xi + (1 α)Lk,Xp + λI) 1V 1/2) 1 . Since V 1/2 is invertible, by Lemma D.5 and the submultiplicativity of the operator norm, α (I V 1/2(1 α)(αLk,Xi + (1 α)Lk,Xp + λI) 1V 1/2) 1 = α V 1/2(V 1 (1 α)(αLk,Xi + (1 α)Lk,Xp + λI) 1) 1V 1/2 α (V 1 (1 α)(αLk,Xi + (1 α)Lk,Xp + λI) 1) 1V 1 = α (I (1 α)V (αLk,Xi + (1 α)Lk,Xp + λI) 1) 1 = α (αLk,Xi + (1 α)Lk,Xp + λI)(αLk,Xi + αλI) 1 (αLk,Xi + (1 α)Lk,Xp + λI)(Lk,Xi + λI) 1 . (αLk,Xi + (1 α)Lk,Xp + λI)(Lk,Xi + λI) 1 = I + (1 α)(Lk,Xp Lk,Xi)(Lk,Xi + λI) 1, Towards Understanding Ensemble Distillation in Federated Learning by the triangle inequality, the submultiplicativity of the operator norm, and Lemma D.1, we get (αLk,Xi + (1 α)Lk,Xp + λI)(Lk,Xi + λI) 1 1 + 1 α λ Lk,Xi Lk + Lk,Xp Lk . By Lemma D.7, N (log(4/δ))1/2 and Lk,Xp Lk 2 Np (log(4/δ))1/2 with confidence at least 1 δ. Therefore, i=1 Θ1,i (36) where Θ1,i is a random variable such that Θ1,i Θ1(log(4/δ))1/2 with confidence at least 1 δ for i = 1, , m. Here, = O m 1 2r+2 . However, this upper bound (36) is not sufficiently small for our analysis. Thus, we will find a better upper bound now. First, we bound the norm of A 1 2 = V 1/2 1 m i=1 (αLk + (1 α)Lk,Xp + λI) 1(Lk,Xi + λI) By the submultiplicativity of the operator norm, A 1 2 V 1/2(Lk + λI) 1/2 (Lk + λI)1/2 1 m i=1 Lk,Xi + λI (Lk + λI)1/2 (Lk + λI) 1/2(αLk + (1 α)Lk,Xp + λI)1/2 (αLk + (1 α)Lk,Xp + λI)1/2V 1/2 . By Lemma D.1, Lemma D.4, the triangle inequality, and the submultiplicativity of the operator norm, V 1/2(Lk + λI) 1/2 1 + 1 λ Lk,Xp Lk 1/2 , (αLk + (1 α)Lk,Xp + λI)1/2V 1/2 1 + α λ Lk,Xp Lk 1/2 and (Lk + λI) 1/2(αLk + (1 α)Lk,Xp + λI)1/2 1 + 1 α λ Lk,Xp Lk 1/2 . By Lemma D.1, Lemma D.5, the triangle inequality, and the submultiplicativity of the operator norm, (Lk + λI)1/2 1 m i=1 Lk,Xi + λI (Lk + λI)1/2 i=1 Lk,Xi + λI i=1 Lk,Xi Lk By Lemma D.7, Np (log(4/δ))1/2 and i=1 Lk,Xi Lk m N (log(4/δ))1/2 Towards Understanding Ensemble Distillation in Federated Learning with confidence at least 1 δ. Therefore, with confidence at least 1 δ, (log(4/δ))5/4 = O(1)(log(4/δ))5/4. Next, we bound A2 A1 . Note that i=1 V 1/2(1 α)(αLk,Xi + (1 α)Lk,Xp + λI) 1V 1/2 ! = V 1/2 1 α i=1 (1 α)(αLk,Xi + (1 α)Lk,Xp + λI) 1V = V 1/2 1 m i=1 (αLk,Xi + (1 α)Lk,Xp + λI) 1(Lk,Xi + λI) A2 A1 = V 1/2 1 m i=1 (αLk + (1 α)Lk,Xp + λI) 1(Lk,Xi + λI) i=1 (αLk,Xi + (1 α)Lk,Xp + λI) 1(Lk,Xi + λI) i=1 V 1/2(αLk + (1 α)Lk,Xp + λI) 1α(Lk,Xi Lk)(αLk,Xi + (1 α)Lk,Xp + λI) 1(Lk,Xi + λI)V 1/2 using (23). By the triangle inequality, the submultiplicativity of the operator norm, and the fact that (Lk + λI) 1/2(Lk,Xi + λI)(Lk + λI) 1/2 = (Lk + λI) 1/2(Lk,Xi Lk)(Lk + λI) 1/2 + I, we have V 1/2(αLk + (1 α)Lk,Xp + λI) 1α(Lk,Xi Lk)(αLk,Xi + (1 α)Lk,Xp + λI) 1(Lk,Xi + λI)V 1/2 α V 1/2(αLk + (1 α)Lk,Xp + λI) 1/2 (αLk + (1 α)Lk,Xp + λI) 1/2(Lk + λI)1/2 R (Lk + λI)1/2(αLk,Xi + (1 α)Lk,Xp + λI) 1(Lk + λI)1/2 (R + 1) (Lk + λI)1/2V 1/2 R = (Lk + λI) 1/2(Lk,Xi Lk)(Lk + λI) 1/2 . Since (1 α)V αLk + (1 α)Lk,Xp + λI, V 1/2(αLk + (1 α)Lk,Xp + λI) 1/2 (1 α) 1/2 by Lemma D.2. By Lemma D.1, Lemma D.4, the triangle inequality, and the submultiplicativity of the operator norm, (αLk + (1 α)Lk,Xp + λI) 1/2(Lk + λI)1/2 1 + 1 α λ Lk,Xp Lk 1/2 (Lk + λI)1/2V 1/2 1 + 1 λ Lk,Xp Lk 1/2 . By Lemma D.1, Lemma D.5, the triangle inequality, and the submultiplicativity of the operator norm, (Lk + λI)1/2(αLk,Xi + (1 α)Lk,Xp + λI) 1(Lk + λI)1/2 1 + 1 λ α Lk,Xi Lk + (1 α) Lk,Xp Lk . Towards Understanding Ensemble Distillation in Federated Learning To find a PAC-bound, applying Lemma D.7 yields Np (log(12/δ))1/2, Lk,Xi Lk 2 N (log(12/δ))1/2, R 2(κ2 + 1) 1 + log N(λ) 1 + log N(λ) with confidence at least 1 δ for any δ (0, 1). Then, we apply Lemma D.13 to see that A2 A1 is the mean of random variables that are bounded by 1 + log N(λ) 1 + log N(λ) 1 + 2(κ2 + 1) 1 + log N(λ) 1 + log N(λ) (log(a0/δ))q0 for some a0 1 and q0 > 0 with confidence at least 1 δ. Note that 2(κ2 + 1) = O(1). On the other hand, 1 + log N(λ) 1 + log N(λ) log (eκ2)(m N) 1 2r+2 (m N) r 4r+4 + v u u tlog (eκ2)(m N) 1 2r+2 (m N) r 4r+4 = O(1) f(x) = log(eκ2x1/(2r+2)) is continuous on [1, ) and vanishes at which implies that log (eκ2)(m N) 1 2r+2 (m N) r 4r+4 = O(1). 1 + log N(λ) 1 + log N(λ) 1 + 2(κ2 + 1) 1 + log N(λ) 1 + log N(λ) = O m r 2r+2 and so A2 A1 is the mean of random variables that are bounded by O m r 2r+2 (log(a0/δ))q0 for some a0 1 and q0 > 0 with confidence at least 1 δ respectively. Thus, applying the formula j=0 A 1 2 (A2 A1)A 1 2 j + A 1 1 (A2 A1)A 1 2 1 (36), and Lemma D.13 yield j=0 Ξj (37) Towards Understanding Ensemble Distillation in Federated Learning where Ξj is the mean of random variables that are bounded by O(1)(log(a/δ))q for some a 1 and q > 0. The last part of the proof is devoted to bounding i=1 (αLk,Xi + (1 α)Lk,Xp + λI) 1S Diyi 1 i=1 (1 α)(αLk,Xi + (1 α)Lk,Xp + λI) 1U i=1 (αLk,Xi + (1 α)Lk,Xp + λI) 1S Diyi 1 i=1 (1 α)(αLk,Xi + (1 α)Lk,Xp + λI) 1U = U 1/2 1 m i=1 (αLk,Xi + (1 α)Lk,Xp + λI) 1S Diyi 1 i=1 (1 α)(αLk,Xi + (1 α)Lk,Xp + λI) 1(Lk,Xp + λPDp) i=1 (αLk,Xi + (1 α)Lk,Xp + λI) 1(S Diyi Lk,Xif0 λf0) 1 α i=1 (αLk,Xi + (1 α)Lk,Xp + λI) 1λ(I PDp)f0. By Lemma D.3 and the triangle inequality, the first term is bounded as follows: i=1 (αLk,Xi + (1 α)Lk,Xp + λI) 1(S Diyi Lk,Xif0 λf0) i=1 (αLk,Xi + (1 α)Lk,Xp + λI) 1S Diyi f0 i=1 (αLk,Xi + (1 α)Lk,Xp + λI) 1Lk,Xif0 f0 i=1 (αLk,Xi + (1 α)Lk,Xp + λI) 1 ! By Lemma C.12, V 1/2 1 m i=1 (αLk,Xi + (1 α)Lk,Xp + λI) 1S Diyi f0 Using the exact same way as in the proof of Lemma C.12 (since it is just the noisefree case), V 1/2 1 m i=1 (αLk,Xi + (1 α)Lk,Xp + λI) 1Lk,Xif0 f0 ! Hk Λ0 + 1 i=1 Λ0 i (38) where Λ0 and Λ0 i are random variables such that Λ0 O (m N) 2r+1 4r+4 (log(6/δ))5/4 with confidence at least 1 δ and each Λ0 i O (m N) 2r+1 4r+4 (log(6/δ))3/2 with confidence at least 1 δ. Also, by the triangle inequality, the submultiplicativity of the operator norm, (31), and Lemma D.1, i=1 (αLk,Xi + (1 α)Lk,Xp + λI) 1 ! i=1 (Lk,Xp + λI) 1/2Lr kg0 Hk = λ 1 α (Lk,Xp + λI) 1/2Lr kg0 Hk λ 1 αQp (Lk + λI) 1/2Lr kg0 Hk 1 1 α λr+ 1 where Qp is already defined in the proof of Lemma C.12. By (30), i=1 (αLk,Xi + (1 α)Lk,Xp + λI) 1 ! Hk 1 1 α 1 + 1 λ Lk Lk,Xp 1/2 λr+ 1 Towards Understanding Ensemble Distillation in Federated Learning By Lemma D.7, i=1 (αLk,Xi + (1 α)Lk,Xp + λI) 1 ! 2 g0 Hk(log(4/δ))1/4 with confidence at least 1 δ where 2 g0 Hk = O (m N) 2r+1 Lastly, by the triangle inequality, the submultiplicativity of the operator norm, Lemma D.1, Lemma D.3, and (31), 1 α i=1 (αLk,Xi + (1 α)Lk,Xp + λI) 1λ(I PDp)f0 i=1 (αLk,Xi + (1 α)Lk,Xp + λI) 1λ(I PDp)f0 λ α (I PDp)f0 Hk. SDp(g f0) 2 1 λ 1 α (I PDp)f0 Hk + α λ 1 α (I PDp)f0 Hk + Λ + 1 i=1 Λi + Λ0 + 1 i=1 Λ0 i + 1 1 α λr+1/2Qp g0 Hk + λ α (I PDp)f0 Hk By Lemma D.9, i=1 Λ1 i + α i=1 Λ1 i + Λ + 1 = O (m N) 2r+1 4r+4 . (41) From (34), we have λ 1 α (I PDp)f0 Hk = O (m N) 2r+1 By Lemma D.13, (37), (38), and (39), i=1 Λi + Λ0 + 1 i=1 Λ0 i + 1 1 α λr+1/2Qp g0 Hk = O (m N) 2r+1 4r+4 . (42) Since Lk κ2 and Lk,X κ2 for any X, by Lemma D.5, Lemma D.1, the triangle inequality, and the submultiplicativity of the operator norm, we obtain (Lk,Xi + λI) 1(αLk,Xi + (1 α)Lk,Xp + λI) λ + κ2 λ α (I PDp)f0 Hk almost surely. By Lemma C.13 and Lemma D.12, λ α (I PDp)f0 Hk f0 Hk 4 exp 1 4(κ2 + 1)B Towards Understanding Ensemble Distillation in Federated Learning B = 1 + log N(λp) 1 + log N(λp) and λp = 1/N 1 ϵ p which satisfies N(λp) N(1/ p Np) 1. Note that 1 + log N(λp) is continuous on [1, ) and vanishes at , g(x) C(κ, r, ϵ) for some C(κ, r, ϵ) > 0. Then B C(κ, r, ϵ)N ϵ/4 p . Since 0 < B C(κ, r, ϵ) and xβ exp 1 4(κ2 + 1)x is continuous on (0, C(κ, r, ϵ)] and vanishes at 0+, 1 Bβ exp 1 4(κ2 + 1)B C (κ, r, ϵ, β) for some C (κ, r, ϵ, β) > 0. Hence, exp 1 4(κ2 + 1)B B6/ϵ 1 B6/ϵ exp 1 4(κ2 + 1)B (1 + κ2) 1 α λ N 3/2 p C(κ, r, ϵ)6/ϵ C (κ, r, 6/ϵ) = O (m N) 2r+1 4r+4 . (44) On the other hand, from N r(1 ϵ) p (m N) r 2r+2 m, λ α N r( 1+ϵ) p = O (m N) 2r+1 This completes the proof of Theorem 5.5. Theorem C.15. Let g = T h0 for any h0 Hk and g i be a local kernel ridge regressor of client i trained by the local dataset Di and the public dataset Dp(x) with the predictions of g , i.e., g i = (αLk,Xi + (1 α)Lk,Xp + λI) 1(αS Diyi + (1 α)Lk,Xpg ). Under the same assumption of Theorem C.14, with α = 1/m and λ = (m N) 1 2r+2 , E ιρx(g i f0) L2ρx = O (m N) 2r+1 Towards Understanding Ensemble Distillation in Federated Learning Proof of Theorem C.15. We use the same notation as in the proof of Theorem C.14. Since g is not independent of Dp(x), we cannot directly apply Corollary B.2. By Corollary B.1, we have E ιρx(g i f0) L2ρx 9 α N1 + 1 α N2 ! 2ακ(M + γ) λN1 + λ 1 2 +r g0 Hk + (1 α)1/2E h (Lk + λI)1/2(αLk,Xi + (1 α)Lk,Xp + λI) 1/2 L1/2 k,Xp(g f0) Hk i . α N1 + 1 α N2 ! 2ακ(M + γ) λN1 + λ 1 2 +r g0 Hk = O (m N) 2r+1 By Lemma D.1, Lemma D.4, the triangle inequality, and the submultiplicativity of the operator norm, (Lk + λI)1/2(αLk,Xi + (1 α)Lk,Xp + λI) 1/2 1 + α λ Lk,Xi Lk + 1 α λ Lk,Xp Lk 1/2 . By Lemma D.7, with confidence at least 1 δ λ Lk,Xi Lk + 1 α λ Lk,Xp Lk 1/2 (log(4/δ))1/4 where δ (0, 1) and hence (Lk + λI)1/2(αLk,Xi + (1 α)Lk,Xp + λI) 1/2 O(1)(log(4/δ))1/4 O(1) log(4/δ). (45) Recall L1/2 k,Xp(g f0) Hk = SDp(g f0) 2 from (10) and its bound in (40). By Lemma D.9, Lemma D.13, (33), and (35), (Lk + λI)1/2(αLk,Xi + (1 α)Lk,Xp + λI) 1/2 i=1 Λ1 i + α i=1 Λ1 i + Λ + 1 = O (m N) 2r+1 By Lemma D.13, (37), (38), and (39), (Lk + λI)1/2(αLk,Xi + (1 α)Lk,Xp + λI) 1/2 A 1 1 i=1 Λi + Λ0 + 1 i=1 Λ0 i + 1 1 αλ 1 2 +r Qp g0 Hk = O (m N) 2r+1 Note that there is a trivial bound (Lk + λI)1/2(αLk,Xi + (1 α)Lk,Xp + λI) 1/2 λ + κ2 By (45), Lemma C.13, and Lemma D.12, (Lk + λI)1/2(αLk,Xi + (1 α)Lk,Xp + λI) 1/2 (1 + α) λ 1 α (I PDp)f0 Hk λ 1 α 1 + κ2 1/2 f0 Hk 4 exp 1 4(κ2 + 1)B λ N r( 1+ϵ) p . By (37), (43), Lemma C.13, and Lemma D.13, (Lk + λI)1/2(αLk,Xi + (1 α)Lk,Xp + λI) 1/2 A 1 1 λ α (I PDp)f0 Hk 3/2 f0 Hk 4 exp 1 4(κ2 + 1)B λ α N r( 1+ϵ) p Towards Understanding Ensemble Distillation in Federated Learning Using the similar argument as in the derivation of the bound (44), (Lk + λI)1/2(αLk,Xi + (1 α)Lk,Xp + λI) 1/2 (1 + α) λ 1 α (I PDp)f0 Hk = O (m N) 2r+1 (Lk + λI)1/2(αLk,Xi + (1 α)Lk,Xp + λI) 1/2 A 1 1 λ α (I PDp)f0 Hk = O (m N) 2r+1 From (40), E ιρx(g i f0) L2ρx = O (m N) 2r+1 C.7. Algorithm on KRR with Iterative De-regularized Ensemble Distillation Participated by Partial Clients in Federated Learning Setting We provide an algorithm (Algorithm 3) on kernel ridge regression with iterative de-regularized ensemble distillation participated by partial clients in federated learning. At each communication round, we select a fixed number of clients that predict the unlabeled public dataset and train using their local datasets and the public dataset with the updated consensus obtained by stochastic approximation. We denote the number of selected clients in each communication round as C. C.8. Proof of Corollary 5.6 Here, we assume that D and Dp(x) are given. Define a sequence of independent random operators {T p t0}t0 N where random operator T p t0 : Hk Hk is defined as T p t0g = X 1 C (αLk,Xi + (1 α)Lk,Xp + λI) 1(αS Diyi + (1 α)S Dp(SDp(Lk,Xp + λI) 1S Dp) 1SDpg) 1 C (αLk,Xi + (1 α)Lk,Xp + λI) 1(αS Diyi + (1 α)(Lk,Xp + λP Dp(x))g) for each t0 N where {Ct0}t0 N is defined in Algorithm 3. We also define T p t0 : RNp RNp as T p t0v = X 1 C (SDp(Lk,Xp + λI) 1S Dp) 1/2SDp(αLk,Xi + (1 α)Lk,Xp + λI) 1 (αS Diyi + (1 α)S Dp(SDp(Lk,Xp + λ0I) 1S Dp) 1/2v) for each t0 N. We prove Corollary 5.6 in a general setting. Theorem C.16. Assume λ0 = λ, KXp,Xp is invertible, and the number of selected clients in each communication round C is fixed, i.e., C = Pm i=1 1{i Ct0} for any t0 = 1, , t. We also assume that {γt0}t t0=1 [0, 1]t is fixed, t0=1 γt0 = , t0=1 γ2 t0 < , (46) {Ct0} t0=1 is independent, and P(j Ct0) = pj for any j {1, , m} and t0 = 1, . Then the prediction yp on Dp(x) after infinitely many iterations in Algorithm 3 converges to SDpˆg almost surely where ˆg = ˆT h0 for any h0 Hk and pi Pm j=1 pj (αLk,Xi + (1 α)Lk,Xp + λI) 1(αS Diyi + (1 α)S Dp(SDp(Lk,Xp + λI) 1S Dp) 1SDpg). Towards Understanding Ensemble Distillation in Federated Learning Algorithm 3 KRR with iterative De-regularized Ensemble Distillation Participated by Partial Clients in FL 1: Input: hyperparameters α (0, 1), λ > 0, λ0 0, t N, C {1, , m} and {γt0}t t0=1 [0, 1]t such that γ1 = 1 2: Output: Trained model fj, j = 1, , m 3: Pretrain: For j = 1, , m, client j trains its model fj using the loss function argmin h Hk i=1 (h(xi j) yi j)2 + λ h 2 Hk. 4: Each client downloads the unlabeled public dataset Dp(x). 5: for t0 = 1, , t do 6: Determine a set of clients Ct0 whose size is C to participate the ensemble at time t0. 7: For j Ct0, client j predicts on Dp(x) and uploads the prediction yj p to server. 8: The server computes an updated consensus yp = (1 γt0) yp,old + γt0 1 j Ct0 yj p. 9: The server stores yp,old = yp. 10: if t0 = t then 11: The server applies the de-regularization trick to yp: yp = (SDp(Lk,Xp + λ0I) 1S Dp) 1 yp. 12: end if 13: Each client in Ct0 downloads the ensemble prediction yp. 14: For j Ct0, client j updates its model fj using the loss function argmin h Hk α 1 i=1 (h(xi j) yi j)2 + (1 α) 1 i=1 (h(xi p) ( yp)i)2 + λ h 2 Hk. 15: end for Proof. Define ˆT : RNp RNp as pi Pm j=1 pj (SDp(Lk,Xp + λI) 1S Dp) 1/2SDp(αLk,Xi + (1 α)Lk,Xp + λI) 1 (αS Diyi + (1 α)S Dp(SDp(Lk,Xp + λI) 1S Dp) 1/2v). j=1 1{j Ct0} C = E j=1 1{j Ct0} j=1 P(j Ct0) = by taking the expectation. Define an operator Si : RNp RNp by Siv = (SDp(Lk,Xp + λI) 1S Dp) 1/2SDp(αLk,Xi + (1 α)Lk,Xp + λI) 1 (αS Diyi + (1 α)S Dp(SDp(Lk,Xp + λ0I) 1S Dp) 1/2v). E T p t0v = E i=1 1{i Ct0} 1 C Siv i=1 E h 1{i Ct0}Siv i = 1 Pm j=1 pj i=1 pi Siv = ˆTv. Towards Understanding Ensemble Distillation in Federated Learning Observe that (SDp(Lk,Xp + λI) 1S Dp) 1/2 = ((KXp,Xp + NpλI) 1KXp,Xp) 1/2 = (I + NpλK 1 Xp,Xp)1/2. (47) We also claim that SDp(αLk,Xi + (1 α)Lk,Xp + λI) 1(1 α)S Dp KXp,Xp KXp,Xi α λI 1 KXi,Xp KXp,Xp + Np 1 αλI KXp,Xi α λI 1 KXi,Xp I + Np 1 αλ KXp,Xp KXp,Xi α λI 1 KXi,Xp To show this, note that (αLk,Xi + (1 α)Lk,Xp + λI) 1(1 α)S Dp = k Xi k Xp (λI + DKXi Xp,Xi Xp) 1D 0 = k Xi k Xp (λD 1 + KXi Xp,Xi Xp) 1 0 where D = diag(α/N, , α/N | {z } N , (1 α)/Np, , (1 α)/Np | {z } Np ) which follows from (12). Then SDp(αLk,Xi + (1 α)Lk,Xp + λI) 1(1 α)S Dp = KXp,Xi KXp,Xp KXi,Xi + N1 α λI KXi,Xp KXp,Xi KXp,Xp + N2 1 αλI From the formula A B C D 1 = A 1 + A 1B(D CA 1B) 1CA 1 A 1B(D CA 1B) 1 (D CA 1B) 1CA 1 (D CA 1B) 1 SDp(αLk,Xi + (1 α)Lk,Xp + λI) 1(1 α)S Dp = KXp,Xi KXp,Xp (KXi,Xi + N1 α λI) 1KXi,Xp(KXp,Xp + N2 1 αλI KXp,Xi(KXi,Xi + N1 α λI) 1KXi,Xp) 1 (KXp,Xp + N2 1 αλI KXp,Xi(KXi,Xi + N1 α λI) 1KXi,Xp) 1 which gives the formula (48). Since I + Np 1 αλ KXp,Xp KXp,Xi α λI 1 KXi,Xp > I + NpλK 1 Xp,Xp, by (47), (48), Lemma D.2, and the triangle inequality, ˆT is a η-contraction where η (0, 1). Set ut0 as (SDp(Lk,Xp + λI) 1S Dp) 1/2( yp)t0 where ( yp)t0 is the updated consensus before applying the de-regularization at the t0-th iteration in Algorithm 3. Then ut0+1 = (1 γt0+1)ut0 + γt0+1 T p t0+1ut0 = (1 γt0+1)ut0 + γt0+1 ˆTut0 + T p t0+1ut0 ˆTut0 . E h T p t0+1ut0 ˆTut0 | Ft0 i = E h T p t0+1ut0 ˆTut0 | ut0 i = 0 (49) where Ft0 is an σ-algebra generated from u1, , ut0, T p 1 , , T p t0, γ1, , γt0 and γt0+1 since T p t0+1 is independent of Ft0. Since the number of possible realizations of T p t0+1 is finite, there exists B > 0 such that T p t0+1 ˆT B Towards Understanding Ensemble Distillation in Federated Learning almost surely. Therefore, E T p t0+1ut0 ˆTut0 2 = E T p t0+1ut0 ˆTut0 2 E T p t0+1 ˆT 2 ut0 2 Hk | ut0 E h B2 ut0 2 Hk | ut0 i = B2 ut0 2 Hk . (50) From (46), (49), and (50), by Proposition 4.4 in Bertsekas & Tsitsiklis (1996), (SDp(Lk,Xp + λI) 1S Dp) 1/2 yp converges to a unique fixed point of ˆT. Therefore, (SDp(Lk,Xp + λI) 1S Dp) 1/2 yp = ˆT = (SDp(Lk,Xp + λI) 1S Dp) 1/2SDp ˆT = (SDp(Lk,Xp + λI) 1S Dp) 1/2SDpˆg . Note that ˆT = ˆg can be shown using a similar argument as in the first part in Appendix C.6. D. Auxiliary Lemmas We provide the useful properties of operators. Lemma D.1. Let A be a bounded linear operator on a seperable Hilbert space H. If A is compact, self-adjoint, and positive, then (A + λI) r As 1/λr s for any r s 0 and λ > 0. Proof. Since A is compact, self-adjoint, and positive, we can set eigenvalues {µi} i=1 of A and their corresponding eigenfunctions {φi} i=1 satisfying µi 0 and {φi} i=1 is an orthonormal basis of H by the spectral theorem (Conway, 2019). For any φ = P i=1 aiφi H such that φ 2 = P i=1 a2 i = 1, we have (A + λI) r Asφ 2 = aiµs i (µi + λ)r φi aiµs i (µi + λ)r 2s 1 (µi + λ)2(r s) i=1 a2 i = 1 λ2(r s) which implies (A + λI) r As = sup φ =1 (A + λI) r Asφ 1 λr s . Lemma D.2. Let A and B be bounded positive self-adjoint linear operators on a seperable Hilbert space H. We assume B is invertible. Then A B implies A1/2B 1/2 2 = A1/2B 1A1/2 1. If we further assume A is compact, then A < B implies A1/2B 1/2 2 = A1/2B 1A1/2 < 1. Proof. We first assume A B. Since 0 B 1/2AB 1/2 I, we have B 1/2AB 1/2 1 by the definition of positive operator. Then, we obtain 1 B 1/2AB 1/2 = (B 1/2A1/2)(B 1/2A1/2) = (B 1/2A1/2) (B 1/2A1/2) = A1/2B 1A1/2 . We now assume A is compact. By Proposition 4.2(c) of chapter 2 in Conway (2019), B 1/2AB 1/2 is compact. Using 0 B 1/2AB 1/2 < I and the fact that B 1/2AB 1/2 is compact and self-adjoint yields B 1/2AB 1/2 < 1 since Towards Understanding Ensemble Distillation in Federated Learning if B 1/2AB 1/2 = 1 then B 1/2AB 1/2 has an eigenvalue 1 by Lemma 5.9 of chapter 2 in Conway (2019) which contradicts B 1/2AB 1/2 < I. Therefore, 1 > B 1/2AB 1/2 = A1/2B 1A1/2 . The following lemma is similar to Proposition 5 in Rudi et al. (2015). Lemma D.3. Let H be a seperable Hilbert space and X : H H and Y : H H be two bounded linear operators. If Y Y X X is positive, then Xf H Y f H for any f H. Proof. Let f H. Then Y f 2 H = Y f, Y f H = Y Y f, f H X Xf, f H = Xf, Xf H = Xf 2 H since (Y Y X X)f, f H 0. Recall Cordes inequality. Refer to Fujii et al. (1993) for its proof. Lemma D.4. Let A, B be two bounded positive linear operators on a seperable Hilbert space. Then for any s [0, 1]. The following lemma is fundamental but useful. Lemma D.5. Let A and B be two bounded positive self-adjoint linear operators on a seperable Hilbert space. Then A1/2B 1A1/2 B 1A when B is invertible. Proof. Since A and B are self-adjoint, A1/2B 1A1/2 = B 1/2A1/2 2. By Lemma D.4, B 1/2A1/2 2 B 1A . Lemma D.6. Let A, B and C be bounded positive self-adjoint linear operators on a seperable Hilbert space H such that 0 A B < C. We further assume A1/2 is compact. Then A1/2C 1A1/2(I A1/2C 1A1/2) 1 B1/2C 1B1/2(I B1/2C 1B1/2) 1 . Proof. Note that for any operator V such that V < 1, σp(V V ) \ {0} = σp(V V ) \ {0} where σp denotes the point spectrum of a given operator. Then σp((I V V ) 1 I) \ {0} = σp((I V V ) 1 I) \ {0}. (51) Since A1/2 is compact, A1/2C 1A1/2(I A1/2C 1A1/2) 1 is also compact by Proposition 4.2(c) of chapter 2 in Conway (2019). From the fact that A1/2C 1A1/2(I A1/2C 1A1/2) 1 is compact, self-adjoint, and positive, we have A1/2C 1A1/2(I A1/2C 1A1/2) 1 = σmax(A1/2C 1A1/2(I A1/2C 1A1/2) 1) by Lemma 5.9 of chapter 2 in Conway (2019) where σmax denotes the largest eigenvalue of a given (self-adjoint positive) operator. Using (51), σmax(A1/2C 1A1/2(I A1/2C 1A1/2) 1) = σmax((I A1/2C 1A1/2) 1 I) = σmax((I C 1/2AC 1/2) 1 I) = σmax(C 1/2AC 1/2(I C 1/2AC 1/2) 1). Towards Understanding Ensemble Distillation in Federated Learning C 1/2AC 1/2 C 1/2BC 1/2 (I C 1/2AC 1/2) (I C 1/2BC 1/2) (I C 1/2AC 1/2) 1 (I C 1/2BC 1/2) 1 (I C 1/2AC 1/2) 1 I (I C 1/2BC 1/2) 1 I. C 1/2AC 1/2 0 (I C 1/2AC 1/2) I (I C 1/2AC 1/2) 1 I (I C 1/2AC 1/2) 1 I 0. Therefore, 0 < σmax((I C 1/2AC 1/2) 1 I) σmax((I C 1/2BC 1/2) 1 I). Using the fact that σmax((I C 1/2BC 1/2) 1 I) = σmax((I B1/2C 1B1/2) 1 I) = σmax(B1/2C 1B1/2(I B1/2C 1B1/2) 1) B1/2C 1B1/2(I B1/2C 1B1/2) 1 completes the proof of Lemma D.6. By using concentration inequalities (Rudi et al., 2015; Chatalic et al., 2022), we can derive the following useful lemmas (Caponnetto & De Vito, 2007; Yao et al., 2007; Rudi et al., 2015; Lin et al., 2020a). Lemma D.7. Let X = {x1, , x N0} whose data points are independently drawn from ρx. Then, with our notation, Lk,X Lk Lk,X Lk HS 2 2κ2 N0 (log(2/δ))1/2 with confidence at least 1 δ where δ (0, 1) and HS is the Hilbert-Schmidt norm; (b) if 0 < λ 1 and N(λ) 1, (Lk + λI) 1/2(Lk Lk,X)(Lk + λI) 1/2 2(κ2 + 1) 1 + log N(λ) 1 + log N(λ) with confidence at least 1 δ where δ (0, 1); (c) if 0 < λ 1 and N(λ) 1, (Lk + λI)1/2(Lk,X + λI) 1/2 = (Lk + λI)1/2(Lk,X + λI) 1(Lk + λI)1/2 1/2 with confidence at least 1 δ where 1 4(κ2 + 1) 1 + log N(λ) 1 + log N(λ) Lemma D.8. Let D = {(x1, y1), , (x N0, y N0)} whose data points are independently drawn from ρx,y. Then, with our notation, S Dy Lk,D(x)f0 Hk 2κM N0 log(2/δ) + N0 log(2/δ) 2κ(M + γ) N0 log(2/δ) with confidence at least 1 δ where δ (0, 1). Towards Understanding Ensemble Distillation in Federated Learning The following lemmas are useful to compute the expectation using a PAC bound. Lemma D.9. Let A be a non-negative random variable such that A B(log(a/δ))q with confidence at least 1 δ for any δ (0, 1) where B > 0 is a constant, a 1, and q > 0. Then EA Γ(q + 1)a B. Proof. Note that P (A > t) a exp for any t > 0. Thus, 0 P (A > t) dt Z 0 a Bqsq 1 exp( s) ds = Γ(q + 1)a B. Lemma D.10. Let A be a non-negative random variable such that (a) A B(log(a/δ))q with confidence at least 1 δ for any δ (0, 1); (b) A B(log(a/δ))q with confidence at least 1 δ for any δ (δ0, 1) where B > 0 and B > 0 are constants, a 1, q N, and δ0 (0, 1). Then EA Γ(q + 1)a B + a Bqδ0 exp Proof. Note that P (A > t) a exp for any t > 0 and P (A > t) a exp for any 0 < t < B(log(a/δ0))q. Set t0 = B(log(a/δ0))q. Then, 0 P (A > t) dt Z t0 Γ(q + 1)a B + Z Towards Understanding Ensemble Distillation in Federated Learning by the proof of Lemma D.9. Also, dt = a Bq Z ( B/B)1/q log(a/δ0) uq 1 exp( u) du ( B/B)1/q log(a/δ0) = a Bqδ0 exp Remark D.11. If δ0 1, then we only have A B(log(a/δ))q with confidence at least 1 δ for any δ (0, 1) in Lemma D.10. Then, EA Γ(q + 1)a B Γ(q + 1)a Bδ0 by Lemma D.9. Combining this result and Lemma D.10, we have EA Γ(q + 1)a B + a Bqδ0 even if we only assume δ0 > 0 instead of δ0 (0, 1) in Lemma D.10. Sometimes Lemma D.10 is complicated, so we provide another lemma. Lemma D.12. Let A be a non-negative random variable such that A B(log(a/δ))q with confidence at least 1 δ for any δ (δ0, 1) and A B almost surely where B > 0 and B > 0 are constants, a 1 and q > 0. Then EA Γ(q + 1)a B + Bδ0. Proof. Since P (A > t) a exp for any 0 < t < B(log(a/δ0))q, EA = P(A B(log(a/δ0))q)E[A|A B(log(a/δ0))q] + P(A > B(log(a/δ0))q)E[A|A > B(log(a/δ0))q] 0 P(A B(log(a/δ0))q)P(A > t|A B(log(a/δ0))q) dt + δ0B Z B(log(a/δ0))q 0 P(A > t) dt + δ0B Z B(log(a/δ0))q = Γ(q + 1)a B + Bδ0. Note that if δ0 1 then EA Bδ0 is trivial. Thus, it also holds for δ0 1. Lastly, we introduce a lemma to deal with multiplications of random variables whose PAC bounds are given. Towards Understanding Ensemble Distillation in Federated Learning Lemma D.13. If A1 O(a(m, N))(log(a1/δ))q1 with confidence at least 1 δ and A2 O(b(m, N))(log(a2/δ))q2 with confidence at least 1 δ where δ (0, 1), a1, a2 1 and q1, q2 > 0, then A1 A2 O(a(m, N)b(m, N))(log(a3/δ))q3 with confidence at least 1 2δ for some a3 1 and q3 > 0. Proof. It directly follows from the fact that A1 A2 O(a(m, N))O(b(m, N))(log(2a1/δ))q1(log(2a2/δ))q2 O(a(m, N)b(m, N))(log max(2a1, 2a2)/δ)q1+q2 with confidence at least 1 2δ where δ ( 1 E. Experiment Details E.1. Dataset Description Details The generating procedure of the synthetic datasets is as follows: (i) the inputs are independently drawn from the uniform distribution on [0, 1]d with d = 1 for Dataset 1 and Dataset 2 and with d = 3 for Dataset 3; (ii) the corresponding outputs are generated from y = gi(x) + ϵ for Dataset i (i = 1, 2, 3) where ϵ is the independent noise that follows the normal distribution with mean 0 and variance 0.442 and gi are given by g1(x) = min(x, 1 x) for Dataset 1, for Dataset 2, and g3(x) = (1 x 2)6(35 x 2 2 + 18 x 2 + 3)1{ x 1} for Dataset 3. We use the kernel k1(x, x ) = 1 + min(x, x ) for Dataset 1 and Dataset 2 and k2(x, x ) = (1 x x 2)4(4 x x 2 + 1)1{ x x 2 1} for Dataset 3. Note that we do not give any noise for test datasets. We know that g1 Hk1 and g3 Hk2 (Lin et al., 2020a). Since g2(x) = x L2 ρx, 0 g2(x)k1(x, t) dρx(x) = Z t 0 (1 + x) x dx + Z 1 t (1 + t) x dx = g2(t) holds. From this fact, we can easily see that g2 Hk1 and g2 = L1/2 k1 ˆg2 for some ˆg2 Hk1. The generating procedure of MNIST is described in (Cui et al., 2021). We use the RBF kernel k3(x, x ) = exp 1 2 104 x x 2 E.2. Simulation Details We conduct experiments with the local datasets of sizes N = 10 and N = 20. We also set the number of clients m {10, 20, 30, 40, 50, 100} for N = 10 (except MNIST; for MNIST we set m {10, 20, 30, 40, 50} when N = 10) and m {10, 20, 30, 40, 50} for N = 20. Assume there is an unlabeled public dataset of size (m 1)N. For the iterative ensemble distillation algorithm, set the total communication round t = 200 for convergence. We use the fixed distillation hyperparameter α = 1/m but conduct the hyperparemeter tuning for λ > 0 using the grid search. In the test phase, we use a test dataset of size 1000 whose data points are generated from the procedure explained in Appendix E.1. We compute the averaged MSE (Mean Squared Error) over the local models to evaluate the performance. We simulate at least 100 times for each case and then average the results. Towards Understanding Ensemble Distillation in Federated Learning 10 20 30 40 50 The number of clients Averaged MSE central training one-shot ED IED w/o deregularization IED w/ deregularization (a) Dataset 1 with N = 20 10 20 30 40 50 The number of clients Averaged MSE central training one-shot ED IED w/o deregularization IED w/ deregularization (b) Dataset 2 with N = 20 10 20 30 40 50 The number of clients Averaged MSE central training one-shot ED IED w/o deregularization IED w/ deregularization (c) Dataset 3 with N = 20 10 20 30 40 50 The number of clients Averaged MSE central training one-shot ED IED w/o deregularization IED w/ deregularization (d) MNIST with N = 20 Figure 3. Comparison between the performance of the one-shot ensemble distillation algorithm (one-shot ED), the iterative ensemble distillation algorithm without the de-regularization (IED w/o deregularization), the iterative ensemble distillation algorithm with the de-regularization (IED w/ deregularization), and the central training. We set N = 20 and conduct the experiments with various m. E.3. Additional Experimental Results Performance Comparison with N = 20. We visualize the comparison results on the performance of the one-shot ensemble distillation algorithm, the iterative ensemble distillation algorithm without the de-regularization, the iterative ensemble distillation algorithm with the de-regularization trick, and the central training with N = 20 and m {10, 20, 30, 40, 50} in Figure 3. It has a similar pattern to the cases with N = 10 which are summarized in Figure 2. The one-shot ensemble distillation algorithm performs much better than the standalone models. However, it has worse performance on Dataset 3 and MNIST compared with the iterative ensemble distillation algorithms and the central training. The ensemble distillation algorithm without the de-regularization is slightly worse than the ensemble distillation algorithm with the de-regularization and the central training, but the performance gap is not significant for N = 20. In all settings, the iterative ensemble distillation algorithm has a similar performance as the central training in the expected risk sense. Performance Comparison with Fed MD (Li & Wang, 2019). We compare our proposed algorithm with Fed MD (Li & Wang, 2019) on Dataset 3 and MNIST. Fed MD is a representative KD based FL algorithm using neural networks. In the experiments, we consider the unlabeled public dataset version of Fed MD (which is used as a baseline in Zhang et al. (2021)). We use a 3 hidden layer fully connected neural network and Le Net5 (Le Cun et al., 1998) for Fed MD. The result is summarized in Figure 4. We first note that the training strategy for neural networks (such as model architecture and hyperparameters) is good enough because the central training performance of the neural networks is better than that of KRR. However, in both of Dataset 3 and MNIST, Fed MD does not achieve the central training performance and the performance difference is significant as well. On the other hand, the iterative ensemble distillation algorithm with the de-regularization achieves almost the same performance as the central training with KRR. Moreover, Fed MD performs worse than the iterative ensemble distillation algorithm with the de-regularization. Effect of Public Dataset Size. We provide additional experimental results on the effectiveness of the unlabeled public dataset size Np. Figure 5 visualizes the effect of Np on the performance of the one-shot ensemble distillation algorithm and the iterative ensemble distillation algorithm (with the de-regularization). We conduct the experiments with N = 10 and Towards Understanding Ensemble Distillation in Federated Learning 10 20 30 40 50 The number of clients Averaged MSE central training with KRR IED w/ deregularization (a) IED w/ deregularization, Dataset 3 10 20 30 40 50 The number of clients Averaged MSE central training with FNN Fed MD with FNN (b) Fed MD, Dataset 3 10 20 30 40 50 The number of clients Averaged MSE central training with KRR IED w/ deregularization (c) IED w/ deregularization, MNIST 10 20 30 40 50 The number of clients Averaged MSE central training with Le Net5 Fed MD with Le Net5 (d) Fed MD, MNIST Figure 4. Comparison between the performance of the iterative ensemble distillation algorithm with the de-regularization (IED w/ deregularization) and Fed MD (Li & Wang, 2019). We set N = 10 and conduct the experiments with various m. various Np {0.2(m 1)N, 0.5(m 1)N, (m 1)N, 1.5(m 1)N} on the three synthetic datasets. On Dataset 2, it seems that Np = 0.2(m 1)N is enough to achieve the performance of the central training for both the one-shot ensemble distillation and the iterative ensemble distillation. We also observe that the one-shot ensemble distillation does not have an advantage from having more public data points. For iterative ensemble distillation, too small public dataset size (e.g., Np = 0.2(m 1)N) results in worse performance but it is not so effective when the public dataset size is sufficiently large (e.g., Np (m 1)N). Effect of Client Selection Strategy. We also provide experimental results on the effect of client selection strategy. We conduct the experiments to analyze the effect of the number of selected clients at each communication round and the stochastic approximation weights. We also conduct the experiments to measure the performance of Algorithm 3 with a client selection strategy and to compare it with Algorithm 2. Figure 6 visualizes the effect of the number of selected clients at each communication round (denoted by C in Algorithm 3) and the stochastic approximation weights (denoted by {γt0} t0=1 in Algorithm 3). We set N = 10 and m = 50 in all experiments. The experiment details are as follows. We generate Dataset 3 using the procedure explained in Appendix E.1. Then, we find the convergent consensus y p using Algorithm 2 with sufficiently many iterations. We conduct Algorithm 3 with various C and {γt0} t0=1, and then measure the squared distance between y p and yp which is derived from Algorithm 3. As illustrated in Figure 6(a), if we use only one client at each communication round (e.g., asynchronous setting), the speed of the convergence is quite slow in our setting. However, it is enough to consider only 20% of the clients at each communication round to achieve almost the same convergence speed as considering all clients at each communication round in Algorithm 3. To validate the effect of {γt0} t0=1, we set where q 0. To satisfy the condition (46), q should be larger than 0.5. We conduct Algorithm 3 with various q {0, 0.3, 0.501, 0.7}. Figure 6(b) shows that smaller q (slow decay) makes a decrement of the squared distance larger in the early stage of training. However, it also shows that the consensus prediction yp does not converge to y p if q 0.5. Towards Understanding Ensemble Distillation in Federated Learning 10 20 30 40 50 The number of clients Averaged MSE p=0.2 p=0.5 p=1.0 p=1.5 central training (a) One-shot, Dataset 1 10 20 30 40 50 The number of clients Averaged MSE p=0.2 p=0.5 p=1.0 p=1.5 central training (b) One-shot, Dataset 2 10 20 30 40 50 The number of clients Averaged MSE p=0.2 p=0.5 p=1.0 p=1.5 central training (c) One-shot, Dataset 3 10 20 30 40 50 The number of clients Averaged MSE p=0.2 p=0.5 p=1.0 p=1.5 central training (d) Iterative, Dataset 1 10 20 30 40 50 The number of clients Averaged MSE p=0.2 p=0.5 p=1.0 p=1.5 central training (e) Iterative, Dataset 2 10 20 30 40 50 The number of clients Averaged MSE p=0.2 p=0.5 p=1.0 p=1.5 central training (f) Iterative, Dataset 3 Figure 5. The effect of the public dataset size for the one-shot ensemble distillation algorithm and the iterative ensemble distillation algorithm. We conduct the experiments with N = 10 and Np {0.2(m 1)N, 0.5(m 1)N, (m 1)N, 1.5(m 1)N}. Set p = Np/(m 1)N, e.g., p = 1.0 indicates Np = (m 1)N. In particular, the squared distance between y p and yp is large when q = 0 (i.e., do not memorize the previous consensus yp,old). This means that only using a new consensus is inappropriate when we use a client selection strategy. When q = 0.3, the squared distance between y p and yp does not go to zero but it is quite small. On the other hand, a large q guarantees the convergence of yp to y p but the convergence speed is slow. Lastly, we measure the performance of Algorithm 3. Figure 7 visualizes the performance of Algorithm 3 with different iterations t {200, 1000, 2000, 5000, 10000, 20000} on Dataset 1, Dataset 2, and Dataset 3. As in Figure 7, Algorithm 3 achieves the same performance as Algorithm 2 (with 200 iterations) after 5000 iterations on all of the three datasets. Towards Understanding Ensemble Distillation in Federated Learning 0 5000 10000 15000 20000 Iterations Squared Distance from y * C=1 C=10 C=25 C=50 (a) Effect of C 0 5000 10000 15000 20000 Iterations Squared Distance from y * q = 0 q = 0.3 (b) Effect of {γt0} t0=1 (Iterations 20000) 0 10000 20000 30000 40000 50000 Iterations Squared Distance from y * q = 0 q = 0.3 (c) Effect of {γt0} t0=1 (Iterations 50000) Figure 6. The effect of the number of selected clients at each communication round (C) and the stochastic approximation weights ({γt0} t0=1) in Algorithm 3. We set γt0 = 1/tq 0 where q {0, 0.3, 0.501, 0.7}. We set N = 10 and m = 50 and use Dataset 3 in all experiments. y-axis indicates the squared distance between yp in Algorithm 3 and the convergent consensus y p. We set q = 0.501 in (a) and C = 10 in (b) and (c). 200 1000 2000 5000 10000 20000 Iterations Averaged MSE with client selection without client selection (200 iters) (a) Performance on Dataset 1 200 1000 2000 5000 10000 20000 Iterations Averaged MSE with client selection without client selection (200 iters) (b) Performance on Dataset 2 200 1000 2000 5000 10000 20000 Iterations Averaged MSE with client selection without client selection (200 iters) (c) Performance on Dataset 3 Figure 7. Performance of Algorithm 3 with different t {200, 1000, 2000, 5000, 10000, 20000} on the three datasets. The dashed line indicates the performance of Algorithm 2 with 200 iterations. We set N = 10, m = 50, q = 0.501 and C = 10 in all experiments where γt0 = 1/tq 0.