# towards_understanding_ensemble_distillation_in_federated_learning__c9c8a75b.pdf

Towards Understanding Ensemble Distillation in Federated Learning

Sejun Park 1 Kihun Hong 1 Ganguk Hwang 1

Federated Learning (FL) is a collaborative machine learning paradigm for data privacy preservation. Recently, a knowledge distillation (KD) based information sharing approach in FL, which conducts ensemble distillation on an unlabeled public dataset, has been proposed. However, despite its experimental success and usefulness, the theoretical analysis of the KD based approach has not been satisfactorily conducted. In this work, we build a theoretical foundation of the ensemble distillation framework in federated learning from the perspective of kernel ridge regression (KRR). In this end, we propose a KD based FL algorithm for KRR models which is related with some existing KD based FL algorithms, and analyze our algorithm theoretically. We show that our algorithm makes local prediction models as much powerful as the centralized KRR model (which is a KRR model trained by all of local datasets) in terms of the convergence rate of the generalization error if the unlabeled public dataset is sufﬁciently large. We also provide experimental results to verify our theoretical results on ensemble distillation in federated learning.

1. Introduction

Despite the rapid development of machine learning algorithms (Bochkovskiy et al., 2020; Nichol et al., 2022; Brown et al., 2020), the performance of machines still heavily depends on the size of the training dataset. However, due to the hassle of data processing, manpower and time resources are excessively consumed to obtain data, especially labeled data, in supervised learning. Moreover, most of data is inaccessible for training a prediction model due to data privacy preservation in general.

1Department of Mathematical Sciences, Korea Advanced Institute of Science and Technology (KAIST), Daejeon, South Korea. Correspondence to: Ganguk Hwang <guhwang@kaist.edu>.

Proceedings of the 40 th International Conference on Machine Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright 2023 by the author(s).

Recently, Federated Learning (FL) (Mc Mahan et al., 2017) has been proposed as a solution to resolve data shortage and data privacy preservation. Even though various FL algorithms (Mc Mahan et al., 2017; Li et al., 2020a; Yurochkin et al., 2019; Wang et al., 2020; Karimireddy et al., 2020; Li et al., 2021) have been proposed, most of them usually iteratively conduct the following procedure in training. (i) A server distributes a global model to clients for updating their local models and then each client trains its local model using its local dataset. (ii) The clients send their trained local models (or gradients) to the server and the server aggregates them to update the global model. Such approaches obviously improve the performance of the local models while protecting data privacy, but sometimes performance improvement is not signiﬁcantly sufﬁcient to be applied for real world problems. Moreover, they have some restrictions in local training. For example, most of FL algorithms require that all local models be the same model (Mc Mahan et al., 2017; Li et al., 2020a), parts of a global model (Arivazhagan et al., 2019; Jiang et al., 2022), approximations of a global model (Diao et al., 2021; Yao et al., 2021), or derived models from a meta model (Fallah et al., 2020; Shamsian et al., 2021). Such restrictions do not matter in a server-centric (i.e. server provider-centric) learning framework. However, in a client-centric training environment where most training strategies are led by individual clients, they severely matter. Note that local model architectures may also need to be protected as local data.

To overcome the above restrictions, knowledge distillation (KD) (Hinton et al., 2015) based FL algorithms have been proposed (Mora et al., 2022). Some of them improve the effectiveness of the FL algorithms by conducting knowledge distillation in addition to the existing FL algorithms (Lin et al., 2020b; Zhu et al., 2021). Some others use knowledge distillation as a main information sharing method, which can be applied to a client-centric training environment, instead of parameter sharing (Li & Wang, 2019; Cho et al., 2021; Zhang et al., 2021). In these methods, they assume that there exists an unlabeled public dataset or an unlabeled data generator in a server or a shared data storage that all clients can access. So, the knowledge of all local models can be obtained in the server via the predictions of the local models on the unlabeled public dataset. Then the server combines the knowledge through a weighted average

Towards Understanding Ensemble Distillation in Federated Learning

of the predictions and distributes the public dataset with the averaged predictions to clients. Finally, clients use the distributed dataset as well as their local datasets for their local training.

There are some intuitions as to why this ensemble distillation strategy works well. First, the predictions of local models on the unlabeled public dataset contain the information of local models like the parameters of local models in typical FL algorithms. Second, the ensemble distillation process is a kind of automatic crowdsourcing data labeling technique to provide additional data to each client. Therefore, this strategy directly alleviates the data shortage problem of clients. Despite these intuitions, KD based FL algorithms do not have sufﬁcient theoretical analysis even for the independent and identically distributed (IID) case (i.e., all data points of clients are independent and identically distributed) unlike typical federated learning algorithms such as Fed Avg and Fed Prox (Li et al., 2020a;b; Yuan & Li, 2022).

In this work, we provide a theoretical analysis for the effectiveness of ensemble distillation in federated learning. Even though knowledge distillation and ensemble techniques are difﬁcult to be analyzed, some recent works provide analytical results for them in the context of kernel ridge regression (KRR) (Zhang et al., 2013; Lin et al., 2017; 2020a; Mobahi et al., 2020; Afonin & Karimireddy, 2022). Extending the existing results, we provide a theoretical framework for ensemble distillation of KRR models in general federated learning setting. Unlike the theoretical framework of typical FL algorithms which follows the optimization approach, we leverage the generalization theory of statistical models in this work. More precisely, we derive the convergence rate of the expected risk with respect to the dataset size.

Our contributions in this work are as follows:

1. First, we analyze knowledge distillation with an auxiliary dataset in terms of the convergence rate of the expected risk (Section 4). As an application of this result, we verify the effectiveness of one-shot ensemble distillation in federated learning with kernel ridge regression (Section 5.2).

2. Second, we propose and analyze a new iterative ensemble distillation algorithm like Fed MD (Li & Wang, 2019) in federated learning with kernel ridge regression (Section 5.3 - 5.4). In the proposed algorithm, to overcome the undesirable ampliﬁed regularization in the repeated distillation procedures, we introduce a de-regularization trick that leads to the effectiveness of the proposed iterative ensemble distillation algorithm (Section 5.3).

3. Third, we analyze how a random client selection strategy, which is considered to resolve the communication

cost or system heterogeneity issue, affects the performance of the proposed iterative ensemble distillation algorithm (Section 5.5).

4. Lastly, we provide experimental results to validate our theoretical results for the proposed iterative ensemble distillation in federated learning (Section 6).

2. Related Works

Federated Learning: FL is a privacy-preserving decentralized machine learning framework which is proposed in Mc Mahan et al. (2017). FL has several challenges to address such as statistical heterogeneity (Li et al., 2020a; Karimireddy et al., 2020), system (e.g., computational capability) heterogeneity (Nishio & Yonetani, 2019), communication efﬁciency (Koneˇcn y et al., 2016; Caldas et al., 2018), and privacy concerns (Bagdasaryan et al., 2020). Note that there are theoretical results that the classic FL algorithms such as Fed Avg (Mc Mahan et al., 2017) and Fed Prox (Li et al., 2020a) converge for both IID and non-IID cases (Li et al., 2020a;b; Yuan & Li, 2022).

Knowledge Distillation: KD is originally proposed in Buciluˇa et al. (2006); Hinton et al. (2015) in the context of model compression. The original KD encourages the outputs of a student model to match the outputs of a pretrained teacher model. There are some variants of the original student-teacher framework. Self-distillation utilizes the student model itself as a teacher (Zhang et al., 2019). Ensemble distillation (or co-distillation) uses an ensemble of models as a teacher (Hinton et al., 2015; Anil et al., 2018; Lin et al., 2020b). Since it is an inclusive concept, we refer to a knowledge distillation framework with several students and a teacher which is an ensemble of (some of) them as ensemble distillation like Lin et al. (2020b) in this work.

Knowledge Distillation in FL: There are two lines of research mainly in federated learning with knowledge distillation. One line of research aims at improving efﬁciency or effectiveness of federated learning algorithms by applying knowledge distillation (Lin et al., 2020b; He et al., 2020; Zhu et al., 2021; Zhang et al., 2022a;b). In particular, Lin et al. (2020b) improve the performance by applying additional ensemble distillation on a server to train a server model rather than directly replacing it with an aggregated model of client models. They also propose an extended version for heterogeneous model cases. However, it allows only a few prototypes and clients should share their models with the server. The other line of research aims at enabling to use the black box models in FL (Li & Wang, 2019; Cho et al., 2021; Zhang et al., 2021). Li & Wang (2019) ﬁrst propose the framework for FL with black box models. Subsequently, Cho et al. (2021); Zhang et al. (2021) apply advanced ensemble strategies to improve the performance in non-IID

Towards Understanding Ensemble Distillation in Federated Learning

setting. However, to the best of our knowledge, there is no previous work that provides the theoretical analysis of KD in FL. Recently, Allen-Zhu & Li (2020) provide a great theoretical analysis to explain why ensemble knowledge distillation works on neural networks with the same architectures. However, we cannot directly apply their results to ensemble distillation in FL since they assume all neural networks use the same dataset.

Theoretical Analysis of Kernel Ridge Regression: There are many prior works that analyze the convergence rate of kernel ridge regression in expected risk sense or in probability sense thanks to the closed form expression of its solution. The convergence properties for the classical KRR model have been analyzed well in Caponnetto & De Vito (2007); Fischer & Steinwart (2020); Cui et al. (2021); Li et al. (2023). Recently, Zhang et al. (2013); Lin et al. (2017); Chang et al. (2017); Guo et al. (2017); Lin et al. (2020a); Yin et al. (2021) also analyze distributed learning with KRR models in various contexts. In particular, Lin et al. (2020a) propose a communication scheme for distributed learning. However, this approach does not preserve data privacy due to the nature of the form of KRR solutions. On the other hand, Mobahi et al. (2020); Borup & Andersen (2021) apply KRR to study self-distillation. To the best of our knowledge, Afonin & Karimireddy (2022) is the only preceding study that utilizes KRR to provide a theoretical analysis of distillation strategies in federated learning setting. However, they mainly consider the case of two clients and their strategy is an ensemble of inﬁnite models, which is impractical. In our work, we analyze standard ensemble distillation methods for the KRR model with arbitrary number of clients.

3. Backgrounds

In this section, we brieﬂy introduce some backgrounds for our work. See Appendix A for details and comments.

3.1. Preliminaries and Notations

Our interest is a regression problem in FL setting. Let X Rd be an input space. We assume X is compact. Our goal is to ﬁnd a target function f0 : X R. Let ρx,y be a data generating distribution such that ρx,y(x, y) = ρx(x) ρy|x(y|x) where

Ey ρy|x[y|x] = f0(x) and Vary ρy|x(y|x) = σ2(x)

for any x X.

Let k : X X R be a continuous, symmetric, positive semi-deﬁnite kernel such that ZZ k(x1, x2)f(x1)f(x2) dρx(x1)dρx(x2) 0 (1)

for any f L2 ρx and the equality holds if and only if f = 0.

Hk RX denotes a reproducing kernel Hilbert space with a kernel k. Set κ = (supx X k(x, x))1/2.

Deﬁne the covariance operator Lk : Hk Hk by

Lkf = Z f(x)k(x, ) dρx(x)

and its sample analog Lk,X : Hk Hk by

i=1 f(xi)k(xi, )

where X = {x1, , xn} X. Let

N(λ) = tr((Lk + λI) 1Lk)

and fλ = (Lk + λI) 1Lkf0

where λ > 0.

Deﬁne a compact embedding ιρx : Hk L2 ρx by ιρxh = [h] ρx where [h] ρx is the equivalence class containing h in L2 ρx. Also, deﬁne its sample analog SD : Hk Rn by

SDf = [f(x1), , f(xn)]

where D = {(x1, y1), , (xn, yn)} X R.1

A denotes the adjoint operator of a given operator A. We write kx( ) = k(x, ) and

k(x1 1, x1 2) k(x1 1, xm 2 ) ... ... ... k(xn 1, x1 2) k(xn 1, xm 2 )

where X1 = {x1 1, , xn 1} and X2 = {x1 2, , xm 2 }. Also, we write D(x) = {x1, , xn} where D = {(x1, y1), , (xn, yn)} is a given dataset.

Technical Assumptions: In this work we assume the following assumptions.

Assumption 3.1. We assume Ey ρyy2 < and

Z exp |y f0(x)|

dρy|x(y|x) γ2

for any x X where M and γ are positive constants.

Assumption 3.2. The target function f0 satisﬁes f0 = Lr kg0 for some g0 Hk and r [0, 1

2]. In particular, f0 Hk.

1We use a scaled L2 norm as a norm of Euclidean space Rn. See Appendix A.

Towards Understanding Ensemble Distillation in Federated Learning

Some previous works (Guo et al., 2017; Chang et al., 2017; Lin et al., 2020a) assume that N(λ) satisﬁes

N(λ) Ceλ s (3)

for some s (0, 1] and some constant Ce 1 independent of λ. We do not assume any regularity condition on N(λ) in this work to deal with the convergence rate of a general case. However, under the above assumption (3), we can obtain a better convergence rate by extending our analysis. Note that (3) always holds with s = 1.

3.2. Kernel Ridge Regression and Optimal Convergence Rate

Kernel ridge regression (KRR) is an algorithm to estimate the target function f0. Let D = {(x1, y1), , (x N, y N)} be a labeled dataset whose data points are independently drawn from ρx,y. Given a kernel k and a regularization hyperparameter λ > 0, a KRR model is given by a minimizer of the following optimization problem

argmin h Hk

i=1 (h(xi) yi)2 + λ h 2 Hk.

Note that the solution of the above optimization problem has a closed form expression

f D,λ = (Lk,D(x) + λI) 1S Dy

where y = [y1, , y N] .

One way to measure the performance of kernel ridge regression is to ﬁnd a convergence rate of the expected loss

ED ιρx(f D,λ f0) L2ρx

with respect to the dataset size N. We provide a key theorem from Caponnetto & De Vito (2007); Lin et al. (2017); Fischer & Steinwart (2020).

Theorem 3.3. Under Assumption 3.1 and 3.2, with λ = N 1 2r+2 ,

ED ιρx(f D,λ f) L2 ρx = O N 2r+1

Moreover, this convergence rate is optimal.

4. KRR with Knowledge Distillation

Let g be a teacher model which approximates f0. In the noiseless data-free version, the optimization problem of KRR with KD is given by

argmin h Hk α ιρx(h f0) 2 L2ρx+(1 α) ιρx(h g) 2 L2ρx+λ h 2 Hk

where α (0, 1) controls the distillation effect and λ > 0 is a regularization hyperparameter. From the ﬁrst order optimality condition, we obtain the minimizer

fλ = (Lk + λI) 1(αLkf0 + (1 α)Lkg). (4)

Now we consider a sample version. Let D1 = {(x1 1, y1 1), , (x N1 1 , y N1 1 )} be a labeled dataset whose data points are independently generated from ρx,y and D2(x) = {x1 2, , x N2 2 } be an unlabeled dataset whose data points are independently generated from ρx. To distill the teacher s knowledge using D2(x), we consider the optimization problem

argmin h Hk α 1

i=1 (h(xi 1) yi 1)2

i=1 (h(xi 2) g(xi 2))2 + λ h 2 Hk.

Then, the solution of the above problem is

f D,λ = (αLk,D1(x) + (1 α)Lk,D2(x) + λI) 1

(αS D1y1 + (1 α)Lk,D2(x)g). (5)

A natural question is how much the performance of the trained student model f D,λ can be improved and the answer is given in the following: Theorem 4.1 (Informal). The performance of the student model f D,λ is at least as good as the worse of the teacher model g and a trained KRR model using both datasets of sizes N1 and N2 independently generated from ρx,y in terms of the convergence rate of the expected risk when we use adequate α and λ.

Theorem 4.1 is one of the key ideas to analyze ensemble distillation in federated learning. See Appendix B for the detailed statement and proof.

5. KRR with Ensemble Distillation in FL

5.1. Problem Formulation

We assume the following:

There are total m clients and client j has a labeled dataset Dj = {(x1 j, y1 j ), , (x N j , y N j )} whose data points are independently generated from ρx,y (j = 1, , m). We assume |D1| = = |Dm| = N for simplicity. Set yj = [y1 j , , y N j ] and D = Sm j=1 Dj.

To distill knowledge without sharing models, we assume there is an unlabeled public dataset Dp(x) = {x1 p, , x Np p } whose data points are independently generated from ρx and whose size is Np.

Towards Understanding Ensemble Distillation in Federated Learning

Algorithm 1 KRR with iterative Ensemble Distillation in FL

1: Input: hyperparameters α (0, 1), λ > 0, and t N 2: Output: Trained model fj, j = 1, , m 3: Pretrain: For j = 1, , m, client j trains its model fj using the loss function

argmin h Hk

i=1 (h(xi j) yi j)2 + λ h 2 Hk.

4: Each client downloads the unlabeled public dataset Dp(x). 5: for t0 = 1, , t do 6: For j = 1, , m, client j predicts on Dp(x) and uploads the prediction yj p to the server. 7: The server computes an updated consensus

8: Each client downloads the ensemble prediction yp. 9: For j = 1, , m, client j updates its model fj using the loss function

argmin h Hk α 1

i=1 (h(xi j) yi j)2+

i=1 (h(xi p) ( yp)i)2 + λ h 2 Hk.

10: end for

Due to privacy concerns, only client j can access its own dataset Dj. In addition, all clients can access the public dataset Dp(x) with its pseudo labels.

For simplicity, we write Lk,Dj(x) as Lk,Xj for j = 1, , m and Lk,Dp(x) as Lk,Xp.

We consider and analyze an algorithm that performs public data labeling through an ensemble prediction of local models and local model training through knowledge distillation. Algorithm 1 presents the detailed procedure.

5.2. One-Shot Ensemble Distillation in FL

First, we deal with one-shot ensemble distillation in FL, i.e., Algorithm 1 with t = 1. Even though each client cannot access the one-shot ensemble model, using the result in Lin et al. (2020a) and Theorem 4.1, we can derive a strong result that guarantees the performance of the local models after one-shot ensemble distillation.

Theorem 5.1. Assume Assumption 3.1 and 3.2 hold. Also, assume

m N 2r+1 ϵ (6)

for any ﬁxed ϵ (0, 1) and Np (m 1)N. Let f j D,λ be the local model of client j after one-shot ensemble distillation (j = 1, , m). Then, with α = 1/m and λ = (m N) 1 2r+2 ,

E ιρx( f j D,λ f0) L2ρx = O (m N) 2r+1

for j = 1, , m.

The proof of Theorem 5.1 is provided in Appendix C.1. Theorem 5.1 tells us that, under some assumptions, each local model f j D,λ after the one-shot ensemble distillation has at least the same performance as the KRR model using the whole dataset of size |D| = m N in terms of the convergence rate of the expected risk. Note that Theorem 5.1 requires that the number of clients satisfy m N 2r+1 ϵ

for any ﬁxed ϵ (0, 1). However, one of main properties in FL is a massively distributed environment (Mc Mahan et al., 2017). Therefore, it is more desirable to remove this restriction on the number of clients, which is the main reason why we consider iterative ensemble distillation (i.e., t > 1).

5.3. A Toy Example: Iterative Ensemble Distillation in FL When m = 1

We now analyze iterative ensemble distillation. First, we consider a simple example m = 1 to obtain some motivations. In this case, the local client trains its model using its private dataset D1 and the public dataset Dp(x) with pseudo labels which are predictions of the local model. Thus, the problem is equivalent to self-distillation on auxiliary unlabeled dataset. To analyze this algorithm, our ﬁrst interest is to ﬁnd the limiting regressor after inﬁnitely many iterations.

Theorem 5.2. The local model f1 converges to

f D1,λ/α = (Lk,X1 + λ/αI) 1S D1y1

after inﬁnitely many iterations in Algorithm 1 with m = 1.

We provide the proof in Appendix C.2. Theorem 5.2 implies that the limiting regressor is just a kernel ridge regressor using the private dataset D1 with an ampliﬁed regularization. This result is closely related to the study of self-distillation on kernel ridge regression (Mobahi et al., 2020; Borup & Andersen, 2021). However, it has a limitation to analyze the generalization error bound since an ampliﬁed regularization makes the approximation error E ιρx(f fλ/α) L2ρx excessively large when α is small.

To resolve this issue, we introduce a de-regularization trick. Inspired by Kernel Inducing Point (KIP) scheme (Nguyen

Towards Understanding Ensemble Distillation in Federated Learning

et al., 2021), we deﬁne a de-regularization trick for a prediction v on Dp(x) as follows:

SDp(Lk,Xp + λ0I) 1S Dpw = v

where λ0 0 is a de-regularization hyperparameter. For well-deﬁnedness of w, we require a condition KXp,Xp > 0. Under this condition,

w = (SDp(Lk,Xp + λ0I) 1S Dp) 1v

= (KXp,Xp + Npλ0I)K 1 Xp,Xpv. (7)

Note that limλ0 0 w = v. Applying the de-regularization trick, we modify Algorithm 1 so that each client uses the de-regularized predictions, i.e.,

y p = (SDp(Lk,Xp + λ0I) 1S Dp) 1 yp

instead of yp except the last ensemble distillation step. The de-regularization trick can be implemented in the server since it only depends on Dp(x). Therefore, this procedure does not require any additional computational resource in the client side. We provide the modiﬁed algorithm (Algorithm 2) in Appendix C.3.

From now on, we set λ0 = λ. The limiting regressor is obviously changed if we adopt the de-regularization trick. Theorem 5.3. The local model f1 converges to

f D1,λ = Lk,X1 + λI + 1 α

after inﬁnitely many ensemble distillation steps with the de-regularization trick in Algorithm 2 with m = 1 where P Dp(x) is the orthogonal projection onto the orthogonal complement of a subspace

span (k(x, ) : x Dp(x)) (8)

of the reproducing kernel Hilbert space Hk under the assumption λ0 = λ and KXp,Xp is invertible.

Note that there is still an additional term 1 α

α λP Dp(x) in the inverse compared with f D1,λ. However, we can omit this term if Np is large. The following theorem supports this argument. Theorem 5.4. Assume λ0 = λ and KXp,Xp is invertible. We also assume that the density ρx is strictly positive on any non-empty open subset of X. Let f D1,λ Hk be deﬁned as in Theorem 5.3. If α and λ do not depend on Np, then

lim Np f D1,λ = (Lk,X1 + λI) 1 S D1y1 = f D1,λ

almost surely in the Hk-norm sense.

The proof of Theorem 5.4 is provided in Appendix C.5. Then, our next question is how much public data is needed to ignore the additional term 1 α

α λP Dp(x) when we compute the convergence rate of the expected risk. We answer this question in Section 5.4 for the general setting of m clients.

5.4. Iterative Ensemble Distillation in FL for General Case

We now turn to analyze iterative ensemble distillation for m clients. Our main question is whether the limiting regressors of clients after inﬁnitely many iterations in Algorithm 2 have the same performance as a KRR model using the whole dataset D in terms of the convergence rate of the expected risk. The answer is yes if the unlabeled public dataset is sufﬁciently large under some regularity conditions.

Theorem 5.5. Assume Assumption 3.1 and 3.2 with 0 < r 1

2 holds. We further assume m 2, λ0 = λ,

3r+2 2r2+2r N 1 2r+2 1/(1 ϵ) , (m 1)N

for a ﬁxed ϵ (0, 1

2), and KXp,Xp > 0. Let f j D,λ be the local model of client j after conducting Algorithm 2 with t = (j = 1, , m). Then, with α = 1/m and λ = (m N) 1 2r+2 ,

E ιρx( f j D,λ f0) L2ρx = O (m N) 2r+1

for j = 1, , m.

Theorem 5.5 tells us that iterative ensemble distillation can drop the restriction on m and N to attain the same convergence rate as the KRR model trained by all data from clients if we have sufﬁcient unlabeled public data. Thus, this theorem implies Algorithm 2 works well in a massively distributed environment. Note that Theorem 5.5 requires more unlabeled public data compared with Theorem 5.1.

However, Theorem 5.5 does not guarantee the performance of the limiting regressors when r = 0 even if Np is large. The reason is that de-regularization is closely related to the projection PDp(x) but we have no convergence rate of (I PDp(x))f0 Hk when r = 0 where PDp(x) is the orthogonal projection onto the subspace (8). For r > 0, the required public dataset size decreases as r increases.

5.5. Effects of Client Selection Strategy

We discuss the performance of Algorithm 1 and 2 until now. Note that these algorithms are a sort of synchronous algorithms that have to wait until all clients ﬁnish training their local models at each step. So, these algorithms may cause waste of time especially in a massively distributed and/or system heterogeneous environment. A typical approach to resolve this issue in FL is to adopt a client selection strategy in each communication round or to use an asynchronous algorithm. When we directly adopt a random client selection strategy in Algorithm 2, the ensemble prediction yp on Dp(x) may have a high variance, which makes the algorithm unstable. Hence, we memorize the previous prediction

Towards Understanding Ensemble Distillation in Federated Learning

yp,old and update yp as

yp = (1 γt0) yp,old + γt0 yp,new

at each communication round t0 like a Robbins-Monro stochastic approximation (Robbins & Monro, 1951). We provide a general framework (Algorithm 3) that considers a client selection strategy and an asynchronous strategy in Appendix C.7. See Appendix C.7 for details.

Based on the stochastic approximation theory (Robbins & Monro, 1951; Bertsekas & Tsitsiklis, 1996), we obtain a result on the limiting regressor after inﬁnitely many iterations in Algorithm 3 as follows. Corollary 5.6. Assume λ0 = λ and KXp,Xp is invertible. We also assume that

t0=1 γt0 = ,

t0=1 γ2 t0 < .

If we sample a ﬁxed number of clients uniformly from all clients at each communication round, then the prediction yp on Dp(x) after inﬁnitely many ensemble distillation steps with the de-regularization trick in Algorithm 3 converges to SDpg almost surely where g is the limiting ensemble regressor after inﬁnitely many ensemble distillation steps with the de-regularization trick in Algorithm 2. In conclusion, the local models after conducting Algorithm 3 with t = is the same as the local models after conducting Algorithm 2 with t = .

We prove a general version of Corollary 5.6 in Appendix C.8. See Appendix C.8 for details. Corollary 5.6 implies that the local models derived from Algorithm 3 with uniform sampling is the same as those derived from Algorithm 2 as t . Therefore, the client selection strategy does not affect the limiting regressor in the case of uniform client sampling. If we sample clients non-uniformly, then the effect of local datasets to the limiting regressor can be different.

5.6. Connection to Neural Network

From Section 5.2 to 5.5, we discuss the effectiveness of ensemble distillation algorithms for kernel ridge regression models. According to the approximation scheme of neural networks as kernel machines (Jacot et al., 2018; Domingos, 2020), it seems that ensemble distillation with neural networks in regression problems can be also effective.

One remark is that we assume all clients use the same kernel k in our analysis and this assumption is violated when neural networks are used as local models. However, we can expect the algorithms that match features in neural networks have good performance according to our analysis. For instance, Fed He NN (Makhija et al., 2022) uses Hilbert-Schmidt independence criterion to match the features, which improves the

performance. Another remark is that the de-regularization trick can be omitted in the ridgeless cases. Therefore, we can see that some existing algorithms (Li & Wang, 2019; Lin et al., 2020b) are the special cases of our algorithm.

6. Experiments

In this section, we provide experimental results to validate our theoretical results. Rather than verifying the convergence rate of the expected risk, we now analyze the expected risk itself to conﬁrm the practical effectiveness of ensemble distillation algorithms.

We conduct experiments on three synthetic datasets and one real world dataset. We refer to the three synthetic datasets as Dataset 1, Dataset 2, and Dataset 3. Dataset 1 and Dataset 3 are from the existing works (Chang et al., 2017; Lin et al., 2020a). The real world dataset is a simpliﬁed regression version of the MNIST dataset from another existing work (Cui et al., 2021). We refer to this dataset as MNIST. The generating procedure for the datasets and the kernels used for KRR are described in Appendix E.1. To evaluate the performance, we compute the averaged mean squared error of the local models over the test dataset.

To deal with massively distributed environment which is mainly considered in FL, we set the number of clients (m) is relatively large compared with the local dataset size (N). In addition, we set the unlabeled public dataset size Np = (m 1)N to compare one-shot ensemble distillation and iterative ensemble distillation fairly. We conduct an additional experiment to investigate the effect of the unlabeled public dataset size. Detailed experiment setup (e.g., hyperparameter conﬁguration) is described in Appendix E.2.

6.2. Illustrative Example: Effect of De-regularization Trick

We ﬁrst validate the effectiveness of the de-regularization trick for iterative ensemble distillation. In this experiment, we use Dataset 1 and set m = 20, N = 20, λ = 0.002 and Np = 380.

As illustrated in Figure 1, the regularization effect is ampliﬁed as ensemble distillation is repeated when we do not use the de-regularization trick. In contrast, the de-regularization trick prevents the regularization effect from being ampliﬁed even for a large t.

6.3. Performance Comparison

We compare the local training, the one-shot ensemble distillation algorithm, the iterative ensemble distillation algorithm without the de-regularization, the iterative ensemble

Towards Understanding Ensemble Distillation in Federated Learning

0.0 0.2 0.4 0.6 0.8 1.0 x

target t=1 t=2 t=3 t=10 t=1000

(a) w/ de-regularization trick

0.0 0.2 0.4 0.6 0.8 1.0 x

target t=1 t=2 t=3 t=10 t=1000

(b) w/o de-regularization trick

Figure 1. KRR regressors of a client paticipating in iterative ensemble distillation after t communication rounds (a) with the deregularization trick and (b) without the de-regularization trick. The solid blue line is the target function.

Table 1. Performance of the standalone model on different datasets with sample sizes N = 10 and N = 20

DATASET N = 10 N = 20

DATASET 1 0.0329 0.0243 DATASET 2 0.0255 0.0144 DATASET 3 0.0785 0.0739 MNIST 0.9436 0.7845

distillation algorithm with the de-regularization, and the central training (which gives the KRR model trained using all local datasets). The performance of standalone KRR models with sample sizes N {10, 20} is presented in Table 1. We summarize the experimental results of the case with N = 10 and various m in Figure 2. We present additional experimental results with sample size N = 20 in Appendix E.3.

We ﬁrst observe that all ensemble distillation based FL algorithms improve the performance of local models compared with the local training. They also achieve the performance as good as the central training on Dataset 2. However, the one-shot ensemble distillation algorithm becomes worse compared with the central training in the other cases. Especially, it has poor performance in massively distributed environment with high dimensional datasets (Dataset 3 and MNIST). The iterative ensemble distillation algorithm without the de-regularization is also worse than the central training on Dataset 3. This algorithm does not achieve the same performance as the central training in some other cases as well.

On the other hand, the iterative ensemble distillation algorithm with the de-regularization is dominant compared with the one-shot ensemble distillation algorithm and the iterative ensemble distillation algorithm without the deregularization in all settings. It is also comparable with the central training in all experiments. Although our theoretical result requires more unlabeled public data than (m 1)N

Table 2. Performance of the iterative ensemble distillation algorithm on Dataset 3 with different public dataset sizes Np {50, 100, 200, 500, 1000}. We set N = 10 and m = 50.

Np 50 100 200 500 1000

0.0251 0.0198 0.0168 0.0168 0.0164

data points, we can observe that (m 1)N unlabeled public data points are enough for iterative ensemble distillation to achieve the same performance as the central training in practice. Moreover, it has not only the same convergence rate but also the same expected risk as the central training in our experiments.

We additionally compare our proposed algorithm with Fed MD (Li & Wang, 2019) with neural networks on multiple datasets. See Appendix E.3 for the results.

6.4. Effect of Public Dataset Size

We compare the performance of the iterative ensemble distillation algorithm with the de-regularization on various Np. We simulate this algorithm on Dataset 3 with N = 10 and m = 50. The result is summarized in Table 2. Basically, the performance of the algorithm is improved as Np increases. However, the improvement slows down when Np is sufﬁciently large. In other words, more public data points may not improve the performance if there are already sufﬁcient public data points.

We provide additional experimental results including the comparison between with a client selection strategy and without a client selection strategy in Appendix E.3.

7. Conclusions

In this work, we provide a KRR based theoretical framework to verify the effectiveness of KD, one-shot ensemble distillation, and iterative ensemble distillation under some

Towards Understanding Ensemble Distillation in Federated Learning

10 20 30 40 50 100 The number of clients

Averaged MSE

central training one-shot ED IED w/o deregularization IED w/ deregularization

(a) Dataset 1 with N = 10

10 20 30 40 50 100 The number of clients

Averaged MSE

central training one-shot ED IED w/o deregularization IED w/ deregularization

(b) Dataset 2 with N = 10

10 20 30 40 50 100 The number of clients

Averaged MSE

central training one-shot ED IED w/o deregularization IED w/ deregularization

(c) Dataset 3 with N = 10

10 20 30 40 50 The number of clients

Averaged MSE

central training one-shot ED IED w/o deregularization IED w/ deregularization

(d) MNIST with N = 10

Figure 2. Comparison between the performance of the one-shot ensemble distillation algorithm (one-shot ED), the iterative ensemble distillation algorithm without the de-regularization (IED w/o deregularization), the iterative ensemble distillation algorithm with the de-regularization (IED w/ deregularization), and the central training. We set N = 10 and conduct the experiments with various m.

regularity conditions. We also analyze the effects of a client selection strategy in our setting. We simulate ensemble distillation to validate our theoretical results.

Acknowledgements

This work was supported in part by the National Research Foundation of Korea(NRF) grant funded by the Korea government(MSIT) (Grant No. NRF-2019R1A5A1028324) and in part by internal fund/grant of Electronics and Telecommunications Research Institute(ETRI). [22RB1100, Exploratory and Strategic Research of ETRI-KAIST ICT Future Technology]

Afonin, A. and Karimireddy, S. P. Towards model agnostic federated learning using knowledge distillation. In The Tenth International Conference on Learning Representations, 2022.

Allen-Zhu, Z. and Li, Y. Towards understanding ensemble, knowledge distillation and self-distillation in deep learning. ar Xiv preprint ar Xiv:2012.09816, 2020.

Anil, R., Pereyra, G., Passos, A., Orm andi, R., Dahl, G. E., and Hinton, G. E. Large scale distributed neural network

training through online distillation. In 6th International Conference on Learning Representations, 2018.

Arivazhagan, M. G., Aggarwal, V., Singh, A. K., and Choudhary, S. Federated learning with personalization layers. ar Xiv preprint ar Xiv:1912.00818, 2019.

Aydın, A. D. and Gheondea, A. Probability error bounds for approximation of functions in reproducing kernel hilbert spaces. Journal of Function Spaces, 2021, 2021.

Bagdasaryan, E., Veit, A., Hua, Y., Estrin, D., and Shmatikov, V. How to backdoor federated learning. In Proceedings of the Twenty Third International Conference on Artiﬁcial Intelligence and Statistics, volume 108 of Proceedings of Machine Learning Research, pp. 2938 2948. PMLR, 2020.

Banach, S. Sur les op erations dans les ensembles abstraits et leur application aux equations int egrales. Fundamenta Mathematicae, 3(1):133 181, 1922.

Berlinet, A. and Thomas-Agnan, C. Reproducing kernel Hilbert spaces in probability and statistics. Springer Science & Business Media, 2011.

Bertsekas, D. and Tsitsiklis, J. N. Neuro-dynamic programming. Athena Scientiﬁc, 1996.

Towards Understanding Ensemble Distillation in Federated Learning

Bhatia, R. Matrix analysis, volume 169. Springer Science & Business Media, 2013.

Bochkovskiy, A., Wang, C.-Y., and Liao, H.-Y. M. Yolov4: Optimal speed and accuracy of object detection. ar Xiv preprint ar Xiv:2004.10934, 2020.

Borup, K. and Andersen, L. N. Even your teacher needs guidance: Ground-truth targets dampen regularization imposed by self-distillation. Advances in Neural Information Processing Systems, 34:5316 5327, 2021.

Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., Mc Candlish, S., Radford, A., Sutskever, I., and Amodei, D. Language models are few-shot learners. In Advances in Neural Information Processing Systems, volume 33, pp. 1877 1901, 2020.

Buciluˇa, C., Caruana, R., and Niculescu-Mizil, A. Model compression. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 535 541, 2006.

Caldas, S., Koneˇcny, J., Mc Mahan, H. B., and Talwalkar, A. Expanding the reach of federated learning by reducing client resource requirements. ar Xiv preprint ar Xiv:1812.07210, 2018.

Caponnetto, A. and De Vito, E. Optimal rates for the regularized least-squares algorithm. Foundations of Computational Mathematics, 7(3):331 368, 2007.

Chang, X., Lin, S.-B., and Zhou, D.-X. Distributed semisupervised learning with kernel ridge regression. Journal of Machine Learning Research, 18(46):1 22, 2017.

Chatalic, A., Schreuder, N., Rosasco, L., and Rudi, A. Nystr om kernel mean embeddings. In International Conference on Machine Learning, pp. 3006 3024. PMLR, 2022.

Cho, Y. J., Wang, J., Chiruvolu, T., and Joshi, G. Personalized federated learning for heterogeneous clients with clustered knowledge transfer. ar Xiv preprint ar Xiv:2109.08119, 2021.

Conway, J. B. A course in functional analysis, volume 96. Springer, 2019.

Cui, H., Loureiro, B., Krzakala, F., and Zdeborov a, L. Generalization error rates in kernel regression: The crossover from the noiseless to noisy regime. Advances in Neural Information Processing Systems, 34:10131 10143, 2021.

Diao, E., Ding, J., and Tarokh, V. Heteroﬂ: Computation and communication efﬁcient federated learning for heterogeneous clients. In 9th International Conference on Learning Representations, 2021.

Domingos, P. Every model learned by gradient descent is approximately a kernel machine. ar Xiv preprint ar Xiv:2012.00152, 2020.

Fallah, A., Mokhtari, A., and Ozdaglar, A. Personalized federated learning with theoretical guarantees: A modelagnostic meta-learning approach. Advances in Neural Information Processing Systems, 33:3557 3568, 2020.

Ferreira, J. and Menegatto, V. A. Positive deﬁniteness, reproducing kernel hilbert spaces and beyond. Annals of Functional Analysis, 4(1), 2013.

Fischer, S. and Steinwart, I. Sobolev norm learning rates for regularized least-squares algorithms. The Journal of Machine Learning Research, 21(1):8464 8501, 2020.

Fujii, J., Fujii, M., Furuta, T., and Nakamoto, R. Norm inequalities equivalent to heinz inequality. Proceedings of the American Mathematical Society, 118(3):827 830, 1993.

Guo, Z.-C., Lin, S.-B., and Zhou, D.-X. Learning theory of distributed spectral algorithms. Inverse Problems, 33(7): 074009, 2017.

He, C., Annavaram, M., and Avestimehr, S. Group knowledge transfer: Federated learning of large cnns at the edge. Advances in Neural Information Processing Systems, 33: 14068 14080, 2020.

Hinton, G., Vinyals, O., Dean, J., et al. Distilling the knowledge in a neural network. ar Xiv preprint ar Xiv:1503.02531, 2015.

Jacot, A., Gabriel, F., and Hongler, C. Neural tangent kernel: Convergence and generalization in neural networks. Advances in neural information processing systems, 31, 2018.

Jiang, Y., Wang, S., Valls, V., Ko, B. J., Lee, W.-H., Leung, K. K., and Tassiulas, L. Model pruning enables efﬁcient federated learning on edge devices. IEEE Transactions on Neural Networks and Learning Systems, 2022.

Karimireddy, S. P., Kale, S., Mohri, M., Reddi, S., Stich, S., and Suresh, A. T. Scaffold: Stochastic controlled averaging for federated learning. In International Conference on Machine Learning, pp. 5132 5143. PMLR, 2020.

Koneˇcn y, J., Mc Mahan, H. B., Yu, F. X., Richt arik, P., Suresh, A. T., and Bacon, D. Federated learning: Strategies for improving communication efﬁciency. ar Xiv preprint ar Xiv:1610.05492, 2016.

Towards Understanding Ensemble Distillation in Federated Learning

Le Cun, Y., Bottou, L., Bengio, Y., and Haffner, P. Gradientbased learning applied to document recognition. Proceedings of the IEEE, 86(11):2278 2324, 1998.

Li, D. and Wang, J. Fedmd: Heterogenous federated learning via model distillation. ar Xiv preprint ar Xiv:1910.03581, 2019.

Li, T., Sahu, A. K., Zaheer, M., Sanjabi, M., Talwalkar, A., and Smith, V. Federated optimization in heterogeneous networks. In Proceedings of Machine Learning and Systems, volume 2, pp. 429 450, 2020a.

Li, T., Hu, S., Beirami, A., and Smith, V. Ditto: Fair and robust federated learning through personalization. In International Conference on Machine Learning, pp. 6357 6368. PMLR, 2021.

Li, X., Huang, K., Yang, W., Wang, S., and Zhang, Z. On the convergence of fedavg on non-iid data. In 8th International Conference on Learning Representations, 2020b.

Li, Y., Zhang, H., and Lin, Q. On the saturation effect of kernel ridge regression. In The Eleventh International Conference on Learning Representations, 2023.

Lin, S.-B., Guo, X., and Zhou, D.-X. Distributed learning with regularized least squares. Journal of Machine Learning Research, 18(92):1 31, 2017.

Lin, S.-B., Wang, D., and Zhou, D.-X. Distributed kernel ridge regression with communications. Journal of Machine Learning Research, 21(93):1 38, 2020a.

Lin, T., Kong, L., Stich, S. U., and Jaggi, M. Ensemble distillation for robust model fusion in federated learning. Advances in Neural Information Processing Systems, 33: 2351 2363, 2020b.

Makhija, D., Han, X., Ho, N., and Ghosh, J. Architecture agnostic federated learning for neural networks. In Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pp. 14860 14870. PMLR, 2022.

Mc Mahan, B., Moore, E., Ramage, D., Hampson, S., and Arcas, B. A. y. Communication-efﬁcient learning of deep networks from decentralized data. In Proceedings of the 20th International Conference on Artiﬁcial Intelligence and Statistics, pp. 1273 1282. PMLR, 2017.

Mobahi, H., Farajtabar, M., and Bartlett, P. Self-distillation ampliﬁes regularization in hilbert space. Advances in Neural Information Processing Systems, 33:3351 3361, 2020.

Mora, A., Tenison, I., Bellavista, P., and Rish, I. Knowledge distillation for federated learning: a practical guide. ar Xiv preprint ar Xiv:2211.04742, 2022.

Nguyen, T., Chen, Z., and Lee, J. Dataset meta-learning from kernel ridge-regression. In 9th International Conference on Learning Representations, 2021.

Nichol, A. Q., Dhariwal, P., Ramesh, A., Shyam, P., Mishkin, P., Mcgrew, B., Sutskever, I., and Chen, M. GLIDE: Towards photorealistic image generation and editing with text-guided diffusion models. In Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pp. 16784 16804. PMLR, 2022.

Nishio, T. and Yonetani, R. Client selection for federated learning with heterogeneous resources in mobile edge. In ICC 2019-2019 IEEE international conference on communications, pp. 1 7. IEEE, 2019.

Rasmussen, C. E. and Williams, C. K. I. Gaussian processes for machine learning. Adaptive computation and machine learning. MIT Press, 2006.

Robbins, H. and Monro, S. A stochastic approximation method. The annals of mathematical statistics, pp. 400 407, 1951.

Rudi, A., Camoriano, R., and Rosasco, L. Less is more: Nystr om computational regularization. Advances in Neural Information Processing Systems, 28, 2015.

Shamsian, A., Navon, A., Fetaya, E., and Chechik, G. Personalized federated learning using hypernetworks. In International Conference on Machine Learning, pp. 9489 9502. PMLR, 2021.

Steinwart, I. and Christmann, A. Support vector machines. Springer Science & Business Media, 2008.

Wang, H., Yurochkin, M., Sun, Y., Papailiopoulos, D. S., and Khazaeni, Y. Federated learning with matched averaging. In 8th International Conference on Learning Representations, 2020.

Yao, D., Pan, W., Wan, Y., Jin, H., and Sun, L. Fedhm: Efﬁcient federated learning for heterogeneous models via low-rank factorization. ar Xiv preprint ar Xiv:2111.14655, 2021.

Yao, Y., Rosasco, L., and Caponnetto, A. On early stopping in gradient descent learning. Constructive Approximation, 26(2):289 315, 2007.

Yin, R., Wang, W., and Meng, D. Distributed nystr om kernel learning with communications. In International Conference on Machine Learning, pp. 12019 12028. PMLR, 2021.

Yuan, X.-T. and Li, P. On convergence of fedprox: Local dissimilarity invariant bounds, non-smoothness and beyond. ar Xiv preprint ar Xiv:2206.05187, 2022.

Towards Understanding Ensemble Distillation in Federated Learning

Yurochkin, M., Agarwal, M., Ghosh, S., Greenewald, K., Hoang, N., and Khazaeni, Y. Bayesian nonparametric federated learning of neural networks. In International Conference on Machine Learning, pp. 7252 7261. PMLR, 2019.

Zhang, J., Guo, S., Ma, X., Wang, H., Xu, W., and Wu, F. Parameterized knowledge transfer for personalized federated learning. Advances in Neural Information Processing Systems, 34:10092 10104, 2021.

Zhang, J., Chen, C., Li, B., Lyu, L., Wu, S., Ding, S., Shen, C., and Wu, C. Dense: Data-free one-shot federated learning. Advances in Neural Information Processing Systems, 35:21414 21428, 2022a.

Zhang, L., Song, J., Gao, A., Chen, J., Bao, C., and Ma, K. Be your own teacher: Improve the performance of convolutional neural networks via self distillation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3713 3722, 2019.

Zhang, L., Shen, L., Ding, L., Tao, D., and Duan, L.-Y. Finetuning global model via data-free knowledge distillation for non-iid federated learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10174 10183, 2022b.

Zhang, Y., Duchi, J., and Wainwright, M. Divide and conquer kernel ridge regression. In Proceedings of the 26th Annual Conference on Learning Theory, volume 30 of Proceedings of Machine Learning Research, pp. 592 617. PMLR, 2013.

Zhu, Z., Hong, J., and Zhou, J. Data-free knowledge distillation for heterogeneous federated learning. In International Conference on Machine Learning, pp. 12878 12889. PMLR, 2021.

Towards Understanding Ensemble Distillation in Federated Learning

A. Comments and Details on Section 3

A.1. Comments on Section 3.1

For simplicity, we write Ah = A(h) for any h H1 and an operator A : H1 H2 where H1 and H2 are Hilbert spaces. Deﬁne A1h = Ah and Ath = AAt 1h for any integer t 2. For linear operators A : H1 H2 and B : H2 H3, we write the composition B A as BA where H1, H2 and H3 are Hilbert spaces.

We use a scaled L2 norm

a2 1 + + a2n

as a norm of Euclidean space Rn.

Note that the kernel k is bounded since it is continuous on a compact domain X X. Thus, κ < . From the positive deﬁniteness of k, k(x1, x2) κ2 holds for any x1, x2 X.

From the boundedness of k, the continuity of k, and the separability of X, we can conclude Hk is separable by Corollary 4 of Section 1.5 in Berlinet & Thomas-Agnan (2011).

By Mercer s theorem (Rasmussen & Williams, 2006) and Proposition 4.2 in Ferreira & Menegatto (2013), k(x, x ) = P i=1 µiφi(x)φi(x ) where {µi} i=1 are eigenvalues of a linear operator Lk and each φi Hk is an eigenfunction of Lk corresponding to µi such that µi 0, P i=1 µi < , and {φi} i=1 is an orthonormal basis of L2 ρx.2 Moreover, we know that {φi} i=1 are continuous and

with the inner product

where f = P i=1 fiφi and g = P i=1 giφi. We can easily see that { µiφi} i=1 is an orthonormal basis of Hk.

For f L2 ρx such that f = P i=1 fiφi ρx-almost everywhere where

f 2 i µi < ,

we can think f as an element P i=1 fiφi in Hk.

In the deﬁnition of ιρx : Hk L2 ρx, [h] ρx means the equivalence class of all measureable functions that is equal to h Hk ρx-almost everywhere.

The adjoint operator ι ρx : L2 ρx Hk of ιρx satisﬁes

ι ρx([h] ρx)(x) = ι ρx([h] ρx), kx Hk = h, ιρx(kx) L2ρx = Z k(x, t)h(t) dρx(t).

Thus, Lk = ι ρxιρx. From this fact and the compactness of ι ρx (Steinwart & Christmann, 2008), we can see that Lk is compact, self-adjoint, and positive. Also,

ιρxf 2 L2 ρx = ιρxf, ιρxf L2ρx = ι ρxιρxf, f Hk = Lkf, f Hk = L1/2 k f, L1/2 k f Hk = L1/2 k f 2 Hk. (9)

2In L2 ρx, they are considered as equivalence classes containing continuous functions φi.

Towards Understanding Ensemble Distillation in Federated Learning

Let D = {(x1, y1), , (xn, yn)} be a labeled dataset. The adjoint S D : Rn Hk of SD maps c = [c1, , cn] Rn to

Note that Lk,X = S DSD. Obviously SD is compact since it has ﬁnite rank. Therefore Lk,X is also compact, self-adjoint, and positive. Similarly as ιρx,

SDf 2 2 = SDf, SDf 2 = S DSDf, f Hk = Lk,D(x)f, f Hk = L1/2 k,D(x)f, L1/2 k,D(x)f Hk = L1/2 k,D(x)f 2 Hk. (10)

Since SD only depends on D(x), we naturally deﬁne SD as SDf = [f(x1), , f(xn)] for D(x) = {x1, , xn}.

We write k X( ) = [kx1( ), , kxn( )] where X = {x1, , xn} X.

Assumption 3.1 is related to the condition of the noise term and Assumption 3.2 is regarding the regularity of the target function f0.

If the noise is uniformly bounded, Gaussian, or sub-Gaussian, then (2) is satisﬁed (Caponnetto & De Vito, 2007; Lin et al., 2020a).

(3) with s = 1 is always satisﬁed since

which follows from

κ2 Z k(x, x) dρx = Z X

i=1 µiφi(x)2 dρx =

Since N(λ) = P i=1 µi µi+λ, it is monotonically decreasing with respect to λ and N(λ) as λ 0. Thus, there exists λ1 > 0 (e.g., λ1 = µ2) such that N(λ) 1 for 0 < λ < λ1.

A.2. Details on Section 3.2

First, consider a noiseless data-free version :

argmin h Hk J[h] = ιρx(h f0) 2 L2ρx + λ h 2 Hk.

From (9), we obtain J[h] = h f0, Lk(h f0) Hk + λ h 2 Hk.

Observe that

J[h + ϵu] J[h] = ϵ 2Lk(h f0) + 2λh, u Hk + o(ϵ)

since Lk is self-adjoint. By the deﬁnition of functional derivatives, we have

J[h] = 2Lk(h f0) + 2λh.

Note that J is strongly convex since 2J[h] = 2(Lk + λI) 2λI holds. From the ﬁrst order optimality condition, the minimizer of the optimization problem is

fλ = (Lk + λI) 1Lkf0.

Towards Understanding Ensemble Distillation in Federated Learning

However, most of regression problems have limited datasets with noisy labels. Let D = {(x1, y1), , (x N, y N)} be a labeled dataset whose data points are independently drawn from ρx,y. Then the optimization problem is given by

argmin h Hk J[h] = 1

i=1 (h(xi) yi)2 + λ h 2 Hk = SDh y 2 2 + λ h 2 Hk

where y = [y1, , y N] . Then,

J[h + ϵu] J[h] = ϵ 2 SDu, SDh y 2 + ϵ 2λh, u Hk + o(ϵ)

= ϵ 2S DSDh 2S Dy + 2λh, u Hk + o(ϵ).

In other words, J[h] = 2S DSDh 2S Dy + 2λh = 2(Lk,D(x)h S Dy + λh).

Note that J is strongly convex since 2J[h] = 2(Lk,D(x)+λI) 2λI holds. From the ﬁrst order optimality condition, we can see that the minimizer f D,λ is given by

f D,λ = (Lk,D(x) + λI) 1S Dy.

In the matrix form we have f D,λ = k D(x)(NλI + KD(x),D(x)) 1y.

B. Details on Section 4

We now derive (4) and (5) as in Section A.2. Let g Hk be a teacher model which approximates f0.

Consider a noiseless data-free version of kernel ridge regression problem with knowledge distillation :

argmin h Hk J[h] = α ιρx(h f0) 2 L2ρx + (1 α) ιρx(h g) 2 L2ρx + λ h 2 Hk

where α (0, 1) is a distillation hyperparameter and λ > 0 is a regularization hyperparameter. Then

J[h + ϵu] J[h] = ϵ 2α Lk(h f0) + 2(1 α) Lk(h g) + 2λh, u Hk + o(ϵ)

and so J[h] = 2α Lk(h f0) + 2(1 α) Lk(h g) + 2λh.

Again, 2J[h] = 2(Lk + λI) 2λI holds. Also, the minimizer fλ of the optmization problem is given by

fλ = (Lk + λI) 1(αLkf0 + (1 α)Lkg)

using the ﬁrst order optimality condition.

Let D1 = {(x1 1, y1 1), , (x N1 1 , y N1 1 )} be a dataset to train a student model whose data points are independently generated from ρx,y. To distill knowledge of the teacher model g, we assume an unlabeled dataset D2(x) = {x1 2, , x N2 2 } whose data points are independently generated from ρx, is given. Then the optimization problem is

argmin h Hk J[h] = α 1

i=1 (h(xi 1) yi 1)2 + (1 α) 1

i=1 (h(xi 2) g(xi 2))2 + λ h 2 Hk

= α SD1h y1 2 2 + (1 α) SD2(h g) 2 2 + λ h 2 Hk

where α (0, 1) and λ > 0 are hyperparameters and y1 = [y1 1, , y N 1 ] . From

J[h + ϵu] J[h] = αϵ 2 SD1u, SD1h y1 2 + (1 α)ϵ 2 SD2u, SD2h SD2g 2 + ϵ 2λh, u Hk + o(ϵ)

= ϵ 2αS D1SD1h + 2(1 α)S D2SD2h 2αS D1y1 2(1 α)S D2SD2g + 2λh, u Hk + o(ϵ),

Towards Understanding Ensemble Distillation in Federated Learning

we have J[h] = 2(αLk,D1(x) + (1 α)Lk,D2(x) + λI)h 2(αS D1y1 + (1 α)Lk,D2(x)g).

Observe that 2J[h] = 2(αLk,D1(x) + (1 α)Lk,D2(x) + λI) 2λI holds. Applying the ﬁrst order optimality condition, the minimizer f D,λ of the optimization problem is

f D,λ = (αLk,D1(x) + (1 α)Lk,D2(x) + λI) 1(αS D1y1 + (1 α)Lk,D2(x)g).

The solution is written in the matrix form as

f D,λ = h k D1(x) k D2(x) i (λI + DKD1(x) D2(x),D1(x) D2(x)) 1D y1 g(D2(x))

where D = diag(α/N1, , α/N1 | {z } N1

, (1 α)/N2, , (1 α)/N2 | {z } N2

) and g(D2(x)) = [g(x1 2), , g(x N2 2 )] .

B.1. Proof of Theorem 4.1

We ﬁrst prove the following theorem.

Theorem B.1. With the same notation as in Section 4, under Assumption 3.1 and 3.2,

E ιρx( f D,λ f0) L2ρx 9

α N1 + 1 α N2

! 2ακ(M + γ) λN1 + λ 1 2 +r g0 Hk

+ (1 α)1/2E h (Lk + λI)1/2(αLk,D1(x) + (1 α)Lk,D2(x) + λI) 1/2 L1/2 k,D2(x)(g f0) Hk i

Proof of Theorem B.1. Using (9) and Lemma D.3, we have

ιρx( f D,λ f0) L2ρx = L1/2 k ( f D,λ f0) Hk (Lk + λI)1/2( f D,λ f0) Hk.

By the triangle inequality and the submultiplicativity of the operator norm,

(Lk + λI)1/2( f D,λ f0) Hk (Lk + λI)1/2((αLk,D1(x) + (1 α)Lk,D2(x) + λI) 1(αS D1y1 + (1 α)Lk,D2(x)f0) f0) Hk + (1 α) (Lk + λI)1/2(αLk,D1(x) + (1 α)Lk,D2(x) + λI) 1Lk,D2(x)(g f0) Hk. (13)

First, we bound the ﬁrst term in (13). Note that

(Lk + λI)1/2((αLk,D1(x) + (1 α)Lk,D2(x) + λI) 1(αS D1y1 + (1 α)Lk,D2(x)f0) f0)

= (Lk + λI)1/2(αLk,D1(x) + (1 α)Lk,D2(x) + λI) 1(Lk + λI)1/2(Lk + λI) 1/2(α(S D1y1 Lk,D1(x)f0) λf0).

Let Qd = (Lk + λI)1/2(αLk,D1(x) + (1 α)Lk,D2(x) + λI) 1(Lk + λI)1/2 .

By Lemma D.1, Lemma D.5, the triangle inequality, and the submultiplicativity of the operator norm, we have

Qd (αLk,D1(x) + (1 α)Lk,D2(x) + λI) 1(Lk + λI)

= I + α(αLk,D1(x) + (1 α)Lk,D2(x) + λI) 1(Lk Lk,D1(x))

+ (1 α)(αLk,D1(x) + (1 α)Lk,D2(x) + λI) 1(Lk Lk,D2(x))

λ(α Lk,D1(x) Lk + (1 α) Lk,D2(x) Lk ). (14)

Towards Understanding Ensemble Distillation in Federated Learning

On the other hand, by the submultiplicativity of the operator norm and Lemma D.1,

(Lk + λI) 1/2α(S D1y1 Lk,D1(x)f0) Hk α (Lk + λI) 1/2 S D1y1 Lk,D1(x)f0

λ S D1y1 Lk,D1(x)f0 .

λ(Lk + λI) 1/2f0 Hk = λ(Lk + λI) 1/2Lr kg0 Hk λ 1 2 +r g0 Hk (15)

by Lemma D.1. Therefore, by Lemma D.7 and Lemma D.8, we have

(Lk + λI)1/2((αLk,D1(x) + (1 α)Lk,D2(x) + λI) 1(αS D1y1 + (1 α)Lk,D2(x)f0) f0) Hk

α N1 + 1 α N2

! 2ακ(M + γ) λN1 + λ 1 2 +r g0 Hk

(log(6/δ))3/2

with conﬁdence at least 1 δ. By Lemma D.9,

E (Lk + λI)1/2((αLk,D1(x) + (1 α)Lk,D2(x) + λI) 1(αS D1y1 + (1 α)Lk,D2(x)f0) f0) Hk

α N1 + 1 α N2

! 2ακ(M + γ) λN1 + λ 1 2 +r g0 Hk

since Γ(5/2) = 3 π/4 < 3/2. The second term in (13) satisﬁes

(1 α) (Lk + λI)1/2(αLk,D1(x) + (1 α)Lk,D2(x) + λI) 1Lk,D2(x)(g f0) Hk

(1 α) (Lk + λI)1/2(αLk,D1(x) + (1 α)Lk,D2(x) + λI) 1L1/2 k,D2(x)L1/2 k,D2(x)(g f0) Hk

(1 α) (Lk + λI)1/2(αLk,D1(x) + (1 α)Lk,D2(x) + λI) 1/2

(αLk,D1(x) + (1 α)Lk,D2(x) + λI) 1/2(Lk,D2(x) + λI)1/2

(Lk,D2(x) + λI) 1/2L1/2 k,D2(x) L1/2 k,D2(x)(g f0) Hk

by the submultiplicativity of the operator norm. Applying Lemma D.1 and

(1 α)1/2 (αLk,D1(x) + (1 α)Lk,D2(x) + λI) 1/2(Lk,D2(x) + λI)1/2 1

which follows from Lemma D.2, we obtain that the second term in (13) is bounded by

(1 α)1/2 (Lk + λI)1/2(αLk,D1(x) + (1 α)Lk,D2(x) + λI) 1/2 L1/2 k,D2(x)(g f0) Hk

by the submultiplicativity of the operator norm. Taking the expectation completes the proof of Theorem B.1.

The following corollary implies Theorem 4.1.

Corollary B.2. Suppose Assumption 3.1 and 3.2 hold and g is independent of D2(x). With the same notation as in Section 4, if we set α = N1/(N1 + N2) and λ = (N1 + N2) 1 2r+2 , then

E ιρx( f D,λ f0) L2ρx O (N1 + N2) 2r+1

4r+4 + E ιρx(g f0) 2 L2ρx

Proof of Corollary B.2. Set α = N1/(N1 + N2) and λ = (N1 + N2) 1 2r+2 . We will bound the given upper bound of E ιρx( f D,λ f0) L2ρx in Theorem B.1. We ﬁrst observe that

α N1 + 1 α N2

2κ2(N1 + N2) 1 2r+2 N1 + N2

1 + 4κ2 (16)

Towards Understanding Ensemble Distillation in Federated Learning

2ακ(M + γ) λN1 + λ 1 2 +r g0 Hk = 2κ(M + γ)(N1 + N2) 1 4r+4 N1 N1 + N2 1 N1 + (N1 + N2) 2r+1

(2κM + 2κγ + g0 Hk) (N1 + N2) 2r+1

since N1 + N2 p

2(N1 + N2) and N1 N1 + N2. Thus, we know that

α N1 + 1 α N2

! 2ακ(M + γ) λN1 + λ 1 2 +r g0 Hk

= O (N1 + N2) 2r+1

Second, by the Cauchy-Schwarz inequality (Conway, 2019),

E h (1 α)1/2 (Lk + λI)1/2(αLk,D1(x) + (1 α)Lk,D2(x) + λI) 1/2 L1/2 k,D2(x)(g f0) Hk i

E (Lk + λI)1/2(αLk,D1(x) + (1 α)Lk,D2(x) + λI) 1/2 2 1/2 E L1/2 k,D2(x)(g f0) 2 Hk 1/2 .

Since (Lk + λI)1/2(αLk,D1(x) + (1 α)Lk,D2(x) + λI) 1/2 2 = Qd

where Qd is deﬁned in the proof of Theorem B.1, by (14) and Lemma D.7, we have

α N1 + 1 α N2

(log(4/δ))1/2

α N1 + 1 α N2

(log(4/δ)) (17)

with conﬁdence at least 1 δ where δ (0, 1) and so

E (Lk +λI)1/2(αLk,D1(x) +(1 α)Lk,D2(x) +λI) 1/2 2 4Γ 3

α N1 + 1 α N2

by Lemma D.9 and (16) since Γ(3/2) = π/2 < 1. On the other hand, by (10) and the independence of g and D2(x), we ﬁnd that

E L1/2 k,D2(x)(g f0) 2 Hk = E SD2(g f0) 2 2 = E

i=1 (g(xi 2) f0(xi 2))2 #

i=1 (g(xi 2) f0(xi 2))2 | g

= E(g(x1 2) f0(x1 2))2 = E ιρx(g f0) 2 L2ρx.

E ιρx( f D,λ f0) L2ρx O (N1 + N2) 2r+1

4r+4 + E ιρx(g f0) 2 L2ρx

Remark B.3. Since g is ﬁxed in Section 4, we assume g is independent of D2(x) in Corollary B.2. In this case, when g is a KRR model trained using a dataset D whose size is N such that D is independent of D2(x), we know that

E ιρx(g f0) 2 L2ρx

1/2 = O N 2r+1

which can be proved by a similar argument as in the proof of Theorem 3.3 provided in Caponnetto & De Vito (2007). Thus, we can conclude that the generalization error of f D,λ has the convergence rate O min(N1 + N2, N) 2r+1

4r+4 . If D is not independent of D2(x), then it could be more complicated. However, we can achieve the same result using Lemma D.13 under some assumptions. We deal with this case in Section 5.4.

Towards Understanding Ensemble Distillation in Federated Learning

C. Proofs and Comments on Section 5

C.1. Proof of Theorem 5.1

We ﬁrst show the following theorem that is similar to Theorem 13 in Lin et al. (2020a).

Theorem C.1. Let

i=1 (Lk,Xi + λI) 1S Diyi.

Under Assumption 3.1 and 3.2,

E ιρx( f D,λ f0) 2 L2ρx = O λ2r+1 + 1 λm N + λ2r+1B2

+ O(1) λ2r+1B2 + 1 λm N

2 exp 1 4(κ2 + 1)B

1 + 1 4(κ2 + 1)B

holds if 0 < λ 1 and N(λ) 1 where

B = 1 + log N(λ)

1 + log N(λ)

Proof of Theorem C.1. We use a similar argument as in Lin et al. (2020a). Recall Proposition 9 in Lin et al. (2020a):

E ιρx( f D,λ f0) 2 L2ρx 2 ιρx(fλ f0) 2 L2ρx + 1

m E Q4(P + S fλ Hk)2 + E h Q4R2 (Lk + λI)1/2(fλ f0) 2 Hk i

P = (Lk + λI) 1/2(Lkf0 S D1y D1) Hk,

Q = (Lk + λI)1/2(Lk,X1 + λI) 1/2 ,

R = (Lk + λI) 1/2(Lk Lk,X1)(Lk + λI) 1/2 ,

and S = (Lk + λI) 1/2(Lk Lk,X1) .

By (9), Lemma D.1, Lemma D.3, and the submultiplicativity of the operator norm, we have

ιρx(fλ f0) 2 L2ρx = L1/2 k (fλ f0) 2 Hk (Lk + λI)1/2(fλ f0) 2 Hk

= λ(Lk + λI) 1/2Lr kg0 2 Hk λ2r+1 g0 2 Hk. (18)

fλ Hk = (Lk + λI) 1Lkf0 Hk f0 Hk (19)

by the submultiplicativity of the operator norm and Lemma D.1. Therefore, applying Lemma D.7 and Lemma D.8 leads to

Q4(P + S fλ Hk)2 16κ2(2

2κ f0 Hk + M + γ)2 1

λN (log(6/δ))2

with conﬁdence at least 1 δ for 12 exp( 1/4(κ2 + 1)B) δ < 1. On the other hand, using the trivial bound

λ (Lk + λI)1/2 κ2 + λ

which follows from the submultiplicativity of the operator norm and (Lk + λI)1/2 = (µ1 + λ)1/2 (κ2 + λ)1/2, we get

Q4(P + S fλ Hk)2 4κ2(2

2κ f0 Hk + M + γ)2 1 + κ2

2 1 λN (log(6/δ))2

Towards Understanding Ensemble Distillation in Federated Learning

with conﬁdence at least 1 δ for δ (0, 1). By Lemma D.10 and Remark D.11,

E Q4(P + S fλ Hk)2 = O 1

+ O(1) 1 λN 1 + κ2

2 exp 1 4(κ2 + 1)B

1 + 1 4(κ2 + 1)B

Next we turn to bound E Q4R2 (Lk + λI)1/2(fλ f0) 2 Hk . By Lemma D.7 and (20),

Q4R2 16(κ2 + 1)2B2(log(8/δ))2

with conﬁdence at least 1 δ for 8 exp( 1/4(κ2 + 1)B) δ < 1 and

Q4R2 4 1 + κ2

2 (κ2 + 1)2B2(log(8/δ))2

with conﬁdence at least 1 δ for δ (0, 1). Again, using (18), Lemma D.10, and Remark D.11, we obtain

E h Q4R2 (Lk + λI)1/2(fλ f0) 2 Hk i = O(λ2r+1B2)

+ O(1) λ2r+1B2 1 + κ2

2 exp 1 4(κ2 + 1)B

1 + 1 4(κ2 + 1)B

Combining the bounds, we are done.

Now, we derive the performance of one-shot ensemble distillation. The following corollary implies Theorem 5.1.

Corollary C.2. Assume Assumption 3.1 and 3.2 hold. Also, assume m N 2r+1 ϵ for any ﬁxed ϵ (0, 1) and Np (m 1)N. Let f i D,λ = (αLk,Xi + (1 α)Lk,Xp + λI) 1(αS Diyi + (1 α)Lk,Xp f D,λ)

which is the local model of client i after one-shot ensemble distillation (i = 1, , m) where

i=1 (Lk,Xi + λI) 1S Diyi.

Then, with α = 1/m and λ = (m N) 1 2r+2 ,

E ιρx( f i D,λ f0) L2ρx = O (m N) 2r+1

for any i = 1, , m.

Proof of Corollary C.2. When m or N increases, Np also increases. Thus, it is enough to show the statement under the assumption

N N 1 2r+2 p

Although we cannot directly apply Corollary B.2, we can show a similar statement using the same way as in the proof of Corollary B.2. Since

α N1 + 1 α N2

! 2ακ(M + γ) λN1 + λ 1 2 +r g0 Hk

= O (m N) 2r+1

E (Lk + λI)1/2(αLk,Xi + (1 α)Lk,Xp + λI) 1/2 2 4

4(1 + 4κ2),

Towards Understanding Ensemble Distillation in Federated Learning

E ιρx( f i D,λ f0) L2ρx O (m N) 2r+1

4r+4 + E ιρx( f D,λ f0) 2 L2ρx

Therefore, it sufﬁces to show that

E ιρx( f D,λ f0) 2 L2ρx = O (m N) 2r+1

From m N 2r+1 ϵ,

B log(eκ2N)

1 + log(eκ2N)

1 + log(eκ2N)

f(x) = log(eκ2x)

1 + log(eκ2x)

is continuous on [1, ) and vanishes at , f(x) C(κ, r, ϵ) holds for some C(κ, r, ϵ) which only depends on κ, r and ϵ. Then B C(κ, r, ϵ) 1

ϵ 4(2r+2) .

Also, since 0 < B C(κ, r, ϵ) and

xβ exp 1 4(κ2 + 1)x

is continuous on (0, C(κ, r, ϵ)] and vanishes at 0+,

1 Bβ exp 1 4(κ2 + 1)B

C (κ, r, ϵ, β)

for some C (κ, r, ϵ, β) which only depends on κ, r, ϵ and β when β > 0 is a ﬁxed constant. Taking β = 1 + 8

ϵ (2r + 2), we ﬁnd that

exp 1 4(κ2 + 1)B

B1+ 8(2r+2)

ϵ C κ, r, ϵ, 1 + 8(2r + 2)

ϵ C κ, r, ϵ, 1 + 8(2r + 2)

Hence, 1 + κ2

2 exp 1 4(κ2 + 1)B

1 + 1 4(κ2 + 1)B

(1 + κ2)2N 2C(κ, r, ϵ)

ϵ C κ, r, ϵ, 1 + 8(2r + 2)

N 2 1 + 1 4(κ2 + 1)B

As a consequence, by Theorem C.1, we get

E ιρx( f D,λ f0) 2 L2 ρx = O λ2r+1 + 1 λm N + λ2r+1B2

+ O(1) λ2r+1B2 + 1 λm N

2 exp 1 4(κ2 + 1)B

1 + 1 4(κ2 + 1)B

= O λ2r+1 + 1 λm N

= O (m N) 2r+1

2r+2 + (m N) 2r+1

2r+2 = O (m N) 2r+1

Remark C.3. In the proof of Corollary C.2, m N 2r+1 is not sufﬁcient to derive the same result since this condition does not guarantee the boundedness of B.

Towards Understanding Ensemble Distillation in Federated Learning

C.2. Proof of Theorem 5.2

Deﬁne an afﬁne operator H : Hk Hk by

Hg = (αLk,X1 + (1 α)Lk,Xp + λI) 1(αS D1y1 + (1 α)Lk,Xpg)

which is a unique solution of

argmin h Hk J[h] = α 1

i=1 (h(xi 1) yi 1)2 + (1 α) 1

i=1 (h(xi 2) g(xi 2))2 + λ h 2 Hk.

Therefore, the limiting regressor of the local model f1 is H f D1,λ = limt Htf D1,λ after inﬁnitely many iterations in Algorithm 1. Now, we prove the following theorem which is a more general statement of Theorem 5.2.

Theorem C.4. For any initial point h0 Hk, Hth0 converges to a unique ﬁxed point f D1,λ/α of H as t .

Proof of Theorem C.4. Deﬁne another operator H : Hk Hk by

Hg = L1/2 k,Xp(αLk,X1 + (1 α)Lk,Xp + λI) 1(αS D1y1 + (1 α)L1/2 k,Xpg).

Observe that

Hg1 Hg2 Hk (1 α)L1/2 k,Xp(αLk,X1 + (1 α)Lk,Xp + λI) 1L1/2 k,Xp

Since (1 α)Lk,Xp < αLk,X1 + (1 α)Lk,Xp + λI

and (1 α)Lk,Xp is of ﬁnite rank which implies it is compact, we have (1 α)L1/2 k,Xp(αLk,X1 + (1 α)Lk,Xp + λI) 1L1/2 k,Xp

by Lemma D.2. Thus, H is a η-contraction where η (0, 1). By the Banach ﬁxed point theorem (Banach, 1922), Hth0 converges to a unique ﬁxed point g of H as t . Deﬁne operators

A = L1/2 k,Xp and B = (αLk,X1 + (1 α)Lk,Xp + λI) 1(αS D1y1 + (1 α)L1/2 k,Xp ).

Then Hh = BAh and Hh = ABh for any h Hk. From the deﬁnition of g ,

B g = B H g = BAB g = HB g

which implies that B g = (αLk,X1 + (1 α)Lk,Xp + λI) 1(αS D1y1 + (1 α)L1/2 k,Xp g )

is a ﬁxed point of H. Let g be a ﬁxed point of H, i.e., g satisﬁes

Hg = (αLk,X1 + (1 α)Lk,Xp + λI) 1(αS D1y1 + (1 α)Lk,Xpg ) = g .

Then, αS D1y1 + (1 α)Lk,Xpg = (αLk,X1 + (1 α)Lk,Xp + λI)g ,

i.e., g = (Lk,X1 + λ/αI) 1S D1y1 = f D1,λ/α.

Hence, H has a unique a ﬁxed point and so Hth0 = B Ht 1Ah0 converges to B g = g = f D1,λ/α as t .

Remark C.5. Note that H may not be a contraction since

(A + B + I) 1B < 1

may not be true even if operators A and B are positive compact operators on a Hilbert space. So we have to follow the above argument.

Towards Understanding Ensemble Distillation in Federated Learning

Algorithm 2 KRR with iterative De-regularized Ensemble Distillation in FL

1: Input: hyperparameters α (0, 1), λ > 0, λ0 0 and t N 2: Output: Trained model fj, j = 1, , m 3: Pretrain: For j = 1, , m, client j trains its model fj using the loss function

argmin h Hk

i=1 (h(xi j) yi j)2 + λ h 2 Hk.

4: Each client downloads the unlabeled public dataset Dp(x). 5: for t0 = 1, , t do 6: For j = 1, , m, client j predicts on Dp(x) and upload the prediction yj p to server. 7: The server computes an updated consensus

8: if t0 = t then 9: The server applies the de-regularization trick to yp:

yp = (SDp(Lk,Xp + λ0I) 1S Dp) 1 yp.

10: end if 11: Each client downloads the ensemble prediction yp. 12: For j = 1, , m client j updates its model fj using the loss function

argmin h Hk α 1

i=1 (h(xi j) yi j)2 + (1 α) 1

i=1 (h(xi p) ( yp)i)2 + λ h 2 Hk.

13: end for

C.3. Algorithm on KRR with Iterative De-regularized Ensemble Distillation in Federated Learning Setting

Applying the de-regularization trick except the last ensemble distillation step, we provide Algorithm 2.

C.4. Proof of Theorem 5.3

Similarly as in Appendix C.2, deﬁne an afﬁne operator T : Hk Hk by

Tg = (αLk,X1 + (1 α)Lk,Xp + λI) 1(αS D1y1 + (1 α)S Dp(SDp(Lk,Xp + λ0I) 1S Dp) 1SDpg).

Then the limiting regressor of the local model f1 is T f D1,λ = limt T tf D1,λ after inﬁnitely many ensemble distillation steps with the de-regularization trick. As a result, the ﬁnal local model is given by

(αLk,X1 + (1 α)Lk,Xp + λI) 1(αS D1y1 + (1 α)Lk,Xp T f D1,λ).

We deal with T f D1,λ in this subsection to obtain some motivations. Before we prove Theorem 5.3, we provide the following lemma.

Lemma C.6. Assume KXp,Xp > 0. Then,

S Dp(SDp(Lk,Xp + λ0I) 1S Dp) 1SDp = Lk,Xp + λ0PDp

holds where PDp is the orthogonal projection onto a subspace

span (k(x, ) : x Dp(x))

of the reproducing kernel Hilbert space Hk.

Towards Understanding Ensemble Distillation in Federated Learning

Proof of Lemma C.6. Note that

(Lk,Xp + λ0I) 1S Dpv = h k( , x1 p) k( , x Np p ) i (KXp,Xp + Npλ0I) 1v

and so SDp(Lk,Xp + λ0I) 1S Dpv = KXp,Xp(KXp,Xp + Npλ0I) 1v.

We thus obtain

(Lk,Xp S Dp(SDp(Lk,Xp + λ0I) 1S Dp) 1SDp)h = S Dp(I (SDp(Lk,Xp + λ0I) 1S Dp) 1)SDph

Npλ0K 1 Xp,Xp

h(x1 p) ... h(x Np p )

= λ0 h k( , x1 p) k( , x Np p ) i K 1 Xp,Xp

h(x1 p) ... h(x Np p )

On the other hand, from

SDp S Dpv = SDp

i=1 vik(xi p, )

Np KXp,Xpv,

S Dp(SDp S Dp) 1SDph = S Dp

Np K 1 Xp,Xp

h(x1 p) ... h(x Np p )

= h k( , x1 p) k( , x Np p ) i K 1 Xp,Xp

h(x1 p) ... h(x Np p )

Therefore, Lk,Xp S Dp(SDp(Lk,Xp + λ0I) 1S Dp) 1SDp = λ0S Dp(SDp S Dp) 1SDp,

i.e., S Dp(SDp(Lk,Xp + λ0I) 1S Dp) 1SDp = Lk,Xp + λ0S Dp(SDp S Dp) 1SDp.

Set P = S Dp(SDp S Dp) 1SDp. Then we observe that

(i) P is idempotent since

P 2 = S Dp(SDp S Dp) 1SDp S Dp(SDp S Dp) 1SDp = S Dp(SDp S Dp) 1SDp = P.

(ii) P is symmetric since

P = (S Dp(SDp S Dp) 1SDp) = S Dp((SDp S Dp) 1) SDp = S Dp((SDp S Dp) ) 1SDp = S Dp(SDp S Dp) 1SDp = P.

(iii) The range of P is span (k(x, ) : x Dp(x)) since the range of S Dp is contained in span (k(x, ) : x Dp(x)) and for any u( ) span (k(x, ) : x Dp(x)) there exists v RNp such that u = S Dpv which satisﬁes

PS Dpv = S Dp(SDp S Dp) 1SDp S Dpv = S Dpv = u.

By Proposition 3.3 of chapter 2 in Conway (2019), P is the orthogonal projection of Hk onto span (k(x, ) : x Dp(x)).

From the above lemma, we can applied the Banach ﬁxed point theorem (Banach, 1922) again as before.

Towards Understanding Ensemble Distillation in Federated Learning

Lemma C.7. Assume (1 α)λ0 < λ and KXp,Xp > 0. Deﬁne an operator T : Hk Hk as

Tg = (S Dp(SDp(Lk,Xp + λ0I) 1S Dp) 1SDp)1/2

(αLk,X1 + (1 α)Lk,Xp + λI) 1(αS D1y1 + (1 α)(S Dp(SDp(Lk,Xp + λ0I) 1S Dp) 1SDp)1/2g).

Then T is a η-contraction where η (0, 1). Thus, T tg converges to a unique ﬁxed point g of T as t .

Proof of Lemma C.7. By Lemma C.6,

Tg = (Lk,Xp + λ0PDp)1/2(αLk,X1 + (1 α)Lk,Xp + λI) 1(αS D1y1 + (1 α)(Lk,Xp + λ0PDp)1/2g).

Since PDp is a projection, 0 PDp I and so

Lk,Xp Lk,Xp + λ0PDp Lk,Xp + λ0I.

0 (1 α)(Lk,Xp + λ0PDp) < αLk,X1 + (1 α)Lk,Xp + λI. (21)

Since (1 α)(Lk,Xp + λ0PDp) is of ﬁnite rank which implies it is compact, by Lemma D.2,

(1 α)(Lk,Xp + λ0PDp)1/2(αLk,X1 + (1 α)Lk,Xp + λI) 1(Lk,Xp + λ0PDp)1/2 < 1. (22)

Hence, from the submultiplicativity of the operator norm, we have

Tg1 Tg2 Hk = (1 α)(Lk,Xp + λ0PDp)1/2(αLk,X1 + (1 α)Lk,Xp + λI) 1(Lk,Xp + λ0PDp)1/2(g1 g2) Hk (1 α)(Lk,Xp + λ0PDp)1/2(αLk,X1 + (1 α)Lk,Xp + λI) 1(Lk,Xp + λ0PDp)1/2 g1 g2 Hk,

i.e., T is a η-contraction where η < 1. Applying the Banach ﬁxed point theorem (Banach, 1922), we are done.

Similarly as in Appendix C.2, we prove the following general statement. Obviously, the following theorem implies Theorem 5.3. Theorem C.8. Assume λ0 = λ and KXp,Xp > 0. For any h0 Hk, T th0 converges to a unique ﬁxed point Lk,X1 + λI + 1 α

of T as t .

Proof of Theorem C.8. Let

A = (Lk,Xp + λ0PDp)1/2, B = (αLk,X1 + (1 α)Lk,Xp + λI) 1(αS D1y1 + (1 α)(Lk,Xp + λ0PDp)1/2 ).

Then Th = BAh and Th = ABh for any h Hk. Let g be a unique ﬁxed point of T where the existence and the uniqueness of the ﬁxed point of T follow from Lemma C.7. From the deﬁnition,

B g = B T g = BAB g = TB g

holds, i.e., B g is a ﬁxed point of T. Let g be a ﬁxed point of T. From

Tg = (αLk,X1 + (1 α)Lk,Xp + λI) 1(αS D1y1 + (1 α)(Lk,Xp + λ0PDp)g ) = g ,

we obtain (αLk,X1 + λI (1 α)λ0PDp)g = αS D1y1.

Set λ0 = λ. Then λI (1 α)λ0PDp = αλI + (1 α)λ(I PDp).

By Proposition 3.2 of chapter 2 in Conway (2019), I PDp = P Dp. Therefore,

g = Lk,X1 + λI + 1 α

In particular, a ﬁxed point of T is unique. Consequently, T th0 = B T t 1Ah0 converges to g as t by Lemma C.7.

Towards Understanding Ensemble Distillation in Federated Learning

Remark C.9. Without the de-regularization trick, the ﬁxed point of H is Lk,X1 + λ

αI 1 S D1y1.

With the de-regularization trick, the ﬁxed point of T is Lk,X1 + λI + 1 α

Thus, we can see that the de-regularization replaces 1 α

α λI by 1 α

α λP Dp(x).

C.5. Proof of Theorem 5.4

Proof of Theorem 5.4. Using the formula

A 1 B 1 = A 1(B A)B 1, (23)

f D,λ f D1,λ = Lk,X1 + λI + 1 α

α λ(I PDp(x)) 1 1 α

α λ(I PDp(x)) (Lk,X1 + λI) 1 S D1y1.

By Theorem 11 in Aydın & Gheondea (2021),

(I PDp(x)) (Lk,X1 + λI) 1 S D1y1 0

almost surely as Np . Therefore, by Lemma D.1,

f D,λ f D1,λ Hk

Lk,X1 + λI + 1 α

α λ(I PDp(x)) 1 1 α

α λ (I PDp(x)) (Lk,X1 + λI) 1 S D1y1 Hk

(I PDp(x)) (Lk,X1 + λI) 1 S D1y1 Hk 0

almost surely as Np .

Remark C.10. To apply the result provided in Aydın & Gheondea (2021), we assume that the density ρx is strictly positive on any non-empty open subset of X in this case. In Section 5.4, we assume Assumption 3.2 with 0 < r 1

2 instead of this condition to control the error.

C.6. Proof of Theorem 5.5

In this subsection, we assume λ0 = λ and KXp,Xp > 0. Similarly as before, we deﬁne an afﬁne operator T : Hk Hk by

1 m(αLk,Xi + (1 α)Lk,Xp + λI) 1(αS Diyi + (1 α)S Dp(SDp(Lk,Xp + λ0I) 1S Dp) 1SDpg)

1 m(αLk,Xi + (1 α)Lk,Xp + λI) 1(αS Diyi + (1 α)(Lk,Xp + λPDp(x))g) (24)

where the second equality follows from Lemma C.6. Also, deﬁne T : Hk Hk as

1 m(Lk,Xp + λPDp(x))1/2(αLk,Xi + (1 α)Lk,Xp + λI) 1(αS Diyi + (1 α)(Lk,Xp + λPDp(x))1/2g).

Here, PDp(x) is the orthogonal projection onto a subspace

span (k(x, ) : x Dp(x))

Towards Understanding Ensemble Distillation in Federated Learning

of the reproducing kernel Hilbert space Hk. By the submultiplicativity of the operator norm, the triangle inequality, and (22), we have

Tg1 Tg2 Hk 1

i=1 (Lk,Xp + λPDp(x))1/2(αLk,Xi + (1 α)Lk,Xp + λI) 1(1 α)(Lk,Xp + λPDp(x))1/2

i=1 (Lk,Xp + λPDp(x))1/2(αLk,Xi + (1 α)Lk,Xp + λI) 1(1 α)(Lk,Xp + λPDp(x))1/2 < 1,

i.e., T is a η-contraction where η < 1. Set

A = (Lk,Xp + λPDp(x))1/2, B =

1 m(αLk,Xi + (1 α)Lk,Xp + λI) 1(αS Diyi + (1 α)(Lk,Xp + λPDp(x))1/2 ).

Then Th = BAh and Th = ABh for any h Hk. Let g be a unique ﬁxed point of T where the existence and the uniqueness of a ﬁxed point of T follows from the Banach ﬁxed point theorem (Banach, 1922). Then, for any h0 Hk, T th0 = B T t 1Ah0 converges to B g . The following lemma is regarding the computation of B g . Lemma C.11. With the same notation as given above,

i=1 (αLk,Xi + (1 α)Lk,Xp + λI) 1S Diyi +

i=1 (1 α)(αLk,Xi + (1 α)Lk,Xp + λI) 1U1/2 !

i=1 U1/2(1 α)(αLk,Xi + (1 α)Lk,Xp + λI) 1U1/2 ! 1 U1/2 1 m

i=1 (αLk,Xi + (1 α)Lk,Xp + λI) 1S Diyi

where U = Lk,Xp + λPDp(x) = S Dp(SDp(Lk,Xp + λ0I) 1S Dp) 1SDp.

Proof. From the deﬁnition of g , we have

1 m U 1/2(αLk,Xi + (1 α)Lk,Xp + λI) 1(αS Diyi + (1 α)U 1/2 g ).

Note that, from (21),

i=1 U 1/2(1 α)(αLk,Xi + (1 α)Lk,Xp + λI) 1U 1/2 < I

which implies that I 1

m Pm i=1 U 1/2(1 α)(αLk,Xi + (1 α)Lk,Xp + λI) 1U 1/2 is invertible. Hence,

i=1 U 1/2(1 α)(αLk,Xi + (1 α)Lk,Xp + λI) 1U 1/2 ! 1

i=1 (αLk,Xi + (1 α)Lk,Xp + λI) 1S Diyi

Plugging (25) into B g yields

1 m(αLk,Xi + (1 α)Lk,Xp + λI) 1(αS Diyi + (1 α)U1/2 g )

i=1 (αLk,Xi + (1 α)Lk,Xp + λI) 1S Diyi +

i=1 (1 α)(αLk,Xi + (1 α)Lk,Xp + λI) 1U1/2 !

i=1 U1/2(1 α)(αLk,Xi + (1 α)Lk,Xp + λI) 1U1/2 ! 1 U1/2 1 m

i=1 (αLk,Xi + (1 α)Lk,Xp + λI) 1S Diyi

Towards Understanding Ensemble Distillation in Federated Learning

Since Algorithm 2 conducts knowledge distillation on the public dataset Dp(x) using B g in the last step, it sufﬁces to evaluate the performance of B g on Dp(x). To this end, we need two additional lemmas: Lemma C.12 and Lemma C.13. Lemma C.12 can be viewed as an extension of the distributed semi-supervised kernel ridge regression (Chang et al., 2017).

Lemma C.12. Assume m 2, Assumption 3.1 and 3.2. Let

f s D,λ = 1

i=1 (αLk,Xi + (1 α)Lk,Xp + λI) 1S Diyi.

SDp(f s D,λ f0) 2 Λ + 1

where Λ and Λi are random variables for i = 1, , m such that Λ Λ(log(6/δ))5/4 with conﬁdence at least 1 δ and each Λi Λi(log(6/δ))3/2 with conﬁdence at least 1 δ. Here,

!1/2 2κ(M + γ)

2κ2 f0 Hk p

λ1/2+r g0 Hk

Λi = α (1 α)

In particular, under the assumption that Np (m 1)N, with α = 1/m and λ = (m N) 1 2r+2 , we have Λ = O (m N) 2r+1

4r+4 and Λi = O (m N) 2r+1

4r+4 for i = 1, , m, which implies that

E SDp(f s D,λ f0) 2 = O (m N) 2r+1

Proof of Lemma C.12. Using (10) and Lemma D.3, we obtain

SDp(f s D,λ f0) 2 = L1/2 k,Xp(f s D,λ f0) Hk (Lk,Xp + λI)1/2(f s D,λ f0) Hk .

f s D,λ = (αLk + (1 α)Lk,Xp + λI) 1S Dy = 1

i=1 (αLk + (1 α)Lk,Xp + λI) 1S Diyi

where y = [y 1 , , y m] . By the triangle inequality, (Lk,Xp + λI)1/2(f s D,λ f0) Hk (Lk,Xp + λI)1/2( f s D,λ f0) Hk + (Lk,Xp + λI)1/2(f s D,λ f s D,λ) Hk . (26)

First, we bound the ﬁrst term in (26). Note that

(Lk,Xp + λI)1/2( f s D,λ f0) = (Lk,Xp + λI)1/2(αLk + (1 α)Lk,Xp + λI) 1(S Dy Lkf0)

+ (Lk,Xp + λI)1/2((αLk + (1 α)Lk,Xp + λI) 1 (Lk + λI) 1)Lkf0

+ (Lk,Xp + λI)1/2(fλ f0). (27)

Towards Understanding Ensemble Distillation in Federated Learning

Qp = (Lk,Xp + λI) 1/2(Lk + λI)1/2 , Qp = (Lk,Xp + λI)1/2(Lk + λI) 1/2 ,

P = (Lk + λI) 1/2(S Dy Lk,Xf0) Hk,

S = Lk,X Lk ,

and Sp = Lk,Xp Lk .

Note that, by Lemma D.2,

αLk + (1 α)Lk,Xp + λI (1 α)(Lk,Xp + λI)

(Lk,Xp + λI)1/2(αLk + (1 α)Lk,Xp + λI) 1(Lk,Xp + λI)1/2 1 1 α. (28)

Thus, the ﬁrst term in (27) satisﬁes

(Lk,Xp + λI)1/2(αLk + (1 α)Lk,Xp + λI) 1(S Dy Lkf0) Hk 1 1 αQp

Also, the third term in (27) satisﬁes

(Lk,Xp + λI)1/2(fλ f0) Hk Qp (Lk + λI)1/2(fλ f0) Hk

and the second term in (27) satisﬁes

(Lk,Xp + λI)1/2((αLk + (1 α)Lk,Xp + λI) 1 (Lk + λI) 1)Lkf0 Hk = (Lk,Xp + λI)1/2(αLk + (1 α)Lk,Xp + λI) 1(1 α)(Lk Lk,Xp)(Lk + λI) 1Lkf0 Hk (Lk,Xp + λI) 1/2 Sp fλ Hk

by the submultiplicativity of the operator norm, (23), and (28). Combining them yields

(Lk,Xp + λI)1/2( f s D,λ f0) Hk 1 1 αQp

λ S + f0 Hk

λ Sp + Qp (Lk + λI)1/2(fλ f0) Hk (29)

by Lemma D.1 and (19). From

(Lk + λI)1/2(fλ f0) Hk = λ(Lk + λI) 1/2f0 Hk = λ(Lk + λI) 1/2Lr kg0 Hk λ1/2+r g0 Hk

which follows from Lemma D.1 and the submultiplicativity of the operator norm, the inequality (29) becomes

(Lk,Xp + λI)1/2( f s D,λ f0) Hk 1 1 αQp

λ S + f0 Hk

λ Sp + Qpλ1/2+r g0 Hk.

Now, we derive PAC-bounds for Qp, Qp, P, S, and Sp. By Lemma D.1, Lemma D.4, the triangle inequality, and the submultiplicativity of the operator norm, we have

Qp (Lk,Xp + λI) 1(Lk + λI) 1/2 = I + (Lk,Xp + λI) 1(Lk Lk,Xp) 1/2 1 + 1

λ Lk Lk,Xp 1/2 (30)

Qp (Lk,Xp + λI)(Lk + λI) 1 1/2 = I + (Lk,Xp Lk)(Lk,Xp + λI) 1 1/2 1 + 1

λ Lk Lk,Xp 1/2 .

Towards Understanding Ensemble Distillation in Federated Learning

By Lemma D.7 and Lemma D.8, with conﬁdence at least 1 δ,

m N (log(6/δ))1/2, Lk,Xp Lk 2

Np (log(6/δ))1/2 and P 2κ(M + γ)

λm N log(6/δ)

where δ (0, 1). Then

λ Lk Lk,Xp 1/2

(log(6/δ))1/4.

(Lk,Xp + λI)1/2( f s D,λ f0) Hk 1 1 α

!1/2 2κ(M + γ)

(log(6/δ))5/4

Np (log(6/δ))1/2 +

λ1/2+r g0 Hk(log(6/δ))1/4

with conﬁdence at least 1 δ. Deﬁne

!1/2 2κ(M + γ)

2κ2 f0 Hk p

λ1/2+r g0 Hk.

Then (Lk,Xp + λI)1/2( f s D,λ f0) Hk Λ(log(6/δ))5/4

with conﬁdence at least 1 δ. We next bound the second term (Lk,Xp + λI)1/2( f s D,λ f s D,λ) Hk in (26). First, we bound the norm of f s,i D,λ = (αLk + (1 α)Lk,Xp + λI) 1S Diyi.

Observe that

f s,i D,λ = (αLk + (1 α)Lk,Xp + λI) 1(S Diyi Lkf0) + (αLk + (1 α)Lk,Xp + λI) 1(Lk + λI)(Lk + λI) 1Lkf0.

Set Q p = (αLk + (1 α)Lk,Xp + λI) 1(Lk + λI) .

By Lemma D.1 and the submultiplicativity of the operator norm, we have

Q p = I + (1 α)(αLk + (1 α)Lk,Xp + λI) 1(Lk Lk,Xp) 1 + 1 α

λ Lk Lk,Xp .

Again, applying Lemma D.1, the submultiplicativity of the operator norm, the triangle inequality, and (19), we obtain

f s,i D,λ Hk 1

λ S Diyi Lkf0 Hk + 1 + 1 α

λ Lk Lk,Xp f0 Hk.

Now, by the submultiplicativity of the operator norm, the triangle inequality, and (23),

(Lk,Xp + λI)1/2( f s D,λ f s D,λ) Hk

i=1 (Lk,Xp + λI)1/2((αLk + (1 α)Lk,Xp + λI) 1 (αLk,Xi + (1 α)Lk,Xp + λI) 1)S Diyi Hk

i=1 (Lk,Xp + λI)1/2(αLk,Xi + (1 α)Lk,Xp + λI) 1α(Lk,Xi Lk) f s,i D,λ Hk.

Towards Understanding Ensemble Distillation in Federated Learning

Simliarly as in (28), we can easily see that

(Lk,Xp + λI)1/2(αLk,Xi + (1 α)Lk,Xp + λI) 1(Lk,Xp + λI)1/2 1 1 α. (31)

Then, applying (31), Lemma D.1, and the submultiplicativity of the operator norm yields

i=1 (Lk,Xp + λI)1/2(αLk,Xi + (1 α)Lk,Xp + λI) 1α(Lk,Xi Lk) f s,i D,λ Hk

λ Lk,Xi Lk Hk f s,i D,λ Hk.

To ﬁnd a PAC-bound of Lk,Xi Lk f s,i D,λ Hk, we use Lemma D.7 and Lemma D.8. Then, with conﬁdence at least 1 δ,

Np (log(6/δ))1/2, Lk,Xi Lk 2

N (log(6/δ))1/2,

S Diyi Lk,Xif0 Hk 2κ(M + γ)

N log(6/δ).

Lk,Xi Lk f s,i D,λ Hk

N (log(6/δ))3/2

with conﬁdence at least 1 δ. Deﬁne

Λi = α (1 α)

for i = 1, , m. Then α (1 α)

λ Lk,Xi Lk Hk f s,i D,λ Hk Λi(log(6/δ))3/2

with conﬁdence at least 1 δ for any i = 1, , m. Now, we set α = 1/m and λ = (m N) 1 2r+2 . We also assume that Np (m 1)N. Then

and 2κ(M + γ)

(2κ(M + γ) + 2

2κ2 f0 Hk) m1/(2r+2).

Thus, we have

Λ 2(1 + 4κ2)1/2 2κ(M + γ) + 2

2κ2 f0 Hk (m N)(2r+1)/(4r+4) + 4κ2 f0 Hk (m N)(2r+1)/(4r+4) + (1 + 4

2κ2)1/2 g0 Hk (m N)(2r+1)/(4r+4) = O (m N) 2r+1

2κ2 (2κ(M + γ) + (1 + (4 + 2

2)κ2) f0 Hk) m1/(2r+2)

m1/2 1 (m N)(2r+1)/(4r+4) = O (m N) 2r+1

By Lemma D.9, we conclude that

E SDp(f s D,λ f0) 2 EΛ + 1

i=1 EΛi = O (m N) 2r+1

Towards Understanding Ensemble Distillation in Federated Learning

The second lemma is regarding the convergence rate of f0 PDpf0 Hk. Aydın & Gheondea (2021) prove that f0 PDpf0 Hk 0 as Np under the condition that the density ρx is strictly positive on any non-empty open subset of X. However, they do not provide the convergence rate. The following lemma provides a convergence rate if we assume a regularity condition of f0.

Lemma C.13. Assume Assumption 3.2. Then,

f0 PDpf0 Hk 2λr g0 Hk

with conﬁdence at least 1 4 exp( 1/4(κ2 + 1)B) where

B = 1 + log N(λ)

1 + log N(λ)

for any λ > 0 such that λ (0, 1) and N(λ) 1. Also,

f0 PDpf0 Hk f0 Hk

almost surely. In particular, for any ﬁxed ϵ (0, 1),

E f0 PDpf0 Hk = O N ( 1+ϵ)r p .

Proof of Lemma C.13. Note that

(Lk,Xp + λI) 1Lk,Xpf0 = k Dp(x)(NpλI + KXp,Xp) 1SDpf0 span(k(x, ) : x Dp(x))

for any λ > 0. Thus,

f0 PDpf0 Hk = min h span(k(x, ):x Dp(x)) f0 h Hk f0 (Lk,Xp + λI) 1Lk,Xpf0 Hk.

f0 PDpf0 Hk λ(Lk,Xp + λI) 1Lr kg0 Hk λ (Lk + λI) 1/2 (Lk + λI)1/2(Lk,Xp + λI) 1(Lk + λI)1/2 (Lk + λI) 1/2Lr k g0 Hk

by the submultiplicativity of the operator norm. Using Lemma D.1,

λ (Lk + λI) 1/2 (Lk + λI)1/2(Lk,Xp + λI) 1(Lk + λI)1/2 (Lk + λI) 1/2Lr k g0 Hk λr (Lk + λI)1/2(Lk,Xp + λI) 1(Lk + λI)1/2 g0 Hk.

When we assume λ (0, 1) and N(λ) 1, by Lemma D.7, we have

f0 PDpf0 Hk 2λr g0 Hk

with conﬁdence at least 1 δ where 4 exp( 1/4(κ2 + 1)B) δ < 1. On the other hand,

f0 PDpf0 Hk f0 Hk

almost surely since I PDp 1. Combining them, we have

E f0 PDpf0 Hk 2λr g0 Hk + 4 f0 Hk exp 1 4(κ2 + 1)B

Set λ = N 1+ϵ p for a ﬁxed ϵ > 0. Since λ 0 as Np , we may assume N(λ) 1. Then,

1 + log(eκ2Np)

Towards Understanding Ensemble Distillation in Federated Learning

f(x) = log(eκ2x)

1 + log(eκ2x)

is continuous on [1, ) and vanishes at , f(x) C(κ, ϵ) for some C(κ, ϵ) > 0. Thus,

B C(κ, ϵ) 1

g(x) = x β exp 1 4(κ2 + 1)x

Then g is continuous on (0, C(κ, ϵ)] and vanishes at 0+ for any β > 0. Therefore, we know that g(x) C (κ, ϵ, β) on x (0, C(κ, ϵ)] for some C (κ, ϵ, β) > 0. Hence, we have

E f0 PDpf0 Hk 2λr g0 Hk + f0 Hk exp 1 4(κ2 + 1)B

2N ( 1+ϵ)r p g0 Hk + f0 Hk B

ϵ C κ, ϵ, 4(1 ϵ)r

2N ( 1+ϵ)r p g0 Hk + f0 Hk N ( 1+ϵ)r p C(κ, ϵ)

ϵ C κ, ϵ, 4(1 ϵ)r

= O N ( 1+ϵ)r p .

We are now ready to derive the main result in this subsection.

Theorem C.14. Assume m 2, λ0 = λ, Assumption 3.1, and Assumption 3.2 with 0 < r 1

2. We further assume

3r+2 2r2+2r N 1 2r+2 1/(1 ϵ) , (m 1)N

for some ﬁxed 0 < ϵ < 1

2. Let g = T h0 for any h0 Hk where the operator T is deﬁned in (24). Then, with α = 1/m and λ = (m N) 1 2r+2 ,

E SDp(g f0) 2 = O (m N) 2r+1

Proof of Theorem C.14. When m or N increases, λ decreases and Np increases. Thus, we may assume N(λ) 1 and N(1/ p

Np) 1. Set U = S Dp(SDp(Lk,Xp + λI) 1S Dp) 1SDp and V = Lk,Xp + λI. Note that V is invertible and

Lk,Xp U = Lk,Xp + λPDp V = Lk,Xp + λI.

i=1 (αLk,Xi + (1 α)Lk,Xp + λI) 1Uf0.

By the triangle inequality, SDp(g f0) 2 SDp(g g) 2 + SDp( g f0) 2.

First, note that SDp( g f0) 2 = L1/2 k,Xp( g f0) Hk (Lk,Xp + λI)1/2( g f0) Hk

Towards Understanding Ensemble Distillation in Federated Learning

by (10) and Lemma D.3. By the triangle inequality,

(Lk,Xp + λI)1/2( g f0) Hk

i=1 V 1/2(αLk,Xi + (1 α)Lk,Xp + λI) 1(Lk,Xp + λPDp αLk,Xi (1 α)Lk,Xp λI)f0 Hk

i=1 V 1/2(αLk,Xi + (1 α)Lk,Xp + λI) 1α(Lk,Xp Lk,Xi)f0 Hk

i=1 λ V 1/2(αLk,Xi + (1 α)Lk,Xp + λI) 1(I PDp)f0 Hk.

By the submultiplicativity of the operator norm, Lemma D.1, and (31),

V 1/2( g f0) Hk 1

λ Lk,Xp Lk,Xi f0 Hk +

λ 1 α (I PDp)f0 Hk

Set Λ1 i = α (1 α)

λ Lk,Xp Lk,Xi f0 Hk. By the triangle inequality,

Lk,Xp Lk,Xi Lk,Xp Lk + Lk Lk,Xi .

By Lemma D.7, with conﬁdence at least 1 δ,

N (log(4/δ))1/2 and Lk,Xp Lk 2

Np (log(4/δ))1/2.

Then, with conﬁdence at least 1 δ,

Λ1 i Λ1(log(4/δ))1/2 (33)

2ακ2 f0 Hk (1 α)

= O (m N) 2r+1

λ 1 α (I PDp)f0 Hk

2(m N) 1 4r+4 O ((m 1)N) 2r 4r+4 = O (m N) 2r+1

by Lemma C.13. On the other hand, by (10) and Lemma D.3,

SDp(g g) 2 = L1/2 k,Xp(g g) Hk U 1/2(g g) Hk.

We can see that

U 1/2 g = 1

i=1 U 1/2(αLk,Xi + (1 α)Lk,Xp + λI) 1Uf0

i=1 U 1/2(αLk,Xi + (1 α)Lk,Xp + λI) 1Uf0 +

i=1 (1 α)U 1/2(αLk,Xi + (1 α)Lk,Xp + λI) 1U 1/2 !

i=1 U 1/2(1 α)(αLk,Xi + (1 α)Lk,Xp + λI) 1U 1/2 ! 1

i=1 U 1/2(1 α)(αLk,Xi + (1 α)Lk,Xp + λI) 1U 1/2 !

i=1 U 1/2(αLk,Xi + (1 α)Lk,Xp + λI) 1Uf0 +

i=1 (1 α)U 1/2(αLk,Xi + (1 α)Lk,Xp + λI) 1U 1/2 !

i=1 U 1/2(1 α)(αLk,Xi + (1 α)Lk,Xp + λI) 1U 1/2 ! 1 U 1/2

i=1 (1 α)(αLk,Xi + (1 α)Lk,Xp + λI) 1U

Towards Understanding Ensemble Distillation in Federated Learning

Thus, by the triangle inequality, the submultiplicativity of the operator norm, and Lemma C.11, we have

U 1/2(g g) Hk

i=1 (αLk,Xi + (1 α)Lk,Xp + λI) 1S Diyi 1

i=1 (αLk,Xi + (1 α)Lk,Xp + λI) 1Uf0

i=1 U 1/2(1 α)(αLk,Xi + (1 α)Lk,Xp + λI) 1U 1/2 !

i=1 U 1/2(1 α)(αLk,Xi + (1 α)Lk,Xp + λI) 1U 1/2 ! 1

i=1 (αLk,Xi + (1 α)Lk,Xp + λI) 1S Diyi 1

i=1 (1 α)(αLk,Xi + (1 α)Lk,Xp + λI) 1U

First, we bound the ﬁrst term. By the triangle inequality and Lemma D.3,

i=1 (αLk,Xi + (1 α)Lk,Xp + λI) 1S Diyi 1

i=1 (αLk,Xi + (1 α)Lk,Xp + λI) 1Uf0

i=1 (αLk,Xi + (1 α)Lk,Xp + λI) 1S Diyi 1

i=1 (αLk,Xi + (1 α)Lk,Xp + λI) 1Uf0

i=1 (αLk,Xi + (1 α)Lk,Xp + λI) 1S Diyi

i=1 (αLk,Xi + (1 α)Lk,Xp + λI) 1Uf0 f0

From (32), we know V 1/2 1 m

i=1 (αLk,Xi + (1 α)Lk,Xp + λI) 1Uf0 f0

λ 1 α (I PDp)f0 Hk.

Also, using the same argument as in the proof of Lemma C.12, it satisﬁes that V 1/2

i=1 (αLk,Xi + (1 α)Lk,Xp + λI) 1S Diyi

i=1 Λi (35)

where random variables Λ and Λi(i = 1, , m) are given in Lemma C.12. In particular, Λ Λ(log(6/δ))5/4 with conﬁdence at least 1 δ and Λi Λi(log(6/δ))3/2 with conﬁdence at least 1 δ where Λ = O (m N) 2r+1

Λi = O (m N) 2r+1

4r+4 . We next turn to derive an upper bound of the second term. To bound

i=1 U 1/2(1 α)(αLk,Xi + (1 α)Lk,Xp + λI) 1U 1/2 !

i=1 U 1/2(1 α)(αLk,Xi + (1 α)Lk,Xp + λI) 1U 1/2 ! 1 ,

we use Lemma D.6. From

0 U = Lk,Xp + λPDp V = Lk,Xp + λI < 1 1 α(αLk,Xi + (1 α)Lk,Xp + λI),

we have (1 α)(αLk,Xi + (1 α)Lk,Xp + λI) 1 < V 1

for any i = 1, , m. Hence

i=1 (1 α)(αLk,Xi + (1 α)Lk,Xp + λI) 1 < V 1

i=1 (1 α)(αLk,Xi + (1 α)Lk,Xp + λI) 1 ! 1

Towards Understanding Ensemble Distillation in Federated Learning

Note that U is of ﬁnite rank which means that U 1/2 is of ﬁnite rank and so compact, and all of the above operators are positive. Applying Lemma D.6 and the above inequality yields

i=1 U 1/2(1 α)(αLk,Xi + (1 α)Lk,Xp + λI) 1U 1/2 !

i=1 U 1/2(1 α)(αLk,Xi + (1 α)Lk,Xp + λI) 1U 1/2 ! 1

i=1 V 1/2(1 α)(αLk,Xi + (1 α)Lk,Xp + λI) 1V 1/2 !

i=1 V 1/2(1 α)(αLk,Xi + (1 α)Lk,Xp + λI) 1V 1/2 ! 1 .

By the submultiplicativity of the operator norm,

i=1 V 1/2(1 α)(αLk,Xi + (1 α)Lk,Xp + λI) 1V 1/2 !

i=1 V 1/2(1 α)(αLk,Xi + (1 α)Lk,Xp + λI) 1V 1/2 ! 1

i=1 V 1/2(1 α)(αLk,Xi + (1 α)Lk,Xp + λI) 1V 1/2

i=1 V 1/2(1 α)(αLk,Xi + (1 α)Lk,Xp + λI) 1V 1/2 ! 1

i=1 V 1/2(1 α)(αLk,Xi + (1 α)Lk,Xp + λI) 1V 1/2 ! 1 .

Here, the second inequality follows from the fact that

i=1 V 1/2(1 α)(αLk,Xi + (1 α)Lk,Xp + λI) 1V 1/2 1

V 1/2(1 α)(αLk,Xi + (1 α)Lk,Xp + λI) 1V 1/2 1.

which comes from the triangle inequality and (31). Set

i=1 V 1/2(1 α)(αLk,Xi + (1 α)Lk,Xp + λI) 1V 1/2 ! 1

On the other hand, since A 7 A 1 is operator convex (Bhatia, 2013),

i=1 V 1/2(1 α)(αLk,Xi + (1 α)Lk,Xp + λI) 1V 1/2 ! 1 1

i=1 (I V 1/2(1 α)(αLk,Xi + (1 α)Lk,Xp + λI) 1V 1/2) 1.

m Pm i=1 V 1/2(1 α)(αLk,Xi + (1 α)Lk,Xp + λI) 1V 1/2 1 is positive,

i=1 (I V 1/2(1 α)(αLk,Xi + (1 α)Lk,Xp + λI) 1V 1/2) 1

(I V 1/2(1 α)(αLk,Xi + (1 α)Lk,Xp + λI) 1V 1/2) 1 .

Since V 1/2 is invertible, by Lemma D.5 and the submultiplicativity of the operator norm,

α (I V 1/2(1 α)(αLk,Xi + (1 α)Lk,Xp + λI) 1V 1/2) 1

= α V 1/2(V 1 (1 α)(αLk,Xi + (1 α)Lk,Xp + λI) 1) 1V 1/2

α (V 1 (1 α)(αLk,Xi + (1 α)Lk,Xp + λI) 1) 1V 1

= α (I (1 α)V (αLk,Xi + (1 α)Lk,Xp + λI) 1) 1

= α (αLk,Xi + (1 α)Lk,Xp + λI)(αLk,Xi + αλI) 1

(αLk,Xi + (1 α)Lk,Xp + λI)(Lk,Xi + λI) 1 .

(αLk,Xi + (1 α)Lk,Xp + λI)(Lk,Xi + λI) 1 = I + (1 α)(Lk,Xp Lk,Xi)(Lk,Xi + λI) 1,

Towards Understanding Ensemble Distillation in Federated Learning

by the triangle inequality, the submultiplicativity of the operator norm, and Lemma D.1, we get

(αLk,Xi + (1 α)Lk,Xp + λI)(Lk,Xi + λI) 1 1 + 1 α

λ Lk,Xi Lk + Lk,Xp Lk .

By Lemma D.7,

N (log(4/δ))1/2 and Lk,Xp Lk 2

Np (log(4/δ))1/2

with conﬁdence at least 1 δ. Therefore,

i=1 Θ1,i (36)

where Θ1,i is a random variable such that Θ1,i Θ1(log(4/δ))1/2 with conﬁdence at least 1 δ for i = 1, , m. Here,

= O m 1 2r+2 .

However, this upper bound (36) is not sufﬁciently small for our analysis. Thus, we will ﬁnd a better upper bound now. First, we bound the norm of

A 1 2 = V 1/2 1 m

i=1 (αLk + (1 α)Lk,Xp + λI) 1(Lk,Xi + λI)

By the submultiplicativity of the operator norm,

A 1 2 V 1/2(Lk + λI) 1/2

(Lk + λI)1/2 1 m

i=1 Lk,Xi + λI

(Lk + λI)1/2

(Lk + λI) 1/2(αLk + (1 α)Lk,Xp + λI)1/2 (αLk + (1 α)Lk,Xp + λI)1/2V 1/2 .

By Lemma D.1, Lemma D.4, the triangle inequality, and the submultiplicativity of the operator norm,

V 1/2(Lk + λI) 1/2 1 + 1

λ Lk,Xp Lk 1/2 ,

(αLk + (1 α)Lk,Xp + λI)1/2V 1/2 1 + α

λ Lk,Xp Lk 1/2

and (Lk + λI) 1/2(αLk + (1 α)Lk,Xp + λI)1/2 1 + 1 α

λ Lk,Xp Lk 1/2 .

By Lemma D.1, Lemma D.5, the triangle inequality, and the submultiplicativity of the operator norm, (Lk + λI)1/2 1 m

i=1 Lk,Xi + λI

(Lk + λI)1/2

i=1 Lk,Xi + λI

i=1 Lk,Xi Lk

By Lemma D.7,

Np (log(4/δ))1/2 and

i=1 Lk,Xi Lk

m N (log(4/δ))1/2

Towards Understanding Ensemble Distillation in Federated Learning

with conﬁdence at least 1 δ. Therefore, with conﬁdence at least 1 δ,

(log(4/δ))5/4 = O(1)(log(4/δ))5/4.

Next, we bound A2 A1 . Note that

i=1 V 1/2(1 α)(αLk,Xi + (1 α)Lk,Xp + λI) 1V 1/2 !

= V 1/2 1 α

i=1 (1 α)(αLk,Xi + (1 α)Lk,Xp + λI) 1V

= V 1/2 1 m

i=1 (αLk,Xi + (1 α)Lk,Xp + λI) 1(Lk,Xi + λI)

A2 A1 = V 1/2 1 m

i=1 (αLk + (1 α)Lk,Xp + λI) 1(Lk,Xi + λI)

i=1 (αLk,Xi + (1 α)Lk,Xp + λI) 1(Lk,Xi + λI)

i=1 V 1/2(αLk + (1 α)Lk,Xp + λI) 1α(Lk,Xi Lk)(αLk,Xi + (1 α)Lk,Xp + λI) 1(Lk,Xi + λI)V 1/2

using (23). By the triangle inequality, the submultiplicativity of the operator norm, and the fact that

(Lk + λI) 1/2(Lk,Xi + λI)(Lk + λI) 1/2 = (Lk + λI) 1/2(Lk,Xi Lk)(Lk + λI) 1/2 + I,

we have V 1/2(αLk + (1 α)Lk,Xp + λI) 1α(Lk,Xi Lk)(αLk,Xi + (1 α)Lk,Xp + λI) 1(Lk,Xi + λI)V 1/2

α V 1/2(αLk + (1 α)Lk,Xp + λI) 1/2 (αLk + (1 α)Lk,Xp + λI) 1/2(Lk + λI)1/2 R

(Lk + λI)1/2(αLk,Xi + (1 α)Lk,Xp + λI) 1(Lk + λI)1/2 (R + 1) (Lk + λI)1/2V 1/2

R = (Lk + λI) 1/2(Lk,Xi Lk)(Lk + λI) 1/2 .

Since (1 α)V αLk + (1 α)Lk,Xp + λI,

V 1/2(αLk + (1 α)Lk,Xp + λI) 1/2 (1 α) 1/2

by Lemma D.2. By Lemma D.1, Lemma D.4, the triangle inequality, and the submultiplicativity of the operator norm,

(αLk + (1 α)Lk,Xp + λI) 1/2(Lk + λI)1/2 1 + 1 α

λ Lk,Xp Lk 1/2

(Lk + λI)1/2V 1/2 1 + 1

λ Lk,Xp Lk 1/2 .

By Lemma D.1, Lemma D.5, the triangle inequality, and the submultiplicativity of the operator norm,

(Lk + λI)1/2(αLk,Xi + (1 α)Lk,Xp + λI) 1(Lk + λI)1/2 1 + 1

λ α Lk,Xi Lk + (1 α) Lk,Xp Lk .

Towards Understanding Ensemble Distillation in Federated Learning

To ﬁnd a PAC-bound, applying Lemma D.7 yields

Np (log(12/δ))1/2, Lk,Xi Lk 2

N (log(12/δ))1/2,

R 2(κ2 + 1)

1 + log N(λ)

1 + log N(λ)

with conﬁdence at least 1 δ for any δ (0, 1). Then, we apply Lemma D.13 to see that A2 A1 is the mean of random variables that are bounded by

1 + log N(λ)

1 + log N(λ)

1 + 2(κ2 + 1)

1 + log N(λ)

1 + log N(λ)

(log(a0/δ))q0

for some a0 1 and q0 > 0 with conﬁdence at least 1 δ. Note that

2(κ2 + 1) = O(1).

On the other hand,

1 + log N(λ)

1 + log N(λ)

log (eκ2)(m N) 1 2r+2

(m N) r 4r+4 +

v u u tlog (eκ2)(m N) 1 2r+2

(m N) r 4r+4 = O(1)

f(x) = log(eκ2x1/(2r+2))

is continuous on [1, ) and vanishes at which implies that

log (eκ2)(m N) 1 2r+2

(m N) r 4r+4 = O(1).

1 + log N(λ)

1 + log N(λ)

1 + 2(κ2 + 1)

1 + log N(λ)

1 + log N(λ)

= O m r 2r+2

and so A2 A1 is the mean of random variables that are bounded by O m r 2r+2 (log(a0/δ))q0 for some a0 1 and q0 > 0 with conﬁdence at least 1 δ respectively. Thus, applying the formula

j=0 A 1 2 (A2 A1)A 1 2 j + A 1 1 (A2 A1)A 1 2 1

(36), and Lemma D.13 yield

j=0 Ξj (37)

Towards Understanding Ensemble Distillation in Federated Learning

where Ξj is the mean of random variables that are bounded by O(1)(log(a/δ))q for some a 1 and q > 0.

The last part of the proof is devoted to bounding

i=1 (αLk,Xi + (1 α)Lk,Xp + λI) 1S Diyi 1

i=1 (1 α)(αLk,Xi + (1 α)Lk,Xp + λI) 1U

i=1 (αLk,Xi + (1 α)Lk,Xp + λI) 1S Diyi 1

i=1 (1 α)(αLk,Xi + (1 α)Lk,Xp + λI) 1U

= U 1/2 1 m

i=1 (αLk,Xi + (1 α)Lk,Xp + λI) 1S Diyi 1

i=1 (1 α)(αLk,Xi + (1 α)Lk,Xp + λI) 1(Lk,Xp + λPDp)

i=1 (αLk,Xi + (1 α)Lk,Xp + λI) 1(S Diyi Lk,Xif0 λf0) 1 α

i=1 (αLk,Xi + (1 α)Lk,Xp + λI) 1λ(I PDp)f0.

By Lemma D.3 and the triangle inequality, the ﬁrst term is bounded as follows:

i=1 (αLk,Xi + (1 α)Lk,Xp + λI) 1(S Diyi Lk,Xif0 λf0)

i=1 (αLk,Xi + (1 α)Lk,Xp + λI) 1S Diyi f0

i=1 (αLk,Xi + (1 α)Lk,Xp + λI) 1Lk,Xif0 f0

i=1 (αLk,Xi + (1 α)Lk,Xp + λI) 1 !

By Lemma C.12, V 1/2 1 m

i=1 (αLk,Xi + (1 α)Lk,Xp + λI) 1S Diyi f0

Using the exact same way as in the proof of Lemma C.12 (since it is just the noisefree case), V 1/2 1 m

i=1 (αLk,Xi + (1 α)Lk,Xp + λI) 1Lk,Xif0 f0

! Hk Λ0 + 1

i=1 Λ0 i (38)

where Λ0 and Λ0 i are random variables such that Λ0 O (m N) 2r+1

4r+4 (log(6/δ))5/4 with conﬁdence at least 1 δ

and each Λ0 i O (m N) 2r+1

4r+4 (log(6/δ))3/2 with conﬁdence at least 1 δ. Also, by the triangle inequality, the submultiplicativity of the operator norm, (31), and Lemma D.1,

i=1 (αLk,Xi + (1 α)Lk,Xp + λI) 1 !

i=1 (Lk,Xp + λI) 1/2Lr kg0 Hk

= λ 1 α (Lk,Xp + λI) 1/2Lr kg0 Hk

λ 1 αQp (Lk + λI) 1/2Lr kg0 Hk

1 1 α λr+ 1

where Qp is already deﬁned in the proof of Lemma C.12. By (30),

i=1 (αLk,Xi + (1 α)Lk,Xp + λI) 1 !

Hk 1 1 α 1 + 1

λ Lk Lk,Xp 1/2 λr+ 1

Towards Understanding Ensemble Distillation in Federated Learning

By Lemma D.7,

i=1 (αLk,Xi + (1 α)Lk,Xp + λI) 1 !

2 g0 Hk(log(4/δ))1/4

with conﬁdence at least 1 δ where

2 g0 Hk = O (m N) 2r+1

Lastly, by the triangle inequality, the submultiplicativity of the operator norm, Lemma D.1, Lemma D.3, and (31), 1 α

i=1 (αLk,Xi + (1 α)Lk,Xp + λI) 1λ(I PDp)f0

i=1 (αLk,Xi + (1 α)Lk,Xp + λI) 1λ(I PDp)f0

λ α (I PDp)f0 Hk.

SDp(g f0) 2 1

λ 1 α (I PDp)f0 Hk + α

λ 1 α (I PDp)f0 Hk + Λ + 1

i=1 Λi + Λ0 + 1

i=1 Λ0 i + 1 1 α λr+1/2Qp g0 Hk +

λ α (I PDp)f0 Hk

By Lemma D.9,

i=1 Λ1 i + α

i=1 Λ1 i + Λ + 1

= O (m N) 2r+1

4r+4 . (41)

From (34), we have

λ 1 α (I PDp)f0 Hk

= O (m N) 2r+1

By Lemma D.13, (37), (38), and (39),

i=1 Λi + Λ0 + 1

i=1 Λ0 i + 1 1 α λr+1/2Qp g0 Hk

= O (m N) 2r+1

4r+4 . (42)

Since Lk κ2 and Lk,X κ2 for any X, by Lemma D.5, Lemma D.1, the triangle inequality, and the submultiplicativity of the operator norm, we obtain

(Lk,Xi + λI) 1(αLk,Xi + (1 α)Lk,Xp + λI) λ + κ2

λ α (I PDp)f0 Hk

almost surely. By Lemma C.13 and Lemma D.12,

λ α (I PDp)f0 Hk

f0 Hk 4 exp 1 4(κ2 + 1)B

Towards Understanding Ensemble Distillation in Federated Learning

B = 1 + log N(λp)

1 + log N(λp)

and λp = 1/N 1 ϵ p which satisﬁes N(λp) N(1/ p

Np) 1. Note that

1 + log N(λp)

is continuous on [1, ) and vanishes at , g(x) C(κ, r, ϵ) for some C(κ, r, ϵ) > 0. Then

B C(κ, r, ϵ)N ϵ/4 p .

Since 0 < B C(κ, r, ϵ) and

xβ exp 1 4(κ2 + 1)x

is continuous on (0, C(κ, r, ϵ)] and vanishes at 0+,

1 Bβ exp 1 4(κ2 + 1)B

C (κ, r, ϵ, β)

for some C (κ, r, ϵ, β) > 0. Hence,

exp 1 4(κ2 + 1)B

B6/ϵ 1 B6/ϵ exp 1 4(κ2 + 1)B

(1 + κ2) 1 α

λ N 3/2 p C(κ, r, ϵ)6/ϵ C (κ, r, 6/ϵ) = O (m N) 2r+1

4r+4 . (44)

On the other hand, from N r(1 ϵ) p (m N) r 2r+2 m,

λ α N r( 1+ϵ) p

= O (m N) 2r+1

This completes the proof of Theorem 5.5.

Theorem C.15. Let g = T h0 for any h0 Hk and g i be a local kernel ridge regressor of client i trained by the local dataset Di and the public dataset Dp(x) with the predictions of g , i.e.,

g i = (αLk,Xi + (1 α)Lk,Xp + λI) 1(αS Diyi + (1 α)Lk,Xpg ).

Under the same assumption of Theorem C.14, with α = 1/m and λ = (m N) 1 2r+2 ,

E ιρx(g i f0) L2ρx = O (m N) 2r+1

Towards Understanding Ensemble Distillation in Federated Learning

Proof of Theorem C.15. We use the same notation as in the proof of Theorem C.14. Since g is not independent of Dp(x), we cannot directly apply Corollary B.2. By Corollary B.1, we have

E ιρx(g i f0) L2ρx 9

α N1 + 1 α N2

! 2ακ(M + γ) λN1 + λ 1 2 +r g0 Hk

+ (1 α)1/2E h (Lk + λI)1/2(αLk,Xi + (1 α)Lk,Xp + λI) 1/2 L1/2 k,Xp(g f0) Hk i .

α N1 + 1 α N2

! 2ακ(M + γ) λN1 + λ 1 2 +r g0 Hk

= O (m N) 2r+1

By Lemma D.1, Lemma D.4, the triangle inequality, and the submultiplicativity of the operator norm,

(Lk + λI)1/2(αLk,Xi + (1 α)Lk,Xp + λI) 1/2 1 + α

λ Lk,Xi Lk + 1 α

λ Lk,Xp Lk 1/2 .

By Lemma D.7, with conﬁdence at least 1 δ

λ Lk,Xi Lk + 1 α

λ Lk,Xp Lk 1/2

(log(4/δ))1/4

where δ (0, 1) and hence

(Lk + λI)1/2(αLk,Xi + (1 α)Lk,Xp + λI) 1/2 O(1)(log(4/δ))1/4 O(1) log(4/δ). (45)

Recall L1/2 k,Xp(g f0) Hk = SDp(g f0) 2 from (10) and its bound in (40). By Lemma D.9, Lemma D.13, (33), and (35),

(Lk + λI)1/2(αLk,Xi + (1 α)Lk,Xp + λI) 1/2

i=1 Λ1 i + α

i=1 Λ1 i + Λ + 1

= O (m N) 2r+1

By Lemma D.13, (37), (38), and (39),

(Lk + λI)1/2(αLk,Xi + (1 α)Lk,Xp + λI) 1/2 A 1 1

i=1 Λi + Λ0 + 1

i=1 Λ0 i + 1 1 αλ 1 2 +r Qp g0 Hk

= O (m N) 2r+1

Note that there is a trivial bound

(Lk + λI)1/2(αLk,Xi + (1 α)Lk,Xp + λI) 1/2 λ + κ2

By (45), Lemma C.13, and Lemma D.12,

(Lk + λI)1/2(αLk,Xi + (1 α)Lk,Xp + λI) 1/2 (1 + α)

λ 1 α (I PDp)f0 Hk

λ 1 α 1 + κ2

1/2 f0 Hk 4 exp 1 4(κ2 + 1)B

λ N r( 1+ϵ) p .

By (37), (43), Lemma C.13, and Lemma D.13,

(Lk + λI)1/2(αLk,Xi + (1 α)Lk,Xp + λI) 1/2 A 1 1

λ α (I PDp)f0 Hk

3/2 f0 Hk 4 exp 1 4(κ2 + 1)B

λ α N r( 1+ϵ) p

Towards Understanding Ensemble Distillation in Federated Learning

Using the similar argument as in the derivation of the bound (44),

(Lk + λI)1/2(αLk,Xi + (1 α)Lk,Xp + λI) 1/2 (1 + α)

λ 1 α (I PDp)f0 Hk

= O (m N) 2r+1

(Lk + λI)1/2(αLk,Xi + (1 α)Lk,Xp + λI) 1/2 A 1 1

λ α (I PDp)f0 Hk

= O (m N) 2r+1

From (40), E ιρx(g i f0) L2ρx = O (m N) 2r+1

C.7. Algorithm on KRR with Iterative De-regularized Ensemble Distillation Participated by Partial Clients in Federated Learning Setting

We provide an algorithm (Algorithm 3) on kernel ridge regression with iterative de-regularized ensemble distillation participated by partial clients in federated learning. At each communication round, we select a ﬁxed number of clients that predict the unlabeled public dataset and train using their local datasets and the public dataset with the updated consensus obtained by stochastic approximation. We denote the number of selected clients in each communication round as C.

C.8. Proof of Corollary 5.6

Here, we assume that D and Dp(x) are given. Deﬁne a sequence of independent random operators {T p t0}t0 N where random operator T p t0 : Hk Hk is deﬁned as

T p t0g = X

1 C (αLk,Xi + (1 α)Lk,Xp + λI) 1(αS Diyi + (1 α)S Dp(SDp(Lk,Xp + λI) 1S Dp) 1SDpg)

1 C (αLk,Xi + (1 α)Lk,Xp + λI) 1(αS Diyi + (1 α)(Lk,Xp + λP Dp(x))g)

for each t0 N where {Ct0}t0 N is deﬁned in Algorithm 3. We also deﬁne T p t0 : RNp RNp as

T p t0v = X

1 C (SDp(Lk,Xp + λI) 1S Dp) 1/2SDp(αLk,Xi + (1 α)Lk,Xp + λI) 1

(αS Diyi + (1 α)S Dp(SDp(Lk,Xp + λ0I) 1S Dp) 1/2v)

for each t0 N.

We prove Corollary 5.6 in a general setting.

Theorem C.16. Assume λ0 = λ, KXp,Xp is invertible, and the number of selected clients in each communication round C is ﬁxed, i.e., C = Pm i=1 1{i Ct0} for any t0 = 1, , t. We also assume that {γt0}t t0=1 [0, 1]t is ﬁxed,

t0=1 γt0 = ,

t0=1 γ2 t0 < , (46)

{Ct0} t0=1 is independent, and P(j Ct0) = pj

for any j {1, , m} and t0 = 1, . Then the prediction yp on Dp(x) after inﬁnitely many iterations in Algorithm 3 converges to SDpˆg almost surely where ˆg = ˆT h0 for any h0 Hk and

pi Pm j=1 pj (αLk,Xi + (1 α)Lk,Xp + λI) 1(αS Diyi + (1 α)S Dp(SDp(Lk,Xp + λI) 1S Dp) 1SDpg).

Towards Understanding Ensemble Distillation in Federated Learning

Algorithm 3 KRR with iterative De-regularized Ensemble Distillation Participated by Partial Clients in FL

1: Input: hyperparameters α (0, 1), λ > 0, λ0 0, t N, C {1, , m} and {γt0}t t0=1 [0, 1]t such that γ1 = 1 2: Output: Trained model fj, j = 1, , m 3: Pretrain: For j = 1, , m, client j trains its model fj using the loss function

argmin h Hk

i=1 (h(xi j) yi j)2 + λ h 2 Hk.

4: Each client downloads the unlabeled public dataset Dp(x). 5: for t0 = 1, , t do 6: Determine a set of clients Ct0 whose size is C to participate the ensemble at time t0. 7: For j Ct0, client j predicts on Dp(x) and uploads the prediction yj p to server. 8: The server computes an updated consensus

yp = (1 γt0) yp,old + γt0 1

j Ct0 yj p.

9: The server stores yp,old = yp. 10: if t0 = t then 11: The server applies the de-regularization trick to yp:

yp = (SDp(Lk,Xp + λ0I) 1S Dp) 1 yp.

12: end if 13: Each client in Ct0 downloads the ensemble prediction yp. 14: For j Ct0, client j updates its model fj using the loss function

argmin h Hk α 1

i=1 (h(xi j) yi j)2 + (1 α) 1

i=1 (h(xi p) ( yp)i)2 + λ h 2 Hk.

15: end for

Proof. Deﬁne ˆT : RNp RNp as

pi Pm j=1 pj (SDp(Lk,Xp + λI) 1S Dp) 1/2SDp(αLk,Xi + (1 α)Lk,Xp + λI) 1

(αS Diyi + (1 α)S Dp(SDp(Lk,Xp + λI) 1S Dp) 1/2v).

j=1 1{j Ct0} C = E

j=1 1{j Ct0}

j=1 P(j Ct0) =

by taking the expectation. Deﬁne an operator Si : RNp RNp by

Siv = (SDp(Lk,Xp + λI) 1S Dp) 1/2SDp(αLk,Xi + (1 α)Lk,Xp + λI) 1

(αS Diyi + (1 α)S Dp(SDp(Lk,Xp + λ0I) 1S Dp) 1/2v).

E T p t0v = E

i=1 1{i Ct0} 1 C Siv

i=1 E h 1{i Ct0}Siv i = 1 Pm j=1 pj

i=1 pi Siv = ˆTv.

Towards Understanding Ensemble Distillation in Federated Learning

Observe that

(SDp(Lk,Xp + λI) 1S Dp) 1/2 = ((KXp,Xp + NpλI) 1KXp,Xp) 1/2 = (I + NpλK 1 Xp,Xp)1/2. (47)

We also claim that

SDp(αLk,Xi + (1 α)Lk,Xp + λI) 1(1 α)S Dp

KXp,Xp KXp,Xi

α λI 1 KXi,Xp

KXp,Xp + Np 1 αλI KXp,Xi

α λI 1 KXi,Xp

I + Np 1 αλ

KXp,Xp KXp,Xi

α λI 1 KXi,Xp

To show this, note that

(αLk,Xi + (1 α)Lk,Xp + λI) 1(1 α)S Dp = k Xi k Xp (λI + DKXi Xp,Xi Xp) 1D 0

= k Xi k Xp (λD 1 + KXi Xp,Xi Xp) 1 0

where D = diag(α/N, , α/N | {z } N

, (1 α)/Np, , (1 α)/Np | {z } Np

) which follows from (12). Then

SDp(αLk,Xi + (1 α)Lk,Xp + λI) 1(1 α)S Dp = KXp,Xi KXp,Xp KXi,Xi + N1

α λI KXi,Xp KXp,Xi KXp,Xp + N2 1 αλI

From the formula A B C D

1 = A 1 + A 1B(D CA 1B) 1CA 1 A 1B(D CA 1B) 1

(D CA 1B) 1CA 1 (D CA 1B) 1

SDp(αLk,Xi + (1 α)Lk,Xp + λI) 1(1 α)S Dp

= KXp,Xi KXp,Xp (KXi,Xi + N1

α λI) 1KXi,Xp(KXp,Xp + N2 1 αλI KXp,Xi(KXi,Xi + N1

α λI) 1KXi,Xp) 1

(KXp,Xp + N2 1 αλI KXp,Xi(KXi,Xi + N1

α λI) 1KXi,Xp) 1

which gives the formula (48). Since

I + Np 1 αλ

KXp,Xp KXp,Xi

α λI 1 KXi,Xp

> I + NpλK 1 Xp,Xp,

by (47), (48), Lemma D.2, and the triangle inequality, ˆT is a η-contraction where η (0, 1). Set ut0 as

(SDp(Lk,Xp + λI) 1S Dp) 1/2( yp)t0

where ( yp)t0 is the updated consensus before applying the de-regularization at the t0-th iteration in Algorithm 3. Then

ut0+1 = (1 γt0+1)ut0 + γt0+1 T p t0+1ut0 = (1 γt0+1)ut0 + γt0+1 ˆTut0 + T p t0+1ut0 ˆTut0 .

E h T p t0+1ut0 ˆTut0 | Ft0 i = E h T p t0+1ut0 ˆTut0 | ut0 i = 0 (49)

where Ft0 is an σ-algebra generated from u1, , ut0, T p 1 , , T p t0, γ1, , γt0 and γt0+1 since T p t0+1 is independent of Ft0. Since the number of possible realizations of T p t0+1 is ﬁnite, there exists B > 0 such that

T p t0+1 ˆT B

Towards Understanding Ensemble Distillation in Federated Learning

almost surely. Therefore,

E T p t0+1ut0 ˆTut0 2

= E T p t0+1ut0 ˆTut0 2

E T p t0+1 ˆT 2 ut0 2 Hk | ut0

E h B2 ut0 2 Hk | ut0 i = B2 ut0 2 Hk . (50)

From (46), (49), and (50), by Proposition 4.4 in Bertsekas & Tsitsiklis (1996), (SDp(Lk,Xp + λI) 1S Dp) 1/2 yp converges

to a unique ﬁxed point of ˆT. Therefore,

(SDp(Lk,Xp + λI) 1S Dp) 1/2 yp = ˆT = (SDp(Lk,Xp + λI) 1S Dp) 1/2SDp ˆT

= (SDp(Lk,Xp + λI) 1S Dp) 1/2SDpˆg .

Note that ˆT = ˆg can be shown using a similar argument as in the ﬁrst part in Appendix C.6.

D. Auxiliary Lemmas

We provide the useful properties of operators.

Lemma D.1. Let A be a bounded linear operator on a seperable Hilbert space H. If A is compact, self-adjoint, and positive, then (A + λI) r As 1/λr s

for any r s 0 and λ > 0.

Proof. Since A is compact, self-adjoint, and positive, we can set eigenvalues {µi} i=1 of A and their corresponding eigenfunctions {φi} i=1 satisfying µi 0 and {φi} i=1 is an orthonormal basis of H by the spectral theorem (Conway, 2019). For any φ = P i=1 aiφi H such that φ 2 = P i=1 a2 i = 1, we have

(A + λI) r Asφ 2 =

aiµs i (µi + λ)r φi

aiµs i (µi + λ)r

2s 1 (µi + λ)2(r s)

i=1 a2 i = 1 λ2(r s)

which implies

(A + λI) r As = sup φ =1 (A + λI) r Asφ 1 λr s .

Lemma D.2. Let A and B be bounded positive self-adjoint linear operators on a seperable Hilbert space H. We assume B is invertible. Then A B implies A1/2B 1/2 2 = A1/2B 1A1/2 1.

If we further assume A is compact, then A < B implies

A1/2B 1/2 2 = A1/2B 1A1/2 < 1.

Proof. We ﬁrst assume A B. Since 0 B 1/2AB 1/2 I, we have B 1/2AB 1/2 1 by the deﬁnition of positive operator. Then, we obtain

1 B 1/2AB 1/2 = (B 1/2A1/2)(B 1/2A1/2) = (B 1/2A1/2) (B 1/2A1/2) = A1/2B 1A1/2 .

We now assume A is compact. By Proposition 4.2(c) of chapter 2 in Conway (2019), B 1/2AB 1/2 is compact. Using 0 B 1/2AB 1/2 < I and the fact that B 1/2AB 1/2 is compact and self-adjoint yields B 1/2AB 1/2 < 1 since

Towards Understanding Ensemble Distillation in Federated Learning

if B 1/2AB 1/2 = 1 then B 1/2AB 1/2 has an eigenvalue 1 by Lemma 5.9 of chapter 2 in Conway (2019) which contradicts B 1/2AB 1/2 < I. Therefore,

1 > B 1/2AB 1/2 = A1/2B 1A1/2 .

The following lemma is similar to Proposition 5 in Rudi et al. (2015).

Lemma D.3. Let H be a seperable Hilbert space and X : H H and Y : H H be two bounded linear operators. If Y Y X X is positive, then Xf H Y f H

for any f H.

Proof. Let f H. Then

Y f 2 H = Y f, Y f H = Y Y f, f H X Xf, f H = Xf, Xf H = Xf 2 H

since (Y Y X X)f, f H 0.

Recall Cordes inequality. Refer to Fujii et al. (1993) for its proof.

Lemma D.4. Let A, B be two bounded positive linear operators on a seperable Hilbert space. Then

for any s [0, 1].

The following lemma is fundamental but useful.

Lemma D.5. Let A and B be two bounded positive self-adjoint linear operators on a seperable Hilbert space. Then A1/2B 1A1/2 B 1A when B is invertible.

Proof. Since A and B are self-adjoint, A1/2B 1A1/2 = B 1/2A1/2 2. By Lemma D.4, B 1/2A1/2 2 B 1A .

Lemma D.6. Let A, B and C be bounded positive self-adjoint linear operators on a seperable Hilbert space H such that 0 A B < C. We further assume A1/2 is compact. Then

A1/2C 1A1/2(I A1/2C 1A1/2) 1 B1/2C 1B1/2(I B1/2C 1B1/2) 1 .

Proof. Note that for any operator V such that V < 1, σp(V V ) \ {0} = σp(V V ) \ {0} where σp denotes the point spectrum of a given operator. Then

σp((I V V ) 1 I) \ {0} = σp((I V V ) 1 I) \ {0}. (51)

Since A1/2 is compact, A1/2C 1A1/2(I A1/2C 1A1/2) 1 is also compact by Proposition 4.2(c) of chapter 2 in Conway (2019). From the fact that A1/2C 1A1/2(I A1/2C 1A1/2) 1 is compact, self-adjoint, and positive, we have

A1/2C 1A1/2(I A1/2C 1A1/2) 1 = σmax(A1/2C 1A1/2(I A1/2C 1A1/2) 1)

by Lemma 5.9 of chapter 2 in Conway (2019) where σmax denotes the largest eigenvalue of a given (self-adjoint positive) operator. Using (51),

σmax(A1/2C 1A1/2(I A1/2C 1A1/2) 1) = σmax((I A1/2C 1A1/2) 1 I) = σmax((I C 1/2AC 1/2) 1 I)

= σmax(C 1/2AC 1/2(I C 1/2AC 1/2) 1).

Towards Understanding Ensemble Distillation in Federated Learning

C 1/2AC 1/2 C 1/2BC 1/2 (I C 1/2AC 1/2) (I C 1/2BC 1/2)

(I C 1/2AC 1/2) 1 (I C 1/2BC 1/2) 1

(I C 1/2AC 1/2) 1 I (I C 1/2BC 1/2) 1 I.

C 1/2AC 1/2 0 (I C 1/2AC 1/2) I

(I C 1/2AC 1/2) 1 I

(I C 1/2AC 1/2) 1 I 0.

Therefore, 0 < σmax((I C 1/2AC 1/2) 1 I) σmax((I C 1/2BC 1/2) 1 I).

Using the fact that

σmax((I C 1/2BC 1/2) 1 I) = σmax((I B1/2C 1B1/2) 1 I)

= σmax(B1/2C 1B1/2(I B1/2C 1B1/2) 1)

B1/2C 1B1/2(I B1/2C 1B1/2) 1

completes the proof of Lemma D.6.

By using concentration inequalities (Rudi et al., 2015; Chatalic et al., 2022), we can derive the following useful lemmas (Caponnetto & De Vito, 2007; Yao et al., 2007; Rudi et al., 2015; Lin et al., 2020a).

Lemma D.7. Let X = {x1, , x N0} whose data points are independently drawn from ρx. Then, with our notation,

Lk,X Lk Lk,X Lk HS 2

2κ2 N0 (log(2/δ))1/2

with conﬁdence at least 1 δ where δ (0, 1) and HS is the Hilbert-Schmidt norm;

(b) if 0 < λ 1 and N(λ) 1,

(Lk + λI) 1/2(Lk Lk,X)(Lk + λI) 1/2 2(κ2 + 1)

1 + log N(λ)

1 + log N(λ)

with conﬁdence at least 1 δ where δ (0, 1);

(c) if 0 < λ 1 and N(λ) 1,

(Lk + λI)1/2(Lk,X + λI) 1/2 = (Lk + λI)1/2(Lk,X + λI) 1(Lk + λI)1/2 1/2

with conﬁdence at least 1 δ where

1 4(κ2 + 1)

1 + log N(λ)

1 + log N(λ)

Lemma D.8. Let D = {(x1, y1), , (x N0, y N0)} whose data points are independently drawn from ρx,y. Then, with our notation,

S Dy Lk,D(x)f0 Hk 2κM

N0 log(2/δ) +

N0 log(2/δ) 2κ(M + γ) N0 log(2/δ)

with conﬁdence at least 1 δ where δ (0, 1).

Towards Understanding Ensemble Distillation in Federated Learning

The following lemmas are useful to compute the expectation using a PAC bound.

Lemma D.9. Let A be a non-negative random variable such that A B(log(a/δ))q with conﬁdence at least 1 δ for any δ (0, 1) where B > 0 is a constant, a 1, and q > 0. Then

EA Γ(q + 1)a B.

Proof. Note that

P (A > t) a exp

for any t > 0. Thus,

0 P (A > t) dt Z

0 a Bqsq 1 exp( s) ds = Γ(q + 1)a B.

Lemma D.10. Let A be a non-negative random variable such that

(a) A B(log(a/δ))q with conﬁdence at least 1 δ for any δ (0, 1);

(b) A B(log(a/δ))q with conﬁdence at least 1 δ for any δ (δ0, 1)

where B > 0 and B > 0 are constants, a 1, q N, and δ0 (0, 1). Then

EA Γ(q + 1)a B + a Bqδ0 exp

Proof. Note that

P (A > t) a exp

for any t > 0 and

P (A > t) a exp

for any 0 < t < B(log(a/δ0))q. Set t0 = B(log(a/δ0))q. Then,

0 P (A > t) dt Z t0

Γ(q + 1)a B + Z

Towards Understanding Ensemble Distillation in Federated Learning

by the proof of Lemma D.9. Also,

dt = a Bq Z

( B/B)1/q log(a/δ0) uq 1 exp( u) du

( B/B)1/q log(a/δ0)

= a Bqδ0 exp

Remark D.11. If δ0 1, then we only have A B(log(a/δ))q with conﬁdence at least 1 δ for any δ (0, 1) in Lemma D.10. Then, EA Γ(q + 1)a B Γ(q + 1)a Bδ0

by Lemma D.9. Combining this result and Lemma D.10, we have

EA Γ(q + 1)a B + a Bqδ0

even if we only assume δ0 > 0 instead of δ0 (0, 1) in Lemma D.10.

Sometimes Lemma D.10 is complicated, so we provide another lemma.

Lemma D.12. Let A be a non-negative random variable such that A B(log(a/δ))q with conﬁdence at least 1 δ for any δ (δ0, 1) and A B almost surely where B > 0 and B > 0 are constants, a 1 and q > 0. Then

EA Γ(q + 1)a B + Bδ0.

Proof. Since

P (A > t) a exp

for any 0 < t < B(log(a/δ0))q,

EA = P(A B(log(a/δ0))q)E[A|A B(log(a/δ0))q] + P(A > B(log(a/δ0))q)E[A|A > B(log(a/δ0))q]

0 P(A B(log(a/δ0))q)P(A > t|A B(log(a/δ0))q) dt + δ0B

Z B(log(a/δ0))q

0 P(A > t) dt + δ0B

Z B(log(a/δ0))q

= Γ(q + 1)a B + Bδ0.

Note that if δ0 1 then EA Bδ0 is trivial. Thus, it also holds for δ0 1.

Lastly, we introduce a lemma to deal with multiplications of random variables whose PAC bounds are given.

Towards Understanding Ensemble Distillation in Federated Learning

Lemma D.13. If A1 O(a(m, N))(log(a1/δ))q1 with conﬁdence at least 1 δ and A2 O(b(m, N))(log(a2/δ))q2 with conﬁdence at least 1 δ where δ (0, 1), a1, a2 1 and q1, q2 > 0, then A1 A2 O(a(m, N)b(m, N))(log(a3/δ))q3 with conﬁdence at least 1 2δ for some a3 1 and q3 > 0.

Proof. It directly follows from the fact that

A1 A2 O(a(m, N))O(b(m, N))(log(2a1/δ))q1(log(2a2/δ))q2 O(a(m, N)b(m, N))(log max(2a1, 2a2)/δ)q1+q2

with conﬁdence at least 1 2δ where δ ( 1

E. Experiment Details

E.1. Dataset Description Details

The generating procedure of the synthetic datasets is as follows: (i) the inputs are independently drawn from the uniform distribution on [0, 1]d with d = 1 for Dataset 1 and Dataset 2 and with d = 3 for Dataset 3; (ii) the corresponding outputs are generated from y = gi(x) + ϵ for Dataset i (i = 1, 2, 3) where ϵ is the independent noise that follows the normal distribution with mean 0 and variance 0.442 and gi are given by

g1(x) = min(x, 1 x)

for Dataset 1,

for Dataset 2, and g3(x) = (1 x 2)6(35 x 2 2 + 18 x 2 + 3)1{ x 1}

for Dataset 3. We use the kernel k1(x, x ) = 1 + min(x, x )

for Dataset 1 and Dataset 2 and

k2(x, x ) = (1 x x 2)4(4 x x 2 + 1)1{ x x 2 1}

for Dataset 3. Note that we do not give any noise for test datasets. We know that g1 Hk1 and g3 Hk2 (Lin et al., 2020a). Since g2(x) = x L2 ρx,

0 g2(x)k1(x, t) dρx(x) = Z t

0 (1 + x) x dx + Z 1

t (1 + t) x dx = g2(t)

holds. From this fact, we can easily see that g2 Hk1 and g2 = L1/2 k1 ˆg2 for some ˆg2 Hk1. The generating procedure of MNIST is described in (Cui et al., 2021). We use the RBF kernel

k3(x, x ) = exp 1 2 104 x x 2

E.2. Simulation Details

We conduct experiments with the local datasets of sizes N = 10 and N = 20. We also set the number of clients m {10, 20, 30, 40, 50, 100} for N = 10 (except MNIST; for MNIST we set m {10, 20, 30, 40, 50} when N = 10) and m {10, 20, 30, 40, 50} for N = 20. Assume there is an unlabeled public dataset of size (m 1)N. For the iterative ensemble distillation algorithm, set the total communication round t = 200 for convergence. We use the ﬁxed distillation hyperparameter α = 1/m but conduct the hyperparemeter tuning for λ > 0 using the grid search. In the test phase, we use a test dataset of size 1000 whose data points are generated from the procedure explained in Appendix E.1. We compute the averaged MSE (Mean Squared Error) over the local models to evaluate the performance. We simulate at least 100 times for each case and then average the results.

Towards Understanding Ensemble Distillation in Federated Learning

10 20 30 40 50 The number of clients

Averaged MSE

central training one-shot ED IED w/o deregularization IED w/ deregularization

(a) Dataset 1 with N = 20

10 20 30 40 50 The number of clients

Averaged MSE

central training one-shot ED IED w/o deregularization IED w/ deregularization

(b) Dataset 2 with N = 20

10 20 30 40 50 The number of clients

Averaged MSE

central training one-shot ED IED w/o deregularization IED w/ deregularization

(c) Dataset 3 with N = 20

10 20 30 40 50 The number of clients

Averaged MSE

central training one-shot ED IED w/o deregularization IED w/ deregularization

(d) MNIST with N = 20

Figure 3. Comparison between the performance of the one-shot ensemble distillation algorithm (one-shot ED), the iterative ensemble distillation algorithm without the de-regularization (IED w/o deregularization), the iterative ensemble distillation algorithm with the de-regularization (IED w/ deregularization), and the central training. We set N = 20 and conduct the experiments with various m.

E.3. Additional Experimental Results

Performance Comparison with N = 20. We visualize the comparison results on the performance of the one-shot ensemble distillation algorithm, the iterative ensemble distillation algorithm without the de-regularization, the iterative ensemble distillation algorithm with the de-regularization trick, and the central training with N = 20 and m {10, 20, 30, 40, 50} in Figure 3. It has a similar pattern to the cases with N = 10 which are summarized in Figure 2. The one-shot ensemble distillation algorithm performs much better than the standalone models. However, it has worse performance on Dataset 3 and MNIST compared with the iterative ensemble distillation algorithms and the central training. The ensemble distillation algorithm without the de-regularization is slightly worse than the ensemble distillation algorithm with the de-regularization and the central training, but the performance gap is not signiﬁcant for N = 20. In all settings, the iterative ensemble distillation algorithm has a similar performance as the central training in the expected risk sense.

Performance Comparison with Fed MD (Li & Wang, 2019). We compare our proposed algorithm with Fed MD (Li & Wang, 2019) on Dataset 3 and MNIST. Fed MD is a representative KD based FL algorithm using neural networks. In the experiments, we consider the unlabeled public dataset version of Fed MD (which is used as a baseline in Zhang et al. (2021)). We use a 3 hidden layer fully connected neural network and Le Net5 (Le Cun et al., 1998) for Fed MD. The result is summarized in Figure 4.

We ﬁrst note that the training strategy for neural networks (such as model architecture and hyperparameters) is good enough because the central training performance of the neural networks is better than that of KRR. However, in both of Dataset 3 and MNIST, Fed MD does not achieve the central training performance and the performance difference is signiﬁcant as well. On the other hand, the iterative ensemble distillation algorithm with the de-regularization achieves almost the same performance as the central training with KRR. Moreover, Fed MD performs worse than the iterative ensemble distillation algorithm with the de-regularization.

Effect of Public Dataset Size. We provide additional experimental results on the effectiveness of the unlabeled public dataset size Np. Figure 5 visualizes the effect of Np on the performance of the one-shot ensemble distillation algorithm and the iterative ensemble distillation algorithm (with the de-regularization). We conduct the experiments with N = 10 and

Towards Understanding Ensemble Distillation in Federated Learning

10 20 30 40 50 The number of clients

Averaged MSE

central training with KRR IED w/ deregularization

(a) IED w/ deregularization, Dataset 3

10 20 30 40 50 The number of clients

Averaged MSE

central training with FNN Fed MD with FNN

(b) Fed MD, Dataset 3

10 20 30 40 50 The number of clients

Averaged MSE

central training with KRR IED w/ deregularization

(c) IED w/ deregularization, MNIST

10 20 30 40 50 The number of clients

Averaged MSE

central training with Le Net5 Fed MD with Le Net5

(d) Fed MD, MNIST

Figure 4. Comparison between the performance of the iterative ensemble distillation algorithm with the de-regularization (IED w/ deregularization) and Fed MD (Li & Wang, 2019). We set N = 10 and conduct the experiments with various m.

various Np {0.2(m 1)N, 0.5(m 1)N, (m 1)N, 1.5(m 1)N} on the three synthetic datasets. On Dataset 2, it seems that Np = 0.2(m 1)N is enough to achieve the performance of the central training for both the one-shot ensemble distillation and the iterative ensemble distillation. We also observe that the one-shot ensemble distillation does not have an advantage from having more public data points. For iterative ensemble distillation, too small public dataset size (e.g., Np = 0.2(m 1)N) results in worse performance but it is not so effective when the public dataset size is sufﬁciently large (e.g., Np (m 1)N).

Effect of Client Selection Strategy. We also provide experimental results on the effect of client selection strategy. We conduct the experiments to analyze the effect of the number of selected clients at each communication round and the stochastic approximation weights. We also conduct the experiments to measure the performance of Algorithm 3 with a client selection strategy and to compare it with Algorithm 2.

Figure 6 visualizes the effect of the number of selected clients at each communication round (denoted by C in Algorithm 3) and the stochastic approximation weights (denoted by {γt0} t0=1 in Algorithm 3). We set N = 10 and m = 50 in all experiments. The experiment details are as follows. We generate Dataset 3 using the procedure explained in Appendix E.1. Then, we ﬁnd the convergent consensus y p using Algorithm 2 with sufﬁciently many iterations. We conduct Algorithm 3 with various C and {γt0} t0=1, and then measure the squared distance between y p and yp which is derived from Algorithm 3.

As illustrated in Figure 6(a), if we use only one client at each communication round (e.g., asynchronous setting), the speed of the convergence is quite slow in our setting. However, it is enough to consider only 20% of the clients at each communication round to achieve almost the same convergence speed as considering all clients at each communication round in Algorithm 3.

To validate the effect of {γt0} t0=1, we set

where q 0. To satisfy the condition (46), q should be larger than 0.5. We conduct Algorithm 3 with various q {0, 0.3, 0.501, 0.7}. Figure 6(b) shows that smaller q (slow decay) makes a decrement of the squared distance larger in the early stage of training. However, it also shows that the consensus prediction yp does not converge to y p if q 0.5.

Towards Understanding Ensemble Distillation in Federated Learning

10 20 30 40 50 The number of clients

Averaged MSE

p=0.2 p=0.5 p=1.0

p=1.5 central training

(a) One-shot, Dataset 1

10 20 30 40 50 The number of clients

Averaged MSE

p=0.2 p=0.5 p=1.0

p=1.5 central training

(b) One-shot, Dataset 2

10 20 30 40 50 The number of clients

Averaged MSE

p=0.2 p=0.5 p=1.0

p=1.5 central training

(c) One-shot, Dataset 3

10 20 30 40 50 The number of clients

Averaged MSE

p=0.2 p=0.5 p=1.0

p=1.5 central training

(d) Iterative, Dataset 1

10 20 30 40 50 The number of clients

Averaged MSE

p=0.2 p=0.5 p=1.0

p=1.5 central training

(e) Iterative, Dataset 2

10 20 30 40 50 The number of clients

Averaged MSE

p=0.2 p=0.5 p=1.0

p=1.5 central training

(f) Iterative, Dataset 3

Figure 5. The effect of the public dataset size for the one-shot ensemble distillation algorithm and the iterative ensemble distillation algorithm. We conduct the experiments with N = 10 and Np {0.2(m 1)N, 0.5(m 1)N, (m 1)N, 1.5(m 1)N}. Set p = Np/(m 1)N, e.g., p = 1.0 indicates Np = (m 1)N.

In particular, the squared distance between y p and yp is large when q = 0 (i.e., do not memorize the previous consensus yp,old). This means that only using a new consensus is inappropriate when we use a client selection strategy. When q = 0.3, the squared distance between y p and yp does not go to zero but it is quite small. On the other hand, a large q guarantees the convergence of yp to y p but the convergence speed is slow.

Lastly, we measure the performance of Algorithm 3. Figure 7 visualizes the performance of Algorithm 3 with different iterations t {200, 1000, 2000, 5000, 10000, 20000} on Dataset 1, Dataset 2, and Dataset 3. As in Figure 7, Algorithm 3 achieves the same performance as Algorithm 2 (with 200 iterations) after 5000 iterations on all of the three datasets.

Towards Understanding Ensemble Distillation in Federated Learning

0 5000 10000 15000 20000 Iterations

Squared Distance from y *

C=1 C=10 C=25 C=50

(a) Effect of C

0 5000 10000 15000 20000 Iterations

Squared Distance from y *

q = 0 q = 0.3

(b) Effect of {γt0} t0=1 (Iterations 20000)

0 10000 20000 30000 40000 50000 Iterations

Squared Distance from y *

q = 0 q = 0.3

(c) Effect of {γt0} t0=1 (Iterations 50000)

Figure 6. The effect of the number of selected clients at each communication round (C) and the stochastic approximation weights ({γt0} t0=1) in Algorithm 3. We set γt0 = 1/tq 0 where q {0, 0.3, 0.501, 0.7}. We set N = 10 and m = 50 and use Dataset 3 in all experiments. y-axis indicates the squared distance between yp in Algorithm 3 and the convergent consensus y p. We set q = 0.501 in (a) and C = 10 in (b) and (c).

200 1000 2000 5000 10000 20000 Iterations

Averaged MSE

with client selection without client selection (200 iters)

(a) Performance on Dataset 1

200 1000 2000 5000 10000 20000 Iterations

Averaged MSE

with client selection without client selection (200 iters)

(b) Performance on Dataset 2

200 1000 2000 5000 10000 20000 Iterations

Averaged MSE

with client selection without client selection (200 iters)

(c) Performance on Dataset 3

Figure 7. Performance of Algorithm 3 with different t {200, 1000, 2000, 5000, 10000, 20000} on the three datasets. The dashed line indicates the performance of Algorithm 2 with 200 iterations. We set N = 10, m = 50, q = 0.501 and C = 10 in all experiments where γt0 = 1/tq 0.