# neural_tangent_kernel_empowered_federated_learning__514c1624.pdf Neural Tangent Kernel Empowered Federated Learning Kai Yue 1 Richeng Jin 1 Ryan Pilgrim 2 Chau-Wai Wong 1 Dror Baron 1 Huaiyu Dai 1 Federated learning (FL) is a privacy-preserving paradigm where multiple participants jointly solve a machine learning problem without sharing raw data. Unlike traditional distributed learning, a unique characteristic of FL is statistical heterogeneity, namely, data distributions across participants are different from each other. Meanwhile, recent advances in the interpretation of neural networks have seen a wide use of neural tangent kernels (NTKs) for convergence analyses. In this paper, we propose a novel FL paradigm empowered by the NTK framework. The paradigm addresses the challenge of statistical heterogeneity by transmitting update data that are more expressive than those of the conventional FL paradigms. Specifically, sample-wise Jacobian matrices, rather than model weights/gradients, are uploaded by participants. The server then constructs an empirical kernel matrix to update a global model without explicitly performing gradient descent. We further develop a variant with improved communication efficiency and enhanced privacy. Numerical results show that the proposed paradigm can achieve the same accuracy while reducing the number of communication rounds by an order of magnitude compared to federated averaging. 1. Introduction Federated learning (FL) has emerged as a popular paradigm involving a large number of clients collaboratively solving a machine learning problem (Kairouz et al., 2021). In a typical FL framework, a server broadcasts a global model to selected clients and collects model updates without accessing the raw data. One popular algorithm is known as federated averaging (Fed Avg) (Mc Mahan et al., 2017), in which clients perform stochastic gradient descent (SGD) to 1NC State University 2Independent Scholar. Correspondence to: Chau-Wai Wong . Proceedings of the 39 th International Conference on Machine Learning, Baltimore, Maryland, USA, PMLR 162, 2022. Copyright 2022 by the author(s). update the local models and then upload the weight vectors to the server. A new global model is constructed on the server by averaging the received weight vectors. Li et al. (2020) characterized some unique challenges for FL. First, client data are generated locally and remain decentralized, which implies that they may not be independent and identically distributed (IID). Prior work has shown that statistical heterogeneity can negatively affect the convergence of Fed Avg (Zhao et al., 2018). This phenomenon may be explained by noting that local updating under data heterogeneity will cause cost-function inconsistency (Wang et al., 2020). A more challenging issue is the system heterogeneity, including the diversity of client hardware, battery power, and network connectivity. Local updating schemes often exacerbate the straggler issue caused by heterogeneous system characteristics (Li et al., 2020). Recent studies have proposed various strategies to alleviate the statistical heterogeneity. One possible solution is to share a globally available dataset with participants to reduce the distance between client-data distributions and the population distribution (Zhao et al., 2018). In practice, though, such a dataset may be unavailable or too small to meaningfully compensate for the heterogeneity. Some researchers replaced the coordinate-wise weight averaging strategy in Fed Avg with nonlinear aggregation schemes (Wang et al., 2020; Chen & Chao, 2021). The nonlinear aggregation relies on a separate optimization routine, which can be elusive, especially when the federated model does not perform well. Another direction is to modify the local objectives or local update schemes to cancel the effects of client drift (Li et al., 2020; Karimireddy et al., 2020). However, some studies reported that these methods are not consistently effective when evaluated in various settings (Reddi et al., 2021; Haddadpour et al., 2021; Chen & Chao, 2021). In this work, we present a neural tangent kernel empowered federated learning (NTK-FL) paradigm. NTK-FL outperforms state-of-the-art methods by achieving the target accuracy with fewer communication rounds. We summarize our contributions as follows. We propose a novel FL paradigm without requiring clients to perform local gradient descent. To the best of our knowledge, this is the first work using the NTK method Neural Tangent Kernel Empowered Federated Learning to replace gradient descent to diversify the design of FL algorithms. Our scheme inherently solves the non-IID data problem of FL. Compared to Fed Avg, it is robust to different degrees of data heterogeneity and has a consistently fast convergence speed. We verify the effectiveness of the paradigm theoretically and experimentally. We add communication-efficient and privacy-preserving features to the paradigm and develop CP-NTK-FL by combining strategies such as random projection and data subsampling. We show that some strategies can also be applied to traditional FL methods. Although such methods cause performance degradation when applied to Fed Avg, they only slightly worsen the model accuracy when applied to the proposed CP-NTK-FL. 2. Related Work Neural Tangent Kernel. Jacot et al. (2018) showed that training an infinitely wide neural network with gradient descent in the parameter space is equivalent to kernel regression in the function space. Lee et al. (2019) used a first-order Taylor expansion to approximate the neural network output and derived the training dynamics in a closed form. For the analyses, Chen et al. (2020) established the generalization bounds for a two-layer over-parameterized neural network with the NTK framework. The NTK computation has been extended to convolutional neural networks (CNNs) (Arora et al., 2019), recurrent neural networks (RNNs) (Alemohammad et al., 2021), and even to neural networks with arbitrary architectures (Yang & Littwin, 2021). Empirical studies have also provided a good understanding of the wide neural network training (Lee et al., 2020). Federated Learning. FL aims to train a model with distributed clients without transmitting local data (Mc Mahan et al., 2017; Kairouz et al., 2021). Fed Avg has been proposed as a generic solution with theoretical analyses and implementation variants. Recent studies have shown a growing interest in improving its communication efficiency, privacy guarantees, and robustness to heterogeneity. To reduce communication cost, gradient quantization and sparsification were incorporated into Fed Avg (Reisizadeh et al., 2020; Sattler et al., 2019). From the security perspective, Zhu et al. (2019) showed that sharing gradients may cause privacy leakage in the model inversion attack. To address this challenge, differentially private federated optimization and decentralized aggregation methods were developed (Girgis et al., 2021; Cheng et al., 2021). Other works put the focus on the statistical heterogeneity issue and designed methods such as adding regularization terms to the objective function (Li et al., 2020; Smith et al., 2017) or employing personalized models (Liang et al., 2019). In this work, we focus on a novel FL paradigm where the global model is derived based on the NTK evolution. We show that the proposed NTK-FL is robust to statistical heterogeneity inherently, and extend it to a variant with improved communication efficiency and enhanced privacy. Kernel Methods in Federated Learning. The NTK framework has been mostly used for convergence analyses in FL. Seo et al. (2020) studied two knowledge distillation methods in FL and compared their convergence properties based on the neural network function evolution in the NTK regime. Li et al. (2021) incorporated batch normalization layers to local models and provided theoretical justification for its faster convergence by studying the minimum nonnegative eigenvalue of the tangent kernel matrix. To facilitate the understanding of the conventional FL process, Huang et al. (2021) directly used the NTK framework to analyze the convergence rate and generalization bound of two-layer Re LU neural networks trained with Fed Avg. Likewise, Su et al. (2021) studied the convergence behavior of a set of FL algorithms in the kernel regression setting. In comparison, our work does not focus on pure convergence analyses of existing algorithms. We propose a novel FL framework by replacing the gradient descent with the NTK evolution. 3. Background and Preliminaries Symbol conventions are as follows. We use [N] to denote the set of the integers {1, 2, . . . , N}. Lowercase nonitalic boldface, nonitalic boldface capital, and italic boldface capital letters denote column vectors, matrices, and tensors, respectively. For example, for column vectors aj RM, j [N], A = [a1, . . . , a N] is an M N matrix. A third-order tensor A RK M N can be viewed as a concatenation of such matrices. We use a slice to denote a matrix in a third-order tensor by varying two indices (Kolda & Bader, 2009). Take tensor A, for instance: Ai:: is a matrix of the ith horizontal slice, and A:j: is its jth lateral slice (Kolda & Bader, 2009). The indicator function of an event is denoted by 1 ( ). 3.1. Federated Learning Model Consider an FL architecture where a server trains a global model by indirectly using datasets distributed among M workers. The local dataset of the mth worker is denoted by Dm = {(xm,i, ym,i)}Nm i=1, where (xm,i, ym,i) is an inputoutput pair. The local objective can be formulated as an empirical risk minimization over Nm training examples: Fm(w) = 1 Nm i=1 R(w; xm,j, ym,i), (1) where R is a sample-wise risk function quantifying the error of model with a weight vector w Rd estimating the label Neural Tangent Kernel Empowered Federated Learning ym,i for an input xm,i. The global objective function is denoted by F(w), and the optimization problem may be formulated as: min w Rd F(w) = 1 m=1 Fm(w). (2) 3.2. Linearized Neural Network Model Let (xi, yi) denote a training pair, with xi Rd1 and yi Rd2, where d1 is the input dimension and d2 is the output dimension. X [x1, . . . , x N] represents the input matrix and Y [y1, . . . , y N] represents the label matrix. Consider a neural network function f : Rd1 Rd2 parameterized by a vector w Rd, which is the vectorization of all weights for the multilayer network. Given an input xi, the network outputs a prediction ˆyi = f(w; xi). Let ℓ(ˆyi, yi) be the loss function measuring the dissimilarity between the predicted result ˆyi and the true label yi. We are interested in finding an optimal weight vector w that minimizes the empirical loss over N training examples: w = argmin w L(w; X, Y) 1 i=1 ℓ(ˆyi, yi). (3) One common optimization method is the gradient descent training. Given the learning rate η, gradient descent updates the weights at each time step as: w(t+1) = w(t) η w L. To simplify the notation, let f (t)(x) be the output at time step t with an input x, i.e., f (t)(x) f(w(t); x). Following Lee et al. (2019), we use the first-order Taylor expansion around the initial weight vector w(0) to approximate the neural network output given an input x, i.e., f (t)(x) f (0)(x) + J(0)(x)(w(t) w(0)), (4) where J(0)(x) = [ f (0) 1 (x), . . . , f (0) d2 (x)] , with f (t) j (x) [ ˆy(t) j / w(t) 1 , . . . , ˆy(t) j / w(t) d ] being the gradient of the jth component of the neural network output with respect to w(t). Consider the halved mean-squared error (MSE) loss ℓ, namely, ℓ(a, b) = 1 d2 Pd2 j=1 1 2(aj bj)2. Based on the continuous-time limit, one can show that the dynamics of the gradient flow are governed by the following differential equation: df dt = η H(0) f (t)(X) Y , (5) where f (t)(X) RN d2 is a matrix of concatenated output for all training examples, and H(0) is the neural tangent kernel at time step 0, with each entry (H(0))ij equal to the scaled Frobenius inner product of the Jacobian matrices: (H(0))ij = 1 D J(0)(xi), J(0)(xj) E The differential equation (5) has the closed-form solution: f (t)(X) = I e ηt N H(0) Y + e ηt N H(0)f (0)(X). (7) The neural network state f (t)(X) can thus be directly obtained from (7) without running the gradient descent algorithm. 4. Proposed FL Paradigm via the NTK Framework In this section, we present the NTK-FL paradigm (Figure 1) and then extend it to the variant CP-NTK-FL (Figure 2) with improved communication efficiency and enhanced privacy. The detailed algorithm descriptions are presented as follows. 4.1. NTK-FL Paradigm NTK-FL follows four steps to update the global model in each round. First, the server will select a subset Ck of clients and broadcast to them a model weight vector w(k) from the kth round. Here, the superscript k is the communication round index, and it should be distinguished from the gradient descent time step t in Section 3.2. Second, each client will use its local training data Dm to compute a Jacobian tensor J(k) m RNm d2 d, m Ck, which is a concatenation of Nm sample-wise Jacobian matrices (J(k) m )i:: = [ f (k) 1 (xm,i), . . . , f (k) d2 (xm,i)] , i [Nm]. The client will then upload the Jacobian tensor J(k) m , labels Ym, and initial condition f (k)(Xm) to the server. The transmitted information corresponds to the variables in the solution for the state evolution in (7). Third, the server will construct a global Jacobian tensor J(k) RNk d2 d based on received J(k) m s, with each client contributing Nm horizontal slices to J(k). Here, we use Nk to denote the number of training examples in communication round k when there is no ambiguity. We use a toy example to explain the process as follows. Suppose the server selects client 1 and client 2 in a certain round. Clients 1 and 2 will compute the Jacobian tensors J(k) 1 and J(k) 2 , respectively. The global Jacobian tensor is constructed as: ( J(k) 1,i:: , if i N1, J(k) 2,j:: , j = i N1, if i N1 + 1. (8) After obtaining the global Jacobian tensor J(k), the (i, j)th entry of the global kernel H(k) is calculated as the scaled Frobenius inner product of two horizontal slices of J(k), i.e., (H(k))ij = 1 d2 J(k) i:: , J(k) j:: F. Fourth, the server will perform the NTK evolution to obtain the updated neural network function f (k+1) and weight vector w(k+1). With a slight abuse of notation, let f (k,t) denote the neural network Neural Tangent Kernel Empowered Federated Learning the mth client aggregation server x server updates & broadcast client receives weight w(k) y client sends J(k) m , f (k)(Xm), and Ym server builds kernel H(k) & peforms weight evolution server evaluates w(k,tj) & selects the best one Figure 1: Schematic of NTK-FL. Each client first receives the weight w(k), and then uploads the Jacobian tensor J(k) m , label Ym, and initial condition f (k)(Xm). The server builds a global kernel H(k) and performs the weight evolution with {t1, . . . , tΨ}. We use (11a) to find the best tj and update the weight accordingly. output at gradient descent step t in communication round k. The neural network function evolution dynamics and weight evolution dynamics are given by: f (k,t) = I e ηt Nk H(k) Y(k) + e ηt Nk H(k) f (k), (9a) j=1 (J(k) :j: ) R(k,t) :j + w(k), (9b) where J(k) :j: is the jth lateral slice of J(k), and R(k,t) :j is the jth column of the residual matrix R(k,t) defined as follows: R(k,t) η Nkd2 Y(k) f (k,u)(X(k)) . (10) Here, X(k) and Y(k) denote a concatenation of client training examples and labels, respectively. The weight evolution in (9b) is derived by unfolding the gradient descent steps. To update the global weight, the server performs the evolution with various integer steps {t1, . . . , tΨ} and selects the best one with the smallest loss value. Our goal is to minimize the training loss with a small generalization gap (Nakkiran et al., 2020). The updated weight is decided by the following procedure: t(k) = argmin tj L(f(w(k,tj); X(k)); Y(k)), (11a) w(k+1) w(k,t(k)). (11b) Alternatively, if the server has an available validation dataset, the optimal number of update steps can be selected based on the model validation performance. In practice, such a validation dataset can be obtained from held-out clients (Wang et al., 2021). Based on the closed-form solution in (9b), the search of t(k) over the grid {t1, . . . , tΨ} can be completed in O(Ψ) time. Robustness Against Statistical Heterogeneity. In essence, statistical heterogeneity comes from decentralized data with heterogeneous distributions owned by individual clients. If privacy is not an issue, the non-IID challenge can be readily resolved by mixing all clients datasets and training a centralized model. In NTK-FL, instead of building a centralized dataset, we use Jacobian matrices to construct a global kernel H(k), which is a concise representation of gathered data points from all selected clients. This representation is yet more expressive/less compact than that of a traditional FL algorithm. More precisely, the update being sent for NTK-FL regarding the ith training example of the mth client for NTK-FL is Jm = [ f1(xm,i), . . . , fd2(xm,i)] , whereas the gradient update being sent for Fed Avg is L(w; xm,i, ym,i) = 1 d2 Pd2 j=1 (ˆym,i,j ym,i,j) fj(xm,i), a weighted sum of row vectors in Jm. The gradient will be further averaged over multiple training examples. By sending Jacobian matrices Jm and jointly processing them on the server, NTK-FL delays the more aggressive data aggregation step after the communication stage and therefore better approximates the centralized learning setting than Fed Avg does. Comparison of NTK-FL and Huang et al. (2021). Huang et al. (2021) presented the details of Fed Avg by letting clients use local updates and upload gradients to train a two-layer neural network. In contrast, NTK-FL let each client transmit Jacobian matrices without performing local SGD steps. The model weight is updated via NTK evolution in (9b). The main differences include: (1) clients transmit more expressive Jacobian matrices to improve model performance in the non-IID FL setting; (2) more computation is shifted to the server. 4.2. CP-NTK-FL Variant Compared to Fed Avg, NTK-FL does not incur additional client computational overhead since calculating the Jacobian tensor enjoys the same computation efficiency as computing aggregated gradients. Without locally updating weight vectors, NTK-FL is faster than Fed Avg on the client side. In this section, we focus on the perspectives of communication efficiency and security in terms of data confidentiality and membership privacy. For communication, we follow the widely adopted analysis framework in wireless communication to examine only the client uplink overhead, assuming that the downlink bandwidth is much larger and the server will have enough trans- Neural Tangent Kernel Empowered Federated Learning shuffling server aggregation server trusted key server the mth client x key server encrypts ρ & transmits client receives weight w(k) & decrypts to get the seed ρ w(k), ρ Bm Dm z client gets Z m sends C(J (k) m ), f (k) m (Z m), Ym { shuffling server performs permutation aggregagtion server builds kernel H(k) & obtains w(k) Figure 2: Schematic of CP-NTK-FL. A trusted key server (orange) sends an encrypted seed E(k+ m, ρ) with the public key k+ m for random projection. The client transmits the required message to the shuffling server (blue) to perform a permutation. 3 6 9 12 15 18 Communication Round Test Accuracy NTK-FL NTK-FL' Fed Avg Fed Avg' Figure 3: Training results of 300 clients via NTK-FL and Fed Avg, along with variants with the local dataset subsampling and random projection, denoted as NTK-FL and Fed Avg , respectively. We train a two-layer multilayer perceptron on the Fashion-MNIST dataset. The joint effect causes more accuracy degradation in Fed Avg (red) than in NTK-FL (black). mission power (Tran et al., 2019). In NTK-FL, the client uplink communication cost and space complexity are dominated by a third-order tensor J(k) m , i.e., an O (Nmd2d) complexity compared to O (d) in Fed Avg. For security, we investigate a threat model where a curious server may perform membership inference attacks (Nasr et al., 2018) or model inversion/data reconstruction attacks (Zhu et al., 2019). Compared to the averaged gradient, sample-wise Jacobian matrices are more expressive, which may facilitate such attacks from the aggregation server. We extend NTK-FL by combining various tools to solve the aforementioned problems without jeopardizing the performance severely. These tools are optional building modules and can be adopted separately or jointly, depending on the available resources in practice. Although it is possible to incorporate these tools into Fed Avg, we will show that overall it will lead to a more severe accuracy drop. Jacobian Dimension Reduction. First, we let the mth client sample a subset Bm from its dataset Dm uniformly for the training. Let β (0, 1) denote the sampling rate, Bm contains N m = βNm data points, with the training pairs denoted by (X m, Y m). In general, using more data points will improve the model generalizability (Mohri et al., 2018). The sampling rate β controls the trade-off between efficiency and model performance. Next, we consider using a random projection to preprocess the input data via a seed shared by a trusted key server. Formally, the sampled training examples are projected into Z m, i.e., Z m = X m P, where P Rd1 d 1 is a projection matrix generated based on a seed ρ with IID standard Gaussian entries. In general, we have d 1 < d1 and an non-invertible projection operation. The concept of trusted key server follows the trusted third party in cryptography (Van Oorschot, 2020), and we assume it will not be compromised. These two steps can already improve communication efficiency and confidentiality. We first examine the current Jacobian tensor J (k) m RN m d2 d . Compared with its original version J(k) m , it has reduced dimensionality at the cost of certain information loss. Meanwhile, the random projection will defend against the data reconstruction attack, as the Jacobian tensor is now evaluated at the projected data Z m. We empirically verify their impact on the test accuracy in Figure 3. We set d 1 = 100 and sampling rate β = 0.4, and train a multilayer perceptron with 100 hidden nodes on the Fashion-MNIST dataset (Xiao et al., 2017). The joint effect of these strategies is a slight accuracy drop in NTK-FL and a nonnegligible accuracy degradation in Fed Avg. Jacobian Compression and Shuffling. We use a compression scheme to reduce the size of the Jacobian tensor by zeroing out the coordinates with small magnitudes (Alistarh et al., 2018). In addition to the communication efficiency, this compression scheme is empirically effective against the data reconstruction attack (Zhu et al., 2019). To further ensure the confidentiality and membership privacy, we introduce a shuffling server, inspired by some recent frameworks (Girgis et al., 2021; Cheng et al., 2021), to permute Jacobian tensors J(k) m s, neural network states f (k) m s, and labels Ym s. Based on (9b), we denote the model update by w(k) w(k+1) w(k) = Pd2 j=1(J(k) :j: ) R(k,t) :j , which Neural Tangent Kernel Empowered Federated Learning is a sum of matrix products. If rows and columns are permuted in synchronization, the weight update w(k) will remain unchanged. Considering the high dimensionality of the neural network weight, the reconstruction attack becomes computationally infeasible. Since introducing a new differential privacy mechanism is not the main focus of this work and the privacy protection analysis is consistent with the existing framework (Girgis et al., 2021), we do not intend to go into details. Meanwhile, as a provable differential privacy guarantee does not explicitly protect against data reconstruction attacks (Zhang et al., 2020), we empirically study the privacy loss under the deep leakage from gradient algorithm (Zhu et al., 2019) in Appendix B. A thorough study of different attack schemes is out of the scope of this work and we leave it for future work. 5. Analysis of Algorithm In this section, we analyze the loss decay rate between successive communication rounds in NTK-FL and make comparisons with Fed Avg. Similar to prior work (Du et al., 2019; Dukler et al., 2020), we consider a two-layer neural network f : Rd R of the following form to facilitate our analysis: f(w; x) = 1 n r=1 crσ(v r x), (12) where x Rd1 is an input, vr Rd1 is the weight vector in the first layer, cr is the weight in the second layer, and σ( ) is the rectified linear unit (Re LU) function, namely σ(z) = max(z, 0), applied coordinatewise. Without loss of generality, we assume the selected client set Ck is {1, 2, . . . , Mk} in communication round k. Let X(k) = [X 1 , . . . , X Mk] RNk d1 denote a concatenation of client inputs and y(k) = [y 1 , . . . , y Mk] RNk d2 denote a concatenation of client labels. In the following analysis, we assume d2 = 1 for simplicity. We state two assumptions as prerequisites. Assumption 1 (Weight Distribution). When broadcast in communication round k, the first layer v(k) r s follow the normal distribution N(0, Σk). The minimum eigenvalue of the covariance matrix is bounded by λmin(Σk) α2 k, where αk is a positive constant. The second layer cr s are sampled from { 1, 1} with equal probability and are kept fixed during training. Assumption 2 (Normalized input). The input data are normalized, i.e., xi 2 = 1, i [Nk]. For this neural network model, the (i, j)th entry of the empirical kernel matrix H(k) given in (6) can be calculated as: (H(k))ij = 1 nx i xj Pn r=1 1(k) ir 1(k) jr , where 1(k) ir 1{ v(k) r , xi 0}, and the term c2 r is omitted according to Assumption 1. Define H(k) , whose (i, j)th entry is given by: (H(k) )ij Ev(k) h x i xj1(k) i 1(k) j i , (13) where 1(k) i 1( v(k), xi 0) and v(k) follows the normal distribution N(0, Σk). Let λk denote the minimum eigenvalue of H(k) , which is restricted as follows. Assumption 3 The kernel matrix H(k) is positive definite, namely, λk > 0. In fact, the positive-definite property of H(k) can be shown under certain conditions (Dukler et al., 2020). For simplicity, we omit the proof details and directly assume the positive definiteness of H(k) in Assumption 3. Next, we study the residual term f (k)(X(k)) y(k) 2 2. We give the convergence result by analyzing how the residual term decays given training examples X(k). Theorem 1 For the NTK-FL scheme under Assumptions 1 to 3, let the learning rate η = O λk Nk and the neural network width n = Ω N2 k λ2 k ln N2 k δ , then with probability at least 1 δ, f (k+1)(X(k)) y(k) 2 2 1 ηλk t(k) f (k)(X(k)) y(k) 2 2, (14) where t(k) is the number of update steps defined in (11a). The proof of Theorem 1 can be found in Appendix D. We discuss the choice of the optimal number of update steps t(k) below. Remark 1 Based on the solution given by (9a), the neural network prediction error is a decreasing function of the update steps t. However, one could not pick an arbitrarily large t as the final optimal number of update steps t(k). When t increases, the neural network evolution in the function space does not consistently match with the evolution in the weight space. Empirical studies in the centralized training have confirmed the nonnegligible gap between the NTK weight and gradient descent weight when t is greater than the order of 102 to 103 (Lee et al., 2019). In Lemma 2 of Appendix C, we provide detailed explanations and show that the difference between the NTK weight in (9b) and the gradient descent weight increases with t. Next, we compare the proposed NTK-FL with Fed Avg. By studying the asymmetric kernel matrix caused by local update (Huang et al., 2021), we have the following theorem for Fed Avg, where the proof can be found in Appendix E. Neural Tangent Kernel Empowered Federated Learning 0 10 20 30 40 50 Communication Round Test Accuracy Centralized NTK-FL Data Share Fed Nova Fed Avg 0 10 20 30 40 50 Communication Round Test Accuracy Centralized NTK-FL Data Share Fed Nova Fed Avg 0 10 20 30 40 50 Communication Round Test Accuracy Centralized NTK-FL Data Share Fed Nova Fed Avg Figure 4: Test accuracy versus communication round of different methods evaluated on: (a) FEMNIST dataset, where the heterogeneity comes from feature skewness. (b) non-IID MNIST dataset with label skewness (α = 0.5). (c) non-IID Fashion-MNIST dataset with label skewness (α = 0.5). NTK-FL outperforms all baseline FL algorithms in different scenarios and achieves similar test performance compared with the ideal centralized training case. Theorem 2 For Fed Avg under Assumptions 1 to 3, let the learning rate η = O λk τNk Mk and the neural network width n = Ω N2 k λ2 k ln N 2 k δ , then with probability at least 1 δ, f (k+1)(X(k)) y(k) 2 2 1 ητλk 2Nk Mk f (k)(X(k)) y(k) 2 2, (15) where τ is the number of local iterations, and Mk is the number of clients in communication round k. Remark 2 (Fast Convergence of NTK-FL). The convergence rate of NTK-FL is faster than Fed Avg. To see this, we compare the Binomial approximation of the decay coefficient in Theorem 1 with the decay coefficient in Theorem 2, 2Nk + O η2 1 < 1 η2τλk 2Nk Mk , (16) where η1 1 for a large Nk. 1 The number of NTK update steps t(k) is chosen dynamically in (11a), which is on the order of 102 to 103, whereas τ is often on the order of magnitude of 10 in the literature (Reisizadeh et al., 2020; Haddadpour et al., 2021). One can verify that η1t(k)λk is larger than η2τλk/Mk and draw the conclusion. 6. Experimental Results Federated Settings. We use three datasets, namely, MNIST (Le Cun et al., 1998), Fashion-MNIST (Xiao et al., 2017), and FEMNIST (Caldas et al., 2018) digits. All of them contain C = 10 categories. For MNIST and Fashion MNIST, we follow Hsu et al. (2019) to simulate non-IID 1For example, if we have 100 clients, each of which has 100 data points, then Nk is on the order of 104. Considering the choice of the learning rate η1, the Binomial approximation holds. data with the symmetric Dirichlet distribution (Good, 1976). Specifically, for the mth client, we draw a random vector qm Dir(α), where qm = [qm,1, . . . , qm,C] belongs to the (C 1)-standard simplex. Images with category k are assigned to the mth client in proportional to (100 qm,k)%. The heterogeneity in this setting mainly comes from label skewness. FEMNIST splits the dataset into shards indexed by the original writer of the digits. The heterogeneity mainly comes from feature skewness. A multilayer perceptron model with 100 hidden nodes is chosen as the target neural network model. We consider a total of 300 clients and select 20 of them with equal probability in each round. Convergence. We empirically verify the convergence rate of the proposed method. For Fed Avg, we use the number of local iterations from {1, 3, . . . , 9, 10, 20, . . . , 50} and report the best results. For NTK-FL, we choose t(k) over the set {100, 200, . . . , 2000}. We use the following methods that are robust to the non-IID setting as the baselines: (i) Data sharing scheme suggested by Zhao et al. (2018), where a global dataset is broadcast to clients for local training; the size of the global dataset is set to be 10% of the total number of local data points. (ii) Federated normalized averaging (Fed Nova) (Wang et al., 2020), where the clients transmit normalized gradient vectors to the server. (iii) Centralized training simulation, where the server collects the data points from subset Ck of clients and performs gradient descent to directly train the global model. Scheme (iii) achieves the performance that can be considered as an upper bound of all other algorithms. The training curves over three repetitions are shown in Figure 4. More implementation details and the results on CIFAR-10 can be found in Appendix A. Our proposed NTK-FL method shows consistent advantages over other methods in different non-IID scenarios. Degree of Heterogeneity. In this experiment, we select the Dirichlet distribution parameter α from {0.1, 0.2, 0.3, 0.4, 0.5} and simulate different degrees of heterogeneity on Fashion-MNIST dataset. A smaller α Neural Tangent Kernel Empowered Federated Learning 0.1 0.2 0.3 0.4 0.5 Test Accuracy NTK-FL Data Share Fed Nova Fed Avg Figure 5: Test accuracy versus the Dirichlet distribution parameter α for different methods evaluated on the non-IID Fashion MNIST dataset. Reducing the value of α will increase the degree of heterogeneity in the data distribution. NTK-FL is robust to different heterogeneous data distributions, and shows more advantages over Fed Avg and Fed Nova when the degree of heterogeneity is larger. 100 200 300 400 500 600 d 1 0.7 0.6 0.5 0.4 0.3 0.2 82.9 84.1 84.3 84.7 84.8 84.6 83.1 83.8 84.8 84.4 84.5 85.0 82.8 83.9 84.4 84.6 84.4 84.9 82.4 83.8 84.3 84.4 84.4 84.5 82.4 83.5 84.2 84.4 84.1 84.2 81.9 83.2 83.7 84.2 84.5 84.6 Accuracy (%) Figure 6: CP-NTK-FL test accuracy for different hyperparams. A larger data sampling rate β and a larger dimension d 1 are expected to give a higher test accuracy. In general, the scheme is robust to different combinations of hyperparameters. will increase the degree of heterogeneity in the data distribution. We evaluate NTK-FL, Data Share, Fed Nova, and Fed Avg model test accuracy after training for 50 rounds. The mean values over three repetitions are shown in Figure 5, where each point is obtained over five repetitions with a standard deviation less than 1%. It can be observed that NTK-FL achieves stable test accuracy in different heterogeneous settings. In comparison, Fed Avg and Fed Nova show a performance drop in the small α region. NTK-FL has more advantages over baselines methods when the degree of heterogeneity is larger. Effect of Hyperparameters. We study the effect of the tunable parameters in CP-NTK-FL. We change the local data sampling rate β and dimension d 1, and evaluate the model test accuracy on the non-IID Fashion-MNIST dataset (α = 0.1) after 10 communication rounds. The results are shown in Figure 6. A larger data sampling rate β or a larger dimension d 1 will cause less information loss and are expected to achieve a higher test accuracy. The results also show that the scheme is robust to different combinations of hyperparameters. Uplink Communication. In federated learning, uplink communication overhead can be one of the bottlenecks in the training stage. We evaluate the uplink communication efficiency of CP-NTK-FL (d 1 =200, β =0.3) by measuring the number of rounds and cumulative uplink communication cost to reach a test accuracy of 85% on non-IID Fashion MNIST dataset (α = 0.1). The results over three repetitions are shown in Table 1. Compared with federated learning with compression (Fed COM) (Haddadpour et al., 2021), Table 1: Uplink communication cost to reach an accuracy of 85% on non-IID Fashion-MNIST dataset (α = 0.1) for different methods. CP-NTK-FL can achieve the target within the fewest communication rounds without incurring communication cost significantly. optimization algorithms comm. rounds comm. cost (MB) CP-NTK-FL 26 386 Fed COM 250 379 QSGD (4 bit) 614 465 Fed Avg 284 1720 quantized SGD (QSGD) (Alistarh et al., 2017), and Fed Avg, CP-NTK-FL achieves the goal within an order of magnitude fewer iterations, which is particularly advantageous for applications with nonnegligible encoding/decoding delays or network latency. 7. Conclusion and Future Work In this paper, we have proposed an NTK empowered FL paradigm. It inherently solves the statistical heterogeneity challenge. By constructing a global kernel based on the local sample-wise Jacobian matrices, the global model weights can be updated via NTK evolution in the parameter space. Compared with traditional algorithms such as Fed Avg, NTK-FL has a more centralized training flavor by transmitting more expressive updates. The effectiveness of Neural Tangent Kernel Empowered Federated Learning the proposed paradigm has been verified theoretically and experimentally. In future work, it will be interesting to extend the paradigm for other neural network architectures, such as CNNs, residual networks (Res Nets) (He et al., 2016), and RNNs. It is also worthwhile to further improve the efficiency of NTKFL and explore its savings in wall-clock time. We believe the proposed paradigm will provide a new perspective to solve federated learning challenges. Aji, A. F. and Heafield, K. Sparse communication for distributed gradient descent. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, 2017. Alemohammad, S., Wang, Z., Balestriero, R., and Baraniuk, R. The recurrent neural tangent kernel. In International Conference on Learning Representations, 2021. Alistarh, D., Grubic, D., Li, J., Tomioka, R., and Vojnovic, M. QSGD: Communication-efficient SGD via gradient quantization and encoding. 2017. Alistarh, D., Hoefler, T., Johansson, M., Konstantinov, N., Khirirat, S., and Renggli, C. The convergence of sparsified gradient methods. In Advances in Neural Information Processing Systems, 2018. Arora, S., Du, S. S., Hu, W., Li, Z., Salakhutdinov, R., and Wang, R. On exact computation with an infinitely wide neural net. In Advances in Neural Information Processing Systems, 2019. Caldas, S., Duddu, S. M. K., Wu, P., Li, T., Koneˇcn y, J., Mc Mahan, H. B., Smith, V., and Talwalkar, A. Leaf: A benchmark for federated settings. ar Xiv preprint ar Xiv:1812.01097, 2018. Chen, H.-Y. and Chao, W.-L. Fed BE: Making Bayesian model ensemble applicable to federated learning. In International Conference on Learning Representations, 2021. Chen, Z., Cao, Y., Gu, Q., and Zhang, T. A generalized neural tangent kernel analysis for two-layer neural networks. In Advances in Neural Information Processing Systems, 2020. Cheng, P.-C., Eykholt, K., Gu, Z., Jamjoom, H., Jayaram, K., Valdez, E., and Verma, A. Separation of powers in federated learning. ar Xiv preprint ar Xiv:2105.09400, 2021. Du, S. S., Zhai, X., Poczos, B., and Singh, A. Gradient descent provably optimizes over-parameterized neural networks. In International Conference on Learning Representations, 2019. Dukler, Y., Montufar, G., and Gu, Q. Optimization theory for Re LU neural networks trained with normalization layers. In International Conference on Machine Learning, 2020. Girgis, A. M., Data, D., Diggavi, S. N., Kairouz, P., and Suresh, A. T. Shuffled model of differential privacy in federated learning. In International Conference on Artificial Intelligence and Statistics, 2021. Gonzalez, R. C. and Woods, R. E. Digital Image Processing, 3rd Edition. 2014. Good, I. J. On the application of symmetric Dirichlet distributions and their mixtures to contingency tables. The Annals of Statistics, 4(6):1159 1189, 1976. Haddadpour, F., Kamani, M. M., Mokhtari, A., and Mahdavi, M. Federated learning with compression: Unified analysis and sharp guarantees. In International Conference on Artificial Intelligence and Statistics, 2021. He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In International Conference on Computer Vision and Pattern Recognition, 2016. Hsu, T.-M. H., Qi, H., and Brown, M. Measuring the effects of non-identical data distribution for federated visual classification. ar Xiv preprint ar Xiv:1909.06335, 2019. Huang, B., Li, X., Song, Z., and Yang, X. FL-NTK: A neural tangent kernel-based framework for federated learning convergence analysis. ar Xiv preprint ar Xiv:2105.05001, 2021. Jacot, A., Gabriel, F., and Hongler, C. Neural tangent kernel: Convergence and generalization in neural networks. In Advances in Neural Information Processing Systems, 2018. Kairouz, P., Mc Mahan, H. B., Avent, B., Bellet, A., Bennis, M., et al. Advances and open problems in federated learning. Foundations and Trends in Machine Learning, 2021. Karimireddy, S. P., Kale, S., Mohri, M., Reddi, S., Stich, S., and Suresh, A. T. Scaffold: Stochastic controlled averaging for federated learning. In International Conference on Machine Learning, 2020. Kolda, T. G. and Bader, B. W. Tensor decompositions and applications. SIAM Review, 51(3):455 500, 2009. Krizhevsky, A. Learning multiple layers of features from tiny images. Master thesis, Dept. of Comput. Sci., Univ. of Toronto, Toronto, Canada, 2009. Neural Tangent Kernel Empowered Federated Learning Le Cun, Y., Bottou, L., Bengio, Y., and Haffner, P. Gradientbased learning applied to document recognition. Proceedings of the IEEE, 86(11):2278 2324, 1998. Lee, J., Xiao, L., Schoenholz, S., Bahri, Y., Novak, R., Sohl Dickstein, J., and Pennington, J. Wide neural networks of any depth evolve as linear models under gradient descent. In Advances in Neural Information Processing Systems, 2019. Lee, J., Schoenholz, S. S., Pennington, J., Adlam, B., Xiao, L., Novak, R., and Sohl-Dickstein, J. Finite versus infinite neural networks: An empirical study. In Advances in Neural Information Processing Systems, 2020. Li, T., Sahu, A. K., Talwalkar, A., and Smith, V. Federated learning: Challenges, methods, and future directions. IEEE Signal Processing Magazine, 37(3):50 60, 2020. Li, T., Sahu, A. K., Zaheer, M., Sanjabi, M., Talwalkar, A., and Smith, V. Federated optimization in heterogeneous networks. 2020. Li, X., Jiang, M., Zhang, X., Kamp, M., and Dou, Q. Fed BN: Federated learning on non-iid features via local batch normalization. In International Conference on Learning Representations, 2021. Liang, P. P., Liu, T., Ziyin, L., Allen, N. B., Auerbach, R. P., Brent, D., Salakhutdinov, R., and Morency, L.-P. Think locally, act globally: Federated learning with local and global representations. Neur IPS Workshop on Federated Learning, 2019. Mc Mahan, B., Moore, E., Ramage, D., Hampson, S., and y Arcas, B. A. Communication-efficient learning of deep networks from decentralized data. In Artificial Intelligence and Statistics, 2017. Mohri, M., Rostamizadeh, A., and Talwalkar, A. Foundations of machine learning. MIT press, 2018. Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., and Sutskever, I. Deep double descent: Where bigger models and more data hurt. In International Conference on Learning Representations, 2020. Nasr, M., Shokri, R., and Houmansadr, A. Machine learning with membership privacy using adversarial regularization. In ACM SIGSAC Conference on Computer and Communications Security, pp. 634 646, 2018. Reddi, S. J., Charles, Z., Zaheer, M., Garrett, Z., Rush, K., Koneˇcn y, J., Kumar, S., and Mc Mahan, H. B. Adaptive federated optimization. In International Conference on Learning Representations, 2021. Reisizadeh, A., Mokhtari, A., Hassani, H., Jadbabaie, A., and Pedarsani, R. Fed PAQ: A communication-efficient federated learning method with periodic averaging and quantization. In International Conference on Artificial Intelligence and Statistics, 2020. Sattler, F., Wiedemann, S., M uller, K.-R., and Samek, W. Robust and communication-efficient federated learning from non-iid data. IEEE Transactions on Neural Networks and Learning Systems, 31(9):3400 3413, 2019. Seo, H., Park, J., Oh, S., Bennis, M., and Kim, S.- L. Federated knowledge distillation. ar Xiv preprint ar Xiv:2011.02367, 2020. Smith, V., Chiang, C.-K., Sanjabi, M., and Talwalkar, A. Federated multi-task learning. In Advances in Neural Information Processing Systems, 2017. Su, L., Xu, J., and Yang, P. Achieving statistical optimality of federated learning: Beyond stationary points. ar Xiv preprint ar Xiv:2106.15216, 2021. Tran, N. H., Bao, W., Zomaya, A., Nguyen, M. N., and Hong, C. S. Federated learning over wireless networks: Optimization model design and analysis. In IEEE Conference on Computer Communications, pp. 1387 1395. IEEE, 2019. Van Oorschot, P. C. Computer Security and the Internet: Tools and Jewels. Springer Nature, 2020. Wang, H., Yurochkin, M., Sun, Y., Papailiopoulos, D., and Khazaeni, Y. Federated learning with matched averaging. In International Conference on Learning Representations, 2020. Wang, J., Liu, Q., Liang, H., Joshi, G., and Poor, H. V. Tackling the objective inconsistency problem in heterogeneous federated optimization. In Advances in Neural Information Processing Systems, 2020. Wang, J., Charles, Z., Xu, Z., Joshi, G., Mc Mahan, H. B., Al-Shedivat, M., Andrew, G., Avestimehr, S., Daly, K., Data, D., et al. A field guide to federated optimization. ar Xiv preprint ar Xiv:2107.06917, 2021. Xiao, H., Rasul, K., and Vollgraf, R. Fashion-MNIST: a novel image dataset for benchmarking machine learning algorithms. ar Xiv preprint ar Xiv:1708.07747, 2017. Yang, G. and Littwin, E. Tensor programs IIb: Architectural universality of neural tangent kernel training dynamics. In International Conference on Machine Learning, 2021. Zhang, Y., Jia, R., Pei, H., Wang, W., Li, B., and Song, D. The secret revealer: Generative model-inversion attacks against deep neural networks. In IEEE/CVF Conference Neural Tangent Kernel Empowered Federated Learning on Computer Vision and Pattern Recognition, pp. 253 261, 2020. Zhao, Y., Li, M., Lai, L., Suda, N., Civin, D., and Chandra, V. Federated learning with non-iid data. ar Xiv preprint ar Xiv:1806.00582, 2018. Zhu, L., Liu, Z., and Han, S. Deep leakage from gradients. In Advances in Neural Information Processing Systems, 2019. Neural Tangent Kernel Empowered Federated Learning A. Implementation and Additional Results 0 10 20 30 40 50 Communication Round Test Accuracy Centralized NTK-FL Data Share Fed Nova Fed Avg Figure 7: Test accuracy versus communication round of different methods evaluated on the non-IID CIFAR-10 dataset, where the Dirichlet distribution parameter α is set to 0.1. We give the detailed setting of the learning rate and batch size. For the learning rate η, we search over the set {10 3, 3 10 3, 10 2, 3 10 2, 10 1}. The learning rate is fixed during the training. For the client batch size, we set it to 200 for all datasets. We use the same setup as in Section 6. We evaluate different methods, including the centralized training simulation, data sharing method (Zhao et al., 2018), Fed Nova (Wang et al., 2020), Fed Avg (Mc Mahan et al., 2017), and the proposed NTK-FL on the non-IID CIFAR-10 dataset (Krizhevsky, 2009) and present the results in Figure 7. NTK-FL outperforms other FL algorithms and shows test accuracy close to the centralized simulation. The observation is consistent with the results in Figure 4. The implementation is available at https://github.com/KAI-YUE/ntk-fed. B. Reconstruction Attack In the following experiment, we measure the privacy loss when using Jacobian sparsification under the data reconstruction attack. We compress the Jacobian tensor and perform the deep leakage attack (Zhu et al., 2019). The sparsity levels and image structural similarity index measure (SSIM) (Gonzalez & Woods, 2014) are shown in Figure 8. When the sparsity decreases, the reconstructions are closer to original images, which is consistent with Zhu et al. (2019). The sparsity is set above 80% 90% when using the Top-k sparsification approach (Aji & Heafield, 2017), where the privacy loss becomes difficult to quantify. A solid privacy-protection study is nontrivial and we leave it for future work. original Sparsity SSIM 90% 0.06 80% 0.23 70% 0.35 60% 0.47 50% 0.65 Figure 8: Attack results under different sparsity levels. C. The Number of Update Steps We explain below the reason that the optimal number of update steps, t(k), must not be too large in (11a). The weight w(k,t) is generated by the NTK method in (9b). Let ew(k,t) denote the weight generated by the gradient descent method. For the two-layer neural network model given in (12), we will show the upper bound on the gap between w(k,t) and ew(k,t). To this end, we first give Lemma 1 as a prerequisite. Define S(k) i as the set of indices corresponding to neurons whose activation patterns are different from the broadcast version v(k) for an input xi: S(k) i n r [n] v(k), v(k) v(k) r 2 R, 1(k) ir = 1 v(k), xi 0 o . (17) We upper bound the cardinality of S(k) i in Lemma 1. Lemma 1 Under Assumptions 1 to 2, with probability at least 1 δ, we have 2 π n R δαk , i [Nk]. (18) Neural Tangent Kernel Empowered Federated Learning Proof. To bound |S(k) i | = n P r=1 1 r S(k) i , consider an event A(k) ir defined as follows: A(k) ir n v(k), v(k) v(k) r 2 R, 1(k) ir = 1 v(k), xi 0 o . (19) Clearly, 1 r S(k) i = 1 A(k) ir . According to Assumption 2, it can be shown that the event A(k) ir happens if and only if |(v(k) r ) xi| R based on a geometric argument. From Assumption 1, we have (v(k) r ) xi N(0, x i Σkxi). The probability of event A(k) ir is P[A(k) ir ] = P h |(v(k) r ) xi| R i (20a) 2 π R αk , (20b) where the error function is given by erf(z) = 2 π R z 0 e t2 dt. By Markov s inequality, we have with probability at least 1 δ, n X r=1 1 r S(k) i 2 π n R δαk . (21) The proof is complete. Lemma 2 Consider the residual term r(k,u) = y(k) f (k,u)(X(k)). Suppose i [Nk], |r(k,u) i | < γ(k,u), where γ(k,u) is a positive constant and limu γ(k,u) = 0. With probability at least 1 δ, we have w(k,t) ew(k,t) 1 2nd1η πδαk u=1 γ(k,u). (22) Proof. The weights w(k,t) and ew(k,t) can be written as u=0 f (k,0)(X(k)) h y(k) f (k,u)(X(k)) i + w(k,0), (23) ew(k,t) = η u=0 f (k,u)(X(k)) h y(k) f (k,u)(X(k)) i + w(k,0). (24) The ℓ1 norm of the difference between w(k,t) and ew(k,t) is w(k,t) ew(k,t) 1 = h f (k,u)(X(k)) f (k,0)(X(k)) i h y(k) f (k,u)(X(k)) i 1 (25a) u=1 γ(k,u) Nk X r=1 crxi(1(k,t) ir 1(k,0) ir ) u=1 γ(k,u) Nk X d1 n |1(k,t) ir 1(k,0) ir | (25c) u=1 γ(k,u) Nk X d1 n 1(r S(k) i ) (25d) u=1 γ(k,u). (25e) Neural Tangent Kernel Empowered Federated Learning 0 200 400 600 800 Update Step Analytical Upper Bound Empirical Weight Difference 0 200 400 600 800 Update Step Analytical Upper Bound Empirical Weight Difference Figure 9: The ℓ1 norm of the difference between the NTK weight w(k,t) and gradient descent weight ew(k,t) on the (a) Fashion-MNIST dataset and (b) FEMNIST dataset. For the analytical upper bound, we calculate the term P u γ(k,u) in (25e) and set the coefficient to 2. Both the theoretical result and real experiments show that the weight difference increases when more update steps are used. The theoretical result indicates that the difference between the NTK weight w(k,t) and gradient descent weight ew(k,t) increases with the number of update steps t. We validate our theoretical result using real experiments and report the results in Figure 9. The consistent trend between the empirical weight difference and the analytical upper bound confirms our analysis that increasing the number of update steps enlarges the gap between w(k,t) and ew(k,t). In the NTK evolution scheme, one cannot choose an arbitrarily large t as the final number of update step t(k). D. Proof of Theorem 1 We first present some lemmas to facilitate the convergence analysis. We bound the perturbation of the kernel matrix H(k,t) in Lemma 3. Lemma 3 Under Assumptions 1 to 2, if r [n], v(k,t) r v(k) r 2 R, then H(k,t) H(k) 2 2 2Nk R πδαk . (26) Proof. We have H(k,t) H(k) 2 2 H(k,t) H(k) 2 F (27a) h (H(k,t))ij (H(k))ij i2 (27b) j=1 (x i xj)2 n X r=1 1(k,t) ir 1(k,t) jr 1(k) ir 1(k) jr Consider the event Air defined in (19). Let φ(k,t) ijr 1(k,t) ir 1(k,t) jr 1(k) ir 1(k) jr . If Air and Ajr happen, clearly we have |φ(k,t) ijr | = 0. Therefore, the expectation of |φ(k,t) ijr | can be bounded as E h φ(k,t) ijr i P(Air Ajr) 1 (28a) P(Air) + P(Ajr) (28b) 2 π R αk , (28c) where x comes from (20b). By Markov s inequality, we have with probability at least 1 δ, |φ(k,t) ijr | 2 2 π R δαk . (29) Neural Tangent Kernel Empowered Federated Learning Plugging (29) into (27c) yields H(k,t) H(k) 2 2 N 2 k n2 8n2R2 πδ2α2 k = 8N 2 k R2 πδ2α2 k . (30) Taking the square root on both sides completes the proof. Lemma 4 With probability at least 1 δ, H(k) H(k) 2 Nk ln (2N 2 k/δ) 2n . (31) Proof. We have H(k) H(k) 2 2 H(k) H(k) 2 F = h (H(k))ij (H(k) )ij i2 . (32) Note that (H(k))ij = 1 r=1 1(k) ir 1(k) jr , (H(k))ij [ 1, 1]. By Hoeffding s inequality, we have with probability at least 1 δ/N 2 k, (H(k))ij (H(k) )ij ln (2N 2 k/δ) 2n . (33) Applying the union bound over i, j [Nk] yields H(k) H(k) 2 Nk ln (2N 2 k/δ) 2n . (34) The proof is complete. Now we are going to prove Theorem 1. Theorem 1 For the NTK-FL scheme under Assumptions 1 to 3, let the learning rate η = O λk Nk and the neural network width n = Ω N2 k λ2 k ln N2 k δ , then with probability at least 1 δ, f (k+1)(X(k)) y(k) 2 2 1 ηλk t(k) f (k)(X(k)) y(k) 2 2, (35) where t(k) is the number of update steps defined in (11a). Proof. Taking the difference between successive neural network predictions yields f (k,t+1)(xi) f (k,t)(xi) = 1 n h crσ (v(k,t+1) r ) xi crσ (v(k,t) r ) xi i . (36) We decompose the difference term to the sum of d I i and d II i , based on the set S(k) i : h crσ (v(k,t+1) r ) xi crσ (v(k,t) r ) xi i , (37a) h crσ (v(k,t+1) r ) xi crσ (v(k,t) r ) xi i . (37b) Consider the residual term f (k,t+1)(X(k)) y(k) 2 2 (38a) = f (k,t+1)(X(k)) f (k,t)(X(k)) + f (k,t)(X(k)) y(k) 2 2 (38b) = f (k,t)(X(k)) y(k) 2 2 + 2 D d I + d II, f (k,t)(X(k)) y(k)E + f (k,t+1)(X(k)) f (k,t)(X(k)) 2 2. (38c) Neural Tangent Kernel Empowered Federated Learning We will give upper bounds for the inner product terms d I, f (k,t)(X(k)) y(k) , d II, f (k,t)(X(k)) y(k) , and the difference term f (k,t+1)(X(k)) f (k,t)(X(k)) 2 2, separately. Based on the property of the set S(k) i , we have d I i = η n r / Si cr vr L, xi 1(k,t) ir (39a) f (k,t)(xj) yj x j xi X r / Si c2 r 1(k,t) ir 1(k,t) jr (39b) f (k,t)(xj) yj (H(k,t))ij (H (k,t))ij , (39c) where (H (k,t))ij is defined as (H (k,t))ij 1 1(k,t) ir 1(k,t) jr . (40) For the inner product term d I, f (k,t)(X(k)) y(k) , we have D d I, f (k,t)(X(k)) y(k)E = η Nk (f (k,t)(X(k)) y(k)) (H(k,t) H (k,t))(f (k,t)(X(k)) y(k)). (41) Let T1 and T2 denote the following terms T1 (f (k,t)(X(k)) y(k)) H(k,t)(f (k,t)(X(k)) y(k)), (42a) T2 (f (k,t)(X(k)) y(k)) H (k,t)(f (k,t)(X(k)) y(k)). (42b) With probability at least 1 δ, T1 can be bounded as: T1 = (f (k,t)(X(k)) y(k)) (H(k,t) H(k) + H(k) H(k) + H(k) )(f (k,t)(X(k)) y(k)) (43a) (f (k,t)(X(k)) y(k)) (H(k,t) H(k))(f (k,t)(X(k)) y(k)) (f (k,t)(X(k)) y(k)) (H(k) H(k) )(f (k,t)(X(k)) y(k)) λk f (k,t)(X(k)) y(k) 2 2 (43b) 2Nk R πδαk + Nk ln (2N 2 k/δ) 2n λk ! f (k,t)(X(k)) y(k) 2 2, (43c) where x comes from Lemma 3 and Lemma 4. To bound the term T2, consider the ℓ2 norm of the matrix H (k,t). With probability at least 1 δ, we have: H (k,t) 2 H (k,t) F (44a) x i xj1(k,t) ir 1(k,t) jr n |S(k) i | x δαk , (44c) where x comes from Lemma 1. Therefore, with probability at least 1 δ, we have f (k,t)(X(k)) y(k) 2 2. (45) Combine the results of (43c) and (45): D d I, f (k,t)(X(k)) y(k)E η ln (2N 2 k/δ) 2n λk ! f (k,t)(X(k)) y(k) 2 2. (46) Neural Tangent Kernel Empowered Federated Learning For the inner product term d II, f (k,t)(X(k)) y(k) , we first bound d II 2 2 as follows: h crσ (v(k,t+1) r ) xi crσ (v(k,t) r ) xi i i=1 |S(k) i | X (cr vr L, xi )2 (47b) i=1 |S(k) i | X vr L 2 2 xi 2 2 (47c) n |S(k) i |2 max r [n] vr L 2 2 (47d) η2|S(k) i |2 n2 f (k,t)(X(k)) y(k) 2 2, (47e) where x comes from the Lipschitz continuity of the Re LU function σ( ), y holds due to Cauchy Schwartz inequality. Plug (21) into (47e), we have with probability at least 1 δ: d II 2 2 2η2R2 f (k,t)(X(k)) y(k) 2 2. (48) The inner product term d II, f (k,t)(X(k)) y(k) can be bounded as D d II, f (k,t)(X(k)) y(k)E f (k,t)(X(k)) y(k) 2 2. (49) Finally, the bound for the difference term is derived as f (k,t+1)(X(k)) f (k,t)(X(k)) 2 2 r=1 cr vr L, xi η2 f (k,t)(X(k)) y(k) 2 2. (50) Combine the results of (46), (49) and (50): f (k,t+1)(X(k)) y(k) 2 2 2ηR πδαk + 2η ln (2N 2 k/δ) 2n 2ηλk Nk + η2 # f (k,t)(X(k)) y(k) 2 2. (51) Let R = O δαkλk , n = Ω N2 k λ2 k ln N2 k δ , and η = O( λk Nk ), we have f (k,t+1)(X(k)) y(k) 2 2 1 ηλk f (k,t)(X(k)) y(k) 2 2. (52) Invoking the inequality (52) recursively completes the proof. E. Proof of Theorem 2 Theorem 2 For Fed Avg under Assumptions 1 to 3, let the learning rate η = O λk τNk Mk and the neural network width n = Ω N2 k λ2 k ln N 2 k δ , then with probability at least 1 δ, f (k+1)(X(k)) y(k) 2 2 1 ητλk 2Nk Mk f (k)(X(k)) y(k) 2 2, (53) where τ is the number of local iterations, and Mk is the number of clients in communication round k. Neural Tangent Kernel Empowered Federated Learning Proof. We first construct a different set of kernel matrices {Λ(k), Λ(k,τ) m } similar to Huang et al. (2021). Let 1(k,u) imr 1{ v(k,u) m,r , xi 0}, the (i, j)th entry of Λ(k,u) m and Λ(k,u) is defined as (Λ(k,u) m )ij 1 r=1 1(k,0) imr 1(k,u) jmr , (54a) (Λ(k,u))ij (Λ(k,u) m )ij, if (xj, yj) Dm. (54b) Taking the difference between successive terms yields f (k+1)(xi) f (k)(xi) = 1 n h crσ (v(k+1) r ) xi crσ (v(k) r ) xi i . (55) We decompose the difference term to the sum of d I i and d II i , based on the set S(k) i and its complement: h crσ (v(k+1) r ) xi crσ (v(k) r ) xi i , (56a) h crσ (v(k+1) r ) xi crσ (v(k) r ) xi i . (56b) Consider the residual term f (k+1)(X(k)) y 2 2 (57a) = f (k+1)(X(k)) f (k)(X(k)) + f (k)(X(k)) y 2 2 (57b) = f (k)(X(k)) y(k) 2 2 + 2 D d I + d II, f (k)(X(k)) y E + f (k+1)(X(k)) f (k)(X(k)) 2 2. (57c) We will give upper bounds for the inner product terms d I, f (k)(X(k)) y , d II, f (k)(X(k)) y , and the difference term f (k+1)(X(k)) f (k)(X(k)) 2 2, separately. For an input x Rd1, let f (k,u) m (x) 1 n r=1 crσ( v(k,u) m,r ), x ). By the update rule of Fed Avg, the relation between the weight vector v(k) r in successive communication rounds is: v(k+1) r = v(k) r η Mk u=0 Lv(k,u) r (58a) = v(k) r ηcr Nk n Mk j Im (f (k,u) m (xj) yj)xj1(k,u) jmr . (58b) Based on the property of the set S(k) i , we have d I i = 1 n cr D v(k+1) r v(k) r , xi E 1(k) ir (59a) j Im (f (k,u) m (xj) yj)x i xj1(k) ir 1(k,u) jmr (59b) j Im (f (k,u) m (xj) yj) h (Λ(k,u) m )ij (Λ (k,u) m )ij i . (59c) For the inner product term d I, f (k)(X(k)) y , we have D d I, f (k)(X(k)) y(k)E = η Nk Mk u=0 (f (k)(X(k)) y(k)) (Λ(k,u) Λ (k,u))(f (k,u) m (X(k)) y(k)). (60) Neural Tangent Kernel Empowered Federated Learning Let T1 and T2 denote the following terms T1 (f (k)(X(k)) y(k)) Λ(k,u)(f (k,u) g (X(k)) y(k)), (61a) T2 (f (k)(X(k)) y(k)) Λ (k,u)(f (k,u) g (X(k)) y(k)), (61b) where f (k,u) g (X(k)) [f (k,u) 1 (X1) , , f (k,u) Mk (XMk) ] . We are going to bound T1 and T2 separately. T1 can be written as: T1 = (f (k)(X(k)) y(k)) (Λ(k,u) H(k) + H(k) H(k) + H(k) )(f (k,u) g (X(k)) y(k)) (62a) = (f (k)(X(k)) y(k)) (Λ(k,u) H(k))(f (k,u) g (X(k)) y(k)) (f (k)(X(k)) y(k)) (H(k) H(k) )(f (k,u) g (X(k)) y(k)) (f (k)(X(k)) y(k)) H(k) (f (k)(X(k)) y(k)) (f (k)(X(k)) y(k)) H(k) (f (k,u) g (X(k)) f (k)(X(k))). (62b) First, we bound the norm of f (k,u) g (X(k)) y(k). It can be shown that f (k,u) m (Xm) ym 2 = f (k,u) m (Xm) f (k,u 1) m (Xm) + f (k,u 1) m (Xm) ym 2 (63a) f (k,u) m (Xm) f (k,u 1) m (Xm) 2 + f (k,u 1) m (Xm) ym 2 (63b) x (1 + η) f (k,u 1) m (Xm) ym 2, (63c) where x holds based on the derivation of (50). Applying (63c) recursively yields f (k,u) m (Xm) ym 2 (1 + η)u f (k)(Xm) ym 2. (64) The bound for f (k,u) g (X(k)) y(k) 2 2 can thus be derived as f (k,u) g (X(k)) y(k) 2 2 = h f (k,u) g (xi) yi i2 (65a) f (k,u) m (Xm) ym 2 2 (65b) (1 + η)2u f (k)(X(k)) y(k) 2 2. (65c) Second, following the steps in Lemma 3, it can be shown that with probability at least 1 δ, Λ(k,t) H(k) 2 2 2Nk R πδα . (66) We also bound the difference between f (k,u) g (X(k)) and f (k)(X(k)) as follows: f (k,u) g (X(k)) f (k)(X(k)) 2 x v=1 f (k,v) g (X(k)) f (k,v 1) g (X(k)) 2 (67a) v=1 η f (k,v 1) g (X(k)) y(k) 2 (67b) v=1 η(1 + η)v 1 f (k)(X(k)) y(k) 2 (67c) = [(1 + η)u 1] f (k)(X(k)) y(k) 2, (67d) where x holds due to triangle inequality, y comes from (50), z comes from (65c). Plugging the results from (65c), (66), and (67d) into (62b), we have with probability at least 1 δ, 2Nk R πδα + Nk ln (2N 2 k/δ) 2n + κλk f (k)(X(k)) y(k) 2 2, (68) Neural Tangent Kernel Empowered Federated Learning where κ is the condition number of the matrix H(k) . Next, consider the bound for T2. The ℓ2 norm of Λ (k,u) can be bounded as Λ (k,u) 2 Λ (k,u) F (69a) x i xj1(k) ir 1(k,u) jmr n |S(k) i | x where x comes from Lemma 1. Therefore, we have with probability at least 1 δ, T2 (1 + η)u r f (k)(X(k)) y(k) 2 2. (70) Combine the results of (68) and (70): D d I, f (k)(X(k)) y(k)E τ Mk " η + (τ 1) 2 η2 + o(η2) v u u tln 2N 2 k δ # f (k)(X(k)) y(k) 2 2. For the inner product term d II, f (k)(X(k)) y , we first bound d II 2 2 with probability at least 1 δ: h crσ (v(k+1) r ) xi crσ (v(k) r ) xi i i=1 |S(k) i | X cr v(k+1) r v(k) r , xi 2 (72b) i=1 |S(k) i | X ηcr Nk n Mk j Im (f (k,u) m (xj) yj)1(k,u) jmr N 2 kn2M 2 k i=1 |S(k) i | X f (k,u) m (xj) yj N 2 kn2M 2 k i=1 |S(k) i | X u=0 |Im| f (k,u) m (Xm) ym 2 Neural Tangent Kernel Empowered Federated Learning By apply the results from previous steps, we have N 2 kn2M 2 k i=1 |S(k) i | X u=0 (1 + η)u|Im| f (k)(Xm) ym 2 y 1 N 2 kn2M 2 k i=1 |S(k) i | X m Ck ((1 + η)τ 1) |Im| f (k)(Xm) ym 1 z 1 Nkn2M 2 k i=1 |S(k) i | X ((1 + η)τ 1) f (k)(X(k)) y(k) 2 τη + τ(τ 1) 2 η2 + o(η2) 2 f (k)(X(k)) y(k) 2 2. (73d) where x comes from (64), y holds due to a 1 a 2, z holds due to a 1 p dim(a) a 2, { is from Lemma 1. With probability at least 1 δ, the inner product term can thus be bounded as D d II, f (k)(X(k)) y E 2 η2 + o(η2) f (k)(X(k)) y(k) 2 2. (74) The bound for the difference term is derived as f (k+1)(X(k)) f (k)(X(k)) 2 2 r=1 cr v(k+1) r v(k) r , xi τη + τ(τ 1) 2 η2 + o(η2) 2 f (k)(X(k)) y(k) 2 2. (75b) Combine the results of (71), (74) and (75b): f (k+1)(X(k)) y(k) 2 2 v u u tln 2N2 k δ M 2 k + o(η2) ) f (k)(X(k)) y(k) 2 2. Let R = O δαλk , n = Ω N2 k λ2 k ln N 2 k δ , and η = O λk τNk Mk f (k+1)(X(k)) y(k) 2 2 1 ητλk 2Nk Mk f (k+1)(X(k)) y(k) 2 2. (77a)