# personalized_federated_learning_with_gaussian_processes__e4ee7b48.pdf

Personalized Federated Learning with Gaussian Processes

Idan Achituve Bar-Ilan University, Israel idan.achituve@biu.ac.il

Aviv Shamsian Bar-Ilan University, Israel aviv.shamsian@biu.ac.il

Aviv Navon Bar-Ilan University, Israel aviv.navon@biu.ac.il

Gal Chechik Bar-Ilan University, Israel NVIDIA, Isreal gal.chechik@biu.ac.il

Ethan Fetaya Bar-Ilan University, Israel ethan.fetaya@biu.ac.il

Federated learning aims to learn a global model that performs well on client devices with limited cross-client communication. Personalized federated learning (PFL) further extends this setup to handle data heterogeneity between clients by learning personalized models. A key challenge in this setting is to learn effectively across clients even though each client has unique data that is often limited in size. Here we present p Fed GP, a solution to PFL that is based on Gaussian processes (GPs) with deep kernel learning. GPs are highly expressive models that work well in the low data regime due to their Bayesian nature. However, applying GPs to PFL raises multiple challenges. Mainly, GPs performance depends heavily on access to a good kernel function, and learning a kernel requires a large training set. Therefore, we propose learning a shared kernel function across all clients, parameterized by a neural network, with a personal GP classiﬁer for each client. We further extend p Fed GP to include inducing points using two novel methods, the ﬁrst helps to improve generalization in the low data regime and the second reduces the computational cost. We derive a PAC-Bayes generalization bound on novel clients and empirically show that it gives non-vacuous guarantees. Extensive experiments on standard PFL benchmarks with CIFAR-10, CIFAR-100, and CINIC-10, and on a new setup of learning under input noise show that p Fed GP achieves well-calibrated predictions while signiﬁcantly outperforming baseline methods, reaching up to 21% in accuracy gain.

1 Introduction

In recent years, there is a growing interest in applying learning in decentralized systems under the setup of federated learning (FL) [37, 51, 66]. In FL, a server node stores a global model and connects to multiple end-devices ( clients"), which have private data that cannot be shared. The goal is to learn the global model in a communication-efﬁcient manner. However, learning a single shared model across all clients may perform poorly when the data distribution varies signiﬁcantly across clients. Personalized Federated Learning (PFL) [67] addresses this challenge by jointly learning a personalized model for each client. While signiﬁcant progress had been made in recent years, leading approaches still struggle in realistic scenarios. First, when the amount of data per client is limited, even though this is one of the original motivations behind federated learning [4, 51, 72]. Second, when the input distribution shifts between clients, which is often the case, as clients use different devices and sensors. Last, when we require well-calibrated predictions, which is an important demand from medical and other safety-critical applications.

35th Conference on Neural Information Processing Systems (Neur IPS 2021).

Figure 1: p Fed GP - learning a shared deep kernel function with client-speciﬁc GP models. Each client stores private data, possibly from a different distribution. The data is ﬁrst mapped to an embedding space with a shared neural network across all clients. Then, using common kernels a GP is applied to the data of the client for model learning and inference. We illustrate the per-client kernel matrix kθ(xi, xj). Bold cells indicate a stronger covariance.

Here, we show how Gaussian Processes (GPs) with deep kernel learning (DKL) [80] is an effective alternative for handling these challenges. GPs have good predictive performance in a wide range of dataset sizes [2, 81], they are robust to input noise [75], can adapt to shifts in the data distribution [48], and provide well-calibrated predictions [69]. While regression tasks are more natural for GPs, here we focus on classiﬁcation tasks for consistency with common benchmarks and learning procedures in the ﬁeld; however, our approach is also applicable to regression tasks.

Consider a naive approach that ﬁts a separate GP classiﬁer to each client based on its personal data. Its performance heavily depends on the quality of the kernel, and standard kernels tend to work poorly in domains such as images. A popular solution to this problem is to use deep kernel learning (DKL) [80], where a kernel is applied to features outputted by a neural network (NN). Unfortunately, GPs with DKL can strongly overﬁt, often even worse than standard NNs [56], and thus negate the main beneﬁt of using a GP. We solve this issue by jointly learning a shared kernel function across clients. As the kernel captures similarities between inputs, a single kernel should work well across clients, while using a separate GP per client will give the required ﬂexibility for personalization.

We adapt a GP classiﬁer recently proposed in [2] which uses the Pólya-Gamma augmentation [57] in a tree-structure model to the federated setting. We term our method p Fed GP. We extend p Fed GP by tailoring two inducing points (IPs) methods [58, 70]. The ﬁrst helps generalization in the low data regime and, unlike common inducing point methods, does not reduce the computational costs. The second does focus on reducing the computational cost to make our approach scalable and work in low-resource clients. We also adjust previous PAC-Bayes generalization bounds for GPs [60, 64] to include the Pólya-Gamma augmentation scheme. These bounds are suitable for cases where the kernel is not learned, such as when new clients arrive after the shared NN was already learned.

Therefore, this paper makes the following contributions: (i) introduce p Fed GP as a natural solution to PFL; (ii) develop two IP methods to enhance GP classiﬁers that use the Pólya-Gamma augmentation scheme and integrate them with p Fed GP; (iii) derive a PAC-Bayes generalization bound on novel clients and show empirically that it gives meaningful guarantees; (iv) achieve state-of-the-art results in a wide array of experiments, improving accuracy by up to 21% 1.

2 Related work

Federated learning. In FL, clients collaboratively solve a learning task while preserving data privacy and maintaining communication efﬁciency [1, 34, 42, 51, 54, 82, 85]. Fed Avg [51] is an early but effective FL approach that updates models locally and averages them into a global model. Several optimization methods have been proposed for improving convergence in FL [41, 45, 71, 77]. Other approaches focus on preserving client privacy [3, 20, 52, 90], improving robustness to statistical diversity [26, 27, 31, 35, 87, 88], and reducing communication cost [13, 61]. These methods aim to learn a global model across clients, limiting their ability to deal with heterogeneous datasets.

1Our code is publicly available at https://github.com/Idan Achituve/p Fed GP

Personalized federated learning. To overcome client heterogeneity, PFL aims to introduce some personalization for each client in the federation [39, 73]. Recent methods include adapting multitask learning [18, 67], meta-learning approaches [6, 21, 22, 33, 43, 89], and model mixing, where clients learn a mixture of the global and local models [4, 17, 27, 44]. Other approaches utilize different regularization schemes to enforce soft parameter sharing [32, 72]. Personalization in FL has also been explored through clustering approaches in which similar clients within the federation have a greater effect on one another [49, 86]. Recently, [65] proposed learning a central hypernetworks that acts on client representation vectors for generating personalized models.

Bayesian FL. Some studies put forward a Bayesian treatment to the FL setup. [8, 12] used variational inference with Bayesian NNs. [76, 84] proposed a matching algorithm between local models based on the Beta-Bernoulli process to construct a global model. [14] extended Bayesian optimization to FL setting via Thompson sampling. To scale the GP model they used random Fourier features. We use inducing points instead. [83] proposed a federated learning framework that uses a global GP model for regression tasks and without DKL. Unlike this study, we focus on classiﬁcation tasks with a personal GP classiﬁer per client and advocate sharing information between clients through the kernel. [74] used GPs in a client selection strategy. In [36] an approach based on stein variational gradient descent was suggested. This method does not scale beyond small-sized networks. [47] proposed a multivariate Gaussian product mechanism to aggregate local models. As we will show, this method is less suited when the data heterogeneity between clients is large.

Gaussian process classiﬁcation. Unlike regression, in classiﬁcation approximations must be used since the likelihood is not a Gaussian [59]. Classic approaches include the Laplace approximation [79], expectation-propagation [53], and least squares [62]. Recently, several methods were proposed based on the Pólya-Gamma augmentation [57] for modeling multinomial distributions [46], GP classiﬁcation [23, 24, 78], few-shot learning [69], and incremental learning [2]. Here we build on the last approach. Classiﬁcation with GPs is commonly done with variational inference techniques [30], here we wish to exploit the conjugacy of the model to take Gibbs samples from the posterior. This approach yields well calibrated [69] and more accurate models [2].

3 Gaussian processes background

We ﬁrst provide a brief introduction to the main components of our model. Detailed explanations are deferred to the Appendix. Scalars are denoted with lower-case letters (e.g., x), vectors with bold lowercase letters (e.g., x), and matrices with bold capital letters (e.g., X). In general, y = [y1, ..., y N]T is the vector of labels, and X RN d is the design matrix with N data points whose ith row is xi.

Gaussian processes. GPs map input points to target output values via a random latent function f. f is assumed to follow a Gaussian process prior f GP(m(x), k(x, x )), where the evaluation vector of f on X, f = [f(x1), ..., f(x N)]T , has a Gaussian distribution f N(µ, K) with means µi = m(xi) and covariance Kij = k(xi, xj). The mean m(x) is often set to be the constant zero function, and the kernel k(x, x ) is a positive semi-deﬁnite function. The target values are assumed to be independent when conditioned on f. For Gaussian process regression the likelihood is Gaussian, p(y|f) = N(f, σ2). Therefore, the posterior p(f|y, X) is also Gaussian, and both the marginal and the predictive distributions have known analytic expressions. This is one of the main motivations behind using GPs, as most other Bayesian models have intractable inference.

Unfortunately, for Gaussian process classiﬁcation (GPC) the likelihood, p(y|f), is not a Gaussian and the posterior does not admit a closed-form expression. One approach for applying GPs to binary classiﬁcation tasks is the Pólya-Gamma augmentation [57]. Using this approach, we can augment the GP model with random variables ω from a Pólya-Gamma distribution, one for each example. As a result, p(f|y, X, ω) is a Gaussian density and p(ω|y, X, f) is a Pólya-Gamma density. This allows to use Gibbs sampling to efﬁciently sample from the posterior p(f, ω|y, X) for inference and prediction. A key advantage of the Pólya-Gamma augmentation is that it beneﬁts from fast mixing and has the ability of even a single value of ω to capture much of the volume of the marginal distribution over function values [46]. Full equations and further details on the Pólya-Gamma augmentation scheme are given in Appendix A.1.

Deep kernel learning (DKL). The quality of the GP model heavily depends on the kernel function k(xi, xj). For many data modalities, such as images, common kernels are not a good measure of semantic similarity. Therefore, in [10, 80] standard kernels are used over features outputted

by a neural network gθ. For example, the RBF kernel kθ(xi, xj) = exp ||gθ(xi) gθ(xj)||2

2ℓ2 . In regression, it is possible to directly backpropagate through the GP inference as it is given in closed-form. In our case, we use Fisher s identity [19] to obtain stochastic gradients [69].

Inducing points. GPs require storing and inverting a kernel matrix on the entire training set which often limits its usage. A common solution to this problem is to use inducing point methods [58, 70]. The key idea is to replace the exact kernel with an approximation for fast computation. Usually, M N pseudo-inputs are learned such that the main computational bottleneck is in inverting M M matrices.

GP-Tree. We build on GP-Tree [2], a recent GP classiﬁer that was shown to scale well with dataset size and the number of classes. GP-Tree turns the multi-class classiﬁcation problem into a sequence of binary decisions along the tree nodes. Each node in the tree ﬁts a binary GP classiﬁer based on the Pólya-Gamma augmentation scheme and the data associated with that node. The leaf nodes correspond to the classes in the dataset. The tree is constructed by ﬁrst computing a prototype for each class and then recursively performing divisive hierarchical clustering on these prototypes to two clusters at each node. Further details are given in Appendix A.2.

4 p Fed GP: federated learning with Gaussian processes

Now we describe our approach for applying personalized federated learning (PFL) with Gaussian processes. First, we extend GP-Tree to the FL setup and show how to use Gibbs sampling to learn the NN parameters. Then, we present two alternatives for this method that use inducing points. The ﬁrst is for extremely limited-size datasets, while the second allows controlling the computational resources. We name our method p Fed GP. An illustration of our method is given in Figure 1.

4.1 A full GP model

The training procedure follows the standard protocol in this ﬁeld [4, 44, 51]. We assume the existence of a server that holds the shared parameters θ (a NN). Let C denote the set of clients. For each client c C we denote by Dc its local dataset of size Nc. At each training iteration (round) the model is sent to S clients to perform kernel learning (|S| |C|). Each client c S updates its copy of the global model and then sends the updated model to the server. The server then averages over the updates to obtain a new global model.

At each client c, we perform kernel learning in the following manner. We ﬁrst compute the feature representation of the data samples associated with the client using the shared network. Then, we build the hierarchical classiﬁcation tree as discussed in Section 3 & Appendix A.2. In [2] the tree was built only once after a pre-training stage and the model parameters were learned using a variational inference approach. Here, we re-build the tree at each round using the most recent features and we use a Gibbs sampling procedure, as it allows this ﬂexibility in building the tree and performs better when not prohibitive by computational limitations. Learning the network parameters θ with the Gibbs sampling approach can be done with two common objectives, the marginal likelihood, and the predictive distribution.

We denote by Xv the data associated with the tree node v, i.e., the data points which have v on the path from the root node to their class leaf node. We denote by yv the binary label of these points, i.e., does their path go left or right at this node. And we denote by ωv the Pólya-Gamma random variables associated with node v. The marginal likelihood term for the full hierarchical classiﬁcation tree is the sum of the separate marginal likelihood terms of all the nodes v in the tree:

LML c (θ; Dc) = X

v log pθ(yv|Xv) = X

v log Z pθ(yv|ωv, Xv)p(ωv)dωv. (1)

Similar to [69] we use a gradient estimator based on Fisher s identity [19]:

θLML c (θ; Dc) = X

Z pθ(ωv|yv, Xv) θlog pθ(yv|ωv, Xv)dωv X

l=1 θlog pθ(yv|ω(l) v , Xv).

(2) Here, ω(1) v , ..., ω(L) v are samples from the posterior at node v. Due to the Pólya-Gamma augmentation pθ(yv|ω(l) v , Xv) is proportional to a Gaussian density. The exact expression is give in Appendix A.2.

To use the predictive distribution as an objective, in each training iteration, after building the tree model, at each node we randomly draw a portion from the (node) training data and use it to predict the class label for the remaining part. We denote with Xv and yv the training portion, x v and y v the input and the label of the point we are predicting, and P y the path from the root node to the y leaf node (i.e., the original class). Here we also take advantage of the independence between nodes to maximize the predictive distribution per node individually. The predictive distribution for a single data point:

LP D c (θ; x , y ) = X

v P y log pθ(y v|x v, yv, Xv) = X

v P y log Z pθ(y v|ωv, x v, yv, Xv)p(ωv|yv, Xv)dωv.

(3) We use an approximate-gradient estimator based on posterior samples of ω:

θLP D c (θ; x , y ) X

l=1 θlog pθ(y v|ω(l) v , x v, yv, Xv). (4)

Where pθ(y v|ω(l) v , x v, yv, Xv) = R p(y v|f )pθ(f |ω(l) v , x v, yv, Xv)df does not have an analytical expression, but pθ(f |ω(l) v , x v, yv, Xv) = R pθ(f |f, x v, Xv)pθ(f|ω(l) v , yv, Xv)df is Gaussian with known parameters. We then compute the predictive distribution by performing Gauss-Hermite integration over f . See exact expression in Appendix A.2.

4.2 Augmenting the model with inducing points: sample efﬁciency

The GP model described in Section 4.1 works well in most situations. However, when the number of data points per client is small, performance naturally degrades. To increase information sharing between clients and improve the per-client performance, we suggest augmenting the model with global inducing points shared across clients. When sending the model from the server to a client, we also send the inducing inputs and their labels. To streamline optimization and reduce the communication burden, we deﬁne the inducing inputs in the feature space of the last embedding layer of the shared NN. Therefore, usually, their size will be negligible compared to the network size.

We denote by X the learned inducing inputs and by y their ﬁxed class labels. They are set evenly across classes. During training, we regard only the set of inducing inputs-labels ( X, y) as the available (training) data and use them for posterior inference. More formally, we ﬁrst compute pθ(f| ω, y, X, X) = R pθ(f| f, X, X)pθ( f| ω, y, X)d f using its analytical expression for the actual training data and then compute the probability of y using Gauss-Hermite integration. Then we use Eq. 3 & 4 for learning the network parameters and the inducing locations. At test time, to make full use of the training data, we combine the inducing inputs with the training data and use both to obtain the GP formulas and to make predictions. We note that with just using the inducing inputs at test time the model performs remarkably well, despite having almost no personalization component. See Appendix E.8 for a further discussion.

One potential issue with using IPs in this manner is that it distorts the true class distribution. As a result, the classiﬁer may be more likely to predict a low-probability class during test time. We address this issue by adjusting the output distribution. In general, let p(y, x) and q(y, x) be two distributions that differ only in the class probabilities, i.e. p(x|y) = q(x|y), the predictive distribution follows: q(y |x ) p(y |x ) q(x |y )q(y )

p(x |y )p(y ) = q(y |x ) q(y )

p(y )p(y |x ). (5)

We use this to correct the GP predictions to the original class ratios at each tree node. We found in our experiments that this correction generally improves the classiﬁer performance for class imbalanced data. As an example for this phenomena, consider a binary classiﬁcation problem having 90 examples from the ﬁrst class and 10 examples from the second class (therefore, q(y = 0) = 0.9, and q(y = 1) = 0.1). Assume we deﬁned 50 inducing inputs per class, so now during test time the model sees 140 samples from the ﬁrst class and 60 samples from the second class which corresponds to probabilities p(y = 0) = 0.7 and p(y = 1) = 0.3.

4.3 Augmenting the model with inducing points: computational efﬁciency

Learning the full GP model described in Section 4.1 requires inverting a matrix of size Nc in the worst case (at the root node), which has O(N 3 c ) run-time complexity and O(N 2 c ) memory complexity.

Algorithm 1 p Fed GP. C clients indexed by c; E - number of local epochs; |S| - number of sampled clients; M - number of inducing inputs per class

Server executes: Initialize shared network θ θ0 Initialize M inducing inputs per class for all classes in the system # in p Fed GP-IP variants only for each round t 1, 2, ... do: Sample S clients uniformly at random for each client c S in parallel: θc t+1, M c t+1 Client Update(θt, Mt) # obtain updates from client c Update θt+1, Mt+1 using Fed Avg [51] update rule.

Client Update(θ, M): for each local epoch e 1, ..., E do: if e = 1: Build GP tree classiﬁer using the personal dataset Dc Update θ, M using gradient-based optimization methods on Dc with LML or LP D return θ, M

Therefore, we propose an additional procedure based on inducing points to allow reduced complexity in low resource environments and scalability to larger dataset sizes.

This variant is based on the fully independent training conditional (FITC) method [70]. The key idea is to cast all the dependence on the inducing points and assume independence between the latent function values given the inducing points. Here for brevity, we omit the subscripts denoting the client and the tree node. However, all quantities and data points are those that belong to a speciﬁc client and tree node. Let X RM d denote the pseudo-inputs (deﬁned in the embedding space of the last layer of the NN), and f RM the corresponding latent function values. Here as well, the inducing inputs are deﬁned globally at the server level and they are set evenly across classes.

We assume the following GP prior p(f, f) = N 0, h KNN KNM KMN KMM

i , where KMM is the kernel

between the inducing inputs, KNN is a diagonal matrix between the actual training data, KNM is the kernel between the data and the inducing inputs, and we placed a zero mean prior. The likelihood of the dataset when factoring the inducing variables and the Pólya-Gamma variables (one per training sample), and the posterior over f, both have known analytical expressions. We can then obtain the posterior and marginal distributions by marginalizing over f. Here we will present the posterior and marginal distributions:

p(f|X, y, ω, X) = N(f|KNMB 1KMNΛ 1Ω 1κ, KNN KNM(K 1 MM B 1)KMN), (6)

p(y|X, ω, X) N(Ω 1κ|0, Λ + KNMK 1 MMKMN). (7)

Where Ω= diag(ω), κj = yj 1/2, Λ = Ω 1 + diag(KNN KNMK 1 MMKMN), and B = KMM + KMNΛ 1KNM. Importantly, we only need to invert M M or diagonal matrices. See full derivation in Appendix B. During test time, we use f to get the posterior of f to compute the predictive distribution.

Now we can use either the marginal or the predictive distribution to learn the shared NN parameters and the inducing locations. The complexity of applying this procedure is reduced to O(M 2Nc + M 3) in run-time, and O(MNc + M 2) in memory. While the (conditional) independence assumption between the latent function values may be restrictive, we found this method to be comparable with the full GP alternative in our experiments. Potentially, this can be attributed to the effect of sharing the inducing inputs among clients and the information that ω stores on f.

5 Generalization bound

It is reasonable to expect that after we learned the system new clients will arrive. In such cases, we would like to use p Fed GP without re-training the kernel function. Under this scenario, we can derive generalization bounds concerning only the GP classiﬁer without taking into account the ﬁxed

neural network using PAC-Bayes bound [50]. Having meaningful guarantees can be very important in safety-critical applications. The PAC-Bayes bound for GPC [64] (with the Gibbs risk):

Theorem 1. Given i.i.d. samples Dc = {(xi, yi)}Nc i=1 of size Nc drawn from any data distribution over X { 1, 1}, a GP posterior Q, and a GP prior P, the following bound holds, where the probability is over random data samples:

Pr Dc{R(Q) > RDc(Q) + KL 1 ber(RDc(Q), ϵ(δ, n, P, Q))} δ. (8)

Here, we have,

R(Q) = E(x ,y )[Prf Q(f |x ,Dc){sign f = y }], RDc(Q) = 1

i=1 Prfi Q(fi|Dc){sign fi = yi}

ϵ(δ, Nc, P, Q) = 1

KL[Q || P] + log Nc + 1

, KL 1 ber(q, ϵ) = maxp [0,1]KLber[q || p] ϵ

An important observation in [64] is that the KL-divergence between the posterior and prior Gaussian processes is equivalent to the KL-divergence between the posterior and prior distribution of their values on the Nc training samples. While [64] assumed Q to be Gaussian, this observation still holds even without this assumption. However, when Q is no longer Gaussian, as is the case here, KL[Q(f) || P(f)] no longer has a closed-form expression. We can show that for the Pólya-Gamma augmentation:

KL[Q(f) || P(f)] = EQ(ω){KL[Q(f|ω)||P(f)]} MI[f; ω]

= EQ(ω){KL[Q(f|ω)||P(f)]} + EQ(f,ω)

log Q(ω) Q(ω|f)

Figure 2: Test error vs an estimated upper bound over 10 clients with varying degrees of a training set data size. Each dot represents a combination of client and data size. In parenthesis - the average difference between the empirical and the test error.

where MI denotes the mutual information. Since Q(f|ω) and P(f) are Gaussian, the KL[Q(f|ω)||P(f)] term has a close form expression so we only need to perform Monte Carlo approximation on the expectation on ω on the ﬁrst element. In the second expectation, Q(ω) does not have a known expression. To estimate it, given {(ωi, fi)}N i=1 samples, we use Q(ωi) 1 N 1 P

j =i Q(ωi|fj). Note that if the summation for j includes fi, it might result in a biased estimator. Further details on estimating KL[Q(f) || P(f)] are in Appendix C.

To assess the quality of the bound, we partitioned the CIFAR-10 dataset to 100 clients. We trained a shared network using our full-GP variant on 90 clients and then recorded the generalization and test error on the remaining 10 clients four times, each with a different training set size. Figure 2 shows the estimation of the generalization error bound (δ = 0.01) vs the actual error on the novel clients with the Gibbs classiﬁer. First, we observe that indeed the bound is greater than the actual test error for all points and that it is not vacuous. There is a strong correlation between the actual error and the bound. Secondly, unlike worst-case bounds (e.g., VC-dimension), this bound depends on the actual data and not only the number of data points.

6 Experiments

We evaluated p Fed GP against baseline methods in various learning setups. We present the result for the following model variants: (i) p Fed GP, the full GP model (Section 4.1); (ii) p Fed GP-IP-data, the model with IPs described in Section 4.2; and (iii) p Fed GP-IP-compute, the model with IPs described in Section 4.3. For p Fed GP and p Fed GP-IP-compute, the results obtained by maximizing the predictive and marginal likelihood were similar, with a slight advantage to the former. Therefore, we present here the results only for the predictive alternative and defer the results of the marginal alternative to the Appendix. Additional experiments, ablation study, and further analyses are provided

Table 1: Test accuracy ( SEM) over 50, 100, 500 clients on CIFAR-10, CIFAR-100, and CINIC-10. The # samples/client indicates the average number of training samples per client.

CIFAR-10 CIFAR-100 CINIC-10

# clients 50 100 500 50 100 500 50 100 500 # samples/client 800 400 80 800 400 80 1800 900 180

Local 86.2 0.2 82.9 0.4 74.8 0.5 52.1 0.2 45.6 0.3 30.9 0.2 61.1 0.3 56.9 0.7 46.4 0.1 Fed Avg [51] 56.4 0.5 59.7 0.5 54.0 0.5 23.6 0.2 24.0 0.2 20.4 0.0 45.6 0.4 44.7 0.5 45.7 0.5 FOLA [47] 55.9 3.3 52.1 3.1 45.9 0.3 25.5 1.5 22.4 1.3 18.7 0.1 45.2 0.3 43.4 0.3 38.3 0.2

Fed Per [4] 83.8 0.8 81.5 0.5 76.8 1.2 48.3 0.6 43.6 0.2 25.6 0.3 70.6 0.2 68.4 0.5 62.2 .05 LG-Fed Avg [44] 87.9 0.3 83.6 0.7 64.7 0.7 43.6 0.2 37.5 0.9 20.3 0.5 59.5 1.1 59.9 2.1 52.5 0.8 p Fed Me [72] 86.4 0.8 85.0 0.3 80.3 0.5 49.8 0.5 47.7 0.4 32.5 0.8 69.9 0.5 68.9 0.7 58.8 0.1 Fed U [18] 80.6 0.3 78.1 0.5 65.6 0.4 41.1 0.2 36.0 0.2 15.9 0.4 59.3 0.2 55.4 0.6 41.6 0.5 p Fed HN [65] 90.2 0.6 87.4 0.2 83.2 0.8 60.0 1.0 52.3 0.5 34.1 0.1 70.4 0.4 69.4 0.5 64.2 .05

Ours p Fed GP-IP-data 88.6 0.2 87.4 0.2 86.9 0.7 60.2 0.3 58.5 0.3 55.7 0.4 69.8 0.2 68.3 0.6 67.6 0.3 p Fed GP-IP-compute 89.9 0.6 88.8 0.1 86.8 0.4 61.2 0.4 59.8 0.3 49.2 0.3 72.0 0.3 71.5 0.5 68.2 0.2 p Fed GP 89.2 0.3 88.8 0.2 87.6 0.4 63.3 0.1 61.3 0.2 50.6 0.2 71.8 0.3 71.3 0.4 68.1 0.3

in Appendix E. Unless stated otherwise, we report the average and the standard error of the mean (SEM) over three random seeds of the federated accuracy, deﬁned as the average accuracy across all clients and samples.

Datasets. All methods were evaluated on CIFAR-10, CIFAR-100 [38], and CINIC-10 [15] datasets. CINIC-10 is more diverse since it combines images from CIFAR-10 and Image Net [16].

Compared methods. We compared our method against the following baselines: (1) Local, p Fed GP full model on each client with a private network and no collaboration with other clients; (2) Fed Avg [51], a standard FL model with no personalization component; (3) FOLA [47], a Bayesian method that used a multivariate Gaussian product mechanism to aggregate local models; (4) Fed Per [4], a PFL approach that learns a personal classiﬁer for each client on top of a shared feature extractor; (5) LG-Fed Avg [44], a PFL method that uses local feature extractor per client and global output layers; (6) p Fed Me [72], a PFL method which adds a Moreau-envelopes loss term; (7) Fed U [18], a recent multi-task learning approach for PFL that learns a model per client; (8) p Fed HN [65], a recent PFL approach that uses a hypernetwork to generate client-speciﬁc networks.

Training protocol. We follow the training strategy proposed in [65]. We limit the training process to 1000 communication rounds, in each we sample ﬁve clients uniformly at random for model updates. The training procedure is different in the FOLA and p Fed HN baselines, so we used an equivalent communication cost. In LG-Fed Avg, we made an extra 200 communication rounds after a pre-training stage with the Fed Avg model for 1000 communication rounds. In the local model, we performed 100 epochs of training for each client. In all experiments, we used a Le Net-based network [40] having two convolution layers followed by two fully connected layers and an additional linear layer. We tuned the hyperparameters of all methods using a pre-allocated held-out validation set. Full experimental details are given in Appendix D.

6.1 Standard PFL setting

We ﬁrst evaluated all methods in a standard PFL setting [65, 72]. We varied the total number of clients in the system from 50 to 500 and we set the number of classes per client to two/ten/four for CIFAR-10/CIFAR-100/CINIC-10 respectively. Since the total number of samples in the system is ﬁxed, the number of samples per client changed accordingly. For each client, the same classes appeared in the training and test set.

The results are presented in Table 1. They show that: (1) The performance of the local baseline is signiﬁcantly impaired when the number of samples per client decreases, emphasizing the importance of federated learning in the presence of limited local data. (2) Fed Avg and FOLA, which do not use personalized FL, perform poorly in this heterogeneous setup. (3) p Fed GP outperforms or is on par with previous state-of-the-art approaches when local data is sufﬁcient (e.g., 50 clients on all datasets). When the data per client becomes limited, p Fed GP achieves signiﬁcant improvements over competing methods; note the 9% and 21% difference in CIFAR-100 over 100 and 500 clients, respectively. (4) p Fed GP-IP-compute often achieves comparable results to p Fed GP and is often superior to p Fed GP-IP-data. We believe that it can be attributed to the fact that in p Fed GP-IP-compute

Fed Avg FOLA Fed Per LG-Fed Avg p Fed Me

Fed U p Fed HN p Fed GP-IP-data (ours) p Fed GP-IP-compute (ours) p Fed GP (ours)

Figure 3: Reliability diagrams on CIFAR-100 with 50 clients. Diagonal indicates perfect calibration. Each plot also shows the expected & maximum calibration error (ECE & MCE) and the Brier Score (BRI). Lower is better.

the training data take an active part in the GP inference formulas (Eq. 6), while in p Fed GP-IP-data the data impact in a weak manner only through the loss function. (5) p Fed GP-IP-data is especially helpful when few samples per class are available, e.g., CIFAR-100 with 500 clients. That last point is further illustrated by decoupling the effect of the number of clients from that of the training set size. To illustrate that, in Appendix E.2 we ﬁxed the number of clients and varied the number of training samples per class. From this experiment, we deduced that both factors (individually) contribute to p Fed GP success.

Table 2: Test accuracy ( SEM) over 100 clients on noisy CIFAR-100. We also provide the relative accuracy decrease (%) w.r.t. the performance on the original CIFAR-100 data (see Table 1).

Method Accuracy Decrease (%)

Fed Per [4] 28.1 0.9 -35.6 LG-Fed Avg [44] 26.9 0.9 -28.3 p Fedme [72] 33.2 0.6 -30.4 Fed U [18] 35.0 0.2 -2.8 p Fed HN [65] 38.9 0.5 -25.7

Ours p Fed GP-IP-data 45.0 0.3 -23.1 p Fed GP-IP-compute 47.1 .05 -21.2 p Fed GP 49.5 0.1 -19.2

A desired property from PFL classiﬁers is the ability to provide uncertainty estimation. For example, in decision support systems, such as in healthcare applications, the decision-maker should have an accurate estimation of the classiﬁer conﬁdence in the prediction. Here, we quantify the uncertainty through calibration. Figure 3 compares all methods both visually and using common metrics [7, 25, 55] on the CIFAR-100 dataset with 50 clients. Expected calibration error (ECE) measures the weighted average between the classiﬁer conﬁdence and accuracy. Maximum calibration error (MCE) takes the maximum instead of the average. And, Brier score (BRI) [7] measures the average squared error between the labels and the prediction probabilities. The ﬁgure shows that p Fed GP classiﬁers are best calibrated across all metrics in almost all cases. We note that with temperature scaling, the calibration of the baseline methods can be improved [25]; however, choosing the right temperature requires optimization over a separate validation set, which our model does not need. Additional calibration results, including temperature scaling, are presented in Appendix E.9.

6.2 PFL with input noise

In real-world federated systems, the clients may employ different measurement devices for data collection (cameras, sensors, etc.), resulting in different input noise characteristics per client. Here, we investigate p Fed GP performance in this type of personalization. To simulate that, we partitioned CIFAR-10/100 to 100 clients similar to the protocol described in Section 6.1, we deﬁned 57 unique distributions of image corruption noise [29], and we assigned a noise model to each client. Then for each example in each client, we sampled a corruption noise according to the noise model allocated to that client. Here we show the results for the noisy CIFAR-100 dataset in Table 2. Further details

on the perturbations performed and result for the noisy CIFAR-10 are given in the Appendix2. We observe a signiﬁcant gap in favor of the p Fed GP variants compared to baseline methods. Note that using global inducing points is slightly less beneﬁcial in this case since they are deﬁned globally and therefore are not tied to a speciﬁc noise type as the real client data is.

6.3 Generalization to out-of-distribution (OOD) novel clients

Figure 4: Generalization to novel clients on CIFAR-10.

FL are dynamic systems. For example, novel clients may enter the system after the model was trained, possibly with a data distribution shift. Adapting to a new OOD client is both challenging and important for real-world FL systems. To evaluate p Fed GP in this scenario, we followed the learning protocol proposed in [65]. We partitioned the CIFAR-10 dataset into two groups. The data in the ﬁrst group was distributed between 90 clients for model training. The remaining data from the second group was distributed between an additional 10 clients that were excluded during training. Within each group, we set the class probabilities in each client by sampling from a Dirichlet distribution with the same α parameter. For the training group, we set α = 0.1, trained the shared model using these clients, and froze it. Then, we evaluated the models on the second group by varying α {.1, .25, .5, .75, 1}, on the remaining 10 clients. As α moves away from 0.1 the distribution shift between the two groups increases, resulting in more challenging OOD clients. Figure 4 reports the generalization gap as a function of the Dirichlet parameter α. The generalization gap is computed by taking the difference between the average test accuracy of the (ten) novel clients and the average test accuracy of the (ninety) clients used for training. From the ﬁgure, here as well, p Fed GP achieves the best generalization performance for all values of α. Moreover, unlike baseline methods, p Fed GP does not require any parameter tuning. Several baselines were excluded from the ﬁgure since they had a large generalization gap.

7 Conclusion

In this study, we proposed p Fed GP, a novel method for PFL. p Fed GP learns a kernel function, parameterized by a NN, that is shared between all clients using a personal GP classiﬁer on each client. We proposed three variants for p Fed GP, a full model approach that generally shows the best performance and two extensions to it. The ﬁrst is most beneﬁcial when the number of examples per class are small while the second allows controlling the computational requirements of the model. We also derived PAC-Bayes generalization bound on novel clients and empirically showed that it gives non-vacuous guarantees. p Fed GP provides well-calibrated predictions, generalizes well to OOD novel clients, and consistently outperforms competing methods.

Broader impact: Our method shares the standard communication procedure of FL approaches, where no private data is directly communicated across the different nodes in the system. This protocol does not explicitly guarantee that no private information can be inferred at this time. As we show, p Fed GP is particularly useful for clients with little data, and for clients that have strongly different distribution. This has great potential to improve client personalization in real-world systems, and do better at handling less common data. The latter is of great interest for decision support systems in sensitive domains such as health care or legal.

Acknowledgements

This study was funded by a grant to GC from the Israel Science Foundation (ISF 737/2018), and by an equipment grant to GC and Bar-Ilan University from the Israel Science Foundation (ISF 2332/18). IA was funded by a grant from the Israeli innovation authority, through the AVATAR consortium.

2The noisy CIFAR-10/100 datasets are available at: https://idanachituve.github.io/projects/p Fed GP

[1] Sawsan Abdulrahman, Hanine Tout, Hakima Ould-Slimane, Azzam Mourad, Chamseddine Talhi, and Mohsen Guizani. A survey on federated learning: The journey from centralized to distributed on-site learning and beyond. IEEE Internet of Things Journal, 8(7):5476 5497, 2021.

[2] Idan Achituve, Aviv Navon, Yochai Yemini, Gal Chechik, and Ethan Fetaya. GP-Tree: A Gaussian process classiﬁer for few-shot incremental learning. In Proceedings of the 38th International Conference on Machine Learning, pages 54 65. PMLR, 2021.

[3] Naman Agarwal, Ananda Theertha Suresh, Felix Xinnan X Yu, Sanjiv Kumar, and Brendan Mc Mahan. cp SGD: Communication-efﬁcient and differentially-private distributed SGD. In Advances in Neural Information Processing Systems, pages 7564 7575, 2018.

[4] Manoj Ghuhan Arivazhagan, Vinay Aggarwal, Aaditya Kumar Singh, and Sunav Choudhary. Federated learning with personalization layers. ar Xiv preprint ar Xiv:1912.00818, 2019.

[5] David Arthur and Sergei Vassilvitskii. k-means++ the advantages of careful seeding. In Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms, pages 1027 1035, 2007.

[6] Harkirat Singh Behl, Atılım Güne s Baydin, and Philip HS Torr. Alpha MAML: Adaptive model-agnostic meta-learning. ar Xiv preprint ar Xiv:1905.07435, 2019.

[7] Glenn W Brier. Veriﬁcation of forecasts expressed in terms of probability. Monthly weather review, 78(1):1 3, 1950.

[8] Thang D Bui, Cuong V Nguyen, Siddharth Swaroop, and Richard E Turner. Partitioned variational inference: A uniﬁed framework encompassing federated and continual learning. ar Xiv preprint ar Xiv:1811.11206, 2018.

[9] Dongqi Cai, Qipeng Wang, Yuanqiang Liu, Yunxin Liu, Shangguang Wang, and Mengwei Xu. Towards ubiquitous learning: A ﬁrst measurement of on-device training performance. In Proceedings of the 5th International Workshop on Embedded and Mobile Deep Learning, pages 31 36, 2021.

[10] Roberto Calandra, Jan Peters, Carl Edward Rasmussen, and Marc Peter Deisenroth. Manifold Gaussian processes for regression. In 2016 International Joint Conference on Neural Networks (IJCNN), pages 3338 3345. IEEE, 2016.

[11] Shuxiao Chen, Qinqing Zheng, Qi Long, and Weijie J Su. A theorem of the alternative for personalized federated learning. ar Xiv preprint ar Xiv:2103.01901, 2021.

[12] Luca Corinzia, Ami Beuret, and Joachim M Buhmann. Variational federated multi-task learning. ar Xiv preprint ar Xiv:1906.06268, 2019.

[13] Xinyan Dai, Xiao Yan, Kaiwen Zhou, Han Yang, Kelvin KW Ng, James Cheng, and Yu Fan. Hyper-sphere quantization: Communication-efﬁcient SGD for federated learning. ar Xiv preprint ar Xiv:1911.04655, 2019.

[14] Zhongxiang Dai, Bryan Kian Hsiang Low, and Patrick Jaillet. Federated Bayesian optimization via Thompson sampling. Advances in Neural Information Processing Systems, 33, 2020.

[15] Luke N Darlow, Elliot J Crowley, Antreas Antoniou, and Amos J Storkey. CINIC-10 is not Imagenet or CIFAR-10. ar Xiv preprint ar Xiv:1810.03505, 2018.

[16] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A largescale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248 255. Ieee, 2009.

[17] Yuyang Deng, Mohammad Mahdi Kamani, and Mehrdad Mahdavi. Adaptive personalized federated learning. ar Xiv preprint ar Xiv:2003.13461, 2020.

[18] Canh T Dinh, Tung T Vu, Nguyen H Tran, Minh N Dao, and Hongyu Zhang. Fed U: A uniﬁed framework for federated multi-task learning with Laplacian regularization. ar Xiv preprint ar Xiv:2102.07148, 2021.

[19] Randal Douc, Eric Moulines, and David Stoffer. Nonlinear time series: Theory, methods and applications with R examples. CRC press, 2014.

[20] John C Duchi, Michael I Jordan, and Martin J Wainwright. Privacy aware learning. Journal of the ACM (JACM), 61(6):1 57, 2014.

[21] Alireza Fallah, Aryan Mokhtari, and A. Ozdaglar. Personalized federated learning: A metalearning approach. Advances in Neural Information Processing Systems, 2020.

[22] Alireza Fallah, Aryan Mokhtari, and Asuman Ozdaglar. On the convergence theory of gradientbased model-agnostic meta-learning algorithms. In International Conference on Artiﬁcial Intelligence and Statistics, pages 1082 1092. PMLR, 2020.

[23] Théo Galy-Fajou, Florian Wenzel, Christian Donner, and Manfred Opper. Multi-class Gaussian process classiﬁcation made conjugate: Efﬁcient inference via data augmentation. In Uncertainty in Artiﬁcial Intelligence, pages 755 765. PMLR, 2020.

[24] Théo Galy-Fajou, Florian Wenzel, and Manfred Opper. Automated augmented conjugate inference for non-conjugate gaussian process models. In International Conference on Artiﬁcial Intelligence and Statistics, pages 3025 3035. PMLR, 2020.

[25] Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q Weinberger. On calibration of modern neural networks. In International Conference on Machine Learning, pages 1321 1330. PMLR, 2017.

[26] Farzin Haddadpour and Mehrdad Mahdavi. On the convergence of local descent methods in federated learning. ar Xiv preprint ar Xiv:1910.14425, 2019.

[27] Filip Hanzely and Peter Richtárik. Federated learning of a mixture of global and local models. ar Xiv preprint ar Xiv:2002.05516, 2020.

[28] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770 778, 2016.

[29] Dan Hendrycks and Thomas Dietterich. Benchmarking neural network robustness to common corruptions and perturbations. In International Conference on Learning Representations, 2019.

[30] James Hensman, Alexander Matthews, and Zoubin Ghahramani. Scalable variational Gaussian process classiﬁcation. In Artiﬁcial Intelligence and Statistics, pages 351 360. PMLR, 2015.

[31] T. H. Hsu, Hang Qi, and M. Brown. Measuring the effects of non-identical data distribution for federated visual classiﬁcation. Ar Xiv, abs/1909.06335, 2019.

[32] Yutao Huang, Lingyang Chu, Zirui Zhou, Lanjun Wang, Jiangchuan Liu, Jian Pei, and Yong Zhang. Personalized cross-silo federated learning on non-IID data. In Proceedings of the AAAI Conference on Artiﬁcial Intelligence, 2021.

[33] Yihan Jiang, Jakub Koneˇcn y, Keith Rush, and Sreeram Kannan. Improving federated learning personalization via model agnostic meta learning. ar Xiv preprint ar Xiv:1909.12488, 2019.

[34] Peter Kairouz, H Brendan Mc Mahan, Brendan Avent, Aurélien Bellet, Mehdi Bennis, Arjun Nitin Bhagoji, Keith Bonawitz, Zachary Charles, Graham Cormode, Rachel Cummings, et al. Advances and open problems in federated learning. ar Xiv preprint ar Xiv:1912.04977, 2019.

[35] Sai Praneeth Karimireddy, Satyen Kale, Mehryar Mohri, Sashank Reddi, Sebastian Stich, and Ananda Theertha Suresh. SCAFFOLD: Stochastic controlled averaging for federated learning. In International Conference on Machine Learning, pages 5132 5143. PMLR, 2020.

[36] Rahif Kassab and Osvaldo Simeone. Federated generalized Bayesian learning via distributed Stein variational gradient descent. ar Xiv preprint ar Xiv:2009.06419, 2020.

[37] Jakub Koneˇcn y, H Brendan Mc Mahan, Daniel Ramage, and Peter Richtárik. Federated optimization: Distributed machine learning for on-device intelligence. ar Xiv preprint ar Xiv:1610.02527, 2016.

[38] Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. Technical report, University of Toronto, 2009.

[39] V. Kulkarni, Milind Kulkarni, and A. Pant. Survey of personalization techniques for federated learning. 2020 Fourth World Conference on Smart Trends in Systems, Security and Sustainability (World S4), pages 794 797, 2020.

[40] Yann Le Cun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278 2324, 1998.

[41] Tian Li, Anit Kumar Sahu, Maziar Sanjabi, Manzil Zaheer, Ameet Talwalkar, and Virginia Smith. On the convergence of federated optimization in heterogeneous networks. ar Xiv preprint ar Xiv:1812.06127, 2018.

[42] Tian Li, Anit Kumar Sahu, Ameet Talwalkar, and Virginia Smith. Federated learning: Challenges, methods, and future directions. IEEE Signal Processing Magazine, 37(3):50 60, 2020.

[43] Zhenguo Li, Fengwei Zhou, Fei Chen, and Hang Li. Meta-SGD: Learning to learn quickly for few-shot learning. ar Xiv preprint ar Xiv:1707.09835, 2017.

[44] Paul Pu Liang, Terrance Liu, Liu Ziyin, Nicholas B Allen, Randy P Auerbach, David Brent, Ruslan Salakhutdinov, and Louis-Philippe Morency. Think locally, act globally: Federated learning with local and global representations. ar Xiv preprint ar Xiv:2001.01523, 2020.

[45] Tao Lin, Sebastian U Stich, Kumar Kshitij Patel, and Martin Jaggi. Don t use large mini-batches, use local SGD. In International Conference on Learning Representations, 2019.

[46] Scott W Linderman, Matthew J Johnson, and Ryan P Adams. Dependent multinomial models made easy: stick breaking with the Pólya-Gamma augmentation. In Proceedings of the 28th International Conference on Neural Information Processing Systems, pages 3456 3464, 2015.

[47] Liangxi Liu and Feng Zheng. A Bayesian federated learning framework with multivariate gaussian product. ar Xiv preprint ar Xiv:2102.01936, 2021.

[48] Wesley Maddox, Shuai Tang, Pablo Moreno, Andrew Gordon Wilson, and Andreas Damianou. Fast adaptation with linearized neural networks. In International Conference on Artiﬁcial Intelligence and Statistics, pages 2737 2745. PMLR, 2021.

[49] Y. Mansour, M. Mohri, J. Ro, and A. T. Suresh. Three approaches for personalization with applications to federated learning. Ar Xiv, abs/2002.10619, 2020.

[50] David A Mc Allester. PAC-Bayesian stochastic model selection. Machine Learning, 51(1):5 21, 2003.

[51] Brendan Mc Mahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Aguera y Arcas. Communication-efﬁcient learning of deep networks from decentralized data. In Artiﬁcial Intelligence and Statistics, pages 1273 1282. PMLR, 2017.

[52] H Brendan Mc Mahan, Daniel Ramage, Kunal Talwar, and Li Zhang. Learning differentially private recurrent language models. In International Conference on Learning Representations, 2018.

[53] Thomas Peter Minka. A family of algorithms for approximate Bayesian inference. Ph D thesis, Massachusetts Institute of Technology, 2001.

[54] Viraaji Mothukuri, Reza M Parizi, Seyedamin Pouriyeh, Yan Huang, Ali Dehghantanha, and Gautam Srivastava. A survey on security and privacy of federated learning. Future Generation Computer Systems, 115:619 640, 2021.

[55] Mahdi Pakdaman Naeini, Gregory Cooper, and Milos Hauskrecht. Obtaining well calibrated probabilities using Bayesian binning. In Proceedings of the AAAI Conference on Artiﬁcial Intelligence, volume 29, 2015.

[56] Sebastian W. Ober, Carl E. Rasmussen, and Mark van der Wilk. The promises and pitfalls of deep kernel learning. Co RR, 2021.

[57] Nicholas G. Polson, James G. Scott, and Jesse Windle. Bayesian inference for logistic models using Pólya Gamma latent variables. Journal of the American Statistical Association, pages 1339 1349, 2013.

[58] Joaquin Quinonero-Candela and Carl Edward Rasmussen. A unifying view of sparse approximate Gaussian process regression. The Journal of Machine Learning Research, 6:1939 1959, 2005.

[59] Carl Edward Rasmussen and Christopher K. I. Williams. Gaussian Processes for Machine Learning. The MIT Press, 2006.

[60] David Reeb, Andreas Doerr, Sebastian Gerwinn, and Barbara Rakitsch. Learning Gaussian processes by minimizing PAC-Bayesian generalization bounds. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, pages 3341 3351, 2018.

[61] Amirhossein Reisizadeh, Aryan Mokhtari, Hamed Hassani, Ali Jadbabaie, and Ramtin Pedarsani. Fed PAQ: A communication-efﬁcient federated learning method with periodic averaging and quantization. In International Conference on Artiﬁcial Intelligence and Statistics, pages 2021 2031. PMLR, 2020.

[62] Ryan Rifkin and Aldebaro Klautau. In defense of one-vs-all classiﬁcation. The Journal of Machine Learning Research, 5:101 141, 2004.

[63] Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4510 4520, 2018.

[64] Matthias Seeger. PAC-Bayesian generalisation error bounds for Gaussian process classiﬁcation. Journal of machine learning research, 3(Oct):233 269, 2002.

[65] Aviv Shamsian, Aviv Navon, Ethan Fetaya, and Gal Chechik. Personalized federated learning using hypernetworks. In International Conference on Machine Learning. PMLR, 2021.

[66] Reza Shokri and Vitaly Shmatikov. Privacy-preserving deep learning. In Proceedings of the 22nd ACM SIGSAC conference on computer and communications security, pages 1310 1321, 2015.

[67] Virginia Smith, Chao-Kai Chiang, Maziar Sanjabi, and Ameet Talwalkar. Federated multi-task learning. In Proceedings of the 31st International Conference on Neural Information Processing Systems, pages 4427 4437, 2017.

[68] Jake Snell, Kevin Swersky, and Richard Zemel. Prototypical networks for few-shot learning. In Proceedings of the 31st International Conference on Neural Information Processing Systems, pages 4080 4090, 2017.

[69] Jake Snell and Richard Zemel. Bayesian few-shot classiﬁcation with one-vs-each Pólya-Gamma augmented Gaussian processes. In International Conference on Learning Representations, 2021.

[70] Edward Snelson and Zoubin Ghahramani. Sparse Gaussian processes using pseudo-inputs. In Advances in Neural Information Processing Systems, pages 1257 1264. MIT Press, 2006.

[71] Sebastian U Stich. Local SGD converges fast and communicates little. In International Conference on Learning Representations, 2018.

[72] Canh T Dinh, Nguyen Tran, and Tuan Dung Nguyen. Personalized federated learning with Moreau envelopes. Advances in Neural Information Processing Systems, 33, 2020.

[73] Alysa Ziying Tan, Han Yu, Lizhen Cui, and Qiang Yang. Towards personalized federated learning. ar Xiv preprint ar Xiv:2103.00710, 2021.

[74] Minxue Tang, Xuefei Ning, Yitu Wang, Yu Wang, and Yiran Chen. Fedgp: Correlation-based active client selection for heterogeneous federated learning. ar Xiv preprint ar Xiv:2103.13822, 2021.

[75] Carlos Villacampa-Calvo, Bryan Zaldívar, Eduardo C Garrido-Merchán, and Daniel Hernández Lobato. Multi-class Gaussian process classiﬁcation with noisy inputs. Journal of Machine Learning Research, 22(36):1 52, 2021.

[76] Hongyi Wang, Mikhail Yurochkin, Yuekai Sun, Dimitris Papailiopoulos, and Yasaman Khazaeni. Federated learning with matched averaging. In International Conference on Learning Representations, 2019.

[77] Jianyu Wang and Gauri Joshi. Cooperative SGD: A uniﬁed framework for the design and analysis of communication-efﬁcient SGD algorithms. In ICML Workshop on Coding Theory for Machine Learning, 2019.

[78] Florian Wenzel, Théo Galy-Fajou, Christian Donner, Marius Kloft, and Manfred Opper. Efﬁcient Gaussian process classiﬁcation using Pólya-Gamma data augmentation. In The AAAI Conference on Artiﬁcial Intelligence, pages 5417 5424. AAAI Press, 2019.

[79] Christopher KI Williams and David Barber. Bayesian classiﬁcation with Gaussian processes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(12):1342 1351, 1998.

[80] Andrew Gordon Wilson, Zhiting Hu, Ruslan Salakhutdinov, and Eric P. Xing. Deep kernel learning. In International Conference on Artiﬁcial Intelligence and Statistics (AISTATS), 2016.

[81] Andrew Gordon Wilson, Zhiting Hu, Ruslan Salakhutdinov, and Eric P Xing. Stochastic variational deep kernel learning. In Proceedings of the 30th International Conference on Neural Information Processing Systems, pages 2594 2602, 2016.

[82] Qiang Yang, Yang Liu, Tianjian Chen, and Yongxin Tong. Federated machine learning: Concept and applications. ACM Transactions on Intelligent Systems and Technology (TIST), 10(2):1 19, 2019.

[83] Feng Yin, Zhidi Lin, Qinglei Kong, Yue Xu, Deshi Li, Sergios Theodoridis, and Shuguang Robert Cui. Fed Loc: Federated learning framework for data-driven cooperative localization and location data processing. IEEE Open Journal of Signal Processing, 1:187 215, 2020.

[84] Mikhail Yurochkin, Mayank Agarwal, Soumya Ghosh, Kristjan Greenewald, Nghia Hoang, and Yasaman Khazaeni. Bayesian nonparametric federated learning of neural networks. In International Conference on Machine Learning, pages 7252 7261. PMLR, 2019.

[85] Chen Zhang, Yu Xie, Hang Bai, Bin Yu, Weihong Li, and Yuan Gao. A survey on federated learning. Knowledge-Based Systems, 216:106775, 2021.

[86] Michael Zhang, Karan Sapra, Sanja Fidler, Serena Yeung, and Jose M Alvarez. Personalized federated learning with ﬁrst order model optimization. ar Xiv preprint ar Xiv:2012.08565, 2020.

[87] Yue Zhao, Meng Li, Liangzhen Lai, Naveen Suda, Damon Civin, and Vikas Chandra. Federated learning with non-IID data. ar Xiv preprint ar Xiv:1806.00582, 2018.

[88] Fan Zhou and Guojing Cong. On the convergence properties of a K-step averaging stochastic gradient descent algorithm for nonconvex optimization. In Proceedings of the 27th International Joint Conference on Artiﬁcial Intelligence, pages 3219 3227, 2018.

[89] Pan Zhou, Xiaotong Yuan, Huan Xu, Shuicheng Yan, and Jiashi Feng. Efﬁcient meta learning via minibatch proximal update. Advances in Neural Information Processing Systems, 32:1534 1544, 2019.

[90] Wennan Zhu, Peter Kairouz, Brendan Mc Mahan, Haicheng Sun, and Wei Li. Federated heavy hitters discovery with differential privacy. In International Conference on Artiﬁcial Intelligence and Statistics, pages 3837 3847. PMLR, 2020.