# personalized_federated_learning_using_hypernetworks__c7186e36.pdf Personalized Federated Learning using Hypernetworks Aviv Shamsian 1 Aviv Navon 1 Ethan Fetaya 1 Gal Chechik 1 2 Personalized federated learning is tasked with training machine learning models for multiple clients, each with its own data distribution. The goal is to train personalized models collaboratively while accounting for data disparities across clients and reducing communication costs. We propose a novel approach to this problem using hypernetworks, termed p Fed HN for personalized Federated Hyper Networks. In this approach, a central hypernetwork model is trained to generate a set of models, one model for each client. This architecture provides effective parameter sharing across clients while maintaining the capacity to generate unique and diverse personal models. Furthermore, since hypernetwork parameters are never transmitted, this approach decouples the communication cost from the trainable model size. We test p Fed HN empirically in several personalized federated learning challenges and find that it outperforms previous methods. Finally, since hypernetworks share information across clients, we show that p Fed HN can generalize better to new clients whose distributions differ from any client observed during training. 1. Introduction Federated learning (FL) is the task of learning a model over multiple disjoint local datasets (Mc Mahan et al., 2017a; Yang et al., 2019). It is particularly beneficial when local data cannot be shared due to privacy, storage, or communication constraints. For example, Io T applications may create data at edge devices that are too large to share and medical applications may be forbidden from sharing data due to privacy (Wu et al., 2020). In federated learning, all clients collectively train a shared model without sharing data *Equal contribution 1Bar-Ilan University, Ramat Gan, Israel 2Nvidia, Tel-Aviv, Israel. Correspondence to: Aviv Shamsian , Aviv Navon . Proceedings of the 38 th International Conference on Machine Learning, PMLR 139, 2021. Copyright 2021 by the author(s). while minimizing communication. Unfortunately, learning a single global model may fail when the data distribution varies across clients. For example, user data may originate from different devices or geographical locales and is potentially heterogeneous. In the extreme, each client may be required to solve another task. To handle such heterogeneity across clients, Personalized Federated Learning (PFL) (Smith et al., 2017) allows each client to use a personalized model instead of a shared global model. The key challenge in PFL is to benefit from joint training while allowing each client to keep its unique model and at the same time limit the communication cost. While several approaches were recently proposed for this challenge, these problems are far from being resolved. In this work, we describe a new approach named p Fed HN for personalized Federated Hyper Network, which aims to resolve these challenges using hypernetworks (Ha et al., 2017). A hypernetwork is a model that produces parameters for another neural networks. Our approach learns to smartly share parameters across clients by using a single joint hypernetwork to generate separate network parameters for every client. Each client has a unique embedding vector, passed as input to the hypernetwork to produce its personalized model weights. The vast majority of trainable parameters belong to the hypernetwork and are shared across clients. Despite that, using a hypernetwork achieves great flexibility and diversity across client models. Intuitively, since a hypernetwork provides a mapping from the embedding input space to the space of client network parameters, its image can be viewed as a low-dimensional manifold in that space. Thus, we can think of the hypernetwork as the coordinate map of this manifold. Each unique client model is restricted to lay on this manifold and is parametrized by the embedding vector. Another benefit of using hypernetworks is that the hypernetwork parameters, which generally are of a much larger size than the clients network parameters that it produces, are never transmitted. Instead, each client only needs to receive its own network parameters to make predictions and compute gradients. Furthermore, the hypernetwork only needs to receive gradients or update directions to optimize its parameters. The communication cost does not depend on the size of the hypernetwork, and as a result, we can train a large hypernetwork with the same communication costs Personalized Federated Learning using Hypernetworks as in previous models. Compared to previous parametersharing schemes, like Dinh et al. (2020); Mc Mahan et al. (2017a), hypernetworks open new options that were not directly possible before. For instance, consider a case where each client uses a cell phone or a wearable device, each with different computational resources. The hypernetwork can produce several networks per input, each with a different computational capacity, allowing each client to select its appropriate network. This paper makes the following contributions: (1) A new approach for personalized federated learning based on hypernetworks. (2) This approach generalizes better (a) to novel clients that differ from the ones seen during training; and (b) to clients with different computational resources, allowing clients to have different model sizes. (3) A new set of state-of-the-art results for the standard benchmarks in the field CIFAR10, CIFAR100, and Omniglot. The paper is organized as follows. Section 3 describes our model in detail. Section 4 establishes some theoretical results to provide insight into our model. Section 5 shows experimentally that p Fed HN achieves stateof-the-art results on several datasets and learning setups. We make our source code publicly available at: https: //github.com/Aviv Sham/p Fed HN. 2. Related Work 2.1. Federated Learning Federated learning (FL) (Mc Mahan et al., 2017a; Kairouz et al., 2019; Mothukuri et al., 2021; Li et al., 2019; 2020a) is a learning setup in machine learning in which multiple clients collaborate to solve a learning task while maintaining privacy and communication efficiency. Recently, numerous methods have been introduced for solving the various FL challenges. Duchi et al. (2014); Mc Mahan et al. (2017b); Agarwal et al. (2018); Zhu et al. (2019) proposed new methods for preserving privacy, and Reisizadeh et al. (2020); Dai et al. (2019); Basu et al. (2020); Li et al. (2020b); Stich (2018) focused on reducing communication cost. While some methods assume a homogeneous setup, in which all clients share a common data distribution (Wang & Joshi, 2018; Lin et al., 2018), others tackle the more challenging heterogeneous setup in which each client is equipped with its own data distribution (Zhou & Cong, 2017; Hanzely & Richt arik, 2020; Zhao et al., 2018; Sahu et al., 2018; Karimireddy et al., 2019; Haddadpour & Mahdavi, 2019; Hsu et al., 2019). Perhaps the most known and commonly used FL algorithm is Fed Avg (Mc Mahan et al., 2017a). It learns a global model by aggregating local models trained on IID data. However, the above methods learn a shared global model for all clients instead of personalized per-client solutions. 2.2. Personalized Federated Learning The federated learning setup presents numerous challenges, including data heterogeneity (differences in data distribution), device heterogeneity (in terms of computation capabilities, network connection, etc.), and communication efficiency (Kairouz et al., 2019). Especially data heterogeneity makes it hard to learn a single shared global model that applies to all clients. To overcome these issues, Personalized Federated Learning (PFL) aims to personalize the global model for each client in the federation (Kulkarni et al., 2020). Many papers proposed a decentralized version of the model agnostic meta-learning (MAML) problem (Fallah et al., 2020a; Li et al., 2017; Behl et al., 2019; Zhou et al., 2019; Fallah et al., 2020b). Since the MAML approach relies on the Hessian matrix, which is computationally costly, several works attempted to approximate the Hessian (Finn et al., 2017; Nichol et al., 2018). Multitask learning was also used for PFL by viewing each client as a learning task (Smith et al., 2017). Recently, Dinh et al. (2021) proposed Fed U, a federated multitask algorithm that uses Laplacian regularization to encourage similarities between clients. Another approach to PFL is model mixing, where clients learn a mixture of the global model and local models (Deng et al., 2020; Arivazhagan et al., 2019). Hanzely & Richt arik (2020) introduced a new neural network architecture divided into base and personalized layers. The central model trains the base layers by Fed Avg, and the personalized layers are trained locally. Liang et al. (2020) presented LG-Fed Avg, a mixing-model method where each client obtains local feature extractor and global output layers, which differ from the conventional mixing model approach. Using a shared output layers lowers the communication costs because the global model requires fewer parameters. Other approaches to PFL propose training the global and local models under different regularization schemes (Huang et al., 2020). Dinh et al. (2020) introduced p Fed Me, a method that uses Moreau envelops to regularize the client loss. This regularization helps to decouple the optimization of the personalized models from the optimization of the global one. Alternatively, clustering methods for federated learning assume that the local data of each client is partitioned by nature (Mansour et al., 2020). They aim to group similar clients and train a centralized model per group. In a heterogeneous setup, some clients are closer than others in terms of data distribution. Based on this assumption and inspired by Fed Avg, Zhang et al. (2020) proposed p Fed FOMO, an aggregation method where each client only federates with a subset of relevant clients. Personalized Federated Learning using Hypernetworks Figure 1. The Federated hypernetwork framework. (a) An HN h( ; ϕ) is located on the server and communicates a personal model θi for each client i. In turn, the client i sends back the update direction θi; (b) The HN acts on the client embedding vi to produce model weights θi. The client performs several local optimization steps to obtain θi, and sends back the update direction θi = θi θi. 2.3. Hypernetworks Hypernetworks (HNs) (Klein et al., 2015; Riegler et al., 2015; Ha et al., 2017) are deep neural networks that output the weights of another target network that performs the learning task. The idea is that the output weights vary depending on the input to the hypernetwork. HNs are widely used in various machine learning domains, including language modeling (Suarez, 2017), computer vision (Ha et al., 2017; Klocek et al., 2019), continual learning (von Oswald et al., 2019), hyperparameter optimization (Lorraine & Duvenaud, 2018; Mac Kay et al., 2019; Bae & Grosse, 2020), multi-objective optimization (Navon et al., 2021), and decoding block codes (Nachmani & Wolf, 2019). Most related to our approach is Zhao et al. (2020), which proposed using HNs for meta-learning with an inputtask embedding. That work differs from the current work in several key ways. It focused on few-shot learning with linear HNs, and it used an entirely different optimization algorithm based on a bi-level optimization with inner-outer loops. HNs are naturally suitable for learning a diverse set of personalized models, as HNs dynamically generate target networks conditioned on the input. In this section, we first formalize the personalized federated learning (PFL) problem, then we present our personalized Federated Hyper Networks (p Fed HN) approach. 3.1. Problem Formulation Personalized federated learning (PFL) aims to collaboratively train personalized models for a set of n clients, each with its personal private data. Unlike conventional FL, each client i is equipped with its own data distribution Pi on X Y. Assume each client has access to mi IID samples from Pi, Si = {(x(i) j , y(i) j )}mi i=1. Let ℓi : Y Y R+ denote the loss function corresponds to client i, and Li the average loss over the personal training data Li(θi) = 1 mi P j ℓi(xj, yj; θi). Here θi denotes the personal model of client i. The PFL goal is to optimize Θ = arg min Θ 1 n i=1 Ex,y Pi[ℓi(xj, yj; θi)] (1) and the training objective is given by arg min Θ 1 n i=1 Li(θi) = arg min Θ 1 n j=1 ℓi(xj, yj; θi) where Θ denotes the set of personal parameters {θi}n i=1. 3.2. Federated Hypernetworks In this section, we describe our proposed personalized Federated Hypernetworks (p Fed HN), a novel method for solving the PFL problem (Eq. 2) using hypernetworks. Hypernetworks are deep neural networks that output the weights of another network, conditioning on its input. Intuitively, HNs simultaneously learn a family of target networks. Let h( ; ϕ) denote the hypernetwork parametrized by ϕ and f( ; θ) the target network parametrized by θ. The hypernetwork is located at the server and acts on a client descriptor vi (see Figure 1). The descriptor can be a trainable embedding vector for the client or fixed, provided that a good client representation is known a-priori. Given vi the HN outputs the weights for the ith client θi = θi(ϕ) := h(vi; ϕ). Hence, the HN h learns a family of personalized models {h(vi; ϕ) | i [n]}. p Fed HN provides a natural way for sharing information across clients while maintaining the Personalized Federated Learning using Hypernetworks Algorithm 1 Personalized Federated Hypernetwork input: R number of rounds, K number of local steps, α HN learning rate, η client learning rate for r = 1, ..., R do sample client i [n] set θi = h(vi; ϕ) and θi = θi for k = 1, ..., K do sample mini-batch B Si θi = θi η θi Li(B) θi = θi θi ϕ = ϕ α ϕθT i θi vi = vi α viϕT ϕθT i θi return: ϕ flexibility of personalized models, by sharing the parameters ϕ. We adjust the PFL objective (Eq. 2) according to the above setup to obtain arg min ϕ,v1,...,vn 1 n i=1 Li(h(vi; ϕ)). (3) One crucial and attractive property of p Fed HN is that it decouples the size of h and the communication cost. The amount of transmitted data during the forward and backward communications is only determined by the size of the target network and is independent of h. Consequently, the hypernetwork can be arbitrarily large without impairing communication efficiency. Indeed, using the chain rule we have ϕLi = ( ϕθi)T θi Li so the client only needs to communicate θi Li back to the hypernetwork, which has the same size as the personal network parameters θi. In our work, we used a more general update rule ϕ = ( ϕθi)T θi, where θi is the change in local model parameters after several local update steps. Since the main limitation is the communication cost, we found it beneficial to perform several local update steps on the client side per each communication round. This aligns with prior work that highlighted the benefits of local optimization steps in terms of both convergence speed (hence communication cost) and final accuracy (Mc Mahan et al., 2017a; Huo et al., 2020). Given the current personalized parameters θi, we perform several local optimization steps on the personal data to obtain θi. We then return the personal model update direction θi := θi θi therefore, the update for ϕ is given by ( ϕθi)T ( θi θi). This update rule is inspired by Zhang et al. (2019). Intuitively, suppose we have access to the optimal solution of the personal problem θ i = arg minθi Li, then our update rule becomes the gradient of an approximation to the surrogate loss Li(vi, ϕ) = 1 2||θ i h(vi; ϕ)||2 2 by replacing θ i with θi. In Appendix C (Figure 5), we compare the results for a different number of local update steps and show considerable improvement over using the gradient, i.e., using a single step. We summarize our method in Alg. 1. Importantly, we note that in practice, all parameter updates are performed using efficient backpropagation and without multiplication of Jacobian matrices. 3.3. Personal Classifier In some cases, it is undesirable to learn the entire network end-to-end with a single hypernetwork. As an illustrative example, consider a case where clients differ only by the label ordering in their output vectors. In this case, having to learn the correct label ordering per client adds another unnecessary difficulty if they were to learn the classification layer as well using the hypernetwork. As another example, consider the case where each client solves an entirely separate task, similar to multitask learning, where the number of classes may differ between clients. It makes little sense to have the hypernetwork produce each unique task classification layer. In these cases, it would be preferable for the hypernetwork to produce the feature extraction part of the target network, which contains most of the trainable parameters while learning a local output layer for each client. Formally, let ωi denote the personal classifier parameters of client i. We modify the optimization problem (eq. 3) to obtain, arg min ϕ,v1,...,vn,ω1,...,ωn 1 n i=1 Li(θi, ωi), (4) where we define the feature extractor θi = h(vi; ϕ), as before. The parameters ϕ, v1, ..., vn are updated according to Alg. 1, while the personal parameters ωi are updated locally using ωi = ωi α ωi Li. 4. Analysis In this section, we theoretically analyze p Fed HN. First, we provide an insight regarding the solution for the p Fed HN (Eq. 3) using a simple linear version of our hypernetwork. Then, we describe generalization bounds of our framework. 4.1. A Linear Model Consider a linear version of the hypernetwork, where both the target model and the hypernetwork are linear models, θi = Wvi with ϕ := W Rd k and vi Rk is the ith clients embedding. Let V denote the k n matrix whose columns are the clients embedding vectors vi. We note that even for convex loss functions Li(θi) the objective L(W, V ) = P i Li(Wvi) might not be convex in (W, V ) but block multi-convex. In one setting, however, we get a nice analytical solution. Personalized Federated Learning using Hypernetworks Proposition 1. Let {Xi, yi} be the data for client i and let the loss for linear regressor θi be Li(θi) = Xiθi yi 2. Furthermore, assume that for all i, XT i Xi = Id. Define the empirical risk minimization (ERM) solution for client i to be θi = arg minθ Rd Xiθ yi 2. The optimal W, V minimizing P i Xi Wvi yi 2 are given by PCA on { θ1, ..., θn}, where W is the top-k principal components and vi is the coefficients for θi in these components. We provide the proof in Section A of the Appendix. The linear version of our p Fed HN performs dimensionality reduction by PCA. However, unlike classical dimensionality reduction, which is unaware of the learning task, p Fed HN uses multiple clients for reducing the dimensionality while preserving the optimal model as best as possible. This allows us to get solutions between the two extremes: A single shared model up to scaling (k = 1) and each client training locally (k n). We note that optimal reconstruction of the local models (k n) is generally suboptimal in terms of generalization performance, as no information is shared across clients. This dimensionality reduction can also be viewed as a denoising process. Consider a linear regression with a Gaussian noise model, i.e., for all clients p(y|x) = N(x T θ i , σ2), and assume that each client solves a maximum-likelihood objective. From the central limit theorem for maximumlikelihood estimators (White, 1982), we have that1 ni( θi θ i ) d N(0, Id), where θi is the maximum likelihood solution. This means that approximately θi = θ i + ϵ with ϵ N(0, σi I), i.e., our local solutions θi are a noisy version of the optimal model θ i with isotropic Gaussian noise. We can now view a linear hypernetwork as performing denoising on θi by PCA. PCA is a classic approach to denoising (Muresan & Parks, 2003) that is well suited for reducing isotropic noise when the energy of the original points is concentrated on a low-dimensional subspace. Intuitively we think of our standard hypernetwork as a nonlinear extension of this approach, which has a similar effect by forcing the models to lay on a low-dimensional manifold. 4.2. Generalization We now investigate how p Fed HN generalizes using the approach of Baxter (2000). The common approach for multitask learning with neural networks is to have a common feature extractor shared by all tasks and a per-task head operating on these features. This case was analyzed by Baxter (2000). Conversely, here the per-task parameters are the inputs to the hypernetwork. Next, we provide the generalization guarantee under this setting and discuss its implications. 1Note that the Fisher matrix is the identity from our assumption that XT i Xi = Id. Let Di = n (x(i) j , y(i) j ) om j=1 be the training set for the ith client, generated by a distribution Pi. We denote by ˆLD(ϕ, V ) the empirical loss of the hypernetwork ˆLD(ϕ, V ) = 1 n Pn i=1 1 m Pm j=1 ℓi x(i) j , y(i) j ; h(ϕ, vi) and by L(ϕ, V ) the expected loss L(ϕ, V ) = 1 n Pn i=1 EPi [ℓi(x, y; h(ϕ, vi))]. We assume weights of the hypernetwork and the embeddings are bounded in a ball of radius R, in which the following three Lipschitz conditions hold: 1. |ℓi(x, y, θ1) ℓi(x, y, θ2)| L θ1 θ2 2. h(ϕ, v) h(ϕ , v) Lh ϕ ϕ 3. h(ϕ, v) h(ϕ, v ) LV v v . Theorem 1. Let the hypernetwork parameter space be of dimension N and the embedding space be of dimension k. Under previously stated assumptions, there exists M = O k ϵ2 log RL(Lh+LV ) ϵ + N nϵ2 log RL(Lh+LV ) ϵ + 1 nϵ2 log 1 such that if the number of samples per client m is greater than M, we have with probability at least 1 δ for all ϕ, V that |L(ϕ, V ) ˆLD(ϕ, V )| ϵ. Theorem 1 provides insights on the parameter-sharing effect of p Fed HN. The first term for the number of required samples M depends on the dimension of the embedding vectors; as each client corresponds to its unique embedding vector (i.e., not being shared between clients), this part is independent of the number of clients n. However, the second term depends on the hypernetwork size N and is reduced by a factor of n because the hypernetwork weights are shared. Additionally, the generalization is affected by the Lipschitz constant of the hypernetwork, Lh (along with other Lipschitz constants), as it can affect the effective space that we can reach with our embedding. In essence, this characterizes the price that we pay, in terms of generalization, for the hypernetworks flexibility. It might also open new directions to improve performance. However, our initial investigation into bounding the Lipschitz constant by adding spectral normalization (Miyato et al., 2018) did not show any significant improvement, see Appendix C.4. 5. Experiments We evaluate p Fed HN in several learning setups using three common image-classification datasets: CIFAR10, CIFAR100, and Omniglot (Krizhevsky & Hinton, 2009; Lake et al., 2015). Unless stated otherwise, we report the Federated Accuracy, defined as j Acc fi x(i) j , y(i) j , averaged over three seeds. The experiments show that p Fed HN outperforms classical FL approaches and leading PFL models. Personalized Federated Learning using Hypernetworks Table 1. Heterogeneous data. Test accuracy over 50, 100 and 500 clients on the CIFAR10, CIFAR100, and Omniglot datasets. CIFAR10 CIFAR100 Omniglot # clients 50 100 500 50 100 500 50 Local 68.11 7.39 59.32 5.59 53.87 0.11 19.98 1.41 15.12 0.58 13.95 0.09 65.97 0.86 Fed Avg 47.79 4.48 44.12 3.10 54.04 0.87 15.96 0.55 15.71 0.35 20.40 0.11 41.61 3.59 Fed U 80.63 0.52 78.12 0.87 65.64 0.74 41.09 0.45 36.03 0.33 15.93 0.76 60.82 0.01 LG-Fed Avg 85.19 0.58 81.49 1.56 64.72 1.26 53.16 2.18 49.99 3.13 20.25 0.91 72.99 5.00 Fed Per 83.39 0.47 80.99 0.71 76.79 2.08 48.32 1.46 42.08 0.18 25.62 0.52 69.92 3.12 p Fed Me 86.09 0.32 85.23 0.58 80.28 0.9 49.09 1.10 45.57 1.02 32.53 1.32 69.98 0.28 p Fed HN (ours) 88.38 0.29 87.97 0.70 75.92 1.3 59.48 0.67 53.24 0.31 43.01 2.6 72.03 1.08 p Fed HN-PC (ours) 90.08 0.63 88.09 0.86 83.22 1.28 60.17 1.63 52.4 0.74 34.10 0.17 81.89 0.15 Compared Methods: We evaluate and compare the following approaches: (1) p Fed HN, Our proposed Federated Hyper Networks (2) p Fed HN-PC, p Fed HN with a personalized classifier per client ; (3) Local, Local training on each client, with no collaboration between clients; (4) Fed Avg (Mc Mahan et al., 2017a), one of the first and perhaps the most widely used FL algorithm; (5) Fed U (Dinh et al., 2021), a multitask learning approach for PFL; (6) p Fed Me (Dinh et al., 2020), a PFL approach that adds a Moreau-envelopes loss term; (7) LG-Fed Avg (Liang et al., 2020) PFL method with a local feature extractor and global output layers; (8) Fed Per (Arivazhagan et al., 2019) a PFL approach that learns per-client personal classifier on top of a shared feature extractor. Training Strategies: In all experiments, our target network has the same architecture as the baseline model. It is a simple fully-connected neural network, with three hidden layers and multiple linear heads per target-weight tensor. We limit training to have at most 5000 server-client communication steps for all methods except LG-Fed Avg. That method uses a pretrained Fed Avg model; hence it is trained with additional 1000 communication steps. The Local baseline is trained for 2000 optimization steps on each client. For p Fed HN, we set the number of local steps to K = 50, and the embedding dimension to 1 + n/4 , where n is the number of clients. We provide an extensive ablation study on design choices in Appendix C. We tune the hyperparameters of all methods using a pre-allocated held-out validation set. Full experimental details are provided in Appendix B. 5.1. Heterogeneous Data We evaluate the different approaches on a challenging heterogeneous setup. We adopt the learning setup and the evaluation protocol described in Dinh et al. (2020) for generating heterogeneous clients in terms of classes and size of local training data. First, we sample two/ten classes for each client for CIFAR10/CIFAR100; Next, for each client i and selected class c, we sample αi,c U(.4, .6), and assign it with αi,c P j αj,c of the samples for this class. We repeat the above using 50, 100 and 500 clients. This procedure produces clients with a varying number of samples and classes. For the target network, we use a Le Net-based (Le Cun et al., 1998) network with two convolutional layers and two fully connected layers. We also evaluate all methods with the Omniglot dataset (Lake et al., 2015). Omniglot contains 1623 different grayscale handwritten characters (with 20 samples each) from 50 different alphabets. Each alphabet obtains a varying number of characters. In this setup, we use 50 clients and assign an alphabet to each client. Therefore, clients receive different numbers of samples, and the distribution of labels is disjoint across clients. We use a Le Net-based model with four convolutional layers and two fully connected layers. The results are presented in Table 1. The two simple baselines, local and Fed Avg, perform poorly on most tasks, showing the importance of personalized federated learning. p Fed HN achieves a significant 2%-10% improvement over competing approaches. Furthermore, on the Omniglot dataset, where each client is allocated with a completely different learning task (different alphabet), we show significant improvement using p Fed HN-PC. We present additional results on MNIST and CIFAR10/100 datasets in Appendix C.2 and C.1. 5.2. Computational Budget We discussed above how the challenges of heterogeneous data can be handled using p Fed HN. Another major challenge presented by personalized FL is that clients communication, storage, and computational resources may differ significantly. These capacities may even change in time due to varying network and power conditions. In such a setup, the server should adjust to the communication and computational policies of each client. Unfortunately, previous works do not address this resource heterogeneity. p Fed HN can naturally adapt to this challenging learning setup by producing target networks of different sizes. In this section, we evaluate the capacity of p Fed HN to handle clients that differ in their computational and commu- Personalized Federated Learning using Hypernetworks Table 2. Computational budget. Test accuracy for CIFAR10/100 with 75 clients and varying computational capacities. CIFAR10 CIFAR100 Local model size S M L S M L Fed Avg 36.91 2.26 42.09 2.37 47.51 1.97 9.79 0.19 11.76 1.14 17.86 2.42 LG-Fed Avg 73.93 3.65 53.13 3.49 54.72 2.50 33.48 4.83 29.15 1.51 23.01 1.41 Fed Per 79.38 4.94 82.65 0.59 84.08 1.87 37.24 1.23 39.31 2.08 41.31 1.58 p Fed Me 81.21 1.23 84.08 1.63 83.15 2.45 39.91 0.81 41.99 0.55 44.93 1.63 p Fed HN (ours) 85.38 1.21 86.92 1.35 87.20 0.76 48.04 0.89 48.66 1.21 50.66 2.70 nication resource budget. We use the same split described in Section 5.1 with a total of 75 clients divided into the three equal-sized groups named S (small), M (medium), and L (large). The Models of clients within each group share the same architecture. The three architectures (of the three groups) have a different number of parameters. We train a single p Fed HN to output target client models of different sizes. Importantly, this allows p Fed HN to share parameters between all clients, even if those have different local model sizes. Baselines: For quantitative comparisons, and since the existing baseline methods cannot easily extend to this setup, we train three independent per-group models. Each group is trained for 5000 server-client communication steps. See Appendix B for further details. The results are presented in Table 2, showing that p Fed HN achieves 4% 8% improvement over competing methods. These results demonstrate the flexibility of our approach, which can adjust to different client settings while maintaining high accuracy. 5.3. Generalization to Novel Clients Next, we study an important learning setup where new clients join, and a new model has to be trained for their data. In the general case of sharing models across clients, this would require retraining (or finetuning) the shared model. While PFL methods like p Fed ME (Dinh et al., 2020) can adapt to this setting by finetuning the global model locally, p Fed HN architecture offers a significant benefit. Since the shared model learns a meta-model over the distribution of clients, it can, in principle, generalize to new clients without retraining. With p Fed HN, once the shared model ϕ has been trained on a set of clients, extending to a new set of novel clients requires little effort. We freeze the hypernetwork weights ϕ and optimize an embedding vector vnew. Since only a small number of parameters are being optimized, training is less prone to overfitting compared to other approaches. The success of this process depends on the hypernetwork s capacity to learn the distribution over clients Figure 2. Generalization to novel clients. The accuracy generalization gap between training and novel clients, defined as accnovel acctrain, where acc denotes the average accuracy. and generalize to clients with different data distributions. To evaluate p Fed HN in this setting, we use the CIFAR10 dataset, with a total of 100 clients. 90 clients are used for training, and 10 are held out as novel clients. To allocate data samples, for each client i we first draw a sample from a Dirichlet distribution with parameter α = (α, ..., α) R10, pi Dir(α). Next, we normalize the pi s so that P i pi,j = 1 for all j, to obtain the vector ˆpi. We now allocate samples according to the ˆpi s. For the training clients, we choose α = .1, whereas for the novel clients we vary α {.1, .25, .5, 1}. To estimate the distance between a novel client and the training clients, we use the total variation (TV) distance between the novel client and its nearest neighbor in the training set. The TV is computed over the empirical distributions ˆp. Figure 2 presents the accuracy generalization gap as a function of the total variation distance. p Fed HN achieves the best generalization performance for all levels of TV (corresponds to the different values for α). 5.4. Heterogeneity of personalized classifiers We further investigate the flexibility of p Fed HN-PC in terms of personalizing different networks for different clients. Po- Personalized Federated Learning using Hypernetworks Figure 3. Model personalization. Rows correspond to clients, each with their trained in a binary classification task and keeping their personalized classifier ωi. Columns correspond to the feature extractor θj of another client. The diagonal corresponds to the stanard training with θi and ωi. For better visualization, values denote accuracy normalized per row: norm-acci,j = (acci,j minℓacci,ℓ)/(maxℓacci,ℓ minℓacci,ℓ). tentially, since each client i has its own personalized classifier ωi, the feature extractor component θi generated by the HN may in principle become strongly similar across clients, making the HN redundant. The experiments suggest that this concern does not happen in practice because p Fed HN-PC out-performs Fed Per (Arivazhagan et al., 2019). However, this raises a fundamental question about the interplay between local and shared personalized components. We provide additional insight to this topic by answering the question: Do feature extraction layers, as generated by the p Fed HN-PC s hypernetwork, significantly differ from each other? To investigate the level of personalization in θi achieved by p Fed HN-PC, we first train it on the CIFAR10 dataset split among ten clients, with two classes assigned to each client. Next, for each client, we replace its feature extractor θi with that of another client θj while keeping its personal classifier ωi unaltered. Figure 3 depicts the normalized accuracy in this mix-and-match experiment. Rows correspond to a client, and columns correspond to the feature extractor of another client. Several effects in Figure 3 are of interest. First, p Fed HNPC produces personalized feature extractors for each client since the accuracy achieved when crossing classifiers, and feature extractors vary significantly. Second, some client pairs can be crossed without hurting the accuracy. Specifi- Figure 4. t-SNE visualization of the learned client representation v for the CIFAR100 dataset. Clients are tasked with classifying classes that belong to the same coarse class. Clients marked with the same color correspond to the same coarse class, see text for details. p Fed HN clustered together clients from the same group. cally, we had two clients learning to discriminate horse-vs.- dog. Interestingly, the client for ship-vs.-airplane performs well when presented with truck and bird, presumably because both their feature extractors learned to detect sky. 5.5. Learned Client Representation In our experiments, we learn to represent each client using a trainable embedding vector vi. These embedding vectors, therefore, learn a continuous semantic representation over the set of clients. The smooth nature of this representation gives the HN the power to share information across clients. We now wish to study the structure of that embedding space. To examine how the learned embedding vectors reflect a meaningful representation over the client space, we utilize the hierarchy in CIFAR100 for generating clients with similar data distribution of semantically similar labels. Concretely, we split the CIFAR100 into 100 clients, where each client is assigned with data from one out of the twenty coarse classes uniformly (i.e., each coarse class is assigned to five clients). In Figure 4 we project the learned embedding vectors into R2 using the t-SNE algorithm (Maaten & Hinton, 2008). A clear structure is presented, in which clients from the same group (in terms of coarse labels) are clustered together. 6. Conclusion In this work, we present a novel approach for personalized federated learning. Our method trains a central hypernetwork to output a unique personal model for each client. We show through extensive experiments significant improvement in accuracy on all datasets and learning setups. Sharing across clients through a central hypernetwork has several benefits compared to previous architectures. First, since it learns a unified model over the distribution of clients, Personalized Federated Learning using Hypernetworks the model generalizes better to novel clients without the need to retrain the central model. Second, it naturally extends to handle clients with different compute power, by generating client models of different sizes. Finally, this architecture decouples the training complexity from communication complexity, since the transmitted local models can be significantly more compact than the central model. We expect that the current framework can be further extended in several important ways. First, the architecture opens questions about the best way of allocating learning capacity to a central model vs. distributed locally trained components. Second, the question of generalization to clients with new distribution awaits further analysis. Acknowledgements This study was funded by a grant to GC from the Israel Science Foundation (ISF 737/2018), and by an equipment grant to GC and Bar-Ilan University from the Israel Science Foundation (ISF 2332/18). AS and AN were funded by a grant from the Israeli Innovation Authority, through the AVATAR consortium. Agarwal, N., Suresh, A. T., Yu, F. X. X., Kumar, S., and Mc Mahan, B. cpsgd: Communication-efficient and differentially-private distributed sgd. In Advances in Neural Information Processing Systems, pp. 7564 7575, 2018. Arivazhagan, M. G., Aggarwal, V., Singh, A. K., and Choudhary, S. Federated learning with personalization layers. ar Xiv preprint ar Xiv:1912.00818, 2019. Bae, J. and Grosse, R. B. Delta-stn: Efficient bilevel optimization for neural networks using structured response jacobians. Ar Xiv, abs/2010.13514, 2020. Basu, D., Data, D., Karakus, C., and Diggavi, S. N. Qsparselocal-sgd: Distributed sgd with quantization, sparsification, and local computations. IEEE Journal on Selected Areas in Information Theory, 1(1):217 226, 2020. Baxter, J. A model of inductive bias learning. Journal of artificial intelligence research, 12:149 198, 2000. Behl, H. S., Baydin, A. G., and Torr, P. H. Alpha maml: Adaptive model-agnostic meta-learning. ar Xiv preprint ar Xiv:1905.07435, 2019. Dai, X., Yan, X., Zhou, K., Yang, H., Ng, K. K., Cheng, J., and Fan, Y. Hyper-sphere quantization: Communicationefficient sgd for federated learning. ar Xiv preprint ar Xiv:1911.04655, 2019. Deng, Y., Kamani, M. M., and Mahdavi, M. Adaptive personalized federated learning. ar Xiv preprint ar Xiv:2003.13461, 2020. Dinh, C. T., Tran, N. H., and Nguyen, T. D. Personalized federated learning with moreau envelopes. Ar Xiv, abs/2006.08848, 2020. Dinh, C. T., Vu, T. T., Tran, N. H., Dao, M. N., and Zhang, H. Fedu: A unified framework for federated multi-task learning with laplacian regularization. ar Xiv preprint ar Xiv:2102.07148, 2021. Duchi, J. C., Jordan, M. I., and Wainwright, M. J. Privacy aware learning. Journal of the ACM (JACM), 61(6):1 57, 2014. Fallah, A., Mokhtari, A., and Ozdaglar, A. Personalized federated learning: A meta-learning approach. Ar Xiv, abs/2002.07948, 2020a. Fallah, A., Mokhtari, A., and Ozdaglar, A. On the convergence theory of gradient-based model-agnostic meta-learning algorithms. In International Conference on Artificial Intelligence and Statistics, pp. 1082 1092. PMLR, 2020b. Finn, C., Abbeel, P., and Levine, S. Model-agnostic metalearning for fast adaptation of deep networks. ar Xiv preprint ar Xiv:1703.03400, 2017. Ha, D., Dai, A. M., and Le, Q. V. Hypernetworks. Ar Xiv, abs/1609.09106, 2017. Haddadpour, F. and Mahdavi, M. On the convergence of local descent methods in federated learning. ar Xiv preprint ar Xiv:1910.14425, 2019. Hanzely, F. and Richt arik, P. Federated learning of a mixture of global and local models. ar Xiv preprint ar Xiv:2002.05516, 2020. Hsu, T. H., Qi, H., and Brown, M. Measuring the effects of non-identical data distribution for federated visual classification. Ar Xiv, abs/1909.06335, 2019. Huang, Y., Chu, L., Zhou, Z., Wang, L., Liu, J., Pei, J., and Zhang, Y. Personalized cross-silo federated learning on non-iid data. 2020. Huo, Z., Yang, Q., Gu, B., Huang, L. C., et al. Faster ondevice training using new federated momentum algorithm. ar Xiv preprint ar Xiv:2002.02090, 2020. Kairouz, P., Mc Mahan, H. B., Avent, B., Bellet, A., Bennis, M., Bhagoji, A. N., Bonawitz, K., Charles, Z., Cormode, G., Cummings, R., et al. Advances and open problems in federated learning. ar Xiv preprint ar Xiv:1912.04977, 2019. Personalized Federated Learning using Hypernetworks Karimireddy, S. P., Kale, S., Mohri, M., Reddi, S. J., Stich, S. U., and Suresh, A. T. Scaffold: Stochastic controlled averaging for on-device federated learning. ar Xiv preprint ar Xiv:1910.06378, 2019. Klein, B., Wolf, L., and Afek, Y. A dynamic convolutional layer for short rangeweather prediction. 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4840 4848, 2015. Klocek, S., Maziarka, Ł., Wołczyk, M., Tabor, J., Nowak, J., and Smieja, M. Hypernetwork functional image representation. In International Conference on Artificial Neural Networks, pp. 496 510. Springer, 2019. Krizhevsky, A. and Hinton, G. Learning multiple layers of features from tiny images. Technical report, University of Toronto, 2009. Kulkarni, V., Kulkarni, M., and Pant, A. Survey of personalization techniques for federated learning. 2020 Fourth World Conference on Smart Trends in Systems, Security and Sustainability (World S4), pp. 794 797, 2020. Lake, B. M., Salakhutdinov, R., and Tenenbaum, J. B. Human-level concept learning through probabilistic program induction. Science, 350(6266):1332 1338, 2015. Le Cun, Y., Bottou, L., Bengio, Y., and Haffner, P. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278 2324, 1998. Li, Q., Wen, Z., and He, B. Federated learning systems: Vision, hype and reality for data privacy and protection. Ar Xiv, abs/1907.09693, 2019. Li, T., Sahu, A. K., Talwalkar, A., and Smith, V. Federated learning: Challenges, methods, and future directions. IEEE Signal Processing Magazine, 37(3):50 60, 2020a. Li, Z., Zhou, F., Chen, F., and Li, H. Meta-sgd: Learning to learn quickly for few-shot learning. ar Xiv preprint ar Xiv:1707.09835, 2017. Li, Z., Kovalev, D., Qian, X., and Richt arik, P. Acceleration for compressed gradient descent in distributed and federated optimization. ar Xiv preprint ar Xiv:2002.11364, 2020b. Liang, P. P., Liu, T., Ziyin, L., Allen, N. B., Auerbach, R. P., Brent, D., Salakhutdinov, R., and Morency, L.-P. Think locally, act globally: Federated learning with local and global representations. ar Xiv preprint ar Xiv:2001.01523, 2020. Lin, T., Stich, S. U., Patel, K. K., and Jaggi, M. Don t use large mini-batches, use local sgd. ar Xiv preprint ar Xiv:1808.07217, 2018. Lorraine, J. and Duvenaud, D. Stochastic hyperparameter optimization through hypernetworks. Ar Xiv, abs/1802.09419, 2018. Maaten, L. V. D. and Hinton, G. E. Visualizing data using t-sne. Journal of Machine Learning Research, 9:2579 2605, 2008. Mac Kay, M., Vicol, P., Lorraine, J., Duvenaud, D., and Grosse, R. B. Self-tuning networks: Bilevel optimization of hyperparameters using structured best-response functions. Ar Xiv, abs/1903.03088, 2019. Mansour, Y., Mohri, M., Ro, J., and Suresh, A. T. Three approaches for personalization with applications to federated learning. Ar Xiv, abs/2002.10619, 2020. Mc Mahan, H., Moore, E., Ramage, D., Hampson, S., and y Arcas, B. A. Communication-efficient learning of deep networks from decentralized data. In AISTATS, 2017a. Mc Mahan, H. B., Ramage, D., Talwar, K., and Zhang, L. Learning differentially private recurrent language models. ar Xiv preprint ar Xiv:1710.06963, 2017b. Miyato, T., Kataoka, T., Koyama, M., and Yoshida, Y. Spectral normalization for generative adversarial networks. In International Conference on Learning Representations, ICLR, 2018. Mothukuri, V., Parizi, R. M., Pouriyeh, S., Huang, Y., Dehghantanha, A., and Srivastava, G. A survey on security and privacy of federated learning. Future Generation Computer Systems, 115:619 640, 2021. Muresan, D. D. and Parks, T. W. Adaptive principal components and image denoising. In Proceedings 2003 International Conference on Image Processing (Cat. No.03CH37429), 2003. Nachmani, E. and Wolf, L. Hyper-graph-network decoders for block codes. In Wallach, H., Larochelle, H., Beygelzimer, A., d'Alch e-Buc, F., Fox, E., and Garnett, R. (eds.), Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. Navon, A., Shamsian, A., Chechik, G., and Fetaya, E. Learning the pareto front with hypernetworks. In International Conference on Learning Representations, 2021. Nichol, A., Achiam, J., and Schulman, J. On first-order meta-learning algorithms. ar Xiv preprint ar Xiv:1803.02999, 2018. Reisizadeh, A., Mokhtari, A., Hassani, H., Jadbabaie, A., and Pedarsani, R. Fedpaq: A communication-efficient federated learning method with periodic averaging and quantization. In International Conference on Artificial Intelligence and Statistics, pp. 2021 2031. PMLR, 2020. Personalized Federated Learning using Hypernetworks Riegler, G., Schulter, S., R uther, M., and Bischof, H. Conditioned regression models for non-blind single image super-resolution. 2015 IEEE International Conference on Computer Vision (ICCV), pp. 522 530, 2015. Sahu, A. K., Li, T., Sanjabi, M., Zaheer, M., Talwalkar, A., and Smith, V. On the convergence of federated optimization in heterogeneous networks. ar Xiv preprint ar Xiv:1812.06127, 3, 2018. Smith, V., Chiang, C.-K., Sanjabi, M., and Talwalkar, A. S. Federated multi-task learning. In Advances in neural information processing systems, pp. 4424 4434, 2017. Stich, S. U. Local sgd converges fast and communicates little. ar Xiv preprint ar Xiv:1805.09767, 2018. Suarez, J. Character-level language modeling with recurrent highway hypernetworks. In Proceedings of the 31st International Conference on Neural Information Processing Systems, pp. 3269 3278, 2017. von Oswald, J., Henning, C., Sacramento, J., and Grewe, B. F. Continual learning with hypernetworks. ar Xiv preprint ar Xiv:1906.00695, 2019. Wang, J. and Joshi, G. Cooperative sgd: A unified framework for the design and analysis of communication-efficient sgd algorithms. ar Xiv preprint ar Xiv:1808.07576, 2018. White, H. Maximum likelihood estimation of misspecified models. Econometrica, 50:1 25, 1982. Wu, Q., He, K., and Chen, X. Personalized federated learning for intelligent iot applications: A cloud-edge based framework. IEEE Open Journal of the Computer Society, 1:35 44, 2020. Yang, Q., Liu, Y., Chen, T., and Tong, Y. Federated machine learning: Concept and applications. ar Xiv: Artificial Intelligence, 2019. Zhang, M., Lucas, J., Ba, J., and Hinton, G. E. Lookahead optimizer: k steps forward, 1 step back. Advances in Neural Information Processing Systems, 32:9597 9608, 2019. Zhang, M., Sapra, K., Fidler, S., Yeung, S., and Alvarez, J. M. Personalized federated learning with first order model optimization. ar Xiv preprint ar Xiv:2012.08565, 2020. Zhao, D., von Oswald, J., Kobayashi, S., Sacramento, J., and Grewe, B. F. Meta-learning via hypernetworks. 2020. Zhao, Y., Li, M., Lai, L., Suda, N., Civin, D., and Chandra, V. Federated learning with non-iid data. ar Xiv preprint ar Xiv:1806.00582, 2018. Zhou, F. and Cong, G. On the convergence properties of a k-step averaging stochastic gradient descent algorithm for nonconvex optimization. ar Xiv preprint ar Xiv:1708.01012, 2017. Zhou, P., Yuan, X., Xu, H., Yan, S., and Feng, J. Efficient meta learning via minibatch proximal update. Advances in Neural Information Processing Systems, 32: 1534 1544, 2019. Zhu, W., Kairouz, P., Sun, H., Mc Mahan, B., and Li, W. Federated heavy hitters discovery with differential privacy. ar Xiv preprint ar Xiv:1902.08534, 2019. Supplementary Material for Personalized Federated Learning by Hypernetworks A. Proof of Results Proof for Proposition 1. Let θi denote the optimal solution at client i, then θi = (XT i Xi) 1XT i yi = XT i yi. Denote θi = Wvi, we have arg min θi Xiθi yi 2 2 = arg min θi (Xiθi yi)T (Xiθi yi) = arg min θi θT i XT i Xiθi 2θT i XT i y + y T i yi = arg min θi θT i θi 2 θi, θi + y T i yi = arg min θi θT i θi 2 θi, θi + θi 2 2 = arg min θi θi θi 2 2 Thus, our optimization problem becomes arg min W,V P i Wvi θi 2 2. WLOG, we can optimize W over the set of all matrices with orthonormal columns, i.e. W T W = I. Since for each solution (W, V ) we can obtain the same loss for (WR, R 1V ), and select a R that performs Gram-Schmidt on the columns of W. In case of fixed W the optimal solution for vi is given by v i = (W T W) 1W T θi = W T θi. Hence, our optimization problem becomes, arg min W ;W T W =I i WW T θi θi 2 2, which is equivalent to PCA on { θi}i. Proof for Theorem 1. Using Theorem 4 from (Baxter, 2000) and the notation used in that paper, we get that M = O 1 nϵ2 log C(ϵ,Hn l ) δ where C(ϵ, Hn l ) is the covering number for Hn l . In our case each element of Hn l is parametrized by ϕ, v1, ..., vn and the distance is given by d((ϕ, v1, ..., vn), (ϕ , v 1, ..., v n)) = (5) X ℓ(h(ϕ, vi)(xi), yi) X ℓ(h(ϕ , v i)(xi), yi) From the triangle inequality and our Lipshitz assumptions d((ϕ, v1, ..., vn), (ϕ , v 1, ..., v n)) (6) X 1 n E xi,yi Pi [|ℓ(h(ϕ, vi)(xi), yi) ℓ(h(ϕ , v i)(xi), yi)|] L h(ϕ, vi) h(ϕ , v i) L h(ϕ, vi) h(ϕ, v i) + L h(ϕ, v i) h(ϕ , v i) L Lh ϕ ϕ + L LV v v Now if we select a covering of the parameter space such that each ϕ has a point ϕ that is ϵ 2L(Lh+LV ) away and each embedding vi has an embedding v i at the same distance we get an ϵ-covering in the d((ϕ, v1, ..., vn), (ϕ , v 1, ..., v n)) metric. From here we see that log(C(ϵ, Hn l )) = O (n k + N) log RL(LV +Lh) B. Experimental Details For all experiments presented in the main text, we use a fully-connected hypernetwork with 3 hidden layers of 100 hidden units each. For all relevant baselines, we aggregate over 5 clients at each round. We set K = 3 ,i.e., 60 local steps, for the p Fed Me algorithm, as it was reported to work well in the original paper (Dinh et al., 2020). Heterogeneous Data (Section 5.1). For the CIFAR experiments, we pre-allocate 10, 000 training examples for validation. For the Omniglot dataset, we use a 70%/15%/15% split for train/validation/test sets. The validation sets are used for hyperparameter tuning and early stopping. We search over learning-rate {1e 1, 5e 2, 1e 2, 5e 3}, and personal learning-rate {5e 2, 1e 2, 5e 3, 1e 3} for PFL methods using 50 clients. For the CIFAR datasets, the selected hyperparameters are used across all number of clients (i.e. 10, 50, 100). Computational Budget (Section 5.2) We use the same hyperparameters selected in Section 5.1. To align with previous works (Dinh et al., 2020; Liang et al., 2020; Fallah et al., 2020a), we use a Le Net-based (target) network with two convolution layers, where the second layer has twice the number of filters in comparison to the first. Following these layers are two fully connected layers that output logits vector. In this learning setup, we use three different sized target networks with different numbers of filters for the first Personalized Federated Learning using Hypernetworks convolution layer. Specifically, for S/M/L sized networks, the first convolution layer consists of 8/16/32 filters, respectively. p Fed HN s HN produces weights vector with size equal to the sum of the weights of the three sized networks combined. Then it sends the relevant weights according to the target network size of the client. C. Additional Experiments C.1. CIFAR10/CIFAR100 We provide additional experiment over CIFAR10/CIFAR100 datasets. Here, we compare p Fed HN to the baselines on small scale setup of 10 clients. The results are presented in Table 3. We show significant improvement using p Fed HN on small scale experiment in addition to the results presented in the main text. Table 3. Heterogeneous data. Test accuracy over 10 clients on the CIFAR10, CIFAR100 datasets. CIFAR10 CIFAR100 # clients 10 10 Local 86.46 4.02 58.98 1.38 Fed Avg 51.42 2.41 15.96 0.55 Fed Per 87.27 1.39 55.76 0.34 p Fed Me 87.69 1.93 51.97 1.29 LG-Fed Avg 89.11 2.66 53.69 1.42 p Fed HN (ours) 90.83 1.56 65.74 1.80 p Fed HN-PC (ours) 92.47 1.63 68.15 1.49 We provide additional experiment over MNIST dataset. We follow the same data partition procedure as in the CIFAR10/CIFAR100 heterogeneity experiment, described in Section 5.1. For this experiment we use a single hidden layer fullyconnected (FC) hypernetwork. The main network (or target network in the case of p Fed HN) is a single hidden layer FC NN. All FL/PFL methods achieve high classification accuracy on this dataset, which makes it difficult to attain meaningful comparisons. The results are presented in Table 4. p Fed HN achieves similar results to p Fed Me. Table 4. Comparison on the MNIST dataset. Fed Avg 96.22 0.65 97.12 0.07 96.99 0.19 p Fed Me 99.40 0.04 99.30 0.13 99.12 0.06 p Fed HN (ours) 99.53 0.16 99.28 0.11 99.16 0.19 C.3. Exploring Design Choices In this section we return to the experimental setup of Section 5.1, and evaluate p Fed HN using CIFAR10 dataset with 50 clients. First, we examine the effect of the local optimization steps. Next, we vary the capacity of the HN and observe the change in classification accuracy. Finally we vary the dimension of the client representation (embedding). C.3.1. EFFECT OF LOCAL OPTIMIZATION Figure 5. Effect of the number of local optimization steps on the test accuracy for the CIFAR10 dataset. First, we examine the effect of performing local optimization step and transmitting θ back to the hypernetwork. Figure 5 shows the test accuracy throughout the training process. It compares training using the standard chain rule (steps = 1) with the case of training locally for k steps, k {25, 50, 100, 200}. Using our proposed update rule, i.e., making multiple local update steps, yields large improvements in both convergence speed and final accuracy, compared to using the standard chain rule (i.e., k = 1). The results show that p Fed HN is relatively robust to the choice of local local optimization steps. As stated in the main text we set k = 50 for all experiments. C.3.2. CLIENT EMBEDDING DIMENSION Next, we investigate the effect of embedding vector dimension on p Fed HN performance. Specifically, we run an ablation study on set of different embedding dimensions {5, 15, 25, 35}. The results are presented in Figure 6 (a). We show p Fed HN robustness to the dimension of the client embedding vector; hence we fix the embedding dimension through all experiments to 1+n/4 , where n is the number of client. C.3.3. HYPERNETWORK CAPACITY Here we inspect the effect of the HN s capacity on the local networks performance. We conducted an experiment in which we change the depth of the HN by stacking fully connected layers. Personalized Federated Learning using Hypernetworks Figure 6. Test results on CIFAR10 showing the effect of (a) the dimension of the the client embedding vector, and; (b) the number of hypernetwork s hidden layers. We evaluate p Fed HN on CIFAR10 dataset using k {1, 2, 3, 4, 5} hidden layers. Figure 6 (b) presents the final test accuracy. p Fed HN achieves optimal performance with k = 3 and k = 4 hidden layers, with accuracies 88.38 and 88.42 respectively. We use a three hidden layers HN for all experiments in the main text. C.4. Spectral Normalization Table 5. p Fed HN with spectral-normalization. p Fed HN (ours) 90.94 2.18 87.02 0.22 85.3 1.81 We show in Theorem 1 that the generalization is affected by the hypernetworks Lipschitz constant Lh. This theoretical result suggests that we can benefit from bounding this constant. Here we empirically test this by applying spectral normalization (Miyato et al., 2018) for all layers of the HN. The results are presented in Table 5. We do not observe any significant improvement compared to the results without spectral normalization (presented in Table 1 of the main text). C.5. Generalization to Novel Clients Here we provide additional results on the generalization performance for novel clients, studied in Section 5.3 of the main text. Figure 7 shows the accuracy of individual clients as a function of the total variation distance. Each point represents a different client, where the total variation distance is calculated w.r.t to the nearest training set client. As expected, the results show (on average) that the test accuracy decreases with the increase in the total variation Figure 7. Accuracy for novel clients on the CIFAR10 test set. Each point represents a different client. Total variation is computed w.r.t the nearest training set client. C.6. Fixed Client Representation We wish to compare the performance of p Fed HN when trained with a fixed vs trainable client embedding vectors. We use CIFAR10 with the data split described in Section 5.3 of the main text and 50 clients. We use a client embedding dimension of 10. We set the fixed embedding vector for client i to the vector of class proportions, ˆpi, described in Section 5.3. p Fed HN achieves similar performance with both the trainable and fixed client embedding, 84.12 0.42 and 83.92 0.36 respectively.