# federated_learning_from_small_datasets__285c449c.pdf Published as a conference paper at ICLR 2023 FEDERATED LEARNING FROM SMALL DATASETS Michael Kamp Institute for AI in medicine (IKIM) University Hospital Essen, Essen Germany, and Ruhr-University Bochum, Bochum Germany, and Monash University, Melbourne, Australia michael.kamp@uk-essen.de Jonas Fischer Harvard T.H. Chan School of Public Health Department of Biostatistics Boston, MA, United States jfischer@hsph.harvard.edu Jilles Vreeken CISPA Helmholtz Center for Information Security Saarbr ucken, Germany vreeken@cispa.de Federated learning allows multiple parties to collaboratively train a joint model without having to share any local data. It enables applications of machine learning in settings where data is inherently distributed and undisclosable, such as in the medical domain. Joint training is usually achieved by aggregating local models. When local datasets are small, locally trained models can vary greatly from a globally good model. Bad local models can arbitrarily deteriorate the aggregate model quality, causing federating learning to fail in these settings. We propose a novel approach that avoids this problem by interleaving model aggregation and permutation steps. During a permutation step we redistribute local models across clients through the server, while preserving data privacy, to allow each local model to train on a daisy chain of local datasets. This enables successful training in data-sparse domains. Combined with model aggregation, this approach enables effective learning even if the local datasets are extremely small, while retaining the privacy benefits of federated learning. 1 INTRODUCTION How can we learn high quality models when data is inherently distributed across sites and cannot be shared or pooled? In federated learning, the solution is to iteratively train models locally at each site and share these models with the server to be aggregated to a global model. As only models are shared, data usually remains undisclosed. This process, however, requires sufficient data to be available at each site in order for the locally trained models to achieve a minimum quality even a single bad model can render aggregation arbitrarily bad (Shamir and Srebro, 2014). In many relevant applications this requirement is not met: In healthcare settings we often have as little as a few dozens of samples (Granlund et al., 2020; Su et al., 2021; Painter et al., 2020). Also in domains where deep learning is generally regarded as highly successful, such as natural language processing and object detection, applications often suffer from a lack of data (Liu et al., 2020; Kang et al., 2019). To tackle this problem, we propose a new building block called daisy-chaining for federated learning in which models are trained on one local dataset after another, much like a daisy chain. In a nutshell, at each client a model is trained locally, sent to the server, and then instead of aggregating local models sent to a random other client as is (see Fig. 1). This way, each local model is exposed to a daisy chain of clients and their local datasets. This allows us to learn from small, distributed datasets simply by consecutively training the model with the data available at each site. Daisy-chaining alone, however, violates privacy, since a client can infer from a model upon the data of the client it received it from (Shokri et al., 2017). Moreover, performing daisy-chaining naively would lead to overfitting which can cause learning to diverge (Haddadpour and Mahdavi, 2019). In this paper, we propose to combine daisy-chaining of local datasets with aggregation of models, both orchestrated by the server, and term this method federated daisy-chaining (FEDDC). Published as a conference paper at ICLR 2023 Figure 1: Federated learning settings. A standard federated learning setting with training of local models at clients (middle) with aggregation phases where models are communicated to the server, aggregated, and sent back to each client (left). We propose to add daisy chaining (right), where local models are sent to the server and then redistributed to a random permutation of clients as is. We show that our simple, yet effective approach maintains privacy of local datasets, while it provably converges and guarantees improvement of model quality in convex problems with a suitable aggregation method. Formally, we show convergence for FEDDC on non-convex problems. We then show for convex problems that FEDDC succeeds on small datasets where standard federated learning fails. For that, we analyze FEDDC combined with aggregation via the Radon point from a PAC-learning perspective. We substantiate this theoretical analysis for convex problems by showing that FEDDC in practice matches the accuracy of a model trained on the full data of the SUSY binary classification dataset with only 2 samples per client, outperforming standard federated learning by a wide margin. For non-convex settings, we provide an extensive empirical evaluation, showing that FEDDC outperforms naive daisy-chaining, vanilla federated learning FEDAVG (Mc Mahan et al., 2017), FEDPROX (Li et al., 2020a), FEDADAGRAD, FEDADAM, and FEDYOGI (Reddi et al., 2020) on low-sample CIFAR10 (Krizhevsky, 2009), including non-iid settings, and, more importantly, on two real-world medical imaging datasets. Not only does FEDDC provide a wide margin of improvement over existing federated methods, but it comes close to the performance of a gold-standard (centralized) neural network of the same architecture trained on the pooled data. To achieve that, it requires a small communication overhead compared to standard federated learning for the additional daisy-chaining rounds. As often found in healthcare, we consider a cross-SILO scenario where such small communication overhead is negligible. Moreover we show that with equal communication, standard federated averaging still underperforms in our considered settings. In summary, our contributions are (i) FEDDC, a novel approach to federated learning from small datasets via a combination of model permutations across clients and aggregation, (ii) a formal proof of convergence for FEDDC, (iii) a theoretical guarantee that FEDDC improves models in terms of ϵ, δ-guarantees which standard federated learning can not, (iv) a discussion of the privacy aspects and mitigations suitable for FEDDC, including an empirical evaluation of differentially private FEDDC, and (v) an extensive set of experiments showing that FEDDC substantially improves model quality on small datasets compared to standard federated learning approaches. 2 RELATED WORK Learning from small datasets is a well studied problem in machine learning. In the literature, we find both general solutions, such as using simpler models and transfer learning (Torrey and Shavlik, 2010), and more specialized ones, such as data augmentation (Ibrahim et al., 2021) and few-shot learning (Vinyals et al., 2016; Prabhu et al., 2019). In our scenario overall data is abundant, but the problem is that data is distributed into small local datasets at each site, which we are not allowed to pool. Hao et al. (2021) propose local data augmentation for federated learning, but their method requires a sufficient quality of the local model for augmentation which is the opposite of the scenario we are considering. Huang et al. (2021) provide generalization bounds for federated averaging via the NTK-framework, but requires one-layer infinite-width NNs and infinitesimal learning rates. Federated learning and its variants have been shown to learn from incomplete local data sources, e.g., non-iid label distributions (Li et al., 2020a; Wang et al., 2019) and differing feature distributions (Li et al., 2020b; Reisizadeh et al., 2020a), but fail in case of large gradient diversity (Haddadpour and Mahdavi, 2019) and strongly dissimilar label distribution (Marfoq et al., 2021). For small Published as a conference paper at ICLR 2023 datasets, local empirical distributions may vary greatly from the global distribution: the difference of empirical to true distribution decreases exponentially with the sample size (e.g., according to the Dvoretzky Kiefer Wolfowitz inequality), but for small samples the difference can be substantial, in particular if the distribution differs from a Normal distribution (Kwak and Kim, 2017). Shamir and Srebro (2014) have shown the adverse effect of bad local models on averaging, proving that even due to a single bad model averaging can be arbitrarily bad. A different approach to dealing with biased local data is by learning personalized models at each client. Such personalized FL (Li et al., 2021) can reduce sample complexity, e.g., by using shared representations (Collins et al., 2021) for client-specific models, e.g., in the medical domain (Yang et al., 2021), or by training sample-efficient personalized Bayesian methods (Achituve et al., 2021). It is not applicable, however, to settings where you are not allowed to learn the biases or batch effects of local clients, e.g., in many medical applications where this would expose sensitive client information. Kiss and Horvath (2021) propose a decentralized and communication-efficient variant of federated learning that migrates models over a decentralized network, storing incoming models locally at each client until sufficiently many models are collected on each client for an averaging step, similar to Gossip federated learing (Jelasity et al., 2005). The variant without averaging is similar to simple daisy-chaining which we compare to in Section 7. FEDDC is compatible with any aggregation operator, including the Radon machine (Kamp et al., 2017), the geometric median (Pillutla et al., 2022), or neuron-clustering (Yurochkin et al., 2019), and can be straightforwardly combined with approaches to improve communication-efficiency, such as dynamic averaging (Kamp et al., 2018), and model quantization (Reisizadeh et al., 2020b). We combine FEDDC with averaging, the Radon machine, and Fed Prox (Li et al., 2020a) in Sec. 7. 3 PRELIMINARIES We assume iterative learning algorithms (cf. Chp. 2.1.4 Kamp, 2019) A : X Y H H that update a model h H using a dataset D X Y from an input space X and output space Y, i.e., ht+1 = A(D, ht). Given a set of m N clients with local datasets D1, . . . , Dm X Y drawn iid from a data distribution D and a loss function ℓ: Y Y R, the goal is to find a single model h H that minimizes the risk ε(h) = E(x,y) D[ℓ(h(x), y)]. In centralized learning, datasets are pooled as D = S i [m] Di and A is applied to D until convergence. Note that applying A on D can be the application to any random subset, e.g., as in mini-batch training, and convergence is measured in terms of low training loss, small gradient, or small deviation from previous iterate. In standard federated learning (Mc Mahan et al., 2017), A is applied in parallel for b N rounds on each client locally to produce local models h1, . . . , hm. These models are then centralized and aggregated using an aggregation operator agg : Hm H, i.e., h = agg(h1, . . . , hm). The aggregated model h is then redistributed to local clients which perform another b rounds of training using h as a starting point. This is iterated until convergence of h. When aggregating by averaging, this method is known as federated averaging (FEDAVG). Next, we describe FEDDC. 4 FEDERATED DAISY-CHAINING We propose federated daisy chaining as an extension to federated learning in a setup with m clients and one designated sever.1 We provide the pseudocode of our approach as Algorithm 1. The client: Each client trains its local model in each round on local data (line 4), and sends its model to the server every b rounds for aggregation, where b is the aggregation period, and every d rounds for daisy chaining, where d is the daisy-chaining period (line 6). This re-distribution of models results in each individual model conceptually following a daisy chain of clients, training on each local dataset. Such a daisy chain is interrupted by each aggregation round. The server: Upon receiving models, in a daisy-chaining round (line 9) the server draws a random permutation π of clients (line 10) and re-distributes the model of client i to client π(i) (line 11), while in an aggregation round (line 12), the server instead aggregates all local models and re-distributes the aggregate to all clients (line 13-14). 1This star-topology can be extended to hierarchical networks in a straightforward manner. Federated learning can also be performed in a decentralized network via gossip algorithms (Jelasity et al., 2005). Published as a conference paper at ICLR 2023 Algorithm 1: Federated Daisy-Chaining FEDDC Input: daisy-chaining period d, aggregation period b, learning algorithm A, aggregation operator agg, m clients with local datasets D1, . . . , Dm, total number of rounds T Output: final model aggregate h T 1 initialize local models h1 0, . . . , hm 0 2 Locally at client i at time t do 3 sample S from Di 4 hi t A(S, hi t 1) 5 if t mod d = d 1 or t mod b = b 1 then 6 send hi t to server 7 receive new hi t from server // receives either aggregate ht or some hj t 8 At server at time t do 9 if t mod d = d 1 then // daisy chaining 10 draw permutation π of [1,m] at random 11 for all i [m] send model hi t to client π(i) 12 else if t mod b = b 1 then // aggregation 13 ht agg(h1 t, . . . , hm t ) 14 send ht to all clients Communication complexity: Note that we consider cross-SILO settings, such as healthcare, were communication is not a bottleneck and, hence, restrict ourselves to a brief discussion in the interest of space. Communication between clients and server happens in O( T b ) many rounds, where T is the overall number of rounds. Since FEDDC communicates every dth and bth round, the amount of communication rounds is similar to FEDAVG with averaging period b Fed Avg = min{d, b}. That is, FEDDC increases communication over FEDAVG by a constant factor depending on the setting of b and d. The amount of communication per communication round is linear in the number of clients and model size, similar to federated averaging. We investigate the performance of FEDAVG provided with the same communication capacity as FEDDC in our experiments and in App. A.3.6. 5 THEORETICAL GUARANTEES In this section, we formally show that FEDDC converges for averaging. We, further, provide theoretical bounds on the model quality in convex settings, showing that FEDDC has favorable generalization error in low sample settings compared to standard federated learning. More formally, we first show that under standard assumptions on the empirical risk, it follows from a result of Yu et al. (2019) that FEDDC converges when using averaging as aggregation and SGD for learning a standard setting in, e.g., federated learning of neural networks. We provide all proofs in the appendix. Corollary 1. Let the empirical risks Ei emp(h) = P (x,y) Di ℓ(hi(x), y) at each client i [m] be L-smooth with σ2-bounded gradient variance and G2-bounded second moments, then FEDDC with averaging and SGD has a convergence rate of O(1/ m T), where T is the number of local updates. Since model quality in terms of generalization error does not necessarily depend on convergence of training (Haddadpour and Mahdavi, 2019; Kamp et al., 2018), we additionally analyze model quality in terms of probabilistic worst-case guarantees on the generalization error (Shalev-Shwartz and Ben-David, 2014). The average of local models can yield as bad a generalization error as the worst local model, hence, using averaging as aggregation scheme in standard federated learning can yield arbitrarily bad results (cf. Shamir and Srebro, 2014). As the probability of bad local models starkly increases with smaller sample sizes, this trivial bound often carries over to our considered practical settings. The Radon machine (Kamp et al., 2017) is a federated learning approach that overcomes these issues for a wide range of learning algorithms and allows us to analyze (non-trivial) quality bounds of aggregated models under the assumption of convexity. Next, we show that FEDDC can improve model quality for small local datasets where standard federated learning fails to do so. A Radon point (Radon, 1921) of a set of points S from a space X is similar to the geometric median a point in the convex hull of S with a high centrality (i.e., a Tukey depth (Tukey, 1975; Published as a conference paper at ICLR 2023 0 100 200 300 400 500 0.4 0.5 0.6 0.7 0.8 0.9 centralized (test) (a) FEDDC with Radon point with d = 1, b = 50. 0 100 200 300 400 500 0.4 0.5 0.6 0.7 0.8 0.9 (b) Federated learning with Radon point with b = 1. 0 100 200 300 400 500 0.4 0.5 0.6 0.7 0.8 0.9 (c) Federated learning with Radon point with b = 50. Figure 2: Results on SUSY. We visualize results in terms of train (green) and test error (orange) for (a) FEDDC (d = 1, b = 50) and standard federated learning using Radon points for aggregation with (b) b = 1, i.e., the same amount of communication as FEDDC, and (c) b = 50, i.e., the same aggregation period as FEDDC. The network has 441 clients with 2 data points per client. The performance of a central model trained on all data is indicated by the dashed line. Gilad-Bachrach et al., 2004) of at least 2). For a Radon point to exist, S X has to have a minimum size r N called the Radon number of X. For X Rd the radon number is d + 2. Here, the set of points S are the local models, or more precisely their parameter vectors. We make the following standard assumption (Von Luxburg and Sch olkopf, 2011) on the local learning algorithm A. Assumption 2 ((ϵ, δ)-guarantees). The learning algorithm A applied on a dataset drawn iid from D of size n n0 N produces a model h H s.t. with probability δ (0, 1] it holds for ϵ > 0 that P (ε(h) > ϵ) < δ. The sample size n0 is monotonically decreasing in δ and ϵ (note that typically n0 is a polynomial in ϵ 1 and log(δ 1)). Here ε(h) is the risk defined in Sec. 3. Now let r N be the Radon number of H, A be a learning algorithm as in assumption 2, and risk ε be convex. Assume m rh many clients with h N. For ϵ > 0, δ (0, 1] assume local datasets D1, . . . , Dm of size larger than n0(ϵ, δ) drawn iid from D, and h1, . . . , hm be local models trained on them using A. Let rh be the iterated Radon point (Clarkson et al., 1996) with h iterations computed on the local models (for details, see App. A.2). Then it follows from Theorem 3 in Kamp et al. (2017) that for all i [m] it holds that P (ε(rh) > ϵ) (r P (ε(hi) > ϵ))2h (1) where the probability is over the random draws of local datasets. That is, the probability that the aggregate rh is bad is doubly-exponentially smaller than the probability that a local model is bad. Note that in PAC-learning, the error bound and the probability of the bound to hold are typically linked, so that improving one can be translated to improving the other (Von Luxburg and Sch olkopf, 2011). Eq. 1 implies that the iterated Radon point only improves the guarantee on the confidence compared to that for local models if δ < r 1, i.e. P (ε(rh) > ϵ) (r P (ε(hi) > ϵ))2h < (rδ)2h < 1 only holds for rδ < 1. Consequently, local models need to achieve a minimum quality for the federated learning system to improve model quality. Corollary 3. Let H be a model space with Radon number r N, ε a convex risk, and A a learning algorithm with sample size n0(ϵ, δ). Given ϵ > 0 and any h N, if local datasets D1, . . . , Dm with m rh are smaller than n0(ϵ, r 1), then federated learning using the Radon point does not improve model quality in terms of (ϵ, δ)-guarantees. In other words, when using aggregation by Radon points alone, an improvement in terms of (ϵ, δ)- guarantees is strongly dependent on large enough local datasets. Furthermore, given δ > r 1, the guarantee can become arbitrarily bad by increasing the number of aggregation rounds. Federated Daisy-Chaining as given in Alg. 1 permutes local models at random, which is in theory equivalent to permuting local datasets. Since the permutation is drawn at random, the amount of permutation rounds T necessary for each model to observe a minimum number of distinct datasets k with probability 1 ρ can be given with high probability via a variation of the coupon collector problem as T d m ρ 1 m (Hm Hm k), where Hm is the m-th harmonic number see Lm. 5 in Published as a conference paper at ICLR 2023 App. A.5 for details. It follows that when we perform daisy-chaining with m clients and local datasets of size n for at least dmρ 1 m (Hm Hm k) rounds, then each local model will with probability at least 1 ρ be trained on at least kn distinct samples. For an ϵ, δ-guarantee, we thus need to set b large enough so that kn n0(ϵ, δ) with probability at least 1 δ. This way, the failure probability is the product of not all clients observing k distinct datasets and the model having a risk larger than ϵ, which is δ = δ. Proposition 4. Let H be a model space with Radon number r N, ε a convex risk , and A a learning algorithm with sample size n0(ϵ, δ). Given ϵ > 0, δ (0, r 1) and any h N, and local datasets D1, . . . , Dm of size n N with m rh, then Alg. 1 using the Radon point with aggr. period Hm Hm n 1n0(ϵ, improves model quality in terms of (ϵ, δ)-guarantees. This result implies that if enough daisy-chaining rounds are performed in-between aggregation rounds, federated learning via the iterated Radon point improves model quality in terms of (ϵ, δ)-guarantees: the resulting model has generalization error smaller than ϵ with probability at least 1 δ. Note that the aggregation period cannot be arbitrarily increased without harming convergence. To illustrate the interplay between these variables, we provide a numerical analysis of Prop. 4 in App. A.5.1. This theoretical result is also evident in practice, as we show in Fig. 2. There, we compare FEDDC with standard federated learning and equip both with the iterated Radon point on the SUSY binary classification dataset (Baldi et al., 2014). We train a linear model on 441 clients with only 2 samples per client. After 500 rounds FEDDC daisy-chaining every round (d = 1) and aggregating every fifty rounds (b = 50) reached the test accuracy of a gold-standard model that has been trained on the centralized dataset (ACC=0.77). Standard federated learning with the same communication complexity using b = 1 is outperformed by a large margin (ACC=0.68). We additionally provide results of standard federated learning with b = 50 (ACC=0.64), which shows that while the aggregated models perform reasonable, the standard approach heavily overfits on local datasets if not pulled to a global average in every round. More details on this experiment can be found in App. A.3.2. In Sec. 7 we show that the empirical results for averaging as aggregation operator are similar to those for the Radon machine. First, we discuss the privacy-aspects of FEDDC. 6 DATA PRIVACY 0 5 104 10 104 15 104 20 104 FEDDC DP-FEDDC (S = 2, σ = 0.01) DP-FEDDC (S = 2, σ = 0.02) DP-FEDDC (S = 4, σ = 0.05) Figure 3: Differential privacy results. Comparison of FEDDC (top solid line) to FEDDC with clipped parameter updates and Gaussian noise (dashed lines) on CIFAR10 with 250 clients. A major advantage of federated over centralized learning is that local data remains undisclosed to anyone but the local client, only model parameters are exchanged. This provides a natural benefit to data privacy, which is the main concern in applications such as healthcare. However, an attacker can make inferences about local data from model parameters (Ma et al., 2020) and model updates or gradients (Zhu and Han, 2020). In the daisy-chaining rounds of FEDDC clients receive a model that was directly trained on the local data of another client, instead of a model aggregate, potentially facilitating membership inference attacks (Shokri et al., 2017) reconstruction attacks (Zhu and Han, 2020) remain difficult because model updates cannot be inferred since the server randomly permutes the order of clients in daisy-chaining rounds. Should a malicious client obtain model updates through additional attacks, a common defense is applying appropriate clipping and noise before sending models. This guarantees ϵ, δ-differential privacy for local data (Wei et al., 2020) at the cost of a slight-to-moderate loss in model quality. This technique is also proven to defend against backdoor and poisoning attacks (Sun et al., 2019). Moreover, FEDDC is compatible with standard defenses against such attacks, such as noisy or robust aggregation (Liu et al., 2022) FEDDC with the Radon machine is an example of robust aggregation. We illustrate the effectiveness of FEDDC Published as a conference paper at ICLR 2023 0 200 400 600 800 1,000 1 centralized (test) (a) FEDDC with d = 1, b = 200. 0 200 400 600 800 1,000 (b) FEDAVG with b = 1. 0 200 400 600 800 1,000 (c) FEDAVG with b = 200. Figure 4: Synthetic data results. Comparison of FEDDC (a), FEDAVG with same communication (b) and same averaging period (c) for training fully connected NNs on synthetic data. We report mean and confidence accuracy per client in color and accuracy of central learning as dashed black line. with differential privacy in the following experiment. We train a small Res Net on 250 clients using FEDDC with d = 2 and b = 10, postponing the details on the experimental setup to App. A.1.1 and A.1.2. Differential privacy is achieved by clipping local model updates and adding Gaussian noise as proposed by Geyer et al. (2017). The results as shown in Figure 3 indicate that the standard trade-off between model quality and privacy holds for FEDDC as well. Moreover, for mild privacy settings the model quality does not decrease. That is, FEDDC is able to robustly predict even under differential privacy. We provide an extended discussion on the privacy aspects of FEDDC in App. A.7. 7 EXPERIMENTS ON DEEP LEARNING Our approach FEDDC, both provably and empirically, improves model quality when using Radon points as aggregation which, however, require convex problems. For non-convex problems, in particular deep learning, averaging is the state-of-the-art aggregation operator. We, hence, evaluate FEDDC with averaging against the state of the art in federated learning on synthetic and real world data using neural networks. As baselines, we consider federated averaging (FEDAVG) (Mc Mahan et al., 2017) with optimal communication, FEDAVG with equal communication as FEDDC, and simple daisy-chaining without aggregation. We further consider the 4 state-of-the-art methods FEDPROX (Li et al., 2020a), FEDADAGRAD, FEDYOGI, and FEDADAM (Reddi et al., 2020). As datasets we consider a synthetic classification dataset, image classification in CIFAR10 (Krizhevsky, 2009), and two real medical datasets: MRI scans for brain tumors,2 and chest X-rays for pneumonia3. We provide additional results on MNIST in App. A.3.8. Details on the experimental setup are in App. A.1.1,A.1.2, code is publicly available at https://github.com/kampmichael/Fed DC. Synthetic Data: We first investigate the potential of FEDDC on a synthetic binary classification dataset generated by the sklearn (Pedregosa et al., 2011) make_classification function with 100 features. On this dataset, we train a simple fully connected neural network with 3 hidden layers on m = 50 clients with n = 10 samples per client. We compare FEDDC with daisy-chaining period d = 1 and aggregation period b = 200 to FEDAVG with the same amount of communication b = 1 and the same averaging period b = 200. The results presented in Fig. 4 show that FEDDC achieves a test accuracy of 0.89. This is comparable to centralized training on all data which achieves a test accuracy of 0.88. It substantially outperforms both FEDAVG setups, which result in an accuracy of 0.80 and 0.76. Investigating the training of local models between aggreation periods reveals that the main issue of FEDAVG is overfitting of local clients, where FEDAVG train accuracy reaches 1.0 quickly after each averaging step. With these promising results on vanilla neural networks, we next turn to real-world image classification problems typically solved with CNNs. CIFAR10: As a first challenge for image classification, we consider the well-known CIFAR10 image benchmark. We first investigate the effect of the aggregation period b on FEDDC and FEDAVG, separately optimizing for an optimal period for both methods. We use a setting of 250 clients with 2kaggle.com/navoneel/brain-mri-images-for-brain-tumor-detection 3kaggle.com/praveengovi/coronahack-chest-xraydataset Published as a conference paper at ICLR 2023 a small version of Res Net, and 64 local samples each, which simulates our small sample setting, drawn at random without replacement (details in App. A.1.2). We report the results in Figure 5 and set the period for FEDDC to b = 10, and consider federated averaging with periods of both b = 1 (equivalent communication to FEDDC with d = 1, b = 10) and b = 10 (less communication than FEDDC by a factor of 10) for all subsequent experiments. 1 10 20 50 100 200 500 0 0.2 0.4 0.6 0.8 averaging period b FEDDC FEDAVG Figure 5: Averaging periods on CIFAR10. For 150 clients with small Res Nets and 64 samples per client, we visualize the test accuracy (higher is better) of FEDDC and FEDAVG for different aggregation periods b. Next, we consider a subset of 9600 samples spread across 150 clients (i.e. 64 samples per client), which corresponds to our small sample setting. Now, each client is equipped with a larger, untrained Res Net18.4 Note that the combined amount of examples is only one fifth of the original training data, hence we cannot expect typical CIFAR10 performance. To obtain a gold standard for comparison, we run centralized learning CENTRAL, separately optimizing its hyperparameters, yielding an accuracy of around 0.65. All results are reported in Table 1, where we report FEDAVG with b = 1 and b = 10, as these were the best performing settings and b = 1 corresponds to equal amounts of communication as FEDDC. We use a daisy chaining period of d = 1 for FEDDC throughout all experiments for consistency, and provide results for larger daisy chaining periods in App. A.3.5, which, depending on the data distribution, might be favorable. We observe that FEDDC achieves substantially higher accuracy over the baseline set by federated averaging. In App. A.3.7 we show that this holds also for client subsampling. Upon further inspection, we see that FEDAVG drastically overfits, achieving training accuracies of 0.97 (App. A.3.1), a similar trend as on the synthetic data before. Daisy-chaining alone, apart from privacy issues, also performs worse than FEDDC. Intriguingly, also the state of the art shows similar trends. FEDPROX, run with optimal b = 10 and µ = 0.1, only achieves an accuracy of 0.51 and FEDADAGRAD, FEDYOGI, and FEDADAM show even worse performance of around 0.22, 0.31, and 0.34, respectively. While applied successfully on large-scale data, these methods seem to have shortcomings when it comes to small sample regimes. To model different data distributions across clients that could occur in for example our healthcare setting, we ran further experiments on simulated non-iid data, gradually increasing the locally available data, as well as on non-privacy preserving decentralized learning. We investigate the effect of non-iid data on FEDDC by studying the pathological non-IID partition of the data (Mc Mahan et al., 2017). Here, each client only sees examples from 2 out of the 10 classes of CIFAR10. We again use a subset of the dataset. The results in Tab. 2 show that FEDDC outperforms FEDAVG by a wide margin. It also outperforms FEDPROX, a method specialized on heterogeneous datasets in our considered small sample setting. For a similar training setup as before, we show results for gradually increasing local datasets in App. A.3.4. Most notably, FEDDC outperforms FEDAVG even with 150 samples locally. Only when the full CIFAR10 dataset is distributed across the clients, FEDAVG is on par with FEDDC (see App. Fig. 7). We also compare with distributed training through gradient sharing (App. A.3.3), which discards any privacy concerns, implemented by mini-batch SGD with parameter settings corresponding to our federated setup as well as a separately optimized version. The results show that such an approach is outperformed by both FEDAVG as well as FEDDC, which is in line with previous findings and emphasize the importance of model aggregation. As a final experiment on CIFAR10, we consider daisy-chaining with different combinations of aggregation methods, and hence its ability to serve as a building block that can be combined with other federated learning approaches. In particular, we consider the same setting as before and combine FEDPROX with daisy chaining. The results, reported in Tab. 2, show that this combination is not only successful, but also outperforms all others in terms of accuracy. Medical image data: Finally, we consider two real medical image datasets representing actual health related machine learning tasks, which are naturally of small sample size. For the brain MRI scans, we simulate 25 clients (e.g., hospitals) with 8 samples each. Each client is equipped with a CNN 4Due to hardware restrictions we are limited to training 150 Res Nets, hence 9600 samples across 150 clients. Published as a conference paper at ICLR 2023 CIFAR10 MRI Pneumonia FEDDC (ours) 62.9 0.02 78.4 0.61 83.2 0.84 DC (baseline) 58.4 0.85 57.7 1.57 79.8 0.99 FEDAVG (b=1) 55.8 0.78 74.1 1.68 80.1 1.53 FEDAVG (b=10) 48.7 0.87 75.6 1.18 79.4 1.11 FEDPROX 51.1 0.80 76.5 0.50 80.0 0.36 FEDADAGRAD 21.8 0.01 45.7 1.25 62.5 0.01 FEDYOGI 31.4 4.37 71.3 1.62 77.6 0.64 FEDADAM 34.0 0.23 73.8 1.98 73.5 0.36 CENTRAL 65.1 1.44 82.1 1.00 84.1 3.31 Table 1: Results on image data, reported is the average test accuracy of the final model over three runs ( denotes maximum deviation from the average). FEDDC 62.9 0.02 FEDDC +FEDPROX 63.2 0.38 FEDDC 34.2 0.61 FEDAVG (b=1) 30.2 2.11 FEDAVG (b=10) 24.9 1.95 FEDPROX 32.8 0.00 FEDADAGRAD 11.7 0.00 FEDADAM 13.0 0.00 FEDYOGI 12.5 0.04 Table 2: Combination of FEDDC with FEDAVG and FEDPROX and non-iid results on CIFAR10. (see App. A.1.1). The results for brain tumor prediction evaluated on a test set of 53 of these scans are reported in Table 1. Overall, FEDDC performs best among the federated learning approaches and is close to the centralized model. Whereas FEDPROX performed comparably poorly on CIFAR10, it now outperforms FEDAVG. Similar to before, we observe a considerable margin between all competing methods and FEDDC. To investigate the effect of skewed distributions of sample sizes across clients, such as smaller hospitals having less data than larger ones, we provide additional experiments in App. A.3.5. The key insight is that also in these settings, FEDDC outperforms FEDAVG considerably, and is close to its performance on the unskewed datasets. For the pneumonia dataset, we simulate 150 clients training Res Net18 (see App. A.1.1) with 8 samples per client, the hold out test set are 624 images. The results, reported in Table 1, show similar trends as for the other datasets, with FEDDC outperforming all baselines and the state of the art, and being within the performance of the centrally trained model. Moreover it highlights that FEDDC enables us to train a Res Net18 to high accuracy with as little as 8 samples per client. 8 DISCUSSION AND CONCLUSION We propose to combine daisy-chaining and aggregation to effectively learn high quality models in a federated setting where only little data is available locally. We formally prove convergence of our approach FEDDC, and for convex settings provide PAC-like generalization guarantees when aggregating by iterated Radon points. Empirical results on the SUSY benchmark underline these theoretical guarantees, with FEDDC matching the performance of centralized learning. Extensive empirical evaluation shows that the proposed combination of daisy-chaining and aggregation enables federated learning from small datasets in practice.When using averaging, we improve upon the state of the art for federated deep learning by a large margin for the considered small sample settings. Last but not least, we show that daisy-chaining is not restricted to FEDDC, but can be straight-forwardly included in FEDAVG, Radon machines, and FEDPROX as a building block, too. FEDDC permits differential privacy mechanisms that introduce noise on model parameters, offering protection against membership inference, poisoning and backdoor attacks. Through the random permutations in daisy-chaining rounds, FEDDC is also robust against reconstruction attacks. Through the daisy-chaining rounds, we see a linear increase in communication. As we are primarily interested in healthcare applications, where communication is not a bottleneck, such an increase in communication is negligible. Importantly, FEDDC outperforms FEDAVG in practice also when both use the same amount of communication. Improving the communication efficiency considering settings where bandwidth is limited, e.g., model training on mobile devices, would make for engaging future work. We conclude that daisy-chaining lends itself as a simple, yet effective building block to improve federated learning, complementing existing work to extend to settings where little data is available per client. FEDDC, thus, might offer a solution to the open problem of federated learning in healthcare, where very few, undisclosable samples are available at each site. Published as a conference paper at ICLR 2023 ACKNOWLEDGMENTS The authors thank Sebastian U. Stich for his detailed comments on an earlier draft. Michael Kamp received support from the Cancer Research Center Cologne Essen (CCCE). Jonas Fischer is supported by a grant from the US National Cancer Institute (R35CA220523). Idan Achituve, Aviv Shamsian, Aviv Navon, Gal Chechik, and Ethan Fetaya. Personalized federated learning with gaussian processes. In Advances in Neural Information Processing Systems, volume 34. Curran Associates, Inc., 2021. 3 Giuseppe Ateniese, Luigi V Mancini, Angelo Spognardi, Antonio Villani, Domenico Vitali, and Giovanni Felici. Hacking smart machines with smarter ones: How to extract meaningful data from machine learning classifiers. International Journal of Security and Networks, 10(3):137 150, 2015. 22 Pierre Baldi, Peter Sadowski, and Daniel Whiteson. Searching for exotic particles in high-energy physics with deep learning. Nature communications, 5(1):1 9, 2014. 6, 15 Arjun Nitin Bhagoji, Supriyo Chakraborty, Prateek Mittal, and Seraphin Calo. Analyzing federated learning through an adversarial lens. In International Conference on Machine Learning, pages 634 643. PMLR, 2019. 22 Kenneth L Clarkson, David Eppstein, Gary L Miller, Carl Sturtivant, and Shang-Hua Teng. Approximating center points with iterative radon points. International Journal of Computational Geometry & Applications, 6(03):357 377, 1996. 5, 15 Liam Collins, Hamed Hassani, Aryan Mokhtari, and Sanjay Shakkottai. Exploiting shared representations for personalized federated learning. In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 2089 2099. PMLR, 18 24 Jul 2021. 3 Robin C Geyer, Tassilo Klein, and Moin Nabi. Differentially private federated learning: A client level perspective. ar Xiv preprint ar Xiv:1712.07557, 2017. 7 Ran Gilad-Bachrach, Amir Navot, and Naftali Tishby. Bayes and tukey meet at the center point. In International Conference on Computational Learning Theory, pages 549 563. Springer, 2004. 5 Kristin L Granlund, Sui-Seng Tee, Hebert A Vargas, Serge K Lyashchenko, Ed Reznik, Samson Fine, Vincent Laudone, James A Eastham, Karim A Touijer, Victor E Reuter, et al. Hyperpolarized mri of human prostate cancer reveals increased lactate with tumor grade driven by monocarboxylate transporter 1. Cell metabolism, 31(1):105 114, 2020. 1 Farzin Haddadpour and Mehrdad Mahdavi. On the convergence of local descent methods in federated learning. ar Xiv preprint ar Xiv:1910.14425, 2019. 1, 2, 4 Weituo Hao, Mostafa El-Khamy, Jungwon Lee, Jianyi Zhang, Kevin J Liang, Changyou Chen, and Lawrence Carin Duke. Towards fair federated learning with zero-shot data augmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3310 3319, 2021. 2 Baihe Huang, Xiaoxiao Li, Zhao Song, and Xin Yang. Fl-ntk: A neural tangent kernel-based framework for federated learning analysis. In International Conference on Machine Learning, pages 4423 4434. PMLR, 2021. 2 Marwa Ibrahim, Mohammad Wedyan, Ryan Alturki, Muazzam A Khan, and Adel Al-Jumaily. Augmentation in healthcare: Augmented biosignal using deep learning and tensor representation. Journal of Healthcare Engineering, 2021, 2021. 2 M ark Jelasity, Alberto Montresor, and Ozalp Babaoglu. Gossip-based aggregation in large dynamic networks. ACM Transactions on Computer Systems (TOCS), 23(3):219 252, 2005. 3 Published as a conference paper at ICLR 2023 Michael Kamp. Black-Box Parallelization for Machine Learning. Ph D thesis, Rheinische Friedrich Wilhelms-Universit at Bonn, Universit ats-und Landesbibliothek Bonn, 2019. 3 Michael Kamp, Mario Boley, Olana Missura, and Thomas G artner. Effective parallelisation for machine learning. In Advances in Neural Information Processing Systems, volume 30, pages 6480 6491. Curran Associates, Inc., 2017. 3, 4, 5, 14, 15 Michael Kamp, Linara Adilova, Joachim Sicking, Fabian H uger, Peter Schlicht, Tim Wirtz, and Stefan Wrobel. Efficient decentralized deep learning by dynamic model averaging. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 393 409. Springer, 2018. 3, 4 Bingyi Kang, Zhuang Liu, Xin Wang, Fisher Yu, Jiashi Feng, and Trevor Darrell. Few-shot object detection via feature reweighting. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 8420 8429, 2019. 1 P eter Kiss and Tomas Horvath. Migrating models: A decentralized view on federated learning. In Proceedings of the Workshop on Parallel, Distributed, and Federated Learning. Springer, 2021. 3 Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical report, University of Toronto, Toronto, 2009. 2, 7 Sang Gyu Kwak and Jong Hae Kim. Central limit theorem: the cornerstone of modern statistics. Korean journal of anesthesiology, 70(2):144, 2017. 3 Yann Le Cun, L eon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278 2324, 1998. 18 Qinbin Li, Bingsheng He, and Dawn Song. Model-contrastive federated learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10713 10722, 2021. 3 Tian Li, Anit Kumar Sahu, Manzil Zaheer, Maziar Sanjabi, Ameet Talwalkar, and Virginia Smith. Federated optimization in heterogeneous networks. In Conference on Machine Learning and Systems, 2020a, 2020a. 2, 3, 7 Xiaoxiao Li, Meirui Jiang, Xiaofei Zhang, Michael Kamp, and Qi Dou. Fedbn: Federated learning on non-iid features via local batch normalization. In International Conference on Learning Representations, 2020b. 2 Pei Liu, Xuemin Wang, Chao Xiang, and Weiye Meng. A survey of text data augmentation. In 2020 International Conference on Computer Communication and Network Security (CCNS), pages 191 195. IEEE, 2020. 1 Pengrui Liu, Xiangrui Xu, and Wei Wang. Threats, attacks and defenses to federated learning: issues, taxonomy and perspectives. Cybersecurity, 5(1):4, 2022. 6, 22 Chuan Ma, Jun Li, Ming Ding, Howard H Yang, Feng Shu, Tony QS Quek, and H Vincent Poor. On safeguarding privacy and security in the framework of federated learning. IEEE network, 34(4): 242 248, 2020. 6, 22 Othmane Marfoq, Giovanni Neglia, Aur elien Bellet, Laetitia Kameni, and Richard Vidal. Federated multi-task learning under a mixture of distributions. In Advances in Neural Information Processing Systems. Curran Associates, Inc., 2021. 2 Brendan Mc Mahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Aguera y Arcas. Communication-efficient learning of deep networks from decentralized data. In Artificial Intelligence and Statistics, pages 1273 1282, 2017. 2, 3, 7, 8 Peter Neal. The generalised coupon collector problem. Journal of Applied Probability, 45(3): 621 629, 2008. 20 Published as a conference paper at ICLR 2023 Corrie A Painter, Esha Jain, Brett N Tomson, Michael Dunphy, Rachel E Stoddard, Beena S Thomas, Alyssa L Damon, Shahrayz Shah, Dewey Kim, Jorge G omez Tejeda Za nudo, et al. The angiosarcoma project: enabling genomic and clinical discoveries in a rare cancer through patient-partnered research. Nature medicine, 26(2):181 187, 2020. 1 F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825 2830, 2011. 7, 14 Krishna Pillutla, Sham M Kakade, and Zaid Harchaoui. Robust aggregation for federated learning. IEEE Transactions on Signal Processing, 70:1142 1154, 2022. 3 Viraj Prabhu, Anitha Kannan, Murali Ravuri, Manish Chaplain, David Sontag, and Xavier Amatriain. Few-shot learning for dermatological disease diagnosis. In Machine Learning for Healthcare Conference, pages 532 552. PMLR, 2019. 2 Johann Radon. Mengen konvexer K orper, die einen gemeinsamen Punkt enthalten. Mathematische Annalen, 83(1):113 115, 1921. 4, 15 Sashank J Reddi, Zachary Charles, Manzil Zaheer, Zachary Garrett, Keith Rush, Jakub Koneˇcn y, Sanjiv Kumar, and Hugh Brendan Mc Mahan. Adaptive federated optimization. In International Conference on Learning Representations, 2020. 2, 7, 14 Amirhossein Reisizadeh, Farzan Farnia, Ramtin Pedarsani, and Ali Jadbabaie. Robust federated learning: The case of affine distribution shifts. In Advances in Neural Information Processing Systems, volume 33, pages 21554 21565. Curran Associates, Inc., 2020a. 2 Amirhossein Reisizadeh, Aryan Mokhtari, Hamed Hassani, Ali Jadbabaie, and Ramtin Pedarsani. Fedpaq: A communication-efficient federated learning method with periodic averaging and quantization. In International Conference on Artificial Intelligence and Statistics, pages 2021 2031. PMLR, 2020b. 3 Shai Shalev-Shwartz and Shai Ben-David. Understanding machine learning: From theory to algorithms. Cambridge university press, 2014. 4 Ohad Shamir and Nathan Srebro. Distributed stochastic optimization and learning. In 2014 52nd Annual Allerton Conference on Communication, Control, and Computing (Allerton), pages 850 857. IEEE, 2014. 1, 3, 4, 16 Reza Shokri, Marco Stronati, Congzheng Song, and Vitaly Shmatikov. Membership inference attacks against machine learning models. In 2017 IEEE Symposium on Security and Privacy (SP), pages 3 18. IEEE, 2017. 1, 6, 22 Xiaoping Su, Xiaofan Lu, Sehrish Khan Bazai, Eva Comp erat, Roger Mouawad, Hui Yao, Morgan Rouprˆet, Jean-Philippe Spano, David Khayat, Irwin Davidson, et al. Comprehensive integrative profiling of upper tract urothelial carcinomas. Genome biology, 22(1):1 25, 2021. 1 Ziteng Sun, Peter Kairouz, Ananda Theertha Suresh, and H Brendan Mc Mahan. Can you really backdoor federated learning? ar Xiv preprint ar Xiv:1911.07963, 2019. 6, 22 Lisa Torrey and Jude Shavlik. Transfer learning. In Handbook of research on machine learning applications and trends: algorithms, methods, and techniques, pages 242 264. IGI global, 2010. 2 John W Tukey. Mathematics and picturing data. In Proceedings of the International Congress of Mathematics, volume 2, pages 523 531, 1975. 4 Oriol Vinyals, Charles Blundell, Timothy Lillicrap, Daan Wierstra, et al. Matching networks for one shot learning. In Advances in neural information processing systems, volume 29, pages 3630 3638. Curran Associates, Inc., 2016. 2 Ulrike Von Luxburg and Bernhard Sch olkopf. Statistical learning theory: Models, concepts, and results. In Handbook of the History of Logic, volume 10, pages 651 706. Elsevier, 2011. 5 Published as a conference paper at ICLR 2023 Hongyi Wang, Mikhail Yurochkin, Yuekai Sun, Dimitris Papailiopoulos, and Yasaman Khazaeni. Federated learning with matched averaging. In International Conference on Learning Representations, 2019. 2 Kang Wei, Jun Li, Ming Ding, Chuan Ma, Howard H Yang, Farhad Farokhi, Shi Jin, Tony QS Quek, and H Vincent Poor. Federated learning with differential privacy: Algorithms and performance analysis. IEEE Transactions on Information Forensics and Security, 15:3454 3469, 2020. 6, 22 Qian Yang, Jianyi Zhang, Weituo Hao, Gregory P. Spell, and Lawrence Carin. Flop: Federated learning on medical datasets using partial networks. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, page 3845 3853. Association for Computing Machinery, 2021. 3 Hao Yu, Sen Yang, and Shenghuo Zhu. Parallel restarted sgd with faster convergence and less communication: Demystifying why model averaging works for deep learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 5693 5700, 2019. 4, 19 Mikhail Yurochkin, Mayank Agarwal, Soumya Ghosh, Kristjan Greenewald, Nghia Hoang, and Yasaman Khazaeni. Bayesian nonparametric federated learning of neural networks. In International Conference on Machine Learning, pages 7252 7261. PMLR, 2019. 3 Chengliang Zhang, Suyi Li, Junzhe Xia, Wei Wang, Feng Yan, and Yang Liu. Batchcrypt: Efficient homomorphic encryption for cross-silo federated learning. In USENIX Annual Technical Conference, pages 493 506, 2020. 22 Ligeng Zhu and Song Han. Deep leakage from gradients. In Federated learning, pages 17 31. Springer, 2020. 6