# distreal_distributed_resourceaware_learning_in_heterogeneous_systems__5f39c19d.pdf DISTREAL: Distributed Resource-Aware Learning in Heterogeneous Systems Martin Rapp1, Ramin Khalili2, Kilian Pfeiffer1, J org Henkel1 1 Karlsruhe Institute of Technology, Karlsruhe, Germany 2 Huawei Research Center, Munich, Germany martin.rapp@kit.edu, ramin.khalili@huawei.com, kilian.pfeiffer@kit.edu, henkel@kit.edu We study the problem of distributed training of neural networks (NNs) on devices with heterogeneous, limited, and time-varying availability of computational resources. We present an adaptive, resource-aware, on-device learning mechanism, DISTREAL, which is able to fully and efficiently utilize the available resources on devices in a distributed manner, increasing the convergence speed. This is achieved with a dropout mechanism that dynamically adjusts the computational complexity of training an NN by randomly dropping filters of convolutional layers of the model. Our main contribution is the introduction of a design space exploration (DSE) technique, which finds Pareto-optimal per-layer dropout vectors with respect to resource requirements and convergence speed of the training. Applying this technique, each device is able to dynamically select the dropout vector that fits its available resource without requiring any assistance from the server. We implement our solution in a federated learning (FL) system, where the availability of computational resources varies both between devices and over time, and show through extensive evaluation that we are able to significantly increase the convergence speed over the state of the art without compromising on the final accuracy. Introduction Deep learning has achieved impressive results in a number of diverse domains, such as image classification (Howard et al. 2017; Tan and Le 2019), board and video games (Silver et al. 2016; Mnih et al. 2016), and is widely applied to distributed systems, such as mobile and sensor networks (Zhang, Patras, and Haddadi 2019), as we consider in this paper. Centralized deep learning techniques, where the training is performed in a single location (e.g., a data center), is often costly, as data would need to be collected and sent all over the network to that centralized entity (Shi et al. 2020), and might not be feasible/authorized if the training uses users private data. FL (Mc Mahan et al. 2017) emerged as an alternative to such techniques, performing distributed learning on each device with the locally available data. FL has proven effective in large-scale systems (Bonawitz et al. 2019; Liu, Wang, and Liu 2019; Chen et al. 2020b). However, training of a deep NN model is resource-hungry Copyright 2022, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. in terms of computation, energy, time, etc. (You et al. 2018), and it is rather unrealistic to assume that all devices in an FL system can perform all types of training computations all the time, especially if the training is distributed on edge devices, e.g., as suggested for 6G systems.1 This is as the computational capabilities of devices participating in an FL system may be heterogeneous, e.g., different hardware, different generations (Bonawitz et al. 2019). More importantly, the resources available on a device for training could change over time. This could for instance be due to shared resource contention (Dhar et al. 2019), where CPU time, cache memory, energy, etc. are shared between the learning and parallel tasks. We illustrate this with the following two examples. 1) Edge computing has been employed in ML-based realtime video analytics, where each edge device processes images from several camera modules (Ananthanarayanan et al. 2017). Currently, edge devices mostly perform inference, but there is a clear trend towards additionally performing distributed learning via FL (Zhou et al. 2019). The learning task shares computational resources with the inference tasks. The inference workload depends on the activity in the video images and changes over time, as processing is skipped for subsequent similar images to save resources (Ananthanarayanan et al. 2017). These changes happen fast, i.e., in the order of seconds (Zhang et al. 2017), while FL round times may be minutes (Bonawitz et al. 2019). 2) Google GBoard (Yang et al. 2018) trains a next-word-prediction model using FL on end users mobile phones. To avoid slowing down user applications, and thereby degrading the user experience, training is performed only when the device is charging and idle, and aborted when these conditions change. This introduces a bias towards certain devices and users, degrading the model accuracy (Yang et al. 2018). This can be resolved by allowing training also when the device is in use, but only using free resources. Smartphone workloads change within seconds (Tu et al. 2014), which is faster than the GBoard round time of several minutes. In both examples, the learning task is subject to fast-changing resource availability. While several works study the problem of heterogeneity across devices (Li et al. 2020a; Imteaj et al. 2020), time- 1EU s Horizon Europe (European Commission 2021a,b) calls for proposals, as well as 5GIA (5G Infrastructure Association) and SRIA (Strategic Research and Innovation Agenda) reports (Bernardos and Uusitalo 2021; Net World 2020) detail such a vision. The Thirty-Sixth AAAI Conference on Artificial Intelligence (AAAI-22) varying resource availability has so far been neglected. In this paper, we propose a distributed, resource-aware, adaptive, on-device learning technique, DISTREAL, which enables us to fully exploit and efficiently utilize available resources on devices for the training, dealing with all these types of heterogeneity. Our objective is to maximize the accuracy that is reached after limited training time on devices, i.e., convergence speed. To fulfill this goal, we should make sure that C1) The available resources on a device are fully exploited. This requires fine-grained adjustability of the training model on a device, and a method to instantly react to changes; and C2) The available, limited, resources on a device are used efficiently, to maximize the accuracy improvement and hence the overall convergence speed. Specifically, we provide the following novel contributions: We introduce and formulate the problem of heterogeneous time-varying resource availability in FL. We propose a dropout technique to adjust the computational complexity (resource requirements) of training the model at any time. Thereby, each device locally decides the dropout setting which fits its available resources, without requiring any assistance from the server, addressing C1. This is different from the state-of-the-art techniques, such as (Caldas et al. 2018; Horv ath et al. 2021; Xu et al. 2019; Diao, Ding, and Tarokh 2021), where the server is responsible for regulating resource requirements of training for each device at the beginning of each training round, which may take several minutes. We show that using different per-layer dropout rates achieves a much better trade-off between the resource requirements and the convergence speed, compared to using the same rate at all layers as the state of the art (Caldas et al. 2018; Diao, Ding, and Tarokh 2021), addressing C2. We present a DSE technique to automatically find the Pareto-optimal dropout vectors at design time. We implement our solution DISTREAL in an FL system, in which the availability of computational resources varies both between devices and over time. We show through extensive evaluation that DISTREAL significantly increases the convergence speed over the state of the art, and is robust to the rapid changes in resource availability at devices, without compromising on the final accuracy. System Model and Problem Definition System Model We target a distributed system, which comprises one server and N distributed devices that act as clients. Each device i holds its own local training data Xi. The system uses FL for decentralized training of an NN model from the distributed data. We target a synchronous coordination scheme, which divides the training into many rounds. At the beginning of a round, the server selects n devices to participate in the training. Each selected device downloads the recent model from the server, trains it with its local data, and sends weight updates back to the server. The server combines all received weight updates to a single update by weighted averaging. Updates from devices that take too long to perform the training (stragglers) are discarded. Device Resource Model The devices are subject to timevarying limited computational resource availability for training. To which degree the availability of a certain resource affects the training time of an NN depends on the NN and hyperparameters, but also on the deep learning library implementation and the underlying hardware (Chen et al. 2020c). We abstract from such specifics of the hardware and software implementation, and from the constrained physical resource to keep this work applicable to many systems by representing the resource availability in the number of multiplyaccumulate operations (MACs) that a device can calculate per time given its specifications and available resources. MAC operations are the fundamental building block of NNs (e.g., fully-connected and convolutional layers) and account for the great majority of operations (Krizhevsky, Sutskever, and Hinton 2012). In the appendix, we also provide experimental evidence for the suitability of MACs/s as an abstract metric. Resource availability varies between devices and over time. Therefore, these resource availabilities ri(t) depends on the device i, and the current time t. Resources may change at any time, i.e., also within an FL round. Resources are not required to be known ahead of time. Objective Our objective is to maximize the convergence speed of training, i.e., the reached accuracy after a number of rounds, under heterogeneous (between devices and over time) resource availability. Related Work Many works on resource-aware machine learning focus on resource-aware inference (Tann et al. 2016; Yu et al. 2018; Amir and Givargis 2018; Li et al. 2021a). These techniques allow adapting the inference to dynamically changing availability of resources at run-time but are not applicable to training. Resource-aware training is recently getting increasing attention, mostly in the context of FL. Most attention has so far been paid to limited communication resources, leading to solutions, such as compression, quantization, and sketching (Shi et al. 2020; Thakker et al. 2019) Importantly, these works do not reduce the computational resources for training, as they are applied after local training has finished. These works are complementary/orthogonal to our work and can be adopted to our solution (see the section on the run-time technique). Techniques on computationresource-aware training can be categorized into two classes: techniques that always train the full NN on each device but with fewer data/relaxed timing and techniques that train subsets of the NN. Train Full NNs Fed Prox (Li et al. 2020b) allows devices participating in an FL system to deliver partial results to the server by dropping training examples that could not be processed with the available resources. Our previous work (Rapp, Khalili, and Henkel 2020) studied multi-head networks where each device uses the head that fits its available resources. Devices only synchronize the weights of the first shared layers. However, this technique has low adaptability as only a few resource levels can be supported. Asynchronous variants of FL have been proposed that allow devices to finish training at any time (Chen et al. 2020a; Xie, Koyejo, and Gupta 2020). However, asynchronous synchronization may reduce the convergence stability (Mc Mahan et al. 2017; Xu et al. 2019). Techniques based on Federated Distillation (Li and Wang 2019; Chang et al. 2019; Lin et al. 2020) synchronize knowledge between devices by exchanging labels on a public dataset instead of exchanging NN weights. Therefore, each device has the design flexibility to use an NN model according to its constraints. However, Federated Distillation cannot cope with time-varying resources. Train NN Subsets Several techniques perform training only on a dynamic subset of the NN, to be able to fit the resource requirements of training to the resource availability on each device. Fj ORD (Horv ath et al. 2021), Yu and Li (2021), and Hetero FL (Diao, Ding, and Tarokh 2021) select subsets of the NN for each device at the beginning of each round. They select the subsets in a hierarchical way, where smaller subsets are fully contained in larger subsets. Hetero FL introduces a shrinkage ratio s that defines the ratio of removed hidden channels to reduce the resource requirements of the NN. The same parameter s is applied repeatedly to all layers to obtain several subsets with decreasing resource requirements. Using hierarchical subsets restricts the granularity of resource requirements, as increasing the number of supported subsets reduces the achievable accuracy (Tann et al. 2016). This limitation can be avoided by selecting subsets randomly, i.e., use different subsets in every round. ELFISH (Xu et al. 2019) randomly removes neurons before training on slow devices at the beginning of a round. Graham, Reizenstein, and Robinson (2015) study the suitability of dropout (Srivastava et al. 2014) to reduce resource requirements. They find that computations can only be saved if dropout is done in a structured way, i.e., the same neurons are dropped for all samples of a mini-batch. Federated Dropout (Caldas et al. 2018) has been originally proposed to reduce the communication and computation overhead of FL. They perform dropout at the server and train a repacked smaller network on the devices. The dropout masks are changed randomly in each round, which results in all parts of the NN being trained eventually. However, they use the same dropout rate for all devices and a single dropout rate for all layers. All these works select the trained subset at the server, which may reduce the communication volume, but importantly does not allow to adapt to changing resource availability on the devices within a round. Resource-Aware Training of NNs Our technique comprises two parts. At run time (online), we dynamically drop parts of the NN using an adapted version of dropout (Srivastava et al. 2014). The Pareto-optimal vectors of dropout rates w.r.t. convergence speed and resource requirements are obtained at design time (offline) using a DSE. Before going into the details of our contribution, we introduce dropout as the basis of our technique. Dropout to Reduce Computations In Training Dropout was originally designed as a regularization method to mitigate overfitting (Srivastava et al. 2014). It randomly drops individual neurons during training with a certain probability Layer i 1 Layer i . . . . . . drop filter (randomly sampled) Figure 1: Filter-based structured dropout in a convolutional layer maintains regularity in the calculations while significantly reducing the required computations. (dropout rate). This results in an irregular fine-grained pattern of dropped neurons. All major deep learning libraries perform dropout by calculating the output of all neurons and multiplying the dropped ones with 0 (Abadi et al. 2015; Paszke et al. 2019). This wastes computational resources; it would be more efficient to not calculate values that are going to be dropped. However, convolutional and fully-connected layers are implemented as matrix-vector or matrix-matrix operations that are heavily optimized with the help of vectorization (Abadi et al. 2015; Paszke et al. 2019). Skipping the calculation of individual values results in sparse matrix operations, which breaks vectorization, increasing the required resources instead of decreasing them (Song et al. 2019). To reduce the number of computations, the dropout pattern needs to show some regularity that still allows using vectorization of dense matrix operations. This can be achieved by dropping contiguous parts of the computation (Graham, Reizenstein, and Robinson 2015). Modern NNs consist of many different layer types such as convolutional, pooling, fully-connected, activation, or normalization layers. Many of these layers are computationally lightweight (e.g., pooling), while some contain the majority of computations (convolutional and fully-connected layers). In stateof-the-art convolutional NNs, the convolutional part requires orders of magnitude more MACs than the fully-connected part. (See the appendix for an experimental analysis.) We, therefore, argue that a technique to save computations needs to target convolutional layers. Fig. 1 depicts filter-based structured dropout in a convolutional layer, as we apply in this paper: instead of dropping individual pixels in the output, whole filters are dropped stochastically. This approach reduces the number of computations while allowing to keep existing vectorization methods. Fig. 2 depicts how the numbers MACs of the forward pass evolve when we apply different vectors of per-layer dropouts for Dense Net-40 (details in the experimental evaluation). We apply the DSE technique introduced in this paper to determine these vectors and show in the x-axis the resulting ratio of dropped filters2. We observe that the number of MACs decreases almost quadratically with this ratio. We also report the training time of a single mini-batch on a Raspberry Pi 4, which serves as an example for an Io T de- 2Multiple vectors may results in the same ratio of dropped filters, while providing different convergence / resource requirement trade-offs, explaining why for some ratios we have multiple MACs. 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0 75 No dropout: MACs Ratio of Dropped Filters Million MACs 45 No dropout: mini-batch time Mini-batch time [s] MACs Mini-batch time Figure 2: The number of MACs and mini-batch training time decrease quadratically with the ratio of dropped filters. vice, using an implementation of this structured dropout in Py Torch (Paszke et al. 2019), which is publicly available3. A training step (forward pass, backpropagation, and weight updates) requires about 2 more MACs than the forward pass alone (Amodei et al. 2018). The mini-batch training time shows a similar trend as the number of MACs but with an offset. This is because our implementation does not modify the backend of Py Torch to be aware of this dropout, which results in copy operations of weight tensors to repack them. As a consequence, the measured benefits are smaller than what is theoretically achievable. Changing the backend would get the benefits closer to the optimum but is not easily doable because of closed sources (e.g., of CUDA). In summary, this experiment shows that structured dropout significantly reduces the required computational resources. The dropout rates also change the convergence speed. A higher dropout rate in a layer means that in each training update, a smaller fraction of the layer s weights is updated, thereby slowing down the training. As a consequence, the dropout rate determines a trade-off between the resource requirements and the convergence speed. The dropout rate should always be selected as low as the available resources allow. To cope with changing resource availability, we propose to dynamically change the dropout rate at run time. Inference uses always the whole model. Design-Time DSE: Find Pareto-Optimal Dropout Vectors The resource requirements (MACs) and convergence speed both depend on the dropout rates of each layer. Prior works restrict themselves to choosing a common dropout rate for all layers (Caldas et al. 2018). Relaxing this restriction opens up a larger design space, where each dropout rate of each layer is adjusted towards a better trade-off between resource requirements and training convergence. However, this design space could be too large to be explored manually. For example, Dense Net-100 has 99 convolutional layers that each need to be assigned a dropout rate. Some works apply simple parametric functions of the depth to similar problems (Huang et al. 2016). However, this only works in case of a homogeneous NN structure, where properties of layers (e.g., MACs) change monotonically. For instance, Dense Net layers alternate between computationally lightweight and complex, rendering a simple parametric function sub-optimal. 3https://git.scc.kit.edu/CES/DISTREAL This section describes the required automated DSE technique to efficiently explore such a large space. The DSE is executed only once at design time (offline). Specifically, the design space contains all combinations of dropout values per layer. We select dropout values from the continuous range [0, 0.5] because higher values reduce the final achievable accuracy, as we observe in our experiments, as well as indicated in previous studies (Srivastava et al. 2014). For an NN with k convolutional layers, the design space is [0, 0.5]k. We have two objectives, the resource requirements and the convergence speed. Resource Requirements: As discussed in the device resource model, the number of MACs is an implementationindependent representation of the resource requirements. Dropout is a probabilistic process, i.e., the number of MACs varies between different update steps. The resource requirements with a certain dropout vector is represented by the expectation value of the number of MACs of the forward pass. This number can be analytically computed depending on the layer topology, the dropout rate of this layer and preceding layers. The appendix lists equations for different layer types. Convergence Speed: The convergence speed with a certain dropout vector is measured by observing the accuracy change when training. Exploring the search space takes too long if a full training with every candidate dropout vector is performed. Instead, we assess the accuracy change after a short training, similar to learning curve extrapolation in neural architecture search (Baker et al. 2017). We train for 64 mini-batches with batch size 64, which allows us to explore many candidate dropout vectors in a reasonable time. This corresponds to the amount of data collected by very few devices. To reduce the impact of random initialization, the NN is not trained from scratch but from a snapshot after partially training it on a distorted version of the dataset. For instance, we reduced the brightness, contrast, and saturation to 0.5 of the original value for CIFAR-10/100 datasets. The DSE, therefore, does not require access to the devices data, but only access to a small amount of similar (or even synthetic) data. To further reduce the impact of random variations, we repeat this with three different random seeds. The convergence speed is represented by the average accuracy improvement. Fig. 3 shows our DSE flow. The problem of finding Pareto-optimal dropout vectors is a multi-objective optimization. This is a well-studied class of problems with many established algorithms. Evolutionary algorithms have successfully been employed for neural architecture search (Elsken, Metzen, and Hutter 2019), which is related to the problem studied in this section. Note that we are not searching for an architecture, but tune parameters of a given architecture. The output of the DSE is the Paretofront of dropout vectors. To have a large variety of options to chose from at run time, but also keep a low number of vectors to be stored, the Pareto-front should be approximately equidistantly represented. We use the NSGAII (Deb et al. 2002) genetic algorithm from the pygmo2 library (Biscani and Izzo 2020). NSGA-II explores the search space by crossover (combining parts of two dropout vectors) and mutation (random changes) of good dropout vec- Number of MACs: MACs(d) Multi-obj. optim. algo. (NSGA-II) With several random seeds: Train for few batches Avg. acc. change: Acc(d) Dynamic NN Pareto-optimal vectors Dropout vector d Device i Resource-Aware Training (Algorithm 1) design-time (offline) run-time (online) Figure 3: Efficient resource-aware training comprises the DSE to find Pareto-optimal vectors of dropout rates per layer and resource-aware training on each device at run time. tors w.r.t. the objective function, and is designed to obtain dropout vectors that are equidistantly distributed across the Pareto-front. Thereby, an individual is one dropout vector containing the per-layer dropout rates. For our largest studied NN, Dense Net-100, this is 99 float values between 0 and 0.5. A population is a set of individuals. We use a population size of 64. A generation performs one optimization step on the population with the goal to find the Pareto-front. The optimization minimizes the following two-dimensional fitness function f(d) for a dropout vector d, which normalizes the values of the resource requirements MACs(d) and convergence speed Acc(d) to the range [0, 1]: MACs(d) MACs({0.5,...,0.5}) MACs({0,...,0}) MACs({0.5,...,0.5}) Acc({0,...,0}) Acc(d) Acc({0,...,0}) Acc({0.5,...,0.5}) Fig. 4 shows the evolving population of dropout vectors for Dense Net-40. The initial population comprises random dropout vectors. We add two samples to the initial population (all dropout values are 0 / 0.5) to accelerate the exploration of the Pareto-front (leverage the crossover operation). After 50 generations of NSGA-II, the Pareto-front has fully evolved and shows a continuous trade-off between resource requirements and convergence speed. Importantly, the Pareto-front found by DSE provides a significantly better trade-off between resource requirements and convergence speed compared to using the same dropout rate for all layers. Run Time: Resource-Aware Training of NNs After finding the Pareto-optimal dropout vectors, they are stored in a lookup table (LUT) D, along with the corresponding number of MACs. The LUT is small in size (e.g., 25 k B for Dense Net-100 for storing 64 dropout vector of 99 dropout values and the number of MACs, each in 32-bit format) and stays constant for all rounds. At run time, a device selects the dropout vector d that best corresponds to its resource Accuracy change / mini-batch [%] Dominated samples Pareto-front Same dropout rate per layer 20 40 60 80 0 Resource requirements [Million MACs] 20 40 60 80 Gen. 0 (random) Gen. 5 Gen. 20 Gen. 50 (final) Figure 4: The evolving Pareto-front for Dense Net-40 significantly outperforms setting the same rate for all layers. Algorithm 1: Each Selected Device i (Client) Require: D: LUT of Pareto-optimal dropout vectors (from DSE) receive θinit from server θ θinit, c 0 for each b Xi do iterate over mini-batches from local data r ri(t) current resource availability d D[r] resource-aware dropout vector Update dropout values of local NN with d θ θ η θ L(b; θ) update step c c + MACs(d) accumulate computations send (θ θinit, c) to server weight update and computations availability. If resource availability changes at the device, the dropout vectors can be adjusted to these changes at almost zero overhead before every mini-batch. No weight copies, recompilation, repacking of weights, etc. are required for adapting the resource requirements. In an FL setting, each device selects its dropout vector at run time according to its resource availability, as shown in Algorithm 1. This is done at the granularity of single mini-batches, i.e., devices can quickly react to changes. Additionally, the server does not need to know the resource availability at each device at the beginning of the round, reducing signaling overhead, and avoiding the requirement to know resource availability ahead of time. This is important to maintain scalability with the number of devices. At the end of each round, the devices report back the weight updates and the computational resources they put into training (number of MACs, as stored in the LUT). The server (Algorithm 2) performs a weighted averaging of the received updates w.r.t. the devices reported computational resources. Thereby, updates from devices that have trained with lower dropout rates, are weighted stronger. This is an extension of Fed Avg (Mc Mahan et al. 2017), which performs weighted averaging only based on the number of mini-batches. In the case of constant and same resource availability on all de- Algorithm 2: Server θ0 random initialization for each round t = 1, 2, . . . do K select n devices broadcast θt 1 to selected devices K receive (dθi, ci) from devices i K C P i K ci tot. computations dΘ P i K ci dθi weighted sum dθ dΘ/C weighted average θt θt 1 + dθ FEMNIST CIFAR-10 CIFAR-100 #Devices 3,550 100 100 #Samples/device 181 70.7 500 500 Devices/round 35 10 10 Resources var. 3 4 4 Table 1: System configuration for FL. vices, our coordination technique behaves the same as Fed Avg. As we do not change the type of data exchanged between the devices and the server, compared to Fed Avg, we can still apply and adopt techniques that mitigate communication aspects, such as compression and sketched updates (Shi et al. 2020). Experimental Results This section demonstrates the benefits of DISTREAL with heterogeneous resource availability in an FL system. Experimental Setup We study synchronous FL as described in the system model. We report the classification accuracy of the synchronized model at the end of each round. Our main performance metric is the convergence speed, i.e., the accuracy achieved after a certain number of rounds, but we also report the final accuracy after convergence. The three datasets used in our experiments are Federated Extended MNIST (FEMNIST) (Cohen et al. 2017) with non-independently and identically distributed (noniid) split data, similar to LEAF (Caldas et al. 2019), and CIFAR-10/100 (Krizhevsky and Hinton 2009). FEMNIST consists of 641,828 training and 160,129 test examples, each a 28 28 grayscale image of one out of 62 classes (10 digits, 26 upperand 26 lower-case letters). CIFAR-10 consists of 50,000 training and 10,000 test examples, each a 32 32 RGB image of one out of 10 classes such as airplane or frogs. CIFAR-100 is similar to CIFAR-10 but uses 100 classes. Table 1 summarizes the configurations. For FEMNIST, we use a similar network as used in Federated Dropout (Caldas et al. 2018), with a depth of 4 layers, requiring 4.0 million MACs in the forward pass. We use Dense Net (Huang et al. 2017) for CIFAR-10 and CIFAR100 with growth rate k = 12 and depth of 40 and 100, respectively. This results in 74 million MACs for CIFAR10 and 291 million MACs for CIFAR-100 in the forward pass. The DSE for these NNs takes around 15, 270, and 330 compute-hours, respectively, on a system with an Intel Core i5-4570 and an NVIDIA Ge Force GTX 980. More details about the NN configurations and the computational complexity of the DSE are presented in the appendix. We compare DISTREAL to four baselines: 1. Full resource availability. All devices have the full resources to train the full NN in each round. This is a theoretical baseline, which serves as an upper bound. 2. Small network. The NN complexity is reduced to fit the weakest device. Thereby, each device can train the full (reduced) NN in each round with Fed Avg. For CIFAR-10 and CIFAR-100, we reduce the depth of Dense Net to 19 and 40, respectively. Because the network of FEMNIST already has only a few layers, we reduce the number of filters of the convolutional layers. 3. Federated Dropout as in (Caldas et al. 2018). Similar to our technique, it uses dropout to reduce the computational complexity. However, the same dropout rate is used for all layers. To have a fair comparison, we extend the technique of Caldas et al. (2018) to allow for different dropout rates for different devices according to the resource availability. The rates are determined by the server at the start of each round as in the original technique. 4. Hetero FL as in (Diao, Ding, and Tarokh 2021). It uses a shrinking ratio 0