# efficient_device_scheduling_with_multijob_federated_learning__36390c70.pdf

Efficient Device Scheduling with Multi-Job Federated Learning

Chendi Zhou1 , Ji Liu2 *, Juncheng Jia1, Jingbo Zhou2, Yang Zhou3, Huaiyu Dai4, Dejing Dou2

1Soochow University, 2Baidu Inc., China, 3Auburn University, 4North Carolina State University, United States

Recent years have witnessed a large amount of decentralized data in multiple (edge) devices of end-users, while the aggregation of the decentralized data remains difficult for machine learning jobs due to laws or regulations. Federated Learning (FL) emerges as an effective approach to handling decentralized data without sharing the sensitive raw data, while collaboratively training global machine learning models. The servers in FL need to select (and schedule) devices during the training process. However, the scheduling of devices for multiple jobs with FL remains a critical and open problem. In this paper, we propose a novel multi-job FL framework to enable the parallel training process of multiple jobs. The framework consists of a system model and two scheduling methods. In the system model, we propose a parallel training process of multiple jobs, and construct a cost model based on the training time and the data fairness of various devices during the training process of diverse jobs. We propose a reinforcement learning-based method and a Bayesian optimization-based method to schedule devices for multiple jobs while minimizing the cost. We conduct extensive experimentation with multiple jobs and datasets. The experimental results show that our proposed approaches significantly outperform baseline approaches in terms of training time (up to 8.67 times faster) and accuracy (up to 44.6% higher).

Introduction

Recent years have witnessed a large amount of decentralized data over various Internet of Things (Io T) devices, mobile devices, etc. (Liu et al. 2021), which can be exploited to train machine learning models of high accuracy for diverse artificial intelligence applications. Since the data contain sensitive information of end-users, a few stringent legal restrictions (Official Journal of the European Union 2016; CCL 2018; CCP 2018; Chik 2013) have been put into practice to protect data security and privacy. In this case, it is difficult or even impossible to aggregate the decentralized data into a single server or a data center to train machine learning models. To enable collaborative training with distributed

C. Zhou and J. Liu contributed equally to the paper. This work was done when C. Zhou was an intern at Baidu Inc. *Corresponding author (liuji04@baidu.com). Copyright 2022, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.

data, federated learning (FL) (Mc Mahan et al. 2017a), which does not transfer raw data, emerges as an effective approach. FL was first introduced to collaboratively train a global model with non-Independent and Identically Distributed (non-IID) data distributed on mobile devices (Mc Mahan et al. 2017a). During the training process of FL, the raw data is kept decentralized without being moved to a single server or a single data center (Kairouz et al. 2019; Yang et al. 2019). FL only allows the intermediate data to be transferred from the distributed devices, which can be the weights or the gradients of a model. FL generally exploits a parameter server architecture (Smola and Narayanamurthy 2010), where a server (or a group of servers) coordinates the training process with numerous devices. To collaboratively train a global model, the server selects (schedules) a number of devices to perform local model updates based on their local data, and then it aggregates the local models to obtain a new global model. This process is repeated multiple times so as to generate a global model of high accuracy. While current FL solutions (Mc Mahan et al. 2017a; Pilla 2021) focus on a single-task job or a multi-task job (Smith et al. 2017), FL with multiple jobs (Han et al. 2020) remains an open problem. The major difference between the multitask job and multiple jobs is that the tasks of the multi-task job share some common parts of the model, while the multiple jobs do not have interaction between each other in terms of the model. The multi-job FL deals with the simultaneous training process of multiple independent jobs. Each job corresponds to multiple updates during the training process of a global model with the corresponding decentralized data. While the FL with a single job generally chooses a portion of devices to update the model, the other devices remain idle, and the efficiency is low. The multi-job FL can well exploit diverse devices for multiple jobs simultaneously, which brings high efficiency. The available devices are generally heterogeneous (Li et al. 2020, 2021), i.e., the computing and communication capability of each device is different, and the data in each device may also differ. During the training process of multiple jobs, the devices need to be scheduled to each job. At a given time, a device can be scheduled to only one job. However, only a portion of the available devices is scheduled to one job in order to reduce the influence of stragglers (Mc Mahan et al. 2017a). Powerful devices should be scheduled to jobs in order to accelerate the training pro-

The Thirty-Sixth AAAI Conference on Artificial Intelligence (AAAI-22)

cess, while other eligible devices should also participate in the training process to increase the fairness of data so as to improve the accuracy of the final global models. The fairness of data refers to the fair participation of the data in the training process of FL, which can be indicated by the standard deviation of the times to be scheduled to a job (Pitoura and Triantafillou 2007; Finkelstein et al. 2008). While the scheduling problem of devices is typical NPhard (Du and Leung 1989; Liu et al. 2020a), some solutions have already been proposed for the training process of FL (Mc Mahan et al. 2017b; Nishio and Yonetani 2019; Li et al. 2021; Abdulrahman et al. 2021) or distributed systems (Barika et al. 2019), which generally only focus on a single job with FL. In addition, these methods either cannot address the heterogeneity of devices (Mc Mahan et al. 2017b), or do not consider the data fairness during the training process (Nishio and Yonetani 2019; Li et al. 2021; Abdulrahman et al. 2021), which may lead to low accuracy. In this paper, we propose a Multi-Job Federated Learning (MJ-FL) framework to enable the efficient training of multiple jobs with heterogeneous edge devices. The MJ-FL framework consists of a system model and two scheduling methods. The system model enables the parallel training process of multiple jobs. With the consideration of both the efficiency of the training process, i.e., the time to execute an iteration, and the data fairness of each job for the accuracy of final models, we propose a cost model based on the training time and the data fairness within the system model. We propose two scheduling methods, i.e., reinforcement learningbased and Bayesian optimization-based, to schedule the devices for each job. To the best of our knowledge, we are among the first to study FL with multiple jobs. We summarize our contributions as follows:

We propose MJ-FL, a multi-job FL framework consisting of a parallel training process for multiple jobs and a cost model for the scheduling methods. We propose combining the capability and data fairness in the cost model to improve the efficiency of the training process and the accuracy of the global model.

We propose two scheduling methods, i.e., Reinforcement Learning (RL)-based and Bayesian Optimization (BO)- based methods, to schedule the devices to diverse jobs. Each method has advantages in a specific situation. The BO-based method performs better for simple jobs, while the RL-based method is more suitable for complex jobs.

We carry out extensive experimentation to validate the proposed approach. We exploit multiple jobs, composed of Resnet18, CNN, Alex Net, VGG, and Le Net, to demonstrate the advantages of our proposed approach using both IID and non-IID datasets.

The rest of the paper is organized as follows. We present the related work in Section . Then, we explain the system model and formulate the problem with a cost model in Section . We present the scheduling methods in Section . The experimental results with diverse models and datasets are given in Section . Finally, Section concludes the paper.

Related Work

In order to protect the security and privacy of decentralized raw data, FL emerges as a promising approach, which enables training a global model with decentralized data (Mc Mahan et al. 2017a; Yang et al. 2019; Li et al. 2020; Liu et al. 2021). Based on the data distribution, FL can be classified into three types, i.e., horizontal, vertical, and hybrid (Yang et al. 2019; Liu et al. 2021). The horizontal FL addresses the decentralized data of the same features, while the identifications are different. The vertical FL handles the decentralized data of the same identifications with different features. The hybrid FL deals with the data of different identifications and different features. In addition, FL includes two variants: cross-device FL and cross-silo FL (Kairouz et al. 2019). The cross-device FL trains global machine learning models with a huge number of mobile or Io T devices, while the cross-silo FL handles the collaborative training process with the decentralized data from multiple organizations or geo-distributed datacenters. In this paper, we focus on the horizontal and cross-device FL. Current FL approaches (Bonawitz et al. 2019; Liu et al. 2020b; Yurochkin et al. 2019; Wang et al. 2020) generally deal with a single job, i.e., with a single global model. While some FL approaches have been proposed to handle multiple tasks (Smith et al. 2017; Chen et al. 2021), the tasks share some common parts of a global model and deal with the same types of data. In addition, the devices are randomly selected (scheduled) in these approaches. A few scheduling approaches (Mc Mahan et al. 2017b; Nishio and Yonetani 2019; Li et al. 2021; Abdulrahman et al. 2021; Barika et al. 2019; Nishio and Yonetani 2019; Li et al. 2021; Abdulrahman et al. 2021; Sun et al. 2020) exist for single-job scheduling while the device scheduling with multi-job FL is rarely addressed. The scheduling methods in the above works are mainly based on some heuristics. For instance, the greedy method (Shi, Zhou, and Niu 2020) and the random scheduling method (Mc Mahan et al. 2017b) are proposed for FL, while genetic algorithms (Barika et al. 2019) are exploited for distributed systems. However, these methods do not consider the fairness of data, which may lead to low accuracy for multi-job FL. The black-box optimizationbased methods, e.g., RL (Sun et al. 2020), BO (Kim, Kim, and Park 2020), and deep neural network (Zang et al. 2019), have been proposed to improve the efficiency, i.e., the reduction of execution time, in distributed systems. They do not consider data fairness either, which may lead to low accuracy for multi-job FL. Different from all existing works, we propose a system model for the multi-job FL with the consideration of both efficiency and accuracy. In addition, we propose two scheduling methods, one based on RL and the other based on BO, for multi-job FL, which are suitable for diverse models and for both IID and non-IID datasets.

System Model and Problem Formulation

In this section, we first explain the motivation for multi-job FL. Then, we propose our multi-job FL framework, consisting of multi-job FL process and a cost model. Afterward, we

formally define the problem to address in this paper.

Motivation for Multi-Job Federated Learning

Let us assume a scenario where there are multiple FL jobs to be processed at the same time, e.g., image classification, speech recognition, and text generation. These jobs can be trained in parallel so as to efficiently exploit the available devices. However, while each device can only update the model of one job at a given time slot, it is critical to schedule devices to different jobs during the training process. As the devices are generally heterogeneous, some devices may possess high computation or communication capability while others may not. In addition, the data fairness of multiple devices may also impact the convergence speed of the training process. For instance, if only certain powerful devices are scheduled to a job, the model can only learn the knowledge from the data stored on these devices, while the knowledge from the data stored on other devices may be missed. In order to accelerate the training process of multiple jobs with high accuracy, it is critical to consider how to schedule devices while taking into consideration both the computing and communication capability and the data fairness. A straightforward approach is to train each job separately using the mechanism explained in (Mc Mahan et al. 2017b), while exploiting the existing scheduling of single-job FL, e.g., Fed Avg (Mc Mahan et al. 2017b). In this way, simple parallelism is considered while the devices are not fully utilized and the system is of low efficiency. In addition, a direct adaptation of existing scheduling methods to multi-job FL cannot address the efficiency and the accuracy at the same time. Thus, it is critical to propose a reasonable and effective approach for the multi-job FL.

Multi-job Federated Learning Framework

In this paper, we focus on an FL environment composed of a server module and multiple devices. The server module (Server) may consist of a single parameter server or a group of parameter servers (Li et al. 2014). In this section, we present a multi-job FL framework, which is composed of a process for the multi-job execution and a cost model to estimate the cost of the execution.

Multi-job FL Process Within the multi-job FL process, we assume that K devices, denoted by the set K, collaboratively train machine learning models for M jobs, denoted by the set M. Each device k is assumed to have M local datasets corresponding to the M jobs without loss of generality, and the dataset of the m-th job on device k is expressed as Dm k = {xm k,d Rnm, ym k,d R}Dm k d=1 with Dm k = |Dm k | as the number of data samples, xm k,d representing the d-th nm-dimentional input data vector of Job m at Device k, and ym k,d denoting the labeled output of xm k,d. The whole dataset of Job m is denoted by Dm = S k K Dm k with Dm = P k K Dm k . The objective of multi-job FL is to learn respective model parameters {wm} based on the decentralized datasets. The global learning problem of multi-job FL can be expressed by the following formulation:

m=1 Lm, with Lm =

Dm k Dm F m k (wm), (1)

where Lm is the loss value of Job m, F m k (wm) = 1 Dm k P {xm k,d,ym k,d} Dm k f m(wm; xm k,d, ym k,d) is the loss value

of Job m at Device k, W : {w1, w2, ..., w M} is the set of weight vectors for all jobs, and f m(wm; xm k,d, ym k,d) captures the error of the model parameter wm on the data pair {xm k,d, ym k,d}. In order to solve the problem defined in Formula 1, the Server needs to continuously schedule devices for different jobs to update the global models iteratively until the training processes of the corresponding job converge or achieve a target performance requirement (in terms of accuracy or loss value). We design a multi-job FL process as shown in Fig. 1. The Server first initializes a global model for each job. The initialization can be realized randomly or from the pretraining process with public data. In order to know the current status of devices, the Server sends requests to available devices in Step 1 . Then, in Step 2 , the Server schedules devices to the current job, according to a scheduling plan generated from a scheduling method (see details in Section ). The scheduling plan is a set of devices that are selected to perform the local training process for the current job. Please note that the scheduling process generates a scheduling plan for each job during the training process of multiple jobs, i.e., with an online strategy, while the scheduling processes of multiple jobs are carried out in parallel. The Server distributes the latest global model of the current job to the scheduled devices in Step 3 , and then the model is updated in each device based on the local data in Step 4 . Afterward, each device uploads the updated model to the Server after its local training in Step 5 . Finally, Server aggregates the models of scheduled devices to generate a new global model in Step 6 . The combination of Steps 1 - 6 is denoted by a round, which is repeated for each job until the corresponding global model reaches the expected performance (accuracy, loss value, or convergence). Please note that multiple jobs are executed in parallel asynchronously, while a device can only be scheduled to one job at a given time. In addition, we assume that the importance of each job is the same.

Cost Model In order to measure the performance of each round, we exploit a cost model defined in Formula 2, which is composed of time cost and data fairness cost. The data fairness has a significant impact on convergence speed. Costr m(Vr m) = α T r m(Vr m) + β F r m(Vr m), (2) where α and β are the weights of time cost and fairness cost respectively, T r m( ) represents the execution time of the training process in Round r with the set of scheduled devices Vr m, and F r m( ) is the corresponding data fairness cost. As defined in Formula 3, the execution time of a round depends on the slowest device in the set of scheduled devices. T r m(Vr m) = max k Vr m {tk m}, (3)

where tk m is the execution time of Round r in Device k for Job m. tk m is composed of the communication time and the computation time, which is complicated to estimate and differs for different devices. In this study, we assume that the

Scheduler Aggregator

Global model 1

Global model m

Upload local model

Distribute Global model m

Multi-job Device Scheduling

Resource Request 1

User K Distribute Global model m

Distribute Global model 1

Distribute Global model 1

Upload local model

Upload local model

Upload local model

Local update Model m

Local model upload

Figure 1: The training process within the Multi-job Federated Learning Framework.

execution time of each device follows the shift exponential distribution as defined in Formula 4 (Shi et al. 2021; Lee et al. 2018):

P[tk m<t] =

( 1 e µk τm Dm k (t τmak Dm k ), t τmak Dm k , 0, otherwise, (4)

where the parameters ak > 0 and µk > 0 are the maximum and fluctuation of the computation and communication capability, which is combined into one quantity, of Device k, respectively. Moreover, we assume that the calculation time of model aggregation has little impact on the training process because of the strong computation capability of the Server and the low complexity of the model. The data fairness of Round r corresponding to Job m is indicated by the deviation of the frequency of each device to be scheduled to Job m defined in Formula 5.

F r m(Vr m) = 1 |K|

k K (sr k,m 1

k K sr k,m)2, (5)

where sr k,m is the frequency of Device k to be scheduled to Job m, and K and |K| are the set of all devices and the size, respectively. sr k,m is calculated by counting the total number of the appearance of Device k to be scheduled to Job m in the set of scheduling plans for Job m, i.e., {V1 m, ..., Vr m}.

Problem Formulation The problem we address is how to reduce the training time when given a loss value for each job. While the execution of each job is carried out in parallel, the problem can be formulated as follows:

r=1 T r m(Vr m) o (6)

s.t. Lm(R m) lm, Vr m K, m {1, 2, ..., M}, r {1, 2, ..., R m},

where lm is the given loss value of Job m, R m represents the minimum number of rounds to achieve the given loss in the real execution, and Lm(R m) is the loss value of the trained model at Round R m, defined in Formula 1. As it requires the global information of the whole training process, which is hard to predict, to solve the problem, we transform the problem to the following one, which can be solved with limited local information of each Round. In addition, in order to achieve the given loss value of Job m within a short time (the first constraint in Formula 6), we need to consider the data fairness within the total cost in Formula 7, within which the data fairness can help reduce R m so as to minimize the total training time.

n Total Cost(Vr m) o , (7)

Total Cost(Vr m) =

m =1 Costr m (Vr m ), (8)

s.t. Vr m K, m {1, 2, ..., M}, where Costr m(Vr m) can be calculated based on Formula 2 with a set of scheduled devices Vr m to be generated using a scheduling method for Job m. Since the scheduling results of one job may have a potential influence on the scheduling of other jobs, we consider the cost of other jobs when scheduling devices to the current job in this problem. As the search space is O(2|K|), this scheduling problem is still a combinatorial optimization problem (Toth 2000) and NPhard (Du and Leung 1989; Liu et al. 2020a).

Device Scheduling for Multi-job FL In this section, we propose two scheduling methods, i.e., BO-based and RL-based, to address the problem defined in Formula 7. The scheduling plan generated by a scheduling method is defined in Formula 9: V r m = argmin Vr m {K\Vr o } Total Cost(Vr m), (9)

Algorithm 1: Bayesian Optimization-Based Scheduling Input: Vo : A set of occupied devices Sm : A matrix of the frequency of each device scheduled to Job m Rm : The maximum round of the current Job m lm : The desired loss value for Job m. Output: Vm = {V 1 m , ..., V Rm m } : a set of scheduling plans, each with the size |K| Cm 1: ΠL Randomly generate a set of observation points and calculate the cost 2: for r {1, ...Rm} and lm is not achieved do 3: Π Randomly generate a set of observation points with the devices within K\Vo 4: V r m arg max V Π αEI(V; Π )

5: FL training of Job m with V r m and update Sm, Vo 6: Cr = Total Cost(V r m ) 7: ΠL+r ΠL+r 1 (V r m , Cr) 8: end for

where V r m is a scheduling plan, K\Vr o represents the set of available devices to schedule, Total Cost(Vr m) is defined in Formula 8, and K and Vr o are the set of all devices and the set of occupied devices in Round r, respectively.

Bayesian Optimization-Based Scheduling While the Gaussian Process (GP) (Srinivas et al. 2010) can well represent linear and non-linear functions, BO-based methods (Shahriari et al. 2016) can exploit a GP to find a near-optimal solution for the problem defined in Formula 9. In this section, we propose a Bayesian Optimization-based Device Scheduling method (BODS). We adjust a GP to fit the cost function Total Cost( ). The GP is composed of a mean function µ defined in Formula 10 and a covariance function K defined in Formula 11 with a Matern kernel (Williams and Rasmussen 2006).

µ(Vr m) = E Vr m {K\Vr o }[Total Cost(Vr m)] (10)

K(Vr m, V r m) = E Vr m {K\Vr o },V r m {K\Vr o }

[(Total Cost(Vr m) µ(Vr m))(Total Cost(V r m) µ(V r m))] (11)

The BODS is explained in Algorithm 1. First, we randomly generate a set of observation points and calculate the cost based on Formula 2 (Line 1). Each observation point is a pair of scheduling plan and cost for the estimation of mean function and the covariance function. Then, within each round, we randomly sample a set of scheduling plans (Line 3), within which we select the one with the biggest reward using updated µ and K based on ΠL+r 1 (Line 4). Afterward, we perform the FL training for Job m with the generated scheduling plan (Line 5), and calculate the cost corresponding to the real execution (Line 6) according to Formula 8 and update the observation point set (Line 7).

Figure 2: The architecture of the RLDS.

Let (Vr l , Cl) denote an observation point l for Job m in Round r, where Vr l = {Vr l,1, ..., Vr l,M} and Cl is the cost value of Total Cost(Vr l,m) while the scheduling plans of other jobs are updated with the ones in use in Round r. At a given time, we have a set of observations ΠL 1 = {(Vr 1, C1), ..., (Vr L 1, CL 1)} composed of L 1 observation points. We denote the minimum cost value within the L 1 observations by C+ L 1. Then, we exploit Expected Improvement(EI) (Jones, Schonlau, and Welch 1998) to choose a new scheduling plan V r m in Round r that improves C+ L 1 the most, which is the utility function. Please note that this is not an exhaustive search as we randomly select several observation points (a subset of the whole search space) at the beginning and add new observation points using the EI method.

Reinforcement Learning-Based Scheduling

In order to learn more information about the near-optimal scheduling patterns for complex jobs, we further propose a Reinforcement Learning-based Device Scheduling (RLDS) method as shown in Fig. 2, which is inspired by (Mao et al. 2019; Sun et al. 2020). The scheduler of RLDS consists of a policy network and a policy converter. In the process of device scheduling, RLDS collects the status information of jobs as the input to the policy network. Then, the policy network generates a list of probabilities on all devices as the output. Finally, the policy converter converts the list into a scheduling plan.

Policy Network The policy network is implemented using a Long Short-Term Memory (LSTM) network followed by a fully connected layer, which can learn the sharing relationship of devices among diverse jobs. We take the computation and communication capability of available devices to be used in Formula 4, and the data fairness of each job defined in Formula 5 as the input. The network calculates the probability of each available device to be scheduled to a job.

Policy Converter The Policy Converter generates a scheduling plan based on the probability of each available device calculated by the policy network with the ϵ-greedy strategy (Xia and Zhao 2015).

Training In the training process of RLDS, we define the reward as Rm = 1 Total Cost(Vr m). Inspired by (Williams 1992; Zoph and Le 2017), we exploit Formula 12 to update the policy network:

0 100200300400500600700800 Elapsed Time (min)

0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 0.50 0.55 0.60

Test Accuracy

VGG with NIID

Random Genetic Fed CS Greedy BODS RLDS

0 40 80 120 160 200 240 Elapsed Time (min)

0.40 0.45 0.50 0.55 0.60 0.65 0.70 0.75 0.80 0.85 0.90 0.95

Test Accuracy

Cnn with NIID

Random Genetic Fed CS Greedy BODS RLDS

0 10 20 30 40 50 60 70 80 Elapsed Time (min)

0.40 0.45 0.50 0.55 0.60 0.65 0.70 0.75 0.80 0.85 0.90 0.95 1.00

Test Accuracy

Le Net with NIID

Random Genetic Fed CS Greedy BODS RLDS

Figure 3: The accuracy of different jobs in Group A over time with the non-IID distribution.

Algorithm 2: Reinforcement Learning-Based Scheduling Input: Vo : A set of occupied devices Sm : A vector of the frequency of each device scheduled to Job m Rm : The maximum round of the current Job m lm : The desired loss value for Job m. Output: Vm = {V1 m, ..., VRm m } : a set of scheduling plans, each with the size |K| Cm 1: θ pre-trained policy network, θ 0, bm 0 2: for r {1, 2, ..., Rm} and lm is not achieved do 3: Vr m generate a scheduling plan using the policy network 4: FL training of Job m and update Sm, Vo 5: Compute Rm

6: Update θ according to Formula 12 7: bm (1 - γ) * bm + γ * Rm n 8: end for

Vr n,m K\Vr o X

k Vr n,m θ log P(S m k |S m (k 1):1; θ)

(Rm n bm), (12) where θ and θ represent the updated parameters and the current parameters of the policy network, respectively, η is the learning rate, N is the number of scheduling plans to update the model in Round r (N > 1 in the pre-training process and N = 1 during the execution of multiple jobs), P represents the probability calculated based on the RL model, S m k = 1 represents that Device k is scheduled to Job m, and bm is the baseline value for reducing the variance of the gradient. We exploit RLDS during the training process of multiple jobs within the MJ-FL framework as shown in Algorithm 2. We pre-train the policy network with randomly generated scheduling plans (see details in Appendix (Zhou et al. 2021)) (Line 1). When generating a scheduling plan for Job m, the latest policy network is utilized (Line 3). We perform the FL training for Job m with the generated scheduling plan and update the frequency matrix Sm and the set of occupied devices Vo (Line 4). Afterward, we calculate the reward cor-

responding to the real execution (Line 5). The parameters are updated based on the Formula 12 (Line 6), while the baseline value bm is updated while considering the historical value (Line 7).

Experiments In this section, we present the experimental results to show the efficiency of our proposed scheduling methods within MJ-FL. We compared the performance of RLDS and BODS with four baseline methods, i.e., Random (Mc Mahan et al. 2017b), Fed CS (Nishio and Yonetani 2019), Genetic (Barika et al. 2019), and Greedy (Shi, Zhou, and Niu 2020).

Federated Learning Setups In the experiment, we take three jobs as a group to be executed in parallel. We carry out the experiments with two groups, i.e., Group A with VGG-16 (VGG) (Simonyan and Zisserman 2015), CNN (CNN-A-IID and CNN-A-non-IID) (Le Cun et al. 1998), and Le Net-5 (Le Net) (Le Cun et al. 1998), and Group B with Resnet-18 (Res Net) (He et al. 2016), CNN (CNN-B) (Le Cun et al. 1998), and Alexnet (Krizhevsky, Sutskever, and Hinton 2012), while each model corresponds to one job. The complexity of the models is as follows: Alex Net < CNN-B < Res Net and Le Net < CNN (CNN-A-IID and CNN-A-non-IID) < VGG. We exploit the datasets of CIFAR-10 (Krizhevsky and Hinton 2009), emnist-letters (Cohen et al. 2017), emnist-digital (Cohen et al. 2017), Fashion-MNIST (Xiao, Rasul, and Vollgraf 2017), and MNIST (Le Cun et al. 1998) in the training process. Please see details of the models and datasets in Appendix. For the non-IID setting of each dataset, the training set is classified by category, and the samples of each category are divided into 20 parts. Each device randomly selects two categories and then selects one part from each category to form its local training set. For the IID setting, each device randomly samples a specified number of images from each training set. In addition, we use 12 Tesla V100 GPUs to simulate an FL environment composed of a parameter server and 100 devices. We use Formula 4 to simulate the capabilities of devices in terms of training time with the uniform sampling strategy, while the accuracy is the results from the real training processes. In the experimentation, we use corresponding target accuracy (for ease of comparison) in the place of target loss value.

Convergence Accuracy Time (min) Random Genetic Fed CS Greedy BODS RLDS Random Genetic Fed CS Greedy BODS RLDS Non-IID

VGG 0.55 0.54 0.55 0.43 0.57 0.57 VGG(0.55) 2486 1164.3 1498.5 / 455.1 406.8 CNN 0.90 0.80 0.80 0.83 0.90 0.88 CNN(0.80) 44.25 95.85 27.39 43.04 15.88 12.75 Le Net 0.990 0.988 0.990 0.986 0.991 0.990 Le Net(0.984) 43.81 30.15 33.37 43.76 28.93 34.08 IID VGG 0.614 0.558 0.603 0.522 0.603 0.614 VGG(0.60) 529.9 / 322.5 / 293.6 249.2 CNN 0.943 0.928 0.943 0.928 0.943 0.937 CNN(0.930) 52.05 176.85 27.45 26.48 19.25 18.29 Le Net 0.9945 0.9928 0.9934 0.99 0.9946 0.9933 Le Net(0.993) 43.15 57.53 27.31 / 16.73 23.31

Table 1: The convergence accuracy and the time required to achieve the target accuracy for different methods in Group A. The numbers in parentheses represent the target accuracy, and / represents that the target accuracy is not achieved.

Convergence Accuracy Time (min) Random Genetic Fed CS Greedy BODS RLDS Random Genetic Fed CS Greedy BODS RLDS Non-IID

Res Net 0.546 0.489 0.523 0.403 0.583 0.537 Res Net(0.45) 571.0 307.2 279.5 174.2 157.5 137.6 CNN 0.821 0.767 0.821 0.764 0.836 0.823 CNN(0.73) 47.1 22.0 18.5 70.8 13.8 4.8 Alex Net 0.989 0.986 0.987 0.871 0.990 0.989 Alex Net(0.978) 141.85 77.74 84.8 / 61.91 57.97 IID Res Net 0.787 0.754 0.782 0.743 0.791 0.771 Res Net(0.740) 65.93 32.51 31.4 52.93 15.9 11.96 CNN 0.867 0.867 0.868 0.868 0.869 0.869 CNN(0.865) 88.81 23.89 26.06 21.42 23.99 9.3 Alex Net 0.9938 0.9938 0.9939 0.9935 0.9939 0.9943 Alex Net(0.9933) 35.08 19.44 20.97 / 21.65 12.58

Table 2: The convergence accuracy and the time required to achieve the target accuracy for different methods in Group B. The numbers in parentheses represent the target accuracy, and / represents that the target accuracy is not achieved.

Evaluation on the non-IID setting: When the decentralized data is of non-IID, the data fairness defined in Formula 5 has a significant influence on the accuracy. As shown in Fig. 3, the convergence speed of our proposed methods, i.e., RLDS and BODS, is significantly faster than other methods. RLDS has a significant advantage for complex jobs (VGG in Fig. 3(a)), while BODS can lead to good performance for relatively simple jobs in Groups A and B (please see details of Group B in Fig. 6 in Appendix). In addition, as shown in Tables 1 and 2, the final accuracy of RLDS and BODS outperforms other methods (up to 44.6% for BODS and 33.3% for RLDS), as well. Given a target accuracy, our proposed methods can achieve the accuracy within a shorter time, compared with baseline methods, in terms of the time for a single job, i.e., the training time of each job (up to 5.04 times shorter for BODS and 5.11 times shorter for RLDS), and the time for the whole training process, i.e., the total time calculated based on Formula 6 (up to 4.15 times for BODS and 4.67 times for RLDS), for Groups A and B. We have similar observations with IID, while the advantage of RLDS is much more significant (up to 8.67 times shorter in terms of the time for a single job) than that of non-IID as shown in Tables 1 and 2. We also find that MJ-FL outperforms (up to 5.36 faster and 12.5% higher accuracy) sequential execution of single-job FL (see details in Appendix). As RLDS can learn more information with a complex neural network, RLDS outperforms BODS for complex jobs. BODS can lead to high convergence accuracy and fast convergence speed thanks to the emphasis on the combination of the data fairness and the capability of the device,

i.e., computation and communication capability. Both RLDS and BODS significantly outperform the baseline methods, while there are also differences among the four methods. The Greedy method is more inclined to schedule the devices with high capability, which leads to a significant decrease in the final convergence accuracy. The Genetic method can exploit randomness to achieve data fairness while generating scheduling plans, and the convergence performance is better than the Greedy method. The Fed CS method optimizes the scheduling plan with random selection, which improves the fairness of the device to a certain extent, and the convergence speed is faster than the Random method.

Conclusion In this work, we proposed a new Multi-Job Federated Learning framework, i.e., MJ-FL. The framework is composed of a system model and two device scheduling methods. The system model is composed of a process for the parallel execution of multiple jobs and a cost model based on the capability of devices and data fairness. We proposed two device scheduling methods, i.e., RLDS for complex jobs and BODS for simple jobs, to efficiently select proper devices for each job based on the cost model. We carried out extensive experimentation with six real-life models and four datasets with IID and non-IID distribution. The experimental results show that MJ-FL outperforms the single-job FL, and that our proposed scheduling methods significantly outperform baseline methods (up to 44.6% in terms of accuracy, 8.67 times faster for a single job, and 4.67 times faster for the total time).

References 2018. California Consumer Privacy Act Home Page. https: //www.caprivacy.org/. Online; accessed 14/02/2021. 2018. Cybersecurity Law of the People s Republic of China. https://www.newamerica.org/cybersecurityinitiative/digichina/blog/translation-cybersecurity-lawpeoples-republic-china/. Online; accessed 22/02/2021. Abdulrahman, S.; Tout, H.; Mourad, A.; and Talhi, C. 2021. Fed MCCS: Multicriteria Client Selection Model for Optimal Io T Federated Learning. IEEE Internet of Things Journal, 8(6): 4723 4735. Barika, M.; Garg, S.; Chan, A.; and Calheiros, R. 2019. Scheduling algorithms for efficient execution of stream workflow applications in multicloud environments. IEEE trans. on Services Computing. Bonawitz, K.; Eichner, H.; Grieskamp, W.; Huba, D.; Ingerman, A.; Ivanov, V.; Kiddon, C.; Konecn y, J.; Mazzocchi, S.; Mc Mahan, B.; Overveldt, T. V.; Petrou, D.; Ramage, D.; and Roselander, J. 2019. Towards Federated Learning at Scale: System Design. In Machine Learning and Systems (MLSys). Chen, D.; Hong, C. S.; Wang, L.; Zha, Y.; Zhang, Y.; Liu, X.; and Han, Z. 2021. Matching theory based low-latency scheme for multi-task federated learning in mec networks. IEEE Internet of Things Journal. Chik, W. B. 2013. The Singapore Personal Data Protection Act and an assessment of future trends in data privacy reform. Computer Law & Security Review, 29(5): 554 575. Cohen, G.; Afshar, S.; Tapson, J.; and Van Schaik, A. 2017. EMNIST: Extending MNIST to handwritten letters. In Int. Joint Conf. on Neural Networks (IJCNN), 2921 2926. Du, J.; and Leung, J. Y.-T. 1989. Complexity of Scheduling Parallel Task Systems. SIAM Journal on Discrete Mathematics, 2(4): 473 487. Finkelstein, A.; Harman, M.; Mansouri, S. A.; Ren, J.; and Zhang, Y. 2008. Fairness analysis in requirements assignments. In IEEE Int. Requirements Engineering Conf., 115 124. Han, J.; Rafique, M. M.; Xu, L.; Butt, A. R.; Lim, S.-H.; and Vazhkudai, S. S. 2020. Marble: A multi-gpu aware job scheduler for deep learning on hpc systems. In IEEE/ACM Int. Symposium on Cluster, Cloud and Internet Computing (CCGRID), 272 281. He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deep Residual Learning for Image Recognition. In IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 770 778. Jones, D. R.; Schonlau, M.; and Welch, W. J. 1998. Efficient global optimization of expensive black-box functions. Journal of Global Optimization, 13(4): 455 492. Kairouz, P.; Mc Mahan, H. B.; Avent, B.; Bellet, A.; Bennis, M.; Bhagoji, A. N.; Bonawitz, K.; Charles, Z.; Cormode, G.; Cummings, R.; et al. 2019. Advances and open problems in federated learning. ar Xiv preprint ar Xiv:1912.04977. Kim, K.-r.; Kim, Y.; and Park, S. 2020. A probabilistic machine learning approach to scheduling parallel loops with bayesian optimization. IEEE trans. on Parallel and Distributed Systems (TPDS), 32(7): 1815 1827.

Krizhevsky, A.; and Hinton, G. 2009. Learning multiple layers of features from tiny images. Handbook of Systemic Autoimmune Diseases, 1(4). Krizhevsky, A.; Sutskever, I.; and Hinton, G. E. 2012. Image Net Classification with Deep Convolutional Neural Networks. In Annual Conf. on Neural Information Processing Systems (Neur IPS), 1106 1114. Le Cun, Y.; Bottou, L.; Bengio, Y.; and Haffner, P. 1998. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11): 2278 2324. Lee, K.; Lam, M.; Pedarsani, R.; Papailiopoulos, D.; and Ramchandran, K. 2018. Speeding Up Distributed Machine Learning Using Codes. IEEE Trans. on Information Theory, 64(3): 1514 1529. Li, L.; Shi, D.; Hou, R.; Li, H.; Pan, M.; and Han, Z. 2021. To Talk or to Work: Flexible Communication Compression for Energy Efficient Federated Learning over Heterogeneous Mobile Edge Devices. In IEEE Conf. on Computer Communications (INFOCOM), 1 10. Li, M.; Andersen, D. G.; Park, J. W.; Smola, A. J.; Ahmed, A.; Josifovski, V.; Long, J.; Shekita, E. J.; and Su, B.-Y. 2014. Scaling distributed machine learning with the parameter server. In USENIX Symposium on Operating Systems Design and Implementation (OSDI), 583 598. Li, T.; Sahu, A. K.; Talwalkar, A.; and Smith, V. 2020. Federated learning: Challenges, methods, and future directions. IEEE Signal Processing Magazine, 37(3): 50 60. Liu, J.; Huang, J.; Zhou, Y.; Li, X.; Ji, S.; Xiong, H.; and Dou, D. 2021. From Distributed Machine Learning to Federated Learning: A Survey. ar Xiv preprint ar Xiv:2104.14362. Liu, L.; Yu, H.; Sun, G.; Luo, L.; Jin, Q.; and Luo, S. 2020a. Job scheduling for distributed machine learning in optical WAN. Future Generation Computer Systems (FGCS), 112: 549 560. Liu, Y.; Huang, A.; Luo, Y.; Huang, H.; Liu, Y.; Chen, Y.; Feng, L.; Chen, T.; Yu, H.; and Yang, Q. 2020b. Fedvision: An online visual object detection platform powered by federated learning. In AAAI Conf. on Artificial Intelligence, volume 34, 13172 13179. Mao, H.; Schwarzkopf, M.; Venkatakrishnan, S. B.; Meng, Z.; and Alizadeh, M. 2019. Learning scheduling algorithms for data processing clusters. In Wu, J.; and Hall, W., eds., ACM Special Interest Group on Data Communication (SIGCOMM), 270 288. ACM. Mc Mahan, B.; Moore, E.; Ramage, D.; Hampson, S.; and y Arcas, B. A. 2017a. Communication-efficient learning of deep networks from decentralized data. In Int. Conf. on Artificial Intelligence and Statistics (AISTATS), 1273 1282. Mc Mahan, B.; Moore, E.; Ramage, D.; Hampson, S.; and y Arcas, B. A. 2017b. Communication-efficient learning of deep networks from decentralized data. In Artificial Intelligence and Statistics, 1273 1282. PMLR. Nishio, T.; and Yonetani, R. 2019. Client selection for federated learning with heterogeneous resources in mobile edge. In IEEE Int. Conf. on Communications (ICC), 1 7.

Official Journal of the European Union. 2016. General data protection regulation. https://eur-lex.europa.eu/legalcontent/EN/TXT/PDF/?uri=CELEX:32016R0679. Online; accessed 12/02/2021. Pilla, L. L. 2021. Optimal Task Assignment for Heterogeneous Federated Learning Devices. In IEEE Int. Parallel and Distributed Processing Symposium (IPDPS), 661 670. Pitoura, T.; and Triantafillou, P. 2007. Load distribution fairness in p2p data management systems. In IEEE Int. Conf. on Data Engineering (ICDE), 396 405. Shahriari, B.; Swersky, K.; Wang, Z.; Adams, R. P.; and de Freitas, N. 2016. Taking the Human Out of the Loop: A Review of Bayesian Optimization. Proceedings of the IEEE, 104(1): 148 175. Shi, W.; Zhou, S.; and Niu, Z. 2020. Device scheduling with fast convergence for wireless federated learning. In IEEE Int. Conf. on Communications (ICC), 1 6. Shi, W.; Zhou, S.; Niu, Z.; Jiang, M.; and Geng, L. 2021. Joint Device Scheduling and Resource Allocation for Latency Constrained Wireless Federated Learning. IEEE Trans. on Wireless Communications, 20(1): 453 467. Simonyan, K.; and Zisserman, A. 2015. Very Deep Convolutional Networks for Large-Scale Image Recognition. In Int. Conf. on Learning Representations (ICLR). Smith, V.; Chiang, C.-K.; Sanjabi, M.; and Talwalkar, A. 2017. Federated Multi-Task Learning. In Annual Conf. on Neural Information Processing Systems (Neur IPS), 4427 4437. Smola, A.; and Narayanamurthy, S. 2010. An architecture for parallel topic models. Very Large Data Bases Conference (VLDB) Endowment, 3(1-2): 703 710. Srinivas, N.; Krause, A.; Kakade, S. M.; and Seeger, M. W. 2010. Gaussian Process Optimization in the Bandit Setting: No Regret and Experimental Design. In Int. Conf. on Machine Learning (ICML), 1015 1022. Sun, P.; Guo, Z.; Wang, J.; Li, J.; Lan, J.; and Hu, Y. 2020. Deep Weave: Accelerating Job Completion Time with Deep Reinforcement Learning-based Coflow Scheduling. In Int. Joint Conf. on Artificial Intelligence (IJCAI), 3314 3320. Toth, P. 2000. Optimization engineering techniques for the exact solution of NP-hard combinatorial optimization problems. European Journal of Operational Research (EJOR), 125(2): 222 238. Wang, H.; Yurochkin, M.; Sun, Y.; Papailiopoulos, D.; and Khazaeni, Y. 2020. Federated Learning with Matched Averaging. In Int. Conf. on Learning Representations (ICLR). Williams, C. K.; and Rasmussen, C. E. 2006. Gaussian processes for machine learning, volume 2. MIT press Cambridge, MA. Williams, R. J. 1992. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8(3-4): 229 256. Xia, Z.; and Zhao, D. 2015. Online reinforcement learning by bayesian inference. In Int. Joint Conf. on Neural Networks (IJCNN), 1 6.

Xiao, H.; Rasul, K.; and Vollgraf, R. 2017. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. ar Xiv preprint ar Xiv:1708.07747. Yang, Q.; Liu, Y.; Chen, T.; and Tong, Y. 2019. Federated machine learning: Concept and applications. ACM Trans. on Intelligent Systems and Technology (TIST), 10(2): 1 19. Yurochkin, M.; Agarwal, M.; Ghosh, S.; Greenewald, K.; Hoang, N.; and Khazaeni, Y. 2019. Bayesian nonparametric federated learning of neural networks. In Int. Conf. on Machine Learning (ICML), 7252 7261. Zang, Z.; Wang, W.; Song, Y.; Lu, L.; Li, W.; Wang, Y.; and Zhao, Y. 2019. Hybrid deep neural network scheduler for job-shop problem based on convolution two-dimensional transformation. Computational intelligence and neuroscience, 2019. Zhou, C.; Liu, J.; Jia, J.; Zhou, J.; Zhou, Y.; Dai, H.; and Dou, D. 2021. Efficient Device Scheduling with Multi-Job Federated Learning. ar Xiv preprint ar Xiv:2112.05928. Zoph, B.; and Le, Q. V. 2017. Neural Architecture Search with Reinforcement Learning. In Int. Conf. on Learning Representations (ICLR).