# splitfed_when_federated_learning_meets_split_learning__26a756a1.pdf Split Fed: When Federated Learning Meets Split Learning Chandra Thapa1*, Pathum Chamikara Mahawaga Arachchige1, Seyit Camtepe1, Lichao Sun2* 1CSIRO Data61, Sydney, Australia 2Lehigh University, Bethlehem, Pennsylvania, USA {chandra.thapa, chamikara.arachchige, seyit.camtepe}@data61.csiro.au, lis221@lehigh.edu Federated learning (FL) and split learning (SL) are two popular distributed machine learning approaches. Both follow a model-to-data scenario; clients train and test machine learning models without sharing raw data. SL provides better model privacy than FL due to the machine learning model architecture split between clients and the server. Moreover, the split model makes SL a better option for resource-constrained environments. However, SL performs slower than FL due to the relay-based training across multiple clients. In this regard, this paper presents a novel approach, named splitfed learning (SFL), that amalgamates the two approaches eliminating their inherent drawbacks, along with a refined architectural configuration incorporating differential privacy and Pixel DP to enhance data privacy and model robustness. Our analysis and empirical results demonstrate that (pure) SFL provides similar test accuracy and communication efficiency as SL while significantly decreasing its computation time per global epoch than in SL for multiple clients. Furthermore, as in SL, its communication efficiency over FL improves with the number of clients. Besides, the performance of SFL with privacy and robustness measures is further evaluated under extended experimental settings. Introduction Distributed Collaborative Machine Learning (DCML) is popular due to its default data privacy benefits (Kairouz, Mc Mahan, and et al. 2019). Unlike the conventional approach, where the data is centrally pooled and accessed, DCML enables machine learning without having to transfer data from data custodians to any untrusted party. Moreover, analysts have no access to raw data; instead, the machine learning (ML) model is transferred to the data curator for processing. Besides, it enables computation on multiple systems or servers and distributed devices. The most popular DCML approaches are federated learning (Konecn y, Mc Mahan, and Ramage 2015; Mc Mahan et al. 2017) and split learning (Gupta and Raskar 2018). Federated learning (FL) trains a full ML model on the distributed clients with their local data and later aggregates the locally trained full ML models to form a global model in a *Corresponding author. Copyright 2022, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. server. The main advantage of FL is that it allows parallel, hence efficient, ML model training across many clients. Computational requirement at the client-side and model privacy during ML training in FL. The main disadvantage of FL is that each client needs to run the full ML model, and resource-constrained clients, such as available in the Internet of Things, could not afford to run the full model. This case is prevalent if the ML models are deep learning models. Besides, there is a privacy concern from the model s privacy perspective during training because the server and clients have full access to the local and global models. To address these concerns, split learning (SL) was introduced. SL splits the full ML model into multiple smaller network portions and train them separately on a server, and distributed clients with their local data. Assigning only a part of the network to train at the client-side reduces processing load (compared to that of running a complete network as in FL), which is significant in ML computation on resourceconstrained devices (Vepakomma et al. 2018). Besides, a client has no access to the server-side model and vice-versa. Training time overhead in SL. Despite the advantages of SL, there is a primary issue. The relay-based training in SL makes the clients resources idle because only one client engages with the server at one instance; causing a significant increase in the training overhead with many clients. To address these issues in FL and SL, this paper proposes a novel architecture called splitfed learning (SFL). SFL considers the advantages of FL and SL, while emphasizing on data privacy, and robustness of the model. Refer to Table 1 for its abstract comparison with FL and SL. Our contributions are mainly two-fold: Firstly, we are the first to propose SFL. Data privacy and model s robustness are enhanced at the architectural level in SFL by the differential privacybased measures (Abadi et al. 2016) and Pixel DP (Lecuyer et al. 2019). Secondly, to demonstrate the feasibility of SFL, we present comparative performance measurements of FL, SL, and SFL by considering four standard datasets and four popular models. Based on our analyses and empirical results, SFL provides an excellent solution that offers better model privacy than FL, and it is faster than SL with a similar performance to SL in model accuracy and communication efficiency. Overall, SFL is beneficial for resource-constrained environments where full model training and deployment are not The Thirty-Sixth AAAI Conference on Artificial Intelligence (AAAI-22) Model aggregation No Yes Yes Model privacy advantage by splitted model Yes No Yes Client-side training Sequential Parallel Parallel Distributed computing Yes Yes Yes Access to raw data No No No Table 1: An abstract comparison of split learning (SL), federated learning (FL), and splitfed learning (SFL). feasible, and fast model training time is required to periodically update the global model based on a continually updating dataset over time (e.g., data stream). These environments characterize various domains, including health, e.g., real-time anomaly detection in a network with multiple Internet of Medical Things1 connected via gateways, and finance, e.g., privacy-preserving credit card fraud detection. Background and Related Works Federated learning (Konecn y, Mc Mahan, and Ramage 2015; Mc Mahan et al. 2017; Bonawitz et al. 2019) trains a complete ML network/algorithm at each client on its local data in parallel for a certain number of local epochs, and then the local updates are sent to the server for aggregation (Mc Mahan et al. 2017). This way, the server forms a global model and completes one global epoch2. The learned parameters of the global model are then sent back to all clients to train for the next round. This process continues until the algorithm converges. In this paper, we consider the federated averaging (Fed Avg) algorithm (Mc Mahan et al. 2017) for model aggregations in FL. Fed Avg considers a weighted average of the gradients for the model updates. Split learning (Vepakomma et al. 2018; Gupta and Raskar 2018) splits a deep learning network W into multiple portions, and these portions are processed and computed on different devices. In a simple setting, W is split into two portions WC and WS, called client-side network and serverside network, respectively. The clients, where the data reside, commit only to the client-side portion of the network, and the server commits only to the server-side portion of the network. The communication involves sending activations, called smashed data, of the split layer, called cut layer, of the client-side network to the server, and receiving the gradients of the smashed data from the server-side operations. The synchronization of the learning process with multiple clients is done either in a centralized mode or peer-to-peer mode in SL (Gupta and Raskar 2018). 1The examples of the Internet of Medical Things include glucose monitoring devices, open artificial pancreas systems, wearable electrocardiogram (ECG) monitoring devices, and smart lenses. 2When forward propagation and back-propagation are completed for all available datasets across all participating clients for one cycle, it is called one global epoch. Client-side Local Model Forward pass -> smashed data Backward pass -> gradients of the smashed data Client-side Global Model Client-side model portion Server-side model portion Main Server Figure 1: Overview of splitfed learning (SFL) system model. Differential privacy (DP) is a privacy model that defines privacy in terms of stochastic frameworks (Dwork and Roth 2014; Dwork et al. 2016). DP is formally defined as follows: Definition 1 A mechanism M is considered to be (ϵ, δ)- differential private if, for all adjacent datasets, x and y, and for all possible subsets of results, R of the mechanism, the following holds: P[M(x) R] eϵ P[M(y) R] + δ. Practically, the values of ϵ (privacy budget) and δ (probability of failure) should be kept as small as possible to maintain a high level of privacy. However, the smaller the values of ϵ and δ, the higher the noise applied to the input data by the DP algorithm. The Proposed Framework The framework SFL is presented in this section. We first give the overview of SFL. Then we detail three key modules: (1) the differentially private knowledge perturbation, (2) the Pixel DP for robust learning, and (3) total cost analysis of SFL. Overall Structure SFL combines the primary strength of FL, which is parallel processing among distributed clients, and the primary strength of SL, which is network splitting into client-side and server-side sub-networks during training. Refer to Figure 1 for a representation of the SFL architecture. Unlike SL, all clients carry out their computations in parallel and engage with the main server and fed server. A client can be a hospital or an Internet of Medical Things with low computing resources, and the main server can be a cloud server or a researcher with high-performance computing resources. The fed server is introduced to conduct Fed Avg on the clientside local updates. Moreover, the fed server synchronizes the client-side global model in each round of network training. The fed server s computations, which is mainly computing Fed Avg, are not costly. Hence, the fed server can be hosted within the local edge boundaries. Alternatively, if we implement all operations at the fed server over encrypted information, i.e., homomorphic encryption-based client-side model aggregation, then the main server can perform the operations of the fed server. SFL workflow. All clients perform forward propagation on their client-side model in parallel, including its Algorithm 1: Splitfed Learning (SFL) Notations: (1) St is a set of K clients at t time instance, (2) Ak,t is the smashed data of client k at t, (3) Yk and ˆYk are the true and predicted labels, respectively, of the client k, (4) ℓk is the gradient of the loss for the client k, (5) n and nk are the total sample size and sample size at a client k, respectively. /* Runs on Main Server */ Ensure Main Server executes: if time instance t=0 then Initialize WS t (global server-side model) else for each client k St in parallel do while local epoch e = E do (Ak,t, Yk) Client Update(WC k,t) Forward propagation with Ak,t on WS t, compute ˆYk Loss calculation with Yk and ˆYk Back-propagation calculate ℓk(WS t; AS t) Send d Ak,t := ℓk(AS t; WS t) (i.e., gradient of the Ak,t) to client k for Client Backprop(d Ak,t) end end Server-side model update: WS t+1 WS t η nk n PK i=1 ℓi(WS t; AS t) end /* Runs on Fed Server */ Ensure Fed Server executes: if t=0 then Initialize WC t (global client-side model) Send WC t to all K clients for Client Update(WC k,t) else for each client k St in parallel do WC k,t Client Backprop(d Ak,t) end Client-side global model updates: WC t+1 PK k=1 nk n WC k,t Send WC t+1 to all K clients for Client Update(WC k,t) end noise layer, and pass their smashed data to the main server. Then the main server processes the forward propagation and back-propagation on its server-side model with each client s smashed data separately in (somewhat) parallel. It then sends the gradients of the smashed data to the respective clients for their back-propagation. Afterward, the server updates its model by Fed Avg, i.e., weighted averaging of gradients that it computes during the back-propagation on each client s smashed data. At the client s side, after receiving the gradients of its smashed data, each client performs the back-propagation on their client-side local model and computes its gradients. A DP mechanism is used to make these gradients private and send them to the fed server. The fed server conducts the Fed Avg of the client-side local updates and sends them back to all participating clients. Variants of Splitfed Learning. There can be several variants of SFL. We broadly divide them into two categories in the following: Based on Server-side Aggregation. This paper proposes two variants of SFL. The first one is called splitfedv1 (SFLV1), which is depicted in Algorithm 1 and 2. The next algorithm is called splitfedv2 (SFLV2), and it is motivated by the intuition of the possibility to increase the model accuracy by removing the model aggregation part in the server-side computation module in Algorithm 1. In Algorithm 1, the server-side models of all clients are executed separately in parallel and then aggregated to obtain the global server-side model at each global epoch. In contrast, SFLV2 processes the forward-backward propagations of the server-side model sequentially with respect to the client s smashed data (no Fed Avg of the server-side models). The client order is chosen randomly in the server-side operations, and the model gets updated in every single forward-backward propagation. Besides, the server receives the smashed data from all participating clients synchronously. The client-side operation remains the same as in the SFLV1; the fed server conducts the Fed Avg of the client-side local models and sends the aggregated model back to all participating clients. These operations are not affected by the client order as the local client-side models are aggregated by the weighted averaging method, i.e., Fed Avg. Some other SFL versions are available in the literature, but they are developed after and influenced by our approach (Han, amd Jungmoon Lee, and Moon 2021; Gao et al. 2021). Based on Data Label Sharing. Due to the split ML models in SFL, we can carry out ML in the two settings; (1) sharing the data labels to the server and (2) without sharing any data labels to the server. Algorithm 1 considers SFL with data label sharing. In cases without sharing data labels, the ML model in SFL can be partitioned into three parts, assuming a simple setup. Each client will process two client-side model portions; one with the first few layers of W, and another with the last few layers of W and loss calculations. The remaining middle layers of W will be computed at the server-side. All possible configurations of SL, including vertically partitioned data, extended vanilla, and multi-task SL (Vepakomma et al. 2018), can be carried out similarly in SFL as its variants. Privacy Protection The inherent privacy preservation capabilities of SFL are due to two reasons: firstly, it adopts the model-to-data approach, and secondly, SFL conducts ML over a split network. A network split in ML learning enables the clients/fed server and the main server to maintain the full model privacy by not allowing the main server to get the client-side model updates and vice-versa. The main server has access only to the smashed data (i.e., activation vectors of the cut layer). The curious main server needs to invert all the clientside model parameters, i.e., weight vectors, to infer data and client-side model. The possibility of inferring the clientside model parameters and raw data is highly unlikely if we configure the client-side ML networks fully connected Algorithm 2: Client Update /* Runs on Client k */ Ensure Client Update(WC k,t): Model updates WC k,t Fed Server() Set Ak,t = ϕ for each local epoch e from 1 to E do Forward propagation with data Xk up to a layer L 1 in WC k,t Noise layer: Perturbs the outputs of the layer L based on Equation (5) With the output from the noise layer, continue forward propagation to the remaining layers of WC k,t, and get the activations of its final layer Ak,t (smashed data) Yk is the true labels of Xk Send Ak,t and Yk to the main server Wait for the completion of Client Backprop(d Ak,t) end /* Runs on Client k */ Ensure Client Backprop(d Ak,t): while local epoch e = E do d Ak,t Main Server() Back-propagation, calculate gradients ℓk(WC k,t) with d Ak,t ℓ2-norm of each gradient is clipped and a calibrated noise is added based on Equation (2) and (3) to calculate gk,t Update WC k,t WC k,t η gk,t end Send WC k,t to the Fed server layers with sufficiently large numbers of nodes (Gupta and Raskar 2018). However, for a smaller client-side network, the possibility of this issue can be high. This issue can be controlled by modifying the loss function at the clientside (Vepakomma et al. 2019). Due to the same reasons, the clients (having access only to the gradients of the smashed data from the main server) and the fed server (having access only to the client-side updates) cannot infer the server-side model parameters. Since there is no network split and separate training on the client-side and server-side in FL, SFL provides superior architectural configurations for enhanced privacy for an ML model during training compared to FL. Privacy Protection at the Client-side. We discuss the inherent privacy of the proposed model in the previous section. However, there can be an advanced adversary exploiting the underlying information representations of the shared smashed data or parameters (weights) to violate data owners privacy. This can happen if any server/client becomes curious though still honest. To avoid these possibilities, we apply two measures in our studies; (i) differential privacy to the client-side model training and (ii) Pixel DP noise layer in the client-side model. Privacy Protection on Fed Server. Considering Algorithm 2, we present the process for implementing differential privacy at a client k. We assume the following: σ represents the noise scale, and C represents the gradient norm bound. Now, firstly, after t time, the client k receives the gradients d Ak,t from the server, and with this, it calculates client-side gradients ℓk(WC k,i,t) for each of its local sample xi, and gk,t (xi) ℓk(WC k,i,t). (1) Secondly, the ℓ2-norm of each gradient is clipped according to the following equation: gk,t (xi) gk,t (xi) / max 1, gk,t (xi) 2 Thirdly, calibrated noise is added to the average gradient: gk,t (xi) + N 0, σ2C 2I . (3) Finally, the client-side model parameters of client k are updated as follows; WC k,t+1 WC k,t ηt gk,t. We apply calibrated noise iteratively until the model converges or reaches a specified number of iterations. As the iterations progress, the final convergence will exhibit a privacy level of (ε, δ)- differential privacy, where (ε, δ) is the overall privacy cost of the client-side model. Differential privacy is used to enforce strict privacy to the client-side model training algorithm based on Abadi et al. s approach (Abadi et al. 2016). Equation 2 (norm clipping) guarantees that gk,t (xi) 2 is preserved when gk,t (xi) 2 C . This step also guarantees that gk,t (xi) 2 scaled down to C when gk,t (xi) 2 > C . This step also helps clipping out the effect of Equation 5 on the gradients. Hence, norm clipping step allows bounding the influence of each individual example on gk,t in the process of guaranteeing differential privacy. It was shown that, the corresponding noise addition (refer to Equation 3) provides (ϵ, δ)-DP for each step of b (b = nk/batch size), if we choose σ (noise scale) to be q δ /ε (Dwork and Roth 2014). Hence, at the end of b steps, this will result in (bϵ, bδ)- DP. As shown by Abadi et al., for any ε < c1b2T and δ > 0, by choosing σ c2 b ε , the privacy can be maintained at (ϵ, δ)-DP (Abadi et al. 2016). Moments accountant (a privacy accountant) is used to track and maintain (ϵ, δ). Hence, at the end of b, a client model guarantees (ϵ, δ)-DP. With the strict assumption that all clients work on IID data, we can confirm that all clients maintain and guarantee (ϵ, δ)- DP while client-side model training and synchronization. Privacy Protection on Main Server. The above DP measures do not stop potential leakage from the smashed data to the main server though it has some effect on the smashed data after the first global epoch. Thus, to avoid privacy leakage and further strengthen data privacy and model robustness against potential adversarial ML settings, we integrate a noise layer in the client-side model based on the concepts of Pixel DP (Lecuyer et al. 2019). This extended measure utilizes the noise application mechanism involved in differential privacy to add a calibrated noise to the output (e.g., activation vectors) of a layer at the client-side model while maintaining utility. In this process, firstly, we calculate the sensitivity of the process. The sensitivity of a function A is defined as the maximum change in output that can be produced by a change in the input, given some distance metrics for the input and output (p-norm and q-norm, respectively): Ip,q = IA p,q = maxi,j,i =j Ak,i min Ak,j q xi xk p (4) Secondly, Laplacian noise with scale IA p,q ε is applied to randomize any data as follows: AP k,i = Ak,i + Lap where, AP k,i represents a private version of Ak,i, and ϵ is the privacy budget used for the Laplacian noise. This method enables forwarding private versions of the smashed data to the main server; hence, preserving the privacy of smashed data. The private version of the smashed data is due to the postprocessing immunity of the DP mechanism applied at the noise layer in the client-side model. The noisy smashed data is more private than the original data due to the calibrated noise. Moreover, Pixel DP not only can provide privacy for smashed data, but also can improve the robustness of the model against adversarial examples. However, detailed analysis and mathematical guarantees are kept for future work to preserve the main focus of the proposed work. Robustness via Pixel DP. The primary intuition behind using random DP mechanism to robust ML against adversarial examples is to create a DP scoring function. For example, feeding any data sample through the DP scoring function, the outputs are DP with regards to the features of the input. Then, stability bounds for the expected output of the DP function are given by the following Lemma (Lecuyer et al. 2019): Lemma 1 Suppose a randomized function M, with bounded output M [0, b], b R+, satisfies (ϵ, δ)-DP. Then the expected value of its output meets the following property: α Bp(1).E(M(x)) eϵ E(M(x + α)) + bδ, (6) where Bp(r) := α Rn : α p r is the p-norm ball, and the expectation is taken over the randomness in M. Combined with Equation, α Bp(L), k = f(x). yk(x + α) > maxi:i =k yi(x + α), the bounds provide a rigorous certification for robustness to adversarial examples. Total Cost Analysis This section analyzes the total communication cost and model training time for FL, SL, and SFL under a uniform data distribution. Assume K be the number of clients, p be the total data size, q be the size of the smashed layer, R be the communication rate, T be the time taken for one forward and backward propagation on the full model with dataset of size p (for any architecture), Tfedavg is the time required to perform the full model aggregation (let Tfedavg 2 be the aggregation time for an individual server), |W| be the size of the full model, and β be the fraction of the full model s size available in a client in SL/SFL, i.e., |WC| = β|W|. The term 2β|W| in communication per client is due to the download and upload of the client-side model updates before and after training, respectively, by a client. The result is presented in Table 2. As shown in the table, SL can become inefficient when there is a large number of clients. Besides, we see that when K increases, the total training time cost increases in the order of SFLV2