# divergenceaware_federated_selfsupervised_learning__9605e3eb.pdf Published as a conference paper at ICLR 2022 DIVERGENCE-AWARE FEDERATED SELF-SUPERVISED LEARNING Weiming Zhuang1,3, Yonggang Wen2, Shuai Zhang3 1S-Lab, NTU, Singapore 2NTU, Singapore 3Sense Time Research weiming001@e.ntu.edu.sg,ygwen@ntu.edu.sg,zhangshuai@sensetime.com Self-supervised learning (SSL) is capable of learning remarkable representations from centrally available data. Recent works further implement federated learning with SSL to learn from rapidly growing decentralized unlabeled images (e.g., from cameras and phones), often resulted from privacy constraints. Extensive attention has been paid to SSL approaches based on Siamese networks. However, such an effort has not yet revealed deep insights into various fundamental building blocks for the federated self-supervised learning (Fed SSL) architecture. We aim to fill in this gap via in-depth empirical study and propose a new method to tackle the nonindependently and identically distributed (non-IID) data problem of decentralized data. Firstly, we introduce a generalized Fed SSL framework that embraces existing SSL methods based on Siamese networks and presents flexibility catering to future methods. In this framework, a server coordinates multiple clients to conduct SSL training and periodically updates local models of clients with the aggregated global model. Using the framework, our study uncovers unique insights of Fed SSL: 1) stop-gradient operation, previously reported to be essential, is not always necessary in Fed SSL; 2) retaining local knowledge of clients in Fed SSL is particularly beneficial for non-IID data. Inspired by the insights, we then propose a new approach for model update, Federated Divergence-aware Exponential Moving Average update (Fed EMA). Fed EMA updates local models of clients adaptively using EMA of the global model, where the decay rate is dynamically measured by model divergence. Extensive experiments demonstrate that Fed EMA outperforms existing methods by 3-4% on linear evaluation. We hope that this work will provide useful insights for future research. 1 INTRODUCTION Self-supervised learning (SSL) has attracted extensive research interest for learning representations without relying on expensive data labels. In computer vision, the common practice is to design proxy tasks to facilitate visual representation learning from unlabeled images (Doersch et al., 2015; Noroozi & Favaro, 2016; Zhang et al., 2016; Gidaris et al., 2018). Among them, the state-of-the-art SSL methods employ contrastive learning that uses Siamese networks to minimize the similarity of two augmented views of images (Wu et al., 2018; Chen et al., 2020a; He et al., 2020; Grill et al., 2020; Chen & He, 2021). All these methods heavily rely on the assumption that images are centrally available in cloud servers, such as public data on the Internet. However, the rapidly growing amount of decentralized images may not be centralized due to increasingly stringent privacy protection regulations (Custers et al., 2019). The increasing number of edge devices, such as street cameras and phones, are generating a large number of unlabeled images, but these images may not be centralized as they could contain sensitive personal information like human faces. Besides, learning representations from these images could be more beneficial for downstream tasks deployed in the same scenarios (Yan et al., 2020). A straightforward method is to adopt SSL methods for each edge, but it results in poor performance (Zhuang et al., 2021a) as decentralized data are mostly non-independently and identically distributed (non-IID) (Li et al., 2020a). Federated learning (FL) has emerged as a popular privacy-preserving method to train models from decentralized data (Mc Mahan et al., 2017), where clients send training updates to the server instead Published as a conference paper at ICLR 2022 of raw data. The majority of FL methods, however, are not applicable for unsupervised representation learning because they require fully labeled data (Caldas et al., 2018), or partially labeled data in either the server or clients (Jin et al., 2020a; Jeong et al., 2021). Recent studies implement FL with SSL methods that are based on Siamese networks, but they only focus on a single SSL method. For example, Fed CA (Zhang et al., 2020a) is based on Sim CLR (Chen et al., 2020a) and Fed U (Zhuang et al., 2021a) is based on BYOL (Grill et al., 2020). These efforts have not yet revealed deep insights into the fundamental building blocks of Siamese networks for federated self-supervised learning. In this paper, we investigate the effects of fundamental components of federated self-supervised learning (Fed SSL) via in-depth empirical study. To facilitate fair comparison, we first introduce a generalized Fed SSL framework to embrace existing SSL methods that differ in building blocks of Siamese networks. The framework comprises of a server and multiple clients: clients conduct SSL training using Siamese networks an online network and a target network; the server aggregates the trained online networks to obtain a new global network and uses this global network to update the online networks of clients in the next round of training. Fed SSL primarily focuses on the cross-silo FL where clients are stateful with high availability (Kairouz et al., 2019). We conduct empirical studies based on the Fed SSL framework and discover important insights of Fed SSL. Among four popular SSL methods (Sim CLR (Chen et al., 2020a), Mo Co (He et al., 2020), BYOL (Grill et al., 2020), and Sim Siam (Chen & He, 2021), Fed BYOL achieves the best performance, whereas Fed Sim Siam yields the worst performance. More detailed analysis uncover the following unique insights: 1) Stop-gradient operation, essential for Sim Siam and BYOL, is not always essential in Fed SSL; 2) Target networks of clients are essential to gain knowledge from online networks; 3) Keeping local knowledge of clients is beneficial for performance on non-IID data. Inspired by the insights, we propose a new approach, Federated Divergence-aware Exponential Moving Average update (Fed EMA) 1 , to address the non-IID data problem. Specifically, instead of updating online networks of clients simply by the global network, Fed EMA updates them via exponential moving average (EMA) of the global network, where the decay rate of EMA is measured by the divergence of global and online networks dynamically. Extensive experiments demonstrate that Fed EMA outperforms existing methods in a wide range of settings. We believe that important insights from this study will shed light on future research. Our main contributions are threefold: We introduce a new generalized Fed SSL framework that embraces existing SSL methods based on Siamese networks and presents flexibility catering to future methods. We conduct in-depth empirical studies of Fed SSL based on the framework and discover deep insights of the fundamental building blocks of Siamese networks for Fed SSL. Inspired by the insights, we further propose a new model update approach, Fed EMA, that adaptively updates online networks of clients with EMA of the global network. Extensive experiments show that Fed EMA outperforms existing methods in a wide range of settings. 2 RELATED WORK Self-supervised Learning In computer vision, self-supervised learning (SSL) aims to learn visual representations without any labels. Discriminative SSL methods facilitate learning with proxy tasks (Pathak et al., 2016; Noroozi & Favaro, 2016; Zhang et al., 2016; Gidaris et al., 2018). Among them, contrastive learning (Oord et al., 2018; Bachman et al., 2019) has become a promising principle. It uses Siamese networks to minimize the similarity of two augmented views (positive pairs) and maximize the similarity of two different images (negative pairs). These methods are either contrastive or non-contrastive ones: contrastive SSL methods require negative pairs (Chen et al., 2020a; He et al., 2020) to prevent training collapse; non-contrastive SSL methods (Grill et al., 2020; Chen & He, 2021) are generally more efficient as they maintain remarkable performances using only positive pairs. However, these methods do not perform well on decentralized non-IID data (Zhuang et al., 2021a). We analyze their similarities and variances and propose a generalized Fed SSL framework. Federated Learning Federated learning (FL) is a distributed training technique for learning from decentralized parties without transmitting raw data to a central server (Mc Mahan et al., 2017). 1Intuitively, Fed SSL is analogous to a Super Class in object-oriented programming (OOP), then Fed EMA is a Sub Class that inherits Fed SSL and overrides the model update method. Published as a conference paper at ICLR 2022 ... Online Online Online Model Upload Model Update Online Network Target Network Model Aggregation Figure 1: Overview of federated self-supervised learning (Fed SSL) framework. It comprises an endto-end training pipeline with four steps: 1) Each client k conducts local training on unlabeled data Dk with Siamese networks an online network W o k and a target network W t k; 2) After training, client k uploads W o k to the server; 3) The server aggregates them to obtain a new global network W o g ; 4) The server updates W o k of client k with W o g . Among many studies that address the non-IID data challenge (Zhao et al., 2018; Li et al., 2020b; Wang et al., 2020; Zhuang et al., 2021c), Personalized FL (PFL) aims to learn personalized models for clients (Tan et al., 2021). Although some PFL methods interpolate global and local models (Hanzely et al., 2020; Mansour et al., 2020; yuyang deng et al., 2021), our proposed Fed EMA differ in the motivation, application scenario, and measurement of the decay rate. Besides, the majority of existing works only consider supervised learning where clients have fully labeled data. Although recent works propose federated semi-supervised learning (Jin et al., 2020b; Zhang et al., 2020b; Jeong et al., 2021) or federated domain adaptation (Peng et al., 2020; Zhuang et al., 2021b), they still need labels in either the server or clients. This paper focuses on purely unlabeled decentralized data. Federated Unsupervised Learning Learning representations from unlabeled decentralized data while preserving data privacy is still a nascent field. Federated unsupervised representation learning is first proposed by van Berlo et al. (2020) based on autoencoder, but it neglects the non-IID data challenge. Zhang et al. (2020a) address the non-IID issue with potential privacy risk for sharing features. Although Zhuang et al. (2020) address the issue based on BYOL as our Fed EMA, they do not shed light on why BYOL works best. Since SSL methods are evolving rapidly and new methods are emerging, we introduce a generalized Fed SSL framework and deeply investigate the fundamental components to build up practical guidelines for the generic Fed SSL framework. 3 AN EMPIRICAL STUDY OF FEDERATED SELF-SUPERVISED LEARNING This section first defines the problem and introduces the generalized Fed SSL framework. Using the framework, we then conduct empirical studies to reveal deep insights of Fed SSL. 3.1 PROBLEM DEFINITION Fed SSL aims to learn a generalized representation W from multiple decentralized parties for downstream tasks in the same scenarios. Each party k contains unlabeled data Dk = {Xk} that cannot be transferred to the server or other parties due to privacy constraints. Data is normally non-IID among decentralized parties (Li et al., 2020a); each party could contain only limited data categories (e.g., two out of ten CIFAR-10 classes) (Luo et al., 2019). As a result, each party alone is unable to obtain a good representation (Zhuang et al., 2021a). The global objective function to learn from multiple parties is minw f(w) := PK k=1 nk n fk(w), where K is the number of clients, and n = PK k=1 nk is the total data amount. For client k, fk(w) := Exk Pk[ fk(w; xk)] is the expected loss over data distribution Pk, where xk is the unlabeled data and fk(w; xk) is the loss function. 3.2 GENERALIZED FRAMEWORK We introduce a generalized Fed SSL framework that empowers existing SSL methods based on Siamese networks to learn from decentralized data under privacy constraints. Figure 1 depicts the Published as a conference paper at ICLR 2022 end-to-end training pipeline of the framework. It comprises of three key operations: 1) Local Training in clients; 2) Model Aggregation in the server; 3) Model Communication (upload and update) between the server and clients. We implement and analyze four popular SSL methods Sim CLR (Chen et al., 2020a), Mo Co (V1 (He et al., 2020) and V2 (Chen et al., 2020b)), Sim Siam (Chen & He, 2021), and BYOL (Grill et al., 2020). Variances in Siamese networks of these methods lead to differences in executions in these three operations 2. Local Training Firstly, each client k conducts self-supervised training on unlabeled data Dk based on the same global model W o g downloaded from the server. Regardless of SSL methods, clients train with Siamese networks an online network W o k and a target network W t k for E local epochs using cooresponding loss functions L. We classify these SSL methods with two major differences (Figure 8 in Appendix A): 1) Only Sim Siam and BYOL contain a predictor in the online network, so we denote their online network W o k = (Wk, W p k ), where Wk is the online encoder and W p k is the predictor; As for Sim CLR and Mo Co, W o k = Wk. 2) Sim CLR and Sim Siam share identical weights between the online encoder and the target encoder, so W t k = Wk. In contrast, Mo Co and BYOL update the target encoder with EMA of the online encoder in every mini-batch: W t k = m Wk + (1 m)W t k, where m is the momentum value normally set to 0.99. Model Communication After local training, client k uploads the trained online network W o k to the server and updates it with the global model W o g after aggregation. Considering the differences of SSL methods, we upload and update encoders and predictors separately: 1) we upload and update the predictor when it presents in local training; 2) we follow the communication protocol Zhuang et al. (2021a) to upload and update only the online encoder Wk when encoders are different. Model Aggregation When the server receives online networks from clients, it aggregates them to obtain a new global model W o g = PK k=0 nk n W o k . W o g = (Wg, W p g ) if predictor presents, otherwise W o g = Wg, where Wg is the global encoder. Then, the server sent W o g to clients to update their online networks. The training iterates these three operations until it meets the stopping conditions. At the end of the training, we use the parameters of W o g as the generic representation W for evaluation. 3.3 EXPERIMENTAL SETUP We provide basic experimental setups in this section and describe more details in Appendix B. Datasets We conduct experiments using CIFAR-10 and CIFAR-100 datasets (Krizhevsky et al., 2009). To simulate federated settings, we equally split a dataset into K clients. We simulate non IID data with label heterogeneity, where each client contains limited classes l = {2, 4, 6, 8, 10} number of classes for CIFAR-10 and l = {20, 40, 60, 80, 100} for CIFAR-100. The setting is IID when each client contains 10 (100) classes for CIFAR-10 (CIFAR-100). Implementation Details We implement Fed SSL in Python using popular deep learning framework Py Torch (Paszke et al., 2017). To simulate federated learning, we train each client on one NVIDIA V100 GPU. These clients communicate with the server through NCCL backend. We use Res Net-18 (He et al., 2016) as default network for the encoders and present results of Res Net-50 in Appendix C. The predictor is a two-layer multi-layer perceptron (MLP). By default, we train for R = 100 rounds with K = 5 clients, E = 5 local epoches, batch size B = 128, learning rate η = 0.032 with cosine decay, and non-IID data l = 2 (l = 20) for CIFAR-10 (CIFAR-100). Linear Evaluation We evaluate the quality of representations following linear evaluation (Kolesnikov et al., 2019; Grill et al., 2020) protocol. We first learn representations from the Fed SSL framework. Then, we train a new linear classifier on the frozen representations. 3.4 ALGORITHM COMPARISONS We benchmark and compare the SSL methods using the Fed SSL framework. To denote the implementation of an SSL method, We add a prefix Fed to the name of the SSL method. For example, Fed BYOL denotes using BYOL in the Fed SSL framework. 2Intuitively, Fed SSL is analogous to a Super Class in OOP, then implementation of each method is a Sub Class that inherits Fed SSL and overrides methods of local training, model communication, and aggregation. Published as a conference paper at ICLR 2022 Table 1: Top-1 accuracy comparison of SSL methods using the Fed SSL framework on non-IID CIFAR datasets. Fed BYOL performs the best, whereas Fed Sim Siam performs the worst. Type Method CIFAR-10 (%) CIFAR-100 (%) Contrastive Fed Sim CLR 78.09 0.14 55.58 0.13 Fed Mo Co V1 78.21 0.04 56.98 0.29 Fed Mo Co V2 79.14 0.13 57.47 0.65 Non-contrastive Fed Sim Siam 76.27 0.18 48.94 0.22 Fed BYOL 79.44 0.99 57.51 0.09 10 20 30 40 50 60 70 80 90 100 15 30 45 60 75 k NN Acc. (%) Fed BYOL w/o pred Fed Sim Siam w/o pred Fed BYOL Fed Sim Siam 10 20 30 40 50 60 70 80 90 100 15 30 45 60 75 k NN Acc. (%) Fed BYOL w/o stop-grad Fed Sim Siam w/o stop-grad Fed BYOL Fed Sim Siam Fed BYOL acc (%) w/o stop-grad 75.73 w/ stop-grad 79.44 Figure 2: Comparison of non-contrastive Fed SSL methods with and without (w/o) predictor (pred) or stop-gradient (stop-grad) on non-IID CIFAR-10 dataset. Without predictor, both Fed Sim Siam and Fed BYOL drops performance on k NN testing accuracy (left plot). Without stop-gradient, Fed BYOL retains competitive results on k NN testing accuracy (middle plot) and linear evaluation (right table). Table 1 compares linear evaluation results of these methods under the non-IID setting of CIFAR datasets. On the one hand, contrastive Fed SSL methods obtain similar performances. As Sim CLR is previously reported to need a large batch size (e.g., B = 4096) (Chen et al., 2020a), it is surprising that Fed Sim CLR obtains competitive results using the same batch size B = 128 as the others. On the other hand, the results of non-contrastive Fed SSL methods have large variances: Fed BYOL achieves the best performance, whereas Fed Sim Siam yields the worst performance. Since Sim Siam is capable to learn as powerful representations as BYOL (Chen & He, 2021), as well as considering that noncontrastive methods are conceptually simpler and more efficient (Tian et al., 2021), we focus on non-contrastive methods and further investigate the effects of their fundamental components. 3.5 IMPACT OF FACTORS OF NON-CONTRASTIVE METHODS This section analyzes the impact of fundamental components of non-contrastive Fed SSL methods. From empirical studies, we obtain the following insights: 1) predictor is essential; 2) EMA and stopgradient improves performances; 3) Local encoders should retain local knowledge of the non-IID data; 4) Target encoder should gain knowledge from the online encoder. Details are as followed. Predictor is essential. Figure 2 (left plot) presents the k NN testing accuracy as a monitoring process for Fed BYOL and Fed Sim Siam with and without predictors. Without predictors, both methods can barely learn due to collapse in local training. It affirms the vital role of predictor (Chen & He, 2021; Tian et al., 2021) even when learning from decentralized data. Stop-gradient operation is previously indicated as an essential component for Sim Siam and BYOL (Tian et al., 2021), but it is not essential for Fed BYOL. Stop-gradient prevents stochastic gradient optimization on the target network. Figure 2 shows that Fed Sim Siam without stop-gradient collapses, whereas Fed BYOL without stop-gradient still achieves competitive performance. It is because online and target encoders are significantly different in Fed BYOL as the online encoder is updated by the global encoder every communication round. In contrast, Sim Siam or Fed Siam Siam share weights between online and target encoders, so removing stop-gradient leads to collapse. Exponential Moving Average (EMA) is not essential, but it helps improve performance. Table 2 (first and second rows) shows that Fed BYOL outperforms Fed Sim Siam at different levels of non IID data, which is represented by {2, 4, 6, 8, 10} classes per client of CIFAR-10. EMA is the main difference between Sim Siam and BYOL, indicating that EMA is helpful to improve performance. Based on these results, we further analyze the underlying impact of EMA on the encoders below. Published as a conference paper at ICLR 2022 Table 2: Top-1 accuracy comparison on various non-IID levels the number of classes per client on the CIFAR-10 dataset. Update-both means updating both Wk and W t k with Wg. Method # of classes per client (%) 2 4 6 8 10 (iid) Fed BYOL 79.44 82.82 83.02 84.57 84.20 Fed Sim Siam 76.27 79.34 80.17 80.92 80.50 Fed BYOL, update-both 74.50 78.77 83.02 84.56 83.80 10 20 30 40 50 60 70 80 90 100 15 30 45 60 75 k NN Acc. (%) Fed BYOL Fed BYOL w/o EMA Fed BYOL w/o EMA and sg Fed BYOL w/o EMA and sg, update-both Fed BYOL acc (%) w/o EMA 50.20 w/o EMA and stop-grad 11.97 w/o EMA and stop-grad, update-both 68.75 original 79.44 Figure 3: Comparison of Fed BYOL without exponential moving average (EMA) and stop-gradient (sg) on the non-IID CIFAR-10 dataset. Fed BYOL w/o EMA and sg can hardly learn, but updating both Wk and W t k with Wg (update-both) enables it to achieve comparable results. Encoders that retain local knowledge of non-IID data helps improve performance. EMA in Fed BYOL allows the parameters of the online encoder to be different from the target encoder. As a result, the global encoder only updates the online encoder, not the target encoder. We hypothesize that retaining such local knowledge of data in the target encoder is beneficial especially when the data distribution is highly skewed. For comparison, we remove such local knowledge by updating both online and target encoders with the global model. Table 2 shows that Fed BYOL with both encoders updated leads to lower performance than Fed BYOL; It achieves results close to Fed Sim Siam when the data distribution is more skewed (2 or 4 classes per client). These results demonstrate the importance of keeping local knowledge in the encoders. Besides, the results of {6, 8, 10} classes per client also indicate the benefit of EMA. Target encoder is essential to gain knowledge from the online encoder. Figure 3 shows that Fed BYOL without EMA can merely learn, and Fed BYOL without EMA and stop-gradient (sg) degrades in performance. In both cases, the target encoder is either never updated (w/o EMA) or is updated only through backpropagation (w/o EMA and sg) not updated by the online encoder. On the other hand, we also identify that Fed SSL methods in Table 1, which achieve competitive results, all update target encoders with knowledge of online encoders (the global encoder is the aggregation of the online encoder). We argue that target encoder is crucial to gain knowledge from the online encoder to provide contrastive targets. We further validate it by using the global encoder to update both online and target encoders when removing both EMA and stop-gradient. Figure 3 shows that such method improves performance and achieves comparable results. 4 DIVERGENCE-AWARE DYNAMIC MOVING AVERAGE UPDATE Built on the Fed SSL framework, we propose Federated Divergence-aware EMA update (Fed EMA) to further mitigate non-IID data challenges. Since Fed BYOL contains all components that help improve performance, we adopt it as the baseline and optimize the model update operation. Non-IID data causes the global model to diverge from centralized training (Zhuang et al., 2021a). Inspired by the insight that retaining local knowledge of non-IID data helps improve performance, we propose to update the online network via EMA of the global network. Compared with Fed BYOL that replaces the online network with the global network, Fed EMA fuses local and global knowledge effectively through EMA update, where the decay rate of EMA is dynamically measured by model divergences. Figure 4 depicts our proposed Fed EMA method. The formulation is as followed: Published as a conference paper at ICLR 2022 W _ k {r} = \ m u W _ k {r-1} + (1 - \mu ) W_g {r} \label {eq:encoder}, (1) W_ k {p ,r} = \ mu W_k {p , r-1} + (1 - \mu ) W_g {p,r} \label {eq:predictor}, (2) \mu = \t e x t {mi n }( \ lambda \norm {W_g {r} - W_k {r-1}}, 1), \label {eq:mu} (3) where W r k and W p,r k are the online encoder and predictor of client k at training round r; W r g and W p,r g are the global encoder and predictor; µ is the decay rate, measured by the divergence between global and online encoders; λ is a scaler to adjust the level of model divergence, which is measured by calculating the l2-norm of the global and online encoders. We summarize Fed EMA in Algorithm 1. Fed EMA can be regarded as a generalization of Fed BYOL they are the same when λ = 0. Scaler λ plays a vital role to adapt Fed EMA for different levels of divergence caused by the data. The divergence between global and online encoders varies when the settings of federated learning change. For example, different degrees of non-IID settings would result in different divergences. Since characteristics of data are unknown before training as they are unlabeled, we propose a practical autoscaler to calculate a personalized λk for each client k automatically. The formula is λk = τ ||W r+1 g W r k ||, where τ [0, 1] is the expected value of µ at round r. We calculate λk only once at the earliest round r that client k is sampled for training. When the same set of clients are sampled for training, λk = τ ||W 1 g W 0 k || is calculated at round r = 1. The intuition of Fed EMA is to retain more local knowledge when divergence is large and incorporate more global knowledge when divergence is small. When model divergence is large, keeping more local knowledge is more beneficial for the non-IID data. Since the global network is the aggregation of online networks, representing global knowledge from clients. When divergence is small, adapting more global knowledge help improve model generalization. Since model divergence is larger at the start of training (Figure 6), it is practical to choose larger τ [0.5, 1); τ = 1 is not considered because only local knowledge is used when τ = 1. We use τ = 0.7 by default in experiments. Algorithm 1 Our proposed Fed EMA 1: Server Execution: 2: Init W 0 g and W p,0 g , init λk to be null 3: for each round r = 0, 1, ..., R do 4: St (Selection of K clients) 5: for client k St in parallel do 6: W r k , W p,r k Client(W r g ,W p,r g ,r,λk) 7: W r+1 g P k St nk n W r k 8: W p,r+1 g P n W p,r k 9: for client k St do 10: λk τ ||W r+1 g W r k || if λk is null 11: Return W R g 12: Client (W r g , W p,r g , r, λk): 13: if λk is null or not selected in r 1 then 14: Wk, W t k, W p k W r g , W r g , W p,r g 15: else 16: µ min(λk W r g W r 1 k , 1) 17: Wk µW r 1 k + (1 µ)W r g 18: W p k µW p,r 1 k + (1 µ)W p,r g 19: for local epoch e = 0, 1, ..., E 1 do 20: for b B data batches with size B do 21: W o k W o k η LW o k ,W t k(W o k ; b) 22: W t k m W t k + (1 m)Wk 23: Return W r k , W p,r k Global Encoder Global Predictor Divergence-aware EMA Augmentation stop-gradient Online Encoder Predictor Target Encoder Figure 4: Illustration of our proposed Federated Divergence-aware Exponential Moving Average update (Fed EMA). Compared with Fed BYOL that simply updates the online network W o k of client k with the global network W o g , we propose to update them via EMA of the global network following Eqn 1 and 2, where the decay rate µ is dynamically measured the divergences between the online encoder Wk and the global encoder Wg (Eqn 3). The online network, W o k = (Wk, W p k ), is the concatenation of the online encoder Wk and the predictor W p k . Published as a conference paper at ICLR 2022 Table 3: Top-1 accuracy comparison under linear probing on CIFAR datasets. Our proposed Fed EMA outperforms all other methods. Full results are in Table 5. Method CIFAR-10 (%) CIFAR-100 (%) IID Non-IID IID Non-IID Standalone training 82.42 0.32 74.95 0.66 53.88 2.24 52.37 0.93 Fed BYOL 84.29 0.18 79.44 0.99 54.24 0.24 57.51 0.09 Fed U (Zhuang et al., 2021a) 83.96 0.18 80.52 0.21 54.82 0.67 57.21 0.31 Fed EMA (λ = 0.8) 85.59 0.25 82.77 0.08 57.86 0.15 61.21 0.54 Fed EMA (autoscaler, τ = 0.7) 86.26 0.26 83.34 0.39 58.55 0.34 61.78 0.14 BYOL (Centralized) 90.46 0.34 - 65.54 0.47 - Table 4: Top-1 accuracy comparison on 1% and 10% of labeled data for semi-supervised learning on non-IID CIFAR datasets. Fed EMA outperforms other methods. Full results are in Table 7. Method CIFAR-10 (%) CIFAR-100 (%) 1% 10% 1% 10% Standalone training 61.37 0.13 69.06 0.24 21.37 0.73 39.99 0.87 Fed BYOL 70.48 0.30 76.95 0.46 30.21 0.40 47.07 0.14 Fed U (Zhuang et al., 2021a) 69.52 0.73 77.06 0.55 29.00 0.27 46.67 0.06 Fed EMA (λ = 1) 72.78 0.66 79.01 0.30 32.49 0.22 49.82 0.36 Fed EMA (autoscaler, τ = 0.7) 73.44 0.22 79.49 0.34 33.04 0.23 50.48 0.11 BYOL (Centralized) 87.67 0.15 87.89 0.05 40.96 0.58 56.60 0.33 5 EVALUATION This section follows the experimental setup in Section 3.3 to evaluate Fed EMA in the linear evaluation and semi-supervised learning. We also provide ablation studies of important hyperparameters. 5.1 ALGORITHM COMPARISONS To demonstrate the effectiveness of Fed EMA, we compare it with the following methods: 1) standalone training, where a client learns independently using BYOL; 2) Fed CA, which is proposed in Zhang et al. (2020a); 3) Fed BYOL as the baseline; 4) Fed U, which is proposed in (Zhuang et al., 2021a). Besides, we also present results of possible upper bounds that learn representations with centralized data using BYOL. Linear Evaluation Table 3 shows that Fed EMA outperforms other methods on different settings of CIFAR datasets. Specifically, the performance is more 3% higher than existing methods in most settings. Besides, our proposed autoscaler achieves similar results as λ = 0.8. More experiments on larger number of clients K and random selection of clients are provided in Table 6 in Appendix C. Semi-supervised Learning We also assess the quality of representations following the semisupervised learning protocol (Zhai et al., 2019; Chen et al., 2020a) we add a new two-layer MLP on top of the encoder and fine-tune the whole model with limited (1% and 10%) labeled data for 100 epochs. Table 4 indicates that Fed EMA consistently outperforms other methods on non-IID settings of CIFAR datasets and our autoscaler outperforms manual-selected λ = 1. 5.2 ABLATION STUDIES Ablation on Fed EMA We analyze whether we need to update both online encoder (Eqn 1) and predictor (Eqn 2) in Fed EMA. Figure 5 shows that updating only the encoder or predictor leads to better performance; only updating predictor also leads to faster convergence. Their combination re- Published as a conference paper at ICLR 2022 Method acc (%) Fed BYOL 79.44 Fed EMA predictor only 81.13 Fed EMA encoder only 82.39 Fed EMA (ours) 82.77 10 20 30 40 50 60 70 80 90100 30 40 50 60 70 80 k NN Acc. (%) Fed BYOL Fed EMA Fed EMA encoder only Fed EMA predictor only Figure 5: Ablation studies of Fed EMA: applying EMA on either predictor or encoder leads to better performance on CIFAR-10. 0 10 20 30 40 50 60 70 80 90100 Rounds 0.0 0.2 0.4 0.6 0.8 1.0 Client 0 Client 1 Client 2 Client 3 Figure 6: Changes of divergence throughout training. 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0 Scaler (λ) 30 40 50 60 70 80 Top-1 Acc. (%) CIFAR-10 CIFAR-100 (a) Scaler λ 0.1 0.3 0.5 0.7 0.9 Decay Rate µ 76 78 80 82 84 Top-1 Acc. (%) Ours Fed BYOL Encoder only Predictor only Both (b) Constant µ 2 4 6 8 10 (iid) Classes Per Client 76 78 80 82 84 86 88 Top-1 Acc. (%) Fed BYOL Fed EMA (c) Non-IID level Figure 7: Ablation study on scaler λ, decay rate µ, and non-IID levels of the CIFAR-10 dataset: (a) analyzes the impact of scaler λ on performance; (b) compares using constant µ on encoder, predictor, or both; (c) studies the impact of different non-IID levels. sults in the best performance. These results demonstrate the effectiveness of updating both predictor and encoder in Fed EMA. More results on other settings are provided in Table 5 in Appendix C. Changes of Divergence Figure 6 illustrates that the divergence between global encoder and online encoder (Eqn 3) decreases gradually as training proceeds. It validates our intuition that more local knowledge is used at the start of training when divergence is larger. Besides, clients can update at their own pace depending on the divergence caused by their local dataset. Scaler λ We study the impact of λ with values in [0, 2] with interval of 0.2 in Figure 7a. λ > 1 leads to a significant performance drop because it results in µ = 1 at the start of training on CIFAR datasets, implying that no aggregated global network is used. When λ (0, 1), the performances are consistently better than Fed BYOL (λ = 0) as both local and global knowledge are effectively aggregated. These analyses are mainly suitable for our experiment setting. The range values of λ depend on the characteristics of data and the hyper-parameters (e.g., local epoch) of FL settings. A practical way to tune λ manually is to understand the divergence by running the algorithm for several rounds and choose the λ that scales µ to (0.5, 1). Nevertheless, we recommend using autoscaler and provide ablation study of τ of autoscaler in Figure 10a in Appendix C. Constant Values of µ We further demonstrate the necessity of dynamic EMA by comparing with using constant values of µ in Eqn 1 and 2. Figure 7b shows that a good choice of constant µ can outperform Fed BYOL, but Fed EMA outperforms using constant µ for the online encoder, predictor, or applying both. We also provide results that encoder and predictor use different µ in Appendix C. Non-IID Level Figure 7c compares the performance of different non-IID levels, ranging from 2 to 10 classes per client on the CIFAR-10 dataset. We use autoscaler for these experiments. Fed EMA consistently outperforms Fed BYOL in these settings. 6 CONCLUSION We uncover important insights of federated self-supervised learning (Fed SSL) from in-depth empirical studies, using a newly introduced generalized Fed SSL framework. Inspired by the insights, we propose a new method, Federated Divergence-aware Exponential Moving Average update (Fed EMA), to further address the non-IID data challenge. Our experiments and ablations demonstrate that Fed EMA outperforms existing methods in a wide range of settings. In the future, we plan to implement Fed SSL and Fed EMA on larger-scale datasets. We hope that this study will provide useful insights for future research. Published as a conference paper at ICLR 2022 7 REPRODUCIBILITY STATEMENT To facilitate reproducibility of experiment results, we first provide basic experimental setups in Section 3.3, including datasets, implementation details, and evaluation protocols. Then, we describe more experimental details in Appendix B, including datasets, data transformation, network architecture, training details, and default settings. Also, we indicate the settings and hyper-parameters of experiments when their settings are different from the default. Moreover, we plan to open-source the codes in the future. ACKNOWLEDGMENTS We would like to thank reviewers of ICLR 2022 for their constructive and helpful feedback. This study is in part supported by the RIE2020 Industry Alignment Fund Industry Collaboration Projects (IAF-ICP) Funding Initiative, as well as cash and in-kind contribution from the industry partner(s); the National Research Foundation, Singapore under its Energy Programme (EP Award NRF2017EWT-EP003-023) administrated by the Energy Market Authority of Singapore, and its Energy Research Test-Bed and Industry Partnership Funding Initiative, part of the Energy Grid (EG) 2.0 programme, and its Central Gap Fund ( Central Gap Award No. NRF2020NRF-CG001-027); Singapore MOE under its Tier 1 grant call, Reference number RG96/20. Philip Bachman, R Devon Hjelm, and William Buchwalter. Learning representations by maximizing mutual information across views. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alch e Buc, E. Fox, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. URL https://proceedings.neurips.cc/ paper/2019/file/ddf354219aac374f1d40b7e760ee5bb7-Paper.pdf. Sebastian Caldas, Sai Meher Karthik Duddu, Peter Wu, Tian Li, Jakub Koneˇcn y, H Brendan Mc Mahan, Virginia Smith, and Ameet Talwalkar. Leaf: A benchmark for federated settings. ar Xiv preprint ar Xiv:1812.01097, 2018. Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. In International conference on machine learning, pp. 1597 1607. PMLR, 2020a. Xinlei Chen and Kaiming He. Exploring simple siamese representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 15750 15758, June 2021. Xinlei Chen, Haoqi Fan, Ross Girshick, and Kaiming He. Improved baselines with momentum contrastive learning. ar Xiv preprint ar Xiv:2003.04297, 2020b. Bart Custers, Alan M. Sears, Francien Dechesne, Ilina Georgieva, Tommaso Tani, and Simone van der Hof. EU Personal Data Protection in Policy and Practice. Springer, 2019. Carl Doersch, Abhinav Gupta, and Alexei A Efros. Unsupervised visual representation learning by context prediction. In Proceedings of the IEEE international conference on computer vision, pp. 1422 1430, 2015. Spyros Gidaris, Praveer Singh, and Nikos Komodakis. Unsupervised representation learning by predicting image rotations. ar Xiv preprint ar Xiv:1803.07728, 2018. Jean-Bastien Grill, Florian Strub, Florent Altch e, Corentin Tallec, Pierre Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Avila Pires, Zhaohan Guo, Mohammad Gheshlaghi Azar, Bilal Piot, koray kavukcuoglu, Remi Munos, and Michal Valko. Bootstrap your own latent - a new approach to self-supervised learning. In H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin (eds.), Advances in Neural Information Processing Systems, volume 33, pp. 21271 21284. Curran Associates, Inc., 2020. URL https://proceedings.neurips. cc/paper/2020/file/f3ada80d5c4ee70142b17b8192b2958e-Paper.pdf. Published as a conference paper at ICLR 2022 Filip Hanzely, Slavom ır Hanzely, Samuel Horv ath, and Peter Richt arik. Lower bounds and optimal algorithms for personalized federated learning. In Advances in Neural Information Processing Systems. Curran Associates, Inc., 2020. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770 778, 2016. Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9729 9738, 2020. Wonyong Jeong, Jaehong Yoon, Eunho Yang, and Sung Ju Hwang. Federated semi-supervised learning with inter-client consistency & disjoint learning. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=ce6CFXBh30h. Yilun Jin, Xiguang Wei, Yang Liu, and Qiang Yang. Towards utilizing unlabeled data in federated learning: A survey and prospective. ar Xiv e-prints, pp. ar Xiv 2002, 2020a. Yilun Jin, Xiguang Wei, Yang Liu, and Qiang Yang. A survey towards federated semi-supervised learning. ar Xiv preprint ar Xiv:2002.11545, 2020b. Peter Kairouz, H Brendan Mc Mahan, Brendan Avent, Aur elien Bellet, Mehdi Bennis, Arjun Nitin Bhagoji, Kallista Bonawitz, Zachary Charles, Graham Cormode, Rachel Cummings, et al. Advances and open problems in federated learning. ar Xiv preprint ar Xiv:1912.04977, 2019. Alexander Kolesnikov, Xiaohua Zhai, and Lucas Beyer. Revisiting self-supervised visual representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1920 1929, 2019. Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009. Tian Li, Anit Kumar Sahu, Ameet Talwalkar, and Virginia Smith. Federated learning: Challenges, methods, and future directions. IEEE Signal Processing Magazine, 37:50 60, 2020a. Tian Li, Anit Kumar Sahu, Manzil Zaheer, Maziar Sanjabi, Ameet Talwalkar, and Virginia Smith. Federated optimization in heterogeneous networks. Proceedings of Machine Learning and Systems, 2:429 450, 2020b. Ilya Loshchilov and Frank Hutter. SGDR: stochastic gradient descent with warm restarts. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings, 2017. URL https://openreview.net/forum? id=Skq89Scxx. Jiahuan Luo, Xueyang Wu, Yun Luo, Anbu Huang, Yunfeng Huang, Yang Liu, and Qiang Yang. Real-world image datasets for federated learning. ar Xiv preprint ar Xiv:1910.11089, 2019. Yishay Mansour, Mehryar Mohri, Jae Ro, and Ananda Theertha Suresh. Three approaches for personalization with applications to federated learning. ar Xiv preprint ar Xiv:2002.10619, 2020. Brendan Mc Mahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Ag uera y Arcas. Communication-efficient learning of deep networks from decentralized data. In Aarti Singh and Xiaojin (Jerry) Zhu (eds.), Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, AISTATS 2017, 20-22 April 2017, Fort Lauderdale, FL, USA, volume 54 of Proceedings of Machine Learning Research, pp. 1273 1282. PMLR, 2017. URL http://proceedings.mlr.press/v54/mcmahan17a.html. Mehdi Noroozi and Paolo Favaro. Unsupervised learning of visual representations by solving jigsaw puzzles. In European conference on computer vision, pp. 69 84. Springer, 2016. Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding. ar Xiv preprint ar Xiv:1807.03748, 2018. Published as a conference paper at ICLR 2022 Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary De Vito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in pytorch. 2017. Deepak Pathak, Philipp Krahenbuhl, Jeff Donahue, Trevor Darrell, and Alexei A Efros. Context encoders: Feature learning by inpainting. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2536 2544, 2016. Xingchao Peng, Zijun Huang, Yizhe Zhu, and Kate Saenko. Federated adversarial domain adaptation. In International Conference on Learning Representations, 2020. URL https:// openreview.net/forum?id=HJez F3VYPB. Alysa Ziying Tan, Han Yu, Lizhen Cui, and Qiang Yang. Towards personalized federated learning. ar Xiv preprint ar Xiv:2103.00710, 2021. Yuandong Tian, Xinlei Chen, and Surya Ganguli. Understanding self-supervised learning dynamics without contrastive pairs. In Marina Meila and Tong Zhang (eds.), Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pp. 10268 10278. PMLR, 18 24 Jul 2021. URL https://proceedings.mlr. press/v139/tian21a.html. Bram van Berlo, Aaqib Saeed, and Tanir Ozcelebi. Towards federated unsupervised representation learning. In Proceedings of the Third ACM International Workshop on Edge Systems, Analytics and Networking, pp. 31 36, 2020. Hongyi Wang, Mikhail Yurochkin, Yuekai Sun, Dimitris Papailiopoulos, and Yasaman Khazaeni. Federated learning with matched averaging. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=Bkluql SFDS. Zhirong Wu, Yuanjun Xiong, Stella X Yu, and Dahua Lin. Unsupervised feature learning via nonparametric instance discrimination. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3733 3742, 2018. Xi Yan, David Acuna, and Sanja Fidler. Neural data server: A large-scale search engine for transfer learning data. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3893 3902, 2020. yuyang deng, Mohammad Mahdi Kamani, and Mehrdad Mahdavi. Adaptive personalized federated learning, 2021. URL https://openreview.net/forum?id=g0a-XYjp Q7r. Xiaohua Zhai, Avital Oliver, Alexander Kolesnikov, and Lucas Beyer. S4l: Self-supervised semisupervised learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1476 1485, 2019. Fengda Zhang, Kun Kuang, Zhaoyang You, Tao Shen, Jun Xiao, Yin Zhang, Chao Wu, Yueting Zhuang, and Xiaolin Li. Federated unsupervised representation learning. ar Xiv preprint ar Xiv:2010.08982, 2020a. Richard Zhang, Phillip Isola, and Alexei A Efros. Colorful image colorization. In European conference on computer vision, pp. 649 666. Springer, 2016. Zhengming Zhang, Zhewei Yao, Yaoqing Yang, Yujun Yan, Joseph E Gonzalez, and Michael W Mahoney. Benchmarking semi-supervised federated learning. ar Xiv preprint ar Xiv:2008.11364, 17, 2020b. Yue Zhao, Meng Li, Liangzhen Lai, Naveen Suda, Damon Civin, and Vikas Chandra. Federated learning with non-iid data. Co RR, abs/1806.00582, 2018. URL http://arxiv.org/abs/ 1806.00582. Weiming Zhuang, Yonggang Wen, Xuesen Zhang, Xin Gan, Daiying Yin, Dongzhan Zhou, Shuai Zhang, and Shuai Yi. Performance optimization of federated person re-identification via benchmark analysis. In Proceedings of the 28th ACM International Conference on Multimedia, pp. 955 963, 2020. Published as a conference paper at ICLR 2022 Weiming Zhuang, Xin Gan, Yonggang Wen, Shuai Zhang, and Shuai Yi. Collaborative unsupervised visual representation learning from decentralized data. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4912 4921, 2021a. Weiming Zhuang, Xin Gan, Yonggang Wen, Xuesen Zhang, Shuai Zhang, and Shuai Yi. Towards unsupervised domain adaptation for deep face recognition under privacy constraints via federated learning. ar Xiv preprint ar Xiv:2105.07606, 2021b. Weiming Zhuang, Yonggang Wen, and Shuai Zhang. Joint optimization in edge-cloud continuum for federated unsupervised person re-identification. In Proceedings of the 29th ACM International Conference on Multimedia, pp. 433 441, 2021c. Weiming Zhuang, Xin Gan, Yonggang Wen, and Shuai Zhang. Easyfl: A low-code federated learning platform for dummies. IEEE Internet of Things Journal, 2022. Published as a conference paper at ICLR 2022 A DIFFERENCES OF SELF-SUPERVISED LEARNING METHODS We study four SSL methods using the Fed SSL framework in Section 3. These four SSL methods have two major differences that impact the executions of local training, model communication, and model aggregation. Figure 8 depicts these differences: 1) BYOL and Sim Siam have predictors, whereas Mo Co and Sim CLR do not have them; 2) Sim Siam and Sim CLR share weights between two encoders, whereas BYOL and Mo Co have different parameters for the online and target encoders. Online Online similarity stop- Online Target similarity stop- BYOL Sim Siam Online Target Memory Bank Online Online similarity & dissimilarity Identical Encoders With Predictor Without Predictor Non-identical Figure 8: Illustration of differences among four Self-supervised Learning (SSL) methods. B EXPERIMENTAL DETAILS In this section, we provide more details about the dataset, network architecture, and training and evaluation setups. Datasets CIFAR-10 and CIFAR-100 are two popular image datasets (Krizhevsky et al., 2009). Both datasets consist of 50,000 training images and 10,000 testing images. CIFAR-10 contains 10 classes, where each class has 5,000 training images and 1,000 testing images. While CIFAR-100 contains 100 classes, where each class has 500 training images and 100 testing images. To simulate federated learning, we equally split the training set into K clients. We simulate non-IID data using label heterogeneity data among clients is more skewed when each client contains less number of classes. Hence, we simulate different levels of non-IID data with l number of classes per client, where l = {2, 4, 6, 8, 10} for CIFAR-10 and l = {20, 40, 60, 80, 100} for CIFAR-100. For example, when simulating 5 clients with l = 4 classes per client in CIFAR-10, we need 5 4 = 20 total sets of data over 10 classes. Thus, we split the training images of each class equally into two sets (2,500 images in each set) and assign random four sets without overlapping classes to a client. The setting is IID when each client contains all classes of a dataset. By default, we run experiments with K = 5 clients with non-IID setting l = 2 classes per client for CIFAR-10 dataset and l = 20 classes per client for CIFAR-100 dataset. Transformation In local training of the Fed SSL framework, we take two augmentations of each image as the inputs for online and target networks, respectively. We obtain the augmentations by transforming the images with a set of transformations: For Sim CLR, BYOL, Sim Siam, and Mo Co V2, we adopt the transformations from Chen et al. (2020a); For Mo Co V1, we use the transformation described in its paper (He et al., 2020). Published as a conference paper at ICLR 2022 Table 5: Top-1 accuracy comparison under linear evaluation protocol on CIFAR datasets. Our proposed Fed EMA outperforms all other methods on non-IID settings. Method Architecture Param. CIFAR-10 (%) CIFAR-100 (%) IID Non-IID IID Non-IID Standalone training Res Net-18 11M 82.42 74.95 53.88 52.37 Fed Sim CLR Res Net-18 11M 82.15 78.09 56.39 55.58 Fed Mo Co V1 Res Net-18 11M 83.63 78.21 59.58 56.98 Fed Mo Co V2 Res Net-18 11M 84.25 79.14 58.71 57.47 Fed Sim Siam Res Net-18 11M 81.46 76.27 49.92 48.94 Fed BYOL Res Net-18 11M 84.29 79.44 54.24 57.51 Fed U (Zhuang et al., 2021a) Res Net-18 11M 83.96 80.52 54.82 57.21 Fed EMA predictor only (ours) Res Net-18 11M 84.97 81.13 55.52 57.53 Fed EMA encoder only (ours) Res Net-18 11M 82.88 82.39 56.06 59.74 Fed EMA (λ = 0.8) Res Net-18 11M 85.59 82.77 57.86 61.21 Fed EMA (autoscaler, τ = 0.7) Res Net-18 11M 86.26 83.34 58.55 61.78 Standalone training Res Net-50 23M 83.16 77.84 57.21 55.16 Fed Sim CLR Res Net-50 23M 82.24 80.37 57.46 56.88 Fed Mo Co V1 Res Net-50 23M 87.19 82.18 64.74 59.73 Fed Mo Co V2 Res Net-50 23M 87.19 79.62 63.75 59.52 Fed Sim Siam Res Net-50 23M 79.64 76.7 46.28 48.8 Fed BYOL Res Net-50 23M 83.90 81.33 57.75 59.53 Fed CA (Zhang et al., 2020a) Res Net-50 23M 71.25 68.01 43.30 42.34 Fed U (Zhuang et al., 2021a) Res Net-50 23M 86.48 83.25 59.51 61.94 Fed EMA predictor only (ours) Res Net-50 23M 83.66 81.78 57.79 60.11 Fed EMA encoder only (ours) Res Net-50 23M 84.66 84.91 58.52 62.51 Fed EMA (λ = 0.8) Res Net-50 23M 86.12 85.29 60.96 62.53 Fed EMA (autoscaler, τ = 0.7) Res Net-50 23M 85.08 84.31 59.48 62.77 BYOL (Centralized) Res Net-18 11M 90.46 - 65.54 - BYOL (Centralized) Res Net-50 23M 91.85 - 66.51 - Table 6: Top-1 accuracy comparison on larger numbers of clients with client subsampling: 1) randomly selecting 5 out of 20 clients per round (5/20); 2) randomly selecting 8 out of 80 clients per round (8/80). Fed EMA, trained with autoscaler, consistently outperforms Fed BYOL in both settings. Method 5/20 clients (%) 8/80 clients (%) CIFAR-10 CIFAR-100 CIFAR-10 CIFAR-100 IID Non-IID IID Non-IID IID Non-IID IID Non-IID Fed BYOL 83.25 74.92 49.49 47.09 73.58 63.28 41.19 41.58 Fed EMA (ours) 84.98 75.77 55.41 52.78 73.96 64.19 41.97 43.05 B.2 NETWORK ARCHITECTURE Predictor The network architecture of the predictor is a two-layer multilayer perceptron (MLP). The two-layer MLP starts from a fully connected layer with 4096 neurons. Followed by onedimension batch normalization and a Re LU activation function, it ends with another fully connected layer with 2048 neurons. Encoder We use Res Net-18 He et al. (2016) as the default network architecture of the encoder in the majority of experiments. Besides, we also provide results of Res Net-50 in Table 5 and 7. Published as a conference paper at ICLR 2022 Table 7: Top-1 accuracy comparison on using 1% and 10% of labeled data for semi-supervised learning on the non-IID settings of CIFAR datasets. Fed EMA outperforms all other methods. Method Architecture Param. CIFAR-10 (%) CIFAR-100 (%) 1% 10% 1% 10% Standalone training Res Net-18 11M 61.37 69.06 21.37 39.99 Fed Sim CLR Res Net-18 11M 63.79 73.49 21.55 41.90 Fed Mo Co V1 Res Net-18 11M 60.57 73.95 21.83 43.49 Fed Mo Co V2 Res Net-18 11M 62.89 73.65 26.93 45.27 Fed Sim Siam Res Net-18 11M 67.57 74.96 25.13 41.96 Fed BYOL Res Net-18 11M 70.48 76.95 30.21 47.07 Fed U (Zhuang et al., 2021a) Res Net-18 11M 69.52 77.06 29.00 46.67 Fed EMA (λ = 1) Res Net-18 11M 72.78 79.01 32.49 49.82 Fed EMA (autoscaler, τ = 0.7) Res Net-18 11M 73.44 79.49 33.04 50.48 Standalone training Res Net-50 23M 63.65 74.30 23.18 41.43 Fed Sim CLR Res Net-50 23M 63.00 73.56 19.30 41.13 Fed Mo Co V1 Res Net-50 23M 61.85 75.53 22.12 46.43 Fed Mo Co V2 Res Net-50 23M 64.25 73.96 25.79 42.52 Fed Sim Siam Res Net-50 23M 61.46 15.25 16.03 29.76 Fed BYOL Res Net-50 23M 69.99 76.69 26.57 45.46 Fed CA (Zhang et al., 2020a) Res Net-50 23M 28.50 36.28 16.48 22.46 Fed U (Zhuang et al., 2021a) Res Net-50 23M 69.76 80.25 28.42 48.42 Fed EMA (λ = 1) Res Net-50 23M 74.64 81.48 31.42 49.92 Fed EMA (autoscaler, τ = 0.7) Res Net-50 23M 72.52 80.68 29.68 50.75 BYOL (Centralized) Res Net-18 11M 87.67 87.89 40.96 56.60 BYOL (Centralized) Res Net-50 23M 89.07 89.66 41.49 60.23 Table 8: Top-1 accuracy comparison on various non-IID levels the number of classes per client on the CIFAR-100 dataset. Update-both means updating both Wk and W t k with Wg. Method # of classes per client (%) 2 4 6 8 10 (iid) Fed BYOL 57.51 56.96 55.14 54.96 54.24 Fed Sim Siam 48.94 51.08 49.05 48.09 49.92 Fed BYOL, update-both 49.53 54.17 51.50 52.70 53.41 Our Res Net architecture differs from the implementation in Py Torch (Paszke et al., 2017) in three aspects: 1) We use kernel size 3 3 for the first convolution layer instead of 7 7; 2) We use an average pooling layer with kernel size 4 4 before the last linear layer instead of adaptive average pooling layer; 3) We replace the last linear layer with a two-layer MLP. The network architecture of the MLP is the same as the predictor. B.3 TRAINING AND EVALUATION DETAILS We implement Fed SSL in Python using Easy FL (Zhuang et al., 2022), an easy-to-use federated learning platform based on Py Torch (Paszke et al., 2017). The following are the details of training and evaluation. Training We use Stochastic Gradient Descent (SGD) as the optimizer in training. We use η = 0.032 as the initial learning rate and decay the learning with a cosine annealing (Loshchilov & Hutter, 2017), which is also used in Sim Siam. By default, we train R = 100 rounds with local Published as a conference paper at ICLR 2022 Table 9: Comparison of Fed BYOL without exponential moving average (EMA) and stop-gradient (sg) on the CIFAR datasets. Fed BYOL w/o EMA and sg can hardly learn, but updating both Wk and W t k with Wg (update-both) enables it to achieve comparable results. Method CIFAR-10 (%) CIFAR-100 (%) IID Non-IID IID Non-IID Fed BYOL w/o EMA 54.11 50.20 23.82 25.83 Fed BYOL w/o EMA and stop-grad 21.21 11.97 3.74 2.79 Fed BYOL w/o EMA and stop-grad, update-both 82.29 68.75 48.74 41.91 Fed BYOL 84.29 79.44 54.24 57.51 0 50 100 150 200 250 300 350 400 50 60 70 80 k NN Acc. (%) Fed BYOL Fed EMA Method Rounds R (%) 100 200 300 400 Fed BYOL 79.08 82.23 83.77 86.09 Fed EMA (ours) 80.78 82.41 84.08 86.51 Figure 9: Comparison of Fed BYOL and Fed EMA on various total training rounds R on the non-IID setting of the CIFAR-10 dataset. Fed EMA consistently outperforms Fed BYOL. epochs E = 5 and batch size B = 128 using K = 5 clients. We simulate training of K clients on K NVIDIA V100 GPUs and employ the Py Torch Paszke et al. (2017) communication backend (NCCL) for communications between clients and the server. If not specified, we use λ = 1 by default or autoscaler with τ = 0.7 for Fed EMA. As for experiments of Fed U, we follow the hyperparameters described in paper (Zhuang et al., 2021a). Cross-silo FL vs Cross-device FL This paper primarily focuses on cross-silo FL where clients are stateful with high availability. Clients can cache local models and carry these local states from round to round. Extensive experiments demonstrate that Fed EMA achieves the best performance under this setting. On the other hand, cross-device FL assumes there are millions of stateless clients that might participate in training just once. Due to the constraints of experimental settings, the majority of studies conduct experiments with at most hundreds of clients (Wang et al., 2020; Jeong et al., 2021). Fed EMA can work under such experimental settings by caching the states of clients in the server. When the number of clients scales to millions, Fed EMA degrades to Fed BYOL that updates both encoders without keeping any local states. Evalaution We assess the quality of learned representations using linear evaluation (Kolesnikov et al., 2019; Grill et al., 2020) and semi-supervised learning (Zhai et al., 2019; Chen et al., 2020a) protocols. We first obtain a trained encoder (or learned representations) using full training set for linear evaluation and 99% or 90% of the training set for semi-supervised learning (excluding the 1% or 10% for fine-tuning). Then, we conduct evaluations based on the trained encoder. For linear evaluation, we train a new fully connected layer on top of the frozen trained encoder (fixed parameters) for 200 epochs, using batch size 512 and Adam optimizer with learning rate 3e-3. For semi-supervised learning, we add a new two-layer MLP on top of the trained encoder and fine-tune the whole model using 1% or 10% of data for 100 epochs, using batch size 128 and Adam optimizer with learning rate 1e-3. In both evaluation protocols, we remove the two-layer MLP of the encoder by replacing it with an identity function. C ADDITIONAL EXPERIMENTAL RESULTS AND ANALYSIS In this section, we provide more experimental results of algorithm comparisons and further analyze Fed EMA in different data amounts, training rounds R, and batch sizes B. Published as a conference paper at ICLR 2022 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 τ for Autoscaler 50 60 70 80 Top-1 Acc. (%) CIFAR-10 CIFAR-100 (a) τ for autoscaler 0.0 0.2 0.4 0.6 0.8 1.0 0.0 µo=0.1 µp=0.1 µo=0.1 µp=0.3 µo=0.1 µp=0.5 µo=0.3 µp=0.3 µo=0.3 µp=0.5 µo=0.5 µp=0.3 µo=0.5 µp=0.5 µo=0.7 µp=0.7 Encoder and Predictor 78 79 80 81 82 83 Top-1 Accuracy (%) Ours Fed BYOL (b) Combinations of µo and µp Figure 10: Ablation study on τ for autoscaler and combinations of constant µ: (a) analyzes the impact of τ on performances; 2) presents top-1 accuracy of using different combinations of constant µo on the online encoder and constant µp on the predictor. Table 10: Top-1 accuracy comparison on various batch sizes B on the non-IID setting of CIFAR-10 dataset. The batch size should not be either too small or too large. Besides, Fed EMA outperforms Fed BYOL. Method Batch Sizes B (%) 16 32 64 128 256 512 Fed BYOL 68.74 72.90 78.58 79.44 79.80 77.74 Fed EMA (ours) 74.03 79.06 82.18 83.34 82.19 80.51 C.1 MORE EXPERIMENTAL RESULTS Table 5 presents top-1 accuracy comparison under linear evaluation of a wide range of methods on CIFAR datasets using both Res Net-18 and Res Net-50. It supplements the algorithm comparisons in Section 3.4 and 5.1. Interestingly, Fed Mo Co V1 achieves good performances on IID settings of CIFAR-100 dataset. Since decentralized data are mostly non-IID, we focus more on the non-IID setting. Fed EMA outperforms all the other methods in non-IID settings of CIFAR datasets. We use λ = 0.8 when using Res Net-18 and λ = 1 when using Res Net-50. Table 6 shows results of scaling to larger numbers of clients K with subsampling clients in each training round. We run two sets of experiments: 1) randomly selecting 5 out of 20 clients in each round with local epoch E = 5 and total rounds R = 400; 2) randomly selecting 8 out of 80 clients in each round with local epoch E = 2 and total rounds R = 800. We run Fed EMA with autoscaler. Fed EMA consistently outperforms Fed BYOL with both encoders updated. We conduct these experiments using Res Net-18. Table 7 supplements the semi-supervised learning results on Table 4, providing additional results using Res Net-50 as the network architecture for the encoder. Fed EMA consistently outperforms all the other methods. Besides, Table 8 and 9 compare Fed Sim Siam, Fed BYOL, and variances of Fed BYOL to further demonstrate the insights from empirical studies. They supplement results in Table 2 and Figure 3. C.2 FURTHER ANALYSIS τ for Autoscaler We analyze the impact of τ on performances in Figure 10a. Generally, using autoscaler with τ (0, 1) is better than Fed BYOL (τ = 0). The performance of τ = 1 yields worse results because only local knowledge are used in model update (the global knowledge is neglected) as discussed in Section 4. Besides, performances of τ [0.5, 1) are generally better other values, which verifies our intuition discussed in Section 4. These results also show that we can achieve even higher performance on the CIFAR-100 dataset on Table 3 by tunning τ. We run experiments on non-IID settings using Res Net-18. Published as a conference paper at ICLR 2022 Table 11: Comparison of needed communication rounds to reach target accuracy using different local epochs E on the non-IID setting of the CIFAR-10 dataset. E = 1 is unable to reach 80% in 100 rounds. A larger E can reduce communication costs by increasing the computation cost. Target accuracy Communiation (rounds) Computation (epochs) E = 1 E = 5 E = 10 E = 20 E = 1 E = 5 E = 10 E = 20 70% 90 40 10 8 90 200 100 160 80% - 80 50 40 - 400 500 800 Table 12: Top-1 accuracy comparison of various data amounts in clients and different numbers of clients. Increasing the number of clients does not improve performance, whereas increasing the data amount of clients results in better performance. # of clients K = 5 K = 10 K = 20 Data amount 10% 25% 50% 100% 50% 25% Fed BYOL 43.27 65.14 76.11 78.25 75.10 63.95 Fed EMA (ours) 44.33 67.46 79.49 82.54 79.20 66.61 Constant µ To further illustrate the effectiveness of our dynamic EMA, we provide results of using different combinations of constant µo on the online encoder and constant µp on the predictor in Figure 10b. The results of µo = 0.9 and µp = 0.9 is only 54.52%, which is far lower than the others. Among these combinations, µo = 0.5 and µp = 0.3 achieve the best performance. It suggests that better performances may be achieved if we can construct different dynamic µ for the encoder and the predictor, while we leave this interesting insight for future exploration. Although good choices of constant µo and µp achieve better performance than Fed BYOL, Fed EMA consistently outperforms all these methods. These results complement Figure 7b in the main manuscript. Impact of Training Rounds R Figure 9 compares Fed BYOL and Fed EMA with increasing number of training (communication) rounds R. Performances of both Fed BYOL and Fed EMA increases as training proceeds and Fed EMA consistently outperforms Fed BYOL. We run these experiments with λ = 0.5 for Fed EMA on the non-IID setting of CIFAR-10 dataset. Impact of Batch Size B We investigate the impact of batch size in Table 10. The performances of batch size B = 128 and B = 256 are similar, outperforming the other batch sizes. It indicates that the batch size should not be either too small or too large. Besides, Fed EMA outperforms Fed BYOL in all batch sizes. We run the experiments with autoscaler (τ = 0.7) on the non-IID setting of the CIFAR-10 dataset. Communication vs Computation Cost Table 11 shows the needed communication rounds and computation epochs to reach a target accuracy using different local epochs E with Fed EMA. Increasing E reduces communication cost as the needed rounds decrease, but it generally requires a higher computation cost. For example, compared with E = 5 that needs 80 rounds to reach 80% with 400 epochs of computation, E = 20 only uses 40 rounds but needs 800 epochs computation cost. These results indicate the trade-off between communication cost and computation cost. Data Amount Table 12 shows that increasing the data amount improves the performance significantly. By default, we split the CIFAR-10 dataset into 5 clients, each client contains 10,000 training images, denoting as 100% data amount. As a result, p% data amount means that each client contains 10, 000 p% images. For example, 25% data amount means that each client contains 2,500 images. With lesser data points in each client, we can construct more clients to conduct training as the total data amount is fixed. Table 12 shows that when the data amount is same in clients, increasing the number of clients in each training round do not improve performance. However, increasing the data amount in each client increases the performance significantly. These results indicate that it is important for clients to have sufficient data to participate in training in Fed SSL.