# decentralized_unsupervised_learning_of_visual_representations__ee3bcbed.pdf

Decentralized Unsupervised Learning of Visual Representations

Yawen Wu1 , Zhepeng Wang2 , Dewen Zeng3 , Meng Li4 , Yiyu Shi3 and Jingtong Hu1

1University of Pittsburgh 2George Mason University 3University of Notre Dame 4Facebook {yawen.wu, jthu}@pitt.edu, zwang48@gmu.edu, {dzeng2, yshi4}@pitt.edu, meng.li@fb.com

Collaborative learning enables distributed clients to learn a shared model for prediction while keeping the training data local on each client. However, existing collaborative learning methods require fullylabeled data for training, which is inconvenient or sometimes infeasible to obtain due to the high labeling cost and the requirement of expertise. The lack of labels makes collaborative learning impractical in many realistic settings. Self-supervised learning can address this challenge by learning from unlabeled data. Contrastive learning (CL), a self-supervised learning approach, can effectively learn visual representations from unlabeled image data. However, the distributed data collected on clients are usually not independent and identically distributed (non-IID) among clients, and each client may only have few classes of data, which degrades the performance of CL and learned representations. To tackle this problem, we propose a collaborative contrastive learning framework consisting of two approaches: feature fusion and neighborhood matching, by which a uniﬁed feature space among clients is learned for better data representations. Feature fusion provides remote features as accurate contrastive information to each client for better local learning. Neighborhood matching further aligns each client s local features to the remote features such that well-clustered features among clients can be learned. Extensive experiments show the effectiveness of the proposed framework. It outperforms other methods by 11% on IID data and matches the performance of centralized learning.

1 Introduction

Collaborative learning is an effective approach for multiple distributed clients to collaboratively learn a shared model from decentralized data. In the learning process, each client updates the local model by using local data, and then a central server aggregates the local models to obtain a shared model. In this

Corresponding author

way, collaborative learning enables learning from decentralized data [Mc Mahan et al., 2017] while keeping data local for privacy. Collaborative learning can be applied to healthcare applications, where many personal devices such as mobile phones collaboratively learn to provide early warnings to cognitive diseases such as Parkinson s and to assess mental health [Chen et al., 2020b]. Collaborative learning can also be used for robotics, in which multiple robots learn a shared navigation scheme to adapt to new environments [Liu et al., 2019]. Compared with local learning, collaborative learning improves navigation accuracy by utilizing knowledge from other robots. Existing collaborative learning approaches assume local data is fully labeled so that supervised learning can be used for the model update on each client. However, labeling all the data is usually unrealistic due to high labor costs and the requirement of expert knowledge. For example, in medical diagnosis, even if the patients are willing to spend time on labeling all the local data, the deﬁciency of expert knowledge of these patients will result in large label noise and thus inaccurate learned model. The deﬁciency of labels makes supervised collaborative learning impractical. Self-supervised learning can address this challenge by pre-training a neural network encoder with unlabeled data, followed by ﬁne-tuning for a downstream task with limited labels. Contrastive learning (CL), an effective self-supervised learning approach [Chen et al., 2020a], can learn data representations from unlabeled data to improve the model. By integrating CL into collaborative learning, clients can collaboratively learn data representations by using a large amount of data without labeling. In collaborative learning, data collected on clients are inherently far from IID [Hsu et al., 2020], which results in two unique challenges when integrating collaborative learning with CL as collaborative contrastive learning (CCL) to learn high-quality representations. The ﬁrst challenge is that each client only has a small amount of unlabeled data with limited diversity, which prevents effective contrastive learning. More speciﬁcally, compared with the global data (the concatenation of local data from all clients), each client only has a subset of the global data with a limited number of classes [Mc Mahan et al., 2017; Zhao et al., 2018; Wu et al., 2021b]. For instance, in real-world datasets [Luo et al., 2019], each client only has one or two classes out of seven object classes. Since conventional contrastive learning frameworks [He et al., 2020; Chen et al., 2020a] are designed for centralized learning on

Proceedings of the Thirty-First International Joint Conference on Artiﬁcial Intelligence (IJCAI-22)

large-scale datasets with sufﬁcient data diversity, directly applying them to local learning on each client will result in the low quality of learned representations. The second challenge is that each client focuses on learning its local data without considering the data on the other clients. As a result, the features of data in the same class but from different clients may not be well-clustered even though they could have been clustered for improved representations. Data are decentralized in collaborative learning and even if two clients have data of the same class, they are unaware of this fact and cannot leverage it to collaboratively learn to cluster these data. Besides, even if one client has knowledge of other s data, since no labels are available, there is no easy way to identify the correct data clusters and perform clustering for better representations. To address these challenges, we propose a collaborative contrastive learning (CCL) framework to learn visual representations from decentralized unlabeled data on distributed clients. The framework employs contrastive learning [He et al., 2020] for local learning on each client and consists of two approaches to learn high-quality representations. The ﬁrst approach is feature fusion and it provides remote features as accurate contrastive information to each client for better local learning. To protect the privacy of remote features against malicious clients, we employ an encryption method [Huang et al., 2020] to encrypt the images before generating their features. The second approach is neighborhood matching and it further aligns each client s local features to the fused features such that well-clustered features among clients are learned. In summary, the main contributions of the paper include: Collaborative contrastive learning framework. We propose a framework with two approaches to learning visual representations from unlabeled data on distributed clients. The ﬁrst approach improves the local representation learning on each client with limited data diversity, and the second approach further learns uniﬁed global representations among clients. Feature fusion for better local representations. We propose a feature fusion approach to leverage remote features for better local learning while avoiding raw data sharing. The remote features serve as negatives in the local contrastive loss to achieve a more accurate contrast with fewer false negatives and more diverse negatives. Neighborhood matching for improved global representations. We propose a neighborhood matching approach to cluster decentralized data across clients. During local learning, each client identiﬁes the remote features to cluster local data with and performs clustering. In this way, well-clustered features among clients can be learned.

2 Background and Related Work

Contrastive Learning. Contrastive learning is a selfsupervised approach to learn an encoder (i.e. a convolutional neural network without the ﬁnal classiﬁer) for extracting visual representation vectors from the unlabeled input images by performing a proxy task of instance discrimination [Chen et al., 2020a; He et al., 2020; Wu et al., 2018; Wu et al., 2021a]. For an input image x, its representation

vector z is obtained by z = f(x), z Rd, where f( ) is the encoder. Let the representation vectors query q and key k+ form a positive pair, which are the representation vectors from two transformations (e.g. cropping and ﬂipping) of the same input image. Let Q be the memory bank with K representation vectors stored, serving as negatives. The positive pair query q and key k+ will be contrasted with each vector n Q (i.e. negatives) by the loss function:

ℓq = log exp(q k+/τ) exp(q k+/τ) + P n Q exp(q n/τ) (1)

Minimizing the loss will learn an encoder to generate visual representations. Then a classiﬁer can be trained on top of the encoder by using limited labeled data. However, existing contrastive learning approaches are designed for centralized learning [Chen et al., 2020a; He et al., 2020] and require sufﬁcient data diversity for learning. When applied to each client in collaborative learning with limited data diversity, their performance will greatly degrade. Therefore, an approach to increase the local data diversity on each client while protecting the shared information is needed. Collaborative Learning. The goal of collaborative learning is to learn a shared model by aggregating locally updated models from clients while keeping raw data on local clients [Mc Mahan et al., 2017]. In collaborative learning, there are C clients indexed by c. The training data D is distributed among clients, and each client c has a subset of the training data Dc D. There are recent works aiming to optimize the aggregation process [Reisizadeh et al., 2020; Nguyen et al., 2020]. While our work can be combined with these works, for simplicity, we employ a typical collaborative learning algorithm [Mc Mahan et al., 2017]. The learning is performed round-by-round. In communication round t, the server randomly selects β C clients Ct and send them the global model with parameters θt, where β is the percentage of active clients per round. Each client c Ct updates the local parameters θt c on local dataset Dc for E epochs to get θt+1 c by minimizing the loss ℓc(Dc, θt). Then the local models are aggregated into the global model by averaging the weights θt+1 P c Ct |Dc| P

i Ct |Di|θt+1 c . This learning process continues until the global model converges. To improve the performance of collaborative learning on non-IID data, [Zhao et al., 2018; Jeong et al., 2018] share local raw data (e.g. images) among clients. However, sharing raw data among clients will cause privacy concerns [Li et al., 2020]. Besides, they need fully labeled data to perform collaborative learning, which requires expert knowledge and potentially high labeling costs. Therefore, an approach to performing collaborative learning with limited labels and avoiding sharing raw data is needed.

3 Framework Overview We propose a collaborative contrastive learning (CCL) framework to learn representations from unlabeled data on distributed clients. These distributed data cannot be combined in a single location to construct a centralized dataset due to privacy and legal constraints [Kairouz et al., 2019]. The overview of the proposed framework is shown in Figure 1. There is a

Proceedings of the Thirty-First International Joint Conference on Artiﬁcial Intelligence (IJCAI-22)

Remote features

Features of encrypted images

Client 1 Client C

2 C-1 Local images

Local features

model Remote features

Features of encrypted images

Local images

Local features

4. Local learning 1. Model and feature uploading Avg() 2. Model aggregation 3. Model and feature downloading

Features of encrypted images

Features of non-encrypted images

Figure 1: Overview of the proposed collaborative contrastive learning (CCL) framework. Four steps are performed in each learning round, including model and feature uploading from clients to the server, model aggregation on the server, model and feature downloading to clients, and local learning. Local learning is performed by the proposed feature fusion in Sect. 4 and neighborhood matching in Sect. 5.

central server that coordinates multiple clients to learn representations. The local model on each client is based on Mo Co [He et al., 2020]. CCL follows the proposed feature fusion technique to reduce the false negative ratio on each client for better local learning (Sect. 4). Besides, based on fused features, CCL further uses the proposed neighborhood matching to cluster representations of data from different clients to learn uniﬁed representations among clients (Sect. 5).

Before introducing the details of feature fusion and neighborhood matching, we present the proposed CCL process. CCL is performed round-by-round and there are four steps in each round as shown in Figure 1. First, each client c uploads its latest model (consisting of the main model f c q and momentum model f c k) and latest features Ql,c of encrypted images to the server. Second, the server aggregates main models from clients by θq P c C |Dc|

|D| θc q and momentum models

by θk P c C |Dc|

|D| θc k to get updated fq and fk, where |Dc| is the data size of client c and |D| is the total data size of |C| clients. The server also combines features as Q = {Ql,c}c C. Third, the server downloads the aggregated models fq and fk and combined features Q excluding Ql,c as Qs,c = {Q \ Qc} to each client c. Fourth, each client updates its local models with fq and fk, and then performs local contrastive learning for multiple epochs with local features Ql,c and remote features Qs,c by using loss Eq.(12) including contrastive loss with fused features Eq.(6) and neighborhood matching Eq.(11). During local contrastive learning, to generate features Ql,c for uploading in the next round, images x are encrypted by Insta Hide [Huang et al., 2020] as x and fed into momentum model f c k. In this way, even if a malicious client can ideally recover x from the features Ql,c, which is already very unlikely in practice, it still cannot reconstruct x from x since Instahide effectively hides information contained in x. Next, we present the details of local contrastive learning, including feature fusion to reduce false negative ratio in Sect. 4 and

neighborhood matching for uniﬁed representations in Sect. 5.

4 Local Learning with Feature Fusion

Next, we focus on how to perform local CL in each round of CCL. We ﬁrst present the key challenge of CL on each client, which does not exist in conventional centralized CL. Then we propose feature fusion to tackle this challenge and introduce how to perform local CL with fused features. Key challenge: Limited data diversity causes a high false negative (FN) ratio on each client. A low FN ratio is crucial to achieving accurate CL [Kalantidis et al., 2020]. For one image sample q, FNs are features that we use as negative features but actually correspond to images of the same class as q. In centralized CL, the percentage of FNs is inherently low since diverse data are available. The model has access to the whole dataset D with data from all the classes instead of a subset Dc as in collaborative learning. Thus, when we randomly sample negatives from D, the FN ratio is low. For instance, when dataset D has 1000 balanced classes and the negatives n are randomly sampled, for any image q to be learned, only 1 1000 of n are from the same class as q and are FNs. However, in collaborative learning, the FN ratio is inherently high on each client due to the limited data diversity, which signiﬁcantly degrades the performance of contrastive learning. For instance, in real-world datasets [Luo et al., 2019], one client can have only one or two classes out of seven classes. With limited data diversity on each client, when learning image sample q, many negatives n to contrast with will be from the same class as q and are FNs. To perform contrastive learning by minimizing the contrastive loss in Eq.(1), the model scatters the FNs away from q, which should have been clustered since they are from the same class. As a result, the representations of samples from the same class will be scattered instead of clustered and degrade the learned representations.

Proceedings of the Thirty-First International Joint Conference on Artiﬁcial Intelligence (IJCAI-22)

4.1 Feature Fusion To address this challenge, we propose feature fusion to share negatives in the feature space (i.e. the output vector of the encoder), which reduces FN and improves the diversity of negatives while avoiding raw data sharing. Let Ql,c be the memory bank of size K for local features of non-encrypted images on client c, and let Ql,c be features of encrypted images. In one round t of CCL, features Ql,c of encrypted images on each client c will be uploaded to the server (i.e. step 1 in Figure 1). The server also downloads combined features Q excluding Ql,c to each client c (i.e. step 3 in Figure 1) to form its memory bank of remote negatives Qs,c as follows.

Qs,c = {Ql,i | 1 i |C|, i = c} (2)

where C is the set of all clients. On client c, with local negatives Ql,c and remote negatives Qs,c, the loss for sample q is deﬁned as:

" exp(q k+/τ) exp(q k+/τ) + P n {Ql Qs} exp(q n/τ)

(3) where we leave out the client index c in Ql,c and Qs,c for conciseness. ℓq is the negative log-likelihood over the probability distribution generated by applying a softmax function to a pair of input q and its positive k+, negatives n from both local negatives Ql and remote negatives Qs. Effectiveness of feature fusion. The remote negatives Qs reduce the FN ratio in local contrastive learning and improve the quality of learned representations on each client. More speciﬁcally, in collaborative learning with non-IID data, we assume the global dataset D has M classes of data, each class with the same number of data. Each client c C has a subset Dc D of the same length in m classes (m M) [Zhao et al., 2018; Mc Mahan et al., 2017]. For a sample q on client c, when only local negatives Ql,c are used, 1

m|Ql,c| negatives will be in the same class as q, which results in an FN ratio RF N = 1 m. Since m is usually small (e.g. 2) due to limited data diversity, the FN ratio RF N will be large (e.g. 50%) and degrade the quality of learned representations. Different from this, when remote negatives are used, the FN ratio is:

1 m|Ql,c| + P i C,i =c I(i, q) 1

|Ql,c| + P i C,i =c |Ql,i| 1

where I(i, q) is an indicator function that equals 1 when client i has data of the same class as q, and 0 otherwise. In most cases, RF N(q) is effectively reduced by the remote negatives. First, in the extreme case of non-IID data distribution, where the classes on each client are mutually exclusive [Zhao et al., 2018], all I(i, q) equal 0 and RF N(q) = 1 m |Ql,c| |Ql,c|+P

i C,i =c |Ql,i| = 1 m|C| 1 m. With the remote neg-

atives, the FN ratio is effectively reduced by a factor |C|. Second, as long as not all clients have data of the same class as q, some elements in {I(i, q)}|C| i=1, i =c will be 0, and RF N(q) in Eq.(4) will be smaller than 1 m. In this case, the FP ratio is also reduced. Third, even if the data on each client is IID and all I(i, q) equal 1, which is unlikely in realistic collaborative learning [Hsu et al., 2020], the FN ratio RF N(q) will

m. In this case, while RF N(q) is the same as that without remote negatives, the increased diversity of negatives from other clients can still beneﬁt the local contrastive learning. Further reducing the false negative ratio. To further reduce the FN ratio, we propose to exclude the local negatives by removing Ql in the denominator of Eq.(3) and only keeping remote negatives Qs. The corresponding FN ratio becomes:

P i C,i =c I(i, q) 1

m|Ql,i| P i C,i =c |Ql,i| RF N(q) (5)

As long as not all other clients have data in the same class as q, some I(i, q) will be 0. In this way, R F N(q) < RF N(q) and the FN ratio is further reduced. Based on the loss ℓq for one sample q in Eq.(3), the contrastive loss for one mini-batch B is: Lcontrast = 1 |B|

5 Local Learning with Neighborhood Matching

Local features

Remote features

(a) Before neighborhood matching (b) After neighborhood matching

Figure 2: Neighborhood matching aligns each client s local features to the remote features such that well-clustered features among clients are learned.

In local contrastive learning, each client focuses on learning its local data without considering data on the other clients. As a result, the features of data in the same class but from different clients may not be well-clustered even though they could be clustered for improved representations. Challenge: To cluster local features to the correct remote features, one has to identify local data and remote features that are in the same class. However, since no labels are available for local data and remote features, there is no easy way to identify the correct clusters to push local features to. To address this challenge, we propose a neighborhood matching approach to identify the remote features to cluster local data with and deﬁne an objective function to perform the clustering. First, during local learning on one client, as shown in Figure 2, for each local sample we ﬁnd N nearest features from both the remote and local features as neighbors. Then the features of the local sample will be pushed to these neighbors by the proposed entropy-based loss. Since the model is synchronized from the server to clients in each communication round, the remote and local features are encoded by similar models on different clients. Therefore, the neighbors are likely to be in the same class as the local sample being learned, and

Proceedings of the Thirty-First International Joint Conference on Artiﬁcial Intelligence (IJCAI-22)

clustering them will improve the learned representations of global data. In this way, the global model is also improved when aggregating local models. Identifying neighbors. To push each local sample close to its neighbors, we minimize the entropy of one sample s matching probability distribution to either a remote feature or a local feature. To improve the robustness, we match one sample to N nearest features at the same time, instead of only one nearest feature. By minimizing the entropy for N nearest neighbors, the sample s matching probability to each of the nearest neighbors will be individually certain. For each local sample qi, we regard top-N closest features, either from local or remote features, as neighbors. To ﬁnd the neighbors, we ﬁrst deﬁne the neighbor candidates as: Q = {Qs+l,i| i U(|Qs| + K, K)} (7) where i U(|Qs| + K, K) samples K integer indices from [|Qs| + K] randomly at uniform. Qs+l,i is the element with index i in the union of remote and local features Qs Ql. For one local sample qi, the neighbors P(qi), which are the top-N nearest neighbor candidates Q , is given by: P(qi) = {Q j | j top N(Si,m), 1 m K} (8)

where Si,m = sim(qi, nm) = q T i nm/ qi nm is the cosine similarity between qi and one neighbor candidate nm Q . Neighborhood matching loss. To make the probability of qi matching to each nj P(qi) individually certain, we consider the set: Lj = {nj} {Q \ P(qi)} R(K N+1) d (9) where d is the dimension of one feature vector. Lj contains one of the top-N nearest neighbors nj and neighbor candidates excluding all other top-N nearest neighbors. Given Lj, the probability that sample qi is matched to one of the neighbors na Lj is:

pi,j,a = exp(q T i na/τnm) P n Lj exp(q T i n/τnm), na Lj (10)

The temperature τnm controls the softness of the probability distribution [Hinton et al., 2015]. Since nj has the largest cosine similarity with qi for n Lj, pi,j,a will have the largest value when na = nj for na Lj. In this way, when minimizing the entropy of the probability distribution {pi,j,a}na Lj, the matching probability of qi and nj will be maximized. For one mini-batch B, to match each sample to its N nearest neighbors, the entropy for all samples in this mini-batch is calculated as:

a=1 pi,j,a log(pi,j,a) (11)

where K N +1 is the number of features in Lj, and N is the number of nearest neighbors to match. By minimizing Lneigh, each i B will be aligned to its top-N nearest neighbors. Final loss. Based on the contrastive loss with fused features in Eq.(6) and neighborhood matching loss in Eq.(11), the overall objective is formulated as: L = Lcontrast + λLneigh (12) where λ is a weight parameter.

6 Experimental Results Datasets, model architecture, and distributed settings. We evaluate the proposed approaches on three datasets, including CIFAR-10 [Krizhevsky et al., 2009], CIFAR-100 [Krizhevsky et al., 2009], and Fashion-MNIST [Xiao et al., 2017]. We use Res Net-18 as the base encoder and use a 2-layer MLP to project the representations to 128-dimensional feature space [Chen et al., 2020a; He et al., 2020]. For each of the three datasets, we consider one IID setting and two non-IID settings. The detailed collaborative learning settings and training details can be found in the Appendix. Metrics. To evaluate the quality of learned representations, we use standard metrics for centralized self-supervised learning, including linear evaluation and semi-supervised learning [Chen et al., 2020a]. Besides, we evaluate by collaborative ﬁnetuning for realistic collaborative learning. In linear evaluation, a linear classiﬁer is trained on top of the frozen base encoder, and the test accuracy represents the quality of learned representations. We ﬁrst perform collaborative learning by the proposed approaches without labels to learn representation. Then we ﬁx the encoder and train a linear classiﬁer on the 100% labeled dataset on top of the encoder. The classiﬁer is trained for 100 epochs by the SGD optimizer following the hyper-parameters from [He et al., 2020]. In semi-supervised learning, we ﬁrst train the base encoder without labels in collaborative learning. Then we append a linear classiﬁer to the encoder and ﬁnetune the whole model on 10% or 1% labeled data for 20 epochs with SGD optimizer following the hyper-parameters from [Caron et al., 2020]. In collaborative ﬁnetuning, the learned encoder by the proposed approaches is used as the initialization for ﬁnetuning the whole model by supervised collaborative learning [Mc Mahan et al., 2017] with few locally labeled data on clients. Detailed collaborative ﬁnetuning settings can be found in the Appendix. Baselines. We compare the proposed methods with multiple approaches. Predicting Rotation is a self-supervised learning approach by predicting the rotation of images [Gidaris et al., 2018]. Deep Cluster-v2 is the improved version of Deep Cluster [Caron et al., 2020; Caron et al., 2018] and achieves SOTA performance. Sw AV and Sim CLR are SOTA approaches for self-supervised learning [Caron et al., 2020; Chen et al., 2020a]. We combine these approaches with Fed Avg as Fed Rot, Fed DC, Fed Sw AV, and Fed Sim CLR. Fed CA is the SOTA collaborative unsupervised learning approach with a shared dictionary and online knowledge distillation [Zhang et al., 2020]. Besides, we compare with two methods as upper bounds. Mo Co [He et al., 2020] is a centralized contrastive learning method assuming all data are combined in a single location. We compare with Mo Co since the local model in the proposed methods is based on it. Fed Avg [Mc Mahan et al., 2017] is a fully supervised collaborative learning method.

6.1 Linear Evaluation Linear evaluation on CIFAR-10. We evaluate the proposed approaches by linear evaluation with 100% data labeled for training the classiﬁer on top of the ﬁxed encoder learned with unlabeled data by different approaches. The proposed approaches signiﬁcantly outperform other methods and even match the performance of the centralized upper bound method.

Proceedings of the Thirty-First International Joint Conference on Artiﬁcial Intelligence (IJCAI-22)

Fed Sim CLR

Centr. Mo Co (Up. bound)

Top-1 accuracy (%)

Figure 3: Linear evaluation accuracy on CIFAR-10 in IID setting. The classiﬁer is trained by 100% labels on a ﬁxed encoder learned by different approaches. Error bar denotes standard derivation over three independent runs. Centralized Mo Co is the upper bound method.

The results on CIFAR-10 in the IID setting are shown in Figure 3. On CIFAR-10, the proposed approaches achieve 88.90% top-1 accuracy, only 0.38% below the upper bound method centralized Mo Co. The proposed approaches also outperform the SOTA method Fed CA by +10.92% top-1 accuracy and the best-performing baseline Fed Sw AV by +5.28%.

CIFAR-10 CIFAR-100 FMNIST

Method IID Non-1 Non-2 IID Non-1 Non-2 IID Non-1 Non-2

Fed Rot 78.57 75.88 70.98 45.80 44.57 43.15 83.74 82.65 82.90 Fed DC 78.49 69.97 69.34 49.27 49.06 47.21 88.41 85.92 88.35 Fed Sw AV 83.62 75.07 75.36 55.51 51.45 53.77 89.63 87.11 89.75 Fed Sim CLR 82.99 71.23 73.30 48.83 45.67 48.46 88.45 84.41 86.23 Fed CA 77.98 75.57 75.50 48.93 47.70 48.22 86.98 86.22 86.46 Proposed 88.90 79.07 78.31 61.91 57.54 58.63 91.26 88.13 90.08

Upper bounds Mo Co (Centralized) 89.28 63.72 91.97 Fed Avg (Supervised) 92.88 60.60 59.03 73.08 67.59 66.90 94.12 77.08 69.92

Table 1: Linear evaluation on CIFAR-10, CIFAR-100, and FMNIST under the IID and two non-IID settings. 100% labeled data are used for learning the classiﬁer on the ﬁxed encoder and top-1 accuracy is reported. Centralized Mo Co and Supervised Fed Avg are the upper bound methods.

Linear evaluation on various datasets and distributed settings. We evaluate the proposed approaches on different datasets and collaborative learning settings. The results under the IID setting, non-IID settings 1 and 2 on CIFAR-100 and FMNIST are shown in Table ??. The linear classiﬁer is trained on top of the ﬁxed encoder learned with unlabeled data by different approaches. Under all the three collaborative learning settings and on both datasets, the proposed approaches signiﬁcantly outperform the baselines. For example, on CIFAR-100 the proposed approaches outperform the best-performing baseline by 6.40%, 6.09%, and 4.86% under three collaborative learning settings, respectively. Besides, compared with the two upper bound methods, the proposed methods match the performance of the upper bound centralized Mo Co under IID setting, and outperforms supervised Fed Avg on CIFAR-10 under non-IID settings.

CIFAR-10 CIFAR-100 FMNIST

Labeled ratio 10% 1% 10% 1% 10% 1%

Fed Rot 85.38 71.62 43.78 19.84 91.23 47.94 Fed DC 78.88 44.18 40.69 11.93 88.97 30.25 Fed Sw AV 84.51 48.96 50.23 13.82 90.48 62.08 Fed Sim CLR 86.05 75.36 49.54 27.45 91.28 84.46 Fed CA 84.15 41.25 48.57 8.13 91.67 36.93 Proposed 89.27 84.79 58.49 40.71 92.18 85.63

Mo Co (Centralized) 88.44 81.75 57.76 37.79 92.46 86.78

CIFAR-10 CIFAR-100 FMNIST

Labeled ratio 10% 1% 10% 1% 10% 1%

Fed Rot 77.82 58.48 43.50 18.80 90.80 60.72 Fed DC 71.25 31.85 40.85 11.42 86.80 37.91 Fed Sw AV 78.25 39.87 46.58 14.11 88.77 35.85 Fed Sim CLR 78.49 58.13 46.89 23.86 90.41 80.72 Fed CA 79.75 58.76 48.10 8.07 89.60 38.98 Proposed 84.01 67.87 54.85 31.29 91.29 82.13

Mo Co (Centralized) 88.44 81.75 57.76 37.79 92.46 86.78

Table 2: Semi-supervised learning under IID setting (top) and non-IID setting 1 (bottom). We ﬁnetune the encoder and classiﬁer with different ratios of labeled data and report the top-1 accuracy.

6.2 Semi-Supervised Learning We further evaluate the proposed approaches by semisupervised learning, where both the encoder and classiﬁer are ﬁnetuned with 10% or 1% labeled data after learning the encoder on unlabeled data by different approaches. We evaluate the approaches under the IID collaborative learning setting and two non-IID collaborative learning settings. Table ?? shows the comparison of our results against the baselines under the IID collaborative learning setting (top) and non-IID setting 1 (bottom). Our approach signiﬁcantly outperforms the selfsupervised baselines with 10% and 1% labels. Notably, the proposed methods even outperform the upper bound method centralized Mo Co on CIFAR-10 and CIFAR-100 datasets under the IID setting. Results under non-IID collaborative learning setting 2 will be shown in the Appendix for conciseness.

6.3 Collaborative Finetuning We evaluate the performance of the proposed approaches by collaborative ﬁnetuning the learned encoder with few locally labeled data on clients. The results under the IID setting and non-IID setting 1 are shown in Table ?? (top) and Table ?? (bottom), respectively. On both collaborative learning settings and three datasets, the proposed approaches consistently outperform the baselines.

6.4 Ablations Effectiveness of feature fusion and neighborhood matching. We evaluate three approaches. Contrastive learning (CL) is the approach without feature fusion (FF) or neighborhood matching (NM). CL+FF adds feature fusion, and CL+FF+NM further adds neighborhood matching. We evaluate the approaches by linear evaluation and semi-supervised learning (1% labels) under the non-IID collaborative learning

Proceedings of the Thirty-First International Joint Conference on Artiﬁcial Intelligence (IJCAI-22)

CIFAR-10 CIFAR-100 FMNIST

Labeled ratio 10% 1% 10% 1% 10% 1%

Fed Rot 85.16 74.25 49.97 16.65 90.49 82.81 Fed DC 79.98 71.17 42.81 21.47 90.17 84.22 Fed Sw AV 85.23 78.92 51.67 26.75 91.33 85.10 Fed Sim CLR 83.52 75.10 51.73 15.32 91.64 84.31 Fed CA 82.32 72.77 50.78 21.10 91.57 84.29 Proposed 89.33 82.52 56.88 33.15 92.15 87.11

Fed Avg (Supervised) 74.71 39.35 33.16 8.07 87.95 75.68

CIFAR-10 CIFAR-100 FMNIST

Labeled ratio 10% 1% 10% 1% 10% 1%

Fed Rot 57.34 56.80 46.93 17.12 75.02 72.02 Fed DC 60.37 49.28 40.29 21.20 77.12 73.16 Fed Sw AV 57.34 51.93 46.65 22.06 75.45 73.17 Fed Sim CLR 63.05 51.63 47.69 11.19 76.49 73.25 Fed CA 59.52 57.33 49.14 21.50 73.23 71.34 Proposed 65.80 59.30 50.75 28.25 78.81 76.88

Fed Avg (Supervised) 48.41 32.33 33.26 8.42 67.42 66.03

Table 3: Collaborative ﬁnetuning under IID setting (top) and non-IID setting 1 (bottom). We ﬁnetune the encoder and classiﬁer with different ratios of locally labeled data on clients by supervised collaborative learning and report the top-1 accuracy.

CL CL+FF CL+FF+NM (proposed)

Top-1 accuracy (%)

(a) Linear evaluation.

CL CL+FF CL+FF+NM (proposed)

Top-1 accuracy (%)

(b) Semi-sup. learning.

Figure 4: Ablations on CIFAR-10 dataset under the non-IID setting. CL is vanilla contrastive learning. FF is feature fusion and NM is neighborhood matching. Top-1 accuracy of linear classiﬁer and semi-supervised learning (1% labels) are reported. Error bar denotes standard derivation over three independent runs.

setting (non-IID setting 1). As shown in Figure 4, with linear evaluation, CL achieves 74.96% top-1 accuracy. Adding FF improves the accuracy by 3.68%, and adding NM further improves the accuracy by 0.43%. With semi-supervised learning (1% labels), CL achieves 64.00% top-1 accuracy. Adding FF improves the accuracy by 2.12% and adding NM further improves the accuracy by 1.75%. These results show the effectiveness of feature fusion and neighborhood matching.

7 Conclusion

We propose a framework for collaborative contrastive representation learning. To improve representation learning on each client, we propose feature fusion to provide remote features as accurate contrastive data to each client. To achieve uniﬁed representations among clients, we propose neighborhood matching to align each client s local features to the remote ones. Experiments show superior accuracy of the proposed

framework compared with the state-of-the-art.

Acknowledgments

This work was supported in part by NSF CNS-2122320, CNS212220, CNS-1822099, and CNS-2007302.

[Caron et al., 2018] Mathilde Caron, Piotr Bojanowski, Armand Joulin, and Matthijs Douze. Deep clustering for unsupervised learning of visual features. In Proceedings of the European Conference on Computer Vision (ECCV), pages 132 149, 2018.

[Caron et al., 2020] Mathilde Caron, Ishan Misra, Julien Mairal, Priya Goyal, Piotr Bojanowski, and Armand Joulin. Unsupervised learning of visual features by contrasting cluster assignments. Proceedings of Advances in Neural Information Processing Systems (Neur IPS), 2020.

[Chen et al., 2020a] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. In Proceedings of the 37th International Conference on Machine Learning. PMLR, 13 18 Jul 2020.

[Chen et al., 2020b] Yiqiang Chen, Xin Qin, Jindong Wang, Chaohui Yu, and Wen Gao. Fedhealth: A federated transfer learning framework for wearable healthcare. IEEE Intelligent Systems, 2020.

[Gidaris et al., 2018] Spyros Gidaris, Praveer Singh, and Nikos Komodakis. Unsupervised representation learning by predicting image rotations. In International Conference on Learning Representations, 2018.

[He et al., 2020] Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9729 9738, 2020.

[Hinton et al., 2015] Geoffrey Hinton, Oriol Vinyals, and Jeffrey Dean. Distilling the knowledge in a neural network. NIPS Deep Learning and Representation Learning Workshop, 2015.

[Hsu et al., 2020] Tzu-Ming Harry Hsu, Hang Qi, and Matthew Brown. Federated visual classiﬁcation with realworld data distribution. In ECCV 2020: 16th European Conference, pages 76 92. Springer, 2020.

[Huang et al., 2020] Yangsibo Huang, Zhao Song, Kai Li, and Sanjeev Arora. Instahide: Instance-hiding schemes for private distributed learning. In International Conference on Machine Learning, pages 4507 4518. PMLR, 2020.

[Jeong et al., 2018] Eunjeong Jeong, Seungeun Oh, Hyesung Kim, Jihong Park, Mehdi Bennis, and Seong-Lyun Kim. Communication-efﬁcient on-device machine learning: Federated distillation and augmentation under non-iid private data. ar Xiv preprint ar Xiv:1811.11479, 2018.

Proceedings of the Thirty-First International Joint Conference on Artiﬁcial Intelligence (IJCAI-22)

[Kairouz et al., 2019] Peter Kairouz, H Brendan Mc Mahan, Brendan Avent, Aur elien Bellet, Mehdi Bennis, Arjun Nitin Bhagoji, Keith Bonawitz, Zachary Charles, Graham Cormode, Rachel Cummings, et al. Advances and open problems in federated learning. ar Xiv preprint ar Xiv:1912.04977, 2019. [Kalantidis et al., 2020] Yannis Kalantidis, Mert Bulent Sariyildiz, Noe Pion, Philippe Weinzaepfel, and Diane Larlus. Hard negative mixing for contrastive learning. In Neural Information Processing Systems (Neur IPS), 2020. [Krizhevsky et al., 2009] Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009. [Li et al., 2020] Tian Li, Anit Kumar Sahu, Manzil Zaheer, Maziar Sanjabi, Ameet Talwalkar, and Virginia Smith. Federated optimization in heterogeneous networks. Proceedings of Machine Learning and Systems, 2:429 450, 2020. [Liu et al., 2019] Boyi Liu, Lujia Wang, and Ming Liu. Lifelong federated reinforcement learning: a learning architecture for navigation in cloud robotic systems. IEEE Robotics and Automation Letters, 4(4):4555 4562, 2019. [Luo et al., 2019] Jiahuan Luo, Xueyang Wu, Yun Luo, Anbu Huang, Yunfeng Huang, Yang Liu, and Qiang Yang. Realworld image datasets for federated learning. ar Xiv preprint ar Xiv:1910.11089, 2019. [Mc Mahan et al., 2017] Brendan Mc Mahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Aguera y Arcas. Communication-efﬁcient learning of deep networks from decentralized data. In Artiﬁcial Intelligence and Statistics, pages 1273 1282. PMLR, 2017. [Nguyen et al., 2020] Hung T Nguyen, Vikash Sehwag, Seyyedali Hosseinalipour, Christopher G Brinton, Mung Chiang, and H Vincent Poor. Fast-convergent federated learning. IEEE Journal on Selected Areas in Communications, 39(1):201 218, 2020. [Reisizadeh et al., 2020] Amirhossein Reisizadeh, Aryan Mokhtari, Hamed Hassani, Ali Jadbabaie, and Ramtin Pedarsani. Fedpaq: A communication-efﬁcient federated learning method with periodic averaging and quantization. In International Conference on Artiﬁcial Intelligence and Statistics, pages 2021 2031, 2020. [Wu et al., 2018] Zhirong Wu, Yuanjun Xiong, Stella X Yu, and Dahua Lin. Unsupervised feature learning via nonparametric instance discrimination. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3733 3742, 2018. [Wu et al., 2021a] Yawen Wu, Zhepeng Wang, Dewen Zeng, Yiyu Shi, and Jingtong Hu. Enabling on-device selfsupervised contrastive learning with selective data contrast. In 2021 58th ACM/IEEE Design Automation Conference (DAC), pages 655 660. IEEE, 2021. [Wu et al., 2021b] Yawen Wu, Dewen Zeng, Zhepeng Wang, Yiyu Shi, and Jingtong Hu. Federated contrastive learning for volumetric medical image segmentation. In International Conference on Medical Image Computing and

Computer-Assisted Intervention, pages 367 377. Springer, 2021. [Xiao et al., 2017] Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. ar Xiv preprint ar Xiv:1708.07747, 2017. [Zhang et al., 2020] Fengda Zhang, Kun Kuang, Zhaoyang You, Tao Shen, Jun Xiao, Yin Zhang, Chao Wu, Yueting Zhuang, and Xiaolin Li. Federated unsupervised representation learning. ar Xiv preprint ar Xiv:2010.08982, 2020. [Zhao et al., 2018] Yue Zhao, Meng Li, Liangzhen Lai, Naveen Suda, Damon Civin, and Vikas Chandra. Federated learning with non-iid data. ar Xiv preprint ar Xiv:1806.00582, 2018.

Proceedings of the Thirty-First International Joint Conference on Artiﬁcial Intelligence (IJCAI-22)