# private_semisupervised_federated_learning__d7cbbbde.pdf Private Semi-Supervised Federated Learning Chenyou Fan1 , Junjie Hu2 , Jianwei Huang2,3 1School of Artificial Intelligence, South China Normal University, China 2Shenzhen Institute of Artificial Intelligence and Robotics for Society, China 3School of Science and Engineering, The Chinese University of Hong Kong, Shenzhen, China fanchenyou@scnu.edu.cn, {hujunjie, jianweihuang}@cuhk.edu.cn We study a federated learning (FL) framework to effectively train models from scarce and skewly distributed labeled data. We consider a challenging yet practical scenario: a few data sources own a small amount of labeled data, while the rest mass sources own purely unlabeled data. Classical FL requires each client to have enough labeled data for local training, thus is not applicable in this scenario. In this work, we design an effective federated semisupervised learning framework (Fed SSL) to fully leverage both labeled and unlabeled data sources. We establish a unified data space across all participating agents, so that each agent can generate mixed data samples to boost semi-supervised learning (SSL), while keeping data locality. We further show that Fed SSL can integrate differential privacy protection techniques to prevent labeled data leakage at the cost of minimum performance degradation. On SSL tasks with as small as 0.17% and 1% of MNIST and CIFAR-10 datasets as labeled data, respectively, our approach can achieve 5-20% performance boost over the state-of-the-art methods. 1 Introduction Recently, federated learning (FL) [Mc Mahan et al., 2017] has received substantial research interests, as it provides a practical way of training machine learning models with distributed data sources while preserving data privacy. In the FL paradigm, each agent (participant) trains a local learning model with its owned data only, while a central server regularly communicates with all agents to generate a better global model through the aggregation of the local models. A key feature of FL is that no direct data exchange happens during the learning process, in contrast to centralized training. However, existing FL studies assume either each agent owns sufficient training data [Mc Mahan et al., 2017], or the agents collectively have sufficient data for the tasks of interest [Fan and Huang, 2021]. For instance, the image classification task, as the benchmark task of FL [Mc Mahan et al., 2017; Zhao et al., 2018], requires each agent to prepare thousands of labeled training samples to fully train the deep-learning based models such as CNNs. In reality, the labeled data is Figure 1: Overview of a distributed learning system with both labeled and unlabeled data sources, denoted with different colors. Our study can establish a shared global data space to boost SSL across all data sources, while keeping data privacy. scarce and distributed unevenly over only a few data sources. A typical scenario in mobile computing is that most end users create user data (e.g., tweets), while only a few users have the interests and time to annotate them (e.g., sentiments). The huge gap between lab scenarios with abundant labeled data and real situations with scarce and unevenly distributed labeled data severely limits the practicality and scalability of FL. It motivates us to consider the following question: How to make FL effective with scarce and skewly distributed labeled data as well as abundant unlabeled data? Semi-supervised learning (SSL) [Chapelle et al., 2006] is an important machine learning topic that aims to leverage unlabeled data to enhance model capacity. The classical manifold" assumption of centralized SSL [Zhou and Li, 2010] assumes that the data space is composed of multiple lowerdimensional manifolds, and data points lying on the same manifold should have the same label. A common practice [Sajjadi et al., 2016] to ensure this assumption is to firstly augment each data instance with label-invariant transformation to multiple variants, then tune the models to classify them as the same label. Recent studies [Laine and Aila, 2017; Zhang et al., 2018] further propose to perform soft data-label augmentations by mixing/blending both data features and la- Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence (IJCAI-22) bels. However, it is not clear how to perform effective SSL in the FL context, as data exchange is prohibited. To tackle this issue, we design an effective federated semisupervised learning framework (Fed SSL) to fully leverage both labeled and unlabeled data sources while keeping the data locality as FL. We propose to learn a global generative model to establish a unified data space across all data sources, enabling each agent to generate labeled data instances for local model training. We jointly optimize the objective of training the local model F to estimate accurate labels of the generated samples and the objective of training a generator G to provide realistic data imputations conditioning on inferred labels by F. To prevent training divergence with mass unlabeled data, we further regularize the model with selfreconstruction and realism maximization targets. In addition, we aim to prevent privacy leakage for the labeled data sources. We design a hybrid training strategy of sequential and parallel training steps on labeled and unlabeled data sources respectively. This design integrates the differential privacy (DP) scheme in Fed SSL smoothly to prohibit excessive access to the labeled data. We show that our strategy can protect privacy with strict theoretical guarantee, and it causes only minimal degradation in model performance. In conclusion, our contributions include: We study a critical but under-explored task of effectively performing semi-supervised learning with distributed and mixed types of data sources. Our learning framework prevents the violation of data locality that existed in all previous studies of SSL in FL. We design a mixed-data generation strategy to utilize both labeled and unlabeled data sources by establishing a unified data space without direct data exchange. We firstly propose a private SSL framework in FL which ensures strict privacy protections for labeled data sources. We outperform baselines by 5%-15% on vision and NLP tasks and prevent divergence with extremely scarce data. 2 Related Work We briefly review recent related work in categories: 1) semisupervised learning (SSL), 2) federated learning (FL), and 3) settings involving both of them. The SSL, e.g., [Chapelle et al., 2006], is an important machine learning topic which aims to utilize unlabeled data to improve task learning. Classical SSL methods include pseudo-labeling [Lee and others, 2013] and entropy minimization [Grandvalet and Bengio, 2004]. Data augmentation approaches, such as Mix Up [Zhang et al., 2018], Mix Match [Berthelot et al., 2019], and Fix Match [Sohn et al., 2020], have also been developed and integrated into DL models. They interpolate pairs of data and label with random ratios to augment training data. However, existing approaches can only apply on centralized training paradigm, while our approach effectively applies on distributed data sources. The FL has become a rapidly developing topic in the research community, e.g., [Mc Mahan et al., 2017; Zhao et al., 2018; Li and others, 2019; Fan and Liu, 2020], as it provides a new way of learning models over a collection of distributed devices while keeping data locality. Recent studies focused on FL s robustness against non-IID data [Zhao et al., 2018; Sattler et al., 2019], few-shot data [Wu et al., 2020; Fan and Huang, 2021] and differential privacy [Wei and others, 2020; Xin et al., 2020]. However, these FL studies assume the agents in FL have plenty of labeled training data, while we focus on a practical scenario that labeled data is scarce. Recently, several works made initial attempts to consider SSL in FL settings. Several surveys [Jin et al., 2020a; Jin et al., 2020b] discussed about applying existing SSL methods in FL without experimental proofs. Zhang et al. [2020] assumed that all labeled data is available at the server. Jeong et al. [2020] assumed that labeled data is available at every client. Itahara et al. [2020] assumed that all unlabeled data is shared across all agents. However, these studies violated the data locality property of FL. In contrast, we consider the scenario that labeled data or unlabeled data is kept at each client, and we prohibit sharing data across the agents. We briefly review FL and SSL first, then formulate SSL objective in FL formally, and propose our Fed SSL framework. 3.1 Review of Federated Learning We consider a classical supervised FL system with K distributed agents. Each agent owns a local data source with which it trains a local learning model. A central server coordinates the agents by periodically collecting local models to fuse to a global model, then synchronizing back to all agents for next round of update. Formally, let Xk be the data source of client k, nk be the number of data samples in Xk, n = P k nk be the total number of samples across all data sources. We consider a C-class data space D with label space Y = {0, ..., C-1}. Let F be the learning model with parameters w that maps data to label space F : D Y, e.g., a CNN for image classification. The global FL target is to minimize the joint training objective L over all data sources min w L(w) = PK k=1 1 K ℓk(Xk; w) in which local objective ℓk is the task-specific local training objective, e.g., cross-entropy loss for classification. A widely used FL strategy called Fed Avg [Mc Mahan et al., 2017] is to fuse the global model w with weighted average of local models such that w = PK k=1 nk 3.2 SSL Objective in FL We consider a distributed SSL scenario where each learning agent owns one of two possible types of data sources, as shown in Figure 1. The first type of source owns a few labeled data instances (with no unlabeled data), and we call the agents with such source as support agents (S-agents). The other type of source owns all unlabeled data instances, and we call the agents with such source as query agents (Q-agents). The mission of our study is to improve the collective capacity of both Q-agents and S-agents through SSL in FL paradigm. We consider a system of NS S-agents and NQ Qagents. We denote the data collection for all S-agents as S = {X1, X2, ..., XNS}, with each labeled source Xi = {(xt, pt)}|Xi| t=1. Here we use pt RC as one-hot encoding Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence (IJCAI-22) of the data label yt. We denote the data collection for all Qagents as Q = {U1, U2, ..., UNQ}, with each unlabeled data source as Uj = {ut}|Uj| t=1. The global FL target is the joint of all local objectives on all Sand Q-agents as follows min w L(w) = 1 NS i=1 ℓs(Xi; w) + 1 NQ j=1 ℓq(Uj; w) , (1) in which ℓs and ℓq are the task-specific loss functions on Sagent and Q-agent that we will explore later. 3.3 Local Mix Up (L-Mix Up) Operations We introduce the local Mix Up (L-Mix Up) [Zhang et al., 2018; Berthelot et al., 2019] in classical SSL, for eliciting our design of global Mix Up (G-Mix Up) in the next section. Pseudo Label. At Q-agent j with unlabeled data Uj, we can guess the pseudo labels given a trained model w. Following [Berthelot et al., 2019], we augment a data instance ut Uj to its K variants {uk t }K k=1 with label-invariant operations, e.g., image rotation and cropping, and compute their mean probabilistic predictions pt = 1 K PK k=1 F(uk t ; w) RC. We can estimate the pseudo label by applying label sharpening [Berthelot et al., 2019] to create low-entropy (sharp) label distribution ˆpt, with likelihood of each category c as ˆp(c) t = Sharpen( pt, Z)c = ( p(c) t )Z PC v=1( p(v) t )Z , (2) in which Z > 1 amplifies the dominating classes. Local Mix Up (L-Mix Up) Operations. We describe the LMix Up in details, which directly applies Mix Up at each Sand Q-agent locally and individually to synthesize mixed data features and labels for local training. Formally, for two data features x1 and x2 with label (or pseudo label) distributions p1 and p2, L-Mix Up produces λ Beta(α, α), α (0, 1) , x = λx1 + (1 λ)x2 , p = λp1 + (1 λ)p2 , where (x , p ) is the synthesized data, and α is the mixup hyper-parameter which controls the strength of interpolation between two inputs. L-Mix Up augments the local data set and regularizes the model towards producing linearly-behaving boundaries between classes thus makes the model predict more accurately for unlabeled data [Zhang et al., 2018]. 3.4 Global Mix Up (G-Mix Up) Operations L-Mix Up is limited as it can only operate locally, as data exchange among agents is prohibited in FL. This would lead to inferior performance in our challenging SSL setting of scarce and skewly distributed labeled data. To tackle this issue, we propose the global Mix Up (G-Mix Up) that operates across the agents to allow data imputation without data exchange. G-Mix Up utilizes a generative learning scheme to construct a global data space to generate and mix data of arbitrary class. We design a conditional generator G to synthesize a mixed data sample which conditions on both local datalabel pair as well as an additional sampled data class from the global label space. We will discuss the details with S-agent and Q-agent individually. G-Mix Up at an S-agent. Let the conditional generator G accept a labeled data with true label (x1, p1), a randomly sampled class c2 of one-hot form p2, a blending factor λ as in (3), and a noise vector z RDn. The goal of G is to produce a synthesized data ˆx G((x1, p1), p2, z, λ) with a predicted label ˆp = F(ˆx). Consistent with L-Mix Up, the generated ˆx is supposed to be λ-likely as of x1 s class, and (1 λ)-likely as class c2 in appearance, i.e., p = λp1 + (1 λ)p2, which we can evaluate with the cross-entropy loss. Formally, the joint training objective for F and G is min F,G ℓS = CE(ˆp, p ) = XC c=1 p (c) log(ˆp(c)) s.t. p1, p2 RC, z U(0, 1)Dn, x1 Xi , ˆx G((x1, p1), p2, z, λ), λ Beta(α, α) , ˆp F(ˆx) , p λp1 + (1 λ)p2 . A better G produces more realistic imputations, and a better F produces more accurate estimation of the generated data labels. Hereby F and G improve each other to reach optimality. The global G can establish a global data space to facilitate effective global Mix Up, thus improving SSL capacity. G-Mix Up at a Q-agent. Next, we define G-Mix Up at a Q-agent with unlabeled data. We firstly sample an unlabeled data u1, produce its pseudo label q1 with data augmentation and sharpening with (2). To perform mix-data generation, we then sample a new class c2 with its one-hot form p2, draw a noise vector z, and train G to produce a synthesized ˆx G((u1, q1), p2, z, λ) with a predicted label ˆp = F(ˆx). Ideally, the synthesized data ˆx is expected to be λ-likely as q1 and (1 λ)-likely as p2 in appearance, i.e., p = λq1 + (1 λ)p2, which we can evaluate with the cross-entropy loss. Formally, the joint training objective for F and G is min F,G ℓQ = CE(ˆp, p ) = XC c=1 p (c) log(ˆp(c)) s.t. p2 RC, z U(0, 1)Dn, u1 Uj , q1 = Sharpen(F(Aug(u1)), Z) , ˆx G((u1, q1), p2, z, λ), λ Beta(α, α) , ˆp F(ˆx) , p λq1 + (1 λ)p2 . We further regularize model training by ensuring the realism of generated samples and reconstructed real samples. Realistic loss. Equations (4) and (5) imply that if we set the blending factor λ = 0, G will generate a data sample 100% of class c2. We can utilize a conditional discriminator D to encourage realism of the generated image with given class. We extend to a general case of λ [0, 1], to encourage the realism of G output weighed by a factor g(λ) depending on Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence (IJCAI-22) Figure 2: Model design for model G. Algorithm 1: Fed SSL algorithm. Input: A set of NS support agents, and NQ query agents. Output: A global model F for SSL task. 1 Server executes: Initialize global F, G, D, let t 1 2 while t maximum rounds T do 3 for each client i in support set S in parallel do 4 [D, G]t i G-Reg([D, G]t, Xi) 5 [F, G]t i G-Mix Up([F, G]t, Xi) 6 for each client j in query set Q in parallel do 7 [D, G]t j G-Reg([D, G]t, Uj) 8 [F, G]t j G-Mix Up([F, G]t, Uj) 9 [F, D, G]t+1 Fed Avg({[F, D, G]t k S Q)} 10 The server sends [F, D, G]t+1 back to clients blending factor λ. We design the dynamic realistic loss as min G max D ℓreal = log D(x1, p1) + g(λ) log(1 D(ˆx, p2)) s.t. p1, p2 RC, z U(0, 1)Dn, x1 X c1 , (6) ˆx G((x1, p1), p2, z, λ), λ [0, 1] , g(λ) emax{λ,1 λ} 1 , in which max{λ, 1 λ} 1 [ 0.5, 0] and g(λ) [0.61, 1]. When blending from one source of either sampled image or noise (λ = 0, 1), we encourage more visually realism with large g(λ) = 1; when blending evenly from two sources (λ = 0.5), we tolerate the unrealism with a smaller g(λ) = 0.61. We adopt the training process of GAN [Goodfellow et al., 2014] to update D and G alternatively. Reconstruction loss. When setting the blending factor λ = 1, Equations (4) and (5) imply that G should re-produce the sampled x1 (or u1). Thus we design an encoder-decoder structure of G to reconstruct the input, and we regularize its training with reconstruction loss term such that min G ℓrec = ||ˆx x||2 2 s.t. ˆx G((x1, p1), c2, z, λ = 1) . (7) Fed SSL Algorithm Details. We show our proposed Fed SSL with G-Mix Up in Algorithm 1 and summarize as Algorithm 2: Pseudo code for Fed SSL-DP. 1 Let ϵtot 0, δ 10 5. 2 for each client i in support set S in sequence do 3 if ϵtot > ϵ then 4 Out of privacy budget, break; 6 Perform DP update on G; 7 ϵtot ϵtot + αM(σ, δ); 8 Perform normal update on D, F; 9 for each client j in query set Q in parallel do 10 Perform normal update on D, F, G; follows. At local training stage of FL, we adopt a two-step optimization procedure to 1) regularize model by training D and G alternatively to minimize ℓreg = ℓreal+ℓrec, which we denote as G-Reg (line 4,7); 2) train F and G with G-Mix Up to produce mixed data samples and improve accuracy (line 5,8). By federating the models (line 9), we unify G-Mix Up operations on S-and Q-agents to leverage both true labels and pseudo labels to enhance SSL, and establish a unified global data space over all agents to facilitate data augmentation over entire data space with arbitrary classes, thus boost SSL. 4 Privacy In the realistic setting as we consider, the small labeled data could be overly accessed by G during G-Mix Up, which poses risks of information leakage (as the global G could reconstruct them at arbitrary agents). In the spirit of FL for privacy protection, it s critical for us to ensure privacy for GMix Up operations. In this section, we introduce (ϵ, δ)-DP to our framework and provide a practice algorithm which integrates noise injection mechanism into Fed SSL to provide strict privacy guarantee for labeled data sources. Definition 1. (ϵ, δ)-DP [Dwork and Roth, 2014]. A randomized function M : D R, with domain D and range R, satisfies (ϵ, δ)-DP if for any two adjacent databases d, d D and for any subset of outputs S R P[M(d) S] eϵP[M(d ) S] + δ . A standard way to guarantee (ϵ, δ)-DP is by integrating a Gaussian Mechanism (GM) [Dwork and Roth, 2014] in the model learning process, by adding Gaussian noise N(0, σ2). Stochastic GM (SGM) [Abadi et al., 2016] is developed to work with SGD for training deep-learning models. SGM applies GM on sampled batches with a sampling ratio γ, and T update steps of SGM will imply a (O( Tγϵ, δ)-DP by using Moment Accountant (MA)[Abadi et al., 2016] to track the overall privacy budget. We propose a practical algorithm called Fed SSL-DP, that seamlessly integrates SGM and MA into Fed SSL to ensure (ϵ, δ)-DP , with pseudo code shown below. Overall, we design a hybrid training strategy to train on S-agents and Q-agents efficiently. Specifically, we train on S-agents with DP sequentially first (line 2), i.e., train on Sagent i, then pass the updated model to S-agent i + 1. After looping over all S-agents, we perform the standard FL in parallel with Q-agents (line 9). Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence (IJCAI-22) We first fix the tolerance term δ at a small value, e.g., 10 5; then choose a target privacy budget upper bound ϵ, e.g., 4, 8, or 16, and initialize accumulator ϵtot (line 1). Then we perform DP update on G with SGM, then calculate ϵ as privacy loss with MA αM(σ, δ) and accumulate to ϵtot (line 6-8). Once ϵtot reaches the budget ϵ, we stop accessing to labeled sources (line 3-4). Lemma 1. Algorithm 2 guarantees (ϵ, δ)-DP for labeled data sources. Proof. Algorithm 2 guarantees that the cumulative privacy loss for updating G during sequential S-agent updates ϵtot ϵ. Also, the parallel Q-agent update does not access the labeled data on S-agents. Therefore, G is (ϵ, δ)-DP with respect to labeled data sources. 5 Experiments and Discussions We describe the datasets, parameter choices and models we experiment on. Then we analyse the performance with visual and textual tasks with ablation studies and visualizations. 5.1 Datasets and Splits Following recent works [Berthelot et al., 2019; Zhu et al., 2020; Chen and et al., 2018] of evaluating FL and SSL, we describe three widely used benchmark datasets. CIFAR-10 [Krizhevsky, 2009] is a common image recognition dataset with 50000 data instances of 10 categories (such as birds, cars, and horses). We try 3 settings with an increasing difficulty, by holding out 5000 (10%), 2500 (5%), and 500 (1%) as labeled instances, respectively, and keeping the rest as unlabeled instances. MNIST [Le Cun et al., 1998] is a digit recognition dataset with 60000 data instances for 10 digital classes. We try 3 settings with an increasing difficulty, by holding out about 300 (0.5%), 150 (0.25%), and 100 (0.17%) of total data as labeled instances respectively, and using the rest as unlabeled. Sent140 [Caldas et al., 2018] is an FL benchmark for sentiment analysis as a 2-way classification task (positive and negative). We sample 60,000 sentences and try 3 settings with increasing difficulty by holding out 3000 (5%), 600 (1%), and 300 (0.5%) as labeled sentences, with the rest unlabeled. We tokenize each sentence to a max of 40 words. 5.2 FL Device Numbers We try different numbers of S-agents (labeled sources) and Q-agents (unlabeled sources), denoted as NS and NQ respectively. We examine two different settings, i.e., (NS = 2, NQ = 6), and (NS = 3, NQ = 9), to represent the scenario of fewer S-agents and more Q-agents. We indeed try extremely challenging cases, e.g., 100 (0.17%) MNIST samples distributed to 3 S-agents so that each has about 34 labeled samples, for checking whether Fed SSL is robust to extreme data-scarce scenarios. 5.3 Details of Our Methods and Baselines We compare our proposed methods (Fed SSL and Fed SSLDP) with the most related baselines as follows: Fed SSL is our method which learns a global data space to generate mixed samples of arbitrary classes to better augment local training data, as in Alg. 1. Fed SSL-DP additionally integrates Gaussian Mechanism (GM) into Fed SSL to ensure DP of labeled data sources, as in Alg. 2. Baselines. Supervise performs FL only at S-agents with labeled data. Q-agents (with unlabeled data) are not used. Pseudo performs additional SSL with pseudo labels (Sec.3.3) at Q-agents upon Supervise approach. Mix Match [Berthelot et al., 2019] is the commonly adopted centralized SSL approach, described as the L-Mix Up in Section 3.3. We applied Mix Match on all agents to utilize both labeled and unlabeled data as a fair baseline. Fix Match [Sohn et al., 2020] is the augmentation of Mix Match. Its FL version Fed Match [Jeong et al., 2020] performs both L-Mix Up and divergence minimization across all agents, which forms the strongest baseline. 5.4 Experimental Results Results on CIFAR-10. Method \ Setting NS = 2, NQ = 6 NS = 3, NQ = 9 10% 5% 1% 10% 5% 1% Supervise 0.754 0.654 0.344 0.759 0.633 0.302 Pseudo 0.792 0.727 0.549 0.782 0.743 0.537 Mix Match 0.822 0.769 0.558 0.811 0.757 0.548 Fed Match 0.803 0.747 0.568 0.785 0.753 0.547 Fed SSL-DP(ours) 0.848 0.793 0.634 0.839 0.777 0.628 Fed SSL (ours) 0.855 0.801 0.661 0.854 0.787 0.653 Table 1: CIFAR-10 results of 10-way classification accuracy. We evaluate on CIFAR-10 with randomly sampled 10%, 5%, and 1% of the total training data as labeled data, which accounts for roughly 5000, 2500, and 500 total training samples, respectively. We uniformly distribute the labeled data to S-agents, and unlabeled data to Q-agents. Table 1 summarizes the results for 8 and 12 devices: Fed SSL consistently performs best in all settings of labeled data portion and client number. For 10% labeled data setting, Fed SSL achieves the best accuracy of 0.855 and 0.854, outperforming the best baseline (Mix Match or Fed Match) by 4.2-5.3%, relatively. For 1% labeled data, the relative performance gain reaches 14.1%-19.2%. This shows the effectiveness of Fed SSL especially in extreme data-scarce conditions. Baselines are ineffective. Compared with Supervise, Pseudo labels could help improve performance by 3%, 6%, and 15% in absolute value for 10%, 5%, and 1% label data respectively. However, L-Mix Up (Fed Match and Mix Match) could only further increase performance by less than 3% in absolute value, indicating the ineffectiveness of conventional SSL techniques, due to the skewly distributed data labels. Fed SSL-DP achieves comparable performance with Fed SSL. For 10% and 5% labeled data cases, the performance gap is less than 2%. The biggest differences happen for 8 and 12 devices with 1% labeled data, which are about 2.5-2.8% in absolute value. DP affects the model accuracy more with less labels and larger device numbers. Fed SSL-DP outperforms the baselines significantly, indicating that our approach is effective in distributed learning while preserving the privacy of the labeled sources. Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence (IJCAI-22) Results on MNIST. Method \ Setting NS = 2, NQ = 6 NS = 3, NQ = 9 0.5% 0.25% 0.17% 0.5% 0.25% 0.17% Supervise 0.951 0.889 - 0.951 - - Pseudo 0.966 0.939 - 0.969 - - Mix Match 0.979 0.957 - 0.977 - - Fed Match 0.985 0.962 - 0.981 - - Fed SSL-DP(ours) 0.981 0.965 0.950 0.986 0.972 0.949 Fed SSL(ours) 0.988 0.970 0.976 0.987 0.975 0.969 Table 2: MNIST results of 10-way classification accuracy. We evaluate our methods on MNIST with 0.5%, 0.25% and 0.17% of the total data as labeled data, which accounts for just 300, 150 and 100 total training samples, respectively. We show the results in Table 2: The baselines suffer from model divergence. In the extreme cases of training with 0.25% and 0.17% labeled data on 8 (NS=2, NQ=6) and 12 (NS=3, NQ=9) devices, all baselines failed to converge and predict randomly (denoted as - ). Due to lack of a unified data space, each local model overfits to local data so that the federated global model collapses. Fed SSL and Fed SSL-DP can prevent divergence. In contrast, our proposed Fed SSL achieves reasonable accuracy of around 0.97. Thanks to the global data space, Fed SSL augments local data and prevents overfitting to scarce labels (Sagents) and incorrect pseudo labels (Q-agents). Fed SSL-DP achieves similar accuracy with Fed SSL, with performance gap generally under 1% in absolute value. Even for extreme data-scarce (0.17%) case, the gap is below 3%, indicating the usability of Fed SSL-DP. Results on Text Classification. In Table 3, we show results on Sent140 dataset with 5%, 1% and 0.5% of the total tweets with sentiment labels, which accounts for roughly 3000, 600, and 300 sentences respectively. We implement blending operation of two sentences by a simple weighted sum of two words BERT embeddings from two sentences at same positions. Table 3 shows the binary classification results. Fed SSL consistently outperforms the baselines, leading the next best Fed Match by 1.8-5% relatively, while leading the weakest Supervise by 7-12% relatively. Setting NS = 1, NQ = 3 NS = 2, NQ = 6 5% 1% 0.5% 5% 1% 0.5% Supervise 0.700 0.648 0.628 0.690 0.639 0.622 Pseudo 0.733 0.672 0.660 0.721 0.671 0.638 Fed Match 0.741 0.691 0.670 0.724 0.683 0.661 Fed SSL(ours) 0.754 0.722 0.699 0.749 0.703 0.689 Table 3: Sent140 results of 2-way classification accuracy. 5.5 Ablation Studies Non-IID data partition. We consider a more challenging setting with non-IID partition of data classes. We adopt a round robin strategy of distributing non-overlapping MNIST digital classes to each agent. As an example of (NS = 3, NQ = 9), we firstly distribute all labeled instances of digital classes [0, 3, 6, 9], [1, 4, 7], [2, 5, 8] to 3 S-agents respectively, then we distribute all unlabeled instances of [0, 9], [1], [2], ..., [8] to 9 Q-agents respectively. This creates non-overlapping partitions of digital classes within S-agent group and Q-agent group thus makes local training header. Setting NS = 1/2/3, NQ = 3/6/9 0.25% 0.17% Supervise 0.854 / 0.868 / - 0.798 / - / - Pseudo 0.907 / 0.917 / - 0.849 / - / - Mix Match 0.927 / 0.925 / - 0.864 / - / - Fed Match 0.932 / 0.927 / - 0.869 / - / - Fed SSL 0.964 / 0.949 / 0.665 0.952 / 0.801 / 0.551 Table 4: MNIST results in Non-IID settings. We examine the performance of baselines and our models under this difficulty situation and observe in Table 4 that: The non-IID settings bring about performance drop and divergence especially for scarce labeled data 0.25% and 0.17% with a large device number 8 and 12. Only Fed SSL could perform reasonably well in extreme settings (e.g., 0.17% labeled data), while all other baselines diverge due to collapsed local training on partial data classes. General setting of partially labeled data. Fed SSL can readily extend to a general setting in which each client has both labeled and unlabeled data, as our proposed G-Mix Up can flexibly sample data pairs with true labels as Eq.(4) and/or with pseudo labels as Eq.(5). We allocate 10% of labeled images and the rest unlabeled images of CIFAR-10 uniformly to 4 clients. Fed SSL outperforms the baseline Fed Match by 5.9% (85.1% v.s. 79.2%), as our G-Mix Up can perform both local and global data imputation to better train the unified global model, while Fed Match can only perform local mixup. # labeled Fed SSL(Tab.4) no lrec no lreal w/o both 0.25% 96.4% -0.83% -0.93% -1.8% 0.17% 95.2% -1.3% -2.8% -29.3% Table 5: Ablation study of reconstruction and realistic loss. Effects of lreal (Eq.6) and lrec (Eq.7). We evaluate Fed SSL on non-iid MNIST with 3 ablation settings: no realistic loss lrec, no reconstruction loss lreal and without both. We find that both lrec and lreal are critical with better regularization especially for the extreme data-scarce scenario (0.17%), e.g., w/o both would yield 29.3% drop of accuracy. 6 Conclusion We proposed a unified framework that makes FL effective in challenging SSL scenarios. We designed a generative learning strategy to establish a global data space across the agents while preserving data privacy with theoretical guarantee. Our approach outperforms the baselines significantly and works robustly in extreme data-scarce and non-IID cases. Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence (IJCAI-22) Acknowledgments This work is supported by the National Natural Science Foundation of China (NSFC 62106156), Shenzhen Science and Technology Program (Project JCYJ20210324120011032), Guangdong Basic and Applied Basic Research Foundation (Project 2021B1515120008), and the Shenzhen Institute of Artificial Intelligence and Robotics for Society. [Abadi et al., 2016] Martin Abadi, Andy Chu, Ian Goodfellow, H. Brendan Mc Mahan, Ilya Mironov, Kunal Talwar, and Li Zhang. Deep learning with differential privacy. In SIGSAC, 2016. [Berthelot et al., 2019] David Berthelot, Nicholas Carlini, Ian Goodfellow, Nicolas Papernot, Avital Oliver, and Colin Raffel. Mixmatch: A holistic approach to semi-supervised learning. Neur IPS, 2019. [Caldas et al., 2018] Sebastian Caldas, Peter Wu, Tian Li, Jakub Konecný, H. Brendan Mc Mahan, Virginia Smith, and Ameet Talwalkar. Leaf: A benchmark for federated settings. ar Xiv preprint ar Xiv:1812.01097, 2018. [Chapelle et al., 2006] Olivier Chapelle, Bernhard Schölkopf, and Alexander Zien. Semi-Supervised Learning. 2006. [Chen and et al., 2018] Fei Chen and et al. Federated meta-learning with fast convergence and efficient communication. ar Xiv preprint ar Xiv:1802.07876, 2018. [Dwork and Roth, 2014] Cynthia Dwork and Aaron Roth. The algorithmic foundations of differential privacy. Found. Trends Theor. Comput. Sci., 9(3-4):211 407, 2014. [Fan and Huang, 2021] Chenyou Fan and Jianwei Huang. Federated few-shot learning with adversarial learning. ar Xiv preprint ar Xiv:2104.00365, 2021. [Fan and Liu, 2020] Chenyou Fan and Ping Liu. Federated generative adversarial learning. In Chinese Conference on Pattern Recognition and Computer Vision (PRCV), 2020. [Goodfellow et al., 2014] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Neur IPS, 2014. [Grandvalet and Bengio, 2004] Yves Grandvalet and Yoshua Bengio. Semi-supervised learning by entropy minimization. In NIPS, 2004. [Itahara et al., 2020] Sohei Itahara, Takayuki Nishio, Yusuke Koda, Masahiro Morikura, and Koji Yamamoto. Distillation-based semi-supervised federated learning for communication-efficient collaborative training with non-iid private data. ar Xiv preprint ar Xiv:2008.06180, 2020. [Jeong et al., 2020] Wonyong Jeong, Jaehong Yoon, Eunho Yang, and Sung Ju Hwang. Federated semi-supervised learning with inter-client consistency. ar Xiv preprint ar Xiv:2006.12097, 2020. [Jin et al., 2020a] Yilun Jin, Xiguang Wei, Yang Liu, and Qiang Yang. A survey towards federated semi-supervised learning. ar Xiv preprint ar Xiv:2002.11545, 2020. [Jin et al., 2020b] Yilun Jin, Xiguang Wei, Yang Liu, and Qiang Yang. Towards utilizing unlabeled data in federated learning: A survey and prospective. ar Xiv preprint ar Xiv:2002.11545, 2020. [Krizhevsky, 2009] Alex Krizhevsky. Learning multiple layers of features from tiny images. Master s thesis, University of Tront, 2009. [Laine and Aila, 2017] Samuli Matias Laine and Timo Oskari Aila. Temporal ensembling for semi-supervised learning. In ICLR, 2017. [Le Cun et al., 1998] Yann Le Cun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 1998. [Lee and others, 2013] Dong-Hyun Lee et al. Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks. In ICML Workshop, 2013. [Li and others, 2019] Tian Li et al. Federated learning: Challenges, methods, and future directions. ar Xiv preprint ar Xiv:1908.07873, 2019. [Mc Mahan et al., 2017] Brendan Mc Mahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Agüera y Arcas. Communication-efficient learning of deep networks from decentralized data. In AISTATS, 2017. [Sajjadi et al., 2016] Mehdi Sajjadi, Mehran Javanmardi, and Tolga Tasdizen. Regularization with stochastic transformations and perturbations for deep semi-supervised learning. In NIPS, 2016. [Sattler et al., 2019] Felix Sattler, Simon Wiedemann, Klaus Robert Müller, and Wojciech Samek. Robust and communication-efficient federated learning from non-iid data. IEEE TNNLS, 2019. [Sohn et al., 2020] Kihyuk Sohn, David Berthelot, Nicholas Carlini, Zizhao Zhang, Han Zhang, Colin A. Raffel, Ekin Dogus Cubuk, Alexey Kurakin, and Chun-Liang Li. Fixmatch: Simplifying semi-supervised learning with consistency and confidence. In Neur IPS, 2020. [Wei and others, 2020] Kang Wei et al. Federated learning with differential privacy: Algorithms and performance analysis. IEEE Transactions on Information Forensics and Security, 2020. [Wu et al., 2020] Qiong Wu, Kaiwen He, and Xu Chen. Personalized federated learning for intelligent iot applications: A cloudedge based framework. IEEE Computer Graphics and Applications, 2020. [Xin et al., 2020] Bangzhou Xin, Wei Yang, Yangyang Geng, Sheng Chen, Shaowei Wang, and Liusheng Huang. Private flgan: Differential privacy synthetic data generation based on federated learning. In ICASSP, 2020. [Zhang et al., 2018] Hongyi Zhang, Moustapha Cisse, Yann N. Dauphin, and David Lopez-Paz. mixup: Beyond empirical risk minimization. In ICLR, 2018. [Zhang et al., 2020] Zhengming Zhang, Yaoqing Yang, Zhewei Yao, Yujun Yan, Joseph E Gonzalez, and Michael W Mahoney. Improving semi-supervised federated learning by reducing the gradient diversity of models. ar Xiv preprint ar Xiv:2008.11364, 2020. [Zhao et al., 2018] Yue Zhao, Meng Li, Liangzhen Lai, Naveen Suda, Damon Civin, and Vikas Chandra. Federated learning with non-iid data. ar Xiv preprint ar Xiv:1806.00582, 2018. [Zhou and Li, 2010] Zhi-Hua Zhou and Ming Li. Semi-supervised learning by disagreement. Knowledge and Information Systems, 2010. [Zhu et al., 2020] Jianchao Zhu, Liangliang Shi, Junchi Yan, and Hongyuan Zha. Automix: Mixup networks for sample interpolation via cooperative barycenter learning. ECCV, 2020. Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence (IJCAI-22)