# learning_with_partial_labels_from_semisupervised_perspective__161cf2f4.pdf

Learning with Partial Labels from Semi-supervised Perspective

Ximing Li1,2, Yuanzhi Jiang1,2, Changchun Li1,2,*, Yiyuan Wang3,4, Jihong Ouyang1,2

1College of Computer Science and Technology, Jilin University, China 2Key Laboratory of Symbolic Computation and Knowledge Engineering of MOE, Jilin University, China 3College of Information Science and Technology, Northeast Normal University, China 4Key Laboratory of Applied Statistics of MOE, Northeast Normal University, China liximing86@gmail.com, yzjiang20@mails.jlu.edu.cn, changchunli93@gmail.com, wangyy912@nenu.edu.cn, ouyj@jlu.edu.cn

Partial Label (PL) learning refers to the task of learning from the partially labeled data, where each training instance is ambiguously equipped with a set of candidate labels but only one is valid. Advances in the recent deep PL learning literature have shown that the deep learning paradigms, e.g., self-training, contrastive learning, or class activate values, can achieve promising performance. Inspired by the impressive success of deep Semi-Supervised (SS) learning, we transform the PL learning problem into the SS learning problem, and propose a novel PL learning method, namely Partial Label learning with Semi-supervised Perspective (PLSP). Specifically, we first form the pseudo-labeled dataset by selecting a small number of reliable pseudo-labeled instances with high-confidence prediction scores and treating the remaining instances as pseudo-unlabeled ones. Then we design a SS learning objective, consisting of a supervised loss for pseudo-labeled instances and a semantic consistency regularization for pseudounlabeled instances. We further introduce a complementary regularization for those non-candidate labels to constrain the model predictions on them to be as small as possible. Empirical results demonstrate that PLSP significantly outperforms the existing PL baseline methods, especially on high ambiguity levels. Code available: https://github.com/changchunli/PLSP.

Introduction During the past decades, modern deep neural networks have gained great success in various domains such as computer vision and natural language processing. Commonly, they are built on the paradigm of supervised learning, which often requires massive training instances with precise labels. However, in many real-world scenarios, the high-quality training instances are intractable to collect, because instance annotation by human-beings is costly and even subject to label ambiguity and noise, potentially resulting in many training data with various noisy supervision (Li, Socher, and Hoi 2020; Li et al. 2022). Among them, one prevalent noisy challenge is from the partially labeled data, where each training instance is equipped with a set of candidate labels but only one is valid (Cour, Sapp, and Taskar 2011). As illustrated in Fig.1, for a human annotator it could be difficult to correctly distinguish Alaskan Malamute and Huskie, so she/he

*Corresponding author. Copyright 2023, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.

Huskie? Alaskan Malamute?

Annotator 2 : Alaskan Malamute Annotator 1 : Huskie

Figure 1: An example of PL instances. An Alaskan Malamute is in this image, but annotators may also tag it with Huskie.

may tend to retain both of them as candidate labels. Due to the popularity of such noisy data in applications, e.g., web mining (Luo and Orabona 2010), multimedia context analysis (Zeng et al. 2013), and image classification (Chen, Patel, and Chellappa 2018), the paradigm of learning from partial labels, formally dubbed as Partial Label (PL) learning, has recently attracted more attention from the machine learning community (Feng and An 2019b; Feng et al. 2020; Lv et al. 2020; Li, Li, and Ouyang 2020; Li et al. 2021; Wang et al. 2022a; Wu, Wang, and Zhang 2022). Naturally, the main challenge of PL learning lies in the ambiguity of partial labels, because the ground-truth label is unknown and can not be directly accessible to the learning method. Accordingly, the mainstream of PL learning methods concentrates on recovering precise supervised signals from the ambiguous candidate labels. Some two-stage methods refine the candidate labels by label propagation among instance nearest neighbors (Zhang and Yu 2015; Zhang, Zhou, and Liu 2016; Xu, Lv, and Geng 2019); and most PL learning methods jointly train the classifier with the refined labels and refine the candidate labels with the classifier predictions (Wu and Zhang 2018; Zhang, Zhou, and Liu 2019; Feng and An 2019a; Li, Li, and Ouyang 2020; Ni et al. 2021). Besides them, some deep PL learning methods employ discriminators to recover precise supervision from the candidate labels under the frameworks of GAN (Zhang et al. 2020) and Triple-

The Thirty-Seventh AAAI Conference on Artificial Intelligence (AAAI-23)

GAN (Li et al. 2021). Despite promising performance, the refined labels of these PL learning methods can be still ambiguous and inaccurate for most training instances, resulting in potential performance degradation. In parallel with PL learning, Semi-Supervised (SS) learning has recently achieved great progress with strong deep neural backbones (Berthelot et al. 2019, 2020; Xie et al. 2020; Sohn et al. 2020; Zhang et al. 2021; Li, Li, and Ouyang 2021). The recent deep SS learning methods are mainly based on the consistency regularization with the assumption that the classifier tends to give consistent predictions on augmented variants of the same instance. Inspired by them, we revisit the problem that the refined labels of PL learning methods may still be ambiguous for most training instances, and throw the following question: Whether we can efficiently select a small number of reliable instances with high-confidence pseudo-labels, and then resolve the PL learning task with the strong SS learning techniques? Motivated by this question, we develop a novel PL learning method, namely Partial Label learning with Semi-supervised Perspective (PLSP). Our PLSP consists of two stages. In the first stage, we efficiently per-train the classifier by treating all candidate labels equally important, and then select a small number of reliable pseudo-labeled instances with high-confidence predictions and treat the remaining instances as pseudo-unlabeled instances. In the second stage, we formulate a SS learning objective to induce the classifier over the pseudo-training dataset. To be specific, we incorporate a consistency regularization with respect to the weaklyand strongly-augmented instances, and draw semantic-transformed features for them to further achieve consistency at the semantic level. To efficiently optimize the objective with semantic-transformed features, we derive an approximation of its expectation form. We conduct extensive experiments to evaluate the effectiveness of PLSP on benchmark datasets. Empirical results demonstrate that PLSP is superior to the existing PL learning baseline methods, especially on high ambiguity levels.

Related Work

Partial Label Learning

There are many PL learning studies based on the shallow frameworks. The early disambiguation-free methods efficiently induce the classifiers by treating all candidate labels equally important (Cour et al. 2009; Cour, Sapp, and Taskar 2011; Zhang, Yu, and Tang 2017), while in PLSP we have also employed this spirit to per-train the classifier to initialize the pseudo-training dataset. Beyond them, the disambiguation methods aim to induce stronger classifiers by refining precise supervision from candidate labels (Wu and Zhang 2018; Feng and An 2018). Some two-stage methods first refine the candidate labels by label propagation among instance nearest neighbors (Zhang and Yu 2015; Zhang, Zhou, and Liu 2016; Xu, Lv, and Geng 2019). But most disambiguation methods jointly train the classifier with the refined labels and refine the candidate labels with the classifier predictions (Feng and An 2019a; Li, Li, and Ouyang 2020). However, the refined labels may be also noisy for most training instances.

Inspired by the effectiveness of deep learning and efficiency of stochastic optimization, a number of deep PL learning methods have been recently developed (Zhang et al. 2020; Lv et al. 2020; Feng et al. 2020; Wen et al. 2021; Yan and Guo 2020; Li et al. 2021; Xu et al. 2021; Zhang et al. 2022; Wang et al. 2022a). Some deep PL learning methods refine the candidate labels with adversarial training, such as the attempts based on GAN (Zhang et al. 2020) and Triple-GAN (Li et al. 2021). Most other methods design proper objectives for PL learning. For example, PRODEN (Lv et al. 2020) optimizes a classifier-consistent objective derived by the assumption that the ground-truth label would achieve the minimal loss among candidate labels; Feng et al. (2020) propose risk-consistent and classifier-consistent methods with the assumption that the candidate labels are drawn from a uniform distribution; and Wen et al. (2021) propose a risk-consistent leveraged weighted loss with label-specific candidate label sampling. However, those methods highly rely on their prior assumptions. Besides, the recent deep PL learning method Pi CO (Wang et al. 2022a) borrows the idea of contrastive learning to keep the consistence between the augmented versions of each instance. In contrast to Pi CO, we also employ a consistency regularization with respect to augmented instances but we further draw semantic-transformed features for them to achieve a new semantic consistency regularization.

Semi-Supervised Learning The recent deep SS learning methods are mainly based on the consistency regularization (Laine and Aila 2017; Tarvainen and Valpola 2017; Miyato et al. 2019; Berthelot et al. 2019, 2020; Xie et al. 2020; Sohn et al. 2020; Zhang et al. 2021). It is built on a simple concept that the classifier should give consistent predictions on an unlabeled instance and its perturbed version. To conduct this idea, many perturbation methods are adopted, such as the virtual adversarial training used in VAT (Miyato et al. 2019) and the mixup technique adopted by Mix Match (Berthelot et al. 2019). With the data augmentation technique popular, the previous arts (Berthelot et al. 2020; Xie et al. 2020; Sohn et al. 2020; Zhang et al. 2021) keep the classifier predictions on the weaklyand stronglyaugmented variants of an unlabeled instance to be consistent, and empirically achieve impressive performance. The main data augmentation techniques used in these methods are at the pixel-level, such as flip-and-shift, Cutout (Devries and Taylor 2017), Auto Augment (Cubuk et al. 2019), CTAugment (Berthelot et al. 2020), and Rand Augment (Cubuk et al. 2020) etc. Complementary to these pixel-level augmentations, Wang et al. (2022b) design a semantic-level data augmentation method motivated by the linear characteristic of deep features. In PLSP, we employ both pixel-level and semanticlevel data augmentations, and design a semantic consistency regularization for PL learning by performing semantic-level transformations on both weaklyand strongly-augmented variants of an instance.

The Proposed PLSP Approach In this section, we introduce the proposed PL learning method, namely Partial Label learning with Semi-supervised Perspective (PLSP).

Problem formulation of PL learning We now formulate the problem of PL learning. Let X Rd be a ddimensional feature space and Y = {1, , l}, l 2 be the label space. In the context of PL learning, we are given by a training dataset consisting of n instances, denoted by Ω= {(xi, Ci)}n i=1. For each instance, xi X and Ci C are its feature vector and corresponding candidate label set, respectively, where C = {P(Y) \ \ Y} is the power set of Y except for the empty set and the whole label set. Specially, the single ground-truth label of each instance is unknown and must be concealed in its corresponding candidate label set. The goal of PL learning is to train a classifier f( ; Θ), parameterized by Θ, from such noisy training dataset Ω.

Overview of PLSP The main idea of PLSP is to transform the PL learning problem into the SS learning problem, and then induce the classifier by leveraging the well-established SS learning paradigms. Specifically, we first select a small number of reliable partial-labeled instances (i.e., m n) from Ωaccording to their predicted scores, e.g., class activation values (Zhang et al. 2022), and form a pseudo-labeled instance set Ωl = {(xp(i), yp(i)) X Cp(i)}m i=1, where the subscript p(i) denotes the mapping function of instance index and yp(i) is the corresponding high-confidence pseudo label. We treat the remaining instances as pseudo-unlabeled instances, denoted by Ωu. Accordingly, we can further treat the pseudo-training dataset {Ωl, Ωu} as a training dataset of SS learning, so as to train a classifier from it by leveraging the following well-established objective of SS learning: L({Ωl, Ωu}; Θ) = Ll(Ωl; Θ) + Ru(Ωu; Θ), (1) where Ll is the pseudo-supervised loss with respect to Ωl; and Ru the regularization with respect to Ωu (e.g., consistency regularization). In the following, we will introduce the details of forming the pseudo-training dataset {Ωl, Ωu} and constructing SS learning loss over {Ωl, Ωu} for PL learning, and then show the specific objective of PLSP as well as the full training process.

Forming the Pseudo-Training Dataset {Ωl, Ωu} To form the pseudo-labeled instance set Ωl, we pre-train the classifier f( ; Θ) by using a simple disambiguation-free PL learning loss, where all candidate labels are treated equally:

Ldf(Ω; Θ) = 1

j Ci log pij (2)

where pi = [pij] j Y is the classifier prediction of instance xi, and pij = ezij/ P j Y ezij , zi = f(xi; Θ). With the

per-trained classifier f( ; eΘ),1 we select a small number of reliable pseudo-labeled instances to form Ωl. Specifically, for each instance (xi, Ci) Ω, we assign the candidate label with the highest class activation value (CAV) (Zhang et al. 2022) as its pseudo label:

yi = arg max j Ci vij, vij = ezij|ezij 1|, ezi = f(xi; eΘ).

1This pre-training process can be very efficient and converge within a few epochs.

For each class j Y, we construct its pseudo-labeled instance set Ωj l by choosing instances with the top-k CAVs of class j:

Ωj l = (xi, yi)|i Top K({vij|(xi, Ci) Ω, yi = j}) ,

where, as its name suggests, Top K( ) outputs the index set of instances with the top-k CAVs. Accordingly, the pseudolabeled set Ωl can be formed as follows:2

j Y Ωj l , (3)

and the remaining instances can constitute the pseudounlabeled instance set Ωu as follows:

Ωu = {(xi, Ci)|(xi, Ci) Ω, (xi, yi) / Ωl}. (4)

Forming the SS Learning Loss over {Ωl, Ωu} Given {Ωl, Ωu}, we continue to optimize the per-trained classifier f( ; eΘ) by using the SS learning loss of PLSP, including the pseudo-supervised loss Ll for Ωl and the regularization term Ru for Ωu.

Pseudo-supervised loss. We can treat Ωl as a labeled dataset, and directly formulate the specific pseudo-supervised loss as follows:

Ll(Ωl; Θ) = 1 |Ωl|

(xi,yi) Ωl log piyi. (5)

Regularizing the pseudo-unlabeled instances Inspired by the impressive success of the consistency regularization in SS learning (Xie et al. 2020; Sohn et al. 2020; Zhang et al. 2021), we employ it to regularize the pseudo-unlabeled instances. Specifically, for each instance within Ωu, we first generate its weaklyand strongly-augmented variants with the wideused pixel-level data augmentation tricks,3 and then constrain their corresponding prediction scores to be consistent. Formally, for each pseudo-unlabeled instance (xi, Ci) Ωu, let its weaklyand strongly-augmented variants denote by xw i = α(xi) and xs i = A(xi), respectively. Its corresponding consistency regularization term can be written as follows:

Ru((xi, Ci); Θ) = h(bpw i )KL(bpi||ps i), (6)

where KL( || ) denotes the KL-divergence. More specially, bpi = [bpij] j Y is the pseudo-target approximated on the weakly-augmented variant xw i :

bpij = 1(j Ci)bpw ij P j Y 1(j Ci)bpw ij ,

bpw ij = ebzw ij P j Y ebzw ij , bzw i = f(xw i ; bΘ),

2The total number of pseudo-labeled instances m = k l. 3These data augmentations tricks include flip-and-shift, Cutout (Devries and Taylor 2017), Auto Augment (Cubuk et al. 2019), CTAugment (Berthelot et al. 2020), and Rand Augment (Cubuk et al. 2020) etc. We will introduce their implementation details in the experiment part.

where bΘ is the fixed copy of the current parameters Θ; ps i = [ps ij] j Y is the classifier prediction on the stronglyaugmented variant xs i, and ps ij = ezs ij/ P j Y ezs ij , zs i = f(xs i; Θ); h(bpw i ) is an indicator function used to retain highconfidence pseudo-unlabeled instances in this consistency regularization term, specifically defined as follows:

h(bpw i ) = 1 max j Y bpw ij τ arg max j Y bpw ij Ci ,

where τ (0.5, 1.0] is the confidence threshold. Accordingly, the overall consistency regularization over Ωu is stated as:

Ru(Ωu; Θ) = 1 |Ωu|

(xi,Ci) Ωu Ru((xi, Ci); Θ). (7)

Besides, inspired by that the deep feature space usually is linear and includes some meaningful semantic directions (Wang et al. 2019, 2022b), we construct semantic-level transformations based on those semantic directions, which is complementary to the pixel-level transformations, and perform semantic consistency regularization on them, so as to further regularize the classifier f( ; Θ) in the semantic level. Let the classifier f( ; Θ) = W g( ; Φ), g( ; Φ) be the deep feature extractor, and W = [wj] j Y be parameters of the last full-connected predictive layer. Specifically, we suppose that those semantic directions are drawn from a set of labelspecific zero-mean Gaussian distributions {N(0, λΣj)}j Y, where λ > 0 controls the strength of semantic transformations. Given the known labels, we can apply those sampled label-specific semantic directions on the deep features of instances to construct semantic-level transformations. Thanks to the properties of Gaussian distribution, we can draw the semantic-transformed feature of any instance (xi, yi) as:

ai N(ai, λΣyi), ai = g(xi; Φ). (8)

Nevertheless, the true labels of pseudo-unlabeled instances within Ωu are totally unknown. For each instance (xi, Ci) Ωu, we approximate its pseudo label byi with the CAVs on its weakly-augmented variant xw i as:

byi = arg max j Ci bvij, bvij = bzij|bzij 1|, bzi = f(xw i ; bΘ).

We can then draw the semantic-transformed features of its weaklyand strongly-augmented variants by applying Eq.(8). Drawing K semantic-transformed features for each augmented variant, the consistency regularization term in Eq.(6) can be rewritten as the following semantic consistency regularization term:

RK u ((xi, Ci); Θ) = 1 K2

k1,k2=1 h(bpw,k1 i )KL(bpk1 i ||ps,k2 i ),

s.t. baw,k1 i N(baw i , λΣbyi), baw i = g(xw i ; bΦ),

as,k2 i N(as i, λΣbyi), as i = g(xs i; Φ), (9)

bpw,k1 ij = ebzw,k1 ij P j Y ebzw,k1 ij , bzw,k1 i = c W baw,k1 i ;

ps,k2 ij = ezs,k2 ij P j Y ezs,k2 ij , zs,k2 i = W as,k2 i ,

and further bpk1 i = [bpk1 ij ] j Y is calculated as follows:

bpk1 ij = 1(j Ci)bpw,k1 ij P j Y 1(j Ci)bpw,k1 ij .

To avoid inefficiently sampling, we consider the expectation of Eq.(9) with all possible semantic-transformed features:

R u ((xi, Ci); Θ) = Ebaw,k1 i ,as,k2 i [h(bpw,k1 i )KL(bpk1 i ||ps,k2 i )]. (10) Unfortunately, it is intractable to optimize Eq.(10) in its exact form. Alternatively, we derive an easy-to-compute upper bound R u ((xi, Ci); Θ) given in the following proposition. Finally, the consistency regularization over Ωu in Eq.(7) is rewritten below:

Ru(Ωu; Θ) = 1 |Ωu|

R u ((xi, Ci); Θ). (11)

Proposition 1. Suppose that baw,k1 i N(baw i , λΣbyi) and as,k2 i N(as i, λΣbyi). Then we have an upper bound for R u ((xi, Ci); Θ) given by

R u ((xi, Ci); Θ) h(bpw i )KL(bpi||ps i) R u ((xi, Ci); Θ)

s.t. baw i = g(xw i ; bΦ), as i = g(xs i; Φ),

where bpw ij = 1

j Y 1/Φ βbu jj baw i

(1+λβ2 bu jj Σbyi bujj )1/2 , bpij =

1(j Ci)bpw ij P

j Y 1(j Ci)bpw ij , ps ij = e w j as i P

j Y e w j as i + λ

2 u j j Σbyi uj j , bujj =

bwj bwj , uj j = wj wj, and Φ(z) = 1

2π R z e t2/2dt is the cumulative distribution function of the standard normal distribution N(0, 1).

Objective of PLSP and Iterative Training Summary We summarize the overall objective of PLSP, and clarify the training details in following.

Objective of PLSP. Besides the aforementioned pseudosupervised loss Ll and regularization Ru, we also incorporate a complementary loss over Ωu to minimize the predictions of non-candidate labels:

Lcl(Ωu; Θ) = 1 |Ωu|

j / Ci log(1 pij). (12)

And we also improve Ll and Lcl with the semantic-level transformation Eq.(8), then obtain their corresponding upper bounds according to Proposition 1, given by:

Ll(Ωl; Θ) = 1 |Ωl|

(xi,yi) Ωl log piyi, (13)

Lcl(Ωu; Θ) = 1 |Ωu|

j / Ci log(1 pij), (14)

piyi = ew yiai P j Y e w j ai+ λ

2 u j yiΣyiuj yi , ai = g(xi, Φ),

pij = ew j ai P j Y ew j ai+ λ

2 u j jΣbyiuj j , ai = g(xi, Φ),

and uj yi = wj wyi. Accordingly, the final objective of PLSP can be reformulated as: L({Ωl, Ωu}; Θ) =

Ll(Ωl; Θ) + Ru(Ωu; Θ) + Lcl(Ωu; Θ), (15) where γ > 0 is the hyper-parameter to balance the SS learning loss and complementary loss.

Update of label-specific covariance matrices {Σj}j Y. Following (Wang et al. 2019, 2022b), we approximate {Σj}j Y with pseudo-labeled instances by counting statistics from all mini-batches incrementally. For each Σj in the c-th iteration, it can be updated as follows:

Σ(c) j = m(c 1) j Σ(c 1) j + m (c) j Σ (c) j m(c 1) j + m (c) j

+ m(c 1) j m (c) j (µ(c 1) j µ (c) j )(µ(c 1) j µ (c) j )

(m(c 1) j + m (c) j )2 , (16)

µ(c) j = m(c 1) j µ(c 1) j + m (c) j µ (c) j m(c 1) j + m (c) j , m(c) j = m(c 1) j + m (c) j ,

where µ (c) j and Σ (c) j are the mean and covariance matrix of

features within class j in c-th mini-batch, respectively; m(c) j the total number of pseudo-labeled instances belonging to class j in all c mini-batches and m (c) j the number of pseudolabeled instances belonging to class j in c-th mini-batch.

Adjusting the SS learning loss weight γ. In the early training stage, the SS learning loss may be less accurate. To fix issue, We dynamically adjust the SS learning loss weight γ by a non-decreasing function γ = min{ t

T γ0, γ0} with respect to the epoch number t, where γ0 is the maximum weight, and T the maximum number of SS training epochs.

Adjusting the confidence threshold τ. We employ the curriculum pseudo labeling (Zhang et al. 2021) to adjust τ. For each class j, its value at c-th iteration is calculated by:

τc(j) = ηc(j) τ0, ηc(j) = σc(j) maxj Y σc(j ),

(xi,Ci) Ωu h(bpw i ) 1(byi = j),

where τ0 is the maximum confidence threshold.

Adjusting the transformation strength λ. Following (Wang et al. 2019, 2022b), we dynamically adjust the transformation strength λ with a non-decreasing function λ = min{ t

T λ0, λ0} with respect to the epoch number t, where λ0 is the maximum transformation strength, so as to reduce the negative impact of the low-quality estimations of covariance matrices in the early training stage.

Algorithm 1: Training procedure of PLSP Input: Ω: PL training dataset Ω= {(xi, Ci)}n i=1; m: number of pseudo-labeled instances; γ0: SS learning loss weight; τ0: confidence threshold; λ0: semantic transformation strength; Output: Θ: classifier parameters 1: Initialize the classifier parameters Θ = {Φ, W}; 2: for t = 0 to T0 do {% Pre-training stage %} 3: for c = 0 to I do 4: Sample a mini-batch {(xi, Ci)}B i=1 from Ω; 5: Compute Ldf according to Eq.(2); 6: Update Θ with SGD; 7: end for 8: end for 9: for t = 0 to T do {% SS training stage %} 10: Construct pseudo-training dataset {Ωl, Ωu} according to Eqs.(3) and (4); 11: for c = 0 to I do 12: Sample a mini-batch {(xi, yi)}Bl i=1 from Ωl and a mini-batch {(xw i , xs i, Ci)}Bu i=1 from Ωu with α( ) and A( ); 13: Compute ai = g(xi; Φ), aw i = g(xw i ; Φ), as i = g(xs i; Φ); 14: Estimate covariance matrices {Σj}j Y according to Eq.(16); 15: Compute L according to Eq.(15); 16: Update Θ with SGD; 17: end for 18: end for

Iterative training summary. In practice, to prevent the error memorization and reduce the time cost, we update {Ωl, Ωu} with the current predictions per-epoch. The classifier parameters Θ are optimized by using the stochastic optimization with SGD. Overall, the iterative training procedure of PLSP is summarized in Algorithm 1.

Experiment Experimental Setup

Datasets. We utilize 3 widely used benchmark image datasets, including Fashion-MNIST (Xiao, Rasul, and Vollgraf 2017), CIFAR-10 and CIFAR-100 (Krizhevsky 2016). We manually synthesize the partially labeled versions of these datasets by applying Uniformly Sampling Strategy (USS) (Feng et al. 2020) and Flipping Probability Strategy (FPS) (Lv et al. 2020). The former one is conducted by uniformly sampling a candidate label set from the candidate label set space C for each instance, and the latter one generates the candidate label set of each instance by selecting any irrelevant label as its candidate one with a flipping probability q.4 In experiments, we employ q {0.3, 0.5, 0.7} for Fashion MNIST and CIFAR-10, and q {0.05, 0.1, 0.2} for CIFAR100 due to the more labels. We adopt 5-layer Le Net, 22-layer Densenet and 18-layer Res Net as the backbones of Fashion MNIST, CIFAR-10 and CIFAR-100, respectively.

4Note that the flipping probability strategy will uniformly flip a random irrelevant label into the candidate label set when none of irrelevant labels are flipped.

Metric Dataset CC RC PRODEN LWS CAVL Pi CO PLSP (I) USS

Macro-F1 FMNIST 0.879 0.001 0.893 0.001 0.891 0.006 0.881 0.007 0.882 0.004 0.907 0.001 0.897 0.003 CIFAR-10 0.745 0.006 0.787 0.003 0.796 0.003 0.781 0.007 0.748 0.051 0.869 0.001 0.889 0.003

Micro-F1 FMNIST 0.880 0.001 0.893 0.001 0.892 0.003 0.883 0.005 0.883 0.004 0.907 0.001 0.897 0.002 CIFAR-10 0.747 0.005 0.788 0.003 0.796 0.003 0.782 0.004 0.756 0.038 0.870 0.001 0.889 0.003

(II) FPS (q = 0.3)

Macro-F1 FMNIST 0.883 0.003 0.893 0.001 0.894 0.003 0.888 0.005 0.887 0.003 0.909 0.003 0.894 0.002 CIFAR-10 0.769 0.005 0.803 0.003 0.801 0.005 0.802 0.005 0.796 0.004 0.880 0.003 0.898 0.002

Micro-F1 FMNIST 0.883 0.003 0.894 0.001 0.895 0.002 0.889 0.004 0.887 0.003 0.909 0.003 0.894 0.002 CIFAR-10 0.769 0.004 0.803 0.003 0.801 0.005 0.803 0.004 0.796 0.004 0.880 0.003 0.898 0.002

(III) FPS (q = 0.5)

Macro-F1 FMNIST 0.881 0.001 0.890 0.002 0.891 0.004 0.884 0.006 0.881 0.003 0.903 0.002 0.891 0.003 CIFAR-10 0.734 0.007 0.782 0.004 0.791 0.004 0.794 0.003 0.767 0.004 0.865 0.002 0.887 0.002

Micro-F1 FMNIST 0.881 0.001 0.891 0.002 0.892 0.002 0.885 0.004 0.882 0.004 0.903 0.002 0.892 0.002 CIFAR-10 0.735 0.008 0.783 0.004 0.791 0.004 0.795 0.002 0.768 0.004 0.866 0.002 0.887 0.002

(IV) FPS (q = 0.7)

Macro-F1 FMNIST 0.874 0.007 0.886 0.006 0.884 0.003 0.875 0.002 0.863 0.002 0.865 0.035 0.878 0.005 CIFAR-10 0.678 0.008 0.728 0.002 0.745 0.004 0.743 0.005 0.673 0.051 0.808 0.045 0.869 0.006

Micro-F1 FMNIST 0.875 0.007 0.886 0.002 0.885 0.001 0.875 0.001 0.853 0.002 0.872 0.020 0.879 0.004 CIFAR-10 0.681 0.008 0.730 0.002 0.746 0.004 0.744 0.005 0.692 0.035 0.816 0.032 0.870 0.006

Table 1: Empirical results (mean std) on Fashion-MNIST (FMNIST) and CIFAR-10 with different data generation strategies and ambiguity levels: (I) USS; (II) FPS (q=0.3); (III) FPS (q=0.5); (IV) FPS (q=0.7). The highest scores are indicated in bold. The notation indicates that the performance gain of PLSP is statistically significant (paired sample t-tests) at 0.01 level.

Baseline PL learning methods and training settings. We compare PLSP against the following 6 existing deep PL learning methods, including RC (Feng et al. 2020), CC (Feng et al. 2020), PRODEN (Lv et al. 2020), LW (Wen et al. 2021) with sigmoid loss function, CAVL (Zhang et al. 2022), and Pi CO (Wang et al. 2022a). We train all methods by using the SGD optimizer, and search the learning rate from {0.0001, 0.001, 0.01, 0.05, 0.1, 0.5} and the weight decay from {10 6, 10 5, , 10 1}. For all baselines and the pretraining-stage of PLSP, we set the batch size 256 for Fashion MNIST and CIFAR-10, and 64 for CIFAR-100. For all baselines, we employ the default or suggested settings of hyperparameters in their papers and released codes. For PLSP, we use the following hyper-parameter settings: γ0 = 1.0, λ0 = 0.01, τ0 = 0.75, number of pre-training epoches T0 = 10, number of SS training epoches T = 250, number of inner loops I = 200, batch sizes of pseudo-labeled and pseudounlabeled instances Bl = 64, Bu = 256. Specially, for CIFAR-100 we set T0 = 50, I = 800, Bl = 16, Bu = 64. We set the number of pseudo-labeled instances per-class k = 200. Besides, we employ the horizontal flipping and cropping to conduct the weakly augmentation function α( ) of all datasets, and implement the strongly augmentation function A( ) for Fashion-MNIST with horizontal flipping, cropping and Cutout, for CIFAR-10 and CIFAR-100 with horizontal flipping, cropping, Cutout as well as Auto Augment.5 All experiments are carried on a Linux server with one NVIDIA Ge Force RTX 3090 GPU.

5For Auto Augment, we simply utilize the augmentation policies released by (Cubuk et al. 2019).

Evaluation metrics. We employ Macro-F1 and Micro-F1 to evaluate the classification performance, and calculate them by using the Scikit-Learn tools (Pedregosa et al. 2011).

Main Results We perform all experiments with five different random seeds, and report the average scores of Fashion-MNIST and CIFAR10 in Table 1, and ones of CIFAR-100 in Table 2. Overall, our PLSP significantly outperforms all comparing methods in most cases, and achieves particularly significant performance gain on high ambiguity levels. As shown in Tables 1 and 2: (1) Our PLSP consistently perform better than all baselines on CIFAR-10 and achieves a competitive performance on Fashion-MNIST across four partial label settings. For example, Micro-F1 scores of PLSP are 0.019 0.054 higher than ones of the recent state-of-the-art Pi CO on four partiallylabeled versions of CIFAR-10, and even gain 0.054 significant improvement on high ambiguity level, i.e., q = 0.7. (2) Compared with all baselines, our PLSP achieves very significant performance gain on CIFAR-100 across q = 0.05, 0.1 and 0.2, and show more significant superiority than that on previous simpler datasets. (3) Besides, Pi CO always drop dramatically on Fashion-MNIST and CIFAR-10 with q = 0.7, especially CIFAR-100 with q = 0.2. The possible reason is that Pi CO could not identify true labels with contrastive representation learning to disambiguate candidate labels when on high ambiguity level.

Ablation Study In this section, we perform extensive experiments to examine the importance of different components of PLSP. We compare

Metric q CC RC PRODEN LWS CAVL Pi CO PLSP

0.05 0.469 0.003 0.461 0.007 0.601 0.004 0.567 0.008 0.398 0.008 0.744 0.007 0.770 0.002 0.1 0.431 0.006 0.388 0.006 0.512 0.006 0.498 0.005 0.229 0.018 0.636 0.021 0.733 0.013 0.2 0.348 0.008 0.230 0.013 0.476 0.010 0.401 0.015 0.066 0.010 0.190 0.025 0.660 0.008

0.05 0.470 0.004 0.465 0.006 0.607 0.001 0.596 0.003 0.402 0.008 0.746 0.006 0.770 0.002 0.1 0.435 0.005 0.400 0.005 0.568 0.003 0.535 0.001 0.262 0.015 0.660 0.015 0.739 0.008 0.2 0.357 0.009 0.279 0.012 0.496 0.007 0.434 0.011 0.104 0.009 0.288 0.022 0.687 0.006

Table 2: Empirical results (mean std) on CIFAR-100 with FPS (q = 0.05, 0.1, 0.2). The highest scores are indicated in bold. The notation indicates that the performance gain of PLSP is statistically significant (paired sample t-tests) at 0.01 level.

Method CIFAR-10 Macro-F1 Micro-F1 PLSP 0.869 0.006 0.870 0.006 PLSP w/o ST 0.858 0.008 0.861 0.007 DF 0.413 0.015 0.418 0.012

Table 3: Ablation study results (mean std) on CIFAR-10 with FPS (q = 0.7). The highest scores are indicated in bold.

PLSP with PLSP without the semantic transformation (ST) and the version training only with the disambiguation-free (DF) objective of Eq.(2) on CIFAR-10 by using data generation with FPS on q = 0.7. The experimental results are reported in Table 3. It clearly demonstrates that the proposed SS learning strategy can significantly improve the classification performance of PL learning. Besides, we can also observe that the semantic transformation can also improve the classification performance, proving its effectiveness to capture the semantic consistency.

Sensitivity Analysis In this section, we examine the sensitivities of number of pseudo-instances per-class k. We conduct the sensitive experiments by varying k over {0, 50, 100, 200, 500, 1000, 5000} on CIFAR-10 by using data generation with FPS on q = 0.7, and illustrate the experimental results in Fig.2. As is shown: (1) Obviously, the performance is relatively stable when k 200 and achieve the highest when k = 200, and it sharply drops as the values become bigger. It is expected since the smaller value of k may ignore some high-confidence instances and the bigger value of k will introduce many unreliable pseudo-labeled instances, leading to a poor classifier. (2) Moreover, the performance is poor when both k = 0 and 5000, especially when k = 5000. Notice that when k = 0 none of instances within Ωare selected as pseudo-labeled instances, i.e., {Ωl = , Ωu = Ω}, and when k = 5000 for CIFAR-10 all instances are selected as pseudo-labeled instances, i.e., {Ωl = Ω, Ωu = Ω}. It demonstrates the effectiveness of the proposed SS learning strategy for PL learning task. In practice, we suggest tuning k over the set {50, 100, 200}.

Efficiency Comparison To examine the efficiency of our PLSP, we perform efficiency comparisons over PLSP and Pi CO on all benchmarks with USS. We compare the overall time costs during pre-training and training stages respectively, and perform experiments

Method Pi CO PLSP Pretrain Train Pretrain Train FMNIST 24,400s 60s 10,920s CIFAR-10 38,000s 107s 25,000s CIFAR-100 55,200s 299s 44,000s

Table 4: Time cost (second, s) of PLSP and Pi CO on Fashion MNIST (FMNIST), CIFAR-10 and CIFAR-100 with USS.

Figure 2: Sensitivity analysis of the number of pseudo-labeled instances per-class k on CIFAR-10 with FPS (q = 0.7).

with the suggested settings for all methods and benchmarks. Table 4 shows the running time results averaged on 10 runs. As is shown: (1) Obviously, the additional disambiguationfree pre-training stage of PLSP is very efficient. (2) Moreover, In contrast to Pi CO, PLSP empirically converges fast due to more reliable supervision with SSL perspective (PLSP 250 epochs vs Pi CO 800 epochs) and costs less time in practice during the training stage.

In this work, we develop a novel PL learning method named PLSP by resolving the PL learning problem from the semisupervised perspective with strong SS learning techniques. We conduct the SS learning strategy by selecting highconfidence partially-labeled instances as pseudo-labeled instances and treating the remained ones as pseudo-unlabeled. We design a semantic consistency regularization with respect to the semantic-transformed weaklyand strongly-augmented instances, and derive its approximation form for efficient optimization. Empirical results demonstrate the superior performance of PLSP compared with the existing PL learning baselines, especially on high ambiguity levels.

Acknowledgments We would like to acknowledge support for this project from the National Key R&D Program of China (No.2021ZD0112501, No.2021ZD0112502), the National Natural Science Foundation of China (NSFC) (No.62276113, No.62006094, No.61876071), the Key R&D Projects of Science and Technology Department of Jilin Province of China (No.20180201003SF, No.20190701031GH).

References Berthelot, D.; Carlini, N.; Cubuk, E. D.; Kurakin, A.; Sohn, K.; Zhang, H.; and Raffel, C. 2020. Re Mix Match: Semi Supervised Learning with Distribution Matching and Augmentation Anchoring. In ICLR. Berthelot, D.; Carlini, N.; Goodfellow, I. J.; Papernot, N.; Oliver, A.; and Raffel, C. 2019. Mix Match: A Holistic Approach to Semi-Supervised Learning. In Neur IPS, 5050 5060. Chen, C.; Patel, V. M.; and Chellappa, R. 2018. Learning from Ambiguously Labeled Face Images. IEEE TPAMI, 40(7): 1653 1667. Cour, T.; Sapp, B.; Jordan, C.; and Taskar, B. 2009. Learning from Ambiguously Labeled Images. In IEEE CVPR, 919 926. Cour, T.; Sapp, B.; and Taskar, B. 2011. Learning from Partial Labels. JMLR, 12(5): 1501 1536. Cubuk, E. D.; Zoph, B.; Man e, D.; Vasudevan, V.; and Le, Q. V. 2019. Auto Augment: Learning Augmentation Strategies From Data. In IEEE CVPR, 113 123. Cubuk, E. D.; Zoph, B.; Shlens, J.; and Le, Q. 2020. Rand Augment: Practical Automated Data Augmentation with a Reduced Search Space. In Neur IPS, 18613 18624. Devries, T.; and Taylor, G. W. 2017. Improved Regularization of Convolutional Neural Networks with Cutout. ar Xiv preprint ar Xiv:1708.04552. Feng, L.; and An, B. 2018. Leveraging Latent Label Distributions for Partial Label Learning. In IJCAI, 2107 2113. Feng, L.; and An, B. 2019a. Partial Label Learning by Semantic Difference Maximization. In IJCAI, 2294 2300. Feng, L.; and An, B. 2019b. Partial Label Learning with Self-Guided Retraining. In AAAI, 3542 3549. Feng, L.; Lv, J.; Han, B.; Xu, M.; Niu, G.; Geng, X.; An, B.; and Sugiyama, M. 2020. Provably Consistent Partial-Label Learning. In Neur IPS, 10948 10960. Krizhevsky, A. 2016. Learning Multiple Layers of Features from Tiny Images. Technical report, University of Toronto. Laine, S.; and Aila, T. 2017. Temporal Ensembling for Semi Supervised Learning. In ICLR. Li, C.; Li, X.; Feng, L.; and Ouyang, J. 2022. Who Is Your Right Mixup Partner in Positive and Unlabeled Learning. In ICLR. Li, C.; Li, X.; and Ouyang, J. 2020. Learning with Noisy Partial Labels by Simultaneously Leveraging Global and Local Consistencies. In ACM CIKM, 725 734.

Li, C.; Li, X.; and Ouyang, J. 2021. Semi-Supervised Text Classification with Balanced Deep Representation Distributions. In ACL-IJCNLP, 5044 5053. Li, C.; Li, X.; Ouyang, J.; and Wang, Y. 2021. Detecting the Fake Candidate Instances: Ambiguous Label Learning with Generative Adversarial Networks. In ACM CIKM, 903 912. Li, J.; Socher, R.; and Hoi, S. C. H. 2020. Divide Mix: Learning with Noisy Labels as Semi-supervised Learning. In ICLR. Luo, J.; and Orabona, F. 2010. Learning from Candidate Labeling Sets. In Neur IPS, 1504 1512. Lv, J.; Xu, M.; Feng, L.; Niu, G.; Geng, X.; and Sugiyama, M. 2020. Progressive Identification of True Labels for Partial Label Learning. In ICML, 6500 6510. Miyato, T.; Maeda, S.; Koyama, M.; and Ishii, S. 2019. Virtual Adversarial Training: A Regularization Method for Supervised and Semi-Supervised Learning. IEEE TPAMI, 41(8): 1979 1993. Ni, P.; Zhao, S.; Dai, Z.; Chen, H.; and Li, C. 2021. Partial Label Learning via Conditional-Label-Aware Disambiguation. JCST, 36(3): 590 605. Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; Vander Plas, J.; Passos, A.; Cournapeau, D.; Brucher, M.; Perrot, M.; and Duchesnay, E. 2011. Scikitlearn: Machine Learning in Python. JMLR, 12: 2825 2830. Sohn, K.; Berthelot, D.; Carlini, N.; Zhang, Z.; Zhang, H.; Raffel, C.; Cubuk, E. D.; Kurakin, A.; and Li, C. 2020. Fix Match: Simplifying Semi-Supervised Learning with Consistency and Confidence. In Neur IPS, 596 608. Tarvainen, A.; and Valpola, H. 2017. Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. In Neur IPS, 1195 1204. Wang, H.; Xiao, R.; Li, Y.; Feng, L.; Niu, G.; and Zhao, J. 2022a. Pi CO: Contrastive Label Disambiguation for Partial Label Learning. In ICLR. Wang, Y.; Huang, G.; Song, S.; Pan, X.; Xia, Y.; and Wu, C. 2022b. Regularizing Deep Networks With Semantic Data Augmentation. IEEE TPAMI, 44(7): 3733 3748. Wang, Y.; Pan, X.; Song, S.; Zhang, H.; Huang, G.; and Wu, C. 2019. Implicit Semantic Data Augmentation for Deep Networks. In Neur IPS, 12614 12623. Wen, H.; Cui, J.; Hang, H.; Liu, J.; Wang, Y.; and Lin, Z. 2021. Leveraged Weighted Loss for Partial Label Learning. In ICML, 11091 11100. Wu, D.; Wang, D.; and Zhang, M. 2022. Revisiting Consistency Regularization for Deep Partial Label Learning. In ICML, 24212 24225. Wu, X.; and Zhang, M. 2018. Towards Enabling Binary Decomposition for Partial Label Learning. In IJCAI, 2868 2874. Xiao, H.; Rasul, K.; and Vollgraf, R. 2017. Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms. ar Xiv preprint ar Xiv:1708.07747.

Xie, Q.; Dai, Z.; Hovy, E. H.; Luong, T.; and Le, Q. 2020. Unsupervised Data Augmentation for Consistency Training. In Neur IPS, 6256 6268. Xu, N.; Lv, J.; and Geng, X. 2019. Partial Label Learning via Label Enhancement. In AAAI, 5557 5564. Xu, N.; Qiao, C.; Geng, X.; and Zhang, M. 2021. Instance Dependent Partial Label Learning. In Neur IPS, 27119 27130. Yan, Y.; and Guo, Y. 2020. Multi-Level Generative Models for Partial Label Learning with Non-random Label Noise. ar Xiv preprint ar Xiv:2005.05407. Zeng, Z.; Xiao, S.; Jia, K.; Chan, T.; Gao, S.; Xu, D.; and Ma, Y. 2013. Learning by Associating Ambiguously Labeled Images. In IEEE CVPR, 708 715. Zhang, B.; Wang, Y.; Hou, W.; Wu, H.; Wang, J.; Okumura, M.; and Shinozaki, T. 2021. Flex Match: Boosting Semi Supervised Learning with Curriculum Pseudo Labeling. In Neur IPS, 18408 18419. Zhang, F.; Feng, L.; Han, B.; Liu, T.; Niu, G.; Qin, T.; and Sugiyama, M. 2022. Exploiting Class Activation Value for Partial-Label Learning. In ICLR. Zhang, M.; and Yu, F. 2015. Solving the Partial Label Learning Problem: An Instance-Based Approach. In IJCAI, 4048 4054. Zhang, M.; Yu, F.; and Tang, C. 2017. Disambiguation-Free Partial Label Learning. IEEE TKDE, 29(10): 2155 2167. Zhang, M.; Zhou, B.; and Liu, X. 2016. Partial Label Learning via Feature-Aware Disambiguation. In ACM SIGKDD, 1335 1344. Zhang, M.; Zhou, B.; and Liu, X. 2019. Adaptive Graph Guided Disambiguation for Partial Label Learning. In ACM SIGKDD, 83 91. Zhang, Y.; Yang, G.; Zhao, S.; Ni, P.; Lian, H.; Chen, H.; and Li, C. 2020. Partial Label Learning via Generative Adversarial Nets. In ECAI, 1674 1681.