# partial_label_learning_with_selfguided_retraining__86a59d16.pdf

The Thirty-Third AAAI Conference on Artiﬁcial Intelligence (AAAI-19)

Partial Label Learning with Self-Guided Retraining

Lei Feng,1,2 Bo An1

1School of Computer Science and Engineering, Nanyang Technological University, Singapore 2Alibaba-NTU Singapore Joint Research Institute, Singapore feng0093@e.ntu.edu.sg, boan@ntu.edu.sg

Partial label learning deals with the problem where each training instance is assigned a set of candidate labels, only one of which is correct. This paper provides the ﬁrst attempt to leverage the idea of self-training for dealing with partially labeled examples. Speciﬁcally, we propose a uniﬁed formulation with proper constraints to train the desired model and perform pseudo-labeling jointly. For pseudo-labeling, unlike traditional self-training that manually differentiates the ground-truth label with enough high conﬁdence, we introduce the maximum inﬁnity norm regularization on the modeling outputs to automatically achieve this consideratum, which results in a convex-concave optimization problem. We show that optimizing this convex-concave problem is equivalent to solving a set of quadratic programming (QP) problems. By proposing an upper-bound surrogate objective function, we turn to solving only one QP problem for improving the optimization efﬁciency. Extensive experiments on synthesized and real-world datasets demonstrate that the proposed approach signiﬁcantly outperforms the state-of-the-art partial label learning approaches.

Introduction In partial label (PL) learning, each training example is represented by a single instance (feature vector) while associated with a set of candidate labels, only one of which is the ground-truth label. This learning paradigm is also termed as superset label learning (Liu and Dietterich 2012; 2014; H ullermeier and Cheng 2015; Gong et al. 2018) or ambiguous label learning (H ullermeier and Beringer 2006; Zeng et al. 2013; Chen et al. 2014; Chen, Patel, and Chellappa 2017). Since manually labeling the ground-truth label of each instance could incur unaffordable monetary or time cost, partial label learning has various application domains, such as web mining (Luo and Orabona 2010), image annotation (Cour, Sapp, and Taskar 2011; Zeng et al. 2013), and ecoinformatics (Liu and Dietterich 2012). Formally, let X Rn be the n-dimensional feature space and Y = {1, 2, , l} be the corresponding label space with l labels. Suppose the PL training set is denoted by D = {xi, Si}m i=1 where xi X is an n-dimensional feature vector and Si denotes the candidate label set. The key

Copyright c 2019, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved.

assumption of partial label learning lies in that the groundtruth label for xi is concealed in its candidate label set Si. The task of partial label learning is to learn a function f : X Y from the PL training set D, to correctly predict the label of a test instance. Obviously, the available labeling information in the PL training set is ambiguous, as the ground-truth label is concealed in the candidate label set. Hence the key for accomplishing the task of learning from PL examples is how to disambiguate the candidate labels. Based on the employed strategy, existing approaches can be roughly grouped into two categories, including the average-based strategy and the identiﬁcation-based strategy. The average-based strategy assumes that each candidate label makes equal contributions to the model training, and the prediction is made by averaging their modeling outputs (H ullermeier and Beringer 2006; Cour, Sapp, and Taskar 2011; Zhang, Zhou, and Liu 2016). The identiﬁcation-based strategy considers the ground-truth label as a latent variable, which is identiﬁed by an iterative reﬁning procedure (Jin and Ghahramani 2003; Nguyen and Caruana 2008; Liu and Dietterich 2012; Yu and Zhang 2016). Although these approaches are able to extract the relative labeling conﬁdence of each candidate label, they fail to reﬂect the mutually exclusive relationships among different candidate labels. Motivated by self-training that takes into account such mutually exclusive relationships by directly labeling an unlabeled instance with enough high conﬁdence, this paper gives the ﬁrst attempt to leverage the similar idea to deal with PL instances. A straightforward method is to ﬁrst apply a multi-output model on the PL examples, then pick up the candidate label with enough high conﬁdence as the groundtruth label, ﬁnally retrain the model on the resulting data. This process is repeated until no PL examples exist, or no PL examples can be picked up as the ground-truth label. Although this method is intuitive, the model learned from PL examples are probably hard to directly identify the groundtruth label in accordance with the modeling outputs, as candidate label sets exist. Furthermore, the incorrectly identiﬁed labels could have contagiously negative impacts on the ﬁnal predictions. To address this problem, we propose a novel partial label learning approach named SURE (Self-g Uided REtraining). Speciﬁcally, we propose a novel uniﬁed formulation with

proper constraints to train the desired model and perform pseudo-labeling jointly. Unlike traditional self-training that manually differentiates the ground-truth label with enough high conﬁdence, we introduce the maximum inﬁnity norm regularization on the modeling outputs to automatically perform pseudo-labeling. In this way, the pseudo labels are decided by balancing the minimum approximation loss and the maximum inﬁnity norm. To optimize the objective function, a convex-concave problem is encountered, as a result of the maximum inﬁnity norm regularization. We show that solving this convex-concave problem is equivalent to solving a set of quadratic programming problems. By proposing an upper-bound surrogate objective function, we turn to solving only one quadratic programming problem for improving the optimization efﬁciency. Extensive experiments on a number of synthesized and real-world datasets clearly demonstrate the advantage of the proposed approach.

Related Work Due to the difﬁculty in dealing with ambiguous labeling information of PL examples, there are only two general strategies to disambiguate the candidate labels, including the average-based strategy and the identiﬁcation-based strategy. The average-based strategy treats each candidate label equally in the model training phase. Following this strategy, some instance-based approaches predict the label y of the test instance x by averaging the candidate labels of its neighbors (H ullermeier and Beringer 2006; Zhang and Yu 2015), i.e., arg maxy Y P

xi N(x) I(yi Si) where N(x) denotes the neighbors of instance x. Besides, some parametric approaches adopt a parametric model F(xi, y; θ) (Cour, Sapp, and Taskar 2011; Zhang, Zhou, and Liu 2016) that differentiates the average modeling output of the candidate labels, i.e., 1 |Si| P

y Si F(xi, y; θ) from that of the non-candidate

labels, i.e., F(xi, ˆy; θ) (ˆy ˆSi), where ˆSi denotes the noncandidate label set. Although this strategy is intuitive, an obvious drawback is that the modeling output of the groundtruth label may be overwhelmed by the distractive outputs of the false positive labels. The identiﬁcation-based strategy considers the groundtruth label as a latent variable, and assumes certain parametric model F(x, y; θ) where the ground-truth label is identiﬁed by arg maxy Si F(xi, y; θ). Generally, the speciﬁc objective function is optimized on the basis of the maximum likelihood criterion: max(Pm i=1 log(P

y Si 1 |Si|F(xi, y; θ))) (Jin and Ghahramani 2003; Liu and Dietterich 2012) or the maximum margin criterion: max(Pm i=1(maxy Si F(xi, y; θ) maxˆy ˆSi F(xi, ˆy; θ))) (Nguyen and Caruana 2008; Yu and Zhang 2016). One potential drawback of this strategy lies in that instead of recovering the ground-truth label, the differentiated label may turn out to be false positive, which could severely disrupt the subsequent model training. Self-training is a commonly used technique for semisupervised learning (Zhu and Goldberg 2009), which is characterized by the fact that the learning process uses its own predictions to teach itself. It has the advantage of taking into account the mutually exclusive relationships among la-

bels by directly labeling an unlabeled instance with enough high conﬁdence. Despite of its simplicity, the early mistakes could be exaggerated by further generating incorrectly labeled data. It is going to be even worse in the PL setting, as the ground-truth label is concealed in the candidate label set. By alleviating the negative effects of self-training with a uniﬁed formulation, a novel partial label learning approach following the identiﬁcation-based strategy will be introduced in the next section.

The Proposed Approach Following the notations in Introduction, we denote by X = [x1, , xm] Rm n the instance matrix and Y = [y1, , ym] {0, 1}m l the corresponding label matrix, where yij = 1 means that the j-th label is a candidate label of the i-th instance xi, otherwise the j-th label is a noncandidate label. By adopting the identiﬁcation-based strategy, we also regard the ground-truth label as latent variable, and denote by P = [p1, , pm] [0, 1]m l the conﬁdence matrix where pij represents the conﬁdence (probability) of the j-th label being the ground-truth label of the i-th instance. Unlike self-training that takes into account the mutually exclusive relationships among the candidate labels by performing deterministic pseudo-labeling, we introduce the maximum inﬁnity norm regularization to automatically achieve this consideratum. A uniﬁed formulation with proper constraints is proposed as follows:

i=1 (L(xi, pi, f) λ pi ) + βΩ(f)

s.t. 0 pij yij, i [m], j [l] (1) Xl

j=1 pij = 1, i [m]

where [m] := {1, 2, , m}, L is the employed loss function, Ωcontrols the complexity of model f, and λ, β are the tradeoff parameters. In this uniﬁed formulation, the model is learned from the pseudo label matrix P, rather than the original noisy label matrix Y. Besides, unlike the way of traditional self-training to perform deterministic pseudo-labeling by picking up the label with enough high conﬁdence, the conﬁdence of the ground-truth label is differentiated and enlarged by trading off the loss and the maximum inﬁnity norm. Intuitively, only within the allowable range of loss, the candidate label with enough high conﬁdence can be identiﬁed as the ground-truth label. In this way, the negative effects of self-training is alleviated by training the model and performing pseudo-labeling jointly. In addition, the ﬁrst constraint plays two roles: the conﬁdence of each candidate label should be larger than 0, but no more than 1; the conﬁdence of each non-candidate label should be strictly 0. The second constraint guarantees that each conﬁdence vector pi will always be in the probability simplex, i.e., {pi [0, 1]l : P j pij = 1|i [m]}. This constraint also implicitly takes into consideration the mutually exclusive relationships among the candidate labels, as the conﬁdence of certain one candidate label is enlarged by the maximum in-

ﬁnity norm regularization, the conﬁdences of other candidate labels will be naturally reduced. To instantiate the above formulation, we adopt the widely used squared loss, i.e., L(xi, pi, f) = f(xi) pi 2 2. Besides, we employ the simple linear model f(xi) = W xi + b where W, b are model parameters. A kernel extension for the general nonlinear case will be introduced in the later section. To control the model parameter, we simply adopt the common squared Frobenius norm of W, i.e., W 2 F . To sum up, the ﬁnal optimization problem is presented as follows:

i=1 ( W xi + b pi 2

2 λ pi ) + β W 2 F

s.t. 0 pij yij, i [m], j [l] (2) Xl

j=1 pij = 1, i [m]

Optimization Problem (2) can be solved by alternating minimization, which enable us to optimize one variable with other variables ﬁxed. This process is repeated until convergence or the maximum number of iterations is reached.

Updating W and b With P ﬁxed, problem (2) with respective to W and b can be compactly stated as follows:

XW + 1b P 2

F + β W 2 F (3)

where 1 denotes the vector with all components set to 1. Setting the gradient with respect to W and b to 0, the closedform solutions can be easily obtained:

W = (X X + βI X 11 X

m ) 1(X P X 11 P

m(P 1 W X 1) (4)

Kernel Extension To deal with the nonlinear case, the above linear learning model can be easily extended to a kernel-based nonlinear model. To achieve this, we utilize a feature mapping φ( ) : Rn RH to map the original feature space x Rn to some higher (maybe inﬁnite) dimensional Hilbert space φ(x) RH. By representor theorem (Sch olkopf, Smola, and others 2002), W can be represented by a linear combination of input variables, i.e., W = φ(X) A where A Rm l stores the combination weights of instances. Hence φ(X)W = KA where K φ(X)φ(X) Rm m is the kernel matrix with each element deﬁned by kij = φ(xi) φ(xj) = κ(xi, xj), and κ( , ) denotes the kernel function. In this paper, Gaussian kernel function κ(xi, xj) = exp( xi xj 2 2 /(2σ2)) is employed with σ set to the averaged pairwise distances of instances. By incorporating such kernel extension, problem (3) can be presented as follows:

KA + 1b P 2

F + β tr(A KA) (5)

where tr( ) is the trace operator. Setting the gradient with respect to A and b to 0, the closed-form solutions are reported as:

A = (K + βI 11 K

m ) 1(P 11 P

m(P 1 A K 1) (6)

By adopting this kernel extension, we choose to update the parameters A and b throughout this paper.

Updating P With A and b ﬁxed, the modeling output matrix Q = [q1, , qm] Rm l is denoted by Q = φ(X)W + 1b = KA + 1b , problem (2) reduces to:

i=1 ( pi qi 2 2 λ pi )

s.t. 0 pij yij, i [m], j [l] (7) Xl

j=1 pij = 1, i [m]

Obviously, we can solve problem (7) by solving m independent problems, one for each example. We further denote by OP the minimum loss of the problem for the i-th example:

OP = min pi pi qi 2 2 λ pi

s.t. 1 pi = 1 (8) 0 pi yi

Here, problem (8) is a constrained convex-concave problem, as the ﬁrst term is convex while the second term is concave. Instead of using traditional time-consuming convex-concave procedure (Yuille and Rangarajan 2003) to solve this problem, we show that optimizing this problem is equivalent to solving l independent QP problems, each for one label. We denoted by OPI(j) the minimum loss of the problem for the j-th label:

OPI(j) = min pi pi qi 2 2 λpij

s.t. pik pij, k [l] (9)

1 pi = 1 0 pi yi

Theorem 1. OP = minj [l] OPI(j).

Proof. It is obvious that there must exist j [l] such that pij = pi and the optimum loss OP of problem (8) can be obtained. In addition, if pij = pi coincidentally holds, then OPI(j) = OP, as in such case, problem (9) is equivalent to problem (8). While if pij = pi , then OPI(j) > OP. Hence OP = minj [l] OPI(j).

Theorem 1 gives us a motivation to solve problem (8) by selecting the minimum loss from l independent quadratic programming problems. However, this may be time-consuming, as the label space could be very large.

Thus, we propose a surrogate objective function to upper bound the loss incurred by problem (8). Speciﬁcally, we select the candidate label j with the maximal modeling output by j = arg maxj Si qij where Si is the candidate label set containing the indices of candidate labels of the instance xi. The proposed surrogate objective function is given as:

OPS = min pi pi qi 2 2 λpij

s.t. qik qij, k Si, j Si (10) pik pij, k [l]

1 pi = 1 0 pi yi

Note that the difference between problem (10) and problem (9) lies in that problem (10) adds a constraint to select the label j with the maximal modeling output. Unlike problem (9) that considers the possibility of each label being the ground-truth label, problem (10) directly assumes the candidate label j with the maximal modeling output j = arg maxj Si qij to be the label that is most likely the ground-truth label. This assumption coincides with selftraining, which also considers the label with the maximal modeling output as the ground-truth label. Different from self-training that manually performs deterministic pseudolabeling, our approach aims to automatically enlarge the conﬁdence of the label with the maximal modeling output as much as possible by balancing the two terms in problem (10). In this way, we not only avoid the opinionated mistakes by self-training, but also take into account the mutually exclusive relationships among candidate labels.

Theorem 2. OP OPS.

Proof. From the formulation of problem (10) and problem (9), it is easy to see that OPS {OPI(j)|j Si}. Since Si [l], OPS {OPI(j)|j [l]}. Which means, minj [l] OPI(j) OPS. Using Theorem (1), OP = minj [l] OPI(j) OPS.

Theorem (2) shows that OPS of problem (10) is an upper bound of the loss OP incurred by problem (8). Hence we can choose to optimize problem (10) for efﬁciency, as only one QP problem is involved. Such problem can be easily solved by any off-the-shelf QP tools. After the completion of the optimization process, the predicted label ey of the text instance ex is given as:

ey = arg max j [l]

i=1 aijκ(ex, xi) + bj (11)

The pseudo code of SURE is presented in Algorithm 1.

Experiments Comparing Algorithms To demonstrate the effectiveness of SURE, we conduct extensive experiments to compare SURE with six state-of-theart partial label learning algorithms, each conﬁgured with suggested parameters according to the respective literature:

Algorithm 1 The SURE Algorithm

D: the PL training set D = {(X, Y)} λ, β: the regularization parameters ex: the unseen test instance Output:

ey: the predicted label for the test instance ex

1: construct the kernel matrix K = [κ(xi, xj)]m m; 2: initialize P = Y; 3: repeat 4: update A = [aij]m l and b = [bj]l according to (6); 5: update Q = KA + 1b ; 6: calculate P by solving (10) with a general QP procedure for each training example; 7: until convergence or the maximum number of iterations. 8: return the predicted label ey according to (11).

PLKNN (H ullermeier and Beringer 2006): a k-nearest neighbor approach that makes predictions by averaging the labeling information of neighboring examples [suggested conﬁguration: k {5, 6, , 10}];

CLPL (Cour, Sapp, and Taskar 2011): a convex formulation that deals with PL examples by transforming partial label learning problem to binary learning problem via feature mapping [suggested conﬁguration: SVM with squared hinge loss];

IPAL (Zhang and Yu 2015): an instance-based approach that disambiguates candidate labels by an adapted label propagation scheme. [suggested conﬁguration: α {0, 0.1, , 1}, k {5, 6, , 10}];

PLSVM (Nguyen and Caruana 2008): a maximum margin approach that learns from PL examples by optimizing margin-based objective function [suggested conﬁguration: λ {10 3, 10 2, , 103}];

PALOC (Wu and Zhang 2018): an approach that adapts one-vs-one decomposition strategy to enable binary decomposition for learning from PL examples [suggested conﬁguration: µ = 10];

LSBCMM (Liu and Dietterich 2012): a maximum likelihood approach that learns from PL examples via mixture models [suggested conﬁguration: L = 10 log2(l) ].

The two parameters λ and β for SURE are chosen from {0.001, 0.01, 0.05, 0.1, 0.3, 0.5, 1}. Parameters for each algorithm are selected by ﬁve-fold cross-validation on the training set. For each dataset, ten-fold cross-validation is performed where mean prediction accuracies and the standard deviations are recorded. In addition, we use t-test at 0.05 signiﬁcance level for two independent samples to investigate whether SURE is signiﬁcantly superior/inferior (win/loss) to the comparing algorithms for all the experiments.

Table 1: Characteristics of the controlled UCI datasets.

Dataset deter ecoli glass usps Examples 358 336 214 9298 Features 23 7 9 256 Labels 6 8 6 10 Conﬁgurations: (I) p = 0.1, r = 1, ϵ {0.1, 0.2, , 0.7} (II) r = 1, p {0.1, 0.2, , 0.7} (III) r = 2, p {0.1, 0.2, , 0.7} (IV) r = 3, p {0.1, 0.2, , 0.7}

Controlled UCI Datasets

The characteristics of 4 controlled UCI datasets are reported in Table 1. Following the widely-used controlling protocol (Cour, Sapp, and Taskar 2011; Liu and Dietterich 2012; Zhang and Yu 2015; Wu and Zhang 2018; Feng and An 2018; Wang and Zhang 2018), each UCI dataset can be used to generate artiﬁcial partial label datasets. There are three controlling parameters p, r and ϵ where p controls the proportion of PL examples, r controls the number of false positive labels, and ϵ controls the probability of a speciﬁc false positive label occurring with the ground-truth label. As shown in Table 1, there are 4 conﬁgurations, each corresponding to 7 results. Hence we can totally generate 4 4 7 = 112 different artiﬁcial partial label datasets. Figure 1 shows the classiﬁcation accuracy of each algorithm as ϵ ranges from 0.1 to 0.7 when p = 0.1 and r = 1 (Conﬁguration (I)). In this setting, a speciﬁc label is selected as the coupled label that co-occurs with the ground-truth label with probability ϵ, and any other label would be randomly chosen to be a false positive label with probability 1 ϵ. Figures 2, 3, and 4 illustrate the classiﬁcation accuracy of each algorithm as p ranges from 0.1 to 0.7 when r = 1, 2, and 3 (Conﬁguration (II), (III), and (IV)), respectively. In these three settings, r extra labels are randomly chosen to be the false positive labels. That is, the number of candidate labels for each instance is r + 1. As shown in Figures 1, 2, 3, and 4, SURE outperforms other comparing algorithms in general. To further statistically compare SURE with other algorithms, the detailed win/tie/loss counts between SURE and the comparing algorithms are recorded in Table 2. Out of the 112 results, it is easy to observe that:

SURE achieves superior or at least comparable performance against PLKNN and PLSVM in all cases.

SURE achieves superior performance against CLPL and LSCMM in 72.3% and 58.9% cases while outperformed by them in only 4.5% and 1.8% cases, respectively.

SURE outperforms IPAL and PALOC in 50.9% and 63.4% cases while outperformed by them in only 5.4% and 2.7% cases, respectively.

In summary, the effectiveness of SURE on controlled UCI datasets is demonstrated.

Table 2: Win/tie/loss (t-test at 0.05 signiﬁcance level for two independent samples) counts on the controlled UCI datasets between SURE and the comparing algorithms.

PLKNN CLPL IPAL PLSVM PALOC LSBCMM (I) 24/4/0 21/6/1 14/11/3 21/7/0 20/8/0 14/14/0 (II) 24/4/0 19/7/2 15/13/0 24/4/0 17/10/1 17/10/1 (III) 24/4/0 21/6/1 14/13/1 25/3/0 17/10/1 19/9/0 (IV) 25/3/0 20/7/1 14/12/2 27/1/0 17/10/1 16/11/1 Total 97/15/0 81/26/5 57/49/6 97/15/0 71/38/3 66/44/2

Real-World Datasets

Table 3 reports the characteristics of real-world partial label datasets1 including Lost (Cour, Sapp, and Taskar 2011), MSRCv2 (Liu and Dietterich 2012), Soccer Player (Zeng et al. 2013), Yahoo! News (Guillaumin, Verbeek, and Schmid 2010), and FG-NET (Panis and Lanitis 2014). These realworld partial label datasets are from several task domains. For automatic face naming (Lost, Soccer Player, and Yahoo! News), each face (instance) is cropped from an image or a video frame, and the names appearing on the corresponding captions or subtitles are taken as candidate labels. For facial age estimation (FG-NET), human faces are regarded as instances while ages annotated by crowdsourcing labelers serve as candidate labels. For object classiﬁcation (MSRCv2), each image segment is considered as an instance, and objects appearing in the same image are taken as candidate labels. The average number of candidate labels (Avg. CLs) per instance is also reported in Table 3. Table 4 reports the mean classiﬁcation accuracy as well as the standard deviation of each algorithm on each realworld dataset. Note that the average number of candidate labels (Avg. CLs) of FG-NET dataset is quite large, which results in an extremely low classiﬁcation accuracy of each algorithm. For better evaluation of this facial age estimation task, we employ conventional mean absolute error (MAE) (Zhang, Zhou, and Liu 2016) to conduct two extra experiments. Speciﬁcally, for FG-NET (MAE3/MAE5), a test example is considered correctly classiﬁed if the MAE between the predicted age and the ground-truth age is no more than 3/5 years. As shown in Table 4, we can observe that:

SURE signiﬁcantly outperforms PLKNN on all the realworld datasets.

Out of the 42 cases (6 comparing algorithms and 7 datasets), SURE signiﬁcantly outperforms all the comparing algorithms in 78.6% cases, and achieves competitive performance in 21.4% cases.

It is worth noting that SURE is never signiﬁcantly outperformed by any comparing algorithms.

These experimental results on real-world datasets also demonstrate the effectiveness of SURE.

Further Analysis

1These datasets are publicly available at: http://cse.seu.edu.cn/ Personal Page/zhangml/

Figure 1: Classiﬁcation performance on controlled UCI datasets with ϵ ranging from 0.1 to 0.7 (p = 1, r = 1).

Figure 2: Classiﬁcation performance on controlled UCI datasets with p ranging from 0.1 to 0.7 (r = 1).

Figure 3: Classiﬁcation performance on controlled UCI datasets with p ranging from 0.1 to 0.7 (r = 2).

Figure 4: Classiﬁcation performance on controlled UCI datasets with p ranging from 0.1 to 0.7 (r = 3).

Table 3: Characteristics of real-world partial label datasets.

Dataset Examples Features Labels Avg. CLs Task Domain Lost 1122 108 16 2.23 automatic face naming (Panis and Lanitis 2014) MSRCv2 1758 48 23 3.16 object classiﬁcation (Liu and Dietterich 2012) Soccer Player 17472 279 171 2.09 automatic face naming (Zeng et al. 2013) Yahoo! News 22991 163 219 1.91 automatic face naming (Guillaumin, Verbeek, and Schmid 2010) FG-NET 1002 262 78 7.48 facial age estimation (Panis and Lanitis 2014)

Parameter Sensitivity Analysis There are two tradeoff parameters λ and β for SURE, which should be manually

searched in advance. Hence this section studies how λ and β inﬂuence the prediction accuracy produced by SURE. We

Table 4: Classiﬁcation accuracy of each algorithm on the real-world datasets. Furthermore, / indicates whether SURE is statistically superior/inferior to the comparing algorithm (t-test at 0.05 signiﬁcance level for two independent samples).

SURE PLKNN CLPL IPAL PLSVM PALOC LSBCMM Lost 0.781 0.039 0.432 0.051 0.742 0.038 0.678 0.053 0.729 0.042 0.629 0.056 0.693 0.035 MSRCv2 0.515 0.027 0.417 0.034 0.413 0.041 0.529 0.039 0.461 0.046 0.479 0.042 0.473 0.037 Soccer Player 0.533 0.017 0.495 0.018 0.368 0.010 0.541 0.016 0.464 0.011 0.537 0.015 0.498 0.017 Yahoo! News 0.644 0.015 0.483 0.011 0.462 0.009 0.609 0.011 0.629 0.010 0.625 0.005 0.645 0.005 FG-NET 0.078 0.021 0.039 0.018 0.063 0.027 0.054 0.030 0.063 0.029 0.065 0.019 0.059 0.025 FG-NET(MAE3) 0.458 0.024 0.269 0.045 0.458 0.022 0.362 0.034 0.356 0.022 0.435 0.018 0.382 0.029 FG-NET(MAE5) 0.615 0.019 0.438 0.053 0.596 0.017 0.540 0.033 0.479 0.016 0.609 0.043 0.532 0.038

(a) Varying λ on Lost

(b) Varying β on Lost

(c) Convergence on Lost

(d) Convergence on MSRCv2

Figure 5: Parameter sensitivity and convergence analysis for SURE. (a) Sensitivity analysis of λ on Lost; (b) Sensitivity analysis of β on MSRCv2; (c) Convergence analysis on Lost; (d) Convergence analysis on MSRCv2.

vary one parameter, while keeping the other ﬁxed at the best setting. Figures 5(a) and 5(b) show the performance of SURE on the Lost dataset given different values of λ and β respectively. Note that λ controls the importance of the maximum inﬁnity norm regularization. When λ is very small, the mutually exclusive relationships among labels are hardly considered, thus the classiﬁcation accuracy would be at a low level. As λ increases, we start to take into consideration such exclusive relationships, and the classiﬁcation accuracy increases. However, if λ is sufﬁciently large, the classiﬁcation accuracy will drop dramatically. This is because when we overly concentrate on the mutually exclusive relationships among labels, we will directly regard the candidate label that has the maximal modeling output as the ground-truth label. Since to maximize the inﬁnity norm p is overly important, the approximation loss will be totally ignored. From the above, we can draw a conclusion that it would be better to balance the approximation loss and the mutually exclusive relationships among labels. Such conclusion clearly comﬁrms the effectiveness of the SURE approach. Another

tradeoff parameter β aims to control the model complexity. The classiﬁcation accuracy curve of varying β obviously accords with our cognition that it is important to balance between overﬁtting and underﬁtting.

Illustration of Convergence We illustrate the convergece of SURE by using the difference of the optimization variable P between two successive iterations ( P = P(t+1) P(t) F ). Figure 5(c) and 5(d) show the convergence curves of SURE on Lost and MSRCv2 respectively. It is apparent that P gradually decreases to 0 as the number of iterations t increases. Hence the convergence of SURE is demonstrated.

Conclusion In this paper, we utilize the idea of self-training to exaggerate the mutually exclusive relationships among candidate labels for further enhancing partial label learning performance. Instead of manually performing pseudo-labeling after model training, we propose a uniﬁed formulation (named SURE) with the maximum inﬁnity norm regularization to train the desired model and perform pseudo-labeling jointly. Extensive experimental results demonstrate the effectiveness of SURE. Since self-training is a typical semi-supervised learning method, it would be interesting to extend SURE to the setting of semi-supervised learning. Besides, as mutually exclusive relationships exist in general multi-class problems, it would be valuable to explore other possible ways to incorporate such relationships into partial label learning.

Acknowledgements This work was supported by MOE, NRF, and NTU.

References Chen, Y.-C.; Patel, V. M.; Chellappa, R.; and Phillips, P. J. 2014. Ambiguously labeled learning using dictionaries. IEEE Transactions on Information Forensics and Security 9(12):2076 2088. Chen, C.-H.; Patel, V. M.; and Chellappa, R. 2017. Learning from ambiguously labeled face images. IEEE Transactions on Pattern Analysis and Machine Intelligence. Cour, T.; Sapp, B.; and Taskar, B. 2011. Learning from partial labels. Journal of Machine Learning Research 12(5):1501 1536.

Feng, L., and An, B. 2018. Leveraging latent label distributions for partial label learning. In Proceedings of the 27th International Joint Conference on Artiﬁcial Intelligence, 2107 2113. Gong, C.; Liu, T.; Tang, Y.; Yang, J.; Yang, J.; and Tao, D. 2018. A regularization approach for instance-based superset label learning. IEEE Transactions on Cybernetics 48(3):967 978. Guillaumin, M.; Verbeek, J.; and Schmid, C. 2010. Multiple instance metric learning from automatically labeled bags of faces. Lecture Notes in Computer Science 63(11):634 647. H ullermeier, E., and Beringer, J. 2006. Learning from ambiguously labeled examples. Intelligent Data Analysis 10(5):419 439. H ullermeier, E., and Cheng, W. 2015. Superset learning based on generalized loss minimization. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, 260 275. Jin, R., and Ghahramani, Z. 2003. Learning with multiple labels. In Advances in Neural Information Processing Systems, 921 928. Liu, L.-P., and Dietterich, T. G. 2012. A conditional multinomial mixture model for superset label learning. In Advances in Neural Information Processing Systems, 548 556. Liu, L.-P., and Dietterich, T. 2014. Learnability of the superset label learning problem. In International Conference on Machine Learning, 1629 1637. Luo, J., and Orabona, F. 2010. Learning from candidate labeling sets. In Advances in Neural Information Processing Systems, 1504 1512. Nguyen, N., and Caruana, R. 2008. Classiﬁcation with partial labels. In Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 551 559. Panis, G., and Lanitis, A. 2014. An overview of research activities in facial age estimation using the fg-net aging database. In European Conference on Computer Vision, 737 750. Sch olkopf, B.; Smola, A. J.; et al. 2002. Learning with kernels: support vector machines, regularization, optimization, and beyond. MIT press. Wang, J., and Zhang, M.-L. 2018. Towards mitigating the class-imbalance problem for partial label learning. In Proceedings of the 24th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 2427 2436. Wu, X., and Zhang, M.-L. 2018. Towards enabling binary decomposition for partial label learning. In Proceedings of the 27th International Joint Conference on Artiﬁcial Intelligence, 2427 2436. Yu, F., and Zhang, M.-L. 2016. Maximum margin partial label learning. In Proceedings of Asian Conference on Machine Learning, 96 111. Yuille, A. L., and Rangarajan, A. 2003. The concave-convex procedure. Neural Computation 15(4):915 936.

Zeng, Z.-N.; Xiao, S.-J.; Jia, K.; Chan, T.-H.; Gao, S.-H.; Xu, D.; and Ma, Y. 2013. Learning by associating ambiguously labeled images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 708 715. Zhang, M.-L., and Yu, F. 2015. Solving the partial label learning problem: An instance-based approach. In Proceedings of the 24th International Joint Conference on Artiﬁcial Intelligence, 4048 4054. Zhang, M.-L.; Zhou, B.-B.; and Liu, X.-Y. 2016. Partial label learning via feature-aware disambiguation. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 1335 1344. Zhu, X., and Goldberg, A. B. 2009. Introduction to semisupervised learning. Synthesis Lectures on Artiﬁcial Intelligence and Machine Learning 3(1):1 130.