# positive_and_unlabeled_learning_with_label_disambiguation__66ae844c.pdf

Positive and Unlabeled Learning with Label Disambiguation

Chuang Zhang1 , Dexin Ren1 , Tongliang Liu3 , Jian Yang1,2 and Chen Gong1

1PCA Lab, the Key Laboratory of Intelligent Perception and Systems for High-Dimensional Information of Ministry of Education, School of Computer Science and Engineering, Nanjing University of Science and Technology, China 2Jiangsu Key Lab of Image and Video Understanding for Social Security 3UBTECH Sydney AI Centre, SCS, FEIT, The University of Sydney, Australia chen.gong@njust.edu.cn

Positive and Unlabeled (PU) learning aims to learn a binary classiﬁer from only positive and unlabeled training data. The state-of-the-art methods usually formulate PU learning as a cost-sensitive learning problem, in which every unlabeled example is simultaneously treated as positive and negative with different class weights. However, the groundtruth label of an unlabeled example should be unique, so the existing models inadvertently introduce the label noise which may lead to the biased classiﬁer and deteriorated performance. To solve this problem, this paper proposes a novel algorithm dubbed as Positive and Unlabeled learning with Label Disambiguation (PULD). We ﬁrst regard all the unlabeled examples in PU learning as ambiguously labeled as positive and negative, and then employ the margin-based label disambiguation strategy, which enlarges the margin of classiﬁer response between the most likely label and the less likely one, to ﬁnd the unique ground-truth label of each unlabeled example. Theoretically, we derive the generalization error bound of the proposed method by analyzing its Rademacher complexity. Experimentally, we conduct intensive experiments on both benchmark and real-world datasets, and the results clearly demonstrate the superiority of the proposed PULD to the existing PU learning approaches.

1 Introduction

Collecting a large amount of labeled data, especially labeled negative data, is a critical bottleneck in many real-world machine learning applications due to the laborious manual annotation or the difﬁculty of collecting negative data. In contrast, positive and unlabeled data can often be collected easily. This has led to the development of Positive and Unlabeled (PU) learning [Denis et al., 2005], which aims at learning a binary classiﬁer only from positive and unlabeled data without the assistance of negative data. Recently, PU learning has gained lots of popularity in tackling many realworld scenarios such as software clone detection [Wei and Li, 2018], protein function prediction [Youngs et al., 2014;

Fu et al., 2016], and remote-sensed hyperspectral image classiﬁcation [Li et al., 2011], etc. Given its broad applicability as mentioned above, PU learning has attracted a great deal of research attention in recent years. Effective algorithms have been developed which can be roughly divided into three categories. The ﬁrst category [Liu et al., 2002; Li and Liu, 2003] ﬁrstly identiﬁes some reliable negative examples from the unlabeled set, and then employs the reliable negative as well as the original positive examples to train a traditional supervised classiﬁer. The second category [Lee and Liu, 2003; Shi et al., 2018] takes PU leaning problem as a one-side label noise learning problem, which directly treats the unlabeled examples as negative ones. Particularly, the undiscovered positive examples in the unlabeled set are regarded as mislabeled, while no negative examples are mislabeled as positive. The last category [Elkan and Noto, 2008; Youngs et al., 2015; Du Plessis et al., 2014; Du Plessis et al., 2015] imposes different weights on the loss values incurred by simultaneously treating unlabeled examples as positive and negative, and thus transferring PU learning into a cost-sensitive learning problem. However, the ground-truth label of each unlabeled example is deﬁnite and unique, so simultaneously evaluating every unlabeled example as positive and negative under the framework of empirical risk minimization will inevitably introduce the label noise, leading to the imperfect performance. To address this drawback, this paper proposes a novel PU leaning algorithm dubbed Positive and Unlabeled learning with Label Disambiguation (PULD). To be speciﬁc, we ﬁrst treat PU learning problem as a Partial Label Learning (PLL) [Jin and Ghahramani, 2003] problem, where we regard the unlabeled examples as ambiguously labeled as positive and negative, and then employ the disambiguation technique, which is widely used in PLL, to robustly determine the real label of every unlabeled example. PLL aims to learn from ambiguous labeling information where each training example is associated with a set of candidate labels, among which only one label is valid [Cour et al., 2011; Gong et al., 2018; Feng and An, 2018; Chen et al., 2018]. Recent successful PLL methods have devised various disambiguation regularizers to identify the single ground-truth label from the candidate label set associated with each training example [Nguyen and Caruana, 2008; Wu and Zhang, 2018], which play an important role in boosting the PLL performance.

Proceedings of the Twenty-Eighth International Joint Conference on Artiﬁcial Intelligence (IJCAI-19)

In our PU learning case, we take each unlabeled example as a partially labeled example with the candidate label set {1, 2}, and then utilize the margin based disambiguation strategy to enlarge the margin between the most likely label and the less likely one. As a result, the ground-truth label in the candidate label set can be effectively highlighted. Besides, a manifold regularizer with geometric knowledge of the original data is applied [Zhou et al., 2004; Belkin et al., 2006; Gong et al., 2019b], so that the similar examples in the feature space are assigned similar labels. Furthermore, we theoretically derive the generalization error bound of the proposed PULD by analyzing its Rademacher complexity, and intensive experiments on both benchmark and real-world datasets clearly demonstrate the superiority of the proposed method to the state-of-the-art PU learning approaches.

2 Our Method

In this section, we ﬁrst establish our proposed PULD model, and then explain its iterative optimization process.

Assume that there are n training examples S = {(x1, y1), . . . , (xp, yp), (xp+1, yp+1), . . . , (xn, yn); n = p + u} identically and independently drawn from some distribution D over X Y, where X Rd and Y = {1, 2} denote the feature space and label space, respectively. For the convenience of the subsequent derivations, we follow the setting of multi-class classiﬁcation, where negative examples are labeled as 1 and positive examples are labeled as 2. In S, the ﬁrst p elements with the label {yi}p i=1 = 2 constitute the positive set P, and the rest u elements form the unlabeled set U in which the label of each example can be either 1 or 2 but not known. In the training stage, our target is to ﬁnd a real-valued scoring function h : X Y R based on the training set S = P U, and the classiﬁcation is performed according to the score, i.e. arg maxy Y h(x, y). Firstly, we build a KNN graph G = V, E over the training set S where V is the vertex set consisted of all xi S, and E is the edge set describing the similarity between these nodes. Further, A is the adjacency matrix of G, and its (i, j)-th element Aij encodes the similarity between xi and xj, which is computed by Aij =exp xi xj 2/(2δ2) with δ being the Gaussian kernel width if xi and xj are linked, and Aij = 0 otherwise. Therefore, the normalized adjacency matrix adopted by us is A = D 1/2AD 1/2 with D being the diagonal degree matrix of which the (i, i)-th element is Dii = P j Aij. Moreover, we deﬁne the mapping Φ(x, y):X Y 7 F which projects the example-label pair (x, y) to Φ(x, y) F, namely

Φ(x, y) = x 1(y = 1) x 1(y = 2)

with d being the dimensionality of the input feature x, and 1(a) being an indicator function which returns 1 if a is true and 0 otherwise. Let y = (y1, . . . , yp, yp+1, . . . , yn) be the ground-truth labels of training examples. Then given the classiﬁer as h(x, y) = ω Φ(x, y), and ξ = {ξi}p i=1 and

η = {ηi}p+u i=p+1 as slack variables, our model is formulated as

min ω,y,ξ,η 1 2 ω 2+ α

i,j=1 Aij(yi yj)2

(2) s.t. For xi P :

ω Φ(xi, 2) ω Φ(xi, 1) 1 ξi (3) ξi 0, i = 1, , p For xi U :

max y i Y ω Φ(xi, y i) max y i =y i ω Φ(xi, y i ) 1 ηi (4)

ηi 0, i = p + 1, , n Xp+u

i=p+1 1(yi = 2) = uc, (5)

where α, β, and γ are nonnegative trade-off parameters, and c is the proportion of positive examples in unlabeled set U. In (2), the ﬁrst term is used to present overﬁtting. The second term along with the constraint (3) imposed on P requires that for every xi P, the score ω Φ(xi, 2) with the correct label yi=2 should be larger than the score ω Φ(xi, 1) associated with the incorrect label yi =1, by at least 1 ξi. The third term as well as the constraint (4) enlarges the margin between the most likely label and the less likely one for each unlabeled example xi U, in which the score associated with the most likely label y (i.e. maxy ω Φ(xi, y i)) should be larger than the score associated with the less likely one y (i.e. maxy i =y i ω Φ(xi, y i )), by at least 1 ηi. The last term ensures that the similar examples in the feature space should obtain similar labels. Constraint (5) requires the number of positive examples in U to be identical to uc, where the estimation of c is deferrd to Section 4.1. Since we regard the unlabeled examples in U as ambiguously labeled as positive and negative, and the candidate label set of unlabeled examples can be denoted as {1, 2}, so we follow multi-class SVM [Crammer and Singer, 2001] and partial label learning [Yu and Zhang, 2017] to equip our model with good label discriminability. Therefore, the unique groundtruth label of every unlabeled example can be precisely highlighted. After we have learned the model parameter ω, a test example xi is classiﬁed as yi = arg maxyi Y ω Φ(xi, yi). Note that once the ground-truth labels {yi}p+u i=p+1 of the unlabeled examples in U are determined, our algorithm proceeds to maximize the canonical multi-class margin over each xi P U, i.e. ω Φ(xi, yi) maxy i =yi ω Φ(xi, y i). Therefore, by introducing the slack variables ν = {νi}n i=1, the previous model can be transformed to

min ω,ν,y 1 2 ω 2+ µ

i=1 νi+γ Xn

i,j=1 Aij(yi yj)2 (6)

s.t. ω Φ(xi, yi) max y i =yi ω Φ(xi, y i) 1 νi (7)

νi 0, i = 1, , n yi {1, 2}, i = 1, , n (8) Xn

i=1 1(yi = 2) = uc + p. (9)

Proceedings of the Twenty-Eighth International Joint Conference on Artiﬁcial Intelligence (IJCAI-19)

In above model, the constraint (7) enforces the maximum margin between different label responses for each training example. In addition, the constraint (8) enforces that the ground-truth label yi should take a value in the candidate label set {1, 2}. Note that our model is different from the Lap SVM [Belkin et al., 2006] for semi-supervised learning, as Lap SVM does not contain the label disambiguation regularizer as our PULD. Furthermore, if Lap SVM is employed for PU learning, it will incorrectly classify all unlabeled examples to positive due to the absence of negative training data.

2.2 Optimization

Since our model (6)-(9) involves mixed-type variables (i.e. integer variables y, and continuous variables ω and ν), so its optimization is non-trivial. In this section, an alternating optimization procedure is devised to ﬁnd the optimal solution. Note that the variable y appears in both the second term and the third term of (6), so it is difﬁcult to be updated all at once. Therefore, we adopt the variable splitting technique and introduce an auxiliary variable z = {z1, . . . , zn}. As a result, our model (6)-(9) is transformed to

min ω,ν,y,z 1 2 ω 2+ µ

i,j=1 Aij(zi zj)2+λ

i=1 1(yi =zi)

s.t. ω Φ(xi, yi) maxy i =yiω Φ(xi, y i) 1 νi νi 0, i = 1, , n yi {1, 2}, i = 1, , n Xn

i=1 1(yi = 2) = uc + p,

where the last term in (10) is to keep y and z to be consistent, and λ is a trade-off parameter. Now, we are going to solve the three subproblems associated with ω, y, and z, respectively. Update ω. By ﬁxing y and z, the subproblem related to ω is

min ω,ν 1 2 ω 2 + µ

s.t. ω Φ(xi, yi) max y i =yi ω Φ(xi, y i) 1 νi

νi 0 i = 1, , n.

Note that (11) coincides with the well-studied multi-class maximum margin formulation [Crammer and Singer, 2001]. Therefore, (11) can be readily solved by utilizing any off-theshelf implementation for multi-class SVM [Fan et al., 2008]. Update y. By dropping the unrelated terms to y in (10), the subproblem regarding y is

min y,ν µ n

i=1 νi + λ Xn

i=1 1(yi = zi)

s.t. ω Φ(xi, yi) maxy i =yiω Φ(xi, y i) 1 νi νi 0, i = 1, , n yi {1, 2}, i = 1, , n Xn

i=1 1(yi = 2) = uc + p.

Let ζyi i = ω Φ(xi, yi) maxy i =yi ω Φ(xi, y i), then according to the ﬁrst two constraints of (12), we have νi = max(0, 1 ζyi i ), so the ﬁrst term of the objective function in (12) can be written as

i=1 max(0, 1 ζyi i ). (13)

Let Y = [Yij]n 2 be the binary-valued presentation of y, where Yij =1 indicates that the label of example xi is j, and 0 otherwise. Similarly, we may deﬁne Z=[Zij]n 2. By further deﬁning the i-th row of an n 2 coefﬁcient matrix C as:

( M, max(0, 1 ζ2 i ) if xi P, max(0, 1 ζ1 i ), max(0, 1 ζ2 i ) if xi U, (14)

where M is a user-specﬁed large number to keep the label of each positive example to be constant, we know that the formulation (13) can be transformed to

j=1 Yij Cij. (15)

Since Y and Z are both binary-valued, the second term of the objective function in (12) can be written as

i=1 1(yi = zi) min Y λ

j=1 Yij Zij. (16)

By Combining (12), (13), (15), and (16), the subproblem related to Y (i.e. the binary-valued representation of y) can be ﬁnally transformed to

n Cij λZij)

j=1 Yij = 1, i = 1, , n Xn

i=1 Yij = uc + p, j {1, 2}

Yij {0, 1},

where the ﬁrst constraint P2 j=1 Yij = 1 ensures that each training example has a unique ground-truth label and the second constraint Pn i=1 Yij = uc + p controls the amount of classiﬁed positive examples. Note that (17) is a binary integer programming (BIP) problem. By relaxing Yij {0, 1} as Yij [0, 1], it can be efﬁciently solved by employing standard Linear Programming (LP) solvers such as the simplex algorithm or the interior point algorithm [Boyd and Vandenberghe, 2004], where y can be computed based on yi = arg maxj Yij. Update z. The subproblem regarding z is

i,j=1 Aij(zi zj)2 + λ Xn

i=1 1(yi = zi). (18)

Considering that the last term in (18) is usually difﬁcult to optimize, here we replace the indicator function with a surrogate ℓ2 norm term, therefore (18) can be converted to

i,j=1 Aij Zi Zj 2 + λ Xn

i=1 Yi Zi 2. (19)

According to [Zhou et al., 2004], the closed form solution of (19) can be expressed as

Z = λ/γ 1 + λ/γ (I 1 1 + λ/γ A) 1Y. (20)

Proceedings of the Twenty-Eighth International Joint Conference on Artiﬁcial Intelligence (IJCAI-19)

Algorithm 1 The algorithm for solving PULD

Input: positive set P, unlabeled set U; trade-off parameters µ, γ, and λ; stopping criteria ϵ, iter max; 1: Initialize ω = 0, z = y = 0; 2: Construct K-NN graph and compute A; 3: Initialize the coefﬁcient matrix C, where {Ci }p i=1 = (0, 1) and {Ci }u+p i=p+1 = (1/2, 1/2); 4: Initialize the ground-truth label y via solving (17); 5: while not converge do 6: Objold = Obj; 7: Update ω via solving (11); 8: Update C via (14); 9: Update y via solving (17); 10: Update z via solving (19); 11: Compute the objective function value Obj via (10); 12: iter := iter + 1; 13: Check the convergence conditions: Obj Objold < ϵ or iter > iter max; 14: end while Output: optimized classiﬁer parameter ω.

Therefore, we can get z according to zi = arg maxj Zij. The entire optimization procedure of our PULD is summarized in Algorithm 1, in which the alternating procedure is guaranteed to converge according to [Hong and Luo, 2017].

3 Theoretical Analysis In this section, we study the generalizability of the proposed PULD algorithm. In our PU learning case, the hypothesis is deﬁned on a scoring function h H : X Y R with H denoting the hypothesis space. The label associated with the input x is the one resulting in the largest score h(x, y) when y Y, which deﬁnes the mapping x 7 arg maxy Yh(x, y), (21)

where h(x, y) = ω Φ(x, y). We may deﬁne the margin loss on positive examples ρp(x, y) and unlabeled examples ρu(x, y) as ρp(x, y) = h(x, y) maxy =yh(x, y ) (22) and ρu(x, y) = maxy Yh(x, y ) maxy =y h(x, y ). (23) To get the empirical margin loss for our PU learning case, we ﬁrst introduce the following margin loss function [Gong et al., 2019a]:

0 if x > ρ, 1 x/ρ if 0 x ρ, 1 otherwise, (24)

where ρ is the margin. Let ρ(p) > 0 and ρ(u) > 0 denote the margins for positive and unlabeled examples, respectively, then we may deﬁne the empirical margin loss as

i=1 ℓρ(p)(ρp (xi, yi))+ 1

i=p+1 ℓρ(u)(ρu (xi, yi)) .

Let ˆh be any learned classiﬁer output by PULD and R(ˆh) be the corresponding expected loss, so our target is to upper bound the generalization error R(ˆh) ˆRρ(ˆh).

Theorem 1. Assume that x X, x b. Let p and u be the sizes of positive set and unlabeled set respectively, α, β be the trade-off parameters in (2), and ˆh be the scoring function learned by the proposed algorithm. For any δ 0, with probability at least 1 δ, we have

R(ˆh) ˆRρ(ˆh) 8b p

2(α+β) ρ(p) p + 8b p

2(α+β) ρ(u) u +

Before proving this theorem, we ﬁrst present some useful deﬁnitions and lemmas.

Deﬁnition 2. (Rademacher complexity, [Bartlett and Mendelson, 2002]) Let σ = {σ1, . . . , σn} be a set of independent Rademacher variables which are uniformly sampled from { 1, 1}. Let v1, . . . , vn be an independent distributed sample set and F be a function class. The Rademacher complexity is deﬁned as:

i=1 σif (vi)

Lemma 3. (Generalization bound, [Bartlett and Mendelson, 2002]) Let F be a [0, 1]-valued function class on X and f F. Given x1, . . . , xn X are i.i.d. variables, then for any δ > 0, with probability at least 1 δ, we have

i=1 f (xi) 2Rn(F) +

The Rademacher complexity appeared in (28) can be bounded by the following Lemma, which is

Lemma 4. (Talagrand contraction Lemma, [Bartlett and Mendelson, 2002]) If L : R R is Lipschitz continuous with constant L and satisﬁes L(0) = 0, then

Rn(L F) LRn(F), (29)

where represents the composition of two functions. Now we present the formal proof for Theorem 1:

Proof. Let H1 be the family of hypothesis mapping X Y to R deﬁned by H1 = {(x, y) 7 ρh(x, y):h H}. By Lemma 3, with probability at least 1 δ, for all h H, we have

E ℓ ρh (x, y) ˆRρ(h) + 2Rp(ℓρ(p) H1)

+2Ru(ℓρ(u) H1) +

Since 1t 0 ℓ(t) for all t R, the expected error R(h) is the lower bound of E ℓ ρh (x, y) , namely R(h) = E[1[h(x,y) h(x,y )] 0] E ℓ ρh (x, y) , and thus we have

R(h) ˆRρ(h) + 2Rp(ℓρ(p) H1)

+ 2Ru(ℓρ(u) H1) +

Proceedings of the Twenty-Eighth International Joint Conference on Artiﬁcial Intelligence (IJCAI-19)

Rp(ℓρ(p) H1) = E

i=1 σiℓρ(p) (ρp (xi, yi))

Ru(ℓρ(u) H1) = E

i=p+1 σiℓρ(u) (ρu (xi, yi))

(33) To upper bound the Rademacher complexities in (32) and (33), we ﬁrst derive an upper bound for the classiﬁers in H. Due to the optimality of any classiﬁer parameters ω, we have 1 2 ω 2 + α

i=1 1 ω Φ (xi, 2) ω Φ (xi, 1)

1 max y i ω Φ (xi, y i) max y i =y i ω Φ (xi, y i )

i,j=1 Aij (yi yj)2 α + β, (34)

when ω is set to 0. Since each term in (34) is nonnegative, we can get an upper bound for ω 2, which implys that ω 2 2α + 2β. Now we are going to bound Rp(ℓρ(p) H1) and Ru(ℓρ(u) H1). Specially, since the function ℓ(x) is 1/ρ-Lipschitiz, by using Lemma 4, we have Rp(ℓρ(p) H1) Rp(H1)/ρ(p) and Ru(ℓρ(u) H1) Ru(H1)/ρ(u). For any family of hypothesis mapping h : X Y R, we deﬁne Π1(H) = {x 7 h(x, y) : y Y, h H}. Based on the theorem 8.1 in [Mohri et al., 2018], we can directly get the upper bounds for Rp(H1) and Ru(H1), which are Rp(H1) 4Rp(Π1(H)) and Ru(H1) 4Ru(Π1(H)). Next we are going to upper bound Rp(Π1(H)) and Ru(Π1(H)). By noting that E[σiσj] = 0 for i = j, we have

Rp(Π1(H)) = E sup y Y

i=1 σiω Φ (xi, y)

i=1 E [ xi 2] b 2α + 2β

where the Inequality 1 holds because of the Jensen s Inequality and the concave property of the square-root function. Similarly, we have

Ru(Π1(H)) b 2α + 2β

By combining the formulation (31) and the upper bound for (32) and (33), it is easy to get that for any learned classiﬁer ˆh, with probability at least 1 δ, we have

R(ˆh) ˆRρ(ˆh) 8b p

2(α+β) ρ(p) p + 8b p

2(α+β) ρ(u) u +

(37) which concludes the proof of Theorem 1.

Theorem 1 shows that by either increasing the sample size of positive or unlabeled data, the generalization error bound of PULD decreases, which justiﬁes the usefulness of positive and unlabeled data in PU learning. This also guarantees the generalization ability of the proposed learning algorithm.

4 Experiments

To demonstrate the superiority of our proposed PULD to the existing PU methods, we perform intensive experiments on both benchmark and real-world datasets in this section.

4.1 Benchmark Datasets In this section, we compare our PULD with state-ofthe-art PU learning algorithms such as Weighted SVM (WSVM) [Elkan and Noto, 2008], Unbiased PU risk [Du Plessis et al., 2015], Non-Negative PU risk (NNPU) [Kiryo et al., 2017], and Loss Decomposition and Centroid Estimation (LDCE) [Shi et al., 2018] on Open ML1 benchmark datasets. Speciﬁcally, ﬁve binary datasets are adopted for algorithm evaluation including vote, diabetes, wdbc, fri, and phishing, and their conﬁgurations are listed in Table 1. For each dataset, we randomly choose r = 20%, 30%, and 40% positive examples as well as all negative examples as unlabeled and leave the rest positive examples as labeled. Under each r, we conduct ﬁve-fold cross validation on every compared method and report the average accuracy over the ﬁve independent implementations. Note that all data features have been normalized in advance, and the selected positive examples and the dataset splits are kept identical for all compared methods. In our experiments, the proportion of positive examples in unlabeled set c is assumed to be known during training for all the compared methods, which is also assumed by the existing PU learning works such as [Du Plessis et al., 2015; Kiryo et al., 2017; Shi et al., 2018]. Practically, it can be efﬁciently estimated by the methods such as [Liu and Tao, 2016; Christoffel et al., 2016]. The parameters of every algorithm have been carefully tuned on the validation set to achieve the best performance. To be speciﬁc, the parameter β of NNPU is tuned to the default value 0 on all benchmark datasets. For LDCE, we choose regularization parameter λ from {2 4, . . . , 24} and β from {0.1, 0.2, . . . , 0.9} according to [Shi et al., 2018]. Moreover, for the proposed PULD, K is chosen from the candidate set {6, 8, 10, 12, 14}, δ is chosen from {10 2, . . . , 101}, and the trade-off parameters µ, γ, and λ in (10) are turned by searching the grid {10 4, . . . , 101}. Furthermore, we also adopt the t-test with signiﬁcance level 0.05 to investigate whether our PULD is signiﬁcantly better than other baselines. The test accuracies of all methods on the ﬁve benchmark datasets are reported in Table 1. We can ﬁnd that PULD generally achieves the best classiﬁcation accuracy compared with the baselines. Besides, the accuracies obtained by PULD on the ﬁve datasets are all above 75%, which suggests that PULD generates very impressive classiﬁcation results although it is trained without negative data.

4.2 Real-world Datasets Here we investigate the performance of WSVM, UPU, NNPU, LDCE, and PULD on the practical image classiﬁcation task, and CIFAR-10 [Krizhevsky and Hinton, 2009] and SVHN [Netzer et al., 2011] datasets are chosen to test their performance. CIFAR-10 consists of 60000 32 32 natural im-

1https://www.openml.org/

Proceedings of the Twenty-Eighth International Joint Conference on Artiﬁcial Intelligence (IJCAI-19)

Dataset (n, d) r WSVM UPU NNPU LDCE PULD

vote (435,16) 0.2 0.961 0.017 0.943 0.014 0.868 0.032 0.902 0.013 0.959 0.019 0.3 0.941 0.020 0.905 0.022 0.900 0.009 0.892 0.010 0.955 0.014 0.4 0.918 0.044 0.911 0.012 0.891 0.033 0.907 0.032 0.948 0.023

wdbc (569,30) 0.2 0.943 0.020 0.899 0.023 0.877 0.026 0.954 0.015 0.963 0.021 0.3 0.939 0.018 0.911 0.036 0.933 0.016 0.950 0.014 0.974 0.006 0.4 0.911 0.010 0.911 0.019 0.939 0.010 0.925 0.005 0.967 0.011

diabetes (768,8) 0.2 0.751 0.040 0.651 0.003 0.730 0.014 0.749 0.020 0.787 0.039 0.3 0.714 0.017 0.686 0.031 0.714 0.036 0.751 0.017 0.791 0.027 0.4 0.730 0.035 0.697 0.037 0.651 0.003 0.734 0.019 0.781 0.040

fri (1000,25) 0.2 0.675 0.015 0.563 0.002 0.732 0.013 0.521 0.008 0.779 0.006 0.3 0.657 0.026 0.579 0.021 0.720 0.011 0.498 0.034 0.773 0.012 0.4 0.630 0.025 0.571 0.020 0.720 0.011 0.485 0.025 0.757 0.029

phishing (11055,68) 0.2 0.928 0.002 0.872 0.004 0.904 0.005 0.901 0.005 0.939 0.005 0.3 0.928 0.006 0.897 0.007 0.921 0.004 0.792 0.006 0.941 0.006 0.4 0.919 0.004 0.907 0.010 0.927 0.008 0.796 0.011 0.931 0.007

Table 1: The accuracies of various methods on ﬁve Open ML benchmark datasets when r = 20%, 30%, and 40%. The best record under each r is marked in bold. indicates that PULD is signiﬁcantly better than the corresponding methods via paired t-test.

Dataset r WSVM UPU NNPU LDCE PULD

CIFAR-10 0.2 0.829 0.749 0.752 0.772 0.834 0.3 0.820 0.810 0.771 0.761 0.861 0.4 0.787 0.836 0.748 0.701 0.860

SVHN 0.2 0.794 0.728 0.779 0.785 0.851 0.3 0.786 0.769 0.810 0.776 0.852 0.4 0.776 0.790 0.828 0.748 0.850

Table 2: The test accuracies on the adopted real-world datasets and the best record under each r is marked in bold.

ages in 10 classes with each class containing 6000 images. We choose the images of airplane , auto mobile , ship , and truck as negative, and regard the images of bird , cat , deer , dog , frog , and horse as positive. Therefore, there are 24000 positive examples and 36000 negative examples. SVHN contains 99289 32 32 digit images belonging to 10 classes, i.e. the digits 0 - 9 , where the negative set is formed by the digit images 1 - 5 , and the rest digit images compose the positive set. As a result, we get 34699 positive examples and 64590 negative examples. In our experiment, we extract the 512-dimensional GIST features for each image. Similar to the above experiments, for each image dataset, the situations of r = 20%, 30%, and 40% are particularly studied. Note that the training set and the test set are split in advance with 50000 training examples and 10000 test examples for CIFAR-10, and 73257 training examples and 26032 test examples for SVHN. The test accuracies achieved by the compared methods on the two datasets are presented in Table 2, where we can clearly see that our PULD achieves the highest classiﬁcation accuracy among all comparators on both CIFAR-10 and SVHN. Therefore, the proposed PULD is effective in handling real-world data.

4.3 Parametric Sensitivity The model (10) of our PULD contains three trade-off parameters µ, γ, and λ. Therefore, this section examines the parametric sensitivity of our model to them. The two real-world datasets CIFAR-10 and SVHN are adopted here. Figure 1

(a) (b) (c)

Figure 1: The parametric sensitivity of (a) µ, (b) γ, and (c) λ of our PULD on CIFAR-10 and SVHN datasets.

shows the test accuracies of our method on these two datasets when r =20%. We change µ from 10 3 to 100, and γ and λ from 10 4 to 10 1. From Figure 1, we observe that the three parameters have very slight effect on the accuracy, therefore they can be easily tuned for practical implementations.

5 Conclusion This paper proposed a novel PU learning algorithm named Positive and Unlabeled learning with Label Disambiguation (PULD). Speciﬁcally, we convert PU learning to a partial label learning problem, and use the disambiguation technique to efﬁciently identify the ground-truth labels. The proposed model can be easily solved via an alternative optimization process, and its generalization bound has also been theoretically proved. The experimental results on both benchmark and real-world datasets clearly show that PULD is superior to the state-of-the-art PU methods.

Acknowledgments This research is supported by NSF of China (Nos: 61602246, U1713208), NSF of Jiangsu Province (No: BK20171430), the Fundamental Research Funds for the Central Universities (No: 30918011319), the Summit of the Six Top Talents Program (No: DZXX-027), the Innovative and Entrepreneurial Doctor Program of Jiangsu Province, the Young Elite Scientists Sponsorship Program by Jiangsu Province, the Young Elite Scientists Sponsorship Program by CAST (No: 2018QNRC001), the Australian Research Council Projects (Nos: DP-180103424, DE-1901014738), and Program for Changjiang Scholars.

Proceedings of the Twenty-Eighth International Joint Conference on Artiﬁcial Intelligence (IJCAI-19)

References [Bartlett and Mendelson, 2002] Peter L Bartlett and Shahar Mendelson. Rademacher and gaussian complexities: Risk bounds and structural results. JMLR, 3(Nov):463 482, 2002. [Belkin et al., 2006] Mikhail Belkin, Partha Niyogi, and Vikas Sindhwani. Manifold regularization: A geometric framework for learning from labeled and unlabeled examples. JMLR, 7(Nov):2399 2434, 2006. [Boyd and Vandenberghe, 2004] Stephen Boyd and Lieven Vandenberghe. Convex optimization. Cambridge university press, 2004. [Chen et al., 2018] Ching-Hui Chen, Vishal M Patel, and Rama Chellappa. Learning from ambiguously labeled face images. IEEE T-PAMI, 40(7):1653 1667, 2018. [Christoffel et al., 2016] Marthinus Christoffel, Gang Niu, and Masashi Sugiyama. Class-prior estimation for learning from positive and unlabeled data. In ACML, pages 221 236, 2016. [Cour et al., 2011] Timothee Cour, Ben Sapp, and Ben Taskar. Learning from partial labels. JMLR, 12(May):1501 1536, 2011. [Crammer and Singer, 2001] Koby Crammer and Yoram Singer. On the algorithmic implementation of multiclass kernel-based vector machines. JMLR, 2(Dec):265 292, 2001. [Denis et al., 2005] Franc ois Denis, R emi Gilleron, and Fabien Letouzey. Learning from positive and unlabeled examples. Theoretical Computer Science, 348(1):70 83, 2005. [Du Plessis et al., 2014] Marthinus C Du Plessis, Gang Niu, and Masashi Sugiyama. Analysis of learning from positive and unlabeled data. In Neur IPS, pages 703 711, 2014. [Du Plessis et al., 2015] Marthinus Du Plessis, Gang Niu, and Masashi Sugiyama. Convex formulation for learning from positive and unlabeled data. In ICML, pages 1386 1394, 2015. [Elkan and Noto, 2008] Charles Elkan and Keith Noto. Learning classiﬁers from only positive and unlabeled data. In SIGKDD, pages 213 220. ACM, 2008. [Fan et al., 2008] Rong-En Fan, Kai-Wei Chang, Cho-Jui Hsieh, Xiang-Rui Wang, and Chih-Jen Lin. Liblinear: A library for large linear classiﬁcation. JMLR, 9(Aug):1871 1874, 2008. [Feng and An, 2018] Lei Feng and Bo An. Leveraging latent label distributions for partial label learning. In IJCAI, pages 2107 2113, 2018. [Fu et al., 2016] Guangyuan Fu, Jun Wang, Bo Yang, and Guoxian Yu. Neggoa: negative go annotations selection using ontology structure. Bioinformatics, 32(19):2996 3004, 2016. [Gong et al., 2018] Chen Gong, Tongliang Liu, Yuanyan Tang, Jian Yang, Jie Yang, and Dacheng Tao. A regularization approach for instance-based superset label learning. IEEE T-CYB, 48(3):967 978, 2018. [Gong et al., 2019a] Chen Gong, Tongliang Liu, Jian Yang, and Dacheng Tao. Large-margin label-calibrated support vector machines for positive and unlabeled learning. IEEE T-NNLS, 2019. [Gong et al., 2019b] Chen Gong, Hong Shi, Jie Yang, and Jian Yanga. Multi-manifold positive and unlabeled learning for visual analysis. IEEE T-CSVT, 2019. [Hong and Luo, 2017] Mingyi Hong and Zhi-Quan Luo. On the linear convergence of the alternating direction method of multipliers. Mathematical Programming, 162(1-2):165 199, 2017. [Jin and Ghahramani, 2003] Rong Jin and Zoubin Ghahramani. Learning with multiple labels. In Neur IPS, pages 921 928, 2003.

[Kiryo et al., 2017] Ryuichi Kiryo, Gang Niu, Marthinus C du Plessis, and Masashi Sugiyama. Positive-unlabeled learning with non-negative risk estimator. In Neur IPS, pages 1675 1685, 2017. [Krizhevsky and Hinton, 2009] Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. Technical report, Citeseer, 2009. [Lee and Liu, 2003] Wee Sun Lee and Bing Liu. Learning with positive and unlabeled examples using weighted logistic regression. In ICML, volume 3, pages 448 455, 2003. [Li and Liu, 2003] Xiaoli Li and Bing Liu. Learning to classify texts using positive and unlabeled data. In IJCAI, volume 3, pages 587 592, 2003. [Li et al., 2011] Wenkai Li, Qinghua Guo, and Charles Elkan. A positive and unlabeled learning algorithm for one-class classiﬁcation of remote-sensing data. IEEE T-GRS, 49(2):717 725, 2011. [Liu and Tao, 2016] Tongliang Liu and Dacheng Tao. Classiﬁcation with noisy labels by importance reweighting. IEEE T-PAMI, 38(3):447 461, 2016. [Liu et al., 2002] Bing Liu, Wee Sun Lee, Philip S Yu, and Xiaoli Li. Partially supervised classiﬁcation of text documents. In ICML, volume 2, pages 387 394. Citeseer, 2002. [Mohri et al., 2018] Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar. Foundations of machine learning. MIT press, 2018. [Netzer et al., 2011] Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y Ng. Reading digits in natural images with unsupervised feature learning. 2011. [Nguyen and Caruana, 2008] Nam Nguyen and Rich Caruana. Classiﬁcation with partial labels. In SIGKDD, pages 551 559. ACM, 2008. [Shi et al., 2018] Hong Shi, Shaojun Pan, Jian Yang, and Chen Gong. Positive and unlabeled learning via loss decomposition and centroid estimation. In IJCAI, pages 2689 2695, 2018. [Wei and Li, 2018] Huihui Wei and Ming Li. Positive and unlabeled learning for detecting software functional clones with adversarial training. In IJCAI, pages 2840 2846, 2018. [Wu and Zhang, 2018] Xuan Wu and Min-Ling Zhang. Towards enabling binary decomposition for partial label learning. In IJCAI, pages 2868 2874, 2018. [Youngs et al., 2014] Noah Youngs, Duncan Penfold-Brown, Richard Bonneau, and Dennis Shasha. Negative example selection for protein function prediction: the nogo database. PLo S computational biology, 10(6):e1003644, 2014. [Youngs et al., 2015] Noah Youngs, Dennis Shasha, and Richard Bonneau. Positive-unlabeled learning in the face of labeling bias. In ICDMW, pages 639 645. IEEE, 2015. [Yu and Zhang, 2017] Fei Yu and Min-Ling Zhang. Maximum margin partial label learning. Machine Learning, 106(4):573 593, 2017. [Zhou et al., 2004] Dengyong Zhou, Olivier Bousquet, Thomas N Lal, Jason Weston, and Bernhard Sch olkopf. Learning with local and global consistency. In Neur IPS, pages 321 328, 2004.

Proceedings of the Twenty-Eighth International Joint Conference on Artiﬁcial Intelligence (IJCAI-19)