# positive_unlabeled_learning_via_wrapperbased_adaptive_sampling__dbc4e3dc.pdf

Positive Unlabeled Learning via Wrapper-Based Adaptive Sampling

Pengyi Yang1, Wei Liu2, Jean Yang1

1Charles Perkins Centre, School of Mathematics and Statistics, University of Sydney, Australia 2Advanced Analytics Institute, University of Technology Sydney, Australia {pengyi.yang, jean.yang}@sydney.edu.au; wei.liu@uts.edu.au

Learning from positive and unlabeled data frequently occurs in applications where only a subset of positive instances is available while the rest of the data are unlabeled. In such scenarios, often the goal is to create a discriminant model that can accurately classify both positive and negative data by modelling from labeled and unlabeled instances. In this study, we propose an adaptive sampling (Ada Sampling) approach that utilises prediction probabilities from a model to iteratively update the training data. Starting with equal prior probabilities for all unlabeled data, our method wraps around a predictive model to iteratively update these probabilities to distinguish positive and negative instances in unlabeled data. Subsequently, one or more robust negative set(s) can be drawn from unlabeled data, according to the likelihood of each instance being negative, to train a single classiﬁcation model or ensemble of models.

1 Introduction Traditional supervised learning algorithms require labels of both positive and negative instances for building a binary classiﬁcation model. In various applications, however, obtaining negative data could be difﬁcult especially in domains that lack precise knowledge and deﬁnition of negative instances [Calvo et al., 2007]. For example, deﬁning genes that are unrelated to a disease is difﬁcult as there can be genes that are unknown to be related to the disease and therefore contaminating the negative sample set. In such cases, positive unlabeled learning techniques are proposed to model from labeled positive instances augmented with unlabeled instances comprising both unknown positive and negative instances [Denis et al., 2005; Li et al., 2009]. Current techniques proposed for positive unlabeled learning can roughly be categorised into (1) heuristic, (2) biasbased, (3) one-class, and (4) bootstrap sampling approaches. Heuristic approaches often partition the learning process into two steps where negative instances are ﬁrstly identiﬁed by using heuristic methods such as information retrieval techniques [Li and Liu, 2003], Bayesian methods [Liu et al., 2003], Expectation Maximization algorithms [Nigam et al., 1998;

Liu et al., 2002], or domain knowledge [Yang et al., 2012], and a ﬁnal classiﬁcation model is created using labeled positive instances and unlabeled negative instances identiﬁed in the ﬁrst step. One key disadvantage of most heuristic approaches is the requirement of a pre-deﬁned threshold to determine either to include or exclude a potential negative instance obtained from unlabeled data for model training. Finding the optimal threshold for negative instance selection could be data dependent and often inﬂuence greatly on the accuracy of the ﬁnal model. The lack of formality for many heuristic methods also limited their generality. Bias-based approaches, on the other hand, treat all unlabeled data as negative instances and employ a traditional learning procedure, except a bias is introduced to weight the classiﬁcation model and/or the cost function towards positive class in that predictions made from unlabeled data are penalised less for been positive to account for unknown positive instances been labeled as negatives [Elkan and Noto, 2008]. This approach was utilised for learning biased SVM [Liu et al., 2003], logistic model [Lee and Liu, 2003] and Bayes classiﬁer [Nigam et al., 2000]. Elkan and Noto [Elkan and Noto, 2008] have subsequently formulated the bias-based approach in a general framework that could be used with a large selection of classiﬁcation models. However, bias-based approaches often rely on training data for estimating the bias to be applied for model correction. Hence, part of the training data need to be utilised for bias estimation or a crossvalidation procedure is conducted. This is unattractive especially when the training data are limited because the estimation of the bias coefﬁcient could deviate signiﬁcantly, causing underor over-correction and thus poor classiﬁcation model. Alternatively, positive unlabeled learning can also be formulated as a one-class learning problem where only positive labels are used for training a classiﬁcation model [Li et al., 2011]. This has give rise to a set of methods that adhere to the same principle of one-class learning but tailored for positive unlabeled learning [Denis et al., 2002; Calvo et al., 2007]. Given the similarity between one-class learning and positive unlabeled learning, many more oneclass learning algorithms can also be easily tuned for positive unlabeled learning [Khan and Madden, 2014]. The drawback of adjusting one-class learning methods for positive unlabeled learning however is that they generally rely on

Proceedings of the Twenty-Sixth International Joint Conference on Artiﬁcial Intelligence (IJCAI-17)

generative classiﬁcation models and ignore unlabeled data. Therefore, more labeled positive instances may be required to achieve comparable performance to methods that effectively utilise both labeled and unlabeled instances. Recently, methods based on bootstrap sampling have been proposed to create ensemble of models for positive unlabeled learning [Mordelet and Vert, 2014; Claesen et al., 2015; Yang et al., 2016]. In such settings, unlabeled instances are treated as negatives and bootstrap sampling are performed on unlabeled instances and subsequently concatenated with labeled positive instances to train base classiﬁers that form an ensemble. The key idea is to take advantage of instability of predictions, caused potentially by the random inclusion of unlabeled positive instances, and aggregate a ﬁnal stable prediction. These bootstrap sampling approaches adhere to and exploit the advantages of bagging-like procedure [Breiman, 1996]. Nevertheless, since all unlabeled instances are treated as negative data, the random subsets sampled from unlabeled data still contains incorrect labels. The base classiﬁers, therefore, still suffer from unwanted label noise which propagates and affects the performance of the ﬁnal ensemble model. This study extends on bootstrap sampling approaches by introducing a novel wrapper-based adaptive sampling (Ada Sampling) procedure. Similar to previously proposed methods [Mordelet and Vert, 2014; Claesen et al., 2015], initially all unlabeled instances are treated as negative examples and are equally likely to be selected for model training. Then Ada Sampling differs from bootstrap sampling approaches in that the procedure wraps around a classiﬁcation model and prediction uncertainties of unlabeled instances from the model are incorporated for each subsequent iterations of sampling to reduce the probability of selecting potential unknown positive instances as negative examples for model training (Figure 1). Ada Sampling is generic and can be applied with any learning model that outputs classiﬁcation probabilities. It does not require additional training data for bias estimation nor a heuristic procedure for selecting negative instances. This allows both labeled and unlabeled data to be utilised for model training without introducing any prediction bias in the learning model. Furthermore, Ada Sampling approach can be easily extended for ensemble learning where different negative instances are drawn and combined with labeled positive to create diverse base classiﬁers. This enables ensemble model with Ada Sampling to make effective usage of unlabeled data while also prevent potential noise propagation from applying bootstrap sampling directly to all unlabeled data.

Ada Sampling

Learning algorithm Sampling from unlabeled data by predicted negative likelihood

Final prediction Positive unlabeled data

Figure 1: Schematic illustration of Ada Sampling procedure.

Our empirical studies suggest that Ada Sampling requires very few iterations to accurately distinguish unlabeled pos-

itive and negative instances even with very high positive to negative instance ratio in unlabeled data. We next compared Ada Sampling based single and ensemble models with the state-of-the-art bias-based approach and bootstrap sampling approach using Support Vector Machine (SVM) and k Nearest Neighbours (k NN) and a panel of evaluation metrics on several real-world datasets with different ratios of unlabeled positive instances. Our experimental results demonstrate that Ada Sampling signiﬁcantly improve on classiﬁcation for both SVM and k NN, and their performance compared favourably to state-of-the-art methods. Together, this study offers a conceptually simple, ﬂexible, yet powerful approach for positive unlabeled learning.

2 Methods 2.1 Ada Sampling It is helpful to view the positive unlabeled learning problem as discriminating positive and negative instances from a dataset with negative examples been contaminated by hidden positive instances. In this formulation, the problem of positive unlabeled learning is reduced to a traditional classiﬁcation problem where all unlabeled instances are treated as negative instances. Let us denote the labeled instances as L (i.e. y = 1), unlabeled instances as U and assume that there are m labeled and n unlabeled instances. In positive unlabeled learning where the label information is not available for U, a traditional classiﬁer can be trained by labeling all U as negatives (i.e. y=0), sampling with equal probability from all instances in U and combining them with L:

[D0, y] = [L, y = 1] [S0, y = 0] (1)

where S0 U and the superscript 0 of S0 and D0 is the iteration index (cf., Alg. 1). A classiﬁcation model can be ﬁtted using this training dataset (Eq. 1):

p(y|x) = hθ(x; [D0, y]) (2)

The above classiﬁcation model (Eq. 2) is the starting point of Ada Sampling where an instance s U will be selected to be a negative example for subsequent training with a probability of 1 p(y = 1|s). The training data can be updated after a set of instances Si U are selected (with replacement):

[Di, y] = [L, y = 1] [Si, y = 0] (3)

where i indexes iteration of sampling. By utilising the prediction probabilities of the ﬁtted model on instances from U, Ada Sampling can update the training dataset [Di, y] (Eq. 3) to reduce the chance of selecting unlabeled positive instances as negative training examples. This will lead to updated prediction probabilities pi(y|x1), ..., pi(y|xm+n) of all instances including the n instances of U from which Ada Sampling can be repeatedly utilised to update the training dataset.

2.2 Ada Sampling for Classiﬁcation Single model. Ada Sampling can be used with various classiﬁcation algorithms for positive unlabeled learning. Akin to wrapper based feature selection [Kohavi and John, 1997], here the procedure wraps around a classiﬁcation model to

Proceedings of the Twenty-Sixth International Joint Conference on Artiﬁcial Intelligence (IJCAI-17)

iteratively prioritise negative instances from unlabeled data. This procedure therefore tunes the data with respect to a given predictive model. Criteria such as the following:

pi(y|xj) pi 1(y|xj) < ε (4)

can be utilised for termination of the iterative sampling and the predicted probabilities from the ﬁnal iteration can be used for classiﬁcation of instances from both the labeled and unlabeled data. Here j = 1, ..., m + n is the index of all instances in the data. We set ε to be 0.01, requiring smaller than 1% change in mean prediction probabilities of all instances for the process to terminate. Algorithm 1 summarises this procedure in pseudocode.

Algorithm 1: Ada Sampling for single model

Data: Positive unlabeled data L and U Result: Predicted label of all instances y

1 p0 1; // initialise probability vector for all instances

2 S0 sampling(U, p0 U); // select negative instances with probability p0 from U

3 [D0, y] [L, y = 1] [S0, y = 0]; // label initial training data

7 // train a model and classify all instances

8 pi(y|x1), ..., pi(y|xm+n) predict(hθ(x; [Di 1, y]), L U);

9 // adaptive sampling from x U w.r.t updated probabilities

10 Si sampling(U, pi U);

11 [Di, y] [L, y = 1] [Si, y = 0];

12 while Eq. 4 > ε;

13 y classify(hθ(x; [Di, y]), L U);

Ensemble of models. Alternatively, we can apply weighted sampling from U using pi U, from the last Ada Sampling interaction, as weights to create different negative subsets S k (k = 1, ...K). This allows for creating base models bk (k = 1, ...K) each trained on a different training set [L, y = 1] [S k, y = 0] for ensemble prediction. The key advantage of this procedure for ensemble learning is that prediction uncertainties of U are exploited multiple times to make effective usage of instances in U, avoiding potential high variance introduced by training a single model for classiﬁcation. Algorithm 2 summarises Ada Sampling based ensemble learning procedure in pseudocode.

3 Experimental Procedure This section summarises the datasets used for evaluation and describe the performance evaluation strategy.

3.1 Synthetic Datasets Synthetic datasets were used to analyse the behaviour of Ada Sampling. In particular, we simulated 100 labeled pos-

Algorithm 2: Ada Sampling for ensemble of models

Data: Positive unlabeled data L and U Result: Predicted label of all instances y

2 S0 sampling(U, p0 U);

3 [D0, y] [L, y = 1] [S0, y = 0];

7 pi(y|x1), ..., pi(y|xm+n) predict(hθ(x; [Di 1, y]), L U);

8 Si sampling(U, pi U);

9 [Di, y] [L, y = 1] [Si, y = 0];

10 while Eq. 4 > ε;

11 // create an ensemble of models

12 h E θ Null;

13 for k 1...K do

14 S k sampling(U, pi U);

15 [Dk, y] [L, y = 1] [S k, y = 0];

16 h E θ h E θ S hbk θ (x; [Dk, y]);

18 y classify(h E θ ; L U);

itive instances (denoted as L+) from a normal distribution N(6, 1) and 300 unlabeled negative instances (denoted as U ) from a normal distribution N(4, 1). Then, 50 or 100 unlabeled positive instances (denoted as U+) were added into the data to simulate easy and hard scenarios, respectively. Together, this gives two synthetic datasets where in the easy scenario there are 100 labeled positive instances and 350 unlabeled instances (a ratio of 1:0.5:3 for L+, U+ and U ), and in the hard scenario there are 100 labeled positive instances and 400 unlabeled instances (a ratio of 1:1:3 for L+, U+ and U ).

3.2 Real-World Datasets and Cross-Validation

We utilised ﬁve benchmark datasets for performance evaluation. These include breast cancer diagnosis (Breast), prediction free electrons in the ionosphere data (Ionosphere), sonar prediction of mines vs. rocks (Sonar), the Wisconsin database of breast cancer (WDBC), and the Pima Indians diabetes dataset (Pima). All these datasets were obtained from UC Irvine Machine Learning Repository [Lichman, 2013] To simulate positive unlabeled learning scenarios, we treated instances from the negative class as unlabeled and introduced 50% and 67% of unlabeled positive instances with respect to the positive class by randomly removing label information of 1/2 or 2/3 of instances from the positive class, creating an easy and a hard scenarios. This gives 2 conﬁgurations of each dataset on which the evaluation experiments were performed (Table 1). We used a multi-layered repetitive 5-fold cross-validation (CV) procedure to evaluate the performance of each method. Speciﬁcally, label information of instances from the positive class were randomly removed. This is repeated 5 times each

Proceedings of the Twenty-Sixth International Joint Conference on Artiﬁcial Intelligence (IJCAI-17)

Table 1: Summary of real-world datasets and conﬁgurations used for positive unlabeled learning.

Dataset P N |L+| |U+| |L+|/|U+| Breast (easy) 239 444 119 120 1 Breast (hard) 239 444 80 159 0.5 Ionosphere (easy) 126 225 63 63 1 Ionosphere (hard) 126 225 42 84 0.5 Sonar (easy) 97 111 49 48 1 Sonar (hard) 97 111 32 65 0.5 WDBC (easy) 212 357 106 106 1 WDBC (hard) 212 357 71 141 0.5 Pima (easy) 268 500 134 134 1 Pima (hard) 268 500 89 179 0.5

with a different set of selected instances and comprise the ﬁrst layer of randomisation. Subsequently, the data is split for 5fold CV and this is repeated 10 times each with a different split point. This gives the second layer of randomisation each is nested with the ﬁrst layer of 5-fold CV. The performance of each method is the average of each trail plus and minus the mean standard error with respect to a given evaluation matric described below.

3.3 Classiﬁcation Algorithms and Evaluation Metrics We applied Ada Sampling with support vector machine (SVM) and k-nearest neighbour (k NN) classiﬁcation algorithms. SVM and k NN are typical examples of eager and lazy learning algorithms, respectively, and therefore represent two different methods that could be used together with Ada Sampling. An SVM with radial basis function kernel (C=1) and a k NN with k=3 were used across all positive unlabeled methods as well as the baseline to provide objective comparison between each positive unlabeled learning methods. The evaluation matrices utilised for performance comparison are sensitivity (Se), speciﬁcity (Sp), F1 score, and geometric mean (GM). Area under the curve (AUC) is not included as a comparison metric because it is not effective for evaluating bias-based approach where the ranking of the predictions often remain the same [Elkan and Noto, 2008], leading to the same ROC curve but adjusted thresholds. Given that all benchmark datasets used in this study have roughly balanced class distribution, the F1 score and geometric mean provide a good trade-off between sensitivity and speciﬁcity for method comparison. Speciﬁcally, each of the matric is deﬁned as following:

Se = TP TP + FN ; Sp = TN FP + TN ;

F1 = 2TP 2TP + FP + FN ; GM =

TP TP + FN TP TP + FP ;

where TP, TN, FP, and FN are true positives, true negatives, false positives, and false negatives.

4 Results This section presents the experimental results from using synthetic datasets and performance comparison using real-world

datasets. All the data and code are available from the project repository1.

4.1 Analysis on Synthetic Datasets

We ﬁrst evaluated on whether Ada Sampling would allow classiﬁcation algorithms to recover unlabeled positive instances from synthetic datasets. As can be seem from Figure 2, initially, unlabeled positive instances generally receive low classiﬁcation probabilities with respect to (w.r.t) positive class. However, after only 2 iterations, classiﬁers coupled with Ada Sampling procedure are able to drastically increase classiﬁcation probabilities with respect to the positive class for most unlabeled positive instances. These results indicate that Ada Sampling is highly effective in adaptive learning and converges in very few iterations.

0.0 0.5 1.0

0 1 2 3 4 5

0.2 0.4 0.6 0.8

Ada Sampling iteration

Probability w.r.t positive class

0 1 2 3 4 5

0.2 0.4 0.6 0.8

SVM k NN Ada Sampling iteration

0.0 0.5 1.0

(a) iteration 0 iteration 5

iteration 0 iteration 5

Probability w.r.t positive class

Probability w.r.t positive class

Probability w.r.t positive class

Figure 2: Evaluation of Ada Sampling iteration on synthetic datasets. (a) Predicted probabilities of unlabeled positive instances with respect to (w.r.t) positive class in easy case (a ratio of 1:0.5:3 for L+, U+ and U ). (b) Predicted probabilities of unlabeled positive instances with respect to positive class in hard case (a ratio of 1:1:3 for L+, U+ and U ).

Figure 3 shows decision boundaries created by each classiﬁcation algorithms. Baseline results correspond to classiﬁcation by treating all unlabeled instances simply as negative class examples. Results from Ada Sampling correspond to applying Alg. 1 to create ﬁnal classiﬁcation models. It is apparent that decision boundaries created by all classiﬁcation models in baseline settings signiﬁcantly over-penalise positive instances (brown straps of Figure 3). Expectedly, such over-penalisation increase with the number of unlabeled positive instances (compare results under brown strap in Figure 3 (a) and (b)). Ada Sampling facilitates classiﬁcation models to recover a large proportion of labeled as well as unlabeled positive instances from been over-penalised by reducing the chance of selecting unlabeled positive instances and therefore extending decision boundaries around positive instances (green straps of Figure 3).

1https://github.com/Pengyi Yang/Ada Sampling

Proceedings of the Twenty-Sixth International Joint Conference on Artiﬁcial Intelligence (IJCAI-17)

Table 2: SVM prediction without or with positive unlabeled learning methods

Se (%) Sp (%) F1 (%) GM (%) Se (%) Sp (%) F1 (%) GM(%) Breast (easy) Breast (hard)) Original 97 0 96.6 0 95.5 0 95.5 0 97 0 96.6 0 95.5 0 95.5 0 Baseline 41.4 0.6 99.5 0 57.4 0.7 63.1 0.5 6.5 0.8 100 0 11.4 1.2 35 1.2 Bias Mode 66.9 0.4 98.5 0.1 78.5 0.3 80 0.3 61.3 1.9 88.3 1.9 67.4 1 70.4 0.8 Bag Model 57.7 0.4 99.2 0.1 72.1 0.4 74.8 0.3 19.2 0.6 99.7 0.1 31.4 0.8 42.3 0.7 Ada Single 97.8 0.1 95.8 0.1 95.1 0.1 95.2 0.1 97.1 0.2 96.1 0.1 95 0.1 95.1 0.1 Ada Ensemble 97.9 0.1 95.8 0.1 95.2 0 95.3 0 97.5 0.1 96.1 0.1 95.2 0.1 95.3 0.1 Ionosphere (easy) Ionosphere (hard) Original 87 0.2 98.1 0 91.3 0.1 91.5 0.1 87 0.2 98.1 0 91.3 0.1 91.5 0.1 Baseline 15.7 1 99.9 0 25.5 1.4 40.2 1.2 0.7 0.1 100 0 1.3 0.2 NA Bias Model 77.5 0.7 97.2 0.3 84.5 0.4 85.3 0.4 64.2 1.2 97.8 0.4 75.3 0.9 77.5 0.7 Bag Model 54.3 0.9 99.5 0.1 68.8 0.9 72.4 0.7 27.3 1.1 99.9 0 41.5 1.3 50.8 1 Ada Single 90.8 0.2 91.3 0.4 88.2 0.3 88.4 0.3 87.1 0.5 89.9 0.5 85.1 0.3 85.4 0.3 Ada Ensemble 91.2 0.2 92.6 0.3 89.3 0.2 89.5 0.2 87.9 0.5 91.4 0.5 86.6 0.4 86.8 0.4 Sonar (easy) Sonar (hard) Original 77.3 0.3 88.3 0.2 80.9 0.2 81.3 0.2 77.3 0.3 88.3 0.2 80.9 0.2 81.3 0.2 Baseline 23.8 1.4 99.8 0.1 36.8 1.9 49 1.3 7.9 1.1 100 0 13.4 1.8 38.1 0.9 Bias Model 55.3 0.8 87.9 1.2 64.7 0.5 66.8 0.5 45.3 2.4 81.3 3.3 52 0.9 56.7 0.6 Bag Model 40.8 0.8 96.3 0.3 55.5 0.9 60.3 0.8 20.2 1.5 98.7 0.2 31.3 2.2 47.2 1 Ada Single 63.7 0.5 77.4 1.1 67 0.4 67.5 0.4 54.8 1.3 75.9 2.2 59.5 0.6 61 0.6 Ada Ensemble 65.2 0.5 78.1 1.1 68.5 0.4 68.9 0.4 55.8 1.1 76.2 1.9 60.5 0.5 61.7 0.4 WDBC (easy) WDBC (hard) Original 95.6 0.1 98.7 0 96.6 0 96.7 0 95.6 0.1 98.7 0 96.6 0 96.7 0 Baseline 28.2 1.3 100 0 42.7 1.6 52.1 1.3 3.2 0.5 100 0 5.9 0.8 25.8 0.4 Bias Model 74.9 1.2 95.1 0.6 81.4 0.4 82.3 0.4 72 1.9 80.5 2.2 70.3 0.9 72.2 0.7 Bag Model 51.8 0.3 100 0 67.8 0.3 71.7 0.2 15.5 1 100 0 25.8 1.5 40.1 1.2 Ada Single 96.4 0.2 93.8 0.2 93.2 0.2 93.3 0.1 95.4 0.3 92.7 0.2 91.9 0.1 92.1 0.1 Ada Ensemble 96.6 0.1 93.5 0.1 93.2 0.1 93.3 0.1 95.5 0.3 93.1 0.2 92.3 0.1 92.4 0.1 Pima (easy) Pima (hard) Original 55.1 0.1 87.2 0.1 61.5 0.1 62 0.1 55.1 0.1 87.2 0.1 61.5 0.1 62 0.1 Baseline 1.2 0.1 99.8 0 2.3 0.2 NA 0.2 0.1 99.9 0 0.3 0.1 NA Bias Model 98.1 0.3 4.2 0.7 52.1 0.1 59 0.1 98.3 0.3 2.4 0.4 51.7 0.1 58.7 0.1 Bag Model 9.5 0.3 98.8 0.1 16.5 0.5 26.9 0.5 1.3 0.1 99.8 0 2.6 0.1 13.3 0 Ada Single 82.7 0.3 62.6 0.3 65.5 0.2 67 0.2 79 0.2 64.4 0.2 64.4 0.1 65.6 0.1 Ada Ensemble 83.3 0.2 62.9 0.2 66 0.1 67.5 0.1 80.4 0.2 64.6 0.2 65.3 0.1 66.5 0.1

(a) (b) Baseline

Ada Sampling

Ada Sampling

0 2 4 6 8 10 0 2 4 6 8 10 0 2 4 6 8 10 0 2 4 6 8 10

0 2 4 6 8 10 0 2 4 6 8 10 0 2 4 6 8 10 0 2 4 6 8 10

10 8 6 4 2 0

10 8 6 4 2 0

10 8 6 4 2 0

10 8 6 4 2 0

10 8 6 4 2 0

10 8 6 4 2 0

10 8 6 4 2 0

10 8 6 4 2 0

Figure 3: Comparison of baseline (i.e. treat all unlabeled instances as negative examples) and Ada Sampling assisted classiﬁcation on synthetic data. Decision boundaries of SVM and k NN on (a) easy and (b) hard dataset.

4.2 Classiﬁcation of Real-World Datasets

Table 2 and 3 compare SVM and k NN classiﬁcation on ﬁve read-world datasets using baseline approach (i.e. treating all unlabeled instances as negative examples), bias-

based approach ( Bias Model ) described in [Elkan and Noto, 2008], bagging-like approach ( Bag Model ) described in [Mordelet and Vert, 2014], and Ada Sampling-based single model ( Ada Single ) and ensemble of models ( Ada Ensemble ) proposed here. The classiﬁcation of SVM and k NN on the original dataset (i.e. both positive and negative instances are deﬁned) are performed to provide gold standards. Direct application of both SVM and k NN to positive unlabeled data gives low predictive sensitivities (Baseline, Table 2 and 3) and the sensitivity decreases with the increase of unlabeled positive instances ( hard cases). Bias Model improves predictive sensitivities in most cases but suffered from low speciﬁcities in Pima dataset classiﬁcation. It appears that Bias Model over-corrected towards positive class in Pima dataset. This demonstrates a potential problem for relying on correcting the initial classiﬁer trained by treating all unlabeled instances as negatives. If too many unlabeled positive instances are used as negative examples, the classiﬁcation models will have poor quality, resulting in invalid correction. Bag Model appears to improve moderately on predictive sensitivities but the overall performance is lower them other alternative positive unlabeled learning approaches accord-

Proceedings of the Twenty-Sixth International Joint Conference on Artiﬁcial Intelligence (IJCAI-17)

Table 3: k NN prediction without or with positive unlabeled learning methods

Se (%) Sp (%) F1 (%) GM (%) Se (%) Sp (%) F1 (%) GM (%) Breast (easy) Breast (hard) Original 95.7 0.1 97.5 0 95.5 0 95.6 0 95.7 0.1 97.5 0 95.5 0 95.6 0 Baseline 48.6 0.6 99.1 0.1 64.3 0.6 68.3 0.5 25.6 0.4 99.5 0.1 40 0.5 49.2 0.5 Bias Model 82.1 0.3 97.7 0.1 88 0.2 88.3 0.2 67.8 0.5 98.2 0.1 79 0.4 80.3 0.3 Bag Model 64.3 0.5 98.8 0.1 77.1 0.4 78.8 0.4 38.6 0.6 99.2 0.1 54.7 0.6 60.7 0.6 Ada Single 97.2 0.1 97.2 0.1 96 0.1 96 0.1 95.2 0.3 97.4 0.1 95.2 0.1 95.2 0.1 Ada Ensemble 97.4 0.1 97 0.1 96 0.1 96 0.1 96.2 0.3 97.3 0.1 95.6 0.1 95.7 0.1 Ionosphere (easy) Ionosphere (hard) Original 60.4 0.3 98.1 0.1 73.2 0.2 75.4 0.2 60.4 0.3 98.1 0.1 73.2 0.2 75.4 0.2 Baseline 32 1.5 98.7 0.1 45.7 1.8 54.4 1.2 17.7 1.6 99.3 0.1 27.5 2.3 44.4 1.2 Bias Model 61 1.3 96.5 0.2 71.9 0.9 74 0.7 45.1 1.9 97.9 0.3 58.6 1.7 63.5 1.3 Bag Model 42.1 1.9 98.3 0.1 55.9 2 61.3 1.6 24.9 2.2 99.1 0.2 36.2 2.8 51.6 1.6 Ada Single 65.9 1.6 96.1 0.3 75 1.3 76.7 1 58.3 2 94.4 0.4 67.5 1.6 69.8 1.3 Ada Ensemble 67.2 1.5 97.1 0.2 76.9 1.1 78.5 0.9 56.4 2.4 96.4 0.3 66.8 2 70.1 1.6 Sonar (easy) Sonar (hard) Original 73.1 0.4 87.7 0.2 77.9 0.2 78.3 0.2 73.1 0.4 87.7 0.2 77.9 0.2 78.3 0.2 Baseline 40.6 2 93.8 0.2 53 2.2 57.1 2 22.7 1.9 97.2 0.3 33.5 2.5 46.5 1.9 Bias Model 55.3 1.1 85.9 0.6 63.6 1 65.2 1 37.4 1 91.7 0.6 49.7 1 54.1 0.9 Bag Model 48.7 2 90.4 0.4 59.4 1.9 61.9 1.7 27.6 2 95.5 0.5 39 2.5 48 2 Ada Single 66.1 1.6 60.7 0.9 62 0.9 62.5 0.9 54.3 1.6 60.8 1.6 53.7 0.7 54.3 0.7 Ada Ensemble 68 1.6 60.4 1.1 63.2 0.8 63.7 0.8 55.6 1.6 60.4 1.6 54.6 0.7 55.2 0.7 WDBC (easy) WDBC (hard) Original 87.8 0.1 95.9 0.1 90.1 0.1 90.2 0.1 87.8 0.1 95.9 0.1 90.1 0.1 90.2 0.1 Baseline 41.6 0.6 98.5 0.1 57.3 0.6 62.4 0.5 24 0.6 99.2 0 37.7 0.8 47.1 0.7 Bias Model 74.5 0.6 93.1 0.1 79.9 0.4 80.3 0.4 59 0.4 95.2 0.2 70.3 0.4 71.9 0.3 Bag Model 55.2 0.6 97.8 0.1 69.1 0.5 71.7 0.4 34.2 0.6 98.7 0.1 49.6 0.7 56.3 0.6 Ada Single 90.2 0.3 91.3 0.3 88.1 0.1 88.2 0.1 88.6 0.5 91.9 0.4 87.6 0.2 87.8 0.2 Ada Ensemble 90.7 0.3 90.9 0.3 88.1 0.1 88.2 0.1 89.8 0.5 92 0.4 88.4 0.2 88.5 0.2 Pima (easy) Pima (hard) Original 53.7 0.2 78.4 0.2 55.2 0.1 55.3 0.1 53.7 0.2 78.4 0.2 55.2 0.1 55.3 0.1 Baseline 19.2 0.2 93.4 0.2 28.9 0.3 33.9 0.3 8.6 0.3 97 0.1 14.8 0.4 22.3 0.5 Bias Model 59.2 0.3 67.1 0.3 53.6 0.2 53.9 0.2 43.5 0.4 77.1 0.2 46.6 0.3 46.8 0.3 Bag Model 29.6 0.3 88.7 0.2 39 0.4 41.4 0.4 14.5 0.3 94.6 0.2 23 0.5 28.9 0.5 Ada Single 77.2 0.4 59.4 0.4 61.1 0.2 62.5 0.2 74.2 0.4 60.4 0.5 59.8 0.3 61.1 0.3 Ada Ensemble 79.7 0.3 59.4 0.4 62.4 0.2 64 0.2 77 0.3 59.9 0.5 61.2 0.2 62.5 0.2

ing to F1 score and geometric mean. This is expected as in Bag Model no explicit mechanism is applied to deal with unlabeled positive instances. While the bootstrap sampling on unlabeled instances may avoid selecting unlabeled positive instances, this is not enforced as the sampling is completely random. In comparison, Ada Sampling-based approaches achieved the highest prediction accuracy in terms of F1 score and geometric mean in all tested datasets and in both easy and hard cases regardless the classiﬁcation algorithms (i.e SVM and k NN). Moreover, Ada Ensemble outperformed Ada Single in most cases, suggesting an added advantage of incorporating heterogenous models using Ada Sampling. It is worth noting that in a few cases, the performance of Ada Sampling-based approach even outperformed the gold standard where all original labels were used for learning. This suggest that Ada Sampling not only can recover missing label information but also could identify and correct potential label noise in the original datasets.

5 Conclusion In this study, we proposed an adaptive sampling approach, called Ada Sampling, for positive unlabeled learning. The

proposed approach inheres the spirit of wrapper classiﬁcation in which a classiﬁcation model is used iteratively to assess the likelihood of each instance with respect to each class category. Ada Sampling is a ﬂexible framework and can be utilised to optimise data for individual classiﬁcation model as well as constructing more complex ensemble models. Our experimental results demonstrated that both the single classiﬁcation model and the ensemble of models derived from Ada Sampling perform signiﬁcantly better than those without using Ada Sampling and in most cases also outperform other stat-of-the-art approaches for positive unlabeled learning. We note that with minor modiﬁcations, Ada Sampling can be easily extended for (1) multi-class classiﬁcation and (2) class label noise identiﬁcation and correction. The current study forms the basis of our future work on these directions.

Acknowledgements This work is supported by Australian Research Council(ARC)/Discovery Early Career Researcher Award (DE170100759) to Pengyi Yang and National Health and Medical Research Council (NHMRC)/Career Development Fellowship (1105271) to Jean Yang.

Proceedings of the Twenty-Sixth International Joint Conference on Artiﬁcial Intelligence (IJCAI-17)

References [Breiman, 1996] Leo Breiman. Bagging predictors. Machine learning, 24(2):123 140, 1996. [Calvo et al., 2007] Borja Calvo, Pedro Larra naga, and Jos e A Lozano. Learning bayesian classiﬁers from positive and unlabeled examples. Pattern Recognition Letters, 28(16):2375 2384, 2007. [Claesen et al., 2015] Marc Claesen, Frank De Smet, Johan AK Suykens, and Bart De Moor. A robust ensemble approach to learn from positive and unlabeled data using svm base models. Neurocomputing, 160:73 84, 2015. [Denis et al., 2002] Francois Denis, Remi Gilleron, and Marc Tommasi. Text classiﬁcation from positive and unlabeled examples. In Proceedings of the 9th International Conference on Information Processing and Management of Uncertainty in Knowledge-Based Systems, IPMU 02, pages 1927 1934, 2002. [Denis et al., 2005] Franc ois Denis, R emi Gilleron, and Fabien Letouzey. Learning from positive and unlabeled examples. Theoretical Computer Science, 348(1):70 83, 2005. [Elkan and Noto, 2008] Charles Elkan and Keith Noto. Learning classiﬁers from only positive and unlabeled data. In Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 213 220. ACM, 2008. [Khan and Madden, 2014] Shehroz S Khan and Michael G Madden. One-class classiﬁcation: taxonomy of study and review of techniques. The Knowledge Engineering Review, 29(03):345 374, 2014. [Kohavi and John, 1997] Ron Kohavi and George H John. Wrappers for feature subset selection. Artiﬁcial intelligence, 97(1):273 324, 1997. [Lee and Liu, 2003] Wee S Lee and Bing Liu. Learning with positive and unlabeled examples using weighted logistic regression. In Proceedings of the 20th International Conference on Machine Learning (ICML-03), pages 448 455, 2003. [Li and Liu, 2003] Xiaoli Li and Bing Liu. Learning to classify texts using positive and unlabeled data. In Proceedings of the 18th international joint conference on Artiﬁcial intelligence, pages 587 592. Morgan Kaufmann Publishers Inc., 2003. [Li et al., 2009] Xiaoli Li, S Yu Philip, Bing Liu, and See Kiong Ng. Positive unlabeled learning for data stream classiﬁcation. In SDM, volume 9, pages 257 268. SIAM, 2009. [Li et al., 2011] Wenkai Li, Qinghua Guo, and Charles Elkan. A positive and unlabeled learning algorithm for one-class classiﬁcation of remote-sensing data. IEEE Transactions on Geoscience and Remote Sensing, 49(2):717 725, 2011. [Lichman, 2013] M. Lichman. UCI machine learning repository, 2013.

[Liu et al., 2002] Bing Liu, Wee Sun Lee, Philip S Yu, and Xiaoli Li. Partially supervised classiﬁcation of text documents. In ICML, volume 2, pages 387 394. Citeseer, 2002. [Liu et al., 2003] Bing Liu, Yang Dai, Xiaoli Li, Wee Sun Lee, and Philip S Yu. Building text classiﬁers using positive and unlabeled examples. In Data Mining, 2003. ICDM 2003. Third IEEE International Conference on, pages 179 186. IEEE, 2003. [Mordelet and Vert, 2014] Fantine Mordelet and J-P Vert. A bagging svm to learn from positive and unlabeled examples. Pattern Recognition Letters, 37:201 209, 2014. [Nigam et al., 1998] Kamal Nigam, Andrew Mc Callum, Sebastian Thrun, and Tom Mitchell. Learning to classify text from labeled and unlabeled documents. In Proceedings of the ﬁfteenth national/tenth conference on Artiﬁcial intelligence/Innovative applications of artiﬁcial intelligence, pages 792 799. American Association for Artiﬁcial Intelligence, 1998. [Nigam et al., 2000] Kamal Nigam, Andrew Kachites Mc Callum, Sebastian Thrun, and Tom Mitchell. Text classiﬁcation from labeled and unlabeled documents using em. Machine learning, 39(2-3):103 134, 2000. [Yang et al., 2012] Peng Yang, Xiao-Li Li, Jian-Ping Mei, Chee-Keong Kwoh, and See-Kiong Ng. Positiveunlabeled learning for disease gene identiﬁcation. Bioinformatics, 28(20):2640 2647, 2012. [Yang et al., 2016] Pengyi Yang, Sean J Humphrey, David E James, Yee Hwa Yang, and Raja Jothi. Positiveunlabeled ensemble learning for kinase substrate prediction from dynamic phosphoproteomics data. Bioinformatics, 32(2):252 259, 2016.

Proceedings of the Twenty-Sixth International Joint Conference on Artiﬁcial Intelligence (IJCAI-17)