# selfpu_self_boosted_and_calibrated_positiveunlabeled_training__b52e23a3.pdf

Self-PU: Self Boosted and Calibrated Positive-Unlabeled Training

Xuxi Chen* 1 Wuyang Chen* 2 Tianlong Chen 2 Ye Yuan 2 Chen Gong 3 Kewei Chen 4 Zhangyang Wang 2

Many real-world applications have to tackle the Positive-Unlabeled (PU) learning problem, i.e., learning binary classiﬁers from a large amount of unlabeled data and a few labeled positive examples. While current state-of-the-art methods employ importance reweighting to design various risk estimators, they ignored the learning capability of the model itself, which could have provided reliable supervision. This motivates us to propose a novel Self-PU1 learning framework, which seamlessly integrates PU learning and self-training. Self-PU highlights three self - oriented building blocks: a self-paced training algorithm that adaptively discovers and augments conﬁdent positive/negative examples as the training proceeds; a self-calibrated instance-aware loss; and a self-distillation scheme that introduces teacher-students learning as an effective regularization for PU learning. We demonstrate the state-of-the-art performance of Self-PU on common PU learning benchmarks (MNIST and CIFAR-10), which compare favorably against the latest competitors. Moreover, we study a realworld application of PU learning, i.e., classifying brain images of Alzheimer s Disease. Self PU obtains signiﬁcantly improved results on the renowned Alzheimer s Disease Neuroimaging Initiative (ADNI) database over existing methods.

1. Introduction

For standard supervised learning of binary classiﬁers, both positive and negative classes need to be collected for train-

*Equal contribution The work was done when Xuxi Chen was mentored by Zhangyang Wang. 1University of Science and Technology of China 2Texas A&M University 3Nanjing University of Science and Technology 4Banner Alzheimer s Institute. Correspondence to: Zhangyang Wang <atlaswang@tamu.edu>.

Proceedings of the 37 th International Conference on Machine Learning, Online, PMLR 119, 2020. Copyright 2020 by the author(s). 1The code is publicly available at: https://github.com/ TAMU-VITA/Self-PU.

ing purposes. However, this is not always a realistic setting in many applications, where one certain class of data could be difﬁcult to be collected or annotated. For example, in chronic disease diagnosis, while we might safely consider a diagnosed patient to be positive , the much larger population of undiagnosed individuals are practically mixed with both positive (patient) and negative (healthy) examples, since people might be undergoing the disease s incubation period (Armenian & Lilienfeld, 1974) or might just have not seen doctors. Roughly labeling the undiagnosed examples all as negative will hence lead to biased classiﬁers that inevitably underestimate the risk of chronic disease.

Given those practical demands, Positive-Unlabeled (PU) Learning has been increasingly studied in recent years, where a binary classiﬁer is to be learned from a part of positive examples, plus an unlabeled sample pool of mixed and unspeciﬁed positive and negative examples. Because of this weak supervision, PU learning is more challenging than standard supervised or semi-supervised classiﬁcation problems. Early works tried to identify reliable negative examples from the unlabeled data by hand-crafted heuristics or standard semi-supervised learning methods (Liu et al., 2002; Li & Liu, 2003). Recently, importance reweighting methods such as unbiased PU (u PU) (Du Plessis et al., 2014; 2015) and non-negative PU (nn PU) (Kiryo et al., 2017) treat unlabeled data as weighted negative ones.

Despite these successes, self-supervision via auxiliary or surrogate tasks was never considered, which could potentially supply another means of reliable supervision. This motivates us to explore the learning capability of the model itself. Our proposed Self-PU learning framework exploits three aspects of such self-boosts : (a) we design a selfpaced training strategy to progressively select unlabeled examples and update the trust set of conﬁdent examples; (b) we explore a ﬁne-grained calibration of the functions for unconﬁdent examples in a meta-learning fashion; and (c) we construct a collaborative self-supervision between teacher and student models, and enforce their consistency as a new regularization, against the weak supervision in PU learning. Our main contributions are outlined as follows:

A novel self-paced learning pipeline is ﬁrst introduced to adaptively mine conﬁdent examples from unlabeled data, that will be labeled into trusted positive/negative classes. A hybrid loss is applied to both the augmented

Self-PU: Self Boosted and Calibrated Positive-Unlabeled Training

labeled examples and remaining unlabeled data for supervision. The procedure is repeated progressively, with more unlabeled examples selected each time. A self-calibration strategy is leveraged to further explore the ﬁne-grained treatment of loss functions over unconﬁdent examples, in a meta-learning fashion. A self-distillation scheme is designed via the collaborative training between several teacher networks and student networks, providing a consistency regularization as another fold of self-supervision. In addition to standard benchmarks (MNIST, CIFAR10), a new real-world testbed of PU learning, i.e., Alzheimer s Disease neuroimage classiﬁcation, is evaluated for the ﬁrst time. On the Alzheimer s Disease Neuroimaging Initiative (ADNI) database, Self-PU achieves superior results over existing solutions.

2. Related Work

2.1. PU Learning

Let X Rd and Y {+1, 1} (d N) be the input and output random variables. In PU learning, the training dataset D is composed of a positive set DP and an unlabeled set DU, where we have D = DP DU. DP contains np positive examples xp sampled from P(x|Y = +1) and DU contains nu unlabeled examples xu sampled from P(x). Denote the class prior probability πp = P(Y = +1) and πn = P(Y = 1), where we follow the convention (Kiryo et al., 2017) to assume πp as known throughout the paper. Let g : Rd R be the binary classiﬁer and θ be its parameter, and the L : R {+1, 1} R be loss function. The risk of classiﬁer g, ˆRP U(g) can be approximated by

ˆRP U(g) = πp

i=1 L(g(xp i ), 1)+

i=1 L(g(xu i ), 1) πp

i=1 L(g(xp i ), 1),

which has been known as the unbiased risk estimator for u PU (Du Plessis et al., 2014; 2015; Xu et al., 2017; Elkan & Noto, 2008; Xu et al., 2019b). It was later pointed out that the second line in Eq. (1) would become negative due to overﬁtting complex models (Kiryo et al., 2017).

A non-negative version (nn PU) of Eq. (1) was therefore suggested:

ˆRP U(g) = πp

i=1 L(g(xp i ), 1)+

i=1 L(g(xu i ), 1) πp

i=1 L(g(xp i ), 1))

Importance reweighting methods (e.g. u PU, nn PU) achieve the state of the arts, although treating unlabeled data as weighted negative examples still brings in unreliable supervision. Generative adversarial networks were introduced by (Hou et al., 2018), where the conditional generator produced both negative and positive examples resembling the unlabeled real data. DAN (Liu et al., 2019) tried to recover the positive and negative distributions from the unlabeled data without requiring the class prior.

2.2. Self-Paced Learning

Self-paced learning (Kumar et al., 2010) was presented as a special case of curriculum learning (Bengio et al., 2009), where the feed of training examples was dynamically generated by the model based on its learning history, aiming to simulate the learning principle of starting by learning easy instances and then gradually taking more challenging cases (Khan et al., 2011). Early PU works designed heuristics for sample selection. In (Xu et al., 2019a), positive instead of negative examples are permanently selected by analyzing the distribution of sample loss. Unlike previous PU learning works which rely on crafted sample selection heuristics, we are the ﬁrst to leverage the data-driven self-paced learning to progressively turn unlabeled data into labeled ones.

2.3. Self-Supervised Learning

In many supervision-starved ﬁelds, it is generally difﬁcult to obtain accurate annotations, despite the vast number of unlabeled data available. Self-supervised learning aims to form pseudo supervision for learning informative/discriminative features from the data, where models are required to predict on proxy tasks formed to be relevant to the target goal. It is known to beneﬁt data-efﬁcient learning (Trinh et al., 2019; Jing & Tian, 2020), adversarial robustness (Chen et al., 2020), and outlier detection (Mohseni et al., 2020).

For example, (Laine & Aila, 2016) augmented each unlabeled example with random noises, and forced consistency between the two predictions. In (Tarvainen & Valpola, 2017), two identical models were used during training: the student learned as usual while the teacher model generated labels and updated its weights through a moving average of the student, forcing consistency between two models. (Zhang et al., 2018) further suggested that, instead of exchanging examples, mutual feature distillation between peer networks can form another strong source of supervision, and can enable the collaborative learning of an ensemble of students. To our best knowledge, we are the ﬁrst to consider such self-supervision to improve PU learning.

Self-PU: Self Boosted and Calibrated Positive-Unlabeled Training

3. The Self-PU Framework

Our proposed Self-PU framework exploits the learning capability of the model itself (Figure 1). We ﬁrst design a self-paced learning pipeline to progressively select and label conﬁdent examples from unlabeled data for supervised learning. On top of that, we calibrate the loss functions over the unconﬁdent examples via meta-learning. Moreover, a consistency loss is introduced between peer networks with different learning paces, which collaboratively teach each other. We further extend our consistency from peer networks to their moving-averages (Tarvainen & Valpola, 2017; Laine & Aila, 2016), as another form of supervision.

self-paced ( 3.1) self-calibrated ( 3.2)

self-supervised ( 3.3)

Teacher1 Teacher2

Student1 Student2

Figure 1. Illustration of the proposed Self-PU framework. After a short warm-up period, the classiﬁer is ﬁrst trained with selfpaced learning, where conﬁdent examples in DU are progressively selected and labeled (positive/negative) into a trusted Dtrust subset for supervised learning, with the loss functions over unconﬁdent examples carefully calibrated. After collecting enough conﬁdent examples, we start the self-supervised learning via distillation between two collaborative students and their teacher networks.

3.1. Self-Paced PU Learning

Despite the success of unbiased PU risk estimators, they still rely on the estimated class prior and reduced weights on unlabeled data. As shown in (Arpit et al., 2017), during gradient descent, deep neural networks tend not to memorize all training data at the same time but tend to memorize frequent or easy patterns ﬁrst and later irregular patterns. If we could ﬁrst ﬁnd out easy examples and label them with conﬁdence, and then augmenting this labeled pool for the training progress, then we can enjoy progressively increased conﬁdent full supervision along with the training, in addition to the weak supervision from the PU risk estimators.

Given the model g, an input example x and the corresponding label y, we may compute the output g(x) and then calculate the probability of x being positive as p(x) = P(Y = +1|x) = f(g(x)), where f is a monotonic function of mapping R [0, 1] (e.g. sigmoid function). A greater p(x) suggests higher conﬁdence that x belongs to positive class as predicted by g, and vice versa. By descending sort of p(x) each time, we can select n most conﬁdent positive and

n most conﬁdent negative examples from the current unlabeled data pool DU. They will be removed from DU and added to our trusted subset Dtrust, considered as labeled training examples hereinafter.

Let LCE(x, y) be the cross entropy loss:

LCE(x, y) = log f(g(x))Iy=1 + log(1 f(g(x))Iy= 1

, Lnn PU(x, y) be the nn PU risk with Sigmoid loss, and together with the given positive subset DP , our hybrid loss for self-paced learning becomes:

(x,y) Dtrust LCE(x, y) + X

x D Dtrust Lnn PU(x) (3)

Note that previous works select either only conﬁdent positive examples (Xu et al., 2019a) or negative examples (Li & Liu, 2003), while our self-paced learning selects both. Since the cross entropy is used as our supervised loss, one advantage is that the trusted sets of positive/negative samples are balanced in size at each sampling step, avoiding the potential pitfall of extreme class imbalance caused by incrementally sampling only one class.

Besides, previous sample selection (Xu et al., 2019a) often sticks to a pre-ﬁxed learning schedule. In contrast, we unleash more ﬂexibility for the model to automatically and adaptively adjust its own learning pace, via the following techniques. Later on, we will experimentally verify their effectiveness via a step-by-step ablation study.

3.1.1. DYNAMIC RATE SAMPLING

As the learning progresses, training examples with easy/frequent patterns and those with harder/irregular patterns are memorized in different training stages (Arpit et al., 2017). It is important to make our self-paced learning compatible with the memorizing process of the model. A small number of easy examples should be selected ﬁrst, and then intermediate to hard examples can be labeled after the model is well-trained. Instead of ﬁxing the number of selected conﬁdent examples, we propose to dynamically choose the number of conﬁdent examples during the self-paced learning. Speciﬁcally, as the self-paced learning proceeds, we linearly increase the size of Dtrust from 0 to r|DU|, where the sampling ratio r could range from 10% to 40% in our experiments (see section 4.3.2). Empirically, we ﬁrst warmup the model by training 10 epochs before starting the self-paced learning, in order to keep the selected conﬁdent examples as accurate as possible.

3.1.2. IN-AND-OUT TRUSTED SET

In previous sample selection approaches, once selected, a trusted example will never be deprived of its label during the subsequent training. In contrast, we allow our training to regret on earlier selections in Dtrust. Especially at the

Self-PU: Self Boosted and Calibrated Positive-Unlabeled Training

early training stage, the intermediate model might not be well-trained enough and not always reliable for predicting labels, which could mislead training if continuing acting as supervision. To this end, we adaptively update Dtrust by also re-examining its current examples each time when we augment new conﬁdent ones. The previously selected examples will be removed from Dtrust if their predictions by the current model become of low-conﬁdence, and will be treated as unlabeled again.

3.1.3. SOFT LABELS

Instead of giving the selected conﬁdent examples hard labels, we directly use the prediction f(g(x)) as soft labels: [1 f(g(x)), f(g(x))] as [P(Y = 1|x), P(Y = +1|x)], as the practice of label smoothing (Szegedy et al., 2016) appears to beneﬁt learning robustness against label noise.

3.2. Self-Calibrated Loss Reweighting

Only leveraging nn PU risk on DU Dtrust may not be optimal, as some examples in this set can still provide meaningful supervision. To exploit more supervision from this noisy sets, we introduce a learning-to-reweight paradigm (Ren et al., 2018) to the PU learning ﬁeld for the ﬁrst time. Letting LCES(x) = f(g(x)) log f(g(x)) + (1 f(g(x))) log(1 f(g(x))) be the cross entropy loss function with soft labels in Sec. 3.1.3, we adaptively combine LCES and Lnn PU for each example xi in a batch from DU Dtrust, namely:

l(xi, wi) = wi,1LCES(xi) + wi,2Lnn PU(xi)

Let n be the mini-batch size. To learn the optimal w = [w1, w2, . . . , wn]T together in training, we update the model g for a single gradient descent step on l with wi very small (i.e. a perturbation) of a mini-batch of training examples w.r.t. parameters of models θ, followed by a gradient descent step on the cross entropy loss of a mini-batch of validation examples w.r.t w, and then rectify the output to be non-negative. The procedure is described as follows:

i=1 l(xi, yi, wi) (4)

j=1 L CE(xv j, yv j )|wi=0 (5)

wi = max(ui, 0), wi,1 = wi,1 P

i wi,1 , wi,2 = wi,2 P

where δ denotes the step size, m the mini-batch size on the validation set which contains clean positive and negative examples, and (xv j, yv j ), j = 1, 2, . . . , m an example from the validation set with the ground-truth label. L CE(x, y) calculates loss using the updated parameters θ .

Meanwhile, on DU Dtrust, weighting the cross-entropy

loss too much might not be beneﬁcial to the classiﬁer, especially when the soft labels are not accurate enough.

Therefore, we restrict the total weights of the cross-entropy loss via a balancing factor γ:

T = sup{k :

i=1 w2,i < γn} (7)

w i,1 = wi,11{i<T }, w i,2 = wi,21{i<T } + 1{i T } (8)

The corresponding hybrid loss becomes:

LSP+Reweight = X

(x,y) Dtrust LCE(x, y)+ (9)

x (DU Dtrust) l (x) + X

x DP Lnn PU(x), (10)

where l (x) =

Pn i=1 l(xi,w i ) n .

3.3. Self-Supervised Consistency via Distillation

To explore additional sources of supervision, we encourage two forms of self-supervised consistency: among different learning paces of the model, and along the model s own moving averaged trajectory. The two goals are altogether achieved by an innovative distillation scheme, with a pair of collaborative student models and their teacher models.

3.3.1. CONSISTENCY FOR DIFFERENT LEARNING PACES: MAKING A PAIR OF STUDENTS

The consistency between two self-paced models trained with different paces (i.e. sampling ratio in self-paced learning) makes the trained model more resilient to perturbations caused by training stochasticity. To form this selfsupervision, we simultaneously train two networks that share the identical architecture, with the same DP and DU to start on. However, they are set with different conﬁdence thresholds and select different amounts from DU to Dtrust each time, making their learning paces un-synchronized and resulting in two different trusted sets Dtrust1 and Dtrust2. Since two students estimations of the probabilities of classes may differ, we force the consistency between two students as a source of distillation, via a mean-square-error (MSE) loss on two models predictions.

Denote two networks g1 and g2, the MSE from g1 to g2 is deﬁned over D Dtrust1:

LMSE(g1, g2, x) = ||f(g1(x)) f(g2(x))||2, x D Dtrust1 (11) The MSE from g2 to g1 is deﬁned over D Dtrust2:

LMSE(g2, g1, x) = ||f(g2(x)) f(g1(x))||2, x D Dtrust2 (12)

A hard sample mining strategy is further adopted on top, where we only calculate MSE over those challenging unlabeled examples whose nn PU risks (2) are large.

Self-PU: Self Boosted and Calibrated Positive-Unlabeled Training

Later on, the pair of networks will become two student models for distillation. We therefore have our ﬁrst part of self-supervised consistency loss as:

Lstudents = X

x D Dtrust1 lstu(g1, g2, x)+

x D Dtrust2 lstu(g2, g1, x), (13)

lstu(g1, g2, x) = ( LMSE(g1, g2, x), Lnn PU(x) > αLMSE(g1, g2, x) 0, Lnn PU(x) αLMSE(g1, g2, x) (14)

We study the effect of choosing α in section 4.3.3. The mean squared error between the two students is only calculated on D Dtrust1 and D Dtrust2. One reason why we choose such design is that the accuracy on Dtrust1 and Dtrust2 discovered by the self-paced learning is much higher than the accuracy on D. In addition, on the Dtrust1 and Dtrust2 the prediction entropy is 0.005, while on the unlabeled set it is 0.074, which indicates much lower conﬁdence.

3.3.2. CONSISTENCY FOR MOVING AVERAGED WEIGHTS: ADDING TEACHERS TO DISTILL

Inspired by (Tarvainen & Valpola, 2017), in addition to the consistency between two students, we also encourage them to be consistent with their moving averaged trajectory of weights. Assume that g1 and g2 are parameterized by θ1 and θ2. For each student we introduce a new teacher model, G1 and G2, parameterized by Θ1 and Θ2 with the same structure as g1 and g2. The weights of G1, G2 are updated via the following moving average:

Θ1,t = βθ1,t 1 + (1 β)θ1,t Θ2,t = βθ2,t 1 + (1 β)θ2,t (15)

where θ1,t denotes the instance of θ1 at time t, and similarly for others. We study the effect of β in section 4.3.4.

An MSE loss is next enforced for G1 and G2 to distill from g1 and g2, namely:

Lteachers = X

x D ||f(G1(x)) f(g1(x))||2

x D ||f(G2(x)) f(g2(x))||2 (16)

The above constitutes the second part of our self-supervised consistency cost.

In summary, the beneﬁts of self-supervised learning for PU learning come from two folds: 1) the enlarged labeled examples (Dtrust) introduces stronger supervision into PU learning and brings high accuracy; 2) the consistency cost

between diverse student and teacher models introduces the learning stability (low variance). Eventually, our overall loss function2 of Self-PU is:

L = LSP+Reweight + Lstudents + Lteachers. (17)

In all experiments and as shown in Figure 1, we ﬁrst apply self-paced learning and self-calibrated loss reweighting from the 10th epoch to the 50th epoch, followed by a selfdistillation period from 50th to 200th epoch. That allows for the models to learn sufﬁcient meaningful information before being distilled. After training, we compare the validation accuracy of two teacher models and select the better performer to be applied on the testing set3.

4. Experiments

4.1. Datasets

In order to evaluate our proposed Self-PU learning framework, we conducted experiments on two common testbeds for PU learning: MNIST, and CIFAR-10; plus a new realworld benchmark, i.e. ADNI (Jack Jr et al., 2008), for the application of Alzheimer s Disease diagnosis.

4.1.1. INTRODUCTION TO THE ADNI DATABASE

The Alzheimer s Disease Neuroimaging Initiative (ADNI) database4 was constructed to test whether brain scans, e.g. magnetic resonance imaging (MRI) and other biological markers, can be utilized to predict the early-stage Alzheimer s disease (AD), in order for more timely prevention and treatment. The dataset, especially its MRI image collection, has been widely adopted and studied for the classiﬁcation of Alzheimer s disease (Khvostikov et al., 2018; Li et al., 2015). Fig. 2 shows visual examples.

Traditionally, the machine learning community considers the AD diagnosis task as a binary, fully supervised classiﬁcation task, between the patient and the healthy classes. It has never been connected to PU learning. Yet, we advocate that this task could become a new suitable, realistic and challenging application benchmark for PU learning.

The early-stage AD prediction/diagnosis is highly nontrivial for multi-fold, ﬁeld-speciﬁc reasons. First, many nuance factors can heavily affect the feature effectiveness, ranging from individual patient variability to (mechanical/optical) equipment functional ﬂuctuations, to manual operation and sensor/environment noise. Second, within the whole-brain scans, only some (not fully-speciﬁed) local

2Since here we have two students of different learning paces, our LSP+Reweight is also extended to both Dtrust1 and Dtrust2. 3Note that, here we only select one and discard the other, only for simplicity purpose. Other approaches, such as average or weighted-fusion of the two teachers models, are applicable too. 4http://adni.loni.usc.edu

Self-PU: Self Boosted and Calibrated Positive-Unlabeled Training

Table 1. Speciﬁcation of benchmark datasets and models. Dataset #Train #Test Input Size πp Positive/Negative Model

MNIST 60,000 10,000 28 28 0.49 Odd/Even 6-layer MLP CIFAR-10 50,000 10,000 3 32 32 0.40 Trafﬁc tools/Animals 13-layer CNN ADNI 822 113 104 128 112 0.43 AD Positive/Negative 3-branch 2-layer CNN

Sagittal Coronal Axial

Figure 2. Cross-sectional imaging of a 104 128 112 MRI example from the ADNI dataset. Images are from the 52nd, 64th, 56th slice of sagittal, coronal, and axial plane, respectively. An MRI image is of gray scale with a value from 0 to 255 for each voxel, and was processed by the intensity inhomogeneity correction, skull-stripping and cerebellum removing.

brain regions are found to be indicative of AD symptoms. Third and most importantly, in contrast to the diagnosed patients, the remaining population, who are not yet clinically diagnosed with AD, cannot be simply treated as all healthy: on one hand, the above challenges of AD diagnosis inevitably lead to incorrectly missed patient cases; on the other hand, and more notably, the AD patients go through a stage called mild cognitive impairment (MCI) (Larson et al., 2004; Duyckaerts & Hauw, 1997), a critical transition period between the expected cognitive decline of normal aging, and the severe decline of true dementia. During the MCI stage, those people were clinically considered as AD patients already (if diagnosed with more intrusive biochemical means); however, no symptom is known to be observable in current MRI images or other bio-markers.

In other words, the MCI examples have deﬁnitely been included in the currently healthy -labeled samples in ADNI, while they should have belonged to the patient class. In training, we label patients as positive class, the healthy - labeled examples can then be considered as unlabeled class, which mixes the true healthy people (i.e., from the actual negative class) and the MCI people (i.e., from the positive class). We communicated with several seasoned medical doctors practicing in AD ﬁelds, and they unanimously agreed that AD diagnosis should be described as a PU learning problem rather than (the traditional treatment as) a binary classiﬁcation problem. In this paper, we study the speciﬁc setting of MRI image classiﬁcation task on the ADNI dataset, while other bio-marker classiﬁcation can be

similarly studied in PU settings.

4.1.2. DATASET SETTING

We report our dataset protocols towards PU learning. More metadata are summarized in Table 1.

MNIST: odd numbers 1, 3, 5, 7, 9 form the positive class while even numbers 0, 2, 4, 6, 8 form the negative.

CIFAR-10: four vehicles classes ( airplane , automobile , ship , truck ) constitute the positive class, and six animal classes ( bird , cat , deer , dog , frog horse ) constitute the negative.

ADNI: We utilized the public ADNI data set as (Li et al., 2015; Yuan et al., 2018) suggested: The T1weighted MRI images were processed by ﬁrst correcting the intensity inhomogeneity, followed by skullstripping and cerebellum removing. We consider the subjects as positive class if they: 1) have positive clinical diagnosis records on ﬁle; or 2) have their standardized uptake value ratio (SUVR) values 5 no less than 1.08 (Villeneuve et al., 2015; Ott et al., 2017; Yuan et al., 2018). While this estimate can be treated as golden rule in clinical practice and is shown to work well in our experiments (Table 8), it can be further adjusted ﬂexibly and used in our framework with ease.

Following the convention of nn PU (Kiryo et al., 2017), we use np = |DP | = 1000 in MNIST and CIFAR-10.In ADNI, we end up with np = 113. nu = |DU| equals the size of remaining training data on all three datasets. πp is the proportion of true positive examples in the dataset.

4.2. Baselines and Implementations

Following nn PU (Kiryo et al., 2017), we used a 6-layer multilayer perceptron (MLP) with Re LU on MNIST. On CIFAR-10, we use a 13-layer CNN with Re LU. We design a multi-scale network for ADNI, which is used as the backbone for all compared baselines: please see supplementary materials for details. We use Adam optimizer with a cosine annealing learning rate scheduler for training. The batch size is 256 for MNIST and CIFAR-10, and 64 for ADNI.

5SUVR is a therapy monitoring or response, considered as an important indicator of Alzheimer s Disease.

Self-PU: Self Boosted and Calibrated Positive-Unlabeled Training

The γ is set to 1 16. The batch size of validation examples equals to the batch size of the training examples.

For a fair comparison, each experiment runs ﬁve times, and the mean and standard deviations of accuracy are reported.

4.3. Ablation Study

In this section, we carry out a thorough ablation study, on the key components introduced during the self-paced learning stage (i.e., the selection of trusted set) and the selfsupervised distillation stage (i.e., the diversity of students, the effect of hard sample mining when training students, and the effect of weight averaging further by teachers). All experiments are conducted on the CIFAR-10 dataset6. We will study the effect of γ in the supplementary materials.

4.3.1. SELECTION OF TRUSTED SET DTRUST

Since self-paced learning aims to mine more conﬁdent positive/negative examples, it is crucial to ensure the trustworthiness of the selected Dtrust. We therefore calculate the accuracy of assigned labels for Dtrust along the self-paced training, as an indicator of the sampling strategy reliability.

We compare three settings: 1) Fixed sampling size : each time, the model selects a ﬁxed number of samples (e.g. 25% of examples in DU), assigning soft labels and adding them into Dtrust. Meanwhile, low-conﬁdence samples in Dtrust will also be removed in next round of selection. 2) Sampling without replacement : each example selected by model will permanently reside in Dtrust. Here the size of Dtrust is linearly increased along the training progress. 3) Our default approach in Self-PU: both Dynamic Rate Sampling and In-and-Out Trusted Set are enabled. All three settings end up with |Dtrust| = 0.25|DU|.

From Figure 3, we clearly see that sampling either with a ﬁxed size or without replacement results in a less reliable selection of Dtrust, compared to our strategy. Moreover, the inaccurately selected examples in Dtrust will further cause much unstable training (dash line). We demonstrate that both Dynamic Rate Sampling and In-and-Out Trusted Set are vital to achieving an accurate and stable self-paced learning (solid line). Table 2 shows the ﬁnal test accuracy of three settings, where our proposed self-paced learning pipeline (LSP) signiﬁcantly outperforms the other two settings (LSP of ﬁxed sampling size and sampling without replacement). The better accuracy and lower variance show the advantage of our strategy.

6To conduct controlled experiments we disable the selfcalibration strategy in Table 3, 4, 5

Table 2. Classiﬁcation comparison on CIFAR-10: we report both means and standard deviations (in parentheses) from ﬁve runs. LSP: self-paced training. LSPS: self-paced training with soft label in Sec. 3.1.3. LSP + Reweighting: self-paced training with selfcalibrated loss reweighting in section 3.2. LSPS + Reweighting: selfpaced training with soft label and self-calibrated loss reweighting in section 3.2. Lstudents: self-distillation from a pair of students. Lteacher: self-distillation from teacher networks. Self-PU: LSPS + Reweighting + Lstudents + Lteacher

Method CIFAR-10 %

nn PU (baseline) 88.60 (0.40)

LSP (ﬁxed size) 88.05 (0.59) LSP (w.o. replacement) 88.27 (0.43) LSP 88.66 (0.40) LSPS 88.75 (0.27) LSP + Reweighting 89.25 (0.42) LSPS + Reweighting 89.39 (0.36)

LSP + Lstudents 88.84 (0.36) LSPS + Lstudents 88.93 (0.28) LSP + Lstudents + Lteacher 89.43 (0.42) LSPS + Lstudents + Lteacher 89.65 (0.33)

Self-PU 89.68 (0.22)

Table 3. Study of student diversity (learning paces) for two-student distillation on CIFAR-10 dataset. Pace1/Pace2 denotes the ﬁnal ratio of |Dtrust| over |DU|.

Pace1 Pace2 Test Accuracy %

10% 40% 89.32 (0.36) 15% 35% 89.55 (0.46) 20% 30% 89.65 (0.33) 25% 25% 89.64 (0.47)

Table 4. Study of hard sample mining threshold α for two-student distillation on CIFAR-10 dataset. Smaller α indicates stronger distillation (Eq. (14))

α Test Accuracy %

5 89.59 (0.39) 10 89.65 (0.33) 20 89.38 (0.51)

4.3.2. EFFECTS OF STUDENT DIVERSITY

Different learning paces enable the diversity of two students and thus make the collaborative teaching between two students effective. Therefore we study how the student diversity, i.e. combination of their different learning paces, can affect the ﬁnal results. Table 3 considers three different pace pairs. For example, Pace1 10% means that the

Self-PU: Self Boosted and Calibrated Positive-Unlabeled Training

10 20 30 40 50 60 70 80 90 Epoch

Self paced Sampling Accuracy on CIFAR10

Ours Fixed sampling size Sampling w.o. replacement

Figure 3. Accuracy of selected conﬁdent examples during selfpaced learning. We compared self-paced learning with three different sampling settings: ﬁxed sampling size (dot line), sampling without replacement (dash line), and our proposed dynamic inand-out sampling (solid line). It is clear that self-paced learning with ﬁxed sampling size or without replacement suffers from low sampling accuracy, and no-replacement is even jeopardized by the inaccurate examples remain in the Dtrust.

Table 5. Study of smoothing coefﬁcient β for teacher networks on CIFAR-10 dataset. Greater β indicates slower updates of teachers from the students (Eq. (16))

β Test Accuracy %

0.2 89.37 (0.39) 0.3 89.65 (0.33) 0.4 89.47 (0.41)

Table 6. Study of γ for self-calibrated loss reweighting on CIFAR10. Greater γ indicates larger weight on cross-entropy (Eq. (10))

0.125 89.29% 0.100 89.42% 0.075 89.55% 0.063 89.68% 0.050 89.67% 0.000 89.65%

self-paced learning of the ﬁrst student model will end up with |Dtrust| = 0.1|DU|, and all students will complete the sampling for self-paced learning within the same number of training epochs. Table 3 shows that, while student diversity helps ( 20% + 30% > 25% + 25% ), too large student pace discrepancy will hurt the learning too ( 20% + 30% > 10% + 40% ). Students with very different paces are harmful because a large gap in two learning paces results in a smaller intersection-over-union of Dtrust1 and Dtrust2. It is difﬁcult to keep consistency between two mod-

els trained with different amounts of labeled data. Therefore, it is important to keep diversity, while not too extreme.

4.3.3. EFFECTS OF SAMPLE MINING THRESHOLD

Lstudents takes the hard sample mining threshold α as an important hyperparameter: the smaller α is, the more examples are counted in computing the mean squared error, which implies stronger self-supervision consistency between two students. Table 4 shows that a moderate α = 10 leads to the best performance. Understandably, either under-mining (α = 20) or over-mining (α = 5) hurts the performance: the former is not sufﬁciently regularized, while the latter starts to eliminate the emphasis over hard examples.

4.3.4. EFFECTS OF SMOOTHING COEFFICIENT β

The smoothing coefﬁcient β controls how conservative we distill the teachers from the students: the larger β is, the more reluctant the teacher models get updated from the students. Table 5 investigate three different β values: similar to the previous experiment on α, β also favors a reasonably moderate value, while either overly large or small β, corresponding to over-smoothening and under-smoothening during the distillation from students to teachers respectively, degrades the ﬁnal performance.

4.3.5. EFFECTS OF γ FOR SELF-CALIBRATED LOSS REWEIGHTING

The balancing factor γ restricts the total weights of the crossentropy loss. In Table 6, we report the validation accuracy with different γ, and we can see that the optimal choice of γ is 0.063. Table 6 indicates that mining examples with our calibrated loss contribute to better supervision than using only Lnn PU, while too much weight on LXE may lead to worse validation accuracy.

4.3.6. EFFECT OF TEACHER AND STUDENTS

We verify the effect of two types of distillation in our selfsupervised learning: mutually between Lstudents, and by Lteachers. In Table 2, distillation from two students with different learning paces (Lstudents) improve the accuracy of nn PU baseline from 88.60% to 88.84% on CIFAR-10. By adding two teachers for self-distillation, the performance is further boosted to 89.43%, which endorses the complementary power of two types of self-distillation.

4.4. Comparison to State-of-the-Art Methods

4.4.1. RESULTS ON THE MNIST AND CIFAR-10 BENCHMARKS

We compare the performance of the proposed Self-PU with several popular baselines: the unbiased PU learning (u PU)

Self-PU: Self Boosted and Calibrated Positive-Unlabeled Training

Table 7. Classiﬁcation comparison on MNIST and CIFAR-10. * indicates that 3,000 positive examples were initialized for training, while others used 1,000.

Method MNIST % CIFAR-10 %

u PU (Du Plessis et al., 2014) 92.52 (0.39) 88.00 (0.62) nn PU (Kiryo et al., 2017) 93.41 (0.20) 88.60 (0.40) DAN* (Liu et al., 2019) - 89.7 (0.40)

Self-PU 94.21 (0.54) 89.68 (0.22) Self-PU 96.00 (0.29) 90.77 (0.21)

Table 8. Classiﬁcation accuracy of different methods on ADNI. naive means that we treat the entire unlabeled class as negative.

Method ADNI %

naive 73.27 (1.45) u PU 73.45 (1.77) nn PU 75.96 (1.42)

Self-PU 79.50 (1.80)

(Du Plessis et al., 2014); the non-negative PU learning (nn PU) (Kiryo et al., 2017)7, and DAN (a latest GAN-based PU method) (Liu et al., 2019).

Table 7 summarizes the comparison results on MNIST and CIFAR-10. On MNIST, Self-PU outperforms u PU and nn PU by over 0.5%, setting new performance records. On CIFAR-10, Self-PU surpasses nn PU by over 1% (a considerable gap). More importantly, by only leveraging 1000 positive examples, Self-PU achieves comparable performance as DAN where 3000 positive samples were used. Training with 3000 positive examples further boosts our performance which outperforms DAN by 1%.

0 20 40 60 80 100 120 140 160 180 200 Epoch

Validation Accuracy on CIFAR10

Self PU nn PU u PU

Figure 4. Validation accuracy during training on CIFAR-10. Our Self-PU framework9achieved a more stable training compared with u PU and nn PU methods.

Our Self-PU achieved not only high accuracy, but more importantly a much more stable PU learning process (Figure 4). As noted in (Kiryo et al., 2017), u PU suffers from overﬁtting complex models. We also empirically found a similar phenomenon in PU learning with nn PU risk estimator, where the validation accuracy remains unstable and even drops in the late training stage. However, the training process of Self-PU is signiﬁcantly more stable than u PU and nn PU. This training stability beneﬁts from both accurately

7We reproduced the u PU and nn PU baselines using the ofﬁcial codebase from: https://github.com/kiryor/ nn PUlearning

9Since in Self-PU we use the teacher model G for the ﬁnal prediction, the solid line shows the accuracy of G starts from epoch 50 when the self-paced training ends.

identiﬁed examples in self-paced training and prediction consistency forced by our self-supervised distillations.

4.4.2. RESULTS ON THE NEW ADNI TESTBED

Finally, we demonstrate the promise of our method on the more complex real-word ADNI data in Table 8. We ﬁrst run a naive fully supervised classiﬁcation baseline, by treating the entire unlabeled class as negative. Its accuracy is much inferior to our PU learning results, validating our PU formulation of the ADNI task. Next, Self-PU gains signiﬁcantly over u PU and nn PU, showing highly promising performance on ADNI and setting new state-of-the-arts. Our sophisticated building blocks seem to add robustness to handling the real-world data variations and challenges.

Furthermore, our above results seems to suggest that conventional PU benchmarks like CIFAR-10 and MNIST may have been saturated (as they already did in image classiﬁcation). We recommend the community to pay more attention to more challenging and realistic new PU learning testbeds, and suggest ADNI as an effective, illustrative and practically important option.

5. Conclusion

We proposed Self-PU, that bridges self-training strategy into PU learning for the ﬁrst time. It leverages both the selfpaced selected set of trusted samples and the consistency supervision via self-distillation and self-calibration. Experiments report state-of-the-art performance of Self-PU on two conventional (and potentially oversimpliﬁed) benchmarks, plus our newly introduced real-world PU testbed of ADNI classiﬁcation. Our future work will explore more realistic PU learning setting, which we believe will motivate new algorithmic ﬁndings.

Armenian, H. and Lilienfeld, A. The distribution of incubation periods of neoplastic diseases. American journal of epidemiology, 1974.

Arpit, D., Jastrzebski, S., Ballas, N., Krueger, D., Bengio, E., Kanwal, M., Maharaj, T., Fischer, A., Courville, A.,

Self-PU: Self Boosted and Calibrated Positive-Unlabeled Training

and Bengio, Y. A closer look at memorization in deep networks. In ICML, pp. 233 242, 2017.

Bengio, Y., Louradour, J., Collobert, R., and Weston, J. Curriculum learning. In ICML, pp. 41 48, 2009.

Chen, T., Liu, S., Chang, S., Cheng, Y., Amini, L., and Wang, Z. Adversarial robustness: From selfsupervised pre-training to ﬁne-tuning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 699 708, 2020.

Du Plessis, M., Niu, G., and Sugiyama, M. Convex formulation for learning from positive and unlabeled data. In ICML, pp. 1386 1394, 2015.

Du Plessis, M. C., Niu, G., and Sugiyama, M. Analysis of learning from positive and unlabeled data. In Neur IPS, 2014.

Duyckaerts, C. and Hauw, J.-J. Diagnosis and staging of alzheimer disease. Neurobiology of aging, 18(4):S33 S42, 1997.

Elkan, C. and Noto, K. Learning classiﬁers from only positive and unlabeled data. In ACM SIGKDD, 2008.

Hou, M., Chaib-draa, B., Li, C., and Zhao, Q. Generative adversarial positive-unlabelled learning. IJCAI, Jul 2018. doi: 10.24963/ijcai.2018/312.

Jack Jr, C., Bernstein, M., Fox, N., Thompson, P., Alexander, G., Harvey, D., Borowski, B., Britson, P., Whitwell, J., and Ward, C. The alzheimer s disease neuroimaging initiative (adni): Mri methods. Journal of Magnetic Resonance Imaging, pp. 685 691, 2008.

Jing, L. and Tian, Y. Self-supervised visual feature learning with deep neural networks: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020.

Khan, F., Mutlu, B., and Zhu, J. How do humans teach: On curriculum learning and teaching dimension. In Neur IPS, pp. 1449 1457, 2011.

Khvostikov, A., Aderghal, K., Benois-Pineau, J., Krylov, A., and Catheline, G. 3d cnn-based classiﬁcation using smri and md-dti images for alzheimer disease studies. ar Xiv preprint ar Xiv:1801.05968, 2018.

Kiryo, R., Niu, G., du Plessis, M. C., and Sugiyama, M. Positive-unlabeled learning with non-negative risk estimator. In Neur IPS, pp. 1675 1685, 2017.

Kumar, M. P., Packer, B., and Koller, D. Self-paced learning for latent variable models. In Neur IPS. 2010.

Laine, S. and Aila, T. Temporal ensembling for semisupervised learning. ar Xiv:1610.02242, 2016.

Larson, E. B., Shadlen, M.-F., Wang, L., Mc Cormick, W. C., Bowen, J. D., Teri, L., and Kukull, W. A. Survival after initial diagnosis of alzheimer disease. Annals of internal medicine, 140(7):501 509, 2004.

Li, F., Tran, L., Thung, K.-H., Ji, S., Shen, D., and Li, J. A robust deep model for improved classiﬁcation of ad/mci patients. Biomedical Health Informatics, 2015.

Li, X. and Liu, B. Learning to classify texts using positive and unlabeled data. In IJCAI, 2003.

Liu, B., Lee, W. S., Yu, P. S., and Li, X. Partially supervised classiﬁcation of text documents. In ICML, 2002.

Liu, F., Chen, H., and Wu, H. Discriminative adversarial networks for positive-unlabeled learning. ar Xiv:1906.00642, 2019.

Mohseni, S., Pitale, M., Yadawa, J., and Wang, Z. Selfsupervised learning for generalizable out-of-distribution detection. AAAI, 2020.

Ott, B. R., Jones, R. N., Noto, R. B., Yoo, D. C., Snyder, P. J., Bernier, J. N., Carr, D. B., and Roe, C. M. Brain amyloid in preclinical alzheimer s disease is associated with increased driving risk. Alzheimer s & Dementia: Diagnosis, Assessment & Disease Monitoring, 6:136 142, 2017.

Ren, M., Zeng, W., Yang, B., and Urtasun, R. Learning to reweight examples for robust deep learning. ar Xiv preprint ar Xiv:1803.09050, 2018.

Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., and Wojna, Z. Rethinking the inception architecture for computer vision. CVPR, Jun 2016. doi: 10.1109/cvpr.2016.308.

Tarvainen, A. and Valpola, H. Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. In Neur IPS, 2017.

Trinh, T. H., Luong, M.-T., and Le, Q. V. Selﬁe: Selfsupervised pretraining for image embedding. ar Xiv preprint ar Xiv:1906.02940, 2019.

Villeneuve, S., Rabinovici, G. D., Cohn-Sheehy, B. I., Madison, C., Ayakta, N., Ghosh, P. M., La Joie, R., Arthur Bentil, S. K., Vogel, J. W., Marks, S. M., et al. Existing pittsburgh compound-b positron emission tomography thresholds are too high: statistical and pathological evaluation. Brain, 138(7):2020 2033, 2015.

Xu, M., Li, B., Niu, G., Han, B., and Sugiyama, M. Revisiting sample selection approach to positive-unlabeled learning: Turning unlabeled data into positive rather than negative. 2019a.

Self-PU: Self Boosted and Calibrated Positive-Unlabeled Training

Xu, Y., Xu, C., Xu, C., and Tao, D. Multi-positive and unlabeled learning. In Proceedings of the 26th International Joint Conference on Artiﬁcial Intelligence, pp. 3182 3188, 2017.

Xu, Y., Wang, Y., Chen, H., Han, K., Chunjing, X., Tao, D., and Xu, C. Positive-unlabeled compression on the cloud. In Advances in Neural Information Processing Systems, pp. 2561 2570, 2019b.

Yuan, Y., Wang, Z., Lee, W., Thiyyagura, P., Reiman, E. M., and Chen, K. Feasibility of quantifying amyloid burden using volumetric mri data: Preliminary ﬁndings based on the deep learning 3d convolutional neural network approach. Alzheimer s & Dementia: The Journal of the Alzheimer s Association, 14(7):P695, 2018.

Zhang, Y., Xiang, T., Hospedales, T. M., and Lu, H. Deep mutual learning. In ICCV, 2018.