# positiveunlabeled_learning_with_nonnegative_risk_estimator__7a11c347.pdf

Positive-Unlabeled Learning with Non-Negative Risk Estimator

Ryuichi Kiryo1,2 Gang Niu1,2 Marthinus C. du Plessis Masashi Sugiyama2,1

1The University of Tokyo, 7-3-1 Hongo, Tokyo 113-0033, Japan 2RIKEN, 1-4-1 Nihonbashi, Tokyo 103-0027, Japan { kiryo@ms., gang@ms., sugi@ }k.u-tokyo.ac.jp

From only positive (P) and unlabeled (U) data, a binary classiﬁer could be trained with PU learning, in which the state of the art is unbiased PU learning. However, if its model is very ﬂexible, empirical risks on training data will go negative, and we will suffer from serious overﬁtting. In this paper, we propose a non-negative risk estimator for PU learning: when getting minimized, it is more robust against overﬁtting, and thus we are able to use very ﬂexible models (such as deep neural networks) given limited P data. Moreover, we analyze the bias, consistency, and mean-squared-error reduction of the proposed risk estimator, and bound the estimation error of the resulting empirical risk minimizer. Experiments demonstrate that our risk estimator ﬁxes the overﬁtting problem of its unbiased counterparts.

1 Introduction

Positive-unlabeled (PU) learning can be dated back to [1, 2, 3] and has been well studied since then. It mainly focuses on binary classiﬁcation applied to retrieval and novelty or outlier detection tasks [4, 5, 6, 7], while it also has applications in matrix completion [8] and sequential data [9, 10].

Existing PU methods can be divided into two categories based on how U data is handled. The ﬁrst category (e.g., [11, 12]) identiﬁes possible negative (N) data in U data, and then performs ordinary supervised (PN) learning; the second (e.g., [13, 14]) regards U data as N data with smaller weights. The former heavily relies on the heuristics for identifying N data; the latter heavily relies on good choices of the weights of U data, which is computationally expensive to tune.

In order to avoid tuning the weights, unbiased PU learning comes into play as a subcategory of the second category. The milestone is [4], which regards a U data as weighted P and N data simultaneously. It might lead to unbiased risk estimators, if we unrealistically assume that the class-posterior probability is one for all P data.1 A breakthrough in this direction is [15] for proposing the ﬁrst unbiased risk estimator, and a more general estimator was suggested in [16] as a common foundation. The former is unbiased but non-convex for loss functions satisfying some symmetric condition; the latter is always unbiased, and it is further convex for loss functions meeting some linear-odd condition [17, 18]. PU learning based on these unbiased risk estimators is the current state of the art.

However, the unbiased risk estimators will give negative empirical risks, if the model being trained is very ﬂexible. For the general estimator in [16], there exist three partial risks in the total risk (see Eq. (2) deﬁned later), especially it has a negative risk regarding P data as N data to cancel the bias caused by regarding U data as N data. The worst case is that the model can realize any measurable function and the loss function is not upper bounded, so that the empirical risk is not lower bounded. This needs to be ﬁxed since the original risk, which is the target to be estimated, is non-negative.

1It implies the P and N class-conditional densities have disjoint support sets, and then any P and N data (as the test data) can be perfectly separated by a ﬁxed classiﬁer that is sufﬁciently ﬂexible.

31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.

To this end, we propose a novel non-negative risk estimator that follows and improves on the stateof-the-art unbiased risk estimators mentioned above. This estimator can be used for two purposes: ﬁrst, given some validation data (which are also PU data), we can use our estimator to evaluate the risk for this case it is biased yet optimal, and for some symmetric losses, the mean-squared-error reduction is guaranteed; second, given some training data, we can use our estimator to train binary classiﬁers for this case its risk minimizer possesses an estimation error bound of the same order as the risk minimizers corresponding to its unbiased counterparts [15, 16, 19].

In addition, we propose a large-scale PU learning algorithm for minimizing the unbiased and nonnegative risk estimators. This algorithm accepts any surrogate loss and is based on stochastic optimization, e.g., [20]. Note that [21] is the only existing large-scale PU algorithm, but it only accepts a single surrogate loss from [16] and is based on sequential minimal optimization [22].

The rest of this paper is organized as follows. In Section 2 we review unbiased PU learning, and in Section 3 we propose non-negative PU learning. Theoretical analyses are carried out in Section 4, and experimental results are discussed in Section 5. Conclusions are given in Section 6.

2 Unbiased PU learning

In this section, we review unbiased PU learning [15, 16].

Problem settings Let X Rd and Y { 1} (d N) be the input and output random variables. Let p(x, y) be the underlying joint density of (X, Y ), pp(x) = p(x | Y = +1) and pn(x) = p(x | Y = 1) be the P and N marginals (a.k.a. the P and N class-conditional densities), p(x) be the U marginal, πp = p(Y = +1) be the class-prior probability, and πn = p(Y = 1) = 1 πp. πp is assumed known throughout the paper; it can be estimated from P and U data [23, 24, 25, 26].

Consider the two-sample problem setting of PU learning [5]: two sets of data are sampled independently from pp(x) and p(x) as Xp = {xp i }np i=1 pp(x) and Xu = {xu i }nu i=1 p(x), and a classiﬁer needs to be trained from Xp and Xu.2 If it is PN learning as usual, Xn = {xn i }nn i=1 pn(x) rather than Xu would be available and a classiﬁer could be trained from Xp and Xn.

Risk estimators Unbiased PU learning relies on unbiased risk estimators. Let g : Rd R be an arbitrary decision function, and ℓ: R { 1} R be the loss function, such that the value ℓ(t, y) means the loss incurred by predicting an output t when the ground truth is y. Denote by R+ p (g) = Ep[ℓ(g(X), +1)] and R n (g) = En[ℓ(g(X), 1)], where Ep[ ] = EX pp[ ] and En[ ] = EX pn[ ]. Then, the risk of g is R(g) = E(X,Y ) p(x,y)[ℓ(g(X), Y )] = πp R+ p (g) + πn R n (g). In PN learning, thanks to the availability of Xp and Xn, R(g) can be approximated directly by

b Rpn(g) = πp b R+ p (g) + πn b R n (g), (1)

where b R+ p (g) = (1/np) Pnp i=1 ℓ(g(xp i ), +1) and b R n (g) = (1/nn) Pnn i=1 ℓ(g(xn i ), 1). In PU learning, Xn is unavailable, but R n (g) can be approximated indirectly, as shown in [15, 16]. Denote by R p (g) = Ep[ℓ(g(X), 1)] and R u (g) = EX p(x)[ℓ(g(X), 1)]. As πnpn(x) = p(x) πppp(x), we can obtain that πn R n (g) = R u (g) πp R p (g), and R(g) can be approximated indirectly by

b Rpu(g) = πp b R+ p (g) πp b R p (g) + b R u (g), (2)

where b R p (g) = (1/np) Pnp i=1 ℓ(g(xp i ), 1) and b R u (g) = (1/nu) Pnu i=1 ℓ(g(xu i ), 1).

The empirical risk estimators in Eqs. (1) and (2) are unbiased and consistent w.r.t. all popular loss functions.3 When they are used for evaluating the risk (e.g., in cross-validation), ℓis by default the zero-one loss, namely ℓ01(t, y) = (1 sign(ty))/2; when used for training, ℓ01 is replaced with a surrogate loss [27]. In particular, [15] showed that if ℓsatisﬁes a symmetric condition:

ℓ(t, +1) + ℓ(t, 1) = 1, (3)

2Xp is a set of independent data and so is Xu, but Xp Xu does not need to be such a set. 3The consistency here means for ﬁxed g, b Rpn(g) R(g) and b Rpu(g) R(g) as np, nn, nu .

we will have b Rpu(g) = 2πp b R+ p (g) + b R u (g) πp, (4) which can be minimized by separating Xp and Xu with ordinary cost-sensitive learning. An issue is b Rpu(g) in (4) must be non-convex in g, since no ℓ(t, y) in (3) can be convex in t. [16] showed that b Rpu(g) in (2) is convex in g, if ℓ(t, y) is convex in t and meets a linear-odd condition [17, 18]: ℓ(t, +1) ℓ(t, 1) = t. (5) Let g be parameterized by θ, then (5) leads to a convex optimization problem so long as g is linear in θ, for which the globally optimal solution can be obtained. Eq. (5) is not only sufﬁcient but also necessary for the convexity, if ℓis unary, i.e., ℓ(t, 1) = ℓ( t, +1).

Justiﬁcation Thanks to the unbiasedness, we can study estimation error bounds (EEB). Let G be the function class, and bgpn and bgpu be the empirical risk minimizers of b Rpn(g) and b Rpu(g). [19] proved EEB of bgpu is tighter than EEB of bgpn when πp/ np + 1/ nu < πn/ nn, if (a) ℓsatisﬁes (3) and is Lipschitz continuous; (b) the Rademacher complexity of G decays in O(1/ n) for data of size n drawn from p(x), pp(x) or pn(x).4 In other words, under mild conditions, PU learning is likely to outperform PN learning when πp/ np + 1/ nu < πn/ nn. This phenomenon has been observed in experiments [19] and is illustrated in Figure 1(a).

3 Non-negative PU learning

In this section, we propose the non-negative risk estimator and the large-scale PU algorithm.

3.1 Motivation

Let us look inside the aforementioned justiﬁcation of unbiased PU (u PU) learning. Intuitively, the advantage comes from the transformation πn R n (g) = R u (g) πp R p (g). When we approximate πn R n (g) from N data {xn i }nn i=1, the convergence rate is Op(πn/ nn), where Op denotes the order in probability; when we approximate R u (g) πp R p (g) from P data {xp i }np i=1 and U data {xu i }nu i=1, the convergence rate becomes Op(πp/ np + 1/ nu). As a result, we might beneﬁt from a tighter uniform deviation bound when πp/ np + 1/ nu < πn/ nn.

However, the critical assumption on the Rademacher complexity is indispensable, otherwise it will be difﬁcult for EEB of bgpu to be tighter than EEB of bgpn. If G = {g | g Cg} where Cg > 0 is a constant, i.e., it has all measurable functions with some bounded norm, then Rn,q(G) = O(1) for any n and q(x) and all bounds become trivial; moreover if ℓis not bounded from above, b Rpu(g) becomes not bounded from below, i.e., it may diverge to . Thus, in order to obtain high-quality bgpu, G cannot be too complex, or equivalently the model of g cannot be too ﬂexible.

This argument is supported by an experiment as illustrated in Figure 1(b). A multilayer perceptron was trained for separating the even and odd digits of MNIST hand-written digits [29]. This model is so ﬂexible that the number of parameters is 500 times more than the total number of P and N data. From Figure 1(b) we can see:

(A) on training data, the risks of u PU and PN both decrease, and u PU is faster than PN;

(B) on test data, the risk of PN decreases, whereas the risk of u PU does not; the risk of u PU is lower at the beginning but higher at the end than that of PN. To sum up, the overﬁtting problem of u PU is serious, which evidences that in order to obtain highquality bgpu, the model of g cannot be too ﬂexible.

3.2 Non-negative risk estimator

Nevertheless, we have no choice sometimes: we are interested in using ﬂexible models, while labeling more data is out of our control. Can we alleviate the overﬁtting problem with neither changing the model nor labeling more data?

4Let σ1, . . . , σn be n Rademacher variables, the Rademacher complexity of G for X of size n drawn from q(x) is deﬁned by Rn,q(G) = EX Eσ1,...,σn[supg G 1 n P

xi X σig(xi)] [28]. For any ﬁxed G and q, Rn,q(G) still depends on n and should decrease with n.

0 100 200 300 400 500 Epoch

Risk w.r.t. surrogate loss

PN test PN train u PU test u PU train

(a) Plain linear model

0 100 200 300 400 500 Epoch

Risk w.r.t. surrogate loss

PN test PN train u PU test u PU train nn PU test nn PU train

(b) Multilayer perceptron (MLP)

The dataset is MNIST; even/odd digits are regarded as the P/N class, and πp 1/2; np = 100 and nn = 50 for PN learning; np = 100 and nu = 59, 900 for unbiased PU (u PU) and non-negative PU (nn PU) learning. The model is a plain linear model (784-1) in (a) and an MLP (784-100-1) with Re LU in (b); it was trained by Algorithm 1, where the loss ℓis ℓsig, the optimization algorithm A is [20], with β = 1/2 for u PU, and β = 0 and γ = 1 for nn PU. Solid curves are b Rpn(g) on test data where g {bgpn, bgpu, egpu}, and dashed curves are b Rpn(bgpn), b Rpu(bgpu) and e Rpu(egpu) on training data. Note that nn PU is identical to u PU in (a).

Figure 1: Illustrative experimental results.

The answer is afﬁrmative. Note that b Rpu(bgpu) keeps decreasing and goes negative. This should be ﬁxed since R(g) 0 for any g. Speciﬁcally, it holds that R u (g) πp R p (g) = πn R n (g) 0, but b R u (g) πp b R p (g) 0 is not always true, which is a potential reason for u PU to overﬁt. Based on this key observation, we propose a non-negative risk estimator for PU learning:

e Rpu(g) = πp b R+ p (g) + max n 0, b R u (g) πp b R p (g) o . (6)

Let egpu = arg ming G e Rpu(g) be the empirical risk minimizer of e Rpu(g). We refer to the process of obtaining egpu as non-negative PU (nn PU) learning. The implementation of nn PU will be given in Section 3.3, and theoretical analyses of e Rpu(g) and egpu will be given in Section 4.

Again, from Figure 1(b) we can see:

(A) on training data, the risk of nn PU ﬁrst decreases and then becomes more and more ﬂat, so that the risk of nn PU is closer to the risk of PN and farther from that of u PU; in short, the risk of nn PU does not go down with u PU after a certain epoch; (B) on test data, the tendency is similar, but the risk of nn PU does not go up with u PU; (C) at the end, nn PU achieves the lowest risk on test data. In summary, nn PU works by explicitly constraining the training risk of u PU to be non-negative.

3.3 Implementation

A list of popular loss functions and their properties is shown in Table 1. Let g be parameterized by θ. If g is linear in θ, the losses satisfying (5) result in convex optimizations. However, if g needs to be ﬂexible, it will be highly nonlinear in θ; then the losses satisfying (5) are not advantageous over others, since the optimizations are anyway non-convex. In [15], the ramp loss was used and b Rpu(g) was minimized by the concave-convex procedure [30]. This solver is fairly sophisticated, and if we replace b Rpu(g) with e Rpu(g), it will be more difﬁcult to implement. To this end, we propose to use the sigmoid loss ℓsig(t, y) = 1/(1 + exp(ty)): its gradient is everywhere non-zero and e Rpu(g) can be minimized by off-the-shelf gradient methods.

In front of big data, we should scale PU learning up by stochastic optimization. Minimizing b Rpu(g) is embarrassingly parallel while minimizing e Rpu(g) is not, since b Rpu(g) is point-wise but e Rpu(g) is not due to the max operator. That being said, max{0, b R u (g; Xu) πp b R p (g; Xp)} is no greater than (1/N) PN i=1 max{0, b R u (g; X i u) πp b R p (g; X i p)}, where (X i p, X i u) is the i-th mini-batch, and hence the corresponding upper bound of e Rpu(g) can easily be minimized in parallel.

Table 1: Loss functions for PU learning and their properties.

Name Deﬁnition (3) (5) Bounded Lipschitz ℓ (z) = 0

Zero-one loss (1 sign(z))/2 z = 0 Ramp loss max{0, min{1, (1 z)/2}} z [ 1, +1] Squared loss (z 1)2/4 z R Logistic loss ln(1 + exp( z)) z R Hinge loss max{0, 1 z} z ( , +1] Double hinge loss max{0, (1 z)/2, z} z ( , +1] Sigmoid loss 1/(1 + exp(z)) z R

All loss functions are unary, such that ℓ(t, y) = ℓ(z) with z = ty. The ramp loss comes from [15]; the double hinge loss is from [16], in which the squared, logistic and hinge losses were discussed as well. The ramp and squared losses are scaled to satisfy (3) or (5). The sigmoid loss is a horizontally mirrored logistic function; the logistic loss is the negative logarithm of the logistic function.

Algorithm 1 Large-scale PU learning based on stochastic optimization

Input: training data (Xp, Xu);

hyperparameters 0 β πp supt maxy ℓ(t, y) and 0 γ 1 Output: model parameter θ for bgpu(x; θ) or egpu(x; θ) 1: Let A be an external SGD-like stochastic optimization algorithm such as [20] or [31] 2: while no stopping criterion has been met: 3: Shufﬂe (Xp, Xu) into N mini-batches, and denote by (X i p, X i u) the i-th mini-batch 4: for i = 1 to N: 5: if b R u (g; X i u) πp b R p (g; X i p) β:

6: Set gradient θ b Rpu(g; X i p, X i u) 7: Update θ by A with its current step size η 8: else: 9: Set gradient θ(πp b R p (g; X i p) b R u (g; X i u)) 10: Update θ by A with a discounted step size γη

The large-scale PU algorithm is described in Algorithm 1. Let ri = b R u (g; X i u) πp b R p (g; X i p). In practice, we may tolerate ri β where 0 β πp supt maxy ℓ(t, y), as ri comes from a single mini-batch. The degree of tolerance is controlled by β: there is zero tolerance if β = 0, and we are minimizing b Rpu(g) if β = πp supt maxy ℓ(t, y). Otherwise if ri < β, we go along θri with a step size discounted by γ where 0 γ 1, to make this mini-batch less overﬁtted. Algorithm 1 is insensitive to the choice of γ, if the optimization algorithm A is adaptive such as [20] or [31].

4 Theoretical analyses

In this section, we analyze the risk estimator (6) and its minimizer (all proofs are in Appendix B).

4.1 Bias and consistency

Fix g, e Rpu(g) b Rpu(g) for any (Xp, Xu) but b Rpu(g) is unbiased, which implies e Rpu(g) is biased in general. A fundamental question is then whether e Rpu(g) is consistent. From now on, we prove this consistency. To begin with, partition all possible (Xp, Xu) into D+(g) = {(Xp, Xu) | b R u (g) πp b R p (g) 0} and D (g) = {(Xp, Xu) | b R u (g) πp b R p (g) < 0}. Assume there are Cg > 0 and Cℓ> 0 such that supg G g Cg and sup|t| Cg maxy ℓ(t, y) Cℓ.

Lemma 1. The following three conditions are equivalent: (A) the probability measure of D (g) is non-zero; (B) e Rpu(g) differs from b Rpu(g) with a non-zero probability over repeated sampling of (Xp, Xu); (C) the bias of e Rpu(g) is positive. In addition, by assuming that there is α > 0 such that R n (g) α, the probability measure of D (g) can be bounded by

Pr(D (g)) exp( 2(α/Cℓ)2/(π2 p/np + 1/nu)). (7)

Based on Lemma 1, we can show the exponential decay of the bias and also the consistency. For convenience, denote by χnp,nu = 2πp/ np + 1/ nu.

Theorem 2 (Bias and consistency). Assume that R n (g) α > 0 and denote by g the right-hand side of Eq. (7). As np, nu , the bias of e Rpu(g) decays exponentially:

0 EXp,Xu[ e Rpu(g)] R(g) Cℓπp g. (8)

Moreover, for any δ > 0, let Cδ = Cℓ p

ln(2/δ)/2, then we have with probability at least 1 δ,

| e Rpu(g) R(g)| Cδ χnp,nu + Cℓπp g, (9)

and with probability at least 1 δ g,

| e Rpu(g) R(g)| Cδ χnp,nu. (10)

Either (9) or (10) in Theorem 2 indicates for ﬁxed g, e Rpu(g) R(g) in Op(πp/ np + 1/ nu). This convergence rate is optimal according to the central limit theorem [32], which means the proposed estimator is a biased yet optimal estimator to the risk.

4.2 Mean squared error

After introducing the bias, e Rpu(g) tends to overestimate R(g). It is not a shrinkage estimator [33, 34] so that its mean squared error (MSE) is not necessarily smaller than that of b Rpu(g). However, we can still characterize this reduction in MSE. Theorem 3 (MSE reduction). It holds that MSE( e Rpu(g)) < MSE( b Rpu(g)),5 if and only if Z

(Xp,Xu) D (g) ( b Rpu(g) + e Rpu(g) 2R(g))( b R u (g) πp b R p (g)) d F(Xp, Xu) > 0, (11)

where d F(Xp, Xu) = Qnp i=1 pp(xp i )dxp i Qnu i=1 p(xu i )dxu i . Eq. (11) is valid, if the following conditions are met: (a) Pr(D (g)) > 0; (b) ℓsatisﬁes Eq. (3); (c) R n (g) α > 0; (d) nu np, such that we have R u (g) b R u (g) 2α almost surely on D (g). In fact, given these four conditions, we have for any 0 β Cℓπp,

MSE( b Rpu(g)) MSE( e Rpu(g)) 3β2Pr{ e Rpu(g) b Rpu(g) > β}. (12)

The assumption (d) in Theorem 3 is explained as follows. Since U data can be much cheaper than P data in practice, it would be natural to assume nu is much larger and grows much faster than np, hence Pr{R u (g) b R u (g) α}/Pr{ b R p (g) R p (g) α/πp} exp(np nu) asymptotically.6

This means the contribution of Xu is negligible for making (Xp, Xu) D (g) so that Pr(D (g)) exhibits exponential decay mainly in np. As Pr{R u (g) b R u (g) 2α} has stronger exponential decay in nu than Pr{R u (g) b R u (g) α} as well as nu np, we made the assumption (d).

4.3 Estimation error

While Theorems 2 and 3 addressed the use of (6) for evaluating the risk, we are likewise interested in its use for training classiﬁers. In what follows, we analyze the estimation error R(egpu) R(g ), where g is the true risk minimizer in G, i.e., g = arg ming G R(g). As a common practice [28], assume that ℓ(t, y) is Lipschitz continuous in t for all |t| Cg with a Lipschitz constant Lℓ. Theorem 4 (Estimation error bound). Assume that (a) infg G R n (g) α > 0 and denote by the right-hand side of Eq. (7); (b) G is closed under negation, i.e., g G if and only if g G. Then, for any δ > 0, with probability at least 1 δ,

R(egpu) R(g ) 16Lℓπp Rnp,pp(G) + 8LℓRnu,p(G) + 2C δ χnp,nu + 2Cℓπp , (13)

where C δ = Cℓ p

ln(1/δ)/2, and Rnp,pp(G) and Rnu,p(G) are the Rademacher complexities of G for the sampling of size np from pp(x) and of size nu from p(x), respectively.

5Here, MSE( ) is over repeated sampling of (Xp, Xu). 6This can be derived as np, nu by applying the central limit theorem to the two differences and then L Hôpital s rule to the ratio of complementary error functions [32].

Theorem 4 ensures that learning with (6) is also consistent: as np, nu , R(egpu) R(g ) and if ℓsatisﬁes (5), all optimizations are convex and egpu g . For linear-in-parameter models with a bounded norm, Rnp,pp(G) = O(1/ np) and Rnu,p(G) = O(1/ nu), and thus R(egpu) R(g ) in Op(πp/ np + 1/ nu).

For comparison, R(bgpu) R(g ) can be bounded using a different proof technique [19]:

R(bgpu) R(g ) 8Lℓπp Rnp,pp(G) + 4LℓRnu,p(G) + 2Cδ χnp,nu, (14)

where Cδ = Cℓ p

ln(2/δ)/2. The differences of (13) and (14) are completely from the differences of the corresponding uniform deviation bounds, i.e., the following lemma and Lemma 8 of [19]. Lemma 5. Under the assumptions of Theorem 4, for any δ > 0, with probability at least 1 δ,

supg G | e Rpu(g) R(g)| 8Lℓπp Rnp,pp(G) + 4LℓRnu,p(G) + C δ χnp,nu + Cℓπp . (15)

Notice that b Rpu(g) is point-wise while e Rpu(g) is not due to the maximum, which makes Lemma 5 much more difﬁcult to prove than Lemma 8 of [19]. The key trick is that after symmetrization, we employ | max{0, z} max{0, z }| |z z |, making three differences of partial risks point-wise (see (18) in the proof). As a consequence, we have to use a different Rademacher complexity with the absolute value inside the supremum [35, 36], whose contraction makes the coefﬁcients of (15) doubled compared with Lemma 8 of [19]; moreover, we have to assume G is closed under negation to change back to the standard Rademacher complexity without the absolute value [28]. Therefore, the differences of (13) and (14) are mainly due to different proof techniques and cannot reﬂect the intrinsic differences of empirical risk minimizers.

5 Experiments

In this section, we compare PN, unbiased PU (u PU) and non-negative PU (nn PU) learning experimentally. We focus on training deep neural networks, as u PU learning usually does not overﬁt if a linear-in-parameter model is used [19] and nothing needs to be ﬁxed.

Table 2 describes the speciﬁcation of benchmark datasets. MNIST, 20News and CIFAR-10 have 10, 7 and 10 classes originally, and we constructed the P and N classes from them as follows: MNIST was preprocessed in such a way that 0, 2, 4, 6, 8 constitute the P class, while 1, 3, 5, 7, 9 constitute the N class; for 20News, alt. , comp. , misc. and rec. make up the P class, and sci. , soc. and talk. make up the N class; for CIFAR-10, the P class is formed by airplane , automobile , ship and truck , and the N class is formed by bird , cat , deer , dog , frog and horse . The dataset epsilon has 2 classes and such a construction is unnecessary.

Three learning methods were set up as follows: (A) for PN, np = 1, 000 and nn = (πn/2πp)2np; (B) for u PU, np = 1, 000 and nu is the total number of training data; (C) for nn PU, np and nu are exactly same as u PU. For u PU and nn PU, P and U data were dependent, because neither b Rpu(g) in Eq. (2) nor e Rpu(g) in Eq. (6) requires them to be independent. The choice of nn was motivated by [19] and may make nn PU potentially better than PN as nu (whether np < or np nu).

The model for MNIST was a 6-layer multilayer perceptron (MLP) with Re LU [40] (more speciﬁcally, d-300-300-300-300-1). For epsilon, the model was similar while the activation was replaced with Softsign [41] for better performance. For 20News, we borrowed the pre-trained word embeddings from Glo Ve [42], and the model can be written as d-avg_pool(word_emb(d,300))-300-300-1,

Table 2: Speciﬁcation of benchmark datasets, models, and optimition algorithms.

Name # Train # Test # Feature πp Model g(x; θ) Opt. alg. A

MNIST [29] 60, 000 10, 000 784 0.49 6-layer MLP with Re LU Adam [20] epsilon [37] 400, 000 100, 000 2, 000 0.50 6-layer MLP with Softsign Adam [20] 20News [38] 11, 314 7, 532 61, 188 0.44 5-layer MLP with Softsign Ada Grad [31] CIFAR-10 [39] 50, 000 10, 000 3, 072 0.40 13-layer CNN with Re LU Adam [20]

See http://yann.lecun.com/exdb/mnist/ for MNIST, https://www.csie.ntu.edu.tw/~cjlin/ libsvmtools/datasets/binary.html for epsilon, http://qwone.com/~jason/20Newsgroups/ for 20Newsgroups, and https://www.cs.toronto.edu/~kriz/cifar.html for CIFAR-10.

0 25 50 75 100 125 150 175 200 Epoch

Risk w.r.t. zero-one loss

PN test PN train u PU test u PU train nn PU test nn PU train

0 25 50 75 100 125 150 175 200 Epoch

Risk w.r.t. zero-one loss

(b) epsilon

0 25 50 75 100 125 150 175 200 Epoch

Risk w.r.t. zero-one loss

0 25 50 75 100 125 150 175 200 Epoch

Risk w.r.t. zero-one loss

(d) CIFAR-10 Figure 2: Experimental results of training deep neural networks.

where word_emb(d,300) retrieves 300-dimensional word embeddings for all words in a document, avg_pool executes average pooling, and the resulting vector is fed to a 4-layer MLP with Softsign. The model for CIFAR-10 was an all convolutional net [43]: (32*32*3)-[C(3*3,96)]*2-C(3*3,96,2)- [C(3*3,192)]*2-C(3*3,192,2)-C(3*3,192)-C(1*1,192)-C(1*1,10)-1000-1000-1, where the input is a 32*32 RGB image, C(3*3,96) means 96 channels of 3*3 convolutions followed by Re LU, [ ]*2 means there are two such layers, C(3*3,96,2) means a similar layer but with stride 2, etc.; it is one of the best architectures for CIFAR-10. Batch normalization [44] was applied before hidden layers. Furthermore, the sigmoid loss ℓsig was used as the surrogate loss and an ℓ2-regularization was also added. The resulting objectives were minimized by Adam [20] on MNIST, epsilon and CIFAR-10, and by Ada Grad [31] on 20News; we ﬁxed β = 0 and γ = 1 for simplicity.

The experimental results are reported in Figure 2, where means and standard deviations of training and test risks based on the same 10 random samplings are shown. We can see that u PU overﬁtted training data and nn PU ﬁxed this problem. Additionally, given limited N data, nn PU outperformed PN on MNIST, epsilon and CIFAR-10 and was comparable to it on 20News. In summary, with the proposed non-negative risk estimator, we are able to use very ﬂexible models given limited P data.

We further tried some cases where πp is misspeciﬁed, in order to simulate PU learning in the wild, where we must suffer from errors in estimating πp. More speciﬁcally, we tested nn PU learning by replacing πp with π p {0.8πp, 0.9πp, . . . , 1.2πp} and giving π p to the learning method, so that it would regard π p as πp during the entire training process. The experimental setup was exactly same as before except the replacement of πp.

The experimental results are reported in Figure 3, where means of test risks of nn PU based on the same 10 random samplings are shown, and the best test risks are identiﬁed (horizontal lines are the best mean test risks and vertical lines are the epochs when they were achieved). We can see that on MNIST, the more misspeciﬁcation was, the worse nn PU performed, while under-misspeciﬁcation hurt more than over-misspeciﬁcation; on epsilon, the cases where π p equals to πp, 1.1πp and 1.2πp

0 25 50 75 100 125 150 175 200 Epoch

Risk w.r.t. zero-one loss

0.8 0.9 1.0 1.1 1.2

0 25 50 75 100 125 150 175 200 Epoch

Risk w.r.t. zero-one loss

(b) epsilon

0 25 50 75 100 125 150 175 200 Epoch

Risk w.r.t. zero-one loss

0 25 50 75 100 125 150 175 200 Epoch

Risk w.r.t. zero-one loss

(d) CIFAR-10 Figure 3: Experimental results given π p {0.8πp, 0.9πp, . . . , 1.2πp}.

were comparable, but the best was π p = 1.1πp rather than π p = πp; on 20News, these three cases became different, such that π p = πp was superior to π p = 1.2πp but inferior to π p = 1.1πp; at last on CIFAR-10, π p = πp and π p = 1.1πp were comparable again, and π p = 1.2πp was the winner.

In all the experiments, we have ﬁxed β = 0, which may explain this phenomenon. Recall that u PU overﬁtted seriously on all the benchmark datasets, and note that the larger π p is, the more different nn PU is from u PU. Therefore, the replacement of πp with some π p > πp introduces additional bias of e Rpu(g) in estimating R(g), but it also pushes e Rpu(g) away from b Rpu(g) and then pushes nn PU away from u PU. This may result in lower test risks given some π p slightly larger than πp as shown in Figure 3. This is also why under-misspeciﬁed π p hurt more than over-misspeciﬁed π p.

All the experiments were done with Chainer [45], and our implementation based on it is available at https://github.com/kiryor/nn PUlearning.

6 Conclusions

We proposed a non-negative risk estimator for PU learning that follows and improves on the stateof-the-art unbiased risk estimators. No matter how ﬂexible the model is, it will not go negative as its unbiased counterparts. It is more robust against overﬁtting when being minimized, and training very ﬂexible models such as deep neural networks given limited P data becomes possible. We also developed a large-scale PU learning algorithm. Extensive theoretical analyses were presented, and the usefulness of our non-negative PU learning was veriﬁed by intensive experiments. A promising future direction is extending the current work to semi-supervised learning along [46].

Acknowledgments

GN and MS were supported by JST CREST JPMJCR1403 and GN was also partially supported by Microsoft Research Asia.

[1] F. Denis. PAC learning from positive statistical queries. In ALT, 1998.

[2] F. De Comité, F. Denis, R. Gilleron, and F. Letouzey. Positive and unlabeled examples help learning. In ALT, 1999.

[3] F. Letouzey, F. Denis, and R. Gilleron. Learning from positive and unlabeled examples. In ALT, 2000.

[4] C. Elkan and K. Noto. Learning classiﬁers from only positive and unlabeled data. In KDD, 2008.

[5] G. Ward, T. Hastie, S. Barry, J. Elith, and J. Leathwick. Presence-only data and the EM algorithm. Biometrics, 65(2):554 563, 2009.

[6] C. Scott and G. Blanchard. Novelty detection: Unlabeled data deﬁnitely help. In AISTATS, 2009.

[7] G. Blanchard, G. Lee, and C. Scott. Semi-supervised novelty detection. Journal of Machine Learning Research, 11:2973 3009, 2010.

[8] C.-J. Hsieh, N. Natarajan, and I. S. Dhillon. PU learning for matrix completion. In ICML, 2015.

[9] X. Li, P. S. Yu, B. Liu, and S.-K. Ng. Positive unlabeled learning for data stream classiﬁcation. In SDM, 2009.

[10] M. N. Nguyen, X. Li, and S.-K. Ng. Positive unlabeled leaning for time series classiﬁcation. In IJCAI, 2011.

[11] B. Liu, W. S. Lee, P. S. Yu, and X. Li. Partially supervised classiﬁcation of text documents. In ICML, 2002.

[12] X. Li and B. Liu. Learning to classify texts using positive and unlabeled data. In IJCAI, 2003.

[13] W. S. Lee and B. Liu. Learning with positive and unlabeled examples using weighted logistic regression. In ICML, 2003.

[14] B. Liu, Y. Dai, X. Li, W. S. Lee, and P. S. Yu. Building text classiﬁers using positive and unlabeled examples. In ICDM, 2003.

[15] M. C. du Plessis, G. Niu, and M. Sugiyama. Analysis of learning from positive and unlabeled data. In NIPS, 2014.

[16] M. C. du Plessis, G. Niu, and M. Sugiyama. Convex formulation for learning from positive and unlabeled data. In ICML, 2015.

[17] N. Natarajan, I. S. Dhillon, P. Ravikumar, and A. Tewari. Learning with noisy labels. In NIPS, 2013.

[18] G. Patrini, F. Nielsen, R. Nock, and M. Carioni. Loss factorization, weakly supervised learning and label noise robustness. In ICML, 2016.

[19] G. Niu, M. C. du Plessis, T. Sakai, Y. Ma, and M. Sugiyama. Theoretical comparisons of positiveunlabeled learning against positive-negative learning. In NIPS, 2016.

[20] D. P. Kingma and J. L. Ba. Adam: A method for stochastic optimization. In ICLR, 2015.

[21] E. Sansone, F. G. B. De Natale, and Z.-H. Zhou. Efﬁcient training for positive unlabeled learning. ar Xiv preprint ar Xiv:1608.06807, 2016.

[22] J. C. Platt. Fast training of support vector machines using sequential minimal optimization. In B. Schölkopf, C. J. C. Burges, and A. J. Smola, editors, Advances in Kernel Methods, pages 185 208. MIT Press, 1999.

[23] A. Menon, B. Van Rooyen, C. S. Ong, and B. Williamson. Learning from corrupted binary labels via class-probability estimation. In ICML, 2015.

[24] H. G. Ramaswamy, C. Scott, and A. Tewari. Mixture proportion estimation via kernel embedding of distributions. In ICML, 2016.

[25] S. Jain, M. White, and P. Radivojac. Estimating the class prior and posterior from noisy positives and unlabeled data. In NIPS, 2016.

[26] M. C. du Plessis, G. Niu, and M. Sugiyama. Class-prior estimation for learning from positive and unlabeled data. Machine Learning, 106(4):463 492, 2017.

[27] P. L. Bartlett, M. I. Jordan, and J. D. Mc Auliffe. Convexity, classiﬁcation, and risk bounds. Journal of the American Statistical Association, 101(473):138 156, 2006.

[28] M. Mohri, A. Rostamizadeh, and A. Talwalkar. Foundations of Machine Learning. MIT Press, 2012.

[29] Y. Le Cun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278 2324, 1998.

[30] A. L. Yuille and A. Rangarajan. The concave-convex procedure (CCCP). In NIPS, 2001.

[31] J. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12:2121 2159, 2011.

[32] K.-L. Chung. A Course in Probability Theory. Academic Press, 1968.

[33] C. Stein. Inadmissibility of the usual estimator for the mean of a multivariate normal distribution. In Proc. 3rd Berkeley Symposium on Mathematical Statistics and Probability, 1956.

[34] W. James and C. Stein. Estimation with quadratic loss. In Proc. 4th Berkeley Symposium on Mathematical Statistics and Probability, 1961.

[35] V. Koltchinskii. Rademacher penalties and structural risk minimization. IEEE Transactions on Information Theory, 47(5):1902 1914, 2001.

[36] P. L. Bartlett and S. Mendelson. Rademacher and Gaussian complexities: Risk bounds and structural results. Journal of Machine Learning Research, 3:463 482, 2002.

[37] G.-X. Yuan, C.-H. Ho, and C.-J. Lin. An improved GLMNET for l1-regularized logistic regression. Journal of Machine Learning Research, 13:1999 2030, 2012.

[38] K. Lang. Newsweeder: Learning to ﬁlter netnews. In ICML, 1995.

[39] A. Krizhevsky. Learning multiple layers of features from tiny images. Technical report, University of Toronto, 2009.

[40] V. Nair and G. E. Hinton. Rectiﬁed linear units improve restricted boltzmann machines. In ICML, 2010.

[41] X. Glorot and Y. Bengio. Understanding the difﬁculty of training deep feedforward neural networks. In AISTATS, 2010.

[42] J. Pennington, R. Socher, and C. D. Manning. Glo Ve: Global vectors for word representation. In EMNLP, 2014.

[43] J. T. Springenberg, A. Dosovitskiy, T. Brox, and M. Riedmiller. Striving for simplicity: The all convolutional net. In ICLR, 2015.

[44] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML, 2015.

[45] S. Tokui, K. Oono, S. Hido, and J. Clayton. Chainer: a next-generation open source framework for deep learning. In Machine Learning Systems Workshop at NIPS, 2015.

[46] T. Sakai, M. C. du Plessis, G. Niu, and M. Sugiyama. Semi-supervised classiﬁcation based on classiﬁcation from positive and unlabeled data. In ICML, 2017.

[47] C. Mc Diarmid. On the method of bounded differences. In J. Siemons, editor, Surveys in Combinatorics, pages 148 188. Cambridge University Press, 1989.

[48] M. Ledoux and M. Talagrand. Probability in Banach Spaces: Isoperimetry and Processes. Springer, 1991.

[49] S. Shalev-Shwartz and S. Ben-David. Understanding Machine Learning: From Theory to Algorithms. Cambridge University Press, 2014.

[50] V. N. Vapnik. Statistical Learning Theory. John Wiley & Sons, 1998.