# online_positive_and_unlabeled_learning__9326d23a.pdf

Online Positive and Unlabeled Learning

Chuang Zhang1 , Chen Gong 1,3 , Tengfei Liu4 , Xun Lu4 , Weiqiang Wang4 and Jian Yang1,2

1PCA Lab, the Key Laboratory of Intelligent Perception and Systems for High-Dimensional Information of Ministry of Education, School of Computer Science and Engineering, Nanjing University of Science and Technology, China 2Jiangsu Key Lab of Image and Video Understanding for Social Security 3The Department of Computing, Hong Kong Polytechnic University 4Ant Financial Services Group chen.gong@njust.edu.cn

Positive and Unlabeled learning (PU learning) aims to build a binary classiﬁer where only positive and unlabeled data are available for classiﬁer training. However, existing PU learning methods all work on a batch learning mode, which cannot deal with the online learning scenarios with sequential data. Therefore, this paper proposes a novel positive and unlabeled learning algorithm in an online training mode, which trains a classiﬁer solely on the positive and unlabeled data arriving in a sequential order. Speciﬁcally, we adopt an unbiased estimate for the loss induced by the arriving positive or unlabeled examples at each time. Then we show that for any coming new single datum, the model can be updated independently and incrementally by gradient based online learning method. Furthermore, we extend our method to tackle the cases when more than one example is received at each time. Theoretically, we show that the proposed online PU learning method achieves low regret even though it receives sequential positive and unlabeled data. Empirically, we conduct intensive experiments on both benchmark and real-world datasets, and the results clearly demonstrate the effectiveness of the proposed method.

1 Introduction Traditional supervised learning usually employs labeled positive and negative examples for model training. However, the labeled negative data are not always available in many real-world applications. This has led to the development of Positive and Unlabeled learning (PU learning) [Denis et al., 2005], which aims to train a binary classiﬁer from only positive and unlabeled data without the existence of negative data. Here the unlabeled data could be either positive or negative, but their ground-truth labels remain unknown to the learning algorithm throughout the training stage. Due to its usefulness and effectiveness, PU learning has been widely used in many real-world applications, such as software clone detection [Wei and Li, 2018], remote-sensed hyperspectral image classiﬁcation [Li et al., 2011], etc.

Given its broad applicability as mentioned above, PU learning has attracted intensive research attention in recent years. A variety of effective algorithms have been proposed, such as [Liu et al., 2002; Kiryo et al., 2017; Gong et al., 2019b], etc. Although these existing methods have received encouraging performance on various datasets or tasks, they all work on a batch learning or ofﬂine learning mode, which cannot deal with the online learning scenarios with sequential data. Unfortunately, it is quite often that the data are presented in sequence for massive practical applications, so traditional batch learning algorithms which require all training data to be simultaneously observed will not work. Therefore, online learning [Rosenblatt, 1958; Crammer and Singer, 2003] is proposed to process the training examples that arrive in a sequential order, where a learner aims to learn and update the optimal classiﬁer for processing the future data at each step. In other words, the classiﬁer is required to be updated incrementally for any new training examples, so that the online learning algorithms do not need to observe all training examples for classiﬁer training.

To make PU models applicable to sequential data, in this paper, we propose a novel Online Positive and Unlabeled ( OPU for short) learning algorithm. The key challenge for our online model designing is how to avoid the bias incurred by the absence of negative data, and to update the model incrementally according to the sequential positive and unlabeled examples. In this paper, we propose to overcome this challenge by exploring the unbiased estimator for positive and unlabeled learning [Du Plessis et al., 2015] and utilizing an Online Gradient Descent (OGD) [Zinkevich, 2003] method to optimize our model. Speciﬁcally, we cast OPU learning as a sequential Empirical Risk Minimization problem, in which different unbiased loss functions are elaborately designed for arriving positive and unlabeled data, respectively. Then, we show that, for any coming new data point, the model can be updated independently and incrementally by OGD. Besides, we present theoretical proof of regret that bounds the difference between the solution computed by our OPU learning and the optimal solution learned at hindsight. Moreover, experimental results on both benchmark and realworld datasets are presented, conﬁrming the effectiveness of our OPU algorithms.

Proceedings of the Twenty-Ninth International Joint Conference on Artiﬁcial Intelligence (IJCAI-20)

2 Related Work

This section brieﬂy reviews some typical works on PU learning and online learning, as they are related to this paper.

2.1 PU Learning

PU learning has been a popular research topic in machine learning community. Up to now, a variety of PU learning models have been proposed, and they can typically be categorized into three types according to how the unlabeled data are handled. The methods belonging to the ﬁrst category deploy a twostep strategy which ﬁrstly identiﬁes some reliable negative examples from the unlabeled set, and then employs the reliable negative examples as well as the original positive examples to train a traditional supervised classiﬁer. In this strategy, the ﬁrst stage is critical and various methods have been developed to ﬁnd the potential negative examples, such as Spy Technique [Liu et al., 2002], 1-DNF method [Yu et al., 2002], and K-means based prototype method [Xiao et al., 2011]. The methods belonging to the second category convert PU learning problem to a one-sided label noise learning problem, which directly treat the unlabeled examples as noisy negative ones. Concretely, the original positive examples in the unlabeled set are regarded as mislabeled, while no negative examples are mislabeled as positive ones, and then some effective label noise learning algorithms can be deployed to solve the PU learning problem. The representative works are [Lee and Liu, 2003], [Shi et al., 2018], and [Gong et al., 2019b]. The methods belonging to the last category impose different weights on the loss values incurred by positive examples and also the pseudo-labeling on treating unlabeled examples as positive examples and negative examples, and thus transferring PU learning problem into a cost-sensitive learning problem. The typical works include [Elkan and Noto, 2008; Du Plessis et al., 2014; Du Plessis et al., 2015; Kiryo et al., 2017]. Apart from above-mentioned methods, other representative PU models usually rely on larger-margin theory [Gong et al., 2019a], multi-manifold data structure [Gong et al., 2019c], label disambiguation [Zhang et al., 2019], and sequential minimal optimization [Sansone et al., 2018]. However, as mentioned in the introduction, all above methods are only designed for batch learning mode and cannot be applied to deal with sequential data. In fact, [Li et al., 2009] has studied PU learning for the data on the ﬂy. However, both their setting and target are very different from ours as they only treat current data as positive while regarding all historical data as unlabeled, to model the interest drift of online customers. Differently, we require that the real labels of positive and unlabeled data should be speciﬁed in advance and will not change during the entire online learning process, so that our model coincides with the standard PU learning requirements.

2.2 Online Learning

Online learning [Rosenblatt, 1958; Kivinen et al., 2004; Finn et al., 2019] aims to learn from the data arriving in a sequential order. In general, existing online learning algorithms

mainly fall into two categories according to the model updating strategy, namely the ﬁrst-order-based online learning and the second-order-based online learning. The ﬁrst-order-based online learning algorithms only exploit the ﬁrst order feature information. Representative works include Perceptron [Rosenblatt, 1958; Freund and Schapire, 1999], Relaxed Online Maximum Margin Algorithm [Li and Long, 2000], and Passive-Aggressive algorithm [Crammer et al., 2006]. Apparently, the performances of these algorithms are limited, for only the ﬁrst-order information is deployed. In recent years, some second-order-based online learning algorithms have been elaborately designed to improve the performance of ﬁrst-order-based algorithms. Generally, the second-order-based algorithms can signiﬁcantly outperform the ﬁrst-order-based algorithms by exploring the second-order information, such as the covariance matrix of the feature information. Representative works are the secondorder Perceptron [Cesa-Bianchi et al., 2005], Conﬁdence Weighted learning [Dredze et al., 2008], and Adaptive Regularization of Weights Learning [Crammer et al., 2009]. However, the aforementioned algorithms are not applicable to the PU learning problem studied in this paper, and this motivates us to seek for a novel online training strategy for tackling the OPU learning problem.

3 Preliminaries on Batch-mode PU Learning In this section, we review the formal setting for traditional batch-mode PU learning, which helps to explain our designed online algorithm in Section 4. Consider a binary classiﬁcation problem, where x Rd denotes a d-dimensional pattern and y {1, 1} is the corresponding class label. Let p(x, y) be the underling joint density of (x, y), and the class-conditional densities regarding positive class and negative class can be written as pp(x) = p(x|y = 1) and pn(x) = p(x|y = 1), respectively, where p(x) denotes the marginal density regarding unlabeled data. Furthermore, given π = p(y = 1) as the positive class-prior probability, we have that p(y = 1) = 1 π. Since an unlabeled dataset consists of positive and negative examples, we know that p(x) = πp(x|y = 1) + (1 π)p(x|y = 1). Assume that we have a positive dataset P and an unlabeled dataset U independent and identically drawn from an unknown distribution D as

P := {xp i }np i=1 p(x|y = 1), U := xu j nu j=1 p(x),

where np and nu are the size of positive dataset and unlabeled dataset, respectively. The goal of PU learning is to learn a classiﬁer g(x) that assigns a label ˆy to a new pattern x as by = sign(g(x)). It has been widely acknowledged that the Bayes optimal classiﬁer g (x) can be obtained by minimizing the following classiﬁcation risk, namely

R(g)= E(X,Y ) p(x,y) [ℓ0 1(Y g(X))]

= πEp[ℓ0 1(g(X))]+(1 π)En[ℓ0 1( g(X))] , (1)

where (X, Y ) are corresponding random variables of (x, y), Ep[ ] = EX pp(x)[ ], En[ ] = EX pn(x)[ ], and ℓ0 1( ) is the zero-one loss formulated as ℓ0 1(z) = 1

2 sign(z) with z being the variable.

Proceedings of the Twenty-Ninth International Joint Conference on Artiﬁcial Intelligence (IJCAI-20)

Thanks to the availability of positive and negative examples in the traditional fully supervised learning, the expectations Ep [ℓ0 1(g(X))] and En [ℓ0 1( g(X))] in Eq. (1) can be estimated through the corresponding example averages. However, there are no negative examples available for PU learning, therefore En [ℓ0 1( g(X))] cannot be estimated directly. To solve this problem, [Du Plessis et al., 2015] proposed to estimate En [ℓ0 1( g(X))] indirectly from weighted Ep [ℓ0 1( g(X))] and Eu [ℓ0 1( g(X))]. Based on the fact that p(x) = πp(x|y = 1)+(1 π)p(x|y = 1), we can obtain

Eu [ℓ0 1( g(X))] =πEp [ℓ0 1( g(X))] + (1 π)En [ℓ0 1( g(X))] , (2)

where Eu[ ] denotes the expectation over p(x). Therefore, the risk R(g) in PU learning can be approximated by

R(g) = πEp [ℓ0 1(g(X))] + Eu [ℓ0 1( g(X))] πEp [ℓ0 1( g(X))] = πEp [ℓ0 1(g(X)) ℓ0 1( g(X))] + Eu [ℓ0 1( g(X))] .

From the equation above, we can see that the risk for PU learning consists of a composite loss for positive examples i.e., ℓ0 1(g(X)) ℓ0 1( g(X)), and an ordinary loss for unlabeled examples, i.e., ℓ0 1( g(X)). [Du Plessis et al., 2015] showed that a loss function ℓ(z) will lead to a convex PU learning model if it satisﬁes the linear-odd property, namely ℓ(z) ℓ( z) = z, and the feasible ℓ(z) can be double hinge loss, square loss, and logistic loss. In our method, the performance of different loss functions with linear-odd property will be discussed in Section 6.

4 The Proposed OPU Method

Let us consider the problem setting of online positive and unlabeled classiﬁcation task. Let {(xt, yt) |t = 1, . . . , T} be a sequence of input examples under a potential distribution D, where xt Rd is a pattern of d dimension received at the t-th time, yt {1, 0} is the observed class label of corresponding pattern, and T is the number of training rounds across the whole training stage. Here, we take y = 1 for positive examples and y = 0 for unlabeled examples. The goal of OPU classiﬁcation is to learn a linear classiﬁer g(xt) = w t xt, where wt Rd is the weight vector at the t-th time during the training stage.

4.1 Basic OPU with Single Coming Datum We cast our OPU learning as sequential Empirical Risk Minimization problem, of which the main idea is to propose an unbiased estimate for the loss induced by the received training examples at each time. Speciﬁcally, our goal is to develop an OPU learning algorithm which approximates the risk presented in Eq. (3). Given a surrogate loss function with linear-odd property ℓ(z), Eq. (3) turns to be a convex optimization problem as

R(g) = πEp [ℓ(g(X)) ℓ( g(X))] + Eu [ℓ( g(X))] = πEp[ g(X)] + Eu[ℓ( g (X))], (4)

Algorithm 1 Basic OPU with single coming datum

Input: The penalty parameter λ; 1: Initialize w0 0; 2: for t = 1, 2, . . . , T do 3: Receive a training example (xt, yt); 4: Set learning rate ηt = 1 λt; 5: if yt = 1 then 6: Calculate the gradient t = πg (xt) + λwt; 7: else 8: Calculate the gradient t = ℓ ( g(xt)) + λwt; 9: end if 10: Update wt ΠW(wt 1 ηt t); 11: end for Output: The latest classiﬁer parameter w T .

which can be rewritten as an average of a set of training examples, namely

i=1 g(xi) + 1

j=1 ℓ( g (xj)) . (5)

Now, we reformulate the risk estimation for each individual training example received at the t-th time, i.e., Rt(g) = π1yt=1g(xt) + 1yt=0ℓ( g (xt)) , (6) where 1( ) is an indicator function which equals to one if its argument is true, and zero otherwise. Using this representation, we can directly apply the gradient descent based online learning method to solve OPU learning problem. Formally, given a linear-in-parameter classiﬁer g(xt) = w t xt, the regularized objective function of Eq. (6) can be written as

f(wt) = Rt(g) + λ

= π1yt=1g(xt) + 1yt=0ℓ( g (xt)) + λ

2 wt 2 2, (7) where the ﬁrst term and the second term are the losses induced by the received positive and negative examples, and the last term is an L2 regularizer to avoid overﬁtting with λ being a nonnegative penalty parameter. Then, we consider the unbiased gradient t of the above objective function in presence of the t-th coming data, which is given by t = π1yt=1g (xt) + 1yt=0ℓ ( g (xt)) + λwt, (8) where g ( ) and l ( ) are the corresponding derivatives. Therefore, we can update wt ΠW(wt 1 ηt t) using a step size of ηt, where ΠW(w) is a projection step deﬁned as ΠW(w) = arg minw W w w , with W being a feasible set of w. After a predetermined number T of iterations, we output the w T in the last iteration. Algorithm 1 concludes the overall OPU algorithm with single coming datum.

4.2 Extended OPU with Multiple Coming Data Practically, we may also meet the cases that the data come in groups, so we extend our basic OPU learning algorithm to enable it to process a set of examples for one time. To be exact, given a small training set It = {(xi, yi)}b i=1 with the size of b (b > 1) at time t (t = 1, 2, . . . , T), the regularized objective function can be formulated as

f It(wt) = RIt(g) + λ

2 wt 2 2, (9)

Proceedings of the Twenty-Ninth International Joint Conference on Artiﬁcial Intelligence (IJCAI-20)

where RIt(g) is the risk averages of corresponding positive and unlabeled examples, deﬁned as

i=1( π1yi=1g(xi) + 1yi=0ℓ( g(xi))),

Now, we rewrite the objective function f It(wt) by adding a conservative constraint γt

2 wt wt 1 2 2 [Li et al., 2014], and obtain as

φIt(wt) =f It(wt) + γt

2 wt wt 1 2 2

i=1( π1yi=1g(xi) + 1yi=0ℓ( g(xi)))

2 wt 2 2 + γt

2 wt wt 1 2 2, (10)

where λ is a nonnegative trade-off parameter and γt is a conservative coefﬁcient related to λ. The ﬁrst term is the unbiased loss estimator for PU learning aiming to achieve full utilization of this the currently coming set, the second term is an L2 regularizer to avoid overﬁtting, and the last term is a conservative constraint which limits dramatic changes of the weight vector to avoid overutilization. For (xi, yi) It, i = 1, . . . , b, by computing the unbiased gradient of the above approximate objective Eq. (10), we have

i=1( π1yi=1g (xi) + 1yi=0ℓ ( g (xi)))

+ λwt + γt(wt wt 1). (11)

By following the gradient descent based online learning method, the learning parameter can be effectively updated as wt ΠW(wt 1 ηt It), with ηt being the learning rate. Furthermore, compared with the basic OPU algorithm with single coming datum, another beneﬁt brought by the extended OPU algorithm is that an O(1/b) variance reduction can be achieved if b (b > 1) training examples are presented each time [Dekel et al., 2012].

5 Theoretical Analysis In this section, we are going to analyze the regret bound of the proposed OPU learning method, which measures the difference between the cumulative loss of our predictions and the cumulative loss of the optimal predictor w . To be speciﬁc, we ﬁrst bound the regret of our basic OPU algorithm which updates the model with one example per time, and then we bound the regret of the extended OPU algorithm which updates the model with b (b > 1) arrived examples.

5.1 Regret for Basic OPU Algorithm When only one example is processed at each time, we have the following theorem for the regret bound: Theorem 1. Let f1, . . . , f T be a sequence of λ-strongly convex functions. Let W be a closed convex set and deﬁne ΠW(w) = arg minw W w w . Let w0, . . . , w T be a sequence of vectors such that w0 W and for t 0, wt = ΠW (wt 1 ηt t), with t being the unbiased gradient of ft at wt. Assume that for all t, t G, we have

Regret T (f(w)) G2

2λ (1 + log T). (12)

Proof. Let w arg minw W PT t=1 ft(w). By recalling the deﬁnition of regret, we have

Regret T (f(w)) = XT

t=1 ft (wt) XT

t=1 ft (w ) . (13)

Since ft is λ-strongly convex, we get the following inequality,

ft (w ) ft (wt)+ t (wt w )+λ

2 wt w 2 . (14)

Rearranging Eq. (14), we get

ft (w ) ft (wt) t (wt w ) + λ

Following Zinkevichs analysis [Zinkevich, 2003], we are going to upper-bound t (wt w ). According to the properties of projections [Hazan et al., 2007], we have,

wt w 2 = Π (wt 1 ηt t) w 2

wt 1 ηt t w 2 . (15)

wt w 2 wt 1 w 2 + η2 t t 2

2ηt t (wt 1 w ) . (16)

After rearranging the above equation and using the assumption t G, we obtain

2 t (wt 1 w ) 1

ηt ( wt 1 w 2 wt w 2)+ηt G2.

(17) By comparing Eq. (14) and Eq. (17), and summing up Eq. (17) from t = 1 to T with ηt = 1/(λt), we have XT

t=1 ft (wt) ft (w )

t=1 wt w 2 1

ηt 1 ηt 1 λ + G2

t=1 1 λt G2

2λ (1 + log T), (18)

which concludes the proof.

5.2 Regret for Extended OPU Algorithm When a set of examples are processed at each time, we have the following theorem for the regret bound: Theorem 2. Let {It}T t=1 be the received training sets of the learning algorithm, with each training set consisting of b (b > 1) training examples. Let f I1, . . . , f IT be a sequence of λstrongly convex functions. Let W be a feasible set of w, and w0, . . . , w T be a sequence of vectors such that w0 W and w arg minw W PT t=1 f It(w). Assume that for all t, supw W f It(wt) f(w) 2 2 A2, we have

Regret T (f I(w)) λ w w0 2 2 2

b + A2(1+log T)

Proof. By using the deﬁnition of regret, in this case we have

Regret T (f I(w)) = XT

t=1 f It (wt) XT

t=1 f It (w ) . (20)

Proceedings of the Twenty-Ninth International Joint Conference on Artiﬁcial Intelligence (IJCAI-20)

According to Theorem 1 in [Li et al., 2014], assumed that f It(wt) is λ-convex for all t and the updating parameter is chosen as γt = γ + λt, we have that for all w W, XT

t=1(f It(wt) f I(w )) γ

2 w w0 2 2+A2

(21) For strongly convex objective function, one may choose γ = (λA)/(

b w w0 2) to minimize the upper bound in Eq. (21). Since the variance decreases with O(1/b) for each set of coming data, we see that the choice of γ in Eq. (21) with γ = O(1/

b) is appropriate. We can easily get the following aggregated regret bound with simple algebra, XT

t=1(f It (wt) f I (w ))

2 w w0 2 2 + A2

2 w w0 2 2 + A2

t=1 1 γ + λt

2 w w0 2 2 + A2

λ w w0 2 2 2

b + A2(1 + log T)

which completes the proof.

From Theorems 1 and 2, we see that the regret bound for our basic OPU algorithm is the order of O(log T), and that for our extended OPU is O((log T)/b), which improves the regret bound of the basic OPU algorithm with O(1/b). Therefore, the models ﬁnally obtained by our method are very close to the ideal solution w .

6 Experiments In this section, we ﬁrst study the performance of the proposed method on benchmark and real-world datasets, and then investigate the parametric sensitivity of the pre-tuned parameter in our model.

6.1 Compared Algorithms In our experiments, two different loss functions with linearodd property are discussed, i.e., double hinge loss ℓDH(z) = max z, max 0, 1

2z and square loss ℓSL(z) = 1 4(z 1)2 1

4. The following methods are compared: UPU: Unbiased PU learning, a batch-mode PU learning algorithm [Du Plessis et al., 2015], which can be deemed as the performance upper bound of our OPU algorithm. OPUDH: The proposed basic OPU algorithm with single coming datum in Sec. 4.1 using double hinge loss. OPUSL: The proposed basic OPU algorithm with single coming datum in Sec. 4.1 using square loss.

OPUDH: The proposed extended OPU algorithm with multiple coming data in Sec. 4.2 using double hinge loss.

OPUSL: The proposed extended OPU algorithm with multiple coming data in Sec. 4.2 using square loss.

6.2 Benchmark Datasets We conduct experiments on a variety of benchmark datasets from Open ML machine learning repositories1. To be speciﬁc, four binary datasets are adopted for algorithm evaluation including Vote, Australian, Mushroom, and Phishing, and their brief information is presented in Table 1. For each dataset, we randomly choose r = 20%, 30%, and 40% positive examples as well as all negative examples as unlabeled and leave the rest positive examples as labeled. Under each r, we conduct ﬁve-fold cross validation on every compared method and report the average accuracy and standard deviation over the ﬁve independent implementations. As a result, each model under a certain implementation is trained with 80% examples and then tested on the rest 20% examples. Moreover, all data features are normalized to [ 1, 1] in advance, and the formation of training set is kept identical for all compared methods to ensure fair comparison. In our experiments, the class prior of positive examples π is assumed to be known during training for all the compared methods, which is also assumed by the existing PU learning works such as [Du Plessis et al., 2015; Kiryo et al., 2017; Gong et al., 2019b]. In practice, π can be obtained by experience or some domain-speciﬁc prior knowledge. Besides, the parameters of every algorithm have been carefully tuned on the validation set to achieve the best performance. In our OPU, we choose the regularization parameter λ from {10 6, . . . , 102}. For UPU, the regularization parameter λ is chosen from {10 3, . . . , 101}. For our extended OPU algorithms including OPUDH and OPUSL, 5% training examples of the whole examples are selected to form the set of input data at each time. The obtained test accuracies are reported in Table 1. We can ﬁnd that the extended OPU algorithms OPUDH and OPUSL usually achieve higher accuracies than the basic OPU algorithms OPUDH and OPUSL, and the OPU algorithms using double hinge loss often outperform the OPU algorithms using square loss. Besides, the performances of our OPU algorithms are very close to the performance of batch-mode PU learning method (i.e., UPU), and in some cases our OPU algorithm can perform even better than UPU, which suggests the effectiveness of the proposed method.

6.3 Real-world Datasets Here, we investigate the performance of the compared methods on image classiﬁcation tasks. Concretely, CIFAR10 [Krizhevsky and Hinton, 2009] and SVHN [Netzer et al., 2011] are chosen to evaluate their performance. CIFAR-10 consists of 60000 32 32 natural images in 10 classes with each class containing 6000 images. We choose the images of transportation tools ( airplane , auto mobile , ship , and truck ) as negative, and regard the images of animals ( bird , cat , deer , dog , frog , and horse ) as positive. Therefore, there are 24000 positive examples and 36000 negative examples in total. SVHN consists of 99289 32 32 digit images in 10 classes i.e., the digits 0 - 9 , where the negative set is formed by the digit images 1 - 5 , and the rest digit images compose the positive set. As a result, we get 34699 positive

1https://www.openml.org/

Proceedings of the Twenty-Ninth International Joint Conference on Artiﬁcial Intelligence (IJCAI-20)

Dataset (n, d) r UPU OPUDH OPUSL OPUDH OPUSL

Vote (435, 16)

20% 0.920 0.087 0.902 0.029 0.887 0.024 0.908 0.008 0.911 0.022 30% 0.940 0.012 0.894 0.025 0.864 0.040 0.908 0.026 0.903 0.029 40% 0.959 0.006 0.889 0.026 0.877 0.041 0.901 0.024 0.898 0.038

Australian (690, 14)

20% 0.772 0.045 0.808 0.012 0.758 0.102 0.864 0.023 0.844 0.016 30% 0.827 0.025 0.832 0.047 0.720 0.104 0.864 0.037 0.850 0.027 40% 0.843 0.027 0.743 0.125 0.621 0.120 0.861 0.031 0.724 0.110

Mushroom (8124, 112)

20% 0.857 0.013 0.810 0.004 0.787 0.105 0.836 0.078 0.816 0.067 30% 0.800 0.006 0.766 0.056 0.768 0.028 0.825 0.036 0.789 0.046 40% 0.753 0.008 0.740 0.012 0.751 0.014 0.840 0.067 0.762 0.037

Phishing (11055, 68)

20% 0.867 0.010 0.827 0.007 0.842 0.012 0.856 0.005 0.856 0.017 30% 0.893 0.013 0.838 0.003 0.835 0.004 0.881 0.004 0.871 0.012 40% 0.910 0.009 0.826 0.004 0.812 0.007 0.890 0.004 0.867 0.007

Table 1: The accuracies of various methods on four Open ML benchmark datasets when r = 20%, 30%, and 40%. The best record among the proposed OPU algorithms under each r is marked in bold. n and d are the amount of training data and feature dimension accordingly.

Dataset r UPU OPUDH OPUSL OPUDH OPUSL

20% 0.848 0.810 0.840 0.824 0.844 30% 0.842 0.827 0.792 0.832 0.803 40% 0.836 0.827 0.799 0.827 0.815

20% 0.838 0.833 0.783 0.835 0.812 30% 0.846 0.808 0.798 0.818 0.805 40% 0.847 0.820 0.813 0.828 0.819

Table 2: The test accuracies on the adopted real-world datasets. The best record among the proposed OPU algorithms is marked in bold.

examples and 64590 negative examples. Note that the training set and the test set are split in advance with 50000 training examples and 10000 test examples for CIFAR-10, and 73257 training examples and 20632 test examples for SVHN. In our experiments, we extract 512-dimensional GIST features for each image. Similar to the previous experiments, for each dataset, the situations r = 20%, 30%, and 40% are studied. The parameters of every method have been carefully tuned to achieve the best performance. The test accuracies achieved by the compared methods are presented in Table 2, where we can see that although our OPU algorithms receive the sequential data sequentially, their performances are very similar to the compared UPU learning algorithm with all data observed for model training. Therefore, the proposed OPU algorithms are effective in handling real-world data.

6.4 Parametric Sensitivity This section investigates the sensitivity of our OPU algorithms to the tuning parameter λ and the positive class prior π appearing in the objective function Eq. (7) and Eq. (10) of our OPU learning model. Speciﬁcally, we examine the test accuracy of the extended OPU algorithm with double hinge loss (i.e., OPUDH) on the two real-world datasets when r = 30%. The proposed OPUSL, OPUDH, and OPUSL all exhibit similar results to OPUDH, so here we omit their parametric sensitivity curves due to the lack of space. Inﬂuence of λ: The experimental results with λ changing from 10 6 to 100 are reported in Figure. 1 (a), we see that λ is critical for our OPU algorithm to obtain satisfactory performance. To be speciﬁc, the choice of λ = 10 2 usually leads to high classiﬁcation accuracy. Inﬂuence of π: As mentioned previously, the positive class prior π might be unknown in practice, so it should be esti-

10-6 10-4 10-2 100 0.5

CIFAR-10 SVHN (a)

0.6 0.8 1.2 1.4 0.5

CIFAR-10 SVHN (b)

Figure 1: The parametric sensitivity of λ (a), and π (b) for OPUDH on CIFAR-10 and SVHN datasets.

mated from experience or some domain-speciﬁc prior knowledge. However, such estimation can be inaccurate, therefore we investigate how the classiﬁcation accuracy is inﬂuenced by the inaccurate π. More speciﬁcally, we tested our OPUDH by replacing π with inaccurate ˆπ {0.6π, 0.8π,. . ., 1.4π} and inserting ˆπ to the learning method. The experimental results are presented in Figure. 1 (b), which suggests that the performance of the proposed method will not severely decrease when the estimation is slightly deviated from the real π.

7 Conclusion

This paper proposed a novel OPU learning paradigm to deal with the sequential positive and unlabeled data. We show that with an unbiased estimate for the loss induced by the received positive or unlabeled example at each time, our OPU model can be effectively solved by gradient based online learning methods. Furthermore, two OPU algorithms with different formations of input data are developed, and their regret bounds have also been theoretically proved. The experimental results on both benchmark and real-world datasets clearly demonstrate the effectiveness of the proposed method. Since our method relies on the class prior π, we may design an online strategy to estimate π incrementally in the future.

Acknowledgements

This research is supported by NSF of China (Nos: 61973162, U1713208), NSF of Jiangsu Province (No: BK20171430), the Fundamental Research Funds for the Central Universities (No: 30918011319), the Summit of the Six Top Talents Program (No: DZXX027), the Young Elite Scientists Sponsorship Program by CAST (No: 2018QNRC001), the Ant Financial Science Funds for Security Research of Ant Financial, and the Program for Changjiang Scholars.

Proceedings of the Twenty-Ninth International Joint Conference on Artiﬁcial Intelligence (IJCAI-20)

[Cesa-Bianchi et al., 2005] Nicolo Cesa-Bianchi, Alex Conconi, and Claudio Gentile. A second-order perceptron algorithm. SIAM Journal on Computing, 34(3):640 668, 2005.

[Crammer and Singer, 2003] Koby Crammer and Yoram Singer. Ultraconservative online algorithms for multiclass problems. JMLR, 3(Jan):951 991, 2003.

[Crammer et al., 2006] Koby Crammer, Ofer Dekel, Joseph Keshet, Shai Shalev-Shwartz, and Yoram Singer. Online passiveaggressive algorithms. JMLR, 7(Mar):551 585, 2006.

[Crammer et al., 2009] Koby Crammer, Alex Kulesza, and Mark Dredze. Adaptive regularization of weight vectors. In Neur IPS, pages 414 422, 2009.

[Dekel et al., 2012] Ofer Dekel, Ran Gilad-Bachrach, Ohad Shamir, and Lin Xiao. Optimal distributed online prediction using mini-batches. JMLR, 13(Jan):165 202, 2012.

[Denis et al., 2005] Franc ois Denis, R emi Gilleron, and Fabien Letouzey. Learning from positive and unlabeled examples. Theoretical Computer Science, 348(1):70 83, 2005.

[Dredze et al., 2008] Mark Dredze, Koby Crammer, and Fernando Pereira. Conﬁdence-weighted linear classiﬁcation. In ICML, pages 264 271. ACM, 2008.

[Du Plessis et al., 2014] Marthinus C Du Plessis, Gang Niu, and Masashi Sugiyama. Analysis of learning from positive and unlabeled data. In Neur IPS, pages 703 711, 2014.

[Du Plessis et al., 2015] Marthinus Du Plessis, Gang Niu, and Masashi Sugiyama. Convex formulation for learning from positive and unlabeled data. In ICML, pages 1386 1394, 2015.

[Elkan and Noto, 2008] Charles Elkan and Keith Noto. Learning classiﬁers from only positive and unlabeled data. In SIGKDD, pages 213 220. ACM, 2008.

[Finn et al., 2019] Chelsea Finn, Aravind Rajeswaran, Sham Kakade, and Sergey Levine. Online meta-learning. ar Xiv preprint ar Xiv:1902.08438, 2019.

[Freund and Schapire, 1999] Yoav Freund and Robert E Schapire. Large margin classiﬁcation using the perceptron algorithm. Machine Learning, 37(3):277 296, 1999.

[Gong et al., 2019a] Chen Gong, Tongliang Liu, Jian Yang, and Dacheng Tao. Large-margin label-calibrated support vector machines for positive and unlabeled learning. IEEE T-NNLS, 2019.

[Gong et al., 2019b] Chen Gong, Hong Shi, Tongliang Liu, Chuang Zhang, Jian Yang, and Dacheng Tao. Loss decomposition and centroid estimation for positive and unlabeled learning. IEEE T-PAMI, 2019.

[Gong et al., 2019c] Chen Gong, Hong Shi, Jie Yang, and Jian Yanga. Multi-manifold positive and unlabeled learning for visual analysis. IEEE T-CSVT, 2019.

[Hazan et al., 2007] Elad Hazan, Amit Agarwal, and Satyen Kale. Logarithmic regret algorithms for online convex optimization. Machine Learning, 69(2-3):169 192, 2007.

[Kiryo et al., 2017] Ryuichi Kiryo, Gang Niu, Marthinus C du Plessis, and Masashi Sugiyama. Positive-unlabeled learning with non-negative risk estimator. In Neur IPS, pages 1675 1685, 2017.

[Kivinen et al., 2004] Jyrki Kivinen, Alexander J Smola, and Robert C Williamson. Online learning with kernels. IEEE T-SP, 52(8):2165 2176, 2004.

[Krizhevsky and Hinton, 2009] Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. Technical report, Citeseer, 2009. [Lee and Liu, 2003] Wee Sun Lee and Bing Liu. Learning with positive and unlabeled examples using weighted logistic regression. In ICML, volume 3, pages 448 455, 2003. [Li and Long, 2000] Yi Li and Philip M Long. The relaxed online maximum margin algorithm. In Neur IPS, pages 498 504, 2000. [Li et al., 2009] Xiao-Li Li, Philip S Yu, Bing Liu, and See-Kiong Ng. Positive unlabeled learning for data stream classiﬁcation. In Proceedings of the 2009 SIAM International Conference on Data Mining, pages 259 270. SIAM, 2009. [Li et al., 2011] Wenkai Li, Qinghua Guo, and Charles Elkan. A positive and unlabeled learning algorithm for one-class classiﬁcation of remote-sensing data. T-GRS, 49(2):717 725, 2011. [Li et al., 2014] Mu Li, Tong Zhang, Yuqiang Chen, and Alexander J Smola. Efﬁcient mini-batch training for stochastic optimization. In SIGKDD, pages 661 670. ACM, 2014. [Liu et al., 2002] Bing Liu, Wee Sun Lee, Philip S Yu, and Xiaoli Li. Partially supervised classiﬁcation of text documents. In ICML, volume 2, pages 387 394. Citeseer, 2002. [Netzer et al., 2011] Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y Ng. Reading digits in natural images with unsupervised feature learning. 2011. [Rosenblatt, 1958] Frank Rosenblatt. The perceptron: a probabilistic model for information storage and organization in the brain. Psychological review, 65(6):386, 1958. [Sansone et al., 2018] Emanuele Sansone, Francesco GB De Natale, and Zhi-Hua Zhou. Efﬁcient training for positive unlabeled learning. IEEE T-PAMI, 2018. [Shi et al., 2018] Hong Shi, Shaojun Pan, Jian Yang, and Chen Gong. Positive and unlabeled learning via loss decomposition and centroid estimation. In IJCAI, pages 2689 2695, 2018. [Wei and Li, 2018] Huihui Wei and Ming Li. Positive and unlabeled learning for detecting software functional clones with adversarial training. In IJCAI, pages 2840 2846, 2018. [Xiao et al., 2011] Yanshan Xiao, Bo Liu, Jie Yin, Longbing Cao, Chengqi Zhang, and Zhifeng Hao. Similarity-based approach for positive and unlabelled learning. In IJCAI, 2011. [Yu et al., 2002] Hwanjo Yu, Jiawei Han, and Kevin Chen-Chuan Chang. Pebl: positive example based learning for web page classiﬁcation using svm. In SIGKDD, pages 239 248. ACM, 2002. [Zhang et al., 2019] Chuang Zhang, Dexin Ren, Tongliang Liu, Jian Yang, and Chen Gong. Positive and unlabeled learning with label disambiguation. In IJCAI, pages 4250 4256, 2019. [Zinkevich, 2003] Martin Zinkevich. Online convex programming and generalized inﬁnitesimal gradient ascent. In ICML, pages 928 936, 2003.

Proceedings of the Twenty-Ninth International Joint Conference on Artiﬁcial Intelligence (IJCAI-20)