# generative_adversarial_positiveunlabelled_learning__9b0e51ee.pdf Generative Adversarial Positive-Unlabeled Learning Ming Hou1, Brahim Chaib-draa2, Chao Li3, Qibin Zhao14 1 Tensor Learning Unit, Center for Advanced Intelligence Project, RIKEN, Japan 2 Department of Computer Science and Software Engineering, Laval University, Canada 3 Causal Inference Team, Center for Advanced Intelligence Project, RIKEN, Japan 4 School of Automation, Guangdong University of Technology, China ming.hou@riken.jp, brahim.chaib-draa@ift.ulaval.ca, chao.li.hf@riken.jp, qibin.zhao@riken.jp In this work, we consider the task of classifying binary positive-unlabeled (PU) data. Existing discriminative learning based PU models attempt to seek an optimal reweighting strategy for unlabeled (U) data, so that a decent decision boundary can be found. However, given limited positive (P) data, the conventional PU models tend to suffer from overfitting when adapted to very flexible deep neural networks. In contrast, we are the first to innovate a totally new paradigm to attack the binary PU task, from the perspective of generative learning by leveraging the powerful generative adversarial networks (GAN). Our generative positive-unlabeled (Gen PU) framework incorporates an array of discriminators and generators that are endowed with different roles in simultaneously producing positive and negative realistic samples. We also provide theoretical analysis to justify that, at equilibrium, Gen PU is capable of recovering both positive and negative data distributions. Moreover, we show Gen PU is generalizable and closely related to the semi-supervised classification. Given rather limited P data, experiments on both synthetic and realworld dataset demonstrate the effectiveness of our proposed framework. With infinite realistic and diverse samples generated from Gen PU, a very flexible classifier can then be trained using deep neural networks. 1 Introduction Positive-unlabeled (PU) classification [Denis et al., 2005] has gained great popularity in dealing with limited partially labeled data and succeeded in a broad range of applications such as automatic label identification. Yet, PU can be used for the detection of outliers in an unlabeled dataset with knowledge only from a collection of inlier data [Hido et al., 2008]. PU also finds its usefulness in one-vs-rest classification task such as land-cover classification (urban vs non-urban) where The corresponding author non-urban data are too diverse to be labeled than urban data [Li et al., 2011]. The most commonly used PU approaches for binary classification can typically be categorized, in terms of the way of handling U data, into two types [Kiryo et al., 2017]. One type such as [Liu et al., 2002; Li and Liu, 2003] attempts to recognize negative samples in the U data and then feed them to classical positive-negative (PN) models. However, these approaches depend heavily on the heuristic strategies and often yield a poor solution. The other type, including [Liu et al., 2003; Lee and Liu, 2003], offers a better solution by treating U data to be N data with a decayed weight. Nevertheless, finding an optimal weight turns out to be quite costly. Most importantly, the classifiers trained based on above approaches suffer from a systematic estimation bias [Du Plessis et al., 2015; Kiryo et al., 2017]. Seeking for unbiased PU classifier, [Du Plessis et al., 2014] investigated the strategy of viewing U data as a weighted mixture of P and N data [Elkan and Noto, 2008], and introduced an unbiased risk estimator by exploiting some non-convex symmetric losses, i.e., the ramp loss. Although cancelling the bias, the non-convex loss is undesirable for PU due to the difficulty of non-convex optimization. To this end, [Du Plessis et al., 2015] proposed a more general risk estimator which is always unbiased and convex if the convex loss satisfies a linearodd condition [Patrini et al., 2016]. Theoretically, these authors argue that the estimator yields globally optimal solution, with more appealing learning properties than the non-convex counterpart. More recently, [Kiryo et al., 2017] observed that the aforementioned unbiased risk estimators can go negative without bounding from the below, leading to serious overfitting when the classifier becomes too flexible. To fix this, they presented a non-negative biased risk estimator yet with favorable theoretical guarantees in terms of consistency, meansquared-error reduction and estimation error. The proposed estimator is shown to be more robust against overfitting than previous unbiased ones. However, given limited P data, the overfitting issue still exists especially when very flexible deep neural network is applied. Generative models, on the other hand, have the advantage in expressing complex data distribution. Apart from distribution density estimation, generative models are often applied Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence (IJCAI-18) to learn a function that is able to create more samples from the approximate distribution. Lately, a large body of successful deep generative models have emerged, especially generative adversarial networks (GAN) [Goodfellow et al., 2014; Salimans et al., 2016]. GAN intends to solve the task of generative modeling by making two agents play a game against each other. One agent named generator synthesizes fake data from random noise; the other agent, termed as discriminator, examines both real and fake data and determines whether it is real or not. Both agents keep evolving over time and get better and better at their jobs. Eventually, the generator is forced to create synthetic data which is as realistic as possible to those from the training dataset. Inspired by the tremendous success and expressive power of GAN, we tackle the binary PU classification task by resorting to generative modeling, and propose our generative positive-unlabeled (Gen PU) learning framework. Building upon GAN, our Gen PU model includes an array of generators and discriminators as agents in the game. These agents are devised to play different parts in simultaneously generating positive and negative real-like samples, and thereafter a standard PN classifier can be trained on those synthetic samples. Given a small portion of labeled P data as seeds, Gen PU is able to capture the underlying P and N data distributions, with the capability to create infinite diverse P and N samples. In this way, the overfitting problem of conventional PU can be greatly mitigated. Furthermore, our Gen PU is generalizable in the sense that it can be established by switching to different underlying GAN variants with distance measurements (i.e., Wasserstein GAN [Arjovsky et al., 2017]) other than Jensen-Shannon divergence (JSD). As long as those variants are sophisticated to produce high-quality diverse samples, the optimal accuracy could be achieved by training a very deep neural networks. Our main contribution (i) we are the first (to our knowledge) to invent a totally new paradigm to effectively solve the PU task through deep generative models; (ii) we provide theoretical analysis to prove that, at equilibrium, our model is capable of learning both positive and negative data distributions; (iii) we experimentally show the effectiveness of the proposed model given limited P data on both synthetic and real-world dataset; (iv) our method can be easily extended to solve the semi-supervised classification, and also opens a door to new solutions of many other weakly supervised learning tasks from the aspect of generative learning. 2 Preliminaries 2.1 Positive-Unlabeled (PU) Classification Given as input d-dimensional random variable x Rd and scalar random variable y { 1} as class label, and let p(x, y) be the joint density, the class-conditional densities are: pp(x) = p(x|y = 1) pn(x) = p(x|y = 1), while p(x) refers to as the unlabeled marginal density. The standard PU classification task [Ward et al., 2009] consists of a positive dataset Xp and an unlabeled dataset Xu with i.i.d samples drawn from pp(x) and p(x), respectively: Xp = {xi p}np i=1 pp(x) Xu = {xi u}nu i=1 p(x). Due to the fact that the unlabeled data can be regarded as a mixture of both positive and negative samples, the marginal density turns out to be p(x) = πpp(x|y = 1) + πnp(x|y = 1), (1) where πp = p(y = 1) and πn = 1 πp are denoted as classprior probability, which is usually unknown in advance and can be estimated from the given data [Jain et al., 2016]. The objective of PU task is to train a classifier on Xp and Xu so as to classify the new unseen pattern xnew. In contrast to PU classification, positive-negative (PN) classification assumes all negative samples, Xn = {xi n}nn i=1 pn(x), are labeled, so that the classifier can be trained in an ordinary supervised learning fashion. 2.2 Generative Adversarial Networks (GAN) GAN, originated in [Goodfellow et al., 2014], is one of the most recent successful generative models that is equipped with the power of producing distributional outputs. GAN obtains this capability through an adversarial competition between a generator G and a discriminator D that involves optimizing the following minimax objective function: min G max D V(G, D) = min G max D Ex px(x) log(D(x)) + Ez pz(z) log(1 D(G(z))), (2) where px(x) represents true data distribution; pz(z) is typically a simple prior distribution (e.g., N(0, 1)) for latent code z, while a generator distribution pg(x) associated with G is induced by the transformation G(z): z x. To find the optimal solution, [Goodfellow et al., 2014] employed simultaneous stochastic gradient descent (SGD) for alternately updating D and G. The authors argued that, given the optimal D, minimizing G is equivalent to minimizing the distribution distance between px(x) and pg(x). At convergence, GAN has px(x) = pg(x). 3 Generative PU Classification 3.1 Notations Throughout the paper, {pp(x), pn(x), p(x)} denote the positive data distribution, the negative data distribution and the entire data distribution, respectively. {Dp, Du, Dn} are referred to as the positive, unlabeled and negative discriminators, while {Gp, Gn} stand for positive and negative generators, targeting to produce real-like positive and negative samples. Correspondingly, {pgp(x), pgn(x)} describe the positive and negative distributions induced by the generator functions Gp(z) and Gn(z). 3.2 Proposed Gen PU Model We build our Gen PU model upon GAN by leveraging its potentiality in producing realistic data, with the goal of identification of both positive and negative distributions from P and U data. Then, a decision boundary can be made by training standard PN classifier on the generated samples. Figure 1 illustrates the architecture of the proposed framework. Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence (IJCAI-18) positive generator negative generator positive discriminator negative discriminator unlabelled discriminator Figure 1: Our Gen PU framework. Dp receives as inputs the real positive examples from Xp and the synthetic positive examples from Gp; Dn receives as inputs the real positive examples from Xp and the synthetic negative examples from Gn; Du receives as inputs real unlabeled examples from Xu, synthetic positive examples from Gp as well as synthetic negative examples from Gn at the same time. Associated with different loss functions, Gp and Gn are designated to generate positive and negative examples, respectively. In brief, Gen PU framework is an analogy to a minimax game comprising of two generators {Gp, Gn} and three discriminators {Dp, Du, Dn}. Guided by the adversarial supervision of {Dp, Du, Dn}, {Gp, Gn} are tasked with synthesizing positive and negative samples that are indistinguishable with the real ones drawn from {pp(x), pn(x)}, respectively. As being their competitive opponents, {Dp, Du, Dn} are devised to play distinct roles in instructing the learning process of {Gp, Gn}. More formally, the overall Gen PU objective function can be decomposed, in views of Gp and Gn, as follows: Ψ(Gp, Gn, Dp, Du, Dn) = πpΦGp,Dp,Du + πnΦGn,Du,Dn, (3) where πp and πn corresponding to Gp and Gn are the priors for positive class and negative class, satisfying πp + πn = 1. Here, we assume πp and πn are predetermined and fixed. The first term linked with Gp in (3) can be further split into two standard GAN components GANGp,Dp and GANGp,Du: ΦGp,Dp,Du = λp min Gp max Dp VGp,Dp(G, D) + λu min Gp max Du VGp,Du(G, D), (4) where λp and λu are the weights balancing the relative importance of effects between Dp and Du. In particular, the value functions of GANGp,Dp and GANGp,Du are VGp,Dp(G, D) = Ex pp(x) log(Dp(x)) + Ez pz(z) log(1 Dp(Gp(z))) (5) VGp,Du(G, D) = Ex pu(x) log(Du(x)) + Ez pz(z) log(1 Du(Gp(z))). (6) On the other hand, the second term linked with Gn in (3) can also be split into GAN components, namely GANGn,Du and GANGn,Dn: ΦGn,Du,Dn = λu min Gn max Du VGn,Du(G, D) + λn max Gn max Dn VGn,Dn(G, D), (7) whose weights λu and λn control the trade-off between Du and Dn. GANGn,Du also takes the form of the standard GAN with the value function VGn,Du(G, D) = Ex pu(x) log(Du(x)) + Ez pz(z) log(1 Du(Gn(z))). (8) The value function of GANGn,Dn is given by VGn,Dn(G, D) = Ex pp(x) log(Dn(x)) + Ez pz(z) log(1 Dn(Gn(z))). (9) In contrast to the zero-sum loss applied elsewhere, the optimization of GANGn,Dn is given by first maximizing (9) to obtain the optimal D n as D n = arg max Dn Ex pp(x) log(Dn(x)) + Ez pz(z) log(1 Dn(Gn(z))), (10) then plugging D n into the value function (9), and finally minimizing VGn,D n(G, D n) instead of VGn,D n(G, D n) to get the optimal G n as G n = arg min Gn VGn,D n(G, D n). (11) Such modification makes GANGn,Dn different from the standard GAN, and this is reflected by the second term of (7). Intuitively, (5)-(6) indicate Gp, co-supervised under both Dp and Du, endeavours to minimize the distance between the induced distribution pgp(x) and positive data distribution pp(x), while striving to stay around within the whole data distribution p(x). In fact, Gp tries to deceive both discriminators by simultaneously maximizing Dp s and Du s outputs on fake positive samples. As a result, the loss terms in (5) and (6) jointly guide pgp(x) gradually moves towards and finally settles to pp(x) of p(x). Equations (8)-(11) suggest Gn, when facing both Du and Dn, struggles to make the induced pgn(x) stay away from pp(x), and also makes its effort to force pgn(x) to lie within p(x). To achieve this, the objective in (11) favors Gn to produce negative examples; this in turn helps Dn to maximize the objective in (10) to separate positive training samples from fake negative samples rather than confusing Dn. Notice that, in the value function (11), Gn is designed to minimize Dn s output instead of maximizing it when feeding Dn with fake negative samples. Consequently, Dn will send uniformly negative feedback to Gn. In this way, the gradient information derived from negative feedback decreases pgn(x) where the positive data region pp(x) is large. In the meantime, the gradient signals from Du increase pgn(x) outside the positive region but still restricting pgn(x) in the true data distribution p(x). This crucial effect will eventually push pgn(x) away from pp(x) but towards pn(x). Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence (IJCAI-18) 3.3 Theoretical Analysis Theoretically, suppose all the {Gp, Gn} and {Dp, Du, Dn} have enough capacity. Then the following results show that, at Nash equilibrium point of (3), the minimal JSD between the distributions induced by {Gp, Gn} and the data distributions {pp(x), pn(x)} are achieved, respectively, i.e., pgp(x) = pp(x) and pgn(x) = pn(x). Meanwhile, the JSD between the distribution induced by Gn and data distribution pp(x) is maximized, i.e., pgn(x) almost never overlaps with pp(x). Proposition 1. Given fixed generators Gp, Gn and known class prior πp, the optimal discriminators Dp, Du and Dn for the objective in equation (3) have the following forms: D p(x) = pp(x) pp(x) + pgp(x), D u(x) = p(x) p(x) + πppgp(x) + πnpgn(x) and D n(x) = pp(x) pp(x) + pgn(x). Proof. Assume that all the discriminators Dp, Du and Dn can be optimized in functional space. Differentiating the objective V(G, D) in (3) w.r.t. Dp, Du and Dn and equating the functional derivatives to zero, we can obtain the optimal D p, D u and D n as described above. Theorem 2. Suppose the data distribution p(x) in the standard PU learning setting takes form of p(x) = πppp(x) + πnpn(x), where pp(x) and pn(x) are well-separated. Given the optimal D p, D u and D n, the minimax optimization problem with the objective function in (3) obtains its optimal solution if pgp(x) = pp(x) and pgn(x) = pn(x), (12) with the objective value of (πpλp + λu) log(4). Proof. Substituting the optimal D p, D u and D n into (3), the objective can be rewritten as follows: V(G, D ) = πp {λp [Ex pp(x) log( pp(x) pp(x) + pgp(x)) + Ex pgp(x) log( pgp(x) pp(x) + pgp(x))] + λu [Ex pu(x) log( p(x) p(x) + πppgp(x) + πnpgn(x)) + Ex pgp(x) log( πppgp(x) + πnpgn(x) p(x) + πppgp(x) + πnpgn(x))]} + πn {λu [Ex pu(x) log( p(x) p(x) + πppgp(x) + πnpgn(x)) + Ex pgn(x) log( πppgp(x) + πnpgn(x) p(x) + πppgp(x) + πnpgn(x))] λn [Ex pp(x) log( pp(x) pp(x) + pgn(x)) + Ex pgn(x) log( pgn(x) pp(x) + pgn(x))]}. (13) Combining the intermediate terms associated with λu using the fact πp + πn = 1, we reorganize (13) and arrive at G = arg min G V(G, D ) = arg min G πp λp [2 JSD(pp pgp) log(4)] + λu [2 JSD(p πppgp + πnpgn) log(4)] πn λn [2 JSD(pp pgn) log(4)], (14) which peaks its minimum if pgp(x) = pp(x), (15) πppgp(x) + πnpgn(x) = p(x) (16) and for almost every x except for those in a zero measure set pp(x) > 0 pgn(x) = 0, pgn(x) > 0 pp(x) = 0. (17) The solution to G = {Gp, Gn} must jointly satisfy the conditions described in (15)-(17), which implies (12) and leads to the minimum objective value of (πpλp + λu) log(4). The theorem reveals that approaching to Nash equilibrium is equivalent to jointly minimizing JSD(p πppgp + πnpgn) and JSD(pp pgp) and maximizing JSD(pp pgn) at the same time, thus exactly capturing pp and pn. 3.4 Connection to Semi-Supervised Classification The goal of semi-supervised classification is to learn a classifier from positive, negative and unlabeled data. In such context, besides training sets Xp and Xu, a partially labeled negative set Xn is also available, with samples drawn from negative data distribution pn(x). In fact, the very same architecture of Gen PU can be applied to the semi-supervised classification task by just adapting the standard GAN value function to Gn, then the total value function turns out to be V(G, D) = πp{λp[Ex pp(x) log(Dp(x)) + Ez pz(z) log(1 Dp(Gp(z)))] +λu[Ex pu(x) log(Du(x))+Ez pz(z) log(1 Du(Gp(z)))]} + πn{λu[Ex pu(x) log(Du(x))+Ez pz(z) log(1 Du(Gn(z)))] +λn[Ex pn(x) log(Dn(x))+Ez pz(z) log(1 Dn(Gn(z)))]}. With above formulation, Dn discriminates the negative training samples from the synthetic negative samples produced by Gn. Now Gn intends to fool Dn and Du simultaneously by outputting realistic examples, just like Gp does for Dp and Du. Being attracted by both p(x) and pn(x), the induced distribution pgn(x) gradually approaches to pn(x) and finally recovers the true distribution pn(x) of p(x). Theoretically, it is not hard to show the optimal Gp and Gn, at convergence, give rise to pgp(x) = pp(x) and pgn(x) = pn(x). 4 Experimental Results We show the efficacy of our framework by conducting experiments on synthetic and real-world images datasets. For real Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence (IJCAI-18) Figure 2: Evolution of the positive samples (in green) and negative samples (in blue) produced by Gen PU. The true positive samples (in orange) and true negative samples (in red) are also illustrated. data, the approaches including oracle PN, unbiased PU (UPU) [Du Plessis et al., 2015], non-negative PU (NNPU) [Kiryo et al., 2017] 1 are selected for comparison. Specifically, the Oracle PN means all the training labels are available for all the P and N data, whose performance is just used as a reference for other approaches. For all the methods, the true classprior πp is assumed to be known in advance. Regarding the weights of Gen PU, for simplicity, we set λu = 1 and freely tune λp and λn on the validation set via grid search over the range like [... 0.01, 0.02, ... 0.1, 0.2, ..., 1, 2, ....]. 4.1 Synthetic Simulation We begin our test with a toy example to visualize the learning behaviors of our Gen PU. The training samples are synthesized using concentric circles, Gaussian mixtures and half moons functions with Gaussian noises added to the data (standard deviation is 0.1414). The training set contains 5000 positive and 5000 negative samples, which are then partitioned into 500 positively labelled and 9500 unlabeled samples. We establish the generators with two fully connected hidden layers and the discriminators with one hidden layer. There are 128 Re LU units contained in all hidden layers. The dimensionality of the input latent code is set to 256. Figure 2 depicts the evolution of positive and negative samples produced by Gen PU through time. As expected, in all the scenarios, the induced generator distributions successfully converge to the respective true data distributions given limited P data. Notice that the Gaussian mixtures cases demonstrate the capability of our Gen PU to learn a distribution with multiple submodes. 4.2 MNIST and USPS Dataset Next, the evaluation is carried out on MNIST [Le Cun et al., 1998] and USPS [Le Cun et al., 1990] datasets. For MNIST, we each time select a pair of digits to construct the P and N 1The software codes for UPU and NNPU are downloaded from https://github.com/kiryor/nn PUlearning Operation Feature Maps Nonlinearity Gp(z), Gn(z): z N (0, I) 100 fully connected 256 leaky relu fully connected 256 leaky relu fully connected 256/784 tanh Dp(x), Dn(x) 256/784 fully connected 1 sigmoid Du(x) 256/784 fully connected 256 leaky relu fully connected 256 leaky relu fully connected 1 sigmoid leaky relu slope 0.2 mini-batch size for Xp, Xu 50, 100 learning rate 0.0003 optimizer Adam(0.9, 0.999) weight, bias initialization 0, 0 Table 1: Specifications of network architecture and hyperparameters for USPS/MNIST dataset. sets, each of which consists of 5, 000 training points. The specifics for architecture and hyperparameters are described in Table 1. To be challenging, the results of the most visually similar digit pairs, such as 3 vs 5 and 8 vs 3 , are recorded in Table 1. The best accuracies are shown with the number of labeled positive examples Nl ranging from 100 to 1. Obviously, our method outperforms UPU in all the cases. We also observe our Gen PU achieves better than or comparable accuracy to NNPU when the number of labeled samples is relatively large (i.e., Nl = 100). However, when the labeled samples are insufficient, for instance Nl = 5 of the 3 vs 5 scenario, the accuracy of Gen PU slightly decreases from 0.983 to 0.979, which is in contrast to that of NNPU drops significantly from 0.969 to 0.843. Spectacularly, Gen PU still remains highly accurate even if only one labeled sample is provided whereas NNPU fails in this situation. Figure 3 reports the training and test errors of the classifiers for distinct settings of Nl. When Nl is 100, UPU suffers from a serious overfitting to training data, whilst both NNPU and Gen PU perform fairly well. As Nl goes small (i.e., 5), NNPU also starts to overfit. It should be men- 0 50 100 150 200 epoch train error Nl = 100 Train 0 50 100 150 200 epoch Nl = 100 Test Oracle UPU NNPU Gen PU 0 50 100 150 200 epoch train error Nl = 5 Train 0 50 100 150 200 epoch Nl = 5 Test Oracle UPU NNPU Gen PU Figure 3: Training error and test error of deep PN classifiers on MINST for the pair 3 vs 5 with distinct Nl. Top: (a) and (b) for Nl = 100. Bottom: (c) and (d) for Nl = 5. Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence (IJCAI-18) MNIST 3 vs. 5 8 vs. 3 Np : Nu Oracle PN UPU NNPU Gen PU Oracle PN UPU NNPU Gen PU 100: 9900 0.993 0.914 0.969 0.983 0.994 0.932 0.974 0.982 50: 9950 0.993 0.854 0.966 0.982 0.994 0.873 0.965 0.979 10: 9990 0.993 0.711 0.866 0.980 0.994 0.733 0.907 0.978 5: 9995 0.993 0.660 0.843 0.979 0.994 0.684 0.840 0.976 1: 9999 0.993 0.557 0.563 0.976 0.994 0.550 0.573 0.972 Table 2: The accuracy comparison on MNIST for Nl {100, 50, 10, 5, 1}. USPS 3 vs 5 8 vs 3 Nl : Nu UPU NNPU Gen PU UPU NNPU Gen PU 50: 1950 0.890 0.965 0.965 0.900 0.965 0.945 10: 1990 0.735 0.880 0.955 0.725 0.920 0.935 5: 1995 0.670 0.830 0.950 0.630 0.865 0.925 1: 1999 0.540 0.610 0.940 0.555 0.635 0.920 Table 3: The accuracy comparison on USPS for Nl {50, 10, 5, 1}. 10 5 0 5 10 10 5 0 5 10 Figure 4: Top: visualization of positive (left) and negative (right) digits generated using one positive 3 label. Bottom: projected distributions of 3 vs 5 , with ground truth (left) and generated (right). tioned that the negative training curve of UPU (in blue) is because the unbiased risk estimators [Du Plessis et al., 2014; 2015] contain negative loss term which is unbounded from the below. When the classifier becomes very flexible, the risk can be arbitrarily negative [Kiryo et al., 2017]. Additionally, the rather limited Nl cause training processes of both UPU and NNPU behave unstable. In contrast, Gen PU avoids overfitting to small training P data by restricting the models of Dp and Dn from being too complex when Nl becomes small (see Table 1). For visualization, Figure 4 demonstrates the generated digits with only one labeled 3 , together with the projected distributions induced by Gp and Gn. In Table 3, similar results can be obtained on USPS data. 4.3 Celeb-A Dataset We are also interested in how Gen PU performs on the real-life image set. In this set, the data is taken from Celeb A dataset [Liu et al., 2015] and resized to 64 64. We aim at classifying female from male using partially labeled male face images. Figure 5: Visualization of male (left) and female (right) faces generated by Gen PU on Celeb A 64 64 data. To this end, the first 20, 000 male and 20, 000 female faces in Celeb A are chosen as training set and the last 1, 000 faces are used as test set. Then, 2, 000 out of 20, 000 male faces are are randomly selected as positively labeled data. The architectures for generators and discriminators follow the design of the improved WGAN [Gulrajani et al., 2017]. Figure 5 illustrates the generated male and female faces, indicating Gen PU is capable of producing visually appealing and diverse images that belong to the correct categories. A deep PN classifier is then trained on the synthetic images and achieves the accuracy of 87.9 which is better than 86.8 of NNPU and 62.5 of UPU. 5 Discussion One key factor to the success of Gen PU relies on the capability of underlying GAN in generating diverse samples with high quality standard. Only in this way, the ideal performance could be achieved by training a flexible classifier on those samples. However, it is widely known that the perfect training of the original GAN is quite challenging. GAN suffers from issues of mode collapse and mode oscillation, especially when high-dimensional data distribution has a large number of output modes. For this reason, the similar issue of the original GAN may happen to our Gen PU when a lot of output submodes exist. Since it has empirically been shown that the original GAN equipped with JSD inclines to mimic the mode-seeking process towards convergence. Fortunately, our framework is very flexible in the sense that it can be established by switching to different underlying GAN variants with more effective distance metrics (i.e., integral probability metric (IPG)) other than JSD (or f-divergence). By doing so, the possible issue of the missing modes can be greatly reduced. Another possible solution is to extend single generator Gp (Gn) to multiple generators {Gi p}I i=1 ({Gj n}J i=1) for the positive (negative) class, also by utilizing the parameter Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence (IJCAI-18) sharing scheme to leverage common information and reduce the computational load. Regarding other application domains, one direction of future work is to use Gen PU for text classification. However, applying GAN to generating sequential semantically meaningful text is challenging, since GAN has difficulty in directly generating sequences of discrete tokens [Goodfellow et al., 2016]. More sophisticated underlying GANs need to be developed for this purpose. Acknowledgments This work was partially supported by JSPS KAKENHI (Grant No. 17K00326), NSFC China (Grant No. 61773129) and JST GREST (Grant No. JPMJCR1784). References [Arjovsky et al., 2017] Martin Arjovsky, Soumith Chintala, and L eon Bottou. Wasserstein generative adversarial networks. In International Conference on Machine Learning, pages 214 223, 2017. [Denis et al., 2005] Franc ois Denis, R emi Gilleron, and Fabien Letouzey. Learning from positive and unlabeled examples. Theoretical Computer Science, 348(1):70 83, 2005. [Du Plessis et al., 2014] Marthinus C Du Plessis, Gang Niu, and Masashi Sugiyama. Analysis of learning from positive and unlabeled data. In Advances in Neural Information Processing Systems, pages 703 711, 2014. [Du Plessis et al., 2015] Marthinus Du Plessis, Gang Niu, and Masashi Sugiyama. Convex formulation for learning from positive and unlabeled data. In International Conference on Machine Learning, pages 1386 1394, 2015. [Elkan and Noto, 2008] Charles Elkan and Keith Noto. Learning classifiers from only positive and unlabeled data. In SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 213 220. ACM, 2008. [Goodfellow et al., 2014] Ian Goodfellow, Jean Pouget Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in Neural Information Processing Systems, pages 2672 2680, 2014. [Goodfellow et al., 2016] Ian Goodfellow, Yoshua Bengio, Aaron Courville, and Yoshua Bengio. Deep learning. MIT press Cambridge, 2016. [Gulrajani et al., 2017] Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron C Courville. Improved training of wasserstein GANs. In Advances in Neural Information Processing Systems, pages 5769 5779, 2017. [Hido et al., 2008] Shohei Hido, Yuta Tsuboi, Hisashi Kashima, Masashi Sugiyama, and Takafumi Kanamori. Inlier-based outlier detection via direct density ratio estimation. In International Conference on Data Mining, pages 223 232. IEEE, 2008. [Jain et al., 2016] Shantanu Jain, Martha White, and Predrag Radivojac. Estimating the class prior and posterior from noisy positives and unlabeled data. In Advances in Neural Information Processing Systems, pages 2693 2701, 2016. [Kiryo et al., 2017] Ryuichi Kiryo, Gang Niu, Marthinus C du Plessis, and Masashi Sugiyama. Positive-unlabeled learning with non-negative risk estimator. In Advances in Neural Information Processing Systems, pages 1674 1684, 2017. [Le Cun et al., 1990] Yann Le Cun, Bernhard E Boser, John S Denker, Donnie Henderson, Richard E Howard, Wayne E Hubbard, and Lawrence D Jackel. Handwritten digit recognition with a back-propagation network. In Advances in Neural Information Processing Systems, pages 396 404, 1990. [Le Cun et al., 1998] Yann Le Cun, Corinna Cortes, and Christopher JC Burges. The MNIST database of handwritten digits. http://yann.lecun.com/exdb/mnist/, 1998. [Lee and Liu, 2003] Wee Sun Lee and Bing Liu. Learning with positive and unlabeled examples using weighted logistic regression. In International Conference on Machine Learning, pages 448 455, 2003. [Li and Liu, 2003] Xiaoli Li and Bing Liu. Learning to classify texts using positive and unlabeled data. In International Joint Conference on Artifical Intelligence, pages 587 592, 2003. [Li et al., 2011] Wenkai Li, Qinghua Guo, and Charles Elkan. A positive and unlabeled learning algorithm for oneclass classification of remote-sensing data. IEEE Transactions on Geoscience and Remote Sensing, 49(2):717 725, 2011. [Liu et al., 2002] Bing Liu, Wee Sun Lee, Philip S Yu, and Xiaoli Li. Partially supervised classification of text documents. In International Conference on Machine Learning, pages 387 394, 2002. [Liu et al., 2003] Bing Liu, Yang Dai, Xiaoli Li, Wee Sun Lee, and Philip S Yu. Building text classifiers using positive and unlabeled examples. In International Conference on Data Mining, pages 179 186. IEEE, 2003. [Liu et al., 2015] Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the wild. In International Conference on Computer Vision, pages 3730 3738, 2015. [Patrini et al., 2016] Giorgio Patrini, Frank Nielsen, Richard Nock, and Marcello Carioni. Loss factorization, weakly supervised learning and label noise robustness. In International Conference on Machine Learning, pages 708 717, 2016. [Salimans et al., 2016] Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training GANs. In Advances in Neural Information Processing Systems, pages 2234 2242, 2016. [Ward et al., 2009] Gill Ward, Trevor Hastie, Simon Barry, Jane Elith, and John R Leathwick. Presence-only data and the EM algorithm. Biometrics, 65(2):554 563, 2009. Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence (IJCAI-18)