# differenceseeking_generative_adversarial_networkunseen_sample_generation__2331aea0.pdf

Published as a conference paper at ICLR 2020

DIFFERENCE-SEEKING GENERATIVE ADVERSARIAL NETWORK UNSEEN SAMPLE GENERATION

Yi-Lin Sung Graduate Institute of Communication Engineering National Taiwan University, Taiwan, ROC Institute of Information Science, Academia Sinica r06942076@ntu.edu.tw

Sung-Hsien Hsieh Institute of Information Science and Research Center for Information Technology Innovation, Academia Sinica, Taiwan, ROC parvaty316@hotmail.com

Soo-Chang Pei Graduate Institute of Communication Engineering National Taiwan University, Taiwan, ROC peisc@ntu.edu.tw

Chun-Shien Lu Institute of Information Science and Research Center for Information Technology Innovation, Academia Sinica, Taiwan, ROC lcs@iis.sinica.edu.tw

ABSTRACT Unseen data, which are not samples from the distribution of training data and are difﬁcult to collect, have exhibited importance in numerous applications, (e.g., novelty detection, semi-supervised learning, and adversarial training). In this paper, we introduce a general framework called difference-seeking generative adversarial network (DSGAN), to generate various types of unseen data. Its novelty is the consideration of the probability density of the unseen data distribution as the difference between two distributions p d and pd whose samples are relatively easy to collect. The DSGAN can learn the target distribution, pt, (or the unseen data distribution) from only the samples from the two distributions, pd and p d. In our scenario, pd is the distribution of the seen data, and p d can be obtained from pd via simple operations, so that we only need the samples of pd during the training. Two key applications, semi-supervised learning and novelty detection, are taken as case studies to illustrate that the DSGAN enables the production of various unseen data. We also provide theoretical analyses about the convergence of the DSGAN.

1 INTRODUCTION

Unseen data1are not samples from the distribution of the training data and are difﬁcult to collect. It has been demonstrated that unseen samples can be applied to several applications. Dai et al. (2017) proposed how to create complement data, and theoretically showed that complement data, considered as unseen data, could improve semi-supervised learning. In novelty detection, Yu et al. (2017) proposed a method to generate unseen data and used them to train an anomaly detector. Another related area is adversarial training Goodfellow et al. (2015), where classiﬁers are trained to resist adversarial examples, which are unseen during the training phase. However, the aforementioned methods only focus on producing speciﬁc types of unseen data, instead of enabling the generation of general types of unseen data.

In this paper, we propose a general framework called difference-seeking generative adversarial network (DSGAN), to generate a variety of unseen data. The DSGAN is a generative approach. Traditionally, generative approaches, which are usually conducted in an unsupervised learning manner, are developed for learning the data distribution from its samples, from which subsequently, they produce novel and high-dimensional samples, such as the synthesized image Saito et al. (2018). A state-of-the-art approach is the so-called generative adversarial network (GAN) Goodfellow et al. (2014). GAN produces sharp images based on a game-theoretic framework, but it can be difﬁcult and unstable to train owing to multiple interaction losses. Speciﬁcally, GAN consists of two functions: generator and discriminator. Both functions are represented as parameterized neural networks. The discriminator network is trained to determine whether the inputs belong to the real dataset or fake dataset created by the generator. The generator learns to map a sample from a latent space to some distribution to increase the classiﬁcation errors of the discriminator.

1In traditional machine learning scenarios, "unseen" data corresponds to data that is not used or seen during the training stage but rather the testing stage. The distribution of "unseen" data could be same as or different

Published as a conference paper at ICLR 2020

Nevertheless, if a generator can learn to create unseen data, then a traditional GAN requires numerous training samples of unseen classes for training, leading to a contradiction with the deﬁnition of the unseen data. This fact motivates us to present the DSGAN, which can generate unseen data by adopting seen data as training samples (see Fig. 9, which illustrates the difference between GAN and the DSGAN, in Appendix A). The key concept is to consider the distribution of the unseen data as the difference between two distributions that are relatively easy to obtain. For example, the out-of-distribution examples in the MNIST dataset, from another perspective, are found to belong to the differences between the sets of examples in MNIST and the universal set. It should be noted that in traditional GAN, the target distribution is identical to the training data distribution; however, in the DSGAN these two distributions, are considered to be different.

This paper makes the following contributions:

(1) We propose the DSGAN to generate any unseen data only if the density of the target (unseen data) distribution is the difference between those of any two distributions, p d and pd.

(2) We show that the DSGAN possesses the ﬂexibility to learn different target (unseen data) distributions in two key applications, semi-supervised learning and novelty detection. Specifically, for novelty detection, the DSGAN can produce boundary points around the seen data because this type of unseen data is easily misclassiﬁed. For semi-supervised learning, the unseen data are linear combinations of any labeled data and unlabeled data, excluding the labeled and unlabeled data themselves2.

(3) The DSGAN yields results comparable to a semi-supervised learning but with a short training time and low memory consumption. In novelty detection, combining both the DSGAN and variational auto-encoder (VAE, Kingma & Welling (2014b)) methods achieve the state-of-the-art results.

2 PROPOSED METHOD-DSGAN

2.1 FORMULATION We denote the generator distribution as pg and training data distribution as pd, both in an Ndimensional space. Let p d be the distribution decided by the user. For example, p d can be the convolution of pd and normal distribution. Let pt be the target distribution that the user is interested in, and it can be expressed as

(1 α)pt(x) + αpd(x) = p d(x), (1)

where α [0, 1]. Our method, the DSGAN, aims to learn pg such that pg = pt. Note that if the support set of pd belongs to that of p d, then there exists at least an α such that the equality in (1) holds. However, even if the equality does not hold, intuitively, the DSGAN attempts to learn pg such

that pg(x) p d(x) αpd(x)

1 α with the constraint, pg(x) 0. Speciﬁcally, the generator will output

samples located in the high-density areas of p d αpd. Furthermore, we show that the DSGAN can learn pg, whose support set is the difference between those of p d and pd in Theorem 1.

First, we formulate the generator and discriminator in GANs. The inputs, z, of the generator are drawn from pz (z) in an M-dimensional space. The generator function, G(z; θg) : RM RN, represents a mapping to the data space, where G is a differentiable function with parameter θg. The discriminator is deﬁned as D (x; θd) : RN [0, 1], which outputs a single scalar. D (x) can be considered as the probability that x belongs to a class of the real data.

Similar to traditional GAN, we train D to distinguish the real data from the fake data sampled from G. Concurrently, G is trained to produce realistic data that can mislead D. However, in the DSGAN, the deﬁnitions of real data and fake data are different from those in traditional GAN. The samples from p d are considered as real, but those from the mixture distribution between pd and pg are considered as fake. The objective function is deﬁned as follows:

from the "seen" data, according to applications. In this paper, we focus on the scenario that the two distributions are different. 2The linear combination of any labeled data and unlabeled data probably belongs to the set of seen data (labeled data and unlabeled data), which contradicts the deﬁnition of unseen data. Thus, the samples generated by the DSGAN should not include the seen data themselves.

Published as a conference paper at ICLR 2020

V (G, D) := Ex p d(x) [log D(x)] + (1 α)Ez pz(z) [log (1 D (G (z)))] + αEx pd(x) [log (1 D(x))] . (2) We optimize (2) by a min max game between G and D, i.e.,

min G max D V (G, D) .

During the training procedure, an iterative approach, like traditional GAN, is to alternate between k steps of training D and one step of training G. In practice, minibatch stochastic gradient descent via backpropagation is used to update θd and θg. Thus, for each pg, pd, and p d, m samples are required for computing the gradients, where m is the number of samples in a minibatch. The training procedure is illustrated in Algorithm 1 in Appendix A. The DSGAN suffers from the same drawbacks as traditional GAN, (e.g., mode collapse, overﬁtting, and strong discriminator) so that the generator gradient vanishes. There are literature Salimans et al. (2016); Arjovsky & Bottou (2017); Miyato et al. (2018) focusing on dealing with the above problems, and such concepts can be readily combined with the DSGAN.

Li et al. (2017) and Reed et al. (2016) proposed an objective function similar to (2). Their goal was to learn the conditional distribution of training data. However, we aim to learn the target distribution, pt, in Eq. (1), and not the training data distribution.

2.2 CASE STUDY ON VARIOUS UNSEEN DATA GENERATION To achieve a more intuitive understanding about the DSGAN, we conduct several case studies on two-dimensional (2D) synthetic datasets and MNIST. In Eq. (1), α = 0.8 is used.

Figure 1: Complement points (in Green) between two circles (in Orange).

Figure 2: Boundary points (in Green) between four circles (in Orange).

Figure 3: Illustration of the generation of the unseen data in the boundary around the training data. First, the convolution of pd and normal distribution ensure the density on the boundary is no longer zero. Second, we seek pg such that Eq. (1) holds, where the support set of pg is approximated by the difference of those between p d and pd.

Figure 4: Illustration of the difference-set seeking in MNIST.

Figure 5: DSGAN learns the difference between two sets.

Complement samples generation Fig. 1 illustrates that the DSGAN can generate complement samples between 2 circles. Denoting the density function of the two circles as pd, we assign the samples drawn from p d as linear combinations of the two circles. Then, by applying the DSGAN, we achieve our goal of generating complement samples. In fact, this type of unseen data is used in semi-supervised learning.

Published as a conference paper at ICLR 2020

Boundary samples generation Fig. 2 illustrates that the DSGAN generates boundary points between four circles. This type of unseen data is used in novelty detection. In this case, we assign pd and p d as the density function of four circles and the convolution of pd and normal distribution, respectively. The basis of our concept is also illustrated by a one-dimensional (1D) example in Fig. 3.

Difference-set generation We also validate the DSGAN on a high-dimensional dataset such as MNIST. In this example, we deﬁne pd as the distribution of digit 1 and p d as the distribution containing two digits 1 and 7 . Because the density, pd(x), is high when x is digit 1, the generator is prone to output digit 7 with a high probability. More sample qualities of DSGAN on Celeb A can be refer to Appendix G.

From the above results, we can observe two properties of the generator distribution, pg: i) the higher the density of pd(x), the lower the density of pg(x); ii) pg prefers to output samples from the high-density areas of p d(x) αpd(x).

2.3 DESIGNING p d

Thus far, we have demonstrated how the DSGAN can produce various types of unseen data by choosing a speciﬁc p d. In this section, we introduce a standard procedure to design p d, and illustrate each step with pictures.

Step 1. First, the training data, pd, are collected (Fig. 6 (a)).

Step 2. Second, based on the applications, the desired unseen data distribution is deﬁned (e.g., complement samples for semi-supervised learning) (Fig. 6 (b)).

Step 3. Third, p d is deﬁned as a mixed distribution of (1 α)pg + (α)pd (Fig. 6 (c)).

Step 4. Finally, a suitable mapping function that can transform pd to p d is designed (e.g, linear combination of any two samples of pd)

Figure 6: Illustration for designing p d.

In the above procedure, the most important step is to determine which types of unseen data are suitable for a speciﬁc problem (Step 2). In this paper, we show two types of unseen data, which are useful in semi-supervised learning and novelty detection. However, determining all types of unseen data for all applications is beyond the scope of this study, and we leave this for future work.

Furthermore, we provide a method (see Appendix B in supplementary materials) by reformulating the objective function (2), so that it is more stable to train the DSGAN.

3 THEORETICAL RESULTS

In this section, we show that by choosing an appropriate α, the support set of pg belongs to the difference set between p d and pd, so that the samples from pg are unseen from the pd perspective.

We start our proofs from two assumptions. First, in a non-parametric setting, we assume that both the generator and discriminator have inﬁnite capacities. Second, pg is deﬁned as the distribution of the samples drawn from G(z) under z pz.

In the following, we show that the support set of pg is contained within the differences in the support sets of p d and pd while achieving the global minimum such that we can generate the desired pg by designing an appropriate p d.

Published as a conference paper at ICLR 2020

Theorem 1. Suppose αpd(x) p d(x) for all x Supp(pd) and all density functions pd(x), and p d(x) and pg(x) are continuous. If the global minimum of C(G) is achieved, then

Supp (pg) Supp (p d) Supp(pd),

where C(G) = max D V (G, D)

= Ex p d(x)

log p d(x) p d(x) + (1 α)pg(x) + αpd(x)

log (1 α)pg(x) + αpd(x) p d(x) + (1 α)pg(x) + αpd(x)

Proof. See Appendix C for the details.

Summarizing, the generator is prone to output samples that are located in the high-density areas of p d αpd.

4 APPLICATIONS

The DSGAN was applied to two problems: semi-supervised learning and novelty detection. In the semi-supervised learning, the DSGAN acts as a bad generator, which creates complement samples (unseen data) in the feature space of the training data. For the novelty detection, the DSGAN generates the samples (unseen data) as boundary points around the training data.

4.1 SEMI-SUPERVISED LEARNING Semi-supervised learning (SSL) is a type of learning model that uses a few labeled data and numerous unlabeled data. The existing SSL methods based on a generative model, (e.g., VAE Kingma et al. (2014) and GAN Salimans et al. (2016)), yield good empirical results. Dai et al. (2017) theoretically showed that a good semi-supervised learning required a bad GAN with the following objective function: max D Ex,y L log PD (y | x, y K) + Ex pd(x) log PD (y K | x) + Ex pg(x) log PD (K + 1 | x) , (3)

where (x, y) denotes a pair of data, and its corresponding label, {1, 2, . . . , K} denotes the label space for the classiﬁcation, and L = {(x, y)} is the label dataset. Moreover, under the semi-supervised settings, pd in (3) is the distribution of the unlabeled data. Note that the discriminator, D, in GAN also plays the role of a classiﬁer. If the generator distribution exactly matches the real data distribution (i.e., pg = pd), then the classiﬁer trained by the objective function (3) with the unlabeled data cannot have a better performance than that trained by the supervised learning with the objective function. Speciﬁcally, max D Ex,y L log PD (y | x, y K) . (4)

Contrastingly, the generator is preferred to generate complement samples, which lie on the lowdensity area of pd. Under some mild assumptions, these complement samples help D to learn the correct decision boundaries in the low-density area because the probabilities of the true classes are forced to be low in the out-of-distribution areas.

The complement samples in Dai et al. (2017) are complex to produce. In Sec. 5.2, we will demonstrate that with the DSGAN, complement samples can be easily generated.

4.2 NOVELTY DETECTION Novelty detection determines if a query example belongs to a seen class. If the samples of one seen class are considered as positive data, then this difﬁculty is the absence of negative data in the training phase, so that the supervised learning cannot function.

Recently, novelty detection has made signiﬁcant progress with the advent of deep leaning. Pidhorskyi et al. (2018)Sakurada & Yairi (2014) focused on learning a representative latent space for a seen class. When testing, the query image was projected onto the learned latent space. Then, the difference between the query image and its inverse image (reconstruction) was measured. Thus, only an encoder was needed to be trained for the projection and a decoder for the reconstruction. Under the circumstance, an autoencoder (AE) is generally is adopted to learn both the encoder and decoder Pidhorskyi et al. (2018)Perera et al. (2019). Let Enc( ) be the encoder and Dec( ) be the decoder. The loss function of the AE is deﬁned as min Enc,Dec Ex ppos(x) x Dec(Enc(x)) 2 2 , (5)

Published as a conference paper at ICLR 2020

where ppos is the distribution of a seen class. After the training, a query example, xtest, is classiﬁed as the seen class if xtest Dec(Enc(xtest)) 2 2 τ, (6)

where τ R+ plays the trade-off between the true positive rate and false positive rate. However, (6) is based on two assumptions: (1) the positive samples from one seen class should have a small reconstruction error; (2) the AE (or latent space) cannot well describe the negative examples from the unseen classes, leading to a relatively large reconstruction error. In general, the ﬁrst assumption inherently holds when both the testing and training data originate from the same seen class. However, Pidhorskyi et al. (2018)Perera et al. (2019) observed that assumption (2) does not hold at all times because the loss function in (5) does not include a loss term to enforce the negative data to have a large reconstruction error.

For assumption (2) to hold, given positive data as the training inputs, we propose using the DSGAN to generate negative examples in the latent space, as discussed in Sec. 5.3. Then, the loss function of the AE is modiﬁed to enforce the negative data to have a large reconstruction error.

5 EXPERIMENTS

Our experiments are divided into three parts. The ﬁrst one examines how the hyperparameter, α, inﬂuences the learned generator distribution, pg. In the second and third experiments, we obtain empirical results about semi-supervised learning and novelty detection, which are presented in Sec. 5.2 and Sec. 5.3, respectively. Note that the training procedure of the DSGAN can be improved by other extensions of GANs such as WGAN Arjovsky et al. (2017), WGAN-GP Gulrajani et al. (2017), EBGAN Zhao et al. (2017), and LSGAN Mao et al. (2017). In our method, the WGAN-GP was adopted for the stability of the DSGAN in training and reduction in the mode collapse.

5.1 DSGAN WITH DIFFERENT α

The impacts of different α values on the DSGAN are illustrated in Fig. 7. In this example, the support of pd is the area bounded by a red dotted line, and the orange points are the samples from pd. Concurrently, we shift pd to the right by 1 unit and create the distribution, p d, whose support is bounded by blue dotted lines. The overlapping area between p d and pd is 0.5 unit (assuming the area of pd is 1 unit). Based on our theoretical results, α = 0.5 is the smallest selected value allowing pg to be disjoint to pd. Therefore, we can see that some generated samples, as presented in Fig. 7(a), still belong to the support set of pd. Fig. 7(b) shows that there is a perfect agreement between our theoretical and experiment results with α = 0.5. When α = 0.8, there is a remarkable gap between the generated (green) points and yellow points, as shown in Fig. 7(c). In theory, the result obtained at α = 0.8 should be the same as that obtained at α = 0.5. This is because the discriminator should assign the entire area, which is the intersection of the complement of support set of pd and support set of p d, to the same score, under the assumption that the discriminator has an inﬁnite capacity. However, in practice, the capacity of the discriminator is limited. Therefore, the score of the area near pd is lower than that far from it, when α is large. Therefore, pg tends to repel pd to achieve a high score (to deceive the discriminator). 5.2 DSGAN IN SEMI-SUPERVISED LEARNING We ﬁrst introduce how the DSGAN generates the complement samples in the feature space. Dai et al. (2017) proved that if the complement samples generated by G could satisfy the following two assumptions in (7) and (8), i.e.,

x pg(x), 0 > max 1 i K w T i f(x) and x pd(x), 0 < max 1 i K w T i f(x), (7)

where f is the feature extractor and wi is the linear classiﬁer for the ith class, and

x1 L, x2 pd(x), xg pg(x) s.t. f(xg) = βf(x1) + (1 β)f(x2) with β [0, 1], (8)

then all the unlabeled data would be correctly classiﬁed by the objective function (3). Speciﬁcally, (7) ensures that the classiﬁers can discriminate the generated data from the unlabeled data, and (8) causes the decision boundary to be located in the low-density areas of pd.

Published as a conference paper at ICLR 2020

(a) α = 0.30

(b) α = 0.50

(c) α = 0.80

Figure 7: Inﬂuence of α on the synthetic dataset. We observe that the samples of pg (green points) move farther away from pd as α increases; however, they are still bounded by the support of p d. When α is 0.5, the support set of pg is disjoint to that of pd, satisying the theoretical results. When α is 0.8, pg generates the rightmost points of p d. The level curves from the discriminator show that the generator is more prone to producing samples in a region with higher score than in that with a lower score. Note that the outputs of the discriminator are not restricted in [0, 1], because we use the formulation of the WGAN in this experiment.

The assumption in (8) implies that the complement samples must be in the space created by the linear combination of the labeled and unlabeled data. In addition, they cannot fall into the real data distribution, pd, owing to the assumption (7). To allow the DSGAN to generate such samples, we let

the samples of p d be linear combinations of those from L and pd. Since pg(x) p d(x) αpd(x)

pg will tend to match p d, whereas the term, αpd, ensures that the samples from pg do not belong to pd. Thus, pg satisﬁes the assumption in (8). Moreover, (7) is also satisﬁed by training the classiﬁer with (3) based on substituting the generator distribution in (3) into the learned pg.

Following the previous works, we apply the proposed DSGAN to semi-supervised learning on three benchmark datasets: MNIST Le Cun et al. (1998), SVHN Netzer et al. (2011), and CIFAR-10 Krizhevsky (2009). The details of the experiments can be found in Appendix D.

5.2.1 SIMULATION RESULTS First, the selected hyperparameters are listed in Table 5 in Appendix D.1. Second, the results obtained from the DSGAN and state-of-the-art methods on the three benchmark datasets are summarized in Table 1. It can be observed that our method can compete with the state-of-the-art methods on the three datasets. Note that we report the results of bad GAN not only from the original papers in the literature but also by reproducing them using the released codes of the authors. The reason of presenting both the results is that we cannot reproduce parts of the results. The experiments in Li et al. (2019) also showed a similar problem. In comparison with Dai et al. (2017), our methods do not need to rely on an additional density estimation network, Pixel CNN++ Salimans et al. (2017). Although Pixel CNN++ is one of the best density estimation networks, learning such a deep architecture requires large computation and high memory consumption. In Table 2, we list the training time and memory consumption for our method and bad GAN. Compared to bad GAN, our method consumes 15.8% less training time and saves about 9000 MB during the training.

Moreover, it can also be observed from Table 1 that our results are comparable to the best record of bad GAN and CAGAN. are better than those of other approaches on the MNIST and SVHN datasets. On CIFAR-10, our method is only inferior to the CT-GAN. However, this might not be a reasonable comparison because the CT-GAN uses extra techniques, including temporal ensembling and data augmentation, which the other methods do not use.

5.3 DSGAN IN NOVELTY DETECTION

In this section, we study how to use the DSGAN for assisting novelty detection. As mentioned in Sec. 4.2, we need to train the auto-encoder (AE) such that (i) the positive samples from one seen class have a small reconstruction error; (ii) negative samples from the unseen classes incur relatively higher reconstruction errors.

Published as a conference paper at ICLR 2020

Table 1: Comparison of the semi-supervised learning in our DSGAN and the state-of-the-art methods. For a reasonable comparison, we only consider GAN-based methods. denotes the use of the same architecture of the classiﬁer. denotes a larger architecture of the classiﬁer. denotes the use of data augmentation (in CIFAR-10). The results for MNIST are recorded as the number of errors, whereas for the others are as percentage of the error.

Methods MNIST SVHN CIFAR-10

FM Salimans et al. (2016) 93 6.5 8.11 1.3 18.63 1.32 Triple GAN Li et al. (2017) 91 58 5.77 0.17 16.99 0.36 bad GAN Dai et al. (2017) 79.5 9.8 4.25 0.03 14.41 0.30 CAGAN Ni et al. (2018) 81.9 4.5 4.83 0.09 12.61 0.12 CT-GAN Wei et al. (2018) 89 13 - 9.98 0.21

bad GAN-reproduce 86.2 13.2 4.48 0.16 16.25 0.33

Our method 82.7 4.6 4.38 0.10 14.52 0.14

Table 2: Training times of our method and bad GAN. We only report the training time on MNIST, on which the authors of bad GAN applied Pixel CNN++. The experiments run on a NVIDIA 1080 Ti.

Methods Training time Memory Consumption

bad GAN 38 s / epoch 9763 MB Our method 32 s / epoch 711 MB

The fundamental concept is to use the DSGAN to generate negative samples, which originally do not exist under the scenario of novelty detection. Next, we add a new loss term to penalize the small reconstruction errors of the negative samples (see the third stage below). Three stages are required to train our model (AE):

1. The encoder, Enc( ), and decoder, Dec( ), are trained using the loss function (5). 2. Given x ppos, Enc(x) are collected as the samples drawn from pd. p d is the convolution of pd having a normal distribution with a zero mean and variance σ. Then, we train the DSGAN to generate negative samples, which are drawn from p d(x) pd(x) and are the boundary points around the positive samples in the latent space. Note that there are some variations in the DSGAN: the input of the generator, G, is Enc(x), instead of a random vector z in the latent space. We also add Enc(x) G(Enc(x)) 2 2, which will be explained in the next step, to train the generator. 3. Fixing the encoder, we retrain the decoder by the modiﬁed loss function,

min Dec Ex ppos(x) x Dec(Enc(x)) 2 2 + w max 0, m x Dec(G(Enc(x))) 2 2 ,

where w is the trade-off between the reconstruction errors of positive samples Enc(x) and negative samples G(Enc(x)). Note that in the previous step, we add Enc(x) G(Enc(x)) 2 2 to ensure that the outputs of the generator are around the input. Thus, the second term charges even though the negative samples are close to the corresponding positive sample, and they still exhibit a high reconstruction error, which is bounded by m (Zhao et al. (2017)).

The above algorithm, called VAE+DSGAN, can be used to strengthen the existing AE-based methods by using them in the ﬁrst stage. In the simulation, we used a variational autoencoder (VAE) Kingma & Welling (2014a) because it performs better than the AE in the novelty detection.

5.3.1 SIMULATION RESULTS In this section, following Perera et al. (2019), the performance was evaluated using the area under the curve (AUC) of the receiver operating characteristics (ROC) curve. Given a dataset, a class was chosen as the seen class for training, and all the classes were used for testing. There exist several testing benchmarks for novelty detection, such as MNIST, COIL100 Nene et al. (1996) and CIFAR-10. The state-of-the-art method Perera et al. (2019) achieves high performance in AUC on MNIST and COIL100 (AUC is larger than 0.97). However, for CIFAR-10, Perera et al. (2019) only

Published as a conference paper at ICLR 2020

Original images

Ours (VAE + DSGAN)

Figure 8: Comparison of the reconstructed results of the VAE and our method. The seen class, which is at the bottom of the images, is a car. Other rows are images from the unseen classes. Our method exhibits a relatively larger gap, in terms of the reconstruction error between the seen data and unseen data, than the VAE.

Table 3: Comparison of our method (VAE+DSGAN) and the state-of-the-art methods: VAE Kingma & Welling (2014a), AND Abati et al. (2019), DSVDD Ruff et al. (2018), and OCGAN Perera et al. (2019). The results for CIFAR-10 are recorded in terms of the AUC value. The number in the top row denotes the seen class, where 0: Plain, 1: Car, 2: Bird, 3: Cat, 4: Deer, 5: Dog, 6: Frog,7: Horse, 8: Ship, 9: Truck.

0 1 2 3 4 5 6 7 8 9 MEAN

VAE .700 .386 .679 .535 .748 .523 .687 .493 .696 .386 .583 AND .735 .580 .690 .542 .761 .546 .751 .535 .717 .548 .641 DSVDD .617 .659 .508 .591 .609 .657 .677 .673 .759 .731 .648 OCGAN ..757 .531 .640 .620 .723 .620 .723 .575 .820 .554 .657

Our method .737 .614 .676 .644 .759 .562 .660 .646 .769 .633 .670

achieves 0.656. Thus, we chose the challenging dataset, CIFAR-10, as the benchmark to evaluate our method. The detailed network architecture can be found in Appendix E.

Because VAE+DSGAN can be considered as a ﬁne tuning VAE Kingma & Welling (2014a), we ﬁrst illustrate the key difference between the VAE and VAE+DSGAN, as shown in Fig. 8. The seen class, which is at the bottom of the images, is a car. Other rows are the images from the unseen classes. One can see that the reconstructed images are reasonably good even for the unseen class in the VAE. By contrast, our method enforces the reconstructed images of the unseen classes to be blurred while still preserving the reconstruction quality of the seen class. Thus, our method achieves a relatively larger gap, in terms of the reconstruction error between the seen data and unseen data, than the VAE.

In Table 3, we compare the proposed method with several methods, including the VAE Kingma & Welling (2014a), AND Abati et al. (2019), DSVDD Ruff et al. (2018), and OCGAN Perera et al. (2019), in terms of the AUC value. One can see that in most cases, our method almost outperforms the VAE. Furthermore, the mean of the AUC values of our method also is larger than those of the state-of-the-art methods. It is worth mentioning that in addition to the VAE, the DSGAN has potential of being combined with other AE-based methods.

6 RELATED WORKS ABOUT UNSEEN DATA GENERATION

Yu et al. (2017) proposed a method to generate samples of unseen classes in a unsupervised manner via an adversarial learning strategy. However, it requires solving an optimization problem for each sample, which certainly leads to a high computation cost. By contrast, the DSGAN has the capability to create inﬁnite diverse unseen samples. Hou et al. (2018) presented a new GAN architecture that could learn two distributions of unseen data from a part of seen data and the unlabeled data. However, the unlabeled data must be a mixture of seen and unseen samples; the DSGAN does not require any unseen data. Kliger & Fleishman (2018) also applied GAN in novelty detection. Their objective

Published as a conference paper at ICLR 2020

was to learn a generator whose distribution is a mixture of novelty data distribution and training data distribution. To this end, they used feature matching (FM) to train the generator and expected pg to learn the mixture of distributions. However, the ultimate goal of FM is still to learn pg = pd; therefore, their method might fail when GAN learns well.

Dai et al. (2017) aimed to generate complementary samples (or out-of-distribution samples), but assumed that the in-distribution could be estimated by a pre-trained model, such as Pixel CNN++, which might be difﬁcult and expensive to train. Lee et al. (2018) used a simple classiﬁer to replace the role of Pixel CNN++ in Dai et al. (2017) so that the training was comparatively much easier and more suitable. Nevertheless, their method only focused on generating unseen data surrounding the low-density area of seen data. In comparison, the DSGAN has more ﬂexibility to generate different types of unseen data (e.g., a linear combination of seen data, as described in Sec. 5.2). In addition, their method needs the label information of the data, whereas our method is fully unsupervised.

7 CONCLUSIONS

We propose the DSGAN, which can produce any unseen data based on the assumption that the density of the unseen data distribution is the difference between the densities of any two distributions. The DSGAN is useful in an environment when the samples from the unseen data distribution are more difﬁcult to collect than those from the two known distributions. Empirical and theoretical results are provided to validate the effectiveness of the DSGAN. Finally, because the DSGAN is developed based on GAN, it is easy to apply any improved versions of GAN to the DSGAN.

8 ACKNOWLEDGEMENT

This work was partially supported by grants MOST 107-2221-E-001-015-MY2 and MOST 108-2634F-007-010 from Ministry of Science and Technology, Taiwan, ROC.

D. Abati, A. Porrello, S. Calderara, and R. Cucchiara. And: Autoregressive novelty detectors. In IEEE CVPR, 2019.

M. Arjovsky and L. Bottou. Towards principled methods for training generative adversarial networks. In ICLR. 2017.

M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein generative adversarial networks. In ICML, volume 70, pp. 214 223, 2017.

Zihang Dai, Zhilin Yang, Fan Yang, William W Cohen, and Ruslan R Salakhutdinov. Good semisupervised learning that requires a bad gan. In NIPS, pp. 6510 6520. 2017.

I. Goodfellow, J. Shlens, and C. Szegedy. Explaining and harnessing adversarial examples. In ICLR, 2015.

Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In NIPS, pp. 2672 2680. 2014.

Ishaan Gulrajani, Faruk Ahmed, Martín Arjovsky, Vincent Dumoulin, and Aaron C. Courville. Improved training of wasserstein gans. In NIPS, 2017.

Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In NIPS, 2017.

M. Hou, B. Chaib-draa, C. Li, and Q. Zhao. Generative adversarial positive-unlabelled learning. In IJCAI, pp. 2255 2261, 2018.

D. P. Kingma and Max Welling. Auto-encoding variational bayes. In ICLR. 2014a.

Diederik P. Kingma and Max Welling. Auto-encoding variational bayes. ICLR, abs/1312.6114, 2014b.

Diederik P Kingma, Shakir Mohamed, Danilo Jimenez Rezende, and Max Welling. Semi-supervised learning with deep generative models. In NIPS, pp. 3581 3589. 2014.

Published as a conference paper at ICLR 2020

Mark Kliger and Shachar Fleishman. Novelty detection with gan. Ar Xiv, abs/1802.10560, 2018.

A. Krizhevsky. Learning multiple layers of features from tiny images. 2009.

Y. Le Cun, C. Cortes, and C. J. C. Burges. The mnist database of handwritten digits. 1998.

Kimin Lee, Honglak Lee, Kibok Lee, and Jinwoo Shin. Training conﬁdence-calibrated classiﬁers for detecting out-of-distribution samples. In ICLR, 2018.

Chongxuan Li, Kun Xu, Jun Zhu, and Bo Zhang. Triple generative adversarial nets. In NIPS, 2017.

Wenyuan Li, Zichen Wang, Jiayun Li, Jennifer S Polson, William Speier, and Corey Conkling Arnold. Semi-supervised learning based on generative adversarial network: a comparison between good gan and bad gan approach. Ar Xiv, abs/1905.06484, 2019.

Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the wild. In Proceedings of International Conference on Computer Vision (ICCV), December 2015.

X. Mao, Q. Li, H. Xie, R. Y. K. Lau, Z. Wang, and S. P. Smolley. Least squares generative adversarial networks. In IEEE ICCV, pp. 2813 2821, 2017.

Takeru Miyato, Toshiki Kataoka, Masanori Koyama, and Yuichi Yoshida. Spectral normalization for generative adversarial networks. In ICLR. 2018.

Sameer A. Nene, Shree K. Nayar, and Hiroshi Murase. Columbia object image library (coil-20). 1996.

Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y. Ng. Reading digits in natural images with unsupervised feature learning. In NIPS Workshop, 2011.

Yao Ni, Dandan Song, Xi Zhang, Hank Wu, and Lejian Liao. Cagan: Consistent adversarial training enhanced gans. In IJCAI, 2018.

Pramuditha Perera, Ramesh Nallapati, and Bing Xiang. OCGAN: one-class novelty detection using gans with constrained latent representations. In IEEE CVPR, 2019.

Stanislav Pidhorskyi, Ranya Almohsen, Donald A. Adjeroh, and Gianfranco Doretto. Generative probabilistic novelty detection with adversarial autoencoders. In NIPS, pp. 6823 6834, 2018.

Antti Rasmus, Mathias Berglund, Mikko Honkala, Harri Valpola, and Tapani Raiko. Semi-supervised learning with ladder networks. In NIPS, 2015.

Scott E. Reed, Zeynep Akata, Xinchen Yan, Lajanugen Logeswaran, Bernt Schiele, and Honglak Lee. Generative adversarial text to image synthesis. In ICML, 2016.

Lukas Ruff, Robert Vandermeulen, Nico Goernitz, Lucas Deecke, Shoaib Ahmed Siddiqui, Alexander Binder, Emmanuel Müller, and Marius Kloft. Deep one-class classiﬁcation. In ICML, pp. 4393 4402, 2018.

Y. Saito, S. Takamichi, and H. Saruwatari. Statistical parametric speech synthesis incorporating generative adversarial networks. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 26(1):84 96, 2018.

Mayu Sakurada and Takehisa Yairi. Anomaly detection using autoencoders with nonlinear dimensionality reduction. In MLSDA, pp. 4 11, 2014.

Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, Xi Chen, and Xi Chen. Improved techniques for training gans. In NIPS, pp. 2234 2242. 2016.

Tim Salimans, Andrej Karpathy, Xi Chen, and Diederik P. Kingma. Pixelcnn++: Improving the pixelcnn with discretized logistic mixture likelihood and other modiﬁcations. In ICLR, 2017.

Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathon Shlens, and Zbigniew Wojna. Rethinking the inception architecture for computer vision, 2015.

Published as a conference paper at ICLR 2020

Yaxing Wang, Chenshen Wu, Luis Herranz, Joost van de Weijer, Abel Gonzalez-Garcia, and Bogdan Raducanu. Transferring gans: generating images from limited data. Co RR, abs/1805.01677, 2018. URL http://arxiv.org/abs/1805.01677.

Xiang Wei, Boqing Gong, Zixia Liu, Wei Lu, and Liqiang Wang. Improving the improved training of wasserstein gans: A consistency term and its dual effect. In ICLR, 2018.

Y. Yu, W.-Y. Qu, N. Li, and Z. Guo. Open-category classiﬁcation by adversarial sample generation. In IJCAI, pp. 3357 3363, 2017.

J. J. Zhao, M. Mathieu, and Y. Le Cun. Energy-based generative adversarial network. In ICLR, 2017.

Published as a conference paper at ICLR 2020

A FLOWCHART AND ALGORITHM OF DSGAN

Figure 9: Illustration of the differences between traditional GAN and DSGAN.

Algorithm 1 The training procedure of DSGAN using minibatch stochastic gradient descent. k is the number of steps applied to discriminator. α is the ratio between pg and pd in the mixture distribution. We used k = 1 and α = 0.8 in experiments.

01. for number of training iterations do 02. for k steps do 03. Sample minibatch of m noise samples z(1), ..., z(m) from pg(z). 04. Sample minibatch of m samples x(1) d , ..., x(m) d from pd(x). 05. Sample minibatch of m samples x(1) d , ..., x(m) d from p d(x). 06. Update the discriminator by ascending its stochastic gradient:

i=1 log D x(i) d + log 1 D G z(i) + log 1 D x(i) d

07. end for 08. Sample minibatch of m noise samples z(1), ..., z(m) from pg(z). 09. Update the generator by descending its stochastic gradient:

h log 1 D G z(i) i

10. end for

Published as a conference paper at ICLR 2020

B TRICKS FOR STABLE TRAINING

We provide a trick to stabilize the training procedure by reformulating the objective function. Speciﬁcally, V (G, D) in (2) is reformulated as:

V (G, D) = Z

x p d(x) log (D (x))

+ ((1 α)pg(x) + αpd(x)) log (1 D (x)) dx = Ex p d(x) [log D(x)]

+ Ex (1 α)pg(x)+α pd(x) [log (1 D (x))] .

Instead of sampling a mini-batch of m samples from pz and pd in Algorithm 1, (1 α)m and αm samples from both distributions are required, respectively. The computation cost in training can be reduced due to fewer samples. Furthermore, although (9) is equivalent to (2) in theory, we ﬁnd that the training using (9) achieves better performance than using (2) via empirical validation in Table 4. We conjecture that the equivalence between (9) and (2) is based on the linearity of expectation, but mini-batch stochastic gradient descent in practical training may lead to the different outcomes.

Table 4: Semi-supervised learning results on MNIST with and without the use of sampling tricks.

Methods MNIST (# errors)

Our method w/o tricks 91.0 7.0 Our method w/ tricks 82.7 4.6

C PROOF OF THEOREM 1

In this section, we show Theorem 1.

This proof includes two parts: the ﬁrst part shows that the objective function is equivalent to minimizing the Jensen Shannon divergence in the mixture distribution (pd and pg) and p d if G and D are assigned sufﬁcient capacity; the second part shows that by choosing an appropriate α, the support set of pg belongs to the difference set between p d and pd, so that the samples from pg are unseen from the pd perspective.

For the ﬁrst part, we show the optimal discriminator given G, and then show that minimizing V (G, D) via G, given the optimal discriminator, is equivalent to minimizing the Jensen Shannon divergence between (1 α)pg + αpd and p d.

Proposition 1. If G is ﬁxed, the optimal discriminator, D, is

D G(x) = pd(x) pd(x) + (1 α)pg(x) + αpd(x).

Published as a conference paper at ICLR 2020

Proof. Given any generator G, the training criterion for the discriminator D is to maximize the quantity V (G, D):

V (G, D) = Z

x p d(x) log (D (x)) dx

z pz(z) log (1 D (G (z))) dz

x pd(x) log (1 D (x)) dx

x p d(x) log (D (x)) dx

x pg(x) log (1 D (x)) dz

x pd(x) log (1 D (x)) dx

x p d(x) log (D (x))

+ ((1 α)pg(x) + αpd(x)) log (1 D (x)) dx.

For any (a, b) R2\{0, 0}, the function a log (y) + b log (1 y) achieves its maximum in [0, 1] at y = a a+b. The discriminator only needs to be deﬁned within Supp(p d) S Supp(pd) S Supp(pg). We complete this proof.

Moreover, D can be considered to discriminate between samples from p d and ((1 α)pg(x) + αpd(x)). By replacing the optimal discriminator in V (G, D), we trivially obtain

C(G) = max D V (G, D)

= Ex p d(x)

log p d(x) p d(x) + (1 α)pg(x) + αpd(x)

log (1 α)pg(x) + αpd(x) p d(x) + (1 α)pg(x) + αpd(x)

Actually, the results thus far yield the optimal solution of D given G is ﬁxed in (1). Now, the next step is to determine the optimal G with D G as ﬁxed.

Theorem 2. The global minimum of C(G) is achieved if and only if (1 α)pg(x) + αpd(x) = p d(x) for all x. Then, C(G) achieves the value, log 4.

Proof. We start from

(1) = log(4)

+ Ex p d(x)

log 2p d(x) p d(x) + (1 α)pg(x) + αpd(x)

log 2 ((1 α)pg(x) + αpd(x)) p d(x) + (1 α)pg(x) + αpd(x)

= log(4) + KL p d

p d + (1 α)pg + αpd

+ KL (1 α)pg(x) + αpd

p d + (1 α)pg + αpd

= log(4) + 2 JSD (p d (1 α)pg + αpd) ,

where p (x) = (1 α)pg(x) + αpd(x), KL is the Kullback-Leibler divergence and JSD is the Jensen-Shannon divergence. The JSD returns the minimal value, which is 0, iff both distributions are the same, namely p d = (1 α)pg + αpd. Because pg(x) s are always non-negative, it should be noted both distributions are the same only if αpd(x) p d(x) for all x s. We complete this proof.

Published as a conference paper at ICLR 2020

Note that (1 α)pg(x) + αpd(x) = p d(x) may not hold if αpd(x) > p d(x). However, the DSGAN still works based on two facts: i) given D, V (G, D) is a convex function in pg and ii) because Z

x pg(x)dx = 1, the set collecting all the feasible solutions of pg is convex. Thus, there always exists

a global minimum of V (G, D) given D, but it may not be log(4).

Now, we go back to prove Theorem 1. We show that the support set of pg is contained within the differences in the support sets of p d and pd while achieving the global minimum such that we can generate the desired pg by designing an appropriate p d.

Proof. Recall that

x p d(x) log p d(x) p d(x) + (1 α)pg(x) + αpd(x)

+ p (x) log (1 α)pg(x) + αpd(x) p d(x) + (1 α)pg(x) + αpd(x)

x S(pg; x)dx

x Supp(p d) Supp(pd) S(pg; x)dx

x Supp(pd) S(pg; x)dx.

S(pg; x) is used to simplify the notations inside the integral. For any x, S(pg; x) in pg(x) is nonincreasing and S(pg; x) 0 always holds. Speciﬁcally, S(pg; x) is decreasing along the increase of pg(x) if p d(x) > 0; S(pg; x) attains the maximum value, zero, for any pg(x) if p d(x) = 0. Since

DSGAN aims to minimize C(G) with the constraint Z

x pg(d)dx = 1, the solution attaining the

global minima must satisfy pg(x) = 0 if p d(x) = 0; otherwise, there exists another solution with smaller value of C(G). Thus, Supp (pg) Supp (p d).

Furthermore, T(pg; x) = S(pg; x)

pg(x) = log (1 α)pg(x) + αpd(x) p d(x) + (1 α)pg(x) + αpd(x)

, which is expected

to be as small as possible to minimize C(G), is increasing on pg(x) and converges to 0. Then, we show that T(pg; x) for x Supp(p d) T Supp(pd) is always larger than that for x Supp(p d) Supp(pd) for all pg. Speciﬁcally,

1. When x Supp(p d) T Supp(pd), T(pg; x) log 1

2 always holds due to the assumption of αpd(x) p d(x).

2. When x Supp(p d) Supp(pd), T(pg; x) < log 1

2 for all pg(x) s satisfying (1 α)pg(x) p d(x).

Thus, the minimizer prefers pg(x) > 0 for x Supp(p d) Supp(pd) and (1 α)pg(x) p d(x). We check whether there exists a solution pg such that (1 α)pg(x) p d(x) and Z

x Supp(p d) Supp(pd) pg(d)dx = 1, implying pg(x) = 0 for x Supp(p d) T Supp(pd). Based

Published as a conference paper at ICLR 2020

on the following expression, Z

x Supp(p d) Supp(pd) p d(x)dx + Z

x Supp(pd) p d(x)dx = 1

x Supp(p d) Supp(pd) p d(x)dx

x Supp(pd) αpd(x)dx

x Supp(p d) Supp(pd) p d(x)dx 1 α

x Supp(p d) Supp(pd) p d(x)dx

x Supp(p d) Supp(pd) (1 α)pg(x)dx,

the last inequality implies that there must exist a feasible solution. We complete this proof.

Another concern is the convergence of Algorithm 1.

Proposition 2. The discriminator reaches its optimal value given G in Algorithm 1, and pg is updated by minimizing Ex p d(x) [log D G(x)] + Ex p (x) [log (1 D G (x))] .

If G and D have sufﬁcient capacities, then pg converges to argmin pg JSD (p d (1 α)pg + αpd).

Proof. Consider V (G, D) = U(pg, D) as a function of pg. By the proof idea of Theorem 2 in Goodfellow et al. (2014), if f(x) = supα A fα(x) and fα(x) is convex in x for every α, then fβ(x) f if β = argsupα A fα(x). In other words, if sup D V (G, D) is convex in pg, the subderivatives of sup D V (G, D) includes the derivative of the function at the point, where the maximum is attained, implying the convergence with sufﬁciently small updates of pg. We complete this proof.

D EXPERIMENTAL DETAILS FOR SEMI-SUPERVISED LEARNING

D.0.1 DATASETS: MNIST, SVHN, AND CIFAR-10

For evaluating the semi-supervised learning task, we used 60000/ 73257/ 50000 samples and 10000/ 26032/ 10000 samples from the MNIST/ SVHN/ CIFAR-10 datasets for the training and testing, respectively. Under the semi-supervised setting, we randomly chose 100/ 1000/ 4000 samples from the training samples, which are the MNIST/ SVHN/ CIFAR-10 labeled datasets, and the amounts of the labeled data for all the classes are equal. Furthermore, our criterion to determine the hyperparameters is introduced in Appendix D.1, and the network architectures are described in Appendix D.2. We performed testing with 10/ 5/ 5 runs on MNIST/ SVHN/ CIFAR-10 based on the selected hyperparameters, and randomly selected the labeled dataset. The results were recorded as the mean and standard deviation of the number of errors from each run.

D.1 HYPERPARAMETERS

The hyperparameters were chosen to make our generated samples consistent with the assumptions in (7) and (8). However, in practice, if we make all the samples produced by the generator following the assumption in (8), then the generated distribution is not close to the true distribution, even a large margin between them exists, which is not what we desire. So, in our experiments, we make a concession that the percentage of generated samples, which accords with the assumption, is around 90%. To meet this objective, we tune the hyperparameters. Table 5 shows our setting of hyperparameters, where β is deﬁned in (8).

Published as a conference paper at ICLR 2020

Table 5: Hyperparameters in semi-supervised learning.

Hyperparameters MNIST SVHN CIFAR-10

α 0.8 0.8 0.5 β 0.3 0.1 0.1

D.2 ARCHITECTURE

In order to fairly compare with other methods, our generators and classiﬁers for MNIST, SVHN, and CIFAR-10 are same as in Salimans et al. (2016) and Dai et al. (2017). However, different from previous works that have only a generator and a discriminator, we design an additional discriminator in the feature space, and its architecture is similar across all datasets with only the difference in the input dimensions. Following Dai et al. (2017), we also deﬁne the feature space as the input space of the output layer of discriminators.

Compared to SVHN and CIFAR-10, MNIST is a simple dataset as it is only composed of fully connected layers. Batch normalization (BN) or weight normalization (WN) is used in every layer to stable training. Moreover, Gaussian noise is added before each layer in the classiﬁer, as proposed in Rasmus et al. (2015). We ﬁnd that the added Gaussian noise exhibits a positive effect for semisupervised learning. The architecture is shown in Table 6.

Table 7 and Table 8 are models for SVHN and CIFAR-10, respectively, and these models are almost the same except for some implicit differences, e.g., the number of convolutional ﬁlters and types of dropout. In these tables, given a dropping rate, Dropout denotes a normal dropout in that the elements of input tensor are randomly set to zero while Dropout2d is a dropout only applied on the channels to randomly zero all the elements.

Table 6: Network architectures for semi-supervised learning on MNIST. (GN: Gaussian noise)

Generator G Discriminator D Classiﬁer C

Input: z R100 from unif(0, 1) Input: 250 dimension feature Input: 28 28 gray image

100 500 FC layer with BN Softplus 500 500 FC layer with BN Softplus 500 784 FC layer with WN Sigmoid

250 400 FC layer Re LU 400 200 FC layer Re LU 200 100 FC layer Re LU 100 1 FC layer

GN, std = 0.3 784 1000 FC layer with WN ,Re LU GN, std = 0.5 1000 500 FC layer with WN, Re LU GN, std = 0.5 500 250 FC layer with WN, Re LU GN, std = 0.5 250 250 FC layer with WN, Re LU GN, std = 0.5 250 250 FC layer with WN, Re LU

250 10 FC layer with WN

Furthermore, the training procedure alternates between k steps of optimizing D and one step of optimizing G. We ﬁnd that k in Algorithm 1 is a key role in the problem of mode collapse for different applications. For semi-supervised learning, we set k = 1 for all datasets.

E EXPERIMENTAL DETAILS FOR NOVELTY DETECTION

The architecture of GAN and VAE are depicted in Table 9 and 10, respectively.

In the experiment, we ﬁrst trained the VAE for 500 epochs and then we trained DSGAN for 500 epochs with m = 1.5 and w = 0.5. Third, we ﬁxed the encoder and tuned the decoder with both positive and negative samples (generated by DSGAN) for 600 epochs.

Published as a conference paper at ICLR 2020

Table 7: The architectures of generator and discriminator for semi-supervised learning on SVHN and CIFAR-10. N was set to 128 and 192 for SVHN and CIFAR-10, respectively.

Generator G Discriminator D

Input: z R100 from unif(0, 1) Input: N dimension feature

100 8192 FC layer with BN, Re LU Reshape to 4 4 512 5 5 conv. transpose 256 stride = 2 with BN, Re LU 5 5 conv. transpose 128 stride = 2 with BN, Re LU 5 5 conv. transpose 3 stride = 2 with WN, Tanh

N 400 FC layer, Re LU 400 200 FC layer, Re LU 200 100 FC layer, Re LU 100 1 FC layer

Table 8: The architecture of classiﬁers for semi-supervised learning on SVHN and CIFAR-10. (GN: Gaussian noise; l Re LU(leak rate): Leaky Re LU(leak rate))

Classiﬁer C for SVHN Classiﬁer C for CIFAR-10

Input: 32 32 RGB image Input: 32 32 RGB image

GN, std = 0.05 Dropout2d, dropping rate = 0.15 3 3 conv. 64 stride = 1 with WN, l Re LU(0.2) 3 3 conv. 64 stride = 1 with WN, l Re LU(0.2) 3 3 conv. 64 stride = 2 with WN, l Re LU(0.2) Dropout2d, dropping rate = 0.5 3 3 conv. 128 stride = 1 with WN, l Re LU(0.2) 3 3 conv. 128 stride = 1 with WN, l Re LU(0.2) 3 3 conv. 128 stride = 2 with WN, l Re LU(0.2) Dropout2d, dropping rate = 0.5 3 3 conv. 128 stride = 1 with WN, l Re LU(0.2) 1 1 conv. 128 stride = 1 with WN, l Re LU(0.2) 1 1 conv. 128 stride = 1 with WN, l Re LU(0.2) Global average Pooling

128 10 FC layer with WN

GN, std = 0.05 Dropout2d, dropping rate = 0.2 3 3 conv. 96 stride = 1 with WN, l Re LU(0.2) 3 3 conv. 96 stride = 1 with WN, l Re LU(0.2) 3 3 conv. 96 stride = 2 with WN, l Re LU(0.2) Dropout, dropping rate = 0.5 3 3 conv. 192 stride = 1 with WN, l Re LU(0.2) 3 3 conv. 192 stride = 1 with WN, l Re LU(0.2) 3 3 conv. 192 stride = 2 with WN, l Re LU(0.2) Dropout, dropping rate = 0.5 3 3 conv. 192 stride = 1 with WN, l Re LU(0.2) 1 1 conv. 192 stride = 1 with WN, l Re LU(0.2) 1 1 conv. 192 stride = 1 with WN, l Re LU(0.2) Global average Pooling

192 10 FC layer with WN

Table 9: The architectures of generator and discriminator in DSGAN for novelty detection.

Generator G Discriminator D

Input: 128 dimension feature Input: 128 dimension feature

128 1024 FC layer with BN, Re LU 1024 512 FC layer with BN, Re LU 512 256 FC layer with BN, Re LU 256 128 FC layer

128 400 FC layer, Re LU 400 200 FC layer, Re LU 200 100 FC layer, Re LU 100 1 FC layer

F ABLATION STUDY ON DIFFERENT α VALUES FOR SEMI-SUPERVISED LEARNING

Fig. 7 shows how different α values inﬂuence DSGAN. The optimal α for DSGAN to generate unseen data depends on p d and pd. According to Fig. 7, we can ﬁgure out that DSGAN is prone to generating unseen data under a larger α. Recall that Theorem 1 illustrates α should be expected to be as large as possible if both network G and D have inﬁnite capacity. Though the networks never have the inﬁnite capacity in real applications, a general rule is to pick a large α and force the complement data to be far from pd, which is similar to the results in Sec. 5.1.

Published as a conference paper at ICLR 2020

Table 10: The architectures of VAE for novelty detection.

Encoder Decoder

5 5 conv. 32 stride = 2, with BN, l Re LU(0.2) 5 5 conv. 64 stride = 2, with BN, l Re LU(0.2) 5 5 conv. 128 stride = 2, with BN, l Re LU(0.2) (For mean) 4 4 conv. 128 stride = 1 (For std) 4 4 conv. 128 stride = 1

5 5 conv. transpose 128 stride = 2 with BN, l Re LU(0.2) 5 5 conv. transpose 64 stride = 2 with BN, l Re LU(0.2) 5 5 conv. transpose 32 stride = 2 with BN, l Re LU(0.2) 5 5 conv. transpose 3 stride = 2, Tanh

Here, we conduct the experiments on different α under semi-supervised learning settings. From Sec. 4.1 and 5.2, bad GAN already shows that, if the desired unseen data can be generated, then the classiﬁer will put the correct decision boundary in the low-density area.

In Table 11, we demonstrate the classiﬁcation results on α = 0.5 and α = 0.8, respectively. We can observe that the results generated at α = 0.8 is better than those generated at α = 0.5, meeting the above discussion. From our empirical observations, DSGAN is prone to generating unseen data at α = 0.8, leading a better classiﬁer.

Table 11: Ablation study of different α values for DSGAN in semi-supervised learning, where the result for MNIST is represented in terms of number of errors and the percentage of errors was used for other datasets.

Methods MNIST SVHN CIFAR-10

DSGAN (α = 0.5) 91.5 5.6 4.59 0.15 14.52 0.14 DSGAN (α = 0.8) 82.7 4.6 4.38 0.10 14.47 0.15

G SAMPLE QUALITY OF DSGAN ON CELEBA

We show one more experiment on Celeb A (Liu et al. (2015)) to demonstrate DSGAN can work well even for complicated images. In this experiment, we generate the color images of size 64 64. Similar to our 1/7 experiments on the MNIST dataset, we let p d be the distribution of face images with glasses and without glasses and let pd be the distribution of images without glasses. We validate DSGAN with α = 0.5 and α = 0.8, respectively. For α = 0.5, we sample 10000 images with glasses and 10000 images without glasses from Celeb A. When α is 0.8, we sample 40000 instead of 10000 images without glasses.

We also train GAN to verify the generated image quality of DSGAN. For fair comparison, GAN is trained under two kinds of settings. The ﬁrst one is that GAN is only trained with the images with glasses. Second, it is pretrained with all images, and is ﬁnetuned with the images with glasses, namely transferring GANs in Wang et al. (2018). It should be noted that transferring GAN uses the same amount of training data as DSGAN and serves as a stronger baseline than GAN under the ﬁrst setting.

Fréchet Inception Distance (FID) (Heusel et al. (2017)) is used to evaluate the quality of generated images. FID calculates the Wasserstein-2 distance between generated images and real images (images with glasses) in the feature space of Inception-v3 network (Szegedy et al. (2015)). We train both networks for 600 epochs, and use WGAN-GP as the backbone for both GAN and DSGAN. In addition, transferring GANs are pretrained for 500 epochs, then being ﬁnetuned for 600 epochs.

Fig. 10 and Table 12 show generated images and FID for all methods, respectively. We can see that our DSGAN can generate images with glasses from the given pd and p d, and the FID of DSGAN are comparable to that of GAN. The experiment validates that DSGAN still works well to create complement data for complicate images.

Published as a conference paper at ICLR 2020

(a) Transferring GANs (pretrained with 20000 images and ﬁnetuned with 10000 samples with glasses)

(b) DSGAN (α = 0.5)

(c) Transferring GANs (pretrained with 50000 images and ﬁnetuned with 10000 samples with glasses)

(d) DSGAN (α = 0.8)

Figure 10: Sampled generated images of GAN and DSGAN on Celeb A.

Table 12: FIDs of GAN and DSGAN on Celeb A. Smaller FID means that the generated distribution is closer to the distribution of images with glasses.

10000 samples 20000 samples 50000 samples GAN transferring GAN DSGAN (α = 0.5) transferring GAN DSGAN (α = 0.8) FID 22.37 18.34 18.05 16.45 15.39