# structured_generative_adversarial_networks__64b0bf89.pdf

Structured Generative Adversarial Networks

1Zhijie Deng , 2,3Hao Zhang , 2Xiaodan Liang, 2Luona Yang, 1,2Shizhen Xu, 1Jun Zhu , 3Eric P. Xing 1Tsinghua University, 2Carnegie Mellon University, 3Petuum Inc. {dzj17,xsz12}@mails.tsinghua.edu.cn, {hao,xiaodan1,luonay1}@cs.cmu.edu, dcszj@mail.tsinghua.edu.cn, epxing@cs.cmu.edu

We study the problem of conditional generative modeling based on designated semantics or structures. Existing models that build conditional generators either require massive labeled instances as supervision or are unable to accurately control the semantics of generated samples. We propose structured generative adversarial networks (SGANs) for semi-supervised conditional generative modeling. SGAN assumes the data x is generated conditioned on two independent latent variables: y that encodes the designated semantics, and z that contains other factors of variation. To ensure disentangled semantics in y and z, SGAN builds two collaborative games in the hidden space to minimize the reconstruction error of y and z, respectively. Training SGAN also involves solving two adversarial games that have their equilibrium concentrating at the true joint data distributions p(x, z) and p(x, y), avoiding distributing the probability mass diffusely over data space that MLE-based methods may suffer. We assess SGAN by evaluating its trained networks, and its performance on downstream tasks. We show that SGAN delivers a highly controllable generator, and disentangled representations; it also establishes start-of-the-art results across multiple datasets when applied for semi-supervised image classiﬁcation (1.27%, 5.73%, 17.26% error rates on MNIST, SVHN and CIFAR-10 using 50, 1000 and 4000 labels, respectively). Beneﬁting from the separate modeling of y and z, SGAN can generate images with high visual quality and strictly following the designated semantic, and can be extended to a wide spectrum of applications, such as style transfer.

1 Introduction

Deep generative models (DGMs) [12, 8, 26] have gained considerable research interest recently because of their high capacity of modeling complex data distributions and ease of training or inference. Among various DGMs, variational autoencoders (VAEs) and generative adversarial networks (GANs) can be trained unsupervisedly to map a random noise z N(0, 1) to the data distribution p(x), and have reported remarkable successes in many domains including image/text generation [17, 9, 3, 27], representation learning [27, 4], and posterior inference [12, 5]. They have also been extended to model the conditional distribution p(x|y), which involves training a neural network generator G that takes as inputs both the random noise z and a condition y, and generates samples that have desired properties speciﬁed by y. Obtaining such a conditional generator would be quite helpful for a wide spectrum of downstream applications, such as classiﬁcation, where synthetic data from G can be used to augment the training set. However, training conditional generator is inherently difﬁcult, because it requires not only a holistic characterization of the data distribution, but also ﬁne-grained alignments between different modes of the distribution and different conditions. Previous works have tackled this problem by using a large amount of labeled data to guide the generator s learning [32, 23, 25], which compromises the generator s usefulness because obtaining the label information might be expensive.

indicates equal contributions. indicates the corresponding author. 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.

In this paper, we investigate the problem of building conditional generative models under semisupervised settings, where we have access to only a small set of labeled data. The existing works [11, 15] have explored this direction based on DGMs, but the resulted conditional generators exhibit inadequate controllability, which we deﬁne as the generator s ability to conditionally generate samples that have structures strictly agreeing with those speciﬁed by the condition a more controllable generator can better capture and respect the semantics of the condition.

When supervision from labeled data is scarce, the controllability of a generative model is usually inﬂuenced by its ability to disentangle the designated semantics from other factors of variations (which we will term as disentanglability in the following text). In other words, the model has to ﬁrst learn from a small set of labeled data what semantics or structures the condition y is essentially representing by trying to recognize y in the latent space. As a second step, when performing conditional generation, the semantics shall be exclusively captured and governed within y but not interweaved with other factors. Following this intuition, we build the structured generative adversarial network (SGAN) with enhanced controllability and disentanglability for semi-supervised generative modeling. SGAN separates the hidden space to two parts y and z, and learns a more structured generator distribution p(x|y, z) where the data are generated conditioned on two latent variables: y, which encodes the designated semantics, and z that contains other factors of variation. To impose the aforementioned exclusiveness constraint, SGAN ﬁrst introduces two dedicated inference networks C and I to map x back to the hidden space as C : x y, I : x z, respectively. Then, SGAN enforces G to generate samples that when being mapped back to hidden space using C (or I), the inferred latent code and the generator condition are always matched, regardless of the variations of the other variable z (or y). To train SGAN, we draw inspirations from the recently proposed adversarially learned inference framework (ALI) [5], and build two adversarial games to drive I, G to match the true joint distributions p(x, z), and C, G to match the true joint distribution p(x, y). Thus, SGAN can be seen as a combination of two adversarial games and two collaborative games, where I, G combat each other to match joint distributions in the visible space, but I, C, G collaborate with each other to minimize a reconstruction error in the hidden space. We theoretically show that SGAN will converge to desired equilibrium if trained properly.

To empirically evaluate SGAN, we ﬁrst deﬁne a mutual predictability (MP) measure to evaluate the disentanglability of various DGMs, and show that in terms of MP, SGAN outperforms all existing models that are able to infer the latent code z across multiple image datasets. When classifying the generated images using a golden classiﬁer, SGAN achieves the highest accuracy, conﬁrming its improved controllability for conditional generation under semi-supervised settings. In the semisupervised image classiﬁcation task, SGAN outperforms strong baselines, and establishes new state-of-the-art results on MNIST, SVHN and CIFAR-10 dataset. For controllable generation, SGAN can generate images with high visual quality in terms of both visual comparison and inception score, thanks to the disentangled latent space modeling. As SGAN is able to infer the unstructured code z, we further apply SGAN for style transfer, and obtain impressive results.

2 Related Work

DGMs have drawn increasing interest from the community, and have been developed mainly toward two directions: VAE-based models [12, 11, 32] that learn the data distribution via maximum likelihood estimation (MLE), and GAN-based methods [19, 27, 21] that train a generator via adversarial learning. SGAN combines the best of MLE-based methods and GAN-based methods which we will discuss in detail in the next section. DGMs have also been applied for conditional generation, such as CGAN [19], CVAE [11]. Dis VAE [32] is a successful extension of CVAE that generates images conditioned on text attributes. In parallel, CGAN has been developed to generate images conditioned on text [24, 23], bounding boxes, key points [25], locations [24], other images [10, 6, 31], or generate text conditioned on images [17]. All these models are trained using fully labeled data.

A variety of techniques have been developed toward learning disentangled representations for generative modeling [3, 29]. Info GAN [3] disentangles hidden dimensions on unlabeled data by mutual information regularization. However, the semantic of each disentangled dimension is uncontrollable because it is discovered after training rather than designated by user modeling. We establish some connections between SGAN and Info GAN in the next section.

There is also interest in developing DGMs for semi-supervised conditional generation, such as semisupervised CVAE [11], its many variants [16, 9, 18], ALI [5] and Triple GAN [15], among which the closest to us are [15, 9]. In [9], VAE is enhanced with a discriminator loss and an independency

constraint, and trained via joint MLE and discriminator loss minimization. By contrast, SGAN is an adversarial framework that is trained to match two joint distributions in the visible space, thus avoids MLE for visible variables. Triple GAN builds a three-player adversarial game to drive the generator to match the conditional distribution p(x|y), while SGAN models the conditional distribution p(x|y, z) instead. Triple GAN therefore lacks constraints to ensure the semantics of interest to be exclusively captured by y, and lacks a mechanism to perform posterior inference for z. 3 Structured Generative Adversarial Networks (SGAN)

We build our model based on the generative adversarial networks (GANs) [8], a framework for learning DGMs using a two-player adversarial game. Speciﬁcally, given observed data {xi}N i=1, GANs try to estimate a generator distribution pg(x) to match the true data distribution pdata(x), where pg(x) is modeled as a neural network G that transforms a noise variable z N(0, 1) into generated data ˆx = G(z). GANs assess the quality of ˆx by introducing a neural network discriminator D to judge whether a sample is from pdata(x) or the generator distribution pg(x). D is trained to distinguish generated samples from true samples while G is trained to fool D:

min G max D L(D, G) = Ex pdata(x)[log(D(x))] + Ez p(z)[log(1 D(G(z)))],

Goodfellow et al. [8] show the global optimum of the above problem is attained at pg = pdata. It is noted that the original GAN models the latent space using a single unstructured noise variable z. The semantics and structures that may be of our interest are entangled in z, and the generator transforms z into ˆx in a highly uncontrollable way it lacks both disentanglability and controllability.

We next describe SGAN, a generic extension to GANs that is enhanced with improved disentanglability and controllability for semi-supervised conditional generative modeling.

Overview. We consider a semi-supervised setting, where we observe a large set of unlabeled data X = {xi}N i=1. We are interested in both the observed sample x and some hidden structures y of x, and want to build a conditional generator that can generate data ˆx that matches the true data distribution of x, while obey the structures speciﬁed in y (e.g. generate pictures of digits given 0-9). Besides the unlabeled x, we also have access to a small chunk of data Xl = {xl j, yl j}M j=1 where the structure y is jointly observed. Therefore, our model needs to characterize the joint distribution p(x, y) instead of the marginal p(x), for both fully and partially observed x.

As the data generation process is intrinsically complex and usually determined by many factors beyond y, it is necessary to consider other factors that are irrelevant with y, and separate the hidden space into two parts (y, z), of which y encodes the designated semantics, and z includes any other factors of variation [3]. We make a mild assumption that y and z are independent from each other so that y could be disentangled from z. Our model thus needs to take into consideration the uncertainty of both (x, y) and z, i.e. characterizing the joint distribution p(x, y, z) while being able to disentangle y from z. Directly estimating p(x, y, z) is difﬁcult, as (1) we have never observed z and only observed y for partial x; (2) y and z might be entangled at any time as the training proceeds. As an alternative, SGAN builds two inference networks I and C. The two inference networks deﬁne two distributions pi(z|x) and pc(y|x) that are trained to approximate the true posteriors p(z|x) and p(y|x) using two different adversarial games. The two games are uniﬁed via a shared generator x pg(x|y, z). Marginalizing out z or y obtains pg(x|z) and pg(x|y):

pg(x|z) = Z

y p(y)pg(x|y, z)dy, pg(x|y) = Z

z p(z)pg(x|y, z)dz, (1)

where p(y) and p(z) are appropriate known priors for y and z. As SGAN is able to perform posterior inference for both z and y given x (even for unlabeled data), we can directly imposes constraints [13] that enforce the structures of interest being exclusively captured by y, while those irreverent factors being encoded in z (as we will show later). Fig.1 illustrates the key components of SGAN, which we elaborate as follows.

Generator pg(x|y, z). We assume the following generative process from y, z to x: z p(z), y p(y), x p(x|y, z), where p(z) is chosen as a non-informative prior, and p(y) as an appropriate prior that meets our modeling needs (e.g. a categorical distribution for digit class). We parametrize p(x|y, z) using a neural network generator G, which takes y and z as inputs, and outputs generated samples x pg(x|y, z) = G(y, z). G can be seen as a decoder in VAE parlance, and its architecture depends on speciﬁc applications, such as a deconvolutional neural network for generating images [25, 21].

(a) (b) (c) (d) (e)

I(𝑥) C(𝑥) G(𝑦, 𝑧)

Figure 1: An overview of the SGAN model: (a) the generator pg(x|y, z); (b) the adversarial game Lxz; (c) the adversarial game Lxy; (d) the collaborative game Rz; (e) the collaborative game Ry.

Adversarial game Lxz. Following the adversarially learning inference (ALI) framework, we construct an adversarial game to match the distributions of joint pairs (x, z) drawn from the two different factorizations: pg(x, z) = p(z)pg(x|z), pi(x, z) = p(x)pi(z|x). Speciﬁcally, to draw samples from pg(x, z), we note the fact that we can ﬁrst draw the tuple (x, y, z) following y p(y), z p(z), x pg(x|y, z), and then only taking (x, z) as needed. This implicitly performs the marginalization as in Eq. 1. On the other hand, we introduce an inference network I : x z to approximate the true posterior p(z|x). Obtaining (x, z) p(x)pi(z|x) with I is straightforward: x p(x), z pi(z|x) = I(x). Training G and I involves ﬁnding the Nash equilibrium for the following minimax game Lxz (we slightly abuse Lxz for both the minimax objective and a name for this adversarial game):

min I,G max Dxz Lxz = Ex p(x)[log(Dxz(x, I(x)))] + Ez p(z),y p(y)[log(1 Dxz(G(y, z), z))], (2)

where we introduce Dxz as a critic network that is trained to distinguish pairs (x, z) pg(x, z) from those come from pi(x, z). This minimax objective reaches optimum if and only if the conditional distribution pg(x|z) characterized by G inverses the approximate posterior pi(z|x), implying pg(x, z) = pi(x, z) [4, 5]. As we have never observed z for x, as long as z is assumed to be independent from y, it is reasonable to just set the true joint distribution p(x, z) = p g(x, z) = p i (x, z), where we use p g and p i to denote the optimal distributions when Lxz reaches its equilibrium.

Adversarial game Lxy. The second adversarial game is built to match the true joint data distribution p(x, y) that has been observed on Xl. We introduce the other critic network Dxy to discriminate (x, y) p(x, y) from (x, y) pg(x, y) = p(y)pg(x|y), and build the game Lxy as:

min G max Dxy Lxy = E(x,y) p(x,y)[log(Dxy(x, y))] + Ey p(y),z p(z)[log(1 Dxy(G(y, z), y))].

Collaborative game Ry. Although training the adversarial game Lxy theoretically drives pg(x, y) to concentrate on the true data distribution p(x, y), it turns out to be very difﬁcult to train Lxy to desired convergence, as (1) the joint distribution p(x, y) characterized by Xl might be biased due to its small data size; (2) there is little supervision from Xl to tell G what y essentially represents, and how to generate samples conditioned on y. As a result, G might lack controllability it might generate low-ﬁdelity samples that are not aligned with their conditions, which will always be rejected by Dxy. A natural solution to these issues is to allow (learned) posterior inference of y to reconstruct y from generated x [5]. By minimizing the reconstruction error, we can backpropagate the gradient to G to enhance its controllability. Once pg(x|y) can generate high-ﬁdelity samples that respect the structures y, we can reuse the generated samples (x, y) pg(x, y) as true samples in the ﬁrst term of Lxy, to prevent Dxz from collapsing into a biased p(x, y) characterized by Xl.

Intuitively, we introduce the second inference network C : x y which approximates the posterior p(y|x) as y pc(y|x) = C(x), e.g. C reduces to a N-way classiﬁer if y is categorical. To train pc(y|x), we deﬁne a collaboration (reconstruction) game Ry in the hidden space of y:

min C,G Ry = E(x,y) p(x,y)[log pc(y|x)] E(x,y) pg(x,y)[log pc(y|x)], (4)

which aims to minimize the reconstruction error of y in terms of C and G, on both labeled data Xl and generated data (x, y) pg(x, y). On the one hand, minimizing the ﬁrst term of Ry w.r.t. C guides C toward the true posterior p(y|x). On the other hand, minimizing the second term w.r.t. G enhances G with extra controllability it minimizes the chance that G could generate samples that would otherwise be falsely predicted by C. Note that we also minimize the second term w.r.t. C, which proves effective in semi-supervised learning settings that uses synthetic samples to augment the predictive power of C. In summary, minimizing Ry can be seen as a collaborative game between two players C and G that drives pg(x|y) to match p(x|y) and pc(y|x) to match the posterior p(y|x).

Collaborative games Rz. As SGAN allows posterior inference for both y and z, we can explicitly impose constraints Ry and Rz to separate y from z during training. To explain, we ﬁrst note that optimizing the second term of Ry w.r.t G actually enforces the structure information to be fully persevered in y, because C is asked to recover the structure y from G(y, z), which is generated conditioned on y, regardless of the uncertainty of z (as z is marginalized out during sampling). Therefore, minimizing Ry indicates the following constraint: min C,G Ey p(y) pc(y|G(y, z1)), pc(y|G(y, z2)) , z1, z2 p(z), where a, b is some distance function between a and b (e.g. cross entropy if C is a N-way classiﬁer). On the counter part, we also want to enforce any other unstructured information that is not of our interest to be fully captured in z, without being entangled with y. So we build the second collaborative game Rz as:

min I,G Rz = E(x,z) pg(x,z)[log pi(z|x)] (5)

where I is required to recover z from those samples generated by G conditioned on z, i.e. reconstructing z in the hidden space. Similar to Ry, minimizing Rz indicates: min I,G Ez p(z) pi(z|G(y1, z)), pi(z|G(y2, z)) , y1, y2 p(y), and when we model I as a deterministic mapping [4], the distance between distributions is equal to the ℓ-2 distance between the outputs of I.

Theoretical Guarantees. We provide some theoretical results about the SGAN framework under the nonparametric assumption. The proofs of the theorems are deferred to the supplementary materials. Theorem 3.1 The global minimum of max Dxz Lxz is achieved if and only if p(x)pi(z|x) = p(z)pg(x|z). At that point D xz = 1

2. Similarly, the global minimum of max Dxy Lxy is achieved if and only if p(x, y) = p(y)pg(x|y). At that point D xy = 1

2. Theorem 3.2 There exists a generator G (y, z) of which the conditional distributions pg(x|y) and pg(x|z) can both achieve equilibrium in their own minimax games Lxy and Lxz. Theorem 3.3 Minimizing Rz w.r.t. I will keep the equilibrium of the adversarial game Lxz. Similarly, minimizing Ry w.r.t. C will keep the equilibrium of the adversarial game Lxy unchanged.

Algorithm 1 Training Structured Generative Adversarial Networks (SGAN).

1: Pretrain C by minimizing the ﬁrst term of Eq. 4 w.r.t. C using Xl. 2: repeat 3: Sample a batch of x: xu p(x). 4: Sample batches of pairs (x, y): (xl, yl) p(x, y), (xg, yg) pg(x, y), (xc, yc) pc(x, y). 5: Obtain a batch (xm, ym) by mixing data from (xl, yl), (xg, yg), (xc, yc) with proper mixing portion. 6: for k = 1 K do 7: Train Dxz by maximizing the ﬁrst term of Lxz using xu and the second using xg. 8: Train Dxy by maximizing the ﬁrst term of Lxy using (xm, ym) and the second using (xg, yg). 9: end for 10: Train I by minimizing Lxz using xu and Rz using xg. 11: Train C by minimizing Ry using (xm, ym) (see text). 12: Train G by minimizing Lxy + Lxz + Ry + Rz using (xg, yg). 13: until convergence.

Training. SGAN is fully differentiable and can be trained end-to-end using stochastic gradient descent, following the strategy in [8] that alternatively trains the two critic networks Dxy, Dxz and the other networks G, I and C. Though minimizing Ry and Rz w.r.t. G will introduce slight bias, we ﬁnd empirically it works well and contributes to disentangling y and z. The training procedures are summarized in Algorithm 1. Moreover, to guarantee that C could be properly trained without bias, we pretrain C by minimizing the ﬁrst term of Ry until convergence, and do not minimize Ry w.r.t. C until G has started generating meaning samples (usually after several epochs of training). As the training proceeds, we gradually improve the portion of synthetic samples (x, y) pg(x, y) and (x, y) pc(x, y) in the stochastic batch, to help the training of Dxy and C (see Algorithm 1), and you can refer to our codes on Git Hub for more details of the portion. We empirically found this mutual bootstrapping trick yields improved C and G.

Discussion and connections. SGAN is essentially a combination of two adversarial games Lxy and Lxz, and two collaborative games Ry, Rz, where Lxy and Lxz are optimized to match the data distributions in the visible space, while Ry and Rz are trained to match the posteriors in the hidden space. It combines the best of GAN-based methods and MLE-based methods: on one hand, estimating

density in the visible space using GAN-based formulation avoids distributing the probability mass diffusely over data space [5], which MLE-based frameworks (e.g. VAE) suffer. One the other hand, incorporating reconstruction-based constraints in latent space helps enforce the disentanglement between structured information in y and unstructured ones in z, as we argued above.

We also establish some connections between SGAN and some existing works [15, 27, 3]. We note the Lxy game in SGAN is connected to the Triple GAN framework [15] when its trade-off parameter α = 0. We will empirically show that SGAN yields better controllability on G, and also improved performance on downstream tasks, due to the separate modeling of y and z. SGAN also connects to Info GAN in the sense that the second term of Ry (Eq. 4) reduces to the mutual information penalty in Info GAN under unsupervised settings. However, SGAN and Info GAN have totally different aims and modeling techniques. SGAN builds a conditional generator that has the semantic of interest y as a fully controllable input (known before training); Info GAN in contrast aims to disentangle some latent variables whose semantics are interpreted after training (by observation). Though extending Info GAN to semi-supervised settings seems straightforward, successfully learning the joint distribution p(x, y) with very few labels is non-trivial: Info GAN only maximizes the mutual information between y and G(y, z), bypassing p(y|x) or p(x, y), thus its direct extension to semi-supervised settings may fail due to lack of p(x, y). Moreover, SGAN has dedicated inference networks I and C, while the network Q(x) in Info GAN shares parameters with the discriminator, which has been argued as problematic [15, 9] as it may compete with the discriminator and prevents its success in semisupervised settings. See our ablation study in section 4.2 and Fig.3. Finally, the ﬁrst term in Ry is similar to the way Improved-GAN models the conditional p(y|x) for labeled data, but SGAN treats the generated data very differently Improved-GAN labels xg = G(z, y) as a new class y = K + 1, instead SGAN reuses xg and xc to mutually boost I, C and G, which is key to the success of semi-supervised learning (see section 4.2).

4 Evaluation

We empirically evaluate SGAN through experiments on different datasets. We show that separately modeling z and y in the hidden space helps better disentangle the semantics of our interest from other irrelevant attributes, thus yields improved performance for both generative modeling (G) and posterior inference (C, I) (section 4.1 4.3). Under SGAN framework, the learned inference networks and generators can further beneﬁt a lot of downstream applications, such as semi-supervised classiﬁcation, controllable image generation and style transfer (section 4.2 4.3).

Dataset and conﬁgurations. We evaluate SGAN on three image datasets: (1) MNIST [14]: we use the 60K training images as unlabeled data, and sample n {20, 50, 100} labels for semi-supervised learning following [12, 27], and evaluate on the 10K test images. (2) SVHN [20]: a standard train/test split is provided, where we sample n = 1000 labels from the training set for semi-supervised learning [27, 15, 5]. (3) CIFAR-10: a challenging dataset for conditional image generation that consists of 50K training and 10K test images from 10 object classes. We randomly sample n = 4000 labels [27, 28, 15] for semi-supervised learning. For all datasets, our semantic of interest is the digit/object class, so y is a 10-dim categorical variable. We use a 64-dim gaussian noise as z in MNIST and a 100-dim uniform noise as z in SVHN and CIFAR-10.

Implementation. We implement SGAN using Tensor Flow [1] and Theano [2] with distributed acceleration provided by Poseidon [33] which parallelizes line 7-8 and 10-12 of Algorithm. 1. The neural network architectures of C, G and Dxy mostly follow those used in Triple GAN [15] and we design I and Dxz according to [5] but with shallower structures to alleviate the training costs. Empirically SGAN needs 1.3-1.5x more training time than Triple GAN [15] without parallelization. It is noted that properly weighting the losses of the four games in SGAN during training may lead to performance improvement. However, we simply set them equal without heavy tuning1.

4.1 Controllability and Disentanglability

We evaluate the controllability and disentanglability of SGAN by assessing its generator network G and inference network I, respectively. Speciﬁcally, as SGAN is able to perform posterior inference for z, we deﬁne a novel quantitative measure based on z to compare its disentanglability to other DGMs: we ﬁrst use the trained I (or the recognition network in VAE-based models) to infer z for unseen x from test sets. Ideally, as z and y are modeled as independent, when I is trained to approach the true posterior of z, its output, when used as features, shall have weak predictability for y. Accordingly, we

1The code is publicly available at https://github.com/thudzj/Structured GAN.

use z as features to train a linear SVM classiﬁer to predict the true y, and deﬁne the converged accuracy of this classiﬁer as the mutual predictability (MP) measure, and expect lower MP for models that can better disentangle y from z. We conduct this experiment on all three sets, and report the averaged MP measure of ﬁve runs in Fig. 2, comparing the following DGMs (that are able to infer z): (1) ALI [5] and (2) VAE [12], trained without label information; (3) CVAE-full2: the M2 model in [11] trained under the fully supervised setting; (4) SGAN trained under semi-supervised settings. We use 50, 1000 and 4000 labels for MNIST, SVHN and CIFAR-10 dataset under semi-supervised settings, respectively.

MNIST SVHN CIFAR-10

ALI VAE SGAN CVAE-full

Figure 2: Comparisons of the MP measure for different DGMs (lower is better).

Clearly, SGAN demonstrates low MP when predicting y using z on three datasets. Using only 50 labels, SGAN exhibits reasonable MP. In fact, on MNIST with only 20 labels as supervision, SGAN achieves 0.65 MP, outperforming other baselines by a large margin. The results clearly demonstrate SGAN s ability to disentangle y and z, even when the supervision is very scarce.

On the other hand, better disentanglability also implies improved controllability of G, because less entangled y and z would be easier for G to recognize the designated semantics so G should be able to generate samples that are less deviated from y during conditional generation. To verify this, following [9], we use a pretrained gold-standard classiﬁer (0.56% error on MNIST test set) to classify generated images, and use the condition y as ground truth to calculate the accuracy. We compare SGAN in Table 1 to CVAE-semi and Triple GAN [15], another strong baseline that is also designed for conditional generation under semi-supervised settings. We use n = 20, 50, 100 labels on MNIST, and observe a signiﬁcantly higher accuracy for both Triple GAN and SGAN. For comparison, a generator trained by CVAE-full achieves 0.6% error. When there are fewer labels available, SGAN outperforms Triple GAN. The generator in SGAN can generate samples that consistently obey the conditions speciﬁed in y, even when there are only two images per class (n = 20) as supervision. These results verify our statements that disentangled semantics further enhance the controllability of the conditioned generator G.

4.2 Semi-supervised Classiﬁcation

Model # labeled samples n = 20 n = 50 n = 100 CVAE-semi 33.05 10.72 5.66 Triple GAN 3.06 1.80 1.29 SGAN 1.68 1.23 0.93 Table 1: Errors (%) of generated samples classiﬁed by a classiﬁer with 0.56% test error.

It is natural to use SGAN for semi-supervised prediction.With a little supervision, SGAN can deliver a conditional generator with reasonably good controllability, with which, one can synthesize samples from pg(x, y) to augment the training of C when minimizing Ry. Once C becomes more accurate, it tends to make less mistakes when inferring y from x. Moreover, as we are sampling (x, y) pc(x, y) to train Dxy during the maximization of Lxy, a more accurate C means more available labeled samples (by predicting y from unlabeled x using C) to lower the bias brought by the small set Xl, which in return can enhance G in the minimization phase of Lxy. Consequently, a mutual boosting cycle between G and C is formed.

To empirically validate this, we deploy SGAN for semi-supervised classiﬁcation on MNIST, SVHN and CIFAR-10, and compare the test errors of C to strong baselines in Table 2. To keep the comparisons fair, we adopt the same neural network architectures and hyper-parameter settings from [15], and report the averaged results of 10 runs with randomly sampled labels (every class has equal number of labels). We note that SGAN outperforms the current state-of-the-art methods across all datasets and settings. Especially, on MNIST when labeled instances are very scarce (n = 20), SGAN attains the highest accuracy (4.0% test error) with signiﬁcantly lower variance, beneﬁting from the mutual boosting effects explained above. This is very critical for applications under low-shot or even one-shot settings where the small set Xl might not be a good representative for the data distribution p(x, y).

2For CVAE-full, we use test images and ground truth labels together to infer z when calculating MP. We are unable to compare to semi-supervised CVAE as in CVAE inferring z for test images requires image labels as input, which is unfair to other methods.

Method MNIST SVHN CIFAR-10 n = 20 n = 50 n = 100 n = 1000 n = 4000 Ladder [22] - - 0.89( 0.50) - 20.40( 0.47) VAE [12] - - 3.33( 0.14) 36.02( 0.10) - Cat GAN [28] - - 1.39( 0.28) - 19.58( 0.58) ALI [5] - - - 7.3 18.3 Improved GAN [27] 16.77( 4.52) 2.21( 1.36) 0.93 ( 0.07) 8.11( 1.3) 18.63( 2.32) Triple GAN [15] 5.40( 6.53) 1.59( 0.69) 0.92( 0.58) 5.83( 0.20) 18.82( 0.32) SGAN 4.0( 4.14) 1.29( 0.47) 0.89( 0.11) 5.73( 0.12) 17.26( 0.69)

Table 2: Comparisons of semi-supervised classiﬁcation errors (%) on MNIST, SVHN and CIFAR-10 test sets. 4.3 Qualitative Results

In this section we present qualitative results produced by SGAN s generator under semi-supervised settings. Unless otherwise speciﬁed, we use 50, 1000 and 4000 labels on MNIST, SVHN, CIFAR-10 for the results. These results are randomly selected without cherry pick, and more results could be found in the supplementary materials.

(a) w/o Ry, Rz

(c) Full model Figure 3: Ablation study: conditional generation results by SGAN (a) without Ry, Rz, (b) without Rz (c) full model. Each row has the same y while each column shares the same z.

Controllable generation. To ﬁgure out how each module in SGAN contributes to the ﬁnal results, we conduct an ablation study in Fig.3, where we plot images generated by SGAN with or without the terms Ry and Rz during training. As we have observed, our full model accurately disentangles y and z. When there is no collaborative game involved, the generator easily collapses to a biased conditional distribution deﬁned by the classiﬁer C that is trained only on a very small set of labeled data with insufﬁcient supervision. For example, the generator cannot clearly distinguish the following digits: 0, 2, 3, 5, 8. Incorporating Ry into training signiﬁcantly alleviate this issue an augmented C would resolve G s confusion. However, it still makes mistakes in some confusing classes, such as 3 and 5. Ry and Rz connect the two adversarial games to form a mutual boosting cycle. The absence of any of them would break this cycle, consequently, SGAN would be under-constrained and may collapse to some local minima resulting in both a less accurate classiﬁer C and a less controlled G.

Visual quality. Next, we investigate whether a more disentangled y, z will result in higher visual quality on generated samples, as it makes sense that the conditioned generator G would be much easier to learn when its inputs y and z carry more orthogonal information. We conduct this experiment on CIFAR-10 that is consisted of natural images with more uncertainty besides the object categories. We compare several state-of-the-art generators in Fig 4 to SGAN without any advanced GAN training strategies (e.g. WGAN, gradient penalties) that are reported to possibly improve the visual quality. We ﬁnd SGAN s conditional generator does generate less blurred images with the main objects more salient, compared to Triple GAN and Improved GAN w/o minibatch discrimination (see supplementary). For a quantitative measure, we generate 50K images and compute the inception

(a) CIFAR-10 data

(b) Triple GAN

(c) SGAN Figure 4: Visual comparison of generated images on CIFAR-10. For (b) and (c), each row shares the same y.

(f) Figure 5: (a)-(c): image progression, (d)-(f): style transfer using SGAN.

score [27] as 6.91( 0.07), compared to Triple GAN 5.08( 0.09) and Improved-GAN 3.87( 0.03) w/o minibatch discrimination, conﬁrming the advantage of structured modeling for y and z.

Image progression. To demonstrate that SGAN generalizes well instead of just memorizing the data, we generate images with interpolated z in Fig.5(a)-(c) [32]. Clearly, the images generated with progression are semantically consistent with y, and change smoothly from left to right. This veriﬁes that SGAN correctly disentangles semantics, and learns accurate class-conditional distributions.

Style transfer. We apply SGAN for style transfer [7, 30]. Speciﬁcally, as y is modeled as digit/object category on all three dataset, we suppose z shall encode any other information that are orthogonal to y (probably style information). To see whether I behaves properly, we use SGAN to transfer the unstructured information from z in Fig.5(d)-(f): given an image x (the leftmost image), we infer its unstructured code z. We generate images conditioned on z, but with different y. It is interesting to see that z encodes various aspects of the images, such as the shape, texture, orientation, background information, etc, as expected. Moreover, G can correctly transfer these information to other classes.

5 Conclusion We have presented SGAN for semi-supervised conditional generative modeling, which learns from a small set of labeled instances to disentangle the semantics of our interest from other elements in the latent space. We show that SGAN has improved disentanglability and controllability compared to baseline frameworks. SGAN s design is beneﬁcial to a lot of downstream applications: it establishes new state-of-the-art results on semi-supervised classiﬁcation, and outperforms strong baseline in terms of the visual quality and inception score on controllable image generation.

Acknowledgements

Zhijie Deng and Jun Zhu are supported by NSF China (Nos. 61620106010, 61621136008, 61332007), the MIIT Grant of Int. Man. Comp. Stan (No. 2016ZXFB00001), Tsinghua Tiangong Institute for Intelligent Computing and the NVIDIA NVAIL Program. Hao Zhang is supported by the AFRL/DARPA project FA872105C0003. Xiaodan Liang is supported by award FA870215D0002.

[1] Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. Tensorﬂow: A system for large-scale machine learning. In USENIX Symposium on Operating Systems Design and Implementation, 2016.

[2] James Bergstra, Olivier Breuleux, Frédéric Bastien, Pascal Lamblin, Razvan Pascanu, Guillaume Desjardins, Joseph Turian, David Warde-Farley, and Yoshua Bengio. Theano: A cpu and gpu math compiler in python. pages 3 10, 2010.

[3] Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, and Pieter Abbeel. Infogan: Interpretable representation learning by information maximizing generative adversarial nets. In Advances in Neural Information Processing Systems, pages 2172 2180, 2016.

[4] Jeff Donahue, Philipp Krähenbühl, and Trevor Darrell. Adversarial feature learning. ar Xiv preprint ar Xiv:1605.09782, 2016.

[5] Vincent Dumoulin, Ishmael Belghazi, Ben Poole, Alex Lamb, Martin Arjovsky, Olivier Mastropietro, and Aaron Courville. Adversarially learned inference. ar Xiv preprint ar Xiv:1606.00704, 2016.

[6] Tzu-Chien Fu, Yen-Cheng Liu, Wei-Chen Chiu, Sheng-De Wang, and Yu-Chiang Frank Wang. Learning cross-domain disentangled deep representation with supervision from a single domain. ar Xiv preprint ar Xiv:1705.01314, 2017.

[7] Leon A Gatys, Alexander S Ecker, and Matthias Bethge. A neural algorithm of artistic style. ar Xiv preprint ar Xiv:1508.06576, 2015.

[8] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in Neural Information Processing Systems, pages 2672 2680, 2014.

[9] Zhiting Hu, Zichao Yang, Xiaodan Liang, Ruslan Salakhutdinov, and Eric P Xing. Controllable text generation. ar Xiv preprint ar Xiv:1703.00955, 2017.

[10] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. Image-to-image translation with conditional adversarial networks. ar Xiv preprint ar Xiv:1611.07004, 2016.

[11] Diederik P Kingma, Shakir Mohamed, Danilo Jimenez Rezende, and Max Welling. Semi-supervised learning with deep generative models. In Advances in Neural Information Processing Systems, pages 3581 3589, 2014.

[12] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. ar Xiv preprint ar Xiv:1312.6114, 2013.

[13] Samuli Laine and Timo Aila. Temporal ensembling for semi-supervised learning. ar Xiv preprint ar Xiv:1610.02242, 2016.

[14] Yann Le Cun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278 2324, 1998.

[15] Chongxuan Li, Kun Xu, Jun Zhu, and Bo Zhang. Triple generative adversarial nets. In Advances in Neural Information Processing Systems, 2017.

[16] Chongxuan Li, Jun Zhu, Tianlin Shi, and Bo Zhang. Max-margin deep generative models. In Advances in Neural Information Processing Systems, pages 1837 1845, 2015.

[17] Xiaodan Liang, Zhiting Hu, Hao Zhang, Chuang Gan, and Eric P Xing. Recurrent topic-transition gan for visual paragraph generation. ar Xiv preprint ar Xiv:1703.07022, 2017.

[18] Lars Maaløe, Casper Kaae Sønderby, Søren Kaae Sønderby, and Ole Winther. Auxiliary deep generative models. ar Xiv preprint ar Xiv:1602.05473, 2016.

[19] Mehdi Mirza and Simon Osindero. Conditional generative adversarial nets. ar Xiv preprint ar Xiv:1411.1784, 2014.

[20] Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y Ng. Reading digits in natural images with unsupervised feature learning.

[21] Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. ar Xiv preprint ar Xiv:1511.06434, 2015.

[22] Antti Rasmus, Mathias Berglund, Mikko Honkala, Harri Valpola, and Tapani Raiko. Semi-supervised learning with ladder networks. In Advances in Neural Information Processing Systems, pages 3546 3554, 2015.

[23] Scott Reed, Zeynep Akata, Xinchen Yan, Lajanugen Logeswaran, Bernt Schiele, and Honglak Lee. Generative adversarial text to image synthesis. In International Conference on Machine Learning, pages 1060 1069, 2016.

[24] Scott Reed, Aäron van den Oord, Nal Kalchbrenner, Victor Bapst, Matt Botvinick, and Nando de Freitas. Generating interpretable images with controllable structure. In International Conference on Learning Representations, 2017.

[25] Scott E Reed, Zeynep Akata, Santosh Mohan, Samuel Tenka, Bernt Schiele, and Honglak Lee. Learning what and where to draw. In Advances in Neural Information Processing Systems, pages 217 225, 2016.

[26] Ruslan Salakhutdinov and Geoffrey Hinton. Deep boltzmann machines. In Artiﬁcial Intelligence and Statistics, pages 448 455, 2009.

[27] Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training gans. In Advances in Neural Information Processing Systems, pages 2226 2234, 2016.

[28] Jost Tobias Springenberg. Unsupervised and semi-supervised learning with categorical generative adversarial networks. ar Xiv preprint ar Xiv:1511.06390, 2015.

[29] Luan Tran, Xi Yin, and Xiaoming Liu. Disentangled representation learning gan for pose-invariant face recognition. In Conference on Computer Vision and Pattern Recognition, 2017.

[30] Hao Wang, Xiaodan Liang, Hao Zhang, Dit-Yan Yeung, and Eric P Xing. Zm-net: Real-time zero-shot image manipulation network. ar Xiv preprint ar Xiv:1703.07255, 2017.

[31] Xiaolong Wang and Abhinav Gupta. Generative image modeling using style and structure adversarial networks. In European Conference on Computer Vision, pages 318 335. Springer, 2016.

[32] Xinchen Yan, Jimei Yang, Kihyuk Sohn, and Honglak Lee. Attribute2image: Conditional image generation from visual attributes. In European Conference on Computer Vision, pages 776 791. Springer, 2016.

[33] Hao Zhang, Zhiting Hu, Jinliang Wei, Pengtao Xie, Gunhee Kim, Qirong Ho, and Eric Xing. Poseidon: A system architecture for efﬁcient gpu-based deep learning on multiple machines. ar Xiv preprint ar Xiv:1512.06216, 2015.