# unsupervised_disentangled_representation_learning_with_analogical_relations__0a09ed6a.pdf

Unsupervised Disentangled Representation Learning with Analogical Relations

Zejian Li, Yongchuan Tang , Yongxing He College of Computer Science, Zhejiang University, Hangzhou 310027, China {zejianlee, yctang, heyongxing}@zju.edu.cn

Learning the disentangled representation of interpretable generative factors of data is one of the foundations to allow artiﬁcial intelligence to think like people. In this paper, we propose the analogical training strategy for the unsupervised disentangled representation learning in generative models. The analogy is one of the typical cognitive processes, and our proposed strategy is based on the observation that sample pairs in which one is different from the other in one speciﬁc generative factor show the same analogical relation. Thus, the generator is trained to generate sample pairs from which a designed classiﬁer can identify the underlying analogical relation. In addition, we propose a disentanglement metric called the subspace score, which is inspired by subspace learning methods and does not require supervised information. Experiments show that our proposed training strategy allows the generative models to ﬁnd the disentangled factors, and that our methods can give competitive performances as compared with the state-of-the-art methods.

1 Introduction

This paper is concerned with the problem of unsupervised disentangled representation learning in the generative model. The disentangled representation is a kind of distributed feature representation in which disjoint dimensions of a latent code reﬂect different high-level generative factors of data. Speciﬁcally, the disentangled representation can separate the explanatory factors which interact nonlinearly in the real-world data, such as the object shape, the material property, and the light source, etc. Thus, the disentangled representation is helpful for a large variety of AI tasks [Bengio et al., 2013]. The generative model is helpful in learning the disentangled representation. It is a methodology to learn a probability distribution, and it generates a new sample according to the code in the hidden space. By learning the appropriate parameter, it can gradually learn to generate new data of the same distribution as the target one [Goodfellow et al., 2014].

Corresponding author

When the disentangled representation is learned in the generative model, disjoint dimensions of a hidden code could model the data generative factors separately. These underlying factors could explain the major variation in the data. When only one factor varies but all others are ﬁxed, the generated sequence of samples can show an interpretable change to human beings. For example, when we generate a ﬁgure of a hand-written digital number, a component of the code may be associated with the stroke width. When the value is changed, the stroke width of the generated number becomes smaller. A large body of work has been devoted to this problem. When the generative factors are predeﬁned, the disentangled representation can be learned by the reconstruction of the data when the codes are swapped [Peng et al., 2017; Denton and Birodkar, 2017]. Moreover, when the data are labeled with attributes, the representation can be learned via the mapping between the data and the attributes [Wang et al., 2017] or the consistency between the variation of the data and the transformation of the latent code [Kulkarni et al., 2015; Worrall et al., 2017]. On the other hand, unsupervised learning of a disentangled representation has been a major challenge. DIP-VAE [Higgins et al., 2017] and β-VAE [Kumar et al., 2017] learn the disentanglement of the latent code by encouraging the latent distribution to be close to the standard normal distribution, in which each random variables are independent. Info GAN [Chen et al., 2016] also uses the statistical independence and the method is motivated by the principle of the maximization of the mutual information. In summary, most of the existing works disentangle the factors by taking advantage of the supervised signals or by using the statistical independence of the prior distribution.

Different from the existing methods, our solution is motivated by the analogy. A key observation is that each interpretable disentangled factor is associated with an analogical relation of sample pairs. In the example of generating digital numbers, disjoint components of a hidden code can be associated with the factors such as rotation, stroke thickness, width, etc. Figure 1 gives an illustrative example of analogical pairs, which manifests the factor of stroke thickness. Reversely, this factor of variation can be learned from the analogical sample pairs. The discussed relation is known as the proportional analogical relation, which has a general form (a : b :: c : d) where the pairs (a, b) and (c, d) have relational similarity, such as mammals : lungs :: ﬁsh : gills [Gust et al., 2008]. The

Proceedings of the Twenty-Seventh International Joint Conference on Artiﬁcial Intelligence (IJCAI-18)

Figure 1: An illustrative example of the analogical relation. These hand-written numbers are generated by our proposed Analogical GAN method trained on the MNIST dataset. The stroke becomes thinner from left to right. A pair of 1 s and a pair of 8 s are selected, which show the analogical relation of stroke thickness. The relation of the ﬁrst bolder 1 to the second thinner 1 is similar to that of the ﬁrst bolder 8 to the second thinner 8 . Furthermore, the 1 s here are rotated while 8 s are not. Given this difference in rotation, the variation of stroke width shows the same pattern.

analogical relation can be extended to a batch of n sample pairs, which have the form (a1 : b1 :: a2 : b2 :: . . . :: an : bn). Namely, the pairs of ai and bi for all i = 1, . . . , n share the same analogical relation. The analogy has been a central part of human intelligence and cognition, so from this perspective learning the interpretable factors via the analogical relation is close to the human-like cognition process. Based on the observation above, we propose our analogical training strategy on top of the generative model. At the beginning of the training process, the generator gives an analogical sample pair according to the code pair in which a predeﬁned component of the latent code is different while all the other conﬁguration is ﬁxed. Next, an extra classiﬁer tries to identify the predeﬁned analogical relation behind the generated analogical pair. Then, the classiﬁer and the generator are trained together to allow the classiﬁer to make the correct decision. This is a cooperative game. The generator should learn to generate sample pairs characterized by analogical relations which the classiﬁer can capture. Figure 2 gives an illustration of the analogical training process. This is our main contribution. The other contribution is our proposal of a disentanglement metric of the learned representation, called the subspace score. This metric is based on two assumptions inspired by the subspace learning methodology. The ﬁrst assumption is that each kind of variation of the generated samples forms an afﬁne subspace, and thus these subspaces are expected to be uncovered by the subspace clustering algorithm. Therefore, the disentanglement of the factors is approximated by the clustering performance. The second is that the union of these subspaces should be close to the majority of observed samples. This closeness is measured by the distance between the observed samples and the afﬁne space spanned by the generated samples. Our proposed subspace score is the combination of these two measures. The subspace score does not require supervised information, and it is able to be applied to the real-world unlabeled dataset. To the best of our knowledge, this is the ﬁrst time an unsupervised disentanglement metric is proposed. The rest of the paper is organized as follows. We introduce our proposed analogical training strategy in Section 2 and the disentanglement metric in Section 3. Next, we review related works in Section 4. In Section 5 we demonstrate the experiment results and compare our method with other methods along the subspace score. We conclude the

Figure 2: The analogical training process. We assume each code component represents a generative factor, and each generative factor has a unique analogical relation. The analogical relation ri of the factor is implied by a pair of codes which are different only in the ith component. In the left of the ﬁgure, the horizontal axis is the ith dimension, and the vertical dimension indicates all the other dimensions. The code pair c1 and c2 are different only in the ith one. To start the training process, the code pair is passed through the generator G, which generates the analogical sample pair x1 and x2. Then, the classiﬁer R tries to identify the underlying relation ri. Finally, the generator and the classiﬁer both update their parameters such that the analogical relation can be classiﬁed more accurately, and thus the ith component learns to reﬂect a meaningful factor gradually. The training process can be applied on a batch of code pairs, and the analogical sample pairs also share the relation ri.

paper in Section 6. Our source code will be available on https://github.com/Zejian Li/analogical-training.

2 Method 2.1 Preliminaries In this part, we brieﬂy review two generative models, Variational Auto-Encoder (VAE) [Kingma and Welling, 2013] and Generative Adversarial Network (GAN) [Goodfellow et al., 2014]. A generative model learns the target probability distribution by generating new samples whose distribution is close the target one. Formally, given a latent representation z Z sampled from a predeﬁned distribution Pz, we can generate a new sample x X by sampling from Pθ(x | z). Here Pθ(x | z) is assumed to be deterministic and described by a function G: Z 7 X parameterized by θ, and thus x = G(z). By learning appropriate θ, the generated distribution Pθ can get closer to the ground truth distribution Pr. VAE has emerged as a popular deep generative model. A key step in VAE is to reinterpret the log-likelihood of the observed sample x in the generated distribution Pθ as

log pθ(x) = KL(qφ(z | x) pθ(z | x)) + L(θ, φ; x). (1)

Here qφ(z | x) is the variational posterior distribution with the parameter φ and KL( ) is the Kullback-Leibler divergence between two distributions. L(θ, φ; x) is the evidence lower bound deﬁned as

L(θ, φ; x) = Eqφ(z|x) log pθ(x | z) KL(qφ(z | x) p(z)). (2) The ﬁrst term is the expected log-likelihood to recover x, and the second term is the KL-divergence between the variational posterior distribution and the prior distribution. Since we always have log pθ(x) L(θ, φ; x), we can increase the loglikelihood of the data by maximizing the evidence lower bound. Hence, the optimization problem of the model is

max G,Q L(G, Q)

s.t. L(G, Q) = Ex Pr L(θ, φ; x). (3)

Proceedings of the Twenty-Seventh International Joint Conference on Artiﬁcial Intelligence (IJCAI-18)

Here we use the function Q parameterized by φ to describe qφ(z | x). GAN is another framework to train generative models. It learns the real distribution by training the generator G to confuse an adversarial discriminator D. The discriminator D tries to distinguish the data generated from Pθ or sampled from Pr, while the generator G gradually learns to generate data that D cannot correctly classify. Formally, given a sample x, D(x) approximates the probability that x is sampled from Pr. The formulation is given as follows.

min G max D V (D, G)

s.t.V (D, G) = Ex Pr log D(x) + Ez Pz log(1 D(G(z))) (4) To learn the disentangled representation in GAN, we divide the latent representation to the continuous code c and the noise z. The latent code c tries to capture the disentangled representation while the noise z ﬁts other details. Thus the generated sample is given by x = G(c, z) where c Pc and z Pz. We do not adopt this division in VAE, since we ﬁnd it leads to a trivial solution. For consistency, we write x = G(c, z) in the rest of the paper.

2.2 Analogical Training Strategy In this part, we describe our proposed analogical training strategy, which is to learn the disentangled representation of interpretable factors. Our proposed method is based on the observation that a disentangled generative factor is related to a unique analogical relation of sample pairs. As shown in Figure 1, we have two different digits, and by reducing their stroke width we have their twins. The relation of the ﬁrst digit and its twin is similar to the relation between the other digit and its twin. Even though the conﬁgurations of other factors are different, the change of the given factor shows the variation of the same pattern. Thus, the stroke width factor is shown by the analogical relation. Reversely, given a generator, if the analogical relation behind different generated pairs can always be recognized, we believe the learned factor is disentangled. More detailedly, each component of the hidden code represents a generative factor, and the change of the generative factor can be reﬂected by the variation in samples. Sample pairs which are different only in a single generative factor are deﬁned as analogical pairs, and together they show the analogical relation unique to the generative factor. The model learns the disentangled factor by trying to generate analogical pairs. Formally, we deﬁne a classiﬁer R to recognize the analogical relations. Given an analogical pair of samples x1 and x2, R identiﬁes the generative factor r in which the two samples are different. This is to maximize the log-likelihood

Er Ex1,x2|r log R(r | x1, x2) (5)

in which r is the generative factor as a random variable of the category distribution over {1, . . . , k} when we have k factors. Ex1,x2|r is short for E(x1,x2) P(x1,x2|r). Particularly, R(r | x1, x2) is the probability that R believe x1 and x2 are different in the factor r, which means x1 and x2 show the analogical relation unique to r. Since samples in an analogical pair are different in one factor, we can generate the pair according to two latent codes

which are different in one component. We denote the code pair as c1 and c2 and thus x1 = G(c1, z) and x2 = G(c2, z). Thus, (5) can be rewritten as K(G, R) =Er Ec1,c2|r Ex1 G(c1,z),x2 G(c2,z) log R(r | x1, x2). (6)

The process of the analogical training is visualized in Figure 2. The expectation can be approximated with the Monte Carlo method. Theoretically, K(G, R) is the lower bound of the mutual information I(r; x1, x2) between the generative factor and the sample pair. Formally, I(r; x1, x2) = H(r) H(r | x1, x2), where the ﬁrst term is the entropy of r, and the second term is the conditional entropy of r given x1 and x2. We have H(r) = log k and H(r | x1, x2) = Ex1,x2Er|x1,x2 log P(r | x1, x2). (7) The computation of the posterior distribution P(r | x1, x2) is intractable. We can deﬁne an auxiliary distribution R to infer a variational lower bound. H(r | x1, x2)

=Ex1,x2Er|x1,x2

log P(r | x1, x2)

R(r | x1, x2) + log R(r | x1, x2)

=Ex1,x2KL(P(r | x1, x2)||R(r | x1, x2)) + Er Ex1,x2|r log R(r | x1, x2). (8) Since the KL-divergence is non-negative, we can have H(r | x1, x2) Er Ex1,x2|r log R(r | x1, x2)

=Er Ec1,c2|r Ex1 G(z,c1),x2 G(z,c2) log R(r | x1, x2). (9)

Therefore, K(G, R) is the lower bound of I(r; x1, x2). The bound is tight when R(r | x1, x2) is close to P(r | x1, x2). In this case, maximizing K(G, R) is equivalent to maximizing the mutual information I(r; x1, x2). Intuitively, this means the generator should make an effort to give sample pairs whose inner difference can show the variation of the generative factor. Combining (3) and (6), we have our proposed Analogical VAE (Ana VAE) learned via the optimization problem max G,Q,R L(G, Q) + λK(G, R), (10)

where λ is the hyperparameter to control the effect of K(G, R). Similarly, the optimization problem of our Analogical GAN (Ana GAN) is min G,R max D V (D, G) λK(G, R). (11)

Empirically, we set λ to 1 by default. All the functions G, Q, D and R are modeled by neural networks.

3 Disentanglement Metric In this part, we introduce our disentanglement metric. The metric is inspired by the subspace clustering algorithm [Elhamifar and Vidal, 2013]. We assume that sequences of samples of the same variation lie in a low-dimensional afﬁne subspace. Then if the factors in the representation are disentangled, the subspaces of different variations are independent.1 In this

1A set of k linear subspaces {Si RD}n i=1 are independent if dim( k i=1Si) = Pk i=1 dim(Si), where is the direct sum.

Proceedings of the Twenty-Seventh International Joint Conference on Artiﬁcial Intelligence (IJCAI-18)

Figure 3: An illustration of the subspaces of variations. The samples of number 2 and 9 vary in the stroke width, and those of 5 changes in the writing style. We assume the sequences of the same variation lie on an afﬁne subspace. Here, the sequences of 2 and 9 lie on the subspace of stroke variation, and the sequence 5 lies on the subspace of writing style variation. For visualization the two afﬁne planes are not independent here.

case, samples of the same variation can be grouped by the subspace clustering method. An exemplary illustration of subspaces of variations is given in Figure 3. Another assumption is that the observed samples should be close to the afﬁne space spanned by the generated samples, because the generated samples should be indistinguishable to the real ones. Our metric is designed according to these two assumptions. Since these assumptions are not related to any supervised information, the subspace score can be applied to unlabeled datasets. The subspace clustering algorithm can separate data according to the relation of linear combinations and thus recover the low-dimensional afﬁne subspaces. Given our assumption above, the disentanglement of factors can be approximated by how well the generated clusters of different variations can be correctly separated by the subspace clustering method. To be more speciﬁc, given the learned generator factor ri, we generate a sequence of samples {xj | j = 1, . . . , m} by varying the ith component of the code while all other conﬁgurations are ﬁxed. Thus, this sequence shows the variation of ri. A sample cluster consists of several different sequences of the same variation, and k clusters of different variations are sampled and given to the subspace clustering algorithm. Formally, suppose we have Y = [Y1, . . . , Yk]. For i {1, . . . , k}, Yi is a group of sample sequences and the samples in each sequence are different only in the generative factor ri. Then the coefﬁcient matrix U is learned by

ˆU = arg min YU Y 2 F + λ R(U), s.t. diag(U) = 0. (12)

Here, F is the Frobenius norm, and R(U) is the regularized term with the parameter λ . In addition, diag( ) denotes the diagonal elements in the matrix. We use the Orthogonal Matching Pursuit method (OMP) [Tropp and Gilbert, 2007] to learn the coefﬁcients, since we have the prior knowledge of the number of samples in every cluster, which is the product of the number of sample sequences in each cluster and the size of the sequence. Then the afﬁnity matrix is deﬁned as

|U|+|U|T, where | | is the absolute value. Ideally, this matrix is block diagonal given that samples in different Yi s lie in independent afﬁne subspaces. Finally, with the afﬁnity matrix, the spectral clustering method is used to infer the clustering assignment ˆC. The clustering performance is measured by the normalized mutual information NMI( ˆC, C) where C is the ground truth division of Yi s in Y. The performance measures how well clusters of different variations can be separated, and thus it evaluates the disentanglement of factors. This is the ﬁrst part of our measure. The other part is the mean distance from the observed samples to the afﬁne space spanned by generated samples. This measures the closeness described in the second assumption. Formally, given x as an observed sample, we have the distance from x to its projection on the afﬁne space spanned by Y as

d(x, Y) = min u Yu x 2. (13)

When all the data are normalized to have unit length, we have d(x, Y) [0, 1]. Given the observed set X with n samples, the averaged distance is d = 1

n P x X d(x, Y). We use 1 d to estimate the closeness. Finally, our proposed subspace score is deﬁned as

α NMI( ˆC, C) + (1 α) (1 d). (14)

We set α as 0.5 by default. The higher value of the subspace score is, the better the model learns the disentangled representation and generates new samples. One may argue that the subspaces given by the principal component analysis (PCA) would get the highest score, because the assumptions of the subspace score are intrinsically compatible with PCA. However, the subspace score is to measure the variations of the generated samples. PCA lacks the ability to generate new samples, and it is not guaranteed to learn a concordant variation in each independent subspace. Thus, we do not think it is appropriate to apply the proposed metric on PCA. A novel disentanglement metric is also proposed in [Higgins et al., 2017], which can measure the independence and interpretability of factors simultaneously. However, this metric is applied on a synthetic dataset of 2D shapes with predeﬁned factors, but not on the real-world dataset.

4 Related Works Several methods have been proposed for the unsupervised disentangled representation learning. Info GAN [Chen et al., 2016] designs an extra network to recover the code values from the generated samples to learn disentanglement. This is based on the principle of maximizing the mutual information, and thus Info GAN shares the theoretical foundation with our strategy. The difference between our strategy and Info GAN is that our methods recover the relative relation from samples. We believe that the relation between samples are more important than the code values, because the code values have little physical meaning. For example, we cannot infer how many pixels of the stroke width is of a generated digital number from the code. However, the comparison between the sample codes leads to the conclusion that one digit is bolder than the other. Thus, the analogical training of sample pairs is more conceptually straightforward.

Proceedings of the Twenty-Seventh International Joint Conference on Artiﬁcial Intelligence (IJCAI-18)

Figure 4: Latent factors learned in the Flower dataset. The pictures are generated by varying a speciﬁc latent variable from 3 to 3 while others are zero. Each ﬁgure grid shows the variation of the similar factors, and each row shows the samples of the same method. The models learn to disentangle factors including the color temperature of the ﬂower (a), the variation of the color from yellow to magenta (b), and the number of ﬂowers (c). The pictures are best viewed magniﬁed on screen.

Different from Info GAN, β-VAE [Higgins et al., 2017] is a variant of VAE. It learns the factorized factors by putting more emphasis on the KL-divergence between the variational posterior and the standard normal prior at the cost of the reconstruction accuracy. To avoid the cost, DIP-VAE [Kumar et al., 2017] designs an extra regularization term which minimizes the KL-divergence between the expectation of the variation posterior over the data and the prior. The main difference between these methods and ours is the way to encourage disentanglement. The existing methods disentangle the factors by leveraging the statistical independence in the standard normal prior distribution. However, in a direct manner our analogical training strategy requires that the change of one factor does not result in the variations of the other factors in the generated samples. This constraint allows the model to learn the disentanglement of factors directly.

5 Experiments

In this section, we present the experiment results on ﬁve image datasets, including MNIST [Le Cun et al., 1998], Celeb A [Liu et al., 2015], Flower [Nilsback and Zisserman, 2008], CUB [Wah et al., 2011] and Chairs [Aubry et al., 2014]. Speciﬁcally, we compare our methods with other state-of-the-art methods along the subspace score.

5.1 Implementation Details

Implementation details in the experiments are summarized here. For Ana GAN, we use the network architecture of DCGAN [Radford et al., 2015]. We use the WGAN-GP loss [Gulrajani et al., 2017] instead of the original GAN loss and the parameter num critic is set as 3. At the ﬁrst 100 epoch, we only optimize V (D, G), because at the beginning of the training G has not captured the distribution yet and fails to generate samples good enough, and R cannot learn useful relations from G. Both the noise and the code are sampled from the standard normal distribution. We use Adam optimizer [Kingma and Ba, 2014] with a learning rate of 0.00002 and a momentum of 0.5. The batch size is 32. Ana VAE shares most of the conﬁguration in Ana GAN. The encoder network Q borrows the major structure of D in Ana GAN. The learning rate for the Adam optimizer is 0.0001. In the experiment on

MNIST with labels, we combine all methods in the comparison with AC-GAN [Odena et al., 2016] to incorporate the label information. In both Ana VAE and Ana GAN, R has the similar structure of D, but the numbers of feature mappings in convolutional layers are halved. When learning the analogical relation of the factor ri, we ﬁrst sample two identical codes c1 and c2. Then we set c2i (the ith component of c2) as the c1i v or c1i + v where v is sampled from the uniform distribution over [1, 2]. If we take c2i = c1i v only, R may separate a generative factor into two symmetric ones whose variations are in the opposite directions. The noise z1 and z2 are identical. We add a dropout layer [Srivastava et al., 2014] after each nonlinear activation layer in R to avoid overﬁtting. Lastly, the proposed algorithms are implemented with Py Torch. To compute the subspace score, a cluster of ten sample sequences is generated for each factor and each sequence has ﬁve samples. The sequence is generated by varying the corresponding component of the code from 2 to 2 with the interval 1 but keeping other components ﬁxed. We compute the subspace score over ﬁve different sets of generated samples to get the average. We implement it with scikit-learn.

5.2 Comparison

Figure 4 and Figure 5 provide a qualitative comparison of our proposed methods with β-VAE and DIP-VAE as appropriate. The pictures are generated by varying a speciﬁc latent variable from 3 to 3 while others are ﬁxed to zero, so the samples in the central column of each ﬁgure grid are the same. This guarantees that the effect of only one factor is investigated every time. Figure 4 shows the learned factors in the Flower dataset. The models learn the factors including the color temperature of the ﬂower, the variation of the color from yellow to magenta, and the number of ﬂowers automatically. The ﬁrst two factors are concerned with the colors, while the last with the structure of the picture. The β-VAE and DIP-VAE tend to generate the blurry pictures, in which the petals and pistils are not distinguishable, while Ana VAE can generate ﬂowers in a more clear fashion and Ana GAN can even generate ﬂowers with detailed textures. Particularly, although all the methods learn the variations of the number of ﬂowers in the data, β-VAE and DIP-VAE fail to represent the

Proceedings of the Twenty-Seventh International Joint Conference on Artiﬁcial Intelligence (IJCAI-18)

Figure 5: Latent factors learned in the Celeb A dataset. The pictures are generated by varying a speciﬁc latent variable from 3 to 3 while others are zero. Each ﬁgure grid shows the variation of the similar factors, and each row shows the samples generated by the same method. The models learn to disentangle factors including the emotion (a), the transformation from female to male (b), and the azimuth (c). The pictures are best viewed magniﬁed on screen.

Table 1: Subspace score. The best performances are highlighted.

Dataset MNIST (w/o labels) Celeb A Flower CUB Chairs VAE (untrained) [Kingma and Welling, 2013] 0.208 / 0.417 0.095 0.132 0.145 0.085 VAE [Kingma and Welling, 2013] 0.552 / 0.608 0.546 0.551 0.555 0.565 Info GAN [Chen et al., 2016] 0.389 / 0.482 0.484 0.445 0.452 0.198 β-VAE (β = 20) [Higgins et al., 2017] 0.508 / 0.582 0.512 0.510 0.508 0.538 DIP-VAE (λ = 10) [Kumar et al., 2017] 0.552 / 0.608 0.542 0.550 0.552 0.561 Ana GAN (ours) 0.403 / 0.495 0.541 0.478 0.491 0.536 Ana VAE (ours) 0.504 / 0.587 0.586 0.603 0.599 0.580

two ﬂowers in one picture in a clean way, while our proposed method can clearly separate the two ﬂowers. The color of the ﬂowers here is entangled with the ﬂower shapes, since we can seldom have a sunﬂower in magenta in reality.

Figure 5 shows the comparison in the Celeb A dataset. The models are able to disentangle factors of emotion, gender, and azimuth of the faces. The β-VAE tends to give face in a blurry pattern, but it disentangles the factors in a relative explicit way. DIP-VAE and Ana VAE have similar clarity here, but the factors they learn are slightly entangled with other variations. Speciﬁcally, the factors of emotion they learn are related to the width of the face; while the faces become smiling from left to right, they also become fatter. At last, Ana GAN gives faces with much more details than the other methods, but it tends to entangle the factors with other variations. The ﬁrst two factors are entangled with the hair color. Generally, our proposed methods show comparable performances with other methods.

Table 1 reports the subspace score of models. Ana VAE outperforms other methods in the experiments on most datasets, while DIP-VAE has the highest score on the MNIST dataset. VAE has a similar performance to DIP-VAE. It has been discussed in [Kumar et al., 2017] that VAE also has the ability to disentangle factors in the data. Ana GAN does not have high scores here. Since the subspace score uses the mean squared error to estimate reconstructions, it is hard for samples of Ana GAN with detailed textures to have high reconstruction accuracies. Info GAN also suffers from the same difﬁculty. In summary, Ana VAE shows competitive performances as

compared with other methods along the subspace score.

6 Conclusion In this paper, we propose the analogical training strategy to learn the disentangled representation without supervision. Based on the observation that there exists an analogical relation unique to a generative factor, the proposed strategy learns the disentangled factors from the generated analogical sample pairs. The strategy designs a classiﬁer to help the generator to generate sample pairs of consistent analogical relations. We also propose the subspace score to measure the disentanglement of the factors. The subspace score does not require label information in the dataset. Experiments show that our proposed methods can uncover interpretable factors in the data and give competitive performances as compared with other state-of-the-art algorithms. In the future work, we will explore different measures of disentanglement and the application of the disentangled representation in foundational AI problems, such as the conceptual space learning.

Acknowledgments This work is funded by the National Natural Science Foundation of China (NSFC) under Grant NO. 61773336 and NO. 91748127. We would like to extend our gratitude to the data providers. Zejian Li would like to thank Wei Li, Chao Wang and Ting He in Zhejiang University for helpful comments.

Proceedings of the Twenty-Seventh International Joint Conference on Artiﬁcial Intelligence (IJCAI-18)

[Aubry et al., 2014] Mathieu Aubry, Daniel Maturana, Alexei A. Efros, Bryan C. Russell, and Josef Sivic. Seeing 3d chairs: exemplar part-based 2d-3d alignment using a large dataset of cad models. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3762 3769. IEEE, 2014. [Bengio et al., 2013] Yoshua Bengio, Aaron Courville, and Pascal Vincent. Representation learning: A review and new perspectives. IEEE transactions on pattern analysis and machine intelligence, 35(8):1798 1828, 2013. [Chen et al., 2016] Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, and Pieter Abbeel. Infogan: Interpretable representation learning by information maximizing generative adversarial nets. In Advances in Neural Information Processing Systems, pages 2172 2180, 2016. [Denton and Birodkar, 2017] Emily Denton and Vighnesh Birodkar. Unsupervised learning of disentangled representations from video. In Advances in Neural Information Processing Systems, pages 4417 4426. Curran Associates, Inc., 2017. [Elhamifar and Vidal, 2013] Ehsan Elhamifar and Rene Vidal. Sparse subspace clustering: Algorithm, theory, and applications. IEEE transactions on pattern analysis and machine intelligence, 35(11):2765 2781, 2013. [Goodfellow et al., 2014] Ian Goodfellow, Jean Pouget Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672 2680, 2014. [Gulrajani et al., 2017] Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron Courville. Improved training of wasserstein gans. ar Xiv preprint ar Xiv:1704.00028, 2017. [Gust et al., 2008] Helmar Gust, Ulf Krumnack, Kai-Uwe K uhnberger, and Angela Schwering. Analogical reasoning: A core of cognition. In Proceedings of the 31st Annual German Conference on AI, KI 2008, volume 22, pages 8 12. Springer, 2008. [Higgins et al., 2017] Irina Higgins, Loic Matthey, Arka Pal, Christopher Burgess, Xavier Glorot, Matthew Botvinick, Shakir Mohamed, and Alexander Lerchner. B-vae: Learning basic visual concepts with a constrained variational framework. In International Conference on Learning Representation, 2017. [Kingma and Ba, 2014] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. ar Xiv preprint ar Xiv:1412.6980, 2014. [Kingma and Welling, 2013] Diederik P. Kingma and Max Welling. Auto-encoding variational bayes. ar Xiv preprint ar Xiv:1312.6114, 2013. [Kulkarni et al., 2015] Tejas D. Kulkarni, William F. Whitney, Pushmeet Kohli, and Joshua B. Tenenbaum. Deep convolutional inverse graphics network. In Advances in

Neural Information Processing Systems, pages 2539 2547. Curran Associates, Inc., 2015. [Kumar et al., 2017] Abhishek Kumar, Prasanna Sattigeri, and Avinash Balakrishnan. Variational inference of disentangled latent concepts from unlabeled observations. ar Xiv preprint ar Xiv:1711.00848, 2017. [Le Cun et al., 1998] Yann Le Cun, L eon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278 2324, 1998. [Liu et al., 2015] Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the wild. In Proceedings of International Conference on Computer Vision (ICCV), pages 3730 3738. IEEE, 2015. [Nilsback and Zisserman, 2008] Maria-Elena Nilsback and Andrew Zisserman. Automated ﬂower classiﬁcation over a large number of classes. In Indian Conference on Computer Vision, Graphics & Image Processing, 2008., pages 722 729. IEEE, 2008. [Odena et al., 2016] Augustus Odena, Christopher Olah, and Jonathon Shlens. Conditional image synthesis with auxiliary classiﬁer gans. ar Xiv preprint ar Xiv:1610.09585, 2016. [Peng et al., 2017] Xi Peng, Xiang Yu, Kihyuk Sohn, Dimitris N Metaxas, and Manmohan Chandraker. Reconstruction-based disentanglement for pose-invariant face recognition. In Proceedings of the IEEE International Conference on Computer Vision, pages 1632 1641, 2017. [Radford et al., 2015] Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. ar Xiv preprint ar Xiv:1511.06434, 2015. [Srivastava et al., 2014] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way to prevent neural networks from overﬁtting. Journal of machine learning research, 15(1):1929 1958, 2014. [Tropp and Gilbert, 2007] Joel A. Tropp and Anna C. Gilbert. Signal recovery from random measurements via orthogonal matching pursuit. IEEE Transactions on Information Theory, 53(12):4655 4666, 2007. [Wah et al., 2011] Catherine Wah, Steve Branson, Peter Welinder, Pietro Perona, and Serge Belongie. The caltechucsd birds-200-2011 dataset. 2011. [Wang et al., 2017] Chaoyue Wang, Chaohui Wang, Chang Xu, and Dacheng Tao. Tag disentangled generative adversarial network for object image re-rendering. In Proceedings of the Twenty-Sixth International Joint Conference on Artiﬁcial Intelligence, IJCAI, pages 2901 2907, 2017. [Worrall et al., 2017] Daniel E. Worrall, Stephan J. Garbin, Daniyar Turmukhambetov, and Gabriel J. Brostow. Interpretable transformations with encoder-decoder networks. In The IEEE International Conference on Computer Vision (ICCV), volume 4, pages 5726 5735, 2017.

Proceedings of the Twenty-Seventh International Joint Conference on Artiﬁcial Intelligence (IJCAI-18)