# disentangled_recurrent_wasserstein_autoencoder___a6705f4f.pdf

Published as a conference paper at ICLR 2021

DISENTANGLED RECURRENT WASSERSTEIN AUTOENCODER

Jun Han PCG, Tencent junhanjh@tencent.com

Martin Renqiang Min NEC Laboratories America renqiang@nec-labs.com

Ligong Han Rutgers University hanligong@gmail.com

Li Erran Li Alexa AI, Amazon erranlli@gmail.com

Xuan Zhang Texas A&M University floatlazer@gmail.com

Learning disentangled representations leads to interpretable models and facilitates data generation with style transfer, which has been extensively studied on static data such as images in an unsupervised learning framework. However, only a few works have explored unsupervised disentangled sequential representation learning due to challenges of generating sequential data. In this paper, we propose recurrent Wasserstein Autoencoder (R-WAE), a new framework for generative modeling of sequential data. R-WAE disentangles the representation of an input sequence into static and dynamic factors (i.e., time-invariant and time-varying parts). Our theoretical analysis shows that, R-WAE minimizes an upper bound of a penalized form of the Wasserstein distance between model distribution and sequential data distribution, and simultaneously maximizes the mutual information between input data and different disentangled latent factors, respectively. This is superior to (recurrent) VAE which does not explicitly enforce mutual information maximization between input data and disentangled latent representations. When the number of actions in sequential data is available as weak supervision information, R-WAE is extended to learn a categorical latent representation of actions to improve its disentanglement. Experiments on a variety of datasets show that our models outperform other baselines with the same settings in terms of disentanglement and unconditional video generation both quantitatively and qualitatively.

1 INTRODUCTION

Unsupervised representation learning is an important research topic in machine learning. It embeds high-dimensional sensory data such as images and videos into a low-dimensional latent space in an unsupervised learning framework, aiming at extracting essential data variation factors to help downstream tasks such as classiﬁcation and prediction (Bengio et al., 2013). In the last several years, disentangled representation learning, which further separates the latent embedding space into exclusive explainable factors such that each factor only interprets one of semantic attributes of sensory data, has received a lot of interest and achieved many empirical successes on static data such as images (Chen et al., 2016; Higgins et al., 2017; Dupont, 2018; Chen et al., 2018; Rubenstein et al., 2018b;a; Kim & Mnih, 2018). For example, the latent representation of handwritten digits can be disentangled into a content factor encoding digit identity and a style factor encoding handwriting style.

In spite of successes on static data, only a few works have explored unsupervised representation disentanglement of sequential data due to the challenges of developing generative models of sequential

Equal contribution. Part of his work was done before joining Tencent. His work was done before joining Amazon. His work was done before joining Texas A&M University.

Published as a conference paper at ICLR 2021

data. Learning disentangled representations of sequential data is important and has many applications. For example, the latent representation of a smiling-face video can be disentangled into a static part encoding the identity of the person (content factor) and a dynamic part encoding the smiling motion of the face (motion factor). The disentangled representation of the video can be potentially used for many downstream tasks such as classiﬁcation, retrieval, and synthetic video generation with style transfer. Most of previous unsupervised representation disentanglement models for static data heavily rely on the KL-divergence regularization in a VAE framework (Higgins et al., 2017; Dupont, 2018; Chen et al., 2018; Kim & Mnih, 2018), which has been shown to be problematic due to matching individual instead of aggregated posterior distribution of the latent code to the same prior (Tolstikhin et al., 2018; Rubenstein et al., 2018b;a). Therefore, extending VAE or recurrent VAE (Chung et al., 2015) to disentangle sequential data in a generative model framework (Hsu et al., 2017; Yingzhen & Mandt, 2018) is not ideal. In addition, recent research (Locatello et al., 2019) has theoretically shown that it is impossible to perform unsupervised disentangled representation learning without inductive biases on both models and data, especially on static data. Fortunately, sequential data such as videos often have clear inductive biases for the disentanglement of content factor and motion factor as mentioned in (Locatello et al., 2019). Unlike static data, the learned static and dynamic factors of sequential data are not exchangeable.

In this paper, we propose a recurrent Wasserstein Autoencoder (R-WAE) to learn disentangled representations of sequential data. We employ a Wasserstein metric (Arjovsky et al., 2018; Gulrajani et al., 2017; Bellemare et al., 2017) induced from the optimal transport between model distribution and the underlying data distribution, which has some nicer properties (for e.g., sum invariance, scale sensitivity, applicable to distributions with non-overlapping supports, and better out-of-sample performance in the worst-case expectation (Esfahani & Kuhn, 2018)) than the KL divergence in VAE (Kingma & Welling, 2014) and β-VAE (Higgins et al., 2017). Leveraging explicit inductive biases in both sequential data and model, we encode an input sequence into two parts: a shared static latent code and a dynamic latent code, and sequentially decode each element of the sequence by combining both codes. We enforce a ﬁxed prior distribution for the static code and learn a prior for the dynamic code to ensure the consistency of the sequence. The disentangled representations are learned by separately regularizing the posteriors of the latent codes with their corresponding priors.

Our main contributions are summarized as follows: (1) We draw the ﬁrst connection between minimizing a Wasserstein distance and maximizing mutual information for unsupervised representation disentanglement of sequential data from an information theory perspective; (2) We propose two sets of effective regularizers to learn the disentangled representation in a completely unsupervised manner with explicit inductive biases in both sequential data and models. (3) We incorporate a relaxed discrete latent variable to improve the disentangled learning of actions on real data. Experiments show that our models achieve state-of-the-art performance in both disentanglement of static and dynamic latent representations and unconditional video generation under the same settings as baselines (Yingzhen & Mandt, 2018; Tulyakov et al., 2018).

2 BACKGROUND AND RELATED WORK

Notation Let calligraphic letters (i.e. X) be sets, capital letters (i.e. X) be random variables and lowercase letters be their values. Let D(PX, PG) be the divergence between the true (but unknown) data distribution PX (density p(x)) and the latent-variable generative model distribution PG speciﬁed by a prior distribution PZ (density p(z)) of latent variable Z. Let DKL be KL divergence, DJS be Jensen-Shannon divergence and MMD be Maximum Mean Discrepancy (MMD) (Gretton et al., 2007).

Optimal Transport Between Distributions The optimal transport cost inducing a rich class of divergence between the distribution PX and the distribution PG is deﬁned as follows, W(PX, PG):= inf Γ P(X PX,Y PG)E(X,Y ) Γ[c(X, Y )], (1)

where c(X, Y ) is any measurable cost function and P(X PX, Y PG) is the set of joint distributions of (X, Y) with respective marginals PX and PG.

Comparison between WAE (Tolstikhin et al., 2018) and VAE (Kingma & Welling, 2014) Instead of optimizing over all couplings Γ between two random variables in X, Bousquet et al.

Published as a conference paper at ICLR 2021

(2017); Tolstikhin et al. (2018) show that it is sufﬁcient to ﬁnd Q(Z|X) such that the marginal Q(Z) := EX PX[Q(Z|X)] is identical to the prior P(Z), as given in the following deﬁnition,

Deﬁnition 1. For any deterministic PG(X|Z) and any function G : Z X,

W(PX, PG) = inf Q:QZ=PZEPXEQ(Z|X)[c(X, G(Z))]. (2)

Deﬁnition 1 leads to the following loss DWAE of WAE based on a Wasserstein distance,

inf Q(Z|X) EPXEQ(Z|X)[c(X, G(Z))] + β D(QZ, PZ), (3)

where the ﬁrst term is data reconstruction loss, and the second one is a regularizer that forces the posterior QZ = R Q(Z|X)d PX to match the prior PZ (Adversarial autoencoder (AAE) (Makhzani et al., 2015) shares a similar idea to WAE). In contrast, VAE has a different regularizer EX[DKL(Q(Z|X), PZ))] enforcing the latent posterior distribution of each input to match PZ. In (Rubenstein et al., 2018a;b), it is shown that WAE has better disentanglement than β-VAE (Higgins et al., 2017) on images, which inspires us to design a new representation disentanglement framework for sequential data with several innovations.

Unsupervised disentangled representation learning Several generative models have been proposed to learn disentangled representations of sequential data (Denton et al., 2017; Hsu et al., 2017; Yingzhen & Mandt, 2018; Hsieh et al., 2018; Sun et al., 2018; Tulyakov et al., 2018). FHVAE in (Hsu et al., 2017) is a VAE-based hierarchical graphical model with factorized Gaussian priors and only focuses on speech or audio data. Our R-WAE employing a more powerful recurrent prior can be applied to both speech and video data. The models in (Sun et al., 2018; Denton et al., 2017; Hsieh et al., 2018) are based on the ﬁrst several elements of a sequence to design disentanglement architectures for future sequence predictions.

In terms of representation learning by mutual information maximization, our work empirically demonstrates that explicit inductive biases in data and model architecture are necessary to the success of learning meaningful disentangled representations of sequential data, while the works in (Locatello et al., 2019; Poole et al., 2019; Tschannen et al., 2020; Ozair et al., 2019) are about general representation learning, especially on static data.

The most related works to ours are Mo Co GAN (Tulyakov et al., 2018) and DS-VAE (Yingzhen & Mandt, 2018), which have the ability to disentangle variant and invariant parts of sequential data and perform unconditional sequence generation. Tulyakov et al. (2018) is a GAN-based model that can be only applied to the setting in which the number of motions is ﬁnite, and cannot encode the latent representation of sequences. Yingzhen & Mandt (2018) provides a disentangled sequential autoencoder based on VAE (Kingma & Welling, 2014). Training VAE is equivalent to minimizing a lower bound of the KL divergence between empirical data distribution and generated data distribution, which has been shown to produce inferior disentangled representations of static data than generative models employing the Wasserstein metric (Rubenstein et al., 2018a;b).

3 PROPOSED APPROACH: DISENTANGLED RECURRENT WASSERSTEIN AUTOENCODER (R-WAE)

Given a high-dimensional sequence x1:T , our goal is to learn a disentangled representation of timeinvariant latent code zc and time-variant latent code zm t , along the sequence. Let zt = (zc, zm t ) be the latent code of xt. Let Xt, Zt, Zc and Zm t be random variables with realizations xt, zt, zc and zm t respectively, and denote D = X1:T . To achieve this goal, we deﬁne the following probabilistic generative model by assuming Zm t and Zc are independent,

P(X1:T , Z1:T ) = P(Zc)

t=1 Pψ(Zm t |Zm <t)Pθ(Xt|Zt), (4)

where P(Z1:T ) = P(Zc) QT t=1 Pψ(Zm t |Zm <t) is the prior in which Zt = (Zc, Zm t ), and the decoder model Pθ(Xt | Zt) is a Dirac delta distribution. In practice, P(Zc) is chosen as N(0, I)

Published as a conference paper at ICLR 2021

(a) Generative Model (b) Inference Model (c) Weakly Supervised Generative (d) Weakly Supervised Inference Figure 1: Structures of our proposed sequential probabilistic models. Sequence x1:T is disentangled into static part zc and dynamic parts {zm t }. (a) Sequence is generated by randomly sampling {zc, zm t } from priors and concatenating them as input into an LSTM to get hidden state ht for the decoder; (b) zc is inferred from x1:T with an LSTM, and zm t is inferred from ht and zm t 1 with another LSTM; (c) is the same as (a) except concatenating additional categorical a; (d) A categorical latent variable a is inferred from the dynamic latent codes. The detailed structures of the encoder and decoder are in the supplementary material.

and Pψ(Zm t |Zm <t) = N(µψ(Zm <t), σ2 ψ(Zm <t)), µψ and σψ are parameterized by Recurrent Neural Networks (RNNs). The inference model Q is deﬁned as

Qφ(Zc, Zm 1:T |X1:T ) = Qφ(Zc|X1:T )

t=1 Qφ(Zm t | Zm <t, Xt), (5)

where Qφ(Zc|X1:T ) and Qφ(Zm t | Zm <t, Xt) are also Gaussian distributions parameterized by RNNs. The structures of the generative model (4) and the inference model (5) are provided in Fig. 1.

3.1 R-WAE MINIMIZES A PENALIZED FORM OF A WASSERSTEIN DISTANCE

The optimal transport cost between two distributions PD and PG with respective sequential variables X1:T (X1:T PD) and Y1:T (Y1:T PG) is given as follows,

W(PD, PG) := inf Γ P(X1:T PD,Y1:T PG) E(X1:T ,Y1:T ) Γ[c(X1:T , Y1:T )], (6)

P(X1:T PD,Y1:T PG) is a set of all joint distributions with marginals PD and PG respectively.

When we choose c(x, y) = x y 2 (2-Wasserstein distance) and c(X1:T , Y1:T ) = P

t Xt Yt 2 by linearity, it is easy to derive the optimal transport cost for disentangled sequential variables.

Theorem 1. With deterministic P(Xt|Zt) and any function Yt = G(Zt), we derive

W(PD, PG) = inf Q:QZc=PZc,QZm 1:T =PZm 1:T

t EPDEQ(Zt|Z<t,Xt)[c(Xt, G(Zt))], (7)

where QZ1:T = QZc QZm 1:T is the marginal distribution of Z1:T when X1:T PD and Z1:T Q(Z1:T |X1:T ) and PZ1:T is the prior. Based on the assumptions, we have an upper bound,

W(PD, PG) inf Q S

t EPDEQ(Zt|Z<t,Xt)[c(Xt, G(Zt))], (8)

where the subset S is S = {Q : QZc = PZc, QZm 1 = PZm 1 , QZm t |Zm <t = PZm t |Zm <t} .

In practice, we have the following objective function of our proposed R-WAE based on Theorem 1,

t=1 EQ(Zt|Z<t,Xt)[c(Xt, G(Zt))] + β1 D(QZc, PZc) + β2

t=1 D(QZm t |Zm <t, PZm t |Zm <t), (9)

where D is an divergence between two distributions, and the second and third terms are, respectively, regularization terms for Zc and Zm t . In the following, we will present two different approaches to calculating the regularization terms in section 3.2 and 3.3. Because we cannot straightforwardly estimate the marginals Qφ(Zc) and Qφ(Zm t |Zm <t), we cannot directly use KL divergence in the two regularization terms, but we can optimize the RHS of (9) by likelihood-free optimizations (Gretton et al., 2007; Goodfellow et al., 2014; Nowozin et al., 2016; Arjovsky et al., 2018) when samples from all distributions are available.

Published as a conference paper at ICLR 2021

3.2 DJS PENALTY FOR Zc AND MMD PENALTY FOR Zm

The prior distribution of Zc is chosen as a multivariate unit-variance Gaussian, N(0, I). We can choose penalty DJS(QZc,PZc) for Zc and apply min-max optimization by introducing a discriminator Dγ (Goodfellow et al., 2014). Instead of performing optimizations in high-dimensional input data space, we move the adversarial optimization to the latent representation space of the content with a much lower dimension. Because the prior distribution of {Zm t } is dynamically learned during training, it is challenging to use DJS to regularize {Zm t } with a discriminator, which will induce a third minimization within a min-max optimization. Therefore, we use MMD to regularize {Zm t } as samples from both distributions are easy to obtain (dimension of zm t is less than 20 in our experiments on videos). With a kernel k, MMDk(Q, P) is approximated by samples from Q and P (Gretton et al., 2007). The regularization terms can be summarized as follows and we call the resulting model R-WAE(GAN) (see Algorithm 1 in Appendix for details):

D(QZc,PZc) = DJS(QZc,PZc); D(QZm t |Zm <t,PZm t |Zm <t) = MMDk(QZm t |Zm <t,PZm t |Zm <t). (10)

3.3 SCALED MMD PENALTY FOR Zc AND MMD PENALTY FOR Zm

MMD with neural kernels for generative modeling of real-world data (Li et al., 2017; Bi nkowski et al., 2018; Arbel et al., 2018) motivates us to use only MMD as regularization in Eq. (9),

D(QZc,PZc) = MMDkγ(QZc,PZc); D(QZm t |Zm <t,PZm t |Zm <t) = MMDk(QZm t |Zm <t,PZm t |Zm <t), (11)

where kγ is a parametrized family of kernels (Li et al., 2017; Bi nkowski et al., 2018; Arbel et al., 2018) deﬁned as kγ(x, y) = k(fγ(x), fγ(y)) and fγ(x) is a feature map, which is more expressive and used for Zc with equal or higher dimension than Zm t . The details of optimizing the ﬁrst term MMDkγ(QZc,PZc) in Eq. (11) is provided in Appendix D based on scaled MMD (Arbel et al., 2018), a principled and stable technique for training MMD-based critic. We call the resulting model R-WAE(MMD) (see Algorithm 2 in Appendix for details).

3.4 WEAKLY SUPERVISED DISENTANGLEMENT WITH A KNOWN NUMBER OF ACTIONS

When the number of actions (motions) in sequential data, denoted by A, is available, we incorporate a categorical latent variable a (a one-hot vector whose dimension is A) to enhance the disentanglement of the dynamic latent codes of the motions. The inference model for a is designed as qφ(a|x1:T , zm 1:T ). Intuitively, the action is inferred from the motion sequence to recognize its label. Learning such a categorical distribution requires a continuous relaxation of the discrete random variable in order to backpropagate its gradient. Let α1, , αA be the class probabilities, we can obtain a sample ea = (y1, , y A) from its continuous relaxation by ﬁrst sampling g = (g1, , g A) with gj Gumbel(0, 1) and then applying transformation eaj = exp((log αj +gj)/τ) P

i exp((log αi +gi)/τ), where τ is a temperature parameter controlling the approximation. To learn the categorical distribution using the reparameterization trick, we use a regularizer DKL(qφ(ea|x1:T , zm 1:T ), p(ea)), where p(ea) is a uniform Gumbel-Softmax prior distribution (Jang et al., 2016; Maddison et al., 2016). The motion variable is augmented as z R t = (zm t , a), and learning joint continuous and discrete latent representation of image data has been extensively discussed in (Dupont, 2018) (see Fig. 1(c,d) for illustrations).

4 ANALYZING R-WAE FROM AN INFORMATION THEORY PERSPECTIVE

Theorem 2. If the mutual information (MI) between Z1:T and X1:T is deﬁned in terms of the inference model Q, I(Z1:T ; X1:T ) = EQ(X1:T ,Z1:T )[log Q(Z1:T |X1:T ) log Q(Z1:T )], where Q(X1:T , Z1:T ) = Q(Z1:T |X1:T )P(X1:T ) and Q(Z1:T ) = P

X1:T Q(X1:T , Z1:T ), we have a lower bound:

I(Z1:T ; X1:T )

t=1 EPD EQφ[log Pθ(Xt |Zt) log P(D)] EQφ(Zc|X1:T )[log Qφ(Zc) log P(Zc)]

t=1 EPD EQφ(Zm t |Zm <t,Xt)[log Qφ(Zm t |Zm <t) log P(Zm t |Zm <t) . (12)

Published as a conference paper at ICLR 2021

(a) t=1; t=10; t=50; t=100

(b) t=1; t=3; t=5; t=7

(c) t=1; t=3; t=5; t=7 Figure 2: Illustration of disentangling the motions and contents of two videos on the test data of SM-MNIST (T = 100), Sprites (T = 8) and MUG dataset (T = 8). The ﬁrst row and fourth row are original videos. The second row and third row are generated sequences by swapping the respective motion variables while keeping content variable the same (sampled at 4 time steps for illustrations).

Theorem 2 shows that R-WAE maximizes a lower bound of the mutual information between X1:T and Z1:T , which theoretically guarantees that R-WAE learns semantically meaningful latent representations of input sequences. With constant removed, the RHS of (9) and (12) are the same if D is KL divergence. In spite of being theoretically important, Theorem 2 with KL divergence cannot be directly used for the regularization terms of R-WAE in practice, because we cannot straightforwardly estimate the marginals Qφ(Zc) and Qφ(Zm t |Zm <t) as discussed previously.

From Eq. (9) and (12), we can obtain the following theorem.

Theorem 3. When its distribution divergence is chosen as KL divergence, the regularization terms in Eq. (9) jointly minimize the KL divergence between the inference model Q(Z1:T |X1:T ) and the prior model P(Z1:T ) and maximize the mutual information between X1:T and Z1:T ,

KL(Q(Zc)||P(Zc)) = Ep D[KL(Q(Zc|X1:T )||P(Zc))] I(X1:T ; Zc), (13) KL(Q(Zm t |Zm <t)||P(Zm t |Zm <t))=Ep D[KL(Q(Zm t |Zm <t, X1:T )||P(Zm t |Zm <t)] I(X1:T ; Zm t |Zm <t),

where the mutual information is deﬁned in terms of the inference model as in Theorem 2.

Theorem 3 shows that, even if adopting KL divergence, the regularization in the loss of R-WAE still improves over the one in vanilla VAE, which only has the ﬁrst term in the RHS of Eq. (13). The two mutual information terms explicitly enforce mutual information maximization between input data and unexchangeable disentangled latent representations, Zc and Zm t . Therefore, R-WAE is superior to recurrent VAE (DS-VAE).

5 EXPERIMENTS

We conduct extensive experiments on four datasets to quantitatively and qualitatively validate our methods. The baseline methods for comparisons are DS-VAE (Yingzhen & Mandt, 2018) and Mo Co GAN (Tulyakov et al., 2018). We train our models on Stochastic moving MNIST (SMMNIST), Sprites, and TIMIT datasets under a completely unsupervised setting. The number of actions (motions) is used as prior information for all methods on MUG facial dataset. The detailed descriptions of datasets, architectures, and hyperparameters are provided in Appendix C, D, and G, respectively.

5.1 QUALITATIVE RESULTS ON DISENTANGLEMENT

We encode two original videos on the ﬁrst and fourth row in Fig. 2 and generate videos on the second and third row by swapping corresponding {zc} and {zm 1:T } between videos for style transfer. Fig. 2(left) shows that even testing on the long sequence (trained with T = 100), our R-WAE can disentangle content and motions exactly. In Fig. 2(right), we do the same swapping on Sprites. We can see that the generated swapped videos have exact same appearances and actions as the corresponding original ones. On the MUG

Published as a conference paper at ICLR 2021

dataset, it is interesting to see that we can swap different motions between different persons. PPPPPPP Methods Datasets EER

zc = 16 zm = 16 FHVAE 5.06% 22.77% DS-VAE 5.64% 19.20% R-WAE 4.73% 23.41%

Table 1: EER on TIMIT speech dataset under the same dimension setting of segment-level zc and sequence-leve zm for FHVAE (Hsu et al., 2017), DS-VAE (full q) (Yingzhen & Mandt, 2018) and R-WAE(MMD), respectively. Small EER is better for zc and larger EER is better for zm.

PPPPPPP Methods Datasets Sprites SM-MNIST actions content digits DS-VAE(S) 8.11% 3.98% 3.31% R-WAE(S) 5.83% 2.45% 1.78% DS-VAE(C) 10.37% 4.86% 4.26% R-WAE(C) 7.72% 3.31% 2.84%

Table 2: Comparison of averaged classiﬁcation errors. On Sprites dataset, ﬁx one encoded attribute and randomly sample others. On SM-MNIST dataset, we ﬁx the encoded zc and randomly sample the motion sequence from the learned prior pψ(zm t |zm <t). We cannot quantitatively verify the motion disentanglement on SM-MNIST.

5.2 QUANTITATIVE RESULTS SM-MNIST and Sprites Datasets We quantitatively evaluate the disentanglement of our RWAE(MMD). In Table 2, "S" denotes a simple encoder/decoder architecture, where the encoders in both our model and DS-VAE (Yingzhen & Mandt, 2018) only use 5 layers of convolutional and deconvolutional networks adopted from DS-VAE (Yingzhen & Mandt, 2018). "C" denotes a complex encoder/decoder architecture where we use Ladder network (Sønderby et al., 2016; Zhao et al., 2017) and Res Block (He et al., 2016), provided in Appendix E. On SM-MNIST, we get the labeled latent codes {zc} of test videos {x1:T } with T = 10 and randomly sample motion variables {zm 1:T } to get labeled new samples. We pretrain a classiﬁer and predict the accuracy on these labeled new samples. The accuracy on SM-MNIST dataset is evaluated on 10000 test samples. On Sprites, the labels of each attribute(skin colors, pants, hair styles, tops and pants) are available. We get the latent codes by ﬁxing one attribute and randomly sample other attributes. We train a classiﬁer for each attribute and evaluate the disentanglement of each attribute. The accuracy is based on 296 9 test data. Both DS-VAE and R-WAE(MMD) have extremely high accuracy (99.94%) when ﬁxing hair style attribute, which is not provided in Table 2 due to space limit. As R-WAE(GAN) and R-WAE(MMD) have similar performance on these datasets, we only provide the results and parameters of R-WAE(MMD) to save space. There are two interesting observations in Table 2. First, the simple architecture has better disentanglement than the complex architecture overall. The reason is that the simple architecture has sufﬁcient ability to extract features and generate clear samples to be recognized by the pretrained classiﬁers. But the simple architecture cannot generate high-quality samples when applied to real data. Second, our proposed R-WAE(MMD) achieve better disentanglement than DS-VAE (Yingzhen & Mandt, 2018) on both corresponding architectures. The attributes within content latent variables are independent, our model can further disentangle the factors. Compared to DS-VAE, these results demonstrate the advantages of R-WAE with implicit mutual information maximization terms. Due to space limit, we also include similar comparisons on a new Moving-Shape dataset in Appendix I. As the number of possible motions in SM-MNIST is inﬁnite and random, we cannot evaluate the disentanglement of motions by training a classiﬁer. We ﬁx the encoded motions {zm 1:T } and randomly sample content variables {zc}. We also randomly sample a motion sequence {zm 1:T } and randomly sample contents {zc}. We manually check the motions of these samples and almost all have the same corresponding motion even though the sequence is long (T = 100). TIMIT Speech Dataset We quantitatively compare our R-WAE with FHVAE and DS-VAE on the speaker veriﬁcation task under the same setting as (Hsu et al., 2017; Yingzhen & Mandt, 2018). The evaluation metric is based on equal error rate (EER), which is explained in detail in Appendix C. The lower EER on zc encoding the timbre of speakers is better and the higher EER on zm encoding linguistic content is better. From Table 1, our model can disentangle zc and zm well. We can see that our R-WAE(MMD) has the best EER performance on both content attribute and motion attribute. In Appendix H we show by style transfer experiments that the learned dynamic factor encodes semantic content which is comparable to DS-VAE. MUG Facial Dataset We quantitatively evaluate the disentanglement and quality of generated samples. We train a 3D classiﬁer on MUG facial dataset with accuracy 95.11% and Inception Score 5.20 on test data (Salimans et al., 2016). We calculate Inception score, intra-entropy H(y|v), where y is the predicted label and v is the generated video, and inter-entropy H(y) (He et al., 2018). For a comprehensive quantitative evaluation, Frame-level FID score, introduced by (Heusel et al., 2017), is also provided. From Table 2, our R-WAE(MMD) and R-WAE(GAN) have higher accuracy, which

Published as a conference paper at ICLR 2021

PPPPPPP Metrics Methods Moco GAN DS-VAE(NA) DS-VAE(W) R-WAE(MMD) R-WAE(GAN)

Accuracy 75.50% 66.73% 82.84 % 88.62% 90.15% Intra-entropy 0.26 0.28 0.23 0.17 0.15 Inter-entropy 1.78 1.77 1.78 1.79 1.79 Inception Score 4.60 4.44 4.71 5.05 5.16 FID 16.95 18.72 14.79 12.21 10.86 Table 3: Quantitative results on generated samples from the MUG facial dataset. "DS-VAE(NA)" means that number of actions is not incorporated (Yingzhen & Mandt, 2018). In "DS-VAE(NA)", samples are generated by ﬁxing the encoded motions and randomly sampling content variable from the prior. Samples on DS-VAE(W), R-WAE(MMD) and R-WAE(GAN) are generated by incorporating the prior information(number of actions) into the model.

means the categorical variable best captures the actions, which indicates our models achieve the best disentanglement. In table 2, the Inception score of R-WAE(GAN) is very close to Inception Score of the exact test data, which means our models have the best sample quality. Our proposed R-WAE(GAN) and R-WAE(MMD) have the best frame-level FID scores, compared with DS-VAE and Mo Co GAN. The orignal DS-VAE (DS-VAE(NA)) (Yingzhen & Mandt, 2018) without leveraging the number of actions performs worst, which shows that incorporating the number of actions as prior information does enhance the disentanglement of actions.

5.3 UNCONDITIONAL VIDEO GENERATION

SM-MNIST dataset Fig. 4 in Appendix E provides generated samples on the SM-MNIST dataset by randomly sampling content {zc} from the prior p(zc) and motions {zm 1:T } from the learned prior pψ(zm t |zm <t). The length of our generated videos is T = 100 and we only show randomly chosen videos of T = 20 to save ﬁle size. Our R-WAE(MMD) achieves the most consistent and visually best sequence even when T = 100. Samples from Mo Co GAN (Tulyakov et al., 2018) usually change digit identity along the sequence. The reason is that Mo Co GAN (Tulyakov et al., 2018) requires the number of actions be ﬁnite. Our generated Sprites videos also have the best results but are not provided due to page limit.

(a) R-WAE(GAN)

(c) Mo Co GAN

Figure 3: Unconditional video generation on MUG dataset, where the sample at time step T = 10 is chosen for clear comparison. DS-VAE in (b) is improved by incorporating categorical latent variables. Samples of the video sequence are given in Appendix E.

MUG Facial Dataset Fig. 3 and Fig. 5 in Appendix E show generated samples on MUG dataset by randomly sampling content {zc} from the prior p(zc) and motions z R t = (a, zm t ) from the categorical prior p(a) (latent action variable a is a one-hot vector with dimension 6) and the learned prior pψ(zm t |zm <t). We show generated videos of length T = 10. DS-VAE (Yingzhen & Mandt, 2018) used the same structure as ours. Fig. 5 shows that DS-VAE (Yingzhen & Mandt, 2018) and

Published as a conference paper at ICLR 2021

Mo Co GAN (Tulyakov et al., 2018) have blurry beginning frames {xt} and even more blurry frames as time t evolves. While our R-WAE(GAN) has much better frame quality and more consistent video sequences. To have a clear comparison among all three methods, we show the samples at time step T = 10 in Fig. 3, and we can see that DS-VAE has very blurry samples with large time steps.

6 CONCLUSION

In this paper, we propose recurrent Wasserstein Autoencoder (R-WAE) to learn disentangled representations of sequential data based on the optimal transport between distributions with sequential variables. Our theoretical analysis shows that R-WAE simultaneously maximizes the mutual information between input sequential data and different disentangled latent factors. Experiments on a variety of datasets demonstrate that our models achieve state-of-the-art results on the disentanglement of static and dynamic latent representations and unconditional video generation. Future research includes exploring our framework in self-supervised learning and conditional settings for text-to-video and video-to-video synthesis.

Acknowledgement Jun Han thanks Dr. Chen Fang at Tencent for insightful discussions and Prof. Qiang Liu at UT Austin for invaluable support.

Niki Aifanti, Christos Papachristou, and Anastasios Delopoulos. The mug facial expression database. In 11th International Workshop on Image Analysis for Multimedia Interactive Services WIAMIS 10. IEEE, 2010.

Michael Arbel, Dougal Sutherland, Mikołaj Bi nkowski, and Arthur Gretton. On gradient regularizers for mmd gans. In NIPS, 2018.

Martin Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein gan. ICML, 2018.

Yogesh Balaji, Martin Renqiang Min, Bing Bai, Rama Chellappa, and Hans Peter Graf. Tfgan: Improving conditioning for text-to-video synthesis. 2018.

Marc G Bellemare, Ivo Danihelka, Will Dabney, Shakir Mohamed, Balaji Lakshminarayanan, Stephan Hoyer, and Rémi Munos. The cramer distance as a solution to biased wasserstein gradients. ar Xiv preprint ar Xiv:1705.10743, 2017.

Yoshua Bengio, Aaron Courville, and Pascal Vincent. Representation learning: A review and new perspectives. IEEE transactions on pattern analysis and machine intelligence, 2013.

Mikołaj Bi nkowski, Dougal J Sutherland, Michael Arbel, and Arthur Gretton. Demystifying mmd gans. ar Xiv preprint ar Xiv:1801.01401, 2018.

Olivier Bousquet, Sylvain Gelly, Ilya Tolstikhin, Carl-Johann Simon-Gabriel, and Bernhard Schoelkopf. From optimal transport to generative modeling: the vegan cookbook. ar Xiv preprint ar Xiv:1705.07642, 2017.

Andrew Brock, Jeff Donahue, and Karen Simonyan. Large scale gan training for high ﬁdelity natural image synthesis. ICLR, 2019.

Tian Qi Chen, Xuechen Li, Roger B Grosse, and David K Duvenaud. Isolating sources of disentanglement in variational autoencoders. In Neur IPS, 2018.

Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, and Pieter Abbeel. Infogan: Interpretable representation learning by information maximizing generative adversarial nets. In NIPS, 2016.

Junyoung Chung, Kyle Kastner, Laurent Dinh, Kratarth Goel, Aaron C Courville, and Yoshua Bengio. A recurrent latent variable model for sequential data. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett (eds.), Advances in Neural Information Processing Systems 28, pp. 2980 2988. Curran Associates, Inc., 2015.

Najim Dehak, Patrick J Kenny, Réda Dehak, Pierre Dumouchel, and Pierre Ouellet. Front-end factor analysis for speaker veriﬁcation. IEEE Transactions on Audio, Speech, and Language Processing, 2010.

Emily L Denton et al. Unsupervised learning of disentangled representations from video. In NIPS, 2017.

Emilien Dupont. Learning disentangled joint continuous and discrete representations. In NIPS, 2018.

Published as a conference paper at ICLR 2021

Peyman Mohajerin Esfahani and Daniel Kuhn. Data-driven distributionally robust optimization using the wasserstein metric: Performance guarantees and tractable reformulations. Mathematical Programming, 2018.

John S Garofolo. Timit acoustic phonetic continuous speech corpus. Linguistic Data Consortium, 1993, 1993.

Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In NIPS, 2014.

Arthur Gretton, Karsten Borgwardt, Malte Rasch, Bernhard Schölkopf, and Alex J Smola. A kernel method for the two-sample-problem. In NIPS, 2007.

Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron C Courville. Improved training of wasserstein gans. In NIPS, 2017.

Jiawei He, Andreas Lehrmann, Joseph Marino, Greg Mori, and Leonid Sigal. Probabilistic video generation using holistic attribute control. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 452 467, 2018.

Junxian He, Daniel Spokoyny, Graham Neubig, and Taylor Berg-Kirkpatrick. Lagging inference networks and posterior collapse in variational autoencoders. ICLR, 2019.

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, 2016.

Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In Neur IPS, 2017.

Irina Higgins, Loic Matthey, Arka Pal, Christopher Burgess, Xavier Glorot, Matthew Botvinick, Shakir Mohamed, and Alexander Lerchner. beta-vae: Learning basic visual concepts with a constrained variational framework. In ICLR, 2017.

R Devon Hjelm, Alex Fedorov, Samuel Lavoie-Marchildon, Karan Grewal, Adam Trischler, and Yoshua Bengio. Learning deep representations by mutual information estimation and maximization. ar Xiv preprint ar Xiv:1808.06670, 2018.

Jun-Ting Hsieh, Bingbin Liu, De-An Huang, Li F Fei-Fei, and Juan Carlos Niebles. Learning to decompose and disentangle representations for video prediction. In Neur IPS, 2018.

Wei-Ning Hsu, Yu Zhang, and James Glass. Unsupervised learning of disentangled and interpretable representations from sequential data. In NIPS, 2017.

Eric Jang, Shixiang Gu, and Ben Poole. Categorical reparameterization with gumbel-softmax. 2016.

Hyunjik Kim and Andriy Mnih. Disentangling by factorising. In ICML, 2018.

Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. ICLR, 2015.

Diederik P Kingma and Max Welling. Auto-encoding variational bayes. NIPS, 2014.

Chun-Liang Li, Wei-Cheng Chang, Yu Cheng, Yiming Yang, and Barnabás Póczos. Mmd gan: Towards deeper understanding of moment matching network. In NIPS, 2017.

Francesco Locatello, Stefan Bauer, Mario Lucic, Gunnar Raetsch, Sylvain Gelly, Bernhard Schölkopf, and Olivier Bachem. Challenging common assumptions in the unsupervised learning of disentangled representations. In International Conference on Machine Learning, 2019.

Laurens van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. JMLR, 2008.

Chris J Maddison, Andriy Mnih, and Yee Whye Teh. The concrete distribution: A continuous relaxation of discrete random variables. ar Xiv preprint ar Xiv:1611.00712, 2016.

Alireza Makhzani, Jonathon Shlens, Navdeep Jaitly, Ian Goodfellow, and Brendan Frey. Adversarial autoencoders. ar Xiv preprint ar Xiv:1511.05644, 2015.

Lars Mescheder, Andreas Geiger, and Sebastian Nowozin. Which training methods for gans do actually converge? ICML, 2018.

Takeru Miyato, Toshiki Kataoka, Masanori Koyama, and Yuichi Yoshida. Spectral normalization for generative adversarial networks. ICLR, 2018.

Published as a conference paper at ICLR 2021

Sebastian Nowozin, Botond Cseke, and Ryota Tomioka. f-gan: Training generative neural samplers using variational divergence minimization. In NIPS, pp. 271 279, 2016.

Sherjil Ozair, Corey Lynch, Yoshua Bengio, Aaron Van den Oord, Sergey Levine, and Pierre Sermanet. Wasserstein dependency measure for representation learning. In Advances in Neural Information Processing Systems, 2019.

Henning Petzka, Asja Fischer, and Denis Lukovnicov. On the regularization of wasserstein gans. NIPS, 2017.

Ben Poole, Sherjil Ozair, Aaron Van Den Oord, Alex Alemi, and George Tucker. On variational bounds of mutual information. In ICML, 2019.

Guo-Jun Qi. Loss-sensitive generative adversarial networks on lipschitz densities. ar Xiv preprint ar Xiv:1701.06264, 2017.

Kevin Roth, Aurelien Lucchi, Sebastian Nowozin, and Thomas Hofmann. Stabilizing training of generative adversarial networks through regularization. In NIPS, 2017.

Paul K Rubenstein, Bernhard Schoelkopf, and Ilya Tolstikhin. On the latent space of wasserstein auto-encoders. ar Xiv preprint ar Xiv:1802.03761, 2018a.

Paul K Rubenstein, Bernhard Schoelkopf, and Ilya Tolstikhin. Learning disentangled representations with wasserstein auto-encoders. 2018b.

Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training gans. In NIPS, pp. 2234 2242, 2016.

Casper Kaae Sønderby, Tapani Raiko, Lars Maaløe, Søren Kaae Sønderby, and Ole Winther. Ladder variational autoencoders. In NIPS, 2016.

Ximeng Sun, Huijuan Xu, and Kate Saenko. A two-stream variational adversarial network for video generation. ar Xiv preprint ar Xiv:1812.01037, 2018.

Hoang Thanh-Tung, Truyen Tran, and Svetha Venkatesh. Improving generalization and stability of generative adversarial networks. ICLR, 2019.

Ilya Tolstikhin, Olivier Bousquet, Sylvain Gelly, and Bernhard Schoelkopf. Wasserstein auto-encoders. ICLR, 2018.

Michael Tschannen, Josip Djolonga, Paul K Rubenstein, Sylvain Gelly, and Mario Lucic. On mutual information maximization for representation learning. ICLR, 2020.

Sergey Tulyakov, Ming-Yu Liu, Xiaodong Yang, and Jan Kautz. Mocogan: Decomposing motion and content for video generation. In CVPR, 2018.

Li Yingzhen and Stephan Mandt. Disentangled sequential autoencoder. In ICML, 2018.

Shengjia Zhao, Jiaming Song, and Stefano Ermon. Learning hierarchical features from deep generative models. In ICML, 2017.

Published as a conference paper at ICLR 2021

APPENDIX FOR RECURRENT WASSERSTEIN AUTOENCODER

APPENDIX A: PROOF OF THEOREM 1

In the following, we provide the proof of Theorem 1.

Theorem 1 For PG deﬁned with deterministic PG(X|Z) and any function Y = G(Z),

W(PD, PG) = inf Q:QZc=PZc,QZm 1:T =PZm 1:T

t=1 EPDEQ(Zt|Xt)[c(Xt, G(Zt))], (14)

where QZ1:T is the marginal distribution of Z1:T when X1:T PD and Z1:T Q(Z1:T |X1:T ) and PZ1:T is the prior. Based on the assumptions, the constraint set is equivalent to

W(PD, PG) inf Q S

t EPDEQ(Zt|Xt)[c(Xt, G(Zt))], (15)

where the set S = {Q : QZc = PZc, QZm 1 = PZm 1 , QZm t |Zm <t = PZm t |Zm <t}.

Proof: Consider the sequential random variables D = X1:T and Y1:T , the optimal transport between the distribution for D = X1:T and the distribution for Y1:T induces a rich class of divergence, W(PD, PG) := inf Γ P(X1:T PD,Y1:T PG) E(X1:T ,Y1:T ) Γ[c(X1:T , Y1:T )] (16)

where P(X1:T PD, Y1:T PG) is a set of all joint distributions of (X1:T , Y1:T ) with marginals PD and PG, respectively.

When we choose c(x, y) = x y 2, c(X1:T , Y1:T ) = P

t Xt Yt 2 by linearity. It is easy to derive the optimal transport for distributions with sequential random variables,

W(PD, PG) = inf Q:QZ1:T =PZ1:T

t EPDEQ(Zt|Xt)[c(Xt, G(Zt))]. (17)

Based on our assumption, the marginal distribution of the inference model satisﬁes

Q(Z1, , ZT ) = Q(Zc)Q(Zm 1 , , Zm T ) = Q(Zc) Y

t Q(Zm t |Zm <t). (18)

The prior distribution P(Z1, , ZT ) satisﬁes

P(Z1, , ZT ) = P(Zc)P(Zm 1 , , Zm T ) = P(Zc) Y

t P(Zm t |Zm <t). (19)

Since the set S is a subset of {Q : QZ1:T = PZ1:T }, we derive the inequality (15).

APPENDIX B: PROOF OF THEOREM 2

In the following, we provide the proof of Theorem 2. To make the notations easy to read, we use the density functions of corresponding distributions.

The joint generative distribution is p(x1:T , z1:T ) = pψ(z1:T )pθ(x1:T |z1:T ), where pψ(z1:T ) is the prior distribution and pθ(x1:T |z1:T ) is the decoder model. And the corresponding joint inference distribution is qφ(x1:T , z1:T ) = p D(x1:T )qφ(z1:T | x1:T ).

If the MI between z1:T and x1:T is deﬁned in terms of the inference model q, we have the following lower bound with step-by-step derivations:

I(z1:T ; x1:T )= Eq(x1:T ,z1:T )[log qφ(z1:T |x1:T )

qφ(z1:T ) ] (20)

= Eq(x1:T ,z1:T )[DKL(qφ(z1:T |x1:T ), p(z1:T |x1:T ))+log p(z1:T |x1:T ) log qφ(z1:T )]

Ep D[Eq(z1:T |x1:T )[log p(x1:T |z1:T ) + log p(z1:T ) log qφ(z1:T ) log p(D)]]

t=1 Ep(D) Eqφ(zt|xt)[log pθ(xt|zt)] Ep(D)[Eqφ(zt|xt)[log qφ(zc) log p(zc)]

t=1 Ep(D) Eqφ(zm t |xt)[log qφ(zm t |zm <t) log p(zm t |zm <t) + log p(D) ,

Published as a conference paper at ICLR 2021

where we use Bayesian s rule p(z1:T |x1:T ) = pθ(x1:T |z1:T )p(z1:T )

p(D) . Maximizing the MI between z1:T and x1:T achieves state-of-the-art results in disentangled latent representation by using different regularizers for the static and dynamical latent variable with different priors (Hjelm et al., 2018). In practice, incorporating the mutual information I(zm t , xt) between element xt and motion zm t might facilitate the disentanglement of the dynamical latent variable zm t .

Theorem 3 When its distribution divergence is chosen as KL divergence, the regularization terms in Eq. (9) jointly minimize the KL divergence between the inference model Q(Z1:T |X1:T ) and the prior model P(Z1:T ) and maximize the mutual information between X1:T and Z1:T ,

KL(Q(Zc)||P(Zc)) = Ep D[KL(Q(Zc|X1:T )||P(Zc))] I(X1:T ; Zc). KL(Q(Zm t |Zm <t)||P(Zm t |Zm <t)) = Ep D[KL(Q(Zm t |Zm <t, X1:T )||P(Zm t |Zm <t)] I(X1:T ; Zm t |Zm <t).

Proof: Denote XD = X1:T . As in the proof of Theorem 2, the mutual information between Z1:T and X1:T is deﬁned in terms of the inference model Q, and we use the density functions of corresponding distributions to make the notations easy to read. Thus,

Q(Z1:T ) = Ep Dq(z1:T |x1:T ).

According to the deﬁnition of mutual information, we have

I(X1:T ; Zc) = Ep D X

zc p D(x1:T )q(zc|x1:T ) log p D(x1:T )q(zc|x1:T )

p D(x1:T )q(zc)

zc q(zc|x1:T ) log q(zc|x1:T )

zc q(zc|x1:T ) log q(zc|x1:T )

p(zc) Ep D X

zc q(zc|x1:T ) log q(zc)

zc q(zc|x1:T ) log q(zc|x1:T )

zc q(zc) log q(zc)

= Ep D[KL(Q(Zc|X1:T )||P(Zc))] KL(Q(Zc)||P(Zc))

KL(Q(Zc)||P(Zc)) = Ep D[KL(Q(Zc|X1:T )||P(Zc))] I(X1:T ; Zc).

Similarly, we can prove the second equality in the theorem.

APPENDIX C: DATASETS

Stochastic Moving MNIST(SM-MNIST) Dataset Stochastic moving MNIST (SM-MNIST) consists of sequences of frames of size 64 64 1, containing one MNIST digit moving and bouncing off edges of the frame (walls). We use one digit instead of two digits because two moving digits may collide, which changes the content of the dynamics and is inconsistent with our assumption. The digits in SM-MNIST move with a constant velocity along a trajectory until they hit at wall at which point they bounce off with a random speed and direction.

Sprites Dataset We follow the same steps as in Yingzhen & Mandt (2018) to process Sprites dataset, which consists of animated cartoon characters whose clothing, hairstyle, skin color and action can be fully controlled. We use 6 variants in each of 4 attribute categories (skin colors, tops, pants and hair style) and there are 64 = 1296 unique characters in total, where 1000 of them are used for training and the rest of them are used for testing. We use 9 action categories (walking, casting spells and slashing, each with three different viewing angles.) The resulting dataset consists of video sequences with T = 8 frames of size 64 64 3.

MUG Facial Dataset We use the MUG Facial Expression Database (Aifanti et al., 2010) for this experiment. The dataset consists of 86 subjects. Each video consists of 50 to 160 frames. To use the same network architecture for the whole video datasets in this paper, we cropped the face regions and scaled to the same size 64 64 3. We use six facial expressions (anger, fear, disgust, happiness,

Published as a conference paper at ICLR 2021

sadness, and surprise). To ensure there is sufﬁcient change in the facial expression along a video sequence, we choose every other frame in the original video sequences to form training and test video sequences of length T = 10. 80% of the videos are used for training and 20% of the videos are used for testing.

TIMIT Speech Dataset The TIMIT dataset (Garofolo, 1993) contains broadband 16k Hz of phonetically-balanced read speech. A total of 6300 utterances (5.4 hours) are presented with 10 sentences from each of 630 speakers. The data is preprocessed in the same way as in (Yingzhen & Mandt, 2018) and (Hsu et al., 2017). The raw speech waveforms are ﬁrst split into sub-sequences of 200ms, and then preprocessed with sparse fast Fourier transform to obtain a 200 dimensional logmagnitude spectrum, computed every 10ms, i.e., we use T = 20 for sequence x1:T . The dimension of xt is 200.

Now we explain the detail of the evaluation metric, equal error rate (EER), used on TIMIT dataset. Let wtest be the feature of test utterance xtest 1:T and wtarget be the feature of test utterance xtarget 1:T . The predicted identity is conﬁrmed if the cosine similarity between wtest and wtarget, cos(wtest, wtarget) is greater than some threshold ϵ used in Dehak et al. (2010). The equal error rate (EER) means the false rejection rate equals the false acceptance rate (Dehak et al., 2010). In the following, we will discuss the two choices of feature wtest for evaluations of all methods,

i=1 Eq(zc|xi 1:T )[zc],

which is used to evaluate the disentanglement of zc;

j=1 Eq(zm t |xi 1:T )[zm t ],

which is used to evaluate the disentanglement of zm. For more details, please refer to (Dehak et al., 2010; Yingzhen & Mandt, 2018; Hsu et al., 2017). We use the same network architecture as in Yingzhen & Mandt (2018) for a fair comparison on speech dataset. As the input dimension of speech is low, the encoder/decoder network is a 2-hidden-layer MLP with the hidden dimension 256.

APPENDIX D: CHOICES OF REGULARIZERS

In the following, we will discuss the choice of regularizers in R-WAE. To make notations easy to read, we use density functions for corresponding distributions. In both R-WAE(GAN) and R-WAE(MMD), we use the same regularizer for D(q(zm t |zm <t), p(zm t |zm <t)). We also add a KLdivergence regularization term on zm to stabilize training. In the experiments, we assume inference model q(zc|x1:T ) is a Gaussian distribution with parameters mean µc and diagonal variance matrix σc. Inference model q(zm t |xt, zm <t) is a Gaussian distribution with parameters mean µm and diagonal variance matrix σm. For the prior distribution, we assume p(zm t |zm <t) is a Gaussian distribution with parameters mean µψ m and diagonal covariance matrix σψ m. For regularizing the motion variables, we just use MMD without introducing any additional parameter, MMDk(q(zm t |zm <t), p(zm t |zm <t)), and we choose mixture of RBF kernel (Li et al., 2017), where RBF kernel is deﬁned as k(x, y) = exp( x y 2

2σ2 ). With samples {ezi}n i=1 from the posterior q(ezc) and samples {zi}n i=1 from the prior p(zc), MMDk(q(ezc), p(zc)) is deﬁned as

MMDk(q(ezc), p(zc))= 1 n(n 1)

i =j k(zi, zj)+ 1 n(n 1)

i =j k(ezi, ezj) 1

i,j k(ezi, zj). (21)

The difference between R-WAE(MMD) and R-WAE(GAN) is how to choose metrics for the regularizer D(QZc,PZc), where PZc is the prior distribution and QZc is the posterior distribution of the inference model.

R-WAE(MMD) The regularizer D(QZc,PZc) is chosen as,

D(QZc,PZc) = MMDkγ(Q(Zc),P(Zc)),

Published as a conference paper at ICLR 2021

where the scaled MMD MMDkγ(Q(Zc),P(Zc)) is chosen as

MMDkγ(QZc,PZc) = \ MMDkγ(QZc, PZc) 1 + 10E ˆ P [ fγ(zc) 2 F ],

where the function fγ(zc) is the kernel feature map and \ MMDkγ(QZc, PZc) is deﬁned in the following. When we have samples {ezc i}n i=1 from Q(Zc) and samples {zc i}n i=1 from P(Zc),

\ MMDkγ(Q(Zc), P(Zc)) = 1 n(n 1)

i =j k(fγ(zc i), fγ(zc j)) + 1 n(n 1)

i =j k(fγ(ezc i), fγ(ezc j))

i,j k(fγ(ezc i), fγ(zc j)),

where the RBF kernel k is deﬁned on scalar variables, k(x, y) = exp( x y 2

2 ). To avoid the situation where the generator gets stuck on a local optimum, we apply spectral parametrization for the weight matrix (Miyato et al., 2018). The feature map fγ is updated L steps at each iteration. To overcome posterior collapse and inference lagging, we will update the inference model per iteration of updating the decoder model for L steps during training (He et al., 2019). See Algorithm 1 for details.

R-WAE(GAN) For the regularizer DJS(QZc,PZc), we introduce a discriminator Dγ. The loss is as follows, L = Ezc p(zc)[log Dγ(zc)] + Eezc q(ezc)[log(1 Dγ(ezc)))], (23) where p(zc) is the prior distribution and q(ezc) is the posterior distribution of the inference model. To stabilize the training of the min-max problem in GAN-based optimization (23), a lot of stabilization techniques have been proposed (Thanh-Tung et al., 2019; Mescheder et al., 2018; Gulrajani et al., 2017; Petzka et al., 2017; Roth et al., 2017; Qi, 2017). Let samples {zc} are from the prior p(zc) and {ezc} are from the inference posterior q(ezc). In our R-WAE(GAN), we will adopt the regularization from Mescheder et al. (2018) and Thanh-Tung et al. (2019), L λE[ ( Dγ)ˆzc 2], (24) where ˆzc = αzc + (1 α)ezc, α U(0, 1) and ( Dγ)ˆzc is evaluated its gradient at the point ˆzc.

Algorithm 1 R-WAE(GAN)

Input: regularization coefﬁcient β and content prior p(zc) Goal: learn encoders qφ(zc|x1:T ) and qφ(zm t |xt, zm <t), prior pψ(zm t |zm <t), discriminator Dγ, and decoder pθ(xt|zt), where zt = (zc, zm t ) while not converged do

for step 1 to L do

Sample batch X = {xt} Sample {zc} from prior p(zc) and {zm t } from prior pψ Sample { zc, zm t } from encoders qφ Update discriminator Dγ and encoders qφ with loss given by (9), (10) end for Update pθ and prior pψ with loss given by (9) and (10). end while

Algorithm 2 R-WAE(MMD)

Input: regularization coefﬁcient β and content prior p(zc) Goal: learn encoders qφ(zc|x1:T ) and qφ(zm t |xt, zm <t), prior pψ(zm t |zm <t), feature map fγ and decoder pθ(xt|zt), where zt = (zc, zm t ) while not converged do

for step 1 to L do

Sample batch X = {xt} Sample {zc} from prior p(zc) and {zm t } from prior pψ Sample { zc, zm t } from encoders qφ Update feature map fγ and encoders qφ with loss given by (9), (11) end for Update pθ and prior pψ with loss given by (9) and (11). end while

6.1 APPENDIX E: UNCONDITIONAL VIDEO GENERATION

Fig. 4 provides generated samples on the SM-MNIST dataset by randomly sampling content {zc} from the prior p(zc) and motions {zm 1:T } from the learned prior pψ(zm t |zm <t). The length of our

Published as a conference paper at ICLR 2021

(a) R-WAE(MMD) (b) DS-VAE (c) Mo Go GAN

Figure 4: Unconditional video generation on SM-MNIST: (a) Sequences (length=20) in RWAE(MMD) are randomly taken from generated samples with T = 100 to save pdf size; (b) Generated videos by DS-VAE (Yingzhen & Mandt, 2018) with T = 20; (c) Generated videos by Mo Co GAN (Tulyakov et al., 2018) with T = 20. The ﬁgures should be viewed with Adobe Reader to see video.

generated videos is T = 100 and we only show randomly chosen videos of T = 20 to save ﬁle size. Our R-WAE(MMD) achieves the most consistent and visually best sequence even when T = 100. Samples from Mo Co GAN (Tulyakov et al., 2018) usually change digit identity along the sequence. The reason is that Mo Co GAN (Tulyakov et al., 2018) requires the number of actions be ﬁnite.

Fig. 5 shows unconditional video generation with T = 10 on MUG facial dataset. DS-VAE in (b) is improved by incorporating categorical latent variables. The ﬁgures should be viewed with Adobe Reader to see video.

(a) R-WAE(GAN) (b) DS-VAE (c) Mo Co GAN Figure 5: Unconditional video generation with T = 10 on MUG facial dataset. DS-VAE in (b) is improved by incorporating categorical latent variables. The ﬁgures should be viewed with Adobe Reader to see video.

APPENDIX F: LATENT MANIFOLD VISUALIZATION

We encode the test data {x1:T } of SM-MNIST with T = 10 to get the content codes {zc} using our R-WAE(MMD). We visualize two-dimensional (2D) manifold of {zc} using t-SNE (Maaten & Hinton, 2008). In Fig. 6, different colors correspond to the digit identities of the latent codes {zc} of test videos on SM-MNIST. This indicates that {zc} encoded by our R-WAE(MMD) exactly captures the invariant information (digits) of the test data. The latent motion codes are sequential and cannot be visualized.

Published as a conference paper at ICLR 2021

Figure 6: Visualizing 2D manifold of content code {zc} encoded from R-WAE(MMD) on SMMNIST by t-SNE (Maaten & Hinton, 2008).

APPENDIX G: MODEL ARCHITECTURE AND HYPER-PARAMETERS

(a) Encoder Network (b) Decoder Network

Figure 7: Structures of the encoder network and decoder network. (a) The Res Block in the encoder network consists of convolutional network adopted from Brock et al. (2019), named "Res Block down". After each Resblock, we use a FC network to get latent feature hi, for i = 0, , 5 (Ladder Network (Sønderby et al., 2016; Zhao et al., 2017)), whose dimensions are the same. [h5, h4, h3, h2, h1, h0] are concatenated into latent feature ht, where ht is deﬁned in Fig.1. We use deconvolutional network adopted from Brock et al. (2019), named "Res Block up". In (b), the hidden state ht of an LSTM, deﬁned in Fig.1, is evenly split into [h5, h4, h3, h2, h1, h0]. And the Res Block in decoder network consists of deconvolutional network adopted from Brock et al. (2019). We use leaky relu activation for all Res Blocks.

In the inference model, we use an encoder network, deﬁned in Fig. 7 (a) to extract latent feature ht deﬁned in Fig.1. We use a decoder network to reconstruct ˆxt from the hidden state ht, deﬁned in Fig.1. For the discriminator Dγ in R-WAE(GAN), we use a 4-layer fully-connected neural

Published as a conference paper at ICLR 2021

network (FC NN) with respective dimension (256, 256, 128, 1). For the feature map fγ with a scalar output for the RBF kernel of R-WAE(MMD), we use a 4-layer fully-connected neural network with respective dimension (256, 256, 128, 1). After encoding xt, we get extracted latent feature ht. We use Fig. 8(a) and Fig. 8(b) to infer the content variable zc and motion variables zm t . When the Gumbel latent variable is incorporated into our weakly-supervised inference model, we use Fig. 8(c) to infer the Gumbel latent variable a. The latent content variable zc and latent motion variable zm t are concatenated as input to an LSTM after an FC NN to output hidden state ht for reconstructing ˆxt using the decoder. For our weakly-supervised model, the latent content variable zc, latent motion variable zm t and latent action variable a are concatenated as input to an LSTM after an FC NN to output hidden state ht for reconstructing ˆxt using the decoder. We use Adam optimizer (Kingma & Ba, 2015) with β1 = 0.5 and β2 = 0.9.

Sprites Methods actions content R-WAE(GAN) (S) 3.73% 2.00% R-WAE(MMD) (S) 5.83% 2.45% R-WAE(GAN) (C) 3.13% 3.31% R-WAE(MMD) (C) 7.72% 3.31%

Table 4: Results of R-WAE(GAN) and R-WAE(MMD) on Sprites dataset.

Architecture on SM-MNIST, Sprites and TIMIT Datasets We use the same architecture on SM-MNIST and Sprites dataset, as shown in Fig. 9. The details of the parameters of the networks are provided in Fig. 9. As R-WAE(GAN) and R-WAE(MMD) have similar performance on SMMNIST and Sprites (see Sprites results in Table 4), we only provide the results and parameters of R-WAE(MMD) to save space. At each iteration of training the decoder pθ(xt|zt) and the prior pψ(zm t |zm <t), we train the encoder parameters qφ and the feature map fγ for R-WAE(MMD) with L steps. The results on SM-MNIST and Sprites datasets are evaluated after 500 epochs. On SM-MNIST dataset, we use a Bernoulli cross-entropy loss and choose L = 5. The penalty coefﬁcients β1 and β2, are, respectively, 5 and 20. The learning rate for the decoder model is 5 10 4 and the learning rate for the encoder is 1 10 4. The learning rate for fγ is 1 10 4. On Sprites dataset, we use an L2 reconstruction loss and choose L = 5 steps. The penalty coefﬁcients β1 and β2 are, respectively, 10 and 60. The learning rate for the decoder model is 3 10 4 and the learning rate for the encoder is 1 10 4. The learning rate for Dγ in R-WAE(GAN) or fγ in R-WAE(MMD) is 1 10 4. We use a decayed learning rate schedule on both datasets. After 50 epochs, we decrease all learning rates by a factor of 2 and after 80 epochs decrease further by a factor of 5. On TIMIT speech dataset, we use the same encoder and decoder architecture as that of DS-VAE. The dimension of hidden states is 256 and the dimensions of zc and zm t are both 16.

(a) infer zc (b) infer zm t (c) infer a (d) output ht for decoder (e) output ht for weakly-supervised decoder

Figure 8: Network architectures in addition to encoder/decoder network with ht deﬁned in Fig. 7. (a) Network structure to infer the content variable zc from sequence x1:T ; (b) Network structure to infer content variable zm t ; (c) In inference model, we introduce an additional Gumbel random variable a inferred by motion sequences {zm t }; (d) Content variable zc and motion variable zm t are concatenated into an LSTM for the decoder model; (e) In weakly-supervised inference model, content variable zc, motion variable zm t and Gumbel random variable a are concatenated into an LSTM for the decoder model.

Published as a conference paper at ICLR 2021

Res Block1 down 64*3*3

self-attention Res Block2 down 128*3*3 Res Block3 down 256*3*3 Res Block4 down 512*3*3 Res Block5 down 1024*3*3 Reshape output to (N, 1024 2 2)

Table 5: Encoder Network Architecture.

FC NN and Reshape input to (N, 2048, 2, 2)

Res Block1 up 1024*3*3

Res Block2 up 512*3*3 Res Block3 up 256*3*3 Res Block4 up 128*3*3

self-attention Res Block5 up 64*3*3 Conv 3*3*3, activation=sigmoid

Table 6: Decoder Network Architecture.

Figure 9: Network parameters on encoder network and decoder network on SM-MNIST and Sprites datasets. We adopt Res Block down and up from Brock et al. (2019). The dimensions of zc, zm t , ht are 120, 12 and 150 respectively. The batch size on both SM-MNIST and Sprites dataset are 60 and the length of video sequence for training is T = 8.

Res Block1 down 64*3*3

self-attention Res Block2 down 128*3*3 Res Block3 down 256*3*3 Res Block4 down 512*3*3 Res Block5 down 1024*3*3 Reshape output to (N, 1024 2 2)

Table 7: Encoder Network Architecture.

FC NN and Reshape to (N, 3072, 2, 2)

Res Block1 up 1536*3*3

Res Block2 up 768*3*3 Res Block3 up 384*3*3 Res Block4 up 192*3*3

self-attention Res Block5 up 96*3*3 Conv 3*3*3, activation=sigmoid

Table 8: Decoder Network Architecture.

Figure 10: Network parameters on encoder network and decoder network on MUG facial dataset. We adopt Res Block down and up from Brock et al. (2019). The dimensions of zc, zm t , ht, a are 150, 16, 180 and 6 respectively. The batch size on MUG facial dataset are 30 and the length of video sequence for training is T = 8.

Architecture on MUG Facial Dataset The details of the architecture parameters of the networks for MUG facial dataset are provided in Fig. 9. The results on MUG facial dataset are evaluated after 800 epochs. For the regularizer DKL(qφ(a|x1:T , zm 1:T ), p(a)), we choose the coefﬁcient of this categorical regularizer to be 50. We use an L2 reconstruction loss and choose L = 5 steps. For R-WAE(MMD), the penalty coefﬁcients β1 and β2 are, respectively, 10 and 50. For R-WAE(GAN), the coefﬁcients β1 and β2 of the penalties are, respectively, 5 and 60. The learning rate for the decoder model is 5 10 4 and the learning rate for the encoder is 2 10 4. The learning rate for Dγ in R-WAE(GAN) or fγ in R-WAE(MMD) is 2 10 4. We use the same decayed learning rate schedule as described on SM-MNIST and Sprites datasets. This architecture can be applied to improve the compression rate (?).

APPENDIX H: ADDITIONAL RESULTS ON AUDIO DATA

Swapping Static and Dynamic Factors on Audio Data Here we present results of swapping static and dynamic factors of given audio sequences. Results are given in Figure 11. Each heatmap subplot is of dimension 80 20 and visualizes the spectrum of 200ms of an audio clip, in which the mel-scale ﬁlter bank features are plotted in the frequency domain (x-axis represents temporal domain with 20 timesteps and y-axis is the value of frequencies). We collect these heatmaps in a matrix where the static factors in a row are kept the same and each column shares the same dynamic factor. It can be observed that in each column, the linguistic phonetic contents as reﬂected by the formants along x-axis are kept almost the same after swapping. Likewise, the timbres are reﬂected as the harmonics in the spectrum plot. This can be concluded by observing that the horizontal light stripes which represents the harmonics are kept consistent in a row. Moreover, we perform

Published as a conference paper at ICLR 2021

Figure 11: Cross generation of 16 audio clips forms a 17 17 matrix. The ﬁrst column and the ﬁrst row are spectrum visualization of the original sequences. Subplot at the (i + 1)-th row and (j + 1)-th column represents the reconstruction of i-th static factor and j-th dynamic factor.

Published as a conference paper at ICLR 2021

identity veriﬁcation experiment as conducted in DS-VAE (Yingzhen & Mandt, 2018). Similar to cross reconstruction, zc female and zc male (or f female and f male in DS-VAE) are swapped for two sequences {xfemale} and {xmale}. By an informal listening test of the original-swapped speech sequence pairs, we conﬁrm that the speech content is preserved and identity is transferred (i.e. female voice usually has higher frequency).

APPENDIX I: ADDITIONAL RESULTS ON A MOVING-SHAPE VIDEO DATA

Static Factor Pred. Acc. Dynamic Factor Pred. Acc. DS-VAE (TFGAN) 77.47% 72.45% DS-VAE (Big GAN) 75.37% 70.85% R-WAE (TFGAN) 80.50% 83.60% R-WAE (Big GAN) 75.27% 80.00%

Table 9: Prediction accuracy on generated video data, the experiment setting here is similar to Table 2 in the main text. For predicting the static factor, we ﬁx the static latent representation zc and randomly sample zm, and examine whether the static information is preserved in the generated video (if so, the static attributes should be correctly predicted by a pretrained video classiﬁer). For predicting the dynamic factor, we perform corresponding experiments analogously.

Fix Content & Sample Motion

DS-VAE R-WAE

Figure 12: Results of ﬁx zc and sample zm using TFGAN (Balaji et al., 2018) architectures. The ﬁrst row in each subﬁgure are real video sequences. The generated motion of moving objects by DS-VAE contains abrupt jumps and is not smooth, while R-WAE is able to generate motion of various types including zig-zag, diagonal and straight line.

Generation Results on Moving Shapes We report results on a Moving-Shape dataset in Table 9 and Fig. 12. The Moving-Shape synthetic dataset was introduced in Balaji et al. (2018) which has 5 control parameters: shape type (e.g. triangle and square), size (small and large), color (e.g. white and red), motion type (e.g. zig-zag, straight line and diagonal) and motion direction. In Table 9, TFGAN (Balaji et al., 2018) encoder and decoder architectures are considered less expressive compared with Big GAN (Brock et al., 2019) architectures. Similar to results in Table 2, with more complex and expressive architecture, learning disentangled representation is harder. The results in Table 9 and Fig. 12 demonstrate that R-WAE produces better disentanglement and generation performance than DS-VAE both quantitatively and qualitatively. Qualitative difference of ﬁxing zm and sampling zc for DS-VAE and R-WAE is not that obvious and thus not shown.