# disentangled_recurrent_wasserstein_autoencoder___a6705f4f.pdf Published as a conference paper at ICLR 2021 DISENTANGLED RECURRENT WASSERSTEIN AUTOENCODER Jun Han PCG, Tencent junhanjh@tencent.com Martin Renqiang Min NEC Laboratories America renqiang@nec-labs.com Ligong Han Rutgers University hanligong@gmail.com Li Erran Li Alexa AI, Amazon erranlli@gmail.com Xuan Zhang Texas A&M University floatlazer@gmail.com Learning disentangled representations leads to interpretable models and facilitates data generation with style transfer, which has been extensively studied on static data such as images in an unsupervised learning framework. However, only a few works have explored unsupervised disentangled sequential representation learning due to challenges of generating sequential data. In this paper, we propose recurrent Wasserstein Autoencoder (R-WAE), a new framework for generative modeling of sequential data. R-WAE disentangles the representation of an input sequence into static and dynamic factors (i.e., time-invariant and time-varying parts). Our theoretical analysis shows that, R-WAE minimizes an upper bound of a penalized form of the Wasserstein distance between model distribution and sequential data distribution, and simultaneously maximizes the mutual information between input data and different disentangled latent factors, respectively. This is superior to (recurrent) VAE which does not explicitly enforce mutual information maximization between input data and disentangled latent representations. When the number of actions in sequential data is available as weak supervision information, R-WAE is extended to learn a categorical latent representation of actions to improve its disentanglement. Experiments on a variety of datasets show that our models outperform other baselines with the same settings in terms of disentanglement and unconditional video generation both quantitatively and qualitatively. 1 INTRODUCTION Unsupervised representation learning is an important research topic in machine learning. It embeds high-dimensional sensory data such as images and videos into a low-dimensional latent space in an unsupervised learning framework, aiming at extracting essential data variation factors to help downstream tasks such as classification and prediction (Bengio et al., 2013). In the last several years, disentangled representation learning, which further separates the latent embedding space into exclusive explainable factors such that each factor only interprets one of semantic attributes of sensory data, has received a lot of interest and achieved many empirical successes on static data such as images (Chen et al., 2016; Higgins et al., 2017; Dupont, 2018; Chen et al., 2018; Rubenstein et al., 2018b;a; Kim & Mnih, 2018). For example, the latent representation of handwritten digits can be disentangled into a content factor encoding digit identity and a style factor encoding handwriting style. In spite of successes on static data, only a few works have explored unsupervised representation disentanglement of sequential data due to the challenges of developing generative models of sequential Equal contribution. Part of his work was done before joining Tencent. His work was done before joining Amazon. His work was done before joining Texas A&M University. Published as a conference paper at ICLR 2021 data. Learning disentangled representations of sequential data is important and has many applications. For example, the latent representation of a smiling-face video can be disentangled into a static part encoding the identity of the person (content factor) and a dynamic part encoding the smiling motion of the face (motion factor). The disentangled representation of the video can be potentially used for many downstream tasks such as classification, retrieval, and synthetic video generation with style transfer. Most of previous unsupervised representation disentanglement models for static data heavily rely on the KL-divergence regularization in a VAE framework (Higgins et al., 2017; Dupont, 2018; Chen et al., 2018; Kim & Mnih, 2018), which has been shown to be problematic due to matching individual instead of aggregated posterior distribution of the latent code to the same prior (Tolstikhin et al., 2018; Rubenstein et al., 2018b;a). Therefore, extending VAE or recurrent VAE (Chung et al., 2015) to disentangle sequential data in a generative model framework (Hsu et al., 2017; Yingzhen & Mandt, 2018) is not ideal. In addition, recent research (Locatello et al., 2019) has theoretically shown that it is impossible to perform unsupervised disentangled representation learning without inductive biases on both models and data, especially on static data. Fortunately, sequential data such as videos often have clear inductive biases for the disentanglement of content factor and motion factor as mentioned in (Locatello et al., 2019). Unlike static data, the learned static and dynamic factors of sequential data are not exchangeable. In this paper, we propose a recurrent Wasserstein Autoencoder (R-WAE) to learn disentangled representations of sequential data. We employ a Wasserstein metric (Arjovsky et al., 2018; Gulrajani et al., 2017; Bellemare et al., 2017) induced from the optimal transport between model distribution and the underlying data distribution, which has some nicer properties (for e.g., sum invariance, scale sensitivity, applicable to distributions with non-overlapping supports, and better out-of-sample performance in the worst-case expectation (Esfahani & Kuhn, 2018)) than the KL divergence in VAE (Kingma & Welling, 2014) and β-VAE (Higgins et al., 2017). Leveraging explicit inductive biases in both sequential data and model, we encode an input sequence into two parts: a shared static latent code and a dynamic latent code, and sequentially decode each element of the sequence by combining both codes. We enforce a fixed prior distribution for the static code and learn a prior for the dynamic code to ensure the consistency of the sequence. The disentangled representations are learned by separately regularizing the posteriors of the latent codes with their corresponding priors. Our main contributions are summarized as follows: (1) We draw the first connection between minimizing a Wasserstein distance and maximizing mutual information for unsupervised representation disentanglement of sequential data from an information theory perspective; (2) We propose two sets of effective regularizers to learn the disentangled representation in a completely unsupervised manner with explicit inductive biases in both sequential data and models. (3) We incorporate a relaxed discrete latent variable to improve the disentangled learning of actions on real data. Experiments show that our models achieve state-of-the-art performance in both disentanglement of static and dynamic latent representations and unconditional video generation under the same settings as baselines (Yingzhen & Mandt, 2018; Tulyakov et al., 2018). 2 BACKGROUND AND RELATED WORK Notation Let calligraphic letters (i.e. X) be sets, capital letters (i.e. X) be random variables and lowercase letters be their values. Let D(PX, PG) be the divergence between the true (but unknown) data distribution PX (density p(x)) and the latent-variable generative model distribution PG specified by a prior distribution PZ (density p(z)) of latent variable Z. Let DKL be KL divergence, DJS be Jensen-Shannon divergence and MMD be Maximum Mean Discrepancy (MMD) (Gretton et al., 2007). Optimal Transport Between Distributions The optimal transport cost inducing a rich class of divergence between the distribution PX and the distribution PG is defined as follows, W(PX, PG):= inf Γ P(X PX,Y PG)E(X,Y ) Γ[c(X, Y )], (1) where c(X, Y ) is any measurable cost function and P(X PX, Y PG) is the set of joint distributions of (X, Y) with respective marginals PX and PG. Comparison between WAE (Tolstikhin et al., 2018) and VAE (Kingma & Welling, 2014) Instead of optimizing over all couplings Γ between two random variables in X, Bousquet et al. Published as a conference paper at ICLR 2021 (2017); Tolstikhin et al. (2018) show that it is sufficient to find Q(Z|X) such that the marginal Q(Z) := EX PX[Q(Z|X)] is identical to the prior P(Z), as given in the following definition, Definition 1. For any deterministic PG(X|Z) and any function G : Z X, W(PX, PG) = inf Q:QZ=PZEPXEQ(Z|X)[c(X, G(Z))]. (2) Definition 1 leads to the following loss DWAE of WAE based on a Wasserstein distance, inf Q(Z|X) EPXEQ(Z|X)[c(X, G(Z))] + β D(QZ, PZ), (3) where the first term is data reconstruction loss, and the second one is a regularizer that forces the posterior QZ = R Q(Z|X)d PX to match the prior PZ (Adversarial autoencoder (AAE) (Makhzani et al., 2015) shares a similar idea to WAE). In contrast, VAE has a different regularizer EX[DKL(Q(Z|X), PZ))] enforcing the latent posterior distribution of each input to match PZ. In (Rubenstein et al., 2018a;b), it is shown that WAE has better disentanglement than β-VAE (Higgins et al., 2017) on images, which inspires us to design a new representation disentanglement framework for sequential data with several innovations. Unsupervised disentangled representation learning Several generative models have been proposed to learn disentangled representations of sequential data (Denton et al., 2017; Hsu et al., 2017; Yingzhen & Mandt, 2018; Hsieh et al., 2018; Sun et al., 2018; Tulyakov et al., 2018). FHVAE in (Hsu et al., 2017) is a VAE-based hierarchical graphical model with factorized Gaussian priors and only focuses on speech or audio data. Our R-WAE employing a more powerful recurrent prior can be applied to both speech and video data. The models in (Sun et al., 2018; Denton et al., 2017; Hsieh et al., 2018) are based on the first several elements of a sequence to design disentanglement architectures for future sequence predictions. In terms of representation learning by mutual information maximization, our work empirically demonstrates that explicit inductive biases in data and model architecture are necessary to the success of learning meaningful disentangled representations of sequential data, while the works in (Locatello et al., 2019; Poole et al., 2019; Tschannen et al., 2020; Ozair et al., 2019) are about general representation learning, especially on static data. The most related works to ours are Mo Co GAN (Tulyakov et al., 2018) and DS-VAE (Yingzhen & Mandt, 2018), which have the ability to disentangle variant and invariant parts of sequential data and perform unconditional sequence generation. Tulyakov et al. (2018) is a GAN-based model that can be only applied to the setting in which the number of motions is finite, and cannot encode the latent representation of sequences. Yingzhen & Mandt (2018) provides a disentangled sequential autoencoder based on VAE (Kingma & Welling, 2014). Training VAE is equivalent to minimizing a lower bound of the KL divergence between empirical data distribution and generated data distribution, which has been shown to produce inferior disentangled representations of static data than generative models employing the Wasserstein metric (Rubenstein et al., 2018a;b). 3 PROPOSED APPROACH: DISENTANGLED RECURRENT WASSERSTEIN AUTOENCODER (R-WAE) Given a high-dimensional sequence x1:T , our goal is to learn a disentangled representation of timeinvariant latent code zc and time-variant latent code zm t , along the sequence. Let zt = (zc, zm t ) be the latent code of xt. Let Xt, Zt, Zc and Zm t be random variables with realizations xt, zt, zc and zm t respectively, and denote D = X1:T . To achieve this goal, we define the following probabilistic generative model by assuming Zm t and Zc are independent, P(X1:T , Z1:T ) = P(Zc) t=1 Pψ(Zm t |Zm