# disentangled_sequential_autoencoder__60d5a0f2.pdf Disentangled Sequential Autoencoder Yingzhen Li 1 Stephan Mandt 2 We present a VAE architecture for encoding and generating high dimensional sequential data, such as video or audio. Our deep generative model learns a latent representation of the data which is split into a static and dynamic part, allowing us to approximately disentangle latent time-dependent features (dynamics) from features which are preserved over time (content). This architecture gives us partial control over generating content and dynamics by conditioning on either one of these sets of features. In our experiments on artificially generated cartoon video clips and voice recordings, we show that we can convert the content of a given sequence into another one by such content swapping. For audio, this allows us to convert a male speaker into a female speaker and vice versa, while for video we can separately manipulate shapes and dynamics. Furthermore, we give empirical evidence for the hypothesis that stochastic RNNs as latent state models are more efficient at compressing and generating long sequences than deterministic ones, which may be relevant for applications in video compression. 1. Introduction Representation learning remains an outstanding research problem in machine learning and computer vision. Recently there is a rising interest in disentangled representations, in which each component of learned features refers to a semantically meaningful concept. In the example of video sequence modelling, an ideal disentangled representation would be able to separate time-independent concepts (e.g. the identity of the object in the scene) from dynamical information (e.g. the time-varying position and the orientation or pose of that object). Such disentangled represen- 1University of Cambridge, UK 2Disney Research, Los Angeles, CA, USA. Correspondence to: Yingzhen Li , Stephan Mandt . Proceedings of the 35 th International Conference on Machine Learning, Stockholm, Sweden, PMLR 80, 2018. Copyright 2018 by the author(s). tations would open new efficient ways of compression and style manipulation, among other applications. Recent work has investigated disentangled representation learning for images within the framework of variational auto-encoders (VAEs) (Kingma & Welling, 2013; Rezende et al., 2014) and generative adversarial networks (GANs) (Goodfellow et al., 2014). Some of them, e.g. the β-VAE method (Higgins et al., 2016), proposed new objective functions/training techniques that encourage disentanglement. On the other hand, network architecture designs that directly enforce factored representations have also been explored by e.g. Siddharth et al. (2017); Bouchacourt et al. (2017). These two types of approaches are often mixed together, e.g. the info GAN approach (Chen et al., 2016) partitioned the latent space and proposed adding a mutual information regularisation term to the vanilla GAN loss. Mathieu et al. (2016) also partitioned the encoding space into style and content components, and performed adversarial training to encourage the datapoints from the same class to have similar content representations, but diverse style features. Less research has been conducted for unsupervised learning of disentangled representations of sequences. For video sequence modelling, Villegas et al. (2017) and Denton & Birodkar (2017) utilised different networks to encode the content and dynamics information separately, and trained the auto-encoders with a combination of reconstruction loss and GAN loss. Structured (Johnson et al., 2016) and Factorised VAEs (Deng et al., 2017) used hierarchical priors to learn more interpretable latent variables. Hsu et al. (2017) designed a structured VAE in the context of speech recognition. Their VAE architecture is trained using a combination of the standard variational lower bound and a discriminative regulariser to further encourage disentanglement. More related work is discussed in Section 3. In this paper, we propose a generative model for unsupervised structured sequence modelling, such as video or audio. We show that, in contrast to previous approaches, a disentangled representation can be achieved by a careful design of the probabilistic graphical model. In the proposed architecture, we explicitly use a latent variable to represent content, i.e., information that is invariant through the sequence, and a set of latent variables associated to each frame to represent dynamical information, such as pose and position. Com- Disentangled Sequential Autoencoder pared to the mentioned previous models that usually predict future frames conditioned on the observed sequences, we focus on learning the distribution of the video/audio content and dynamics to enable sequence generation without conditioning. Therefore our model can also generalise to unseen sequences, which is confirmed by our experiments. In more detail, our contributions are as follows: Controlled generation. Our architecture allows us to approximately control for content and dynamics when generating videos. We can generate random dynamics for fixed content, and random content for fixed dynamics. This gives us a controlled way of manipulating a video/audio sequence, such as swapping the identity of moving objects or the voice of a speaker. Efficient encoding. Our representation is more data efficient than encoding a video frame by frame. By factoring out a separate variable that encodes content, our dynamical latent variables can have smaller dimensions. This may be promising when it comes to end-to-end neural video encoding methods. We design a new metric that allow us to verify disentanglement of the latent variables, by investigating the stability of an object classifier over time. We give empirical evidence, based on video data of a physics simulator, that for long sequences, a stochastic transition model generates more realistic dynamics. The paper is structured as follows. Section 2 introduces the generative model and the problem setting. Section 3 discusses related work. Section 4 presents three experiments on video and speech data. Finally, Section 5 concludes the paper and discusses future research directions. 2. The model Let x1:T = (x1, x2, ..., x T ) denote a high dimensional sequence, such as a video with T consecutive frames. Also, assume the data distribution of the training sequences is p (x1:T ). In this paper, we model the observed data with a latent variable model that separates the representation of time-invariant concepts (e.g. object identities) from those of time-varying concepts (e.g. pose information). Generative model. Consider the following probabilistic model, which is also visualised in Figure 1: pθ(x1:T , z1:T , f) = pθ(f) t=1 pθ(zt|z