# contrastively_disentangled_sequential_variational_autoencoder__6d5ca6ac.pdf Contrastively Disentangled Sequential Variational Autoencoder Junwen Bai Cornell University jb2467@cornell.edu Weiran Wang Google weiranwang@google.com Carla Gomes Cornell University gomes@cs.cornell.edu Self-supervised disentangled representation learning is a critical task in sequence modeling. The learnt representations contribute to better model interpretability as well as the data generation, and improve the sample efficiency for downstream tasks. We propose a novel sequence representation learning method, named Contrastively Disentangled Sequential Variational Autoencoder (C-DSVAE), to extract and separate the static (time-invariant) and dynamic (time-variant) factors in the latent space. Different from previous sequential variational autoencoder methods, we use a novel evidence lower bound which maximizes the mutual information between the input and the latent factors, while penalizes the mutual information between the static and dynamic factors. We leverage contrastive estimations of the mutual information terms in training, together with simple yet effective augmentation techniques, to introduce additional inductive biases. Our experiments show that C-DSVAE significantly outperforms the previous state-of-the-art methods on multiple metrics. 1 Introduction The goal of self-supervised learning methods is to extract useful and general representations without any supervision, and to further facilitate downstream tasks such as generation and prediction [1]. Despite the difficulty of this task, many existing works have shed light on this field across different domains such as computer vision [2, 3, 4, 5], natural language processing [6, 7, 8] and speech processing [9, 10, 11, 12, 13] (also see a huge number of references in these papers). While the quality of the learnt representations improves gradually, recent research starts to put more emphasis on learning disentangled representations. This is because disentangled latent variables may capture separate variations of the data generation process, which could contain semantic meanings, provide the opportunity to remove unwanted variations for a lower sample complexity of downstream learning [14, 15], and allow more controllable generations [16, 17, 18]. These advantages lead to a rapidly growing research area, studying various principles and algorithmic techniques for disentangled representation learning [19, 20, 21, 22, 23, 24, 25, 26]. One concern raised in [24] is that without any inductive bias, it would be extremely hard to learn meaningful disentangled representations. On the other hand, this concern could be much alleviated in the scenarios where the known structure of the data can be exploited. In this work, we are concerned with the representation learning for sequence data, which has a unique structure to utilize for disentanglement learning. More specifically, for many sequence data, the variations can be explained by a dichotomy of a static (time-invariant) factor and dynamic (timevariant) factors, each varies independently from the other. For example, representations of a video recording the movements of a cartoon character could be disentangled into the character identity (static) and the actions (dynamic). For audio data, the representations shall be able to separate the speaker information (static) from the linguistic information (dynamic). 35th Conference on Neural Information Processing Systems (Neur IPS 2021). 𝒙&𝟏 𝒙&𝟐 𝒙&𝑻 LSTM LSTM LSTM 𝒔 𝒛𝟏:𝑻 𝒔𝒄 𝒛𝟏:𝑻 motion augmentation content augmentation Figure 1: The illustration of our C-DSVAE model. Left panel is the general structure of the sequenceto-sequence auto-encoding process: each frame is passed to the LSTM cell; dynamic factors z1:T are extracted for each time step; the static factor s is extracted by summarizing the full sequence; the generation/reconstruction of frame i depends on s and zi. Right panel depicts the contrastive learning module of C-DSVAE: xm 1:T is the motion augmentation of x1:T and xc 1:T is the content augmentation of x1:T ; dynamic factors zm 1:T of xm 1:T can be seen as the positive sample for the anchor z1:T in the contrastive estimation w.r.t. the motion, and similarly the static factor sc of xc 1:T can be viewed as the positive sample for s w.r.t. the content. We propose Contrastively Disentangled Sequential Variational Autoencoder (C-DSVAE), a method seeking for a clean separation of the static and dynamic factors for the sequence data. Our method extends the previously proposed sequential variational autoencoder (VAE) framework, and performs learning with a different evidence lower bound (ELBO) which naturally contains mutual information (MI) terms to encourage disentanglement. Due to the difficulty in estimating high dimensional complex distributions (e.g., for the dynamic factors), we further incorporate the contrastive estimation for the MI terms with systematic data augmentation techniques which modify either the static or dynamic factors of the input sequence. The new estimation method turns out to be more effective than the minibatch sampling based estimate, and introduces additional inductive biases towards the invariance. To our knowledge, we are the first to synergistically combine the contrastive estimation and sequential generative models in a principled manner for learning disentangled representations. We validate C-DSVAE on four datasets from the video and audio domains. The experimental results show that our method consistently outperforms the previous state-of-the-art (SOTA) methods, both quantitatively and qualitatively. We denote the observed input sequence as x1:T = {x1, x2, ..., x T } where xi represents the input feature at time step i (e.g., xi could be a video frame or the spectrogram feature of a short audio segment), and T is the sequence length. The latent representations are divided into the static factor s, and the dynamic factors z1:T where zi is the learnt dynamic representation at time step i. 2.1 Probabilistic Model We assume that in the ground-truth generation process, zi depends on z