# unsupervised_learning_of_disentangled_representations_from_video__0c3acc5f.pdf

Unsupervised Learning of Disentangled Representations from Video

Emily Denton Department of Computer Science New York University denton@cs.nyu.edu

Vighnesh Birodkar Department of Computer Science New York University vighneshbirodkar@nyu.edu

We present a new model DRNET that learns disentangled image representations from video. Our approach leverages the temporal coherence of video and a novel adversarial loss to learn a representation that factorizes each frame into a stationary part and a temporally varying component. The disentangled representation can be used for a range of tasks. For example, applying a standard LSTM to the time-vary components enables prediction of future frames. We evaluate our approach on a range of synthetic and real videos, demonstrating the ability to coherently generate hundreds of steps into the future.

1 Introduction

Unsupervised learning from video is a long-standing problem in computer vision and machine learning. The goal is to learn, without explicit labels, a representation that generalizes effectively to a previously unseen range of tasks, such as semantic classiﬁcation of the objects present, predicting future frames of the video or classifying the dynamic activity taking place. There are several prevailing paradigms: the ﬁrst, known as self-supervision, uses domain knowledge to implicitly provide labels (e.g. predicting the relative position of patches on an object [4] or using feature tracks [36]). This allows the problem to be posed as a classiﬁcation task with self-generated labels. The second general approach relies on auxiliary action labels, available in real or simulated robotic environments. These can either be used to train action-conditional predictive models of future frames [2, 20] or inversekinematics models [1] which attempt to predict actions from current and future frame pairs. The third and most general approaches are predictive auto-encoders (e.g.[11, 12, 18, 31]) which attempt to predict future frames from current ones. To learn effective representations, some kind of constraint on the latent representation is required.

In this paper, we introduce a form of predictive auto-encoder which uses a novel adversarial loss to factor the latent representation for each video frame into two components, one that is roughly time-independent (i.e. approximately constant throughout the clip) and another that captures the dynamic aspects of the sequence, thus varying over time. We refer to these as content and pose components, respectively. The adversarial loss relies on the intuition that while the content features should be distinctive of a given clip, individual pose features should not. Thus the loss encourages pose features to carry no information about clip identity. Empirically, we ﬁnd that training with this loss to be crucial to inducing the desired factorization.

We explore the disentangled representation produced by our model, which we call Disentangled Representation Net (DRNET ), on a variety of tasks. The ﬁrst of these is predicting future video frames, something that is straightforward to do using our representation. We apply a standard LSTM model to the pose features, conditioning on the content features from the last observed frame. Despite the simplicity of our model relative to other video generation techniques, we are able to generate convincing long-range frame predictions, out to hundreds of time steps in some instances. This is signiﬁcantly further than existing approaches that use real video data. We also show that DRNET can

31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.

be used for classiﬁcation. The content features capture the semantic content of the video thus can be used to predict object identity. Alternately, the pose features can be used for action prediction.

2 Related work

On account of its natural invariances, image data naturally lends itself to an explicit what and where representation. The capsule model of Hinton et al. [10] performed this separation via an explicit auto-encoder structure. Zhao et al. [40] proposed a multi-layered version, which has similarities to ladder networks [23]. Several weakly supervised approaches have been proposed to factor images into style and content (e.g. [19, 24]). These methods all operate on static images, whereas our approach uses temporal structure to separate the components.

Factoring video into time-varying and time-independent components has been explored in many settings. Classic structure-from-motion methods use an explicit afﬁne projection model to extract a 3D point cloud and camera homography matrices [8]. In contrast, Slow Feature Analysis [38] has no model, instead simply penalizing the rate of change in time-independent components and encouraging their decorrelation. Most closely related to ours is Villegas et al. [33] which uses an unsupervised approach to factoring video into content and motion. Their architecture is also broadly similar to ours, but the loss functions differ in important ways. They rely on pixel/gradient space ℓp-norm reconstructions, plus a GAN term [6] that encourages the generated frames to be sharp. We also use an ℓ2 pixel-space reconstruction. However, this pixel-space loss is only applied, in combination with a novel adversarial term applied to the pose features, to learn the disentangled representation. In contrast to [33], our forward model acts on latent pose vectors rather than predicting pixels directly.

Other approaches explore general methods for learning disentangled representations from video. Kulkarni et al. [14] show how explicit graphics code can be learned from datasets with systematic dimensions of variation. Whitney et al. [37] use a gating principle to encourage each dimension of the latent representation to capture a distinct mode of variation. Grathwohl et al. [7] propose a deep variational model to disentangle space and time in video sequences.

A range of generative video models, based on deep nets, have recently been proposed. Ranzato et al. [22] adopt a discrete vector quantization approach inspired by text models. Srivastava et al. [31] use LSTMs to generate entire frames. Video Pixel Networks [12] use these models is a conditional manner, generating one pixel at a time in raster-scan order (similar image models include [27, 32]). Finn et al. [5] use an LSTM framework to model motion via transformations of groups of pixels. Cricri et al. [3] use a ladder of stacked-autoencoders. Other works predict optical ﬂows ﬁelds that can be used to extrapolate motion beyond the current frame, e.g. [17, 39, 35]. In contrast, a single pose vector is predicted in our model, rather than a spatial ﬁeld.

Chiappa et al. [2] and Oh et al. [20] focus on prediction in video game environments, where known actions at each frame can be permit action-conditional generative models that can give accurate long-range predictions. In contrast to the above works, whose latent representations combine both content and motion, our approach relies on a factorization of the two, with a predictive model only being applied to the latter. Furthermore, we do not attempt to predict pixels directly, instead applying the forward model in the latent space. Chiappa et al. [2], like our approach, produces convincing long-range generations. However, the video game environment is somewhat more constrained than the real-world video we consider since actions are provided during generation.

Several video prediction approaches have been proposed that focus on handling the inherent uncertainty in predicting the future. Mathieu et al. [18] demonstrate that a loss based on GANs can produce sharper generations than traditional ℓ2-based losses. [34] train a series of models, which aim to span possible outcomes and select the most likely one at any given instant. While we considered GANbased losses, the more constrained nature of our model, and the fact that our forward model does not directly generate in pixel-space, meant that standard deterministic losses worked satisfactorily.

In our model, two separate encoders produce distinct feature representations of content and pose for each frame. They are trained by requiring that the content representation of frame xt and the pose representation of future frame xt+k can be combined (via concatenation) and decoded to predict the pixels of future frame xt+k. However, this reconstruction constraint alone is insufﬁcient to induce

the desired factorization between the two encoders. We thus introduce a novel adversarial loss on the pose features that prevents them from being discriminable from one video to another, thus ensuring that they cannot contain content information. A further constraint, motivated by the notion that content information should vary slowly over time, encourages temporally close content vectors to be similar to one another.

More formally, let xi = (x1 i , ..., x T i ) denote a sequence of T images from video i. We subsequently drop index i for brevity. Let Ec denote a neural network that maps an image xt to the content representation ht c which captures structure shared across time. Let Ep denote a neural network that maps an image xt to the pose representation ht p capturing content that varies over time. Let D denote a decoder network that maps a content representation from a frame, ht c, and a pose representation ht+k p from future time step t + k to a prediction of the future frame xt+k. Finally, C is the scene discriminator network that takes pairs of pose vectors (h1, h2) and outputs a scalar probability that they came from the same video or not.

The loss function used during training has several terms:

Reconstruction loss: We use a standard per-pixel ℓ2 loss between the predicted future frame xt+k and the actual future frame xt+k for some random frame offset k [0, K]:

Lreconstruction(D) = ||D(ht c, ht+k p ) xt+k||2 2 (1)

Note that many recent works on video prediction that rely on more complex losses that can capture uncertainty, such as GANs [19, 6].

Similarity loss: To ensure the content encoder extracts mostly time-invariant representations, we penalize the squared error between the content features ht c, ht+k c of neighboring frames k [0, K]:

Lsimilarity(Ec) = ||Ec(xt) Ec(xt+k)||2 2 (2)

Adversarial loss: We now introduce a novel adversarial loss that exploits the fact that the objects present do not typically change within a video, but they do between different videos. Our desired disenanglement would thus have the content features be (roughly) constant within a clip, but distinct between them. This implies that the pose features should not carry any information about the identity of objects within a clip.

We impose this via an adversarial framework between the scene discriminator network C and pose encoder Ep, shown in Fig. 1. The latter provides pairs of pose vectors, either computed from the same video (ht p,i, ht+k p,i ) or from different ones (ht p,i, ht+k p,j ), for some other video j. The discriminator then attempts to classify the pair as being from the same/different video using a cross-entropy loss:

Ladversarial(C) = log(C(Ep(xt i), Ep(xt+k i ))) + log(1 C(Ep(xt i), Ep(xt+k j ))) (3)

The other half of the adversarial framework imposes a loss function on the pose encoder Ep that tries to maximize the uncertainty (entropy) of the discriminator output on pairs of frames from the same clip:

Ladversarial(Ep) = 1

2 log(C(Ep(xt i), Ep(xt+k i ))) + 1

2 log(1 C(Ep(xt i), Ep(xt+k i ))) (4)

Thus the pose encoder is encouraged to produce features that the discriminator is unable to classify if they come from the same clip or not. In so doing, the pose features cannot carry information about object content, yielding the desired factorization. Note that this does assume that the object s pose is not distinctive to a particular clip. While adversarial training is also used by GANs, our setup purely considers classiﬁcation; there is no generator network, for example.

Overall training objective: During training we minimize the sum of the above losses, with respect to Ec, Ep, D and C:

L = Lreconstruction(Ec, Ep, D)+αLsimilarity(Ec)+β(Ladversarial(Ep)+Ladversarial(C)) (5)

where α and β are hyper-parameters. The ﬁrst three terms can be jointly optimized, but the discriminator C is updated while the other parts of the model (Ec, Ep, D) are held constant. The overall model is shown in Fig. 1. Details of the training procedure and model architectures for Ec, Ep, D and C are given in Section 4.1.

Target 1 (same scene)

Target 0 (different scenes)

Pose encoder: Ep(x) Scene discriminator:

C(Ep(x), Ep(x ))

Pose encoder: Ep(x)

Lsimilarity

Lreconstruction

Content encoder: Ec(x)

Frame decoder: D( Ec(xt), Ep(xt+k) )

(maximal uncertainty)

Ladversarial

Figure 1: Left: The discriminator C is trained with binary cross entropy (BCE) loss to predict if a pair of pose vectors comes from the same (top portion) or different (lower portion) scenes. xi and xj denote frames from different sequences i and j. The frame offset k is sampled uniformly in the range [0, K]. Note that when C is trained, the pose encoder Ep is ﬁxed. Right: The overall model, showing all terms in the loss function. Note that when the pose encoder Ep is updated, the scene discriminator is held ﬁxed.

x t+3 ~ x t+2 ~ x t+1 ~

Figure 2: Generating future frames by recurrently predicting hp, the latent pose vector.

3.1 Forward Prediction

After training, the pose and content encoders Ep and Ec provide a representation which enables video prediction in a straightforward manner. Given a frame xt, the encoders produce ht p and ht c respectively. To generate the next frame, we use these as input to an LSTM model to predict the next pose features ht+1 p . These are then passed (along with the content features) to the decoder, which generates a pixel-space prediction xt+1: ht+1 p = LSTM(Ep(xt), ht c) xt+1 = D( ht+1 p , ht c) (6) ht+2 p = LSTM( ht+1 p , ht c) xt+2 = D( ht+2 p , ht c) (7)

Note that while pose estimates are generated in a recurrent fashion, the content features ht c remain ﬁxed from the last observed real frame. This relies on the nature of Lreconstruction which ensured that content features can be combined with future pose vectors to give valid reconstructions.

The LSTM is trained separately from the main model using a standard ℓ2 loss between ht+1 p and ht+1 p . Note that this generative model is far simpler than many other recent approaches, e.g. [12]. This largely due to the forward model being applied within our disentangled representation, rather than directly on raw pixels.

3.2 Classiﬁcation

Another application of our disentangled representation is to use it for classiﬁcation tasks. Content features, which are trained to be invariant to local temporal changes, can be used to classify the semantic content of an image. Conversely, a sequence of pose features can be used to classify actions in a video sequence. In either case, we train a two layer classiﬁer network S on top of either hc or hp, with its output predicting the class label y.

4 Experiments

We evaluate our model on both synthetic (MNIST, NORB, SUNCG) and real (KTH Actions) video datasets. We explore several tasks with our model: (i) the ability to cleanly factorize into content and pose components; (ii) forward prediction of video frames using the approach from Section 3.1; (iii) using the pose/content features for classiﬁcation tasks.

4.1 Model details

We explored a variety of convolutional architectures for the content encoder Ec, pose encoder Ep and decoder D. For MNIST, Ec, Ep and D all use a DCGAN architecture [21] with |hp| = 5 and |hc| = 128. The encoders consist of 5 convolutional layers with subsampling. Batch normalization and Leaky Re LU s follow each convolutional layer except the ﬁnal layer which normalizes the pose/content vectors to have unit norm. The decoder is a mirrored version of the encoder with 5 deconvolutional layers and a sigmoid output layer.

For both NORB and SUNCG, D is a DCGAN architecture while Ec and Ep use a Res Net-18 architecture [9] up until the ﬁnal pooling layer with |hp| = 10 and |hc| = 128.

For KTH, Ep uses a Res Net-18 architecture with |hp| = 24. Ec uses the same architecture as VGG16 [29] up until the ﬁnal pooling layer with |hc| = 128. The decoder is a mirrored version of the content encoder with pooling layers replaced with spatial up-sampling. In the style of U-Net [25], we add skip connections from the content encoder to the decoder, enabling the model to easily generate static background features.

In all experiments the scene discriminator C is a fully connected neural network with 2 hidden layers of 100 units. We trained all our models with the ADAM optimizer [13] and learning rate η = 0.002. We used β = 0.1 for MNIST, NORB and SUNCG and β = 0.0001 for KTH experiments. We used α = 1 for all datasets.

For future prediction experiments we train a two layer LSTM with 256 cells using the ADAM optimizer. On MNIST, we train the model by observing 5 frames and predicting 10 frames. On KTH, we train the model by observing 10 frames and predicting 10 frames.

4.2 Synthetic datasets

MNIST: We start with a toy dataset consisting of two MNIST digits bouncing around a 64x64 image. Each video sequence consists of a different pair of digits with independent trajectories. Fig. 3(left) shows how the content vector from one frame and the pose vector from another generate new examples that transfer the content and pose from the original frames. This demonstrates the clean disentanglement produced by our model. Interestingly, for this data we found it to be necessary to use a different color for the two digits. Our adversarial term is so aggressive that it prevents the

5 1 9 12 6 18 21 15 50 100 24 200 500

Input frames Generated frames

Figure 3: Left: Demonstration of content/pose factorization on held out MNIST examples. Each image in the grid is generated using the pose and content vectors hp and hc taken from the corresponding images in the top row and ﬁrst column respectively. The model has clearly learned to disentangle content and pose. Right: Each row shows forward modeling up to 500 time steps into the future, given 5 initial frames. For each generation, note that only the pose part of the representation is being predicted from the previous time step (using an LSTM), with the content vector being ﬁxed from the 5th frame. The generations remain crisp despite the long-range nature of the predictions.

Figure 4: Left: Factorization examples using our DRNET model on held out NORB images. Each image in the grid is generated using the pose and content vectors hp and hc taken from the corresponding images in the top row and ﬁrst column respectively. Center: Examples where DRNET was trained without the adversarial loss term. Note how content and pose are no longer factorized cleanly: the pose vector now contains content information which ends up dominating the generation. Right: factorization examples from Mathieu et al. [19].

x1 x2 Interpolations

Figure 5: Left: Examples of linear interpolation in pose space between the examples x1 and x2. Right: Factorization examples on held out images from the SUNCG dataset. Each image in the grid is generated using the pose and content vectors hp and hc taken from the corresponding images in the top row and ﬁrst column respectively. Note how, even for complex objects, the model is able to rotate them accurately.

pose vector from capturing any content information, thus without a color cue the model is unable to determine which pose information to associate with which digit. In Fig. 3(right) we perform forward modeling using our representation, demonstrating the ability to generate crisp digits 500 time steps into the future.

NORB: We apply our model to the NORB dataset [16], converted into videos by taking sequences of different azimuths, while holding object identity, lighting and elevation constant. Fig. 4.2(left) shows that our model is able to factor content and pose cleanly on held out data. In Fig. 4.2(center) we train a version of our model without the adversarial loss term, which results in a signiﬁcant degradation in the model and the pose vectors are no longer isolated from content. For comparison, we also show the factorizations produced by Mathieu et al. [19], which are less clean, both in terms of disentanglement and generation quality than our approach. Table 1 shows classiﬁcation results on NORB, following the training of a classiﬁer on pose features and also content features. When the adversarial term is used (β = 0.1) the content features perform well. Without the term, content features become less effective for classiﬁcation.

SUNCG: We use the rendering engine from the SUNCG dataset [30] to generate sequences where the camera rotates around a range of 3D chair models. The dataset consists of 324 different chair models of varying size, shape and color. DRNET learns a clean factorization of content and pose and is able to generate high quality examples of this dataset, as shown in Fig. 4.2(right).

4.3 KTH Action Dataset Finally, we apply DRNET to the KTH dataset [28]. This is a simple dataset of real-world videos of people performing one of six actions (walking, jogging, running, boxing, handwaving, hand-clapping) against fairly uniform backgrounds. In Fig. 4.3 we show forward generations of different held out examples, comparing against two baselines: (i) the MCNet of Villegas et al. [33]which, to the best of our knowledge, produces the current best quality generations of on real-world video and (ii) a baseline auto-encoder LSTM model (AE-LSTM). This is essentially the same as ours, but with a single encoder whose features thus combine content and pose (as opposed to factoring them in DRNET ). It is also similar to [31].

Fig. 7 shows more examples, with generations out to 100 time steps. For most actions this is sufﬁcient time for the person to have left the frame, thus further generations would be of a ﬁxed background. In Fig. 9 we attempt to quantify the ﬁdelity of the generations by comparing our approach to MCNet [33] using a metric derived from the Inception score [26]. The Inception score is used for assessing generations from GANs and is more appropriate for our scenario that traditional metrics such as PSNR or SSIM (see appendix B for further discussion). The curves show the mean scores of our generations decaying more gracefully than MCNet [33]. Further examples and generated movies may be viewed in appendix A and also at https://sites.google.com/view/drnet-paper//.

A natural concern with high capacity models is that they might be memorizing the training examples. We probe this in Fig. 4.3, where we show the nearest neighbors to our generated frames from the training set. Fig. 8 uses the pose representation produced by DRNET to train an action classiﬁer from very few examples. We extract pose vectors from video sequences of length 24 and train a fully connected classiﬁer on these vectors to predict the action class. We compare against an autoencoder baseline, which is the same as ours but with a single encoder whose features thus combine content and pose. We ﬁnd the factorization signiﬁcantly boosts performance.

truth future

Dr Net (ours)

t = 25 t = 15 t = 17 t = 27 t = 30 t = 12 t = 5 t = 10 t = 1

truth future

Dr Net (ours)

t = 25 t = 15 t = 17 t = 27 t = 30 t = 12 t = 5 t = 10 t = 1

Figure 6: Qualitative comparison between our DRNET model, MCNet [33] and the AE-LSTM baseline. All models are conditioned on the ﬁrst 10 video frames and generate 20 frames. We display predictions of every 3rd frame. Video sequences are taken from held out examples of the KTH dataset for the classes of walking (top) and running (bottom).

t=11 t=100 t=90 t=80 t=70 t=60 t=50 t=47 t=44 t=41 t=38 t=35 t=32 t=29 t=26 t=23 t=20 t=17 t=14

Figure 7: Four additional examples of generations on held out examples of the KTH dataset, rolled out to 100 timesteps.

Model Accuracy (%)

DRNET β=0.1 hc 93.3 hp 60.9

DRNET β=0 hc 72.6 hp 80.8 Mathieu et al. [19] 86.5

Table 1: Classiﬁcation results on NORB dataset, with/without adversarial loss (β = 0.1/0) using content or pose representations (hc, hp respectively). The adversarial term is crucial for forcing semantic information into the content vectors without it performance drops significantly.

Figure 8: Classiﬁcation of KTH actions from pose vectors with few labeled examples, with autoencoder baseline. N.B. SOA (fully supervised) is 93.9% [15].

0 20 40 60 80 100 1.3

Future time step

Inception Score

Dr Net MCNet

Figure 9: Comparison of KTH video generation quality using Inception score. X-axis indicated how far from conditioned input the start of the generated sequence is.

t = 12 t = 15 t = 17 t = 21 t = 25 t = 27 t = 30

Dr Net generations

Nearest neighbour in

t = 12 t = 15 t = 17 t = 21 t = 25 t = 27 t = 30

Nearest neighbour in pose+content

Dr Net generations

Nearest neighbour in

Nearest neighbour in pose+content

Figure 10: For each frame generated by DRNET (top row in each set), we show nearest-neighbor images from the training set, based on pose vectors (middle row) and both content and pose vectors (bottom row). It is evident that our model is not simply copying examples from the training data. Furthermore, the middle row shows that the pose vector generalizes well, and is independent of background and clothing.

5 Discussion

In this paper we introduced a model based on a pair of encoders that factor video into content and pose. This seperation is achieved during training through novel adversarial loss term. The resulting representation is versatile, in particular allowing for stable and coherent long-range prediction through nothing more than a standard LSTM. Our generations compare favorably with leading approaches, despite being a simple model, e.g. lacking the GAN losses or probabilistic formulations of other video generation approaches. Source code is available at https://github.com/edenton/drnet.

Acknowledgments

We thank Rob Fergus, Will Whitney and Jordan Ash for helpful comments and advice. Emily Denton is grateful for the support of a Google Fellowship

[1] P. Agrawal, A. Nair, P. Abbeel, J. Malik, and S. Levine. Learning to poke by poking: Experiential learning of intuitive physics. ar Xiv preprint ar Xiv:1606.07419, 2016. [2] S. Chiappa, S. Racaniere, D. Wierstra, and S. Mohamed. Recurrent environment simulators. In ICLR, 2017. [3] F. Cricri, M. Honkala, X. Ni, E. Aksu, and M. Gabbouj. Video ladder networks. ar Xiv preprint ar Xiv:1612.01756, 2016. [4] C. Doersch, A. Gupta, and A. A. Efros. Unsupervised visual representation learning by context prediction. In CVPR, pages 1422 1430, 2015. [5] C. Finn, I. Goodfellow, and S. Levine. Unsupervised learning for physical interaction through video prediction. In ar Xiv 1605.07157, 2016. [6] I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In NIPS, 2014. [7] W. Grathwohl and A. Wilson. Disentangling space and time in video with hierarchical variational auto-encoders. ar Xiv preprint ar Xiv:1612.04440, 2016. [8] R. Hartley and A. Zisserman. Multiple view geometry in computer vision, 2000. [9] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016. [10] G. E. Hinton, A. Krizhevsky, and S. Wang. Transforming auto-encoders. In ICANN, 2011. [11] G. E. Hinton and R. R. Salakhutdinov. Reducing the dimensionality of data with neural networks. Science, 313(5786):504 507, 2006. [12] N. Kalchbrenner, A. van den Oord, K. Simonyan, I. Danihelka, O. Vinyals, A. Graves, and K. Kavukcuoglu. Video pixel networks. In ar Xiv:1610.00527, 2016. [13] D. Kingma and J. Ba. Adam: A method for stochastic optimization. In International Conference on Learning Representations, 2015. [14] T. D. Kulkarni, W. F. Whitney, P. Kohli, and J. Tenenbaum. Deep convolutional inverse graphics network. In Advances in Neural Information Processing Systems, pages 2539 2547, 2015. [15] Q. V. Le, W. Y. Zou, S. Y. Yeung, and A. Y. Ng. Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis. In Proceedings of the 2011 IEEE Conference on Computer Vision and Pattern Recognition, 2011. [16] Y. Le Cun, F. Huang, and L. Bottou. Learning methods for generic object recognition with invariance to pose and lighting. In CVPR, 2004. [17] C. Liu. Beyond pixels: exploring new representations and applications for motion analysis. Ph D thesis, Massachusetts Institute of Technology, 2009. [18] M. Mathieu, C. Couprie, and Y. Le Cun. Deep multi-scale video prediction beyond mean square error. ar Xiv 1511.05440, 2015. [19] M. Mathieu, P. S. Junbo Zhao, A. Ramesh, and Y. Le Cun. Disentangling factors of variation in deep representations using adversarial training. In Advances in Neural Information Processing Systems 29, 2016. [20] J. Oh, X. Guo, H. Lee, R. Lewis, and S. Singh. Action-conditional video prediction using deep networks in Atari games. In NIPS, 2015. [21] A. Radford, L. Metz, and S. Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. In The International Conference on Learning Representations, 2016.

[22] M. Ranzato, A. Szlam, J. Bruna, M. Mathieu, R. Collobert, and S. Chopra. Video (language) modeling: a baseline for generative models of natural videos. ar Xiv 1412.6604, 2014. [23] A. Rasmus, M. Berglund, M. Honkala, H. Valpola, and T. Raiko. Semi-supervised learning with ladder network. In Advances in Neural Information Processing Systems 28, 2015. [24] S. Reed, Z. Zhang, Y. Zhang, and H. Lee. Deep visual analogy-making. In NIPS, 2015. [25] O. Ronneberger, P. Fischer, and T. Brox. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pages 234 241. Springer International Publishing, 2015. [26] T. Salimans, I. J. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen. Improved techniques for training gans. ar Xiv 1606.03498, 2016. [27] T. Salimans, A. Karpathy, X. Chen, and D. P. Kingma. Pixelcnn++: Improving the pixelcnn with discretized logistic mixture likelihood and other modiﬁcations. ar Xiv preprint ar Xiv:1701.05517, 2017. [28] C. Schuldt, I. Laptev, and B. Caputo. Recognizing human actions: A local svm approach. In Pattern Recognition, 2004. ICPR 2004. Proceedings of the 17th International Conference on, volume 3, pages 32 36. IEEE, 2004. [29] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In The International Conference on Learning Representations, 2015. [30] S. Song, F. Yu, A. Zeng, A. X. Chang, M. Savva, and T. Funkhouser. Semantic scene completion from a single depth image. IEEE Conference on Computer Vision and Pattern Recognition, 2017. [31] N. Srivastava, E. Mansimov, and R. Salakhutdinov. Unsupervised learning of video representations using LSTMs. In ICML, 2015. [32] A. van den Oord, N. Kalchbrenner, and K. Kavukcuoglu. Pixel recurrent neural networks. In ICML, 2016. [33] R. Villegas, J. Yang, S. Hong, X. Lin, and H. Lee. Decomposing motion and content for natural video sequence prediction. In ICLR, 2017. [34] C. Vondrick, H. Pirsiavash, and A. Torralba. Generating videos with scene dynamics. In ar Xiv 1609.02612, 2016. [35] J. Walker, A. Gupta, and M. Hebert. Dense optical ﬂow prediction from a static image. In ICCV, 2015. [36] X. Wang and A. Gupta. Unsupervised learning of visual representations using videos. In CVPR, pages 2794 2802, 2015. [37] W. F. Whitney, M. Chang, T. Kulkarni, and J. B. Tenenbaum. Understanding visual concepts with continuation learning. ar Xiv preprint ar Xiv:1502.04623, 2016. [38] L. Wiskott and T. Sejnowski. Slow feature analysis: Unsupervised learning of invariance. Neural Computation, 14(4):715 770, 2002. [39] T. Xue, J. Wu, K. L. Bouman, and W. T. Freeman. Visual dynamics: Probabilistic future frame synthesis via cross convolutional networks. In NIPS, 2016. [40] J. Zhao, M. Mathieu, R. Goroshin, and Y. Le Cun. Stacked what-where auto-encoders. In International Conference on Learning Representations, 2016.