# temporal_difference_variational_autoencoder__e2c9f0f0.pdf Published as a conference paper at ICLR 2019 TEMPORAL DIFFERENCE VARIATIONAL AUTO-ENCODER Karol Gregor, George Papamakarios, Frederic Besse, Lars Buesing, Théophane Weber Deep Mind {karolg, gpapamak, fbesse, lbuesing, theophane}@google.com To act and plan in complex environments, we posit that agents should have a mental simulator of the world with three characteristics: (a) it should build an abstract state representing the condition of the world; (b) it should form a belief which represents uncertainty on the world; (c) it should go beyond simple step-by-step simulation, and exhibit temporal abstraction. Motivated by the absence of a model satisfying all these requirements, we propose TD-VAE, a generative sequence model that learns representations containing explicit beliefs about states several steps into the future, and that can be rolled out directly without single-step transitions. TD-VAE is trained on pairs of temporally separated time points, using an analogue of temporal difference learning used in reinforcement learning. 1 INTRODUCTION Generative models of sequential data have received a lot of attention, due to their wide applicability in domains such as speech synthesis (van den Oord et al., 2016a; 2017), neural translation (Bahdanau et al., 2014), image captioning (Xu et al., 2015), and many others. Different application domains will often have different requirements (e.g. long term coherence, sample quality, abstraction learning, etc.), which in turn will drive the choice of the architecture and training algorithm. Of particular interest to this paper is the problem of reinforcement learning in partially observed environments, where, in order to act and explore optimally, agents need to build a representation of the uncertainty about the world, computed from the information they have gathered so far. While an agent endowed with memory could in principle learn such a representation implicitly through model-free reinforcement learning, in many situations the reinforcement signal may be too weak to quickly learn such a representation in a way which would generalize to a collection of tasks. Furthermore, in order to plan in a model-based fashion, an agent needs to be able to imagine distant futures which are consistent with the agent s past. In many situations however, planning step-by-step is not a cognitively or computationally realistic approach. To successfully address an application such as the above, we argue that a model of the agent s experience should exhibit the following properties: The model should learn an abstract state representation of the data and be capable of making predictions at the state level, not just the observation level. The model should learn a belief state, i.e. a deterministic, coded representation of the filtering posterior of the state given all the observations up to a given time. A belief state contains all the information an agent has about the state of the world and thus about how to act optimally. The model should exhibit temporal abstraction, both by making jumpy predictions (predictions several time steps into the future), and by being able to learn from temporally separated time points without backpropagating through the entire time interval. To our knowledge, no model in the literature meets these requirements. In this paper, we develop a new model and associated training algorithm, called Temporal Difference Variational Auto-Encoder (TD-VAE), which meets all of the above requirements. We first develop TD-VAE in the sequential, non-jumpy case, by using a modified evidence lower bound (ELBO) for stochastic state space models Published as a conference paper at ICLR 2019 (Krishnan et al., 2015; Fraccaro et al., 2016; Buesing et al., 2018) which relies on jointly training a filtering posterior and a local smoothing posterior. We demonstrate that on a simple task, this new inference network and associated lower bound lead to improved likelihood compared to methods classically used to train deep state-space models. Following the intuition given by the sequential TD-VAE, we develop the full TD-VAE model, which learns from temporally extended data by making jumpy predictions into the future. We show it can be used to train consistent jumpy simulators of complex 3D environments. Finally, we illustrate how training a filtering a posterior leads to the computation of a neural belief state with good representation of the uncertainty on the state of the environment. 2 MODEL DESIDERATA 2.1 CONSTRUCTION OF A LATENT STATE-SPACE Autoregressive models. One of the simplest way to model sequential data (x1, . . . , x T ) is to use the chain rule to decompose the joint sequence likelihood as a product of conditional probabilities, i.e. log p(x1, . . . , x T ) = P t log p(xt | x1, . . . , xt 1). This formula can be used to train an autoregressive model of data, by combining an RNN which aggregates information from the past (recursively computing an internal state ht = f(ht 1, xt)) with a conditional generative model which can score the data xt given the context ht. This idea is used in handwriting synthesis (Graves, 2013), density estimation (Uria et al., 2016), image synthesis (van den Oord et al., 2016b), audio synthesis (van den Oord et al., 2017), video synthesis (Kalchbrenner et al., 2016), generative recall tasks (Gemici et al., 2017), and environment modeling (Oh et al., 2015; Chiappa et al., 2017). While these models are conceptually simple and easy to train, one potential weakness is that they only make predictions in the original observation space, and don t learn a compressed representation of data. As a result, these models tend to be computationally heavy (for video prediction, they constantly decode and re-encode single video frames). Furthermore, the model can be computationally unstable at test time since it is trained as a next step model (the RNN encoding real data), but at test time it feeds back its prediction into the RNN. Various methods have been used to alleviate this issue (Bengio et al., 2015; Lamb et al., 2016; Goyal et al., 2017; Amos et al., 2018). State-space models. An alternative to autoregressive models are models which operate on a higher level of abstraction, and use latent variables to model stochastic transitions between states (grounded by observation-level predictions). This enables to sample state-to-state transitions only, without needing to render the observations, which can be faster and more conceptually appealing. They generally consist of decoder or prior networks, which detail the generative process of states and observations, and encoder or posterior networks, which estimate the distribution of latents given the observed data. There is a large amount of recent work on these type of models, which differ in the precise wiring of model components (Bayer & Osendorfer, 2014; Chung et al., 2015; Krishnan et al., 2015; Archer et al., 2015; Fraccaro et al., 2016; Liu et al., 2017; Serban et al., 2017; Buesing et al., 2018; Lee et al., 2018; Ha & Schmidhuber, 2018). Let z = (z1, . . . , z T ) be a state sequence and x = (x1, . . . , x T ) an observation sequence. We assume a general form of state-space model, where the joint state and observation likelihood can be written as p(x, z) = Q t p(zt | zt 1)p(xt | zt).1 These models are commonly trained with a VAEinspired bound, by computing a posterior q(z | x) over the states given the observations. Often, the posterior is decomposed autoregressively: q(z | x) = Q t q(zt | zt 1, φt(x)), where φt is a function of (x1, . . . , xt) for filtering posteriors or the entire sequence x for smoothing posteriors. This leads to the following lower bound: log p(x) Ez q(z | x) h X t log p(xt | zt) + log p(zt | zt 1) log q(zt | zt 1, φt(x)) i . (1) 1For notational simplicity, p(z1 | z0) = p(z1). Also note the conditional distributions could be very complex, using additional latent variables, flow models, or implicit models (for instance, if a deterministic RNN with stochastic inputs is used in the decoder). Published as a conference paper at ICLR 2019 2.2 ONLINE CREATION OF BELIEF STATE. A key feature of sequential models of data is that they allow to reason about the conditional distribution of the future given the past: p(xt+1, . . . , x T | x1, . . . , xt). For reinforcement learning in partially observed environments, this distribution governs the distribution of returns given past observations, and as such, it is sufficient to derive the optimal policy. For generative sequence modeling, it enables conditional generation of data given a context sequence. For this reason, it is desirable to compute sufficient statistics bt = bt(x1, . . . , xt) of the future given the past, which allow to rewrite the conditional distribution as p(xt+1, . . . , x T | x1, . . . , xt) p(xt+1, . . . , x T | bt). For an autoregressive model as described in section 2.1, the internal RNN state ht can immediately be identified as the desired sufficient statistics bt. However, for the reasons mentioned in the previous section, we would like to identify an equivalent quantity for a state-space model. For a state-space model, the filtering distribution p(zt | x1, . . . , xt), also known as the belief state in reinforcement learning, is sufficient to compute the conditional future distribution, due to the Markov assumption underlying the state-space model and the following derivation: p(xt+1, . . . , x T | x1, . . . , xt) = Z p(zt | x1, . . . , xt)p(xt+1, . . . , x T | zt) dzt. (2) Thus, if we train a network that extracts a code bt from (x1, . . . , xt) so that p(zt | x1, . . . , xt) p(zt | bt), bt would contain all the information about the state of the world the agent has, and would effectively form a neural belief state, i.e. a code fully characterizing the filtering distribution. Classical training of state-space model does not compute a belief state: by computing a joint, autoregressive posterior q(z | x) = Q t q(zt | zt 1, x), some of the uncertainty about the marginal posterior of zt may be leaked in the sample zt 1. Since that sample is stochastic, to obtain all information from (x1, . . . , xt) about zt, we would need to re-sample zt 1, which would in turn require re-sampling zt 2 all the way to z1. While the notion of a belief state itself and its connection to optimal policies in POMDPs is well known (Astrom, 1965; Kaelbling et al., 1998; Hauskrecht, 2000), it has often been restricted to the tabular case (Markov chain), and little work investigates computing belief states for learned deep models. A notable exception is (Igl et al., 2018), which uses a neural form of particle filtering, and represents the belief state more explicitly as a weighted collection of particles. Related to our definition of belief states as sufficient statistics is the notion of predictive state representations (PSRs) (Littman & Sutton, 2002); see also (Venkatraman et al., 2017) for a model that learns PSRs which, combined with a decoder, can predict future observations. Our last requirement for the model is that of temporal abstraction. We postpone the discussion of this aspect until section 4. 3 BELIEF-STATE-BASED ELBO FOR SEQUENTIAL TD-VAE In this section, we develop a sequential model that satisfies the requirements given in the previous section, namely (a) it constructs a latent state-space, and (b) it creates a online belief state. We consider an arbitrary state space model with joint latent and observable likelihood given by p(x, z) = Q t p(zt | zt 1)p(xt | zt), and we aim to optimize the data likelihood log p(x). We begin by autoregressively decomposing the data likelihood as: log p(x) = P t log p(xt | x 1, without considering the inputs in between these time points. Note that it is not sufficient to predict the future inputs xt1, . . . as they do not contain information about whether the digit moves left or right. We need to sample a state that contains this information. We roll out a sequence from the model as follows: (a) bt is computed by the aggregation recurrent network from observations up to time t; (b) a state zt is sampled from p B(zt | bt); (c) a sequence of states is rolled out by repeatedly sampling z z p(z | z) starting with z = zt; (d) each z is decoded by p(x | z), producing a sequence of frames. The resulting sequences are shown in Figure 3. We see that indeed the model can roll forward the samples in steps of more than one elementary time step (the sampled digits move by more than one pixel) and that it preserves the direction of motion, demonstrating that it rolls forward a state. 5.3 NOISY HARMONIC OSCILLATOR We would like to demonstrate that the model can build a state even when little information is present in each observation, and that it can sample states far into the future. For this we consider a 1D sequence obtained from a noisy harmonic oscillator, as shown in Figure 4 (first and fourth rows). The frequencies, initial positions and initial velocities are chosen at random from some range. At every update, noise is added to the position and the velocity of the oscillator, but the energy is approximately preserved. The model observes a noisy version of the current position. Attempting to predict the input, which consists of one value, 100 time steps in the future would be uninformative; such a Published as a conference paper at ICLR 2019 Figure 4: Skip-state prediction for 1D signal. The input is generated by a noisy harmonic oscillator. Rollouts consist of (a) a jumpy state transition with either dt = 20 or dt = 100, followed by 20 state transitions with dt = 1. The model is able to create a state and predict it into the future, correctly predicting frequency and magnitude of the signal. prediction wouldn t reveal what the frequency or the magnitude of the signal is, and because the oscillator updates are noisy, the phase information would be nearly lost. Instead, we should try to predict as much as possible about the state, which consists of frequency, magnitude and position, and it is only the position that cannot be accurately predicted. The aggregation RNN is an LSTM; we use a hierarchical TD-VAE with two layers, where the latent variables in the higher layer are sampled first, and their results are passed to the lower layer. The belief, smoothing and state-transition distributions are feed-forward networks, and the decoder simply extracts the first component from the z of the first layer. We also feed the time interval t2 t1 into the smoothing and state-transition distributions. We train on sequences of length 200, with t2 t1 taking values chosen at random from [1, 10] with probability 0.8 and from [1, 120] with probability 0.2. We analyze what the model has learned as follows. We pick time t1 = 60 and sample zt1 p B(zt1 | bt1). Then, we choose a time interval δt {20, 100} to skip, sample from the forward model p(z2 | z1, δt) to obtain zt2 at t2 = t1 + δt. To see the content of this state, we roll forward 20 times with time step δ = 1 and plot the result, shown in Figure 4. We see that indeed the state zt2 is predicted correctly, containing the correct frequency and magnitude of the signal. We also see that the position (phase) is predicted well for dt = 20 and less accurately for dt = 100 (at which point the noisiness of the system makes it unpredictable). Finally, we show that TD-VAE training can improve the quality of the belief state. For this experiment, the harmonic oscillator has a different frequency in each interval [0, 10), [10, 20), [20, 120), [120, 140). The first three frequencies f1, f2, f3 are chosen at random. The final frequency f4 is chosen to be one fixed value fa if f1 > f2 and another fixed value fb otherwise (fa and fb are constants). In order to correctly model the signal in the final time interval, the model needs to learn the relation between f1 and f2, store it over length of 100 steps, and apply it over a number of time steps (due to the noise) in the final interval. To test whether the belief state contains the information about this relationship, we train a binary classifier from the belief state to the final frequency f4 at points just before the final interval. We compare two models with the same recurrent architecture (an LSTM), but trained with different objective: next-step prediction vs TD-VAE loss. The figure on the right shows the classification accuracy for the two methods, averaged over 20 runs. We found that the longer the separating time interval (containing frequency f3) and the smaller the size of the LSTM, the better TD-VAE is compared to next-step predictor. 5.4 DEEPMIND LAB ENVIRONMENT In the final experiment, we analyze the model on a more visually complex domain. We use sequences of frames seen by an agent solving tasks in the Deep Mind Lab environment (Beattie et al., 2016). We aim to demonstrate that the model holds explicit beliefs about various possible futures, and that Published as a conference paper at ICLR 2019 Figure 5: Beliefs of the model. Left: Independent samples z1, z2, z3 from current belief; all 3 decode to roughly the same frame. Right: Multiple predicted futures for each sample. The frames are similar for each zi, but different across zi s. Figure 6: Rollout from the model. The model was trained on steps uniformly distributed in [1, 5]. The model is able to create forward motion that skips several time steps. it can roll out in jumps. We suggest functional forms inspired by convolutional DRAW: we use convolutional LSTMs for all the circles in Figure 8 and make the model 16 layers deep (except for the forward updating LSTMs which are fully connected with depth 4). We use time skips t2 t1 sampled uniformly from [1, 40] and analyze the content of the belief state b. We take three samples z1, z2, z3 from p B(z | b), which should represent three instances of possible futures. Figure 5 (left) shows that they decode to roughly the same frame. To see what they represent about the future, we draw 5 samples zk i p(ˆz | z), k = 1, . . . , 5 and decode them, as shown in Figure 5 (right). We see that for a given i, the predicted samples decode to similar frames (images in the same row). However z s for different i s decode to different frames. This means b represented a belief about several different possible futures, while different zi each represent a single possible future. Finally, we show what rollouts look like. We train on time separations t2 t1 chosen uniformly from [1, 5] on a task where the agent tends to move forward and rotate. Figure 6 shows 4 rollouts from the model. We see that the motion appears to go forward and into corridors and that it skips several time steps (real single step motion is slower). 6 CONCLUSIONS In this paper, we argued that an agent needs a model that is different from an accurate step-by-step environment simulator. We discussed the requirements for such a model, and presented TD-VAE, a sequence model that satisfies all requirements. TD-VAE builds states from observations by bridging time points separated by random intervals. This allows the states to relate to each other directly over longer time stretches and explicitly encode the future. Further, it allows rolling out in state-space and in time steps larger than, and potentially independent of, the underlying temporal environment/data step size. In the future, we aim to apply TD-VAE to more complex settings, and investigate a number of possible uses in reinforcement learning such are representation learning and planning. Published as a conference paper at ICLR 2019 Brandon Amos, Laurent Dinh, Serkan Cabi, Thomas Rothörl, Sergio Gómez Colmenarejo, Alistair Muldal, Tom Erez, Yuval Tassa, Nando de Freitas, and Misha Denil. Learning awareness models. ar Xiv preprint ar Xiv:1804.06318, 2018. Evan Archer, Il Memming Park, Lars Buesing, John Cunningham, and Liam Paninski. Black box variational inference for state space models. ar Xiv preprint ar Xiv:1511.07367, 2015. Karl J Astrom. Optimal control of Markov decision processes with incomplete state estimation. Journal of mathematical analysis and applications, 10:174 205, 1965. Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. ar Xiv preprint ar Xiv:1409.0473, 2014. Justin Bayer and Christian Osendorfer. Learning stochastic recurrent networks. ar Xiv preprint ar Xiv:1411.7610, 2014. Charles Beattie, Joel Z Leibo, Denis Teplyashin, Tom Ward, Marcus Wainwright, Heinrich Küttler, Andrew Lefrancq, Simon Green, Víctor Valdés, Amir Sadik, et al. Deep Mind Lab. ar Xiv preprint ar Xiv:1612.03801, 2016. Marc G Bellemare, Will Dabney, and Rémi Munos. A distributional perspective on reinforcement learning. ar Xiv preprint ar Xiv:1707.06887, 2017. Samy Bengio, Oriol Vinyals, Navdeep Jaitly, and Noam Shazeer. Scheduled sampling for sequence prediction with recurrent neural networks. In Advances in Neural Information Processing Systems, pp. 1171 1179, 2015. Lars Buesing, Theophane Weber, Sebastien Racaniere, SM Eslami, Danilo Rezende, David P Reichert, Fabio Viola, Frederic Besse, Karol Gregor, Demis Hassabis, et al. Learning and querying fast generative models for reinforcement learning. ar Xiv preprint ar Xiv:1802.03006, 2018. Silvia Chiappa, Sébastien Racaniere, Daan Wierstra, and Shakir Mohamed. Recurrent environment simulators. ar Xiv preprint ar Xiv:1704.02254, 2017. Junyoung Chung, Kyle Kastner, Laurent Dinh, Kratarth Goel, Aaron C Courville, and Yoshua Bengio. A recurrent latent variable model for sequential data. In Advances in neural information processing systems, pp. 2980 2988, 2015. Junyoung Chung, Sungjin Ahn, and Yoshua Bengio. Hierarchical multiscale recurrent neural networks. ar Xiv preprint ar Xiv:1609.01704, 2016. Marco Fraccaro, Søren Kaae Sønderby, Ulrich Paquet, and Ole Winther. Sequential neural models with stochastic layers. In Advances in neural information processing systems, pp. 2199 2207, 2016. Mevlana Gemici, Chia-Chun Hung, Adam Santoro, Greg Wayne, Shakir Mohamed, Danilo J Rezende, David Amos, and Timothy Lillicrap. Generative temporal models with memory. ar Xiv preprint ar Xiv:1702.04649, 2017. Anirudh Goyal, Alessandro Sordoni, Marc-Alexandre Côté, Nan Ke, and Yoshua Bengio. Z-forcing: Training stochastic recurrent networks. In Advances in Neural Information Processing Systems, pp. 6713 6723, 2017. Alex Graves. Generating sequences with recurrent neural networks. ar Xiv preprint ar Xiv:1308.0850, 2013. Karol Gregor, Frederic Besse, Danilo Jimenez Rezende, Ivo Danihelka, and Daan Wierstra. Towards conceptual compression. In Advances In Neural Information Processing Systems, pp. 3549 3557, 2016. David Ha and Jürgen Schmidhuber. World models. ar Xiv preprint ar Xiv:1803.10122, 2018. Published as a conference paper at ICLR 2019 Milos Hauskrecht. Value-function approximations for partially observable Markov decision processes. Journal of artificial intelligence research, 13:33 94, 2000. Maximilian Igl, Luisa Zintgraf, Tuan Anh Le, Frank Wood, and Shimon Whiteson. Deep variational reinforcement learning for POMDPs. ar Xiv preprint ar Xiv:1806.02426, 2018. Dinesh Jayaraman, Frederik Ebert, Alexei A Efros, and Sergey Levine. Time-agnostic prediction: Predicting predictable video frames. ar Xiv preprint ar Xiv:1808.07784, 2018. Leslie Pack Kaelbling, Michael L Littman, and Anthony R Cassandra. Planning and acting in partially observable stochastic domains. Artificial intelligence, 101(1-2):99 134, 1998. Nal Kalchbrenner, Aaron van den Oord, Karen Simonyan, Ivo Danihelka, Oriol Vinyals, Alex Graves, and Koray Kavukcuoglu. Video pixel networks. ar Xiv preprint ar Xiv:1610.00527, 2016. Diederik P Kingma, Tim Salimans, Rafal Jozefowicz, Xi Chen, Ilya Sutskever, and Max Welling. Improved variational inference with inverse autoregressive flow. In Advances in Neural Information Processing Systems, pp. 4743 4751, 2016. Jan Koutnik, Klaus Greff, Faustino Gomez, and Juergen Schmidhuber. A clockwork RNN. ar Xiv preprint ar Xiv:1402.3511, 2014. Rahul G Krishnan, Uri Shalit, and David Sontag. Deep Kalman filters. ar Xiv preprint ar Xiv:1511.05121, 2015. Alex Lamb, Anirudh Goyal, Ying Zhang, Saizheng Zhang, Aaron C Courville, and Yoshua Bengio. Professor forcing: A new algorithm for training recurrent networks. In Advances In Neural Information Processing Systems, pp. 4601 4609, 2016. Alex X Lee, Richard Zhang, Frederik Ebert, Pieter Abbeel, Chelsea Finn, and Sergey Levine. Stochastic adversarial video prediction. ar Xiv preprint ar Xiv:1804.01523, 2018. Michael L Littman and Richard S Sutton. Predictive representations of state. In Advances in neural information processing systems, pp. 1555 1561, 2002. Hao Liu, Lirong He, Haoli Bai, and Zenglin Xu. Efficient structured inference for stochastic recurrent neural networks. 2017. Alexander Neitz, Giambattista Parascandolo, Stefan Bauer, and Bernhard Schölkopf. Adaptive skip intervals: Temporal abstraction for recurrent dynamical models. ar Xiv preprint ar Xiv:1808.04768, 2018. Junhyuk Oh, Xiaoxiao Guo, Honglak Lee, Richard L Lewis, and Satinder Singh. Action-conditional video prediction using deep networks in atari games. In Advances in Neural Information Processing Systems, pp. 2863 2871, 2015. Sébastien Racanière, Théophane Weber, David Reichert, Lars Buesing, Arthur Guez, Danilo Jimenez Rezende, Adrià Puigdomènech Badia, Oriol Vinyals, Nicolas Heess, Yujia Li, et al. Imaginationaugmented agents for deep reinforcement learning. In Advances in Neural Information Processing Systems, pp. 5694 5705, 2017. Antti Rasmus, Mathias Berglund, Mikko Honkala, Harri Valpola, and Tapani Raiko. Semi-supervised learning with ladder networks. In Advances in Neural Information Processing Systems, pp. 3546 3554, 2015. Iulian Vlad Serban, Alessandro Sordoni, Ryan Lowe, Laurent Charlin, Joelle Pineau, Aaron C Courville, and Yoshua Bengio. A hierarchical latent variable encoder-decoder model for generating dialogues. In AAAI, pp. 3295 3301, 2017. Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction, volume 1. MIT press Cambridge, 1998. Benigno Uria, Marc-Alexandre Côté, Karol Gregor, Iain Murray, and Hugo Larochelle. Neural autoregressive distribution estimation. The Journal of Machine Learning Research, 17(1):7184 7220, 2016. Published as a conference paper at ICLR 2019 Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. Wave Net: A generative model for raw audio. ar Xiv preprint ar Xiv:1609.03499, 2016a. Aaron van den Oord, Nal Kalchbrenner, and Koray Kavukcuoglu. Pixel recurrent neural networks. ar Xiv preprint ar Xiv:1601.06759, 2016b. Aaron van den Oord, Yazhe Li, Igor Babuschkin, Karen Simonyan, Oriol Vinyals, Koray Kavukcuoglu, George van den Driessche, Edward Lockhart, Luis C Cobo, Florian Stimberg, et al. Parallel wave Net: Fast high-fidelity speech synthesis. ar Xiv preprint ar Xiv:1711.10433, 2017. Arun Venkatraman, Nicholas Rhinehart, Wen Sun, Lerrel Pinto, Martial Hebert, Byron Boots, Kris Kitani, and J Bagnell. Predictive-state decoders: Encoding the future into recurrent networks. In Advances in Neural Information Processing Systems, pp. 1172 1183, 2017. Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. Show, attend and tell: Neural image caption generation with visual attention. In International conference on machine learning, pp. 2048 2057, 2015. Published as a conference paper at ICLR 2019 A TD-VAE AS A MODEL OF JUMPY OBSERVATIONS In section 3, we derive an approximate ELBO which forms the basis of the training loss of the one-step TD-VAE. One may wonder whether a similar idea may underpin the training loss of the jumpy TD-VAE. Here we show how to modify the derivation to provide an approximate ELBO for a slightly different training regime. Assume a sequence (x1, . . . , x T ), and an arbitrary distribution S over subsequences xs = (xt1, . . . , xtn) of x. For each time index ti, we suppose a state zti, and model the subsequence xs with a jumpy state-space model p(xs) = Q i p(zti|zti 1)p(xti|zti); denote zs = (zt1, . . . , ztn) the state subsequence. We use the exact same machinery as the next-step ELBO, except that we enrich the posterior distribution over zs by making it depend not only on observation subsequence xs, but on the entire sequence x. This is possible because posterior distributions can have arbitrary contexts; the observations which are part of x but not xs effectively serve as auxiliary variable for a stronger posterior. We use the full sequence x to form a sequence of belief states bt at all time steps. We use in particular the ones computed at the subsampled times ti. By following the same derivation as the one-step TD-VAE, we obtain: ES [log p(xt1, . . . , xtn)] ES i E (zti 1,zti) q h log p(xti | zti) + log p(zti 1 | x