# hamiltonian_generative_networks__fe372e5f.pdf

Published as a conference paper at ICLR 2020

HAMILTONIAN GENERATIVE NETWORKS

Peter Toth Deep Mind petertoth@google.com

Danilo J. Rezende Deep Mind danilor@google.com

Andrew Jaegle Deep Mind drewjaegle@google.com

Sébastien Racanière Deep Mind sracaniere@google.com

Aleksandar Botev Deep Mind botev@google.com

Irina Higgins Deep Mind irinah@google.com

The Hamiltonian formalism plays a central role in classical and quantum physics. Hamiltonians are the main tool for modelling the continuous time evolution of systems with conserved quantities, and they come equipped with many useful properties, like time reversibility and smooth interpolation in time. These properties are important for many machine learning problems from sequence prediction to reinforcement learning and density modelling but are not typically provided out of the box by standard tools such as recurrent neural networks. In this paper, we introduce the Hamiltonian Generative Network (HGN), the ﬁrst approach capable of consistently learning Hamiltonian dynamics from high-dimensional observations (such as images) without restrictive domain assumptions. Once trained, we can use HGN to sample new trajectories, perform rollouts both forward and backward in time and even speed up or slow down the learned dynamics. 1 We demonstrate how a simple modiﬁcation of the network architecture turns HGN into a powerful normalising ﬂow model, called Neural Hamiltonian Flow (NHF), that uses Hamiltonian dynamics to model expressive densities. We hope that our work serves as a ﬁrst practical demonstration of the value that the Hamiltonian formalism can bring to deep learning.

1 INTRODUCTION

Image manifold

Path on manifold

induced by Hamiltonian

Figure 1: The Hamiltonian manifold hypothesis: natural images lie on a low-dimensional manifold in pixel space, and natural image sequences (such as one produced by watching a two-body system, as shown in red) correspond to movement on the manifold according to Hamiltonian dynamics.

Any system capable of a wide range of intelligent behaviours within a dynamic environment requires a good predictive model of the environment s dynamics. This is true for intelligence in both biological (Friston, 2009; 2010; Clark, 2013) and artiﬁcial (Hafner et al., 2019; Battaglia et al., 2013; Watter et al., 2015; Watters et al., 2019) systems. Predicting environmental dynamics is also of fundamental importance in physics, where Hamiltonian dynamics and the structurepreserving transformations it provides have been used to unify, categorise and discover new physical entities (Noether, 1915; Livio, 2012).

Hamilton s fundamental result was a system of two ﬁrst-order differential equations that, in a stroke, uniﬁed the predictions made by prior Newtonian and Lagrangian mechanics (Hamilton, 1834). After well over a century of development, it has proven to be essential for parsimonious descriptions of nearly all of physics.

Equal contribution. 1More results and video evaluations are available at: http://tiny.cc/hgn

Published as a conference paper at ICLR 2020

z ~ qɸ(z|x0...x T)

Hamiltonian

Forward rollout +dt

Reverse rollout -dt

Figure 2: Hamiltonian Generative Network schematic. The encoder takes a stacked sequence of images and infers the posterior over the initial state. The state is rolled out using the learnt Hamiltonian. Note that we depict Euler updates of the state for schematic simplicity, while in practice this is done using a leapfrog integrator. For each unroll step we reconstruct the image from the position q state variables only and calculate the reconstruction error.

Hamilton s equations provide a way to predict a system s future behavior from its current state in phase space (that is, its position and momentum for classical Newtonian systems, and its generalized position and momentum more broadly). Hamiltonian mechanics induce dynamics with several nice properties: they are smooth, they include paths along which certain physical quantities are conserved (symmetries) and their time evolution is fully reversible. These properties are also useful for machine learning systems. For example, capturing the time-reversible dynamics of the world state might be useful for agents attempting to account for how their actions led to effects in the world; recovering an abstract low-dimensional manifold with paths that conserve various properties is tightly connected to outstanding problems in representation learning (see e.g. Higgins et al. (2018) for more discussion); and the ability to conserve energy is related to expressive density modelling in generative approaches (Rezende & Mohamed, 2015). Hence, we propose a reformulation of the well-known image manifold hypothesis by extending it with a Hamiltonian assumption (illustrated in Fig. 1): natural images lie on a low-dimensional manifold embedded within a high-dimensional pixel space and natural sequences of images trace out paths on this manifold that follow the equations of an abstract Hamiltonian.

Given the rich set of established tools provided by Hamiltonian descriptions of system dynamics, can we adapt these to solve outstanding machine learning problems? When it comes to adapting the Hamiltonian formalism to contemporary machine learning, two questions need to be addressed: 1) how should a system s Hamiltonian be learned from data; and 2) how should a system s abstract phase space be inferred from the high-dimensional observations typically available to machine learning systems? Note that the inferred state may need to include information about properties that play no physical role in classical mechanics but which can still affect their behavior or function, like the colour or shape of an object. The ﬁrst question was recently addressed by the Hamiltonian Neural Network (HNN) (Greydanus et al., 2019) approach, which was able to learn the Hamiltonian of three simple physical systems from noisy phase space observations. However, to address the second question, HNN makes assumptions that restrict it to Newtonian systems and appear to limit its ability to scale to more challenging video datasets.

In this paper we introduce the ﬁrst model that answers both of these questions without relying on restrictive domain assumptions. Our model, the Hamiltonian Generative Network (HGN), is a generative model that infers the abstract state from pixels and then unrolls the learned Hamiltonian following the Hamiltonian equations (Goldstein, 1980). We demonstrate that HGN is able to reliably learn the Hamil-

Published as a conference paper at ICLR 2020

tonian dynamics from noisy pixel observations on four simulated physical systems: a pendulum, a mass-spring and twoand threebody systems. Our approach outperforms HNN by a signiﬁcant margin. After training, we demonstrate that HGN produces meaningful samples with reversible dynamics and that the speed of rollouts can be controlled by changing the time derivative of the integrator at test time. Finally, we show that a small modiﬁcation of our architecture yields a ﬂexible, normalising ﬂow-based generative model that respects Hamiltonian dynamics. We show that this model, which we call Neural Hamiltonian Flow (NHF), inherits the beneﬁcial properties of the Hamiltonian formalism (including volume preservation) and is capable of expressive density modelling, while offering computational beneﬁts over standard ﬂow-based models.

2 RELATED WORK

Most machine learning approaches to modeling dynamics use discrete time steps, which often results in an accumulation of the approximation errors when producing rollouts and, therefore, to a fast drop in accuracy. Our approach, on the other hand, does not discretise continuous dynamics and models them directly using the Hamiltonian differential equations, which leads to slower divergence for longer rollouts. Thedensitymodelversionof HGN(NHF)usesthe Hamiltoniandynamicsasnormalisingﬂows along with a numerical integrator, making our approach somewhat related to the recently published neural ODE work (Chen et al., 2018; Grathwohl et al., 2018). What makes our approach different is that Hamiltonian dynamics are both invertible and volume-preserving (as discussed in Sec. 3.3), which makes our approach computationally cheaper than the alternatives and more suitable as a model of physical systems and other processes that have these properties. Also related is recent work attempting to learn a model of physical system dynamics end-to-end from image sequences using an autoencoder (de Avila Belbute-Peres et al., 2018). Unlike our work, this model does not exploit Hamiltonian dynamics and is trained in a supervised or semi-supervised regime.

2.1 HAMILTONIAN NEURAL NETWORK

One of the most comparable approaches to ours is the Hamiltonian Neural Network (HNN) (Greydanus et al., 2019). This work, done concurrently to ours, proposes a way to learn Hamiltonian dynamics from data by training the gradients of a neural network (obtained by backpropagation) to match the time derivative of a target system in a supervised fashion. In particular, HNN learns a differentiable function H(q,p) that maps a system s state (its position, q, and momentum, p) to a scalar quantity interpreted as the system s Hamiltonian. This model is trained so that H(p,q) satisﬁes the Hamiltonian equation by minimizing

dt )2], (1)

where the derivatives H

p are computed by backpropagation. Hence, this learning procedure is most directly applicable when the true state space (in canonical coordinates) and its time derivatives are known. Accordingly, in the majority of the experiments presented by the authors, the Hamiltonian was learned from the ground truth state space directly, rather than from pixel observations.

The single experiment with pixel observations required a modiﬁcation of the model. First, the input to the model became a concatenated, ﬂattened pair of images ot =[xt,xt+1], which was then mapped to a low-dimensional embedding space zt = [qt,pt] using an encoder neural network. Note that the dimensionality of this embedding (z R2 in the case of the pendulum system presented in the paper) was chosen to perfectly match the ground truth dimensionality of the phase space, which was assumed to be known a priori. This, however, is not always possible. The latent embedding was then treated as an estimate of the position and the momentum of the system depicted in the images, where the momentum was assumed to be equal to the velocity of the system an assumption enforced by the additional constraint found necessary to encourage learning, which encouraged the time derivative of the position latent to equal the momentum latent using ﬁnite differences on the split latents:

LCC =(pt (qt+1 qt))2. (2)

Published as a conference paper at ICLR 2020

This assumption is appropriate in the case of the simple pendulum system presented in the paper, however it does not hold more generally. Note that our approach does not make any assumptions on the dimensionality of the learned phase space, or the form of the momenta coordinates, which makes our approach more general and allows it to perform well on a wider range of image domains as presented in Sec. 4.

3.1 THE HAMILTONIAN FORMALISM

The Hamiltonian formalism describes the continuous time evolution of a system in an abstract phase space s = (q, p) R2n, where q Rn is a vector of position coordinates, and p Rn is the corresponding vector of momenta. The time evolution of the system in phase space is given by the Hamiltonian equations:

where the Hamiltonian H : R2n R maps the state s = (q,p) to a scalar representing the energy of the system. The Hamiltonian speciﬁes a vector ﬁeld over the phase space that describes all possible dynamics of the system. For example, the Hamiltonian for an undamped mass-spring system is H(q,p)= 1

2m, where m is the mass, q R1 is its position, p R1 is its momentum and k is the spring stiffness coefﬁcient. The Hamiltonian can often be expressed as the sum of the kinetic T and potential V energies H=T(p)+V (q), as is the case for the mass-spring example. Identifying a system s Hamiltonian is in general a very difﬁcult problem, requiring carefully instrumented experiments and researcher insight produced by years of training. In what follows, we describe a method for modeling a system s Hamiltonian from raw observations (such as pixels) by inferring a system s state with a generative model and rolling it out with the Hamiltonian equations.

3.2 LEARNING HAMILTONIANS WITH THE HAMILTONIAN GENERATIVE NETWORK

Our goal is to build a model that can learn a Hamiltonian from observations. We assume that the data X ={(x1 0,...,x1 T ),...,(x K 0 ,...,x K T )} comes in the form of high-dimensional noisy observations, where each xi =G(si)=G(qi) is a non-deterministic function of the generalised position in the phase space, and the full state is a non-deterministic function of a sequence of images si =(qi,pi)=D(xi 0,...,xi t), since the momentum (and hence the full state) cannot in general be recovered from a single observation. Our goal is to infer the abstract state and learn the Hamiltonian dynamics in phase space by observing K motion sequences, discretised into T +1 time steps each. In the process, we also want to learn an approximation to the generative process G(s) in order to be able to move in both directions between the high dimensional observations and the low-dimensional abstract phase space.

Although the Hamiltonian formalism is general and does not depend on the form of the observations, we present our model in terms of visual observations, since many known physical Hamiltonian systems, like a mass-spring system, can be easily observed visually. In this section we introduce the Hamiltonian Generative Network (HGN), a generative model that is trained to behave according to the Hamiltonian dynamics in an abstract phase space learned from raw observations of image sequences. HGN consists of three parts (see Fig. 2): an inference network, a Hamiltonian network and a decoder network, which are discussed next.

The inference network takes in a sequence of images (xi 0,...xi T ), concatenated along the channel dimension, and outputs a posterior over the initial state z qφ( |x0,...x T ), corresponding to the system s coordinates in phase space at the ﬁrst frame of the sequence. We parametrise qφ(z) as a diagonal Gaussian with a unit Gaussian prior p(z) = N(0, I) and optimise it using the usual reparametrisation trick (Kingma & Welling, 2014). To increase the expressivity of the abstract phase space s0, we map samples from the posterior with another function s0 =fψ(z) to obtain the system s initial state. As mentioned in Sec. 3.1, the Hamiltonian function expects the state to be of the form s = (q,p), hence we initialise s0 R2n and arbitrarily assign the ﬁrst half of the units to represent abstract position q and the other half to represent abstract momentum p.

Published as a conference paper at ICLR 2020

u ~ 𝛑(u) x ~ p(x)

s0 ~ 𝛑(s0) si ~ p(si)

Figure 3: A: standard normalising ﬂow, where the invertible function fi is implemented by a neural network. B: Hamiltonian ﬂows, where the initial density is transformed using the learned Hamiltonian dynamics. Note that we depict Euler updates of the state for schematic simplicity, while in practice this is done using a leapfrog integrator.

q T-1 p T-1

q0 ~ 𝝿(q0) p0 ~ 𝝿(p0)

p T ~ q(p T|q T)

Figure 4: A schematic representation of NHF which can perform expressive density modelling by using the learned Hamiltonians as normalising ﬂows. Note that we depict Euler updates of the state for schematic simplicity, while in practice this is done using a leapfrog integrator.

The Hamiltonian network is parametrised as a neural network with parameters γ that takes in the inferred abstract state and maps it to a scalar Hγ(st) R. We can use this function to do rollouts in the abstract state space using the Hamiltonian equations (Eq. 3), for example by Euler integration: st+1 =(qt+1,pt+1)=(qt+ H

pt dt, pt H

qt dt). In this work we assume a separable Hamiltonian, so in practice we use a more sophisticated leapfrog integrator to roll out the system, since it has better theoretical properties and results in better performance in practice (see Sec. A.6 in Supplementary Materials for more details).

The decoder network is a standard deconvolutional network (we use the architecture from Karras et al. (2018)) that takes in a low-dimensional representation vector and produces a high-dimensional pixel reconstruction. Given that each instantaneous image does not depend on the momentum information, we restrict the decoder to take only the position coordinates of the abstract state as input: pθ(xt)=dθ(qt).

The objective function. Given a sequence of T +1 images, HGN is trained to optimise the following objective:

L(φ,ψ,γ,θ; x0,...x T )= 1 T +1

t=0 [ Eqφ(z|x1,...x T ) [ log pψ,γ,θ(xt | qt) ] ] KL( qφ(z) || p(z) ),

which can be seen as a temporally extended variational autoencoder (VAE) (Kingma & Welling, 2014; Rezende et al., 2014) objective, consisting of a reconstruction term for each frame, and an additional term that encourages the inferred posterior to match a prior. The key difference with a standard VAE lies in how we generate rollouts these are produced using the Hamiltonian equations of motion in learned Hamiltonian phase space.

3.3 LEARNING HAMILTONIAN FLOWS

In this section, we describe how the architecture described above can be modiﬁed to produce a model for ﬂexible density estimation. Learning computationally feasible and accurate estimates of complex

Published as a conference paper at ICLR 2020

Pendulum 2- & 3-Body Problem Mass-spring

Figure 5: Ground truth Hamiltonians and samples from generated datasets for the ideal pendulum, mass-spring, and twoand three-body systems used to train HGN.

densities is an open problem in generative modelling. A common idea to address this problem is to start with a simple prior distribution π(u) and then transform it into a more expressive form p(x) through a series of composable invertible transformations fi(u) called normalising ﬂows (Rezende & Mohamed, 2015) (see Fig. 3A). Sampling can then be done according to x=f T ... f1(u), where u π( ). Density evaluation, however, requires more expensive computations of both inverting the ﬂows and calculating the determinants of their Jacobians. For a single ﬂow step, this equates to the following: p(x)=π(f 1(x)) det f 1

x . While a lot of work has been done recently into proposing better alternatives for ﬂow-based generative models in machine learning (Rezende & Mohamed, 2015; Kingma et al., 2016; Papamakarios et al., 2017; Dinh et al., 2017; Huang et al., 2018; Kingma & Dhariwal, 2018; Hoogeboom et al., 2019; Chen et al., 2019; Grathwohl et al., 2018), none of the approaches manage to produce both sampling and density evaluation steps that are computationally scalable.

The two requirements for normalising ﬂows are that they are invertible and volume preserving, which are exactly the two properties that Hamiltonian dynamics possess. This can be seen by computing the determinant of the Jacobian of the inﬁnitesimal transformation induced by the Hamiltonian H. By Jacobi s formula, the derivative of the determinant at the identity is the trace and:

2H qi pj 2H

qi qj 2H pi pj 2H

2H qi pj 2H

qi qj 2H pi pj 2H

+O(dt2)=1+O(dt2) (5)

where i =j are the off-diagonal entries of the determinant of the Jacobian. Hence, in this section we describe a simple modiﬁcation of HGN that allows it to act as a normalising ﬂow. We will refer to this modiﬁcation as the Neural Hamiltonian Flow (NHF) model. First, we assume that the initial state s0 is a sample from a simple prior s0 π0( ). We then chain several Hamiltonians Hi to transform the sample to a new state s T =HT ... H1(s0) which corresponds to a sample from the more expressive ﬁnal density s T p(x) (see Fig. 3B for an illustration of a single Hamiltonian ﬂow). Note that unlike HGN, where the Hamiltonian dynamics are shared across time steps (a single Hamiltonian is learned and its parameters are shared across time steps of a rollout), in NHF each step of the ﬂow (corresponding to a single time step of a rollout) can be parametrised by a different Hamiltonian. The inverse of such a Hamiltonian ﬂow can be easily obtained by replacing dt by dt in the Hamiltonian equations and reversing the order of the transformations, s0 =H dt 1 ... H dt T (s T ) (we will use the appropriate dt or dt superscript from now on to make the direction of integration of the Hamiltonian dynamics more explicit). The resulting density p(s T ) is given by the following equation:

ln p(s T )=ln π(s0)=ln π(H dt 1 ... H dt T (s T ))+O(dt2)

Our proposed NHF is more computationally efﬁcient that many other ﬂow-based approaches, because it does not require the expensive step of calculating the trace of the Jacobian. Hence, the NHF model constitutes a more structured form of a Neural ODE ﬂow (Chen et al., 2018), but with a few notable differences: (i) The Hamiltonian ODE is volume-preserving, which makes the computation of loglikelihood cheaper than for a general ODE ﬂow. (ii) General ODE ﬂows are only invertible in the limit dt 0, whereas for some Hamiltonians we can use more complex integrators (like the symplectic leapfrog integrator described in Sec. A.6) that are both invertible and volume-preserving for any dt>0. The structure s=(q,p) on the state-space imposed by the Hamiltonian dynamics can be constraining from the point of view of density estimation. We choose to use the trick proposed in the Hamiltonian

Published as a conference paper at ICLR 2020

Pendulum Mass-spring

Train MSE Test MSE Train MSE Test MSE

Two-Body Three-Body

HNN HGN (leapfrog) HGN (Euler) HGN (determ) HGN (no f)

Figure 6: Average pixel MSE for each step of a single train and test unroll on four physical systems. All versions of HGN outperform HNN, which often learned to reconstruct an average image. Note the different scales of the plots comparing HGN to HNN and different versions of HGN.

Monte Carlo (HMC) literature (Neal et al., 2011; Salimans et al., 2015; Levy et al., 2017), which treats the momentum p as a latent variable (see Fig. 4). This is an elegant solution which avoids having to artiﬁcially split the density into two disjoint sets. As a result, the data density that our Hamiltonian ﬂows are modelling becomes exclusively parametrised by p(q T ), which takes the following form: p(q T ) = R p(q T ,p T )dp T = R π(H dt 1 ... H dt T (q T ,p T ))dp T . This integral is intractable, but the model can still be trained via variational methods where we introduce a variational density fψ(p T |q T ) with parameters ψ and instead optimise the following ELBO:

ELBO(q T )=Efψ(p T |q T )[ ln π(H dt 1 ... H dt T (q T ,p T )) ln fψ(p T |q T ) ] ln p(q T ), (6)

Note that, in contrast to VAEs (Kingma & Welling, 2014; Rezende et al., 2014), the ELBO in Eq. 6 is not explicitly in the form of a reconstruction error term plus a KL term.

In order to directly compare the performance of HGN to that of its closest baseline, HNN, we generated four datasets analogous to the data used in Greydanus et al. (2019). The datasets contained observations of the time evolution of four physical systems: mass-spring, pendulum, twoand three-body (see Fig. 5). In order to generate each trajectory, we ﬁrst randomly sampled an initial state, then produced a 30 step rollout following the ground truth Hamiltonian dynamics, before adding Gaussian noise with standard deviation σ2 =0.1 to each phase-space coordinate, and rendering a corresponding 64x64 pixel observation. We generated 50 000 train and 10 000 test trajectories for each dataset. When sampling initial states, we start by ﬁrst sampling the total energy of the system denoted as a radius r in the phase space, before sampling the initial state (q,p) uniformly on the circle of radius r. Note that our pendulum dataset is more challenging than the one described in Greydanus et al. (2019), where the pendulum had a ﬁxed radius and was initialized at a maximum angle of 30 from the central axis.

Mass-spring. The dynamics of a frictionless mass-spring system are modeled by the Hamiltonian H = 1

2m, where k is the spring constant and m is the mass. We ﬁx k = 2 and m = 0.5, then sample a radius from a uniform distribution r U(0.1,1.0).

Published as a conference paper at ICLR 2020

Pendulum. The dynamics of a frictionless pendulum are modeled by the Hamiltonian H=2mgl(1 cos (q))+ p2

2ml2 , where g is the gravitational constant and l is the length of the pendulum. We ﬁx g=3, m=0.5, l=1, then sample a radius from a uniform distribution r U(1.3,2.3).

Twoand threebody problems. In an n-body problem, particles interact with each other through an attractive force, like gravity. The dynamics are represented by the following Hamiltonian H = Pn i ||pi||2

2mi P 1 i<j n gmimj ||qj qi||. We set m=1 and g=1 for both systems. For the two-body problem, we set r U(0.5,1.5), and we also change the observation noise to σ2 = 0.05. For the three-body problem, we set r U(0.9,1.2), and set the observation noise to σ2 =0.2.

Learning the Hamiltonian We tested whether HGN and the HNN baseline could learn the dynamics of the four systems described above. To ensure that our re-implementation of HNN was correct, we replicated all the results presented in the original paper (Greydanus et al., 2019) by verifying that it could learn the dynamics of the mass-spring, pendulum and two-body systems well from the ground truth state, and the dynamics of a restricted pendulum from pixels. We also compared different modiﬁcations of HGN: a version trained and tested with an Euler rather than a leapfrog integrator (HGN Euler), a version trained with no additional function between the posterior and the prior (HGN no fψ) and a deterministic version (HGN determ), which did not include the sampling step from the posterior qφ(z|x0...x T ) .

MODEL MASS-SPRING PENDULUM TWO-BODY THREE-BODY TRAIN TEST TRAIN TEST TRAIN TEST TRAIN TEST

HNN 50.38 1.75 104.37 52.51 69.39 0.89 64.62 0.87 80.52 1.79 78.9 1.93 61.63 3.31 57.54 2.39 HNN (CONV) 18.97 0.77 119.09 69.17 13.14 0.66 106.07 41.94 7.3 0.38 120.71 44.48 15.22 1.99 62.77 37.57

HGN (EULER) 3.67 1.09 6.2 2.69 5.43 2.53 10.93 4.32 6.62 3.93 15.06 7.01 7.51 3.49 9.4 3.92 HGN (DETERM) 0.23 0.23 3.07 1.06 0.79 1.24 10.68 3.19 2.34 2.3 14.47 5.24 4.1 2.05 5.17 1.96 HGN (NO fψ) 4.95 1.71 7.04 2.55 6.83 3.29 13.98 4.94 6.35 3.86 16.49 6.6 8.37 3.13 10.41 3.72 HGN (LEAPFROG) 3.84 1.07 6.23 2.03 4.9 1.86 11.72 4.14 6.36 3.29 16.47 7.15 7.88 3.55 9.8 3.72

Table 1: Average pixel MSE over a 30 step unroll on the train and test data on four physical systems. All values are multiplied by 1e+4. We evaluate two versions of the Hamiltonian Neural Network (HNN) (Greydanus et al., 2019): the original architecture and a convolutional version closely matched to the architecture of HGN. We also compare four versions of our proposed Hamiltonian Generative Network (HGN): the full version, a version trained and tested with an Euler rather than a leapfrog integrator, a deterministic rather than a generative version, and a version of HGN with no extra network between the posterior and the initial state.

Tbl. 1 and Fig. 6 demonstrate that HGN and its modiﬁcations learned well on all four datasets. However, when we attempted to train HNN on the four datasets described above, its Hamiltonian often collapsed to 0 and the model failed to reproduce any dynamics, defaulting to a static single image. We were unable to improve on this performance despite our best efforts, including a modiﬁcation of the architecture to closely match ours (referred to as HNN Conv) (see Sec. A.3 of the appendix for details). Tbl. 1 shows that the average mean squared error (MSE) of the pixel reconstructions on both the train and test data is an order of magnitude better for HGN compared to both versions of HNN. The same holds when visualising the average per-frame MSE of a single train and test rollout for each dataset shown in Fig. 6.

Note that the different versions of HGN have different trade-offs. The deterministic version produces more accurate reconstructions but it does not allow sampling. This effect is equivalent to a similar distinction between autoencoders (Hinton & Salakhutdinov, 2006) and VAEs. Using the simpler Euler integrator rather than the more involved leapfrog one might be conceptually more appealing, however it does not provide the same energy conservation and reversibility properties as the leapfrog integrator, as evidenced by the increase by an order of magnitude of the variance of the learned Hamiltonian throughout a sequence rollout as shown in Tbl. 2. The full version of HGN, on the other hand, is capable of reproducing the dynamics well, is capable of producing diverse yet plausible rollout samples (Fig. 8) and its rollouts can be reversed in time, sped up or slowed down by either changing the value or the sign of dt used in the integrator (Fig. 7).

Expressive density modelling using learned Hamiltonian ﬂows We evaluate whether NHF is capable of expressive density modelling by stacking learned Hamiltonians into a series of normalising

Published as a conference paper at ICLR 2020

MODEL MASS-SPRING PENDULUM TWO-BODY THREE-BODY TRAIN TEST TRAIN TEST TRAIN TEST TRAIN TEST

HNN N/A N/A N/A N/A N/A N/A 0.72 1.43 HNN (CONV) N/A N/A N/A N/A N/A N/A 2.17 28.34

HGN (EULER) 157.28 120.16 135.53 19.53 405.83 682.02 1.18 8.36

HGN (DETERM) 0.08 0.09 0.03 0.02 0.03 0.05 3.08 0.22

HGN (NO fψ) 0.02 0.02 0.02 0.04 0.04 0.02 4.03 1.82

HGN (LEAPFROG) 1.38 0.68 3.15 7.13 0.15 0.45 0.31 1.65

Table 2: Variance of the Hamiltonian on four physical systems over single train and test rollouts shown in Fig. 6. The numbers reported are scaled by a factor of 1e+4. High variance indicates that the energy is not conserved by the learned Hamiltonian throughout the rollout. Many HNN Hamiltonians have collapsed to 0, as indicated by N/A. HGN Hamiltonians are meaningful, and different versions of HGN conserve energy to varying degrees.

Original (train)

HGN forward unroll (+dt)

t HGN backward unroll (-dt)

HNN forward unroll

HGN sped up forward unroll (+2dt)

HGN slowed down forward unroll (+½dt)

Original (test)

HGN forward unroll (+dt)

Figure 7: Example of a train and a test sequence from the dataset of a three-body system, its inferred forward, backward, double speed and half speed rollouts in time from HGN, and a forward rollout from HNN. HNN did not learn the dynamics of the system and instead learned to reconstruct an average image.

Mass-spring

Sample 2 Sample 1 Sample 2 Sample 1

Sample 2 Sample 1 Sample 2 Sample 1

Figure 8: Examples of sample rollouts for all four datasets from a trained HGN.

ﬂows. Fig. 9 demonstrates that NHF can transform a simple soft-uniform prior distribution π(s0;σ,β) into signiﬁcantly more complex densities with arbitrary numbers of modes. The soft-uniform density, π(s0;σ,β) f(β(s + σ 1

2))f( β(s σ 1

2)), where f is the sigmoid function and β is a constant, was chosen to make it easier to visualise the learned attractors. The model also performed well with other priors, including a Normal distribution. It is interesting to note that the trained model is very interpretable. When decomposed into the equivalents of the kinetic and potential energies, it can be seen that the learned potential energy V (q) learned to have several local minima, one for each mode of the data. As a consequence, the trajectory of the initial samples through the ﬂow has attractors at the modes of the data. We have also compared the performance of NHF to that of the RNVP baseline

Published as a conference paper at ICLR 2020

Target density Learned density Kinetic energy Potential energy Single step Integrated ﬂow

Figure 9: Multimodal density learning using Hamiltonian ﬂows. From left to right: KDE estimators of the target and learned densities; learned kinetic energy K(p) and potential energy V (q); single leapfrog step and an integrated ﬂow. The potential energy learned multiple attractors, also clearly visible in the integrated ﬂow plot. The basins of attraction are centred at the modes of the data.

Target density

NVP (2 blocks) NVP (3 blocks)

Figure 10: Comparison between RNVP (Dinh et al., 2017) and NHF on a Gaussian Mixture Dataset. The RNVPs use alternating masks with two layers (red) or 3 layers (purple). NHF uses 1, 2 or 3 leapfrog steps (blue, yellow and green respectively).

(Dinh et al., 2017). Fig. 10 shows that the two approaches are comparable in their performance, but NHF is more computationally efﬁcient as discussed at the end of Sec. 3.3.

5 CONCLUSIONS

We have presented HGN, the ﬁrst deep learning approach capable of reliably learning Hamiltonian dynamics from pixel observations. We have evaluated our approach on four classical physical systems and demonstrated that it outperformed the only relevant baseline by a large margin. Hamiltonian dynamics have a number of useful properties that can be exploited more widely by the machine learning community. For example, the Hamiltonian induces a smooth manifold and a vector ﬁeld in the abstract phase space along which certain physical quantities are conserved. The time evolution along these paths is also completely reversible. These properties can have wide implications in such areas of machine learning as reinforcement learning, representation learning and generative modelling. We have demonstrated the ﬁrst step towards applying the learnt Hamiltonian dynamics as normalising ﬂows for expressive yet computationally efﬁcient density modelling. We hope that this work serves as the ﬁrst step towards a wider adoption of the rich body of physics literature around the Hamiltonian principles in the machine learning community.

ACKNOWLEDGEMENTS

We thank Alvaro Sanchez for useful feedback.

Published as a conference paper at ICLR 2020

Peter W. Battaglia, Jessica B. Hamrick, and Joshua B. Tenenbaum. Simulation as an engine of physical scene understanding. PNAS, 110, 2013.

Ricky T. Q. Chen, Jens Behrmann, David Duvenaud, and Jörn-Henrik Jacobsen. Residual ﬂows for invertible generative modeling. ar Xiv, 2019.

Tian Qi Chen, Yulia Rubanova, Jesse Bettencourt, and David K Duvenaud. Neural ordinary differential equations. In Neur IPS, 2018.

Andy Clark. Whatever next? predictive brains, situated agents, and the future of cognitive science. Behavioral and Brain Sciences, 36(3), 2013.

Filipe de Avila Belbute-Peres, Kevin Smith, Kelsey Allen, Josh Tenenbaum, and J. Zico Kolter. End-to-end differentiable physics for learning and control. Neur IPS, 2018.

Laurent Dinh, Jascha Sohl-Dickstein, and Samy Bengio. Density estimation using real nvp. ICLR, 2017.

K. Friston. The free-energy principle: a rough guide to the brain? Trends in cognitive sciences, 13, 2009.

Karl Friston. The free-energy principle: a uniﬁed brain theory? Nature Reviews Neuroscience, 2010.

Herbert Goldstein. Classical Mechanics. Addison-Wesley Pub. Co, Reading, Mass., 1980.

Will Grathwohl, Ricky T. Q. Chen, Jesse Bettencourt, Ilya Sutskever, and David Duvenaud. Ffjord: Free-form continuous dynamics for scalable reversible generative models. ICLR, 2018.

Sam Greydanus, Misko Dzamba, and Jason Yosinski. Hamiltonian neural networks. Neur IPS, 2019.

Danijar Hafner, Timothy Lillicrap, Ian Fischer, Ruben Villegas, David Ha, Honglak Lee, and James Davidson. Learning latent dynamics for planning from pixels. PMLR, 2019.

William R. Hamilton. On a general method in dynamics. Philosophical Transactions of the Royal Society, II, pp. 247 308, 1834.

Irina Higgins, David Amos, David Pfau, Sebastien Racaniere, Loic Matthey, Danilo Rezende, and Alexander Lerchner. Towards a deﬁnition of disentangled representations. ar Xiv, 2018.

G. E. Hinton and R.R. Salakhutdinov. "reducing the dimensionality of data with neural networks. Science, 313:504 507, 2006.

Emiel Hoogeboom, Rianne van den Berg, and Max Welling. Emerging convolutions for generative normalizing ﬂows. ICML, 2019.

Chin-Wei Huang, David Krueger, Alexandre Lacoste, and Aaron Courville. Neural autoregressive ﬂows. ICML, 2018.

Eric Jones, Travis Oliphant, Pearu Peterson, et al. Sci Py: Open source scientiﬁc tools for Python, 2001. URL http://www.scipy.org/.

Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. Progressive growing of gans for improved quality, stability, and variation. ICLR, 2018.

Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. ar Xiv preprint ar Xiv:1412.6980, 2014.

Diederik P. Kingma and Prafulla Dhariwal. Glow: Generative ﬂow with invertible 1x1 convolutions. In Advances in Neural Information Processing Systems, pp. 10215 10224, 2018.

Diederik P. Kingma and Max Welling. Auto-encoding variational bayes. ICLR, 2014.

Diederik P. Kingma, Tim Salimans, Rafal Jozefowicz, Xi Chen, Ilya Sutskever, and Max Welling. Improving variational inference with inverse autoregressive ﬂow. Neur IPS, 2016.

Published as a conference paper at ICLR 2020

Daniel Levy, Matthew D Hoffman, and Jascha Sohl-Dickstein. Generalizing hamiltonian monte carlo with neural networks. ar Xiv preprint ar Xiv:1711.09268, 2017.

Mario Livio. Why symmetry matters. Nature, 490, 2012.

Radford M Neal et al. MCMC using hamiltonian dynamics. Handbook of markov chain monte carlo, 2 (11):2, 2011.

Emmy Noether. The ﬁniteness theorem for invariants of ﬁnite groups. Mathematische Annalen, 77: 89 92, 1915.

George Papamakarios, Theo Pavlakou, and Iain Murray. Masked autoregressive ﬂow for density estimation. Neur IPS, 2017.

Danilo J Rezende and Fabio Viola. Generalized elbo with constrained optimization, geco. Workshop on Bayesian Deep Learning, Neur IPS, 2018.

Danilo Jimenez Rezende and Shakir Mohamed. Variational inference with normalizing ﬂows. ar Xiv preprint ar Xiv:1505.05770, 2015.

Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation and approximate inference in deep generative models. ICML, 2014.

Tim Salimans, Diederik Kingma, and Max Welling. Markov chain Monte Carlo and variational inference: Bridging the gap. In ICML, 2015.

Manuel Watter, Jost Tobias Springenberg, Joschka Boedecker, and Martin Riedmiller. Embed to control: A locally linear latent dynamics model for control from raw images. Neur IPS, 2015.

Nicholas Watters, Loic Matthey, Matko Bosnjak, Christopher P. Burgess, and Alexander Lerchner. Cobra: Data-efﬁcient model-based rl through unsupervised object discovery and curiosity-driven exploration. arxiv, 2019.

Published as a conference paper at ICLR 2020

A SUPPLEMENTARY MATERIALS

A.1 HAMILTONIAN GENERATIVE NETWORK

The Hamiltonian Generative Network (HGN) consists of three major parts, an encoder, the Hamiltonian transition network and a decoder. During training the encoder starts with a sequence

(x1,x2,...,xn) R32 32 3 (7)

of raw training images and encodes it into a probabilistic prior representation transformed with an additional network on top

q=q(z|x1,x2,...,xn) p=N(s0|(0,1)) (8)

into a start state

s0 R4 4 (2 16) (9)

consisting of a downsized spatial representation in latent space (4 4), where each abstract pixel is the the concatenation of abstract position (q) and momentum (p) (each of dimension 16). The encoder network is a convolutional neural network with 8 layers, with 32 ﬁlters on the ﬁrst layer, then 64 ﬁlters on each subsequent layer, while in the last layer we have 48 ﬁlters. The ﬁnal encoder transformer network is a convolutional neural network with 3 layers and 64 ﬁlters on each layer. Starting from this initial embedded state, the Hamiltonian transition network generates subsequent states using a symplectic integrator approximating the Hamiltonian equations. The Hamiltonian transition network represents the Hamiltonian function

H:st R4 4 (2 16) R (10)

as a function from the abstract position and momentum space to the real numbers at any time step t. The Hamiltonian transition network is a convolutional neural network of 6 layers, each consisting of 64 ﬁlters. The discrete timestep we use for the symplectic integrator update step is dt=0.125.

At each time step t the decoder network dθ takes only the abstract position part qt of the state st and decodes it back to an output image ext R32 32 3 of the same shape as the input images.

dθ :qt ext. (11)

The decoder network is a progressive network consisting of 3 residual blocks, where each residual block resizes the current input image by a factor of 2 using the nearest neighbor method (at the end we have to upscale our latent spatial dimension of 4 to the desired output image dimension of 32 in these steps), followed by 2 blocks of a one layer convolutional neural network with 64 ﬁlters and a leaky Re LU activation function, closing by a sigmoid activation in each block. After the 3 blocks a ﬁnal one layer convolutional neural network outputs the output image with the right number of channels.

We use Adam optimisier (Kingma & Ba, 2014) with learning rate 1.5e-4. When optimising the loss, in practice we do not learn the variance of the decoder pθ(x|s) and ﬁx it to 1, which makes the reconstruction objective equivalent to a scaled L2 loss. Furthermore, we introduce a Lagrange multiplier in front of the KL term and optimise it using the same method as in Rezende & Viola (2018).

A.2 NEURAL HAMILTONIAN FLOW

For all NHF experiments the Hamiltonian was of the form H(q,p) = K(p) + V (q). The kinetic energy term K and the potential energy term V are soft-plus MLPs with layer-sizes [d,128,128,1] where d is the dimension of the data. Soft-plus non-linearities were chosen because the optimisation of Hamiltonian ﬂows involves second-order derivatives of the MLPs used for parametrising the Hamiltonians. This makes Re LU non-linearities unsuitable. The encoder network was parametrized as fψ(p|q)=N(p;µ(q),σ(q)), where µ and σ are Re LU MLPs with size [d,128,128,d]. The Hamiltonian ﬂow Hdt, was approximated using a leapfrog integrator (Neal et al., 2011) since it preserves volume and is invertible for any dt (see also section A.6). We found that only two leapfrog steps where sufﬁcient for all our examples. Parameters were optimised using Adam (Kingma & Ba, 2014) (learning rate 3e-4) and Lagrange multipliers were optimised using the same method as in Rezende & Viola (2018). All

Published as a conference paper at ICLR 2020

shown kernel density estimate (KDE) plots used 1000 samples and isotropic Gaussian kernel bandwidth of 0.3. For the RNVP baseline (Dinh et al., 2017) we used alternating masks with two or three layers. Each RNVP layer used an afﬁne coupling parametrized by a two-layer relu-MLP that matched those used in the leapfrog.

A.3 HAMILTONIAN NEURAL NETWORK

The Hamiltonian Neural Network (HNN) (Greydanus et al., 2019) learns a differentiable function H(q,p) that maps a system s state in phase space (its position q and momentum p) to a scalar quantity interpreted as the system s Hamiltonian. This model is trained so that H(q,p) satisﬁes the Hamiltonian equation by minimizing

dt )2], (12)

where the derivatives H

p are computed by backpropagation. In the original paper, these targets are either assumed to be known or are estimated by ﬁnite differences using the state at times t and t+1. Accordingly, in the majority of the experiments presented by the authors, the Hamiltonian was learned from the ground truth state space directly, rather than from pixel observations.

The original HNN model is trained in a supervised fashion on the ground truth state of a physical system and its time derivatives. As such, it is not directly comparable to our method, which learns a Hamiltonian directly from pixels. Instead, we compare to the Pixel HNN variant of the HNN, which is introduced in the same paper, and which is able to learn a Hamiltonian from images and in the absence of true state or time derivative in some settings.

This required a modiﬁcation of the model. First, the input to the model became a concatenated, ﬂattened pair of images Xt = (xt,xt+1), which was then mapped to a low-dimensional embedding space zt =(qt,pt) using an encoder neural network. Note that the dimensionality of this embedding (z R2 in the case of the pendulum system presented in the paper) is chosen to perfectly match the ground truth dimensionality of the phase space, which was assumed to be known a priori. This, however, is not always possible, as when a system has not yet been identiﬁed. The latent embedding was then treated as an estimate of the position and the momentum of the system depicted in the images, where the momentum was assumed to be equal to the velocity of the system an assumption enforced by the additional constraint found necessary to encourage learning, which encouraged the time derivative of the position latent to equal the momentum latent using ﬁnite differences on the split latents:

LCC =(pt (qt+1 qt))2. (13)

This loss is motivated by the observation that in simple Newtonian systems with unit mass, the system s state is fully described by the position and its time derivative (the system s velocity). An image embedding that corresponds to the position and velocity of the system will minimize this loss. This assumption is appropriate in the case of the simple pendulum system presented in the paper, however it does not hold more generally.

As mentioned earlier, Pixel HNN takes as input a concatenated, ﬂattened pair of images and maps them to an embedding space zt =(qt,pt), which is treated as an estimate of the position and momentum of the system depicted in the images. Note that Xt always consists of two images in order to make the momentum observable. This embedding space is used as the input to an HNN, which is trained to learn the Hamiltonian of the system as before, but using the embedding instead of the true system state.

To enable stable learning in this conﬁguration, the Pixel HNN uses a standard mean-squared error autoencoding loss:

i=1 (Xi t ˆXi t)2, (14)

Published as a conference paper at ICLR 2020

where ˆXt is the autoencoder output and Xi t is the value of pixel i of N total pixels in Xt. 2 This loss encourages the network embedding to reﬂect the content of the input images and to avoid the trivial solution to (12).

The full Pixel HNN loss is:

LPixel HNN =LAE+LCC+λHNNLHNN+λWDLWD, (15)

where LHNN is computed using the ﬁnite difference estimate of the time derivative of the embedding. λHNN is a Lagrange multiplier, which is set to 0.1, as in the original paper. LWD is a standard L2 weight decay and its Lagrange multiplier λWD is set to 1e-5, as in the original paper.

In the experiments presented here, we reimplemented the Pixel HNN architecture as described in Greydanus et al. (2019) and trained it using the full loss (15). As in the original paper, we used a Pixel HNN with HNN, encoder, and decoder subnetworks each parameterized by a multi-layer perceptron (MLP). The encoder and decoder MLPs use Re LU nonlinearities. Each consists of 4 layers, with 200 units in each hidden layer and an embedding of the same size as the true position and momentum of the system depicted (2 for mass-spring and pendulum, 8 for two-body, and 12 for 3-body). The HNN MLP uses tanh nonlinearities and consists of two hidden layers with 200 units and a one-dimensional output.

To ensure the difference in performance between the Pixel HNN and HGN are not due primarily to archiectural choices, we also compare to a variant of the Pixel HNN architecture using the same convolutional encoder and decoder as used in HGN. We used identical hyperparameters to those described in section A.1. We map between the convolutional latent space used by the encoder and decoder and the vector-valued latent required by the HNN using one additional linear layer for the encoder and decoder.

In the original paper, the Pixel HNN model is trained using full-batch gradient descent. To make it more comparable to our approach, we train it here using stochastic gradient descent using minibatches of size 64 and around 15000 training steps. As in the original paper, we train the model using the Adam optimizer (Kingma & Ba, 2014) and a learning rate of 1e-3. As in the original paper, we produce rollouts of the model using a Runge-Kutta integrator (RK4). See Section A.6 for a description of RK4. Note that, as in the original paper, we use the more sophisticated algorithm implemented in scipy (scipy.integrate.solve_ivp) (Jones et al., 2001).

A.4 DATASETS

The datasets for the experiments described in 4 were generated in a similar manner to Greydanus et al. (2019) for comparative purposes. All of the datasets simulate the exact Hamiltonian dynamics of the underlying differential equation using the default scipy initial value problem solver Jones et al. (2001). After creating a dataset of trajectories for each system, we render those into a sequence of images. The system depicted in each dataset can be visualized by rendering circular objects:

For the mass-spring the mass object is rendered as a circle and the spring and pivot are invisible. For the pendulum only the weight (the bob) is rendered as a circle, while the massless rod and pivot are invisible. For the two and three body problem we render each point mass as a circle in a different color.

Additionally, we smooth out the circles such that they do not have hard edges, as can be seen in Fig. 7.

A.5 CONVERGENCE OF HGN AND HNN

In order to obtain the results presented in Tbl. 1 we trained both HGN and HNN for 15000 iterations, with batch size of 16 for HGN, and 64 for the HNN. Given the dataset sizes, this means that HGN was trained for around 5 epochs and HNN was trained for around 19 epochs, which took around 16 hours. Figs. 11-12 plot the convergence rates for HGN (leapfrog) and HNN (conv) on the four datasets.

2In our experiments, which use RGB images, this expectation is taken over color channels as well as pixels.

Published as a conference paper at ICLR 2020

Pendulum Mass-spring

Two-Body Three-Body

Figure 11: Average MSE for pixel reconstructions during training of HGN (leapfrog). The x axis indicates the training iteration number 1e+2.

A.6 INTEGRATORS

Throughout this paper, we estimate the future state of systems from inferred values of the system position and momentum by numerically integrated the Hamiltonian. We explore three methods of numerical integration: (i) Euler integration, (ii) Runge-Kutta integration and (iii) leapfrog integration.

Euler integration estimates the value of a function at time t+dt by incrementing the function s value with the value accumulated by the function s derivative, assuming it stays constant in the interval [t,t+dt]. In the Hamiltonian framework, Euler integration takes the form:

qt+dt =qt+dt H

pt+dt =pt dt H

Because Euler integration estimates a function s future value by extrapolating along its ﬁrst derivative, the method ignores the contribution of higher-order derivatives to the function s change in time. Accordingly, while Euler integration can reasonably estimate a function s value over short periods, its errors accumulate rapidly as it is integrated over longer periods or when it is applied multiple times. This limitation motivates the use of methods that are stable over more steps and longer integration times.

One such method is four-step Runge-Kutta integration (RK4), the most widely used member of the Runge-Kutta family of integrators. Whereas Euler integration estimates the value of a function at time t+dt using only the function s derivative at time t, RK4 accumulates multiple estimates of the function s value in the interval [t,t + dt]. This integral more correctly reﬂects the behavior of the function in the interval, resulting in a more stable estimate of the function s value. RK4 estimates the state at time t+dt as:

Published as a conference paper at ICLR 2020

Pendulum Mass-spring

Two-Body Three-Body

Figure 12: Average MSE for pixel reconstructions during training of HNN (conv). The x axis indicates the training iteration number 1e+2.

xt+dt =xt+dt(1

where, for compactness, we write the full system state as xt =[qt,pt] and the Hamiltonian as H(x). In the Hamiltonian framework, k1,k2,k3,k4 are obtained by evaluating the derivative at four points in the interval [t,t+dt]:

k1 = d H(x)

k2 = d H(x)

k3 = d H(x)

k4 = d H(x)

x=xt+dt k3 (22)

While RK4 may produce reasonably stable estimates over short periods of time, it is not guaranteed to behave stably indeﬁnitely. Neither RK4 nor Euler integration is guaranteed to preserve the energy of the system being integrated, and in practice both will produce estimates that drift away from the true system dynamics over timescales that are relevant for simulating real systems.

Fortunately, there are well-known methods for numerical integration that preserve energy and can be applied to Hamiltonian systems, like the one we propose here. One such method is leapfrog integration, which is a special method for integrating differential equations of the form:

Published as a conference paper at ICLR 2020

Non-symplectic

Figure 13: A: example of using a symplectic (leapfrog) and a non-symplectic (Euler) integrators on the Hamiltonian of a harmonic oscillator. The blue quadrilaterals depict a volume in phase space over the course of integration. While the symplectic integrator conserves the volume of this region, but the non-symplectic integrator causes it to increase in volume with each integration step. The symplectic integrator clearly introduces less divergence in the phase space than the non-symplectic alternative over the same integration window. B: an illustration of the leapfrog updates in the phase space, where q is position and p is momentum.

dt =F(x), dx

dt =y. (23)

If we assume that the Hamiltonian equations take this form, we can integrate them using the leapfrog integrator, which in essence updates the position and momentum variables at interleaved time points in a way that resembles the updates leapfrogging over each other (see Fig. 13B for an illustration).

In particular, the following updates can be applied to a Hamiltonian of the form H = V (q)+T(p), where V is the potential energy and T is the kinetic energy of the system:

qt+dt =qt+dtpt+ dt

As discussed above, leapfrog integration is more stable and accurate over long rollouts than integrators like Euler or RK4. This is because the leapfrog integrator is an example of a symplectic integrator, which means it is guaranteed to preserve the special form of the Hamiltonian even after repeated application. An example visual comparison between a symplectic (leapfrog) and non-symplectic (Euler) integrator applied over the Hamiltonian for a harmonic oscilator is shown in Fig. 13A. For a more thorough discussion of the properties of leapfrog integration, see (Neal et al., 2011).

A.7 PROOF OF THE HAMILTONIAN FLOW ELBO

The Hamiltonian Flow consists of two components. Firstly, it deﬁnes a density model over the joint space s T = (q T ,p T ) using the Hamiltonian Flow as described in section 3.3. However, we assume that the observable variable represents only q T and treat p T as a latent variable which we have to marginalize over. Since the integral is intractable using the introduced variational distribution we can derive the lower bound on the marginal likelihood:

Published as a conference paper at ICLR 2020

ln p(q T )=ln Z p(q T ,p T ) dp T

=ln Z p(q T ,p T ) fψ(p T |q T )fψ(p T |q T ) dp T

=ln Efψ(p T |q T )[ p(q T ,p T )dp T

fψ(p T |q T ) ]

Efψ(p T |q T )[ lnp(q T ,p T )dp T

fψ(p T |q T ) ]

=Efψ(p T |q T )[ ln π(H dt 1 ... H dt T (q T ,p T )) ln fψ(p T |q T ) ]

=ELBO(q T )