# hamiltonian_generative_networks__fe372e5f.pdf Published as a conference paper at ICLR 2020 HAMILTONIAN GENERATIVE NETWORKS Peter Toth Deep Mind petertoth@google.com Danilo J. Rezende Deep Mind danilor@google.com Andrew Jaegle Deep Mind drewjaegle@google.com Sébastien Racanière Deep Mind sracaniere@google.com Aleksandar Botev Deep Mind botev@google.com Irina Higgins Deep Mind irinah@google.com The Hamiltonian formalism plays a central role in classical and quantum physics. Hamiltonians are the main tool for modelling the continuous time evolution of systems with conserved quantities, and they come equipped with many useful properties, like time reversibility and smooth interpolation in time. These properties are important for many machine learning problems from sequence prediction to reinforcement learning and density modelling but are not typically provided out of the box by standard tools such as recurrent neural networks. In this paper, we introduce the Hamiltonian Generative Network (HGN), the first approach capable of consistently learning Hamiltonian dynamics from high-dimensional observations (such as images) without restrictive domain assumptions. Once trained, we can use HGN to sample new trajectories, perform rollouts both forward and backward in time and even speed up or slow down the learned dynamics. 1 We demonstrate how a simple modification of the network architecture turns HGN into a powerful normalising flow model, called Neural Hamiltonian Flow (NHF), that uses Hamiltonian dynamics to model expressive densities. We hope that our work serves as a first practical demonstration of the value that the Hamiltonian formalism can bring to deep learning. 1 INTRODUCTION Image manifold Path on manifold induced by Hamiltonian Figure 1: The Hamiltonian manifold hypothesis: natural images lie on a low-dimensional manifold in pixel space, and natural image sequences (such as one produced by watching a two-body system, as shown in red) correspond to movement on the manifold according to Hamiltonian dynamics. Any system capable of a wide range of intelligent behaviours within a dynamic environment requires a good predictive model of the environment s dynamics. This is true for intelligence in both biological (Friston, 2009; 2010; Clark, 2013) and artificial (Hafner et al., 2019; Battaglia et al., 2013; Watter et al., 2015; Watters et al., 2019) systems. Predicting environmental dynamics is also of fundamental importance in physics, where Hamiltonian dynamics and the structurepreserving transformations it provides have been used to unify, categorise and discover new physical entities (Noether, 1915; Livio, 2012). Hamilton s fundamental result was a system of two first-order differential equations that, in a stroke, unified the predictions made by prior Newtonian and Lagrangian mechanics (Hamilton, 1834). After well over a century of development, it has proven to be essential for parsimonious descriptions of nearly all of physics. Equal contribution. 1More results and video evaluations are available at: http://tiny.cc/hgn Published as a conference paper at ICLR 2020 z ~ qɸ(z|x0...x T) Hamiltonian Forward rollout +dt Reverse rollout -dt Figure 2: Hamiltonian Generative Network schematic. The encoder takes a stacked sequence of images and infers the posterior over the initial state. The state is rolled out using the learnt Hamiltonian. Note that we depict Euler updates of the state for schematic simplicity, while in practice this is done using a leapfrog integrator. For each unroll step we reconstruct the image from the position q state variables only and calculate the reconstruction error. Hamilton s equations provide a way to predict a system s future behavior from its current state in phase space (that is, its position and momentum for classical Newtonian systems, and its generalized position and momentum more broadly). Hamiltonian mechanics induce dynamics with several nice properties: they are smooth, they include paths along which certain physical quantities are conserved (symmetries) and their time evolution is fully reversible. These properties are also useful for machine learning systems. For example, capturing the time-reversible dynamics of the world state might be useful for agents attempting to account for how their actions led to effects in the world; recovering an abstract low-dimensional manifold with paths that conserve various properties is tightly connected to outstanding problems in representation learning (see e.g. Higgins et al. (2018) for more discussion); and the ability to conserve energy is related to expressive density modelling in generative approaches (Rezende & Mohamed, 2015). Hence, we propose a reformulation of the well-known image manifold hypothesis by extending it with a Hamiltonian assumption (illustrated in Fig. 1): natural images lie on a low-dimensional manifold embedded within a high-dimensional pixel space and natural sequences of images trace out paths on this manifold that follow the equations of an abstract Hamiltonian. Given the rich set of established tools provided by Hamiltonian descriptions of system dynamics, can we adapt these to solve outstanding machine learning problems? When it comes to adapting the Hamiltonian formalism to contemporary machine learning, two questions need to be addressed: 1) how should a system s Hamiltonian be learned from data; and 2) how should a system s abstract phase space be inferred from the high-dimensional observations typically available to machine learning systems? Note that the inferred state may need to include information about properties that play no physical role in classical mechanics but which can still affect their behavior or function, like the colour or shape of an object. The first question was recently addressed by the Hamiltonian Neural Network (HNN) (Greydanus et al., 2019) approach, which was able to learn the Hamiltonian of three simple physical systems from noisy phase space observations. However, to address the second question, HNN makes assumptions that restrict it to Newtonian systems and appear to limit its ability to scale to more challenging video datasets. In this paper we introduce the first model that answers both of these questions without relying on restrictive domain assumptions. Our model, the Hamiltonian Generative Network (HGN), is a generative model that infers the abstract state from pixels and then unrolls the learned Hamiltonian following the Hamiltonian equations (Goldstein, 1980). We demonstrate that HGN is able to reliably learn the Hamil- Published as a conference paper at ICLR 2020 tonian dynamics from noisy pixel observations on four simulated physical systems: a pendulum, a mass-spring and twoand threebody systems. Our approach outperforms HNN by a significant margin. After training, we demonstrate that HGN produces meaningful samples with reversible dynamics and that the speed of rollouts can be controlled by changing the time derivative of the integrator at test time. Finally, we show that a small modification of our architecture yields a flexible, normalising flow-based generative model that respects Hamiltonian dynamics. We show that this model, which we call Neural Hamiltonian Flow (NHF), inherits the beneficial properties of the Hamiltonian formalism (including volume preservation) and is capable of expressive density modelling, while offering computational benefits over standard flow-based models. 2 RELATED WORK Most machine learning approaches to modeling dynamics use discrete time steps, which often results in an accumulation of the approximation errors when producing rollouts and, therefore, to a fast drop in accuracy. Our approach, on the other hand, does not discretise continuous dynamics and models them directly using the Hamiltonian differential equations, which leads to slower divergence for longer rollouts. Thedensitymodelversionof HGN(NHF)usesthe Hamiltoniandynamicsasnormalisingflows along with a numerical integrator, making our approach somewhat related to the recently published neural ODE work (Chen et al., 2018; Grathwohl et al., 2018). What makes our approach different is that Hamiltonian dynamics are both invertible and volume-preserving (as discussed in Sec. 3.3), which makes our approach computationally cheaper than the alternatives and more suitable as a model of physical systems and other processes that have these properties. Also related is recent work attempting to learn a model of physical system dynamics end-to-end from image sequences using an autoencoder (de Avila Belbute-Peres et al., 2018). Unlike our work, this model does not exploit Hamiltonian dynamics and is trained in a supervised or semi-supervised regime. 2.1 HAMILTONIAN NEURAL NETWORK One of the most comparable approaches to ours is the Hamiltonian Neural Network (HNN) (Greydanus et al., 2019). This work, done concurrently to ours, proposes a way to learn Hamiltonian dynamics from data by training the gradients of a neural network (obtained by backpropagation) to match the time derivative of a target system in a supervised fashion. In particular, HNN learns a differentiable function H(q,p) that maps a system s state (its position, q, and momentum, p) to a scalar quantity interpreted as the system s Hamiltonian. This model is trained so that H(p,q) satisfies the Hamiltonian equation by minimizing dt )2], (1) where the derivatives H p are computed by backpropagation. Hence, this learning procedure is most directly applicable when the true state space (in canonical coordinates) and its time derivatives are known. Accordingly, in the majority of the experiments presented by the authors, the Hamiltonian was learned from the ground truth state space directly, rather than from pixel observations. The single experiment with pixel observations required a modification of the model. First, the input to the model became a concatenated, flattened pair of images ot =[xt,xt+1], which was then mapped to a low-dimensional embedding space zt = [qt,pt] using an encoder neural network. Note that the dimensionality of this embedding (z R2 in the case of the pendulum system presented in the paper) was chosen to perfectly match the ground truth dimensionality of the phase space, which was assumed to be known a priori. This, however, is not always possible. The latent embedding was then treated as an estimate of the position and the momentum of the system depicted in the images, where the momentum was assumed to be equal to the velocity of the system an assumption enforced by the additional constraint found necessary to encourage learning, which encouraged the time derivative of the position latent to equal the momentum latent using finite differences on the split latents: LCC =(pt (qt+1 qt))2. (2) Published as a conference paper at ICLR 2020 This assumption is appropriate in the case of the simple pendulum system presented in the paper, however it does not hold more generally. Note that our approach does not make any assumptions on the dimensionality of the learned phase space, or the form of the momenta coordinates, which makes our approach more general and allows it to perform well on a wider range of image domains as presented in Sec. 4. 3.1 THE HAMILTONIAN FORMALISM The Hamiltonian formalism describes the continuous time evolution of a system in an abstract phase space s = (q, p) R2n, where q Rn is a vector of position coordinates, and p Rn is the corresponding vector of momenta. The time evolution of the system in phase space is given by the Hamiltonian equations: where the Hamiltonian H : R2n R maps the state s = (q,p) to a scalar representing the energy of the system. The Hamiltonian specifies a vector field over the phase space that describes all possible dynamics of the system. For example, the Hamiltonian for an undamped mass-spring system is H(q,p)= 1 2m, where m is the mass, q R1 is its position, p R1 is its momentum and k is the spring stiffness coefficient. The Hamiltonian can often be expressed as the sum of the kinetic T and potential V energies H=T(p)+V (q), as is the case for the mass-spring example. Identifying a system s Hamiltonian is in general a very difficult problem, requiring carefully instrumented experiments and researcher insight produced by years of training. In what follows, we describe a method for modeling a system s Hamiltonian from raw observations (such as pixels) by inferring a system s state with a generative model and rolling it out with the Hamiltonian equations. 3.2 LEARNING HAMILTONIANS WITH THE HAMILTONIAN GENERATIVE NETWORK Our goal is to build a model that can learn a Hamiltonian from observations. We assume that the data X ={(x1 0,...,x1 T ),...,(x K 0 ,...,x K T )} comes in the form of high-dimensional noisy observations, where each xi =G(si)=G(qi) is a non-deterministic function of the generalised position in the phase space, and the full state is a non-deterministic function of a sequence of images si =(qi,pi)=D(xi 0,...,xi t), since the momentum (and hence the full state) cannot in general be recovered from a single observation. Our goal is to infer the abstract state and learn the Hamiltonian dynamics in phase space by observing K motion sequences, discretised into T +1 time steps each. In the process, we also want to learn an approximation to the generative process G(s) in order to be able to move in both directions between the high dimensional observations and the low-dimensional abstract phase space. Although the Hamiltonian formalism is general and does not depend on the form of the observations, we present our model in terms of visual observations, since many known physical Hamiltonian systems, like a mass-spring system, can be easily observed visually. In this section we introduce the Hamiltonian Generative Network (HGN), a generative model that is trained to behave according to the Hamiltonian dynamics in an abstract phase space learned from raw observations of image sequences. HGN consists of three parts (see Fig. 2): an inference network, a Hamiltonian network and a decoder network, which are discussed next. The inference network takes in a sequence of images (xi 0,...xi T ), concatenated along the channel dimension, and outputs a posterior over the initial state z qφ( |x0,...x T ), corresponding to the system s coordinates in phase space at the first frame of the sequence. We parametrise qφ(z) as a diagonal Gaussian with a unit Gaussian prior p(z) = N(0, I) and optimise it using the usual reparametrisation trick (Kingma & Welling, 2014). To increase the expressivity of the abstract phase space s0, we map samples from the posterior with another function s0 =fψ(z) to obtain the system s initial state. As mentioned in Sec. 3.1, the Hamiltonian function expects the state to be of the form s = (q,p), hence we initialise s0 R2n and arbitrarily assign the first half of the units to represent abstract position q and the other half to represent abstract momentum p. Published as a conference paper at ICLR 2020 u ~ 𝛑(u) x ~ p(x) s0 ~ 𝛑(s0) si ~ p(si) Figure 3: A: standard normalising flow, where the invertible function fi is implemented by a neural network. B: Hamiltonian flows, where the initial density is transformed using the learned Hamiltonian dynamics. Note that we depict Euler updates of the state for schematic simplicity, while in practice this is done using a leapfrog integrator. q T-1 p T-1 q0 ~ 𝝿(q0) p0 ~ 𝝿(p0) p T ~ q(p T|q T) Figure 4: A schematic representation of NHF which can perform expressive density modelling by using the learned Hamiltonians as normalising flows. Note that we depict Euler updates of the state for schematic simplicity, while in practice this is done using a leapfrog integrator. The Hamiltonian network is parametrised as a neural network with parameters γ that takes in the inferred abstract state and maps it to a scalar Hγ(st) R. We can use this function to do rollouts in the abstract state space using the Hamiltonian equations (Eq. 3), for example by Euler integration: st+1 =(qt+1,pt+1)=(qt+ H pt dt, pt H qt dt). In this work we assume a separable Hamiltonian, so in practice we use a more sophisticated leapfrog integrator to roll out the system, since it has better theoretical properties and results in better performance in practice (see Sec. A.6 in Supplementary Materials for more details). The decoder network is a standard deconvolutional network (we use the architecture from Karras et al. (2018)) that takes in a low-dimensional representation vector and produces a high-dimensional pixel reconstruction. Given that each instantaneous image does not depend on the momentum information, we restrict the decoder to take only the position coordinates of the abstract state as input: pθ(xt)=dθ(qt). The objective function. Given a sequence of T +1 images, HGN is trained to optimise the following objective: L(φ,ψ,γ,θ; x0,...x T )= 1 T +1 t=0 [ Eqφ(z|x1,...x T ) [ log pψ,γ,θ(xt | qt) ] ] KL( qφ(z) || p(z) ), which can be seen as a temporally extended variational autoencoder (VAE) (Kingma & Welling, 2014; Rezende et al., 2014) objective, consisting of a reconstruction term for each frame, and an additional term that encourages the inferred posterior to match a prior. The key difference with a standard VAE lies in how we generate rollouts these are produced using the Hamiltonian equations of motion in learned Hamiltonian phase space. 3.3 LEARNING HAMILTONIAN FLOWS In this section, we describe how the architecture described above can be modified to produce a model for flexible density estimation. Learning computationally feasible and accurate estimates of complex Published as a conference paper at ICLR 2020 Pendulum 2- & 3-Body Problem Mass-spring Figure 5: Ground truth Hamiltonians and samples from generated datasets for the ideal pendulum, mass-spring, and twoand three-body systems used to train HGN. densities is an open problem in generative modelling. A common idea to address this problem is to start with a simple prior distribution π(u) and then transform it into a more expressive form p(x) through a series of composable invertible transformations fi(u) called normalising flows (Rezende & Mohamed, 2015) (see Fig. 3A). Sampling can then be done according to x=f T ... f1(u), where u π( ). Density evaluation, however, requires more expensive computations of both inverting the flows and calculating the determinants of their Jacobians. For a single flow step, this equates to the following: p(x)=π(f 1(x)) det f 1 x . While a lot of work has been done recently into proposing better alternatives for flow-based generative models in machine learning (Rezende & Mohamed, 2015; Kingma et al., 2016; Papamakarios et al., 2017; Dinh et al., 2017; Huang et al., 2018; Kingma & Dhariwal, 2018; Hoogeboom et al., 2019; Chen et al., 2019; Grathwohl et al., 2018), none of the approaches manage to produce both sampling and density evaluation steps that are computationally scalable. The two requirements for normalising flows are that they are invertible and volume preserving, which are exactly the two properties that Hamiltonian dynamics possess. This can be seen by computing the determinant of the Jacobian of the infinitesimal transformation induced by the Hamiltonian H. By Jacobi s formula, the derivative of the determinant at the identity is the trace and: 2H qi pj 2H qi qj 2H pi pj 2H 2H qi pj 2H qi qj 2H pi pj 2H +O(dt2)=1+O(dt2) (5) where i =j are the off-diagonal entries of the determinant of the Jacobian. Hence, in this section we describe a simple modification of HGN that allows it to act as a normalising flow. We will refer to this modification as the Neural Hamiltonian Flow (NHF) model. First, we assume that the initial state s0 is a sample from a simple prior s0 π0( ). We then chain several Hamiltonians Hi to transform the sample to a new state s T =HT ... H1(s0) which corresponds to a sample from the more expressive final density s T p(x) (see Fig. 3B for an illustration of a single Hamiltonian flow). Note that unlike HGN, where the Hamiltonian dynamics are shared across time steps (a single Hamiltonian is learned and its parameters are shared across time steps of a rollout), in NHF each step of the flow (corresponding to a single time step of a rollout) can be parametrised by a different Hamiltonian. The inverse of such a Hamiltonian flow can be easily obtained by replacing dt by dt in the Hamiltonian equations and reversing the order of the transformations, s0 =H dt 1 ... H dt T (s T ) (we will use the appropriate dt or dt superscript from now on to make the direction of integration of the Hamiltonian dynamics more explicit). The resulting density p(s T ) is given by the following equation: ln p(s T )=ln π(s0)=ln π(H dt 1 ... H dt T (s T ))+O(dt2) Our proposed NHF is more computationally efficient that many other flow-based approaches, because it does not require the expensive step of calculating the trace of the Jacobian. Hence, the NHF model constitutes a more structured form of a Neural ODE flow (Chen et al., 2018), but with a few notable differences: (i) The Hamiltonian ODE is volume-preserving, which makes the computation of loglikelihood cheaper than for a general ODE flow. (ii) General ODE flows are only invertible in the limit dt 0, whereas for some Hamiltonians we can use more complex integrators (like the symplectic leapfrog integrator described in Sec. A.6) that are both invertible and volume-preserving for any dt>0. The structure s=(q,p) on the state-space imposed by the Hamiltonian dynamics can be constraining from the point of view of density estimation. We choose to use the trick proposed in the Hamiltonian Published as a conference paper at ICLR 2020 Pendulum Mass-spring Train MSE Test MSE Train MSE Test MSE Two-Body Three-Body HNN HGN (leapfrog) HGN (Euler) HGN (determ) HGN (no f) Figure 6: Average pixel MSE for each step of a single train and test unroll on four physical systems. All versions of HGN outperform HNN, which often learned to reconstruct an average image. Note the different scales of the plots comparing HGN to HNN and different versions of HGN. Monte Carlo (HMC) literature (Neal et al., 2011; Salimans et al., 2015; Levy et al., 2017), which treats the momentum p as a latent variable (see Fig. 4). This is an elegant solution which avoids having to artificially split the density into two disjoint sets. As a result, the data density that our Hamiltonian flows are modelling becomes exclusively parametrised by p(q T ), which takes the following form: p(q T ) = R p(q T ,p T )dp T = R π(H dt 1 ... H dt T (q T ,p T ))dp T . This integral is intractable, but the model can still be trained via variational methods where we introduce a variational density fψ(p T |q T ) with parameters ψ and instead optimise the following ELBO: ELBO(q T )=Efψ(p T |q T )[ ln π(H dt 1 ... H dt T (q T ,p T )) ln fψ(p T |q T ) ] ln p(q T ), (6) Note that, in contrast to VAEs (Kingma & Welling, 2014; Rezende et al., 2014), the ELBO in Eq. 6 is not explicitly in the form of a reconstruction error term plus a KL term. In order to directly compare the performance of HGN to that of its closest baseline, HNN, we generated four datasets analogous to the data used in Greydanus et al. (2019). The datasets contained observations of the time evolution of four physical systems: mass-spring, pendulum, twoand three-body (see Fig. 5). In order to generate each trajectory, we first randomly sampled an initial state, then produced a 30 step rollout following the ground truth Hamiltonian dynamics, before adding Gaussian noise with standard deviation σ2 =0.1 to each phase-space coordinate, and rendering a corresponding 64x64 pixel observation. We generated 50 000 train and 10 000 test trajectories for each dataset. When sampling initial states, we start by first sampling the total energy of the system denoted as a radius r in the phase space, before sampling the initial state (q,p) uniformly on the circle of radius r. Note that our pendulum dataset is more challenging than the one described in Greydanus et al. (2019), where the pendulum had a fixed radius and was initialized at a maximum angle of 30 from the central axis. Mass-spring. The dynamics of a frictionless mass-spring system are modeled by the Hamiltonian H = 1 2m, where k is the spring constant and m is the mass. We fix k = 2 and m = 0.5, then sample a radius from a uniform distribution r U(0.1,1.0). Published as a conference paper at ICLR 2020 Pendulum. The dynamics of a frictionless pendulum are modeled by the Hamiltonian H=2mgl(1 cos (q))+ p2 2ml2 , where g is the gravitational constant and l is the length of the pendulum. We fix g=3, m=0.5, l=1, then sample a radius from a uniform distribution r U(1.3,2.3). Twoand threebody problems. In an n-body problem, particles interact with each other through an attractive force, like gravity. The dynamics are represented by the following Hamiltonian H = Pn i ||pi||2 2mi P 1 i