# universal_approximation_using_wellconditioned_normalizing_flows__7f559f9c.pdf

Universal Approximation For Log-concave Distributions Using Well-conditioned Normalizing

Holden Lee Mathematics Department

Duke University Durham, NC 27708 holden.lee@duke.edu

Chirag Pabbaraju Computer Science Department

Stanford University Stanford, CA 94305 cpabbara@cs.stanford.edu

Anish Sevekari Department of Mathematical Sciences

Carnegie Mellon University

Pittsburgh, PA 15213 asevekar@andrew.cmu.edu

Andrej Risteski Machine Learning Department

Carnegie Mellon University

Pittsburgh, PA 15213 aristesk@andrew.cmu.edu

Normalizing ﬂows are a widely used class of latent-variable generative models with a tractable likelihood. Afﬁne-coupling models [Dinh et al., 2014, 2016] are a particularly common type of normalizing ﬂows, for which the Jacobian of the latent-to-observable-variable transformation is triangular, allowing the likelihood to be computed in linear time. Despite the widespread usage of afﬁne couplings, the special structure of the architecture makes understanding their representational power challenging. The question of universal approximation was only recently resolved by three parallel papers [Huang et al., 2020, Zhang et al., 2020, Koehler et al., 2020] who showed reasonably regular distributions can be approximated arbitrarily well using afﬁne couplings albeit with networks with a nearly-singular Jacobian. As ill-conditioned Jacobians are an obstacle for likelihood-based training, the fundamental question remains: which distributions can be approximated using well-conditioned afﬁne coupling ﬂows? In this paper, we show that any log-concave distribution can be approximated using well-conditioned afﬁne-coupling ﬂows. In terms of proof techniques, we uncover and leverage deep connections between afﬁne coupling architectures, underdamped Langevin dynamics (a stochastic differential equation often used to sample from Gibbs measures) and Hénon maps (a structured dynamical system that appears in the study of symplectic diffeomorphisms). Our results also inform the practice of training afﬁne couplings: we approximate a padded version of the input distribution with iid Gaussians a strategy which Koehler et al. [2020] empirically observed to result in better-conditioned ﬂows, but had hitherto no theoretical grounding. Our proof can thus be seen as providing theoretical evidence for the beneﬁts of Gaussian padding when training normalizing ﬂows.

1 Introduction

Normalizing ﬂows [Dinh et al., 2014, Rezende and Mohamed, 2015] are a class of generative models parametrizing a distribution in Rd as the pushfoward of a simple distribution (e.g. Gaussian) through an invertible map g : Rd ! Rd with trainable parameter . The fact that g is invertible allows us to

35th Conference on Neural Information Processing Systems (Neur IPS 2021).

write down an explicit expression for the density of a point x through the change-of-variables formula, namely p (x) = φ(g 1

(x))det(Dg 1

(x)), where φ denotes the density of the standard Gaussian. For different choices of parametric families for g , one gets different families of normalizing ﬂows, e.g. afﬁne coupling ﬂows [Dinh et al., 2014, 2016, Kingma and Dhariwal, 2018], Gaussianization ﬂows [Meng et al., 2020], sum-of-squares polynomial ﬂows [Jaini et al., 2019].

In this paper we focus on afﬁne coupling ﬂows arguably the family that has been most successfully scaled up to high resolution datasets [Kingma and Dhariwal, 2018]. The parametrization of g is chosen to be a composition of so-called afﬁne coupling blocks, which are maps f : Rd ! Rd, s.t. f(x S, x[d]\S) = (x S, x[d]\S s(x S) + t(x S)), where denotes entrywise multiplication and s, t are (typically simple) neural networks. The choice of parametrization is motivated by the fact that the Jacobian of each afﬁne block is triangular, so that the determinant can be calculated in linear time.

Despite the empirical success of this architecture, theoretical understanding remains elusive. The most basic questions revolve around the representational power of such models. Even the question of universal approximation was only recently answered by three concurrent papers [Huang et al., 2020, Zhang et al., 2020, Koehler et al., 2020] though in a less-than-satisfactory manner, in light of how normalizing ﬂows are trained. Namely, Huang et al. [2020], Zhang et al. [2020] show that any (reasonably well-behaved) distribution p, once padded with zeros and treated as a distribution in Rd+d0, can be arbitrarily closely approximated by an afﬁne coupling ﬂow. While such padding can be operationalized as an algorithm by padding the training image with zeros, it is never done in practice, as it results in an ill-conditioned Jacobian. This is expected, as the map that always sends the last d0 coordinates to 0 is not injective. Koehler et al. [2020] prove universal approximation without padding; however their construction also gives rise to a poorly conditioned Jacobian: namely, to approximate a distribution p to within accuracy in the Wasserstein-1 distance, the Jacobian of the network they construct will have smallest singular value on the order of .

Importantly, for all these constructions, the condition number of the resulting afﬁne coupling map is poor no matter how nice the underlying distribution it s trying to approximate is. In other words, the source of this phenomenon isn t that the underlying distribution is low-dimensional or otherwise degenerate. Thus the question arises:

Question: Can well-behaved distributions be approximated by an afﬁne coupling ﬂow with a well-conditioned Jacobian?

In this paper, we answer the above question in the afﬁrmative for a broad class of distributions log-concave distributions if we pad the input distribution not with zeroes, but with independent Gaussians. This gives theoretical grounding of an empirical observation in Koehler et al. [2020] that Gaussian padding works better than zero-padding, as well as no padding.

The practical relevance of this question is in providing guidance on the type of distributions we can hope to ﬁt via training using an afﬁne coupling ﬂow. Theoretically, our techniques uncover some deep connections between afﬁne coupling ﬂows and two other (seeming unrelated) areas of mathematics: stochastic differential equations (more precisely underdamped Langevin dynamics, a momentum variant of the standard overdamped Langevin dynamics) and dynamical systems (more precisely, a family of dynamical systems called Hénon-like maps).

2 Overview of results

In order to state our main result, we introduce some notation and deﬁnitions.

2.1 Notation

Deﬁnition 1. An afﬁne coupling block is a map f : Rd ! Rd, s.t. f(x S, x[d]\S) = (x S, x[d]\S s(x S) + t(x S)) for some set of coordinates S, where denotes entrywise multiplication and s, t are trainable (generally non-linear) functions. An afﬁne coupling network is a ﬁnite sequence of afﬁne coupling blocks. Note that the partition (S, [d] \ S), as well as s, t may be different between blocks. We say that the non-linearities are in a class F (e.g., neural networks, polynomials, etc.) if s, t 2 F.

The appeal of afﬁne coupling networks comes from the fact that the Jacobian of each afﬁne block is triangular, so calculating the determinant is a linear-time operation.

We will be interested in the conditioning of f that is, an upper bound on the largest singular value σmax(Df) and lower bound on the smallest singular value σmin(Df) of the Jacobian Df of f. Note that this is a slight abuse of nomenclature most of the time, condition number refers to the ratio of the largest and smallest singular value. As training a normalizing ﬂow involves evaluating det(Df), we in fact want to ensure that neither the smallest nor largest singular values are extreme.

The class of distributions we will focus on approximating via afﬁne coupling ﬂows is log-concave distributions: Deﬁnition 2. A distribution p : Rd ! R+, p(x) / e U(x) is log-concave if r2U(x) = r2 ln p(x) 0.

Log-concave distributions are typically used to model distributions with Gaussian-like tail behavior. What we will leverage about this class of distributions is that a special stochastic differential equation (SDE), called underdamped Langevin dynamics, is well-behaved in an analytic sense. Finally, we recall the deﬁnitions of positive deﬁnite matrices and Wasserstein distance, and introduce a notation for truncated distributions. Deﬁnition 3. We say that a symmetric matrix is positive semideﬁnite (PSD) if all of its eigenvalues are non-negative. For symmetric matrices A, B, we write A B if and only if A B is PSD. Deﬁnition 4. Given two probability measures µ, over a metric space (M, d), the Wasserstein-1 distance between them, denoted W1(µ, ), is deﬁned as

W1(µ, ) = inf γ2Γ(µ, )

d(x, y) dγ(x, y)

where Γ(µ, ) is the set of couplings, i.e. measures on M M with marginals µ, respectively. For two probability distributions p, q, we denote by W1(p, q) the Wasserstein-1 distance between their associated measures. In this paper, we set M = Rd and d(x, y) = kx yk2. Deﬁnition 5. Given a distribution q and a compact set C, we denote by q|C the distribution q truncated to the set C. The truncated measure is deﬁned as q|C(A) = 1 q(C)q(A \ C).

2.2 Main result

Our main result states that we can approximate any log-concave distribution in Wasserstein-1 distance by a well-conditioned afﬁne-coupling ﬂow network. Precisely, we show:

Theorem 1. Let p(x) : Rd ! R+ be of the form p(x) / e U(x), such that:

1. U 2 C2, i.e., r2U(x) exists and is continuous. 2. ln p satisﬁes Id r2 ln p(x) Id.

Furthermore, let p0 := p N(0, Id). Then, for every > 0, there exists a compact set C R2d and

an invertible afﬁne-coupling network f : R2d ! R2d with polynomial non-linearities, such that

W1(f#(N(0, I2d)|C), p0) .

Furthermore, the map deﬁned by this afﬁne-coupling network f is well conditioned over C, that is,

there are positive constants A( ), B( ) = O(1) such that for any unit vector w,

A( ) k Dwf(x, v)k B( )

for all (x, v) 2 C, where Dw is the directional derivative in the direction w. In particular, the condition number of Df(x, v) is bounded by B( )

A( ) = O(1) for all (x, v) 2 C.

We make several remarks regarding the statement of the theorem: Remark 1. The Gaussian padding (i.e. setting p0 = p N(0, Id)) is essential for our proofs. All the other prior works on the universal approximation properties of normalizing ﬂows (with or without padding) result in ill-conditioned afﬁne coupling networks. This gives theoretical backing of empirical observations on the beneﬁts of Gaussian padding in Koehler et al. [2020]. Remark 2. The choice of non-linearities s, t being polynomials is for the sake of convenience in our proofs. Using standard universal approximation results, they can also be chosen to be neural networks with a smooth activation function.

Remark 3. The Jacobian Df has both upper-bounded largest singular value, and lower-bounded smallest singular value which of course bounds the determinant det(Df). As remarked in Section 2.1, merely bounding the ratio of the two quantities would not sufﬁce for this. Moreover, the bound we prove only depends on properties of the distribution (i.e., ), and does not worsen as ! 0, in contrast to Koehler et al. [2020]. Remark 4. The region C where the pushforward of the Gaussian through f and p0 are close is introduced solely for technical reasons essentially, standard results in analysis for approximating smooth functions by polynomials can only be used if the approximation needs to hold on a compact set. Note that C can be made arbitrarily large by making arbitrarily small. Remark 5. We do not provide an explicit computation of the number of afﬁne coupling blocks in the constructed network, although a bound of polylog( )/ O(k) can be extracted from our proofs. Remark 6. Our proof also implies a well-conditioned universal approximation result for other related normalizing ﬂow models. Lemma 1 proves that the ﬂow map of underdamped Langevin dynamics is well conditioned for all t 2 [0, T]. However, as indicated in Chen et al. [2018], underdamped Langevin dynamics is a continuous normalizing ﬂow, thus the claim applies to such ﬂows as well. Similarly, the particular afﬁne coupling layers we construct in eq. (13) also form a residual block, so the claim also holds for residual ﬂows [Behrmann et al., 2018].

3 Related Work

The landscape of normalizing ﬂow models is rather rich. The inception of the ideas was in Rezende and Mohamed [2015] and Dinh et al. [2014], and in recent years, an immense amount of research has been dedicated to developing different architectures of normalizing ﬂows. The focus of this paper are afﬁne coupling ﬂows, which were introduced in Dinh et al. [2014], introduced the idea of using pushforward maps with triangular Jacobians for computational efﬁciency. This was further developed in Dinh et al. [2016] and culminated in Kingma and Dhariwal [2018], who introduced 1x1 convolutions in the afﬁne coupling framework to allow for trainable choices of partitions. We note, there have been variants of normalizing ﬂows in which the Jacobian is non-triangular, e.g. [Grathwohl et al., 2018, Dupont et al., 2019, Behrmann et al., 2018], but these models still don t scale beyond datasets the size of CIFAR-10.

In terms of theoretical results, the most closely related works are Huang et al. [2020], Zhang et al. [2020], Koehler et al. [2020]. The former two show universal approximation of afﬁne couplings albeit if the input is padded with zeros. This of course results in maps with singular Jacobians, which is why this strategy isn t used in practice. Koehler et al. [2020] show universal approximation without padding though their constructions results in a ﬂow model with condition number 1/ to get approximation in the Wasserstein sense, regardless of how well-behaved the distribution to be approximated is. Furthemore, Koehler et al. [2020] provide some empirical evidence that padding with iid Gaussians (as in our paper) is better than both zero padding (as in Huang et al. [2020], Zhang et al. [2020]) and no padding on small-scale data.

4 Preliminaries

Our techniques leverage tools from stochastic differential equations and dynamical systems. We brieﬂy survey the relevant results.

4.1 Langevin Dynamics

Broadly, Langevin diffusions are families of stochastic differential equations (SDEs) which are most frequently used as algorithmic tools for sampling from distributions speciﬁed up to a constant of proportionality. They have also recently received a lot of attention as tools for designing generative models [Song and Ermon, 2019, Song et al., 2020].

In this paper, we will only make use of underdamped Langevin dynamics, a momentum-like analogue of the more familiar overdamped Langevin dynamics, deﬁned below. Our construction will involve simulating underdamped Langevin dynamics using afﬁne coupling blocks.

Deﬁnition 6 (Underdamped Langevin Dynamics). Underdamped Langevin dynamics with potential U and parameters , γ is the pair of SDEs

dxt = vtdt dvt = γ vtdt r U(xt)dt + p2γ d Bt. (1)

The stationary distribution of the SDEs (limiting distribution as t ! 1) is given by p (x, v) / e U(x)

The variable vt can be viewed as a velocity variable and xt as a position variable in that sense, the above SDE is an analogue to momentum methods in optimization.

The convergence of (1) can be bounded when the distribution p(x) / exp( U(x)) satisﬁes an analytic condition, namely has a bounded log-Sobolev constant. Though we don t use the log-Sobolev constant in any substantive manner in this paper, we include the deﬁnition for completeness. Deﬁnition 7. A distribution p : Rd ! R+ satisﬁes a log-Sobolev inequality with constant C > 0 if 8g : Rd ! R, s.t. g2, g2| log g2| 2 L1(p), we have

Ep[g2 log g2] Ep[g2] log Ep[g2] 2CEpkrgk2. (2)

In the context of Markov diffusions (and in particular, designing sampling algorithms using diffusions), the interest in this quantity comes as it governs the convergence rate of overdamped Langevin diffusion in the KL divergence sense. Namely, if pt is the distribution of overdamped Langevin after time t, one can show

KL(pt||p) e Ct KL(p0||p).

We will only need the following fact about the log-Sobolev constant:

Fact 1 (Bakry and Émery [1985], Bakry et al. [2013]). Let the distributions p(x) / exp( U(x)) be such that U(x) λI. Then, p has log-Sobolev constant bounded by λ.

We will also need the following result characterizing the convergence time of underdamped Langevin dynamics in terms of the log-Sobolev constant, as shown in Ma et al. [2019]: Theorem 2 (Ma et al. [2019]). Let p (x) / exp( U(x)) have a log-Sobolev constant bounded by . Furthermore, for a distribution p : Rd ! R+, let

L[p] := KL(pkp ) + Ep

rδ KL(pkp )

δp , Srδ KL(pkp )

where S is a positive deﬁnite matrix given by S = 1

4Id d 1 2Id d 1 2Id d 2Id d

. If pt is the distribution of

(xt, vt) which evolve according to (1), we have

10L[pt] (3)

whenever p satisﬁes a log-Sobolev inequality with constant .

We note that the above theorem uses a non-standard Lyapunov function L, which combines KL divergence with an extra term, since the generator of underdamped Langevin is not self-adjoint this makes analyzing the drop in KL divergence difﬁcult. As L is clearly an upper bound on KL(p||p ), so it sufﬁces to show L decreases rapidly.

We will also need a less-well-known deterministic form of the updates which is equivalent to (1). Precisely, we convert (1) an equivalent ODE (with time-dependent coefﬁcients). The proof of this fact (via a straightforward comparison of the Fokker-Planck equation) can be found in Ma et al. [2019]. Theorem 3. Let pt(xt, vt) be the probability distribution of running (1) for time t. If started from (x0, v0) p0, the probability distribution of the solution (xt, vt) to the ODEs

O Id Id γId

(r ln pt r ln p ) (4)

is also pt(xt, vt).

4.2 Dynamical systems and Henon maps

We also build on work from dynamical systems, more precisely, a family of maps called Hénon-like maps [Hénon, 1976].

Deﬁnition 8 ([Turaev, 2002]). A pair of ODEs forms a Hénon-like map if it has the form

dt = x + r J(x) (5)

for a smooth function J : Rd ! R.

This special family of ODEs is a continuous-time generalization of a classical discrete dynamical system of the same name Hénon [1976]. The property that is useful for us is that the Euler discretization of this map can be written as a sequence of afﬁne coupling blocks.

In Turaev [2002], it was proven that these ODEs are universal approximators in some sense. Namely, the iterations of this ODE can approximate any symplectic diffeomorphism: a continuous map which preserves volumes (i.e. the Jacobian of the map is 1). These kinds of diffeomorphisms have their genesis in Hamiltonian formulations of classical mechanics [Abraham and Marsden, 2008].

At ﬁrst blush, symplectic diffeomorphisms and underdamped Langevin seem to have nothing to do with each other. The connection comes through the so-called Hamiltonian representation theorem [Polterovich, 2012], which states that any symplectic diffeomorphism from C R2d ! R2d can be written as the iteration of the following Hamiltonian system of ODEs for some (time-dependent) Hamiltonian H(x, v, t): dx

dt = d dv H(x, v, t) dv

dx H(x, v, t) (6)

In fact, in our theorem, we will use techniques inspired by those in Turaev [2002], who shows:

Theorem 4 (Turaev [2002]). For any function H(x, v, t) : R2d R 0 ! R which is polynomial in (x, v), there exists a polynomial V (x, v, t), s.t. the timemap of the system

dt = @ @v H(x, v, t) dv

@x H(x, v, t) (7)

is uniformly O( 2)-close to the time-2 map of the system

jxj @ @xj V (x, t) (8)

for some integers { i}d

We will prove a generalization of this theorem that applies to underdamped Langevin dynamics.

5 Proof Sketch of Theorem 1

5.1 Overview of strategy

We wish to construct an afﬁne coupling network that (approximately) pushes forward a Gaussian p = N(0, I2d) to the distribution we wish to model with Gaussian padding, i.e. p0 = p N(0, Id). Because the inverse of an afﬁne coupling network is an afﬁne coupling network, we can invert the problem, and instead attempt to map p0 to N(0, I2d). 1

There is a natural map that takes p0 to p = N(0, I2d), namely, underdamped Langevin dynamics (1). Hence, our proof strategy involves understanding and simulating underdamped Langevin dynamics with the initial distribution p0 = p N(0, Id), and the target distribution p = N(0, I2d), and comprises of two important steps.

1As an aside, a similar strategy is taken in practice by recent SDE-based generative models (Song et al. [2020]).

First, we show that the ﬂow-map for Langevin is well-conditioned (Lemma 1 below). Here, by ﬂow-map, we mean the map which assigns each x to its evolution over a certain amount of time t according to the equations speciﬁed by (1).

Second, we break the simulation of underdamped Langevin dynamics for a certain time t into intervals of size , and show that the inverse ﬂow-map over each -sized interval of time can be approximated well by a composition of afﬁne-coupling maps (Lemma 5 below). To show this, we consider a more general system of ODEs than the one in Turaev [2002] (in particular, a non-Hamiltonian system), which can be applied to underdamped Langevin dynamics. We then show that the inverse ﬂow-map of this system of ODEs can be approximated by a sequence of afﬁne-coupling blocks. We note that for this argument, it is critical that we use underdamped rather than overdamped Langevin dynamics, as overdamped Langevin dynamics do not have the required form for afﬁne-coupling blocks.

5.2 Underdamped Langevin is well-conditioned

Consider running underdamped Langevin dynamics with stationary distribution p equal to the standard Gaussian, started at a log-concave distribution with bounded condition number . The following lemma says that the ﬂow map is well-conditioned, with condition number depending polynomially on .

Lemma 1. Consider underdamped Langevin dynamics (1) with = 1, friction coefﬁcient γ < 2 and starting distribution p which satisﬁes all the assumptions in Theorem 1. Let Tt denote the ﬂow-map from time 0 to time t induced by (4). Then for any x0, v0 2 Rd and unit vector w, the directional derivative of Tt at x0, v0 in direction w satisﬁes

k Dw Tt(x0)k

Therefore, the condition number of Tt is bounded by

We sketch the proof below and include a complete proof in Section A.

First, using (4) and the chain rule shows that the Jacobian of the ﬂow map at x0, Dt = DTt(x0), satisﬁes

O Id Id γId

r2(ln pt ln p )Dt, (9)

i.e., it is bounded by the difference of the Hessians of the log-pdfs of the current distribution and the stationary distribution. We will show that r2 ln pt decays exponentially towards r2 ln p = I2d.

To accomplish this, consider how r2 ln pt evolves if we replace (1) by its discretization,

ext+ = ext + evt

evt+ = (1 γ)evt ext + t, t N(0, 2γ Id).

Note that because the stationary distribution is a Gaussian, r U(xt) = xt in (1), and the above

equations take a particularly simple form: we apply a linear transformation to

add Gaussian noise, which corresponds to convolving the current distribution by a Gaussian. We keep track of upper and lower bounds for r2 ln pt, and compute how they evolve under this linear transformation and convolution by a Gaussian. Taking ! 0, we obtain differential equations for the upper and lower bounds for r2 ln pt, which we can solve. A Grönwall argument shows that these bounds decay exponentially towards r2 ln p = I2d. The decay rate can be bounded as a power of 1

From (9), we then obtain that the condition number of Dt is bounded by the integral of a exponentially decaying function, and hence is bounded independent of t. In particular, we may take t large enough so that pt is -close to the stationary distribution. Because the decay rate of the exponential is 1 O(1) , the bound is O(1).

Note that we vitally used the fact that the stationary distribution p is a standard Gaussian, as our argument requires that r2 ln p be constant everywhere.

5.3 ODE approximation by afﬁne-coupling blocks

Next, we analyze a more general version of the Hamiltonian system of ODEs considered in Turaev [2002], which we recalled in (7). In particular, the system of ODEs we will be considering is:

dt = @ @v H(x, v, t) dv

@x H(x, v, t) γ @

@v H(x, v, t) (10)

Note that substituting H(x, v, t) = ln pt(x, v) ln p (x, v) above gives us the underdamped Langevin dynamics.

The ﬁrst step is to restrict our considerations to H being a polynomial in x, v, rather than a general smooth function. Towards this, we recall the notion of closeness in the C1 topology: Deﬁnition 9. Let C Rd be a compact set. Let f, g : C ! R be two continuously differentiable functions. Then we say that f, g are uniformly -close over C in C1 topology if

(kf(x) g(x)k + k Df(x) Dg(x)k)

The following lemma (a generalization of the Stone-Weierstrass Theorem) then establishes that it sufﬁces to focus on H being polynomial in x, v: Lemma 2 (Theorem 5, Peet [2007]). Let C Rd be a compact set. For any C2 function H : Rd ! R, and any > 0, there is a multivariate polynomial P : Rd ! R such that P, H are uniformly -close over C in C1 topology.

Focusing on the case of polynomials, Lemma 3 below shows that instead of ﬂowing the pair of ODEs given by (10) over an interval of time , we can instead run a different ODE for time 2 , such that the ﬂow-maps corresponding to both these ODEs are O( 2)-close.

Lemma 3. Let C R2d be a compact set. For any function H(x, v, t) : R2d ! R which is polynomial in (x, v), there exist polynomial functions J, F, G, s.t. the time-(t0 + , t0) ﬂow map of the system dx

dt = @ @v H(x, v, t) dv

@x H(x, v, t) γ @

@v H(x, v, t) (11)

is uniformly O( 2)-close over C in C1 topology to the time-2 map of the system

dt = v F(v, t) x dvj

jxj Jj(x, t) vj Gj(x, t) (12)

Here, denotes component-wise product, and the constants inside the O( ) depend on C and the coefﬁcients of H.

The complete proof of this lemma is included in Appendix B; we provide a brief sketch here. First, we consider the ﬁrst order (O( 2)) approximation of the ﬂow map of a standard ODE of the form y = Dy (where D is diagonal), and observe that for small , we can think of (12) as a perturbed version of such an ODE with an appropriate choice of D. Using standard ODE perturbation techniques, we can approximately express the time-t evolution of (12) up to ﬁrst-order in , in terms of polynomials F, G, J and trigonometric functions.

Then, we compare this map to the ﬁrst-order approximation of ﬂowing the pair of ODEs (11) for time via Taylor s theorem. Furthermore, this approximation is a polynomial in (x, v) since H is a polynomial in (x, v).

The crucial step involves choosing the functional form of F(z, t), J(z, t), G(z, t) suitably, so that they are polynomials in z with coefﬁcients in terms of sin( t), cos( t). After simpliﬁcation, both expressions can be expressed in terms of polynomials in x, v where coefﬁcients can be expressed in terms of

0 sinp( s) cosq( s) ds, which either integrate to 0 or a constant. Thus, to ensure that the two approximations match, we are left with a problem of making two multivariate polynomials in (x, v) equal.

This ﬁnal step can of course be written as a linear system of equations. We identify a special structure in this system, which helps us show that the system is full-rank, and hence has a solution.

Finally, consider discretizing the newly constructed ODE (12) into small steps of size by a simple Euler schema i.e.,

xn+1 = xn + (vn F(vn, n) xn)

vn+1,j = vn,j ( 2

jxn,j Jj(xn, n) vn,j Gj(xn, n)) (13)

We note that each step above can be written as a composition of two afﬁne coupling blocks given by (xn, vn) 7! (xn, vn+1) 7! (xn+1, vn+1). Namely, the map (xn, vn) 7! (xn, vn+1) can be written as

vn+1 = vn (1 )G(xn, n) ( 2 xn J(xn, n))

This map is an afﬁne coupling block with s(xn) = (1 ) G(xn, n) and t(xn) = ( 2 xn J(xn, n)). The map (xn, vn+1) 7! (xn+1, vn+1) can be written as

vn+1 = vn+1

xn+1 = xn + (vn+1 F(vn+1, n) xn)

which is an afﬁne coupling block with s(vn+1) = 1 F(vn+1, n) and t(vn+1) = vn+1.

The composition of the two maps above yields an afﬁne coupling network (xn, vn) 7! (xn+1, vn+1) precisely as given by Equation (13) with non-linearities s, t in each of the blocks given by polynomials. The following lemma bounds the error resulting from this discretization: Lemma 4 (Euler s discretization method). 2 Let C R2d be a compact set. Consider discretizing the time from 0 to t into t

steps and performing the update given by (13) at each of these steps. Let the map obtained as a result of discretizing thus be denoted by T 0

t and let the original ﬂow map be denoted by Tt. Then Tt and T 0

t are uniformly O( ) close over C in C1 topology, and the constants inside the O( ) depend on C, and bounds on the derivatives of Tt over C.

5.4 Simulating by breaking into -sized intervals

Let Ts,t denote the time-s, t ﬂow-map of (10) from time s to time t. Since the ﬂow maps are invertible, Ts,t and Tt,s are inverses. We are now ready to state the following lemma which says that the underdamped Langevin ﬂow-map Tφ,0 can be written as a composition of afﬁne-couplings maps:

Lemma 5. Let C R2d be a compact set. Suppose that Tφ,0(x, v) is the time-(φ, 0) ﬂow-map of the ODE s dx

dt = @ @v H(x, v, t) dv

@x H(x, v, t) γ @

@v H(x, v, t) (14)

where H is C1. Then for any 1, φ 2 R+, there exists an integer N = N( 1, φ, C) and afﬁnecoupling blocks f1, . . . , f N such that the composition f = f N f1 is 1-close to Tφ,0 in the C1 topology over C.

The proof of Lemma 5 is in Appendix C. We provide a brief sketch here: from Lemma 2, we know that it sufﬁces to show the result for a polynomial H. Thereafter, we break the time for which we want to ﬂow the ODE given by (14) into small chunks of length . Lemmas 3 and 4 then show that the ﬂow map over this chunk can be written as an afﬁne coupling network. Composing the afﬁne coupling networks over all the chunks of time gives us the result.

5.5 Putting components together

The previous sections established that for any t and any compact set C, there is a afﬁne-coupling network f with polynomial non-linearities such that Tt,0 and f are uniformly close over C. We will now pick an appropriate value of t and set C such that W1(f#(p |C), p0) where p = N(0, I2d), which is the required result of Theorem 1. First, using Theorem 2, for

φ > 10 log 1 + log 2 + log L[p0]

we have that KL(T0,φ#(p0), p ) 2

1 2 . We use the following transportion cost inequality to convert this to a Wasserstein bound.

2This result is well known in the C0 topology, we provide an analysis for the C1 bound in Appendix D.1.

Theorem 5 (Talagrand [1996]). The standard Gaussian p on Rd satisﬁes a transportation cost inequality: For every distribution q on Rd with ﬁnite second moment, W1(p, q)2 2KL(qkp).

This gives us that W1(T0,φ #(p0), p ) 1. A simple argument in Lemma 10 (Appendix D.2) then gives

W1(p0, Tφ,0 #(p )) = W1(Tφ,0 #(T0,φ #(p0)), Tφ,0 #(p )) Lip(Tφ,0) 1 (15)

A subsequent argument stated as Lemma 11 in Appendix D.2, shows that if f and Tφ,0 are uniformly 1-close in C0 topology on some C, then their pushforwards through p |C are indeed close, i.e.,

W1(Tφ,0 #(p |C), f#(p |C)) 1. (16)

Next, we establish a bound on the Wasserstein distance between the standard Gaussian and its truncation on a compact set, proved in Appendix D.3. Lemma 6. Let p = N(0, I2d). Then for every δ 2 R+, there exists a compact set C = B(0, R) such that W1(p , p |C) δ, where B(0, R) denotes the ball of radius R centered at the origin.

We now choose a compact set C such that Lemma 6 holds for δ = 1. Then Lemma 10 again implies that

W1(Tφ,0 #(p ), Tφ,0 # (p |C)) Lip(Tφ,0) 1 (17) Equations (15), (16), (17) and the triangle inequality together imply

W1(f#(p |C), p0) (2Lip(Tφ,0) + 1) 1

for small enough 1. We can indeed set 1 small enough so as to satisfy the last inequality above,

because of the global bound Lip(Tφ,0)

established in Lemma 1. This gives

us the statement of Theorem 1. Note that the ﬁnal value of φ depends on , , γ and L[p0].

6 Conclusion

In this paper, we provide the ﬁrst guarantees on universal approximation with well-conditioned afﬁne coupling networks. The conditioning of the network is crucial when the networks are trained using gradient-based optimization of the likelihood. Mathematically, we uncover connections between stochastic differential equations, dynamical systems and afﬁne coupling ﬂows. Our construction uses Gaussian padding, which lends support to the empirical observation that this strategy tends to result in better-conditioned ﬂows [Koehler et al., 2020]. We leave it as an open problem to generalize beyond log-concave distributions.

Ralph Abraham and Jerrold E Marsden. Foundations of mechanics. Number 364. American Mathematical Soc., 2008. 6

Uri M Ascher and Chen Greif. A ﬁrst course on numerical methods. SIAM, 2011. 26

Dominique Bakry and Michel Émery. Diffusions hypercontractives. In Séminaire de Probabilités

XIX 1983/84, pages 177 206. Springer, 1985. 5

Dominique Bakry, Ivan Gentil, and Michel Ledoux. Analysis and geometry of Markov diffusion

operators, volume 348. Springer Science & Business Media, 2013. 5

Jens Behrmann, Will Grathwohl, Ricky T. Q. Chen, David Duvenaud, and Jörn-Henrik Jacobsen.

Invertible residual networks. nov 2018. URL http://arxiv.org/abs/1811.00995v3. 4

Jens Behrmann, Paul Vicol, Kuan-Chieh Wang, Roger Grosse, and Jörn-Henrik Jacobsen. Un-

derstanding and mitigating exploding inverses in invertible neural networks. jun 2020. URL http://arxiv.org/abs/2006.09347v1.

S.G Bobkov and F Götze. Exponential integrability and transportation cost related to logarithmic

sobolev inequalities. Journal of Functional Analysis, 163(1):1 28, 1999. ISSN 0022-1236. doi: https://doi.org/10.1006/jfan.1998.3326. URL https://www.sciencedirect.com/science/ article/pii/S0022123698933262.

Herm Jan Brascamp and Elliott H Lieb. On extensions of the brunn-minkowski and prékopa-leindler

theorems, including inequalities for log concave functions, and with an application to the diffusion equation. In Inequalities, pages 441 464. Springer, 2002. 14

Jianfei Chen, Cheng Lu, Biqi Chenli, Jun Zhu, and Tian Tian. Vﬂow: More expressive generative

ﬂows with variational data augmentation. feb 2020. URL http://arxiv.org/abs/2002. 09741v2.

Ricky T. Q. Chen, Yulia Rubanova, Jesse Bettencourt, and David Duvenaud. Neural ordinary

differential equations. jun 2018. URL http://arxiv.org/abs/1806.07366v5. 4

Carmen Chicone. Ordinary differential equations with applications. Springer, New York Berlin,

2006. ISBN 9780387307695. 29

Rob Cornish, Anthony L. Caterini, George Deligiannidis, and Arnaud Doucet. Relaxing bijectivity

constraints with continuously indexed normalising ﬂows. sep 2019. URL http://arxiv.org/ abs/1909.13833v5.

Laurent Dinh, David Krueger, and Yoshua Bengio. Nice: Non-linear independent components

estimation. ar Xiv preprint ar Xiv:1410.8516, 2014. 1, 2, 4

Laurent Dinh, Jascha Sohl-Dickstein, and Samy Bengio. Density estimation using real nvp. ar Xiv

preprint ar Xiv:1605.08803, 2016. 1, 2, 4

Emilien Dupont, Arnaud Doucet, and Yee Whye Teh. Augmented neural odes. In Advances in Neural

Information Processing Systems, pages 3134 3144, 2019. 4

Will Grathwohl, Ricky TQ Chen, Jesse Bettencourt, Ilya Sutskever, and David Duvenaud. Ffjord:

Free-form continuous dynamics for scalable reversible generative models. ar Xiv preprint ar Xiv:1810.01367, 2018. 4

Gilles Hargé. A convex/log-concave correlation inequality for Gaussian measure and an application

to abstract Wiener spaces. Probability theory and related ﬁelds, 130(3):415 440, 2004.

Michel Hénon. A two-dimensional mapping with a strange attractor. In The Theory of Chaotic

Attractors, pages 94 102. Springer, 1976. 6

Chin-Wei Huang, Laurent Dinh, and Aaron Courville. Augmented normalizing ﬂows: Bridging the

gap between generative ﬂows and latent variable models. feb 2020. URL http://arxiv.org/ abs/2002.07101v1. 1, 2, 4

Priyank Jaini, Kira A. Selby, and Yaoliang Yu. Sum-of-squares polynomial ﬂow, 2019. 2

Durk P Kingma and Prafulla Dhariwal. Glow: Generative ﬂow with invertible 1x1 convolutions. In

Advances in Neural Information Processing Systems, pages 10215 10224, 2018. 2, 4

Frederic Koehler, Viraj Mehta, and Andrej Risteski. Representational aspects of depth and condition-

ing in normalizing ﬂows, 2020. 1, 2, 3, 4, 10

Alexander V Kolesnikov. Mass transportation and contractions. ar Xiv preprint ar Xiv:1103.1479,

Yi-An Ma, Niladri Chatterji, Xiang Cheng, Nicolas Flammarion, Peter Bartlett, and Michael I. Jordan.

Is there an analog of nesterov acceleration for mcmc? feb 2019. URL http://arxiv.org/abs/ 1902.00996v2. 5

Chenlin Meng, Yang Song, Jiaming Song, and Stefano Ermon. Gaussianization ﬂows, 2020. 2

Felix Otto and Cédric Villani. Generalization of an inequality by talagrand and links with the

logarithmic sobolev inequality. Journal of Functional Analysis, 173(2):361 400, 2000.

Grigorios A Pavliotis. Stochastic processes and applications: diffusion processes, the Fokker-Planck

and Langevin equations, volume 60. Springer, 2014.

Matthew M. Peet. Exponentially stable nonlinear systems have polynomial lyapunov functions on

bounded regions, 2007. 8

Leonid Polterovich. The geometry of the group of symplectic diffeomorphism. Birkhäuser, 2012. 6

Danilo Rezende and Shakir Mohamed. Variational inference with normalizing ﬂows. In International

Conference on Machine Learning, pages 1530 1538. PMLR, 2015. 1, 4

Adrien Saumard and Jon A Wellner. Log-concavity and strong log-concavity: a review. Statistics

surveys, 8:45, 2014. 13

Yang Song and Stefano Ermon. Generative modeling by estimating gradients of the data distribution.

ar Xiv preprint ar Xiv:1907.05600, 2019. 4

Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben

Poole. Score-based generative modeling through stochastic differential equations. ar Xiv preprint ar Xiv:2011.13456, 2020. 4, 6

E. G. Tabak and Cristina V. Turner. A family of nonparametric density estimation algorithms. 66(2):

145 164, sep 2012. doi: 10.1002/cpa.21423. URL https://doi.org/10.1002%2Fcpa.21423.

Esteban G. Tabak and Eric Vanden-Eijnden. Density estimation by dual ascent of the log-likelihood.

8(1):217 233, 2010. doi: 10.4310/cms.2010.v8.n1.a11. URL https://doi.org/10.4310% 2Fcms.2010.v8.n1.a11.

Amirhossein Taghvaei and Prashant Mehta. Accelerated ﬂow for probability distributions. In

International Conference on Machine Learning, pages 6076 6085. PMLR, 2019.

Michel Talagrand. Transportation cost for gaussian and other product measures. Geometric &

Functional Analysis GAFA, 6(3):587 600, 1996. 10

Takeshi Teshima, Isao Ishikawa, Koichi Tojo, Kenta Oono, Masahiro Ikeda, and Masashi Sugiyama.

Coupling-based invertible neural networks are universal diffeomorphism approximators. jun 2020. URL http://arxiv.org/abs/2006.11469v2.

Dmitry Turaev. Polynomial approximations of symplectic dynamics and richness of chaos in non-hyperbolic area-preserving maps. Nonlinearity, 16(1):123 135, nov 2002. doi: 10.1088/ 0951-7715/16/1/308. URL https://doi.org/10.1088%2F0951-7715%2F16%2F1%2F308. 6, 7, 8, 17, 20

Roman Vershynin. High-dimensional probability: An introduction with applications in data science,

volume 47. Cambridge university press, 2018.

Han Zhang, Xi Gao, Jacob Unterman, and Tom Arodz. Approximation capabilities of neural odes

and invertible residual networks. 2020. 1, 2, 4