# tails_of_lipschitz_triangular_flows__a15c9fb7.pdf
Tails of Lipschitz Triangular Flows
Priyank Jaini * 1 2 Ivan Kobyzev 3 Yaoliang Yu 1 2 Marcus A. Brubaker 3 4
We investigate the ability of popular flow based methods to capture tail-properties of a target density by studying the increasing triangular maps used in these flow methods acting on a tractable source density. We show that the density quantile functions of the source and target density provide a precise characterization of the slope of transformation required to capture tails in a target density. We further show that any Lipschitz-continuous transport map acting on a source density will result in a density with similar tail properties as the source, highlighting the trade-off between a complex source density and a sufficiently expressive transformation to capture desirable properties of a target density. Subsequently, we illustrate that flow models like Real-NVP, MAF, and Glow as implemented originally lack the ability to capture a distribution with non-Gaussian tails. We circumvent this problem by proposing tail-adaptive flows consisting of a source distribution that can be learned simultaneously with the triangular map to capture tail-properties of a target density. We perform several synthetic and real-world experiments to compliment our theoretical findings.
1. Introduction
Increasing triangular maps are a recent construct in probability theory that can transform any source density to any target density (Bogachev et al., 2005). The Knothe-Rosenblatt transformation (Rosenblatt, 1952; Knothe et al., 1957) gives an explicit version of an increasing triangular map that does the transformation. These triangular maps provide a unified framework (Jaini et al., 2019) to study popular neural density estimation methods like normalizing flows (Tabak & Vanden-Eijnden, 2010; Tabak & Turner, 2013; Rezende &
*Equal contribution 1Univesity of Waterloo, Waterloo, Canada 2Vector Institute, Toronto, Canada 3Borealis AI 4York University, Toronto, Canada. Correspondence to: Priyank Jaini
.
Proceedings of the 37 th International Conference on Machine Learning, Online, PMLR 119, 2020. Copyright 2020 by the author(s).
Mohamed, 2015) and autoregressive models (Papamakarios et al., 2017; Huang et al., 2018; Kingma et al., 2016; Uria et al., 2016; Larochelle & Murray, 2011) which are tractable methods for explicitly modelling densities for highdimensional datasets. Indeed, these methods have been applied successfully in several domains including natural images, videos, speech and audio synthesis, novelty detection, and natural language.
This work studies the tail properties of a target density by characterizing the properties of the corresponding increasing triangular map required to push a tractable source density with known tails to the desired target density. We begin in 3 by showing that, in one dimension, the density quantile functions of the source and target density characterize the slope of a (unique) increasing transformation. Furthermore, the asymptotic properties of the density quantile function allow us to give a granular characterisation of the degree of heaviness of a distribution. We show that the degree of heaviness parameter of the source and target densities characterize the properties of the corresponding triangular map completely. We then give a precise rate at which an increasing transformation must grow in order to capture the tail behaviour of the target density by drawing connections between the degree of heaviness parameter and the existence of higher-order moments of the densities.
We generalize these results for higher dimensions in 4 by showing that a Lipschitz-continuous transport map will always result in a target density with the same tail properties as the source, highlighting the trade-off between choosing an appropriate source density and sufficiently complex transport map to capture tails in a target density. Additionally, when the source and target densities are from the elliptical family, we show that the increasing triangular map from a light-tailed distribution to a heavy-tailed distribution must have all diagonal entries of the Jacobian unbounded.
In 5, we discuss the implications of these results for a class of flow based models that we call affine triangular flows which include NICE (Dinh et al., 2015), Real-NVP (Dinh et al., 2017), MAF (Papamakarios et al., 2017), IAF (Kingma et al., 2016), and Glow (Kingma & Dhariwal, 2018). We show both theoretically and empirically that these models as originally implemented lack the ability to push a fixed source density to a target density with heav-
Tails of Lipschitz Triangular Flows
ier tails. To circumvent these draw-backs of affine flows, we subsequently propose tail-adaptive flows in 6, where the source density, instead of being fixed, is endowed with a learnable parameter that controls its tail behaviour and allows affine flows to capture tail properties of the target density. We illustrate these properties of tail-adaptive flows empirically and demonstrate their performance on benchmark datasets.
Contributions. We summarize our main contributions as follows:
We show that density quantiles precisely capture the properties of a push-forward transformation. We use these to provide asymptotic rates for the slope of maps required to capture heavy-tailed behaviour. We show that Lipschitz push-forward maps cannot change the tails of the source density qualitatively. We thus reveal a trade-off between choosing a complex source density and an expressive transformation for representing heavy-tailed target densities. As a consequence, we show that several popular flow models as originally implemented lack the ability to capture heavier tailed density than the fixed source. We propose tail-adaptive flows that can be deployed easily in any existing flow based and autoregressive model to better capture tail properties of a target density. We also demonstrate the importance of choosing an appropriate source density.
Due to space constraints, proofs are deferred to Appendix A.
2. Preliminaries and Set-Up
In this section we set up our main problem, introduce key definitions and notations, and formulate the framework of characterizing tail properties of a target probability density through the unique triangular push-forward map.
We call a mapping T : Rd Rd triangular if its j-th component Tj only depends on the first j variables z1, . . . , zj. The name triangular comes from the fact that the Jacobian T is a triangular matrix function. Further, we call T increasing if for all j [d], Tj is an increasing function of zj. Triangular transformations are appealing due to the following result by Bogachev et al. (2005):
Theorem 1 (Bogachev et al. 2005). For any two densities p and q over Z = X = Rd, there exists a unique (up to null sets of p) increasing triangular map T : Z X so that if Z p then T(Z) q, i.e. q is the push-forward of p, or in symbols q = T#p.
Let us give an example to help understand Theorem 1.
Example 1 (Increasing Rearrangement). Let p and q be univariate probability densities with distribution functions
F and G, respectively. One can define the increasing map T = G 1 F such that q = T#p, where G 1 : [0, 1] R is the quantile function of q:
G 1(u) := inf{t : G(t) u}. (1)
Indeed, if Z p, one has that F(Z) uniform. Also, if U uniform, then G 1(U) q. Theorem 1 is a rigorous iteration of this univariate argument by repeatedly conditioning (a construction popularly known as the Knothe-Rosenblatt transformation (Rosenblatt, 1952; Knothe et al., 1957)). Specifically, the j-th component Tj of T for the Knothe-Rosenblatt transformation is given by xj = Tj(z1, . . . , zj 1, zj) = F 1 q,j| 0, mp(λ) := E Z p[eλZ] = o .
otherwise, it is light-tailed i.e. p L if all its higherorder moments are finite2. We show that any diffeomorphic transformation T that pushes a source density p L to a target density q H cannot have a bounded slope globally.
Theorem 2. Let p L and q H such that q = T#p, where T is a diffeomorphism. Then, for all M > 0 and all z0 > 0 there is z > z0, such that T (z) > M. Conversely, if T is a Lipschitz-continuous map & p L, then, T#p L.
Theorem 2 is mostly a qualitative result, and it provides little knowledge about the map T required to capture a heavytailed distribution q given a source density p. Moreover, we would ideally like to characterize the properties of T in terms of the degree of heaviness of p and q respectively. We will address this problem by proposing a refined definition of tails of a density function in terms of the asymptotic behaviour of the density quantile function as formulated by Parzen (1979) and Andrews et al. (1973).
For a probability density p over a domain Z R, let Fp : Z [0, 1] denote the cumulative distribution function of p, and Qp : [0, 1] Z be the quantile function given by Qp = F 1 p . Then, f Qp : [0, 1] R+ is called the density quantile function and is given by f Qp = 1/Q p. Parzen (1979) proved that the limiting behaviour of any density quantile function as u 1 is given by:
f Q(u) (1 u)α, α > 0 (2)
where g(u) h(u) implies that limu 1 g(u)/h(u) is a finite constant. We can additionally define the limiting behaviour of the quantile function Q(u) when u 1 as:
Q(u) (1 u) γ, γ = α 1. (3)
The parameter α is called the tail-exponent and defines the tail-area of a distribution and acts as a measure of degree of
2We note that this definition is restricted to only right-tails. For the sake of simplicity we develop our results for right-tails, but they generalise to left-tails naturally.
heaviness. Indeed, for two distributions with tail exponents α1 and α2, if α1 > α2, the former has heavier tails relative to the latter. Thus, the tail exponent α allows us to classify distributions based on their degree of heaviness.
Define Hα := n p : f Qp (1 u)α as u 1 o .
Following Parzen (1979), if 0 < α < 1 the distributions are light-tailed, e.g. the Uniform distribution. Here, we further show that a distribution has support bounded from above if and only if the right density quantile function has tail-exponent 0 < α < 1.
Proposition 1. Let p be a density with f Qp (1 u)α
as u 1 . Then, 0 < α < 1 iff supp(p) = [a, b] where b < i.e. p has a support bounded from above.
H1 corresponds to a family of distributions for which all higher order moments exist. However, these distributions are relatively heavier tailed than short-tailed distributions and were termed as medium tailed distributions by Parzen (1979), e.g. normal and exponential distribution. Additionally, for α = 1, a more refined description of the asymptotic behaviour of the quantile function can be given in terms of the shape parameter β:
f Q(u) (1 u) log 1 1 u
and Q(u) log 1 1 u
β determines the degree of heaviness in medium tailed distributions; the smaller the value of β, the heavier the tails of the distribution e.g. exponential distribution has β = 1, and normal distribution has β = 0.5. We thus define
H1,β = n p : f Qp (1 u) log 1 1 u
1 β , 0 β 1 o
and we have H1 = 0 β 1H1,β. Further, the class of light tailed distributions defined in the beginning of the section is L = 0<α 1Hα. Finally, the class of heavy tailed distributions have α > 1 i.e. H = α>1Hα, e.g. student-t distribution tν with ν degrees of freedom .
We are now in a position to characterize the map T based on the degree of heaviness of the source and target densities. Following Example 1, the slope of T is given by the ratio of the density quantile function of the source and the target distribution respectively, i.e.
T (z) = p(z) q T(z) = p F 1 p Fp(z)
q F 1 q Fp(z)
i.e. T (z) = f Qp(u)
f Qq(u), where u = Fp(z).
Tails of Lipschitz Triangular Flows
Figure 1. Results for Real-NVP illustrating the inability to capture tails. The second and third column show the quantile and log-quantile plots for the source, target, and estimated target density. The quantile function of the source and the estimated target density are identical depictng the inability to capture heavier tails. This is further explained by the estimated tail-coefficients γsource = 0.15, γtarget = 0.81, and γestimated target = 0.15. Best viewed in color. More details in Section 5.
Clearly, the density quantile functions precisely characterizes the slope of an increasing map needed to push a source density p to a target density q.
Proposition 2. Let p and q be two square integrable univariate densities such that q := T#p. If the density quantile f Qp of p shrinks to 0 at a rate slower than the density quantile f Qq of q, then T (z) is asymptotically unbounded.
Example 2 in Appendix B helps to illustrate Proposition 2 and the next corollary provides a precise characterization of asymptotic properties of a diffeomorphic transformation between densities with varying tail behaviour.
Corollary 1. Let p Hαp be a source density, q Hαq be a target density and T be an increasing transformation such that q = T#p. Then, limz T (z) = limu 1 (1 u)αq αp. Further, if αp = αq = 1, then limz T (z) = limu 1 log 1/(1 u) βp βq where u = Fp(z).
Example 3 in Appendix B further underlines the importance of density quantile functions to study tails of increasing transformations. We now connect the tail-exponent parameter α( ) to the existence of higher-order moments of a random variable. Given a random variable X p, the expected value of a function g(x) can be written in terms of
the quantile function as: Ep[g(x)] = R 1 0 g Qp(u) du. This allows us to draw a precise connection between the degree of heaviness of a distribution as given by the density quantile functions (and tail exponent α) and the the existence of the number of its higher-order moments (ω).
Proposition 3. Let p be a distribution with Qp(u) (1 u) γ as u 1 . Then, R z0 zωp(z)dz exists and is finite for some z0 iff ω < 1
Corollary 2. If p is a distribution with Qp(u) (1 u) γ
as u 1 and Qp(u) u γ as u 0+.3 Then, Ep[|z|ω] exists and is finite iff ω < 1
Based on these observations, we can equivalently define heavy-tailed distributions as follows:
Definition 1. A distribution p(z) with compact support i.e. supp(p) = [a, b] where |a| < and |b| < is said to be ω heavy tailed if for all 0 < µ < ω, Ep[|z b| 1/µ] exists and is finite, but for µ ω, Ep[|z b| 1/µ] is infinite or does not exist.
Definition 2. A distribution p(z) with tail exponent α = 1
3This condition takes the left-tail into account as well. Note that it is not necessary for both tails to have the same behaviour and our analysis extends to such cases.
Tails of Lipschitz Triangular Flows
is said to be ω heavy tailed if for all 0 < µ < ω, Ep[e|z|µ] exists and is finite, but for µ ω, Ep[e|z|µ] is infinite or does not exist.
Definition 3 (ω 1 heavy tailed distributions). A distribution p(z) with tail-exponent α > 1 is heavy tailed with degree ω 1 with ω R+ if for all 0 < µ < ω, Ep[|z|µ] exists and is finite, but for all µ ω, Ep[|z|µ] is infinite or does not exist.
These definitions allow us to finally give the rate an increasing transformation must emulate to exactly represent tail-properties of a target density given some source density.
Proposition 4. Let p be a ω 1 p heavy distribution, q be a ω 1 q heavy distribution and T be a diffeomorphism such that q := T#p. Then for small ϵ > 0, T(z) = o(|z| ωp/ωq ϵ).
4. Properties of Multivariate Transformations
We now generalize our results to higher dimensions by first fixing the definition of a heavy-tailed distribution in higher dimensions 4. We say that a random variable X Rd admits a heavy-tailed density function if the univariate random variable X has a heavy tailed density where is some norm function. The granular definitions from Section 3 can be extended to the multivariate case through the density function of X .
Theorem 3. Let Z Rd be a random variable with density function p that is light-tailed and X Rd be a target random variable with density function q that is heavy-tailed. Let T : Z X be such that q = T#p, then T cannot be a Lipschitz function.
Corollary 3. Under the same set-up as in Theorem 3, there exists an index i [d] such that z Ti is unbounded.
Theorem 3 is a general result for any diffeomorphic transformation between two densities and we discuss the implication of this result for flow based models in 5. However, before proceeding further, we also characterize the properties of the triangular map T such that q = T#p by studying the properties of the univariate maps Tj, j [d] obtained by repeated conditioning when the source and target densities are from the class of elliptical distributions.
Definition 4 (Elliptical distribution, (Cambanis et al., 1981)). A random vector X Rd is said to be elliptically distributed denoted by X εd(µ, Σ, FR) with rank(Σ) = r if and only if there exists a µ Rd, a matrix A Rd r with maximal rank r, and a non-negative random variable R, such that X d= µ + RAU(d), where the random r-vector U is independent of R and is uniformly
4Note that due to the lack of total ordering there is no standard definition of multivariate heavy-tailed distributions.
distributed over the unit sphere Bd 1, Σ = AT A and FR is the cumulative distribution function of the variate R.
For ease in developing our results, we consider only full rank elliptical distributions i.e. rank(Σ) = d but the results can be easily extended to the general case. The spherical random vector U (d) produces elliptically contoured density surfaces due to the transformation A. The density function of an elliptical distribution as defined above is given by: f(x) = | det Σ| 1
2 g R (x µ)T Σ 1(x µ) , where the function g R(t) : [0, ) [0, ) is related to f R, the density function of R, by the equation: f R(r) = sdrd 1g R(d2), d 0, here sd = 2πd/2 Γ(d/2) is the area of a unit sphere. Thus, the tail properties of a random variable X with an elliptical distribution εd(µ, Σ, FR) is determined by the generating random variable R. Indeed, X is heavy-tailed in all directions if the univariate generating random variable R is heavy-tailed.
Define mf R(k) = 1 sd R 0 rkf R(r) dr, k R+. Intuitively, mf R(k) is the k-th order moment of f R when k is integer-valued. This allows us to generalize the granular definition of heavy-tailed distributions ( 3, Definition 3) to the multivariate elliptical case: the distribution εd(µ, Σ, FR) is ω 1-heavy iff µk is finite for all k < ω i.e. iff FR is ω 1-heavy. Similarly, from Definition 2 one has that εd(µ, Σ, FR) is ω-heavy iff FR is ω-heavy. Elliptical distributions have certain convenient properties: marginal, conditional and linear transformation of an elliptical distribution are also elliptical (see Appendix B). Furthermore, we derive the degree of heaviness parameter of the conditional distributions of an elliptical distribution.
Proposition 5. Under the same assumptions as in Lemma 2 (App.B), if X εd(0, I, FR) is ω 1-heavy, then the conditional distribution of X2|(X1 = x1) is (ω + d1) 1-heavy where X1 Rd1.
Equipped with all the necessary results, we now show that an increasing triangular map T between a light-tailed and a heavy-tailed elliptical distribution has all diagonal entries of T unbounded.
Proposition 6. Let Z εd(0, I, FS) and X εd(0, I, FR) have densities p and q respectively where FR is heavier tailed than FS. If T : Z X is an increasing triangular map such that q := T#p, then all diagonal entries of T and det| T| are unbounded.
Remark 1. Our analysis naturally extends to the case when the target density is lighter tailed by studying the corresponding inverse transformation T 1. Particularly, such a transformation should have a vanishing asymptotic slope to capture lighter-tailed distributions.
Tails of Lipschitz Triangular Flows
Table 1. Affine triangular flows
Model coefficients Tj zj ; z1, . . . , zj 1
NICE µj(z