# marginal_tailadaptive_normalizing_flows__57ec0ea4.pdf

Marginal Tail-Adaptive Normalizing Flows

Mike Laszkiewicz 1 2 Johannes Lederer 1 Asja Fischer 2

Learning the tail behavior of a distribution is a notoriously difﬁcult problem. By deﬁnition, the number of samples from the tail is small, and deep generative models, such as normalizing ﬂows, tend to concentrate on learning the body of the distribution. In this paper, we focus on improving the ability of normalizing ﬂows to correctly capture the tail behavior and, thus, form more accurate models. We prove that the marginal tailedness of an autoregressive ﬂow can be controlled via the tailedness of the marginals of its base distribution. This theoretical insight leads us to a novel type of ﬂows based on ﬂexible base distributions and data-driven linear layers. An empirical analysis shows that the proposed method improves on the accuracy especially on the tails of the distribution and is able to generate heavytailed data. We demonstrate its application on a weather and climate example, in which capturing the tail behavior is essential.

1. Introduction

Heavy-tailed distributions are known to occur in various applications in biology, ﬁnance, climate, and many other ﬁelds. Quantities with a heavy-tailed distribution are, for example, the length of protein sequences in genomes (Koonin et al., 2006), returns of stocks (Gabaix et al., 2003), or the occurence and impacts of weather and climate events (Katz, 2002). In these applications, heavy-tailed events are often the most substantial samples and hence, ignoring them thinking of underestimating a maximum ﬂood level or the loss of a ﬁnancial crisis would yield to crucial model failures. From a theoretical point of view, heavy-tailed distributions emerge from several circumstances, including the limiting distribution in the generalized central limit theorem,

*Equal contribution 1Faculty of Mathematics, Ruhr University, Bochum, Germany 2Center of Computer Science, Bochum, Germany. Correspondence to: Mike Laszkiewicz <Mike.Laszkiewicz@rub.de>.

Proceedings of the 39 th International Conference on Machine Learning, Baltimore, Maryland, USA, PMLR 162, 2022. Copyright 2022 by the author(s).

of a multiplicative process, or as the limit of an extremal process (Nair et al., 2013). Further, many distributions that arise from a functional relationship, such as the ratio of two standard normal distributed random variables, are heavy-tailed, highlighting their importance for models that incorporate known physical relationships among quantities. Given the frequency of occurrence and their potential impact, developing generative models that allow to learn heavy-tailed distributions are an urgent task to solve. We approach this task by providing important theoretical groundings regarding the expressiveness of Normalizing Flows (NFs) for heavy-tailed data.

Normalizing Flows (Rippel & Adams, 2013; Tabak & Turner, 2013; Dinh et al., 2015; Rezende & Mohamed, 2015) are a popular class of deep generative models. Despite their success in learning tractable distributions where both sampling and density evaluation can be efﬁcient and exact, their ability to model heavy-tailed distributions is known to be limited. Jaini et al. (2020) identiﬁed the problem that autoregressive afﬁne NFs are unable to map a light-tailed distribution to a heavy-tailed distribution. They propose to solve this issue by replacing the Gaussian base distribution by a multivariate t-distribution with one learnable degree of freedom, leading to a model referred to as Tail-Adaptive Flows (TAF).

Contributions In this paper, we extend the work of Jaini et al. (2020) in multiple ways: First, while TAF allows to model distributions with a heavy-tailed Euclidean norm, we show that modeling multivariate distributions, where some of the marginals are heavyand some are light-tailed, still poses a problem. More precisely, we identify the problem that an autoregressive afﬁne NF using a base distribution with solely heavy-tailed marginals (such as TAF) is only able to provide a target distribution with just heavy-tailed marginals as well. Consequently, such a NF is not capable of learning distributions with mixed marginal tail behavior. Second, to solve the problem, we derive a theoretical result that states conditions under which the marginal tailedness of the base distribution is preserved. Third, we turn these theoretical insights into a novel modiﬁcation of autoregressive NFs, which allows to model the marginal tail behavior. Since the proposed model preserves marginal tailedness, we call it marginal tail-adaptive ﬂow (m TAF). The proposed

Marginal Tail-Adaptive Normalizing Flows

10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5

12.5 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5

6 4 2 0 2 4 6 8

20 10 0 10 20 30 40 50 60

Light Tailed

Light Tailed

Target Distribution

Employ Tail Index

Rearranging

Autoregressive Transformation

Tail-dependent

structured Linear Layer

Step 1: Estimate the marginal Tail Indices

Step 3: Autoregressive Layer

with a data-driven Linearity Step 2: Defining the

Base Distribution

Flexible Base

Distribution

Figure 1. An Overview of m TAF. In a ﬁrst step, we apply estimators from extreme value theory to classify the marginals as heavyor light-tailed. This classiﬁcation deﬁnes a ﬂexible base distribution consisting of marginal Gaussians and marginal t-distributions with ﬂexible degree of freedom, as illustrated by Step 2 of this ﬁgure. Further, we rearrange the marginals such that the ﬁrst dl marginals are light-tailed, whereas the remaining marginals are heavy-tailed. m TAF is then constructed using several ﬂow-layers as visualized in Step 3: we employ a triangular mapping, followed by a 2-group permutation scheme, which can be generalized to general 2-group linearities (Section 3.3). At the end, we restore the original ordering using the inverse of the permutation employed in Step 2. Using Theorem 3.4, we prove that m TAFs are marginally tail-adaptive (Corollary 3.5).

method combines estimators from extreme value theory, a ﬂexible base distribution, and a novel data-driven type of linearities as illustrated in Figure 1. Furthermore, we notice that the autoregressive layers in Neural Spline Flows (NSFs (Durkan et al., 2019)), which are a SOTA architecture, are linear in their tails and, therefore, that we can apply all theory derived by Jaini et al. (2020) and presented in this paper for neural spline layers. We introduce a simple modiﬁcation on the LU-layers that ensures that NSF preserve the full marginal tailedness structure of the base distribution. Lastly, we present a generalized and more ﬂexible version of m TAF, which we call generalized Tail-Adaptive Flow (g TAF). Our theory is backed up by an experimental analysis demonstrating the superior performance of the proposed methods in learning the tails, especially when it comes to generating synthetic tail samples. Finally, we apply m TAF and g TAF on a climate example to generate heavy-tailed weather data.

Notational Conventions In the following, we will denote random variables by bold letters, such as x, and its realisations by non-bold letters, x. We use this notation for multivariate and for univariate random variables. Further,

we denote the jth component of x by xj, and x j or x<j are the ﬁrst j or j 1 components of x, respectively. We denote the random variable representing the base distribution by z and the random variable representing the target distribution by x. Further, for notational convenience, we denote the probability density functions (PDFs) of x and z by p and q. Finally, we assume that both random variables x and z have continuous and positive density on RD, i.e p(x), q(z) > 0 for all x, z RD, where D is the dimensionality of x and z.

2. Background

In this section, we give a brief introduction to heavy-tailed distributions and present needed background knowledge about normalizing ﬂows.

2.1. Heavy-tailed Distributions

Heavy-tailed distributions are distributions that have heavier tails (i.e. decay slower) than the exponential distribution. Loosely speaking, slowly decaying tails allow to model dis-

Marginal Tail-Adaptive Normalizing Flows

tributions that generate samples, which differ by a large magnitude from the rest of the samples. For a univariate random variable x we deﬁne heavy-tailedness via its moment-generating function1:

Deﬁnition 2.1 (Heavy-Tailed Random Variables). Consider a random variable x R with PDF p. We say that x is heavy-tailed if and only if

λ > 0 : Ex eλx = .

The function mp(λ) := Ex[exp(λx)] is known as the moment-generating function of x. Random variables that are not heavy-tailed are said to be light-tailed.

Note that this deﬁnition is, strictly speaking, merely a definition for heavy right tails. We say a random variable x R has heavy left tails if x has heavy tails according to Deﬁnition 2.1. For simplicity of derivations and w.l.o.g., we proceed with this deﬁnition but the derived results can analogously be applied to left tails.

We can assess the degree of tailedness of a distribution. While there are many equivalent notions of the so called tail index, the most straight-forward deﬁnition is via the existence of moments:

Deﬁnition 2.2 (Tail Index). A random variable x R with PDF p is said to have tail index2 α if it holds that

( < , if β < α , = , if β > α .

Since the tail index is tightly related to the decay rate of the PDF, it enables us to assess the degree of heavy-tailedness of a random variable. Therefore, estimation of the tail index became an important objective in extreme value theory and statistical risk assessment (see e.g. Embrechts et al., 2013). Since the existence of the moment does not depend on the body of x but only on the tails of x (see Proposition A.1 in Section A.1 in the Appendix), estimating the tail index by ﬁtting a full parametric model to all data e.g. via likelihood maximization leads to a biased estimator. Instead, semiparametric estimators have been developed, which aim to ﬁt a distribution only on the tails. Popular methods for tail estimation include the Hill estimator (Hill, 1975), the moment estimator (Dekkers et al., 1989), and kernel-based estimators (Csorgo et al., 1985). In Section C.1 of the

1One can readily show that this deﬁnition is equivalent to the deﬁnition, which compares the tails of x to the tails of an exponential distribution. See Section 1 in Nair et al. (2013). 2Notice that the notion of a tail index is only valid for regularly varying random variables, which are a subclass of heavy-tailed random variables. For the purpose of this work, it is sufﬁcient to consider regularly varying random variables. More details can be found in Nair et al. (2013).

Appendix, we discuss these tail estimators and review some practical issues with these.

An example of a heavy tailed distribution is the standardized t-distribution, which has parameter ν > 0 referred to as the degree of freedom and a density function given by

p(x) := Γ ν+1

where Γ is the Gamma function. It is known that the tdistribution has tail index ν (see e.g. Kirkby et al. (2019) for a detailed reference).

In the multivariate setting, there exist various deﬁnitions of heavy-tailedness. For instance Resnick (2004) make use of a deﬁnition based on multivariate regular variation. Jaini et al. (2020) deﬁne a multivariate random variable x to be heavy-tailed if the ℓ2-norm is heavy-tailed, a property which we refer to as ℓ2-heavy-tailed, and which is formally deﬁned as follows: Deﬁnition 2.3 (ℓ2-Heavy-Tailed). Let x RD be a multivariate random variable. Then, we call x ℓ2-heavy-tailed if x is univariately heavy-tailed according to Deﬁnition 2.1, where denotes the ℓ2-norm. Otherwise, we call x ℓ2-light-tailed.

2.2. Normalizing Flows

The fundamental idea behind NFs is based on the change-ofvariables formula for probability density functions (PDFs) given in the following theorem. Theorem 2.4 (Change-of-Variables). Consider random variables x, z RD and a diffeomorphic map T : RD RD such that x = T(z). Then, it holds that the PDF of x satisﬁes

p(x) = q T 1(x) det JT 1(x) x RD , (1)

where JT 1(x) is the Jacobian of T 1 evaluated at x RD.

This formula allows us to evaluate the possibly intractable PDF of x if we can evaluate both, the PDF of z and T 1(x), and efﬁciently calculate the Jacobian-determinant det JT 1(x). As T maps z to x, we denote the distribution of z and x as the base and the target distribution, respectively.

To model the PDF of x using NFs, it is common to set the base distribution to a standard normal distribution (i.e., z N(0, I)) and to employ likelihood maximization to learn a parameterized transformation

Tθ := T (L) θ T (1) θ ,

which, yet, remains tractable and diffeomorphic. Masked autoregressive ﬂows (MAFs (Papamakarios et al., 2017)) are

Marginal Tail-Adaptive Normalizing Flows

one popular architecture3, which employ transformations T = (T1, . . . , TD) of the form

Tj(z) := µj(z<j)+exp(σj(z<j))zj for j {1, . . . , D} , (2) where µj and σj are neural networks, which obtain the ﬁrst j 1 components of z as input and output a scalar. Composing several transformations of the form (2), we obtain the MAF. The autoregressive form in (2) allows us to efﬁciently evaluate the Jacobian-Determinant due to the triangular form of JT (x), which is why autoregressive NFs are also referred to as triangular ﬂows. By shufﬂing the ordering of the components, i.e. applying a permutation after each autoregressive transformation, we are able to form more diverse causal dependencies, leading to more expressive models. It has been shown that replacing the permutations by more general invertible layers can further improve the estimation performance (Oliva et al., 2018). In summary MAFs consist of multiple consecutive layers T (l) θ P (l), where P (l) RD D is some linear layer. Other examples for triangular ﬂows include NAF (Huang et al., 2018), and NSF (Durkan et al., 2019), where the latter substitutes the afﬁne transformation (2) by a rational of two splines. Further types of NFs include invertible Res Nets (Jacobsen et al., 2018; Behrmann et al., 2019; Chen et al., 2019), continuous ﬂows (Chen et al., 2018; Grathwohl et al., 2019), and many more (Kobyzev et al., 2020).

Tail-Adaptive Flows. Jaini et al. (2020) investigated the ability of triangular ﬂows to learn heavy-tailed distributions. The authors have shown that if a triangular afﬁne ﬂow transforms a ℓ2-light-tailed distribution, such as the multivariate Gaussian distribution, to a ℓ2-heavy-tailed target distribution, then Tθ cannot be Lipschitz continuous. And more explicitly, it holds the following.

Theorem 2.5. (Jaini et al., 2020) Let z be a ℓ2-light-tailed random variable and T be an afﬁne triangular ﬂow such that Tj(z j) = µj(z<j) + σj(z<j)zj for all j. If σj is bounded above and µj is Lipschitz for all j, then the transformed variable x is also ℓ2-light-tailed.

Furthermore, the authors prove that any triangular mapping from an elliptical distribution to a heavier-tailed elliptical distribution must have an unbounded Jacobian-determinant. These results illuminate that learning a heavy-tailed distribution using NFs leads to non-Lipschitz transformations and unbounded Jacobians, which inevitably affects training robustness (Behrmann et al., 2021). Motivated by these result, Jaini et al. (2020) propose Tail-Adaptive Flows (TAF), which replace the Gaussian base distribution by a multivariate t-distribution with one learnable degree of freedom.

3Stricly speaking, (2) shows the transformations of an IAF (Kingma et al., 2016). However, MAF and IAF are theoretically equivalent, the differences lie only in their architectures.

3. Learning the correct marginal Tail Behavior with m TAF

In this section, we present a simple extension to triangular afﬁne ﬂows that allows to model distributions with a ﬂexible tail behavior. We start by presenting our theoretical results in Section 3.1. Motivated by these results, we propose marginally Tail-Adaptive Flow (m TAF) in Section 3.2, which we apply to NSFs in Section 3.3. Lastly, we present a more ﬂexible relaxation of m TAF, which we call generalized Tail-Adaptive Flow (g TAF) in Section 3.4.

3.1. The Necessity of a ﬂexible Base Distribution

In this work, we investigate the tailedness of NFs more thoroughly through the lense of marginal tailedness, i.e. we consider the univariate tailedness of the marginal distributions of xj.Therefore, we introduce the following deﬁnitions: Deﬁnition 3.1 (j-heavy-tailed, mixed-tailed, fully heavy tailed, equal Tail Behavior). We call a random variable x RD j-heavy-tailed if its jth marginal xj is heavy-tailed according to Deﬁnition 2.1. Otherwise, we call x j-lighttailed. x is said to be mixed-tailed if there exists j1, j2 such that x is j1-heavy-tailed and j2-light-tailed. Further, we say that x is fully heavy-tailed if x is j-heavy-tailed for all j {1, . . . , D}. We deﬁne two random variables x and z to have equal tail behavior if it holds for all j that

x is j heavy tailed z is j heavy tailed .

We found the following relation to Deﬁnition 2.3. Proposition 3.2 (j-heavy-tailedness induces ℓ2-heavy-tailedness). Assume that x is j-heavy-tailed for any j. Then, x is also ℓ2-heavy-tailed.

The proof can be found in Section A.1 in the Appendix. The proposition shows that j-heavy-tailedness is a more speciﬁc notion of multivariate heavy-tailedness than ℓ2-heavytailedness, which allows a narrow inspection of the tail behavior. More precisely, the new notion allows us to differentiate between fully heavy-tailed random variables and mixed-tailed random variables, which are both ℓ2-heavytailed. The ﬁrst result states that, under mild technical conditions, fully heavy-tailedness of the base distribution is preserved by triangular afﬁne maps. Proposition 3.3 (Triangular afﬁne Maps preserve fully heavy-tailedness). Let z be a fully heavy-tailed random variable that satisﬁes Assumption A.64 and let T be a a triangular afﬁne map, that is, Tj(zj, z<j) = µj(z<j) + σj(z<j)zj with σj > 0. Then, it holds that T(z) is also fully heavytailed.

We provide a formal proof in Section A.2 of the Appendix. Assumption A.6 is a mild condition on the decay rate of

4This Assumption can be found in Section A.2 in the Appendix.

Marginal Tail-Adaptive Normalizing Flows

the copula density of z. In Section A.3 of the Appendix, we explain this condition in more detail and give various examples.

It is clear that permuting the marginals does not change the heavy-tailedness. Hence, by iterative application of Proposition 3.3, we deduce that afﬁne triangular ﬂows that employ permutation layers and a fully heavy-tailed base distribution are unable to model mixed tailed distributions. Implicitly, Proposition 3.3 states that a Lipschitz normalizing ﬂow as proposed by Jaini et al. (2020) is not able to model mixed-tailed distributions. The following theorem provides sufﬁcient conditions under which a ﬂow is able to model mixed-tailed distributions, which guides us towards a marginally tail-adaptive ﬂow architecture.

Theorem 3.4 (Learning the correct Tail Behavior). Consider a random-variable z that is j-light-tailed for j {1, . . . , dl} for some dl < D and j-heavy-tailed for j {dl + 1, . . . , D}. Then, under the same conditions as in Theorem 2.5 and Proposition 3.3, it holds that z and T(z) have the same tail behavior.

Proof. Since the result combines Theorem 2.5 and Proposition 3.3 in an evident fashion, we just quickly present a sketch of the proof. First, let us consider j dl. Then it holds for the moment-generating function of xj that

RD eλTj(z j)q(z)dz

Rj eλTj(z j)p j(z j)dz j ,

which has been shown to be bounded for some λ > 0 (see the proof of Theorem 2.5 in Jaini et al. (2020)). Therefore, x is j-light-tailed for all j dl. In the case j > dl, we notice5

that the proof for heavy-tailedness of Tj(z j) involves just the heavy-tailedness of zj and not of any other component of z<j. Hence, if zj is heavy-tailed, then xj = Tj(z j) is also heavy-tailed, regardless of z<j. Therefore, x is jheavy-tailed for all j > dl, which completes the proof. Note that in general we cannot deduce the latter conclusion for light-tailed marginals, i.e. if zj is light-tailed, this does not mean that xj is also light-tailed. This is only the case, if all z<j are light-tailed as well.

3.2. Marginally Tail-Adaptive Flow (m TAF)

Our main result, Theorem 3.4, prompts that if we maintain an ordering of the marginals such that the ﬁrst marginals are light-tailed and the following are heavy-tailed in each ﬂow step, we retain the marginal tail behavior of the base distribution in the estimated target distribution. This ﬁnding motivates the novel NF proposed in this paper. The proposed

5For details, we refer to the proof of Proposition 3.3 in the Appendix.

approach combines research ﬁndings from extreme value theory (Embrechts et al., 2013; Nair et al., 2013), recent ﬁndings about normalizing ﬂows (Jaini et al., 2020; Alexanderson & Henter, 2020; Laszkiewicz et al., 2021), and the results presented herein. The proposed m TAFs consists of three steps that are depicted in Figure 1 and described in the following:

Step 1: Estimating the marginal tail indices and deﬁning the marginal distributions. For each marginal, i.e. for the marginal distribution qj of each xj, j = 1, . . . , D, we assess heavyor light-tailedness using tail estimators. If the marginal is predicted to be light-tailed, we set the corresponding marginal base distribution to be standard normal distributed zj N(0, 1). Otherwise we set the marginal to the standardized t-distribution with the estimated degree of freedom, i.e. zj tˆνj, where ˆνj is the Hill double-bootstrap estimator (Danielsson et al., 2001; Qi, 2008). We present all the details about the tail-assessment scheme in Section C.1.

Step 2: Deﬁning the base distribution. We construct the base distribution as the mean-ﬁeld approximation of the marginals, i.e. z has the density q(z) := QD j=1 qj(zj) with marginal densities qj deﬁned in step 1. Further, to satisfy the assumptions of Theorem 3.4, we need to permute the marginals such that it holds zj N(0, 1) for j dl and zj tˆνj for j > dl. We apply the same permutation to restructure our data according to the base components. To account for tail index estimation errors and for more ﬂexible learning, one can make the tail indices (i.e. the degrees of freedom of each t-distribution) learnable. That is, we initialize the degree of freedom of the jth marginal with ˆνj but adapt the parameter together with the network parameters throughout training.

Step 3: A data-driven permutation scheme. Recall, that vanilla autoregressive ﬂows employ a permutation step after each transformation to enhance the mixing of variables. However, purely random permutations might lead to a violation on the ordering of marginals, which is necessary to ensure Theorem 3.4. Therefore, we permute only within the set of heavy-tailed marginals and within the set of light-tailed marginals, to ensure the validity of Theorem 3.4. Within these groups one can choose any permutation scheme. We generalize this result for LU-layers in Section 3.3.

Without loss of generality, we assume that the ﬁrst dl components of z are light-tailed and the remaining D dl components are heavy-tailed6. Then, the training objective is to optimize for ﬂow parameters ˆθ and degrees of freedom

6Otherwise we permute the marginals as described in Step 2.

Marginal Tail-Adaptive Normalizing Flows

ˆν = [ˆνdl+1, . . . , ˆνD] to maximize the log-likelihood

L(ˆθ, ˆν; X) =

i=1 log π T 1 ˆθ x(j)

i=dl+1 log tˆνi T 1 ˆθ x(j)

log det JTˆ θ x(j) )

where X := (x(1), . . . x(N)) is the data, and π and tˆν are the PDF of the standard normal distribution and the standard t-distribution with ˆν degrees of freedom, respectively.

When applying our theoretical results presented in the previous section to the proposed m TAF, we can show that it fulﬁlls the desired tail-preserving property, as formalized by the following corollary:

Corollary 3.5 (Marginally tail-adaptive). Under the same assumptions as in Theorem 2.5 and in Proposition 3.3, m TAFs are marginally tail-adaptive, that is, z and x = T(z) have the same tail behavior.

3.3. Marginal Tail-Adaptive Neural Spline Flows

Recent ﬁndings on NFs lead to signiﬁcant improvements of their performance, such as employing LU-Layers instead of permutations (Kingma & Dhariwal, 2018; Oliva et al., 2018) and more expressive autoregressive layers. One of the current SOTA NFs, which combine both improvements, are NSFs (Durkan et al., 2019). In this section we apply the theory from the previous sections to NSFs with a modiﬁed version of the LU-Layers, while retaining their computational beneﬁts.

First, let us provide sufﬁcient conditions under which linear layers preserve the marginal tail behavior:7

Theorem 3.6. Let z be a random variable that is j-lighttailed for j {1, . . . , dl} and j-heavy-tailed for j {dl + 1, . . . , D}. Further, consider a block-diagonal invertible matrix

W = A 0 B C

with A Rdl dl, B R(D dl) dl, Cin R(D dl) (D dl)

and 0 is a zero matrix of size d (D dl). Then, it follows that Wz and z have equal tail behavior.

As a special case of Lemma B.3, we can invert a blockdiagonal matrix A 0 B C

1 = A 1 0 C 1BA 1 C 1

Further, the log-determinant of W is given by

log det(W) = log det(A) + log det(C) . (5)

7The proof is presented in Section B.1.

Therefore, as long as log-determinant computation and inversion of A and C are efﬁcient, we can efﬁciently employ a block-diagional matrix as a ﬂow layer. Luckily, both is given if we parameterize A and C using the LU-decomposition, whereas B can be an arbitrary matrix.

Moreover, NSFs make use of monotonic rational-quadratic splines to deﬁne the autoregressive layers. These splines are deﬁned within some interval [ b, b] and are linearily extended outside this interval. Hence, the autoregressive NSF layers are in fact afﬁne linear in their tails, which in turn means that we can apply all the theory from the previous sections on NSFs as well (compare with Lemma A.2).

In summary, NSF with linear layers according to the blockform (3) preserve the marginal tailedness of the base distribution. We want to highlight that even though each autoregressive NSF layer is linear outside [ b, b], this does not mean that the whole ﬂow is linear outside [ b, b]. This is because the (modiﬁed) LU-layers in between can map a component in and out of [ b, b], leading to a non-trivial mapping outside that interval.

3.4. Generalized Tail-Adaptive Flows

Even though being theoretically founded, we introduce a more ﬂexible relaxation of m TAF, which we call generalized TAF (g TAF): We drop the structural restrictions on the linearities and set the base distribution to a mean-ﬁeld approximation of t-distributions with different trainable degrees of freedom. Therefore, g TAF is a compromise between the theoretically stronger m TAF and TAF. Note that since tν ν N(0, 1) we are able to approximately model heavyas well as light-tailed marginals. Further, since LUlayers are trainable as well, g TAF is able to approximate m TAF arbitrary well by learning a structure as in (3).

Theorem 3.7. Let T : RD RD be almost surely continuous. Further, let z = (z1, . . . , z D) be the m TAF base distribution with dl Gaussian marginals and let zν be the g TAF base distribution with marginals

( tν(0, 1) for j dl zj for j > dl .

Then, it holds that T(zν) ν D T(z), where D denotes convergence in distribution.

Hence, when ﬁxing the ﬂow T, g TAF converges to the m TAF solution as the light-tailed degrees of freedom tend to . We provide a proof in Section A.4.

4. Experimental Analysis

To investigate the beneﬁts of the derived methods, we perform an empirical analysis on synthetic data (Section 4.1)

Marginal Tail-Adaptive Normalizing Flows

Table 1. Average test loss, Area under log-log plot, and t Va R (lower is better for each metric) in the setting ν = 2 and dh {1, 4}. The copula model serves as an oracle baseline.

L Areal Areah t Va Rl t Va Rh L Areal Areah t Va Rl t Va Rh

vanilla 10.25 0.25 4.19 0.60 29.56 8.98 0.25 4.30 0.54 28.55 TAF 10.15 0.37 3.55 0.78 4.21 8.69 0.42 4.05 0.89 4.36 g TAF 10.12 0.55 3.24 1.16 2.67 8.57 0.50 3.38 0.98 5.55 m TAF 10.11 0.26 2.22 0.59 2.94 8.55 0.25 2.60 0.57 6.74

copula 9.75 0.20 1.23 0.45 2.22 9.75 0.19 1.43 0.46 3.49

to understand the behavior and beneﬁts of m TAF and g TAF in comparison to other ﬂows models. In Section 4.2, we demonstrate how we can exhibit the heavy-tailed behavior of the model to sample new extremes in weather and climate example. We provide a Py Torch implementation and the code for all experiments, which can be accessed through our public git repository8.

4.1. Synthetic Experiments

In this series of experiments, we compare the performance of 4 different NSFs: vanilla ﬂow (i.e. z N(0, I)), TAF (i.e. z tˆν(0, I)), g TAF, and m TAF with ﬁxed degrees of freedom. The data is generated by a 8-dimensional Gaussian copula with dh {1, 4} heavy-tailed marginals with tail index ν = 2. Details about the data generation can be found in Section C.2. As an oracle baseline, we ﬁt a Gaussian copula to the data. To measure the overall ﬁt of the model, we track the negative log-likelihood loss L. Since it is wellknown that a good likelihood ﬁt is not equivalent to high sampling quality (Theis et al., 2015), we take a closer on the samples in the tails of the distribution by considering the following three metrics.9

1. Tail Value at Risk (t Va Rα), also known as expected shortfall, which is the expectation of the quantile function given that we consider a quantile level larger than α. By calculating the absolute difference between t Va Rα based on the data distribution and based on synthetic samples generated by the ﬂow, we obtain the t Va R-difference.

2. Area under log-log plot Area, which can be interpreted as a reweighted version of t Va R that puts more weight on the extremes.

3. Synthetic Tail Estimates are the marginal tail estimators based on synthetic data generated by the ﬂow. Ideally, we expect the NF to generate samples according

8https://github.com/Mike Lasz/marginal Tail Adaptive Flow 9All metrics are formally deﬁned in Section C.3.

to the true tail-index.

We measure t Va R-differences componentwise and average over all heavy-tailed and light-tailed components to obtain t Va Rh and t Va Rl, respectively. The same applies for Area.

We ﬁt each model 25 times to 3 different synthetic distributions and summarize the numeric results in Table 110. It is no surprise that vanilla performs well for light-tailed components but suffers on capturing the tails of the heavy-tailed distributions. In the setting dh = 1, we observe the same behavior for TAF, which could be attributed to having only one joint tail parameter ν to model the seven light-tailed and one heavy-tailed marginal. m TAF does not always perform best but it manages to ﬁnd a good balance between the ﬁt on the light-tailed, as well as heavy-tailed components. This is demonstrated more clearly when considering the tail estimation indices of the marginals (Figure 2). In this ﬁgure, we summarize the estimated marginal tail behavior of the learned model in a confusion matrix. While most of the generated marginals are classiﬁed as light-tailed for vanilla and TAF, m TAF is able to recover the marginal tail behavior almost perfectly. We make similar observations in the other settings and when using MAFs instead of NSFs (Section C.3). We extend our analysis by providing a 50-dimensional example in Section C.3.

L H Generated

97.7% 96.3%

L H Generated

86.7% 79.3%

13.3% 20.7%

L H Generated

23.7% 92.0%

L H Generated

2.7% 100.0%

Figure 2. Marginal tail estimation based on synthetic ﬂow samples for dh = 4, ν = 2 of vanilla, TAF, g TAF, and m TAF (from left to right). We classify marginals whose tail estimator is less than 10 as heavy-tailed, otherwise it is classiﬁed as light-tailed.

10Standard deviations for one synthetic distribution are shown in Table 3.

Marginal Tail-Adaptive Normalizing Flows

4.2. Modeling Climate Data

Dry-bulb air temperature in K

Atmospheric pressure in Pa

0 20 40 60 80 100 120 140 Atmospheric level

Cloud optical depth

0 - 25 % 25 - 50 % 50 - 100 %

Figure 3. Synthetic ﬂow samples using m TAF, where we clipped the lower-values of the cloud-optical depth at 0. The corresponding negative log-likelihood is 2121.46. The proﬁles are ordered using band depth statistics Pintado & Romo (2009) and the shaded areas represent standard deviations.

We demonstrate the beneﬁts of the proposed methods on an example, in which tail modeling is crucial: we apply m TAF to generate new data following the distribution of the EUMETSAT Numerical Weather Prediction Satellite Application Facility (NWP-SAF) dataset (Eresmaa & Mc-

Nally, 2014). The data set consists of 25 000 measurements of 3 meteorological quantities, measured on different atmospheric levels. Considering each measurement in each atmospheric level, we obtain a 412-dimensional dataset, which we visualize in Figure 9 in the Appendix. We train a vanilla, TAF, g TAF, and m TAF model using NSFs to ﬁt the weather data. All architectural details are listed in Section C.4.

Qualitatively, all generated proﬁle lines appear reasonable, see Figure 3 and Section C.4. From a quantitative perspective, we observe that m TAF achieves the smallest negative log-likelihood loss, as displayed in Table 2. We extend the quantitative analysis by investigating 1-dimensional random projections of the data and ﬂow samples11. To do so, we follow the same procedure as Meyer et al. (2021), i.e. we generate random weights w1, . . . , w100 U([0, 1]412) and generate 100 random 1-dimensional data sets X(j) ﬂow := { wj, T(zi) : zi z} and X(j) data := { wj, xi : xi X}. For each j, we can compare statistics such as the means, standard deviations, and quantiles of X(j) ﬂow and X(j) data, which we visualize in Figure 4. While all methods are able to ﬁt the mean very well, only m TAF generates data that obeys the same standard deviation and extreme quantiles.

More details and further comparisons are provided in Section C.4.

5. Limitations and Extensions

Our theoretical ﬁndings from Section 3 provide a solid foundation on learning heavy-tailed generative models. Nonetheless, there exist limitations that are worth being addressed in future works.

Asymmetric Tail Behavior is a widespread property of real distributions in which only one of the tails (lower or upper tail) is heavy-tailed whereas the other is light-tailed. For instance the cloud optical depth in the climate example (Section 4.2) has just heavy upper tails since cloud-optical depth cannot drop below 0. One way to potentially solve this issue would be to allow for even more ﬂexible base distributions using composite models (see for instance Abu Bakar et al. (2015) and the references therein). Recently, COMET Flows (Mc Donald et al., 2022) employed those composite models to learn the lower tail, the upper tail, and the body of each marginal separately to improve on the performance of a variant of copula ﬂows (Wiese et al.,

11Note that in contrast to the synthetic experiments, this data has much more dimensions. Quantities such as the Area under log-log plot depend on the rare tail-samples, and hence, its estimation requires many samples. This requirement is further emphasized in the high-dimensional setting, which is why we resort to these 1-dimensional projections.

Marginal Tail-Adaptive Normalizing Flows

Table 2. Average test loss on the NWP-SAF dataset. We average over 25 trials per model and show the standard deviations in brackets.

vanilla TAF g TAF m TAF

L 2101.91( 9.44) 2110.56( 7.87) 2113.48( 7.93) 2121.38( 10.91)

3.4 3.6 3.8 4.0 4.2 4.4

vanilla TAF g TAF m TAF

1.2 1.4 1.6 1.8

vanilla TAF g TAF m TAF

(b) Standard Deviation

2.8 3.0 3.2 3.4 3.6 3.8

vanilla TAF g TAF m TAF

(c) 1%-quantile

Figure 4. Statistics from 1-dimensional random projections of the data and ﬂow samples. Each point represents the statistic (S(Xdata(j)), S(Xﬂow(j))), which are calculated using the random weight wj U([0, 1]412). In the ideal case, all points lie on the dotted diagional line.

Tail Dependencies are a popular property investigated in areas such as ﬁnancial risk analysis, and which quantify the dependency of two marginals given that a tail event occurred. More precisely, we deﬁne the tail dependency between the marginals xi, xj as

λi,j = lim u 1 P xj > F 1 j (u)|xi > F 1 i (u) ,

where F 1 j and F 1 i are the quantile functions of xj and xi, respectively. There is a large body of theoretical works revolving around copulae to model tail dependencies (Joe, 2014). Hence, a possible extension might be to replace the mean ﬁeld assumption in the base distribution by a copula distribution, i.e. by dependent marginals with PDF

q(z) = c F1(z1), . . . , FD(z D) D Y

j=1 qj(zj) ,

where Fj are the CDFs and qj are the marginal PDFs of xj for each j.

6. Conclusion

In this work, we deepen the mathematical understanding of the tail behavior of autoregressive normalizing ﬂows. We note that the distribution we want to model may have heavyas well as light-tailed marginals, and we prove that standard normalizing ﬂows are not able to learn such distributions. Our developed theory shows how the marginal tail behavior of the target distribution of the ﬂow relates to the tail behavior of its base distribution. Based on these theoretical

ﬁndings we propose a new algorithm, which we refer to as m TAF. In particular, we initialize a base distribution based on statistical tail estimates of the target, and employ structured linearities that guarantee the correct tail behavior of the target distribution. Importantly, we extend our theory to Neural Spline Flows and provide a modiﬁcation that casts them marginally tail-adaptive. Lastly, as a trade-off between the theoretically founded m TAF and the less restricted linearities in standard normalizing ﬂows, we present a more ﬂexible relaxation of m TAF called g TAF, which is able to converge to m TAF as a special case. An in-depth empirical analysis with heavyand light-tailed marginals shows that m TAF and g TAF are superior in terms of estimation performance, especially when it comes to learning a heavy tail of a distribution. In contrast to standard normalizing ﬂows, only g TAF and m TAF are able to reliably generate samples that follow the desired tail behavior.

In summary, we believe that the theoretical soundness and the ability to faithfully generate extreme samples is a major strength of g TAF and m TAF, which are both crucial properties in various applications such as ﬁnance, climate and related areas.

Acknowledgements

This work was supported by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) under Germany s Excellence Strategy EXC2092 CASA 390781972. We also thank the anonymous reviewers for their careful reading and their useful suggestions.

Marginal Tail-Adaptive Normalizing Flows

Abu Bakar, S., Hamzah, N., Maghsoudi, M., and Nadarajah, S. Modeling loss data using composite models. Insurance: Mathematics and Economics, 61:146 154, 2015.

Alexanderson, S. and Henter, G. E. Robust model training and generalisation with studentising ﬂows. In ICML Workshop on Invertible Neural Networks, Normalizing Flows, and Explicit Likelihood Models, 2020.

Behrmann, J., Grathwohl, W., Chen, R. T., Duvenaud, D., and Jacobsen, J.-H. Invertible residual networks. In International Conference on Machine Learning, pp. 573 582. PMLR, 2019.

Behrmann, J., Vicol, P., Wang, K.-C., Grosse, R., and Jacobsen, J.-H. Understanding and mitigating exploding inverses in invertible neural networks. In Proceedings of The 24th International Conference on Artiﬁcial Intelligence and Statistics, volume 130, pp. 1792 1800, Virtual Event, 2021.

Chen, R. T., Rubanova, Y., Bettencourt, J., and Duvenaud, D. Neural ordinary differential equations. In Advances in Neural Information Processing Systems, volume 31, Montreal, Canada, 2018.

Chen, R. T. Q., Behrmann, J., Duvenaud, D. K., and Jacobsen, J.-H. Residual ﬂows for invertible generative modeling. In Advances in Neural Information Processing Systems, volume 32, pp. 9913 9923, Vancouver, BC, Canada, 2019.

Csorgo, S., Deheuvels, P., and Mason, D. Kernel Estimates of the Tail Index of a Distribution. The Annals of Statistics, 13(3):1050 1077, 1985.

Danielsson, J., de Haan, L., Peng, L., and de Vries, C. G. Using a bootstrap method to choose the sample fraction in tail index estimation. Journal of Multivariate Analysis, 76(2):226 248, 2001.

Dekkers, A. L. M., Einmahl, J. H. J., and Haan, L. D. A moment estimator for the index of an extreme-value distribution. The Annals of Statistics, 17(4):1833 1855, 1989.

Dinh, L., Krueger, D., and Bengio, Y. NICE: non-linear independent components estimation. In 3rd International Conference on Learning Representations, ICLR Workshop Track Proceedings, San Diego, CA, USA, 2015.

Draisma, G., de Haan, L., Peng, L., and Pereira, T. T. A bootstrap-based method to achieve optimality in estimating the extreme-value index. Extremes, 2(4):367 404, 1999.

Durkan, C., Bekasov, A., Murray, I., and Papamakarios, G. Neural spline ﬂows. Advances in Neural Information Processing Systems, 32:7511 7522, 2019.

Embrechts, P., Kl uppelberg, C., and Mikosch, T. Modelling extremal events: for insurance and ﬁnance, volume 33. Springer Science & Business Media, 2013.

Eresmaa, R. and Mc Nally, A. Diverse proﬁle datasets from the ecmwf 137-level short-range forecasts. 10 2014. doi: 10.13140/2.1.4476.8963.

Gabaix, X., Gopikrishnan, P., Plerou, V., and Stanley, H. E. A theory of power-law distributions in ﬁnancial market ﬂuctuations. Nature, 423(6937):267 270, 2003.

Gallier, J. Schur complements and applications. 05 2011. doi: 10.1007/978-1-4419-9961-0 16.

Grathwohl, W., Chen, R. T. Q., Bettencourt, J., Sutskever, I., and Duvenaud, D. Ffjord: Free-form continuous dynamics for scalable reversible generative models. In International Conference on Learning Representations, New Orleans, LA, USA, 2019.

Groeneboom, P., Lopuha a, H. P., and de Wolf, P.-P. Kerneltype estimators for the extreme value index. The Annals of Statistics, 31(6):1956 1995, 2003.

Hill, B. M. A simple general approach to inference about the tail of a distribution. The annals of statistics, pp. 1163 1174, 1975.

Hofert, M., Kojadinovic, I., Maechler, M., and Yan, J. Elements of Copula Modeling with R. Springer Use R! Series, 2018.

(https://math.stackexchange.com/users/491644/maxim), M. Asymptotics of inverse of normal cdf. Mathematics Stack Exchange. URL:https://math.stackexchange.com/q/2966269 (version: 2018-10-22).

Huang, C.-W., Krueger, D., Lacoste, A., and Courville, A. Neural autoregressive ﬂows. In International Conference on Machine Learning, pp. 2078 2087. PMLR, 2018.

Jacobsen, J.-H., Smeulders, A., and Oyallon, E. i-revnet: Deep invertible networks. In International Conference on Learning Representations, Vancouver, Canada, 2018.

Jaini, P., Kobyzev, I., Yu, Y., and Brubaker, M. Tails of Lipschitz triangular ﬂows. In Proceedings of the 37th International Conference on Machine Learning, volume 119, pp. 4673 4681, Virtual Event, 2020.

Joe, H. Dependence modeling with copulas. CRC press, 2014.

Marginal Tail-Adaptive Normalizing Flows

Katz, R. W. Do weather or climate variables and their impacts have heavy-tailed distributions? In Climate Variations and Forecasting (Joint with the 16th Conference Probability and Statistics and the 13th Symposium on Global Change and Climate Variations), 2002.

Kingma, D. P. and Dhariwal, P. Glow: Generative ﬂow with invertible 1x1 convolutions. In Advances in Neural Information Processing Systems, volume 31, 2018.

Kingma, D. P., Salimans, T., Jozefowicz, R., Chen, X., Sutskever, I., and Welling, M. Improved variational inference with inverse autoregressive ﬂow. Advances in neural information processing systems, 29:4743 4751, 2016.

Kirkby, J. L., Nguyen, D. H., and Nguyen, D. Moments of student s t-distribution: A uniﬁed approach. Computation Theory e Journal, 2019.

Kobyzev, I., Prince, S., and Brubaker, M. Normalizing ﬂows: An introduction and review of current methods. In IEEE Transactions on Pattern Analysis and Machine Intelligence. IEEE, 2020.

Koonin, E., Wolf, Y., and Karev, G. Power Laws, Scale-Free Networks and Genome Biology. 01 2006.

Laszkiewicz, M., Lederer, J., and Fischer, A. Copula-based normalizing ﬂows. In ICML Workshop on Invertible Neural Networks, Normalizing Flows, and Explicit Likelihood Models, 2021.

Liu, J., Lipschutz, S., and Spiegel, M. Schaum s Outline of Mathematical Handbook of Formulas and Tables, 4th Edition. Mc Graw-Hill, 2012.

Mc Donald, A., Tan, P.-N., and Luo, L. Comet ﬂows: Towards generative modeling of multivariate extremes and tail dependence. ar Xiv preprint ar Xiv:2205.01224, 2022.

Mc Neil, A. J., Frey, R., and Embrechts, P. Quantitative risk management: concepts, techniques and tools-revised edition. Princeton university press, 2015.

Meyer, D., Nagler, T., and Hogan, R. J. Copula-based synthetic data generation for machine learning emulators in weather and climate: application to a simple radiation model. Geoscientiﬁc Model Development Discussions, pp. 1 21, 2021.

Nair, J., Wierman, A., and Zwart, B. The fundamentals of heavy-tails: Properties, emergence, and identiﬁcation. In Proceedings of the ACM SIGMETRICS/International Conference on Measurement and Modeling of Computer Systems, pp. 387 388, New York, NY, USA, 2013.

Oliva, J., Dubey, A., Zaheer, M., Poczos, B., Salakhutdinov, R., Xing, E., and Schneider, J. Transformation autoregressive networks. In International Conference on Machine Learning, pp. 3898 3907. PMLR, 2018.

Papamakarios, G., Pavlakou, T., and Murray, I. Masked autoregressive ﬂow for density estimation. In Advances in Neural Information Processing Systems, volume 30, pp. 2338 2347, Long Beach, CA, USA, 2017.

Patki, N., Wedge, R., and Veeramachaneni, K. The synthetic data vault. In 2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA), pp. 399 410, Oct 2016. doi: 10.1109/DSAA.2016.49.

Pintado, S. and Romo, J. On the concept of depth for functional data. Journal of the American Statistical Association, 104, 06 2009. doi: 10.1198/jasa.2009.0108.

Qi, Y. Bootstrap and empirical likelihood methods in extremes. Extremes, 11:81 97, 03 2008.

Resnick, S. On the foundations of multivariate heavy-tail analysis. Journal of Applied Probability, 41:191 212, 2004.

Rezende, D. and Mohamed, S. Variational inference with normalizing ﬂows. In International conference on machine learning, pp. 1530 1538. PMLR, 2015.

Rippel, O. and Adams, R. P. High-dimensional probability estimation with deep density models. Co RR, abs/1302.5125, 2013.

Tabak, E. G. and Turner, C. V. A family of nonparametric density estimation algorithms. Communications on Pure and Applied Mathematics, 66(2):145 164, 2013.

Theis, L., Oord, A. v. d., and Bethge, M. A note on the evaluation of generative models. ar Xiv preprint ar Xiv:1511.01844, 2015.

Van der Vaart, A. W. Asymptotic statistics, volume 3. Cambridge university press, 2000.

Voitalov, I., van der Hoorn, P., van der Hofstad, R., and Krioukov, D. Scale-free networks well done. Phys. Rev. Research, 1:033034, Oct 2019.

Wiese, M., Knobloch, R., and Korn, R. Copula & marginal ﬂows: Disentangling the marginal from its joint. ar Xiv preprint ar Xiv:1907.03361, 2019.

Marginal Tail-Adaptive Normalizing Flows

A. Theory and Proofs

In this Section, we derive and prove our theoretical results. We start by presenting some preliminary results in Section A.1, which help us providing the proof of our main result in Section A.2. In Section A.3, we illuminate the technical Assumption A.6 and provide examples and intuitions.

A.1. Preliminary theoretical Results

Proposition A.1. Let x R be a random variable. Then, it holds that x has tail index less or equal than α iff. for any β > α and any C > 0 it is Z

|x|>C |x|βp(x)dx = .

Proof. Let us ﬁrst assume that x has tail index at most α. Then, according to Deﬁnition 2.2, we know that for any β > α and C > 0

= Ex[|x|β] = Z

R |x|βp(x)dx = Z

|x| C |x|βp(x)dx + Z

|x|>C |x|βp(x)dx

|x| C p(x)dx + Z

|x|>C |x|βp(x)dx .

Since we assume p to be continuous, we can bound p on the compact interval [ C, C], and hence, the ﬁrst above integral must be bounded. Therefore, it is Z

|x|>C |x|βp(x)dx = .

To prove the back-direction, let us consider a β > α and C > 0. Very similarly to the forward-proof, we can now see that

|x|>C |x|βp(x)dx

|x| C p(x)dx + Z

|x|>C |x|βp(x)dx

= Ex |x|β ,

where the second equality follows from the ﬁniteness of the integral. Since this follows for any β > α, x must have tail index α or less.

This simple result demonstrates that the tail index, as indicated by the name, depends on the tail of the distribution, i.e. on the PDFs behavior for large values |x| > C. This fact motivates why maximum likelihood estimations of the tail index, which depend on the whole distribution are biased. For more elaborate details, we refer to Section 9 in Nair et al. (2013).

In a similar fashion to the previous result, the next technical lemma states that unboundedness of the moment-generating function is due to the unboundedness of the integrand for tail events, i.e. for z > z . This little lemma turns out to be useful in the proof of Proposition 3.3.

Lemma A.2. Let z R be heavy-tailed. Then it holds for any z R and λ > 0 that

z>z eλzpz(z)dz = .

Marginal Tail-Adaptive Normalizing Flows

Proof. Since z R is heavy-tailed, we know that for all λ > 0

= mz(λ) = Z

R eλzpz(z)dz

z z eλzpz(z)dz + Z

z>z eλzpz(z)dz

Fz(z )eλz + Z

z>z eλzpz(z)dz ,

where Fz is the CDF of z. The last inequality follows from the fact that exp(λz) is monotonic increasing in z. Since the ﬁrst summand is bounded, it follows that the second summand must be unbounded. This completes the proof.

Recall that in Proposition 3.2 we state that j-heavy-tailedness induces ℓ2-heavy-tailedness. In the following, we provide a formal proof of this result.

Proof of Proposition 3.2. For this proof, we employ the equivalent deﬁnition of heavy-tailedness of xj via the decay rate of its distribution function (see e.g. Lemma 1.1. in Nair et al. (2013)), i.e.

lim sup xj 1 Fj(xj)

e λxj = for all λ > 0 , (6)

where Fj is the CDF of xj. Since xj x for all x RD, we can conclude that Fj(a) F x (a) for a R. Therefore,

e λa 1 Fxj(a)

e λa for a .

According to the equivalent deﬁnition in (6), x is heavy-tailed, which proves that x is ℓ2-heavy-tailed.

The following is a well-known implication of the change of variables formula and the integration rule by substitution, which we are going to apply in the subsequent proofs.

Lemma A.3 (Substitution in the Moment-Generating Function). Let T be a diffeomorphism such that T(z) = x for some random variables x, z RD. Then, we can rewrite Z

RD eλxp(x)dx = Z

RD eλT (z)q(z)dz .

For completeness, we give a brief proof of this result.

Proof. Using the change of variables formula, see (1), we can write Z

RD eλxp(x)dx = Z

RD eλxq T 1(x) det JT 1(x) dx .

Now, we can rewrite exp(λx) = exp(λT T 1(x) and substitute z = T 1. Integration by substitution completes the proof.

Next, we present how we can use copulae to reformulate a multivariate PDF. [Copula] A copula is a multivariate distribution with cumulative distribution function (CDF) C : [0, 1]D [0, 1] that has standard uniform marginals, i.e. the marginals Cj of C satisfy Cj U[0, 1].

Theorem A.4 (Sklar s Theorem). Taken from Hofert et al. (2018).

1. For any D-dimensional CDF F with marginal CDFs F1, . . . , FD, there exists a copula C such that

F(z) = C F1(z1), . . . , FD(z D) (7)

Marginal Tail-Adaptive Normalizing Flows

for all z RD. The copula is uniquely deﬁned on U := QD j=1 Im(Fj), where Im(Fj) is the image of Fj. For all u U it is given by C(u) = F F 1 (u1), . . . , F D (u D) ,

where F j are the right-inverses of Fj.

2. Conversely, given any D-dimensional copula C and marginal CDFs F1, . . . FD, a function F as deﬁned in (7) is a D-dimensional CDF with marginals F1, . . . , FD.

Therefore, if F is absolutely continuous, we can differentiate (7) to obtain the PDF of z

q(z) = c F1(z1), . . . , FD(z D) D Y

j=1 qj(zj) ,

where c denotes the PDF of the copula C.

Lastly, we present the following asymptotic behavior of the inverse CDF of a standard Gaussian distribution, which we use in Section A.3 to explain Assumption A.6.

Lemma A.5 (Asymptotic Behavior12 of Φ 1(1 y)). Denote by Φ the CDF of a standard Gaussian distribution. Then, it holds for the inverse of Φ that Φ 1(1 y) p

2 log(y) for y 0 .

Proof. First, we note that

which is a well-known asymptotic (Liu et al., 2012). Here, erf denotes the error-function. Rearranging terms gives

log 1 Φ(x) log x

Finally, we can invert the above asymptotic equation to obtain

2 log(1 y) for y 1

or equivalently Φ 1(1 y) = p

2 log(y) for y 0 .

A.2. Proof of the Main Result

The proof of Proposition 3.3 relies on lower-bounding the moment-generating function of each marginal xj. In order to derive such a bound of a multivariate integral, we rewrite the joint distributions q j using their copula densities:

q j(z j) = cj F1(z1), . . . , Fj(zj) Y

for any j {1, . . . , D} and for corresponding copula density cj. Our proof relies on the following technical condition on the decay rate of the copula densities.

Assumption A.6 (Bounding the Marginal Decay Rate of the Copula Densities). For all j {1, . . . , D} and λ > 0 there exists a compact set S Rj 1 with positive (Lebesgue-)mass, a constant z j > 0, a scaling constant s > 0, and a function f(z<j) < λσ(z<j) for z<j S such that

cj F1(z1), . . . , Fj(zj) se f(z<j)zj for zj > z j and z<j S , (8)

where cj is the copula density of q j

12The idea of the proof is due to (https://math.stackexchange.com/users/491644/maxim)

Marginal Tail-Adaptive Normalizing Flows

This assumption sets a bound on the decay rate of the copula density with respect to zj. We clarify this assumption in Section A.3 with additional examples.

Now, we set all preliminaries to prove Proposition 3.3.

Proof of Proposition 3.3. We start by considering the case j = 1. In this case it is x1 = µ + σz1 and therefore

R eλx1p1(x1)dx1

R eλ(µ1+σ1z1)q1(z1)dz1 (Lemma A.3)

R eλσ1z1q1(z1)dz1 .

Deﬁning λ := λσ1 > 0, we can see that the last integral is unbounded due to the heavy-tailedness of z1, see Deﬁnition 2.1. Therefore, mx1(λ) = for all λ > 0, which proves the heavy-tailedness of x1. Next, we consider the case j > 1. Again, we examine the moment-generating function of xj. Deﬁne the jth canonical basis vector vj := (0, . . . , 0, 1, 0, . . . , 0) . Then,13

mxj(λ) = mv j x = Z

RD eλv j xp(x)dx (LOTUS)

RD eλTj(zj,z<j)q(z)dz (Lemma A.3)

Rj eλµ(z<j)+λσ(z<j)zjq j(z j)dz j

Rj 1 eλµ(z<j) Z

R eλσ(z<j)zjq j(z j)dzjdz<j . (9)

Using Sklar s Theorem (Theorem A.4), we can write any joint PDF as the product of marginals and a copula density cj such that q j(z j) = cj F1(z1), . . . , Fj(zj) Y

i<j qi(zi) . (10)

We plug (10) into (9) to obtain

Rj 1 eλµ(z<j)q<j(z<j) Z

R eλσ(z<j)zjcj F1(z1), . . . , Fj(zj) qj(zj)dzjdz<j

S eλµ(z<j)q<j(z<j) Z

zj>z j eλσ(z<j)zjcj F1(z1), . . . , Fj(zj) qj(zj)dzjdz<j , (11)

since all quantities within the integral are positive. Using Assumption A.6, we can bound the inner integral of the above equation, which we denote by A(z<j), and get

zj>z j e(λσ(z<j) f(z<j))zjqj(zj)dzj

zj>z j eλ zjqj(zj)dzj (deﬁne λ := λ σ(z<j) f(z<j) )

= for all z<j S ,

due to the heavy-tailedness of zj and Lemma A.2. Since S is compact, µ and q are both continuous, and q is positive, we deduce that exp(λµ(z<j))q<j(z<j) is lower-bounded (by a constant larger than 0) in S. Therefore, employing (11) and using that S has positive mass, we can lower-bound the moment-generating function by , which proves the heavy-tailedness of xj. In summary, x is j-heavy-tailed for all j {1, . . . , D}.

13Note that for the sake of clarity, we leave out the index j in µj and σj.

Marginal Tail-Adaptive Normalizing Flows

A.3. Notes on Assumption A.6

Assumption 1 might look troublesome at ﬁrst sight, but we will illustrate in this section that the condition is indeed very reasonable. We will show how to verify it in simple examples, and we will introduce a simpler, more intuitive sufﬁcient condition for it.

First of all, let us present a restricted but more intuitive version of Assumption A.6.

Assumption A.7 (Simpliﬁcation of Assumption A.6). For all j {1, . . . , D} and λ > 0 it holds for S := [a, b]j 1 that there exist constants z j and s > 0 such that

c F1(z1), . . . , Fj(zj) se (λσ ε)zj for zj > z j and z<j S , (12)

where cj is the copula density of q j, λσ is a lower bound of λσ(z<j), ε > 0 is small such that λσ ε > 0.

Let us summarize the simpliﬁcations that we make in Assumption A.7. First of all, we restricted S to be a closed cube [a, b]j 1, which is obviously a speciﬁc instant of a compact set with positive mass. Further, we assumed σ to be continuous, and thus, λσ(z<j) must be lower-bounded in S. This allows us to replace the function f by the constant λσ ε for arbitrary small ε > 0.

After giving this simpliﬁed sufﬁcient condition, we provide some intuition by presenting some examples where Assumption A.7 holds true. [Independent Variables] Consider a random variable z with independent components, i.e. q(z) = QD j=1 qj(zj). Then, the associated copula is the independence copula (Figure 5), which is a uniform random distribution on [0, 1]D. Therefore it is c F1(z1), . . . , FD(z D) = 1 for all z RD and Assumption A.7 follows immediately since s exp( (λσ ε)zj) 0 for zj . [Bounded Copula Density] Consider a lower-bounded copula density, i.e. there exists a lower bound a > 0 such that

c(u1, . . . , u D) a for all u [0, 1]D .

Again, the validity of Assumption A.7 in this setting is clear. Furthermore, this assumption is obviously not limited to

0.00 0.25 0.50 0.75 1.00

(a) Independence copula: c(z1, z2) = 1.

0.00 0.25 0.50 0.75 1.00

(b) Gaussian copula with correlation ρ12 = 0.7 .

Figure 5. Two 2-dimensional Copula densities.

bounded copula densities but also holds for copula densities that converge to 0 but whose decay rate in zj is lower-bounded by (12). To visualize the intuition, consider the 2-dimensional copula density of a Gaussian copula in Figure 5. Imagine ﬁxing S such that F1(z1) [0.5, 0.75], which is compact for continuous F1. Then (12) bounds the decay rate within the tube [0.5, 0.75] if we consider z2 , i.e. if F2(z2) 1. Next, we show how we can formally prove the assumption for

Marginal Tail-Adaptive Normalizing Flows

Gaussian copulae. Recall, the Gaussian copula with correlation matrix R RD D has density function

2 Φ 1(u1), . . . Φ 1(u D) R 1 I Φ 1(u1), . . . Φ 1(u D) !

where I RD D is the identity matrix, Φ 1 is the inverse CDF of the univariate standard Gaussian distribution, and u [0, 1]D. In the following, we consider Assumption A.7 for j = D.

Note that S is assumed to be a compact set, therefore Fj(zj) are all upper and lower-bounded by some value for z<j S. This makes all polynomials of them also bounded. Hence, we can ﬁnd constants a , b , c such that we can lower-bound the term within the exponential in (13) by

2 a Φ 1 FD(z D) 2 + b Φ 1 FD(z D) + c .

Plugging the above into (13) gives

c F1(z1), . . . , FD(z D) 1

2 a Φ 1 FD(z D) 2 + b Φ 1 FD(z D) + c !

exp aΦ 1 FD(z D) 2 + bΦ 1 FD(z D) (for some a, b)

exp |a|Φ 1 FD(z D) 2 + |b|Φ 1 FD(z D)

exp |a|Φ 1 FD(z D) 2 , (14)

where the last line applies if Φ 1 FD(z D) 0, which is satisﬁed if z is large enough14.

Next, we use the asymptotic relation from Lemma A.5

Φ 1 FD(z D) q

2 log 1 FD(z D) .

Hence, for each ε > 0 there exists a z large enough such that Φ 1 FD(z D)

2 log 1 FD(z D) 1 < ε ,

which can be rearranged to

Φ 1 FD(z D) < q

2 log 1 FD(z D) (1 + ε) .

Plugging the above into (14), we obtain

c F1(z1), . . . , FD(z D) exp 2|a|(1 + ε)2 log 1 FD(z D)

= 1 FD(z D) 2 a , (15)

where we deﬁne a := 2|a|(1 + ε)2. Hence, we are left to lower-bound (15), which we can do for a range of heavy-tailed marginal distributions such as:

1. Pareto distribution: The Pareto distribution with shape parameter α has CDF

14for instance if z is larger than the median of z D

Marginal Tail-Adaptive Normalizing Flows

1 FD(z D) 2 a = 1

= exp 2 aα log(z D)

exp 2 aαz D

2. Scale invariant distributions: Following the same argument as above, each distribution with CDF

F(z) = 1 bz α for z > z (16)

for constants b, α, z > 0 satisﬁes the bound 1 FD(z D) 2 a b exp 2 aαz D .

Each distribution with CDF as in (16) is a scale-invariant distribution (see e.g. Theorem 2.1 in Nair et al. (2013)).

3. Exponentially decaying distributions: Every distribution that satisﬁes the bound

F(z) 1 exp(z) α

for z > z and α > 0. In this case, we can again show that Assumption A.6 is valid: 1 FD(z D) 2 a exp( 2 aαz D) .

Lastly, we want to emphasize that Corollary 3.5 is derived by an iterative application of Theorem 3.4. Therefore, Assumption A.6 must hold for all ﬂow steps , i.e. if T = T (L) T (1), we need to ensure validity of Assumption A.6 for z(0) := z, z(1) := T (1)(z), z(2) := T (2) T (1)(z), . . . , z(L 1) := T (L 1) T (1)(z). In Example A.3, we show that this assumption holds true for z(0) since we deﬁne our base distribution under the mean-ﬁeld assumption. Furthermore, we conjecture that if we apply a Lipschitz-continuous diffeomorphism on a random variable with bounded copula density, then the transformed random variable must also have a bounded copula density. Hence, Assumption A.6 would be valid for all ﬂow steps (see Example A.3) when starting with a mean-ﬁeld base distribution q(z) = QD j=1 qj(zj). However, this conjecture needs to be studied in further research.

A.4. Proof of Theorem 3.7

Proof. We proceed in two steps: First, we show that zν ν D z by applying the Cram er-Wold Theorem. Second, we use the Continuous-Mapping Theorem to ﬁnish the proof. All used theorems can be found, for instance, in Van der Vaart (2000).

Step 1. By the Cram er-Wold Theorem, it is sufﬁcient to show that for every v RD it is

j=1 vjzν,j ν D

j=1 vjzj = v z .

We prove this convergence via the convergence of the characteristic function. Since zν,j are independent for all j {1, . . . , D}, it is

ϕv zν(t) = ϕzν,1(v1t) ϕzν,dl (vdlt)ϕzdl+1(vdl+1t) ϕz D(v Dt)

ϕz1(v1t) ϕzdl (vdlt)ϕzdl+1(vdl+1t) ϕz D(v Dt)

= ϕv z(t) t R ,

where the convergence follows from the convergence tν ν D N(0, 1) and Levy s Continuity Theorem. The last equality follows since the marginals are independent. Using Levy s Continuity Theorem once more, we conclude that

v zν ν D v z .

Hence, we can apply Cram er-Wold to obtain the convergence zν ν D z.

Marginal Tail-Adaptive Normalizing Flows

Step 2. We apply the Continuous Mapping Theorem to obtain the convergence result.

B. Data-Driven LU-Layers

One of the major reason for recent improvements of NFs is due to the generalization of the permutation layers to more general invertible linear layers (Oliva et al., 2018). One type of these general invertible linear layers are based on the LU-decomposition, which were essential building blocks for many SOTA NFs, such as in Kingma & Dhariwal (2018); Durkan et al. (2019). Therefore, it is important to address whether we can generalize our theory from Section 3.2 of the main paper to the more expressive LU-Layers. In this section we provide supplementary materials to Section 3.3 to show that we are indeed able to generalize to LU-Layers, while retaining their computational beneﬁts. First, in Section B.1 we derive sufﬁcient conditions under which linear layers preserve the marginal tail behaviour. In Section B.2 we present an efﬁcient way to implement a tail-preserving LU-type invertible layer.

B.1. Marginal Tail Behavior under Linear Transformations

In this section, we assume without loss of generality that the components of z are ordered such that z is j-light-tailed for j {1, . . . , dl} and j-heavy-tailed for j {dl + 1, . . . , D}. Our goal is to ﬁnd conditions for a matrix W under which z and Wz have equal tail-behavior.

Theorem B.1. Let z be a random variable that is j-light-tailed for j {1, . . . , dl} and j-heavy-tailed for j {dl + 1, . . . , D}. Further, consider a diagonal invertible block-matrix

W = A 0 B C

with A Rdl dl, B R(D dl) dl, C R(D dl) (D dl) and 0 is a zero matrix of size dl (D dl). Then, it follows that Wz and z have equal tail behavior.

Before heading into the proof, we introduce the following useful lemma:

Lemma B.2. Let 0 < λ be a scalar such that Ez[exp(λ z)] < for some univariate random variable z. Then, it holds that Ez[exp(λz)] < for all 0 < λ λ .

Proof. Consider 0 < λ λ . Then, we can split the expectation

( ,0] q(z)eλzdz + Z

(0, ) q(z)eλzdz

( ,0] q(z)e0dz + Z

(0, ) q(z)eλ zdz .

The ﬁrst integral is upper-bounded by 1, and the second integral can be upper-bounded by the integral over the same integrand on R, which is bounded by deﬁnition of per assumption. Hence, Ez[exp(λz)] < for all 0 < λ λ .

Now we are ready to prove Theorem B.1.

Proof. The idea of the proof is to show that a linear combination of k (non-degenerate) random variables z1, . . . , zk is light-tailed if and only if all z1, . . . , zk are light-tailed. We do this via algebraic induction, i.e. we show that

1. azj is light-tailed iff. zj is light-tailed for some scalar a R;

2. z1 + z2 is light-tailed iff. z1 and z2 are both light-tailed.

For an arbitrary linear combination random variables, we can iterate through 1. and 2. to prove that the linear combination is light-tailed if and only if each component is light-tailed.

Marginal Tail-Adaptive Normalizing Flows

1. Let us assume that zj is light-tailed. Therefore it exists some λ R such that Ezj[exp(λ zj)] < . Further, for the moment-generating function of azj it is

Eazj eλx] = Z

R pzj(zj)eλazjdx <

for λ := λ /a, where the ﬁrst equality follows from the LOTUS. The other direction of the equivalence follows analogously. 2. Consider z1 and z2 with joint PDF q(z1, z2). Since both random variables are light-tailed, there exist λ1 and λ2 such that their moment-generating function is bounded. Then, it is for λ := 0.5 min{λ1, λ2}

Ez1+z2 eλz = Z

R eλ(z1+z2)q(z1, z2)dz1dz2 (LOTUS)

eλ(z1+z2)q(z1, z2)dz1dz2 + Z

0 eλ(z1+z2)q(z1, z2)dz1dz2

R eλz2q(z2)dz2 + Z

Z min{0,z2}

0 eλ(z1+z2)q(z1, z2)dz1

max{0,z2} eλ(z1+z2)q(z1, z2)dz1

where the last line follows by replacing z2 in the ﬁrst integral by 0, which is an upper bound for z2. Using the deﬁnition of λ and Lemma B.2, we see that the ﬁrst integral is bounded by some constant c1. Using the monotonicity of the integrands once more, we get

Ez1+z2 eλz c1 + Z

Z min{0,z2}

0 e2λz2q(z1, z2)dz1 + Z

max{0,z2} e2λz1q(z1, z2)dz1

R e2λz2q(z2)dz2 + Z e2λz1q(z1)dz1 ,

which follows by upper-bounding the integral from 0 to min{0, z2} and the integral from max{0, z2} by the integrals over R, respectively. Using Lemma B.2 and the deﬁnition of λ once more, we can ﬁnd constants c2 and c3 that bound the remaining integrals. Hence, we conclude that Ez1+z2 < , which proves the backward direction of 2. To prove the forward direction, we show that if z1 and z2 are not both light-tailed (i.e at least of them is heavy-tailed), than z1 + z2 is not light-tailed. Without loss of generality, we assume z2 to be heavy-tailed. Then, we can write

Ez1+z2 eλz = Z

R eλ(z1+z2)q(z1, z2)dz1dz2

eλ(z1+z2)q(z1, z2)dz1 + Z

0 eλ(z1+z2)q(z1, z2)dz1

eλ(z1+z2)q(z1, z2)dz1 + Z

0 eλ(z2)q(z1, z2)dz1

Note that we can lower-bound the last inner integral by Z

0 eλz2q(z1, z2)dz1 = eλz2q(z2) eλz2 Z 0

q(z1, z2)dz1 eλz2 Z 0

q(z1, z2)dz1 .

Plugging this bound into (18) gives us

Ez1+z2 eλz Z

R eλz2q(z2)dz2 = λ > 0 ,

which is due to the heavy-tailedness of z2. Therefore, z1 + z2 must also be heavy-tailed. This proves the equivalence in 2. Finally, we note that due to the block-triangular form of A the ﬁrst dl components of x := Az, i.e. x1, . . . , xdl are linear

Marginal Tail-Adaptive Normalizing Flows

combinations of light-tailed components z1, . . . , zdl, which implies the light-tailedness of x1, . . . , xdl. The remaining D dl components of x are linear combinations containing at least one heavy-tailed component zj with j {dl, . . . , D} and are therefore again heavy-tailed15. This completes the proof.

B.2. Implementation of Data-Driven LU-Layers

In the previous section, we have seen that in order to retain the tail behavior, the block-matrix form given in (17) is sufﬁcient. In this section, we give more details on an efﬁcient parameterization leading to fast inversion and log-determinant computations. It is well-known that under mild conditions the inversion of block-matrices is efﬁciently solvable.

Lemma B.3 (Inversion of Block-Matrices). Consider invertible square matrices A Rdl dl, D Rdh dh and arbitrary matrices B Rdl dh, C Rdh dl for some dl, dh N. Then it holds that

1 = (A BD 1C) 1 0 D 1C(A BD 1C) 1 D 1

As a special case, it is A 0 B C

1 = A 1 0 C 1BA 1 C 1

for invertible square matrices A Rdl dl, C Rdh dh and arbitrary matrix B Rdh dl.

Lemma B.3 is a standard result that can be found in many linear algebra text books (see e.g. Gallier (2011)).

Furthermore, we can compute the determinant of a diagonal block-matrix W as deﬁned in (17) as

det(W) = det(A) det(C) . (20)

Now, let us summarize the expensive computations in (19) and (20) that both need to be efﬁcient in NFs. The computations involve inversions of A and C, only forward computations of B, and the computation of det(A) and det(C). Hence, we propose to parameterize A and C using a LU-decomposition for both matrices, which leads to efﬁcient inverse16 and log-determinant computations. We do not make any restrictions on B and parameterize it by a standard unconstrained linear layer. We provide a Py Torch implementation of this modiﬁed tail-preserving LU-layer, which can be accessed through our public git repository.

C. Algorithms and computational Details

C.1. Tail Estimation

Many heavy-tailed distributions can be characterized by their tail index, which include the set of regularly varying distributions,17 such as the t-distribution, the Pareto distribution, and many more. However, as already shown in Section 2.1, the tail index does not depend on the body of the distribution, and hence, non-tail samples must typically be discarded for tail index estimation. Although a variety of estimators for the tail index exist, such as the Hill estimator (Hill, 1975), the moment estimator (Dekkers et al., 1989), and kernel-based estimators (Csorgo et al., 1985), none of them is considered to be as superior in all settings. A major issue of all mentioned estimators is that they are based on a threshold deﬁning the tail, i.e. the user needs to input statements of the form the k largest samples are considered to be tail events . Even though there exist some strategies to ﬁnd k, there is none working robustly in all settings. In fact, one can construct simple counter examples for all estimators that lead to failures of tail estimation. This is due to undesired properties of the estimators, such as the lack of translation invariance of the Hills estimator (while the tail index clearly is location invariant). We refer to Section 9 in Nair et al. (2013) for a detailed text book treatment of tail index estimation. In summary, robust tail estimation is still considered as an unsolved problem, which forces practitioners to consider multiple estimators to make a well-founded decision. Furthermore, we note that the Hills estimator can only be applied for regularly varying distributions, which

15Note that this argument assumes that D has no empty rows, which is implicitly assume due to the invertibility of A. Compare with Equation ref(Algosection). 16which can be guaranteed by restricting the diagonal elements of the upper diagonal matrix in the LU-decomposition. 17see Section 2 in (Nair et al., 2013) for further details

Marginal Tail-Adaptive Normalizing Flows

Algorithm 1 Marginal Tail Estimation

tail.est [ ] for j in {1, . . . , D} do

marginal Data val[:, j] moments moments est(|marginal|) {0 if |marginal| is estimated to be light-tailed} kernel kernel est(|marginal|) {0 if |marginal| is estimated to be light-tailed} if moments==kernel==0 then

tail est.append(0) {light-tailed if moments and kernel estimate a light-tailed marginal} else

hill hills est(|marginal|) if hill> 10 then

tail est.append(0) {light-tailed if hills estimator predicts high tail index} else

tail est.append(hill) end if end if end for tail est

excludes the application of the Hills estimator to classify light-tailed distributions. In contrast, the moments and the kernel estimator can both be applied to identify heavy-tailed marginals and to assess a tail index.

To implement the tail assessment scheme, see Step 1 of the proposed method in Section 3.2, we found that Algorithm 1 works ﬁne in classifying the correct tail behavior and giving a decent initialization for the tail indices. That is, we use the moments double-bootstrap estimator (Draisma et al., 1999) and the kernel-type double-bootstrap estimator (Groeneboom et al., 2003) to assess heavy-tailedness of the data distribution. If both estimators predict a light-tailed distribution, we set the corresponding marginal base distribution to be standard normal distributed, otherwise we set it to a the standardized t-distribution, i.e. zj tˆvj, where ˆvj is the Hill double-bootstrap estimator (Danielsson et al., 2001; Qi, 2008). We reused the code by Voitalov et al. (2019), which implements all tail estimation procedures18 from our Algorithm. Notice that we clip the tail index by 10, i.e. the algorithm classiﬁes marginals with a tail index larger than 10 as light-tailed, which prevents a too restrictive set of allowed permutations, see Step 3 in Section 3.2. To illustrate this argument, consider the following simple example. Assume that we estimate all except of one marginal to be heavy-tailed. Then, the ﬁrst component of the ﬂow is never allowed to permute with other components, since they are classiﬁed as heavy-tailed. Hence, the mixing of the ﬁrst component would be severely restricted. Further, since large tail indices indicate a less heavy-tailed distribution, it is reasonable to clip the tail index at some threshold.

C.2. Synthetic Data Generation

The generation of the synthetic distribution consists of 3 steps: 1. Generating the marginal distributions, 2. Deﬁning a copula distribution, 3. Combining the marginal and the copula to obtain a multivariate joint distribution.

Generating the marginal distributions. The ﬁrst two marginals are deﬁned to be Gaussians. The following marginals are a 2-mixtures of Gaussians and a mixtures of three Gaussians. The last dh {1, 4} components are a mixture of two t-distributions and the remaining marginals are again mixtures of two Gaussians. All mixtures have equal weight for each mixture component and all means and standard-deviations are randomized. Means are constructed by uniformly sampling from [ 4, 4], whereas standard-deviations are sampled from [1, 2].

Deﬁning a copula distribution. Recall, a Gaussian copula (13) is parameterized by a correlation matrix R. To generate R, we randomly sample 16 different pairs (i, j) {1, . . . , 8}2 with i = j and set the corresponding entry of the correlation matrix Ri,j := 0.25. The diagonals of R are set to 1.

18including the hyperparameter selection (Danielsson et al., 2001; Qi, 2008; Draisma et al., 1999; Groeneboom et al., 2003)

Marginal Tail-Adaptive Normalizing Flows

Obtaining a joint distribution. Lastly, we combine the marginals with the Gaussian copula using Sklar s Theorem A.4. This gives us a multivariate distribution with speciﬁed and complex marginals with a dependency structure given by the copula, see Joe (2014) for more details on the induced dependencies.

To construct the training, test, and validation sets 15.000, 75 000, and 10 000 samples from this distribution are sampled, respectively.

In the setting D = 50, we apply a similar procedure but with dh = 10 heavy-tailed components and with the ﬁrst 40 marginals being 2-mixtures of Gaussians, the remaining 10 marginals being 2-mixtures t-distributions. We deﬁne the dependency-structure by randomly selecting 200 pairs (i, j) {1, . . . , 50}2 with i = j and set Ri,j = 0.25 again. Training, validation, and test sets consists of 50 000, 10 000, and 75 000 samples, respectively.

C.3. Synthetic Experiments

In all synthetic experiments, we used a NSF with 5 layers and corresponding LU-linearities. m TAF employs the modiﬁed LU-linearities from Section 3.3. In the NSF layers, we used conditioner Res Nets with 2 hidden layers with 30 or 200 hidden neurons in the case D = 8 and D = 50, respectively and Re LU activations. Further, we used NSF layers with 3 bins and set the tail-bound to 2. We optimized the network using Adam with 5 000 or 20 000 train steps in the case D = 8 and D = 50, respectively, with a learning rate of 1e 5 and a weight decay of 1e 6. To ﬁt the Gaussian copula baseline, we use the default settings of the copulas (Patki et al., 2016) library.

To assess the sample quality on the tail of the distribution, we consider 3 metrics.

1. Tail Value at Risk for some level α, which is deﬁned by

t Va Rα := t Va Rα(F) := 1 1 α

for some CDF F. t Va Rα is a well-known metric and is widely used in ﬁnance (Mc Neil et al., 2015). We plug in the marginal empirical CDFs ˆFdata and ˆFﬂow, i.e. the empirical CDFs based on the data and on synthetic samples, respectively, and calculate the absolute difference between these quantities. The resulting metric is the marginal t Va R-difference for the level α. We set α = 0.95 in all our experiments.

2. Area under log-log plot is deﬁned by

log F 1 data i

log F 1 ﬂow i

where F 1 data, F 1 ﬂow denote the inverse empirical complementary CDFs given by the test data and the ﬂow samples, respectively.

3. Synthetic Tail Estimates, where the tail-assessment is similar to the on described in Algorithm 1. We can then assess the similarity between the tail estimators based on the data and the tail estimators based on synthetically generated ﬂow samples.

Setting D=8 In our experimental study, we generated 3 synthetic distributions per setting as explained in Section C.2 and ﬁt each model 25 times to each synthetic distribution. While it is reasonable to compare the averaged metrics over all 75 runs, investigating the standard deviation over all runs might be misleading since the metrics could be centered around different values for each synthetic distribution. For this reason, it is more insightful to compare standard deviations over runs where the target distribution is ﬁxed, which we present in Table 3. In Table 4 we present the numeric results of the synthetic experiments for a larger tail index, i.e. a less extreme setting. In this setting, we observe that g TAF tends to perform slightly worse for light-tailed components, while achieving good results for heavy-tailed components. Note that in this case, the performance of m TAF degrades, which might be attributed to the less ﬂexible structure of its linearities. When replacing the NSF-layers by MAF-layers, we observe that the MAF fails to converge for a vanilla base distribution. Again, m TAF strikes a balance between ﬁtting heavyand light-tailed marginals but the overall performance is better in the case of NSF-layers especially for the heavy-tailed marginals. We conjecture that this is due to the linearity of the

Marginal Tail-Adaptive Normalizing Flows

Table 3. Standard deviations corresponding to the experiments shown in Table 1, i.e. in the setting ν = 2 and dh {1, 4}, for one target distribution.

L Areal Areah t Va Rl t Va Rh L Areal Areah t Va Rl t Va Rh

vanilla 0.02 0.02 0.12 0.16 3.74 0.03 0.03 0.14 0.13 1.75 TAF 0.00 0.03 0.20 0.12 2.38 0.01 0.06 0.12 0.22 1.15 g TAF 0.01 0.06 0.26 0.17 1.30 0.01 0.06 0.14 0.20 0.64 m TAF 0.01 0.03 0.21 0.10 1.41 0.01 0.03 0.20 0.18 1.32

Table 4. Average test loss, Area under log-log plot, and t Va R (lower is better for each metric) in the setting ν = 3 and dh {1, 4}. The copula model serves as an oracle baseline.

L Areal Areah t Va Rl t Va Rh L Areal Areah t Va Rl t Va Rh

vanilla 10.82 0.25 2.77 0.55 12.06 10.55 0.23 2.64 0.58 9.98 TAF 10.79 0.36 2.78 0.79 1.49 10.46 0.38 2.95 0.93 2.36 g TAF 10.77 0.55 1.11 1.27 2.82 10.38 0.50 1.13 1.00 2.56 m TAF 10.76 0.33 2.03 1.05 7.36 10.38 0.34 1.54 1.09 5.02

copula 9.76 0.20 0.79 0.45 1.82 9.76 0.19 0.91 0.46 1.62

tails of each NSF-layer19, which leads to a better generalization for those low-sample regions. Therefore, we continue the following experiments using the NSF architecture. Furthermore, we investigate the tail indices of the generated samples by constructing confusion matrices similar to those in Figure 2, which we present in Figure 6. We observe a similar behavior, that is, while vanilla and TAF produce mainly marginals with light-tailed marginals, g TAF is able to produce much better samples with heavy-tailed marginals. Again, m TAF produces marginals whose marginal tail behavior almost perfectly ﬁts the true tail behavior. For further visual inspection of the generated marginals, we consider QQ-plots of the heavy-tailed components in Figure 7 and Figure 8. In both cases, vanilla and TAF and sometimes in g TAF we observe humps in the tails, which surrogate a bad sampling performance in their tails. m TAF does not have these, which is in accordance to our ﬁndings derived from Figure 2 and 6.

Setting D=50 Table 6 compares the quantitative metrics for our synthetic experiments in the case D = 50. In this case, we generate 3 synthetic distributions as explained in Section C.2 and ﬁt each model 5 times. We observe that m TAF clearly outperforms vanilla and TAF in terms of Area, while the ﬂexibility in g TAF allows it to perform almost on par with the oracle copula baseline. Considering the metrics that account for the tail ﬁt of the heavy-tailed components, we surprisingly observe that g TAF even outperforms the oracle copula model. However, considering the light-tailed components, g TAF performs a bit worse, which is not surprising since g TAF models each marginal distribution by a t-distribution.

C.4. Climate Data

In this section, we provide more details on the employed architectures of m TAF on the NWP-SAF dataset, which we visualize in Figure 9. Furthermore, we present a more in-depth discussion about the results.

We consider each quantity (i.e. dry-bulb air temperature, atmospheric pressure, and cloud optical depth) at each atmospheric level as one component, leading to a 412-dimensional dataset. Recall that m TAF requires a classiﬁcation into lightand heavy-tailed marginals that lead to a reordering of the initial marginals, which we do as follows. This dataset similar to other time-series-like datasets gives us a natural autoregressive ordering, which we make use of in our permutation step. This leads to the initial permutation (compare with Step 2 in Figure 1 and Section 3.2), in which the ﬁrst components

19This does not mean that the whole ﬂow is linear in its tails since tail samples can be linearily mapped in and out of the bins, leading a non-linear mapping.

Marginal Tail-Adaptive Normalizing Flows

Table 5. Average test loss, Area under log-log plot, and t Va R (lower is better for each metric) in the setting ν = 2 and dh {1, 4} when using a MAF.

L Areal Areah t Va Rl t Va Rh L Areal Areah t Va Rl t Va Rh

vanilla >1e6 0.82 4.88 2.76 8.67 >1e6 0.95 5.97 3.78 9.93 TAF 10.63 1.09 4.68 3.88 8.00 9.97 1.12 5.90 4.28 9.42 g TAF 10.60 1.26 3.78 4.71 2.91 9.84 1.32 4.89 5.29 4.22 m TAF 10.55 0.77 4.00 2.51 3.08 9.81 0.48 5.01 1.32 4.49

Table 6. Average test loss, Area under log-log plot, and t Va R (lower is better for each metric) in the setting D = 50, ν {2, 3}, and dh = 10. The copula model serves as an oracle baseline.

L Areal Areah t Va Rl t Va Rh L Areal Areah t Va Rl t Va Rh

vanilla 58.59 0.29 3.38 0.45 24.61 62.59 0.30 1.71 0.43 6.38 TAF 58.12 0.67 3.02 1.20 3.19 62.47 0.57 2.02 0.96 2.15 g TAF 58.05 0.94 0.58 1.23 1.16 62.42 0.61 0.51 0.96 0.99 m TAF 58.17 0.39 2.30 0.84 4.20 62.60 0.30 1.32 0.46 2.86

copula 57.23 0.21 1.04 0.42 2.50 56.73 0.20 0.66 0.41 1.33

are given by light-tailed components of the dry-bulb air temperature, followed by the reversed atmospheric pressure and the cloud-optical depth. However, in contrast to our synthetic experiments, where the selection of the heavy-tailed components was more or less trivial, this task is more complicated in a time series with highly dependent features. What we found works best in practice, is to deliberately choose a large set of heavy-tailed components according to Table 7, while making the degree of freedom learnable. Furthermore, we implemented all NFs vanilla, TAF, g TAF, and m TAF using 5 autoregressive NSF layers with LU-linearities and their modiﬁed versions from Section 3.3. The conditioner networks in the NSF-layers have 2 hidden layers with 100 hidden neurons in each layer, we set the tail-bounds to 2.5, and each spline uses 3 bins. We apply Batch-Norm after each NSF-layer. We optimize for 20 000 steps using the Adam optimizer with a learning rate of 1e-4 and a learning rate of 0.01 for the tail indices and scheduled the rates using cosine annealing.

We plot synthetic samples from the remaining ﬂows in Figure 10.

Marginal Tail-Adaptive Normalizing Flows

L H Generated

98.3% 98.7%

L H Generated

90.3% 84.0%

L H Generated

26.7% 90.7%

L H Generated

2.3% 100.0%

ν = 2, dh = 1

L H Generated

97.7% 95.3%

L H Generated

90.0% 86.7%

10.0% 13.3%

L H Generated

22.7% 98.4%

L H Generated

ν = 3, dh = 4

L H Generated

96.0% 94.7%

L H Generated

90.1% 89.3%

L H Generated

24.0% 100.0%

L H Generated

4.0% 100.0%

ν = 3, dh = 1

Figure 6. Marginal tail estimation based on synthetic ﬂow samples of vanilla, TAF, g TAF, and m TAF (from left to right) for varying tail index ν and number of heavy-tailed components dh. We classify marginals whose tail estimator is less than 10 as heavy-tailed, otherwise it is classiﬁed as light-tailed.

4 2 0 2 4 True Samples

Flow Samples

2 0 2 True Samples

Flow Samples

2 0 2 True Samples

Flow Samples

4 2 0 2 4 True Samples

Flow Samples

Figure 7. QQ-plots for the 8th heavy-tailed marginal in the setting ν = 2 and dh = 1. The QQ-plots correspond to samples generated by vanilla, TAF, g TAF, and m TAF, respectively.

Marginal Tail-Adaptive Normalizing Flows

15 10 5 0 5 True Samples

Flow Samples

5.0 2.5 0.0 2.5 5.0 True Samples

Flow Samples

5 0 5 10 True Samples

Flow Samples

4 2 0 2 4 True Samples

Flow Samples

15 10 5 0 5 True Samples

Flow Samples

5.0 2.5 0.0 2.5 5.0 True Samples

Flow Samples

5 0 5 10 True Samples

Flow Samples

2 0 2 True Samples

Flow Samples

15 10 5 0 5 True Samples

Flow Samples

5.0 2.5 0.0 2.5 5.0 True Samples

Flow Samples

5 0 5 10 True Samples

Flow Samples

4 2 0 2 4 True Samples

Flow Samples

15 10 5 0 5 True Samples

Flow Samples

5.0 2.5 0.0 2.5 5.0 True Samples

Flow Samples

5 0 5 10 True Samples

Flow Samples

6 4 2 0 2 4 True Samples

Flow Samples

Figure 8. QQ-plots for the last 4 heavy-tailed marginals in the setting ν = 2 and dh = 4. The rows of QQ-plots correspond to samples generated by vanilla, TAF, g TAF, and m TAF, respectively. Each marginal in depicted in one column.

Table 7. Components (i.e. measurements at a speciﬁc atmospheric level), which we manually select as heavy-tailed based on Figure 9.

MEASUREMENT LIGHT-TAILED HEAVY-TAILED

DRY-BULB AIR TEMPERATURE IN K 1 - 79 80 - 137 ATMOSPHERIC PRESSURE IN HPA 1 - 99 100 - 137 CLOUD OPTICAL DEPTH 1 - 57 58 - 137

Marginal Tail-Adaptive Normalizing Flows

Dry-bulb air temperature in K

Atmospheric pressure in h Pa

0 20 40 60 80 100 120 140 0

Cloud optical depth

Median 0 - 25 % 25 - 50 % 50 - 100 %

Figure 9. Real proﬁles from the NWP-SAF dataset. We used the implementation by Meyer et al. (2021) to generate the ﬁgure. The proﬁles are ordered using band depth statistics (Pintado & Romo, 2009).

Marginal Tail-Adaptive Normalizing Flows

Dry-bulb air temperature in K

Atmospheric pressure in Pa

0 20 40 60 80 100 120 140 Atmospheric level

Cloud optical depth

0 - 25 % 25 - 50 % 50 - 100 %

Dry-bulb air temperature in K

Atmospheric pressure in Pa

0 20 40 60 80 100 120 140 Atmospheric level

Cloud optical depth

0 - 25 % 25 - 50 % 50 - 100 %

Dry-bulb air temperature in K

Atmospheric pressure in Pa

0 20 40 60 80 100 120 140 Atmospheric level

Cloud optical depth

0 - 25 % 25 - 50 % 50 - 100 %

Figure 10. Synthetic ﬂow samples using vanilla, TAF, and g TAF (from left to right), where we clipped the lower-values of the cloud-optical depth at 0. The corresponding negative log-likelihoods are 2094.35, 2117.48, 2121.65, respectively. We used the implementation by Meyer et al. (2021) to generate the ﬁgure. The proﬁles are ordered using band depth statistics (Pintado & Romo, 2009) and the shaded areas represent standard deviations.