# sumofsquares_polynomial_flow__31d58efc.pdf Sum-of-Squares Polynomial Flow Priyank Jaini 1 2 3 Kira A. Selby 1 2 Yaoliang Yu 1 2 Triangular map is a recent construct in probability theory that allows one to transform any source probability density function to any target density function. Based on triangular maps, we propose a general framework for high-dimensional density estimation, by specifying one-dimensional transformations (equivalently conditional densities) and appropriate conditioner networks. This framework (a) reveals the commonalities and differences of existing autoregressive and flow based methods, (b) allows a unified understanding of the limitations and representation power of these recent approaches and, (c) motivates us to uncover a new Sum-of-Squares (SOS) flow that is interpretable, universal, and easy to train. We perform several synthetic experiments on various density geometries to demonstrate the benefits (and shortcomings) of such transformations. SOS flows achieve competitive results in simulations and several real-world datasets. 1. Introduction Neural density estimation methods are gaining popularity for the task of multivariate density estimation in machine learning (Kingma et al., 2016; Dinh et al., 2015; 2017; Papamakarios et al., 2017; Uria et al., 2016; Huang et al., 2018). These generative models provide a tractable way to evaluate the exact density, unlike generative adversarial nets (Goodfellow et al., 2014) or variational autoencoders (Kingma & Welling, 2014; Rezende et al., 2014). Popular methods for neural density estimation are autoregressive models (Neal, 1992; Bengio & Bengio, 1999; Larochelle & Murray, 2011; Uria et al., 2016) and normalizing flows (Rezende & Mohamed, 2015; Tabak & Vanden-Eijnden, 2010; Tabak & Turner, 2013). These models aim to learn an invertible, bijective and increasing transformation T that 1University of Waterloo, Waterloo, Canada 2Waterloo AI Institute, Waterloo, Canada 3Vector Institute, Toronto, Canada. Correspondence to: Yaoliang Yu . Proceedings of the 36 th International Conference on Machine Learning, Long Beach, California, PMLR 97, 2019. Copyright 2019 by the author(s). pushes forward a (simple) source probability density (or measure, in general) to a target density such that computing the inverse T 1 and the Jacobian |T | is easy. In probability theory, it has been rigorously proven that increasing triangular maps (Bogachev et al., 2005) are universal, i.e. any source density can be transformed into a target density using an increasing triangular map. Indeed, the Knothe-Rosenblatt transformation (Ch.1, Villani, 2008) gives a (heuristic) construction of such a map, which is unique up to null sets (Bogachev et al., 2005). Furthermore, by definition the inverse and the Jacobian of a triangular map can be very efficiently computed through univariate operations. However, for multivariate densities computing the exact Knothe-Rosenblatt transform itself is not possible in practice. Thus, a natural question is: Given a pair of densities, how can we efficiently estimate this unique increasing triangular map? This work is devoted to studying these increasing, bijective, and monotonic triangular maps, in particular how to estimate them in practice. In 2, we precisely formulate the density estimation problem and propose a general maximum likelihood framework for estimating densities using triangular maps. We also explore the properties of the triangular map required to push a source density to a target density. Subsequently, in 3, we trace back the origins of the triangular map and connect it to many recent works on generative modelling. We relate our study of increasing, bijective, triangular maps to works on iterative Gaussianization (Chen & Gopinath, 2001; Laparra et al., 2011) and normalizing flows Tabak & Vanden-Eijnden (2010); Tabak & Turner (2013); Rezende & Mohamed (2015). We show that a triangular map can be decomposed into compositions of one dimensional transformations or equivalently univariate conditional densities, allowing us to demonstrate that all autoregressive models and normalizing flows are subsumed in our general density estimation framework. As a by-product, this framework also reveals that autoregressive models and normalizing flows are in fact equivalent. Using this unified framework, we study the commonalities and differences of the various aforementioned models. Most importantly, this framework allows us to study the universality in a much cleaner and more streamlined way. We present a unified understanding of the limitations and representation power of Sum-of-Squares Polynomial Flow these approaches, summarized concisely in Table 1 below. In 4, by understanding the pivotal properties of triangular maps and using our proposed framework, we uncover a new neural density estimation procedure called the Sum-of Squares polynomial flows (SOS flows). We show that SOS flows are akin to higher order approximation of T depending on the degree of the polynomials used. Subsequently, we show that SOS flows are universal, i.e. given enough model complexity, they can approximate any target density. We further show that (a) SOS flows are a strict generalization of the inverse autoregressive flow (IAF) of Kingma et al. (2016), (b) they are interpretable; its coefficients directly control the higher order moments of the target density and, (c) SOS flows are easy to train; unlike NAFs (Huang et al., 2018) which require non-negative weights, there are no constraints on the parameters of SOS. In 5, we report our empirical analysis. We performed holistic synthetic experiments to gain intuitive understanding of triangular maps and SOS flows in particular. Additionally, we compare SOS flows to previous neural density estimation methods on real-world datasets where it achieved competitive performance. We summarize our main contributions as follows: We study and propose a rigorous framework for using triangular maps for density estimation Using this framework, we study the similarities and differences of existing flow based and autoregressive models We provide a unified understanding of the limitations and representational power of these methods We propose SOS flows that are universal, interpretable, and easy to train. We perform several synthetic and real-world experiments to demonstrate the efficacy of SOS flows. 2. Density estimation through triangular map In this section we set up our main problem, introduce key definitions and notations, and formulate the general approach to estimate density functions using triangular maps. Let p, q be two probability density1 functions (w.r.t. the Lebesgue measure) over the source domain Z Rd and the target domain X Rd, respectively. Our main goal is to find a deterministic transformation T : Z X such that for all (measurable) set B X, Z T 1(B) p(z)dz. 1All of our results can be extended to two probability measures satisfying mild regularity conditions. For simplicity and concreteness we restrict to probability densities here. In particular, when T is bijective and differentiable (e.g. Rudin, 1987), we have the change-of-variable formula x = T(z) such that q(x) = p(z)/|T (z)| = p(T 1x)/|T (T 1x)| =: T#p, (1) where |T (z)| is the (absolute value) of the Jacobian (determinant of the derivative) of T. In other words, by pushing the source random variable z p through the map T we can obtain a new random variable x q. This push-forward idea has played an important role in optimal transport theory (Villani, 2008) and in recent Monte carlo simulations (Marzouk et al., 2016; Parno & Marzouk, 2018; Peherstorfer & Marzouk, 2018). Here, our interest is to learn the target density q through the map T. Let F be a class of mappings and use the KL divergence2 to measure closeness between densities. We can formulate the density estimation problem as: min T F KL(q T#p) Z q(x) log p(T 1x) |T (T 1x)|dx. (2) When we only have access to an i.i.d. sample *x1, . . . , xn+ q, we can replace the integral above with empirical averages, which amounts to maximum likelihood estimation: max T F 1 n h log |T (T 1xi)| + log p(T 1xi) i . (3) Conveniently, we can choose any source density p to facilitate estimation. Typical choices include the standard normal density on Z = Rd (with zero mean and identity covariance) and uniform density over the cube Z = [0, 1]d. Computationally, being able to solve (3) efficiently relies on choosing a map T whose inverse T 1 is cheap to compute; Jacobian |T | is cheap to compute. Fortunately, this is always possible. Following Bogachev et al. (2005) we call a (vector-valued) mapping T : Rd Rd triangular if for all j, its j-th component Tj only depends on the first j variables x1, . . . , xj. The name triangular is derived from the fact that the derivative of T is a triangular matrix function3. We call T (strictly) increasing if for all j [d], Tj is (strictly) increasing w.r.t. the j-th variable xj when other variables are fixed. Theorem 1 (Bogachev et al. 2005). For any two densities p and q over Z = X = Rd, there exists a unique (up to null 2Other statistical divergences can be used as well. 3The converse is clearly also true if our domain is connected. Sum-of-Squares Polynomial Flow sets of p) increasing triangular map T : Z X so that q = T#p. The same4 holds over Z = X = [0, 1]d. Conveniently, to compute the Jacobian of an increasing triangular map we need only multiply d univariate partial derivatives |T (x)| = Qd j=1 Tj xj . Similarly, inverting an increasing triangular map requires inverting d univariate functions sequentially, each of which can be efficiently done through say bisection. Bogachev et al. (2005) further proved that the change-of-variable formula (1) holds for any increasing triangular map T (without any additional assumption but using the right-side derivative). Thus, triangular mappings form a very appealing function class for us to learn a target density as formulated in (2) and (3). Indeed, Moselhy & Marzouk (2012) already promoted a similar idea for Bayesian posterior inference and Spantini et al. (2018) related the sparsity of a triangular map with (conditional) independencies of the target density. Moreover, many recent generative models in machine learning are precisely special cases of this approach. Before we discuss these connections, let us give some examples to help understand Theorem 1. Example 1. Consider two probability densities p and q on the real line R, with distribution function F and G, respectively. Then, we can define the increasing map T = G 1 F such that q = T#p, where G 1 : [0, 1] R is the quantile function of q: G 1(u) := inf{t : G(t) u}. Indeed, it is well-known that F(Z) uniform if Z p and G 1(U) q if U uniform. Theorem 1 is a rigorous iteration of this univariate argument by repeatedly conditioning. Note that the increasing property is essential for claiming the uniqueness of T. Indeed, for instance, let p be standard normal, then both T = id and T = id push p to the same target normal density. Example 2 (Pushing uniform to normal). Let p be uniform over [0, 1] and q N(µ, σ2) be normal distributed. The unique increasing transformation T(z) = G 1 F = µ + 2σ erf 1(2z 1) 2k + 1 (z 1 where erf(t) = 2 π R t 0 e s2ds is the error function, which was Taylor expanded in the last equality. The coefficients c0 = 1 and ck = Pk 1 m=0 cmck 1 m (m+1)(2m+1). We observe that the derivative of T is an infinite sum of squares of polynomials. In particular, if we truncate at k = 0, we obtain 2) + O(z3). (4) 4More generally on any open or closed subset of Rd if we interpret the monotonicity of T appropriately (Alexandrova, 2006). Example 3 (Pushing normal to uniform). Similar as above but we now find a map S that pushes q to p: S(x) = F 1 G = Φ( x µ where Φ is the cdf of standard normal. As shown by Medvedev (2008), S must be the inverse of the map T in Example 2. We observe that the derivative of S is no longer a sum of squares of polynomials, but we prove later that it is approximately so. If we truncate at k = 0, we obtain 2πσ (x µ) + O(x3), where the leading term is also the inverse of the leading term of T in (4). We end this section with two important remarks. Remark 1. If the target density has disjoint support e.g. mixture of Gaussians (Mo Gs) with well-separated components, then the resulting transformation will need to admit sharp jumps for areas of near zero mass. This follows by analyzing the transformaiton T(z) = G 1 F. The slope T (z) of T(z) is the ratio of the quantile pdfs of the source density and the target density. Therefore, in regions of near zero mass for target density, the transformation will have near infinite slope. In Appendix B, we demonstrate this phenomena specifically for well-separated Mo Gs and show that a piece-wise linear function transforms a standard Gaussian to Mo Gs. This also opens the possibility to use the number of jumps of an estimated transformation as the indication of the number of components in the data density. Remark 2. So far we have employed the (increasing) triangular map T explicitly to represent our estimate of the target density. This is advantageous since it allows us to easily draw samples from the estimated density, and, if needed, it results in the estimated density formula (1) immediately. An alternative would be to parameterize the estimated density directly and explicitly, such as in mixture models, probabilistic graphic models and sigmoid belief networks. The two approaches are conceptually equivalent: Thanks to Theorem 1, we know choosing a family of triangular maps fixes a family of target densities that we can represent, and conversely, choosing a family of target densities fixes a family of triangular maps that we can implicitly learn. The advantage of the former approach is that given a sample from the target density, we can infer the pre-image in the source domain while this information is lost in the second approach. 3. Connection to existing works The results in Section 2 suggest using (3) with F being a class of triangular maps for estimating a probability density Sum-of-Squares Polynomial Flow q. In this section we put this general approach into historical perspective, and connect it to the many recent works on generative modelling. Due to space constraint, we limit our discussion to work that are directly relevant to ours. Origins of triangular map: Rosenblatt (1952), among his contemporary peers, used the triangular map to transform a continuous multivariate distribution into the uniform distribution over the cube. Independently, Knothe (1957) devised the triangular map to transform uniform distributions over convex bodies and to prove generalizations of the Brunn Minkowski inequality. Talagrand (1996), unaware of the previous two results and in the process of proving some sharp Gaussian concentration inequality, effectively discovered the triangular map that transforms the Gaussian distribution into any continuous distribution. The work of Bogachev et al. (2005) rigorously established the existence and uniqueness of the triangular map and systematically studied some of its key properties. Carlier et al. (2010) showed surprisingly that the triangular map is the limit of solutions to a class of Monge-Kantorovich mass transportation problems under quadratic costs with diminishing weights. None of these pioneering works considered using triangular maps for density estimation. Iterative Gaussianization and Normalizing Flow: In his seminal work, Huber (1985) developed the important notion of non-Gaussianality to explain the projection pursuit algorithm of Friedman et al. (1984). Later, Chen & Gopinath (2001), based on a heuristic argument, discovered the triangular map approach for density estimation but deemed it impractical because of the seemingly impossible task of estimating too many conditional densities. Instead, Chen & Gopinath (2001) proposed the iterative Gaussianization technique, which essentially decomposes5 the triangular map into the composition of a sequence of alternating diagonal maps Dt and linear maps Lt. The diagonal maps are estimated using the univariate transform in Example 1 where G is standard normal and F is a mixture of standard normals. Later, Laparra et al. (2011) simplified the linear map into random rotations. Both approaches, however, suffer cubic complexity w.r.t. dimension due to generating or evaluating the linear map. The recent work of (Tabak & Vanden-Eijnden, 2010; Tabak & Turner, 2013) coined the name normalizing flow and further exploited the straightforward but crucial observation that we can approximate the triangular map through a sequence of simple maps such as radial basis functions or rotations composed with diagonal maps. Similar simple maps have also been explored in Ball e et al. (2016). Rezende & Mohamed (2015) designed a rank-1 (or radial) normalizing flow and applied it to variational inference, largely popularizing the idea in generative 5This can be made precise, much in the same way as decomposing a triangular matrix into the product of two rotation matrices and a diagonal matrix, i.e. the so-called Schur decomposition. modelling. These approaches are not estimating a triangular map per se, but the main ideas are nevertheless similar. (Bona fide) Triangular Approach: Deco & Brauer (1995) (see also Redlich (1993)), to our best knowledge, is among the first to mention the name triangular explicitly in tasks (nonlinear independent component analysis) related to density estimation. More recently, Dinh et al. (2015) recognized the promise of even simple triangular maps in density estimation. The (increasing) triangular map in (Dinh et al., 2015) consists of two simple (block) components: T1(x1) = x1 and T2(x1, x2) = x2 + m(x1), where x = (x1, x2) is a two-block partition and m is a map parameterized by a neural net. The advantage of this triangular map is its computational convenience: its Jacobian is trivially 1 and its inversion only requires evaluating m. Dinh et al. (2015) applied different partitions of variables, iteratively composed several such simple triangular maps and combined with a diagonal linear map6. However, these triangular maps appear to be too simple and it is not clear if through composition they can approximate any increasing triangular map. In subsequent work, Dinh et al. (2017) proposed the extension where T1(x1) = x1 but T2(x1, x2) = x2 exp(s(x1)) + m(x1), where denotes the element-wise product. This map is again increasing triangular. Moselhy & Marzouk (2012) employed triangular maps for Bayesian posterior inference, which was further extended in (Marzouk et al., 2016) for sampling from an (unknown) target density. One of their formulations is essentially the same as our eq. (2). Autoregressive Neural Models: A joint probability density function can be factorized into the product of marginal and conditionals: q(x1, . . . , xd) = q(x1) Qd j=2 q(xj|xj 1, . . . , x1). In his seminal work, Neal (1992) proposed to model each (discrete) conditional density by a simple linear logistic function (with the conditioned variables as inputs). This was later extended by Bengio & Bengio (1999) using a twolayer nonlinear neural net. The recent work of Uria et al. (2016) proposed to decouple the hidden layers in Bengio & Bengio (1999) and to introduce heavy weight sharing to reduce overfitting and computational complexity. Already in (Bengio & Bengio, 1999), univariate mixture models were mentioned as a possibility to model each conditional density, which was further substantiated in (Uria et al., 2016). More precisely, they model the j-th conditional density as: q(xj|xj 1, . . . , x1) = κ=1 wj,κ N(xj; µj,κ, σj,κ) (5) θj := (wj,κ, µj,κ, σj,κ)k κ=1 = Cj(xj 1, . . . , x1), 6They also considered a more general coupling that may no longer be triangular. Sum-of-Squares Polynomial Flow Table 1. Various auto-regressive and flow-based methods expressed under a unified framework. All the conditioners can take inputs x instead of z. The symbol is used for weight sharing, for use of masks for efficient implementation, for universality of the method and, if the method learns a triangular transformation explicitly (E) or implicitly (I). ? implies that universality of these methods has neither been proved or disproved although it can now be analyzed with ease using our framework. Sj(zj; θj) is defined in eq. (6) and P2r+1(zj; aj) is defined in eq. (8). Model conditioner Cj output Tj zj ; Cj(z1, . . . , zj 1) Mixture (e.g. Mc Lachlan & Peel, 2004) θj Sj(zj; θj) I (Bengio & Bengio, 1999) θj(z