# exchangeable_generative_models_with_flow_scans__8fdb31b8.pdf The Thirty-Fourth AAAI Conference on Artificial Intelligence (AAAI-20) Exchangeable Generative Models with Flow Scans Christopher M. Bender, 1 Kevin O Connor,*2 Yang Li,1 Juan Jose Garcia,1 Junier Oliva,1 Manzil Zaheer3 1Department of Computer Science, UNC Chapel Hill 2Department of Statistics and Operations Research, UNC Chapel Hill 3Google Research {bender, yangli95, jjgarcia, joliva}@cs.unc.edu, koconn@live.unc.edu, manzilz@google.com In this work, we develop a new approach to generative density estimation for exchangeable, non-i.i.d. data. The proposed framework, Flow Scan, combines invertible flow transformations with a sorted scan to flexibly model the data while preserving exchangeability. Unlike most existing methods, Flow Scan exploits the intradependencies within sets to learn both global and local structure. Flow Scan represents the first approach that is able to apply sequential methods to exchangeable density estimation without resorting to averaging over all possible permutations. We achieve new state-of-the-art performance on point cloud and image set modeling. Introduction Modeling unordered, non-i.i.d. data is an important problem in machine learning and data science. Collections of data objects with complicated intrinsic relationships are ubiquitous. These collections include sets of 3d points sampled from the surface of complicated shapes like human organs, sets of images shared within the same web page, or point cloud Li DAR data observed by driverless cars. In any of these cases, the collections of data objects do not possess any inherent ordering of their elements. Thus, any generative model which takes these data as input should not depend on the order in which the elements are presented and must be flexible enough to capture the dependencies between co-occurring elements. The unorderedness of these kinds of collections is captured probabilistically by the notion of exchangeability. Formally, a set of points {xj}n j=1 Rd with cardinality n, dimension d, and probability density p( ) is called exchangeable if p(x1, ..., xn) = p(xπ1, ..., xπn) (1) for every permutation π. In practice {xj}n j=1 often represent 2d or 3d spatial points (see Fig. 1) in which case we refer to the set as a point cloud. In other settings, the points of interest may be more complex like images represented as very high-dimensional vectors. As a simple example, one may trivially generate a set of exchangeable points by drawing them i.i.d. from some distribution. More commonly, elements within an exchangeable Equal contribution Copyright c 2020, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. Figure 1: A training dataset of sets. Each instance Xi is a set of points Xi = {xi,j Rd}ni j=1 (d = 2 shown). We estimate p(Xi), from which we can sample distinct sets. set share information with one another, providing structure. Despite the abundance of such data, the bulk of existing approaches either ignore the relation between points (i.i.d. methods) or model dependencies in a manner that depends on inherent orderings (sequential methods) (Rezatofighi et al. 2017; You et al. 2018). In order to accurately learn the structure of a set whilst preserving the exchangeability of its likelihood, one cannot rely solely on either approach. In this work, we focus on the task of tractable, non-i.i.d. density estimation for exchangeable sets. We explore both low cardinality sets of high dimension (10-20 points with many hundreds of dimensions each, e.g. collections of images) and high cardinality sets of low dimension (hundreds of points with 2-7 dimensions each, e.g. point clouds). We develop a generative model suitable for exchangeable sets in either regime, called Flow Scan, which does not rely on i.i.d. assumptions and is provably exchangeable. Contrary to intuition, we show that one can preserve exchangeability while scanning over the data in a sorted manner. Flow Scan is the first method to achieve a tractable, non-i.i.d., exchangeable likelihood by leveraging traditional (e.g. sequential), non-exchangeable density estimators. Main Contributions. 1) We show that transforming points with an equivariant change of variables allows for modeling sets in a different space. 2) We introduce a scanning-based technique for modeling exchangeable data, relating the underlying exchangeable likelihood to that of the sorted covariates. 3) We demonstrate how traditional density estimators may be used for the task of principled and feasible exchangeable density estimation via a scanning-based approach. 4) We show empirically that Flow Scan achieves the state-of-the-art for density estimation tasks in both synthetic and real-world point cloud and image set datasets. Motivation and Challenges We motivate our problem with a simple, yet common, set generative process that requires a non-i.i.d., exchangeable density estimator. Consider the following generative process for a set: 1) generate latent parameters φ pΦ( ) and then 2) generate a set X p( | φ). Here p( | φ) may be as simple as a Gaussian model (where φ is the mean and covariance parameters) or as complex as a nonparametric model (where φ may be infinite-dimensional). This simple set generative process requires a non-i.i.d. approach, even for the case when the ground truth conditional set likelihood, p(X | φ), is conditionally i.i.d.. We show this by first noting that with conditionally i.i.d. p(X | φ) = n j=1 p(xj | φ), the complete set likelihood is: p(X) = pΦ(φ) j=1 p(xj | φ) dφ. (2) (Note, that Eq. 2 is in the same vein as De Finetti s theorem (Bernardo and Smith 2009).) One can show dependency (non-i.i.d.) with the conditional likelihood of a single point xk given a disjoint subset S X \ {xk}: p(xk | S) = pΦ(φ | S) p(xk | S, φ) dφ = pΦ(φ | S) p(xk | φ) dφ = pΦ(φ) p(xk | φ) dφ = p(xk). That is, the conditional likelihood p(xk | S) depends on other points in X via the posterior pΦ(φ | S), which accounts for what φ was likely to have generated S. As a consequence, the complete generative process (2) is not marginally i.i.d., notwithstanding the conditional i.i.d. p(X | φ). Thus, any model built on an i.i.d. assumption may be severely biased. The generative process in Eq. (2) is especially applicable for surface point cloud data. For such sets, Xi, points are drawn i.i.d. from (conditioned on) the surface of a shape with (unknown) parameters φi (e.g. object class, length, orientation, noise, etc.), resulting in the dataset D = {Xi p( | φi)}N i=1 of N sets. As shown above, modeling such point cloud set data requires a non-i.i.d. approach even though points may be drawn independently given the surface parameters. Flow Scan will not only yield an exchangeable, non-i.i.d. generative model, but will also directly model elements in sets without latent parameters. In effect, Flow Scan will automatically marginalize out dependence on latent parameters of a given set, and is thus capable of handling complicated p( | φ). Broadly, the primary challenge in direct exchangeable density estimation is designing a flexible, invariant architecture which yields a valid likelihood. As explained above, using an i.i.d. assumption to enforce this property will severely hamper the performance of a model. To avoid this simplification, techniques often shoehorn invariances to observed orderings by feeding randomly permuted data into sequential models (Rezatofighi et al. 2017; You et al. 2018). Such approaches attempt to average out the likelihood of the model over all permutations: π ps(xπ1, . . . , xπn), (3) where ps is some sequential model. Of course, the observation of all potential orderings for even a modest collection of points is infeasible. Furthermore, there are often no guarantees that the sequential model pseq will learn to ignore orderings, especially for unseen test data (Vinyals, Bengio, and Kudlur 2015). Given that an i.i.d. assumption is not robust and averaging over all permutations is infeasible, what operation should be used to ensure permutation invariance of the architecture? Instead of attempting to wash out the effect of order in an architecture as in Eq. 3, we propose to enforce invariance by adopting a prespecified ordering and scanning over elements in this order. As will be discussed in the Methods section, the benefit of estimating a likelihood over sorted data is that it frees us from the restriction of exchangeability. Given the sorted data, we can apply any number of traditional density estimators. However, such an approach presents its own challenges: Determining a suitable way to scan through an exchangeable sequence. That is, one must map the set X = {xj}n j=1 to a sequence X (x[1], . . . , x[n]) where x[j] denotes the j th point in the sorted order. Relating the likelihood of the scanned sequence to likelihood of the exchangeable set. Modeling the exchangeable likelihood through a scanned likelihood is not immediately obvious; a simple equality of the two does not hold, p(X) = p(x[1], . . . , x[n]). Scanning in a space that is beneficial for modeling. The native input space may not be best suited for modeling or scanning, hence it would be constructive to transform the exchangeable input prior to the scan. Developing an architecture that exploits the structure gained in the scan. The scanning operation will introduce sequential correlations among elements which need to be modeled successfully. Next, we develop the Flow Scan model while addressing each of these challenges. Methods Flow Scan consists of three components: 1) a sequence of equivariant flow transformations (ˆqe), to map the data to a space that is easier to model; 2) a sort with correction factor to allow for the use of non-exchangeable density estimators; 3) a density estimator (ˆps) (e.g. an autoregressive model which may utilize sequential flow transformations, ˆqc), to estimate the likelihood while accounting for correlations induced by sorting (see Fig. 2). In this section, we motivate each piece of the architecture and detail how they combine to yield a highly flexible, exchangeable density estimator. Equivariant Flow Transformations Flow Scan first utilizes a sequence of equivariant flow transformations. So-called flow models rely on the change of variables formula to build highly effective models for traditional non-exchangeable generative tasks (like image modeling) (Kingma and Dhariwal 2018). Using the change of variables formula, flow models approximate the likelihood of a d-dimensional distribution over real-valued covariates Figure 2: Illustration of our proposed method. First, input sets are scanned (in a possibly transformed space). After, the scanned covariates are modeled (possibly in a autoregressive fashion, as shown). x = (x(1), ..., x(d)) Rd, by applying an invertible (flow) transformation ˆq(x) to an estimated base distribution ˆf: ˆp(x(1), ..., x(d)) = det dˆq ˆf(ˆq(x)), (4) where |det dˆq dx| is the Jacobian of the transformation ˆq. Often, the base distribution is a standard Gaussian. However, (Oliva et al. 2018) recently showed that performance may be improved with a more flexible base distribution on transformed covariates such as an autoreggressive density (Germain et al. 2015; Gregor et al. 2014; Larochelle and Murray 2011; Uria et al. 2016; Uria, Murray, and Larochelle 2013). There are a myriad of possible invertible transformations, ˆq, that one may apply to inputs x Rn d in order to model elements in a more expressive space, where x is the set X represented as a matrix. However, in our case one must take care to preserve exchangeability of the inputs when transforming the data. For example, a simple affine change of variables will be sensitive to the order in which the elements of x were observed, resulting in a space which is no longer exchangeable. One can circumvent this problem by requiring that any transformation, ˆq, used is equivariant. That is, for all permutation operators, Γ, we have that ˆq(Γx) = Γˆq(x). Proposition 1 states that equivariance of the transformations in conjunction with invariance of the base distribution is enough to ensure that exchangeability is preserved, allowing one to model set data in a transformed space. The proof is straightforward and relegated to the Appendix. Proposition 1. Let ˆq : Rn d Rn d be a permutation equivariant, invertible transformation and the base distribution, ˆf, be exchangeable. Then the likelihood, ˆp(x) = det dˆq dx ˆf(ˆq(x)), is exchangeable. Given an invertible transformation, q : Rd Rd, one may construct a simple permutation equivariant transformation by applying it to each point in a set independently: (x1, ..., xn) (q(x1), ..., q(xn)). However, it is possible to engineer equivariant transformations which utilize information from other points in the set while still preserving equivariance. Proposition 1 shows that Flow Scan is compatible with any combination of these transformations. Set-Coupling Among others, we propose a novel set-level scaling and shifting coupling transformation (Dinh, Sohl Dickstein, and Bengio 2016). For d-dimensional points, the coupling transformation scales and shifts one subset, S {1, . . . , d} of the d covariates given then rest, Sc as (letting superscripts index point dimensions): x(S) exp f x(Sc) x(S) + g x(Sc) x(Sc) x(Sc), (5) for learned functions f, g : R|Sc| R|S|. We propose a set-coupling transformation as follows: x(S) i exp f ϕ(x(Sc)), x(Sc) i x(S) i + g ϕ(x(Sc)), x(Sc) i x(Sc) i x(Sc) i , (6) where x(Sc) Rn |Sc| is the set of unchanged covariates, ϕ(x(Sc)) Rr are general, learnable permutation invariant set embeddings. We utilize embeddings from Deep Set architectures (Zaheer et al. 2017); however, other embeddings are possible, e.g. prescribed statistics (Jebara, Kondor, and Howard 2004), and f, g : Rr+|Sc| R|S| are learned functions. The embedding ϕ is responsible for capturing set-level information from other covariates. This is combined with each point x(Sc) i to yield shifts and scales with both pointand set-level dependence (see Fig. 3). The log-determinant and inverse are detailed in the Appendix along with several other examples of flexible, equivariant transformations. Figure 3: An illustration of how set-coupling transformations act on a set. The first plot shows the input data to be transformed. In the subsequent plots, the set is transformed in an invertible, equivariant fashion by stacking set-coupling transformations. Iteratively transforming dimensions of a set in this way yields a set with simpler structure that may be modeled more easily, as shown in the last plot. Invariance Through Sorting After applying a series of equivariant flow transformations, Flow Scan performs a sort operation and corrects the likelihood with a factor of 1/n!. Sorting in a prespecified fashion ensures that different permutations of the input map to the same output. In this section, we prove that this yields an analytically correct likelihood and comment on the advantages of such an approach. Specifically, we show that the exchangeable (unordered) likelihood of a set of n points pe(x1, . . . xn) (where xj Rd) can be written in terms of the non-exchangeable (ordered) likelihood of the points in a sorted order ps(x[1], . . . , x[n]) as stated in Prop. 2 below. Proposition 2. Let pe be an exchangeable likelihood which is continuous and non-degenerate (e.g. j {1, . . . , d} Pr[x(j) 1 = x(j) 2 = . . . = x(j) n ] = 1). Then, pe(x1, . . . xn) = 1 n!ps(x[1], . . . , x[n]), (7) where x[j] is the jth point in the sorted order. Proof. We derive Eq. 7 from a variant of the change of variables formula (Casella and Berger 2002). It states that if we have a partition of our input space, {Aj}M j=1, such that a transformation of variables q is invertible in each partition Aj with inverse q 1 j , then we may write the likelihood f of z = q(u) in terms of the likelihood p of the input data u as: det dq 1 j dz p(q 1 j (z)). (8) For the moment, suppose that the points {xj}n j=1 are sorted according to the first dimension. That is, x[1], . . . , x[n] in Eq. 7 are such that x(1) [1] < . . . < x(1) [n]. The act of sorting these points amounts to a transformation of variables s : Rn d Rn d, s(x1, . . . , xn) = (x[1], . . . , x[n]). The transformation s is one-to-one on the partitions of the input space Rn d defined by the relative order of points. In other words, we may partition the input space according to the permutation that would sort the data: Aπ = {x Rn d | x(1) π1 < x(1) π2 < . . . < x(1) πn }. We may invert s in Aπ via the inverse permutation matrix of π, Γ 1 π . Letting Π be the set of all permutations, Eq. 8 yields: Γ 1 π pe(Γ 1 π s(x)) = n! pe(x), (9) where (*) follows from Eq. 8 and (**) follows from the exchangeability of pe. Thus, we may compute the exchangeable likelihood pe(x) using the likelihood of the sorted points, as in Eq. 7. Trivially, similar arguments also hold when sorting according to a dimension other than the first. Furthermore, it is possible to sort according any appropriately transformed space of xj, rather than any native dimension itself (as this is equivalent to applying a transformation, sorting, and inverting said transformation). Consequently, the exchangeable likelihood may be estimated via an approximation of the scanned covariates: pe(x) 1 n! ˆps(s(x)). Since the density of sorted scan is not exchangeable, we may estimate ˆps using traditional density estimation techniques. This gives a principled approach to reduce the problem of exchangeable likelihood estimation to a flat vector (or sequence) likelihood estimation task. Autoregressive Scan Likelihood After performing equivariant flow transformations and sorting, Flow Scan applies a non-exchangeable density estimator to model the transformed and sorted data. Let z = s(ˆq(x)) Rn d be the sorted covariates. Since z is not exchangeable, one can apply any traditional likelihood estimator on its covariates, e.g. one may treat z as a vector and model ˆps(vec(z)) using a flat density estimator. However, flattening in this way suffers from several disadvantages. First, it is inflexible to varying cardinalities. Furthermore, the total number of covariates, nd, may be large for sets with large cardinality or dimensionality. Finally, a general flat model loses the context that covariates are from multiple points in some shared set. To address these challenges, we use an autoregressive likelihood: k=1 ˆp(zk | h