# exchangeable_generative_models_with_flow_scans__8fdb31b8.pdf

The Thirty-Fourth AAAI Conference on Artiﬁcial Intelligence (AAAI-20)

Exchangeable Generative Models with Flow Scans

Christopher M. Bender, 1 Kevin O Connor,*2 Yang Li,1 Juan Jose Garcia,1 Junier Oliva,1 Manzil Zaheer3

1Department of Computer Science, UNC Chapel Hill 2Department of Statistics and Operations Research, UNC Chapel Hill 3Google Research {bender, yangli95, jjgarcia, joliva}@cs.unc.edu, koconn@live.unc.edu, manzilz@google.com

In this work, we develop a new approach to generative density estimation for exchangeable, non-i.i.d. data. The proposed framework, Flow Scan, combines invertible ﬂow transformations with a sorted scan to ﬂexibly model the data while preserving exchangeability. Unlike most existing methods, Flow Scan exploits the intradependencies within sets to learn both global and local structure. Flow Scan represents the ﬁrst approach that is able to apply sequential methods to exchangeable density estimation without resorting to averaging over all possible permutations. We achieve new state-of-the-art performance on point cloud and image set modeling.

Introduction Modeling unordered, non-i.i.d. data is an important problem in machine learning and data science. Collections of data objects with complicated intrinsic relationships are ubiquitous. These collections include sets of 3d points sampled from the surface of complicated shapes like human organs, sets of images shared within the same web page, or point cloud Li DAR data observed by driverless cars. In any of these cases, the collections of data objects do not possess any inherent ordering of their elements. Thus, any generative model which takes these data as input should not depend on the order in which the elements are presented and must be ﬂexible enough to capture the dependencies between co-occurring elements. The unorderedness of these kinds of collections is captured probabilistically by the notion of exchangeability. Formally, a set of points {xj}n j=1 Rd with cardinality n, dimension d, and probability density p( ) is called exchangeable if p(x1, ..., xn) = p(xπ1, ..., xπn) (1) for every permutation π. In practice {xj}n j=1 often represent 2d or 3d spatial points (see Fig. 1) in which case we refer to the set as a point cloud. In other settings, the points of interest may be more complex like images represented as very high-dimensional vectors. As a simple example, one may trivially generate a set of exchangeable points by drawing them i.i.d. from some distribution. More commonly, elements within an exchangeable

Equal contribution Copyright c 2020, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved.

Figure 1: A training dataset of sets. Each instance Xi is a set of points Xi = {xi,j Rd}ni j=1 (d = 2 shown). We estimate p(Xi), from which we can sample distinct sets.

set share information with one another, providing structure. Despite the abundance of such data, the bulk of existing approaches either ignore the relation between points (i.i.d. methods) or model dependencies in a manner that depends on inherent orderings (sequential methods) (Rezatoﬁghi et al. 2017; You et al. 2018). In order to accurately learn the structure of a set whilst preserving the exchangeability of its likelihood, one cannot rely solely on either approach. In this work, we focus on the task of tractable, non-i.i.d. density estimation for exchangeable sets. We explore both low cardinality sets of high dimension (10-20 points with many hundreds of dimensions each, e.g. collections of images) and high cardinality sets of low dimension (hundreds of points with 2-7 dimensions each, e.g. point clouds). We develop a generative model suitable for exchangeable sets in either regime, called Flow Scan, which does not rely on i.i.d. assumptions and is provably exchangeable. Contrary to intuition, we show that one can preserve exchangeability while scanning over the data in a sorted manner. Flow Scan is the ﬁrst method to achieve a tractable, non-i.i.d., exchangeable likelihood by leveraging traditional (e.g. sequential), non-exchangeable density estimators. Main Contributions. 1) We show that transforming points with an equivariant change of variables allows for modeling sets in a different space. 2) We introduce a scanning-based technique for modeling exchangeable data, relating the underlying exchangeable likelihood to that of the sorted covariates. 3) We demonstrate how traditional density estimators may be used for the task of principled and feasible exchangeable density estimation via a scanning-based approach. 4) We show empirically that Flow Scan achieves the state-of-the-art for density estimation tasks in both synthetic and real-world point cloud and image set datasets.

Motivation and Challenges We motivate our problem with a simple, yet common, set generative process that requires a non-i.i.d., exchangeable density estimator. Consider the following generative process for a set: 1) generate latent parameters φ pΦ( ) and then 2) generate a set X p( | φ). Here p( | φ) may be as simple as a Gaussian model (where φ is the mean and covariance parameters) or as complex as a nonparametric model (where φ may be inﬁnite-dimensional). This simple set generative process requires a non-i.i.d. approach, even for the case when the ground truth conditional set likelihood, p(X | φ), is conditionally i.i.d.. We show this by ﬁrst noting that with conditionally i.i.d. p(X | φ) = n j=1 p(xj | φ), the complete set likelihood is:

p(X) = pΦ(φ)

j=1 p(xj | φ) dφ. (2)

(Note, that Eq. 2 is in the same vein as De Finetti s theorem (Bernardo and Smith 2009).) One can show dependency (non-i.i.d.) with the conditional likelihood of a single point xk given a disjoint subset S X \ {xk}: p(xk | S) = pΦ(φ | S) p(xk | S, φ) dφ = pΦ(φ | S) p(xk | φ) dφ = pΦ(φ) p(xk | φ) dφ = p(xk). That is, the conditional likelihood p(xk | S) depends on other points in X via the posterior pΦ(φ | S), which accounts for what φ was likely to have generated S. As a consequence, the complete generative process (2) is not marginally i.i.d., notwithstanding the conditional i.i.d. p(X | φ). Thus, any model built on an i.i.d. assumption may be severely biased. The generative process in Eq. (2) is especially applicable for surface point cloud data. For such sets, Xi, points are drawn i.i.d. from (conditioned on) the surface of a shape with (unknown) parameters φi (e.g. object class, length, orientation, noise, etc.), resulting in the dataset D = {Xi p( | φi)}N i=1 of N sets. As shown above, modeling such point cloud set data requires a non-i.i.d. approach even though points may be drawn independently given the surface parameters. Flow Scan will not only yield an exchangeable, non-i.i.d. generative model, but will also directly model elements in sets without latent parameters. In effect, Flow Scan will automatically marginalize out dependence on latent parameters of a given set, and is thus capable of handling complicated p( | φ). Broadly, the primary challenge in direct exchangeable density estimation is designing a ﬂexible, invariant architecture which yields a valid likelihood. As explained above, using an i.i.d. assumption to enforce this property will severely hamper the performance of a model. To avoid this simpliﬁcation, techniques often shoehorn invariances to observed orderings by feeding randomly permuted data into sequential models (Rezatoﬁghi et al. 2017; You et al. 2018). Such approaches attempt to average out the likelihood of the model over all permutations:

π ps(xπ1, . . . , xπn), (3)

where ps is some sequential model. Of course, the observation of all potential orderings for even a modest collection

of points is infeasible. Furthermore, there are often no guarantees that the sequential model pseq will learn to ignore orderings, especially for unseen test data (Vinyals, Bengio, and Kudlur 2015). Given that an i.i.d. assumption is not robust and averaging over all permutations is infeasible, what operation should be used to ensure permutation invariance of the architecture? Instead of attempting to wash out the effect of order in an architecture as in Eq. 3, we propose to enforce invariance by adopting a prespeciﬁed ordering and scanning over elements in this order. As will be discussed in the Methods section, the beneﬁt of estimating a likelihood over sorted data is that it frees us from the restriction of exchangeability. Given the sorted data, we can apply any number of traditional density estimators. However, such an approach presents its own challenges:

Determining a suitable way to scan through an exchangeable sequence. That is, one must map the set X = {xj}n j=1 to a sequence X (x[1], . . . , x[n]) where x[j] denotes the j th point in the sorted order.

Relating the likelihood of the scanned sequence to likelihood of the exchangeable set. Modeling the exchangeable likelihood through a scanned likelihood is not immediately obvious; a simple equality of the two does not hold, p(X) = p(x[1], . . . , x[n]).

Scanning in a space that is beneﬁcial for modeling. The native input space may not be best suited for modeling or scanning, hence it would be constructive to transform the exchangeable input prior to the scan.

Developing an architecture that exploits the structure gained in the scan. The scanning operation will introduce sequential correlations among elements which need to be modeled successfully.

Next, we develop the Flow Scan model while addressing each of these challenges.

Methods Flow Scan consists of three components: 1) a sequence of equivariant ﬂow transformations (ˆqe), to map the data to a space that is easier to model; 2) a sort with correction factor to allow for the use of non-exchangeable density estimators; 3) a density estimator (ˆps) (e.g. an autoregressive model which may utilize sequential ﬂow transformations, ˆqc), to estimate the likelihood while accounting for correlations induced by sorting (see Fig. 2). In this section, we motivate each piece of the architecture and detail how they combine to yield a highly ﬂexible, exchangeable density estimator.

Equivariant Flow Transformations Flow Scan ﬁrst utilizes a sequence of equivariant ﬂow transformations. So-called ﬂow models rely on the change of variables formula to build highly effective models for traditional non-exchangeable generative tasks (like image modeling) (Kingma and Dhariwal 2018). Using the change of variables formula, ﬂow models approximate the likelihood of a d-dimensional distribution over real-valued covariates

Figure 2: Illustration of our proposed method. First, input sets are scanned (in a possibly transformed space). After, the scanned covariates are modeled (possibly in a autoregressive fashion, as shown).

x = (x(1), ..., x(d)) Rd, by applying an invertible (ﬂow) transformation ˆq(x) to an estimated base distribution ˆf:

ˆp(x(1), ..., x(d)) = det dˆq

ˆf(ˆq(x)), (4)

where |det dˆq

dx| is the Jacobian of the transformation ˆq. Often, the base distribution is a standard Gaussian. However, (Oliva et al. 2018) recently showed that performance may be improved with a more ﬂexible base distribution on transformed covariates such as an autoreggressive density (Germain et al. 2015; Gregor et al. 2014; Larochelle and Murray 2011; Uria et al. 2016; Uria, Murray, and Larochelle 2013). There are a myriad of possible invertible transformations, ˆq, that one may apply to inputs x Rn d in order to model elements in a more expressive space, where x is the set X represented as a matrix. However, in our case one must take care to preserve exchangeability of the inputs when transforming the data. For example, a simple afﬁne change of variables will be sensitive to the order in which the elements of x were observed, resulting in a space which is no longer exchangeable. One can circumvent this problem by requiring that any transformation, ˆq, used is equivariant. That is, for all permutation operators, Γ, we have that ˆq(Γx) = Γˆq(x). Proposition 1 states that equivariance of the transformations in conjunction with invariance of the base distribution is enough to ensure that exchangeability is preserved, allowing one to model set data in a transformed space. The proof is straightforward and relegated to the Appendix. Proposition 1. Let ˆq : Rn d Rn d be a permutation equivariant, invertible transformation and the base distribution, ˆf, be exchangeable. Then the likelihood, ˆp(x) = det dˆq

dx ˆf(ˆq(x)), is exchangeable.

Given an invertible transformation, q : Rd Rd, one may construct a simple permutation equivariant transformation by applying it to each point in a set independently: (x1, ..., xn) (q(x1), ..., q(xn)). However, it is possible to engineer equivariant transformations which utilize information from other points in the set while still preserving equivariance. Proposition 1 shows that Flow Scan is compatible with any combination of these transformations.

Set-Coupling Among others, we propose a novel set-level scaling and shifting coupling transformation (Dinh, Sohl Dickstein, and Bengio 2016). For d-dimensional points, the

coupling transformation scales and shifts one subset, S {1, . . . , d} of the d covariates given then rest, Sc as (letting superscripts index point dimensions):

x(S) exp f x(Sc) x(S) + g x(Sc)

x(Sc) x(Sc), (5)

for learned functions f, g : R|Sc| R|S|. We propose a set-coupling transformation as follows:

x(S) i exp f ϕ(x(Sc)), x(Sc) i x(S) i + g ϕ(x(Sc)), x(Sc) i

x(Sc) i x(Sc) i , (6)

where x(Sc) Rn |Sc| is the set of unchanged covariates, ϕ(x(Sc)) Rr are general, learnable permutation invariant set embeddings. We utilize embeddings from Deep Set architectures (Zaheer et al. 2017); however, other embeddings are possible, e.g. prescribed statistics (Jebara, Kondor, and Howard 2004), and f, g : Rr+|Sc| R|S| are learned functions. The embedding ϕ is responsible for capturing set-level information from other covariates. This is combined with each point x(Sc) i to yield shifts and scales with both pointand set-level dependence (see Fig. 3). The log-determinant and inverse are detailed in the Appendix along with several other examples of ﬂexible, equivariant transformations.

Figure 3: An illustration of how set-coupling transformations act on a set. The ﬁrst plot shows the input data to be transformed. In the subsequent plots, the set is transformed in an invertible, equivariant fashion by stacking set-coupling transformations. Iteratively transforming dimensions of a set in this way yields a set with simpler structure that may be modeled more easily, as shown in the last plot.

Invariance Through Sorting After applying a series of equivariant ﬂow transformations, Flow Scan performs a sort operation and corrects the likelihood with a factor of 1/n!. Sorting in a prespeciﬁed fashion ensures that different permutations of the input map to the

same output. In this section, we prove that this yields an analytically correct likelihood and comment on the advantages of such an approach. Speciﬁcally, we show that the exchangeable (unordered) likelihood of a set of n points pe(x1, . . . xn) (where xj Rd) can be written in terms of the non-exchangeable (ordered) likelihood of the points in a sorted order ps(x[1], . . . , x[n]) as stated in Prop. 2 below.

Proposition 2. Let pe be an exchangeable likelihood which is continuous and non-degenerate (e.g. j {1, . . . , d} Pr[x(j) 1 = x(j) 2 = . . . = x(j) n ] = 1). Then,

pe(x1, . . . xn) = 1

n!ps(x[1], . . . , x[n]), (7)

where x[j] is the jth point in the sorted order.

Proof. We derive Eq. 7 from a variant of the change of variables formula (Casella and Berger 2002). It states that if we have a partition of our input space, {Aj}M j=1, such that a transformation of variables q is invertible in each partition Aj with inverse q 1 j , then we may write the likelihood f of z = q(u) in terms of the likelihood p of the input data u as:

det dq 1 j dz

p(q 1 j (z)). (8)

For the moment, suppose that the points {xj}n j=1 are sorted according to the ﬁrst dimension. That is, x[1], . . . , x[n] in Eq. 7 are such that x(1) [1] < . . . < x(1) [n]. The act of sorting these points amounts to a transformation of variables s : Rn d Rn d, s(x1, . . . , xn) = (x[1], . . . , x[n]). The transformation s is one-to-one on the partitions of the input space Rn d deﬁned by the relative order of points. In other words, we may partition the input space according to the permutation that would sort the data: Aπ = {x Rn d | x(1) π1 < x(1) π2 < . . . < x(1) πn }. We may invert s in Aπ via the inverse permutation matrix of π, Γ 1 π . Letting Π be the set of all permutations, Eq. 8 yields:

Γ 1 π pe(Γ 1 π s(x)) = n! pe(x), (9)

where (*) follows from Eq. 8 and (**) follows from the exchangeability of pe. Thus, we may compute the exchangeable likelihood pe(x) using the likelihood of the sorted points, as in Eq. 7. Trivially, similar arguments also hold when sorting according to a dimension other than the ﬁrst. Furthermore, it is possible to sort according any appropriately transformed space of xj, rather than any native dimension itself (as this is equivalent to applying a transformation, sorting, and inverting said transformation).

Consequently, the exchangeable likelihood may be estimated via an approximation of the scanned covariates: pe(x) 1 n! ˆps(s(x)). Since the density of sorted scan is not exchangeable, we may estimate ˆps using traditional density estimation techniques. This gives a principled approach to reduce the problem of exchangeable likelihood estimation to a ﬂat vector (or sequence) likelihood estimation task.

Autoregressive Scan Likelihood After performing equivariant ﬂow transformations and sorting, Flow Scan applies a non-exchangeable density estimator to model the transformed and sorted data. Let z = s(ˆq(x)) Rn d be the sorted covariates. Since z is not exchangeable, one can apply any traditional likelihood estimator on its covariates, e.g. one may treat z as a vector and model ˆps(vec(z)) using a ﬂat density estimator. However, ﬂattening in this way suffers from several disadvantages. First, it is inﬂexible to varying cardinalities. Furthermore, the total number of covariates, nd, may be large for sets with large cardinality or dimensionality. Finally, a general ﬂat model loses the context that covariates are from multiple points in some shared set. To address these challenges, we use an autoregressive likelihood:

k=1 ˆp(zk | h<k), (10)

where ˆp(zk | h<k) is itself a d-dimensional density estimator (such as Eq. 4) conditioned on a recurrent state h<k = h(z1, . . . , zk 1). This proposed approach is capable of sharing parameters across the n d-dimensional likelihoods and is more amenable to large, possibly varying, cardinalities.

Correspondence Flow Transformations In much the same way that nearby pixels are correlated in image space, points with neighboring indices will be correlated in a scan space. Thus, we also propose a coupling (Dinh, Krueger, and Bengio 2014) invertible transformation to transform adjacent points, exploiting existing correlations among points as follows. We note that it is straightforward to use a sequential coupling transformation to shift and scale points zi as in Eq. 5, but based on inputting a recurrent output h<i to f and g functions. In addition, it is also possible to split individual points for coupling as follows. First, split the scanned points z = s(ˆq(x)) = (z1, . . . , zn) into two groups depending on the parity (even/odd) of their respective index. Second, transform each even point, with a scale and shift based on the corresponding odd point. That is for pairs of points (z2j, z2j+1) we perform the following transformation: (z2j, z2j+1) (s(z2j+1)z2j + m(z2j+1), z2j+1), where s : Rd Rd, m : Rd Rd are scale and shifting functions, respectively, parameterized by a learnable fully connected network. This correspondence coupling transformation z z is easily invertible and has analytical Jacobian determinant det dz

dz = n/2 1 j=0 |s(z2j+1)|. Several of these transformations may be stacked before the autoregressive likelihood by alternating between shifting and scaling even points based on odd and vice-versa odd points based on even. We shall also make use of a similar splitting scheme to split sets of images into 3d tensors that are fed into 3d convolution networks for shifting and scaling.

Complete Flow Scan Architecture Since the scanned likelihood in Eq. 7 yields an exchangeable likelihood, one may use as the base likelihood following a permutation equivariant transformation as in Prop. 1. This

enables us to apply the sorting step after performing any number of equivariant transformations and improve the ﬂexibility of the model as a result. As no generality is lost, we choose to sort on the ﬁrst dimension in our experiments detailed below. Combining the three components detailed above, we arrive at the complete Flow Scan architecture: a sequence of equivariant ﬂow transformations, a sort with correction factor, and an autoregressive scan likelihood. The estimated exchangeable likelihood that results is:

ˆpfs(x) = 1

ˆps(s(ˆqe(x))), (11)

where ˆqe and ˆps are the estimated (via maximum likelihood) equivariant ﬂow transformation and sorted ﬂow scan covariate likelihood, respectively. When correspondence ﬂow transformations are included after the sort operation, we obtain an estimated exchangeable likelihood:

ˆpfs(x) = 1

k=1 ˆp(zk | h(z<k)), (12)

where z is the resulting covariates from corresponding coupling transforming the ﬂow scanned covariates. In both cases, Flow Scan gives a valid, provably exchangeable density estimate relying neither on variational lower bounds of the likelihood nor averaging over all possible permutations of the inputs. Furthermore, Flow Scan is easily adapted to input sets with varying cardinalities, as is commonly observed in practice. In the Experiments section, we demonstrate empirically that Flow Scan is highly ﬂexible and capable of modeling sets of both points clouds and images.

Related Work Unlike the recent surge in ﬂexible density estimation for ﬂat vectors with deep architectures (Dinh, Krueger, and Bengio 2014; Dinh, Sohl-Dickstein, and Bengio 2016; Kingma and Dhariwal 2018; Larochelle and Murray 2011; Uria, Murray, and Larochelle 2013; Uria et al. 2016; Gregor et al. 2014; Germain et al. 2015; Oliva et al. 2018), exchangeable treatments of data in ML have been limited with some notable exceptions. Some recent work (Lee et al. 2018; Qi et al. 2017; Zaheer et al. 2017) has explored neural architectures for constructing a permutation invariant set embeddings. They featurize input sets exchangeably in a way that is useful for (typically supervised) downstream tasks; but the embeddings themselves will not result in valid likelihoods. In other work, Generative Adversarial Networks (GAN) have been explored as a means of sampling point clouds (Zaheer et al. 2018). However, none of these methods provide a valid exchangeable likelihood estimate as is our focus. A recently proposed model, BRUNO (Korshunova et al. 2018), preserves exchangeability by performing independent point-wise changes of variables, a simple equivariant linear transformation, and an i.i.d. base exchangeable process in the latent space. The Neural Statistician (NS) (Edwards and Storkey 2017) estimates a permutation invariant code produced by an exchangeable VAE. That is, the Neural Statistician uses an encoder, called a statistics network, on the entire

exchangeable set to get an approximate posterior on the latent code. Given the success of a point cloud autoencoder with a Deep Set network as the statistics network in (Oliva et al. 2018), we consider this architecture for the variational Neural Statistician which is an especially strong baseline, representing the state-of-the-art likelihood method for point cloud data.

Experiments In this section, we compare the performance of Flow Scan to that of BRUNO and NS in a variety of exchangeable point cloud and image modeling tasks. In each experiment, our goal is to estimate an exchangeable likelihood p(x) for x Rn d which models the inputs well. As is standard in density estimation tasks, we measure the success of the model via the estimated likelihood of a held out test set for each experiment. For readability, we report the estimated log likelihood divided by the number of points (per point log likelihoods, PPLL): 1 n log ˆp(x). The PPLL provides a set-level likelihood that eases comparison across dataset and cardinality. As NS does not yield a likelihood, we report its estimated variational lower bound on the PPLL. Results for each datasets can be found in Tab. 1. As a qualitative assessment of each model s performance, we also include samples generated by each trained model. Those which are not reported in the main text can be found in the Appendix. Unless stated explicitly, the ﬁgures included are not reconstructions, but completely synthetic point clouds or images generated by each model. Further implementation details (including code and Appendices) can be found at https://github.com/lupalab/flowscan.

Shufﬂed Synthetic Sequential Data We begin with a synthetic point cloud experiment to test Flow Scan s ability to learn a known, ground truth likelihood. To allow for complex interactions between points, we study a common scenario that leads to exchangeable data: sequential data with time marginalized out. In other words, we suppose that all time-points xj Rd of a sequence (x1, . . . , xn) are put into an unordered set {x1, . . . , xn}. Effectively, this yields observations of sequences in matrices that are randomly shufﬂed from the sequential order. Hence, exchangeable instances are x = Γπxs, for permutations

Dataset BRUNO NS Flow Scan Synthetic -2.28 -1.07 0.14 Airplanes 2.71 4.09 4.81 Chairs 0.75 2.02 2.58 Model Net10 0.49 2.12 3.01 Model Net10a 1.20 2.82 3.58 Caudate 1.29 4.49 4.87 Thalamus -0.815 2.69 3.12 Spatial MNIST -5.68 -5.37 -5.26

Table 1: Per-point log-likelihood (PPLL) of the test set for all point cloud experiments. Higher PPLL indicates better modeling of the test set.

Γπ Rn n (drawn uniformly at random) and sequential data xs = (x1, . . . , xn) Rn d (drawn via a sequential likelihood pseq). Here we consider a synthetic ground truth sequential model pseq where the likelihood of an instance is computed by marginalizing out the permutation: p(x) =

π Pr(π = π ) pseq(Γ 1 π x) = 1 n!

π pseq(Γπx). To obtain interesting non-linear dependencies we consider a sinusoidal sequence (see Fig. 4 and Appendix for details). To allow for computing the ground truth likelihood in a timely manner, we consider n = 8, leading to a large number, 8! = 40320, of summands in the likelihood of the data.

Figure 4: Left: true samples; markers and colors indicate instances and sequential order. Right: Flow Scan samples.

Table 1 illustrates the per point log likelihood (PPLL) estimates across the synthetic sets using BRUNO, the NS, and Flow Scan. The Flow Scan model outperforms the other methods, achieving nearly the same PPLL as the ground truth (0.23) despite not averaging over all n! permutations. For further comparison, we also trained a sequential model on the randomly permuted instances (and marginalizing out the permutation as in Eq. 3). However, randomly permuting the input sequence proved to be ineffective and resulted in low test PPLLs (with severe overﬁtting).

Model Net Next, we illustrate the efﬁcacy of our model on real world point cloud data. We consider object classes from the Model Net dataset (Wu et al. 2015), which contains CAD models of common real world objects. Point clouds were created by randomly sampling 512 points from the surface of each object. All point cloud sets are modeled in an unsupervised fashion. That is, we estimate p(x), where x R512 3. Models are compared on the following datasets comprised of different subsets of point cloud classes: airplanes, chairs, Model Net10, and Model Net10a. Model Net10 is the standard subset (Wu et al. 2015) consisting of bathtub, bed, chair, desk, dresser, monitor, night stand, sofa, table, and toilet classes. Since Model Net10 is composed largely of furniture-like objects, we also select a more diverse, ten-class subset that we will refer to as Model Net10a, containing airplane, bed, car, chair, guitar, lamp, laptop, plant, stairs, and table classes. Results can be found in Tab. 1 and four samples from Flow Scan are included in Fig. 6. For each of the four datasets tested, we ﬁnd that Flow Scan achieves the highest average test log-likelihood. Qualitatively, we also observe superior samples from the Flow Scan model as can be seen in Fig. 5 and in the Appendix. In addition to training on these Model Net datasets, we also performed an ablation study (see the Appendix) where we see that our full architecture yields the

(a) Flow Scan (b) NS (c) BRUNO

Figure 5: Synthetic plane samples from trained models

Figure 6: Flow Scan Model Net10 samples

best performance over alternatives.

Brain Data We test Flow Scan s performance on a medical imaging task in a higher dimensional setting using samples of the Caudate and Thalamus (Cody et al. 2017). Each set contains 512 randomly sampled 7d points. The ﬁrst three dimensions contain the Cartesian coordinates of the surface boundary (as in Model Net). The next two dimensions represent the normal direction at the boundary in terms of angles. The ﬁnal two dimensions represent the local curvature (expressed as shape index and curvedness (Koenderink 1990)). Table 1 enumerates the PPLL for both datasets across all three methods. Comparing samples from Flow Scan (see Fig. 7) to that of NS and BRUNO (included in the Appendix) we see that Flow Scan better captures the geometric features of the data than NS. Overall, superior PPLLs and samples suggest that Flow Scan seamlessly incorporates the additional geometric information to model point clouds more accurately than baseline methods.

(a) Caudate (b) Thalamus

Figure 7: Flow Scan Caudate and Thalamus samples

(a) Flow Scan

Figure 8: Spatial MNIST samples from each model.

Spatial MNIST For a direct comparison to NS, we also trained our model on the Spatial MNIST dataset, used by (Edwards and Storkey 2017). Each set consists of 50 points sampled uniformly at random from active pixels of a single MNIST (Le Cun et al. 1998) image with uniform noise added to ensure nondegeneracy. The dataset that results consists of 2-dimensional point clouds each representing a digit from 0 to 9. PPLLs for each model can be found in Tab. 1 and a random selection of samples from each can be found in Fig. 8. Both the (unconditioned) likelihoods and the samples indicate that Flow Scan gives superior performance in this task.

MNIST Finally, we show that Flow Scan exhibits superior likelihoodsand samples in a high-dimensional, low-cardinality setting. Following (Korshunova et al. 2018), sets are composed of 20 random images corresponding to the same digit class from the MNIST dataset. After training, PPLLs are evaluated on held out test sets constructed from unseen images. Our baseline is BRUNO, which achieves a PPLL of 643.6. BRUNO s unconditional samples (Fig. 9c) often contain elements from different digits, indicating a lack of intra-set dependency in the resulting model. We improve upon BRUNO by ﬁrst adding convolution-based Set-Coupling transformations (but keeping the i.i.d. base likelihood), which achieves a PPLL of 634.8. Still, sample sets (Fig. 9b) show mixed digit classes. Finally, we consider a full Flow Scan model that adds a sort, scan, and 3d convolution-based correspondence coupling transformations, which achieves the best PPLL of 621.7. Furthermore, Flow Scan samples consistently contain the same digit class (Fig. 9a), showing that we are able to fully model the intra-set dependencies of elements.

(a) Flow Scan

(b) Set-Coupling

Figure 9: Single digit set samples from Flow Scan, Set Coupling, and BRUNO trained on MNIST. Each row corresponds to a single set of 20 images generated by one model.

In this work, we introduced Flow Scan for estimating exchangeable densities. This is a difﬁcult task, where models were previously limited to either exchangeable base likelihoods (Korshunova et al. 2018), or conditionally i.i.d. restrictions with variational approximations of the likelihood (Edwards and Storkey 2017). We explored how to map inputs to a space that is easier to model whilst preserving exchangeability via equivariant ﬂow transformations. Among others, we proposed the Set-Coupling transformation which extends existing pointwise coupling transformations (Dinh, Krueger, and Bengio 2014) to sets. Additionally, we demonstrated how to apply non-exchangeable density estimators to this task via sorting and scanning. This is the ﬁrst tractable approach to achieve this, avoiding averaging over any permutations of the data while unlocking a much larger class of base likelihoods for exchangeable density estimation. Finally, we argued for the use of an autoregressive base likelihood with sequential transformations to exploit the sequential structure gained in the sort and scan. Combining equivariant ﬂow transformations, sorting and scanning, and an autoregressive likelihood, we arrived at Flow Scan. We showed empirically that Flow Scan s ability to model intradependencies within sets surpassed that of other state-of-the-art methods in both high-cardinality, low-dimensionality and low-cardinality, high-dimensionality settings. Quantitatively Flow Scan s likelihoods were a substantial improvement (see Tab. 1). Furthermore, there was a clear qualitative improvement in samples from Flow Scan.

Acknowledgements The authors would like to thank NIH grant HDO55741 for the use of the subcortical data set and Mahmoud Mostapha for preprocessing the data. Kevin O Connor would also like to acknowledge the support of NIH grant T32 LM12420.

References Bernardo, J. M., and Smith, A. F. 2009. Bayesian theory, volume 405. John Wiley & Sons. Casella, G., and Berger, R. L. 2002. Statistical inference, volume 2. Duxbury Paciﬁc Grove, CA. Cody, H.; Gu, H.; Munsell, B.; Kim, S.; Styner, M.; Wolff, J.; Elison, J.; Swanson, M.; Zhu, H.; Botteron, K.; Collins, L.; N. Constantino, J.; R. Dager, S.; M. Estes, A.; Evans, A.; Fonov, V.; Gerig, G.; Kostopoulos, P.; C. Mc Kinstry, R.; and H. Gu, C. 2017. Early brain development in infants at high risk for autism spectrum disorder. Nature 542:348 351. Dinh, L.; Krueger, D.; and Bengio, Y. 2014. Nice: Non-linear independent components estimation. Co RR abs/1410.8516. Dinh, L.; Sohl-Dickstein, J.; and Bengio, S. 2016. Density estimation using real NVP. Co RR abs/1605.08803. Edwards, H., and Storkey, A. 2017. Towards a neural statistician. In 5th International Conference on Learning Representations (ICLR 2017). Germain, M.; Gregor, K.; Murray, I.; and Larochelle, H. 2015. MADE: Masked Autoencoder for Distribution Estimation. In Bach, F., and Blei, D., eds., Proceedings of the 32nd International Conference on Machine Learning, volume 37 of Proceedings of Machine Learning Research, 881 889. Lille, France: PMLR. Gregor, K.; Danihelka, I.; Mnih, A.; Blundell, C.; and Wierstra, D. 2014. Deep autoregressive networks. In Xing, E. P., and Jebara, T., eds., Proceedings of the 31st International Conference on Machine Learning, volume 32 of Proceedings of Machine Learning Research, 1242 1250. Bejing, China: PMLR. Jebara, T.; Kondor, R.; and Howard, A. 2004. Probability product kernels. J. Mach. Learn. Res. 5:819 844. Kingma, D. P., and Dhariwal, P. 2018. Glow: Generative ﬂow with invertible 1x1 convolutions. In Advances in Neural Information Processing Systems, 10236 10245. Koenderink, J. J. 1990. Solid Shape. Cambridge, MA, USA: MIT Press. Korshunova, I.; Degrave, J.; Huszar, F.; Gal, Y.; Gretton, A.; and Dambre, J. 2018. Bruno: A deep recurrent model for exchangeable data. In Advances in Neural Information Processing Systems, 7190 7198. Larochelle, H., and Murray, I. 2011. The neural autoregressive distribution estimator. In Gordon, G.; Dunson, D.; and Dud ık, M., eds., Proceedings of the Fourteenth International Conference on Artiﬁcial Intelligence and Statistics, volume 15 of Proceedings of Machine Learning Research, 29 37. Fort Lauderdale, FL, USA: PMLR. Le Cun, Y.; Bottou, L.; Bengio, Y.; Haffner, P.; et al. 1998. Gradient-based learning applied to document recognition. Proceedings of the IEEE 86(11):2278 2324.

Lee, J.; Lee, Y.; Kim, J.; Kosiorek, A. R.; Choi, S.; and Teh, Y. W. 2018. Set transformer. ar Xiv preprint ar Xiv:1810.00825. Oliva, J. B.; Dubey, A.; P oczos, B.; Schneider, J.; and Xing, E. P. 2018. Transformation autoregressive networks. ar Xiv preprint ar Xiv:1801.09819. Qi, C. R.; Su, H.; Mo, K.; and Guibas, L. J. 2017. Pointnet: Deep learning on point sets for 3d classiﬁcation and segmentation. Proc. Computer Vision and Pattern Recognition (CVPR), IEEE 1(2):4. Rezatoﬁghi, S. H.; Milan, A.; Abbasnejad, E.; Dick, A.; Reid, I.; et al. 2017. Deepsetnet: Predicting sets with deep neural networks. In Computer Vision (ICCV), 2017 IEEE International Conference on, 5257 5266. IEEE. Uria, B.; Cˆot e, M.-A.; Gregor, K.; Murray, I.; and Larochelle, H. 2016. Neural autoregressive distribution estimation. Journal of Machine Learning Research 17(205):1 37. Uria, B.; Murray, I.; and Larochelle, H. 2013. Rnade: The real-valued neural autoregressive density-estimator. In Burges, C. J. C.; Bottou, L.; Welling, M.; Ghahramani, Z.; and Weinberger, K. Q., eds., Advances in Neural Information Processing Systems 26. Curran Associates, Inc. 2175 2183. Vinyals, O.; Bengio, S.; and Kudlur, M. 2015. Order matters: Sequence to sequence for sets. ar Xiv preprint ar Xiv:1511.06391. Wu, Z.; Song, S.; Khosla, A.; Yu, F.; Zhang, L.; Tang, X.; and Xiao, J. 2015. 3d shapenets: A deep representation for volumetric shapes. In Proceedings of the IEEE conference on computer vision and pattern recognition, 1912 1920. You, J.; Ying, R.; Ren, X.; Hamilton, W.; and Leskovec, J. 2018. Graphrnn: Generating realistic graphs with deep autoregressive models. In International Conference on Machine Learning, 5694 5703. Zaheer, M.; Kottur, S.; Ravanbakhsh, S.; Poczos, B.; Salakhutdinov, R. R.; and Smola, A. J. 2017. Deep sets. In Advances in Neural Information Processing Systems, 3391 3401. Zaheer, M.; Li, C.-L.; Zhang, Y.; Poczos, B.; and Salakhutdinov, R. 2018. Point cloud gan. ar Xiv preprint ar Xiv:1810.05795.