# semidiscrete_normalizing_flows_through_differentiable_tessellation__48684fca.pdf

Semi-Discrete Normalizing Flows through Differentiable Tessellation

Ricky T. Q. Chen

Meta AI rtqichen@meta.com

Brandon Amos

Meta AI bamos@meta.com

Maximilian Nickel

Meta AI maxn@meta.com

Mapping between discrete and continuous distributions is a difﬁcult task and many have had to resort to heuristical approaches. We propose a tessellation-based approach that directly learns quantization boundaries in a continuous space, complete with exact likelihood evaluations. This is done through constructing normalizing ﬂows on convex polytopes parameterized using a simple homeomorphism with an efﬁcient log determinant Jacobian. We explore this approach in two application settings, mapping from discrete to continuous and vice versa. Firstly, a Voronoi dequantization allows automatically learning quantization boundaries in a multidimensional space. The location of boundaries and distances between regions can encode useful structural relations between the quantized discrete values. Secondly, a Voronoi mixture model has near-constant computation cost for likelihood evaluation regardless of the number of mixture components. Empirically, we show improvements over existing methods across a range of structured data modalities.

1 Introduction

Likelihood-based models have seen increasing usage across multiple data modalities. Across a variety of modeling approaches, the family of normalizing ﬂows stands out as a large amount of structure can be incorporated into the model, aiding its usage in modeling a wide variety of domains such as images [7, 23], graphs [29], invariant distributions [2, 25] and molecular structures [42]. However, the majority of works focus on only continuous functions and continuous random variables. This restriction can make it difﬁcult to apply such models to distributions with implicit discrete structures, e.g. distributions with discrete symmetries, multimodal distributions, distributions with holes.

In this work, we incorporate discrete structure into standard normalizing ﬂows, while being entirely composable with any other normalizing ﬂow. Speciﬁcally, we propose a homeomorphism between the unbounded domain and convex polytopes, which are deﬁned through a learnable tessellation of the domain, i.e. a set of disjoint subsets that together fully cover the domain. This homeomorphism is cheap to compute, has a cheap inverse, and results in an efﬁcient formulation of the resulting change in density. In other words, this transformation is highly scalable at least computationally.

Our method has the potential to be useful for a variety of applications. Firstly, the learned tessellation naturally allows deﬁning a dequantization method for discrete variables. This allows likelihood-based models that normally only act on continuous spaces, such as normalizing ﬂows, to work directly on discrete data in a ﬂexible choice of embedding space with the ability to capture relational structure. Secondly, if we take each convex polytope as the support of a single mixture component, then the full tessellation deﬁnes a mixture model with disjoint components. This allows us to scale to mixtures of normalizing ﬂows while retaining the compute cost of a single model. Following semi-discrete optimal transportation [37], which deﬁnes couplings between discrete and continuous measures, we refer to our models as semi-discrete normalizing ﬂows that learn transformations between discrete and continuous random variables.

36th Conference on Neural Information Processing Systems (Neur IPS 2022).

p(x) p(f(x))

Figure 1: We propose an invertible mapping f between RD and a convex polytope, which is parameterized based on a differentiable Voronoi tessellation of RD. This mapping adds discrete structure into normalizing ﬂows, and its inverse f 1 and log determinant Jacobian can both be efﬁciently computed.

2 Preliminaries

Normalizing Flows This family of generative models typically includes any model that makes use of invertible transformations f to map samples between distributions. The relationship between the original and transformed density functions have a closed form expression,

pz(f(x)) = px(x)

!!!det @f(x)

For instance, pz can be a simple base distribution while f is learned so that the resulting px is close to some target distribution. Many different choices of f have been discussed in the literature, leading to different trade-offs and use cases [24, 36].

Furthermore, if the domain and codomain of f are different, then px and pz can have different supports, i.e. regions where probability is non-zero. Currently, existing designs of f that have this property act independently for each dimension, such as the logit transform [7]. To the best of our knowledge, the use of general support-modifying invertible transformations have not been discussed extensively in the literature.

Dequantization In order to model discrete data with density models, a standard approach is to combined with dequantization methods. These methods provide a correspondence between discrete values and convex subsets, where a discrete variable y 2 Y is randomly placed within a disjoint subset of RD. There is also a correspondence between the likelihood values of the discrete random variable and the dequantized random variable [44].

Let Ay denote the subset corresponding to y and let q(x|y) be the dequantization model which has a bounded support in Ay. Then any density model p(x) with support over RD satisﬁes

log p(y) Ex q(x|y)

log(1[x2Ay]p(x)) log q(x|y)

= Ex q(x|y) [log p(x) log q(x|y)] (2)

Thus with an appropriate choice of dequantization, maximizing the likelihood under the density model p(x) is equivalent to maximizing the likelihood under the discrete model p(y).

In existing dequantization methods, the value of D and the choice of subsets Ay are entirely dependent on the type of discrete data (ordinal vs non-ordinal) and the number of discrete values |Y| in the non-ordinal case. Furthermore, the subsets Ay do not interact with one another and are ﬁxed during training. In contrast, we conjecture that important relations between discrete values should be modeled as part of the parameterization of Ay, and that it d be useful to be able to automatically learn the boundaries of Ay based on gradients.

Disjoint mixture models Building mixture models is one of the simplest methods for creating more ﬂexible distributions from simpler one, and mixture models can typically be shown to be universal density estimators [12]. However, in practice this often requires a large number of mixture components, which quickly becomes computationally expensive. This is because evaluating the likelihood under a mixture model requires evaluating the likelihood of each component. To alleviate this issue, Dinh et al. [8] recently proposed to use components with disjoint support. In particular, let {Ak}K

k=1 be disjoint subsets of RD, such that each mixture component is deﬁned on one subset and

has support restricted to that particular subset. The likelihood of the mixture model then simpliﬁes to

p(x|k)p(k) =

1[x2Ak]p(x|k)p(k) = p(x|k = g(x))p(k = g(x)) (3)

where g : RD ! {1, . . . , K} is a set identiﬁcation function that satisﬁes x 2 Ag(x). This framework allows building large mixture models while no longer having a compute cost that scales with K. In contrast to variational approaches with discrete latent variables, the use of disjoint subsets provides an exact (log-)likelihood and not a lower bound.

3 Voronoi Tessellation for Normalizing Flows

We ﬁrst discuss how we parameterize each subset as a convex polytope, and in the next part, discuss constructing distributions on each convex polytope using a differentiable homeomorphism.

Parameterizing disjoint subsets We separate the domain into subsets through a Voronoi tessellation [48]. This induces a correspondence between each subset and a corresponding anchor point, which provides a differentiable parameterization of the tessellation for gradient-based optimization.

Let X = {x1, . . . , x K} be a set of anchor points in RD. The Voronoi cell for each anchor point is

Vk , {x 2 RD : kxk xk < kxi xk , i = 1, . . . , K}, (4)

i.e. it deﬁnes a subset containing all points which have the anchor point as their nearest neighbor. Together, these subsets {Vk}K

k=1form a Voronoi tessellation of RD. Each subset can equivalently be expressed in the form of a convex polytope,

Vk = {x 2 RD : a T

i x < bi 8i 6= k}, where ai = 2(xi xk)T, bi = kxik2 kxkk2 . (5)

For simplicity, we also include box constraints so that all Voronoi cells are bounded in all directions.

Vk = {x 2 RD : a T

i x < bi 8i 6= k | {z } Voronoi cell

, c(l) < x < c(r)

| {z } box constraints

Thus, the learnable parameters of this Voronoi tessellation are the anchor points {x1, . . . , x K}, and the box constraints cl, cr 2 RD. These will be trained using gradients from an invertible transformation and resulting distribution deﬁned in each Voronoi cell, which we discuss next.

Figure 2: An illustration.

Invertible mapping onto subset Within the normalizing ﬂow framework, deﬁning a probability distribution on a bounded support is equivalent to deﬁning an invertible transformation that maps from the unbounded support onto the appropriate support. Figure 2 illustrates the transformation of mapping onto a shaded region.

Given a cell Vk for some k 2 {1, . . . , K}, we construct an invertible mapping fk : RD ! Vk by following 2 steps. Let x 2 RD. First, if x = xk, the anchor point of Vk, we simply set fk(x) = x. Otherwise:

1. Determine where the ray starting from xk in the direction of x intersects with the boundary of Vk.

First deﬁne the direction δk(x) , x xk kx xkk and the ray x(λ) , xk + λδk(x), with λ > 0. Since Vk is convex, we can frame this as a linear programming problem.

max λ s.t. x(λ) 2 Vk, λ 0 (7)

where Vk is the closure of Vk. We discuss solving this efﬁciently in Section 3.1.

Let λ be the solution. This solution exists if Vk is bounded, which is always true due to the use of box constraints. Then x(λ ) will be the point of intersection.

This ﬁrst step solves for the farthest point in the Voronoi cell in the direction of x. Using this knowledge, we can now map all points that lie on this ray onto the Voronoi cell.

2. Apply an invertible transformation such that any point on the ray {x(λ) : λ > 0} is mapped onto

the line segment {xk + (x(λ ) xk) : 2 (0, 1)}.

There are many possible choices for designing this transformation. An intuitive choice is to use a monotonic transformation of the relative distance from x to the anchor point xk.

fk(x) , xk + k

kx xkk kx(λ ) xkk

(x(λ ) xk) (8)

where k is an appropriate invertible squashing function from R+ to (0, 1). In our experiments, we use k(h) = softsign(γkh) where γk is a learned cell-dependent scale. Other choices should also work, such as a monotonic neural network; however, depending on the application, we may need to compute 1

k for computing the inverse mapping, so it s preferable to have an analytical inverse.

3.1 Remarks and Propositions

Box constraints There can be continuity problems if a Voronoi cell is unbounded, as the solution to Equation (8) does not exist if x(λ ) diverges. Furthermore, when solving Equation (7), it can be difﬁcult to numerically distinguish between an unbounded cell and one whose boundaries are very far away. It is for these reasons that we introduce box constraints (Equation 6) in the formulation of Voronoi cells which allows us to sidestep these issues for now.

Solving for λ Equation (7) can be solved numerically, but this approach is prone to numerical errors and requires implicit differentiation through convex optimization solutions [1]. We instead note that the solution of Equation (7) can be expressed in closed form, since it is always going to be the intersection of the ray x(λ) with the nearest linear constraint.

i x = bi be the plane that represents one of the linear constraints in Equation (6), which are expressed in Equation (4). Let λi be the intersection of this plane with the ray, i.e. it is the solution to a T

i x(λi) = bi, then the solution is simply the smallest positive λi, which satisﬁes all the linear constraints:

λ = min{λi : λi > 0}, where λi = bi a T

iδk(x) . (9)

There are a total of K + 2D 1 linear constraints, including the Voronoi cell boundaries and box constraints, which can be computed fully in parallel. This also allows end-to-end differentiation of the mapping fk via automatic differentiation, providing the ability to learn the parameters of fk, i.e. parameters of k and the Voronoi tessellation.

Homeomorphism with continuous density We can show that fk is a bijection, and both fk and f 1

k are continuous. This allows us to use fk within the normalizing ﬂows framework, as a mapping between a distribution px deﬁned on RD and the transformed distribution pz on Vk. Furthermore, the Jacobian is continuous almost everywhere. Proofs are in Appendix A. Proposition 1 The mapping fk : RD ! Vk as deﬁned in the 2-step procedure is a homeomorphism.

Proposition 2 If px(x) is continuous, then the transformed density pz(fk(x)) is continuous a.e.

Efﬁcient computation of the log det Jacobian As fk is a mapping in D dimensions, computing the log determinant Jacobian for likelihood computation can be costly and will scale poorly with D if computed naïvely. Instead, we note that the Jacobian of fk is highly structured. Intuitively, because fk depends only on the direction δk(x) and the distance away from xk, it only has two degrees of freedom regardless of D. In fact, the Jacobian of fk can be represented as a rank-2 update on a scaled identity matrix. This allows us to use the matrix determinant lemma to reformulate the log determinant Jacobian in a computeand memory-efﬁcient form. We summarize this in a proposition. Proposition 3 Let the transformation fk(x) and all intermediate quantities be as deﬁned in Section 3 for some given input x. Then the Jacobian factorizes as

@x = c I + u1v T

for some c 2 R, ui 2 RD, vi 2 RD, and its log determinant has the form

!!!det @fk(x)

!!! = log |1 + w11| + log

!!!1 + w22 w12w21

!!! + D log c (11)

Ordinal Simplex Binary Argmax Voronoi

Figure 3: Existing dequantization methods can be seen as special cases of Voronoi dequantization with ﬁxed anchor points. In additional to decoupling the dimension of the embedding space from the number of discrete values, the Voronoi dequantization can learn to model the similarities of discrete values through the positions and boundaries of their Voronoi cells.

where wij = c 1v T

i uj. This expression for the log determinant only requires inner products between vectors in RD. To reduce notational clutter, exact formulas for c, u1, u2, v1, v2 are in Appendix A. All vectors used in computing Equation (11) are either readily available as intermediate quantities after computing fk(x) or are gradients of scalar functions and can be efﬁciently computed through reverse-mode automatic differentiation. The only operations on these vectors are dot products, and no large D-by-D matrices are ever constructed. Compared to explicitly constructing the Jacobian matrix, this is more efﬁcient in both compute and memory, and can readily scale to large values of D.

4 Application settings

We discuss two applications of our method to likelihood-based modeling of discrete and continuous data. The ﬁrst is a new approach to dequantization which allows training a model of discrete data using density models that normally only act on continuous variables such as normalizing ﬂows. Compared to existing dequantization methods [7, 18, 38], the Voronoi dequantization is not restricted to ﬁxed dimensions and can beneﬁt from learning similarities between discrete values. In fact, existing approaches can be seen as special cases of a Voronoi dequantization.

The second is a formulation of disjoint mixture models, where each component lies strictly within each subset of a Voronoi tessellation. To the best of our knowledge, disjoint mixture models have not been explored signiﬁcantly in the literature and have not been successfully applied in more than a couple dimensions. Compared to an existing method [8], the Voronoi mixture model is not restricted to acting individually for each dimension.

Voronoi dequantization Let y be a discrete random variable with ﬁnite support Y. Then we can use tessellation to deﬁne subsets for dequantizing y as long as there are at least as many Voronoi cells equivalently, anchors points as the number of discrete values, i.e. K |Y|. By assigning each discrete value to a Voronoi cell, we can then deﬁne a dequantization model q(x|y) by ﬁrst sampling from a base distribution z q(z|y) in RD and then applying the mapping x = fk(z) from Section 3 to construct a distribution over Vk. We can then obtain probabilities q(x|y) efﬁciently and train the dequantization alongside the density model p(x) on RD.

Sampling from the model p(y|x) is straightforward and deterministic after sampling x. We can write y = g(x) where g is the set identiﬁcation function satisfying x 2 Vg(x). From Equation (4), it is easy to see that g(x) is the nearest neighbor operation, g(x) = arg mink kx xkk.

We are free to choose the number of dimensions D, where a smaller D assigns less space for each Voronoi cell induces dequantized distributions that are easier to ﬁt, while a larger D allows the anchor points and Voronoi cells more room to change over the course of training. When the anchor points are ﬁxed to speciﬁc positions, we can recover the disjoint subsets used by prior methods as special cases. We illustrate this in Figure 3.

Voronoi mixture models Let {Vk} be a Voronoi tessellation of RD. Then a disjoint mixture model can be constructed by deﬁning distributions on each Voronoi cell. Here we make use of the inverse mapping f 1

k : Vk ! RD (discussed in Appendix B) so that we only need to parameterize distributions over RD. Let x be a point in RD, our Voronoi mixture model assigns the density

log pmix(x) = log pcomp(f 1

k=g(x)(x)|k = g(x)) + log

!!!det @f 1

!!! + log p(k = g(x)) (12)

where pcomp can be any distribution over RD, including another disjoint mixture model. This can easily be composed with normalizing ﬂow layers where, in addition to the change of variable due to f 1

k , we also apply the change in density resulting from choosing one out of K components.

5 Related Work

Normalizing ﬂows for discrete data Invertible mappings have been proposed for discrete data, where discrete values are effectively rearranged from a factorized distribution. In order to parameterize the transformation, Hoogeboom et al. [17], van den Berg et al. [47] use quantized ordinal transformations, while Tran et al. [45] takes a more general approach of using modulo addition on one-hot vectors. These approaches suffer from gradient bias due to the need to use discontinuous operations and do not have universality guarantees since it s unclear whether simple rearrangement is sufﬁcient to transform any joint discrete distribution into a factorized distribution. In contrast, the dequantization approach provides a universality guarantee since the lower bound in Equation (2) is tight when p(x)1[x2Ay] / q(x|y), with a proportionality equal to p(y).

Dequantization methods Within the normalizing ﬂows literature, the act of adding noise was originally used for ordinal data as a way to combat numerical issues [7, 46]. Later on, appropriate dequantization approaches have been shown to lower bound the log-likelihood of a discrete model [16, 44]. For non-ordinal data, many works have proposed simplex-based approaches. Early works on relaxations [20, 32] proposed continuous distribution on the simplex that mimic the behaviors of discrete random variables; however, these were only designed for the use with a Gumbel base distribution. Potapczynski et al. [38] extend this to a Gaussian distribution although it is not hard to see this can work with any base distribution by designing invertible transformations between RD and the probability simplex with K vertices, with D = K 1, where K is the number of classes of a discrete random variable.

Intuitively, after ﬁxing one of the logits, the softmax operation is an invertible transformation on the bounded domain K 1 = {x 2 RK : PK

i xi = 1, xi > 0}. The (K 1)-simplex can then be broken into K subsets, each corresponding to a particular discrete value.

k = {x 2 K 1 : xk > xi 8i 6= k}. (13)

More recently, Hoogeboom et al. [18] proposed ignoring the simplex constraint and simply use

k = {x 2 RK : xk > xi 8i 6= k}, (14)

which effectively increases the number of dimensions by one compared to the simplex approach. However, both approaches force the dimension of the continuous space D to scale with K. In order to make their approach work when K is large, [18] propose binarizing the data as a preprocessing step. In contrast, Voronoi dequantization has full ﬂexibility in choosing D regardless of K.

Related models for discrete data Among other related methods, a recent work proposed normalizing ﬂows on truncated supports [43] but had to resort to rejection sampling for training. Furthermore, their subsets are not disjoint by construction. Prior works [28, 31] also proposed removing the constraint that subsets are disjoint, and instead work with general mixture models with unbounded support, relying on the conditional model p(y|x) being sufﬁciently weak so that the task of modeling is forced onto a ﬂow-based prior. They have achieved good performance on a number of tasks, similar to general approaches that combine normalizing ﬂows with variational inference [19, 33, 50]. However, they lose the computational savings and the deterministic decoder p(y|x) gained from using disjoint subsets. On the other hand, quantization based on nearest neighbor have been used for learning discrete latent variables [34, 40], but no likelihoods are constructed, the boundaries are not explicitly differentiated through, and the model relies on training with heuristical updates.

Disjoint mixture models The computational savings from using disjoint subsets was pointed out by Dinh et al. [8]. However, their method only works in each dimension individually. They transform samples using a linear spline, which is equivalent to creating subsets based on the knots of the spline and applying a linear transformation within each subset. Furthermore, certain parameterizations of the spline can lead to discontinuous density functions, whereas our disjoint mixture has a density function that is continuous almost everywhere (albeit exactly zero on the boundaries; see Section 7

Discrete Flow

Voronoi Flow

Figure 4: Quantized 2D data. Voronoi dequantization can model complex relations between discrete values. No knowledge of ordering is given to the models.

Dequantized samples

Figure 5: The model learns to cluster discrete values with similar probability values.

for more discussion). The use of monotonic splines have previously been combined with normalizing ﬂows [10], but interestingly, since the splines in Dinh et al. [8] are not enforced to be monotonic, the full transformation is not bijective and acts like a disjoint mixture model. Ultimately, their experiments were restricted to two-dimensional syntheic data sets, and it remained an interesting research question whether disjoint mixture models can be successfully applied in high dimensions.

6 Experiments

We experimentally validate our semi-discrete approach of combining Voronoi tessellation with likelihood-based modeling on a variety of data domains: discrete-valued UCI data sets, itemset modeling, language modeling, and disjoint mixture modeling for continuous UCI data. The goal of these experiments is not to show state-of-the-art results in these domains but to showcase the relative merit of Voronoi tessellation compared to existing methods. For this reason, we just use simple coupling blocks with afﬁne transformations [6, 7] as a base ﬂow model in our experiments, unless stated otherwise. When comparing to Discrete Flows [45], we use their bipartite layers. Details regarding preprocessing for data sets can be found in Appendix C, and detailed experimental setup is in Appendix D. All tables, when shown, display standard deviations across three random seeds. Open source code is available at https://github.com/facebookresearch/semi-discrete-flow.

6.1 Quantized 2D data

We start by considering synthetic distributions over two discrete variables, to qualitatively compare against Discrete Flows [45]. Figure 4 shows the target probability mass functions, which are created by a 2D distribution where each dimension is quantized into 91 discrete values. Though the data is ordinal, no knowledge of ordering is given to the models. We see that Discrete Flows has trouble ﬁtting these, because it is difﬁcult to rearrange the target distribution into a factorized distribution. For our model, we dequantize each discrete variable with a Voronoi tessellation in R2. We then learn a ﬂow model on the combined R4, parameterized by multilayer perceptrons (MLPs). In Figure 5 we visualize the learned Voronoi tessellation and samples from our model. The learned tessellation seems to group some of discrete values that occur frequently together, to reduce the number of modes.

6.2 Discrete-valued UCI data

We experiment with complex data sets where each discrete variable can have a varying number of classes. Furthermore, these discrete variables may have hidden semantics. To this end, we use unprocessed data sets from the UCI database [9]. The only pre-processing we perform is to remove variables that only have one discrete value. We then take 80% as train, 10% as validation, and 10% as the test set. Most of these data sets have a combination of both ordinal and non-ordinal variables, and we expect the non-ordinal variables to exhibit relations that are unknown (e.g. spatial correlations).

Table 1: Discrete UCI data sets. Negative log-likelihood results on the test sets in nats.

Method Connect4 Forests Mushroom Nursery Poker Hands USCensus90

Voronoi Deq. 12.92 0.07 14.20 0.05 9.06 0.05 9.27 0.04 19.86 0.04 24.19 0.12 Simplex Deq. 13.46 0.01 16.58 0.01 9.26 0.01 9.50 0.00 19.90 0.00 28.09 0.08 Binary Argmax Deq. 13.71 0.04 16.73 0.17 9.53 0.01 9.49 0.00 19.90 0.01 27.23 0.02 Discrete Flow 19.80 0.01 21.91 0.01 22.06 0.01 9.53 0.01 19.82 0.03 55.62 0.35

Table 2: Permutation-invariant discrete itemset modeling.

Model (Dequantization) Retail (nats) Accidents (nats)

CNF (Voronoi) 9.44 2.34 7.81 2.84 CNF (Simplex) 24.16 0.21 19.19 0.01 CNF (Binary Argmax) 10.47 0.42 6.72 0.23 Determinantal Point Process 20.35 0.05 15.78 0.04

Table 3: Language modeling.

Dequantization text8 (bpc) enwik8 (bpc)

Voronoi (D=2) 1.39 0.01 1.46 0.01 Voronoi (D=4) 1.37 0.00 1.41 0.00 Voronoi (D=6) 1.37 0.00 1.40 0.00 Voronoi (D=8) 1.36 0.00 1.39 0.01 Binary Argmax [18] 1.38 1.42 Ordinal [18] 1.43 1.44

We see that Discrete Flows can be competitive with dequantization approaches, but can also fall short on more challenging data sets such as the USCensus90, the largest data set we consider with 2.7 million examples and 68 different discrete variables of varying types. For dequantization methods, simplex [38] and binary argmax [18] approaches are mostly on par. We do see a non-negligible gap in performance between these baselines and Voronoi dequantization for most of the data sets, likely due to the ability to learn semantically useful relations between the values of each discrete variable. For instance, the Connect4 data set contains positions of a two-player board game with the same name, which exhibit complex spatial dependencies between the board pieces, and the USCensus90 data set contains highly correlated variables and is often used to test clustering algorithms.

6.3 Permutation-invariant itemset modeling

An appeal of using normalizing ﬂows in continuous space is the ability to incorporate invariance to speciﬁc group symmetries into the density model. For instance, this can be done by ensuring the ordinary differential equation is equivariant [2, 25, 42] in a continuous normalizing ﬂow [4, 13] framework. Here we focus on invariance with respect to permutations, i.e. sets of discrete variables. This invariance cannot be explicitly modeled by Discrete Flows as they require an ordering or a bipartition of discrete variables. We preprocessed a data set of retail market basket data from an anonymous Belgian retail store [3] and a data set of anonymized trafﬁc accident data [11], which contain sets of discrete variables each with 765 and 269 values, respectively. Without the binarization trick, simplex dequantization must use a large embedding dimensions and performs worse than a determinental point process baseline [26]. Meanwhile, our Voronoi dequantization has no problems staying competitive as its embedding space is freely decoupled from the number of discrete values.

6.4 Language modeling

Language modeling a widely used benchmark for discrete models [18, 28, 45]. Here we used the open source code provided by Hoogeboom et al. [18] with the exact same autoregressive ﬂow model and optimizer setups. The only difference is replacing their binary argmax with our Voronoi dequantization. Results are shown in Table 3, where we tried out multiple embedding dimensions D. Generally, we ﬁnd D = 2 to be too low and can stagnate training since the Voronoi cells are more constrained, while larger values of D improve upon argmax dequantization.

6.5 Voronoi mixture models

We test disjoint mixture modeling with Voronoi tessellation on continuous data sets. Figure 6 shows results of training a simple normalizing ﬂow on 2D data, which has trouble completely separating all the modes. Adding in a Voronoi mixture allows us to better ﬁt these multimodal densities.

Figure 6: Tessellation is done in a transformed space; nonlinear boundaries are shown.

Table 4: Disjoint mixture modeling. NLL on the test sets in nats. Baseline results from [13, 35].

Method POWER GAS HEPMASS MINIBOONE BSDS300

Real NVP -0.17 0.01 -8.33 0.07 18.71 0.01 13.55 0.26 -153.28 0.89 MAF -0.24 0.01 -10.08 0.01 17.73 0.01 12.24 0.22 -154.93 0.14 FFJORD -0.46 0.01 -8.59 0.12 14.92 0.08 10.43 0.22 -157.40 0.19 Base Coupling Flow -0.44 0.01 -11.75 0.02 16.78 0.08 10.87 0.06 -155.14 0.04 Voronoi Disjoint Mixture -0.52 0.01 -12.63 0.05 16.16 0.01 10.24 0.14 -156.59 0.14

To test Voronoi mixture models on higher dimensions, we apply our approach to data sets preprocessed by Papamakarios et al. [35], which range from 6 to 63 dimensions. We add a disjoint mixture component after a few layers of normalizing ﬂows, with each component modeled by a normalizing ﬂow conditioned on a embedding of the component index. The number of mixture components ranges from 16 to 64, with complete sweeps and hyperparameters detailed in Appendix D. For comparison, we also show results from the baseline coupling ﬂow that uses the same number of total layers as the disjoint mixture, as well as a strong but computationally costly density model FFJORD [13]. From Table 4, we see that the disjoint mixture model approach allows us to increase complexity quite a bit, with almost no additional cost compared to the baseline ﬂow model.

7 Conclusion and Discussion

We combine Voronoi tessellation with normalizing ﬂows to construct a new invertible transformation that has learnable discrete structure. This acts as a learnable mapping between discrete and continuous distributions. We propose two applications of this method: a Voronoi dequantization that maps discrete values into a learnable convex polytope, and a Voronoi mixture modeling approach that has around the same compute cost as a single component. We showcased the relative merit of our approach across a range of data modalities with varying hidden structure. Below, we discuss some limitations and possible future directions.

Diminishing density on boundaries. The distributions within each Voronoi cell necessarily go to zero on the boundary between cells due to the use of a homeomorphism from an unbounded domain. This is a different and undesirable property as opposed to the partitioning method used by Dinh et al. [8]. However, alternatives could require more compute cost in the form of solving normalization constants. Balancing expressiveness and compute cost is a delicate problem that sits at the core of probabilistic machine learning.

Design of homeomorphisms and tessellations. Our proposed homeomorphism is simple and computationally scalable, but this comes at the cost of smoothness and expressiveness. As depicted in Figure 1, the transformed density has a non-smooth probability density function. This non-smoothness exists when the ray intersects with multiple boundaries. This may make optimization more difﬁcult as the gradients of the log-likelihood objective can be discontinuous. Additionally, our use of Euclidean distance can become problematic in high dimensions, as this metric can cause almost all points to be nearly equidistant, resulting in a behavior where all points lie seemingly very close to the boundary. Improvements on the design of homeomorphisms to bounded domains could help alleviate these problems. Additionally, more ﬂexible tessellations such as the Laguerre tessellation and additional concepts from semi-discrete optimal transport [14, 27, 37] may be adapted to improve semi-discrete normalizing ﬂows.

[1] Agrawal, A., Amos, B., Barratt, S., Boyd, S., Diamond, S., and Kolter, Z. Differentiable convex

optimization layers. Advances in Neural Information Processing Systems, 2019. [2] Biloš, M. and Günnemann, S. Scalable normalizing ﬂows for permutation invariant densities.

In International Conference on Machine Learning, pp. 957 967. PMLR, 2021. [3] Brijs, T., Swinnen, G., Vanhoof, K., and Wets, G. Using association rules for product assortment

decisions: A case study. In Knowledge Discovery and Data Mining, pp. 254 260, 1999. [4] Chen, R. T., Rubanova, Y., Bettencourt, J., and Duvenaud, D. Neural ordinary differential

equations. Advances in Neural Information Processing Systems, 2018. [5] Chen, R. T. Q. torchdiffeq, 2018. URL https://github.com/rtqichen/torchdiffeq. [6] Dinh, L., Krueger, D., and Bengio, Y. NICE: Non-linear independent components estimation.

ar Xiv preprint ar Xiv:1410.8516, 2014. [7] Dinh, L., Sohl-Dickstein, J., and Bengio, S. Density estimation using real NVP. International

Conference on Learning Representations, 2017. [8] Dinh, L., Sohl-Dickstein, J., Larochelle, H., and Pascanu, R. A rad approach to deep mixture

models. ar Xiv preprint ar Xiv:1903.07714, 2019. [9] Dua, D. and Graff, C. UCI machine learning repository, 2017. URL http://archive.ics.

uci.edu/ml. [10] Durkan, C., Bekasov, A., Murray, I., and Papamakarios, G. Neural spline ﬂows. Advances in

Neural Information Processing Systems, 32:7511 7522, 2019. [11] Geurts, K., Wets, G., Brijs, T., and Vanhoof, K. Proﬁling high frequency accident locations

using association rules. In Proceedings of the 82nd Annual Transportation Research Board, Washington DC. (USA), January 12-16, pp. 18pp, 2003. [12] Goodfellow, I., Bengio, Y., and Courville, A. Deep learning. MIT press, 2016. [13] Grathwohl, W., Chen, R. T., Bettencourt, J., Sutskever, I., and Duvenaud, D. FFJORD: Free-form continuous dynamics for scalable reversible generative models. ar Xiv preprint ar Xiv:1810.01367, 2018. [14] Gruber, P. M. Optimum quantization and its applications. Advances in Mathematics, 186(2):

456 497, 2004. [15] Hendrycks, D. and Gimpel, K. Gaussian error linear units (gelus). ar Xiv preprint ar Xiv:1606.08415, 2016. [16] Ho, J., Chen, X., Srinivas, A., Duan, Y., and Abbeel, P. Flow++: Improving ﬂow-based

generative models with variational dequantization and architecture design. In International Conference on Machine Learning, pp. 2722 2730. PMLR, 2019. [17] Hoogeboom, E., Peters, J. W., Berg, R. v. d., and Welling, M. Integer discrete ﬂows and lossless

compression. Advances in Neural Information Processing Systems, 2019. [18] Hoogeboom, E., Nielsen, D., Jaini, P., Forré, P., and Welling, M. Argmax ﬂows and multinomial

diffusion: Towards non-autoregressive language models. Advances in Neural Information Processing Systems, 2021. [19] Huang, C.-W., Dinh, L., and Courville, A. Augmented normalizing ﬂows: Bridging the gap

between generative ﬂows and latent variable models. ar Xiv preprint ar Xiv:2002.07101, 2020. [20] Jang, E., Gu, S., and Poole, B. Categorical reparameterization with gumbel-softmax. Interna-

tional Conference on Learning Representations, 2017. [21] Kim, H., Papamakarios, G., and Mnih, A. The lipschitz constant of self-attention. In Interna-

tional Conference on Machine Learning, pp. 5562 5571. PMLR, 2021. [22] Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. International Conference

on Learning Representations, 2015. [23] Kingma, D. P. and Dhariwal, P. Glow: Generative ﬂow with invertible 1x1 convolutions.

Advances in Neural Information Processing Systems, 2018. [24] Kobyzev, I., Prince, S., and Brubaker, M. Normalizing ﬂows: An introduction and review of

current methods. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020.

[25] Köhler, J., Klein, L., and Noé, F. Equivariant ﬂows: exact likelihood generative learning for

symmetric densities. In International Conference on Machine Learning, pp. 5361 5370. PMLR, 2020. [26] Kulesza, A. and Taskar, B. Determinantal point processes for machine learning. Foundations

and Trends in Machine Learning, 5(2 3):123 286, 2012. ISSN 1935-8237. doi: 10.1561/ 2200000044. [27] Lévy, B. A numerical algorithm for l2 semi-discrete optimal transport in 3d. ESAIM: Mathe-

matical Modelling and Numerical Analysis, 49(6):1693 1715, 2015. [28] Lippe, P. and Gavves, E. Categorical normalizing ﬂows via continuous transformations. Inter-

national Conference on Learning Representations, 2021. [29] Liu, J., Kumar, A., Ba, J., Kiros, J., and Swersky, K. Graph normalizing ﬂows. Advances in

Neural Information Processing Systems, 2019. [30] Loshchilov, I. and Hutter, F. Decoupled weight decay regularization. International Conference

on Learning Representations, 2019. [31] Ma, X., Zhou, C., Li, X., Neubig, G., and Hovy, E. Flowseq: Non-autoregressive conditional

sequence generation with generative ﬂow. ar Xiv preprint ar Xiv:1909.02480, 2019. [32] Maddison, C. J., Mnih, A., and Teh, Y. W. The concrete distribution: A continuous relaxation

of discrete random variables. International Conference on Learning Representations, 2017. [33] Nielsen, D., Jaini, P., Hoogeboom, E., Winther, O., and Welling, M. Survae ﬂows: Surjections

to bridge the gap between vaes and ﬂows. Advances in Neural Information Processing Systems, 33, 2020. [34] Oord, A. v. d., Vinyals, O., and Kavukcuoglu, K. Neural discrete representation learning. ar Xiv

preprint ar Xiv:1711.00937, 2017. [35] Papamakarios, G., Pavlakou, T., and Murray, I. Masked autoregressive ﬂow for density estima-

tion. Advances in neural information processing systems, 30, 2017. [36] Papamakarios, G., Nalisnick, E., Rezende, D. J., Mohamed, S., and Lakshminarayanan, B.

Normalizing ﬂows for probabilistic modeling and inference. ar Xiv preprint ar Xiv:1912.02762, 2019. [37] Peyré, G., Cuturi, M., et al. Computational optimal transport. Center for Research in Economics

and Statistics Working Papers, (2017-86), 2017. [38] Potapczynski, A., Loaiza-Ganem, G., and Cunningham, J. P. Invertible gaussian reparameteriza-

tion: Revisiting the gumbel-softmax. ar Xiv preprint ar Xiv:1912.09588, 2019. [39] Ramachandran, P., Zoph, B., and Le, Q. V. Searching for activation functions. ar Xiv preprint

ar Xiv:1710.05941, 2017. [40] Razavi, A., van den Oord, A., and Vinyals, O. Generating diverse high-ﬁdelity images with

vq-vae-2. In Advances in neural information processing systems, pp. 14866 14876, 2019. [41] Rudin, W. et al. Principles of mathematical analysis, volume 3. Mc Graw-hill New York, 1964. [42] Satorras, V. G., Hoogeboom, E., Fuchs, F. B., Posner, I., and Welling, M. E(n) equivariant

normalizing ﬂows. In Beygelzimer, A., Dauphin, Y., Liang, P., and Vaughan, J. W. (eds.), Advances in Neural Information Processing Systems, 2021. [43] Tan, S., Huang, C.-W., Sordoni, A., and Courville, A. Learning to dequantise with truncated

ﬂows. In International Conference on Learning Representations, 2022. [44] Theis, L., Oord, A. v. d., and Bethge, M. A note on the evaluation of generative models.

International Conference on Learning Representations, 2016. [45] Tran, D., Vafa, K., Agrawal, K., Dinh, L., and Poole, B. Discrete ﬂows: Invertible generative

models of discrete data. Advances in Neural Information Processing Systems, 32:14719 14728, 2019. [46] Uria, B., Murray, I., and Larochelle, H. Rnade: The real-valued neural autoregressive density-

estimator. Advances in Neural Information Processing Systems, 26, 2013. [47] van den Berg, R., Gritsenko, A. A., Dehghani, M., Sønderby, C. K., and Salimans, T. Idf++:

Analyzing and improving integer discrete ﬂows for lossless compression. In International Conference on Learning Representations, 2020.

[48] Voronoi, G. Nouvelles applications des paramètres continus à la théorie des formes quadratiques.

premier mémoire. sur quelques propriétés des formes quadratiques positives parfaites. Journal für die reine und angewandte Mathematik (Crelles Journal), 1908(133):97 102, 1908. [49] Zhang, G., Wang, C., Xu, B., and Grosse, R. Three mechanisms of weight decay regularization.

International Conference on Learning Representations, 2019. [50] Ziegler, Z. and Rush, A. Latent normalizing ﬂows for discrete sequences. In International

Conference on Machine Learning, pp. 7673 7682. PMLR, 2019.