# invertible_convolutional_flow__dce19878.pdf

Invertible Convolutional Flow

Mahdi Karami

Department of Computer Science University of Alberta karami1@ualberta.ca

Jascha Sohl-Dickstein Dale Schuurmans Laurent Dinh Daniel Duckworth

Google Brain

Normalizing ﬂows can be used to construct high quality generative probabilistic models, but training and sample generation require repeated evaluation of Jacobian determinants and function inverses. To make such computations feasible, current approaches employ highly constrained architectures that produce diagonal, triangular, or low rank Jacobian matrices. As an alternative, we investigate a set of novel normalizing ﬂows based on the circular and symmetric convolutions. We show that these transforms admit efﬁcient Jacobian determinant computation and inverse mapping (deconvolution) in O(N log N) time. Additionally, element-wise multiplication, widely used in normalizing ﬂow architectures, can be combined with these transforms to increase modeling ﬂexibility. We further propose an analytic approach to designing nonlinear elementwise bijectors that induce special properties in the intermediate layers, by implicitly introducing speciﬁc regularizers in the loss. We show that these transforms allow more effective normalizing ﬂow models to be developed for generative image models.

1 Introduction

Flow-based generative networks have shown tremendous promise for modeling complex observations in high dimensional datasets. In ﬂow-based models, a complex probability density is constructed by transforming a simple base density, such as a standard normal distribution, via a chain of smooth, invertible mappings (bijections), to yield a normalizing ﬂow. Such models are employed in various contexts, including approximating a complex posterior distribution in variational inference [Rezende and Mohamed, 2015], or for density estimation with generative models [Dinh et al., 2016].

Using a complex transformation (bijective function) to deﬁne a normalized density requires the computation of a Jacobian determinant, which is generally impractical for arbitrary neural network transformations. To overcome this difﬁculty and enable fast computation, previous work has carefully designed architectures that produce simple Jacobian forms. For example, [Rezende and Mohamed, 2015, Berg et al., 2018] consider transformations with a Jacobian that corresponds to low rank perturbations of a diagonal matrix, enabling the use of Sylvester s determinant lemma. Other works, such as [Dinh et al., 2014, 2016, Kingma et al., 2016, Papamakarios et al., 2017], use a constrained transformation where the Jacobian has a triangular structure. The latter approach has proved particularly successful, since this constraint is easy to enforce without major sacriﬁces in expressiveness or computational efﬁciency. More recently, Kingma and Dhariwal [2018] propose the use of 1 1 convolutions for cross channel mixing in a multi-channel signal, achieving tractability via a block diagonal Jacobian. Nevertheless, these models have overlooked some opportunities for formulating tractable normalizing ﬂows that can enhance expressiveness and better capture the structure of natural data, such as images and audio. Also, a new line of work based on ordinary

33rd Conference on Neural Information Processing Systems (Neur IPS 2019), Vancouver, Canada.

differential equations has emerged recently that offers promising continuous dynamics based ﬂows [Grathwohl et al., 2019].

In this work, we propose an alternative nonlinear convolution layer, the nonlinear adaptive convolution ﬁlter, where expressiveness is increased by allowing a layer s kernel to adapt to the layer s input. The idea is to partition the input of a layer x into {x1, x2}, where the convolution updates x2 as w(x1) x2, while the kernel w(x1) is a function of x1 that can be expressed by a deep neural network. We present invertible convolution operators whose Jacobian can be computed efﬁciently, making this approach practical for normalizing ﬂow. Unlike the causal convolution employed in [van den Oord et al., 2016] to generate audio waveforms, or in [Zheng et al., 2017] to approximate the posterior in a variational autoencoder, the proposed transformations are not constrained to depend only on the preceding input variables and also offer efﬁcient inverse mapping, also known as deconvolution, analytically. Also, recently, circular convolution has been adopted in [Karami et al., 2018] as a normalizing ﬂow for density estimation and in [Hoogeboom et al., 2019] to design invertible periodic convolution for (almost) periodic data. Furthermore, we propose an analytic approach to add invertible pointwise nonlinearity in the ﬂow that implicitly induces speciﬁc regularizers on the intermediate layers.

2 Background

Given a random variable z p(z) and an invertible and differentiable mapping g : Rn Rn, with inverse mapping f = g 1, the probability density function of the transformed variable x = g(z) can be recovered by the change of variable rule as p(x) = p(z) |det Jg| 1 = p(f(x)) |det Jf|. Here Jg = g z and Jf = f x are the Jacobian matrices of functions g and f, respectively. One can use these to build a complex mapping g by composing a chain of simple bijective maps, g = g(1) g(2) ... g(K), that preserve invertibility, with the inverse mapping being f = f (K) f (K 1) ... f (1). By applying the chain rule to the Jacobian of the composition, and using the fact that det AB = det A det B, the log-likelihood equality (LLE) can be written as

log p(x) = log p(z) +

k=1 log |det Jfk| . (1)

Evaluating the Jacobian determinant is the main computational bottleneck in (1) since, in general, its scaling is cubic in the size of input. It is therefore natural to seek structured transformations that mitigate this cost while retaining useful modeling ﬂexibility.1

2.1 Toeplitz structure and Circular Convolution

Although available methods have typically considered bijections whose Jacobians have block-diagonal or triangular forms, these are not the only useful possibilities. In fact, various other transformations exist whose Jacobian has sufﬁcient structure to allow computationally efﬁcient determinant calculation. One such structure is the Toeplitz property, where all the elements along each diagonal of a square matrix are identical (Figure 1(a)). The calculation of the determinant can then be simpliﬁed signiﬁcantly. Let JT be a Toeplitz matrix of size N N; its determinant can be evaluated in O(N 2) time in general [Monahan, 2011]. More speciﬁcally, if JT has a limited bandwidth size of K = r + s, as depicted in Figure 1(a), then the determinant computation can be reduced to O(K2 log N + K3) time [Cinkir, 2011]. Moreover, Toeplitz matrices can be inverted efﬁciently [Martinsson et al., 2005]. The fact that the discrete convolution can be expressed as a product of a Toeplitz matrix and the input [Gray et al., 2006] highlights that the Toeplitz property is of particular interest in convolutional neural networks (CNNs).

1Notation deﬁnition: Throughout the paper, invertible ﬂows are denoted by f, while f(x) is used for unconditional ﬂows, and conditional (data-parameterized) ﬂows are identiﬁed by f(x2; x1) or f(x2; θ(x1)) where the ﬂow warps x2 conditioned on x1. Subscripts are intended to specify the type of ﬂow or its parameters while superscripts enumerate the order of ﬂows in the chain. For example, f denotes the convolutional ﬂow in general and σα is used to specify the pointwise nonlinear bijectors with its inverse being φα. Also, in general, y and x indicate the output and input of a ﬂow, respectively and when referring to kth ﬂow in the chain, we use y(k) and x(k) where x(k) = y(k 1). Moreover, circular convolution and symmetric convolution are denoted by and s, respectively, while denotes an invertible convolution in general, and x F, x C and x T denote DFT, DCT and trigonometric transform of sample x, respectively.

w0 w 1 . . . w s 0 w1 w0 ... ... ... ... ...

wr ... ... ... ... w s ... ... ... w0 w 1 0 wr . . . w1 w0

w0 w N 1 . . . w2 w1

w1 w0 ... ... w2 ... ... ... ... ...

w N 2 ... ... w0 w N 1 w N 1 w N 2 . . . w1 w0

w0 w0 . . . w N 3 w N 2

w1 w0 ... w N 4 w N 3 ... ... ... ... ...

w N 2 w N 3 ... w0 w0 w N 1 w N 2 . . . w1 w0

w1 w2 . . . w N 1 w N 1

w N 1 w N 2 ...

... ... ...

w N 1 w N 1

w2 w1 w N 1 w N 2 . . . w1 w0

(c) Figure 1: (a) JT is a Toeplitz matrix with limited bandwidth size of K = r + s, (b) JC is the Jacobian of circular convolution that is a circulant matrix, and (c) JS is the Jacobian of symmetric convolution that can be expressed as summation of a Toeplitz matrix and an upside-down Toeplitz matrix (also called a Hankel matrix where its skew-diagonal elements are identical).

In this paper, we consider a particular transformation whose Jacobian is a circulant matrix, a special form of Toeplitz structure where the rows (columns) are cyclic permutations of the ﬁrst row (column), i.e. Jl,m = J1,(l m) mod N. See Figure 1(b) for an illustration. This structure allows certain computationally expensive algebraic operations, such as determinant calculation, inversion and eigenvalue decomposition, to be performed efﬁciently in O(N log N) time by exploiting the fact that a square circulant matrix can be diagonalized by a discrete Fourier transform (DFT) [Gray et al., 2006]. Deﬁne the circular convolution as y := w x where y(i) := PN 1 n=0 x(n)w(i n) mod N, which is equivalent to the linear convolution of two sequences when one is padded cyclically, also known as periodic padding, as illustrated in Figure 2(a). The key property we exploit in developing an efﬁcient normalizing layer is that the Jacobian of this convolution forms a circulant matrix, hence its determinant and inverse mapping (deconvolution) can be computed efﬁciently. Some useful properties of this operation are needed:

Proposition 1 Let y := w x be a circular convolution on the input vector x with its DFT transform x F := FDF T {x}. Then:

a) The circular convolution operation can be expressed as a vector-matrix multiplication y = Cwx where Cw is a circulant square matrix having the convolution kernel w as its ﬁrst row.

b) The Jacobian of the mapping is Jy = Cw.

c) The matrix Cw can be diagonalized using DFT basis with its eigenvalues being equal to the DFT of w, hence log |det Jy| = PN 1 n=0 log |w F(n)| .

d) The circular convolution can be expressed by element-wise multiplication in the frequency domain, y F(k) = w F(k) x F(k), a.k.a. the circular convolution-multiplication property.

e) If w F(n) = 0 n, this linear operation is invertible with inverse x F(n) = w 1 F (n) y F(n). Moreover, its inverse mapping (deconvolution) is also a circular convolution operation with kernel winv := F 1 N {w 1 F }. On the other hand, the log determinant Jacobian also acts as a log-barrier in the objective function that in turn prevents the w F(n) from becoming zero hence enforces the invertibility of the convolution ﬁlter.

f) The circular convolution, its inverse, and Jacobian determinant can all be efﬁciently computed in O(N log N) time in the frequency domain, exploiting Fast Fourier Transform (FFT) algorithms.

2.2 Symmetric convolution

Circular convolution is not a unique operation with such properties, symmetric convolution is another form of structured ﬁltering operation that can be adopted to achieve interesting desirable properties.

(c) Figure 2: (a) Cyclic (periodic) extension and (b) even-symmetric extension of the base sequence, where the base sequence speciﬁed by dark solid lines. (c) Nonlinear gates corresponding to l1 and l2 regularizers.

A family of symmetric extension (padding) patterns and their corresponding discrete trigonometric transforms (DTT) are outlined in Martucci [1994], based on which alternative symmetric convolution ﬁlters can be deﬁned that satisfy the convolution-multiplication property. Among this family, we choose an even-symmetric extension that can be readily interpreted. Deﬁne an even-symmetric extension of a base sequence of length N around N 1/2 as

ˆx(n) = ε{x(n)} := x(n) n = 0, 1, ..., N 1 x( n 1) n = N, ..., 1 . (2)

This even-symmetric extension is illustrated in Figure 2(b). The symmetric convolution of two sequences, denoted by s, can then be deﬁned by the circular convolution of their corresponding even-symmetric extensions, as y = w s x := R{ˆx ˆw}, where R{.} is a rectangular window operation that retains the base sequence of interest in an extended sequence; that is, it inverts the symmetric extension operation (2). Now, since the sequences are extended by an even-symmetric pattern, the cosine functions provide the appropriate basis for the Fourier transform, giving rise to the discrete cosine transform of type two (DCT-II):

x C(k) = Fdct{x}k = 1

2 1n=0 + 1x(n) cos πk

The convolution-multiplication property holds for this convolution, which implies that the symmetric convolution of two sequences in the spatial domain can be expressed as a pointwise multiplication in the transform domain, after a forward DCT of its operands, i.e. y C = w C x C. This property also offers and alternative deﬁnition for the symmetric convolution: the inverse DCT of pointwise multiplication of the forward DCT of its operands [Martucci, 1994].

One can also show that the symmetric convolution provides a structured Jacobian that can be speciﬁed by Toeplitz matrices; see Figure 1(c) for an illustration. Analogous to the results presented in Proposition 1 for circular convolution, the symmetric convolution-multiplication property implies that the Jacobian of the symmetric convolution can be diagonalized by a DCT basis, with eigenvalues being the DCT of the convolution kernel. Similarly, the inverse ﬁlter (deconvolution) can be obtained by inverting the kernel coefﬁcients in the transform domain, i.e. winv := F 1 dct{1./w C}, where, again, the invertibility of the convolution is guaranteed by the fact that it log determinant Jacobian in the objective function keeps the elements of w C away from zero (as a log-barrier). On the other hand, since the DCT can be deﬁned in terms of a DFT of the symmetric extension of the original sequences, the symmetric convolution, its inverse, and Jacobian determinant can exploit available fast Fourier algorithms with O(N log N) complexity.2

3 Convolutional normalizing ﬂow

3.1 Data adaptive convolution layer

The special convolutional forms introduced above appear to be particularly well suited to capturing structure in images and audio signals, therefore we seek to design more expressive normalizing ﬂows using the convolution bijections as a building blocks. To increase ﬂexibility, we propose a data-adaptive convolution ﬁlter with a ﬁlter kernel that is a function of the input of the layer.

2 All bijective convolutions in experiments were performed in transform domain using a fast Fourier transform algorithm.

Inspired by the idea of the coupling layer in [Dinh et al., 2016], a modular bijection can be formed by splitting the input x Rd into two disjoint parts {x1 Rd1, x2 Rd2 : d1 + d2 = d}, referred to as the base input and update input, respectively, and only updating x2 by an invertible convolution operation with a data-parameterized kernel that depends on x1. The data-adaptive convolution sub-ﬂow can then be expressed as

f (x2; x1) = w(x1) x2. (4)

In the above transformation is an invertible convolution operation and can be one of the invertible convolutions introduced in last section. Here, the kernel w(x1) can be any nonlinear function, which leads to a nonlinear adaptive convolution ﬁltering scheme.

3.2 Pointwise nonlinear bijections

Adding pointwise nonlinear bijections in the chain of normalizing ﬂows can further enhance expressiveness. More speciﬁcally, focusing on the Jacobian determinant introduced by the nonlinearities in log-likelihood equation (1), one can observe that these terms can be interpreted as regularizers on the latent representation. In other words, speciﬁc structures on intermediate activations can be encouraged by designing customized pointwise nonlinear gates; these structures encode various prior knowledge into the design of the model. Let σ(k) denote the kth bijection in the chain of normalizing ﬂows that is assumed to be an pointwise nonlinear operation, i.e. y(k) i = σ(k)(x(k) i ). Dropping the indices, this mapping can be simply written as y = σ(x) with inverse x = φ(y) = σ 1(y). Since the nonlinearity operates elementwise, its Jacobian is diagonal, hence the log determinant reduces to log |det Jy| = Pd i=1 log σ(xi)

. Then, an analytic approach designing nonlinear invertible gates are derived in the following.

Proposition 2 Assume we want to induce a speciﬁc structure, formulated by a regularizer γ(y), on the intermediate activation y := y(k) i . Then the elementwise bijection can be deﬁned as the solution to the differential equation: | σ 1

y | = eγ(y). In the other word, the contribution to the

log |det Jσ| term in the negative log-likelihood from this unit will then reduces to log | φ

y | = γ(y).

Solving the above equation and deriving the nonlinear bijection for two well established l1 and l2 regularizers leads to the following.

l1 regularization: γ(y) = α|y| which corresponds to Laplace distribution assumption on y:

φα(y) = sign(y)

α (eα|y| 1), σα(x) = sign(x)

α ln(α|x| + 1). (5)

Due to its symmetric logarithmic shape, we call the forward function σα(x) an S-Log gate parameterized by positive-valued α. l2 regularization: γ(y) = αy2 which corresponds to Gaussian distribution assumption on y:

π 4αerﬁ( αy), σα(x) = 1 αerﬁ 1( q

The proposed nonlinear gates, plotted in Figure 2(c), are not only differentiable by construction but also have unbounded domain and range, making them suitable choices for designing normalizing ﬂows in many settings such as density estimation. Due to its simple analytical form and closed form inversion, the S-Log gate, (5), is adopted as nonlinear bijection in our model architecture. For multichannel inputs, we assume that the gates share the same parameter α over all spatial locations of a channel (feature map).

3.3 Combined convolution multiplication layer

The convolution operation spatially slides a ﬁlter and applies the same weighted summation at every location of its input, resulting in location invariant ﬁltering. To achieve a more ﬂexible and richer ﬁltering scheme, we can combine an element-wise multiplication, indicated by f , and invertible convolution, indicated by f , so that the ﬁltering scheme varies over space and frequency. The product of a diagonal matrix with a circulant matrix was also proposed in [Cheng et al., 2015] as a

{ , } ~ ~ ,2 ~ i

Figure 3: The diagram of one step of ﬂow (CONF) that is composed of M combined convolutional ﬂows deﬁned in (6). In density estimation, the input to the conditioning neural network is the base input, x1, and the ﬂow updates x2. In variational inference applications, the neural network is conditioned on the data points x while warping the latent random variable z.

structured approximation for dense (fully connected) linear layers, while [Moczulski et al., 2015] showed that any N N linear operator can be approximated to arbitrary precision by composing order N of such products.

Overall, the aforementioned components can be deployed to compose a combined convolutional ﬂow as

fw,s(x2; x1) = (σα f σα f )(x2; x1)

= σα s(x1) σα(w(x1) x2) (6)

We found that a more expressive network can be achieved by stacking M iterates of the combined convolutional ﬂows and an additive coupling transform in each step of the network. Therefore, the convolutional coupling ﬂow (CONF) can be written as ( y1 = x1 y2 = (f (M) w,s ... f (1) w,s)(x2; x1) + t(x1). (7)

The parameters of the ﬂow {w1, s1, ..., w M, s M, b} can be any nonlinear function of the base input x1 and are not required to be invertible, hence they can be modeled by deep neural networks with an arbitrary number of hidden units, offering ﬂexibility and rich representation capacity while preserving an efﬁcient learning algorithm. These are also called conditioning networks in the context of normalizing ﬂow. The model complexity can be signiﬁcantly reduced by using one conditioning neural network for all parameters of a coupling ﬂow so that it shares all layers except the last one for generating the parameters of the ﬂow. Consequently, we achieve a more expressive ﬂow with the stack of bijectors in (7) without introducing too many extra NN layers in the model.

The modular structure of coupling CONF modules (7) implies that its Jacobian determinant can be expressed in terms of its sub-ﬂows. More details on the Jacobian determinant, invertibility condition and inverse of this transformation can be found in Appendix A.

Initialization of the parameters: Better data propagation is expected to be achieved for very deep normalizing ﬂows if the combined ﬂow (6) acts (approximately) as an identity mapping at initialization. Accordingly, the parameters of the nonlinear bijector pair, {σα, σα }, are initialized sufﬁciently close to zero so that they behave approximately as linear functions at the outset. Furthermore, the conditioning networks are initialized such that the scaling ﬁlters, s, and the convolution kernels at the frequency domain, F{w}, are all initially identity ﬁlters.

Multi-dimensional extension: The multi-dimensional discrete Fourier transform can be expressed in separable forms, meaning that the operations can be performed by successively applying 1dimensional transforms along each dimension [Gonzalez and Woods, 1992]. The separability property ensures the results mentioned so far can be extended to multi-dimensional settings. In this work, we are particularly interested in 2-D operations for image data. Based on the 2-D circular convolution deﬁnition, its equivalent block-circulant matrix form, and diagonalization method by 2-D DFT [Gonzalez and Woods, 1992, Ch. 5], the results of the circular convolution in Theorem 1 can be readily generalized to the 2-D case.3 The same properties apply to the 2-D symmetric convolution, since the symmetric convolution-multiplication property can be generalized naturally to the 2-D setting [Foltz and Welsh, 1998].

3 Due to the separability property, the 2-D DFT of matrices of size N1 N2 can be computed in O(N1N2(log N1 + log N2)) time.

Table 1: Average test negative log-likelihood (in nats) for tabular datasets and (in bits/dim) for MNIST and CIFAR using fully connected conditioning networks (lower is better). C-CONF and S-CONF stands for circular and symmetric convolutional coupling ﬂow presented in (7), respectively. Error bars correspond to 2 standard deviations. The results of the benchmark methods are from Grathwohl et al. [2019]. POWER GAS BSDS300 MNIST CIFAR10 MADE 3.08 .03 -3.56 .04 -148.85 .28 2.04 .01 5.67 .01 MAF -0.24 .01 -10.08 .02 -155.69 .28 1.89 .01 4.31 .01 Real NVP -0.17 .01 -8.33 .14 -153.28 1.78 1.93 .01 4.53 .01 Glow -0.17 .01 -8.15 .40 -155.07 .03 - - FFJORD -0.46 .01 -8.59 .12 -157.40 .19 - - S-CONF -0.48 .01 -10.98 .13 -163.23 .13 1.26 .01 3.78 .03 C-CONF -0.47 .01 -10.84 .06 -163.23 .34 1.25 .01 3.82 .00

Table 2: Results in bits per dimension for MNIST and CIFAR10 using CNN based conditioning networks. The results of the benchmark methods are from [Kingma and Dhariwal, 2018] and [Grathwohl et al., 2019] Real NVP Glow FFJORD S-CONF MNIST 1.06 1.05 0.99 1.00 CIFAR10 3.49 3.35 3.40 3.34

4 Model architecture

A highly ﬂexible and complex density approximation can be formed by composing a chain of the convolution coupling layers introduced in this work. As explained in Section 1, the determinant of the Jacobian and inverse of the composition can then be obtained readily. In addition to the invertible transformation introduced in this work, we use the following bijections in the ﬁnal architecture of the normalizing ﬂow.

Cross-channel mapping (mixing) For multi-channel setting, the invertible convolution operation is performed in a depthwise fashion i.e. each input channel is ﬁltered by a separate convolution kernel. Then cross channel information ﬂow can be complemented by channel shufﬂing or using a 1 1 convolution. The latter offered signiﬁcant improvement with small computational overhead in normalizing ﬂows [Kingma and Dhariwal, 2018] hence, is applied after each convolutional coupling layer in our architecture. Also, for single channel inputs, assuming equal size splits {x1, x2} (base input and update input), these can be treated as two separate channels of the input and the same technique can be applied to mix them after each coupling layer.

Multiscale architecture To achieve latent representations at multiple scales and obtain more ﬁnegrained features, a subset of latent variables can be factored out at the intermediate layers. This technique is very useful for large image datasets and can signiﬁcantly reduce the computational cost in very deep models [Dinh et al., 2016].

Normalization To improve the training in very deep normalizing ﬂows, batch normalization was employed as a bijection after each coupling layer in [Dinh et al., 2016]. To overcome the adverse effect of small minibatch size in batch normalization, Kingma and Dhariwal [2018] proposed actnorm, as normalization, which applies an afﬁne transformation and normalizes the activation per channel, similar to batch normalization but with larger minibatch size, at initialization while the parameters of this bijection are freely updated during training with smaller minibatch size, the technique called data dependent initialization. Thus, in density estimation experiments, we employed the actnorm layers as bijections in the chain of normalizing ﬂow and also in the deep conditioning neural networks.

5 Experiments

5.1 Density estimation

We ﬁrst conduct experiments to evaluate the beneﬁts of the proposed ﬂow model (CONF). As observed in [Huang et al., 2018], expressiveness of the afﬁne coupling ﬂows and afﬁne autoregressive

ﬂows stems from the complexity of the conditioning neural network that models ﬂow parameters, and successive application of the ﬂows. Therefore for fair comparison we follow [Papamakarios et al., 2017] and use a general-purpose neural network composed of fully connected layers in the design of conditioning networks. In this way we highlight the capacity of the ﬂow itself, without relying on complex data dependent neural networks such as deep residual convolutional network used in [Dinh et al., 2016, Kingma and Dhariwal, 2018, Ho et al., 2019].

First we evaluate the proposed ﬂow for density estimation on tabular datasets, considering two UCI datasets (POWR, GAS) and the natural image patches dataset (BSDS300) used in Papamakarios et al. [2017]. Description of these datasets and the preprocessing procedure applied can be found therein. We also perform unconditional density estimation on two image datasets; MNIST, consisting of handwritten digits [Y. Le Cun, 1998] and CIFAR-10, consisting of natural images [Krizhevsky, 2009]. In BSDS300, the value of bottom-right pixel is replaced with the average of its immediate neighbors resulting in monochrome patches of size 8 8. For image data, the 2D invertible convolution is used as the ﬂow. All datasets are dequantized by adding uniform distributed noise to each dimension, and then they are scaled to [0, 1] values. Variational dequantization is proposed as a an alternative method offering better variational lower bound on the log-likelihood [Ho et al., 2019], which is beyond the scope of this paper.

We compare the density estimation performance of CONF to the afﬁne coupling ﬂow models real NVP [Dinh et al., 2016] and Glow [Kingma and Dhariwal, 2018], and the recent continuous-time invertible generative model FFJORD [Grathwohl et al., 2019]. These reversible models admit efﬁcient sampling with a single pass of the generative model. We also compare the density estimation capacity of the proposed model against the autoregressive based methods, MADE [Germain et al., 2015], MAF [Papamakarios et al., 2017]. These family of autoregressive normalizing ﬂows require O(D) evaluations of the generative function to sample from the model, making them prohibitively expensive for high dimensional applications. The results, summarized in Table 1, highlight that the circular convolution-based (FFT-based) CONF (C-CONF) and symmetric convolution-based (DCT-based) CONF (S-CONF) offer signiﬁcant performance gains over the other models. Since S-CONF outperforms C-CONF in most of the experiments, we use it as the main convolutional ﬂow in the next experiments, simply referring to it as CONF. The signiﬁcant performance improvement of CONF on image datasets suggest that the feedforward conditioning NN were able to capture 2D local structures.

To make a fair comparison, we used a feedforward neural network architecture similar to the one used for MAF [Papamakarios et al., 2017] except that we simpliﬁed the architecture by using a single network for all parameters of a ﬂow layer, while MAF used separate networks for the scaling and shift parameters. Each coupling ﬂow is composed of a maximum of M = 2 iterates of the combined convolution ﬂow. The parameters of the network and number of layers are selected to be comparable to those used in [Papamakarios et al., 2017]. Details of model architecture and experimental setup together with more empirical results are presented in appendix.

5.2 Density estimation using CNN based conditioning networks

We further assess the performance of CONF when the conditioning networks are based on convolutional neural networks, which are speciﬁcally designed for image data. A shallow convolutional NN, similar to the one used in GLOW, is employed to generate the parameters of the ﬂow, except that we use one NN to generate all the parameters of a layer, reducing the number of model parameters. The results of the experiments on MNIST and CIFAR10 data are presented in Table 2. The experimental setup and generated samples from the model can be found in Appendix C.1 and D, respectively.

5.3 Variational inference

We also evaluate the proposed normalizing ﬂow as a ﬂexible inference network for a variational auto-encoder (VAE) [Rezende and Mohamed, 2015]. Here ﬂows are only conditioned on encoded data points, produced by the encoder, and transform the posterior distribution of the latent variable without a coupling connection, resulting in z(t) = (f (M) w,s ... f (1) w,s)(z(t 1); x)+t(x). We compare the performance of the trained VAE using this convolutional ﬂow against other approaches, including a non ﬂow-based VAE with factorized Gaussian distributions, and ﬂow-based VAE using inverse autoregressive ﬂow (IAF), planar ﬂow [Rezende and Mohamed, 2015, Kingma et al., 2016] and

Table 3: Average test negative log-likelihood (in nats) and negative evidence lower bound (ELBO) on four benchmark datasets (lower is better). Reported error bars correspond to 2 standard deviations calculated over 3 trials. The combination of number of ﬂow steps F and M of each model is reported in the format (F-M).

MNIST Omniglot Caltech Silhouettes Frey Faces -ELBO NLL -ELBO NLL -ELBO NLL -ELBO NLL VAE 86.55 .06 82.14 .07 104.28 .39 97.25 .23 110.80 .46 99.62 .74 4.53 .02 4.40 .03 IAF 84.20 .17 80.79 .12 102.41 .04 96.08 .16 111.58 .38 99.92 .30 4.47 .05 4.38 .04

Planar 86.06 .31 81.91 .22 102.65 .42 96.04 .28 109.66 .42 98.53 .68 4.40 .06 4.31 .06 CONF(16-1) 83.89 .03 80.86 .05 98.35 .27 94.54 .12 108.64 1.71 97.29 .91 4.43 .01 4.34 .02

O-SNF(4-8) 84.74 81.04 .15 101.41 .08 95.25 .09 109.37 .94 97.78 .47 4.50 .00 4.39 .01 CONF(4-8) 83.22 .05 80.64 .06 97.17 .08 94.19 .03 104.09 1.03 94.56 .29 4.41 .01 4.31 .00

O-SNF(16-32) 83.32 .06 80.22 .03 99.00 .29 93.82 .21 106.08 .39 94.61 .83 4.51 .04 4.39 .05 CONF(16-16) 96.35 .05 93.66 .03 101.10 .49 92.37 .40 4.39 .02 4.29 .00

Sylvester normalizing ﬂows (SNF) as the building blocks of the normalizing ﬂows. We used the encoder/decoder architecture of Berg et al. [2018] and the results of the available methods are adopted from this paper. The details of training procedure are summarized in Appendix C.2.

Although the proposed ﬂow is slower than SNF of the same size, the results in Table 3 show that CONF outperforms Sylvester ﬂow in most cases, and even smaller CONF models show similar or better capacity than larger SNF. Also, we observe that CONF with M = 1 outperforms planar ﬂow by a wide margin on all datasets, except for Frey Faces which is a challenging dataset and prone to overﬁtting for large SNF; here large CONF (F = 16, M = 16) perform the best among all methods, so demonstrates less sensitivity to overﬁtting on the Frey Faces dataset.

Number of parameters: Let the stochastic latent variable be a D-dimensional vector z RD and the encoder s output be e(x) RE, then each step of CONF requires an additional E (2MD+D)+2M parameters to produce the ﬂow parameters based on e(x), which is comparable to the number of parameters related to a step of planar ﬂow if M = 1. This is of the same order of the number of parameters of Sylvester ﬂow with a bottleneck of size M, which is E (2MD + 2M 2 + M).

6 Conclusion

In this work we showed that circular and symmetric convolutions can be used as invertible transformations with fast and efﬁcient inversion, deconvolution, and Jacobian determinant evaluation. These features make them well suited for designing ﬂexible normalizing ﬂows. Using these invertible convolutions, we introduced a family of data adaptive coupling layers, which consist of convolutions, where the kernel of the convolutions are themselves a function of the coupling layer input. We also analytically derived invertible pointwise nonlinearities that implicitly induce speciﬁc regularizers on intermediate activations in deep ﬂow models. The results also helps better understand the role of nonlinear gates through the lens of their contribution to latent variables distributions. Using these new architectural components, we achieved state of the art performance on several datasets for invertible normalizing ﬂows with fast sampling.

Rianne van den Berg, Leonard Hasenclever, Jakub M Tomczak, and Max Welling. Sylvester normalizing ﬂows for variational inference. ar Xiv preprint ar Xiv:1803.05649, 2018.

Yu Cheng, Felix X Yu, Rogerio S Feris, Sanjiv Kumar, Alok Choudhary, and Shi-Fu Chang. An exploration of parameter redundancy in deep networks with circulant projections. In Proceedings of the IEEE International Conference on Computer Vision, pages 2857 2865, 2015.

Z. Cinkir. A fast elementary algorithm for computing the determinant of Toeplitz matrices. Ar Xiv e-prints, January 2011.

Laurent Dinh, David Krueger, and Yoshua Bengio. Nice: Non-linear independent components estimation. ar Xiv preprint ar Xiv:1410.8516, 2014.

Laurent Dinh, Jascha Sohl-Dickstein, and Samy Bengio. Density estimation using real nvp. ar Xiv preprint ar Xiv:1605.08803, 2016.

Thomas M Foltz and BM Welsh. Image reconstruction using symmetric convolution and discrete trigonometric transforms. JOSA A, 15(11):2827 2840, 1998.

Mathieu Germain, Karol Gregor, Iain Murray, and Hugo Larochelle. Made: Masked autoencoder for distribution estimation. In International Conference on Machine Learning, pages 881 889, 2015.

Rafael C Gonzalez and Richard E Woods. Digital image processing, 1992.

Will Grathwohl, Ricky T. Q. Chen, Jesse Bettencourt, and David Duvenaud. Ffjord: Free-form continuous dynamics for scalable reversible generative models. In International Conference on Learning Representations, 2019.

Robert M Gray et al. Toeplitz and circulant matrices: A review. Foundations and Trends R in Communications and Information Theory, 2(3):155 239, 2006.

Jonathan Ho, Xi Chen, Aravind Srinivas, Yan Duan, and Pieter Abbeel. Flow++: Improving ﬂowbased generative models with variational dequantization and architecture design, 2019.

Emiel Hoogeboom, Rianne van den Berg, and Max Welling. Emerging convolutions for generative normalizing ﬂows. ar Xiv preprint ar Xiv:1901.11137, 2019.

Chin-Wei Huang, David Krueger, Alexandre Lacoste, and Aaron Courville. Neural autoregressive ﬂows. In International Conference on Machine Learning, pages 2083 2092, 2018.

Mahdi Karami, Laurent Dinh, Daniel Duckworth, Jascha Sohl-Dickstein, and Dale Schuurmans. Generative convolutional ﬂow for density estimation. In Workshop on Bayesian Deep Learning Neur IPS 2018, 2018.

Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. ar Xiv preprint ar Xiv:1412.6980, 2014.

Diederik P Kingma and Prafulla Dhariwal. Glow: Generative ﬂow with invertible 1x1 convolutions. ar Xiv preprint ar Xiv:1807.03039, 2018.

Diederik P Kingma, Tim Salimans, Rafal Jozefowicz, Xi Chen, Ilya Sutskever, and Max Welling. Improved variational inference with inverse autoregressive ﬂow. 2016.

Alex Krizhevsky. Learning multiple layers of features from tiny images. 2009.

Per-Gunnar Martinsson, Vladimir Rokhlin, and Mark Tygert. A fast algorithm for the inversion of general toeplitz matrices. Computers & Mathematics with Applications, 50(5-6):741 752, 2005.

Stephen A Martucci. Symmetric convolution and the discrete sine and cosine transforms. IEEE Transactions on Signal Processing, 42(5):1038 1051, 1994.

Marcin Moczulski, Misha Denil, Jeremy Appleyard, and Nando de Freitas. Acdc: A structured efﬁcient linear layer, 2015.

John F Monahan. Numerical methods of statistics. Cambridge University Press, 2011.

George Papamakarios, Iain Murray, and Theo Pavlakou. Masked autoregressive ﬂow for density estimation. In Advances in Neural Information Processing Systems, pages 2338 2347, 2017.

Danilo Rezende and Shakir Mohamed. Variational inference with normalizing ﬂows. In Proceedings of The 32nd International Conference on Machine Learning, pages 1530 1538, 2015.

Casper Kaae Sønderby, Tapani Raiko, Lars Maaløe, Søren Kaae Sønderby, and Ole Winther. Ladder variational autoencoders. In Advances in neural information processing systems, pages 3738 3746, 2016.

Aaron van den Oord, Nal Kalchbrenner, Lasse Espeholt, Oriol Vinyals, Alex Graves, et al. Conditional image generation with pixelcnn decoders. In Advances in Neural Information Processing Systems, pages 4790 4798, 2016.

C. Cortes Y. Le Cun. The mnist database of handwritten digit. 1998.

Guoqing Zheng, Yiming Yang, and Jaime Carbonell. Convolutional normalizing ﬂows. ar Xiv preprint ar Xiv:1711.02255, 2017.