# compression_with_flows_via_local_bitsback_coding__96d88ab9.pdf

Compression with Flows via Local Bits-Back Coding

Jonathan Ho UC Berkeley jonathanho@berkeley.edu

Evan Lohn UC Berkeley evan.lohn@berkeley.edu

Pieter Abbeel UC Berkeley, covariant.ai pabbeel@cs.berkeley.edu

Likelihood-based generative models are the backbones of lossless compression due to the guaranteed existence of codes with lengths close to negative log likelihood. However, there is no guaranteed existence of computationally efﬁcient codes that achieve these lengths, and coding algorithms must be hand-tailored to speciﬁc types of generative models to ensure computational efﬁciency. Such coding algorithms are known for autoregressive models and variational autoencoders, but not for general types of ﬂow models. To ﬁll in this gap, we introduce local bits-back coding, a new compression technique for ﬂow models. We present efﬁcient algorithms that instantiate our technique for many popular types of ﬂows, and we demonstrate that our algorithms closely achieve theoretical codelengths for state-of-the-art ﬂow models on high-dimensional data.

1 Introduction

To devise a lossless compression algorithm means to devise a uniquely decodable code whose expected length is as close as possible to the entropy of the data. A general recipe for this is to ﬁrst train a generative model by minimizing cross entropy to the data distribution, and then construct a code that achieves lengths close to the negative log likelihood of the model. This recipe is justiﬁed by classic results in information theory that ensure that the second step is possible in other words, optimizing cross entropy optimizes the performance of some hypothetical compressor. And, thanks to recent advances in deep likelihood-based generative models, these hypothetical compressors are quite good. Deep autoregressive models, latent variable models, and ﬂow models are now achieving state-of-the-art cross entropy scores on a wide variety of real-world datasets in speech, videos, text, images, and other domains [45, 46, 39, 34, 5, 31, 6, 22, 10, 11, 23, 35, 19, 44, 25, 29].

But we are not interested in hypothetical compressors. We are interested in practical, computationally efﬁcient compressors that scale to high-dimensional data and harness the excellent cross entropy scores of modern deep generative models. Unfortunately, naively applying existing codes, like Huffman coding [21], requires computing the model likelihood for all possible values of the data, which expends computational resources scaling exponentially with the data dimension. This inefﬁciency stems from the lack of assumptions about the generative model s structure.

Coding algorithms must be tailored to speciﬁc types of generative models if we want them to be efﬁcient. There is already a rich literature of tailored coding algorithms for autoregressive models and variational autoencoders built from conditional distributions which are already tractable for coding [38, 12, 18, 13, 42], but there are currently no such algorithms for general types of ﬂow models [32]. It seems that this lack of efﬁcient coding algorithms is a con of ﬂow models that stands at odds with their many pros, like fast and realistic sampling, interpretable latent spaces, fast likelihood evaluation, competitive cross entropy scores, and ease of training with unbiased log likelihood gradients [10, 11, 23, 19].

To rectify this situation, we introduce local bits-back coding, a new technique for turning a general, pretrained, off-the-shelf ﬂow model into an efﬁcient coding algorithm suitable for continuous data discretized to high precision. We show how to implement local bits-back coding without assumptions

33rd Conference on Neural Information Processing Systems (Neur IPS 2019), Vancouver, Canada.

on the ﬂow structure, leading to an algorithm that runs in polynomial time and space with respect to the data dimension. Going further, we show how to tailor our implementation to various speciﬁc types of ﬂows, culminating in a fully parallelizable algorithm for Real NVP-type ﬂows that runs in linear time and space with respect to the data dimension and is fully parallelizable for both encoding and decoding. We then show how to adapt local bits-back coding to losslessly code data discretized to arbitrarily low precision, and in doing so, we obtain a new compression interpretation of dequantization, a method commonly used to train ﬂow models on discrete data. We test our algorithms on recently proposed ﬂow models trained on real-world image datasets, and we ﬁnd that they are computationally efﬁcient and attain codelengths in close agreement with theoretical predictions. Open-source code is available at https://github.com/hojonathanho/localbitsback.

2 Preliminaries

Lossless compression We begin by deﬁning lossless compression of d-dimensional discrete data x using a probability mass function p(x ) represented by a generative model. It means to construct a uniquely decodable code C, which is an injective map from data sequences to binary strings, whose lengths |C(x )| are close to log p(x ) [7].1 The rationale is that if the generative model is expressive and trained well, its cross entropy will be close to the entropy of the data distribution. So, if the lengths of C match the model s negative log probabilities, the expected length of C will be small, and hence C will be a good compression algorithm. Constructing such a code is always possible in theory because the Kraft-Mc Millan inequality [27, 30] ensures that there always exists some code with lengths |C(x )| = log p(x ) log p(x ).

Flow models We wish to construct a computationally efﬁcient code specialized to a ﬂow model f, which is a differentiable bijection between continuous data x Rd and latents z = f(x) Rd [9 11]. A ﬂow model comes with a density p(z) on the latent space and thus has an associated sampling process x = f 1(z) for z p(z) under which it deﬁnes a probability density function via the change-of-variables formula for densities:

log p(x) = log p(z) log |det J(x)| (1)

where J(x) denotes the Jacobian of f at x. Flow models are straightforward to train with maximum likelihood, as Eq. (1) allows unbiased exact log likelihood gradients to be computed efﬁciently.

Dequantization Standard datasets such as CIFAR10 and Image Net consist of discrete data x Zd. To make a ﬂow model suitable for such discrete data, it is standard practice to deﬁne a derived discrete model P(x ) := R

[0,1)d p(x + u) du to be trained by minimizing a dequantization objective, which is a variational bound on the codelength of P(x ):

log p(x + u)

[0,1)d p(x + u) du = log P(x ) (2)

Here, q(u|x ) proposes dequantization noise u [0, 1)d that transforms discrete data x into continuous data x + u; it can be ﬁxed to either a uniform distribution [43, 41, 39] or to another parameterized ﬂow to be trained jointly with f [19]. This dequantization objective serves as a theoretical codelength for ﬂow models trained on discrete data, just like negative log probability mass serves as a theoretical codelength for discrete generative models [41].

3 Local bits-back coding

Our goal is to develop computationally efﬁcient coding algorithms for ﬂows trained with dequantization (2). In Sections 3.1 to 3.4, we develop algorithms that use ﬂows to code continuous data discretized to high precision. In Section 3.5, we adapt these algorithms to losslessly code data discretized to low precision, attaining our desired codelength (2) for discrete data.

3.1 Coding continuous data using discretization

We ﬁrst address the problem of developing coding algorithms that attain codelengths given by negative log densities of ﬂow models, such as Eq. (1). Probability density functions do not directly map to codelength, unlike probability mass functions which enjoy the result of the Kraft-Mc Millan inequality. So, following standard procedure [7, section 8.3], we discretize the data to a high precision k

1We always use base 2 logarithms.

and code this discretized data with a certain probability mass function derived from the density model. Speciﬁcally, we tile Rd with hypercubes of volume δx := 2 kd; we call each hypercube a bin. For x Rd, let B(x) be the unique bin that contains x, and let x be the center of the bin B(x). We call x the discretized version of x. For a sufﬁciently smooth probability density function p(x), such as a density coming from a neural network ﬂow model, the probability mass function P( x) := R

B( x) p(x) dx takes on the pleasingly simple form P( x) p( x)δx when the precision k is large. Now we invoke the Kraft-Mc Millan inequality, so the theoretical codelength for x using P is log P( x) log p( x)δx (3) bits. This is the compression interpretation of the negative log density: it is a codelength for data discretized to high precision, when added to the total number of bits of discretization precision. It is this codelength, Eq. (3), that we will try to achieve with an efﬁcient algorithm for ﬂow models. We defer the problem of coding data discretized to low precision to Section 3.5.

3.2 Background on bits-back coding The main tool we will employ is bits-back coding [47, 18, 13, 20], a coding technique originally designed for latent variable models (the connection to ﬂow models is presented in Section 3.3 and is new to our work). Bits-back coding codes x using a distribution of the form p(x) = P z p(x, z), where p(x, z) = p(x|z)p(z) includes a latent variable z; it is relevant when z ranges over an exponentially large set, which makes it intractable to code with p(x) even though coding with p(x|z) and p(z) may be tractable individually. Bits-back coding introduces a new distribution q(z|x) with tractable coding, and the encoder jointly encodes x along with z q(z|x) via these steps:

1. Decode z q(z|x) from an auxiliary source of random bits 2. Encode x using p(x|z) 3. Encode z using p(z) The ﬁrst step, which decodes z from random bits, produces a sample z q(z|x). The second and third steps transmit z along with x. At decoding time, the decoder recovers (x, z), then recovers the bits the encoder used to sample z using q. So, the encoder will have transmitted extra information in addition to x precisely Ez q(z|x) [ log q(z|x)] bits on average. Consequently, the net number of bits transmitted regarding x only will be Ez q(z|x) [log q(z|x) log p(x, z)], which is redundant compared to the desired length log p(x) by an amount equal to the KL divergence DKL (q(z|x) p(z|x)) from q to the true posterior.

Bits-back coding also works with continuous z discretized to high precision with negligible change in codelength [18, 42]. In this case, q(z|x) and p(z) are probability density functions. Discretizing z to bins z of small volume δz and deﬁning the probability mass functions Q( z|x) and P( z) by the method in Section 3.1, we see that the bits-back codelength remains approximately unchanged:

E z Q( z|x)

log p(x| z)P( z)

E z Q( z|x)

log p(x| z)p( z)δz

log p(x|z)p(z)

(4) When bits-back coding is applied to a particular latent variable model, such as a VAE, the distributions involved may take on a certain meaning: p(z) would be the prior, p(x|z) would be the decoder network, and q(z|x) would be the encoder network [25, 36, 8, 4, 13, 42, 26]. However, it is important to note that these distributions do not need to correspond explicitly to parts of the model at hand. Any will do for coding data losslessly, though some choices result in better codelengths. We exploit this fact in Section 3.3, where we apply bits-back coding to ﬂow models by constructing artiﬁcial distributions p(x|z) and q(z|x), which do not come with a ﬂow model by default.

3.3 Local bits-back coding

We now present local bits-back coding, our new high-level principle for using a ﬂow model f to code data discretized to high precision. Following Section 3.1, we discretize continuous data x into x, which is the center of a bin of volume δx. The codelength we desire for x is the negative log density of f (1), plus a constant depending on the discretization precision: log p( x)δx = log p(f( x)) log |det J( x)| log δx (5) where J( x) is the Jacobian of f at x. We will construct two densities p(z|x) and p(x|z) such that bits-back coding attains Eq. (5). We need a small scalar parameter σ > 0, with which we deﬁne p(z|x) := N(z; f(x), σ2J(x)J(x) ) and p(x|z) := N(x; f 1(z), σ2I) (6) To encode x, local bits-back coding follows the method described in Section 3.2 with continuous z:

1. Decode z P( z|x) = R

B( z) p(z|x) dz p( z|x)δz from an auxiliary source of random bits

2. Encode x using P( x| z) = R

B( x) p(x| z) dz p( x| z)δx 3. Encode z using P( z) = R

B( z) p(z) dz p( z)δz The conditional density p(z|x) (6) is artiﬁcially injected noise, scaled by σ (the ﬂow model f remains unmodiﬁed). It describes how a local linear approximation of f would behave if it were to act on a small Gaussian around x.

To justify local bits-back coding, we simply calculate its expected codelength. First, our choices of p(z|x) and p(x|z) (6) satisfy the following equation:

Ez p(z|x) [log p(z|x) log p(x|z)] = log |det J(x)| + O(σ2) (7)

Next, just like standard bits-back coding (4), local bits-back coding attains an expected codelength close to Ez p(z| x)L( x, z), where

L(x, z) := log p(z|x)δz log p(x|z)δx log p(z)δz (8)

Equations (6) to (8) imply that the expected codelength matches our desired codelength (5), up to ﬁrst order in σ (see Appendix A for details):

Ez L(x, z) = log p(x)δx + O(σ2) (9)

Note that local bits-back coding exactly achieves the desired codelength for ﬂows (5), up to ﬁrst order in σ. This is in stark contrast to bits-back coding with latent variable models like VAEs, for which the bits-back codelength is the negative evidence lower bound, which is redundant by an amount equal to the KL divergence from the approximate posterior to the true posterior [25].

Local bits-back coding always codes x losslessly, no matter the setting of σ, δx, and δz. However, σ must be small for the O(σ2) inaccuracy in Eq. (9) to be negligible. But for σ to be small, the discretization volumes δz and δx must be small too, otherwise the discretized Gaussians p( z|x)δz and p( x|z)δx will be poor approximations of the original Gaussians p(z|x) and p(x|z). So, because δx must be small, the data x must be discretized to high precision. And, because δz must be small, a relatively large number of auxiliary bits must be available to decode z p( z|x)δz. We will resolve the high precision requirement for the data with another application of bits-back coding in Section 3.5, and we will explore the impact of varying σ, δx, and δz on real-world data in experiments in Section 4.

3.4 Concrete local bits-back coding algorithms We have shown that local bits-back coding attains the desired codelength (5) for data discretized to high precision. Now, we instantiate local bits-back coding with concrete algorithms.

3.4.1 Black box ﬂows Algorithm 1 is the most straightforward implementation of local bits-back coding. It directly implements the steps in Section 3.3 by invoking an external procedure, such as automatic differentiation, to explicitly compute the Jacobian of the ﬂow. It therefore makes no assumptions on the structure of the ﬂow, and hence we call it the black box algorithm.

Algorithm 1 Local bits-back encoding: for black box ﬂows (decoding in Appendix B) Require: data x, ﬂow f, discretization volumes δx, δz, noise level σ

1: J Jf( x) Compute the Jacobian of f at x 2: Decode z N(f( x), σ2JJ ) δz By converting to an AR model (Section 3.4.1) 3: Encode x using N(f 1( z), σ2I) δx 4: Encode z using p( z) δz

Coding with p(x|z) (6) is efﬁcient because its coordinates are independent [42]. The same applies to the prior p(z) if its coordinates are independent or if another efﬁcient coding algorithm already exists for it (see Section 3.4.3). Coding efﬁciently with p(z|x) relies on the fact that any multivariate Gaussian can be converted into a linear autoregressive model, which can be coded efﬁciently, one coordinate at a time, using arithmetic coding or asymmetric numeral systems. To see how, suppose y = Jϵ, where ϵ N(0, I) and J is a full-rank matrix (such as a Jacobian of a ﬂow model). Let L be the Cholesky decomposition of JJ . Since LL = JJ , the distribution of Lϵ is equal to the distribution of Jϵ = y, and so solutions y to the linear system L 1 y = ϵ have the same distribution

as y. Because L is triangular, L 1 is easily computable and also triangular, and thus y can be determined with back substitution: yi = (ϵi P

j<i(L 1)ij yj)/(L 1)ii, where i increases from 1 to d. In other words, p( yi| y<i) := N( yi; Lii P

j<i(L 1)ij yj, L2 ii) is a linear autoregressive model that represents the same distribution as y = Jϵ.

If nothing is known about the structure of the Jacobian of the ﬂow, Algorithm 1 requires O(d2) space to store the Jacobian and O(d3) time to compute the Cholesky decomposition. This is certainly an improvement on the exponential space and time required by naive algorithms (Section 1), but it is still not efﬁcient enough for high-dimensional data in practice. To make our coding algorithms more efﬁcient, we need to make additional assumptions on the ﬂow. If the Jacobian is always block diagonal, say with ﬁxed block size c c, then the steps in Algorithm 1 can be modiﬁed to process each block separately in parallel, thereby reducing the required space and time to O(cd) and O(c2d), respectively. This makes Algorithm 1 efﬁcient for ﬂows that operate as elementwise transformations or as convolutions, such as activation normalization ﬂows and invertible 1 1 convolution ﬂows [23].

3.4.2 Autoregressive ﬂows An autoregressive ﬂow z = f(x) is a sequence of one-dimensional ﬂows zi = fi(xi; x<i) for each coordinate i {1, . . . , d} [33, 24]. Algorithm 2 shows how to code with an autoregressive ﬂow in linear time and space. It never explicitly calculates and stores the Jacobian of the ﬂow, unlike Algorithm 1. Rather, it invokes one-dimensional local bits-back coding on one coordinate of the data at a time, thus exploiting the structure of the autoregressive ﬂow in an essential way.

Algorithm 2 Local bits-back encoding: for autoregressive ﬂows (decoding in Appendix B) Require: data x, autoregressive ﬂow f, discretization volumes δx, δz, noise level σ

1: for i = d, . . . , 1 do Iteration ordering not mandatory, but convenient for ANS 2: Decode zi N(fi( xi; x<i), (σf i( xi; x<i))2) δ1/d z Neural net operations parallelizable over i 3: Encode xi using N(f 1 i ( zi; x<i), σ2) δ1/d x 4: end for 5: Encode z using p( z) δz

A key difference between Algorithm 1 and Algorithm 2 is that the former needs to run the forward and inverse directions of the entire ﬂow and compute and factorize a Jacobian, whereas the latter only needs to do so for each one-dimensional ﬂow on each coordinate of the data. Consequently, Algorithm 2 runs O(d) time and space, excluding resource requirements of the ﬂow itself. The encoding procedure of Algorithm 2 resembles log likelihood computation for autoregressive ﬂows, so the model evaluations it requires are completely parallelizable over data dimensions. The decoding procedure resembles sampling, so it requires d model evaluations in serial (see Appendix B). These tradeoffs are entirely analogous to those of coding with discrete autoregressive models.

Autoregressive ﬂows with further special structure lead to even more efﬁcient implementations of Algorithm 2. As an example, let us focus on a NICE/Real NVP coupling layer [10, 11]. This type of ﬂow computes z by splitting the coordinates of the input x into two halves, x d/2, and x>d/2. The ﬁrst half is passed through unchanged as z d/2 = x d/2, and the second half is passed through an elementwise transformation z>d/2 = f(x>d/2; x d/2) which is conditioned on the ﬁrst half. Specializing Algorithm 2 to this kind of ﬂow allows both encoding and decoding to be parallelized over coordinates, resembling how the forward and inverse directions for inference and sampling can be parallelized for these ﬂows [10, 11]. See Appendix B for the complete algorithm listing.

Algorithm 2 is not the only known efﬁcient coding algorithm for autoregressive ﬂows. For example, if f is an autoregressive ﬂow whose prior p(z) = Q

i pi(zi) is independent over coordinates, then f can be rewritten as a continuous autoregressive model p(xi|x<i) = pi(f(xi; x<i))|f (xi; x<i)|, which can be discretized and coded one coordinate at a time using arithmetic coding or asymmetric numeral systems. The advantage of Algorithm 2, as we will see next, is that it applies to more complex priors that prevent the distribution over x from naturally factorizing as an autoregressive model.

3.4.3 Compositions of ﬂows Flows like NICE, Real NVP, Glow, and Flow++ [10, 11, 23, 19] are composed of many intermediate ﬂows: they have the form f(x) = f K f1(x), where each of the K layers fi is one of the types of ﬂows discussed above. These models derive their expressiveness from applying simple ﬂows many times, resulting in a expressive composite ﬂow. The expressiveness of the composite

ﬂow suggests that coding will be difﬁcult, but we can exploit the compositional structure to code efﬁciently. Since the composite ﬂow f = f K f1 can be interpreted as a single ﬂow z1 = f1(x) with a ﬂow prior f K f2(z1), all we have to do is code the ﬁrst layer f1 using the appropriate local bits-back coding algorithm, and when coding its output z1, we recursively invoke local bits-back coding for the prior f K f2 [26]. A straightforward inductive argument shows that this leads to the correct codelength. If coding any z1 with f K f2 achieves the expected codelength log pf K f2(z1)δz +O(σ2), then the expected codelength for f1, using f K f2 as a prior, is log pf K f2(f1(x)) log |det Jf1(x)| log δx +O(σ2). Continuing the same into f K f2, we conclude that the resulting expected codelength

i=1 log |det Jfi(zi 1)| log δx + O(σ2), (10)

where z0 := x, is what we expect from coding with the whole composite ﬂow f. This codelength is averaged over noise injected into each layer zi, but we ﬁnd that this is not an issue in practice. Our experiments in Section 4 show that it is easy to make σ small enough to be negligible for neural network ﬂow models, which are generally resistant to activation noise.

We call this the compositional algorithm. Its signiﬁcance is that, provided that coding with each intermediate ﬂow is efﬁcient, coding with the composite ﬂow is efﬁcient too, despite the complexity of the composite ﬂow as a function class. The composite ﬂow s Jacobian never needs to be calculated or factorized, leading to dramatic speedups over using Algorithm 1 on the composite ﬂow as a black box (Section 4). In particular, coding with Real NVP-type models needs just O(d) time and space and is fully parallelizable over data dimensions.

3.5 Dequantization for coding unrestricted-precision data We have shown how to code data discretized to high precision, achieving codelengths close to log p( x)δx. In practice, however, data is usually discretized to low precision; for example, images from CIFAR10 and Image Net consist of integers in {0, 1, . . . , 255}. Coding this kind of data directly would force us to code at a precision log δx much higher than 1, which would be a waste of bits.

To resolve this issue, we propose to use this extra precision within another bits-back coding scheme to arrive at a good lossless codelength for data at its original precision. Let us focus on the setting of coding integer-valued data x Zd up to bins of volume 1. Recall from Section 2 that ﬂow models are trained on such data by minimizing a dequantization objective (2), which we reproduce here: Eu q(u|x ) [log q(u|x ) log p(x + u)] (11)

Above, q(u|x ) is a dequantizer, which adds noise u [0, 1)d to turn x into continuous data x + u [43, 41, 39, 19]. We assume that the dequantizer is itself provided as a ﬂow model, speciﬁed by u = qx (ϵ) [0, 1)d for ϵ p(ϵ), as in [19]. In Algorithm 3, we propose a bits-back coding scheme in which u q( u|x )δx is decoded from auxiliary bits using local bits-back coding, and x + u is encoded using the original ﬂow p(x + u)δx, also using local bits-back coding.

Algorithm 3 Local bits-back encoding with variational dequantization (decoding in Appendix B) Require: discrete data x , ﬂow density p, dequantization ﬂow conditional density q, discretization volume δx

1: Decode u q( u|x ) δx via local bits-back coding 2: x x + u Dequantize 3: Encode x using p( x) δx via local bits-back coding

The decoder, upon receiving x + u, recovers the original x and u by rounding (see Appendix B for the full pseudocode). So, the net codelength for Algorithm 3 is given by subtracting the bits needed to decode u from the bits needed to encode x + u: log q( u|x )δx log p(x + u)δx + O(σ2) = log q( u|x ) log p(x + u) + O(σ2) (12) This codelength closely matches the dequantization objective (11) on average, and it is reasonable for the low-precision discrete data x because, as we stated in Section 2, it is a variational bound on the codelength of a certain discrete generative model for x , and modern ﬂow models are explicitly trained to minimize this bound [43, 41, 19]. Since the resulting code is lossless for x , Algorithm 3 provides a new compression interpretation of dequantization: it converts a code suitable for high precision data into a code suitable for low precision data, just as the dequantization objective (11) converts a model suitable for continuous data into a model suitable for discrete data [41].

4 Experiments

We designed experiments to investigate the following: (1) how well local bits-back codelengths match the theoretical codelengths of modern ﬂow models on high-dimensional data, (2) the effects of the precision and noise parameters δ and σ on codelengths (Section 3.3), and (3) the computational efﬁciency of local bits-back coding for use in practice. We focused on Flow++ [19], a recently proposed Real NVP-type ﬂow with a ﬂow-based dequantizer. We used all concepts presented in this paper: Algorithm 1 for elementwise and convolution ﬂows [23], Algorithm 2 for coupling layers, the compositional method of Section 3.4.3, and Algorithm 3 for dequantization. We used asymmetric numeral systems (ANS) [12], following the BB-ANS [42] and Bit-Swap [26] algorithms for VAEs (though the ideas behind our algorithms do not depend on ANS). We expect our implementation to easily extend to other models, like ﬂows for video [28] and audio [35], though we leave that for future work. We provide open-source code at https://github.com/hojonathanho/localbitsback.

Codelengths Table 1 lists the local bits-back codelengths on the test sets of CIFAR10, 32x32 Image Net, and 64x64 Image Net. The listed theoretical codelengths are the average negative log likelihoods of our model reimplementations, without importance sampling for the variational dequantization bound, and we ﬁnd that our coding algorithm attains very similar lengths. To the best of our knowledge, these results are state-of-the-art for lossless compression with fully parallelizable compression and decompression when auxiliary bits are available for bits-back coding.

Table 1: Local bits-back codelengths (in bits per dimension)

Compression algorithm CIFAR10 Image Net 32x32 Image Net 64x64

Theoretical 3.116 3.871 3.701 Local bits-back (ours) 3.118 3.875 3.703

Effects of precision and noise Recall from Section 3.3 that the noise level σ should be small to attain accurate codelengths. This means that the discretization volumes δx and δz should be small as well to make discretization effects negligible, at the expense of a larger requirement of auxiliary bits, which are not counted into bits-back codelengths [18]. Above, we ﬁxed δx = δz = 2 32 and σ = 2 14, but here, we study the impact of varying δ = δx = δz and σ: on each dataset, we compressed 20 random datapoints in sequence, then calculated the local bits-back codelength and the auxiliary bits requirement; we did this for 5 random seeds and averaged the results. See Fig. 1 for CIFAR results, and see Appendix C for results on all models with standard deviation bars.

We indeed ﬁnd that as δ and σ decrease, the codelength becomes more accurate, and we ﬁnd a sharp transition in performance when δ is too large relative to σ, indicating that coarse discretization destroys noise with small scale. Also, as expected, we ﬁnd that the auxiliary bits requirement grows as δ shrinks. If auxiliary bits are not available, they must be counted into the codelength for the ﬁrst datapoint, which can make our method impractical when coding few datapoints or when no pre-transmitted random bits are present [42, 26]. The cost can be made negligible by coding long sequences, such as entire test sets or audio or video with large numbers of frames [35, 28].

6 8 10 12 14 16 log¾

8 10 12 14 16 18 20 22 24 26 28 30 32

17.04 61.73 207.76 406.05 526.35 572.98

5.18 16.02 59.37 205.54 405.63 525.91

4.62 4.25 15.39 59.03 205.91 406.05

4.56 3.72 3.73 15.25 59.04 205.97

4.54 3.66 3.23 3.69 15.32 59.08

4.53 3.64 3.16 3.18 3.70 15.33

4.53 3.63 3.15 3.12 3.18 3.69

4.53 3.62 3.14 3.11 3.12 3.18

4.52 3.62 3.14 3.10 3.10 3.12

4.53 3.62 3.14 3.10 3.10 3.10

4.52 3.63 3.14 3.10 3.10 3.10

4.53 3.62 3.14 3.10 3.10 3.10

4.52 3.62 3.14 3.10 3.10 3.10 Net codelength (bits/dim) (model NLL: 3.10)

6 8 10 12 14 16 log¾

8 10 12 14 16 18 20 22 24 26 28 30 32

12.61 11.37 10.56 10.26 10.13 10.09

15.93 14.64 13.37 12.61 12.30 12.16

19.85 17.94 16.59 15.35 14.55 14.24

23.85 21.87 19.92 18.58 17.33 16.57

27.85 25.86 23.88 21.93 20.60 19.35

31.84 29.84 27.84 25.87 23.93 22.61

35.83 33.85 31.86 29.86 27.86 25.92

39.83 37.85 35.85 33.83 31.86 29.87

43.84 41.85 39.85 37.85 35.84 33.84

47.83 45.85 43.85 41.86 39.86 37.85

51.83 49.84 47.85 45.84 43.84 41.85

55.85 53.83 51.83 49.83 47.84 45.85

59.81 57.84 55.85 53.84 51.84 49.84 Auxiliary bits required (bits/dim)

Figure 1: Effects of precision and noise parameters δ and σ on coding a random subset of CIFAR10

Computational efﬁciency We used Open MP-based CPU code for compression with parallel ANS streams [14], with neural net operations running on a GPU. See Table 2 for encoding timings (decoding timings in Appendix C are nearly identical), averaged over 5 runs, on 16 CPU cores and 1

Titan X GPU. We computed total CPU and GPU time for the black box algorithm (Algorithm 1) and the compositional algorithm (Section 3.4.3) on single datapoints, and we also timed the latter with batches of datapoints, made possible by its low memory requirements (this was not possible with the black box algorithm, which already needs batching to compute the Jacobian for one datapoint). We ﬁnd that the total CPU and GPU time for the compositional algorithm is only slightly slower than running a pass of the ﬂow model without coding, whereas the black box algorithm is signiﬁcantly slower due to Jacobian computation. This conﬁrms that our Jacobian-free coding techniques are crucial for practical use.

Table 2: Encoding time (in seconds per datapoint). Decoding times are nearly identical (Appendix C)

Algorithm Batch size CIFAR10 Image Net 32x32 Image Net 64x64

Black box (Algorithm 1) 1 64.37 1.05 534.74 5.91 1349.65 2.30

Compositional (Section 3.4.3) 1 0.77 0.01 0.93 0.02 0.69 0.02 64 0.09 0.00 0.17 0.00 0.18 0.00

Neural net only, without coding 1 0.50 0.03 0.76 0.00 0.44 0.00 64 0.04 0.00 0.13 0.00 0.05 0.00

5 Related work

We have built upon bits-back coding [47, 18, 38, 13, 12, 42, 26] to enable ﬂow models to perform lossless compression, which is already possible with VAEs and autoregressive models with certain tradeoffs. VAEs and ﬂow models (Real NVP-type models speciﬁcally) currently attain similar theoretical codelengths on image datasets [19, 29] and have similarly fast coding algorithms, but VAEs are more difﬁcult to train due to posterior collapse [4], which implies worse net codelengths unless carefully tuned by the practitioner. Compared with the most recent instantiation of bits-back coding for hierarchical VAEs [26], our algorithm and models attain better net codelengths at the expense of a large number of auxiliary bits: on 32x32 Image Net, we attain a net codelength of 3.88 bits/dim at the expense of approximately 40 bits/dim of auxiliary bits (depending on hyperparameter settings), but the VAEs can attain a net codelength of 4.48 bits/dim with only approximately 2.5 bits/dim of auxiliary bits [26]. As discussed in Section 4, auxiliary bits pose a problem when coding a few datapoints at a time, but not when coding long sequences like entire test sets or long videos. It would be interesting to examine the compression performance of models which are VAE-ﬂow hybrids, of which dequantized ﬂows are a special case (Section 3.5).

Meanwhile, autoregressive models currently attain the best codelengths (2.80 bits/dim on CIFAR10 and 3.44 bits/dim on Image Net 64x64 [6]), but decoding, just like sampling, is extremely slow due to serial model evaluations. Both our compositional algorithm for Real NVP-type ﬂows and algorithms for VAEs built from independent distributions are parallelizable over data dimensions and use a single model pass for both encoding and decoding.

Concurrent work [17] proposes Eq. (6) and its analysis in Appendix A to connect ﬂows with VAEs to design new types of generative models, while by contrast, we take a pretrained, off-the-shelf ﬂow model and employ Eq. (6) as artiﬁcial noise for compression. While the local bits-back coding concept and the black-box Algorithm 1 work for any ﬂow, our fast linear time coding algorithms are specialized to autoregressive ﬂows and the Real NVP family; it would be interesting to ﬁnd fast coding algorithms for other types of ﬂows [15, 3], investigate non-image modalities [28, 35], and explore connections with other literature on compression with neural networks [1, 2, 16, 40, 37].

6 Conclusion

We presented local bits-back coding, a technique for designing lossless compression algorithms backed by ﬂow models. Along with a compression interpretation of dequantization, we presented concrete coding algorithms for various types of ﬂows, culminating in an algorithm for Real NVP-type models that is fully parallelizable for encoding and decoding, runs in linear time and space, and achieves codelengths very close to theoretical predictions on high-dimensional data. As modern ﬂow models attain excellent theoretical codelengths via straightforward, stable training, we hope that they will become viable for practical compression with the help of our algorithms, and more broadly, we hope that our work will open up new possibilities for deep generative models in compression.

Acknowledgments

We thank Peter Chen and Stas Tiomkin for discussions and feedback. This work was supported by a UC Berkeley EECS Departmental Fellowship.

[1] Johannes Ballé, Valero Laparra, and Eero P Simoncelli. End-to-end optimized image compression. ar Xiv preprint ar Xiv:1611.01704, 2016. [2] Johannes Ballé, David Minnen, Saurabh Singh, Sung Jin Hwang, and Nick Johnston. Variational image compression with a scale hyperprior. ar Xiv preprint ar Xiv:1802.01436, 2018. [3] Jens Behrmann, David Duvenaud, and Jörn-Henrik Jacobsen. Invertible residual networks. ar Xiv preprint ar Xiv:1811.00995, 2018. [4] Xi Chen, Diederik P Kingma, Tim Salimans, Yan Duan, Prafulla Dhariwal, John Schulman, Ilya Sutskever, and Pieter Abbeel. Variational lossy autoencoder. ar Xiv preprint ar Xiv:1611.02731, 2016. [5] Xi Chen, Nikhil Mishra, Mostafa Rohaninejad, and Pieter Abbeel. Pixel SNAIL: An improved autoregressive generative model. In International Conference on Machine Learning, pages 863 871, 2018. [6] Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. Generating long sequences with sparse transformers. ar Xiv preprint ar Xiv:1904.10509, 2019. [7] Thomas M Cover and Joy A Thomas. Elements of information theory. John Wiley & Sons, 2012. [8] Peter Dayan, Geoffrey E Hinton, Radford M Neal, and Richard S Zemel. The helmholtz machine. Neural computation, 7(5):889 904, 1995. [9] Gustavo Deco and Wilfried Brauer. Higher order statistical decorrelation without information loss. In Advances in Neural Information Processing Systems, pages 247 254, 1995. [10] Laurent Dinh, David Krueger, and Yoshua Bengio. Nice: Non-linear independent components estimation. ar Xiv preprint ar Xiv:1410.8516, 2014. [11] Laurent Dinh, Jascha Sohl-Dickstein, and Samy Bengio. Density estimation using Real NVP. ar Xiv preprint ar Xiv:1605.08803, 2016. [12] Jarek Duda. Asymmetric numeral systems: entropy coding combining speed of huffman coding with compression rate of arithmetic coding. ar Xiv preprint ar Xiv:1311.2540, 2013. [13] Brendan J. Frey and Geoffrey E. Hinton. Efﬁcient stochastic source coding and an application to a Bayesian network source model. The Computer Journal, 40(2_and_3):157 165, 1997. [14] Fabian Giesen. Interleaved entropy coders. ar Xiv preprint ar Xiv:1402.3392, 2014. [15] Will Grathwohl, Ricky TQ Chen, Jesse Betterncourt, Ilya Sutskever, and David Duvenaud. Ffjord: Freeform continuous dynamics for scalable reversible generative models. ar Xiv preprint ar Xiv:1810.01367, 2018. [16] Karol Gregor, Frederic Besse, Danilo Jimenez Rezende, Ivo Danihelka, and Daan Wierstra. Towards conceptual compression. In Advances In Neural Information Processing Systems, pages 3549 3557, 2016. [17] Alexey A Gritsenko, Jasper Snoek, and Tim Salimans. On the relationship between normalising ﬂows and variational-and denoising autoencoders. 2019. [18] Geoffrey Hinton and Drew Van Camp. Keeping neural networks simple by minimizing the description length of the weights. In in Proc. of the 6th Ann. ACM Conf. on Computational Learning Theory. Citeseer, 1993. [19] Jonathan Ho, Xi Chen, Aravind Srinivas, Yan Duan, and Pieter Abbeel. Flow++: Improving ﬂow-based generative models with variational dequantization and architecture design. In International Conference on Machine Learning, 2019. [20] Antti Honkela and Harri Valpola. Variational learning and bits-back coding: an information-theoretic view to bayesian learning. IEEE Transactions on Neural Networks, 15(4):800 810, 2004. [21] David A Huffman. A method for the construction of minimum-redundancy codes. Proceedings of the IRE, 40(9):1098 1101, 1952. [22] Nal Kalchbrenner, Aäron Oord, Karen Simonyan, Ivo Danihelka, Oriol Vinyals, Alex Graves, and Koray Kavukcuoglu. Video pixel networks. In International Conference on Machine Learning, pages 1771 1779, 2017. [23] Diederik P Kingma and Prafulla Dhariwal. Glow: Generative ﬂow with invertible 1x1 convolutions. In Advances in Neural Information Processing Systems, pages 10215 10224, 2018. [24] Diederik P Kingma, Tim Salimans, Rafal Jozefowicz, Xi Chen, Ilya Sutskever, and Max Welling. Improved variational inference with inverse autoregressive ﬂow. In Advances in neural information processing systems, pages 4743 4751, 2016. [25] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. ar Xiv preprint ar Xiv:1312.6114, 2013. [26] Friso H Kingma, Pieter Abbeel, and Jonathan Ho. Bit-swap: Recursive bits-back coding for lossless compression with hierarchical latent variables. In International Conference on Machine Learning, 2019. [27] Leon Gordon Kraft. A device for quantizing, grouping, and coding amplitude-modulated pulses. Ph D thesis, Massachusetts Institute of Technology, 1949.

[28] Manoj Kumar, Mohammad Babaeizadeh, Dumitru Erhan, Chelsea Finn, Sergey Levine, Laurent Dinh, and Durk Kingma. Videoﬂow: A ﬂow-based generative model for video. ar Xiv preprint ar Xiv:1903.01434, 2019. [29] Lars Maaløe, Marco Fraccaro, Valentin Liévin, and Ole Winther. Biva: A very deep hierarchy of latent variables for generative modeling. ar Xiv preprint ar Xiv:1902.02102, 2019. [30] Brockway Mc Millan. Two inequalities implied by unique decipherability. IRE Transactions on Information Theory, 2(4):115 116, 1956. [31] Jacob Menick and Nal Kalchbrenner. Generating high ﬁdelity images with subscale pixel networks and multidimensional upscaling. ar Xiv preprint ar Xiv:1812.01608, 2018. [32] Fabian Mentzer, Eirikur Agustsson, Michael Tschannen, Radu Timofte, and Luc Van Gool. Practical full resolution learned lossless image compression. ar Xiv preprint ar Xiv:1811.12817, 2018. [33] George Papamakarios, Theo Pavlakou, and Iain Murray. Masked autoregressive ﬂow for density estimation. In Advances in Neural Information Processing Systems, pages 2338 2347, 2017. [34] Niki Parmar, Ashish Vaswani, Jakob Uszkoreit, Łukasz Kaiser, Noam Shazeer, and Alexander Ku. Image transformer. ar Xiv preprint ar Xiv:1802.05751, 2018. [35] Ryan Prenger, Rafael Valle, and Bryan Catanzaro. Waveglow: A ﬂow-based generative network for speech synthesis. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 3617 3621. IEEE, 2019. [36] Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation and approximate inference in deep generative models. ar Xiv preprint ar Xiv:1401.4082, 2014. [37] Oren Rippel and Lubomir Bourdev. Real-time adaptive image compression. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 2922 2930. JMLR. org, 2017. [38] Jorma J Rissanen. Generalized Kraft inequality and arithmetic coding. IBM Journal of research and development, 20(3):198 203, 1976. [39] Tim Salimans, Andrej Karpathy, Xi Chen, and Diederik P Kingma. Pixel CNN++: Improving the Pixel CNN with discretized logistic mixture likelihood and other modiﬁcations. ar Xiv preprint ar Xiv:1701.05517, 2017. [40] Lucas Theis, Wenzhe Shi, Andrew Cunningham, and Ferenc Huszár. Lossy image compression with compressive autoencoders. ar Xiv preprint ar Xiv:1703.00395, 2017. [41] Lucas Theis, Aäron van den Oord, and Matthias Bethge. A note on the evaluation of generative models. In International Conference on Learning Representations (ICLR 2016), pages 1 10, 2016. [42] James Townsend, Thomas Bird, and David Barber. Practical lossless compression with latent variables using bits back coding. In International Conference on Learning Representations, 2019. [43] Benigno Uria, Marc-Alexandre Côté, Karol Gregor, Iain Murray, and Hugo Larochelle. Neural autoregressive distribution estimation. The Journal of Machine Learning Research, 17(1):7184 7220, 2016. [44] Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. Wavenet: A generative model for raw audio. ar Xiv preprint ar Xiv:1609.03499, 2016. [45] Aaron van den Oord, Nal Kalchbrenner, and Koray Kavukcuoglu. Pixel recurrent neural networks. International Conference on Machine Learning (ICML), 2016. [46] Aaron van den Oord, Nal Kalchbrenner, Oriol Vinyals, Lasse Espeholt, Alex Graves, and Koray Kavukcuoglu. Conditional image generation with Pixel CNN decoders. ar Xiv preprint ar Xiv:1606.05328, 2016. [47] Christopher S Wallace and David M Boulton. An information measure for classiﬁcation. The Computer Journal, 11(2):185 194, 1968.