# compression_with_flows_via_local_bitsback_coding__96d88ab9.pdf Compression with Flows via Local Bits-Back Coding Jonathan Ho UC Berkeley jonathanho@berkeley.edu Evan Lohn UC Berkeley evan.lohn@berkeley.edu Pieter Abbeel UC Berkeley, covariant.ai pabbeel@cs.berkeley.edu Likelihood-based generative models are the backbones of lossless compression due to the guaranteed existence of codes with lengths close to negative log likelihood. However, there is no guaranteed existence of computationally efficient codes that achieve these lengths, and coding algorithms must be hand-tailored to specific types of generative models to ensure computational efficiency. Such coding algorithms are known for autoregressive models and variational autoencoders, but not for general types of flow models. To fill in this gap, we introduce local bits-back coding, a new compression technique for flow models. We present efficient algorithms that instantiate our technique for many popular types of flows, and we demonstrate that our algorithms closely achieve theoretical codelengths for state-of-the-art flow models on high-dimensional data. 1 Introduction To devise a lossless compression algorithm means to devise a uniquely decodable code whose expected length is as close as possible to the entropy of the data. A general recipe for this is to first train a generative model by minimizing cross entropy to the data distribution, and then construct a code that achieves lengths close to the negative log likelihood of the model. This recipe is justified by classic results in information theory that ensure that the second step is possible in other words, optimizing cross entropy optimizes the performance of some hypothetical compressor. And, thanks to recent advances in deep likelihood-based generative models, these hypothetical compressors are quite good. Deep autoregressive models, latent variable models, and flow models are now achieving state-of-the-art cross entropy scores on a wide variety of real-world datasets in speech, videos, text, images, and other domains [45, 46, 39, 34, 5, 31, 6, 22, 10, 11, 23, 35, 19, 44, 25, 29]. But we are not interested in hypothetical compressors. We are interested in practical, computationally efficient compressors that scale to high-dimensional data and harness the excellent cross entropy scores of modern deep generative models. Unfortunately, naively applying existing codes, like Huffman coding [21], requires computing the model likelihood for all possible values of the data, which expends computational resources scaling exponentially with the data dimension. This inefficiency stems from the lack of assumptions about the generative model s structure. Coding algorithms must be tailored to specific types of generative models if we want them to be efficient. There is already a rich literature of tailored coding algorithms for autoregressive models and variational autoencoders built from conditional distributions which are already tractable for coding [38, 12, 18, 13, 42], but there are currently no such algorithms for general types of flow models [32]. It seems that this lack of efficient coding algorithms is a con of flow models that stands at odds with their many pros, like fast and realistic sampling, interpretable latent spaces, fast likelihood evaluation, competitive cross entropy scores, and ease of training with unbiased log likelihood gradients [10, 11, 23, 19]. To rectify this situation, we introduce local bits-back coding, a new technique for turning a general, pretrained, off-the-shelf flow model into an efficient coding algorithm suitable for continuous data discretized to high precision. We show how to implement local bits-back coding without assumptions 33rd Conference on Neural Information Processing Systems (Neur IPS 2019), Vancouver, Canada. on the flow structure, leading to an algorithm that runs in polynomial time and space with respect to the data dimension. Going further, we show how to tailor our implementation to various specific types of flows, culminating in a fully parallelizable algorithm for Real NVP-type flows that runs in linear time and space with respect to the data dimension and is fully parallelizable for both encoding and decoding. We then show how to adapt local bits-back coding to losslessly code data discretized to arbitrarily low precision, and in doing so, we obtain a new compression interpretation of dequantization, a method commonly used to train flow models on discrete data. We test our algorithms on recently proposed flow models trained on real-world image datasets, and we find that they are computationally efficient and attain codelengths in close agreement with theoretical predictions. Open-source code is available at https://github.com/hojonathanho/localbitsback. 2 Preliminaries Lossless compression We begin by defining lossless compression of d-dimensional discrete data x using a probability mass function p(x ) represented by a generative model. It means to construct a uniquely decodable code C, which is an injective map from data sequences to binary strings, whose lengths |C(x )| are close to log p(x ) [7].1 The rationale is that if the generative model is expressive and trained well, its cross entropy will be close to the entropy of the data distribution. So, if the lengths of C match the model s negative log probabilities, the expected length of C will be small, and hence C will be a good compression algorithm. Constructing such a code is always possible in theory because the Kraft-Mc Millan inequality [27, 30] ensures that there always exists some code with lengths |C(x )| = log p(x ) log p(x ). Flow models We wish to construct a computationally efficient code specialized to a flow model f, which is a differentiable bijection between continuous data x Rd and latents z = f(x) Rd [9 11]. A flow model comes with a density p(z) on the latent space and thus has an associated sampling process x = f 1(z) for z p(z) under which it defines a probability density function via the change-of-variables formula for densities: log p(x) = log p(z) log |det J(x)| (1) where J(x) denotes the Jacobian of f at x. Flow models are straightforward to train with maximum likelihood, as Eq. (1) allows unbiased exact log likelihood gradients to be computed efficiently. Dequantization Standard datasets such as CIFAR10 and Image Net consist of discrete data x Zd. To make a flow model suitable for such discrete data, it is standard practice to define a derived discrete model P(x ) := R [0,1)d p(x + u) du to be trained by minimizing a dequantization objective, which is a variational bound on the codelength of P(x ): log p(x + u) [0,1)d p(x + u) du = log P(x ) (2) Here, q(u|x ) proposes dequantization noise u [0, 1)d that transforms discrete data x into continuous data x + u; it can be fixed to either a uniform distribution [43, 41, 39] or to another parameterized flow to be trained jointly with f [19]. This dequantization objective serves as a theoretical codelength for flow models trained on discrete data, just like negative log probability mass serves as a theoretical codelength for discrete generative models [41]. 3 Local bits-back coding Our goal is to develop computationally efficient coding algorithms for flows trained with dequantization (2). In Sections 3.1 to 3.4, we develop algorithms that use flows to code continuous data discretized to high precision. In Section 3.5, we adapt these algorithms to losslessly code data discretized to low precision, attaining our desired codelength (2) for discrete data. 3.1 Coding continuous data using discretization We first address the problem of developing coding algorithms that attain codelengths given by negative log densities of flow models, such as Eq. (1). Probability density functions do not directly map to codelength, unlike probability mass functions which enjoy the result of the Kraft-Mc Millan inequality. So, following standard procedure [7, section 8.3], we discretize the data to a high precision k 1We always use base 2 logarithms. and code this discretized data with a certain probability mass function derived from the density model. Specifically, we tile Rd with hypercubes of volume δx := 2 kd; we call each hypercube a bin. For x Rd, let B(x) be the unique bin that contains x, and let x be the center of the bin B(x). We call x the discretized version of x. For a sufficiently smooth probability density function p(x), such as a density coming from a neural network flow model, the probability mass function P( x) := R B( x) p(x) dx takes on the pleasingly simple form P( x) p( x)δx when the precision k is large. Now we invoke the Kraft-Mc Millan inequality, so the theoretical codelength for x using P is log P( x) log p( x)δx (3) bits. This is the compression interpretation of the negative log density: it is a codelength for data discretized to high precision, when added to the total number of bits of discretization precision. It is this codelength, Eq. (3), that we will try to achieve with an efficient algorithm for flow models. We defer the problem of coding data discretized to low precision to Section 3.5. 3.2 Background on bits-back coding The main tool we will employ is bits-back coding [47, 18, 13, 20], a coding technique originally designed for latent variable models (the connection to flow models is presented in Section 3.3 and is new to our work). Bits-back coding codes x using a distribution of the form p(x) = P z p(x, z), where p(x, z) = p(x|z)p(z) includes a latent variable z; it is relevant when z ranges over an exponentially large set, which makes it intractable to code with p(x) even though coding with p(x|z) and p(z) may be tractable individually. Bits-back coding introduces a new distribution q(z|x) with tractable coding, and the encoder jointly encodes x along with z q(z|x) via these steps: 1. Decode z q(z|x) from an auxiliary source of random bits 2. Encode x using p(x|z) 3. Encode z using p(z) The first step, which decodes z from random bits, produces a sample z q(z|x). The second and third steps transmit z along with x. At decoding time, the decoder recovers (x, z), then recovers the bits the encoder used to sample z using q. So, the encoder will have transmitted extra information in addition to x precisely Ez q(z|x) [ log q(z|x)] bits on average. Consequently, the net number of bits transmitted regarding x only will be Ez q(z|x) [log q(z|x) log p(x, z)], which is redundant compared to the desired length log p(x) by an amount equal to the KL divergence DKL (q(z|x) p(z|x)) from q to the true posterior. Bits-back coding also works with continuous z discretized to high precision with negligible change in codelength [18, 42]. In this case, q(z|x) and p(z) are probability density functions. Discretizing z to bins z of small volume δz and defining the probability mass functions Q( z|x) and P( z) by the method in Section 3.1, we see that the bits-back codelength remains approximately unchanged: E z Q( z|x) log p(x| z)P( z) E z Q( z|x) log p(x| z)p( z)δz log p(x|z)p(z) (4) When bits-back coding is applied to a particular latent variable model, such as a VAE, the distributions involved may take on a certain meaning: p(z) would be the prior, p(x|z) would be the decoder network, and q(z|x) would be the encoder network [25, 36, 8, 4, 13, 42, 26]. However, it is important to note that these distributions do not need to correspond explicitly to parts of the model at hand. Any will do for coding data losslessly, though some choices result in better codelengths. We exploit this fact in Section 3.3, where we apply bits-back coding to flow models by constructing artificial distributions p(x|z) and q(z|x), which do not come with a flow model by default. 3.3 Local bits-back coding We now present local bits-back coding, our new high-level principle for using a flow model f to code data discretized to high precision. Following Section 3.1, we discretize continuous data x into x, which is the center of a bin of volume δx. The codelength we desire for x is the negative log density of f (1), plus a constant depending on the discretization precision: log p( x)δx = log p(f( x)) log |det J( x)| log δx (5) where J( x) is the Jacobian of f at x. We will construct two densities p(z|x) and p(x|z) such that bits-back coding attains Eq. (5). We need a small scalar parameter σ > 0, with which we define p(z|x) := N(z; f(x), σ2J(x)J(x) ) and p(x|z) := N(x; f 1(z), σ2I) (6) To encode x, local bits-back coding follows the method described in Section 3.2 with continuous z: 1. Decode z P( z|x) = R B( z) p(z|x) dz p( z|x)δz from an auxiliary source of random bits 2. Encode x using P( x| z) = R B( x) p(x| z) dz p( x| z)δx 3. Encode z using P( z) = R B( z) p(z) dz p( z)δz The conditional density p(z|x) (6) is artificially injected noise, scaled by σ (the flow model f remains unmodified). It describes how a local linear approximation of f would behave if it were to act on a small Gaussian around x. To justify local bits-back coding, we simply calculate its expected codelength. First, our choices of p(z|x) and p(x|z) (6) satisfy the following equation: Ez p(z|x) [log p(z|x) log p(x|z)] = log |det J(x)| + O(σ2) (7) Next, just like standard bits-back coding (4), local bits-back coding attains an expected codelength close to Ez p(z| x)L( x, z), where L(x, z) := log p(z|x)δz log p(x|z)δx log p(z)δz (8) Equations (6) to (8) imply that the expected codelength matches our desired codelength (5), up to first order in σ (see Appendix A for details): Ez L(x, z) = log p(x)δx + O(σ2) (9) Note that local bits-back coding exactly achieves the desired codelength for flows (5), up to first order in σ. This is in stark contrast to bits-back coding with latent variable models like VAEs, for which the bits-back codelength is the negative evidence lower bound, which is redundant by an amount equal to the KL divergence from the approximate posterior to the true posterior [25]. Local bits-back coding always codes x losslessly, no matter the setting of σ, δx, and δz. However, σ must be small for the O(σ2) inaccuracy in Eq. (9) to be negligible. But for σ to be small, the discretization volumes δz and δx must be small too, otherwise the discretized Gaussians p( z|x)δz and p( x|z)δx will be poor approximations of the original Gaussians p(z|x) and p(x|z). So, because δx must be small, the data x must be discretized to high precision. And, because δz must be small, a relatively large number of auxiliary bits must be available to decode z p( z|x)δz. We will resolve the high precision requirement for the data with another application of bits-back coding in Section 3.5, and we will explore the impact of varying σ, δx, and δz on real-world data in experiments in Section 4. 3.4 Concrete local bits-back coding algorithms We have shown that local bits-back coding attains the desired codelength (5) for data discretized to high precision. Now, we instantiate local bits-back coding with concrete algorithms. 3.4.1 Black box flows Algorithm 1 is the most straightforward implementation of local bits-back coding. It directly implements the steps in Section 3.3 by invoking an external procedure, such as automatic differentiation, to explicitly compute the Jacobian of the flow. It therefore makes no assumptions on the structure of the flow, and hence we call it the black box algorithm. Algorithm 1 Local bits-back encoding: for black box flows (decoding in Appendix B) Require: data x, flow f, discretization volumes δx, δz, noise level σ 1: J Jf( x) Compute the Jacobian of f at x 2: Decode z N(f( x), σ2JJ ) δz By converting to an AR model (Section 3.4.1) 3: Encode x using N(f 1( z), σ2I) δx 4: Encode z using p( z) δz Coding with p(x|z) (6) is efficient because its coordinates are independent [42]. The same applies to the prior p(z) if its coordinates are independent or if another efficient coding algorithm already exists for it (see Section 3.4.3). Coding efficiently with p(z|x) relies on the fact that any multivariate Gaussian can be converted into a linear autoregressive model, which can be coded efficiently, one coordinate at a time, using arithmetic coding or asymmetric numeral systems. To see how, suppose y = Jϵ, where ϵ N(0, I) and J is a full-rank matrix (such as a Jacobian of a flow model). Let L be the Cholesky decomposition of JJ . Since LL = JJ , the distribution of Lϵ is equal to the distribution of Jϵ = y, and so solutions y to the linear system L 1 y = ϵ have the same distribution as y. Because L is triangular, L 1 is easily computable and also triangular, and thus y can be determined with back substitution: yi = (ϵi P jd/2. The first half is passed through unchanged as z d/2 = x d/2, and the second half is passed through an elementwise transformation z>d/2 = f(x>d/2; x d/2) which is conditioned on the first half. Specializing Algorithm 2 to this kind of flow allows both encoding and decoding to be parallelized over coordinates, resembling how the forward and inverse directions for inference and sampling can be parallelized for these flows [10, 11]. See Appendix B for the complete algorithm listing. Algorithm 2 is not the only known efficient coding algorithm for autoregressive flows. For example, if f is an autoregressive flow whose prior p(z) = Q i pi(zi) is independent over coordinates, then f can be rewritten as a continuous autoregressive model p(xi|x