# lossless_compression_with_probabilistic_circuits__124a2ae4.pdf Published as a conference paper at ICLR 2022 LOSSLESS COMPRESSION WITH PROBABILISTIC CIRCUITS Anji Liu CS Department UCLA liuanji@cs.ucla.edu Stephan Mandt CS Department University of California, Irvine mandt@uci.edu Guy Van den Broeck CS Department UCLA guyvdb@cs.ucla.edu Despite extensive progress on image generation, common deep generative model architectures are not easily applied to lossless compression. For example, VAEs suffer from a compression cost overhead due to their latent variables. This overhead can only be partially eliminated with elaborate schemes such as bits-back coding, often resulting in poor single-sample compression rates. To overcome such problems, we establish a new class of tractable lossless compression models that permit efficient encoding and decoding: Probabilistic Circuits (PCs). These are a class of neural networks involving |p| computational units that support efficient marginalization over arbitrary subsets of the D feature dimensions, enabling efficient arithmetic coding. We derive efficient encoding and decoding schemes that both have time complexity O(log(D) |p|), where a naive scheme would have linear costs in D and |p|, making the approach highly scalable. Empirically, our PC-based (de)compression algorithm runs 5-40 times faster than neural compression algorithms that achieve similar bitrates. By scaling up the traditional PC structure learning pipeline, we achieve state-of-the-art results on image datasets such as MNIST. Furthermore, PCs can be naturally integrated with existing neural compression algorithms to improve the performance of these base models on natural image datasets. Our results highlight the potential impact that non-standard learning architectures may have on neural data compression. 1 INTRODUCTION Thanks to their expressiveness, modern Deep Generative Models (DGMs) such as Flow-based models (Dinh et al., 2014), Variational Autoencoders (VAEs) (Kingma & Welling, 2013), and Generative Adversarial Networks (GANs) (Goodfellow et al., 2014) achieved state-of-the-art results on generative tasks such as creating high-quality samples (Vahdat & Kautz, 2020) and learning low-dimensional representation of data (Zheng & Sun, 2019). However, these successes have not been fully transferred into neural lossless compression; see (Yang et al., 2022) for a recent survey. Specifically, GANs cannot be used for lossless compression due to their inability to assign likelihoods to observations. Latent variable models such as VAEs rely on rate estimates obtained by lower-bounding the likelihood of the data, i.e., the quantity which is theoretically optimal for lossless compression; they furthermore rely on sophisticated schemes such as bits-back coding (Hinton & Van Camp, 1993) to realize these rates, oftentimes resulting in poor single-sample compression ratios (Kingma et al., 2019). Therefore, good generative performance does not imply good compression performance for lossless compression, as the model needs to support efficient algorithms to encode and decode close to the model s theoretical rate estimate. While both Flowand VAE-based compression algorithms (Hoogeboom et al., 2019; Kingma et al., 2019) support efficient and near-optimal compression under certain assumptions (e.g., the existence of an additional source of random bits), we show that Probabilistic Circuits (PCs) (Choi et al., 2020) are also suitable for lossless compression tasks. This class of tractable models has a particular structure that allows efficient marginalization of its random variables a property that, as we show, enables efficient conditional entropy coding. Therefore, we introduce PCs as backbone models and develop (de)compression algorithms that achieve high compression ratios and high computational efficiency. Published as a conference paper at ICLR 2022 Similar to other neural compression methods, the proposed lossless compression approach operates in two main phases (i) learn good PC models that approximate the data distribution, and (ii) compress and decompress samples x with computationally efficient algorithms. The proposed lossless compression algorithm has four main contributions: A new class of entropy models. This is the first paper that uses PCs for data compression. In contrast to other neural compression algorithm, we leverage recent innovations in PCs to automatically learn good model architectures from data. With customized GPU implementations and better training pipelines, we are the first to train PC models with competitive performance compared to deep learning models on datasets such as raw MNIST. A new coding scheme. We developed a provably efficient (Thm. 1) lossless compression algorithm for PCs that take advantage of their ability to efficiently compute arbitrary marginal probabilities. Specifically, we first show which kinds of marginal probabilities are required for (de)compression. The proposed algorithm combines an inference algorithm that computes these marginals efficiently given a learned PC and So TA streaming codes that use the marginals for enand decoding. Competitive compression rates. Our experiments show that on MNIST and EMNIST, the PC-based compression algorithm achieved So TA bitrates. On more complex data such as subsampled Image Net, we hybridize PCs with normalizing flows and show that PCs can significantly improve the bitrates of the base normalizing flow models. Competitive runtimes. Our (de)compressor runs 5-40x faster compared to available implementations of neural lossless compressors with near So TA performance on datasets such as MNIST.1 Our open-source implementation of the PC-based (de)compression algorithm can be found at https: //github.com/Juice-jl/Pressed Juice.jl. Notation We denote random variables by uppercase letters (e.g., X) and their assignments by lowercase letters (e.g., x). Analogously, we use bold uppercase (e.g., X) and lowercase (e.g., x) letters to denote sets of variables and their joint assignments, respectively. The set of all possible joint assignments to variables X is denoted val(X). 2 TRACTABILITY MATTERS IN LOSSLESS COMPRESSION The goal of lossless compression is to map every input sample to an output codeword such that (i) the original input can be reconstructed from the codeword, and (ii) the expected length of the codewords is minimized. Practical (neural) lossless compression algorithms operate in two main phases learning and compression (Yang et al., 2022). In the learning phase, a generative model p(X) is learned from a dataset D :={x(i)}N i=1. According to Shannon s source coding theorem (Shannon, 1948), the expected codeword length is lower-bounded by the negative cross-entropy between the data distribution D and the model distribution p(X) (i.e., Ex D[log p(x)]), rendering it a natural and widely used objective to optimize the model (Hoogeboom et al., 2019; Mentzer et al., 2019). In the compression phase, compression algorithms take the learned model p and samples x as input and generate codewords whose expected length approaches the theoretical limit (i.e., the negative cross-entropy between D and p). Although there exist various close-to-optimal compression schemes (e.g., Huffman Coding (Huffman, 1952) and Arithmetic Coding (Rissanen, 1976)), a natural question to ask is what are the requirements on the model p such that compression algorithms can utilize it for encoding/decoding in a computationally efficient manner? In this paper, we highlight the advantages of tractable probabilistic models for lossless compression by introducing a concrete class of models that are expressive and support efficient encoding and decoding. To encode a sample x, a standard streaming code operates by sequentially encoding every symbol xi into a bitstream b, such that xi occupies approximately log p(xi|x1, . . . , xi 1) bits in b. As a result, the length of b is approximately log p(x). For example, Arithmetic Coding (AC) encodes the symbols {xi}D i=1 (define D:=|X| as the number of features) sequentially by successively refining an interval that represents the sample, starting from the initial interval [0, 1). To encode xi, the algorithm 1Note that there exists compression algorithms optimized particularly for speed by using simple entropy models (Townsend et al., 2019), though that also leads to worse bitrates. See Sec. 3.3 for detailed discussion. Published as a conference paper at ICLR 2022 partitions the current interval [a, b) using the left and right side cumulative probability of xi: li(xi) := p(Xi i). Despite implementation differences, computing the cumulative probabilities li(x) and hi(x) are required for many other streaming codes (e.g., r ANS). Therefore, for most streaming codes, the main computation cost of both the encoding and decoding process comes from calculating li(x) and hi(x). The main challenge for the above (de)compression algorithm is to balance the expressiveness of p and the computation cost of {li(x), hi(x)}D i=1. On the one hand, highly expressive probability models such as energy-based models (Lecun et al., 2006; Ranzato et al., 2007) can potentially achieve high compression ratios at the cost of slow runtime, which is due to the requirement of estimating the model s normalizing constant. On the other hand, models that make strong independence assumptions (e.g., n-gram, fully-factorized) are cheap to evaluate but lack the expressiveness to model complex distributions over structured data such as images.2 This paper explores the middle ground between the above two extremes. Specifically, we ask: are there probabilistic models that are both expressive and permit efficient computation of the conditional probabilities in Eq. (1)? This question can be answered in the affirmative by establishing a new class of tractable lossless compression algorithms using Probabilistic Circuits (PCs) (Choi et al., 2020), which are neural networks that can compute various probabilistic queries efficiently. In the following, we overview the empirical and theoretical results of the proposed (de)compression algorithm. We start with theoretical findings: the proposed encoding and decoding algorithms enjoy time complexity O(log(D) |p|), where |p| D is the PC model size. The backbone of both algorithms, formally introduced in Sec. 3, is an algorithm that computes the 2 D conditional probabilities {li(x), hi(x)}D i=1 given any x efficiently, as justified by the following theorem. Theorem 1 (informal). Let x be a D-dimensional sample, and let p be a PC model of size |p|, as proposed in this paper. We then have that computing all quantities {li(xi), hi(xi)}D i=1 takes O(log(D) |p|) time. Therefore, enor decoding x with a streaming code (e.g., Arithmetic Coding) takes O(log(D) |p|+D) = O(log(D) |p|) time. The properties of PCs that enable this efficient lossless compression algorithm will be described in Sec. 3.1, and the backbone inference algorithm with O(log(D) |p|) time complexity will later be shown as Alg. 1. Table 1 provides an (incomplete) summary of our empirical results. First, the PC-based lossless compression algorithm is fast and competitive. As shown in Table 1, the small PC model achieved a near-So TA bitrate while being 15x faster than other neural compression algorithms with a similar bitrate. Next, PCs can be integrated with Flow-/VAE-based compression methods. As illustrated in Table 1(right), the integrated model significantly improved performance on sub-sampled Image Net compared to the base IDF model. 3 COMPUTATIONALLY EFFICIENT (DE)COMPRESSION WITH PCS In the previous section, we have boiled down the task of lossless compression to calculating conditional probabilities {li(xi), hi(xi)}D i=1 given p and xi. This section takes PCs into consideration and demonstrates how these queries can be computed efficiently. In the following, we first introduce relevant background on PCs (Sec. 3.1), and then proceed to introduce the PC-based (de)compression algorithm (Sec. 3.2). Finally, we empirically evaluate the optimality and speed of the proposed compressor and decompressor (Sec. 3.3). 2Flow-model-based neural compression algorithms adopt p defined on mutually independent latent variables (denoted Z), and improve expressiveness by learning bijection functions between Z and X (i.e., the input space). This is orthogonal to our approach of directly learn better p. Furthermore, we can naturally integrate the proposed expressive p with bijection functions and achieve better performance as demonstrated in Sec. 5. Published as a conference paper at ICLR 2022 Table 1: An (incomplete) summary of our empirical results. Comp. stands for compression. Method MNIST (10,000 test images) Theoretical bpd Comp. bpd En- & decoding time PC (small) 1.26 1.30 53 PC (large) 1.20 1.24 168 IDF 1.90 1.96 880 Bit Swap 1.27 1.31 904 Method Image Net32 Image Net64 Theoretical bpd Theoretical bpd PC+IDF 3.99 3.71 IDF 4.15 3.90 Real NVP 4.28 3.98 Glow 4.09 3.81 3.1 BACKGROUND: PROBABILISTIC CIRCUITS Figure 1: An example structureddecomposable PC. The feedforward order is from left to right; inputs are assumed to be boolean variables; parameters are labeled on the corresponding edges. Probability of each unit given input assignment x1x2x4 is labeled blue next to the corresponding unit. Probabilistic Circuits (PCs) are an umbrella term for a wide variety of Tractable Probabilistic Models (TPMs). They provide a set of succinct definitions for popular TPMs such as Sum-Product Networks (Poon & Domingos, 2011), Arithmetic Circuits (Shen et al., 2016), and Probabilistic Sentential Decision Diagrams (Kisa et al., 2014). The syntax and semantics of a PC are defined as follows. Definition 1 (Probabilistic Circuits). A PC p(X) represents a probability distribution over X via a parametrized directed acyclic graph (DAG) with a single root node nr. Similar to neural networks, every node of the DAG defines a computational unit. Specifically, each leaf node corresponds to an input unit; each inner node n represents either a sum or a product unit that receives inputs from its children, denoted in(n). Each node n encodes a probability distribution pn, defined as follows: fn(x) if n is an input unit, P c in(n) θn,c pc(x) if n is a sum unit, Q c in(n) pc(x) if n is a product unit, (2) where fn( ) is an univariate input distribution (e.g., Gaussian, Categorical), and θn,c denotes the parameter that corresponds to edge (n, c). Intuitively, sum and product units encode weighted mixtures and factorized distributions of their children s distributions, respectively. To ensure that a PC models a valid distribution, we assume the parameters associated with any sum unit n are normalized: n,P c in(n)θn,c =1. We further assume w.l.o.g. that a PC alternates between sum and product units before reaching an input unit. The size of a PC p, denoted |p|, is the number of edges in its DAG. This paper focuses on PCs that can compute arbitrary marginal queries in time linear in their size, since this is necessary to unlock the efficient (de)compression algorithm. In order to support efficient marginalization, PCs need to be decomposable (Def. 2),3 which is a property of the (variable) scope φ(n) of PC units n, that is, the collection of variables defined by all its descendent input units. Definition 2 (Decomposability). A PC is decomposable if for every product unit n, its children have disjoint scopes: c1, c2 in(n) (c1 = c2), φ(c1) φ(c2) = . All product units in Fig. 1 are decomposable. For example, each purple product unit (whose scope is {X1, X2}) has two children with disjoint scopes {X1} and {X2}, respectively. In addition to Def. 2, we make use of another property, structured decomposability, which is the key to guaranteeing computational efficiency of the proposed (de)compression algorithm. Definition 3 (Structured decomposability). A PC is structured-decomposable if (i) it is decomposable and (ii) for every pair of product units (m, n) with identical scope (i.e., φ(m) = φ(n)), we have that |in(m)| = |in(n)| and the scopes of their children are pairwise identical: i {1, ..., |in(m)|}, φ(cmi)=φ(cni), where cmi and cni are the ith child unit of m and n. 3Another property called smoothness is also required to compute marginals efficiently. However, since enforcing smoothness on any structured-decomposable PC only imposes at most an almost-linear increase in its size (Shih et al., 2019), we omit introducing it here (all PCs used in this paper are structured-decomposable). Published as a conference paper at ICLR 2022 X1 X2 X3 X4 Pixel has been sent Pixel being sent Streaming code (e.g., r ANS) p(x3 | x1, x2) X1 X1 Image patch, e.g., 2 2 All nodes in groups #1, #2, and #3 do not need to be explicitly evaluated. Encoder Decoder p(x3 | x1, x2) X1 X2 X3 X4 Reconstructed patch Streaming code (e.g., r ANS) Figure 2: Overview of the PC-based (de)compressor. The encoder s side sequentially compresses variables one-by-one using the conditional probabilities given all sent variables. These probabilities are computed efficiently using Alg. 1. Finally, a streaming code uses conditional probabilities to compress the variables into a bitstream. On the decoder s side, a streaming code decodes the bitstream to reconstruct the image with the conditional probabilities computed by the PC. The PC shown in Fig. 1 is structured-decomposable because for all three groups of product units with the same scope (grouped by their colors), their children divide the variable scope in the same way. For example, the children of both orange units decompose the scope {X1, X2, X3, X4} into {X1, X2} and {X3, X4}. As a key sub-routine in the proposed algorithm, we describe how to compute marginal queries given a smooth and (structured-)decomposable PC in O(|p|) time. First, we assign probabilities to every input unit: for an input unit n defined on variable X, if evidence is provided for X in the query (e.g., X =x or X