# compressing_tabular_data_via_latent_variable_estimation__2520ca74.pdf Compressing Tabular Data via Latent Variable Estimation Andrea Montanari 1 Eric Weiner 1 Abstract Data used for analytics and machine learning often take the form of tables with categorical entries. We introduce a family of lossless compression algorithms for such data that proceed in four steps: (i) Estimate latent variables associated to rows and columns; (ii) Partition the table in blocks according to the row/column latents; (iii) Apply a sequential (e.g. Lempel-Ziv) coder to each of the blocks; (iv) Append a compressed encoding of the latents. We evaluate this approach on several benchmark datasets, and study optimal compression in a probabilistic model for tabular data, whereby latent values are independent and table entries are conditionally independent given the latent values. We prove that the model has a well defined entropy rate and satisfies an asymptotic equipartition property. We also prove that classical compression schemes such as Lempel-Ziv and finite-state encoders do not achieve this rate. On the other hand, the latent estimation strategy outlined above achieves the optimal rate. 1. Introduction Classical theory of lossless compression (Cover & Thomas, 2006; Salomon, 2004) assumes that data take the form of a random vector XN = (X1, X2, . . . , XN) of length N with entries in a finite alphabet X. Under suitable ergodicity assumptions, the entropy per letter converges to a limit h := lim N H(XN)/N (Shannon-Mc Millan-Breiman theorem). Universal coding schemes (e.g. Lempel-Ziv coding) do not requite knowledge of the distribution of XN, and can encode such a sequence without information loss using (asymptotically) h bits per symbol. While this theory is mathematically satisfying, its modeling assumptions (stationarity, ergodicity) are unlikely to be satisfied in many applications. This has long been recognized by practitioners. The main objective of this paper 1Project N, Mountain View, CA, United States. Correspondence to: Andrea Montanari . Proceedings of the 40 th International Conference on Machine Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright 2023 by the author(s). is to investigate this fact mathematically in the context of tabular data, characterize the gap to optimality of classical schemes, and describe an asymptotically optimal algorithm that overcomes their limitations. We consider a data table with m rows and n columns and entries in X, Xm,n X m n Xm,n := (Xij)i m,j n. The standard approach to such data is: (i) Serialize, e.g. in row-first order, to form a vector of length N = mn, XN = (X11, X12, . . . , X1n, X21, . . . , Xmn); (ii) Apply a standard compressor (e.g., Lempel-Ziv) to this vector. We will show, both empirically and mathematically, that this standard approach can be suboptimal in the sense of not achieving the optimal compression rate. This happens even in the limit of large tables, as long as the number of columns and rows are polynomially related (i.e. nε m n M for some small constant ε and large constant M). We advocate an alternative approach: 1. Estimate row/column latents um = (u1, . . . , um) Lm, vn = (v1, . . . , vn) Ln, with L a finite alphabet. 2. Partition the table in blocks according to the row/column latents, Namely, for u, v L, define X(u, v) = vec Xij : ui = u, vj = v . (1.1) where vec(M) denote the serialization of matrix M (either row-wise or column-wise). 3. Apply a base compressor (generically denoted by ZX : X {0, 1} ) to each block X(u, v) z(u, v) = ZX (X(u, v)) , u, v L . (1.2) 4. Encode the row latents and column latents using a possibly different compressor ZL : X {0, 1} , to get zrow = ZL(u), zcol = ZL(v). Finally output the concatenation (denoted by ) Enc(Xm,n) = header zrow zcol M u,v L z(u, v) . Here header is a header that contains encodings of the lengths of subsequent segments. Compressing Tabular Data via Latent Variable Estimation Note that encoding the latents can in general lead to a suboptimal compression rate. While this can be remedied with techniques such as bits-back coding, we observed in our applications that using such techniques yields limited improvement. Our analysis shows that the rate improvement afforded by bits-back coding is only significant in certain special regimes. We refer to Sections 5 and 6 for further discussion. The above description leaves several design choices undefined, namely: (a) The latents estimation procedure at point 1; (b) The base compressor ZX for the blocks X(u, v); (c) The base compressor ZL for the latents. We will provide details for a specific implementation in Section 2, alongside empirical evaluation in Section 3. Section 4 introduces a probabilistic model for the data Xm,n, and Section 5 establishes our main theoretical results: standard compression schemes are suboptimal on this model, while the above latents-based approach is asymptotically optimal. Finally we discuss extensions in Section 6. 1.1. Related work The use of latent variables is quite prevalent in compression methods based on machine learning and probabilistic modeling. Hinton and Zemel (1993) introduced the idea that stochastically generated codewords (e.g., random latents) can lead to minimum description lengths via bits back coding. This idea was explicitly applied to lossless compression using arithmetic coding in (Frey & Hinton, 1996), and ANS coding in (Townsend et al., 2019a;b). Compression via low-rank approximation is closely-related to our latents-based approach and has been studied in the past. An incomplete list of contributions includes (Cheng et al., 2005) (numerical analysis), (Li & Li, 2010) (hyperspectral imaging), (Yuan & Oja, 2005; Hou et al., 2015) (image processing), (Taylor, 2013) (quantum chemistry), (Phan et al., 2020) (compressing the gradient for distributed optimization), (Chen et al., 2021) (large language models compression). The present paper contributes to this line of work, but departs from it in a number of ways. (i) We study lossless compression while earlier work is mainly centered on lossy compression. (ii) Most of the papers in this literature do not precisely quantify compression rate: they do not count bits. (iii) We show empirically an improvement in terms of lossless compression rate over state of the art. Another related area is network compression: simple graphs can be viewed as matrices with entries in {0, 1}. In the case of graph compression, one is interested only in such matrices up to graph isomorphisms. The idea of reordering the nodes of the network and exploiting similarity between nodes has been investigated in this context, see e.g. (Boldi & Vigna, 2004; Chierichetti et al., 2009; Lim et al., 2014; Algorithm 1 Latent-based Tabular Compressor Input: Data matrix Xm,n X m n, range k = |L| Output: Compressed data Enc(Xm,n) {0, 1} Estimate latents um [k]m, vn [k]n using Algorithm 2, with inputs Xm,n, k for u, v [k] do X(u, v) = vec Xij : ui = u, vj = v z(u, v) = ZX (X(u, v)) end for Compute zrow = ZL(u), zcol = ZL(v) return concatenation of {z(u, v) : u, v [k]}, zrow,zcol, metadata Besta & Hoefler, 2018) However, we are not aware of results analogous to ours in this literature. To the best of our knowledge, our work is the first to prove that classical lossless compression techniques do not achieve the ideal compression rate under a probabilistic model for tabular data. We characterize this ideal rate as well as the one achieved by classical compressors, and prove that latents estimation can be used to close this gap. 1.2. Notations We generally use boldface for vectors and uppercase boldface for matrices, without making any typographic distinction between numbers and random variables. When useful, we indicate by superscripts the dimensions of a matrix or a vector: um is a vector of length m, and Xm,n is a matrix of dimensions m n. For a string v and a b, we use vb a = (va, . . . , vb) to denote the substring of v. If X, Y are random variables on a common probability space (Ω, F, P), we denote by H(X), H(Y ) their entropies, H(X, Y ) their joint entropy, H(X|Y ) the conditional entropy of X given Y . We will overload this notation: if p is a discrete probability distribution, we denote by H(p) its entropy. Unless stated otherwise, all entropies will be measured in bits. For ε [0, 1], h(ε) := ε log2 ε (1 ε) log2(1 ε). 2. Implementation The overall structure of the compression algorithm was already described in the introduction. Algorithm 1 summarizes it. In this section we provide further details about the two basic components it relies on: the base compressor Z , and the latents estimation algorithm. In both cases, the choice is in no-way unique and we only describe what we used in our implementation. Compressing Tabular Data via Latent Variable Estimation 2.1. Base compressors We implemented the following two options for the base compressors ZX (for data blocks) and ZL (for latents). Dictionary-based compression (Lempel-Ziv, LZ). For this we used Zstandard (ZSTD) Python bindings to the C implementation using the library zstd, with level 12. While ZSTD can use run-length encoding schemes or literal encoding schemes, we verified that in in this case ZSTD always use its LZ algorithm. The LZ algorithm in ZSTD is somewhat more sophisticated than the plain LZ algorithm used in our proofs. In particular it includes (Collet & Kucherawy, 2018) Huffman coding of literals 0-255 and entropy coding of the LZ stream. Experiments with other (simpler) LZ implementations yielded similar results. We focus on ZSTD because of its broad adoption in industry. Frequency-based entropy coding (ANS). For each data portion (i.e each block X(u, v) and each of the row latents u and column latents v) compute empirical frequencies of the corresponding symbols. Namely for all u, v L, x X, we compute b Q(x|u, v) := 1 N(u, v) j:vj=v 1xij=x , ˆqr(u) := 1 i=1 1ui=u , ˆqc(v) := 1 i=1 1vi=v , where N(u, v) is the number of i m, j n such that ui = u, vj = v. We then apply ANS coding (Duda, 2009) to each block X(u, v) modeling its entries as independent with distribution b Q( |u, v), and to the latents um, vn using the distributions ˆqr( ), ˆqc( ). We separately encode these counts as long integers. Since our main objective was to study the impact of learning latents, we did not try to optimize these base compressors. 2.2. Latent estimation We implemented latents estimation using a spectral clustering algorithm outlined in Algorithm 2. In words, the algorithm encodes the data matrix Xm,n as an m n real-valued matrix M m,n Rm n using a map ψ : X R. It then computes the top k 1 left and right singular vectors of M, and stores them as matrices A Rm (k 1), B Rm (k 1). The rows ai, bj Rk 1 of these matrices are used as embedding of the rows and columns indices in k 1 dimensions. Finally, we run KMeans on these vectors to construct k clusters of rows/columns. A few remarks are in order. The algorithm encodes the data matrix Xm,n as an m n real-valued matrix M m,n Algorithm 2 Spectral latents estimation Input: Data matrix Xm,n X m n latents range k = |L|; map ψ : X R Output: Factors um Lm, vn Ln Compute top (k 1) singular vectors of M m,n = ψ(Xm,n), ( ai)i k 1, ( bi)i k 1 Stack singular vectors in matrices A = [ a1| . . . | ak 1] Rm (k 1), B = [ b1| | bk 1] Rn (k 1); Let (ai)i m, ai Rk 1 be the rows of A; (bi)i n, b Rk 1 the rows of B Apply KMeans to (ai)i m; store the cluster labels as vector um Apply KMeans to (bi)i n; store the cluster labels as vector vn return um, vn Rm n using a map ψ : X R. In our experiments we did not optimize this map and encoded the elements of X as 0, 1, . . . , |X| 1 arbitrarily, cf. also Section 5.3 The singular vector calculation turns out to be the most time consuming part of the algorithm. Computing approximate singular vectors via power iteration requires in this case of the order of log(m n) matrix vector multiplications for each of k vectors. This amounts to mnk log(m n) operations, which is larger than the time needed to compress the blocks or to run KMeans. A substantial speed-up is obtained via row subsampling, cf. Section 6 For the clustering step we use the scikit-learn implementation via sklearn.cluster.KMeans, with random initialization. 3. Empirical evaluation We evaluated our approach on tabular datasets with different origins. Our objective is to assess the impact of using latents in reordering columns and rows, so we will not attempt to achieve the best possible data reduction rate (DRR) on each dataset, but rather to compare compression with latents and without in as-uniform-as-possible fashion. Since our focus is on categorical variables, we preprocess the data to fit in this setting as described in Section A.2. This preprocessing step might involve dropping some of the columns of the original table. We denote the number of columns after preprocessing by n. We point out two simple improvements we introduce in the implementation: (i) We use different sizes for rows latent alphabet and column latent alphabet |Lr| = |Lc|; (ii) We choose |Lr|, |Lc| by optimizing the compressed size . 3.1. Datasets More details on these data can be found in Appendix A.1: Compressing Tabular Data via Latent Variable Estimation Taxicab. A table with m = 62, 495, n = 18 (NYC.gov, 2022). LZ: |Lr| = 9, |Lc| = 15. ANS: |Lr| = 5, |Lc| = 14. Network. Four social networks from (Leskovec & Krevl, 2014) with m = n {333, 747, 786, 1187}. LZ and ANS: |Lr| = 5, |Lc| = 5. Card transactions. A table with m = 24, 386, 900 and n = 12 (Altman, 2019). LZ and ANS: |Lr| = 3, |Lc| = n. Business price index. A table with m = 72, 750 and n = 10 (stats.govt.nz, 2022). LZ: |Lr| = 6, |Lc| = 7. ANS: |Lr| = 2, |Lc| = 6. Forest. A table from the UCI data repository with m = 581, 011, n = 55 (Dua & Graff, 2017). LZ and ANS: |Lr| = 6, |Lc| = 17. US Census. Another table from (Dua & Graff, 2017) with m = 2, 458, 285 and n = 68. LZ and ANS: |Lr| = 9, |Lc| = 68. Jokes. A collaborative filtering dataset with m = 23, 983 rows and n = 101 (Goldberg et al., 2001; Goldberg et al.). LZ: |Lr| = 2, |Lc| = 101. ANS: |Lr| = 8, |Lc| = 8. 3.2. Results Given a lossless encoder φ : X m n {0, 1} , we define its compression rate and data reduction rate (DRR) as Rφ(Xm,n) := len(φ(Xm,n)) mn log2 |X| , DRRφ(Xm,n) := 1 Rφ(Xm,n) . (3.1) (Larger DRR means better compression.) The DRR of each algorithm is reported in Table 1. For the table of results, LZ refers to row-major order ZSTD, LZ (c) refers to column-major order ZSTD. We run KMeans on the data 5 times, with random initializations finding the DRR each time and reporting the average. We make the following observations on the empirical results of Table 1. First, Latent + ANS encoder achieves systematically the best DRR. Second, the use of latent in several cases yields a DRR improvement of 5% (of the uncompressed size) or more. Third, as intuitively natural, this improvement appears to be larger for data with a large number of columns (e.g. the network data). The analysis of the next section provides further support for these findings. 4. A probabilistic model In order to better understand the limitations of classical approaches, and the optimality of latent-based compression, we introduce a probabilistic model for the table Xm,n X m n. We assume the true latents (ui)i m, (vj)j n to be independent random variables with P(ui = u) = qr(u) , P(vi = v) = qc(v) . (4.1) We assume that the entries (Xij)i m,j n are conditionally independent given um = (ui)i m vn = (vj)j n, with P Xij = x um, vn = Q(x|ui, vj) . (4.2) The distributions qr, qc, and conditional distribution Q are parameters of the model (a total of 2(|L| 1) + |L|2(|X| 1) real parameters). We will write (Xm,n, um, vn) T (Q, qr, qc; m, n) to indicate that the triple (Xm,n, um, vn) is distributed according to the model. Remark 4.1. Some of our statements will be nonasymptotic, in which case m, n, X, L, Q, qr, qc are fixed. Others will be of asymptotic. In the latter case, we have in mind a sequence of problems indexed by n. In principle, we could write mn, Xn, Ln, Qn, qr,n, qc,n to emphasize the fact that these quantities depend on n. However, we will typically omit these subscripts. Example 4.2 (Symmetric Binary Model). As a toy example, we will use the following Symmetric Binary Model (SBM) which parallels the symmetric stochastic block model for community detection (Holland et al., 1983). We take L = [k] := {1, . . . , k}, X = {0, 1}, qr = qc = Unif([k]) (the uniform distribution over [k]) and Q(1|u, v) = p1 if u = v, Q(1|u, v) = p0 if u = v. (4.3) We will write (Xm,n, um, vn) TSBM(p0, p1, k; m, n) when this distribution is used. Figure 1 reports the results of simulations within the SBM, for ZSTD and ANS base compressors. In this case m = n = 1000, k = 3, and we average DRR values over 4 realizations. Appendix B reports additional simulations under the same model for k {5, 7}: the results are very similar to the ones of Figure 1. As expected, the use of latents is irrelevant along the line p1 p0 (in this case, the latents do not impact the distribution of Xij). However, it becomes important when p1 and p0 are significantly different. The figures also report contour lines of the theoretical predictions for the asymptotic DRR of various compression algorithms (cf. Example 5.4). The agreement is excellent. 5. Theoretical analysis In this section we present our theoretical results on compression rates under the model T (Q, qr, qc, k; m, n) introduced above. We first characterize the optimal compression rate in Section 5.1, then prove that standard compression methods fail to attain this goal in Section 5.2, and finally show that Compressing Tabular Data via Latent Variable Estimation Table 1: Data reduction rate (DRR) achieved by classical and latent-based compressors on real tabular data. Data Size LZ LZ (c) ANS Latent + LZ Latent + ANS Taxicab 380 KB 0.41 0.44 0.43 0.48 0.54 FB Network 1 13.6 KB 0.63 0.63 0.76 0.58 0.78 FB Network 2 68.1 KB 0.44 0.44 0.57 0.64 0.75 FB Network 3 75.4 KB 0.59 0.59 0.75 0.69 0.80 GP Network 1 172 KB 0.46 0.46 0.65 0.58 0.70 Forest (s) 6.10 MB 0.29 0.38 0.47 0.41 0.49 Card Transactions (s) 123 MB 0.03 0.21 0.29 0.20 0.30 Business price index (s) 153 KB 0.03 0.20 0.28 0.25 0.32 US Census 43.9 MB 0.38 0.31 0.47 0.52 0.62 Jokes 515 KB 0.21 0.15 0.07 0.03 0.14 0.2 0.26 0.33 0.39 0.45 0.52 0.58 0.64 0.71 0.77 Naive ZSTD % DRR 0.2 0.26 0.33 0.39 0.45 0.52 0.58 0.64 0.71 0.77 Kmeans ZSTD % DRR 0.2 0.26 0.33 0.39 0.45 0.52 0.58 0.64 0.71 0.77 Naive ANS % DRR 0.2 0.26 0.33 0.39 0.45 0.52 0.58 0.64 0.71 0.77 Kmeans ANS % DRR Figure 1: Comparing data reduction rate of naive coding and latent-based coding for synthetically generated data. Top: ZSTD base compressor. Bottom: ANS base compressor. Contour lines correspond to the compression rate predicted by the theorems of Section 5 (coinciding with optimal rate for latent-based encoders). latent-based compression does in Section 5.3. Proofs are deferred to Appendices D, E, F, G. Throughout, we denote by (X, U, V ) a triple with joint distribution P(X = x, U = u, V = v) = Q(x|u, v)qr(u)qc(v) (this is the same as the joint distribution of (Xij, ui, vj) for fixed i, j). 5.1. Ideal compression Our first lemma provides upper and lower bounds on the entropy per symbol H(Xm,n)/mn. Lemma 5.1. Defining H+ m,n(X|U, V ) := H(X|U, V ) + 1 n H(U) + 1 m H(V ), we have H(X|U, V ) 1 mn H(Xm,n) H+ m,n(X|U, V ) . (5.1) Compressing Tabular Data via Latent Variable Estimation Further, for any estimators bu : X m n Lm, bv : X m n Ln, let Err U := minπ SL Pm i=1 1ˆui =π(ui)/m, Err V := minπ SL Pn i=1 1ˆvi =π(vi)/n (min over permutations of L), letting εU := E Err U, εV := E Err V , we have H+ m,n(X|U, V ) δm,n 1 mn H(Xm,n) H+ m,n(X|U, V ) . (5.2) where δm,n := δ(εU)/n + δ(εV )/m and δ(ε) := h(ε) + ε log(|L| 1). Corollary 5.2. There exists a lossless compressor φ whose rate (cf.Eq. (3.1)) is E Rφ(Xm,n) 1 log2 |X| n H+ m,n(X|U, V ) + 1 mn Further, for any lossless compressor φ, E Rφ(Xm,n) H+ m,n(X|U, V ) δm,n 2 log2(mn)/mn. Remark 5.1. The simpler bound (5.1) implies that the entropy per entry is H(X|U, V ) + O(1/(m n)). The operational interpretation of this result is that we should be able to achieve the same compression rate per symbol as if the latents were given to us. The additional terms 1 m H(V ) in Eq. (5.2) account for the additional memory required for the latents. The lower bound in Eq. (5.2) implies that, if the latents can be accurately estimated from the data Xm,n (that is if εU, εV are small), then this overhead is essentially unavoidable. The nearly ideal compression rate in Eq. (5.3) can be achieved by Huffmann or arithmetic coding, and requires knowledge of the probability distribution of Xm,n. Under the these schemes, the length of the codeword associated to Xm,n is within constant number of bits from log2 P(Xm,n), where P(X0) := P(Xm,n = X0) is the probability mass function of the random table Xm,n (Cover & Thomas, 2006; Salomon, 2004). The next lemma implies that the length concentrates tightly around the entropy. Lemma 5.3 (Asymptotic Equipartition Property). For X0 X m n, let P(X0) = PQ,qr,qc;m,n(X0) the probability of Xm,n = X0 under model Xm,n T (Q, qr, qc; m, n). Assume there exists a constant c > 0 such that minx X minu,v L Q(x|u, v) c. Then there exists a constant C (depending on c) such that the following happens. For Xm,n T (Q, qr, qc; m, n) and any t 0 with probability at least 1 2 e t: log P(Xm,n) H(Xm,n) C p mn(m + n) t . (5.4) For the sake of simplicity, in the last statement we assume a uniform lower bound on Q(x|u, v). While such a lower bound holds without loss of generality when Q is independent of m, n (symbols with zero probability can be dropped), it might not hold in the n-dependent case. Appendix D gives a more general statement. 5.2. Failure of classical compression schemes We analyze two types of codes: finite-state encoders and Lempel-Ziv codes. Both operate on the serialized data XN = vec(Xm,n), N = mn, obtained by scanning the table in row-first order (obviously column-first yields symmetric results). 5.2.1. FINITE STATE ENCODERS A finite state (FS) encoder takes the form of a triple (Σ, f, g) with Σ a finite set of cardinality M = |Σ| and f : X Σ {0, 1} , g : X Σ Σ. We assume that Σ contains a special initialization symbol sinit. Starting from state s0 = sinit, the encoder scans the input XN sequentially. Assume after the first ℓinput symbols it is in state sℓ, and produced encoding zk(ℓ) 1 . Given input symbol Xℓ+1, it appends f(Xℓ+1, sℓ) to the codeword, and updates its state to sℓ+1 = g(Xℓ+1, sℓ). With an abuse of notation, denote by fℓ(Xℓ, sinit) {0, 1} the binary sequence obtained by applying the finite state encoder to Xℓ= (X1, . . . , Xℓ) We say that the FS encoder is information lossless if for any ℓ N, Xℓ7 fℓ(Xℓ, sinit) is injective. Theorem 5.4. Let X = Xm,n T (Q, qr, qc; m, n) and φ := (Σ, f, g) be an information lossless finite state encoder. Define the corresponding compression rate Rφ(X), as per Eq. (3.1). Assuming m > 10, |Σ| |X|, and log2 |Σ| n log2 |X|/9, E Rφ(X) H(X|U) log2 |X| 10 log |Σ| n log |X| log(n log |Σ|) . The asymmetry between U and V in the last statement (and below) arises because we assume that the table is serialized in row-major order. Of course the roles of U and V are exchanged if we use column major. Remark 5.2. The leading term of the above lower bound is H(X|U)/ log2 |X|. Since conditioning reduces entropy, this is strictly larger than the ideal rate which is roughly H(X|U, V )/ log2 |X|, cf. Eq. (5.3). The next term is negligible provided log |Σ| n log |X|. This condition is easy to interpret: it amounts to say that the finite state machine does not have enough states to memorize a row of the table Xm,n. The gap between H(X|U, V ) (appearing in the ideal rate of Lemma 5.1) and H(X|U) (appearing in the last statement Compressing Tabular Data via Latent Variable Estimation and in the analysis of LZ encoders) can be illustrated by a toy example. Assume ui, vj are uniform in {0, 1} and Xij = ui + vi( mod 2). Then H(X|U, V ) = 0 while H(X|U) = H(X|V ) = log 2. It is easy to compress very well by identifying the latents (just store the latents), but if we scan a single row (or column), we will only see a random sequence of bits. 5.2.2. LEMPEL-ZIV The pseudocode of the Lempel-Ziv algorithm that we will analyze is given in Appendix F. In words, after the first k characters of the input have been parsed, the encoder finds the longest string Xk+ℓ 1 k which appears in the past. It then encodes a pointer to the position of the earlier appearance of the string Tk, and its length Lk. If a simbol Xk never appeared in the past, we use a special encoding, cf. Appendix F. We encode the pointer Tk in plain binary using log2(N + |X|) bits (note that Tk { |X| + 1, . . . , 1, . . . , N}), and Lk using an instantaneous prefix-free code, e.g. Elias δcoding, taking 2 log2 Lk + 1 bits. Assumption 5.5. There exist a constant c0 > 0 such that max x X max u,v L Q(x|u, v) 1 c0 . Further Q, qr, co, X, L are fixed and m, n with m = nα+o(1), i.e. lim n log m log n = α (0, ) . (5.6) Theorem 5.6. Under Assumption 5.5, the asymptotic Lempel-Ziv rate is lim m,n E RLZ(Xm,n) = R LZ := X qr(u)R LZ(u) log2 |X| , R LZ(u) := H(X|U = u) 1 + α H(X|U = u, V ) . Remark 5.3. The asymptotics of the Lempel-Ziv rate is given by the minimum of two expressions, which correspond to different behaviors of the encoder. For u L, define α (u) := H(X|U = u, V )/(H(X|U = u) H(X|U = u, V )) (with α (u) = if H(X|U = u) = H(X|U = u, V )). Then: If α < α (u), then we are a skinny table regime. The algorithm mostly deduplicates segments in rows with latent u by using strings in different rows but aligned in the same columns. If α > α (u), then we are a fat table regime. The algorithm mostly deduplicates segments on rows with latent u by using rows and columns that are not the same as the current segment. Example 5.4 (Symmetric Binary Model, dense regime). Under the Symmetric Binary Model TSBM(p0, p1, k; m, n) of Example 4.2, we can compute the optimal compression rate of Corollary 5.2, the finite state compression rate of Theorem 5.4, the Lempel-Ziv rate of Theorem 5.6. If p0, p1 are of order one, and m = nα+on(1) as m, n , letting p := ((k 1)/k)p0 + (1/k) p1, h(p0, p1) := ((k 1)/k)h(p0) + (1/k) h(p1), we obtain: E Ropt(X) = 1 1 k h(p1) + on(1) , E Rfin. st.(X) h(p) + on(1) , E RLZ(X) = h(p) 1 + α h(p0, p1) + on(1) . These theoretical predictions are used to trace the contour lines in Figure 1. (ANS coding is implemented as a finite state code here.) 5.3. Practical latent-based compression Achieving the ideal rate of Corollary 5.2 via arithmetic or Huffmann coding requires to compute the probability P(Xm,n), which is intractable. We will next show that we can achieve a compression rate that is close to the ideal rate via latents estimation. We begin by considering general latents estimators bu : X m n Lm, bv : X m n Ln. We measure their accuracy by the error (cf. Lemma 5.1) Err U(X; bu) := 1 n 1ˆui(X) =π(ui) o and the analogous Err V (X; bv). Here the minimization is over the set SL of permutations of the latents alphabet L. We can use any estimators bu, bv to reorder rows and columns and compress the table Xm,n according to the algorithm described in the introduction. We denote by Rlat(X) the compression rate achieved by such a procedure. Our first result implies that, if the latent estimators are consistent (namely, they recover the true latents with high probability, up to permutations), then the resulting rate is close to the ideal one. Lemma 5.7. Assume data distributed according to model Xm,n T (Q, qr, qc; m, n), with m, n log2 |L|. Further assume there exists c0 > 0 such that qr(u), qc(v) c0 for all u, v L. Let Rlat(X) be the rate achieved by the latent-based scheme with latents estimators bu, bv, and base encoders ZX = ZL = Z. Then E Rlat(X) H(Xm,n) mn log2 |X| + 2Perr(m, n) + 4 log(mn) + |L|2 Z(c mn; Q) + 2 Z(m n; {qr, qc}) . (5.8) Compressing Tabular Data via Latent Variable Estimation Here Perr(m, n) := P( Err U(Xm,n; bu) > 0) + P( Err V (Xm,n; bv) > 0), Z(N; P ) is the worst-case redundancy of encoder Z over i.i.d. sources with distributions in P (see comments below), Q := {Q( |u, v)}u,v L. The redundancies of Lempel-Ziv, frequency-based arithmetic coding and ANS coding can be upper bounded as (in the last bound Q, qr, qc need to be be independent of N) LZ(N; P ) 40c (P ) log log N 1/2 , (5.9) AC(N; P ) 2|X| log |X| log N ANS(N; P ) 2|X| log N + C|X| Here Eq. (5.9) holds for N exp{supq P (4 log(2/H(q)))2}, and c (P ) := supq P P x X (log q(x))2/|X|. The proof of this lemma is given in Appendix G.1. The main content of the lemma is in the general bound (5.8) which is proven in Appendix G.1.1. Remark 5.5. We define the worst case redundancy Z(N0; P ) := max N N0 b Z(N; P ), where b Z(N; P ) := max q P Eqlen(Z(Y N)) H(Y N) where P P([k]) := {(pi)i k Rk : pi 0 i and P i k pi = 1} is a set of probability distributions over [k] and Y N is a vector with i.i.d. entries Yi q. While Eqs. (5.9) (5.11) are closely related to well known facts, there are nevertheless differences with respect to statements in the literature. We address them in Section G.1.2. Perhaps the most noteworthy difference is in the bound (5.9) for the LZ algorithm. Existing results, e.g. Theorem 2 in (Savari, 1998), assume a single, N-independent, distribution q and are asymptotic in nature. Equation (5.9) is a non-asymptotic statement and applies to a collection of distributions P that could depend on N. Lemma 5.7 can be used in conjunction with any latent estimation algorithm, as we next demonstrate by considering the spectral algorithm of Section 2.2. Recall that the algorithm makes use of a map ψ : X R. For (X, U, V ) Q( | , )qr( )qc( ), we define ψ(u, v) := E[ψ(X)|U = u, V = v], Ψ := u,v L and the parameters: µn := σmin(Ψ) , νn := max u,v L |ψ(u, v)| , (5.13) σ2 n := max u,v L Var ψ(X)|U = u, V = v . (5.14) We further will assume, without loss of generality maxx X |ψ(x)| 1. Finally, we need to formally specify the version of the KMeans primitive in the spectral clustering algorithm. In fact, we establish correctness for a simpler thresholding procedure. Considering to be definite the row latents, and for a given threshold θ > 0, we construct a graph Gθ = ([m], Eθ) by letting (for distinct i, j [m]) ai aj 2 ( ai 2 + aj 2)/2 θ (i, j) Eθ . (5.15) The algorithm then output the connected components of Gθ. Theorem 5.8. Assume data Xm,n T (Q, qr, qc; m, n), with m, n log2 |L| and minu L(qr(u) qc(u)) c0 for a constant c0 > 0. Let Rlat(X) be the rate achieved by the latent-based scheme with spectral latents estimators bu, bv, base compressors ZX = ZL = Z, and thresholding algorithm as described above. Then, assuming σn c p (log n)/n, νn/σn c log n µn C(σn p (log n)/m (log n)/ mn), θ c0/100, we have E Rlat(X) H(Xm,n) mn log2 |X| + 10 log(mn) + |L|2 Z(c mn; Q) + 2 Z(m n; {qr, qc}) . We focus on the simpler thresholding algorithm of Eq. (5.15) instead of KMeans in order to avoid technical complications that are not the main focus of this paper. We expect it to be relatively easy to generalize this result, e.g. using the results of (Makarychev et al., 2020) for KMeans++. Example 5.6. Consider the Symmetric Binary Model TSBM(p0, p1, k; m, n) of Example 4.2, with p0 = p0,n, p1 = p1,n potentially dependent on n. Since in this case X = {0, 1} the choice of the map ψ has little impact and we set ψ(x) = x. We assume, to simplify formulas, |p1,n p0,n| kp0,n, p1,n p0,n 9/10. It is easy to compute µn = |p1,n p0,n|, νn = p0,n p1,n, σ2 n p1,n p0,n: Theorem 5.8 implies nearly optimal compression rate under the following conditions on the model parameters: n , |p1,n p0,n| log n mn , |p1,n p0,n| Here hides factors depending on k, c0. The last of these condition amounts to requiring that the signal-to-noise ratio is large enough to consistently reconstruct the latents. In the special case of square symmetric matrices (m = n), sharp constants in these bounds can be derived from (Abbe, 2017). Compressing Tabular Data via Latent Variable Estimation Method DRR (k = 4) (k = 6) (k = 10) Oracle 0.27 0.27 0.27 No Latent 0.06 0.11 0.17 kalg = k 2 0.14 0.21 0.25 kalg = k 1 0.21 0.25 0.26 kalg = k 0.27 0.27 0.26 kalg = k + 1 0.27 0.27 0.26 kalg = k + 2 0.25 0.27 0.23 Method Taxi Census Forest Jokes ZSTD 22 0.53 sec 146 sec 17.2 sec 2.8 sec Latents + LZ 1.2 sec 96 sec 5.2 sec 3.8 sec Subsampling 0.7 sec 30 sec 3.7 sec 1.5 sec Table 2: Left: DRR of the latent-based compressor inthe SBM with miss-specified latent space size kalg. Right: average runtime (seconds) of latent-based compressors on several datasets, compared with state of the art. 6. Discussion and extensions We proved that classical lossless compression schemes, that serialize the data and then apply a finite state encoder or a Lempel-Ziv encoder to the resulting sequence are suboptimal when applied to tabular data. Namely, we introduced a simple model for tabular data, and made the following novel contributions: 1. We characterized the optimal compression rate under this model. 2. We rigorously quantified the gap in compression rate suffered by classical compressors. 3. We showed that a compression scheme that estimates the latents performs well in practice and provably achieves optimal rate on our model. The present work naturally suggests several questions and directions for future work. Model miss-specification. The numerical simulations of Section 4 were carried our within the Symmetric Binary Model, under the assumption that the correct cardinality of the latents space, k, is known at the encoder. While the rigorous guarantees of Section 5.3 are more general, it is important to understand the effect of model misspecification. Table 2 presents initial evidence of the robustness of the proposed approach. We generate tables Xm,n TSBM(p0, p1, k; m, n), with p0 = 0.2, p1 = 0.8, and dimensions m = n = 1000. We vary k {4, 6, 8} and run the latent-based compressor in a miss-specified setting by using a latentss-space size kalg = k. We observe that the resulting DRR is fairly robust to miss-specification in the the number of latents. It is also worth pointing out that parameters of the latents model, such as kalg, can be chosen at compression time to optimize DRR. However this implies severe performance losses, and therefore further investigation into model misspecification is warranted. Computational efficiency. While we did not attempt to develop a highly optimized compressor, we believe that the present approach is amenable to such a development. The largest overhead over standard compressors is in the clustering step, and in particular computing the singular value decomposition (SVD) of the data. This can be accelerated using methods from randomized linear algebra. We implemented row subsampling SVD (Drineas et al., 2006), and observed essentially no loss in DRR by using 10% of the rows as compared to full SVD. Table 2 reports running time experiments to compare the original latent based compressor, its subsampling version and ZSTD at its strongest compression level. Runtimes were averaged over 5 runs on a Macbook Pro single-threaded with a 2 GHz 4-core Intel i5 chip. We observe that row-subsampling yields significant performance improvements, and runtimes that compare well with the industry state of the art. Bits back coding. As mentioned several times, encoding the latents is sub-optimal, unless these can be estimated accurately in the sense of E Err U(X; bu) 0, E Err V (X; bv) 0, cf. Lemma 5.1. If this is not the case, then optimal rates can be achieved using bits-back coding. Continuous latents. Of course, using discrete latents is somewhat un-natural, and it would be interesting to consider continuous ones, in which case bits-back coding is required. Multi-way tables (tensors). A natural extension of the current work is to order-s multiway tables or (equivalently for our purposes) tensors X X n1 ns. The basic scheme would essentially be the same: estimate s vector of latents vi Lni, i s and use them to partition rows/columns. Acknowledgements This work was carried out while Andrea Montanari was on leave from Stanford and a Chief Scientist at Ndata Inc dba Project N. The present research is unrelated to AMs Stanford activity. Compressing Tabular Data via Latent Variable Estimation Abbe, E. Community detection and stochastic block models: recent developments. The Journal of Machine Learning Research, 18(1):6446 6531, 2017. Abbe, E., Fan, J., Wang, K., and Zhong, Y. Entrywise eigenvector analysis of random matrices with low expected rank. Annals of statistics, 48(3):1452, 2020. Altman, E. R. Synthesizing credit card transactions, 2019. https://arxiv.org/abs/1910.03033. Besta, M. and Hoefler, T. Survey and taxonomy of lossless graph compression and space-efficient graph representations. ar Xiv preprint ar Xiv:1806.01799, 2018. Boldi, P. and Vigna, S. The webgraph framework i: compression techniques. In Proceedings of the 13th international conference on World Wide Web, pp. 595 602, 2004. Chen, P., Yu, H.-F., Dhillon, I., and Hsieh, C.-J. Drone: Data-aware low-rank compression for large nlp models. Advances in neural information processing systems, 34: 29321 29334, 2021. Cheng, H., Gimbutas, Z., Martinsson, P.-G., and Rokhlin, V. On the compression of low rank matrices. SIAM Journal on Scientific Computing, 26(4):1389 1404, 2005. Chierichetti, F., Kumar, R., Lattanzi, S., Mitzenmacher, M., Panconesi, A., and Raghavan, P. On compressing social networks. In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 219 228, 2009. Collet, Y. and Kucherawy, M. Zstandard compression and the application/zstd media type. Technical report, 2018. Cover, T. M. and Thomas, J. A. Elements of information theory. Wiley, 2006. Drineas, P., Kannan, R., and Mahoney, M. W. Fast monte carlo algorithms for matrices ii: Computing a low-rank approximation to a matrix. SIAM Journal on computing, 36(1):158 183, 2006. Dua, D. and Graff, C. UCI machine learning repository, 2017. http://archive.ics.uci.edu/ml. Duda, J. Asymmetric numeral systems. ar Xiv:0902.0271, 2009. Duda, J. Asymmetric numeral systems: entropy coding combining speed of huffman coding with compression rate of arithmetic coding. ar Xiv preprint ar Xiv:1311.2540, 2013. Frey, B. J. and Hinton, G. E. Free energy coding. In Proceedings of Data Compression Conference-DCC 96, pp. 73 81. IEEE, 1996. Goldberg, K., Roeder, T., Gupta, D., and Perkins, C. Jester Datasets for Recommender Systems and Collaborative Filtering Research. https://eigentaste.berkeley.edu/dataset/. Goldberg, K., Roeder, T., Gupta, D., and Perkins, C. Eigentaste: A constant time collaborative filtering algorithm. Information Retrieval, 4(2):133 151, 2001. https://doi.org/10.1023/A:1011419012209. Hinton, G. E. and Zemel, R. Autoencoders, minimum description length and helmholtz free energy. Advances in neural information processing systems, 6, 1993. Holland, P. W., Laskey, K. B., and Leinhardt, S. Stochastic blockmodels: First steps. Social networks, 5(2):109 137, 1983. Hou, J., Chau, L.-P., Magnenat-Thalmann, N., and He, Y. Sparse low-rank matrix approximation for data compression. IEEE Transactions on Circuits and Systems for Video Technology, 27(5):1043 1054, 2015. IBM. Stats NZ Business Price indexes: March 2022 quarter. https://www.stats.govt.nz/information-releases/businessprice-indexes-march-2022-quarter/, March 2022. Kosolobov, D. The efficiency of the ans entropy encoding. ar Xiv preprint ar Xiv:2201.02514, 2022. Leskovec, J. and Krevl, A. SNAP Datasets: Stanford large network dataset collection, June 2014. http://snap.stanford.edu/data. Li, N. and Li, B. Tensor completion for on-board compression of hyperspectral images. In 2010 IEEE International Conference on Image Processing, pp. 517 520. IEEE, 2010. Lim, Y., Kang, U., and Faloutsos, C. Slashburn: Graph compression and mining beyond caveman communities. IEEE Transactions on Knowledge and Data Engineering, 26(12):3077 3089, 2014. Makarychev, K., Reddy, A., and Shan, L. Improved guarantees for k-means++ and k-means++ parallel. Advances in Neural Information Processing Systems, 33:16142 16152, 2020. NYC.gov. TLC Trip Record Data: NYC Taxi & Limousine Commission, January 2022. Phan, A.-H., Sobolev, K., Sozykin, K., Ermilov, D., Gusak, J., Tichavsk y, P., Glukhov, V., Oseledets, I., and Cichocki, A. Stable low-rank tensor decomposition for compression Compressing Tabular Data via Latent Variable Estimation of convolutional neural network. In European Conference on Computer Vision, pp. 522 539. Springer, 2020. Salomon, D. Data compression: the complete reference. Springer Science & Business Media, 2004. Savari, S. A. Redundancy of the lempel-ziv string matching code. IEEE Transactions on Information Theory, 44(2): 787 791, 1998. stats.govt.nz. Stats NZ Business Price indexes: March 2022 quarter. https://www.stats.govt.nz/informationreleases/business-price-indexes-march-2022-quarter/, March 2022. Taylor, P. R. Lossless compression of wave function information using matrix factorization: A gzip for quantum chemistry. The Journal of Chemical Physics, 139(7): 074113, 2013. Townsend, J., Bird, T., and Barber, D. Practical lossless compression with latent variables using bits back coding. In 7th International Conference on Learning Representations, ICLR 2019, volume 7. International Conference on Learning Representations (ICLR), 2019a. Townsend, J., Bird, T., Kunze, J., and Barber, D. Hilloc: lossless image compression with hierarchical latent variable models. In International Conference on Learning Representations, 2019b. Yuan, Z. and Oja, E. Projective nonnegative matrix factorization for image compression and feature extraction. In Scandinavian Conference on Image Analysis, pp. 333 342. Springer, 2005. Compressing Tabular Data via Latent Variable Estimation A. Details on the empirical evaluation A.1. Datasets We used the following datasets: Taxicab. A table with m = 62, 495 rows, n0 = 20 columns comprising data for taxi rides in NYC during January 2022 (NYC.gov, 2022). After preprocessing this table has n = 18 columns. For the LZ (ZSTD) compressor we used |Lr| = 9 row latents and 15 column latents, for the ANS compressor we used |Lr| = 5 row latents and 14 column latents. Network. Four social networks from SNAP Datasets, representing either friends as undirected edges for Facebook or directed following relationships on Google Plus (Leskovec & Krevl, 2014). We regard these as four distinct tables with 0 1 entries, with dimensions, respectively m = n {333, 747, 786, 1187}. For each table we used 5 row latents and 5 column latents. Card transactions. A table of simulated credit card transactions containing information like card ID, merchant city, zip, etc. This table has m = 24, 386, 900 rows and n0 = 15 columns and was generated as described in (Altman, 2019) and downloaded from (IBM, 2022). After preprocessing the table has n = 12 columns. For this table we used 3 row latents and n column latents. Business price index. A table of the values of the consumer price index of various goods in New Zealand between 1996 and 2022. This is a table with m = 72, 750 rows and n0 = 12 columns from the Business price indexes: March 2022 quarter - CSV file from (stats.govt.nz, 2022). After preprocessing this table has n = 10 columns. Due to the highly correlated nature of consecutive rows, we first shuffle them before compressing. For the LZ method we used 6 row latents and 7 column latents, for the ANS method we used 2 row latents and 6 column latents. Forest. A table from the UCI data repository comprising m = 581, 011 cartographic measurements with n0 = 55 attributes, to predict forest cover type based on information gathered from US Geological Survey (Dua & Graff, 2017). It contains binary qualitative variables, and some continuous values like elevation and slope. After preprocessing this data has n = 55 columns. For the LZ method we used 6 row latents and 17 column latents, for the ANS method we used 6 row latents and 17 column latents. US Census. Another table from the UCI Machine Learning Repository (Dua & Graff, 2017) with m = 2, 458, 285 and n0 = 68 categorical attributes related to demographic information, income, and occupation information. After preprocessing this data has n = 68 columns. For this data we used 9 row latents and n column latents. Jokes. A table containing ratings of a series of jokes by 24,983 users collected between April 1999 and May 2003 (Goldberg et al., 2001; Goldberg et al.). These ratings are real numbers on a scale from 10 to 10, and a value of 99 is given to jokes that were not rated. There are m = 23, 983 rows and n0 = 101. The first column identifies how many jokes were rated by a user, and the rest of the columns contain the ratings. After preprocessing this data has n = 101 columns, all quantized. For the LZ method we used 2 row latents and n column latents, for the ANS method we used 8 row latents and 8 column latents. A.2. Preprocessing We preprocessed different columns as follows: If a column comprises K 256 unique values, then we map the values to {0, . . . , K 1}. If a column is numerical and comprises more than 256 unique values, we calculate the quartiles for the data and map each entry to its quartile membership (0 for the lowest quartile, 1 for the next largest, 2 for the next and 3 for the largest). If a column does not meet either of the above criteria, we discard it. Finally, in some experiments we randomly permuted before compression. The rationale is that some of the above datasets have rows already ordered in a way that makes nearby rows highly correlated. In these cases, row reordering is obviously of limited use. Compressing Tabular Data via Latent Variable Estimation 0.2 0.26 0.33 0.39 0.45 0.52 0.58 0.64 0.71 0.77 Naive ZSTD % DRR 0.2 0.26 0.33 0.39 0.45 0.52 0.58 0.64 0.71 0.77 Kmeans ZSTD % DRR 0.2 0.26 0.33 0.39 0.45 0.52 0.58 0.64 0.71 0.77 Naive ANS % DRR 0.2 0.26 0.33 0.39 0.45 0.52 0.58 0.64 0.71 0.77 Kmeans ANS % DRR Figure 2: Comparing data reduction rate of naive coding and latent-based coding for data from SBM with k = 5 latents. Top row: ZSTD base compressor. Bottom row: ANS based compressor. Contour lines correspond to the theoretical predictions for various compression algorithms (cf. Example 5.4). B. Further simulations Figures 2 and 3 report empirical DRR values for ZSTD and ANS coding, for data generated according to the symmetric model TSBM(p0, p1, k; m, n) of Section 4. We use m = n = 1000 as before, but now k {5, 7}. Results confirm the conclusions of Section 4. C. A basic fact Lemma C.1. Let A be a finite set and F : A {0, 1} be an injective map. Then, for any probability distribution p over A, a A p(a)len(F(a)) H(p) log2 log2(|A| + 2) . (C.1) Proof. Assume without loss of generality that A = {1, . . . , M}, with |A| = K, and that the elements of A have all nonvanishing probability and are ordered by decreasing probability p1 p2 p K > 0. Let Nℓ:= 2 + 4 + + 2ℓ= 2ℓ+ 1 2. Then the expected length is minimized any map F such that len(F(a)) = ℓfor Nℓ 1 a Nℓwith the Compressing Tabular Data via Latent Variable Estimation 0.2 0.26 0.33 0.39 0.45 0.52 0.58 0.64 0.71 0.77 Naive ZSTD % DRR 0.2 0.26 0.33 0.39 0.45 0.52 0.58 0.64 0.71 0.77 Kmeans ZSTD % DRR 0.2 0.26 0.33 0.39 0.45 0.52 0.58 0.64 0.71 0.77 Naive ANS % DRR 0.2 0.26 0.33 0.39 0.45 0.52 0.58 0.64 0.71 0.77 Kmeans ANS % DRR Figure 3: Same as Figure 2 with k = 7. maximum length ℓK being defined by NℓK 1 < K NℓK. For A p, L := len(F(A)), we have H(p) := H(A) (a) H(L) + H(A|L) ℓ=1 P(L = ℓ)H(A|L = ℓ) (b) log2 ℓK + ℓ=1 P(L = ℓ) ℓ log2 log2(K + 2) + X a A p(a)len(F(a)) , where (a) is the chain rule of entropy and (b) follows because by injectivity, given len(F(A)) = ℓ, A can take at most 2ℓ D. Proofs of results on ideal compression D.1. Proof of Lemma 5.1 We begin by claiming that 1 mn H(Xm,n) = H(X|U, V ) + 1 mn I(Xm,n; U m, V n) . (D.1) Compressing Tabular Data via Latent Variable Estimation Indeed, by the definition of mutual information, we have H(Xm,n) = H(Xm,n|U m, V n)+I(Xm,n; U m, V n). Equation (D.1) follows by noting that H(Xm,n|U m, V n) = X v Ln P(U m = u, V n = v) H(Xm,n|U m = u, V n = v) v Ln P(U m = u, V n = v) H(Xi,j|Ui = u, Vj = v) v L P(Ui = ui, Vj = vj)H(Xi,j|Ui = ui, Vj = vj) j=1 H(Xi,j|Ui, Vj) (b) = mn H(X1,1|U1, V1) , where (a) follows from the fact that the (Xi,j) are conditionally independent given U m, V n, ad since the conditional distribution of Xi,j only depends on U m, V n via Ui, Vj; (b) holds because the triples (Xi,j, Ui, Vj) are identically distributed. The lower bound in Eq. (5.1) holds because mutual information is non-negative, and the upper bound because I(Xm,n; U m, V n) H(U m, V n) = m H(U1) + n H(V1). Finally, to prove Eq. (5.2), define πU,X := arg min π SL 1 m i=1 1ˆui =π(ui) , πV ,X := arg min π SL 1 n i=1 1ˆvi =π(vi) , (D.2) If the minimizer is not unique, one can be chosen arbitrarily. We then have holds because I(Xm,n; U m, V n) = H(U m, V n) H(U m, V n|Xm,n) m H(U1) + n H(V1) H(U m|Xm,n) H(V n|Xm,n) m H(U1) + n H(V1) H(U m|Xm,n, πU,X) + I(U m; πU,X|Xm,n) (D.3) H(V |Xm,n, πV ,X) + I(V n; πV ,X|Xm,n) m H(U1) + n H(V1) H(U m|Xm,n, πU,X) H(V |Xm,n, πV ,X) 2|L| log2(|L|) , (D.4) where in the last inequality we used the fact that I(U m; πU,X|Xm,n) H(πU,X) log2(|L|!). Now, consider the term H(U m|Xm,n, πU,X). Letting Y := (Xm,n, πU,X) the stated accuracy assumption implies that there exists an estimators bu+ = bu+(Y ) such that i=1 εU,i , εU,i := P ˆu+ i (Y ) = Ui . By Fano s inequality H(U m|Xm,n, πU,X) i=1 H(Ui|Xm,n, πU,X) h(εU,i) + εU,i log(|L| 1) m h(εU) + εU log(|L| 1) , where the last step follows by Jensen s inequality. The claim (5.2) follows by substituting this bound in Eq. (D.4) and using a similar bound for H(V n|Xm,n, πV ,X). Compressing Tabular Data via Latent Variable Estimation D.2. Proof of Lemma 5.3 We begin with a technical fact. Lemma D.1. Let ξ = (ξij)i m,j n, σ = (σi)i m, τ = (τj) n be collections of mutually independent random variables taking values in a measurable space Z. x : Z3 Z, F : Zm n R. Define x(ξ, σ, τ) Rm n via x(ξ, σ, τ)ij = x(ξij, σi, τj). Given a vector of independent random variables z, we let Varzi(f(z)) := Ezi[(f(z) Ezif(z))2]. Define the quantities B := max x,x Zm n F(x) F(x ) , (D.5) B1 := max σ,σ Zm max τ Zn EξF(x(ξ, σ, τ)) EξF(x(ξ, σ , τ)) , (D.6) B2 := max σ Zm max τ,τ Zn EξF(x(ξ, σ, τ)) EξF(x(ξ, σ, τ )) , (D.7) V := sup ξ,τ,σ i m,j n Varξij F(x(ξ, σ, τ)) , (D.8) V1 := sup τ,σ i m Varσi EξF(x(ξ, σ, τ) , (D.9) V2 := sup τ,σ j n Varσj EξF(x(ξ, σ, τ)) . (D.10) Then, for any t 0, the following holds with probability at least 1 8e t: F(x(ξ, σ, τ)) EF(x(ξ, σ, τ)) 2 max( p 2V2t; (B + B1 + B2)t) . (D.11) Proof. Let z ZN be a vector of independent random variables and f : ZN R. Define the martingale Xk := E[f(z)|Fk] (where Fk := σ(z1, . . . , zk)). Then we have ess sup |Xk Xk 1| B0 := sup d(z,z ) 1 |f(z) f(z )| , (D.12) k=1 E[(Xk Xk 1)2|Fk 1] = k=1 E (E[f|z 10, |Σ| |X|, and log2 |Σ| n log2 |X|/9, the expected compression rate is lower bounded as follows E R(Xm,n) H(X|U) log2 |X| 10 log |Σ| n log |X| log(n log |Σ|) . (E.10) Compressing Tabular Data via Latent Variable Estimation Proof. We let N := mn, N = mn 2ℓwhere we ℓ n/3 will be selected later. We write XN := vec(Xm,n) for the vectorization Xm,n, XN for the vector comprising its first N entries. Recall the definition of empirical distribution. For any fixed w X ℓ ˆpℓ XN (w) := 1 N ℓ+ 1 i=1 1Xi+ℓ 1 i =w . Let S := {i [N ℓ+ 1] : [i, i + ℓ 2] n N = }. In words, these are the subset of blocks of length ℓthat do not cross the end of a line in the table. Since for each line break there are at most ℓ 1 such blocks, we have |S| N ℓ+ 1 (m 1)(ℓ 1). We will consider the following modified empirical distribution pℓ XN (w) := 1 i S 1Xi+ℓ 1 i =w . Then by construction ˆpℓ XN (w) = (1 ηℓ)pℓ XN (w) + ηℓqℓ XN (w) , ηℓ:= 1 |S| N ℓ+ 1 = (m 1)(ℓ 1) where qℓ XN is the empirical distribution of blocks that do cross the line. By concavity of the entropy, we have H(ˆpℓ XN ) (1 ηℓ)H(pℓ XN ) + ηℓH(qℓ XN ) (1 ηℓ)H(pℓ XN ) . (E.11) Further, since ℓ n/3, ηℓ= (m 1)(ℓ 1) mn 3ℓ (m 1)ℓ Now let the row latents u := (ui)i m be fixed, and denote by ˆr S u their weighted empirical distribution, defined as follows: ˆr S u(s) := |S [(i 1)n + 1, in]| |S| 1ui=s . In words, ˆr S u is the empirical distribution of the latents (ui)i m where row i is weighted by its contribution to S. Note that all the weights are equal to (n 2(ℓ 1))/|S| except, potentially, for the last one. pℓ (w) := E[pℓ XN (w)] = X u L ˆr S u(u) i=1 Qx|u(wi|u) , Qx|u(w|u) := X v L Q(w|u, v) qc(v) . Using Eq. (E.11), (E.12) and concavity of the entropy, we get E H(ˆpℓ XN )|u 1 ℓ H(pℓ ) . (E.13) By Proposition E.1, we get E R(Xm,n)|u mn 2ℓ mnℓlog2 |X| H(pℓ ) 1 ℓlog2 |X| log2(|Σ|ℓ) + log2 log2 |X| 1 ℓlog2 |X|H(pℓ ) 2ℓ n 1 ℓlog2 |X| log2(|Σ|ℓ) + log2 log2 |X| , Compressing Tabular Data via Latent Variable Estimation Algorithm 3 Lempel-Ziv Input: Data XN X N = {0, , |X| 1}N Output: Binary string Z {0, 1} for k = 1 to N do if j < k : Xj = Xk then Lk max{ℓ 1 : j {1, . . . , k 1} s.t. Xj+ℓ 1 j = Xk+ℓ 1 k } Tk max{j {1, . . . , k 1} s.t. Xj+Lk 1 j = Xk+Lk 1 k } else Lk 1 Tk ( Xk) end if Z Z plain(Tk) elias(Lk) k k + Lk if len(Z) len(plain(XN)) then return Z else return plain(XN) end if end for where in the last inequality we used the fact that H(pℓ ) ℓlog2 |X|. We choose Substituting and simplifying, we get E R(Xm,n)|u H(pℓ ) ℓlog2 |X| 10 n log |Σ| log |X| log(n log |Σ|) . (E.15) Finally, letting (W1, . . . , Wℓ, U) X ℓ L be random variables with joint distribution ˆr S u(u) Qℓ i=1 Qx|u(wi|u). Then u L ˆr S u(u)H Q ℓ x|u( |u) (E.16) u L ˆr S u(u)H(X|U = u) , (E.17) and therefore EH(pℓ ) H(X|U), finishing the proof. F. Proofs for Lempel-Ziv coding The pseudocode of the Lempel-Ziv algorithm that we will analyze is given here. For ease of presentation, we identify X with a set of integers. Note that if a simbol Xk never appeared in the past, we point to Tk = Xk and set Lk = 1. This is essentially equivalent to prepending a sequence of distinct |X| symbols to XN. It is useful to define for each k N, Lk(XN) := max ℓ 1 : j {1, . . . , k 1} s.t. Xj+ℓ 1 j = Xk+ℓ 1 k , (F.1) Tk(XN) := min j {1, . . . , k 1} s.t. Xj+Lk 1 j = Xk+Lk 1 k . (F.2) Compressing Tabular Data via Latent Variable Estimation F.1. Proof of Theorem 5.6 Lemma F.1. Under Assumption 5.5, there exists a constant C such that the following holds with probability at least 1 N 10: max k N Lk(XN) C log N . (F.3) Proof. We begin by considering a slightly different setting, and will then show that our question reduces to this setting. Let (Zi)i 1 be independent random variables with Zi qi a probability distribution over X. Further assume maxx X qi(x) 1 c for all i 1. Then we claim that, for any t, ℓ 1, we have P Zℓ 1 = Zt+ℓ t+1 (1 c)ℓ. (F.4) Indeed, condition on the event Zt 1 = xt 1 for some x1, . . . , xt X. Then the event Zℓ 1 = Zt+ℓ t+1 implies that, for i {t + 1, . . . , t + ℓ}, Zi = xπ(i) where π(i) = i mod t, π(i) [1, t]. Then P Zℓ 1 = Zt+ℓ t+1 max xt 1 X t P Zℓ 1 = Zt+ℓ t+1|Zt 1 = xt 1 max xt 1 X t P Zi = xπ(i) i {t + 1, . . . , t + ℓ}|Zt 1 = xt 1 max xt 1 X t i=t+1 P Zi = xπ(i) (1 c)ℓ. This proves claim (F.4). Let us now reconsider our original setting: P max k N Lk(XN) ℓ = P i < j N : Xi+ℓ 1 i = Xj+ℓ 1 j N 2 max i 0, there exist constants C, ε > 0 independent of u Lm, such that the following hold max i m,j n P Ei,j(ℓ(δ, ui)) C N ε , (F.8) min m0 i m,j n P Ei,j(ℓ( δ, ui)) 1 C N ε . (F.9) Compressing Tabular Data via Latent Variable Estimation Lemma F.3. Let ℓc(δ, u) := (1 + δ)(log m)/H(X|U = u, V ) , n c = n maxu L ℓc(δ, u), and m0 = m1 on(1). Under Assumption 5.5, for any δ > 0, there exist constants C, ε > 0 independent of u Lm, such that the following hold max i m,j n c P Fi,j(ℓc(δ, ui)) C m ε , (F.10) min m0 i m,j n c P Fi,j(ℓc( δ, ui)) 1 C m ε . (F.11) We are now in position to prove Theorem 5.6. Proof of Theorem 5.6. We denote by (k(1), . . . , k(M)) the values taken by k in the while loop of the Lempel-Ziv pseudocode. In particular k(1) = 1 , (F.12) k(ℓ+ 1) = k(ℓ) + Lk(ℓ)(XN) , (F.13) k(M) = N . (F.14) Therefore the total length of the code is len(LZ(Xm,n)) = M log2(N + |X|) + ℓ=1 len(elias(Lk(ℓ))) (F.15) By Lemma F.1 (and recalling that len(elias(L)) 2 log2 L+1) we have, with high probability, maxℓ m len(elias(Lk(ℓ))) 2 log2(C log N). Letting G denote the good event that this bound holds, we have, on G M log2 N len(LZ(Xm,n)) M log2(N + |X|) + 2M log2(C log N) (F.16) Since |X| is a constant, this means that for any η > 0, there exists N0(η) such that, for all N N0(η), with probability at least 1 η: M 1G log2 N len(LZ(Xm,n)) (1 + η)M 1G log2 N + N 1Gc log2 |X| , (F.17) where on the right len(LZXm,n)) N log2 |X| by construction. We thus have E M 1G log2 N N log2 |X| E RLZ(Xm,n) (1 + η)E M 1G log2 N N log2 |X| + η , (F.18) lim inf m,n E RLZ(Xm,n) lim inf m,n E M 1G log2 N N log2 |X| , (F.19) lim sup m,n E RLZ(Xm,n) lim sup m,n E M 1G log2 N N log2 |X| . (F.20) We are therefore left with the task of bounding E M 1G We begin by the lower bound. Define the set of bad indices B(Xm,n, δ) [m] [n], B(Xm,n, δ) := n (i, j) [m] [n] : Ei,j(ℓ(δ, ui)) or Fi,j(ℓc(δ, ui)) o (F.21) We will drop the arguments Xm,n, δ for economy of notation, and write B := B(Xm,n, δ). We further define S(u) = S(u; Xm,n) := {(i, j) [m] [n] : ui = u and ℓ M : ij = k(ℓ)} . (F.22) In words, S(u) is the set of positions (i, j) of the table Xm,n where words in the LZ parsing begin. Compressing Tabular Data via Latent Variable Estimation We also write N(u) = n |{i [m] : ui = u}| for the total number of rows in Xm,n with row latent equal to ui and L i for the length of the first segment in row i initiated in row i 1: (i,j) S(u) L ij + X i m:ui=u L i (i,j) S(u) Bc L ij + X (i,j) B L ij + X i m:ui=u L i (i,j) S(u) ℓ(u; δ) ℓc(u; δ) + (|B| + m) C log N |S(u)|ℓ(u; δ) ℓc(u; δ) + (|B| + m) C log N , where the last inequality holds on event G. By taking expectation on this event, we get E{N(u) 1G} E{|S(u)| 1G}} ℓ(u; δ) ℓc(u; δ) + (E|B| + m) C log N . By Lemmas F.2 and F.2, E(|B|) m0n + X m0 i m,j n P Ei,j(ℓ(δ, ui)) Fi,j(ℓc(δ, ui)) + Cm log N m0n + Cm1 εn + Cm log N CN (log N)2 . E(|B|) Cm1 εn + Cm log n. Further E{N(u)} = Nqr(u) and E{N(u) 1G} E{N(u)} N P(Gc), whence lim inf m,n 1 N E{|S(u)| 1G}} ℓ(u; δ) ℓc(u; δ) qr(u) . (F.23) Recalling the definition of ℓ(u; δ), ℓc(u; δ) and the fact that δ is arbitrary,n the last inequality yields lim inf m,n E{|S(u)| 1G}log2 N N qr(u) h H(X|U = u) 1 + α H(X|U = u, V ) i . (F.24) Summing over u, noting that P u L |S(u)| = M, and substituting in Eq. (F.19) yields the lower bound on the rate in Eq. (5.7). Finally, the upper bound is proved by a similar strategy as for the lower bound. Define the set of bad indices B = B (Xm,n, δ) [m] [n], B (Xm,n, δ) := n (i, j) [m] [n] : Ec i,j(ℓ( δ, ui)) or Fc i,j(ℓc( δ, ui)) o (F.25) We also denote by L+ i the length of the last segment in row i. We then have (i,j) S(u) L ij X i m:ui=u L+ i (i,j) S(u) Bc L ij X i m:ui=u L+ i (i,j) S(u) Bc ℓ(u; δ) ℓc(u; δ) X i m:ui=u L+ i |S(u)|ℓ(u; δ) ℓc(u; δ) (|B | + m) C log N , where the last inequality holds on event G. By taking expectation on this event, we get E{N(u) 1G} E{|S(u)| 1G}} ℓ( δ, u) ℓc( δ, u) (E|B | + m) C log N . Compressing Tabular Data via Latent Variable Estimation By Lemmas F.2 and F.2, E(|B |) m0n + X m0 i m,j n P Ec i,j(ℓ( δ, ui)) Fc i,j(ℓc( δ, ui)) + Cm log N m0n + Cm1 εn + Cm log N CN (log N)2 . The proof is completed exactly as for the lower bound. F.2. Proof of Lemma F.2 We will use the following standard lemmas. Lemma F.4. Let X be a centered random variable with P(X x0) = 1, x0 > 0. Then, letting c(x0) = (ex0 1 x0)/x2 0, we have E(e X) 1 + c(x0)E(X2) . (F.26) Proof. This simply follows from exp(x) 1 + x + c(x0)x2 for x x0. Lemma F.5. Let (pi)i 1, (qi)i 1, be probability distributions on X, with supi 1 maxx X pi(x) 1 c, and supi 1 P x X pi(x)(log pi(x))2 C for constants c, C. Let (Xi)i ℓbe independent random variables with Xi pi, and set X = (X1, . . . , Xℓ). Let Y (j) X ℓ, j 1 be a sequence of i.i.d. random vectors, with (Yi(j))i ℓindependent and Yi(j) qi. Finally, let T := min{t 1 : Y (t) = X}. Then, for any ε > 0, there exists δ = δ(ε, c, C) > 0 such that (letting H(p) := ℓ 1 Pℓ i=1 H(pi)) P(T eℓ[H(p) ε]) e δℓ. (F.27) Further, the same bound holds (with a different δ(ε, c, C)) (Y (j))j 1 are independent not identically distributed, if there exist a finite set (qa i )i 1,a [K], K ℓC0, and a map b : N [K] such that Y (j) qb(j) 1 qb(j) ℓ . Proof. We denote by Y a vector distributed as Y (i). Conditional on X = x, T is a geometric random variables with mean 1/(1 P(Y = x)). Hence, for tℓ(ε) := eℓ[H(p) ε], P(T tℓ(ε)|X = x) = 1 (1 P(Y = x))tℓ(ε) tℓ(ε)P(Y = x) . P T tℓ(ε) e ℓε/2 + P P(Y = X|X) tℓ(ε/2) 1 (F.28) = e ℓε/2 + Pℓ(ε/2) , (F.29) Pℓ(u) := P ℓ X i=1 log 1 qi(Xi) < i=1 H(pi) ℓu . (F.30) By Chernoff bound, for any λ 0, Pℓ(u) exp{ ℓφ(λ, u)}, where φ(λ, u) := λu 1 λH(pi) + log E qi(Xi)λ . (F.31) By H older inequality, for λ [0, 1] we have E qi(Xi)λ (P x p(x)β)1/β where β = 1/(1 λ). Therefore ψ(λ; p) := λH(p) + (1 λ) log X x X p(x)1/(1 λ) = (1 λ) log EX p exp λ 1 λ log p(X) + H(p) . Compressing Tabular Data via Latent Variable Estimation Consider the random variable Zi := λ 1 λ log pi(Xi) + H(p) where Xi pi. Under the assumptions of the lemma, for λ [0, 1/2] we have E(Zi) = 0 and Zi log(1 c) + H(p) log |X|(1 c) , (F.32) E[Z2 i ] λ 1 λ x X pi(x)(log pi(x))2 4Cλ2 , (F.33) Using Lemma F.4, we get ψ(λ; pi) = (1 λ) log Ee Zi (F.34) (1 λ) log 1 + c0E(Z2 i ) (F.35) log(1 + c λ2) , (F.36) φ(λ, u) λu log(1 + c λ2) . By maximizing this expression over λ, we find that Pℓ(ε/2) exp( δ0(ε)ℓ) which completes the proof for the case of i.i.d. vectors Y (j). The case of non-identically distributed vectors follows by union bound over a [K]. Lemma F.6. Let (pi)i 1, be probability distributions on X, with supi 1 maxx X pi(x) 1 c, and supi 1 P x X pi(x)(log pi(x))2 C for constants c, C. Let (Xi)i ℓbe independent random variables with Xi pi, X = (X1, . . . , Xℓ). Let Y (j) X ℓ, j 1 be a sequence of i.i.d. copies of X. Finally, let T := min{t 1 : Y (t) = X}. Then, for any ε > 0, there exists δ = δ(ε, c, C) > 0, such that (letting H(p) := ℓ 1 Pℓ i=1 H(pi)) P(T eℓ[H(p)+ε]) e δℓ. (F.37) Proof. The proof follows the same argument as for Lemma F.5. Denote by Y a vector distributed as Y (i). and define tℓ(ε) := eℓ[H(p)+ε], P(T tℓ(ε)|X = x) = (1 P(Y = x))tℓ(ε) exp tℓ(ε)P(Y = x) . P T tℓ(ε) exp n eℓε/2o + P P(Y = X|X) tℓ(ε/2) 1 (F.38) e ℓε/2 + Pℓ(ε/2) , (F.39) Pℓ(u) := P ℓ X i=1 log 1 pi(Xi) i=1 H(pi) + ℓu . (F.40) We claim that, for each u > 0, Pℓ(u) e δ0(u)ℓfor some δ0(u) > 0. Indeed, using again Chernoff s bound, we get, for any λ 0, Pℓ(u) e ℓ φ(λ,u), where φ(λ, u) := λu 1 i=1 ψ(λ; pi) , (F.41) ψ(λ; pi) := log E exp(λWi) , Wi := log 1 pi(Xi) H(pi) . (F.42) where in the last line Xi pi. Under the assumptions of the lemma, Wi C almost surely and applying again Lemma F.4, we get ψ(λ; pi) log(1 + c λ2) for λ 1. The proof is completed by selecting for each u > 0, λ > 0 so that λu log(1 + c λ2) > 0. Compressing Tabular Data via Latent Variable Estimation We are now in position to prove Lemma F.2. Proof of Lemma F.2. We begin by proving the bound (F.8). Fix i m, j n , u Lm, δ > 0, and write ℓ= ℓ(δ, ui). Define Rij := {(i, j ) : max(1, j ℓ) j j 1} and Sij := {(i , j ) : i < i or i = i, j < j ℓ}. Finally, for t {0, . . . , ℓ 1}, let Sij(t) := Sij {(i , j ) : i j = t mod ℓ}. By union bound P Ei,j(ℓ) u A + (rs) Rij P X rs +ℓ 1 rs = X ij +ℓ 1 ij u , B(t) := P (r, s) Sij(t) : X rs +ℓ 1 rs = X ij +ℓ 1 ij u . Now, by the bound of Eq. (F.4), A ℓ (1 c)ℓ C N ε , (F.43) for suitable constants C, ε. Next, for any t {0, . . . , ℓ 1}, the vectors {X rs +ℓ 1 rs } are mutually independent and independent of {X ij +ℓ 1 ij }. Conditional on u, the coordinates of X rs +ℓ 1 rs = (X rs , , X rs +ℓ 1) are independent with marginal distributions X r s Qx|u( |ur ) (note that independence of the coordinates holds because ℓ< m/2 and therefore X rs +ℓ 1 rs does not include two entries in the same column). Note that the collection of marginal distributions Qx|u( |u), u L satisfies the conditions of Lemma F.5 by assumption. Further, the vector X rs +ℓ 1 rs can have at most one of K = |L|2(ℓ+ 1) distributions (depending on the latents value and the occurrence of a line break in the block.) Applying Lemma F.5, we obtain: B(t) e ε0ℓ C N ε (F.44) Summing over t {0, . . . , ℓ 1} and adjusting the constants yields the claim (F.8). Next consider the bound (F.9). Fix u Lm, i m,j n , and write ℓ= ℓ( δ, ui) for brevity below P Ec i,j(ℓ)|u P (i , j ) Sij(t) s.t. ui = ui, j < n : X i j +ℓ 1 i j = X ij +ℓ 1 ij u . (F.45) Here t {0, . . . , ℓ 1} can be chosen arbitrarily. Let Sij(t; u) := {(i , j ) Sij(t) s.t. ui = ui, j < n }. Conditional on u, the vectors (X i j +ℓ 1 i j )(i ,j ) Sij(t;u) are i.i.d. and independent of X ij +ℓ 1 ij . Further, they are distributed as X ij +ℓ 1 ij . Finally, Nij(u) := Sij(t; u) n mi(u) C log N C log N C n . where mi(u) is the number rows i < i such that ui = u. Since i i and P(ui = u) minu qr(u ) > 0, by Chernoff bound there exist constants C, c0 such that, for all m, n large enough (since i m0) P Nij(u) c0m0n 1 Ce m0/C . (F.46) Further, for any δ > 0 we can choose positive constants ε0, ε1 > 0 such that the following holds for all m, n, large enough log N N 1 ε1 eℓ[H(X|U=ui)+ε0] (F.47) Compressing Tabular Data via Latent Variable Estimation Let Tij be the rank of the first (i , j ) (moving backward)in the set defined above such that X i j +ℓ 1 i j = X ij +ℓ 1 ij , and Tij = if no such vector exists. We can continue from Eq. (F.45) to get P Ec i,j(ℓ) P(Tij Nij(u)) P Tij Nij(u); Nij(u) eℓ[H(X|U=ui)+ε0] + Ce m0/C (a) exp δ0 min u L ℓ( δ; u) + Ce m0/C CN ε , where in (a) we used Lemma F.6. This completes the proof of Eq. (F.9). F.3. Proof of Lemma F.3 We begin by considering the bound (F.10). Fix i m, j n c, u Lm, v Ln, δ > 0, and write ℓ= ℓc(δ, ui), n = n c. By union bound: P Fi,j(ℓ) u, v = P s [n],|j s|<ℓB(s) u, v , B(s) := n r < i : X rs +ℓ 1 rs = X ij +ℓ 1 ij o . Note that for a fixed s, and conditional on u, v, the vectors (X rs +ℓ 1 rs )1 s i 1 are mutually independent and independent of X+ℓ 1 . Further, X rs +ℓ 1 rs has independent coordinates with marginals X r s Qx|u( |ur , vs ) (recall that we are conditioning both on u and v). In particular, the marginal distributions satisfy the assumption of Lemma F.5 and the law of X rs +ℓ 1 rs can take one of K = |L|2(ℓ+ 1) possible values. Letting i T(s) the last row at which X rs +ℓ 1 rs = X ij +ℓ 1 ij (with T(s) i if no such row exists), we have, for some constants C, c0 > 0, P Fi,j(ℓ) u, v P s [n],|j s|<ℓ{T(s) i 1} {i 1 eℓ[H ε]} u, v + 1(i 1 > eℓ[H ε]) (a) 2ℓe ℓε + 1(m > eℓ[H ε]) Cm c0ε + 1(m > eℓ[H ε]) , where in (a) we used Lemma F.5, and we defined H := ℓ 1 Pj+ℓ 1 k=j H(X|U = ui, V = vk). Taking expectation with respect to v, we get P Fi,j(ℓ) u Cm c0ε + P 1 k=j H(X|U = ui, V = vk) < 1 1 + δ (H(X|U = ui, V ) + ε) (a) Cm c0ε + e ℓε C m c0ε , where in (a) we used Chernoff bound. This completes the proof of Eq. (F.10). Finally, the proof Eq. (F.11) is similar to the one of Eq. (F.9). We fix u Lm, i m, j n c, and write ℓ= ℓc( δ, ui). P Fc i,j(ℓ)|u P i < iui = ui : X i j +ℓ 1 i j = X ij +ℓ 1 ij u . (F.48) Let Sc ij(u) := {(i , j) s.t. ui = ui, i < i}. Conditional on u, v, the vectors (X i j +ℓ 1 i j )(i ,j ) Sc ij(u) are i.i.d. and independent copies of X ij +ℓ 1 ij . Finally, N c i (u) := Sc ij(u) is the number rows i < i such that ui = u. By Chernoff bound there exist constants C, c0 such that, for all m, n large enough (recalling that we need only to consider i m0) P N c i (u) c0m0 1 Ce m0/C . (F.49) Since m0 m1 on(1), for any δ > 0 we can choose constants ε0, ε1 > 0 so that c0m0 m1 ε1 eℓ[H(X|U=ui,V )+2ε0] . (F.50) Compressing Tabular Data via Latent Variable Estimation Recall the definition H := ℓ 1 Pj+ℓ 1 k=j H(X|U = ui, V = vk). By an an application of Chernoff bound P N c i (u) eℓ[H(X|U=ui,V )+ε0] 1 Cm ε Ce m0/C . (F.51) Let Ti be the rank of the first i (moving backward)in the set defined above such that X i j +ℓ 1 i j = X ij +ℓ 1 ij , and Ti = if no such vector exists. From Eq. (F.48) we get P Fc i,j(ℓ) P(Ti Ni(u)) P Ti Ni(u); Ni(u) eℓ[H(X|U=ui,V )+ε0] + Cm ε (a) exp δ0 min u L ℓc( δ; u) + Cm ε 2Cm ε , where in (a) we used Lemma F.6. G. Proofs for latent-based encoders G.1. Proof of Lemma 5.7 G.1.1. GENERAL BOUND (5.8) From Eq. (1.3), we get Rlat(X) = 1 mn log2 |X| n len(header) + len(ZL(bu)) + len(ZL(bv)) + X u,v L len(ZX ( ˆX(u, v))) o 1 , (G.1) where ˆX(u, v) := vec Xij : ˆui(X) = u, ˆvj(X) = v are the estimated blocks of X. Note that this rate depends on the base compressors ZL, ZX but we will omit these from our notations. Define the ideal expected compression rate (i.e. the rate achieved by a compressor that is given the latents): R# := 1 mn log2 |X| n E[len(header)] + E[len(Z(u))] + E[len(Z(v))] + X u,v L E[len(Z(X(u, v)))] o . Since Rlat(X) 1 by construction, we have E Rlat(X) E Rlat(X) 1 Err U(X;bu)=11 Err V (X;bv)=1 + P Err U(X; bu) < 1 + P Err V (X; bv) < 1 ( ) R# + P Err U(X; bu) < 1 + P Err V (X; bv) < 1 , where in step ( ) we bounded E[len(Z(bu))1 Err U(X;bu)=1] = E[len(Z(u))1 Err U(X;bu)=1] E[len(Z(u))], because, on the event { Err U(X; bu) = 1}, bu coincides with u up to relabelings, and the compressed length is invariant under relabelings. Similar arguments were applied to len(Z(v)) and len(Z(X(u, v))). We now have, by the definition of Z(N; k) in Eq. (5.12), E[len(Z(u))] mn log2 |X| H(U) n log2 |X| + + 1 n Z(m n; {r, c}) , (G.2) E[len(Z(v))] mn log2 |X| H(V ) m log2 |X| + + 1 m Z(m n; {r, c}) , (G.3) E[len(Z(X(u, v)))|u, v] mn log2 |X| ˆqr(u)ˆqc(v)H(X|U = u, V = v) log2 |X| + Z(c mn; {Q( |u, v)}i,v L) , (G.4) where in the last line ˆr is the empirical distribution of the row latents and ˆc is the empirical distribution of the column latents. By taking expectation in the last expression, we get E[len(Z(X(u, v)))|u, v] mn log2 |X| H(X|U, V ) log2 |X| + |L|2 Z(c mn; {Q( |u, v)}u,v L) . (G.5) Compressing Tabular Data via Latent Variable Estimation Finally, the header contains |L|2 + 2 integers of maximum size mn, whence len(header) 4 log2(mn). We conclude that R# 1 log2 |X| n H(X|U, V ) + 1 n H(V ) o + 2 log2(mn) + |L|2 Z(c mn; {Q( |u, v)}u,v L) + 2 Z(m n; {r, c}) . The claim (5.8) follows from the first bound in Eq. (5.2) noticing that, under the stated assumptions on m, n, 1 n h(εU) + εu log(|L| 1) εU P Err U(Xm,n; bu) < 1 . (G.6) G.1.2. REDUNDANCY BOUNDS FOR SPECIFIC ENCODERS: EQS. (5.9) (5.11) LZ coding. Let XN = (X1, . . . , XN) be a vector with i.i.d. symbols Xi q with q a probability distribution over X. The analysis is similar to the one in Appendix F, and we will adopt the same notations here. There are two important differences: data are i.i.d. (not matrix-structured) and we want to derive a sharper estimate (not just the entropy term, but bounding the overhead as well). We define Lk(XN), Tk(XN) as per Eqs. (F.1), (F.2). We let (k(1), . . . , k(MN)) be the values taken by k in the while loop of the Lempel-Ziv pseudocode of Section 5.2.2. In particular k(1) = 1 , (G.7) k(ℓ+ 1) = k(ℓ) + Lk(ℓ)(XN) , (G.8) k(MN) = N . (G.9) (We set k(0) = 0 by convention.) Therefore the total length of the code is len(LZ(XN)) = MN log2(N + |X|) + ℓ=1 len(elias(Lk(ℓ))) MN log2(N + |X|) + 2 ℓ=1 log2(Lk(ℓ)) MN log2(N + |X|) + 2MN log2(N/MN) , where the last step follows by Jensen s inequality. By one more application of Jensen, we obtain E RLZ(XN) 1 log2 |X| EMN N log2(N + |X|) + 2 log2(N/EMN) . (G.10) Define the set of break points and bad positions as SN := k(1), k(2), . . . , k(MN) , (G.11) BN(ℓ) := k [N/2, N] : Lk(XN) < ℓ . (G.12) Note that SN = S N S> N where: S N := n k(j) : j MN, k(j 1) N/2 o , S> N := n k(j) : j MN, k(j 1) > N/2 o . (G.13) Further |S N| d= M N/2 and, for any ℓ N, Lk |S> N| |BN(ℓ)| ℓ. (G.14) Compressing Tabular Data via Latent Variable Estimation 2ℓ+ E |BN(ℓ)| k= N/2 P Lk(XN) < ℓ . (G.15) We claim that this implies, for C0 = 20c log |X| and log N (2 log(2/H(q)))2, 1 N E |S> N| H(q) 2 log2 N + C0 (log log2 N)1/2 (log2 N)3/2 =: ψ(log2 N) . (G.16) Before proving this claim, let us show that it implies the thesis. Recall that MN = |SN| and SN = S N S> N where |S N| d= M N/2 . Therefore, we have proven EMN Nψ(log2 N) + EM N/2 ℓ=0 Nℓψ(log2 Nℓ) + EMNK , (G.17) where we defined recursively N0 = N, Nℓ+1 = N/2 , and K := min{ℓ: log2 Nℓ< (2 log(2/H(q)))2}. Of course, MNK NK exp((2 log 2/H(q))2). Further N ℓ Nℓ N ℓ, where N 0 = N 0 = N and N ℓ+1 = N ℓ/2, N ℓ+1 = (N ℓ 1)/2 for ℓ 0. We thus get N ℓ= (N + 1)2 ℓ 1, N ℓ= N 2 ℓand therefore ℓ=0 Nℓψ(log2 Nℓ) 1 N ℓψ(log2 N ℓ) ℓ=0 2 ℓ 1 log2(N2 ℓ 1) + C0 ℓ=0 2 ℓ (log log2 N)1/2 (log2(N2 ℓ 1))3/2 log2 N + 2C0 (log log2 N)1/2 (log2 N)3/2 . Substituting in Eq. (G.17), we get 1 N EMN H(q) log2 N + 2C0 (log log2 N)1/2 (log2 N)3/2 + 1 N exp 2 log(2/H(q)) 2 log2 N + 3C0 (log log2 N)1/2 (log2 N)3/2 , where the last inequality follows for N exp 4 log(2/H(q)) 2 (noting that C0 > 1). Finally, the desired bound (5.9) follows by substituting the last estimate in Eq. (G.10). We are left with the task of proving claim (G.16). Fix any k, N/2 k N and write qℓfor the product distribution q q (ℓtimes). Setting H = Hnats(q) (measuring here entropy in nats), for any δ > 0, P Lk(XN) < ℓ = P Xi+ℓ 1 i = Xk+ℓ 1 k i < k (G.18) xℓ X ℓ P Zk(xℓ) = 0 P Xk+ℓ 1 k = xℓ xℓ X ℓ qℓ(xℓ) 1 qℓ(xℓ) N/2ℓ xℓ X ℓ qℓ(xℓ) 1 qℓ(xℓ) e ℓ[H+δ] + exp n N 2ℓ e ℓ[H+δ]o =: P (ℓ; δ) + P>(ℓ, N; δ) . (G.19) Compressing Tabular Data via Latent Variable Estimation By Chernoff bound P (ℓ; δ) e ℓmaxλ>0 ψδ(λ) , ψδ(λ) := λ[H + δ] log n X x X q(x)1 λo . Note that λ 7 ψδ(λ) is continuous, concave, with ψ δ(0) = δ, ψδ(0) = 0, ψδ(1) = δ + H log |X|. Hence (assuming H < log |X| because otherwise there is nothing to prove) for all δ small enough ψ is maximized for λ (0, 1). Further, defining the random variable Q = q(x) for x Unif(X), ψ δ (λ) = E[Q1 λ(log Q)] E[Q1 λ] + h E[Q1 λ(log Q)2] E[Q1 λ(log Q)2] E[Q1 λ] (G.21) E[(log Q)2] =: c (q) . (G.22) Here the last inequality holds because Q 7 Q1 λ is monotone increasing (for λ [0, 1]) and Q 7 (log Q)2 is monotone decreasing over Q [0, 1], and therefore E[Q1 λ(log Q)2] E[Q1 λ] E[(log Q)2]. In what follows, we set c := c (q) 1. Hence ψδ(λ) δλ c λ2/2 for λ [0, 1] and therefore using Eq. (G.20), P (ℓ; δ) exp n ℓmin δ2 Substituting in Eq. (G.19), and using this in Eq. (G.15), we get, for δ [0, c ]: 1 N E|S> N| 1 2ℓ+ exp n ℓδ2 o + exp n N 2ℓ e ℓ[H+δ]o H (1 ε) , δ = 1 for ε (2c /H) (1/2). Substituting in the previous bound, we get 1 N E|S> N| H 2 log N (1 + 2ε) + exp n Hε2 log N o + exp n HN (ε+ε2)/2 We finally select ε = c0(c log |X|/H)(log log N/ log N)1/2, with c0 a sufficiently small absolute constant. Substituting above, 1 N E|S> N| H 2 log N c0 c log |X|(log log N)1/2 (log N)3/2 + exp n c2 0(log |X|)2c 16H log log N o + exp n H 2 log N e(c0c log |X|/2H) log No c0 c log |X|(log log N)1/2 (log N)3/2 + (log N) c2 0/16 + exp n H 2 log N e(c0/2) log No . Setting c0 = 5, we get 1 N E|S> N| H 2 log N 6 c log |X|(log log N)1/2 (log N)3/2 + exp n H 2 log N e2 log No , whence, the claim (G.16) follows for log N (log(2/H))2. Compressing Tabular Data via Latent Variable Estimation Arithmetic Coding. In Arithmetic Coding (AC) we encode the empirical distribution of XN bq N(x) := N 1 P i N 1Xi=x, and then encode XN in at most log2 bq N(XN) + 1 bits. The encoding of bq N amounts to encoding the |X| 1 integers N bq N(x), x X \ {0} (assuming that 0 X, one of the counts can be obtained by difference). We thus have len(AC(XN)) log2 bq N N (XN) + 1 + X x X len(elias(N bq N(x))) log2 bq N N (XN) + 2|X| log2 N i=1 log2 bq N(Xi) + 2|X| log2 N = N H(bq N) + 2|X| log2 N . Taking expectations ERAC(XN) E H(bq N) log2 |X| + 2|X| log2 N H(q) log2 |X| + 2|X| log2 N N log2 |X| . ANS Coding. The bound (5.11) follows for range ANS coding from the analysis of (Duda, 2009; 2013; Kosolobov, 2022), where encoding of empirical distributions are analyzed as for AC coding. G.2. Proof of Theorem 5.8 The proof consists in applying Lemma 5.7 and showing that P Err U(Xm,n; bu) > 0 log(mn)/mn, P Err V (Xm,n; bv) > 0 log(mn)/mn. In what follows we will assume without loss of generality m n, and recall that |L| = k, identifying L = {1, . . . , k}. We will assume k fixed. We will use C, c, c , . . . for constants that might depend on k, as well as the constant c0 in the statement in ways that we do ot track. We will show that these bounds hold conditional on u, v, on the events minu ˆqr(u), minv ˆqr(v) c/2 which holds with probability at least 1 exp( c m) 1 log(mn)/mn. Hence, hereafter we will treat u, v as deterministic. Recall that M Rm n is the matrix entries with Mij = ψ(Xi,j) , and let M = E{M}. We collect a few facts about M and its expectation. Singular values. Note that M takes the form M = LΨRT , (G.23) where Ψ Rr r is a matrix with entries Ψu,v = ψ(u, v), L {0, 1}m r, with Lij = 1 ui = j, and R {0, 1}n r, with Rij = 1 vi = j. Define L = L0D1/2 L where DL is a diagonal matrix with (DL)ii = mˆqr(i), and analogously R = R0D1/2 L , and introduce the singular value decomposition D1/2 L ΨD1/2 R = AΣB T. We then have the singular value decomposition M = A ΣBT , A = L0A , B = R0B . (G.24) Therefore σk(M ) σmin(DL)1/2σmin(Ψ)σmin(DR)1/2 (here and below σk denotes the k-th largest singular value) and using the assumptions on ˆqr, ˆqc, σk(M ) cµ mn . (G.25) Compressing Tabular Data via Latent Variable Estimation Concentration. M M is a centered matrix with independent entries with variance bounded by σ2 and entries bounded by 1 (by the assumption |ψ(x)| 1). By matrix Bernstein inequality there exists a universal constant C such that the following holds with probability at least 1 (100n) 2: M M op C max σ p n log n; log n . (G.26) Incoherence. Since all the entries of M are bounded by 1, we get M 2 M T 2 ν n . (G.27) Row concentration. For any i m, and any W Rn k fixed, with probability at least 1 (100n) 5: (M b M )i, W 2 C max σ W F p log n; W 2 log n . Defining := σk(M ) cµ mn, this implies (M M )i, W 2 W 2 ϕ W F n W 2 ϕ(x) := C µ mn max xσ p n log n; log n . Given these, we apply (Abbe et al., 2020)[Corollary 2.1], with the following estimates of various parameters (see (Abbe et al., 2020) for definitions): M 2 M T 2 ν n γ , ϕ(γ) max σ2 Then (Abbe et al., 2020)[Corollary 2.1] implies that there exists a k k orthogonal matrix Q such that A A Q 2 A 2 (1 + ϕ(1))(γ + ϕ(γ)) + ϕ(1) (G.28) Recall that A = L0A and A is an othogonal matrix. Therefore, there exists an orthogonal matrix Q such that (with the desired probability): Further, the i-th row of L0 is (L0)i, = 1 p mˆqr(ui) e T ui =: z(ui)e T ui . (G.31) Compressing Tabular Data via Latent Variable Estimation Hence, for any i, c0 L0 2 (L0)i, L0 2 . Denoting by qj the j-th row of Q, we thus get for all i, ai z(ui)qui 2 The claim follows immediately using the fact that c0 maxu z(u) minu z(u) maxu z(u).