# entropy_coding_of_unordered_data_structures__a181dc94.pdf Published as a conference paper at ICLR 2024 ENTROPY CODING OF UNORDERED DATA STRUCTURES Julius Kunze University College London juliuskunze@gmail.com Daniel Severo University of Toronto and Vector Institute d.severo@mail.utoronto.ca Giulio Zani University of Amsterdam g.zani@uva.nl Jan-Willem van de Meent University of Amsterdam j.w.vandemeent@uva.nl James Townsend University of Amsterdam j.h.n.townsend@uva.nl We present shuffle coding, a general method for optimal compression of sequences of unordered objects using bits-back coding. Data structures that can be compressed using shuffle coding include multisets, graphs, hypergraphs, and others. We release an implementation that can easily be adapted to different data types and statistical models, and demonstrate that our implementation achieves state-ofthe-art compression rates on a range of graph datasets including molecular data. 1 INTRODUCTION The information stored and communicated by computer hardware, in the form of strings of bits and bytes, is inherently ordered. A string has a first and last element, and may be indexed by numbers in N, a totally ordered set. For data like text, audio, or video, this ordering carries meaning. However, there are also numerous data structures in which the elements have no meaningful order. Common examples include graphs, sets and multisets, and map-like datatypes such as JSON. Recent applications of machine learning to molecular data benefit from large datasets of molecules, which are graphs with vertex and edge labels representing atom and bond types (some examples are shown in Table 1 below). All of these data are necessarily stored in an ordered manner on a computer, but the order then represents redundant information. This work concerns optimal lossless compression of unordered data, and we seek to eliminate this redundancy. Table 1: Examples of molecules and their order information. The discount column shows the saving achieved by shuffle coding by removing order information (see eq. 14). For each molecule m, n is the number of atoms and |Aut(m)| is the size of the automorphism group. All values are in bits, and log denotes the binary logarithm. Molecular structure Permutation Symmetry Discount log n! log|Aut(m)| log n! log|Aut(m)| Nitric oxide N O 1.00 0.00 1.00 H 2.58 1.00 1.58 Hydrogen peroxide O O H H 4.58 1.00 3.58 Ethylene C C H H H H 9.49 3.00 6.49 Boric acid B O 12.30 2.58 9.71 Published as a conference paper at ICLR 2024 Recent work by Severo et al. (2023a) showed how to construct an optimal lossless codec for (unordered) multisets from a codec for (ordered) vectors, by storing information in an ordering. Their method depends on the simple structure of multisets automorphism groups, and does not extend to other unordered objects such as unlabeled graphs. In this paper we overcome this issue and develop shuffle coding, a method for constructing codecs for general unordered objects from codecs for ordered objects . Our definitions of ordered and unordered objects are based on the concept of combinatorial species (Bergeron et al., 1997; Joyal, 1981), originally developed to assist with the enumeration of combinatorial structures. They include multisets, as well as all of the other unordered data structures mentioned above, and many more. Although the method is applicable to any unordered object, we focus our experiments on unordered (usually referred to as unlabeled ) graphs, as these are a widely used data type, and the improvements in compression rate from removing order information are large (as summarized in Table 2). We show that shuffle coding can achieve significant improvements relative to existing methods, when compressing unordered graphs under the Erd os-Rényi G(n, p) model of Erd os and Rényi (1960) as well as the recently proposed Pólya s urn-based model of Severo et al. (2023b). Shuffle coding extends to graphs with vertex and edge attributes, such as the molecular and social network datasets of TUDatasets (Morris et al., 2020), which are compressed in Section 5. We release source code1 with straightforward interfaces to enable future applications of shuffle coding with more sophisticated models and to classes of unordered objects other than graphs. 2 BACKGROUND The definitions for ordered and unordered objects are given in Section 2.1. Entropy coding is reviewed in Section 2.2. Examples are given throughout the section for clarification. 2.1 PERMUTABLE CLASSES For n N, we let [n] := {0, 1, . . . , n 1}, with [0] = . The symmetric group of permutations on [n], i.e. bijections from [n] to [n], will be denoted by Sn. Permutations compose on the left, like functions, i.e., for s, t Sn, the product st denotes the permutation formed by performing t then s. Permutations are represented as follows 2 = (2, 0, 1) S3. (1) The glyph on the left-hand side represents the permutation that maps 0 to 2, 1 to 0 and 2 to 1. This permutation can also be represented concretely by the vector (2, 0, 1). Concepts from group theory, including subgroups, cosets, actions, orbits, and stabilizers are used throughout. We provide a brief introduction in Appendix A. We will be compressing objects which can be re-ordered by applying permutations. This is formalized in the following definition: Definition 2.1 (Permutable class2). For n N, a permutable class of order n is a set F, equipped with a left group action of the permutation group Sn on F, which we denote with the binary operator. We refer to elements of F as ordered objects. Example 2.1 (Length n strings). For a fixed set X, let Fn = Xn, that is, length n strings of elements of X, and let Sn act on a string in Fn by rearranging its elements. Taking X to be the set of ASCII characters, we can define a permutable class of ASCII strings, with action by rearrangement, for example "Team" = "ea Tm". (2) 1Source code, data and results are available at https://github.com/juliuskunze/shuffle-coding. 2This definition is very close to that of a combinatorial species , the main difference being that we fix a specific n. See discussion in Yorgey (2014, pp. 66 67). Published as a conference paper at ICLR 2024 Example 2.2 (Simple graphs Gn). Let Gn be the set of simple graphs with vertex set [n]. Specifically, an element g Gn is a set of edges , which are unordered pairs of elements of [n]. We define the action of Sn on a graph by moving the endpoints of each edge in the direction of the arrows, for example Our main contribution in this paper is a general method for compressing unordered objects. These may be defined formally in terms of the equivalence classes, known as orbits, which comprise objects that are identical up to re-ordering (see Appendix A for background): Definition 2.2 (Isomorphism, unordered objects). For two objects f and g in a permutable class F, we say that f is isomorphic to g, and write f g, if there exists s Sn such that g = s f (i.e. if f and g are in the same orbit under the action of Sn). Note that the relation is an equivalence relation. For f F we use f to denote the equivalence class containing f, and e F to denote the quotient set of equivalence classes. We refer to elements f e F as unordered objects. For the case of strings in example 2.1, an object s isomorphism class is characterized by the multiset of elements contained in the string. For the simple graphs in example 2.2, the generalized isomorphism in definition 2.2 reduces to the usual notion of graph isomorphism. We can define a shorthand notation for unordered graphs, with points at the nodes instead of numbers: Using this notation, the unordered simple graphs on three vertices, for example, can be written: Finally, we define the subgroup of Sn which contains the symmetries of a given object f: Definition 2.3 (Automorphism group). For an element f of a permutable class F, we let Aut(f) denote the automorphism group of f, defined by Aut(f) := {s Sn | s f = f}. (6) This is the stabilizer subgroup of f under the action of Sn. The elements of the automorphism group of the simple graph from example 2.2 are: 2.1.1 CANONICAL ORDERINGS To define a codec for unordered objects, we will introduce the notion of a canonical representative of each equivalence class in e F. This allows us, for example, to check whether two ordered objects are isomorphic, by mapping both to the canonical representative and comparing. Definition 2.4 (Canonical ordering). A canonical ordering is an operator : F F, such that 1. For f F, we have f f. 2. For f, g F, f = g if and only if f g. For strings, any sorting function satisfies properties 1 and 2 and is therefore a valid canonical ordering. For graphs, the canonical orderings we use are computed using the nauty and Traces libraries (Mc Kay and Piperno, 2014). The libraries provide a function, which we call canon_perm, which, given a graph g, returns a permutation s such that s g = g. In addition to canon_perm, nauty and Published as a conference paper at ICLR 2024 Traces can compute the automorphism group of a given graph, via a function which we refer to as aut.3 Our method critically depends on the availability of such a function for a given permutation class. While permutable objects other than graphs cannot be directly canonized by nauty and Traces, it is often possible to embed objects into graphs in such a way that the structure is preserved and the canonization remains valid (see Anders and Schweitzer (2021)). We use an embedding of edge-colored graphs into vertex-colored graphs in order to canonize and compress graphs with edge attributes (which are not directly supported by nauty/traces). We leave more systematic approaches to canonizing objects from permutable classes as an interesting direction for future work4. We fix a set M of prefix-free binary messages, and a length function l: M [0, ), which measures the number of physical bits required to represent values in M. Our method requires stack-like (LIFO) codecs, such as those based on the range variant of asymmetric numeral systems (r ANS), to save bits corresponding to the redundant order using bits-back (Townsend et al., 2019). Definition 2.5 ((Stack-like) codec). A stack-like codec (or simply codec) for a set X is an invertible function encode : M X M. (8) We call a codec optimal for a probability distribution over X with mass function P if for any m M and x X, l(encode(m, x)) l(m) + log 1 P(x).5 (9) We refer to log 1 P (x) as the optimal rate and to the inverse of encode as decode. Since decode has to be implemented in practice, we treat it as an explicit part of a codec below. The encode function requires a pre-existing message as its first input. Therefore, at the beginning of encoding we set m equal to some fixed, short initial message m0, with length less than 64 bits. As in other entropy coding methods, which invariably have some small constant overhead, this initial bit cost is amortized as we compress more data. We will assume access to three primitive codecs provided by r ANS. These are Uniform(n), optimal for a uniform distribution on {0, 1, . . . , n-1}. Bernoulli(p), optimal for a Bernoulli distribution with probability p. Categorical(ps), optimal for a categorical distribution with probability vector ps. These primitive codecs can be composed to implement codecs for strings and simple graphs. In appendix B, we show such a string codec optimal for a distribution where each character is drawn i.i.d. from a categorical with known probabilities, and a codec for simple graphs optimal for the Erd os-Rényi G(n, p) model, where each edge s existence is decided by an independent draw from a Bernoulli with known probability parameter. We will use these codecs for ordered objects as a component of shuffle coding. There is an implementation-dependent limit on the parameter n of Uniform and on the number of categories for Categorical. In the 64-bit r ANS implementation which we wrote for our experiments, this limit is 248. This is not large enough to, for example, cover Sn for large n, and therefore permutations must be encoded and decoded sequentially, see Appendix C. For details on the implementation of the primitive r ANS codecs listed above, see Duda (2009) and Townsend (2021). 3In fact, a list of generators for the group is computed, rather than the entire group, which may be very large. 4Schweitzer and Wiebking (2019) describe generic methods for canonization starting from a constructive definition of permutable objects using hereditarily finite sets (i.e. not using the species definition). 5This condition, with a suitable definition of , is equivalent to rate-optimality in the usual Shannon sense, see Townsend (2020). Published as a conference paper at ICLR 2024 Figure 1: Visualization of lemma 3.2. For a fixed graph g, the six elements s S3 can be partitioned according to the value of s g. The three sets in the partition are the left cosets of Aut(g). 3 CODECS FOR UNORDERED OBJECTS Our main contribution in this paper is a generic codec for unordered objects, i.e. a codec respecting a given probability distribution on e F. We first derive an expression for the optimal rate that this codec should achieve, then in Section 3.1 we describe the codec itself. To help simplify the presentation, we will use the following generalization of exchangeability from sequences of random variables to arbitrary permutable classes: Definition 3.1 (Exchangeability). For a probability distribution P defined on a permutable class F, we say that P is exchangeable if isomorphic objects have equal probability under P, i.e. if f g P(f) = P(g). (10) We can assume, without loss of modeling power, that unordered objects are generated by first generating an ordered object f from an exchangeable distribution and then forgetting the order by projecting f onto its isomorphism class f: Lemma 3.1 (Symmetrization). For any distribution Q on a class of unordered objects e F, there exists a unique exchangeable distribution P on ordered objects F for which g f P(g). (11) Proof. For existence, set P(f) := Q( f)/| f| for f F, and note that g f g = f. For uniqueness, note that definition 3.1 implies that the restriction of P to any particular class must be uniform, which completely determines P. We will model real-world permutable objects using an exchangeable model, which will play the role of P in eq. (11). To further simplify our rate expression we will also need the following application of the orbit-stabilizer theorem (see Appendix A for more detail), which is visualized in Figure 1: Lemma 3.2. Given a permutable class F, for each object f F, there is a fixed bijection between the left cosets of Aut(f) in Sn and the isomorphism class f. This is induced by the function θf : Sn F defined by θf(s) := s f. This implies that | f| = |Sn| |Aut(f)| = n! |Aut(f)|. (12) Proof. Follows directly from the orbit-stabilizer theorem (theorem A.1) and the definitions of Aut, f and f. For any f F, this allows us to express the right hand side of eq. (11) as: X g f P(g) = | f|P(f) = n! |Aut(f)|P(f) (13) Published as a conference paper at ICLR 2024 where the first equality follows from exchangeability of P, and the second from eq. (12). Finally, from eqs. (11) and (13), we can immediately write down the following optimal rate expression, which a codec on unordered objects should achieve: log 1 Q( f) = log 1 P(f) Ordered rate log n! |Aut(f)| Note that only the log 1/P(f) term depends on the choice of model. The log(n!/|Aut(f)|) term can be computed directly from the data, and is the discount that we get for compressing an unordered object vs. compressing an ordered one. The discount is larger for objects which have a smaller automorphism group, i.e. objects which lack symmetry. It can be shown that almost all simple graphs have a trivial automorphism group for large enough n, see e.g. Bollobás (2001, Chapter 9), and thus in practice the discount is usually equal to or close to log n!. 3.1 ACHIEVING THE TARGET RATE FOR UNORDERED OBJECTS How can we achieve the optimal rate in eq. (14)? In appendix B we give examples of codecs for ordered strings and simple graphs which achieve the ordered rate . To operationalize the negative discount term, we can use the bits-back with ANS method introduced by Townsend et al. (2019), the key idea being to decode an ordering as part of an encode function (see line 3 in the code below). The value of the negative term in the rate provides a hint at how exactly to decode an ordering: the discount is equal to the logarithm of the number of cosets of Aut(f) in Sn, so a uniform codec for those cosets will consume exactly that many bits. Lemma 3.2 tells us that there is a direct correspondence between the cosets of Aut(f) and the set f, so if we uniformly decode a choice of coset, we can reversibly map that to an ordering of f. The following is an implementation of shuffle coding, showing, on the right, the effect of the steps on message length. 1 def encode(m, f): Effect on message length: 2 f_canon = action_apply(canon_perm(f), f) 3 m, s = Uniform LCoset(f_canon.aut).decode(m) log n! |Aut(f)| 4 g = action_apply(s, f_canon) 5 m = P.encode(m, g) + log 1 P (f) 6 return m 8 def decode(m): 9 m, g = P.decode(m) 10 s_ = inv_canon_perm(g) 11 f_canon = action_unapply(s_, f) 12 m = Uniform LCoset(f_canon.aut).encode(m, s_) 13 return m, f_canon The encode function accepts a pair (m, f), and reversibly decodes a random choice g from the isomorphism class of f. This is done using a uniform codec for left cosets, Uniform LCoset, which we discuss in detail in Appendix C. The canonization on line 2 is necessary so that the decoder can recover the chosen coset and encode it on line 12. While the codec technically maps between M e F and M, we avoid representing equivalence classes explicitly as sets, and instead use a single element of the class as a representative. Thus the encoder accepts any f in the isomorphism class being encoded, and the decoder then returns the canonization of f. Similarly, Uniform LCoset.encode accepts any element of the coset, and Uniform LCoset.decode returns a canonical coset element. 3.2 INITIAL BITS The increase in message length from shuffle coding is equal to the optimal rate in eq. (14). However, the decode step on line 3 of the encode function assumes that there is already some information in the message which can be decoded. At the very beginning of encoding, these initial bits can be Published as a conference paper at ICLR 2024 generated at random, but they are unavoidably encoded into the message, meaning that for the first object, the discount is not realized. This constant initialization overhead means that the rate, when compressing only one or a few objects, is not optimal, but tends to the optimal rate if more objects are compressed, as the overhead is amortized. 4 RELATED WORK To date, there has been a significant amount of work on compression of what we refer to as ordered graphs, see Besta and Hoefler (2019) for a comprehensive survey. Compression of unordered graphs, and unordered objects in general, has been less well studied, despite the significant potential benefits of removing order information (see Table 2). The work of Varshney and Goyal (2006) is the earliest we are aware of to discuss the theoretical bounds for compression of sets and multisets, which are unordered strings. Choi and Szpankowski (2012) discuss the optimal rate for unordered graphs (a special case of our eq. 14), and present a compression method called structural ZIP (SZIP), which asymptotically achieves the rate log 1 PER(g) n log n + O(n), (15) where PER is the Erd os-Rényi G(n, p) model. Compared to our method, SZIP is less flexible in the sense that it only applies to simple graphs (without vertex or edge attributes), and it is not an entropy coding method, thus the model PER cannot be changed easily. On the other hand, SZIP can achieve good rates on single graphs, whereas, because of the initial bits issue (see Section 3.2), our method only achieves the optimal rate on sequences of objects. We discuss this issue further and provide a quantitative comparison in Section 5. Steinruecken (2014, 2015, 2016) provides a range of specialized methods for compression of various ordered and unordered permutable objects, including multisets, permutations, combinations and compositions. Steinruecken s approach is similar to ours in that explicit probabilistic modeling is used, although different methods are devised for each kind of object rather than attempting a unifying treatment as we have done. Our method can be viewed as a generalization of the framework for multiset compression presented in Severo et al. (2023a), which also used bits-back with ANS (BB-ANS; Townsend, 2021; Townsend et al., 2019). Severo et al. (2023a) use interleaving to reduce the initial bits overhead and achieve an optimal rate when compressing a single multiset (which can also be applied to a sequence of multisets), whereas the method presented in this paper is optimal only for sequences of unordered objects (including sequences of multisets). However, as mentioned in Section 1, their method only works for multisets and not for more general unordered objects. There are a number of recent works on deep generative modeling of graphs (see Zhu et al. (2022) for a survey), which could be applied to entropy coding to improve compression rates. Particularly relevant is Chen et al. (2021), who optimize an evidence lower-bound (ELBO), equivalent to an upper-bound on the rate in eq. (14), when P is not exchangeable. Finally, the Partition and Code (Pn C; Bouritsas et al., 2021) method uses neural networks to compress unordered graphs. We compare to Pn C empirically in Table 3. Pn C is also specialized to graphs, although it does employ probabilistic modeling to some extent. 5 EXPERIMENTS To demonstrate the method experimentally, we first applied it to the TUDatasets graphs (Morris et al., 2020), with a very simple Erd os-Rényi G(n, p) model for P. Table 2 shows a summary, highlighting the significance of the discount achieved by shuffle coding. We compressed a dataset at a time (note that for each high-level graph type there are multiple datasets in TUDatasets). To handle graphs with discrete vertex and edge attributes, we treated all attributes as independent and identically distributed (i.i.d.) within each dataset. For each dataset, the codec computes and encodes a separate empirical probability vector for vertices and edges, as well as an empirical p parameter, and the size n of each graph. We use run-length encoding for these meta-data, described Published as a conference paper at ICLR 2024 in detail in Appendix E. Some datasets in TUDatasets contain graphs with continuous attributes. We did not encode these attributes, since for these values lossy compression would usually be more appropriate, and the focus of this work is on lossless. Table 2: For the TUDatasets, this table shows the significance of the discount term - in eq. (14). With an Erd os-Rényi (ER) model, with edge probability adapted to each dataset, the percentage improvement (Discount) is the difference between treating the graph as ordered (Ordered ER) and using Shuffle coding to forget the order (Shuffle coding ER). Rates are measured in bits per edge. Graph type Ordered ER Shuffle coding ER Discount Small molecules 2.11 1.14 46% Bioinformatics 9.20 6.50 29% Computer vision 6.63 4.49 32% Social networks6 3.98 2.97 26% Synthetic 5.66 2.99 47% We also compared directly to Bouritsas et al. (2021), who used a more sophisticated neural method to compress graphs (upper part of Table 3). They reported results for six of the datasets from the TUDatasets with vertex and edge attributes removed, and for two of the six they reported results which included vertex and edge attributes. Because Pn C requires training, it was evaluated on a random test subset of each dataset, whereas shuffle coding was evaluated on entire datasets. We found that for some types of graphs, such as the bioinformatics and social network graphs, performance was significantly improved by using a Pólya urn (PU) preferential attachment model for ordered graphs introduced by Severo et al. (2023b). In this model, a sequence of edges is sampled, where the probability of an edge being connected to a specific node is approximately proportional to the number of edges already connected to that node. Such a rich-get-richer dynamic is plausibly present in the formation of many real-world graphs, explaining the urn model s good performance. It treats edges as a set, and we were able to use an inner shuffle codec for sets to encode the edges, demonstrating the straightforward compositionality of shuffle coding. See Appendix D for details. The average initial bit cost per TU dataset in Table 3 is 0.01 bits per edge for both ER and PU, demonstrating good amortization. As mentioned in Section 4, SZIP achieves a good rate for single graphs, whereas shuffle coding is only optimal for sequences of graphs. In the lower part of Table 3, we compare the net rate , which is the increase in message length from shuffle coding the graphs, assuming some existing data is already encoded into the message. The fact that shuffle coding just works with any statistical model for ordered graphs is a major advantage of the method, as demonstrated by the fact that we were easily able to improve on the Erd os-Rényi results by swapping in a recently proposed model. We report speeds in Appendix F. Our implementation has not yet been optimized. One thing that will not be easy to speed up is canonical ordering, since for this we use the nauty and Traces libraries, which have already been heavily optimized. Fortunately, those calls are currently only 10 percent of the overall time, and we believe there is significant scope for optimization of the rest. 6 LIMITATIONS AND FUTURE WORK Time complexity. Shuffle coding relies on computing an object s canonical ordering and automorphism group, for which no polynomial-time algorithm is known for graphs. In consequence, while nauty and Traces solve this problem efficiently for various graph classes, it is impractical in the worst case. This limitation can be overcome by approximating an object s canonical ordering, instead of calculating it exactly. This introduces a trade-off between speed and compression rate in the method, and lowers runtime complexity to polynomial time. We leave a detailed description of that more sophisticated method to future work. Initial bits. A limitation of the version of shuffle coding presented in this paper is that it only achieves an optimal rate for sequences; the rate discount cannot be realized in the one-shot case, 6Three of the 24 social network datasets, REDDIT-BINARY, REDDIT-MULTI-5K, REDDIT-MULTI-12K, were excluded because compression running time was too long. Published as a conference paper at ICLR 2024 Table 3: Comparison between shuffle coding, with Erd os-Rényi (ER) and our Pólya urn (PU) models, and the best results obtained by Pn C (Bouritsas et al., 2021) and SZIP (Choi and Szpankowski, 2012) for each dataset. We also show the discount realized by shuffle coding. Each SZIP comparison is on a single graph, and thus for shuffle coding we report the optimal (net) compression rate, that is the additional cost of compressing that graph assuming there is already some compressed data to append to. All measurements are in bits per edge. Shuffle coding Dataset Discount ER PU Pn C Small molecules MUTAG 2.77 1.88 2.66 2.45 0.02 MUTAG (with attributes) 2.70 4.20 4.97 4.45 PTC_MR 2.90 2.00 2.53 2.97 0.14 PTC_MR (with attributes) 2.87 4.88 5.40 6.49 0.54 ZINC_full 3.11 1.82 2.63 1.99 Bioinformatics PROTEINS 2.48 3.68 3.50 3.51 0.23 Social networks IMDB-BINARY 0.97 2.06 1.50 0.54 IMDB-MULTI 0.88 1.52 1.14 0.38 Shuffle coding (net) Dataset Discount ER PU SZIP SZIP Airports (USAir97) 1.12 5.09 2.90 3.81 Protein interaction (Yeast S) 3.55 6.84 5.70 7.05 Collaboration (geom) 3.45 8.30 4.41 5.28 Collaboration (Erdos) 7.80 7.00 4.37 5.08 Genetic interaction (homo) 3.97 8.22 6.77 8.49 Internet (as) 7.34 8.37 4.47 5.75 as explained in Section 3.2. However, it is possible to overcome this by interleaving encoding and decoding steps, as done in the bit-swap method of Kingma et al. (2019). Information can be eagerly encoded during the progressive decoding of the coset, reducing the initial bits needed by shuffle coding from O(log n!) to O(log n). This is a generalization of the multiset coding method described by Severo et al. (2023a). We again defer a detailed description to future work. Models. Unlike Pn C, we do not rely on compute-intensive learning or hyperparameter tuning. Shuffle coding achieves state-of-the-art compression rates when using simple models with minimal parameters. There is currently active research on deep generative models for graphs, see Zhu et al. (2022) for a survey. We expect improved rates for shuffle coding when combined with such neural models. 7 CONCLUSION A significant proportion of the data which needs to be communicated and stored is fundamentally unordered. We have presented shuffle coding, the first general method which achieves an optimal rate when compressing sequences of unordered objects. We have also implemented experiments which demonstrate the practical effectiveness of shuffle coding for compressing many kinds of graphs, including molecules and social network data. We look forward to future work applying the method to other forms of unordered data, and applying more sophisticated probabilistic generative models to gain improvements in compression rate. ACKNOWLEDGMENTS James Townsend acknowledges funding under the project VI.Veni.212.106, financed by the Dutch Research Council (NWO). We thank Ashish Khisti for discussions and encouragement, and Heiko Zimmermann for feedback on the paper. Published as a conference paper at ICLR 2024 Anders, Markus and Schweitzer, Pascal (2021). Parallel Computation of Combinatorial Symmetries. ar Xiv: 2108.04590 [cs]. Bergeron, François, Labelle, Gilbert, and Leroux, Pierre (1997). Combinatorial Species and Treelike Structures. Trans. by Margaret Readdy. Encyclopedia of Mathematics and Its Applications. Cambridge. Besta, Maciej and Hoefler, Torsten (2019). Survey and Taxonomy of Lossless Graph Compression and Space-Efficient Graph Representations. ar Xiv: 1806.01799 [cs, math]. Bollobás, Béla (2001). Random Graphs. 2nd ed. Cambridge Studies in Advanced Mathematics. Cambridge. Bouritsas, Giorgos, Loukas, Andreas, Karalias, Nikolaos, and Bronstein, Michael (2021). Partition and Code: Learning How to Compress Graphs. In Advances in Neural Information Processing Systems. Vol. 34, pp. 18603 18619. Chen, Xiaohui, Han, Xu, Hu, Jiajing, Ruiz, Francisco, and Liu, Liping (2021). Order Matters: Probabilistic Modeling of Node Sequence for Graph Generation. In Proceedings of the 38th International Conference on Machine Learning, pp. 1630 1639. Choi, Yongwook and Szpankowski, Wojciech (2012). Compression of Graphical Structures: Fundamental Limits, Algorithms, and Experiments. In IEEE Transactions on Information Theory 58.2, pp. 620 638. Duda, Jarek (2009). Asymmetric Numeral Systems. ar Xiv: 0902.0271 [cs, math]. Erd os, Paul and Rényi, Alfréd (1960). On the Evolution of Random Graphs. In Publications of the Mathematical Institute of the Hungarian Academy of Sciences 5.1, pp. 17 60. Holt, Derek F. (2005). Handbook of Computational Group Theory. In collab. with Bettina Eick and Eamonn A. O Brien. Discrete Mathematics and Its Applications. Boca Raton, Florida. Joyal, André (1981). Une théorie combinatoire des séries formelles. In Advances in Mathematics 42.1, pp. 1 82. Kingma, Friso, Abbeel, Pieter, and Ho, Jonathan (2019). Bit-Swap: Recursive Bits-Back Coding for Lossless Compression with Hierarchical Latent Variables. In Proceedings of the 36th International Conference on Machine Learning, pp. 3408 3417. Knuth, Donald E. (1981). The Art of Computer Programming. 2nd ed. Vol. 2. Reading, Mass. Mc Kay, Brendan D. and Piperno, Adolfo (2014). Practical Graph Isomorphism, II. In Journal of Symbolic Computation 60, pp. 94 112. Morris, Christopher, Kriege, Nils M., Bause, Franka, Kersting, Kristian, Mutzel, Petra, and Neumann, Marion (2020). TUDataset: A Collection of Benchmark Datasets for Learning with Graphs. In ICML 2020 Workshop on Graph Representation Learning and beyond (GRL+ 2020). ar Xiv: 2007.08663. Schweitzer, Pascal and Wiebking, Daniel (2019). A Unifying Method for the Design of Algorithms Canonizing Combinatorial Objects. In Proceedings of the 51st Annual ACM SIGACT Symposium on Theory of Computing. STOC 2019. New York, NY, USA, pp. 1247 1258. Seress, Ákos (2003). Permutation Group Algorithms. Cambridge Tracts in Mathematics 152. Cambridge. Severo, Daniel, Townsend, James, Khisti, Ashish, Makhzani, Alireza, and Ullrich, Karen (2023a). Compressing Multisets with Large Alphabets. In IEEE Journal on Selected Areas in Information Theory. Severo, Daniel, Townsend, James, Khisti, Ashish J., and Makhzani, Alireza (2023b). Random Edge Coding: One-Shot Bits-Back Coding of Large Labeled Graphs. In Proceedings of the 40th International Conference on Machine Learning. Sims, Charles C. (1970). Computational Methods in the Study of Permutation Groups. In Computational Problems in Abstract Algebra, pp. 169 183. Steinruecken, Christian (2014). Lossless Data Compression. University of Cambridge. Steinruecken, Christian (2015). Compressing Sets and Multisets of Sequences. In IEEE Transactions on Information Theory 61.3, pp. 1485 1490. Published as a conference paper at ICLR 2024 Steinruecken, Christian (2016). Compressing Combinatorial Objects. In 2016 Data Compression Conference (DCC), pp. 389 396. Townsend, James (2020). A Tutorial on the Range Variant of Asymmetric Numeral Systems. ar Xiv: 2001.09186 [cs, math, stat]. Townsend, James (2021). Lossless Compression with Latent Variable Models. ar Xiv: 2104.10544. Townsend, James, Bird, Thomas, and Barber, David (2019). Practical Lossless Compression with Latent Variables Using Bits Back Coding. In International Conference on Learning Representations (ICLR). Varshney, L.R. and Goyal, V.K. (2006). Toward a Source Coding Theory for Sets. In Data Compression Conference (DCC 06), pp. 13 22. Yorgey, Brent (2014). Combinatorial Species and Labelled Structures. University of Pennsylvania. Zhu, Yanqiao, Du, Yuanqi, Wang, Yinkai, Xu, Yichen, Zhang, Jieyu, Liu, Qiang, and Wu, Shu (2022). A Survey on Deep Graph Generation: Methods and Applications. In Proceedings of the First Learning on Graphs Conference, 47:1 47:21. Published as a conference paper at ICLR 2024 A GROUP ACTIONS, ORBITS AND STABILIZERS This appendix gives the definitions of group actions, orbits and stabilizers as well as a statement and proof of the orbit-stabilizer theorem, which we make use of in section 3. We use the shorthand H G to mean that H is a subgroup of G, and for g G, we use the usual notation, g H := {gh | h H} and Hg := {hg | h H} for left and right cosets, respectively. Definition A.1 (Group action). For a set X and a group G, a group action, or simply action, is a binary operator G : G X X (16) which respects the structure of G in the following sense: 1. The identity element e G is neutral, that is e G x = x. 2. The operator G respects composition. That is, for g, h G, g G (h G x) = (gh) G x. (17) We will often drop the subscript G and use infix alone where the action is clear from the context. Definition A.2 (Orbit). An action of a group G on a set X induces an equivalence relation G on X, defined by x G y if and only if there exists g G such that y = g x. (18) We refer to the equivalence classes induced by G as orbits, and use Orb G(x) to denote the orbit containing an element x X. We use X/G to denote the set of orbits, so for each x X, Orb G(x) X/G. Definition A.3 (Stabilizer subgroup). For an action of a group G on a set X, for each x X, the stabilizer Stab G(x) := {g G | g x = x} (19) forms a subgroup of G. We make use of the orbit-stabilizer theorem in Section 3. Here we give a statement and brief proof of this well-known theorem. Theorem A.1 (Orbit-stabilizer theorem). For an action of a finite group G on a set X, for each x X, the function θx : G X defined by θx(g) := g x (20) induces a bijection from the left cosets of Stab G(x) to Orb G(x). This implies that the orbit Orb G(x) is finite and |Orb G(x)| = |G| |Stab G(x)|. (21) Proof. We show that θf induces a well defined function on the left-cosets of Stab G(x), which we call θf. Specifically, we define θf(g Stab G(x)) := g x, (22) and show that θf is injective and surjective. To see that θf is well defined and injective, note that h g Stab G(x) g 1h Stab G(x) (23) g 1h x = x (24) g x = h x, (25) using the definition of Stab G. For surjectivity, we have y Orb G(x) = g G s.t. y = g x (26) = y = θf(g Stab G(x)) (27) using the definition of Orb G. Published as a conference paper at ICLR 2024 In Appendix C.2, it will be helpful to have an explicit bijection between G and the Cartesian product Orb G(x) Stab G(x). This requires a way of selecting a canonical element from each left coset of Stab G(x) in G. This is similar to the canonical ordering of definition 2.4: Definition A.4 (Transversal). For a group G with subgroup H G, a transversal of the left cosets of H in G is a mapping t : G G such that 1. For all g G, we have t(g) g H. 2. For all f, g G, if f g H, then t(f) = t(g). Given such a transversal, we can setup the bijection mentioned above: Lemma A.1. Let G be a group acting on a set X. If, for x X, we have a transversal tx of the left cosets of Stab G(x) in G, then we can form an explicit bijection between G and Orb G(x) Stab G(x). Proof. For g G, let ox(g) := g x (28) sx(g) := tx(g) 1g, (29) then ox Orb G(x). By condition 1 in definition A.4, there exists h Stab G(x) such that t(g) = gh, and in particular sx(g) = h Stab G(x). So there is a well-defined function ϕx(g) := (ox(g), sx(g)) with ϕx : G Orb G(x) Stab G(x). To see that ϕx is injective, suppose that ϕx(f) = ϕx(g). Then ox(f) = ox(g), so f x = g x, and therefore f g Stab G(x). Condition 2 in definition A.4 implies that tx(f) = tx(g), and since sx(g) = sx(f) we have tx(f) 1f = tx(g) 1g, so f = g. The orbit-stabilizer theorem implies that |G| = |Orb G(x)||Stab G(x)|, and therefore if ϕx is injective it must also be bijective. B CODECS FOR ORDERED OBJECTS Codecs for strings and graphs can be composed from the primitive codecs introduced in section 2.2: def String(ps, length): def encode(m, string): assert len(string) == length for c in reversed(string): m = Categorical(ps).encode(m, c) return m def decode(m): string = [] for _ in range(length): m, c = Categorical(ps).decode(m) string.append(c) return m, str(string) return Codec(encode, decode) def Erdos Renyi(n, p): def encode(m, g): assert len(g) == n for i in reversed(range(n)): for j in reversed(range(i)): e = g[i][j] m = Bernoulli(p).encode(m, e) return m def decode(m): g = [] for i in range(n): inner = [] for j in range(i) m, e = Bernoulli(p).decode(m) inner.append(e) g.append(inner) return (m, g) return Codec(encode, decode) Left: Codec for fixed-length strings implemented by applying a Categorical codec to each character. Right: Codec for graphs respecting an Erd os-Rényi distribution G(n, p), implemented by applying the Bernoulli codec to each edge. C A UNIFORM CODEC FOR COSETS OF A PERMUTATION GROUP Shuffle coding, as described in Section 3, requires that we can encode and decode left cosets in Sn of the automorphism group of a permutable object. In this appendix we describe a codec for cosets Published as a conference paper at ICLR 2024 of an arbitrary permutation group characterized by a list of generators. We first describe the codec, which we call Uniform LCoset, on a high level and then in Appendices C.1 and C.2, we describe the two main components in more detail. The optimal rate for a uniform coset codec is equal to the log of the number of cosets, that is |H| = log n! log|H|. (30) This rate expression hints at an encoding method: to encode a coset, we first decode a choice of element of the coset (equivalent to decoding a choice of element of H and then multiplying it by a canonical element of the coset), and then encode that chosen element using a uniform codec on Sn. Note that if the number of cosets is small we could simply encode the index of the coset directly, but in practice this is rarely feasible. The following is a concrete implementation of a left coset codec: 1 def Uniform LCoset(grp): Effects on l(m): 2 def encode(m, s): 3 s_canon = coset_canon(grp, s) 4 m, t = Uniform Perm Grp(grp).decode(m) log|H| 5 u = s_canon * t 6 m = Uniform S(n).encode(m, u) + log(n!) 9 def decode(m): 10 m, u = Uniform S(n).decode(m) 11 s_canon = coset_canon(subgrp, u) 12 t = inv(s_canon) * u 13 m = Uniform Perm Grp(grp).encode(m, t) 14 return m, s_canon 15 return Codec(encode, decode) The codecs Uniform S and Uniform Perm Grp are described in Appendix C.1 and Appendix C.2 respectively. Uniform S(n) is a uniform codec over the symmetric group Sn, and Uniform Perm Grp is a uniform codec over elements of a given permutation group, i.e., a subgroup of Sn. We use a stabilizer chain, discussed in more detail in Appendix C.2, which is a computationally convenient representation of a permutation group. A stabilizer chain allows computation of a transversal which can be used to canonize coset elements (line 3 and line 11 in the code above). Routines for constructing and working with stabilizer chains are standard in computational group theory, and are implemented in Sym Py (https://www.sympy.org/), as well as the GAP system (https://www.gap-system.org/), see Holt (2005, Chapter 4) for theory and description of the algorithms. The method we use for coset_canon is implemented in the function Minimal Element Coset Stab Chain in the GAP system. C.1 A UNIFORM CODEC FOR PERMUTATIONS IN THE SYMMETRIC GROUP We use a method for encoding and decoding permutations based on the Fisher-Yates shuffle (Knuth, 1981, pp. 139 140). The following is a Python implementation: 1 def Uniform S(n): 2 def swap(s, i, j): 3 si_old = s[i] 4 s[i] = s[j] 5 s[j] = si_old 7 def encode(m, s): 8 p = list(range(n)) 9 p_inv = list(range(n)) 10 to_encode = [] Published as a conference paper at ICLR 2024 11 for j in reversed(range(2, n + 1)): 12 i = p_inv[x[j - 1]] 13 swap(p_inv, p[j - 1], x[j - 1]) 14 swap(p, i, j - 1) 15 to_encode.append(i) 17 for j, i in zip(range(2, n + 1), reversed(to_encode)): 18 m = Uniform(j).encode(m) 19 return m 21 def decode(m): 22 s = list(range(n)) 23 for j in reversed(range(2, n + 1)): 24 m, i = Uniform(j).decode(m) 25 swap(s, i, j - 1) 26 return m, s 27 return Codec(encode, decode) The decoder closely resembles the usual Fisher-Yates sampling method, and the encoder has been carefully implemented to exactly invert this process. Both encoder and decoder have time complexity in O(n). C.2 A UNIFORM CODEC FOR PERMUTATIONS IN AN ARBITRARY PERMUTATION GROUP For coding permutations from an arbitrary permutation group, we use the following construction, which is a standard tool in computational group theory (see Holt (2005) and Seress (2003)): Definition C.1 (Base, stabilizer chain). Let H Sn be a permutation group, and B = (b0, . . . , b K 1) a list of elements of [n]. Let H0 := H, and Hk := Stab Hk 1(bk 1) for k = 1, . . . , K. If HK is the trivial group containing only the identity, then we say that B is a base for H, and the sequence of groups H0, . . . , HK is a stabilizer chain of H relative to B. Bases and stabilizer chains are guaranteed to exist for all permutation groups, and can be efficiently computed using the Schreier-Sims algorithm (Sims, 1970). The algorithm also produces a transversal for the left cosets of each Hk+1 in Hk for each k = 0, . . . , K 1, in a form known as a Schreier tree (Holt, 2005). If we define Ok := Orb Hk(bk), for k = 0, . . . , K 1, then by applying the orbit-stabilizer theorem recursively, we have |H| = QK 1 k=0 |Ok|, which gives us a decomposition of the optimal rate that a uniform codec on H should achieve: k=0 log|Ok|. (31) Furthermore, by applying lemma A.1 recursively, using the transversals produced by Schreier-Sims, we can construct an explicit bijection between H and the Cartesian product QK 1 k=0 Ok. We use this bijection, along with a sequence of uniform codecs on O0, . . . , OK 1 for coding automorphisms at the optimal rate in eq. (31). For further details refer to the implementation. D PÓLYA URN MODEL DETAILS We implemented Pólya urn models mostly as described in Severo et al. (2023b), with few modifications. Differently to the original implementation, we apply shuffle coding to the list of edges, resulting in a codec for the set of edges. We also disallow edge redraws and self-loops, leading to an improved rate, as shown in appendix G. This change breaks edge-exchangeability, leading to a stochastic codec, meaning that the code length depends on the initial message. Shuffle coding is compatible with such models. In this more general setting, the ordered log-likelihood term in the optimal rate (eq. 14) is replaced with a variational evidence lower bound (ELBO). The discount term is unaffected. The derivations in the Published as a conference paper at ICLR 2024 main text are based on the special case of exchangeable models, where log-likelihoods are exact, for simplicity. They can be generalized with little effort and new insight. E PARAMETER CODING DETAILS All bit rates reported for our experiments include model parameters. Once per dataset, we code the following lists of natural numbers by coding both the list length and the bit count log m of the maximum element m with a 46-bit and 5-bit uniform codec respectively, as well as each element of the list with a codec respecting a log-uniform distribution in [0, log m ]: A list resulting from sorting the graphs numbers of vertices, and applying run-length coding, encoding run lengths and differences between consecutive numbers of vertices. For datasets with vertex attributes: a list of all vertex attribute counts within a dataset. For datasets with edge attributes: a list of all edge attribute counts within a dataset. For Erd os-Rényi models: a list consisting of the following two numbers: the total number of edges in all graphs, and the number of vertex pairs that do not share an edge. Coding these empirical count parameters allows coding the data according to maximum likelihood categorical distributions. For Pólya urn models, we additionally code the edge count for each graph using a uniform codec over [0, 1 2n(n 1)], exploiting the fact that the vertex count n is already coded as described above. For each dataset, we use a single bit to code whether or not self-loops are present and adapt the codec accordingly. F COMPRESSION SPEED We show compression and decompression speeds of our experiments in Table 4. These speeds include time needed for gathering dataset statistics and parameter coding. The results show that for our implementation, only a small fraction of runtime is spent on finding automorphism groups and canonical orderings with nauty. G MODEL ABLATIONS We present results of additional ablation experiments on the Pn C datasets in Table 5. We do an ablation that uses a uniform distribution for vertex and edge attributes with an Erd os-Rényi model (unif. ER). There is a clear advantage to coding maximum-likelihood categorical parameters (ER), justifying it as the approach used throughout this paper. We also show the rates obtained by the original method proposed in Severo et al. (2023b) (PU redr.), demonstrating a clear rate advantage of our approach disallowing edge redraws and self-loops (PU) in the model. Published as a conference paper at ICLR 2024 Table 4: Compression and decompression speeds in kilobytes per second (k B/s) of shuffle coding with the Erd os-Rényi (ER) and Pólya urn (PU) models, for all previously reported TU and SZIP datasets. We show SZIP compression speeds calculated from the runtimes reported in Choi and Szpankowski (2012) for comparison. All results are based on the ordered ER rate as the reference uncompressed size. All shuffle coding speeds are for a single thread on a Mac Book Pro 2018 with a 2.7GHz Intel Core i7 CPU. We also report the share of time spent on nauty calls that determine the canonical ordering and generators of the automorphism group of all graphs. Dataset nauty encode decode encode decode encode TU by type Small molecules 15% 54 56 Bioinformatics 2% 51 66 Computer vision 3% 25 28 Social networks <1% 0.440 0.467 Synthetic 7% 98 110 Small molecules MUTAG 7% 115 122 51 40 MUTAG (with attributes) 16% 135 141 67 62 PTC_MR 8% 107 103 50 45 PTC_MR (with attributes) 18% 117 125 67 62 ZINC_full 7% 105 105 50 47 Bioinformatics PROTEINS 3% 88 94 30 30 Social networks IMDB-BINARY 4% 17 18 8 8 IMDB-MULTI 3% 11 12 6 5 SZIP Airports (USAir97) 1% 82 78 5 5 164 Protein interaction (Yeast S) 8% 2.442 2.391 1.238 0.859 77 Collaboration (geom) <1% 0.004 0.005 0.005 0.005 64 Collaboration (Erdos) 15% 0.025 0.025 0.024 0.024 18 Genetic interaction (homo) 7% 0.180 0.154 0.117 0.141 32 Internet (as) 13% 0.002 0.003 0.002 0.002 7 Table 5: Model ablations compared to Pn C. All results are in bits per edge. Shuffle coding Dataset unif. ER ER PU PU redr. Pn C Small molecules MUTAG 1.88 2.66 2.81 2.45 0.02 MUTAG (with attributes) 6.37 4.20 4.97 5.13 4.45 PTC_MR 2.00 2.53 2.74 2.97 0.14 PTC_MR (with attributes) 8.04 4.88 5.40 5.61 6.49 0.54 ZINC_full 1.82 2.63 2.75 1.99 Bioinformatics PROTEINS 3.68 3.50 3.62 3.51 0.23 Social networks IMDB-BINARY 2.06 1.50 2.36 0.54 IMDB-MULTI 1.52 1.14 2.17 0.38