# hytrel_hypergraphenhanced_tabular_data_representation_learning__c4333539.pdf HYTREL: Hypergraph-enhanced Tabular Data Representation Learning Pei Chen1 , Soumajyoti Sarkar2, Leonard Lausen2, Balasubramaniam Srinivasan2, Sheng Zha2, Ruihong Huang1, George Karypis2 1Texas A&M University, 2Amazon Web Services {chenpei,huangrh}@tamu.edu, {soumajs,lausen,srbalasu,zhasheng,gkarypis}@amazon.com Language models pretrained on large collections of tabular data have demonstrated their effectiveness in several downstream tasks. However, many of these models do not take into account the row/column permutation invariances, hierarchical structure, etc. that exist in tabular data. To alleviate these limitations, we propose HYTREL, a tabular language model, that captures the permutation invariances and three more structural properties of tabular data by using hypergraphs where the table cells make up the nodes and the cells occurring jointly together in each row, column, and the entire table are used to form three different types of hyperedges. We show that HYTREL is maximally invariant under certain conditions for tabular data, i.e., two tables obtain the same representations via HYTREL iff the two tables are identical up to permutations. Our empirical results demonstrate that HYTREL consistently outperforms other competitive baselines on four downstream tasks with minimal pretraining, illustrating the advantages of incorporating the inductive biases associated with tabular data into the representations. Finally, our qualitative analyses showcase that HYTREL can assimilate the table structures to generate robust representations for the cells, rows, columns, and the entire table. 1 1 Introduction Tabular data that is organized in bi-dimensional matrices are widespread in webpages, documents, and databases. Understanding tables can benefit many tasks such as table type classification, table similarity matching, and knowledge extraction from tables (e.g., column annotations) among others. Inspired by the success of pretrained language models in natural language tasks, recent studies [Yin et al., 2020, Yang et al., 2022] proposed Tabular Language Models (Ta LMs) that perform pretraining on tables via self-supervision to generate expressive representations of tables for downstream tasks. Among the Ta LMs, many works [Herzig et al., 2020, Yin et al., 2020, Deng et al., 2020, Iida et al., 2021] serialize tables to a sequence of tokens for leveraging existing pretrained language model checkpoints and textual self-supervised objectives like the Masked Language Modeling. However, due to the linearization of tables to strings, these models do not explicitly incorporate the structural properties of a table, e.g., the invariances to arbitrary permutations of rows and columns (independently). Our work focuses on obtaining representations of tables that take table structures into account. We hypothesize that incorporating such properties into the table representations will benefit many downstream table understanding tasks. *Work done as an intern at Amazon Web Services. 1Code is available at: https://github.com/awslabs/hypergraph-tabular-lm 37th Conference on Neural Information Processing Systems (Neur IPS 2023). Club Country Founded Barcelona ESP 1899 Real Madrid ESP 1902 Bayern München GER 1900 Soccer Clubs Real Madrid Bayern München Soccer Clubs Column Permut. Row Permut. Figure 1: An example of modeling a table as a hypergraph. Cells make up the nodes and the cells in each row, column, and the entire table form hyperedges. The table caption and the header names are used for the names of the table and column hyperedges. The hypergraph keeps the four structural properties of tables, e.g., the invariance property of the table as the row/column permutations result in the same hypergraph. Motivation: Tabular data is structurally different in comparison to other data modalities such as images, audio, and plain texts. We summarize four structural properties present in the tables below: Most tables are invariant to row/column permutations. This means, in general, if we arbitrarily (and independently) permute the rows or columns of a table, it is still an equivalent table. For other tables with an explicit ordering of rows or columns, we can make them permutation invariant by appropriately adding a ranking index as a new column or row. Data from a single column are semantically similar for example, they oftentimes have the same semantic types. Similarly, the cells within a single row together describe the attributes of a sample within the table and the cells cannot be treated in silos. The interactions within cells/rows/columns are not necessarily pairwise, i.e., the cells within the same row/column, and rows/columns from the same table can have high-order multilateral relations [?]. Information in tables is generally organized in a hierarchical fashion where the information at the table-level can be aggregated from the column/row-level, and further from the cell-level. However, the linearization-based approaches are not designed to explicitly capture most of the above properties. We aim to address the limitation by modeling all the aforementioned structural properties as inductive biases while learning the table representations. Our Approach: In line with recent studies [Deng et al., 2020, Yang et al., 2022, Wang et al., 2021a] which have elucidated upon the importance of the structure of a table, we propose the HYTREL that uses hypergraphs to model the tabular data. We propose a modeling paradigm that aims capture all of the four properties directly. Figure 1 provides an example of how a hypergraph is constructed from a table. As observed, converting a table into a hypergraph allows us to incorporate the first two properties inherent to the nature of hypergraphs. Hypergraphs seamlessly allow the model to incorporate row/column permutation invariances, as well as interactions among the cells within the same column or row. Moreover, the proposed hypergraph structure can capture the high-order (not just pairwise) interactions for the cells in a column or a row, as well as from the whole table, and an aggregation of hyperedges can also help preserve the hierarchical structure of a table. Contributions: Our theoretical analysis and empirical results demonstrate the advantages of modeling the four structural properties. We first show that HYTREL is maximally invariant when modeling tabular data (under certain conditions), i.e. if two tables get the same representations via the hypergraph table learning function, then the tables differ only by row/column permutation (independently) actions and vice versa. Empirically, we pretrain HYTREL on publicly available tables using two self-supervised objectives: a table content based ELECTRA2 objective [Clark et al., 2020, Iida et al., 2021] and a table structure dependent contrastive objective [Wei et al., 2022]. The evaluation of the pretrained HYTREL model on four downstream tasks (two knowledge extraction tasks, a table type detection task, and a table similarity prediction task) shows that HYTREL can achieve state-of-the-art performance. We also provide an extensive qualitative analysis of HYTREL including visualizations that showcase that (a) HYTREL representations are robust to arbitrary permutations of rows and columns (independently), (b) HYTREL can incorporate the hierarchical table structure into the representations, (c) HYTREL can achieve close to state-of-the-art performance even without pretraining, and the model is extremely efficient with respect to the number epochs for pretraining in comparison to prior works, further demonstrating the advantages of HYTREL in modeling the structural properties of tabular data. In Appendix B, we provide additional analysis that demonstrates HYTREL s ability to handle input tables of arbitrary size and underscore the importance of the independent row/column permutations. 2 HYTREL Model Formally, a table in our work is represented as T = [M, H, R], where M is the caption, H = [h1, h2, h3, ..., hm] are the m column headers, R represents the n rows [R1, R2, R3, ..., Rn]. Each row Ri has m cells [ci1, ci2, ci3, ..., cim]. The caption, header, and cell values can be regarded as sentences that contain several words. We note that each cell cij or header hj also belongs to the corresponding column Cj. We use C = [C1, C2, C3, ..., Cm] to represent all the columns that include the headers, so a table can also be defined as T = [M, C]. & Embeddings Pretraining/Fine-tuning Heads real , madrid #!, #" Real Madrid % tokenize embed pooling Multi-Head Set Feed Forward coun , ##try #! Hyperedge2Node Hyper Att Node2Hyperedge Hyper Att Hyperedge Fusion Figure 2: The overall framework of HYTREL. We first turn a table T into a hypergraph G and then initialize the embeddings of the nodes V and hyperedges E. After that, we encode the hypergraph using stacked multiple-layer hypergraph-structure-aware transformers (Hyper Trans). Each Hyper Trans layer has two attention blocks that work on hypergraph (Hyper Att) and one Hyperedge Fusion block. Lastly, we use the node and hyperedge representations from the final layer for pretraining and fine-tuning. 2.1 Formatter & Embedding Layer The formatter transforms a table into a hypergraph. As shown in Figure 1, given a table T , we construct a corresponding hypergraph G = (V, E), where V, E denote the set of nodes and hyperedges respectively. We treat each cell cij as a node vij V, and each row Ri, each column Cj, and the entire table T as hyperedges ec i, er j, et E (1 i n,1 j m), respectively. As a part of our hypergraph construction, each cell node vij is connected to 3 hyperedges: its column hyperedge ec i, row hyperedge er j, and the table hyperedge et. The hypergraph can be conveniently be represented as an incidence matrix B {0, 1}mn (m+n+1), where Bij = 1 when node i belong to hyperedge j and Bij = 0 otherwise. An embedding layer is then employed over the nodes and hyperedges . Each node vij corresponds to a cell cij that has several tokens, and we obtain the feature vector Xvij,: RF for a given node by feeding its constituent cell tokens into the embedding layer and then averaging the embeddings over the tokens. After obtaining node embeddings X Rnm F for all nodes, a similar transformation is applied over hyperedges. For different hyperedge types, we use different table content for their initialization : for a column hyperedge ec i, we use all the tokens from the corresponding header hi. For the table hyperedge et, we use the the entire caption associated with M. For the row hyperedge er j, 2We use the name ELECTRA following Iida et al. [2021] but it is different from the ELECTRA objective used in language modeling [Clark et al., 2020]. We do not utilize an auxiliary model to generate corruption. when no semantic information is available, we randomly initialize them as Ser j ,: RF . Performing the above operations yields an initialization for all the hyperedge embeddings S R(m+n+1) F . 2.2 Hypergraph Encoder After embedding, we propose to use a structure-aware transformer module (Hyper Trans) to encode the hypergraphs. Hyper Trans encoder can encode the table content, structure, and relations among the table elements (including cells, headers, captions, etc.). As shown in Figure 2, one layer of the Hyper Trans module is composed of two hypergraph attention blocks (Hyper Att, f) [Chien et al., 2022] that interact with the node and hyperedge representations, and one Hyperedge Fusion block. The first Hyper Att block is the Node2Hyperedge attention block as defined below: S(t+1) e,: = f V E Ke,X(t) (1) Where f V E is a hypergraph attention function defined from nodes to hyperedges. Ke,X = {Xv,: : v e} denotes the sets of hidden node representations included in the hyperedge e. The Node2Hyperedge block will aggregate information to hyperedge e from its constituent nodes v e. We then use a Hyperedge Fusion module (a Multilayer Perceptron Network, MLP) to propagate the hyperedge information from the last step, as defined below: S(t+1) e,: = MLP S(t) e,:; S(t+1) e,: (2) A second Hyper Att block Hyperedge2Node then aggregates information from a hyperedge to its constituent nodes as follows: X(t+1) v,: = f E V Lv,S(t+1) (3) Where f E V is another hypergraph attention function defined from hyperedges to nodes. Lv,S = {Se,: : v e} is defined as the sets of hidden representations of hyperedges that contain the node v. As for the Hyper Att block f, similar to transformer [Vaswani et al., 2017], it is composed of one multi-head attention, one Position-wise Feed-Forward Network (FFN), two-layer normalization (LN) [Ba et al., 2016] and two skip connections [He et al., 2016], as in Figure 2. However, we do not use the self-attention [Vaswani et al., 2017] mechanism from the transformer model because it is not designed to keep the invariance structure of tables or hypergraphs. Inspired by the deep set models [Zaheer et al., 2017, Lee et al., 2019], we use a set attention mechanism that can keep the permutation invariance of a table. We define Hyper Att f as follows: f V E or E V(I) := LN(Y + FFN(Y)) (4) Where I is the input node or hyperedge representations. The intermediate representations Y is obtained by: Y = LN (ω + Set MH(ω, I, I)) (5) Where Set MH is the multi-head set attention mechanism defined as: Set MH(ω, I, I) = h i=1Oi (6) and Oi = Softmax ωi IWK i T IWV i (7) Where ω is a learnable weight vector as the query and ω := h i=1ωi, WK i and WV i are the weights for the key and value projections, means concatenation. So the Hyper Trans module will update node and hyperedge representations alternatively. This mechanism enforces the table cells to interact with the columns, rows, and the table itself. Similar to BERTbase [Devlin et al., 2019] and Ta BERTbase [Yin et al., 2020] , we stack 12 layers of Hyper Trans. 2.3 Invariances of the HYTREL Model Let ϕ : T 7 z Rd be our target function which captures the desired row/column permutation invariances of tables (say for tables of size n m). Rather than working on the table T directly, the proposed HYTREL model works on a hypergraph (via Eqns (1-5)) that has an incidence matrix B of size mn (m + n + 1). Correspondingly, we shall refer to HYTREL as a function g : B 7 y Rk. In this section we will make the connections between the properties of the two functions ϕ and g, demonstrating a maximal invariance between the two as a result of which we prove that our HYTREL can also preserve the permutation invariances of the tables. First, we list our assumptions and resultant properties of tabular data. Subsequently, we present the maximal invariance property of ϕ and our hypergraph-based learning framework g. As a part of our notation, we use [n] to denote {1, 2, . . . , n}. Preliminaries and all detailed proofs are presented in the Appendix A.1 and A.2 respectively. Assumption 2.1. For any table (Tij)i [n],j [m] (where i, j are indexes of the rows, columns), an arbitrary group action a Sn Sm acting appropriately on the rows and columns leaves the target random variables associated with tasks on the entire table unchanged. This assumption is valid in most real-world tables as reordering the columns and the rows in the table oftentimes doesn t alter the properties associated with the entire table (e.g. name of the table, etc). As noted earlier, for tables with an explicit ordering of rows or columns, we can make them permutation invariant by adding a ranking index as a new column or row appropriately. To model this assumption, we state a property required for functions acting on tables next. Property 1. A function ϕ : T 7 z Rd which satisfies Assumption 2.1 and defined over tabular data must be invariant to actions from the (direct) product group Sn Sm acting appropriately on the table i.e. ϕ(a T ) = ϕ(T ) a Sn Sm. However, HYTREL (or the function g via hypergraph modeling) through Eqns (1-5)) models invariances of the associated incidence matrix to the product group Smn Sm+n+1 (proof presented in the appendix). To make the connection between the two, we present the maximal invariance property of our proposed HYTREL model. Theorem 2.2. A continuous function ϕ : T 7 z Rd over tables is maximally invariant when modeled as a function g : B 7 y Rk over the incidence matrix of a hypergraph G constructed per Section 2.1 (Where g is defined via Eqns (1-5)) if a bijective map between the space of tables and incidence matrices (defined over appropriate sizes of tables, incidence matrices). That is, ϕ(T1) = ϕ(T2) iff T2 is some combination of row and/or column permutation of T1 and g(B1) = g(B2) where B1, B2 are the corresponding (hypergraph) incidence matrices of tables T1, T2. Proof Sketch: Detailed proof is provided in Appendix A.2. The above theorem uses Lemma 1 from [Tyshkevich and Zverovich, 1996] and applies the Weisfeiler-Lehman test of isomorphism over the star expansion graphs of the hypergraphs toward proving the same. As a consequence of Theorem 2.2, two tables identical to permutations will obtain the same representation, which has been shown to improve generalization performance [Lyle et al., 2020]. 2.4 Pretraining Heads ELECTRA Head: In the ELECTRA pretraining setting, we first corrupt a part of the cells and the headers from a table and then predict whether a given cell or header has been corrupted or not Iida et al. [2021]. Cross-entropy loss is used together with the binary classification head. Contrastive Head: In the contrastive pretraining setting, we randomly corrupt a table-transformed hypergraph by masking a portion of the connections between nodes and hyperedges, as inspired by the hypergraph contrastive learning [Wei et al., 2022]. For each hypergraph, we corrupt two augmented views and use them as the positive pair, and use the remaining in-batch pairs as negative pairs. Following this, we contrast the table and column representations from the corresponding hyperedges. The Info NCE [van den Oord et al., 2018] objective is used for optimization as in 8. loss = log exp (q k+/τ) PK i=0 exp (q ki/τ) (8) where (q, k+) is the positive pair, and τ is a temperature hyperparameter. 3 Experiments 3.1 Pre-training Data In line with previous Ta LMs [Yin et al., 2020, Iida et al., 2021], we use tables from Wikipedia and Common Crawl for pretraining. We utilize preprocessing tools provided by Yin et al. [2020] and collect a total of 27 million tables (1% are sampled and used for validation).3 During pretraining, we truncate large tables and retain a maximum of 30 rows and 20 columns for each table, with a maximum of 64 tokens for captions, column names, and cell values. It is important to note that the truncation is solely for efficiency purposes and it does not affect HYTREL s ability to deal with large tables, as elaborated in appendix B.1. Settings With the ELECTRA pretraining objective, we randomly replace 15% of the cells or headers of an input table with values that are sampled from all the pretraining tables based on their frequency, as recommended by Iida et al. [2021]. With the contrastive pretraining objective, we corrupted 30% of the connections between nodes and hyperedges for each table to create one augmented view. The temperature τ is set as 0.007. For both objectives, we pretrain the HYTREL models for 5 epochs. More details can be found the Appendix C.1. 3.2 Fine-tuning4 After pretraining, we use the HYTREL model as a table encoder to fine-tune downstream tablerelated tasks. In order to demonstrate that our model does not heavily rely on pretraining or on previous pretrained language models, we also fine-tune the randomly initialized HYTREL model for comparison. In this section, we introduce the evaluation tasks and the datasets. We choose the following four tasks that rely solely on the table representations since we want to test the task-agnostic representation power of our model and avoid training separate encoders for texts (e.g., questions in table QA tasks) or decoders for generations. As mentioned, our encoder can be used in all these scenarios and we leave its evaluation in other table-related tasks as future work. Column Type Annotation (CTA) task aims to annotate the semantic types of a column and is an important task in table understanding which can help many knowledge discovery tasks such as entity recognition and entity linking. We use the column representations from the final layer of HYTREL with their corresponding hyperedge representations for making predictions. We evaluate HYTREL on the TURL-CTA dataset constructed by Deng et al. [2020]. Column Property Annotation (CPA) task aims to map column pairs from a table to relations in knowledge graphs. It is an important task aimed at extracting structured knowledge from tables. We use the dataset TURL-CPA constructed by Deng et al. [2020] for evaluation. Table Type Detection (TTD) task aims to annotate the semantic type of a table based on its content. We construct a dataset using a subset from the public WDC Schema.org Table Corpus. Table Similarity Prediction (TSP) task aims at predicting the semantic similarity between tables and then classifying a table pair as similar or dissimilar. We use the PMC dataset proposed by Habibi et al. [2020] for evaluation. 3.3 Baselines Ta BERT [Yin et al., 2020] is a representative Ta LM that flattens the tables into sequences and jointly learns representations for sentences and tables by pretraining the model from the BERT checkpoints. K=1 and K=3 are the two variants based on the number of rows used. TURL [Deng et al., 2020] is another representative Ta LM that also flattens the tables into sequences and pretrains from Tiny BERT [Jiao et al., 2020] checkpoints. It introduces a vision matrix to incorporate table structure into the representations. Doduo [Suhara et al., 2022] is a state-of-the-art column annotation system that fine-tunes the BERT and uses table serialization to incorporate table content. 3As the version of Wikipedia used by [Yin et al., 2020] is not available now, we use an updated version so we collect slightly more tables than previous Ta LMs. 4More details about experimental settings, the datasets, and the baselines can be found the Appendix C.2 Systems Column Type Annotation Column Property Annotation Sherlock 88.40 / 70.55 / 78.47 - BERTbase - 91.18 / 90.69 / 90.94 TURL + metadata 92.75 / 92.63 / 92.69 92.90 / 93.80 / 93.35 Doduo + metadata 93.25 / 92.34 / 92.79 91.20 / 94.50 / 92.82 Ta BERTbase(K=1) 91.40 0.06 / 89.49 0.21 / 90.43 0.11 92.31 0.24 / 90.42 0.53 / 91.36 0.30 w/o Pretrain 90.00 0.14 / 85.50 0.09 / 87.70 0.10 89.74 0.40 / 68.74 0.93 / 77.84 0.64 Ta BERTbase(K=3) 91.63 0.21 / 91.12 0.25 / 91.37 0.08 92.49 0.18 / 92.49 0.22 / 92.49 0.10 w/o Pretrain 90.77 0.11 / 87.23 0.22 / 88.97 0.12 90.10 0.17 / 84.83 0.89 / 87.38 0.48 HYTREL w/o Pretrain 92.92 0.11 / 92.50 0.10 / 92.71 0.08 92.85 0.35 / 91.50 0.54 / 92.17 0.38 HYTREL w/ ELECTRA 92.85 0.21 / 94.21 0.09 / 93.53 0.10 92.88 0.24 / 94.07 0.27 / 93.48 0.12 HYTREL w/ Contrastive 92.71 0.20 / 93.24 0.08 / 92.97 0.13 93.01 0.40 / 93.16 0.40 / 93.09 0.17 Table 1: Test results on the CTA and CPA tasks (Precision/Recall/F1 Scores,%). The results of Ta BERT and HYTREL are from the average of 5 system runs with different random seeds. For fair comparisons, we use the results of TURL and Doduo with metadata, i.e., captions and headers. Systems Table Type Detection Table Similarity Prediction Accuracy Precision/Recall/F1 Accuracy TFIDF+Glove+MLP - 87.36 / 83.81 / 84.47 85.06 Tab Sim - 88.65 / 85.45 / 86.13 87.05 Ta BERTbase(K=1) 93.11 0.31 87.04 0.64 / 85.34 0.93 / 86.18 1.13 87.35 1.42 w/o Pretrain 85.04 0.41 33.61 12.70 / 50.31 12.75 / 40.30 12.03 63.45 10.11 Ta BERTbase(K=3) 95.15 0.14 87.76 0.64 / 86.97 0.59 / 87.36 0.95 88.29 0.98 w/o Pretrain 89.88 0.26 82.96 1.84 / 81.16 1.45 / 82.05 1.02 82.57 1.20 HYTREL w/o Pretrain 93.84 0.17 88.94 1.83 / 85.72 1.52 / 87.30 1.02 88.38 1.43 HYTREL w/ ELECTRA 95.81 0.19 87.35 0.42 / 87.29 0.84 / 87.32 0.50 88.29 0.49 HYTREL w/ Contrastive 94.52 0.30 89.41 0.58 / 89.10 0.90 / 89.26 0.53 90.12 0.49 Table 2: Test results on the TTD (Accuracy Score,%) and TSP (Precision/Recall/F1 Scores,%) tasks. The results of Ta BERT and HYTREL are from the average of 5 system runs with different random seeds. 3.4 Main Results The results are presented in Tables 1 and 2. Overall, HYTREL consistently outperforms the baselines and achieve superior performance. A salient observation is that our model (even without pretraining) can achieve close to state-of-the-art performance. In comparison, we notice that the performance slumps significantly for Ta BERT without pretraining. This phenomenon empirically demonstrates the advantages of modeling the table structures as hypergraphs over the other methods that we compare. Additionally, we observe that the two pretraining objectives help different tasks in different ways. For the CTA, CPA, and TTD tasks, the two objectives can help HYTREL further improve its performance. In general, the ELECTRA objective performs better than the contrastive objective. These results are also in line with the representation analysis in Section 4.2 where we observe that the ELECTRA objective tends to learn table structure better than the contrastive objective. However, for the TSP task, we observe that the contrastive objective can help the HYTREL model while the ELECTRA objective fails to bring any improvement. One possible reason for the ineffectiveness of the ELECTRA objective could be its inability to transfer well across domains. HYTREL pretrained with tables from Wikipedia and Common Crawl could not transfer well to the medical domain PMC dataset. As for the improvement observed from the contrastive objective, the reason could be that contrastive learning that uses similarity metrics in the objective function can naturally help the similarity prediction task. Scalability: As stated in Section 3, we have truncated large tables during pretraining. However, this truncation does not hinder the ability of HYTREL to handle large table inputs in downstream tasks. In Appendix B.1, we present additional experiments demonstrating that: (a) HYTREL can effectively process tables of any size as inputs, and (b) down-sampling can be a favorable strategy when working with large input tables, significantly improving efficiency without compromising performance. 4 Qualitative Analysis 4.1 HYTREL Learns Permutation Robust Representations Row Permu. Column Permu. Both Permu. Distance (L2-norm) Ta BERT (K=1) Ta BERT (K=3) Hy Trel Figure 3: Average distance between table element representations before and after permutations. The HYTREL is immune to the permutations while the Ta BERT is sensitive to them. We sample 5,000 tables from the validation data for analysis. We analyze the impact of applying different permutations to a table, including permuting rows, columns, and both rows/columns independently. Toward our analysis, we measure the Euclidean distance (L2 Norm) of the representations (cells, rows, columns and tables). As shown in Figure 3, the distance is almost always 0 for the HYTREL model because of its explicit invariance-preserving property. On the other hand, for the Ta BERT model, the distance is not trivial. We observe that when more rows (K=3) are enabled, the value of the L2 norm increases as we introduce different permutations. Moreover, we also observe that permuting the columns has a greater impact on the L2 norm than the row permutations. A combination of rows and columns permutations has the largest impact among all three actions. Note that when K=1 with just one row, the effect of row permutation is disabled for Ta BERT. 4.2 HYTREL Learns the Underlying Hierarchical Table Structure Next, we demonstrate that the HYTREL model has learned the hierarchical table structure into its representations, as we target at. We use t-SNE [Van der Maaten and Hinton, 2008] for the visualization of different table elements from the same 5,000 validation tables, as shown in Figure 4. Figure 4: t-SNE visualization of the representations learned by HYTREL. tab , col , row , and cell are the representations for different table elements: tables, columns, rows, and cells. We observe that with random initializations, different table elements cannot be distinguished properly. After the encoding of the randomly initialized HYTREL model, we start to observe noticeable differences for different table elements in the visualization space. Notably, the individual cell representations start to concentrate together and can be distinguished from high-level table elements (tables, columns, and rows) which occupy their separate places in the space of representations. We also notice that, by pretraining the HYTREL with the ELECTRA objective, all table elements can be well separated, showing that it incorporates the hierarchical table structure into its latent representations. As for the contrastive pretraining, we see that it can distinguish columns from rows as compared with randomly initialized HYTREL, but could not to well separate the table representations in comparison with the ELECTRA pretraining. This also partially explains the better performance of the ELECTRA pretraining in the CTA, CPA and TTD tasks in contrast to the contrastive pretraining. 4.3 HYTREL Demonstrates Effective Pretraining by Capturing the Table Structures 0 1 3 5 7 9 10 88 F1 Score/ Accuracy (%) 0 1 3 5 7 9 10 75.0 0 1 3 5 7 9 10 86 0 1 3 5 7 9 10 Hy Trel (ELECTRA) Hy Trel (Contrastive) Ta BERT (K=3) Ta BERT (K=1) Pretraining Epochs Figure 5: Performance with different pretraining checkpoints on the validation set of the four tasks. For the Ta BERT models, we can only access the randomly initialized and fully pretrained (10 epochs) checkpoints. All results are from 5 system runs with different random seeds. Our evaluation shows that the HYTREL model can perform well without pretraining, which demonstrates its training efficiency by modeling table structures. Here we further analyze how much pretraining is required for HYTREL to further enhance its performance, as compared with the baseline model. We plot the validation performance of the tasks evaluated at different pretraining checkpoints in Figure 5. Overall, we can observe that the performance of HYTREL improves drastically during the first several pretraining epochs, and then saturates at about 5 epochs. With the minimal pretraining, HYTREL can outperform the fully pretrained Ta BERT models. This demonstrates that our model does not require extensive pretraining to further improve its performance in contrast with previous Ta LMs (e.g., TURL for 100 epochs, Ta BERT for 10 epochs). Besides, we also observe from the curves that the ELECTRA objective consistently outperforms the contrastive objective for the CTA, CPA, and TTD tasks, but under-performs on the TSP task. Also, the ELECTRA objective has a negative impact on the TSP task when pretrained for longer duration, which is in line with our previous findings. 5 Related Work There are two avenues of research that have studied in tabular data representation learning. The first group of studies focus on predicting labels (essentially one row) for classification and regression problems, using row information and column schema as input[Huang et al., 2020, Arik and Pfister, 2021, Somepalli et al., 2021, Gorishniy et al., 2021, Grinsztajn et al., 2022, Wang and Sun, 2022, Du et al., 2022, Wydma nski et al., 2023, Liu et al., 2023]. These studies use gradient descent-based end-to-end learning and aim to outperform tree-based models through task-specific model pretraining and fine-tuning. The second group of studies proposes Ta LMs to retrieve task-agnostic tabular data representations for different downstream table understanding tasks. Drawing inspiration from textual Language Models like BERT [Devlin et al., 2019], many works [Herzig et al., 2020, Yin et al., 2020, Deng et al., 2020, Iida et al., 2021, Eisenschlos et al., 2021] serialize tables to a sequence of tokens, leveraging existing checkpoints and textual self-supervised objectives. However, the representations of the tables can not only be learned from table content and by utilizing table structures, similar to other forms of semi-structured data like code and HTML. Some contemporary works have noticed the importance of the table structures and introduce many techniques to learn a certain aspect of them, such as masking [Deng et al., 2020], coordinates [Wang et al., 2021a, Dash et al., 2022], and attention bias [Yang et al., 2022]. Our work belongs to this second group of studies and we propose to use hypergraphs to comprehensively model the rich table structures, and this is close to previous graph-based neural networks [Mueller et al., 2019, Wang et al., 2021b,c] where tables have been structured as graphs to incorporate row and column order information. Table representation learning that focuses on joint text and table understanding is a separate field of research that partially overlaps with our work. Among them, some work [Herzig et al., 2020, Shi et al., 2022, Herzig et al., 2021, Glass et al., 2021, Yang et al., 2022] specialize in question-answering (QA) on tables and they jointly train a model that takes the question and the table structure as input together, allowing the pretraining to attend to the interactions of both texts and tables and boosting the table-based QA tasks. Another branch of joint text and table understanding work focuses on text generation from tables[Parikh et al., 2020, Yoran et al., 2021, Wang et al., 2022, Andrejczuk et al., 2022], relying on an encoder-decoder model like T5 [Raffel et al., 2020] that can encode tables and decode free-form texts. In contrast to these studies, our work centers on the importance of structures in tables for table representation only, without extra text encoding or decoding. Learning on hypergraphs has gone through a series of evolution [Agarwal et al., 2006, Yadati et al., 2019, Arya et al., 2020] in the way the hypergraph structure is modeled using neural networks layers. However, many of them collapse the hypergraph into a fully connected graph by clique expansion and cannot preserve the high-order interaction among the nodes. The recent development of permutation invariant networks [Zaheer et al., 2017, Lee et al., 2019] has enabled high-order interactions on the hypergraphs [Chien et al., 2022] that uses parameterized multi-set functions to model dual relations from node to hyperedges and vice versa. Closely related to the latest advancement, our HYTREL model adopts a similar neural message passing on hypergraphs to preserve the invariance property and high-order interactions of tables. 6 Limitations The proposed HYTREL is a table encoder, and by itself cannot handle joint text and table understanding tasks like table QA and table-to-text generation. While it s possible to add text encoders or decoders for these tasks, it can potentially introduce additional factors that may complicate evaluating our hypothesis about the usefulness of modeling structural table properties. Moreover, the current model structure is designed for tables with simple column structures, like prior Ta LMs, and cannot handle tables with complex hierarchical column structures. Additionally, HYTREL does not consider cross-table relations. Although we believe the hypergraph can generalize to model complicated column structures and cross-table interactions, we leave these aspects for future research. 7 Conclusion In this work, we propose a tabular language model HYTREL that models tables as hypergraphs. It can incorporate the permutation invariances and table structures into the table representations. The evaluation on four table-related tasks demonstrates the advantages of learning these table properties and show that it can consistently achieve superior performance over the competing baselines. Our theoretical and qualitative analyses also support the effectiveness of learning the structural properties. Pengcheng Yin, Graham Neubig, Wen tau Yih, and Sebastian Riedel. Ta BERT: Pretraining for joint understanding of textual and tabular data. In Annual Conference of the Association for Computational Linguistics (ACL), July 2020. Jingfeng Yang, Aditya Gupta, Shyam Upadhyay, Luheng He, Rahul Goel, and Shachi Paul. Table Former: Robust transformer modeling for table-text encoding. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 528 537, Dublin, Ireland, May 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.acl-long.40. URL https://aclanthology.org/2022.acl-long.40. Jonathan Herzig, Pawel Krzysztof Nowak, Thomas Müller, Francesco Piccinno, and Julian Eisenschlos. Ta Pas: Weakly supervised table parsing via pre-training. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4320 4333, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.398. URL https://aclanthology.org/2020.acl-main.398. Xiang Deng, Huan Sun, Alyssa Lees, You Wu, and Cong Yu. Turl: table understanding through representation learning. Proceedings of the VLDB Endowment, 14(3):307 319, 2020. Hiroshi Iida, Dung Thai, Varun Manjunatha, and Mohit Iyyer. TABBIE: Pretrained representations of tabular data. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 3446 3456, Online, June 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.naacl-main. 270. URL https://aclanthology.org/2021.naacl-main.270. Zhiruo Wang, Haoyu Dong, Ran Jia, Jia Li, Zhiyi Fu, Shi Han, and Dongmei Zhang. Tuta: Tree-based transformers for generally structured table pre-training. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, KDD 21, page 1780 1790, New York, NY, USA, 2021a. Association for Computing Machinery. ISBN 9781450383325. doi: 10.1145/3447548.3467434. URL https://doi.org/10.1145/3447548.3467434. Kevin Clark, Minh-Thang Luong, Quoc V. Le, and Christopher D. Manning. ELECTRA: Pretraining text encoders as discriminators rather than generators. In ICLR, 2020. URL https: //openreview.net/pdf?id=r1x MH1Btv B. Tianxin Wei, Yuning You, Tianlong Chen, Yang Shen, Jingrui He, and Zhangyang Wang. Augmentations in hypergraph contrastive learning: Fabricated and generative. ar Xiv preprint ar Xiv:2210.03801, 2022. Eli Chien, Chao Pan, Jianhao Peng, and Olgica Milenkovic. You are allset: A multiset function framework for hypergraph neural networks. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=hp BTIv2uy_E. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. URL https://proceedings. neurips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf. Jimmy Ba, Jamie Ryan Kiros, and Geoffrey E. Hinton. Layer normalization. Ar Xiv, abs/1607.06450, 2016. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770 778, 2016. doi: 10.1109/CVPR.2016.90. Manzil Zaheer, Satwik Kottur, Siamak Ravanbakhsh, Barnabas Poczos, Russ R Salakhutdinov, and Alexander J Smola. Deep sets. In I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. URL https://proceedings.neurips.cc/paper/ 2017/file/f22e4747da1aa27e363d86d40ff442fe-Paper.pdf. Juho Lee, Yoonho Lee, Jungtaek Kim, Adam Kosiorek, Seungjin Choi, and Yee Whye Teh. Set transformer: A framework for attention-based permutation-invariant neural networks. In Proceedings of the 36th International Conference on Machine Learning, pages 3744 3753, 2019. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171 4186, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. doi: 10.18653/v1/N19-1423. URL https: //aclanthology.org/N19-1423. RI Tyshkevich and Vadim E Zverovich. Line hypergraphs. Discrete Mathematics, 161(1-3):265 283, 1996. Clare Lyle, Mark van der Wilk, Marta Kwiatkowska, Yarin Gal, and Benjamin Bloem-Reddy. On the benefits of invariance in neural networks. ar Xiv preprint ar Xiv:2005.00178, 2020. Aäron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding. Co RR, abs/1807.03748, 2018. URL http://arxiv.org/abs/1807.03748. Maryam Habibi, Johannes Starlinger, and Ulf Leser. Tabsim: A siamese neural network for accurate estimation of table similarity. pages 930 937, 12 2020. doi: 10.1109/Big Data50022.2020.9378077. Xiaoqi Jiao, Yichun Yin, Lifeng Shang, Xin Jiang, Xiao Chen, Linlin Li, Fang Wang, and Qun Liu. Tiny BERT: Distilling BERT for natural language understanding. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 4163 4174, Online, November 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.findings-emnlp.372. URL https:// aclanthology.org/2020.findings-emnlp.372. Yoshihiko Suhara, Jinfeng Li, Yuliang Li, Dan Zhang, Ça gatay Demiralp, Chen Chen, and Wang Chiew Tan. Annotating Columns with Pre-trained Language Models. Association for Computing Machinery, 2022. ISBN 9781450392495. URL https://doi.org/10.1145/3514221. 3517906. Laurens Van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. Journal of machine learning research, 9(11), 2008. Xin Huang, Ashish Khetan, Milan Cvitkovic, and Zohar Karnin. Tabtransformer: Tabular data modeling using contextual embeddings. ar Xiv preprint ar Xiv:2012.06678, 2020. Sercan Ö Arik and Tomas Pfister. Tabnet: Attentive interpretable tabular learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 6679 6687, 2021. Gowthami Somepalli, Micah Goldblum, Avi Schwarzschild, C Bayan Bruss, and Tom Goldstein. Saint: Improved neural networks for tabular data via row attention and contrastive pre-training. ar Xiv preprint ar Xiv:2106.01342, 2021. Yury Gorishniy, Ivan Rubachev, Valentin Khrulkov, and Artem Babenko. Revisiting deep learning models for tabular data. In Neur IPS, 2021. Leo Grinsztajn, Edouard Oyallon, and Gael Varoquaux. Why do tree-based models still outperform deep learning on typical tabular data? In Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2022. URL https://openreview.net/forum?id= Fp7__ph Qszn. Zifeng Wang and Jimeng Sun. Transtab: Learning transferable tabular transformers across tables. In Advances in Neural Information Processing Systems, 2022. Kounianhua Du, Weinan Zhang, Ruiwen Zhou, Yangkun Wang, Xilong Zhao, Jiarui Jin, Quan Gan, Zheng Zhang, and David Paul Wipf. Learning enhanced representations for tabular data via neighborhood propagation. In Neur IPS 2022, 2022. URL https://www.amazon.science/publications/ learning-enhanced-representations-for-tabular-data-via-neighborhood-propagation. Witold Wydma nski, Oleksii Bulenok, and Marek Smieja. Hypertab: Hypernetwork approach for deep learning on small tabular datasets. ar Xiv preprint ar Xiv:2304.03543, 2023. Tennison Liu, Jeroen Berrevoets, Zhaozhi Qian, and Mihaela Van Der Schaar. Learning representations without compositional assumptions. In Proceedings of the 40th International Conference on Machine Learning, ICML 23. JMLR.org, 2023. Julian Eisenschlos, Maharshi Gor, Thomas Müller, and William Cohen. MATE: Multi-view attention for table transformer efficiency. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 7606 7619, Online and Punta Cana, Dominican Republic, November 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.emnlp-main. 600. URL https://aclanthology.org/2021.emnlp-main.600. Sarthak Dash, Sugato Bagchi, Nandana Mihindukulasooriya, and Alfio Gliozzo. Permutation invariant strategy using transformer encoders for table understanding. In Findings of the Association for Computational Linguistics: NAACL 2022, pages 788 800, 2022. Thomas Mueller, Francesco Piccinno, Peter Shaw, Massimo Nicosia, and Yasemin Altun. Answering conversational questions on structured data without logical forms. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 5902 5910, Hong Kong, China, November 2019. Association for Computational Linguistics. doi: 10.18653/v1/D19-1603. URL https://aclanthology.org/D19-1603. Fei Wang, Kexuan Sun, Muhao Chen, Jay Pujara, and Pedro Szekely. Retrieving complex tables with multi-granular graph representation learning. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 21, page 1472 1482, New York, NY, USA, 2021b. Association for Computing Machinery. ISBN 9781450380379. doi: 10.1145/3404835.3462909. URL https://doi.org/10.1145/3404835.3462909. Daheng Wang, Prashant Shiralkar, Colin Lockard, Binxuan Huang, Xin Luna Dong, and Meng Jiang. Tcn: Table convolutional network for web table interpretation. In Proceedings of the Web Conference 2021, pages 4020 4032, 2021c. Peng Shi, Patrick Ng, Feng Nan, Henghui Zhu, Jun Wang, Jiarong Jiang, Alexander Hanbo Li, Rishav Chakravarti, Donald Weidner, Bing Xiang, et al. Generation-focused table-based intermediate pre-training for free-form question answering. 2022. Jonathan Herzig, Thomas Müller, Syrine Krichene, and Julian Martin Eisenschlos. Open domain question answering over tables via dense retrieval. ar Xiv preprint ar Xiv:2103.12011, 2021. Michael Glass, Mustafa Canim, Alfio Gliozzo, Saneem Chemmengath, Vishwajeet Kumar, Rishav Chakravarti, Avi Sil, Feifei Pan, Samarth Bharadwaj, and Nicolas Rodolfo Fauceglia. Capturing row and column semantics in transformer based question answering over tables. ar Xiv preprint ar Xiv:2104.08303, 2021. Ankur P Parikh, Xuezhi Wang, Sebastian Gehrmann, Manaal Faruqui, Bhuwan Dhingra, Diyi Yang, and Dipanjan Das. Totto: A controlled table-to-text generation dataset. ar Xiv preprint ar Xiv:2004.14373, 2020. Ori Yoran, Alon Talmor, and Jonathan Berant. Turning tables: Generating examples from semi-structured tables for endowing language models with reasoning skills. ar Xiv preprint ar Xiv:2107.07261, 2021. Fei Wang, Zhewei Xu, Pedro Szekely, and Muhao Chen. Robust (controlled) table-to-text generation with structure-aware equivariance learning. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 5037 5048, Seattle, United States, July 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.naacl-main.371. URL https://aclanthology.org/2022. naacl-main.371. Ewa Andrejczuk, Julian Martin Eisenschlos, Francesco Piccinno, Syrine Krichene, and Yasemin Altun. Table-to-text generation and pre-training with tabt5. ar Xiv preprint ar Xiv:2210.09162, 2022. Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21(140):1 67, 2020. URL http://jmlr.org/papers/v21/20-074.html. Sameer Agarwal, Kristin Branson, and Serge Belongie. Higher order learning with graphs. In Proceedings of the 23rd international conference on Machine learning, pages 17 24, 2006. Naganand Yadati, Madhav Nimishakavi, Prateek Yadav, Vikram Nitin, Anand Louis, and Partha Talukdar. Hypergcn: A new method for training graph convolutional networks on hypergraphs. Advances in neural information processing systems, 32, 2019. Devanshu Arya, Deepak K Gupta, Stevan Rudinac, and Marcel Worring. Hypersage: Generalizing inductive representation learning on hypergraphs. ar Xiv preprint ar Xiv:2010.04558, 2020. László Babai, Paul Erdo s, and Stanley M. Selkow. Random graph isomorphism. SIAM Journal on Computing, 9(3):628 635, 1980. doi: 10.1137/0209047. URL https://doi.org/10.1137/ 0209047. Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. How powerful are graph neural networks? ar Xiv preprint ar Xiv:1810.00826, 2018. Balasubramaniam Srinivasan, Da Zheng, and George Karypis. Learning over families of setshypergraph representation learning for higher order tasks. In Proceedings of the 2021 SIAM International Conference on Data Mining (SDM), pages 756 764. SIAM, 2021. Mohamed Trabelsi, Zhiyu Chen, Shuo Zhang, Brian D Davison, and Jeff Heflin. Stru BERT: Structureaware bert for table search and matching. In Proceedings of the ACM Web Conference, WWW 22, 2022. Madelon Hulsebos, Kevin Hu, Michiel Bakker, Emanuel Zgraggen, Arvind Satyanarayan, Tim Kraska, Ça gatay Demiralp, and César Hidalgo. Sherlock: A deep learning approach to semantic data type detection. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. ACM, 2019. Jeffrey Pennington, Richard Socher, and Christopher Manning. Glo Ve: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1532 1543, Doha, Qatar, October 2014. Association for Computational Linguistics. doi: 10.3115/v1/D14-1162. URL https://aclanthology.org/D14-1162. A Details of the Proofs A.1 Preliminaries Definition A.1 (Hypergraph). Let H = (V, E, X, E) denote a hypergraph H with a finite vertex set V = {v1, . . . , vn}, corresponding vertex features with X Rn d; d > 0, a finite hyperedge set E = {e1, . . . , em}, where E P (V ) \ { } and m S i=1 ei = V , and P (V ) denotes the power set on the vertices, the corresponding hyperedge features E Rm d; d > 0. The hypergraph connections can be conveniently represented as an incidence matrix B {0, 1}n m, where Bij = 1 when node i belong to hyperedge j and Bij = 0 otherwise. Definition A.2 (1-Wesifeiler Lehman Test). Let G = (V, E) be a graph, with a finite vertex set V and let s : V be a node coloring function with an arbitrary co-domain and s(v), v V denotes the color of a node in the graph. Correspondingly, we say a labeled graph (G, s) is a graph G with a complete node coloring s : V . The 1-WL Algorithm [Babai et al., 1980] can then be described as follows: let (G, s) be a labeled graph and in each iteration, t 0, the 1-WL computes a node coloring c(t) s : V , which depends on the coloring from the previous iteration. The coloring of a node in iteration t > 0 is then computed as c(t) s (v) = HASH c(t 1) s (v), n c(t 1) s (u) | u N(v) o where HASH is bijective map between the vertex set and , and N(v) denotes the 1-hop neighborhood of node v in the graph. The 1-WL algorithm terminates if the number of colors across two iterations does not change, i.e., the cardinalities of the images of c(t 1) s and c(t) s are equal. The 1-WL isomorphism test, is an isomorphism test, where the 1-WL algorithm is run in parallel on two graphs G1, G2 and the two graphs are deemed non-isomorphic if a different number of nodes are colored as κ in . Definition A.3 (Group). A group is a set G equipped with a binary operation : G G G obeying the following axioms: for all g1, g2 G, g1 g2 G (closure). for all g1, g2, g3 G, g1 (g2 g3) = (g1 g2) g3 (associativity). there is a unique e G such that e g = g e = g for all g G (identity). for all g G there exists g 1 G such that g g 1 = g 1 g = e (inverse). Definition A.4 (Permutation Group). A permutation group (a.k.a. finite symmetric group) Sm is a discrete group G defined over a finite set of size m symbols (w.l.o.g. we shall consider the set {1, 2, . . . , m}) and consists of all the permutation operations that can be performed on the m symbols. Since the total number of such permutation operations is m! the order of Sm is m!. Definition A.5 (Product Group). Given two groups G (with operation +) and H (with operation ), the direct product group of G H is defined as follows: The underlying set is the Cartesian product, G H. That is, the ordered pairs (g, h), where g G and h H The binary operation on G H is defined component-wise: (g1, h1) (g2, h2) = (g1+g2, h1 h2) Definition A.6 (Group invariant functions). Let G be a group acting on vector space V . We say that a function f : V R is G-invariant if f(g x) = f(x) x V, g G. Definition A.7 (Orbit of an element). For an element x X and given a group G, the orbit of the element x is given by Orb(x) = {g x | g G}. In the section, we restate (for the sake of convenience) and prove the statements in the paper. First, we restate the assumption and the corresponding property and then provide the proof. Assumption A.8. For any table (Tij)i [n],j [m] (where i, j are indexes of the rows, and columns appropriately), an arbitrary group action g Sn Sm acting appropriately on the rows and columns leaves the target random variables associated with tasks on the entire table unchanged. Property 2. A function ϕ : T 7 z Rd which satisfies Assumption A.8 and defined over tabular data must be invariant to actions from the (direct) product group Sn Sm acting appropriately on the table i.e. ϕ(g T ) = ϕ(T ) g Sn Sm. Proof. Proof by contradiction. Let y R be some target value associated with the entire table which is invariant up to permutations of the table given by Assumption A.8. Assume the case where ϕ is not an invariant function acting over tables i.e. ϕ(g T ) = ϕ(T ) g Sm Sn. This means different group actions will result in different output representations-and hence the representation is dependent on the action itself. When d = 1 and the value associated with y is given by ϕ acting over the table, equality between the two (for all permutation actions of a given) is only possible if we can unlearn the group action (or learn the same representation for the entire orbit) which would make it invariant and hence a contradiction. Next, we look at the function g defined via Eqns (1-5) to see that the representation obtained by a table via message passing is invariant to row/column permutations of the incidence matrix B. Eqn 1, which is the Node2hyperedge block, is a permutation invariant set aggregator which aggregates information from cells in the table to get the vectorial representation of a row/column/table hyperedge. That is, any permutation action drawn from the Smn group doesn t impact the representation of a given hyperedge as long as the membership is consistent. Similarly, the Hyperedge2Node block (Eqn 2) is an aggregation function that aggregates information from the hyperedges to the nodes. This ensures that a given node s (in the hypergraph) representation is invariant to permutation actions from the Sm+n+1 group as long as the membership is consistent. Now, since the Node2Hyperedge and Hyperedge2Node blocks act sequentially, its easy to see that (1-5) jointly encode invariance to the product group Smn Sm+n+1 when we obtain representations of an incidence matrix as a set (permutation invariant) aggregation of representations of all the mn nodes and the m + n + 1 hyperedges. Next, we restate and prove the theorem about the maximal invariance property. Theorem A.9. A continuous function ϕ : T 7 z Rd over tables is maximally invariant if it is modeled as a function g : B 7 y Rk over the incidence matrix of a hypergraph G constructed per Section 2.1 (Where g is defined via Eqns (1-5)) if a bijective map between the space of tables and incidence matrices (defined over appropriate sizes of tables and incidence matrices). That is, ϕ(T1) = ϕ(T2) iff T2 is some combination of row and/or column permutation of T1 and g(B1) = g(B2) where B1, B2 are the corresponding (hypergraph) incidence matrices of tables T1, T2. Proof. Proof by construction. We will assume the existence of a bijective map between the space of tables and incidence matrices (with appropriate indexing by column names and some canonical row ordering). Without loss of generality, we assume the number of rows and columns in the table to be n and m respectively, and some canonical ordering of the rows and columns in the table for the sake of convenience. Hence the number of nodes, and hyperedges in the hypergraph based on the formulation in Section 2.1 is mn and m + n + 1 respectively. An invariant function over the incidence matrices means that function is invariant to any consistent relabelling of nodes and hyperedges in the hypergraph hypergraph incidence matrix of size mn (m + n + 1) and hence (upon node/hyperedge relabeling) invariant to the Smn Sm+n+1 group. It is important to note that any row/column permutation of the table affects the ordering of both the rows and columns of the incidence matrix. The existence of a bijective map between ϕ and g guarantees that we know exactly which elements (cells, rows, columns, table) participate in which nodes/hyperedges and which don t. This is evident as a different permutation of the rows/columns would result in different incidence matrices and hence the above information can be deconstructed by obtaining the mappings between all permutations of a table and corresponding incidence matrices. Now, as this is proof of maximal invariance we have to prove both directions i.e. when the function ϕ maps two tables to the same representation-the tables are just row/column permutations of each other, and the g over their corresponding incidence matrices yield identical representations and vice versa. Direction 1: Assume a continuous function ϕ over two tables T1, T2 yields the same representation ϕ(T1) = ϕ(T2). Now, we need to show that the function g over the corresponding incidence matrices B1, B2 yield the same representation. Let ρ be the bijective map between the space of tables and incidence matrices. The guarantee of a bijective map between tables and incidence matrices and the injective (one-to-one) nature of All Set [Chien et al., 2022] which is a multiset function that ensures that the hyperedges with the same elements get the same representation and non-identical hyperedges obtain different representations (via Eqns(1-5). Hence, when the outputs of ϕ match, the outputs of g also do - and this is possible when the tables are identical up to permutations of each other. The universality property of All Set, combined with the existence of the bijective map ensures the existence of a neural network g for a given continuous function ϕ. Direction 2: Assume a function g which maps incidence matrices B1, B2 of two hypergraphs to the same representation. Now again, since the bijective map exists, it ensures that we can check if the same nodes and hyperedges map across the two incidence matrices. This confirms the existence of a bijective map between the family of stars [Tyshkevich and Zverovich, 1996] of the two hypergraphs (i). The 1-Weisfeiler Lehman message passing equations of the Hyper Trans (Along with the inherited injectivity property of All Set) module along the star expansion of the hypergraph ensures that the star expansions of the hypergraphs are 1-WL isomorphic [Xu et al., 2018] (ii). Combining (i) and (ii) together via Theorem 2 from [Srinivasan et al., 2021], we get that the hypergraphs are isomorphic - and therefore yielding the tables to be identical and ϕ getting the same representation due to Property 1. It is easy to see that when the g yields different representations, the tables are not identical, and hence ϕ yields a different representation as well. B Further Analysis of HYTREL B.1 Effect of Input Table Size As mentioned in Section 3, we have truncated the large tables and only keep a maximum of 30 rows and 20 columns in pretraining for efficiency purposes. However, the truncation does not affect HYTREL s ability to deal with large tables in downstream tasks. Our model can deal with the arbitrary size of tables. This is different from the BERT-based Ta LMs (e.g., Ta BERT, TURL) that have positional encodings, and the pretrained checkpoints limit the positions within a length of 512 for an input sequence. With the HYTREL, we can always build large hypergraphs with large tables by adding more nodes under each hyperedge, without sacrificing information loss. To further demonstrate this, we do experiments on the table type detection (TTD) dataset, as it has relatively large tables (an average number of 157 rows). Below in Table 3 are the experimental results of using different maximum row/column limitations with HYTREL (ELECTRA), and we fine-tune the dataset on an NVIDIA A10 Tensor Core GPU (24GB). Size Limitations (#rows, #columns) Dev Acc (%) Test Acc (%) Training Time (min/epoch) GPU Memory Consumption (memory used/batch size, GB/sample) (3, 2) 78.47 78.00 3 0.14 (15, 10) 95.36 95.96 9 0.48 (30, 20) 96.23 95.81 14 0.58 (60, 40) 96.38 95.80 23 1.41 (120, 80) 96.02 95.71 51 5.13 (240, 160) 96.06 95.67 90 16.00 Table 3: Effect of Input Table Size We can see that HYTREL has no difficulty dealing with the arbitrary size of input tables. However, we also observe that the prediction performances are almost the same when we enlarge the table size limitation to a certain number, but the time and hardware resource consumption will increase dramatically. So we recommend using downsampling to deal with large tables, and it will be much more efficient without hurting the performance. B.2 Effect of Excessive Invariance As we have emphasized, the invariant permutation property of a table restricts the permutation of rows and columns to be independent. Arbitrary permutations of rows or columns can lead to an excessive invariance problem, ultimately compromising the table s semantics. Take Figure 6 as an example, permuting the rows and columns independently will result in semantically the same table (Upper), but arbitrarily permuting will break the semantics of the table and cause excessive invariance (Below). We use the Ta BERT model to demonstrate the effect brought by the excessive invariance. As we have introduced before, the Ta BERT models use a linearization approach with positional encoding to model a table. So they cannot preserve the table invariance property. However, we cannot achieve this by simply removing the positions, as it will cause excessive invariance and it is equivalent to shuffling all the cells in a table. As we emphasize, we only assume independent row/column permutation invariance. We empirically show the effect of disabling the positional encodings for the Ta BERT model. As shown in Table 5, we evaluate the model on the Table Type Detection dataset. It turns out that the excessive invariance will cause a significant performance decrease. Club Country Founded Barcelona ESP 1899 Real Madrid ESP 1902 Bayern München GER 1900 Soccer Clubs Club Founded Country Real Madrid 1902 ESP Barcelona 1899 ESP Bayern München 1900 GER Soccer Clubs Club Country Founded Barcelona 1902 1899 Real Madrid ESP ESP Bayern München GER 1900 Soccer Clubs Figure 6: An example of table permutations. Models Dev Acc (%) Test Acc (%) Ta BERT (K=1), w/ positions 94.11 93.44 Ta BERT (K=1), w/o positions 88.62 88.91 Ta BERT (K=3), w/ positions 95.82 95.22 Ta BERT (K=3), w/o positions 92.80 92.44 Table 4: Excessive invariance causes performance slumping B.3 Inference Complexity of the Model Theoretical Analysis: Given a table with m rows and n columns, let s assume each cell only contains one token, and we use dot-product and single head in the attention operation. The inference complexity comparison of one layer attention is: Complexity BERT-based Methods HYTREL w/o linear transformation O((mn)2 d) O(mnd) w/ linear transformation O((mn)2 d + (mn) d2) O(mnd + (mn) d2) Table 5: Inference Complexity For the BERT-based methods, the input sequence length is mn, d is the hidden dimension. We can see that our Hy Trel has less complexity, and the difference comes from the dot-product for attention scores. The query in the set transformer of Hy Trel is a learnable vector with d dimension, it reduces computation time of self-attention from quadratic to linear in the number of elements in the set [Lee et al., 2019], with a complexity of O(mnd). In the self-attention, the query is a matrix from the whole input sequence with (mn) d dimensions, which has a complexity of O((mn)2 d) in the attention score calculation. Empirical Analysis: We include comparison of our approach with the Ta BERT baseline about the inference time, including data processing (graph building time in Hy Trel). As observable from the table, the total inference time of our model is lesser compared to Ta BERT, and the time cost for graph building is not a bottleneck. Datasets CTA CPA TTD TS Total Samples 4844 1560 4500 1391 batch size 8 8 8 8 Ta BERT (K=1) - data preparing 13 5 51 13 Ta BERT (K=1) - inference 27 5 25 28 Ta BERT (K=1) - total inference 40 10 76 41 Ta BERT (K=3) - data preparing 14 5 52 13 Ta BERT (K=3) - inference 85 16 79 80 Ta BERT (K=3) - total inference 99 21 131 93 Hy Trel - graph building 5 1 16 5 Hy Trel - inference 16 6 17 13 Hy Trel - total inference 21 7 33 18 Table 6: Inference Time Comparision (seconds) We keeps a maximal of 3 rows with the Hy Trel model for fair comparison with the BERT (K=3) models. The data preprocessing of Ta BERT is to format the input tables (.json) into tensors, and the graph building of Hy Trel is to format the input tables (.json) into feature tensors and the incidence matrix in hypergraphs. All experiments are conducted on a single A10 GPU, and the inference batch sizes are all chosen to be 8 for all models and all dataset; We use the validation set of CTA, CPA and TTD for experiments. For TS with a small number of tables that is tested with five fold cross-validation in the paper, here we use the whole dataset for experiments. C Details of the Experiments C.1 Pretraining C.1.1 Model Size The parameter size of the HYTREL and the baseline Ta BERT are as followed in Table 7. HYTREL has more parameters mainly because of the Hyperedge Fusion block (Figure 2) that is absent from the Ta BERT model. Models # Parameters (million) Ta BERT (K=1) 110.6 Ta BERT (K=3) 131.9 HYTREL 179.4 Table 7: Model Size C.1.2 Hyperparameters The hyperparameter choices and resource consumption details of the pretraining are presented in Table 8 and Table 9. Models Batch Size Learning Rate Warm-up Ratio Max Epochs Weight Decay Optimizer HYTREL w/ ELECTRA 8192 5e 4 0.05 5 0.02 Adam HYTREL w/ Contrastive 4096 1e 4 0.05 5 0.02 Adam Table 8: Pretraining Hyperparameters Models GPUs Accelerator Precision Training Time HYTREL w/ ELECTRA 16 NVIDIA A100 Deep Speed (Ze RO Stage 1) bf16 6h / epoch HYTREL w/ Contrastive 16 NVIDIA A100 Deep Speed (Ze RO Stage 1) bf16 4h / epoch Table 9: Pretraining Cost C.2 Fine-tuning C.2.1 Details of the Datasets TURL-CTA is a multi-label classification task (this means for one column, there can be multiple labels annotated) that consists of 255 semantic types. In total, it has 628,254 columns from 397,098 tables for training, and 13,025 (13,391) columns from 4,764 (4,844) tables for testing (validation). TURL-CPA dataset is also a multi-label classification task in which each pair of columns may belong to multiple relations. It consists of a total number of 121 relations with 62,954 column pairs from 52,943 tables training sample, and 2,072 (2,175) column pairs from 1,467 (1,560) tables for testing (validation). Similar to the CTA task, we use the pairwise column representations from HYTREL with their corresponding hyperedge representations for fine-tuning. WDC Schema.org Table Corpus consists of 4.2 million relational tables covering 43 schema.org entity classes. We sample tables from the top 10 classes and construct a table type detection dataset that contains 3,6000 training tables and 4,500 (4,500) tables for validation (testing). PMC is a table corpus that is formed from Pub Med Central (PMC) Open Access subset, and this collection is about biomedicine and life sciences. In total, PMC contains 1,391 table pairs, where 542 pairs are similar and 849 pairs are dissimilar. For each pair of tables, we use their table representations together with column representations from HYTREL for fine-tuning. As for evaluation, we follow previous work [Habibi et al., 2020, Trabelsi et al., 2022] and report the macro-average of the 5-fold cross-validation performance. C.2.2 Hyperparameters The hyperparameter choices for fine-tuning both the Ta BERT and the HYTREL models are as followed. For the primary hyperparameters batch size and learning rate, we choose the batch size to be as large as possible to fit the GPU memory constraint, and then use grid search to find the best learning rate. Other hyperparameters are kept the same as in previous work. Tasks Batch Size Learning Rate Warm-up Ratio Max Epochs Weight Decay Optimizer Column Type Annotation 256 1e 3 0.05 50 0.02 Adam Column Property Annotation 256 1e 3 0.05 30 0.02 Adam Table Type Detection 32 5e 5 0.05 10 0.02 Adam Table Similarity Prediction 16 5e 5 0.05 10 0.02 Adam Table 10: Hyperparameters for Fine-tuning HYTREL. We use the same settings for all variants of HYTREL (no pretraining, pretraining with ELECTRA and Contrastive objectives) Tasks Batch Size Learning Rate Warm-up Ratio Max Epochs Optimizer Column Type Annotation 64 (K=3), 128 (K=1) 5e 5 0.1 10 Adam Column Property Annotation 48 (K=3), 128 (K=1) 1e 5 (K=3), 2e 5 (K=1) 0.1 10 Adam Table Type Detection 64 5e 5 0.1 10 Adam Table Similarity Prediction 8 5e 5 0.1 10 Adam Table 11: Hyperparameters for Fine-tuning Ta BERT. C.2.3 Details of the Baselines Ta BERT [Yin et al., 2020] is a representative Ta LM that jointly learns representations for sentences and tables. It flattens the tables into sequences and pretrains the model from BERT checkpoints. Since we use the same source and prepossessing for the pretraining tables, we use it as our main baseline. Ta BERT is not yet evaluated on the table-only tasks, we fine-tune the base version with both K=1 and K=3 row number settings on all the tasks. We also evaluate the Ta BERT model that is randomly initialized and see how much it relies on the pretraining. TURL [Deng et al., 2020] is another representative Ta LMs that focuses on web tables with cell values as entities connected to the knowledge base. It also flattens the tables into sequences and pretrains from Tiny BERT [Jiao et al., 2020] checkpoints. It introduces a vision matrix to incorporate table structure into the representations. We copy the reported results for CTA and CPA tasks. For a fair comparison, we use the fine-tuning settings with metadata (captions and headers) but without external pretrained entity embeddings. Doduo [Suhara et al., 2022] is a state-of-the-art column annotation system that fine-tunes the BERT and uses table serialization to incorporate table content. We copy the reported results for the CTA and CPA tasks. For each specific task and dataset, we also report the state-of-the-art baseline results from previous work. For the TURL-CTA dataset, we use the reported results of Sherlock [Hulsebos et al., 2019]. For the TURL-CPA dataset, we use the reported results of fine-tuning the BERTbase [Devlin et al., 2019] model. For the PMC dataset, we use the reported results from two baselines: TFIDF+Glove+MLP, which uses TFIDF and Glove [Pennington et al., 2014] features of a table with a fully connected MLP model, and Tab Sim [Habibi et al., 2020] that uses a siamese Neural Network to encode a pair of tables for prediction.