# hypeboy_generative_selfsupervised_representation_learning_on_hypergraphs__0a30d8d3.pdf Published as a conference paper at ICLR 2024 HYPEBOY: GENERATIVE SELF-SUPERVISED REPRESENTATION LEARNING ON HYPERGRAPHS Sunwoo Kim Shinhwan Kang Fanchen Bu Soo Yong Lee Jaemin Yoo Kijung Shin Kim Jaechul Graduate School of AI, School of Electrical Engineering Korea Advanced Institute of Science and Technology (KAIST) {kswoo97, shinhwan.kang, boqvezen97, syleetolow, jaemin, kijungs}@kaist.ac.kr Hypergraphs are marked by complex topology, expressing higher-order interactions among multiple nodes with hyperedges, and better capturing the topology is essential for effective representation learning. Recent advances in generative selfsupervised learning (SSL) suggest that hypergraph neural networks learned from generative self-supervision have the potential to effectively encode the complex hypergraph topology. Designing a generative SSL strategy for hypergraphs, however, is not straightforward. Questions remain with regard to its generative SSL task, connection to downstream tasks, and empirical properties of learned representations. In light of the promises and challenges, we propose a novel generative SSL strategy for hypergraphs. We first formulate a generative SSL task on hypergraphs, hyperedge filling, and highlight its theoretical connection to node classification. Based on the generative SSL task, we propose a hypergraph SSL method, HYPEBOY. HYPEBOY learns effective general-purpose hypergraph representations, outperforming 16 baseline methods across 11 benchmark datasets. Code and datasets are available at https://github.com/kswoo97/hypeboy. 1 INTRODUCTION Many real-world interactions occur among multiple entities, such as online group discussions on social media, academic collaboration of researchers, and joint item purchases (Benson et al., 2018; Kim et al., 2023a; Lee et al., 2024). Representing such higher-order interactions with an ordinary graph can cause topological information loss (Dong et al., 2020; Yoon et al., 2020). Thus, hypergraphs have emerged as an effective tool for representing high-order interactions in various domains, including recommender systems (Xia et al., 2022; Yu et al., 2021), financial analysis (Sawhney et al., 2021; 2020), and drug analysis (Ruan et al., 2021; Saifuddin et al., 2023). For representation learning on such a complex topology, hypergraph neural networks (HNNs) have been developed. Training HNNs has primarily relied on task-related label supervision, such as external node labels. However, simply learning from external supervision may limit HNNs in capturing more complex patterns in hypergraph topology. Incorporating self-supervision related to topology, hence, can substantially improve HNNs representation learning. Particularly, generative self-supervised learning (SSL) holds promise for effective hypergraph representation learning. Generative SSL has recently shown remarkable success in encoding complex patterns in multiple domains. GPT (Open AI, 2023) in natural language processing and Masked Autoencoder (He et al., 2022) in computer vision are notable examples. With generative self-supervision, HNNs may encode complex topology more effectively, leading to improved representation learning. However, designing a generative SSL strategy for hypergraphs is not straightforward. First, questions remain unanswered for generative SSL tasks: (1.a) What should be the target generative SSL task for HNNs? (1.b) How does the generative SSL task relate to downstream tasks (e.g., node classification with external labels)? Second, even after determining the task, details of the SSL method (based on the task) remain unspecified. (2.a) What empirical properties should the method aim to satisfy? (2.b) How can the method achieve effective general-purpose hypergraph representations? Moreover, if not carefully designed, the generative SSL strategy can suffer from severe computational burden, as the number of potential hyperedges grows exponentially with the number of nodes. Published as a conference paper at ICLR 2024 In light of these promises and challenges, we systematically investigate and propose a hypergraph generative SSL strategy. Our contributions and the rest of the paper are organized as follows: SSL Task: We formulate a generative SSL task, hyperedge filling, for hypergraph representation learning. Notably, we establish its theoretical connections to node classification (Section 3). SSL Method: Based on the hyperedge filling task, we propose HYPEBOY, a novel hypergraph SSL method. HYPEBOY is designed to satisfy desirable properties of hypergraph SSL, mitigating (1) over-emphasized proximity, (2) dimensional collapse, and (3) non-uniformity/-alignment of learned representations (Section 4). Experiments: We demonstrate that HYPEBOY learns effective general-purpose hypergraph representations. It significantly outperforms SSL-based HNNs in both node classification and hyperedge prediction across 11 benchmark hypergraph datasets (Section 5; code and datasets are available at https://github.com/kswoo97/hypeboy). 2 RELATED WORK In this section, we review the literature on hypergraph neural networks and self-supervised learning. Hypergraph neural networks (HNNs). HNNs learn hypergraph representations. Converting hyperedges into cliques (fully connected subgraphs) allows graph neural networks to be applied to hypergraphs (Feng et al., 2019; Yadati et al., 2019). Such conversion, however, may result in topological information loss, since high-order interactions (hyperedges) are reduced to pair-wise interactions (edges). As such, most HNNs pass messages through hyperedges to encode hypergraphs. Some notable examples include HNHN (Dong et al., 2020) with hyperedge encoders, Uni GNN (Huang & Yang, 2021) with generalized message passing functions for graphs and hypergraphs, All Set (Chien et al., 2022) with set encoders, ED-HNN (Wang et al., 2023a) with permutation-equivariant diffusion operators, and Phenom NN (Wang et al., 2023b) with hypergraph-regularized energy functions. Self-supervised learning (SSL). SSL strategies aim to learn representation from the input data itself, without relying on external labels. They can largely be categorized into contrastive or generative types. Contrastive SSL aims to maximize the agreement between data obtained from diverse views (Chen et al., 2020; Grill et al., 2020; You et al., 2020). Generative SSL, on the other hand, predicts or reconstructs parts of the input data. The success of generative SSL demonstrates its strengths in learning complex input data, in domains including natural language processing (Devlin et al., 2019; Open AI, 2023) and computer vision (He et al., 2022; Tong et al., 2022). Recently, generative SSL for graphs has gained significant attention, with their main focuses on reconstructing edges (Tan et al., 2023; Li et al., 2023) or node features (Hou et al., 2022; 2023). Self-supervised learning on hypergraphs. The interest in SSL for hypergraphs is on the rise. Early hypergraph SSL strategies mainly targeted specific downstream tasks, such as group (Zhang et al., 2021) and session-based recommendation (Xia et al., 2022). Recent ones aim to obtain generalpurpose representation. Tri CL (Lee & Shin, 2023a) utilizes a tri-directional contrastive loss, which consists of node-, hyperedge-, and membership-level contrast. Kim et al. (2023c) enhances the scalability of Tri CL with a partitioning technique. Hyper GCL (Wei et al., 2022) utilizes neural networks to generate views for contrast and empirically demonstrates its superiority in node classification, outperforming rule-based view generation methods. Hyper GRL (Du et al., 2022) is a generative SSL method sharing similar spirits with our approach; however, the underlying motivations and methodological designs are markedly distinct. We provide a detailed comparison in Appendix C.1. 3 PROPOSED TASK AND THEORETICAL ANALYSIS In this section, after providing some preliminaries, we formulate hyperedge filling, our generative SSL task on hypergraphs. Then, we establish a theoretical connection between hyperedge filling and node classification, which is a widely-considered important downstream task. Preliminaries. A hypergraph G = (V, E) is defined by a node set V and a hyperedge set E. Each hyperedge ej E is a non-empty set of nodes (i.e., = ej V, ej E). Each node vi V is equipped with a feature vector xi Rd, and X R|V| d denotes the node feature matrix where the i-th row Xi corresponds to xi. Published as a conference paper at ICLR 2024 Node and its features (a) Hyperedge filling task Augmentation Augmented hypergraph Objective of filling (b) Visual illustration of HYPEBOY Figure 1: Overview of (a) the hyperedge filling task and (b) HYPEBOY, our proposed SSL method based on the task. The goal of the task is to find the missing node for a given query subset (i.e., the other nodes in a hyperedge). HYPEBOY trains HNNs aiming to correctly predict the missing node. A hypergraph neural network (HNN) fθ is a function that receives a node feature matrix X and a set of hyperedges E as inputs to return node embeddings Z R|V| k (i.e., Z = fθ(X, E), where θ is a set of learnable parameters) 1. 3.1 PROPOSED TASK: HYPEREDGE FILLING We formulate hyperedge filling, a generative SSL task for hypergraph representation learning. We first define the task and discuss the superiority of the hyperedge filling task over alternatives. An illustration of the hyperedge filling task is provided in Figure 1(a). Task definition. Given a set of nodes, hyperedge filling aims to predict a node that is likely to form a hyperedge together. Specifically, for each hyperedge ej E, we divide it into a (missing) node vi ej and a (query) subset qij = ej\{vi}. The target of the task is to correctly fill the missing node vi for each given subset qij. This can be formalized by maximizing the probability of vi correctly completing qij, which is denoted as p(X,E,Θ) (vi | qij), where Θ is a set of parameters we aim to optimize in this task. We will further elaborate on our design of p(X,E,Θ)( ) in Section 4.3. Advantage over alternatives. Potential alternatives include naive extensions of generative SSL tasks for ordinary graphs: (a) generating hyperedges from scratch and (b) classifying given sets of nodes into hyperedges and non-hyperedges. Compared to (a), by shifting the focus of prediction from the set level (hyperedge itself) to the node level, the hyperedge filling task reduces the prediction space from computationally prohibitive O(2|V|) to affordable O(|V|). Compared to (b), the hyperedge filling task provides richer and more diversified generative SSL signals. Specifically, for each hyperedge ej, our task offers |ej| distinct node-subset combinations that can serve as SSL signals. In contrast, classifying the mere existence of ej yields a singular and, thus, limited signal. 3.2 THEORETICAL RESULTS ON HYPEREDGE FILLING To demonstrate the effectiveness of hyperedge filling as a general SSL task, we present its theoretical connection to node classification. In essence, we demonstrate that node representations optimized for the hyperedge filling task can improve node classification accuracy. 3.2.1 BASIC SETTING First, we assume a data model of a hypergraph G = (V, E) where (1) each node belongs to a single class, (2) the features of each node are generated from a Gaussian distribution, and (3) each hyperedge is generated according to a given hyperedge affinity parameter P [0, 1]. Assumption 1 (Node classes and features). Assume that there are 2N nodes and node classes C1 and C0 such that C1 C0 = V, C1 C0 = , and |C1| = |C0| = N. Each node feature vector xi is independently generated from N(x; µ1, Σ) if vi C1, and N(x; µ0, Σ) if vi C0. For simplicity, we assume µ1 = (0.5)d i=1, µ0 = ( 0.5)d i=1, and Σ = I, where I is the d-by-d identity matrix. 1In this paper, we assume an HNN returns only vector representations of nodes unless otherwise stated, while we acknowledge that some HNNs return embeddings of hyperedges as well. Published as a conference paper at ICLR 2024 Assumption 2 (Hypergraph topology). Assume that the number of hyperedges and the size of each hyperedge are given, and there is no singleton hyperedge (i.e., |ej| 2, ej E). Let B denote the binomial distribution. Each hyperedge ej E has a class membership cj {0, 1}, where cj B(1, 0.5). Given the number |ej| of nodes and the class cj {0, 1} of a hyperedge, the number of its members belonging to C1 (i.e., |ej C1|) B(|ej|, Pcj(1 P)1 cj). Note that the tendency of each hyperedge containing nodes of the same class is symmetric about P = 0.5 under the binary class setting. In addition, when P approaches 1, each hyperedge ej is more likely to contain nodes of the same class (spec., node class cj); as P approaches 0, each hyperedge ej is again more likely to contain nodes of the same class (spec., node class 1 cj). Second, we describe how node representations are updated via the hyperedge filling task. In this theoretical analysis, we define the updating process of node representations as follows: (F1) Filling probability p(X,E,Θ) ( ) is defined on each node-subset pair as follows: p(X,E,Θ) (vi | qij) := exp(x T i (P vk qij xk)) P vt V exp(x T t (P vk qij xk)). (1) (F2) Node representation zi is obtained via gradient descent with respect to xi from L, which is the negative log-likelihood of Eq. (1), (i.e., L = log p(X,E,Θ) (vi | qij)). For ease of analysis, we assume zi = xi γ xi L, where γ R+ is a fixed constant. At last, we assume a Gaussian naive Bayes classifier F (Bishop, 2006), which is defined as: F(xi) = arg max k {0,1} f(xi; µk, I), where f is the probability density function of N(x; µk, I). (2) 3.2.2 HYPEREDGE FILLING HELPS NODE CLASSIFICATION Our goal is to show that for accurate classification of vi, the representation zi, which is obtained for hyperedge filling as described in (F1) and (F2), is more effective than the original feature xi. First, we assume a node vi belonging to the class C1 (i.e., vi C1), and we later generalize the result to C0. The effectiveness of an original feature is defined as the expected accuracy of a classifier F with xi (i.e., Ex[1F(xi)=1] := Px (f(xi; µ1, I) > f(xi; µ0, I))). Similarly, that with a derived representation is defined as Ez[1F(zi)=1] = Pz (f(zi; µ1, I) > f(zi; µ0, I)). Below, we show that the effectiveness of a derived representation zi is greater than that of an original feature xi under a certain condition (Theorem 1). Theorem 1 (Improvement in effectiveness). Assume a hyperedge ej s.t. ej C1 = and node features X that are generated under Assumption 1. For any node vi ej C1, the following holds: 1T X vk qij xk > 0 Ez[1F(zi)=1] > Ex[1F(xi)=1], where 1 denotes (1)d k=1. (3) Proof. A full proof is provided in Appendix A.1. Theorem 1 states that when a certain condition (specified in the parentheses in Eq. (3)) is met, the effectiveness of zi is greater than that of xi. This result implies that node representations, when refined using the objective function associated with the hyperedge filling task, are more proficient in performing accurate node classification compared to the original node features. While the finding in Theorem 1 demonstrates the usefulness of the hyperedge filling task in node classification, its validity relies on the condition (in the parentheses in Eq. (3)). We further analyze the probability that the condition is met by stochastic G under Assumptions 1 and 2 for a given P. Theorem 2 (Realization of condition). Assume node features X and a hyperedge ej s.t. (i) generated under Assumption 1 and 2 respectively, and (ii) ej C1 = . For any qij where vi ej C1, the following holds: 1. Px,e 1T P vk qij xk > 0 P 0.5, P [0, 1]. Published as a conference paper at ICLR 2024 W/O Projection head W/ Projection head Singular value rank index Log of singular values (a) Representation spectrum analysis. A sudden singular-value drop implies dimensional collapse. initialization Our method Uniformity Class 0 Class 1 Class 3 Class 4 (b) Representations on the unit hypersphere. Uniformly distributed representations achieve uniformity, and representations of nodes of the same class located close to each other achieve alignment. Figure 2: Analysis regarding Property 2 (prevention of dimensional collapse) and Property 3 (representation uniformity and alignment) of HYPEBOY. As shown in (a), while HYPEBOY without projection heads (red) suffers from dimensional collapse, HYPEBOY (blue) does not, demonstrating the necessity of the projection head. Furthermore, as shown in (b), representations from an HNN trained by HYPEBOY meet both uniformity and alignment, justifying our design choice of the loss function. Experiments are conducted on the Cora dataset. 2. Px,e 1T P vk qij xk > 0 P is a strictly decreasing function w.r.t. P [0, 0.5] and strictly increasing function w.r.t. P [0.5, 1]. Proof. A full proof is provided in Appendix A.2. Theorem 2 states that the probability of the condition being satisfied is at least 0.5 for any affinity parameter P [0, 1]. Moreover, the probability strictly increases w.r.t. P [0.5, 1] and that strictly decreases w.r.t. P [0, 0.5]. Thus, it can be inferred that as the probability of each hyperedge including nodes of the same class increases, so does the likelihood of the condition being met. Notably, many real-world group interactions exhibit homophilic traits (Laakasuo et al., 2020; Khanam et al., 2023). Therefore, the hyperedge filling task can improve node classification in many real-world scenarios, as evidenced by our theoretical findings and real-world characteristics. Generalization to the class C0. The above results can be easily generalized to the class C0 due to the symmetry. Specifically, for each node vi C0, the effectiveness (spec., the expected accuracy of the classifier F) of a derived representation zi is greater than that of an original feature xi under a certain condition. The probability for such a condition to hold strictly increases w.r.t. P [0.5, 1] and strictly decreases w.r.t. P [0, 0.5]. We theoretically show this in Appendix A.1 and A.2. 4 PROPOSED METHOD FOR HYPEREDGE FILLING In this section, we present HYPEBOY (Hypergraphs, build own hyperedges), a hypergraph generative SSL method based on the hyperedge filling task (Section 3). HYPEBOY exhibits the following desirable properties of hypergraph generative SSL, as empirically demonstrated later: Property 1. Does not over-emphasize proximity information (Veliˇckovi c et al., 2019). Property 2. Avoids dimensional collapse (Jing et al., 2022) in node representations. Property 3. Learns node representation to be aligned and uniform (Wang & Isola, 2020). Our proposed method, HYPEBOY, is illustrated in Figure 1(b). After presenting each of its steps, we discuss its role in satisfying the above properties. Lastly, we introduce a two-stage training scheme for further enhancing HYPEBOY. 4.1 STEP 1: HYPERGRAPH AUGMENTATION HYPEBOY first obtains augmented feature matrix X and hyperedge set E by using augmentation functions τx and τE, respectively. The feature augmentation function τx : (X, pv) 7 X masks certain entries of X based on Bernoulli sampling (spec., X = X M, where is an elementwise product and Mij B(1, 1 pv), i [|V|], j [d]). The topology augmentation function Published as a conference paper at ICLR 2024 τE : (E, pe) 7 E samples |E|(1 pe) hyperedges uniformly at random from E. Note that the magnitudes of feature and topology augmentations are proportional to pv, pe [0, 1], respectively. Role. For hypergraph SSL, augmentations are crucial for mitigating overly-emphasized proximity information. That is, using all hyperedges and/or features for both message passing and objectivefunction construction may risk HNNs to heavily rely on direct neighborhood information (Tan et al., 2023). Many graph SSL strategies mask certain input edges and/or node features, preventing their encoder models from overfitting to the neighbor distribution and features (Hou et al., 2022; Li et al., 2023; Tan et al., 2023). Motivated by their findings, we use the augmentation step. In Appendix F.1, we demonstrate that augmentation enhances node classification performance of HYPEBOY. 4.2 STEP 2: HYPERGRAPH ENCODING After augmentation, HYPEBOY obtains node and query subset representations. First, HYPEBOY obtains node embeddings Z R|V| d by feeding the augmented hypergraph into an encoder HNN: Z = fθ(X , E ). Then, HYPEBOY obtains projected representations of query subsets (i.e., qij, vi ej, ej E) and nodes. To this end, we utilize a node projection head f ϕ and a set projection head f ψ. Specifically, projected embeddings of node vi and query subset qij are hi = f ϕ(zi) Rk and qij = f ρ (P vt qij zt) Rk, respectively. Here, the design choice of the set projection head is motivated by Deep Sets (Zaheer et al., 2017). Role. We investigate the role of the projection heads, which are non-trivial components, in the context of the dimensional collapse of embeddings. Dimensional collapse is a phenomenon in which embedding vectors occupy only the lower dimensional sub-space of their full dimension (Jing et al., 2022). This is identified by observing whether or not certain singular values of the embedding covariance matrix drop to zero. To prevent dimensional collapse, we employ projection heads in HYPEBOY, and this is in line with the prior findings of Jing et al. (2022) and Song et al. (2023). Figure 2(a) illustrates that an HNN trained using HYPEBOY avoids dimensional collapse, whereas an HNN trained with a HYPEBOY variant (without projection heads) does not. Results on more datasets are in Appendix F.2. Furthermore, we provide a theoretical analysis of why HYPEBOY without projection heads may suffer from dimensional collapse in Appendix B.2. Note that this distinction leads to a performance discrepancy in node classification (Section 5.3). 4.3 STEP 3: HYPEREDGE FILLING LOSS The last step is to compute the SSL loss based on the hyperedge filling probability. We design p(X,E,Θ) (vk | qij) to be normalized over V (i.e., P vk V p(X,E,Θ) (vk | qij) = 1). To this end, we utilize a Softmax function to model probabilities. In sum, with projected embeddings hi and qij (Section 4.2), the probability of a node vi completing a query subset qij is defined as follows: p(X,E,Θ) (vi | qij) := exp(sim(hi, qij)) P vk V exp(sim(hk, qij)), (4) where sim is the cosine similarity function (other similarity functions are also applicable). Lastly, HYPEBOY is optimized for all possible hyperedge filling cases (i.e., Πej EΠvi ejp(X,E,Θ) (vi | qij)). To sum up, HYPEBOY minimizes the negative log-likelihood of all possible cases as follows: vi ej log exp(sim(hi, qij)) P vk V exp(sim(hk, qij)). (5) Note that the set Θ of all parameters consists of the parameters of the encoder HNN fθ, the node projection head f ϕ, and the set projection head f ρ (i.e., Θ = (θ, ϕ, ρ)). They are updated by gradient descent, aiming to minimize the loss L defined in Eq. (5). Role. Our design choice of p(X,E,Θ)( ) ensures that representations learned by HYPEBOY achieve both alignment and uniformity (Wang & Isola, 2020). In our case, alignment indicates that node embeddings belonging to the same class are closely located to each other, and uniformity indicates that embeddings are uniformly distributed over the embedding space. Through the numerator of Eq. (5), HYPEBOY pulls representations of a missing node and a query subset, promoting the alignment as discussed in our theoretical analysis (Section 3.2). At the same time, the denominator of Published as a conference paper at ICLR 2024 Eq. (5) pushes away every node representation from each query subset representation, encouraging that representations are uniformly distributed. This intuition is supported by the findings of Wang & Isola (2020): the denominators of their contrastive loss, which push away representations from each other, encourage uniformity. As shown in Figure 2(b), we verify that node representations from HYPEBOY achieve both alignment and uniformity. Results on more datasets are in Appendix F.3. 4.4 TWO-STAGE TRAINING SCHEME FOR FURTHER ENHANCEMENT In our preliminary study, we observed that with a randomly initialized encoder, HYPEBOYtends to rely too heavily on the projection heads rather than the encoder parameters, leading to sub-optimal training. Thus, we propose a two-stage training scheme to enhance the effectiveness of HYPEBOY. In a nutshell, we first train an HNN fθ to reconstruct masked node features (inspired by Hou et al. (2022)). Then, we use the trained fθ parameters for the initialization of HYPEBOY s encoder to learn hyperedge filling (Sections 4.1-4.3). This feature warm-up for the HNN encoder repeats the following three steps for a fixed number of epochs, which we set to 300 in all our experiments. Step 1: Encoding. For each feature-reconstruction training epoch, we sample a certain number of nodes uniformly at random from V, which we denote as V V. Then, we mask the features of sampled nodes by using a learnable input token m(I) Rd and use the resulting masked node feature matrix X R|V| d, as the input node feature matrix (i.e., X i = m(I), vi V and X j = Xj, vj V \ V ). We also obtain augmented hyperedges E by using the strategy described in Section 4.1. Then, we obtain node embeddings Z R|V| d as follows: Z = fθ(X , E ). Step 2: Decoding. We again mask the embeddings of the nodes that we sampled earlier with a learnable embedding token m(E) Rd , obtaining the masked node embedding matrix Z (i.e., Z i = m(E), vi V and Z j = Z j, vj V \ V ). Then, we acquire reconstructed node features ˆX R|V| d by using a decoder HNN gψ as follows: ˆX = gψ(Z , E ). Step 3: Updating. After gaining reconstructed node features ˆX, we maximize the similarity between the original and reconstructed features of the masked nodes. To this end, we minimize the following loss function, which is proposed by Hou et al. (2022): L = (1/|V |) P vi V (1 (ˆXT i Xi/ ˆXi Xi )). Using gradient descent to minimize L , we update the parameters of the encoder HNN fθ, the decoder HNN gψ, the input token m(I), and the embedding token m(E). 5 EXPERIMENTAL RESULTS We now evaluate the efficacy of HYPEBOY as techniques for (1) pre-training hypergraph neural networks (HNNs) for node classification (Section 5.1) and (2) learning general-purpose representations (Section 5.2). Then, we justify each of its components through an ablation study (Section 5.3). Datasets. For experiments, we use 11 benchmark hypergraph datasets. The hypergraph datasets are from diverse domains, expressing co-citation, co-authorship, computer graphics, movie-actor, news, and political membership relations. In Appendix D, we detail their statistics and descriptions. Baselines methods. We utilize 16 baseline methods. They include (a) 10 (semi-)supervised HNNs, including 2 state-of-the-art HNNs (ED-HNN (Wang et al., 2023a) and Phenom NN (Wang et al., 2023b)), (b) 2 generative SSL strategies for ordinary graphs (Graph MAE2 (Hou et al., 2023) and Mask GAE (Li et al., 2023)), and (c) 4 SSL strategies for hypergraph (Tri CL (Lee & Shin, 2023a), Hyper GCL (Wei et al., 2022), Hyper GRL (Du et al., 2022), and H-GD, which is a direct extension of an ordinary graph SSL method (Zheng et al., 2022) to hypergraphs). We use Uni GCNII (Huang & Yang, 2021) 2 and GCN (Kipf & Welling, 2017) as the backbone encoders for hypergraph SSL methods and ordinary graph SSL methods, respectively 3. In Appendix E, we provide their details, including their implementations, training, and hyperparameters. HYPEBOY. We utilize Uni GCNII as an encoder of HYPEBOY, which is the same as that of other hypergraph SSL methods. For both nodeand set projection heads, we use an MLP. Further details of HYPEBOY, including its implementations and hyperparameters, are provided in Appendix E.4. 2Hyper GRL utilizes a GCN as its backbone encoder since its input is an ordinary graph. 3Results for alternative encoders can be found in Appendix F.4. Published as a conference paper at ICLR 2024 Table 1: Efficacy as pre-training techniques: AVG and STD of accuracy values in node classification under the fine-tuning protocol. The best and second-best performances are colored green and yellow, respectively. R.G. and A.R. denote the random guess and average ranking among all methods, respectively. O.O.T. means that training is not completed within 24 hours. HYPEBOY outperforms all the baseline methods in 8 datasets, and overall, it obtains the best average ranking. Method Citeseer Cora Pubmed Cora-CA DBLP-P DBLP-A AMiner IMDB MN-40 20News House A.R. (Semi-)Supervised R.G. 18.1 (0.9) 17.4 (1.0) 36.0 (0.7) 17.8 (0.7) 18.9 (0.2) 25.6 (1.0) 10.2 (0.2) 33.9 (0.7) 3.7 (0.2) 50.1 (1.3) 26.7 (0.3) 17.8 MLP 32.5 (7.0) 27.9 (7.0) 62.1 (3.7) 34.8 (5.1) 73.5 (1.0) 56.0 (4.9) 22.3 (1.7) 39.1 (2.4) 89.4 (1.5) 73.1 (1.4) 72.2 (3.9) 14.2 HGNN 41.9 (7.8) 50.0 (7.2) 72.9 (5.0) 50.2 (5.7) 85.3 (0.8) 67.1 (6.0) 30.3 (2.5) 42.2 (2.9) 88.0 (1.4) 76.4 (1.9) 52.7 (3.8) 10.7 Hyper GCN 31.4 (9.5) 33.1 (10.2) 63.5 (14.4) 37.1 (9.1) 53.5 (11.6) 68.2 (14.4) 26.4 (3.6) 37.9 (4.5) 55.1 (7.8) 67.0 (9.4) 49.8 (3.5) 15.6 HNHN 43.1 (8.7) 50.0 (7.9) 72.1 (5.4) 48.3 (6.2) 84.6 (0.9) 62.6 (4.8) 30.0 (2.4) 42.3 (3.4) 86.1 (1.6) 74.2 (1.5) 49.7 (2.2) 12.6 Uni GCN 44.2 (8.1) 49.1 (8.4) 74.4 (3.9) 51.3 (6.3) 86.9 (0.6) 65.1 (4.7) 32.7 (1.8) 41.6 (3.5) 89.1 (1.0) 77.2 (1.2) 51.1 (2.4) 9.6 Uni GIN 40.4 (9.1) 47.8 (7.7) 69.8 (5.6) 48.3 (6.1) 83.4 (0.8) 63.4 (5.1) 30.2 (1.4) 41.4 (2.7) 88.2 (1.8) 70.6 (1.8) 51.1 (3.0) 13.9 Uni GCNII 44.2 (9.0) 48.5 (7.4) 74.1 (3.9) 54.8 (7.5) 87.4 (0.6) 65.8 (3.9) 32.5 (1.7) 42.5 (3.9) 90.8 (1.1) 70.9 (1.0) 50.8 (4.3) 9.4 All Set 43.5 (8.0) 47.6 (4.2) 72.4 (4.5) 57.5 (5.7) 85.9 (0.6) 65.3 (3.9) 29.3 (1.2) 42.3 (2.4) 92.1 (0.6) 71.9 (2.5) 54.1 (3.4) 10.4 ED-HNN 40.3 (8.0) 47.6 (7.7) 72.7 (4.7) 54.8 (5.4) 86.2 (0.8) 65.8 (4.8) 30.0 (2.1) 41.4 (3.0) 90.7 (0.9) 76.2 (1.2) 71.3 (3.7) 10.4 Phenom NN 49.8 (9.6) 56.4 (9.6) 76.1 (3.5) 60.8 (6.2) 88.1 (0.4) 72.3 (4.1) 33.8 (2.0) 44.1 (3.7) 95.9 (0.8) 74.0 (1.5) 70.4 (5.6) 3.9 Self-supervised Graph MAE2 41.1 (10.0) 49.3 (8.3) 72.9 (4.2) 55.4 (8.4) 86.6 (0.6) 69.5 (4.4) 32.8 (1.9) 43.3 (2.7) 90.1 (0.7) 71.9 (1.3) 52.8 (3.5) 8.9 Mask GAE 49.6 (10.1) 57.1 (8.8) 72.8 (4.3) 57.8 (5.9) 86.3 (0.5) 74.8 (3.1) 33.7 (1.6) 44.5 (2.5) 90.0 (0.9) O.O.T. 51.8 (3.3) 7.7 Tri CL 51.7 (9.8) 60.2 (7.9) 76.2 (3.6) 64.3 (5.5) 88.0 (0.4) 79.7 (2.9) 33.1 (2.2) 46.9 (2.9) 90.3 (1.0) 77.2 (1.0) 69.7 (4.9) 3.4 Hyper GCL 47.0 (9.2) 60.3 (7.4) 76.8 (3.7) 62.0 (5.1) 87.6 (0.5) 79.7 (3.8) 33.2 (1.6) 43.9 (3.6) 91.2 (0.8) 77.8 (0.8) 69.2 (4.9) 3.5 H-GD 45.4 (9.9) 50.6 (8.2) 74.5 (3.5) 58.8 (6.2) 87.3 (0.5) 75.1 (3.6) 32.6 (2.2) 43.0 (3.3) 90.0 (1.0) 77.2 (1.0) 69.7 (5.1) 6.1 Hyper GRL 42.3 (9.3) 49.1 (8.8) 73.0 (4.3) 55.8 (8.0) 86.7 (0.6) 70.8 (3.7) 33.0 (1.8) 43.1 (2.7) 90.1 (0.8) O.O.T. 52.5 (3.3) 9.2 HYPEBOY 56.7 (9.8) 62.3 (7.7) 77.0 (3.4) 66.3 (4.6) 88.2 (0.4) 80.6 (2.3) 34.1 (2.2) 47.6 (2.5) 90.4 (0.9) 77.6 (0.9) 70.4 (4.8) 1.7 5.1 EFFICACY AS A PRE-TRAINING TECHNIQUE (FINE-TUNED EVALUATION) Setup. Following Wei et al. (2022), we randomly split the nodes into training/validation/test sets with the ratio of 1%/1%/98%, respectively.4 For reliability, we assess each method on 20 data splits across 5 random model initializations, following Lee & Shin (2023a). We report the average (AVG) and standard deviation (STD) of test accuracy values on each dataset. Specifically, for each SSL strategy, including HYPEBOY, we pre-train a backbone encoder with the corresponding SSL scheme and then fine-tune the encoder in a (semi-)supervised manner. Results. As shown in Table 1, HYPEBOY shows the best average ranking among all 18 methods. Two points stand out. First, pre-training an HNN with HYPEBOY generally improves node classification. Compared to the performance of Uni GCNII (HYPEBOY s backbone encoder), HYPEBOY obtains performance gains up to 12.5 points in 10 out of 11 datasets. Second, HYPEBOY outperforms all other SSL strategies. Specifically, compared to the second-best method (Tri CL), the accuracy gap is up to 5.0 points. In addition, the suboptimal performance of the state-of-the-art generative SSL strategies for graphs (i.e. Graph MAE2 and Mask GAE) implies the importance of preserving higher-order interactions in learning hypergraph representations. In summary, HYPEBOY serves as an effective SSL strategy to pre-train HNNs for node classification. 5.2 EFFICACY AS A GENERAL-PURPOSE EMBEDDING TECHNIQUE (LINEAR EVALUATION) Setup. We assess the generalizability of learned representations from HYPEBOY in two downstream tasks: node classification and hyperedge prediction. Considering this objective, we limit the baseline methods to SSL strategies, which yield embeddings independent of downstream tasks, and the original node features (i.e., naive X). We use the linear evaluation protocol (i.e., the embeddings are used as fixed inputs to the classifiers for each task). For node classification, we use the same settings described in Section 5.1. For hyperedge prediction, we split hyperedges into training/validation/test sets by the ratio of 60%/20%/20%. For its evaluation, we obtain the same number of negative hyperedge samples as that of the ground-truth hyperedges. We report the average (AVG) and standard deviation (STD) of test AUROC values on each dataset. Further experimental details about hyperedge prediction, including negative hyperedge sampling, are provided in Appendix E.3. Results. As shown in Table 2, HYPEBOY has the best average ranking in both node classification and hyperedge prediction. Specifically, in node classification, compared to the second-best method (Tri CL), the accuracy gap is up to 6.3 points. This demonstrates that HYPEBOY is more effective in learning general-purpose hypergraph representations than the other SSL strategies on hypergraphs. 4We ensure that a training set includes at least one node from each class. Published as a conference paper at ICLR 2024 Table 2: Efficacy as general-purpose embedding techniques: AVG and STD of accuracy/AUROC values in node classification and hyperedge prediction tasks under the linear evaluation protocol. In each downstream task, the best and second-best performances are colored green and yellow, respectively. A.R. denotes the average ranking among all methods. O.O.T. means that training is not completed within 24 hours. HYPEBOY obtains the best average ranking in both downstream tasks. Method Citeseer Cora Pubmed Cora-CA DBLP-P DBLP-A AMiner IMDB MN-40 20News House A.R. Node classification Naive X 27.8 (7.0) 32.4 (4.6) 62.8 (2.8) 31.9 (5.5) 69.4 (0.7) 54.7 (4.7) 21.4 (1.2) 38.1 (1.9) 91.9 (1.1) 70.6 (1.9) 71.3 (5.4) 5.5 Graph MAE2 29.2 (6.5) 37.5 (7.0) 55.5 (9.5) 38.2 (9.1) 75.6 (1.7) 57.5 (5.6) 27.3 (2.7) 36.6 (3.5) 89.1 (1.8) 62.3 (2.3) 51.7 (3.5) 6.3 Mask GAE 47.2 (11.1) 56.8 (9.3) 62.6 (5.5) 56.0 (4.8) 84.8 (0.7) 75.1 (3.5) 33.2 (2.0) 44.1 (3.9) 90.5 (0.9) O.O.T. 50.0 (2.8) 4.5 Tri CL 53.3 (10.0) 62.1 (8.8) 74.5 (4.1) 63.6 (5.2) 87.1 (0.7) 80.9 (3.2) 35.0 (3.6) 48.0 (3.2) 80.0 (5.1) 67.2 (4.0) 69.1 (5.5) 2.6 Hyper GCL 42.6 (8.6) 61.8 (8.3) 67.6 (8.0) 58.1 (6.3) 56.6 (5.2) 79.8 (3.8) 33.3 (2.2) 47.5 (2.8) 84.1 (2.8) 71.2 (3.4) 67.1 (5.4) 4 H-GD 35.6 (7.8) 37.6 (6.8) 58.0 (8.2) 48.6 (7.4) 73.3 (1.3) 74.0 (3.3) 33.8 (5.0) 35.2 (2.9) 76.6 (4.4) 54.8 (7.4) 68.3 (5.7) 5.5 Hyper GRL 35.3 (8.2) 35.4 (8.8) 50.2 (8.7) 39.4 (8.1) 78.7 (1.2) 62.7 (5.1) 28.0 (2.8) 34.8 (3.0) 89.4 (1.5) O.O.T. 52.0 (3.7) 6.1 HYPEBOY 59.6 (9.9) 63.5 (9.4) 75.0 (3.4) 66.0 (4.6) 87.9 (0.5) 81.2 (2.7) 34.3 (3.2) 48.8 (1.8) 89.2 (2.2) 75.7 (2.1) 69.4 (5.4) 1.5 Hyperedge prediction Naive X 63.3 (2.1) 75.5 (1.6) 88.3 (0.6) 55.0 (1.9) 90.0 (0.4) 72.1 (1.3) 80.0 (1.1) 39.5 (1.9) 99.5 (0.1) 97.7 (2.9) 54.8 (5.0) 6.4 Graph MAE2 73.3 (2.7) 76.4 (1.7) 81.6 (1.1) 76.3 (3.1) 85.2 (0.4) 68.3 (1.8) 80.7 (0.9) 53.7 (2.6) 99.5 (0.1) 90.1 (5.7) 62.9 (3.8) 6.1 Mask GAE 86.1 (1.6) 88.5 (1.4) 92.9 (0.5) 81.8 (2.7) 93.2 (0.5) 79.3 (2.0) 84.6 (0.1) 58.1 (2.5) 99.3 (0.1) O.O.T. 87.0 (3.4) 4.1 Tri CL 90.5 (1.2) 90.7 (1.3) 91.9 (0.5) 87.8 (1.5) 94.8 (0.2) 87.9 (1.4) 90.4 (0.6) 58.9 (2.1) 99.6 (0.1) 98.2 (3.0) 90.0 (2.6) 1.8 Hyper GCL 73.9 (2.6) 85.4 (1.5) 89.6 (0.5) 81.1 (1.9) 83.6 (0.6) 83.5 (1.0) 82.1 (7.6) 53.8 (2.4) 99.4 (0.1) 96.7 (7.0) 76.3 (6.3) 5 H-GD 72.2 (5.0) 71.9 (3.1) 87.2 (0.7) 73.2 (4.0) 91.6 (1.0) 81.4 (1.9) 84.9 (2.1) 53.1 (1.8) 99.5 (0.1) 83.9 (2.1) 87.9 (3.1) 5.3 Hyper GRL 83.2 (8.0) 84.0 (1.7) 78.3 (1.8) 80.0 (3.3) 88.2 (0.3) 80.8 (1.4) 81.5 (0.7) 54.7 (2.8) 98.8 (0.4) O.O.T. 88.2 (3.3) 5.5 HYPEBOY 91.1 (1.1) 91.9 (1.1) 95.1 (0.3) 88.1 (1.4) 95.5 (0.1) 87.3 (1.3) 89.8 (0.5) 59.4 (2.1) 99.7 (0.1) 99.0 (1.6) 87.0 (2.8) 1.5 Table 3: The ablation study with four variants of HYPEBOY on node classification under the finetuning protocol. The best and second-best performances are colored green and yellow, respectively. F.R., H.F., and P.H. denote Feature Reconstruction, Hyperedge Filling, and Projection Heads, respectively. A.R. denotes the average ranking among all methods. NA denotes no pre-training. HYPEBOY outperforms others in most datasets, justifying each of its components. F. R. H. F. P. H. Citeseer Cora Pubmed Cora-CA DBLP-P DBLP-A AMiner IMDB MN-40 20News House A.R. NA 44.2 (9.0) 48.5 (7.4) 74.1 (3.9) 54.8 (7.5) 87.4 (0.6) 65.8 (3.9) 32.5 (1.7) 42.5 (3.9) 90.8 (1.1) 70.9 (1.0) 50.8 (4.3) 5.5 V1 51.6 (11.2) 60.7 (8.2) 76.2 (3.6) 63.5 (6.0) 88.1 (0.5) 78.5 (2.9) 33.5 (2.8) 46.8 (3.1) 90.0 (1.1) 77.4 (0.9) 68.5 (4.5) 4.3 V2 52.7 (9.6) 59.7 (9.2) 76.7 (3.2) 63.5 (6.0) 88.2 (0.5) 79.1 (2.5) 33.8 (2.2) 46.9 (3.3) 90.6 (1.0) 77.0 (0.9) 69.6 (4.9) 3.3 V3 52.0 (9.3) 58.9 (8.2) 74.1 (3.9) 61.2 (6.6) 87.8 (0.4) 79.9 (2.3) 33.9 (2.1) 46.3 (2.7) 91.4 (0.9) 77.5 (0.9) 70.1 (4.8) 3.6 V4 56.0 (9.9) 61.8 (8.5) 76.5 (3.1) 65.3 (4.3) 88.0 (0.4) 80.3 (2.4) 34.0 (2.0) 47.5 (2.3) 90.8 (1.0) 77.4 (1.0) 69.3 (5.0) 2.5 Ours 56.7 (9.8) 62.3 (7.7) 77.0 (3.4) 66.3 (4.6) 88.2 (0.4) 80.6 (2.3) 34.1 (2.2) 47.6 (2.5) 90.4 (0.9) 77.6 (0.9) 70.4 (4.8) 1.4 5.3 ABLATION STUDY We analyze the necessity for each component of HYPEBOY, specifically, (a) the hyperedge filling task (Section 3.1), (b) projection heads (Section 4.2), and (c) feature reconstruction warm-up (Section 4.4). To this end, we utilize four variants of HYPEBOY: (V1) without feature reconstruction warm-up and projection heads, (V2) without feature reconstruction warm-up 5, (V3) without the hyperedge filling process, and (V4) without projection heads. Here, projection heads are used only for methods with the hyperedge filling process. Note that V3 is a method that only utilizes the feature reconstruction process, which is described in Section 4.4. As shown in Table 3, HYPEBOY, equipped with all of its components, outperforms the others in most datasets, demonstrating the effectiveness of our design choices. There are two other notable results. First, the necessity of projection heads is evidenced by the superior performance of V2 (compared to V1) and ours (compared to V4). Second, the advantage of the hyperedge filling task over feature reconstruction is manifested by the better average rank of V2 compared to V3. 6 CONCLUSION In this work, we conduct a comprehensive analysis of generative self-supervised learning on hypergraphs. Our contribution is three-fold. First, we formulate the hyperedge filling task, a generative self-supervised learning task on hypergraphs, and investigate the theoretical connection between the task and node classification (Section 3). Second, we present a generative SSL method HYPEBOY to solve the task (Section 4). Third, we demonstrate the superiority of HYPEBOY over existing SSL methods on hypergraphs through extensive experiments (Section 5). 5To mitigate an issue of over-relying on projection heads (Section 4.4), we have trained an encoder without projection heads at the beginning, and after some epochs, we train the encoder with projection heads. Published as a conference paper at ICLR 2024 ACKNOWLEDGEMENTS This work was supported by Samsung Electronics Co., Ltd. and Institute of Information & Communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) (No. 2022-0-00871, Development of AI Autonomy and Knowledge Enhancement for AI Agent Collaboration) (No. 2019-0-00075, Artificial Intelligence Graduate School Program (KAIST)). Ali Behrouz, Farnoosh Hashemi, Sadaf Sadeghian, and Margo Seltzer. Cat-walk: Inductive hypergraph learning via set walks. In Neur IPS, 2023. Austin R Benson, Rediet Abebe, Michael T Schaub, Ali Jadbabaie, and Jon Kleinberg. Simplicial closure and higher-order link prediction. Proceedings of the National Academy of Sciences, 115 (48):E11221 E11230, 2018. Christopher M. Bishop. Pattern Recognition and Machine Learning. Springer, 2006. Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. In ICML, 2020. Eli Chien, Chao Pan, Jianhao Peng, and Olgica Milenkovic. You are allset: A multiset function framework for hypergraph neural networks. In ICLR, 2022. Minyoung Choe, Sunwoo Kim, Jaemin Yoo, and Kijung Shin. Classification of edge-dependent labels of nodes in hypergraphs. In KDD, 2023. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. In NACCL, 2019. Yihe Dong, Will Sawin, and Yoshua Bengio. Hnhn: Hypergraph networks with hyperedge neurons. In ICML Workshop on Graph Representation Learning and Beyond (GRL+), 2020. Boxin Du, Changhe Yuan, Robert Barton, Tal Neiman, and Hanghang Tong. Self-supervised hypergraph representation learning. In Big Data, 2022. Dheeru Dua, Casey Graff, et al. Uci machine learning repository. 2017. Yifan Feng, Zizhao Zhang, Xibin Zhao, Rongrong Ji, and Yue Gao. Gvcnn: Group-view convolutional neural networks for 3d shape recognition. In CVPR, 2018. Yifan Feng, Haoxuan You, Zizhao Zhang, Rongrong Ji, and Yue Gao. Hypergraph neural networks. In AAAI, 2019. Jean-Bastien Grill, Florian Strub, Florent Altch e, Corentin Tallec, Pierre Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Avila Pires, Zhaohan Guo, Mohammad Gheshlaghi Azar, et al. Bootstrap your own latent-a new approach to self-supervised learning. In Neur IPS, 2020. Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Doll ar, and Ross Girshick. Masked autoencoders are scalable vision learners. In CVPR, 2022. Zhenyu Hou, Xiao Liu, Yukuo Cen, Yuxiao Dong, Hongxia Yang, Chunjie Wang, and Jie Tang. Graphmae: Self-supervised masked graph autoencoders. In KDD, 2022. Zhenyu Hou, Yufei He, Yukuo Cen, Xiao Liu, Yuxiao Dong, Evgeny Kharlamov, and Jie Tang. Graphmae2: A decoding-enhanced masked self-supervised graph learner. In WWW, 2023. Jing Huang and Jie Yang. Unignn: a unified framework for graph and hypergraph neural networks. In IJCAI, 2021. Li Jing, Pascal Vincent, Yann Le Cun, and Yuandong Tian. Understanding dimensional collapse in contrastive self-supervised learning. In ICLR, 2022. Published as a conference paper at ICLR 2024 Kazi Zainab Khanam, Gautam Srivastava, and Vijay Mago. The homophily principle in social network analysis: A survey. Multimedia Tools and Applications, 82(6):8811 8854, 2023. Sunwoo Kim, Fanchen Bu, Minyoung Choe, Jaemin Yoo, and Kijung Shin. How transitive are real-world group interactions? measurement and reproduction. In KDD, 2023a. Sunwoo Kim, Minyoung Choe, Jaemin Yoo, and Kijung Shin. Reciprocity in directed hypergraphs: measures, findings, and generators. Data Mining and Knowledge Discovery, 37(6):2330 2388, 2023b. Sunwoo Kim, Dongjin Lee, Yul Kim, Jungho Park, Taeho Hwang, and Kijung Shin. Datasets, tasks, and training methods for large-scale hypergraph learning. Data Mining and Knowledge Discovery, 37(6):2216 2254, 2023c. Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In ICLR, 2015. Thomas N Kipf and Max Welling. Semi-supervised classification with graph convolutional networks. In ICLR, 2017. Michael Laakasuo, Anna Rotkirch, Max Van Duijn, Venla Berg, Markus Jokela, Tamas David Barrett, Anneli Miettinen, Eiluned Pearce, and Robin Dunbar. Homophily in personality enhances group success among real-life friends. Frontiers in Psychology, 11:710, 2020. Dongjin Lee and Kijung Shin. I m me, we re us, and i m us: Tri-directional contrastive learning on hypergraphs. In AAAI, 2023a. Geon Lee and Kijung Shin. Temporal hypergraph motifs. Knowledge and Information Systems, 65 (4):1549 1586, 2023b. Geon Lee, Fanchen Bu, Tina Eliassi-Rad, and Kijung Shin. A survey on hypergraph mining: Patterns, tools, and generators. ar Xiv preprint ar Xiv:2401.08878, 2024. Jintang Li, Ruofan Wu, Wangbin Sun, Liang Chen, Sheng Tian, Liang Zhu, Changhua Meng, Zibin Zheng, and Weiqiang Wang. What s behind the mask: Understanding masked graph modeling for graph autoencoders. In KDD, 2023. Vinod Nair and Geoffrey E Hinton. Rectified linear units improve restricted boltzmann machines. In ICML, 2010. Open AI. Gpt-4 technical report. 2023. Prasanna Patil, Govind Sharma, and M Narasimha Murty. Negative sampling for hyperlink prediction in networks. In PAKDD, 2020. Ding Ruan, Shuyi Ji, Chenggang Yan, Junjie Zhu, Xibin Zhao, Yuedong Yang, Yue Gao, Changqing Zou, and Qionghai Dai. Exploring complex and heterogeneous correlations on hypergraph for the prediction of drug-target interactions. Patterns, 2(12):100390, 2021. Khaled Mohammed Saifuddin, Briana Bumgardner, Farhan Tanvir, and Esra Akbas. Hygnn: Drugdrug interaction prediction via hypergraph neural network. In ICDE, 2023. Ramit Sawhney, Shivam Agarwal, Arnav Wadhwa, and Rajiv Ratn Shah. Spatiotemporal hypergraph convolution network for stock movement forecasting. In ICDM, 2020. Ramit Sawhney, Shivam Agarwal, Arnav Wadhwa, Tyler Derr, and Rajiv Ratn Shah. Stock selection via spatiotemporal hypergraph attention network: A learning to rank approach. In AAAI, 2021. Zeen Song, Xingzhe Su, Jingyao Wang, Wenwen Qiang, Changwen Zheng, and Fuchun Sun. Towards the sparseness of projection head in self-supervised learning. ar Xiv preprint ar Xiv:2307.08913, 2023. Hang Su, Subhransu Maji, Evangelos Kalogerakis, and Erik Learned-Miller. Multi-view convolutional neural networks for 3d shape recognition. In CVPR, 2015. Published as a conference paper at ICLR 2024 Qiaoyu Tan, Ninghao Liu, Xiao Huang, Soo-Hyun Choi, Li Li, Rui Chen, and Xia Hu. S2gae: Self-supervised graph autoencoders are generalizable learners with graph masking. In WSDM, 2023. Jian Tang, Meng Qu, Mingzhe Wang, Ming Zhang, Jun Yan, and Qiaozhu Mei. Line: Large-scale information network embedding. In WWW, 2015. Zhan Tong, Yibing Song, Jue Wang, and Limin Wang. Videomae: Masked autoencoders are dataefficient learners for self-supervised video pre-training. In Neur IPS, 2022. Nate Veldt, Austin R Benson, and Jon Kleinberg. Combinatorial characterizations and impossibilities for higher-order homophily. Science Advances, 9(1):eabq3200, 2023. Petar Veliˇckovi c, William Fedus, William L Hamilton, Pietro Li o, Yoshua Bengio, and R Devon Hjelm. Deep graph infomax. In ICLR, 2019. Peihao Wang, Shenghao Yang, Yunyu Liu, Zhangyang Wang, and Pan Li. Equivariant hypergraph diffusion neural operators. In ICLR, 2023a. Tongzhou Wang and Phillip Isola. Understanding contrastive representation learning through alignment and uniformity on the hypersphere. In ICML, 2020. Xiao Wang, Houye Ji, Chuan Shi, Bai Wang, Yanfang Ye, Peng Cui, and Philip S Yu. Heterogeneous graph attention network. In WWW, 2019. Yuxin Wang, Quan Gan, Xipeng Qiu, Xuanjing Huang, and David Wipf. From hypergraph energy functions to hypergraph neural networks. In ICML, 2023b. Tianxin Wei, Yuning You, Tianlong Chen, Yang Shen, Jingrui He, and Zhangyang Wang. Augmentations in hypergraph contrastive learning: Fabricated and generative. In Neur IPS, 2022. Zhirong Wu, Shuran Song, Aditya Khosla, Fisher Yu, Linguang Zhang, Xiaoou Tang, and Jianxiong Xiao. 3d shapenets: A deep representation for volumetric shapes. In CVPR, 2015. Lianghao Xia, Chao Huang, and Chuxu Zhang. Self-supervised hypergraph transformer for recommender systems. In KDD, 2022. Naganand Yadati, Madhav Nimishakavi, Prateek Yadav, Vikram Nitin, Anand Louis, and Partha Talukdar. Hypergcn: A new method for training graph convolutional networks on hypergraphs. In Neur IPS, 2019. Naganand Yadati, Vikram Nitin, Madhav Nimishakavi, Prateek Yadav, Anand Louis, and Partha Talukdar. Nhp: Neural hypergraph link prediction. In CIKM, 2020. Se-eun Yoon, Hyungseok Song, Kijung Shin, and Yung Yi. How much and when do we need higher-order information in hypergraphs? a case study on hyperedge prediction. In WWW, 2020. Yuning You, Tianlong Chen, Yongduo Sui, Ting Chen, Zhangyang Wang, and Yang Shen. Graph contrastive learning with augmentations. In Neur IPS, 2020. Junliang Yu, Hongzhi Yin, Jundong Li, Qinyong Wang, Nguyen Quoc Viet Hung, and Xiangliang Zhang. Self-supervised multi-channel hypergraph convolutional network for social recommendation. In WWW, 2021. Manzil Zaheer, Satwik Kottur, Siamak Ravanbakhsh, Barnabas Poczos, Russ R Salakhutdinov, and Alexander J Smola. Deep sets. In Neur IPS, 2017. Chuxu Zhang, Dongjin Song, Chao Huang, Ananthram Swami, and Nitesh V Chawla. Heterogeneous graph neural network. In KDD, 2019. Junwei Zhang, Min Gao, Junliang Yu, Lei Guo, Jundong Li, and Hongzhi Yin. Double-scale selfsupervised hypergraph learning for group recommendation. In CIKM, 2021. Yizhen Zheng, Shirui Pan, Vincent Lee, Yu Zheng, and Philip S Yu. Rethinking and scaling up graph contrastive learning: An extremely efficient approach with group discrimination. In Neur IPS, 2022. Published as a conference paper at ICLR 2024 In this section, we provide a proof of each theorem. In addition, we extend the results from Section 3.2, which assume vi C1 is assumed, to include cases where vi C0. A.1 PROOF OF THEOREM 1 Proof. We first analyze a node vi that belongs to C1 (i.e., vi C1). By definition, Ex[1F(xi)=1] = Px (f(xi; µ1, I) > f(xi; µ0, I)) holds. Thus, Ex[1F(xi)=1] is determined by the distribution of f(xi; µ1, I) > f(xi; µ0, I), where a random variable is xi. We provide the exact formula for this inequality: f(xi; µ1, I) > f(xi; µ0, I), exp( (xi µ1)T (xi µ1)) > exp( (xi µ0)T (xi µ0)), (xi µ1)T (xi µ1) < (xi µ0)T (xi µ0), =x T i xi 2µT 1 xi + µT 1 µ1 < x T i xi 2µT 0 xi + µT 0 µ0, (µ1 µ0)T xi + µT 0 µ0 µT 1 µ1 > 0 1T xi > 0, definition of µ1 and µ0 in Assumption 1. Thus, Ex[1F(xi)=1] is equivalent to Px( 1T xi > 0). In a similar sense, Ex[1F(zi)=1] is equivalent to Px( 1T zi > 0), since as mentioned in Section 3.2, zi is a function of xk, vk V. Now, we formalize zi. Note that zi = xi γ xi L holds by (F2). To further elaborate on this equation, we first detail xi L: xi L = log(p(X,E,Θ) (vi | qij)) xi = log exp(x T i (P vk qij xk)) P vt V exp(x T t (P vk qij xk)) vt V exp(x T t (P vk qij xk)) + exp(x T i (P vk qij xk)) P vt V exp(x T t (P vk qij xk)) 1 exp(x T i (P vk qij xk)) P vt V exp(x T t (P vk qij xk)) By Eq (9), zi can be expressed as a function of xk, vk V, as follows: zi = xi + γ 1 exp(x T i (P vk q xk)) P vt V exp(x T t (P | {z } (Term 1) Let x q denotes P vk qij xk. Furthermore, denote (Term 1) in Eq (10) as f(x) (0, 1). Then, we rewrite Eq (10) as zi = xi + γ (1 f(x)) x q. We finally rewrite 1T zi as follows: 1T zi = 1T xi + γ(1 f(x)) 1T x q. (11) Let β denotes 1T x q. Note that by the statement of Theorem 1, 1T x q > 0 β > 0 holds. Thus, our main interest, which is Px( 1T zi > 0), can be rewritten as follows: Px 1T zi > 0 = Px 1T xi + γ(1 f(x))β > 0 . (12) Published as a conference paper at ICLR 2024 Note that when 1T xi > 0 holds, 1T zi > 0 holds as well, since γ(1 f(x))β > 0 holds. Thus, Eq (12) is split as follows: Px 1T zi > 0 = Px 1T xi > 0 + Px γ(1 f(x))β < 1T xi < 0 , = Ex[1F(zi)=1] = Ex[1F(xi)=1] | {z } (a) Expected accuracy of naive xi γβ < 1T xi (1 f(x)) < 0 | {z } (b) Additional gain via hyperedge filling Note that the gain term, which is the (b) term of Eq (13), is always greater than zero. Generalization to vi C0. Now, we analyze a node vi that belongs to C0 (i.e., vi C0). In this case, the previous condition 1T x q > 0 β > 0 becomes 1T x q < 0 β < 0. In a similar sense, for the expected accuracy: Px 1T zi > 0 is changed as Px 1T zi < 0 . In this setting, we can directly extend the result of Eq (12) as follows: Px 1T zi < 0 = Px 1T xi + γ(1 f(x))β < 0 . (14) By employing the above proof, we can obtain the following result (note that β < 0 in this case): Px 1T zi < 0 = Px 1T xi < 0 + Px 0 < 1T xi < γ(1 f(x))β , = Ex[1F(zi)=0] = Ex[1F(xi)=0] | {z } (a) Expected accuracy of naive xi γβ < 1T xi (1 f(x)) < 0 | {z } (b) Additional gain via hyperedge filling Thus, we can also derive the same result for vi C0. A.2 PROOF OF THEOREM 2. Remark. We have denoted the affinity parameter in Assumption 2 as P to distinguish it from the hyperedge filling probability p(X,E,Θ). In this proof, by allowing a slight duplication in notation, we let P := p, since we do not use p(X,E,Θ) in this proof. In addition, we let ej := e and qij := q. Proof. We first derive the functional form of Px,e 1T P vk q xk > 0 p . By vi ej C1, which is the condition of the theorem, the following holds: = Px,e 1T P vk q xk > 0, vi e p Pe (vi e | p) . (16) It is important to note that if the number of nodes that belong to both C1 and ej is decided, denoted by s (i.e., s := |e C1|), the distribution of 1T P vk q xk is automatically decided, since xt, vt V is independently generated from Gaussian distribution, whose mean vector is decided according to the class of vt. This is because of P vk q xk N((2s 1 S)µ1, (S 1)I), where S is a size of e for a given s (i.e., |e| = S). From this result, the following two results are induced: N (2s 1 S)d 2 , (S 1)d , (17) Published as a conference paper at ICLR 2024 where Φ( ) is a C.D.F. of the standard Gaussian distribution. Then, we rewrite Eq (16) as follows: vk q xk > 0, s, vi e Pe (vi e | p) , (19) vk q xk > 0 s Pe(s, vi e) Pe (vi e | p) . (20) By Assumption 2, each Pe(s, vi e) is derived as follows 6: Pe(s, vi e | p) = S s 2 | {z } prob. of s at c = 1 + (1 p)sp S s 2 | {z } prob. of s at c = 0 | {z } prob. of vi C1 e By using Eq (21), we further detail Pe(vi e | p) as follows: Pe(vi e | p) = 2 + (1 p)sp S s 2 + (1 p)sp S s sps(1 p)S s 2 | {z } s B(S,p) + s(1 p)sp S s 2 | {z } s B(S,1 p) , expectation of Binomial distribution, (25) With the result of Eq (26) and Eq (18), we rewrite Eq (20) as follows: 2 + (1 p)sp S s s ps(1 p)S s + (1 p)sp S s Φ Note that our main function, which is Px,e 1T P vk q xk > 0 p , is equal to Eq (28). It is important to note that Eq (28) is symmetric w.r.t. 0.5 in p [0, 1]. Thus, if we prove Eq (28) is strictly increasing on p [0.5, 1], the proof of the statement that Eq (28) is strictly decreasing on p [0, 0.5] is straight forward. Hence, we prove the second statement of the theorem by showing that Eq (28) is strictly increasing on p [0.5, 1]. Now, we show the first and second statements of the theorem. Here, we first show the second statement, and from the result, we further prove the first statement. Second statement. We first rewrite Eq (28) by adequately mixing two binomial functions. For simplicity, we denote Φ((2s 1 S) q d 4(S 1)) as ϕ(s), since the value that changes in Φ( ) is only 6The term prob. indicates probability. Published as a conference paper at ICLR 2024 s and other terms are fixed constants. From this, Eq (28) is rewritten as follows: s ps(1 p)S s + (1 p)sp S s ϕ(s), (29) sps(1 p)S sϕ(s) | {z } (Term 1) s(1 p)s(p)S sϕ(s) | {z } (Term 2) We define s := (S s). Due to the symmetric characteristic of the binomial function and standard Gaussian distribution, we can rewrite (Term 2) in Eq (30) as follows: ps (1 p)S s (S s )Φ Since we can regard s and s as equivalent terms in our framework, we add (Term 1) of Eq (30) and Eq (31) as follows: | {z } Term (a) We aim to show that Eq (32) is a strictly increasing function w.r.t. p [0.5, 1]. For simplicity, we let K := p d/(4(S 1)). We first discard a constant term 1/S from Eq (32) for simplicity. Note that without Eq (32) Term (a), Eq (32) is equivalent to 1, and this is a sum of binomial probability mass functions. That is, we can think of Eq (32) as a weighted average of the following values: (S s)Φ ((S 2s 1)K) + sΦ ((2s S 1)K) , s {0, 1, , S}. (33) Specifically, we let S s ps(1 p)S s as a weight of (S s)Φ ((S 2s 1)K) + sΦ ((2s S 1)K) . Note that values of Eq (33) and their weights are both symmetric about s = S/2: which means s and S s have the same weight value. Thus, we can rewrite Eq (32) for an odd number S as: (ps(1 p)S s + (1 p)sp S s) (S s)Φ ((S 2s 1)K) + sΦ ((2s S 1)K) . (34) While this formulation is for an odd number S, we can extend further results to an even number S. We first show that Eq (33) is an increasing function w.r.t. s { S/2 , S/2 + 1, , S}. For two cases where s = k and s = k + 1, their Eq (33) is expressed as follows: (S k 1)Φ (S 2k 3) + (k + 1)Φ (2k S + 1) (35) > (S k)Φ (S 2k 1) + kΦ (2k S 1) , (36) k (Φ (2k S + 1) Φ (2k S 1)) + Φ (2k S + 1) (37) >(S k) (Φ (S 2k 1) Φ (S 2k 3)) + Φ (S 2k 3) , (38) By the condition of s S/2 , k > S k holds. In addition, by the characteristic of the CDF of the standard normal distribution, (Φ (2k S + 1) Φ (2k S 1)) > (Φ (S 2k 1) Φ (S 2k 3)) also holds. Moreover, since s S/2 , Φ (2k S + 1) > Φ (S 2k 3) also holds. In sum, Eq (35) > Eq (36) holds, and this result implies that Eq (33) is an increasing function w.r.t. s { S/2 , S/2 + 1, , S}. Since greater s has a higher value of Eq (33), we can naturally conclude that Eq (34), which is our goal function, is an increasing function w.r.t. p if weights are more assigned to the terms Published as a conference paper at ICLR 2024 related to the higher s as p increases. From this rationale, we show the following statement: s { S/2 , , S}, s.t. w(p, s)/ p > 0, s s and w(p, s)/ p < 0, s s , where w(p, s) = S s ps(1 p)S s + (1 p)sp S s . We first derive the first derivative of w(p, s): ( sps 1(1 p)S s | {z } (a) (S s)ps(1 p)S s 1 | {z } (b) s(1 p)s 1p S s | {z } (c) + (S s)(1 p)sp S s 1 | {z } (d) Then, we show that: s { S/2 , , S}, s.t. ((a) (b)) > ((c) (d)), s s and ((a) (b)) < ((c) (d)), s s . We elaborate on (a) (b) and (c) (d) as follows: (a) (b) = ps 1(1 p)S s 1(s(1 p) (S s)p), (41) = ps 1(1 p)S s 1(s sp Sp + sp), (42) (c) (d) = (1 p)s 1p S s 1( sp + (S s)(1 p)), (43) = (1 p)s 1p S s 1( sp + (S s) Sp + sp). (44) Remark. When p = 1/2, ((a) (b)) ((c) (d)) = 0 holds, which indicates that if Eq (32) turns out to be an increasing function w.r.t. p, Eq (32) has its minimum value at p = 1/2. Thus, we narrow down the range of p as p > 1/2. We first formalize (a) (b) = (c) (d) equation. To this end, we utilize the logarithm function to simplify the computation (note that S s is omitted). ps 1(1 p)S s 1(s Sp) = (1 p)s 1p S s 1(s S + Sp), (45) (s 1) log p + (S s 1) log (1 p) + log (s Sp), = (s 1) log (1 p) + (S s 1) log p + log (s S + Sp), = (s 1) log ( p 1 p) + (S s 1) log 1 p p + log s Sp s S(1 p) = 0, = (2s S) log p 1 p | {z } Term (a) + log s Sp s S(1 p) | {z } Term (b) To show the required conditions, it is enough to show that (A) Eq (46) < 0 holds at s = S/2, and (B) Eq (46) > 0 holds at s = S. This is because since Eq (46) is an increasing function w.r.t. s, the above two conditions imply that there exists s , which is our interest, between S/2 and S. When plugging S/2 to s, 2s S gets zero. Since p > 1/2 is given (see the above Remark), log s Sp s S(1 p) becomes negative. Thus, (A) is proven. We now plug S into s and obtain the following result: (2S S) log p 1 p + log S Sp S S(1 p), (47) S log p 1 p + log S(1 p) Sp S log p 1 p log p 1 p, (48) (S 1) log p 1 p > 0, p > 1/2. (49) Thus, we can conclude that Eq (32), which is our interest, is an increasing function w.r.t. p > 1/2, and has its minimum value at p = 1/2. Now, we show that the above proof can also be applied to an even number S. Note that for an even number S, Eq (32) is equal to the addition of Eq (34) that starts with s = S/2 + 1 and Eq (50), w(p, s) = S S 2 (1 p) S+1 We derive w(p, s)/ p as follows: 2 (1 p) S 2 2 (1 2p) 0, p [0.5, 1]. (51) Published as a conference paper at ICLR 2024 Since w(p, s)/ w(p, s) < 0 holds for s = (S + 1)/2, our proof is still valid for an even number S. We finalize the proof of Second statement. First statement. From the above proof, we now show the lower bound of Eq (27), which is the first statement. Note that the value of Eq (27) (equivalent to Eq (32)) is minimized at p = 0.5. Thus, we prove the statement by finding the lower bound of Eq (28) at p = 0.5. For simplicity, we again use the following notation K := q s ps(1 p)S s + (1 p)sp S s Φ S Φ ((2s S 1)K) . (53) First, consider an even number S. Then, for Φ((2s S 1)K), s {1, , S}, consider the below relations: 1 Φ( S + 1) | {z } (a1) +2 Φ( S + 3) | {z } (b1) + + (S 1) Φ(S 3) | {z } (b2) +S Φ(S 1) | {z } (a2) In Eq (54), (a1) + (a2) = 1 holds, and (b1) + (b2) = 1 holds also. In this manner, we can pair all terms in Eq (54). In a general sense, the following holds: k11k12 + k21k22 such that k11 + k21 = S, k12 + k22 = 1, k11 < k12, and k21 < k22 hold. Note that k11k12 + k21k22 is lower-bounded by k11+k21 2 since distributing the weight of a larger value to a smaller value will decrease the overall value. From this result, we can obtain a lower bound of Eq (54) as follows: S Φ ((2s S 1)K) 1 2 = 0.5. (55) Now, we consider an odd number S. We can extend the proof of an even number S case, while additionally considering the following term: s = S+1 2 ; Φ(0) that does not have any pair. On the other hand, since Φ(0) = 0.5, we still ensure Eq (55) holds in this case as well. Thus, for both evenand odd-number S, the value of Eq (28) at p = 0.5 is lower-bounded by 0.5, which is a global lower bound of Eq (28). We finalize the proof of First statement. Moreover, as described, Eq (28) is symmetric w.r.t. p = 0.5 on p [0, 1]. In conclusion, we have verified that Eq (28) is strictly decreasing at P [0, 0.5], by extending the result at P [0.5, 1.0]. Generalization to vi C0. Note that the required condition for the improvement in vi C0 is 1T P vk q xk < 0. Thus, we further investigate Px,e 1T P vk q xk < 0 P , where vi e and q = e \ {vi} (refer to Remark). First, let s denote the number of nodes that belong to both ej and C0. Note that we aim to investigate the following probability: = Px,e 1T P vk q xk < 0, vi e p Pe (vi e | p) . (56) Here, the following holds: X vk q xk N((2s 1 S)µ0, (S 1)I), (57) N (2s 1 S)d 2 , (S 1)d , (58) Published as a conference paper at ICLR 2024 Note that Eq. (59) is equivalent to Eq. (18). Moreover, the following also holds: Pe(s, vi e | p) = S s 2 | {z } prob. of s at c = 0 + (1 p)sp S s 2 | {z } prob. of s at c = 1 | {z } prob. of vi C0 e Here again, Eq. (21) is equivalent to Eq. (60). Finally, we guarantee that Eq. (56) is expressed as follows: vk q xk > 0 s Pe(s, vi e) Pe (vi e | p) , (61) 2 + (1 p)sp S s s ps(1 p)S s + (1 p)sp S s Φ Importantly, Eq. (63) is equal to Eq. (28). This result implies that Theorem 2 and its proof is generalizable to the case of vi C0 also. A.3 EMPIRICAL DEMONSTRATION ON THE PROOF. To empirically validate the proof provided above, we conduct toy experiments. We consider every combination of S {2, 3, 4, 5, 6, 7, 8}, which is a size of ej (i.e., |ej|), and d {2, 3, 4, 5, 6, 7, 8}, which is a dimension of input node feature xi Rd, vi V. Then, we visualize how Eq (27) varies w.r.t. S, d, and P [0, 1]. Note that X-axis of each plot corresponds to P. As shown in Figure 3, for all cases, we can verify that the value of Px,e 1T P vk q xk > 0 P is strictly-increasing w.r.t. P [0.5, 1] and -decreasing w.r.t. P [0, 1] (statement 2), and the value is lower bounded by 0.5 (statement 1). Thus, we can ensure that Theorem 2 is valid through empirical verification. B ADDITIONAL THEORETICAL ANALYSIS B.1 EXISTENCE OF REASONABLE SOLUTIONS IN THE HYPEREDGE FILLING TASK In this section, we investigate the existence of a reasonable solution in the proposed hyperedge filling task. To this end, we first formalize the concept of the hyperedge filling task is solved. Definition 1 (Solved hyperedge filing task). For a hypergraph G = (V, E), the hyperedge filling task is solved when the following holds for a probabilistic model p(X,E,Θ)( ): p(X,E,Θ) (vi | ej \ {vi}) > p(X,E,Θ) (vk | ej \ {vi}) , vi ej, vk {vs : {vs} (ej \ {vi}) E, vs V}, , ej E. Roughly, Definition 1 indicates that for all possible hyperedge filling cases in a hypergraph G, a probabilistic model p always has the greater probability of filling a correct node for a given query subset than that of filling a wrong node. It is important to note that our goal is to identify whether there exists a reasonable solution for p( ), not a trivial solution. Thus, we formalize the concept of a reasonable solution as the following definition: Definition 2 (Reasonable solution). Let Q := {ej \ {vi} : vi ej, ej E}. For a hypergraph G = (V, E), a reasonable solution of the hyperedge filling task is a pair of node representations Z R|V| d and query set representations Q R|Q| d such that: z T i qj > z T k qj, vi Sj, vk V \ Sj, qj Q, where Sj = {vs : ({vs} qj E) (vs qj), vs V}. Published as a conference paper at ICLR 2024 𝒙𝒌> 𝟎| 𝓟 theoretical lower bound Y-Axis: Probability 𝑺= 𝟐 𝑺= 𝟑 𝑺= 𝟒 𝑺= 𝟓 𝑺= 𝟔 𝑺= 𝟕 𝑺= 𝟖 𝒅= 𝟐 𝒅= 𝟑 𝒅= 𝟒 𝒅= 𝟓 𝒅= 𝟔 𝒅= 𝟕 𝒅= 𝟖 Figure 3: Empirical demonstration of Theorem 2. Note that S denotes the size of a hyperedge, and d denotes the dimension of features. As stated, Px,e 1T P vk q xk > 0 P is strictly increasing in P [0.5, 1] (statement 2), and lower bounded by 0.5 (statement 1). Now, we analyze whether there exists a reasonable solution for any hypergraph G = (V, E). Theorem 3 (Existence of a reasonable solution). For any hypergraph G = (V, E), there exists a reasonable solution for some embedding dimension d. Proof. Consider a binary matrix B {1, 0}|V | |Q|, where each column j {1, , |Q|} represents one-hot encoding of Sj over V (i.e., Bij = 1[vi Sj], where 1[ ] is an indicator function). Here, we consider a matrix decomposition of B, specifically, singular value decomposition of B. Denote this singular value decomposition as B = UΣVT , where U R|V| rank(B), V Rrank(B) rank(B), and V R|Q| rank(B). Let Z := UΣ and Q := V. In such a case, the following holds: z T i qj = 1 if vi Sj, 0 otherwise. (64) Published as a conference paper at ICLR 2024 Thus, z T i qj > z T k qj, vi Sj, vk V \ Sj, qj Q always holds, which implies that Z := UΣ and Q := V are one of the reasonable solutions. From Theorem 3, we demonstrate that there exists a reasonable solution for any hypergraph G. B.2 DIMENSIONAL COLLAPSE OF HYPEBOY WITHOUT PROJECTION HEADS. In this subsection, we provide a theoretical analysis of why HYPEBOY without projection heads may suffer from dimensional collapse. Jing et al. (2022) have suggested that one of the reasons for the dimensional collapse in contrastive learning lies in the gradient flow (spec., mechanism of how representations are updated via contrastive loss). Specifically, consider a linear model that yields representations of a n number of data by Z = XW, where X Rn d are node features and W Rd k is a learnable parameter matrix. When we update parameters W by using contrastive losses, k-th biggest singular value of weight matrices σk evolve proportional to its own value (e.g., σk σk c, where σk is a k-th biggest singular value of updated W and c is a positive constant). Due to this mechanism, small single values grow slowly or diminish, causing dimensional collapse. In this analysis, motivated by the analysis of Jing et al. (2022), we track how singular values of node representations change over learning hyperedge filling task. Following the setting of Jing et al. (2022), we consider a linear model: Z = XW. Note that X are given features of data points. Thus, the change of singular values of embeddings Z depends on W. We first define the loss of hyperedge filling as follows: exp z T i P vt V exp z T t P Now, we obtain the closed form of the following update equation, which is based on the gradient descent: W W γ WL, where γ is a fixed learning rate. We first derive ( L/ Z), and utilize the chain rule: ( L/ Z) ( Z/ W). The derivative of numerators of Eq (65) can be written as follows: vi ej z T i P where E R|V| |V|, s.t. Eij = 2 |{ek : {vi, vj} ek, ek E}|. We now derive the derivative of the denominators of Eq (65): vi ej log P vt V exp z T t P We denote a multiset of every possible query subset as Q (i.e., Q = {ej \ {vi} : vi ej, ej E}). The reason for multiset is that given subsets can be duplicated due to the nature of overlapping hyperedges (e.g., when we omit a node v1 from {v1, v2, v3} and v4 from {v4, v2, v3}, remaining sets become identical). We split the derivation into the node level. In addition, we let a subset of Q such that includes a node vk as Qk. Then, for a node vk, its gradient is computed as follows: exp z T k P P vt V exp z T t P vs qj zs exp z T i P vs ql zs vt V exp z T t P vs ql zs zi. (68) Published as a conference paper at ICLR 2024 We define two more matrices to express Eq. (67) and (68) in a simple matrix form. By assigning an index k = {1, , |Q|} to qj Q, we define a binary matrix Q R|Q| |V| that represents Q: Qji = 1 if vi qj, 0 otherwise. In addition, we denote a score matrix A R|Q| |V| that models a probability of filling each node vi V to each query subset qj Q, which is defined as follows: Aki = exp z T i P vt V exp z T t P vs qj zs . (69) By using A and Q, we can express Eq (67) and (68) for all nodes as follows: QT A | {z } Eq (67) of all nodes + AT Q | {z } Eq (68) of all nodes Finally, we express ( L/ Z) as follows: L Z = E + QT A + AT Q Z. (71) In addition, by using the chain rule and matrix derivative rules, L/ W is derived as follows: L W = XT E + QT A + AT Q XW. (72) We denote updated W as W. To sum up, update rule of W := W γ WL is expressed as follows: W := I + γXT E QT A + AT Q X W. (73) Here, we denote k-th biggest singular values of W and W as σk and σk, respectively. Note that I + γXT E QT A + AT Q X is a function of X, E, and W. Thus, we denote this term as g(X, E, W) Rd d. If we take a closer look at g(X, E, W)W, W is a near-linear transformation of W itself7. In such a case, from the min-max characterization rule, the following inequality holds: σk σ 1 σk, where σ 1 is the biggest singular value of a matrix g(X, E, W). (74) Eq (74) indicates that σk is upper-bounded by the multiplication between the biggest singular value of g(X, E, W) and σk. While it is not an exact equality, this result implies that the changed singular value is likely to be correlated with its previous singular value. Hence, small singular values are likely to grow slowly, similar to Jing et al. (2022) study, which may cause dimensional collapse. C DISCUSSIONS C.1 COMPARISON WITH HYPERGRL In this section, we compare our HYPEBOY against Hyper GRL (Du et al., 2022). Both methodologies utilize a strategy that leverages a node and a subset of a hyperedge for self-supervised learning. Specifically, this involves using a pair that consists of a node vi that belongs to a hyperedge ej, and the subset of ej excluding vi (ej \ {vi}), for training purposes. However, HYPEBOY significantly differs from Hyper GRL in various aspects. 7It is not a perfect linear transformation, since a function g( ) is also a function of W. Published as a conference paper at ICLR 2024 Motivation. The motivation of HYPEBOY is markedly distinct from that of Hyper GRL. The hyperedge filling task is a task that aims to correctly fill the missing node for a given (query) subset. The underlying motivation is to train a machine learning model to generate hyperedges, thereby identifying entities that are forming group interactions. This may enable the model to understand the complex, higher-order relationships within a given hypergraph. This objective aligns with the goals of question-answering approaches in NLP (Devlin et al., 2019). Conversely, the key motivation of Hyper GRL, which performs node-level pretext task introduced by Du et al. (2022), is described as follows: the node inside one specific hyperedge should be distinguished from nodes outside this hyperedge given the context of node . Here, the motivations behind Du et al. (2022) are akin to those of proximity-based approaches, including LINE (Tang et al., 2015). Method design. The design choices of HYPEBOY and Hyper GRL differ significantly. Below, we outline these technical distinctions. Input: HYPEBOY receives the original hypergraph structure, while Hyper GRL receives a cliqueexpanded graph. Backbone encoder: HYPEBOY utilizes hypergraph neural networks, while Hyper GRL employs graph neural networks. Augmentation: HYPEBOY augments node features and hyperedges, while Hyper GRL does not utilize an augmentation strategy. Projection head: HYPEBOY utilizes different projection head parameters for node-projection and query subset-projection, while Hyper GRL utilizes the same parameters for both. Negative samples: HYPEBOY utilizes the entire node as negative samples for each hyperedge filling case, while Hyper GRL samples several nodes. SSL signal: HYPEBOY utilizes all possible (node, query subset) pairs of a given hypergraph, while Hyper GRL samples a single pair for each hyperedge. Feature reconstruction: HYPEBOY utilizes node feature reconstruction warm-up, while Hyper GRL utilizes hyperedge prediction loss as another SSL signal. Performance comparison. It is important to note that HYPEBOY is superior to Hyper GRL as an SSL strategy for hypergraphs. We provide empirical evidence below: Under the fine-tuning protocol, HYPEBOY outperforms Hyper GRL in every dataset (refer to Section 5.1). Under the linear-evaluation protocol with the node classification task, HYPEBOY outperforms Hyper GRL in every dataset (refer to Section 5.2). Under the linear-evaluation protocol with he hyperedge prediction task, HYPEBOY outperforms Hyper GRL in 10 out of 11 datasets (refer to Section 5.2). C.2 POTENTIAL EXTENSIONS OF HYPEBOY While this work mainly assumes static and undirected hypergraphs, many real-world group interactions have directions and temporal dynamics. Directed hypergraphs (Kim et al., 2023b) and temporal hypergraphs (Lee & Shin, 2023b) are utilized to include directional information and temporal information in hypergraphs, respectively. HYPEBOY can be extended to such hypergraphs by adequately considering their unique characteristics. Moreover, while we assume a transductive hypergraph setting, there is a recent effort to perform representation learning on hypergraphs under an inductive case (Behrouz et al., 2023). Considering this, HYPEBOY can also be extended to an inductive case, incorporating the idea of Behrouz et al. (2023). Lastly, HYPEBOY can be enhanced to capture more complex characteristics of a hypergraph (e.g., hyperedge-dependency of nodes (Choe et al., 2023)). In this section, we provide details of the used datasets. In our experiments, we utilize 11 hypergraph benchmark datasets. Following Lee & Shin (2023a), we remove nodes that do not belong to any hyperedge. Data statistics of each dataset are reported in Table 4. Published as a conference paper at ICLR 2024 Table 4: Statistics of 11 datasets we utilize in our experiments. Citeseer Cora Pubmed Cora-CA DBLP-P DBLP-A AMiner IMDB MN-40 20News House |V| 1,458 1,434 3,840 2,388 41,302 2,591 20,201 3,939 12,311 16,242 1,290 |E| 1,079 1,579 7,963 1,072 22,263 2,690 8,052 2,015 12,311 100 341 P e E |e| 3,453 4,786 34,629 4,585 99,561 6,201 32,028 9,560 61,555 65,451 11,843 # Classes 6 7 3 7 6 4 12 3 40 4 2 # Features 3,703 1,433 500 1,433 1,425 334 500 3066 100 100 2 Source of each dataset. The sources of each dataset used in this work are as follows: The Cora, Citeseer, Pubmed, Cora-CA, and DBLP-P datasets are from the work of Yadati et al. The DBLP-A and IMDB datasets are from the work of Wang et al. (2019). The AMiner dataset is from the work of Zhang et al. (2019). The Mondelnet-40 (MN-40) dataset is from the work of Wu et al. (2015). The 20Newsgroups (20News) dataset is from the work of Dua et al. (2017). The House dataset is from the work of Chien et al. (2022). Co-citation datasets. We utilize three co-citation datasets: Cora, Citeseer, and Pubmed. In these datasets, each node represents a publication and each hyperedge represents a set of publications cocited by a publication. For example, if a publication has cited three publications that correspond to vi, vj, and vk, respectively, these publications (nodes) are grouped as a hyperedge {vi, vj, vk} E. Node features are bag-of-words of the corresponding publication (node). Node class indicates the academic category of the corresponding publication. Co-authorship datasets. We utilize four co-authorship datasets: Cora-CA, DBLP-P, DBLP-A, and AMiner. In Cora-CA, DBLP-P, and AMiner, each node represents a publication, and a set of publications that are written by the same author is grouped as a hyperedge. Node features are bagof-words of the corresponding publication (node). Node class indicates the academic category of the corresponding publication. Conversely in DBLP-A, each node represents an author, and coauthors of a publication are grouped as a hyperedge. Node features are bag-of-words regarding the research keywords of the corresponding author 8. Node class indicates the primary research area of the corresponding author. Computer graphic datasets. We utilize one computer graphical dataset: Modelnet-40 (MN-40). In this dataset, each node represents a visual object, and hyperedges are synthetic hyperedges that have been created with a k-NN graph constructed according to the features of each data point, following Feng et al. (2019) and Chien et al. (2022). Node features are embeddings (obtained via GVCNN (Feng et al., 2018) and MVCNN (Su et al., 2015)) of the corresponding visual object. Node class indicates the category of the corresponding visual object. Movie-Actor dataset. We utilize one movie-actor relational dataset: IMDB. In this dataset, each node indicates a movie, and the filmography of an actor is grouped as a hyperedge. Node features are bag-of-words of the plot of the corresponding movie. Node class indicates the genre of the movie. News dataset. We utilize one news dataset: 20News Groups (20News). In this dataset, each node represents a document of 20 newsgroups and a set of documents containing a particular word is grouped as a hyperedge. Node features are TF-IDF representations of news messages of the corresponding news document. Node class indicates the category of the corresponding news document. Political membership dataset. We utilize one political membership dataset: House. In this dataset, each node represents a member of the US House of Representatives, and members belonging to the same committee are grouped as a hyperedge. Node features are generated by adding noise to the one-hot encoded label vector, following Chien et al. (2022). Node class indicates the political party of the corresponding member. 8Please do not be confused with the academic corresponding author. Published as a conference paper at ICLR 2024 E EXPERIMENTAL DETAILS In this section, we provide our experimental details. We first describe the machines we use in our experiments. Then, we provide details of HYPEBOY and baseline methods, including their implementation and hyperparameter settings. E.1 MACHINES All experiments are conducted on a machine with NVIDIA RTX 8000 D6 GPUs (48GB memory) and two Intel Xeon Silver 4214R processors. E.2 DETAILS OF BASELINE METHODS Neural networks. We utilize 10 (semi-)supervised baseline models: MLP, HGNN (Feng et al., 2019), Hyper GCN (Yadati et al., 2019), HNHN (Dong et al., 2020), three unified hypergraph networks, (Uni GCN, Uni GIN, Uni GCNII) (Huang & Yang, 2021), All Set (Chien et al., 2022), EDHNN (Wang et al., 2023a), and Phenom NN (Wang et al., 2023b). For all the methods, we fix their hidden dimension and model layers as 128 and 2, respectively. Graph SSL. To utilize two graph-generative graph SSL methods, which are Graph MAE2 (Hou et al., 2023) and Mask GAE (Li et al., 2023), and one hypergraph-generative graph SSL method, which is Hyper GRL (Du et al., 2022), we transform the input hypergraph into a graph by using the clique expansion, which converts each hyperedge into a clique of a graph (Dong et al., 2020). Moreover, we utilize a 2-layer GCN (Kipf & Welling, 2017) with hidden dimension 128 as a backbone encoder of these two SSL methods. H-GD. One of our baseline methods H-GD directly extends the group discrimination approach, an SSL technique designed on ordinary graphs (Zheng et al., 2022). We strictly follow the overall structure suggested by Zheng et al. (2022), while the feature augmentation and topology augmentation have been replaced into τx and τe that are described in Section 4.1, respectively. E.3 DETAILS OF SETTINGS Hyperedge prediction setting. Note that we need to obtain negative hyperedge samples to evaluate a model on the hyperedge prediction task. In our experiments, we utilize negative hyperedge samples that are generated by the Size Negative Sample strategy (Patil et al., 2020). Specifically, we first sample the size of the current hyperedge from the real-world hyperedge size distribution of the corresponding dataset. Then, we sample nodes uniformly at random and fill the corresponding hyperedge. By using the strategy, we obtain negative hyperedges with the same number of groundtruth hyperedges and utilize them for training/validation/test. Hyperedge embedding. To perform hyperedge prediction, we require a representation of each hyperedge. On the other hand, since the output of utilized HNNs and GNNs are node embeddings, we need an additional technique to acquire hyperedge representation from node embeddings. To this end, we utilize a max-min pooling aggregation function (Yadati et al., 2020). For example, for a hyperedge ei = {vj, vk, vℓ}, whose constituent node embeddings are zj, zk, zℓ Rd, respectively, its hyperedge embedding h(e) i is obtained as follows: hi = ( max t {j,k,ℓ} zts min t {j,k,ℓ} zts)d s=1. (75) Finally, we utilize hi as a representation of ei. E.4 DETAILS OF IMPLEMENTATION Feature reconstruction warm-up of HYPEBOY. As described in Section 4.4, before training an encoder HNN with HYPEBOY, we perform a feature reconstruction process, which directly extends Graph MAE (Hou et al., 2022) to hypergraph structure. This process mitigates the encoder s overreliance on the projection heads, improving the performance in downstream tasks (Section 5.3). For the decoder HNN gψ, we utilize a 2-layer Uni GCNII (Huang & Yang, 2021), which has the same Published as a conference paper at ICLR 2024 Table 5: Hyperparameter combination of HYPEBOY that shows the best validation accuracy on each dataset. pv [0, 1] indicates the magnitude of feature augmentation, pe indicates the magnitude of topological augmentation, and SSLepochs indicates the training epoch of hyperedge filling task. Citeseer Cora Pubmed Cora-CA DBLP-P DBLP-A AMiner IMDB MN-40 20News House pv 0.2 0.4 0.0 0.4 0.0 0.1 0.2 0.3 0.4 0.2 0.4 pe 0.9 0.9 0.5 0.8 0.6 0.9 0.6 0.9 0.5 0.5 0.8 SSL epochs 120 200 180 200 180 120 100 180 180 80 140 architecture as the encoder HNN. We mask 50% of input nodes and set hyperedge augmentation ratio pe as 0.2. Projection heads of HYPEBOY. We utilize a two-layer MLP model with a Re LU (Nair & Hinton, 2010) activation function without dropoutand normalization-layers, for both node projection head f ϕ and set projection head f ρ (Section 4.2). Note that these projection heads are updated via the hyperedge filling task together with the encoder HNN. Training details and hyperparameters. We train all models using the Adam optimizer (Kingma & Ba, 2015) with a fixed weight decay rate of 10 6. We fix the hidden dimension and dropout rate of all models as 128 and 0.5, respectively. When training any neural network for downstream tasks, we train a model for 200 epochs, and for every 10 epochs, we evaluate the validation accuracy of the model. Then, we utilize the model s checkpoint that yields the best validation accuracy to perform the final evaluation of the model on the test dataset. For a linear evaluation protocol of node classification, we utilize a logistic classifier, with a learning rate 0.001. For a linear evaluation protocol of hyperedge classification, we utilize a two-layer MLP, with a learning rate of 0.001 and a dropout rate of 0.5. Regarding hyperparameter tuning, for each model and dataset, we find the hyperparameter combination that gives the best validation performance. Then, with the selected hyperparameters, we train the model on a training dataset and evaluate the trained model on a test dataset. For all the supervised models, we tune the learning rate as a hyperparameter within {0.05, 0.01, 0.005, 0.001, 0.0005, 0.0001}. Moreover, especially for Phenom NN, which is a stateof-the-art hypergraph neural network, we additionally tune their other hyperparameters. Specifically, along with the learning rate, we additionally tune λ0 and λ1 within {0, 1, 10}, which are weights of different message-passing functions, and α within {0, 1, 10}, which is the weight of a node s own feature. For all the SSL baseline methods, we set broader search spaces, which are as follows: Tri CL (Lee & Shin, 2023a): We tune their feature augmentation magnitude within {0.2, 0.3, 0.4}, hyperedge augmentation magnitude within {0.2, 0.3, 0.4}, and learning rate of its all components within {0.01, 0.001, 0.0001}. Hyper GCL (Wei et al., 2022): We tune learning rate of an encoder within {0.01, 0.001, 0.0001}, that of a view generator within {0.01, 0.001, 0.0001}, and the weight of contrastive loss β within {0.5, 1, 2}. HGD, a variant of GD (Zheng et al., 2022): We tune their feature augmentation magnitude within {0.2, 0.3, 0.4}, hyperedge augmentation magnitude within {0.2, 0.3, 0.4}, and learning rate of its all components within {0.01, 0.001, 0.0001}. Graph MAE2 (Hou et al., 2023): We tune mask ratio within {0.25, 0.5, 0.75} and learning rate of its all components within {0.01, 0.001, 0.0001}. Mask GAE (Li et al., 2023) We tune degree loss weight α within {0.001, 0.002, 0.003} and learning rate of its all components within {0.01, 0.001, 0.0001}. Hyper GRL (Du et al., 2022) We tune negative-sampling adjacency matrix order k within {1, 2, 3} and learning rate of its all components within {0.01, 0.001, 0.0001}. In addition, we set self-supervised learning epochs of the above methods as hyperparameters, searching within {50, 100, 150, , 450, 500}. Published as a conference paper at ICLR 2024 Table 6: Comparison between HYPEBOY (HYPEBOY (denoted by w/ Aug.)) and a variant of HYPEBOY that does not use augmentation strategy (denoted by w/o Aug.) in the node classification task under two protocols. The best performance is colored as a green. In most of the settings, HYPEBOY outperforms its variant. Protocol Method Citeseer Cora Pubmed Cora-CA DBLP-P DBLP-A AMiner IMDB MN-40 20News House Fine w/o Aug. 53.5 (8.7) 58.6 (8.3) 75.5 (4.0) 62.9 (6.0) 87.8 (0.4) 79.7 (2.3) 33.9 (2.1) 47.2 (2.5) 90.7 (0.7) 77.5 (0.9) 69.9 (4.9) tuning w/ Aug. 56.7 (9.8) 62.3 (7.7) 77.0 (3.4) 66.3 (4.6) 88.2 (0.4) 80.6 (2.3) 34.1 (2.2) 47.6 (2.5) 90.4 (0.9) 77.6 (0.9) 70.4 (4.8) Linear w/o Aug. 55.8 (9.0) 60.6 (7.3) 72.7 (3.4) 63.3 (4.6) 87.6 (0.5) 81.4 (2.5) 34.2 (2.7) 47.8 (2.0) 89.7 (1.9) 75.1 (1.6) 67.5 (1.6) evaluation w/ Aug. 59.6 (9.9) 63.5 (9.4) 75.0 (3.4) 66.0 (4.6) 87.9 (0.5) 81.2 (2.7) 34.3 (3.2) 48.8 (1.8) 89.2 (2.2) 75.7 (2.1) 69.4 (5.4) Citeseer Pubmed Cora CA DBLP P DBLP A Without projection heads With Projection head X-Axis: Singular value rank index Y-Axis: Log of singular values Figure 4: Analyzing dimensional collapse of HYPEBOY with/without projection heads on five benchmark datasets. While HYPEBOY does not suffer from dimensional collapse, its variant that does not utilize projection heads, suffers from this issue. For HYPEBOY, we tune feature augmentation magnitude px within {0.0, 0.1, 0.2, 0.3, 0.4} and hyperedge augmentaiton magnitude pe within {0.5, 0.6, 0.7, 0.8, 0.9}. We fix the learning rate and training epochs of the feature reconstruction warm-up as 0.001 and 300, respectively. Regarding the overall training process of HYPEBOY, we first train an encoder HNN with a feature reconstruction process for 300 epochs. Then, we train the encoder HNN with the hyperedge filling task. Specifically, the hyperedge filling training epoch is a hyperparameter, which we search within {20, 40, , 180, 200}. We report a hyperparameter combination of each dataset that shows the best validation accuracy in our fine-tuning protocol experiment (Section 5.1) in Table 5. F ADDITIONAL EXPERIMENTAL RESULTS In this section, we present additional experimental results that were excluded from the main paper due to space constraints. F.1 COMPARISON AGAINST HYPEBOY WITHOUT AUGMENTATION As described in Section 4.1, augmentation (spec., masking) has played a key role in various generative SSL methods for obtaining effective representations. Motivated by these findings, we incorporate augmentation process τx and τe into HYPEBOY. To assess the impact of augmentation, we analyze a variant of HYPEBOY that omits the augmentation step. In this variant, the input to HYPEBOY s encoder HNN consists of ground-truth features and topology, (X, E). We then compare the performance of this variant against our proposed HYPEBOY in node classification tasks, under both fine-tuning and linear evaluation protocols. As shown in Table 6, HYPEBOY consistently outperforms the variant across most of the settings, demonstrating the necessity of augmentation in the model s performance. F.2 DIMENSIONAL COLLAPSE ANALYSIS In Section 4.2, we discuss the role of projection heads in terms of dimensional collapse: projection heads encourage an encoder HNN to avoid dimensional collapse. We further analyze this phenomenon in five benchmark hypergraph datasets: Citeseer, Pubmed, Cora-CA, DBLP-P, and DBLP-A. As shown in Figure 4, across all these datasets, we verify that certain singular values of embeddings achieved via HYPEBOY without projection heads drop to zero. This empirically demonstrates that dimensional collapse has occurred in such cases (refer to red lines). Notably, such issues are not observed in HYPEBOY (refer to blue lines). Published as a conference paper at ICLR 2024 Table 7: Performance of SSL methods in various HNNs under fine-tuning protocol. The best and second-best performances are colored green and yellow, respectively. NA indicates the corresponding encoder without pre-training. HYPEBOY outperforms other hypergraph SSL methods in most of the settings. Method Citeseer Cora Pubmed Cora-CA DBLP-P NA 41.9 (7.8) 50.0 (7.2) 72.9 (5.0) 50.2 (5.7) 85.3 (0.8) Tri CL 49.3 (10.3) 56.9 (10.0) 74.3 (4.1) 57.2 (6.1) 87.4 (5.0) Hyper GCL 47.2 (9.3) 59.5 (7.6) 76.3 (4.3) 53.3 (6.9) 85.8 (0.4) HYPEBOY 52.1 (9.1) 61.1 (9.8) 76.8 (4.3) 60.0 (6.0) 87.4 (0.5) Mean Pooling Conv NA 39.9 (10.0) 48.5 (8.9) 72.4 (4.0) 50.6 (7.5) 85.7 (0.7) Tri CL 46.6 (8.3) 59.8 (8.9) 74.0 (3.8) 57.3 (4.8) 87.0 (5.0) Hyper GCL 43.2 (9.4) 59.6 (7.6) 74.2 (4.3) 55.0 (6.7) 85.7 (0.9) HYPEBOY 54.5 (8.3) 61.2 (7.8) 77.3 (3.4) 63.0 (4.4) 86.8 (0.5) NA 43.6 (6.3) 48.9 (9.2) 69.8 (4.5) 50.7 (6.7) 82.8 (1.0) Tri CL 48.9 (7.6) 57.7 (8.4) 72.0 (4.5) 57.8 (4.9) 85.1 (0.5) Hyper GCL 46.8 (8.5) 54.6 (7.5) 73.1 (3.9) 56.9 (5.3) 83.6 (0.7) HYPEBOY 53.8 (9.1) 62.3 (6.4) 73.4 (3.4) 61.0 (3.9) 84.2 (0.7) Citeseer Pubmed Cora-CA DBLP-A DBLP-P Figure 5: Analyzing alignment and uniformity of representations obtained via HYPEBOY. In most cases, representations obtained by HYPEBOY achieve both alignment and uniformity. F.3 ALIGNMENT AND UNIFORMITY ANALYSIS In Section 4.3, we discuss the role of our p(X,E,Θ)( ) design choice in promoting alignment and uniformity of node representations. This characteristic is further analyzed across five datasets: Citeseer, Pubmed, Cora-CA, DBLP-P, and DBLP-A. As shown in Figure 5, node representations obtained from HYPEBOY achieve both uniformity (first column) and class alignment (other columns) in most of the cases. Published as a conference paper at ICLR 2024 Table 8: Efficacy of pretext tasks under fine-tuning protocol. The best performances are colored as green. Among the three tasks, the hyperedge filling task shows superior performance. Method Citeseer Cora Pubmed Cora-CA DBLP-P Naive contrastive learning 59.2 (6.7) 44.5 (10.2) 75.2 (4.0) 62.1 (5.3) 86.7 (0.5) Naive feature reconstruction 58.6 (8.0) 51.5 (9.2) 74.7 (5.1) 61.9 (7.3) 87.3 (0.6) Naive hyperedge filling 60.7 (8.2) 51.6 (11.8) 76.2 (3.6) 63.5 (6.0) 88.1 (0.5) Table 9: Comparing HYPEBOY against other topological SSL methods under fine-tuning protocol. The best performances are colored as green. Among all the four methods, HYPEBOY shows the best performance. Data Method Citeseer Cora Pubmed Cora-CA DBLP-P Graph Edge filling 44.1 (11.3) 55.7 (8.2) 73.6 (3.8) 55.5 (8.2) 86.9 (0.5) Graph Hyperedgeedge filling 42.8 (9.2) 56.4 (6.0) 73.6 (4.0) 55.0 (8.6) 86.7 (0.5) Hypergraph Hyperedge prediction 43.9 (7.6) 56.9 (9.7) 73.0 (4.9) 59.1 (6.5) 85.8 (0.6) Hypergraph Hypereedge filling (HYPEBOY) 56.7 (9.8) 62.3 (7.7) 77.0 (3.4) 66.3 (4.6) 88.2 (0.4) F.4 ENCODER ANALYSIS In this section, we experimentally verify that the efficacy of HYPEBOY extends beyond Uni GCNII (Huang & Yang, 2021). For this purpose, we employ three HNNs: HGNN (Feng et al., 2019), Mean Pooling Conv (Lee & Shin, 2023a), and Set GNN (Chien et al., 2022). Notably, the latter two encoders, Mean Pooling Conv and Set GNN, serve as backbone encoders in Tri CL (Lee & Shin, 2023a) and Hyper GCL (Wei et al., 2022), respectively. The experimental setting mirrors that of the fine-tuning experiments detailed in Section 5.1. As shown in Table 7, HYPEBOY outperforms other hypergraph SSL methods in most of the settings, validating its broad effectiveness across a diverse range of HNNs. F.5 PRETEXT TASK ANALYSIS In this section, we evaluate the efficacy of different SSL pretext tasks. Specifically, we assess our hyperedge filling task by comparing it with two other SSL tasks: contrastive learning and feature reconstruction. To ensure a fair comparison and minimize the impact of the underlying SSL method design choices, we streamline each method as described below: Naive contrastive learning: This method is a variant of Tri CL (Lee & Shin, 2023a) that does not use projection head parameters. Other settings are all the same as that of Tri CL (e.g., contrastive losses and view generation process). Naive feature reconstruction: This method is a variant of the hypergraph feature reconstruction method (Section 4.4), that does not use a decoder HNN. Naive hyperedge filling: This method is a variant of HYPEBOY that does not use feature reconstruction warm-up and projection heads. Specifically, this method is equivalent to V1, which is described in Section 5.3. As shown in Table 8, the naive hyperedge filling method outperforms the other two methods across all settings, highlighting the superior effectiveness of the hyperedge filling task. F.6 TOPOLOGICAL GENERATIVE PRETEXT TASK ANALYSIS In this section, we compare HYPEBOY with other topology generative SSL methods. To this end, we adopt the following three baseline methods: Graph edge filling: Utilizes edge filling on a clique-expanded graph (Appendix E.2), employing the same method HYPEBOY used to fill size-2 hyperedges. Graph hyperedge filling: Applies hyperedge filling on a clique-expanded graph (Appendix E.2), using the same approach as HYPEBOY for filling hyperedges. Hyperedge prediction: Undertakes the hyperedge prediction task as an SSL task, leveraging size negative samples (Patil et al., 2020). Published as a conference paper at ICLR 2024 Cora - Cite AMiner IMDB Figure 6: Homophily analysis of three benchmark datasets: Cora-Cite, AMiner, and IMDB. Each color represents the class (label) of nodes. An upward trend in a class indicates homophily, whereas a downward trend indicates non-homophily. Cora-Cite demonstrates homophily, IMDB shows nonhomophily, and AMiner falls in between, with some classes exhibiting homophily and others not. As shown in Table 9, HYPEBOY outperforms the other three topological generative SSL methods in all settings, highlighting the effectiveness of the hyperedge filling task. F.7 HOMOPHILY ANALYSIS In this section, we demonstrate that HYPEBOY exhibits prominent node classification performance across various homophilic scenarios, including both homophilic and non-homophilic hypergraphs. To this end, we analyze both real-world hypergraphs and semi-synthetic hypergraphs. Real-world hypergraphs. We analyze the homophily characteristic of three hypergraph benchmark datasets: Cora-Cite, AMiner, and IMDB, which are utilized in this work. Note that HYPEBOY shows the best performance among all the methods in these datasets (refer to Table 1). For the homophily analysis, we utilize the method suggested by Veldt et al. (2023), which is a visualization-based approach. The analysis should be conducted for each hyperedge size, with the results for each hyperedge size presented in separate plots. In these plots, the line color represents the node class. A brief interpretation of the resulting plots is as follows: An upward trend line indicates that nodes of the corresponding class demonstrate higher-order homophilic characteristics within hyperedges of the same size. A downward trend line indicates that nodes of the corresponding class display higher-order nonhomophilic characteristics within hyperedges of the same size. As shown in Figure 6, the three datasets (Cora-cite, AMiner, and IMDB) exhibit different homophilic patterns. Specifically, the Cora-Cite dataset demonstrates homophilic tendencies, the IMDB dataset shows a non-homophilic pattern, and the AMiner dataset falls in between, with some classes being homophilic and others not. Given that HYPEBOY outperforms other methods in these datasets, we can conclude that its efficacy is valid for both homophilic and non-homophilic hypergraphs. Semi-synthetic hypergraphs. We analyze how the performance of HYPEBOY varies as the homophilic extent of hypergraphs varies. To this end, we corrupt the hypergraph topology of two homophilic hypergraph datasets: Cora and Citeseer. Specifically, we utilize the node-swapping method. Details are provided in Algorithm 1. Note that the more we swap nodes in a hypergraph, the more the hypergraph gets non-homophilic. Published as a conference paper at ICLR 2024 Table 10: Analysis of performance changes under variations in homophilic extent. The smallest drop from the original performance is highlighted in green. NA indicates Uni GCNII encoder without pretraining. As shown, HYPEBOY is the most robust method against homophily violation among the used hypergraph SSL methods. Dataset Method Original T = 50 T = 100 T = 150 T = 200 NA 48.5 (-) 46.7 (-1.8) 44.7 (-3.8) 43.5 (-5.0) 40.9 (-7.6) Tri CL 60.2 (-) 59.1 (-1.1) 56.1 (-4.1) 55.7 (-4.5) 53.6 (-6.5) Hyper GCL 60.3 (-) 57.4 (-2.9) 53.7 (-6.6) 52.0 (-8.3) 50.0 (-10.3) HYPEBOY 62.3 (-) 61.7 (-0.6) 59.2 (-3.1) 58.0 (-4.3) 56.7 (-5.6) NA 44.2 (-) 41.4 (-2.8) 39.0 (-5.2) 35.9 (-8.3) 33.8 (-10.4) Tri CL 51.7 (-) 51.1 (-0.6) 47.5 (-4.2) 45.2 (-6.5) 41.8 (-9.9) Hyper GCL 47.0 (-) 45.8 (-1.2) 43.3 (-3.7) 40.9 (-6.1) 37.7 (-9.3) HYPEBOY 56.7 (-) 56.5 (-0.2) 54.2 (-2.5) 50.8 (-5.9) 48.6 (-8.1) Algorithm 1 Node swapping algorithm Input: A set of hyperedges E, Number of iterations T Output: Shuffled hyperedges E 1: E E 2: for k 1 to T do 3: Sample two hyperedges e i and e j from E uniformly at random 4: Sample a node v i from e i and a node v j from e j uniformly at random 5: e i (e i \ {v i}) {v j} 6: e j (e j \ {v j}) {v i} 7: E (E \ {e i} \ {e j}) {e i } {e j } Return: Swapped hyperedges E As shown in Table 10, HYPEBOY is the most robust among the hypergraph SSL methods evaluated, exhibiting the smallest performance drop in scenarios where homophily is violated. F.8 ADDITIONAL DATA SPLITS In this section, we investigate the efficacy of HYPEBOY in various node-label-split settings. To this end, we sample a certain number of nodes for each class as training nodes and validation nodes. Specifically, we utilize the following two settings: For each node class, use 5 nodes/5 nodes/the rest as train/validation/test nodes, respectively. For each node class, use 10 nodes/10 nodes/the rest as train/validation/test nodes, respectively. Other experimental settings are the same as that of the fine-tuning experiments (Section 5.1). As shown in Table 11, HYPEBOY outperforms other hypergraph SSL methods in all the settings. Thus, we demonstrate that the efficacy of HYPEBOY is not restricted to a particular label split setting. F.9 HETEROGENOUS HYPERGRAPH EXPERIMENT In this section, we investigate the efficacy of HYPEBOY as a pre-training strategy on the heterogeneous hypergraph dataset, which contains hyperedges of various semantics. To this end, we utilize the ACM dataset, where nodes correspond to publications. There are two types of hyperedges: authorship and subject relations. Authorship: Publications authored by the same person are represented as a hyperedge. Subject relations: Publications sharing a subject are represented as a hyperedge. Other experimental settings are the same as that of the fine-tuning experiments (Section 5.1). As shown in Table 12, HYPEBOY outperforms other hypergraph SSL methods in the ACM dataset, demonstrating its efficacy beyond homogeneous hypergraphs. Published as a conference paper at ICLR 2024 Table 11: Results in the various label-split settings under the fine-tuning protocol. The best and second-best performances are colored green and yellow, respectively. NA indicates Uni GCNII encoder without pre-training. As shown, HYPEBOY outperforms other hypergraph SSL methods in all the settings. Method Citeseer Cora Pubmed Cora-CA DBLP-P NA 53.7 (6.8) 63.7 (3.8) 71.6 (4.6) 60.7 (4.9) 77.8 (4.7) Tri CL 60.1 (5.2) 71.0 (3.1) 71.9 (3.8) 67.6 (2.9) 80.5 (2.5) Hyper GCL 57.0 (5.6) 70.3 (2.9) 72.2 (4.4) 61.5 (4.4) 79.5 (3.7) HYPEBOY 63.1 (4.6) 72.0 (2.4) 72.9 (3.2) 67.8 (2.4) 81.5 (2.4) NA 62.7 (3.8) 72.8 (2.2) 73.7 (3.0) 67.7 (3.0) 82.4 (1.9) Tri CL 66.3 (2.6) 75.7 (2.3) 74.5 (3.1) 69.7 (2.2) 83.6 (1.9) Hyper GCL 64.4 (3.9) 74.9 (2.1) 74.3 (3.1) 69.0 (3.4) 83.6 (1.6) HYPEBOY 67.7 (2.6) 75.8 (7.8) 74.9 (2.7) 70.9 (1.7) 84.6 (1.5) Table 12: Results in the ACM dataset under the fine-tuning protocol. The best performances are colored green. Among the four methods, HYPEBOY shows the best performance. Method Uni GCNII Tri CL Hyper GCL HYPEBOY Accuracy 40.0 (3.1) 40.3 (2.4) 40.8 (3.4) 41.7 (2.6) G POTENTIAL APPLICATIONS OF HYPEREDGE FILLING In this section, we provide two potential applications of the hyperedge filling task. Email recipient recommendation: Given recipients of an email, recommend a user likely to be added as a recipient of the email. Nodes in a hypergraph indicate users, and each hyperedge indicates an email, consisting of a sender, recipients, and CCs. Item recommendation: Given items in a shopping cart recommend an item to be co-purchased with the items. Nodes in a hypergraph indicate items, and each hyperedge contains a set of copurchased items.