# link_prediction_with_relational_hypergraphs__fa5335f9.pdf Published in Transactions on Machine Learning Research (05/2025) Link Prediction with Relational Hypergraphs Xingyue Huang xingyue.huang@cs.ox.ac.uk Department of Computer Science, University of Oxford, UK Miguel Romero mgromero@uc.cl Department of Computer Science, Universidad Católica de Chile, Chile Pablo Barceló pbarcelo@uc.cl Institute for Mathematics and Comp. Engineering, Universidad Católica de Chile & IMFD, Chile Michael M. Bronstein michael.bronstein@cs.ox.ac.uk Department of Computer Science, University of Oxford, UK İsmail İlkan Ceylan ismail.ceylan@cs.ox.ac.uk Department of Computer Science, University of Oxford, UK Reviewed on Open Review: https: // openreview. net/ forum? id= S6fe4a H6YA Link prediction with knowledge graphs has been thoroughly studied in graph machine learning, leading to a rich landscape of graph neural network architectures with successful applications. Nonetheless, it remains challenging to transfer the success of these architectures to inductive link prediction with relational hypergraphs, where the task is over k-ary relations, substantially harder than link prediction on knowledge graphs with binary relations only. In this paper, we propose a framework for link prediction with relational hypergraphs, empowering applications of graph neural networks on fully relational structures. Theoretically, we conduct a thorough analysis of the expressive power of the resulting model architectures via corresponding relational Weisfeiler-Leman algorithms and logical expressiveness. Empirically, we validate the power of the proposed architectures on various relational hypergraph benchmarks. The resulting model architectures substantially outperform every baseline for inductive link prediction and also lead to competitive results for transductive link prediction. 1 Introduction Hawking Oxford Study Degree Awarded Figure 1: A relational hypergraph over the relations Study Degree and Awarded. The facts Study Degree(Hawking, Oxford, Physics, BA) and Awarded(Physics, Nobel, Oxford) are ordered hyperedges, where the order of entities in each fact is denoted by dashed arrows. Knowledge graphs consist of facts (or, edges) representing different relations between pairs of nodes. Knowledge graphs are inherently incomplete (Ji et al., 2020; Wang et al., 2017) which motivated a large literature on link prediction with knowledge graphs (Wang et al., 2014; Schlichtkrull et al., 2018; Sun et al., 2019; Teru et al., 2020; Vashishth et al., 2020; Liu et al., 2021a; Zhu et al., 2021). This task amounts to predicting missing facts in the knowledge graph and has led to a rich landscape of graph neural network architectures (Schlichtkrull et al., 2018; Teru et al., 2020; Vashishth et al., 2020; Zhu et al., 2021). Our understanding of these architectures is supported by theoretical studies quantifying their expressive power (Barceló et al., 2022; Zhang et al., 2021; Huang et al., 2023; Qiu et al., 2024). Published in Transactions on Machine Learning Research (05/2025) In this work, we are interested in link prediction on fully relational data, where every relation is between k nodes, for any relation-specific choice of k. Relational data can encode rich relationships between entities; e.g., consider a relationship between four entities: Hawking went to Oxford to study Physics and received a BA degree . This can be represented with a fact Study Degree(Hawking, Oxford, Physics, BA). Clearly, relational data can be represented via relational hypergraphs, where each ordered, relational hyperedge in the hypergraph corresponds to a relational fact (see Figure 1). Motivation. Given the prevalence of relational data, link prediction with relational hypergraphs has been widely studied in the context of shallow embedding models (Wen et al., 2016; Abboud et al., 2020; Fatemi et al., 2020; 2023), where the idea is to generalize knowledge graph embedding methods to relational hypergraphs. The key limitation of these approaches is that they are all transductive: they cannot be directly used to make predictions between nodes that are not seen during training. The same limitation has motivated the development of graph neural network architectures for inductive link prediction on knowledge graphs enabling for predictions between nodes that are not seen during training (Teru et al., 2020) which eventually led to very strong architectures such as NBFNets (Zhu et al., 2021). Figure 2: Unary encoders cannot distinguish the query facts r(u, v, t) and r(u, w, t), drawn in green. In the same spirit, graph neural networks have been extended for inductive link prediction on relational hypergraphs (Yadati, 2020; Zhou et al., 2023), but these approaches do not enjoy the same level of success. This can be attributed to multiple, related factors. In essence, link prediction with relational hypergraphs is a k-ary task (for k varying depending on the relation), which is much more challenging than a binary prediction task and requires dedicated approaches. On the other hand, existing proposals are simple extensions of relational graph neural networks (such as RGCNs (Schlichtkrull et al., 2018)), which cannot adequately capture k-ary tasks. In fact, these architectures are unary encoders that are used for k-ary predictions, which is known to be a fundamental limitation (Zhang et al., 2021; Huang et al., 2023). To make these points concrete, let us consider the example shown in Figure 2. In this example, regardless of the choice of the unary encoder, it is not possible to distinguish between the query facts r(u, w, t) and r(u, v, t), because the nodes w and v are isomorphic in the hypergraph. However, an appropriate ternary encoder can easily differentiate these facts using the information that the distance between the nodes u and w differs from the distance between the nodes u and v. These limitations along with a lack of an established theory motivates our study. Approach. We first investigate the expressive power of existing graph neural networks proposed for relational hypergraphs such as G-MPNNs (Yadati, 2020) and RD-MPNNs (Zhou et al., 2023) to rigorously identify their limitations. This is achieved by studying the framework of hypergraph relational message passing neural networks (HR-MPNNs) which subsumes these architectures. To address the limitations of HR-MPNNs, we introduce hypergraph conditional message passing neural networks (HC-MPNNs) as a framework for inductive link prediction inspired by the conditional message passing paradigm studied for knowledge graphs (Zhu et al., 2021; Huang et al., 2023). We conduct a systematic expressiveness study showing that HC-MPNNs can compute richer properties of nodes dependent on k other nodes when compared to HR-MPNNs. Specifically, our study for expressive power answers the following questions: 1. Which nodes can be distinguished by an architecture? To answer this question, we generalize existing results given for graph neural networks on knowledge graphs (Barceló et al., 2022; Huang et al., 2023) using Weisfeler-Leman algorithms designated for relational hypergraphs. 2. What properties of nodes can be uniformly expressed by an architecture? To answer this question, we investigate logical expressiveness which situates the class of node properties that can be expressed by an architecture within an appropriate logical fragment. Published in Transactions on Machine Learning Research (05/2025) Contributions. Our main contributions can be summarized as follows: We rigorously identify the expressive power and limitations of HR-MPNNs that encompass most of the existing architectures for link prediction with relational hypergraphs (Section 4). We introduce the novel framework of HC-MPNNs, which includes more expressive architectures, such as HCNets, and addresses the core limitations of HR-MPNNs (Section 5). We present a detailed empirical analysis to validate our theoretical findings (Section 6). Experiments for inductive and transductive link prediction with relational hypergraphs show that a simple HC-MPNNs architecture surpasses all existing baselines leading to competitive results. Our ablation studies on different model components justify the importance of our model design choices. We supplement the real-world experiments with a synthetic experiment inspired by the example from Figure 2 to validate the expressive power of HC-MPNNs (Section 6.5). 2 Related work Knowledge graphs. Link prediction with knowledge graphs has been studied extensively in the literature. Early literature is dominated by knowledge graph embedding models including Trans E (Bordes et al., 2013), Rotat E (Sun et al., 2019), Compl Ex (Trouillon et al., 2016), Tuck ER (Balazevic et al., 2019), and Box E (Abboud et al., 2020), which are all restricted to the transductive regime. In the space of graph neural networks, RGCN (Schlichtkrull et al., 2018) and Comp GCN (Vashishth et al., 2020) emerged as architectures extending standard message passing neural networks (Gilmer et al., 2017) to knowledge graphs using a relational message passing scheme. Gra IL (Teru et al., 2020) is the first architecture explicitly designed to operate in the inductive regime, but it suffers from a high computational complexity. Zhu et al. (2021) proposed NBFNets as an architecture that subsumes previous methods such as Neural LP (Yang et al., 2017) and DRUM (Sadeghian et al., 2019). NBFNets perform strongly and have better computational complexity thanks to their high parallelizability (Zhu et al., 2021). Recently, A*Net (Zhu et al., 2023) is proposed to scale NBFNets further with the usage of a neural priority function. NBFNets are shown to fall under the framework of conditional message passing neural networks (C-MPNNs) (Huang et al., 2023), as they compute node representations conditioned pairwise on other node representations, making these architectures suitable for link prediction tasks and explaining their superior performance. The success of conditional message passing on knowledge graphs serves as a motivation for our work on relational hypergraphs. Relational hypergraphs. Link prediction with relational hypergraphs has been widely studied in the context of shallow embedding models (Wen et al., 2016; Abboud et al., 2020; Fatemi et al., 2020; 2023). To score facts of the form r(u1, , uk), some methods extend the scoring function (i.e., decoder) of existing knowledge graph embedding methods to consider multiple entities. For example, m-Trans H (Wen et al., 2016) is an extension of Trans H (Wang et al., 2014) designed to handle multiple entities jointly. Similarly, GETD (Liu et al., 2020) builds on the bilinear embedding method Tuck ER (Balazevic et al., 2019) to handle higher-arity relations. Fatemi et al. (2020) proposed HSimpl E and Hyp E that disentangle the position and relation embedding. Box E (Abboud et al., 2020) is an embedding model that encodes each relation using box embeddings, and naturally applies to k-ary relations while achieving strong results on transductive benchmarks. Fatemi et al. (2023) explores the connection between relational algebra and relational hypergraph embeddings and proposes Re Al E. Recently, Li et al. (2024a) proposed HJE that jointly considers 3D convolution and relation-aware 2D convolution, whereas Li et al. (2024b) utilizes 3D circular convolutional neural network and the alternate mask stack strategy to enhance feature extraction. In the space of graph neural networks, Feng et al. (2018) and Yadati et al. (2019) leverage message-passing methods on undirected hypergraphs. Subsequently, Star E (Galkin et al., 2020) was introduced as a message-passing neural network specifically designed for hyper-relational knowledge graphs, providing a different approach to represent high-arity facts, while Ali et al. (2021) further studied different inductive settings for hyper-relational knowledge graph completion. Georgiev et al. (2022) also incorporates a transformer into message passing for hyper-relational knowledge graphs. The first approach that is tailored to relational hypergraphs is G-MPNN (Yadati, 2020), which operates by relational message passing. RD-MPNNs (Zhou et al., 2023) builds on this approach and additionally incorporates the positional information of entities in their respective relations during message passing, which is critical for relational facts since the order of nodes in each edge clearly Published in Transactions on Machine Learning Research (05/2025) matters. G-MPNN and RD-MPNNs represent the closest related works to the present study and we show that these architectures are instances of HR-MPNNs and hence are subject to the same limitations. 3 Link prediction with relational hypergraphs Relational hypergraphs. A relational hypergraph G = (V, E, R, c) consists of a set V of nodes, a set E of hyperedges (or simply edges or facts) of the form e = r(u1, . . . , uk) E where r R is a relation type, u1, . . . , uk V are nodes, and k = ar(r) is the arity of the relation r. We consider labeled hypergraphs, where the labels are given by a coloring function on nodes c : V D. If the range of this coloring satisfies D = Rd, we say c is a d-dimensional feature map and use the notation x. We write ρ(e) as the relation r R of the hyperedge e E, and e(i) to refer to the node in the i-th arity position of the hyperedge e. We define E(v) = {(e, i) | e(i) = v, e E, 1 i ar(ρ(e))} as the set of edge-position pairs of a node v. Intuitively, this set captures all occurrences of node v in different hyperedges and arity positions. We also define the positional neighborhood of a hyperedge e with respect to a position i as Ni(e) = {(e(j), j) | j = i, 1 j ar(ρ(e))}. This set represents all nodes that co-occur with the node at position i in a hyperedge e, along with their positions. A knowledge graph is a relational hypergraph where for all r R, ar(r) = 2. Link prediction on hyperedges. Given a relational hypergraph G = (V, E, R, c), and a query q(u1, ..., ut 1, ?, ut+1..., uk), where q R is the query relation and ? is the querying position, link prediction is the problem of scoring all the hyperedges obtained by substituting nodes v V in place of ? . We denote a k-tuple (u1, . . . , uk) by u and the tuple (u1, . . . , ut 1, ut+1, . . . , uk) by u. For convenience, we commonly write a query as a tuple q = (q, u, t). Isomorphisms. An isomorphism from a relational hypergraph G = (V, E, R, c) to a relational hypergraph G = (V , E , R, c ) is a bijection f : V V such that c(v) = c (f(v)) for all v V , and r(u1, , uk) E if and only if r(f(u1), , f(uk)) E , for all r R and u1, , uk V . Invariants. For k 1, we define a k-ary relational hypergraph invariant as a function ξ associating with each relational hypergraph G = (V, E, R, c) a function ξ(G) with domain V k such that for all relational hypergraphs G, G , all isomorphisms f from G to G , and for all k-tuples of nodes u V k, we have ξ(G)(u) = ξ(G )(f(u)). Refinements. Given two relational hypergraph invariants ξ and ξ , we say a function ξ(G) : V k D refines a function ξ (G) : V k D, denoted as ξ(G) ξ (G), if for all u, u V k, ξ(G)(u) = ξ(G)(u ) implies ξ (G)(u) = ξ (G)(u ). In addition, we call such functions equivalent, denoted as ξ(G) ξ (G), if ξ(G) ξ (G) and ξ (G) ξ(G). A k-ary relational hypergraph invariant ξ refines a k-ary relational hypergraph invariant ξ , if ξ(G) refines ξ (G) for all relational hypergraphs G. Similarly for equivalence. 4 Hypergraph relational MPNNs We first introduce HR-MPNNs, which capture existing architectures tailored for relational hypergraphs, such as G-MPNN (Yadati, 2020) and RD-MPNN (Zhou et al., 2023) (Appendix B.1). Let G = (V, E, R, x) be a relational hypergraph, where x is a feature map that yields the initial features xv = x(v) for all nodes v V . For ℓ 0, an HR-MPNN iteratively computes a sequence of feature maps h(ℓ) : V 7 Rd(ℓ), where the representations h(ℓ) v := h(ℓ)(v) are given by: h(0) v = xv, h(ℓ+1) v = Up h(ℓ) v , Agg h(ℓ) v , {{Msgρ(e) {(h(ℓ) w , j) | (w, j) Ni(e)} | (e, i) E(v)}} , where Up, Agg, and Msgρ(e) are differentiable, update, aggregation, and relation-specific message functions, respectively. These functions are layer-specific, but we omit the superscript (ℓ) for brevity. An HR-MPNN has a fixed number of layers L 0 and the final representations of nodes are given by the function h(L) : V Rd(L). We can then use a k-ary decoder decq : Rd(L) k R, to produce a score for the likelihood of q(u) for q R, u V k. Published in Transactions on Machine Learning Research (05/2025) HR-MPNNs contain architectures designed for single-relational, undirected hypergraphs, such as HGNN (Feng et al., 2018) and Hyper GCN (Yadati et al., 2019) (see Appendix B.2). Furthermore, HR-MPNNs capture relational message passing neural networks (Huang et al., 2023), as a special case (see Appendix B.3). MPNNs are well-understood both in terms of their ability to distinguish graph nodes (Morris et al., 2019; Xu et al., 2019) and in terms of their capacity to capture logical node properties (Barceló et al., 2020). This line of work has been extended to relational architectures (Barceló et al., 2022; Huang et al., 2023). In the next subsections, we provide similar characterizations for HR-MPNNs. 4.1 A Weisfeiler-Leman test for HR-MPNNs We formally characterize the ability of HR-MPNNs to distinguish nodes in relational hypergraphs via a variant of the 1-dimensional Weisfeiler-Leman test, namely the hypergraph relational 1-WL test, denoted by hrwl1. The test hrwl1 is a natural generalization of rwl1 (Barceló et al., 2022) to relational hypergraphs. Given a relational hypergraph G = (V, E, R, c), for ℓ 0, hrwl1 updates the node colorings as: hrwl(0) 1 (v) = c(v), hrwl(ℓ+1) 1 (v) = τ hrwl(ℓ) 1 (v),{{ {(hrwl(ℓ) 1 (w), j)| (w, j) Ni(e)}, ρ(e) |(e, i) E(v)}} . The function τ is an injective mapping that maps the above pair to a unique color that has not been used in previous iterations: hrwl(ℓ) 1 defines a valid node invariant on relational hypergraphs for all ℓ 0. As it turns out, hrwl1 has the same expressive power as HR-MPNNs in terms of distinguishing nodes over relational hypergraphs: Theorem 4.1. Let G = (V, E, R, c) be a relational hypergraph, then the following statements hold: 1. For all initial feature maps x with c x, all HR-MPNNs with L layers, and for all 0 ℓ L, it holds that hrwl(ℓ) 1 h(ℓ). 2. For all L 0, there is an initial feature map x with c x and an HR-MPNN with L layers, such that for all 0 ℓ L, we have hrwl(ℓ) 1 h(ℓ). Intuitively, item (1) states that hrwl1 upper bounds the power of any HR-MPNN: if the test cannot distinguish two nodes, then HR-MPNNs cannot either. On the other hand, item (2) states that HR-MPNNs can be as expressive as hrwl1: for any L, there is an HR-MPNN that simulates L iterations of the test. In our proof, we explicitly construct this HR-MPNN using a simple architecture: the proof requires a very delicate construction to ensure the HR-MPNN synthetizes the information around a node v (given by its neighborhood E(v)), in the same way hrwl1 does (see Appendix C). 4.2 Logical expressiveness of HR-MPNNs The previous WL characterization of HR-MPNNs is non-uniform in the sense that it holds for a given relational hypergraph G. We now turn our attention to a uniform analysis of the power of HR-MPNNs and study the problem of which (node) properties can be expressed as HR-MPNNs, which is well-suited for the inductive setup. Following Barceló et al. (2020), we investigate logical classifiers, i.e., those that can be defined in the formalism of first-order logic (FO). Briefly, a first-order formula ϕ(x) with one free variable x defines a logical classifier that assigns value true to node u in relational hypergraph G whenever G |= ϕ(u). A logical classifier ϕ(x) is captured by a HR-MPNN A if for every relational hypergraph G the nodes u that are classified as true by ϕ and A are the same. Graded modal logic on hypergraphs. Barceló et al. (2020) showed that a logical classifier is captured by an MPNN over single-relational undirected graphs if and only if it can be expressed in graded modal logic (de Rijke, 2000; Lutz et al., 2001). This result is extended to knowledge graphs by Huang et al. (2023). We consider a variant of graded modal logic for hypergraphs. Fix a set of relation types R and a set of node colors C. The hypergraph graded modal logic (HGML) is the fragment of FO containing the following unary Published in Transactions on Machine Learning Research (05/2025) Hypergraph G zq + p3 zq + p1 0 Init of q(b, ?, a) Message Passing Figure 3: Given a relational hypergraph G with V = {a, b, c, d, e}, E = {r1(a, b, d, c), r2(d, e, b)}, R = {r1, r2} and a query q(b, ?, a), HCNet conditions on the nodes a and b and then applies message passing to compute the score for q(b, e, a). Here, zq is the learnable relation vector for query relation, and pi is the positional encoding of the i-th arity position. formulas. Firstly, a(x) for a C is a formula. Secondly, if φ(x) and φ (x) are HGML formulas, then φ(x) and φ(x) φ (x) also are. Thirdly, for r R, 1 i ar(r) and N 1: N y r(y1, . . . , yi 1, x, yi+1, . . . , yar(r)) Ψ( y) is a HGML formula, where y = (y1, . . . , yi 1, yi+1, . . . , yar(r)) and Ψ( y) is a Boolean combination of HGML formulas having free variables from y. Intuitively, the formula expresses that x participates in at least N edges e at position i, where the remaining nodes in e satisfy condition Ψ. Example 4.2. Consider the set of relations from Figure 1 and the property: x is a person who obtained a degree y of a subject z at a university m that has been awarded less than two prices p of some subject w. This can be expressed as the following formula: ϕ(x) = Person(x) y, z, m Study Degree(x, y, z, m) 2p, w (Awarded(w, p, m)) It is easy to verify that ϕ(x) is indeed a HGML formula. For any property expressed in HGML, such as ϕ(x), does there exist an HR-MPNN that captures this property on all relational hypergraphs with a shared relational vocabulary R and node colors C? Indeed, we show that HR-MPNNs are as powerful as HGML: Theorem 4.3. Each hypergraph graded modal logic classifier is captured by a HR-MPNN. For the proof, we first show a simple normal form for HGML formulas, and then carefully translate formulas of this form into HR-MPNNs. See Appendix D for further discussion regarding HGML. 5 Hypergraph conditional MPNNs In this section, we propose hypergraph conditional message passing networks (HC-MPNNs), a generalization of C-MPNNs (Huang et al., 2023) to relational hypergraphs. Let G = (V, E, R, x) be a relational hypergraph, where x is a feature map. Given a query q = (q, u, t), for ℓ 0, an HC-MPNN computes a sequence of feature maps h(ℓ) v|q as follows: h(0) v|q = Init(v, q), h(ℓ+1) v|q = Up h(ℓ) v|q, Agg h(ℓ) v|q, {{Msgρ(e) {(h(ℓ) w|q, j) | (w, j) Ni(e)}, q | (e, i) E(v)}} , where Init, Up, Agg, and Msgρ(e) are differentiable initialization, update, aggregation, and relation-specific message functions, respectively. An HC-MPNN has a fixed number of layers L 0, and the final conditional node representations are given by h(L) v|q. We denote by h(ℓ) q : V Rd(ℓ) the function h(ℓ) q (v) := h(ℓ) v|q. Published in Transactions on Machine Learning Research (05/2025) To ensure that HC-MPNNs compute k-ary representations (see Appendix H), we impose a generalized version of target node distinguishability proposed by Huang et al. (2023). An initialization function1 satisfies generalized target node distinguishability if for all q = (q, u, t): Init(u, q) = Init(v, q), u u, v / u and Init(ui, q) = Init(uj, q), ui, uj u, ui = uj Differently from message passing on simple hypergraphs, we need to consider the relation type of each edge (multi-relational) and the relative position of each node (directed) in the edges on relational hypergraphs. Hence, the message function Msgρ(e) needs to be relation-specific while also keeping track of the positions j of nodes w in their respective neighborhoods Ni(e). We can then obtain the scores of query q applying a unary decoder dec on h(L) v|q. 5.1 Hypergraph conditional networks We define a basic HC-MPNN, which we call hypergraph conditional networks (HCNets). For a query q = (q, u, t), an HCNet computes the following representations for all ℓ 0: h(0) v|q = X i =t 1v=ui (pi + zq), h(ℓ+1) v|q = σ W (ℓ)h h(ℓ) v|q X (e,i) E(v) g(ℓ) ρ(e),q j =i (α(ℓ)h(ℓ) e(j)|q+ (1 α(ℓ))pj) i + b(ℓ) , where g(ℓ) ρ(e),q is learnable message function, σ is an activation function, W (ℓ) is a learnable weight matrix, b(ℓ) as learnable bias term per layer, zq is the learnable query vector for q R, and 1C is the indicator function that returns 1 if condition C is true, and 0 otherwise. As usual, is scalar multiplication, and is element-wise multiplication of vectors. We write α to refer to a learnable scalar and pi to refer to the positional encoding at position i, which is sinusoidal positional encoding (Vaswani et al., 2017). In particular, we set g(ℓ) ρ(e),q to be a query-dependent diagonal linear map Diag(Wrzq) where Wr is a learnable matrix for each relation r. Alternatively, we can adopt a query-independent map by replacing Wrzq with learnable vector wr for each relation r. Intuitively, the model initialization ensures that all source nodes (i.e., nodes that appear in u) are initialized to their respective positions in the query edge, and all other nodes are initialized as the zero vector 0 satisfying generalized target node distinguishability, shown in Figure 3. 5.2 A Weisfeiler-Leman test for HC-MPNNs To analyze the expressive power of HC-MPNNs for distinguishing nodes, we can still use the hrwl1 test provided we restrict ourselves to initial colorings c that respect the given query q. Formally, given a query q = (q, u, t) on a relational hypergraph G = (V, E, R, c), we say that the coloring c satisfies generalized target node distinguishability with respect to q if: c(u) = c(v) u u, v / u and c(ui) = c(uj) ui, uj u, ui = uj. Note that initial colorings satisfying this property are equivalent to the initializations of HC-MPNNs. As a direct consequence of Theorem 4.1 we obtain: Theorem 5.1. Let G = (V, E, R, c) be a relational hypergraph and q = (q, u, t) be a query such that c satisfies generalized target node distinguishability with respect to q. Then the following statements hold: 1. For all HC-MPNNs with L layers and initialization Init with Init c, 0 ℓ L, we have hrwl(ℓ) 1 h(ℓ) q . 2. For all L 0, there is an HC-MPNN with L layers s.t. 0 ℓ L, hrwl(ℓ) 1 h(ℓ) q holds. 1HC-MPNNs can also take into account node features with slight modification. See detail discussions in Appendix L. Published in Transactions on Machine Learning Research (05/2025) Theorem 5.1 tells us that HC-MPNNs are stronger than HR-MPNNs due to the initialization: HC-MPNNs can initialize nodes differently based on the query q, whereas HR-MPNNs always assign the same initialization for all queries. In fact, the ternary edges from Figure 2 cannot be distinguished by HR-MPNNs but they can be distinguished by HC-MPNNs. 5.3 Logical expressiveness of HC-MPNNs We remark that Theorem 4.3 can be translated to HC-MPNNs by slightly modifying the logic. We consider symbolic queries q = (q, b, t), where now each b b is a constant symbol. Our vocabulary contains relation types r R and node colors C, as before, and additionally the constants b b. We define hypergraph graded modal logic with constants (HGMLc) as HGML but, as atomic cases, we additionally have formulas of the form φ(x) = (x = b) for some constant b (see Appendix F for details). This allows us to identify variables with individual constants. Example 5.2. Now that we have a richer vocabulary with constants, we can now represent more formulas conditioned on the constants appearing in the query. For instance, given a symbolic query with b = (Physics, BA) , we can express a more complex formula ψ(x) that represents x is a person with a BA degree of Physics at some University m, where less than two prizes p in total have been awarded in Physics. as follows: ψ(x) = Person(x) y, z, m Study Degree(x, y, z, m) (z = Physics) (y = BA) 2p, w (Awarded(w, p, m) (w = Physics)) Note that this formula ψ(x) cannot be expressed as an HGML formula but it can be as an HGMLc formula, due to the additional introduction of constants. We prove the following result showing that HC-MPNNs can capture richer k-ary node properties: Theorem 5.3. Each HGMLc classifier can be captured by a HC-MPNN over valid relational hypergraphs. 6 Experimental evaluation We evaluate HCNet on a broad range of experiments: Inductive experiments (Section 6.1): We evaluate HCNet for inductive link prediction with relational hypergraphs and report very substantial improvements reflecting on our theoretical findings. Transductive experiments (Section 6.2): We evaluate HCNet for transductive link prediction with relational hypergraphs and report competitive results. Knowledge graph experiments (Section 6.3): HCNet can handle knowledge graphs as a special case and our evaluation shows that it can match the performance of models such as NBFNets. Ablation on initialization and positional encoding (Section 6.4): We conduct ablation studies on the choice of initialization and positional encoding in HCNets. Expressiveness evaluation (Section 6.5): We conduct a synthetic experiment on Hyper Cycle dataset, building on the counter-example in Figure 2 to showcase the expressivity differences between HR-MPNNs and HC-MPNNs. Experimental setups. In all experiments, we consider a 2-layer MLP as the decoder and adopt layer normalization and dropout in all layers before applying Re LU activation and skip-connection. During the training, we remove edges that are currently being treated as positive tuples to prevent overfitting for each batch. We choose the best checkpoint based on its evaluation of the validation set. In terms of evaluation, we adopt filtered ranking protocol. For each test edge q(u1, . . . , uk) where k = ar(q), and for each position t {1, ..., k}, we replace the t-th entities by all other possible entities such that the query after replacement is not in the graph. We consider the query-independent message function for all datasets except Wiki People. We report Mean Reciprocal Rank (MRR), Hits@1, and Hits@3 for inductive experiments and MRR, Hits@1, and Hits@10 for transductive experiments as evaluation metrics and provide averaged results of five runs on different seeds. We reported standard deviations and execution time & memory used along with all Published in Transactions on Machine Learning Research (05/2025) Table 1: Results of inductive link prediction experiments on WP-IND, JF-IND, and MFB-IND. WP-IND JF-IND MFB-IND MRR Hits@1 Hits@3 MRR Hits@1 Hits@3 MRR Hits@1 Hits@3 HGNN 0.072 0.045 0.112 0.102 0.086 0.128 0.121 0.076 0.114 Hyper GCN 0.075 0.049 0.111 0.099 0.088 0.133 0.118 0.074 0.117 G-MPNN-sum 0.177 0.108 0.191 0.219 0.155 0.236 0.124 0.071 0.123 G-MPNN-mean 0.153 0.096 0.145 0.112 0.039 0.116 0.241 0.162 0.257 G-MPNN-max 0.200 0.125 0.214 0.216 0.147 0.240 0.268 0.191 0.283 RD-MPNN 0.304 0.238 0.328 0.402 0.308 0.453 0.122 0.082 0.125 HCNet 0.414 0.352 0.451 0.435 0.357 0.495 0.368 0.223 0.417 other experiment details in Appendix N. Furthermore, we provide a detailed discussion of computational complexity between HR-MPNNs and HC-MPNNs in Appendix I, and a discussion of the impact of relational hypergraphs density on the performance in Appendix M. We ran all experiments on a single NVIDIA V100 GPU. The code for experiments is provided in https://github.com/Hxy Scotthuang/HC-MPNN. 6.1 Inductive experiments Datasets. Yadati (2020) constructed three inductive datasets, WP-IND, JF-IND, and MFB-IND from existing transductive datasets on relational hypergraphs: Wiki People (Guan et al., 2019), JF17K (Wen et al., 2016), and M-FB15K (Fatemi et al., 2020), with their statistics in Table 8. Baselines. We compare with the baseline models HGNN (Feng et al., 2018), Hyper GCN (Yadati et al., 2019), and three variants of G-MPNN (Yadati, 2020) with different aggregation functions. Since HGNN and Hyper GCN are designed for simple hypergraphs, Yadati (2020) tested them on transformed relational hypergraphs where the relations are ignored. In addition, Yadati (2020) initialized nodes with given node features, whereas we ignore the node feature and initialize each node with the respective initialization defined in HCNets. We modify RD-MPNNs (Zhou et al., 2023) by replacing learned entity embeddings to be all one vector 1d to enable inductive link prediction. We adopt the batching trick (Zhu et al., 2021) on MFB-IND. Hyper-parameters are reported in Table 10. Results. We report the inductive experiments results in Table 1, and observe that HCNet outperforms all the existing baseline methods by a large margin, doubling the metric on WP-IND and JF-IND and substantially increasing on MFB-IND. Notably, we emphasize that HCNet does not utilize the provided node features whereas other baseline models do, further highlighting the effectiveness of HCNet in generalizing to entirely new graphs in the absence of node features. This is because HCNet is more expressive by computing querydependent k-ary invariants instead of query-agnostic unary invariants in HR-MPNNs such as RD-MPNNs and G-MPNNs with different aggregation functions. Overall, these results perfectly align with the main theoretical findings presented in this paper. 6.2 Transductive experiments Datasets & Baselines. We evaluate HCNets on the link prediction task with relational hypergraphs, namely the publicly available FB-AUTO, MFB15K, (Fatemi et al., 2020), Wiki People (Guan et al., 2021), and JF17K2 (Wen et al., 2016). These datasets include facts of different arities up to 9. We have taken the result of embedding methods m-Dist Mult, m-CP, m-Trans H from Wen et al. (2016), RAE from Zhang et al. (2018), Na LP from Guan et al. (2019), t Na LP+ from Guan et al. (2021), HINGE from Rosso et al. (2020), Neu Infer from Guan et al. (2020), Hyp E from Fatemi et al. (2020), Box E from Abboud et al. (2020), RAM from Liu et al. (2021b), Hyper MLN from Chen et al. (2022), Hy Conv E from Wang et al. (2023), Re AIE from Fatemi et al. (2023), HJE from (Li et al., 2024a), Hy Cub E, Hy Plan E from (Li et al., 2024b) and GNN 2Note that we include JF17K for comparing with literature, even though JF17K suffers from redundant entries and severe test leakages (Galkin et al., 2020). Published in Transactions on Machine Learning Research (05/2025) Table 2: Transductive link prediction experiments on FB-AUTO, Wiki People, JF17K, and MFB15K FB-AUTO Wiki People JF17K MFB15K MRR Hits@1 Hits@10 MRR Hits@1 Hits@10 MRR Hits@1 Hits@10 MRR Hits@1 Hits@10 m-Dist Mult 0.784 0.745 0.845 - - - 0.463 0.372 0.634 0.705 0.633 0.844 m-CP 0.752 0.704 0.837 - - - 0.392 0.303 0.560 0.680 0.605 0.828 m-Trans H 0.728 0.727 0.728 - - - 0.444 0.370 0.581 0.623 0.531 0.809 RAE 0.703 0.614 0.854 0.253 0.118 0.463 0.396 0.312 0.561 - - - Na LP 0.672 0.611 0.774 0.338 0.272 0.466 0.366 0.290 0.516 - - - t Na LP+ 0.729 0.645 0.826 0.339 0.269 0.473 0.449 0.370 0.598 - - - HINGE 0.678 0.630 0.765 0.333 0.259 0.477 0.473 0.397 0.618 - - - Neu Infer 0.737 0.700 0.805 0.350 0.282 0.467 0.451 0.373 0.604 - - - Hyp E 0.804 0.774 0.856 0.263 0.127 0.486 0.494 0.408 0.656 0.777 0.725 0.881 RAM 0.830 0.803 0.876 0.363 0.271 0.500 0.539 0.463 0.690 - - - Box E 0.844 0.814 0.898 - - - 0.560 0.472 0.722 0.761 0.702 0.877 Hyper MLN 0.831 0.803 0.877 - - - 0.556 0.482 0.717 - - - Hy Conv E 0.847 0.820 0.901 0.362 0.275 0.501 0.590 0.478 0.729 - - - Re AIE 0.873 0.852 0.909 - - - 0.559 0.482 0.705 0.801 0.755 0.901 RD-MPNN 0.810 0.714 0.888 - - - 0.512 0.445 0.685 - - - HJE 0.872 0.848 0.903 0.450 0.375 0.582 0.590 0.507 0.729 - - - Hy Cub E 0.881 0.860 0.918 0.448 0.368 0.592 0.584 0.508 0.730 - - - Hy Plan E 0.866 0.843 0.909 0.402 0.323 0.549 0.569 0.496 0.708 - - - HCNet 0.871 0.842 0.922 0.421 0.344 0.565 0.540 0.440 0.730 0.759 0.693 0.884 method RD-MPNN from Zhou et al. (2023). The statistics of the datasets are reported in Table 9, and the hyper-parameter choices in Table 11. The full tables are in Tables 15 and 16 which additionally report H@3. Results. We summarize the results for the transductive link prediction tasks and report them in Table 2. HCNet is competitive even comparing with existing embedding methods specifically designed for transductive link prediction tasks. This demonstrates the effectiveness of HCNets also on transductive datasets. 6.3 Knowledge graphs experiments In addition, to evaluate the model s performance on knowledge graphs a specific instance of relational hypergraphs we conduct inductive experiments on knowledge graphs, where every edge has an arity of 2, and compare the outcomes with those of current state-of-the-art models. Setup. We evaluate HCNet on 4 standard inductive splits of WN18RR (Bordes et al., 2013) and FB15k-237 (Dettmers et al., 2018), which was proposed in Teru et al. (2020). We provide the details of the datasets in Table 17. Contrary to the standard experiment setting (Zhu et al., 2021; 2023) on knowledge graph G = (V, E, R, x) where for each relation r(u, v) E, an inverse-relation r 1 is introduced as a fresh relation symbol and r 1(v, u) is added in the knowledge graph, in our setup we do not augment inverse edges for HCNet. This makes the task more challenging. We compare HCNet with models designed only for inductive binary link prediction task with knowledge graphs, namely Gra IL (Teru et al., 2020), Neural LP (Yang et al., 2017), DRUM (Sadeghian et al., 2019), NBFNet (Zhu et al., 2021), RED-GNN (Zhang & Yao, 2022), and A*Net (Zhu et al., 2023), and we take the results provided in Zhu et al. (2023) for comparison. Results. We report the results in Table 3. We observe that HCNets are highly competitive even compared with state-of-the-art models specifically designed for link prediction with knowledge graphs. HCNets reach the top 3 for 7 out of 8 datasets, and obtain a very close result for the final dataset. Note here that the top 2 models are NBFNet (Zhu et al., 2021) and A*Net (Zhu et al., 2023), which share a similar idea of HCNet and are all based on conditional message passing. The difference in results lies in the different message functions, which are further supported in Table 1 of Huang et al. (2023). However, we highlight that HCNet does not augment with inverse relation edges, as described in the set-up of the experiment. HCNet can recognize the directionality of relational edges and pay respect to both incoming and outgoing edges during message passing. No current link prediction model based on message passing can explicitly take care of this without edge augmentation. In fact, Theorem G.4 implies that all current models based on conditional message passing, including NBFNets, need inverse relation augmentation to match the expressive power of HCNet. Theoretically speaking, this allows us to claim that HCNet is strictly more powerful than all other models in the baseline that are based on conditional message passing, assuming all considered models are expressive enough to match their corresponding relational WL test. Published in Transactions on Machine Learning Research (05/2025) Table 3: Binary inductive experiment on knowledge graph for Hits@10 result. The best result is in bold, and second/third best in underline. Method FB15k-237 WN18RR v1 v2 v3 v4 v1 v2 v3 v4 Gra IL 0.429 0.424 0.424 0.389 0.760 0.776 0.409 0.687 Neural LP 0.468 0.586 0.571 0.593 0.772 0.749 0.476 0.706 DRUM 0.474 0.595 0.571 0.593 0.777 0.747 0.477 0.702 NBFNet 0.574 0.685 0.637 0.627 0.826 0.798 0.568 0.694 RED-GNN 0.483 0.629 0.603 0.621 0.799 0.780 0.524 0.721 A*Net 0.589 0.672 0.629 0.645 0.810 0.803 0.544 0.743 HCNet 0.566 0.646 0.614 0.610 0.822 0.790 0.536 0.724 Table 4: Ablation on Initialization Init WP-IND JF-IND zq pi MRR Hits@3 MRR Hits@3 - - 0.388 0.421 0.390 0.451 - 0.387 0.421 0.392 0.447 - 0.394 0.430 0.393 0.456 0.414 0.451 0.435 0.495 Table 5: Ablation on Positional Encoding PE WP-IND JF-IND MRR Hits@3 MRR Hits@3 Constant 0.393 0.426 0.356 0.428 One-hot 0.395 0.428 0.368 0.432 Learnable 0.396 0.425 0.416 0.480 Sinusoidal 0.414 0.451 0.435 0.495 6.4 Ablation studies on the impact of initialization and positional encoding To assess the contribution of each model component, we conduct ablation studies mainly on different choices of positional encodings and initialization functions on WP-IND and JF-IND datasets with the same empirical setup described in Section 6.1. Complete results are reported in Appendix N. Initialization. We conduct experiments to validate the impact of different initialization by evaluating all combinations of whether including positional encoding pi or learnable query vectors zq. From Table 4, we observe that both positional encoding pi and the relation zq are essential in the initialization, as removing either of them worsens the overall performance of HCNet. A closer look reveals that the removal of the positional encoding is more detrimental compared to removing relational embedding since the model could deduce the relation types based on implicit information such as the arity of the query relation. Positional encoding. We also examine the importance of the choice of positional encodings, which serves as an indicator of which position the given entities lie in a hyperedge. We provide experiments on multiple choices of positional encodings and report the results in Table 5. Empirically, we notice that the sinusoidal positional encoding produces the best results due to its ability to measure sequential dependency between neighboring entities, compared with one-hot positional encoding which assumes orthogonality among each position. We also notice that learnable embeddings do not produce better results since it is generally hard to learn a suitable embedding that respects the order of the nodes in a relation based on random initialization. Finally, constant embedding evidently performs the worst as it pays no respect to position information and treats all hyperedges with the same set of nodes in the same way regardless of the order of the nodes in these edges. 6.5 Expressiveness evaluation We carry out a synthetic experiment with a custom-built dataset Hyper Cycle to showcase that HC-MPNNs are more expressive than HR-MPNNs in the task of link prediction with relational hypergraphs. Dataset. We construct Hyper Cycle, a synthetic dataset that consists of multiple relational hypergraphs with relation R = {r0, r1, r2}. Each relational hypergraph G is parameterized by 2 hyper-parameters: the number Published in Transactions on Machine Learning Research (05/2025) of nodes n which is always a multiple of 4, and the arity of each edge k. We construct a directed hyper-edge of arity k with alternating relations between r1 and r2 for all k consecutive nodes in this cycle. We present one example of such relational hypergraph in Figure 4, where n = 8 and k = 3. We generate the dataset by choosing n = {8, 12, 16, 20} and k = {3, 4, 5, 6, 7}. We then randomly pick 70% of the generated graphs as the training set and the remaining 30% as the testing set. (See details in Appendix J). Objective. The objective of this task is for each node to identify the node that is located at the opposite point in the cycle of the given node as true. Formally speaking, for a relational hypergraph G(n, k), we want to predict a 2-ary (hyper-)edge of relation r0 between any node xi and its opposite point x(i+n/2 mod n) for all 1 i n, i.e., classify r0(xi, x(i+n/2) mod n) as true. The negative sample is generated by considering the r0 relation (hyper-)edges that connect the 2-hop neighboring node, i.e., classify r0(xi, x(i+2) mod n) as false. Note that since n = 4, we will never have (i + n/2) mod n (i + 2) mod n. Figure 4: Example relational hypergraph from dataset Hyper Cycle with n = 8 and k = 3: r1 is blue, r2 is red, and the objective is to classify the black edge as true and the gray edge as false. Design & Result. We claim that HC-MPNN can correctly predict all the testing triplets, whereas HR-MPNN fails to learn this pattern and will only achieve 50% accuracy, which is no better than random guessing. This is exactly due to the lack of expressiveness of HR-MPNNs by relying on a k-ary decoder for link prediction. Theoretically, all nodes of the relational hypergraphs in Hyper Cycle, due to their rotational symmetry introduced by alternating relation types r1, r2, can be partitioned into two sets. Since the nodes within each set are isomorphic to each other, it is impossible for any HR-MPNNs to distinguish between these nodes by only computing its unary invariant. Thus HR-MPNNs cannot possibly solve this task, as whenever they classify the target opposite point node to be true, they also have to classify the 2-hop node to be true, and vice versa. Empirically, we observe that HR-MPNN always fails to learn anything meaningful, reaching an accuracy of 50%. However, HR-MPNN can bridge this gap by introducing the relevant notion of distance . As HR-MPNN carries out message-passing after identifying the source node, the relative distance between the source node and the target opposite point node will be different than the one with the 2-hop node. Thus, by keeping track of the distance from the source node, HCNet will compute a different embedding for the positive triplet and the negative triplet, effectively solving this task. Empirically, we observe that HCNet easily reaches 100% accuracy, consistent with our theory. 7 Summary, discussions, and limitations We investigated two frameworks of relational message-passing neural networks on the task of link prediction with relational hypergraphs, namely HR-MPNNs and HC-MPNNs. Furthermore, we studied the expressive power of these two frameworks in terms of relational WL and logical expressiveness. We then proposed a simple yet powerful model instance of HC-MPNNs called HCNet and presented its superior performance on inductive link prediction tasks, which is further supported by additional transductive link prediction and synthetic experiments. One limitation lies in the potentially high computational complexity of our approach when applied to large relational hypergraphs. Our approach is also limited to link prediction and a potential future avenue is to investigate complex query answering on fully relational data. Our study extends the success of link prediction with knowledge graphs to relational hypergraphs where higher-arity relations can be effectively modeled with GNNs, advancing their applications to fully relational structures. Published in Transactions on Machine Learning Research (05/2025) Broader impact This work mainly focused on link prediction with relational hypergraphs, which has a wide range of applications and thus many potential societal impacts. One potential negative impact is the enhancement of malicious network activities like phishing or pharming through the use of powerful link prediction models. We encourage further studies to mitigate these issues. Ralph Abboud, İsmail İlkan Ceylan, Thomas Lukasiewicz, and Tommaso Salvatori. Boxe: A box embedding model for knowledge base completion. In Neur IPS, 2020. Mehdi Ali, Max Berrendorf, Mikhail Galkin, Veronika Thost, Tengfei Ma, Volker Tresp, and Jens Lehmann. Improving inductive link prediction using hyper-relational facts, 2021. Ivana Balazevic, Carl Allen, and Timothy Hospedales. Tucker: Tensor factorization for knowledge graph completion. In EMNLP-IJCNLP, 2019. Pablo Barceló, Egor V. Kostylev, Mikaël Monet, Jorge Pérez, Juan L. Reutter, and Juan Pablo Silva. The logical expressiveness of graph neural networks. In ICLR, 2020. Pablo Barceló, Mikhail Galkin, Christopher Morris, and Miguel Romero. Weisfeiler and leman go relational. In Lo G, 2022. Antoine Bordes, Nicolas Usunier, Alberto Garcia-Duran, Jason Weston, and Oksana Yakhnenko. Translating embeddings for modeling multi-relational data. In NIPS, 2013. Zirui Chen, Xin Wang, Chenxu Wang, and Jianxin Li. Explainable link prediction in knowledge hypergraphs. In CIKM, 2022. Maarten de Rijke. A note on Graded Modal Logic. In Stud Logica, 2000. Tim Dettmers, Minervini Pasquale, Stenetorp Pontus, and Sebastian Riedel. Convolutional 2D knowledge graph embeddings. In AAAI, 2018. Bahare Fatemi, Perouz Taslakian, David Vazquez, and David Poole. Knowledge hypergraphs: Prediction beyond binary relations. In IJCAI, 2020. Bahare Fatemi, Perouz Taslakian, David Vazquez, and David Poole. Knowledge hypergraph embedding meets relational algebra. JMLR, 2023. Yifan Feng, Haoxuan You, Zizhao Zhang, Rongrong Ji, and Yue Gao. Hypergraph neural networks. In AAAI, 2018. Matthias Fey and Jan Eric Lenssen. Fast graph representation learning with pytorch geometric. In ICLRRLGM Workshop, 2019. Luis Antonio Galárraga, Christina Teflioudi, Katja Hose, and Fabian Suchanek. AMIE: Association rule mining under incomplete evidence in ontological knowledge bases. In WWW, 2013. Mikhail Galkin, Priyansh Trivedi, Gaurav Maheshwari, Ricardo Usbeck, and Jens Lehmann. Message passing for hyper-relational knowledge graphs. In EMNLP, 2020. Mikhail Galkin, Xinyu Yuan, Hesham Mostafa, Jian Tang, and Zhaocheng Zhu. Towards foundation models for knowledge graph reasoning. In ICLR, 2024. Dobrik Georgiev Georgiev, Marc Brockschmidt, and Miltiadis Allamanis. HEAT: Hyperedge attention networks. Transactions on Machine Learning Research, 2022. ISSN 2835-8856. Justin Gilmer, Samuel S. Schoenholz, Patrick F. Riley, Oriol Vinyals, and George E. Dahl. Neural message passing for quantum chemistry. In ICML, 2017. Published in Transactions on Machine Learning Research (05/2025) Martin Grohe. The logic of graph neural networks. In LICS, 2021. Saiping Guan, Xiaolong Jin, Yuanzhuo Wang, and Xueqi Cheng. Link prediction on n-ary relational data. In WWW, 2019. Saiping Guan, Xiaolong Jin, Jiafeng Guo, Yuanzhuo Wang, and Xueqi Cheng. Neu Infer: Knowledge inference on N-ary facts. In ACL, 2020. Saiping Guan, Xiaolong Jin, Jiafeng Guo, Yuanzhuo Wang, and Xueqi Cheng. Link prediction on n-ary relational data based on relatedness evaluation. IEEE Transactions on Knowledge and Data Engineering, 2021. Will Hamilton, Zhitao Ying, and Jure Leskovec. Inductive representation learning on large graphs. In NIPS, 2017. Xingyue Huang, Miguel Romero Orth, İsmail İlkan Ceylan, and Pablo Barceló. A theory of link prediction via relational weisfeiler-leman on knowledge graphs. In Neur IPS, 2023. Shaoxiong Ji, Shirui Pan, E. Cambria, Pekka Marttinen, and Philip S. Yu. A survey on knowledge graphs: Representation, acquisition, and applications. IEEE Transactions on Neural Networks and Learning Systems, 2020. Thomas N. Kipf and Max Welling. Semi-supervised classification with graph convolutional networks. In ICLR, 2017. Zhao Li, Chenxu Wang, Xin Wang, Zirui Chen, and Jianxin Li. Hje: Joint convolutional representation learning for knowledge hypergraph completion. IEEE Transactions on Knowledge and Data Engineering, 2024a. Zhao Li, Xin Wang, Jianxin Li, Wenbin Guo, and Jun Zhao. Hycube: Efficient knowledge hypergraph 3d circular convolutional embedding. Co RR, 2024b. Shuwen Liu, Bernardo Grau, Ian Horrocks, and Egor Kostylev. Indigo: Gnn-based inductive knowledge graph completion using pair-wise encoding. In Neur IPS, 2021a. Yu Liu, Quanming Yao, and Yong Li. Generalizing tensor decomposition for n-ary relational knowledge bases. In WWW, 2020. Yu Liu, Quanming Yao, and Yong Li. Role-aware modeling for n-ary relational knowledge bases. In WWW, 2021b. Carsten Lutz, Ulrike Sattler, and Frank Wolter. Modal logic and the two-variable fragment. In CSL, 2001. Christopher Morris, Martin Ritzert, Matthias Fey, William L. Hamilton, Jan Eric Lenssen, Gaurav Rattan, and Martin Grohe. Weisfeiler and Leman go neural: Higher-order graph neural networks. In AAAI, 2019. Haiquan Qiu, Yongqi Zhang, Yong Li, and quanming yao. Understanding expressivity of GNN in rule learning. In ICLR, 2024. Paolo Rosso, Dingqi Yang, and Philippe Cudré-Mauroux. Beyond triplets: Hyper-relational knowledge graph embedding for link prediction. In WWW, 2020. Ali Sadeghian, Mohammadreza Armandpour, Patrick Ding, and Daisy Zhe Wang. Drum: End-to-end differentiable rule mining on knowledge graphs. In NIPS, 2019. Michael Sejr Schlichtkrull, Thomas N. Kipf, Peter Bloem, Rianne van den Berg, Ivan Titov, and Max Welling. Modeling relational data with graph convolutional networks. In ESWC, 2018. Zhiqing Sun, Zhi-Hong Deng, Jian-Yun Nie, and Jian Tang. Rotate: Knowledge graph embedding by relational rotation in complex space. In ICLR, 2019. Published in Transactions on Machine Learning Research (05/2025) Komal K. Teru, Etienne G. Denis, and William L. Hamilton. Inductive relation prediction by subgraph reasoning. In ICML, 2020. Théo Trouillon, Johannes Welbl, Sebastian Riedel, Éric Gaussier, and Guillaume Bouchard. Complex embeddings for simple link prediction. In ICML, 2016. Shikhar Vashishth, Soumya Sanyal, Vikram Nitin, and Partha Talukdar. Composition-based multi-relational graph convolutional networks. In ICLR, 2020. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Neur IPS, 2017. Chenxu Wang, Xin Wang, Zhao Li, Zirui Chen, and Jianxin Li. Hyconve: A novel embedding model for knowledge hypergraph link prediction with convolutional neural networks. In WWW, 2023. Quan Wang, Zhendong Mao, Bin Wang, and Li Guo. Knowledge graph embedding: A survey of approaches and applications. IEEE Transactions on Knowledge and Data Engineering, 2017. Zhen Wang, Jianwen Zhang, Jianlin Feng, and Zheng Chen. Knowledge graph embedding by translating on hyperplanes. In AAAI, 2014. Jianfeng Wen, Jianxin Li, Yongyi Mao, Shini Chen, and Richong Zhang. On the representation and embedding of knowledge bases beyond binary relations. In IJCAI, 2016. Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. How powerful are graph neural networks? In ICLR, 2019. Naganand Yadati. Neural message passing for multi-relational ordered and recursive hypergraphs. In Neur IPS, 2020. Naganand Yadati, Madhav Nimishakavi, Prateek Yadav, Vikram Nitin, Anand Louis, and Partha Talukdar. Hypergcn: A new method for training graph convolutional networks on hypergraphs. In Neur IPS, 2019. Fan Yang, Zhilin Yang, and William W Cohen. Differentiable learning of logical rules for knowledge base reasoning. In Neur IPS, 2017. Muhan Zhang, Pan Li, Yinglong Xia, Kai Wang, and Long Jin. Labeling trick: A theory of using graph neural networks for multi-node representation learning. In Neur IPS, 2021. Richong Zhang, Junpeng Li, Jiajie Mei, and Yongyi Mao. Scalable instance reconstruction in knowledge bases via relatedness affiliated embedding. In WWW, 2018. Yongqi Zhang and Quanming Yao. Knowledge graph reasoning with relational digraph. In Web Conf, 2022. Xue Zhou, Bei Hui, Ilana Zeira, Hao Wu, and Ling Tian. Dynamic relation learning for link prediction in knowledge hypergraphs. In Appl Intell, 2023. Zhaocheng Zhu, Zuobai Zhang, Louis-Pascal Xhonneux, and Jian Tang. Neural bellman-ford networks: A general graph neural network framework for link prediction. In Neur IPS, 2021. Zhaocheng Zhu, Xinyu Yuan, Mikhail Galkin, Sophie Xhonneux, Ming Zhang, Maxime Gazeau, and Jian Tang. A*net: A scalable path-based reasoning approach for knowledge graphs. In Neur IPS, 2023. Published in Transactions on Machine Learning Research (05/2025) A R-MPNNs and C-MPNNs In this section, we follow Huang et al. (2023) and define relational message passing neural networks (RMPNNs) and conditional message passing neural networks (C-MPNNs). For ease of presentation, we omit the discussion regarding history functions and readout functions from Huang et al. (2023). R-MPNNs. Let G = (V, E, R, x) be a knowledge graph, where x is a feature map. A relational message passing neural network (R-MPNN) computes a sequence of feature maps h(ℓ) : V Rd(ℓ), for ℓ 0. For simplicity, we write h(ℓ) v instead of h(ℓ)(v). For each node v V , the representations h(ℓ) v are iteratively computed as: h(0) v = xv h(ℓ+1) v = Up h(ℓ) v , Agg({{Msgr(h(ℓ) w )| w Nr(v), r R}}) , where Up, Agg, and Msgr are differentiable update, aggregation, and relation-specific message functions, respectively, Nr(v) := {u | r(u, v) E} is the neighborhood of a node v V relative to a relation r R. An R-MPNN has a fixed number of layers L 0, and then, the final node representations are given by the map h(L) : V Rd(L). The final representations can be used for node-level predictions. For link-level tasks, we use a binary decoder decq : Rd(L) Rd(L) R, which produces a score for the likelihood of the fact q(u, v), for q R. C-MPNNs. Let G = (V, E, R, x) be a knowledge graph, where x is a feature map. A conditional message passing neural network (C-MPNN) iteratively computes pairwise representations, relative to a fixed query q R and a fixed node u V , as follows: h(0) v|u,q = Init(u, v, q) h(ℓ+1) v|u,q = Up h(ℓ) v|u,q, Agg({{Msgr(h(ℓ) w|u,q, zq)| w Nr(v), r R}}) , where Init, Up, Agg, and Msgr are differentiable initialization, update, aggregation, and relation-specific message functions, respectively. We denote by h(ℓ) q : V V Rd(ℓ) the function h(ℓ) q (u, v) := h(ℓ) v|u,q, and denote zq to be a learnable vector representing the query q R. A C-MPNN has a fixed number of layers L 0, and the final pair representations are given by h(L) q . To decode the likelihood of the fact q(u, v) for some q R, we simply use a unary decoder dec : Rd(L) R, parameterized by a 2-layer MLP. In addition, we require Init(u, v, q) to satisfy target node distinguishability: for all q R and v = u V , it holds that Init(u, u, q) = Init(u, v, q). Nevertheless, we provide a full literature review on link prediction with a hyper-relational knowledge graph and n-ary relational graph for the sake of completeness. B HR-MPNNs subsume existing models In this section, we provide further details on how the proposed framework HR-MPNNs subsumes existing models as claimed. B.1 HR-MPNNs subsume G-MPNNs and RD-MPNNs To see why HR-MPNNs subsume RD-MPNNs (Zhou et al., 2023) and G-MPNNs (Yadati, 2020), which are prominent examples of message passing model on relational hypergraphs in the literature, it suffices to instantiate some components of HR-MPNNs with particular functions. An RD-MPNN can be seen as an instance of an HR-MPNN that uses summation as Agg, and a relationspecific message function Msgr which, for each relation r, applies summation followed by a linear map with non-linearity. The update function Up is a one-layer Multi-layer Perceptron (MLP). Published in Transactions on Machine Learning Research (05/2025) Similarly, a G-MPNN instance can be seen as an HR-MPNN that uses either summation, mean, or max as Agg, and a message function Msgr which, for each relation r, applies a Hadamard product of the relational embedding. B.2 HR-MPNNs subsuming HGNNs and Hyper GCNs To see why HR-MPNNs generalize HGNNs (Feng et al., 2018) and Hyper GCNs (Yadati et al., 2019) on simple, undirected hypergraph, first note that (i) these models are single-relational - no relation types - so they are a special case in this sense and (ii) the hyperedges in these undirected hypergraphs are unordered. To recover HGNN, we can set the message function Msgr to be mean, ignoring the relation types r, and ignore the relative position in the formula (as there is no ordering in simple, undirected hypergraph). Then, we can choose the Agg function to be symmetrically normalized mean, similar to the aggregation in GCN (Kipf & Welling, 2017). To recover Hyper GCN, we set Agg to be the symmetrically normalized mean, and Msgr function to be wi,j arg maxhj |hi hj|, with some weight wi,j (again ignoring the relation r and position i), provided that the message function has access to the feature of considered node hi. B.3 HR-MPNNs subsume R-MPNNs We formally show that the R-MPNNs framework is subsumed by the HR-MPNNs framework when applied to the knowledge graph. Theorem B.1. Let G = (V, E, R, x) be a knowledge graph, then given any R-MPNN instance A with L layer parameterized by Agg(ℓ) A , Up(ℓ) A , and Msg Ar for 0 < ℓ L, r R, there exists a HR-MPNN instance B with L layer, parameterized by Agg(ℓ) B , Up(ℓ) B , and Msg Br, such that for all v V , we have h(ℓ) A,G(v) = h(ℓ) B,G(v) for all 0 ℓ L. Proof. Given an R-MPNN instance A with L layer, we can have that for 0 ℓ L, we have h(0) A,G(v) = x(v) h(ℓ+1) A,G (v) = Up(ℓ) A h(ℓ) A,G(v), Agg(ℓ) A ({{Msg Ar(h(ℓ) A,G(w))| w Nr(v), r R}}) , Note that we can now rewrite the updating formula in the following form: h(ℓ+1) A,G (v) = Up(ℓ) A h(ℓ) A,G(v), Agg(ℓ) A {{Msg Aρ(e) {(h(ℓ) A,G(w), j)|(w, j) Ni(e)} |(e, i) E(v), i = 2}} We then parameterize a HR-MPNN instance B with L layer of the following form: h(0) B,G(v) = x(v) h(ℓ+1) B,G (v) = Up(ℓ) B h(ℓ) B,G(v), Agg(ℓ) B h(ℓ) B,G(v), {{Msg Bρ(e) {(h(ℓ) B,G(w), j)|(w, j) Ni(e)} |(e, i) E(v)}} where we have for all 0 < ℓ L, r R, Up(ℓ) B := Up(ℓ) A , Agg(ℓ) B (x, S) := Agg(ℓ) A (S), for some vector x and some (multi-)set S, and Msg Bρ(e)({(h(ℓ)(w), j)|(w, j) Ni(e)}) := Msg Aρ(e)({(h(ℓ)(w), j)|(w, j) Ni(e), j = 1}) We argue that Msg Bρ(e) can be achieved by applying a filtering function on each element of the set to check if the second argument of the tuple is 1 or not. Published in Transactions on Machine Learning Research (05/2025) Now we are ready to prove the theorem by induction. First notice that the base case ℓ= 0 trivially holds. For the inductive case, assume that for all v V , we have h(ℓ) A,G(v) = h(ℓ) B,G(v). Then, notice that for 0 < ℓ L: h(ℓ+1) A,G (v) = Up(ℓ) A h(ℓ) A,G(v), Agg(ℓ) A {{Msg Aρ(e) {(h(ℓ) A,G(w), j)|(w, j) Ni(e)} |(e, i) E(v), i = 2}} = Up(ℓ) A h(ℓ) A,G(v), Agg(ℓ) A {{Msg Aρ(e) {(h(ℓ) A,G(w), j)|(w, j) Ni(e), j = 1} |(e, i) E(v), i = 2}} = Up(ℓ) B h(ℓ) B,G(v), Agg(ℓ) B h(ℓ) B,G(v), {{Msg Bρ(e) {(h(ℓ) B,G(w), j)|(w, j) Ni(e)} |(e, i) E(v)}} = h(ℓ+1) B,G (v) Remark B.2. Note that analogously we can show that HC-MPNNs subsumes C-MPNNs by noticing generalized target node distinguishability in HC-MPNNs degrades to target node distinguishability in the context of knowledge graph. See further detailed discussion in Appendix G. C Proof of Theorem 4.1 Theorem 4.1. Let G = (V, E, R, c) be a relational hypergraph, then the following statements hold: 1. For all initial feature maps x with c x, all HR-MPNNs with L layers, and for all 0 ℓ L, it holds that hrwl(ℓ) 1 h(ℓ). 2. For all L 0, there is an initial feature map x with c x and an HR-MPNN with L layers, such that for all 0 ℓ L, we have hrwl(ℓ) 1 h(ℓ). Proof. First, for simplicity of notation, we define m(ℓ) e,i = Msgρ(e) {(h(ℓ) w , j) | (w, j) Ni(e)} for edge e, position 1 i ar(ρ(e)), and ℓ 0. To prove item (1), we first take an initial feature map x with c x and a HR-MPNN with L layers. We apply induction on ℓ. The base case where ℓ= 0 follows directly as hrwl(0) 1 c x h(0). For the inductive case, assume hrwl(ℓ+1) 1 (u) = hrwl(ℓ+1) 1 (v) for some node pair u, v V and for some ℓ 1. By injectivity of τ, it follows that hrwl(ℓ) 1 (u) = hrwl(ℓ) 1 (v) and {{({(hrwl(ℓ) 1 (w), j) | (w, j) Ni(e)}, ρ(e)) | (e, i) E(u)}} = {{({(hrwl(ℓ) 1 (w), j) | (w, j) Ni (e )}, ρ(e )) | (e , i ) E(v)}} By inductive hypothesis, we have h(ℓ) u = h(ℓ) v and {{({(h(ℓ) w , j) | (w, j) Ni(e)}, ρ(e)) | (e, i) E(u)}} = {{({(h(ℓ) w , j) | (w, j) Ni (e )}, ρ(e )) | (e , i ) E(v)}}. Thus we have {{Msg(ℓ) ρ(e) {(h(ℓ) w , j) | (w, j) Ni(e)} | (e, i) E(u)}} = {{Msg(ℓ) ρ(e ) {(h(ℓ) w , j) | (w, j) Ni (e )} | (e , i ) E(v)}} {{m(ℓ) e,i | (e, i) E(u)}} = {{m(ℓ) e ,i | (e , i ) E(v)}}. Published in Transactions on Machine Learning Research (05/2025) We thus conclude that h(ℓ+1) u = Up(ℓ) h(ℓ) u , Agg h(ℓ) u , {{m(ℓ) e,i | (e, i) E(u)}} = Up(ℓ) h(ℓ) v , Agg h(ℓ) v , {{m(ℓ) e ,i | (e , i ) E(v)}} = h(ℓ+1) v . Now we proceed to show item (2). We use a model of HR-MPNN in the following form and show that any iteration of hrwl1 can be simulated by a specific layer of such instance of HR-MPNN: h(0) v = xv h(ℓ+1) v = f (ℓ) h h(ℓ) v X (e,i) E(v) g(ℓ) ρ(e) j =i (h(ℓ) e(j) + pj) i . Here, f (ℓ)(z) = sign(W (ℓ)z b) where W (ℓ) is a parameter matrix, b is the bias term, in this case the all-ones vector b = (1, . . . , 1)T , and as non-linearity we use the sign function sign. For a relation type r R, the function g(ℓ) r has the form g(ℓ) r (z) = Y (ℓ) r sign(W (ℓ) r z b), where W (ℓ) r and Y (ℓ) r are parameter matrices and b is the all-ones bias vector. Recall that denotes element-wise multiplication and pj is the positional encoding at position j, which in this case is a parameter vector. We shall use the following lemma shown in Morris et al. (2019)[Lemma 9]. The matrix J denotes the all-ones matrix (with appropriate dimensions). Lemma C.1 ((Morris et al., 2019)). Let B Ns t be a matrix whose columns are pairwise distinct. Then there is a matrix X Rt s such that the matrix sign(XB J) { 1, 1}t t is non-singular. For a matrix B, we denote by Bi its i-th column. Let n = |V | and without loss of generality assume V = {1, . . . , n}. Let m be the maximum arity over all edges of G. We will write feature maps h : V Rd for G = (V, E, R, c) also as matrices H Rd n, where the column Hv corresponds to the d-dimensional feature vector for node v. Let F ts be the following nm n matrix: 1 1 1 1 ... ... ... ... ... 1 1 1 1 1 1 1 1 ... ... ... ... ... 1 1 1 1 ... ... ... ... ... 1 1 1 1 ... ... ... ... ... 1 1 1 1 That is, (F ts)ij = 1 if m j i, and (F ts)ij = 1 otherwise. We shall use the columns of F ts as node features in our simulation. The following lemma is a simple variation of Lemma A.5 from Huang et al. (2023), which in turn is a variation of Lemma C.1 above. Lemma C.2. Let B Ns t be a matrix such that t n, and all the columns are pairwise distinct and different from the all-zeros column. Then there is a matrix X Rnm s such that the matrix sign(XB J) { 1, 1}nm t is precisely the sub-matrix of F ts given by its first t columns. Proof. Let z = (1, k + 1, (k + 1)2, . . . , (k + 1)s 1) N1 s, where k is the largest entry in B, and b = z B N1 t. By construction, the entries of b are positive and pairwise distinct. Without loss of generality, Published in Transactions on Machine Learning Research (05/2025) we assume that b = (b1, b2, . . . , bt) for b1 > b2 > > bt > 0. As the bi are ordered, we can choose numbers x1, . . . , xt R such that bi xj < 1 if i j, and bi xj > 1 if i < j, for all i, j {1, . . . , t}. Let x = (x1, . . . , xt, 2/bt, . . . , 2/bt)T Rn 1. Note that (2/bt) bi > 1, for all i {1, . . . , t}. Let x = (x1, . . . , x1, x2, . . . , x2, . . . , xn, . . . , xn)T Rnm 1 be the vector obtained from x by replacing each entry xi with m consecutive copies of xi. Then sign(x b J) is precisely the sub-matrix of F ts given by its first t columns. We can choose X = x z Rnm s. We conclude item (2) by showing the following lemma: Lemma C.3. There exist a family of feature maps {h(ℓ) : V Rnm | 0 ℓ L}, family of matrices {W (ℓ) | 0 ℓ< L} and {{W (ℓ) r , Y (ℓ) r } | 0 ℓ< L, r R}, and positional encodings {pj | 1 j m} such that: h(ℓ) hrwl(ℓ) 1 for all 0 ℓ L. h(ℓ) v Rnm is a column of F ts for all 0 ℓ L and v V . h(ℓ+1) v = f (ℓ) h h(ℓ) v P (e,i) E(v) g(ℓ) ρ(e) j =i (h(ℓ) e(j) +pj) i for all 0 ℓ< L and v V , where f (ℓ) and g(ℓ) r are defined as above, i.e. f (ℓ)(z) = sign(W (ℓ)z b) and g(ℓ) r (z) = Y (ℓ) r sign(W (ℓ) r z b) (vector b is the all-ones vector). Proof. We proceed by induction on ℓ. Suppose that the node coloring hrwl(0) 1 c with colors 1, . . . , p, for p n. Then we choose h(0) such that h(0) v = F tsc(v), i.e., h(0) v is the c(v)-th column of F ts. Thus, h(0) satisfies the required conditions. For the inductive case, assume that h(ℓ) hrwl(ℓ) 1 for 0 ℓ< L and that h(ℓ) v is a column of F ts for all v V . We shall define parameter matrices W (ℓ) and {{W (ℓ) r , Y (ℓ) r } | r R} and positional encodings {pj | 1 j m} such that the conditions of the lemma are satisfied. For 1 j m, the positional encoding pj is independent of ℓ. Let pj = 4b + 8ej Rm, where b is the m-dimensional all-ones vector and ej is the m-dimensional one-hot encoding of j. In other words, all entries of pj are 4 except for the j-th entry which is 12. We define pj = ( pj, . . . , pj) Rnm to be the concatenation of n copies of pj. Let r R and define Epos r = {(e, i) | e E, ρ(e) = r, 1 i ar(r)}. For (e, i) Epos r , define o(ℓ) e,i = j =i(h(ℓ) e(j) + pj) f col (ℓ) e,i = {(hrwl(ℓ) 1 (w), j)| (w, j) Ni(e)}. We claim that for (e, i), (e , i ) Epos r , we have o(ℓ) e,i = o(ℓ) e ,i if and only if f col (ℓ) e,i = f col (ℓ) e ,i . Suppose first that f col (ℓ) e,i = f col (ℓ) e ,i . By inductive hypothesis, we have {(h(ℓ) w , j)| (w, j) Ni(e)} = {(h(ℓ) w , j)| (w, j) Ni (e )}. It follows that o(ℓ) e,i = o(ℓ) e ,i . Suppose now that f col (ℓ) e,i = f col (ℓ) e ,i . We consider two cases. Assume first i = i . Then o(ℓ) e,i and o(ℓ) e ,i differ on the i-th coordinate, that is, (o(ℓ) e,i)i = (o(ℓ) e ,i )i. Indeed, note that the entries of vectors of the form h(ℓ) w +pj are always prime numbers in {3, 5, 11, 13} (the entries of h(ℓ) w are always in { 1, 1} by inductive hypothesis). The i-th coordinate of all the vector factors in the product o(ℓ) e,i = j =i(h(ℓ) e(j) +pj) has value 3, and hence (o(ℓ) e,i)i = 3ar(r) 1. On the other hand, there exists a vector factor in the product o(ℓ) e ,i = j =i (h(ℓ) e (j) +pj) (the factor h(ℓ) e (i) +pi), whose i-th coordinate is 11. Hence (o(ℓ) e,i)i and (o(ℓ) e ,i )i have Published in Transactions on Machine Learning Research (05/2025) different prime factorizations and then they are distinct. Now assume i = i . Since f col (ℓ) e,i = f col (ℓ) e ,i , there must be a position j such that hrwl(ℓ) 1 (e(j )) = hrwl(ℓ) 1 (e (j )). By inductive hypothesis, h(ℓ) e(j ) = h(ℓ) e (j ). Again by inductive hypothesis, we know that h(ℓ) e(j ) and h(ℓ) e (j ) are columns of F ts, say w.l.o.g. the k-th and k -th columns, respectively, for 1 k < k n. By construction of F ts, all the m entries of h(ℓ) e(j ) from coordinates {km + 1, . . . , km + m} are 1, while these are 1 for h(ℓ) e (j ). We claim that o(ℓ) e,i and o(ℓ) e ,i differ on the (km + j )-th coordinate. Consider the product o(ℓ) e,i = j =i(h(ℓ) e(j) + pj). The (km + j )-th coordinate of the factor h(ℓ) e(j ) + pj is 13, while it is in {3, 5} for the remaining factors. For the product o(ℓ) e ,i = j =i (h(ℓ) e (j) + pj), the (km + j )-th coordinate of the factor h(ℓ) e (j ) + pj is 11, while it is in {3, 5} for the remaining factors. Hence (o(ℓ) e,i)km+j and (o(ℓ) e ,i )km+j have different prime factorizations and then they are distinct. Let r R. It follows from the previous claim that if we interpret o(ℓ) and f col (ℓ) as colorings for Epos r , then these two colorings are equivalent (i.e., the produce the same partition). Let sr be the number of colors involved in these colorings, and let o1, . . . , osr Rnm be an enumeration of the distinct vectors appearing in {o(ℓ) e,i | (e, i) Epos r }. Let Sr be the (nm sr)-matrix whose columns are o1, . . . , osr. Fix an enumeration r1, . . . , r|R| of R and define s = P r R sr. Now we are ready to define our sought matrices W (ℓ) r and Y (ℓ) r , for r R. We define W (ℓ) r to be the (sr nm)-matrix obtained from applying Lemma C.1 to the matrix Sr. Let e Y (ℓ) r Rsr sr be the inverse matrix of sign(W (ℓ) r Sr J). Suppose r = rk for 1 k |R|. Then, the matrix Y (ℓ) r is the (s sr)-matrix defined as the vertical concatenation of the following |R| matrices: Nr1, . . . , Nrk 1, e Y (ℓ) r , Nrk+1, . . . , Nr|R|, where Nr is the all-zeros (sr sr)-matrix. By construction, Y (ℓ) r sign(W (ℓ) r Sr J) is the vertical concatenation of Nr1, . . . , Nrk 1, Ir, Nrk+1, . . . , Nr|R|, where Ir is the sr sr identity matrix. In particular, if we consider g(ℓ) r (z) = Y (ℓ) r sign(W (ℓ) r z b) as in the statement of the lemma, then for each (e, i) Epos r , the vector m(ℓ) e,i = g(ℓ) r (o(ℓ) e,i) has the form m(ℓ) e,i = (0r1, . . . , 0rk 1, c(ℓ) e,i, 0rk+1, . . . , 0r|R|)T {0, 1}s, where 0r is the all-zeros vector of dimension sr and c(ℓ) e,i {0, 1}sr is a one-hot encoding of edge color o(ℓ) e,i, or equivalently, of edge color f col (ℓ) e,i. It follows that the vector f (ℓ) v = X (e,i) E(v) g(ℓ) ρ(e)(o(ℓ) e,i) = X (e,i) E(v) Epos r g(ℓ) r (o(ℓ) e,i) has the form f (ℓ) v = (ar1, . . . , ar|R|)T Ns, where ar is the sr-dimensional vector whose entry (ar)j, for 1 j sr, is the number of elements (e, i) in E(v) Epos r with color j, that is, such that o(ℓ) e,i = oj. In particular, ar is an encoding of the multiset {{f col (ℓ) e,i | (e, i) E(v) Epos r }} and hence f (ℓ) v is an encoding of the multiset {{(f col (ℓ) e,i, ρ(e)) | (e, i) E(v)}}. Note that this multiset is precisely the multiset {{col(ℓ)(e, i) | (e, i) E(v)}} from the definition of the update rule of the hypergraph relational 1-WL test. Hence, the feature map given by the concatenation [h(ℓ) v || f (ℓ) v ], for all v V , is equivalent to hrwl(ℓ+1) 1 . It remains to define the function f (ℓ), given by the parameter matrix W (ℓ), so that the feature map h(ℓ+1) satisfies the conditions of the lemma. Since the columns of F ts are independent, there exists a matrix M Rn nm such that MF ts is the n n identity matrix. Since each h(ℓ) v , with v V , is a column of F ts, then Mh(ℓ) v {0, 1}n corresponds to a one-hot encoding of the column or color h(ℓ) v . Let M be the (n + s) (nm + s) matrix with all entries 0 except for the upper-left (n nm)-submatrix which is M, and the lower-right (s s)-submatrix which is the (s s) identity matrix. By construction, we have M [h(ℓ) v || f (ℓ) v ] = [Mh(ℓ) v || f (ℓ) v ] Nn+s. Let z1, . . . , zq, with q n, be the distinct vectors of the form [Mh(ℓ) v || f (ℓ) v ] and let B be the ((n + s) q)-matrix whose columns are precisely z1, . . . , zq. We can apply Lemma C.2 to B to obtain a matrix X Rnm (n+s) such that sign(XB J) is the matrix given by the first q columns of F ts. We define our sought matrix W (ℓ) to be W (ℓ) = XM . Published in Transactions on Machine Learning Research (05/2025) D HGML and proof of Theorem 4.3 D.1 HGML formulas Fix a set of relation types R and a set of node colors C. The hypergraph graded modal logic (HGML) is the fragment of FO containing the following unary formulas. Firstly, a(x) for a C is a formula. Secondly, if φ(x) and φ (x) are HGML formulas, then φ(x) and φ(x) φ (x) also are. Thirdly, for r R, 1 i ar(r) and N 1: N y r(y1, . . . , yi 1, x, yi+1, . . . , yar(r)) Ψ( y) is a HGML formula, where y = (y1, . . . , yi 1, yi+1, . . . , yar(r)) and Ψ( y) is a boolean combination of HGML formulas having free variables from y. Intuitively, the formula expresses that x participates in at least N edges e at position i, such that the remaining nodes in e satisfies Ψ. Let G = (V, E, R, c) be a relational hypergraph where the range of the node coloring c is C. Next, we define the semantics of HGML. We define when a node v of G satisfies a HGML formula φ(x), denoted by G |= φ(v), recursively as follows: if φ(x) = a(x) for a C, then G |= φ(v) iff a is the color of v in G, i.e., c(v) = a. if φ(x) = φ (x), then G |= φ(v) iff G |= φ (v). if φ(x) = φ (x) φ (x), then G |= φ(v) iff G |= φ (v) and G |= φ (v). if φ(x) = N y (r(y1, . . . , yi 1, x, yi+1, . . . , yar(r)) Ψ( y)) then G |= φ(v) iff there exists at least N tuples (w1, . . . wi 1, wi+1, . . . , war(r)) of nodes of G such that r(w1, . . . , wi 1, v, wi+1, . . . , war(r)) holds in G and the boolean combination Ψ(w1, . . . wi 1, wi+1, . . . , war(r)) evaluates to true. As an example, consider the set of relations from Figure 1, that is, relations {Person(x), Study Degree(x, y, z, m), Awarded(w, p, m)}. Consider the property: x is a person who obtained a degree y of a subject z at a university m that has been awarded less than two prices p of some subject w. This can be expressed as the following HGML formula: ϕ(x) = Person(x) y, z, m Study Degree(x, y, z, m) 2p, w (Awarded(w, p, m)) Observe that HGML formulas have a restricted form and hence they are not able to represent all logical queries, which hints at the fundamental limitations of our studied models. For instance, formulas in HGML can only express local properties of nodes. That is, properties of the form a node is connected (via hyperedges) to other nodes satisfying other (local) properties . This is illustrated in the example above as the variables y, z, m are forced to appear together with x in the hyper-edge Study Degree(x, y, z, m). Another limitation of HGML is that once we quantify over the neighboring variables for x (in the example y, z, m), we can only check (local) HGML properties separately for the neighboring variables and combine them via Boolean combinations. In the example above, we check the property m has been awarded less than two prices p of some subject w for university m via the HGML formula α(m) = 2p, w (Awarded(w, p, m)). In particular, we cannot check properties that involve simultaneously two or more neighboring variables, as these properties would not be HGML properties (they would not even be unary). As an example, consider the property x is a person who obtained a degree y of a subject z at a university m that has been awarded less than two prices p in subject z. Now we do not impose that m has less than two prices in any subject, but less than two prices in the particular subject z (the same related with person x). This can be expressed as: ϕ(x) = Person(x) y, z, m Study Degree(x, y, z, m) 2p (Awarded(z, p, m)) Note that this is not an HGML formula as β(m, z) = 2p (Awarded(z, p, m)) checks a condition that involves two neighboring variables (m and z). This violates exactly the requirement discussed above. Published in Transactions on Machine Learning Research (05/2025) D.2 Proof of Theorem 4.3 Before showing Theorem 4.3, we need to prove an auxiliary result. We define a restriction of HGML, denoted by HGMLr, as follows. HGMLr is defined as HGML, except for the inductive case N y r(y1, . . . , yi 1, x, yi+1, . . . , yar(r)) Ψ( y) where now we impose Ψ( y) to be a conjunction of HGML formulas with different free variables, that is, Ψ( y) = φ1(y1) φi 1(yi 1) φi+1(yi+1) φar(r)(yar(r)). We have that HGML is actually equivalent to HGMLr. Proposition D.1. Every HGML formula can be translated into an equivalent HGMLr formula. Proof. We apply induction to the formulas in HGML. The only interesting case is when the formula has the form N y r(y1, . . . , yi 1, x, yi+1, . . . , yar(r)) Ψ( y) for r R, 1 i ar(r), N 1 and a boolean combination Ψ( y) of HGML formulas. We can write Ψ( y) in disjunctive normal form and since negation and conjunction are part of HGML, we can assume that Ψ( y) has the form: 1 k q φk 1(y1) φk i 1(yi 1) φk i+1(yi+1) φk ar(r)(yar(r)). For 1 k d and a subset T {1, . . . , i 1, i + 1, . . . , ar(r)}, we denote by ϕk T the formula ϕk T (y1, . . . , yi 1, yi+1, . . . , yar(r)) = a T φk a(ya) a/ T φk a(ya). Note that ϕk T expresses that for the k-th disjunct of Ψ, the conjuncts φk a(ya) that are false are precisely those for which a T. In particular the k-th disjunct of Ψ corresponds to ϕk . For S {1, . . . , d}, and a vector T = (Tk {1, . . . , i 1, i + 1, . . . , ar(r)} : Tk = , k / S), we denote by ΨS,T the formula: ΨS,T (y1, . . . , yi 1, yi+1, . . . , yar(r)) = k/ S ϕk Tk. ΨS,T expresses that exactly the k-th disjuncts for k S are true, and each of the remaining false disjuncts for k / S are being falsified by making false precisely the conjuncts φk a(ya), with a Tk. Since HGML contains negation and conjunction, we can write ΨS,T as a conjunction of HGML formulas with different free variables, that is: ΨS,T (y1, . . . , yi 1, yi+1, . . . , yar(r)) = α1(y1) αi 1(yi 1) αi+1(yi+1) αar(r)(yar(r)). F := {ΨS,T | S {1, . . . , d}, S = , T = (Tk {1, . . . , i 1, i + 1, . . . , ar(r)} | Tk = , k / S)}. Then by construction, we have that Φ is true iff exactly one of the formulas in F is true. It follows that N y r(y1, . . . , yi 1, x, yi+1, . . . , yar(r)) Ψ( y) is equivalent to the HGMLr formula _ (NS,T N|ΨS,T F) P S,T NS,T =N ΨS,T F NS,T y r(y1, . . . , yi 1, x, yi+1, . . . , yar(r)) eΨS,T ( y) Published in Transactions on Machine Learning Research (05/2025) eΨS,T (y1, . . . , yi 1, yi+1, . . . , yar(r)) = α1(y1) αi 1(yi 1) αi+1(yi+1) αar(r)(yar(r)). where αa(ya) is the translation to HGMLr of the formula αa(ya), which we already have by induction. Now we are ready to prove Theorem 4.3. Theorem 4.3. Each hypergraph graded modal logic classifier is captured by a HR-MPNN. Proof. We follow a similar strategy than the logic characterizations from Barceló et al. (2020); Huang et al. (2023). Let φ(x) be a formula in HGML, where the vocabulary contains relation types R and node colors C. By Proposition D.1, we can assume that φ(x) belongs to HGMLr. Let φ1, . . . , φL be an enumeration of the subformulas of φ such that if φi is a subformula of φj, then i j. In particular, φL = φ. We shall define an HR-MPNN Bφ with L layers computing L-dimensional features in each layer. The idea is that at layer ℓ {1, . . . , L}, the ℓ-th component of the feature h(ℓ) v is computed correctly and corresponds to 1 if φℓis satisfied in node v, and 0 otherwise. We add an additional final layer that simply outputs the last component of the feature vector. We use models of HR-MPNNs of the following form: h(ℓ+1) v = f (ℓ) h h(ℓ) v X (e,i) E(v) g(ℓ) ρ(e) j =i (pj h(ℓ) e(j)) i . Here, f (ℓ)(z) = σ(W (ℓ)z + b) where W (ℓ) is a parameter matrix, b is the bias term and σ is a non-linearity. For a relation type r R, the function g(ℓ) r has the form g(ℓ) r (z) = ar σ(W (ℓ) r z), where W (ℓ) r is a parameter matrix and ar is a parameter vector. Recall that denotes element-wise multiplication and pj is the positional encoding at position j, which in this case is a parameter vector. The parameter matrix W (ℓ) will be a (L 2L)-matrix of the form W (ℓ) = [W (ℓ) 0 I], where W (ℓ) 0 is a (L L) parameter matrix and I is the (L L) identity matrix. The parameter matrices W (ℓ) 0 and W (ℓ) r are actually layer independent and hence we omit the superscripts. Therefore, our models are of the following form: h(ℓ+1) v = σ W0h(ℓ) v + X (e,i) E(v) ρ(e)=r ar σ(Wr j =i (pj h(ℓ) e(j))) + b . For the non-linearity σ we use the truncated Re LU function σ(x) = min(max(0, x), 1). Let m be the maximum arity of the relations in R. For 1 j m, the positional encoding pj is defined as follows. The dimension of pj must be L (the same as for feature vectors). We define a set of positions Ij {1, . . . , L} as follows: k Ij iff there exists a subformula of φ of the form N y r(y1, . . . , yi 1, x, yi+1, . . . , yar(r)) α1(y1) αi 1(yi 1) αi+1(yi+1) αar(r)(yar(r)) . such that j {1, . . . , i 1, i + 1, . . . , ar(r)} and αj is the k-th subformula in the enumeration φ1, . . . , φL. Then we define pj such that (pj)k = 1 if k Ij and (pj)k = 3 otherwise. Now we define the parameter matrices W0 RL L and Wr RL L, for r R, together with the bias vector b. For 0 ℓ< L, the ℓ-row of W0 and Wr, and the ℓ-th entry of ar and b are defined as follows (omitted entries are 0): 1. If φℓ(x) = a(x) for a color a C, then (W0)ℓℓ= 1. 2. If φℓ(x) = φk(x) then (W0)ℓk = 1, and bℓ= 1. Published in Transactions on Machine Learning Research (05/2025) 3. If φℓ(x) = φj(x) φk(x) then (W0)ℓj = 1, (W0)ℓk = 1 and bℓ= 1. φℓ(x) = N y r(y1, . . . , yi 1, x, yi+1, . . . , yar(r)) φk1(y1) φki 1(yi 1) φki+1(yi+1) φkar(r)(yar(r)) then (Wr)ℓkj = 1 for j {1, . . . , i 1, i + 1, . . . , ar(r)} and (ar)ℓ= 1 and bℓ= N + 1. Let G = (V, E, R, c) be a relational hypergraph with node colors from C. In order to apply Bφ to G, we choose initial L-dimensional features h(0) v such that (h(0) v )ℓ= 1 if φℓ= a(x) and a is the color of v, and (h(0) v )ℓ= 0 otherwise. In other words, the L-dimensional initial feature h(0) v is a one-hot encoding of the color of v. To conclude the theorem we show by induction the following statement: ( ) For all 1 ℓ L, all 1 p ℓ, all v V , we have (h(ℓ) v )p = 1 if and only if G |= φp(v). We start by showing the following: ( ) For all 1 ℓ L, all v V , and all 1 p L such that φp(x) = a(x) for some a C, we have (h(ℓ) v )p = 1 if and only if G |= φp(v). We apply induction on ℓ. For the base case assume ℓ= 1. Take v V and 1 p L such that φp(x) = a(x) for some a C. By construction, we have that: (h(1) v )p = σ (h(0) v )p = (h(0) v )p. By definition of h(0), we obtain that (h(1) v )p = 1 if and only if G |= φp(v). For the inductive case, suppose ℓ> 1 and take v V and 1 p L such that φp(x) = a(x) for some a C. We have that: (h(ℓ) v )p = σ (h(ℓ 1) v )p = (h(ℓ 1) v )p. By inductive hypothesis we know that (h(ℓ 1) v )p = 1 if and only if G |= φp(v). It follows that (h(ℓ) v )p = 1 if and only if G |= φp(v). We now prove statement ( ). We start with the base case ℓ= 1. Take v V . It must be the case that p = 1 and hence φp(x) = a(x) for some a C. The result follows from ( ). For the inductive case, take ℓ> 1. Take v V and 1 p ℓ. We consider several cases: Suppose φp(x) = a(x) for some color a C. Then the result follows from ( ). Suppose that φp(x) = φk(x). We have that: (h(ℓ) v )p = σ (h(ℓ 1) v )k + 1 = (h(ℓ 1) v )k + 1. We obtain that (h(ℓ) v )p = 1 iff (h(ℓ 1) v )k = 0. Since k ℓ 1, we have by inductive hypothesis that (h(ℓ 1) v )k = 1 iff G |= φk(v). It follows that (h(ℓ) v )p = 1 iff G |= φp(v). Suppose that φp(x) = φj(x) φk(x). Then: (h(ℓ) v )p = σ (h(ℓ 1) v )j + (h(ℓ 1) v )k 1 . We obtain that (h(ℓ) v )p = 1 iff (h(ℓ 1) v )j = 1 and (h(ℓ 1) v )k = 1. Since j, k ℓ 1, we have by inductive hypothesis that (h(ℓ 1) v )j = 1 iff G |= φj(v) and (h(ℓ 1) v )k = 1 iff G |= φk(v). It follows that (h(ℓ) v )p = 1 iff G |= φp(v). Published in Transactions on Machine Learning Research (05/2025) Suppose that φp(x) = N y r(y1, . . . , yi 1, x, yi+1, . . . , yar(r)) φk1(y1) φki 1(yi 1) φki+1(yi+1) φkar(r)(yar(r)) . (h(ℓ) v )p = σ X (e,q) E(v) ρ(e)=r j =i t =q(pt h(ℓ 1) e(t) )kj N + 1 . We say that a pair (e, q) E(v), with ρ(e) = r, is good if q = i and G |= φkj(e(j)) for all j {1, . . . , i 1, i + 1, . . . , ar(r)}. We claim that P j =i t =q(pt h(ℓ 1) e(t) )kj = 0 if (e, q) is good and P j =i t =q(pt h(ℓ 1) e(t) )kj > 1 otherwise. Suppose (e, q) is good. Then q = i. Take j = i. We have that t =i(pt h(ℓ 1) e(t) )kj = 0 since the factor (pt h(ℓ 1) e(t) )kj = 0 when t = j. Indeed, by construction, (pj)kj = 1. Also, since kj ℓ 1, we have by inductive hypothesis that (h(ℓ 1) e(j) )kj = 1 iff G |= φkj(e(j)). Since (e, q) is good, it follows that (h(ℓ 1) e(j) )kj = 1. Hence (pj h(ℓ 1) e(j) )kj = 0. Suppose now that (e, q) is not good. Assume first that q = i. Then there exists j = i such that G |= φkj(e(j)). We have that t =i(pt h(ℓ 1) e(t) )kj > 1. If t = j, then we have (pt)kj = 1. Since kj ℓ 1, by inductive hypothesis we have that (h(ℓ 1) e(j) )kj = 1 iff G |= φkj(e(j)). It follows that (pt h(ℓ 1) e(t) )kj = 1 when t = j. If t / {i, j}, then (pt)kj = 3 and then (pt h(ℓ 1) e(t) )kj > 1. Hence t =i(pt h(ℓ 1) e(t) )kj > 1. Suppose now that q = i. Then we can choose j = q and obtain that t =q(pt h(ℓ 1) e(t) )kj > 1. Indeed, we have (pt)kq = 3 for all t = q. Hence all the factors of t =q(pt h(ℓ 1) e(t) )kq are > 1 and then the product is > 1. As a consequence of the previous claim, we have that: (h(ℓ) v )p = σ |{(e, i) E(v) | ρ(e) = r, (e, i) is good}| N + 1 . By definition G |= φp(v) iff |{(e, i) E(v) | ρ(e) = r, (e, i) is good}| N. Hence G |= φp(v) iff (h(ℓ) v )p = 1. E Proof of Theorem 5.1 Theorem 5.1. Let G = (V, E, R, c) be a relational hypergraph and q = (q, u, t) be a query such that c satisfies target node distinguishability with respect to q. Then the following statements hold: 1. For all HC-MPNNs with L layers and initialization Init with Init c, 0 ℓ L, we have hrwl(ℓ) 1 h(ℓ) q . 2. For all L 0, there is an HC-MPNN with L layers s.t. 0 ℓ L, hrwl(ℓ) 1 h(ℓ) q holds. Proof. Note that given G and q, each HC-MPNN A with L layers can be translated into a HR-MPNN B with L layers that produce the same node features in each layer: for B we choose as initial features, the features obtained from the initialization function of A, and use the same architecture of A (functions Up, Agg, Msg). On the other hand, each HR-MPNN B with L layers whose initial features define a coloring that satisfies generalized target node distinguishability with respect to q can be translated into a HC-MPNN A with L layers that compute the same node features in each layer: we can define the initialization function of A so that we obtain the initial features of B and then use the same architecture of B. Published in Transactions on Machine Learning Research (05/2025) Item (1) is obtained by translating the given HC-MPNN into its correspondent HR-MPNN and then invoking Theorem 4.1. Similarly, item (2) is obtained by applying Theorem 4.1 to obtain an equivalent HR-MPNN and then translate it to a HC-MPNN. F Proof of Theorem 5.3 We consider symbolic queries q = (q, b, t), where each b b is a constant symbol. We consider vocabularies containing relation types r R, node colors C, and the constants b b. In this case, we work with relational hypergraphs G = (V, E, R, c, (vb)b b), where the range of the coloring c is C and vb is the interpretation of constant b. We only focus on valid relational hypergraphs, that is, G = (V, E, R, c, (vb)b b) such that for all b, b b, b = b implies vb = vb . We define hypergraph graded modal logic with constants (HGMLc) as HGML but, as atomic cases, we additionally have formulas of the form φ(x) = (x = b) for some constant b. As expected, we have that HC-MPNNs can capture HGMLc classifiers. Theorem 5.3. Each HGMLc classifier can be captured by a HC-MPNNs over valid relational hypergraphs. Proof. The theorem follows by applying the same construction as in the proof of Theorem 4.3. Now we have extra base cases of the form φ(x) = (x = b) but the same arguments apply. Note that now we need to define the initial features h(0) via the initialization function of the HC-MPNN. Since we are focusing on valid relational hypergraphs, this can be easily done while satisfying generalized target node distinguishability. G Link prediction with knowledge graphs An interesting observation is that when we restrict relational hypergraphs to have hyperedges of arity exactly 2, we recover the class of knowledge graphs. C-MPNNs (Huang et al., 2023) are tailored for knowledge graphs and their expressive power has been recently studied extensively, with a focus on their capability for distinguishing pairs of nodes (for a formal definition see Appendix A). In this section, we compare HC-MPNNs and C-MPNNs, and hence we are interested in the expressive power of HC-MPNNs in terms of distinguishing pairs of nodes. Note however that, in principle, HC-MPNNs do not compute binary invariants. Indeed, for q R and a pair of nodes u, v we can obtain two final features depending on whether we pose the query q(u, ?) or q(?, v). As a convention, we shall define the final feature of the pair u, v as the result of the query q(u, ?). When a HC-MPNN computes binary invariants under this convention, we say the HC-MPNN is restricted to tail predictions. We proceed to show that HC-MPNNs restricted to tail predictions have the same expressive power in terms of distinguishing pairs of nodes as the rawl+ 2 test proposed in Huang et al. (2023). This test is an extension of rawl2, which in turn, matches the expressive power of C-MPNNs. It follows then that HC-MPNNs are strictly more powerful than C-MPNNs over knowledge graphs. We show this by first defining a variant of the relational WL test which upper bound the expressive power of HC-MPNNs restricting to tail predictions. Given a knowledge graph G = (V, E, R, c, η), where η : V V 7 D is a pairwise coloring satisfying target node distinguishability, i.e. u = v, η(u, u) = η(u, v), we define a relational hypergraph conditioned local 2-WL test, denoted as hcwl2. hcwl2 iteratively updates binary coloring η as follow for all ℓ 0: hcwl(0) 2 = η(u, v) hcwl(ℓ+1) 2 (u, v) = τ hcwl(ℓ) 2 (u, v), {{ {(hcwl(ℓ) 2 (u, w), j)|(w, j) Ni(e)}, ρ(e) | (e, i) E(v)}} Note that indeed, hcwl(ℓ) 2 computes a binary invariants for all ℓ 0. First, we show that HC-MPNN restricted on only tails prediction is indeed characterized by hcwl2. The proof idea is very similar to Theorem 5.1 in Huang et al. (2023). Theorem G.1. Let G = (V, E, R, x, η) be a knowledge graph where x is a feature map and η is a pairwise node coloring satisfying target node distinguishability. Given a query with q = (q, u, 2), then we have: Published in Transactions on Machine Learning Research (05/2025) 1. For all HC-MPNNs restricted on tails prediction with L layers and initializations Init with Init η, and 0 ℓ L , we have hcwl(ℓ) 2 h(ℓ) q 2. For all L 0 , there is an HC-MPNN restricted on tails prediction with L layers such that for all 0 ℓ L , we have hcwl(ℓ) 2 h(ℓ) q . Proof. We first rewrite the HC-MPNN restricted on tails predictions in the following form. Given a query q = (q, u, t), we know that since G is a knowledge graph, u only consists of a single node, which we denote as u. In addition, since we only consider the case of tail prediction, then we always have t = 2. With this restriction, we restate the HC-MPNN restricted on tails prediction on the knowledge graph as follows: h(0) v|q = Init(v, q), h(ℓ+1) v|q = Up h(ℓ) v|q, Agg h(ℓ) v|q, {{Msgρ(e)({(h(ℓ) w|q, j) | (w, j) Ni(e)}), | (e, i) E(v)}} Now, we follow a similar idea in the proof of C-MPNN for binary invariants (Huang et al., 2023). Let G = (V, E, R, c, η) be a knowledge graph where η is a pairwise coloring. Construct the auxiliary knowledge graph G2 = (V V, E , R, cη) where E = {r((u, w), (u, v)) | r(w, v) E, r R} and cη is the node coloring cη((u, v)) = η(u, v). Similar to Theorem 5.1, If A is a HC-MPNN and B is an HR-MPNN, we write h(ℓ) A,G(u, v) := h(ℓ) (q,(u),2)(v) and h(ℓ) B,G2((u, v)) := h(ℓ)((u, v)) for the features computed by A and B over G and G2, respectively. We sometimes write N G r (e) and EG(v) to emphasize that the positional neighborhood within a hyperedge and set of hyperedges including node v is taken over the knowledge graph G, respectively. Finally, we say that an initial feature map y for G2 satisfies generalized target node distinguishability if y((u, u)) = y((u, v)) for all u = v. Note here that the generalized target node distinguishability naturally reduced to target node distinguishability proposed in Huang et al. (2023) since u is a singleton. Thus, we have the following equivalence between HR-MPNN and HC-MPNN restricted on tail prediction on the knowledge graph. Proposition G.2. Let G = (V, E, R, x, η) be a knowledge graph where x is a feature map, and η is a pairwise coloring. Let q R, then: 1. For every HC-MPNN A with L layers, there is an initial feature map y for G2 an HR-MPNN B with L layers such that for all 0 ℓ L and u, v V , we have h(ℓ) A,G(u, v) = h(ℓ) B,G2((u, v)). 2. For every initial feature map y for G2 satisfying generalized target node distinguishability and every HR-MPNN B with L layers, there is a HC-MPNN A with L layers such that for all 0 ℓ L and u, v V , we have h(ℓ) A,G(u, v) = h(ℓ) B,G2((u, v)). Proof. We proceed to show item (1) first. Consider the HR-MPNN B with the same relational-specific message Msgr, aggregation Agg, and update functions Up as A for all the L layers. The initial feature map y is defined as y((u, v)) = Init(v, (q, (u), 2)), where Init is the initialization function of A. Then, by induction on number of layer ℓ, we have that for the base case ℓ= 0, h(0) A (u, v) = Init(v, (q, (u), 2)) = y((u, v)) = h(0) B ((u, v)). For the inductive case, assume h(ℓ) A (u, v) = h(ℓ) B ((u, v)), then h(ℓ+1) A (u, v) = Up h(ℓ) A (u, v), Agg h(ℓ) A (u, v), {{Msgρ(e) {(h(ℓ) A (u, w), j) | (w, j) N G i (e)} | (e, i) EG(v)}} = Up h(ℓ) B ((u, v)), Agg h(ℓ) B ((u, v)), {{Msgρ(e) {(h(ℓ) B ((u, w)), j)|(w, j) N G2 i (e)} |(e, i) EG2(v)}} = h(ℓ+1) B ((u, v)). Published in Transactions on Machine Learning Research (05/2025) To show item (2), we consider A with the same relational-specific message Msgr, aggregation Agg, and update functions Up as B for all the L layers. We also take initialization function Init such that Init(v, (q, (u), 2)) = y((u, v)). Then, we can follow the same argument for the equivalence as item (1). We then show the equivalence in terms of the relational WL algorithms: Proposition G.3. Let G = (V, E, R, c, η) be a knowledge graph where η is a pairwise coloring. For all ℓ 0 and u, v V , we have that hcwl(ℓ) 2 (u, v) computed over G coincides with hrwl(ℓ) 1 ((u, v)) computed over G2 = (V V, E , R, cη). Proof. For ℓ= 0, we have hcwl(0) 2 (G, u, v) = η(u, v) = cη((u, v)) = hrwl(0) 1 (G2, (u, v)). For the inductive case, we have that hcwl(ℓ+1) 2 (G, u, v) = τ hcwl(ℓ) 2 (G, u, v), {{ {(hcwl(ℓ) 2 (G, u, w), j) | (w, j) N G i (e)}, ρ(e) | (e, i) EG(v)}} = τ hrwl(ℓ) 1 (G2, (u, v)), {{ {(hrwl(ℓ) 1 (G2, (u, w)), j) | (w, j) N G2 i (e)}, ρ(e) | (e, i) EG2(v)}} = hrwl(ℓ+1) 1 (G2, (u, v)). Now we are ready to show the proof for Theorem G.1. For G = (V, E, R, x, η), we consider G2 = (V V, E , R, cη). We start with item (1). Let A be a HC-MPNN with L layers and initialization Init satisfying Init η and let 0 ℓ L. Let y be an initial feature map for G2 and B be an HR-MPNN with L layers in Proposition G.2, item (1). For the initialization we have y cη since y((u, v)) = Init(v, (q, (u), 2)). Thus, we can proceed and apply Theorem 4.1, item (1) to G2, y, and B and show that hrwl(ℓ) 1 h(ℓ) B,G2, which in turns shows that hcwl(ℓ) 2 h(ℓ) A,G. We then proceed to show item (2). Let L 0 be an integer representing a total number of layers. We apply Theorem 4.1, item (2) to G2 and obtain an initial feature map y with y cη and an HR-MPNN B with L layer such that hrwl(ℓ) 1 h(ℓ) B,G2 for all 0 ℓ L. We stress again that y and η both satisfied generalized target node distinguishability. Now, let A be the HC-MPNN from Proposition G.2, item (2). We finally have that hcwl(ℓ) 2 h(ℓ) A,G as required. Note that the item (2) again holds for HCNet. We are ready to prove the claim that HC-MPNN is more powerful than C-MPNN by showing the strict containment of their corresponding relational WL test, that is, hcwl2 and rawl2. In particular, we show that the defined hcwl2 is equivalent to rawl+ 2 defined in Huang et al. (2023), via Theorem G.4. Then, by Proposition A.17 in Huang et al. (2023), we have that rawl+ 2 rawl2. The intuition of Theorem G.4 is that for each updating step, hcwl2 aggregates over all the neighboring edges, which contain both incoming edges and outgoing edges. In addition, hcwl2 can differentiate between them via the position of the entities in the edge. This is equivalent to aggregating incoming relation and outgoing inversed-relation in rawl+ 2 . Theorem G.4. For all knowledge graph G = (V, E, R, c), let hcwl(0) 2 (G) rawl+ 2 (0)(G), then hcwl(ℓ) 2 (G) rawl+ 2 (ℓ)(G) for all ℓ 0. Proof. First we restate the definition of hcwl2(G) and rawl+ 2 (G) for convenience. Given that the query is always a tail query, i.e., k = 2, and given a knowledge graph G = (V, E, R, c), we have that the updating Published in Transactions on Machine Learning Research (05/2025) formula for hcwl2(G) is hcwl(ℓ+1) 2 (G, (u, v)) = τ(hcwl(ℓ) 2 (G, (u, v)), {{({(hcwl(ℓ) 2 (G, (u, w)), j) | (w, j) Ni(e)}, ρ(e)) | (e, i) E(v)}}) = τ(hcwl(ℓ) 2 (G, (u, v)), {{(hcwl(ℓ) 2 (G, (u, w)), j, ρ(e)) | (w, j) Ni(e), (e, i) E(v)}}) Note here that the second equation comes from the fact that the maximum arity is always 2. Then, recall the definition of rawl2. Given a knowledge graph G = (V, E, R, c, η), where η is a pairwise coloring only, we have rawl(ℓ+1) 2 (G, (u, v)) = τ rawl(ℓ) 2 (G, (u, v)), {{(rawl(ℓ) 2 (G, (u, w)), r) | w Nr(v), r R)}} where Nr(v) is the relational neighborhood with respect to relation r R, i.e., w Nr(v) if and only if r(v, w) E. Equivalently, we can rewrite rawl2 in the following form: rawl(ℓ+1) 2 (G, (u, v)) = τ rawl(ℓ) 2 (G, (u, v)), {{(rawl(ℓ) 2 (G, (u, w)), ρ(e)) | (w, j) Ni(e), (e, i) E(v), i = 1}} since we only want to obtain the node w as the tails entities in an edge, and thus the second argument of the (only) element in Ni(e) will always be 2. For a test T, we sometimes write T(G, u), or T(G, u, v) in case of binary tests, to emphasize that the test is applied over G, and T(G) for the pairwise/k-ary coloring given by the test. Let G = (V, E, R, c, η) be a knowledge graph. The, note that G+ = (V, E+, R+) is the augmented knowledge graph where R+ is the disjoint union of R and {r | r R}, and E = {r (v, u) | r(u, v) E, u = v} We can then define E(v) = {(e, i) | e(i) = v, e E} E+(v) = (e, i) | e(i) = v, e E+ E (v) = (e, i) | e(i) = v, e E . Finally, recall the definition of rawl+ 2 (G, u, v) = rawl2(G+, u, v). We can write this in the equivalent form: rawl+ 2 (ℓ+1)(G, (u, v)) = τ rawl+ 2 (ℓ)(G, (u, v)), {{(rawl+ 2 (ℓ)(G, (u, w)), ρ(e)) | (w, j) Ni(e), (e, i) E+(v), i = 1}} = τ rawl+ 2 (ℓ)(G, (u, v)), {{(rawl+ 2 (ℓ)(G, (u, w)), ρ(e)) | (w, j) Ni(e), (e, i) E(v), i = 1}} {{(rawl+ 2 (ℓ)(G, (u, w)), ρ(e)) | (w, j) Ni(e), (e, i) E (v), i = 1}} Now we are ready to show the proof. First we show that hcwl(ℓ) 2 (G) rawl+ 2 (ℓ)(G). We prove by induction the number of layers ℓby showing that for some u, v V and for some ℓ, hcwl(ℓ+1) 2 (G, (u, v)) = hcwl(ℓ+1) 2 (G, (u , v )) rawl+ 2 (ℓ)(G, (u, v)) = rawl+ 2 (ℓ)(G, (u , v )) By assumption, we know the base case holds. Assume that hcwl(ℓ) 2 (G) rawl+ 2 (ℓ)(G) for some ℓ 0, for a pair of node-pair (u, v), (u , v ) V 2, Given that hcwl(ℓ+1) 2 (G, (u, v)) = hcwl(ℓ+1) 2 (G, (u , v )) By definition, we have that τ(hcwl(ℓ) 2 (G, (u, v)), {{(hcwl(ℓ) 2 (G, (u, w)), j, ρ(e)) | (w, j) Ni(e), (e, i) E(v)}}) = τ(hcwl(ℓ) 2 (G, (u , v )), {{(hcwl(ℓ) 2 (G, (u , w)), j, ρ(e )) | (w, j) Ni(e ), (e , i) E(v )}}) Published in Transactions on Machine Learning Research (05/2025) Conditioning on i {1, 2}, we can further decompose the set. τ(hcwl(ℓ) 2 (G, (u, v)),{{(hcwl(ℓ) 2 (G, (u, w)), j, ρ(e)) | (w, j) Ni(e), (e, i) E(v), i = 1}}, {{(hcwl(ℓ) 2 (G, (u, w)), j, ρ(e)) | (w, j) Ni(e), (e, i) E(v), i = 2}}) = τ(hcwl(ℓ) 2 (G, (u , v )),{{(hcwl(ℓ) 2 (G, (u , w)), j, ρ(e )) | (w, j) Ni(e ), (e , i) E(v ), i = 1}}, {{(hcwl(ℓ) 2 (G, (u , w)), j, ρ(e )) | (w, j) Ni(e ), (e , i) E(v ), i = 2}}) Assume τ is injective, the three arguments in τ must match, i.e., hcwl(ℓ) 2 (G, (u, v)) = hcwl(ℓ) 2 (G, (u , v )), and {{(hcwl(ℓ) 2 (G, (u, w)), j, ρ(e)) | (w, j) Ni(e), (e, i) E(v), i = 1}} = {{(hcwl(ℓ) 2 (G, (u , w)), j, ρ(e )) | (w, j) Ni(e ), (e , i) E(v ), i = 1}} We also have {{(hcwl(ℓ) 2 (G, (u, w)), j, ρ(e)) | (w, j) Ni(e), (e, i) E(v), i = 2}} = {{(hcwl(ℓ) 2 (G, (u , w)), j, ρ(e )) | (w, j) Ni(e ), (e , i) E(v ), i = 2}} By inductive hypothesis, we have that rawl+ 2 (ℓ)(G, (u, v)) = rawl+ 2 (ℓ)(G, (u , v )). Thus, we have that {{(rawl+ 2 (ℓ)(G, (u, w)), j, ρ(e)) | (w, j) Ni(e), (e, i) E(v), i = 1}} = {{(rawl+ 2 (ℓ)(G, (u , w)), j, ρ(e )) | (w, j) Ni(e ), (e , i) E(v ), i = 1}} {{(rawl+ 2 (ℓ)(G, (u, w)), j, ρ(e)) | (w, j) Ni(e), (e, i) E(v), i = 2}} = {{(rawl+ 2 (ℓ)(G, (u , w)), j, ρ(e )) | (w, j) Ni(e ), (e , i) E(v ), i = 2}} First, for the first equation, we notice that {{(rawl+ 2 (ℓ)(G, (u, w)), j, ρ(e)) | (w, j) Ni(e), (e, i) E(v), i = 1}} = {{(rawl+ 2 (ℓ)(G, (u , w)), j, ρ(e )) | (w, j) Ni(e ), (e , i) E(v ), i = 1}} if and only if {{(rawl+ 2 (ℓ)(G, (u, w)), ρ(e)) | (w, j) Ni(e), (e, i) E(v), i = 1}} = {{(rawl+ 2 (ℓ)(G, (u , w)), ρ(e )) | (w, j) Ni(e ), (e , i) E(v ), i = 1}} since the filtered set of pair (w, j) are the same, and the (rawl+ 2 (ℓ)(G, (u, w)), ρ(e)) and (rawl+ 2 (ℓ)(G, (u , w)), ρ(e )) matches if and only if (rawl+ 2 (ℓ)(G, (u, w)), 2, ρ(e)) and (rawl+ 2 (ℓ)(G, (u , w)), 2, ρ(e )) matches. This is because we simply augment an additional position indicator 2 in the tuple as we fixed i = 1, which does not break the equivalence of the statements. Then, for the second equation, we note that {{(rawl+ 2 (ℓ)(G, (u, w)), j, ρ(e)) | (w, j) Ni(e), (e, i) E(v), i = 2}} = {{(rawl+ 2 (ℓ)(G, (u , w)), j, ρ(e )) | (w, j) Ni(e ), (e , i) E(v ), i = 2}} if and only if {{(rawl+ 2 (ℓ)(G, (u, w)), ρ(e)) | (w, j) Ni(e), (e, i) E (v), i = 1}} = {{(rawl+ 2 (ℓ)(G, (u , w)), ρ(e )) | (w, j) Ni(e ), (e , i) E (v ), i = 1}} Published in Transactions on Machine Learning Research (05/2025) since this time the filtered set of pair (w, j) also matches, but for the inverse relation. For any edge e E(v) where (w, 1) Ne(v), the edge will be in form ρ(e)(w, v) as w is placed in the first position. Thus, there will be a corresponding reversed edge ρ(e) 1(v, w) E by definition. Then, by the same argument as in the second equation above, adding such an additional position indicator 1 on every tuple will not break the equivalence of the statement. An important observation is that since the inverse relations are freshly created, we will never mix up these inverse edges in both tests. For rawl+ 2 , we can distinguish these edges by checking the freshly created relation symbols r 1 R+\R, whereas in hcwl2, the neighboring nodes from these edges are identified with the position indicator 1 in the tuple. Thus, we have that {{(rawl+ 2 (ℓ)(G, (u, w)), ρ(e)) | (w, j) Ni(e), (e, i) E(v), i = 1}} = {{(rawl+ 2 (ℓ)(G, (u , w)), ρ(e )) | (w, j) Ni(e ), (e , i) E(v ), i = 1}} {{(rawl+ 2 (ℓ)(G, (u, w)), ρ(e)) | (w, j) Ni(e), (e, i) E (v), i = 1}} = {{(rawl+ 2 (ℓ)(G, (u , w)), ρ(e )) | (w, j) Ni(e ), (e , i) E (v ), i = 1}} Since τ is injective, this is equivalent to τ rawl+ 2 (ℓ)(G, (u, v)),{{(rawl+ 2 (ℓ)(G, (u, w)), ρ(e)) | (w, j) Ni(e), (e, i) E(v), i = 1}} {{(rawl+ 2 (ℓ)(G, (u, w)), ρ(e)) | (w, j) Ni(e), (e, i) E (v), i = 1}} = τ rawl+ 2 (ℓ)(G, (u , v )),{{(rawl+ 2 (ℓ)(G, (u , w)), ρ(e )) | (w, j) Ni(e ), (e , i) E(v ), i = 1}} {{(rawl+ 2 (ℓ)(G, (u , w)), ρ(e )) | (w, j) Ni(e ), (e , i) E (v ), i = 1}} and thus, we have τ rawl+ 2 (ℓ)(G, (u, v)),{{(rawl+ 2 (ℓ)(G, (u, w)), ρ(e)) | (w, j) Ni(e), (e, i) E+(v), i = 1}} = τ rawl+ 2 (ℓ)(G, (u , v )),{{(rawl+ 2 (ℓ)(G, (u , w)), ρ(e )) | (w, j) Ni(e ), (e , i) E+(v ), i = 1}} and finally rawl+ 2 (ℓ+1)(G, (u, v)) = rawl+ 2 (ℓ+1)(G, (u , v )) Note that since all arguments apply for both directions, the converse holds. Remark G.5. We remark that the idea of HC-MPNNs restricted to tail predictions can be extended to arbitrary relational hypergraphs in order to compute k-ary invariants for any k. See Appendix H for a discussion. H Computing k-ary invariants In this section, we present a canonical way to construct a valid k-ary invariants. We start by introducing a construction of a valid k-ary invariants termed as atomic types, following the convention by Grohe (2021). H.1 Atomic types Given a relational hypergraph G = (V, E, R, c) with l labels and a tuple u = (u1, ..., uk) V k, where k > 1, we define the atomic type of u in G as a vector: atpk(G)(u) {0, 1}lk+( k 2)+m2+|R|km, Published in Transactions on Machine Learning Research (05/2025) where l is the number of colors and m is the arity of the relation with maximum arity. We use the first lk bits to represent the color of the k nodes in u, another k 2 bits to indicate whether node ui is identical to uj. We then represent the order of these nodes using m2 bits and finally represent the relation with additional |R|km bits. Atomic types are k-ary relational hypergraph invariants as they satisfy the property that atpk(G)(u) = atpk(G )(u ) if and only if the mapping u1 7 u 1, . . ., uk 7 u k is an isomorphism from the induced subgraph G[{u1, , uk}] to G [{u 1, , u k}]. H.2 Relational hypergraph conditioned local k-WL test Now we are ready to show the k-ary invariants. Similarly to hcwl2, we can restrict HC-MPNN to only carry out a tail prediction with relational hypergraphs to make sure it directly computes k-ary invariants. Here, we introduce Relational hypergraph conditioned local k-WL test, dubbed hcwlk, which naturally generalized hcwl2 to relational hypergraph. Given u V k 1 and a relational hypergraph G = (V, E, R, c, ζ) where ζ : V k 7 D is a k-ary coloring that satisfied generalized target node distinguishability, i.e., ζ( u, u) = ζ( u, v) u u, v / u, ζ( u, ui) = ζ( u, uj) ui, uj u, ui = uj. hcwlk updates k-ary coloring ζ for ℓ 0: hcwl(0) k = ζ( u, v) hcwl(ℓ+1) k ( u, v) = τ hcwl(ℓ) k ( u, v), {{ {(hcwl(ℓ) k ( u, w), j)|(w, j) Ni(e)}, ρ(e) |(e, i) E(v)}} Again, we notice that hcwl(ℓ) k computes a valid k-ary invariants. We can also show that HC-MPNN restricted on tails prediction, i.e., for each query q = (q, u, j) where j = k, is characterized by hcwlk. Theorem H.1. Let G = (V, E, R, x, ζ) be a relational hypergraphs where x is a feature map and ζ is a k-ary node coloring satisfying generalized target nodes distinguishability. Given a query with q = (q, u, k), then we have that: 1. For all HC-MPNNs restricted on tails prediction with L layers and initializations Init with Init η, and 0 ℓ L , we have hcwl(ℓ) k h(ℓ) q 2. For all L 0 , there is an HC-MPNN restricted on tails prediction with L layers such that for all 0 ℓ L , we have hcwl(ℓ) k h(ℓ) q . Proof. The proof is very similar to that in Theorem G.1. Note that we sometimes write a k-ary tuple v = (u1, , uk) V k by (u, uk) where u = (u1, , uk 1) with a slight abuse of notation. We build an auxiliary relational hypergraph Gk = (V k, E , R, cζ) where E = {r(( u, v1), , ( u, vm)) | r(v1, , vm) E, r R}, and cζ is a node coloring cζ(( u, v)) = ζ( u, v). If A is a HC-MPNN and B is an HR-MPNN, we write h(ℓ) A,G( u, v) := h(ℓ) q (v) and h(ℓ) B,Gk(( u, v)) := h(ℓ)(( u, v)) for the features computed by A and B over G and Gk, respectively. Again, we write N G r (e) and E(v)G to emphasize that the positional neighborhood, as well as the hyperedges containing node v, is taken over the relational hypergraph G, respectively. Finally, we say that an initial feature map y for Gk satisfies generalized target node distinguishability if y(( u, u)) = y(( u, v)) u u, v / u, y(( u, ui)) = y(( u, uj)) ui, uj u, ui = uj. As a result, we have the following equivalence between HR-MPNN and HC-MPNN restricted on tail prediction with the relational hypergraph. Proposition H.2. Let G = (V, E, R, x, ζ) be a knowledge graph where x is a feature map, and ζ is a k-ary nodes coloring. Let q R, then: Published in Transactions on Machine Learning Research (05/2025) 1. For every HC-MPNN A with L layers, there is an initial feature map y for Gk an HR-MPNN B with L layers such that for all 0 ℓ L and u, v V , we have h(ℓ) A,G( u, v) = h(ℓ) B,G2(( u, v)). 2. For every initial feature map y for Gk satisfying generalized target node distinguishability and every HR-MPNN B with L layers, there is a HC-MPNN A with L layers such that for all 0 ℓ L and ( vu, v) V k, we have h(ℓ) A,G( u, v) = h(ℓ) B,Gk(( u, v)). Proof. We first show item (1). Consider the HR-MPNN B with the same relational-specific message Msgr, aggregation Agg, and update functions Up as A for all the L layers. The initial feature map y is defined as y(( u, v)) = Init(v, q), where Init is the initialization function of A. Then, by induction on number of layer ℓ, we have that for the base case ℓ= 0, h(0) A ( u, v) = Init(v, q) = y(( u, v)) = h(0) B (( u, v)). For the inductive case, assume h(ℓ) A ( u, v) = h(ℓ) B (( u, v)), then h(ℓ+1) A ( u, v) = Up h(ℓ) A ( u, v), Agg h(ℓ) A ( u, v), {{Msgρ(e) {(h(ℓ) A ( u, w), j) | (w, j) N G i (e)} | (e, i) EG(v)}} = Up h(ℓ) B (( u, v)), Agg h(ℓ) B (( u, v)), {{Msgρ(e) {(h(ℓ) B (( u, w)), j)|(w, j) N Gk i (e)} |(e, i) EGk(v)}} = h(ℓ+1) B (( u, v)). To show item (2), we consider A with the same relational-specific message Msgr, aggregation Agg, and update functions Up as B for all the L layers. We also take initialization function Init such that Init(v, q) = y(( u, v)). Then, we can follow the same argument for the equivalence as item (1). Similarly, we can show the equivalence in terms of the relational WL algorithms with hcwlk: Proposition H.3. Let G = (V, E, R, c, ζ) be a relational hypergraph where ζ is a k-ary node coloring. For all ℓ 0 and ( u, v) V k, we have that hcwl(ℓ) k ( u, v) computed over G coincides with hrwl(ℓ) 1 (( u, v)) computed over Gk = (V k, E , R, cζ). Proof. For ℓ= 0, we have hcwl(0) k (G, u, v) = ζ( u, v) = cζ(( u, v)) = hrwl(0) 1 (Gk, ( u, v)). For the inductive case, we have that hcwl(ℓ+1) k (G, u, v) = τ hcwl(ℓ) k (G, u, v), {{ {(hcwl(ℓ) 2 (G, u, w), j) | (w, j) N G i (e)}, ρ(e) | (e, i) EG(v)}} = τ hrwl(ℓ) 1 (Gk, ( u, v)), {{ {(hrwl(ℓ) 1 (Gk, ( u, w)), j)|(w, j) N Gk i (e)}, ρ(e) |(e, i) EGk(v)}} = hrwl(ℓ+1) 1 (Gk, ( u, v)). Now we are ready to show the proof for Theorem H.1. For a relational hypergraph G = (V, E, R, x, ζ), we consider Gk = (V k, E , R, cζ) as defined earlier. We start with item (1). Let A be a HC-MPNN with L layers and initialization Init satisfying Init ζ and let 0 ℓ L. Let y be an initial feature map for Gk and B be an HR-MPNN with L layers in Proposition H.2, item (1). For the initialization we have y cζ since y(( u, v)) = Init(v, q). Thus, we can proceed and apply Theorem 4.1, item (1) to Gk, y, and B and show that hrwl(ℓ) 1 h(ℓ) B,Gk, which in turns shows that hcwl(ℓ) k h(ℓ) A,G. Published in Transactions on Machine Learning Research (05/2025) Table 6: Model asymptotic runtime complexities. Model Complexity of a forward pass Amortized complexity of a query HR-MPNNs O(L(m|E|d + |V |d2)) O(L( m|E|d |R||V |2 + d2 |R||V | + d)) HC-MPNNs O(L(m|E|d + |V |d2)) O(L( m|E|d |V | + d2)) We then proceed to show item (2). Let L 0 be an integer representing a total number of layers. We apply Theorem 4.1, item (2) to Gk and obtain an initial feature map y with y cζ and an HR-MPNN B with L layer such that hrwl(ℓ) 1 h(ℓ) B,Gk for all 0 ℓ L. We stress again that y and ζ both satisfy generalized target node distinguishability. Now, let A be the HC-MPNN from Proposition H.2, item (2). Thus, hcwl(ℓ) k h(ℓ) A,G as required. Again, we note that the item (2) holds for HCNet. I Complexity analysis In this section, we discuss the asymptotic time complexity of HR-MPNN and HC-MPNN. For HC-MPNN, we consider the model instance of HCNet with g(ℓ) r being a query-independent diagonal linear map. For HR-MPNN, we consider the model instance with the same updating function Up and relation-specific message function Msgr as the considered HCNet model instance, referred to as HRNet: h(0) v = 1d h(ℓ+1) v = σ W (ℓ)h h(ℓ) v X j =i (α(ℓ)h(ℓ) e(j)+ (1 α(ℓ))pj) w(ℓ) r i + b(ℓ) . Notation. Given a relational hypergraph G = (V, E, R, c), we denote |V |, |E|, |R| to be the size of vertices, edges, and relation types. d is the hidden dimension and m is the maximum arity of the edges. Additionally, we denote L to be the total number of layers, and k to be the arity of the query relation q R in the query q = (q, u, t). Analysis. Given a query q = (q, u, t), the runtime complexity of a single forward pass of HCNet is O(L(m|E|d + |V |d2)) since for each message, we need O(d) for the relation-specific transformation, and we have m|E| total amount of message in each layer. During the updating function, we additionally need a linear transformation for each aggregated message as well as a self-transformation, which costs O(d2) for each node. Adding them up, we have O(m|E|d + |V |d2) cost for each layer, and thus O(L(m|E|d + |V |d2)) in total. Note that this is the same as the complexity of HRNet since the only differences lie in initialization methods, which is O(|V |d) cost for HCNet. In terms of computing a single query, the amortized complexity of HCNet is O(L( m|E|d |V | + d2)) since in each forward pass, |V | number of queries are computed at the same time. In contrast, HRNet computes |V |k query as once it has representations for all nodes in the relational hypergraph, it can compute all possible hyperedges by permuting the nodes and feeding them into the k-ary decoder. We summarize the complexity analysis in Table 6. Discussion with space complexity of positional encoding. Note that models designed for inductive inference over knowledge graphs, such as NBFNet (Zhu et al., 2021), typically augments the original knowledge graph edges with their inverses by introducing inverse relations (i.e., for each r(a, b), an edge of the form r 1(b, a) is added). This is to ensure that messages flow in both directions between two nodes. To simply extend this idea to relational hyperedges, we need the full set of directional interactions in a hyperedge of arity k, which requires enumerating all k! permutations of the node ordering, resulting in k! 1 additional hyperedges per original one. If each permutation is treated as a distinct augmented relation, Published in Transactions on Machine Learning Research (05/2025) the model must store a separate relation embedding for each of these permutations. Assuming uniform arity k and an embedding dimension d, this leads to a space complexity of |R|k!d. Note that such an approach is only practical when k = 2, as in traditional knowledge graphs, where it doubles the number of embeddings. For relational hypergraphs with larger k, however, this introduces an exponential increase in storage and quickly becomes infeasible. In contrast, our method avoids this explosion by employing positional encodings to distinguish among node permutations within a hyperedge. This allows us to retain a single relation embedding per relation, with an additional k-length positional embedding shared across all relations, resulting in a total space complexity of |R|d+kd. This scales linearly with k, enabling efficient modeling of high-arity hyperedges without sacrificing expressive power. J Details in synthetic experiments Dataset construction. We construct Hyper Cycle, a synthetic dataset that consists of multiple relational hypergraphs with relation R = {r0, r1, r2}. Each relational hypergraph G is parameterized by 2 hyperparameters: the number of nodes n which is always a multiple of 4, and the arity of each edge k. Given such (n, k) pair, we generate the relational hypergraph G(n, k) = (V (n, k), E(n, k), R(n, k)) where V (n, k) = {x1, , xn} E(n, k) = {r(i mod 2)+1(x(i+j) mod n | 0 j < k) | 1 i n} R(n, k) = {r0, r1, r2} We generate the dataset by choosing n = {8, 12, 16, 20} and k = {3, 4, 5, 6, 7}. We then randomly pick 70% of the generated graphs as the training set and the remaining 30% as the testing set. Model architectures. We considered two model architectures, namely an HC-MPNN instance HCNet: h(0) v|q = X i =t 1v=ui (pi + zq) h(ℓ+1) v|q = σ W (ℓ)h h(ℓ) v|q X j =i (α(ℓ)h(ℓ) e(j)|q+ (1 α(ℓ))pj) w(ℓ) r i + b(ℓ) . and a corresponding HR-MPNNs instance called HRNet that shares the same update, aggregate, and relationspecific message functions as in HCNet, defined as follow: h(0) v = 1d h(ℓ+1) v = σ W (ℓ)h h(ℓ) v X j =i (α(ℓ)h(ℓ) e(j)+ (1 α(ℓ))pj) w(ℓ) r i + b(ℓ) . Note that σ stands for the Re LU activation function in both models. We additionally use a binary MLP decoder for HRNet, which takes the concatenation of the final representation for each entity in the query, together with the learnable query vector zq to obtain the final probability. Experimental details. For both models, we use 7 layers, each with 32 hidden dimensions. We configure the learning rate to be 1e-3 for both models and train them for 100 epochs. K Scalability and custom Triton kernel Scalability is generally a concern for inductive link prediction since link prediction between a given pair of nodes relies heavily on the structural properties of these nodes (due to the lack of node features) which necessitates strong encoders that go beyond the power of 1-WL. This is more dramatic for relational hypergraphs since the prediction now relies on the structural properties of k nodes and any model will suffer from scalability issues if k becomes large. With that being said, our approach remains feasible for the benchmark Published in Transactions on Machine Learning Research (05/2025) Table 7: Average degree of relational hypergraphs in the experiments. WP-IND JF-IND MFB-IND FB-AUTO Wiki People Average degree 1.03 1.36 104.5 2.16 6.06 datasets, but we think it is important for future work to scale up these models for larger datasets, much like it has been done for classical GNNs (Hamilton et al., 2017; Zhu et al., 2023). To resolve this empirically, we have included custom implementation via Triton kernel 3 in our codebase to account for the message passing process on relational hypergraphs, which on average halved the training times and dramatically reduced the space usage of the algorithm (5 times reduction on average). The idea is to not materialize all the messages explicitly as in Py Torch geometric (Fey & Lenssen, 2019), but directly write the neighboring features into the corresponding memory addresses. Compared with materializing all hyperedge messages which takes O(k|E|) where k is the maximum arity, computing with Triton kernel only is O(|V |) in memory. This will enable fast and scalable message passing on relational hypergraphs, both on HR-MPNNs and HC-MPNNs. L On adding node features On the surface, it seems that HC-MPNNs does not directly take node features into account. This is because in the task of link prediction on relational hypergraphs, no node features are explicitly provided to begin with, and thus we did not assume the presence of node features in this particular task setting. However, it is relatively straightforward to account for node features by simply concatenating the node feature xv on top of the current representation hv to obtain h v = [hv xv]. Indeed, the only requirement for HC-MPNNs in the initialization is to satisfy generalized target node distinguishability, and thus concatenating node features will preserve this property. As a result, all theoretical results can be directly applied to HC-MPNNs with node features. It is worth noting that this concatenating technique has already been applied in Zhang et al. (2021) on knowledge graphs with node features and has proven to be successful. Additionally, this technique is also mentioned in Galkin et al. (2024) for link prediction with knowledge graphs using conditional message passing. M Impact on the density of the relational hypergraphs To further analyze the impact on the structure and density, we present the average degree of (training) datasets in Table 7. Observe that even though MFB-IND is a very dense hypergraph, HCNets can still manage to double the metrics compared to existing models. Furthermore, we highlight the performance of HCNets in sparse hypergraph settings, which are more representative of many real-world scenarios. Remarkably, HCNets maintain competitive performance even under these challenging conditions, underscoring their adaptability and effectiveness across a wide range of graph density regimes. These findings highlight the versatility of HCNets in handling diverse hypergraph structures. N Further experiment details We report the details of the experiment carried out in the body of the paper in this section. In particular, we report the dataset statistics of the inductive link prediction task in Table 8 and of the transductive link prediction task in Table 9. We also report the hyperparameter used for HCNet in the inductive link prediction task at Table 10 and transductive link prediction task at Table 11, respectively. We present the dataset statistics and hyperparameter choices in Table 17 and Table 18, respectively. We also show the complete tables for the ablation study mentioned in Table 13 and Table 14, the detailed definitions of initialization and positional encoding considered in Table 19 and Table 20, respectively. 3https://github.com/triton-lang/triton Published in Transactions on Machine Learning Research (05/2025) Table 8: Dataset statistics of inductive link prediction task with relational hypergraph. Dataset # seen vertices # train hyperedges # unseen vertices # relations # features # max arity WP-IND 4,463 4,139 100 32 37 4 JF-IND 4,685 6,167 100 31 46 4 MFB-IND 3,283 336,733 500 12 25 3 Finally, we report the execution time and GPU usages for 1 epochs of HCNets on all datasets considered in the paper with corresponding hyperparameters in Table 21. See further discussion of scalability in Appendix K. For the RD-MPNNs training, we consider a learning rate of 0.1, a dimension of 200, and 10 negative samples for training on all inductive datasets. In the experiments, all relational hypergraphs do not contain node features. We present a detailed discussion and strategy in Appendix L for HC-MPNNs to be applied on relational hypergraphs with node features. We adopt the partial completeness assumption (Galárraga et al., 2013) on relational hypergraphs, where we randomly corrupt the t-th position of a k-ary fact q(u1, , uk) each time for 1 t k. HCNets minimize the negative log-likelihood of the positive fact presented in the training graph, and the negative facts due to corruption. We represent query q = (q, u, t) as the fact q(u1, , uk) given corrupting t-th position, and represent its conditional probability as p(v|q) = σ(f(h(L) v|q)), where v V is the considered entity in the t-th position, L is the total number of layer, σ is the sigmoid function, and f is a 2-layer MLP. We then adopt self-adversarial negative sampling (Sun et al., 2019) by sampling negative triples from the following distribution: L(v | q) = log p(v | q) i=1 wi,α log(1 p(v i | q)) where α is the adversarial temperature as part of the hyperparameter, n is the number of negative samples for the positive sample and v i is the i-th corrupted vertex of the negative sample. Finally, wi is the weight for the i-th negative sample, given by wi,α := Softmax log(1 p(v i | q)) α Table 9: Dataset statistics of transductive link prediction task with relational hypergraph on FB-AUTO, Wiki People, JF17K, and MFB15K with respective arity. Dataset FB-AUTO Wiki People JF17K MFB15K |V | 3,410 47,765 29,177 10,314 |R| 8 707 327 71 #train 6,778 305,725 61,104 415,375 #valid 2,255 38,223 15,275 39,348 #test 2,180 38,281 24,915 38,797 # arity= 2 3,786 337,914 56,322 82,247 # arity= 3 0 25,820 34,550 400,027 # arity= 4 215 15,188 9,509 26 # arity 5 7,212 3,307 2,267 11,220 Published in Transactions on Machine Learning Research (05/2025) Table 10: Hyperparameters for inductive experiments of HCNet. Hyperparameter WP-IND JF-IND MFB-IND GNN Layer Depth(L) 5 5 4 Hidden Dimension 128 256 32 Decoder Layer Depth 2 2 2 Hidden Dimension 128 256 32 Optimization Optimizer Adam Adam Adam Learning Rate 5e-3 1e-2 5e-3 Batch size 32 32 1 #Negative Sample 10 10 10 Epoch 20 20 10 #Batch Per Epoch - - 10000 Adversarial Temperature 0.5 0.5 0.5 Dropout 0.2 0.2 0 Accumulation Iteration 1 1 32 Table 11: Hyperparameters for transductive experiments of HCNet. Hyperparameter FB-AUTO Wiki People JF17K MFB15K GNN Layer Depth(L) 4 5 6 6 Hidden Dimension 128 64 64 64 Decoder Layer Depth 2 2 2 2 Hidden Dimension 128 64 64 64 Optimization Optimizer Adam Adam Adam Adam Learning Rate 1e-3 1e-3 5e-3 1e-3 Batch size 32 16 1 32 #Negative Sample 32 32 50 32 Epoch 20 6 6 4 #Batch Per Epoch 5000 Adversarial Temperature 0.5 0.5 0.5 0.5 Dropout 0.2 0.2 0.2 0.2 Accumulation Iteration 1 1 32 1 Table 12: Results of inductive link prediction experiments. We report averaged MRR, Hits@1, and Hits@3 (higher is better) on test sets together with its standard deviation. WP-IND JF-IND MFB-IND MRR Hits@1 Hits@3 MRR Hits@1 Hits@3 MRR Hits@1 Hits@3 HGNN 0.072 0.045 0.112 0.102 0.086 0.128 0.121 0.076 0.114 Hyper GCN 0.075 0.049 0.111 0.099 0.088 0.133 0.118 0.074 0.117 G-MPNN-sum 0.177 0.108 0.191 0.219 0.155 0.236 0.124 0.071 0.123 G-MPNN-mean 0.153 0.096 0.145 0.112 0.039 0.116 0.241 0.162 0.257 G-MPNN-max 0.200 0.125 0.214 0.216 0.147 0.240 0.268 0.191 0.283 RD-MPNN 0.304 0.238 0.328 0.402 0.308 0.453 0.122 0.082 0.125 HCNet 0.414 0.005 0.352 0.004 0.451 0.005 0.435 0.017 0.357 0.023 0.495 0.014 0.368 0.015 0.223 0.014 0.417 0.022 Published in Transactions on Machine Learning Research (05/2025) Table 13: Results of ablation study experiments on initialization. We report MRR, Hits@1, and Hits@3 (higher is better) on test sets. Init WP-IND JF-IND zq pi MRR Hits@1 Hits@3 MRR Hits@1 Hits@3 - - 0.388 0.324 0.421 0.390 0.295 0.451 - 0.387 0.321 0.421 0.392 0.302 0.447 - 0.394 0.329 0.430 0.393 0.300 0.456 0.414 0.352 0.451 0.435 0.357 0.495 Table 14: Results of ablation study experiments on positional encoding. We report MRR, Hits@1, and Hits@3 (higher is better) on test sets. PE WP-IND JF-IND MRR Hits@1 Hits@3 MRR Hits@1 Hits@3 Constant 0.393 0.328 0.426 0.356 0.247 0.428 One-hot 0.395 0.334 0.428 0.368 0.275 0.432 Learnable 0.396 0.335 0.425 0.416 0.335 0.480 Sinusoidal 0.414 0.352 0.451 0.435 0.357 0.495 Table 15: Transductive link prediction experiments on FB-AUTO and Wiki People FB-AUTO Wiki People MRR Hits@1 Hits@3 Hits@10 MRR Hits@1 Hits@3 Hits@10 m-Dist Mult 0.784 0.745 0.815 0.845 - - - - m-CP 0.752 0.704 0.785 0.837 - - - - m-Trans H 0.728 0.727 0.728 0.728 - - - - RAE 0.703 0.614 0.764 0.854 0.253 0.118 0.343 0.463 Na LP 0.672 0.611 0.712 0.774 0.338 0.272 0.362 0.466 t Na LP+ 0.729 0.645 0.748 0.826 0.339 0.269 0.369 0.473 HINGE 0.678 0.630 0.706 0.765 0.333 0.259 0.361 0.477 Neu Infer 0.737 0.700 0.755 0.805 0.350 0.282 0.381 0.467 Hyp E 0.804 0.774 0.823 0.856 0.263 0.127 0.355 0.486 RAM 0.830 0.803 0.851 0.876 0.363 0.271 0.455 0.500 Box E 0.844 0.814 0.863 0.898 - - - - Hyper MLN 0.831 0.803 0.851 0.877 - - - - Hy Conv E 0.847 0.820 0.872 0.901 0.362 0.275 0.388 0.501 Re AIE 0.873 0.852 0.886 0.909 - - - - RD-MPNN 0.810 0.714 0.870 0.888 - - - - HJE 0.872 0.848 0.886 0.903 0.450 0.375 0.487 0.582 Hy Cub E 0.881 0.860 0.894 0.918 0.448 0.368 0.490 0.592 Hy Plan E 0.866 0.843 0.880 0.909 0.402 0.323 0.443 0.549 HCNet 0.871 0.005 0.842 0.007 0.892 0.003 0.922 0.004 0.421 0.004 0.344 0.005 0.457 0.005 0.565 0.007 Published in Transactions on Machine Learning Research (05/2025) Table 16: Transductive link prediction experiments on JF17K and MFB15K JF17K MFB15K MRR Hits@1 Hits@3 Hits@10 MRR Hits@1 Hits@3 Hits@10 m-Dist Mult 0.463 0.372 0.510 0.634 0.705 0.633 0.740 0.844 m-CP 0.392 0.303 0.441 0.560 0.680 0.605 0.715 0.828 m-Trans H 0.444 0.370 0.475 0.581 0.623 0.531 0.669 0.809 RAE 0.396 0.312 0.433 0.561 - - - - Na LP 0.366 0.290 0.334 0.516 - - - - t Na LP+ 0.449 0.370 0.484 0.598 - - - - HINGE 0.473 0.397 0.490 0.618 - - - - Neu Infer 0.451 0.373 0.484 0.604 - - - - Hyp E 0.494 0.408 0.538 0.656 0.777 0.725 0.800 0.881 RAM 0.539 0.463 0.573 0.690 - - - - Box E 0.560 0.472 0.604 0.722 0.761 0.702 0.791 0.877 Hyper MLN 0.556 0.482 0.597 0.717 - - - - Hy Conv E 0.590 0.478 0.610 0.729 - - - - Re AIE 0.559 0.482 0.594 0.705 0.801 0.755 0.823 0.901 RD-MPNN 0.512 0.445 0.573 0.685 - - - - HJE 0.590 0.507 0.613 0.729 - - - - Hy Cub E 0.584 0.508 0.616 0.730 - - - - Hy Plan E 0.569 0.496 0.600 0.708 - - - - HCNet 0.540 0.002 0.440 0.001 0.595 0.007 0.730 0.006 0.759 0.003 0.693 0.002 0.796 0.005 0.884 0.007 Table 17: Dataset statistics for the inductive relation prediction experiments. #Query* is the number of queries used in the validation set. In the training set, all triplets are used as queries. Dataset #Relation Train & Validation Test #Nodes #Triplet #Query* #Nodes #Triplet #Query v1 9 2,746 5,410 630 922 1,618 188 v2 10 6,954 15,262 1,838 2,757 4,011 441 v3 11 12,078 25,901 3,097 5,084 6,327 605 v4 9 3,861 7,940 934 7,084 12,334 1,429 v1 180 1,594 4,245 489 1,093 1,993 205 v2 200 2,608 9,739 1,166 1,660 4,145 478 v3 215 3,668 17,986 2,194 2,501 7,406 865 v4 219 4,707 27,203 3,352 3,051 11,714 1,424 Published in Transactions on Machine Learning Research (05/2025) Table 18: Hyperparameters for binary inductive experiments with HCNet. Hyperparameter WN18RR FB15k-237 GNN Layer Depth(L) 6 6 Hidden Dimension 32 32 Decoder Layer Depth 2 2 Hidden Dimension 64 64 Optimization Optimizer Adam Adam Learning Rate 5e-3 5e-3 Batch size 32 32 #Negative Samples 32 32 Epoch 30 30 #Batch Per Epoch Adversarial Temperature 0.5 0.5 Dropout 0.2 0.2 Accumulation Iteration 1 1 Table 19: Definition of Init in the ablation study of initialization. Here, q = (q, u, t), and d is the hidden dimension before passing to the first layer. Init h(0) v|q zq pi i =t 1v=ui 1d i =t 1v=ui pi - P i =t 1v=ui zq P i =t 1v=ui (pi + zq) Table 20: Definition of pi in the ablation study of positional encoding. Here, Id i is the one-hot vector of d dimension where only the index i has entry 1 and the rest 0. Note that d is the hidden dimension before passing to the first layer. ˆp is a d-dimensional learnable vectors. pi,j is the j-th index of position encoding pi, and d is the dimension of the vector pi. Constant pi = 1d One-hot pi = Id i Learnable pi = ˆpi Sinusoidal pi,2j = sin i 100002j/d ; pi,(2j+1) = cos i 100002j/d Table 21: Comparison of the execution time of 1 epoch for inductive and transductive link prediction task with relational hypergraph using a single A10 GPU. Note that we use batch size = 1 during the testing for all models, and 10k steps for MFB-IND during the training of HCNets. WP-IND JF-IND MFB-IND FB-AUTO Wiki People Train Test Train Test Train Test Train Test Train Test RD-MPNN 2sec 3.5min 2sec 3min 14min 38min 3sec 35min - - HCNet 3.5min 18sec 8min 10sec 80min 3.5min 4.5min 4min 3hr 2hr