# how_expressive_are_knowledge_graph_foundation_models__317abfc9.pdf How Expressive are Knowledge Graph Foundation Models? Xingyue Huang 1 Pablo Barcel o 2 3 4 Michael M. Bronstein 1 5 Ismail Ilkan Ceylan 1 Mikhail Galkin 6 Juan L Reutter 2 3 4 Miguel Romero Orth 2 4 Knowledge Graph Foundation Models (KGFMs) are at the frontier for deep learning on knowledge graphs (KGs), as they can generalize to completely novel knowledge graphs with different relational vocabularies. Despite their empirical success, our theoretical understanding of KGFMs remains very limited. In this paper, we conduct a rigorous study of the expressive power of KGFMs. Specifically, we show that the expressive power of KGFMs directly depends on the motifs that are used to learn the relation representations. We then observe that the most typical motifs used in the existing literature are binary, as the representations are learned based on how pairs of relations interact, which limits the model s expressiveness. As part of our study, we design more expressive KGFMs using richer motifs, which necessitate learning relation representations based on, e.g., how triples of relations interact with each other. Finally, we empirically validate our theoretical findings, showing that the use of richer motifs results in better performance on a wide range of datasets drawn from different domains. 1. Introduction Knowledge Graph Foundation Models (KGFMs) gained significant attention (Galkin et al., 2024; Mao et al., 2024) for their success in link prediction tasks involving unseen entities and relations on knowledge graphs (KGs). These models aim to generalize across different KGs by effectively learning relation invariants: properties of relations that are transferable across KGs of different relational vocabularies (Gao et al., 2023). KGFMs learn representations of relations by relying on their structural roles in the KG, 1University of Oxford 2Universidad Cat olica de Chile 3IMFD 4CENIA 5AITHYRA 6Google. Correspondence to: Xingyue Huang . Proceedings of the 42 nd International Conference on Machine Learning, Vancouver, Canada. PMLR 267, 2025. Copyright 2025 by the author(s). HSBC Finance Bloomberg Oxford Micron Tech Semi Conductors Tokyo Electron Intel Figure 1. Training is over relations provide and research, and testing is over structurally similar relations supply and produce. which can be transferred to novel relations based on their structurally similarities to the observed relations. Example 1.1. Consider the two KGs from Figure 1 which use disjoint sets of relations. Suppose that the model is trained on Gtrain and the goal is to predict the missing link produce(Intel, Semi Conductors) in Gtest. Ideally, the model will predict the link by learning relation invariants that map produce 7 research and supply 7 provide, as the structural role of these relations are similar in the respective graphs, even if the relations themselves differ. The dominant approach (Geng et al., 2023; Lee et al., 2023; Galkin et al., 2024) for computing relation invariants is by learning relational embeddings over a graph of relations. In this graph, each node represents a relation type from the original KG and each edge represents a connection between two relations. Figure 2. A simple motif of a path of length two. One prominent choice of these connections is to identify whether two relations in the original KG match a set of shared patterns known as graph motifs: small, recurring subgraphs that capture essential relational structures within the KG. Figure 2 depicts a simple motif: a path of length two, connecting three entities via two relations (i.e., a binary How Expressive are Knowledge Graph Foundation Models? motif). For instance, the relations provide and research are connected via the entity Oxford in Figure 1, and hence match this motif with the map α 7 provide, β 7 research. We saliently argue that existing models limit themselves to binary motifs, which restricts their ability to capture complex interactions involving more relations. We investigate the expressive power of these models, specifically, how the choice of motifs impacts the model s ability to capture relation invariants and help distinguish between different relational structures. Prior works have extensively studied relational message-passing neural networks on KGs (Barcel o et al., 2022; Huang et al., 2023), but these results do not apply to KGFMs, which is the focus of our work. Approach. We introduce MOTif-Induced Framework for KGFMs (MOTIF) based on learning relation invariants from arbitrary (not necessarily binary) graph motifs. MOTIF first constructs a relational hypergraph, where each node corresponds to a relation in the original KG, and each hyperedge represents a match of relations to a motif in the original KG. Then, it performs message passing over the relational hypergraph to generate relation representations, which are used in message passing over the original KG to enable link prediction over novel relations. Importantly, MOTIF is a general framework that strictly subsumes existing KGFMs such as ULTRA (Galkin et al., 2024) and INGRAM (Lee et al., 2023) as detailed in Section 5, and provides a principled approach for enhancing the expressive power of other KGFMs such as TRIX (Zhang et al., 2024) and RMPI (Geng et al., 2023). Contributions. Our contributions can be summarized as follows: We introduce MOTIF as a general framework capable of integrating arbitrary graph motifs into KGFMs, subsuming existing KGFMs such as ULTRA and INGRAM. To the best of our knowledge, we provide the first rigorous analysis on the expressive power of KGFMs. Our study explains the capabilities and limitations of existing KGFMs in capturing relational invariants. We identify a sufficient condition under which a newly introduced motif enhances the expressiveness of a KGFM within the framework of MOTIF. Using this theoretical recipe, we derive a new KGFM that is provably more expressive than ULTRA. Empirically, we conduct an extensive analysis on 54 KGs from diverse domains and validate our theoretical results through synthetic and real-world experiments. All proofs of the technical results can be found in the appendix of the paper, along with the additional experiments. 2. Related work Transductive and inductive (on node) link prediction. Link prediction on KGs has been extensively studied in the literature. Early approaches like Trans E (Bordes et al., 2013), Rotat E (Sun et al., 2019), and Box E (Abboud et al., 2020) focus on the transductive setting, where the learned entity embeddings are fixed, and thus inapplicable to unseen entities at test time. Multi-relational GNNs such as RGCN (Schlichtkrull et al., 2018) and Comp GCN (Vashishth et al., 2020) remain transductive as they store entity and relation embeddings as parameters. To overcome this limitation, Teru et al. (2020) introduce Gra IL, which enables inductive link prediction via the labeling trick. NBFNet (Zhu et al., 2021), A*Net (Zhu et al., 2023), RED-GNN (Zhang & Yao, 2022), and Ada Prop (Zhang et al., 2023), provide improvements by leveraging conditional message-passing, which is provably more expressive (Huang et al., 2023). These models, once trained, can only be applied to KGs with the same relational vocabulary, limiting their applicability to graphs with unseen relations. Inductive (on node and relation) link prediction. INGRAM (Lee et al., 2023) was one of the first approaches to study inductive link prediction over both new nodes and unseen relations by constructing a weighted relation graph to learn new relation representations. Galkin et al. (2024) extended this idea with the ULTRA architecture, which constructs a multi-relational graph of fundamental relations and leverages conditional message passing to enhance performance. ULTRA was among the first KGFMs to inspire an entire field of research (Mao et al., 2024). Concurrently, RMPI (Geng et al., 2023) explored generating multirelational graphs through local subgraph extraction while also incorporating ontological schema. Gao et al. (2023) introduced the concept of double-equivariant GNNs, which establish invariants on nodes and relations by leveraging subgraph GNNs in the proposed ISDEA framework to enforce double equivariance precisely. MTDEA (Zhou et al., 2023a) expands this framework with an adaptation procedure for multi-task generalization. Further, TRIX (Zhang et al., 2024) expands on ULTRA with recursive updates of relation and entity embeddings, and is the first work to compare expressivity among KGFMs. Finally, KG-ICL (Cui et al., 2024) introduced a new KGFM utilizing in-context learning with a unified tokenizer for entities and relations. Link prediction on relational hypergraphs. Relational hypergraphs are a generalization of KGs used to represent higher-arity relational data. Work on link prediction in relational hypergraphs first focused on shallow embeddings (Wen et al., 2016; Liu et al., 2020; Fatemi et al., 2020), and later G-MPNN (Yadati, 2020) and RD-MPNNs (Zhou et al., 2023b) advanced by incorporating message passing. Recently, Huang et al. (2024) conducted an in-depth expressiv- How Expressive are Knowledge Graph Foundation Models? ity study on these models and proposed HCNets, extending conditional message-passing to relational hypergraphs and achieving strong results on inductive link prediction. 3. Preliminaries Knowledge graphs. A knowledge graph (KG) is a tuple G = (V, E, R), where V is a set of nodes, R is the set of relation types and E V R V is a set of labeled edges, or facts. We write r(u, v) to denote a labeled edge, or a fact, where r R and u, v V . A (potential) link in G is a triple r(u, v) in V R V that may or may not be a fact in G. The neighborhood of node v V relative to a relation r R is Nr(v) := {u | r(u, v) E}. Relational hypergraphs. A relational hypergraph is a tuple G = (V, E, R), where V is a set of nodes, R is a set of relation types, and E a set of hyperedges (or facts) of the form e = r(u1, . . . , uk) E where r R is a relation type, u1, . . . , uk V are nodes, and k = ar(r) is the arity of the relation r. We write ρ(e) as the relation type r R of the hyperedge e E, and e(i) to refer to the node in the i-th arity position of the hyperedge e. Homomorphisms. A (node-relation) homomorphism from a KG G = (V, E, R) to a KG G = (V , E , R ) is a pair of mappings h = (π, ϕ), where π : V V and ϕ : R R , such that for every fact r(u, v) in G, the fact ϕ(r)(π(u), π(v)) belongs to G . The image h(G) of h is the knowledge graph (π(V ), h(E), ϕ(R)), where h(E) contains the fact ϕ(r)(π(u), π(v)) for each r(u, v) in G. Isomorphism, relation invariants, link invariants. An isomorphism from KG G = (V, E, R) to KG G = (V , E , R ) is a pair of bijections (π, ϕ), where π : V V and ϕ : R R , such that every fact r(u, v) is in G if and only if the fact ϕ(r)(π(u), π(v)) is in G . Two graphs are isomorphic if there is an isomorphism between them. For k 1, a k-ary relation invariant is a function ξ associating to each KG G = (V, E, R) a function ξ(G) with domain Rk such that for every isomorphic KGs G and G , every isomorphism (π, ϕ), and every tuple r Rk of relations in G, we have ξ(G)( r) = ξ(G )(ϕ( r)). A link invariant is a function ω assigning to each KG G = (V, E, R) a function ω(G) with domain V R V such that for every isomorphic KGs G and G , every isomorphism (π, ϕ), and every link q(u, v) in G, we have ω(G)(q(u, v)) = ω(G )(ϕ(q)(π(u), π(v))). For instance, a unary relation invariant ξ evaluated on Gtrain and Gtest in Figure 1 may satisfy ξ(Gtrain)(research) = ξ(Gtest)(produce) if the two relations play analogous structural roles. Similarly, a link invariant ω may assign equal values to research(Oxford, Finance) in Gtrain and produce(Intel, Semi Conductors) in Gtest under a structurepreserving mapping. 4. MOTIF: A general framework for KGFMs We present MOTIF, a very general framework for KGFMs. Given a KG G = (V, E, R), MOTIF computes an encoding for each potential link q(u, v) in G following three steps: 1. LIFT: Use a set F of motifs to compute a relational hypergraph LIFTF(G) = (VLIFT, ELIFT, RLIFT), using a procedure called LIFT. 2. RELATION ENCODER: Apply a relation encoder on the hypergraph LIFTF(G) to obtain relation representations. 3. ENTITY ENCODER: Use the relation representations from Step 2 and apply an entity encoder on the KG G to obtain final link encodings. The overall process is illustrated in Figure 3 and we now detail each of the steps. A graph motif is a pair P = (GM, r), where GM = (VM, EM, RM) is a connected KG and r is a tuple defining an order on the relation set RM. Specifically, k-ary motif is a motif with a motif graph containing exactly k relation types, i.e., |RM| = k. When k = 2, we call it binary motif. The information extracted by motifs is defined via homomorphism. Definition 4.1. Let G be a KG and P = (GM, r) be a motif with r = (r1, . . . , rn). The evaluation Eval(P, G) of P over G is the set of tuples (ϕ(r1), . . . , ϕ(rn)), for each homomorphism h = (π, ϕ) from GM to G. The LIFT operation. Given a KG G = (V, E, R) and a set of motifs F, the operation LIFTF(G) constructs a relational hypergraph (VLIFT, ELIFT, RLIFT) as follows: The node set is VLIFT = R, where each node corresponds to a relation type from the original KG. The relation set is RLIFT = F, where each motif P F is treated as a distinct relation type in the relational hypergraph. The hyperedge set ELIFT is defined as {P(rn, . . . , rn) | P F, (rn, . . . , rn) Eval(P, G)} where Eval(P, G) denotes the set of all k-tuples of relations in G that match the motif P. 4.2. Relation encoder A relation encoder is a tuple (F, Enc LIFT), where F is a set of motifs and Enc LIFT computes relation representations for the set R = VLIFT of nodes of LIFTF(G) = How Expressive are Knowledge Graph Foundation Models? Knowledge Graph G 1. LIFT G with motif set F 2. RELATION ENCODER over LIFTF(G) 3. ENTITY ENCODER over G Figure 3. Overall framework of MOTIF(F) over the motif set F: Given a query r1(u, v), MOTIF(F) first applies the LIFTF operation to generate the relational hypergraph (VLIFT, ELIFT, RLIFT) over the set of motifs F. Then a relation encoder is applied to generate a relation representation (indicated by color) conditioned on query r1, followed by an entity encoder that conducts conditional message passing. (VLIFT, ELIFT, RLIFT). Given that our aim is to encode links of the form q(u, v), the encoding computed by Enc LIFT for relation r R is conditioned on the relation q R. These encodings are determined by a sequence of features h(t) r|q Rd(t), for 0 t T, defined iteratively as follows: h(0) r|q = INIT1(q, r), h(t+1) r|q = UP1 h(t) r|q, AGG1(M) , where INIT1, UP1, AGG1 are differentiable initialization, update, and aggregation functions, respectively, and: M = {{MSGρ(e) {(h(t) r |q, j) | (r , j) N i(e)} | (e, i) ELIFT(r)}}. In this setting, MSGρ(e) is a motif-specific message function and we further define ELIFT(r) = {(e, i) | e(i) = r, e ELIFT, 1 i ar(ρ(e))} as the set of edge-position pairs of a node r, and N i(e) as the positional neighborhood of a hyperedge e with respect to a position i: N i(e) = {(e(j), j) | j = i, 1 j ar(ρ(e))}. We use Enc LIFT,q[r] to denote the final relation encoding of a relation r in G when conditioned on q, that is, the last layer vector h(T ) r|q Rd(T ). Remark 1. Observe that the relation encoder computes binary relation invariants provided INIT1 is an invariant. 4.3. Entity encoder MOTIF uses an entity encoder to compute a link-level representation of a KG, regardless of its relation vocabulary. This works as follows. MOTIF is defined as a tuple (Enc KG, (F, Enc LIFT)), where (F, Enc LIFT) is a relation encoder and Enc KG is an entity encoder that computes node representations over KGs of the form G = (V, E, R), conditioned on a node u V and the relation q, according to the following iterative scheme: h(0) v|u,q = INIT2(u, v, q), h(ℓ+1) v|u,q = UP2 h(ℓ) v|u,q, AGG2(N) , h2t: (α, β) α β t2h: (α, β) α β t2t: (α, β) α β h2h: (α, β) α β Figure 4. The 4 motifs used by ULTRA are shown here. h2t represents head-to-tail , with the others following similarly. Note that h2t and t2h are isomorphic but with different relation ordering. where INIT2, UP2, AGG2, and MSG are differentiable initialization, update, aggregation and message functions, respectively, v is an arbitrary node in V , and: N = {{MSG(h(ℓ) w|u,q, Enc LIFT,q[r]) | w Nr(v), r R}}. We use Enc KG,u,q[v] to denote the encoding of v, conditioned on relation q and source node u, which corresponds to h(L) v|u,q Rd(L) (L being the number of layers). Finally, a unary decoder Dec : Rd(L) 7 [0, 1] is applied on Enc KG,q,u[v] to obtain the probability score for the existence of the potential link q(u, v). Remark 2. Observe that the entity encoder computes link invariants provided INIT2 is an invariant. 5. KGFMs captured by MOTIF We revisit the ULTRA architecture (Galkin et al., 2024) (See details in Appendix F). The four fundamental relations used in ULTRA can be interpreted as the graph motifs illustrated in Figure 4. Hence, any ULTRA architecture can be represented as a MOTIF architecture, with Enc LIFT and Enc KG being specific variants of NBFNets (Zhu et al., 2021). MOTIF can also represent a slight variant of the INGRAM architecture (Lee et al., 2023) when we replace the constructed weighted relation graph with a KG. Here the construction of How Expressive are Knowledge Graph Foundation Models? LIFT uses a graph motif set F = {h2h, t2t}, with Enc LIFT and Enc KG being variants of GATv2 (Brody et al., 2022), and replace the random initialization in INIT1 by an invariant function. Note that the definition of MOTIF allows for unconditional message passing (Huang et al., 2023) when both INIT1 and INIT2 are agnostic to the query q. We also note that, while other KGFMs like RMPI (Geng et al., 2023) and TRIX1 (Zhang et al., 2024) do not technically fall under the framework of MOTIF, these models do rely on a LIFT operation to construct a graph of relations based on a predefined set of motifs, followed by relation and entity encoders. In particular, RMPI constructs an alternative relation graph inspired by the line graph, incorporating the motifs used in ULTRA along with two additional motifs, PARA and LOOP. In turn, TRIX considers the same set of motifs as ULTRA but employs alternating updates on entity and relation representations. 6. Expressive power MOTIF computes encodings for links in KGs. In this section, we aim to understand which links MOTIF can separate and whether equipping MOTIF with more graph motifs increases its separation power. Hence, our focus is on the potential power obtained when using a particular set F of motifs. To that extent, we define MOTIF(F) as the set of all MOTIF instances that use a relation encoder built over F, that is, the set of all MOTIF instances of the form (Enc KG, (F, Enc LIFT)). Definition 6.1. Let F and F be sets of graph motifs. Then F refines F , denoted F F , if for every KG G and potential links q(u, v), q (u , v ) over G, whenever every MOTIF instance in MOTIF(F) encodes q(u, v) and q (u , v ) in the same way, then every MOTIF instance in MOTIF(F ) also encodes q(u, v) and q (u , v ) in the same way. Furthermore, F and F are said to have the same separation power if both F F and F F hold. We denote this as F F . Definition 6.1 captures the idea of the potential to separate, or distinguish, pairs of links in a KG using different sets of graph motifs. Specifically, whenever F F , we know that using F will not result in architectures that can separate links that cannot be separated by any MOTIF instance in MOTIF(F). On the other hand, if F F , then we know that F and F can be used interchangeably without altering the potential for separating links in our architectures. We start by observing a simple fact: isomorphic graph motifs do not make a difference in terms of separation power. Proposition 6.2. Let P = (GM, r) and P = (G M, r ) be two graph motifs such that GM is isomorphic to G M. Then, 1See Appendix B for a detailed comparison of the expressive powers of TRIX and MOTIF. for any set F of graph motifs: (F {P}) (F {P }) (F {P, P }). Returning to Figure 4, it is easy to see that the motifs h2t and t2h are isomorphic. Hence, F = {h2t, t2h, h2h, t2t} has the same separation power as {h2t, h2h, t2t} or {t2h, h2h, t2t}. But although these sets of motifs have the same separation power, this does not imply that the ULTRA architecture can drop either h2t or t2h. Indeed, while the set of links distinguished by any instance in MOTIF(F) is the same as those in MOTIF({h2t, h2h, t2t}) or MOTIF({t2h, h2h, t2t}), the ULTRA framework is a strict subset of MOTIF(F), so we cannot directly transfer these results to ULTRA framework. From a practical standpoint, the difference lies in the support for inverse relations: ULTRA bases its relation encoding on NBFNets (Zhu et al., 2021), while MOTIF uses HCNets (Huang et al., 2024), which supports inverse relations by design (see Appendices A and F). 6.1. When a new motif leads to more links separated? As a tool to assert when adding a new motif P increases the number of links that can be separated using a set of graph motifs F, we provide a necessary condition for the refinement of sets of motifs F and F . We need some terminology. A homomorphism h = (π, ϕ) from a KG G = (V, E, R) to a KG G = (V , E , R ) is relationpreserving if R R and ϕ(R) = R. Further, a graph H is a relation-preserving core if every relation-preserving homomorphism h from H to H is onto, that is, h(H) = H. It is possible to show the following: Proposition 6.3. For every KG G, up to isomorphism, there is a unique KG H such that: H is a relation-preserving core, and there are relation-preserving homomorphisms from G to H and from H to G. The KG H is called the relation-preserving core of G. Our condition is laid in terms of a specific notion of homomorphism that we call core-onto homomorphisms. Formally, a homomorphism h = (π, ϕ) from G to G is core-onto if the KG h(G) is isomorphic to the relation-preserving core of G . If such a homomorphism exists, we write G co G . Let us also say that a motif is trivial if its relation-preserving core is isomorphic to the graph with a single fact r(u, v). We are now ready to state the main result of this section. Theorem 6.4. Let F, F be two sets of motifs, and assume that F F . Then, for every non-trivial motif P F there is a motif P F such that P co P . To prove this result, we define a coloring scheme analogous to the WL-test for relational hypergraphs presented in How Expressive are Knowledge Graph Foundation Models? Huang et al. (2024), which simulates the relation encoding defined by Enc LIFT. We then combine this scheme with techniques from Huang et al. (2023) to produce a coloring for each link in the KG, effectively simulating the encoding provided by Enc KG (See Appendix C). We show that this scheme is as powerful as MOTIF in its ability to distinguish links over KGs. With this tool in hand, we prove the converse of Theorem 6.4: we show how to construct a KG G and links q(u, v), q (u , v ) such that every MOTIF instance in MOTIF(F) encodes them in the same way, but there is a MOTIF instance in MOTIF(F ) that encodes them differently, whenever F contains a motif that cannot be covered with a core-onto homomorphism from any motif in F. 6.2. Consequences for the design of MOTIF models Consider MOTIF models in MOTIF(F). Theorem 6.4 implies that if P / F cannot be covered via a core-onto homomorphism from any motif already in F, then incorporating P into the set of graph motifs used by the architecture should lead to more expressive encodings. This has consequences for the design of MOTIF architectures. k-path motif. Take a motif representing a path of length k: r1(u1, u2), r2(u2, u3), . . . , rk(uk, uk+1). For n 1, let Fpath n denote the set of motifs formed from changing the orientation of any number of edges rj in the path of length k, with k n, up to isomorphism. Theorem 6.4 tells us that Fpath n Fpath m whenever n < m, since there is no core-onto homomorphism from any motif with k n edges to a motif with m edges. Hence, every bigger path motif adds power for more expressive encodings. Further, the set of motifs used in ULTRA (Figure 4) is F = {h2t, t2h, h2h, t2t}, which, by Proposition 6.2, has the same separation power as Fpath 2 = {h2t, h2h, t2t}. We can leverage this observation to prove: Theorem 6.5. ULTRA has the same expressive power as MOTIF(Fpath 2 ). (α, β, γ, δ) Figure 5. The 4-star motif. k-star motif. As another example, consider the k-star motif : r1(v1, u), r2(v2, u), . . . , rk(vk, u). Figure 5 shows an example of 4star motif. Once again, the set Fstar n of motifs containing all k-stars with k n gives us a hierarchy in terms of separation power, where we have that Fstar n Fstar m whenever n < m. Hence, every wider star motif adds power for more expressive encodings. A more expressive KGFM. We also present new ways to improve the ULTRA architecture. Adding any motif with tfh: (α, β, γ) α β γ tft: (α, β, γ) α β γ hfh: (α, β, γ) α β γ hft: (α, β, γ) α β γ Figure 6. The 3-path motifs (F path 3 ). tfh stands for tail-forwardhead, and the rest follows. Figure 7. ULTRA cannot distinguish between r3(u, v1) and r3(u, v2) whereas MOTIF(F path 3 ) can. 3 edges (shown in Figure 6) to the set of four motifs used in ULTRA enhances its expressive power. As an example task shown in Figure 7, we show that ULTRA with its motifs generates a complete relation graph and fails to distinguish between r1 and r2. In contrast, MOTIF(Fpath 3 ) constructs a hyperedge (r2, r3, r1) but not (r1, r3, r2), thereby differentiating between r1 and r2 and solving the task. (See Appendix K for detailed proof.) Furthermore, the core idea of augmenting existing KGFMs via more expressive motifs to provably enhance their expressivity fits seamlessly in the large bodies of KGFMs in the literature, such as INGRAM (Lee et al., 2023), RMPI (Geng et al., 2023), and TRIX (Zhang et al., 2024), providing an orthogonal avenue for boosting their expressive power. 6.3. Comparison with existing link prediction models Existing inductive link prediction models, such as C-MPNNs and R-MPNNs (Huang et al., 2023), can be viewed as special instances of MOTIF with an empty motif set and INIT1 being a one-hot encoding of relations. However, this configuration breaks relation invariance on relational hypergraphs. Somewhat counter-intuitively, we note that an instance of MOTIF that preserves relation invariance is inherently less expressive than its corresponding entity encoder Enc KG, which acts as a node invariant (but not a relation invariant). This implies, for example, that NBFNet (Zhu et al., 2021) is strictly more expressive than How Expressive are Knowledge Graph Foundation Models? ULTRA when measured only over graphs with the same relations types. This increased expressive power comes at the cost of limited generalization to unseen relations. 7. Experimental analysis We evaluate MOTIF over a broad range of knowledge graphs to answer the following questions: Q1. Does MOTIF equipped with more expressive graph motifs exhibit a stronger performance? Q2. Does ULTRA have the same expressive power as MOTIF(Fpath 2 )? Q3. How does a more refined relation invariant help link prediction tasks? Q4. What is the trade-off between expressiveness and scalability for MOTIF? (See Appendices H and I.) In the experiments, we consider a basic architecture which replaces the relation encoder Enc LIFT by HCNet (Huang et al., 2024) and the entity encoder Enc KG by a slight modification of NBFNet (Zhu et al., 2021) (see Appendix E). 7.1. Synthetic experiments: Connect Hub We construct synthetic datasets Connect Hub(k) to validate MOTIF s enhanced expressiveness with richer motifs (Q1). Task & Setup. The dataset Connect Hub(k) consists of multiple synthetic KGs, where in each KG the relations are partitioned into a positive class P, a negative class N, both of size k + 1, and a query relation q. Each KG contains a (k+1)-star hub with positive relations, and for each possible subset of relations of P and N of size k, the KG contains a positive and negative k-star community, respectively. An example of such KG is shown in Figure 8 with k = 2. We consider the following classification task: predict whether there exists a link of relation q from the hub center to positive communities but NOT to negative communities. The key to solving this task is successfully differentiating between relations from positive and negative classes. We consider ULTRA and MOTIF(Fstar m ) for varying m, and show the empirical results on these datasets in Table 1. (Further details in Appendix L.) ULTRA & MOTIF with inexpressive motifs. We claim both ULTRA and MOTIF(Fstar m ) with m k cannot solve the task on Connect Hub(k). Since the only difference between P and N is the presence of the (k + 1)-star hub of relations in P, these models cannot detect this higher-order motif. Thus, LIFTF(G) will contain two disjoint isomorphic hypergraphs: one with positive relations and the other with negative relations. This results in indistinguishability p2 n2 p1 n1 p3 n3 p2 n2 Figure 8. A example of a KG in one of the synthetic datasets Connect Hub(2). Relations in P are in red and N are in blue. Table 1. Accuracy for Connect Hub(k) datasets with ULTRA and MOTIF with different m-star MOTIF(F star m ) for varying k. k =? 2 3 4 5 6 ULTRA 0.50 0.50 0.50 0.50 0.50 Fstar 2 0.50 0.50 0.50 0.50 0.50 Fstar 3 1.00 0.50 0.50 0.50 0.50 Fstar 4 1.00 0.50 0.50 0.50 Fstar 5 1.00 0.50 0.50 Fstar 6 1.00 0.50 Fstar 7 1.00 between relations in P and N, failing the task. Empirically, we find that MOTIF(Fstar m ) with m k and ULTRA only reach 50% accuracy, no better than a random guess, aligning with our theoretical studies. MOTIF with expressive motifs. We claim that MOTIF(Fstar m ) with m > k can detect the presence of the positive hub, precisely due to the additional (k + 1)-star motifs. This allows the model to construct a hyperedge in LIFTF(G) across all relations in P, while no such hyperedge exists for N, thus distinguishing P from N. Empirically, MOTIF(Fstar m ) with m = k+1 reaches 100% accuracy on Connect Hub(k), validating our theoretical results. 7.2. Pretraining and fine-tuning experiments Datasets & Evaluations. For pretraining and fine-tuning experiments, we follow the protocol of Galkin et al. (2024) and pretrain on FB15k237 (Toutanova & Chen, 2015), WN18RR (Dettmers et al., 2018), and Co DEx Medium (Safavi & Koutra, 2020). We then apply zeroshot inference and fine-tuned inference over 51 KGs across three settings: inductive on nodes and relations (Inductive e, r), inductive on nodes (Inductive e), and Transductive. The detailed information of datasets, model architectures, implementations, and hyper-parameters used in the experiments are presented in Appendix M. Following convention (Zhu et al., 2021), on each knowledge graph and for each triplet r(u, v), we augment the correspond- How Expressive are Knowledge Graph Foundation Models? ing inverse triplet r 1(v, u) where r 1 is a fresh relation symbol. For evaluation, we follow filtered ranking protocol (Bordes et al., 2013) and report Mean Reciprocal Rank (MRR), and Hits@10 for each dataset. We report both head and tails prediction results for each triplet on all datasets except three from Lv et al. (2020), where only tails prediction results are reported. The code is available at https://github.com/Hxy Scotthuang/MOTIF/. Setup. To evaluate the impact of graph motifs, we construct four variants of MOTIF models with different F: 3-path (Fpath 3 ), 2-path (Fpath 2 ), {h2t} only, and no motifs ( ), defined in Section 6. We present the average zero-shot and fine-tuned results of MOTIF over 51 KGs in Table 2, along with the corresponding results for ULTRA taken from Galkin et al. (2024). The detailed per-dataset results of pretrained, zero-shot, and finetuned inference are shown in Tables 7 to 10, respectively. Note that in the zero-shot inference setting, Inductive e, r, Inductive e, and Transductive are in principle indifferent, as all models encounter unseen relations in the inference graph. Results. First of all, note that MOTIF(Fpath 3 ) outperforms MOTIF(Fpath 2 ) over all 51 datasets on average in zero-shot inference. In the case of fine-tuned inference, a similar trend is present: MOTIF(Fpath 3 ) consistently outperforms MOTIF(Fpath 2 ). It is also worth noting that both MOTIF(Fpath 3 ) and MOTIF(Fpath 2 ) achieve further performance improvements after fine-tuning on the training set compared to zero-shot inference. These findings demonstrate that the additionally introduced motifs help the model to learn richer relation invariants (Q1), which are then leveraged for better predictions on new KGs with unseen nodes and relations. This observation is true on both zero-shot and fine-tuned inference. Secondly, recall that MOTIF(Fpath 2 ) and ULTRA have the same expressive power as shown in Theorem 6.5. This theoretical finding is supported in our empirical study, as we see that ULTRA performs similarly with MOTIF(Fpath 2 ) in these experiments (Q2). Lastly, when it comes to comparing MOTIF(Fpath 2 ) and MOTIF({h2t}), we observe a gradual decrease in all metrics on all datasets as the motifs are removed. This aligns with Theorem 6.4: adding 3-paths on top of 2-paths yields more expressive power on MOTIF since there is no core-onto homomorphism from any 2-path to 3-path motifs. Similarly, adding h2h or t2t on top of h2t also yields more expressive power as there is no homomorphism from h2t to h2h or t2t. In the extreme case, we observe that removing all of the motifs in MOTIF( ) prevents the model from learning any non-trivial relation invariants, thus exhibiting a complete failure in generalization to unseen KGs. Results on datasets with complex relational patterns (Q1). Our findings on domain-specific benchmarks further validate the benefits of increased expressiveness. On Metafam (Zhou et al., 2023a), a dataset designed to challenge models with conflicting and compositional relational patterns, MOTIF achieves a 45% improvement in MRR over ULTRA in the zero-shot setting. This underscores the model s ability to capture non-trivial relational invariants when higher-order motifs are available. Similarly, on WIKITOPICS-MT (Zhou et al., 2023a), where multiple KGs from diverse topics are combined to form a multi-task setting, MOTIF(Fpath 3 ) consistently outperforms baselines. Averaged across 8 datasets, MOTIF(Fpath 3 ) improves MRR from 0.331 to 0.358 and Hits@10 from 0.442 to 0.481; gains are even more pronounced on certain subsets (e.g., a 58% relative improvement on MT1-tax). These results confirm that expressiveness improvements translate into practical gains when the structure of the data aligns with the capabilities of the underlying motifs. 7.3. End-to-end experiments We also conduct end-to-end experiments where for each dataset, we train a MOTIF(Fpath 3 ) on the training set from scratch and evaluate its corresponding validation and test sets. We present the average results over all 54 KGs in Table 3 (Full results in Table 11 and Table 12). We continue to see improved performance of MOTIF(Fpath 3 ) over ULTRA due to the additional motifs, allowing models to capture a richer set of information over relations (Q1). Degradation of relation learning (Q3). As a case study, we focus on the end-to-end experiments with WN-v2 (Teru et al., 2020), an inductive (on node) dataset with only 20 relations after inverse augmentation. The MRR for ULTRA is 0.296, severely lower than that of MOTIF(Fpath 3 ) (0.684). To investigate, we plot the cosine similarity2 among the computed relation embeddings when queried relations are r0 = derivationally related form and r2 = hypernym in Figure 9. ULTRA produces highly similar relation representations, whereas MOTIF with richer motifs generates distinguishing relation representations, aiding the entity encoder in link prediction. Nevertheless, we notice that such degradation does not appear in the pretraining experiments, suggesting that either enriching relation learning with diverse motifs or training over multiple KGs could help avoid such degradation. 8. Conclusions We introduced MOTIF, a general framework that extends existing KGFMs via incorporating arbitrary motifs. In the context of this framework, we studied the expressive power 2The block structure is due to the concatenation of augmented inverse relations. How Expressive are Knowledge Graph Foundation Models? Table 2. Average zero-shot and fine-tuned link prediction MRR and Hits@10 over 51 KGs. Inductive e, r Inductive e Transductive Total Avg Pretrained Model Motif (F) (23 graphs) (18 graphs) (10 graphs) (51 graphs) (3 graphs) MRR H@10 MRR H@10 MRR H@10 MRR H@10 MRR H@10 Zero-shot Inference ULTRA 0.345 0.513 0.431 0.566 0.339 0.494 0.374 0.529 - - F path 3 0.349 0.525 0.436 0.577 0.343 0.496 0.378 0.537 - - F path 2 0.337 0.509 0.431 0.570 0.335 0.492 0.370 0.527 - - {h2t} 0.330 0.503 0.422 0.559 0.325 0.482 0.361 0.518 - - 0.074 0.121 0.107 0.169 0.054 0.101 0.082 0.134 - - Finetuned Inference ULTRA 0.397 0.556 0.442 0.582 0.384 0.543 0.410 0.563 0.407 0.568 MOTIF F path 3 0.401 0.558 0.455 0.594 0.394 0.549 0.419 0.569 0.415 0.565 Table 3. Average end-to-end MRR and Hits@10 results over 54 KGs. Total Avg Model (54 graphs) ULTRA 0.394 0.552 MOTIF(F path 3 ) 0.409 0.561 Figure 9. Cosine similarities between generated relations embeddings when queried relation q is r0 and r2 in test set of WN-v2. of KGFMs and identified sufficient conditions under which adding new motifs improves MOTIFs ability to compute finer relation invariants. We thoroughly discussed the implications of our results on existing KGFMs. The theoretical findings are empirically validated on a diverse set of datasets. Our work provides key insights and concrete recipes for designing more expressive KGFMs. However, the typical trade-off between expressive power and computational complexity clearly also applies to our study. While using richer motifs provably increases MOTIF s expressive power, this may come at the cost of computational efficiency in terms of both time and memory (please see the discussion in Appendices H and I). A promising avenue for future work is to enhance MOTIF s computational efficiency, thus enabling scalability to real-world large-scale graphs. Acknowledgment Bronstein is supported by EPSRC Turing AI World-Leading Research Fellowship No. EP/X040062/1 and EPSRC AI Hub on Mathematical Foundations of Intelligence: An Erlangen Programme for AI No. EP/Y028872/1. Reutter is funded by Fondecyt grant 1221799. Barcel o and Reutter are funded by ANID Millennium Science Initiative Program - Code ICN17002. Barcel o and Romero are funded by the National Center for Artificial Intelligence CENIA FB210017, Basal ANID. Impact Statement This paper presents work aimed at advancing the field of Machine Learning. Our methods have potential applications in recommendation systems and knowledge graph completion, among others. While these applications can have significant societal impacts, including improving information retrieval and personalization, as well as potential risks such as bias amplification and misinformation propagation, we do not identify any specific, immediate concerns requiring special attention in this work. Abboud, R., Ceylan, I. I., Lukasiewicz, T., and Salvatori, T. Boxe: A box embedding model for knowledge base completion. In Neur IPS, 2020. Ba, J. L., Kiros, J. R., and Hinton, G. E. Layer normalization. In ar Xiv, 2016. Barcel o, P., Galkin, M., Morris, C., and Romero, M. Weis- How Expressive are Knowledge Graph Foundation Models? feiler and leman go relational. In Lo G, 2022. Bordes, A., Usunier, N., Garcia-Duran, A., Weston, J., and Yakhnenko, O. Translating embeddings for modeling multi-relational data. In NIPS, 2013. Brody, S., Alon, U., and Yahav, E. How attentive are graph attention networks? In ICLR, 2022. Cui, Y., Sun, Z., and Hu, W. A prompt-based knowledge graph foundation model for universal in-context reasoning. In Neur IPS, 2024. Dettmers, T., Pasquale, M., Pontus, S., and Riedel, S. Convolutional 2D knowledge graph embeddings. In AAAI, 2018. Fatemi, B., Taslakian, P., Vazquez, D., and Poole, D. Knowledge hypergraphs: Prediction beyond binary relations. In IJCAI, 2020. Fey, M. and Lenssen, J. E. Fast graph representation learning with Py Torch Geometric. In ICLR Workshop on Representation Learning on Graphs and Manifolds, 2019. Gal arraga, L. A., Teflioudi, C., Hose, K., and Suchanek, F. AMIE: Association rule mining under incomplete evidence in ontological knowledge bases. In WWW, 2013. Galkin, M., Denis, E., Wu, J., and Hamilton, W. L. Nodepiece: Compositional and parameter-efficient representations of large knowledge graphs. In ICLR, 2022. Galkin, M., Yuan, X., Mostafa, H., Tang, J., and Zhu, Z. Towards foundation models for knowledge graph reasoning. In ICLR, 2024. Gao, J., Zhou, Y., Zhou, J., and Ribeiro, B. Double equivariance for inductive link prediction for both new nodes and new relation types. In ar Xiv, 2023. Geng, Y., Chen, J., Pan, J. Z., Chen, M., Jiang, S., Zhang, W., and Chen, H. Relational message passing for fully inductive knowledge graph completion. In ICDE, 2023. Grohe, M. The logic of graph neural networks. In LICS, 2021. Himmelstein, D. S., Lizee, A., Hessler, C., Brueggeman, L., Chen, S. L., Hadley, D., Green, A., Khankhanian, P., and Baranzini, S. E. Systematic integration of biomedical knowledge prioritizes drugs for repurposing. Elife, 2017. Huang, X., Orth, M. R., Ceylan, I. I., and Barcel o, P. A theory of link prediction via relational weisfeiler-leman on knowledge graphs. In Neur IPS, 2023. Huang, X., Orth, M. R., Barcel o, P., Bronstein, M. M., and Ismail Ilkan Ceylan. Link prediction with relational hypergraphs. In ar Xiv, 2024. Lee, J., Chung, C., and Whang, J. J. Ingram: Inductive knowledge graph embedding via relation graphs. In ICML, 2023. Liu, S., Grau, B., Horrocks, I., and Kostylev, E. Indigo: Gnn-based inductive knowledge graph completion using pair-wise encoding. In Neur IPS, 2021. Liu, Y., Yao, Q., and Li, Y. Generalizing tensor decomposition for n-ary relational knowledge bases. In WWW, 2020. Lv, X., Xu Han, L. H., Li, J., Liu, Z., Zhang, W., Zhang, Y., Kong, H., and Wu, S. Dynamic anticipation and completion for multi-hop reasoning over sparse knowledge graph. In EMNLP, 2020. Mahdisoltani, F., Biega, J. A., and Suchanek, F. M. Yago3: A knowledge base from multilingual wikipedias. In CIDR, 2015. Mao, H., Chen, Z., Tang, W., Zhao, J., Ma, Y., Zhao, T., Shah, N., Galkin, M., and Tang, J. Position: Graph foundation models are already here. In ICML, 2024. Safavi, T. and Koutra, D. Co DEx: A Comprehensive Knowledge Graph Completion Benchmark. In EMNLP, 2020. Schlichtkrull, M. S., Kipf, T. N., Bloem, P., van den Berg, R., Titov, I., and Welling, M. Modeling relational data with graph convolutional networks. In ESWC, 2018. Sun, Z., Deng, Z.-H., Nie, J.-Y., and Tang, J. Rotate: Knowledge graph embedding by relational rotation in complex space. In ICLR, 2019. Teru, K. K., Denis, E. G., and Hamilton, W. L. Inductive relation prediction by subgraph reasoning. In ICML, 2020. Toutanova, K. and Chen, D. Observed versus latent features for knowledge base and text inference. In Workshop on Continuous Vector Space Models and their Compositionality, 2015. Vashishth, S., Sanyal, S., Nitin, V., and Talukdar, P. Composition-based multi-relational graph convolutional networks. In ICLR, 2020. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L. u., and Polosukhin, I. Attention is all you need. In NIPS, 2017. Wen, J., Li, J., Mao, Y., Chen, S., and Zhang, R. On the representation and embedding of knowledge bases beyond binary relations. In IJCAI, 2016. Xiong, W., Hoang, T., and Wang, W. Y. Deeppath: A reinforcement learning method for knowledge graph reasoning. In EMNLP, 2017. How Expressive are Knowledge Graph Foundation Models? Xu, K., Hu, W., Leskovec, J., and Jegelka, S. How powerful are graph neural networks? In ICLR, 2019. Yadati, N. Neural message passing for multi-relational ordered and recursive hypergraphs. In Neur IPS, 2020. Zhang, M., Li, P., Xia, Y., Wang, K., and Jin, L. Labeling trick: A theory of using graph neural networks for multinode representation learning. In Neur IPS, 2021. Zhang, Y. and Yao, Q. Knowledge graph reasoning with relational digraph. In Web Conf, 2022. Zhang, Y., Zhou, Z., Yao, Q., Chu, X., and Han, B. Adaprop: Learning adaptive propagation for graph neural network based knowledge graph reasoning. In KDD, 2023. Zhang, Y., Bevilacqua, B., Galkin, M., and Ribeiro, B. TRIX: A more expressive model for zero-shot domain transfer in knowledge graphs. In Lo G, 2024. Zhou, J., Bevilacqua, B., and Ribeiro, B. A multi-task perspective for link prediction with new relation types and nodes. In Neur IPS GLFrontiers, 2023a. Zhou, X., Hui, B., Zeira, I., Wu, H., and Tian, L. Dynamic relation learning for link prediction in knowledge hypergraphs. In Appl Intell, 2023b. Zhu, Z., Zhang, Z., Xhonneux, L.-P., and Tang, J. Neural bellman-ford networks: A general graph neural network framework for link prediction. In Neur IPS, 2021. Zhu, Z., Yuan, X., Galkin, M., Xhonneux, S., Zhang, M., Gazeau, M., and Tang, J. A*net: A scalable path-based reasoning approach for knowledge graphs. In Neur IPS, 2023. How Expressive are Knowledge Graph Foundation Models? A. C-MPNNs and HC-MPNNs In this section, we follow Huang et al. (2023) and Huang et al. (2024) to define conditional message passing neural networks (C-MPNNs) and hypergraph conditional message passing neural networks (HC-MPNNs). We then follow Huang et al. (2024) to define hypergraph conditional networks (HCNets) as an architecture from HC-MPNNs. We also present their corresponding relational Weisfeiler Leman test to study their expressive power rigorously. For ease of presentation, we omit the discussion regarding history functions and readout functions from Huang et al. (2023). Also, some notations are adapted to simplify the exposition. A.1. Model definitions C-MPNNs. Let G = (V, E, R) be a KG. A conditional message passing neural network (C-MPNN) iteratively computes node representations, relative to a fixed query q R and a fixed source node u V , as follows: h(0) v|u,q = INIT(u, v, q) h(ℓ+1) v|u,q = UP h(ℓ) v|u,q, AGG({{MSGr(h(ℓ) w|u,q, zq)| w Nr(v), r R}}) , where INIT, UP, AGG, and MSGr are differentiable initialization, update, aggregation, and relation-specific message functions, respectively. We denote by h(ℓ) q : V V Rd(ℓ) the function h(ℓ) q (u, v) := h(ℓ) v|u,q, and denote zq to be a learnable vector representing the query q R. We can then interpret a C-MPNN as computing representations of pair of nodes. Given a fixed number of layers L 0, a C-MPNN computes the final pair representations as h(L) q . In order to decode the likelihood of the fact q(u, v) for some q R, we consider a unary decoder Dec : Rd(L) R. We additionally require INIT(u, v, q) to satisfy the property of target node distinguishability: for all q R and v = u V , it holds that INIT(u, u, q) = INIT(u, v, q). HC-MPNNs. Given a relational hypergraph H = (V, E, R) and a query q = (q, u, t) := q(u1, , ut 1, ?, ut+1, , um), for ℓ 0, an HC-MPNN computes a sequence of feature maps h(ℓ) v|q as follows: h(0) v|q = INIT(v, q), h(ℓ+1) v|q = UP h(ℓ) v|q, AGG h(ℓ) v|q, {{MSGρ(e) {(h(ℓ) w|q, j) | (w, j) Ni(e)}, q , | (e, i) E(v)}} , where INIT, UP, AGG, and MSGρ(e) are differentiable initialization, update, aggregation, and relation-specific message functions, respectively. An HC-MPNN has a fixed number of layers L 0, and the final conditional node representations are given by h(L) v|q. We denote by h(ℓ) q : V Rd(ℓ) the function h(ℓ) q (v) := h(ℓ) v|q. HCNets. HCNets can be seen as an architecture of HC-MPNN frameworks by choosing appropriate initialization, aggregation, and message functions. Let H = (V, E, R) be a relational hypergraph. For a query q = (q, u, t) := q(u1, , ut 1, ?, ut+1, , um), an HCNet computes the following representations for all ℓ 0: h(0) v|q = X i =t 1v=ui (pi + zq), h(ℓ+1) v|q = σ W (ℓ)h h(ℓ) v|q X (e,i) E(v) g(ℓ) ρ(e),q j =i (α(ℓ)h(ℓ) e(j)|q+ (1 α(ℓ))pj) i + b(ℓ) , where g(ℓ) ρ(e),q is a learnable message function which is often a diagonal matrix, σ is an activation function, W (ℓ) is a learnable weight matrix, b(ℓ) as learnable bias term per layer, zq is the learnable query vector for q R, and 1C is the indicator function that returns 1 if condition C is true, and 0 otherwise. α is a learnable scalar and pi to refer to the sinusoidal positional encoding (Vaswani et al., 2017) at position i. A.2. Relational WL tests To characterize these models expressive power, Huang et al. (2023) and Huang et al. (2024) proposed the corresponding relational Weisfeiler-Leman algorithms to capture these models exactly. How Expressive are Knowledge Graph Foundation Models? Relational assymetric local 2-WL test (rawl2). Following Huang et al. (2023), we define the relational asymmetric local 2-WL test (rawl2) as follows: Let G = (V, E, R) be a KG and η : V V D an initial pairwise node coloring. We say η satisfies target node distinguishability if η(u, u) = η(u, v) for all u = v V . Then, for each ℓ 0, we update the coloring as: rawl(0) 2 (u, v) = η(u, v), rawl(ℓ+1) 2 (u, v) = HASH rawl(ℓ) 2 (u, v), {{(rawl(ℓ) 2 (u, w), r) | w Nr(v), r R}} , where HASH injectively maps the above pair to a fresh unique color. It is shown in Theorem 5.1, Huang et al. (2023) that rawl(ℓ) 2 is a relational WL test that characterizes the expressive power of C-MPNNs, i.e., for all C-MPNN, given any knowledge graph G, ℓ 0, we have rawl(ℓ) 2 refines the ℓlayer of this C-MPNN. In addition, given any knowledge graph G, there exists a C-MPNN has equivalent expressive power of rawl(ℓ) 2 for all ℓ 0. For any knowledge graph G, we will sometimes write rawl+ 2 (G) as a shorthand of rawl2(G+), i.e., rawl2 applied on augmented knowledge graph G+. Formally speaking, let G = (V, E, R) be a knowledge graph. Following Huang et al. (2023), we define its augmented knowledge graph to be G+ = (V, E+, R+), where R+ is the disjoint union of R and R := {r | r R}, and E+ = E E where E = {r (v, u) | r(u, v) E, u = v}. Hypergraph relational local 1-WL test (hrwl1). Similarly, following Huang et al. (2024), we define the hypergraph relational local 1-WL test (hrwl1) as follows: Let H = (V, E, R) be a relational hypergraph and c : V D an initial node coloring. Then, for each ℓ 0, we update the coloring as: hrwl(0) 1 (v) = c(v), hrwl(ℓ+1) 1 (v) = HASH hrwl(ℓ) 1 (v),{{ {(hrwl(ℓ) 1 (w), j)| (w, j) Ni(e)}, ρ(e) |(e, i) E(v)}} . where the function HASH is an injective mapping that maps the above pair to a unique color that has not been used in the previous iteration. When we restrict hrwl1 to initial colorings c that respect a given query q := (q, u, t) = q(u1, , ut 1, ut+1, , uk), we say that the coloring c satisfies generalized target node distinguishability with respect to q if: c(u) = c(v) u u, v / u and c(ui) = c(uj) ui, uj u, ui = uj. Then, Theorem 5.1, Huang et al. (2024) shows that hrwl1 with initial coloring c satisfying generalized target node distinguishability characterizes the expressive power of HC-MPNNs. Hypergraph relational local 2-WL test (hcwl2). Following Huang et al. (2024), if we further restrict the test and apply only on knowledge graphs where all the edges have arity 2 and additionally only with tail prediction, we can instead assign pair-wise coloring η : V V 7 D satisfying target node distinguishability, i.e. u = v, η(u, u) = η(u, v). Then, we define a relational hypergraph conditioned local 2-WL test, denoted as hcwl2. hcwl2 iteratively updates binary coloring η as follow for all ℓ 0: hcwl(0) 2 = η(u, v) hcwl(ℓ+1) 2 (u, v) = HASH hcwl(ℓ) 2 (u, v), {{ {(hcwl(ℓ) 2 (u, w), j)|(w, j) Ni(e)}, ρ(e) | (e, i) E(v)}} B. Differences between TRIX and MOTIF TRIX (Zhang et al., 2024) is one of the first works to study its expressivity of KGFMs by comparing it to ULTRA (Galkin et al., 2024). However, our proposed framework MOTIF is inherently incomparable from TRIX in terms of theoretical expressivity. TRIX enhances expressivity by introducing a model capable of implicitly counting the number of homomorphism matches in a knowledge graph. This capability allows TRIX to distinguish between relation pairs based on their frequency of occurrence within specific structural patterns. For instance, consider a knowledge graph with edges E = {r1(u1, v1), r2(v1, w1), r3(u2, v2), r4(v2, w2), r3(u3, v3), r4(v3, w3)}. How Expressive are Knowledge Graph Foundation Models? Here, the pair (r3, r4) appears twice in 2-hop paths, while (r1, r2) appears only once. TRIX can distinguish between these pairs due to its ability to count such occurrences. In contrast, MOTIF focuses on the existence of specific higher-order motifs within the knowledge graph, rather than their frequency. This design enables MOTIF to capture complex relational structures that may not be distinguishable through frequency counts alone. For example, consider a knowledge graph with edges E = {r1(x1, x2), r2(x1, x2), r1(x3, x4), r2(x3, x4), r3(y1, y2), r4(y1, y4), r3(y3, y2), r4(y3, y4)}. In this case, the pairs (r1, r2) and (r3, r4) may be indistinguishable under TRIX due to isomorphic relation graphs under head-to-head and tail-to-tail mappings. However, MOTIF can distinguish between these pairs by identifying specific motifs, such as the PARA motif with edges {α(x, y), β(x, y)}, which maps only to (r1, r2). Therefore, while TRIX excels at distinguishing relation pairs based on frequency counts of specific patterns, MOTIF provides a complementary approach by focusing on the structural existence of complex motifs. This fundamental difference renders the two frameworks incomparable in terms of expressivity, as each can capture relational distinctions that the other cannot. However, note that incorporating higher-order motifs into TRIX could similarly boost its expressivity, highlighting our framework as a complementary way for improving KGFMs. C. A WL test for MOTIF We start by introducing a two-stage WL test that matches the separating power of MOTIF(F), for a set of motifs F. Using this equivalence, we show Proposition 6.2 and Theorem 6.4, in sections D.1 and D.3, respectively. Characterizing the separating power of MOTIF(F) via a suitable test, allows us to reason about simpler coloring schemes, which, in turn, makes the proofs more manageable. Also, the equivalence between MOTIF(F) and our two-stage WL test may be of independent interest. We assume a given KG G = (V, E, R). Our test assigns one color to each link q(u, v), for q R and u, v V (recall that links are not necessarily facts in G). In the first stage, we color the nodes of LIFTF(G) = (VLIFT, ELIFT, RLIFT), that is, the set VLIFT = R. When coloring a relation r R, we always condition on another relation q R. We denote by hcwl(t) F (q, r) the color assigned to r, conditioned on q, on iteration t 0. We assume a predefined number of iterations T 0, which will be a parameter of our test. In the second stage, we color the nodes v of the KG G, conditioned on a relation q R and a source node u V . This determines the final color col(ℓ) F,T (q(u, v)) for the link q(u, v). Crucially, the second stage relies on the first stage in that it uses the colors hcwl(T ) F (q, r) internally. Roughly speaking, the first-stage coloring follows the coloring strategy of hrwl1 applied to LIFTF(G), while the second-stage coloring follows the strategy of rawl2, after incorporating the relation encodings obtained in the first stage. (see Section A for the definition of hrwl1 and rawl2). Formally, the first-stage colors hcwl(t) F are defined via the following rules: hcwl(0) F (q, r) = 1r=q 1 hcwl(t+1) F (q, r) = HASH hcwl(t) F (q, r), {{ {(hcwl(t) F (q, s), j) | (s, j) N i(e)}, ρ(e) | (e, i) ELIFT(r)}} . Recall that ELIFT(r) = {(e, i) | e(i) = r, e ELIFT, 1 i ar(ρ(e))} . is the set of edge-position pairs of a node r, and N i(e) is the positional neighborhood of a hyperedge e with respect to a position i: N i(e) = {(e(j), j) | j = i, 1 j ar(ρ(e))} . 1r=q takes value 1 if r = q, and 0 otherwise. The function HASH is a fixed injective function taking values in the natural numbers. The second-stage colors col(ℓ) F,T , where T 0 is a parameter, are defined via the following rules: col(0) F,T (q(u, v)) = 1v=u hcwl(T ) F (q, q) col(ℓ+1) F,T (q(u, v)) = HASH col(ℓ) F,T (q(u, v)), {{ col(ℓ) F,T (q(u, w)), hcwl(T ) F (q, r) | w Nr(v), r R}} . How Expressive are Knowledge Graph Foundation Models? We start by showing that our coloring scheme upper bounds the separation power of any model in MOTIF(F). For simplicity, we assume that any model in MOTIF(F) has the form (Enc KG, (F, Enc LIFT)), where 1. the initialization function INIT1 used by the relational encoder Enc LIFT is defined as INIT1(q, r) = 1r=q 1d, and 2. the initialization function INIT2 used by the entity encoder is defined as INIT2(u, v, q) = 1v=u h(T ) q|q . This is a common choice, and in particular, it is the choice we take in our basic architecture for the experiments (see Section E). Proposition C.1. Let G = (V, E, R) be a KG and (Enc KG, (F, Enc LIFT)) be a MOTIF(F) instance, where Enc LIFT and Enc KG involve T 0 and L 0 layers, respectively. Then for every pair of links q(u, v) and q (u , v ), we have that col(L) F,T (q(u, v)) = col(L) F,T (q (u , v )) = Enc KG,u,q[v] = Enc KG,u ,q [v ]. Proof. We show first the following claim: ( ) for every quadruple of relations q, r, q , r R, we have that hcwl(T ) F (q, r) = hcwl(T ) F (q , r ) = Enc LIFT,q[r] = Enc LIFT,q [r ]. Recall that Enc LIFT,p[s] is determined by the sequence of vectors h(t) s|p and that Enc LIFT,p[s] = h(T ) s|p . Take a quadruple q, r, q , r R. We prove by induction that for every 0 t T, we have that hcwl(t) F (q, r) = hcwl(t) F (q , r ) = h(t) r|q = h(t) r |q . For the base case, take t = 0. Assume that hcwl(0) F (q, r) = hcwl(0) F (q , r ). By the definition of hcwl(0) F , either q = r and q = r , or q = r and q = r . In either case, we obtain h(0) r|q = 1r=q 1d = 1r =q 1d = h(0) r |q . For the inductive case, suppose that hcwl(t) F (q, r) = hcwl(t) F (q , r ) = h(t) r|q = h(t) r |q and assume that hcwl(t+1) F (q, r) = hcwl(t+1) F (q , r ). By the definition of hcwl F, it follows that hcwl(t) F (q, r) = hcwl(t) F (q , r ) {{ {(hcwl(t) F (q, s), j) | (s, j) N i(e)}, ρ(e) | (e, i) ELIFT(r)}} = {{ {(hcwl(t) F (q , s ), j) | (s , j) N i(e)}, ρ(e) | (e, i) ELIFT(r )}}. We can apply the inductive hypothesis to the first equation and obtain that h(t) r|q = h(t) r |q . The second equation together with the inductive hypothesis implies that {{ {(h(t) s|q, j) | (s, j) N i(e)}, ρ(e) | (e, i) ELIFT(r)}} = {{ {(h(t) s |q , j) | (s , j) N i(e)}, ρ(e) | (e, i) ELIFT(r )}}. As a consequence, we have that M = {{MSGρ(e) {(h(t) s|q, j) | (s, j) N i(e)} | (e, i) ELIFT(r)}} = {{MSGρ(e) {(h(t) s |q, j) | (s , j) N i(e)} | (e, i) ELIFT(r)}} = M where MSGρ(e) is the motif-specific message function used in the (t + 1)-th layer. It follows that h(t+1) r|q = UP1 h(t) r|q, AGG1(M) = UP1 h(t) r |q , AGG1(M ) = h(t+1) r |q How Expressive are Knowledge Graph Foundation Models? as required. (UP1 and AGG1 are the update and aggregate functions in the (t + 1)-th layer.) Take a pair of links q(u, v) and q (u , v ). Now we are ready to show col(L) F,T (q(u, v)) = col(L) F,T (q (u , v )) = Enc KG,u,q[v] = Enc KG,u ,q [v ]. Recall that Enc KG,x,p[y] is determined by the sequence of vectors h(ℓ) y|x,p and that Enc KG,x,p[y] = h(L) y|x,p. We show by induction that for every 0 ℓ L, it holds that col(ℓ) F,T (q(u, v)) = col(ℓ) F,T (q (u , v )) = h(ℓ) v|u,q = h(ℓ) v |u ,q . In the base case ℓ= 0. Suppose col(0) F,T (q(u, v)) = col(0) F,T (q (u , v )), i.e., 1v=u hcwl(T ) F (q, q) = 1v =u hcwl(T ) F (q , q ). Note that without loss of generality, we can assume that HASH never takes the value 0, and hence hcwl(T ) F (q, q) = 0 and hcwl(T ) F (q , q ) = 0. We then have two cases: 1. v = u and v = u . We obtain h(0) v|u,q = h(0) v |u ,q as required. 2. v = u, v = u and hcwl(T ) F (q, q) = hcwl(T ) F (q , q ). By the claim ( ), we have that h(T ) q|q = h(T ) q |q . We conclude that h(0) v|u,q = h(0) v |u ,q as desired. For the inductive step, suppose that col(ℓ) F,T (q(u, v)) = col(ℓ) F,T (q (u , v )) = h(ℓ) v|u,q = h(ℓ) v |u ,q . Assume that col(ℓ+1) F,T (q(u, v)) = col(ℓ+1) F,T (q (u , v )). It follows that col(ℓ) F,T (q(u, v)) = col(ℓ) F,T (q (u , v )) {{ col(ℓ) F,T (q(u, w)), hcwl(T ) F (q, r) | w Nr(v), r R}} = {{ col(ℓ) F,T (q (u , w)), hcwl(T ) F (q , r) | w Nr(v ), r R}}. Using the inductive hypothesis and claim ( ), we obtain that h(ℓ) v|u,q = h(ℓ) v |u ,q {{ h(ℓ) w|u,q, h(T ) r|q | w Nr(v), r R}} = {{ h(ℓ) w|u ,q , h(T ) r|q | w Nr(v ), r R}}. It follows that N = {{MSG h(ℓ) w|u,q, Enc LIFT,q[r] | w Nr(v), r R}} = {{MSG h(ℓ) w|u ,q , Enc LIFT,q [r] | w Nr(v ), r R}} = N . We conclude that h(ℓ+1) v|u,q = UP2 h(ℓ) v|u,q, AGG2(N) = UP2 h(ℓ) v |u ,q , AGG2(N ) = h(ℓ+1) v |u ,q . as required. (UP2 and AGG2 are the update and aggregate functions in the (ℓ+ 1)-th layer.) We now show that MOTIF(F) can be as powerful as our test, regarding separating links. Proposition C.2. Let G = (V, E, R) be a KG, and let T, L 0 be natural numbers. Then there exists an instance (Enc KG, (F, Enc LIFT)) of MOTIF(F) such that for every pair of links q(u, v) and q (u , v ), we have that col(L) F,T (q(u, v)) = col(L) F,T (q (u , v )) Enc KG,u,q[v] = Enc KG,u ,q [v ]. How Expressive are Knowledge Graph Foundation Models? Proof. It suffices to choose (Enc KG, (F, Enc LIFT)) such that the functions UP(t) 1 , AGG(t) 1 and MSG(t) ρ(e) defining Enc LIFT, and the functions UP(ℓ) 2 , AGG(ℓ) 2 and MSG(ℓ) defining Enc KG, are injective. Indeed, suppose this condition holds. We prove first by induction that for every q, r, q , r R and 0 t T, h(t) r|q = h(t) r |q = hcwl(t) F (q, r) = hcwl(t) F (q , r ). For t = 0, the result follows by our assumption on INIT1. Suppose that h(t+1) r|q = h(t+1) r |q . As UP(t) 1 is injective, we have that h(t) r|q = h(t) r |q and AGG(t) 1 (M) = AGG(t) 1 (M ). From the former and by induction, we obtain hcwl(t) F (q, r) = hcwl(t) F (q , r ). From the later, as AGG(t) 1 is injective, we obtain M = M . Using the fact that MSG(t) ρ(e) is injective, we conclude that {{{(h(t) s|q, j) | (s, j) N i(e)} | (e, i) ELIFT(r)}} = {{{(h(t) s|q , j) | (s, j) N i(e)} | (e, i) ELIFT(r )}}. By induction, we have {{{(hcwl(t) F (q, s), j) | (s, j) N i(e)} | (e, i) ELIFT(r)}} = {{{(hcwl(t) F (q , s), j) | (s, j) N i(e)} | (e, i) ELIFT(r )}}. This implies that hcwl(t+1) F (q, r) = hcwl(t+1) F (q , r ) as required. Now we prove by induction that for every pair of links q(u, v) and q (u , v ), and 0 ℓ L, it holds that h(ℓ) v|u,q = h(ℓ) v |u ,q = col(ℓ) F,T (q(u, v)) = col(ℓ) F,T (q (u , v )). Note that this implies that col(L) F,T (q(u, v)) = col(L) F,T (q (u , v )) Enc KG,u,q[v] = Enc KG,u ,q [v ]. The implication holds for ℓ= 0 by the assumption on INIT2 and the fact that h(T ) q|q = h(T ) q |q = hcwl(T ) F (q, q) = hcwl(T ) F (q , q ). Assume that h(ℓ+1) v|u,q = h(ℓ+1) v |u ,q . As UP(t) 2 is injective, we have that h(ℓ) v|u,q = h(ℓ) v |u ,q and AGG(ℓ) 2 (N) = AGG(ℓ) 2 (N ). From the former and by induction, we obtain col(ℓ) F,T (q(u, v)) = col(ℓ) F,T (q (u , v )). From the later, since AGG(ℓ) 2 is injective, we have N = N . As MSG(ℓ) is injective, we obtain that {{ h(ℓ) w|u,q, h(T ) r|q | w Nr(v), r R}} = {{ h(ℓ) w|u ,q , h(T ) r|q | w Nr(v ), r R}}. By induction, we have {{ col(ℓ) F,T (q(u, v)), hcwl(T ) F (q, r) | w Nr(v), r R}} = {{ col(ℓ) F,T (q (u , v )), hcwl(T ) F (q , r) | w Nr(v ), r R}}. This implies that col(ℓ+1) F,T (q(u, v)) = col(ℓ+1) F,T (q (u , v )) as desired. It is easy to find injective funtions UP(t) 1 , AGG(t) 1 , MSG(t) ρ(e), UP(ℓ) 2 , AGG(ℓ) 2 , MSG(ℓ). For instance UP(t) 1 and UP(ℓ) 2 can be vector concatenation, and AGG(t) 1 , MSG(t) ρ(e), AGG(ℓ) 2 , MSG(ℓ) can be chosen using Lemma 5 from Xu et al. (2019) (see also Lemma VIII.5 from Grohe (2021)). Another alternative is to choose UP(t) 1 , AGG(t) 1 , MSG(t) ρ(e) according to Theorem 4.1 in Huang et al. (2024), and UP(ℓ) 2 , AGG(ℓ) 2 , MSG(ℓ) according to Theorem 5.1 in Huang et al. (2023). As a direct corollary of Propositions C.1 and C.2, we obtain: Corollary C.3. Let F and F be two set of motifs. Then F F if and only if for every KG G = (V, E, R) and link q(u, v), q (u , v ), whenever col(L) F,T (q(u, v)) = col(L) F,T (q (u , v )) for every T, L 0, then col(L) F ,T (q(u, v)) = col(L) F ,T (q (u , v )) for every T, L 0. How Expressive are Knowledge Graph Foundation Models? D. Missing proofs in the paper D.1. Proof of Proposition 6.2 Let P = (GM, r) and P = (G M, r ) be two graph motifs such that GM is isomorphic to G M. Then, for any set F of graph motifs: (F {P}) (F {P }) (F {P, P }). Proof. We only show that (F {P}) (F {P, P }), the part that (F {P}) (F {P }) is analogous. For readability, let us write F1 = F {P} and F2 = F {P, P }. Further, E1 and E2 correspond to the edges of LIFTF1(G) and LIFTF2(G), respectively. We make use of Corollary C.3, and show that for every KG G = (V, E, R), every pair of links q(u, v), q (u , v ) and every T, L 0, col(L) F1,T (q(u, v)) = col(L) F1,T (q(u , v )) col(L) F2,T (q(u, v)) = col(L) F2,T (q(u , v )). By definition of the second stage colors, all we need to show is that for every s, s R, hcwl(T ) F1 (q, s) = hcwl(T ) F1 (q , s ) hcwl(T ) F2 (q, s) = hcwl(T ) F2 (q , s ). Once again, we only need to show one direction, the other one is analogous. We prove the ( ) direction by induction on 0 t T. The base case is immediate because the inizialization does not depend on F1 or F2. Assume then this condition holds for all t < T. On step t, let q, s and q , s be pairs satisfying the left hand side. According to the coloring step, hcwl(t) F1(q, s) corresponds to the hash of a pair given by the previous color hcwl(t 1) F2 (q, s) and the multiset of the colors of all neighbours of q in any tuple in any relation in F1. For F2, the color is the same except we include {{ {(hcwl(t) F2(q, x), j) | (x, j) N i(P )}, ρ(e) | (P , i) E2(s)}}, where E2(s) is the set of pairs (e, i) where e is an edge in E2 and e(i) = s. Hence, since this is the only term that differs from the coloring according to F1, to prove the right hand side, it suffices to show that the above multiset is the same as {{ {(hcwl(t) F2(q , x ), j ) | (x , j ) N i (P )}, ρ(e) | (P , i ) E2(s )}}, Now, because of isomorphism, there is a bijection f that maps each index position i of r to a position f(i) of r , in such a way that a pair (P, i) is in E1(s) if and only if (P , f(i)) is in E2(s). Further, for every fact P(r1, . . . , rn) in LIFTF1(G), there is a fact P(r 1, . . . , r n) such that ri = r f(i). Hence, for every pair (x, j) N k(P ), (P , k) E2(s), there is a pair (x, p) in N f 1(k)(P), (P, f 1(i)) E1. By induction, since hcwl(t) F1(q, s) = hcwl(t) F1(q , s ), the color of (q, x) by hcwl(t 1) F1 must then correspond to the color of some node (q , x ) in N f 1(k)(P), (P, f 1(i)) E1(s ). Once again, by isomorphism, we find the pair (x , j) N k(P ), (P , k) E2(s ), where by induction the color of (q , x ) by hcwl(t 1) F2 now corresponds to that of (q, x). This proves a bijection between all elements in the multiset, from which we directly obtain that hcwl(t) F2(q, s) = hcwl(t) F2(q , s ). D.2. Proof of Proposition 6.3 For every KG G, up to isomorphism, there is a unique KG H such that: H is a relation-preserving core, and there are relation-preserving homomorphisms from G to H and from H to G. Proof. Let us assume there are two such cores H1 and H2. Since there are relation-preserving homomorphisms from G to H1, from G to H2, from H1 to G and from H2 to G, componing these homomorphisms establishes there are relationpreserving homomorphisms f from H1 to H2 and g from H2 to H1; note that the composition of two relation-preserving homomorphisms is also a relation-preserving homomorphism. How Expressive are Knowledge Graph Foundation Models? But now f g is a relation-preserving homomorphism from H1 to H1 and g f is a homomorhpsim from H2 to H2. This means that f g and g f are onto. Moreover, since they are endomorphisms, they must be an automorphism. Hence, f and g are both onto, which means that H1 and H2 are isomorphic. D.3. Proof of Theorem 6.4 Let F, F be two sets of motifs, and assume that F F . Then, for every non-trivial motif P F there is a motif P F such that P co P . Proof. We prove the converse of this statement: assume that there is a non-trivial motif P F that does not satisfy the conditions of the Theorem, that is, that there is no motif P F such that P co P . We show how to construct a KG G and a pair of links q(u, v) and q (u , v ) such that col(L) F,T (q(u, v)) = col(L) F,T (q(u , v )) for every T, L 0, but col(L) F ,T (q(u, v)) = col(L) F ,T (q(u , v )), for some T, L 0. This, by Corollary C.3, suffices for the proof. Let us write P = (GP , r ). We split the proofs into two cases, depending on whether there exists a motif P in F such that there is no homomorphism from any motif P in F to P . Case 1. Assume that F contains a motif P such that there is no node-relation homomorphism from any motif P in F to P . Notice that this implies that all motifs in F cannot be mapped to the trivial motif with just the fact Y (x, z), as otherwise, we would have motifs in F that can be mapped to any edge in P , resulting in a node-relation homomorphism to motif P . Let y be an arbitrary relation in P and s a fresh relation. Construct G by taking GP plus two triples y, (a, b) and s(c, d), with a, b, c, d fresh nodes. We claim that col F (G, y(a, b)) = col F (G, s(c, d)) but col F(G, y(a, b)) = col F(G, s(c, d)), which is what we were looking for. By our assumptions it follows that LIFTF(G) is a graph without relations, as there are no homomorphisms from any motif in F to G. Therefore, we have that hcwly(y) = hcwls(s) over LIFTF(G): in both cases the hcwl coloring involves a single node with the same color, not connected to any other node, so we stop after just one iteration. In turn, this entails that col F(y(a, b)) = col F (s(c, d)), as once again nodes a, b and c, d are not connected to any other part in G. On the other hand, LIFTF (G) contains at least one tuple with distinct elements in the relation RP corresponding to motif P , consisting of r : this comes from the identity homomorphism from P to GP . But since P is non-trivial, there cannot be a relation-preserving homomorphism h from P to the graph formed just by fact s(u, v), as this would entail that the relation preserving core of P is indeed s(u, v), and hence P would be trivial, which we assumed otherwise. Then, there cannot be a tuple of different relations that includes s in relation RP in LIFTF (G). This entails that the color hcwly(y) and hcwls(s) over LIFTF (G) are already different after the first step. Hence, col F (y(a, b)) = col F (s(c, d)) as promised. Case 2. Next assume that there is a motif P in F such that there are no core-onto homomorphisms from any motif P in F to P (but there may be regular node-relation homomorphisms). Let then H contain all different motifs (up to isomorphism) h(P) = (h( r), h(GP ))3 such that P F and h is a homomorphism from P to the relation preserving core of P By our assumptions, none of these homomorphisms is core-onto. Define two functions f1 and f2 that map each relation name from a motif in H to a fresh relation, and that are the identity on nodes. Further, construct a graph G1 as follows: for every motif h(P) in H, replace every node in f1(h(P)) by a fresh node, and add it to G1. Construct graph G2 in the same fashion except we use f2(h(P)). Finally, extend f1 to be the identity on all other relations in ( r ) (recall that P = ( r , GP ) not in any motif in H, so that f1(P ) is a well-defined motif and its corresponding graph Gf1(P ) is a well defined graph. Our graph G corresponds to the union of G1, G2 and Gf1(P ). Clearly, G1 and G2 are isomorphic, let g be one such isomorphism. By construction, we also have that LIFTF(G1) is isomorphic to LIFTF(G2) and LIFTF (G1) is isomorphic to LIFTF (G2). For an arbitrary motif in H and a fact y(x, z) of this motif, consider facts f1(y)(a, b) and f2(y)(g(a), g(b)) in G1 and G2, respectively, that result from processing triple y(x, z) in this motif according to our construction (here all of a, b, g(a), g(b) are fresh nodes). As in Case 1, we show that these triples are separated by col F but not by col F. 3Slightly abusing the notation, we use h(GP ) to denote the motif in which every fact b(a, c) is replaced by h(b)((h(a), h(c)). How Expressive are Knowledge Graph Foundation Models? First, note that LIFTF(G) = LIFTF(G1) LIFTF(G2), as (1) G1 and G2 are not connected, and (2) any homomorphism h from a motif P in F to G1 Gf1(P ) has a corresponding homomorphism h from P to G1 such that h(P) and h (P) are isomorphic, and therefore any tuple in the relation RP in LIFTF(G1 Gf1(P )) is also in the relation RP of LIFTF(G1). The fact that LIFTF(G) = LIFTF(G1) LIFTF(G2), that LIFTF(G1) and LIFTF(G2) are isomorphic and that they are not connected establishes that hcwlf1(Y )(f1(X)) = hcwlf2(Y )(f2(X)) for any relation name X in G. Then, again since both connected components in LIFTF(G) are isomorphic, and our game is invariant to isomorphisms (assuming the same coloring of relations), it must be the case that col F(f1(Y )(a, b)) = col F(f2(Y )(g(a), g(b))). Let us now look at LIFTF (G). Notice first that any tuple t in RP in LIFTF (G2) has a corresponding tuple f2 f 1 1 (t) in LIFTF (G1). In other words, f2 f 1 1 is a bijection from LIFTF (G2) to LIFTF (G1 Gf1(P )) whose inverse is f1 f 1 2 . This is because these tuples arise from homomorphisms that start from P and end in some motif in H, and both G1 and G2 have the same copy of this motif. To finish we claim that there one tuple t in relation RP in graph LIFTF (G1 Gf1(P )) such that f1 f 1 2 (t) is not in relation RP in graph LIFTF (G2). This means that for any element r in t the cardinality of edges in RP where r participates is strictly higher than the cardinality of edges where f1 f 1 2 (r) participate, which implies that hcwlr(s) = hcwlf1 f 1 2 (r)(f1 f 1 2 (s)) for any relation name s that belongs to t because the color in step (1) is already different. As in Case 1 above, this can be used to show that col F (f1(Y )(a, b)) = col F (f2(Y )(g(a), g(b))). The tuple t corresponds to f1( r ). This tuple exists in relation RP in LIFTF (G1 Gf1(P )) because f1is a homomorphism from P to Gf1(P ) We show that LIFTF (G) does not contain any tuple consisting of all different nodes from f2( r ) in relation RP , which suffices for our claim. Note that this implies that all relation variables from P are in the preimage of f2 (i.e., are part of a motif in H), otherwise the claim is obvious as the size of r is bigger than the range of f2. Suppose for the sake of contradiction that such a tuple f2( r ) is in RP . By construction, any tuple from elements in f2( r ) must come from LIFTF (G2). Hence, tuple f2( r ) is added to LIFTF (G2) due to a homomorphism ˆh from P to a motif h(P) in H (for a motif P in F and a homomorphism h from P to the core of P ), and since the tuple contains all elements in f2( r ), this homomorphism ˆh is relation-preserving. By composition of homomorphisms this also entails a relation-preserving homomorphism from the (relation-preserving) core of P to h(P). Further, since h is a homomorphism from P to the relation-preserving core of P , the identity is a homomorphism from h(P) to said core of P , and since h(P) contains all relations in f2( r ), this homomorphism is also relation-preserving. This is a contradiction because we have found that h(P) is the relation preserving core of P and therefore h is a core-onto homomorphism from P to P . D.4. Proof of Theorem 6.5 ULTRA has the same expressive power as MOTIF(Fpath 2 ). Proof. By Proposition 6.2, we show that MOTIF({h2t, h2h, t2t}) and MOTIF({t2h, h2h, t2t}) has the same expressive power as MOTIF({h2t, t2h, h2h, t2t}). We also note that {t2h, h2h, t2t} and {h2t, h2h, t2t} are the same as Fpath 2 . Thus, it is enough to show that ULTRA, in its general form, has the same separating power than MOTIF(F), where F = {h2t, t2h, h2h, t2t}. Recall that the separating power of MOTIF(F) can be characterized by the coloring scheme col F defined in Section C. Following the same strategy as in Propositions C.1 and C.2, it is straightforward to show that the separating power of ULTRA matches the following test. Let G = (V, E, R) be a KG. The first-stage colors rcol(t) ultra are defined via the following rules: rcol(0) ultra(q, r) = 1r=q 1 rcol(t+1) ultra (q, r) = HASH rcol(t) ultra(q, r), {{ rcol(t) ultra(q, r ), r LIFT | r Nr LIFT(r), r LIFT RLIFT}} . 1r=q takes value 1 if r = q, and 0 otherwise. The function HASH is a fixed injective function taking values in the natural numbers. How Expressive are Knowledge Graph Foundation Models? The second-stage colors ecol(ℓ) ultra,T , where T 0 is a parameter, are defined via the following rules: ecol(0) ultra,T (q(u, v)) = 1v=u rcol(T ) ultra(q, q) ecol(ℓ+1) ultra,T (q(u, v)) = HASH ecol(ℓ) ultra,T (q(u, v)), {{ ecol(ℓ) ultra,T (q(u, w)), rcol(T ) ultra(q, r) | w Nr(v), r R}} . We aim to show that for every links q(u, v), q (u , v ) and T, L 0, we have ecol(L) ultra,T (q(u, v)) = ecol(L) ultra,T (q (u , v )) col(L) F,T (q(u, v)) = col(L) F,T (q (u , v )). Since the definition of the second-stage colorings ecolultra,T and col F,T are identical, it suffices to show that for every q, r, q , r R, and every 0 t T, it holds that rcol(t) ultra(q, r) = rcol(t) ultra(q , r ) hcwl(t) F (q, r) = hcwl(t) F (q , r ). This propositions follows from the following observations. The update rule of rcol(t) ultra coincides with the one of rawl(t) 2 (defined in Section A) applied to LIFTF(G). On the other hand, the update rule of hcwl(t) F coincides with the one of hcwl(t) 2 (defined in Section A) applied to LIFTF(G). It follows from Theorem H.4 in Huang et al. (2024) that rawl+ 2 (t) is equivalent to hcwl(t) 2 . Here rawl+ 2 (t) corresponds to the test rawl2 (t) applied to the augmented graph G+, which extends each fact r(u, v) in G with its inverse fact r (v, u) (see definition in Section A). The key observation is that over LIFTF(G), the test rawl2 (t) and rawl+ 2 (t) are also equivalent. This is because of the structure of F = {h2t, t2h, h2h, t2t}, as observed in Appendix F: The relation h2h is symmetric, and hence h2h (q, r) LIFTF(G)+ h2h(q, r) LIFTF(G)+. Similarly for t2t. On the other hand, h2t(q, r) LIFTF(G) t2h(r, q) LIFTF(G). Hence, h2t (q, r) LIFTF(G)+ t2h(q, r) LIFTF(G)+. Similarly for t2h. Therefore, adding inverses over LIFTF(G) is superfluous, and rawl2 (t) and rawl+ 2 (t) are equivalent. We conclude that rawl2 (t) and hcwl(t) 2 are equivalent, and then rcol(t) ultra and hcwl(t) F are equivalent, as required. E. MOTIF architecture In this section, we will introduce the architecture of MOTIF used in the experimental sections. We adopt HCNets (Huang et al., 2024) as the relation encoder for the relational hypergraphs, and following ULTRA (Galkin et al., 2024), we adopt a slight modification of NBFNet (Zhu et al., 2021) as the entity encoder. Model architectures. Specifically, for the relation encoder, we have that for 0 t T, h(0) r|q = 1r=q 1d, h(t+1) r|q = σ W (t)h h(t) v|q X (e,i) ELIFT(r) z(t) ρ(e) j =i (α(t)h(t) e(j)|q+ (1 α(t))pj) i + b(t) , For entity encoder, we follow ULTRA (Galkin et al., 2024) and adopt the same variation of NBFNet (Zhu et al., 2021), which computes the pairwise representation as follows for 0 ℓ L: h(0) v|u,q = 1v=u h(T ) q|q h(ℓ+1) v|u,q = σ W (ℓ)h h(0) v|u,q X w Nr(v) h(ℓ) w|u,q MLP(ℓ)(h(T ) r|q ) i + b(ℓ) , where σ is Re LU activation, W and b are learnable weight matrix and bias term per layer, 1d is an all-one vector of d dimension, and 1C is the indicator function that returns 1 if condition C is true, and 0 otherwise. We denote as scalar multiplication, and as element-wise multiplication of vectors. In the relation encoder, we note α to be a learnable scalar and pi to be a learnable positional encoding at i-th position. Finally. we decode the probability of a query q(u, v) by passing through representation h(L) v|u,q through a Multi-Layer Perceptron (MLP). How Expressive are Knowledge Graph Foundation Models? In this section, we follow Galkin et al. (2024) and define ULTRA, a KGFM that computes invariants on nodes and relations on knowledge graphs. Model architectures. Given a knowledge graph G = (V, E, R, c) and a query q(u, ?), ULTRA first constructs a knowledge graph LIFTF(G) = (VLIFT, ELIFT, RLIFT) with each node representing a relation r, and 4 fundamental relations in RLIFT, defined as follow: tail-to-head (t2h) edges, t2h(r1, r2) ELIFT u, v, w V, r1(u, v) r2(v, w) E head-to-head (h2h) edges, h2h(r1, r2) ELIFT u, v, w V, r1(v, u) r2(v, w) E head-to-tail (h2t) edges, h2t(r1, r2) ELIFT u, v, w V, r1(v, u) r2(w, v) E tail-to-tail (t2t) edges, t2t(r1, r2) ELIFT u, v, w V, r1(u, v) r2(w, v) E Then, ULTRA applies a relation encoder to generate representation hr|q via a variant of NBFNet (Zhu et al., 2021) that initialize with all-one vector 1d on the queried relations q VLIFT. For 0 t T, the relation encoder iteratively computes h(t) r|q as follow: h(0) r|q = 1r=q 1d h(t+1) r|q = σ W (t)h h(0) r|q X r LIFT RLIFT r NRLIFT (r) h(t) r |q zr LIFT) i + b(t) , where zr LIFT are learnable vectors for each one of the fundamental relations, σ is Re LU activations, and similar to previous notation, stands for concatenations, and W (t) and z(t) are learnable weight matrices and bias vectors for each 0 t T, correspondingly. Finally, after the relational embedding h(T ) v|q is obtained given the query q(u, ?), ULTRA proceeds to computes the final pairwise embedding for 0 ℓ L: h(0) v|u,q = 1v=u h(T ) q|q h(ℓ+1) v|u,q = σ W (ℓ)h h(0) v|u,q X w Nr(v) h(ℓ) w|u,q MLP(ℓ)(h(T ) r|q )) i + b(ℓ) , Frameworks. We follow the notation of MOTIF and write ULTRA in its most general form. Later, we will also use ULTRA to refer to this general form when the context is clear. For the relational encoder, we have the following form: h(0) r|q = INIT1(q, r) h(t+1) r|q = UP1(h(t) r|q, AGG1({{MSGr LIFT(h(t) r |q)| r Nr LIFT(r), r LIFT RLIFT}}))), INIT1, UP1, AGG1, and MSGr are differentiable initialization, update, aggregation, and fundamental-relation-specific message functions, respectively. Similarly for the entity encoder, we have this format: h(0) v|u,q = INIT2(u, v, q) h(ℓ+1) v|u,q = UP2(h(ℓ) v|u,q, AGG2({{MSG(h(ℓ) w|u,q, h(T ) r|q )| w Nr(v), r R}}))), where INIT2, UP2, AGG2, and MSG are differentiable initialization, update, aggregation and message functions, respectively. Characteristic of fundamental relation. We also identify a few characteristics of the fundamental relationship from ULTRA when constructing the relation graphs: 1. Self-loop. h2h and t2t always exist for all relation, i.e., r R, h2h(r, r) t2t(r, r). This will induce two self-loops for all relation. How Expressive are Knowledge Graph Foundation Models? 2. Symmetric. Relation h2h and t2t are symmetric: If h2h(r1, r2) ELIFT then we must have h2h(r2, r1) ELIFT, as they satisfy the same assumption for homomorphism. The same applied for t2t. 3. Inverse. Relation h2t and t2h are always inverse relation to each other, i.e., if h2t(r1, r2) ELIFT then we must have t2h(r2, r1) ELIFT, as they satisfy the same assumption, and vice versa. As a standard practice, current link prediction models augment the knowledge graph via inverse relations. However, adding inverse relations implies a lot of symmetrical edges in the constructed relation graphs. First, note that for every r(u, v) E, we have r 1(v, u) E . Thus, by default, we have that t2h(r, r 1), h2t(r, r 1),h2t(r 1, r), t2h(r 1, r) ELIFT. Then we consider what happened in the constructed relation graphs: t2h relations. Assume t2h(r1, r2) ELIFT, i.e., u, v, w V, r1(u, v), r2(v, w) E, then t2t(r1, r 1 2 ), h2h(r 1 1 , r2), h2t(r 1 1 , r 1 2 ) ELIFT h2h relation. Assume h2h(r1, r2) ELIFT, i.e., u, v, w V, r1(v, u), r2(v, w) E, then h2t(r1, r 1 2 ), t2h(r 1 1 , r2), t2t(r 1 1 , r 1 2 ) ELIFT h2t relation. Assume h2t(r1, r2) ELIFT, i.e., u, v, w V, r1(v, u), r2(w, v) E, then h2h(r1, r 1 2 ), t2t(r 1 1 , r2), t2h(r 1 1 , r 1 2 ) ELIFT t2t relation. Assume t2t(r1, r2) ELIFT, i.e., u, v, w V, r1(u, v), r2(w, v) E, then t2h(r1, r 1 2 ), h2t(r 1 1 , r2), h2h(r 1 1 , r 1 2 ) ELIFT G. Sparse matrix multiplication for relational hypergraph construction To compute the adjacency matrix for the relational hypergraph, we hereby define the implementation details of lift operation LIFT. We follow a similar strategy from Galkin et al. (2024) and compute the four motifs of ULTRA, h2t,t2h,t2t,h2h, via sparse matrix multiplication. We then proceed to show higher-order motif computation presented in MOTIF instances in the experiments. G.1. 2-path motifs (Fpath 2 ) Given a knowledge graph G = (V, E, R, c) with n := |V | nodes and m := |R| relations, we represent the (multi-relational) adjacency matrix A Rn m n in sparse format. We can then gather the adjacency matrix by scatter max over the first node dimension and the last node dimension, obtaining two (sparse) matrices Eh Rn m and Et Rm n, representing any incoming edges of any relation from any nodes and outgoing edges of any relation from any nodes, respectively. Note that since the adjacency matrix is binary, i.e., all of the coordinates are either 0 or 1, taking scatter max is equivalent to taking logical OR operation. Note that these binary interactions are actually all possible 2-path motifs. For motif computing interactions between relations are then equivalent to one sparse matrix multiplication (spmm) operation between adjacencies: Ah2h = spmm ET h , Eh Rm m At2t = spmm ET t , Et Rm m Ah2t = spmm ET h , Et Rm m At2h = spmm ET t , Eh Rm m Note that we only take the existence of any edges, thus, the respective edge index is extracted from all non-zero values. Finally, we concatenate all the considered relations types as the final adjacency matrix of LIFTF(G) through LIFT operation. How Expressive are Knowledge Graph Foundation Models? tfh: (α, β, γ) α β γ tft: (α, β, γ) α β γ hfh: (α, β, γ) α β γ hft: (α, β, γ) α β γ Figure 10. The 3-path motifs used by MOTIF(F path 3 ) are shown here. G.2. 3-path motifs (Fpath 3 ) Note that it is also possible to compute higher-order k-path (Fpath k ) motifs via a similar strategy, i.e., via iteratively carrying out spmm over the adjacency matrix Ak 2. As an example, we first define the motifs considered in the experiment of MOTIF with all the 3-path motifs, namely, tfh,tft,hfh,hft, shown in Figure 10. tail-forward-head (tfh) edges, tfh(r1, r2, r3) ELIFT u, v, w, x V, r1(u, v) r2(v, w) r3(w, x) tail-forward-tail (tft) edges, tft(r1, r2, r3) ELIFT u, v, w, x V, r1(u, v) r2(v, w) r3(x, w) head-forward-head (hfh) edges, hfh(r1, r2, r3) ELIFT u, v, w, x V, r1(v, u) r2(v, w) r3(x, w) head-forward-tail (hft) edges, hft(r1, r2, r3) ELIFT u, v, w, x V, r1(v, u) r2(v, w) r3(w, x) To compute these efficiently via spmm, we follow a similar strategy of generating the multi-relational adjacency matrix A Rn m n and Eh Rn m and Et Rm n. Then, with the same strategy, the computation of these interactions among three relations is equivalent to one sparse matrix of size Rm m m: Ahfh = spmm ET h , A, Eh Rm m m Atft = spmm ET t , A, Et Rm m m Ahft = spmm ET h , A, Et Rm m m Atfh = spmm ET t , A, Eh Rm m m The final adjacency matrix of LIFTF(G) can be generated analogously by concatenating all the considered relations types. Note that it is also possible to construct even higher-order Path-based motifs or other types of motifs with spmm or alternative efficient computation methods. We leave this as future work. H. Scalability analysis Scalability is generally a concern for KGFMs since they carry out inductive (on node and relation) link prediction between a given pair of nodes, which relies heavily on the structural properties of these nodes as well as the relations (due to the lack of node and relation features). While incorporating more expressive motifs enhances the model s predictive power, it also introduces additional computational costs in terms of time and space complexity. This trade-off between expressiveness and scalability is critical in practical applications and warrants careful consideration. Specifically, we aim to answer the question: What is the trade-off between expressiveness and scalability for MOTIF? (Q4). To explore this trade-off, we conducted a scalability analysis, as reported in Table 4. All experiments were performed on the FB15k-237 dataset using a batch size of 64 on a single NVIDIA H100 GPU. We report the number of parameters, the How Expressive are Knowledge Graph Foundation Models? Table 4. Scalability analysis of link prediction using motifs applying on FB15k-237: number of edges in constructed relational hypergraphs, number of parameters used, training time per batch, inference time per batch, and GPU memory taken on a single H100. (batch size = 64) Model Motif (F) #Parameter #Edges in LIFTF(G) Training Time(s) Inference Time(s) GPU Memory(GB) ULTRA 168,705 108,240 1.19 0.066 12.87 F path 3 181,185 4,288,092 3.66 2.613 17.85 F path 2 179,649 81,180 1.80 0.100 13.14 {h2t} 178,881 27,060 1.56 0.087 13.10 178,497 0 1.48 0.065 12.88 number of edges in the constructed relational hypergraphs, training time per batch, inference time per batch, and GPU memory usage. H.1. Impact of motifs on scalability The number of parameters in the models remains relatively stable as richer motifs are introduced, as the additional learnable parameters grow linearly with |R| (the number of relations) and d (embedding dimension) due to the inclusion of additional fundamental relation embeddings. However, the difference between MOTIF and ULTRA is notable: MOTIF includes additional positional encodings in the HCNet to capture edge directionality, leading to slightly more parameters. The key scalability challenge arises from the rapid increase in the number of edges in the constructed relational hypergraphs LIFTF(G) as more complex motifs are introduced. For example: binary edges scale with O(|V |2|R|), while higher-order motifs such as Fpath 3 introduce edges that scale with O(|V |3|R|). This results in an exponential increase as we consider higher order motifs such as k-path or k-star in the number of induced hyper-edges, leading to significantly higher computational and memory requirements. In terms of training and inference time, higher-order motifs (e.g., Fpath 3 ) dramatically increase both training and inference times. For instance, training time per batch for Fpath 3 is 3.66 seconds compared to 1.80 seconds for the Fpath 2 motif. Additionally, GPU memory consumption also increases with more complex motifs. While Fpath 3 consumes 17.85 GB, the simpler motifs such as Fpath 2 and {h2t} use approximately 13 GB. H.2. Optimizing scalability with Triton kernels To mitigate the scalability bottlenecks, we have included custom implementation via Triton kernel4 to account for the message passing process on both knowledge graphs and relational hypergraphs, which on average halved the training times and dramatically reduced the space usage of the algorithm (5 times reduction on average). The idea is to not materialize all the messages explicitly as in Py Torch Geometric (Fey & Lenssen, 2019), but directly write the neighboring features into the corresponding memory addresses. Compared with materializing all hyperedge messages which takes O(k|E|) where k is the maximum arity, computing with Triton kernel only is O(|V |) in memory. This will enable fast and scalable message passing on both knowledge graph and relational hypergraphs in MOTIF. As a result, we observe that all of the instance of MOTIF can be run on a 20GB GPU. We have shown our implementation in the provided codebase. While adding higher-order motifs significantly enhances performance, it also introduces scalability challenges, particularly in terms of memory and computational requirements. Our Triton kernel implementation alleviates some of these issues, but further research is needed to explore more efficient ways to scale models for extremely large knowledge graphs and relational hypergraphs. I. Computational complexity In this section, we present the theoretical computational complexity of the model variants considered in the experiments (Q4). Given a knowledge graph G = (V, E, R, c), we have that |V |, |E|, |R| represents the size of vertices, edges, and relation types, respectively. d is the hidden dimension of the model, and L is the number of layers in the model. m denotes the maximum arity of the hyperedges, and in the experiment, it s either 2 or 3. 4https://github.com/triton-lang/triton How Expressive are Knowledge Graph Foundation Models? I.1. Generating relational hypergraph 2-path (Fpath 2 ). The complexity of generating the relation graph for ULTRA is completely determined via initial scatter max over the node dimension, which has a time complexity of O(|V |2|R|), followed by sparse-matrix multiplication over ET h or ET t and Eh or Et. The time complexity of this operation is O(nnz(X) nnz(Y )) where nnz is a function finding the number of non-zero element in sparse matrix X {ET h , ET t }, Y {Eh, Et}. Note that if we do not use spmm, this time complexity will blow up to O(|V ||R|2). Thus, the total complexity for constructing the relation graph in ULTRA will become O(|V ||R|(|V | + |R|)). 3-path (Fpath 3 ). Similarly, for the construction of the relational hypergraph in MOTIF, i.e., all the 3-path and the 2-path, we need another spmm to additionally account for the forward adjacency matrix A. The time complexity of this operation under sparse matrix multiplication is then O(nnz(X) nnz(A) nnz(Y )), X {ET h , ET t }, Y {Eh, Et}. Without spmm, this time complexity will become O(|V |2|R|2 + |V ||R|3), which dominates the construction of Eh, Et together with all the 2-Path motifs and thus also serve as the total complexity. Additionally, it s important to note that the relational hypergraph can be generated as a preprocessing step during inference. During training, we typically exclude the positive triplets currently being used for training from the training knowledge graph to prevent overfitting. This exclusion often leads to changes in the relational hypergraph structure, requiring us to rebuild the relational hypergraph for each batch. Therefore, the complexity of constructing these relational hypergraphs needs to be considered. m-Star (Fstar m ). In the synthetic experiments, we consider m-Star motifs for variety of m, ranging from 2 to 8. It is computationally more expensive to generate m-Star since there does not exist an spmm computation for fast computation. To compute m-Star, we need to go through each node v and check the combination deg(v) m and check the existence individually to fill up the table. Thus, the complexity will be O(|V | |R| m ), which is exponential to the number of relations R and thus hard to implement in the relation-rich real-world datasets. As a result, we exclude the motifs of m-Star in real-world experiments. I.2. C-MPNNs and HC-MPNNs We take the complexity analysis results from Huang et al. (2023) and Huang et al. (2024), shown in Table 5. In this table, we report both the complexity of a single forward pass as well as the amortized complexity of a query q(u, v). I.3. ULTRA and MOTIF The complexity of ULTRA is thus the combination of relation graph generation with 2-path motifs, applying C-MPNN on the constructed relation graph LIFTF(G), and then another C-MPNN on knowledge graphs. (Note that NBFNet is an instance of C-MPNN, shown in Huang et al. (2023)). Thus, given a ULTRA model with T layers on relation encoders and L layers on its entity encoder, the overall complexity of a forward pass is going to be O(|V ||R|(|V | + |R|) + L(|R|2d + |R|d2) + L(|E|d + |V |d2)) as the number of edges in the constructed relation graph is bounded by O(|R|2), and the number of nodes in the relation graph is O(|R|). The complexity of MOTIF equipped with 3-path with T layers on relation encoders and L layers on its entity encoder can be derived in a similar manner: O(|V |2|R|2 + |V ||R|3 + L(|R|3d + |R|d2) + L(|E|d + |V |d2)). However, note that due to spmm, the construction of the relation (hyper)graph is very efficient in practical scenarios. Additionally, we note that the amortized complexity of a query of ULTRA is O |R|(|V | + |R|) + T(|R|2d + |R|d2) |V | + L(|E|d and that for MOTIF equipped with 3-path is O |V ||R|2 + |R|3 + T(|R|3d + |R|d2) |V | + L(|E|d |V | + d2) . How Expressive are Knowledge Graph Foundation Models? Table 5. Asymptotic runtime complexities for relation and entity encoder. Model Complexity of a forward pass Amortized complexity of a query C-MPNNs O(L(|E|d + |V |d2)) O(L( |E|d |V | + d2)) HC-MPNNs O(L(m|E|d + |V |d2)) O(L( m|E|d |V | + d2)) J. On adding node and relation features In general, the task of link prediction on knowledge graphs does not consider any node features. Thus, on the surface, it seems that MOTIF does not take the node features into account, and so do many models (Teru et al., 2020) that utilize the labeling tricks (Zhang et al., 2021) and conditional message passing (Zhu et al., 2021; 2023; Galkin et al., 2024; Huang et al., 2023; 2024). Adding node features. Incorporating node features, such as text embeddings, is relatively simple and can be done by simply concatenating the node feature vector, xv, with the current representation, hv, to produce h v = [hv||xv]. Since the initialization of MOTIFs only requires satisfying target node distinguishability, concatenating the node features will preserve this property. As a result, all theoretical results remain applicable to MOTIF when node features are included. It s also important to note that this concatenation method has already been successfully employed in knowledge graphs with node features, as demonstrated in Zhang et al. (2021); Galkin et al. (2024). Adding relation features. The same principal can apply since MOTIF explicitly computes the (conditioned) relation representations. Specifically, given any feature vector of relations xr such as text embeddings, we can consider the augmented relation embedding h r = [hr||xr] in place of the original computed relation embeddings hr. We leave how adding relation features could potentially boost the zero-shot and fine-tuned performance of KGFMs as an important future work. K. Proof of the counter-example in Section 6.2 We present a counter-example of link prediction in Figure 7, showing that ULTRA cannot distinguish between links r3(u, v1) and r3(u, v2) whereas MOTIF(Fpath 3 ) can correctly distinguish. We first define the knowledge graph of interest G = (V, E, R, c) where V = {u, x, y, v1, v2} (1) R = {r1, r2, r3} (2) E = {r1(y, v1), r1(v1, y), r2(u, x), r2(y, v2), r2(v2, y), r3(x, y), r3(y, x)} (3) We first show that ULTRA cannot distinguish between r3(u, v1) and r3(u, v2). First we note that the LIFT operation defines on motifs of ULTRA, i.e., {h2t, t2h, h2h, t2t}, will generates an complete relation graph LIFTF(G) for G. Formally speaking, it holds that P F, r, r R, P(r, r ) ELIFT. Now, it is enough to show that ULTRA will assign the same coloring to (u, v1) and (u, v2). We first claim that h(T ) r1|r3 = h(T ) r2|r3 for all T 0. The base case holds since h(0) r1|r3 = h(0) r2|r3 = 1d. For inductive case, assume h(t) r1|r3 = h(t) r1|r3, then it holds How Expressive are Knowledge Graph Foundation Models? h(t+1) r1|r3 = σ W (t)h h(0) r1|r3 X r LIFT RLIFT r Nr LIFT (r1) h(t) r |r3 zr LIFT) i + b(t) , = σ W (t)h h(0) r1|r3 X r LIFT RLIFT r R h(t) r |r3 zr LIFT) i + b(t) , = σ W (t)h h(0) r2|r3 X r LIFT RLIFT r R h(t) r |r3 zr LIFT) i + b(t) , = σ W (t)h h(0) r2|r3 X r LIFT RLIFT r Nr LIFT (r2) h(t) r |r3 zr LIFT) i + b(t) , = h(t+1) r2|r3 Now it is straightforward to see that the representation of (u, v1) and (u, v2) will be identical since the entity encoder of ULTRA cannot distinguish between relation r1 and r2, thus viewing node v1 and v2 isomorphic to each other when conditioned on u. However, we notice that MOTIF(Fpath 3 ) can distinguish between r1 and r2 by computing h(t) r1|r3 = h(t) r2|r3, since the generated relational hypergraph will contain an hyperedges hft(r2, r3, r1) by noting the homomorphism on path u r2 x r3 y r1 v1 but it does not contain hft(r1, r3, r2). L. Details in synthetic experiments Dataset construction. Given 2 k 7, and for each knowledge graph Gi = (Vi, Ei, Ri) in Connect Hub(k), we generate and partition the relations into two classes and a query relation Ri = Pi Ni {q} where Pi = {p1, , pl}, and Ni = {n1, , nl} for some l > k. Then, we construct each component as follows: A hub Hi where we have the center node u, and for all relation p Pi, there exists distinct v Vi such that the edge p(v, u) exists. Note that the center node u will have a degree larger than k by construction. Multiple positive communities with relations Pi where for each subset of relations Pi Pi satisfying | Pi| k, we have a center node x and there will exists distinct nodes y1, , y| Pi| such that pj(yj, x) exists for all 1 j | Pi|. Multiple negative communities with relations Ni where for each subset of relations Ni Ni satisfying | Ni| k, we have a center node x and there will exists distinct nodes y1, , y| Ni| such that nj(yj, x) exists for all 1 j | Ni|. Finally, the graph Gi is a disjoint union of all these components. The detailed dataset statistics are reported in Table 6. Experiment details. We use ULTRA with sum aggregation on both relational and entity levels, and similarly in all variants of MOTIF(Fstar m ). The message function on the entity level is chosen to be elemental-wise summation to avoid loss of information from relation types during the first message passing step. We use 2 layers for both ULTRA and all variants of MOTIF(Fstar m ), each with 32 dimension on both relation and entity model, and the Adam optimizer is used with a learning rate of 0.001, trained for 500 epochs in all experiments. M. Further experimental details Datasets. In this section, we report the details of all experiments reported in the body of this paper. For pertaining experiments, we train the MOTIF instance on three transductive knowledge graph completion datasets, following Galkin et al. (2024): FB15k237 (Toutanova & Chen, 2015), WN18RR (Dettmers et al., 2018), and Co DEx Medium (Safavi & Koutra, 2020). We then carry out zero-shot and fine-tuning experiments on the following datasets, categorized into three groups: Inductive e, r. Inductive link prediction on both new nodes and new relations. Including 13 datasets in INGRAM (Lee et al., 2023): FB-25, FB-50, FB-75, FB-100, WK-25, WK-50, WK-75, WK-100, NL-0, NL-25, NL-50, NL-75, NL-100; How Expressive are Knowledge Graph Foundation Models? Table 6. Dataset construction parameters for Connect Hub. Connect Hub(k) # Training/Testing graphs # Training/Testing triplets # Relations k = 2 7/3 360/210 [3,13] k = 3 7/3 1642/1078 [4,14] k = 4 4/2 824/1058 [5,11] k = 5 1/1 112/224 [6,8] k = 6 1/1 238/476 [7,9] and 10 datasets in MTDEA (Zhou et al., 2023a): MT1 tax, MT1 health, MT2 org, MT2 sci, MT3 art, MT3 infra, MT4 sci, MT4 health, Metafram, FBNELL. Inductive e. Inductive link prediction on nodes only. Including 12 datasets from Gra IL (Teru et al., 2020): WN-v1, WN-v2, WN-v3, WN-v4, FB-v1, FB-v2, FB-v3, FB-v4, NL-v1, NL-v2, NL-v3, NL-v4; 4 datasets from INDIGO (Liu et al., 2021): HM 1k, HM 3k, HM 5k, HM Indigo; and 2 datasets from Nodepiece (Galkin et al., 2022): ILPC Small, ILPC Large. Transductive. Transductive link prediction on seen nodes and seen relations. Including Co DEx Small, Co DEx Large (Safavi & Koutra, 2020), NELL-995 (Xiong et al., 2017), YAGO 310 (Mahdisoltani et al., 2015), WDsinger, NELL23k, FB15k237(10), FB15k237(20), FB15k237(50) (Lv et al., 2020), and Hetionet (Himmelstein et al., 2017). Full tables of zero-shot performance of MOTIF(Fpath 3 ) are presented in Table 7 and Table 8, and full tables of fine-tune performance of MOTIF(Fpath 3 ) averaged 3 runs are presented in Table 9, Table 10. We also carry out end-to-end experiments on all considered datasets, with full tables averaging 3 runs presented in Table 11 and Table 12. We report the statistics of these datasets in Table 13, Table 14, and Table 15, respectively. We finally report the detail hyperparameters used in Table 16, and the epoch used for fine-tuning and end-to-end training in Table 17. Note that we do not include all 13 datasets shown in Galkin et al. (2024) due to the size of constructed relational hypergraphs when equipped with 3-path motifs. Training. Following convention, on each knowledge graph and for each triplet r(u, v), we augment the corresponding inverse triplet r 1(v, u) where r 1 is a fresh relation symbol. All trained MOTIF and its variants aim to minimize the negative log-likelihood of both positive and negative facts. In line with the partial completeness assumption (Gal arraga et al., 2013), we generate negative samples by randomly corrupting either the head or the tail entity. The conditional probability of a fact q(u, v) is parameterized as P(v | u, q) = σ(f(h(L) v|u,q)), where σ denotes the sigmoid function and f is a two-layer MLP. We adopted layer-normalization (Ba et al., 2016) and short-cut connection after each aggregation and before applying Re LU on both encoders and a dropout rate of 0.2 on relation encoders only. We discard the edges that directly connect query node pairs to prevent overfitting. The best checkpoint for each model instance is selected based on its performance on the validation set. Inspired by Sun et al. (2019), we employ self-adversarial negative sampling, drawing negative triples from the following distribution, with α serving as the adversarial temperature: L(v | u, q) = log p(v | u, q) i=1 wi,α log(1 p(v i | u i, q)) Here, k represents the number of negative samples for each positive sample, and (u i, q, v i) is the i-th negative sample. Finally, wi is the weight for the i-th negative sample, defined as wi,α := Softmax log(1 p(v i | u i, q)) α How Expressive are Knowledge Graph Foundation Models? Table 7. Detailed zero-shot inductive link prediction MRR and hits@10 for ULTRA and MOTIF(F path 3 ). Dataset ULTRA MOTIF MRR H@10 MRR H@10 Inductive e, r FB-100 0.449 0.642 0.428 0.628 FB-75 0.403 0.604 0.399 0.614 FB-50 0.338 0.543 0.338 0.546 FB-25 0.388 0.640 0.384 0.640 WK-100 0.164 0.286 0.164 0.282 WK-75 0.365 0.537 0.366 0.540 WK-50 0.166 0.324 0.163 0.314 WK-25 0.316 0.532 0.311 0.493 NL-100 0.471 0.651 0.438 0.647 NL-75 0.368 0.547 0.314 0.512 NL-50 0.407 0.570 0.373 0.532 NL-25 0.395 0.569 0.348 0.498 NL-0 0.342 0.523 0.324 0.497 MT1 tax 0.224 0.305 0.325 0.448 MT1 health 0.298 0.374 0.326 0.398 MT2 org 0.095 0.159 0.092 0.154 MT2 sci 0.258 0.354 0.286 0.435 MT3 art 0.259 0.402 0.269 0.414 MT3 infra 0.619 0.755 0.658 0.786 MT4 sci 0.274 0.449 0.283 0.451 MT4 health 0.624 0.737 0.627 0.762 Metafam 0.238 0.644 0.344 0.829 FBNELL 0.485 0.652 0.468 0.664 Inductive e WN-v1 0.648 0.768 0.682 0.778 WN-v2 0.663 0.765 0.663 0.771 WN-v3 0.376 0.476 0.420 0.538 WN-v4 0.611 0.705 0.640 0.718 FB-v1 0.498 0.656 0.503 0.692 FB-v2 0.512 0.700 0.511 0.716 FB-v3 0.491 0.654 0.500 0.692 FB-v4 0.486 0.677 0.487 0.677 NL-v1 0.785 0.913 0.674 0.871 NL-v2 0.526 0.707 0.564 0.769 NL-v3 0.515 0.702 0.533 0.724 NL-v4 0.479 0.712 0.503 0.711 ILPC Small 0.302 0.443 0.295 0.444 ILPC Large 0.290 0.424 0.285 0.415 HM 1k 0.059 0.092 0.063 0.097 HM 3k 0.037 0.077 0.055 0.084 HM 5k 0.034 0.071 0.050 0.073 HM Indigo 0.440 0.648 0.426 0.637 How Expressive are Knowledge Graph Foundation Models? Table 8. Detailed zero-shot transductive link prediction MRR and hits@10 for ULTRA and MOTIF(F path 3 ). Dataset ULTRA MOTIF MRR H@10 MRR H@10 Pretrained WN18RR 0.480 0.614 0.507 0.609 FB15k237 0.368 0.564 0.336 0.532 Co DEx Medium 0.372 0.525 0.357 0.509 Transductive Co DEx Small 0.472 0.667 0.474 0.670 Co DEx Large 0.338 0.469 0.339 0.469 NELL-995 0.406 0.543 0.491 0.623 YAGO 310 0.451 0.615 0.441 0.607 WDsinger 0.382 0.498 0.397 0.514 NELL23k 0.239 0.408 0.220 0.384 FB15k237(10) 0.248 0.398 0.236 0.384 FB15k237(20) 0.272 0.436 0.259 0.422 FB15k237(50) 0.324 0.526 0.312 0.508 Hetionet 0.257 0.379 0.256 0.383 How Expressive are Knowledge Graph Foundation Models? Table 9. Detailed finetuned inductive link prediction MRR and hits@10 for ULTRA and MOTIF(F path 3 ). Dataset ULTRA MOTIF MRR H@10 MRR H@10 Inductive e, r FB-100 0.444 .003 0.643 .004 0.439 .001 0.642 .002 FB-75 0.400 .003 0.598 .004 0.399 .002 0.607 .002 FB-50 0.334 .002 0.538 .004 0.340 .002 0.544 .002 FB-25 0.383 .001 0.635 .002 0.388 .000 0.635 .002 WK-100 0.168 .005 0.286 .003 0.173 .003 0.284 .009 WK-75 0.380 .001 0.530 .001 0.371 .008 0.535 .012 WK-50 0.140 .010 0.280 .012 0.160 .006 0.304 .002 WK-25 0.321 .003 0.535 .007 0.317 .005 0.505 .004 NL-100 0.458 .012 0.684 .011 0.464 .001 0.682 .001 NL-75 0.374 .007 0.570 .005 0.360 .001 0.548 .004 NL-50 0.418 .005 0.595 .005 0.414 .004 0.573 .005 NL-25 0.407 .009 0.596 .012 0.390 .008 0.580 .027 NL-0 0.329 .010 0.551 .012 0.328 .003 0.556 .007 MT1 tax 0.330 .046 0.459 .056 0.416 .067 0.522 .005 MT1 health 0.380 .0002 0.467 .006 0.385 .002 0.473 .003 MT2 org 0.104 .007 0.170 .001 0.106 .001 0.170 .003 MT2 sci 0.311 .010 0.451 .042 0.326 .002 0.520 .023 MT3 art 0.306 .003 0.473 .003 0.315 .013 0.469 .004 MT3 infra 0.657 .008 0.807 .007 0.683 .007 0.827 .001 MT4 sci 0.303 .007 0.478 .003 0.309 .001 0.483 .005 MT4 health 0.704 .002 0.785 .002 0.703 .001 0.787 .002 Metafam 0.997 .003 1.000 .000 1.000 .000 1.000 .000 FBNELL 0.481 .004 0.661 .011 0.481 .004 0.664 .008 Inductive e WN-v1 0.685 .003 0.793 .003 0.703 .005 0.806 .005 WN-v2 0.679 .002 0.779 .003 0.680 .002 0.781 .009 WN-v3 0.411 .008 0.546 .006 0.466 .004 0.590 .002 WN-v4 0.614 .003 0.720 .001 0.659 .003 0.733 .004 FB-v1 0.509 .002 0.670 .004 0.530 .003 0.702 .001 FB-v2 0.524 .003 0.710 .004 0.557 .004 0.744 .003 FB-v3 0.504 .001 0.663 .003 0.519 .001 0.684 .002 FB-v4 0.496 .001 0.684 .001 0.508 .002 0.695 .002 NL-v1 0.757 .021 0.878 .035 0.712 .067 0.873 .012 NL-v2 0.575 .004 0.761 .007 0.566 .003 0.765 .003 NL-v3 0.563 .004 0.755 .006 0.580 .001 0.764 .001 NL-v4 0.469 .020 0.733 .011 0.507 .062 0.740 .007 ILPC Small 0.303 .001 0.453 .002 0.302 .001 0.449 .001 ILPC Large 0.308 .002 0.431 .001 0.307 .001 0.432 .001 HM 1k 0.042 .002 0.100 .007 0.067 .009 0.107 .010 HM 3k 0.030 .002 0.090 .003 0.054 .006 0.103 .002 HM 5k 0.025 .001 0.068 .003 0.049 .001 0.091 .001 HM Indigo 0.432 .001 0.639 .002 0.426 .001 0.635 .001 How Expressive are Knowledge Graph Foundation Models? Table 10. Detailed finetuned transductive link prediction MRR and hits@10 for ULTRA and MOTIF(F path 3 ). Dataset ULTRA MOTIF MRR H@10 MRR H@10 Pretrained WN18RR 0.480 .000 0.614 .000 0.529 .002 0.628 .001 FB15k237 0.368 .000 0.564 .000 0.357 .004 0.550 .005 Co DEx Medium 0.372 .000 0.525 .000 0.361 .001 0.517 .001 Transductive Co DEx Small 0.490 .003 0.686 .003 0.490 .001 0.680 .003 Co DEx Large 0.343 .002 0.478 .002 0.355 .002 0.489 .003 NELL-995 0.509 .013 0.660 .006 0.514 .006 0.655 .002 YAGO 310 0.557 .009 0.710 .003 0.603 .010 0.735 .003 WDsinger 0.417 .002 0.526 .002 0.423 .001 0.532 .001 NELL23k 0.268 .001 0.450 .001 0.256 .001 0.441 .001 FB15k237(10) 0.254 .001 0.411 .001 0.254 .001 0.411 .001 FB15k237(20) 0.274 .001 0.445 .002 0.273 .001 0.444 .001 FB15k237(50) 0.325 .002 0.528 .002 0.323 .002 0.523 .005 Hetionet 0.399 .005 0.538 .004 0.446 .002 0.575 .004 How Expressive are Knowledge Graph Foundation Models? Table 11. Detailed end-to-end inductive link prediction MRR and hits@10 for ULTRA and MOTIF(F path 3 ). Dataset ULTRA MOTIF MRR H@10 MRR H@10 Inductive e, r FB-100 0.425 .004 0.628 .003 0.418 .002 0.614 .001 FB-75 0.385 .001 0.592 .003 0.389 .004 0.592 .007 FB-50 0.323 .005 0.523 .006 0.334 .001 0.532 .001 FB-25 0.365 .004 0.612 .005 0.376 .001 0.621 .001 WK-100 0.159 .002 0.269 .007 0.163 .016 0.273 .013 WK-75 0.365 .006 0.501 .020 0.354 .006 0.504 .012 WK-50 0.149 .007 0.290 .006 0.150 .008 0.297 .022 WK-25 0.305 .008 0.486 .006 0.299 .004 0.489 .011 NL-100 0.410 .010 0.648 .008 0.458 .021 0.663 .004 NL-75 0.364 .006 0.562 .004 0.363 .001 0.568 .010 NL-50 0.416 .001 0.597 .003 0.408 .002 0.591 .008 NL-25 0.405 .006 0.612 .018 0.387 .007 0.601 .023 NL-0 0.323 .020 0.520 .028 0.329 .001 0.555 .004 MT1 tax 0.392 .040 0.517 .005 0.431 .045 0.521 .005 MT1 health 0.378 .004 0.466 .006 0.378 .001 0.473 .004 MT2 org 0.100 .001 0.166 .001 0.100 .001 0.168 .001 MT2 sci 0.323 .002 0.524 .006 0.358 .013 0.535 .006 MT3 art 0.302 .001 0.459 .001 0.313 .003 0.470 .006 MT3 infra 0.671 .005 0.815 .006 0.687 .003 0.823 .002 MT4 sci 0.290 .003 0.457 .005 0.290 .001 0.457 .001 MT4 health 0.684 .007 0.771 .002 0.701 .001 0.781 .001 Metafam 0.989 .010 1.000 .000 0.885 .003 1.000 .000 FBNELL 0.489 .012 0.678 .006 0.469 .028 0.634 .013 Inductive e WN-v1 0.668 .007 0.788 .005 0.679 .002 0.792 .006 WN-v2 0.296 .074 0.597 .040 0.684 .013 0.795 .006 WN-v3 0.332 .010 0.530 .012 0.428 .017 0.562 .013 WN-v4 0.622 .018 0.709 .003 0.635 .031 0.720 .016 FB-v1 0.506 .004 0.672 .004 0.524 .014 0.689 .013 FB-v2 0.524 .003 0.723 .004 0.537 .002 0.737 .002 FB-v3 0.502 .002 0.670 .002 0.509 .003 0.679 .001 FB-v4 0.487 .003 0.678 .003 0.494 .002 0.685 .005 NL-v1 0.671 .015 0.799 .036 0.777 .007 0.866 .013 NL-v2 0.526 .017 0.745 .007 0.561 .008 0.775 .010 NL-v3 0.563 .011 0.758 .005 0.565 .009 0.752 .009 NL-v4 0.463 .038 0.703 .030 0.512 .008 0.737 .008 ILPC Small 0.296 .002 0.445 .004 0.291 .001 0.438 .002 ILPC Large 0.290 .006 0.417 .013 0.284 .001 0.417 .002 HM 1k 0.029 .004 0.072 .006 0.017 .001 0.040 .002 HM 3k 0.027 .001 0.075 .003 0.030 .001 0.089 .001 HM 5k 0.027 .003 0.065 .001 0.028 .001 0.079 .001 HM Indigo 0.399 .001 0.614 .001 0.369 .006 0.587 .006 How Expressive are Knowledge Graph Foundation Models? Table 12. Detailed end-to-end transductive link prediction MRR and hits@10 for ULTRA and MOTIF(F path 3 ). Dataset ULTRA MOTIF MRR H@10 MRR H@10 Transductive WN18RR 0.489 .005 0.620 .001 0.531 .004 0.636 .005 FB15k237 0.352 .001 0.546 .002 0.327 .001 0.520 .001 Co DEx Medium 0.367 .003 0.521 .001 0.364 .068 0.518 .087 Co DEx Small 0.488 .001 0.679 .001 0.490 .003 0.678 .001 Co DEx Large 0.332 .002 0.473 .001 0.332 .001 0.467 .001 NELL-995 0.509 .009 0.646 .016 0.510 .004 0.655 .009 YAGO 310 0.567 .005 0.710 .001 0.583 .001 0.724 .001 WDsinger 0.413 .001 0.525 .001 0.421 .053 0.530 .051 NELL23k 0.261 .001 0.445 .002 0.263 .003 0.447 .001 FB15k237(10) 0.248 .001 0.398 .001 0.255 .001 0.411 .001 FB15k237(20) 0.272 .001 0.436 .002 0.271 .001 0.442 .001 FB15k237(50) 0.320 .002 0.523 .003 0.316 .002 0.511 .001 Hetionet 0.424 .003 0.553 .002 0.419 .003 0.548 .001 Table 13. Dataset statistics for inductive-e, r link prediction datasets. Triples are the number of edges given at training, validation, or test graphs, respectively, whereas Valid and Test denote triples to be predicted in the validation and test graphs. Dataset Training Graph Validation Graph Test Graph Entities Rels Triples Entities Rels Triples Valid Entities Rels Triples Test FB-25 5190 163 91571 4097 216 17147 5716 4097 216 17147 5716 FB-50 5190 153 85375 4445 205 11636 3879 4445 205 11636 3879 FB-75 4659 134 62809 2792 186 9316 3106 2792 186 9316 3106 FB-100 4659 134 62809 2624 77 6987 2329 2624 77 6987 2329 WK-25 12659 47 41873 3228 74 3391 1130 3228 74 3391 1131 WK-50 12022 72 82481 9328 93 9672 3224 9328 93 9672 3225 WK-75 6853 52 28741 2722 65 3430 1143 2722 65 3430 1144 WK-100 9784 67 49875 12136 37 13487 4496 12136 37 13487 4496 NL-0 1814 134 7796 2026 112 2287 763 2026 112 2287 763 NL-25 4396 106 17578 2146 120 2230 743 2146 120 2230 744 NL-50 4396 106 17578 2335 119 2576 859 2335 119 2576 859 NL-75 2607 96 11058 1578 116 1818 606 1578 116 1818 607 NL-100 1258 55 7832 1709 53 2378 793 1709 53 2378 793 Metafam 1316 28 13821 1316 28 13821 590 656 28 7257 184 FBNELL 4636 100 10275 4636 100 10275 1055 4752 183 10685 597 Wiki MT1 tax 10000 10 17178 10000 10 17178 1908 10000 9 16526 1834 Wiki MT1 health 10000 7 14371 10000 7 14371 1596 10000 7 14110 1566 Wiki MT2 org 10000 10 23233 10000 10 23233 2581 10000 11 21976 2441 Wiki MT2 sci 10000 16 16471 10000 16 16471 1830 10000 16 14852 1650 Wiki MT3 art 10000 45 27262 10000 45 27262 3026 10000 45 28023 3113 Wiki MT3 infra 10000 24 21990 10000 24 21990 2443 10000 27 21646 2405 Wiki MT4 sci 10000 42 12576 10000 42 12576 1397 10000 42 12516 1388 Wiki MT4 health 10000 21 15539 10000 21 15539 1725 10000 20 15337 1703 How Expressive are Knowledge Graph Foundation Models? Table 14. Dataset statistics for inductive-e link prediction datasets. Triples are the number of edges given at training, validation, or test graphs, respectively, whereas Valid and Test denote triples to be predicted in the validation and test graphs. Dataset Rels Training Graph Validation Graph Test Graph Entities Triples Entities Triples Valid Entities Triples Test FB-v1 180 1594 4245 1594 4245 489 1093 1993 411 FB-v2 200 2608 9739 2608 9739 1166 1660 4145 947 FB-v3 215 3668 17986 3668 17986 2194 2501 7406 1731 FB-v4 219 4707 27203 4707 27203 3352 3051 11714 2840 WN-v1 9 2746 5410 2746 5410 630 922 1618 373 WN-v2 10 6954 15262 6954 15262 1838 2757 4011 852 WN-v3 11 12078 25901 12078 25901 3097 5084 6327 1143 WN-v4 9 3861 7940 3861 7940 934 7084 12334 2823 NL-v1 14 3103 4687 3103 4687 414 225 833 201 NL-v2 88 2564 8219 2564 8219 922 2086 4586 935 NL-v3 142 4647 16393 4647 16393 1851 3566 8048 1620 NL-v4 76 2092 7546 2092 7546 876 2795 7073 1447 ILPC Small 48 10230 78616 6653 20960 2908 6653 20960 2902 ILPC Large 65 46626 202446 29246 77044 10179 29246 77044 10184 HM 1k 11 36237 93364 36311 93364 1771 9899 18638 476 HM 3k 11 32118 71097 32250 71097 1201 19218 38285 1349 HM 5k 11 28601 57601 28744 57601 900 23792 48425 2124 HM Indigo 229 12721 121601 12797 121601 14121 14775 250195 14904 Table 15. Dataset statistics for transductive link prediction datasets. Task denotes the prediction task: h/t is predicting both heads and tails, and t is predicting only tails. Dataset Entities Rels Train Valid Test Task FB15k237 14541 237 272115 17535 20466 h/t WN18RR 40943 11 86835 3034 3134 h/t Co DEx Small 2034 42 32888 1827 1828 h/t Co DEx Medium 17050 51 185584 10310 10311 h/t Co DEx Large 77951 69 551193 30622 30622 h/t NELL995 74536 200 149678 543 2818 h/t YAGO310 123182 37 1079040 5000 5000 h/t WDsinger 10282 135 16142 2163 2203 h/t NELL23k 22925 200 25445 4961 4952 h/t FB15k237(10) 11512 237 27211 15624 18150 t FB15k237(20) 13166 237 54423 16963 19776 t FB15k237(50) 14149 237 136057 17449 20324 t Hetionet 45158 24 2025177 112510 112510 h/t How Expressive are Knowledge Graph Foundation Models? Table 16. MOTIF hyper-parameters for pretraining, fine-tuning, and training from end to end. Hyperparameter MOTIF Enc1 # Layers T 6 Hidden dimension 64 Dropout 0.2 # Layers L 6 Hidden dimension 64 Dec 2-layer MLP Dropout 0 Pre-training Optimizer Adam W Learning rate 0.0005 Training steps 40,000 Adversarial temperature 1 # Negatives 512 Batch size 32 Training graph mixture FB15k237, WN18RR, Co DEx Medium Fine-tuning Optimizer Adam W Learning rate 0.0005 Adversarial temperature 1 # Negatives 256 Batch size 16 Optimizer Adam W Learning rate 0.0005 Adversarial temperature 1 # Negatives 256 Batch size 16 How Expressive are Knowledge Graph Foundation Models? Table 17. Hyperparameters for fine-tuning MOTIF and from end to end. Full represents a whole epoch with all batches being used. Datasets Finetune End-to-End Epoch Batch per Epoch Epoch Batch per Epoch FB 25-100 3 full 10 full WK 25-100 3 full 10 full NL 0-100 3 full 10 full MT1-MT4 3 full 10 full Metafam, FBNELL 3 full 10 full FB v1-v4 1 full 10 full WN v1-v4 1 full 10 full NL v1-v4 3 full 10 full ILPC Small 3 full 10 full ILPC Large 1 1000 10 1000 HM 1k-5k, Indigo 1 100 10 1000 FB15k237 1 full 10 2000 WN18RR 1 full 10 2000 Co DEx Small 1 4000 10 4000 Co DEx Medium 1 4000 10 8000 Co DEx Large 1 2000 10 4000 NELL-995 1 full 10 2000 YAGO310 1 2000 10 4000 WDsinger 3 full 10 2000 NELL23k 3 full 10 4000 FB15k237(10) 1 full 10 2000 FB15k237(20) 1 full 10 2000 FB15k237(50) 1 1000 10 2000 Hetionet 1 4000 10 2000