# holographic_node_representations_pretraining_taskagnostic_node_embeddings__cbeef5f1.pdf

Published as a conference paper at ICLR 2025

HOLOGRAPHIC NODE REPRESENTATIONS: PRE-TRAINING TASK-AGNOSTIC NODE EMBEDDINGS

Beatrice Bevilacqua Purdue University bbevilac@purdue.edu

Joshua Robinson Stanford University joshrob@cs.stanford.edu

Jure Leskovec Stanford University jure@stanford.edu

Bruno Ribeiro Purdue University ribeirob@purdue.edu

Large general purpose pre-trained models have revolutionized computer vision and natural language understanding. However, the development of general purpose pre-trained Graph Neural Networks (GNNs) lags behind other domains due to the lack of suitable generalist node representations. Existing GNN architectures are often tailored to specific task orders, such as node-level, link-level, or higher-order tasks, because different tasks require distinct permutation symmetries, which are difficult to reconcile within a single model. In this paper, we propose holographic node representations, a new blueprint for node representations capable of solving tasks of any order. Holographic node representations have two key components: (1) a task-agnostic expansion map, which produces highly expressive, high-dimensional embeddings, free from node-permutation symmetries, to be fed into (2) a reduction map that carefully reintroduces the relevant permutation symmetries to produce low-dimensional, task-specific embeddings. We show that well-constructed expansion maps enable simple and efficient reduction maps, which can be adapted for any task order. Empirical results show that holographic node representations can be effectively pre-trained and reused across tasks of varying orders, yielding up to 100% relative performance improvement, including in cases where prior methods fail entirely.

1 INTRODUCTION

The development of node embeddings capable of effectively solving multiple tasks remains an unsolved problem in graph learning (Mao et al., 2024). Just as LLMs have demonstrated the ability to generate versatile representations that can be adapted for a variety of downstream tasks (Radford et al., 2019; Gemini Team et al., 2023; Open AI, 2023), there is a growing need for graph models that can produce embeddings versatile enough to solve new tasks not foreseen at train time. Despite significant advancements in Graph Neural Networks (GNNs), most existing approaches remain highly specialized, which limits their broader applicability across diverse graph tasks.

Existing embeddings are typically designed for specific types of tasks, or task orders , and often struggle with others. For example, embeddings optimized for node classification (an order-1 task) may not work well for link prediction (an order-2 task) (You et al., 2019; Srinivasan & Ribeiro, 2020; Zhang et al., 2021), and vice-versa. This challenge extends to higher-order tasks, and embeddings intended to handle various tasks generally fail to perform well on any task. A crucial yet often overlooked reason for this failure is that different task orders possess different permutation symmetries, which are difficult to program into a single model. This order-specificity of existing models hinders the development of general pre-trained node representations that can be adapted to any graph task.

In this work, we revisit the notion of node embeddings to create holographic node representations, a new type of node representations specifically designed to be pre-trained and then adapted to any task order. The key innovation is a new treatment of node permutation symmetries. Instead of immediately introducing potentially undesirable symmetries, as is commonly done with standard permuta-

Published as a conference paper at ICLR 2025

tion equivariant GNNs (Bronstein et al., 2021), holographic node representations use the majority of their computation to learn an expansion map which produces highly-descriptive symmetry-free representations. These are then fed into a reduction map, which introduces task-specific symmetries to produce embeddings tailored for the task. The reduction map is lightweight, allowing it to be efficiently adapted and learned for each new task order. The core idea is that, regardless of the pretraining task order, a suitable reduction map can always be constructed and learned anew for new tasks of any order, leveraging the output of the expansion map as the pre-trained node embeddings.

In addition to introducing a suitable notion of representations for order-agnostic embeddings, we present the first practical instantiation of this concept, which we call Holo GNN. The key technical challenge we address is how to add and remove permutation symmetries. For the symmetry-free expansion map, we combine a standard message passing model with repeated symmetry breaking by perturbing node features, producing a sequence of flat embeddings for each node, each able to capture different information fragments. We demonstrate that this design is provably able to express eigenvectors without sign or basis ambiguities, which ensures Holo GNN is able to capture rich graph features (Belkin & Niyogi, 2003). The reduction map fuses together the sequence of representations for task-specific groups of nodes to output any-order (node-, link-, and higher-order) representations. We also prove that our combination of expansion and reduction maps produces maximally-expressive representations from a permutation symmetry perspective.

We experimentally validate our approach, showing that Holo GNN is able to learn representations that transfer between task orders consistently and significantly outperforming the baselines. For instance, on the CORA dataset, the performance of Holo GNN decreases by only 1.5% when pretrained on link prediction and adapted to node classification, while standard GNNs show a 7% performance drop, and link-prediction models such as NBFNet (Zhu et al., 2021) and SEAL (Zhang & Chen, 2018), experience a significantly larger drop of up to 41%.

In summary, our key contributions are:

(1) Introducing holographic node representations, a novel approach that allows node embeddings to be pre-trained on tasks of any order and then be efficiently adapted to solve new tasks of new orders by respecting the task-specific permutation symmetries;

(2) Presenting Holo GNN, the first practical architecture for learning holographic node representations, addressing the technical challenge of adding and removing permutation symmetries;

(3) Theoretically and empirically demonstrating that holographic node representations offer significant improvements in generalization to new tasks, providing highly expressive representations that consistently outperform existing methods, irrespective of the pre-training task order.

Notation. Let G = (A, X) be a graph, where A {0, 1}n n is the adjacency matrix and X Rn dx the node features. We write v [n] = {1, . . . , n} to denote a node in G, or equivalently v V . For node embeddings f(A, X) Rn d, f(A, X)v denotes the embedding in Rd of node v [n]. Similarly, for f(A, X) Rn T d, f(A, X)v denotes the RT d slice. We denote the permutation group on [n] by Sn, which is the collection of all bijective maps [n] [n]. In general, for tensors X R( n r) dx, we write XS Rdx for a set S [n] of cardinality r. For S, we define π S = {π(i) | i S}, and write π X to denote the permutation of the rows of X by applying π to set S indexing them, with the permutation action. Finally, 2[n] denotes the power set of [n].

2 PERMUTATION SYMMETRIES AND TASK GENERALIZATION

An often overlooked aspect of graph learning is the task-specific nature of existing embeddings, typically designed for a particular task order , which is the size of the set of nodes involved in the predictions. For example, node classification involves predictions for individual nodes (order-1), while link prediction for pairs of nodes (order-2). Since there are many important graph tasks of different orders, there is a need for powerful models producing embeddings suitable for all orders.

However current embeddings are task-order specific. One reason for this is that many existing embeddings are structural, meaning that the embeddings do not depend on the IDs of each node. Crucially, the exact permutation group describing this invariance is not universal, but is inherently dependent on the task order. This is clearly reflected in the fact that the cardinality of the required

Published as a conference paper at ICLR 2025

embeddings for structural representations is determined by the task order itself, as can be seen in the definition of structural representations. Definition 2.1 (Structural Representations). Let S be a set of r nodes of a graph G = (A, X). A function f : {0, 1}n n Rn dx R( n r) d returns structural representations of order r if f is equivariant to permutations of the nodes, that is if f(A, X)S = f(π A, π X)π S, for any π Sn.

In other words, a function returns structural representations if it produces embeddings that are equivariant under any permutation of the node ordering. Within this class, there is a clean notion of mostexpressive representation, which occurs when isomorphic node sets share the same representation (necessitated by Definition 2.1), and all non-isomorphic sets have distinct representations. Definition 2.2 (Most-Expressive Structural Representations). A function f : {0, 1}n n Rn dx R( n r) d returns most-expressive structural representations of order r if for every graphs G = (A, X), G = (A , X ) and sets of nodes S, S of cardinality r, f(A, X)S = f(A , X )S iff π Sn such that A = π A , X = π X , S = π S .

Figure 1: Nodes v2 and v3 are isomorphic; links (v1, v2) and (v4, v3) are isomorphic; links (v1, v2) and (v1, v3) are not isomorphic. However, if we aggregate most-expressive node representations as the link representation, (v1, v2) and (v1, v3) will get the same prediction. Example from Zhang et al. (2021).

A most-expressive structural representation is sufficient to solve any task of that order (Srinivasan & Ribeiro, 2020). However, in general, the most expressive structural representation of one order is not also the most expressive structural representation of another order. For instance, Srinivasan & Ribeiro (2020) and Zhang et al. (2021) prove that a most-expressive structural representation of order-1 tasks cannot serve as a most-expressive structural representation for order-2 tasks, as show in Figure 1. Because of this, it is not possible to pre-train a single structural representation that is most-expressive and therefore able to solve tasks of any order.

In addition to structural representations, positional ones, such as those derived from eigenvectors (Belkin & Niyogi, 2003; Dwivedi et al., 2023), play an important role in graph learning. Unlike structural embeddings, positional embeddings assign distinct embeddings to isomorphic node sets. While positional embeddings can be highly effective for tasks requiring the unique placement of nodes within a graph, they tend to underperform on others (Srinivasan & Ribeiro, 2020). For instance, positional embeddings often perform well in link prediction tasks, where the relative positions of the node endpoints are critical (Lim et al., 2023). However, they are less effective for node tasks, where node similarity should reflect structural similarity rather than positional proximity. In other words, current positional representations are also task specific.

Challenges for learning a single embedding for different tasks. Because structural and positional representations are task specific, they are unsuitable output candidates for pre-training a single model that produces representations capable of solving tasks of different orders. This claim can be formulated precisely for node embeddings, which are perhaps the most important class of embeddings due to their linear scalability in the graph size, which is essential when scaling to large graphs. Proposition 2.3 (Informal). For any node embedding model f, there exists two tasks of different orders for which at least one is not solvable using f.

The formal version of Proposition 2.3 is presented in Appendix B.1 as Proposition B.1. This impossibility result highlights why graph machine learning has continued to rely on specialized architectures tailored for specific tasks, unlike the more generalized pre-trained models seen in NLP and computer vision. It also clarifies that there is a need for novel node representations that extend beyond conventional node embeddings. In the next section, we present our approach bypassing this impossibility result by representing nodes with sequences of (flat) node embeddings.

3 HOLOGRAPHIC NODE REPRESENTATIONS

The previous section demonstrated the need for a new notion of node representations beyond the typical single flat embedding vector per node. Our proposal is to use holographic node representations. Holographic representations have existed outside of the graph learning context since the

Published as a conference paper at ICLR 2025

Head Link Task

Link Reduction

Node Reduction

Order-specific models Pre-trained node embeddings Pre-trained holographic node representations

Fails to generalize to higher-order tasks Trains from scratch representations for each task

Figure 2: While order-specific models must be retrained for each new task, holographic node representations can be adapted to new tasks. Pre-trained (structural) node embeddings cannot solve high-order tasks since isomorphic nodes get the same representation, making it impossible to differentiate non-isomorphic links. Instead, holographic node representations generate 2D node representations (T de), where the additional T dimension captures multiple symmetry-free graph views (so isomorphic nodes can be distinguished in at least one view). The reduction map reduces these representations back to 1D, using Λ to re-introduce the correct permutation symmetries for the task.

work of Feldman (1986) and Hinton (1990), which studied how to store complex concepts in neural networks. At one extreme was the one-concept, one-neuron idea, and at the other was holographic representations which distributed individual concepts across neurons. These holographic representations worked by producing representations conditional on different views of the data, rather than relying on a single absolute representation. In this section, we adapt this principle to graph data, proposing holographic node representations, which produce representations conditioned on different nodes. These conditional representations overcome the impossibility result in Proposition 2.3.

Holographic node representations are defined through an expansion map and a reduction map (Figure 2). The expansion map produces task-agnostic high-dimensional representations of the individual units of interest (in our case each node) that capture different permutation-sensitive projections of the object. The permutation-sensitivity of these projections is designed in a way that we can always create a reduction map (a light-weight task-specific map) that reintroduces the relevant symmetries for the task at hand, producing structural representations that match the task order. In this way, regardless of the pre-training task order, a suitable reduction map can always be learned for any new task of any order, using the output of the expansion map as the pre-trained node embeddings.

Holographic node representations start with structural node representations,

V1 = gstruc(A, X) Rn d1, (1)

where gstruc is a learnable function returning structural node representations. This function can be as simple as a single-layer GNN or even the identity function (V1 = X). Importantly, the expressiveness of our approach is not limited by that of gstruc, since lost distinctions between nodes can be reintroduced in the expansion map. But formulating V1 as a general structural representation has the benefit of allowing the forthcoming expansion map to use prior efforts to obtain rich representations. Definition 3.1 (Holographic Node Representations). Holographic node representations consist of two learnable, parameterized maps:

(1) Expansion Map:

Eθ : {0, 1}n n Rn d1 Rn T de (2[n])L NT (2)

The expansion map, parameterized by θ, takes as input the adjacency matrix and the initial structural representations, and it outputs: (a) A (T de)-dimensional representation for each of the n nodes (2D representation), denoted by Vθ(A, V1); (b) A sequence of L lists of node IDs, where nodes within each list share the same role, and nodes in different lists have distinct roles; (c) A sequence of integers, indicating how the T node representations should be grouped.

Published as a conference paper at ICLR 2025

(2) Reduction Map: Rψ : Rn T de (2[n])L NT R( n r) dr (3) The reduction map, parameterized by ψ, takes the output of the expansion map and produces 1D representations for any set of r nodes.

The mappings satisfy the following properties:

Property (1): The composition Rψ Eθ produces n r structural representations (one for each set of r nodes), i.e., π Rψ(Eθ(A, V1)) = Rψ(Eθ(π A, π V1)) for any π Sn.

Property (2): For any undirected graph G = (A, X) and isomorphic nodes u, v G, with u = v and having different neighborhoods, there exists a θ such that, Vθ(A, V1)v = Vθ(A, V1)u. That is, the expansion map Eθ can distinguish isomorphic nodes.

Note that both Eθ and Rψ are neural networks, with parameters θ, ψ, for which we introduce explicit designs in Section 4. From now on, we will write E and R instead of Eθ and Rψ for simplicity when it is unambiguous. Additionally, we denote the three outputs of the expansion map as Eθ(A, V1) = V (A, V1), (hλ(A, V1))L λ=1, Λ Rn T de (2[n])L NT .

3.1 INTERPRETING THE DEFINITION

A visual overview of holographic node representations is in Figure 2. The expansion map introduces a dimension T, yielding node-level representations V (A, V1)v RT de for each node v, instead of the typical flat embeddings in Rde. This T dimension encodes positional information (Property (2)) through T distinct views of the graph, as shown by different colored vectors in Figure 2, addressing the impossibility result for flat embeddings in Proposition 2.3.

Whilst expressive, the expansion map alone does not give suitable node representations since information is distributed along the T dimension, and does not respect any permutation symmetries. Property (1) ensures that the reduction map R combines the information across the T views, to output structural representations for any set of r nodes. Since the reduction map depends on r it is order specific (r = 1 for node, r = 2 for link, and so on), and it is learned anew for each task we aim to adapt to. The partition Λ is crucial in this process, as it specifies how the T representations should be grouped to obtain suitable structural representations.

While any expansion map E can trivially lead to valid holographic node representations by choosing a constant reduction map (e.g., R returning 0), the challenge lies in designing a reduction map that generates expressive representations. We address this in Section 4, proving the expressivity of our approach by leveraging the link between positional and structural representations, building on the existence result from Srinivasan & Ribeiro (2020). A key contribution of our work is advancing this theoretical result by designing practical architectures that efficiently perform this operation.

4 AN ARCHITECTURE FOR HOLOGRAPHIC NODE REPRESENTATIONS

This section presents Holo GNN, an architecture for learning holographic node representations. We begin by outlining how the symmetry-free expansion map is obtained through symmetry breakings, achieved by perturbing node features for selected nodes. Then, we propose two breaking methods. In Section 4.2, we describe sequential breaking, a general design inspired by numerical linear algebra methods, such as the Lanczos algorithm for computing eigenvectors (Lanczos, 1950). We formalize this connection, showing that sequential breaking generalizes eigenvector methods in a learnable and canonical way. In Section 4.3, we propose a faster, special case called parallel breaking.

4.1 OVERVIEW OF SYMMETRY BREAKINGS

The intuition behind Holo GNN is to obtain the symmetry-free expansion map E through symmetry breakings . This process breaks isomorphisms between nodes by perturbing the features of selected nodes, and then passing these modified features through a structural model. The perturbed features act as node identifiers, enabling the model to learn distinct representations for isomorphic nodes, which a structural model cannot do.

Published as a conference paper at ICLR 2025

The idea of injecting unique features to obtain positional representations is well-established in previous work (You et al., 2019; Murphy et al., 2019; Abboud et al., 2021; Sato et al., 2021; Puny et al., 2020; Eliasof et al., 2023). Holo GNN distinguishes itself by carefully injecting identifiers in a way that a reduction map can later be designed to reintroduce order-specific permutation symmetries.

Precisely, step t = 1 initializes V1 as in Equation (1). At step t > 1, new node representations Vt Rn d are produced by injecting a unique identifier, such as the standard basis vector 1v Rn, for the breaking node v, into a structural model ft (e.g., a permutation equivariant GNN). The role of the breaking node v is to obtain an embedding for v that distinguishes it from other nodes, which structural models cannot achieve. The breaking nodes are selected based on predefined functions known as node-breaking selectors, which deterministically return lists of nodes. As we will see, the partition Λ is uniquely determined by the breaking nodes, and the final structural representations are produced by combining all Vt generated by breaking nodes returned by the same breaking selector.

The node-breaking selectors are defined as (hλ)L λ=1. Each selector hλ returns a list of breaking nodes and must satisfy several properties: Definition 4.1 (Breaking selector hλ). For λ = 1, . . . , L,

(vλ,1, . . . , vλ,kλ) = hλ(A, V1) (4)

denote the λ-th list of breaking nodes, sorted by node ID. The selector hλ is permutation-equivariant, meaning that if a node v is in hλ(A, V1), then all nodes isomorphic to v in G = (A, V1) are also included. Additionally, (hλ)L λ=1 are disjoint, ensuring that no node is included in more than one list.

The breaking selectors can be either pre-defined or learnable functions of the input graphs. Valid choices include having each hλ return all nodes with the same degree, or expressive choices where each hλ returns a list of all (and only) isomorphic nodes. Appendix A.1 expands on practical choices.

Validity of Symmetry Breakings. It may seem unclear if symmetry breaking generates E(A, V1) satisfying Definition 3.1. In particular, one might question whether perturbing features for a single node v at a time is enough to produce different embeddings for all nodes in the graph. In principle it may be necessary to perturb multiple nodes at once, which would require at least quadratic number of forward passes through the structural models. Fortunately, we prove that this is not the case, and symmetry breaking with single-node perturbations produces valid expansion maps. Theorem 4.2 (Sufficiency of single-node breakings for holographic representations). For any G = (A, X) and isomorphic nodes u, v G, with u = v and having different neighborhoods, there exists E obtained by single-node symmetry breakings such that V (A, V1)v = V (A, V1)u.

4.2 SEQUENTIAL BREAKING ALGORITHM

Algorithm 1 Sequential Breaking

1: Λ1 = 1 2: λ = 2 3: t = 2 4: while t T do 5: for v hλ(A, V1) do 6: Vt = ft(A, Vt 1, . . . , V1, 1v) 7: t = t + 1 8: Λt = λ 9: end for 10: λ = λ + 1 11: end while 12: T = t

The sequential breaking algorithm introduces symmetry-breaking perturbations iteratively, where each new symmetry is broken on the output of previous symmetry breakings. At each step t > 1, new node embeddings Vt Rn de are generated by a structural model ft which takes as input: (1) the adjacency A, (2) the embeddings produced in earlier steps Vt 1, . . . , V1, and (3) a perturbation 1v on the breaking node v. The breaking node v is the next node in the sequence returned by the current breaking selector hλ, with the index λ incremented when the end of the list is reached. The pseudocode is detailed in Algorithm 1.

The algorithm runs until a specified maximum number of iterations, T . However, the total number of steps T may exceed T to ensure that the current breaking selector s list is fully exhausted. This guarantees that all nodes having the same role, and therefore in the same list, are used as breaking nodes, preventing any improper selection that could differ for isomorphic graphs, and hinder the reduction map from constructing structural representations. The expansion map is defined as:

E(A, V1) = Concat(VT , . . . , V1), (hλ(A, V1))L λ=1, Λ Rn T de (2[n])L NT , (5)

Published as a conference paper at ICLR 2025

with Concat concatenating along a new dimension, and (hλ(A, V1))L λ=1 and Λ saved for use in the reduction map. The reduction map uses this information to appropriately combine the embeddings Vt to reintroduce permutation symmetries, adapting the representations to the task order at hand.

Expressivity of sequential breaking. Algorithm 1 describes a very general class of representations, as evidenced by several well-known models that are special cases. First, all structural models, such as GIN (Xu et al., 2019) and GCN (Kipf & Welling, 2017), are special cases by setting f1 = GIN/GCN, ft = Identity for t > 1, and ignoring breaking nodes entirely. Another important special case, which we prove next, is that sequential breaking can compute eigenvectors (and functions thereof). Even more importantly, sequential breaking resolves sign and basis ambiguities inherent in eigenvectors, eliminating the need for additional techniques like sign flipping (Dwivedi et al., 2023; Kreuzer et al., 2021; Kim et al., 2022; Dwivedi et al., 2021) or invariant (Wang et al., 2022; Lim et al., 2022; Huang et al., 2024b) and equivariant (Lim et al., 2023) networks to account for them. Theorem 4.3 (Expansion Map can express canonical eigenvectors). For any symmetric matrix A, there exists an expansion map E(A, V1) obtained by sequential breaking which can express eigenvectors with no sign or basis ambiguity, and identical for isomorphic graphs up to permutations.

The proof is constructive, manually instantiating expansion maps that exactly correspond to popular algorithms for computing eigenvectors, such as the Lanczos algorithm. Because the proof is constructive, we are also able to prove that sequential breaking removes sign/basis ambiguities, which has practical advantages over explicit canonicalization (Kaba et al., 2023; Ma et al., 2023; 2024).

Having constructed an expressive expansion map, we are left to design appropriate reduction maps. Given two isomorphic graphs G1 and G2, there exists a permutation matrix BG1 G2 such that, V G2 t = BG1 G2V G1 t , t [T] (c.f., Remark B.8). A natural way to construct structural representations of order r is to treat the representations of the r nodes across these isomorphic graphs as part of a set. Specifically, the structural representation of {v1, . . . , vr} is obtained by first computing an intermediate representation at each step t, using a learnable set function ϕt, which takes as input the representations of {v1, . . . , vr} at step t in graph Gk, i.e.,

Uk,t = ϕt n BG1 Gk Vt

u {v1,...,vr}

Then, another learnable set function ρt aggregates the intermediate representations Uk,t for all Gk,

Ut,{v1,...,vr} = ρt Uk,t

Finally, the reduction map can be obtained as the list of such representations for each t,

R(E(A, V1)){v1,...,vr} = Ut,{v1,...,vr}

t [T ]. (8)

A key limitation of this approach is that it requires iterating over all possible n! isomorphic graphs to construct BG1 Gk. Appendix A.2 discusses practical approaches that use the breaking selectors to construct structural representations of order r, instead of iterating over isomorphic graphs.

4.3 PARALLEL BREAKING ALGORITHM

Algorithm 2 Parallel Breaking

1: Λ1 = λ 2: t = 2 3: λ = 2 4: while t T do 5: for v hλ(A, V1) do 6: Vt = fλ(A, V1 1v) 7: t = t + 1 8: Λt = λ 9: end for 10: λ = λ + 1 11: end while 12: T = t

In practice, breaking symmetries sequentially can be slow. To address this, we introduce parallel breaking, a faster variant of the sequential approach. Parallel breaking generates embeddings Vt using a structural model fλ that, in addition to A, takes only a perturbed version of V1 for all t, rather than all prior embeddings Vt 1, . . . , V1. In this case, the embeddings Vt for different t are only differentiated by breaking different nodes, with perturbations obtained as V1 1v, where denotes concatenation. We remark here that: (1) the structural model fλ is the same for all breaking nodes belonging to the same hλ (as opposed to differing for each t in

Published as a conference paper at ICLR 2025

the sequential breaking, because it does not see previous breakings); (2) since there is no dependency on the step t, all Vt can be computed in parallel. The pseudocode is shown in Algorithm 2.

Identically to sequential breaking, the maximum number of iterations is T , but the total number of steps T may exceed T to ensure that the current breaking selector s list hλ(A, V1) is exhausted. The expansion map is defined by concatenating the embeddings VT , . . . , V1 along a new dimension, while also collecting the breaking selectors (hλ(A, V1))L λ=1 and the partition Λ (c.f., Equation (5)).

For parallel breaking, the reduction map can be constructed in a simpler way, by aggregating representations Vt obtained from breaking nodes returned by the same hλ. Specifically, for the structural representation of {v1, . . . , vr} (i.e., of order r), we first compute an intermediate representation for each step t, using a learnable set function ϕλ, taking the representations of {v1, . . . , vr} at step t,

Uλ,t = ϕλ Vt,u

u {v1,...,vr}

Then, another learnable set function ρλ aggregates the intermediate representations Uλ,t for all t obtained by breaking nodes returned by the same breaking selector, and therefore such that Λt = λ,

Uλ,{v1,...,vr} = ρλ Uλ,t

t [T ] s.t. Λt=λ

Finally, the reduction map can be obtained as the list of such representations for each λ,

R(E(A, V1)){v1,...,vr} = Uλ,{v1,...,vr}

λ [L]. (11)

The subtlety of this formulation is in which groups of representations the reduction map must be invariant to the ordering, and which groups it is beneficial to be sensitive to the order. Specifically, the within-breaking-set representations being order-invariant (Equation (10)) is essential for R E to be structural. Distinguishing the ordering of between-breaking-set representations (Equation (11)) is necessary for the representations to be expressive. Next, we show that these choices return holographic node representations, and are maximally expressive in the sense of Definition 2.2.

4.4 HOLOGNN GIVES EXPRESSIVE HOLOGRAPHIC NODE REPRESENTATIONS

In the previous subsections, we introduced two designs for expansion and reduction maps: sequential breaking (Section 4.2) and parallel breaking (Section 4.3). Sequential breaking enables the computation of canonical eigenvectors, while parallel breaking provides a more scalable solution. Importantly, both choices define holographic node representations, as the next result shows. Theorem 4.4 (Sufficiency of single-node symmetry breakings for structural of any order). Let R be the reduction map as defined in Equations (8) and (11). When combined with its corresponding expansion map E, then R(E(A, V1)){v1,...,vr} is a structural representation of {v1, . . . , vr}.

The significance of Theorem 4.4 lies in the sufficiency of single-node symmetry breakings, which are highly efficient, to generate structural representations. Moreover, we show that these breakings produce most-expressive link structural representations (per Definition 2.2) for the parallel breaking algorithm, with a corresponding result for the sequential breaking algorithm in Appendix B.3. Theorem 4.5 (Sufficiency of single-node breakings for most-expressive link structural). Let E be the parallel breaking expansion map in Equation (5) with T = n and let R be the reduction map as defined in Equation (11) with r = 2. Assume ρλ and ϕλ injective, and fλ node-most-expressive, for every λ [L]. Then, R(E(A, V1)){v1,v2} is a most-expressive structural representation of {v1, v2}.

While maximal expressivity can be achieved for links with single-node breakings, this result can be extended by considering multi-node symmetry breakings. Theorem B.4 in Appendix B.3 shows that breaking r 1 nodes is sufficient to obtain most-expressive structural representations of order r.

5 EXPERIMENTS

We empirically study the generalization of Holo GNN between tasks of different orders performed on the same dataset. We consider a typical pre-train-adapt protocol, where Holo GNN is pre-trained on a task of one order, e.g., link-prediction, and adapted to solve a new task of a potentially different

Published as a conference paper at ICLR 2025

Link Node Link Node Link Node Worst Pretraining Task

Node Link Node Link Node Link Movie Movie User Movie User User Movie Movie

CORA Citeseer Pubmed Movie Lens

Test Metric

Pretrain on (for pretrained)

Trained Pretrained

Figure 3: Performance (the higher the better) of each model when trained on one task (yellow bar) and when pre-trained on a different task and adapted to the task (blue bar). For each model, the more yellow visible, the larger the performance loss from using pre-trained embeddings. Holo GNN shows more consistent performance and thus smaller performance loss across all datasets and tasks.

order, e.g., node-classification. During adaptation, the expansion map remains fixed, with the pretrained node embeddings retrieved from storage, while only the light-weight reduction map and a task-specific output head are trained on the new task. The protocol for other models is the same, with an MLP output head re-trained on the downstream test task on fixed pre-trained node embeddings.

We implement Holo GNN using the parallel breaking algorithm with number of symmetry breakings T 10 n. Next we present our main results and refer to Appendix E for additional experiments.1

Planetoid and Movie Lens (Sen et al., 2008; Fey & Lenssen, 2019). We compare Holo GNN against SEAL (Zhang & Chen, 2018; Li et al., 2020), NBFNet (Zhu et al., 2021), and a standard GNN with SAGE (Hamilton et al., 2017) or GCN (Kipf & Welling, 2017) layers, depending on the dataset. Notably, while both standard GNNs and Holo GNN produce pre-training node embeddings regardless of the pre-training task order, SEAL and NBFNet generate pre-training embeddings that match the order of the pre-training task, leading to significantly higher storage requirements for high-order tasks. Figure 3 (and Tables 3 and 4) shows the performance of each model when trained on a particular task (yellow bar) and when pre-trained on a different task and adapted to it (blue bar). For each model, the more yellow visible, the larger the performance loss from using pre-trained embeddings.

Holo GNN demonstrates significantly more consistent performance when compared to other methods. For instance, in the CORA dataset, the difference in performance between Holo GNN when trained on Node Classification (81.6%) and its performance when pre-trained on Link Prediction and adapted to Node Classification (80.2%) results in a performance drop of less than 1.5%. On the contrary, SAGE suffers from a loss of 7%, while NBFNet and SEAL show a more dramatic drop of 41%. Similarly, in the Movie Lens dataset, when considering the order-3 task User Movie User, the difference in performance between Holo GNN pre-trained on the task (81.1%) and its worst performance when pre-trained in any other task (72.1%, obtained pre-training on the order-2 Movie Movie) results in a performance drop of 9%. In comparison, SAGE suffers a more pronounced 22% drop, SEAL runs out of memory (OOM), and NBFNet is unable to manage tasks of order-3 (NA).

Rel Bench. We evaluate Holo GNN on Rel Bench (Robinson et al., 2024), a recently proposed benchmark of graph datasets derived from relational databases. We use the rel-stack dataset, which is obtained from several years of activity on the Stack Exchange Q&A website. We evaluate Holo GNN on (i) two node level tasks: user-engagement and user-badge, where the task is to predict if a user will be active and if a user will earn a badge (community aware) in the next 3 months, and (ii) one link-level task: user-post-comment, where the goal is to predict if a user will comment on a given post in the next 3 months. Following Robinson et al. (2024) our implementation is based on SAGE (Hamilton et al., 2017), and we include task-agnostic non-learnable baselines.

Results are shown in Table 1. Notably, when pre-training on node-level tasks (user-engagement and user-badge) and adapting to link-level (user-post-comment), SAGE completely fails

1Our code is available at https://github.com/beabevi/holognn

Published as a conference paper at ICLR 2025

Table 1: Evaluation on the rel-stack dataset from Rel Bench. The user-post-comment task is link prediction (order-2), and the other two are node classification (order-1) tasks. Holo GNN achieves superior performance when adapted to tasks of different order than the pre-training one. Results in blue indicate cases where pre-train and test tasks are the same (i.e., supervised learning).

user-post-comment user-engagement user-badge (MAP ) (AUROC ) (AUROC )

TASK-AGNOSTIC

Global Popularity 0.03 0.00 NA NA Past Visit 2.05 0.00 NA NA Random NA 50.7 0.20 50.1 0.00 Majority NA 50.1 0.20 50.0 0.00

PRE-TRAINING TASK

user-post-comment SAGE 0.11 0.05 88.2 0.05 84.6 0.05 Holo GNN 3.40 0.29 88.8 0.10 85.5 0.00

user-engagement SAGE 0.00 0.00 90.7 0.01 86.0 0.04 Holo GNN 2.38 0.33 90.6 0.01 87.0 0.03

user-badge SAGE 0.00 0.00 89.2 0.01 89.0 0.00 Holo GNN 1.02 0.21 89.4 0.05 89.0 0.05

Table 2: Link prediction AUC and runtime per epoch in the tasks of Lim et al. (2023). We show that eigenvectors obtained as output of the expansion map (Algorithm 1) results in comparable or better performance than the best performing Sign Equivariant, while being significantly faster.

Erd os-R enyi Barab asi-Albert

Model Test AUC Runtime (s) Test AUC Runtime (s)

GCN (constant input) .497 .06 .058 .00 .705 .01 .048 .00 Sign Net (Lim et al., 2022) .498 .00 .120 .00 .707 .00 .095 .00 U i,:Uj,: .570 .01 .010 .01 .597 .01 .008 .00 MLP(Ui,: Uj,:) .614 .02 .050 .00 .651 .03 .040 .00 Sign Equivariant (Lim et al., 2023) .751 .00 .063 .00 .773 .01 .054 .00

Eigenvectors via Expansion Map .750 .00 .050 .00 .816 .00 .040 .00

to generalize, achieving 0.0 MAP. Holo GNN, on the other hand, successfully transfers from nodeto link-level tasks. Similarly, when pre-training on the link-level and adapting to node-level tasks, Holo GNN suffers a smaller drop in performance than SAGE. For instance, when transferring to user-badge from user-engagement, SAGE suffers a 3% drop in performance, whilst Holo GNN suffers only 2%. Note that the improved performance of Holo GNN is not simply due to being a stronger base model. When directly training on a task, Holo GNN achieves identical performance to SAGE. This shows that the key differentiator between them is in their transfer properties.

Eigenvectors. To demonstrate the generality of holographic node representations, we employ the sequential breaking algorithm (c.f., Algorithm 1) to obtain eigenvectors, which we use in the link prediction task of Lim et al. (2022). Table 2 shows that the eigenvectors returned by holographic node representations outperform the best performing methods in a fraction of the time. Our approach surpasses the performance of eigenvectors obtained by conventional eigensolvers (MLP(Ui,: Uj,:), with Ui,: the eigenvector embedding of node i). We attribute this improvement to our eigenvectors being derived from V1 (Equation (1)) which is learned on the task. That is, holographic node representations use structural features to select the combination of sign and basis, resulting in representations that facilitate learning rather than relying on arbitrary choices of conventional eigensolvers.

6 CONCLUSION

In this work, we study the ability of node embeddings to solve different task orders, such as predictions about single nodes, pairs of nodes, or larger sets. Our first insight is that no existing model is able to solve different task orders simultaneously. We trace the source of this limitation back to the permutations symmetries possessed by different embeddings, which are task-order dependent, and therefore no single choice of symmetries suffices. To address this challenge, we propose holographic node representations, which learn symmetry-free node representations, but done carefully so that the appropriate task-order symmetries can be re-introduced when adapting to new task orders. We propose Holo GNN, the first practical model for learning holographic node representations, and show that Holo GNN leads to significant improvements when adapting to new task orders.

Published as a conference paper at ICLR 2025

Reproducibility Statement. The assumptions for our theoretical results can be found in Appendix B. The experimental details are outlined in Appendix E. Detailed steps of the method are shown in Algorithms 1 and 2, with remarks in Appendix A.

ACKNOWLEDGMENTS

BR acknowledges support from the National Science Foundation (NSF) awards CCF-1918483, CAREER IIS-1943364 and CNS-2212160, an Amazon Research Award, and Analyti XIN, Wabash Heartland Innovation Network (WHIN), Ford, NVidia, CISCO, and Amazon. Computing infrastructure was supported in part by CNS-1925001 (Cloud Bank). This work was supported in part by AMD under the AMD HPC Fund program.

Ralph Abboud, Ismail Ilkan Ceylan, Martin Grohe, and Thomas Lukasiewicz. The surprising power of graph neural networks with random node initialization. In Proceedings of the Thirtieth International Joint Conference on Artifical Intelligence (IJCAI), 2021. 6, 30

Michael O Albertson and Karen L Collins. Symmetry breaking in graphs. the electronic journal of combinatorics, 3(1):R18, 1996. 30

Walter Edwin Arnoldi. The principle of minimized iterations in the solution of the matrix eigenvalue problem. Quarterly of applied mathematics, 9(1):17 29, 1951. 30

Guy Bar-Shalom, Beatrice Bevilacqua, and Haggai Maron. Subgraphormer: Unifying subgraph gnns and graph transformers via graph products. In Forty-first International Conference on Machine Learning, 2024a. 30

Guy Bar-Shalom, Yam Eitan, Fabrizio Frasca, and Haggai Maron. A flexible, equivariant framework for subgraph gnns via graph products and graph coarsening. ar Xiv preprint ar Xiv:2406.09291, 2024b. 30

Dominique Beaini, Shenyang Huang, Joao Alex Cunha, Zhiyi Li, Gabriela Moisescu-Pareja, Oleksandr Dymov, Samuel Maddrell-Mander, Callum Mc Lean, Frederik Wenkel, Luis M uller, Jama Hussein Mohamud, Ali Parviz, Michael Craig, Michał Koziarski, Jiarui Lu, Zhaocheng Zhu, Cristian Gabellini, Kerstin Klaser, Josef Dean, Cas Wognum, Maciej Sypetkowski, Guillaume Rabusseau, Reihaneh Rabbany, Jian Tang, Christopher Morris, Mirco Ravanelli, Guy Wolf, Prudencio Tossou, Hadrien Mary, Therence Bois, Andrew W Fitzgibbon, Blazej Banaszewski, Chad Martin, and Dominic Masters. Towards foundational models for molecular learning on large-scale multi-task datasets. In The Twelfth International Conference on Learning Representations, 2024. 29

Mikhail Belkin and Partha Niyogi. Laplacian eigenmaps for dimensionality reduction and data representation. Neural computation, 15(6):1373 1396, 2003. 2, 3

Beatrice Bevilacqua, Fabrizio Frasca, Derek Lim, Balasubramaniam Srinivasan, Chen Cai, Gopinath Balamurugan, Michael M Bronstein, and Haggai Maron. Equivariant subgraph aggregation networks. In International Conference on Learning Representations, 2022. 30

Beatrice Bevilacqua, Moshe Eliasof, Eli Meirom, Bruno Ribeiro, and Haggai Maron. Efficient subgraph gnns by learning effective selection policies. In The Twelfth International Conference on Learning Representations, 2024. 30

Lukas Biewald. Experiment tracking with weights and biases, 2020. Software available from wandb.com. 34

Michael M Bronstein, Joan Bruna, Taco Cohen, and Petar Veliˇckovi c. Geometric deep learning: Grids, groups, graphs, geodesics, and gauges. ar Xiv preprint ar Xiv:2104.13478, 2021. 2

Runjin Chen, Tong Zhao, Ajay Kumar Jaiswal, Neil Shah, and Zhangyang Wang. LLa GA: Large language and graph assistant. In Forty-first International Conference on Machine Learning, 2024a. 29

Published as a conference paper at ICLR 2025

Zhikai Chen, Haitao Mao, Hang Li, Wei Jin, Hongzhi Wen, Xiaochi Wei, Shuaiqiang Wang, Dawei Yin, Wenqi Fan, Hui Liu, et al. Exploring the potential of large language models (llms) in learning on graphs. ACM SIGKDD Explorations Newsletter, 25(2):42 61, 2024b. 29

Leonardo Cotta, Christopher Morris, and Bruno Ribeiro. Reconstruction for powerful graph representations. In Advances in Neural Information Processing Systems, volume 34, 2021. 30

Leonardo Cotta, Beatrice Bevilacqua, Nesreen Ahmed, and Bruno Ribeiro. Causal lifting and link prediction. In Proceedings of the Royal Society A: Mathematical, Physical, and Engineering Sciences, 2023. 17

Hejie Cui, Zijie Lu, Pan Li, and Carl Yang. On positional and structural node features for graph neural networks on non-attributed graphs. In Proceedings of the 31st ACM International Conference on Information & Knowledge Management, pp. 3898 3902, 2022. 30

Jan JM Cuppen. A divide and conquer method for the symmetric tridiagonal eigenproblem. Numerische Mathematik, 36:177 195, 1980. 22, 23

George Dasoulas, Ludovic Dos Santos, Kevin Scaman, and Aladin Virmaux. Coloring graph neural networks for node disambiguation. In Proceedings of the Twenty-Ninth International Conference on International Joint Conferences on Artificial Intelligence, pp. 2126 2132, 2021. 18

Inderjit Singh Dhillon. A new O(n2) algorithm for the symmetric tridiagonal eigenvalue/eigenvector problem. University of California, Berkeley, 1997. 23

Vijay Prakash Dwivedi and Xavier Bresson. A generalization of transformer networks to graphs. AAAI Workshop on Deep Learning on Graphs: Methods and Applications, 2021. 30

Vijay Prakash Dwivedi, Anh Tuan Luu, Thomas Laurent, Yoshua Bengio, and Xavier Bresson. Graph neural networks with learnable structural and positional representations. In International Conference on Learning Representations, 2021. 7

Vijay Prakash Dwivedi, Chaitanya K Joshi, Anh Tuan Luu, Thomas Laurent, Yoshua Bengio, and Xavier Bresson. Benchmarking graph neural networks. Journal of Machine Learning Research, 24(43):1 48, 2023. 3, 7, 30, 31

Nelson Elhage, Tristan Hume, Catherine Olsson, Nicholas Schiefer, Tom Henighan, Shauna Kravec, Zac Hatfield-Dodds, Robert Lasenby, Dawn Drain, Carol Chen, et al. Toy models of superposition. ar Xiv preprint ar Xiv:2209.10652, 2022. 29

Moshe Eliasof, Fabrizio Frasca, Beatrice Bevilacqua, Eran Treister, Gal Chechik, and Haggai Maron. Graph positional encoding via random feature propagation. In International Conference on Machine Learning, pp. 9202 9223. PMLR, 2023. 6

Torsten Fahle, Stefan Schamberger, and Meinolf Sellmann. Symmetry breaking. In Principles and Practice of Constraint Programming CP 2001: 7th International Conference, CP 2001 Paphos, Cyprus, November 26 December 1, 2001 Proceedings 7, pp. 93 107. Springer, 2001. 30

Jerome Feldman. Neural representation of conceptual knowledge. 1986. 4

Matthias Fey and Jan Eric Lenssen. Fast graph representation learning with pytorch geometric. ar Xiv preprint ar Xiv:1903.02428, 2019. 9, 34

Fabrizio Frasca, Beatrice Bevilacqua, Michael M Bronstein, and Haggai Maron. Understanding and extending subgraph gnns by rethinking their symmetries. In Advances in Neural Information Processing Systems, 2022. 30

Mikhail Galkin, Xinyu Yuan, Hesham Mostafa, Jian Tang, and Zhaocheng Zhu. Towards foundation models for knowledge graph reasoning. In The Twelfth International Conference on Learning Representations, 2024. 29

Jianfei Gao, Yangze Zhou, Jincheng Zhou, and Bruno Ribeiro. Double equivariance for inductive link prediction for both new nodes and new relation types. ar Xiv preprint ar Xiv:2302.01313, 2023. 29

Published as a conference paper at ICLR 2025

Gemini Team, Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, et al. Gemini: a family of highly capable multimodal models. ar Xiv preprint ar Xiv:2312.11805, 2023. 1

Will Hamilton, Zhitao Ying, and Jure Leskovec. Inductive representation learning on large graphs. Advances in neural information processing systems, 30, 2017. 9, 35

Yufei He and Bryan Hooi. Unigraph: Learning a cross-domain graph foundation model from natural language. ar Xiv preprint ar Xiv:2402.13630, 2024. 29

Geoffrey E Hinton. Mapping part-whole hierarchies into connectionist networks. Artificial Intelligence, 46(1-2):47 75, 1990. 4

Zhenyu Hou, Haozhan Li, Yukuo Cen, Jie Tang, and Yuxiao Dong. Graphalign: Pretraining one graph neural network on multiple graphs via feature alignment. ar Xiv preprint ar Xiv:2406.02953, 2024. 29

Weihua Hu, Matthias Fey, Marinka Zitnik, Yuxiao Dong, Hongyu Ren, Bowen Liu, Michele Catasta, and Jure Leskovec. Open graph benchmark: Datasets for machine learning on graphs. ar Xiv preprint ar Xiv:2005.00687, 2020. 33, 36

Qian Huang, Hongyu Ren, Peng Chen, Gregor Krˇzmanc, Daniel Zeng, Percy S Liang, and Jure Leskovec. Prodigy: Enabling in-context learning over graphs. Advances in Neural Information Processing Systems, 36, 2024a. 29

Yinan Huang, William Lu, Joshua Robinson, Yu Yang, Muhan Zhang, Stefanie Jegelka, and Pan Li. On the stability of expressive positional encodings for graphs. In The Twelfth International Conference on Learning Representations, 2024b. 7

Eric Jang, Shixiang Gu, and Ben Poole. Categorical reparameterization with gumbel-softmax. In International Conference on Learning Representations, 2017. 18

S ekou-Oumar Kaba, Arnab Kumar Mondal, Yan Zhang, Yoshua Bengio, and Siamak Ravanbakhsh. Equivariance with learned canonicalization functions. In International Conference on Machine Learning, pp. 15546 15566. PMLR, 2023. 7

Jinwoo Kim, Dat Nguyen, Seonwoo Min, Sungjun Cho, Moontae Lee, Honglak Lee, and Seunghoon Hong. Pure transformers are powerful graph learners. Advances in Neural Information Processing Systems, 35:14582 14595, 2022. 7, 30

Thomas N Kipf and Max Welling. Semi-supervised classification with graph convolutional networks. In International Conference on Learning Representations, 2017. 7, 9, 35, 36

Devin Kreuzer, Dominique Beaini, Will Hamilton, Vincent L etourneau, and Prudencio Tossou. Rethinking graph transformers with spectral attention. Advances in Neural Information Processing Systems, 34:21618 21629, 2021. 7, 30

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems, 25, 2012. 29

Divyansha Lachi, Mehdi Azabou, Vinam Arora, and Eva Dyer. Graphfm: A scalable framework for multi-graph pretraining. ar Xiv preprint ar Xiv:2407.11907, 2024. 29

Cornelius Lanczos. An iteration method for the solution of the eigenvalue problem of linear differential and integral operators. 1950. 5, 30

Pan Li, Yanbang Wang, Hongwei Wang, and Jure Leskovec. Distance encoding design provably more powerful gnns for structural representation learning. ar Xiv preprint ar Xiv:2009.00142, pp. 61, 2020. 9, 30

Derek Lim, Joshua David Robinson, Lingxiao Zhao, Tess Smidt, Suvrit Sra, Haggai Maron, and Stefanie Jegelka. Sign and basis invariant networks for spectral graph representation learning. In The Eleventh International Conference on Learning Representations, 2022. 7, 10

Published as a conference paper at ICLR 2025

Derek Lim, Joshua Robinson, Stefanie Jegelka, and Haggai Maron. Expressive sign equivariant networks for spectral geometric learning. In Advances in Neural Information Processing Systems, 2023. 3, 7, 10, 35

Hao Liu, Jiarui Feng, Lecheng Kong, Ningyue Liang, Dacheng Tao, Yixin Chen, and Muhan Zhang. One for all: Towards training one graph model for all classification tasks. In The Twelfth International Conference on Learning Representations, 2024. 29

George Ma, Yifei Wang, and Yisen Wang. Laplacian canonization: A minimalist approach to sign and basis invariant spectral embedding. Advances in Neural Information Processing Systems, 36, 2023. 7

George Ma, Yifei Wang, Derek Lim, Stefanie Jegelka, and Yisen Wang. A canonization perspective on invariant and equivariant learning. ar Xiv preprint ar Xiv:2405.18378, 2024. 7

Chris J Maddison, Andriy Mnih, and Yee Whye Teh. The concrete distribution: A continuous relaxation of discrete random variables. In International Conference on Learning Representations, 2017. 18

Haitao Mao, Zhikai Chen, Wenzhuo Tang, Jianan Zhao, Yao Ma, Tong Zhao, Neil Shah, Mikhail Galkin, and Jiliang Tang. Position: Graph foundation models are already here. In Forty-first International Conference on Machine Learning, 2024. 1, 29

Christopher Morris, Yaron Lipman, Haggai Maron, Bastian Rieck, Nils M Kriege, Martin Grohe, Matthias Fey, and Karsten Borgwardt. Weisfeiler and leman go machine learning: The story so far. The Journal of Machine Learning Research, 24(1), 2023. 18, 29

Luis M uller, Mikhail Galkin, Christopher Morris, and Ladislav Ramp aˇsek. Attending to graph transformers. Transactions on Machine Learning Research, 2024. ISSN 2835-8856. 30

Ryan Murphy, Balasubramaniam Srinivasan, Vinayak Rao, and Bruno Ribeiro. Relational pooling for graph representations. In International Conference on Machine Learning, 2019. 6, 18, 30

Open AI. Gpt-4 technical report. ar Xiv preprint ar Xiv:2303.08774, 2023. 1

P al Andr as Papp and Roger Wattenhofer. A theoretical comparison of graph neural network extensions. In International Conference on Machine Learning, 2022. 30

P al Andr as Papp, Karolis Martinkus, Lukas Faber, and Roger Wattenhofer. Dropgnn: Random dropouts increase the expressiveness of graph neural networks. Advances in Neural Information Processing Systems, 34:21997 22009, 2021. 30

Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary De Vito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems, volume 32, 2019. 34

Omri Puny, Heli Ben-Hamu, and Yaron Lipman. Global attention improves graph networks generalization. ar Xiv preprint ar Xiv:2006.07846, 2020. 6, 30

Chendi Qian, Gaurav Rattan, Floris Geerts, Christopher Morris, and Mathias Niepert. Ordered subgraph aggregation networks. In Advances in Neural Information Processing Systems, volume 35, 2022. 30

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners. Open AI blog, 1(8):9, 2019. 1, 29

Ladislav Ramp aˇsek, Mikhail Galkin, Vijay Prakash Dwivedi, Anh Tuan Luu, Guy Wolf, and Dominique Beaini. Recipe for a General, Powerful, Scalable Graph Transformer. Advances in Neural Information Processing Systems, 35, 2022. 30, 34

Joshua Robinson, Rishabh Ranjan, Weihua Hu, Kexin Huang, Jiaqi Han, Alejandro Dobles, Matthias Fey, Jan E Lenssen, Yiwen Yuan, Zecheng Zhang, et al. Relbench: A benchmark for deep learning on relational databases. ar Xiv preprint ar Xiv:2407.20060, 2024. 9

Published as a conference paper at ICLR 2025

Yousef Saad. Numerical methods for large eigenvalue problems: revised edition. SIAM, 2011. 35

Ryoma Sato, Makoto Yamada, and Hisashi Kashima. Random features strengthen graph neural networks. In Proceedings of the 2021 SIAM international conference on data mining (SDM), pp. 333 341. SIAM, 2021. 6, 30

Prithviraj Sen, Galileo Namata, Mustafa Bilgic, Lise Getoor, Brian Galligher, and Tina Eliassi-Rad. Collective classification in network data. AI magazine, 29(3):93 93, 2008. 9

Yangyi Shen, Beatrice Bevilacqua, Joshua Robinson, Charilaos Kanatsoulis, Jure Leskovec, and Bruno Ribeiro. Zero-shot generalization of gnns over distinct attribute domains. In ICML 2024 Workshop on Theoretical Foundations of Foundation Models, 2024. 29

Balasubramaniam Srinivasan and Bruno Ribeiro. On the equivalence between positional node embeddings and structural graph representations. In Proceedings of the International Conference on Learning Representations, 2020. 1, 3, 5, 17, 20, 25

Maciej Sypetkowski, Frederik Wenkel, Farimah Poursafaei, Nia Dickson, Karush Suri, Philip Fradkin, and Dominique Beaini. On the scalability of gnns for molecular graphs. In ICLR 2024 Workshop on Navigating and Addressing Data Problems for Foundation Models, 2024. 29

Jiabin Tang, Yuhao Yang, Wei Wei, Lei Shi, Lixin Su, Suqi Cheng, Dawei Yin, and Chao Huang. Graphgpt: Graph instruction tuning for large language models. In Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 491 500, 2024. 29

Adly Templeton, Tom Conerly, Jonathan Marcus, Jack Lindsey, Trenton Bricken, Brian Chen, Adam Pearce, Craig Citro, Emmanuel Ameisen, Andy Jones, et al. Scaling monosemanticity: Extracting interpretable features from claude 3 sonnet. Transformer Circuits Thread, 2024. 29

Haorui Wang, Haoteng Yin, Muhan Zhang, and Pan Li. Equivariant and stable positional encoding for more powerful graph neural networks. In International Conference on Learning Representations, 2022. 7

Lianghao Xia and Chao Huang. Anygraph: Graph foundation model in the wild. ar Xiv preprint ar Xiv:2408.10700, 2024. 29

Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. How powerful are graph neural networks? In International Conference on Learning Representations, 2019. 7

Zhilin Yang, William Cohen, and Ruslan Salakhudinov. Revisiting semi-supervised learning with graph embeddings. In International conference on machine learning, pp. 40 48. PMLR, 2016. 35

Haoteng Yin, Yanbang Wang, and Pan Li. Revisiting graph neural networks and distance encoding from a practical view. ar Xiv preprint ar Xiv:2011.12228, 2020. 30

Chengxuan Ying, Tianle Cai, Shengjie Luo, Shuxin Zheng, Guolin Ke, Di He, Yanming Shen, and Tie-Yan Liu. Do transformers really perform badly for graph representation? Advances in neural information processing systems, 34:28877 28888, 2021. 30

Jiaxuan You, Rex Ying, and Jure Leskovec. Position-aware graph neural networks. In International conference on machine learning, pp. 7134 7143. PMLR, 2019. 1, 6, 30

Bohang Zhang, Guhao Feng, Yiheng Du, Di He, and Liwei Wang. A complete expressiveness hierarchy for subgraph gnns via subgraph weisfeiler-lehman tests. In International Conference on Machine Learning, 2023. 30

Muhan Zhang and Yixin Chen. Link prediction based on graph neural networks. Advances in neural information processing systems, 31, 2018. 2, 9

Muhan Zhang and Pan Li. Nested graph neural networks. In Advances in Neural Information Processing Systems, volume 34, 2021. 30

Published as a conference paper at ICLR 2025

Muhan Zhang, Pan Li, Yinglong Xia, Kai Wang, and Long Jin. Labeling trick: A theory of using graph neural networks for multi-node representation learning. Advances in Neural Information Processing Systems, 34:9061 9073, 2021. 1, 3, 20

Jianan Zhao, Le Zhuo, Yikang Shen, Meng Qu, Kai Liu, Michael Bronstein, Zhaocheng Zhu, and Jian Tang. Graphtext: Graph reasoning in text space. ar Xiv preprint ar Xiv:2310.01089, 2023. 29

Jianan Zhao, Hesham Mostafa, Michael Galkin, Michael Bronstein, Zhaocheng Zhu, and Jian Tang. Graphany: A foundation model for node classification on any graph. ar Xiv preprint ar Xiv:2405.20445, 2024. 29

Lingxiao Zhao, Wei Jin, Leman Akoglu, and Neil Shah. From stars to subgraphs: Uplifting any GNN with local structure awareness. In International Conference on Learning Representations, 2022. 30

Shuxin Zheng, Jiyan He, Chang Liu, Yu Shi, Ziheng Lu, Weitao Feng, Fusong Ju, Jiaxi Wang, Jianwei Zhu, Yaosen Min, et al. Predicting equilibrium distributions for molecular systems with deep learning. Nature Machine Intelligence, pp. 1 10, 2024. 29

Jincheng Zhou, Beatrice Bevilacqua, and Bruno Ribeiro. A multi-task perspective for link prediction with new relation types and nodes. In Neur IPS 2023 Workshop: New Frontiers in Graph Learning, 2023. 29

Zhaocheng Zhu, Zuobai Zhang, Louis-Pascal Xhonneux, and Jian Tang. Neural bellman-ford networks: A general graph neural network framework for link prediction. Advances in Neural Information Processing Systems, 34:29476 29490, 2021. 2, 9, 35

Published as a conference paper at ICLR 2025

A ADDITIONAL REMARKS AND TECHNICAL DETAILS ON HOLOGRAPHIC NODE REPRESENTATIONS

In this section, we provide supplementary information, technical clarifications, and remarks on holographic node representations and its practical instantiation with Holo GNN. We start by discussing in detail the definition of holographic node representations (c.f.. Definition 3.1).

Properties (1) and (2) are essential to ensuring that holographic node representations can be trained in a task-order agnostic way, whilst still being able to map representations back to order-specific representations. Property (1) guarantees final task-specific representations are structural of order r; and Property (2) ensures that the output of the expansion map consists of T positional, and thus symmetry-free, representations, which are necessary for task-agnostic node representations due to the impossibility result from Proposition 2.3.

We further remark that Property (1) in Definition 3.1 ensures that the reduction map returns structural representations for sets rather than tuples of r nodes, aligning with the observation made by Srinivasan & Ribeiro (2020) that structural representations should be defined for sets of nodes. The rationale behind this is that it is not possible to distinguish between isomorphic nodes2 that occupy different positions within a tuple, i.e. (v1, v2, . . . vr) and (v2, v1, . . . vr) if v1 and v2 are isomorphic. Therefore all tuples containing the same nodes should be considered as the same set.

Property (2) in Definition 3.1 ensures that the expansion map E returns T positional representations for each node, where each positional representation correspond to a view of the graph. Importantly, in the case of sequential breaking, these positional representations can be T (canonical) eigenvectors of any symmetric matrix of the graph (Theorem 4.3). Notably, encompassing eigenvectors has an additional implication, which is made clear in Definition 3.1 (Property 2). Specifically, since returning eigenvectors of AAT and AT A represents a valid expansion map output, for any undirected graph, if u and v, u = v, are isomorphic nodes with the same neighborhood, then V (A, V1)v = V (A, V1)u (Cotta et al., 2023, Theorem 4). Therefore, we exclude this case by requiring Property (2) to hold for isomorphic nodes that do not have the same neighborhood.

While Definition 3.1 provides a general definition, Holo GNN represents a practical instantiation, which obtains the symmetry-free expansion map by breaking symmetries for selected nodes chosen by breaking selectors. Intuitively, the role of the node breaking selectors is to determine the breaking nodes to be used in the expansion map, ensuring that positional representations are constructed in a way that allows the reduction map to generate structural representations. In other words, the final structural representations are produced by combining positional representations generated by breaking nodes returned by the same breaking selector (see e.g., Equation (10)). This ensures that the final representations are the same for isomorphic node sets (and therefore they are structural). In contrast, randomly selecting breaking nodes without node breaking selectors would not ensure the same final representations for isomorphic sets, nor would it provide the same representations across different runs or isomorphic graphs.

Definition 4.1 outlines the properties of the breaking selectors, which are necessary for designing an appropriate reduction map capable of returning structural representations. Importantly, Definition 4.1 can be specialized to define most-expressive selectors, as we present next. These mostexpressive selectors are important because are those employed in the sequential breaking algorithm for obtaining eigenvectors, as we shall see in Appendix B.2.

Definition A.1 (Most-expressive breaking selector h λ). A most-expressive selector is a selector, as defined in Definition 4.1, that is most-expressive based on the graph topology. That is, for any two nodes u, v G = (A, V1), then u, v h λ(A, V1) u, u isomorphic in G based only on the graph topology A.

In other words, when using h λ then vλ,1, . . . , vλ,k λ are isomorphic nodes in A. Importantly, each hλ returns a different list (a different group of isomorphic nodes). This is always possible as nonisomorphic nodes can be distinguished by structural properties (e.g., differing degrees) as otherwise they would be isomorphic which a most-expressive selector is able to capture by definition.

2Two nodes u and v are isomorphic if there exists a permutation of nodes that swaps u and v such that the permuted adjacency and feature matrices remain identical.

Published as a conference paper at ICLR 2025

A.1 NODE BREAKING SELECTORS

Having established that the expansion map produces highly expressive symmetry-free representations, we now specify how the breaking nodes are chosen. In principle any structural rule may be used to construct the breaking selectors (Definition 4.1), as long as the choice is appropriately coupled with the definition of Λ to recover from these choices when constructing structural representations (Definition 3.1). In the following, we discuss the breaking rules we consider:

(1) Node degree. Each breaking selector hλ returns all nodes that share the same degree3. That is, we consider h1 to return all nodes with the highest degree, h2 to return nodes with the secondhighest degree, and so forth (inverse sorting is also a possibility). Nodes returned by the same breaking selector are sorted by node ids, as per Definition 4.1. In this case, Λ ensures that Λt remains identical across all t for breaking nodes with the same degree.

(2) Learned structural distribution. Given V1, define a node-wise score s = V1w Rn with learnable weights w Rd. Then, sample k n nodes without replacement according to the distribution Softmax(s). For the learned distribution we need to compute gradients through the random sample. For this we use the straight-through Gumbel-Softmax estimator (Jang et al., 2017; Maddison et al., 2017), to allow differentiable sampling (of k nodes) from the distribution. This rule results in a single breaking selector h1 containing all k sampled nodes. While this rule is not permutation-equivariant in general, it can still be useful for large graphs, where there usually are no isomorphic nodes. During inference, instead of sampling, we directly select the top k nodes based on their scores. If there is a tie at the end (i.e., the k-th and the k + 1-th nodes have the same score), we include all nodes with the tied score, resulting in a list of at least k nodes returned by h1. In this way, h1 is permutation equivariant in test.

(3) Node ids, when T = n. The simplest rule involves a single breaking selector, h1, that returns all nodes, sorted by their IDs. In this case, T = n because the algorithm requires iterating through the entire list of nodes before termination, making this choice less scalable than others. Furthermore, Λt = 1 for all t, reflecting the uniform treatment of all nodes.

(4) Most expressive. Each breaking selector hλ returns a list of nodes that are isomorphic to each other (Definition A.1). Recall that each hλ must return a different list (a different group of isomorphic nodes), which is always possible as non-isomorphic nodes can be distinguished by structural properties. Practical choices include employing most-expressive GNNs, such as Murphy et al. (2019); Dasoulas et al. (2021), or also expressive GNNs whose expressivity level is sufficient to distinguish all graphs in the family of graphs considered in the tasks of interest. We refer the reader to Morris et al. (2023) for a review of expressive methods.

Finally, we remark here that there are many other valid rules for selecting breaking nodes, whose exploration we leave for future work.

A.2 A PRACTICAL SEQUENTIAL BREAKING REDUCTION MAP

Section 4.2 in the main paper presents an instantiation of the reduction map for the sequential breaking algorithm, which leverages the representations of {v1, . . . , vr} in all graphs isomorphic to the input one. A key limitation of this approach is that it requires iterating over each possible isomorphic graph Gk to construct BGk, which is then used in Equation (6) to get the representation of {v1, . . . , vr} in Gk. An alternative strategy is to consider the representations of all r sets isomorphic to {v1, . . . , vr}, instead of the representations of {v1, . . . , vr} for all possible Gk. However, identifying all isomorphic sets can be computationally intensive. A reasonable compromise is to consider all sets of r nodes where each node is isomorphic to a different node in {v1, . . . , vr}. These include the sets isomorphic to {v1, . . . , vr}, as well as those that are not, even if their nodes are in a one-to-one mapping.

Specifically, these sets can be obtained from our breaking selectors (hλ(A, V1))L λ=1. Each set svi,w is generated by replacing node vi in {v1, . . . , vr} with another node w taken from the same breaking selector list, i.e.,

svi,w = {v1, . . . , vi 1, w, vi 1, . . . vr}.

3Any other structural property can be used in this case, e.g., betweenness centrality.

Published as a conference paper at ICLR 2025

where vi {v1, . . . , vr} and vi, w hλ(A, V1) for some λ [L]. We denote the collection of these sets as S{v1,...,vr}, that is,

S{v1,...,vr} = {svi,w} vi {v1,...,vr}, w G s.t. w,vi hλ(A,V1).

Since (hλ(A, V1))L λ=1 is structural by definition (c.f., Definition 4.1), the above strategy allows us to construct the set of sets still containing all isomorphic sets to {v1, . . . , vr}, although S{v1,...,vr} also contains more sets not isomorphic to {v1, . . . , vr}. To obtain a structural representation of {v1, . . . , vr}, we first obtain an intermediate representation for each set svi,w,

Uλ,svi,w = ϕλ Vt,u

Then, we aggregate these intermediate representations for all sets in S{v1,...,vr}, that is

Uλ,t = ψλ Uλ,svi,w

svi,w S{v1,...,vr}

Next, we aggregate these intermediate representations for all t generated by breaking nodes returned by the same breaking selector,

Uλ,{v1,...,vr} = ρλ Uλ,t

t [T ] s.t. Λt=λ

Finally, the reduction map can then be expressed as

R(E(A, X)){v1,...,vr} = (Uλ,{v1,...,vr}) λ [L]. (15)

This appendix includes the proofs for the theoretical results presented in Sections 2 and 4. We start by formally proving the impossibility of having a single embedding capable of solving different task orders, particularly node and link predictions. Then, we discuss the results related to the expansion map, including its connection to canonical eigenvectors. Lastly, we present the proofs concerning the reduction map.

B.1 SECTION 2 PROOFS

We begin by formalizing the challenges related to node embeddings and their use for solving different tasks. Let f(A, X) Rn d denote node embeddings. We define a node-level task predictor and a link-level task predictor, respectively, as

b Y (A, X)u = MLPnode(f(A, X)u), b Y (A, X)uv = MLPlink(f(A, X)u, f(A, X)v).

In practice, MLPnode and MLPlink correspond to different heads on top of the single embeddings f(A, X). This implies that f(A, X) is used for solving two different tasks, a node-level task, which we denote by Tnode, and link-level task, which we denote by Tlink. We define the respective test errors as:

LTnode(Dnode) = E(A,X),Yu Dnode[ℓTnode(b Y (A, X)u, Yu)],

LTlink(Dlink) = E(A,X),Yuv Dlink[ℓTlink(b Y (A, X)uv, Yuv)],

with ℓTnode, ℓTlink appropriate loss functions, and denoting the test minima as L Tnode and L Tlink. In the following, we formally prove that f(A, X) Rn d is not sufficient to accurately solve both tasks. Proposition B.1 (Impossibility of accurate any-order task learning from node embeddings). Consider simultaneously performing two tasks, Tnode and Tlink, using node embeddings f(A, X) Rn d. There exist Tnode and Tlink such that, for any MLPnode and MLPlink achieving the training minima, no f that produces either positional or structural representations can simultaneously satisfy the following two conditions: (1) LTnode(Dnode) = L Tnode; (2) LTlink(Dlink) = L Tlink. That is, when using standard (flat) node embeddings, the predictions cannot be simultaneously accurate (in test) for both tasks.

Published as a conference paper at ICLR 2025

Proof. The proof proceeds by showing that, for any f, it is always possible to construct either Tnode or Tlink, such that one of the conditions does not hold. Note that f returns node embeddings, which can either be equivariant to node permutations, i.e., structural, or not, i.e., positional. In the following, we will prove the theorem for these two cases separately.

Assume that f is equivariant to node permutations, and therefore structural. Then, by definition, b Yu and b Yuv are also invariant to node permutations. It follows from Srinivasan & Ribeiro (2020); Zhang et al. (2021) that there exists a link-level task Tlink such that LTlink(Dlink) = L Tlink. Specifically, consider the link-level task consisting in predicting in test the two non-isomorphic links (v1, v2) and (v1, v3) in Figure 1, and assume that those have different labels. Since f is equivariant to node permutations, the isomorphic nodes v2 and v3 will have the same representation, and therefore (v1, v2) and (v1, v3) will receive the same prediction, despite having different labels. This implies that LTlink(Dlink) = L Tlink, since the test error can be reduced with correct predictions for these two links.

Assume that f is not equivariant to node permutations, and therefore positional. Consider a nodelevel task Tnode on a training graph and consider a node u. Now, assume that the test graph is composed of two disconnected copies to the training graph. This means that there are two nodes isomorphic to u in the test graph. However their representations will be different by definition of positional encoding. Since the representation seen in training for u will correspond to only one of two representations seen in test, the representation of one of the test nodes may lead to incorrect predictions. Therefore LTnode(Dnode) = L Tnode.

Therefore, we have proved that for any f returning node embeddings, there exists a task such that one condition does not hold, which concludes our proof.

The above result justifies the need for our holographic node representations (Definition 3.1), as simply using node embeddings, coupled with different heads for different task orders, cannot result in accurate predictions for different task orders.

B.2 SECTION 4 EXPANSION MAP PROOFS

In the following, we present the theoretical results related to the expansion map, as introduced in Section 4.

We start by proving that single-node breakings are sufficient for obtaining the expansion mappings within holographic node representations.

Theorem 4.2 (Sufficiency of single-node breakings for holographic representations). For any G = (A, X) and isomorphic nodes u, v G, with u = v and having different neighborhoods, there exists E obtained by single-node symmetry breakings such that V (A, V1)v = V (A, V1)u.

Proof. To prove the sufficiency of single-node breakings for a valid expansion map, we consider E to be the expansion map of the sequential breaking algorithm. Since the parallel breaking algorithm represents a special case of the sequential algorithm, the result will naturally extend to the parallel case as well.

By assumption, u and v have different neighborhoods, and therefore there exists at least one node w G that is neighbor of u but not of v (or vice-versa). Additionally, since the breaking selectors are non-overlapping (Definition 4.1) and assuming T = n, there exists a step t such that the breaking node is w. Then, it is sufficient to consider the structurally invariant model ft to perform one message passing step (with identity weights) using the perturbation 1w, in which case u will obtain a different representation than v, as u is connected to w, that had perturbed features. Therefore, there exists a t in which ft(A, Vt 1, . . . , V1, 1w)u = ft(A, Vt 1, . . . , V1, 1w)v. Since E is obtained by concatenating the outputs of the different ft (Equation (5)), then V (A, V1)v = V (A, V1)u, which concludes our proof.

Next, we discuss one of the main results of this section, which is to show that there exists a sequential-breaking expansion map that can express eigenvectors with no sign or basis ambiguities. We consider symmetric matrices, which includes the normalized Laplacian matrix and the undirected adjacency matrix.

Published as a conference paper at ICLR 2025

Theorem 4.3 (Expansion Map can express canonical eigenvectors). For any symmetric matrix A, there exists an expansion map E(A, V1) obtained by sequential breaking which can express eigenvectors with no sign or basis ambiguity, and identical for isomorphic graphs up to permutations.

We prove this result by constructing an explicit expansion map with T = n for which the embeddings V1, . . . , Vn Rn 1 correspond to full set of n Lanczos vectors of the symmetric matrix A (which can be the Laplacian). We then show how these Lanczos vectors can be used to obtain the full set of n eigenvectors. Crucially, we demonstrate that both the Lanczos vectors and the eigenvectors are free of sign and basis ambiguities and are permutation equivariant.

It is important to note that, while the expansion map may return V1, . . . , Vn+1, the set of Lanczos vectors corresponds to V1, . . . , Vn. This is because the initial V1 is also one of the vectors (although obtained without breaking any node), and Vn+1 will be the zero vector (as there are only n orthogonal vectors in Rn).

A key intermediate step is to prove that sequential breaking can express the Lanczos algorithm, which computes the Lanczos vectors. We then demonstrate how to derive the symmetric tridiagonal matrix, that is one of the terms of the Lanczos algorithm, from the Lanczos vectors and the original matrix A using a simple permutation-equivariant function. This step is crucial, as the tridiagonal matrix shares the same eigenvalues as the original matrix A, and the eigenvectors of the original matrix can be obtained from those of the tridiagonal matrix through a simple matrix multiplication. Once the tridiagonal matrix is constructed, many efficient eigensolvers can be applied to produce its eigenvectors, and consequently those of the original matrix.

Before stating the intermediate Lanczos result, we first introduce Lanczos algorithm in detail.

B.2.1 LANCZOS ALGORITHM

Lanczos Algorithm takes as input a symmetric matrix A Rn n, and returns a matrix V of orthogonal columns (the Lanczos vectors) and a symmetric tridiagonal matrix T with the same eigenvalues as A. Lanczos has four main steps:

1. Select an initial arbitrary vector v1 Rn with unit Euclidean norm.

2. Initial iteration:

(a) Let w 1 = Av1. (b) Let α1 = w T 1 v1. (c) Let w1 = w 1 α1v1.

3. For t = 2, . . . , n do:

(a) Let βt = wt 1 (also Euclidean norm). (b) If βt = 0, then let vt = wt 1/βt, else pick as vt an arbitrary vector with Euclidean norm 1 that is orthogonal to all of v1, . . . , vt 1. (c) Let w t = Avt. (d) Let αt = w t vt. (e) Let wt = w t αtvt βtvt 1.

4. Let V be the matrix with columns v1, . . . , vn. Let

α1 β2 0 β2 α2 β3

β3 α3 ... ... ... βn 1 βn 1 αn 1 βn 0 βn αn

Note that it is easy to obtain the tridiagonal matrix T from the matrix of Lanczos vectors V , as T = V T AV .

Published as a conference paper at ICLR 2025

Because it is easy to obtain the tridiagonal matrix T (from which it is easy to obtain eigenvectors, see Cuppen (1980)) from the Lanczos vectors, the precise statement of our intermediate result is that sequential breaking can return the Lanczos vectors. The proof of the full result (Theorem 4.3) then simply extends this by constructing a function of the Lanczos vectors and the matrix A to additionally compute T , compute the eigenvectors of T , and derive those of A. We will show that all these steps are free from sign and basis ambiguities, and additionally prove that the Lanczos vectors (and consequently the tridiagonal and the eigenvectors) are permutation equivariant. Theorem B.2 (Sequential breaking can implement Lanczos Algorithm). For any symmetric matrix A {0, 1}n n there exists an n-step expansion map E(A, X) = [Vn, . . . , V1], where Vj is the jth Lanczos vector of A.

Proof. To prove that the sequential breaking algorithm can implement the Lanczos algorithm, we need to define the functions (hλ)L λ=1 and ft for t [n] in a way that matches the steps of the Lanczos algorithm. Let s break this down step by step:

For (hλ)L λ=1 we consider most-expressive breaking selectors (Definition A.1), which partition the set of nodes into isomorphism equivalence classes. We denote the breaking node of the t-step as vt.

Additionally, we take the embedding dimension d = 1, as we want each Vt to correspond to a Lanczos vector.

Now, we are ready to define Vt = ft(A, Vt 1, . . . , V1, 1v), as follows:

1. For t = 1: V1 = gstruc(A, X) Rn, the (normalized) vector obtained from a structural function (Equation (1)). In this particular case, we assume that V1 has norm 1 (as required by the Lanczos algorithm), and we further assume that gstruc discards node features X, because eigenvector methods do not use node features but only the matrix A.

2. For t = 2:

(a) w 1 = AV1 (b) α1 = V T 1 w 1 (c) w1 = w 1 α1V1 (d) β2 = w1

(e) If β2 = 0: V2 = w1

β2 (f) Else: V2 = 1v1 proj V1(1v1)

1v1 proj V1 Rn where proj denotes the Gram-Schmidt projection operator of 1vt onto the span of V1. Since V1 is a single vector, this is simply V2 = 1vt (1 vt V1)V1 1vt (1 vt V1)V1 .

3. If t > 2:

(a) w t 1 = AVt 1 (b) αt 1 = V T t 1w t 1 (c) wt 1 = w t 1 αt 1Vt 1 βt 1Vt 2 (d) βt = wt 1

(e) If βt = 0: Vt = wt 1

(f) Else: Vt = 1vt proj V1,...Vt 1(1vt)

1vt proj V1,...Vt 1(1vt) Rn, the orthonormalization of 1vt with respect

to the span of V1, . . . Vt 1.

Note that in each case this definition of ft is a composition of the following permutation equivariant functions: linear transformations, dot products, norms, linear combinations, and conditional statements. Because of this, ft is also permutation equivariant as required in Section 4.2.

It is not at first clear that this choice of ft is well defined. The danger comes from the Gram-Schmidt orthogonal projection 1vt proj V1,...Vt 1(1vt), which must be non-zero so that it can be normalized. For now let us assume that this ft is well defined in order to show that the rest of the proof goes through. Then, at the end of the proof we will return to this matter.

Published as a conference paper at ICLR 2025

With these choices the sequential breaking algorithm (Algorithm 1) becomes step-by-step identical to the Lanczos algorithm, with a particular instantiation of the arbitrary breaking vectors, which in the sequential breaking algorithm are not random vectors as in Lanczos, but are instead defined as V1 = gstruc(A, X) Rn and as 1vt, with vt the breaking node.

This formulation of the sequential breaking algorithm matches the Lanczos algorithm step by step. The key points to note are:

1. The function ft is defined to perform the exact same operations as the Lanczos algorithm for each t.

2. When βt = 0, we choose Vt = 1vt proj V1,...Vt 1(1vt)

1vt proj V1,...Vt 1(1vt) , which for appropriate choices of

most-expressive breaking selectors is guaranteed to be a unit vector orthogonal to previous vectors due to the Gram-Schmidt orthonormalization projection map, as we show in Theorem B.6. This underscores that this choice of ft is well defined.

3. The output V is constructed in the same way as in the Lanczos algorithm.

This demonstrates that with the given definitions of (hλ)L λ=1 and ft for t [n], the sequential breaking algorithm exactly replicates the Lanczos algorithm.

We have seen that the sequential breaking algorithm can implement the Lanczos algorithm. Importantly, since all choices are deterministic, with both the initial vector V1 and the arbitrary vectors determined by the most-expressive breaking selectors and therefore by the graph structure, the obtained Lanczos vectors do not have any ambiguity, with both sign and basis exactly determined by gstruc and (hλ)L λ=1. Furthermore, Theorem B.7 shows that the Lanczos vectors are identical up to permutation for isomorphic graphs.

Given these results we are now ready to prove Theorem 4.3.

Proof of Theorem 4.3. Let f1, . . . , fn and (hλ)L λ=1 be as in Theorem B.2. We construct an additional function Γ that, given the original matrix A and the Lanczos vectors, returns the eigenvectors of A, U = Γ(A, V ) Rn n

with V = [V1, . . . , Vn] and U denoting the full set of n eigenvectors of A.

Specifically we define Γ to be factorized into two steps Γ = Γ2 Γ1, corresponding to the two remaining steps required to compute eigenvectors from A and V1, . . . , Vn, which we have seen do not have sign/basis ambiguities and are identical up to permutations for isomorphic graphs.

First, Γ1 represents the map (A, V ) 7 (T , V ) Rn n, returning the tridiagonal matrix of the Lanczos algorithm, as well as the Lanczos vectors (this last output is included for later usage in Γ2, but simply corresponds to the input). Note that this Γ1 can be written as

Γ1(A, V ) = (V T AV , V )

since T = V T AV . Since Γ1 simply consists of two matrix multiplications, it is permutation equivariant.

Then, Γ2 : Rn n Rn n Rn n can be any of the (fast) specialized eigensolvers for symmetric tridiagonal matrices, followed by a matrix multiplication to recover the eigenvectors of A from those of T . For concreteness, we may take Γ2 to first perform the divide-and-conquer algorithm proposed by Cuppen (1980), which computes the eigendecomposition of T in O(n2) steps. Importantly, it was shown by Dhillon (1997) that this divide-and-conquer can be done deterministically using a twisted factorization (see Section 3.1 therein) to initialize, and so the resulting eigenvectors do not have sign/basis ambiguities. Then, given the eigenvectors UTri of T , the eigenvectors U of A can be obtained as U = V UTri. Importantly, since it only involves a decomposition (without ambiguities) and a matrix multiplication, Γ2 is also permutation equivariant.

Thus, we have shown that it is possible to recover the eigenvectors of A from the Lanczos vectors. Since both the Lanczos vectors and the mappings used to obtain the eigenvectors are free of sign

Published as a conference paper at ICLR 2025

and basis ambiguities, and are permutation equivariant, the resulting eigenvectors will also be free of these ambiguities and permutation equivariant, concluding our proof.

B.3 SECTION 4 REDUCTION MAP PROOFS

In the following, we present the theoretical results related to the reduction map, as introduced in Section 4. We start by showing that the reduction map designs introduced for the sequential and parallel breaking algorithms are valid, as they return structural representations of order r (Property (1) in Definition 3.1).

Theorem 4.4 (Sufficiency of single-node symmetry breakings for structural of any order). Let R be the reduction map as defined in Equations (8) and (11). When combined with its corresponding expansion map E, then R(E(A, V1)){v1,...,vr} is a structural representation of {v1, . . . , vr}.

Proof. We will show this result for the sequential breaking algorithm and for the parallel breaking algorithm separately.

Sequential Breaking. Equation (8) trivially satisfies the definition of structural representation (Definition 2.1), as it returns the representation of the set {v1, . . . , vr} by averaging its representations across all isomorphic graphs.

Parallel Breaking. We need to show that Equation (11) satisfies the definition of structural representation (Definition 2.1). Specifically, we show that if {v1, . . . , vr} {u1, . . . , ur}, A A , V1 V 1 then R(E(A, V1)){v1,...,vr} = R(E(A , V 1)){u1,...,ur}. Since the output of R is a concatenation of intermediate representations, it is sufficient to show that Uλ,{v1,...,vr} = U λ,{u1,...,ur}. This is trivially true because: (1) the breaking selectors are structural, and therefore breaking nodes belonging to the same selector in (A1, V1) are isomorphic to breaking nodes belonging to the same selector in (A 1, V 1), (2) ft for t [T] is structural. Therefore the reduction map in Equation (11) returns structural representations.

Note that while we have proven the above result for Equation (8) for the sequential breaking algorithm, it is easy to see that it extends also to Equation (15), by following the same proof steps used for parallel breaking. Indeed, Equation (15) similarly considers the concatenation of intermediate representations, each obtained by aggregating representations obtained by structural models and structural selectors.

Next, we prove that the representations returned by the reduction map are most-expressive for links. We start by showing that this is case for Equation (11).

Theorem 4.5 (Sufficiency of single-node breakings for most-expressive link structural). Let E be the parallel breaking expansion map in Equation (5) with T = n and let R be the reduction map as defined in Equation (11) with r = 2. Assume ρλ and ϕλ injective, and fλ node-most-expressive, for every λ [L]. Then, R(E(A, V1)){v1,v2} is a most-expressive structural representation of {v1, v2}.

Proof. Let {u1, u2} be another set of two nodes, which is different from {v1, v2}. We will show that R(E(A, V1)){v1,v2} = R(E(A, V1)){u1,u2} iff {v1, v2} {u1, u2}, therefore satisfying the definition of most-expressive structural representations (Definition 2.2).

Recall that, since T = n we have T = n+1 because we need to break all the n nodes, and therefore the equations can be rewritten as

Uλ,t = ϕλ {Vt,v1, Vt,v2}

Uλ,{v1,v2} = ρλ {Uλ,t} t [n+1] s.t. Λt=λ

R(E(A, V1)){v1,v2} = Uλ,{v1,v2}

For simplicity, let us further denote by St,{v1,v2} the set of the representations of {v1, v2} at time t, i.e., St,{v1,v2} = {Vt,v1, Vt,v2}.

Published as a conference paper at ICLR 2025

( ) We will show that R(E(A, V1)){v1,v2} = R(E(A, V1)){u1,u2} {v1, v2} {u1, u2}. Since ρλ and ϕλ are injective, we have

R(E(A, V1)){v1,v2} = R(E(A, V1)){u1,u2} λ [L] {St,{v1,v2}} t [n+1] s.t. Λt=λ = {St,{v1,v2}} t [n+1] s.t. Λt=λ j [n + 1], k [n + 1] such that {Vj,v1, Vj,v2} = {Vk,u1, Vk,u2}

which means that j [n + 1], k [n + 1] such that

{fλ(A, V1 1vj)v1, fλ(A, V1 1vj)v2} = {fλ(A, V1 1vk)u1, fλ(A, V1 1vk)u2}

Since the above means that j [n + 1], k [n + 1] where the sets are the same, then without loss of generality we have that4

fλ(A, V1 1vj)v1 = fλ(A, V1 1vk)u1 fλ(A, V1 1vj)v2 = fλ(A, V1 1vk)u2

Consider the first equation. Since fλ is node-most-expressive, then v1 u1, which implies P permuting vj with vk and v1 with u15 such that A = P AP T and V1 1vk = P (V1 1vj). This implies that (vj, v1) (vk, u1). Similarly, for the second equation we have that (vj, v2) (vk, u2). Consequently, it must be (v1, v2) (u1, u2) (cause if they were not, then either (vj, v1) (vk, u1) or (vj, v2) (vk, u2), reaching a contradiction).

( ) We will show that {v1, v2} {u1, u2} R(E(A, V1)){v1,v2} = R(E(A, V1)){u1,u2}. Suppose by contradiction that R(E(A, V1)){v1,v2} = R(E(A, V1)){u1,u2} which means that there exist λ [L] such that Uλ,{v1,v2} = Uλ,{u1,u2}. Since ϕλ, ρλ are injective, this implies that j [n + 1] such that6

fλ(A, V1 1vj)v1 = fλ(A, V1 1vk)u1 k [n + 1] such that Λj = Λk = λ

However, since {v1, v2} {u1, u2}, k [n + 1] with Λk = λ such that considering P permuting vj with vk and v1 with u1 is such that A = P AP T and V1 1vk = P (V1 1vj). Since fλ is structural, then

fλ(A, V1 1vj)v1 = fλ(A, V1 1vk)u1

reaching a contradiction.

Next, we show that maximal expressivity can also be achieved by Equation (8).

Proposition B.3 (Sufficiency of single-node sequential breakings for most-expressive link structural). Let E be the sequential breaking expansion map in Equation (5) with T = n and let R be the reduction map as defined in Equation (8) with r = 2. Assume ρt and ϕt are injective, and ft defined as in Theorem B.2 (that is, returning the Lanczos vectors). Then, R(E(A, V1)){v1,v2} is a most-expressive structural representation of {v1, v2}.

Proof. From the proof of Theorem 4.4 we already know that R(E(A, V1)){v1,v2} is structural. We now only need to show it is most expressive, that is R(E(A, V1)){v1,v2} = R(E(A , V 1)){v 1,v 2} π Sn A = π A, V1 = π V1 and {v1, v2} = π {v 1, v 2}. This follows from Srinivasan & Ribeiro (2020, Theorem 2) as the Lanczos vectors are most expressive positional representations.

While maximal expressivity can be achieved only for links in the case of single-node breakings, for the parallel breaking algorithm case, it is possible to extend the result when considering multi-node breakings. Specifically, we next show that breaking r 1 nodes at a time is sufficient to obtain a most-expressive structural representation of order r.

4An alternative equivalent relation would be to swap u1 with u2 in the equation. 5Note that vj can also be equal to vk, and in such case P permutes only v1 with u1. 6We can alternatively swap v1 with v2 and/or u1 with u2, without loss of generality.

Published as a conference paper at ICLR 2025

Theorem B.4 (Sufficiency of (r 1)-node joint breakings for most-expressive r structural). Let E be the parallel breaking expansion mapping obtained by perturbing (r 1)-nodes at a time, and let T = n r 1 . Let R be the reduction mapping as defined in Equation (11). Assume ρλ and ϕλ injective, and fλ node-most-expressive, for every λ [L]. Then, R(E(A, V1)){v1,...vr} is a most-expressive structural representation of {v1, . . . vr}.

Proof. The proof is step-by-step identical to the proof of Theorem 4.5, with the only difference being that we break r 1 nodes at a time to obtain most-expressive structural representations of order r. Let {u1, . . . , ur} be another set of r nodes, which is different from {v1, . . . , vr}. We will show that R(E(A, V1)){v1,...,vr} = R(E(A, V1)){u1,...,ur} iff {v1, . . . , vr} {u1, . . . , ur}, therefore satisfying the definition of most-expressive structural representations (Definition 2.2).

Recall that, since T = n r 1 we have T = n r 1 + 1 because we need to break all the n r 1 set of r 1 nodes, and therefore the equations can be rewritten as

Uλ,t = ϕλ {Vt,v1, . . . , Vt,vr}

Uλ,{v1,v2} = ρλ {Uλ,t} t [T ] s.t. Λt=λ

R(E(A, V1)){v1,...,vr} = Uλ,{v1,...,vr}

For simplicity, let us further denote by St,{v1,...,vr} the set of the representations of {v1, . . . , vr} at time t, i.e., St,{v1,...,vr} = {Vt,v1, . . . , Vt,vr}.

( ) We will show that R(E(A, V1)){v1,...,vr} = R(E(A, V1)){u1,...,ur} {v1, . . . , vr} {u1, . . . , ur}. Since ρλ and ϕλ are injective, we have R(E(A, V1)){v1,...,vr} = R(E(A, V1)){u1,...,ur} λ [L] {St,{v1,...,vr}} t [T ] s.t. Λt=λ = {St,{u1,...,ur}} t [T ] s.t. Λt=λ j [T], k [T] such that {Vj,v1, . . . , Vj,vr} = {Vk,u1, . . . , Vk,ur} which means that j [T], k [T] such that {fλ(A, V1 1sj)v1, . . . , fλ(A, V1 1sj)vr} = {fλ(A, V1 1sk)u1, . . . , fλ(A, V1 1sk)ur} where 1sj (and similarly 1sk) is the vector with exactly r 1 ones (and n (r 1) zeros), at the r 1 positions specified by sj = {j1, . . . , jr}, with j1, . . . , jr node ids.

Since the above means that j [T], k [T] where the sets are the same, then without loss of generality we have that fλ(A, V1 1sj)v1 = fλ(A, V1 1sk)u1 ... fλ(A, V1 1sj)vr = fλ(A, V1 1sk)ur Consider the first equation. Since fλ is node-most-expressive, then v1 u1, which implies P permuting nodes in sj with nodes in sk and v1 with u1 such that A = P AP T and V1 1sk = P (V1 1sj). This implies that (sj, v1) (sk, u1) or, equivalently, (j1, . . . , jr, v1) (k1, . . . , kr, u1). Similarly, for all equations until the r-th equation we have that (j1, . . . , jr, vr) (k1, . . . , kr, ur). Consequently, it must be (v1, . . . , vr) (u1, . . . , ur).

( ) We will show that {v1, . . . , vr} {u1, . . . , ur} R(E(A, V1)){v1,...,vr} = R(E(A, V1)){u1,...,ur}. Suppose by contradiction that R(E(A, V1)){v1,...,vr} = R(E(A, V1)){u1,...,ur} which means that there exist λ [L] such that Uλ,{v1,...,v2} = Uλ,{u1,...,ur}. Since ϕλ, ρλ are injective, without loss of generality this implies that j [T] such that fλ(A, V1 1sj)v1 = fλ(A, V1 1sk)u1 k [n + 1] such that Λj = Λk = λ However, since {v1, . . . , vr} {u1, . . . , ur}, k [T] with Λk = λ such that considering P permuting sj with sk and v1 with u1 is such that A = P AP T and V1 1sk = P (V1 1sj). Since fλ is structural, then fλ(A, V1 1sj)v1 = fλ(A, V1 1sk)u1 reaching a contradiction.

Published as a conference paper at ICLR 2025

B.4 AUXILIARY THEOREMS AND PROOFS

Lemma B.5. Consider the specialization of our sequential breaking algorithm to the Lanczos algorithm (c.f., Appendix B.2.1). For every step t such that βt = 0, if vt is such that it exists another

node u with Vi,vt = Vi,u for all i < t, then Vt = 1vt proj V1,...Vt 1(1vt)

1vt proj V1,...Vt 1(1vt) Rn is a unit vector

orthogonal to previous vectors.

Proof. First, we note that for every step t, if βt = 0, then7

wt 1 = w t 1 αt 1Vt 1 βt 1Vt 2 = 0,

which implies

w t 1 = AVt 1 span(Vt 1, Vt 2).

Now we only need to show that 1vt span(Vt 1, . . . , V1). Assume by contradiction that instead 1vt span(Vt 1, . . . , V1), and therefore it can be obtained as a linear combination of Vt 1, . . . , V1. However, since by assumption u such that Vi,v = Vi,u for all i < t, then the linear combination of Vt 1, . . . , V1 will necessarily result in a vector having the same entry for vt and u, and therefore cannot be 1vt which is 1 in vt and 0 in u, leading to our contradiction.

Theorem B.6. Consider the specialization of our sequential breaking algorithm to the Lanczos algorithm (c.f., Appendix B.2.1). There exist most-expressive breaking selectors (h λ)L λ=1 (c.f., Definition A.1) which are sufficient to always obtain breaking nodes vt such that, when βt = 0, then

Vt = 1vt proj V1,...Vt 1(1vt)

1vt proj V1,...Vt 1(1vt) Rn is a unit vector orthogonal to previous vectors.

Proof. The proof follows directly from Lemma B.5 by showing that there exist most-expressive (h λ)L λ=1 satisfying its assumptions. These breaking selectors can be easily constructed in the following way:

(1) For the first t such that βt = 0, we need the corresponding h λ to return an isomorphism class that contains more than one node. In other words, the length of the list must be greater than 1 (k λ > 1). Since this is the first time we break symmetries, this ensure that there exists another u satisfying Lemma B.5.

(2) For any other t such that βt = 0, we also need h λ to be returning a non-singleton isomorphism class. This class must be such that it exists another node u isomorphic to vt such that Vi,vt = Vi,u for all i < t, so that the assumptions of Lemma B.5 are met. Note that there always exist such h λ, as we show next.

Assume by contradiction that there isn t any vt such that u with Vi,vt = Vi,u for all i < t. Then it is always possible to obtain a new orthogonal vector using the Lanczos algorithm (by propagating the unique features), which implies that βt = 0 reaching a contradiction.

Finally, note that such (h λ)L λ=1 are non overlapping: whenever an h λ is chosen to obtain a corresponding vt, then the symmetry is broken and therefore using again the isomorphism class as output of another h λ , with λ > λ, will not result in additional symmetry breakings, implying we need to use a different non-singleton isomorphism class for later t > t such that βt = 0.

Theorem B.7. Let G1 = (A1, X1), G2 = (A2, X2) be two isomorphic graphs, i.e., A2 = P A1P T , X2 = P X1, for some permutation matrix P . Denote the Lanczos vectors of G1 as V G1 1 , . . . , V G1 n , and the Lanczos vectors of G2 as V G2 1 , . . . , V G2 n , obtained by our sequential breaking algorithm with V G1 1 and V G2 1 the normalized outputs of the structural function gstruc (Equation (1)) and (hλ)L λ=1 the most-expressive breaking selectors (Definition A.1). There exists a permutation matrix B such that

V G2 t = BV G1 t , t n. (16)

7Note that if t = 2, then w1 = w 1 α1V1 and the equation is valid by assuming Vt 2 = 0, β1 = 0.

Published as a conference paper at ICLR 2025

Proof. For simplicity of notation, let b1 = Concat((hλ(A1, V1))L λ=1) and b2 = Concat((hλ(A2, V2))L λ=1) be the two lists of breakings. Since b1, b2 are simply orderings of node ids, they can equivalently be represented by permutation matrices, which we will denote by Pb1 and Pb2. Let B the permutation matrix transforming b1 into b2, that is:

B = P T b2Pb1. (17) We will prove the result by induction by showing that Equation (16) is true with B as in Equation (17).

Base case. Consider t = 1. Since (hλ)L λ=1 are most-expressive and structural, the breaking nodes in the same position in b1 and b2 are isomorphic. Therefore, B maps each node in G1 to a node in G2 isomorphic to it, which implies

BV G1 1 = V G2 1 .

Consider t = 2. We distinguish two cases: (1) βG1 2 = 0 and βG2 2 = 0 or (2) βG1 2 = 0 and βG2 2 = 0. Indeed, it is not possible to have βG1 2 = βG2 2 , as they are obtained from V G1 1 and V G2 1 using a structural function.

If β2 = 0, since the steps of the Lanczos algorithm outside the else-statement are all structural, since V G2 1 is simply a permutation of V G1 1 , and since B maps each node in G1 to a node in G2 isomorphic to it, then again

BV G1 2 = V G2 2 .

If, instead, βG1 2 = βG2 2 = 0, then, for each graph Gi, i {1, 2}, given a breaking node v Gi 2 we first construct the breaking vector as U Gi 2 = 1v Gi 2 and then orthonormalize it using Gram-Schmidt. Note

that since U G1 2 , U G2 2 are standard basis vectors, which differ in that they are obtained by potentially different breaking nodes v G1 2 and v G2 2 , we have BU G1 2 = U G2 2 by definition of B (which indeed maps v G1 2 into v G2 2 ). Therefore, denoting by ˆV Gi 2 the unnormalized version of V Gi 2 , we have

B ˆV G1 2 = B(U G1 2 proj V G1 1 (U G1 2 ))

= BU G1 2 Bproj V G1 1 (U G1 2 )

= BU G1 2 B(U G1 2 T V G1 1 )V G1 1

= BU G1 2 (U G1 2 T BT BV G1 1 )BV G1 1 = BU G1 2 proj BV G1 1 (BU G1 2 )

= U G2 2 proj V G2 1 (U G2 2 )

= ˆV G2 2 . Since we have just shown that Equation (16) is true for the unnormalized vectors, then dividing both sides by the norm preserves the relationship, and therefore

BV G1 2 = V G2 2 .

Inductive step. Consider a step t 1 < n, and suppose

V G2 i = BV G1 i , i t 1. (18)

We next show that V G2 t = BV G1 t . We distinguish again two cases, βt = 0 and βt = 0.

If βt = 0, then the step to obtain the next Lanczos vector is a structural function, and since B maps each node in G1 to a node in G2 isomorphic to it, then again

BV G1 t = V G2 t .

If βt = 0, then U G1 t , U G2 t are standard basis vectors, which differ in that they are obtained by potentially different breaking nodes v G1 t and v G2 t , and again we have BU G1 t = U G2 t by definition of B. Therefore, denoting by ˆV Gi t the unnormalized version of V Gi t ,

B ˆV G1 t = B(U G1 t

i=1 proj V G1 i (U G1 t ))

Published as a conference paper at ICLR 2025

= BU G1 t B

i=1 proj V G1 i (U G1 t )

= BU G1 t B

i=1 (U G1 t T V G1 i )V G1 i

i=1 (U G1 t T BT BV G1 i )BV G1 i

i=1 proj BV G1 i (BU G1 t )

i=1 proj V G2 i (U G2 t )

= ˆV G2 t which implies that

BV G1 t = V G2 t , since we are simply dividing both sides by the norm.

Remark B.8 (Equivariance of VT , . . . , V1). Theorem B.7 demonstrates that the vectors VT , . . . , V1 produced by the expansion map permute with the input graph. In other words, for any two isomorphic graphs G1, G2, there exists a permutation matrix B such that V G2 t = BV G1 t , for all t T. This property holds for both the Lanczos vectors and any other structural function ft. In the latter case, the proof follows from Theorem B.7 by simply disregarding the β = 0 case.

C RELATED WORK

Towards Graph Foundation Models. Unlike vision and language, graph machine learning does not yet have a standard recipe for training general purpose models. Because of this, there is an active debate on what ingredients are needed (Mao et al., 2024). Some requirements are clear, such as the need for a single input feature space common to all graphs, akin to a language model tokenizer (Radford et al., 2019). Initial approaches include textifying all node features and using pre-trained text encoders to produce feature vectors (Chen et al., 2024b; Liu et al., 2024), while recent works have proposed different solutions for addressing the problem of differing features across graphs (Zhao et al., 2023; Lachi et al., 2024; Zhao et al., 2024; Xia & Huang, 2024; He & Hooi, 2024; Shen et al., 2024; Hou et al., 2024), including for cases where the differences are only the relation identifiers (Gao et al., 2023; Zhou et al., 2023; Galkin et al., 2024). Our work focuses on another crucial need: developing general-purpose representations capable of performing multiple tasks. To best of our knowledge, there is no existing GNN-based model capable of learning representations that are suitable for different tasks. Existing effort are either domain-specific (Zheng et al., 2024; Beaini et al., 2024; Sypetkowski et al., 2024), or heavily rely on LLMs (Huang et al., 2024a; Liu et al., 2024; Chen et al., 2024a; Tang et al., 2024), which can present efficiency challenges (Mao et al., 2024). Perhaps more crucially, no existing method offers guarantees on the expressiveness of the learned representations. Our work builds on this growing understanding of GNN expressive power (Morris et al., 2023), moving from expressivity for a single task to expressivity across many tasks.

Associative Memories and Distributed Representations. Earlier neural network blueprints broadly fell into one of two extreme camps: one concept one neuron , or distributed (holographic) representations, where an individual concept is spread across many neurons. Arguments for either side were advanced partially through appeal neuroscience s understanding the animal brains. The deep learning era triggered by Alex Net (Krizhevsky et al., 2012) largely set this debate to one side, as modern neural networks tended to exhibit properties somewhere in the middle, with neuron-level concepts recoverable in startling detail in some cases (Templeton et al., 2024) and superimposed in other cases (Elhage et al., 2022). Our effort to explicitly reintroduce holographic representations for graphs is in response to the different structural properties (namely symmetries) of graph data compared to vision, language and other modalities.

Published as a conference paper at ICLR 2025

Symmetry Breaking Algorithms. The expansion map in Holo GNN uses symmetry breakings as a key subroutine to produce general and expressive representations. Symmetry breakings is often used in GNNs that color or mark individual nodes in order to distinguish them, a common form of positional encoding (You et al., 2019; Cui et al., 2022), oftentimes also obtained by sampling random features for each node (You et al., 2019; Murphy et al., 2019; Abboud et al., 2021; Sato et al., 2021; Puny et al., 2020). Symmetry breakings has also been used as a form of distance encoding, which can be specialized for predictions of different tasks (Li et al., 2020; Yin et al., 2020). In contrast, our approach uses the same symmetry breaking for all tasks, with task specialization occurring only in the (lightway) reduction map. But symmetry breaking has been used more broadly in the analysis of graphs. Examples include in algebraic combinatorics to study the automorphism groups of different graph classes (Albertson & Collins, 1996), and in combinatorial optimization, where symmetry breaking is used to simplify complex combinatorial constraint sets (Fahle et al., 2001). Moreover, symmetry breaking is also used in gold-standard linear algebra algorithms, such as Lanczos and Arnoldi iterations for computing eigendecompositions (Lanczos, 1950; Arnoldi, 1951). In this case, symmetry breaking is used to (randomly) resolve ambiguities in eigenspaces for eigenvalues with repeated roots. Stepping back, although symmetry breaking is a known concept in graph learning, our approach distinguishes itself by ensuring that symmetry breaking is performed in such a way that it is possible to re-introduce the lost permutation symmetries and produce structural representations.

Subgraph GNNs. The concept of symmetry breaking, particularly through our parallel breaking algorithm, shares similarities with the recent works on Subgraph GNNs (Zhang & Li, 2021; Cotta et al., 2021; Papp et al., 2021; Bevilacqua et al., 2022; Zhao et al., 2022; Papp & Wattenhofer, 2022; Frasca et al., 2022; Qian et al., 2022; Zhang et al., 2023; Bevilacqua et al., 2024; Bar-Shalom et al., 2024a;b), which also break symmetries by marking nodes to enhance the expressive power of GNNs, as originally proposed in Li et al. (2020). However, our approach significantly differs from Subgraph GNNs in that our aim is not to produce expressive representations for a single task, but rather to pre-train models returning representations that can be used for any task. To this end, we introduce the concept of a reduction map, which generates structural representations suitable for any task order. In contrast, Subgraph GNNs use their representations to revert to structural node or graph-level representation. Additionally, while Subgraph GNNs generally break symmetries for all nodes in the graph, our expansion map can leverage T n, significantly improving efficiency.

Graph Transformers. Finally, we acknowledge the growing interest in graph transformer models (Kreuzer et al., 2021; Dwivedi & Bresson, 2021; Ying et al., 2021; Ramp aˇsek et al., 2022; Kim et al., 2022; M uller et al., 2024), driven by their strong performance especially on graph classification tasks. However, graph transformers rely on positional encodings (Dwivedi et al., 2023), which make their learned representations inherently positional. As discussed in Section 2, positional representations are task-order specific. This task-order specificity makes graph transformers unsuitable for pre-training a single model that can generate flexible representations capable of adapting to tasks of varying orders. In contrast, Holo GNN focuses on producing task-agnostic, symmetry-free representations that can efficiently adapt to different task orders.

D COMPLEXITY ANALYSIS

In this section, we analyze the complexity of the symmetry-breaking algorithms (Algorithm 1 and Algorithm 2), which represent the foundation of our Holo GNN.

The complexity of the algorithms is determined by the complexity of ft (for sequential breaking) and fλ (for parallel breaking). Suppose ft and fλ are Message Passing Neural Networks (MPNNs), which have linear complexity in the number of nodes n and edges m, O(n + m), where the feature dimension d is considered constant. Since both sequential and parallel breaking perform T calls to the corresponding MPNNs, each time by passing a new breaking node, the complexity is O(T(n + m)) in both cases. Note that in our experiments T n and T m. For instance, in the Pubmed dataset, T = 8, n = 19717, m = 88648. Finally, we note that in the parallel breaking algorithm, the additional T factor in the complexity can effectively be mitigated, as all T calls can be done in parallel. This is because each breaking is applied to V1, and fλ does not take all prior embeddings.

Published as a conference paper at ICLR 2025

Table 3: Performance on the Movie Lens dataset in each task (column), when pre-trained in each of the four tasks (row block). The diagonal (light blue) contains the performance of the models when trained and tested on the same task. OOM indicates out of RAM, and NA indicates that either the task could not be performed due to out-of-memory issues or because the model cannot perform that task order. Each column illustrates the performance drop when comparing the results of pretraining on the target task with pre-training on all other tasks, with Holo GNN variants exhibiting significantly smaller drops.

User Movie Movie Movie User Movie User User Movie Movie (RMSE ) (Acc. ) (Acc. ) (Acc. )

PRE-TRAINING TASK

SEAL 0.960 0.010 0.815 0.002 OOM OOM NBFNet (no feat) 1.044 0.004 0.981 0.001 NA NA NBFNet 0.967 0.005 0.892 0.001 NA NA SAGE 0.943 0.006 0.657 0.008 0.595 0.002 0.589 0.001 SAGE & Lap PE 0.949 0.019 0.765 0.004 0.593 0.002 0.592 0.001 Holo GNN (MP) 0.949 0.002 0.921 0.001 0.708 0.001 0.808 0.002 Holo GNN (GNN) 0.943 0.001 0.918 0.002 0.741 0.003 0.810 0.001

Movie Movie

SEAL 1.044 0.007 0.981 0.001 OOM OOM NBFNet (no feat) 1.053 0.007 0.989 0.001 NA NA NBFNet 1.050 0.007 0.989 0.001 NA NA SAGE 1.052 0.004 0.977 0.002 0.635 0.006 0.667 0.002 SAGE & Lap PE 1.045 0.003 0.975 0.003 0.646 0.016 0.675 0.001 Holo GNN (MP) 1.019 0.005 0.984 0.001 0.694 0.001 0.810 0.001 Holo GNN (GNN) 1.020 0.006 0.986 0.001 0.721 0.001 0.810 0.004

User Movie User

SEAL NA NA OOM NA NBFNet (no feat) NA NA NA NA NBFNet NA NA NA NA SAGE 1.014 0.005 0.733 0.002 0.816 0.003 0.646 0.001 SAGE & Lap PE 1.012 0.008 0.793 0.002 0.812 0.006 0.647 0.001 Holo GNN (MP) 1.005 0.002 0.936 0.003 0.794 0.004 0.779 0.001 Holo GNN (GNN) 0.991 0.005 0.972 0.001 0.811 0.002 0.795 0.001

User Movie Movie

SEAL NA NA NA OOM NBFNet (no feat) NA NA NA NA NBFNet NA NA NA NA SAGE 1.007 0.014 0.930 0.003 0.721 0.001 0.890 0.007 SAGE & Lap PE 0.995 0.004 0.900 0.002 0.719 0.001 0.885 0.008 Holo GNN (MP) 0.982 0.005 0.974 0.002 0.756 0.002 0.904 0.009 Holo GNN (GNN) 0.989 0.004 0.974 0.002 0.769 0.001 0.905 0.008

E ADDITIONAL EXPERIMENTS AND DETAILS

E.1 ADDITIONAL EXPERIMENTS

In this section, we provide additional insights and results on the experiments. We also introduce an additional baseline, created by concatenating structural and positional encodings referred to as SAGE & Lap PE for Movie Lens and GCN & Lap PE for Planetoid, using SAGE or GCN for structural encodings and Laplacian eigenvectors for positional encodings (Dwivedi et al., 2023). This baseline demonstrates that simple concatenation does not achieve the performance of Holo GNN.

E.1.1 MOVIELENS

Table 3 compares the performance of Holo GNN in the Movie Lens dataset. Each column represents a different task we test on, namely User Movie, Movie Movie, User Movie User, User Movie Movie. Each row block corresponds to a pre-training task, with each row being one model. This implies that each entry in the table corresponds to the performance of a model (row) when pre-trained on one task (row block) and adapted to a potentially different task (column). The diagonals, colored in light blue, contain the performance of each model when pre-trained and tested on the same task; in other words, each diagonal shows the performance of the supervised version of each model.

We consider four tasks, User Movie, Movie Movie which are order-2 tasks, and User Movie User, User Movie Movie, which are order-3 tasks. In particular, the User Movie task, the standard task for this dataset, predicts a user s rating of a movie, Movie Movie predicts the existence of a metapath between two movies, User Movie User predicts if two users will both watch a particular movie,

Published as a conference paper at ICLR 2025

Table 4: Performance on the Planetoid datasets in each of the two tasks (column), when pre-trained in either of the two tasks (row block). For each dataset, the diagonal (light blue) shows the performance of the models when trained and tested on the same task. Each column illustrates the performance drop when comparing the results of pre-training on the target task with pre-training on the other task. Holo GNN variants demonstrate more consistent performance, with significantly smaller drops than all other methods especially when pre-trained on link prediction and tested on node classification.

CORA CITESEER PUBMED

Node Class. Link Pred. Node Class. Link Pred. Node Class. Link Pred. (Acc. ) (AUROC ) (Acc. ) (AUROC ) (Acc. ) (AUROC )

PRE-TRAINING TASK

Node Class.

SEAL 77.2 2.1 85.9 1.6 64.5 0.5 88.8 1.1 75.0 1.1 93.3 0.2 NBFNet (no feat) 31.3 1.3 82.1 0.9 25.7 1.6 89.6 1.1 40.5 0.5 92.5 0.3 NBFNet 68.8 1.3 87.5 1.6 50.4 2.8 88.8 1.2 71.4 2.1 95.7 0.3 GCN 81.4 1.0 89.9 0.2 70.1 0.7 89.5 0.8 79.0 0.2 93.2 0.1 GCN & Lap PE 81.5 0.8 90.2 0.8 70.0 0.4 88.5 0.8 78.9 0.3 93.5 0.1 Holo GNN (MP) 81.5 0.8 91.0 0.3 70.0 0.2 93.7 0.1 78.7 0.2 93.5 0.2 Holo GNN (GNN) 81.6 0.7 91.1 0.4 70.1 0.8 94.0 0.6 78.7 0.8 93.8 0.1

SEAL 35.6 2.6 93.5 0.5 31.6 0.8 90.6 1.1 44.4 1.1 97.8 0.2 NBFNet (no feat) 26.7 3.3 94.8 0.7 22.8 2.8 92.4 1.6 40.5 0.3 98.3 0.2 NBFNet 27.8 6.1 89.0 2.7 25.0 1.0 85.2 1.2 56.7 0.2 98.2 0.2 GCN 74.2 0.2 89.6 0.5 68.3 0.2 91.4 0.2 76.8 0.1 96.0 0.7 GCN & Lap PE 73.4 0.6 89.3 1.8 68.1 0.5 91.1 0.7 75.3 0.6 95.8 0.4 Holo GNN (MP) 80.5 0.8 92.5 0.5 69.0 0.8 94.9 0.5 79.7 0.4 96.7 0.4 Holo GNN (GNN) 80.2 0.5 93.9 0.1 69.0 0.9 95.0 0.5 79.8 0.5 97.2 0.8

User Movie Movie, if a user will watch two particular movies. For SEAL, we use OOM to denote that the model went out of RAM memory in the preprocessing phase, when constructing the subgraphs. We then use NA in the other tasks as the embeddings cannot be adapted to any other tasks, since they could not be learned. For NBFNet and NFBNet (no feat) (the variant that does not use node features) we use NA in the order-3 tasks as NBFNet cannot handle tasks of order greater than 2.

As can be seen from Table 3, Holo GNN (GNN) and its variant Holo GNN (MP), which uses simple message passing propagations as fλ instead of a GNN network, exhibit a significantly smaller gap in the performance when pre-trained on the task and when adapted from a different one. For example, when considering the order-3 task User Movie User, the difference in performance between Holo GNN (GNN) when pre-trained on the task (0.811) and its worst performance when pre-trained in any other task (0.721, obtained pre-training on the order-2 Movie Movie) results in a performance drop of around 9%. SAGE shows a much larger drop of around 22%, with its accuracy dropping from 0.816 when pre-trained on User Movie User to 0.595 when pre-trained on the order-2 User Movie (worst performance when pre-trained in any other task). For the order-2 task Movie Movie, Holo GNN exhibits a drop of 7%, while SAGE s drop becomes 32%. Note that this trend also holds true for all other methods, with SEAL and NBFNet consistently showing larger performance gaps than Holo GNN. The only exception is NBFNet (no feat) in the Movie Movie dataset when pretrained on User Movie; however, this is due to NBFNet underperforming on the pre-training task, suggesting it is merely capturing node distances rather than learning the task, which happens to suffice for Movie Movie.

E.1.2 PLANETOID

Table 4 compares the performance of Holo GNN in the Planetoid datasets Cora, Citeseer and Pubmed. Each column block represents a different dataset, and each column corresponds to a different task we test on, namely Node Classification and Link Prediction. Each row block corresponds to a pre-training task, with each row being one of the models. This implies that each entry in the table corresponds to the performance of a model (row) when pre-trained on one task (row block) and adapted to a potentially different task (column), with both tasks from the same dataset. The diagonal for each dataset, which is colored in light blue, contains the performance of each model when pre-trained and tested on the same task; in other words, it contains to the performance of the supervised version of each model in that dataset.

For each dataset, we consider the order-1 task Node Classification and the order-2 task Link Prediction. As can be seen from Table 4, Holo GNN (GNN) and its variant Holo GNN (MP), which uses

Published as a conference paper at ICLR 2025

simple message passing propagations as fλ instead than a shallow GNN network, demonstrate significantly more consistent performance compared to other methods. While all other models exhibit a significant performance loss when pre-trained on Link Prediction and adapted to Node Classification, this is not the case for Holo GNN and Holo GNN (MP). For instance, in the CORA dataset, the difference in performance between Holo GNN (GNN) when pre-trained on Node Classification (81.6%) and its performance when pre-trained on Link Prediction and adapted to Node Classification (80.2%) results in a performance drop of around 1%. On the contrary, SAGE suffers from a performance drop of 7%, while NBFNet and SEAL show a more dramatic drop of 41%. Pretraining on Node Classification results in less dramatic drops, especially for SAGE which obtains node representation (structural of order 1) both for nodeand link-level tasks, which therefore tend to yield similar performances. Nonetheless, Holo GNN still demonstrate smaller performance drops, and overall better performances.

E.1.3 MOLHIV

Table 5: Performance on molhiv on the graph classification task, when pre-trained in either graph classification or link prediction. Holo GNN demonstrates more consistent performance, with significantly smaller drops than GCN.

MOLHIV Graph Class. (ROC-AUC )

GCN (Pretrain on Graph Class) 76.06 0.97 Holo GNN (GNN) (Pretrain on Graph Class) 78.19 0.91

GCN (Pretrain on Link Pred) 70.29 1.33 Holo GNN (GNN) (Pretrain on Link Pred) 76.43 1.20

We conduct an additional experiment to evaluate the performance of Holo GNN when adapted to a different task order, namely the order-n task graph classification. Since there is currently no standard dataset that includes both graph classification and another type of task, we considered the molhiv dataset from the OGB benchmark (Hu et al., 2020), which is designed for graph classification (an order-n task) and we created a link prediction task (an order-2 task) on the dataset. We evaluated the performance of both GCN and Holo GNN when pre-trained on link prediction and adapted to graph classification and compare them to their performance when directly trained and evaluated on the graph classification task.

Results are reported in Table 5. The key observation is that the difference in performance between Holo GNN trained on graph classification (78.19%) vs pre-trained on link prediction and adapted to graph classification (76.43%) results in a performance drop of only 1.76%, while GCN suffers a more pronounced drop of 5.77%. These results further showcase the ability of Holo GNN to pre-train on one task order and adapt effectively to other task orders.

E.1.4 ABLATION STUDY ON THE NUMBER OF BREAKINGS

Table 6: Impact of the number of breakings T in Holo GNN. Regardless of the value of T, the performance drop of Holo GNN when trained on node classification and when trained on link prediction and adapted to node classification is significantly smaller than the performance drop suffered by GCN, with larger values of T yielding marginal improvements.

CORA Node Class. (Acc. )

GCN (Pretrain on Node Class) 81.4 1.0 Holo GNN (GNN) T = 4 (Pretrain on Node Class) 80.9 0.8 Holo GNN (GNN) T = 8 (Pretrain on Node Class) 81.6 0.7 Holo GNN (GNN) T = 64 (Pretrain on Node Class) 82.1 0.1

GCN (Pretrain on Link Pred) 74.2 0.2 Holo GNN (GNN) T = 4 (Pretrain on Link Pred) 79.4 0.7 Holo GNN (GNN) T = 8 (Pretrain on Link Pred) 80.2 0.5 Holo GNN (GNN) T = 64 (Pretrain on Link Pred) 80.5 0.7

We additionally evaluate the impact of the number of breakings T. We consider the Cora dataset and report results of Holo GNN for different values of T on the node classification task, both when trained on it and when pre-trained on link prediction and adapted to node classification. Table 6 shows that small T values are sufficient for obtaining representations that can effectively adapt to different tasks, and increasing T results in marginal improvements. Indeed, regardless of the value of T, the performance drop of Holo GNN when trained on node classification and when trained on link prediction and adapted to node classification is significantly smaller than the performance drop suffered by GCN.

Published as a conference paper at ICLR 2025

Table 7: Comparison with GPS (Ramp aˇsek et al., 2022) on CORA in each of the two tasks (column), when pre-trained in either of the two tasks. Holo GNN demonstrates more consistent performance, with significantly smaller drops than GPS when pre-trained on link prediction and adapted to node classification.

Node Class. Link Pred. (Acc. ) (AUROC )

GCN (Pretrain on Node Class) 81.4 1.0 89.9 0.2 GPS (Pretrain on Node Class) 78.3 1.0 93.0 0.1 Holo GNN (GNN) (Pretrain on Node Class) 81.6 0.7 91.1 0.4

GCN (Pretrain on Link Pred) 74.2 0.2 89.6 0.5 GPS (Pretrain on Link Pred) 36.8 2.0 92.7 0.4 Holo GNN (GNN) (Pretrain on Link Pred) 80.2 0.5 93.9 0.1

E.1.5 COMPARISON WITH GPS GRAPH TRANSFORMER

We additionally compare the performance of the GPS graph transformer (Ramp aˇsek et al., 2022) baseline with Holo GNN when adapted to a different task order than the one used for pre-training, using the CORA dataset.

Results are reported in Table 7, where each column represents a different task (Node Classification or Link Prediction). For each task, we evaluate the performance of the models when trained directly on that task (shown in light blue) and when pre-trained on the other task and adapted to the target task. While GPS performs well when adapted to Link Prediction, it suffers from a dramatic performance drop of 41.5% when pre-trained on Link Prediction and adapted to Node Classification, while Holo GNN results in a performance drop of only around 1%.

E.2 EXPERIMENTAL DETAILS

We implemented Holo GNN using Pytorch (Paszke et al., 2019) and Pytorch Geometric (Fey & Lenssen, 2019), and performed hyperparameter tuning using the Weight and Biases framework (Biewald, 2020). We ran our experiments on NVIDIA A100 and Ge Force RTX 4090 GPUs.

In all our experiments except the eigenvector task, we implemented the parallel breaking algorithm. We use a feature-marking technique to break nodes, adding an additional dimension to the embeddings with value 1 if that node is selected to be broken, and 0 otherwise. Note that we break nodes one at a time, so there are T distinct forward passes, as described in the parallel breaking algorithm (Algorithm 2). We use a shallow structural model as fλ. Specifically, for Holo GNN (MP), fλ simply performs Ak(V1 1v), for k set to 2. Note that this variant does not have any extra parameters with respect to a standard GNN. For Holo GNN (GNN), fλ consists of a shallow GNN network, with 2 layers and hidden dimension equal to its input dimension. The GNN layer type depends on the dataset, as we will see next. In both cases, we set T = 8 for Planetoid and Movie Lens datasets, and T = 10 for Rel Bench.

We further defined ϕλ to be elementwise multiplication and implemented ρλ to average over all steps t. This means we do not concatenate the Uλ,{v1,...,vr} for different λ values; instead, we treat all Ut,{v1,...,vr} as part of the same partition. This decision aims to reduce the dimension dr of R(E(A, V1)){v1,...,vr} (Equation (11)) to match d, aligning it to the dimension of other methods.

For NBFNet, we included initial node features by concatenating them in the initial (boundary) condition. We denote the variant that does not include node features as NBFNet (no feat).

Notably, while a standard GNN returns node embeddings, SEAL and NBFNet do not return node embeddings when pre-trained on tasks of order greater than 1. SEAL produces pre-trained embeddings corresponding to the order of the task, which can scale up to O(n3) in our experiments when pre-trained on order-3 tasks. Similarly, NBFNet returns order-2 embeddings when pre-trained on tasks of that order and is unable to extend to higher-order tasks. In contrast, Holo GNN remains linear in the number of nodes and can handle tasks of any order, providing a scalable and flexible solution across task complexities.

Published as a conference paper at ICLR 2025

We train all models using the Adam optimizer, and report the test for the configuration achieving the best validation metric. Each experiment is repeated for 3 different seeds. Details of hyperparameter grid for each dataset can be found in the following paragraphs.

Movie Lens. For all models except NBFNet, we use a Bipartite-SAGE GNN Network (Hamilton et al., 2017) with hidden dimension 64 and 3 SAGE layers. For SAGE & Lap PE we tuned the number of eigenvectors in {8, 16}. We tuned the learning rate in {0.01, 0.001}. For Holo GNN (MP) and Holo GNN (GNN), we use the Bipartite-SAGE GNN Network as gstruc (Equation (1)) and use node degree as the breaking rule to construct breaking selectors (Appendix A.1). We employed GCN layers in the GNN used as fλ for Holo GNN (GNN) for all datasets except User Movie User, where we instead used SAGE layers. We generate train validation and test splits for each task by randomly splitting the edges with a ratio of 80:10:10. We run our experiments for 1k epochs, and return the test at the epoch achieving the best validation results. For NBFNet and SEAL we run for 50 epochs, due to the significant and intractable increase in running time. We remark that the number of epochs for these models is similar to the number of epochs used in the corresponding original papers on their datasets. Notably, on User Movie User and User Movie Movie, preprocessing for SEAL to construct the subgraphs exceeds the RAM capacity of our A100 GPUs (528 GB), resulting in out-of-memory (OOM) errors as reported in the table. Finally, since NBFNet is not designed for triplet predictions, we report NA in the corresponding entries for User Movie User and User Movie Movie.

Planetoid. For all models except NBFNet, we use a GNN Network with hidden dimension 64 and 2 GCN layers (Kipf & Welling, 2017). For GCN & Lap PE we tuned the number of eigenvectors in {8, 16}. For the node-level tasks, we employ the widely used choices, including the final classifier being a linear layer, dropout of 0.5 and weight decay of 5e 4. For Holo GNN (MP) and Holo GNN (GNN), we use the GCN network as gstruc (Equation (1)), node degree as the breaking rule to construct breaking selectors (Appendix A.1), and we further employed GCN layers in the GNN used as fλ for Holo GNN (GNN). For node classification, we employ the standard splits (Yang et al., 2016), while for link prediction, following Zhu et al. (2021), we generate train validation and test splits by randomly splitting the edges with a ratio of 85:5:10. We run our experiments for 500 epochs, and return the test at the epoch achieving the best validation results. Due to the significant increase in running times, for NBFNet we run for 100 epochs for the node-level tasks, and for 20 epochs for the link-level tasks, while for SEAL we run for 500 and 50, respectively.

Rel Bench. The rel-stack graph is a large and realistic graph containing over 38M nodes. Because of this, our models use subgraph-based sampling as proposed by Hamilton et al. (2017). This means that the symmetry breaking approaches based on global structural properties, such as choosing nodes of highest-degree, are not possible in this case. Instead, we use a mini-batch based breaking method, which takes the embeddings of each node given by the initial structural model gstruc and passes them through a learnable linear layer to produce a single scalar logic for each node. These are then passed as input to a Gumbel-Softmax computation block as described in Appendix A.1, which selects T = 10 38M distinct nodes to be used as breaking nodes in a fully differentiable way. We adapt the GNN training code from the Rel Bench repository found at https://github.com/snap-stanford/relbench. We adopt many default hyperparameters from the repository, including 10 epochs training for node-level tasks, and 20 for link prediction, 2 GNN message passing layers for ft with sum aggregation, and Adam optimizer with learning rate 0.005 for nodelevel and 0.001 for link-level tasks. The exception is that we shrunk the network so as to fit on a 24G GPU. Namely we used batch size 128, hidden dimension 64, and sampled 64 neighbors per node.

Eigenvector Task. We use the GCN model implemented in Lim et al. (2023), denoted by GCN (constant input) in Table 2, as gstruc (Equation (1)). Rather than training gstruc together with our expansion map and the classifier, we utilize the pre-trained gstruc to obtain V1. Then, we use V1 to derive the eigenvectors as per Algorithm 1, thereby only training the final classifier. We obtained the eigenvectors using the Subspace Iteration (Saad, 2011), a generalization of the power iteration method for computing several dominant eigenvectors simultaneously, using node degrees to obtain breaking selectors (Appendix A.1). Given the eigenvectors matrix U Rn k, we obtain the final prediction as MLP(Ui,: Uj,:), with denoting element-wise product. We set k to be the same as the number of eigenvectors used by the other methods, namely 16.

Published as a conference paper at ICLR 2025

Molhiv. We consider a GNN Network with hidden dimension 64 and 2 GCN layers (Kipf & Welling, 2017) as the baseline. For Holo GNN, We use the identity as gstruc (Equation (1)), and choose T = 2 breakings for each graph in the dataset. We further employ 2 GCN layers with hidden dimension 64 in the GNN used as fλ. For both models, we tuned the learning rate in {0.01, 0.001}. For graph classification, we employ the standard splits (Hu et al., 2020), while for link prediction, we generate train validation and test splits by randomly splitting the edges with a ratio of 85:5:10.