# learning_graph_invariance_by_harnessing_spuriosity__b04cfd65.pdf

Published as a conference paper at ICLR 2025

LEARNING GRAPH INVARIANCE BY HARNESSING SPURIOSITY

Tianjun Yao1 Yongqiang Chen1,2 Kai Hu2 Tongliang Liu3,1 Kun Zhang1,2 Zhiqiang Shen1

1Mohamed bin Zayed University of Artificial Intelligence 2Carnegie Mellon University 3The University of Sydney {tianjun.yao, yongqiang.chen}@mbzuai.ac.ae, kaihu@cs.cmu.edu tongliang.liu@sydney.edu.au, {kun.zhang, zhiqiang.shen}@mbzuai.ac.ae

Recently, graph invariant learning has become the de facto approach to tackle the Out-of-Distribution (OOD) generalization failure in graph representation learning. They generally follow the framework of invariant risk minimization to capture the invariance of graph data from different environments. Despite some success, it remains unclear to what extent existing approaches have captured invariant features for OOD generalization on graphs. In this work, we find that representative OOD methods such as IRM and VRex, and their variants on graph invariant learning may have captured a limited set of invariant features. To tackle this challenge, we propose LIRS, a novel learning framework designed to Learn graph Invariance by Removing Spurious features. Different from most existing approaches that directly learn the invariant features, LIRS takes an indirect approach by first learning the spurious features and then removing them from the ERM-learned features. We demonstrate that learning the invariant graph features in an indirect way enables the model to capture a more comprehensive set of invariant features, leading to better OOD generalization performance in novel environments. Notably, LIRS surpasses the second-best method by as much as 25.50% across all competitive baselines, underscoring its efficacy in OOD generalization. 1

1 INTRODUCTION

Graph representation learning with graph neural networks (GNNs) (Kipf & Welling, 2017; Xu et al., 2019; Veliˇckovi c et al., 2017) has proven to be highly successful in tasks involving relational information (Qiu et al., 2018; Wu et al., 2022b; Yu et al., 2018; Zhang et al., 2022c). Despite their achievements, GNNs often assume that the training and test data share the same distribution. This assumption often does not hold in real-world applications (Hu et al., 2020; Huang et al., 2021; Ji et al., 2022; Koh et al., 2021) due to changes in the underlying data generation process, leading to distribution shifts. Such phenomenon can severely degrade the performance of GNN models, which presents a critical challenge for their deployment in practical scenarios (De Grave et al., 2020).

To address the OOD generalization challenge, graph invariant learning has become the de facto approach. Inspired by the success of invariance principle (Arjovsky et al., 2020; Kreuzer et al., 2021), graph invariant learning aims to identify invariant subgraphs that are causally related with the targets (Yang et al., 2022; Li et al., 2022b; Liu et al., 2022; Zhuang et al., 2023; Chen et al., 2022; Li et al., 2022a). Typically, they follow the framework of Invariant Risk Minimization (IRM) (Arjovsky et al., 2020) or Variance-Risk Minimization (VRex) (Krueger et al., 2021), which aim to capture invariant features by learning an equipredictive classifier across different environments. Despite some success, many graph invariant learning methods only perform similarly compared to the traditional Empirical Risk Minimization (ERM) (Vapnik, 1995) in graph OOD benchmarks (Gui et al., 2022), it raises a critical yet overlooked question:

To what extent do the graph invariant learning algorithms capture the graph invariance?

1Code is available at https://github.com/tianyao-aka/LIRS-ICLR2025

Published as a conference paper at ICLR 2025

In this work, we address this question by investigating two representative OOD algorithms, namely IRMv1 (Arjovsky et al., 2020) and VRex (Kreuzer et al., 2021), in graph-level OOD classification scenarios. These algorithms are widely used, and many graph invariance learning algorithms are based on their variants or extensions (Li et al., 2022b; Liu et al., 2022; Zhuang et al., 2023; Wu et al., 2022c). Surprisingly, we find that IRM, VRex and their variants on graph invariant learning may only learn a subset of invariant features (Sec. 3). In contrast, the learning paradigm that learns invariant features indirectly by removing spurious features from the ERM-learned features may

GNN Encoder

(a) Invariant learning directly

(b) Invariant learning indirectly 𝑛𝑜𝑑𝑒𝑖𝑛𝐺𝑐 𝑛𝑜𝑑𝑒𝑖𝑛𝐺𝑠

Figure 1: Illustration of the two graph invariant learning paradigms. Compared with the invariant learning methods AInv that adopt the direct learning paradigm, our proposed method first adopts a spuriosity learner ASpu to learn spurious features, then learns invariant features by removing the spurious features using the feature disentanglement module ADisent.

render learning a more comprehensive set of invariant features (Figure 1 and Sec. 3). Motivated by this observation, we propose a novel learning framework LIRS: Learning graph Invariance by Removing Spuriosity. Different from most existing approaches that directly learn the invariant features, LIRS adopts the indirect learning paradigm. Specifically, LIRS first leverages the biased infomax principle (Def. 3) to effectively learn graph spuriosity (Theorem 4.1), then employees classconditioned cross entropy loss to effectively learn graph invariance (Theorem 4.2). Extensive experiments on synthetic datasets demonstrate that LIRS is able to learn more invariant features compared to state-of-the-art graph invariant learning methods that adopt the direct invariant learning paradigm. Furthermore, LIRS shows superior OOD performance on real-world datasets with various types of distribution shifts, highlighting its effectiveness in learning graph invariant features. Our contributions can be summarized as follows:

We reveal that learning invariant features indirectly i.e., by first learning and then removing spurious features, can be more effective in capturing invariant features than existing (graph) invariant learning algorithms that learn invariant features directly. We propose LIRS, a novel framework that adopts the indirect learning paradigm for learning invariant features, which consists of: a) The biased infomax principle, a contrastive learning algorithm that provably learns graph spuriosity; b) class-conditioned cross-entropy loss, a learning objective that effectively learns graph invariance by removing spurious features from ERM-learned features. Extensive experiments demonstrate that LIRS outperforms second-best baseline methods by up to 25.50% across 17 competitive baselines on both synthetic and real-world datasets with various distribution shifts.

2 PRELIMINARY

Notations. An undirected graph G with n nodes and m edges is denoted by G := {V, E}, where V is the node set and E is the edge set. G is also represented by the adjacency matrix A and the node feature matrix X Rn D with D feature dimensions. We use Gc and Gs to denote invariant and spurious subgraph for graph G respectively, and hc and hs are invariant and spurious features in the latent space. bhi and bh G denote the estimated representations for node vi and graph G respectively. Finally, we use [K] := {1, 2, . . . , K} to denote an index set, w to denote a scalar value, w to denote a vector, W to denote a matrix, W to denote a random variable, and W to denote a set. A more complete set of notations is presented in Appendix A.

Problem Definition. We focus on OOD generalization in graph classification in hidden environments. Given a set of graph datasets D = {Ge}e Etr Eall, a GNN model f, denoted as ρ h, comprises an encoder h : G RF that learns a representation h G for each graph G, followed by a downstream classifier ρ : RF Y to predict the label b YG = ρ(bh G). The objective of OOD generalization on graphs is to learn an optimal GNN model f ( ) : G Y using data from training environments Dtr = {Ge}e Etr that effectively generalizes across all (unseen) environments:

Published as a conference paper at ICLR 2025

f ( ) = arg minf supe Eall R(f | e), where R(f | e) = Ee G,Y [l(f(G), Y )] is the risk of the predictor f on the environment e, and l( , ) : Y Y R+ denotes a loss function. Assumption 1. (Stable and predictive subgraphs) Given a graph G D, there exists a set of stable (causal) substructure patterns Gc for every class label Y = y, satisfying: a) e, e Etr, Gc Gc, Pe (Y | Gc) = Pe (Y | Gc); and b) The target Y can be expressed as Y = f (Gc) + ϵ, where ϵ G represents random noise, and indicates statistical independence.

Assumption 1 posits that one or more substructure patterns in Gc are not only stably associated with the target label Y across different environments but also possess sufficient predictive power to accurately determine Y . This assumption is well-aligned with real-world scenarios. For instance, in the GOODHIV dataset (Gui et al., 2022; Hu et al., 2020; Wu et al., 2018), a molecule s ability to inhibit HIV may depend on the presence of several functional groups interacting with various parts of the virus. Moreover, recent study also provides empirical evidence for graph applications, suggesting that multiple substructures remain stable and predictive of the targets (see Appendix E.2 in Bui et al. (2024)), thereby supporting the validity of Assumption 1. Therefore, when the OOD algorithms are able to learn a broader set of invariant substructures (features), they will generalize more effectively across different environments.

3 WHY LEARNING INVARIANT FEATURES INDIRECTLY?

IRMv1: 0.36 VRex: 0.32 GREA: 0.26 GIL: 0.22 Ours: 0.15

IRMv1: 0.32 VRex: 0.31 GREA: 0.25 GIL: 0.21 Ours: 0.16

SPMotif-0.40 SPMotif-0.60

-0.4 -0.2 0.0

prob 0.2 0.4 0.6 0.8 1.0 -0.6 -0.4 -0.2 0.0

prob 0.2 0.4 0.6 0.8 1.0

Figure 2: Comparison of the distribution P prob across different OOD algorithms.

In this section, we motivate our study by addressing the question: Why might learning invariant features indirectly be more effective than learning them directly? We begin by presenting an intuitive hypothesis related to our research question, followed by an empirical study on synthetic datasets to support this hypothesis. Consider a dataset D where each sample G(i) comprises invariant features {h(i) c,j}nc j=1 and spurious features {h(i) s,j}ns j=1, with nc and ns denoting the number of invariant and spurious features respectively. Any subset of the invariant features {h(i) c,j}nc j=1 maintains stable relationships with Y , consequently OOD objectives that aim to directly learn invariant features (e.g., IRMv1 and VREx) may only capture a subset of these invariant features if a subset of invariant features can already achieve accurate predictions in the training set. In contrast, an OOD objective that seeks to remove the spurious features {h(i) s,j}ns j=1 from the ERM-

learned features may result in learning a more complete set of invariant features from {h(i) c,j}nc j=1, thus facilitating OOD generalization in unseen environments.

Empirical Study. To validate our hypothesis that learning invariant features by removing spurious features may facilitate learning more invariant features, we perform experiments using a variant of SPMotif datasets (Wu et al., 2022c; Ying et al., 2019). Specifically, for each sample three invariant subgraphs are attached to the base spurious subgraph in the training and validation sets. In the test dataset, we perform model inference and record the estimated probability for correctly predicted samples, denoted as ˆy(i) j . Subsequently, we randomly removed two invariant subgraphs from these

samples and compute the new estimated probability score, denoted as y(i) j . We then fit the distribution P prob using Kernel Density Estimation (KDE) (Terrell & Scott, 1992), based on the changes in probability prob := ˆy(i) j y(i) j , i, j, as illustrated in Figure 2. If the encoder h( ) can indeed learn all invariant features in the test set, then removing two invariant subgraphs should not change ˆy(i) j . Therefore, the closer P prob is to 0, the more it indicates that the encoder has learned the full set of invariant features. In addition to the density function of prob, we also calculate the Wasserstein distance of each algorithm to a null distribution, where the probability mass is entirely centered at 0. As demonstrated in Figure 2, IRMv1 (Arjovsky et al., 2020) and VRex (Krueger et al., 2021) exhibit significant changes in prob, implying that for some samples, IRMv1 and VRex learn

Published as a conference paper at ICLR 2025

only a subset of the invariant patterns. GIL (Li et al., 2022b) and GREA (Liu et al., 2022) are two variants of VRex and IRM that learn invariant features directly by performing environment inference and environment augmentation respectively. As shown in Figure 2, the Wasserstein distances for GIL (Li et al., 2022b) and GREA (Liu et al., 2022) are also greater than that of our proposed method. This highlights the advantages of the learning paradigm that indirectly learns invariant features by first learning and then removing spuriosity. However, one may raise concerns that while

0.4 0.5 0.6 0.7 0.8 0.9 Spurious correlation strengths

Test Accuracy (%)

ERM IRM VRex GIL GREA Ours

Figure 3: Test accuracy under varying spurious correlation strengths of the variant of SPMtoif datasets.

learning invariant features by first removing spurious features may indeed lead to capturing a more complete set of invariant features for the correctly predicted samples, it may also introduce spurious patterns if the spurious features are not effectively removed, which could render such a learning paradigm suboptimal in practice. To address this concern, we report test accuracy under varying strengths of spurious correlations, following previous experiments. As the strength of the spurious correlation increases, it becomes more challenging to eliminate the spurious features. From Figure 3, we have three observations: a) Our method consistently achieves superior OOD performance across different correlation strengths; b) Our method is less sensitive to the increased spurious correlations compared to other representative OOD methods, which is evident by the flatter curve in Figure 3; c) The higher test accuracy, combined with the distribution of probability changes from the previous experiment, suggests that our method enables a larger number of test samples to learn a more complete set of invariant features. Based on these observations and analysis, we conclude that indirectly learning invariant features is more effective than learning them directly.

4 METHODOLOGY

Motivated by the observations in Sec. 3, we now introduce our proposed LIRS framework, which learns invariant features on graphs indirectly.

4.1 LEARNING GRAPH SPURIOSITY VIA BIASED INFOMAX

The first step of learning invariant features in the LIRS framework is to effectively learn graph spuriosity. In this section, we first propose a variant of the infomax principle (Hjelm et al., 2019; Veliˇckovi c et al., 2019; Linsker, 1988), namely biased infomax, which encourages spuriosity learning by suppressing invariant signals. We then present a more practical version of biased infomax, suitable for the real-world datasets. We first formally define spuriosity learning in below: Definition 1. (Spuriosity Learning) Assuming each data instance G(i) in a dataset D is composed of two parts h(i) c and h(i) s , an algorithm A is said to be a spuriosity learner if h(i) s = A(G(i)), i, i.e., A learns only the spurious features for all instances in D.

One such algorithm to encourage spuriosity learning is the global-local infomax principle (Hjelm et al., 2019; Veliˇckovi c et al., 2019; Linsker, 1988), which maximizes the mutual information (MI) between global representation and local node representations, as defined in the following: Definition 2. (The Infomax Principle) The infomax principle optimizes the following optimization objective in Eqn. 1 w.r.t the GNN encoder hθ( ).

max θ EG G 1 |G|

vi G I bhi, bh G , s.t. bhi = hθ(G), bh G = READOUT(bhi). (1)

The encoder hθ( ) will primarily learn spurious features (Yao et al., 2024). However, it will still be constrained by |Gc|, the size of the invariant subgraph. As |Gc| increases, the learned global representation bh G will increasingly capture invariant representations (Yao et al., 2024). To mitigate this constraint and further facilitate spuriosity learning, we propose the biased infomax principle.

Published as a conference paper at ICLR 2025

Definition 3. (The Biased Infomax Principle) The biased infomax principle optimizes the following optimization objective in Eqn. 2 w.r.t the GNN encoder hθ( ).

max θ EG G 1 |G|

vi Gs I bhi, bh G X

vi Gc I bhi, bh G !

, s.t. bhi = hθ(G), bh G = READOUT(bhi).

By maximizing MI across nodes vi Gs and minimizing MI across nodes vi Gc, the biased infomax principle yields an encoder hθ ( ) that encourages the learning of spurious patterns while suppressing invariant ones. Formally, we can demonstrate that the biased infomax principle achieves spuriosity learning.

Theorem 4.1. Given that the invariant subgraph Gc contains invariant patterns that are causally related to the target labels, and Gs contains only spurious patterns, the biased infomax principle achieves spuriosity learning, i.e., the encoder hθ ( ) learns solely the spurious features for each data sample G in D.

To achieve invariant learning by removing spuriosity, it is crucial to effectively learn the spurious features first. Theorem 4.1 demonstrates that the biased infomax principle can learn graph spuriosity more effectively the vanilla infomax, as defined in Def. 2. However, in real-world scenarios, Gc is often unobservable, which hinders the practicality of the biased infomax principle. To make the proposed algorithm more practical, we propose an instance-level adaptive biased infomax, which incorporates a GNN explainer to identify the important nodes first, and then performs instance-wise counterfactual inference to determine whether the selected nodes should be biased or not.

Instance-level adaptive biased infomax. At first glance, the biased infomax principle appears impractical due to the necessity of knowing Gc and Gs. To address this limitation, we employ an approximation algorithm to identify b Gc, which serves as an estimate for Gc. Subsequently, we treat nodes vi b Gc to realize the biased infomax principle. In practice, we utilize GSAT (Miao et al., 2022), a GNN explainer e( ) that is robust under distribution shifts, to identify b Gc and b Gs = G\ b Gc. Nevertheless, the approximation error may inadvertently bias the algorithm by amplifying invariant signals and suppressing spurious ones, as demonstrated in the following proposition.

Proposition 1. Given an error rate p% in the approximation algorithm for Gc, i.e., b Gc contains 1 p% nodes from Gc and p% nodes from Gs, let the learning objectives for the biased infomax with the ground-truth subgraphs Gc and Gs and with the approximated subgraphs b Gc and b Gs be denoted as L(θ ; D) and L(θ ; D) respectively. The difference between L(θ ; D) and L(θ ; D) can be expressed as:

vi p Gs I bhi; bh G X

vi p Gc I bhi; bh G

Here p G denotes p% nodes in graph G. When bh G learns primarily spurious features, I bhi; bh G >

I bhj; bh G , vi Gs, vj Gc, then Prop. 1 implies that as p% increases, the gap between the ground-truth learning objective and the learning objective with approximation will also increase, thus leading to inaccurate graph representations. To reduce the approximation error and mitigate this issue, we conduct counterfactual inference for a graph G by evaluating the changes in predicted probability with and without b Gc derived from the GNN explainer. Assuming b Gc = e(G, Y ) closely approximates the ground-truth Gc, the removal of b Gc should result in a significant change in the predicted probability score. Under this condition, for graph G, we bias vi b Gc. Otherwise, we employ vanilla infomax without the bias operation. Specifically, we use a pre-defined threshold τ. For each graph data instance G(i) with target label j, let by(i) j and y(i) j denote the predicted probability scores before and after the removal of b Gc, respectively. We then bias the infomax for graph G(i) if by(i) j y(i) j > τ using Eqn. 2; otherwise, we use Eqn. 1 to learn the graph representation.

Published as a conference paper at ICLR 2025

4.2 LEARNING GRAPH INVARIANCE VIA CLASS-CONDITIONED CROSS-ENTROPY LOSS

Class-conditioned cross-entropy loss. So far we have obtained the spurious graph representations hs for every instance G in the dataset. Next we consider how to learn graph invariance by removing the spurious features from ERM-learned representations. To solve this challenge, previous study (Yao et al., 2024) takes spurious features hs and target label Y as inputs to train a linear classifier using cross-entropy loss to obtain the logits, which capture spurious correlation for each sample G, and use these (spurious) logits for subsequent spurious feature removing. Although the input only contains spurious features, we argue that: The logits generated may be inaccurate to capture the spurious correlation strengths due to the overlapping of spurious patterns across different classes. To see this, we propose Prop. 2 to demonstrate this issue: Proposition 2. Given a linear regression model with parameters {θ1, θ2} and spurious features {x1, x2}, the correlation strength for feature x1 is p and for x2 is 1 p when Y = 0. Similarly, the correlation strength for feature x1 is q and for x2 is 1 q when Y = 1. Assuming the spurious features x1 and x2 can each take values in {0, 1}, we obtain the following parameter estimates using Mean Squared Error (MSE) loss: θ1 = q p+q, θ2 = 1 q 2 p q.

Prop. 2 demonstrates the issue brought by the interference of overlapping spurious features across different classes. For instance, when p = q, θ1 = θ2 = 1

2, the generated logits fail to distinguish different spurious patterns. To mitigate this issue, we propose class-conditioned cross-entropy loss. Specifically, we first perform clustering within each class based on the spurious features {h(i) s }n i=1. The resulted clusters will reflect different spurious patterns and environments accurately given the spurious features learned by the biased infomax objective. However, employing hard labels can result in a loss of information regarding the relative positions of samples within a cluster. To address this, we refit a linear classifier using the cluster labels as targets. This approach ensures that samples near the cluster boundaries exhibit higher entropy in their spurious logits, thereby preserving the information about their position within the cluster. Given the estimated spurious logits s(i) RK with K clusters in each class, we propose the following learning objective for spurious feature removing:

Linv := Ey Y EG(i)|Y =y

j=1 w(i)s(i) j log σ(bs(i) j ) , s.t.,bs(i) = ρ (bh(i) G ) RK. (3)

Here, σ( ) denotes the softmax function, ρ ( ) : RF RC represents the clustering classification head, which takes bh G from the GNN encoder h( ) as input. The reweighting coefficient for each sample G(i) is denoted as w(i), which adjusts the weight for samples from the majority group by reducing it and increases the weight for samples from the minority group. The generalized cross-entropy (GCE) method (Zhang & Sabuncu, 2018) is employed to calculate w(i), i.e.,

w(i) = 1 s(i) j γ

γ , where j is the ground-truth clustering label for sample G(i), and γ is a hyperparameter. Finally, The loss objective for feature disentanglement is:

L = LGT + λLinv, (4)

where LGT denotes the ERM loss. Linv, serving as a regularization term, will guide the learning process to learn invariant features by removing spurious features from the ERM-learned features. Formally, we present the following theorem. Theorem 4.2. There exists a suitable γ and clustering number K, such that minimizing the loss objective L = LGT + λLinv will lead to the optimal encoder h ( ) which elicits invariant features for any graph G, i.e., bh G = h (G) = hc.

The proof is included in Appendix D. In the proof of Theorem 4.2, we show that Linv will guide the ERM-learned features to focus on invariant features by removing or unlearning the spurious ones. More concretely, given ERM learns both invariant and spurious features (Kirichenko et al., 2023; Chen et al., 2023b), i.e., bh G = κ(hc, hs), where κ( ) is a functional mapping to graph representation bh G given a set of invariant features hc and a set of spurious features hs, Linv regularizes the learning

Published as a conference paper at ICLR 2025

Table 1: Performance on synthetic and real-world datasets. Numbers in bold indicate the best performance, while the underlined numbers indicate the second best performance.

Method GOOD-Motif GOOD-HIV OGBG-Molbace OGBG-Molbbbp

base size scaffold size scaffold size scaffold size

ERM 68.66 4.25 51.74 2.88 69.58 2.51 59.94 2.37 75.11 3.03 83.60 3.47 68.10 1.68 78.29 3.76 IRM 70.65 4.17 51.41 3.78 67.97 1.84 59.00 2.92 75.47 2.22 83.12 2.58 67.22 1.15 77.56 2.48 Group DRO 68.24 8.92 51.95 5.86 70.64 2.57 58.98 2.16 - - 66.47 2.39 79.27 2.43 VREx 71.47 6.69 52.67 5.54 70.77 2.84 58.53 2.88 72.81 4.29 82.55 2.51 68.74 1.03 78.76 2.37 RSC 46.12 3.76 51.70 5.47 69.16 3.23 61.17 0.74 74.59 3.65 84.34 2.65 69.01 2.84 78.07 3.89 Diverse Model 54.24 8.22 41.01 1.98 69.17 3.62 61.59 2.23 73.48 3.56 79.40 1.70 68.04 3.27 77.62 1.90

Drop Edge 45.08 4.46 45.63 4.61 70.78 1.38 58.53 1.26 70.81 2.12 76.39 2.29 66.49 1.55 78.32 3.44 FLAG 61.12 5.39 51.66 4.14 68.45 2.30 60.59 2.95 80.37 1.58 84.72 0.88 67.69 2.36 79.26 2.26 Li SA 54.59 4.81 53.46 3.41 70.38 1.45 52.36 3.73 78.05 5.01 83.92 2.52 68.11 0.52 78.62 3.74 DIR 62.07 8.75 52.27 4.56 68.07 2.29 58.08 2.31 75.49 2.80 77.42 7.43 66.86 2.25 76.40 4.43 Dis C 51.08 3.08 50.39 1.15 68.07 1.75 58.76 0.91 57.78 3.60 71.13 8.86 67.12 2.11 56.59 10.09 CAL 65.63 4.29 51.18 5.60 67.37 3.61 57.95 2.24 76.29 1.60 79.68 4.06 68.06 2.60 79.50 4.81 GREA 56.74 9.23 54.13 10.02 67.79 2.56 60.71 2.20 77.16 1.37 83.15 9.07 69.72 1.66 77.34 3.52 GSAT 62.80 11.41 53.20 8.35 68.66 1.35 58.06 1.98 72.32 5.66 82.45 2.73 66.78 1.45 75.63 3.83 CIGA 66.43 11.31 49.14 8.34 69.40 2.39 59.55 2.56 76.44 1.72 83.95 2.75 64.92 2.09 65.98 3.31 AIA 73.64 5.15 55.85 7.98 71.15 1.81 61.64 3.37 79.42 2.01 85.11 0.74 70.79 1.53 81.03 5.15 OOD-GCL 56.46 4.61 60.23 8.49 70.85 2.07 58.48 2.94 75.96 2.21 85.34 1.77 67.28 3.09 78.11 3.32 EQu AD 67.11 10.11 59.72 3.69 72.24 0.64 64.19 0.56 79.15 2.32 86.41 5.63 70.22 2.36 80.82 5.28

LIRS 75.51 2.19 74.95 7.69 72.82 1.61 66.64 1.44 81.91 1.98 88.77 1.64 71.04 0.76 82.19 1.57

process to optimize κ ( ) such that bh G contains only hc. While previous OOD methods achieve a similar goal (to learn only invariant features), our approach differs by first learning the spurious features and then leveraging them to identify the causal features from the ERM-learned features, a process we refer to as spurious feature removing or unlearning.

5 EXPERIMENTS

In this section, we perform empirical study on both synthetic and real-world datasets. More details about the datasets and experiment setup are included in Appendix H.

5.1 EXPERIMENTAL SETUP

Datasets.We adopt GOODMotif and GOODHIV datasets (Gui et al., 2022), OGBG-Molbace and OGBG-Molbbbp datasets (Hu et al., 2020; Wu et al., 2018) to comprehensively evaluate the OOD generalization performance of our proposed framework. For GOODMotif datasets, we adopt base shift and size shift, for OGBG datasets, in addition to scaffold shift, we also create size shift, following previous studies (Gui et al., 2022; Sui et al., 2023). More details on the datasets used in our work are included in Appendix H.

Baselines. Besides ERM (Vapnik, 1995), we compare our method against three lines of OOD baselines: (1) OOD algorithms on Euclidean data, including IRM (Arjovsky et al., 2020), VREx (Krueger et al., 2021), and Group DRO (Sagawa et al., 2019); (2) Diverse feature learning methods, including RSC (Huang et al., 2020), and Diverse Model (Teney et al., 2022). (3) graphspecific OOD and data augmentation algorithms without requiring environment labels, including DIR (Wu et al., 2022c), GSAT (Miao et al., 2022), GREA (Liu et al., 2022), Dis C (Fan et al., 2022), CIGA (Chen et al., 2022), AIA (Sui et al., 2023), OOD-GCL (Li et al., 2024a), EQu AD (Yao et al., 2024), Drop Edge (Rong et al., 2019), FLAG (Kong et al., 2022), and Li SA (Yu et al., 2023). Details of the baselines and their implementation are provided in Appendix H.

Evaluation. We report the ROC-AUC score for GOOD-HIV, OGBG-Molbbbp, and OGBGMolbace datasets, where the tasks are binary classification. For GOOD-Motif and SPMotif datasets, we use accuracy as the evaluation metric. We run experiments 4 times with different random seeds, select models based on the validation performance, and report the mean and standard deviations on the test set.

Published as a conference paper at ICLR 2025

5.2 EXPERIMENTAL RESULTS

In this section, we report the main results on both synthetic and real-world datasets.

Synthetic datasets. LIRS achieves superior performance on GOOD-Motif, surpassing the secondbest method by 12.51% (base split) and 25.50% (size split). This highlights its effectiveness in capturing domain-invariant features. While EQu AD also removes spurious features, its infomaxbased learning may inadvertently retain them, whereas LIRS s biased infomax effectively suppresses spurious signals. OOD-GCL (Li et al., 2024a) employs contrastive learning without labeled data, making invariant feature learning more challenging. As shown in Table 1, LIRS outperforms OODGCL across all datasets, benefiting from labeled data during training. We also compare LIRS with RSC Huang et al. (2020) and Diverse Model Teney et al. (2022), which aim to learn diverse features via ERM. As illustrated in Table 1, although RSC and Diverse Model perform well on real-world datasets, they achieve suboptimal results on the GOOD-Motif datasets. This discrepancy may arise from the fact that these methods attempt to capture diverse features, while in the GOOD-Motif datasets, only one invariant subgraph is causally related to the target. As a result, these methods may mistakenly capture non-generalizable patterns in Gs. In contrast, LIRS leverages biased infomax to effectively learn spurious patterns and subsequently identify the invariant subgraph by unlearning the spurious patterns. Lastly, LIRS significantly outperforms other OOD methods that aim to directly learn invariant features, showcasing the efficacy of the proposed learning paradigm.

Real-world datasets. In real-world datasets, which present more complex and realistic distribution shifts, many graph OOD algorithms exhibit instability, occasionally underperforming ERM. In contrast, LIRS adopts an indirect learning paradigm, which may enable the learning of more invariant features compared to previous graph invariant learning methods, leading to more effective OOD generalization. Similarly to the results on synthetic datasets, LIRS also demonstrates superior OOD performance under both scaffold shift and size shift compared to other methods. EQu AD, which adopts a similar learning paradigm as LIRS, shows a comparable trend, with a more significant advantage under the size shift relative to other methods. Furthermore, the improvements of LIRS over EQu AD across various datasets highlights the effectiveness of the biased infomax principle and the intra-class cross-entropy loss for learning spurious features and invariant features respectively. Compared to other baselines that learns invariant features directly, LIRS outperforms the best method (AIA) by 4.62% in size shift and 1.95% in scaffold shift respectively.

5.3 IN-DEPTH ANALYSIS

Table 2: Performance on the SPMotif-binary datasets under varying spurious correlations.

Method SPMotif-binary

b = 0.40 b = 0.60 b = 0.90

ERM 74.93 3.94 72.78 3.15 63.78 4.18 IRM 76.01 4.12 70.85 4.73 66.55 4.80 VREx 79.03 1.02 73.78 1.75 65.27 6.78

GIL 77.15 3.18 73.85 2.76 68.90 7.28 GREA 79.65 6.36 73.01 7.99 69.85 0.35 CIGA 76.93 3.94 71.70 1.55 66.80 5.35 AIA 78.46 3.19 71.83 0.69 64.37 4.14 EQu AD 80.82 0.65 74.20 4.10 69.79 7.81

LIRS 82.17 0.91 75.32 1.65 71.29 2.12

Can LIRS learn more invariant features? To further validate the quality of invariant features learned by LIRS, we curated a dataset derived from the SPMotif dataset. Specifically, we constructed a binary classification dataset where the motifs House and Crane corresponds to label 0, and Diamond and Cycle corresponds to label 1. During the construction of each class s samples, we attached both invariant subgraphs to the base subgraph with 50% chance, while in the remaining 50%, we randomly attached one invariant subgraph to the base spurious subgraph. For the test set, we randomly attached a single invariant subgraph to the base subgraph. Similar to the SPMotif dataset, the base spurious subgraph was correlated with the target labels, maintaining an equal correlation strength with the labels in the test set. In addition to evaluating LIRS against traditional methods such as IRM, VRex and their variants, we also incorporated strong graph-specific OOD baselines, including CIGA, AIA, and EQu AD for comparisons. The experimental results in Table 2 demonstrate that LIRS consistently outperforms other baselines across varying spurious correlation strengths, indicating its superior ability to learn more invariant features. Similarly, EQu AD, adopting a similar learning paradigm, also achieves competitive performance in SPMotif-binary datasets. This demonstrates that learning invariant features indirectly can lead to

Published as a conference paper at ICLR 2025

learning a more comprehensive set of invariant features and achieving better OOD generalization ability.

GOODMotif-Base GOODMotif-Size GOODHIV-Sca GOODHIV-Size 40

LIRS w/o biased_infomax w/o intra_class_loss

(a) Ablation study on biased infomax and intra-class cross-entropy loss.

GOODHIV-Sca

GOODHIV-Size (b) Hyperparameter sensitivity analysis on GOODHIV datasets.

Figure 4: Ablation and hyperparameter sensitivity studies for LIRS.

Ablation study. We conducted an ablation study on biased infomax and class-conditioned (intraclass) cross-entropy loss to evaluate their effectiveness in LIRS. As illustrated in Figure 4a, replacing biased infomax with vanilla infomax to generate spurious features leads to performance degradation across all datasets. This decline is primarily due to the enhanced capability of biased infomax in learning spurious features, which is crucial for subsequent steps for learning graph invariance. Furthermore, replacing class-conditioned cross-entropy loss with standard cross-entropy loss also led to negative effects, which is mainly due to the interference of spurious patterns across different classes, as discussed in Sec. 4. One typical empirical support arises from the Motif-size dataset, where each spurious pattern is correlated with different target labels with nearly equal strength. As a result, even though biased infomax can generate accurate spurious features, the logits generated using the standard cross-entropy loss are not sufficiently precise, which aligns with Prop. 2. This limitation significantly constrains the test performance, as illustrated in Figure 4a. In contrast, the class-conditioned cross-entropy loss is able to generate more accurate spurious logits, thus enhancing the OOD performance.

Table 3: Performance on OGBG datasets using a different size split procedure, where the graphs in the training set have a smaller number of nodes compared to those in the test set.

Method OGBG Datasets

OGBG-Molbace OGBG-Molbbbp

ERM 76.31 0.58 87.31 2.37 IRM 79.44 3.05 88.77 2.45 VREx 76.38 1.75 84.20 2.80

GREA 77.46 5.57 86.18 2.54 GSAT 72.29 4.45 87.46 2.67 CIGA 77.89 3.68 87.94 0.86 AIA 78.62 2.88 89.18 1.77 EQu AD 82.34 3.42 88.65 3.83

LIRS 84.62 0.84 90.24 4.14

Hyperparameter Sensitivity. We investigate the impact of hyperparameters in LIRS, including the reweighting coefficient γ, the penalty weights λ, and hyperparameters related to biased infomax. Specifically, we analyze the effect of the epoch E at which embeddings are derived from biased infomax and the threshold used to determine whether a sample should be considered biased or not. As illustrated in Figure 4b, for both the GOODHIV-Sca and GOODHIV-Size datasets, LIRS demonstrates stable performance, highlighting its robustness to variations in hyperparameter settings. This consistency suggests that LIRS is agnostic to hyperparameter tuning, which further emphasizes its applicability in real-world scenarios.

How do different size split methods affect the OOD performance? For the real-world datasets, the size split was performed in descending order, where the graphs in the training set have a larger number of nodes compared to those in the test set. To evaluate the impact of varying graph sizes on generalization ability of various OOD methods, we also performed size splits based on ascending order for the OGBGMolbace and OGBG-Molbbbp datasets. The results of these experiments are presented in Table 3. As shown, LIRS continues to outperform all competitive baselines under this alternative size split. Specifically, LIRS achieves the highest ROC-AUC scores on both OGBG-Molbace and OGBG-

Published as a conference paper at ICLR 2025

Molbbbp datasets. This highlights the superiority of LIRS in generalizing to larger graphs when trained on smaller ones, further confirming its potential for real-world applications.

6 RELATED WORK

Learning invariant features. OOD generalization is a critical challenge in machine learning, where models trained on a specific data distribution often fail to generalize well to unseen distributions. Recently invariance learning has been proposed to tackle this issue, which builds upon the theory of causality (Peters et al., 2016; Pearl, 2009) to learn causally-related representation that remain stable across different environments (Arjovsky et al., 2020; Parascandolo et al., 2020; Mahajan et al., 2021; Wald et al., 2021; Ahuja et al., 2020; 2021). Inspired from IRM (Arjovsky et al., 2020), several invariant learning methods on graphs are proposed (Yang et al., 2022; Li et al., 2022b; Fan et al., 2022; Liu et al., 2022; Wu et al., 2022c; Chen et al., 2022; 2023a; Gui et al., 2023; Sui et al., 2023; Li et al., 2024b) to learn graph-level representations that are robust to distribution shifts. Most of these methods aim to learn invariant features directly by training an equipredictive classifier. Recent study (Yao et al., 2024) has proposed a new learning paradigm to learn graph invariance indirectly by learning spurious features first, then disentangle them from the ERM-learned representations. In this work, we adopt this learning paradigm for learning graph invariance.

Self-supervised learning induces spuriosity. Self-supervised learning (SSL) has emerged as a potent paradigm for learning representations from unlabeled datasets. The fundamental concept involves devising pretext tasks that maximize similarity between two augmented views of the same data point, often utilizing the Info NCE loss (Oord et al., 2018). This approach has been widely applied in domains such as images (Chen et al., 2020a;b) and graphs (Veliˇckovi c et al., 2019; Sun et al., 2019; Zhu et al., 2020; You et al., 2020). Recent work have shown that SSL may tend to learn spurious features in both image domain (Hamidieh et al., 2024; Meehan et al., 2023) and graph domain (Yao et al., 2024). Hamidieh et al. (2024) attempted to enhance SSL model performance by addressing this issue and promoting the learning of invariant features. In contrast, our goal in this work is to improve SSL model ability to capture spuriosity, thereby facilitating better feature disentanglement for invariant learning.

Learning diverse features. Recent studies have shown that ERM tends to encourage models to learn the simplest predictive features (Hermann & Lampinen, 2020; Kalimeris et al., 2019; Neyshabur et al., 2014; Pezeshki et al., 2021). This simplicity bias causes the models to rely on simple (spurious) but non-causal features, ignoring more complex patterns that might be equally predictive. To address this challenge, Huang et al. (2020) employs a self-challenging mechanism to force the model to learn diverse patterns by discarding dominant features, while Teney et al. (2022) constructs diverse features by training a collection of classifiers with diversity regularization. Additionally, Zhang et al. (2022a); Chen et al. (2023b) adopt DRO to iteratively explore new features. While previous works aim to encourage models to learn a diverse set of predictive patterns, our method takes a different approach. Specifically, we aim to remove or unlearn spurious features from ERM-learned features, thereby allowing the model to capture more invariant features.

7 CONCLUSIONS

In this study, we identified a critical limitation in IRM, VRex and their variants on graph invariant learning, i.e., these methods may only capture a subset of the invariant features, thereby limiting their OOD generalization performance. To address this issue, we investigate the effectiveness of learning invariant features indirectly by first learning and then removing spurious features. Our theoretical and empirical analyses demonstrate that this approach facilitates the learning of a more comprehensive set of invariant features than traditional (graph) invariant learning methods, thereby improving generalization to unseen environments. We then propose LIRS that adopts this learning paradigm, which consists of: a) The biased infomax principle, and b) The class-conditioned crossentropy loss, which elicit effective spuriosity learning and invariant learning respectively, aiming to learn more invariant features. Extensive experiments on both synthetic and real-world datasets demonstrate the superiority of our proposed method over existing state-of-the-art OOD algorithms.

Published as a conference paper at ICLR 2025

ACKNOWLEDGEMENT

We would like to acknowledge the support from NSF Award No.2229881, the AI Institute for Societal Decision Making (AI-SDM), the National Institutes of Health (NIH) under Contract R01HL159805, as well as grants from Quris AI, Florin Court Capital, and the MBZUAI-WIS Joint Program.

Kartik Ahuja, Karthikeyan Shanmugam, Kush Varshney, and Amit Dhurandhar. Invariant risk minimization games. In International Conference on Machine Learning, pp. 145 155. PMLR, 2020.

Kartik Ahuja, Ethan Caballero, Dinghuai Zhang, Jean-Christophe Gagnon-Audet, Yoshua Bengio, Ioannis Mitliagkas, and Irina Rish. Invariance principle meets information bottleneck for out-ofdistribution generalization. Advances in Neural Information Processing Systems, 34:3438 3450, 2021.

Martin Arjovsky, L eon Bottou, Ishaan Gulrajani, and David Lopez-Paz. Invariant risk minimization, 2020.

Steve Azzolin, Antonio Longa, Pietro Barbiero, Pietro Li o, and Andrea Passerini. Global explainability of gnns via logic combination of learned concepts. ar Xiv preprint ar Xiv:2210.07147, 2022.

Guy W Bemis and Mark A Murcko. The properties of known drugs. 1. molecular frameworks. Journal of medicinal chemistry, 39(15):2887 2893, 1996.

Ngoc Bui, Hieu Trung Nguyen, Viet Anh Nguyen, and Rex Ying. Explaining graph neural networks via structure-aware interaction index. ar Xiv preprint ar Xiv:2405.14352, 2024.

Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. In International conference on machine learning, pp. 1597 1607. PMLR, 2020a.

Xinlei Chen, Haoqi Fan, Ross Girshick, and Kaiming He. Improved baselines with momentum contrastive learning. ar Xiv preprint ar Xiv:2003.04297, 2020b.

Yongqiang Chen, Yonggang Zhang, Yatao Bian, Han Yang, MA Kaili, Binghui Xie, Tongliang Liu, Bo Han, and James Cheng. Learning causally invariant representations for out-of-distribution generalization on graphs. Advances in Neural Information Processing Systems, 35:22131 22148, 2022.

Yongqiang Chen, Yatao Bian, Kaiwen Zhou, Binghui Xie, Bo Han, and James Cheng. Does invariant graph learning via environment augmentation learn invariance? In Thirty-seventh Conference on Neural Information Processing Systems, 2023a. URL https://openreview.net/forum? id=Eqp R9Vtt13.

Yongqiang Chen, Wei Huang, Kaiwen Zhou, Yatao Bian, Bo Han, and James Cheng. Understanding and improving feature learning for out-of-distribution generalization. Advances in Neural Information Processing Systems, 36, 2023b.

Alex J. De Grave, Joseph D. Janizek, and Su-In Lee. Ai for radiographic covid-19 detection selects shortcuts over signal. med Rxiv, 2020. URL https://api.semanticscholar.org/ Corpus ID:221653623.

Qi Dou, Daniel Coelho de Castro, Konstantinos Kamnitsas, and Ben Glocker. Domain generalization via model-agnostic learning of semantic features. Advances in neural information processing systems, 32, 2019.

Cian Eastwood, Shashank Singh, Andrei L Nicolicioiu, Marin Vlastelica Poganˇci c, Julius von K ugelgen, and Bernhard Sch olkopf. Spuriosity didn t kill the classifier: Using invariant predictions to harness spurious features. Advances in Neural Information Processing Systems, 36, 2023.

Published as a conference paper at ICLR 2025

Shaohua Fan, Xiao Wang, Yanhu Mo, Chuan Shi, and Jian Tang. Debiasing graph neural networks via learning disentangled causal substructure. Advances in Neural Information Processing Systems, 35:24934 24946, 2022.

Matthias Fey and Jan Eric Lenssen. Fast graph representation learning with pytorch geometric, 2019.

Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, Pascal Germain, Hugo Larochelle, Franc ois Laviolette, Mario March, and Victor Lempitsky. Domain-adversarial training of neural networks. Journal of machine learning research, 17(59):1 35, 2016.

Shurui Gui, Xiner Li, Limei Wang, and Shuiwang Ji. GOOD: A graph out-of-distribution benchmark. In Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2022. URL https://openreview.net/forum?id=8h Hg-zs_p-h.

Shurui Gui, Meng Liu, Xiner Li, Youzhi Luo, and Shuiwang Ji. Joint learning of label and environment causal independence for graph out-of-distribution generalization. Advances in Neural Information Processing Systems, 36, 2023.

Kimia Hamidieh, Haoran Zhang, Swami Sankaranarayanan, and Marzyeh Ghassemi. Views can be deceiving: Improved SSL through feature space augmentation. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum? id=mut JBk3ILg.

Kaveh Hassani and Amir Hosein Khasahmadi. Contrastive multi-view representation learning on graphs. In International conference on machine learning, pp. 4116 4126. PMLR, 2020.

Katherine Hermann and Andrew Lampinen. What shapes feature representations? exploring datasets, architectures, and training. Advances in Neural Information Processing Systems, 33: 9995 10006, 2020.

R Devon Hjelm, Alex Fedorov, Samuel Lavoie-Marchildon, Karan Grewal, Phil Bachman, Adam Trischler, and Yoshua Bengio. Learning deep representations by mutual information estimation and maximization, 2019.

Weihua Hu, Gang Niu, Issei Sato, and Masashi Sugiyama. Does distributionally robust supervised learning give robust classifiers? In International Conference on Machine Learning, pp. 2029 2037. PMLR, 2018.

Weihua Hu, Matthias Fey, Marinka Zitnik, Yuxiao Dong, Hongyu Ren, Bowen Liu, Michele Catasta, and Jure Leskovec. Open graph benchmark: Datasets for machine learning on graphs. Advances in neural information processing systems, 33:22118 22133, 2020.

Kexin Huang, Tianfan Fu, Wenhao Gao, Yue Zhao, Yusuf Roohani, Jure Leskovec, Connor W Coley, Cao Xiao, Jimeng Sun, and Marinka Zitnik. Therapeutics data commons: Machine learning datasets and tasks for drug discovery and development. ar Xiv preprint ar Xiv:2102.09548, 2021.

Zeyi Huang, Haohan Wang, Eric P Xing, and Dong Huang. Self-challenging improves cross-domain generalization. In Computer vision ECCV 2020: 16th European conference, Glasgow, UK, August 23 28, 2020, proceedings, part II 16, pp. 124 140. Springer, 2020.

Yuanfeng Ji, Lu Zhang, Jiaxiang Wu, Bingzhe Wu, Long-Kai Huang, Tingyang Xu, Yu Rong, Lanqing Li, Jie Ren, Ding Xue, et al. Drugood: Out-of-distribution (ood) dataset curator and benchmark for ai-aided drug discovery a focus on affinity prediction problems with noise annotations. ar Xiv preprint ar Xiv:2201.09637, 2022.

Dimitris Kalimeris, Gal Kaplun, Preetum Nakkiran, Benjamin Edelman, Tristan Yang, Boaz Barak, and Haofeng Zhang. Sgd on neural networks learns functions of increasing complexity. Advances in neural information processing systems, 32, 2019.

Prannay Khosla, Piotr Teterwak, Chen Wang, Aaron Sarna, Yonglong Tian, Phillip Isola, Aaron Maschinot, Ce Liu, and Dilip Krishnan. Supervised contrastive learning. Advances in neural information processing systems, 33:18661 18673, 2020.

Published as a conference paper at ICLR 2025

Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. ar Xiv preprint ar Xiv:1412.6980, 2014.

Thomas N. Kipf and Max Welling. Semi-supervised classification with graph convolutional networks. In International Conference on Learning Representations, 2017. URL https: //openreview.net/forum?id=SJU4ay Ygl.

Polina Kirichenko, Pavel Izmailov, and Andrew Gordon Wilson. Last layer re-training is sufficient for robustness to spurious correlations, 2023.

Pang Wei Koh, Shiori Sagawa, Henrik Marklund, Sang Michael Xie, Marvin Zhang, Akshay Balsubramani, Weihua Hu, Michihiro Yasunaga, Richard Lanas Phillips, Irena Gao, et al. Wilds: A benchmark of in-the-wild distribution shifts. In International conference on machine learning, pp. 5637 5664. PMLR, 2021.

Kezhi Kong, Guohao Li, Mucong Ding, Zuxuan Wu, Chen Zhu, Bernard Ghanem, Gavin Taylor, and Tom Goldstein. Robust optimization as data augmentation for large-scale graphs. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 60 69, 2022.

Devin Kreuzer, Dominique Beaini, Will Hamilton, Vincent L etourneau, and Prudencio Tossou. Rethinking graph transformers with spectral attention. Advances in Neural Information Processing Systems, 34:21618 21629, 2021.

David Krueger, Ethan Caballero, Joern-Henrik Jacobsen, Amy Zhang, Jonathan Binas, Dinghuai Zhang, Remi Le Priol, and Aaron Courville. Out-of-distribution generalization via risk extrapolation (rex). In International Conference on Machine Learning, pp. 5815 5826. PMLR, 2021.

Haoyang Li, Xin Wang, Ziwei Zhang, and Wenwu Zhu. Ood-gnn: Out-of-distribution generalized graph neural network. IEEE Transactions on Knowledge and Data Engineering, 2022a.

Haoyang Li, Ziwei Zhang, Xin Wang, and Wenwu Zhu. Learning invariant graph representations for out-of-distribution generalization. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho (eds.), Advances in Neural Information Processing Systems, 2022b. URL https://openreview.net/forum?id=ac KK8MQe2xc.

Haoyang Li, Ziwei Zhang, Xin Wang, and Wenwu Zhu. Invariant node representation learning under distribution shifts with multiple latent environments. ACM Transactions on Information Systems, 42(1):1 30, 2023a.

Haoyang Li, Xin Wang, Zeyang Zhang, Haibo Chen, Ziwei Zhang, and Wenwu Zhu. Disentangled graph self-supervised learning for out-of-distribution generalization. In Forty-first International Conference on Machine Learning, 2024a.

Xiner Li, Shurui Gui, Youzhi Luo, and Shuiwang Ji. Graph structure and feature extrapolation for out-of-distribution generalization. ar Xiv preprint ar Xiv:2306.08076, 2023b.

Xiner Li, Shurui Gui, Youzhi Luo, and Shuiwang Ji. Graph structure extrapolation for out-ofdistribution generalization. In Forty-first International Conference on Machine Learning, 2024b. URL https://openreview.net/forum?id=Xgrey8u Qhr.

Ya Li, Xinmei Tian, Mingming Gong, Yajing Liu, Tongliang Liu, Kun Zhang, and Dacheng Tao. Deep domain generalization via conditional invariant adversarial networks. In Proceedings of the European conference on computer vision (ECCV), pp. 624 639, 2018.

R. Linsker. Self-organization in a perceptual network. Computer, 21(3):105 117, 1988. doi: 10. 1109/2.36.

Gang Liu, Tong Zhao, Jiaxin Xu, Tengfei Luo, and Meng Jiang. Graph rationalization with environment-based augmentations. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, KDD 22. ACM, August 2022. doi: 10.1145/3534678. 3539347. URL http://dx.doi.org/10.1145/3534678.3539347.

Published as a conference paper at ICLR 2025

Yang Liu, Xiang Ao, Fuli Feng, Yunshan Ma, Kuan Li, Tat-Seng Chua, and Qing He. Flood: A flexible invariant learning framework for out-of-distribution generalization on graphs. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pp. 1548 1558, 2023.

Dongsheng Luo, Wei Cheng, Dongkuan Xu, Wenchao Yu, Bo Zong, Haifeng Chen, and Xiang Zhang. Parameterized explainer for graph neural network. Advances in neural information processing systems, 33:19620 19631, 2020.

Divyat Mahajan, Shruti Tople, and Amit Sharma. Domain generalization using causal matching. In International conference on machine learning, pp. 7313 7324. PMLR, 2021.

Casey Meehan, Florian Bordes, Pascal Vincent, Kamalika Chaudhuri, and Chuan Guo. Do ssl models have d ej a vu? a case of unintended memorization in self-supervised learning. Advances in Neural Information Processing Systems, 36, 2023.

Siqi Miao, Mia Liu, and Pan Li. Interpretable and generalizable graph learning via stochastic attention mechanism. In International Conference on Machine Learning, pp. 15524 15543. PMLR, 2022.

Hongseok Namkoong and John C. Duchi. Stochastic gradient methods for distributionally robust optimization with f-divergences. In Proceedings of the 30th International Conference on Neural Information Processing Systems, NIPS 16, pp. 2216 2224, Red Hook, NY, USA, 2016. Curran Associates Inc. ISBN 9781510838819.

Behnam Neyshabur, Ryota Tomioka, and Nathan Srebro. In search of the real inductive bias: On the role of implicit regularization in deep learning. ar Xiv preprint ar Xiv:1412.6614, 2014.

Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding. ar Xiv preprint ar Xiv:1807.03748, 2018.

Giambattista Parascandolo, Alexander Neitz, Antonio Orvieto, Luigi Gresele, and Bernhard Sch olkopf. Learning explanations that are hard to vary. ar Xiv preprint ar Xiv:2009.00329, 2020.

Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas K opf, Edward Yang, Zach De Vito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, high-performance deep learning library, 2019.

Judea Pearl. Causality. Cambridge University Press, Cambridge, UK, 2 edition, 2009. ISBN 9780-521-89560-6. doi: 10.1017/CBO9780511803161.

F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825 2830, 2011.

Jonas Peters, Peter B uhlmann, and Nicolai Meinshausen. Causal inference by using invariant prediction: identification and confidence intervals. Journal of the Royal Statistical Society Series B: Statistical Methodology, 78(5):947 1012, 2016.

Mohammad Pezeshki, Oumar Kaba, Yoshua Bengio, Aaron C Courville, Doina Precup, and Guillaume Lajoie. Gradient starvation: A learning proclivity in neural networks. Advances in Neural Information Processing Systems, 34:1256 1272, 2021.

Phillip E Pope, Soheil Kolouri, Mohammad Rostami, Charles E Martin, and Heiko Hoffmann. Explainability methods for graph convolutional neural networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 10772 10781, 2019.

Jiezhong Qiu, Jian Tang, Hao Ma, Yuxiao Dong, Kuansan Wang, and Jie Tang. Deepinf: Social influence prediction with deep learning. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery amp; Data Mining, KDD 18. ACM, July 2018. doi: 10. 1145/3219819.3220077. URL http://dx.doi.org/10.1145/3219819.3220077.

Published as a conference paper at ICLR 2025

Yu Rong, Wenbing Huang, Tingyang Xu, and Junzhou Huang. Dropedge: Towards deep graph convolutional networks on node classification. ar Xiv preprint ar Xiv:1907.10903, 2019.

Shiori Sagawa, Pang Wei Koh, Tatsunori B Hashimoto, and Percy Liang. Distributionally robust neural networks for group shifts: On the importance of regularization for worst-case generalization. ar Xiv preprint ar Xiv:1911.08731, 2019.

Thomas Schnake, Oliver Eberle, Jonas Lederer, Shinichi Nakajima, Kristof T Sch utt, Klaus-Robert M uller, and Gr egoire Montavon. Higher-order explanations of graph neural networks via relevant walks. IEEE transactions on pattern analysis and machine intelligence, 44(11):7581 7596, 2021.

Yongduo Sui, Qitian Wu, Jiancan Wu, Qing Cui, Longfei Li, Jun Zhou, Xiang Wang, and Xiangnan He. Unleashing the power of graph data augmentation on covariate distribution shift. Advances in Neural Information Processing Systems, 36, 2023.

Baochen Sun and Kate Saenko. Deep coral: Correlation alignment for deep domain adaptation, 2016.

Fan-Yun Sun, Jordan Hoffmann, Vikas Verma, and Jian Tang. Infograph: Unsupervised and semi-supervised graph-level representation learning via mutual information maximization. ar Xiv preprint ar Xiv:1908.01000, 2019.

Damien Teney, Ehsan Abbasnejad, Simon Lucey, and Anton Van den Hengel. Evading the simplicity bias: Training a diverse set of models discovers solutions with superior ood generalization. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 16761 16772, 2022.

George R Terrell and David W Scott. Variable kernel density estimation. The Annals of Statistics, pp. 1236 1265, 1992.

Naftali Tishby and Noga Zaslavsky. Deep learning and the information bottleneck principle, 2015.

Laurens Van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. Journal of machine learning research, 9(11), 2008.

Vladimir N. Vapnik. The nature of statistical learning theory. Springer-Verlag New York, Inc., 1995. ISBN 0-387-94559-8.

Petar Veliˇckovi c, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua Bengio. Graph attention networks. ar Xiv preprint ar Xiv:1710.10903, 2017.

Petar Veliˇckovi c, William Fedus, William L Hamilton, Pietro Li o, Yoshua Bengio, and R Devon Hjelm. Deep graph infomax. ar Xiv preprint ar Xiv:1809.10341, 2018.

Petar Veliˇckovi c, William Fedus, William L. Hamilton, Pietro Li o, Yoshua Bengio, and R Devon Hjelm. Deep graph infomax. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=rklz9i Ac KQ.

Yoav Wald, Amir Feder, Daniel Greenfeld, and Uri Shalit. On calibration and out-of-domain generalization. In A. Beygelzimer, Y. Dauphin, P. Liang, and J. Wortman Vaughan (eds.), Advances in Neural Information Processing Systems, 2021. URL https://openreview.net/forum? id=XWYJ25-y TRS.

Xiaoqi Wang and Han-Wei Shen. Gnninterpreter: A probabilistic generative model-level explanation for graph neural networks. ar Xiv preprint ar Xiv:2209.07924, 2022.

Qitian Wu, Hengrui Zhang, Junchi Yan, and David Wipf. Handling distribution shifts on graphs: An invariance perspective, 2022a.

Shiwen Wu, Fei Sun, Wentao Zhang, Xu Xie, and Bin Cui. Graph neural networks in recommender systems: A survey. ACM Comput. Surv., 55(5), dec 2022b. ISSN 0360-0300. doi: 10.1145/ 3535101. URL https://doi.org/10.1145/3535101.

Published as a conference paper at ICLR 2025

Yingxin Wu, Xiang Wang, An Zhang, Xiangnan He, and Tat-Seng Chua. Discovering invariant rationales for graph neural networks. In International Conference on Learning Representations, 2022c. URL https://openreview.net/forum?id=h GXij5rfi Hw.

Zhenqin Wu, Bharath Ramsundar, Evan N. Feinberg, Joseph Gomes, Caleb Geniesse, Aneesh S. Pappu, Karl Leswing, and Vijay Pande. Moleculenet: A benchmark for molecular machine learning, 2018.

Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. How powerful are graph neural networks? ar Xiv preprint ar Xiv:1810.00826, 2018.

Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. How powerful are graph neural networks? In International Conference on Learning Representations, 2019. URL https: //openreview.net/forum?id=ry Gs6i A5Km.

Nianzu Yang, Kaipeng Zeng, Qitian Wu, Xiaosong Jia, and Junchi Yan. Learning substructure invariance for out-of-distribution molecular representations. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho (eds.), Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=2n WUNTn Fijm.

Huaxiu Yao, Yu Wang, Sai Li, Linjun Zhang, Weixin Liang, James Zou, and Chelsea Finn. Improving out-of-distribution robustness via selective augmentation. In International Conference on Machine Learning, pp. 25407 25437. PMLR, 2022.

Tianjun Yao, Yongqiang Chen, Zhenhao Chen, Kai Hu, Zhiqiang Shen, and Kun Zhang. Empowering graph invariance learning with deep spurious infomax. In Forty-first International Conference on Machine Learning, 2024. URL https://openreview.net/forum?id= u9o SQtuj CF.

Zhitao Ying, Dylan Bourgeois, Jiaxuan You, Marinka Zitnik, and Jure Leskovec. Gnnexplainer: Generating explanations for graph neural networks. Advances in neural information processing systems, 32, 2019.

Yuning You, Tianlong Chen, Yongduo Sui, Ting Chen, Zhangyang Wang, and Yang Shen. Graph contrastive learning with augmentations. Advances in neural information processing systems, 33: 5812 5823, 2020.

Bing Yu, Haoteng Yin, and Zhanxing Zhu. Spatio-temporal graph convolutional networks: A deep learning framework for traffic forecasting. In Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence, IJCAI-2018. International Joint Conferences on Artificial Intelligence Organization, July 2018. doi: 10.24963/ijcai.2018/505. URL http://dx.doi.org/10.24963/ijcai.2018/505.

Junchi Yu, Jian Liang, and Ran He. Mind the label shift of augmentation-based graph ood generalization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11620 11630, 2023.

Hao Yuan, Jiliang Tang, Xia Hu, and Shuiwang Ji. Xgnn: Towards model-level explanations of graph neural networks. In Proceedings of the 26th ACM SIGKDD international conference on knowledge discovery & data mining, pp. 430 438, 2020.

Jianyu Zhang, David Lopez-Paz, and L eon Bottou. Rich feature construction for the optimizationgeneralization dilemma. In International Conference on Machine Learning, pp. 26397 26411. PMLR, 2022a.

Michael Zhang, Nimit S Sohoni, Hongyang R Zhang, Chelsea Finn, and Christopher R e. Correct-ncontrast: A contrastive approach for improving robustness to spurious correlations. ar Xiv preprint ar Xiv:2203.01517, 2022b.

Zehong Zhang, Lifan Chen, Feisheng Zhong, Dingyan Wang, Jiaxin Jiang, Sulin Zhang, Hualiang Jiang, Mingyue Zheng, and Xutong Li. Graph neural network approaches for drug-target interactions. Current Opinion in Structural Biology, 73:102327, 2022c. ISSN 0959-440X. doi: https://doi.org/10.1016/j.sbi.2021.102327. URL https://www.sciencedirect.com/ science/article/pii/S0959440X2100169X.

Published as a conference paper at ICLR 2025

Zhilu Zhang and Mert R. Sabuncu. Generalized cross entropy loss for training deep neural networks with noisy labels, 2018.

Yanqiao Zhu, Yichen Xu, Feng Yu, Qiang Liu, Shu Wu, and Liang Wang. Deep graph contrastive representation learning. ar Xiv preprint ar Xiv:2006.04131, 2020.

Yanqiao Zhu, Yichen Xu, Qiang Liu, and Shu Wu. An Empirical Study of Graph Contrastive Learning. ar Xiv.org, September 2021.

Xiang Zhuang, Qiang Zhang, Keyan Ding, Yatao Bian, Xiao Wang, Jingsong Lv, Hongyang Chen, and Huajun Chen. Learning invariant molecular representation in latent discrete space. Advances in Neural Information Processing Systems, 36, 2023.

Published as a conference paper at ICLR 2025

A NOTATIONS

We present a set of notations used throughout our paper for clarity. Below are the main notations along with their definitions.

Table 4: Notation Table

Symbols Definitions G a set of graphs G = (A, X) a graph with the adjacency matrix A {0, 1}n n and node feature matrix X Rn d Y random variable for labels C content factor S style factor E environment hc invariant representations (features) hs spurious representations (features) Gc the invariant subgraph with respect to G Gs the spurious subgraph with respect to G b Gc the estimated invariant subgraph b Gs the estimated spurious subgraph bh G the estimated graph representation for graph G bhi the estimated node representation for node i G bhc the estimated invariant graph representation, interchangeably with b C in our paper bhs the estimated spurious graph representation, interchangeably with b S in our paper s(i) j The jth entry of the spurious logits derived from h(i) s for sample G(i)

bs(i) j The jth of the estimated spurious logits for sample G(i)

h( ) encoder ρ( ) classification head for LGT ρ ( ) classification head for LInv w(i) reweighting coefficient for sample G(i)

[K] := {1, 2, , K} index set with K elements 1K all-one (column) vector with K entries w a vector W a matrix W a random variable W a set

B MORE BACKGROUND AND PRELIMINARIES

Graph Neural Networks. In this work, we adopt message-passing GNNs for graph classification due to their expressiveness. Given a simple and undirected graph G = (A, X) with n nodes and m edges, where A {0, 1}n n is the adjacency matrix, and X Rn d is the node feature matrix with d feature dimensions, the graph encoder h : G Rh aims to learn a meaningful graphlevel representation h G, and the classifier ρ : Rh Y is used to predict the graph label b YG = ρ(h G). To obtain the graph representation h G, the representation h(l) v of each node v in a graph G is iteratively updated by aggregating information from its neighbors N(v). For the l-th layer, the updated representation is obtained via an AGGREGATE operation followed by an UPDATE operation:

m(l) v = AGGREGATE(l) n h(l 1) u : u N(v) o , (5)

h(l) v = UPDATE(l) h(l 1) v , m(l) v , (6)

Published as a conference paper at ICLR 2025

where h(0) v = xv is the initial node feature of node v in graph G. Then GNNs employ a READOUT function to aggregate the final layer node features n h(L) v : v V o into a graph-level representation h G: h G = READOUT n h(L) v : v V o . (7)

Data Generating Process. The presumed data generating process adopted in this work is similar to the previous studies (Arjovsky et al., 2020; Ahuja et al., 2020; Chen et al., 2022; 2023a; Wu et al., 2022a). The dynamics of graph generation are fundamentally grounded upon the principles of Structural Causal Models (SCM) (Peters et al., 2016), capturing the complex interplay between latent variables and their influence on observable graph characteristics. This model assumes the graph generation as a function fgen : Z G, where Z Rn denotes the latent space, and G is the graph space. Through the lens of SCM, the generation process is decomposed into the invariant subgraph Gc, the spurious subgraph Gs, and the observed graph G respectively, as formulated in the following equation:

Gc := f Gc gen(C), Gs := f Gs gen(S), G := f G gen (Gc, Gs) .

(a) PIIF SCM (b) FIIF SCM

Figure 5: Structure causal models for graph data generation.

Furthermore, the latent interactions among C, S, and Y can exhibit two types of conditional independence: (i) Y S | C and (ii) Y S | C. In case (i), the invariant factor C is fully informative (FIIF) to the target label Y , and the latent spurious factor S provide no further information. In case (ii), the invariant factor C is only partially informative (PIIF) about Y , spurious factor S can further provide additional information to aid the prediction of Y , however, as S is directly affected by E, it is not stable across different environments. The SCMs for the two scenarios are illustrated in Figure 5. For the PIIF SCM, the structure equation is:

Y = Xc + n1, Xs = Y + n2 + ϵe, (8)

while for FIIF SCM, the structure equation is:

Y = Xc + n1, Xs = n2 + ϵe. (9)

Here Xc and Xs denote the random variables for invariant factor and spurious factor respectively. n1 and n2 represent random Gaussian noise, and ϵe stands for an environmental variable, which causes the spurious correlation between Xs and Y .

C ADDITIONAL RELATED WORK

OOD Generalization. OOD generalization is a critical challenge in machine learning, where models trained on a specific data distribution often fail to generalize well to unseen distributions. Several approaches have been proposed to address this issue, including domain generalization, distributional robustness optimization (DRO), and invariance learning. Domain generalization aims to learn features that are invariant across different domains or environments. Previous studies, such as Ganin et al. (2016); Sun & Saenko (2016); Li et al. (2018); Dou et al. (2019), regularize the learned features to be domain-invariant. DRO methods focus on training models to perform robust against the worstcase scenarios among diverse data groups. Namkoong & Duchi (2016); Hu et al. (2018); Sagawa et al. (2019) regularize models to be robust to mild distributional perturbations of the training distributions, expecting the models to perform well in unseen test environments. Building upon this, Liu et al. (2022) Zhang et al. (2022b) and Yao et al. (2022) propose advanced strategies to improve robustness by assuming that models trained with ERM have a strong reliance on spurious features.

Published as a conference paper at ICLR 2025

Invariance learning leverages the theory of causality (Peters et al., 2016; Pearl, 2009) and introduces causal invariance to the learned representations. The Independent Causal Mechanism (ICM) assumption in causality states that the conditional distribution of each variable given its causes does not inform or influence other conditional distributions. Despite changes to the intervened variables, the conditional distribution of intervened variables and the target variable remains invariant. Arjovsky et al. (2020) proposes the framework of Invariant Risk Minimization (IRM) that allows the adoption of causal invariance in deep neural networks, inspiring various invariant learning works such as Parascandolo et al. (2020); Mahajan et al. (2021); Wald et al. (2021); Ahuja et al. (2020; 2021); Krueger et al. (2021).

Graph-Level OOD Generalization. Recently, there has been a growing interest in learning graphlevel representations that are robust under distribution shifts, particularly from the perspective of invariant learning. Mole OOD Yang et al. (2022) and GIL Li et al. (2022b) propose to infer environmental labels to assist in identifying invariant substructures within graphs. DIR Wu et al. (2022c), GREA Liu et al. (2022) and i Mo LD Zhuang et al. (2023) employ environment augmentation techniques to facilitate the learning of invariant graph-level representations. These methods typically rely on the explicit manipulation of unobserved environmental variables to achieve generalization across unseen distributions. AIA Sui et al. (2023) employs an adversarial augmenter to explore OOD data by generating new environments while maintaining stable feature consistency. To circumvent the need for environmental inference or augmentation, CIGA Chen et al. (2022) and GALA Chen et al. (2023a) utilizes supervised contrastive learning to identify invariant subgraphs based on the assumption that samples sharing the same label exhibit similar invariant subgraphs. EQu AD Yao et al. (2024) adopts self-supervised learning to learn spuriosu efatures first, followed by learning invariant features by unlearning spurious features. LECI Gui et al. (2023) and G-Splice Li et al. (2023b) assume the availability of environment labels, and study environment exploitation strategies for graph OOD generalization. LECI Gui et al. (2023) proposes to learn a causal subgraph selector by jointly optimizing label and environment causal independence, and G-Splice Li et al. (2023b) studies graph and feature space extrapolation for environment augmentation, which maintains causal validity. On the other hand, some works do not utilize the invariance principle for graph OOD generalization. Dis C Fan et al. (2022) initially learns a biased graph representation and subsequently focuses on unbiased graphs to discover invariant subgraphs. GSAT Miao et al. (2022) utilizes information bottleneck principle Tishby & Zaslavsky (2015) to learn a minimal sufficient subgraph for GNN explainability, which is shown to be generalizable under distribution shifts. OOD-GNN Li et al. (2022a) proposes to learn disentangled graph representation by computing global weights of all data. Parallel to all previous studies, we propose ELi SD, which utilizes spurious subgraph diversification to provably identify Gc for OOD generalization, and uplift the lower bound of I(G; Y ) to enhance feature learning simultaneously. In this study, we adopt the same learning paradigm as EQu AD, and we propose instance-level adaptive biased infomax and intra-class cross entropy loss to enhance the efficacy of learning spurious and invariant features respectively.

Node-Level OOD Generalization. There has also been a substantial amount of work focusing on the OOD generalization problem on node-level classification tasks. Most of these studies (Wu et al., 2022a; Liu et al., 2023; Li et al., 2023a; Yu et al., 2023) also focus on environment generation, which facilitates the subsequent invariant learning. These methods are also likely to learn only a subset of invariant features, thus limiting their ability to generalize to OOD data.

GNN Explainability. GNN explanation methods can be broadly categorized into instancelevel (Luo et al., 2020; Ying et al., 2019; Pope et al., 2019; Schnake et al., 2021) and modellevel (Azzolin et al., 2022; Yuan et al., 2020; Wang & Shen, 2022) explanations. The main goal of the instance-level explanation methods is to explain why a certain prediction is made for a particular instance. Specifically, the important input subgraph and features that contribute the most to model predictions are identified for a given instance. In this work, we employ instance-level GNN explanation methods to approximate the invariant subgraph Gc, and utilize off-the-shelf method GSAT (Miao et al., 2022), that is robust under distribution shifts, to identify the important nodes in a graph to facilitate the learning of graph spuriosity through adaptive biased infomax in real-world datasets.

Published as a conference paper at ICLR 2025

D THEORETICAL RESULTS

D.1 THEORETICAL DISCUSSION ON VREX IN LEARNING INVARIANT FEATURES

We first outline the objective function of VRex as following:

LV Rex := min w,ϕ Ee [L (w ϕ (X) , Y )] + β Vare [L (w ϕ (X) , Y )] , (10)

here w and ϕ denote the classifier and feature extractor, respectively. β is the hyperparameter to control regularization strength. Next we prove the following proposition.

Proposition 3. Let w , ϕ = argmin w,ϕ LV Rex, then bh G = w (X) learns invariant features, however

bh G may only contain a subset of invariant features, given our PIIF data generating assumption:

Y = Xc + n1, Xs = Y + n2 + ϵe, where Xc := X

i Xc,i, Xs := X

i Xs,i, (11)

and FIIF data generating assumption:

Y = Xc + n1, Xs = n2 + ϵe, where Xc := X

i Xc,i, Xs := X

i Xs,i, (12)

when using a linear classifier ϕ(X) = w1Xc + w2Xs, and L(w ϕ(X)) := R(e) = (ϕ(X) Y )2.

Here, we assume an additive model for Xc, i.e., A set of invariant features Xc,i is combined through summation to obtain the final Xc, and similarly for Xs.

Proof. To prove Prop. 3, we first expand Vare(R(e)) = Ee R2(e) (Ee[R(e)])2, then we take derivative w.r.t. w1 and w2, we then show that Eqn. 10 can elicit w that learns invariant features, however, it may only learn a subset of them.

Given R(e) = (ϕ(X) Y )2 and Vare(R(e)) = Ee R2(e) E2 e R(e), we have:

w1 = Ee[(ϕ(X) Y )4]

w1 E2 e[(ϕ(X) Y )2]

= 4Ee[(ϕ(X) Y )3Xc] 4Ee[R(e)] Ee[(ϕ(X) Y )Xc]. (13)

Similarly, for Vare(R(e))

w2 , we have:

w2 = 4Ee (ϕ(X) Y )3 (Xc + n1 + n2 + εe) 4Ee[R(e)] Ee [(ϕ(X) Y ) (Xc + n1 + n2 + ϵe)] .

Now we expand ϕ(X) Y :

ϕ(X) = w1Xc + w2Xs Xs = Xc + n1 + n2 + εe

ϕ(X) = (w1 + w2) Xc + w2n1 + w2n2 + w2εe

Y = Xc + n1 ϕ(X) Y = (w1 + w2 1) Xc + (w2 1) n1 + w2n2 + w2εe

Next, we show that w1 = 1, w2 = 0 is a optimal solution for minimizing Eqn. 10. We set w1 = 1, w2 = 0, and plug into ϕ(X) and ϕ(X) Y , we get:

Published as a conference paper at ICLR 2025

ϕ(X) = w1Xc + w2Xs ϕ(X) = 1 Xc + 0 Xs = Xc, (16)

Y = Xc + n1 ϕ(X) Y = Xc (Xc + n1) = n1. (17)

Substitute Eqn. 1617 into Eqn. 13, we get:

4Ee (ϕ(X) Y )3 Xc = 4Ee En1,n2 ( n1)3 Xc = 0 (Ee[n3 1] = 0)

Ee[R(e)] = Ee n2 1 = σ2 n1 Ee [En1,n2 [( n1) Xc]] = 0,

therefore, we have Vare(R(e))

w1 = 0. For Vare(R(e))

w2 (Eqn. 14), the first term can be expanded as:

Ee h ( n1)3 Xc i = 0

Ee h En1,n2 h ( n1)3 n1 ii = Ee En1,n2 n4 1 = 3σ4 n1

Ee h En1,n2 h ( n1)3 n2 ii = 0

Ee h En1,n2 h ( n1)3 εeii = 0,

4Ee h En1,n2 h ( n1)3 (Xc + n1 + n2 + εe) ii = 4 3σ4 n1 = 12σ4 n1. (20)

For the second term in Eqn. 14:

4Ee[R(e)] Ee [En1,n2 [( n1) (Xc + n1 + n2 + εe)]] = 4σ2 n1 σ2 n1 = 4σ4 n1, (21)

substitute Eqn. 20 and 21 into Eqn. 14, we get:

w2 = 8σ4 n1. (22)

As σn1 is the standard deviation of n1, where Xc is causally related with Y , we assume σn1 is a small value, hence the high-order term O(σ4 n1) 0. Therefore, we show that w1 = 1, w2 = 0 is (approxiamately) optimla solution for Eqn. 10, indicating that only the invariant features are used in the classifier ϕ (X). Now we have:

Y = ϕ (X) = Xc = X

i Xc,i. (23)

Given a fixed value Xc = x, there may be multiple combinations for {Xc,i} that result in Xc = x. The set of invariant features {Xc,i} naturally arises from the learning process of deep neural networks. Some invariant features might be easier to learn due to their salience or higher frequency in the dataset. Consequently, although the VRex objective promotes the learning of invariant encoders (as we have proved above), it does not guarantee the identification of all invariant features {Xc,i}.

For FIIF data generating process, similarly we get:

ϕ(X) = w1Xc + w2Xs, Xs = n2 + ϵe, Y = Xc + n1, ϕ(X) Y = (w1 1)Xc + w2n2 + w2ϵe n1. (24)

Published as a conference paper at ICLR 2025

The variance of the loss across environments is:

Vare (ϕ(X) Y )2 = Ee (ϕ(X) Y )4 Ee (ϕ(X) Y )2 2 . (25)

Recall that ϕ(X) = w1Xc + w2Xs and Xs = n2 + ϵe, we have:

ϕ(X) Y = (w1 1) Xc + w2n2 + w2ϵe n1. (26)

The squared expected loss term can be expanded as:

Ee (ϕ(X) Y )2 = Ee h ((w1 1) Xc + w2n2 + w2ϵe n1)2i

= (w1 1)2 Ee X2 c + w2 2Ee n2 2

+ w2 2Ee h (ϵe)2i + Ee n2 1

+ 2 (w1 1) w2Ee [Xcn2] .

Given the independence between Xc, n1, n2 and ϵe, we have:

Ee (ϕ(X) Y )2 = (w1 1)2 Ee X2 c + w2 2Ee n2 2 + w2 2Ee h (ϵe)2i + Ee n2 1 (28)

For Ee (ϕ(X) Y )4 , we have:

Ee (ϕ(X) Y )4 = Ee h ((w1 1) Xc + w2n2 + w2ϵe n1)4i

= (w1 1)4 Ee X4 c + w4 2Ee n4 2

+ w4 2Ee h (ϵe)4i + Ee n4 1 .

Taking derivative with respect to w1 and w2:

w1 Ee (ϕ(X) Y )2 = 2 (w1 1) Ee X2 c ,

w1 Ee (ϕ(X) Y )4 = 4 (w1 1)3 Ee X4 c ,

w2 Ee (ϕ(X) Y )2 = 2w2 Ee n2 2 + Ee h (ϵe)2i ,

w2 Ee (ϕ(X) Y )4 = 4w3 2 Ee n4 2 + Ee h (ϵe)4i .

Setting Eqn. 30 to zero, we get w1 = 1, w2 = 0. In conclusion, for both PIIF and FIIF data generating scenarios, VRex only learns invariant features, however given an additive model, it may only learn a subset of invariant features that are significant or frequently appears in the training set.

D.2 THEORETICAL DISCUSSION ON IRMV1 IN LEARNING INVARIANT FEATURES

Using a similar derivation process, we can prove that IRMv1 may only learn a subset of invariant features. We first outline the objective of IRMv1 as follows:

LIRMv1 := min w,ϕ Ee h L (w ϕ (X) , Y ) + β w|w=1.0L (w ϕ (X) , Y ) 2 2

here w and ϕ denote the classifier and feature extractor, respectively.

Published as a conference paper at ICLR 2025

Proposition 4. Let w , ϕ = argmin w,ϕ LIRMv1, then bh G = w (X) learns invariant features, how-

ever bh G may only contain a subset of invariant features, given our data generating assumption:

Y = Xc + n1, Xs = Y + n2 + ϵe, where Xc := X

i Xc,i, Xs := X

i Xs,i, (32)

and FIIF data generating assumption:

Y = Xc + n1, Xs = n2 + ϵe, where Xc := X

i Xc,i, Xs := X

i Xs,i, (33)

when using a linear classifier ϕ(X) = w1Xc + w2Xs, and L(w ϕ(X)) := R(e) = (ϕ(X) Y )2.

Proof. The IRMv1 objective can be simplied into:

LIRMv1 := w|w=1.0L(w ϕ(X), Y ) 2 2 = (ϕ(X) Y )4, (34)

ϕ(X) = w1Xc + w2Xs Xs = Xc + n1 + n2 + εe

Y = Xc + n1, (35)

ϕ(X) = w1Xc + w2 (Xc + n1 + n2 + εe) = (w1 + w2) Xc + w2n1 + w2n2 + w2εe

ϕ(X) Y = (w1 + w2 1) Xc + (w2 1) n1 + w2n2 + w2εε. (36)

Taking gradient w.r.t. w1 and w2 respetively:

w1 = 4(ϕ(X) Y )3 (ϕ(X) Y )

w1 = 4(ϕ(X) Y )3 Xc, (37)

w2 = 4(ϕ(X) Y )3 (ϕ(X) Y )

w2 = 4(ϕ(X) Y )3 (Xc + n1 + n2 + εe) . (38)

Plugging in w1 = 1, w2 = 0, we get:

ϕ(X) = Xc ϕ(X) Y = n1, (39)

Substitue Eqn. 39 into Eqn. 37 and Eqn. 38, we get:

= Ee En1,n2 4n3 1 Xc = 0, (40)

Published as a conference paper at ICLR 2025

= Ee 4n3 1 (Xc + n1 + n2 + εe)

= 4Ee En1,n2 n3 1 Xc 4Ee En1,n2 n3 1 n1 4Ee En1,n2 n3 1 n2 4Ee En1,n2 n3 1 εe , (41)

given that:

En1,n2 n3 1 Xc = 0

En1,n2 n3 1 n1 = En1,n2 n4 1 = 3σ4 n1 En1,n2 n3 1 n2 = 0

En1,n2 n3 1 εe = 0,

we have Ee h (ϕ(X) Y )4

i = 12σ4 n1, with a similar assumption that the standard error σn1 of

the Gaussian noise n1 of the causal variable Xc is small, we conclude that Ee h (ϕ(X) Y )4

O(σ4 n1) 0. Hence IRMv1 learns invariant features under the data generating process. Now we have:

Y = ϕ (X) = Xc = X

i Xc,i. (43)

For FIIF generating process, applying the same technique, we have:

ϕ(X) Y = w1Xc + w2 (n2 + ϵe) (Xc + n1) = (w1 1) Xc + w2n2 + w2ϵe n1. LIRMv1

w1 = 4(ϕ(X) Y )3 w1 (ϕ(X) Y ).

w1 (ϕ(X) Y ) = Xc.

w1 = 4(ϕ(X) Y )3 Xc = 4 ((w1 1) Xc + w2n2 + w2ϵe n1)3 Xc.

w2 = 4(ϕ(X) Y )3 w2 (ϕ(X) Y ) = 4(ϕ(X) Y )3 (n2 + ϵe) (45)

Substitute w1 = 1, w2 = 0 into ϕ(X) Y ,:

ϕ(X) Y = Xc (Xc + n1) = n1. (46)

Taking derivative w.r.t. w1 and w2:

= 4Ee n3 1Xc = 0, (47)

= 4Ee n3 1 (n2 + ϵe) = 0. (48)

Therefore we show that for the widely used FIIF and PIIF data generating process, IRMv1 is able to learn invariant features. However, with a similar reasoning as VRex, there may be multiple combinations of the assignments for {Xc,i} to lead to a fixed value Xc = x, and the objective of IRMv1 (Eqn. 31) won t ensure the learning of all invariant features.

Published as a conference paper at ICLR 2025

D.3 PROOF OF THEOREM 4.1

Theorem D.1. [Restatement of Theorem 4.1] Given that the invariant subgraph Gc contains invariant patterns causally related to the target labels, and Gs contains only spurious patterns, the biased infomax principle achieves spuriosity learning. Specifically, the encoder hθ( ) learns solely the spurious features for each data sample in D.

We begin by stating the assumption underlying our proof, followed by a proof by contradiction.

Assumption 2. (Existence of spuriosity learner) There exists a learning algorithm capable of achieving spuriosity learning as defined in Def. 2.

Most invariant learning algorithms aim to learn invariant features, and a large collection of literature has demonstrated the existence of such algorithms (Arjovsky et al., 2020; Kreuzer et al., 2021; Parascandolo et al., 2020; Mahajan et al., 2021; Wald et al., 2021; Ahuja et al., 2020; 2021; Chen et al., 2022; Gui et al., 2023; Liu et al., 2022). In contrast, we assume the existence of algorithms that learn spurious features. One such example is provided by Eastwood et al. (2023), where the method learns both invariant and spurious features, supporting the plausibility of our assumption. Recent studies have shown that self-supervised contrastive learning tends to learn spurious features (Yao et al., 2024; Hamidieh et al., 2024; Meehan et al., 2023). Therefore, self-supervised contrastive learning would be one form of the spuriosity learner. Based on the above, we first sketch our proof.

Proof sketch. As biased infomax is a more general form of the infomax principle, with exponentially many node configurations (Def. 4), the encoder hθ ( ) under the optimal node configuration must result in a spuriosity learner, which also maximizes the learning objective. We then employ proof by contradiction to demonstrate that the node configuration corresponding to biased infomax in Def 3 leads to a larger mutual information, thereby eliciting a contradiction, leading to the conclusion that the node configuration corresponding to Def 3 elicits the spuriosity learner.

Definition 4. (Node configurations) Let the general form of biased infomax be:

max θ EG G 1 |G|

vi e G I bhi; bh G X

vi G\ e G I bhi; bh G

s.t. bhi = hθ(G), bh G = READOUT bhi .

A node configuration c = c1, c2, . . . , c|V| T , such that ci { 1, 1}, corresponds to one specific instantiation in Eqn. 49, where ci = 1 denotes that vi e G, and ci = 1 means vi G \ e G. The set of all possible node configurations can be denoted as H := {c : ci { 1, 1}}.

Given definition 4, the infomax principle (Eqn. 1) is therefore a special case of the biased infomax principle, where c = 1|V|. For notation simplicity, let c := [c1, c2]T , c1 R|Gs|, c2 R|Gc| be two vectors for the node configurations of Gs and Gc. Assuming for the optimal node configuration, there are k elements in c1 with 1, and |Gs| k elements with 1 (denoted as V1 s and V1s ), similarly, there are t elements in c2 with 1, and |Gc| t elements with 1, denoted as V1 c and V1c . We then compare such node configuration c with c = [1|Gs|, 1|Gc|]T .

Proof. First, under the node configuration c , the optimal parameter to optimize the learning objective is:

θ = arg max θ EG G 1 |G|

vi V1 s I bhi; bh G + X

vi V1 c I bhi; bh G X

I bhi; bh G X

I bhi; bh G

and under the node configuration c , the optimal parameter is:

Published as a conference paper at ICLR 2025

θ = arg max θ EG G 1 |G|

vi Gs I bhi; bh G X

vi Gc I bhi; bh G !

f(θ ; D, c ) := EG G 1 |G|

vi V1 s I bhi; bh G + X

vi V1 c I bhi; bh G X

I bhi; bh G X

I bhi; bh G

. Similarly, we can define

f(θ ; D, c ) := EG G 1 |G|

vi Gs I bhi; bh G X

vi Gc I bhi; bh G !

According to Eqn. 51, we can conclude that f(θ ; D, c ) > f(θ ; D, c ), and we also have that hθ ( ) = bh G only learns spurious features as θ is the maximizer for the objective under c , and hθ ( ) = bh G learns both invariant and spurious features, as θ is the maximizer for the objective under c . Let hθ G = hθ (G), and hθ G = hθ (G), therefore we have:

I bhi; bhθ G = 0, vi Gc; (52)

I bhi; bhθ G = c, vi Gc; (53)

I bhi; bhθ G = q, vi Gs; (54)

I bhi; bhθ G = q , vi Gs; (55)

Eqn. 52 is due to hθ ( ) is obtained under c , hence only learn spuriosity. Eqn. 53 is due to that hθ ( ) learns both spurious and invariant features. We also have q > q due to the same reason. Now we compare f(θ ; D, c ) with f(θ ; D, c ) as follows:

f (θ ; D, c ) f (θ ; D, c ) = EG G 1 |G|

EG G 1 |G| (|Gs| q |Gc| c)

= EG G 1 |G| (|Gs|(q q ) + |Gc| c) > 0,

leading to a contradiction that f(θ ; D, c ) > f(θ ; D, c ) for any node configuration c = c , therefore we conclude that c is the optimal configuration that elicits the spuriosity learner, which corresponds to the biased infomax objective in Eqn. 2.

D.4 PROOF OF PROPOSITION 1

Proposition 5. [Restatement of Proposition 1] Given an error rate p% in the approximation algorithm for Gc, let the learning objectives for the biased infomax with the ground-truth subgraphs Gc and Gs and with the approximated subgraphs b Gc and b Gs be denoted as L(θ ; D) and L(θ ; D) respectively. The difference between L(θ ; D) and L(θ ; D) can be expressed as:

vi p Gs I bhi; bh G X

vi p Gc I bhi; bh G

Published as a conference paper at ICLR 2025

Proof. We first expand L(θ ; D) and L(θ ; D) as follows.

L(θ ; D) = EG G 1 |G|

vi Gs I bhi; bh G X

vi Gc I bhi; bh G !

= EG G 1 |G|

vi (1 p)Gs I bhi; bh G + X

vi p Gs I bhi; bh G X

vi (1 p)Gc I bhi; bh G X

vi p Gc I bhi; bh G

L(θ ; D) = EG G 1 |G|

vi (1 p)Gs p Gc I bhi; bh G X

vi (1 p)Gc p Gs I bhi; bh G

the gap of the two learning objectives L is:

L := L(θ ; D) L(θ ; D)

= EG G 1 |G|

vi (1 p)Gs I bhi; bh G + X

vi p Gs I bhi; bh G X

vi (1 p)Gc I bhi; bh G X

vi p Gc I bhi; bh G

vi (1 p)Gs p Gc I bhi; bh G X

vi (1 p)Gc p Gs I bhi; bh G

= EG G 2 |G|

vi p Gs I bhi; bh G X

vi p Gc I bhi; bh G

We conclude the proof.

D.5 PROOF OF PROPOSITION 2

Proposition 6. [Restatement of Proposition 2] Given a linear regression model with parameters {θ1, θ2} and spurious features {x1, x2}, the correlation strength for feature x1 is p and for x2 is 1 p when Y = 0. Similarly, the correlation strength for feature x1 is q and for x2 is 1 q when Y = 1. Assuming the spurious features x1 and x2 can each take values in {0, 1}, we obtain the following parameter estimates using Mean Squared Error (MSE) loss: θ1 = q p+q, θ2 = 1 q 2 p q.

Proof. The linear model for predicting Y is: ˆY = θ1x1 + θ2x2, with Mean Square Error (MSE) loss, for Y = 0 with pn samples, x1 = 1 and x2 = 0; with (1 p)n samples, x1 = 0, x2 = 1; For Y = 1 with qn samples, x1 = 1, x2 = 0, with (1 q)n samples, x1 = 0, x2 = 1. The MSE loss LMSE consists of 4 terms:

i=1 (θ1 0)2 +

i=1 (θ2 0)2 +

i=1 (θ1 1)2 +

i=1 (θ2 1)2

= pn θ2 1 + (1 p)n θ2 2 + qn (θ1 1)2 + (1 q)n (θ2 1)2

h pθ2 1 + (1 p)θ2 2 + q (θ1 1)2 + (1 q) (θ2 1)2i .

Taking partial derivative w.r.t. θ1:

Published as a conference paper at ICLR 2025

4 [2pθ1 + 2q (θ1 1)] = 0

pθ1 + qθ1 q = 0

θ1 = q p + q

Similarly for θ2,

4 [2(1 p)θ2 + 2(1 q) (θ2 1)] = 0

(1 p)θ2 + (1 q)θ2 (1 q) = 0 (2 p q)θ2 = 1 q

θ2 = 1 q 2 p q

As demonstrated in Proposition 2, the presence of spurious feature overlap across different classes hinders the linear classifier s ability to generate distinguishable (symmetric) logits. For instance, when p = q, both θ1 = 0.5 and θ2 = 0.5, resulting in a loss of distinguishability among different spurious patterns. To mitigate this issue, we propose class-conditioned (intra-class) cross entropy loss which reduces the interference across different classes. To show that class-conditioned (intraclass) cross entropy loss is able to maximize the conditional entropy H(s(i)|bh(i) G ), we propose the following proposition.

Proposition 7. Given the spurious features h(i) s , G(i) D and a suitable clustering number K that corresponds to the number of environmental groups in each class Y = y, the loss objective LInv (Eqn. 3) will maximize the conditional entropy H(s(i)|bh(i) G ), i D.

To prove that optimizing LInv maximizes the conditional entropy H(s(i)|bh(i) G ), we show that P(s(i)|bh(i) c ) Cat( 1

K ) with K clusters, hence maximizes H(s(i)|bh(i) G ).

Proof. First, we have:

max θ H(s(i)|bh(i) G ) = H(s(i)|ρ (bh(i) G )) = H(s(i)|bs(i)).

Here, ρ ( ) is the classification head for cluster labels s(i). For notation simplicity, we will omit superscript (i), and denote s := s(i) and bs := bs(i). Now using softmax loss, we have:

Published as a conference paper at ICLR 2025

j=1 sj log (σ(bs)j) , where σ(bs)j = ebsj PK k=1 ebsk (60)

bsi = σ(bs)j (δij σ(bs)i) (61)

j=1 sj log (σ(bs)j)

L bsi = (si σ(bs)i) . (63)

Taking expectation on both side of Eqn. 63, we get:

= E [ (si σ(bs)i)] (64)

0 = (E [si] E [σ(bs)i]) (65)

K E [σ(bs)i] (66)

E [σ(bs)i] = 1

Eqn. 66 is due to that E[si] = 1 K , as each cluster derived from the spurious embedding represents a environmental group, and with appropriate reweighting within each cluster, the samples will be weighted equally across clusters, i.e., samples from majority group and minority group will be weighted equally. Given that this expectation holds over all training samples, one optimal solution for Eqn. 67 is: σ(bs)i = 1

K , i [K] for K class labels. This aligns with the assumption that within each class, there exists a stable pattern Gc, and if the encoder is able to effectively capture invariant patterns Gc while discarding spurious correlations, σ(bs)i will be a fixed and constant value. With this constraint, the solution σ(bs(i)) = 1

K 1K implies that the model learns invariant features. Furthermore, this stable solution maximizes the conditional entropy H(s(i)|bh(i) G ), as σ(bs(i)) is independent of s(i).

Remark. While σ(bs(i)) = 1 K 1K is a sufficient condition to minimize Linv, there exists other solutions that exploit spurious features to achieve high training accuracy. To guide the model toward the stable solution that satisfies H(s(i)|bh(i) G ), an explicit regularization term, such as σ(bs(i)) 1

K 1K 2 2, can be added to the objective in Eqn. 4. However, our empirical results show that such explicit regularization does not significantly affect performance. This may be attributed to the limited number of class labels and node features in graph-level OOD datasets, which inherently enforce σ(bs(i)) to be similar across all training samples, facilitating convergence to the stable solution.

D.6 PROOF OF THEOREM 4.2

Theorem D.2. [Restatement of Theorem 4.2] There exists a suitable γ and clustering number K, such that minimizing the loss objective L = LGT + λLInv will lead to the optimal encoder h ( ) which elicits invariant features for any graph G, i.e., bh G = h (G) = hc.

Proof. We aim to show that minimizing the objective L encourages the encoder h ( ) to learn representations bh G that contain only invariant features hc. First, consider that under ERM, i.e., minimizing LGT alone, the encoder h( ) learns both invariant and spurious features, given the empirical and theoretical evidence in previous studies (Kirichenko et al., 2023; Chen et al., 2023b). Now, we introduce the invariant loss Linv, according to Proposition 7, will maximize the conditional entropy

Published as a conference paper at ICLR 2025

H(s(i)|bh(i) G ) for each sample G(i). This encourages the learned representation bh(i) G to be uninformative about the spurious features s(i). In the following proof, we drop superscript i for simplicity.

Next, our goal is to show that minimizing L leads to bh G = hc. We consider the following three cases:

Case 1: The encoder learns only spurious features (bh G = hs). In this case, the representation bh G is informative about s but not about hc. The conditional entropy H(s|bh G) is minimized but not maximized.

Case 2: The encoder learns both invariant and spurious features (bh G = κ(hc, hs)). Here, bh G is informative about both hc and hs. While this may minimize LGT in the training environments, bh G remains informative about hs, thus not maximizing H(s|bh G).

Case 3: The encoder learns only invariant features (bh G = hc). In this case, bh G is uninformative about hs, therefore, H(s|bh G) is greater than previous two cases, furthermore bh G also minimizes LGT given the invariant features hold the sufficient predictive power for the targets labels (Assumption 1).

Therefore, we conclude that the encoder h ( ) will only learn invariant features, as conditioning on spurious features hs will not maximize H(s|bh G). Additionally, it is important to note that there exists uninformative features which may also maximize H(s|bh G) without including hc. This phenomenon has been demonstrated in EQu AD (Yao et al., 2024), where model performance deteriorates when λ > 1. However, when λ < 1, LGT emphasizes useful features that correlate with the target labels, while Linv acts as a regularization term that encourages the model to remove spurious features and focus on learning invariant features.

E COMPLEXITY ANALYSIS

We provide time and space complexity analysis for LIRS, followed by empirical running time analysis on GOODMotif-base and OGBG-Molbbbp datasets.

Space complexity. The space complexity for LIRS is O(|B|Hm), where |B| denotes the batch size, H denotes the hidden feature dimension, and m denotes the average number of edges. Therefore the memory overhead is on par with ERM, and may outperform other graph data augmentation method such as DIR (Wu et al., 2022c) and GREA (Liu et al., 2022), as for each samples, they generate spurious subgraphs for each sample using |B| samples in the same minibatch, which leads to O (|B|2 + |B|)Hm .

Time complexity. The time complexity of LIRS is O(k Cm F), where k is the number of layers in GNN encoder, C > 1 is a constant as LIRS runs GNN encoder multiple times, F is the hidden dimension size. Notably, most other graph invariant learning algorithms also exhibit a complexity of O(Ckm F), as they require multiple GNN encoders for subgraph extraction and feature encoding respectively. Therefore, the time cost gap between LIRS and other methods is not significant. We provide a detailed running time analysis using the Motif-base and Ogbg-Molbbbp datasets, as shown in Table 5. On the Motif-base dataset, the biased infomax only requires 20 epochs training, making its time cost comparable to ERM. On the Ogbg-Molbbbp dataset, the time cost of LIRS exceeds that of other methods since the biased infomax requires training for 100 epochs, and GSAT must be run to annotate node labels to enable adaptive biased infomax. We provide a breakdown of the time cost for each stage in LIRS to provide a better understandings in Table 6.

The first three components account for the majority of the time cost in LIRS, however they only need to be run once. Retraining the GNN encoder is both fast and stable, with less variance as LInv is merely a cross-entropy loss. This presents a key advantage of LIRS in terms of hyperparameter selection. Specifically, most of OOD methods require hyperparameter search, and for most methods, the process must be restarted entirely for each run, incurring significant time costs. In contrast, for LIRS, only the final GNN retraining step needs to be run multiple times for hyperparameter

Published as a conference paper at ICLR 2025

Table 5: Running time analysis (in seconds) on various OOD methods

Method Motif-base OGBG-Molbbbp

ERM 494.34 117.86 92.42 0.42 IRM 968.94 164.09 151.84 7.53 Vrex 819.94 124.54 129.13 12.93

GSAT 1233 396.19 142.47 25.71 GREA 1612.43 177.36 262.47 45.71 CIGA 1729.14 355.62 352.14 93.32 AIA 1422.34 69.33 217.36 11.04 OOD-GCL 10813 28.12 8455.51 68.61 EQu AD 747.87 34.71 278.85 16.64

LIRS 504.87 24.04 421.32 19.86

Table 6: Running time analysis (in seconds) of LIRS

Method Motif-base OGBG-Molbbbp

Biased Infomax 302.12 214.24 Minibatch Kmeans+SVM 6.63 3.04 0.35 Mark Nodes - 142.47 25.71 GNN Retraining 196.12 24.04 61.57 6.86

search, leading to a significantly reduced time cost when conducting multiple runs compared to other methods.

F MORE DISCUSSION ON LIRS

Comparison with EQu AD (Yao et al., 2024). While LIRS and EQu AD share similarities in the learning paradigm, our study make several distinctions compared with Yao et al. (2024). First, while EQu AD primarily addresses the limitations of existing OOD methods that are sensitive to varying spurious correlation strengths, it does not fully explain why the proposed learning paradigm is effective when spurious correlation strength remains stable. In contrast, our work answer this important question and demonstrates that this learning paradigm enables learning a broader set of invariant features; Second, We curate the SPMotif-binary dataset based on the SPMotif datasets, which can serve as a benchmark for future studies to evaluate the effectiveness of methods in learning a broader set of invariant features; Third, While EQu AD uses a standard infomax objective to learn spurious features, we propose a new algorithm that addresses the limitations of the vanilla infomax approach. Specifically, we introduce the biased infomax to overcome size constraints and further incorporate additional procedure (i.e., a GNN explainer) to annotate critical nodes with adaptive thresholding to realize biased infomax in real-world datasets; Finally, We identify the limitations of the cross-entropy loss in disentangling spurious features and propose a novel loss objective that is more effective in learning invariant features. Additionally, our proposed loss does not compromise computational efficiency.

Comparison with Supervised Contrastive Learning (Khosla et al., 2020) and Deep Graph Infomax (Veliˇckovi c et al., 2018). The biased infomax principle also share similarity with Supervised Contrastive Learning (SCI) and Deep Graph Infomax (DGI). However, SCL and DGI primarily target in-distribution (ID) data and are designed to improve performance on predictive tasks. In contrast, our proposed biased infomax is specifically designed for OOD data, with the goal of learning environment-related features rather than features that directly benefit classification tasks. This makes our approach conceptually distinct from SCL and DGI; Second, In SCL the contrastive loss operates at the inter-sample level, where explicit labels are available for each image. However, in our graph-level OOD setting, biased infomax operates at the node level, where such labels are unavailable. Due to the absence of node labels, we approximate critical nodes using GNN explainer

Published as a conference paper at ICLR 2025

with adaptive thresholding. This additional procedure also distinct biased infomax from SCL and DGI.

Comparison with OOD-GCL (Li et al., 2024a). LIRS and OOD-GCL both utilize contrastive learning to graph invariant features, however, LIRS relies on labeled data in the subsequent stage to effectively learn invariant features, in contrast, OOD-GCL aims to learn invariant features without labeled data, followed by fine-tuning a linear classifier on downstream tasks. This distinction highlights different assumptions and goals in the design of the two methods. Due to the unavailability of labelled data, OOD-GCL underperforms LIRS across all the datasets. In terms of running time, OOD-GCL is also significantly slower than LIRS and all other methods, as it requires multiple rounds of clustering in every epoch and additional invariance regularization, which must be applied for each channel and cluster.

G ALGORITHMIC PSEUDOCODE

We provide pseudocode for the proposed LIRS framework, which consists of 3 stages for learning spurious features and invariant features respectively. The code will be made publicly available upon acceptance of our work.

H MORE DETAILS ABOUT EXPERIMENTS

H.1 DATASETS DETAILS

Table 7: Details about the datasets used in our experiments.

DATASETS Split # TRAINING # VALIDATION # TESTING # CLASSES METRICS

GOOD-HIV Scaffold 24682 4113 4108 2 ROC-AUC Size 26169 4112 3961 2 ROC-AUC

GOOD-Motif Base 18000 3000 3000 3 ACC Size 18000 3000 3000 3 ACC

OGBG-Molbbbp Scaffold 1631 204 204 2 ROC-AUC Size 1633 203 203 2 ROC-AUC

OGBG-Molbace Scaffold 1210 152 151 2 ROC-AUC Size 1211 151 151 2 ROC-AUC

SPMotif-binary Correlation 6000 2000 2000 2 ACC

SPMotif(#Gc = 3) Correlation 9000 3000 3000 3 ACC

In this subsection, we provide a detailed introduction to the datasets used in this work, the dataset statistics are illustrated in Table 7.

SPMotif dataset. Following Wu et al. (2022c), we generate a 3-class synthetic datasets based on BAMotif (Ying et al., 2019). In these datasets, each graph comprises a combination of invariant and spurious subgraphs, denoted by Gc and Gs. The spurious subgraphs include three structures (Tree, Ladder, and Wheel), while the invariant subgraphs consist of Cycle, House, and Crane. The task for a model is to determine which one of the three motifs (Cycle, House, and Crane) is present in a graph. A controllable distribution shift can be achieved via a pre-defined parameter b. This parameter manipulates the spurious correlation between the spurious subgraph Gs and the groundtruth label Y , which depends solely on the invariant subgraph Gc. Specifically, given the predefined bias b, the probability of a specific motif (e.g., House) and a specific base graph (Tree) will co-occur is b while for the others is (1 b)/2 (e.g., House-Ladder, House-Wheel). When b = 1

3, the invariant subgraph is equally correlated to the three spurious subgraphs in the dataset. In SPMotif datasets, S is directly influenced by C, and C is causally related with Y . For the variant of the SPMotif datasets used in Section 3, we attach 3 invariant subgraphs Gc to a base spurious subgraph, and in the test dataset, only 1 invariant subgraph is attached to Gs.

SPMotif-binary dataset. To evaluate the feature learning quality of various OOD methods, We curate a binary classification dataset based on the SPMotif dataset, which is utilized in Section 5.3.

Published as a conference paper at ICLR 2025

Algorithm 1 The LIRS framework

Require: Graph dataset D with labels Y; training epochs E, E ; threshold τ; hyperparameter γ; regularization weight λ Ensure: Trained GNN model f = ρ h

1: Initialize GNN encoder h( ) 2: for epoch t = 1 to E do Step 1: Learning spurious features via biased infomax 3: for each graph G(i) in D do

4: Obtain graph representation h(i) G h(G(i))

5: Use GNN explainer e( ) to identify important nodes b G(i) c 6: Remove b G(i) c from G(i) to get G(i)

7: Compute predicted probabilities ˆy(i) ρ(h(i) G ) and y(i) ρ(h( G(i))) 8: if |ˆy(i) y(i)| > τ then

9: Obtain h(i) G using Eqn. (2) (biased infomax) 10: else 11: Obtain h(i) G using Eqn. (1) (standard infomax) 12: end if 13: end for 14: end for 15: Obtain spurious embeddings {h(i) s } from the trained encoder h( ) 16: for each class y in Y do Step 2: Clustering and Re-fitting Classifier

17: Collect spurious embeddings {h(i) s | y(i) = y}

18: Perform clustering using Minibatch KMeans on {h(i) s } to get cluster labels {c(i)}

19: Train a linear classifier (e.g., SVM) on {h(i) s , c(i)} to obtain spurious logits s(i)

20: end for 21: Re-initialize GNN encoder h( ), and classification head ρ( ), ρ ( ) Step 3: Learning invariant features 22: for epoch t = 1 to E do 23: for each graph G(i) in G do

24: Obtain graph representation bh(i) G h(G(i))

25: Compute estimated spurious logits bs(i) ρ (bh(i) G )

26: Compute reweighting coefficient w(i) = 1 (s(i) j )γ

γ , where j is the cluster label

27: Compute invariant loss using Eqn. 3 28: Update model parameters by minimizing: 29: L = LERM + λLinv 30: end for 31: end for 32: return Trained model f = ρ h

Specifically, the motifs House and Crane are assigned label 0, and Diamond and Cycle are assigned label 1. During the construction of each class s samples, we attached both invariant subgraphs to the base subgraph with 50% chance, while in the remaining 50%, we randomly attached one invariant subgraph to the base spurious subgraph. For the test set, we randomly attached a single invariant subgraph to the base subgraph. Similar to the SPMotif dataset, the base spurious subgraph was correlated with the target labels where b controls the correlation strengths, and in the test set an equal correlation strength is assigned for the samples in the same class.

GOOD-HIV is a molecular dataset derived from the Molecule Net Wu et al. (2018) benchmark, where the primary task is to predict the ability of molecules to inhibit HIV replication. The molecular structures are represented as graphs, with nodes as atoms and edges as chemical bonds. Following Gui et al. (2022), We adopt the covariate shift split, which refers to changes in the input distribution between training and testing datasets while maintaining the same conditional distribution of labels given inputs. This setup ensures that the model must generalize to unseen molecular structures that differ in these domain features from those seen during training. We focus on the Bemis-Murcko

Published as a conference paper at ICLR 2025

scaffold Bemis & Murcko (1996) and the number of nodes in the molecular graph as two domain features to evaluate our method.

GOOD-Motif is a synthetic dataset designed to test structure shifts. Each graph in this dataset is created by combining a base graph and a motif, with the motif solely determining the label. The base graph type and the size are selected as domain features to introduce covariate shifts. By generating different base graphs such as wheels, trees, or ladders, the dataset challenges the model s ability to generalize to new graph structures not seen during training. We employ the covariate shift split, where these domain features vary between training and testing datasets, reflecting real-world scenarios where underlying graph structures may change.

Open Graph Benchmark. We also use 2 molecule datasets from the graph property prediction task on Open Graph Benchmark Hu et al. (2020) or known as OGBG. They were originally collected by Molecule Net (Wu et al., 2018) and used to predict the properties of molecules, including (1) blood brain barrier permeability in Mol BBBP; (2) inhibition to human β-secretase 1 in Mol BACE. For all molecule datasets, we use the scaffold splitting procedure as OGBG adopted (Hu et al., 2020). It attempts to separate structurally different molecules into different subsets, which provides a more realistic estimate of model performance in experiments (Wu et al., 2018). In addition, we also adopt size split to evaluate the OOD generalization ability for various OOD methods following Sui et al. (2023); Gui et al. (2022).

H.2 EXPERIMENTAL SETUP

Encoding spurious features with biased infomax. We adopt MVGRL (Hassani & Khasahmadi, 2020) as the contrastive learning method for learning spurious features. To realize instance-level adaptive biased infomax, we first utilize GSAT (Miao et al., 2022) as the GNN explainability framework to identify important nodes in a graph. The biased infomax principle is realized using contrastive learning, with Info NCE (Oord et al., 2018) loss as the neural mutual information estimator. The estimated nodes from GSAT in a graph G is treated as negative samples rather than positive ones. Consequently, for a graph G, the graph representation is optimized to align closely with nodes from Gs while diverging from the representation of nodes in Gc. The GNN backbone for GSAT is 5-layer GIN (Xu et al., 2018), the hidden dimension is set to 64 for all the datasets except Mol HIV, where the hidden dimension is 128. During inference stage, we obtain top-K edges with highest probability in each graph instance and perform counterfactual inference, i.e., after the removal of the subgraph induced by the top-K edges, we record the change in the prediction score prob, and compare with a pre-defined threshold τ to decide whether it should be biased or not. The nodes in a graph will be treated as negative examples during training MVGRL if prob > τ. In all the experiments, K is searched over: {8, 12}, τ is searched over: {0.2, 0.3}. For SPMotif datasets, we directly use the ground-truth nodes from Gc, and don t employ GSAT for approximation.

Generating logits from spurious features and target labels. We use Mini Batch KMeans algorithm to obtain the clustering label. To generate (soft) logits as targets from spurious features learned via biased infomax in each cluster, we use linear svm followed by a probability calibrator.

GNN encoder. For SPMotif dataset and SPMotif-binary dataset, we adopt 5-layer GIN (Xu et al., 2018) as backbone GNN encoder with mean pooling; For GOOD-Motif datasets, we utilize a 4layer GIN with sum pooling, and a hidden dimension of 300; For GOOD-HIV datasets, we employ a 4-layer GIN with sum pooling, and a hidden dimension of 128; For the OGBG-Molbbbp and OGBG-Molbace dataset, we adopt a 4-layer GIN with sum pooling, and the dimensions of hidden layers is 64.

Optimization and evaluation. By default, we use Adam optimizer (Kingma & Ba, 2014) with a learning rate of 1e 3 and a batch size of 64 for all experiments. we also employ an early stopping of 10 epochs according to the validation performance for all datasets. Test accuracy or ROC-AUC is obtained according to the best validation performance for all experiments. All experiments are run with 4 different random seeds, the mean and standard deviation are reported using the 4 runs of experiments.

Baseline setup and hyperparameters. In our experiments, for the GOOD and OGBG-Molbbbp datasets, the results of ERM, IRM, Group DRO, and VREx are reported from Gui et al. (2022), while the results for Drop Edge, DIR, GSAT, CIGA, GIL, GREA, FLAG, G-Mixup and AIA on

Published as a conference paper at ICLR 2025

GOOD and OGBG datasets are reported from Sui et al. (2023). To ensure fairness, we adopt the same GIN backbone architecture as reported in Sui et al. (2023). For the remaining datasets and methods, we conduct experiments using the provided source codes from the baseline methods. The hyperparameter search is detailed as follows.

For IRM and VREx, the weight of the penalty loss is searched over {1e 1, 1, 1e1, 1e2}. The causal subgraph ratio for DIR is searched across {1e 2, 1e 1, 0.2, 0.4, 0.6}. For RSC, the masking ratio is searched over {0.2, 0.3, 0.4}. For Diverse Model, the number of classifciation headers is searched over {5, 10, 20}, and the penalty weight of the diversification loss is searched over {1e 1, 1e 2, 1e 3}. For Drop Edge, the edge masking ratio is seached over: {0.1, 0.2, 0.3}. For GREA, the weight of the penalty loss is tuned over {1e 2, 1e 1, 1.0}, and the causal subgraph size ratio is tuned over {0.05, 0.1, 0.2, 0.3, 0.5}. For GIL, the penalty weight is searched over {1e 5, 1e 3, 1e 1, 1.0}, and the number of environments is searched over {3, 5, 10}. For GSAT, the causal graph size ratio is searched over {0.3, 0.5, 0.7}. For CIGA, the contrastive loss hinge loss weights are searched over {0.5, 1.0, 2.0, 4.0, 8.0}. For Dis C, we search over q in the GCE loss: {0.5, 0.7, 0.9}. For Li SA, the loss penalty weights are searched over:{1, 1e 1, 1e 2, 1e 3}. For FLAG, the ascending steps are set to 3 as recommended in the paper, and the step size is searched over {1e 3, 1e 2, 1e 1}. For AIA, the stable feature ratio is searched over {0.1, 0.3, 0.5, 0.7, 0.9}, and the adversarial penalty weight is searched over {0.01, 0.1, 0.2, 0.5, 1.0, 3.0, 5.0}. For EQu AD, the penalty weight is searched over {1e 1, 1e 2, 1e 3}, and the reweighting coefficient is searched over {0.1, 0.3, 0.5, 0.7, 0.9}.

Hyperparameter search for LIRS. The penalty weight for LInv in LIRS is searched over {1e 1, 1e 2, 1e 3}. The reweighting coefficient γ is searched over {0.1, 0.3, 0.5, 0.7, 0.9}. The cluster number C is searched over {3, 5, 10}. The training epoch E at which the spurious embedding is derived from the biased infomax is searched over {50, 60, 70, 80, 90} for real-world datasets, and for the synthetic datasets, The training epoch E is searched over {5, 6, 7, 8, 9}.

Implementation of LIRS. We extend the code from GSAT (Miao et al., 2022) to annotate the nodes in each graph to be biased in real-world datasets. We adopt Py GCL (Zhu et al., 2021) package and modify the source code in Dual Branch Contrast to implement the biased infomax to generate spurious embeddings. To generate logits from the spurious embeddings, we use Mini Batch KMeans, linear SVC, and Calibrated Classifier CV in Scikit-Learn package (Pedregosa et al., 2011). We use Py Torch Paszke et al. (2019) and Pytorch-Geometric Fey & Lenssen (2019) for the remaining implementation.

H.3 ADDITIONAL EXPERIMENTAL RESULTS

Visualization of latent features derived from the biased infomax. We provide visualization of the 2d latent embedding derived form the biased infomax. Specifically, we utilize the SPMotif datasets and GOODMotif datasets to assess the correlation between the learned embeddings after dimensionality reduction and environment labels within each class. We annotate each data sample with the environment label corresponding to three spurious patterns (Tree, Ladder, and Wheel). We then use the spurious features obtained from the biased infomax objective, apply t-SNE Van der Maaten & Hinton (2008) for dimensionality reduction, and use KMeans for clustering and visualization. The epoch at which the latent embedding is obtained is selected according to the best hyperparameter. In Figure 6, points of different colors correspond to different environment labels. It can be observed that within each class, the clusters highly correlate with environments, indicating that the biased infomax effectively captures different spurious patterns.

H.4 SOFTWARE AND HARDWARE

All the experiments are ran with Py Torch (Paszke et al., 2019) and Py Torch Geometric (Fey & Lenssen, 2019). We run all the experiments on Linux servers with RTX 4090 and CUDA 12.2.

Published as a conference paper at ICLR 2025

Y=0 Y=1 Y=2

(a) GOODMotif-base

Y=0 Y=1 Y=2

(b) SPMotif-0.60 (#Gc = 3)

Y = 0 Y = 1 Y = 2

(c) SPMotif-0.90 (#Gc = 3)

(d) SPMotif-binary-0.40

(e) SPMotif-binary-0.60

Figure 6: Clustering visualizations using latent embeddings derived from the biased infomax. The clusters resulted from latent embeddings obtained from the biased infomax are highly correlated with the environment labels.