# graph_structure_extrapolation_for_outofdistribution_generalization__13c8a2f8.pdf

Graph Structure Extrapolation for Out-of-Distribution Generalization

Xiner Li 1 Shurui Gui 1 Youzhi Luo 1 Shuiwang Ji 1

Out-of-distribution (OOD) generalization deals with the prevalent learning scenario where test distribution shifts from training distribution. With rising application demands and inherent complexity, graph OOD problems call for specialized solutions. While data-centric methods exhibit performance enhancements on many generic machine learning tasks, there is a notable absence of data augmentation methods tailored for graph OOD generalization. In this work, we propose to achieve graph OOD generalization with the novel design of non-Euclidean-space linear extrapolation. The proposed augmentation strategy extrapolates structure spaces to generate OOD graph data. Our design tailors OOD samples for specific shifts without corrupting underlying causal mechanisms. Theoretical analysis and empirical results evidence the effectiveness of our method in solving target shifts, showing substantial and constant improvements across various graph OOD tasks.

1. Introduction

Machine learning algorithms typically assume training and test data are independently and identically distributed (i.i.d.). However, distribution shift is a common problem in realworld applications, which substantially degrades model performance. The out-of-distribution (OOD) generalization problem deals with learning scenarios where test distributions shift from training distributions and remain unknown during the training phase. The area of OOD generalization has gained increasing interest over the years, and multiple OOD methods have been proposed (Ganin et al., 2016; Arjovsky et al., 2019; Krueger et al., 2021; Gui et al., 2024). Although both general OOD problems and graph analysis (Veliˇckovi c et al., 2018; Liu et al., 2021b; Gui et al., 2022b; Sun et al., 2024) have been intensively studied,

1Department of Computer Science and Engineering, Texas A&M University, Texas, USA. Correspondence to: Shuiwang Ji <sji@tamu.edu>.

Proceedings of the 41 st International Conference on Machine Learning, Vienna, Austria. PMLR 235, 2024. Copyright 2024 by the author(s).

graph OOD research is only in its early stage (Wu et al., 2022b;a; Bevilacqua et al., 2021; Gui et al., 2023). With various applications and unique complexity, graph OOD problems call for specific solutions. Data augmentation (DA) methods have shown a significant boost in generalization capability and performance improvement across multiple fields (Shorten & Khoshgoftaar, 2019; Yao et al., 2022; Park et al., 2022; Han et al., 2022), creating a promising possibility for graph OOD studies. Currently, environment-aware DA methods for graph OOD are under-explored.

Conventional data augmentations increase the amount of data and act as regularizers to reduce over-fitting, which empirically enhances model performance in previous studies (Rong et al., 2019; You et al., 2020). Many DA techniques (Zhang et al., 2017; Yao et al., 2022; Wang et al., 2021), including graph data augmentations (GDA), exclusively interpolate data samples to generate new ones. Interpolation-based DA boosts model performances by making overall progress in learning (Xu et al., 2020), Mixup (Zhang et al., 2017) being a typical example. However, practical tasks are often out-of-distribution instead of in-distribution (ID). Thus, models are expected to extrapolate instead of interpolate. Currently, few augmentation studies focus on extrapolation, especially for graphs. The distribution area where models cannot generalize to is also hardly reachable when generating augmentation samples using traditional techniques, which is a substantial obstacle to OOD generalization. Although graphon calculation (Han et al., 2022) is a potential avenue to extrapolate, the lack of consideration in environment information and causal design renders its causal correlations easily breached. Thus, the performance gain from DA appears limited in OOD tasks.

In this work, we propose to solve OOD generalization in graph classification tasks from a data-centric perspective. To stimulate the potential improvement of DA in OOD tasks, we study graph-space extrapolation, essentially, generating OOD data samples. Graph data has the complex nature of topological irregularity and connectivity, with unique types of distribution shifts in structure. Theoretically, it is impossible to solve unknown shifts without auxiliary information (Lin et al., 2022). Thus, injecting environment information (Gulrajani & Lopez-Paz, 2020) in training to convey the types of shifts is a necessary and promising solution for OOD. We innovatively propose an environment-aware

Graph Structure Extrapolation for Out-of-Distribution Generalization

framework with non-Euclidean-space linear extrapolation designed in both graph structural and feature space. Our contributions are three-fold. Firstly, we establish non-Euclidean space linear extrapolation with definitions, analyses, and guarantees. We theoretically justify that samples generated from linear extrapolation follow common causal assumptions and are tailored for specific OOD shifts. Secondly, we instantiate graph extrapolation as an efficacious generalization method. Structural linear extrapolation is enabled with novel graph splicing and label-environment-aware pair learning techniques. The proposed techniques cover complete structural global/local extrapolation in both distribution directions. Thirdly, our extensive experiments validate the superiority of our method over both graph invariant learning and data augmentation baselines for complex shifts.

Comparison with prior methods. While previous graph OOD methods (Wu et al., 2022b; Miao et al., 2022; Li et al., 2022; Chen et al., 2022b) focus on invariant learning regularization, we target OOD generalization from an explicit data-centric perspective. The major distinction between our method and the augmented-based learning strategy DIR (Wu et al., 2022b) is the ability to generate new graphs. Specifically, DIR mixes the embedding of subgraphs instead of generating new data in input spaces, limiting its application scope of augmentation to single trainings. In contrast, our method aims to produce explicit OOD graphs, which can be easily transferred and jointly used with other graph datasets. In the context of DA, current graph OOD augmentation methods either augment in embedding level (Kwon et al., 2022) or limit the augmented data to subgraphs of the original data (Yu et al., 2022; Sui et al., 2022). On the contrary, we define innovative extrapolation operators in the input level rigorously and generate diverse unseen graphs of functional sizes, backbones, and features. A more detailed discussion of related works is provided in Appendix A.

2. Problem Setting

Graph notations. We denote a graph as G = (A, X, E), where A Rn n, X Rp n, and E Rq m are the adjacency, node feature, and edge feature matrices, respectively. We assume n, m, p, and q are the numbers of nodes, edges, node features, and edge features respectively. Additionally, we assume a set of latent variables {zi Rf}n i=1 form a matrix Z = [z1, z2, , zn] Rf n. For graph-level tasks, each graph has a label Y Y, and for node-level tasks there is a label per node.

OOD settings. The environment formalism following invariant risk minimization (IRM) is a common setting for OOD studies (Peters et al., 2016; Arjovsky et al., 2019; Krueger et al., 2021). This framework assumes that training data form groups, known as environments. Data are similar within the same environment but dissimilar across differ-

ent environments. Since multiple shifts can exist between training and test data, models are usually not expected to solve all shifts. Instead, the target shift type is conveyed using environments. Specifically, the target shift between training and test data, though more significant, should be similarly reflected among different training environments. In this case, OOD methods can potentially grasp the shift by learning among different environments. In this work, we follow the formalism and use environment information in augmentation strategies to benefit OOD generalization. Environments are given as environment labels εi E for each data sample.

Graph structure and feature distribution shifts. Graph data contains complex topologies. Graph distribution shifts can happen on both features and structures, which possesses different properties and should be handled separately. Normal feature distribution shifts happen on node or edge features, and we consider node features in this work. In this case, shifts are solely on the node feature distribution as P tr(X) = P te(X), while P tr(A) = P te(A), where P tr( ) and P te( ) denote training and test distributions, respectively. In contrast, structure distribution shift is the more distinctive and complex case in graph OOD. Structural shifts can happen in the distribution of A or the conditional distribution between X and A, resulting in P train(X, A) = P test(X, A). In graph-level tasks, structural shifts exist both globally and locally. Common global/local domain examples are graph size and graph base (Hu et al., 2020; Gui et al., 2022a), the latter also known as scaffold (Bemis & Murcko, 1996) in molecule data. Specifically, graph size refers to its number of nodes, and graph base refers to the non-functional backbone substructure irrelevant with targets. In this work, we primarily focus on structure distribution shifts.

3. Linear Extrapolation in Graph Space

We propose the GDA strategy of input-space linear extrapolation, inspired by the philosophy of linear interpolation in Mixup (Wang et al., 2021). Linear extrapolation of data distributions essentially guides the model to behave outside the original range by introducing reachable OOD samples. In this section, we define linear extrapolation to extend beyond training distributions in both structure and feature spaces for graph data, effectively teaching the model to anticipate and handle OOD scenarios.

3.1. Causal Analysis

We first establish causal analyses following prior invariant learning works (Ahuja et al., 2021; Rosenfeld et al., 2020; Lu et al., 2021). As shown in Figure 1, C, S1, S2 Z are the latent variables in high-dimensional space that are causally associated with the target Y , non-causally associated with Y , and independent of Y , respectively. The

Graph Structure Extrapolation for Out-of-Distribution Generalization

D8w Pn8A3g6WDg=</latexit>Gs1

n8A09GNe A=</latexit>S2

Injective mapping

Causal mapping

: Unobservable : Partially observable : Observable

Figure 1. Illustration of the causal graph. Unobservable variables lie in the latent space; partially observable variables can be learned and selected.

environment E is target-irrelevant and observable. In the non-Euclidean space, we posit that information from the latent space is wholly reflected in the graph structure, so that C and E determine respective subgraphs of a graph (Chen et al., 2022b;c). Formally, we define subgraphs caused by C as causal subgraphs Ginv, and subgraphs caused by E as environmental subgraphs Genv. Since graphs with the same label should contain invariant causal subgraphs, causal subgraphs are potentially extractable from label-invariant graphs, and environmental subgraphs from environmentinvariant graphs. Let s consider simple distribution shifts in the feature space, where we assume C and E determine respective elements of feature vectors. For single-node feature x Rp, where x = X for node-level tasks and x X for graph-level tasks, let p = i + v. We define node features determined by C as invariant node features xinv Ri, while other node features are variant features xvar* Rv. In practice, with environment information, it is realistic to assume we can learn to select a subset of the variant features xvar Rj substantially determined by E, where j v.

3.2. Linear Extrapolation Formulation

Linear extrapolation, which constructs samples beyond the known range while maintaining the same direction and magnitude of known sample differences, is a central concept in our approach.

Definition 3.1. [Feature Linear Extrapolation (FLE)] Given two feature vector points (xi, yi) and (xj, yj), feature linear extrapolation is defined as

xfle = xi + a(xj xi), yfle = yi + a(yj yi),

s.t. a R, a > 1 a < 0. (1)

The extension of linear extrapolation to graph structure re-

quires the definition of structural linear calculations. We define graph addition, G1 + G2, as the splicing of two graphs, resulting in unions of their vertex and edge sets. Graph subtraction, G2 G1, is defined as subtracting the largest isomorphic subgraph of G1 and G2 from G2. Let

D{Gtr} = {(G1, y1), (G2, y2), . . . , (GN, y N)}

be the N-sample graph training set. Given the discrete nature of graph operations, we formulate the linear extrapolation of graph structures below. Definition 3.2. [Structural Linear Extrapolation (SLE)] Given graphs Gi, Gj D{Gtr}, we define 1-dimension structural linear extrapolation on D{Gtr} as

G1 sle = ai Gi + bij (Gj Gi),

where ai, bij {0, 1}. We extend to define N-dimension structural linear extrapolation:

i=1 ai Gi +

j=1 bij (Gj Gi)

= a G + B, 1G G1 F ,

where a = [a1, . . . , a N] , B = {bij}N N, G = [G1, . . . , GN] , 1 is a N-element vector of ones, and , F is the Frobenius inner product. Let cij {0, 1} indicate the existence of causal subgraphs in (Gj Gi). Then the label for GN sle is defined as

y N sle = (

i=1 ai yi +

j=1 cijbij yj)

j=1 cijbij)

= (a y + C B, 1y F )/(a 1 + C, B F ),

where denotes Hadamard product, y = [y1, . . . , y N] , and C = {cij}N N.

Note that we do not need to avoid multiple graphs to ensure linearity in Eq. 2, due to the high dimensionality of graph structure. In this context, a G denotes splicing multiple graphs together; while B, 1G G1 F denotes splicing together multiple subtracted subgraphs. These definitions enable structural linear extrapolation in the non-Euclidean graph space.

3.3. Linear Extrapolation for OOD Generalization

In this subsection, we justify that linear extrapolation can generate OOD samples respecting specific shifts while maintaining causal validity, i.e., preserving underlying causal mechanisms. First, we establish an assumption for structural linear extrapolation that combining causal structures causes a sample with mixed label.

Graph Structure Extrapolation for Out-of-Distribution Generalization

Assumption 3.3 (Causal Additivity). Let (G1, y1), (G2, y2) and (G3, y3) be graph-label pairs. If G3 = G1inv + G2inv + Genv, then y3 = ay1 + (1 a)y2, where Genv is any combination of environmental subgraphs, and a (0, 1).

This causal assumption holds valid in a wide range of graph classification tasks, and we discuss its scope of application in Appendix D. Next, we provide two definitions to establish the conditions under which structural linear extrapolation can cover certain global and local environment values. Definition 3.4. [Global SLE Achievability] Given a Nsample set of graphs D{G}, we say that a graph global property gp(A, X, E) is achievable by global SLE if there exists an N-dimension structural linear extrapolation GN sle on D{G}, s.t. GN sle = (A, X, E). Definition 3.5. [Local SLE Achievability] Given a Nsample set of graphs D{G}, we say that a substructure B is achievable by local SLE if there exists an N-dimension structural linear extrapolation GN sle on D{G}, s.t. (GN sle)env = B.

The following theorems assert that structural linear extrapolation can create OOD samples covering at least two environments in opposite directions of the distribution, respecting global/local shifts in size and base each, and ensure the causal validity of these samples. Theorem 3.6. Given an N-sample training dataset D{Gtr}, its N-dimension structural linear extrapolation can generate sets D{G1} and D{G2} such that

Gtr, G1, G2, (G1)env < (Gtr)env < (G2)env,

where < denotes smaller in size and lower base complexity for size and base extrapolation. Theorem 3.7. Given an N-sample training dataset D{Gtr} and its true labeling function for the target classification task f(G), if D{GN sle} is a graph set sampled from Ndimension structural linear extrapolation of D{Gtr} and Assumption 3.3 holds, then

(GN sle, y) D{GN sle}, y = f(GN sle).

Proofs and further analysis are in Appendix G. These theorems show that structural linear extrapolation has the capability to generate OOD samples that are both plausible and diverse. The justification of feature linear extrapolation is relatively straightforward and provided in Section 4.4. Thus far, we have provided theoretical bases for the applicability of linear extrapolation on graph OOD tasks.

4. G-Splice for Structural Linear Extrapolation

In this section, we specify structural linear extrapolation as a feasible augmentation method with detailed implementations, termed G-Splice. Using environment information,

the method extrapolates global structural features while preserving structural information that causes the label. The approach is underpinned by theoretical analysis for structural linear extrapolation, providing causally-valid OOD samples that are tailored for specific shifts. The overall model constructs diverse augmentation samples, as shown in Figure 2. In the following subsections, we describe the technical modules of splicing, component graph selection, and post-sampling procedures separately.

4.1. Graph Splice

The action of splicing a group of component graphs is essentially a conditional edge generation task, which we refer to as bridge generation. We generate bridges of predicted number along with corresponding edge attributes between given component graphs to join multiple components into a single graph. In this work, we use conditional variational autoencoders (c VAE) (Kingma & Welling, 2013; Kipf & Welling, 2016), though other generative models may also be used. We adopt c VAE as the major bridge generator for its adequate capability and high efficiency, as compared with diffusion models (Ho et al., 2020) in Appendix J. The bridge generator takes as input a group of component graphs, denoted as

G1, , Gf = (X1, A1, E1), , (Xf, Af, Ef).

The c VAE encoder produces a latent variable distribution. Specifically, we construct the inference model as

qϕ(Z|G1, , Gf) =

vi Gj i=1 qϕ(zi|Xj, Aj, Ej)

i=1 N(zi|µi, diag(σ2 i )),

where vi Gj denotes that the i-th node vi belongs to component graph Gj, µi and σi are the generated mean and standard deviation vectors of the i-th latent distribution, and n is the total number of nodes in all component graphs. The encoder qϕ is parameterized by three-layer graph isomorphism networks (GIN) (Xu et al., 2019a). The generative model produces the probability distribution for bridges Ab

and corresponding attributes Eb:

pθ(Ab, Eb|Z) =

j=1 pθ(Ab ij, eb ij|zi, zj),

pθ(Ab ij, eb ij|zi, zj)

MLPθ(zi, zj), if vi Gs, vj Gt, s.t. s = t (0, None), otherwise

where Ab ij is the ij-th element of Ab and eb ij Eb is the corresponding edge attribute vector. By sampling B times

Graph Structure Extrapolation for Out-of-Distribution Generalization

Cosine similarity matching Structure shift

Non-causal feature selection

0 1 1 0 1 0 1 0 1 Invariant mask

Variance score

Feature perturbation

0 1 1 0 1 0 1 0 1

H</latexit>y = 1

9Ji Wmg=</latexit>"2

9hy Wmw=</latexit>"3

X: Node feature

": Environment label

(A, E): Structure excluding X

M: Invariant mask

XA: Augmented node feature

Feature shift

Whole graphs

Environment subgraph extractor

Cosine similarity matching

Causal subgraph extractor

Whole graphs spliced

Single causal subgraph

Causal & environmental

subgraphs spliced

y Xgx3o2Px Wj JKHa O0B8Ynz/q KJRm</latexit>Genv

latexit>Ginv

Figure 2. Architecture overview. Causal features are selected and then preserved along with graph topological structure, while extrapolation on non-causal features spans their feature space to solve feature shifts. For structure shifts, diverse component graphs are extracted with pair-learning approaches, selected and spliced with extrapolation strategies to achieve causally-valid structural OOD samples.

from pθ(Ab, Eb|Z), we sample B pairs of bridge-attribute vectors to complete bridge generation. To train the bridge generator, we optimize the variational lower bound L w.r.t. the variational parameters by

Lθ,ϕ = Eqϕ log pθ(Ab|Z)

+ αEqϕ log pθ(Eb|Z) βKL[qϕ||p(Z)], (5)

where KL[q( )||p( )] is the Kullback-Leibler divergence between q( ) and p( ). We take the Gaussian prior

i p(zi) = Y

i N(zi|0, I).

α and β are hyperparameters regularizing bridge attribute and KL divergence respectively. Note that we do not include new nodes as part of the bridge, since we aim at preserving the local structures of the component graphs and extrapolating certain global features. More manually add-on graph structures provide no extrapolation significance, while their interpolation influence are not proven beneficial, which is reflected in Appendix J.

Bridge number prediction. To predict the number of prospective bridges between a set of component graphs, a pre-trained GNN parameterized by η produces probabilities for the bridge number B,

pη(B) = GNNη(X1, A1, E1, , Xf, Af, Ef).

When generating bridges, we first sample the number B with the predicted probabilities from the categorical distribution.

4.2. Component Graph Selection

Whole graphs. Corresponding to a G in Eq. 2, we use whole graphs from the training data as a category of component graphs, which possesses computational simplicity and enables extrapolation.

Causal subgraphs and environmental subgraphs. The part of B, 1G G1 F in Eq. 2 requires the operation of subtracting the largest isomorphic subgraph, which is practically unfeasible. In this case, using target and environment label information, we approximate (Gj Gi) for particular graph pairs. Specifically, (Gj Gi) Gjinv for Gj, Gi with different labels but the same environment, and (Gj Gi) Gjenv for Gj, Gi with the same label but different environments. Therefore, we can use extracted causal/environmental subgraphs as (Gj Gi). We perform pair-wise similarity matching on label-invariant graphs to extract causal subgraphs, and on environmentinvariant graphs to extract environmental subgraphs, following Sec. 3.1. Since environment for test data remains unknown and these subgraphs are inaccessible during test, using these subgraphs in data augmentation for training can be an optimum strategy. Let G and G be two graphs with the same label y1 but different environments ε1 and ε2. We identify Ginv as the subgraph shared by both graphs that exhibits the highest similarity. The theorem below establishes that this identification approach unveils the causal subgraph Ginv by definition under certain causal assumptions.

Graph Structure Extrapolation for Out-of-Distribution Generalization

Theorem 4.1. Given the causal graph (Figure 1) and assuming a bijective causal mapping between C and Y , for G and G , let Gs represent potential subgraphs. If I( ) measures cosine similarity in the graph embedding space and fz : G Rf is a feature mapping that reversely infers hidden features, then:

(1) It follows that the similarity of the invariant subgraphs of G and G , I(fz(Ginv), fz(G inv)), reaches its maximal 1;

(2) Given a subgraph set

Gs = {Gs| G ss.t.I(fz(Gs), fz(G s)) = 1} ,

the invariant subgraph Ginv can be obtained as Ginv = argmax Gs Gs |Gs|.

Discussions and proofs are provided in Appendix G. Formally, the algorithmic procedure to identify Ginv has optimization objective

Ginv = argmax Gs Gs |Gs|,

where the set of subgraphs

Gs = Gs| argmax Gs I(f(Gs), f(G s)) .

Note that considering the injected noises of Ginv in real world, though generally much smaller than S1 and S2, we adopt a relaxed version of Gs. We use label-invariant and environment-variant graph pairs to pre-train a causal subgraph extraction network, which is optimized by the sampled causal subgraphs predicting the label Y solely. Therefore, the outer training objective of fz is

f z = argmin fz ℓ(y; fc(fz(Ginv)),

where fc is the classifier. More algorithmic details are in Appendix C. Similarly, environment-invariant and labelvariant graph pairs are used to pre-train the environmental subgraph extraction network using the environment label.

4.3. Post-sampling Procedure

To enable structural linear extrapolation, we need to make selections of component graphs and sample bridges accordingly, as well as assign the labels and environments. Since linear extrapolation has infinite choices of a and b, to enable training, we simplify the augmentation into three options: 1.single causal subgraphs Ginv, 2.causal and environmental subgraphs spliced Ginv + f Genv, 3.whole graphs spliced f G. With the pre-trained component graph extractors and bridge generator, we apply these strategies to augment graph data in OOD classification tasks. The actual number of component graphs f is set as a hyperparameter, tuned and

determined through OOD validation during training. The augmentation selections are also tuned as a hyperparameter, with at least one option applied. We label the generated graphs following the definition of structural linear extrapolation. Considering extrapolation in environments, for size OOD tasks, we create up to three new environments with the three options according to the size distribution of the augmented graphs. For base/scaffold OOD tasks, we create up to two new environments, one for Ginv, the other for Ginv + f Genv and f G, since the former option construct graphs without base/scaffolds, and the latter options graphs with multiple base/scaffolds combined. All original graphs and augmentation graphs then form the augmented training distribution with add-on environments. To make adequate use of the augmented environment information for OOD generalization, we optionally apply an invariant regularization (Krueger et al., 2021) during learning, reaching our final objective,

ψ := argmin ψ E(G,y) ε {E EA}Pε[ℓ(y; fψ(G))]

+γVarε {E EA}[E(G,y) Pεℓ(y; fψ(G))], (6)

where EA are the augmented environments, fψ is the prediction network for OOD tasks, ℓ( ) calculates cross-entropy loss, and Var[ ] calculates variance.

4.4. Feat X for Feature Linear Extrapolation

As an additional distribution shift consideration, we implement feature linear extrapolation as Feat X, a simple labelinvariant data augmentation strategy designed to further improve OOD generalization for graph feature shifts, which is applicable to both graph-level and node-level tasks, as shown in Figure 2. Environment information is used to selectively perturb non-causal features while causal features determining the label are preserved. During the augmentation process, with knowledge of the domain range of the node features, we span the feature space with extrapolation for non-causal features. In contrast to Mixup which exclusively interpolates, with knowledge of non-causal features and their domain, Feat X covers extrapolation as well as interpolation to advance in OOD generalization. We introduce technical details in the following subsections and end with theoretical support for the effectiveness of feature linear extrapolation.

Non-Causal Feature Selection. Unlike graph structure, features reside in a relatively low-dimensional space. Direct extrapolation of node features that have causal relationships with the target may not yield beneficial outcomes (Appendix J). Therefore, we only perturb the selected variant features xvar. The selection of xvar is implemented as learning an invariance mask M Rp based on variance score vector SV Rp and threshold T. The variance score vector SV measures the variance of each feature element

Graph Structure Extrapolation for Out-of-Distribution Generalization

w.r.t. the target; high scores represents large variances and therefore features of xvar. Variance scores are learned using label and environment information. Label-invariant and environment-variant samples should have similar invariant features xinv while variant features xvar vary majorly; thus their feature variances increase the variance score vector SV . Conversely, feature variances of environment-invariant label-variant samples reduce variance scores. The invariance mask M selects xvar and masks out other features by applying threshold T on variance score vector SV . Formally,

SV = k1Ey Y Var Py[x] k2Eε EVar Pε[x],

M = [SV > T], (7)

where k = [k1, k2] and T are trainable parameters, and Var P [x] calculates variance of node features for samples in distribution P.

Node Feature Extrapolation. We apply the mask M and perturb the non-causal node features to achieve extrapolation w.r.t. xvar, without altering the topological structure of the graph. Let the domain for x be denoted as D, which is assumed to be accessible. Valid extrapolations must generate augmented samples with node feature XA D while XA P train(X). We ensure the validity of extrapolation with the generalized modulo operation (Appendix G),

X Rp, (X mod D) D.

Given each pair of samples Dε1, Dε2 with the same label y but different environments ε1 and ε2, Feat X produces

XA = M ((1 + λ)Xε1 λ xε2) mod D + M Xε1,

(A, E) = (Aε1, Eε1),

where λ, λ N(a, b) is sampled for each data pair. Empirically, we achieve favorable extrapolation performance and faster convergence with λ = λ R+ Γ(a, b), which we use in experiments, with the shape parameter a and scale parameter b of gamma distribution as hyperparameters. Note that Dεi, Dεj can be both graph-level and node-level data samples, and in the graph-level case xε2 Xε2 is the features of a random node. The augmented samples form a new environment. Replacing original node features in training samples by the augmented ones, our optimization process is formulated as

ψ := argmin ψ,k,T E(Dεi,Dεj ,y) P train[ℓ(fψ(XA, A, E), y)],

where fψ is the prediction network for OOD tasks and ℓ( ) calculates cross-entropy loss.

Solving Feature-based Graph Distribution Shifts. Feat X enables extrapolation w.r.t. the selected variant features, while causal features are preserved, thereby transforming

OOD areas to ID. Theoretical analysis can evidence that our extrapolation spans the feature space outside P train(X) for xvar. We present the following theorem showing that, under certain conditions, Feat X substantially solves feature shifts on the selected variant features for node-level tasks. Let n A be the number of samples Feat X generates and fψ be the well-trained network with Feat X applied.

Theorem 4.2. If (1) (X1, , Xj) P train from at least 2 environments, s.t. (X1var, , Xjvar) span Rj, and (2) X1 = X2, the GNN encoder of fψ maps G1 = (X1, A, E) and G2 = (X2, A, E) to different embeddings, then with ˆy = fψ(X, A, E), ˆy Xvar as n A .

Proof is provided in Appendix G. Theorem 4.2 states that, given sufficient diversity in environment information and expressiveness of GNN, Feat X can achieve invariant prediction regarding the selected variant features. Therefore, Feat X possesses the capability to generalize over distribution shifts on the selected variant features. Extending on the accuracy of non-causal selection, if xvar* = xvar, we achieve causally-invariant prediction in feature-OOD tasks.

5. Experimental Studies

We evaluate the effectiveness of our method on multiple graph OOD classification tasks.

Setup. For all experiments, we select the best checkpoints for OOD tests according to results on OOD validation sets; ID validation and ID test are also used for comparison if available. For fair comparisons, we use unified GNN backbones for all methods in each dataset, specifically, GINVirtual (Xu et al., 2019a; Gilmer et al., 2017) and GCN (Kipf & Welling, 2017) for graph-level and node-level tasks, respectively. Experimental details and hyperparameters are provided in Appendix H. Computational complexity analysis are provided in Appendix B. ID validation and test results as well as additional experiments and results are provided in Appendix I.

Baselines. We compare our method with both OOD learning algorithms and graph data augmentation methods, as well as the empirical risk minimization (ERM). OOD algorithms include IRM (Arjovsky et al., 2019), VREx (Krueger et al., 2021), DANN (Ganin et al., 2016), Deep Coral (Sun & Saenko, 2016), Group DRO (Sagawa et al., 2019), and graph OOD methods DIR (Wu et al., 2022b), EERM (Wu et al., 2022a), SRGNN (Zhu et al., 2021), GIL (Li et al., 2022), CIGA (Chen et al., 2022b), and Li SA (Yu et al., 2023). GDA methods include Drop Node (Feng et al., 2020), Drop Edge (Rong et al., 2019), Feature Masking (Thakoor et al., 2021), FLAG (Kong et al., 2022), Graph Mixup (Wang et al., 2021), LISA (Yao et al., 2022), and G-Mixup (Han et al., 2022). Note that DIR, Drop Node, GIL, CIGA, and G-Mixup only apply for graph-level tasks, while EERM and

Graph Structure Extrapolation for Out-of-Distribution Generalization

Table 1. Performances of 17 baselines and G-Splice on 8 datasets with structure shift. Values in this table are model out-of-distribution performance. Higher values indicate better performance. All numerical results are averages across 3 random runs. Numbers in bold represent the best results and underline the second best.

Method GOOD-HIV GOOD-SST2 Twitter GOOD-Motif DD NCI1

size scaffold length length size base size size ERM 59.94 2.86 69.58 1.99 81.30 0.35 57.04 1.70 51.74 2.27 68.66 3.43 0.15 0.11 0.16 0.10 IRM 59.94 1.59 67.97 2.46 79.91 1.97 57.72 1.03 53.68 4.11 70.65 3.18 0.06 0.08 0.11 0.07 VREx 58.49 2.22 70.77 1.35 80.64 0.35 56.37 0.76 54.47 3.42 71.47 2.75 0.14 0.10 0.22 0.12 Group DRO 58.98 1.84 70.64 1.72 79.21 1.02 56.84 0.63 51.95 2.80 68.24 1.94 0.02 0.02 0.01 0.04 DANN 62.38 2.65 70.63 1.82 79.71 1.35 55.71 1.23 51.46 3.41 65.47 5.35 0.12 0.09 0.15 0.07 Deep Coral 60.11 3.53 68.61 1.70 79.81 0.22 56.14 1.76 53.71 2.75 68.88 3.61 0.11 0.15 0.09 0.02 DIR 57.67 1.43 67.47 2.61 77.65 1.93 56.81 0.91 50.41 5.66 61.50 15.69 0.42 0.03 0.15 0.03 GIL 62.05 1.55 66.18 2.87 78.81 1.35 55.46 1.48 33.20 2.87 38.60 10.58 0.14 0.02 0.01 0.03 CIGA 62.56 1.76 71.47 1.29 81.20 0.75 57.19 1.15 45.36 4.35 45.59 6.44 0.30 0.05 0.23 0.06 Li SA 49.28 2.89 69.71 1.05 79.05 1.09 56.55 1.67 52.67 2.77 52.33 3.50 0.28 0.03 0.02 0.02 Drop Node 58.52 0.49 71.18 1.16 81.14 1.73 56.76 0.24 54.14 3.11 74.55 5.56 -0.02 0.02 0.08 0.06 Drop Edge 59.01 1.90 71.46 1.63 78.93 1.34 57.42 0.48 45.42 1.90 37.69 1.05 -0.02 0.03 0.12 0.05 Mask Feature 62.3 3.17 65.90 3.68 82.00 0.73 57.67 1.11 52.24 3.75 64.98 6.95 -0.02 0.02 0.03 0.02 FLAG 60.84 2.99 69.11 0.83 77.05 1.27 56.56 0.56 50.85 0.53 66.17 2.87 0.02 0.06 0.06 0.02 Graph Mixup 59.03 3.07 68.88 2.40 80.88 0.60 55.97 1.67 51.48 3.35 70.08 2.06 -0.08 0.06 -0.02 0.06 LISA 61.50 2.05 71.94 1.31 80.67 0.43 56.14 0.88 53.68 2.65 75.58 0.75 0.20 0.02 0.05 0.03 G-Mixup 61.95 3.15 70.13 2.40 80.28 1.49 56.05 1.76 53.93 3.03 49.27 4.84 -0.02 0.02 0.07 0.04 G-Splice 64.46 1.38 72.82 1.16 82.31 0.59 58.02 0.40 86.53 2.66 79.86 13.00 0.45 0.09 0.37 0.08 G-Splice+R 65.56 0.34 73.28 0.16 82.34 0.24 58.34 0.58 85.07 4.50 83.96 7.38 0.45 0.04 0.40 0.03

Table 2. OOD performances of all methods on FSMotif. The shifts are size-color and base-color.

Method FSMotif-S-C FSMotif-B-C

ERM 35.00 0.98 32.67 2.83 IRM 33.00 0.98 36.78 5.67 VREx 35.33 0.72 32.22 2.20 Group DRO 32.67 0.82 32.22 2.20 DANN 33.22 0.42 31.89 1.73 Coral 32.66 0.94 30.89 0.31 DIR 33.33 2.16 32.11 2.04 GIL 37.00 0.62 23.89 4.50 Drop Node 30.44 1.64 43.44 9.75 Drop Edge 32.11 1.40 32.22 2.20 FLAG 34.89 1.23 30.67 0.88 GMixup 31.56 1.25 32.45 2.28 Mixup 31.67 1.41 34.78 5.81 Maskfeat 33.78 0.88 37.00 8.96 LISA 33.89 1.91 38.22 10.68

Feat X 35.89 4.69 40.83 6.22 G-Splice 72.11 6.18 50.67 6.40 G-Splice+Feat X 75.11 2.99 54.11 11.64

SRGNN only for node-level tasks.

5.1. OOD Performance on Structure Shifts

Datasets & Metrics. To evidence the generalization improvements of structure extrapolation, we evaluate G-Splice on 8 graph-level OOD datasets with structure shifts. We adopt 5 datasets from the GOOD benchmark (Gui et al.,

2022a), HIV-size, HIV-scaffold, SST2-length, Motif-size, and Motif-base, where -" denotes the shift domain. We construct another natural language dataset Twitter-length (Yuan et al., 2020) following the OOD split of GOOD. Additionally, we adopt protein dataset DD-size and molecular dataset NCI1-size following Bevilacqua et al. (2021). All datasets possess structure shifts, thus proper benchmarks for structural OOD generalization. For evaluation, we report ROC-AUC for GOODHIV, Matthews correlation coefficient (MCC) for DD and NCI1, and accuracy in percentage for all other datasets.

Results. OOD performances of G-Splice and all baselines on structure distribution shifts are shown in Table 1. As can be observed, G-Splice consistently outperforms all other methods in OOD test results, showing effectiveness in various structural OOD tasks. On synthetic dataset GOODMotif, G-Splice substantially outperforms most baselines by 60% for size domain and 20% for base domain in accuracy, approaching ID performances, which evidences the generalization improvements achieved by our structural extrapolation. With a VREx-like regularization applied, G-Splice+R achieve further performance gain on most datasets, implying that combined use of augmented environment information with both data extrapolation and invariance regularization is beneficial. Furthermore, in contrast to OOD performances, G-Splice does not always perform best in ID settings. Also, G-Splice shows significant performance gain compared with other graph data augmentation methods. This reveals that G-

Graph Structure Extrapolation for Out-of-Distribution Generalization

Splice enhances generalization abilities with extrapolation strategies rather than overall progress in learning or simple data augmentation. By guiding the model to extrapolate with OOD samples, G-Splice extends the data distribution and improves generalization for specific structure shifts. As ablation studies, we evidence that certain extrapolation procedures specifically benefit size or base shifts, supporting our theoretical analysis, which is detailed in Appendix J.

5.2. OOD Evaluation for Structure and Feature Extrapolation

Structural and feature linear extrapolations can be performed respectively targeting specific types of shifts, as well as combined to solve comprehensive shifts. We demonstrate the superiority of our methods when used concurrently on graph tasks with multiple shifts of structure and feature. Generally, existing OOD datasets construct splits based on selected single shifts. To evaluate on complex graph shifts covering both structure and feature, we create a new dataset, FSMotif. FSMotif is a synthetic dataset where each graph is generated by connecting a base graph and a motif, with the label determined by the motif solely and all nodes assigned color features. We design two complex shift domains, the base graph type combined with color feature, and the graph size combined with color feature. Dataset details are in Appendix H. We report the OOD test accuracy (with OOD validation) of all baselines, as well as G-Splice and Feat X applied both separately and concurrently, across 3 random runs in Table 2. As shown, the combination of G-Splice and Feat X performs evidently better over their separate use and significantly over other baselines, with nearly doubled accuracy on FSMotif-S-C. This strongly evidences that our two strategies can combine to extrapolate over comprehensive shifts, adding to the practicality of our proposed method.

To further demonstrate the effectiveness of the feature extrapolation in graph tasks, we provide comprehensive evaluations with other typical baselines in Appendix I.

6. Discussion and Conclusion

Our work introduces an innovative GDA approach to solve graph OOD generalization using linear extrapolation in graph space. Our environment-aware framework, featuring G-Splice and Feat X, improves over existing methods by generating causally-valid OOD samples that enhance model performance. Overall, our data-centric approach opens a new direction in graph OOD studies. Currently, our method depends on the quality of environment information and GNN expressiveness. Also, our theoretical analysis focus on linear extrapolation. Future works could explore optimizing environment information and incorporating non-linear extrapolation studies.

Acknowledgments

This work was supported in part by National Science Foundation grants IIS-2006861 and IIS-1908220, Cisco Research, Presidential Impact Fellowship and institutional supports of Texas A&M University.

Impact Statement

This paper presents work whose goal is to advance the field of Machine Learning. There are many potential societal consequences of our work, none which we feel must be specifically highlighted here.

Ahuja, K., Caballero, E., Zhang, D., Gagnon-Audet, J.-C., Bengio, Y., Mitliagkas, I., and Rish, I. Invariance principle meets information bottleneck for out-of-distribution generalization. Advances in Neural Information Processing Systems, 34:3438 3450, 2021.

Arjovsky, M., Bottou, L., Gulrajani, I., and Lopez Paz, D. Invariant risk minimization. ar Xiv preprint ar Xiv:1907.02893, 2019.

Barrett, D., Hill, F., Santoro, A., Morcos, A., and Lillicrap, T. Measuring abstract reasoning in neural networks. In International conference on machine learning, pp. 511 520. PMLR, 2018.

Battaglia, P., Pascanu, R., Lai, M., Jimenez Rezende, D., et al. Interaction networks for learning about objects, relations and physics. Advances in neural information processing systems, 29, 2016.

Bemis, G. W. and Murcko, M. A. The properties of known drugs. 1. molecular frameworks. Journal of medicinal chemistry, 39(15):2887 2893, 1996.

Bevilacqua, B., Zhou, Y., and Ribeiro, B. Size-invariant graph representations for graph classification extrapolations. In International Conference on Machine Learning, pp. 837 851. PMLR, 2021.

Buffelli, D., Liò, P., and Vandin, F. Sizeshiftreg: a regularization method for improving size-generalization in graph neural networks. Advances in Neural Information Processing Systems, 35:31871 31885, 2022.

Chen, Y., Xiong, R., Ma, Z.-M., and Lan, Y. When does group invariant learning survive spurious correlations? Advances in Neural Information Processing Systems, 35: 7038 7051, 2022a.

Chen, Y., Zhang, Y., Bian, Y., Yang, H., Ma, K., Xie, B., Liu, T., Han, B., and Cheng, J. Learning causally invariant

Graph Structure Extrapolation for Out-of-Distribution Generalization

representations for out-of-distribution generalization on graphs. In Advances in Neural Information Processing Systems, 2022b.

Chen, Y., Zhang, Y., Yang, H., Ma, K., Xie, B., Liu, T., Han, B., and Cheng, J. Invariance principle meets outof-distribution generalization on graphs. ar Xiv preprint ar Xiv:2202.05441, 2022c.

Deshmukh, A. A., Lei, Y., Sharma, S., Dogan, U., Cutler, J. W., and Scott, C. A generalization error bound for multi-class domain generalization. ar Xiv preprint ar Xiv:1905.10392, 2019.

Duchi, J. C. and Namkoong, H. Learning models with uniform performance via distributionally robust optimization. The Annals of Statistics, 49(3):1378 1406, 2021.

Fan, S., Wang, X., Mo, Y., Shi, C., and Tang, J. Debiasing graph neural networks via learning disentangled causal substructure. In Oh, A. H., Agarwal, A., Belgrave, D., and Cho, K. (eds.), Advances in Neural Information Processing Systems, 2022. URL https: //openreview.net/forum?id=ex60CCi5GS.

Feng, W., Zhang, J., Dong, Y., Han, Y., Luan, H., Xu, Q., Yang, Q., Kharlamov, E., and Tang, J. Graph random neural networks for semi-supervised learning on graphs. Advances in neural information processing systems, 33: 22092 22103, 2020.

Ganin, Y., Ustinova, E., Ajakan, H., Germain, P., Larochelle, H., Laviolette, F., Marchand, M., and Lempitsky, V. Domain-adversarial training of neural networks. The journal of machine learning research, 17(1):2096 2030, 2016.

Gilmer, J., Schoenholz, S. S., Riley, P. F., Vinyals, O., and Dahl, G. E. Neural message passing for quantum chemistry. In International conference on machine learning, pp. 1263 1272. PMLR, 2017.

Gui, S., Wang, C., Chen, Q., and Tao, D. Featureflow: Robust video interpolation via structure-to-texture generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14004 14013, 2020.

Gui, S., Li, X., Wang, L., and Ji, S. GOOD: A graph outof-distribution benchmark. In Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2022a.

Gui, S., Yuan, H., Wang, J., Lao, Q., Li, K., and Ji, S. Flowx: Towards explainable graph neural networks via message flows. ar Xiv preprint ar Xiv:2206.12987, 2022b.

Gui, S., Liu, M., Li, X., Luo, Y., and Ji, S. Joint learning of label and environment causal independence for graph out-of-distribution generalization. ar Xiv preprint ar Xiv:2306.01103, 2023.

Gui, S., Li, X., and Ji, S. Active test-time adaptation: Theoretical analyses and an algorithm. ar Xiv preprint ar Xiv:2404.05094, 2024.

Gulrajani, I. and Lopez-Paz, D. In search of lost domain generalization. ar Xiv preprint ar Xiv:2007.01434, 2020.

Guo, H. and Mao, Y. Intrusion-free graph mixup. Ar Xiv, abs/2110.09344, 2021.

Han, X., Jiang, Z., Liu, N., and Hu, X. G-Mixup: Graph data augmentation for graph classification. ar Xiv preprint ar Xiv:2202.07179, 2022.

Heckerman, D. A tutorial on learning with Bayesian networks. Springer, 1998.

Ho, J., Jain, A., and Abbeel, P. Denoising diffusion probabilistic models. In Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., and Lin, H. (eds.), Advances in Neural Information Processing Systems, volume 33, pp. 6840 6851. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper/2020/ file/4c5bcfec8584af0d967f1ab10179ca4b-Paper.pdf.

Hu, W., Fey, M., Zitnik, M., Dong, Y., Ren, H., Liu, B., Catasta, M., and Leskovec, J. Open graph benchmark: Datasets for machine learning on graphs. Advances in neural information processing systems, 33:22118 22133, 2020.

Huang, Y., Peng, X., Ma, J., and Zhang, M. 3dlinker: An e (3) equivariant variational autoencoder for molecular linker design. ar Xiv preprint ar Xiv:2205.07309, 2022.

Igashov, I., Stärk, H., Vignac, C., Satorras, V. G., Frossard, P., Welling, M., Bronstein, M., and Correia, B. Equivariant 3d-conditional diffusion models for molecular linker design. ar Xiv preprint ar Xiv:2210.05274, 2022.

Kingma, D. P. and Welling, M. Auto-encoding variational bayes. ar Xiv preprint ar Xiv:1312.6114, 2013.

Kipf, T. N. and Welling, M. Variational graph auto-encoders. ar Xiv preprint ar Xiv:1611.07308, 2016.

Kipf, T. N. and Welling, M. Semi-supervised classification with graph convolutional networks. In International Conference on Learning Representations (ICLR), 2017.

Knyazev, B., Taylor, G. W., and Amer, M. Understanding attention and generalization in graph neural networks. Advances in neural information processing systems, 32, 2019.

Graph Structure Extrapolation for Out-of-Distribution Generalization

Kong, K., Li, G., Ding, M., Wu, Z., Zhu, C., Ghanem, B., Taylor, G., and Goldstein, T. Robust optimization as data augmentation for large-scale graphs. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 60 69, 2022.

Krueger, D., Caballero, E., Jacobsen, J.-H., Zhang, A., Binas, J., Zhang, D., Le Priol, R., and Courville, A. Out-ofdistribution generalization via risk extrapolation (REx). In International Conference on Machine Learning, pp. 5815 5826. PMLR, 2021.

Kwon, K., Jeong, K., Park, S., Park, S., Lee, H., Kwak, S.-Y., Kim, S., and Cho, K. Extramix: Extrapolatable data augmentation for regression using generative models. 2022.

Li, D., Yang, Y., Song, Y.-Z., and Hospedales, T. M. Deeper, broader and artier domain generalization. In Proceedings of the IEEE international conference on computer vision, pp. 5542 5550, 2017.

Li, H., Zhang, Z., Wang, X., and Zhu, W. Learning invariant graph representations for out-of-distribution generalization. In Advances in Neural Information Processing Systems, 2022.

Li, X., Zhao, J., Zhang, W.-Q., Lv, Z., and Huang, S. Keyword search based on unsupervised pre-trained acoustic models. International Journal of Asian Language Processing, 31(03n04):2250005, 2021.

Lin, Y., Zhu, S., Tan, L., and Cui, P. Zin: When and how to learn invariance without environment partition? Advances in Neural Information Processing Systems, 35: 24529 24542, 2022.

Liu, G., Zhao, T., Xu, J., Luo, T., and Jiang, M. Graph rationalization with environment-based augmentations. ar Xiv preprint ar Xiv:2206.02886, 2022.

Liu, J., Hu, Z., Cui, P., Li, B., and Shen, Z. Heterogeneous risk minimization. In International Conference on Machine Learning, pp. 6804 6814. PMLR, 2021a.

Liu, M., Luo, Y., Wang, L., Xie, Y., Yuan, H., Gui, S., Yu, H., Xu, Z., Zhang, J., Liu, Y., et al. DIG: A turnkey library for diving into graph deep learning research. The Journal of Machine Learning Research, 22(1):10873 10881, 2021b.

Lu, C., Wu, Y., Hernández-Lobato, J. M., and Schölkopf, B. Invariant causal representation learning for out-ofdistribution generalization. In International Conference on Learning Representations, 2021.

Miao, S., Liu, M., and Li, P. Interpretable and generalizable graph learning via stochastic attention mechanism. In International Conference on Machine Learning, pp. 15524 15543. PMLR, 2022.

Moreno-Torres, J. G., Raeder, T., Alaiz-Rodríguez, R., Chawla, N. V., and Herrera, F. A unifying view on dataset shift in classification. Pattern Recognition, 45(1):521 530, 2012. ISSN 0031-3203. doi: https://doi.org/10.1016/ j.patcog.2011.06.019. URL https://www.sciencedirect. com/science/article/pii/S0031320311002901.

Muandet, K., Balduzzi, D., and Schölkopf, B. Domain generalization via invariant feature representation. In International conference on machine learning, pp. 10 18. PMLR, 2013.

Park, J., Shim, H., and Yang, E. Graph transplant: Node saliency-guided graph mixup with local structure preservation. In Proceedings of the First Mini Con Conference, 2022.

Pearl, J. Causality. Cambridge university press, 2009.

Peters, J., Bühlmann, P., and Meinshausen, N. Causal inference by using invariant prediction: identification and confidence intervals. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 78(5):947 1012, 2016.

Peters, J., Janzing, D., and Schölkopf, B. Elements of causal inference: foundations and learning algorithms. The MIT Press, 2017.

Quiñonero-Candela, J., Sugiyama, M., Schwaighofer, A., and Lawrence, N. D. Dataset shift in machine learning. Mit Press, 2008.

Rong, Y., Huang, W., Xu, T., and Huang, J. Dropedge: Towards deep graph convolutional networks on node classification. ar Xiv preprint ar Xiv:1907.10903, 2019.

Rosenfeld, E., Ravikumar, P., and Risteski, A. The risks of invariant risk minimization. ar Xiv preprint ar Xiv:2010.05761, 2020.

Sagawa, S., Koh, P. W., Hashimoto, T. B., and Liang, P. Distributionally robust neural networks for group shifts: On the importance of regularization for worst-case generalization. ar Xiv preprint ar Xiv:1911.08731, 2019.

Sanchez-Gonzalez, A., Heess, N., Springenberg, J. T., Merel, J., Riedmiller, M., Hadsell, R., and Battaglia, P. Graph networks as learnable physics engines for inference and control. In International Conference on Machine Learning, pp. 4470 4479. PMLR, 2018.

Saxton, D., Grefenstette, E., Hill, F., and Kohli, P. Analysing mathematical reasoning abilities of neural models. ar Xiv preprint ar Xiv:1904.01557, 2019.

Shen, Z., Cui, P., Zhang, T., and Kunag, K. Stable learning via sample reweighting. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pp. 5692 5699, 2020.

Graph Structure Extrapolation for Out-of-Distribution Generalization

Shen, Z., Liu, J., He, Y., Zhang, X., Xu, R., Yu, H., and Cui, P. Towards out-of-distribution generalization: A survey. ar Xiv preprint ar Xiv:2108.13624, 2021.

Shimodaira, H. Improving predictive inference under covariate shift by weighting the log-likelihood function. Journal of statistical planning and inference, 90(2):227 244, 2000.

Shorten, C. and Khoshgoftaar, T. M. A survey on image data augmentation for deep learning. Journal of big data, 6(1):1 48, 2019.

Sohn, K., Lee, H., and Yan, X. Learning structured output representation using deep conditional generative models. Advances in neural information processing systems, 28, 2015.

Sui, Y., Wang, X., Wu, J., Zhang, A., and He, X. Adversarial causal augmentation for graph covariate shift. ar Xiv preprint ar Xiv:2211.02843, 2022.

Sun, B. and Saenko, K. Deep coral: Correlation alignment for deep domain adaptation. In European conference on computer vision, pp. 443 450. Springer, 2016.

Sun, L., Huang, Y., Wang, H., Wu, S., Zhang, Q., Gao, C., Huang, Y., Lyu, W., Zhang, Y., Li, X., et al. Trustllm: Trustworthiness in large language models. ar Xiv preprint ar Xiv:2401.05561, 2024.

Tang, H., Huang, Z., Gu, J., Lu, B.-L., and Su, H. Towards scale-invariant graph-related problem solving by iterative homogeneous gnns. Advances in Neural Information Processing Systems, 33:15811 15822, 2020.

Thakoor, S., Tallec, C., Azar, M. G., Azabou, M., Dyer, E. L., Munos, R., Veliˇckovi c, P., and Valko, M. Largescale representation learning on graphs via bootstrapping. ar Xiv preprint ar Xiv:2102.06514, 2021.

Torrey, L. and Shavlik, J. Transfer learning. In Handbook of research on machine learning applications and trends: algorithms, methods, and techniques, pp. 242 264. IGI global, 2010.

Veliˇckovi c, P., Cucurull, G., Casanova, A., Romero, A., Liò, P., and Bengio, Y. Graph Attention Networks. International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=r JXMpik CZ. accepted as poster.

Veliˇckovi c, P., Ying, R., Padovano, M., Hadsell, R., and Blundell, C. Neural execution of graph algorithms. ar Xiv preprint ar Xiv:1910.10593, 2019.

Wang, J., Lan, C., Liu, C., Ouyang, Y., Qin, T., Lu, W., Chen, Y., Zeng, W., and Yu, P. Generalizing to unseen domains:

A survey on domain generalization. IEEE Transactions on Knowledge and Data Engineering, 2022.

Wang, M. and Deng, W. Deep visual domain adaptation: A survey. Neurocomputing, 312:135 153, 2018.

Wang, Y., Wang, W., Liang, Y., Cai, Y., and Hooi, B. Mixup for node and graph classification. In Proceedings of the Web Conference 2021, pp. 3663 3674, 2021.

Weiss, K., Khoshgoftaar, T. M., and Wang, D. A survey of transfer learning. Journal of Big data, 3(1):1 40, 2016.

Widmer, G. and Kubat, M. Learning in the presence of concept drift and hidden contexts. Machine learning, 23 (1):69 101, 1996.

Wu, C.-H., Li, X., Rajesh, R., Ooi, W. T., and Hsu, C.- H. Dynamic 3d point cloud streaming: Distortion and concealment. In Proceedings of the 31st ACM Workshop on Network and Operating Systems Support for Digital Audio and Video, pp. 98 105, 2021.

Wu, Q., Zhang, H., Yan, J., and Wipf, D. Handling distribution shifts on graphs: An invariance perspective. ar Xiv preprint ar Xiv:2202.02466, 2022a.

Wu, Y.-X., Wang, X., Zhang, A., He, X., and seng Chua, T. Discovering invariant rationales for graph neural networks. In ICLR, 2022b.

Xu, K., Hu, W., Leskovec, J., and Jegelka, S. How powerful are graph neural networks? In International Conference on Learning Representations, 2019a. URL https://openreview.net/forum?id=ry Gs6i A5Km.

Xu, K., Li, J., Zhang, M., Du, S. S., Kawarabayashi, K.-i., and Jegelka, S. What can neural networks reason about? ar Xiv preprint ar Xiv:1905.13211, 2019b.

Xu, K., Zhang, M., Li, J., Du, S. S., Kawarabayashi, K.-i., and Jegelka, S. How neural networks extrapolate: From feedforward to graph neural networks. ar Xiv preprint ar Xiv:2009.11848, 2020.

Yang, C., Wu, Q., Wang, J., and Yan, J. Graph neural networks are inherently good generalizers: Insights by bridging gnns and mlps. ar Xiv preprint ar Xiv:2212.09034, 2022a.

Yang, N., Zeng, K., Wu, Q., Jia, X., and Yan, J. Learning substructure invariance for out-of-distribution molecular representations. In Oh, A. H., Agarwal, A., Belgrave, D., and Cho, K. (eds.), Advances in Neural Information Processing Systems, 2022b. URL https://openreview.net/ forum?id=2n WUNTn Fijm.

Graph Structure Extrapolation for Out-of-Distribution Generalization

Yao, H., Wang, Y., Li, S., Zhang, L., Liang, W., Zou, J., and Finn, C. Improving out-of-distribution robustness via selective augmentation. ar Xiv preprint ar Xiv:2201.00299, 2022.

Yehudai, G., Fetaya, E., Meirom, E., Chechik, G., and Maron, H. From local structures to size generalization in graph neural networks. In International Conference on Machine Learning, pp. 11975 11986. PMLR, 2021.

You, Y., Chen, T., Sui, Y., Chen, T., Wang, Z., and Shen, Y. Graph contrastive learning with augmentations. Advances in Neural Information Processing Systems, 33: 5812 5823, 2020.

Yu, J., Liang, J., and He, R. Finding diverse and predictable subgraphs for graph domain generalization. ar Xiv preprint ar Xiv:2206.09345, 2022.

Yu, J., Liang, J., and He, R. Mind the label shift of augmentation-based graph ood generalization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11620 11630, 2023.

Yuan, H., Yu, H., Gui, S., and Ji, S. Explainability in graph neural networks: A taxonomic survey. ar Xiv preprint ar Xiv:2012.15445, 2020.

Zeng, H., Zhou, H., Srivastava, A., Kannan, R., and Prasanna, V. Graph SAINT: Graph sampling based inductive learning method. In International Conference on Learning Representations, 2020. URL https://openreview. net/forum?id=BJe8pk HFw S.

Zhang, H., Cisse, M., Dauphin, Y. N., and Lopez-Paz, D. mixup: Beyond empirical risk minimization. ar Xiv preprint ar Xiv:1710.09412, 2017.

Zhang, X., Zhou, L., Xu, R., Cui, P., Shen, Z., and Liu, H. NICO++: Towards better benchmarking for domain generalization. ar Xiv preprint ar Xiv:2204.08040, 2022.

Zhang, X., Wang, L., Helwig, J., Luo, Y., Fu, C., Xie, Y., Liu, M., Lin, Y., Xu, Z., Yan, K., et al. Artificial intelligence for science in quantum, atomistic, and continuum systems. ar Xiv preprint ar Xiv:2307.08423, 2023.

Zhu, Q., Ponomareva, N., Han, J., and Perozzi, B. Shiftrobust GNNs: Overcoming the limitations of localized graph training data. Advances in Neural Information Processing Systems, 34, 2021.

Zhuang, F., Qi, Z., Duan, K., Xi, D., Zhu, Y., Zhu, H., Xiong, H., and He, Q. A comprehensive survey on transfer learning. Proceedings of the IEEE, 109(1):43 76, 2020.

Graph Structure Extrapolation for Out-of-Distribution Generalization

A. Background and Related Works

Out-of-Distribution (OOD) Generalization. Out-of-Distribution (OOD) Generalization (Shen et al., 2021; Duchi & Namkoong, 2021; Shen et al., 2020; Liu et al., 2021a) addresses the challenge of adapting a model, trained on one distribution (source), to effectively process unseen data from a potentially different distribution (target). It shares strong ties with various areas such as transfer learning (Weiss et al., 2016; Torrey & Shavlik, 2010; Zhang et al., 2023; Zhuang et al., 2020), domain adaptation (Wang & Deng, 2018), domain generalization (Wang et al., 2022), causality (Pearl, 2009; Peters et al., 2017), and invariant learning (Arjovsky et al., 2019). As a form of transfer learning, OOD generalization is especially challenging when the target distribution substantially differs from the source distribution. OOD generalization, also known as distribution or dataset shift (Quiñonero-Candela et al., 2008; Moreno-Torres et al., 2012), encapsulates several concepts including covariate shift (Shimodaira, 2000), concept shift (Widmer & Kubat, 1996), and prior shift (Quiñonero-Candela et al., 2008). Both Domain Adaptation (DA) and Domain Generalization (DG) can be viewed as specific instances of OOD, each with its own unique assumptions and challenges.

Domain Generalization (DG). DG (Wang et al., 2022; Li et al., 2017; Muandet et al., 2013; Deshmukh et al., 2019; Gui et al., 2020) strives to predict samples from unseen domains without the need for pre-collected target samples, making it more practical than DA in many circumstances. However, generalizing without additional information is logically implausible, a conclusion also supported by the principles of causality (Pearl, 2009; Peters et al., 2017). As a result, contemporary DG methods have proposed the use of domain partitions (Ganin et al., 2016; Zhang et al., 2022) to generate models that are domain-invariant. Yet, due to the ambiguous definition of domain partitions, many DG methods lack robust theoretical underpinning.

Causality & Invariant Learning. Causality (Peters et al., 2016; Pearl, 2009; Peters et al., 2017) and invariant learning (Arjovsky et al., 2019; Rosenfeld et al., 2020; Ahuja et al., 2021) provide a theoretical foundation for the above concepts, offering a framework to model various distribution shift scenarios as structural causal models (SCMs). SCMs, which bear resemblance to Bayesian networks (Heckerman, 1998), are underpinned by the assumption of independent causal mechanisms, a fundamental premise in causality. Intuitively, this supposition holds that causal correlations in SCMs are stable, independent mechanisms akin to unchanging physical laws, rendering these causal mechanisms generalizable. An assumption of a data-generating SCM equates to the presumption that data samples are generated through these universal mechanisms. Hence, constructing a model with generalization ability requires the model to approximate these invariant causal mechanisms. Given such a model, its performance is ensured when data obeys the underlying data generation assumption. Peters et al. (2016) initially proposed optimal predictors invariant across all environments (or interventions). Motivated by this work, Arjovsky et al. (2019) proposed framing this invariant prediction concept as an optimization process, considering one of the most popular data generation assumptions, PIIF. Consequently, numerous subsequent works (Rosenfeld et al., 2020; Ahuja et al., 2021; Chen et al., 2022a; Lu et al., 2021) referred to as invariant learning considered the initial intervention-based environments (Peters et al., 2016) as an environment variable in SCMs. When these environment variables are viewed as domain indicators, it becomes evident that this SCM also provides theoretical support for DG, thereby aligning many invariant works with the DG setting. Besides PIIF, many works have considered FIIF and anti-causal assumptions (Rosenfeld et al., 2020; Ahuja et al., 2021; Chen et al., 2022a), which makes these assumptions as popular basics of causal theoretical analyses.

OOD generalization for graph. Extrapolating on non-Euclidean data has garnered increased attention, leading to a variety of applications (Sanchez-Gonzalez et al., 2018; Barrett et al., 2018; Saxton et al., 2019; Battaglia et al., 2016; Tang et al., 2020; Veliˇckovi c et al., 2019; Li et al., 2021; Xu et al., 2019b; Wu et al., 2021). Inspired by Xu et al. (2020), Yang et al. (2022a) proposed that GNNs intrinsically possess superior generalization capability. Several prior works (Knyazev et al., 2019; Yehudai et al., 2021; Bevilacqua et al., 2021) explored graph generalization in terms of graph sizes, with Bevilacqua et al. (2021) being the first to study this issue using causal models. Recently, causality modeling-based methods have been proposed for both graph-level tasks (Wu et al., 2022b; Miao et al., 2022; Chen et al., 2022b; Fan et al., 2022; Yang et al., 2022b) and node-level tasks (Wu et al., 2022a). To solve OOD problems in graph, DIR (Wu et al., 2022b) selects graph representations as causal rationales and conducts causal intervention to create multiple distributions. EERM (Wu et al., 2022a) generates environments with REINFORCE algorithm to maximize loss variance between environments while adversarially minimizing the loss. SRGNN (Zhu et al., 2021) aims at pushing biased training data to the given unbiased distribution, performed through central moment discrepancy and kernel matching. To improve interpretation and prediction, GSAT (Miao et al., 2022) learns task-relevant subgraphs by constraining information with stochasticity in attention weights. CIGA (Chen et al., 2022b) models the graph generation process and learns subgraphs to maximally preserve invariant intra-class information. GREA (Liu et al., 2022) performs rationale identification and environment replacement to augment

Graph Structure Extrapolation for Out-of-Distribution Generalization

virtual data examples. GIL (Li et al., 2022) proposes to identify invariant subgraphs and infer latent environment labels for variant subgraphs through joint learning. However, except for CIGA (Chen et al., 2022b), their data assumptions are less comprehensive compared to traditional OOD generalization. CIGA, while recognizing the importance of diverse data generation assumptions (SCMs), attempts to fill the gap through non-trivial extra assumptions without environment information. Additionally, environment inference methods have gained traction in graph tasks, including EERM (Wu et al., 2022a), MRL (Yang et al., 2022b), and GIL (Li et al., 2022). However, these methods face two undeniable challenges. First, their environment inference results require environment exploit methods for evaluation, but there are no such methods that perform adequately on graph tasks according to the synthetic dataset results in GOOD benchmark (Gui et al., 2022a). Second, environment inference is essentially a process of injecting human assumptions to generate environment partitions, but these assumptions are not well compared.

Graph data augmentation for generalization. Some data augmentation methods, not limited to graph methods, empirically show improvements in OOD generalization tasks. Mixup (Zhang et al., 2017), which augments samples by interpolating two labeled training samples, is reported to benefit generalization. LISA (Yao et al., 2022) selectively interpolates intra-label or intra-domain samples to further improve OOD robustness. In the graph area, following Mixup, Graph Mixup (Wang et al., 2021) mixes the hidden representations in each GNN layer, while if Mixup (Guo & Mao, 2021) directly applies Mixup on the graph data instead of the latent space. Graph Transplant (Park et al., 2022) employs node saliency information to select a substructure from each graph as units to mix. G-Mixup (Han et al., 2022) interpolates the graph generator of each class and mixes on class-level to improve GNN robustness. DPS (Yu et al., 2022) extracts multiple label-invariant subgraphs with a set of subgraph generators to train an invariant GNN predictor. However, few works target OOD problems, and no prior work generates OOD samples that can provably generalize over graph distribution shifts. In contrast, we offer a graph augmentation method to extrapolate in structure and feature for OOD generalization.

Technical comparisons with prior methods. We discuss in detail the technical differences between existing works and ours. DIR and GREA algorithms are much alike by design, identifying causal subgraphs and switching non-causal subgraphs between graphs. With this localized strategy, their augmented environments can only cover local base shifts, leaving the global structural extrapolation unexplored. EERM exclusively considers node-level tasks, and only performs edge addition/deletion to cover minor shifts on graph base. GDA methods GMixup and Graph Transplant provide no guarantee for solving OOD related tasks, and can not deal with global structure shifts such as size. Li SA (Yu et al., 2023) extracts multiple subgraphs and Adv CA (Sui et al., 2022) masks certain nodes/edges from given graphs to generate graph augmentations. Size Shift Reg (Buffelli et al., 2022) uses coarsening to extracts multiple subgraphs from given graphs, obtaining slightly smaller augmented graphs (80% or 90% of the original graph in actual implementation). These strategies result in augmented graphs that only contain smaller substructures, restricting their potential extrapolation to one instead of both distribution directions. In this case, a common test scenario where test graphs are larger than the training graphs is not covered. Mixup and Extra Mix (Kwon et al., 2022) apply strategies on feature levels without designs for graph structure. In contrast, we study non-Euclidean space extrapolation in a far more systematic way. Our method considers the completeness of achieving both feature and structural extrapolation, and further cover structural global/local extrapolation (or size/base shifts by example) in both distribution directions. This substantially sets the difference between our method and the existing works. Moreover, our novel theoretical contributions include proposing non-Euclidean space linear extrapolation with definitions, analyses, and guarantees. Considering techniques, our design of graph splice serves global extrapolation and avoids add-on nodes to preserve graph structures, divergent from linker design approaches for molecules (Huang et al., 2022; Igashov et al., 2022). In addition, we design subgraph extraction by label-environment-aware pair learning, a novel technique over previous studies.

B. Computational Complexity Analysis

We provide the theoretical analysis regarding the time and space computational complexity as follows. For computation, we generally use one NVIDIA Ge Force RTX 2080 Ti for each single experiment.

The time complexity of our G-Splice is O((|V |2d + |V |d2)|B|), where |V | denotes the number of nodes, |B| is the batch size, and d is the dimensionality of the representations. The time complexity of our Feat X is O((|E|d + |V |d2)|B|), where |E| denotes the number of edges. Specifically, message-passing GNNs has a complexity of O((|E|d + |V |d2)|B|), which we adopt to instantiate our GNN components. For G-Splice, the time complexity of obtaining GNN representations is O((|E|d + |V |d2)|B|), and that of pair-wise similarity matching and bridge generation are both O(|V |2d|B|). Since O(|E|) O(|V |2), the overall time complexity of G-Splice is O((|E|d + |V |2d + |V |d2)|B|)=O((|V |2d + |V |d2)|B|).

Graph Structure Extrapolation for Out-of-Distribution Generalization

For Feat X, the time complexity of non-causal feature selection and feature extrapolation are both O(|V |d V |B|), where d V is the node feature dimension, which is far smaller than the time complexity of GNN representations. In comparison, the time complexity of most GNN-based graph representation methods are O(|E|d + |V |d2), including simple algorithms like ERM, IRM, and VRex implemented with GNNs. More complicated graph methods such as CIGA has time complexity O((|V |2d + |V |d2)|B|), which is also the case for G-Splice. Therefore, the time complexity of our proposed methods is on par with the existing methods.

The space complexity of G-Splice and Feat X is O(|V |d L + |E|d), where L is the number of GNN layers. This is the space complexity of the message-passing GNNs we use, as well as the space complexity of most GNN-based methods.

C. Technical Details

We complete the technical details of causal and environmental subgraph extractions here. Let Gε1 and Gε2 be two graphs with the same label y1 but different environments ε1 and ε2. As we have discussed, Ginv should be the subgraph both graphs contain and have most in common. Since graph neural networks aggregate information of an ego graph, i.e., the local subgraph within k-hop of a node, to the embedding of that node through message passing, nodes with similar ego graphs should have similar embeddings. Therefore, in the node embedding space, nodes from Gε1 and Gε2 with similar representations should be a part of Ginv. We encode both graphs into node embeddings with a GNN and calculate their weighted similarity matrix Sw, each element of which is the weighted cosine similarity of a pair of nodes from Gε1 and Gε2, i.e.,

Sw ij = w Sc(zi, zj), for vi Gε1, vj Gε2, (8)

where w is a trainable parameter and Sc( ) is the cosine similarity calculation. The scores in the weighted similarity matrix Sw are considered as probabilities to sample the causal subgraph from either Gε1 or Gε2. We use label-invariant and environment-variant graph pairs to pre-train a causal subgraph searching network, which is optimized for the sampled causal subgraphs to be capable of predicting the label Y solely.

Similarly, we perform similarity matching for environment-invariant graph pairs to extract environmental subgraphs Genv that are determined by the environment E. Graphs from the same environment should contain similar subgraphs, and we aim at extracting these environmental subgraphs. Environment-invariant and label-variant graph pairs are used to pre-train the environmental subgraph searching network. We calculate the weighted similarity scores from embeddings and sample subgraphs with probabilities. The network is optimized using the environment label ε for the sampled subgraphs to predict the environment.

D. Further Discussions

D.1. Connections between SLE and FLE

The connection between feature extrapolation and structural extrapolation is implicitly described in Sec 2 s Graph structure and feature distribution shifts and Sec 3. Firstly, Linear Extrapolation, which constructs samples beyond the known range while maintaining the same direction and magnitude of known sample differences, is a central concept in our approach. Graph data are complex in that it contains features as well as topological structures. We propose to define Linear Extrapolation for both feature and structure (Sec 3.2), and together they form the complete definition of Graph Linear Extrapolation. Though their definition have different variables, the formulas share the common form of mathematical organization as linear extrapolations. Secondly, graph distribution shifts can happen on both features and structures, which possesses different properties and should be handled separately. While feature extrapolation and structural extrapolation can be used to solve respective shifts, when combined they can address complex feature-structure shifts. Therefore, they complement each other in a systematic solution of graph OOD problems. Thirdly, feature and structural extrapolation share similar logic in causal analyses and theoretical justification. Combined causal analyses are given in Sec 3.1. In Sec 3.3, we provide theoretical guarantees that structural linear extrapolation has the capability to generate OOD samples that are both plausible and diverse, while the justification of feature linear extrapolation is relatively straightforward and provided in Section 4.4.

D.2. Intuitive Explanations of SLE in G-Splice

Structural linear extrapolation defines the linear calculations addition and subtraction for graph structures, and enables linear extrapolation as Eq. 2. Specifically, a G denotes splicing multiple graphs together while B, 1G G1 F denotes

Graph Structure Extrapolation for Out-of-Distribution Generalization

splicing multiple subtracted subgraphs. G-splice closely follows this, and essentially implements graph addition and subtraction . Addition and subtraction, defined as union and intersection operations on graph edge sets, are essentially splicing graphs and extracting subgraphs. Intuitively, structural addition splices multiple graphs by generating edges among them (conditional generation). Structural subtraction extracts a causal (or environmental) subgraph by performing node-wise similarity matching between a pair of graphs with different labels (or environments) but same environments (or labels). Please refer to Appendix C for label-environment-aware pair learning details.

With these two operations, G-splice can generate OOD samples by extrapolating outside the training distribution. We reason in Sec 3.3 that linear extrapolation can reach graph samples outside the distribution respecting both size and base/scaffold, the two common graph structural distribution shifts. By splicing whole graphs and extracting single causal subgraphs, we create larger and smaller graphs than the training graphs respectively. Also, we can construct graphs without base/scaffold, and graphs with multiple bases/scaffolds, achieving extrapolation with new bases/scaffolds introduced.

D.3. Applicability of the Causal Additivity assumption

Our work builds upon the principle of Causal Additivity, a causal assumption widely applicable in graph classification tasks. This assumption can be subjectively verified through the logic for common natural language sentimental analysis datasets such as SST2 and Twitter, as well as synthetic dataset GOOD-Motif, where labels are determined by certain structures. The spliced graph contains combined causal structures; therefore, it forms a causally valid sample when given the mixed label of all component graphs. For molecule/protein datasets with chemical property tasks, the assumption is strongly underpinned by experimental results, as evidenced by the improved or comparable results even when whole samples are randomly combined in Appendix J.1.1. Although we acknowledge that our assumption may not encompass all cases, it does make headway in addressing a substantial class of problems. As graph OOD generalization is a complex issue in practice, different techniques are required for varying domains and problems. No single method can be expected to resolve all unknown cases, and our future work aims to expand the scope of tasks addressed.

D.4. Strengths of linear extrapolation over interpolation

Techniques are very different for graph interpolation and extrapolation. Graph interpolation like G-Mixup or Graph-Mixup use the weighted sum of two graph s features/representations as new samples. Our linear extrapolation method is distinct from those, containing structural addition and subtraction operations. Structural addition splices multiple graphs by generating edges among them, while structural subtraction extracts causal/environment subgraphs by label-environmentaware pair learning. Graph interpolation and extrapolation also have different objectives. As previous described, we can extrapolate to graphs that are guaranteed to be larger/smaller/more complex/simpler, thus ensuring OOD regarding global/local structures. In contrast, graph interpolation cannot generate graphs that are beyond certain structure boundaries, like the size range of training graphs.

Data augmentation methods can introduce additional samples not covered by the training database to benefit model learning. Since we focus on tasks that are out-of-distribution instead of in-distribution, models are expected to extrapolate instead of interpolate to make predictions outside the training range. However, the distribution area where models cannot generalize to is also hardly reachable when generating augmentation samples using traditional interpolation techniques. Interpolation methods cannot provide any guarantees regarding solving graph distribution shifts. In contrast, theoretical and empirical analyses show that linear extrapolation can generalize over certain shifts. Specifically, it can be reasoned based on our theoretical studies. Theorem 4.2 establishes that feature linear extrapolation can achieve invariant prediction and generalize over distribution shifts regarding selected variant features under certain environment conditions. Theorem 3.6 establishes that structural linear extrapolation can create OOD samples covering at least two environments in opposite directions of the distribution, respecting size and base shifts each. Contrarily, in the case of interpolation, the constructions in the proofs would not hold, thus failing to build these guarantees.

D.5. Method applicability when the distribution shift type is unknown

Firstly, as we have evidenced, our two strategies for feature and structure can be combined to solve comprehensive shifts. Also, our proposed method is applicable when OOD knowledge of a dataset is not fully known, since we are able to choose the techniques as hyperparameter selection. This allows the framework to automatically decide on using either method separately or combined, thus covering OOD tasks with structural, feature or complex shifts. Secondly, although we use specific expressions of base and size shifts in theoretical studies of structural shifts, discussions for base and

Graph Structure Extrapolation for Out-of-Distribution Generalization

size extrapolation are actually applicable to general local and global structural extrapolation, respectively, in the field of graph. Therefore, G-Splice can cover both local and global structural shifts, making it applicable towards various unknown distribution shifts. One potential evidence is that it performs favorably on multiple real-world datasets, which inevitably contain natural and unknown distribution shifts.

E. Broader Impacts

Addressing out-of-distribution (OOD) generalization presents a formidable challenge, particularly in the realm of graph learning. This issue is acutely exacerbated when conducting scientific experiments becomes cost-prohibitive or impractical. In many real-world scenarios, data collection is confined to certain domains, yet extrapolating this knowledge to broader areas, where experiment conduction proves difficult, is crucial. In focusing on a data-centric approach to the OOD generalization problem, we pave the way for the integration of graph data augmentation with graph OOD, a strategy with substantial potential for broad societal and scientific impact.

Our research adheres strictly to ethical guidelines and does not raise any ethical issues. It neither involves human subjects nor gives rise to potential negative social impacts or privacy and fairness issues. Furthermore, we foresee no potential for malicious or unintended usage of our work. Nonetheless, we acknowledge that all technological progress inherently carries risks. Consequently, we advocate for ongoing evaluation of the broader implications of our methodology across a range of contexts.

F. Limitations

Our work builds upon the principle of Causal Additivity, a causal assumption widely applicable in graph classification tasks. This assumption can be verified theoretically and experimentally for a variety of graph classification tasks. However, we acknowledge that our assumption may not encompass all cases. As graph OOD generalization is a complex issue in practice, different domains and problems may required varying techniques and our method might not resolve all unknown cases. Our future work aims to expand the scope of tasks addressed.

For another, our work discusses shifts on both graph structure and feature. By respective considerations, while G-splice can solve structure shifts, it augments structural OOD samples, which creates additional shift when facing feature-OOD situations. Feat X stands in the similar situation, introducing extra shifts for structural OOD problems. When combined, G-Splice and Feat X can solve both types of shifts and is suitable for addressing complex OOD situations. As "all medicine has its side effects", their concurrent use would create extra shifts if the problem does not involve both shifts. Given that an OOD dataset only contain one type between structure/feature shifts, the performance gain might not be so ideal when using two methods concurrently. However, this does not impair their applicability when OOD knowledge of the dataset is not fully known, since we are able to choose the techniques similarly as hyperparameter selection.

In addition, our current work does not discuss link prediction, which is an important task in graph learning. Thus our future work aims to expand the scope of tasks addressed. Furthermore, the proposed methods are designed to cover complex shifts of multiple types, therefore the hyperparameter selections including selection of techniques require certain amounts of pre-computation, which sets prerequisites in computational resources.

G. Theoretical Proofs

This section presents comprehensive proofs for all the theorems mentioned in this paper, along with the derivation of key intermediate results and necessary discussions.

Theorem 3.6 Given an N-sample training dataset D{Gtr}, its N-dimension structural linear extrapolation can generate sets D{G1} and D{G2} s.t. (G1)env < (Gtr)env < (G2)env for Gtr, G1, G2, where < denotes less in size for size extrapolation and lower base complexity for base extrapolation.

Proof. Considering size extrapolation, we prove that 1.sets D{G1} and D{G2} contain graph sizes achievable by Ndimension structural linear extrapolation; 2.|X|G1 < |X|Gtr < |X|G2 holds for Gtr, G1, G2.

Graph Structure Extrapolation for Out-of-Distribution Generalization

For N-dimension structural linear extrapolation on training data D{Gtr}, we have Eq. 2:

i=1 ai Gi +

j=1 bij (Gj Gi) = a G + B, 1G G1 F .

Let the largest and smallest graph Gma and Gmi in D{Gtr} be indexed i = ma and i = mi. We generate D{G2} using Eq. 2 with the condition that ama = 1 and PN i=1 ai 2. We generate D{G1} with the condition that PN i=1 ai = 0, PN i=1 PN j=1 bij = 1 and b(mi)j = 1. By Definition 3.4, D{G1} and D{G2} contain graph sizes achievable by N-dimension structural linear extrapolation.

For G2 D{G2}, since ama = 1 and PN i=1 ai 2, G2 contains multiple graphs spliced together; then we have

|X|G2 > |X|Gma |X|Gtr (9)

for Gtr D{Gtr}. For G1 D{G1}, since PN i=1 ai = 0, PN i=1 PN j=1 bij = 1 and b(mi)j = 1, G1 contains only one single subgraph extracted from Gmi and another graph; then we have

|X|G1 < |X|Gmi |X|Gtr (10)

for Gtr D{Gtr}. Therefore, |X|G1 < |X|Gtr < |X|G2 holds for Gtr, G1, G2.

Considering base extrapolation, we prove that 1.sets D{G1} and D{G2} contain graph bases achievable by N-dimension structural linear extrapolation; 2.BG1 < BGtr < BG2 holds for Gtr, G1, G2, where B denotes the base graph and < denotes less complex in graph base. Note that graph bases can be numerically indexed for ordering and comparisons, such as the Bemis-Murcko scaffold algorithm (Bemis & Murcko, 1996).

For N-dimension structural linear extrapolation on training data D{Gtr}, following Eq. 2, let the graphs with the most and least complex graph base Gmo and Gle in D{Gtr} be indexed i = mo and i = le. We generate D{G2} using Eq. 2 with the condition that amo = 1 and PN i=1 ai 2. We generate D{G1} with the condition that PN i=1 ai = 0, PN i=1 PN j=1 bij = 1 and b(le)j = 1, with (Gj Gi) being a causal graph extraction. By Definition 3.5, D{G1} and D{G2} contain graph bases achievable by N-dimension structural linear extrapolation.

For G2 D{G2}, since amo = 1 and PN i=1 ai 2, G2 contains multiple graphs spliced together including the most complex base; then we have BG2 > BGmo BGtr (11)

for Gtr D{Gtr}, adding upon BGmo to create more complex base graphs. For G1 D{G1}, since PN i=1 ai = 0, PN i=1 PN j=1 bij = 1 and b(le)j = 1 with (Gj Gi) being a causal graph extraction, G1 contains only a single causal subgraph extracted from Gle; then we have BG1 < BGle BGtr (12)

for Gtr D{Gtr}, essentially creating structural linear extrapolations containing no base graphs. Therefore, BG1 < BGtr < BG2 holds for Gtr, G1, G2.

This completes the proof.

Theorem 3.7 Given an N-sample training dataset D{Gtr} and its true labeling function for the target classification task f(G), if D{GN sle} is a graph set sampled from the N-dimension structural linear extrapolation of D{Gtr} and Assumption 3.3 holds, then for (GN sle, y) D{GN sle}, y = f(GN sle).

Proof. By Definition 3.2, for N-dimension structural linear extrapolation on training data D{Gtr}, for (GN sle, y) we have GN sle

i=1 ai Gi +

j=1 bij (Gj Gi) = a G + B, 1G G1 F ,

Graph Structure Extrapolation for Out-of-Distribution Generalization

and the label y for GN sle

i=1 ai yi +

j=1 cijbij yj)/(

j=1 cijbij)

= (a y + C B, 1y F )/(a 1 + C, B F ).

a G splices PN i=1 ai graphs together, and since [G1, . . . , GN] are the N graphs from D{Gtr}, each of Gi contains one and only one causal graph. Under the causal additivity of Assumption 3.3, given G = G1+G2, we have f(G ) = ay1+(1 a)y2. With a fair approximation of a = 1 a = 1/2, we can feasibly obtain f(G ) = (y1 + y2)/2. Recursively, for a G we can derive

i=1 ai yi)/(

i=1 ai). (13)

B, 1G G1 F splices PN i=1 PN j=1 bij extracted subgraphs together. Among them, PN i=1 PN j=1 cijbij are causal subgraphs, while the others are environmental subgraphs. Similarly, using the causal additivity of Assumption 3.3 in a recursive manner, we can derive

f( B, 1G G1 F ) = (

j=1 cijbij yi)/(

j=1 cijbij). (14)

Combining the results of Eq. 13 and Eq. 14, using Assumption 3.3 in a recursive manner, we can derive for a G + B, 1G G1 F :

f(a G + B, 1G G1 F ) = (

i=1 ai yi +

j=1 cijbij yj)/(

j=1 cijbij). (15)

By Definition 3.2, we have

f(GN sle) = f(a G + B, 1G G1 F )

i=1 ai yi +

j=1 cijbij yj)/(

j=1 cijbij) = y.

Therefore, for (GN sle, y) D{GN sle}, we have y = f(GN sle).

This completes the proof.

Theorem 4.1 Given the causal graph (Figure 1) and assuming a bijective causal mapping between C and Y , for two same-class different-environment graphs G = Gy,ϵ and G = Gy,ϵ , let Gs G and G s G represent the subgraphs of G and G . Let I( , ) [0, 1] a similarity function in the graph hidden feature space and fz : G Rf is a feature mapping that reversely infers hidden features, e.g., C and S1, then:

(1) Given the invariant subgraphs of G and G , Ginv and G inv, the similarity of the subgraphs defined in the corresponding graph feature space can be represented as I(fz(Ginv), fz(G inv)). It follows that the value of this similarity reaches its maximum 1;

(2) Given a subgraph set Gs = {Gs| G s, I(fz(Gs), fz(G s)) = 1}, the invariant subgraph Ginv of G can be obtained by optimizing the objective: Ginv = argmax Gs Gs |Gs|.

Proof. The invariant subgraphs Ginv in the same class Y share the same features inferred by fz because of the bijective causal mapping assumption. This implies that the similarity of invariant subgraphs in the same class is able to reach the maximal value of 1. In contrast, subgraphs affected by S1 and S2 do not have this property since S1 and S2 are noises fluctuating frequently, leading to diverse Gs1 and Gs2 that are assumed not to be matched in different graphs. Concretely,

Graph Structure Extrapolation for Out-of-Distribution Generalization

this mismatching is caused by the differences in feature dimensions corresponding to S1 and S2, which leads to similarities strictly lower than 1, i.e., Gs, Gn Gs, Gn = s.t. Gn Gs1 Gs2 G s, I(fz(Gs), fz(G s)) < 1. Note that our causal graph is the general case covering the common SCMs of covariate shift, FIIF, and PIIF assumptions when latent variable S2 is not considered.

Proof of (1): According to the aforementioned assumption, since G and G have the same label, it follows that fz(Ginv) = fz(G inv), which directly results in I(fz(Ginv), fz(G inv)) = 1

Proof of (2): First, given the subgraph set Gs, it follows that Gs Gs, Gs Ginv. Otherwise if Gs contains subgraphs of Genv, Gs1, or Gs2 (Figure 1), the different Genv caused by ϵ, ϵ and the fluctuations in S1, S2 will lead to I(f(Gs), f(G s)) < 1 as discussed above. Therefore, I(f(Gs), f(G s)) = 1 Gs Ginv. In addition, according to (1), Ginv is also included in the set Gs since I(fz(Ginv), fz(G inv)) = 1, which satisfies the definition of Gs. Therefore, the solution for both " Gs Gs, Gs Ginv" and "Ginv Gs" reveals that Ginv is the largest subgraph in Gs, i.e., Ginv = argmax Gs Gs |Gs|.

Theorem 4.2 If (1) (X1, , Xj) P train from at least 2 environments, s.t. (X1var, , Xjvar) span Rj, and (2) X1 = X2, the GNN encoder of fψ maps G1 = (X1, A, E) and G2 = (X2, A, E) to different embeddings, then with ˆy = fψ(X, A, E), ˆy Xvar as n A .

We theoretically prove the statements and Theorem 4.2 for Feat X. We propose to learn and apply a mask M and perturb the non-causal node features to achieve extrapolation w.r.t. xvar, without altering the topological structure of the graph. Let the domain for x be denoted as D, which is assumed to be accessible. Valid extrapolations must generate augmented samples with node feature XA D while XA P train(X). Since x is a vector, D is also a vector, in which each element gives the domain of an element in x. We ensure the validity of extrapolation with the generalized modulo operation mod, which we define as X mod D = X + i abs(D), s.t.X mod D D, (16)

where i is any integer and abs(D) calculates the range length of D. Therefore, X Rp, (X mod D) D. Given each pair of samples Dε1, Dε2 with the same label y but different environments ε1 and ε2, Feat X produces

XA = M ((1 + λ)Xε1 λ xε2) mod D + M Xε1,

(A, E) = (Aε1, Eε1),

where λ, λ N(a, b) is sampled for each data pair. During the process, the augmented samples form a new environment.

We prove Theorem 4.2, showing that, under certain conditions, Feat X substantially solves feature shifts on the selected variant features for node-level tasks. The proof also evidences that our extrapolation spans the feature space outside P train(X) for xvar, transforming OOD areas to ID. Let n A be the number of samples Feat X generates and fψ be the well-trained network with Feat X applied.

Proof. Condition is given that

(X1, , Xj) P train, (X1var, , Xjvar)span Rj. (17)

Therefore, by definition,

u Rj, t = (t1, t2, , tj), t1, t2, , tj Rj, s.t.u = t1X1var + + tj Xjvar. (18)

The operation to generate XA gives

XA = M ((1 + λ)Xε1 λ Xε2) mod D + M Xε1, (19)

so we have XAvar = ((1 + λ)Xε1var λ Xε2var) mod D. (20)

For u, t = (t1, t2, , tj) and (X1, , Xj) P train from at least 2 environments. Without loss of generality, we assume that X1 and X2 are from different environments ε1 and ε2. With n A , there will exist an augmentation sampled between X1 and X2, and since λ N(a, b), λ R,

X1 A s.t.X1 Avar = ((1 + λ)X1var λ X2var) mod D

= (1 + λ)X1var λ X2var + n1 abs(D), 1 + λ = t1, λ = t2,

Graph Structure Extrapolation for Out-of-Distribution Generalization

where n1 is an integer. Equivalently,

X1 Avar = t1X1var + t2X2var + n1 abs(D). (21)

The augmentation sample X1 A belongs to a new environment, thus in a different environment from X1, , Xj. Similarly, with n A , there will exist an augmentation sampled between X1 A and X3,

X2 A s.t.X2 Avar = ((1 + λ)X1 Avar λ X3var) mod D

= (1 + λ)X1 Avar λ X3var + n2 abs(D), λ = 0, λ = t3

where n2 is an integer. Equivalently,

X2 Avar = X1 Avar + t3X3var + n2 abs(D) = t1X1var + t2X2var + t3X3var + (n1 + n2) abs(D). (22)

The augmentation sample X2 A also belongs to the new environment.

Recursively, with n A , there will exist an augmentation

Xj 1 A s.t.Xj 1 A var = t1X1var + t2X2var + + tj Xjvar + (n1 + n2 + + nj 1) abs(D). (23)

Since for u, we have u = t1X1var + + tj Xjvar, therefore, Xj 1 A var = u + (n1 + n2 + + nj 1) abs(D). With u Rj and Xj 1 A var = ((1 + λ)Xj 2 A var λ Xjvar) mod D Rj by the definition of mod D, we have

(n1 + n2 + + nj 1) abs(D) = 0 and Xj 1 A var = u. (24)

Therefore, we prove that with n A ,

u Rj, there exists an augmentation sample Xj 1 A s.t.Xj 1 A var = u. (25)

That is, the extrapolation strategy of Feat X spans the feature space for xvar.

With the above result, for the feature space of xvar, every data point is reachable. As n A , every data point of xvar is reached at least once. Let a group of samples with selected and preserved causal features Xinv* be M Xvar + M Xinv*, where Xvar xvar Rj. Since X1 = X2, the GNN encoder maps G1 = (X1, A, E) and G2 = (X2, A, E) to different embeddings, all different samples from M Xvar + M Xinv* are encoded into different embeddings, while all having the same label y. For the well-trained network fψ, the group of embeddings Z|(M Xvar + M Xinv*) are all predicted into class ˆy = y. In this case,

Xvar Rj, ˆy = fψ(M Xvar + M Xinv*, A, E) = y, (26)

therefore ˆy Xvar as n A . (27)

This completes the proof.

Theorem 4.2 states that, given sufficient diversity in environment information and expressiveness of GNN, Feat X can achieve invariant prediction regarding the selected variant features. Therefore, Feat X possesses the capability to generalize over distribution shifts on the selected variant features. Extending on the accuracy of non-causal selection, if xvar* = xvar, we achieve causally-invariant prediction in feature-based OOD tasks. Thus, Feat X possesses the potential to solve feature distribution shifts.

H. Experimental Details

We further describe experimental details in the following sections.

Graph Structure Extrapolation for Out-of-Distribution Generalization

H.1. Dataset Details

To evidence the generalization improvements of structure extrapolation, we evaluate G-Splice on 8 graph-level OOD datasets with structure shifts. We adopt 5 datasets from the GOOD benchmark (Gui et al., 2022a), GOODHIV-size, GOODHIV-scaffold, GOODSST2-length, GOODMotif-size, and GOODMotif-base, using the covariate shift split from GOOD. GOOD-HIV is a real-world molecular dataset with shift domains scaffold and size. The first one is Bemis-Murcko scaffold (Bemis & Murcko, 1996) which is the two-dimensional structural base of a molecule. The second one is the number of nodes in a molecular graph. GOOD-SST2 is a real-world natural language sentimental analysis dataset with sentence lengths as domain, which is equivalent to the graph size. GOOD-Motif is a synthetic dataset specifically designed for structure shifts. Each graph is generated by connecting a base graph and a motif, with the label determined by the motif solely. The shift domains are the base graph type and the graph size. We construct another natural language dataset Twitter (Yuan et al., 2020) following the OOD splitting process of GOOD, with length as the shift domain. In addition, we adopt protein dataset DD and molecular dataset NCI1 following Bevilacqua et al. (2021), both with size as the shift domain. All datasets possess structure shifts as we have discussed, thus proper benchmarks for structural OOD generalization.

To show the OOD the generalization improvements of feature extrapolation, we evaluate Feat X on 5 graph OOD datasets with feature shifts. We adopt 5 datasets of the covariate shift split from the GOOD benchmark. GOOD-CMNIST is a semi-artificial dataset designed for node feature shifts. It contains image-transformed graphs with color features manually applied, thus the shift domain color is structure-irrelevant. The other 4 datasets are node-level. GOOD-Cora is a citation network dataset with word" shift, referring to the word diversity feature of a node. The input is a small-scale citation network graph, in which nodes represent scientific publications and edges are citation links. The shift domain is word, the word diversity defined by the selected-word-count of a publication. GOOD-Twitch is a gamer network dataset, with the node feature language" as shift domain. The nodes represent gamers and the edge represents the friendship connection of gamers. The binary classification task is to predict whether a user streams mature content. The shift domain of GOOD-Twitch is user language. GOOD-Web KB is a university webpage network dataset. A node in the network represents a webpage, with words appearing in the webpage as node features. Its 5-class prediction task is to predict the owner occupation of webpages, and the shift domain is university, which is implied in the node features. GOOD-CBAS is a synthetic dataset. The input is a graph created by attaching 80 house-like motifs to a 300-node Barabási Albert base graph, and the task is to predict the role of nodes. It includes colored features as in GOOD-CMNIST so that OOD algorithms need to tackle node color differences, which is also typical as feature shift. All shift domains are structure-irrelevant and provide specific evaluation for feature extrapolation.

Following prior works (Wu et al., 2022b; Gui et al., 2022a), we also create another synthetic dataset FSMotif. The GOOD benchmark we use for major evaluation in the paper does not contain OOD datasets with shifts on both structure and feature, which cannot provide evaluation for the combined effectiveness of FLE and SLE. We create FSMotif, with complex shifts on both structure and feature, to prove the superiority of our methods when used concurrently. FSMotif is a synthetic dataset where each graph is generated by connecting a base graph and a motif, with the label determined by the motif solely and all nodes given color features. The shift domains are 1.the base graph type and the color feature, and 2.the graph size and the color feature. Specifically, we generate graphs using seven colors, five label irrelevant base graphs (wheel, tree, ladder, star, and path), and three label determining motifs (house, cycle, and crane).

H.2. Setup Details

We conduct experiments on 8 datasets with 17 baseline methods to evaluate G-Splice, and on 5 datasets with 16 baselines for Feat X. As a common evaluation protocol, datasets for OOD tasks provides OOD validation/test sets (Gui et al., 2022a; Bevilacqua et al., 2021) to evaluate the model s OOD generalization abilities. Some datasets also provide ID validation/test sets for comparison (Gui et al., 2022a). For all experiments, we select the best checkpoints for OOD tests according to results on OOD validation sets; ID validation and ID test are also used for comparison if available. For graph prediction and node prediction tasks, we respectively select strong and commonly acknowledged GNN backbones. For each dataset, we use the same GNN backbone for all baseline methods for fair comparison. For graph prediction tasks, we use GIN-Virtual Node (Xu et al., 2019a; Gilmer et al., 2017) as the GNN backbone. As an exception, for GOOD-Motif we adopt GIN (Xu et al., 2019a) as the GNN backbone, since we observe from experiments that the global information provided by virtual nodes would interrupt the training process here. For node prediction tasks, we adopt Graph SAINT (Zeng et al., 2020) and use GCN (Kipf & Welling, 2017) as the GNN backbone. For all the experiments, we use the Adam optimizer, with a weight decay tuned from the set {0, 1e-2, 1e-3, 1e-4} and a dropout rate of 0.5. The number of convolutional layers in GNN models for each dataset is tuned from the set {3, 5}. We use mean global pooling and the RELU activation function, and the

Graph Structure Extrapolation for Out-of-Distribution Generalization

dimension of the hidden layer is 300. We select the maximum number of epochs from {100, 200, 500}, the initial learning rate from {1e-3, 3e-3, 5e-3, 1e-4}, and the batch size from {32, 64, 128} for graph-level and {1024, 4096} for node-level tasks. All models are trained to converge in the training process. For computation, we generally use one NVIDIA Ge Force RTX 2080 Ti for each single experiment.

H.3. Hyperparameter Selection

In all experiments, we perform hyperparameter search to obtain experimental results that can well-reflect the performance potential of models. For each dataset and method, we search from a hyperparameter set and select the optimal one based on OOD validation metric scores.

For each baseline method, we tune one or two algorithm-specific hyperparameters. For IRM and Deep Coral, we tune the weight for penalty loss from {1e-1, 1, 1e1, 1e2} and {1, 1e-1, 1e-2, 1e-3}, respectively. For VREx, we tune the weight for VREx s loss variance penalty from {1, 1e1, 1e2, 1e3}. For Group DRO, we tune the step size from {1e-1, 1e-2, 1e-3}. For DANN, we tune the weight for domain classification penalty loss from {1, 1e-1, 1e-2, 1e-3}. For Graph Mixup, we tune the alpha value of its Beta function from {0.4, 1, 2}. The Beta function is used to randomize the lamda weight, which is the weight for mixing two instances up. For DIR, we tune the causal ratio for selecting causal edges from {0.2, 0.4, 0.6, 0.8} and loss control from {1e1, 1, 1e-1, 1e-2}. For EERM, we tune the learning rate for reinforcement learning from {1e-2, 1e-3, 5e-3, 1e-4} and the beta value to trade off between mean and variance from {1, 2, 3}. For SRGNN, we tune the weight for shift-robust loss calculated by central moment discrepancy from {1e-4, 1e-5, 1e-6}. For Drop Node, Drop Edge and Mask Feature, we tune the drop/mask percentage rate from {0.05, 0.1, 0.15, 0.2, 0.3}. For FLAG, we set the number of ascending steps M = 3 and tune the ascent step size from the set {1e-2, 1e-3, 5e-3, 1e-4}. For LISA, we tune the parameters of the Beta function in the same way as Graph Mixup. For G-Mixup, we set the augmentation number to 10, tune the augmentation ratio from {0.1, 0.2, 0.3} and the lambda range from {[0.1,0.2], [0.2,0.3]}. For GIL, we tune IGA lambda value from {1e-2, 1e-3, 1e-4} and set top ratio of subgraphs and number of environments by its originally reported optimum. For CIGA, we tune the size ratio of the causal subgraphs from {0.4, 0.6, 0.8}, while contrastive loss and hinge loss weights from {0.5, 1, 2}.

For G-Splice, we tune the percentage of augmentation from {0.6, 0.8, 1.0}. The actual number of component graphs f is tuned from {2, 3, 4}, and the augmentation selection is tuned as a 3-digit binary code representing the 3 options, with at least one option applied. For the pre-training of the bridge generation, hyperparameters regularizing the bridge attribute and KL divergence α and β are tuned from {1.5, 1, 0.5, 0.1}. When the additional VREx-like regularization is applied, we tune the weight of loss variance penalty from {1, 1e1, 1e2}. For Feat X, we tune the the shape parameter a and scale parameter b of the gamma function Γ(a, b) from {2, 3, 5, 7, 9} and {0.5, 1.0, 2.0}, respectively.

I. Supplimentary Experiment

First we provide additional results comparing ID and OOD performances of G-Splice and baseline methods as Table 3. Note that R means invariant regularization with G-Splice. Since we can generate add-on environments, to make adequate use of the augmented environment information for OOD generalization, we optionally apply an invariant regularization. The rationale behind incorporating it is to show our advantage as a data augmentation strategy, i.e., can be combined with invariant regularizations in parallel to further boost performance.

To show the OOD effectiveness of feature extrapolation, we evaluate Feat X on 5 graph OOD datasets with feature shifts. We adopt 5 datasets from the GOOD benchmark, CMNIST-color, Cora-word, Twitch-language, Web KB-university, and CBAS-color, with more details in Appendix H. All shift domains are structure-irrelevant and provide specific evaluation on features. We report accuracy in percentage for all 5 datasets.

OOD performances of Feat X and all baselines on feature shifts are shown in Table 4. We can observe that Feat X consistently outperforms all other methods in OOD test results, showing its effectiveness in various feature OOD tasks. On GOODWeb KB, Feat X substantially outperforms most baselines by 100% in accuracy. On synthetic dataset GOODCBAS, Feat X outperforms most baselines by 14% and achieves OOD results close to ID results, substantially solving the feature shift with feature extrapolation. Feat X does not always outperform in ID settings; also, Feat X shows significant performance gain compared with other graph data augmentation methods. This reveals that Feat X specifically enhances generalization abilities in feature rather than making overall progress in learning with simple data augmentation. Feat X succeeds in selecting non-causal features and lead the model to extrapolate with OOD samples spanning the selected feature space.

Graph Structure Extrapolation for Out-of-Distribution Generalization

Table 3. Performances of 16 baselines and G-Splice on 8 datasets with structure shift. indicates higher values correspond to better performance. IDID denotes ID test results with ID validations, while OODOOD denotes OOD test results with OOD validations. All numerical results are averages across 3+ random runs. Numbers in bold represent the best results and underline the second best.

structure GOOD-HIV-size GOOD-HIV-scaffold GOOD-SST2-length Twitter-length GOOD-Motif-size GOOD-Motif-base DD-size NCI1-size

IDID OODOOD IDID OODOOD IDID OODOOD IDID OODOOD IDID OODOOD IDID OODOOD OODOOD OODOOD

ERM 83.72 1.06 59.94 2.86 82.79 1.10 69.58 1.99 89.82 0.01 81.30 0.35 65.70 1.21 57.04 1.70 92.28 0.10 51.74 2.27 92.60 0.03 68.66 3.43 0.15 0.11 0.16 0.10 IRM 81.33 1.13 59.94 1.59 81.35 0.83 67.97 2.46 89.41 0.11 79.91 1.97 64.02 0.56 57.72 1.03 92.18 0.09 53.68 4.11 92.60 0.02 70.65 3.18 0.06 0.08 0.11 0.07 VREx 83.47 1.11 58.49 2.22 82.11 1.48 70.77 1.35 89.51 0.03 80.64 0.35 65.34 1.70 56.37 0.76 92.25 0.08 54.47 3.42 92.60 0.03 71.47 2.75 0.14 0.10 0.22 0.12 Group DRO 83.79 0.68 58.98 1.84 82.60 1.25 70.64 1.72 89.59 0.09 79.21 1.02 65.10 1.04 56.84 0.63 92.29 0.09 51.95 2.80 92.61 0.03 68.24 1.94 0.02 0.02 0.01 0.04 DANN 83.90 0.68 62.38 2.65 81.18 1.37 70.63 1.82 89.60 0.19 79.71 1.35 65.22 1.48 55.71 1.23 92.23 0.08 51.46 3.41 92.60 0.03 65.47 5.35 0.12 0.09 0.15 0.07 Deep Coral 84.70 1.17 60.11 3.53 82.53 1.01 68.61 1.70 89.68 0.06 79.81 0.22 64.98 1.03 56.14 1.76 92.22 0.13 53.71 2.75 92.61 0.03 68.88 3.61 0.11 0.15 0.09 0.02 DIR 80.46 0.55 57.67 1.43 82.54 0.17 67.47 2.61 84.30 0.46 77.65 1.93 64.98 1.06 56.81 0.91 84.53 1.99 50.41 5.66 87.73 2.60 61.50 15.69 0.42 0.03 0.15 0.03 GIL 80.02 0.78 62.05 1.55 82.12 0.52 66.18 2.87 86.77 1.06 78.81 1.35 62.05 1.03 55.46 1.48 83.67 1.54 33.20 2.87 87.73 2.60 38.60 10.58 0.14 0.02 0.01 0.03 CIGA 81.65 1.85 62.56 1.76 81.76 0.35 71.47 1.29 89.00 0.15 81.20 0.75 64.10 1.67 57.19 1.15 90.33 0.97 45.36 4.35 87.73 2.60 45.59 6.44 0.30 0.05 0.23 0.06

Drop Node 84.09 0.36 58.52 0.49 83.55 1.07 71.18 1.16 90.19 0.21 81.14 1.73 64.56 0.17 56.76 0.24 91.22 0.11 54.14 3.11 92.41 0.09 74.55 5.56 -0.02 0.02 0.08 0.06 Drop Edge 83.73 0.41 59.01 1.90 81.63 0.92 71.46 1.63 90.30 0.18 78.93 1.34 63.72 0.51 57.42 0.48 90.07 0.14 45.42 1.90 82.98 0.99 37.69 1.05 -0.02 0.03 0.12 0.05 Mask Feature 83.44 2.58 62.3 3.17 83.30 0.45 65.90 3.68 89.83 0.24 82.00 0.73 64.92 1.45 57.67 1.11 92.18 0.01 52.24 3.75 92.60 0.02 64.98 6.95 -0.02 0.02 0.03 0.02 FLAG 85.37 0.24 60.84 2.99 82.02 0.67 69.11 0.83 89.87 0.26 77.05 1.27 64.74 1.10 56.56 0.56 92.39 0.04 50.85 0.53 92.70 0.07 66.17 2.87 0.02 0.06 0.06 0.02 Graph Mixup 83.16 1.12 59.03 3.07 82.29 1.34 68.88 2.40 89.78 0.20 80.88 0.60 63.54 1.35 55.97 1.67 92.02 0.10 51.48 3.35 92.68 0.05 70.08 2.06 -0.08 0.06 -0.02 0.06 LISA 83.79 1.04 61.50 2.05 82.82 1.00 71.94 1.31 89.36 0.16 80.67 0.43 63.84 2.15 56.14 0.88 92.27 0.11 53.68 2.65 92.71 0.01 75.58 0.75 0.20 0.02 0.05 0.03 G-Mixup 84.21 1.53 61.95 3.15 82.83 0.53 70.13 2.40 89.75 0.17 80.28 1.49 65.10 1.90 56.05 1.76 92.19 0.07 53.93 3.03 92.60 0.00 49.27 4.84 -0.02 0.02 0.07 0.04

G-Splice 84.75 0.18 64.46 1.38 83.23 0.97 72.82 1.16 89.71 0.67 82.31 0.59 64.80 0.92 58.02 0.40 92.15 1.02 86.53 2.66 91.92 0.09 79.86 13.00 0.45 0.09 0.37 0.08 G-Splice+R 84.85 0.19 65.56 0.34 83.36 0.40 73.28 0.16 89.10 0.81 82.34 0.24 64.68 1.15 58.34 0.58 91.93 0.21 85.07 4.50 92.14 0.29 83.96 7.38 0.45 0.04 0.40 0.03

Table 4. Performances of 16 baselines and Feat X on 5 datasets with feature shift. GOODCMNIST is graph-level while others are node-level datasets. " denotes the method does not apply on the task.

feature GOOD-CMNIST-color GOOD-Cora-word GOOD-Twitch-language GOOD-Web KB-university GOOD-CBAS-color

IDID OODOOD IDID OODOOD IDID OODOOD IDID OODOOD IDID OODOOD

ERM 77.96 0.34 28.60 2.01 70.43 0.47 64.86 0.38 70.66 0.17 48.95 3.19 38.25 0.68 14.29 3.24 89.29 3.16 76.00 3.00 IRM 77.92 0.30 27.83 1.84 70.27 0.33 64.77 0.36 69.75 0.80 47.21 0.98 39.34 2.04 13.49 0.75 91.00 1.28 76.00 3.39 VREx 77.98 0.32 28.48 2.08 70.47 0.40 64.80 0.28 70.66 0.18 48.99 3.20 39.34 1.34 14.29 3.24 91.14 2.72 77.14 1.43 Group DRO 77.98 0.38 29.07 2.62 70.41 0.46 64.72 0.34 70.84 0.51 47.20 0.44 39.89 1.57 17.20 0.76 90.86 2.92 76.14 1.78 DANN 78.00 0.43 29.14 2.93 70.66 0.36 64.77 0.42 70.67 0.18 48.98 3.22 39.89 1.03 15.08 0.37 90.14 3.16 77.57 2.86 Deep Coral 78.64 0.48 29.05 2.19 70.47 0.37 64.72 0.36 70.67 0.28 49.64 2.44 38.25 1.43 13.76 1.30 91.14 2.02 75.86 3.06 DIR 31.09 5.92 20.60 4.26 EERM 68.79 0.34 61.98 0.10 73.87 0.07 51.34 1.41 46.99 1.69 24.61 4.86 67.62 4.08 52.86 13.75 SRGNN 70.27 0.23 64.66 0.21 70.58 0.53 47.30 1.43 39.89 1.36 13.23 2.93 77.62 1.84 74.29 4.10

Drop Node 83.51 0.13 33.01 0.12 Drop Edge 79.51 0.22 26.83 0.81 71.03 0.15 65.55 0.29 70.66 0.10 47.94 3.42 38.79 0.77 14.28 2.34 84.29 1.16 77.62 3.37 Mask Feature 78.32 0.63 44.85 2.42 70.99 0.22 64.42 0.35 70.58 0.80 48.61 3.95 39.89 0.77 15.08 1.30 90.48 7.50 77.62 1.35 FLAG 79.05 0.41 37.74 7.88 70.59 0.11 64.95 0.41 70.54 0.62 45.66 1.17 37.70 4.01 12.70 2.33 91.90 2.69 81.43 1.17 Graph Mixup 77.40 0.22 26.47 1.73 71.54 0.63 65.23 0.56 71.30 0.14 52.27 0.78 54.65 3.41 17.46 1.94 73.57 8.72 70.57 7.41 LISA 76.75 0.46 29.63 2.82 70.15 0.20 64.96 0.17 71.09 0.53 45.55 0.55 37.70 0.00 16.40 2.62 94.29 1.16 83.34 1.35 G-Mixup 77.58 0.29 26.40 1.47

Feat X 69.54 1.51 62.49 2.12 70.39 0.36 66.12 0.54 70.94 0.37 52.76 0.23 50.82 0.00 32.54 8.98 92.86 1.17 87.62 2.43

J. Ablation Studies

J.1. Bridge Generation Studies for G-Splice

J.1.1. VAE AS BRIDGE GENERATOR

In this work, we adopt conditional VAE (Kingma & Welling, 2013; Kipf & Welling, 2016; Sohn et al., 2015) as the major bridge generator for G-Splice due to its adequate capability and high efficiency. We show empirically that VAEs are well suitable for our task.

We reconstruct the generation process with diffusion model (Ho et al., 2020), a generative model with high capability and favorable performances across multiple tasks. Diffusion models consist of a diffusion process which progressively distorts a data point to noise, and a generative denoising process which approximates the reverse of the diffusion process. In our case, the diffusion process adds Gaussian noise independently on each node and edge features encoded into one-hot vectors at each time step. Then the denoising network is trained to predict the noises, and we minimize the error between the predicted noise and the true noise computed in closed-form. During sampling, we iteratively sample bridge indexes and attribute values, and then map them back to categorical values in order that we obtain a valid graph. We compare performances and computational efficiency of the two generative models. As a baseline for bridge generation, we also present the results of random bridges, where bridges of predicted number and corresponding attributes are randomly sampled from the group of component graphs. Note that we do not apply the regularization in these experiments.

As can be observed in Table 5, OOD test results from the two generative models are comparable, both significantly improving

Graph Structure Extrapolation for Out-of-Distribution Generalization

Table 5. Comparison on bridge generation methods. G-Splice-Rand, G-Splice-VAE, and G-Splice-Diff show the performance of G-Splice on GOODHIV with no bridge, random bridge, VAE generated bridge, diffusion model generated bridge, respectively. The train time ratio presents the entire training duration of a method, including module pre-training time, divided by the training duration of G-Splice-Rand in average.

Method GOOD-HIV-size GOOD-HIV-scaffold Train time ratio IDID OODOOD IDID OODOOD

ERM 83.72 1.06 59.94 2.86 82.79 1.10 69.58 1.99 0.57 G-Splice-No-Bridge 84.06 0.09 60.37 0.20 83.50 0.55 69.36 1.87 1 G-Splice-Rand 83.25 0.96 62.36 2.25 84.33 0.69 71.89 2.80 1 G-Splice-VAE 84.75 0.18 64.46 1.38 83.23 0.97 72.82 1.16 1.52 G-Splice-Diff 84.35 0.35 64.09 0.82 83.45 0.97 72.95 1.80 20.25

over G-Splice-Rand. The diffusion model may be slightly limited in performance gain due to the discreteness approximations during sampling. The results implies the necessity of generative models in the splicing operation for overall structural extrapolation. Meanwhile, this shows that VAE is capable of the bridge generation task. In contract, the training duration of diffusion model is 13 times that of VAE due to the sampling processes through massive time steps. Overall, we obtain comparable performances from the two generative models, while VAEs are much less expensive computationally. Therefore, empirical results demonstrate that adopting VAE as our major bridge generator is well suitable.

J.1.2. BRIDGE GENERATION DESIGN

As we have introduced in Sec 4.1, we generate bridges of predicted number along with corresponding edge attributes between given component graphs to splice graphs. We do not include new nodes as a part of the bridge, since we aim at preserving the local structures of the component graphs and extrapolating certain global features. More manually add-on graph structures provide no extrapolation significance, while their interpolation influence are not proven beneficial. We evidence the effectiveness of our design with experiments. We additionally build a module to generate nodes in the bridges. The number of nodes is predicted with a pre-trained predictive model and then a generative model generates the node features. Moreover, we evaluate the results with fixed instead of generated bridge attributes. The performances with our original bridge generator, node generation applied and edge attribute generation removed is summarized as follows. Note that we do not apply the regularization in these experiments.

Table 6. Comparison of bridge generation designs. G-Splice orig, G-Splice + node, and G-Splice - attr show the performance of G-Splice on GOODHIV with the original bridge generator, node generation applied and edge attribute generation removed, respectively.

Method GOOD-HIV-size GOOD-HIV-scaffold

IDID OODOOD IDID OODOOD

ERM 83.72 1.06 59.94 2.86 82.79 1.10 69.58 1.99 G-Splice orig 84.75 0.18 64.46 1.38 83.23 0.97 72.82 1.16 G-Splice + node 83.14 0.82 62.65 2.67 84.67 0.48 71.76 1.76 G-Splice - attr 84.50 0.44 64.13 0.62 83.41 1.10 72.07 1.52

As can be observed in Table 6, OOD test performances from the original bridge generator remains the highest. Without attribute generation, fixed bridge attributes degrades the overall performance due to the manual feature of the bridges, which may mislead the model with spurious information. When we include nodes as a part of the bridge, similarly the manually add-on graph structure may inject spurious information to the model and perturb the preservation of local structures, leading to limited improvements. This evidences the effectiveness of our design for bridge generation.

J.2. Comparison of Extrapolation Procedures for G-Splice

We evidence that certain extrapolation procedures specifically benefit size or base/scaffold shifts, as our theoretical analysis in Sec. 4 and 4.4. For size and base/scaffold shifts on GOODHIV and GOODMotif, we extrapolation with each of the three augmentation options, Ginv, Ginv + f Genv and f G, individually and together, and compare the OOD performances. Note that we apply the VREx-like regularization in these experiments.

Graph Structure Extrapolation for Out-of-Distribution Generalization

Table 7. Comparison of extrapolation procedures for G-Splice. Performances of G-Splice on GOODHIV and GOODMotif with augmentation options single causal subgraph, causal and environmental subgraphs spliced, whole graphs spliced, and all three options applied. Optimal show the performances with options selected after hyperparameter tuning.

G-Splice GOOD-HIV-size GOOD-HIV-scaffold GOOD-Motif-size GOOD-Motif-base

IDID OODOOD IDID OODOOD IDID OODOOD IDID OODOOD

Ginv 83.90 0.40 63.04 2.40 83.19 0.69 72.04 0.96 91.10 0.10 76.95 4.52 92.12 0.14 69.59 3.67 Ginv + f Genv 85.40 0.82 62.65 2.16 83.97 0.57 72.83 1.86 91.08 0.63 72.10 5.43 92.01 0.23 73.92 5.44 f G 84.75 0.18 63.94 1.46 85.10 0.67 73.14 1.05 91.93 0.21 85.07 4.50 92.12 0.15 76.19 10.99 All 85.09 0.61 63.16 1.38 83.47 0.45 72.39 0.52 92.00 0.30 82.86 2.53 92.01 0.16 80.09 12.10 Optimal 84.85 0.19 65.56 0.34 83.36 0.40 73.28 0.16 91.93 0.21 85.07 4.50 92.14 0.29 83.96 7.38

Let the three augmentation options, Ginv, Ginv + f Genv and f G be numbered 1, 2, and 3. The optimal augmentation options for GOOD-HIV-size, GOOD-HIV-scaffold, GOOD-Motif-size, and GOOD-Motif-base after hyperparameter tuning are 1+3, 2+3, 3, and 2+3, respectively. As can be observed from Table 7, Ginv and f G have advantages in size shifts, while Ginv + f Genv and f G are better for base/scaffold shifts. This matches our theoretical analysis of augmentation procedures. For size distribution shifts, Ginv and f G environments enable size extrapolation by creating smaller and larger graphs outside the training distribution, respectively. For base/scaffold distribution shifts, the two new environments respectively construct graphs without base/scaffold, and graphs with f base/scaffolds, achieving base extrapolation with new base/scaffolds introduced. Splicing whole graphs has the advantage of extrapolating to larger graphs, simplicity in operation, and little loss in local structural information. Extracting subgraphs allows better flexibility for G-Splice, making graphs smaller than the training size accessible. In addition, the performance gain from f G shows the effectiveness of the simple splicing strategy by itself.

J.3. Ablation Studies on Feat X

Feat X enables extrapolation w.r.t. the selected variant features. By generating causally valid samples with OOD node features, Feat X essentially expands the training distribution range. Theoretical analysis evidences that our extrapolation spans the feature space outside P train(X) for xvar, thereby transforming OOD areas to ID. We further show with experiments that extrapolation substantially benefits feature shifts in OOD tasks compared with interpolation, which can also improve generalization by boosting learning processes. In addition, we show that our invariance mask and variance score vectors succeed in selecting non-causal features by comparisons between perturbation on selected features and all features.

Table 8. Performances w/o feature selection and extrapolation for Feat X. We show the performance comparisons of interpolating or extrapolating the selected or all features on GOODCMNIST, GOODWeb KB and GOODCBAS.

Feature Perturbation GOOD-CMNIST-color GOOD-Web KB-university GOOD-CBAS-color

IDID OODOOD IDID OODOOD IDID OODOOD

ERM 77.96 0.34 28.60 2.01 38.25 0.68 14.29 3.24 89.29 3.16 76.00 3.00 All Interpolation 76.62 0.46 29.61 2.54 37.56 3.67 15.59 2.48 92.00 0.15 81.76 1.75 All Extrapolation 75.14 0.38 54.13 5.08 38.26 5.18 17.10 3.45 93.25 0.43 84.28 3.67 Selected Interpolation 78.15 0.30 31.65 5.02 43.15 1.42 26.16 2.50 90.10 2.14 80.37 1.35 Selected Extrapolation 69.54 1.51 62.49 2.12 50.82 0.00 32.54 8.98 92.86 1.17 87.62 2.43

As can be observed from Table 8, whether selecting non-causal features and the choice between interpolation and extrapolation both show significant influence on generalization performances. In all three datasets, extrapolation performances exceed corresponding interpolation performances with a clear gap, demonstrating the benefits of extrapolation by generating samples in OOD area that interpolation cannot reach. In GOODWeb KB, perturbing selected non-causal features achieve significant improvements over perturbing all features, regardless of interpolation and extrapolation. This evidences the effectiveness of non-causal feature selection using variance score vectors, empirically supporting our design. In GOODCMNIST and GOODCBAS, since the features are manually added colors, the effect of feature selection is not as obvious as in GOODWeb KB, the real world dataset. Experimental results evidence the effectiveness of the strategies designed in Feat X.

Graph Structure Extrapolation for Out-of-Distribution Generalization

K. Metric Score and Loss Curves

We report the metric score curves and loss curves for part of the datasets in Figure 3-6. As can be observed from each pair of curves, our proposed methods, G-Splice and Feat X, consistently achieve better metric scores and lower loss compared with other baselines during the learning process. This evidences the substantial improvements achieved by structure and feature extrapolation, which benefits OOD generalization in essence.

Figure 3. ROC-AUC score curve and loss curve for GOODHIV-size.

Figure 4. Accuracy score curve and loss curve for GOODMotif-size.

Graph Structure Extrapolation for Out-of-Distribution Generalization

Figure 5. Accuracy score curve and loss curve for GOODCMNIST-color.

Figure 6. Accuracy score curve and loss curve for GOODCBAS-color.