# nonparametric_teaching_for_graph_property_learners__2508b5d0.pdf

Nonparametric Teaching for Graph Property Learners

Chen Zhang 1 * Weixin Bu 2 * Zeyi Ren 1 Zhengwu Liu 1 Yik-Chung Wu 1 Ngai Wong 1

Inferring properties of graph-structured data, e.g., the solubility of molecules, essentially involves learning the implicit mapping from graphs to their properties. This learning process is often costly for graph property learners like Graph Convolutional Networks (GCNs). To address this, we propose a paradigm called Graph Neural Teaching (Gra NT) that reinterprets the learning process through a novel nonparametric teaching perspective. Specifically, the latter offers a theoretical framework for teaching implicitly defined (i.e., nonparametric) mappings via example selection. Such an implicit mapping is realized by a dense set of graph-property pairs, with the Gra NT teacher selecting a subset of them to promote faster convergence in GCN training. By analytically examining the impact of graph structure on parameter-based gradient descent during training, and recasting the evolution of GCNs shaped by parameter updates through functional gradient descent in nonparametric teaching, we show for the first time that teaching graph property learners (i.e., GCNs) is consistent with teaching structureaware nonparametric learners. These new findings readily commit Gra NT to enhancing learning efficiency of the graph property learner, showing significant reductions in training time for graphlevel regression (-36.62%), graph-level classification (-38.19%), node-level regression (-30.97%) and node-level classification (-47.30%), all while maintaining its generalization performance.

1. Introduction

Graph-structured data, commonly referred to as graphs, are typically represented by vertices and edges (Hamilton et al.,

*Equal contribution 1Department of Electrical and Electronic Engineering, The University of Hong Kong, HKSAR, China 2Reversible Inc. Correspondence to: Chen Zhang <czhang6@connect.hku.hk>, Ngai Wong <nwong@eee.hku.hk>.

Proceedings of the 42 nd International Conference on Machine Learning, Vancouver, Canada. PMLR 267, 2025. Copyright 2025 by the author(s).

Figure 1: An illustration of the implicit mapping f between a graph G and its property f (G), where f 0 denotes the mapping of the initial graph property learner, e.g., an initialized GCN.

2017; Chami et al., 2022). The vertices, or nodes, contain individual features, while the edges link these nodes and capture the structural information, collectively forming a complete graph. Graph properties can be categorized as either node-level or graph-level1. For example, the node category is a node-level property in social network graphs (Fan et al., 2019), while the solubility of molecules is a graphlevel property in molecular graphs (Ramakrishnan et al., 2014). Inferring these graph properties essentially involves learning the implicit mapping from graphs to these properties (Hamilton et al., 2017). An intuitive illustration of this mapping is provided in Figure 1. As a representative graph property learner, the Graph Convolutional Network (GCN) (Defferrard et al., 2016; Kipf & Welling, 2017) has shown strong generalizability, delivering impressive performance across various fields such as social networks (Min et al., 2021; Li et al., 2023), quantum chemistry (Gilmer et al., 2017; Mansimov et al., 2019), and biology (Stark et al., 2006; Burkhart et al., 2023).

However, the learning process of the implicit mapping i.e., the training can be quite expensive for GCNs, particularly when dealing with large-scale graphs (Liu et al., 2022a). For example, learning node-level properties in realworld e-commerce relational networks involves millions of

1This paper adopts a graph-level focus in its discussion unless otherwise specified, with node-level considered as a multidimensional generalization. Our project page is available at https://chen2hang. github.io/_publications/nonparametric_teaching_for_ graph_proerty_learners/grant.html.

Nonparametric Teaching for Graph Property Learners

nodes (Robinson et al., 2024). In the case of graph-level property learning tasks, the scale can become prohibitively large (Hu et al., 2021). As a result, there is a pressing need to reduce training costs and improve the learning efficiency.

Recent studies on nonparametric teaching (Zhang et al., 2023b;a; 2024a) offer a promising solution to the above problem. Specifically, nonparametric teaching provides a theoretical framework for efficiently selecting examples when the target mapping (i.e., either a function or a model) being taught is nonparametric, i.e., implicitly defined. It builds on the idea of machine teaching (Zhu, 2015; Zhu et al., 2018) involving designing a training set (dubbed the teaching set) to help the learner rapidly converge to the target functions but relaxes the assumption of target functions being parametric (Liu et al., 2017; 2018), allowing for the teaching of nonparametric (viz. non-closed-form) target functions, with a focus on function space. Unfortunately, these studies focus solely on regular feature data and overlook the structural aspects of the inputs, resulting in difficulty when the inputs are irregular graphs universal data structures that include both features and structure (Chami et al., 2022). Moreover, the update of a GCN is generally carried out through gradient descent in parameter space, leading to a gap compared to the functional gradient descent used in nonparametric teaching within function space (Zhang et al., 2023b;a; 2024a). These call for more examination prior to the adoption of nonparametric teaching theory in the context of graph property learning.

To this end, we systematically explore the impact of graph structure on GCN gradient-based training in both parameter and function spaces. Specifically, we analytically examine the impact of the adjacency matrix, which encodes the graph structure, on parameter-based gradient descent within parameter space, and explicitly show that the parameter gradient maintains the same form when the graph size is scaled. The structure-aware update in parameter space drives the evolution of GCN, which can be expressed using the dynamic graph neural tangent kernel (GNTK) (Du et al., 2019; Krishnagopal & Ruiz, 2023), and is then cast into function space. We prove that this dynamic GNTK converges to the structure-aware canonical kernel utilized in functional gradient descent, suggesting that the evolution of GCN under parameter gradient descent is consistent with that under functional gradient descent. Therefore, it is natural to interpret the learning process of graph properties via the theoretical lens of nonparametric teaching: the target mapping is realized by a dense set of graph-property pairs, and the teacher selects a subset of these pairs to provide to the GCN, ensuring rapid convergence of this graph property learner. Consequently, to enhance the learning efficiency of GCN, we introduce a novel paradigm called Gra NT, where the teacher adopts a counterpart of the greedy teaching algorithm from nonparametric teaching for graph property

learning, specifically by selecting graphs with the largest discrepancy between their property true values and the GCN outputs. Lastly, we conduct extensive experiments to validate the effectiveness of Gra NT in a range of scenarios, covering both graph-level and node-level tasks. Our key contributions are listed as follows:

We propose Gra NT, a novel paradigm that interprets graph property learning within the theoretical context of nonparametric teaching. This enables the use of greedy algorithms from the latter to effectively enhance the learning efficiency of the graph property learner, GCN.

We analytically examine the impact of graph structure on parameter-based gradient descent within parameter space, and reveal the consistency between the evolution of GCN driven by parameter updates and that under functional gradient descent in nonparametric teaching. We further show that the dynamic GNTK, stemming from gradient descent on the parameters, converges to the structure-aware canonical kernel of functional gradient descent. These connect nonparametric teaching theory to graph property learning, thus expanding the applicability of nonparametric teaching in the context of graph property learning.

We demonstrate the effectiveness of Gra NT through extensive experiments in graph property learning, covering regression and classification at both graph and node levels. Specifically, Gra NT saves training time for graph-level regression (-36.62%), graph-level classification (-38.19%), node-level regression (-30.97%) and node-level classification (-47.30%), while upkeeping its generalization performance.

2. Related Works

Graph property learning. Due to the versatility of graphs in modeling diverse data types (Chami et al., 2022), there has been a recent surge of research interest on graphs (Xia et al., 2021), especially attempts of learning implicit mapping from graph data to interested properties (Guo et al., 2021; Zhuang et al., 2023; Cao et al., 2023) for diverse downstream tasks, such as those related to proteins (Fout et al., 2017; Gligorijevi c et al., 2021) and molecular fingerprints (Duvenaud et al., 2015). There have been various efforts to the learner design for better mapping learning, such as the GCN learner (Defferrard et al., 2016; Kipf & Welling, 2017), which borrows the idea of convolutional neural networks used in image tasks (Le Cun et al., 2015), and the graph attention network, which applies the attention operation (Veliˇckovi c et al., 2018), and to the learning efficiency (Chen et al., 2018; Liu et al., 2022b; Zhang et al., 2023c), such as normalization (Cai et al., 2021), graph decomposition (Xue et al., 2023) and lazy update (Narayanan et al., 2022). Differently, we approach graph property

Nonparametric Teaching for Graph Property Learners

learning from a fresh perspective of nonparametric teaching (Zhang et al., 2023b;a) and adopt a corresponding version of the greedy algorithm to enhance the training efficiency of GCN.

Nonparametric teaching. Machine teaching (Zhu, 2015; Zhu et al., 2018) focuses on designing a teaching set that enables the learner to quickly converge to a target model function. It can be viewed as the reverse of machine learning: while machine learning seeks to learn a mapping from a given training set, machine teaching aims to construct the set based on a desired mapping. Its effectiveness has been demonstrated across various domains, including crowdsourcing (Singla et al., 2014; Zhou et al., 2018), robustness (Alfeld et al., 2017; Ma et al., 2019; Rakhsha et al., 2020), and computer vision (Wang et al., 2021; Wang & Vasconcelos, 2021). Nonparametric teaching (Zhang et al., 2023b;a) advances iterative machine teaching (Liu et al., 2017; 2018) by broadening the parameterized family of target mappings to include a more general nonparametric framework. In addition, the practical effectiveness of this theoretical framework has been confirmed in improving the efficiency of multilayer perceptrons (MLPs) when learning implicit mappings from signal coordinates to their corresponding values (Sitzmann et al., 2020; Tancik et al., 2020; Luo et al., 2023; 2024; 2025; Zhang et al., 2024a). Nevertheless, the limited focus on the structural aspects of the input in these studies makes it difficult to directly apply their findings to general tasks involving graph-structured data (Hamilton et al., 2017; Chami et al., 2022). This work systematically examines the impact of graph structure and highlights the alignment between the evolution of GCN driven by parameter updates and that guided by functional gradient descent in nonparametric teaching. These insights, for the first time, broaden the scope of nonparametric teaching theory in graph property learning and position our Gra NT as a means to improve the learning efficiency of GCN.

3. Background

Notation.2 Let G(n) = (V, E) be a graph, where V denotes the set of n vertices (nodes) and E denotes the set of edges. The d-dimensional feature vector is denoted as [xi]d = (x1, , xd) Rd, where the entries xi are indexed by i Nd (Nd := {1, , d}). For simplicity, this feature vector may be denoted as x. The collection of feature vectors for all nodes is represented by an n d feature matrix Xn d (abbreviated as X). The i-th row and j-th column of this matrix, corresponding to the i-th node and j-th feature, are denoted by X(i,:) and X(:,j), respectively. Equivalently, these can be expressed as e i X and Xej, where ei is a basis vector with its i-th entry equal to 1 and all other entries equal to 0. The structure of the graph G(n)

2A notation table is provided in Appendix A.1.

is captured by its adjacency matrix A Rn n, allowing the graph to be concisely represented as G = (X, A) G. The property of the graph is denoted by y Y, where y is a scalar for graph-level properties (i.e., Y R) and a vector for node-level properties (i.e., Y Rn). A set containing m elements is written as {ai}m. If {ai}m {ai}n holds, then {ai}m represents a subset of {ai}n of size m, with indices i Nn. A diagonal matrix with elements a1, , am is denoted by diag(a1, , am), and if all m entries are identical, it is simplified as diag(a; m).

Consider K(G, G ) : G G 7 R as a symmetric and positive definite graph kernel (Vishwanathan et al., 2010). It can also be written as K(G, G ) = KG(G ) = KG (G), and for simplicity, KG( ) can be abbreviated to KG. The reproducing kernel Hilbert space (RKHS) H associated with K(G, G ) is defined as the closure of the linear span {f : f( ) = Pr i=1 ai K(Gi, ), ai R, r N, Gi G}, equipped with the inner product f, g H = P

ij aibj K(Gi, Gj), where g = P

j bj KGj (Liu & Wang, 2016; Zhang et al., 2023b;a). Rather than assuming the ideal case with a closed-form solution f , we consider the more practical scenario where the realization of f

is given (Zhang et al., 2023b;a; 2024a). For simplicity, we assume the function is scalar-valued, aligning with the focus on the graph level in this discussion3. Given the target mapping f : G 7 Y, it uniquely returns y for the corresponding graph G such that y = f (G ). With the Riesz Fréchet representation theorem (Lax, 2002; Schölkopf et al., 2002; Zhang et al., 2023b), the evaluation functional is defined as follows:

Definition 1. Let H be a reproducing kernel Hilbert space with a positive definite graph kernel KG H, where G G. The evaluation functional EG( ) : H 7 R is defined with the reproducing property as follows:

EG(f) = f, KG( ) H = f(G), f H. (1)

Additionally, for a functional F : H 7 R, the Fréchet derivative (Coleman, 2012; Liu, 2017; Zhang et al., 2023b) of F is given as follows:

Definition 2. (Fréchet derivative in RKHS) The Fréchet derivative of a functional F : H 7 R at f H, denoted by f F(f), is implicitly defined by F(f + ϵg) = F(f) + f F(f), ϵg H + o(ϵ) for any g H and ϵ R. This derivative is also a function in H.

Graph convolutional network (GCN) is proposed to learn the implicit mapping between graphs and their properties (Kipf & Welling, 2017; Xu et al., 2018). Specifically, a L-layer GCN fθ(G) X(L) resembles a L-layer MLP,

3In nonparametric teaching, extending from scalar-valued functions to vector-valued ones, which pertains to node-level properties, is a well-established generalization in Zhang et al., 2023a.

Nonparametric Teaching for Graph Property Learners

with the key difference being the feature aggregation at the start of each layer, which is based on the adjacency matrix A (Wu et al., 2019; Krishnagopal & Ruiz, 2023).

X(ℓ) = σ (A + I)κℓX(ℓ 1)W (ℓ) , ℓ NL 1

X(L) = ρ (A + I)κLX(L 1) W (L) , (2)

where W (ℓ) is the weight matrix at layer ℓ, with dimensions hℓ 1 hℓ, hℓdenotes the width of layer ℓwith h0 = d, and κℓdenotes the convolutional order at the ℓ-th layer. X(0)

is the input feature matrix, σ represents activation function (e.g., Re LU), and ρ refers to the pooling operation (e.g., ρ(a) = 1 a for summation pooling).

Nonparametric teaching (Zhang et al., 2023b) is defined as a functional minimization over a teaching sequence D = {(x1, y1), . . . (x T , y T )}, where the input x Rd

represents regular feature data without considering structure, with the set of all possible teaching sequences denoted as D:

D = arg min D D M( ˆf, f ) + λ card(D)

s.t. ˆf = A(D). (3)

The formulation above involves three key components: M which quantifies the disagreement between ˆf and f (e.g., L2 distance in RKHS M( ˆf , f ) = ˆf f H), card( ), which represents the cardinality of the teaching sequence D, regularized by a constant λ, and A(D), which refers to the learning algorithm used by the learners, typically employing empirical risk minimization:

ˆf = arg min f H Ex P(x) (L(f(x), f (x))) (4)

with a convex (w.r.t. f) loss L, which is optimized using functional gradient descent:

f t+1 f t ηG(L, f ; f t, xt), (5)

where t = 0, 1, . . . , T is the time step, η > 0 represents the learning rate, and G denotes the functional gradient computed at time t.

To derive the functional gradient, given by

G(L, f ; f , x) = Ex

Zhang et al., 2023b;a introduce the chain rule for functional gradients (Gelfand et al., 2000) (see Lemma 3) and use the Fréchet derivative to compute the derivative of the evaluation functional in RKHS (Coleman, 2012) (cf. Lemma 4).

Lemma 3. (Chain rule for functional gradients) For differentiable functions G(F) : R 7 R that depend on functionals F(f) : H 7 R, the expression

f G(F(f)) = G(F(f))

F(f) f F(f) (7)

is typically referred to as the chain rule.

Lemma 4. The gradient of the evaluation functional at the feature x, defined as Ex(f) = f(x) : H R, is given by f Ex(f) = K(x, ), where K(x, x ) : Rd Rd R represents a feature-based kernel.

We begin by analyzing the effect of the adjacency matrix on parameter-based gradient descent. Then, by translating the evolution of GCN driven by structure-aware updates in parameter space into function space, we show that the evolution of GCN under parameter gradient descent aligns with that under functional gradient descent. Lastly, we present the greedy Gra NT algorithm, which efficiently selects graphs with steeper gradients to improve the learning efficiency of GCN.

4.1. Structure-aware update in the parameter space

In GCNs, the structural information of graphs is captured through feature aggregation, expressed as (A + I)κX, as shown in Equation 2. The use of (A + I)κ limits the flexibility in learning aggregated hidden features, σ((A + I)κXW ), because it applies the same weights to features aggregated from different convolutional orders within a single layer. This paper considers more flexible GCNs, where the weights for features aggregated from different convolutional orders within a single layer are handled independently (Krishnagopal & Ruiz, 2023). Before presenting the detailed formulation, we introduce the concatenation operation L and define A[κ] := Lκ 1 i=0 Ai = [I A Aκ 1], an n κn matrix. By unfolding the aggregated features at different orders and assigning them distinct weights (Krishnagopal & Ruiz, 2023), the flexible GCN can be expressed as

X(ℓ) = σ A[κℓ]diag(X(ℓ 1); κℓ) W (ℓ) , ℓ NL 1

X(L) = ρ A[κL]diag(X(L 1); κL) W (L) . (8)

Here, the notations are consistent with those in Equation 2, with the exception that W (ℓ) is the weight matrix of size κℓhℓ 1 hℓ. Figure 2 presents an example that illustrates the workflow of this flexible GCN.

Let the column vector θ Rm represent the weights of all layers in a flattened form, where m denotes the total number of parameters in the GCN. Given a training set of size N,

Nonparametric Teaching for Graph Property Learners

{(Gi, yi)|Gi G, yi Y}N, the parameter update using gradient descent (Ruder, 2016) is expressed as follows:

i=1 θL(fθt(Gi), yi). (9)

Due to the sufficiently small learning rate η, the updates are tiny over several iterations, which allows them to be treated as a time derivative and then converted into a differential equation (Du et al., 2019):

L(fθt(Gi), yi)

The term fθ(G)

θ (with the indexes omitted for simplicity), which indicates the direction for updating the parameters, can be written more specifically as

W (L) , X(L)

W (L 1) (:,1) , , X(L)

W (L 1) (:,h L 1) | {z } w.r.t. the (L 1)-th layer

W (1) (:,1) , , X(L)

W (1) (:,h1)

| {z } w.r.t. the first layer

Here, each term represents the derivative of output X(L)

w.r.t. weight column vectors. In contrast to derivatives with regular features as inputs, where the derivatives are independent across features, the adjacency matrix A dictates feature aggregation in these derivatives in a batch manner, where each feature of a single node is treated individually (Du et al., 2019). To clearly demonstrate, in an analytical and explicit manner, how A directs structure-aware updates in the parameter space, we present an example involving the derivative of a two-layer GCN with summation pooling:

W (2) , X(2)

W (1) (:,1) , , X(2)

W (1) (:,h1)

where the term X(2)

W (2) is given by

1 n A[κ2] | {z } size: n κ2n

size: κ2n κ2h1 z }| { diag(σ(A[κ1]diag(X(0); κ1)W (1)); κ2)

| {z } size: 1 κ2h1

and for i Nh1, the term X(2)

W (1) (:,i) is

1 A[κ2] | {z } size: n κ2n

size: κ2n κ1h0 z }| {

σ A[κ1]diag(X(0); κ1) W (2) (i h1+h1) σ A[κ1]diag(X(0); κ1) W (2) (i h1+κ2h1)

| {z } size: 1 κ1h0

Graph 𝑮 ሺ𝑿, 𝑨ሻ

Adjacency Matrix 𝑨

Feature Matrix 𝑿

𝑨𝑿ሺ ሻ 𝑨ଶ𝑿ሺ ሻ 𝑿ሺ ሻ

Flexible GCN

Σ Linear Combination

Activation 𝜎

: Concatenation

: Feature Aggregation

Linear Combination, i.e., 𝑿ሺ𝟎ሻ 𝑨𝑿ሺ𝟎ሻ 𝑨𝟐𝑿ሺ𝟎ሻ 𝑾ଵ Σ

Figure 2: A workflow illustration of a two-layer flexible GCN with a four-node graph G as input.

with σ = σ(A[κ1]diag(X(0);κ1)W (1) (:,i))

A[κ1]diag(X(0);κ1)W (1) (:,i) . Orange indicates

the first layer, green denotes the second layer, and 1 n in purple corresponds to the summation pooling4. When using Re LU as the activation function, σ A[κ1]diag(X(0); κ1) = σ A[κ1]diag(X(0); κ1) . The derivation can be found in Appendix A.2.

When the convolutional order κ is reduced to 1 for all layers, meaning structural information is excluded, the GCN gradient computed for a single input graph exactly matches that of the MLP when applied to a batch composed of the node features. This suggests that the structure-aware, parameterbased gradient is more general than the one used in MLPs, indicating that this work can be seen as a generalization of Zhang et al., 2024a. Furthermore, from the explicit expressions in Equations 13 and 14, it can be observed that the gradient of GCN does not depend on the size of the input graph (i.e., the number of nodes). Instead, it depends on the feature dimension and convolutional order. In other words, the parameter gradient retains the same form even when the input graph size n is scaled.

4.2. The functional evolution of GCN

The structure-aware update in the parameter space drives the functional evolution of fθ H. This variation of fθ, which captures how fθ changes in response to updates in θ, can be derived using Taylor s theorem as follows:

f(θt+1) f(θt) = θf(θt), θt+1 θt + o(θt+1 θt), (15)

4Without the pooling operation, it corresponds to the scenario where the graph property is considered at the node level.

Nonparametric Teaching for Graph Property Learners

where f(θ ) fθ . In a manner similar to the transformation of parameter updates, it can be rewritten in a differential form (Zhang et al., 2024a):

By plugging in the specific parameter updates, i.e., Equation 10, into the first-order approximation term ( ) of this variation, we get

N h L(fθt(Gi),yi)

N [Kθt(Gi, )]N + o θt

where the symmetric and positive definite Kθt(Gi, ) := D fθt(Gi)

θt , fθt( )

θt E (see the detailed derivation in Appendix A.3). Due to the inclusion of nonlinear activation functions in f(θ), the nonlinearity of f(θ) with respect to θ causes the remainder o(θt+1 θt) to be nonzero. In a subtle difference, Jacot et al., 2018; Du et al., 2019; Krishnagopal & Ruiz, 2023 apply the chain rule directly, giving less attention to the convexity of L with respect to θ. This leads to the first-order approximation being derived as the variation, with Kθ being referred to as the graph neural tangent kernel (GNTK). It has been shown that the GNTK stays constant during training when the GCN width is assumed to be infinite (Du et al., 2019; Krishnagopal & Ruiz, 2023). However, in practical applications, there is no need for the GCN width to be infinitely large, which leads us to investigate the dynamic GNTK (Figure 5 in Appendix A.3 provides an example of how GNTK is computed).

Consider describing the variation of fθ H from a highlevel, functional perspective (Zhang et al., 2024a). Using functional gradient descent, it can be expressed as:

t = ηG(L, f ; fθt, {Gi}N), (18)

where the functional gradient is given by:

G(L, f ; fθt, {Gi}N) = 1

N h L(fθt(Gi),yi)

N [K(Gi, )]N. (19)

The asymptotic relationship between GNTK and the structure-aware canonical kernel (Vishwanathan et al., 2010; Zhang et al., 2024a) in the context of functional gradient is given in Theorem 5 below, with the proof in Appendix B.

Theorem 5. For a convex loss L and a given training set {(Gi, yi)|Gi G, yi Y}N, the dynamic GNTK, derived from gradient descent on the parameters of a GCN, converges pointwise to the structure-aware canonical kernel in the dual functional gradient with respect to the input graphs. Specifically, the following holds:

lim t Kθt(Gi, ) = K(Gi, ), i NN. (20)

This suggests that GNTK, which incorporates structural information, serves as a dynamic alternative to the structureaware canonical kernel in functional gradient descent with graph inputs, making the GCN s evolution through parameter gradient descent align with that in functional gradient descent (Kuk, 1995; Du et al., 2019; Geifman et al., 2020). This functional insight bridges the teaching of the graph property learner, GCN, with that of structure-aware nonparametric learners, while also simplifying further analysis (e.g., a convex functional L preserves its convexity with respect to fθ from a functional perspective, but is typically nonconvex when considering θ). By leveraging the functional insight and employing the canonical kernel (Dou & Liang, 2021) instead of GNTK (which should be considered alongside the remainder), it aids in deriving sufficient reduction concerning L in Proposition 6, with the proof deferred to Appendix B.

Proposition 6. (Sufficient Loss Reduction) Suppose the convex loss L is Lipschitz smooth with a constant τ > 0, and the structure-aware canonical kernel is bounded above by a constant γ > 0. If the learning rate η satisfies η 1/(2τγ), then it follows that a sufficient reduction in L is guaranteed, as shown by

L(fθt(Gi), yi)

This demonstrates that the variation of L over time is capped by a negative value, meaning it decreases by at least the magnitude of this upper bound as time progresses, guaranteeing convergence.

4.3. Gra NT algorithm

Building on the insights regarding the effect of the adjacency matrix, which captures the graph structure, on parameterbased gradient descent, as well as the consistency between teaching a GCN and a nonparametric learner, we introduce the Gra NT algorithm. This algorithm seeks to increase the steepness of gradients to improve the learning efficiency of GCN. By interpreting the gradient as a sum of the projections of L(fθ,f )

fθ onto the basis {K(Gi, )}N, increasing the gradient can be achieved by simply maximizing the projection coefficient L(fθ(Gi),yi)

fθ(Gi) , without the need to calculate the norm of the basis K(Gi, ) H (Wright, 2015; Zhang et al., 2024a). This suggests that selecting graphs that either maximize L(fθ(Gi),yi)

fθ(Gi) or correspond to larger

components of L(fθ,f )

fθ can effectively increase the gradient, which implies

{Gi}m = arg max {Gi}m {Gi}N

L(fθ(Gi), yi)

Nonparametric Teaching for Graph Property Learners

Algorithm 1 Gra NT Algorithm Input: Target mapping f realized by a dense set of graphproperty pairs, initial GCN fθ0, the size of selected training set m N, small constant ϵ > 0 and maximal iteration number T.

Set fθt fθ0, t = 0.

while t T and [fθt(Gi) f (Gi)]N 2 ϵ do

The teacher selects m teaching graphs:

/* (Graph-level) Graphs corresponding to the m largest |fθt(Gi) f (Gi)|. */ {Gi}m = arg max {Gi}m {Gi}N [fθt(Gi) f (Gi)]m 2.

/* (Node-level) Graphs associated with

the m largest fθt (Gi) f (Gi) 2

{Gi}m = arg max {Gi}m {Gi}N

h fθt(Gi) f (Gi)

with Frobenius norm F.

Provide {Gi}m to the GCN learner.

The learner updates fθt based on received {Gi}m :

// Parameter-based gradient descent. θt θt η

Gi {Gi}m θL(fθt(Gi), f (Gi)).

Set t t + 1. end

From a functional standpoint, when handling a convex loss functional L, the norm of the partial derivative of L with respect to fθ, represented as L(fθ)

fθ H, is positively correlated with fθ f H. As fθ progressively converges to f , L(fθ)

fθ H diminishes (Boyd et al., 2004; Coleman, 2012). This relationship becomes especially noteworthy when L is strongly convex with a larger convexity constant (Kakade & Tewari, 2008; Arjevani et al., 2016). Leveraging these insights, the Gra NT algorithm selects graphs by

{Gi}m = arg max {Gi}m {Gi}N [fθ(Gi) f (Gi)]m 2 . (23)

The pseudo code, including the node-level version, is provided in Algorithm 1.

5. Experiments and Results

We start by evaluating Gra NT on graph-level regression and classification tasks, then proceed to validate it on nodelevel tasks. The overall results on the test set are shown in Table 1, which clearly highlights the effectiveness of Gra NT in graph property learning: it reduces training time by 36.62% for graph-level regression, 38.19% for graphlevel classification, 30.97% for node-level regression, and 47.30% for node-level classification, all while maintaining

0 5000 10000 15000 20000 25000 30000 Wallclock Time (s)

Validation Loss

without Gra NT with Gra NT (B) with Gra NT (S)

(a) ZINC Loss.

0 5000 10000 15000 20000 25000 30000 Wallclock Time (s)

Validation MAE

without Gra NT with Gra NT (B) with Gra NT (S)

(b) ZINC MAE.

0 500 1000 1500 2000 Wallclock Time (s)

Validation Loss

without Gra NT with Gra NT (B) with Gra NT (S)

(c) ogbg-molhiv Loss.

0 500 1000 1500 2000 Wallclock Time (s)

Validation ROC-AUC

without Gra NT with Gra NT (B) with Gra NT (S)

(d) ogbg-molhiv ROC-AUC.

Figure 3: Validation set performance for graph-level tasks: ZINC (regression) and ogbg-molhiv (classification).

comparable testing performance. Detailed settings are given in Appendix C.

Given the common practice of training GCN in batches, i.e., graphs are fed in batches, it is both natural and intuitive to implement Gra NT at the batch level. This involves selecting batches that exhibit the largest average discrepancy between the actual properties and the corresponding GCN outputs, referred to as Gra NT (B). Meanwhile, another variant, called Gra NT (S), selects single graph with the largest discrepancies within each batch in proportion, then reorganizes the selected graphs into new batches.

Graph-level tasks. We evaluate Gra NT using several widely recognized benchmark datasets as follows:

QM9 (Wu et al., 2018): 130k organic molecules graphs with quantum chemical properties (regression task);

ZINC (Gómez-Bombarelli et al., 2018): 250k molecular graphs with bioactivity and solubility chemical properties (regression task);

ogbg-molhiv (Hu et al., 2020): 41k molecular graphs with HIV inhibitory activity properties (binary classification task);

ogbg-molpcba (Hu et al., 2020): 438k molecular graphs with bioactivity properties (multi-task binary classification task).

To clearly illustrate the practical efficiency of Gra NT, we plot the wallclock time versus loss/metric curves. This is done by conducting a validation after each training epoch, i.e., performing an evaluation on the validation dataset after each training process. Specifically, we display the validation set loss and the typical Mean Absolute Error (MAE) for ZINC in Figure 3 (a) and (b), respectively. In both plots,

Nonparametric Teaching for Graph Property Learners

Gra NT Dataset Time (s) Loss MAE ROC-AUC AP

QM9 9654.81 2.0444 0.0051 0.0009 - - ZINC 33033.82 3.1160 0.0048 0.0004 - - ogbg-molhiv 2163.50 0.1266 - 0.7572 0.0005 - ogbg-molpcba 130191.26 0.0577 - - 0.3270 0.0000 gen-reg 3344.78 0.0086 0.0007 0.0001 - - gen-cls 11662.25 0.1314 - 0.9150 0.0024 -

QM9 6392.26 (-33.79%) 2.0436 0.0051 0.0009 - - ZINC 20935.24 (-36.62%) 3.1165 0.0048 0.0004 - - ogbg-molhiv 1457.39 (-32.64%) 0.1238 - 0.7676 0.0036 - ogbg-molpcba 80465.06 (-38.19%) 0.0577 - - 0.3358 0.0001 gen-reg 2308.97 (-30.97%) 0.0086 0.0007 0.0001 - - gen-cls 6145.72 (-47.30%) 0.1314 - 0.9157 0.0013 -

QM9 7076.37 (-26.71%) 2.0443 0.0051 0.0009 - - ZINC 22265.83 (-32.60%) 3.1170 0.0048 0.0004 - - ogbg-molhiv 1597.69 (-26.15%) 0.1421 - 0.7705 0.0027 - ogbg-molpcba 89858.65 (-30.98%) 0.0575 - - 0.3351 0.0025 gen-reg 2337.46 (-30.12%) 0.0086 0.0007 0.0001 - - gen-cls 8171.21 (-29.93%) 0.1313 - 0.9157 0.0014 -

Table 1: Training time and testing results across different benchmarks. Gra NT (B) and Gra NT (S) demonstrate similar testing performance while significantly reducing training time compared to the "without Gra NT", across graph-level (QM9, ZINC, ogbg-molhiv, ogbg-molpcba) and node-level (gen-reg, gen-cls) datasets, for both regression and classification tasks.

Time (s) MAE

AL-3DGraph (Subedi et al., 2024) 9200.27 0.7991 AL-3DGraph (Subedi et al., 2024) 9364.74 0.4719 AL-3DGraph (Subedi et al., 2024) 12601.77 0.1682 Gra NT (B) 6392.26 0.0051 Gra NT (S) 7076.37 0.0051

: lr=5e-5, batch_size=256, which matches Gra NT settings. : lr=5e-4, batch_size=256. : lr=5e-4, batch_size=32, which corresponds to the default settings used in the provided code for that paper.

Table 2: Comparison of Gra NT with active learning-based methods on the QM9 dataset.

one can see that the curves for Gra NT (B) and Gra NT (S) span about two-thirds of the width of the "without Gra NT" curve, with both the loss and MAE for Gra NT decreasing at a faster rate than those for the "without Gra NT" case. Moreover, Gra NT (B) takes slightly less time to terminate than Gra NT (S). This is because Gra NT (S) selects teaching graphs from each batch, which can add extra operational time compared to Gra NT (B) that uses direct batch selection.

Figures 3 (c) and (d) show the loss and the commonly used ROC-AUC curves on the validation set, respectively, for ogbg-molhiv. Both plots clearly highlight the superiority of Gra NT over the "without Gra NT". In addition, Figure 3 (d) shows that the ROC-AUC values of Gra NT (B) and Gra NT (S) consistently exceed that of "without Gra NT"

Time (s) ROC-AUC

GCN (Kipf & Welling, 2017) 2888.80 0.7385 GCN+Virtual Node (Kipf & Welling, 2017) 3083.16 0.7608 GMo E-GCN (Wang et al., 2023) 3970.16 0.7536 GMo E-GIN (Wang et al., 2023) 3932.06 0.7468 GDe R-GCN (Zhang et al., 2024b) 1772.23 0.7261 GDe R-PNA (Zhang et al., 2024b) 5088.88 0.7616 Gra NT (B) 1457.39 0.7676 Gra NT (S) 1597.69 0.7705

: batch_size=500, retain_ratio=0.7.

Table 3: Comparison of Gra NT with recent efficient methods on the ogbg-molhiv dataset.

once the wallclock time reaches approximately 500s. However, the curves appear relatively jagged, which can be attributed to the label imbalance in this benchmark dataset. This imbalance also explains why, even when the validation loss decreases significantly, the ROC-AUC curve does not rise to a higher range. The detailed numerical results for training time and testing performance are provided in Table 1. The comparisons between Gra NT and recent SOTA methods are shown in Table 2 for QM9 and Table 3 for ogbg-molhiv.

Node-level tasks. We also assess Gra NT for node-level property learning using synthetic data. Specifically, we utilize the graphon, a typical limit object of a convergent sequence of graphs (Xu et al., 2021; Xia et al., 2023), to generate two synthetic datasets: gen-reg (containing 50k

Nonparametric Teaching for Graph Property Learners

0 500 1000 1500 2000 2500 3000 3500 Wallclock Time (s)

Validation Loss

without Gra NT with Gra NT (B) with Gra NT (S)

(a) gen-reg Loss.

0 500 1000 1500 2000 2500 3000 3500 Wallclock Time (s)

Validation MAE

without Gra NT with Gra NT (B) with Gra NT (S)

(b) gen-reg MAE.

0 2000 4000 6000 8000 10000 12000 Wallclock Time (s)

Validation Loss

without Gra NT with Gra NT (B) with Gra NT (S)

(c) gen-cls Loss.

0 2000 4000 6000 8000 10000 12000 Wallclock Time (s)

Validation ROC-AUC

without Gra NT with Gra NT (B) with Gra NT (S)

(d) gen-cls ROC-AUC.

Figure 4: Validation set performance for node-level tasks: gen-reg (regression) and gen-cls (classification).

graphs) for regression and gen-cls (containing 50k graphs) for classification.

Figures 4 (a) and (b) illustrate the validation loss and MAE curves for gen-reg, respectively. From both plots, it is clear that Gra NT reached convergence more quickly in terms of wallclock time compared to the "without Gra NT", highlighting its efficiency.

Figure 4 (c) and (d) show the validation loss and ROC-AUC for gen-cls, respectively. Both plots demonstrate that Gra NT requires less wallclock time to converge compared to the "without Gra NT". Furthermore, although this dataset is generated with imbalanced labels, similar to ogbg-molhiv, there is a notable difference: when the validation loss is low, the corresponding ROC-AUC exceeds 0.9, which is a relatively high value. This underscores the effectiveness of Gra NT. The detailed numerical results for training time and testing performance are shown in Table 1.

All experimental results demonstrate that Gra NT (B) and Gra NT (S) offer significant time-saving benefits while maintaining comparable generalization performance, and in some cases, even outperforming the "without Gra NT". Further experimental results, including result plots for QM9 and ogbgmolpcba, training curves for the aforementioned datasets, and additional validations on the AMD device, can be found in Appendix C.

6. Concluding Remarks and Future Work

This paper proposes Gra NT, a novel paradigm that enhances the learning efficiency of graph property learner (GCN) through nonparametric teaching theory. Specifically, Gra NT reduces the wallclock time needed to learn the implicit mapping from graphs to properties of interest by over 30%, while maintaining comparable test performance, as shown through extensive experiments. Furthermore, Gra NT establishes a theoretical connection between the evolution of a GCN using parameter-based gradient descent and that of a function using functional gradient descent in nonparametric teaching. This connection between nonparametric teaching theory and GCN training broadens the potential applications of nonparametric teaching in graph property learning.

In future work, it would be interesting to investigate other

variations of Gra NT for different graph property learners, such as graph attention networks (Veliˇckovi c et al., 2018). Moreover, exploring the practical applications of Gra NT to improve the efficiency of data-driven methods (Henaff, 2020; Touvron et al., 2021; Müller et al., 2022) within the field of graph property learning offers promising opportunities for future progress, particularly in fields like molecular biology and protein research.

Impact Statement

Recent interest in learning implicit mappings from graph data to specific properties has grown significantly, especially in science-related fields, driven by the ability of graphs to model diverse types of data. This work focuses on improving the learning efficiency of implicit graph property mappings through a novel nonparametric teaching perspective, which has the potential to positively impact graph-related fields and society. Furthermore, it connects nonparametric teaching theory with GCN training, expanding its application in graph property learning. As a result, it also holds promise for valuable contributions to the nonparametric teaching community.

Acknowledgements

We thank all anonymous reviewers for their constructive feedback, which helped improve our paper. We also thank Yikun Wang for his helpful discussions. This work is supported in part by the Theme-based Research Scheme (TRS) project T45-701/22-R of the Research Grants Council (RGC), Hong Kong SAR, and in part by the AVNET-HKU Emerging Microelectronics & Ubiquitous Systems (EMUS) Lab.

Alfeld, S., Zhu, X., and Barford, P. Explicit defense actions against test-set attacks. In AAAI, 2017. 3

Arjevani, Y., Shalev-Shwartz, S., and Shamir, O. On lower and upper bounds in smooth and strongly convex optimization. The Journal of Machine Learning Research, 17 (1):4303 4353, 2016. 7

Nonparametric Teaching for Graph Property Learners

Bengio, Y., Louradour, J., Collobert, R., and Weston, J. Curriculum learning. In ICML, 2009. 20

Boyd, S., Boyd, S. P., and Vandenberghe, L. Convex optimization. Cambridge university press, 2004. 7

Burkhart, J. G., Wu, G., Song, X., Raimondi, F., Mc Weeney, S., Wong, M. H., and Deng, Y. Biology-inspired graph neural network encodes reactome and reveals biochemical reactions of disease. Patterns, 4(7), 2023. 1

Cai, T., Luo, S., Xu, K., He, D., Liu, T.-y., and Wang, L. Graphnorm: A principled approach to accelerating graph neural network training. In ICML, 2021. 2

Cao, K., Phothilimthana, M., Abu-El-Haija, S., Zelle, D., Zhou, Y., Mendis, C., Leskovec, J., and Perozzi, B. Learning large graph property prediction via graph segment training. In Neur IPS, 2023. 2

Chami, I., Abu-El-Haija, S., Perozzi, B., Ré, C., and Murphy, K. Machine learning on graphs: A model and comprehensive taxonomy. Journal of Machine Learning Research, 23(89):1 64, 2022. 1, 2, 3

Chen, J., Ma, T., and Xiao, C. Fastgcn: Fast learning with graph convolutional networks via importance sampling. In ICLR, 2018. 2

Coleman, R. Calculus on normed vector spaces. Springer Science & Business Media, 2012. 3, 4, 7

Defferrard, M., Bresson, X., and Vandergheynst, P. Convolutional neural networks on graphs with fast localized spectral filtering. In Neur IPS, 2016. 1, 2

Dou, X. and Liang, T. Training neural networks as learning data-adaptive kernels: Provable representation and approximation benefits. Journal of the American Statistical Association, 116(535):1507 1520, 2021. 6

Du, S. S., Hou, K., Salakhutdinov, R. R., Poczos, B., Wang, R., and Xu, K. Graph neural tangent kernel: Fusing graph neural networks with graph kernels. In Neur IPS, 2019. 2, 5, 6, 15

Duvenaud, D. K., Maclaurin, D., Iparraguirre, J., Bombarell, R., Hirzel, T., Aspuru-Guzik, A., and Adams, R. P. Convolutional networks on graphs for learning molecular fingerprints. In Neur IPS, 2015. 2

Fan, W., Ma, Y., Li, Q., He, Y., Zhao, E., Tang, J., and Yin, D. Graph neural networks for social recommendation. In WWW, 2019. 1

Fout, A., Byrd, J., Shariat, B., and Ben-Hur, A. Protein interface prediction using graph convolutional networks. In Neur IPS, 2017. 2

Geifman, A., Yadav, A., Kasten, Y., Galun, M., Jacobs, D., and Ronen, B. On the similarity between the laplace and neural tangent kernels. In Neur IPS, 2020. 6

Gelfand, I. M., Silverman, R. A., et al. Calculus of variations. Courier Corporation, 2000. 4

Gilmer, J., Schoenholz, S. S., Riley, P. F., Vinyals, O., and Dahl, G. E. Neural message passing for quantum chemistry. In ICML, 2017. 1

Gligorijevi c, V., Renfrew, P. D., Kosciolek, T., Leman, J. K., Berenberg, D., Vatanen, T., Chandler, C., Taylor, B. C., Fisk, I. M., Vlamakis, H., et al. Structure-based protein function prediction using graph convolutional networks. Nature communications, 12(1):3168, 2021. 2

Gómez-Bombarelli, R., Wei, J. N., Duvenaud, D., Hernández-Lobato, J. M., Sánchez-Lengeling, B., Sheberla, D., Aguilera-Iparraguirre, J., Hirzel, T. D., Adams, R. P., and Aspuru-Guzik, A. Automatic chemical design using a data-driven continuous representation of molecules. ACS central science, 4(2):268 276, 2018. URL https://zinc.docking.org. 7

Guo, Z., Zhang, C., Yu, W., Herr, J., Wiest, O., Jiang, M., and Chawla, N. V. Few-shot graph learning for molecular property prediction. In WWW, pp. 2559 2567, 2021. 2

Hamilton, W. L., Ying, R., and Leskovec, J. Representation learning on graphs: Methods and applications. IEEE Data Engineering Bulletin, 40:52 74, 2017. URL https://api.semanticscholar. org/Corpus ID:3215337. 1, 3

Henaff, O. Data-efficient image recognition with contrastive predictive coding. In ICML, 2020. 9

Hinton, G. E., Srivastava, N., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. R. Improving neural networks by preventing co-adaptation of feature detectors. arxiv 2012. ar Xiv preprint ar Xiv:1207.0580, 2012. 20

Hu, W., Fey, M., Zitnik, M., Dong, Y., Ren, H., Liu, B., Catasta, M., and Leskovec, J. Open graph benchmark: Datasets for machine learning on graphs. booktitle, 2020. URL https://ogb.stanford.edu/docs/ graphprop. 7

Hu, W., Fey, M., Ren, H., Nakata, M., Dong, Y., and Leskovec, J. Ogb-lsc: A large-scale challenge for machine learning on graphs. In Neur IPS Datasets and Benchmarks Track, 2021. 2

Jacot, A., Gabriel, F., and Hongler, C. Neural tangent kernel: Convergence and generalization in neural networks. In Neur IPS, 2018. 6

Nonparametric Teaching for Graph Property Learners

Kakade, S. M. and Tewari, A. On the generalization ability of online strongly convex programming algorithms. In Neur IPS, 2008. 7

Kipf, T. N. and Welling, M. Semi-supervised classification with graph convolutional networks. In ICLR, 2017. 1, 2, 3, 8

Krishnagopal, S. and Ruiz, L. Graph neural tangent kernel: Convergence on large graphs. In ICML, 2023. 2, 4, 6, 15

Kuk, A. Y. Asymptotically unbiased estimation in generalized linear models with random effects. Journal of the Royal Statistical Society Series B: Statistical Methodology, 57(2):395 407, 1995. 6

Lax, P. D. Functional analysis, volume 55. John Wiley & Sons, 2002. 3

Le Cun, Y., Bengio, Y., and Hinton, G. Deep learning. nature, 521(7553):436 444, 2015. 2

Li, X., Sun, L., Ling, M., and Peng, Y. A survey of graph neural network based recommendation in social networks. Neurocomputing, 549:126441, 2023. 1

Liu, Q. Stein variational gradient descent as gradient flow. In Neur IPS, 2017. 3

Liu, Q. and Wang, D. Stein variational gradient descent: A general purpose bayesian inference algorithm. In Neur IPS, 2016. 3

Liu, W., Dai, B., Humayun, A., Tay, C., Yu, C., Smith, L. B., Rehg, J. M., and Song, L. Iterative machine teaching. In ICML, 2017. 2, 3

Liu, W., Dai, B., Li, X., Liu, Z., Rehg, J., and Song, L. Towards black-box iterative machine teaching. In ICML, 2018. 2, 3

Liu, X., Yan, M., Deng, L., Li, G., Ye, X., Fan, D., Pan, S., and Xie, Y. Survey on graph neural network acceleration: An algorithmic perspective. In Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, IJCAI-22. IJCAI, 2022a. Survey Track. 1

Liu, X., Yan, M., Deng, L., Li, G., Ye, X., Fan, D., Pan, S., and Xie, Y. Survey on graph neural network acceleration: An algorithmic perspective. In IJCAI, 2022b. Survey Track. 2

Luo, Z., Guo, Q., Cheung, K. C., See, S., and Wan, R. Copyrnerf: Protecting the copyright of neural radiance fields. In ICCV, 2023. 3

Luo, Z., Shi, B., Li, H., and Wan, R. Imaging interiors: An implicit solution to electromagnetic inverse scattering problems. In ECCV, 2024. 3

Luo, Z., Rocha, A., Shi, B., Guo, Q., Li, H., and Wan, R. The nerf signature: Codebook-aided watermarking for neural radiance fields. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025. 3

Ma, Y., Zhang, X., Sun, W., and Zhu, J. Policy poisoning in batch reinforcement learning and control. In Neur IPS, 2019. 3

Mansimov, E., Mahmood, O., Kang, S., and Cho, K. Molecular geometry prediction using a deep generative graph neural network. Scientific reports, 9(1):20381, 2019. 1

Min, S., Gao, Z., Peng, J., Wang, L., Qin, K., and Fang, B. Stgsn a spatial temporal graph neural network framework for time-evolving social networks. Knowledge Based Systems, 214:106746, 2021. 1

Müller, T., Evans, A., Schied, C., and Keller, A. Instant neural graphics primitives with a multiresolution hash encoding. ACM transactions on graphics (TOG), 41(4): 1 15, 2022. 9

Narayanan, S. D., Sinha, A., Jain, P., Kar, P., and SELLAMANICKAM, S. Iglu: Efficient gcn training via lazy updates. In ICLR, 2022. 2

Rakhsha, A., Radanovic, G., Devidze, R., Zhu, X., and Singla, A. Policy teaching via environment poisoning: Training-time adversarial attacks against reinforcement learning. In ICML, 2020. 3

Ramakrishnan, R., Dral, P. O., Rupp, M., and Von Lilienfeld, O. A. Quantum chemistry structures and properties of 134 kilo molecules. Scientific data, 1(1):1 7, 2014. 1

Robinson, J., Ranjan, R., Hu, W., Huang, K., Han, J., Dobles, A., Fey, M., Lenssen, J. E., Yuan, Y., Zhang, Z., et al. Relbench: A benchmark for deep learning on relational databases. In Neur IPS Datasets and Benchmarks Track, 2024. 2

Ruder, S. An overview of gradient descent optimization algorithms. ar Xiv preprint ar Xiv:1609.04747, 2016. 5

Schölkopf, B., Smola, A. J., Bach, F., et al. Learning with kernels: support vector machines, regularization, optimization, and beyond. MIT press, 2002. 3

Singla, A., Bogunovic, I., Bartók, G., Karbasi, A., and Krause, A. Near-optimally teaching the crowd to classify. In ICML, 2014. 3

Sitzmann, V., Martel, J., Bergman, A., Lindell, D., and Wetzstein, G. Implicit neural representations with periodic activation functions. In Neur IPS, 2020. 3

Nonparametric Teaching for Graph Property Learners

Stark, C., Breitkreutz, B.-J., Reguly, T., Boucher, L., Breitkreutz, A., and Tyers, M. Biogrid: a general repository for interaction datasets. Nucleic acids research, 34 (suppl_1):D535 D539, 2006. 1

Subedi, R., Wei, L., Gao, W., Chakraborty, S., and Liu, Y. Empowering active learning for 3d molecular graphs with geometric graph isomorphism. In Neur IPS, 2024. 8

Tancik, M., Srinivasan, P., Mildenhall, B., Fridovich-Keil, S., Raghavan, N., Singhal, U., Ramamoorthi, R., Barron, J., and Ng, R. Fourier features let networks learn high frequency functions in low dimensional domains. In Neur IPS, 2020. 3

Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., and Jégou, H. Training data-efficient image transformers & distillation through attention. In ICML, 2021. 9

Veliˇckovi c, P., Cucurull, G., Casanova, A., Romero, A., Liò, P., and Bengio, Y. Graph attention networks. In ICLR, 2018. 2, 9

Vishwanathan, S. V. N., Schraudolph, N. N., Kondor, R., and Borgwardt, K. M. Graph kernels. The Journal of Machine Learning Research, 11:1201 1242, 2010. 3, 6

Wang, H., Jiang, Z., You, Y., Han, Y., Liu, G., Srinivasa, J., Kompella, R., Wang, Z., et al. Graph mixture of experts: Learning on large-scale graphs with explicit diversity modeling. In Neur IPS, 2023. 8

Wang, P. and Vasconcelos, N. A machine teaching framework for scalable recognition. In ICCV, 2021. 3

Wang, P., Nagrecha, K., and Vasconcelos, N. Gradientbased algorithms for machine teaching. In CVPR, 2021. 3

Wright, S. J. Coordinate descent algorithms. Mathematical programming, 151(1):3 34, 2015. 6

Wu, F., Souza, A., Zhang, T., Fifty, C., Yu, T., and Weinberger, K. Simplifying graph convolutional networks. In ICML, 2019. 4

Wu, Z., Ramsundar, B., Feinberg, E. N., Gomes, J., Geniesse, C., Pappu, A. S., Leswing, K., and Pande, V. Moleculenet: a benchmark for molecular machine learning. Chemical science, 9(2):513 530, 2018. URL http://quantum-machine.org/datasets. 7

Xia, F., Sun, K., Yu, S., Aziz, A., Wan, L., Pan, S., and Liu, H. Graph learning: A survey. IEEE Transactions on Artificial Intelligence, 2(2):109 127, 2021. 2

Xia, X., Mishne, G., and Wang, Y. Implicit graphon neural representation. In AISTAT, 2023. 8, 22

Xu, H., Luo, D., Carin, L., and Zha, H. Learning graphons via structured gromov-wasserstein barycenters. In AAAI, 2021. 8, 22

Xu, K., Hu, W., Leskovec, J., and Jegelka, S. How powerful are graph neural networks? In ICLR, 2018. 3

Xue, Z., Yang, Y., and Marculescu, R. Sugar: Efficient subgraph-level training via resource-aware graph partitioning. IEEE Transactions on Computers, 2023. 2

Zhang, C., Cao, X., Liu, W., Tsang, I., and Kwok, J. Nonparametric teaching for multiple learners. In Neur IPS, 2023a. 2, 3, 4

Zhang, C., Cao, X., Liu, W., Tsang, I., and Kwok, J. Nonparametric iterative machine teaching. In ICML, 2023b. 2, 3, 4

Zhang, C., Luo, S. T. S., Li, J. C. L., Wu, Y.-C., and Wong, N. Nonparametric teaching of implicit neural representations. In ICML, 2024a. 2, 3, 5, 6, 20

Zhang, G., Dong, H., Li, Z., Chen, D., Wang, K., Chen, T., Liang, Y., Cheng, D., Wang, K., et al. Gder: Safeguarding efficiency, balancing, and robustness via prototypical graph pruning. In Neur IPS, 2024b. 8

Zhang, S., Sohrabizadeh, A., Wan, C., Huang, Z., Hu, Z., Wang, Y., Cong, J., Sun, Y., et al. A survey on graph neural network acceleration: Algorithms, systems, and customized hardware. ar Xiv preprint ar Xiv:2306.14052, 2023c. 2

Zhou, Y., Nelakurthi, A. R., and He, J. Unlearn what you have learned: Adaptive crowd teaching with exponentially decayed memory learners. In SIGKDD, 2018. 3

Zhu, X. Machine teaching: An inverse problem to machine learning and an approach toward optimal education. In AAAI, 2015. 2, 3

Zhu, X., Singla, A., Zilles, S., and Rafferty, A. N. An overview of machine teaching. ar Xiv preprint ar Xiv:1801.05927, 2018. 2, 3

Zhuang, X., Zhang, Q., Wu, B., Ding, K., Fang, Y., and Chen, H. Graph sampling-based meta-learning for molecular property prediction. In IJCAI, 2023. 2

A. Additional Discussions

A.1. Notation Overview

Notation Description

G(n) = (V, E) Graph with n vertices and edge set E V Set of n vertices (nodes) in the graph E Set of edges in the graph [xi]d d-dimensional feature vector with entries xi x Simplified notation for [xi]d Xn d Feature matrix of all nodes (n d) X(i,:) i-th row of X (feature vector of node i) X(:,j) j-th column of X (feature j across nodes) ei i-th basis vector (1 at i-th position, 0 elsewhere) A Adjacency matrix of graph G(n) G = (X, A) Representation of graph with feature matrix and adjacency matrix G Collection of all graphs y Property of the graph (scalar or vector) Y Space of graph properties (R or Rn) {ai}m Set of m elements diag(a1, . . . , am) Diagonal matrix with elements a1, . . . , am diag(a; m) Diagonal matrix with m repeated entries a Nd := {1, , d} Set of natural numbers from 1 to d K(G, G ) Positive definite graph kernel function H Reproducing kernel Hilbert space (RKHS) defined by K f Target mapping from G to Y y Property f (G ) of graph G

Table 4: Summary of Key Notations.

A.2. The derivation of structure-aware updates in the parameter space.

Let s focus on the derivative of a two-layer GCN with summation pooling:

W (2) | {z } the second layer

W (1) (:,1) , , X(2)

W (1) (:,h1) | {z } the first layer

Using the chain rule, we can calculate the derivative of fθ(G) w.r.t. the second-layer weights W (2), which is a vector of size κ2h1, as follows:

W (2) = 1 n A[κ2]diag(X(1); κ2) W (2)

= 1 n A[κ2]diag(X(1); κ2)

= 1 n A[κ2] | {z } size: n κ2n

size: κ2n κ2h1 z }| { diag(σ(A[κ1]diag(X(0); κ1)W (1)); κ2))

| {z } size: 1 κ2h1

Nonparametric Teaching for Graph Property Learners

The derivative of fθ(G) w.r.t. the first-layer weights is more complex. For i Nh1

W (1) (:,i) = 1 A[κ2]diag(X(1)eie i ; κ2) W (2)

W (1) (:,i)

size: κ2n κ2 z }| { diag(X(1)ei; κ2)

size: κ2 κ2h1 z }| { diag(e i ; κ2) W (2)

W (1) (:,i)

size: κ2n κ2 z }| { diag(X(1)ei; κ2)

size: κ2 1 z }| {

W (2) (i h1+h1)

W (2) (i h1+κ2h1)

W (1) (:,i)

size: κ2n 1 z }| {

X(1)ei W (2) (i h1+h1) X(1)ei W (2) (i h1+κ2h1)

W (1) (:,i)

W (1) (:,i)

W (2) (i h1+h1) X(1)ei

W (1) (:,i)

W (2) (i h1+κ2h1)

= 1 A[κ2] | {z } size: n κ2n

size: κ2n κ1h0 z }| {

σ A[κ1]diag(X(0); κ1) W (2) (i h1+h1) σ A[κ1]diag(X(0); κ1) W (2) (i h1+κ2h1)

| {z } size: 1 κ1h0

with σ = σ(A[κ1]diag(X(0);κ1)W (1) (:,i))

A[κ1]diag(X(0);κ1)W (1) (:,i) . Orange marks the first-layer elements, green colors the second-layer elements, and

1 n in purple refers to the summation pooling. If we use Re LU as the activation function, σ A[κ1]diag(X(0); κ1) = σ A[κ1]diag(X(0); κ1) .

For the GCN shown in Figure 2, the derivative, i.e., Equation 25 is specified with κ1, κ2 = 3, 2 as

W (2) = 1 A[2] |{z} size: n 2n

size: 2n 2h1 z }| { diag(σ(A[3]diag(X(0); 3)W (1)); 2)

| {z } size: 1 2h1

W (1) (:,i) = 1 A[2] |{z} size: n 2n

size: 2n 3h0 z }| {

σ A[3]diag(X(0); 3) W (2) (i) σ A[3]diag(X(0); 3) W (2) (i+h1)

| {z } size: 1 3h0

When a graph is input into a GCN, the adjacency matrix A governs the operations between nodes, ensuring the update is structure-aware by performing row-wise operations on the feature matrix. Meanwhile, the weight matrix W controls how the features are processed, by performing column-wise operations on the feature matrix.

Nonparametric Teaching for Graph Property Learners

Adjacency Matrix 𝑨

Graph 𝑮 𝟒= (𝑿 , 𝑨 )

: Concatenation

: Feature Aggregation

Graph 𝑮𝟑= (𝑿, 𝑨)

Adjacency Matrix 𝑨

Figure 5: Graphical illustration of GNTK computation: Kθ(G(3), G (4)) = D fθ(G)

θ E = fθ(G)

W (1) (1,1)

fθ(G ) W (1) (1,1) + +

fθ(G) W (1) (κ1d,h1)

fθ(G ) W (1) (κ1d,h1) + fθ(G)

W (2) (1) + + fθ(G) W (2) (κ2h1)

fθ(G ) W (2) (κ2h1) .

A.3. Graph Neural Tangent Kernel (GNTK)

By substituting the parameter evolution (Equation 10)

L(fθt(Gi), yi)

into the first-order approximation term ( ) of Equation 16, it obtains

L(fθt(Gi), yi)

L(fθt(Gi), yi)

θt , fθt(Gi)

L(fθt(Gi), yi)

θt , fθt(Gi)

L(fθt(Gi), yi)

N [Kθt(Gi, )]N , (30)

which derives Equation 17 as

L(fθt(Gi), yi)

N [Kθt(Gi, )]N + o θt

where the symmetric and positive definite Kθt(Gi, ) := D fθt(Gi)

θt , fθt( )

θt E is referred to as graph neural tangent kernel (GNTK) (Du et al., 2019; Krishnagopal & Ruiz, 2023). Figure 5 illustrates the GNTK calculation process. In simple terms,

Nonparametric Teaching for Graph Property Learners

examining a model s behavior by focusing on the model itself, rather than its parameters, often involves the use of kernel functions.

It can be observed that the quantity fθt( )

θt representing the partial derivative of the GCN with respect to its parameters,

present in Kθt(Gi, ) = D fθt(Gi)

θt , fθt( )

θt E , is determined by both the structure and the specific parameters θt, but does

not depend on the input graphs. The other term fθt(Gi)

θt relies not only on the GCN structure and specific θt, but also on the input graph. If the input for fθt(Gi)

θt is not specified, the GNTK simplifies to a general form Kθt( , ). When a specific graph Gj is defined as the input for fθt( )

θt , GNTK becomes a scalar as Kθt(Gi, Gj) = fθt(Gi)

θt , fθt(Gj)

θt . These are in line with the kernel used in functional gradient descent. By providing the input graph Gi, one coordinate of Kθt is fixed, causing the GCN to update along Kθt(Gi, ), based on the magnitude of fθt(Gi)

θt . This process aligns with the core principle of functional gradient descent. In summary, both the GNTK and the canonical kernel are consistent in their mathematical formulation and show alignment in how they affect the evolution of the corresponding GCN. Furthermore, Theorem 5 highlights the asymptotic connection between the GNTK and the canonical kernel used in functional gradient descent.

Nonparametric Teaching for Graph Property Learners

B. Detailed Proofs

Before providing the detailed proofs, we first introduce the gradient of an evaluation functional EG(f).

Lemma 7. The gradient of an evaluation functional EG(f) = f(G) : H 7 R is f EG(f) = KG.

Proof of Lemma 7 Let us define a function ϕ by adding a small perturbation ϵg (ϵ R, g H) to f H, so that ϕ = f + ϵg. ϕ H since RKHS is closed under addition and scalar multiplication. Therefore, for an evaluation functional EG[f] = f(G) : H 7 R, we can evaluate ϕ at G as

EG[ϕ] = EG[f + ϵg]

= EG[f] + ϵEG[g] + 0

= EG[f] + ϵ K(G, ), g H + 0 (32)

Recall implicit definition of Fréchet derivative in RKHS (see Definition 2) EG[f + ϵg] = EG[f] + ϵ f EG[f], g H + o(ϵ), it follows from Equation 32 that we have the gradient of a evaluation functional f EG[f] = KG.

Proof of Theorem 5 By describing the evolution of a GCN in terms of parameter variations and from a high-level perspective within the function space, we obtain

L(fθt(Gi), yi)

N [K(Gi, )]N = η

L(fθt(Gi), yi)

θt , fθt( )

After the reorganization, we get

L(fθt(Gi), yi)

N [K(Gi, ) Kθt(Gi, )]N = o θt

By inserting the evolution of the parameters

L(fθt(Gi), yi)

into the remainder, we have

L(fθt(Gi), yi)

N [K(Gi, ) Kθt(Gi, )]N = o

L(fθt(Gi), yi)

When training a GCN with a convex loss L, which is convex concerning fθ but often not with regard to θ, we have the limit of the vector limt h L(fθt(Gi),yi)

N = 0. Since the right side of the equation is a higher order infinitesimal than the left, to preserve this equality, it leads us to the conclusion that

lim t [K(Gi, ) Kθt(Gi, )]N = 0. (37)

This means that for each G {Gi}N, GNTK converges pointwise to the canonical kernel.

Proof of Proposition 6 By recalling the definition of the Fréchet derivative in Definition 2, the convexity of L implies that

t L fθt+1 , fθt

Nonparametric Teaching for Graph Property Learners

By identifying the Fréchet derivative of L fθt+1 and the evolution of fθt, the right-hand side term Ξ can be represented as

Ξ = Gt+1, ηGt

* L(fθt+1(Gi), yi)

N [KGi]N , [KGi] N L(fθt(Gi), yi)

L(fθt+1(Gi), yi)

N [KGi]N , [KGi] N

H L(fθt(Gi), yi)

L(fθt(Gi), yi)

N K L(fθt+1(Gi), yi)

where K = K/N, and K is an N N symmetric, positive definite matrix with elements K(Gi, Gj) positioned at the i-th

row and j-th column. For convenience, we adopt the simplified notation that L(fθ (Gi),yi)

fθ (Gi) := fθ L(fθ ; Gi). The final term in Equation 39 can be rewritten as

N fθt L(fθt; Gi) N K fθt+1L(fθt+1; Gi)

N fθt L(fθt; Gi) N K fθt+1L(fθt+1; Gi)

N + fθt L(fθt; Gi)

N fθt L(fθt; Gi)

N fθt L(fθt; Gi) N K fθt L(fθt; Gi)

N fθt L(fθt; Gi) N K fθt+1L(fθt+1; Gi)

N fθt L(fθt; Gi)

N fθt L(fθt; Gi) N K fθt L(fθt; Gi)

fθt+1L(fθt+1; Gi) N fθt L(fθt; Gi) N fθt+1L(fθt+1; Gi) N

K fθt+1L(fθt+1; Gi)

N fθt L(fθt; Gi)

The last term in Equation 40 above can be further detailed as η N

fθt+1L(fθt+1; Gi) N fθt L(fθt; Gi) N fθt+1L(fθt+1; Gi) N

K fθt+1L(fθt+1; Gi)

N fθt L(fθt; Gi)

fθt+1L(fθt+1; Gi)

N fθt L(fθt; Gi)

K fθt+1L(fθt+1; Gi)

N fθt L(fθt; Gi)

N fθt+1L(fθt+1; Gi) N K fθt+1L(fθt+1; Gi)

N fθt L(fθt; Gi)

= η N fθt+1L(fθt+1; Gi) fθt L(fθt; Gi) N K fθt+1L(fθt+1; Gi) fθt L(fθt; Gi)

fθt+1L(fθt+1; Gi)

2 fθt L(fθt; Gi)

K fθt+1L(fθt+1; Gi)

2 fθt L(fθt; Gi)

4N fθt L(fθt; Gi) N K fθt L(fθt; Gi)

Since K is positive definite, it is apparent that

fθt+1L(fθt+1; Gi)

2 fθt L(fθt; Gi)

K fθt+1L(fθt+1; Gi)

2 fθt L(fθt; Gi)

is a non-negative term. Hence, by combining Equations 39, 40, and 41, we obtain

4N fθt L(fθt; Gi) N K fθt L(fθt; Gi)

N fθt+1L(fθt+1; Gi) fθt L(fθt; Gi) N K fθt+1L(fθt+1; Gi) fθt L(fθt; Gi)

Nonparametric Teaching for Graph Property Learners

Based on the definition of the evaluation functional and the assumption that L is Lipschitz smooth with a constant τ > 0, the term 2 in the final part of Equation 42 is bounded above as

2 = fθt+1L(fθt+1; Gi) fθt L(fθt; Gi) N K fθt+1L(fθt+1; Gi) fθt L(fθt; Gi)

fθt+1 L(fθt)

fθt+1 L(fθt)

N τ 2 [EGi (fθt+1 fθt)] N K [EGi (fθt+1 fθt)]N

= τ 2 D (fθt+1 fθt) , [KGi] N E

H K [KGi]N , (fθt+1 fθt) H

= η2τ 2 fθt L(fθt; Gi) N

[KGi]N , [KGi] N

[KGi]N , [KGi] N

H N fθt L(fθt; Gi)

Under the assumption that the canonical kernel is bounded above by a constant γ > 0, we have [KGi]N , [KGi] N

H γ [1]N, [1] N ,

N [1]N, [1] N .

As a result, 1 is bounded above by

i=1 fθt L(fθt; Gi)

i=1 fθt L(fθt; Gi)

i=1 fθt L(fθt; Gi)

Meanwhile, the final term in Equation 43 is also bounded above:

η2τ 2 fθt L(fθt; Gi) N

[KGi]N , [KGi] N

[KGi]N , [KGi] N

H N fθt L(fθt; Gi)

η2τ 2 " γ N

i=1 fθt L(fθt; Gi)

i=1 fθt L(fθt; Gi)

i=1 fθt L(fθt; Gi)

i=1 fθt L(fθt; Gi)

i=1 fθt L(fθt; Gi)

Hence, by combining Equations 42, 43, 44, and 45, we derive

4 η2τ 2γ2 1 N

i=1 fθt L(fθt; Gi)

which means

4 η2τ 2γ2 1 N

i=1 fθt L(fθt; Gi)

Therefore, if η 1 2τγ , it follows that

i=1 fθt L(fθt; Gi)

L(fθt(Gi), yi)

Nonparametric Teaching for Graph Property Learners

C. Experiment Details

This section outlines the experiment details, covering the experimental setup, supplementary results, and a brief analysis of graph-level and node-level tasks on benchmark datasets.

C.1. Experimental Setup

Device Setup. We mainly conduct experiments using NVIDIA Geforce RTX 3090 (24G).

Dataset Splitting. The train / val / test split configurations for the benchmark datasets are provided in Table 5.

Dataset train validation test

QM9 110000 10000 10831 ZINC 220011 24445 5000 ogbg-molhiv 32901 4113 4113 ogbg-molpcba 350343 43793 43793 gen-reg 30000 10000 10000 gen-cls 30000 10000 10000

Table 5: Dataset splitting for the benchmark datasets.

Hyperparameter Settings. The key hyperparameter settings for all benchmark datasets are listed in Table 6. The analysis of the hyperparameter, start-ratio, under Gra NT(B) is provided in Table 7.

Dataset lr κ-list batch-size start-ratio (Gra NT) epochs

QM9 0.00005 [3, 2] 256 0.05 750 ZINC 0.0004 [5, 4, 2, 2] 256 0.05 1000 ogbg-molhiv 0.01 [4, 3, 2, 2] 500 0.1 600 ogbg-molpcba 0.015 [5, 4, 3, 2, 2] 128 0.1 800 gen-reg 0.0002 [3, 2] 100 0.05 250 gen-cls 0.0002 [4, 3] 200 0.05 500

Table 6: Key hyperparameter settings for the benchmark datasets, with the start-ratio specified for Gra NT.

Dataset Metric 0.05 0.1 0.2 0.4 0.8 full

QM9 MAE 0.0051 0.0053 0.0053 0.0053 0.0053 0.0051 Training time (s) 6392.26 6974.30 7918.51 10828.18 14081.66 9654.81

ogbg-molhiv ROC-AUC 0.7546 0.7676 0.7652 0.7618 0.7592 0.7572 Training time (s) 1362.09 1457.39 1719.43 2157.05 3173.92 2163.5

gen-cls ROC-AUC 0.9157 0.9156 0.9156 0.9156 0.9156 0.9150 Training time (s) 6145.72 6237.92 6939.95 9459.22 13153.81 11662.25 Table 7: Performance comparison w.r.t. start-ratio for different datasets.

C.2. Graph-level Tasks

We train the GCN using Gra NT (B), Gra NT (S), and the "without Gra NT", all under a common experimental setup for graph-level tasks. For Gra NT (B) and Gra NT (S), we adopt a curriculum learning strategy (Bengio et al., 2009; Zhang et al., 2024a). Intuitively, at the start of training, the model is undertrained and undergoes significant changes, which calls for more frequent selection by the teacher but with small subsets to help the learner better digest the provided graphs; in contrast, by the end of training, the model stabilizes and is able to digest large subsets. Specifically, the selection interval progressively widens over 50 stages, beginning from the first epoch and gradually extending to the maximum interval. The initial selection ratios are predefined (i.e., the "start-ratio" hyperparameter) for each benchmark dataset. Additionally, for ogbg-molhiv and ogbg-molpcba, we use the Reduce LROn Plateau (Hinton et al., 2012) as the learning rate scheduler, with the lr values in Table 6 representing the initial learning rate. Besides, we activate the learning rate restarting scheme whenever a new selection action is initiated. More experimental results and a brief analysis are provided below.

Nonparametric Teaching for Graph Property Learners

0 2000 4000 6000 8000 10000 Wallclock Time (s)

Validation Loss

without Gra NT with Gra NT (B) with Gra NT (S)

(a) Validation Loss.

0 2000 4000 6000 8000 10000 Wallclock Time (s)

Validation MAE

without Gra NT with Gra NT (B) with Gra NT (S)

(b) Validation MAE.

Figure 6: Validation set performance of graph-level tasks on QM9 (regression).

0 2000 4000 6000 8000 10000 Wallclock Time (s)

Training Loss

without Gra NT with Gra NT (B) with Gra NT (S)

(a) Training Loss.

0 2000 4000 6000 8000 10000 Wallclock Time (s)

Training MAE

without Gra NT with Gra NT (B) with Gra NT (S)

(b) Training MAE.

Figure 7: Training set performance of graph-level tasks on QM9 (regression).

First, we show the validation and training performance for the regression task on QM9 in Figure 6 and Figure 7, along with the training performance for the regression task on ZINC in Figure 8. For the classification tasks, the training performance on ogbg-molhiv is displayed in Figure 9. Moreover, the validation / training performance on ogbg-molpcba are provided in Figure 10 and Figure 11, respectively.

For QM9, as shown in Figure 6, Gra NT (B) and Gra NT (S) require less time than "without Gra NT" and show a quicker decline in validation loss and MAE curves. In the early stages, the curves follow a nearly linear trend. This occurs because the learning rate is small, allowing the model to learn simple features and adjust its parameters gradually at the beginning. As the optimizer moves through the parameter space with steady steps, the model begins to capture more complex features, resulting in a shift to a nonlinear phase in the validation loss and MAE. In Figure 7, the training loss and MAE curves show significant fluctuations early on, which can be attributed to the frequent selection process. This requires the model to rapidly learn and adapt to new teaching graphs. Over time, the curves take on a step-like shape, suggesting that the selection process is stabilizing and the model has become adequately adaptive. Similarly, the training loss and MAE curves for ZINC in Figure 8 display similar trends.However, it is worth mentioning that Gra NT (B) and Gra NT (S) cut down the training time by around 3.5 hours for ZINC compared to "without Gra NT", all while maintaining similar performance.

For ogbg-molhiv, as shown in Figure 9, the training ROC-AUC curve for Gra NT (B) and Gra NT (S) fluctuates due to the underlying data imbalance. However, the overall trend and final results align with expectations, even surpassing the "without Gra NT" curve. To showcase the generalizability and scalability of Gra NT, we further evaluate it on the large-scale multi-classification benchmark dataset ogbg-molpcba, which is 10 times larger than the ogbg-molhiv dataset. Figure 10 shows that the validation AP (Average Precision) curves for Gra NT (B) and Gra NT (S) consistently outperform the "without Gra NT" curve in the mid-to-late stages. Additionally, it is notable that Gra NT (B) and Gra NT (S) reduce the training time by approximately 13.8 hours compared to the "without Gra NT" setup. These results further emphasize Gra NT s ability to

Nonparametric Teaching for Graph Property Learners

0 5000 10000 15000 20000 25000 30000 Wallclock Time (s)

Training Loss

without Gra NT with Gra NT (B) with Gra NT (S)

(a) Training Loss.

0 5000 10000 15000 20000 25000 30000 Wallclock Time (s)

Training MAE

without Gra NT with Gra NT (B) with Gra NT (S)

(b) Training MAE.

Figure 8: Training set performance of graph-level tasks on ZINC (regression).

0 500 1000 1500 2000 Wallclock Time (s)

Training Loss

without Gra NT with Gra NT (B) with Gra NT (S)

(a) Training Loss.

0 500 1000 1500 2000 Wallclock Time (s)

Training ROC-AUC

without Gra NT with Gra NT (B) with Gra NT (S)

(b) Training ROC-AUC.

Figure 9: Training set performance of graph-level tasks on ogbg-molhiv (classification).

enhance training time efficiency on large-scale and complex datasets, particularly in the fields of chemistry and biomedical sciences. Additionally, the training loss and AP curves are shown in Figure 11.

Furthermore, on the AMD Instinct MI210 (64GB) device, we also validate the effectiveness of Gra NT for graph-level tasks on the QM9 dataset, highlighting its cross-device effectiveness. As illustrated in Figure 12, Gra NT (B) and Gra NT (S) converge more quickly than "without Gra NT," showing a faster decline in validation loss and MAE. Roughly speaking, Gra NT (B) and Gra NT (S) save more than 40% of the time compared to "without Gra NT" while achieving comparable results.The training curves are displayed in Figure 13.

C.3. Node-level Tasks

We train the GCN with Gra NT (B), Gra NT (S), and "without Gra NT" using a standard experimental setup for node-level tasks, while applying the same curriculum learning strategy used in graph-level tasks. To evaluate Gra NT on synthetic data, we utilize graphon (Xu et al., 2021; Xia et al., 2023) to create the gen-reg and gen-cls datasets. Specifically, we set the resolution of the synthetic graphon to 1000 1000, which produces graphs that can be easily aligned by arranging the node degrees in strictly increasing order. Additionally, each graph contains approximately 100 nodes, with a total of 50k graphs across both datasets. Lastly, we use a 2-layer GCN with a specific initialization to assign regression properties or classification labels to each node, while the dimension of node features in all graphs is set to 40.

The training performance results for gen-reg and gen-cls are shown in Figure 14 and Figure 15. In particular, Figure 14 clearly demonstrates that Gra NT converges more quickly in terms of wallclock time compared to "without Gra NT," which initially drops faster. This faster initial drop may occur because, for gen-reg, training on all the graphs provides sufficient information about the implicit mapping from graphs to their properties, resulting in better validation performance in the early

Nonparametric Teaching for Graph Property Learners

0 20000 40000 60000 80000 100000 120000 Wallclock Time (s)

Validation Loss

without Gra NT with Gra NT (B) with Gra NT (S)

(a) Validation Loss.

0 20000 40000 60000 80000 100000 120000 Wallclock Time (s)

Validation AP

without Gra NT with Gra NT (B) with Gra NT (S)

(b) Validation AP.

Figure 10: Validation set performance of graph-level tasks on ogbg-molpcba (multi-task classification).

0 20000 40000 60000 80000 100000 120000 Wallclock Time (s)

Training Loss

without Gra NT with Gra NT (B) with Gra NT (S)

(a) Training Loss.

0 20000 40000 60000 80000 100000 120000 Wallclock Time (s)

Training AP

without Gra NT with Gra NT (B) with Gra NT (S)

(b) Training AP.

Figure 11: Training set performance of graph-level tasks on ogbg-molpcba (multi-task classification).

epochs, even though it requires more computational time. Over time, Gra NT selects more representative teaching graphs (those with larger gradients), gradually accumulating enough information for the implicit mapping. In Figure 15, Gra NT (B) takes the least time and shows a faster decline in the training loss curve compared to Gra NT (S), while also reaching the final ROC-AUC results more quickly. Additionally, the training ROC-AUC curve exhibits significant oscillations early on, which can be attributed to the frequent selection of a diverse set of teaching graphs and the label imbalance.

In addition, we conduct experiments on node-level tasks with the gen-cls dataset using the AMD Instinct MI210 (64GB) device, further showcasing cross-device generalizability of Gra NT. Specifically, Gra NT (B) and Gra NT (S) save about one-third of the time compared to without Gra NT, while achieving even better performance. The validation and training performance results are shown in Figure 16 and Figure 17, respectively. As observed, when the loss curves flatten at their lowest points, the final validation and training ROC-AUC curves for Gra NT (B) and Gra NT (S) surpass those of the "without Gra NT" baseline, highlighting Gra NT s effectiveness in improving performance while reducing time consumption.

Nonparametric Teaching for Graph Property Learners

0 5000 10000 15000 20000 25000 30000 Wallclock Time (s)

Validation Loss

without Gra NT with Gra NT (B) with Gra NT (S)

(a) Validation Loss.

0 5000 10000 15000 20000 25000 30000 Wallclock Time (s)

Validation MAE

without Gra NT with Gra NT (B) with Gra NT (S)

(b) Validation MAE.

Figure 12: Validation set performance of graph-level tasks for QM9 (regression) on AMD device.

0 5000 10000 15000 20000 25000 30000 Wallclock Time (s)

Training Loss

without Gra NT with Gra NT (B) with Gra NT (S)

(a) Training Loss.

0 5000 10000 15000 20000 25000 30000 Wallclock Time (s)

Training MAE

without Gra NT with Gra NT (B) with Gra NT (S)

(b) Training MAE.

Figure 13: Training set performance of graph-level tasks for QM9 (regression) on AMD device.

0 500 1000 1500 2000 2500 3000 3500 Wallclock Time (s)

Training Loss

without Gra NT with Gra NT (B) with Gra NT (S)

(a) Training Loss.

0 500 1000 1500 2000 2500 3000 3500 Wallclock Time (s)

Training MAE

without Gra NT with Gra NT (B) with Gra NT (S)

(b) Training MAE.

Figure 14: Training set performance of node-level tasks on gen-reg (regression).

Nonparametric Teaching for Graph Property Learners

0 2000 4000 6000 8000 10000 12000 Wallclock Time (s)

Training Loss

without Gra NT with Gra NT (B) with Gra NT (S)

(a) Training Loss.

0 2000 4000 6000 8000 10000 12000 Wallclock Time (s)

Training ROC-AUC

without Gra NT with Gra NT (B) with Gra NT (S)

(b) Training ROC-AUC.

Figure 15: Training set performance of node-level tasks on gen-cls (classification).

0 500 1000 1500 2000 2500 3000 Wallclock Time (s)

Validation Loss

without Gra NT with Gra NT (B) with Gra NT (S)

(a) Validation Loss.

0 500 1000 1500 2000 2500 3000 Wallclock Time (s)

Validation ROC-AUC

without Gra NT with Gra NT (B) with Gra NT (S)

(b) Validation ROC-AUC.

Figure 16: Validation set performance of node-level tasks for gen-cls (classification) on AMD device.

0 500 1000 1500 2000 2500 3000 Wallclock Time (s)

Training Loss

without Gra NT with Gra NT (B) with Gra NT (S)

(a) Training Loss.

0 500 1000 1500 2000 2500 3000 Wallclock Time (s)

Training ROC-AUC

without Gra NT with Gra NT (B) with Gra NT (S)

(b) Training ROC-AUC.

Figure 17: Training set performance of node-level tasks for gen-cls (classification) on AMD device.