# robust_offline_active_learning_on_graphs__926d0745.pdf Robust Offline Active Learning on Graphs Yuanchen Wu Department of Statistics The Pennsylvania State University yqw5734@psu.edu Yubai Yuan Department of Statistics The Pennsylvania State University yvy5509@psu.edu We consider the problem of active learning on graphs for node-level tasks, which has crucial applications in many real-world networks where labeling node responses is expensive. In this paper, we propose an offline active learning method that selects nodes to query by explicitly incorporating information from both the network structure and node covariates. Building on graph signal recovery theories and the random spectral sparsification technique, the proposed method adopts a two-stage biased sampling strategy that takes both informativeness and representativeness into consideration for node querying. Informativeness refers to the complexity of graph signals that are learnable from the responses of queried nodes, while representativeness refers to the capacity of queried nodes to control generalization errors given noisy node-level information. We establish a theoretical relationship between generalization error and the number of nodes selected by the proposed method. Our theoretical results demonstrate the trade-off between informativeness and representativeness in active learning. Extensive numerical experiments show that the proposed method is competitive with existing graph-based active learning methods, especially when node covariates and responses contain noises. Additionally, the proposed method is applicable to both regression and classification tasks on graphs. 1 Introduction In many graph-based semi-supervised learning tasks for node-level prediction, labeled nodes are scarce, and the labeling process often incurs high costs in real-world applications. Randomly sampling nodes for labeling can be inefficient, as it overlooks label dependencies across the network. Active learning [29] addresses this issue by selecting informative nodes for labeling by human annotators, thereby improving the performance of downstream prediction algorithms. Active learning is closely related to the optimal experimental design principle [36] in statistics. Traditional optimal experimental design methods select samples to maximize a specific statistical criterion [26, 19]. However, these methods often are not designed to incorporate network structure, therefore inefficient for graph-based learning tasks. On the other hand, selecting informative nodes on a network is studied extensively in the graph signal sampling literature [11, 18, 10, 28]. These strategies are typically based on the principle of network homophily, which assumes that connected nodes tend to have similar labels. However, a node s label often also depends on its individual covariates. Therefore, signal-sampling strategies that focus solely on network information may miss critical insights provided by covariates. Recently, inspired by the great success of graph neural networks (GNNs) [16, 34] in graph-based machine learning tasks, many GNN-based active learning strategies have been proposed. Existing methods select nodes to query by maximizing information gain under different criteria, including information entropy [4], the number of influenced nodes [39], prediction uncertainty [22], expected error reduction [8], and expected model change [30]. Most of these information gain measurements 38th Conference on Neural Information Processing Systems (Neur IPS 2024). are defined in the spatial domain, leveraging the message-passing framework of GNNs to incorporate both network structure and covariate information. However, their effectiveness in maximizing learning outcomes is not guaranteed and can be difficult to evaluate. This challenge arises from the difficulty of quantifying node labeling complexity in the spatial domain due to intractable network topologies. While complexity measures exist for binary classification over networks [9], their extension to more complex graph signals incorporating node covariates remains unclear. This lack of well-defined complexity measures complicates performance analysis and creates a misalignment between graph-based information measurements and the gradient used to search the labeling function space, potentially leading to sub-optimal node selection. Moreover, from a practical perspective, most of the previously discussed methods operate in an online setting, requiring prompt labeling feedback from an external annotator. However, this online framework is not always feasible when computational resources are limited [24] or when recurrent interaction between the algorithm and the annotator is impractical, such as in remote sensing or online marketing tasks [32, 35]. Additionally, both network data and annotator-provided labels may contain measurement errors. These methods often fail to account for noise in the training data [23], which can significantly degrade the prediction performance of models on unlabeled nodes [7, 21]. To address these challenges, we propose an offline active learning on graphs framework for nodelevel prediction tasks. Inspired by the theory of graph signal recovery [11, 18, 28] and GNNs, we first introduce a graph function space that integrates both node covariate information and network topology. The complexity of the node labeling function within this space is well-defined in the graph spectral domain. Accordingly, we propose a query information gain measurement aligned with the spectral-based complexity, allowing our strategy to achieve theoretically optimal sample complexity. Building on this, we develop a greedy node query strategy. The labels of the queried nodes help identify orthogonal components of the target labeling function, each with varying levels of smoothness across the network. To address data noise, the query procedure considers both informativeness the contribution of queried nodes in recovering non-smooth components of a signal and representativeness the robustness of predictions against noise in the training data. Compared to existing methods, the proposed approach provides a provably effective strategy under general network structures and achieves higher query efficiency by incorporating both network and node covariate information. The proposed method identifies the labeling function via a bottom-up strategy first identifying the smoother components of the labeling function and then continuing to more oscillated components. Therefore, the proposed method is naturally robust to high-frequency noise in node covariates. We provide a theoretical guarantee for the effectiveness of the proposed method in semi-supervised learning tasks. The generalization error bound is guaranteed even when the node labels are noisy. Our theoretical results also highlight an interesting trade-off between informativeness and representativeness in graph-based active learning. 2 Preliminaries We consider an undirected, weighted, connected graph G = {V, A}, where V = {1, 2, , n} is the set of n nodes, and A Rn n is the symmetric adjacency matrix, with element aij 0 denoting the edge weight between nodes i and j. The degree matrix is defined as D = diag{d1, d2, , dn}, where di = P 1 i n aij denotes the degree of node i. Additionally, we observe the node response vector Y Rn 1 and the node covariate matrix X = (X1, , Xp) Rn p, where the ith row, Xi , is the p-dimensional covariate vector for node i. The linear space of all linear combinations of {X1, , Xp} is denoted as Span{X1, , Xp}. The normalized graph Laplacian matrix is defined as L = I D 1/2AD 1/2, where I is the n n identity matrix. The matrix L is symmetric and positive semi-definite, with n real eigenvalues satisfying 0 = λ1 λ2 λn 2, and a corresponding set of eigenvectors denoted by U = {U1, U2, , Un}. We use b = O(a) to indicate |b| M|a| for some M > 0. For a set of nodes S V, |S| indicates its cardinality, and Sc = V\S denotes the complement of S. 2.1 Graph signal representation Consider a graph signal f Rn, where f(i) denotes the signal value at node i. For a set of nodes S, we define the subspace LS := {f Rn | f(Sc) = 0}, where f(S) R|S| represents the values of f on nodes in S. In this paper, we consider both regression tasks, where f(i) is a continuous response, and classification tasks, where f(i) is a multi-class label. Since U serves as a set of bases for Rn, we can decompose f in the graph spectral domain as f = Pn j=1 αf(λj)Uj, where αf(λj) = f, Uj is defined as the graph Fourier transform (GFT) coefficient corresponding to frequency λj. From a graph signal processing perspective, a smaller eigenvalue λk indicates lower variation in the associated eigenvector Uk, reflecting smoother transitions between neighboring nodes. Therefore, the smoothness of f over the network can be characterized by the magnitude of αf(λj) at each frequency λj. More formally, we measure the signal complexity of f using the bandwidth frequency ωf = sup{λj|αf(λj) > 0}. Accordingly, we define the subspace of graph signals with a bandwidth frequency less than or equal to ω as Lω := {f Rn | ωf ω}. It follows directly that ω1 < ω2, Lω1 Lω2. 2.2 Active semi-supervised learning on graphs The key idea in graph-based semi-supervised learning is to reconstruct the graph signal f within a function space Hω(X, A) that depends on both the network structure and node-wise covariates, where the frequency parameter ω controls the size of the space to mitigate overftting. Assume that Yi is the observed noisy realization of the true signal f(i) at node i, active learning operates in a scenario where we have limited access to Yi on only a subset of nodes S, with |S| << n. The objective is to estimate f within Hω(X, A) using {Yi}i S by considering the empirical estimator of f as f S = arg min g Hω(X,A) i S l Yi, g(i) , (1) where l( ) is a task-specific loss function. We denote f as the minimizer of (1) when responses on all nodes are available, i.e., f = f V. The goal of active semi-supervised learning is to design an appropriate function space Hω(X, A) and select an informative subset of nodes S for querying responses, under the query budget |S| B, such that the estimation error is bounded as follows: f S f 2 2 ρ f f 2 2 For a fixed B, we wish to minimize the parameter ρ > 0, which converges to 0 as the query budget B approaches n. 3 Biased Sequential Sampling In this section, we introduce a function space for recovering the graph signal. Leveraging this function space, we propose an offline node query strategy that integrates criteria of both node informativeness and representativeness to infer the labels of unannotated nodes in the network. 3.1 Graph signal function space In semi-supervised learning tasks on networks, both the network topology and node-wise covariates are crucial for inferring the graph signal. To effectively incorporate this information, we propose a function class for reconstructing the graph signal that lies at the intersection of the graph spectral domain and the space of node covariates. Motivated by the graph Fourier transform, we define the following function class: Hω(X, A) =Proj LωSpan(X) := Span{Proj LωX1, , Proj LωXp}, where Proj LωXi = X j:λj ω Xi, Uj Uj. Here, the choice of ω balances the information from node covariates and network structure. When ω = 2, Hω(X, A) spans the full column space of covariates, i.e., Span{X1, , Xp}, allowing for a full utilization of the original covariate space to estimate the graph signal, but without incorporating any network information. On the other hand, when ω is close to zero consider, for example, the extreme case where |{Uj | λj ω}| = 2 and p 2 then Proj LωXi reduces to Span{U1, U2}, resulting in a loss of critical information provided by the original X. By carefully choosing ω, however, this function space can offer two key advantages for estimating the graph signal. From a signal recovery perspective, Hω(X, A) imposes graph-based regularization over node covariates, enhancing generalizability when the dimension of covariates p exceeds the query budget or even the network size conditions commonly encountered in real applications. Additionally, covariate smoothing filters out signals in the covariates that are irrelevant to networkbased prediction, thereby increasing robustness against potential noise in the covariates. From an active learning perspective, using Hω(X, A) enables a bottom-up query strategy that begins with a small ω to capture the smoothest global trends in the graph signal. As the labeling budget increases, ω is adaptively increased to capture more complex graph signals within the larger space Hω(X, A). The graph signal f can be approximated by its projection onto the space Hω(X, A). Specifically, let Ud = {U1, U2, , Ud} Rn d stack the d leading eigenvectors of L, where d = arg max1 j n(λj ω) 0. The graph signal estimation is then given by Ud UT d Xβ, where β Rd is a trainable weight vector. However, the parameters β may become unidentifiable when the covariate dimension p exceeds d. To address this issue, we reparameterize the linear regression as follows: Ud UT d Xβ = X β, (2) where β = ΣV T 2 β and X = Ud V1. Here, V1 Rd r, V2 Rp r, and Σ Rr r denote the left and right singular vectors and the diagonal matrix of the r singular values, respectively. In the reparameterized form (2), the columns of X serve as bases for Hω(X, A), thus dim(Hω(X, A)) = rank( X) = r min{d, p}. The transformed predictors X capture the components of the node covariates constrained within the low-frequency graph spectrum. A graph signal f Hω(X, A) can be parameterized as a linear combination of the columns of X, with the corresponding weights β identified via ˆβ = arg min β f(i) ( XS)i β 2 (3) where XS R|S| r is the submatrix of X containing rows indexed by the query set S, and {f(i)}i S represents the true labels for nodes in S. To achieve the identification of f, it is necessary that |S| r; otherwise, there will be more parameters than equations in (3). More importantly, since rank( XS) rank( X) = r, f is only identifiable if XS has full column rank. Notice that r increases monotonically with ω. If S is not carefully selected, the graph signal can only be identified in Hω (X, A) for some ω < ω, which is a subspace of Hω(X, A). 3.2 Informative node selection We first define the identification of Hω(X, A) by the node query set S as follows: Definition 1 A subset of nodes S V can identify the graph signal space Hω(X, A) up to frequency ω if, for any two functions f1, f2 Hω(X, A) such that f1(i) = f2(i) for all i S, it follows that f1(j) = f2(j) for all j V. Intuitively, the informativeness of a set S can be quantified by the frequency ω corresponding to the space Hω(X, A) that can be identified. To select informative nodes, we need to bridge the query set S in the spatial domain with ω in the spectral domain. To achieve this, we consider the counterpart of the function space Hω(X, A) in the spatial domain. Specifically, we introduce the projection space with respect to a subset of nodes S as follows: HS(X, A) := Span{X(S) 1 , , X(S) p }, where X(S) p (i) = Xp(i) if i S 0 if i Sc Here, Xp(i) denotes the p-th covariate for node i in X. Theorem 3.1 establishes a connection between the two graph signal spaces Hω(X, A) and HS(X, A), providing a metric for evaluating the informativeness of querying a subset of nodes on the graph. Theorem 3.1 Any graph signal f Hω(X, A) can be identified using labels on a subset of nodes S if and only if: ω < ω(S) := inf g HSc(X,A) ωg, (4) where Sc denotes the complement of S in V. We denote the quantity ω(S) in (4) as the bandwidth frequency with respect to the node set S. This quantity can be explicitly calculated and measures the size of the space Hω(X, A) that can be recovered from the subset of nodes S. The goal of the active learning strategy is to select S within a given budget to maximize the bandwidth frequency ω(S), thus enabling the identification of graph signals with the highest possible complexity. To calculate the bandwidth frequency ω(S), consider any graph signal g and its components with non-zero frequency Λg := {λi | αg(λi) > 0}. We use the fact that j:λj Λg cjλk j = max λj Λg(λj), j:λj Λg cj = 1 and 0 cj 1. Combined with the Rayleigh quotient representation of eigenvalues, the bandwidth frequency ωg can be calculated as ωg = lim k ωg(k), where ωg(k) = g T Lkg As a result, we can approximate the bandwidth ωg using ωg(k) for a large k. Maximizing ω(S) over S then transforms into the following optimization problem: S = arg max S:|S| B ˆω(S), where ˆω(S) := inf g HSc(X,A) ωk g(k), (5) where B represents the budget for querying labels. Due to the combinatorial complexity of directly solving optimization problem (5) by simultaneously selecting S, we propose a greedy selection strategy as a continuous relaxation of (5). The selection procedure starts with S = and sequentially adds one node to S that maximizes the increase in ω(S) until the budget is reached. We introduce an n-dimensional vector t = (t1, t2, , tn)T with 0 ti 1, and define the corresponding diagonal matrix D(t) with diagonal entries given by t. This allows us to encode the set of query nodes using t = 1S, where 1S(i) = 1 if i S and 1S(i) = 0 if i Sc. We then consider the space spanned by the columns of D(t)X as Span{D(t)X}, and the following relation holds: HSc(X, A) = Span{D(1Sc)X}. Intuitively, Span{D(t)X} acts as a differentiable relaxation of the subspace HS(X, A), enabling perturbation analysis of the bandwidth frequency when a new node is added to S. The projection operator associated with Span{D(t)X} can be explicitly expressed as P(t) = D(t)X XT D(t2)X 1 XT D(t). To quantify the increase in ˆω(S) when adding a new node to S, we consider the following regularized optimization problem: λα(t) = min ϕ ϕT Lkϕ ϕT ϕ + αϕT (I P(t)) ϕ The penalty term on the right-hand side of (6) encourages the graph signal ϕ to remain in HSc(X, A). As the parameter α approaches infinity and t = 1Sc, the minimization λα(1Sc) in (6) converges to ˆω(S) in (5). The information gain from labeling a node i Sc can then be quantified by the gradient of the bandwidth frequency as ti decreases from 1 to 0: t=1Sc = 2α ϕT P(t) ti ϕ t=1Sc , (7) where ϕ is the minimizer of (6) at t = 1Sc, which corresponds to the eigenvector associated with the smallest non-zero eigenvalue of the matrix P(1Sc)Lk P(1Sc). We then select the node i = arg maxj Sc j and update the query set as S = S {i}. The explicit representation of i in (7) can be found in Appendix B.5. 3.3 Representative node selection In real-world applications, we often have access only to a perturbed version of the true graph signals, denoted as Y = f + ξ, where ξ represents node labeling noise that is independent of the network data. When replacing the true label f(i) with Y (i) in (3), this noise term introduces both finite-sample bias and variance in the estimation of the graph signal f. As a result, we aim to query nodes that are sufficiently representative of the entire covariate space to bound the generalization error. To achieve this, we introduce principled randomness into the deterministic selection procedure described in Section 3.2 to ensure that S includes nodes that are both informative and representative. The modified graph signal estimation procedure is given by: ˆβ = arg min β i S si Y (i) ( XS)i β 2, (8) where si is the weight associated with the probability of selecting node i into S. Specifically, the generalization error of the estimator in (8) is determined by the smallest eigenvalue of XT S XS, denoted as λmin( XT S XS). Given that λmin( XT X) = 1, our goal is to increase the representativeness of S such that λmin( XT S XS) is lower-bounded by: λmin XT S XS (1 o|S|(1))λmin XT X . (9) However, the informative selection method in Section 3.2 does not guarantee (9). To address this, we propose a sequential biased sampling approach that balances informative node selection with generalization error control. The key idea to achieve a lower bound for λmin( XT S XS) is to use spectral sparsification techniques for positive semi-definite matrices [15]. Let vi R1 r denote the i-th row of the constrained basis X. By definition of X, it follows that Ir r = Pn i=1 v T i vi. Inspired by the randomized sampling approach in [17], we propose a biased sampling strategy to construct S with |S| n and weights {si > 0, i S} such that P i S siv T i vi I. In other words, the weighted covariance matrix of the query set S satisfies λmin( XT SWS XS) 1, where WS is a diagonal matrix with si on its diagonal. We outline the detailed sampling procedure as follows. After the (t 1)th selection, let the set of query nodes be St 1 with corresponding node-wise weights Wt 1 = {sj > 0 | j St 1}. The covariance matrix of St 1 is given by Ct 1 Rr r, defined as Ct 1 = XT St 1 XSt 1 = P j St 1 sjv T j vj. To analyze the behavior of eigenvalues as the query set is updated, we follow [17] and introduce the potential function: Φt 1 = Tr[(ut 1I Ct 1) 1] + Tr[(Ct 1 lt 1I) 1], (10) where ut 1 and lt 1 are constants such that lt 1 < λmin(Ct 1) λmax(Ct 1) < ut 1, and Tr( ) denotes the trace of a matrix. The potential function Φt 1 quantifies the coherence among all eigenvalues of Ct 1. To construct the candidate set Bm, we sample node i and update Ct, ut, and lt such that all eigenvalues of Ct remain within the interval (lt, ut). To achieve this, we first calculate the node-wise probabilities {pi}n i=1 as: pi = vi(ut 1I Ct 1) 1v T i + vi(Ct 1 lt 1I) 1v T i /Φt 1, (11) where Pn i=1 pi = 1. We then sample m nodes into Bm according to {pi}n i=1. For each node i Bm, the corresponding weight is given by si = ϵ piΦt 1 , 0 < ϵ < 1. After obtaining the candidate set Bm, we apply the informative node selection criterion i introduced in Section 3.2, i.e., selecting the node i = arg maxi Bm i, and update the query set and weights as follows: if i Sc t 1 : St = St 1 {i}, Wt = Wt 1 {si}, if i St 1 : si = si + ϵ piΦt 1 . We then update the lower and upper eigenvalue bounds as follows: ut = ut 1 + ϵ Φt 1(1 ϵ), lt = lt 1 + ϵ Φt 1(1 + ϵ). (12) The update rule ensures that ut lt increases at a slower rate than ut, leading to the convergence of the gap between the largest and smallest eigenvalues of XT SWS XS, thereby controlling the condition number. Accordingly, the covariance matrix is updated with the selected node i as: Ct = Ct 1 + ϵ piΦt 1 v T i vi. (13) With the covariance matrix update rule in (13), the average increment is E(Ct) Ct 1 = Pn i=1 pisiv T i vi = ϵ Φt 1 I. Intuitively, the selected node allows all eigenvalues of Ct 1 to increase at the same rate on average. This ensures that λmin( XT S XS) continues to approach λmin( XT X) = 1 during the selection process, thus driving the smallest eigenvalue away from zero. Additionally, the selected node remains locally informative within the candidate set Bm. Compared with the entire set of nodes, selecting from a subset serves as a regularization on informativeness maximization, achieving a balance between informativeness and representativeness for node queries. 3.4 Node query algorithm and graph signal recovery We summarize the biased sampling selection strategy in Algorithm 1. At a high level, each step in the biased sampling query strategy consists of two stages. First, we use randomized spectral sparsification to sample m n nodes and collect them into a candidate set Bm. Intuitively, the covariance matrix on the updated S maintains lower-bounded eigenvalues if a node from Bm is added to S. In the second stage, we select one node from Bm based on the informativeness criterion in Section 3.2 to achieve a significant frequency increase in (7). For initialization, the dimension of the network spectrum d, the size of the candidate set m, and the constant 0 < ϵ < 1 for spectral sparsification need to be specified. Based on the discussion at the end of Section 3.1, the dimension of the function space Hω(X, A) is at most B, where B is the budget for label queries. Therefore, we can set d = min{p, B}. The parameters m and ϵ jointly control the condition number λmax( XT S WS XS) λmin( XT S WS XS) . Algorithm 1 Biased Sampling Query Algorithm Require: t = 0, C0 = 0, the set of query nodes S0 = , the set of node weights W0 = , spectral dimension d, size of candidate set m, constant 0 < ϵ < 1/m, query budget B. Initialization: Compute SVD decomposition UT d X = V1ΣV T 2 , and set X = Ud V1, r = rank( X), u0 = 2r/ϵ, l0 = 2r/ϵ, κ = 2r(1 m2ϵ2)/(mϵ2). while B > 0 do Step 1: Calculate Φt as in (10) and the node-wise probabilities {pi}n i=1 using (11). Step 2: Sample m nodes with replacement according to probabilities {pi}n i=1 to form the candidate set Bm. Step 3: Select node i as i = arg maxi Bm i and calculate its weight wi = ϵ piΦt . If i / St, then update St+1 = St {i} and Wt+1 = Wt {si} with si = wi κ . Else if i St, then update si = si + wi κ . Step 4: Update Ct, ut, lt, B and t as: Ct+1 = Ct + wiv T i vi, ut+1 = ut + ϵ Φt(1 mϵ), lt+1 = lt + ϵ Φt(1 + mϵ), B = B 1, t = t + 1. end while Query: Label all nodes in S through an external annotator. Output: Set of queried nodes S, annotated responses {Yi | i S}, smoothed covariates XS, and weights of queried nodes W. Based on the output from Algorithm 1, we solve the weighted least squares problem in (8): ˆβ = arg min β i S si Y (i) ( XS)i β 2, (14) and recover the graph signal on the entire network as ˆf = Xˆβ. The proposed method is illustrated for the regression task, with an extension to the classification task discussed in Appendix B.2. 4 Theoretical Analysis In this section, we present a theoretical analysis of the proposed node query strategy. The results are divided into two parts: the first focuses on the local information gain of the selection process, while the second examines the global performance of graph signal recovery. Given a set of query nodes S, the information gain from querying the label of a new node i is measured as the increase in bandwidth frequency, defined as i := ω(S {i}) ω(S). We provide a step-by-step analysis of the proposed method by comparing the increase in bandwidth frequency with that of random selection. Theorem 4.1 Define dmin = min i {di}, where di denotes the degree of node i. Let S represent the set of queried nodes prior to the sth selection. Denote the adjacency matrix of the subgraph excluding S as A(n |S|) (n |S|). Let R s and B s denote the increase in bandwidth frequency resulting from the sth label query on a node selected by random sampling and the proposed sampling method, respectively. Let j denote the node with the largest magnitude in the eigenvector corresponding to the smallest non-zero eigenvalue of LSc. Then we have: E( R s) = Ω( 1 n) (1), and E( B s) E( R s) > Ω( 1 η0η3 1d2 min ) Ω( 1 where f = Ω(g) if c1 | f g | c2 for constants c1, c2 when n is sufficient large. Inequality (2) holds given m satisfying n m )m(n m dmin n dmin )dminp dmin = O(1). The expectation E( ) is taken over the randomness of node selection. Both η0, η1 are network-related quantities, where η0 #|{i : | di dj dmin | 1}| and η1 maxi( di Theorem 4.1 provides key insights into the information gain achieved through different node label querying strategies. While random selection yields a constant average information gain, the proposed biased sampling method guarantees a higher information gain under mild assumptions. In Theorem 4.2, we provide the generalization error bound for the proposed sampling method under the weighted OLS estimator. To formally state Theorem 4.2, we first introduce the following two assumptions: Assumption 1 For the underlying graph signal f, there exists a bandwidth frequency ω0 such that f Hω0(X, A). Assumption 2 The observed node-wise response Yi can be decomposed as Yi = f(i) + ξi, where {ξi}n i=1 are independent random variables with E(ξi) = 0 and Var(ξi) σ2. Theorem 4.2 Under Assumptions 1 and 2, for the graph signal estimation ˆf obtained by training (14) on B labeled nodes selected by Algorithm 1, with probability greater than 1 2m t , where t > 2m, we have EY ˆf f 2 2 O rdt B )3/2 + (rdt B )2 (nσ2 + X i>d,i supp(f) α2 i ) + X i>d,i supp(f) α2 i , where αi := f, Ui , supp(f) := {i : 1 i n, |αi| > 0} and rd = rank(UT d X). EY ( ) denotes the expected value with respect to the randomness in observed responses. Theorem 4.2 reveals the trade-off between informativeness and representativeness in graph-based active learning, which is controlled by the spectral dimension d. Since rd is a monotonic function of d, a larger d reduces representativeness among queried nodes, thereby increasing variance in controlling the condition number (i.e., the first three terms). On the other hand, a larger d reduces approximation bias to the true graph signal (i.e., the fifth and last terms) by including more informative nodes for capturing less smoothed signals. (a) Small World (b) Community (c) Scale-free Figure 1: Prediction performance on unlabeled nodes at different levels of labeling noise (σ2). All three simulated networks have n = 100 nodes, with the number of labeled nodes fixed at 25. 5 Numerical Studies In this section, we conduct extensive numerical studies to evaluate the proposed active learning strategy for node-level prediction tasks on both synthetic and real-world networks. For the synthetic networks, we focus on regression tasks with continuous responses, while for the real-world networks, we consider classification tasks with discrete node labels. 5.1 Synthetic networks We consider three different network topologies generated by widely studied statistical network models: the Watts Strogatz model [33] for small-world properties, the Stochastic block model [13] for community structure, and the Barabási Albert model [3] for scale-free properties. Node responses are generated as Y = f + ξ, where f is the true graph signal and ξ N(0, σ2In) is Gaussian noise. The true signal is constructed as f = Udβ, where β is the linear coefficient and Ud denotes the leading d eigenvectors of the normalized graph Laplacian of the synthetic network. Since our theoretical analysis assumes that the observed node covariates X contain noise, we generate X as a perturbed version of Ud by adding non-leading eigenvectors of the normalized graph Laplacian. The detailed simulation settings can be found in Appendix C.1. We compare our algorithm with several offline active learning methods: 1. D-optimal [26] selects subset of nodes S to maximize determinant of observed covariate information matrix XT SXS. 2. RIM [38] selects nodes to maximize the number of influenced nodes. 3. GPT [22] and SPA [27] split the graph into disjoint partitions and select informative nodes from each partition. After the node query step, we fit the weighted linear regression from (14) on the labeled nodes, using the smoothed covariates X, to estimate the linear coefficient ˆβ and predict the response ˆY for the unlabeled nodes. In Figure 1, we plot the prediction MSE of the proposed method against baselines on unlabeled nodes for various levels of labeling noise σ2 (0.5, 0.6, 0.7, 0.8, 0.9, 1). The results show that the proposed method significantly outperforms all baselines across all simulation settings and exhibits strong robustness to noise. The inferior performance of the baselines can be attributed to several factors. D-optimal and RIM fail to account for noise in the node covariates. Meanwhile, partition-based methods like GPT and SPA are highly sensitive to hyperparameters, such as the optimal number of partitions, which limits their generalization to networks lacking a clear community structure. 5.2 Real-world networks We evaluate the proposed method for node classification tasks on real-world datasets, which include five networks with varying homophily levels (high to low: Cora, Pub Med, Citeseer, Chameleon and Texas) and two large-scale networks (Ogbn-Arxiv and Co-Physics). In addition to the offline methods described in Section 5.1, we also compare our approach with two GNN-based online active learning methods AGE [4] and IGP [39]. In each GNN iteration, AGE selects nodes to maximize a linear combination of heuristic metrics, while IGP selects nodes that maximize information gain propagation. Table 1: Test accuracy (Micro-F1%) on five real-world networks with varying levels of homophily. The edge homophily ratio h of a network is defined as the fraction of edges that connect nodes with the same class label. A higher h indicates a network with stronger homophily. Cora (h = 0.81) Pubmed (h = 0.80) Citeseer (h = 0.74) Chameleon (h = 0.23) Texas (h = 0.11) #labeled nodes 35 70 140 15 30 60 30 60 120 50 75 100 15 30 45 Random 68.2 1.3 74.5 1.0 78.9 0.9 71.2 1.8 74.9 1.6 78.4 0.5 57.7 0.8 65.3 1.4 70.7 0.7 22.4 2.6 22.1 2.5 21.8 2.1 67.0 3.3 69.9 3.3 73.8 3.2 AGE 72.1 1.1 78.0 0.9 82.5 0.5 74.9 1.1 77.5 1.2 79.4 0.7 65.3 1.1 67.7 0.5 71.4 0.5 30.0 4.5 28.2 4.9 28.6 5.0 67.9 2.6 68.8 3.3 72.1 3.6 GPT 77.4 1.6 81.6 1.2 86.5 1.2 77.0 3.1 79.9 2.8 81.5 1.6 67.9 1.8 71.0 2.4 74.0 2.0 14.1 2.5 15.8 2.2 16.4 2.4 72.6 2.0 72.5 3.6 74.6 1.8 RIM 77.5 0.8 81.6 1.1 84.1 0.8 75.0 1.5 77.2 0.6 80.2 0.4 67.5 0.7 70.0 0.6 73.2 0.7 35.5 3.7 42.8 3.0 34.4 3.5 68.5 3.7 78.4 3.0 74.6 3.7 IGP 77.4 1.7 81.7 1.6 86.3 0.7 78.5 1.2 82.3 1.4 83.5 0.5 68.2 1.1 72.1 0.9 75.8 0.4 32.5 3.6 33.7 3.1 33.4 3.5 70.8 3.7 69.9 3.3 76.1 3.6 SPA 76.5 1.9 80.3 1.6 85.2 0.6 75.4 1.6 78.3 2.0 73.5 1.2 66.4 2.2 69.3 1.7 73.5 2.0 30.2 3.2 28.5 2.9 31.0 4.4 72.0 3.2 72.5 3.1 74.6 2.1 Proposed 78.4 1.7 81.8 1.8 86.5 1.1 78.9 1.1 79.1 0.6 82.3 0.6 69.1 1.0 72.2 1.3 75.5 0.8 35.1 2.8 35.7 3.0 37.2 3.0 75.0 1.9 79.5 0.8 80.4 2.7 Table 2: Test accuracy (Macro-F1% and Micro-F1%) on two real-world largescale networks: Ogbn-Arxiv (n = 169, 343) and Co-Physics (n = 34, 493). Ogbn-Arxiv (Macro-F1) Ogbn-Arxiv (Micro-F1) Co-Physics (Macro-F1) #labeled nodes 160 320 640 1280 160 320 640 1280 10 20 40 Random 21.9 1.4 27.6 1.5 33.0 1.4 37.2 1.1 52.3 0.8 56.4 0.8 60.0 0.7 63.5 0.4 58.3 13.8 66.9 10.1 78.3 7.1 AGE 20.4 0.9 25.9 1.1 31.7 0.8 36.4 0.8 48.3 2.3 54.9 1.6 60.0 0.7 63.5 0.3 63.7 7.8 71.0 8.8 82.4 3.9 GPT 24.2 0.7 29.5 0.8 36.4 0.5 41.0 0.5 52.3 0.9 56.8 0.8 60.7 0.6 63.6 0.5 75.8 2.7 85.8 0.3 88.9 0.3 Proposed 25.8 1.3 34.3 1.4 38.3 1.2 41.3 1.3 53.1 1.3 58.0 1.0 62.3 1.6 64.8 1.0 83.5 0.8 86.8 1.3 89.2 1.2 Table 3: Average query time (in seconds) per node. Dataset Size Time Texas 183 0.19 0.03 Chameleon 2,277 0.34 0.18 Cora 2,708 0.30 0.19 Citeseer 3,327 0.26 0.07 Pubmed 19,717 0.48 0.25 Co-Physics 34,493 1.08 0.43 Ogbn-Arxiv 169,343 2.11 0.33 Unlike regression, node classification with GNNs is a widely studied area of research. Previous works [22, 30, 38, 39] have demonstrated that the prediction performance of various active learning strategies on unlabeled nodes remains relatively consistent across different types of GNNs. Therefore, we employ Simplified Graph Convolution (SGC) [34] as the GNN classifier due to its straightforward theoretical intuition. Since SGC is essentially multi-class logistic regression on low-pass-filtered covariates, it can be approximately viewed as a special case of the regression model defined in (14). Thus, we conjecture that our theoretical analysis can also be extended to classification tasks and leave its formal verification for future work. The results in Figure 1 demonstrate that the proposed algorithm is highly competitive with baselines across real-world networks with varying degrees of homophily. Our method achieves the best performance on Cora (highest homophily) and Texas (lowest homophily, i.e., highest heterophily) and is particularly effective when the labeling budget is most limited. To handle heterophily in networks like Chameleon and Texas, we expand the graph signal subspace Ud in Algorithm 1 to Ud = {U1, , Ud, Un d+1, , Un}, combining eigenvectors corresponding to the d smallest and d largest eigenvalues. Admittedly, relying on a priori knowledge of label construction may be unrealistic, so developing adaptive methods for designing the signal subspace to effectively handle both homophily and heterophily remains a promising direction for future research. Table 2 summarizes the performance on two large-scale networks. The greatest improvement is observed in the Macro-F1 score on Ogbn-Arxiv, with an increase of up to 4.8% at 320 labeled nodes. Moreover, Table 3 demonstrates that our algorithm scales efficiently to large networks, with the time cost of querying a single node being approximately 2 seconds when n = 169, 343. 6 Conclusion We propose a graph-based offline active learning framework for node-level tasks. Our node query strategy effectively leverages both the network structure and node covariate information, demonstrating robustness to diverse network topologies and node-level noise. We provide theoretical guarantees for controlling generalization error, uncovering a novel trade-off between informativeness and representativeness in active learning on graphs. Empirical results demonstrate that our method performs strongly on both synthetic and real-world networks, achieving competitiveness with state-of-the-art methods on benchmark datasets. Future work could explore extensions to an online active learning setting that iteratively incorporates node response information to further enhance query efficiency. Additionally, scalability on large graphs could be improved by utilizing the Lanczos method [1] or Chebyshev polynomial approximation [16] during node selection. [1] D. Kressner A. Susnjara, N. Perraudin and P. Vandergheynst. Accelerated filtering on graphs using lanczos method. ar Xiv preprint ar Xiv:1509.04537, 2015. [2] Bassam Bamieh. A tutorial on matrix perturbation theory (using compact matrix notation). ar Xiv preprint ar Xiv:2002.05001, 2020. [3] A.-L. Barabasi and R. Albert. Emergence of scaling in random networks. Science, 1999. [4] H. Cai, V. W. Zheng, and K. Chen-Chuan Chang. Active learning for graph embedding. ar Xiv preprint ar Xiv:1705.05085, 2017. [5] X. Chen and E. Price. Active regression via linear-sample sparsification. In Conference on Learning Theory, pages 663 695. PMLR, 2019. [6] Y. Qi D. Blakely, J. Lanchantin. Time and space complexity of graph convolutional networks. Accessed on: Dec, 2021. [7] J. Du and C. X. Ling. Active learning with human-like noisy oracle. In Proceedings of the 2010 IEEE International Conference on Data Mining(ICDM), 2010. [8] Y. Zhang FF. Regol, S. Pal and M. Coates. Active learning on attributed graphs via graph cognizant logistic regression and preemptive query generation. In Proceedings of the 37th International Conference on Machine Learning (ICML), 2020. [9] X. Zhu G. Dasarathy, R. Nowak. S2: An efficient graph based active learning algorithm with application to nonparametric classification. In Proceedings of The 28th Conference on Learning Theory (COLT), 2015. [10] R. Gribonval P. Vandergheynst G. Puy, N. Tremblay. Random sampling of bandlimited signals on graphs. Applied and Computational Harmonic Analysis, 2016. [11] A. Gadde, A. Anis, and A. Ortega. Active semi-supervised learning using sampling theory for graph signals. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining (KDD), 2014. [12] Akshay Gadde, Aamir Anis, and Antonio Ortega. Active semi-supervised learning using sampling theory for graph signals. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 492 501, 2014. [13] M. Girvan and M. E. J. Newman. Community structure in social and biological networks. Proceedings of the National Academy of Sciences, 2002. [14] G. Golub and C. Loan. Matrix computations. HU press, 2013. [15] N. Srivastava J. Batson, D. A. Spielman and S.-H. Teng. Spectral sparsification of graphs: theory and algorithms. Communications of the ACM, 2013. [16] T. N. Kipf and M. Welling. Semi-supervised classification with graph convolutional networks. In Proceedings of the 5th International Conference on Learning Representations (ICLR), 2017. [17] Yin Tat Lee and He Sun. Constructing linear-sized spectral sparsification in almost-linear time. SIAM Journal on Computing, 47(6):2315 2336, 2018. [18] P. Di Lorenzo, S. Barbarossa, and P. Banelli. Cooperative and Graph Signal Processing. Academic Press, 2018. [19] DJ. Hsu M. Derezinski, MKK. Warmuth. Leveraged volume sampling for linear regression. In Advances in Neural Information Processing Systems (Neur IPS), 2018. [20] J. Neter M. Kutner, C. Nachtsheim and W. Li. Applied Linear Statistical Models. Mc Graw Hill/Irwin, 2004. [21] K. Santosh M.-R. Bouguelia, S. Nowaczyk and A. Verikas. Agreeing to disagree: active learning with noisy labels without crowdsourcing. 2018. [22] J. Ma, Z. Ma, J. Chai, and Q. Mei. Partition-based active learning for graph neural networks. Transactions on Machine Learning Research, 2023. [23] T. Maehara NT. Hoang and T. Murata. Revisiting graph neural networks: Graph filtering perspective. In Proceedings of the 25th International Conference on Pattern Recognition (ICPR), 2021. [24] X. Chang P.-Y. Huang Z. Li X. Chen P. Ren, Y. Xiao and X. Wang. A survey of deep active learning. ACM Computing Surveys, 2020. [25] E.L. Paluck, H. Shepherd, and P.M. Aronow. Changing climates of conflict: a social network experiment in 56 schools. Proceedings of the National Academy of Sciences, page 566 571, 2016. [26] F. Pukelsheim. Optimal Design of Experiments (Classics in Applied Mathematics) (Classics in Applied Mathematics, 50). Society for Industrial and Applied Mathematics, 2018. [27] L. Yin R. Fajri, Y. Pei and Mykola Pechenizkiy. A structural-clustering based active learning for graph neural networks. In Symposium on Intelligent Data Analysis (IDA), 2024. [28] J. M. F. Moura S. Chen, A. Sandryhaila and J. Kovaˇcevi c. Signal recovery on graphs: Variation minimization. IEEE Transactions on Signal Processing, 2016. [29] B. Settles. Active learning literature survey. Technical report, University of Wisconsin Madison, 2010. [30] Z. Song, Y. Zhang, and I. King. No change, no gain: Empowering graph neural networks with expected model change maximization for active learning. In Advances in Neural Information Processing Systems (Neur IPS), 2023. [31] M. Vose. Linear algorithm for generating random numbers with a given distribution. IEEE Transactions on software engineering, 1991. [32] W. Wang and WN. Street. Modeling and maximizing influence diffusion in social networks for viral marketing. Applied network science, 2018. [33] Duncan J. Watts and Steven H. Strogatz. Collective dynamics of small-world networks. Nature, 1998. [34] F. Wu, A. H. S. Jr., T. Zhang, C. Fifty, T. Yu, and K. Q. Weinberger. Simplifying graph convolutional networks. In Proceedings of the 36th International Conference on Machine Learning (ICML), 2019. [35] A. Liu Y. Xu, X. Chen and C. Hu. A latency and coverage optimized data collection scheme for smart cities. In New Advances in Identification, Information and Knowledge in the Internet of Things, 2017. [36] A. Singh Z. Allen-Zhu, Y. Li and Y. Wang. Near-optimal discrete optimization for experimental design: A regret minimization approach. Mathematical Programming, 2020. [37] F. Chen G. Long C. Zhang P. Yu Z. Wu, S. Pan. A comprehensive survey on graph neural networks. IEEE transactions on neural networks and learning systems, 2020. [38] W. Zhang, Y. Wang, Z. You, M. Cao, P. Huang, J. Shan, Z. Yang, and B. Cui. Rim: Reliable influence-based active learning on graphs. In Advances in Neural Information Processing Systems (Neur IPS), 2021. [39] W. Zhang, Y. Wang, Z. You, M. Cao, P. Huang, J. Shan, Z. Yang, and B. Cui. Information gain propagation: a new way to graph active learning with soft labels. In Proceedings of the 10th International Conference on Learning Representations (ICLR), 2022. A.1 Proof of Theorem 3.1 Proof: Consider the threshold frequency ω defined as ω < ω(S) := inf g Proj LSc span(X) ωg, (16) and notice that 16 is true if and only if Proj Lωspan(X) Proj LSc span(X) = {0}. (17) For ϕ Proj Lωspan(X) Proj LSc span(X) and f Proj Lωspan(X), we have g = ϕ + f Proj Lωspan(X) by closure under addition and f(S) = g(S). Since Proj Lωspan(X) can be identified by S, one must have f = g, which implies ϕ = 0. Hence, 17 is true. Given 17 is true, assume that Proj Lωspan(X) cannot be identified by S. Then, by definition, there exists f1, f2 Proj Lωspan(X) such that f1(S) = f2(S) and f1 = f2. Since (f1 f2)(S) = 0, f1 f2 Proj LSc span(X). By clousure under addition, f1 f2 Proj Lωspan(X). However, f1 f2 Proj Lωspan(X) Proj LSc span(X) is a contradiction with 17. Therefore, Proj Lωspan(X) can be identified by S. A.2 Proof of Theorem 4.1 Proof: Let U = {U1, U2, , Un} be the eigenvectors of L = I D 1/2AD 1/2 and Λ = diag(λ1, λ2, , λn). Without loss of generality, we analyze the case when k = 1 and the case of no repeated eigenvalues. The analysis for the case of repeated eigenvalues can be similar performed with matrix perturbation analysis on degenerate case [2]. After selecting node i, we have the reduced Laplacian matrix L L = L + L( i), where L( i) = ... ... ... ... ... Li1 . . . Lii . . . Lin ... ... ... ... ... Define Λ as the diagonal matrix containing eigenvalues of L . Using the first order perturbation analysis [2], we have Λ Λ + diag(UT L( i)U). Let L i and Li be the ith column and row of L, respectively. Since = L i Uij + Lii Uij L( i) i Uj = L i Uij + (Lii λj)Uij U T j L( i)Uj = Uij(Li Uj) + (Lii λj)U2 ij = Uij(λj Uij) + (Lii λj)U2 ij =(1 2λj)U2 ij since the diagonal entry of L is 1 (1 2λ1)U2 i1 (1 2λn)U2 in Assume U1 corresponding to the smallest non-zero eigenvalue of Λ , then the increase of bandwidth frequency for node i is U2 i1. For random selection, where node i is queried with uniform probability, i=1 U2 i1 1 For the proposed selection method, define R = diag(r1, r2, , rn), where ri = di dmin , then L =I D 1/2AD 1/2 = I + ϵ D | {z } A0 +ϵ (ϵ dmin) 1R 1 2 D | {z } A1 Next we perform matrix perturbation analysis, define U0 = I, Λ0 = A0, Λ = Λ0 + ϵ Λ1, U = U0 + ϵ U1 where Λ, U approximate eigenvalues and eigenvectors of Λ . Denote (U1)k as the kth column of U1 and assume di = dj, we have (U0) i A1 (U0)k (Λ0)k (Λ0)i (U0)i = X Aik (ε )2 (di dk) di dk ei (U)k = (U0)k + ε (U1)k To satisfy Uk 2 = 1, we multiply τ as (τU)k = (τU0)k + ε (τU1)k recalculate U1 by τU0 as (U1)k = X (ε )2 (di dk) di dk ei (τU)k = τek + X ε (di dk) di dk ei, choose ε = τ 3 1 dkd then τUk 2 = 1 if τ = 1 r i =k Aik di dk Then we consider the normalized τU as U in the following analysis. Assume (U)k is the eigenvector to the smallest non-zero eigenvalue, then at the t 1 step E( ) = E(U2 ik) where E is in terms of the randomness in the proposed sampling procedure. Based on the approximation (U)k = τek + X # {Uik = 0}n i=1 = 1 + dk > dmin Define S = {i {1, 2, , n} : Uik = 0} and P1 = P(the node being selected to query / S). Take p = mini S P(i), q = maxi/ S P(i) and P( ) denotes the probability of being selected into candidate set size m. Denote k = dmin, we first upper bound P1 as n k m qm(1 q)n k m(1 p)k Pk i=0 n k m i qm k(1 q)n k (m i) k i pi(1 p)k i Pk i=0 n k m i k i ηi , where η = p(1 q) We calculate the denominator as In addition, by Stirling s approximation, when n is large n m r n 2πm(n m) nn mm(n m)n m n k m n k 2πm(n k m) (n k)n k mm(n k m)n k m then combining the above simplification, we have n n m n k m (n k)n k nn (n m)n m (n k m)n k m then we can lower bound the expected value of information gain as E(U2 ik) = X i S P(i)U2 i1 (1 P1) min i S U2 ik Notice that P1 is a monotone decreasing function of m given n and k fixed, then we can select a m0 such that n k m0 where δ < 1 is a constant, therefore E(U2 ik) > (1 δ) mini S U2 ik . Next we lower bound the quantity mini S U2 ik . Denote η1 := maxi( di dmin ) and η0 := #|{i : | di dk dmin | 1}|, we calculate the lower bound for mini S U2 ik as min i S U2 ik min d2 min di dmin d2 min di dmin + 1 + (di dk)2 d2 min di dmin X d2 min di dmin 1 + on(1) + X j {i:| di dk dmin | 1} d2 min + X j / {i:| di dk dmin |<1} 1 1 η3 1(η0d2 min + dmin η0) which implies min i S U2 ik 1 η3 1(η0d2 min + dmin η0) As a result, as long as m m0 we have E(U2 ik) 1 δ η3 1(η0d2 min + dmin η0). A.3 Proof of Theorem 4.2 Proof: Based on the assumption that f Proj Lω0span(X), we denote d0 = |{1 j n | λj ω0}| and Ud0 = (U1, U2, , Ud0). Therefore, we can represent f = Ud0UT d0Xβ for some parameter β Rp 1 and f, Ui = U T i Xβ. For the query set S and the corresponding bandwidth frequency ω ω0, we similarly denote d = |{1 j n | λj ω}| d0 and Ud = (U1, U2, , Ud). We denote Vn rd = Ud V1 as the bases of Proj Lωspan(X) where V1 is obtained from SVD decomposition UT d X = V1ΣV T 2 where (V1)d rd and (V2)p rd are left and right singular vectors, respectively. The diagonal matrix Σrd rd contains rd positive singular values with rd min{d, p}. The estimation (11) at the end of section 3 is equivalent to the weighted regression problem of {(Vi , Yi, si)}i S, f = arg min f Proj Lω span(X) i S si|Yi f(i)|2 arg min α Rrd 1 i=1 | si Yi ( si V1(i), . . . , si Vrd(i))α|2, where |S| = B. We have the least squares solution α( f) = A A 1 A WYB 1 s1V1(1) s1Vrd(1) ... ... ... s BV1 (B) . . . srd Vrd (B) and W = diag( s1, . . . , s B) (18) We assume Y = f + ε, where E (ε) = 0 and V ar(ε) = σ2. Notice the oracle f satisfies f = arg min f Proj Lω0 span(X) i=1 EY Yi f 2(i) We decompose the space Proj Lω0span(X) as Proj Lω0span(X) = Proj Lωspan(X) M Proj Lωspan(X) c Then we decompose f = f1 + f2, where f1 Proj Lωspan(X), f2 Proj Lωspan(X) c, then f1 = arg min f Proj Lω {span(X)} i=1 EY (Yi f (i))2 Then we can represent f1(i) = (V1(i), . . . , Vrd(i))α(f1), by solving Aα(f1) = Wf1, we have α(f1) = A A 1 A Wf1 2 From 1 and 2 , we have f (i) f (i) 2 = f f 2 2 f f1 2 2 + f1 f 2 2 α f α(f1) 2 2 + f1 f 2 2 A A 1 A W (YB (f1)B) 2 2 + f1 f 2 2 λmax A A 1 A W YB (f1)B 2 2 + f1 f 2 2 Denote gi = Yi f1(i), we have ES A Wg 2 = j=1 s2 j V2 i (j)|gj|2 Denote αi = si pi for j S where pj is the probability of node j being selected to query. Since ES [sj Vi(j)(Yj f1(j))] =αj ES pj Vi(j)(Yj f1(j)) l=1 EY [Vi(l)(Yl f1(l))] =0 since Vi f2 we have i=1 s2 j V2 i (j)|gj|2 # i=1 |Vi(j)|2 ! j=1 ES sjg2 j i=1 |Vi(j)|2 ! i=1 |Vi(j)|2 ! l=1 EY (Yl f1(l))2 Notice that l=1 EY (Yl f1(l))2 = l=1 EY (Yl f(l))2 + l=1 (f(l) f1(l))2 = nσ2 + f f1 2 2 Notice that f = Ud0UT d0Xβ and f1 = Ud UT d f, then f f1 = Ud UT d Xβ = P i>d f, Ui Ui where Ud = (Ud+1, , Ud0). Therefore, f f1 = P i>d,i supp(f) f, Ui 2. We first state and then prove the following Lemma 1. Lemma A.1 For the output from Algorithm 1 with B query budgets, we have i=1 |Vi(j)|2 ! 2 i with probability 1 2 where δ = rd CC2 0 m B , and C0 is a constant such that C2 0 = (m)2 + max(2, 16 Using Lemma 1 we have E f f 2 2 λmin A A i=1 |Vi(j)|2 ! j=1 EY (Yj f1(j))2 + f f1 2 2 3 10 CC2 0rd m 1 j=1 EY (Yj f1(j))2 + f f1 2 2 B )3/2 + (rdt B )2 (nσ2 + X i>d,i supp(f) Ui, f 2) + X i>d,i supp(f) Ui, f 2, with probability larger than 1 2m t where t > 2m. In the following, we prove Lemma 1 which is based on Theorem 5.2 in [5] and Lemma 3.5 and 3.6 in [17]. In the following, we denote the accumulated covariance matrix in the j selection as Aj, the potential function as Φuj,lj(Aj) = Tr[(uj I Aj) 1]+Tr[(Aj lj I) 1], and Ri(u, l, A) = vi(u I A) 1v T i + vi(A l I) 1v T i , where vi is the ith row of V. Notice that Pn i=1 Ri = Φu,l(A). At each iteration of algorithm 1, the ith node is selected as one of m candidates with p i = Ri Φ . For the m candidates, we define the following probability 1 η if i has maximum i among m candidates η m 1 otherwise where 0 < η < 1. Notice that when η goes to 0, the qi approximate the step 3 in Algorithm 1. Therefore, the probability of node k being query is pk = P(select k) =P( select k | {z } Qk |k in Bm) P(k in Bm) = p k qk =EBm EQk|Bm 1 P(select k)vkv k k Bm P(select k|k Bm) 1 P(select k)vkv k 1 P(k Bm)vkv k 1 P(k Bm)vkv k k Bm P(Bm | k Bm) vkv k where Ωdenote all Cm n possible candidate set with choosing m nodes from n nodes, P(Bm | k Bm) denotes the probability of selecting m 1 nodes into Bm conditioning on k Bm. Denote Ωk as all possible size m candidate sets with node k always in the set. Then Bk m 1 Ωk P (Bm | k Bm) k=1 vkv k = I ϵ (P Ri)pk vkv k = ϵ (P Ri)p kqk vkv k = ϵ Rkqk vkv k where we use the fact vv T (v T B 1v)B for any semi-positive definite matrix B. In addition, E ϵ (P Ri)pk vkv k = ϵ P Ri I = ϵ Φu,l(A)I for any k = 1, , n. Denote wk = q ϵ P Ripk vk, then wkw k mϵ η (u I A), which implies for any k [1, n] w k (u I A) 1wkw k (u I A) 1wk mϵ η w k (u I A) 1wk w k (u I A) 1wk mϵ η Similarly, we have w k (A l I) 1wk mϵ η Then from Lemma 3.3 and Lemma 3.4 in [? ] we have Tr(u I A wkw k ) Tr(u I A) + w k (u I A) 2wk Tr(A + wkw k l I) Tr(A l I) w k (A l I) 2wk Define ϵ = mϵ η , we show in the following that E Φuj,lj(Aj) Φuj 1,lj 1(Aj 1). From (19) we have Φuj,lj(Aj) Φuj,lj(Aj 1) + w j 1(uj I Aj 1) 2wj 1 1 ϵ w j 1(Aj 1 lj I) 2wj 1 Define u = uj uj 1 = ϵ (1 ϵ ) P Rj and l = lj lj 1 = ϵ (1+ϵ ) P Rj . Notice that u Tr(u I A) 1 = Tr(u I A) 2 < 0 l Tr(A l I) 1 = Tr(A l I) 2 < 0 at each step based on the design of uj and lj. and Φu,l(A) is convex in terms of u and l. From 3 , we have Φuj,lj(Aj) Φuj,lj(Aj 1)+ 1 1 ϵ Tr (uj I Aj 1) 2wj 1w j 1 1 1 + ϵ Tr (Aj 1 lj I) 2wj 1w j 1 then with E(wkw T k ) = ϵ P Ri I E Φuj,lj(Aj) Φuj,lj(Aj 1) + ϵ (1 ϵ ) P Ri Tr (uj I Aj 1) 2 ϵ (1 + ϵ ) P Ri Tr (Aj 1 lj I) 2 Φuj,lj(Aj 1) + u Tr (uj I Aj 1) 2 l Tr (Aj 1 lj) 2 f(t) = Tr (uj 1 + t u)I Aj 1 1 + Tr Aj 1 (lj 1 + l t)I 1 t = u Tr (uj 1 + t u)I Aj 1 2 + l Tr Aj 1 (lj 1 + l t)I 2 Since f(t) is convex, we have t=1 f(1) f(0) = Φuj,lj(Aj 1) Φuj 1,lj 1(Aj 1) (22) Then plugin (22), we have E Φuj,lj(Aj) Φuj 1,lj 1(Aj 1) Notice that for selection ε t(1 ε ) ε t(1+ε ) ε t(1 ε ) = 1 (1 ε ) 1 (1+ε ) 1 (1 ε ) 2ε where t = P Ri. We consider that the selection process stops when at the iteration k that uk lk 8rd/ϵ. Notice u0 = 2rd ε , l0 = 2rd ε , when stop at uk lk 8rd ε , we have uk = (u0 l0) + Pk 1 j=0 uj lj u0 + Pk 1 j=0 uj 4rd/ε + Pk 1 j=0 uj lj 2rd/ε + (2ε ) 1 Pk 1 j=0 uj lj 4rd/ε + 4rd/ε 2rd/ε + (2ε ) 1 4rd/ε = 8rd/ε 2rd 1 + 1 Then we have uk lk = 1 uk lk 1 1 + 4 (ε ). Notice that uk lk 8rd ε = Pk 1 j=0 uj lj 4rd Consider at the jth selection uj lj = ϵ 1 ϵ ϵ 1 + ϵ = ϵ Φuj,lj(Aj), ϵ = 2ϵϵ (1 ϵ )(1 + ϵ ) P (finish selection after B times selectionz vectors) P ϵ Φuj,lj(Aj) 4rd j=0 Φ 1 uj,lj(Aj) 4rd B2 PB 1 j=0 Φuj,lj(Aj) 4rd j=0 Φuj,lj(Aj) B2 ϵϵ where we use the result that E Φuj,lj(Aj) Φu0,l0(A0) by recursively using E Φuj,lj(Aj) Φuj 1,lj 1(Aj 1) and the fact that Φu0,l0(A0) = ϵ. We consider the following reparametrization: δ C0 , 0 < δ, mid = 2rd(1 m2ϵ2)/(mϵ2). For the jth selection, αj = ϵ Φj 1 mid. From previous result, we have uk lk = 1 uk lk with probability 1 2/C with δ = CC2 0ηrd m B . Notice that uk = u0 + Pk j=1 ϵ (1 ϵ )Φj and lk = l0 + Pk j=1 ϵ (1+ϵ )Φj , and 1 1 ϵ + 1 1 + ϵ Then if stop at kth selection Φk 2rd uk lk = 2rd (uk 1 lk 1) + ϵ Φk 1 1 ϵ 1 1+ϵ 1 1 ϵ 1 1+ϵ Denote c = 1 1 ϵ 1 1+ϵ , then Φk 1 8ϵ2. We find C0 such that cϵ = 2ϵ ϵ 1 (ϵ )2 < 1 then Φk ϵ uk lk = uk 1 lk 1 + ϵ 1 1 ϵ 1 1 + ϵ 1 1 ϵ 1 1 + ϵ ϵ + 8 1 1 ϵ 1 1 + ϵ Then we choose C0 large such that 16ϵ 1 (ϵ )2 < rd ϵ uk lk 9rd ϵ . Given that ϵ = m η ϵ and ϵ = δ C0 , we choose appropriate C0 to satisfy previous requirement on ϵ and ϵ as 2ϵϵ < 1 (ϵ )2 1 (ϵ )2 < rd Therefore, we choose C0 such that C2 0 > m η 2 + max(2, 16 η . Notice that 1 1 ϵ 1 1 + ϵ and mid is defined as mid = ϵ 1 1 ϵ 1 1+ϵ , then Pk j=1 ϵ Φj mid. Also, ϵ Φj mid mid > rd(1 (ϵ )2) rd(1 (ϵ )2) which implies mid h 1 4ϵϵ rd(1 (ϵ )2), 1 i ϵ Φj = h 1 4ϵϵ rd(1 (ϵ )2), 1 i uk + lk 1 1 ϵ + 1 1+ϵ Notice that for the design matrix A in (18), we have 1 mid A = Ak where Ak is the accumulated covariance matrix when the query process stops at the kthe selection. Therefore, the eigenvalues of A satisfy λ(AT A) = 1 midλ(AT k Ak) " lk mid, uk " lk uk+lk 1 1 ϵ + 1 1+ϵ , uk 1 4ϵϵ rd(1 (ϵ )2) uk+lk 1 1 ϵ + 1 1+ϵ Given that with high probability, 1 4ϵ uk lk 1 + 4ϵ . Then for the lower bound, lk uk+lk 1 1 ϵ + 1 1+ϵ 1 (1 (ϵ )2)(1 + 2ϵ ) > 1 (1 + ϵ )(1 + 2ϵ ) > 1 2 1 (1 + ϵ )2 = 1 and upper bound uk 1 4ϵϵ rd(1 (ϵ )2) uk+lk 1 1 ϵ + 1 1+ϵ (1 + 2ϵ )(1 (ϵ )2) (1 + 2ϵ ) 4ϵϵ (1 + 2ϵ)(1 (ϵ )2), given 4ϵϵ 3 1 1 (ϵ )2 δ/(ηC0))2 . Then with probability larger than 1 2 C , we have δ ηC0 )2 , 8 Consider αj = ϵ Φj 1 mid, ϵ Φj 1 mid 1, 1 1 4ϵϵ rd(1 (ϵ )2) 4 (1 ϵ )2 (1 ϵ )2 = 4 Finally, check sup j S sj Prd i=1 |Vi(j)|2 at the kth selection i=1 |Vi(j)|2 ! Φk 1 mid Φk i=1 |Vi(j)|2o n Prd j=1 |Vi(j)|2 mid 1 1 uk lk + 1 uk lk 2 ϵ mid 9 rd ϵ 2 = 4.5rd 4.5rd 2rd(1 (ϵ )2) ϵϵ = 2.25 ϵϵ 1 (ϵ )2 2.25 4δ = 10δ Then we finish the proof of Lemma 1. B More on Biased Sequential Sampling B.1 Computational complexity In the representative sampling stage, the computational complexity of calculating the sampling probability is O(n). We then sample m nodes to formulate a candidate set Bm, where the complexity of sampling m variables from a discrete probability distribution is O(m) [31]. Consequently, the complexity of the representative learning stage is O(n + m). In the informative selection stage, we calculate the information gain i for each node. This involves obtaining the eigenvector corresponding to the smallest non-zero eigenvalue of the projected graph Laplacian matrix, with a complexity of O(n3) due to the singular value decomposition (SVD) operation. Subsequently, we compute i for each node in the candidate set Bm based on their loadings on the eigenvector, which incurs an additional computational cost of O(mn). Therefore, the total complexity of our biased sampling method is O(n + m + nm + n3). Given a node label query budget B, the overall computational cost becomes O(B n + m + nm + n3) . When the dimension of node covariates p n, we can replace the SVD operation with the Lanczos algorithm to accelerate the informative selection stage. The Lanczos algorithm is designed to efficiently obtain the kth largest or smallest eigenvalues and their corresponding eigenvectors using a generalized power iteration method, which has a time complexity of O(kn2) [14]. As a result, the complexity of the proposed biased sampling method reduces to O(pn2). This is comparable to GNN-based active learning methods, as GNNs and their variations generally have a complexity of O(pn2) per training update [6, 37]. B.2 Connection to classification tasks Although our theoretical analysis is developed for node regression tasks, the proposed query strategy and graph signal recovery procedure are also applicable to classification tasks. Consider a Kclass classification problem, where the response on each node i is given by f(i) 1, 2, . . . , K. We introduce a dummy membership vector (Y1(i), . . . , YK(i)), where Yc(i) = 1 if f(i) = c and Yc(i) = 0 otherwise. For each class c {1, 2, . . . , K}, we first estimate ˆβc based on (14) with the training data { Xi , Yc(i), si}i S, and then compute the score for class c as ˆfc = X ˆβc. The label of an unqueried node j is assigned as ˆf(j) = arg max1 c K{ˆf1(j),ˆf2(j), ,ˆf K(j)}. Notice that the above score-based classifier is equivalent to the softmax classifier: ˆf = arg max 1 c K { exp(ˆf1) P c exp(ˆfc) , , exp(ˆf K) P c exp(ˆfc) } since the softmax function is monotonically increasing with respect to each score function {ˆfc}K c=1. B.3 Discussion on Theorem 4.1 Theorem 4.1 is derived using first-order matrix perturbation theory [2] on the Laplacian matrix L. In Theorem 4.1, we assume that the column space of the node covariate matrix X is identical to the space spanned by the first d eigenvectors of LSc. This assumption simplifies the analysis and the results by focusing on the perturbation of LSc, where LSc is the reduced Laplacian matrix with zero entries in the rows and columns indexed by S. The analysis can be naturally extended to the general setting by replacing LSc with P(1Sc)LP(1Sc), where P(t) is the projection operator defined in Section 3.2. Moreover, under the assumption on the node covariates, the information gain i exhibits an explicit dependence on the network statistics, providing a clearer interpretation of how the network structure influences the benefits of selecting informative nodes. Theorem 4.1 indicates that the improvement of biased sampling is more significant when dmin is larger and η0, η1 are smaller. Specifically, dmin reflects the connectedness of the network, where a betterconnected network facilitates the propagation of label information and enhances the informativeness of a node s label for other nodes. A smaller η1 prevents the existence of dominating nodes, ensuring that the connectedness does not significantly decrease when some nodes are removed from the network. Notice that the node j is the most informative node for the next selection, and η0 measures the number of nodes similar to j in the network. Recall that the proposed biased sampling method considers both the informativeness and representativeness of the selected nodes. Therefore, the information gain is less penalized by the representativeness requirement if η0 is small. Additionally, the size of the candidate set m should be sufficiently large to ensure that informative nodes are included in Bm. B.4 Discussion on Theorem 4.2 The RHS of (15) captures both the variance and bias involved in estimating f using noisy labels on sampled nodes. Specifically, the first three terms represent the estimation variance arising from controlling the condition number of the design matrix on the queried nodes. The fourth and fifth terms reflect the noise and unidentifiable components in the responses of the queried nodes, while the last term denotes the bias resulting from the approximation error of the space using Hω(X, A). The bias term in Theorem 4.2 can be further controlled if the true signal f exhibits decaying or zero weights on high-frequency network components. In addition to rd, the size of the candidate set m also influences the probability of controlling the generalization error. A small m places greater emphasis on the representativeness criterion in sampling, increasing the likelihood of controlling the condition number but potentially overlooking informative nodes, thereby increasing approximation bias. For a fixed prediction MSE, the query complexity of our method is O(d), whereas random sampling incurs a complexity of O( d log d), where d > d. Our method outperforms random sampling in two key aspects: (1) The information-based selection identifies f with fewer queries than random sampling, as shown in Theorem 4.1, and (2) our method achieves an additional improvement by actively controlling the condition number of the covariate matrix, resulting in a logarithmic factor reduction compared to random sampling. B.5 Calculation on node-wise information gain and hyperparameter tuning When calculating node-wise informativeness in (7), we can enhance computational efficiency by avoiding the inversion of D(t). When t is in the neighborhood of 1Sc, we can approximate: P(t) D(t)XSc(XT Sc XSc) 1XT Sc D(t) = D(t)ZSc ZT Sc D(t), where XSc = ((X1)Sc, , (Xp)Sc) and ZSc = XSc(XT Sc XSc) 1/2. Then, the node-wise informativeness can be explicitly expressed as: i tiϕ2 i (ZSc)i (ZT Sc)i + X j =i,1 j n titjϕiϕj(ZSc)i (ZT Sc)j . (23) We find that this approximation yields very similar empirical performance compared to the exact formulation in (7). Therefore, we adopt the formulation in (23) for the subsequent numerical experiments. In practice, we can tune m to ensure that the covariance matrix is well-conditioned. Specifically, we can run the biased sampling procedure multiple times with different values of m and select the largest m such that the condition number of the covariance matrix on the query set S is less than 10 [20]. This threshold is a commonly accepted rule of thumb for considering a covariance matrix to be well-conditioned [20]. Additionally, ϵ is typically fixed at a small value, following the protocol outlined in [17]. C More on Numerical Studies C.1 Experimental setups Synthetic networks The parameters for the three network topologies are: Watts Strogatz (WS) model (K = 4, βW S = 0.1) for small world properties, Stochastic block model (SBM) (Ncommunity = 4, Pin = 0.35, Pout = 0.01) for community structure, and Barabási-Albert (BA) model (α = 3) for scale-free properties. We set n = 100 for all three networks. After generating the networks, we consider them fixed and then simulate Y and X repeatedly using 10 different random seeds. By a slight abuse of notation, we set the node responses and covariates for SBM and WS as Y = U1:10β + ξ and X = U1:10 + MU45:54, where Mij iid N(0.3, 0.1) and β = (5, 5, . . . , 5 | {z } length 10 )T . For the BA model, we set Y = U1:15β + ξ and X = U1:15 + MU45:59, where Mij iid N(0.5, 0.2) and β = (1, . . . , 1 | {z } length 5 , 5, . . . , 5 | {z } length 10 )T . Real-world networks For the proposed method and all baselines, we train a 2-layer SGC model for a fixed 300 epochs. In SGC, the propagation matrix performs low-pass filtering on homophilic networks and high-pass filtering on heterophilic networks. During training, the initial learning rate is set to 10 2 and weight decay as 10 4. C.2 Visualization In Figure 2, we visualize the node query process on synthetic networks generated using SBM and BA, as described in Section 5.1. The figure clearly demonstrates that nodes queried by the proposed algorithm adapt to the informativeness criterion specific to each network topology, effectively aligning with the community structure in SBM and the scale-free structure in BA. Dataset #Nodes Type m d Dataset #Nodes Type m d SBM 100 homophilic 50 10 Citeseer 3,327 homophilic 1000 100 WS 100 homophilic 50 10 Chameleon 2,277 heterophilic 800 30, 30 BA 100 homophilic 50 15 Texas 183 heterophilic 60 15, 15 Cora 2,708 homophilic 2000 200 Ogbn-Arxiv 169,343 homophilic 1000 120 Pubmed 19,717 homophilic 3000 60 Co-Physics 34,493 homophilic 3000 150 Table 4: A description of all datasets used in Section 5 and the hyperparameter settings for each dataset. We set ϵ = 0.001 for all networks. For heterophilic networks, we combine the eigenvectors corresponding to the d smallest and d largest eigenvalues. C.3 Ablation study To gain deeper insights into the respective roles of representative sampling and informative selection in the proposed algorithm, we conduct additional experiments on a New Jersey public school social network dataset, School [25], which was originally collected to study the impact of educational workshops on reducing conflicts in schools. As School is not a benchmark dataset in the active learning literature, we did not compare the performance of our method against other baselines in Section 5.2 to ensure fairness. In this dataset with n = 615 nodes, each node represents an individual student, and edges denote friendships among students. We treat the students grade point averages (GPA) as the node responses and select p = 5 student features grade level, race, and three binary survey responses as node covariates using a standard forward selection approach. As shown in Section 3.4, the representative sampling in steps 1 and 2 of Algorithm 1 is essential to control the condition number of the design matrix and, consequently, the prediction error given noisy network data. We illustrate in Figure 3a the condition number λmax( XT S WS XS) λmin( XT S WS XS) using the proposed method, and compare with the one using random selection. With m = 200, the proposed algorithm achieve a significantly lower condition number than random selection, especially when the number of query is small. In the Citeseer dataset, we investigate the prediction performance of Algorithm 1 when removing steps 1 and 2, i.e., setting the candidate set Bm = Sc t 1 for the tth selection. Figure 3b shows that, with representative sampling, the Macro-F1 score is consistently higher, with a performance gap of up to 15%. Given that node classification on Citeseer is found to be sensitive to labeling noise [38], this result validates the effectiveness of representative sampling in improving the robustness of our query strategy to data noise. In addition, we examine the ability of the proposed method to integrate node covariates for improving prediction performance. In the School dataset, we compare our method to one that removes node covariates during the query stage by setting X as the identity matrix I. Figure 3c illustrates that the prediction MSE for GPA is significantly lower when incorporating node covariates, thus distinguishing our node query strategy from existing graph signal recovery methods [12] that do not account for node covariate information. The implementation code for the proposed algorithm is available at github.com/Yuanchen Wu/Robust Active Learning/. (a) Stochastic Block Model (SBM) (b) Barabási Albert model (BA) Figure 2: For (a) SBM, nodes are grouped by the assigned community; for (b) BA, nodes are grouped by degree. The integer i on each node represents the ith node queried by the proposed algorithm in one replication. Figure 3: Ablation study: (a) The condition number (log scale) of the design matrix of query nodes selected by proposed method and random sampling. The effectiveness of (b) representative sampling and (c) incorporating covariate information in Algorithm 1. Neur IPS Paper Checklist Question: Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope? Answer: [Yes] Justification: We support the claims with theoretical analysis in Section 4 and empirical performance in Section 5. 2. Limitations Question: Does the paper discuss the limitations of the work performed by the authors? Answer: [Yes] Justification: See Section 5.2 for the limitation in real-data studies and Section 6 for future work direction. 3. Theory Assumptions and Proofs Question: For each theoretical result, does the paper provide the full set of assumptions and a complete (and correct) proof? Answer: [Yes] Justification: See section 3 and 4 for the main theorems with assumptions. See Appendix 3.1, 3.2 and 3.3 for the detailed proofs. 4. Experimental Result Reproducibility Question: Does the paper fully disclose all the information needed to reproduce the main experimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper (regardless of whether the code and data are provided or not)? Answer: [Yes] Justification: We include implementation details for both synthetic and real-world networks in Appendix B.1, B.2 and B.3. 5. Open access to data and code Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [Yes] Justification: We use open-source datasets that can be readily downloaded online. The implementation details in Appendix are sufficient enough to reproduce the main experimental results. 6. Experimental Setting/Details Question: Does the paper specify all the training and test details (e.g., data splits, hyperparameters, how they were chosen, type of optimizer, etc.) necessary to understand the results? Answer: [Yes] Justification: We include implementation details for both synthetic and real-world networks in Appendix C.1. 7. Experiment Statistical Significance Question: Does the paper report error bars suitably and correctly defined or other appropriate information about the statistical significance of the experiments? Answer: [Yes] Justification: We include error bars for the numerical studies on both synthetic and real-world networks. 8. Experiments Compute Resources Question: For each experiment, does the paper provide sufficient information on the computer resources (type of compute workers, memory, time of execution) needed to reproduce the experiments? Answer: [NA] Justification: The paper focuses mainly on theories and our algorithm is not deep learning based. 9. Code Of Ethics Question: Does the research conducted in the paper conform, in every respect, with the Neur IPS Code of Ethics https://neurips.cc/public/Ethics Guidelines? Answer: [Yes] 10. Broader Impacts Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed? Answer: [Yes] Justification: Please see the introduction section where we briefly touch on the positive societal impacts. 11. Safeguards Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse (e.g., pretrained language models, image generators, or scraped datasets)? Answer: [NA] 12. Licenses for existing assets Question: Are the creators or original owners of assets (e.g., code, data, models), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected? Answer: [Yes] Justification: Please see Appendix C. 13. New Assets Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets? Answer: [Yes] Justification: Please see Appendix C. 14. Crowdsourcing and Research with Human Subjects Question: For crowdsourcing experiments and research with human subjects, does the paper include the full text of instructions given to participants and screenshots, if applicable, as well as details about compensation (if any)? Answer: [No] Justification: Our experiment simulates the human labeling process in active learning. 15. Institutional Review Board (IRB) Approvals or Equivalent for Research with Human Subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or institution) were obtained? Answer: [NA]