# uncertainty_for_active_learning_on_graphs__a7534266.pdf Uncertainty for Active Learning on Graphs Dominik Fuchsgruber * 1 2 Tom Wollschläger * 1 2 Bertrand Charpentier 1 Antonio Oroz 1 Stephan Günnemann 1 2 Uncertainty Sampling is an Active Learning strategy that aims to improve the data efficiency of machine learning models by iteratively acquiring labels of data points with the highest uncertainty. While it has proven effective for independent data its applicability to graphs remains under-explored. We propose the first extensive study of Uncertainty Sampling for node classification: (1) We benchmark Uncertainty Sampling beyond predictive uncertainty and highlight a significant performance gap to other Active Learning strategies. (2) We develop ground-truth Bayesian uncertainty estimates in terms of the data generating process and prove their effectiveness in guiding Uncertainty Sampling toward optimal queries. We confirm our results on synthetic data and design an approximate approach that consistently outperforms other uncertainty estimators on real datasets. (3) Based on this analysis, we relate pitfalls in modeling uncertainty to existing methods. Our analysis enables and informs the development of principled uncertainty estimation on graphs. 1. Introduction Applications in machine learning are often limited by their data efficiency. This encompasses effort spent on experimental design (Sverchkov & Craven, 2017) or the cost of training on large datasets (Cui et al., 2022). To remedy these problems, Active Learning (AL) allows the learner to query an oracle (e.g. users, machines, or experiments) to label specific data points considered informative, thus saving labeling labor and training effort that would have been spent on uninformative labeled data. *Equal contribution 1School of Computation, Information and Technology, Technical University of Munich, Germany 2Munich Data Science Institute, Germany. Correspondence to: Dominik Fuchsgruber , Tom Wollschläger . Proceedings of the 41 st International Conference on Machine Learning, Vienna, Austria. PMLR 235, 2024. Copyright 2024 by the author(s). Uncertainty Sampling (US) methods (Beluch et al., 2018; Joshi et al., 2009) rely on uncertainty estimates to measure the informativeness of labeling each data point. Intuitively, areas where a learner lacks knowledge are assigned high uncertainty. Moreover, methods should distinguish the (irreducible) aleatoric uncertainty and the (reducible) epistemic uncertainty which the total uncertainty about a prediction is composed of (Kiureghian & Ditlevsen, 2009; Kendall & Gal, 2017). This disentanglement is particularly important for AL: The learner might not benefit much from labeling instances with high irreducible uncertainty while acquiring knowledge about data points for which uncertainty stems from reducible sources can be highly informative. This suggests epistemic uncertainty as a sensible acquisition function (Nguyen et al., 2022). For independent and identically distributed (i.i.d.) data, US methods for AL in particular US methods disentangling aleatoric and epistemic uncertainty have demonstrated high data efficiency in benchmarks (Joshi et al., 2009; Gal et al., 2017; Beluch et al., 2018; Kirsch et al., 2019; Nguyen et al., 2022; Schmidt & Günnemann, 2023). Existing efforts in AL for interdependent data like graphs neglect the compositional nature of uncertainty (Zhu et al., 2003; Jun & Nowak, 2016; Regol et al., 2020; Wu et al., 2021; Zhang et al., 2022c) and limit themselves to a singular measure of total uncertainty. This leaves it unclear (i) to which extent uncertainty estimators can effectively inform US for graph data, and (ii) whether disentangling aleatoric and epistemic uncertainty has similar benefits to AL as in i.i.d. settings. The complex nature of graph data makes investigating US methods for AL particularly challenging. Uncertainty estimates should not only capture information about node features independently but also model information about their relationships (Stadler et al., 2021). In this work, we examine US for node classification problems. We critically evaluate and benchmark state-of-the-art uncertainty estimators against traditional AL strategies and find that US (as well as many other approaches) fall short of surpassing random sampling. Motivated by the effectiveness of US for i.i.d. data, we formally approach the question of whether US methods for graphs align well with the AL objective at all. We derive novel ground-truth uncertainty measures from the underlying data-generating process and Uncertainty for Active Learning on Graphs Labeled Node Unlabeled Node 𝐔𝐧𝐜𝐞𝐫𝐭𝐚𝐢𝐧𝐭𝐢𝐞𝐬 Acquired Node Before Acquisition After Acquisition Total /Alea. Uncertainty Figure 1: US can be realized by acquiring the label of a node with with maximal total, aleatoric or epistemic uncertainty. The former two include irreducible effects leading to label node a while the latter isolates epistemic factors and queries the label of node b, thereby increasing the confidence in correctly predicting the remaining unlabeled nodes the most. disentangle its aleatoric and epistemic components. This generative perspective enables a principled perspective on handling instance interdependence in uncertainty estimation that we leverage for AL: We prove that querying epistemically uncertain nodes is equivalent to maximizing the relative gain in confidence a Bayesian classifier puts on correctly predicting all unlabeled nodes. Figure 1 shows that acquiring the most epistemically uncertain label can improve the confidence of the classifier in the correct prediction more than selecting a node associated with high total uncertainty that mainly stems from irreducible factors. While node a is mainly uncertain because of its link to an orange node, b s uncertainty comes from the the lack of labels in its neighbourhood. By employing our proposed estimates on a Contextual Stochastic Block Model (CSBM) (Deshpande et al., 2018), we empirically confirm the validity of our findings on synthetic data. Our analysis reveals that US is an effective strategy only if uncertainty is disentangled into aleatoric and epistemic components. The main contributions of our work are: We provide the first extensive AL benchmark1 for node classification including both a broad range of state-of-theart uncertainty estimators for US and traditional acquisition strategies. Many traditional methods and uncertainty estimators do not outperform random acquisition. We derive ground-truth aleatoric and epistemic uncertainty for a Bayesian classifier and formally prove the alignment of US with AL. We empirically confirm the efficacy of epistemic US both on synthetic and real data by employing an approximation to our proposed ground-truth uncertainty that outperforms SOTA uncertainty estimators off-the-shelf. This enables future uncertainty estimators to achieve competitive AL performance by building on the principles of our work. 1Find our code at cs.cit.tum.de/daml/graph-active-learning/ 2. Background AL for Semi-Supervised Node classification. Let G = (V, E), be a graph with a set of nodes V and edges E {{i, j} : i, j V}. With n = |V | number of nodes the adjacency matrix is defined as A {0, 1}n n, where Aij = Aji = 1 if there exists an edge between node i and node j. The features are represented as matrix X Rn d. In node classification, every node has a label represented by a vector y [C]n. We decompose the set of all nodes into V = U O where O contains nodes with observed labels y O and U contains remaining unobserved nodes y U that we want to infer, where V = U O. We consider pool-based AL: In each iteration, the learner queries an oracle for a label yi from the set of unobserved labels y U, adds it to the set of observed labels y O and retrains the model. Uncertainty in Machine Learning. Alongside the predicted label yi, it is crucial to consider the associated uncertainty (Kiureghian & Ditlevsen, 2009; Kendall & Gal, 2017). Aleatoric uncertainty ualea is the inherent uncertainty that comes from elements like experimental randomness. It can not be reduced by acquiring more data. On the other hand, Epistemic uncertainty uepi reflects knowledge gaps, which can be addressed by data acquisition. A commonly employed conceptualization (Depeweg et al., 2018) defines the total predictive uncertainty utotal as encapsulating both reducible and irreducible factors, i.e. utotal = uepi + ualea. Contextual Stochastic Block Model. We approach US on graphs sampled from an explicit generative process p(A, X, y). To that end, we generate data from a Contextual Stochastic Blockmodel (CSBM) (Deshpande et al., 2018) enabling a well-principled study of exact groundtruth uncertainty estimators. We first independently sample the node labels y independently from a prior p(y). Node features are generated from class-conditional normal distributions p(Xi | yi) N(µyi, σ2 x I) and edges are introduced independently according to an affiliation matrix F [0, 1]c c as p(Ai,j | yi, yj) Ber(Fyi,yj). We defer an in-depth description to Appendix B.3. Uncertainty for Active Learning on Graphs 3. Related Work Active Learning on Independent and Identically Distributed Data. AL has seen substantial exploration in the context of i.i.d. data (Ren et al., 2021). Approaches can be divided into three main categories: diversity-based, uncertainty-based, or a combination thereof (Zhan et al., 2022). Diversity or representation-based methods query data samples that best represent the full dataset, i.e. they opt for a diverse set of data points. Approaches like KMeans or Coreset opt to minimize the difference in model loss between the selected training set and the whole data set (Sener & Savarese, 2018). Other approaches use adversarial techniques to estimate the representativeness and diversity of new samples (Sinha et al., 2019; Shui et al., 2020). Uncertainty-based approaches query instances that the classifier is most uncertain about. It is commonly computed from the predictive distribution of a classifier calculating its entropy (Shannon, 1948), the margin between the two most likely labels, or as the probability of the least confident label (Wang & Shang, 2014). Houlsby et al. (2011) introduce Bayesian Active Learning by Disagreement (BALD), which queries points with high mutual information between the model parameters and the class label. Another line of work uses a loss-prediction module as a proxy for uncertainty (Yoo & Kweon, 2019; Schmidt & Günnemann, 2023). Nguyen et al. (2019) leverage disentangled uncertainties (Kendall & Gal, 2017) for AL and propose that epistemic uncertainty is a better proxy for US than aleatoric estimates. Other approaches linearly combine diversityand uncertainty-based measures (Yin et al., 2017) or employ a two-step optimization scheme (Ash et al., 2020; Zhan et al., 2022). BADGE (Ash et al., 2019) queries diverse and uncertain instances in the gradient space of the model using clustering. Another line of work constructs similarity graphs between instances to enforce diversity among queries (Dasarathy et al., 2015). GALAXY (Zhang et al., 2022a) proposes further refines this approach by mitigating class imbalance. In contrast, our work concerns settings where the adjacency matrix A is explicitly given . Active Learning on Interdependent Graph Data. While a plethora of studies exist on AL for i.i.d. data, only a limited amount of work addresses interdependent data like graphs. Previous methods approach AL on graphs from different perspectives, including random fields (Zhu et al., 2003; Ma et al., 2013; Ji & Han, 2012; Berberidis & Giannakis, 2018), risk minimization (Jun & Nowak, 2016; Regol et al., 2020), adversarial learning (Li et al., 2020), knowledge transfer between graphs (Hu et al., 2020) or querying cheap soft labels (Zhang et al., 2022b). US is typically only considered in terms of the predictive distribution (Madhawa & Murata, 2020) which does not disentangle aleatoric and epistemic components. Also, other uncertainty-reliant approaches do not make that distinction (Cai et al., 2017; Gao et al., 2018; Li et al., 2022) even though the literature on i.i.d. data suggests that US benefits from disentangled uncertainty estimators (Nguyen et al., 2022; Sharma & Bilgic, 2016). The exploration of US for AL on graphs beyond total uncertainty remains uncharted territory. Our work targets this gap and showcases the unrealized potential of epistemic uncertainty estimators for AL on graph data. Uncertainty Estimation on Graphs. US strategies necessitate accurate uncertainty estimates which can be obtained in different ways. In classification, the predictive distribution of deterministic classifiers has been used to obtain aleatoric uncertainty (Stadler et al., 2021). The Graph Convolutional Network (GCN) (Kipf & Welling, 2017) iteratively updates the representation H(l) of each node using a linear transformation W (l) and subsequent diffusion along the edges and non-linearity σ. Formally, a GCN layer is expressed as f(H(l), A) = σ(AH(l)W (l)). The APPNP model (Klicpera et al., 2018) first transforms the nodes features independently and then diffuses predictions according to approximate Personalized Page Rank (PPR) scores. Bayesian approaches model the posterior distribution over the model parameters. Ensembles (Lakshminarayanan et al., 2017) fall into this category. They approximate the posterior over model parameters through a collection of independently trained models. Monte Carlo Dropout (MC-Dropout) (Gal & Ghahramani, 2016) instead emulates a distribution over model parameters by applying dropout at inference time. Drop Edge (Rong et al., 2020) proposed to additionally drop edges to reduce over-fitting and over-smoothing. Variational Bayes methods place a prior on the model parameters (BGCN) and sample different parameter sets for each forward pass (Blundell et al., 2015). They allow access to disentangled uncertainty estimates by approximating a distribution over predictions from multiple forward passes. Commonly employed measures of epistemic uncertainty are the mutual information between the model weights and predicted labels (Gawlikowski et al., 2023) or the variance in confidence about the predicted label (Stadler et al., 2021). Evidential methods like GPN (Stadler et al., 2021) disentangle epistemic and aleatoric uncertainty by outputting the parameters of a Dirichlet prior to the categorical predictive distribution. This method has shown strong performance in detecting distribution shifts. Finally, Gaussian processes on graphs, while potentially leading to strong uncertainty estimates in specific domains (Wollschläger et al., 2023), they do not disentangle aleatoric and epistemic uncertainty (Liu et al., 2020; Borovitskiy et al., 2021). 4. Benchmarking Uncertainty Sampling Approaches for Active Learning on Graphs Previous studies on non-uncertainty-based AL on graphs find that AL strategies struggle to consistently outperform Uncertainty for Active Learning on Graphs random sampling (Madhawa & Murata, 2020). We, therefore, ask the question of whether US shows any merit when using state-of-the-art uncertainty estimation. We are the first to design a comprehensive AL benchmark for node classification that not only considers traditional methods but also includes a variety of uncertainty estimators for US. Our work answers the research question: Does Uncertainty Sampling using state-of-the-art uncertainty estimators work on graph data? No, no uncertainty estimator outperforms random sampling. Most non-uncertainty strategies fail to consistently improve over random queries as well. Experimental Setup. We evaluate AL on five common citation benchmark datasets for node classification: Cora ML (Bandyopadhyay et al., 2005), Citeseer (Sen et al., 2008; Giles et al., 1998), Pub Med (Namata et al., 2012) as well as the co-purchase graphs Amazon Photos and Amazon Computers (Mc Auley et al., 2015). We evaluate the models of Section 3: GCN, APPNP, MC-Dropout, BGCN, GPN and Ensembles and report average results over multiple AL runs (see Appendix B). When acquiring multiple labels per iteration, greedily selecting the most promising candidates might overestimate the performance improvement (Kirsch et al., 2019). We therefore only acquire a single label in each iteration, enabling us to analyze the performance of different acquisition strategies without having to consider the potential side effects of batched acquisition. We initially label one node per class and fix the acquisition budget to 4C. In addition to a qualitative evaluation of the accuracy curves of different strategies in Figure 2, we report the accuracy after the labeling budget is exhausted in Table 6. Good acquisition functions should achieve higher accuracy at a lower amount of queries, which in turn results in a larger area under (AUC) the visualized curves. After normalization, this metric quantifies the average accuracy which we report as a summary of the AL performance in Table 1. We proceed as follows: First, we benchmark traditional AL approaches and find that only one consistently outperforms Random selection that acquires labels with uniform probability. Then, we show that US fails to even match the performance of random queries in many instances, contrasting its successful application to i.i.d. data. We supply the corresponding AUC and final accuracy scores as well as visualizations on all datasets in Appendix C. Non-Uncertainty-based Strategies. We first examine strategies that do not exclusively rely on uncertainty. (i) Coreset (Sener & Savarese, 2018) opts to find a core-set cover of the training pool by selecting nodes that maximize the minimal distance in the latent space of a classifier to Accuracy (%) 10 20 30 Acquired Labels Random Coreset Coreset, w/o Net Coreset PPR Coreset Features Degree PPR AGE ANRMAB GEEM SEAL Figure 2: Accuracy of AL strategies on Citeseer using a GCN (left) / SGC (right) classifier. Except for GEEM, which is only tractable for SGCs, traditional AL can not significantly outperform random selection. already acquired instances. (ii) Coreset-PPR is similar to Coreset, but we use inverse Personalized Page Rank (PPR) scores as a distance measure to select structurally dissimilar nodes. (iii) Coreset Features. Distances between nodes are computed only in terms of input features. (iv) Degree and PPR. We acquire nodes with the highest corresponding centrality measure. (v) AGE (Cai et al., 2017) and ANRMAB (Gao et al., 2018) combine total predictive uncertainty, informativeness, and representativeness metrics. (vi) GEEM (Regol et al., 2020) uses risk minimization to select the next query. Because of its high computational cost, we follow its authors and employ an SGC (Wu et al., 2019) backbone. (vii) SEAL (Li et al., 2020) uses adversarial learning to identify nodes dissimilar from the labeled set. If applicable, we also consider setting A = I to exclude structural information similar to (Stadler et al., 2021). Observations: Figure 2 shows that only GEEM identifies a training set that is significantly more informative than random sampling. The performance of GEEM, however, comes at a high computational cost, as it requires training O(n C) models in each acquisition which makes it intractable for larger datasets and models beyond SGC. Structural Coreset strategies work well on both co-purchase networks but do not show strong results on other graphs. This highlights a notable gap between many commonly used acquisition functions and a potential optimal strategy. Uncertainty Sampling. We evaluate US using different uncertainty estimators: (i) Aleatoric. Like (Stadler et al., 2021), we compute ualea = maxc pc. (ii) Epistemic. For models perform multiple predictions (MC-Dropout, Ensembles, BGCNs), we compute epistemic uncertainty as the variance of the confidence uepi = Var [pˆc] in the predicted class ˆc. For GPN, we follow the authors and use evidence as a measure of epistemic confidence. (iii) Energy. Energybased models (EBMs) (Liu et al., 2021; Wu et al., 2023) relate uncertainty to the energy u = τ log P c exp (li) of the predicted logits l. We apply this estimator to the deter- Uncertainty for Active Learning on Graphs 10 20 30 Acquired Labels Accuracy (%) Random Ensemble MC-Dropout BGCN GPN Energy Uncertainty Epistemic Aleatoric Epi. w/o Net Figure 3: US on Citeseer. No method significantly outperforms random selection. ministic GCN, APPNP, and SGC models as a surrogate for epistemic uncertainty. Again, we ablate each strategy by withholding structural information (A = I). Observations: In contrast to literature on i.i.d. data (Beluch et al., 2018; Gal et al., 2017; Nguyen et al., 2022), we observe none of the US approaches to be effective in Figure 3. While sampling instances with high aleatoric uncertainty matches the performance of random queries, we find epistemic uncertainty to even underperform in many instances. This is surprising, as the efficacy of epistemic US has been demonstrated for i.i.d. data (Nguyen et al., 2022; Kirsch et al., 2019). Only ensemble models match the performance of random sampling and slightly outperform on some datasets. GPN and energy-based approaches can not guide US toward effective queries. This is an intriguing result as both uncertainty estimators have shown to be highly effective for of out-of-distribution detection (Stadler et al., 2021; Wu et al., 2023). 5. Ground-Truth Uncertainty from the Data Generating Process As no existing US method yields satisfactory results, we formally answer the following research question: Does US on graphs align with the AL objective? Yes, we formally show that acquiring the node with maximal epistemic uncertainty optimizes the gain in the posterior probability of the ground-truth labels y U of unobserved nodes. Evaluating the quality of uncertainty estimates is inherently difficult as generally, ground-truth values are unavailable for both the overall predictive uncertainty utotal and its constituents ualea, uepi. Additionally, since epistemic uncertainty pertains to the knowledge of the classifier, it cannot be defined in a model-agnostic manner. Therefore, we analyze uncertainty from the perspective of the underlying (potentially unknown) data-generating process p(X, A, y) with respect to a Bayesian classifier. This lends itself to a definition of ground-truth uncertainty. In the following, we propose confidence measures and relate them to uncertainty as their inverse conf := u 1. This allows us to state the main theoretical result of our work: The optimality of US using epistemic uncertainty. Definition 5.1. We define the parametrized Bayesian classifier f θ (A, X, ygt O) in terms of the data generating process p(A, X, y) as the prediction c [C]|U| that maximizes: Ep(θ|A,X,ygt O ) P y U = c | A, X, y O = ygt O, θ (1) Here, we denote with ygt O the labels of already observed instances. The predictive distribution p(y U | A, X, ygt O) encapsulates the total confidence of f θ . The classifier averages its prediction according to a learnable posterior distribution p(θ | A, X, y O) over its parameters, e.g. weights of a GNN. Marginalization yields the total confidence conftotal. Definition 5.2. The total confidence conftotal(i, c) of f θ in predicting label c for node i is defined as: Ep(θ|A,X,ygt O ) P yi = c | A, X, y O = ygt O, θ (2) Intuitively, the total confidence captures aleatoric factors through the inherent randomness in the data generating process p(y, A, X). Epistemic uncertainty is incorporated by conditioning on a limited set of observed labels ygt O. With a growing labeled set irreducible errors will increasingly dominate total predictive uncertainty. In the extreme case where all labels but one have been observed, i.e. O = V \ {vi}, remaining uncertainty only stems from aleatoric factors. Definition 5.3. The aleatoric confidence confalea(i, c) of f θ in predicting label c for node i is defined as: Ep(θ|A,X,ygt i) P yi = c | A, X, y i = ygt i, θ (3) Here, we denote with y i = ygt i that all nodes excluding the predicted node i are observed as their true values. All remaining lack of confidence is deemed irreducible. Lastly, we define epistemic confidence by comparing aleatoric factors to the overall confidence. As both are defined probabilistically, we consider their ratio: Definition 5.4. The epistemic confidence of f θ in predicting label c for node i is defined as: confepi(i, c) := confalea(i, c)/ conftotal(i, c) (4) These definitions directly imply a notion of uncertainty: uepi(i, c) = utotal(i, c)/ ualea(i, c). Epistemic US thus labels node i when the associated total uncertainty utotal(i, ygt i ) is large compared to its aleatoric uncertainty ualea(i, ygt i ). It favors uncertain nodes when the uncertainty stems from non-aleatoric sources. Uncertainty for Active Learning on Graphs Table 1: Average AUC ( ) for different acquisition strategies on different models and datasets. We mark the best strategy per model in bold and underline the runner-up. For each dataset, we highlight the overall best strategy with a symbol. Baselines Non-Uncertainty Uncertainty Inputs A & X A X A & X X Model Random Coreset AGE ANRMAB GEEM SEAL Coreset PPR Coreset Inputs Epi./ (Energy) Alea. Epi./ (Energy) Alea. GCN 62.51 64.35 64.12 64.24 n/a 66.07 59.53 61.26 63.97 61.30 65.65 64.33 APPNP 67.72 67.32 66.12 69.49 n/a n/a 71.04 64.49 64.92 67.68 69.59 66.69 Ensemble 63.89 60.55 64.80 65.10 n/a n/a 62.65 65.07 63.47 64.03 64.80 65.82 MC-Dropout 64.94 64.37 64.44 64.06 n/a n/a 62.92 64.35 59.17 63.69 61.82 63.87 BGCN 45.76 49.37 51.25 47.23 n/a n/a 39.43 44.85 44.45 46.61 42.74 48.11 GPN 56.50 n/a n/a n/a n/a n/a 58.04 54.02 54.75 57.16 55.89 57.21 SGC 63.85 65.23 67.56 61.14 71.39 n/a 60.24 59.18 67.51 65.66 65.05 67.13 GCN 61.56 62.61 69.48 60.31 n/a 58.62 61.71 56.71 59.64 61.85 59.66 60.34 APPNP 64.61 63.88 70.18 63.83 n/a n/a 64.21 56.87 63.09 62.37 62.23 63.95 Ensemble 59.26 64.25 68.26 60.40 n/a n/a 61.89 56.36 63.70 61.37 59.71 61.15 MC-Dropout 58.30 62.97 65.24 60.50 n/a n/a 61.43 56.01 58.67 59.07 59.23 62.22 BGCN 53.59 59.29 56.93 52.68 n/a n/a 53.40 51.40 55.19 52.81 57.09 54.62 GPN 59.76 n/a n/a n/a n/a n/a 62.08 54.34 58.82 57.24 56.25 59.37 SGC 56.79 64.48 69.20 60.49 64.82 n/a 62.15 52.25 62.04 61.55 61.04 60.74 Amazon Photos GCN 79.06 78.58 75.17 79.97 n/a 71.16 70.20 82.71 74.66 74.61 79.96 79.63 APPNP 79.29 81.04 79.02 80.35 n/a n/a 76.37 84.24 79.72 77.45 80.48 77.69 Ensemble 82.23 80.44 77.45 82.77 n/a n/a 74.93 84.04 84.46 77.85 80.50 81.25 MC-Dropout 80.32 76.63 74.75 80.21 n/a n/a 75.32 82.45 72.42 73.16 69.68 78.80 BGCN 71.22 67.15 65.69 70.69 n/a n/a 59.34 73.39 70.83 67.83 72.21 69.19 GPN 62.80 n/a n/a n/a n/a n/a 55.59 65.07 54.78 60.53 62.90 62.41 SGC 80.52 82.32 74.01 80.92 86.43 n/a 66.94 84.24 84.01 71.43 80.75 76.38 Remark 5.5. In most applications, including US, monotonoic transformations of an uncertainty estimator do not affect its behaviour. Therefore, uncertainty can equivalently be defined as a difference between loglikelihoods instead of likelihood ratios. This definition recovers the well-established additive nature of uncertainty: log utotal(i, ygt i ) log ualea(i, ygt i ) = log uepi(i, ygt i ). The same holds for logarithmic confidence definitions. Notably, the core result of our work, which we state next, also holds when defining uncertainty in terms of log-likelihoods. Theorem 5.6. Epistemic uncertainty uepi(i, ygt i ) of a node i is equivalent to the relative gain its acquisition provides to the posterior over the remaining true labels: uepi(i, ygt i ) = P y U i = ygt U i | A, X, y O, yi = ygt i P y U i = ygt U i | A, X, y O Hence, acquiring the most epistemically uncertain node is an optimal AL strategy for f θ . We provide a proof of Theorem 5.6 in Appendix A. Here, we refer to all unobserved labels excluding yi as y U i. The ratio that is optimized by epistemic US corresponds to the relative increase in the posterior of the true unobserved labels. That is, it compares the probability the classifier f θ assigns to the remaining unobserved true labels after acquiring the ground-truth label ygt i of node i as opposed to not acquiring its label. High values indicate that the underlying classifier will be significantly more likely to predict the true labels of the remaining nodes after the corresponding query. Thus, a query that maximizes epistemic uncertainty will push the classifier toward predicting the true labels for all remaining unlabeled nodes. This holds for any Bayesian classifier that specifies a posterior p(θ | A, X, y O) over the parameters of the generative process. For example, fitting the parameters of a GNN is an instance of this framework, where all probability mass is put on one estimate of θ. Approaches like Bayesian GNNs explicitly specify this posterior distribution. However, computing exact disentangled uncertainty requires access to unavailable labels y U and is therefore impractical. Our analysis motivates the development of tractable approximations to these quantities. Novel US approaches can directly benefit from the theoretical optimality guarantees that this work provides. 6. Uncertainty Sampling with Ground-Truth Uncertainty To support our theoretical claims, we employ US using the proposed ground-truth epistemic uncertainty as an acquisition function. Since for real-world datasets, the data generating process is not known, we first focus our analysis on CSBMs defined in Section 2 and discuss a practical approximation later in Section 7. This allows us to compute the uncertainty estimates of Definitions 5.2 to 5.4 directly by evaluating the explicit joint likelihood of the generative process p(A, X, y) (see Appendix D). While the optimality of epistemic US holds for any data-generating process, we focus on CSBMs as they have been extensively studied as proxies for real data in node classification (Palowitch et al., 2022). To isolate the effect of correctly disentangling uncertainty, we also assume the parameters of the underlying CSBM to be known to the Bayesian classifier f θ . Therefore, Uncertainty for Active Learning on Graphs any discrepancies in US performance are purely linked to the disentanglement into aleatoric and epistemic factors. Is US effective in practice? Yes, we observe a significant improvement over random acquisition using the proposed ground-truth uncertainty. It is crucial to disentangle uncertainty into aleatoric and epistemic factors. We compare the performance of US using the proposed uncertainty measures to contemporary uncertainty estimators over 5 graphs with 100 nodes and 7 classes sampled from a CSBM distribution p(A, X, y) in Figure 4. We report similar findings for larger graphs in Appendix E. Observations: In agreement with Theorem 5.6, epistemic uncertainty significantly outperforms random queries as well as aleatoric and total uncertainty which we explain formally in the following Propositions 6.1 and 6.2. We continue to analyze which aspects of uncertainty modelling are crucial for its successful in AL. Disentangling Uncertainty. We first discuss why acquiring nodes with high total uncertainty utotal performs worse than isolating epistemic factors: Total uncertainty favors not only informative queries but also tends to acquire labels of nodes that are associated with a high aleatoric uncertainty ualea. Proposition 6.1. Total uncertainty utotal(i, ygt i ) of a node i is proportional to the posterior over the unobserved true labels ygt U i after acquiring its label ygt i : utotal(i, ygt i ) P y U i = ygt U i | A, X, y O, yi = ygt i We provide a proof of Proposition 6.1 in Appendix A. Acquiring nodes with maximal total uncertainty maximizes the posterior of the remaining unlabeled set y U i. This is problematic as one way to increase this posterior probability is to remove an aleatorically uncertain node i from the unlabeled set. Such a query will not push the posterior of the remaining nodes in y U towards their true labels and instead improve the posterior by excluding nodes that are inherently difficult to predict. In contrast, the epistemic acquisition evaluates the joint posterior in relation to the effect of removing node i from the unlabeled set (see Theorem 5.6). In fact, acquiring aleatorically uncertain nodes directly removes inherently ambiguous nodes from the unlabeled set. Proposition 6.2. Aleatoric uncertainty ualea(i, ygt i ) of a node i is proportional to the posterior over the unobserved true labels ygt U i without acquiring its label ygt i : ualea(i, ygt i ) P y U i = ygt U i | A, X, y O We provide a proof for Proposition 6.2 in Appendix A. Proposition 6.2 explains why we observe aleatoric US to be ineffective in Figure 4. It optimizes the posterior of the 0 10 20 30 Acquired Labels Accuracy (%) Random Epistemic Aleatoric Total Full E only Figure 4: US on a CSBM with 100 nodes and 7 classes using f θ . Ground-truth epistemic uncertainty significantly outperforms other estimators and random queries. remaining labels ygt U i without considering the acquisition of ygt i . Such queries do not align with AL as they neglect the additional information obtained in each iteration. To optimize predictions on all remaining nodes it is crucial to properly disentangle uncertainty into aleatoric and epistemic components and acquire epistemically uncertain labels. Modelling the Data Generating Process. We also investigate the importance of the uncertainty estimator to faithfully model the true data-generating process p(A, X, y). To that end, we ablate our proposed uncertainty measures but only consider a Bayesian classifier that erroneously exclusively models present edges (i, j) E while neglecting (i, j) / E: ˆp(A, X, y) := Q i