# uncertainty_aware_semisupervised_learning_on_graph_data__db48dde8.pdf Uncertainty Aware Semi-Supervised Learning on Graph Data Xujiang Zhao1, Feng Chen1, Shu Hu2, Jin-Hee Cho3 1The University of Texas at Dallas, {xujiang.zhao, feng,chen}@utdallas.edu 2University at Buffalo, SUNY, shuhu@buffalo.edu 3Virginia Tech, jicho@vt.edu Thanks to graph neural networks (GNNs), semi-supervised node classification has shown the state-of-the-art performance in graph data. However, GNNs have not considered different types of uncertainties associated with class probabilities to minimize risk of increasing misclassification under uncertainty in real life. In this work, we propose a multi-source uncertainty framework using a GNN that reflects various types of predictive uncertainties in both deep learning and belief/evidence theory domains for node classification predictions. By collecting evidence from the given labels of training nodes, the Graph-based Kernel Dirichlet distribution Estimation (GKDE) method is designed for accurately predicting nodelevel Dirichlet distributions and detecting out-of-distribution (OOD) nodes. We validated the outperformance of our proposed model compared to the state-of-theart counterparts in terms of misclassification detection and OOD detection based on six real network datasets. We found that dissonance-based detection yielded the best results on misclassification detection while vacuity-based detection was the best for OOD detection. To clarify the reasons behind the results, we provided the theoretical proof that explains the relationships between different types of uncertainties considered in this work. 1 Introduction Inherent uncertainties derived from different root causes have realized as serious hurdles to find effective solutions for real world problems. Critical safety concerns have been brought due to lack of considering diverse causes of uncertainties, resulting in high risk due to misinterpretation of uncertainties (e.g., misdetection or misclassification of an object by an autonomous vehicle). Graph neural networks (GNNs) [12, 21] have received tremendous attention in the data science community. Despite their superior performance in semi-supervised node classification and regression, they didn t consider various types of uncertainties in the their decision process. Predictive uncertainty estimation [11] using Bayesian NNs (BNNs) has been explored for classification prediction and regression in the computer vision applications, based on aleatoric uncertainty (AU) and epistemic uncertainty (EU). AU refers to data uncertainty from statistical randomness (e.g., inherent noises in observations) while EU indicates model uncertainty due to limited knowledge (e.g., ignorance) in collected data. In the belief or evidence theory domain, Subjective Logic (SL) [9] considered vacuity (or a lack of evidence or ignorance) as uncertainty in a subjective opinion. Recently other uncertainty types, such as dissonance, consonance, vagueness, and monosonance [9], have been discussed based on SL to measure them based on their different root causes. We first considered multidimensional uncertainty types in both deep learning (DL) and belief and evidence theory domains for node-level classification, misclassification detection, and out-of-distribution (OOD) detection tasks. By leveraging the learning capability of GNNs and considering multidimensional uncertainties, we propose a uncertainty-aware estimation framework by quantifying 34th Conference on Neural Information Processing Systems (Neur IPS 2020), Vancouver, Canada. different uncertainty types associated with the predicted class probabilities. In this work, we made the following key contributions: A multi-source uncertainty framework for GNNs. Our proposed framework first provides the estimation of various types of uncertainty from both DL and evidence/belief theory domains, such as dissonance (derived from conflicting evidence) and vacuity (derived from lack of evidence). In addition, we designed a Graph-based Kernel Dirichlet distribution Estimation (GKDE) method to reduce errors in quantifying predictive uncertainties. Theoretical analysis: Our work is the first that provides a theoretical analysis about the relationships between different types of uncertainties considered in this work. We demonstrate via a theoretical analysis that an OOD node may have a high predictive uncertainty under GKDE. Comprehensive experiments for validating the performance of our proposed framework: Based on the six real graph datasets, we compared the performance of our proposed framework with that of other competitive counterparts. We found that the dissonance-based detection yielded the best results in misclassification detection while vacuity-based detection best performed in OOD detection. Note that we use the term predictive uncertainty in order to mean uncertainty estimated to solve prediction problems. 2 Related Work DL research has mainly considered aleatoric uncertainty (AU) and epistemic uncertainty (EU) using BNNs for computer vision applications. AU consists of homoscedastic uncertainty (i.e., constant errors for different inputs) and heteroscedastic uncertainty (i.e., different errors for different inputs) [4]. A Bayesian DL framework was presented to simultaneously estimate both AU and EU in regression (e.g., depth regression) and classification (e.g., semantic segmentation) tasks [11]. Later, distributional uncertainty was defined based on distributional mismatch between testing and training data distributions [14]. Dropout variational inference [5] was used for an approximate inference in BNNs using epistemic uncertainty, similar to Drop Edge [15]. Other algorithms have considered overall uncertainty in node classification [3, 13, 22]. However, no prior work has considered uncertainty decomposition in GNNs. In the belief (or evidence) theory domain, uncertainty reasoning has been substantially explored, such as Fuzzy Logic [1], Dempster-Shafer Theory (DST) [19], or Subjective Logic (SL) [8]. Belief theory focuses on reasoning inherent uncertainty in information caused by unreliable, incomplete, deceptive, or conflicting evidence. SL considered predictive uncertainty in subjective opinions in terms of vacuity (i.e., a lack of evidence) and vagueness (i.e., failing in discriminating a belief state) [8]. Recently, other uncertainty types have been studied, such as dissonance caused by conflicting evidence[9]. In the deep NNs, [18] proposed evidential deep learning (EDL) model, using SL to train a deterministic NN for supervised classification in computer vision based on the sum of squared loss. However, EDL didn t consider a general method of estimating multidimensional uncertainty or graph structure. 3 Multidimensional Uncertainty and Subjective Logic This section provides an overview of SL and discusses multiple types of uncertainties estimated based on SL, called evidential uncertainty, with the measures of vacuity and dissonance. In addition, we give a brief overview of probabilistic uncertainty, discussing the measures of aleatoric uncertainty and epistemic uncertainty. 3.1 Subjective Logic A multinomial opinion of a random variable y is represented by ω = (b, u, a) where a domain is Y {1, , K} and the additivity requirement of ω is given as P k Y bk + u = 1. To be specific, each parameter indicates, b: belief mass distribution over Y and b = [b1, . . . , b K]T ; u: uncertainty mass representing vacuity of evidence; a: base rate distribution over Y and a = [a1, . . . , a K]T . The projected probability distribution of a multinomial opinion can be calculated as: P(y = k) = bk + aku, k Y. (1) A multinomial opinion ω defined above can be equivalently represented by a K-dimensional Dirichlet probability density function (PDF), where the special case with K = 2 is the Beta PDF as a binomial opinion. Let α be a strength vector over the singletons (or classes) in Y and p = [p1, , p K]T be a probability distribution over Y. The Dirichlet PDF with p as a random vector K-dimensional variables is defined by: Dir(p|α) = 1 B(α) k Y p(αk 1) k , (2) where 1 B(α) = Γ(P k Y(αk) , αk 0, and pk = 0, if αk < 1. The term evidence is introduced as a measure of the amount of supporting observations collected from data that a sample should be classified into a certain class. Let ek be the evidence derived for the class k Y. The total strength αk for the belief of each class k Y can be calculated as: αk = ek + ak W, where ek 0, k Y, and W refers to a non-informative weight representing the amount of uncertain evidence. Given the Dirichlet PDF as defined above, the expected probability distribution over Y can be calculated as: E[pk] = αk PK k=1 αk = ek + ak W W + PK k=1 ek . (3) The observed evidence in a Dirichlet PDF can be mapped to a multinomial opinion as follows: where S = PK k=1 αk refers to the Dirichlet strength. Without loss of generality, we set ak = 1 K and the non-informative prior weight (i.e., W = K), which indicates that ak W = 1 for each k Y. 3.2 Evidential Uncertainty In [9], we discussed a number of multidimensional uncertainty dimensions of a subjective opinion based on the formalism of SL, such as singularity, vagueness, vacuity, dissonance, consonance, and monosonance. These uncertainty dimensions can be observed from binomial, multinomial, or hyper opinions depending on their characteristics (e.g., the vagueness uncertainty is only observed in hyper opinions to deal with composite beliefs). In this paper, we discuss two main uncertainty types that can be estimated in a multinomial opinion, which are vacuity and dissonance. The main cause of vacuity is derived from a lack of evidence or knowledge, which corresponds to the uncertainty mass, u, of a multinomial opinion in SL as: vac(ω) u = K/S, as estimated in Eq. (4). This uncertainty exists because the analyst may have insufficient information or knowledge to analyze the uncertainty. The dissonance of a multinomial opinion can be derived from the same amount of conflicting evidence and can be estimated based on the difference between singleton belief masses (e.g., class labels), which leads to inconclusiveness in decision making applications. For example, a four-state multinomial opinion is given as (b1, b2, b3, b4, u, a) = (0.25, 0.25, 0.25, 0.25, 0.0, a) based on Eq. (4), although the vacuity u is zero, a decision can not be made if there are the same amounts of beliefs supporting respective beliefs. Given a multinomial opinion with non-zero belief masses, the measure of dissonance can be calculated as: j =i bj Bal(bj, bi) P where the relative mass balance between a pair of belief masses bj and bi is defined as Bal(bj, bi) = 1 |bj bi|/(bj +bi). We note that the dissonance is measured only when the belief mass is non-zero. If all belief masses equal to zero with vacuity being 1 (i.e., u = 1), the dissonance will be set to zero. 3.3 Probabilistic Uncertainty For classification, the estimation of the probabilistic uncertainty relies on the design of an appropriate Bayesian DL model with parameters θ. Given input x and dataset G, we estimate a class probability by P(y|x) = R P(y|x; θ)P(θ|G)dθ, and obtain epistemic uncertainty estimated by mutual information [2, 14]: I(y, θ|x, G) | {z } Epistemic = H EP (θ|G)[P(y|x; θ)] | {z } Entropy EP (θ|G) H[P(y|x; θ)] | {z } Aleatoric where H( ) is Shannon s entropy of a probability distribution. The first term indicates entropy that represents the total uncertainty while the second term is aleatoric that indicates data uncertainty. By computing the difference between entropy and aleatoric uncertainties, we obtain epistemic uncertainty, which refers to uncertainty from model parameters. 4 Relationships Between Multiple Uncertainties Figure 1: Multiple uncertainties of different prediction. Let u = [uv, udiss, ualea, uepis, uen]. We use the shorthand notations uv, udiss, ualea, uepis, and uen to represent vacuity, dissonance, aleatoric, epistemic, and entropy, respectively. To interpret multiple types of uncertainty, we show three prediction scenarios of 3-class classification in Figure 1, in each of which the strength parameters α = [α1, α2, α3] are known. To make a prediction with high confidence, the subjective multinomial opinion, following a Dirichlet distribution, will yield a sharp distribution on one corner of the simplex (see Figure 1 (a)). For a prediction with conflicting evidence, called a conflicting prediction (CP), the multinomial opinion should yield a central distribution, representing confidence to predict a flat categorical distribution over class labels (see Figure 1 (b)). For an OOD scenario with α = [1, 1, 1], the multinomial opinion would yield a flat distribution over the simplex (Figure 1 (c)), indicating high uncertainty due to the lack of evidence. The first technical contribution of this work is as follows. Theorem 1 We consider a simplified scenario, where a multinomial random variable y follows a K-class categorical distribution: y Cal(p), the class probabilities p follow a Dirichlet distribution: p Dir(α), and α refer to the Dirichlet parameters. Given a total Dirichlet strength S = PK i=1 αi, for any opinion ω on a multinomial random variable y, we have 1. General relations on all prediction scenarios. (a) uv + udiss 1; (b) uv > uepis. 2. Special relations on the OOD and the CP. (a) For an OOD sample with a uniform prediction (i.e., α = [1, . . . , 1]), we have 1 = uv = uen > ualea > uepis > udiss = 0 (b) For an in-distribution sample with a conflicting prediction (i.e., α = [α1, . . . , αK] with α1 = α2 = = αK, if S ), we have uen = 1, lim S udiss = lim S ualea = 1, lim S uv = lim S uepis = 0 with uen > ualea > udiss > uv > uepis. The proof of Theorem 1 can be found in Appendix A.1. As demonstrated in Theorem 1 and Figure 1, entropy cannot distinguish OOD (see Figure 1 (c)) and conflicting predictions (see Figure 1 (b)) because entropy is high for both cases. Similarly, neither aleatoric uncertainty nor epistemic uncertainty can distinguish OOD from conflicting predictions. In both cases, aleatoric uncertainty is high while epistemic uncertainty is low. On the other hand, vacuity and dissonance can clearly distinguish OOD from a conflicting prediction. For example, OOD objects typically show high vacuity with low dissonance while conflicting predictions exhibit low vacuity with high dissonance. This observation is confirmed through the empirical validation via our extensive experiments in terms of misclassification and OOD detection tasks. 5 Uncertainty-Aware Semi-Supervised Learning In this section, we describe our proposed uncertainty framework based on semi-supervised node classification problem. It is designed to predict the subjective opinions about the classification Figure 2: Uncertainty Framework Overview. Subjective Bayesian GNN (a) designed for estimating the different types of uncertainties. The loss function includes square error (d) to reduce bias, GKDE (b) to reduce errors in uncertainty estimation and teacher network (c) to refine class probability. of testing nodes, such that a variety of uncertainty types, such as vacuity, dissonance, aleatoric uncertainty, and epistemic uncertainty, can be quantified based on the estimated subjective opinions and posterior of model parameters. As a subjective opinion can be equivalently represented by a Dirichlet distribution about the class probabilities, we proposed a way to predict the node-level subjective opinions in the form of node-level Dirichlet distributions. The overall description of the framework is shown in Figure 2. 5.1 Problem Definition Given an input graph G = (V, E, r, y L), where V = {1, . . . , N} is a ground set of nodes, E V V is a ground set of edges, r = [r1, , r N]T RN d is a node-level feature matrix, ri Rd is the feature vector of node i, y L = {yi | i L} are the labels of the training nodes L V, and yi {1, . . . , K} is the class label of node i. We aim to predict: (1) the class probabilities of the testing nodes: p V\L = {pi [0, 1]K | i V \ L}; and (2) the associated multidimensional uncertainty estimates introduced by different root causes: u V\L = {ui [0, 1]m | i V \ L}, where pi,k is the probability that the class label yi = k and m is the total number of uncertainty types. 5.2 Proposed Uncertainty Framework Learning evidential uncertainty. As discussed in Section 3.1, evidential uncertainty can be derived from multinomial opinions or equivalently Dirichlet distributions to model a probability distribution for the class probabilities. Therefore, we design a Subjective GNN (S-GNN) f to form their multinomial opinions for the node-level Dirichlet distribution Dir(pi|αi) of a given node i. Then, the conditional probability P(p|A, r; θ) can be obtained by: P(p|A, r; θ) = YN i=1 Dir(pi|αi), αi = fi(A, r; θ), (7) where fi is the output of S-GNN for node i, θ is the model parameters, and A is an adjacency matrix. The Dirichlet probability function Dir(pi|αi) is defined by Eq. (2). Note that S-GNN is similar to classical GNN, except that we use an activation layer (e.g., Re LU) instead of the softmax layer (only outputs class probabilities). This ensures that S-GNN would output non-negative values, which are taken as the parameters for the predicted Dirichlet distribution. Learning probabilistic uncertainty. Since probabilistic uncertainty relies on a Bayesian framework, we proposed a Subjective Bayesian GNN (S-BGNN) that adapts S-GNN to a Bayesian framework, with the model parameters θ following a prior distribution. The joint class probability of y can be estimated by: P(y|A, r; G) = Z Z P(y|p)P(p|A, r; θ)P(θ|G)dpdθ Z P(yi|pi)P(pi|A, r; θ(m))dpi, θ(m) q(θ) (8) where P(θ|G) is the posterior, estimated via dropout inference, that provides an approximate solution of posterior q(θ) and taking samples from the posterior distribution of models [5]. Thanks to the benefit of dropout inference, training a DL model directly by minimizing the cross entropy (or square error) loss function can effectively minimize the KL-divergence between the approximated distribution and the full posterior (i.e., KL[q(θ) P(θ|G)]) in variational inference [5, 10]. For interested readers, please refer to more detail in Appendix B.8. Therefore, training S-GNN with stochastic gradient descent enables learning of an approximated distribution of weights, which can provide good explainability of data and prevent overfitting. We use a loss function to compute its Bayes risk with respect to the sum of squares loss y p 2 2 by: Z yi pi 2 2 P(pi|A, r; θ)dpi = X k=1 yik E[pik] 2 + Var(pik), (9) where yi is an one-hot vector encoding the ground-truth class with yij = 1 and yik = for all k = j and j is a class label. Eq. (9) aims to minimize the prediction error and variance, leading to maximizing the classification accuracy of each training node by removing excessive misleading evidence. 5.3 Graph-based Kernel Dirichlet distribution Estimation (GKDE) Figure 3: Illustration of GKDE. Estimate prior Dirichlet distribution Dir(ˆα) for node j (red) based on training nodes (blue) and graph structure information. The loss function in Eq. (9) is designed to measure the sum of squared loss based on class labels of training nodes. However, it does not directly measure the quality of the predicted node-level Dirichlet distributions. To address this limitation, we proposed Graph-based Kernel Dirichlet distribution Estimation (GKDE) to better estimate nodelevel Dirichlet distributions by using graph structure information. The key idea of the GKDE is to estimate prior Dirichlet distribution parameters for each node based on the class labels of training nodes (see Figure 3). Then, we use the estimated prior Dirichlet distribution in the training process to learn the following patterns: (i) nodes with a high vacuity will be shown far from training nodes; and (ii) nodes with a high dissonance will be shown near the boundaries of classes. Based on SL, let each training node represent one evidence for its class label. Denote the contribution of evidence estimation for node j from training node i by h(yi, dij) = [h1, . . . , hk, . . . , h K] [0, 1]K, where hk(yi, dij) is obtained by: hk(yi, dij) = 0 yi = k g(dij) yi = k, (10) g(dij) = 1 σ 2π exp( d2 ij 2σ2 ) is the Gaussian kernel function used to estimate the distribution effect between nodes i and j, and dij means the node-level distance (a shortest path between nodes i and j), and σ is the bandwidth parameter. The prior evidence is estimated based GKDE: ˆej = P i L h(yi, dij), where L is a set of training nodes and the prior Dirichlet distribution ˆαj = ˆej + 1. During the training process, we minimize the KL-divergence between model predictions of Dirichlet distribution and prior distribution: min KL[Dir(α) Dir( ˆα)]. This process can prioritize the extent of data relevance based on the estimated evidential uncertainty, which is proven effective based on the proposition below. Proposition 1 Given L training nodes, for any testing nodes i and j, let di = [di1, . . . , di L] be the vector of graph distances from nodes i to training nodes and dj = [dj1, . . . , dj L] be the graph distances from nodes j to training nodes, where dil is the node-level distance between nodes i and l. If for all l {1, . . . , L}, dil djl, then we have where ˆuvi and ˆuvj refer to vacuity uncertainties of nodes i and j estimated based on GKDE. The proof for this proposition can be found in Appendix A.2. The above proposition shows that if a testing node is too far from training nodes, the vacuity will increase, implying that an OOD node is expected to have a high vacuity. In addition, we designed a simple iterative knowledge distillation method [7] (i.e., Teacher Network) to refine the node-level classification probabilities. The key idea is to train our proposed model (Student) to imitate the outputs of a pre-train a vanilla GNN (Teacher) by adding a regularization term of KL-divergence. This leads to solving the following optimization problem: minθ L(θ) + λ1KL[Dir(α) Dir( ˆα)] + λ2KL[P(y | A, r; G) P(y|ˆp)], (11) where ˆp is the vanilla GNN s (Teacher) output and λ1 and λ2 are trade-off parameters. 6 Experiments In this section, we conduct experiments on the tasks of misclassification and OOD detections to answer the following questions for semi-supervised node classification: Q1. Misclassification Detection: What type of uncertainty is the most promising indicator of high confidence in node classification predictions? Q2. OOD Detection: What type of uncertainty is a key indicator of accurate detection of OOD nodes? Q3. GKDE with Uncertainty Estimates: How can GKDE help enhance prediction tasks with what types of uncertainty estimates? Through extensive experiments, we found the following answers for the above questions: A1. Dissonance (i.e., uncertainty due to conflicting evidence) is more effective than other uncertainty estimates in misclassification detection. A2. Vacuity (i.e., uncertainty due to lack of confidence) is more effective than other uncertainty estimates in OOD detection. A3. GKDE can indeed help improve the estimation quality of node-level Dirichlet distributions, resulting in a higher OOD detection. 6.1 Experiment Setup Datasets: We used six datasets, including three citation network datasets [17] (i.e., Cora, Citeseer, Pubmed) and three new datasets [20] (i.e., Coauthor Physics, Amazon Computer, and Amazon Photo). We summarized the description and experimental setup of the used datasets in Appendix B.21. Comparing Schemes: We conducted the extensive comparative performance analysis based on our proposed models and several state-of-the-art competitive counterparts. We implemented all models based on the most popular GNN model, GCN [12]. We compared our model (S-BGCN-T-K) against: (1) Softmax-based GCN [12] with uncertainty measured based on entropy; and (2) Drop-GCN that adapts the Monte-Carlo Dropout [5, 16] into the GCN model to learn probabilistic uncertainty; (3) EDL-GCN that adapts the EDL model [18] with GCN to estimate evidential uncertainty; (4) DPN-GCN that adapts the DPN [14] method with GCN to estimate probabilistic uncertainty. We evaluated the performance of all models considered using the area under the ROC (AUROC) curve and area under the Precision-Recall (AUPR) curve in both experiments [6]. 6.2 Results Misclassification Detection. The misclassification detection experiment involves detecting whether a given prediction is incorrect using an uncertainty estimate. Table 1 shows that S-BGCN-T-K outperforms all baseline models under the AUROC and AUPR for misclassification detection. The outperformance of dissonance-based detection is fairly impressive. This confirms that low dissonance (a small amount of conflicting evidence) is the key to maximize the accuracy of node classification prediction. We observe the following performance order: Dissonance > Entropy Aleatoric > Vacuity Epistemic, which is aligned with our conjecture: higher dissonance with conflicting prediction leads to higher misclassification detection. We also conducted experiments on additional three datasets and observed similar trends of the results, as demonstrated in Appendix C. OOD Detection. This experiment involves detecting whether an input example is out-of-distribution (OOD) given an estimate of uncertainty. For semi-supervised node classification, we randomly selected one to four categories as OOD categories and trained the models based on training nodes of the other categories. Due to the space constraint, the experimental setup for the OOD detection is detailed in Appendix B.3. In Table 2, across six network datasets, our vacuity-based detection significantly outperformed the other competitive methods, exceeding the performance of the epistemic uncertainty and other type of 1The source code and datasets are accessible at https://github.com/zxj32/uncertainty-GNN Table 1: AUROC and AUPR for the Misclassification Detection. Data Model AUROC AUPR Acc Va. Dis. Al. Ep. En. Va. Dis. Al. Ep. En. S-BGCN-T-K 70.6 82.4 75.3 68.8 77.7 90.3 95.4 92.4 87.8 93.4 82.0 EDL-GCN 70.2 81.5 - - 76.9 90.0 94.6 - - 93.6 81.5 DPN-GCN - - 78.3 75.5 77.3 - - 92.4 92.0 92.4 80.8 Drop-GCN - - 73.9 66.7 76.9 - - 92.7 90.0 93.6 81.3 GCN - - - - 79.6 - - - - 94.1 81.5 S-BGCN-T-K 65.4 74.0 67.2 60.7 70.0 79.8 85.6 82.2 75.2 83.5 71.0 EDL-GCN 64.9 73.6 - - 69.6 79.2 84.6 - - 82.9 70.2 DPN-GCN - - 66.0 64.9 65.5 - - 78.7 77.6 78.1 68.1 Drop-GCN - - 66.4 60.8 69.8 - - 82.3 77.8 83.7 70.9 GCN - - - - 71.4 - - - - 83.2 70.3 S-BGCN-T-K 64.1 73.3 69.3 64.2 70.7 85.6 90.8 88.8 86.1 89.2 79.3 EDL-GCN 62.6 69.0 - - 67.2 84.6 88.9 - - 81.7 79.0 DPN-GCN - - 72.7 69.2 72.5 - - 87.8 86.8 87.7 77.1 Drop-GCN - - 67.3 66.1 67.2 - - 88.6 85.6 89.0 79.0 GCN - - - - 68.5 - - - - 89.2 79.0 Va.: Vacuity, Dis.: Dissonance, Al.: Aleatoric, Ep.: Epistemic, En.: Entropy Table 2: AUROC and AUPR for the OOD Detection. Data Model AUROC AUPR Va. Dis. Al. Ep. En. Va. Dis. Al. Ep. En. S-BGCN-T-K 87.6 75.5 85.5 70.8 84.8 78.4 49.0 75.3 44.5 73.1 EDL-GCN 84.5 81.0 - 83.3 74.2 53.2 - - 71.4 DPN-GCN - - 77.3 78.9 78.3 - - 58.5 62.8 63.0 Drop-GCN - - 81.9 70.5 80.9 - - 69.7 44.2 67.2 GCN - - - - 80.7 - - - - 66.9 S-BGCN-T-K 84.8 55.2 78.4 55.1 74.0 86.8 54.1 80.8 55.8 74.0 EDL-GCN 78.4 59.4 - - 69.1 79.8 57.3 - - 69.0 DPN-GCN - - 68.3 72.2 69.5 - - 68.5 72.1 70.3 Drop-GCN - - 72.3 61.4 70.6 - - 73.5 60.8 70.0 GCN - - - - 70.8 - - - - 70.2 S-BGCN-T-K 74.6 67.9 71.8 59.2 72.2 69.6 52.9 63.6 44.0 56.5 EDL-GCN 71.5 68.2 - - 70.5 65.3 53.1 - - 55.0 DPN-GCN - - 63.5 63.7 63.5 - - 50.7 53.9 51.1 Drop-GCN - - 68.7 60.8 66.7 - - 59.7 46.7 54.8 GCN - - - - 68.3 - - - - 55.3 Amazon Photo S-BGCN-T-K 93.4 76.4 91.4 32.2 91.4 94.8 68.0 92.3 42.3 92.5 EDL-GCN 63.4 78.1 - - 79.2 66.2 74.8 - - 81.2 DPN-GCN - - 83.6 83.6 83.6 - - 82.6 82.4 82.5 Drop-GCN - - 84.5 58.7 84.3 - - 87.0 57.7 86.9 GCN - - - - 84.4 - - - - 87.0 Amazon Computer S-BGCN-T-K 82.3 76.6 80.9 55.4 80.9 70.5 52.8 60.9 35.9 60.6 EDL-GCN 53.2 70.1 - - 70.0 33.2 43.9 - - 45.7 DPN-GCN - - 77.6 77.7 77.7 - - 50.8 51.2 51.0 Drop-GCN - - 74.4 70.5 74.3 - - 50.0 46.7 49.8 GCN - - - - 74.0 - - - - 48.7 Coauthor Physics S-BGCN-T-K 91.3 87.6 89.7 61.8 89.8 72.2 56.6 68.1 25.9 67.9 EDL-GCN 88.2 85.8 - - 87.6 67.1 51.2 - - 62.1 DPN-GCN - - 85.5 85.6 85.5 - - 59.8 60.2 59.8 Drop-GCN - - 89.2 78.4 89.3 - - 66.6 37.1 66.5 GCN - - - - 89.1 - - - - 64.0 Va.: Vacuity, Dis.: Dissonance, Al.: Aleatoric, Ep.: Epistemic, En.: Entropy uncertainties. This demonstrates that vacuity-based model is more effective than other uncertainty estimates-based counterparts in increasing OOD detection. We observed the following performance order: Vacuity > Entropy Aleatoric > Epistemic Dissonance, which is consistent with the theoretical results as shown in Theorem 1. Ablation Study. We conducted additional experiments (see Table 3) in order to demonstrate the contributions of the key technical components, including GKDE, Teacher Network, and subjective Bayesian framework. The key findings obtained from this experiment are: (1) GKDE can enhance the OOD detection (i.e., 30% increase with vacuity), which is consistent with our theoretical proof about the outperformance of GKDE in uncertainty estimation, i.e., OOD nodes have a higher vacuity than other nodes; and (2) the Teacher Network can further improve the node classification accuracy. 6.3 Why is Epistemic Uncertainty Less Effective than Vacuity? Although epistemic uncertainty is known to be effective to improve OOD detection [5, 11] in computer vision applications, our results demonstrate it is less effective than our vacuity-based approach. The first potential reason is that epistemic uncertainty is always smaller than vacuity (From Theorem 1), which potentially indicates that epistemic may capture less information related to OOD. Another potential reason is that the previous success of epistemic uncertainty for OOD detection is limited to supervised learning in computer vision applications, but its effectiveness for OOD detection was not sufficiently validated in semi-supervised learning tasks. Recall that epistemic uncertainty (i.e., model uncertainty) is calculated based on mutual information (see Eq. (6)). In a semi-supervised setting, the features of unlabeled nodes are also fed to a model for training process to provide the model with a high confidence on its output. For example, the model output P(y|A, r; θ) would not change too much even with differently sampled parameters θ, i.e., P(y|A, r; θ(i)) P(y|A, r; θ(j)), which result in a low epistemic uncertainty. We also designed a semi-supervised learning experiment for image classification and observed a consistent pattern with the results demonstrated in Appendix C.6. Table 3: Ablation study of our proposed models: (1) S-GCN: Subjective GCN with vacuity and dissonance estimation; (2) S-BGCN: S-GCN with Bayesian framework; (3) S-BGCN-T: S-BGCN with a Teacher Network; (4) S-BGCN-T-K: S-BGCN-T with GKDE to improve uncertainty estimation. Data Model AUROC (Misclassification Detection) AUPR (Misclassification Detection) Acc Va. Dis. Al. Ep. En. Va. Dis. Al. Ep. En. S-BGCN-T-K 70.6 82.4 75.3 68.8 77.7 90.3 95.4 92.4 87.8 93.4 82.0 S-BGCN-T 70.8 82.5 75.3 68.9 77.8 90.4 95.4 92.6 88.0 93.4 82.2 S-BGCN 69.8 81.4 73.9 66.7 76.9 89.4 94.3 92.3 88.0 93.1 81.2 S-GCN 70.2 81.5 - - 76.9 90.0 94.6 - - 93.6 81.5 AUROC (OOD Detection) AUPR (OOD Detection) Amazon Photo S-BGCN-T-K 93.4 76.4 91.4 32.2 91.4 94.8 68.0 92.3 42.3 92.5 - S-BGCN-T 64.0 77.5 79.9 52.6 79.8 67.0 75.3 82.0 53.7 81.9 - S-BGCN 63.0 76.6 79.8 52.7 79.7 66.5 75.1 82.1 53.9 81.7 - S-GCN 64.0 77.1 - - 79.6 67.0 74.9 - - 81.6 - Va.: Vacuity, Dis.: Dissonance, Al.: Aleatoric, Ep.: Epistemic, En.: Entropy 7 Conclusion In this work, we proposed a multi-source uncertainty framework of GNNs for semi-supervised node classification. Our proposed framework provides an effective way of predicting node classification and out-of-distribution detection considering multiple types of uncertainty. We leveraged various types of uncertainty estimates from both DL and evidence/belief theory domains. Through our extensive experiments, we found that dissonance-based detection yielded the best performance on misclassification detection while vacuity-based detection performed the best for OOD detection, compared to other competitive counterparts. In particular, it was noticeable that applying GKDE and the Teacher network further enhanced the accuracy in node classification and uncertainty estimates. Acknowledgments We would like to thank Yuzhe Ou for providing proof suggestions. This work is supported by the National Science Foundation (NSF) under Grant No #1815696 and #1750911. Broader Impact In this paper, we propose a uncertainty-aware semi-supervised learning framework of GNN for predicting multi-dimensional uncertainties for the task of semi-supervised node classification. Our proposed framework can be applied to a wide range of applications, including computer vision, natural language processing, recommendation systems, traffic prediction, generative models and many more [23]. Our proposed framework can be applied to predict multiple uncertainties of different roots for GNNs in these applications, improving the understanding of individual decisions, as well as the underlying models. While there will be important impacts resulting from the use of GNNs in general, our focus in this work is on investigating the impact of using our method to predict multisource uncertainties for such systems. The additional benefits of this method include improvement of safety and transparency in decision-critical applications to avoid overconfident prediction, which can easily lead to misclassification. We see promising research opportunities that can adopt our uncertainty framework, such as investigating whether this uncertainty framework can further enhance misclassification detection or OOD detection. To mitigate the risk from different types of uncertainties, we encourage future research to understand the impacts of this proposed uncertainty framework to solve other real world problems. [1] C. W. De Silva. Intelligent control: fuzzy logic applications. CRC press, 2018. [2] S. Depeweg, J.-M. Hernandez-Lobato, F. Doshi-Velez, and S. Udluft. Decomposition of uncertainty in bayesian deep learning for efficient and risk-sensitive learning. In International Conference on Machine Learning, pages 1184 1193. PMLR, 2018. [3] D. Eswaran, S. Günnemann, and C. Faloutsos. The power of certainty: A dirichlet-multinomial model for belief propagation. In Proceedings of the 2017 SIAM International Conference on Data Mining, pages 144 152. SIAM, 2017. [4] Y. Gal. Uncertainty in deep learning. University of Cambridge, 2016. [5] Y. Gal and Z. Ghahramani. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In ICML, pages 1050 1059, 2016. [6] D. Hendrycks and K. Gimpel. A baseline for detecting misclassified and out-of-distribution examples in neural networks. Proceedings of International Conference on Learning Representations, 2017. [7] G. Hinton, O. Vinyals, and J. Dean. Distilling the knowledge in a neural network. In NIPS Deep Learning and Representation Learning Workshop, 2015. [8] A. Jøsang. Subjective logic. Springer, 2016. [9] A. Jøsang, J.-H. Cho, and F. Chen. Uncertainty characteristics of subjective opinions. In FUSION, pages 1998 2005. IEEE, 2018. [10] A. Kendall, V. Badrinarayanan, and R. Cipolla. Bayesian segnet: Model uncertainty in deep convolutional encoder-decoder architectures for scene understanding. ar Xiv preprint ar Xiv:1511.02680, 2015. [11] A. Kendall and Y. Gal. What uncertainties do we need in bayesian deep learning for computer vision? In NIPS, pages 5574 5584, 2017. [12] T. N. Kipf and M. Welling. Semi-supervised classification with graph convolutional networks. In International Conference on Learning Representations (ICLR), 2017. [13] Z.-Y. Liu, S.-Y. Li, S. Chen, Y. Hu, and S.-J. Huang. Uncertainty aware graph gaussian process for semi-supervised learning. 2020. [14] A. Malinin and M. Gales. Predictive uncertainty estimation via prior networks. In Advances in Neural Information Processing Systems, pages 7047 7058, 2018. [15] Y. Rong, W. Huang, T. Xu, and J. Huang. Dropedge: Towards deep graph convolutional networks on node classification. In International Conference on Learning Representations, 2019. [16] S. Ryu, Y. Kwon, and W. Y. Kim. Uncertainty quantification of molecular property prediction with bayesian neural networks. ar Xiv preprint ar Xiv:1903.08375, 2019. [17] P. Sen, G. Namata, M. Bilgic, L. Getoor, B. Galligher, and T. Eliassi-Rad. Collective classification in network data. AI Magazine, 29(3):93, 2008. [18] M. Sensoy, L. Kaplan, and M. Kandemir. Evidential deep learning to quantify classification uncertainty. In NIPS, pages 3183 3193, 2018. [19] K. Sentz, S. Ferson, et al. Combination of evidence in Dempster-Shafer theory, volume 4015. Citeseer, 2002. [20] O. Shchur, M. Mumme, A. Bojchevski, and S. Günnemann. Pitfalls of graph neural network evaluation. Relational Representation Learning Workshop, Neur IPS 2018, 2018. [21] P. Veliˇckovi c, G. Cucurull, A. Casanova, A. Romero, P. Liò, and Y. Bengio. Graph Attention Networks. ICLR, 2018. [22] Y. Zhang, S. Pal, M. Coates, and D. Ustebay. Bayesian graph convolutional neural networks for semisupervised classification. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 5829 5836, 2019. [23] J. Zhou, G. Cui, Z. Zhang, C. Yang, Z. Liu, L. Wang, C. Li, and M. Sun. Graph neural networks: A review of methods and applications. ar Xiv preprint ar Xiv:1812.08434, 2018.