# graphbased_semisupervised_learning_with_nonignorable_nonresponse__6c5e65e1.pdf

Graph-Based Semi-Supervised Learning with Nonignorable Nonresponses

Fan Zhou1, Tengfei Li2, Haibo Zhou2, Jieping Ye3, Hongtu Zhu3,2

Shanghai University of Finance and Economics1, zhoufan@mail.shufe.edu.cn University of North Carolina at Chapel Hill2, tengfei_li@med.unc.edu,zhou@bios.unc.edu AI Labs, Didi Chuxing 3, {yejieping,zhuhongtu}@didiglobal.com

Graph-based semi-supervised learning is very important for many classiﬁcation tasks, but most existing methods assume that all labelled nodes are randomly sampled. With the presence of nonignorable nonresponse, ignoring all missing nodes can lead to signiﬁcant estimation bias and handicap the classiﬁers. To solve this issue, we propose a Graph-based joint model with Nonignorable Missingness (GNM) and develop an imputation and inverse probability weighting estimation approach. We further use graphical neural networks to model nonlinear link functions and then use a gradient descent (GD) algorithm to estimate all the parameters of GNM. We prove the identiﬁability of the GNM model and validate their predictive performance in both simulations and real data analysis through comparing with models ignoring or misspecifying the missingness mechanism. Our method can achieve up to 7.5% improvement than the baseline model for the document classiﬁcation task on the Cora dataset.

1 Introduction

Graph-based semi-supervised learning problem has been increasingly studied due to more and more real graph datasets. The problem is to predict all the unlabelled nodes in the graph based on only a small subset of nodes being observed. A popular method is to use the graph Laplacian regularization to learn node representations, such as label propagation [25] and manifold regularization [3]. Recently, attention has shifted to the learning of network embeddings [12, 13, 20, 7, 23, 10, 6]. Almost all existing methods assume that the labelled nodes are randomly selected. However, the probability of missingness may depend on the unobserved data after conditioning on the observed data in the real world. For example, when predicting the trafﬁc volume of a road network, sensors used to collect data are usually set up at intersections with large trafﬁc ﬂow. A researcher is more likely to label the documents in a citation network that fall into the categories which he or she is more familiar with. In these cases, non-responses may be missing not at random (MNAR). Ignoring nonignorable nonresponses may be unable to capture the representativeness of remaining samples, leading to signiﬁcant estimation bias.

Modeling non-ignorable missingness is challenging because the MNAR mechanism is usually unknown and may require additional model identiﬁability assumptions [5, 14, 21]. A popular method assigns the inverse of estimated response probabilities as weights to the observed nodes [16, 4], but these procedures are designed for the missing at random (MAR) mechanism instead of MNAR. Another method is to impute missing data by using observed data [18, 19, 11]. Some more advanced methods [24, 21] have been proposed to estimate the non-ignorable missingness using external data [8], but such data is often unavailable in many applications, making these methods infeasible. Moreover, all these methods are built on simple regressions and are not directly applied to graphs.

33rd Conference on Neural Information Processing Systems (Neur IPS 2019), Vancouver, Canada.

In this paper, we develop a Graph-based joint model with Nonignorable Missingness (GNM) by assigning inverse response probability to labelled nodes when estimating the target classiﬁer or regression. To model the non-ignorable missingness, we propose a deep learning based exponential tilting model to utilize the strengths of neural networks in function approximation and representation learning. The main contributions of this paper can be summarized as follows:

To the best of our knowledge, we are the ﬁrst to consider the graph-based semi-supervised learning problem in the presence of non-ignorable nonresponse and try to solve the problem from the perspective of missing data. We introduce a novel identiﬁability in prediction-equivalence quotient (PEQ) space for neural network architectures and its easily checked sufﬁcient conditions. Different from traditional statistical methods which extract features and ﬁt the prediction model separately, we propose a novel joint estimation approach by integrating the inverse weighting framework with a modiﬁed loss function based on the imputation of non-response. It is easy to implement in practice and is robust to the normality assumption when the node response is continuous. We use gradient descent (GD) algorithm to learn all the parameters, which works for traditional regression model as well as for modern deep graphical neural networks. We examine the ﬁnite sample performance of our methods by using both simulation and real data experiments, demonstrating the necessity of de-biasing in acquiring unbiased prediction results on the testing data under the non-ignorable nonresponse setting.

2 Model Description

Let G = (V, E, A) be a weighted graph, where V = {v1, . . . , v N} denotes the vertex set of size |V | = N, E contains all the edges, and A is an N N adjacency matrix. The N vertexes make up the whole population with only a small subset of vertexes being labelled. We introduce some important notations as follows:

(i). x = [x1, x2, . . . , x N]T RN p is a fully observed input feature matrix of size N p with each xi Rp being a p 1 feature vector at vertex vi.

(ii). Y = (y1, y2, . . . , y N)T is a vector of vertex responses, which is partially observed subject to missingness, and yi can be either categorical or continuous.

(iii). A RN N is the adjacency matrix (binary or weighted), which encodes node similarity and network connectivity. Speciﬁcally, aij represents the edge weight between vertexes vi and vj.

(iv). ri {0, 1} is a labeling indicator , that is yi is observed if and only if ri = 1. Let R = {1, . . . , n} denote the set of labelled vertexes and Rc = {n + 1, . . . , N} deﬁnes the subsample of non-respondents for which the vertex label is missing.

(v) G A(x; θg) RN q denotes a q 1 vector of unknown function of x, which can be a deep neural network incorporating the network connectivity A, such as a multi-layer GCN [10] or GAT [22].

In this paper, we consider an non-ignorable response mechanism, where the indicator variable ri depends on yi (which is unobserved when ri = 0). It is assumed that ri follows a Bernoulli distribution as follows: ri|(yi, h(xi; θh)) Bernoulli(πi), (1) where h(xi; θh) is an unknown parametric function of xi and π(yi, h(xi; θh)) = P(ri = 1|yi, h(xi; θh)) is the probability of missingness for yi. Given G A(x; θg), yi and yj are assumed to be independent and given yi and h(xi; θh), ri and rj are assumed to be independent for i = j. Furthermore, an exponential tilting model is proposed for πi as follows:

π(yi, h(xi; θh)) = π(yi, h(xi; θh); αr, γ, φ) = exp{αr + γT h(xi; θh) + φyi} 1 + exp{αr + γT h(xi; θh) + φyi}. (2)

Our question of interest is to unbiasedly learn an outcome model Y |x. Without loss of generality, when y is continuous, we consider a linear model given by

Y = α + G A(x; θg)β + ϵ, (3)

where ϵ = (ϵ1, , ϵN)T N(0, σ2I) and ϵ x is the error term with zero unconditional mean, that is, E(ϵi) = 0. In this case, dropping out missing data can lead to strongly biased estimates when r depends on y. The parameter estimates will not be consistent since E{ϵi|ri = 1} and E{ϵi G A(x; θg)i|ri = 1} are not zero. The missing values could not be imputed even if we would have consistent estimates since

E{yi|ri = 0, G A(x; θg)i; α, β} = E{yi(1 ri)|G A(x; θg)i; α, β}

1 P(ri = 1|G A(x; θg)i; α, β) (4)

= α + βT G A(x; θg)i cov(yi, πi|G A(x; θg)i; α, β)

1 E(πi|G A(x; θg)i; α, β) = α + βT G A(x; θg)i.

When y is a K-class discrete variable, we consider an multicategorical logit model as follow:

P(yi = k|G A(x; θg)i; αk, βk) = exp(αk +βT k G A(x; θg)i)/

j=1 exp(αj +βT j G A(x; θg)i) k (5)

Therefore, we can deﬁne a joint model of (1) and (3) (or (1) and (5)), called Graph-based joint model with Nonignorable Missingness (GNM) to obtain the unbiased estimation of Y |x.

3 Estimation

We examine several important properties, such as identiﬁability, of GNM and its estimation algorithm in this section.

3.1 Identiﬁability

We consider the identiﬁability property of GNM. Let Y = (y T obs, y T mis)T and J = (R, Rc). The joint probability density function (pdf) of the observed data is given by

f(yobs, J|x) = f(y1, y2 . . . , yn, r1, . . . , r N|x) =

i=1 f(yi, ri|x)

Z f(yi, ri|x)dyi. (6)

Based on the assumptions of ri|(yi, h(xi)) and yi|G A(x; θg)i, (6) is equivalent to

i [P(ri = 1|yi, h(xi; θh))f(yi|G A(x; θg)i)]ri[1 Z P(ri = 1|y, h(xi; θh))f(y|G A(x; θg)i)dy]1 ri.

(7) The GNM model is called identiﬁable if for different sets of parameters (θh, θg), P(ri = 1|yi, h(xi; θh))f(yi|G A(x; θg)i) are different functions of (yi, x). The identiﬁability implies that in a positive probability, the global maximum of (7) is unique.

However, identiﬁability may fail for many neural network models. For example, the identiﬁability of parameters in (1) is one of the necessary conditions for model identiﬁability, which can fail for the Relu network. Speciﬁcally, we have

Logit[P(ri = 1|yi, h(zi; βr)); γ] = αr+γRelu(ziβr)+φyi = Logit[P(ri = 1|yi, h(zi; 2βr)); γ/2].

Fortunately, this type of non-identiﬁability does not create any prediction discrepancy, since under GNM, the prediction of y given x is exactly the same for different (γ, θh, β, θg) and (γ , θ h, β , θ g) if we have γT h(x; θh) = γ T h(x; θ h), and G A(x; θg)β = G A(x; θ g)β . (8)

In consideration of the prediction equivalence, a more useful deﬁnition of identiﬁability is given in the following. Let f(yi|G A(x)i; θy) = f(yi|G A(x; θg)i; α, β) and P(ri = 1|yi, h(zi); θr) = P(ri = 1|yi, h(zi; θh); αr, γ, φ), where θy = (α, β, θg) and θr = (αr, γ, φ, θh) contain unknown parameters in the outcome model Y |x and the missing data model r|(y, z). The D(θy) D(θr) denotes the domain of (θy, θr), where is the tensor product of two spaces.

Deﬁnition 3.1. Under GNM, we call (θy, θr) is equivalent to (θ y, θ r), denoted by

(θy, θr) (θ y, θ r),

if (8) holds and α = α, α r = αr and φ = φ, where θy = (α, β, θg), θr = (αr, γ, φ, θh), θ y = (α , β , θ g), and θ r = (α r, γ , φ , θ h). The equivalence class of an element (θy, θr) is denoted by [[(θy, θr)]], deﬁned as the set

[[(θy, θr)]] = {(θ y, θ r) D(θy) D(θr)|(θ y, θ r) (θy, θr)},

and the set of all equivalent classes is called the Prediction-Equivalent Quotient (PEQ) space, denoted by S = D(θy) D(θr)/ . The GNM model is called identiﬁable on the PEQ space iff that

f(y|G A(x)i; θy)P(r = 1|y, h(xi); θr) = f(y|G A(x)i; θ y)P(r = 1|y, h(xi); θ r)

holds for all x, y implies (θy, θr) (θ y, θ r).

Different from identiﬁability on the parameter space, the identiﬁability on the PEQ space implies the uniqueness of the prediction given x instead of parameter estimation. It is applicable to complex architecture that focuses more on prediction than parameter. The following is an example which is not identiﬁabile on both parameter space and PEQ space.

Example 1. Let G A(x; θg) = x, h(x; θh) = x, yi N(µ + xβ, 1), and P(ri = 1|yi) = [1 + exp( αr xγ φyi)] 1 with unknown real-valued αr, γ, φ, µ and β, and thus

P(ri = 1|yi, h(xi))f(yi|G A(x)i) = exp[ (yi µ xiβ)2/2]

2π[1 + exp( αr φyi γx)]. (9)

In this case, two different sets of parameters (αr, γ, φ, µ, β) and (α r, γ , φ , µ , β ) produce equal (9) values if αr = (µ2 µ 2)/2, β = β, φ = µ µ, γ = β(µ µ ), α r = αr, φ = φ, and γ = γ . The observed likelihood is only identiﬁable with ignorable missingness, i.e. φ = φ = 0.

Additional conditions are required to ensure the identiﬁability of GNM on the PEQ space. Theorem 3.1. Assume three conditions as follows.

(A1) For all θg, there exist (x1, x2) such that G A(x1; θg)i = G A(x2; θg)i for each i; β = 0 holds.

(A2) For all θg and z, there exists (u1, u2) such that G A([z, u1]; θg)i = G A([z, u2]; θg)i for each i; and β = 0 holds.

(A3) For all θh, there exists (z1, z2) such that h(z1; θh) = h(z2; θh); and γ = 0 holds.

The GNM model (1) and (5) is identiﬁable on the PEQ space under Condition (A1). Suppose that there exists an instrumental variable u in x = [z, u] such that f(yi|G A(x)i) depends on u, whereas P(ri = 1|yi, h(xi)) does not. Then the GNM model (1) and (3) is identiﬁable on the PEQ space under Conditions (A2) and (A3).

Regularity conditions (A1) (A3) are easy to satisfy.

3.2 Estimation Approach

It is not easy to directly maximize the full likelihood function (6) in practice since it can be extremely difﬁcult to compute its integration term. On the other hand, the normality assumption of the error term can be restrictive for GNM consisting of (1) and (3). Therefore, we propose a doubly robust (DR) estimation approach to alternatively obtain the Inverse Probability Weighted Estimator (IPWE) of θy and imputation estimator of θr [15, 1].

Inverse Probability Weighted Estimator (IPWE) of θy

With π(yi, h(xi); θr) estimated by π(yi, h(xi); bθr), the Inverse Probability Weighted Estimator (IPWE) of θy can be obtained by minimizing the weighted cross-entropy loss

L1(θy|bθr) = X

ri π(yi, h(xi); bθr)

k=1 1(yi = k)log(P(yi = k|G A(x)i; θy)) (10)

when Y |x follows (5) or by minimizing the weighted mean squared error (MSE)

L1(θy|bθr) = X

ri π(yi, h(xi); bθr) {yi α βT G A(x; θg)i)}2 (11)

when Y is continuous. The estimation equation (11) is robust with respect to the normality assumption. If π(yi, h(xi); θr) is correctly speciﬁed, the IPW estimator of θy that solves L1(θy|bθr)/ θy = 0 is consistent and converges to θy according to the following theorem.

Theorem 3.2. If θr is known, then a given estimating function l(yi, G A(x)i; θy) with Eθy{P

i l(yi, G A(x)i; θy)} = 0 satisﬁes

ri π(yi, h(xi); θr)l(yi, G A(x)i; θy)} = 0.

Imputation estimator of θr

With the estimated f(Y |G A(x; bθg)), we could obtain an estimator of θr by minimizing

L2(θr|bθy) = X

ri=1 log(π(yi, h(xi); θr)) X

ri=0 log(1 E{π(yi, h(xi); θr)|x; bθy}), (12)

where π(yi, h(xi); θr) = P(ri = 1|yi, h(xi); θr) and E{π(yi, h(xi))|x; bθy} = R P(ri = 1|y, h(xi); θr)f(y|G A(x)i; bθy)dy. One advantage of our proposed joint estimation approach is that E(π(yi, h(xi); θr)|x) can be easily approximated by the empirical average of a set of random draws at the nodes with missing y as the imputed responses:

E{π(yi, h(xi); θr)|x; θy} = Z P(ri = 1|y, h(xi); θr)f(y|G A(x)i; θy)dy B 1 X

b π(yib, h(xi); θr),

where {yib}B b=1 iid f(y|G A(x)i; bθy). Thus, we can get an unbiased estimate of (12) by replacing the expectation by an empirical mean over samples generated from f(y|G A(x)i; bθy) as follows:

f L2(θr|bθy) = X

ri=1 ln(π(yi, h(xi); θr)) X

ri=0 log(1 B 1 X

yib f(y|G A(x)i;bθy) π(yib, h(xi); θr)),

(13) the gradient of which can be expressed as

θr f L2(θr|bθy) = X

b θrπ(yib, h(xi); θr) 1 B 1 P

b π(yib, h(xi); θr). (14)

The imputation estimator of θr by minimizing L2(θr|θy) is consistent when f(Y |G A(x; θg)) is correctly speciﬁed. The overall estimation procedure is schematically depicted in Figure 1.

3.3 Algorithm

In this subsection, we provide more details of our proposed imputation and IPW estimation approach about how to jointly estimate θy and θr by alternatively minimizing the conditional loss functions L1(θy|bθr) and f L2(θr|bθy) in practice. Speciﬁcally, we update θy and then θr with θ(e+1) y = arg minθy L1(θy|θ(e) r ) and θ(e+1) r = arg minθr f L2(θr|θ(e+1) y ) in order at each epoch, where θ(e) r and θ(e+1) y are the estimates of θr and θy obtained at the e-th and (e + 1)-th epoch, respectively. We use the gradient descent (GD) algorithm to learn all the parameters in θr and θy, while incorporating the network architecture of G A(x; θg) and h(x; θh).

Without specifying the normal assumption when yi is continuous, we replace the random draw y(e) ib in (13) by the expectation of β0 + βT 1 G A(x; θ(e) g )i at the e-th epoch. It can be seen as an approximation obtained by linearizing π(yi, h(xi)) using a Taylor series expansion and taking the expectation of the ﬁrst two terms [2]:

E{π(yi, h(xi))|x; θ(e) y } π(E(yi|x; θ(e) y ), h(xi)) = π(β0 + βT 1 G A(x; θ(e) g )i, h(xi)).

Figure 1: General Picture of the Joint Estimation Approach

In this case, it is equivalent to let B = 1 and the sample size, i.e. the total number of nodes will be ﬁxed at each training epoch. Based on simulations and real experiments below, this simpliﬁcation still outperforms the baseline models with a signiﬁcant improvement in the prediction accuracy on non-response nodes.

The details of the algorithm are described in ﬁve steps as follows:

1. Determine the initial value of the response probability π(0) i (or θ(0) r ). For example, we can let π(0) i = 1 for all the labelled vertexes (ri = 1).

2. Let e = 1, where e represents the number of epoch. We update θy based on π(0) i obtained from the previous epoch by minimizing the loss function in (10) using GD. At the i-th iteration within the e-th epoch, we update θy as follows:

θ(e,i+1) y θ(e,i) y γ0 θy L1(θy|θ(e 1) r ), (15)

where γ0 is the learning rate and L1(θy|θ(e 1) r ) represents the loss function based on π(e 1) i = πi(yi, h(xi); θ(e 1) r ). We denote the updated θy as θ(e) y after M (e) iterations.

3. Impute yi for all the unlabelled nodes ri = 0 using y(e) i = β(e) 0 + G A(x; θ(e) g )T i β(e) 1 for the continuous case and sampling y(e) i from distribution P(yi|G A(x)i; θ(e) y ) otherwise. 4. We use GD to update θr. Speciﬁcally, at the j-th iteration, we have

θ(e,j+1) r θ(e,j) r γ1 θr f L2(θr|θ(e) y ) (16)

with the initial start θ(e,0) r equal to θ(e 1) r , and γ1 is the learning rate. After convergence, we can get the estimate of θr denoted as θ(e) r at the end of this training epoch. Then we update the sampling weight π(e) i based on P(ri = 1|yi, h(xi); θ(e) r ) for all labelled vertexes. 5. Stop once convergence has been achieved, otherwise let e = e + 1 and return to step 3.

The convergence criterion is that whether the imputed unlabelled vertexes at epoch e only slightly differ from those at epoch (e 1). In other words, the iteration procedure is stopped if X

ri=0 |y(e) i y(e 1) i |/ X

i 1(ri = 0) ε

We let M0 and M1 be the maximal number of allowed internal iterations at each epoch for updating θy and θr, respectively. For more details, you can refer to the Algorithm 1 in the supplements.

Theoretically, the complexity (for example one layer GCN) in Steps 2 and 3 is O(|E|pq) at each epoch according to [10], where |E| < N 2 is the number of edges. Moreover, the complexity of Step 4 is O(Np) when h is a one fully connected layer.

4 Experiments

In this section, simulations and one real data analysis are conducted to evaluate the empirical performance of our proposed methods and a baseline model, which ignores the non-response (SM). Our GNM model reduces to SM when it only contains the outcome model Y |x given in (3) (or (5)) with the weights in loss (10) being 1 for all samples. In the real data part, GNM is also compared with the model with a misspeciﬁed ignorable missing mechanism, and some other state-of-art de-biasing methods. In the simulation part, we simulate the node response y based on (3) and generate the labelled set by the exponential tilting model (1). For the real data analysis, we evaluate all the compared models by a semi-supervised document classiﬁcation on the citation network-Cora with non-ignorable non-response.

In this paper, we use GCN to learn the latent node representations G A(x) with the layer-wise propagation deﬁned as

H(l+1) = f(H(l), A) = σ( b D 1

2 b A b D 1

2 H(l)W (l)), (17)

where b A = A + I, in which I is an identity matrix, and b D is the diagonal vertex degree matrix of b A. The W (l) is a weight matrix for the l-th layer and σ( ) is an non-linear activation function. H(0) = x is the initial input and G A(x) = H(2) RN p is the output of the second layer-wise propagation. To be fair, we let G A(x) be a 2-layer GCN model for all compared approaches.

4.1 Simulations

We consider a network data generated by |V | = 2708 vertexes together with a binary adjacency matrix A. x R2708 1433 denotes the fully observed input features which is a large-scale sparse matrix. Both A and x are obtained from the Cora dataset. The node response is simulated from the following model: yi = β0 + βT 1 G A(x)i + ϵi, (18)

where ϵi N(0, σ2) and G A(x) is the output of a 2-layer GCN model. We let response probability π depend on the unobserved vertex response y only , and (1) is simpliﬁed to

πi P(ri = 1|yi) = exp{αr + φyi} 1 + exp{αr + φyi}. (19)

In this case, the instrumental variable u is exactly x itself, and the identiﬁability automatically holds according to Theorem 3.1. All β s in (18) are sampled from uniform distribution U(0, 1). The αr and φ were selected to make the overall missing proportion be approximately 90%. The labelled subset are randomly split into training and validation sets, while the remaining non-response nodes build the testing set. We train all the compared models for a maximum of 200 epochs (E = 200) using Adam [9] with a learning rate 0.05 and make predictions byi for each testing vertex. Training is stopped when validation loss does not decrease in 15 consecutive iterations. We keep all other model settings used by [10] and ﬁx the unit size of the ﬁrst hidden layer to be 16.

Table 1 summarizes the estimation results under different ( p, σ) combinations, where root mean squared error (RMSE) and Mean absolute percentage error (MAPE) are computed between the true node response y and prediction by over the 50 runs. We can clearly see that GNM outperforms SM under all the four settings with much smaller mean RMSEs and MAPEs. Moreover, GNM is more stable than SM with smaller estimation variance.

4.2 Real Data Analysis

For the real data analysis, we modify the Cora to a binary-class data by merging the six non Neural Network classes together. The global prevalence of two new classes are (0.698, 0.302) with N0 = #{y = 0} = 1890 and N1 = #{y = 1} = 818, respectively.

Two missing mechanisms are considered. A simple setup is the same as (19). In this case, we compare our method with the inverse weighting approach proposed by [17]. We let the two functions of x required to estimate π under their framework to be the constant 1 and the ﬁrst principle component (PC) score, which is more stable compared to other functions such as a general xj or P

p σ Method Metric Mean SD 4 0.5 SM RMSE 1.1925 6.43e-1 MAPE 0.2932 2.01e-1 GNM RMSE 0.6983 1.28e-2 MAPE 0.1995 1.00e-2 1 SM RMSE 1.6185 8.58e-2 MAPE 0.3104 4.73e-2 GNM RMSE 1.2103 4.81e-2 MAPE 0.2263 2.28e-2 16 0.5 SM RMSE 0.7923 9.94e-2 MAPE 0.2014 2.42e-2 GNM RMSE 0.6015 2.17e-2 MAPE 0.1672 1.90e-2 1 SM RMSE 1.4212 2.14e-1 MAPE 0.2129 1.05e-2 GNM RMSE 1.1316 6.04e-2 MAPE 0.1849 4.62e-3

Table 1: Mean RMSEs and MAPEs by GNM and SM based on simulated data sets

Figure 2: Boxplot of RMSEs in real data analysis

more complicated setup, the labelled nodes are generated based on

πi P(ri = 1|yi, h(xi)) = exp{αr + γT h(xi) + φyi} 1 + exp{αr + γT h(xi) + φyi}, (20)

where h(xi) = exp(P

j xij/a0 a1) (P

j xij a2)/a3 with value range being [0, 1]. The explicit form of h(x) is assumed to be unknown and we use a multi-layer perceptron to approximate it. The network has two hidden layers with 128 and 64 units. respectively, and we use the tanh activation for the ﬁnal output layer. As a comparison, we also include the results when the non-ignorable missingness is over-simpliﬁed to the ignorable one (GIM). We let nk = #{(yi = k) (ri = 1)}, and use λ to denotes the size ratio between the two groups of labelled nodes, i.e. n1/n0. We carry out more experiments on other datasets including Citeseer , and explore the ﬁnite sample performance of our method using other state-of-art architecture such as GAT [22]. More details are provided in the supplementary materials. 1

Accuracy λ Method Mean SD 1 SM 0.8683 1.98e-2 Rosset 0.8514 5.19e-2 GNM 0.8947 6.47e-3 1.5 SM 0.8458 2.21e-2 Rosset 0.8311 7.09e-2 GNM 0.8908 1.26e-2 2 SM 0.8052 3.26e-2 Rosset 0.8193 6.05e-2 GNM 0.8648 2.54e-2

Table 2: Mean Prediction Accuracy for the simple setup by each method

Figure 3: Boxplot Prediction Accuracy for the simple setup

Results are summarized in Tables 2 and 3. Reported values represent the average classiﬁcation accuracy on testing data by 50 replications with re-sampling allowed. In each setup, two de-biasing methods including our approach are compared with SM. We adjust α and β to make the size of training set be around 120 for each sub-setting. Increasing λ reduces the number of included y = 0 nodes in the training set, leading to an insufﬁcient learning power and thus a lower overall classiﬁcation

1Our implementation of GNM can be found at: https://github.com/BIG-S2/keras-gnm

Accuracy λ Method Mean SD 1 SM 0.8663 1.21e-2 GIM 0.8713 1.52e-2 GNM 0.8961 1.18e-2 2 SM 0.8141 2.34e-2 GIM 0.8291 2.79e-2 GNM 0.8669 1.63e-2

Table 3: Mean Prediction Accuracy for the complicated setup by each method

Figure 4: Boxplot of Prediction Accuracy for the complicated setup

Figure 5: Number of iteration times for GNM and SM at each epoch under sub-setting one

accuracy. For the simple setup, GNM signiﬁcantly outperforms compared models by increasing the baseline prediction accuracy by 3.1% - 7.4%. On the other hand, GNM is less sensitive to the sample selection and has smaller variance compared to the method by [17]. For the complicated setup, mis-specifying the Non-Ignorable missingness as Ignorable still has big biases even though achieving some improvement against SM. The mean prediction accuracy by GNM is between 3.7% to 4.8% higher than that by GIM.

In both sub-settings, our method always leads to the smallest estimation variance, which is less affected by the selection of labelled nodes. For both setups, higher λ value leads to bigger sampling bias, and subsequently there is more signiﬁcant improvement in the prediction accuracy. Figures 3 and 4 are the boxplots of prediction accuracy obtained from each method under the two model setups. It may intuitively demonstrates the necessity of taking into account missing mechanism in order to achieve higher prediction accuracy on the unlabelled nodes.

We also empirically analyze the computational efﬁciency of our algorithm. The number of epochs for GNM to achieve convergence in the 50-run real-data experiments is 3 (21), 4 (19), 5 (7), and 6 (3) in Setting 1. Figure 5 summaries the number of iterations for the 2-layer GCN in SM and those for Step 2 of our algorithm at each epoch. It is demonstrated that the computational cost of our GNM model at each epoch is comparable to that of the baseline GCN model.

[1] Heejung Bang and James M Robins. Doubly robust estimation in missing data and causal inference models. Biometrics, 61(4):962 973, 2005.

[2] Jean-Francois Beaumont. An estimation method for nonignorable nonresponse. Survey Methodology, 26(2):131 136, 2000.

[3] Mikhail Belkin, Partha Niyogi, and Vikas Sindhwani. Manifold regularization: A geometric framework for learning from labeled and unlabeled examples. Journal of machine learning research, 7(Nov):2399 2434, 2006.

[4] James R Carpenter, Michael G Kenward, and Stijn Vansteelandt. A comparison of multiple imputation and doubly robust estimation for analyses with missing data. Journal of the Royal Statistical Society: Series A (Statistics in Society), 169(3):571 584, 2006.

[5] Kani Chen. Parametric models for response-biased sampling. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 63(4):775 789, 2001.

[6] Michaël Defferrard, Xavier Bresson, and Pierre Vandergheynst. Convolutional neural networks on graphs with fast localized spectral ﬁltering. In Advances in Neural Information Processing Systems, pages 3844 3852, 2016.

[7] Aditya Grover and Jure Leskovec. node2vec: Scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining, pages 855 864. ACM, 2016.

[8] Jae Kwang Kim and Cindy Long Yu. A semiparametric estimation of mean functionals with nonignorable missing data. Journal of the American Statistical Association, 106(493):157 165, 2011.

[9] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. ar Xiv preprint ar Xiv:1412.6980, 2014.

[10] Thomas N Kipf and Max Welling. Semi-supervised classiﬁcation with graph convolutional networks. ar Xiv preprint ar Xiv:1609.02907, 2016.

[11] Roderick JA Little and Donald B Rubin. Statistical analysis with missing data, volume 793. Wiley, 2019.

[12] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pages 3111 3119, 2013.

[13] Bryan Perozzi, Rami Al-Rfou, and Steven Skiena. Deepwalk: Online learning of social representations. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 701 710. ACM, 2014.

[14] Jing Qin, Denis Leung, and Jun Shao. Estimation with survey data under nonignorable nonresponse or informative sampling. Journal of the American Statistical Association, 97(457):193 200, 2002.

[15] James M Robins, Andrea Rotnitzky, and Lue Ping Zhao. Estimation of regression coefﬁcients when some regressors are not always observed. Journal of the American statistical Association, 89(427):846 866, 1994.

[16] James M Robins, Andrea Rotnitzky, and Lue Ping Zhao. Analysis of semiparametric regression models for repeated outcomes in the presence of missing data. Journal of the american statistical association, 90(429):106 121, 1995.

[17] Saharon Rosset, Ji Zhu, Hui Zou, and Trevor J Hastie. A method for inferring label sampling mechanisms in semi-supervised learning. In Advances in neural information processing systems, pages 1161 1168, 2005.

[18] Donald B Rubin. Inference and missing data. Biometrika, 63(3):581 592, 1976.

[19] Joseph L Schafer and Nathaniel Schenker. Inference with imputed conditional means. Journal of the American Statistical Association, 95(449):144 154, 2000.

[20] Jian Tang, Meng Qu, Mingzhe Wang, Ming Zhang, Jun Yan, and Qiaozhu Mei. Line: Largescale information network embedding. In Proceedings of the 24th International Conference on World Wide Web, pages 1067 1077. International World Wide Web Conferences Steering Committee, 2015.

[21] Niansheng Tang, Puying Zhao, and Hongtu Zhu. Empirical likelihood for estimating equations with nonignorably missing data. Statistica Sinica, 24(2):723, 2014.

[22] Petar Veliˇckovi c, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua Bengio. Graph attention networks. ar Xiv preprint ar Xiv:1710.10903, 2017.

[23] Zhilin Yang, William W Cohen, and Ruslan Salakhutdinov. Revisiting semi-supervised learning with graph embeddings. ar Xiv preprint ar Xiv:1603.08861, 2016.

[24] Hui Zhao, Pu-Ying Zhao, and Nian-Sheng Tang. Empirical likelihood inference for mean functionals with nonignorably missing response data. Computational Statistics & Data Analysis, 66:101 116, 2013.

[25] Xiaojin Zhu, Zoubin Ghahramani, and John D Lafferty. Semi-supervised learning using gaussian ﬁelds and harmonic functions. In Proceedings of the 20th International conference on Machine learning (ICML-03), pages 912 919, 2003.