# estimation_with_incomplete_data_the_linear_case__8bf9c40a.pdf

Estimation with Incomplete Data: The Linear Case

Karthika Mohan1, Felix Thoemmes2 and Judea Pearl3

1 University of California, Berkeley 2 Cornell University 3 University of California, Los Angeles karthika@eecs.berkeley.edu, felix.thoemmes@cornell.edu, judea@cs.ucla.edu

Traditional methods for handling incomplete data, including Multiple Imputation and Maximum Likelihood, require that the data be Missing At Random (MAR). In most cases, however, missingness in a variable depends on the underlying value of that variable. In this work, we devise model-based methods to consistently estimate mean, variance and covariance given data that are Missing Not At Random (MNAR). While previous work on MNAR data require variables to be discrete, we extend the analysis to continuous variables drawn from Gaussian distributions. We demonstrate the merits of our techniques by comparing it empirically to state of the art software packages.

1 Introduction Incomplete (or missing) data are data in which values of one or more variables are missing. Almost all existing techniques for handling incomplete data employ maximum likelihood or multiple imputation methods to ﬁll in the missing values and estimate parameters of interest. To guarantee convergence and consistency, these techniques require that the missing data mechanism be ignorable ([Rubin, 1976]) i.e. the causes of missingness be either random or fully observed. In practice, however, missingness is almost always caused by variables that are themselves afﬂicted by missingness ([Osborne, 2014; Sverdlov, 2015; Adams, 2007; van Stein and Kowalczyk, 2016]). Such data are called Missing Not At Random (MNAR). Among all MNAR problems, the most frequently encountered case is that of a variable causing its own missingness which we call self-masking MNAR. It is discussed in almost all literature on missing data including [Koller and Friedman, 2009] (chapter 19) and [Darwiche, 2009] (chapter 17). Examples include smokers not answering questions about their smoking behavior in insurance applications, longitudinal studies with attrition ([Little, 1995]), people with high income not revealing their incomes and a general reluctance to answer questions about sensitive topics such as religion, sexual preference and abortion ([Greenland and Finkle, 1995]). While there has been some recent work ([Daniel et al., 2012; Mohan et al., 2013; Mohan and Pearl, 2014a; Mohan

(a) (b) (c)

Figure 1: Examples of (a) MCAR, (b) MAR and (c) MNAR models

and Pearl, 2018; Thoemmes and Mohan, 2015; Shpitser et al., 2015]) on estimation in MNAR data with discrete variables, to the best of our knowledge there exists no theoretically sound and empirically efﬁcient graph based procedure for handling MNAR missingness in datasets with continuous variables. In this paper we focus on MNAR problems in linear systems. As in [Mohan and Pearl, 2014b] and [Mohan and Pearl, 2018], we treat missing data as a causal inference problem. We present sound, graph-based procedures for recovering (i.e. consistently estimating) parameters such as mean, variance and covariance. In the following section we review missingness graphs and structural equation models.

2 Preliminaries

2.1 Missingness Graphs: Notations and Terminology

Let G(V, E) be the causal DAG where V = Vo Vm U V R. Nodes in the graph correspond to variables in the data set. E is the set of edges in the DAG. Vo is the set of variables that are observed in all records in the population and Vm is the set of variables that have missing values in at least one record. Variable X is termed as fully observed if X Vo and partially observed if X Vm. Rvi and V i are two variables associated with every partially observed variable, where V i is a proxy variable that is actually observed, and Rvi represents the status of the causal mechanism responsible for the missingness of V i ; formally,

v i = f(rvi, vi) = vi if rvi = 0 m if rvi = 1 (1)

Proceedings of the Twenty-Seventh International Joint Conference on Artiﬁcial Intelligence (IJCAI-18)

V is the set of all proxy variables and R is the set of all missingness mechanisms. U is the set of unobserved nodes (or latent variables). We use bi-directed edges as a shorthand notation to denote the existence of a U variable as common parent of two variables in Vo Vm R. For any W Vo, Rw = . Unless stated otherwise it is assumed that no variable in V U is a child of an R variable. This graphical representation is called Missingness Graph (or m-graph) ([Mohan et al., 2013]). An m-graph portrays Missing Completely At Random (MCAR) missingness if (Vm, Vo, U) R, Missing At Random (MAR) if (Vm, U) R|Vo and Missing Not At Random (MNAR) if neither MAR nor MCAR hold. Figure 1 (a), (b) and (c) exemplify MCAR, MAR and MNAR missingness respectively. Proxy variables may not always be explicitly shown in mgraphs in order to keep the ﬁgures simple and clear. Conditional Independencies are read off the graph using the dseparation1 criterion [Pearl, 2009]. Before formally deﬁning the linear missingness model, we shall brieﬂy review Structural Equation Models. For a detailed discussion see [Pearl, 2009] (chapter 5) and [Brito, 2004].

2.2 Structural Equation Models A structural equation model (SEM) is a system of equations deﬁned over a set of variables, such that each variable appears on the left hand side of at most one equation. Each equation describes the dependence of one variable in terms of the others and contains an error term to account for the inﬂuence of unobserved factors. Example: X = ϵx and Y = αX + ϵy. As in [Pearl, 2013], we interpret structural equations as an assignment process whose directionality is captured by a path diagram (see Figure 2). In this work all substantive variables ({Vm Vo U}) and error terms are assumed to be drawn from a Gaussian distribution. Linear Structural Equation Modeling is widely used for estimating parameters of interest given data that are missing at random ([Allison, 2003; Graham, 2003; Ullman and Bentler, 2003; Enders, 2006; Schlomer et al., 2010]).

2.3 Recoverability Deﬁnition 1 (Recoverability of target quantity Q ([Mohan et al., 2013])). Let A denote the set of assumptions about the data generation process and let Q be any functional of the underlying distribution P(Vm, VO, R). Q is recoverable if there exists an algorithm that computes a consistent estimate of Q for all strictly positive manifest distributions P(V , Vo, R) that may be generated under A. We present an example of recoverability in linear models in section 3.1. For examples of recoverability in nonparametric models with discrete variables, please see [Mohan et al., 2013].

3 Quasi-linear Missingness Model The causal missingness mechanism is a binary variable and as such the function generating it cannot be linear. The quasilinear model deﬁned below captures the missingness process.

1For a quick introduction to d-separation see, http://www.dagitty.net/learn/dsep/index.html

(i) X1 = ϵx1 (ii) Y = αX1 + ϵy (iii) X2 = γY + δX1 + ϵx2 (iv) Ry = f(Y, ϵry)

(v) Y = (1 Ry)Y + m Ry

Figure 2: Quasi-linear missingness model and equations constituting the SEM corresponding to it.

Deﬁnition 2. A Quasi-linear Missingness Model is a Structural Equation Model such that: 1. every substantive variable X is a linear function of its causes ,Y , and a random error term ϵx X = α1Y1 + α2Y2 + ... + αn Yn + ϵx The coefﬁcient α s are called path coefﬁcients or structural parameters. 2. For every Rx R, Rx = f(Z, ϵRx) where Z is the set of causes and f is a non-linear function. No R variable is a parent of any substantive variable. 3. Every proxy variable X is generated by the non-linear function: X = (1 Rx)X + m Rx Figure 2 and equations (i) to (v) exemplify an m-graph and its corresponding quasi-linear missingness model. Path coefﬁcients can be identiﬁed in linear models by applying criteria such as single door and back door (see [Pearl, 2009] (chapter 5)). In the following lemma we rephrase and state a basic result in linear path analysis that links covariance with path coefﬁcients ([Pearl, 2013; Pearl, 2009; Brito, 2004; Wright, 1921]).

Lemma 1. Let G be an m-graph with k unblocked paths p1, ..pk between X and Y . Let Api be the ancestor of all nodes on path pi. Let the number of nodes on pi be npi.

cov(X, Y ) =

i=1 var(Api)

j=1 αpi j (2)

j=1 αpi j is the product of all causal parameters on

For example, in ﬁgure 2, there exist two paths, X1 Y X2 and X1 X2, between X1 and X2. On applying lemma 1 we get, cov(X1, X2) = αγvar(X1) + δvar(X1). In the following subsection we will exemplify a novel path analytic procedure for consistent estimation of the covariance matrix given MAR data.

3.1 Recoverability of Covariance Matrix: An Example For any target quantity Q we use Q RX =0 to denote: compute Q from samples in which all variables in X are observed.

Example 1. Consider the problem of estimating the covariance matrix given the MAR model of Figure 1 (b). Y is fully observed and hence var(Y ) is trivially recoverable. In order

Proceedings of the Twenty-Seventh International Joint Conference on Artiﬁcial Intelligence (IJCAI-18)

to recover cov(X, Y ), we will ﬁrst recover βXY , the regression coefﬁcient of X on Y . Since X Rx|Y we have the license to compute βXY (using OLS) from samples in which X is observed i.e. βXY = βXY Rx=0. cov(X, Y ) can now be recovered as: cov(X, Y ) = βXY Rx=0 var(Y ) (3)

To recover var(X), we apply the law of total variance: var(X) = E(var(X|Y )) + var(E(X|Y )) Recovering var(E(X|Y )): E(X|Y ) = βXY Y + cx where βXY and cx denote the slope and intercept of the regression line. Since X Rx|Y , E(X|Y ) = E(X|Y ) RX =0. Therefore,

var(E(X|Y )) = var βXY Rx=0Y + cy Rx=0

= (βXY Rx=0)2var(Y ) (4)

Recovering E(var(X|Y )): The variance of a conditional gaussian distribution is constant. Therefore, E(var(X|Y )) = var(X|Y ) Variance of a conditional bivariate Gaussian distribution, var(X|Y ), is given by var(X)(1 ρ2), where ρ = cov(X,Y )

(var(X)var(Y )) 1 2 denotes the correlation coefﬁ-

cient. Since X Rx|Y , var(X|Y ) = var(X|Y ) RX =0. Hence, var(X|Y ) = var(X) Rx=0(1 (ρ Rx=0)2) (5)

Using 4 and 5 var(X) is computed as, var(X) =var(X) Rx=0(1 (ρ Rx=0)2)

+ (βXY Rx=0)2var(Y )

Note that while all factors in the preceding estimand, except var(Y ) are estimated from samples in which X is observed, var(Y ) is recovered from all samples in the dataset, including those in which X is missing. In other words, although a consistent estimate of var(X) cannot be computed directly from fully observed data (i.e. P(X , Y, Rx = 0)), it can be recovered by a procedure in which each factor in the estimand is estimated from subsets of the available data. We further note that as a consequence of recovering var(X), βY X, the causal effect of X on Y can be recovered as,

var(X) = βXY Rx=0var(Y )

var(X) Rx=0(1 (ρ Rx=0)2)+(βXY Rx=0)2var(Y ).

The following section presents procedures for computing parameters of interest in quasi linear models.

4 Recovering Mean, Variance and Covariance Theorem 1 (Recovering Mean of Partially Observed Variables). Let {X, Z} Vm Vo, X Z = , |X| = 1. Given m-graph G, E(X) is recoverable if there exists Z = {Z1, Z2, ...Zn} such that X Rx Rz|Z and E(Zi) is recoverable for all Zi Z. If recoverable, E(X) is given by

E(X) = c Rxz=0 +

i=1 αi Rxz=0E(Zi) (6)

where c and αi s denote the intercept and coefﬁcients of the regression line of X on Z.

Proof. Let X = α1Z1 + α2Z2 + ... + αn Zn + c be the regression line of X on Z. Since X Rx Rz|Z, E(X|Z) = E(X|Z) Rxz=0. Hence,

E(X|Z) = α1 Rxz=0Z1 + ... + αn Rxz=0Zn + c Rxz=0

However since E(X) = E(E(X|Z)) we can write,

E(X) = E α1 Rxz=0Z1 + ... + αn Rxz=0Zn + c Rxz=0

= c Rxz=0 +

i=1 αi Rxz=0E(Zi)

Theorem 2 (Recovering covariance of X and Y ). Let {X, Y, Z} Vm Vo, X Y Z = , |X| = 1, |Y | = 1 and |Z| = n. Given m-graph G, cov(Y, X) is recoverable if there exists Z = {Z1, Z2, ...Zn} such that (i) Y Rx, Ry, Rz|X, Z, and (ii) E(X), E(Y ), var(X), and i E(Zi) and cov(X, Zi) are recoverable. If recoverable, cov(Y, X) is given by

cov(Y, X) = αx Rxyz=0(var(X) + E(X)2) + c Rxyz=0E(X)

i αi Rxyz=0(cov(XZi) + E(X)E(Zi)) E(X)E(Y )

where c and α s denote the intercept and coefﬁcients of the regression line of Y on X, Z.

cov(Y, X) = E(E(XY |Z)) E(Y )E(X) = E(XE(Y |Z, X)) E(Y )E(X)

Using Y Rx, Ry, Rz|X, Z, we get E(Y |X, Z) = E(Y |X, Z) Rxyz=0. Therefore,

E(XE(Y |X, Z)) = E(X αx Rxyz=0X + α1 Rxyz=0Z1

+ α2 Rxyz=0Z2 + ... + αk Rxyz=0Zk + c Rxyz=0

= αx Rxyz=0E(X2) + c Rxyz=0E(X) + X

i αi Rxyz=0E(XZi)

= αx Rxyz=0(var(X) + E(X)2) + c Rxyz=0E(X)

i αi Rxyz=0(cov(XZi) + E(X)E(Zi))

A well known and widely used property of Gaussian distributions is that their conditional variances are constant. Let |X| = m and |Y | = n. Let Mxx, Myy and Mxy denote the covariance matrix of X, Y and X and Y respectively. Variance of the conditional Gaussian distribution f(Y |X) is given by,

V ar(Y |X) = Myy M xy M 1 xx Mxy (7)

Theorem 3 (Recovering variance of a partially observed variable X). Let {X, Z} Vm Vo, X Z = , |X| = 1 and |Z| = n. Given m-graph G, var(X) is recoverable if

Proceedings of the Twenty-Seventh International Joint Conference on Artiﬁcial Intelligence (IJCAI-18)

there exists Z = {Z1, Z2, ...Zn} such that X Rx Rz|Z and cov(Zi, Zj) is recoverable for all Zi, Zj Z. If recoverable, var(X) is given by

var(X) = (Mxx M zx M 1 zz Mzx) Rxz=0

i α2 i Rxz=0var(Zi) + X

i =j (αiαj) Rxz=0cov(Zi, Zj)

where αi s denote the coefﬁcients of the regression line of X on Z.

Proof. We ﬁrst apply the law of total variance: var(X) = E(var(X|Z)) + var(E(X|Z)), and then prove recoverability of var(X) by showing that both summands are recoverable. Recovering E(var(X|Z)): Since X and Zi Z, i are Gaussian variables, var(X|Z) is constant; therefore,

E(var(X|Z)) = var(X|Z) = var(X|Z) Rxz=0 (since X Rx, Rz|Z)

= (Mxx M zx M 1 zz Mzx) Rxz=0 (using eqn 7)

Recovering var(E(X|Z)): Since X Rx, Rz|Z we have,

E(X|Z) = E(X|Z) Rxz=0 = cx Rxz=0 + X

i αi Rxz=0Zi

var(E(X|Z)) = var c Rxz=0 + X

i αi Rxz=0Zi

i α2 i Rxz=0var(Zi) + 2 X

i =j (αiαj) Rxz=0cov(Zi, Zj)

Lemma 2. [Recovering Partial Regression Coefﬁcients] Let {X, Y, Z} Vo Vm. Given m-graph G and missing data D, partial regression coefﬁcient βXY.Z is recoverable if X RX, RY , RZ|Y, Z and is given by βXY.Z Rxyz=0.

Notice that recoverability of E(X) using theorem 1 is contingent upon the recoverability of E(Zi), for all i. Clearly, to recover E(X) we should recover all E(Zi) s ﬁrst. Similar is the case with theorems 2 and 3. In the case of recoverability of conditional distributions in datasets with discrete variables, [Mohan et al., 2013] deﬁned the notion of ordered factorization to sequence the recoverability procedure. In the following theorem we adapt the idea of ordered factorization in [Mohan et al., 2013] to formulate a sufﬁcient condition for recovering mean.

Theorem 4. Let Y {Z} Vm Vo and let O: Y1 < Y2 < . . . < Z = Yk be an ordered set of all variables in Y {Z}. Let Xi {Y1, . . . , Yi 1}, 1 i k. Given an m-graph G, a sufﬁcient condition for recovering E(Z) is that there exist O and Xi i such that Yi (Ryi, Rxi)|Xi.

Proof. Recoverability follows from theorem 1. We proceed by ﬁrst recovering E(Y1) and then recovering E(Y2),..,E(Yk) sequentially in the order dictated by O.

Example 2. Consider the problem of recovering E(X) and E(Y ) given the m-graph G in ﬁgure 1 (c) and ordering O : Y < X. Ordering O directs us to ﬁrst recover E(Y ) and then recover E(X). Since Y Ry in G, we can apply theorem 1 to recover E(Y ) as E(Y ) Ry=0. Since X Rx, Ry|Y and E(Y ) is recoverable, we can apply theorem 1 to recover E(X) as c Rxy=0 + α1 Rxy=0E(Y ) Ry=0.

In a similar manner, ordered recoverability procedures can be extended to recover covariance and variance as well. In the case of MCAR data, all orderings of variables will guarantee recoverability, whereas in the case of MAR data, orderings in which all fully observed variables precede partially observed variables will guarantee recoverability. Finally in the case of MNAR data, the ordering is determined by graph structure, and heuristics for ﬁnding admissible orders are discussed in [Mohan et al., 2013].

5 Extending Recoverability to Complex MNAR Models In this section we focus on m-graphs in which conditional independence between a variable and its missingness mechanism (such as Y Ry and Yi (Ryi, Rxi)|Xi) that are required by theorems 1, 2, 3 and 4 does not hold in the graph. In the following subsection we present a theorem for recovering E(Y ) for any Y Vm, by leveraging X, a neighbor of Y .

5.1 Recovering E(Y ) When Y and Ry are Not Separable Theorem 5 (Recovering E(Y ) for any Y Vm). Let X Vm Vo be a neighbor of Y . Given m-graph G and missing data D, E(Y ) is recoverable if E(X) is recoverable and there exists Z Vm Vo {X, Y } such that X Rx, Ry, Rz|Y, Z and E(Zi) is recoverable for all i. If recoverable, E(Y ) is given by,

E(Y ) = E(X) cx Rxyz=0 Pk i=1 αi Rxyz=0E(Zi)

αy Rxyz=0 (8)

where α s, and c denote the coefﬁcients and intercept of the regression line of X on Z and Y. Proof: See Appendix A Example 3. To recover E(Y ) given the m-graph in ﬁgure 2, we leverage X1. X1 Ry|Y and E(X1) is recoverable since X1 VO. Hence mean of Y can be recovered using theorem

5 as, E(X1) c Ry=0

βX1,Y Ry=0 .

5.2 Recovering var(Y ) When Y and Ry are Not Separable Algorithm 1 recovers variance of a variable Y given an mgraph in which X1 is a parent of Y and X2 is a child of Y . It uses two subroutines (outlined in appendix B) that compute path coefﬁcient and covariance using lemma 1. If a quantity of interest Q is recoverable, then Q is used as a shorthand for the recovered estimand. For example if var(Y ) is recoverable

Proceedings of the Twenty-Seventh International Joint Conference on Artiﬁcial Intelligence (IJCAI-18)

Figure 3: (a) Self masking model in which all variables are partially observed, yet E(Y ) and var(Y ) are recoverable. (b) m-graph constructed from (a) by treating Z, Y, Rz and Rw as latent.

Algorithm 1 Recover Variance(Y, G, X1, X2) Input: Y : variable whose variance is to be recovered. G: Markovian m-graph in which X1 is a parent of Y and X2 is a child of Y Output: var(Y ) if var(Y ) is recoverable NULL if var(Y ) is not recoverable 1: if var(Y ) is recoverable using theorem 3 then 2: Recover estimand var(Y )

3: return var(Y )

4: if βX1,Y is recoverable by lemma 2 then 5: Recover estimand βX1,Y

6: else return NULL 7: αy Recover αy(G, Y, X1, X2) 8: if αy == NULL then return NULL

9: cov(X1, Y ) Recover cov(G, Y, X1, αy ) 10: if cov(X1, Y ) == NULL then return NULL

11: return cov(X1,Y )

then var(Y ) denotes the recovered estimand. We exemplify below the recovery procedure using algorithm 1. Example 4. Consider the problem of recovering mean and variance of Y given the m-graph in ﬁgure 2. We will show that var(Y ) is recoverable using algorithm 1. Steps 1-3: Since Y and Ry are neighbors theorem 3 is not applicable. Hence we proceed to the next step. Steps 4-6: Since X1 Ry|Y , βX1,Y is recoverable using lemma 2 i.e. βX1,Y = βX1,Y Ry=0. Steps 7-8: We invoke subroutine Recover αy to recover the path coefﬁcient α in ﬁgure 2. Since X1 and X2 are fully observed, cov(X1, X2) is recoverable. Path coefﬁcients δ and γ may be recovered as:

δ = βX2X1.Y (by single door criterion, [Pearl, 2009]) = βX2X1.Y Ry=0 (using lemma 2)

γ = βX2Y.X1(by back door criterion, [Pearl, 2009]) = βX2Y.X1 Ry=0 (using lemma 2)

Since X1 is fully observed, var(X1) is recoverable. On applying lemma 1 we get,

cov(X2, X1) = γαvar(X1) + δvar(X1)

Therefore, α = 1 βX2Y.X1 Ry=0

cov(X2, X1)

var(X1) βX2X1.Y Ry=0

Steps 9-10: We invoke subroutine Recover cov to recover cov(X1, Y ). On applying lemma 1 we get: cov(X1, Y ) = βX1Y var(X1) = (βX1Y ) Ry=0var(X1). Step 11: var(Y ) is recovered as:

var(Y ) =cov(X1, Y )

βX1Y = α var(X1)

(βX1Y ) Ry=0

= var(X1) (βX1Y ) Ry=0

1 βX2Y.X1 Ry=0

cov(X2, X1)

var(X1) βX2X1.Y Ry=0

While algorithm 1 currently handles only Markovian graphs, we note that it is possible to extend the algorithm to Semi-Markovian models provided we make additional assumptions (pertaining to variances of latent variables). We further note that suitable candidates for X1 and X2 are the non-descendants and descendants of Y , respectively. Latent projection ([Pearl, 2009], chapter 2) of the input graph constructed by treating all intermediate nodes on unblocked paths between X1 and Y , and X2 and Y , as latent can yield graphs compatible with the requirements of algorithm 1. When latent projection does not introduce bi-directed edges, recovery is straight forward. We exemplify this in section 6.

6 Empirical Evaluation

We denominate the graph based recovery procedure presented in this paper as Model Based Estimation (MBE) and evaluate MBE by simulating partially observed datasets from missingness graphs and estimating their parameters from the incomplete data. We compare our estimates against those yielded by state of the art packages for SEM that apply Multiple Imputation (MI) (using mice package in R) and Maximum Likelihood (ML) (using lavaan in R) techniques [Schminkey et al., 2016; Enders, 2006]. Parameters are evaluated in terms of mean squared error and KL Divergence between original and learned distributions. Missing At Random: We generate data according to the following model and evaluate the performance of MBE, MI and FIML in terms of Mean Squared Error (MSE) and time taken to compute mean of X. X = ϵx, Ry = f(Y1, Y2, ...Yk, ϵry) Yi = αi X + ϵy, 1 i k The ﬁrst experiment measures the mean squared errors of the three estimators: MI, ML, MBE, and studies how it varies with increase in sample size, for the problem of recovering mean(X) when |Y | = k = 5. The results of this experiment are plotted in ﬁgure 4. We further estimated the average time each procedure took in recovering and plotted time vs sample size in ﬁgure 5. Figure 4 shows that all three methods (MBE, MI, FIML) work almost identically so far as the quality of estimate of E(X) is concerned. From ﬁgure 5 it is clear that MBE (a non-iterative procedure) takes less time in computing E(X) as compared to MI and FIML. While all the reported numbers (MSE and time) are averaged over 50 repetitions with different random estimation problems, for a

Proceedings of the Twenty-Seventh International Joint Conference on Artiﬁcial Intelligence (IJCAI-18)

Figure 4: MSE of E(X) vs Sample Size for MAR: X Rx|Y , when |Y | = 5

Figure 5: Time to recover E(X) vs Sample Size for MAR: X Rx|Y , when |Y | = 5

given problem each individual MSE was computed from 500 simulations. In the next experiment we study the efﬁciency of these procedures as the complexity of the model increases. In MAR models, complexity of recoverability depends on the size of the separating set Y that d-separates X from Rx (and on the size of the dataset). This experiment was conducted by ﬁxing sample size to 100,000. We observe in Figure 6 that as the separating set becomes larger, the time taken to recover estimates also increases. Note that in this case the time gain pertains to computing parameters of one partially observed variable X. Clearly, in a real world dataset with several partially observed variables, the time savings offered by MBE will be substantial. Missing Not At Random: The goal here is to evaluate the

Figure 6: Time to recover E(X) vs |Y | for MAR: X Rx|Y

Figure 7: KL divergence vs Sample size for self masking model

effectiveness of algorithm 1 using data generated by the self masking model in 3. βwy and βzy are not recoverable by lemma 2; hence Z and W cannot be used for recovering var(Y ). However, βx1y is recoverable. We can create a latent projection by treating Z, W, Rz and Rw as latent variables as shown in ﬁgure 3 (b). Figure 7 depicts the KL divergence between true and estimated distribution of Y . As expected, FIML and MI behave unpredictably while MBE with minimum KL Divergence behaves ideally.

7 Conclusions We presented novel graph-based procedures that are noniterative and independent of likelihood function, for recovering parameters in quasi-linear missingness models. We further developed procedures for recovering parameters in self masking models. Finally we showed that given MAR data, our techniques are much faster than state of the art procedures and given MNAR data our techniques can recover parameters where existing techniques fail.

A Proof of theorem 5 E(X) = E(E(X|Z, Y )) = E(E(X|Z, Y ) Rxyz=0)

= αy Rxyz=0E(Y ) + Pk i=1 αi Rxyz=0E(Zi) + cx Rxyz=0

Therefore, E(Y ) = E(X) cx Rxyz=0 Pk i=1 αi Rxyz=0E(Zi)

B Subroutines used in Algorithm 1 In this subsection we outline the subroutines Recover α and Recove cov invoked by algorithm 1. Recoverability of queries in these subroutines are determined using results in section 4. Note that in this work we only deal with recoverability of statistical parameters. Path coefﬁcients (which are causal parameters) are considered recoverable if (i) they are identiﬁable and (ii) all factors in the (identiﬁed) estimand are recoverable. Recover αy: Input: G: m-graph in which X1 is a parent of Y and X2 is a child of Y . αy: Path coefﬁcient of edge X1 Y . Output: αy if αy is recoverable, NULL otherwise. This routine applies lemma 1 on X1 and X2 to obtain,

cov(X1, X2) = k P

i=1 var(Api)

j=1 αpi j . If cov(X1, X2),

var(Api) s and all αj s (excluding αy) are recoverable, their

Proceedings of the Twenty-Seventh International Joint Conference on Artiﬁcial Intelligence (IJCAI-18)

recovered estimands are substituted into the equation above to recover and return the estimand of αy. Recover cov: Input: G: m-graph in which X1 is a parent of Y and X2 is a child of Y , αy : Recovered estimand of αy, the path coefﬁcient of edge X1 Y . Output: Estimand for cov(X1, Y ) if it is recoverable, NULL otherwise. This routine applies lemma 1 on X1 and Y and returns the estimand for recovering cov(X1, Y ) if all variances and path coefﬁents on the RHS of equation 2 are recoverable.

References [Adams, 2007] Jon Adams. Researching Complementary and Alternative Medicine. Routledge, 2007. [Allison, 2003] Paul D. Allison. Missing data techniques for structural equation modeling. Journal of Abnormal Psychology, 112(4):545, 2003. [Brito, 2004] Carlos Brito. Graphical Models for Identiﬁcation in Structural Equation Models. Ph D thesis, University of California Los Angeles, 2004. [Daniel et al., 2012] Rhian M. Daniel, Michael G. Kenward, Simon N. Cousens, and Bianca L. De Stavola. Using causal diagrams to guide analysis in missing data problems. Statistical Methods in Medical Research, 21(3):243 256, 2012. [Darwiche, 2009] Adnan Darwiche. Modeling and Reasoning with Bayesian Networks. Cambridge University Press, 2009. [Enders, 2006] Craig K. Enders. Analyzing structural equation models with missing data. In Gregory R. Hancock and Ralph O. Mueller, editors, Structural Equation Modeling: A Second Course, pages 315 344. Information Age Publishing, Inc., 2006. [Graham, 2003] John W. Graham. Adding missing-datarelevant variables to FIML-based structural equation models. Structural Equation Modeling, 10(1):80 100, 2003. [Greenland and Finkle, 1995] Sander Greenland and William D. Finkle. A critical look at methods for handling missing covariates in epidemiologic regression analyses. American Journal of Epidemiology, 142(12):1255 1264, 1995. [Koller and Friedman, 2009] Daphne Koller and Nir Friedman. Probabilistic Graphical Models: Principles and Techniques. The MIT Press, 2009. [Little, 1995] Roderick J. A. Little. Modeling the dropout mechanism in repeated-measures studies. Journal of the American Statistical Association, 90(431):1112 1121, 1995. [Mohan and Pearl, 2014a] Karthika Mohan and Judea Pearl. Graphical models for recovering probabilistic and causal queries from missing data. In C. Cortes M. Welling, Z. Ghahramani and N. Lawrence, editors, Advances in Neural Information Processing Systems 27, pages 1520 1528. Curran Associates, 2014. [Mohan and Pearl, 2014b] Karthika Mohan and Judea Pearl. On the testability of models with missing data. In S. Kaski

and J. Corander, editors, Proceedings of the Seventeenth International Conference on Artiﬁcial Intelligence and Statistics (AISTATS), volume 33, pages 643 650, 2014. [Mohan and Pearl, 2018] Karthika Mohan and Judea Pearl. Graphical models for processing missing data. ar Xiv preprint ar Xiv:1801.03583, 2018. [Mohan et al., 2013] Karthika Mohan, Judea Pearl, and Jin Tian. Graphical models for inference with missing data. In C.J.C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K.Q. Weinberger, editors, Advances in Neural Information Processing Systems 26, pages 1277 1285. Curran Associates, 2013. [Osborne, 2014] Jason W. Osborne. Best Practices in Logistic Regression. SAGE Publications, 2014. [Pearl, 2009] Judea Pearl. Causality: Models, Reasoning and Inference. Cambridge Univ Press, New York, 2nd edition, 2009. [Pearl, 2013] Judea Pearl. Linear models: A useful microscope for causal analysis. Journal of Causal Inference, 1(1):155 170, 2013. [Rubin, 1976] Donald B. Rubin. Inference and missing data. Biometrika, 63:581 592, 1976. [Schlomer et al., 2010] Gabriel L. Schlomer, Sheri Bauman, and Noel A. Card. Best practices for missing data management in counseling psychology. Journal of Counseling Psychology, 57(1):1, 2010. [Schminkey et al., 2016] Donna L. Schminkey, Timo von Oertzen, and Linda Bullock. Handling missing data with multilevel structural equation modeling and full information maximum likelihood techniques. Research in Nursing & Health, 39(4):286 297, 2016. [Shpitser et al., 2015] Iyla Shpitser, Karthika Mohan, and Judea Pearl. Missing data as a causal and probabilistic problem. In M. Meila and T. Heskes, editors, Proceedings of the Thirty-First Conference on Uncertainty in Artiﬁcial Intelligence, pages 802 811. AUAI Press, 2015. [Sverdlov, 2015] Oleksandr Sverdlov. Modern Adaptive Randomized Clinical Trials: Statistical and Practical Aspects. Chapman and Hall/CRC, 2015. [Thoemmes and Mohan, 2015] Felix Thoemmes and Karthika Mohan. Graphical representation of missing data problems. Structural Equation Modeling: A Multidisciplinary Journal, 22:631 642, 2015. [Ullman and Bentler, 2003] Jodie B. Ullman and Peter M. Bentler. Structural Equation Modeling. Wiley Online Library, 2003. [van Stein and Kowalczyk, 2016] Bas van Stein and Wojtek Kowalczyk. An incremental algorithm for repairing training sets with missing values. In International Conference on Information Processing and Management of Uncertainty in Knowledge-Based Systems, pages 175 186. Springer, 2016. [Wright, 1921] Sewall Wright. Correlation and causation. Journal of Agricultural Research, 20(7):557 585, 1921.

Proceedings of the Twenty-Seventh International Joint Conference on Artiﬁcial Intelligence (IJCAI-18)