# reconsidering_generative_objectives_for_counterfactual_reasoning__00bb2d88.pdf

Reconsidering Generative Objectives For Counterfactual Reasoning

Danni Lu1, , Chenyang Tao2, , Junya Chen2,3, Fan Li4, Feng Guo1,5, Lawrence Carin2

1 Department of Statistics, Virginia Tech, Blacksburg, VA, USA 2 Electrical & Computer Engineering, Duke University, Durham, NC, USA 3 School of Mathematical Sciences, Fudan University, Shanghai, China 4 Department of Statistical Science, Duke University, Durham, NC, USA 5 Virginia Tech Transportation Institute, Blacksburg, VA, USA ludanni@vt.edu, chenyang.tao@duke.edu

There has been recent interest in exploring generative goals for counterfactual reasoning, e.g., individualized treatment effect (ITE) estimation. However, existing solutions often fail to address issues that are unique to causal inference, such as covariate balancing and counterfactual validation. As a step toward more ﬂexible, scalable and accurate ITE estimation, we present a novel generative Bayesian estimation framework that integrates representation learning, adversarial matching and causal estimation. By appealing to the Robinson decomposition, we derive a reformulated variational bound that explicitly targets the causal effect estimation rather than speciﬁc predictive goals. Our procedure acknowledges the uncertainties in representation and solves a Fenchel mini-max game to resolve the representation imbalance for better counterfactual generalization, justiﬁed by new theory.The latent variable formulation enables robustness to unobservable latent confounders, extending the scope of its applicability. The proposed approach is demonstrated via an extensive set of tests against competing solutions, both under various simulation setups and to real-world datasets, with encouraging results reported.

1 Introduction Inferring the individualized treatment effects from observational data is a fundamental challenge shared by many decision-making application domains, including healthcare [23], advertising [15], and policy making [44], among others. Recent advances in machine learning have motivated new causal inference methodologies inspired by modern learning perspectives, such as representation learning, adversarial training, etc.

In this work we focus on the problem of causal estimation from observational data, which differs from standard supervised learning in fundamental ways [60]. First, only partial observation of the potential outcomes, the one corresponding to the assigned intervention, can be made. The lack of counterfactual labels prohibits direct validation of the estimated CE. Second, observational studies are susceptible to selection bias due to confounding. In particular, some variables obfuscate causation as they affect both treatment assignment and outcome [81], and they may be latent. Without a proper confounder compensation mechanism, causal estimation can face severe bias.

To resolve this difﬁculty, the classical statistics literature has mainly focused on sample-based adjustment strategies, namely matching and weighting. Matching pairs units that are similar with respect to particular matching criteria [74], forming basic elements of synthetic randomized trials ; weighting reassigns importance weights to each sample unit to create a pseudo population of better balance

Contributed equally.

34th Conference on Neural Information Processing Systems (Neur IPS 2020), Vancouver, Canada.

[29, 46, 47]. Both approaches typically make the unconfoundedness assumption [65], assuming that there are no latent variables that affect both the outcome and the treatment assignment. To guard against model mis-speciﬁcation-induced failures [63], balancing weights are often used in conjunction with outcome regression models to achieve double robustness [72]. However, these classical solutions are constantly challenged by modern datasets, characterized by features such as high dimensionality [12] and complex interactions [88], and they typically make the unconfoundedness assumption.

More recently, representation learning emerged as a new, promising alternative to approach covariate balance [48, 36]. Such schemes explicitly seek an intermediate (low-dimensional) representation that is both (i) predictive of the outcome [82]; and (ii) matched between treatment groups [34]. From a learning perspective, these two points serve to promote the generalization performance for counterfactual predictions [75]. On the ﬂip side, causal perspectives also motivate invariant feature representation learning under general machine learning setups [7].

Recent strides in generative modeling techniques, such as the variational auto-encoder (VAE) [39] and the generative adversarial network (GAN) [24], have equipped causal estimation with new learning principles. Rather than appealing to predictive goals [82], these schemes learn stochastic rules that mimic the data generating procedure, i.e., how to synthesize realistic counterfactuals based on observed data [87]. Such generative causal models typically relax model assumptions posited by standard causal estimation machinery, allowing black-box type inference using ﬂexible learners such as deep networks. Despite their reported strong empirical performance, questions remain: (i) Confounding: Do we fully trust the observed confounders? (ii) Balancing: What if the covariates are unbalanced? (iii) Counterfactual validation: How to avoid over-ﬁtting?

Notably, in-depth discussions on (iii), causal validation procedure, has received attention in the literature only recently, despite its paramount importance [8, 83]. The promise of a fully automated causal estimation procedure has inspired many (unreliable) heuristic proxies [73] (e.g., plug-in surrogate or predictive loss) and principled evaluation strategies have only appeared quite recently. While scholarly consensus on best practice is yet to be reached [20], prominent examples from this category include inﬂuence function based causal validation [3] and rank-preserving causal crossvalidation [71]. Of particular interest is the Robinson residual decomposition employed by the R-learner [56] and generalized causal forests [10], which construct a directly learnable objective.

Motivated by the preceding discussions, this work seeks a uniﬁed treatment that accommodates (i)-(iii). We revisit the generative perspective of causal modeling, and demonstrate how explicitly accounting for balancing and counterfactual validation helps to improve causal estimation. In particular, we present a variational procedure, termed Balancing Variational Neural Inference of Causal Effects (BV-NICE), to address the challenges of generative learning for causal estimation. Our key contributions include: (i) repurposing variational inference as random feature representation learning scheme to facilitate causal estimation; (ii) reformulating the variational objective to better balance confounder representations between comparison groups; (iii) incorporating causal validation targets to scrutinize inferred causal effect. Our approach features direct modeling of causal effects, rather than the difference between the outcome models. It joints strength from distribution matching, representation learning and generative causal estimation, resulting a principled attempt that better addresses the challenges in counterfactual inference. To embrace a more holistic picture, we also cover related issues such as identiﬁability and establish border connections to the literature on causal discovery with the extended discussions found in our supplementary material (SM).

2 Preliminaries Problem setup We consider the basic setup under the potential outcome framework [69, 33]. Assume a sample of n units, with unit i associated with a covariate Xi Rp, a treatment indicator Ti {0, 1} and potential outcomes [Yi(0), Yi(1)] R2. The fundamental problem of causal inference [32] is that only the outcome associated with the prescribed treatment is observed, i.e., Yi Y (Ti) = Ti Yi(1) + (1 Ti)Yi(0), known as the factual data. The individualized treatment effect (ITE) is deﬁned as the expected difference between outcome τ(x) E[Yi(1) Yi(0)|Xi = x], and our goal is to learn a generalizable model τ(x) that predicts the ITE given observed covariates x. We often assume the decomposition τ(x) = µ1(x) µ0(x), where µt(x) E[Y (t)|x], t {0, 1} are known as the outcome models. Another key concept in causal estimation is the propensity score (PS): e(x) p(T = 1|x), i.e., the conditional probability of receiving the treatment given x. While the identiﬁability of causal effect can only be established in the average sense for observational studies, under the assumptions of unconfoundedness:{Y (0), Y (1)} T|X, and positivity:

p(T|X, Y (0), Y (1)) (0, 1) [66], individualized predictions still hold promise. A typical predictive scheme minimizes the prediction loss for the factual observation, i.e., ˆµ = minµ{P

i(Yi µt(Xi))2}. Alternatively, generative schemes seek to identify a data generation procedure pθ(x, t, y) that is consistent with factual observations Dn = {(xi, yi, ti)}n i=1. Robinson residual decomposition Under unconfoundedness, it is easy to verify E [ϵ(T)|X, T] = 0, where ϵ(T) Y (T) (µ0(X) + Tτ(X)) is known as the Robinson residual [64]. Denoting the conditional mean outcome as m(x) E[Y |x] = µ0(x) + e(x)τ(x), and we can rewrite Robinson residual as ϵ(T) = Y (T) m(X) (T e(X))τ(X). Note that this decomposition holds for any outcome distribution, including binary outcomes. This directly motivates the R-learning [56] objective ˆτ = arg minτ{1/n P

i (yi m(xi) (ti e(xi))τ(xi))2}, where m(x) and e(x) are estimated surrogates for the mean outcome and propensity score model. Recently, many have considered the of direct modeling of CE (τ) through the R-decomposition [91, 92, 16, 61, 10], rather than indirectly through (µ0, µ1).

Variational inference A general learning principle is to maximize the expectation of the loglikelihood wrt observed data, i.e., ℓ(θ) := P

i log pθ(xi), which constitutes maximum likelihood estimation (MLE). For a latent variable model pθ(x, z), we consider x as an observation (i.e., data) and z as latent variable. The marginal likelihood pθ(x) = R pθ(x, z) dz typically does not have a closed-form expression, and to avoid direct numerical estimation of pθ(x), variational inference (VI) instead optimizes a variational bound to the marginal log-likelihood log pθ(x) [14, 79]. The most popular choice is known as the Evidence Lower Bound (ELBO), given by

ELBO EZ qφ(z|x)

log pθ(x, Z)

log pθ(x), (1)

where qφ(z|x) is an approximation to the true posterior pθ(z|x) and the inequality is a result of Jensen s inequality. This bound tightens as qφ(z|x) approaches the true posterior pθ(z|x). For estimation, we seek parameters θ that maximize the ELBO, and the commensurately learned parameters φ are often used in a subsequent inference task with new data.

Adversarial distribution matching Consider the problem of matching a model distribution p G(x) to some true data distribution pd(x) presented as empirical samples, wrt some discrepancy measure, d(pd, p G). Typically, p G(x) is given in the form of a stochastic sampler. In the GAN framework, the discrepancy is ﬁrst estimated by maximizing an auxiliary variational functional V (pd, p G; D) : P P R between distributions pd(x) and p G(x) satisfying d(pd, p G) = max D V (pd, p G; D), where P is the space of probability distributions and V (pd, p G; D) is estimated using samples from the two distributions. Function D(x; ω), parameterized by ω and known as the critic function, is intended to maximally discriminate between samples of the two distributions. Subsequently, one seeks to match the generator distribution p G(x) to the unknown true distribution pd(x) by minimizing the estimated discrepancy, resulting in a minimax game between the critic and the generator: min G max D V (pd, p G; D).

3 Balancing VI For Causal Estimation Inspired by the above, we present BV-NICE, a model seeks to improve the current practice of generative learning of causal inference from the following perspectives: (a) automated feature representation learning that explicitly accounts covariate balance, (b) a built-in mechanism for automated model selection directly targets CE estimation accuracy, (c) acknowledging the uncertainty in the observed confounders by introduction of inferred latent variables.

We frame our construction under variational inference based on the following considerations:

We treat covariate x as noisy proxies for the true, unobservable confounders (latent z) The (approximate) posterior acts as a representation encoder that encapsulates uncertainties Matching for the prior p(z) naturally regularizes for model generalization

Consider the following latent variable model pθ(x, y, t, z) = p(x|z)p(y|z, t)p(t|z)p(z) (Figure S1), where (x, y, t) are the observables, z is the (continuous) latent variable, and θ denotes the model parameters. In accordance with standard practice, we model discrete variables with multinomial logistic and continuous variables with Gaussian N(µ, σ2), where µ is a function of z and also possibly t depending on the context, with σ2 set to some prescribed value to avoid overﬁtting. We parameterize stochastic encoders qφ(z|x, y, t) to infer unobserved confounders z. For ﬂexible inference, we model all functions with deep neural nets. Plugging into (1) gives us a tractable objective

for stochastic optimization (see Equation (4)). We relegate the speciﬁcs of our modeling choices in the subsections that follow, after revealing more causal insights embodied in our reformulation.

3.1 A unifying view for VI and R-learner

A key feature we seek to incorporate is to automatically favor solutions that more accurately describe causal effect based on the factual observations. Unlike a model-selection procedure, where candidates are screened in an ad hoc manner, we want our model to explore the parameter space, to identify the best candidates for causal descriptives as part of training. This precludes options such as metalearners [43] and inﬂuence function based estimator [3], as they function as a causal estimator and cannot be efﬁciently trained in an end-to-end manner. We choose to work with the Robinson residual decomposition, and show how the resulting R-learner [56] relates to VI. This implies our variational framework automatically assumes the model selection property.

It is convenient to denote µt(z) µy(z, t), and the causal effect estimator τ(z) = µy(z, t = 1) µy(z, t = 0). Under the R-learning framework, one models the mean outcome m(z) and τ(z) rather than (µ0(z), µ1(z)). It is easy to see these two modeling choices are related by ( m(z) = e(z)µ1(z) + (1 e(z))µ0(z)

τ(z) = µ1(z) µ0(z) µ0(z) = m(z) e(z)τ(z), µ1(z) = m(z) + (1 e(z))τ(z). (2)

A key insight is given by the observation

ϵ(z, t, y) = y m(z) (t e(z))τ(z) = y {tµ1(z) (1 t)µ0(z)}. (3)

Note that the RHS is the residual error for prediction given (z, t). Consequently, ℓR(z, y, t) = ϵ(z, t, y)2 = 2σ2 log pθ(y|z, t). Plugging this result back into the ELBO, and recalling that pθ(t|z) is essentially the propensity score model e(z), we obtain the following factorization

ELBO(x, y, t|pθ, qφ) =

EZ qφ[log pθ(x|Z) | {z } Optional

V-NICE z }| { log pθ(y|Z, t) | {z } R-loss

+ log pθ(t|Z) | {z } PS-loss

] KL(qφ(z|x, y, t) p(z)) | {z } KL-loss

Since our primary goal is to model the causal effect τ, we discard the ﬁrst term related to the likelihood of x and treat the rest as our training target, which we term ℓV-NICE. This choice is motivated by the fact that to correctly infer CE we only need the part of x that is predictive of (y, t) [82]. Modeling x indiscriminately, as practiced by existing generative causal models [50, 68], takes away representation capacity of z [30, 5], compromising our main objective.

Critics !(#; %) X

ELBO GAN BV-NICE

((#) e(#) )(#)

( # : mean outcome , # : propensity score ) # : causal effect

Figure 1: BV-NICE model architecture.

Intuitively, ℓV-NICE, our reformulated ELBO, is a combination of R-loss and propensity score loss, regularized by KL-divergence on the latents to encourage better generalization. Unlike its generative counterparts, our model is directly parameterized through causal triplet (τ(z), m(z), e(z)) to emphasize the causal perspective and allowing structural constraints to be imposed [43]. V-NICE also approximately recovers the R-learner as σ2 0. By optimizing the triplet jointly, rather than a two stage procedure employed by R-learner, our triplet share the reﬁned representation learned. Our discussion also bridges R-learning and likelihood-based learning.

Beneﬁts of integrating the R-loss. A major difference in the construction of R-learner objective, relative to the standard two-learner setup, is that the propensity score is explicitly involved. This allows additional information to be leveraged in many practical settings. For example, a common scenario is that signiﬁcant lags can be expected between the application of a treatment and the observation of the outcome (e.g., when the target outcome is the patients recovery in one year time whether or not administrating a drug). In such scenarios, there will be data available with only confounder and treatment to reﬁne propensity score estimate, which in turn improves treatment effect estimation in R-learning, but can not be used for outcome modeling in the two learner setup. A similar argument holds when additional knowledge is known about the treatment assignment (e.g., when the data is a hybrid of observational and randomized trial). In the same spirit, R-learning also allows the use of data where the treatment information is missing, as they can still be used to improve the estimate of average outcome m(x).

3.2 Balancing VI for causal estimation Our next goal is to establish a mechanism that enables covariate balance. Further denote qt(z) = R qφ(z|x, t)pd(x|T = t) dx. To achieve better balance for subsequent causal estimation, one seeks to match the confounder distributions between treatment groups, i.e., q1 should be close to q0. To this end, we augment the original ELBO with a distribution discrepancy score D(q0 q1), resulting in

ℓBV-NICE ℓV-NICE(pθ, qφ) λD(q0 q1) (5) as our objective for balancing VI (BV-NICE), where λ > 0 speciﬁes the regularization strength. Choice of discrepancy score While the marginal densities of q0 and q1 are intractable, it is relatively easy to acquire samples from them. This motivates leveraging adversarial distribution matching strategies to (indirectly) optimize the discrepancy through a mini-max game. Hence, we indirectly assess D(q0 q1) via the use of a critic function (the max step), and then update the model accordingly to reduce the discrepancy (the min-step). In this study, we appeal to the KL-divergence as our discrepancy measure, which can be recast in its Fenchel dual form as [18, 80] DKL(q0 q1) = Eq0 [log q0 log q1] = max ν>0 {Eq0 [log ν] Eq1 [ν] + 1}, (6)

and note the maximizer ν satisﬁes ν = q

p. This choice is motivated by the following considerations: Easy implementation relative to integral probability metric (IPM)-based schemes It also bounds generalization performance (Sec 3.3) This approach also encourages parameter sharing as the ELBO involves a KL term

Note that this choice is not restrictive, as practitioners are free to choose their favorite distribution matching schemes, such as Wasserstein [75, 6], MMD [25, 48], JSD [24, 87] or other f-divergence [57, 78], that possess other appealing properties. See the SM for a more thorough discussion.

To implement KL-matching, we model log ν as a deep neural network ϑψ(z), as our critic function, where ψ denotes the network parameters. This gives the following neural estimator for the KL term2 ˆDKL(q0 q1) = max ψ {EZ q0 [ϑψ(Z)] EZ q1 [exp(ϑψ(Z ))]} (7) In our case, the distributions are characterized by a neural sampler via the reparameterization trick, e.g., qφ(z|x) as Gφ(ξ, x), ξ p(ξ). Gradients of the sampler can be easily obtained by directly differentiating ˆDKL wrt φ.

3.3 Practical implementation

Random feature encoder To enable ﬂexible encoding of latent features, we employ a neural sampler rφ (z|x). The rφ (z|x) can either be explicit with a tractable likelihood [39, 40], or implicit that maps x and noise to a latent sample, i.e., z = Gφ (ξ, x), ξ U([ 1, 1]k). We choose implicit feature encoder as it produces better results.

Algorithm 1 BV-NICE

Empirical data ˆpd = {(xi, yi, ti)}n i=1, imbalance λ for k = 1, 2, do

(x, y, t) ˆpd, z p(z), zφ = Gφ(ξ, x), ξ p(ξ) φk+1 φ{log pθ(t, y|zφ) ϑψ(x, zφ) % Encoder

λ[ ϑ ψ(zt=0 φ ) exp( ϑ ψ(zt=1 φ ))]} % Balancing

θk+1 θ{log pθ(t, y|zφ)} % Model ψk+1 ψ{ϑψ(x, zφ) exp(ϑψ(x, z ))} % Critic

ψk+1 ψ{ ϑ ψ(zt=0 φ ) exp( ϑ ψ(zt=1 φ ))} % Critic

Another empirical decision is whether to include treatment and outcome in the encoder. Both choices induce a valid lower bound. While the inclusion is practiced in Louizos et al. [50], we argue otherwise. First, it complicates inference procedure and introduces additional approximation error, as auxiliary models must be introduced to sample the latent. Second, the casual effect identiﬁcation requires that the assignment is independent with potential outcomes conditional on the covariates. The inclusion of outcome in the encoder will, on the contrary, potentially introduce bias and violates the unconfoundedness assumption [70, 77]. Practical variants Modiﬁcations to the original VI procedure are often considered by practitioners for better performance, as compensation mechanism to correct for potential model mis-speciﬁcation. We consider two variants that are more principally derived: β-VAE [30] and AAE [54]. The former seeks to address the potential vanishing KL, while the later explicitly targets the mismatch between the aggregated posterior and prior. Both strategies diminishes the role of KL term in ELBO, which often compromises empirical performance via synthesizing uninformative latents to reduce the mismatch to the prior. Implementation details are included in SM.

2Note that we have dropped the constant term for clarity.

Inferring causal effects Given a new observation x, we wish to infer the expected effect τ(x) for a given intervention under the learned model. Since under BV-NICE causal effect τ(z) is deﬁned based on the latent variable z rather than the observed x, the estimation of the causal effect becomes a two-stage process. In the ﬁrst stage we infer the hidden z given x, and in the second stage we average over the latent variables to estimate the causal effect for x. An estimate of the causal effect is given by τ(x) 1

j τ(z j), z j qφ(z|x).

Counterfactual cross-validation with R-residual. A major obstacle in counterfactual reasoning is that due to the absence of counterfactual observations, models can not be validated directly. In our setting, we applied R-loss to hold out factual observations to cross-validation our model. Although it may seem similar to the CV applied in standard machine learning, a key distinction should be noted: that our CV target is explicitly deﬁned wrt the counterfactual estimates. As noted in Nie and Wager [56], Schuler et al. [73], factual residual does not effectively assess counterfactual performance, resulting biased or unreliable estimation.

3.4 Generalization bounds for BV-NICE We provide theoretical justiﬁcation for the use of KL balancing. In particular, we show that the counterfactual generalization error can be bounded by the factual error plus a KL-term of the representation distributions between the treatment groups, adjusted by the variance of the conditional outcome model. We also provide additional discussions on other theoretical aspects in the SM. Deﬁnition 3.1. The expected loss for the unit and treatment pair (z, t) is

ℓh(z, t) = Z

Y L(Yt, h(z, t)))p(Yt|z) d Yt, (8)

where L(y; h) denotes some loss wrt observation y and hypothesis h, and z is parameterized via the stochastic encoder qφ(z|x). The expected factual and counter factual losses of h and φ are:

Z {0,1} ℓh(z, t)pφ(z, t) dz dt, ϵCF(h, φ) Z

Z {0,1} ℓh(z, t)pφ(z, 1 t) dz dt, (9)

where pφ(z, t) = R qφ(z|x, y, t)pd(x, y, t) dx dy. The expected factual treated (t = 1) and control (t = 0) losses are

ϵt=1 F (h, φ) Z

Z ℓh,φ(z, 1)q1(z) dz, ϵt=0 F (h, φ) Z

Z ℓh,φ(z, 0)q0(z) dz (10)

where qt(z) is the aggregated approximate posterior of z given t deﬁned as in Sec 3.2 Deﬁnition 3.2. Precision of estimating heterogeneous effects (PEHE) for a causal effect estimator ˆτ is deﬁned as ϵP EHE(ˆτ) E ˆτ τ 2 L2(P), where L2(P) is the L2 norm wrt feature density P(x).

The following statements assert the generalization error for PEHE can be bounded by the factual error plus a KL-discrepancy term, adjusted by the variance of outcome. Lemma 3.3. Let qt, t {0, 1} be the marginal aggregated approximate posterior distributions deﬁned as in Sec 3.2, u p(T = 1) is the prevalence of treatment, and h : R {0, 1} Y is a hypothesis. Assume lh(z, t) M for t = {0, 1}. Then we have

ϵCF (h, φ) (1 u) ϵt=1 F (h, φ) + u ϵt=0 F (h, φ) + 1 2M

1 2DKL (q0||q1) (11)

Theorem 3.4. Under the conditions of Lemma 3.3, and assuming the loss L deﬁnes lh,Φ is the squared loss L(y, y ) = (y y )2, and deﬁne σY maxt {0,1} EZ[(Y (t) E[Y (t)|Z])2], we have:

ϵP EHE (h, Φ) 2ϵt=0 F (h, Φ) + 2ϵt=1 F (h, Φ) + 1

1 2DKL (q0||q1) 4σ2 Y (12)

This result bears resemblance to the generalization bound proved in Shalit et al. [75]. The key difference is that we have replaced the IPM bound with a KL bound. The original implementation of CFR used the Sinkhorn iterations or MMD computed their IPM, which scales quadratically wrt mini-batch size. Our Fenchel dual KL estimation scales linearly wrt sample size, and consequently more scalable. And the new assumption on ℓh,φ L is generally easily satisﬁed in practice, while the RKHS assumed by CFR is difﬁcult to verify.

4 Related Work

Bayesian causal estimation can be classiﬁed based on how the uncertainty is accounted for. Classical approaches place uncertainty on the model itself, with prominent examples such as BART [17]. To ﬂexibly model the complex causation, Bayesian nonparametric (BNP) schemes have become popular [31]. Alaa and van der Schaar [4] investigated the fundamental limit of information rate for BNP causal models. Closest to this paper is the work of causal estimation VAE (CE-VAE) [50], where latent variables are introduced to account for the uncertainty, with the model learned through variational Bayesian analysis. Our work enhances CE-VAE by infusing additional causal perspectives into its construction: we explicitly address the covariate balancing issue and elaborate how VI connects to R-learning, based on which a reformulated ELBO is derived. Also highly relevant are the works of Bayesian counterfactual risk minimization (CRM) [85, 49], where KL-divergence on the policy (model) distributions is regularized to upper bound excess risk. Our BV-NICE differs in promoting representation balance to reduce generalization risk.

Representation learning has drawn considerable attention in counterfactual inference. Early work explored the use of shrinkage estimators, such as LASSO [12] and elastic-net [9]. Recently, nonlinear representation learning has gained considerable momentum in recognition of growing data complexity [48]. Popular strategies include kernelization [48], neural encoding [37], and representation embedding [82]. While most approaches adopt a deterministic design [48, 75, 2], stochastic variants are considered in the works of CE-VAE [50], CE-GAN [45] and CE-IB [59], which enable additional ﬂexibility and better matching, and consequently improved generalization [39]. Distinct from prior arts, BV-NICE directly targets representations for causal estimation and balancing rather than focusing on predictive performance [56]. See the SM for further discussions and causal perspectives on invariant representation learning [89, 90, 7]. Generative causal learning is an emerging subject in causal inference. The burgeoning ﬁeld of generative modeling provides ample new tools and inspiration for causal modeling. GANbased variants have been most successful in ﬁnding direct applications for counterfactual practice [1, 42, 58, 87, 11], and to a lesser extent with variational schemes [50, 59, 62]. Indirectly, the counterfactual literature has also greatly beneﬁted from borrowing tools originally developed for generative modeling [75], such as distribution matching schemes [25, 6]. Our work presents a principled attempt to integrate generative and causal views, by bringing together counterfactual reasoning, variational learning and adversarial matching.

Covariate balancing is challenged by the fragility of conventional schemes applied to modern datasets. As discussed previously, matching criteria often fail in the presence of nuisance noise [52], while the use of weighting strategies are limited by their restrictive linear assumptions [9], unreliable propensity estimates [37], or unscalable numerical schemes [29]. This motivates a variety of work exploring representation learning with direct regularization of imbalance metrics, such as Mahalanobis, Wasserstein, and MMD measures [9, 13, 93, 94], to learn a proper representation, and possibly in conjunction with a (learned) weighting strategy [37, 35], to mitigate the representation mismatch. A generalization argument was provided by Shalit et al. [75] to support such practice. While some works demonstrate the gains from adopting a sophisticated balancing criteria [86], here we advocate the use of a simple, ﬂexible KL-balancing under a generative framework.

Hidden confounding is detrimental to many representation learning and covariate balancing methods that posit the ignorability or unconfoundedness assumption [65]. The residual confounding due to noisy measurement and unobserved confounders remains as major challenges in practice, threatening the validity of causal estimation [26]. Sensitivity analysis is advised to assess the potential effect of unmeasured confounders on causal estimates [67, 22]. Extensive investigations have been done on robust recovery of (equivalent) causal graphs with the presence of unobserved latents [53, 76, 84], and potential synergies can be exploited between recent advances in causal discovery and counterfactual reasoning. Limited by space, we defer an extended discussion on this to the SM.

Consistency and identiﬁability are key concepts of parallel interest to generalizability. Beyond the common assumption of strong ignorability, conditions to ensure identiﬁability in the presence of latent variables have been adequately discussed in D Amour [21], Miao et al. [55] and the references therein, and we note their settings are drastically simpler than what s assumed by BV-NICE. Most related to this work are the emerging theories on the identiﬁability for latent variables under the general framework assumed by variational inference [38]. While a full exposition on the topic in the context of causal inference is beyond the scope of this study, we refer readers to our SM for some preliminary discussions.

5 Experiments

We consider a wide range of semi-synthetic and real-world tasks to validate our models experimentally. Details of the experimental setup are described the SM, and our code is available from https: //github.com/Dannie Lu/BV-NICE. Importantly, we want to experimentally unveil aspects that are important for the design of generative causal models. More analyses can be found in the SM.

5.1 Experimental setups Table 1: Comparison of performance on semi-synthetic datasets

IHDP1000 ACIC2016 ϵP EHE IN-SAMPLE OUT-SAMPLE IN-SAMPLE OUT-SAMPLE

OLS 0.29 .09 0.30 .11 0.52 .13 0.65 .16 CFR 1.47 .35 1.46 .36 0.52 .14 0.90 .26 BART 0.30 .08 0.33 .11 0.58 .12 0.70 .17 CAUSAL RF 0.63 .24 0.63 .26 0.68 .32 0.81 .40

BV-NICE 0.20 .04 0.20 .06 0.50 .13 0.62 .17

Model architecture, hyper-parameter tuning and data pre-processing For all instantiations, we use fully-connected multi-layer perceptrons (MLP) as our ﬂexible learner. We randomly sample model architectures (number of layers, hidden units) and other hyper-parameters (learning rate, batch-size, regularization strength, etc.). For practical cross-validation, we use 7/3 split for training and validation respectively, and rely on validation outcome RMSE to set best conﬁguration 3. Datasets To extensively validate the proposed procedure in a realistic setup, we consider the following four datasets: (i) IHDP1000 [31]: a semi-synthetic dataset with 1, 000 simulations of different treatment and outcomes mechanism. (ii) ACIC2016 [20]: a benchmark dataset released by Atlantic Causal Inference Competition, which involves 77 semi-synthetic datasets with 100 replications each. (iii) JOBS [44]: a real-world dataset with binary outcomes, a small portion of the data comes from randomized trials. (iv) SHRP2 [27]: a 3-year case-cohort study of driver behavior and environmental factors at the onset of crashes and under normal driving conditions, derived from over 1 million hours of continuous video recordings. Detailed descriptions of these datasets can be found in the SM.

Evaluation metrics To quantitatively assess the performance of competing causal inference procedures, we consider the following performance metrics from the literature: (i), ITE accuracy as quantiﬁed by ϵP EHE; (ii) policy risk Rpol 1 πf E[Y (1)|f(X) = 1)] (1 πf) E[Y (1)|f(X) = 0)] [75], where f(x) : X {0, 1} denotes a decision rule whether to apply the treatment and πf denotes the portion of population receives the treatment under f(x). Note that policy risk only applies to datasets with RCT.

Baseline solutions To compare, the following strong or popular causal estimation baselines are considered: linear regression (OLS, with the T-learner setup); Bayesian Additive Regression Trees (BART) [17], Causal Random Forests (Causal RF) [83], and Counterfactual Regression (CFR) [75].

5.2 Dissecting VI for counterfactual reasoning

We ﬁrst investigate which factors greatly impact the performance to support decision choices for the construction of generative causal models. In particular, we seek answers to the following points through the lens of empirical experiments: (a) level of uncertainty in feature representation; (b) degree of balancing (overlapping); (c) sorts of distributional regularizations.

1e-4 1e-3 1e-2 1e-1 1 10 100 λ

C IHDP ACIC2016

PEHE-ACIC2016

1e-4 1e-2 0 2 8 32 η

IHDP ACIC2016

PEHE-ACIC2016

-3 -2 -1 0 1 2 3

-3 -2 -1 0 1 2 3

-3 -2 -1 0 1 2 3

Figure 2: Impact of imbalance and randomness in feature representation. Normalized ϵP EHE reported, lower is better . Upper: Sensitivity to (L) imbalance λ, (R) randomness η. Lower: Projections of the under-, properand over-balanced feature.

To see how representation uncertainty affect performance, we introduce a randomness parameter η 0, that scales the noise input to the our stochastic feature encoder, i.e., z = G(η ξ, x). We carried out grid search for conﬁguration of (λ, η) on both IHDP and ACIC. In Figure 2, we plot the response curves for imbalance parameter and randomness parameter, with their respective counterpart ﬁxed at optimal. Optimal results, as measured by ϵP EHE, appear at some moderate level of imbalance and representation randomness. This is consistent with theoretical predictions, because perfectly balanced representation (large λ, Fig 2C), compromise the discriminative power of latent representation, while under balanced representation (small λ, Fig 2A), are subject to the selection bias.

3Note this is equivalent to the Robbinson residual validation.

5.3 Evaluation on semi-synthetic and real datasets

0 10 20 Sorted Dataset ID 60 70

BV-NICE BART Causal RF CFR

0 0.1 Inclusion Rate 0.7 0.8 0.9 1 0.18

Policy Risk

Random BV-NICE CFR Causal-RF Logistic-Reg

Figure 3: Result visualization on ACIC2016 (left, ϵP EHE) and JOBS (right, Rpol). Lower is better . Index sorted for ACIC to facilitate visualization.

Table 1 summarizes the performance of BVNICE along with its competing solutions. For both datasets, the proposed BV-NICE performs strongly, giving best results both in terms of in-sample and out-of-sample performance. In Figure 3, we plot the mean ϵP EHE computed on ACIC2016 for each simulation type. The dataset index is sorted based on out-of-sample PEHE of BV-NICE. We can see that, with very few exceptions, BV-NICE consistently outperforms its counterparts being compared. These results underscore the importance of modeling representation uncertainty in CE estimation. Additionally, we applied BV-NICE to the JOBS dataset, and show the policy risk curve in Figure 3. In the inclusion rate regime [0.5, 0.9], BV-NICE gives signiﬁcantly lower risks.

5.4 Trafﬁc safety risk analysis with naturalistic driving data

Lighting_darkness Trafficdensity_free_flow Relationtojunction_other

Locality_local Lighting_dawn_dusk

Agegroup_65 Relationtojunction_intersection

Agegroup_16_19 Trafficflow_not_divided

Income_other Weather_rain_fog Surfacecondition_wet Relationtojunction_parking_lot

Weather_snow Intersectioninfluence_yes

0.0 0.2 0.4 Treatment effect Figure 4: Cellphone risk modulation by exogenous factors, larger values imply stronger risk reduction.

In our last experiment, we apply the proposed BV-NICE to analyze the risk factors in trafﬁc safety [41, 19], in the hope that a ﬁne-grained picture of intervention effectiveness can better inform driving safety regulations to reduce the number of tragic events. Note this study is further characterized by the challenge of rare-event modeling, due to the exceptionally low incidence rates of trafﬁc accidents [28]. Only 1k crashes were ﬂagged and annotated by trained analysts to represent the potential risk factors, along with 20k normal driving baselines for control. Given the prevalence of smart phone usage in modern life, our analysis concentrates on the risk analysis of cellphone use during driving. Following Lu et al. [51], 11 variables are included as confounders out of 84 variables originally recorded by the study, with the inclusion criteria derived based on both domain knowledge and statistical independence tests. In Figure 4, we visualize how exogenous factors modulates the heterogenous risk distribution of cellphone use, in terms of expected reduction in incidence rate. We see restricting cellphone use is most effective in reducing collisions in bad road conditions (e.g., snowy, wet, rainy, foggy), followed by complex environments (e.g., parking lot crossing, intersections). More statistical summaries and comparison to alternative causal effect estimators can be found in the SM.

6 Conclusion

This study revisits design principles for training objective of generative causal models. In particular, we highlight the signiﬁcance of covariate balancing and uncertainty of representation, which is largely missing from prior investigations. We further present a strong causal inference procedure, called BV-NICE, which bridges R-learning and variational inference. We extensively test our model on realistic datasets, and our results reveal the intricate nature of practical causal estimation procedures. While the empirical performance largely conforms to guiding principles, caution needs to be exercised to avoid the pitfalls, which do not appear in violation of theoretical predictions, yet can severely degrade performance. Further scrutiny is warranted for the study of robust causal estimation with ﬂexible learners, that ameliorates the burden of exhaustive search of parameters.

Acknowledgments and Disclosure of Funding

The authors would like to thank the anonymous reviewers for their insightful comments. The work at Virginia Tech was supported by the National Surface Transportation Safety Center for Excellence. The research at Duke University was supported was supported in part by DARPA, DOE, NIH, ONR, NSF. J Chen was partially supported by China Scholarship Council (CSC) and Shanghai Municipal Science and Technology Major Project under Grant 2018SHZDZX01 and ZJLab. The authors would also like to thank Serge Assaad, Shuxi Zeng and Wanlu Deng for fruitful discussions.

Broader Impact

This study presents a novel generative causal inference framework, called BV-NICE, that brings together ideas from both statistical and machine learning based causal modeling. By joining the strength of variational inference, R-learning, and Fenchel mini-max learning, the resulting procedure fully acknowledges the representation uncertainty and enables accurate, reliable direct estimation of individualized causal effect in a ﬂexible, scalable manner. Importantly, while there has been growing consensus that generative causal modeling such as CE-VAE is more suited for many applications yet with suboptimal performance, our research identiﬁes the performance bottleneck and closes the gap between generative causal schemes and state-of-the-art alternatives.

This work promises to have positive societal impacts into the future. And with the best intention in the world, the author(s) wish this research will be applied to progress the course of humanity for the good. Areas stand most likely to beneﬁt from this research are personalized healthcare, public policy, and transportation safety regulations. Variant of the proposed variational framework also promises robustness against the algorithmic biases towards the minority populations, a major issue that draws criticism for machine learning applications. This implies our model can be well suited for ensuring social justice.

[1] Umar I Abdullahi, Spyros Samothrakis, and Maria Fasli. Counterfactual domain adversarial training of neural networks. In 2017 International Conference on the Frontiers and Advances in Data Science (FADS), pages 151 155, 2017.

[2] Ahmed Alaa and Mihaela Schaar. Limits of estimating heterogeneous treatment effects: Guidelines for practical algorithm design. In ICML, 2018.

[3] Ahmed Alaa and Mihaela Van Der Schaar. Validating causal inference models via inﬂuence functions. In ICML, pages 191 201, 2019.

[4] Ahmed M Alaa and Mihaela van der Schaar. Bayesian nonparametric causal inference: Information rates and learning algorithms. IEEE Journal of Selected Topics in Signal Processing, 12 (5):1031 1046, 2018.

[5] Alexander Alemi, Ben Poole, Ian Fischer, Joshua Dillon, Rif A Saurous, and Kevin Murphy. Fixing a broken ELBO. In ICML, 2018.

[6] Martin Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein generative adversarial networks. In ICML, 2017.

[7] Martin Arjovsky, Léon Bottou, Ishaan Gulrajani, and David Lopez-Paz. Invariant risk minimization. ar Xiv preprint ar Xiv:1907.02893, 2019.

[8] Susan Athey and Guido Imbens. Recursive partitioning for heterogeneous causal effects. Proceedings of the National Academy of Sciences, 113(27):7353 7360, 2016.

[9] Susan Athey, Guido W Imbens, and Stefan Wager. Approximate residual balancing: debiased inference of average treatment effects in high dimensions. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 80(4):597 623, 2018.

[10] Susan Athey, Julie Tibshirani, Stefan Wager, et al. Generalized random forests. The Annals of Statistics, 47(2):1148 1178, 2019.

[11] Amelia J Averitt, Natnicha Vanitchanant, Rajesh Ranganath, and Adler J Perotte. The counterfactual χ-GAN. ar Xiv preprint ar Xiv:2001.03115, 2020.

[12] Alexandre Belloni, Victor Chernozhukov, and Christian Hansen. Inference on treatment effects after selection among high-dimensional controls. The Review of Economic Studies, 81(2): 608 650, 2014.

[13] Dimitris Bertsimas, Mac Johnson, and Nathan Kallus. The power of optimization over randomization in designing experiments involving small samples. Operations Research, 63(4): 868 876, 2015.

[14] David M. Blei, Alp Kucukelbir, and Jon D. Mc Auliffe. Variational inference: A review for statisticians. Journal of the American Statistical Association, 112(518):859 877, 2017.

[15] Léon Bottou, Jonas Peters, Joaquin Quiñonero-Candela, Denis X Charles, D Max Chickering, Elon Portugaly, Dipankar Ray, Patrice Simard, and Ed Snelson. Counterfactual reasoning and learning systems: The example of computational advertising. The Journal of Machine Learning Research, 14(1):3207 3260, 2013.

[16] Victor Chernozhukov, Denis Chetverikov, Mert Demirer, Esther Duﬂo, Christian Hansen, Whitney Newey, and James Robins. Double/debiased machine learning for treatment and structural parameters. The Econometrics Journal, 21(1):C1 C68, 2018.

[17] Hugh A Chipman, Edward I George, Robert E Mc Culloch, et al. Bart: Bayesian additive regression trees. The Annals of Applied Statistics, 4(1):266 298, 2010.

[18] Bo Dai, Hanjun Dai, Niao He, Weiyang Liu, Zhen Liu, Jianshu Chen, Lin Xiao, and Le Song. Coupled variational bayes via optimization embedding. In NIPS, pages 9690 9700, 2018.

[19] Thomas A Dingus, Feng Guo, Suzie Lee, Jonathan F Antin, Miguel Perez, Mindy Buchanan King, and Jonathan Hankey. Driver crash risk factors and prevalence evaluation using naturalistic driving data. Proceedings of the National Academy of Sciences, 113(10):2636 2641, 2016.

[20] Vincent Dorie, Jennifer Hill, Uri Shalit, Marc Scott, Dan Cervone, et al. Automated versus do-it-yourself methods for causal inference: Lessons learned from a data analysis competition. Statistical Science, 34(1):43 68, 2019.

[21] Alexander D Amour. On multi-cause approaches to causal inference with unobserved counfounding: Two cautionary failure cases and a promising alternative. In The 22nd International Conference on Artiﬁcial Intelligence and Statistics, pages 3478 3486, 2019.

[22] Alexander Franks, Alexander D Amour, and Avi Feller. Flexible sensitivity analysis for observational studies without observable implications. Journal of the American Statistical Association, pages 1 33, 2019.

[23] Thomas A Glass, Steven N Goodman, Miguel A Hernán, and Jonathan M Samet. Causal inference in public health. Annual Review of Public Health, 34:61 75, 2013.

[24] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In NIPS, 2014.

[25] Arthur Gretton, Karsten M Borgwardt, Malte J Rasch, Bernhard Schölkopf, and Alexander Smola. A kernel two-sample test. Journal of Machine Learning Research, 13(Mar):723 773, 2012.

[26] R. H H Groenwold, E. Hak, and A. W. Hoes. Quantitative assessment of unobserved confounding is mandatory in nonrandomized intervention studies. Journal of Clinical Epidemiology, 62(1): 22 28, 2009. ISSN 08954356. doi: 10.1016/j.jclinepi.2008.02.011. URL http://dx.doi. org/10.1016/j.jclinepi.2008.02.011.

[27] Feng Guo. Statistical methods for naturalistic driving studies. Annual review of statistics and its application, 6:309 328, 2019.

[28] Feng Guo and Youjia Fang. Individual driver risk assessment using naturalistic driving data. Accident Analysis & Prevention, 61:3 9, 2013.

[29] Jens Hainmueller. Entropy balancing for causal effects: A multivariate reweighting method to produce balanced samples in observational studies. Political Analysis, 20(1):25 46, 2012.

[30] Irina Higgins, Loic Matthey, Arka Pal, Christopher Burgess, Xavier Glorot, Matthew Botvinick, Shakir Mohamed, and Alexander Lerchner. β-VAE: Learning basic visual concepts with a constrained variational framework. In ICLR.

[31] Jennifer L Hill. Bayesian nonparametric modeling for causal inference. Journal of Computational and Graphical Statistics, 20(1):217 240, 2011.

[32] Paul W Holland. Statistics and causal inference. Journal of the American Statistical Association, 81(396):945 960, 1986.

[33] Guido W Imbens and Donald B Rubin. Causal inference in statistics, social, and biomedical sciences. Cambridge University Press, 2015.

[34] Fredrik Johansson, Uri Shalit, and David Sontag. Learning representations for counterfactual inference. In ICML, 2016.

[35] Fredrik D Johansson, Nathan Kallus, Uri Shalit, and David Sontag. Learning weighted representations for generalization across designs. ar Xiv preprint ar Xiv:1802.08598, 2018.

[36] Fredrik D Johansson, Uri Shalit, Nathan Kallus, and David Sontag. Generalization bounds and representation learning for estimation of potential outcomes and causal effects. ar Xiv preprint ar Xiv:2001.07426, 2020.

[37] Nathan Kallus. Deepmatch: Balancing deep covariate representations for causal inference using adversarial training. In ICML, 2020.

[38] Ilyes Khemakhem, Diederik P Kingma, and Aapo Hyvärinen. Variational autoencoders and nonlinear ICA: A unifying framework. In AISTATS, 2020.

[39] Diederik P Kingma and Max Welling. Auto-encoding variational Bayes. In ICLR, 2014.

[40] Diederik P Kingma, Tim Salimans, and Max Welling. Improving variational inference with inverse autoregressive ﬂow. In NIPS, 2016.

[41] Sheila G Klauer, Feng Guo, Bruce G Simons-Morton, Marie Claude Ouimet, Suzanne E Lee, and Thomas A Dingus. Distracted driving and risk of road crashes among novice and experienced drivers. New England journal of medicine, 370(1):54 59, 2014.

[42] Murat Kocaoglu, Christopher Snyder, Alexandros G Dimakis, and Sriram Vishwanath. Causal GAN: Learning causal implicit generative models with adversarial training. ar Xiv preprint ar Xiv:1709.02023, 2017.

[43] Sören R Künzel, Jasjeet S Sekhon, Peter J Bickel, and Bin Yu. Metalearners for estimating heterogeneous treatment effects using machine learning. Proceedings of the National Academy of Sciences, 116(10):4156 4165, 2019.

[44] Robert J La Londe. Evaluating the econometric evaluations of training programs with experimental data. The American Economic Review, pages 604 620, 1986.

[45] Changhee Lee, Nicholas Mastronarde, and Mihaela van der Schaar. Estimation of individual treatment effect in latent confounder models via adversarial learning. ar Xiv preprint ar Xiv:1811.08943, 2018.

[46] Fan Li, Alan M. Zaslavsky, and Mary Beth Landrum. Propensity score weighting with multilevel data. Statistics in Medicine, 32(19):3373 3387, 2013. ISSN 02776715. doi: 10.1002/sim.5786.

[47] Fan Li, Laine E. Thomas, and Fan Li. Addressing Extreme Propensity Scores via the Overlap Weights. American Journal of Epidemiology, 188(1):250 257, 2019. ISSN 14766256. doi: 10.1093/aje/kwy201.

[48] Sheng Li and Yun Fu. Matching on balanced nonlinear representations for treatment effects estimation. In NIPS, pages 929 939, 2017.

[49] Ben London and Ted Sandler. Bayesian counterfactual risk minimization. In International Conference on Machine Learning, pages 4125 4133. PMLR, 2019.

[50] Christos Louizos, Uri Shalit, Joris M Mooij, David Sontag, Richard Zemel, and Max Welling. Causal effect inference with deep latent-variable models. In NIPS, pages 6446 6456, 2017.

[51] Danni Lu, Feng Guo, and Fan Li. Evaluating the causal effects of cellphone distraction on crash risk using propensity score methods. Accident Analysis & Prevention, 143:105579, 2020.

[52] Wei Luo and Yeying Zhu. Matching using sufﬁcient dimension reduction for causal inference. Journal of Business & Economic Statistics, pages 1 13, 2019.

[53] Marloes H Maathuis, Markus Kalisch, Peter Bühlmann, et al. Estimating high-dimensional intervention effects from observational data. The Annals of Statistics, 37(6A):3133 3164, 2009.

[54] Alireza Makhzani, Jonathon Shlens, Navdeep Jaitly, Ian Goodfellow, and Brendan Frey. Adversarial autoencoders. ar Xiv preprint ar Xiv:1511.05644, 2015.

[55] Wang Miao, Zhi Geng, and Eric J Tchetgen Tchetgen. Identifying causal effects with proxy variables of an unmeasured confounder. Biometrika, 105(4):987 993, 2018.

[56] Xinkun Nie and Stefan Wager. Quasi-oracle estimation of heterogeneous treatment effects. Biometrika, 2017.

[57] Sebastian Nowozin, Botond Cseke, and Ryota Tomioka. f-GAN: Training generative neural samplers using variational divergence minimization. In NIPS, 2016.

[58] Michal Ozery-Flato, Pierre Thodoroff, Matan Ninio, Michal Rosen-Zvi, and Tal El-Hay. Adversarial balancing for causal inference. ar Xiv preprint ar Xiv:1810.07406, 2018.

[59] Sonali Parbhoo, Mario Wieser, and Volker Roth. Causal deep information bottleneck. ar Xiv preprint ar Xiv:1807.02326, 2018.

[60] Judea Pearl. Causality. Cambridge University Press, 2009.

[61] Scott Powers, Junyang Qian, Kenneth Jung, Alejandro Schuler, Nigam H Shah, Trevor Hastie, and Robert Tibshirani. Some methods for heterogeneous treatment effect estimation in high dimensions. Statistics in medicine, 37(11):1767 1787, 2018.

[62] Aahlad Manas Puli and Rajesh Ranganath. Generalized control functions via variational decoupling. ar Xiv preprint ar Xiv:1907.03451, 2019.

[63] James M Robins, Andrea Rotnitzky, and Lue Ping Zhao. Estimation of regression coefﬁcients when some regressors are not always observed. Journal of the American statistical Association, 89(427):846 866, 1994.

[64] Peter M Robinson. Root-n-consistent semiparametric regression. Econometrica: Journal of the Econometric Society, pages 931 954, 1988.

[65] Paul R. Rosenbaum. From association to causation in observational studies: The role of tests of strongly ignorable treatment assignment. Journal of the American Statistical Association, 79 (385):41 48, 1984.

[66] Paul R Rosenbaum and Donald B Rubin. The Central Role of the Propensity Score in Observational Studies for Causal Effects The central role of the propensity score in observational studies for causal effects. Biometrika, 70(1):41 55, 1983. ISSN 00063444. doi: 10.2307/2335942. URL http://www.jstor.org/stable/2335942%5Cnhttp://about.jstor.org/terms.

[67] Paul R Rosenbaum and Donald B Rubin. Assessing sensitivity to an unobserved binary covariate in an observational study with binary outcome. Journal of the Royal Statistical Society: Series B (Methodological), 45(2):212 218, 1983.

[68] Jason Roy, Kirsten J Lum, Bret Zeldow, Jordan D Dworkin, Vincent Lo Re III, and Michael J Daniels. Bayesian nonparametric generative models for causal inference with missing at random covariates. Biometrics, 74(4):1193 1202, 2018.

[69] Donald B Rubin. Estimating causal effects of treatments in randomized and nonrandomized studies. Journal of educational Psychology, 66(5):688, 1974.

[70] Donald B. Rubin. Using propensity scores to help design observational studies: Application to the tobacco litigation. Matched Sampling for Causal Effects, 2:169 188, 2001. doi: 10.1017/ CBO9780511810725.030.

[71] Yuta Saito and Shota Yasui. Counterfactual cross-validation: Effective causal model selection from observational data. ar Xiv preprint ar Xiv:1909.05299, 2019.

[72] Daniel O Scharfstein, Andrea Rotnitzky, and James M Robins. Adjusting for nonignorable dropout using semiparametric nonresponse models. Journal of the American Statistical Association, 94(448):1096 1120, 1999.

[73] Alejandro Schuler, Michael Baiocchi, Robert Tibshirani, and Nigam Shah. A comparison of methods for model selection when estimating individual treatment effects. ar Xiv preprint ar Xiv:1804.05146, 2018.

[74] Patrick Schwab, Lorenz Linhardt, and Walter Karlen. Perfect match: A simple method for learning representations for counterfactual inference with neural networks. ar Xiv preprint ar Xiv:1810.00656, 2018.

[75] Uri Shalit, Fredrik D Johansson, and David Sontag. Estimating individual treatment effect: generalization bounds and algorithms. In ICML, 2017.

[76] Peter L Spirtes, Christopher Meek, and Thomas S Richardson. Causal inference in the presence of latent variables and selection bias. ar Xiv preprint ar Xiv:1302.4983, 2013.

[77] Elizabeth A. Stuart. Matching methods for causal inference: A review and a look forward. Statistical Science, 25(1):1 21, 2010. ISSN 08834237. doi: 10.1214/09-STS313.

[78] Chenyang Tao, Liqun Chen, Ricardo Henao, Jianfeng Feng, and Lawrence Carin Duke. Chisquare generative adversarial network. In ICML, 2018.

[79] Chenyang Tao, Liqun Chen, Ruiyi Zhang, Ricardo Henao, and Lawrence Carin Duke. Variational inference and model selection with generalized evidence bounds. In ICML, 2018.

[80] Chenyang Tao, Liqun Chen, Shuyang Dai, Junya Chen, Ke Bai, Dong Wang, Jianfeng Feng, Wenlian Lu, Georgiy Bobashev, and Lawrence Carin. On fenchel mini-max learning. In Neur IPS, pages 10427 10439, 2019.

[81] Tyler J. Vander Weele and Ilya Shpitser. On the deﬁnition of a confounder. Annals of Statistics, 41(1):196 220, 2013. ISSN 00905364. doi: 10.1214/12-AOS1058.

[82] Victor Veitch, Yixin Wang, and David M Blei. Using embeddings to correct for unobserved confounding. In Neur IPS, 2019.

[83] Stefan Wager and Susan Athey. Estimation and inference of heterogeneous treatment effects using random forests. Journal of the American Statistical Association, 113(523):1228 1242, 2018.

[84] Janine Witte, Leonard Henckel, Marloes H Maathuis, and Vanessa Didelez. On efﬁcient adjustment in causal graphs. ar Xiv preprint ar Xiv:2002.06825, 2020.

[85] Hang Wu and May Wang. Variance regularized counterfactual risk minimization via variational divergence minimization. In ICML, 2018.

[86] Liuyi Yao, Sheng Li, Yaliang Li, Mengdi Huai, Jing Gao, and Aidong Zhang. Representation learning for treatment effect estimation from observational data. In NIPS, 2018.

[87] Jinsung Yoon, James Jordon, and Mihaela van der Schaar. Ganite: Estimation of individualized treatment effects using generative adversarial nets. In ICLR, 2018.

[88] Kun Zhang and Aapo Hyvärinen. On the identiﬁability of the post-nonlinear causal model. In UAI, 2009.

[89] Kun Zhang, Bernhard Schölkopf, Krikamol Muandet, and Zhikun Wang. Domain adaptation under target and conditional shift. In ICML, 2013.

[90] Kun Zhang, Mingming Gong, and Bernhard Schölkopf. Multi-source domain adaptation: A causal view. In AAAI, 2015.

[91] Qingyuan Zhao, Dylan S Small, and Ashkan Ertefaie. Selective inference for effect modiﬁcation via the lasso. ar Xiv preprint ar Xiv:1705.08020, 2017.

[92] Yan Zhao, Xiao Fang, and David Simchi-Levi. A practically competitive and provably consistent algorithm for uplift modeling. In ICDM, 2017.

[93] José R Zubizarreta. Using mixed integer programming for matching in an observational study of kidney failure after surgery. Journal of the American Statistical Association, 107(500): 1360 1371, 2012.

[94] José R Zubizarreta. Stable weights that balance covariates for estimation with incomplete outcome data. Journal of the American Statistical Association, 110(511):910 922, 2015.