# efficient_causal_decision_making_with_onesided_feedback__3276e184.pdf

Published as a conference paper at ICLR 2025

EFFICIENT CAUSAL DECISION MAKING WITH ONESIDED FEEDBACK

Jianing Chu Amazon jianing.chu3@gmail.com

Shu Yang & Wenbin Lu Department of Statistics North Carolina State University {syang24,wlu4}@ncsu.edu

Pulak Ghosh Indian Institute of Management pulak.ghosh@iimb.ac.in

We study a class of decision-making problems with one-sided feedback, where outcomes are only observable for specific actions. A typical example is bank loans, where the repayment status is known only if a loan is approved and remains undefined if rejected. In such scenarios, conventional approaches to causal decision evaluation and learning from observational data are not directly applicable. In this paper, we introduce a novel value function to evaluate decision rules that addresses the issue of undefined counterfactual outcomes. Without assuming no unmeasured confounders, we establish the identification of the value function using shadow variables. Furthermore, leveraging the semiparametric theory, we derive the efficiency bound for the proposed value function and develop efficient methods for decision evaluation and learning. Numerical experiments and a realworld data application demonstrate the empirical performance of our proposed methods.

1 INTRODUCTION

Binary decision-making problems are pervasive in the real world, encompassing domains such as bank loan approval (Pacchiano et al., 2021), job hiring (Raghavan et al., 2020), school admission (Baker & Hawn, 2022), and criminal recidivism prediction (Lakkaraju et al., 2017). Often, feedback in these scenarios is one-sided. Take bank loan approval as an example: a decision-maker is presented with covariates describing a loan applicant and decides whether to grant or deny the loan. If the loan is approved, feedback regarding the applicant s repayment is subsequently received. However, if the loan is denied, no further information is obtained. There are two main objectives in these decision-making processes: (1) evaluating a decision rule that aims to approve loans for applicants likely to repay while denying loans to those unlikely to do so, based on the expected outcomes it achieves; and (2) deriving an optimal decision rule that maximizes the expected outcome.

Decision-making with one-sided feedback can be viewed as a special contextual bandit problem with two actions, approve and reject , where the outcome is observable exclusively when an individual is approved. Significant challenges arise due to the inherent heterogeneity between the approved and rejected groups specifically, the conditional distribution of the outcome given the covariates may differ between these two groups. As a result, using an outcome model trained on approved samples to predict outcomes for the rejected group is generally unfeasible. To address model bias, one category of approaches uses exploration strategies to gather additional information from new samples, gradually reducing the bias over time (e.g. Jiang et al., 2021; Pacchiano et al., 2021). However, most existing works are restricted to binary outcomes and specific outcome models, lacking robustness to model misspecification and unable to generalize to numerical outcomes. Moreover, in real-world applications, exploration can be costly, risky, or even unethical, such as in

This work was done prior to joining Amazon.

Published as a conference paper at ICLR 2025

healthcare, finance, and education. This motivates us to develop practical approaches to decision evaluation and learning for different types of outcomes from observational data (Dud ık et al., 2014; Kallus & Uehara, 2020; Athey & Wager, 2021; Chu et al., 2023a;b).

As mentioned above, disparities between approved and rejected groups often lead to variations in outcome measures due to unobserved differences in action selection, which also serve as predictors for the outcomes. This phenomenon violates a critical assumption in the causal inference literature for identifying and estimating the value function, known as the no unmeasured confounders (NUC) assumption (Imbens, 2004). This assumption, also referred to as strong ignorability (Rosenbaum & Rubin, 1983) or exogeneity (Imbens & Rubin, 2015), posits that actions are independent of potential outcomes given the covariates. Under this assumption, various approaches have been developed for estimating the value function, such as the inverse propensity weighting (IPW) method (Horvitz & Thompson, 1952) and the doubly robust (DR) method (Zhang et al., 2012; Dud ık et al., 2014; Jiang & Li, 2016). The NUC assumption, however, can be often violated in many real-world scenarios. When the NUC assumption does not hold, the identifiability of the value function may be compromised, and existing estimators under this assumption may no longer be consistent for the value function.

To deal with such violations, the utilization of instrumental variables (IVs) emerges as a wellestablished strategy in the literature (Angrist et al., 1996; Hern an & Robins, 2006; Aronow & Carnegie, 2013; Wang & Tchetgen Tchetgen, 2018). An IV is defined as a pretreatment variable that is independent of all unmeasured confounders, and does not have a direct causal effect on the outcome other than through the action. However, it is acknowledged that identifying suitable IVs poses a considerable challenge, given the potential existence of numerous unmeasured confounders and the difficulty in eliminating the possibility of an IV s dependence on all of them. In contrast to IVs, we consider an alternative approach using a distinct type of variables known as shadow variables (SVs) (Wang et al., 2014; Shao & Wang, 2016; Miao et al., 2016; Li et al., 2024). SVs are independent of the action after conditioning on fully observed covariates and the outcome itself. Meanwhile, SVs are related to the outcome, potentially through unmeasured confounders. For example, in fairness-oriented employment, sensitive attributes such the age of candidates should be independent of the decision. However, these attributes may be related to the performance of candidates, thereby qualifying them as SVs. With the utilization of SVs, we show that the proposed value function is identifiable.

The contribution of this paper is multi-fold.

First, we propose a novel value function for decision-making with one-sided feedback. Without assuming the NUC condition, we consider a model that involves both outcomes and covariates for the action assignment mechanism. We provide identification for the proposed value function under this model by leveraging SVs.

Second, we derive the efficient influence function (EIF) and the semiparametric efficiency bound of the value function. Motivated by the EIF, we develop two different efficient estimators for the value function with binary and continuous outcomes, respectively. Our proposed estimation strategy does not require estimating the density when the outcome is continuous, thereby avoiding instability and distinguishing our methods from existing literature.

Third, we establish theoretical properties for the proposed estimators. We show the estimators are consistent and achieve the semiparametric efficiency bound under mild conditions of nuisance functions approximation.

Fourth, we propose a classification-based framework for learning the optimal decision rule, which allows us to leverage a wide range of existing classification tools tailored to different classes of decision rules. Through numerical experiments, we demonstrate that the proposed method significantly outperforms conventional decision learning methods.

2 RELATED WORK

Contextual Bandits, Off-policy Evaluation and Learning As formally described in Section 3, decision-making with one-sided feedback can be formulated as a special type of contextual bandits problem (Chu et al., 2011; Agrawal & Goyal, 2013; Zhou et al., 2020). There are a limited num-

Published as a conference paper at ICLR 2025

ber of works focusing on one-sided feedback, with two notable related works in this setting. Jiang et al. (2021) considered binary outcomes and estimated outcome functions using generalized linear models, proposing an adaptive online learning approach that integrates uncertainty into outcome estimation. Pacchiano et al. (2021) studied the same problem setting with binary outcomes, approximating the outcome function using deep neural networks and proposing an online algorithm to train an optimistic decision-making model. However, their methods cannot be generalized to numerical outcomes and focus on the online learning setting. In contrast, the primary focus of our work is on decision evaluation and learning using observational data, commonly referred to as off-policy evaluation and learning in the context of contextual bandits. Off-policy methods have attracted significant interest, particularly in fields such as finance, medicine, and education, where experimentation and exploration can be risky, costly, or even unethical (Dud ık et al., 2014; Kallus & Uehara, 2020; Athey & Wager, 2021; Chu et al., 2023a;b).

Selective/Non-Random-Missing Labels Although we study the problem under the contextual bandits setting, it is intrinsically related to the selective/non-random-missing labels problems in semisupervised learning (Misra et al., 2016; Kleinberg et al., 2018; Sohn et al., 2020; Coston et al., 2021). In these problems, only a subset of instances receive labels, determined by the choices of decision-makers. This issue is further complicated by unmeasured confounders that influence both human decisions and the resulting outcomes. Lakkaraju et al. (2017) proposed a model evaluation method based on the assumption that the decisions in the historical dataset are made by different decision-makers with varying thresholds for their yes-no decisions. Sportisse et al. (2023) studied the problem in semi-supervised learning, adopting the assumption that the label-missing mechanism is independent of covariates given the label itself, implying that all covariates are SVs. Based on this assumption, they constructed consistent estimators for the loss function by modeling the labelmissing mechanism. Hu et al. (2022) adopted the same assumption but proposed estimators without modeling the missing mechanism. The significant difference in our work is that we do not require all covariates to be SVs; instead, we allow the missing mechanism to depend on both the covariates and the outcome. More importantly, we develop the most efficient estimator by utilizing the semiparametric theory.

3 PRELIMINARIES

We consider a binary action A {0, 1}, where action 1 denotes approve and action 0 denotes reject . Let X X Rp denote a vector of covariates, and Y R denote the observed outcome of interest. We assume larger values of Y are preferred by convention. We study the problem under the counterfactual potential-outcome framework (Rubin, 2005). The potential outcomes Y (a), a = 0, 1, which are the outcomes that would be observed if a subject received action a = 0 or a = 1, both are well-defined in conventional decision-making problems. Under the Stable Unit Treatment Value Assumption (SUTVA) (Rubin, 2005), we have Y = AY (1)+(1 A)Y (0). However, under the onesided feedback setting, only Y (1) is defined, and the outcome Y is only observed if an individual is approved (A = 1). In this case, the observed outcome is always Y = Y (1). The observed data are then {Oi = (Yi Ai, Ai, Xi), i = 1, . . . , n} and we assume they are independent and identically distributed.

A decision rule π : X [0, 1] is a map from covariates to a probability, so that a decision maker, when presented with covariates X, will select action 1 with probability π(X). In conventional decision-making, where potential outcomes are defined for both actions, implementing a decision rule π in a population would yield the population mean outcome, commonly referred to as the value function, defined as follows:

V (π) = E [Y (1)π(X) + Y (0){1 π(X)}] . (1)

Under the one-sided feedback setting, since Y (0) is not defined, we can no longer use the definition of value function in (1). We define a new value function as

V1(π) = E{Y (1)π(X)}. (2)

The interpretation of V1(π) is straightforward. Consider a practical example of bank loans and a deterministic decision rule π (where π(X) can only take on values 0 or 1). Let Y (1) denote the money earned by the bank if a loan is approved. For an applicant with covariates X, if π(X) = 1,

Published as a conference paper at ICLR 2025

indicating loan approval, then Y (1)π(X) = Y (1) represents the potential financial outcome for the bank. On the other hand, if π(X) = 0, indicating loan rejection, the bank neither earns nor loses any money. Therefore, the newly defined value function V1(π) quantifies the expected monetary outcome for the bank when implementing decision rule π for loan approvals. We define the optimal decision rule as the one that maximizes the defined value function: π = arg maxπ Π V1(π). Our first goal is to evaluate a given decision rule π by estimating V1(π) using the historical data {Oi = (Yi Ai, Ai, Xi), i = 1, . . . , n}. Our second goal is to learn the optimal decision rule π .

4 IDENTIFICATION, EIF, AND EFFICIENCY BOUND

In this section, we provide the identification of the value function V1(π), and establish the corresponding EIF and efficiency bound under the semiparametric theory.

4.1 IDENTIFICATION

Without assuming the NUC condition that Y (1) A | X, we consider a general action assignment mechanism that depends not only on covariates but also on the potential outcome:

φ(x, y) P{A = 1 | X = x, Y (1) = y},

and we assume 0 < φ(x, y) < 1. Let f(x) denote the marginal density of X, and let f(y | x, 1) denote the conditional density of Y (1) given X = x and A = 1. Let w(x) P(A = 1 | X = x). We can show that the value function V1(π) has the following representation (details are given in Appendix A.1) :

V1(π) = E{Y (1)π(X)} = Z f(x)w(x) Z y f(y | x, 1)

φ(x, y) dy π(x)dx. (3)

Therefore, we can identify V1(π) through identifying f(x), w(x), f(y | x, 1), and φ(x, y). The likelihood function for a single observation is

f(x)w(x)a{1 w(x)}1 af(y | x, 1)a.

Thus, f(x), w(x), and f(y | x, 1) can be identified from the observed data distribution. However, as noted in the literature (e.g. Wang et al., 2014; Miao et al., 2016), φ(x, y) is not identifiable without further assumptions.

We assume that covariates X can be partitioned into two subsets of variables U and Z, i.e. X = (UT , ZT )T . U and Z are variables satisfying the following assumptions.

Assumption 4.1 Z A | U, Y (1) and Z Y (1) | U.

Assumption 4.2 For any function h(Y (1), U), E{h(Y (1), U) | X, A = 1} = 0 implies h(Y (1), U) = 0 almost surely.

Assumption 4.1 indicates Z are SVs and φ(x, y) = P{A = 1 | X = x, Y (1) = y} = P{A = 1 | U = u, Y (1) = y} = φ(u, y). For example, in fairness-oriented employment, sensitive attributes such as the age of candidates should be unrelated to the action assignment. If these attributes correlate with the performance of candidates, they can be considered SVs. SVs can be selected based on expert prior knowledge, or alternatively, representations that serve the role of shadow variables can be generated directly from observed covariates without the need for prior knowledge (Li et al., 2024). Assumption 4.2 is known as the conditional completeness assumption, which is widely used in identification problems (Newey & Powell, 2003; Miao et al., 2015; Yang et al., 2019). This condition guarantees the uniqueness of φ(u, y). When both Y (1) and Z are categorical variables with l and m levels, respectively, Assumption 4.2 holds if l < m. When Y (1) is continuous, Assumption 4.2 holds when f(y | x, 1) follows some common distributions, such as exponential families.

Theorem 4.3 Under Assumptions 4.1 and 4.2, f(x), w(x), f(y | x, 1), and φ(u, y) are identifiable, and thus V1(π) is identified by

V1(π) = Z f(x)w(x) Z y f(y | x, 1)

φ(u, y) dy π(x)dx. (4)

Published as a conference paper at ICLR 2025

4.2 EIF AND EFFICIENCY BOUND

The identification (4) motivates a rich class of estimators for the value function. However, to guide the construction of more principled estimators, we derive the EIF and the efficiency bound for the value function using the semiparemetric theory (Bickel et al., 1993; Tsiatis, 2006) in this section. Semiparametric models are sets of probability distributions that indexed by both finitedimensional parametric and infinite-dimensional nonparametric components. The semiparametric efficiency bound is defined as the supremum of the Cramer-Rao lower bounds for all parametric submodels. The EIF is the influence function of a semiparametric regular and asymptotically linear estimator that achieves the semiparametric efficiency bound. We assume a general model for the action assignment mechanism, denoted as φ(u, y; η), which is represented by a parameter η. Consider the Hilbert space T of all measurable functions of the observed data with mean zero and finite variance, equipped with covariance inner product h1, h2 = E{h1( )T h2( )}, where h1, h2 T . We first derive the nuisance tangent space and its orthogonal complement, where the nuisance tangent space is defined as the mean squared closure of all parametric submodel nuisance tangent spaces (Bickel et al., 1993; Tsiatis, 2006). For the ease of exposition, we simplify φ(U, Y (1); η) as φ(η) and φ(U, Y (1); η)/ η as φ(η).

Theorem 4.4 The Hilbert space T can be decomposed as T = Λ1 Λ2 Λ , where Λ1 = [h1(X) : E{h1(X) = 0}] ,

Λ2 = Ah2(X, Y (1)) + w(X) A

1 w(X) E{h2(X, Y (1)) | X} : E{h2(X, Y (1)) | X, A = 1} = 0 ,

φ(η) g(X) ,

g(X) is a function with the same dimension as η, and the notation denotes the direct sum of two spaces that are orthogonal to each other.

Based on Theorem 4.4, the EIF for V1(π) has the following form

ϕeff = h 1(X) | {z } Λ1

+ Ah 2(X) + w(X) A

1 w(X) E{h 2(X, Y (1)) | X} | {z } Λ2

+ DT Sη,eff | {z } Λ

where E{h 1(X)} = 0, E{h 2(X, Y (1)) | X, A = 1} = 0, Sη,eff is the efficient score for η, and D is a vector with the same dimension as η. The efficient score Sη,eff can be obtained by projecting the score function of η onto Λ , as stated in the following theorem.

Theorem 4.5 Under Assumptions 4.1 and 4.2, the efficient score for η is

Sη,eff = φ(η) A

E n φ(η) φ(η)2 | X, A = 1 o

φ(η)2 | X, A = 1 o.

By projecting the value function identification (4) onto Λ1,Λ2, and Λ , we can derive h 1(X), h 2(X), and D. The EIF and semiparametric efficiency bound for the value function are given in the following theorem.

Theorem 4.6 Under Assumptions 4.1 and 4.2, the EIF for V1(π) is

ϕeff(π) = π(X)

φ(η)Y + 1 A φ(η)

φ(η)2 Y | X, A = 1 o

φ(η)2 | X, A = 1 o

V1(π) + DT Sη,eff, (5)

where D = {Var(Sη,eff)} 1

π(X) E n 1 φ(η)

φ(η)2 Y |X,A=1 o

φ(η)2 |X,A=1 o φ(η) φ(η)

E h π(X)E n φ(η) φ(η)2 Y | X, A = 1 oi!

The semiparametric efficiency bound for V1(π) is Υ(π) = E{ϕ2 eff(π)}.

Published as a conference paper at ICLR 2025

5 EFFICIENT DECISION EVALUATION AND LEARNING

5.1 EFFICIENT VALUE ESTIMATION

Based on the EIF (5), since D is a fixed vector and Sη,eff is a score function with mean zero, we propose the following estimator for V1(π):

b V1(π) = Pn

a φ(bη)y + 1 a φ(bη)

b E n 1 φ(η)

φ(η)2 Y | x, 1 o

b E n 1 φ(η)

φ(η)2 | x, 1 o

where Pn[h(x)] = 1

n Pn i=1 h(xi) for any given function h(x), and quantities marked with hats are estimates of their unmarked counterparts. To obtain the value estimator, we first need to estimate η and two conditional expectations E n 1 φ(η)

φ(η)2 Y | x, 1 o and E n 1 φ(η)

φ(η)2 | x, 1 o . A general semiparametric estimator for η can be obtained by solving the following equation:

φ(u, y; η) a

φ(u, y; η) g(x; η) = 0, (7)

where g(x; η) is a calibration function with the same dimension as η. Although this estimator achieves consistency and asymptotic normality under certain regularity conditions, its efficiency is not guaranteed. To ensure minimum estimation variability introduced by bη, we need to derive the efficient estimator of η, denoted as bηeff. This estimator can be obtained by solving the estimation equation based on the efficient score Sη,eff given in Theorem 4.5,

E n φ(η) φ(η)2 | x, 1 o

φ(η)2 | x, 1 o

However, the closed forms of the two conditional expectations in (8) are unknown and need to be approximated. We consider the following two scenarios.

Scenario I: When the outcome Y is binary, say Y {0, 1}, we can specify a model for P(Y = 1 | X, A = 1) and we denote its estimator as b P(Y = 1 | X, A = 1). The conditional

expectations in (8) can be estimated by b E n φ(η) φ(η)2 | X, A = 1 o = 1 φ(U,1;η)2 φ(U,1;η)

η b P(Y = 1 |

X, A = 1) + 1 φ(U,0;η)2 φ(U,0;η)

η {1 b P(Y = 1 | X, A = 1)}, and b E n φ(η) 1

φ(η)2 | X, A = 1 o =

φ(U,1;η)2 b P(Y = 1 | X, A = 1) + φ(U,0;η) 1

φ(U,0;η)2 {1 b P(Y = 1 | X, A = 1)}. Thus we can get the efficient estimator bηeff by solving (8). Next, the conditional expectations in (6) can be estimated by b E n 1 φ(η)

φ(η)2 Y | X, A = 1 o = 1 φ(U,1;bηeff)

φ(U,1;bηeff)2 b P(Y = 1 | X, A = 1), and b E n 1 φ(η)

φ(η)2 | X, A = 1 o =

1 φ(U,1;bηeff)

φ(U,1;bηeff)2 b P(Y = 1 | X, A = 1) + 1 φ(U,0;bηeff)

φ(U,0;bηeff)2 {1 b P(Y = 1 | X, A = 1)}. By plugging

the estimators bηeff, b E n 1 φ(η)

φ(η)2 Y | X, A = 1 o , and b E n 1 φ(η)

φ(η)2 | X, A = 1 o into (6), we obtain the

value estimator and denote it as b Veff(π).

Scenario II: When the outcome Y is continuous, one can still first model the conditional density f(Y | X, A = 1). However, the density estimation often requires large sample sizes and complex algorithms to achieve accurate estimates. This can be computationally intensive and prone to high variance, particularly in high-dimensional spaces. Instead, we propose a two-step estimation strategy. In step 1, we find a root-n consistent estimator bη(1). For example, we can choose a simple calibration function g(x; η) and solve the equation (7). In step 2, we construct pseudo-outcomes φ(bη(1)) φ2(bη(1)) and φ(bη(1)) 1

φ2(bη(1)) and the estimators of the conditional expectations, b E n φ(η) φ(η)2 | X, A = 1 o

and b E n φ(η) 1

φ(η)2 | X, A = 1 o can then be obtained using regression with these pseudo-outcomes. Thus we can get the efficient estimator bηeff by solving (8). Similarly, to estimate the conditional expectations in (6), we can construct pseudo-outcomes 1 φ(bηeff)

φ(bηeff)2 Y and 1 φ(bηeff)

φ(bηeff)2 . The

Published as a conference paper at ICLR 2025

estimators b E n 1 φ(η)

φ(η)2 Y | X, A = 1 o , and b E n 1 φ(η)

φ(η)2 | X, A = 1 o can be obtained using regres-

sion with these pseudo-outcomes. By plugging the estimators bηeff, b E n 1 φ(η)

φ(η)2 Y | X, A = 1 o , and

b E n 1 φ(η)

φ(η)2 | X, A = 1 o into (6), we obtain the value estimator and denote it as b Veff(π).

We now establish the theoretical results for the proposed value estimator. We first make the following assumptions for the nuisance functions and their approximations.

Assumption 5.1 For all x X, (i) {|k1(x)|, |bk1(x)|} > 0, where k1(x) = b E n φ(η) 1

φ(η)2 | x, 1 o ;

(ii) for any k2(x) n E n φ(η) φ(η)2 | x, 1 o , E n 1 φ(η)

φ(η)2 Y | x, 1 oo , {|k2(x)|, |bk2(x)|} < . (iii) for

any k3(x) n E n φ(η) 1

φ(η)2 | x, 1 o , E n 1 φ(η)

φ(η)2 Y | x, 1 o , E n φ(η) φ(η)2 | x, 1 oo , bk3(x) p k3(x).

Assumption 5.1 (i) and (ii) require that the conditional expectations and their estimations are bounded. Assumption 5.1 (iii) requires that the conditional expectations are consistently estimated. In the case of a binary outcome, the estimation of P(Y = 1 | X, A = 1) is required to be consistent. For continuous outcomes, given the root-n consistency of bη(1), we only require that the regression with constructed pseudo-outcomes is consistent. This can be achieved by various machine and deep learning models (e.g. Kennedy, 2016; Farrell et al., 2021).

Theorem 5.2 Under Assumptions 4.1, 4.2, and 5.1 (i) (ii), b Veff(π) is a consistent estimator for V1(π). Additionally, if Assumption 5.1 (iii) holds, b Veff(π) achieves the semiparametric efficiency bound Υ(π).

5.2 FROM EFFICIENT DECISION EVALUATION TO LEARNING

In this section, we consider a deterministic decision rule class Π and propose a method based on the efficient estimator b Veff(π) to learn the optimal decision rule, π = arg maxπ Π V1(π). A natural estimator for the optimal decision rule π would be bπ = arg maxπ Π b Veff(π). However, this direct search poses a significant challenge as it typically involves non-convex and non-smooth optimization problems and can be computationally expensive. We have the following proposition to transform it into a weighted classification problem.

Proposition 5.3 Maximizing the value estimator b Veff(π) is equivalent to a weighted classification problem of minimizing the following loss function over π Π,

i=1 I{I{ bψ(xi, yi, ai) > 0} = π(xi)}| bψ(xi, yi, ai)|, (9)

where bψ(xi, yi, ai) = ai φi(bηeff)yi + n 1 ai φi(bηeff) o b E n 1 φ(η)

φ(η)2 Y |xi,1 o

b E n 1 φ(η)

φ(η)2 |xi,1 o , for 1 i n.

With Proposition 5.3, we have transformed the optimal decision rule learning into a weighted classification problem (9) where for subject i with features xi , the true label is I{ bψ(xi, yi, ai) > 0} and the sample weight is | bψ(xi, yi, ai)|. The choice of classification approach dictates the restricted class Π. Compared to a direct search, a classification-based optimizer facilitates handling more complex functional classes and allows for the use of off-the-shelf machine learning and deep learning software packages.

6 EXPERIMENTS

We have carried out extensive simulation studies and a real data application to evaluate the performance of the proposed methods.

Published as a conference paper at ICLR 2025

6.1 SYNTHETIC SCENARIOS

We compare the proposed method with three alternative methods. One consistent but not efficient estimator for η is the solution to the estimation equation (7) with a simple choice g(x; η). We denote this estimator as bηnaive. The first estimator for the value function is the IPW estimator with bηnaive: b VIPW naive(π) = Pn h a φ(bηnaive)yπ(x) i . The second estimator is also an IPW estimator but with

bηeff: b VIPW eff(π) = Pn h a φ(bηeff)yπ(x) i . The third estimator is the DR estimator (Zhang et al.,

2012; Dud ık et al., 2014): b VDR(π) = Pn π(x) h a b w(x) n y b E(y | x) o + b E(y | x) i .

Decision Evaluation: We first generate covariates X = (X1, X2, X3)T N((1, 1, 0)T , Σ),

1 0.25 0.25 0.25 1 0.25 0.25 0.25 1

. We consider two types of potential outcome, continuous

and binary.

Case 1: The potential outcome Y (1) is generated by Y (1) = 8X1 4X2 1 4X2 + 4X2 3 + ϵ, where ϵ is generated from a normal distribution with mean 0 and standard deviation 0.5. The action A is generated from A Bernoulli{φ(X, Y (1))}, and logit{φ(X, Y (1))} = 1/[1 + exp{0.5 X1 X2 0.1Y (1)}]. Thus, X3 is the shadow variable. We construct three different evaluation decision rules as mixtures of a deterministic decision rule πd(X) = I(2X1 X2 1 X2 + X2 3 > 0) and the uniform random decision rule πu(X) by changing a mixture parameter α, i.e., π(X) = απd(X) + (1 α)πu(X). The candidates of the mixture parameter α are {0.6, 0.3, 0.0}.

Case 2: The potential outcome Y (1) follows a Bernoulli distribution with probability of success 1/{1 + exp(X1 + X2 + X3)}. The action A is generated from A Bernoulli{φ(X, Y (1))}, and logit{φ(X, Y (1))} = 1/[1 exp{0.5 + X1 + X2 + 0.5Y (1)}]. Thus, X3 is the shadow variable. We construct three different evaluation decision rules as mixtures of a deterministic decision rule πd(X) = I(X1+X2+X3 < 0) and the uniform random decision rule πu(X) by changing a mixture parameter α, i.e., π(X) = απd(X)+(1 α)πu(X). The candidates of the mixture parameter α are {0.7, 0.4, 0.0}.

For both cases, the true value function for each evaluation decision rule is obtained by generating a large sample {Xi, Yi(1)}N i=1 with size N = 105 and applying the empirical version of V1(π) = E[Y (1)π(X)]. We consider a correctly specified logistic regression model for φ(η). We obtain bηnaive using g(x; η) = (1, x1, x2, x3)T . We obtain the efficient estimators bηeff and b Veff(π) using the approach introduced in Section 5. Specifically, in case 1, all the regressions with pseudo-outcomes are using random forest (RF) models. In case 2, we estimate P(Y = 1 | X, A = 1) using a generalized additive model (GAM). For the DR estimator, we estimate w(x) using GAM in both cases. We estimate E(y | x) using RF in case 1 and using GAM in case 2.

We consider samples with size n = 1000, 2000. For each case, we conduct 500 replications. The root-mean-square error (RMSE), the standard deviation (SD), and the bias results for cases 1 and 2 are reported in Table 1 and Table 2. Table 1: Decision evaluation results for case 1: (a) 0.0πd + 1.0πu, (b) 0.3πd + 0.7πu, (c) 0.6πd + 0.4πu.

(a) (b) (c) RMSE SD Bias RMSE SD Bias RMSE SD Bias n = 1000 b Veff 0.3512 0.3480 0.0468 0.5509 0.5483 0.0530 0.7999 0.7977 0.0591 b VIPW naive 0.7893 0.7890 -0.0229 0.8279 0.8278 -0.0127 0.8740 0.8740 -0.0024 b VIPW eff 0.6172 0.6119 0.0807 0.8426 0.8387 0.0809 1.0852 1.0822 0.0810 b VDR 0.4421 0.1559 0.4138 0.4371 0.1842 0.3964 0.4364 0.2162 0.3790 n = 2000 b Veff 0.2003 0.1985 0.0274 0.2016 0.2005 0.0209 0.2169 0.2165 0.0143 b VIPW naive 0.7057 0.7026 -0.0662 0.7363 0.7341 -0.0575 0.7733 0.7718 -0.0489 b VIPW eff 0.2563 0.2539 0.0353 0.2771 0.2761 0.0228 0.3121 0.3119 0.0103 b VDR 0.3647 0.1077 0.3485 0.3538 0.1245 0.3312 0.3455 0.1444 0.3139

We have the following observations. b Veff, b VIPW naive, and b VIPW eff are nearly unbiased with sample size n = 1000, 2000. However, b VDR has a significantly larger bias when compared to other

Published as a conference paper at ICLR 2025

Table 2: Decision evaluation results for case 2. (a) 0.0πd + 1.0πu, (b) 0.4πd + 0.6πu, (c) 0.7πd + 0.3πu.

(a) (b) (c) RMSE SD Bias RMSE SD Bias RMSE SD Bias n = 1000 b Veff 0.0172 0.0172 -0.0005 0.0207 0.0207 -0.0008 0.0239 0.0239 -0.0011 b VIPW naive 0.0204 0.0204 -0.0001 0.0246 0.0246 -0.0003 0.0282 0.0282 -0.0005 b VIPW eff 0.0179 0.0179 -0.0006 0.0219 0.0219 -0.0009 0.0254 0.0253 -0.0012 b VDR 0.0196 0.0097 0.0170 0.0223 0.0124 0.0185 0.0248 0.0152 0.0196 n = 2000 b Veff 0.0119 0.0119 -0.0005 0.0142 0.0142 -0.0009 0.0163 0.0163 -0.0013 b VIPW naive 0.0141 0.0141 -0.0003 0.0167 0.0167 -0.0006 0.0190 0.0190 -0.0009 b VIPW eff 0.0122 0.0122 -0.0004 0.0148 0.0147 -0.0007 0.0171 0.0170 -0.0009 b VDR 0.0179 0.0069 0.0166 0.0198 0.0087 0.0178 0.0215 0.0106 0.0187

estimators. This is because the NUC assumption is violated in this setting. Among three consistent estimators b Veff,b VIPW naive, and b VIPW eff, b Veff has the smallest standard deviation and RMSE, which is expected. One interesting observation is that for case 1, when sample size n = 1000, the standard deviations of b VIPW naive with decision rules (b) and (c) are smaller than those of b VIPW eff. One possible reason is that when the sample size is small, the performance of nonparametric regressions with pseudo-outcomes may have larger variation. As the sample size increases, the standard deviations and RMSEs of three consistent estimators b Veff,b VIPW naive, and b VIPW eff become smaller.

Decision Learning: We consider the same covariates as those used in decision evaluation. The potential outcome is generated by Y (1) = 8X1 6X2 1 4X2 + 2X2 3 + ϵ, where ϵ is generated from a normal distribution with mean 0 and standard deviation 0.25. The action A is generated from A Bernoulliφ(X, Y (1)) = 1/[1+exp{0.5 X1 X2 0.15Y (1)}]. We construct four estimators following the same procedure as in decision evaluation. We use a tree-based classification algorithm introduced in Zhou et al. (2023) and focus on depth-2 decision trees for illustration. To evaluate and compare the performance of estimated optimal decision rules obtained by different methods, we compute the corresponding value functions and percentages of making correct decisions (PCD). Again, we generate a large sample {Xi, Yi(1)}N i=1 with size N = 105. For a fixed decision rule π, its value function is computed using the empirical version of V1(π) = E[Y (1)π(X)]. We then maximize the value function and obtain the oracle optimal depth-2 decision tree, denoted as π . For each estimated optimal decision rule bπ, its associated value function is computed using the generated large sample and the PCD is computed by N 1 PN i=1 |bπ(Xi) π (Xi)|. We report the value and PCD results for the decision rules obtained by different methods in Figure 1. We observe that the decision rule obtained by our proposed method has best performance compared with other methods, in terms of values and PCDs. For our proposed method, as the sample size increases, the means of values become larger, PCDs get close to 1, and the standard deviations of values and PCDs become smaller.

Figure 1: The values and PCDs of estimated optimal decision rules.

Published as a conference paper at ICLR 2025

6.2 REAL DATA APPLICATION

In this section, we apply our method to a loan application dataset from a fintech company. A simulated dataset based on the real data is available upon request. The fintech lender aims to provide short-term credit to young salaried professionals by using their mobile and social footprints to determine their credit-worthiness. To get a loan, a customer needs to download the lending app, submit all the requisite details and documentation, and give permission to the lender to gather additional information from the smartphone, such as the number of apps and SMSs. We obtained data from the lending firm for all loans granted from February 2016 to November 2018. There are 42,777 customers in total. We select a set of covariates X, which includes the applicants age, salary, loan amount, CIBIL credit score, number of apps, number of SMSs, number of contacts, and number of social connections. The action A are whether or not the lender approves the loan applications. The outcome Y is defined as 1 if the loan is repaid, and -1 if the applicant defaults on the loan. We conduct hypothesis testing, and our analysis reveals no significant evidence suggesting that the number of social connections violates Assumption 4.1. Therefore, we consider it as a SV.

We randomly sample the training data with a size 3000 and 5000. We compare the four estimators introduced in Section 6.1, which are constructed using the same procedure for the binary outcome. Specifically, we estimate E(Y | X) for the DR method and P(Y | X, A = 1) for the proposed method using GAM. For the DR method, we estimate w(X) using GAM as well. We consider a logistic regression model for φ(η) that uses all covariates (excluding the SV) and the potential outcome as predictors. We obtain bηnaive using g(X, η) = (1, XT )T . We use the same classification algorithm as in the synthetic scenarios to estimate the optimal decision rule. The proposed efficient estimator over the entire dataset is used as the testing value. The training-testing procedure is repeated 100 times. We report the results of testing values in Figure 2. We observe that the average value of proposed method is much larger than those of other three methods, while the variability of proposed method is smaller. This implies the proposed method has better performance than other three methods.

Figure 2: The boxplots of testing values under estimated optimal decision rules by different methods.

7 CONCLUSION

In this paper, we propose a novel framework for causal decision making under the one-sided feedback setting. Specifically, we define a new value function for this task and provide identification leveraging SVs, without assuming NUC. We develop efficient evaluation and learning methods motivated by the semiparametric theory. Numerical experiments and a real-world data application demonstrate the empirical performance of our proposed methods. Although this work focuses on the contextual bandits setting, our method has significant potential for extension to many semisupervised learning tasks (Hu et al., 2022; Sportisse et al., 2023) and generative models (Ma & Zhang, 2021; Ipsen et al., 2021) with non-random missing data.

Published as a conference paper at ICLR 2025

Shipra Agrawal and Navin Goyal. Thompson sampling for contextual bandits with linear payoffs. In International conference on machine learning, pp. 127 135. PMLR, 2013.

Joshua D Angrist, Guido W Imbens, and Donald B Rubin. Identification of causal effects using instrumental variables. Journal of the American statistical Association, 91(434):444 455, 1996.

Peter M Aronow and Allison Carnegie. Beyond late: Estimation of the average treatment effect with an instrumental variable. Political Analysis, 21(4):492 506, 2013.

Susan Athey and Stefan Wager. Policy learning with observational data. Econometrica, 89(1): 133 161, 2021.

Ryan S Baker and Aaron Hawn. Algorithmic bias in education. International Journal of Artificial Intelligence in Education, 32(4):1052 1092, 2022.

Peter J Bickel, CAJ Klaassen, Y Ritov, and JA Wellner. Efficient and Adaptive Inference in Semiparametric Models. Johns Hopkins University Press, Baltimore, 1993.

Jianing Chu, Wenbin Lu, and Shu Yang. Targeted optimal treatment regime learning using summary statistics. Biometrika, 110(4):913 931, 2023a.

Jianing Chu, Shu Yang, and Wenbin Lu. Multiply robust off-policy evaluation and learning under truncation by death. In International Conference on Machine Learning, pp. 6195 6227. PMLR, 2023b.

Wei Chu, Lihong Li, Lev Reyzin, and Robert Schapire. Contextual bandits with linear payoff functions. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 208 214. JMLR Workshop and Conference Proceedings, 2011.

Amanda Coston, Ashesh Rambachan, and Alexandra Chouldechova. Characterizing fairness over the set of good models under selective labels. In International Conference on Machine Learning, pp. 2144 2155. PMLR, 2021.

Miroslav Dud ık, Dumitru Erhan, John Langford, and Lihong Li. Doubly robust policy evaluation and optimization. Statistical Science, 29(4):485 511, 2014.

Max H Farrell, Tengyuan Liang, and Sanjog Misra. Deep neural networks for estimation and inference. Econometrica, 89(1):181 213, 2021.

Miguel A Hern an and James M Robins. Instruments for causal inference: an epidemiologist s dream? Epidemiology, 17(4):360 372, 2006.

Daniel G Horvitz and Donovan J Thompson. A generalization of sampling without replacement from a finite universe. Journal of the American statistical Association, 47(260):663 685, 1952.

Xinting Hu, Yulei Niu, Chunyan Miao, Xian-Sheng Hua, and Hanwang Zhang. On non-random missing labels in semi-supervised learning. ar Xiv preprint ar Xiv:2206.14923, 2022.

Guido W Imbens. Nonparametric estimation of average treatment effects under exogeneity: A review. Review of Economics and Statistics, 86(1):4 29, 2004.

Guido W Imbens and Donald B Rubin. Causal inference in statistics, social, and biomedical sciences. Cambridge University Press, 2015.

Niels Bruun Ipsen, Pierre-Alexandre Mattei, and Jes Frellsen. not-miwae: Deep generative modelling with missing not at random data. In ICLR 2021-International Conference on Learning Representations, 2021.

Heinrich Jiang, Qijia Jiang, and Aldo Pacchiano. Learning the truth from only one side of the story. In International Conference on Artificial Intelligence and Statistics, pp. 2413 2421. PMLR, 2021.

Nan Jiang and Lihong Li. Doubly robust off-policy value evaluation for reinforcement learning. In International Conference on Machine Learning, pp. 652 661. PMLR, 2016.

Published as a conference paper at ICLR 2025

Nathan Kallus and Masatoshi Uehara. Double reinforcement learning for efficient off-policy evaluation in markov decision processes. Journal of Machine Learning Research, 21(167), 2020.

Edward H Kennedy. Semiparametric theory and empirical processes in causal inference. Statistical Causal Inferences and Their Applications in Public Health Research, pp. 141 167, 2016.

Jon Kleinberg, Himabindu Lakkaraju, Jure Leskovec, Jens Ludwig, and Sendhil Mullainathan. Human decisions and machine predictions. The quarterly journal of economics, 133(1):237 293, 2018.

Himabindu Lakkaraju, Jon Kleinberg, Jure Leskovec, Jens Ludwig, and Sendhil Mullainathan. The selective labels problem: Evaluating algorithmic predictions in the presence of unobservables. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 275 284, 2017.

Baohong Li, Haoxuan Li, Ruoxuan Xiong, Anpeng Wu, Fei Wu, and Kun Kuang. Learning shadow variable representation for treatment effect estimation under collider bias. In Proceedings of the 41st International Conference on Machine Learning, pp. 28146 28163. PMLR, 2024.

Chao Ma and Cheng Zhang. Identifiable generative models for missing not at random data imputation. Advances in Neural Information Processing Systems, 34:27645 27658, 2021.

Wang Miao, Lan Liu, Eric Tchetgen Tchetgen, and Zhi Geng. Identification, doubly robust estimation, and semiparametric efficiency theory of nonignorable missing data with a shadow variable. ar Xiv preprint ar Xiv:1509.02556, 2015.

Wang Miao, Peng Ding, and Zhi Geng. Identifiability of normal and normal mixture models with nonignorable missing data. Journal of the American Statistical Association, 111(516):1673 1683, 2016.

Ishan Misra, C Lawrence Zitnick, Margaret Mitchell, and Ross Girshick. Seeing through the human reporting bias: Visual classifiers from noisy human-centric labels. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2930 2939, 2016.

Whitney K Newey and James L Powell. Instrumental variable estimation of nonparametric models. Econometrica, 71(5):1565 1578, 2003.

Aldo Pacchiano, Shaun Singh, Edward Chou, Alex Berg, and Jakob Foerster. Neural pseudo-label optimism for the bank loan problem. Advances in Neural Information Processing Systems, 34: 6580 6593, 2021.

Manish Raghavan, Solon Barocas, Jon Kleinberg, and Karen Levy. Mitigating bias in algorithmic hiring: Evaluating claims and practices. In Proceedings of the 2020 conference on fairness, accountability, and transparency, pp. 469 481, 2020.

Paul R Rosenbaum and Donald B Rubin. The central role of the propensity score in observational studies for causal effects. Biometrika, 70(1):41 55, 1983.

Donald B Rubin. Causal inference using potential outcomes: Design, modeling, decisions. Journal of the American Statistical Association, 100(469):322 331, 2005.

Jun Shao and Lei Wang. Semiparametric inverse propensity weighting for nonignorable missing data. Biometrika, 103(1):175 187, 2016.

Kihyuk Sohn, David Berthelot, Nicholas Carlini, Zizhao Zhang, Han Zhang, Colin A Raffel, Ekin Dogus Cubuk, Alexey Kurakin, and Chun-Liang Li. Fixmatch: Simplifying semi-supervised learning with consistency and confidence. Advances in neural information processing systems, 33:596 608, 2020.

Aude Sportisse, Hugo Schmutz, Olivier Humbert, Charles Bouveyron, and Pierre-Alexandre Mattei. Are labels informative in semi-supervised learning? estimating and leveraging the missing-data mechanism. In International Conference on Machine Learning, pp. 32521 32539. PMLR, 2023.

Anastasios A Tsiatis. Semiparametric theory and missing data. Springer, 2006.

Published as a conference paper at ICLR 2025

Linbo Wang and Eric Tchetgen Tchetgen. Bounded, efficient and multiply robust estimation of average treatment effects using instrumental variables. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 80:531 550, 2018.

Sheng Wang, Jun Shao, and Jae Kwang Kim. An instrumental variable approach for identification and estimation with nonignorable nonresponse. Statistica Sinica, pp. 1097 1116, 2014.

Shu Yang, Linbo Wang, and Peng Ding. Causal inference with confounders missing not at random. Biometrika, 106(4):875 888, 2019.

Baqun Zhang, Anastasios A Tsiatis, Eric B Laber, and Marie Davidian. A robust method for estimating optimal treatment regimes. Biometrics, 68(4):1010 1018, 2012.

Dongruo Zhou, Lihong Li, and Quanquan Gu. Neural contextual bandits with ucb-based exploration. In International Conference on Machine Learning, pp. 11492 11502. PMLR, 2020.

Zhengyuan Zhou, Susan Athey, and Stefan Wager. Offline multi-action policy learning: Generalization and optimization. Operations Research, 71(1):148 183, 2023.

Published as a conference paper at ICLR 2025

A TECHNICAL PROOFS

A.1 PROOF OF THEOREM 4.3

E{Y (1) | X = x} =E{Y (1) | X = x, A = 1}w(x) + E{Y (1) | X = x, A = 0}{1 w(x)}

=w(x) Z yf(y | x, 1)dy + {1 w(x)} Z yf(y | x, 0)dy

=w(x) Z yf(y | x, 1)dy + Z y{1 w(x)}f(y | x, 0)dy

=w(x) Z yf(y | x, 1)dy + Z yf(y | x, 1) f(y | x, 0){1 w(x)}

f(y | x, 1)

=w(x) Z yf(y | x, 1)dy + Z yf(y | x, 1) w(x) 1 φ(x, y) 1 dy

=w(x) Z yf(y | x, 1)dy + w(x) Z yf(y | x, 1) 1 φ(x, y) 1 dy

=w(x) Z y f(y | x, 1)

φ(x, y) dy.

V1(π) = E{Y (1)π(X)} =E (E[{Y (1)π(X)} | X])

= Z f(x)π(x)E{Y (1) | X = x}dx

= Z f(x)w(x) Z y f(y | x, 1)

φ(x, y) dy π(x)dx.

To identify V1(π), we need to identify f(x), w(x), f(y | x, 1), and φ(x, y). The likelihood function for a single observation is

f(x)w(x)a{1 w(x)}1 af(y | x, 1)a.

A key observation is that

w(x) 1 = Z f(y | x, 1)

φ(x, y) dy.

Under Assumption 4.1, φ(x, y) = P{A = 1 | X = x, Y (1) = y} = P{A = 1 | U = u, Y (1) = y} = φ(u, y), and the likelihood function becomes

f(x) Z f(y | x, 1)

φ(u, y) dy a "

1 Z f(y | x, 1)

φ(u, y) dy 1#1 a

f(y | x, 1)a.

Assume we have two different sets of models f(x), f(y | x, 1), φ(u, y), and f(x), f(y | x, 1), φ(u, y), such that

f(x) Z f(y | x, 1)

φ(u, y) dy a "

1 Z f(y | x, 1)

φ(u, y) dy 1#1 a

f(y | x, 1)a

(Z f(y | x, 1)

(Z f(y | x, 1)

f(y | x, 1)a. (10)

Published as a conference paper at ICLR 2025

Taking a = 0 in (10), we have

1 Z f(y | x, 1)

φ(u, y) dy 1#

(Z f(y | x, 1)

Taking a = 1 and taking integration with respect to Y (1) on both sides of the above equation, we have

f(x) Z f(y | x, 1)

φ(u, y) dy 1 = f(x)

(Z f(y | x, 1)

By Equations (11) and (12), we have

f(x) = f(x) and Z f(y | x, 1)

φ(u, y) dy = Z f(y | x, 1)

φ(u, y) dy.

Taking a = 1 in (10), we have

f(x) Z f(y | x, 1)

φ(u, y) dy 1 f(y | x, 1) = f(x)

(Z f(y | x, 1)

) 1 f(y | x, 1).

Thus, we have f(y | x, 1) = f(y | x, 1). Finally, from Z f(y | x, 1)

φ(u, y) dy = Z f(y | x, 1)

φ(u, y) dy,

and Assumption 4.2, we have φ(u, y) = φ(u, y). Thus, f(x), w(x), f(y | x, 1), and φ(x, y) are all identified. The value function V1(π) is then identified.

A.2 PROOF OF THEOREM 4.4

Proof. Let O = {AY, A, X} summarize the vector of observed variables with the likelihood factorized as

f(O) = f(X)w(X)A{1 w(X)}1 Af(Y | X, A = 1)A.

We consider a one-dimensional parametric submodel fθ1(X) for f(X), and a one-dimensional parametric submodel fθ2(Y | X, A = 1) for f(Y | X, A = 1), respectively. The submodel fθ1(X) contains the true model f(X) at θ1 = 0, i.e., fθ1(X) |θ1=0= f(X). Similarly, the submodel fθ2(Y | X, A = 1) contains the true model f(Y | X, A = 1) at θ2 = 0, i.e., fθ2(Y | X, A = 1) |θ2=0= f(Y | X, A = 1). The submodel for the likelihood is

fθ1,θ2(O) = fθ1(X)wθ2(X)A{1 wθ2(X)}1 Afθ2(Y | X, A = 1)A.

log fθ1,θ2(O)

θ1 = log fθ1(X)

log fθ1,θ2(O)

θ2 = A log fθ2(Y | X, A = 1)

θ2 + wθ2(X) A

1 wθ2(X) E log fθ2(Y | X, A = 1)

By the semiparametric theory (Bickel et al., 1993; Tsiatis, 2006), we have the nuisance tangent spaces

Λ1 = [h1(X) : E{h1(X) = 0}] ,

Λ2 = Ah2(X, Y (1)) + w(X) A

1 w(X) E{h2(X, Y (1)) | X} : E{h2(X, Y (1)) | X, A = 1} = 0 .

Published as a conference paper at ICLR 2025

It is easy to verify that Λ1 Λ2. Consider a generic mean zero element in Λ , Ag1(X, Y (1)) + (1 A)g2(X). Since Λ1 Λ , for any measurable mean zero function h1(X), we have E[{Ag1(X, Y (1)) + (1 A)g2(X)}h1(X)] =E(E[{Ag1(X, Y (1)) + (1 A)g2(X)}h1(X) | X]) =E([w(X)E{g1(X, Y (1)) | X, A = 1} + {1 w(X)}g2(X)]h1(X)) =0. Therefore, w(X)E{g1(X, Y (1)) | X, A = 1} + {1 w(X)}g2(X) is a constant and we denote it as c. Since Ag1(X, Y (1)) + (1 A)g2(X) is mean zero, we have E{Ag1(X, Y (1)) + (1 A)g2(X)} =E[w(X)E{g1(X, Y (1)) | X, A = 1} + {1 w(X)}g2(X)] =E(c) = 0. Therefore, we have w(X)E{g1(X, Y (1)) | X, A = 1} + {1 w(X)}g2(X) = 0. (13) Since Λ2 Λ , we have

E {Ag1(X, Y (1)) + (1 A)g2(X)} Ah2(X, Y (1)) + w(X) A

1 w(X) E{h2(X, Y (1)) | X}

=E [w(X)E{g1(X, Y (1))h2(X, Y (1)) | X, A = 1} + g2(X)E{h2(X, Y (1)) | X}]

=E w(X)E{g1(X, Y (1))h2(X, Y (1)) | X, A = 1} + w(X)g2(X)E h2(X, Y (1))

φ(η) | X, A = 1

=E E w(X) g1(X, Y (1)) + g2(X)

h2(X, Y (1)) | X, A = 1

Therefore, g1(X, Y (1)) + g2(X)

φ(η) is a function of X and we denote it as k(X):

k(X) = g1(X, Y (1)) + g2(X)

Taking the conditional expectation on both sides, and by (13), we have

k(X) = E{g1(X, Y (1)) | X, A = 1} + g2(X)

w(X) = g2(X).

Therefore, we have

g2(X) = g1(X, Y (1)) + g2(X)

Ag1(X, Y (1)) + (1 A)g2(X) = φ(η) A

φ(η) g1(X),

and Λ = n φ(η) A

φ(η) g1(X) o . This completes the proof.

A.3 PROOF OF THEOREM 4.5

Proof. The score function for η is

Sη = A w(X)

1 w(X) E φ(η)

The efficient score for η is the projection of the score function Sη onto the space Λ . Notice that Sη Λ1. Therefore, we can write

1 w(X) E φ(η)

φ(η) | X = Ab(X, Y (1)) + w(X) A

1 w(X) E{b(X, Y (1)) | X} | {z } Λ2

φ(η) c(X) | {z } Λ

Published as a conference paper at ICLR 2025

where E{b(X, Y (1)) | X, A = 1} = 0. Let A = 1 in (14), we have

φ(η) | X = b(X, Y (1)) E{b(X, Y (1)) | X} + φ(η) 1

By taking E( | X) on both sides, we have

c(X) = E n φ(η) φ(η) | X o

1 E n 1 φ(η) | X o = E n φ(η) φ(η)2 | X, A = 1 o

φ(η)2 | X, A = 1 o.

Sη,eff = φ(η) A

E n φ(η) φ(η)2 | X, A = 1 o

φ(η)2 | X, A = 1 o.

Let A = 0 in (14), we can further derive that

b(X, Y (1)) = 1 φ(η) 1 w(X)

A.4 PROOF OF THEOREM 4.6

Proof. We consider a one-dimensional parametric submodel fα(X) for f(X), and a onedimensional parametric submodel fβ(Y | X, A = 1) for f(Y | X, A = 1), respectively. The submodel fα(X) contains the true model f(X) at α = α0, i.e., fα0(X) = f(X). Similarly, the submodel fβ(Y | X, A = 1) contains the true model f(Y | X, A = 1) at β = β0, i.e., fβ0(Y | X, A = 1) = f(Y | X, A = 1). Let θ = (α, β). The submodel for the likelihood can be represented as

fθ,η(O) = fα(X){wβ,η(X)}Afβ(Y | X, A = 1){1 wβ,η(X)}1 A, which contains the true model at θ0 = (α0, β0). For the ease of exposition, we write V1(π) as V (π). We use θ in the subscript to denote the quantity with respect to the submodel, e.g., Vθ(π) is the value of V (π) in the submodel.

Sα0 = log fθ(O)

θ=θ0 = log fα(X)

Sβ0 = log fθ(O)

θ=θ0 = A log fβ(Y | X, A = 1)

β=β0 + w(X) A

( log fβ(Y | X, A = 1)

Sη = log fθ(O)

θ=θ0 = A w(X)

1 w(X) E log φ(η)

Let sβ0 = log fβ(Y |X,A=1)

β=β0 and sη = log φ(η)

By the semiparametric theory, the EIF for V (π) must have the form

ϕeff = h 1(X) | {z } Λ1

+ Ah 2(X) + w(X) A

1 w(X) E{h 2(X, Y (1)) | X} | {z } Λ2

+ DT Sη,eff | {z } Λ

where E{h 1(X) = 0}, E{h 2(X, Y (1)) | X, A = 1} = 0, and D is a vector with the same dimension as η. The EIF ϕeff for V (π) must satisfy Vθ(π)/ α|θ=θ0 = E(ϕeff Sα0), Vθ(π)/ β|θ=θ0 = E(ϕeff Sβ0),

Vθ(π)/ ηT |θ=θ0 = E(ϕeff ST η ).

Published as a conference paper at ICLR 2025

Vθ(π)/ α |θ=θ0 = E π(X)w(X)E Y

φ(η) | X, A = 1 Sα0

E(ϕeff Sα0) = E{h 1(X)Sα0}.

h 1(X) = π(X)w(X)E Y

φ(η) | X, A = 1 V (π).

Vθ(π)/ β |θ=θ0= E [π(X){Y (1) E(Y (1) | X)}sβ0] ,

E(ϕeff Sβ0) = E φ(η)h 2(X, Y (1)) + w(X) 1 w(X)E{h 2(X, Y (1)) | X} sβ0

Vθ(π)/ β |θ=θ0 E(ϕeff Sβ0)

=E φ(η)h 2(X, Y (1)) + w(X) 1 w(X)E{h 2(X, Y (1)) | X} π(X){Y (1) E{Y (1) | X} sβ0

=E E h 2(X, Y (1)) + w(X) 1 w(X) E{h 2(X, Y (1))} | X}

φ(η) π(X)Y (1) E{Y (1) | X}

Since E{φ(η)sβ0 | X} = 0, h 2(X, Y (1))+ w(X) 1 w(X) E{h 2(X,Y (1))}|X}

φ(η) π(X) Y (1) E{Y (1)|X}

φ(η) must be a function of X and we denote it as m(X):

m(X) = h 2(X, Y (1)) + w(X) 1 w(X) E{h 2(X, Y (1))} | X}

φ(η) π(X)Y (1) E{Y (1) | X}

φ(η) . (15)

Taking the conditional expectation on both sides, we have

m(X) = E{h 2(X, Y (1)) | X}

Therefore, we have

E{h 2(X, Y (1)) | X}

1 w(X) = h 2(X, Y (1))+ w(X) 1 w(X) E{h 2(X, Y (1))} | X}

φ(η) π(X)Y (1) E{Y (1) | X}

Taking E( | X) on both sides,

E{h 2(X, Y (1)) | X}

=E{h 2(X, Y (1)) | X} + w(X) 1 w(X)E{h 2(X, Y (1)) | X}E{1/φ(η) | X}

π(X) [E{Y (1)/φ(η) | X} E{Y (1) | X}E{1/φ(η) | X}] .

E{h 2(X, Y (1)) | X} = π(X)1 w(X)

w(X) E {Y (1)/φ(η) | X} E{Y (1) | X}E{1/φ(η) | X}

E{1/φ(η) | X} 1 .

By Equations (15) and (16),

h 2(X, Y (1)) = π(X)

1 w(X) 1 φ(η)

E n Y (1) φ(η) | X o E{Y (1) | X}E n 1 φ(η) | X o

E{1/φ(η) | X} 1 + Y (1) E{Y (1) | X}

Published as a conference paper at ICLR 2025

Vθ(π)/ η|θ=θ0 = E

π(X) E n Y (1) 1 φ(η)

φ(η) | X o φ(η) φ(η)

E π(X)Y (1) φ(η)

E(ϕeff ST η ) = DT E{Seff(η)Seff(η)T }.

By Vθ(π)/ ηT |θ=θ0 = E(ϕeff ST η ),

D = {Var(Sη,eff)} 1

π(X) E n 1 φ(η)

φ(η)2 Y | X, A = 1 o

φ(η)2 | X, A = 1 o φ(η) φ(η)

E π(X)E φ(η)

φ(η)2 Y | X, A = 1

By (I),(II), and (III), we complete the proof.

A.5 PROOF OF THEOREM 5.2

φ(η)Y + 1 A φ(η)

φ(η)2 Y | X, 1 o

φ(η)2 | X, 1 o

=E E π(X) A

φ(η)Y (1) | X, Y (1)

=E π(X)Y (1)

φ(η) E {A | X, Y (1)}

=E {π(X)Y (1)} = V1(π).

Since a solution to Equation (7) is a root-n estimator of η, by the strong law of large numbers and uniform consistency, we have b Veff(π) = V1(π) + op(1).

By Assumption 5.1 and the empirical process theory, we have

b E n φ(η) φ(η)2 | x, 1 o

b E n φ(η) 1

φ(η)2 | x, 1 o

E n φ(η) φ(η)2 | x, 1 o

φ(η)2 | x, 1 o

b E n φ(η) φ(η)2 | x, 1 o

b E n φ(η) 1

φ(η)2 | x, 1 o

E n φ(η) φ(η)2 | x, 1 o

φ(η)2 | x, 1 o

+ op(n 1/2). (17)

Published as a conference paper at ICLR 2025

For the ease of exposition, let E1 = E n φ(η) φ(η)2 | x, 1 o and E2 = E n φ(η) 1

φ(η)2 | x, 1 o . By Assumptions 5.1, we have

( φ(bηeff) a

P φ(bηeff) a

φ(bηeff) E1 E2

" φ(bηeff) a

" φ(bηeff) a

" φ(bηeff) a

b E2 + E1(E2 b E2)

Op(n 1/2) op(1)

=op(n 1/2). (18)

By Equations (17) and (18), we have

b E n φ(η) φ(η)2 | x, 1 o

b E n φ(η) 1

φ(η)2 | x, 1 o

E n φ(η) φ(η)2 | x, 1 o

φ(η)2 | x, 1 o

+ op(n 1/2).

By taking Taylor expansion, we have

E n φ(η) φ(η)2 | x, 1 o

φ(η)2 | x, 1 o

=Pn(Sη,eff) + P

E n φ(η) φ(η)2 | x, 1 o

φ(η)2 | x, 1 o

(bη η) + op(n 1/2)

=Pn(Sη,eff) Var(Sη,eff)(bη η) + op(n 1/2). (19)

By Assumption 5.1 and the empirical process theory, we have

b Veff(π) =Pn

a φ(bηeff)y + 1 a φ(bηeff)

φ(η)2 Y | x, 1 o

φ(η)2 | x, 1 o

1 a φ(bηeff)

b E n 1 φ(η)

φ(η)2 Y | x, 1 o

b E n 1 φ(η)

φ(η)2 | x, 1 o

1 a φ(bηeff)

φ(η)2 Y | x, 1 o

φ(η)2 | x, 1 o

+ op(n 1/2).

Published as a conference paper at ICLR 2025

For the ease of exposition, let E3 = E n 1 φ(η)Y

φ(η)2 | x, 1 o . By Assumptions 5.1, we have

+ P φ(bηeff) a

φ(bηeff) E3 E2

" φ(bηeff) a

b E3 b E2 + E3

" φ(bηeff) a

b E3 b E2 + E3

" φ(bηeff) a

b E2 + E3(b E2 E2)

Op(n 1/2) op(1)

=op(n 1/2). (21)

By Equations (20) and (21), we have

b Veff(π) = Pn

a φ(bηeff)y + 1 a φ(bηeff)

φ(η)2 Y | x, 1 o

φ(η)2 | x, 1 o

+ op(n 1/2).

By taking Taylor expansion, we have

b Veff(π) =Pn

a φ(η)y + 1 a φ(η)

φ(η)2 Y | x, 1 o

φ(η)2 | x, 1 o

φ2(η) y + a φ(η)

φ(η)2 Y | x, 1 o

φ(η)2 | x, 1 o

(bη η) + op(n 1/2).

By Equations (19) and (22), we have

b Veff(π) V1(π)

a φ(η)y + 1 a φ(η)

φ(η)2 Y | x, 1 o

φ(η)2 | x, 1 o

φ2(η) y + a φ(η)

φ(η)2 Y | x, 1 o

φ(η)2 | x, 1 o

{Var(Sη,eff)} 1Pn(Sη,eff) V1(π) + op(n 1/2)

a φ(η)y + 1 a φ(η)

φ(η)2 Y | x, 1 o

φ(η)2 | x, 1 o

+ DT Pn(Sη,eff) V1(π) + op(n 1/2)

a φ(η)y + 1 a φ(η)

φ(η)2 Y | x, 1 o

φ(η)2 | x, 1 o

+ DT Sη,eff V1(π)

+ op(n 1/2)

=Pn {ϕeff(π)} + op(n 1/2).

This completes the proof.

Published as a conference paper at ICLR 2025

A.6 PROOF OF PROPOSITION 5.3

arg max π Π b Veff(π)

= arg max π Π

i=1 π(xi) bψ(xi, yi, ai)

= arg max π Π

i=1 π(xi)| bψ(xi, yi, ai)|[I{ bψ(xi, yi, ai) > 0} I{ bψ(xi, yi, ai) 0}]

= arg max π Π

i=1 | bψ(xi, yi, ai)|I{ bψ(xi, yi, ai) > 0}

| bψ(xi, yi, ai)|[{1 π(xi)}I{ bψ(xi, yi, ai) > 0} + π(xi)I{ bψ(xi, yi, ai) 0}]

= arg max π Π

i=1 | bψ(xi, yi, ai)|I{ bψ(xi, yi, ai) > 0}

| bψ(xi, yi, ai)|[π(xi) + I{ bψ(xi, yi, ai) > 0} 2π(xi)I{ bψ(xi, yi, ai) > 0}]

= arg max π Π

i=1 | bψ(xi, yi, ai)|I{ bψ(xi, yi, ai) > 0}

| bψ(xi, yi, ai)|[π2(x) + I2{ bψ(xi, yi, ai) > 0} 2π(xi)I{ bψ(xi, yi, ai) > 0}]

= arg max π Π

i=1 | bψ(xi, yi, ai)|I{ bψ(xi, yi, ai) > 0} | bψ(xi, yi, ai)|[π(xi) I{ bψ(xi, yi, ai) > 0}]2

= arg max π Π

i=1 | bψ(xi, yi, ai)|[π(xi) I{ bψ(xi, yi, ai) > 0}]2

= arg min π Π

i=1 | bψ(xi, yi, ai)|[π(xi) I{ bψ(xi, yi, ai) > 0}]2

= arg min π Π

i=1 | bψ(xi, yi, ai)|I[π(xi) = I{ bψ(xi, yi, ai) > 0}].

Therefore, the decision learning is equivalent to a weighted classification problem, where for subject i with features xi, the true label is I{ bψ(xi, yi, ai) > 0} and the sample weight is | bψ(xi, yi, ai)|.