# offpolicy_evaluation_under_nonignorable_missing_data__bf30527b.pdf Off-Policy Evaluation Under Nonignorable Missing Data Han Wang 1 Yang Xu 1 Wenbin Lu 1 Rui Song 1 Off-Policy Evaluation (OPE) aims to estimate the value of a target policy using offline data collected from potentially different policies. In real-world applications, however, logged data often suffers from missingness. While OPE has been extensively studied in the literature, a theoretical understanding of how missing data affects OPE results remains unclear. In this paper, we investigate OPE in the presence of monotone missingness and theoretically demonstrate that the value estimates remain unbiased under ignorable missingness but can be biased under nonignorable (informative) missingness. To retain the consistency of value estimation, we propose an inverse probability weighting value estimator and conduct statistical inference to quantify the uncertainty of the estimates. Through a series of numerical experiments, we empirically demonstrate that our proposed estimator yields a more reliable value inference under missing data. 1. Introduction Reinforcement learning (RL) has demonstrated many successes in various domains such as game playing (Mnih et al., 2013; Silver et al., 2016), robotic control (Kober et al., 2013), bidding (Jin et al., 2018; Xu et al., 2023), and ridesharing (Xu et al., 2018). These successes often rely on simulators to generate large amounts of interaction data for training. However, in real-world applications, direct access to the environment is usually limited, making it difficult to deploy and evaluate new policies in practice, especially in safety-critical fields like healthcare and autonomous driving. Off-Policy Evaluation (OPE) is a critical step in offline RL to estimate the value of a target policy using offline data that may have been collected under different policies (Prasad et al., 2017; Raghu et al., 2017b; Wang et al., 2018). 1Department of Statistics, North Carolina State University, USA. Correspondence to: Rui Song . Proceedings of the 42 nd International Conference on Machine Learning, Vancouver, Canada. PMLR 267, 2025. Copyright 2025 by the author(s). In recent years, there has been growing interest in highconfidence off-policy evaluation (HCOPE), which not only estimates the policy s value but also provides statistical inference to quantify the confidence in these estimates (Thomas et al., 2015; Luckett et al., 2019; Shi et al., 2021b). However, one overlooked aspect in this literature is that offline data is often incomplete due to different types of missingness. For example, the reward and next-state may be absent following some actions, resulting in incomplete transition tuples. Missing data mechanisms are generally categorized as ignorable or nonignorable (informative) missingness. Ignorable missingness assumes that the pattern of missing data can be fully explained by observed variables. In contrast, nonignorable missingness occurs when the missing data depends on the outcome or reward of interest, making it a more complex and challenging issue to address. Both types of missingness frequently arise in real-world RL applications, including online advertising (Tewari & Murphy, 2017), healthcare (Smith et al., 2023; Yu et al.), robotics (Wang et al., 2019), and more. Failing to account for the underlying missingness pattern, particularly nonignorable missingness (as discussed in later sections), can lead to biased evaluations of policy performance and, consequently, flawed decision-making. For example, in movie recommendation systems like Movielens (Harper & Konstan, 2015), platforms aim to determine the best strategy for recommending personalized movie genres using historical user rating data. However, it is well known that users are more likely to rate movies they prefer, leading to nonignorable missingness. Imagine we want to assess user preferences for two genres, comedy and horror. If users only rate movies they enjoy and avoid providing ratings for those they dislike, the dataset might consist solely of high ratings (e.g., 5-star ratings) while lower ratings remain completely unobserved. Without accounting for this nonignorable missingness, the resulting recommendation strategy might mistakenly treat comedy and horror as equally preferred genres and recommend them at random to all users, which is totally ineffective in evaluating user preferences. A similar challenge arises in survival analysis. Healthcare data often suffers from missingness caused by factors like early discharge (nonignorable when health status is the outcome) or missed follow-ups (ignorable, as it occurs randomly and is unrelated to health status). For example, in Off-Policy Evaluation Under Nonignorable Missing Data the sepsis dataset (Komorowski et al., 2018), nonignorable missingness is evident in the shorter trajectories of deceased patients, who typically exhibit higher SOFA scores (Sepsisrelated Organ Failure Assessment) (Vincent et al., 1996), indicating more severe conditions. As shown in Figure 1, the average SOFA score for deceased patients is higher than that of surviving patients. Ignoring this mortality-driven missingness pattern can result in underestimating the effect of treatments on SOFA scores, potentially leading to misguided decisions about treatment effectiveness. 2 4 6 8 10 12 time step average SOFA score remaining patients death in ICU Figure 1. The average SOFA scores for patients remaining in the dataset (blue) and patients who died during ICU stay (red). The shadow represents the 25% to 75% quantile. In the RL literature, missingness is sometimes addressed by manually defining it as a special event within the reward framework. For example, in the Gridworld environment, missingness might be represented by the agent hitting a wall. In such cases, RL algorithms typically assign a large negative value (e.g., 10 or 100) to discourage the agent from taking actions that lead to these boundary states. However, the choice of penalty severity can significantly affect the evaluation of the value function. Other naive approaches might treat missingness as an additional constraint, transforming the problem into a constrained RL task (Chu et al., 2023). This leads to challenges such as balancing the original reward with the penalty for missingness. Moreover, these simplified solutions do not truly treat missingness as a distinct issue, which risks diluting the focus of the problem: What is the expected value function under the target policy, assuming the trajectories were not subject to missingness? Building on the extensive literature on missingness in survival analysis (Goldberg & Kosorok, 2012; Dong et al., 2020; Zhao & Ma, 2022; Miao et al., 2024), we aim to propose a systematic framework for off-policy evaluation in the presence of missing data. Unlike the simplifications described above, we treat missingness as a distinct node in the causal diagram, which can be causally influenced by both historical information and the rewards. This allows us to explicitly model its impact on policy evaluation. Specifically, we theoretically demonstrate that the original value estimator remains valid under ignorable missingness but becomes biased when the missing mechanism is nonignorable. To mitigate the bias, we propose a novel Inverse Probability Weighting (IPW) value estimator that is shown to be consistent under nonignorable missingness. Furthermore, we conducted statistical inference on the proposed value estimator and provide the associated confidence interval to quantify the uncertainty in value estimation. The effectiveness of the proposed estimator is empirically demonstrated through a simulation study and a real application to MIMIC-III data. We highlight our contributions as follows: Under MAR, we are the first to provide theoretical justification for traditional OPE estimators that do not explicitly account for missingness. Specifically, we identify the key conditions required to ensure the consistency and validity of these estimators. Under MNAR, we show that traditional OPE methods lead to biased estimates. To address this, we propose a novel value estimator to ensure the consistency, with a flexible framework that supports both parametric and semi-parametric estimation. We are also the first to thoroughly study the asymptotic properties of OPE under MNAR, providing uncertainty quantification to the value estimate. The effectiveness of our approach has been validated through extensive simulations and real-world data applications. 2. Related Work Off-Policy Evaluation. OPE has been extensively studied in the literature. Existing approaches can be categorized into three classes. The first category is the Direct Method (DM), where the value is estimated by learning the transition model or fitting the Q-function via model-free function approximation (Bradtke & Barto, 1996; Lagoudakis & Parr, 2003; Le et al., 2019). The second category is the Importance Sampling-based (IS) method (Precup, 2000; Liu et al., 2018; Nachum et al., 2019), which re-weights the observed rewards to correct the mismatch of data distributions between the target policy and the behavior policy. The third category is the Doubly Robust (DR) method (Jiang & Li, 2016; Tang et al., 2019; Kallus & Uehara, 2022), which combines these two methods for more robust and efficient value evaluation. For a comparison of the empirical performance of various OPE approaches, we refer the readers to Off-Policy Evaluation Under Nonignorable Missing Data Voloshin et al. (2019). However, the performance of those OPE methods under missing data is seldom explored. High-Confidence OPE. In addition to obtaining point estimates of value, many applications would benefit from quantifying the uncertainty in Off-Policy Evaluation (OPE) estimates. This type of OPE method is referred to as High Confidence Off-Policy Evaluation (HCOPE) (Thomas et al., 2015). Dai et al. (2020) estimated the value confidence interval (CI) using the empirical likelihood approach under the assumption of i.i.d. transitions, which is often violated in practice (Shi et al., 2021a). Recently, Luckett et al. (2019) and Shi et al. (2021b) advanced the asymptotic theory under Markov Decision Processes (MDPs), with the former focusing on policy learning and the latter on value inference. Despite the extensive literature on OPE, HCOPE remains theoretically more challenging and, as a result, has been less explored in the literature. Missing Data. For ignorable missingness, the IPW technique has been studied in finite-horizon decision-making (Goldberg & Kosorok, 2012; Dong et al., 2020). However, these methods depend on backward recursion, making them susceptible to model misspecification as the horizon length increases. To the best of our knowledge, there is a lack of theoretical support that guarantees the performance of classical OPE approaches in infinite horizon with ignorable missingness, such as under the widely studied MDP framework (Puterman, 1994). For nonignorable missingness, the problem becomes more challenging to solve. In the absence of auxiliary variables, it has been proved that identifying the observed likelihood is impossible, even within a parametric framework (Wang et al., 2014). Existing research has made some progress in identifying and estimating dropout patterns using auxiliary variables, such as shadow variables (Zhao & Ma, 2022; Miao et al., 2024) or instrumental variables (Chen et al., 2009; Wang et al., 2014; Shao & Wang, 2016; Sun et al., 2018; Tchetgen Tchetgen & Wirth, 2017). However, these approaches are limited to single-stage decision-making and do not provide the necessary insights for value estimation in sequential decision-making, which is crucial in many real-world applications, as outlined in the introduction. Among these works, Wang et al. (2014); Tchetgen Tchetgen & Wirth (2017) focus on parametric estimation, while Kim & Yu (2011); Shao & Wang (2016); Sun et al. (2018) emphasize semi-parametric estimation, which offers greater flexibility in specific contexts. Notably, Miao et al. (2024) advances the field by providing a non-parametric identification framework that integrates and generalizes the findings of Wang et al. (2014); Shao & Wang (2016); Miao et al. (2016). This comprehensive framework serves as an excellent foundation for addressing the general problem of MNAR in more complex multi-stage settings such as MDPs. 3. Preliminaries Assume the data follows an MDP defined by a tuple (S, A, p, r, γ), where S is the state space, A is the action space, p : S A S is the Markov transition kernel that characterizes the environment dynamics, r : S A R is the reward function where larger positive rewards are preferable, and γ (0, 1) is a discount factor that trades off long-term rewards for immediate rewards. In this work, we assume the state space S is continuous with d-dimensional state variables, and the action space A = {1, . . . , m} is discrete with m distinct actions. Consider a trajectory {(St, At, Rt+1)}t 0 generated from the MDP model, where (St, At, Rt+1) denotes the triplet of state, action, and immediate reward. Here, we denote the reward as Rt+1 instead of Rt to emphasize that the reward Rt+1 and next state St+1 are jointly determined. The following assumptions are commonly imposed in infinite-horizon RL problems. Assumption 3.1 (Time-homogeneous Markov Assumption). The transition probability satisfies P(St+1|St = s, At = a, {Sj, Aj, Rj+1}0 j t. To describe the lengths of observed trajectories, we also define the dropout time C: C = t if the subject dropped out right after action At, which corresponds to (ηt, ηt+1) = (1, 0). If the trajectory is fully observed, C is set to T. Given the offline data D, our goal is to estimate and conduct statistical inference on the value function under target policy π. Although this work focuses on monotone missingness where dropout occurs after the action is observed, the proposed framework and theoretical results are applicable to more general dropout patterns. A discussion on these broader scenarios, including dropout before action occurs and intermittent missingness, is provided in Appendix F.1. 4.1. Missing Data Mechanisms There are two major types of missing data mechanisms: ignorable and nonignorable missingness. Ignorable missingness refers to the case where the missingness can be fully explained by the observed information, which is also referred to as Missing-At-Random (MAR). An example of MAR would be dropout in a clinical trial due to recorded side effects and lack of efficacy, or other known baseline characteristics. The term randomness in MAR implies that once one has conditioned on all the available data, any remaining missingness is completely random (Graham et al., 2009). On the other hand, if the missingness depends on unobserved components, the missing data mechanism is referred to as nonignorable, or Missing-Not-At-Random (MNAR). Dropout in a clinical trial due to the unobserved current health status or other latent variables is an example of an MNAR. Formal definitions of the two missing mechanisms are given as follows. Definition 4.1 (Ignorable Missingness, MAR). The missingness can be fully accounted for by the observed information, i.e., ηt+1 (Rt+1, St+1) | (St, At, {(Sj, Aj, Rj+1)}0 j 0}, a deterministic policy characterized by a discontinuous function with respect to the state. For each target policy, the true policy values are estimated with 100, 000 Monte Carlo approximations. Assume S(2) t is a shadow variable such that it is correlated with (Rt+1, St+1) but uncorrelated with dropout propensity. We consider the MNAR dropout model λ1(St, At, Rt+1, St+1) = {1 + exp(2.2 + 0.15S(1) t 0.3Rt+1)} 1 and the MAR model λ2(Rt, St, At) = {1 + exp(2.2 + 0.15S(1) t 0.3Rt)} 1. The difference between the two models is that λ1 relies on the next state St+1 through reward Rt+1 while λ2 does not. In this setting, higher reward leads to higher dropout propensity, so the distribution of the observed data is biased towards the lowreward region. More details about data generation are provided in Appendix E.2. Table 1. Results of value estimates and 95% confidence intervals under policy π. The average bias, MSE values, and ECP are reported for each estimator (with standard error in parenthesis). IPW (P) refers to the IPW estimator with parametric estimation, while IPW (SP) refers to the semi-parametric version. n Dropout Method Bias MSE ECP no dropout CC 0.013 ( 0.602 ) 1.592 0.968 MAR CC -0.028 ( 0.807 ) 2.198 0.972 MAR IPW -0.025 ( 0.806 ) 2.207 0.980 MNAR CC -0.598 ( 0.823 ) 2.658 0.904 MNAR IPW (P) -0.016 ( 0.851 ) 2.315 0.976 MNAR IPW (SP) 0.015 ( 0.861 ) 2.340 0.960 no dropout CC -0.04 ( 0.443 ) 1.235 0.968 MAR CC -0.029 ( 0.58 ) 1.538 0.944 MAR IPW -0.03 ( 0.582 ) 1.545 0.956 MNAR CC -0.614 ( 0.587 ) 2.013 0.820 MNAR IPW (P) -0.023 ( 0.602 ) 1.591 0.940 MNAR IPW (SP) 0.003 ( 0.608 ) 1.596 0.932 Four different combinations of n and T are considered in our experiment, which are (500, 10), (1000, 10), (500, 25), (1000, 25). For each setting, we run 250 experiments. In each experiment, we generate a new dataset, estimate the value, and compute its confidence interval. The Empirical Coverage Probability (ECP) is then calculated as the proportion of intervals, out of 250, that contain the true value of the target policy. We present the results for T = 10 in the main paper, the complete set of results are provided in Appendix D. According to Table 1, the CC estimator remains consistent under MAR, which in line with our theoretical findings. The associated confidence intervals also achieve satisfactory coverage, indicating that no further adjustment is necessary in this scenario. However, when it comes to MNAR, the 0.60 0.65 0.70 0.75 0.80 0.85 0.90 0.95 Confidence level 1 Empirical Coverage Probability IPW (P) IPW (SP) CC True Coverage Figure 2. Empirical coverage probability with respect to different values of α under (n, T) = (1000, 10) and target policy π. CC estimator exhibits high bias, resulting in poor coverage probability of the associated confidence intervals. This under-coverage issue gets worse as the sample size grows. In contrast, the proposed IPW estimator (the last two lines, with a P or SP in parentheses to distinguish between the parametric or semi-parametric model used to estimate the dropout propensity λ( ), as detailed in Appendix B) effectively reduces bias and yields more accurate confidence intervals compared to CC, in line with our theoretical results. Figure 2 visualizes the ECP of the estimated confidence intervals at different values of α. As can be seen, our proposed IPW estimator (under both parametric and semi-parametric dropout function estimations) achieves a coverage probability that closely matches the true value of (1 α). In contrast, the CC estimator exhibits significantly lower coverage. These results further support the findings in Theorems 4.5 to 4.7. 6. Real Data We now demonstrate the accuracy and stability of our value estimates by comparing them with existing baselines using a real-world sepsis dataset from the Medical Information Mart for Intensive Care (MIMIC-III v1.4) database (Johnson et al., 2016). The dataset is processed according to the cohort and inclusion/exclusion criteria outlined by Komorowski et al. (2018). Note that the construction of states, actions, rewards, and the dropout model is based on our limited medical research and is solely for data demonstration purposes. Further validation by domain experts may be necessary for healthcare applications. Sepsis is a life-threatening condition that arises when the body s response to infection causes damage to its tissues and organs (Singer et al., 2016). For each patient, we have Off-Policy Evaluation Under Nonignorable Missing Data information on relevant physiological features, including demographics, lab values, vital, and treatment administration information. In order to capture the early phase of sepsis management, data is included from the diagnosis of sepsis and until 48 hours following the onset of sepsis. Intravenous fluids (IV) and vasopressors (VASO) are two commonly administered interventions to correct hypotension caused by infection. Our goal is to evaluate different IV and VASO management policies using this offline dataset. After close examination of the dataset, we find that around 72% patients did not have complete treatment trajectories during the observational time window either due to early discharge from ICU or mortality. Here we only focus on missingness due to early discharge because mortality within 48 hours of sepsis onset only takes up a very small proportion in our data. Application of offline RL or OPE to MIMIC-III dataset has been studied in several works (Raghu et al., 2017b; 2018; Peng et al., 2018; Li et al., 2020; Sonabend et al., 2020). To the best of our knowledge, none of these works consider the monotone missingness issue. We construct a 14-dimensional state feature vector aggregated over a time resolution of 4 hours (see details in Appendix E.3). The action space is discretized into three bins: no intravenous fluids and no vasopressors, intravenous fluids only, and vasopressors. We adopt a reward function similar to Raghu et al. (2017a). Specifically, r(St, At, St+1) = C0 1(SSOFA t+1 = SSOFA t &SSOFA t+1 > 0) + C1 (SSOFA t+1 SSOFA t ) + C2 tanh(SLactate t+1 SLactate t ) + C3, where C0 = 5, C1 = 2.5, C2 = 10, C3 = 10. A higher reward indicates that the patient is in better health. The decision of discharge from ICU is often made based on the current status of the patients, so it is reasonable to assume the missing mechanism is nonignorable, that is, the missingness cannot to be fully accounted for by the observed information in the data. To handle nonignorable missingness, we incorporate dropout information into value estimation. Glasgow Coma Scale (GCS) measures a person s level of consciousness, which is shown to be an important factor for early discharge prediction (Knight, 2003; Kramer & Zimmerman, 2010; Mc Williams et al., 2019), so we include the GCS score SGCS t in our dropout model. Besides, we also add the Fraction of inspired oxygen (Fi O2), Heart Rate (HR), and Respiratory Rate (RR) at the previous time window into the dropout model. Previous GCS score SGCS t 1 is used as a shadow variable 2. The target policies to be evaluated include a fitted behavior policy and optimal policies trained from Deep 2The rationale for selecting the previous GCS score as a shadow variable is detailed in Appendix B.3. Q-Network (Mnih et al., 2015), Dueling Double Deep QNetwork (Wang et al., 2016) and Batch-Constrained Deep Q-Learning (BCQ) (Fujimoto et al., 2019). Note that these methods are only used for learning the target policy, while the entire estimation and inference process relies entirely on either the CC or IPW method, as discussed earlier in Section 4. In our implementation, the dataset is split into two parts, with the first part used for learning the optimal policy and the second part for policy evaluation. Table 2. Results of value estimates and confidence intervals using the original sepsis dataset. In the ˆV π column, the number within the parenthesis stands for the standard error. For clarity in comparing CC and IPW, IPW refers specifically to IPW (SP) in the table. Policy Method b V π CI Behavior CC 2.356 (0.334) (1.702,3.010) IPW 2.377 (0.338) (1.716,3.039) DQN CC 4.459 (0.505) (3.470,5.448) IPW 4.542 (0.506) (3.551,5.534) Dueling DQN CC 4.823 (0.557) (3.731,5.915) IPW 4.925 (0.557) (3.833,6.016) BCQ CC 5.002 (0.609) (3.808,6.195) IPW 5.115 (0.608) (3.924,6.306) Table 2 presents the value estimation results. Since the dropout model λ is often unknown in real applications, we adapt the semi-parametric estimation, i.e. IPW (SP) (see Appendix B) in this section and omit the SP when there is no confusion. In most cases, the IPW estimator yields higher value estimates than the CC estimator. This matches our intuition, as patients who had an early discharge were believed to be in better condition than those who did not. However, for this real-world dataset, the true dropout mechanism is unknown and is hard to verify. To better illustrate the effectiveness of the proposed IPW adjustment, we also build a synthetic dataset consists of 4,490 complete trajectories from the whole sepsis dataset and design a custom dropout hazard model as λ( ) = 1 + exp 4.5 0.8 1(SFi O2 t 0.6) 0.8 1(60 SHR t 100) 0.6 1(10 SRR t 30) 1.5 1(SGCS t+1 14) 1 , which is used as the ground truth. Then we apply this dropout model to the complete data and generate a synthetic dataset with nonignorable missingness. The OPE results are summarized in Table 3, we also include the CC estimates calculated from the complete data as a baseline. Table 3 shows that the CC estimator tends to underestimate the value, while our proposed IPW estimator can effectively reduce the bias with respect to the baseline. This pseudo- Off-Policy Evaluation Under Nonignorable Missing Data Table 3. Results of value estimates and confidence intervals using the synthetic sepsis dataset. In the ˆV π column, the number within the parenthesis stands for the standard error. Policy Dropout Method b V π CI Behavior no dropout CC 1.075 (0.299) (0.487,1.662) MNAR CC 0.504 (0.467) (-0.412,1.420) MNAR IPW 0.988 (0.501) (0.005,1.971) DQN no dropout CC 2.160 (0.448) (1.282,3.039) MNAR CC 1.437 (0.727) (0.011,2.863) MNAR IPW 2.227 (0.742) (0.772,3.682) Dueling DQN no dropout CC 2.368 (0.472) (1.442,3.293) MNAR CC 1.672 (0.750) (0.202,3.142) MNAR IPW 2.450 (0.758) (0.962,3.938) BCQ no dropout CC 2.085 (0.470) (1.163,3.008) MNAR CC 1.403 (0.779) (-0.124,2.931) MNAR IPW 2.147 (0.795) (0.589,3.706) real data further illustrates the effectiveness of our approach in correcting bias introduced by MNAR. In this work, we comprehensively studied the problem of missingness in off-policy evaluation, providing theoretical guarantees with statistical inference on the value estimates. One limitation of our method results from the difficulty of justifying the missingness mechanism, which often require context and subject-matter knowledge. In fact, handling nonignorable missingness still remains an active research area in the field of missing data methodology, and we leave the integration of these evolving advancements for future exploration. Additionally, we believe that extending this work to learning optimal policies from offline data with nonignorable missingness presents an intriguing direction for future research. Acknowledgments This research was conducted with the support of NSF Grant DMS-2113637. The authors gratefully acknowledge this funding, which made it possible to carry out the simulation studies and real data analysis. Impact Statement To the best of our knowledge, this paper is the first to address the missing data problem, particularly missing not at random (MNAR), within the context of off-policy evaluation. Our approach has broad applicability to real-world domains such as healthcare, robotics, recommendation systems, and beyond. We believe that this work, along with the potential extensions discussed in Section 4.4 and Appendices, will inspire future research aimed at improving evaluation methods and decision-making in various machine learning applications. Birmingham, J., Rotnitzky, A., and Fitzmaurice, G. M. Pattern mixture and selection models for analysing longitudinal data with monotone missing patterns. Journal of the Royal Statistical Society Series B: Statistical Methodology, 65(1):275 297, 2003. Bradley, R. C. Basic properties of strong mixing conditions. a survey and some open questions. Probability Surveys, 2:107 144, 2005. Bradtke, S. J. and Barto, A. G. Linear least-squares algorithms for temporal difference learning. Machine Learning, 22(1):33 57, 1996. Chen, H., Geng, Z., and Zhou, X.-H. Identifiability and estimation of causal effects in randomized trials with noncompliance and completely nonignorable missing data. Biometrics, 65(3):675 682, 2009. Chu, J., Yang, S., and Lu, W. Multiply robust off-policy evaluation and learning under truncation by death. In International Conference on Machine Learning, pp. 6195 6227. PMLR, 2023. Dai, B., Nachum, O., Chow, Y., Li, L., Szepesv ari, C., and Schuurmans, D. Coindice: Off-policy confidence interval estimation. Advances in Neural Information Processing Systems, 33:9398 9411, 2020. De Boor, C. Splines as linear combinations of b-splines. a survey. Technical report, Wisconsin Univ Madison Mathematics Research Center, 1976. Diggle, P. and Kenward, M. G. Informative drop-out in longitudinal data analysis. Journal of the Royal Statistical Society: Series C (Applied Statistics), 43(1):49 73, 1994. Dong, L., Laber, E., Goldberg, Y., Song, R., and Yang, S. Ascertaining properties of weighting in the estimation of optimal treatment regimes under monotone missingness. Statistics in Medicine, 39(25):3503 3520, 2020. Fujimoto, S., Meger, D., and Precup, D. Off-policy deep reinforcement learning without exploration. In International Conference on Machine Learning, pp. 2052 2062. PMLR, 2019. Goldberg, Y. and Kosorok, M. R. Q-learning with censored data. Annals of Statistics, 40(1):529, 2012. Graham, J. W. et al. Missing data analysis: Making it work in the real world. Annual Review of Psychology, 60(1): 549 576, 2009. Hansen, L. P. Large sample properties of generalized method of moments estimators. Econometrica: Journal of the Econometric Society, pp. 1029 1054, 1982. Off-Policy Evaluation Under Nonignorable Missing Data Harper, F. M. and Konstan, J. A. The movielens datasets: History and context. ACM Transactions on Interactive Intelligent Systems (Tii S), 5(4):1 19, 2015. Huang, J. Z. et al. Projection estimation in multiple regression with application to functional anova models. Annals of Statistics, 26(1):242 272, 1998. Jiang, N. and Li, L. Doubly robust off-policy value evaluation for reinforcement learning. In International Conference on Machine Learning, pp. 652 661. PMLR, 2016. Jin, J., Song, C., Li, H., Gai, K., Wang, J., and Zhang, W. Real-time bidding with multi-agent reinforcement learning in display advertising. In Proceedings of the 27th ACM International Conference on Information and Knowledge Management, pp. 2193 2201, 2018. Johnson, A. E., Pollard, T. J., Shen, L., Li-Wei, H. L., Feng, M., Ghassemi, M., Moody, B., Szolovits, P., Celi, L. A., and Mark, R. G. MIMIC-III, a freely accessible critical care database. Scientific Data, 3(1):1 9, 2016. Kallus, N. and Uehara, M. Efficiently breaking the curse of horizon in off-policy evaluation with double reinforcement learning. Operations Research, 2022. Kim, J. K. and Yu, C. L. A semiparametric estimation of mean functionals with nonignorable missing data. Journal of the American Statistical Association, 106(493): 157 165, 2011. Knight, G. Nurse-led discharge from high dependency unit. Nursing in Critical Care, 8(2):56 61, 2003. Kober, J., Bagnell, J. A., and Peters, J. Reinforcement learning in robotics: A survey. The International Journal of Robotics Research, 32(11):1238 1274, 2013. Komorowski, M., Celi, L. A., Badawi, O., Gordon, A. C., and Faisal, A. A. The artificial intelligence clinician learns optimal treatment strategies for sepsis in intensive care. Nature Medicine, 24(11):1716 1720, 2018. Kramer, A. A. and Zimmerman, J. E. A predictive model for the early identification of patients at risk for a prolonged intensive care unit length of stay. BMC Medical Informatics and Decision Making, 10(1):1 16, 2010. Lagoudakis, M. G. and Parr, R. Least-squares policy iteration. The Journal of Machine Learning Research, 4: 1107 1149, 2003. Le, H., Voloshin, C., and Yue, Y. Batch policy learning under constraints. In International Conference on Machine Learning, pp. 3703 3712. PMLR, 2019. Li, L., Albert-Smet, I., and Faisal, A. A. Optimizing medical treatment for sepsis in intensive care: from reinforcement learning to pre-trial evaluation. ar Xiv preprint ar Xiv:2003.06474, 2020. Linero, A. R. and Daniels, M. J. A flexible bayesian approach to monotone missing data in longitudinal studies with nonignorable missingness with application to an acute schizophrenia clinical trial. Journal of the American Statistical Association, 110(509):45 55, 2015. Liu, D. C. and Nocedal, J. On the limited memory bfgs method for large scale optimization. Mathematical Programming, 45(1):503 528, 1989. Liu, Q., Li, L., Tang, Z., and Zhou, D. Breaking the curse of horizon: Infinite-horizon off-policy estimation. Advances in Neural Information Processing Systems, 31, 2018. Luckett, D. J., Laber, E. B., Kahkoska, A. R., Maahs, D. M., Mayer-Davis, E., and Kosorok, M. R. Estimating dynamic treatment regimes in mobile health using v-learning. Journal of the American Statistical Association, 2019. Mc Leish, D. L. Dependent central limit theorems and invariance principles. The Annals of Probability, 2(4):620 628, 1974. Mc Williams, C. J., Lawson, D. J., Santos-Rodriguez, R., Gilchrist, I. D., Champneys, A., Gould, T. H., Thomas, M. J., and Bourdeaux, C. P. Towards a decision support tool for intensive care discharge: machine learning algorithm development using electronic healthcare data from mimic-iii and bristol, uk. BMJ Open, 9(3):e025925, 2019. Miao, W., Ding, P., and Geng, Z. Identifiability of normal and normal mixture models with nonignorable missing data. Journal of the American Statistical Association, 111 (516):1673 1683, 2016. Miao, W., Liu, L., Li, Y., Tchetgen Tchetgen, E. J., and Geng, Z. Identification and semiparametric efficiency theory of nonignorable missing data with a shadow variable. ACM/JMS Journal of Data Science, 1(2):1 23, 2024. Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., and Riedmiller, M. Playing atari with deep reinforcement learning. ar Xiv preprint ar Xiv:1312.5602, 2013. Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidjeland, A. K., Ostrovski, G., et al. Human-level control through deep reinforcement learning. Nature, 518(7540): 529 533, 2015. Off-Policy Evaluation Under Nonignorable Missing Data Nachum, O., Chow, Y., Dai, B., and Li, L. Dualdice: Behavior-agnostic estimation of discounted stationary distribution corrections. Advances in Neural Information Processing Systems, 32, 2019. Peng, X., Ding, Y., Wihl, D., Gottesman, O., Komorowski, M., Lehman, L.-w. H., Ross, A., Faisal, A., and Doshi Velez, F. Improving sepsis treatment strategies by combining deep and kernel-based reinforcement learning. In AMIA Annual Symposium Proceedings, volume 2018, pp. 887. American Medical Informatics Association, 2018. Prasad, N., Cheng, L.-F., Chivers, C., Draugelis, M., and Engelhardt, B. E. A reinforcement learning approach to weaning of mechanical ventilation in intensive care units. In 33rd Conference on Uncertainty in Artificial Intelligence, 2017. Precup, D. Eligibility traces for off-policy policy evaluation. Computer Science Department Faculty Publication Series, pp. 80, 2000. Puterman, M. L. Markov decision processes: Discrete stochastic dynamic programming, 1994. Raghu, A., Komorowski, M., Ahmed, I., Celi, L., Szolovits, P., and Ghassemi, M. Deep reinforcement learning for sepsis treatment. ar Xiv preprint ar Xiv:1711.09602, 2017a. Raghu, A., Komorowski, M., Celi, L. A., Szolovits, P., and Ghassemi, M. Continuous state-space models for optimal sepsis treatment: a deep reinforcement learning approach. In Machine Learning for Healthcare Conference, pp. 147 163. PMLR, 2017b. Raghu, A., Komorowski, M., and Singh, S. Model-based reinforcement learning for sepsis treatment. ar Xiv preprint ar Xiv:1811.09602, 2018. Robins, J. M., Rotnitzky, A., and Scharfstein, D. O. Sensitivity analysis for selection bias and unmeasured confounding in missing data and causal inference models. In Statistical Models in Epidemiology, the Environment, and Clinical Trials, pp. 1 94. Springer, 2000. Rosenbaum, P. R. and Rubin, D. B. The central role of the propensity score in observational studies for causal effects. Biometrika, 70(1):41 55, 1983. Rotnitzky, A. and Robins, J. Analysis of semi-parametric regression models with non-ignorable non-response. Statistics in Medicine, 16(1):81 102, 1997. Shao, J. and Wang, L. Semiparametric inverse propensity weighting for nonignorable missing data. Biometrika, 103(1):175 187, 2016. Shi, C., Wan, R., Song, R., Lu, W., and Leng, L. Does the markov decision process fit the data: Testing for the markov property in sequential decision making. In International Conference on Machine Learning, pp. 8807 8817. PMLR, 2020. Shi, C., Wan, R., Chernozhukov, V., and Song, R. Deeplydebiased off-policy interval estimation. In International Conference on Machine Learning, pp. 9580 9591. PMLR, 2021a. Shi, C., Zhang, S., Lu, W., and Song, R. Statistical inference of the value function for reinforcement learning in infinitehorizon settings. Journal of the Royal Statistical Society. Series B: Statistical Methodology, 2021b. Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al. Mastering the game of go with deep neural networks and tree search. Nature, 529(7587):484 489, 2016. Singer, M., Deutschman, C. S., Seymour, C. W., Shankar Hari, M., Annane, D., Bauer, M., Bellomo, R., Bernard, G. R., Chiche, J.-D., Coopersmith, C. M., et al. The third international consensus definitions for sepsis and septic shock (sepsis-3). JAMA, 315(8):801 810, 2016. Smith, B., Khojandi, A., and Vasudevan, R. Bias in reinforcement learning: A review in healthcare applications. ACM Computing Surveys, 56(2):1 17, 2023. Sonabend, A., Lu, J., Celi, L. A., Cai, T., and Szolovits, P. Expert-supervised reinforcement learning for offline policy learning and evaluation. Advances in Neural Information Processing Systems, 33:18967 18977, 2020. Sun, B., Liu, L., Miao, W., Wirth, K., Robins, J., and Tchetgen, E. J. T. Semiparametric estimation with data missing not at random using an instrumental variable. Statistica Sinica, 28(4):1965, 2018. Tang, Z., Feng, Y., Li, L., Zhou, D., and Liu, Q. Doubly robust bias reduction in infinite horizon off-policy estimation. ar Xiv preprint ar Xiv:1910.07186, 2019. Tchetgen Tchetgen, E. J. and Wirth, K. E. A general instrumental variable framework for regression analysis with outcome missing not at random. Biometrics, 73(4): 1123 1131, 2017. Tewari, A. and Murphy, S. A. From ads to interventions: Contextual bandits in mobile health. Mobile Health: Sensors, Analytic Methods, and Applications, pp. 495 517, 2017. Thomas, P., Theocharous, G., and Ghavamzadeh, M. Highconfidence off-policy evaluation. In Proceedings of the AAAI Conference on Artificial Intelligence, 2015. Off-Policy Evaluation Under Nonignorable Missing Data Uehara, M., Huang, J., and Jiang, N. Minimax weight and q-function learning for off-policy evaluation. In International Conference on Machine Learning, pp. 9659 9668. PMLR, 2020. Vincent, J. L., Moreno, R., Takala, J., Willatts, S., De Mendonc a, A., Bruining, H., Reinhart, C. K., Suter, P. M., and Thijs, L. G. The sofa (sepsis-related organ failure assessment) score to describe organ dysfunction/failure. Intensive Care Medicine, 22(7):707 710, 1996. Voloshin, C., Le, H. M., Jiang, N., and Yue, Y. Empirical study of off-policy policy evaluation for reinforcement learning. ar Xiv preprint ar Xiv:1911.06854, 2019. Wang, L., Zhang, W., He, X., and Zha, H. Supervised reinforcement learning with recurrent neural network for dynamic treatment recommendation. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 2447 2456, 2018. Wang, S., Shao, J., and Kim, J. K. An instrumental variable approach for identification and estimation with nonignorable nonresponse. Statistica Sinica, pp. 1097 1116, 2014. Wang, Y., He, H., and Tan, X. Robust reinforcement learning in pomdps with incomplete and noisy observations. ar Xiv preprint ar Xiv:1902.05795, 2019. Wang, Z., Schaul, T., Hessel, M., Hasselt, H., Lanctot, M., and Freitas, N. Dueling network architectures for deep reinforcement learning. In International Conference on Machine Learning, pp. 1995 2003. PMLR, 2016. Xie, T., Ma, Y., and Wang, Y.-X. Towards optimal off-policy evaluation for reinforcement learning with marginalized importance sampling. Advances in Neural Information Processing Systems, 32, 2019. Xu, Y., Zhu, J., Shi, C., Luo, S., and Song, R. An instrumental variable approach to confounded off-policy evaluation. In International Conference on Machine Learning, pp. 38848 38880. PMLR, 2023. Xu, Z., Li, Z., Guan, Q., Zhang, D., Li, Q., Nan, J., Liu, C., Bian, W., and Ye, J. Large-scale order dispatch in ondemand ride-hailing platforms: A learning and planning approach. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 905 913, 2018. Yu, H., Xiang, J., Kong, N., Luo, L., and Yang, C.-C. Partially observable reinforcement learning for blood glucose control under missing data. Available at SSRN 4647947. Zahner, G. E., Pawelkiewicz, W., De Francesco, J. J., and Adnopoz, J. Children s mental health service needs and utilization patterns in an urban community: An epidemiological assessment. Journal of the American Academy of Child & Adolescent Psychiatry, 31(5):951 960, 1992. Zhao, J. and Ma, Y. Optimal pseudolikelihood estimation in the analysis of multivariate missing data with nonignorable nonresponse. Biometrika, 105(2):479 486, 2018. Zhao, J. and Ma, Y. A versatile estimation procedure without estimating the nonignorable missingness mechanism. Journal of the American Statistical Association, 117(540): 1916 1930, 2022. Zhao, J. and Shao, J. Semiparametric pseudo-likelihoods in generalized linear models with nonignorable missing data. Journal of the American Statistical Association, 110 (512):1577 1590, 2015. Zhou, M. and Kim, J. K. An efficient method of estimation for longitudinal surveys with monotone missing data. Biometrika, 99(3):631 648, 2012. Off-Policy Evaluation Under Nonignorable Missing Data A. Assumptions In this section, we provide the assumptions for the theoretical results. The following assumption is introduced by Shi et al. (2021b) to ensure consistency and asymptotic distribution of value estimation when there is no missing data. Assumption A.1. The following conditions hold. (a) The transition kernel p( |s, a) is absolutely continuous with respect to the Lebesgue measure, then there exists some transition density function q such that p (ds |s, a) = q (s |s, a) ds . Let Λ(p, c) denotes the class of p-smooth functions as follows h : sup α 1 p sup s S |Dαh(s)| c, sup α 1= p sup s,s S s =s |Dαh(s) Dαh(s )| s s p p 2 c where Dα denotes the differential operator Dαh(s) = α 1h(s) sα1 1 s αd d , sj denotes the j-th element of s, p denote the largest integer that is smaller than p. Assume there exist some p, c > 0 such that r( , a), q (s | , a) Λ(p, c) for any a A, s S. (b) Let BSpl(L, r) denote a tensor-product B-spline basis of degree r and dimension L on [0, 1]d, and let Wav(L, r) denote a tensor-product Wavelet basis of regularity r and dimension L on [0, 1]d. The sieve ΦL is either BSpl(L, r) or Wav(L, r) with r > max(p, 1). (c) Assume the Markov chain has a unique invariant distribution with some density function µ( ) on S, the probability density function of S0 is denoted as ν0. The density functions µ and ν0 are uniformly bounded away from 0 and on S. (d) Suppose (i) and (ii) hold when T and (iii) holds when T is bounded. (i) λmin h R a A n ξ(s, a)ξ (s, a) γ2uπ(s, a)u π (s, a) o b(a|s)µ(s)ds i c for some constant c > 0, where uπ(s, a) = E {U π (S1) |S0 = s, A0 = a} and λmin(K) denotes the minimum eigenvalue of a matrix K. (ii) The Markov chain {St}t 0 is geometrically ergodic, i.e, there exists some function M( ) on S and some constant ρ < 1 such that R s S M(s)µ(s)ds < + and pt S( | s) µ( ) T V M(s)ρt for any t 0, where T V denotes the total variation norm, pt S(B|s) = P (St B|S0 = s) is the t-step transition kernel. (iii) λmin h PT 1 t=0 E n ξtξ t γ2uπ (St, At) u π (St, At) oi T c for some constant c > 0. (e) The number of basis L satisfies L = o{ n T/ log(n T)}, L2p/d n T{1 + R s ΦL(s)G(ds) 2 2 }. (f) There exists some constant c0 1 such that δπ(s, a) = E a A π(a|S1)Qπ(S1, a) Qπ(S0, A0) )2 S0 = s, A0 = a for any s S, a A, and P (maxt |Rt| c0) = 1. These assumptions together guarantee consistent value estimation under complete data. Assumption A.1(a) basically assumes the smoothness of the reward function r and the transition density function q with respect to the current state s, this allows us to establish the smoothness of the Q-function, which is critical when deriving inference for the value function. Assumption A.1(b) specifies the types of sieve ΦL to approximate the Q-function, which is more of a claim or explanation rather than a strict assumption. Here we consider two types of basis functions, which are standard choices for such problems and serve to simplify the analysis while ensuring general applicability. Assumption A.1(c) is a mild condition on the marginal distribution over states, when ν0 = µ, {St}t 0 is stationary. Assumption A.1(d) is imposed to guarantee the invertiblility of bΣπ. The geometric ergodicity condition in Assumption A.1(d)(ii) ensures that {St}t 0 is exponentially β-mixing (see Theorem 3.7 of Bradley (2005)). We remark that the geometric ergodicity condition is less restrictive than the independence assumption imposed in some existing reinforcement learning literature (e.g., Dai et al. (2020)). For Assumption A.1(e), the constraint on the number of basis functions L Off-Policy Evaluation Under Nonignorable Missing Data controls the smoothness of the basis function, which in turn determines how closely the linear sieve basis function can approximate the true Q function. In the proof of asymptotic normality, we rely on this condition on L to establish that sups S,a A |Q(π; s, a) Φ L(s)β π,a| O(L p/d), which guarantees the consistency and asymptotic behavior of bβ and the value estimates. Assumption A.1(f) is a mild condition on the randomness of observed the reward Rt+1 around r(St, At) and the uniform boundedness of the observed reward. Besides the assumptions on Q-function approximation, we also make the following assumption regarding the dropout propensity. Assumption A.2. We assume the following conditions hold. (a) The dropout propensity model is correctly specified such that λ(St, At, Rt+1, St+1) = λ(St, At, Rt+1, St+1; ψ ) for some ψ . (b) There exist some cλ > 0 such that 1 λ(St, At, Rt+1, St+1) cλ. (c) There exits a shadow variable Zt at each stage, such that Zt ηt+1 | (St, At, Rt+1, St+1), and Zt (Rt+1, St+1) | (ηt+1 = 1, St, At) Assumption A.2(a) ensures that, for either the parametric or semi-parametric model specified in Appendix B, the chosen model class should include the true model of λ( ). This assumption is essential for guaranteeing the statistical properties of the parameter estimate for ψ. Assumption A.2(b) is similar to the positivity assumption described by Rosenbaum & Rubin (1983) in causal inference. It requires that there is always a non-zero probability, cλ, of observing the data, avoiding cases of pure dropout. Assumption A.2(c), as explained in Section 4.3, ensures there is an auxiliary variable (i.e. a shadow variable) that helps identify the model, which is necessary in handling nonignorable missingness (Shao & Wang, 2016; Wang et al., 2014; Zhao & Ma, 2022; Miao et al., 2024). B. Estimation Details of the Dropout Propensity Model To solve for βπ, the first step is to fit the dropout propensity model. In this section, we detail two approaches for estimating the dropout propensity, which can be incorporated into our IPW-based value estimation framework. For ignorable missingness, the dropout propensity function can be simplified to λ(St, At), since ηt+1 is conditionally independent of St+1 and Rt+1. In such cases, the propensity can be modeled with any binary classification method. Unlike ignorable missingness, modeling the nonignorable missingness is much more challenging. The difficulty lies in that if both the dropout propensity λ(St, At, Rt+1, St+1) and the conditional density function f(Rt+1, St+1|St, At) are completely unknown, the joint distribution of (ηt+1, Rt+1, St+1) given (St, At) is non-identifiable (Rotnitzky & Robins, 1997). As discussed in Section 4.3 of the main paper, our goal is to solve the following estimating equation: E[m(Zt, St, At, Rt+1, St+1; ψ)] = 0, where m(Zt, St, At, Rt+1, St+1; ψ) = ηt+1 1 λ(St, At, Rt+1, St+1; ψ) 1 h(St, At, Zt), (5) with h(St, At, Zt) denoting a user-specified vector function of dimension q, same as ψ. For simplicity, we denote m(Zt, St, At, Rt+1, St+1; ψ) as mt(ψ), and h(St, At, Zt) = ht when there is no confusion. That is, E[mt(ψ)] = 0. Specifically, we adapt both parametric and semi-parametric approaches from the single-stage survival analysis literature that address MNAR, which are explained in the following two subsections. B.1. Parametric Estimation for λ(ψ) under MNAR If the parametric form of the dropout model λ( ) is known, the simplest approach is to directly substitute this parametric form into Equation (5) and solve for ψ using the Method of Moments (Mo M) or gradient-based algorithms, such as gradient descent (GD). In our simulation studies, we found that a generalized Mo M (GMM) approach, where the weights are determined by the inverse of the covariance matrix, tends to outperform the classical Mo M, especially when ψ is relatively high-dimensional. Off-Policy Evaluation Under Nonignorable Missing Data B.2. Semi-Parametric Estimation for λ(ψ) under MNAR Inspired by the recent development of the semiparametric framework to model nonignorable missing data (Kim & Yu, 2011; Shao & Wang, 2016), we consider a semiparametric exponential tilting model for the dropout propensity as λ(St, At, Rt+1, St+1; ψ) = {1 + exp[g(St, At) + V t+1ψ]} 1, where ψ Rq is an unknown tilting parameter to learn, Vt+1 Rq are features mapped from (Rt+1, St+1), g( ) is a non-parametric function of observed variables (St, At). For succinctness, we suppress the data arguments in λ(St, At, Rt+1, St+1; ψ) and write it as λt+1(ψ). Based on the definition of the shadow variable (which is the same as instrumental variable originally introduced in Shao & Wang (2016) 3), Zt can be removed from the modeling process of the dropout propensity function. For clarity, we will refer to it as the shadow variable throughout the remainder of this work. Denote the non-shadow part of (St, At) as Ut, the exponential tilting model can be rewritten as λt+1(ψ) = n 1 + exp[g(Ut) + ψ Vt+1] o 1 . (6) According to the definition, a shadow variable Zt is a covariate in (St, At) that is related to the outcome (Rt+1, St+1) but not related to the dropout propensity given other covariates. With the shadow variable, multiple estimating equations can be constructed to estimate the parameters of interest, g and ψ. If the shadow variable Zt is originally discrete with e L levels, the e L estimating equations can be constructed as En T {1(Zt = l)(ηt+1(1 λt+1(ψ)) 1 1)} = 0, l {1, . . . , e L}. Here we use the notation e L to differentiate it from the notation L, which represents the number of basis functions. In the case of Zt being a continuous variable, it can first be discretized into e L bins. In order to estimate ψ, the non-parametric component g is first profiled with a kernel estimator. The remaining e L 1 estimating equations are used to solve for ψ using the Generalized Method of Moments (GMM) (Hansen, 1982). Notice that this semiparametric approach to estimating λ( ) typically involves a specialized form of the estimating equation (5), where each dimension of h(St, At, Zt) is set to 1(Zt = l) for l 1, . . . , e L. Instead of directly specifying a parametric form for λ( ), which could potentially be misspecified, we combine a non-parametric function g( ) with a parametric component to solve for ψ and thus allows for greater flexibility. B.3. How to Find the Shadow Variable Identifying a suitable shadow variable remains a challenging issue in the literature on MNAR, even without accounting for sequential decision-making scenarios like MDPs. As defined in Definition 4.8, a shadow variable is an auxiliary variable that has a causal relationship with the outcome of interest while remaining conditionally independent of the dropout pattern. This assumption is often plausible in various empirical settings. For example, in our real-world application, certain baseline measurements taken before treatment, such as the previous GCS score, can act as shadow variables. This is because measurements like the previous GCS score typically help explain the patient s health status at the current stage, but do not offer additional information about dropout when conditioned on the outcome (the patient s health status). A comparable scenario arises in healthcare, as described in Miao et al. (2024). In a study investigating the mental health of children in Connecticut (Zahner et al., 1992), researchers sought to assess the prevalence of students exhibiting abnormal psychopathological traits based on teacher evaluations, which were prone to missing data. The likelihood of a teacher providing an assessment might depend on their evaluation of the student but is unlikely to influence a separate parent report, provided the teacher s evaluation and all observed covariates are accounted for. Additionally, the parent report is expected to strongly correlate with the teacher s assessment, making the parent s evaluation a valid shadow variable in this context. Other illustrative examples of shadow variables can be found in Wang et al. (2014); Zhao & Shao (2015); Zhao & Ma (2018). While the above examples provide some insight into understanding and identifying shadow variables in real-world applications, the practical task of identifying shadow variables can still be challenging. In cases where definitive domain knowledge is lacking or the conditions for a chosen shadow variable are only partially satisfied, the observed likelihood becomes non-identifiable, and the resulting estimating equation is likely to fail. In such situations, it is advisable to perform a sensitivity analysis (Robins et al., 2000) on the selected shadow variable to evaluate how variations in the choice impact the results. 3To avoid confusion with existing literature on instrumental variables for missing data (Tchetgen Tchetgen & Wirth, 2017; Sun et al., 2018), we will use the term shadow variable exclusively in the following discussion, avoiding the term instrumental variable, as the identification framework presented here differs slightly from that in other studies. Off-Policy Evaluation Under Nonignorable Missing Data B.4. Dropout Estimation Accuracy In this section, we briefly evaluate the accuracy of dropout propensity estimation for both parametric and semi-parametric models. Accurate estimation of dropout propensity is crucial in MNAR settings, as it appears in the denominator of the IPW estimator and directly influences the estimation accuracy of bψ and the value function. Figures 3 is generated based on the simulation setting described in Section 5, which provides an overall assessment of the estimation accuracy of λ( bψ). As shown in the MSE distribution plot, both parametric and semi-parametric models perform adequately well. As expected, the semi-parametric model yields slightly higher MSE than the parametric model due to its greater flexibility and potential for model misspecification. 0.000 0.002 0.004 0.006 0.008 0.010 0.012 MSE MSEs of the Estimated Dropout Function (Parametric and Semi-parametric) Parametric Dropout Function (mean MSE: 0.0006) Semi-parametric Dropout Function (mean MSE: 0.0011) Figure 3. Mean Squared Errors (MSEs) for both parametric and semi-parametric dropout models are close to zero, demonstrating the effectiveness of the dropout propensity estimation. C. A Brief Introduction to B-Spline and Wavelet Basis Functions C.1. B-Spline Basis Function A B-spline basis function Bi,k(x) is a collection of piecewise polynomial functions defined over a partition of the domain (called knots), which are used to approximate smooth functions. For a non-decreasing sequence of knots {ti}n+k i=0 , the B-spline basis of degree k is defined recursively via the Cox de Boor recursion formula. When degree k = 0, ( 1 if ti x < ti+1, 0 otherwise. When degree k > 0, Bi,k(x) = x ti ti+k ti Bi,k 1(x) + ti+k+1 x ti+k+1 ti+1 Bi+1,k 1(x), where terms with zero denominators are treated as zero. By adjusting the number and placement of knots, B-splines can flexibly combine piecewise polynomial segments to create smooth functions that approximate complex functional forms. Off-Policy Evaluation Under Nonignorable Missing Data C.2. Wavelet Basis Function Wavelet basis functions are localized in both time (or space) and frequency, making them suitable for representing functions with local irregularities. A wavelet basis is constructed from a single mother wavelet ψ(x) via dilations and translations: ψj,k(x) = 2j/2ψ(2jx k), where j Z denotes the scale (frequency) and k Z the translation (location). Commonly used wavelets such as Haar and Daubechies form orthonormal bases in L2(R) and are particularly useful in nonparametric regression, signal processing, and capturing complex functions with discontinuities. D. More on Simulations In this section, we present additional simulation results to further support our study. We provide comprehensive results for two dropout models to evaluate the performance of different value estimators. The MNAR dropout model is defined as λ1(St, At, Rt+1, St+1) = {1 + exp(ψ1 + ψ2S(1) t + ψ3Rt+1)} 1, and the MAR model as λ2(Rt, St, At) = {1 + exp(ψ1 + ψ2S(1) t + ψ3Rt)} 1. The true parameter values of ψ = [ψ1, ψ2, ψ3]T are set to [2, 0.08, 0.15]T in Setting 1 and [2, 0.15, 0.3]T in Setting 2 (results for Setting 2 are presented in the main paper). The full set of comparisons, covering four combinations of (n, T) with values (500, 10), (1000, 10), (500, 25), and (1000, 25), are provided in Tables 4-5. Table 4. Results of value estimates and 95% confidence intervals for Setting 1 in the 2D-Linear environment. The average bias, MSE values, and ECP are reported for each estimator (with standard error in parenthesis). T n DROPOUT METHOD BIAS MSE ECP NO DROPOUT CC 0.013 ( 0.602 ) 1.592 0.968 MAR CC 0.009 ( 0.779 ) 2.095 0.964 MAR IPW 0.012 ( 0.781 ) 2.097 0.964 MNAR CC -0.246 ( 0.806 ) 2.205 0.948 MNAR IPW (P) 0.007 ( 0.819 ) 2.158 0.948 MNAR IPW (SP) 0.013 ( 0.837 ) 2.188 0.948 NO DROPOUT CC -0.04 ( 0.443 ) 1.235 0.968 MAR CC -0.05 ( 0.566 ) 1.471 0.952 MAR IPW -0.051 ( 0.566 ) 1.472 0.952 MNAR CC -0.307 ( 0.555 ) 1.588 0.920 MNAR IPW (P) -0.05 ( 0.563 ) 1.487 0.940 MNAR IPW (SP) -0.056 ( 0.558 ) 1.482 0.944 NO DROPOUT CC -0.027 ( 0.413 ) 1.168 0.944 MAR CC -0.022 ( 0.653 ) 1.748 0.964 MAR IPW -0.02 ( 0.651 ) 1.742 0.976 MNAR CC -0.296 ( 0.702 ) 1.923 0.920 MNAR IPW (P) -0.012 ( 0.712 ) 1.837 0.960 MNAR IPW (SP) -0.045 ( 0.718 ) 1.841 0.936 NO DROPOUT CC -0.033 ( 0.245 ) 0.985 0.976 MAR CC -0.052 ( 0.46 ) 1.291 0.960 MAR IPW -0.052 ( 0.46 ) 1.289 0.964 MNAR CC -0.32 ( 0.47 ) 1.439 0.896 MNAR IPW (P) -0.068 ( 0.474 ) 1.322 0.964 MNAR IPW (SP) -0.062 ( 0.475 ) 1.322 0.952 The key takeaways from the results in the tables are as follows: 1. When the missing data is ignorable (i.e., under MAR), both the CC and IPW estimators perform well in terms of estimation accuracy and empirical coverage. 2. In the case of nonignorable missingness (i.e., MNAR), the CC estimator suffers from significant bias, higher MSE, and lower coverage, which negatively impacts the performance of value estimation. In contrast, our proposed IPW-based Off-Policy Evaluation Under Nonignorable Missing Data estimator, whether using a parametric or semi-parametric dropout model, remains stable in terms of both bias and coverage. 3. Across different values of (n, T), we observe that increasing the total number of data points, i.e., n T, leads to slightly more stable estimates, particularly when examining the standard error in the Bias column and the MSE in the MSE column. Table 5. Results of value estimates and 95% confidence intervals for Setting 2 in the 2D-Linear environment. The average bias, MSE values, and ECP are reported for each estimator (with standard error in parenthesis). T n DROPOUT METHOD BIAS MSE ECP NO DROPOUT CC 0.013 ( 0.602 ) 1.592 0.968 MAR CC -0.028 ( 0.807 ) 2.198 0.972 MAR IPW -0.025 ( 0.806 ) 2.207 0.980 MNAR CC -0.598 ( 0.823 ) 2.658 0.904 MNAR IPW (P) -0.016 ( 0.851 ) 2.315 0.976 MNAR IPW (SP) 0.015 ( 0.861 ) 2.340 0.960 NO DROPOUT CC -0.04 ( 0.443 ) 1.235 0.968 MAR CC -0.029 ( 0.58 ) 1.538 0.944 MAR IPW -0.03 ( 0.582 ) 1.545 0.956 MNAR CC -0.614 ( 0.587 ) 2.013 0.820 MNAR IPW (P) -0.023 ( 0.602 ) 1.591 0.940 MNAR IPW (SP) 0.003 ( 0.608 ) 1.596 0.932 NO DROPOUT CC -0.027 ( 0.413 ) 1.168 0.944 MAR CC -0.021 ( 0.723 ) 1.925 0.952 MAR IPW -0.016 ( 0.723 ) 1.927 0.956 MNAR CC -0.655 ( 0.743 ) 2.476 0.864 MNAR IPW (P) -0.072 ( 0.76 ) 2.025 0.944 MNAR IPW (SP) -0.041 ( 0.764 ) 2.031 0.964 NO DROPOUT CC -0.033 ( 0.245 ) 0.985 0.976 MAR CC -0.034 ( 0.485 ) 1.361 0.972 MAR IPW -0.034 ( 0.486 ) 1.361 0.976 MNAR CC -0.648 ( 0.509 ) 1.919 0.780 MNAR IPW (P) -0.057 ( 0.523 ) 1.443 0.972 MNAR IPW (SP) -0.033 ( 0.523 ) 1.439 0.956 E. Additional Experimental Details In this section, we provide more details on the experiments and our implementation. E.1. Algorithm The outline for the proposed estimator is presented in Algorithm 1. In Line 5, when employing a semi-parametric model to estimate the dropout propensity function (see Appendix B.2), an approximation of the estimated standard deviation eσπ,IPW(G), as given in Equation (35), can be used as a substitute for the exact solution bσπ,IPW(G). Accordingly, the confidence interval can be obtained as h b V π IPW(G) zα/2(n T) 1/2eσπ,IPW(G) i . E.2. Simulation Settings E.2.1. DATA GENERATION The initial states are generated from the standard bivariate normal distribution N(02, I2). For t 0, we slightly modify the original dynamics and consider the following transition: S(1) t+1 = (2At 1)S(1) t + ε(1) t , S(2) t+1 = (1 2At)S(2) t + ε(2) t , where ε(1) t and ε(2) t are independent N(0, 0.25) random variables. The immediate reward is designed as Rt+1 = 2S(1) t+1 + S(2) t+1 + 0.5S(2) t 0.25(2At 1) + ε(3) t , where ε(3) t N(0, 10 4). The behavior policy follows a Bernoulli distribution with a mean of 0.5. Throughout the simulation studies, we set the discount factor γ to 0.9. Off-Policy Evaluation Under Nonignorable Missing Data Algorithm 1 Off-Policy Evaluation with Nonignorable Monotone Missingness 1: Input: Observed dataset D = {τi}n i=1, target policy π, discount factor γ, number of basis L 2: Fit dropout propensity model (6) using the semiparametric approach 3: Construct a set of basis ΦL(s) from state variables and estimate bβπ,IPW by (3) 4: Estimate value b V π IPW(s) = U π (s)bβ π π,IPW, b V π IPW(G) = R s S b V π IPW(s)G(ds) 5: Calculate the asymptotic variance bσ2 π,IPW(G) given by Equation (34) 6: Return: the CI of V π(G): h b V π IPW(G) zα/2(n T) 1/2bσπ,IPW(G) i To generate complete data with n trajectories, we first sample n initial states from the reference distribution G, and then generate the action, next state and reward following the generative model described above. This process is repeated until reaching the maximum horizon T. To generate incomplete data, we first calculate the dropout probability λi,t at each step using the dropout model λj( ) defined in Section 5, this corresponds to the probability of subject i dropping out after taking action At. Given the dropout probability, we sample the response indicator ηi,t+1 from a Bernoulli distribution with mean (1 λi,t). To control the overall missing rate, we also set a no-dropout period of two steps, i.e., ηi,0 = ηi,1 = 1. After the second step, the dropout probability is applied and a trajectory will terminate when the response indicator ηi,t turns 0. E.2.2. COMPLETE-CASE (CC) ESTIMATOR The OPE step of Shi et al. (2021b) approximates the Q-function with linear sieves, Qπ(s, a) Φ L(s)βπ,a, where ΦL( ) = ϕL,1( ), , ϕL,L( ) is a vector consisting of L spline bases. In our implementation, we first scale the state variables onto [0, 1] and then construct 6 cubic B-spline bases for each dimension, where the knots are placed at equally-spaced quantiles of the transformed state variables. To avoid extrapolation of the basis function, three repeated knots are placed on the boundary. The tensor product of the basis for each dimension is used to construct the final basis, hence L = 36. The number of basis functions L is allowed to grow with the sample size to reduce the approximation error. For a fair comparison, here we fix L = 36 throughout the experiments despite the sample sizes. The CC estimator of the Q-function parameter β π is given in (4.2). The matrix inversion of bΣπ Rm L m L tends to be unstable when m L is large, so we add a small ridge penalty with weight 10 5 to improve the stability. Given bβπ, the value function can be calculated as b V π(s) = U π (s)bβπ. We approximate the integrated value b V π(G) = R s S b V π(s)G(ds) by sampling 10,000 states from the reference distribution G and take the average of the estimated value for each state. E.2.3. SOLVING THE ESTIMATING EQUATION (4) To calculate the IPW estimator, we need to estimate the dropout probability from the data first. For ignorable missingness (MAR), we fit a logistic regression with the correctly specified model to predict the dropout probability. For nonignorable missingness (MNAR), we adopt both parametric (Miao et al., 2024) and semiparametric methods (Shao & Wang, 2016) for flexible dropout model estimation. In the parametric estimation for the simulation study, we select h(St, At, Zt) = [1, S(1) t , Zt]T for simplicity when learning ψ. Theoretically, it has been established (see Miao et al. (2024)) that any function of (St, At, Zt) can be chosen as the hfunction in the estimating equation while maintaining estimation consistency. Furthermore, Miao et al. (2024) demonstrated, by leveraging doubly robust estimation derived from the efficient influence function of ψ, that there exists a specific function heff(St, At, Zt) capable of minimizing the estimation variance. However, given that the proposed value estimator in this work is based on IPW and to simplify practical implementation, we recommend users choose h to be as simple as possible in real applications. For instance, incorporating an intercept term, using the original form of the non-shadow variables rather than higher-order polynomial transformations, and including the shadow variable in the construction of h to ensure the identifiability of ψ can help reduce estimation variance and potentially improve the overall accuracy of the estimate. In the semi-parametric estimation for the simulation study, the shadow variable S(2) t is discretized into 4 bins based on the quartiles. The nonparametric part is approximated using Gaussian kernel with bandwidth hl = c σln 1/3 l , where σl s and nl s are the estimated standard deviation and the sample size for samples with S2 = l {1, 2, 3, 4}. We pick c = 7.5 in the bandwidth formula based on an inspection of the objective function curve. The parametric part, i.e. ψ, is estimated using Equation (5) by setting h(St, At, Zt) = [1(Zt = 1), . . . , 1(Zt = 3)]T . In the minimization step of GMM, we use the Off-Policy Evaluation Under Nonignorable Missing Data limited-memory BFGS algorithm (Liu & Nocedal, 1989) for both parametric and semi-parametric estimation with several initial values to avoid local minimum. In the semi-parametric estimation procedure, we recommend users construct the h-function using the discretized shadow variable, which has been empirically shown to perform well in our synthetic studies. E.2.4. IPW ESTIMATOR After getting an estimate of the parameter bψn T for the dropout model, we plug in the estimated probability to calculate bβπ,IPW, which is given in (3). To avoid extremely large inverse weight, we bound the missing propensity below at 0.01. After obtaining bβπ,IPW, we calculate the integrated value b V π IPW(G) in a similar way to the CC estimator. Finally, we can construct the confidence interval for the proposed IPW estimator based on the theoretical form of the asymptotic variance, as provided in Equation (34). Note that when using a semi-parametric model for dropout estimation, the asymptotic variance of ψ has a complex form (as discussed in Shao & Wang (2016)), making it difficult to compute due to the non-parametric kernel g(Ut). To simplify the computation, we suggest using an approximation of bσ2 π,IPW(G) given by (35). In all the empirical experiments presented in the main paper and appendices, the confidence intervals based on this approximation are very close to those obtained through bootstrapping and provide stable and satisfactory coverage. Therefore, we use this approximation in our implementation. E.3. Real Data Application Data description The sepsis dataset is extracted from the MIMIC-III v1.4 database (Johnson et al., 2016). We follow the data processing procedure described in Komorowski et al. (2018) and use a pure-python re-implementation available at https://github.com/microsoft/mimic_sepsis. Data is included from the diagnosis of sepsis and until 48 hours following the onset of sepsis to capture treatment management at the early phase. We exclude mortality cases within this time window and only focus on early discharged patients. Model features We consider a 14-dimensional state feature vector to represent important features clinicians would examine when deciding treatment and dosage for patients. The following physiological features are used in our model: Demographics: Age Lab values: Arterial p H, Chloride, Hemoglobin, INR-International Normalized Ratio, PT-Prothrombin Time, Arterial Blood Gas, Ionised Calcium, Calcium, Arterial Lactate Vital signs: Sp O2, Temperature, Heart Rate Other: Sequential Organ Failure Assessment (SOFA) score The features are aggregated over a time resolution of 4 hours, we carry the last value forward if no record is available in the current time window. Target policies In our experiments, we evaluate a fitted behavior policy and optimal policies learned via Deep Q-Network (Mnih et al., 2015), Dueling Double Deep Q-Network (Wang et al., 2016) and a discrete version of Batch-Constrained Deep Q-Learning (BCQ) (Fujimoto et al., 2019). The behavior policy is fitted with a random forest with 250 trees. For the other three types of Q-learning algorithms, we run for 2 105 iterations with minibatch size 256 and learning rate 1 10 3. More details about implementation Similar to the simple environment, we first scale the state variables onto [0, 1] and construct 4 cubic B-spline basis for each dimension. We do not use tensor product here due to high-dimensionality concern, so there are L = 56 bases in total. To fit the dropout model, we use the previous GCS score as a shadow variable, it is discretized into 4 bins based on quantiles. We consider the model in (6) with Ut = {1(SFi O2 t 0.6), 1(60 SHR t 100), 1(10 SRR t 30)}, Zt+1 = 1(SGCS t+1 14), and a bandwidth of hl = 10σln 1/3 l . For large dataset, the kernel estimator used in this semiparametric method can be a bottleneck in computation. To accelerate the algorithm, we apply the downsample technique where we repeatedly sample random subsets from the whole dataset and aggregate the value estimation results by taking average. Off-Policy Evaluation Under Nonignorable Missing Data F. Generalizability of the Proposed Framework In the main paper, we focus on the dropout scenario that occurs after the action is observed but before the reward and the next state are observed. Additionally, we use Least-Squares Temporal Difference Learning (LSTDQ) as the base OPE algorithm to investigate the effect of missing data. In this section, we will expand our scope to more general dropout patterns and other OPE methods. F.1. More General Dropout Patterns The proposed framework is universally applicable to a broader class of dropout patterns. Specifically, the theoretical results for our IPW estimator are valid when dropout occurs after the observed action, regardless of whether the reward is observed or not. This is because the key idea behind the proposed IPW estimator is to assign weights to each transition based on the inverse probability of observing the complete transition quadruple (St, At, Rt+1, St+1) given observed (St, At). On the other hand, when dropout occurs after an observed state but before an observed action, the proposed framework also applies. The distinction lies in that MAR and MNAR are now defined with respect to St instead of (St, At). If the missingness of the current action only depends on the current state and not on the action itself, it is considered ignorable. In such cases, the CC estimator remains valid, and no further adjustment is required. This can be seen from the decomposition E {ηt+1M t (β π)} = E {E (ηt+1 | St, ηt) E (M t (β π) | St)} = 0. Here, E (M t (β π) | St) = 0 follows from the law of total probability together with equation E {Rt+1 + γV π(St+1) Qπ(St, At)|St, At} = 0. In the case of nonignorable missingness where the dropout is dependent on the potential action, the CC estimator can be biased, and the proposed IPW estimator can still be used to mitigate such bias. Moreover, the idea of IPW adjustment can potentially be extended to handle intermittent missingness. The key distinction from monotone missingness lies in estimating the dropout propensity, which should be determined on a case-by-case basis and sometimes requires additional assumptions regarding the missing pattern. We leave this for future investigation. F.2. More General OPE Methods In this section, we discuss how the proposed framework can be extended to Marginalized Importance Sampling-based (MIS) methods. We first introduce some additional notations and review the MIS value estimator. Given the discount factor γ and reference distribution of initial states G, the discounted state-action visitation probability density for policy π is defined as dπ(s, a) = (1 γ)π(a | s) P t=0 γt P (St = s | π). It satisfies the backward Bellman recursion dπ(s , a ) =(1 γ)G(s )π(a |s ) + γ π(a |s ) Z a A dπ(s, a)p(s |s, a)ds . (7) With the notation of dπ(s, a), the policy value can also be expressed as V π G = 1 1 γ E(St,At) dπ,Rt+1 r(St,At) {Rt+1} . In offline settings, the data may be collected from potentially different policies than the target policy π, denote the stateaction visitation probability density under behavior policies as d D. To estimate the value using off-policy data, define the marginalized density ratio under the target policy π as ωπ(s, a) := dπ(s, a) Then the policy value can be equivalently expressed as V π G = 1 1 γ E(St,At) d D,Rt+1 r(St,At) {ωπ(St, At) Rt+1} , which leads to the MIS value estimator (Liu et al., 2018; Xie et al., 2019). Compared with trajectory-based importance sampling methods, such a marginalized density ratio plays a critical role in breaking the curse of horizon. To handle unknown behavior policies, it is preferred to model the density ratio ωπ(s, a) directly, and plug in bωπ(s, a) to obtain the Off-Policy Evaluation Under Nonignorable Missing Data final value estimate as follows, b V π G = 1 1 γ 1 n T t=0 bωπ(Si,t, Ai,t)Ri,t+1. The key for estimating ωπ(s, a) is by noting the following result derived from (7), Ed D ωπ(St, At) f(St, At) γESt+1 p( |St,At),a π( |St+1) [f(St+1, a )] = (1 γ)ES0 G,a π( |S0) [f (S0, a)] . A special case is to replace with f(s, a) with Qπ(s, a), leading to the following equation Ed D ωπ(St, At) Qπ(St, At) γESt+1 p( |St,At),a π( |St+1) [Qπ(St+1, a )] = (1 γ)ES0 G,a π( |S0) [Qπ (S0, a)] . (8) Another way to derive (8) is by noting the equivalence between two expressions of the policy value Ed D[ωπ(St, At) r(St, At)] = (1 γ)ES0 G,a π( |S0) [Qπ (S0, a)] , and then replacing r(St, At) with Qπ(St, At) γESt+1 p( |St,At),a π( |St+1)[Qπ(St+1, a )] using the Bellman equation. Based on this equation, several methods have been developed to estimate the density ratio ωπ(s, a) (Nachum et al., 2019; Uehara et al., 2020). These methods typically learn ωπ(s, a) by minimizing the difference between the two sides of equation (8) within the chosen function classes for Qπ(s, a) and ωπ(s, a). Denote the function class for Qπ(s, a) as Q and the function class for ωπ(s, a) as Ω. To illustrate the estimation process, we use Minimax Weight Learning (MWL) (Uehara et al., 2020) as an example, which estimates ωπ by solving bωπ,n T (s, a) = argmin ωπ Ω sup Qπ Q Ln T (ωπ, Qπ)2, where Ln T (ωπ, Qπ) is defined as follows Ln T (ωπ, Qπ) = 1 t=0 ηi,t+1ωπ(Si,t, Ai,t) a A π(a |Si,t+1)Qπ(Si,t+1, a ) Qπ(Si,t, Ai,t) + (1 γ) ES0 G a A π(a|S0)Qπ (S0, a) The complete-case MIS value estimator can then be obtained by plugging in bωπ,n T (s, a) as follows, b V π CC(G) = 1 1 γ 1 n T t=0 ηi,t+1bωπ,n T (Si,t, Ai,t)Ri,t+1. (10) Next, we present the consistency results under the two missingness mechanisms. Theorem F.1. Assume conditions 3.1-4.4 and A.1(a)(f) hold. Let ωπ(s, a) denote the true density ratio under missing data and bωπ(s, a) denote the estimated density ratio from the observed data. Further assume (a) There exists a constant cω > 0 such that sups,a |ωπ(s, a)| cω and the function class Ωsatisfies ω cω for all ω Ω. (b) Ln T (bωπ, Qπ) = op(1), where Qπ represents the true Q-function. Under ignorable missingness (MAR), the value estimate (10) remains consistent. On the other hand, if the missingness is nonignorable (MNAR), the value estimator (10) can be biased. In Assumptions (a), the boundedness of marginalized state-action density ratio ωπ can be ensured if the enumerator dπ is bounded above and the denominator d D is bounded away from 0. Such an assumption is commonly made in the literature related to importance sampling or inverse weighting. Additionally, the boundedness of function class Ωcan be guaranteed Off-Policy Evaluation Under Nonignorable Missing Data through a truncation argument. Assumption (b) states that bωπ ensures equation (8) approximately holds when substituting f(s, a) with the true Q-function Qπ(s, a). This assumption can be achieved when the function class Q captures the true Q-function, i.e., Qπ Q, and the OPE algorithm minimizes sup Qπ Q Ln T (ωπ, Qπ)2 sufficiently close to 0. The proof for Theorem F.1 can be found in Appendix G.4. It is noteworthy that the statement in Theorem F.1 can also be viewed from a special case of MWL, where ωπ(s, a) and Qπ(s, a) are modeled with the same set of basis functions, i.e., ωπ(s, a) = ΦL(s) απ,a and Qπ(s, a) = ΦL(s) βπ,a. The corresponding value estimator can be shown to be b V π CC(G) = Z s U π(s)G(ds) ( 1 n T t=0 ηi,t+1ξi,t ξi,t γU π,i,t+1 ) 1 1 n T t=0 ηi,t+1ξi,t Ri,t+1 which is identical to the complete-case (CC) estimator discussed in Section 4.2; the derivation is similar to Appendix A.3 of Uehara et al. (2020). Consequently, these two estimators share the same theoretical properties described in Theorem 4.5. In the case of nonignorable missingness, the IPW adjustment discussed in Section 4.2 can be applied to this estimator as well. G. Consistency and Asymptotic Results In this section, we provide the proofs for Theorem 4.5, 4.6 and 4.7. For simplicity, we will omit the subscript π in Σπ, bΣπ,βπ, bβπ, σπ, bσπ. We first introduce the following lemmas from Shi et al. (2021b), the proofs can be found in Section E of their paper. Lemma G.1. Under Assumption A.1(a), there exists some constant c > 0 such that Qπ(s, a) Λ(p, c ) for any policy π and a A. Lemma G.2. Under Assumption A.1(b), there exists some constant c 1 such that (c ) 1 λmin s S ΦL(s)Φ L(s)ds λmax s S ΦL(s)Φ L(s)ds c and sups S ΦL(s) 2 c Lemma G.3. Suppose Assumption A.1 holds. Define Σ = EbΣ, we have Σ 1 2 3 c 1, Σ 2 = O(1), bΣ Σ 2 = Op L1/2(n T) 1/2 log(n T) , bΣ 1 Σ 1 2 = Op L1/2(n T) 1/2 log(n T) and bΣ 1 2 6 c 1 with probability approaching 1, as either n or T . Lemma G.4. Suppose Assumption A.1 holds. As either n or T , we have λmax(T 1 PT 1 t=0 Eξtξ t ) = Op(1), λmax{(n T) 1 Pn i=1 PT 1 t=0 ξi,tξ i,t} = Op(1), λmin(T 1 PT 1 t=0 Eξtξ t ) c/2 and λmin{(n T) 1 Pn i=1 PT 1 t=0 ξi,tξ i,t} c/3 with probability approaching 1. s U(s)G(ds) 2 m 1/2 R s ΦL(s)G(ds) 2, where m is the number of actions in the action space. Lemma G.6. Define Σ = R a A ξ(s, a){ξ(s, a) γu(s, a)} b(a | s)µ(s)ds. Suppose T . Under the given conditions in Lemma G.3, we have Σ Σ 2 T 1/2. Remark G.7. The notation an bn means that there exists some constant C > 0 such that an C bn for any n. The notation an 1 means an = Op(1). Next, we will go through the proof for the consistency and asymptotic result for the proposed IPW estimator. The big idea is similar to the proof of Theorem 1 in Shi et al. (2021b) but with additional components to handle inverse weights and associated uncertainty. G.1. Proof of Theorem 4.5 Theorem Suppose Assumption A.1 holds. b V π CC(G) is a consistent estimator if the missing mechanism is ignorable (MAR). However, if the missing mechanism is nonignorable (MNAR), b V π CC(G) can be biased. Proof. We first provide a sketch of the big idea. Assume the true Q-function is Qπ(s, a) = Φ L(s)β π and the true parameter β satisfies E {M t (β π)} = 0, where M t(βπ) = ξt Rt+1 (ξt γU π,t+1) βπ . Under incomplete data, the equation Off-Policy Evaluation Under Nonignorable Missing Data becomes E {ηt+1M t (βπ)} = 0. Using the condition of MAR (Definition 4.1), we apply the conditional independence between ηt+1 and (St+1, Rt+1) to separate the ηt+1 and M t term as follows E {ηt+1M t (βπ)} = E {E (ηt+1M t (βπ) | St, At, ηt)} = E {E (ηt+1 | St, At, ηt) E (M t (βπ) | St, At)} . It follows from E{Rt+1 + γ P a A Qπ(St+1, a )π(a |St+1) Qπ(St, At)|St, At} = 0 that E{M t(β π)|St, At} = 0. Therefore, E {ηt+1M t (β π)} = 0, then β π is still the solution to E {ηt+1M t (βπ)} = 0. As a result, the corresponding value estimator is still unbiased under some regularity conditions. However, for nonignorable missingness (MNAR), E {ηt+1M t (β π)} = 0 no longer holds because ηt+1 and M t cannot be separated using the conditional independence. Thus the complete-case estimator bβπ,CC will be biased from β π unless the probability P (ηt+1 = 1 | St, At, Rt+1, St+1, ηt) is a constant. Next, we provide a more rigorous proof that takes into account the approximation error. By Condition A.1(a)(b)(e), the number of basis L for the Q-function satisfies L2p/d n T 1 + R s ΦL(s)G(ds) 2 2 , it follows from Lemma G.5 that L2p/d n T 1 + R s U(s)G(ds) 2 2 . By Lemma G.1 and Condition A.1(b), there exist a set of vectors {β a} that satisfy sup s S,a A Qπ(s, a) Φ L(s)β a CL p/d, (11) for some constant C > 0 (Huang et al., 1998). Let β = (β 1 , . . . , β m ) , define Φ L (Si,t+1) β a Qπ (Si,t+1, a) π (a|Si,t+1) n Φ L (Si,t) β Ai,t Qπ (Si,t, Ai,t) o , εi,t = Ri,t+1 + γ X a A Qπ(Si,t+1, a)π (a|Si,t+1) Qπ(Si,t, Ai,t). (12) The condition P (maxt |Rt| c0) = 1 in Assumption A.1(f) implies that Ri,t c0, i, t, almost surely. By Lemma G.1, we have |Qπ(s, a)| c for any π, s, a. Therefore, the error term εi,t can be bounded as follows max 0 t 0, hence σ2 π,IPW(G) CΩ,1 C s U(s)G(ds) Off-Policy Evaluation Under Nonignorable Missing Data Combining Equation (39) together with Equation (37) yields that 1 σπ,IPW(G) b V π IPW(G) V π(G) Z s U(s)G(ds) ζ1 bβIPW β ζ1 2 + CL p/d p s U(s)G(ds) 2 . According to the previous proof, we have bβIPW β = ζ1 + Op L(n T) 1 log(n T) + Op L p/d Together with the condition that L n T/ log(n T) and L2p/d n T n 1 + R s U(s)G(ds) 2 2 o , we obtain n T{b V π IPW(G) V π(G)} σπ,IPW(G) = s U(s)G(ds) ζ1 σπ,IPW(G) + op(1). (40) This completes the second step of the proof. Step 3. Show s U(s)G(ds) ζ1 σπ,IPW(G) In this step, we first construct a martingale and then apply the martingale central limit theorem. For any integer 1 g n T, let i(g) and t(g) be the quotient and the remainder of g + T 1 divided by T, that is, g = {i(g) 1} T + t(g) + 1, 1 i(g) n, 0 t(g) < T. Let F(0) = {S1,0, A1,0}, then iteratively define {F(g)}1 g n T as follows: F(g) = F(g 1) Ri(g),t(g)+1, ηi(g),t(g)+1, Si(g),t(g)+1, Ai(g),t(g)+1 , if t(g) < T 1 F(g) = F(g 1) Ri(g),T , ηi(g),T , Si(g),T , Si(g)+1,0, Ai(g)+1,0 , otherwise. Use ξ(g), m(g), ε(g), ω(g) ψ to represent ξi(g),t(g), mi(g),t(g), εi(g),t(g), and ωi(g),t(g)+1,ψ, respectively. It follows from (31) that s U(s)G(ds) ζ1 σπ,IPW(G) = s U(s)G(ds) Σ 1ζ(g) n Tσπ,IPW(G) + op(1), (41) where ζ(g) = ω (g)ξ(g)ε(g) H2m (g). Using similar arguments as (22), we can show E{ω (g)ξ(g)ε(g) | F(g 1)} = 0. Meanwhile, E{m (g) | F(g 1)} = 0 holds as a result of E{ω (g) | F(g 1)} = 1. Therefore, E{ζ(g) | F(g 1)} = 0, the first term of the RHS of (41) forms a martingale with respect to the filtration {σ(F(g))}g 0, where σ(F(g)) stands for the σ-algebra generated by F(g). We can then use a martingale central limit theorem for triangular arrays (Corollary 2.8 of Mc Leish (1974)) to show the asymptotic normality. This requires to verify the following two conditions: (a) max1 g n T R s U(s)G(ds) Σ 1ζ(g) /{ n Tσπ,IPW(s)} p 0. (b) (n T) 1 Pn T g=1 R s U(s)G(ds) Σ 1ζ(g) 2 / σ2 π,IPW(s) p 1. First, we verify condition (a). It follows from Cauchy-Schwarz inequality that s U(s)G(ds) Σ 1ζ(g) n Tσπ,IPW(s) s U(s)G(ds) Σ 1 2 n Tσπ,IPW(s) . Off-Policy Evaluation Under Nonignorable Missing Data Notice that ζ(g) 2 = ω (g)ξ(g)ε(g) H2m (g) 2 ω (g)ξ(g)ε(g) 2 + H2m (g) 2 |ω (g)| ξ(g) 2 |ε(g)| + H2 2 m (g) 2 cλ sup s ΦL(s) 2 + H2 2 m (g) 2 (c0 + 2c ) c cλ 1 H2 2 Cζ L, for some constant Cζ. Together with Equation (39), we have s U(s)G(ds) Σ 1ζ(g) n Tσπ,IPW(s) n T/ log(n T), condition (a) is proven. Next, we verify condition (b). Notice that s U(s)G(ds) Σ 1ζ(g) 2 σ2 π,IPW(s) 1 = 1 σ2 π,IPW(s) s U(s)G(ds) Σ 1 1 n T g=1 ζ(g)ζ(g) ΩIPW s U(s)G(ds) , t=0 E n ζ(g)ζ(g) o In view of Equation (38), it suffices to show 1 n T g=1 ζ(g)ζ(g) ΩIPW 2 = op(1). (42) This can be proven using similar arguments in bounding bΣ Σ 2 in the proof of Lemma G.3. Therefore, s U(s)G(ds) ζ1 σπ,IPW(G) = s U(s)G(ds) Σ 1ζ(g) n Tσπ,IPW(G) It follows from (40) and Slutsky s theorem that, n T{b V π(G) V π(G)} Step 4. Show bσπ(G)/σπ,IPW(G) p 1. Using similar arguments in verifying condition (b), it suffices to show bΣ 1 IPW bΩIPW(bΣ IPW) 1 Σ 1ΩIPW(Σ ) 1 2 = op(1). Lemma G.10 indicates that ΩIPW 2 = Op(1). This together with Lemma G.8 and the condition L n T/ log(n T) yields that bΣ 1 IPWΩIPW(bΣ IPW) 1 Σ 1ΩIPW Σ 1 2 bΣ 1 IPW Σ 1 2 ΩIPW 2 bΣ 1 IPW 2 + Σ 1 IPW 2 ΩIPW 2 bΣ 1 IPW Σ 1 2 = Op n L1/2(n T) 1/2 log(n T) o = op(1). Off-Policy Evaluation Under Nonignorable Missing Data Thus, it remains to show bΣ 1 IPW bΩIPW(bΣ IPW) 1 bΣ 1 IPWΩIPW(bΣ IPW) 1 2 = op(1), or bΩIPW ΩIPW 2 = op(1). (43) In view of (42), we only need to show bΩIPW 1 n T Pn T g=1 ζ(g)ζ(g) 2 = op(1). Notice that g=1 ζ(g)ζ(g) = 1 n T n bζ(g)bζ(g) ζ(g)ζ(g) o bζ(g) ζ(g) bζ(g) + ζ(g) bζ(g) ζ(g) . By the triangle inequality, we have bΩIPW 1 g=1 ζ(g)ζ(g) 2 bζ(g) ζ(g) bζ(g) 2 + g=1 ζ(g) bζ(g) ζ(g) 2 . (44) It suffices to show 1 n T g=1 (bζ(g) ζ(g))bζ(g) 2 = op(1) and g=1 ζ(g)(bζ(g) ζ(g)) 2 = op(1). Recall that bζ(g) = bω(g)bε(g)ξ(g) c H2c m(g), where bε(g) = Ri(g),t(g)+1 + γ X a A π a | Si(g),t(g)+1 Φ L Si(g),t(g)+1 bβa Φ L Si(g),t(g) bβAi(g),t(g), and bω(g), c H2, c m(g) are obtained by plugging in bψ. We first show 1 n T Pn T g=1 ζ(g)(bζ(g) ζ(g)) 2 = 1 n T Pn T g=1(bζ(g) ζ(g))ζ(g) 2 = op(1), the other statement can be shown using similar arguments. By definition, bζ(g) ζ(g) can be expressed as bζ(g) ζ(g) = bω(g)bε(g)ξ(g) c H2c m(g) ω (g)ε(g)ξ(g) + H2m (g) = bω(g)bε(g) ω (g)ε(g) ξ(g) (c H2c m(g) H2m (g)) | {z } and ζ(g) can be expressed as ζ(g) = ω (g)ε(g)ξ(g) | {z } H2m (g) | {z } By the triangle inequality, 1 n T g=1 (bζ(g) ζ(g))ζ(g) 2 g=1 E(g) 1 E(g) 3 g=1 E(g) 1 E(g) 4 g=1 E(g) 2 E(g) 3 g=1 E(g) 2 E(g) 4 Thus, it suffices to show 1 n T Pn T g=1 E(g) i E(g) j 2 = op(1) for all i {1, 2}, j {3, 4}. We first show 1 n T Pn T g=1 E(g) 1 E(g) 3 2 = op(1). It is equivalent to showing sup a Sm L 1 g=1 a ξ(g)ξ(g) a bω(g)bε(g) ω (g)ε(g) ω (g)ε(g) = op(1), Off-Policy Evaluation Under Nonignorable Missing Data where Sm L 1 denotes the unit sphere {a Rm L : a 2 = 1}. According to Lemma G.4, supa Sm L 1 1 n T Pn T g=1 a ξ(g)ξ(g) a = Op(1), hence we only need to show max1 g n T |(bω(g)bε(g) ω (g)ε(g))bω(g)bε(g)| = op(1). The bound in (13) indicates that ε(g) s are uniformly bounded, together with the bound for ω (g) given in condition A.2(b), we obtain max1 g n T |ω (g)ε(g)| = Op(1). Therefore, it remains to show max 1 g n T bω(g)bε(g) ω (g)ε(g) = op(1). Note that the term can be decomposed as bω(g)bε(g) ω (g)ε(g) = (bω(g) ω (g))bε(g) + ω (g)(bε(g) ε(g)). Using similar arguments in showing (E.50) in Shi et al. (2021b), we have max1 g n T |ε(g) bε(g)| = op(1). On the other hand, the consistency of dropout propensity model, bψ p ψ , indicates that bω(g) p ω (g) for any g, thus max1 g n T |ω(g) bω(g)| = op(1). Combine them together yields max 1 g n T |bω(g)bε(g) ω (g)ε(g)| = max 1 g n T |(bω(g) ω (g))bε(g) + ω (g)(bε(g) ε(g))| max 1 g n T |bω(g) ω (g)||bε(g)| + max 1 g n T |ω (g)||bε(g) ε(g)| = op(1). (45) This completes the proof for 1 n T Pn T g=1 E(g) 1 E(g) 3 2 = op(1). Next, we show 1 n T Pn T g=1 E(g) 2 E(g) 4 2 = op(1). Using similar arguments in bounding bΣ Σ 2 in the proof of Lemma G.3, we obtain c H2 H2 2 = op(1) as well as 1 n T Pn T g=1 m (g)(m (g)) E m (m ) 2 = op(1). It follows from bψ p ψ that 1 n T Pn T g=1 c m(g) m (g) 2 p 0. Therefore, g=1 c m(g)(m (g)) 1 g=1 (m (g))(m (g)) 2 c m(g)(m (g)) m (g)(m (g)) 2 1 n T c m(g) m (g) 2 c m(g) m (g) 2 where the last step holds since the convergence of the diagonal elements of 1 n T Pn T g=1 m (g)(m (g)) implies that 1 n T Pn T g=1 m (g) 2 2 is bounded in probability. Based on the aforementioned results, we can show g=1 E(g) 2 E(g) 4 n c H2c m(g)(m (g)) H 2 H2m (g)(m (g)) H 2 o 2 g=1 c m(g)(m (g)) ) g=1 c m(g)(m (g)) m (g)(m (g)) ) It remains to show 1 n T Pn T g=1 E(g) 1 E(g) 4 2 = op(1) and 1 n T Pn T g=1 E(g) 2 E(g) 3 2 = op(1). They can be shown in similar Off-Policy Evaluation Under Nonignorable Missing Data ways, here we only prove 1 n T Pn T g=1 E(g) 1 E(g) 4 2 = op(1) for brevity. Notice that g=1 E(g) 1 E(g) 4 bω(g)bε(g) ω (g)ε(g) ξ(g)(m (g)) ) bω(g)bε(g) ω (g)ε(g) ξ(g)(m (g)) 2 H2 2 bω(g)bε(g) ω (g)ε(g) ξ(g) 2 m (g) 2 H2 2 bω(g)bε(g) ω (g)ε(g) ξ(g) 2 Since 1 n T Pn T g=1 m (g) 2 2 and H2 2 are bounded in probability, it suffices to show 1 n T Pn T g=1 bω(g)bε(g) ω (g)ε(g) ξ(g) 2 2 = op(1). We have already prove max1 g n T |bω(g)bε(g) ω (g)ε(g)| = op(1) in (45), hence max1 g n T (bω(g)bε(g) ω (g)ε(g))2 = op(1). Meanwhile, by Lemma G.4, we have supa Sm L 1 1 n T Pn T g=1 a ξ(g)ξ(g) a = Op(1). Thus, sup a Sm L 1 g=1 a ξ(g)ξ(g) a bω(g)bε(g) ω (g)ε(g) 2 = op(1), equivalently, 1 n T Pn T g=1 bω(g)bε(g) ω (g)ε(g) ξ(g) 2 2 = op(1). This completes the proof for 1 n T Pn T g=1 E(g) 1 E(g) 4 2 = op(1). We can also show 1 n T Pn T g=1 E(g) 2 E(g) 3 2 = op(1) using similar steps. Combine these results together yields 1 n T Pn T g=1 ζ(g)(bζ(g) ζ(g)) 2 = op(1) . Similarly we can show 1 n T Pn T g=1(bζ(g) ζ(g))bζ(g) 2 = op(1) . Therefore, bΩIPW 1 n T Pn T g=1 ζ(g)ζ(g) 2 = op(1), and hence (43) is proven. G.4. Proof of Theorem F.1 Proof. Similar to the proof of Theorem 4.5, we define εi,t as follows: εi,t = Ri,t+1 + γ X a A Qπ(Si,t+1, a)π (a|Si,t+1) Qπ(Si,t, Ai,t), and use Ft = {(Sj, Aj, Rj+1)}0 j