# deeplydebiased_offpolicy_interval_estimation__9044081d.pdf Deeply-Debiased Off-Policy Interval Estimation Chengchun Shi * 1 Runzhe Wan * 2 Victor Chernozhukov 3 Rui Song 2 Off-policy evaluation learns a target policy s value with a historical dataset generated by a different behavior policy. In addition to a point estimate, many applications would benefit significantly from having a confidence interval (CI) that quantifies the uncertainty of the point estimate. In this paper, we propose a novel deeply-debiasing procedure to construct an efficient, robust, and flexible CI on a target policy s value. Our method is justified by theoretical results and numerical experiments. A Python implementation of the proposed procedure is available at https: //github.com/Runzhe Stat/D2OPE. 1. Introduction Reinforcement learning (RL, Sutton & Barto, 2018) is a general technique in sequential decision making that learns an optimal policy to maximize the average cumulative reward. Prior to adopting any policy in practice, it is crucial to know the impact of implementing such a policy. In many real domains such as healthcare (Murphy et al., 2001; Luedtke & van der Laan, 2017; Shi et al., 2020a), robotics (Andrychowicz et al., 2020) and autonomous driving (Sallab et al., 2017), it is costly, risky, unethical, or even infeasible to evaluate a policy s impact by directly running this policy. This motivates us to study the off-policy evaluation (OPE) problem that learns a target policy s value with pre-collected data generated by a different behavior policy. In many applications (e.g., mobile health studies), the number of observations is limited. Take the Ohio T1DM dataset (Marling & Bunescu, 2018) as an example, only a few thousands observations are available (Shi et al., 2020b). In these cases, in addition to a point estimate on a target policy s *Equal contribution 1Department of Statistics, London School of Economics and Political Science, London, United Kingdom 2Department of Statistics, North Carolina State University, Raleigh, USA 3Department of Economics, Massachusetts Institute of Technology, Cambridge, USA. Correspondence to: Rui Song . Proceedings of the 38 th International Conference on Machine Learning, PMLR 139, 2021. Copyright 2021 by the author(s). value, it is crucial to construct a confidence interval (CI) that quantifies the uncertainty of the value estimates. This paper is concerned with the following question: is it possible to develop a robust and efficient off-policy value estimator, and provide rigorous uncertainty quantification under practically feasible conditions? We will give an affirmative answer to this question. Overview of the OPE Literature. There is a growing literature for OPE. Existing works can be casted into as direct method (see e.g., Le et al., 2019; Shi et al., 2020c; Feng et al., 2020), importance sampling-based method (IS, Precup, 2000; Thomas et al., 2015b; Hanna et al., 2016; Liu et al., 2018; Nachum et al., 2019; Dai et al., 2020) and doubly robust method (Jiang & Li, 2016; Thomas & Brunskill, 2016; Farajtabar et al., 2018; Tang et al., 2019; Uehara et al., 2019; Kallus & Uehara, 2020; Jiang & Huang, 2020). Direct method derives the value estimates by learning the system transition matrix or the Q-function under the target policy. IS estimates the value by re-weighting the observed rewards with the density ratio of the target and behavior policies. Both direct method and IS have their own merits. In general, IS-type estimators might suffer from a large variance due to the use of the density ratio, whereas direct method might suffer from a large bias due to the potential misspecification of the model. Doubly robust methods combine both for more robust and efficient value evaluation. Despite the popularity of developing a point estimate of a target policy s value, less attention has been paid to constructing its CI, which is the focus of this paper. Among those available, Thomas et al. (2015b) and Hanna et al. (2016) derived the CI by using bootstrap or concentration inequality applied to the stepwise IS estimator. These methods suffer from the curse of horizon (Liu et al., 2018), leading to very large CIs. Feng et al. (2020) applied the Hoeffding s inequality to derive the CI based on a kernel-based Q-function estimator. Similar to the direct method, their estimator might suffer from a large bias. Dai et al. (2020) reformulated the OPE problem using the generalized estimating equation approach and applied the empirical likelihood approach (see e.g., Owen, 2001) to CI estimation. They derived the CI by assuming the data transactions are i.i.d. However, observations in reinforcement learning are typically time-dependent. Directly applying the empirical likelihood method to weakly Deeply-Debiased Off-Policy Interval Estimation dependent data would fail without further adjustment (Kitamura et al., 1997; Duchi et al., 2016). The resulting CI might not be valid. We discuss this in detail in Appendix D. Recently, Kallus & Uehara (2019) made an important step forward for OPE, by developing a double reinforcement learning (DRL) estimator that achieves the semiparametric efficiency bound (see e.g., Tsiatis, 2007). Their method learns a Q-function and a marginalized density ratio and requires either one of the two estimators to be consistent. When both estimators converge at certain rates, DRL is asymptotically normal, based on which a Wald-type CI can be derived. However, these convergence rates might not be achievable in complicated RL tasks with high-dimensional state variables, resulting in an asymptotically biased value estimator and an invalid CI. See Section 2.2 for details. Finally, we remark that our work is also related to a line of research on statistical inference in bandits (Van Der Laan & Lendle, 2014; Deshpande et al., 2018; Zhang et al., 2020; Hadad et al., 2021). However, these methods are not applicable to our setting. Advances of the Proposed Method. Our proposal is built upon the DRL estimator to achieve sample efficiency. To derive a valid CI under weaker and practically more feasible conditions than DRL, we propose to learn a conditional density ratio estimator and develop a deeply-debiasing process that iteratively reduces the biases of the Q-function and value estimator. Debiasing brings additional robustness and flexibility. In a contextual bandit setting, our proposal shares similar spirits to the minimax optimal estimating procedure that uses higher order influence functions for learning the average treatment effects (see e.g., Robins et al., 2008; 2017; Mukherjee et al., 2017; Mackey et al., 2018). As such, the proposed method is: robust as the proposed value estimator is more robust than DRL and can converge to the true value in cases where neither the Q-function nor the marginalized density ratio estimator is consistent. More specifically, it is triply-robust" and requires the Q-function, marginalized density ratio, or conditional density ratio estimator to be consistent. See Theorem 1 for a formal statement. efficient as we can show it achieves the semiparametric efficiency bound as DRL. This in turn implies that the proposed CI is tight. See Theorem 2 for details. flexible as it requires much weaker and practically more feasible conditions to achieve nominal coverage. Specifically, our procedure allows the Q-estimator and marginalized density ratio to converge at an arbitrary rate. See Theorem 3 for details. 2. Preliminaries We first formulate the OPE problem. We next review the DRL method, as it is closely related to our proposal. 2.1. Off-Policy Evaluation We assume the data in OPE follows a Markov Decision Process (MDP, Puterman, 2014) model defined by a tuple (S, A, p, r, γ), where S is the state space, A is the action space, p : S2 A [0, 1] is the Markov transition matrix that characterizes the system transitions, r : S A R is the reward function, and γ (0, 1) is a discounted factor that balances the immediate and future rewards. To simplify the presentation, we assume the state space is discrete. Meanwhile, the proposed method is equally applicable to continuous state space as well. Let {(St, At, Rt)}t 0 denote a trajectory generated from the MDP model where (St, At, Rt) denotes the state-actionreward triplet at time t. Throughout this paper, we assume the following Markov assumption (MA) and the conditional mean independence assumption (CMIA) hold: P(St+1 = s|{Sj, Aj, Rj}0 j t) = p(s; At, St), (MA), E(Rt|St, At, {Sj, Aj, Rj}0 j 1/4 whereas our second-order triply robust estimator relaxes this condition by requiring min(α1, α2, α3) > 1/6, as shown in Figure 1. 3.2.3. THE m-STEP DEBIAS ITERATION To further relax the convergence rate requirement, we can iteratively debias the Q-estimator to construct higher-order value estimates. Specifically, for any order m 2, we iteratively apply the debiasing operator to the initial Q-estimator m 1 times and average over all individual tuples, leading to the following estimator, b Q(m) k = |Ik|T (m 1) 1 X D(i1,t1) k D(im 1,tm 1) k b Qk. Here, the sum is taken over all possible combinations of disjoint tuples (i1, t1), (i2, t2), , (im 1, tm 1) in the set {(i, t) : i Ik, 0 t < T}. Note that the definition involves repeated compositions of debiasing operator. For m = 3, we present the detailed form in the appendix. In general, b Q(m) k (a, s) corresponds to an order (m 1) Ustatistic (see e.g., Lee, 2019) for any (a, s). The resulting value estimator bη(m) TR is given by (n T) 1 P i,t ψ(m) i,t where for any (i, t) Ik, the estimating function ψ(m) i,t is obtained by replacing b Q in (2) with b Q(m) k . We make a few remarks. First, when m = 1, b Q(m) k corresponds to the initial Q-estimator. As such, the proposed estimator reduces to the DRL estimator. When m = 2, the definition here is consistent to the second-order triply-robust estimator. Second, for large m, calculating b Q(m) k is computationally intensive. In practice, we may approximate it using the incomplete U-statistics (Lee, 2019; Chen et al., 2019) to facilitate the computation. For instance, to calculate b Q(3) k(a, s), we could approximate it by averaging b D(i1,t1) k b D(i2,t2) k b Qk(a, s) over M pairs sampled from the set {(i1, t1, i2, t2) : i1, i2 Ik, (i1, t1) = (i2, t2)}. We require M to diverge with n T such that the approximation error is asymptotically negligible. The computational complexity of our whole algorithm is analyzed in Appendix C.4 in the supplement. Third, the bias of the Q-estimator and that of the resulting value decrease as the order m increases. Specifically, we have the following results. Deeply-Debiased Off-Policy Interval Estimation Lemma 3 Suppose the conditions in Lemma 2 hold. Let α = α1+(m 1)α2 and α = min(1, α1+(m 1)α2+α3). Then E(a,s) p |E b Q(m) k (a, s) Q(a, s)| = O{(n T) (α )} and |Ebη(m) TR ηπ| = O{(NT) α}. To ensure α < 1/2, it suffices to require α1 + (m 1)α2 + α3 > 1/2. As long as α1, α2, α3 > 0, there exists some m that satisfies this condition. As such, the resulting bias decays faster than (n T) 1/2. This yields the flexibility of our estimator as it allows the nuisance function estimator to converge at an arbitrary rate. When m = 2, Lemmas 1 and 2 are directly implied by Lemma 3. 3.3. Learning the Nuisance Functions This step is to estimate the nuisance functions used in our algorithm, including Qπ, ωπ, and τ π, based on each data subset Ic k, for k = 1, , K. The Q-function. There are multiple learning methods available to produce an initial estimator for Qπ. We employ the fitted Q-evaluation method (Le et al., 2019) in our implementation. Based on the Bellman equation for Qπ (see Equation (4.6), Sutton & Barto, 2018), it iteratively solves the following optimization problem, b Qℓ= arg min Q t 0, respectively. (A3) τ π and ωπ are uniformly bounded away from infinity. Condition (A1) allows the data observations to be weakly dependent. When the behavior policy is not history-dependent, the process {(St, At, Rt)}t 0 forms a Markov chain. The exponential β-mixing condition is automatically satisfied when the Markov chain is geometrically ergodic (see Theorem 3.7 of Bradley, 2005). Geometric ergodicity is less restrictive than those imposed in the existing reinforcement learning literature that requires observations to be independent (see e.g., Dai et al., 2020) or to follow a uniformergodic Markov chain (see e.g., Bhandari et al., 2018; Zou et al., 2019). We also remark that the stationarity assumption in (A1) is assumed for convenience, since the Markov chain will eventually reach stationarity. Condition (A2) characterizes the theoretical requirements on the nuisance function estimators. This assumption is mild as we require these estimators to converge at any rate. When using kernels or neural networks for function approximation, the corresponding convergence rates of b Qk and bωk are provided in Fan et al. (2020); Liao et al. (2020). The convergence rate for bτk can be similarly derived as bωk. Condition (A3) essentially requires that any state-action pair supported by the density function (1 γ) P t 0 γtpπ t is supported by the stationary behavior density function as well. This assumption is similar to the sequential overlap condition imposed by Kallus & Uehara (2020). Theorem 1 (Robustness) Suppose (A1) and (A3) hold, and b Qk, bτk, bωk are uniformly bounded away from infinity almost surely. Then for any m, as either n or T diverges to infinity, our value estimator bη(m) TR is consistent when b Qk, bτk or bωk converges in L2-norm to Qπ, τ π or ωπ for any k. Theorem 1 does not rely on Condition (A2). It only requires one of the three nuisance estimators to converge. As such, it is more robust than existing doubly-robust estimators. Theorem 2 (Efficiency) Suppose (A1) and (A2) hold, and b Qk, bτk, bωk, τ π, ωπ are uniformly bounded away from infinity almost surely. Then for any m, as either n or T approaches infinity, n T(bη(m) TR Ebη(m) TR ) d N(0, σ2) where σ2 corresponds to the efficiency bound in (3). We make some remarks. In the proof of Theorem 2, we show that bη(m) TR is asymptotically equivalent to an mth order U-statistic. According to the Hoeffding decomposition (Hoeffding, 1948), we can decompose the U-statistic into the sum ηπ + Pm j=1 bηj, where ηπ is the main effect term that corresponds to the asymptotic mean of the value estimator, bη1 is the first-order term t=0 ωπ(Ai,t, Si,t){Ri,t + γEa π( |Si,t+1)Qπ(a, Si,t+1) Qπ(Ai,t, Si,t)}, and bηj corresponds to a jth order degenerate U-statistic for any j 2. See Part 3 of the proof of Theorem 2 for details. Note that the DRL estimator is asymptotically equivalent to ηπ + bη1. Under (A1), these bηjs are asymptotically uncorrelated. As such, the variance of our estimator is asymptotically equivalent to j=1 Var(bηj) = where σ2 j s are bounded. When j = 1, we have σ2 j = σ2. For j 2, Var(bηj) decays at a faster rate than Var(bη1) = σ2(n T) 1. As such, the variance of our estimator is asymptotically equivalent to that of DRL. However, in finite sample, the variance of the proposed estimator is strictly larger than DRL, due to the presence of high-order variance terms. This is consistent with our experiment results (see Section 5) where we find the proposed CI is usually slightly wider than that based on DRL. This reflects a bias-variance trade-off. Specifically, our procedure alleviates the bias of the DRL estimator to obtain valid Deeply-Debiased Off-Policy Interval Estimation uncertainty quantification. The resulting estimator would have a strictly larger variance than DRL in finite samples, although the difference is asymptotically negligible. We also remark that in interval estimation, the first priority is to ensure the CI has nominal coverage. This requires an estimator s bias to decay faster than its variance. The second priority is to shorten the length of CI (the variance of the estimator) if possible. In that sense, variance is less significant than bias. Theorem 3 (Flexibility) Suppose the conditions in Theorem 2 hold. Then as long as m satisfies α1 + (m 1)α2 + α3 > 1/2, the proposed CI achieves nominal coverage. Theorem 3 implies that our CI allows the nuisance functions to diverge at an arbitrary rate for sufficiently large m. 5. Experiments In this section, we evaluate the empirical performance of our method using two synthetic datasets: Cart Pole from the Open AI Gym environment (Brockman et al., 2016) and a simulation environment (referred to as Diabetes) to simulate the Ohio T1DM data (Shi et al., 2020b). In the second environment, the goal is to learn an optimal policy as a function of patients time-varying covariates to improve their health status. In both settings, following Uehara et al. (2019), we first learn a near-optimal policy as the target policy, and then apply softmax on its Q-function divided by a temperature parameter τ to set the action probabilities to define a behaviour policy. A larger τ implies a larger difference between the behaviour policy and the target policy. We denote the proposed method as TR and present results with m = 2 and 3. The choice of m represents a trade-off. In theory, m shall be as large as possible to guarantee the validity of our CI. Yet, the computation complexity increases exponentially in m. In our experiments, we find that setting m = 3 yields satisfactory performance in general. For point estimation, we compare the bias and RMSE of our method with DRL and the estimator computed via fitted-Q evaluation (FQE). For interval estimation, we compare the proposed CI with several competing baselines, including Coin DICE (Dai et al., 2020), stepwise IS-based estimator with bootstrapping (Thomas et al., 2015a), stepwise ISbased estimator with Bernstein inequality (Thomas et al., 2015b), and the CI based on DRL. For each method, we report the empirical coverage probability and the average length of the constructed CI. We set T = 300 and γ = 0.98 for Cart Pole, and T = 200 and γ = 0.95 for Diabetes. For both environments, we vary the number of trajectories n and the temperature τ to design different settings. Results are aggregated over 200 replications. Note that FQE and DR share the same subroutines with TR, and hence the same hyper-parameters are used. More details about the environments and the implementations can be found in Section C of the supplement. The results for Cart Pole and Diabetes are depicted in Figures 3 and 4, respectively. We summarize our findings as follows. In terms of interval estimation, first, the proposed CI achieves nominal coverage in all cases, whereas the CI based on DRL fails to cover the true value. This demonstrates that the proposed method is more robust than DRL. In addition, the average length of our CI is slightly larger than that of DRL in all cases. This reflects the bias-variance tradeoff we detailed in Section 4. Second, Coin Dice yields the narrowest CI. However, its empirical coverage probability is well below the nominal level in all cases. As we have commented in the introduction, this is due to that their method requires i.i.d. observations and would fail with weakly dependent data. Please refer to Appendix D for details. Third, the stepwise IS-based estimators suffer from the curse of horizon. The lengths of the resulting CIs are much larger than ours. Moreover, the CI based on bootstrapping the stepwise IS-estimator fails to achieve nominal coverage. This is because the standard bootstrap method is not valid with weakly dependent data. In terms of point estimation, TR yields smaller bias than DRL in all cases. FQE suffers from the largest bias among the three methods. The RMSEs of DRL and TR are comparable and generally smaller than that of FQE. This demonstrates the efficiency of the proposed estimator. 6. Discussion 6.1. Order Selection In this paper, we develop a deeply-debiased procedure for off-policy interval estimation. Our proposal relies on the specification of m, the number of the debias iteration. The choice of m represents a trade-off. In theory, m shall be as large as possible to reduce the bias of the value estimator and guarantee the validity of the resulting CI. Yet, the variance of the value estimator and the computation of our procedure increase with m. In the statistics literature, Lepski s method is a data-adaptive procedure for identifying optimal tuning parameter where cross-validation is difficult to implement, as in our setup (see e.g., Su et al., 2020). It can be naturally coupled with the proposed method for order selection, to balance the bias-variance trade-off. Practical version of Lepski s method was developed using bootstrap in Chernozhukov et al. (2014). This idea is worthwhile to explore and we leave it for future research. 6.2. Nonasymptotic Confidence Bound Non-asymptotic confidence bound is typically obtained by applying concentration inequalities (e.g., Hoeffding s in- Deeply-Debiased Off-Policy Interval Estimation Figure 3. Results for Cartpole. We fix n = 20 and vary τ in the upper subplots, and fix τ = 0.3 and vary n in the lower subplots. The subplots from left to right are about the coverage frequency with α = 0.9, the coverage frequency with α = 0.95, the mean log width of CIs with α = 0.95, the RMSE of value estimates, and the bias of value estimates, respectively. The yellow line (TR, m = 2) and green line (TR, m = 3) are largely overlapped. Figure 4. Results for Diabetes. We fix n = 20 and vary τ in the upper subplots, and fix τ = 1.0 and vary n in the lower subplots. Same legend as Figure 3. The yellow line (TR, m = 2) and green line (TR, m = 3) are largely overlapped. equality or Bernstein inequality Van Der Vaart & Wellner, 1996) to a sum of uncorrelated variables. In our setup, the proposed estimator is a U-statistic. We could apply concentration inequalities to U-statistics (see e.g., Feng et al., 2020) to derive the confidence bound. Alternatively, we may apply self-normalized moderate deviation inequalities (Peña et al., 2008) to derive the non-asymptotic bound. The resulting confidence bound will be wider than the proposed CI. However, it is valid even with small sample size. 6.3. Hardness of Learning of τ π Learning τ π could be much challenging than ωπ. In our current numerical experiments, all the state variables are continuous and it is challenging to obtain the ground truth of the conditional density ratio which involves estimation of a high-dimensional conditional density. As such, we did not investigate the goodness-of-fit of the proposed estimator for τ π. It would be practically interesting to explore the optimal neural network structure to approximate τ π and investigate the finite-sample rate of convergence of our estimator. However, this is beyond the scope of the current paper. We leave it for future research. 6.4. Extension to Exploration Finally, we remark that based on the proposed debiased Q-estimator, a two-sided CI can be similarly to quantify its uncertainty. It allows us to follow the optimism in the face of uncertainty" principle for online exploration. This is another topic that warrants future investigation. Acknowledgements Shi s research was partially supported by the LSE New Research Support Fund. The authors thank the meta reviewer, and the reviewers for their constructive comments, which have led to a significant improvement of the earlier version of this article. Deeply-Debiased Off-Policy Interval Estimation Andrychowicz, O. M., Baker, B., Chociej, M., Jozefowicz, R., Mc Grew, B., Pachocki, J., Petron, A., Plappert, M., Powell, G., Ray, A., et al. Learning dexterous in-hand manipulation. The International Journal of Robotics Research, 39(1):3 20, 2020. Bhandari, J., Russo, D., and Singal, R. A finite time analysis of temporal difference learning with linear function approximation. ar Xiv preprint ar Xiv:1806.02450, 2018. Bickel, P. J., Klaassen, C. A., Bickel, P. J., Ritov, Y., Klaassen, J., Wellner, J. A., and Ritov, Y. Efficient and adaptive estimation for semiparametric models, volume 4. Johns Hopkins University Press Baltimore, 1993. Bradley, R. C. Basic properties of strong mixing conditions. a survey and some open questions. Probability Surveys, 2:107 144, 2005. Breiman, L. Random forests. Machine learning, 45(1): 5 32, 2001. Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., and Zaremba, W. Openai gym. ar Xiv preprint ar Xiv:1606.01540, 2016. Chen, X., Kato, K., et al. Randomized incomplete ustatistics in high dimensions. Annals of Statistics, 47 (6):3127 3156, 2019. Chernozhukov, V., Chetverikov, D., Kato, K., et al. Anticoncentration and honest, adaptive confidence bands. Annals of Statistics, 42(5):1787 1818, 2014. Chernozhukov, V., Chetverikov, D., Demirer, M., Duflo, E., Hansen, C., and Newey, W. Double/debiased/neyman machine learning of treatment effects. American Economic Review, 107(5):261 65, 2017. Dai, B., Nachum, O., Chow, Y., Li, L., Szepesvari, C., and Schuurmans, D. Coindice: Off-policy confidence interval estimation. Advances in neural information processing systems, 33, 2020. Deshpande, Y., Mackey, L., Syrgkanis, V., and Taddy, M. Accurate inference for adaptive linear models. In International Conference on Machine Learning, pp. 1194 1203. PMLR, 2018. Duchi, J., Glynn, P., and Namkoong, H. Statistics of robust optimization: A generalized empirical likelihood approach. ar Xiv preprint ar Xiv:1610.03425, 2016. Fan, J., Wang, Z., Xie, Y., and Yang, Z. A theoretical analysis of deep q-learning. In Learning for Dynamics and Control, pp. 486 489. PMLR, 2020. Farajtabar, M., Chow, Y., and Ghavamzadeh, M. More robust doubly robust off-policy evaluation. ar Xiv preprint ar Xiv:1802.03493, 2018. Feng, Y., Ren, T., Tang, Z., and Liu, Q. Accountable offpolicy evaluation with kernel bellman statistics. ar Xiv preprint ar Xiv:2008.06668, 2020. Hadad, V., Hirshberg, D. A., Zhan, R., Wager, S., and Athey, S. Confidence intervals for policy evaluation in adaptive experiments. Proceedings of the National Academy of Sciences, 118(15), 2021. Hanna, J. P., Stone, P., and Niekum, S. Bootstrapping with models: Confidence intervals for off-policy evaluation. ar Xiv preprint ar Xiv:1606.06126, 2016. Hoeffding, W. A class of statistics with asymptotically normal distribution. The Annals of Mathematical Statistics, pp. 293 325, 1948. Jiang, N. and Huang, J. Minimax value interval for offpolicy evaluation and policy optimization. Advances in Neural Information Processing Systems, 33, 2020. Jiang, N. and Li, L. Doubly robust off-policy value evaluation for reinforcement learning. In International Conference on Machine Learning, pp. 652 661. PMLR, 2016. Kallus, N. and Uehara, M. Efficiently breaking the curse of horizon in off-policy evaluation with double reinforcement learning. ar Xiv preprint ar Xiv:1909.05850, 2019. Kallus, N. and Uehara, M. Double reinforcement learning for efficient off-policy evaluation in markov decision processes. Journal of Machine Learning Research, 21(167): 1 63, 2020. Kitamura, Y. et al. Empirical likelihood methods with weakly dependent processes. The Annals of Statistics, 25(5):2084 2102, 1997. Le, H. M., Voloshin, C., and Yue, Y. Batch policy learning under constraints. ar Xiv preprint ar Xiv:1903.08738, 2019. Lee, A. J. U-statistics: Theory and Practice. Routledge, 2019. Liao, P., Qi, Z., and Murphy, S. Batch policy learning in average reward markov decision processes. ar Xiv preprint ar Xiv:2007.11771, 2020. Liu, Q., Li, L., Tang, Z., and Zhou, D. Breaking the curse of horizon: Infinite-horizon off-policy estimation. In Advances in Neural Information Processing Systems, pp. 5356 5366, 2018. Deeply-Debiased Off-Policy Interval Estimation Luedtke, A. R. and van der Laan, M. J. Evaluating the impact of treating the optimal subgroup. Statistical methods in medical research, 26(4):1630 1640, 2017. Mackey, L., Syrgkanis, V., and Zadik, I. Orthogonal machine learning: Power and limitations. In International Conference on Machine Learning, pp. 3375 3383. PMLR, 2018. Marling, C. and Bunescu, R. C. The ohiot1dm dataset for blood glucose level prediction. In KHD@ IJCAI, pp. 60 63, 2018. Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidjeland, A. K., Ostrovski, G., et al. Human-level control through deep reinforcement learning. nature, 518(7540): 529 533, 2015. Mukherjee, R., Newey, W. K., and Robins, J. M. Semiparametric efficient empirical higher order influence function estimators. ar Xiv preprint ar Xiv:1705.07577, 2017. Murphy, S. A., van der Laan, M. J., Robins, J. M., and Group, C. P. P. R. Marginal mean models for dynamic regimes. Journal of the American Statistical Association, 96(456):1410 1423, 2001. Nachum, O., Chow, Y., Dai, B., and Li, L. Dualdice: Behavior-agnostic estimation of discounted stationary distribution corrections. In Advances in Neural Information Processing Systems, pp. 2318 2328, 2019. Owen, A. B. Empirical likelihood. CRC press, 2001. Peña, V. H., Lai, T. L., and Shao, Q.-M. Self-normalized processes: Limit theory and Statistical Applications. Springer Science & Business Media, 2008. Precup, D. Eligibility traces for off-policy policy evaluation. Computer Science Department Faculty Publication Series, pp. 80, 2000. Puterman, M. L. Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons, 2014. Robins, J., Li, L., Tchetgen, E., van der Vaart, A., et al. Higher order influence functions and minimax estimation of nonlinear functionals. In Probability and statistics: essays in honor of David A. Freedman, pp. 335 421. Institute of Mathematical Statistics, 2008. Robins, J. M., Li, L., Mukherjee, R., Tchetgen, E. T., van der Vaart, A., et al. Minimax estimation of a functional on a structured high-dimensional model. The Annals of Statistics, 45(5):1951 1987, 2017. Sallab, A. E., Abdou, M., Perot, E., and Yogamani, S. Deep reinforcement learning framework for autonomous driving. Electronic Imaging, 2017(19):70 76, 2017. Shi, C. and Li, L. Testing mediation effects using logic of boolean matrices. Journal of the American Statistical Association, pp. accepted, 2021. Shi, C., Lu, W., and Song, R. Breaking the curse of nonregularity with subagging inference of the mean outcome under optimal treatment regimes. Journal of Machine Learning Research, 21(176):1 67, 2020a. Shi, C., Wan, R., Song, R., Lu, W., and Leng, L. Does the markov decision process fit the data: testing for the markov property in sequential decision making. In International Conference on Machine Learning, pp. 8807 8817. PMLR, 2020b. Shi, C., Zhang, S., Lu, W., and Song, R. Statistical inference of the value function for reinforcement learning in infinite horizon settings. ar Xiv preprint ar Xiv:2001.04515, 2020c. Su, Y., Srinath, P., and Krishnamurthy, A. Adaptive estimator selection for off-policy evaluation. In International Conference on Machine Learning, pp. 9196 9205. PMLR, 2020. Sutton, R. S. and Barto, A. G. Reinforcement learning: An introduction. MIT press, 2018. Tang, Z., Feng, Y., Li, L., Zhou, D., and Liu, Q. Doubly robust bias reduction in infinite horizon off-policy estimation. In International Conference on Learning Representations, 2019. Thomas, P. and Brunskill, E. Data-efficient off-policy policy evaluation for reinforcement learning. In International Conference on Machine Learning, pp. 2139 2148, 2016. Thomas, P., Theocharous, G., and Ghavamzadeh, M. High confidence policy improvement. In International Conference on Machine Learning, pp. 2380 2388, 2015a. Thomas, P. S., Theocharous, G., and Ghavamzadeh, M. High-confidence off-policy evaluation. In Twenty-Ninth AAAI Conference on Artificial Intelligence, 2015b. Tsiatis, A. Semiparametric theory and missing data. Springer Science & Business Media, 2007. Uehara, M., Huang, J., and Jiang, N. Minimax weight and qfunction learning for off-policy evaluation. ar Xiv preprint ar Xiv:1910.12809, 2019. Van Der Laan, M. J. and Lendle, S. D. Online targeted learning. 2014. Deeply-Debiased Off-Policy Interval Estimation Van der Vaart, A. W. Asymptotic statistics, volume 3. Cambridge university press, 2000. Van Der Vaart, A. W. and Wellner, J. A. Weak convergence. In Weak convergence and empirical processes, pp. 16 28. Springer, 1996. Zhang, K. W., Janson, L., and Murphy, S. A. Inference for batched bandits. ar Xiv preprint ar Xiv:2002.03217, 2020. Zou, S., Xu, T., and Liang, Y. Finite-sample analysis for sarsa with linear function approximation. In Advances in Neural Information Processing Systems, pp. 8665 8675, 2019.