# bootstrapping_fitted_qevaluation_for_offpolicy_inference__1badbb0c.pdf Bootstrapping Fitted Q-Evaluation for Off-Policy Inference Botao Hao 1 Xiang Ji 2 Yaqi Duan 2 Hao Lu 2 Csaba Szepesv ari 1 3 Mengdi Wang 1 2 Bootstrapping provides a flexible and effective approach for assessing the quality of batch reinforcement learning, yet its theoretical properties are poorly understood. In this paper, we study the use of bootstrapping in off-policy evaluation (OPE), and in particular, we focus on the fitted Q-evaluation (FQE) that is known to be minimaxoptimal in the tabular and linear-model cases. We propose a bootstrapping FQE method for inferring the distribution of the policy evaluation error and show that this method is asymptotically efficient and distributionally consistent for off-policy statistical inference. To overcome the computation limit of bootstrapping, we further adapt a subsampling procedure that improves the runtime by an order of magnitude. We numerically evaluate the bootrapping method in classical RL environments for confidence interval estimation, estimating the variance of off-policy evaluator, and estimating the correlation between multiple off-policy evaluators. 1. Introduction Off-policy evaluation (OPE) often serves as the starting point of batch reinforcement learning (RL). The objective of OPE is to estimate the value of a target policy based on batch episodes of state-transition trajectories that were generated using a different and possibly unknown behavior policy. In this paper, we investigate statistical inference for OPE. In particular, we analyze the popular fitted Q-evaluation (FQE) method, which is a basic model-free approach that fits unknown value function from data using function approximation and backward dynamic programming (Fonteneau 1Deepmind 2Princeton University 3University of Alberta. Correspondence to: Botao Hao , Mengdi Wang . Proceedings of the 37 th International Conference on Machine Learning, PMLR 139, 2021. Copyright 2021 by the author(s). et al., 2013; Munos & Szepesv ari, 2008; Le et al., 2019). In practice, FQE has demonstrated robust and satisfying performances on many classical RL tasks under different metrics (Voloshin et al., 2019). A more recent study by Paine et al. (2020) demonstrated surprising scalability and effectiveness of FQE with deep neural nets in a range of complex continuous-state RL tasks. On the theoretical side, FQE was proved to be a minimax-optimal policy evaluator in the tabular and linear-model cases (Yin & Wang, 2020; Duan & Wang, 2020). The aforementioned research mostly focuses on point estimation for OPE. In practical batch RL applications, a point estimate is far from enough. Statistical inference for OPE is of great interests. For instance, one often hopes to construct tight confidence interval around policy value, estimate the variance of off-policy evaluator, or evaluate multiple policies using the same data and estimate their correlations. Bootstrapping (Efron, 1982), is a conceptually simple and generalizable approach to infer the error distribution based on batch data. Therefore, in this work, we study the use of bootstrapping for off-policy inference. We will provide theoretical justifications as well as numerical experiments. Our main results are summarized below: First we analyze the asymptotic distribution of FQE with linear function approximation and show that the policy evaluation error asymptotically follows a normal distribution (Theorem 4.2). The asymptotic variance matches the Cram er Rao lower bound for OPE (Theorem 4.5) and implies that this estimator is asymptotically efficient. We propose a bootstrapping FQE method for estimating the distribution of off-policy evaluation error. We prove that bootstrapping FQE is asymptotically consistent in estimating the distribution of the original FQE (Theorem 5.1) and establish the consistency of bootstrap confidence interval as well as bootstrap variance estimation. Further, we propose a subsampled bootstrap procedure to improve the computational efficiency of bootstrapping FQE. Bootstrapping Fitted Q-Evaluation for Off-Policy Inference We highlight the necessity of bootstrapping by episodes, rather than by individual sample transition as considered in previous works; see Kostrikov & Nachum (2020). The reason is that bootstrapping dependent data in general fails to characterize the right error distribution (Remark 2.1 in Singh (1981)). We illustrate this phenomenon via experiments (see Figure 1). All our theoretical analysis applies to episodic dependent data, and we do not require the i.i.d. sample transition assumption commonly made in OPE literatures (Jiang & Huang, 2020; Kostrikov & Nachum, 2020; Dai et al., 2020). Finally, we evaluate subsampled bootstrapping FQE in a range of classical RL tasks, including a discrete tabular domain, a continuous control domain and a simulated healthcare example. We test variants of bootstrapping FQE with tabular representation, linear function approximation, and neural networks. We carefully examine the effectiveness and tightness of bootstrap confidence intervals, as well as the accuracy of bootstrapping for estimating the variance and correlation for OPE. Related Work. Point estimation of OPE receives considerable attentions in recent years. Popular approaches include direct methods (Lagoudakis & Parr, 2003; Ernst et al., 2005; Munos & Szepesv ari, 2008; Le et al., 2019), double-robust / importance sampling (Precup et al., 2000; Jiang & Li, 2016; Thomas & Brunskill, 2016), marginalized importance sampling (Hallak & Mannor, 2017; Liu et al., 2018; Xie et al., 2019; Nachum et al., 2019; Uehara & Jiang, 2019; Zhang et al., 2020a;b). On the theoretical side, Uehara & Jiang (2019); Yin & Wang (2020) established asymptotic optimality and efficiency for OPE in the tabular setting and Kallus & Uehara (2020) provided a complete study of semiparametric efficiency in a more general setting. Duan & Wang (2020); Hao et al. (2020b) showed that FQE with linear/sparse lienar function approximation is minimax optimal and Wang et al. (2020) studied the fundamental hardness of OPE with linear function approximation. Confidence interval estimation of OPE is also important in many high-stake applications. Thomas et al. (2015) proposed a high-confidence OPE based on importance sampling and empirical Bernstein inequality. Kuzborskij et al. (2020) proposed a tighter confidence interval for contextual bandits based on empirical Efron-Stein inequality. However, importance sampling suffers from the curse of horizon (Liu et al., 2018) and concentration-based confidence intervals are typically overly-conservative since they only exploit tail information (Hao et al., 2020a). Another line of recent works formulated the estimation of confidence intervals into an optimization problem (Feng et al., 2020; 2021; Dai et al., 2020). These works are specific to confidence interval construction for OPE, and they do not provide distributional consistency guarantee. Thus, they don t easily generalize to other statistical inference tasks. In statistics community, Liao et al. (2019) studied OPE in an infinite-horizon undiscounted MDP and derived the asymptotic distribution of empirical Bellman residual minimization estimator. Their asymptotic variance had a tabular representation and thus didn t show the effect of function approximation. Shi et al. (2020) considered asymptotic confidence interval for policy value but under different model assumption that assumes Q-function is smooth. Several existing work has investigated the use of bootstrapping in OPE. Thomas et al. (2015); Hanna et al. (2017) constructed confidence intervals by bootstrapping importance sampling estimator or learned models but didn t come with any consistency guarantee. The most related work is Kostrikov & Nachum (2020) that provided the first asymptotic consistency of bootstrap confidence interval for OPE. Our analysis improves their work in the following aspects. First, we study FQE with linear function approximation while Kostrikov & Nachum (2020) only considered the tabular case. Second, we provide distributional consistency of bootstrapping FQE which is stronger than the consistency of confidence interval in Kostrikov & Nachum (2020). 2. Preliminary Consider an episodic Markov decision process (MDP) that is defined by a tuple M = (S, A, P, r, H). Here, S is the state space, A is the action space, P(s |s, a) is the probability of reaching state s when taking action a in state s, r : S A [0, 1] is the reward function, and H is the length of horizon. A policy π : S P(A) maps states to a distribution over actions. The state-action value function (Q-function) is defined as, for h = 1, . . . , H, Qπ h(s, a) = Eπ " H X h =h r(sh , ah ) sh = s, ah = a where ah π( | sh ), sh +1 P( | sh , ah ) and Eπ denotes expectation over the sample path generated under policy π. The Q-function satisfies the Bellman equation for policy π: Qπ h 1(s, a) = r(s, a) + E h V π h (s ) s, a i , Bootstrapping Fitted Q-Evaluation for Off-Policy Inference where s P( |s, a) and V π h : S R is the value function defined as V π h (s) = R a Qπ h(s, a)π(a|s)da. Let [n] = {1, . . . , n}. For a positive semidefinite matrix X, we denote λmin(X) as the minimum eigenvalue of X. Denote Id Rd d as a diagonal matrix with 1 as all the diagonal entry and 0 anywhere else. Off-policy evaluation. Suppose that the batch data D = {D1, . . . , DK} consists of K independent episodes collected using an unknown behavior policy π. Each episode, denoted as Dk = {(sk h, ak h, rk h)}h [H], is a trajectory of H state-transition tuples. It is easy to generalize our analysis to multiple unknown behavior policies since our algorithms do not require the knowledge of the behavior policy. Let N = KH be the total number of sample transitions; and we sometimes write D = {(sn, an, rn)}n [N] for simplicity. The goal of OPE is to estimate the expected cumulative return (i.e., value) of a target policy π from a a fixed initial distribution ξ1, based on the dataset D. The value is defined as vπ = Eπ " H X h=1 r(sh, ah) Fitted Q-evaluation. Fitted Q-evaluation (FQE) is an instance of the fitted Q-iteration method, dated back to Fonteneau et al. (2013); Le et al. (2019). Let F be a given function class, for examples a linear function class or a neural network class. Set b Qπ H+1 = 0. For h = H, . . . , 1, we recursively estimate Qπ h by regression and function approximation: b Qπ h = argmin f F f(sn, an) yn 2 + λρ(f) o , where yn = rn + R a b Qπ h+1(sn+1, a)π(a|sn+1)da and ρ(f) is a proper regularizer. The value estimate is bvπ = Es ξ1,a π( |s) h b Qπ 1(x, a) i , (2.1) which can be directly computed based on b Qπ 1. See the full description of FQE in Appendix A.1. Off-policy inference. Let bvπ be an off-policy estimator of the target policy value vπ. In addition to the point estimator, we are primarily interested in the distribution of the off-policy evaluation error bvπ vπ. We aim to infer the error distribution of bvπ vπ in order to conduct statistical inference. Suppose F is an estimated distribution of bvπ vπ. Then we can use F for a range of downstream off-policy inference tasks, for examples: Moment estimation. With F, we can estimate the p-th moment of bvπ vπ by R xpd F(x). Two important examples are bias estimation and variance estimation. Confidence interval construction. Define the quantile function of F as G(p) = inf{x R, p F(x)}. Specify a confidence level 0 < δ 1. With F, we can construct the 1 δ confidence interval as [bvπ G(1 δ/2), bvπ G(δ/2)]. If F is close to the true distribution of bvπ vπ, the above one would be the nearly tightest confidence interval for vπ based on bvπ. Evaluating multiple policies and estimating their correlation. Suppose there are two target policies π1, π2 to evaluate and the corresponding off-policy estimators are bvπ1, bvπ2. Let F12 be the estimated joint distribution of bvπ1 vπ1 and bvπ2 vπ2. The Pearson correlation coefficient between the two estimators is ρ(bvπ1, bvπ2) = Cov(bvπ1, bvπ2) p Var(bvπ1)Var(bvπ2) . Both the covariance and variance can be estimated from F12, so we can further estimate the correlation between off-policy evaluators. Remark 2.1 (Practical scenarios of estimating correlations). Correlation is a basic statistical metric for comparing two estimators, and we used it as an example to illustrate that bootstrapping can be used for estimating a variety of statistics not limited to confidence intervals. In medical applications, we may have multiple target treatment policies to compare against, where a correlation estimate together with confidence intervals would make physicians better informed to make a fairer comparison. 3. Bootstrapping Fitted Q-Evaluation (FQE) As shown in Le et al. (2019); Voloshin et al. (2019); Duan & Wang (2020); Paine et al. (2020), FQE not only demonstrates strong empirical performances, but also enjoys provably optimal theoretical guarantees. Thus it is natural to conduct bootstrapping on top of FQE for off-policy inference. Recall the original dataset D consists of K episodes. We propose to bootstrap FQE by episodes: Draw sample episodes D 1, . . . , D K independently with replacement from D. This is the standard Efron s nonparametric bootstrap (Efron, 1982). Then we run FQE on the new bootstrapped set D = {D 1, . . . , D K} as in Eq. (2.1) and let the output bv π as the bootstrapping FQE estimator. By repeating the above Bootstrapping Fitted Q-Evaluation for Off-Policy Inference 1.0 0.5 0.0 0.5 1.0 0 True Distribution 1.0 0.5 0.0 0.5 1.0 0 Bootstrap by Episode 1.0 0.5 0.0 0.5 1.0 0 Bootstrap by Transition Figure 1. Bootstrap by episodes vs. by sample transitions. The first panel is the true FQE error distribution by Monte Carlo approximation. The second panel is the bootstrap distribution by episode while the third one is by sample transitions. Both behavior and target policies are the optimal policy. The number of Monte Carlo and bootstrap samples is 10000. process, we may obtain multiple samples of bv π, and may use these samples to further conduct off-policy inference (see Section 6.2 for details). 3.1. Bootstrap by episodes vs. boostrap by sample transitions Practitioners may wonder what is the right way to bootstrap a data set. This question is quite well understood in supervised learning when the data points are independent and identically distributed; there the best way to bootstrap is to resample data points directly. However, in episodic RL, although episodes may be generated independently from one another, sample transitions (sn, an, rn) in the same episode are highly dependent. Therefore, we choose to bootstrap the batch dataset by episodes, rather than by sample transitions which was commonly done according to previous literatures (Kostrikov & Nachum, 2020). We argue that bootstrapping by sample transitions may fail to correctly characterize the target error distribution of OPE. This is due to the in-episode dependence. To illustrate this phenomenon, we conduct numerical experiments using a toy Cliff Walking environment. We compare the true distribution of FQE error obtained by Monte Carlo sampling with error distributions obtained using bootstrapping FQE. Figure 1 clearly shows that the bootstrap distribution of bv π bvπ (by episodes) closely approximates the true error distribution of bvπ vπ, while the bootstrap distribution by sample transition is highly irregular and incorrect. This validates our belief that it is necessary to bootstrap by episodes and handle dependent data carefully for OPE. 4. Asymptotic Distribution and Optimality of FQE Before analyzing the use of bootstrap, we first study the asymptotic properties of FQE estimators. For the sake of theoretical abstraction, we focus our analysis on the FQE with linear function approximation, because it is the most basic and universal function approximation. We will show that the FQE error is asympotically normal and its asymptotic variance exactly matches the Cram er Rao lower bound. All the proofs are deferred to Appendix A.3 and A.4. Notations. Given a feature map φ : RS A Rd, we let F be a linear function class spanned by φ. Without loss of generality, we assume φ(s, a) 1 for any (s, a) S A. Define the Bellman operator for policy π as Pπ : RS A RS A such that for any f : S A R, Pπf(s, a) = Es P ( |s,a),a π( |s )[f(s , a )]. Denote the expected covariance matrix induced by the feature φ as Σ = E[ 1 H PH h=1 φ(s1 h, a1 h)φ(s1 h, a1 h) ], where E is the expectation over population distribution generated by the behavior policy. 4.1. Asymptotic normality We need a representation condition about the function class F, which will ensure sample-efficient policy evaluation via FQE. Condition 4.1 (Policy completeness). For any f F, we assume Pπf F, and r F. Policy completeness requires the function class F can well capture the Bellman operator. It is crucial for the estimation consistency of FQE (Le et al., 2019; Duan & Wang, 2020) and implies the realizability condition Qπ h F for h [H]. Recently, Wang et al. (2020) established a lower bound showing that the condition Qπ h F alone is not enough for sample-efficient OPE. Thus we need the policy completeness condition in order to leverage the generalizability of linear function class. Next we present our first main result. The theorem presents the asymptotic normality of FQE with linear function approximation. For any h1 [H], h2 [H], define the crosstime-covariance matrix as h =1 φ(s1 h , a1 h )φ(s1 h , a1 h ) ε1 h1,h ε1 h2,h Bootstrapping Fitted Q-Evaluation for Off-Policy Inference where ε1 h1,h = Qπ h1(s1 h , a1 h ) (r1 h + V π h1+1(s1 h +1)). Theorem 4.2 (Asymptotic normality of FQE). Suppose λmin(Σ) > 0 and Condition 4.1 holds. The FQE with linear function approximation is N-consistent and asymptotically normal: N (bvπ vπ) d N(0, σ2), as N , (4.1) where d denotes converging in distribution. The asymptotic variance σ2 is given by h=1 (νπ h) Σ 1Ωh,hΣ 1νπ h h1

2. Then we have for any 1 r < q, N(bv π bvπ) ri Z trdµ(t), where µ( ) is the distribution of N(0, σ2). The consistency of bootstrap variance estimate is immediately implied by setting r = 2. 6. Subsampled Bootstrapping FQE Computing bootstrap-based quantities can be prohibitively demanding as the data size grows. Inspired by recent developments from statistics community (Kleiner et al., 2014; Sengupta et al., 2016), we adapt a simple subsampled bootstrap procedure for FQE to accelerate the computation. 6.1. Subsampled bootstrap Let the original dataset be D = {D1, . . . , DK}. For any dataset e D, we denote by bvπ( e D) the FQE estimator based on dataset e D and B as the number of bootstrap samples. The subsampled bootstrap includes the following three steps. For each b [B], we first construct a random subset D(b) K,s of s episodes where each sample episode is drawn independently without replacement from dataset D. Typically s = Kγ for some 0 < γ 1.Then we generate a resample set D(b) K,s of K episodes where each sample episode is drawn independently with replacement from D(b) K,s. Note that when s = K, D(b) K,s is always equal to D such that the subsampled bootstrap reduces to vanilla bootstrap. In the end, we compute ε(b) = bvπ(D(b) K,s ) bvπ(D(b) K,s). Algorithm 1 gives the full description. Remark 6.1 (Computational benefit). In Algorithm 1, although each run of FQE is still over a dataset of K episodes, only s of them are distinct. As a result, the runtime of running FQE on a bootstrapped set can be substantially reduced. With linear function approximation, one run of FQE requires solving H least square problems. Thus the total runtime complexity of the subsampled bootstrapping FQE is O(B(K2γH3d + Hd3)), where 0 < γ < 1 controls the subsample size. When γ is small, we achieve significant speedup by an order of magnitude improvements. Bootstrapping Fitted Q-Evaluation for Off-Policy Inference Algorithm 1 Subsampled Bootstrapping FQE input Dataset D = {D1, . . . , DK}, target policy π, subset size s, number of bootstrap samples B. 1: Compute FQE estimator bvπ(D) (Algorithm 2). 2: for b = 1, . . . , B do 3: Build a random subset D(b) K,s. 4: Feed D(b) K,s to FQE and compute bvπ(D(b) K,s). 5: Generate a resample set D(b) K,s . 6: Compute bvπ(D(b) K,s ). 7: Compute ε(b) = bvπ(D(b) K,s ) bvπ(D(b) K,s). 8: end for output {ε(1), . . . , ε(B)}. 6.2. Off-policy inference via bootstrapping FQE We describe how to conduct off-policy inference based on the output of Algorithm 1. Bootstrap variance estimation. To estimate the variance of FQE estimators, we calculate the bootstrap sample variance as c Var(bvπ(D)) = 1 B 1 b=1 (ε(b) ε)2, where ε = 1 B PB b=1 ε(b). Bootstrap confidence interval. Compute the δ/2 and 1 δ/2 quantile of the empirical distribution {ε(1), . . . , ε(B)}, denoted as bqπ δ/2, bqπ 1 δ/2 respectively. The percentile bootstrap confidence interval is [bvπ(D) bqπ 1 δ/2, bvπ(D) bqπ δ/2]. Bootstrap correlation estimation. For any of two target policies π1 and π2, we want to estimate the Pearson correlation coefficient between their FQE estimators. The bootstrap sample correlation can be computed as bρ(bvπ1(D), bvπ2(D)) = PB b=1(ε(b) 1 ε1)(ε(b) 2 ε2) q PB b=1(ε(b) 1 ε1)2 q PB b=1(ε(b) 2 ε2)2 . 7. Experiments In this section, we numerically evaluate the proposed bootstrapping FQE method in several RL environments. For constructing confidence intervals, we fix the confidence level at δ = 0.1. For estimating variance and correlations, we average the results over 200 trials. More details about the experiment are given in Appendix C. 7.1. Experiment with tabular discrete environment We first consider the Cliff Walking environment (Sutton & Barto, 2018), with artificially added randomness to create stochastic transitions (see Appendix C for details). The target policy is chosen to be a near-optimal policy, trained using Q-learning. Consider three choices of the behavior policy: the same as the target policy (on-policy), 0.1 ϵgreedy policy and soft-max policy with temperature 1.0 based on the learned optimal Q-function. The results for soft-max policy and correlation estimation are deferred to Appendix C. We test three different methods. The first two methods are subsampled bootstraping FQE with subsample sizes s = K (the vanilla bootstrap) and s = K0.5 (the computationalefficient version), where B = 100. The third method is the high-confidence off-policy evaluation (HCOPE) (Thomas et al., 2015), which we use as a baseline for comparison. HCOPE is a method for constructing off-policy confident interval for tabular MDP, and it is based concentration inequalities and has provable coverage guarantee. We also compare these methods with the oracle confidence interval (which is the true distribution s quantile obtained by Monte Carlo simulation). Coverage and tightness of off-policy confidence interval (CI). We study the empirical coverage probability and interval width with different number of episodes. Figure 2 shows the result under different behavior policies. In the left panel of Figure 3, we report the effect of the number of bootstrap samples on empirical coverage probability (ϵ-greedy behavior policy, K = 100). It is clear that the empirical coverage of our confidence interval based on bootstrapping FQE becomes increasingly close to the expected coverage (= 1 δ) as the number of episodes increases. The width of bootstrapping-FQE confidence interval is significantly tighter than that of the HCOPE and very close to the oracle one. It is worth noting that, even in the on-policy case, our bootstrap-based confidence interval still has a clear advantage over the concentration-based confidence interval. The advantage of our method comes from that it fully exploits the distribution information. However, bootstrap confidence interval tends to be under-estimate when the number of episodes is extremely small (K = 10). Thus we suggest the practitioner to use bootstrap methods when the sample size is moderately large (K > 50). Further, the subsampled bootstrapping FQE demonstrates a competitive performance as well as significantly reduced computation time. The saving in computation time becomes Bootstrapping Fitted Q-Evaluation for Off-Policy Inference 10 50 100 200 500 # episodes in dataset empirical coverage Vanilla bootstrap Subsampled bootstrap Expected coverage 10 50 100 200 500 # episodes in dataset interval width Vanilla bootstrap Subsampled bootstrap HCOPE Oracle CI 10 50 100 200 500 # episodes in dataset empirical coverage Epsilon-Greedy Behavior Policy Vanilla bootstrap Subsampled bootstrap Expected coverage 10 50 100 200 500 # episodes in dataset interval width Epsilon-Greedy Behavior Policy Vanilla bootstrap Subsampled bootstrap HCOPE Oracle CI Figure 2. Off-policy CI for Cliff Walking. Left: Empirical coverage probability of CI; Right: CI width under different behavior policies. Boostrapping-FQE confidence interval method demonstrates better and tighter coverage of the groundtruth. It closely resembles the oracle confidence interval which comes from the true error distribution. 10 50 75 100 250 # bootstrap samples empirical coverage Vanilla bootstrap Expected coverage 10 50 100 200 500 # episodes in dataset Vanilla bootstrap Subsampled bootstrap Figure 3. Sample and time efficiency of bootstrapping FQE. Left: Empirical coverage of bootstrapping-FQE CI, as #bootstrap samples increases. Right: Runtime of bootstrapping FQE, as datasize increases (with subsample size s = K0.5). 10 50 100 200 500 # episodes in dataset estimation error On-policy Epsilon-greedy policy Softmax policy 10 50 100 200 400 # episodes in dataset policy value Oracle CI Duan & Wang (2020) Subsampled bootstrap Figure 4. Bootstrapping for variance estimation and with function approximation Left: Error of variance estimates for tabular case, as data size increases. Right: Confidence interval constructed using bootstrapping FQE with linear function approximation. increasingly substantial as the data gets big; see the right panel of Figure 3. Bootstrapping FQE for variance estimation. We study the performance of variance estimation using subsampled bootstrapping FQE under three different behavior policies. We vary the number of episodes and the true Var(bvπ(D)) is computed through Monte Carlo method. We report the estimation error of c Var(bvπ(D)) Var(bvπ(D)) across 200 trials in the left panel of Figure 4. 7.2. Experiment with Mountain Car using linear function approximation Next we test the methods on the classical Mountain Car environment (Moore, 1990) with linear function approximation. We artificially added a Gaussian random force to the car s dynamics to create stochastic transitions. For the linear function approximation, we choose 400 radial basis functions (RBF) as the feature map. The target policy is chosen as the optimal policy trained by Q-learning; and the behavior policy is chosen to be the 0.1 ϵ-greedy policy based on the learned optimal Q-function. For comparison, we compute an empirical Bernsteininequality-based confidence interval (Duan & Wang, 2020), which to our best knowledge is the only provable CI based on FQE with function approximation (see Appendix C for its detailed form). We also compute the oracle CI using Monte Carlo simulation. Figure 4 right give all the results. According to the results, our method demonstrates good coverage of the groundtruth and is much tighter than the concentration-based CI, even both of them use linear function approximation. 7.3. Experiment with septic management using neural nets for function approximation Lastly, we consider a real-world healthcare problem for treating sepsis in the intensive care unit (ICU). We use the septic management simulator by Oberst & Sontag (2019) for our study. It simulates a patient s vital signs, e.g. the heart rate, blood pressure, oxygen concentration, and glucose levels, with three treatment actions (antibiotics, vasopressors, and mechanical ventilation) to choosen from at each time step. The reward is +1 when a patient is discharged and 1 if the patient reaches a life critical state. We apply the bootstrapping FQE using neural network function approximator with three fully connected layers, where the first layer uses 256 units and a Relu activation function, the second layer uses 32 units and a Selu activation function, and the last layer uses Softsign. The network takes as input the state-action pair (a 11-dim vector) and outputs a Q-value estimate. Let the behavior policy be the 0.15 ϵ-greedy policy. We evaluate two policies based on the same set of data. This is very common in healthcare problem since we may have multiples treatments by the doctor. One target policy is fixed Bootstrapping Fitted Q-Evaluation for Off-Policy Inference 0.1 0.3 0.5 0.7 0.9 epsilon correlation 0.1 0.3 0.5 0.7 0.9 epsilon correlation Figure 5. Bootstrapping FQE with neural nets for estimating the correlation between two FQE estimators. The left panel is using 300 episodes, while the right panel is using 500 episodes. 0.5 0.0 0.5 1.0 optimal policy 0.05 epsilon greedy policy (1.8, 2.0] 0.5 0.5 0.0 0.5 1.0 optimal policy 0.05 epsilon greedy policy 0.5 0.0 0.5 1.0 optimal policy 0.05 epsilon greedy policy Figure 6. Estimated confidence region for evaluating two policies using bootstrapping FQE with neural networks. Two target policies are optimal policy and 0.15 ϵ-greedy policy. Red point are true values of those two target policies. From left to right, the sample sizes are K = 100, 300, 500. Here we use the geom density() function to illustrate the density of 2-D bootstrap estimators. to be the optimal policy while we vary the other one with different ϵ-greedy noise. We expect the correlation decreases as the difference between two target policies increases. Figure 5 is well aligned with our expectation. In Figure 6, we plot the confidence region of two target policies obtained by bootstrapping FQE using neural networks. According to Figures 5 and 6, the bootstrapping FQE method can effectively construct confidence regions and correlation estimates, even when using neural networks for function approximation. These results suggest that the proposed bootstrapping FQE method reliably achieves off-policy inference, with more general function approximators. 8. Conclusion This paper studies bootstrapping FQE for statistical offpolicy inference and establishes its asymptotic distributional consistency as a theoretical benchmark. Our experiments suggest that bootstrapping FQE is effective and efficient in a range of tasks, from tabular problems to continuous problems, with linear and neural network approximation. Acknowledgements Csaba Szepesv ari gratefully acknowledges funding from the Canada CIFAR AI Chairs Program, Amii and NSERC. Mengdi Wang gratefully acknowledges funding from the U.S. National Science Foundation (NSF) grant CMMI1653435, Air Force Office of Scientific Research (AFOSR) grant FA9550-19-1-020, and C3.ai DTI. Bickel, P. J. and Freedman, D. A. Some asymptotic theory for the bootstrap. The annals of statistics, pp. 1196 1217, 1981. Dai, B., Nachum, O., Chow, Y., Li, L., Szepesv ari, C., and Schuurmans, D. Coindice: Off-policy confidence interval estimation. ar Xiv preprint ar Xiv:2010.11652, 2020. Duan, Y. and Wang, M. Minimax-optimal off-policy evaluation with linear function approximation. Internation Conference on Machine Learning, 2020. Eck, D. J. Bootstrapping for multivariate linear regression models. Statistics & Probability Letters, 134:141 149, 2018. Efron, B. The jackknife, the bootstrap and other resampling plans. SIAM, 1982. Ernst, D., Geurts, P., and Wehenkel, L. Tree-based batch mode reinforcement learning. Journal of Machine Learning Research, 6(Apr):503 556, 2005. Feng, Y., Ren, T., Tang, Z., and Liu, Q. Accountable offpolicy evaluation with kernel bellman statistics. Proceedings of the International Conference on Machine Learning, 2020. Feng, Y., Tang, Z., Zhang, N., and Liu, Q. Non-asymptotic confidence intervals of off-policy evaluation: Primal and dual bounds. In International Conference on Learning Representations, 2021. URL https://openreview. net/forum?id=d Kg5D1Z1Lm. Fonteneau, R., Murphy, S. A., Wehenkel, L., and Ernst, D. Batch mode reinforcement learning based on the synthesis of artificial trajectories. Annals of operations research, 208(1):383 416, 2013. Freedman, D. A. et al. Bootstrapping regression models. The Annals of Statistics, 9(6):1218 1228, 1981. Bootstrapping Fitted Q-Evaluation for Off-Policy Inference Hallak, A. and Mannor, S. Consistent on-line off-policy evaluation. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 1372 1383. JMLR. org, 2017. Hanna, J. P., Stone, P., and Niekum, S. Bootstrapping with models: Confidence intervals for off-policy evaluation. In Thirty-First AAAI Conference on Artificial Intelligence, 2017. Hao, B., Abbasi-Yadkori, Y., Wen, Z., and Cheng, G. Bootstrapping upper confidence bound. Thirty-fourth Annual Conference on Neural Information Processing Systems, 2020a. Hao, B., Duan, Y., Lattimore, T., Szepesv ari, C., and Wang, M. Sparse feature selection makes batch reinforcement learning more sample efficient. ar Xiv preprint ar Xiv:2011.04019, 2020b. Jiang, N. and Huang, J. Minimax value interval for offpolicy evaluation and policy optimization. Advances in Neural Information Processing Systems, 33, 2020. Jiang, N. and Li, L. Doubly robust off-policy value evaluation for reinforcement learning. In International Conference on Machine Learning, pp. 652 661, 2016. Kallus, N. and Uehara, M. Double reinforcement learning for efficient off-policy evaluation in markov decision processes. Journal of Machine Learning Research, 21(167): 1 63, 2020. Kato, K. A note on moment convergence of bootstrap m-estimators. Statistics & Risk Modeling, 28(1):51 61, 2011. Kleiner, A., Talwalkar, A., Sarkar, P., and Jordan, M. I. A scalable bootstrap for massive data. Journal of the Royal Statistical Society: Series B: Statistical Methodology, pp. 795 816, 2014. Kostrikov, I. and Nachum, O. Statistical bootstrapping for uncertainty estimation in off-policy evaluation. ar Xiv preprint ar Xiv:2007.13609, 2020. Kuzborskij, I., Vernade, C., Gy orgy, A., and Szepesv ari, C. Confident off-policy evaluation and selection through self-normalized importance weighting. ar Xiv preprint ar Xiv:2006.10460, 2020. Lagoudakis, M. G. and Parr, R. Least-squares policy iteration. Journal of machine learning research, 4(Dec): 1107 1149, 2003. Le, H. M., Voloshin, C., and Yue, Y. Batch policy learning under constraints. Proceedings of Machine Learning Research, 97:3703 3712, 2019. Liao, P., Klasnja, P., and Murphy, S. Off-policy estimation of long-term average outcomes with applications to mobile health. ar Xiv preprint ar Xiv:1912.13088, 2019. Liu, Q., Li, L., Tang, Z., and Zhou, D. Breaking the curse of horizon: Infinite-horizon off-policy estimation. In Advances in Neural Information Processing Systems, pp. 5356 5366, 2018. Mc Leish, D. L. et al. Dependent central limit theorems and invariance principles. the Annals of Probability, 2(4): 620 628, 1974. Moore, A. W. Efficient memory-based learning for robot control. 1990. Munos, R. and Szepesv ari, C. Finite-time bounds for fitted value iteration. Journal of Machine Learning Research, 9 (5), 2008. Nachum, O., Chow, Y., Dai, B., and Li, L. Dual DICE: Behavior-agnostic estimation of discounted stationary distribution corrections. In Advances in Neural Information Processing Systems, pp. 2315 2325, 2019. Oberst, M. and Sontag, D. Counterfactual off-policy evaluation with gumbel-max structural causal models. In International Conference on Machine Learning, pp. 4881 4890, 2019. Paine, T. L., Paduraru, C., Michi, A., Gulcehre, C., Zolna, K., Novikov, A., Wang, Z., and de Freitas, N. Hyperparameter selection for offline reinforcement learning. ar Xiv preprint ar Xiv:2007.09055, 2020. Petersen, K. and Pedersen, M. The matrix cookbook. technical university of denmark. Technical Manual, 2008. Precup, D., Sutton, R. S., and Singh, S. Eligibility traces for off-policy policy evaluation. In ICML 00 Proceedings of the Seventeenth International Conference on Machine Learning, 2000. Sengupta, S., Volgushev, S., and Shao, X. A subsampled double bootstrap for massive data. Journal of the American Statistical Association, 111(515):1222 1232, 2016. Shi, C., Zhang, S., Lu, W., and Song, R. Statistical inference of the value function for reinforcement learning in infinite horizon settings. ar Xiv preprint ar Xiv:2001.04515, 2020. Bootstrapping Fitted Q-Evaluation for Off-Policy Inference Singh, K. On the asymptotic accuracy of efron s bootstrap. The Annals of Statistics, pp. 1187 1195, 1981. Sutton, R. S. and Barto, A. G. Reinforcement learning: An introduction. MIT press, 2018. Thomas, P. and Brunskill, E. Data-efficient off-policy policy evaluation for reinforcement learning. In International Conference on Machine Learning, pp. 2139 2148, 2016. Thomas, P., Theocharous, G., and Ghavamzadeh, M. High confidence policy improvement. In International Conference on Machine Learning, pp. 2380 2388. PMLR, 2015. Uehara, M. and Jiang, N. Minimax weight and Qfunction learning for off-policy evaluation. ar Xiv preprint ar Xiv:1910.12809, 2019. Van der Vaart, A. W. Asymptotic statistics, volume 3. Cambridge university press, 2000. Voloshin, C., Le, H. M., Jiang, N., and Yue, Y. Empirical study of off-policy policy evaluation for reinforcement learning. ar Xiv preprint ar Xiv:1911.06854, 2019. Wang, R., Foster, D. P., and Kakade, S. M. What are the statistical limits of offline rl with linear function approximation? ar Xiv preprint ar Xiv:2010.11895, 2020. Xie, T., Ma, Y., and Wang, Y.-X. Towards optimal off-policy evaluation for reinforcement learning with marginalized importance sampling. In Advances in Neural Information Processing Systems, pp. 9665 9675, 2019. Yin, M. and Wang, Y.-X. Asymptotically efficient off-policy evaluation for tabular reinforcement learning. International Conference on Artificial Intelligence and Statistics, 2020. Zhang, R., Dai, B., Li, L., and Schuurmans, D. Gen DICE: Generalized offline estimation of stationary values. ar Xiv preprint ar Xiv:2002.09072, 2020a. Zhang, S., Liu, B., and Whiteson, S. Gradient DICE: Rethinking generalized offline estimation of stationary values. ar Xiv preprint ar Xiv:2001.11113, 2020b.