# bootstrapping_fitted_qevaluation_for_offpolicy_inference__1badbb0c.pdf

Bootstrapping Fitted Q-Evaluation for Off-Policy Inference

Botao Hao 1 Xiang Ji 2 Yaqi Duan 2 Hao Lu 2 Csaba Szepesv ari 1 3 Mengdi Wang 1 2

Bootstrapping provides a ﬂexible and effective approach for assessing the quality of batch reinforcement learning, yet its theoretical properties are poorly understood. In this paper, we study the use of bootstrapping in off-policy evaluation (OPE), and in particular, we focus on the ﬁtted Q-evaluation (FQE) that is known to be minimaxoptimal in the tabular and linear-model cases. We propose a bootstrapping FQE method for inferring the distribution of the policy evaluation error and show that this method is asymptotically efﬁcient and distributionally consistent for off-policy statistical inference. To overcome the computation limit of bootstrapping, we further adapt a subsampling procedure that improves the runtime by an order of magnitude. We numerically evaluate the bootrapping method in classical RL environments for conﬁdence interval estimation, estimating the variance of off-policy evaluator, and estimating the correlation between multiple off-policy evaluators.

1. Introduction

Off-policy evaluation (OPE) often serves as the starting point of batch reinforcement learning (RL). The objective of OPE is to estimate the value of a target policy based on batch episodes of state-transition trajectories that were generated using a different and possibly unknown behavior policy. In this paper, we investigate statistical inference for OPE. In particular, we analyze the popular ﬁtted Q-evaluation (FQE) method, which is a basic model-free approach that ﬁts unknown value function from data using function approximation and backward dynamic programming (Fonteneau

1Deepmind 2Princeton University 3University of Alberta. Correspondence to: Botao Hao <haobotao000@gmail.com>, Mengdi Wang <mengdiw@princeton.edu>.

Proceedings of the 37 th International Conference on Machine Learning, PMLR 139, 2021. Copyright 2021 by the author(s).

et al., 2013; Munos & Szepesv ari, 2008; Le et al., 2019). In practice, FQE has demonstrated robust and satisfying performances on many classical RL tasks under different metrics (Voloshin et al., 2019). A more recent study by Paine et al. (2020) demonstrated surprising scalability and effectiveness of FQE with deep neural nets in a range of complex continuous-state RL tasks. On the theoretical side, FQE was proved to be a minimax-optimal policy evaluator in the tabular and linear-model cases (Yin & Wang, 2020; Duan & Wang, 2020).

The aforementioned research mostly focuses on point estimation for OPE. In practical batch RL applications, a point estimate is far from enough. Statistical inference for OPE is of great interests. For instance, one often hopes to construct tight conﬁdence interval around policy value, estimate the variance of off-policy evaluator, or evaluate multiple policies using the same data and estimate their correlations. Bootstrapping (Efron, 1982), is a conceptually simple and generalizable approach to infer the error distribution based on batch data. Therefore, in this work, we study the use of bootstrapping for off-policy inference. We will provide theoretical justiﬁcations as well as numerical experiments.

Our main results are summarized below:

First we analyze the asymptotic distribution of FQE with linear function approximation and show that the policy evaluation error asymptotically follows a normal distribution (Theorem 4.2). The asymptotic variance matches the Cram er Rao lower bound for OPE (Theorem 4.5) and implies that this estimator is asymptotically efﬁcient.

We propose a bootstrapping FQE method for estimating the distribution of off-policy evaluation error. We prove that bootstrapping FQE is asymptotically consistent in estimating the distribution of the original FQE (Theorem 5.1) and establish the consistency of bootstrap conﬁdence interval as well as bootstrap variance estimation. Further, we propose a subsampled bootstrap procedure to improve the computational efﬁciency of bootstrapping FQE.

Bootstrapping Fitted Q-Evaluation for Off-Policy Inference

We highlight the necessity of bootstrapping by episodes, rather than by individual sample transition as considered in previous works; see Kostrikov & Nachum (2020). The reason is that bootstrapping dependent data in general fails to characterize the right error distribution (Remark 2.1 in Singh (1981)). We illustrate this phenomenon via experiments (see Figure 1). All our theoretical analysis applies to episodic dependent data, and we do not require the i.i.d. sample transition assumption commonly made in OPE literatures (Jiang & Huang, 2020; Kostrikov & Nachum, 2020; Dai et al., 2020).

Finally, we evaluate subsampled bootstrapping FQE in a range of classical RL tasks, including a discrete tabular domain, a continuous control domain and a simulated healthcare example. We test variants of bootstrapping FQE with tabular representation, linear function approximation, and neural networks. We carefully examine the effectiveness and tightness of bootstrap conﬁdence intervals, as well as the accuracy of bootstrapping for estimating the variance and correlation for OPE.

Related Work. Point estimation of OPE receives considerable attentions in recent years. Popular approaches include direct methods (Lagoudakis & Parr, 2003; Ernst et al., 2005; Munos & Szepesv ari, 2008; Le et al., 2019), double-robust / importance sampling (Precup et al., 2000; Jiang & Li, 2016; Thomas & Brunskill, 2016), marginalized importance sampling (Hallak & Mannor, 2017; Liu et al., 2018; Xie et al., 2019; Nachum et al., 2019; Uehara & Jiang, 2019; Zhang et al., 2020a;b). On the theoretical side, Uehara & Jiang (2019); Yin & Wang (2020) established asymptotic optimality and efﬁciency for OPE in the tabular setting and Kallus & Uehara (2020) provided a complete study of semiparametric efﬁciency in a more general setting. Duan & Wang (2020); Hao et al. (2020b) showed that FQE with linear/sparse lienar function approximation is minimax optimal and Wang et al. (2020) studied the fundamental hardness of OPE with linear function approximation.

Conﬁdence interval estimation of OPE is also important in many high-stake applications. Thomas et al. (2015) proposed a high-conﬁdence OPE based on importance sampling and empirical Bernstein inequality. Kuzborskij et al. (2020) proposed a tighter conﬁdence interval for contextual bandits based on empirical Efron-Stein inequality. However, importance sampling suffers from the curse of horizon (Liu et al., 2018) and concentration-based conﬁdence intervals

are typically overly-conservative since they only exploit tail information (Hao et al., 2020a). Another line of recent works formulated the estimation of conﬁdence intervals into an optimization problem (Feng et al., 2020; 2021; Dai et al., 2020). These works are speciﬁc to conﬁdence interval construction for OPE, and they do not provide distributional consistency guarantee. Thus, they don t easily generalize to other statistical inference tasks.

In statistics community, Liao et al. (2019) studied OPE in an inﬁnite-horizon undiscounted MDP and derived the asymptotic distribution of empirical Bellman residual minimization estimator. Their asymptotic variance had a tabular representation and thus didn t show the effect of function approximation. Shi et al. (2020) considered asymptotic conﬁdence interval for policy value but under different model assumption that assumes Q-function is smooth.

Several existing work has investigated the use of bootstrapping in OPE. Thomas et al. (2015); Hanna et al. (2017) constructed conﬁdence intervals by bootstrapping importance sampling estimator or learned models but didn t come with any consistency guarantee. The most related work is Kostrikov & Nachum (2020) that provided the ﬁrst asymptotic consistency of bootstrap conﬁdence interval for OPE. Our analysis improves their work in the following aspects. First, we study FQE with linear function approximation while Kostrikov & Nachum (2020) only considered the tabular case. Second, we provide distributional consistency of bootstrapping FQE which is stronger than the consistency of conﬁdence interval in Kostrikov & Nachum (2020).

2. Preliminary

Consider an episodic Markov decision process (MDP) that is deﬁned by a tuple M = (S, A, P, r, H). Here, S is the state space, A is the action space, P(s |s, a) is the probability of reaching state s when taking action a in state s, r : S A [0, 1] is the reward function, and H is the length of horizon. A policy π : S P(A) maps states to a distribution over actions. The state-action value function (Q-function) is deﬁned as, for h = 1, . . . , H,

Qπ h(s, a) = Eπ " H X

h =h r(sh , ah )

sh = s, ah = a

where ah π( | sh ), sh +1 P( | sh , ah ) and Eπ denotes expectation over the sample path generated under policy π. The Q-function satisﬁes the Bellman equation for policy π:

Qπ h 1(s, a) = r(s, a) + E h V π h (s ) s, a i ,

Bootstrapping Fitted Q-Evaluation for Off-Policy Inference

where s P( |s, a) and V π h : S R is the value function deﬁned as V π h (s) = R

a Qπ h(s, a)π(a|s)da.

Let [n] = {1, . . . , n}. For a positive semideﬁnite matrix X, we denote λmin(X) as the minimum eigenvalue of X. Denote Id Rd d as a diagonal matrix with 1 as all the diagonal entry and 0 anywhere else.

Off-policy evaluation. Suppose that the batch data D = {D1, . . . , DK} consists of K independent episodes collected using an unknown behavior policy π. Each episode, denoted as Dk = {(sk h, ak h, rk h)}h [H], is a trajectory of H state-transition tuples. It is easy to generalize our analysis to multiple unknown behavior policies since our algorithms do not require the knowledge of the behavior policy. Let N = KH be the total number of sample transitions; and we sometimes write D = {(sn, an, rn)}n [N] for simplicity. The goal of OPE is to estimate the expected cumulative return (i.e., value) of a target policy π from a a ﬁxed initial distribution ξ1, based on the dataset D. The value is deﬁned as

vπ = Eπ " H X

h=1 r(sh, ah)

Fitted Q-evaluation. Fitted Q-evaluation (FQE) is an instance of the ﬁtted Q-iteration method, dated back to Fonteneau et al. (2013); Le et al. (2019). Let F be a given function class, for examples a linear function class or a neural network class. Set b Qπ H+1 = 0. For h = H, . . . , 1, we recursively estimate Qπ h by regression and function approximation:

b Qπ h = argmin f F

f(sn, an) yn 2 + λρ(f) o ,

where yn = rn + R

a b Qπ h+1(sn+1, a)π(a|sn+1)da and ρ(f) is a proper regularizer. The value estimate is

bvπ = Es ξ1,a π( |s) h b Qπ 1(x, a) i , (2.1)

which can be directly computed based on b Qπ 1. See the full description of FQE in Appendix A.1.

Off-policy inference. Let bvπ be an off-policy estimator of the target policy value vπ. In addition to the point estimator, we are primarily interested in the distribution of the off-policy evaluation error bvπ vπ. We aim to infer the error distribution of bvπ vπ in order to conduct statistical inference. Suppose F is an estimated distribution of bvπ vπ. Then we can use F for a range of downstream off-policy inference tasks, for examples:

Moment estimation. With F, we can estimate the p-th moment of bvπ vπ by R xpd F(x). Two important examples are bias estimation and variance estimation.

Conﬁdence interval construction. Deﬁne the quantile function of F as G(p) = inf{x R, p F(x)}. Specify a conﬁdence level 0 < δ 1. With F, we can construct the 1 δ conﬁdence interval as [bvπ G(1 δ/2), bvπ G(δ/2)]. If F is close to the true distribution of bvπ vπ, the above one would be the nearly tightest conﬁdence interval for vπ based on bvπ.

Evaluating multiple policies and estimating their correlation. Suppose there are two target policies π1, π2 to evaluate and the corresponding off-policy estimators are bvπ1, bvπ2. Let F12 be the estimated joint distribution of bvπ1 vπ1 and bvπ2 vπ2. The Pearson correlation coefﬁcient between the two estimators is

ρ(bvπ1, bvπ2) = Cov(bvπ1, bvπ2) p

Var(bvπ1)Var(bvπ2) .

Both the covariance and variance can be estimated from F12, so we can further estimate the correlation between off-policy evaluators.

Remark 2.1 (Practical scenarios of estimating correlations). Correlation is a basic statistical metric for comparing two estimators, and we used it as an example to illustrate that bootstrapping can be used for estimating a variety of statistics not limited to conﬁdence intervals. In medical applications, we may have multiple target treatment policies to compare against, where a correlation estimate together with conﬁdence intervals would make physicians better informed to make a fairer comparison.

3. Bootstrapping Fitted Q-Evaluation (FQE)

As shown in Le et al. (2019); Voloshin et al. (2019); Duan & Wang (2020); Paine et al. (2020), FQE not only demonstrates strong empirical performances, but also enjoys provably optimal theoretical guarantees. Thus it is natural to conduct bootstrapping on top of FQE for off-policy inference.

Recall the original dataset D consists of K episodes. We propose to bootstrap FQE by episodes: Draw sample episodes D 1, . . . , D K independently with replacement from D. This is the standard Efron s nonparametric bootstrap (Efron, 1982). Then we run FQE on the new bootstrapped set D = {D 1, . . . , D K} as in Eq. (2.1) and let the output bv π as the bootstrapping FQE estimator. By repeating the above

Bootstrapping Fitted Q-Evaluation for Off-Policy Inference

1.0 0.5 0.0 0.5 1.0 0

True Distribution

1.0 0.5 0.0 0.5 1.0 0

Bootstrap by Episode

1.0 0.5 0.0 0.5 1.0 0

Bootstrap by Transition

Figure 1. Bootstrap by episodes vs. by sample transitions. The ﬁrst panel is the true FQE error distribution by Monte Carlo approximation. The second panel is the bootstrap distribution by episode while the third one is by sample transitions. Both behavior and target policies are the optimal policy. The number of Monte Carlo and bootstrap samples is 10000.

process, we may obtain multiple samples of bv π, and may use these samples to further conduct off-policy inference (see Section 6.2 for details).

3.1. Bootstrap by episodes vs. boostrap by sample transitions

Practitioners may wonder what is the right way to bootstrap a data set. This question is quite well understood in supervised learning when the data points are independent and identically distributed; there the best way to bootstrap is to resample data points directly. However, in episodic RL, although episodes may be generated independently from one another, sample transitions (sn, an, rn) in the same episode are highly dependent. Therefore, we choose to bootstrap the batch dataset by episodes, rather than by sample transitions which was commonly done according to previous literatures (Kostrikov & Nachum, 2020). We argue that bootstrapping by sample transitions may fail to correctly characterize the target error distribution of OPE. This is due to the in-episode dependence. To illustrate this phenomenon, we conduct numerical experiments using a toy Cliff Walking environment. We compare the true distribution of FQE error obtained by Monte Carlo sampling with error distributions obtained using bootstrapping FQE. Figure 1 clearly shows that the bootstrap distribution of bv π bvπ (by episodes) closely approximates the true error distribution of bvπ vπ, while the bootstrap distribution by sample transition is highly irregular and incorrect. This validates our belief that it is necessary to

bootstrap by episodes and handle dependent data carefully for OPE.

4. Asymptotic Distribution and Optimality of FQE

Before analyzing the use of bootstrap, we ﬁrst study the asymptotic properties of FQE estimators. For the sake of theoretical abstraction, we focus our analysis on the FQE with linear function approximation, because it is the most basic and universal function approximation. We will show that the FQE error is asympotically normal and its asymptotic variance exactly matches the Cram er Rao lower bound. All the proofs are deferred to Appendix A.3 and A.4.

Notations. Given a feature map φ : RS A Rd, we let F be a linear function class spanned by φ. Without loss of generality, we assume φ(s, a) 1 for any (s, a) S A. Deﬁne the Bellman operator for policy π as Pπ : RS A RS A such that for any f : S A R, Pπf(s, a) = Es P ( |s,a),a π( |s )[f(s , a )]. Denote the expected covariance matrix induced by the feature φ as Σ = E[ 1

H PH h=1 φ(s1 h, a1 h)φ(s1 h, a1 h) ], where E is the expectation over population distribution generated by the behavior policy.

4.1. Asymptotic normality

We need a representation condition about the function class F, which will ensure sample-efﬁcient policy evaluation via FQE.

Condition 4.1 (Policy completeness). For any f F, we assume Pπf F, and r F.

Policy completeness requires the function class F can well capture the Bellman operator. It is crucial for the estimation consistency of FQE (Le et al., 2019; Duan & Wang, 2020) and implies the realizability condition Qπ h F for h [H]. Recently, Wang et al. (2020) established a lower bound showing that the condition Qπ h F alone is not enough for sample-efﬁcient OPE. Thus we need the policy completeness condition in order to leverage the generalizability of linear function class.

Next we present our ﬁrst main result. The theorem presents the asymptotic normality of FQE with linear function approximation. For any h1 [H], h2 [H], deﬁne the crosstime-covariance matrix as

h =1 φ(s1 h , a1 h )φ(s1 h , a1 h ) ε1 h1,h ε1 h2,h

Bootstrapping Fitted Q-Evaluation for Off-Policy Inference

where ε1 h1,h = Qπ h1(s1 h , a1 h ) (r1 h + V π h1+1(s1 h +1)).

Theorem 4.2 (Asymptotic normality of FQE). Suppose λmin(Σ) > 0 and Condition 4.1 holds. The FQE with linear function approximation is

N-consistent and asymptotically normal:

N (bvπ vπ) d N(0, σ2), as N , (4.1)

where d denotes converging in distribution. The asymptotic variance σ2 is given by

h=1 (νπ h) Σ 1Ωh,hΣ 1νπ h

h1<h2 (νπ h1) Σ 1Ωh1,h2Σ 1νπ h, (4.2)

where νπ h = Eπ[φ(sh, ah)|s1 ξ1].

The proof is based on a decomposition of the FQE error

N(bvπ vπ) into the sum of a primary term, which is a sum of the martingale differences, and two small-order terms that are asymptotically negligible. For the primary term, we utilize classical martingale central limit theorem (Mc Leish et al., 1974) to prove its asymptotic normality.

Remark 4.3. The second term on the right-hand side of Eq. (4.2) (cross-product term) characterizes the dependency between two different ﬁtted-Q steps. When considering a tabular time-inhomogeneous MDP that was used in Yin & Wang (2020), this cross-product term disappears and the asymptotic variance becomes

" µπ h(s1 h, a1 h)2

µh(s1 h, a1 h)2 (ε1 h,h)2 #

where µh is the marginal distribution of (s1 h, a1 h) and µπ h is the marginal distribution of (sh, ah) under policy π. This matches the asymptotic variance term in Remark 3.2 of Yin & Wang (2020).

Next, we give a corollary about the joint asymptotic error distribution when evaluating multiple policies. Denote Π = {π1, . . . , πL} as a set of target policies to evaluate and denote bvπk as the FQE estimator of the policy πk. For each πk Π, let ε1,k h1,h = Qπk h1(s1 h , a1 h ) (r1 h +V πk h1+1(s1 h +1)). For any h1 [H], h2 [H], denote

Ωj,k h1,h2 = E

h =1 φ(s1 h , a1 h )φ(s1 h , a1 h ) ε1,j h1,h ε1,k h2,h

Corollary 4.4 (Multiple policies). Suppose the conditions in Theorem 4.2 hold. The set of FQE estimators converge

in distribution to a multivariate Gaussian distribution:

N (bvπ1 vπ1) ...

N (bvπL vπL)

where the covariance matrix Γ = (σ2 jk)L j,k=1 RL L with

h=1 (νπj h ) Σ 1Ωj,k h,hΣ 1νπk h

h1<h2 (νπj h1) Σ 1Ωj,k h1,h2Σ 1νπk h .

4.2. Asymptotic efﬁciency

An asymptotic efﬁcient estimator has the minimal variance among all the unbiased estimator or its variance matches the Cram er Rao bound asymptotically.

Theorem 4.5 (Linear Cram er Rao lower bound). Under Condition 4.1 with linear function class, the variance of any unbiased OPE estimator is lower bounded by σ2 deﬁned in Eq. (4.2).

The above theorem implies FQE with linear function approximation is asymptotic efﬁcient. Jiang & Li (2016) derived the ﬁrst Cram er Rao lower bound for the tabular MDP that depends the size of state and action spaces. Our lower bound is stronger in the sense that it only depends on the feature dimension d. Kallus & Uehara (2020) studied more general semiparametric efﬁciency bound but can not be directly applied to our case since they do not consider the policy completeness assumption.

5. Distributional Consistency of Bootstrapping FQE

In this section, we show that the bootstrapping FQE method is distributionally consistent. More precisely, we prove that, the bootstrap distribution of

N(bv π bvπ), conditioned on data D, asymptotically imitates the true error distribution

N(bvπ vπ). Consequently, we may use the method to construct conﬁdence regions with asymptotically correct and tight coverage. All the proofs are deferred to Appendix A.5 and A.6.

Suppose that the batch dataset D is generated from a probability space (X, A, PD), and the bootstrap weight W is from an independent probability space (W, Ω, PW ). Their joint probability measure is PDW = PD PW . Let PW |D denote the conditional distribution once the dataset D is given.

Bootstrapping Fitted Q-Evaluation for Off-Policy Inference

Theorem 5.1 (Distributional consistency). Suppose the same assumptions in Theorem 4.2 hold. Conditioned on D, we have

N bv π bvπ d N(0, σ2), as N , (5.1)

where σ2 is deﬁned in Eq. (4.2). Consequently, it implies

sup α (0,1)

N bv π bvπ α

N(bvπ vπ) α 0.

Note that the convergence in distribution result applies to the sequence of probability measures PW |D where datasize grows to inﬁnity. The proof of Theorem 5.1 uses techniques that are different from classical analysis of supervised learning. This is because FQE is a ﬁxed-point iteration type algorithm and it has no objective function to minimize directly. This poses some difﬁculties to apply conventional bootstrap analysis. Thus, our proof utilizes the equivalence between FQE and a model-based plug-in estimator described in Appendix A.2, together with the Mallows metric (Bickel & Freedman, 1981; Freedman et al., 1981) and the multivariate delta theorem.

Theorem 5.1 sets the theoretical foundation for using bootstrapping for off-policy inference. Eq. (4.1) and Eq. (5.1) together show that the bootstrap error distribution converges to the same limit as the target error distribution of FQE, which are both asymptotically efﬁcient and match the Cram er Rao lower bound.

By using the distributional consistency of bootstrapping FQE, we may further construct consistent conﬁdence intervals. Denote the lower δth quantile of bootstrap error distribution qπ δ = inf{t : PW |D(bv π bvπ t) δ}. Then we construct the 1 δ conﬁdence interval of the policy value by: CI(δ) = [bvπ qπ 1 δ/2, bvπ qπ δ/2].

We next establish that the coverage probability of the percentile bootstrap conﬁdence interval for vπ converges to the nominal level as a consequence of Theorem 5.1 and the consistency of bootstrap moment estimation.

Corollary 5.2 (Consistency of the coverage probability). Under the assumptions in Theorem 5.1, we have as N , PDW (vπ CI(δ)) 1 δ.

Remark 5.3. Kostrikov & Nachum (2020) proved the consistency of bootstrap conﬁdence interval in the tabular case. In contrast, our result is more general. We establish the distributional consistency for OPE with function approximation.

Corollary 5.4 (Consistency of the moment estimation). Suppose the assumptions in Theorem 5.1 holds and lim sup N EW |D[(

N(bv π bvπ))q] < for some q > 2. Then we have for any 1 r < q,

N(bv π bvπ) ri Z trdµ(t),

where µ( ) is the distribution of N(0, σ2).

The consistency of bootstrap variance estimate is immediately implied by setting r = 2.

6. Subsampled Bootstrapping FQE

Computing bootstrap-based quantities can be prohibitively demanding as the data size grows. Inspired by recent developments from statistics community (Kleiner et al., 2014; Sengupta et al., 2016), we adapt a simple subsampled bootstrap procedure for FQE to accelerate the computation.

6.1. Subsampled bootstrap

Let the original dataset be D = {D1, . . . , DK}. For any dataset e D, we denote by bvπ( e D) the FQE estimator based on dataset e D and B as the number of bootstrap samples. The subsampled bootstrap includes the following three steps. For each b [B], we ﬁrst construct a random subset D(b) K,s of s episodes where each sample episode is drawn independently without replacement from dataset D. Typically s = Kγ for some 0 < γ 1.Then we generate a resample set D(b) K,s of K episodes where each sample episode is

drawn independently with replacement from D(b) K,s. Note

that when s = K, D(b) K,s is always equal to D such that the subsampled bootstrap reduces to vanilla bootstrap. In the end, we compute ε(b) = bvπ(D(b) K,s ) bvπ(D(b) K,s). Algorithm 1 gives the full description.

Remark 6.1 (Computational beneﬁt). In Algorithm 1, although each run of FQE is still over a dataset of K episodes, only s of them are distinct. As a result, the runtime of running FQE on a bootstrapped set can be substantially reduced. With linear function approximation, one run of FQE requires solving H least square problems. Thus the total runtime complexity of the subsampled bootstrapping FQE is O(B(K2γH3d + Hd3)), where 0 < γ < 1 controls the subsample size. When γ is small, we achieve signiﬁcant speedup by an order of magnitude improvements.

Bootstrapping Fitted Q-Evaluation for Off-Policy Inference

Algorithm 1 Subsampled Bootstrapping FQE input Dataset D = {D1, . . . , DK}, target policy π, subset size s, number of bootstrap samples B. 1: Compute FQE estimator bvπ(D) (Algorithm 2). 2: for b = 1, . . . , B do 3: Build a random subset D(b) K,s.

4: Feed D(b) K,s to FQE and compute bvπ(D(b) K,s).

5: Generate a resample set D(b) K,s .

6: Compute bvπ(D(b) K,s ).

7: Compute ε(b) = bvπ(D(b) K,s ) bvπ(D(b) K,s). 8: end for output {ε(1), . . . , ε(B)}.

6.2. Off-policy inference via bootstrapping FQE

We describe how to conduct off-policy inference based on the output of Algorithm 1.

Bootstrap variance estimation. To estimate the variance of FQE estimators, we calculate the bootstrap sample variance as

c Var(bvπ(D)) = 1 B 1

b=1 (ε(b) ε)2,

where ε = 1

B PB b=1 ε(b).

Bootstrap conﬁdence interval. Compute the δ/2 and 1 δ/2 quantile of the empirical distribution {ε(1), . . . , ε(B)}, denoted as bqπ δ/2, bqπ 1 δ/2 respectively. The percentile bootstrap conﬁdence interval is [bvπ(D) bqπ 1 δ/2, bvπ(D) bqπ δ/2].

Bootstrap correlation estimation. For any of two target policies π1 and π2, we want to estimate the Pearson correlation coefﬁcient between their FQE estimators. The bootstrap sample correlation can be computed as bρ(bvπ1(D), bvπ2(D)) = PB b=1(ε(b) 1 ε1)(ε(b) 2 ε2) q PB b=1(ε(b) 1 ε1)2 q PB b=1(ε(b) 2 ε2)2 .

7. Experiments

In this section, we numerically evaluate the proposed bootstrapping FQE method in several RL environments. For constructing conﬁdence intervals, we ﬁx the conﬁdence level at δ = 0.1. For estimating variance and correlations, we average the results over 200 trials. More details about the experiment are given in Appendix C.

7.1. Experiment with tabular discrete environment

We ﬁrst consider the Cliff Walking environment (Sutton & Barto, 2018), with artiﬁcially added randomness to create stochastic transitions (see Appendix C for details). The target policy is chosen to be a near-optimal policy, trained using Q-learning. Consider three choices of the behavior policy: the same as the target policy (on-policy), 0.1 ϵgreedy policy and soft-max policy with temperature 1.0 based on the learned optimal Q-function. The results for soft-max policy and correlation estimation are deferred to Appendix C.

We test three different methods. The ﬁrst two methods are subsampled bootstraping FQE with subsample sizes s = K (the vanilla bootstrap) and s = K0.5 (the computationalefﬁcient version), where B = 100. The third method is the high-conﬁdence off-policy evaluation (HCOPE) (Thomas et al., 2015), which we use as a baseline for comparison. HCOPE is a method for constructing off-policy conﬁdent interval for tabular MDP, and it is based concentration inequalities and has provable coverage guarantee. We also compare these methods with the oracle conﬁdence interval (which is the true distribution s quantile obtained by Monte Carlo simulation).

Coverage and tightness of off-policy conﬁdence interval (CI). We study the empirical coverage probability and interval width with different number of episodes. Figure 2 shows the result under different behavior policies. In the left panel of Figure 3, we report the effect of the number of bootstrap samples on empirical coverage probability (ϵ-greedy behavior policy, K = 100). It is clear that the empirical coverage of our conﬁdence interval based on bootstrapping FQE becomes increasingly close to the expected coverage (= 1 δ) as the number of episodes increases. The width of bootstrapping-FQE conﬁdence interval is signiﬁcantly tighter than that of the HCOPE and very close to the oracle one. It is worth noting that, even in the on-policy case, our bootstrap-based conﬁdence interval still has a clear advantage over the concentration-based conﬁdence interval. The advantage of our method comes from that it fully exploits the distribution information. However, bootstrap conﬁdence interval tends to be under-estimate when the number of episodes is extremely small (K = 10). Thus we suggest the practitioner to use bootstrap methods when the sample size is moderately large (K > 50).

Further, the subsampled bootstrapping FQE demonstrates a competitive performance as well as signiﬁcantly reduced computation time. The saving in computation time becomes

Bootstrapping Fitted Q-Evaluation for Off-Policy Inference

10 50 100 200 500 # episodes in dataset

empirical coverage

Vanilla bootstrap Subsampled bootstrap Expected coverage

10 50 100 200 500 # episodes in dataset

interval width

Vanilla bootstrap Subsampled bootstrap HCOPE Oracle CI

10 50 100 200 500 # episodes in dataset

empirical coverage

Epsilon-Greedy Behavior Policy

Vanilla bootstrap Subsampled bootstrap Expected coverage

10 50 100 200 500 # episodes in dataset

interval width

Epsilon-Greedy Behavior Policy

Vanilla bootstrap Subsampled bootstrap HCOPE Oracle CI

Figure 2. Off-policy CI for Cliff Walking. Left: Empirical coverage probability of CI; Right: CI width under different behavior policies. Boostrapping-FQE conﬁdence interval method demonstrates better and tighter coverage of the groundtruth. It closely resembles the oracle conﬁdence interval which comes from the true error distribution.

10 50 75 100 250 # bootstrap samples

empirical coverage

Vanilla bootstrap Expected coverage

10 50 100 200 500 # episodes in dataset

Vanilla bootstrap Subsampled bootstrap

Figure 3. Sample and time efﬁciency of bootstrapping FQE. Left: Empirical coverage of bootstrapping-FQE CI, as #bootstrap samples increases. Right: Runtime of bootstrapping FQE, as datasize increases (with subsample size s = K0.5).

10 50 100 200 500 # episodes in dataset

estimation error

On-policy Epsilon-greedy policy Softmax policy

10 50 100 200 400 # episodes in dataset

policy value

Oracle CI Duan & Wang (2020) Subsampled bootstrap

Figure 4. Bootstrapping for variance estimation and with function approximation Left: Error of variance estimates for tabular case, as data size increases. Right: Conﬁdence interval constructed using bootstrapping FQE with linear function approximation.

increasingly substantial as the data gets big; see the right panel of Figure 3.

Bootstrapping FQE for variance estimation. We study the performance of variance estimation using subsampled bootstrapping FQE under three different behavior policies. We vary the number of episodes and the true Var(bvπ(D)) is computed through Monte Carlo method. We report the

estimation error of c Var(bvπ(D)) Var(bvπ(D)) across 200 trials in the left panel of Figure 4.

7.2. Experiment with Mountain Car using linear function approximation

Next we test the methods on the classical Mountain Car environment (Moore, 1990) with linear function approximation. We artiﬁcially added a Gaussian random force to the car s dynamics to create stochastic transitions. For the linear function approximation, we choose 400 radial basis functions (RBF) as the feature map. The target policy is chosen as the optimal policy trained by Q-learning; and the behavior policy is chosen to be the 0.1 ϵ-greedy policy based on the learned optimal Q-function.

For comparison, we compute an empirical Bernsteininequality-based conﬁdence interval (Duan & Wang, 2020), which to our best knowledge is the only provable CI based on FQE with function approximation (see Appendix C for its detailed form). We also compute the oracle CI using Monte Carlo simulation. Figure 4 right give all the results. According to the results, our method demonstrates good coverage of the groundtruth and is much tighter than the concentration-based CI, even both of them use linear function approximation.

7.3. Experiment with septic management using neural nets for function approximation

Lastly, we consider a real-world healthcare problem for treating sepsis in the intensive care unit (ICU). We use the septic management simulator by Oberst & Sontag (2019) for our study. It simulates a patient s vital signs, e.g. the heart rate, blood pressure, oxygen concentration, and glucose levels, with three treatment actions (antibiotics, vasopressors, and mechanical ventilation) to choosen from at each time step. The reward is +1 when a patient is discharged and 1 if the patient reaches a life critical state.

We apply the bootstrapping FQE using neural network function approximator with three fully connected layers, where the ﬁrst layer uses 256 units and a Relu activation function, the second layer uses 32 units and a Selu activation function, and the last layer uses Softsign. The network takes as input the state-action pair (a 11-dim vector) and outputs a Q-value estimate. Let the behavior policy be the 0.15 ϵ-greedy policy.

We evaluate two policies based on the same set of data. This is very common in healthcare problem since we may have multiples treatments by the doctor. One target policy is ﬁxed

Bootstrapping Fitted Q-Evaluation for Off-Policy Inference

0.1 0.3 0.5 0.7 0.9 epsilon

correlation

0.1 0.3 0.5 0.7 0.9 epsilon

correlation

Figure 5. Bootstrapping FQE with neural nets for estimating the correlation between two FQE estimators. The left panel is using 300 episodes, while the right panel is using 500 episodes.

0.5 0.0 0.5 1.0 optimal policy

0.05 epsilon greedy policy

(1.8, 2.0] 0.5

0.5 0.0 0.5 1.0 optimal policy

0.05 epsilon greedy policy

0.5 0.0 0.5 1.0 optimal policy

0.05 epsilon greedy policy

Figure 6. Estimated conﬁdence region for evaluating two policies using bootstrapping FQE with neural networks. Two target policies are optimal policy and 0.15 ϵ-greedy policy. Red point are true values of those two target policies. From left to right, the sample sizes are K = 100, 300, 500. Here we use the geom density() function to illustrate the density of 2-D bootstrap estimators.

to be the optimal policy while we vary the other one with different ϵ-greedy noise. We expect the correlation decreases as the difference between two target policies increases. Figure 5 is well aligned with our expectation. In Figure 6, we plot the conﬁdence region of two target policies obtained by bootstrapping FQE using neural networks. According to Figures 5 and 6, the bootstrapping FQE method can effectively construct conﬁdence regions and correlation estimates, even when using neural networks for function approximation. These results suggest that the proposed bootstrapping FQE method reliably achieves off-policy inference, with more general function approximators.

8. Conclusion

This paper studies bootstrapping FQE for statistical offpolicy inference and establishes its asymptotic distributional consistency as a theoretical benchmark. Our experiments suggest that bootstrapping FQE is effective and efﬁcient in a range of tasks, from tabular problems to continuous problems, with linear and neural network approximation.

Acknowledgements

Csaba Szepesv ari gratefully acknowledges funding from the Canada CIFAR AI Chairs Program, Amii and NSERC. Mengdi Wang gratefully acknowledges funding from the U.S. National Science Foundation (NSF) grant CMMI1653435, Air Force Ofﬁce of Scientiﬁc Research (AFOSR) grant FA9550-19-1-020, and C3.ai DTI.

Bickel, P. J. and Freedman, D. A. Some asymptotic theory for the bootstrap. The annals of statistics, pp. 1196 1217, 1981.

Dai, B., Nachum, O., Chow, Y., Li, L., Szepesv ari, C., and Schuurmans, D. Coindice: Off-policy conﬁdence interval estimation. ar Xiv preprint ar Xiv:2010.11652, 2020.

Duan, Y. and Wang, M. Minimax-optimal off-policy evaluation with linear function approximation. Internation Conference on Machine Learning, 2020.

Eck, D. J. Bootstrapping for multivariate linear regression models. Statistics & Probability Letters, 134:141 149, 2018.

Efron, B. The jackknife, the bootstrap and other resampling plans. SIAM, 1982.

Ernst, D., Geurts, P., and Wehenkel, L. Tree-based batch mode reinforcement learning. Journal of Machine Learning Research, 6(Apr):503 556, 2005.

Feng, Y., Ren, T., Tang, Z., and Liu, Q. Accountable offpolicy evaluation with kernel bellman statistics. Proceedings of the International Conference on Machine Learning, 2020.

Feng, Y., Tang, Z., Zhang, N., and Liu, Q. Non-asymptotic conﬁdence intervals of off-policy evaluation: Primal and dual bounds. In International Conference on Learning Representations, 2021. URL https://openreview. net/forum?id=d Kg5D1Z1Lm.

Fonteneau, R., Murphy, S. A., Wehenkel, L., and Ernst, D. Batch mode reinforcement learning based on the synthesis of artiﬁcial trajectories. Annals of operations research, 208(1):383 416, 2013.

Freedman, D. A. et al. Bootstrapping regression models. The Annals of Statistics, 9(6):1218 1228, 1981.

Bootstrapping Fitted Q-Evaluation for Off-Policy Inference

Hallak, A. and Mannor, S. Consistent on-line off-policy evaluation. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 1372 1383. JMLR. org, 2017.

Hanna, J. P., Stone, P., and Niekum, S. Bootstrapping with models: Conﬁdence intervals for off-policy evaluation. In Thirty-First AAAI Conference on Artiﬁcial Intelligence, 2017.

Hao, B., Abbasi-Yadkori, Y., Wen, Z., and Cheng, G. Bootstrapping upper conﬁdence bound. Thirty-fourth Annual Conference on Neural Information Processing Systems, 2020a.

Hao, B., Duan, Y., Lattimore, T., Szepesv ari, C., and Wang, M. Sparse feature selection makes batch reinforcement learning more sample efﬁcient. ar Xiv preprint ar Xiv:2011.04019, 2020b.

Jiang, N. and Huang, J. Minimax value interval for offpolicy evaluation and policy optimization. Advances in Neural Information Processing Systems, 33, 2020.

Jiang, N. and Li, L. Doubly robust off-policy value evaluation for reinforcement learning. In International Conference on Machine Learning, pp. 652 661, 2016.

Kallus, N. and Uehara, M. Double reinforcement learning for efﬁcient off-policy evaluation in markov decision processes. Journal of Machine Learning Research, 21(167): 1 63, 2020.

Kato, K. A note on moment convergence of bootstrap m-estimators. Statistics & Risk Modeling, 28(1):51 61, 2011.

Kleiner, A., Talwalkar, A., Sarkar, P., and Jordan, M. I. A scalable bootstrap for massive data. Journal of the Royal Statistical Society: Series B: Statistical Methodology, pp. 795 816, 2014.

Kostrikov, I. and Nachum, O. Statistical bootstrapping for uncertainty estimation in off-policy evaluation. ar Xiv preprint ar Xiv:2007.13609, 2020.

Kuzborskij, I., Vernade, C., Gy orgy, A., and Szepesv ari, C. Conﬁdent off-policy evaluation and selection through self-normalized importance weighting. ar Xiv preprint ar Xiv:2006.10460, 2020.

Lagoudakis, M. G. and Parr, R. Least-squares policy iteration. Journal of machine learning research, 4(Dec): 1107 1149, 2003.

Le, H. M., Voloshin, C., and Yue, Y. Batch policy learning under constraints. Proceedings of Machine Learning Research, 97:3703 3712, 2019.

Liao, P., Klasnja, P., and Murphy, S. Off-policy estimation of long-term average outcomes with applications to mobile health. ar Xiv preprint ar Xiv:1912.13088, 2019.

Liu, Q., Li, L., Tang, Z., and Zhou, D. Breaking the curse of horizon: Inﬁnite-horizon off-policy estimation. In Advances in Neural Information Processing Systems, pp. 5356 5366, 2018.

Mc Leish, D. L. et al. Dependent central limit theorems and invariance principles. the Annals of Probability, 2(4): 620 628, 1974.

Moore, A. W. Efﬁcient memory-based learning for robot control. 1990.

Munos, R. and Szepesv ari, C. Finite-time bounds for ﬁtted value iteration. Journal of Machine Learning Research, 9 (5), 2008.

Nachum, O., Chow, Y., Dai, B., and Li, L. Dual DICE: Behavior-agnostic estimation of discounted stationary distribution corrections. In Advances in Neural Information Processing Systems, pp. 2315 2325, 2019.

Oberst, M. and Sontag, D. Counterfactual off-policy evaluation with gumbel-max structural causal models. In International Conference on Machine Learning, pp. 4881 4890, 2019.

Paine, T. L., Paduraru, C., Michi, A., Gulcehre, C., Zolna, K., Novikov, A., Wang, Z., and de Freitas, N. Hyperparameter selection for ofﬂine reinforcement learning. ar Xiv preprint ar Xiv:2007.09055, 2020.

Petersen, K. and Pedersen, M. The matrix cookbook. technical university of denmark. Technical Manual, 2008.

Precup, D., Sutton, R. S., and Singh, S. Eligibility traces for off-policy policy evaluation. In ICML 00 Proceedings of the Seventeenth International Conference on Machine Learning, 2000.

Sengupta, S., Volgushev, S., and Shao, X. A subsampled double bootstrap for massive data. Journal of the American Statistical Association, 111(515):1222 1232, 2016.

Shi, C., Zhang, S., Lu, W., and Song, R. Statistical inference of the value function for reinforcement learning in inﬁnite horizon settings. ar Xiv preprint ar Xiv:2001.04515, 2020.

Bootstrapping Fitted Q-Evaluation for Off-Policy Inference

Singh, K. On the asymptotic accuracy of efron s bootstrap. The Annals of Statistics, pp. 1187 1195, 1981.

Sutton, R. S. and Barto, A. G. Reinforcement learning: An introduction. MIT press, 2018.

Thomas, P. and Brunskill, E. Data-efﬁcient off-policy policy evaluation for reinforcement learning. In International Conference on Machine Learning, pp. 2139 2148, 2016.

Thomas, P., Theocharous, G., and Ghavamzadeh, M. High conﬁdence policy improvement. In International Conference on Machine Learning, pp. 2380 2388. PMLR, 2015.

Uehara, M. and Jiang, N. Minimax weight and Qfunction learning for off-policy evaluation. ar Xiv preprint ar Xiv:1910.12809, 2019.

Van der Vaart, A. W. Asymptotic statistics, volume 3. Cambridge university press, 2000.

Voloshin, C., Le, H. M., Jiang, N., and Yue, Y. Empirical study of off-policy policy evaluation for reinforcement learning. ar Xiv preprint ar Xiv:1911.06854, 2019.

Wang, R., Foster, D. P., and Kakade, S. M. What are the statistical limits of ofﬂine rl with linear function approximation? ar Xiv preprint ar Xiv:2010.11895, 2020.

Xie, T., Ma, Y., and Wang, Y.-X. Towards optimal off-policy evaluation for reinforcement learning with marginalized importance sampling. In Advances in Neural Information Processing Systems, pp. 9665 9675, 2019.

Yin, M. and Wang, Y.-X. Asymptotically efﬁcient off-policy evaluation for tabular reinforcement learning. International Conference on Artiﬁcial Intelligence and Statistics, 2020.

Zhang, R., Dai, B., Li, L., and Schuurmans, D. Gen DICE: Generalized ofﬂine estimation of stationary values. ar Xiv preprint ar Xiv:2002.09072, 2020a.

Zhang, S., Liu, B., and Whiteson, S. Gradient DICE: Rethinking generalized ofﬂine estimation of stationary values. ar Xiv preprint ar Xiv:2001.11113, 2020b.