# varianceaware_offpolicy_evaluation_with_linear_function_approximation__dc0ac3c1.pdf Variance-Aware Off-Policy Evaluation with Linear Function Approximation Yifei Min Department of Statistics and Data Science Yale University CT 06511 yifei.min@yale.edu Tianhao Wang Department of Statistics and Data Science Yale University CT 06511 tianhao.wang@yale.edu Dongruo Zhou Department of Computer Science University of California, Los Angeles CA 90095 drzhou@cs.ucla.edu Quanquan Gu Department of Computer Science University of California, Los Angeles CA 90095 qgu@cs.ucla.edu We study the off-policy evaluation (OPE) problem in reinforcement learning with linear function approximation, which aims to estimate the value function of a target policy based on the offline data collected by a behavior policy. We propose to incorporate the variance information of the value function to improve the sample efficiency of OPE. More specifically, for time-inhomogeneous episodic linear Markov decision processes (MDPs), we propose an algorithm, VA-OPE, which uses the estimated variance of the value function to reweight the Bellman residual in Fitted Q-Iteration. We show that our algorithm achieves a tighter error bound than the best-known result. We also provide a fine-grained characterization of the distribution shift between the behavior policy and the target policy. Extensive numerical experiments corroborate our theory. 1 Introduction Reinforcement learning (RL) has been a hot spot in both theory and practice in the past decade. Many efficient algorithms have been proposed and theoretically analyzed for finding the optimal policy adopted by an agent to maximize the long-term cumulative rewards. In contrast to online RL where the agent actively interacts with the environment, offline RL (a.k.a., batch RL) [25, 24] aims to extract information from past data and use this information to learn the optimal policy. There has been much empirical success of offline RL in various application domains [4, 6, 37, 40, 36]. Among various tasks of offline RL, an important task is called off-policy evaluation (OPE), which evaluates the performance of a target policy π given offline data generated by a behavior policy π. Most existing theoretical works on OPE are in the setting of tabular MDPs [34, 26, 11, 16, 45, 47 49], where the state space S and the action space A are both finite. However, real-world applications often have high-dimensional or even infinite-dimensional state and action spaces, where function approximation is required for computational tractability and generalization. While provably efficient online RL with linear function approximation has been widely studied recently [46, 17, 50, 15, 3, 54], little work has been done for analyzing OPE with linear function approximation, with one notable exception by Duan et al. [10]. More specifically, Duan et al. [10] analyzed a regression-based Equal contribution. 35th Conference on Neural Information Processing Systems (Neur IPS 2021). Fitted Q-Iteration method (FQI-OPE) that achieves an e O(H2p (1 + d(π, π))/N) error for linear MDPs [46, 17], where H is the planning horizon, N is the sample size, and d(π, π) represents the distribution shift between the behavior policy and the target policy. They also proved a sample complexity lower bound for a subclass of linear MDPs, for which their algorithm is nearly minimax optimal. However, as we will show later, the H2 dependence is not tight since they discard the useful variance information contained in the offline data. Consequently, their result is only optimal for a small class of MDPs of which the value functions have large variance. The H2 dependence in the sample complexity also makes their algorithm less sample-efficient for long-horizon problems, which is one of the major challenges in RL. Extracting useful information from the data is particularly important for offline RL since the agent cannot sample additional data by interacting with the environment, as compared to online RL. In this paper, we propose a new algorithm that incorporates the variance information of the value functions to improve the sample efficiency of OPE. This allows us to achieve a deeper understanding and tighter error bounds of OPE with linear function approximation. In detail, we consider time-inhomogeneous linear MDPs [46, 17] where the transition probability and reward function are assumed to be linear functions of a known feature mapping and may vary from stage to stage. The main contributions of this paper are summarized as follows: We develop VA-OPE (Variance-Aware Off-Policy Evaluation), an algorithm for OPE that effectively utilizes the variance information from the offline data. The core idea behind the proposed algorithm is to calibrate the Bellman residual in the regression by an estimator of the conditional variance of the value functions, such that data points of higher quality can receive larger important weights. We show that our algorithm achieves e O(P h(v h Λ 1 h vh)1/2/ K) policy evaluation error, where vh is the expectation of the feature vectors under target policy and Λh is the uncentered covariance matrix under behavior policy weighted by the conditional variance of the value function. Our algorithm achieves a tighter error bound and milder dependence on H than FQI-OPE [10], and provides a tighter characterization of the distribution shift between the behavior policy and the target policy, which is also verified by extensive numerical experiments. Our analysis is based on a novel two-step proof technique. In the first step, we use backward induction to establish worst-case uniform convergence2 results for the estimators of the value functions. In the second step, the convergence of OPE estimator is proved by tightening the uniform convergence result based on an average-case analysis. Our proof strategy provides a generic way for analyzing (weighted) ridge regression methods that are carried out in a backward and iterated fashion. The analyses in both steps might be of independent interest. Notation We use lower case letters to denote scalars and use lower and upper case boldface letters to denote vectors and matrices respectively. For any vector x Rd and any positive semi-definite matrix Σ Rd d, we denote by x 2 the Euclidean norm and Σ the operator norm, and define x Σ = x Σx. For any positive integer n, we denote by [n] the set {1, . . . , n}. For any finite set A, we denote by |A| the cardinality of A. For two sequences {an} and {bn}, we write an = O(bn) if there exists an absolute constant C such that an Cbn, and we write an = Ω(bn) if there exists an absolute constant C such that an Cbn. We use e O( ) to further hide the logarithmic factors. 2 Preliminaries 2.1 Markov Decision Processes We consider the time-inhomogeneous episodic Markov Decision Process (MDP), which is represented by a tuple M(S, A, H, {rh}H h=1, {Ph}H h=1). In specific, we denote the state space by S and the action space by A, and H > 0 is the horizon length of each episode. At each stage h [H], rh : S A [0, 1] is the reward function, and Ph(s |s, a) is the transition probability function which represents the probability for state s to transit to state s given action a. A policy π consists of H mappings {πh}H h=1 from S to the simplex on A, such that for any (h, s) [H] S, πh( |s) is a probability distribution over A. Here a policy can be either deterministic (point mass) or stochastic. 2By uniform convergence we mean the convergence of the estimated value functions in ℓ -norm to their true values, which is different from the uniform convergence over all policies in Yin et al. [48]. For any policy π, we define the associated action-value function Qπ h(s, a) and value function V π h (s) at each stage h [H] as follows: Qπ h(s, a) = Eπ i=h ri(si, ai) sh = s, ah = a , V π h (s) = Z A Qπ h(s, a)dπh(a|s), (2.1) where ai πi( |si) and si+1 Pi( |si, ai). For any function V : S R, we introduce the following shorthand notation for the conditional expectation and variance of V : [Ph V ](s, a) = Es Ph( |s,a)[V (s )], [Vh V ](s, a) = [Ph V 2](s, a) ([Ph V ](s, a))2. (2.2) Time-inhomogeneous linear MDPs. We consider a special class of MDPs called linear MDPs [46, 17]. Note that most of the existing works on RL with linear function approximation rely on this assumption. Assumption 2.1. M(S, A, H, {rh}H h=1, {Ph}H h=1) is called a linear MDP with a known feature mapping φ : S A Rd, if for any h [H], there exist γh and µh Rd, such that for any state-action pair (s, a) S A, it holds that Ph( | s, a) = φ(s, a), µh( ) , rh(s, a) = φ(s, a), γh . (2.3) We assume that at any stage h, for any state-action pair (s, a) S A, the reward received by the agent is given by r = rh(s, a) + ϵh(s, a), where rh(s, a) [0, 1] is the expected reward and ϵh(s, a) is the random noise. We assume that the noise is zero-mean and independent of anything else. Without loss of generality, we assume that γh 2 1 and φ(s, a) 2 1 for all (s, a) S A. We also assume that rh(s, a)+ϵh(s, a) 1, |ϵh(s, a)| 1 almost surely and thus Var(ϵh(s, a)) 1 for all h [H] and (s, a) S A. Moreover, we assume that maxh [H] R S f(s)dµh(s) 2 d for all bounded function f : S R such that sups S |f(s)| 1. The above assumption on linear MDPs implies the following proposition for the action-value functions. Proposition 2.2 (Proposition 2.3, [17]). For a linear MDP, for any policy π, there exist weights {wπ h, h [H]} such that for any (s, a, h) S A [H], we have Qπ h(s, a) = φ(s, a), wπ h . Moreover, we have wπ h 2 2H d for all h [H]. Following this proposition, we may further show that the value functions are also linear functions, but of different features. We define φπ h(s) = R A φ(s, a)dπh(a|s) for all s [S] and h [H]. Then by (2.1) we have V π h (s) = Z A φ(s, a) wπ hdπ(a|s) = φπ h(s), wπ h . 2.2 Off-policy Evaluation The purpose of OPE is to evaluate a (known) target policy π given an offline dataset generated by a different (unknown) behavior policy π. In this paper, our goal is to estimate the expectation of the value function induced by π over a fixed initial distribution ξ1, i.e., vπ 1 = Es ξ1[V π 1 (s)]. To faciliate the presentation, we further introduce some important notations. For all h [H], let νh be the occupancy measure over S A at stage h induced by the transition P and the behavior policy π, that is, for any E S A, νh(E) = E [(sh, ah) E | s1 ξ1, ai π( |si), si+1 Pi( |si, ai), 1 i h] . (2.4) For simplicity, we write Eh[f(s, a)] = E π,h[f(s, a)] = R S A f(s, a)dνh(s, a) for any function f on S A. Similarly, we use Eπ,h[f(s, a)] to denote the expectation of f with respect to the occupancy measure at stage h induced by the transition P and the target policy π. We define the following uncentered covariance matrix under behavior policy for all h [H]: Σh = E π,h φ(s, a)φ(s, a) . (2.5) Intuitively, these matrices measure the coverage of the offline data in the state-action space. It is known that the success of OPE necessitates a good coverage [10, 43]. Therefore here we make the same coverage assumption on the offline data. Assumption 2.3 (Coverage). For all h [H], κh := λmin(Σh) > 0. Denote κ = minh [H] κh. A key difference in our result is that, instead of depending on Σh directly, the error bound depends on the following weighted version of the covariance matrices defined as Λh := E π,h σh(s, a) 2φ(s, a)φ(s, a) , (2.6) for all h [H], where each σh : S A R is defined as σh(s, a) := q max{1, Vh V π h+1(s, a)} + 1. (2.7) Note that in the definition of σh( , ), taking the maximum and adding an extra 1 is purely for technical reason and is related to its estimator bσh( , ), which we will introduce and explain later in Section 3.2. In general, one can think of σ2 h(s, a) Vh V π h+1(s, a). Therefore, compared with the raw covariance matrix Σh, Λh further incorporates the variance of the value functions under the target policy. This is the key to obtaining a tighter instance-dependent error bound. Definition 2.4 (Variance-aware coverage). We define ιh := λmin(Λh) and ι = minh [H] ιh. Since sup(s,a) S A σh(s, a)2 is bounded from above, by (2.6) and Assumption 2.3, we immediately have ιh κh/[sup(s,a) S A σh(s, a)2] > 0 for all h [H], and thus ι > 0. Even if Assumption 2.3 does not hold, we can always restrict to the subspace span{φ(sh, ah)}. For convenience of presentation, we make Assumption 2.3 in this paper. Next, we introduce the assumption on the sampling process of the offline data. Assumption 2.5 (Stage-sampling Data). We have two offline datasets D and ˇD where each dataset consists of data from H stages: D = {Dh}h [H] and ˇD = { ˇDh}h [H]. For the dataset D, we assume Dh1 is independent of Dh2 for h1 = h2. For each stage h, we have Dh = {(sk,h, ak,h, rk,h, s k,h)}k [K], where we assume for each k [K], the data point (sk,h, ak,h, rk,h, s k,h) is sampled identically and independently in the following way: (sk,h, ak,h) νh( , ) where νh( , ) is the occupancy measure defined in (2.4), and s k,h Ph( |sk,h, ak,h). The same holds for ˇD, and we write ˇDh = {(ˇsk,h, ˇak,h, ˇrk,h, ˇs k,h)}k [K]. Note that here s k,h = sk,h+1. Assumptions 2.5 is standard in the offline RL literature [48, 10]. Note that in the assumption, there is a data splitting, i.e., one can view it as the whole dataset D ˇD being split into two halves. The datasets D and ˇD will then be used for two different purposes in Algorithm 1 as will be made clear in the next section. We would like to remark that the only purpose of the splitting is to avoid a lengthy analysis. There is no need to perform the data splitting in practice. Also, in our implementation and experiments, we do not split the data. 3 Algorithm To ease the notation, we denote φk,h = φ(sk,h, ak,h), ˇφk,h = φ(ˇsk,h, ˇak,h), bσk,h = bσh(sk,h, ak,h) and rk,h = rh(sk,h, ak,h) + ϵk,h for all (h, k) [H] [K]. Recall that we use the check mark to denote the other half of the splitted dataset. How the splitted data is utilized will be clear in Section 3.2 when we introduce the proposed algorithm. 3.1 Regression-Based Value Function Estimation By Proposition 2.2, it suffices to estimate the vectors {wπ h, h [H]}. A popular approach is to apply the Least-Square Value Iteration (LSVI) [17] which relies on the Bellman equation, Qπ h(s, a) = rh(s, a) + [Ph V π h+1](s, a), that holds for all h [H] and (s, a) S A. By viewing V π h+1(s k,h) as an unbiased estimate of [Ph V π h+1](sk,h, ak,h), the idea of the LSVI-type method is to solve the following ridge regression problem: bwπ h := argmin w Rd λ w 2 2 + φk,h, w rk,h V π h+1(s k,h) 2 , (3.1) for some regularization parameter λ > 0. Since we do not know the exact values of V π h+1 in (3.1), we replace it by an estimator b V π h+1, and then recursively solve the lease-square problem in a backward manner, which enjoys a closed-form solution as follows k=1 φk,hφ k,h + λId k=1 φk,h h rk,h + b V π h+1(s k,h) i . This has been used in the LSVI-UCB algorithm proposed by Jin et al. [17] and the FQI-OPE algorithm studied by Duan et al. [10], for online learning and OPE of linear MDPs respectively. For this kind of algorithms, the key difficulty in the analysis lies in bounding the Bellman error: " K X k=1 φk,hφ k,h + λId k=1 φk,h [Ph b V π h+1](sk,h, ak,h) b V π h+1(s k,h) . Jin et al. [17] applied a Hoeffding-type inequality to bound the Bellman error. Although Duan et al. [10] applied Freedman s inequality in their analysis, their algorithm design overlooks the variance information in the data and consequently they can only adopt a crude upper bound on the conditional variance of the value function, i.e., Vh V π h+1 (H h)2, which simply comes from sups S V π h+1(s) H h. Therefore, it prevents [10] from getting a tight instance-dependent error bound for OPE. This is further verified by our numerical experiments in Appendix A which show that the performance of FQI-OPE degrades for large H. This motivates us to utilize the variance information in the data for OPE. 3.2 The Proposed Algorithm In particular, we present our main algorithm as displayed in Algorithm 1. Due to the greedy nature of the value functions, we adopt a backward estimation scheme. Weighted ridge regression. For any h [H], let bwπ h+1 be the estimate of wπ h+1 computed at the previous step, and correspondingly b V π h+1( ) = φπ h+1( ), bwπ h+1 . Instead of the ordinary ridge regression (3.1), we consider the following weighted ridge regression: bwπ h := argmin w Rd λ w 2 2 + h φk,h, w rk,h b V π h+1(s k,h) i2 bσ2 k,h, (3.2) where bσk,h = bσh(sk,h, ak,h) for all (h, k) [H] [K] with bσh( , ) being a proper estimate of σh( , ) defined in (2.7). We then have the following closed-form solution (Line 9 and 7 of Alg. 1): bwπ h = bΛ 1 h k=1 φk,h rk,h + b V π h+1(s k,h) bσ2 k,h, with bΛh = k=1 bσ 2 k,hφk,hφ k,h + λId. (3.3) In the above estimator, we use the dataset D to estimate the value functions. Next, we apply an LSVI-type method to estimate σh using the dataset ˇD. Variance estimator. By (2.2), we can write [Vh V π h+1](s, a) =[Ph(V π h+1)2](s, a) [Ph V π h+1](s, a) 2 . (3.4) For the first term in (3.4), by Assumption 2.1 we have [Ph(V π h+1)2](s, a) = Z S V π h+1(s )2d Ph(s |s, a) = φ(s, a) Z S V π h+1(s )2 dµh(s ), which suggests that Ph(V π h+1)2 also has a linear representation. Thus we adopt a linear estimator φ(s, a), bβπ h where bβπ h (Line 4) is the solution to the following ridge regression problem: bβπ h = argmin β Rd h ˇφk,h, β [b V π h+1]2(ˇs k,h) i2 + λ β 2 2 = bΣ 1 h k=1 ˇφk,h b V π h+1(ˇs k,h)2. (3.5) Algorithm 1 Variance-Aware Off-Policy Evaluation (VA-OPE) 1: Input: target policy π = {πh}h [H], datasets D = {{(sk,h, ak,h, rk,h, s k,h)}h [H]}k [K] and ˇD = {{(ˇsk,h, ˇak,h, ˇrk,h, ˇs k,h)}h [H]}k [K], initial distribution ξ1, bwπ H+1 = 0 2: for h = H, H 1, . . . , 1 do 3: bΣh PK k=1 ˇφk,h ˇφ k,h + λId 4: bβh bΣ 1 h PK k=1 ˇφk,h b V π h+1(ˇs k,h)2 5: bθh bΣ 1 h PK k=1 ˇφk,h b V π h+1(ˇs k,h) 6: bσh( , ) q max{1, b Vh b V π h+1( , )} + 1 7: bΛh PK k=1 φk,hφ k,h/bσ2 k,h + λId 8: Yk,h rk,h + φπ h(s k,h), bwπ h+1 9: bwπ h bΛ 1 h PK k=1 φk,h Yk,h/bσ2 k,h 10: b Qπ h( , ) φ( , ), bwπ h , b V π h ( ) φπ h( ), bwπ h 11: end for 12: Output: bvπ 1 R S b V π 1 (s) dξ1(s) Similarly, we estimate the second term in (3.4) by φ(s, a), bθπ h , where bθπ h (Line 5) is given by bθh = argmin θ Rd h ˇφk,h, θ b V π h+1(ˇs k,h) i2 + λ θ 2 2 = bΣ 1 h k=1 ˇφk,h b V π h+1(ˇs k,h), (3.6) with bΣh = PK k=1 ˇφk,h ˇφ k,h + λId. Combining (3.5) and (3.6), we estimate Vh V π h+1 by [b Vh b V π h+1]( , ) = φ( , ), bβπ h [0,(H h+1)2] h φ( , ), bθπ h [0,H h+1] i2 , (3.7) where the subscript [0, (H h + 1)2] denotes the clipping into the given range, and similar for the subscript [0, H h + 1]. We do such clipping due to the fact that V π h+1 [0, H h]. We add 1 to deal with the approximation error in b V π h+1. Based on b Vh b V π h+1, the final variance estimator bσh( , ) (Line 6) is defined as bσh( , ) = q max{1, b Vh b V π h+1( , )} + 1. In order to deal with the situation where b Vh b V π h+1 < 0 or is very close to 0, we take maximum between b Vh b V π h+1 and 1. Also, to account for the noise in the observed rewards, we add an extra 1 which is an upper bound of the noise variance by Assumption 2.1. Final estimator. Recursively repeat the above procedure for h = H, H 1, . . . , 1, and we obtain b V1. Then the final estimator for vπ 1 (Line 12) is defined as bvπ 1 = R S b V π 1 (s) dξ1(s). Intuition behind Λh. To illustrate the intuition behind the weighted covariance matrix Λh, here we provide some brief heuristics. Let {(sk,h, ak,h, s k,h)}k [K] be i.i.d. samples such that (sk,h, ak,h) ν for some distribution ν over S A and s k,h Ph( |sk,h, ak,h). Define ek = φ(sk,h, ak,h) [Ph V π h+1](sk,h, ak,h) V π h+1(s k,h) [Vh V π h+1](sk,h, ak,h)2 for all k [K]. Note that ek s are i.i.d zero-mean random vectors and a simple calculation yields Cov(ek) = E [Vh V π h+1](sk,h, ak,h) 2φ(sk,h, ak,h)φ(sk,h, ak,h) . This coincides with (2.6). Suppose Cov(ek) 0, then by the central limit theorem, it holds that k=1 ek d N(0, Cov(ek)). Therefore, Cov(ek) 1, or equivalently Λ 1 h , can be seen as the Fisher information matrix associated with the weighted product of the Bellman error and the feature vectors. This is a tighter characterization of the convergence rate than bounding Vh V π h+1 by its naive upper bound (H h)2. 4 Theoretical Results In this section, we introduce our main theoretical results and give an overview of the proof technique. 4.1 OPE Error Bound Our main result is a refined average-case OPE analysis that yields a tighter error bound in Theorem 4.1. The proof is in Appendix D. To simplify the notation, we define: H h + 1 2ιh , Ch,3 = (H h + 1)2 2 , Ch,4 = Λh Λ 1 h 1/2 . Theorem 4.1. Set λ = 1. Under Assumptions 2.1, 2.3 and 2.5, if K satisfies K C C3 d2 log d H2K where C is some problem-independent universal constant and C3 := max max h [H] Ch,3 C2 h,2 8ι2 h , H4 κ2 max h [H] Ch,3 2 max h [H] Ch,3 then with probability at least 1 δ, the output of Algorithm 1 satisfies |vπ 1 bvπ 1 | C h=1 vπ h Λ 1 h K + C C4 log 16H where vπ h := Eπ,h[φ(sh, ah)] and C4 := PH h=1 q Ch,4 Ch,2 (H h+1)d 4ιh log d H2K κδ vπ h Λ 1 h . Theorem 4.1 suggests that Algorithm 1 provably achieves a tighter instance-dependent error bound for OPE than that in [10]. In detail, the dominant term in our bound is e O(PH h=1 vπ h Λ 1 h / as compared to the e O(PH h=1(H h + 1) vπ h Σ 1 h / K) term in [10]. By (2.5) and (2.6), our bound is at least as good as the latter since Σh [(H h + 1)2 + 1]Λh. More importantly, it is instance-dependent and tight for the general class of linear MDPs: for those where Vh V π h+1 is close to its crude upper bound (H h + 1)2, our bound recovers the prior result. When Vh V π h+1 is small, VA-OPE benefits from incorporating the variance information and our bound gets tightened accordingly. Remark 4.2. Note that we do not require Vh V π h+1(s, a) to be uniformly small for all s, a, and h. From the bound and (2.6), as long as the variances are smaller than (H h + 1)2 on average of (s, a) S A and in sum of h, the bound is improved. It is also worth noting that the lower bound proved in [10] only holds for a subclass of linear MDPs with Vh V π h+1 = Ω((H h + 1)2), and thus their minimax-optimality does not hold for general linear MDPs. For more detailed comparison we refer the reader to Appendix B. Remark 4.3. Conceptually, the term vπ h Λ 1 h serves as a more precise characterization of the distribution shift between the behavior policy π and the target policy π in a variance-aware manner. This enables our algorithm to utilize the data more effectively. Compared with online RL where one can sample new data, OPE is more data-hungry : one cannot decide the overall quality of the data. Thus it is especially beneficial to put more focus on targeted values with less uncertainty. This is also the intuitive reason why our algorithm can achieve a tighter error bound. 4.2 Overview of the Proof Technique Here we provide an overview of the proof for Theorem 4.1. Due to the parallel estimation of the the value functions and their variances, the analysis of VA-OPE is much more challenging compared with that of FQI-OPE. As a result, we need to develop a novel proof technique. First, we have the following error decomposition. Lemma 4.4. For any h [H], let b V π h be the output of Algorithm 1. Then it holds that V π h (s) b V π h (s) = Z A [Ph(V π h+1 b V π h+1)](s, a)dπh(a|s) + λφπ h(s) bΛ 1 h wπ h (4.2) + φπ h(s) bΛ 1 h V π h+1(s ) b V π h+1(s ) dµh(s ) + k=1 φk,hbσ 2 k,h k,h where k,h = [Ph b V π h+1](sk,h, ak,h) b V π h+1(s k,h) ϵk,h. In particular, recall that bvπ 1 = E[b V π 1 (s1) | s1 ξ1] and the OPE error can be decomposed as vπ 1 bvπ 1 = λ h=1 (vπ h) bΛ 1 h V π h+1(s) b V π h+1(s) µh(s)ds h=1 (vπ h) bΛ 1 h k=1 φk,hbσ 2 k,h k,h + λ h=1 (vπ h) bΛ 1 h wπ h. (4.3) The OPE error bound (Theorem 4.1) is proved by bounding the three terms separately in (4.3). This decomposition is different from [10] in that bΣh is replaced by bΛh. This prevents us from adopting a matrix embedding-type proof as used in the prior work. The key is to show the convergence of bΛh to its population counterpart. However, by definition of bΛh, to establish such a result, it first requires the convergence of b V π h+1 to V π h+1 in a uniform manner, i.e., a high probability bound for sups S |b V π h (s) V π h (s)|. To show this, we leverage the decomposition in (4.2) and a backward induction technique, and prove a uniform convergence result which states that with high probability, for all h [H], Algorithm 1 can guarantee b V π h (s) V π h (s) e O 1 This result is formalized as Theorem C.2 and proved in Appendix C. To the best of our knowledge, Theorem C.2 is the first to establish the uniform convergence of the estimation error for the value functions in offline RL with linear function approximation. We believe this result is of independent interest and may be broadly useful in OPE. 5 Numerical Experiments In this section, we provide numerical experiments to evaluate our algorithm VA-OPE, and compare it with FQI-OPE. We construct a linear MDP instance as follows. The MDP has |S| = 2 states and |A| = 100 actions, with the feature dimension d = 10. The behavior policy then chooses action a = 0 with probability p and a {1, , 99} with probability 1 p and uniformly over {1, , 99}. The target policy π always chooses a = 0 no matter which state it is, making state 0 and 1 absorbing. The parameter p can be used to control the distribution shift between the behavior and target policies. Here p 0 leads to small distribution shift, and p 1 leads to large distribution shift. The initial distribution ξ1 is uniform over |S|. For more details about the construction of the linear MDP and parameter configuration, please refer to Appendix A. We compare the performance of the two algorithms on the synthetic MDP described above under different choices of horizon length H. We plot the log-scaled OPE error versus K in Figure 1. It is clear that VA-OPE is at least as good as FQI-OPE in all the cases. Specifically, for small H (Figure 1a), their performance is very comparable, which is as expected. As H increases, we can see from Figure 1a, 1b and 1c that VA-OPE starts to dominate FQI-OPE, and the advantage is more significant for larger H, as suggested by Theorem 4.1. Due to space limit, a comprehensive comparison under different parameter settings is deferred to Appendix A. 6 Related Work Off-policy evaluation. There is a large body of literature on OPE for tabular MDPs. Since the seminal work by Precup [34], various importance sampling-based estimators have been studied in 0 25 50 75 100 125 150 175 200 FQI-OPE VA-OPE 0 25 50 75 100 125 150 175 200 FQI-OPE VA-OPE (b) H = 10. 0 25 50 75 100 125 150 175 200 FQI-OPE VA-OPE (c) H = 30. Figure 1: Comparison of VA-OPE and FQI-OPE under different settings of horizon length H. VA-OPE s advantage becomes more significant as H increases, matching the theoretical prediction. The results are averaged over 50 trials and the error bars denote an empirical [10%,90%] confidence interval. The y-axis is log-scaled OPE error and x-axis is K. For more details please see Appendix A. the literature [26, 27, 39]. By using marginalized importance sampling methods [28, 45, 21, 47], one is able to further break the curse-of-horizon . Moreover, various doubly robust estimators [11, 16, 12, 38, 49] have been developed to achieve variance reduction. Most recently, it is shown by Yin et al. [48] that uniform convergence over all possible policy is also achievable. However, all the aforementioned works are limited to tabular MDPs. There is also a notable line of work on the estimation of the stationary distribution ratio between the target policy and the behavior policy using a primal-dual formulation [30, 51, 8]. However, a theoretical guarantee for the OPE error is not given in the work. More recently, Chen et al. [7] studied OPE in the infinite-horizon setting with linear function approximation. There are many others topics related to OPE, for example, policy gradient [20, 31, 2], conservative policy iteration [19], off-policy temporal-difference learning [35], off-policy Q-learning [23], safe policy iteration [33] and pessimism in RL [22, 18], to mention a few. We refer the reader to the excellent survey by Levine et al. [25] for a more detailed introduction. Online RL with linear function approximation. RL with function approximation has been actively studied as an extension of the tabular setting. Yang and Wang [46] studied discounted linear MDPs with a generative model, and Jin et al. [17] proposed an efficient LSVI-UCB algorithm for linear MDPs without a generative model. It has been shown by Du et al. [9] that MDP with misspecified linear function approximation could be exponentially hard to learn. Linear MDPs under various settings have also been studied by [50, 32, 14, 44]. A parallel line of work studies linear mixture MDPs [15, 3, 5, 53, 29] (a.k.a., linear kernel MDPs [54]) where the transition kernel is a linear function of a ternary feature mapping ψ : S A S Rd. In particular, Zhou et al. [53] achieved a nearly minimax regret bound by carefully utilizing the variance information of the value functions. Zhang et al. [52] constructed a variance-aware confidence set for time-homogeneous linear mixture MDPs. However, both works are focused on online RL rather than offline RL. It requires novel algorithm designs to exploit the variance information for offline tasks like OPE. What s more, the analysis in the offline setting deviates a lot from that for online RL where one can easily apply the law of total variance to obtain tighter bounds. 7 Conclusion and Future Work In this paper, we incorporate the variance information into OPE and propose VA-OPE, an algorithm that provably achieves tighter error bound. Our e O(P h(v h Λ 1 h vh)1/2/ K) error bound has a sharper dependence on the distribution shift between the behavior policy and the target policy. Our work suggests several promising future directions. Theoretically, it remains open to provide an instance-dependent lower bound for the OPE error. Also, beyond the linear function approximation, it is interesting to establish similar results under more general function approximation schemes. Empirically, can we exploit the algorithmic insight of our algorithm to develop practically more data-effective OPE algorithms for complex real-world RL tasks? We wish to explore these directions in the future. Acknowledgments and Disclosure of Funding We thank Mengdi Wang and Yaqi Duan for helpful discussions during the preparation of the paper. We also thank the anonymous reviewers for their helpful comments. DZ and QG are partially supported by the National Science Foundation CAREER Award 1906169, IIS-1904183 and AWS Machine Learning Research Award. The views and conclusions contained in this paper are those of the authors and should not be interpreted as representing any funding agencies. [1] Y. Abbasi-Yadkori, D. Pál, and C. Szepesvári. Improved algorithms for linear stochastic bandits. Advances in neural information processing systems, 24:2312 2320, 2011. [2] A. Agarwal, S. M. Kakade, J. D. Lee, and G. Mahajan. On the theory of policy gradient methods: Optimality, approximation, and distribution shift. ar Xiv preprint ar Xiv:1908.00261, 2019. [3] A. Ayoub, Z. Jia, C. Szepesvari, M. Wang, and L. Yang. Model-based reinforcement learning with value-targeted regression. In International Conference on Machine Learning, pages 463 474. PMLR, 2020. [4] F. Bertoluzzo and M. Corazza. Testing different reinforcement learning configurations for financial trading: Introduction and applications. Procedia Economics and Finance, 3:68 77, 2012. [5] Q. Cai, Z. Yang, C. Jin, and Z. Wang. Provably efficient exploration in policy optimization. In International Conference on Machine Learning, pages 1283 1294. PMLR, 2020. [6] D. Charles, M. Chickering, and P. Simard. Counterfactual reasoning and learning systems: The example of computational advertising. Journal of Machine Learning Research, 14, 2013. [7] L. Chen, B. Scherrer, and P. L. Bartlett. Infinite-horizon offline reinforcement learning with linear function approximation: Curse of dimensionality and algorithm. ar Xiv preprint ar Xiv:2103.09847, 2021. [8] B. Dai, O. Nachum, Y. Chow, L. Li, C. Szepesvári, and D. Schuurmans. Coindice: Off-policy confidence interval estimation. In Advances in Neural Information Processing Systems, 2020. [9] S. S. Du, S. M. Kakade, R. Wang, and L. F. Yang. Is a good representation sufficient for sample efficient reinforcement learning? In International Conference on Learning Representations, 2020. [10] Y. Duan, Z. Jia, and M. Wang. Minimax-optimal off-policy evaluation with linear function approximation. In International Conference on Machine Learning, pages 2701 2709. PMLR, 2020. [11] M. Dudík, J. Langford, and L. Li. Doubly robust policy evaluation and learning. In International Conference on Machine Learning. PMLR, 2011. [12] M. Farajtabar, Y. Chow, and M. Ghavamzadeh. More robust doubly robust off-policy evaluation. In International Conference on Machine Learning, pages 1447 1456. PMLR, 2018. [13] D. A. Freedman. On tail probabilities for martingales. the Annals of Probability, pages 100 118, 1975. [14] J. He, D. Zhou, and Q. Gu. Logarithmic regret for reinforcement learning with linear function approximation. In International Conference on Machine Learning. PMLR, 2021. [15] Z. Jia, L. Yang, C. Szepesvari, and M. Wang. Model-based reinforcement learning with value-targeted regression. In Learning for Dynamics and Control, pages 666 686. PMLR, 2020. [16] N. Jiang and L. Li. Doubly robust off-policy value evaluation for reinforcement learning. In International Conference on Machine Learning, pages 652 661. PMLR, 2016. [17] C. Jin, Z. Yang, Z. Wang, and M. I. Jordan. Provably efficient reinforcement learning with linear function approximation. In Conference on Learning Theory, pages 2137 2143. PMLR, 2020. [18] Y. Jin, Z. Yang, and Z. Wang. Is pessimism provably efficient for offline rl? ar Xiv preprint ar Xiv:2012.15085, 2020. [19] S. Kakade and J. Langford. Approximately optimal approximate reinforcement learning. In In Proc. 19th International Conference on Machine Learning. Citeseer, 2002. [20] S. M. Kakade. A natural policy gradient. In Advances in neural information processing systems, volume 14, 2001. [21] N. Kallus and M. Uehara. Efficiently breaking the curse of horizon in off-policy evaluation with double reinforcement learning. ar Xiv preprint ar Xiv:1909.05850, 2019. [22] R. Kidambi, A. Rajeswaran, P. Netrapalli, and T. Joachims. Morel: Model-based offline reinforcement learning. ar Xiv preprint ar Xiv:2005.05951, 2020. [23] A. Kumar, J. Fu, G. Tucker, and S. Levine. Stabilizing off-policy q-learning via bootstrapping error reduction. In Advances in Neural Information Processing Systems, volume 32, 2019. [24] S. Lange, T. Gabel, and M. Riedmiller. Batch reinforcement learning. In Reinforcement learning, pages 45 73. Springer, 2012. [25] S. Levine, A. Kumar, G. Tucker, and J. Fu. Offline reinforcement learning: Tutorial, review, and perspectives on open problems. ar Xiv preprint ar Xiv:2005.01643, 2020. [26] L. Li, W. Chu, J. Langford, and X. Wang. Unbiased offline evaluation of contextual-banditbased news article recommendation algorithms. In Proceedings of the fourth ACM international conference on Web search and data mining, pages 297 306, 2011. [27] L. Li, R. Munos, and C. Szepesvári. Toward minimax off-policy value estimation. In Artificial Intelligence and Statistics, pages 608 616. PMLR, 2015. [28] Q. Liu, L. Li, Z. Tang, and D. Zhou. Breaking the curse of horizon: Infinite-horizon off-policy estimation. In Advances in Neural Information Processing Systems, volume 31, 2018. [29] Y. Min, J. He, T. Wang, and Q. Gu. Learning stochastic shortest path with linear function approximation. ar Xiv preprint ar Xiv:2110.12727, 2021. [30] O. Nachum, Y. Chow, B. Dai, and L. Li. Dualdice: Behavior-agnostic estimation of discounted stationary distribution corrections. In Advances in Neural Information Processing Systems, volume 32, pages 2318 2328. Curran Associates, Inc., 2019. [31] O. Nachum, B. Dai, I. Kostrikov, Y. Chow, L. Li, and D. Schuurmans. Algaedice: Policy gradient from arbitrary experience. ar Xiv preprint ar Xiv:1912.02074, 2019. [32] G. Neu and C. Pike-Burke. A unifying view of optimism in episodic reinforcement learning. Advances in Neural Information Processing Systems, 33, 2020. [33] M. Pirotta, M. Restelli, A. Pecorino, and D. Calandriello. Safe policy iteration. In International Conference on Machine Learning, pages 307 315. PMLR, 2013. [34] D. Precup. Eligibility traces for off-policy policy evaluation. Computer Science Department Faculty Publication Series, page 80, 2000. [35] D. Precup, R. S. Sutton, and S. Dasgupta. Off-policy temporal-difference learning with function approximation. In International Conference on Machine Learning, pages 417 424, 2001. [36] D. Quillen, E. Jang, O. Nachum, C. Finn, J. Ibarz, and S. Levine. Deep reinforcement learning for vision-based robotic grasping: A simulated comparative evaluation of off-policy methods. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pages 6284 6291. IEEE, 2018. [37] L. Tang, R. Rosales, A. Singh, and D. Agarwal. Automatic ad format selection via contextual bandits. In Proceedings of the 22nd ACM international conference on Information & Knowledge Management, pages 1587 1594, 2013. [38] Z. Tang, Y. Feng, L. Li, D. Zhou, and Q. Liu. Doubly robust bias reduction in infinite horizon off-policy estimation. In International Conference on Learning Representations, 2020. [39] P. Thomas and E. Brunskill. Data-efficient off-policy policy evaluation for reinforcement learning. In International Conference on Machine Learning, pages 2139 2148. PMLR, 2016. [40] P. S. Thomas, G. Theocharous, M. Ghavamzadeh, I. Durugkar, and E. Brunskill. Predictive off-policy policy evaluation for nonstationary decision problems, with applications to digital marketing. In AAAI, pages 4740 4745, 2017. [41] J. A. Tropp. User-friendly tail bounds for sums of random matrices. Foundations of computational mathematics, 12(4):389 434, 2012. [42] R. Vershynin. Introduction to the non-asymptotic analysis of random matrices. ar Xiv preprint ar Xiv:1011.3027, 2010. [43] R. Wang, Y. Wu, R. Salakhutdinov, and S. M. Kakade. Instabilities of offline rl with pre-trained neural representation. In International Conference on Machine Learning. PMLR, 2021. [44] T. Wang, D. Zhou, and Q. Gu. Provably efficient reinforcement learning with linear function approximation under adaptivity constraints. In Advances in Neural Information Processing Systems, 2021. [45] T. Xie, Y. Ma, and Y.-X. Wang. Towards optimal off-policy evaluation for reinforcement learning with marginalized importance sampling. In Advances in Neural Information Processing Systems, volume 32, 2019. [46] L. Yang and M. Wang. Sample-optimal parametric q-learning using linearly additive features. In International Conference on Machine Learning, pages 6995 7004, 2019. [47] M. Yin and Y.-X. Wang. Asymptotically efficient off-policy evaluation for tabular reinforcement learning. In International Conference on Artificial Intelligence and Statistics, pages 3948 3958. PMLR, 2020. [48] M. Yin, Y. Bai, and Y.-X. Wang. Near-optimal provable uniform convergence in offline policy evaluation for reinforcement learning. In International Conference on Artificial Intelligence and Statistics, pages 1567 1575. PMLR, 2021. [49] M. Yin, Y. Bai, and Y.-X. Wang. Near-optimal offline reinforcement learning via double variance reduction. In Advances in Neural Information Processing Systems, 2021. [50] A. Zanette, A. Lazaric, M. Kochenderfer, and E. Brunskill. Learning near optimal policies with low inherent bellman error. In International Conference on Machine Learning, pages 10978 10989. PMLR, 2020. [51] R. Zhang, B. Dai, L. Li, and D. Schuurmans. Gendice: Generalized offline estimation of stationary values. In International Conference on Learning Representations, 2020. [52] Z. Zhang, J. Yang, X. Ji, and S. S. Du. Variance-aware confidence set: Variance-dependent bound for linear bandits and horizon-free bound for linear mixture mdp. ar Xiv preprint ar Xiv:2101.12745, 2021. [53] D. Zhou, Q. Gu, and C. Szepesvari. Nearly minimax optimal reinforcement learning for linear mixture markov decision processes. In Conference on Learning Theory. PMLR, 2021. [54] D. Zhou, J. He, and Q. Gu. Provably efficient reinforcement learning for discounted mdps with feature mapping. In International Conference on Machine Learning. PMLR, 2021. 1. For all authors... (a) Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope? [Yes] (b) Did you describe the limitations of your work? [Yes] (c) Did you discuss any potential negative societal impacts of your work? [N/A] Explain: our work is seeking to develop a mathematical understanding of the off-policy evaluation in reinforcement learning. (d) Have you read the ethics review guidelines and ensured that your paper conforms to them? [Yes] 2. If you are including theoretical results... (a) Did you state the full set of assumptions of all theoretical results? [Yes] (b) Did you include complete proofs of all theoretical results? [Yes] 3. If you ran experiments... (a) Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [Yes] (b) Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? [Yes] (c) Did you report error bars (e.g., with respect to the random seed after running experiments multiple times)? [Yes] (d) Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [Yes] 4. If you are using existing assets (e.g., code, data, models) or curating/releasing new assets... (a) If your work uses existing assets, did you cite the creators? [N/A] (b) Did you mention the license of the assets? [N/A] (c) Did you include any new assets either in the supplemental material or as a URL? [N/A] (d) Did you discuss whether and how consent was obtained from people whose data you re using/curating? [N/A] (e) Did you discuss whether the data you are using/curating contains personally identifiable information or offensive content? [N/A] 5. If you used crowdsourcing or conducted research with human subjects... (a) Did you include the full text of instructions given to participants and screenshots, if applicable? [N/A] (b) Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable? [N/A] (c) Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation? [N/A]