# minimaxoptimal_offpolicy_evaluation_with_linear_function_approximation__8dc2a1f4.pdf

Minimax-Optimal Off-Policy Evaluation with Linear Function Approximation

Yaqi Duan 1 Zeyu Jia 2 Mengdi Wang 3 4

This paper studies the statistical theory of offpolicy policy evaluation with function approximation in batch data reinforcement learning problem. We consider a regression-based ﬁtted Q iteration method, and show that it is equivalent to a modelbased method that estimates a conditional mean embedding of the transition operator. We prove that this method is information-theoretically optimal and has nearly minimal estimation error. In particular, by leveraging contraction property of Markov processes and martingale concentration, we establish a ﬁnite-sample instance-dependent error upper bound and a nearly-matching minimax lower bound. The policy evaluation error depends sharply on a restricted χ2-divergence over the function class between the long-term distribution of target policy and the distribution of past data. This restricted χ2-divergence characterizes the statistical limit of off-policy evaluation, and is both instance-dependent and function-classdependent. Further, we provide an easily computable conﬁdence bound for the policy evaluator, which may be useful for optimistic planning and safe policy improvement.

1. Introduction

Batch data reinforcement learning (RL) is common in decision-making applications where rich experiences are available but new experiments are costly. A ﬁrst-order question is how much one can learn from existing experiences to predict and improve the performance of new policies. This is known as the off-policy policy evaluation (OPE) problem, where one needs to estimate the cumulative rewards (aka

*Equal contribution 1Department of Operations Research and Financial Engineering, Princeton University, NJ, USA 2School of Mathematics, Peking University, Beijing, China 3Department of Operations Research and Financial Engineering, Princeton University, NJ, USA 4Deep Mind, London, UK. Correspondence to: Mengdi Wang <mengdiw@princeton.edu>.

Proceedings of the 37 th International Conference on Machine Learning, Vienna, Austria, PMLR 119, 2020. Copyright 2020 by the author(s).

value) to be earned by a new policy based on logged history.

In this paper, we study the off-policy evaluation using linear function approximation. We assume that the Q-functions of interests belong to a known function class Q with d basis functions. We adopt a direct regression-based approach and investigate the basic ﬁtted Q iteration (FQI) (Bertsekas et al., 1995; Sutton & Barto, 2018). It works by iteratively esti-

mating Q-functions via supervised learning using the batch data. This approach turns out to be equivalent to the modelbased plug-in estimator where one estimates the conditional mean embedding of the unknown transition model and uses it to compute a plug-in value estimator. It is also related to variants of importance sampling methods (see discussions in Sections 1.1 and 3.3).

We provide a ﬁnite-sample error upper bound for this policy evaluator, as well as a nearly matching minimaxoptimal lower bound. Putting them together, we see that the regression-based policy evaluator is nearly statistically optimal. For RL with horizon H, the minimax-optimal OPE error takes the form

where µ is some long-term state-action occupancy measure of the target policy and µ is the data distribution, χ2

Q is a variant of χ2-divergence restricted to the family Q:

Q(p1, p2) := sup

Ep2[f(x)2] 1.

The term χ2

Q(µ , µ) captures the distributional mismatch, between the behavior policy and the target policy, that is relevant to the function class Q. It determines the theoretical limits of OPE within this function class. In the tabular case, it relates to the worst-case density ratio, which often shows up in importance sampling methods. However, when we use function approximation, this χ2

Q divergence term can be signiﬁcantly smaller than the worst-case density ratio. In particular, our analysis shows that χ2

Q(µ , µ) is the condition number of a ﬁnite matrix, which can be reliably estimated. This result suggests that OPE could be more data-efﬁcient with appropriate function approximation.

A summary of technical results of this paper:

Minimax-Optimal Off-Policy Evaluation with Linear Function Approximation

A regression-based algorithm that uniﬁes FQI and plug-

in estimation. It does not require knowledge of the behavior policy , or try to estimate it. It uses iterative regression but does not require Monte Carlo sampling. In the case of linear models, the estimator can be computed easily using simple matrix-vector operations. Finite-sample error upper bound for the regression-

based policy evaluator. Despite that regression may be biased for OPE, we show that the curse of horizon does not occur as long as N = (d H3). A key to the analysis is the use of contraction properties of a Markov process to show that estimation error accumulates linearly in multi-step policy evaluation, instead of exponentially. A minimax error lower bound that sets the statisti-

cal limit for OPE with function approximation. The lower bound nearly matches our upper bound, therefore proves the efﬁciency of regression-based FQI. A data-dependent conﬁdence bound that can be com-

puted as a byproduct of the FQI algorithm.

1.1. Related Literature

Off-policy policy evaluation (OPE) is often the starting point of batch reinforcement learning. A direct approach is to estimate the transition probability distributions and then execute the target policy on an estimated model. This has been studied in the tabular case with bias and variance analysis (Mannor et al., 2004). In real-world applications, in order to tackle MDPs with inﬁnite or continuous state spaces, one often needs various forms of function approximation, and many methods like ﬁtted Q-iteration and least square policy iteration were developed (Jong & Stone, 2007; Lagoudakis & Parr, 2003; Grunewalder et al., 2012; Fonteneau et al., 2013). Regression methods are often used to ﬁt value functions and to satisfy the Bellman equation (Bertsekas et al., 1995; Sutton & Barto, 2018; Yang & Wang, 2019b).

A popular class of OPE methods use importance sampling (IS) to reweigh sample rewards to get unbiased value estimate of a new policy (Precup, 2000). Doubly robust technique blends IS with model-based estimators to reduce the high variance (Jiang & Li, 2016; Thomas & Brunskill, 2016). Liu et al. (2018) suggested that one should estimate the stationary state occupancy measure instead of the cumulative importance ratio in order to break the curse of horizon. Many IS methods only apply to tabular MDP and require knowledge of the behavior policy. Following these ideas, Nachum et al. (2019) proposed a minimax optimization problem that uses function approximation to learn the IS weights, without requiring knowledge of the bahavior policy. Dann et al. (2019) provided error bounds and certiﬁcates for the tabular case to achieve accountability. Liu et al. (2019) studied off-policy gradient method for batch data policy optimization.

On the theoretical side, the sharpest OPE error bound to our best knowledge is given by Xie et al. (2019) and Yin & Wang (2020), which applies to time-inhomogeneous, tabular MDP. Jiang & Li (2016) provided a Cramer-Rao lower bound for discrete-tree MDP. To the authors best knowledge, most existing theoretical results on OPE apply only to tabular MDP without function approximation. Le et al. (2019) studied batch RL and FQI for policy learning, evaluation and provides generalization bounds that depend on the VC dimension of the function class. Their results require a concentration coefﬁcient assumption that the elementwise

ratio between generating and target density functions are uniformly bounded across all states, actions and policies. In comparison, our results do not require such concentration condition, and appear to be the ﬁrst and sharpest error bounds for OPE with linear function approximation.

2. Problem and Model

In this paper, we study off-policy policy evaluation of an Markov decision process (MDP) when we only have a ﬁxed dataset of empirical transitions. An instance of MDP is a controlled random walk over a state space S, where at each state s, if we pick action a 2 A, the system evolves to a random next state s0 according to distribution p(s0 | s, a) and generates a reward r0 2 [0, 1] with E[r0 | s, a] = r(s, a). A policy speciﬁes a distribution ( | s) for choosing actions conditioned on the current state s.

Our objective is to evaluate the performance of a target policy at a ﬁxed initial distribution 0, where the transition model p is unknown. The value to be estimated is the expected cumulative reward in an H-horizon episode, given by

where ah ( | sh), sh+1 p( | sh, ah), E denotes expectation over the sample path generated under policy .

Let D={(sn, an, s0

n=1 be a set of sample transitions, where each s0

n is sampled from distribution p( | sn, an). The sample transitions may be collected from multiple trajectories and under a possibly unknown behavior policy denoted as . Our goal is to estimate v from D.

Given a target policy and a reward function r, the stateaction value functions, also known as Q functions, are deﬁned as, for h = 0, 1, . . . , H,

h(s, a) := E

r(sh0, ah0)

%%%%% sh = s, ah = a

where ah0 ( | sh0), sh0+1 p( | sh0, ah0). Let X := S A. Deﬁne the conditional transition operator P : RX ! RX as

Minimax-Optimal Off-Policy Evaluation with Linear Function Approximation

P f(s, a) := E

for any f : X ! R,

where s0 p( | s, a) and a0 ( | s0). Throughout the paper, we suppose that P operates in a function class Q, such that we can approximate unknown Q functions within this family. Assume without loss of generality that 1 2 Q.

Assumption 1 (Function class). For any f 2 Q, P f 2 Q, and r2Q. It follows that Q

0, . . . , Q

H 2Q, where Q RX .

In most parts of the paper, we assume that the transition data are collected from multiple independent episodes.

Assumption 2 (Data generating process). The dataset D consists of samples from K i.i.d. episodes 1, 2, . . . , K. Each k has H consecutive sample transitions generated by some policy on a single sample path, i.e., k = (sk,0, ak,0, r0

k,0, sk,1, ak,1, r0

k,1, . . . , sk,H, ak,H, r0

We will focus mainly on the case where Q is a linear space spanned by d feature functions φ1, . . . , φd. Also note that the behavior policy is not known.

Notations Denote X = S A. Let RX be the collection of all functions f : X ! R. For any f 2 RX , deﬁne f : S ! R by f (s) =

A f(s, a) (a | s)da. If A is a positive symmetric semideﬁnite matrix, let σmin(A) denote its smallest eigenvalue, and let A1/2 denote the positive symmetric semideﬁnite matrix that A1/2A1/2 = A. For nonnegative {an}1

n=1 and {bn}1

n=1, we denote an . bn if there exists c > 0 such that an cbn for n = 1, 2, . . .. Let {Xn}1

n=1 be a sequence of random variables and {an}1

n=1 R be deterministic. We write Xn = OP(an) if for any δ > 0 there exists M > 0 such that P(|Xn| > an M) δ for all n. If a distribution p is absolutely continuous with respect to distribution q, the Pearson χ2-divergence is deﬁned by χ2(p, q) := Eq

3. Regression-Based Off-Policy Evaluation

We consider a ﬁtted Q-iteration method for new policy evaluation using linear function approximation. We show that it is equivalent to a model-based method that estimates a conditional mean operator that embeds the unknown p into the feature space. They admit a simple matrix-vector implementation when Q is a linear model with ﬁnite dimension.

3.1. Fitted Q-iteration (FQI)

The Q-functions satisfy the Bellman equation

h 1(s, a) = r(s, a) + E

for h=1, 2, . . . , H, where s0 p( | s, a), V

h : S !R is the value function deﬁned as V

h(s, a) (a | s)da.

For the given target policy , we apply regression recursively by letting b Q

H+1 := 0 and for h = H, H 1, . . . , 0,

h :=arg min

n, a) (a | s0

(4) where λ 0 and ( ) is a regularization function. The scheme above provides a recursive way to evaluate

H 1, . . . , b Q

0 and v by regression using empirical data. It is essentially a ﬁtted Q-iteration. The full algorithm is summarized in Algorithm 1.

Algorithm 1 Fitted Q-iteration for Off-Policy Evaluation (FQI-OPE)

Input: initial distribution 0, target policy ,

horizon H, function class Q, sample transitions D = {(sn, an, s0

n=1 Let b Q

H+1 := 0; for h = H, H 1, . . . , 1 do

Calculate b Qh by solving (4); end for Output: bv

0(s, a) 0(s) (a | s)dsda

3.2. An equivalent model-based method using

conditional mean operator

The preceding FQI method can be equivalently viewed as a model-based plug-in estimator. Recall the conditional transition operator P : RX ! RX is

P f(s, a) := E

for any f : X ! R.

Under Assumption 1, it always holds that P Q

h 2 Q. To this end, we are only interested in a projection of groundtruth P onto Q. We estimate the conditional transition operator by b P : for any f :X ! R, let

b P f := arg min

n, a) (a | s0

We can see that, if N ! 1, b P converges to a projected version of P onto Q. Denote φ( ) := [φ1( ), . . . , φd( )]> : X ! Rd. In the case where Q is a linear space given by Q =

and ( ) is taken as (f) := kwk2

2 for f( ) = φ( )>w, (6) the constructed b P in (5) corresponds to an estimated bp of the form

bp( | s, a) := φ(s, a)>b 1

φ(sn, an)δs0n( )

where b := λI + PN

n=1 φ(sn, an)φ(sn, an)> is the empirical covariance matrix and δs0( ) denotes the Dirac measure. Note this bp is not necessary a transition kernel.

Minimax-Optimal Off-Policy Evaluation with Linear Function Approximation

We adopt a model-based approach and use b P in the Bellman equation as a plug-in estimator. In particular, let

br := arg min

f(sn, an) r0

H+1 := 0, b Q

h 1 := br + b P b Q

h, h = H + 1, H, . . . , 1. Then we can estimate the policy value by

0(s, a) 0(s) (a | s)dsda.

It is easy to verify that this plug-in estimator is equivalent to the earlier FQI estimator. See the proof in Appendix A. Theorem 1 (Equivalence between FQI and a model-based method). If Q is a linear space and is given by (6), Algorithm 1 and the preceding model-based approach generate identical policy value estimators, i.e. bv := bv

When Q is a d-dimensional linear space with feature mapping φ, under Assumption 1, there exists a matrix M 2 Rd d such that

φ(s, a)>M = E

φ (s0)> %% s, a

, 8(s, a) 2 X, where φ (s) :=

φ(s, a) (a|s)da. We refer to M as the matrix mean embedding of the conditional transition operator P . We can implement Algorithm 1 in simple vector forms. We embed the one-step reward function and conditional transition operator into a vector and a matrix, respectively:

br( ) = φ( )> b R with b R := b 1

φ(sn, an)φ (s0

(8) The corresponding conditional mean operator b P is

b P f(s, a) = φ(s, a)> c M w, for f( ) = φ( )>w. (9)

We represent b Q

h in the form of b Q

h(s, a) = φ(s, a)> bw

h. In this way, we can easily compute b Q

h using recursive compact vector-matrix operations, as given in Algorithm 2.

Algorithm 2 Conditional Mean Embedding for Policy Evaluation (CME-PE)

Input: initial distribution 0, target policy ,

horizon H, a basis {φ1, . . . , φd} of Q, sample transitions D = {(sn, an, s0

Estimate b R and c M according to (8); Let bw

H+1 := 0; Let

X φ(s, a) 0(s) (a | s)dsda; for h = H, H 1, . . . , 0 do

Calculate bw

h := b R + c M bw

h+1; end for Output: bv := (

3.3. Relations to other methods

Our method turns out to be closely related to variants of importance sampling method for OPE. For examples:

Marginalized importance sampling: Our FQI estimator

takes the form

bw /D(sn, an)r0

bw /D(s, a):=N

0 )>(c M )hb 1φ(s, a).

By viewing bw /D(s, a) as weights, our estimator can be obtained equivalently by importance sampling. In the special tabular case, our bv is equivalent to the marginalized importance sampling (MIS) estimator in (Yin & Wang, 2020). Dual DICE: Nachum et al. (2019) proposed a minimax

formulation to ﬁnd the stationary state occupancy measure and residue (weight for importance sampling) with function approximation . We observe that, if those function classes are taken to be Q, a version of Dual DICE produces the same estimator as the FQI estimator. The two methods can be viewed as dual to each other.

These intriguing relations permit a uniﬁed view of OPE methods. See Appendix A for more discussions.

4. Finite-Sample Error Bound

Recall that D is a collection of K independent H-horizon trajectories. Let be the uncentered covariance matrix of the data distribution:

φ(s1,h, a1,h)φ(s1,h, a1,h)>

which is determined by the unknown behavior policy . Given a target policy , let be an invariant distribution of the Markov chain with transition kernel p (s0 | s) = R

A p(s0 | s, a) (a | s)da. Deﬁne

φ (s)φ (s)> %% s

We assume φ(s, a)> 1φ(s, a) C1d without loss of generality. Theorem 2 provides an instance-dependent policy evaluation upper bound. Its complete proof is given in Appendix B.

Theorem 2 (Upper bound). Let δ 2 (0, 1). Under Assumptions 1 and 2, if N 20 1(2 + 2)2 ln(12d H/δ)C1d H3

and λ ln(12d H/δ)C1d Hσmin( ), then with probability at least 1 δ,

Minimax-Optimal Off-Policy Evaluation with Linear Function Approximation %%v bv %%

(H h + 1)sup

h=0 f 2(s1,h, a1,h)

2N + C ln(12d H/δ)d H3.5

(10) where C := 15 1C1(3 + 2)

0 , 1 := cond

h=1φ (s1,h)φ (s1,h)>

Additionally, if either one of the following holds:

φ(s, a)> 1φ(s0, a0) 0 for any (s, a), (s0, a0) 2 X; the MDP is time-inhomogeneous,

the upper bound can be improved to

h=0(H h + 1)f(sh, ah)

h=0 f 2(s1,h, a1,h)

2N + C ln(12d H/δ)d H3.5

Distributional mismatch as a Q-χ2-divergence. Let µ be the expected occupancy measure of observation {(sn, an)}N

n=1. Let µ be the weighted occupancy distribution of (sh, ah) under policy and 0, given by

µ (s, a) := E PH

h=0(H h + 1)1(sh = s, ah = a)

h=0(H h + 1)

The upper bound (11) can be simpliﬁed to

|bv v | CH2

N + O(N 1).

Moreover, each mismatch term in (10) has a vector form

E [f(sh, ah)

h=0 f 2(s1,h, a1,h)]

h := E [φ(sh, ah)

%% s0 0], so it can be estimated tractably.

The case of tabular MDP. In the tabular case, the condition φ(s, a)> 1φ(s0, a0) 0 holds for all (s, a), (s0, a0) 2 X. It can be easily seen that the error bound (11) has a strong connection with the χ2divergence between the state-action distributions under the behavior and target policies.

Corollary 1 (Upper bound in tabular case). In the tabular case with Q = RX , if N is sufﬁciently large and λ = 0, then with probability at least 1 δ,

%%v bv %% 3H2p

1 + χ2(µ , µ)

2N +O(N 1),

(12) where χ2( , ) denotes the Pearson χ2-divergence. If the MDP is also time-inhomogeneous, then

µh(s, a) Var

N + o(N 1/2),

(13) where µh is the marginal distribution of (s1,h, a1,h) and µ

h is the marginal distribution of (sh, ah) under policy and 0.

The tabular-case upper bound (13) has the same form with Theorem 3.1 in Yin & Wang (2020). The proof of Corollary 1 is deferred to Appendix B.7.

4.1. Proof Outline

We decompose the error into three terms: v bv = E1 + E2 + E3, where E1 is a linear function of b P P , E2 is a high-order function of b P P and E3 = O(λ). In the following, we outline the analysis of E1 and E2.

First-order term E1: This linear error term takes the form E1 = 1

n=1en, where

h)> 1φ(sn, an)

Deﬁne a ﬁltration {Fn}n=1,...,N where Fn is generated by (s1, a1, s0

1), . . . , (sn 1, an 1, s0

n 1) and (sn, an). Then e1, e2, . . . , e N is a martingale difference sequence with respect to {Fn}n=1,...,N. In what is next, we analyze Var[en | Fn] and apply the Freedman s inequality (Freedman, 1975) to derive a ﬁnite sample upper bound for E1.

Consider the conditional variance Var[en | Fn]. By using the Cauchy-Schwarz inequality and the relation Var

4(H h + 1)2, we have

h)> 1φ(sn, an)

(14) We learn from matrix-form Bernstein inequality that

n=1 φ(sn, an)φ(sn, an)> concentrates around with high probability. It follows that

Minimax-Optimal Off-Policy Evaluation with Linear Function Approximation

h)> 1φ(sn, an)

φ(sn, an)φ(sn, an)>

Plugging (15) into (14) and taking the summation, we obtain

Var[en | Fn] 1

It follows from the Freedman s inequality that with high probability,

High-order term E2 (bias-inducing term): The highorder term E2 involves powers of b P P . We use the contraction property of Markov process with respect to its invariant measure, in particular,f <<( )1/2M ( ) 1/2<<

φ (s)φ (s)> %% s

, is an invariant distribution under policy . Assume has full rank for simplicity.

By using the contraction property, we will see that the value error will not grow exponentially in H for large N. We have:

1 + Err(c M )

1 + Err(N b 1)

where the explicit deﬁnitions of errors Err(c M ), Err(N b 1) and Err(Q

h) can be found in Lemma B.7, Appendix B.4. By concentration arguments, we can show Err(c M ), Err(N b 1) .

d H/N and Err(Q

h) . (H h + 1)

d/N with high probability. According to (17), as long as Err(c M ) . H 1, the policy evaluation error will not grow exponentially in H. As a result, if N & d H3, we have |E2| . d H3.5/N.

5. Minimax Lower Bound

In this section, we establish a minimax lower bound that characterizes the hardness of off-policy evaluation using linear function approximators. Theorem 3 nearly matches with the ﬁnite-sample upper bound given by Theorem 2. The complete proof of Theorem 3 is given in Appendix C.

Theorem 3 (Minimax lower bound). Suppose that an MDP instance M = (p, r) satisﬁes:

There exists a set of high-value states S S and a

set of low-value states S S under the target policy such that V

h (s) 3 4(H h + 1) if s 2 S and V

4(H h + 1) if s 2 S;

S mins2S p (s0 | s)ds0 c and p :=

S mins2S p (s0 | s)ds0 c for c > 0.1

For any behavior policy , when N is sufﬁciently large, one

bv sup M 02N (M)

h=0 (H h)f(sh, ah)

h=0 f 2(s1,h, a1,h)

(18) where N(M) is a small neighborhood of M given by N(M) :=

M 0 = (p0, r)

%% sup(s,a)2X

<<p0( | s, a) p( | s, a)

(k k TV denotes the total variation, " &

cd/N). PM 0 is the probability space of M 0, bv (D) is the output of some algorithm bv when D is given as the input.

Remark. The minimax lower bound is a worst-case error lower bound that applies to any estimator, biased or unbiased. Typical minimax lower bound takes the form of infbv sup M where the sup is taken over the entire class of MDP M. Our lower bound is much stronger and can be easily relaxed to the typical form.

Compare Theorems 2 and 3. They nearly match each other, implying that the Q-χ2-divergence term χ2

Q(µ, µ ) determines the statistical complexity of OPE.

An example. Suppose that there is a high-value state s and a low-value state s, which are two absorbing states under the target policy , with reward 1 and 0 respectively.

We construct φ, and such that φ (s) = [z, 1 z]>, φ (s) = [1 z, z]>; and φ (s) = [1, 0]>, φ (s) = [0, 1]>. Here z 2 [0, 1] is a parameter. We construct the transition model as:

p under behavior policy :

p under target policy :

s s 1 1 s s

Suppose that the behavior policy initiates at either one of the states with probability 1/2, and the target policy

1We assume the bahavior policy is deterministic only for the sake of notational simplicity.

Minimax-Optimal Off-Policy Evaluation with Linear Function Approximation

always initiates at state s. We can see that

2 z(1 z) z(1 z) z2 z + 1

1 = . . . =

H 1 = [1, 0]>. For z 2 [ 1

4], the distributional mismatch term controlling the lower bound becomes

1 + 1 (2z 1)2 ,

where z quantiﬁes how much one can tell apart the two states under the target policy using data generated by . When z 1/2, one can not distinguish s and s from data generated by , where the lower bound becomes unbounded.

5.1. Proof Outline

We start with an arbitrary MDP M with transition kernel p that satisﬁes the assumption. We will construct a perturbed instance ep = p + p so that the two transition models are similar but have a gap in their policy values, denoted by v

Construct the perturbation p such that p(s0 | s, a) 0 if s0 2 S, p(s0 | s, a) 0 if s0 2 S and p(s0 | s, a) = 0 elsewhere. In particular, we construct the perturbation as

p(s0 | s, a) = φ(s, a)> q(s0),

q(s0) := x min

s2S p (s0 | s)

(19) where p and p are picked such that

S p(s0 | s, a)ds0 = 0 for any s, a, x is a vector to be picked later.

Reduction to likelihood test We deﬁne likelihood functions L(D) and e L(D) of transition kernels p and ep. The

likelihood ratio

e L(D) L(D) = QN

n | sn,an) p(s0n | sn,an) reﬂects how likely the observation D comes from model ep rather than p. When p ep, with high probability, the dataset D generated by model p has a relatively large likelihood ratio, so that it is hard to distinguish p and ep based on observation D. We prove by a martingale concentration argument that, when N is sufﬁciently large,

x> x N x> x

with high probability. In particular, we have

x> x . N 1/2. If we further have |v ev | + e for some constant gaps , e 0, condition (20) implies that for an arbitrary algorithm bv , only one of the following must hold: either P

|ev bv (D)| e

6. In other words, no algorithm can achieve small OPE error for both p and ep.

Constructing similar instances with a gap in values We have

0 ( e P )h( e P P )Q

By ﬁrst-order Taylor expansion and our construction, if the perturbation p is sufﬁciently small, we have

0 (P )h( e P P )Q

For a given N, we maximize the above value over x under the constraint

x> x . N 1/2. Then we obtain x = c0x0 p

0 x0 where c0 > 0 is a constant and

x0 = 1 PH 1

h. In this way, we have shown that ev v & 1 p

1 using the above construction of x .

Similarly, one can show that for N sufﬁciently large, v ev + e for =

e 1, where e

h and e are counterparts of and under the perturbed model

ep. Finally, we apply the result of the likelihood test and complete the proof.

6. A Computable Conﬁdence Bound

Next we study how to quantify the uncertainty in the policy evaluator given by Algorithm 1. In this section, we assume that the dataset is an arbitrary set of experiences, not necessarily independent episodes. We only assume that the transition samples D = {(sn, an, s0

n)}n=1,...,N are collected in time order.

Assumption 3. The dataset D consists of sample transitions {(st, at, s0

t=1 generated in time order, i.e. adapted to a ﬁltration {Ft}N

t=1, where {(s , a , s0

=1 are Ftmeasurable.

Assumption 3 is much weaker than Assumption 2. It allows the samples to be generated from a long single path possibly under a nonstationary adaptive policy, as is typical in online reinforcement learning.

Under this mildest assumption, we provide a conﬁdence bound for the policy evaluation error |v bv |, which can be analytically computed from the data D. Theorem 4 (Computable conﬁdence bound). Let Assumptions 1 and 3 hold. Let ! := max

%% 0 φ(s, a)>w 1, 8(s, a) 2 X

2. Assume kφ(s, a)k2 1

2Such ! always exists and can be computed priorly

Minimax-Optimal Off-Policy Evaluation with Linear Function Approximation

for any (s, a) 2 X. For a target policy , with probability at least 1 δ, we have %%v bv %%

h is given by (b

0 )>(c M )h.

The proof begins with a decomposition of error given by v bv = PH

h ( b R + c M w

, from which we derive

h ( b R + c M w

We analyze the concentration of h :=

h ( b R + c M w

2 using a martingale argument that is similar to the bandit literature (e.g., proof of Theorem 5 in (Dani et al., 2008)). The complete proof is given in Appendix D.

The conﬁdence bound given in Theorem 4 can be easily calculated as a byproduct of FQI-OPE (Algorithm 1), since b

h, b were already computed in the iterations. In practice, one can tune the value of λ to get the smallest possible conﬁdence bound.

This paper studies the statistical limits of off-policy evaluation using linear function approximation. We establish a minimax error lower bound that depends on a function class-restricted χ2-divergence between the data distribution and the target policy s occupancy measure. We prove that a regression-based FQI method, which can be viewed as a plug-in estimator based on a conditional mean embedding of the transition operator, nearly achieves the minimax lower bound. We also provide a computable conﬁdence bound as a byproduct of the algorithm.

Acknowledgments

Mengdi Wang gratefully acknowledges funding from the U.S. National Science Foundation (NSF) grant CMMI1653435, Air Force Ofﬁce of Scientiﬁc Research (AFOSR)

grant FA9550-19-1-020, and C3.ai DTI.

Bertsekas, D. P., Bertsekas, D. P., Bertsekas, D. P., and

Bertsekas, D. P. Dynamic programming and optimal control, volume 1. Athena scientiﬁc Belmont, MA, 1995.

Dani, V., Hayes, T. P., and Kakade, S. M. Stochastic linear

optimization under bandit feedback. 2008.

Dann, C., Li, L., Wei, W., and Brunskill, E. Policy certiﬁ-

cates: Towards accountable reinforcement learning. In International Conference on Machine Learning, 2019.

Fonteneau, R., Murphy, S. A., Wehenkel, L., and Ernst,

D. Batch mode reinforcement learning based on the synthesis of artiﬁcial trajectories. Annals of operations research, 208(1):383 416, 2013.

Freedman, D. A. On tail probabilities for martingales. The

Annals of Probability, 3(1):100 118, 1975.

Grunewalder, S., Lever, G., Baldassarre, L., Pontil, M., and

Gretton, A. Modelling transition dynamics in mdps with rkhs embeddings. 2012.

Jiang, N. and Li, L. Doubly robust off-policy value evalua-

tion for reinforcement learning. In International Conference on Machine Learning, 2016.

Jong, N. K. and Stone, P. Model-based function approxima-

tion in reinforcement learning. In Proceedings of the 6th international joint conference on Autonomous agents and multiagent systems, pp. 1 8, 2007.

Lagoudakis, M. G. and Parr, R. Least-squares policy it-

eration. Journal of machine learning research, 4(Dec): 1107 1149, 2003.

Le, H. M., Voloshin, C., and Yue, Y. Batch policy learn-

ing under constraints. ar Xiv preprint ar Xiv:1903.08738, 2019.

Liu, Q., Li, L., Tang, Z., and Zhou, D. Breaking the curse

of horizon: Inﬁnite-horizon off-policy estimation. In Advances in Neural Information Processing Systems, pp.

5356 5366, 2018.

Liu, Y., Swaminathan, A., Agarwal, A., and Brunskill, E.

Off-policy policy gradient with state distribution correction. In Conference on Uncertainty in Artiﬁcial Intelligence, 2019.

Mannor, S., Simester, D., Sun, P., and Tsitsiklis, J. N. Bias

and variance in value function estimation. In Proceedings of the twenty-ﬁrst international conference on Machine learning, pp. 72, 2004.

Nachum, O., Chow, Y., Dai, B., and Li, L. Dual DICE:

Behavior-agnostic estimation of discounted stationary distribution corrections. In Advances in Neural Information Processing Systems 32. 2019.

Precup, D. Eligibility traces for off-policy policy evalua-

tion. Computer Science Department Faculty Publication Series, pp. 80, 2000.

Minimax-Optimal Off-Policy Evaluation with Linear Function Approximation

Sutton, R. S. and Barto, A. G. Reinforcement learning: An

introduction. MIT press, 2018.

Thomas, P. and Brunskill, E. Data-efﬁcient off-policy policy

evaluation for reinforcement learning. In International Conference on Machine Learning, 2016.

Tropp, J. et al. Freedman s inequality for matrix martingales.

Electronic Communications in Probability, 16:262 270, 2011.

Xie, T., Ma, Y., and Wang, Y.-X. Towards optimal off-policy

evaluation for reinforcement learning with marginalized importance sampling. In Advances in Neural Information Processing Systems, pp. 9665 9675, 2019.

Yang, L. F. and Wang, M. Reinforcement leaning in feature

space: Matrix bandit, kernels, and regret bound. ar Xiv preprint ar Xiv:1905.10389, 2019a.

Yang, L. F. and Wang, M. Sample-optimal parametric q-

learning using linearly additive features. In 36th International Conference on Machine Learning, ICML 2019, pp. 12095 12114. International Machine Learning Society (IMLS), 2019b.

Yin, M. and Wang, Y.-X. Asymptotically efﬁcient off-policy

evaluation for tabular reinforcement learning. ar Xiv preprint ar Xiv:2001.10742, 2020.