# accountable_offpolicy_evaluation_with_kernel_bellman_statistics__d48f9752.pdf

Accountable Off-Policy Evaluation With Kernel Bellman Statistics

Yihao Feng 1 Tongzheng Ren * 1 Ziyang Tang * 1 Qiang Liu 1

Abstract We consider off-policy evaluation (OPE), which evaluates the performance of a new policy from observed data collected from previous experiments, without requiring the execution of the new policy. This ﬁnds important applications in areas with high execution cost or safety concerns, such as medical diagnosis, recommendation systems and robotics. In practice, due to the limited information from off-policy data, it is highly desirable to construct rigorous conﬁdence intervals, not just point estimation, for the policy performance. In this work, we propose a new variational framework which reduces the problem of calculating tight conﬁdence bounds in OPE into an optimization problem on a feasible set that catches the true state-action value function with high probability. The feasible set is constructed by leveraging statistical properties of a recently proposed kernel Bellman loss (Feng et al., 2019). We design an efﬁcient computational approach for calculating our bounds, and extend it to perform post-hoc diagnosis and correction for existing estimators. Empirical results show that our method yields tight conﬁdence intervals in different settings.

1. Introduction

Reinforcement learning (Sutton & Barto, 1998) has achieved remarkable successes recently, but is still highly data demanding. An essential approach to improve data efﬁciency is to use off-policy methods, which leverage historical data collected from different behavior policies (a.k.a. off-policy data) to evaluate and optimize the target policies of interest. Using off-policy data is of critical importance in many real-world applications (e.g., Murphy et al., 2001; Li et al., 2011; Bottou et al., 2013; Thomas et al., 2017), especially in cases when it is infeasible or even risky to collect online

*Equal contribution 1Department of Computer Science, The University of Texas at Austin. Correspondence to: Yihao Feng <yihao@cs.utexas.edu>, Tongzheng Ren <tzren@cs.utexas.edu>.

Proceedings of the 37 th International Conference on Machine Learning, Online, PMLR 119, 2020. Copyright 2020 by the author(s).

experimental data or build high-quality simulators, such as robotics, advertisement, online education, and medical treatment (e.g., Murphy et al., 2001; Li et al., 2011; Hirano et al., 2003; Liao et al., 2019).

This work concerns the problem of off-policy evaluation (OPE), which estimates the expected reward of a given target policy from off-policy data. Because the information in off-policy data is often limited, typical point estimation may suffer from large error in difﬁcult cases. Therefore, for making high-stakes decisions in areas such as medical treatment, it is crucially important to provide reliable conﬁdence bounds for quantifying the uncertainties in the estimation.

Importance sampling (IS) provides a basic principle for developing OPE methods (e.g., Dud ık et al., 2011; Jiang & Li, 2016; Thomas & Brunskill, 2016; Liu et al., 2018; Nachum et al., 2019; Zhang et al., 2020), as well as doubly robust estimators (Dud ık et al., 2011; Jiang & Li, 2016; Farajtabar et al., 2018; Kallus & Uehara, 2020; Tang et al., 2020), as well as constructing conﬁdence bounds for OPE (e.g., Thomas et al., 2015a; Hanna et al., 2017). However, the typical IS-based methods suffer from a number of critical disadvantages:

Requiring known behavior policies: The typical IS estimators require the observed data to be drawn from a known behavior policy. However, this is often not the case in practice. In practice, the policies from which the observed data is collected may be unknown, or even not exist (e.g., when the data is drawn from a mix of different policies).

High variance and loose bounds: The variance of ISbased estimators can be excessively high when the target and behavior policies are very different from each other. As a consequence, concentration conﬁdence bounds based on IS (e.g., Thomas et al., 2015a;b) could be very loose.

Curse of long horizon: The variance of IS estimators are known to deteriorate quickly when the horizon the Markov decision processes becomes large. Although there are recent improved methods for long horizon problems (Liu et al., 2018; Tang et al., 2020; Mousavi et al., 2020), they introduce additional biases due to the estimation of the density ratios.

Our Method We propose a variational framework for constructing tight, provable conﬁdence bounds which avoids both the requirement on known behavior policies and the

Accountable Off-Policy Evaluation With Kernel Bellman Statistics

curse of long horizon. Instead of using importance sampling, we construct bounds based on the state-action value function (a.k.a. Q-function) speciﬁed by Bellman equation. The idea is to construct a high conﬁdence feasible set of the true Q-function of the target policy, by leveraging the tail bound of the kernel Bellman loss (Feng et al., 2019), and derive the upper (resp. lower) bounds of the expected reward of interest by solving a maximization (resp. minimization) problem on the high conﬁdence feasible set. Further, by constraining the optimization inside a reproducing kernel Hilbert space (RKHS) ball, we obtain a computationally efﬁcient framework for providing conﬁdence bounds of off-policy value evaluation.

Post-hoc Analysis In practice, we may already have an estimated Q-function from previous data/experiments, and we want to construct post-hoc conﬁdence bounds around the existing estimation, or perform post-hoc correction if the estimation is detected to deviate signiﬁcantly from the true value. In this work, we enable such post-hoc diagnosis by extending our main framework. We empirically evaluate our method on several control benchmarks; results show that our method can provide tight conﬁdence bounds, and also reasonable post-hoc corrections for existing estimators.

2. Related Work

Off-policy value (point) estimation has been studied in various reinforcement learning settings, including contextual bandits (e.g., Dud ık et al., 2011; Wang et al., 2017), ﬁnite horizon Markov decision processes (MDPs) (e.g., Thomas & Brunskill, 2016; Jiang & Li, 2016; Farajtabar et al., 2018; Xie et al., 2019), and inﬁnite horizon RL MDPs (e.g., Liu et al., 2018; Nachum et al., 2019; Tang et al., 2020). Unlike our work, most of these works are not behavior agnostic and require to know the behavior policy, except a line of very recent works (e.g., Nachum et al., 2019; Zhang et al., 2020; Mousavi et al., 2020).

A smaller set of works focus on conﬁdence bounds for various RL problems (e.g., White & White, 2010; Thomas et al., 2015a;b; Hanna et al., 2017; Dann et al., 2019; Jiang & Huang, 2020; Duan et al., 2020). A recent line of works (e.g., Dann et al., 2019; Jin et al., 2018; 2020) have utilized upper conﬁdence bounds for explorations on tabular (or linear) MDPs. Thomas et al. (2015a); Hanna et al. (2017) give high concentration conﬁdence bounds for off-policy estimators, based on importance sampling methods, which suffer from the issue of high variance of IS-based estimators.

Recently, Jiang & Huang (2020) proposes upper/lower bounds for OPE based a minimax formulation of double robustness (e.g., Tang et al., 2020; Uehara et al., 2020), but their method does not take the randomness of empirical data (data uncertainty) into account to get high conﬁdence

bounds. Concurrently, Duan et al. (2020) provides a datadependent conﬁdence bounds, based on regression-based Fitted Q iteration (FQI), by assuming the function class is linear and making additional assumptions on MDPs.

Compared with these existing works, our method provides tighter bounds that avoid the high variance issue of IS, work for behavior-agnostic OPE estimators, and can be directly applied to both continuous and discrete state and action MDPs once a kernel function can be properly deﬁned.

Finally, the kernel method has been widely used in reinforcement learning, but is typically used to represent value functions or transition models (e.g., Xu et al., 2005; 2007; Jung & Polani, 2007; Taylor & Parr, 2009; Grunewalder et al., 2012; Farahmand et al., 2016). Our work admits a novel application of kernel in RL, leveraging it as a technical tool for both deﬁning the kernel Bellman loss and constructing high conﬁdence bounds.

3. Off-Policy Evaluation and Kernel Bellman Statistics

We start with introducing background on reinforcement learning and off-policy evaluation (OPE), then give a review on the kernel Bellman loss (Feng et al., 2019) and introduce the concentration bounds for it.

3.1. Preliminary

Denote by M = S, A, P, r, γ, µ0 a Markov decision process (MDP), where S is the state space; A is the action space; P(s |s, a) is the transition probability; r(s, a) is the average immediate reward; µ0 (S) is the initial state distribution, and γ (0, 1) is the discount factor. We focus on the discounted case (γ < 1) in the main paper, and discuss the extension to average reward case (γ = 1) in Appendix C.

A policy π speciﬁes a distribution of actions given states, and π(a|s) denotes the probability of selecting a given s. Given a target policy π, we are interested in estimating the expected total discounted reward associated with π:

ηπ := lim T Eπ

where we start from an initial state s0 µ0, and execute the policy π in the MDP to generate trajectories; here T is the trajectory length. We mainly consider the case with inﬁnite horizon (T = ) in this work.

Assume we are given an observed set of transition pairs D = (si, ai, ri, s i)n i=1, where ri and s i are the observed local reward and the next state following (si, ai). The data is assumed to be off-policy in that they are collected from some arbitrary, unknown policy (a.k.a. behavior policy),

Accountable Off-Policy Evaluation With Kernel Bellman Statistics

which is different from the target policy π of interest. The goal of off-policy evaluation (OPE) is to provide interval estimation (upper and lower bounds) for the expected reward ηπ from the off-policy data D, without knowing the underlying behavior policy (i.e., behavior agnostic).

Value Estimation via Q-Function We approach this problem by approximating the expected reward ηπ through the Q-function. The Q-function (a.k.a state-action value function) Qπ : S A R of a policy π is deﬁned by

Qπ(s, a) := Eπ

t=0 γtr(st, at) s0 = s, a0 = a

which equals the expected long-term discounted reward when we execute policy π starting from (s, a).

Obviously, Qπ and ηπ are related via

ηπ = η(Qπ) := Es0 µ0,a0 π( |s0)[Qπ(s0, a0)] , (2)

where s0 is drawn from the ﬁxed initial distribution µ0. Therefore, given an empirical approximation b Q of Qπ, we can estimate ηπ by η( b Q), which can be further approximated by drawing an i.i.d. sample (si,0, ai,0)N i=1 from µ0 π:

η( b Q) bη( b Q) := 1

i=1 b Q(si,0, ai,0) . (3)

Since µ0 and π are assumed to be known, we can draw an i.i.d. sample with a very large size N to get arbitrarily high accuracy in the Monte Carlo estimation.

Bellman Equation It is well-known that Q = Qπ is the unique solution of the Bellman equation (Puterman, 1994): Q = BπQ, where Bπ is the Bellman evaluation operator, deﬁned by

BπQ(s, a) := E(s ,a ) [r(s, a) + γQ(s , a ) | s, a] .

where (s , a ) is drawn from s P( | s, a) and a π( | s ). For simplicity, we can deﬁne the Bellman residual operator as RπQ = BπQ Q.

We can approximate the Bellman operator on a state-action pair (si, ai) by bootstrapping:

b BπQ(si, ai) := ri + γEa π( |s i)[Q(s i, a )] ,

where ri, s i are the observed local reward and the next state of (si, ai), and a is drawn from policy π by Monte Carlo (same as expected Sarsa (Sutton & Barto, 1998)). Because the policy is known, we can approximate the expectation term Ea π( |s i) [Q(s i, a )] upto arbitrarily high accuracy by drawing a large sample from π( |s i). We can also empirically estimate the Bellman residual by

b RπQ(si, ai) := b BπQ(si, ai) Q(si, ai), which provides an unbiased estimation of the Bellman residual, that is, E[ b RπQ(si, ai) | si, ai ] = RπQ(si, ai). Note that b BπQ(si, ai) is a function of (si, ai, ri, s i), but the dependency on ri, s i is dropped in notation for convenience.

The Double Sampling Problem A naive approach for estimating Qπ is to minimize the empirical Bellman residual. However, a key barrier is the so-called double sampling problem. To illustrate, consider estimating Qπ by minimizing the square norm of the empirical Bellman residual:

b L2(Q) = 1

b RπQ(si, ai) 2 .

Unfortunately, minimizing b L2(Q) does not yield a consistent estimation of Qπ, because ( b RπQ(si, ai))2 does not correctly estimate (RπQ(si, ai))2 due to the square function outside of the empirical Bellman residual (even though b RπQ(si, ai) is an unbiased estimation of RπQ(si, ai)). One strategy for obtaining an unbiased estimation of the exact square (RπQ(si, ai))2 is to multiply two copies of b RπQ(si, ai) obtained from two independent copies of ri, s i following the same (si, ai), hence requiring double sampling (Baird, 1995). However, double sampling rarely happens in observed data collected from natural environment, and is impractical to use in real-world problems. As a result, many works on Q-function estimation resort to temporal difference (TD) based algorithms, which, however, suffer from instability and divergence issues, especially when nonlinear function approximation is used.

3.2. Kernel Bellman Statistics

Recently, Feng et al. (2019) proposed a kernel Bellman loss to address the double sampling problem in value function estimation. This objective function provides a consistent estimation of the derivation from a given Q to the true Qπ, and hence allows us to estimate Qπ without requiring double sampling, nor risking the instability or divergence in TDbased algorithms. We now give a brief introduction to the kernel Bellman loss and the related U/V-statistics, and their concentration inequalities that are useful in our framework.

For notation, we use x = (s, a) to denote a state-action pair, and x := (s , a ) a successive state-action pair following x = (s, a) collected under some behavior policy. We use τ = (s, a, r, s ) to denote a transition pair.

Following Feng et al. (2019), let K : (S A) (S A) R be an integrally strictly positive deﬁnite (ISPD) kernel. The expected kernel Bellman loss is deﬁned by

LK(Q) := Ex, x µ [RπQ(x)K(x, x)RπQ( x)] , (4)

where x = ( s, a) is an independent and identical copy of x, and µ is any distribution supported on S A. Note that µ

Accountable Off-Policy Evaluation With Kernel Bellman Statistics

can be either the on-policy visitation distribution induced by π, or some other valid distributions deﬁned by the observed data (the off-policy case).

As shown in Feng et al. (2019), the expected kernel Bellman loss fully characterizes the true Qπ function in that LK(Q) 0 for any Q, and

LK(Q) = 0 if and only if Q = Qπ.

Another key property of LK(Q) is that it can be easily estimated and optimized from observed transitions, without requiring double samples. Given a set of observed transition pairs D = {τi}n i=1, we can use the so-called V-statistics to estimate LK(Q):

b LV K(Q) := 1

i,j=1 ℓπ,Q(τi, τj) , (5)

where ℓπ,Q is deﬁned by

ℓπ,Q(τi, τj) := b RπQ(xi)K(xi, xj) b RπQ(xj) . (6)

Another way is to use the so-called U-statistics, which removes the diagonal terms in (5):

b LU K(Q) := 1 n(n 1)

1 i =j n ℓπ,Q(τi, τj) . (7)

We call b LU K(Q) and b LV K(Q) the kernel Bellman U/Vstatistics. Both the U/V-statistics can be shown to give consistent estimation of LK(Q). The U-statistics b LU K(Q) is known to be an unbiased estimator of the kernel loss LK(Q), but can be negative and hence unstable when being used as an objective function in practice. In comparison, the V-statistics b LV K(Q) is always non-negative, that is, b LV K(Q) 0 for any Q, and hence behaves more stably in practice. We mainly use the V-statistics in this work.

3.3. Concentration Bounds of Kernel Bellman Statistics

We now introduce concentration inequalities for the kernel Bellman statistics, which form an important tool in our framework. The inequalities here are classical results from Hoeffding (1963).

Proposition 3.1 (Concentration bound of U-/V-statistics). Consider a set of i.i.d. random transition pairs {τi}n i=1 with xi µ. Assume

sup τ, τ |ℓπ,Q(τ, τ))| ℓmax < ,

and n is an even number, then we have the following concentration bound for U-statistics,

|b LU K(Q) LK(Q)| 2ℓmax

and for V-statistics,

|b LV K(Q) LK(Q)| 2ℓmax

We include the proof in the Appendix A for completeness. In addition, in the case when Q = Qπ, it can be shown that the i.i.d. assumption can be removed; see Appendix A.3.

The key quantity in the concentration inequalities is the upper bound ℓmax. The lemma below shows that it can be estimated easily in practice.

Lemma 3.1. Assume the reward function and the kernel function are bounded, i.e. supx |r(x)| rmax, supx, x |K(x, x)| Kmax. Then we have

sup x |Qπ(x)| Qmax := rmax

sup τ, τ |ℓπ,Qπ(τ, τ)| ℓmax := 4Kmaxr2 max (1 γ)2 .

4. Accountable Off-Policy Evaluation

This section introduces our approach for accoutable offpolicy evaluation. We start with the main variational framework for providing conﬁdence bounds in Section 4.1, and then discuss post-hoc diagnosis analysis and correction for existing estimators in Section 4.3 and 4.4.

4.1. Variational Conﬁdence Bounds for Off-Policy Evaluation

We consider the problem of providing conﬁdence bounds for the expected reward ηπ. To simplify the presentation, we focus on the upper bounds here, and one can derive the lower bounds with almost the same arguments.

Let F be a function class that contains Qπ, that is, Qπ F. Then we can construct an upper bound of ηπ by solving the following variational (or functional) optimization on F:

η+ = max Q F

n [η(Q)], s.t. LK(Q) λ o , λ > 0. (8)

Here η+ is an upper bound of ηπ because Qπ satisﬁes the condition that Qπ F and LK(Qπ) = 0, and hence η+ ηπ follows the deﬁnition of the max operator.

In practice, we can not exactly calculate η+, as it involves the exact kernel loss LK(Q) and the exact η(Q). However, if we replace LK(Q) and η(Q) with proper empirical estimates, we can derive a computationally tractable highprobability upper bound of ηπ.

Accountable Off-Policy Evaluation With Kernel Bellman Statistics

Algorithm 1 Conﬁdence Bounds for Off-Policy Evaluation

Input: Off-policy data D = {xi, ri, s i}i i n; maximum reward rmax, discounted factor γ, positive deﬁnite kernel K, RKHS norm ρ, random feature Φ( ), failure probability δ, Monte Carlo sample size N. Draw {xi,0}N i=1 from the initial distribution µ0 π; decide λK and λη by concentration inequalities.

Calculate the upper bound bη+ by solving (14), and the lower bound bη by (15).

Output: bη+, bη .

Proposition 4.1. Let bη(Q) and b LK(Q) be the estimators of η(Q) and LK(Q), such that for δ (0, 1),

P η(Qπ) bη(Qπ) + λη 1 δ,

P b LK(Qπ) λK 1 δ, (9)

where λη and λK are two constants. Note that (9) only needs to hold for Q = Qπ. Assume Qπ F. Deﬁne

bη+ := max Q F

n [bη(Q) + λη], s.t. b LK(Q) λK o . (10)

Then we have bη+ ηπ with probability 1 2δ.

Proof. From the assumption, with probability at least 1 2δ, we have both η(Qπ) bη(Qπ) + λη and b LK(Qπ) λK. In this case, Qπ belongs the feasible set of the optimization in (10), i.e., Qπ F, b LK(Qπ) λK. Therefore, we have ηπ = η(Qπ) bη(Qπ) + λη bη+ by the deﬁnition of the max operator.

We can use the kernel Bellman V-statistics in Proposition 3.1 to construct b LK(Qπ) and λK. For η(Qπ), we use the Monte Carlo estimator bη(Qπ) in (3) and and set λη = Qmax p

2 log(2/δ)/N by Hoeffding inequality.

4.2. Optimization in Reproducing Kernel Hilbert Space

To provide tight high conﬁdence bounds, we need to choose F to be a function space that is both simple and rich enough to contain Qπ. Here we choose F to be a reproducing kernel Hilbert space (RKHS) HK0 induced by a positive kernel K0(x, x). We should distinguish K0 with the kernel K used in kernel Bellman loss. Using RKHS allows us to incorporate a rich, inﬁnite dimensional set of functions, and still obtain computationally tractable solution for the optimization in (10).

Proposition 4.2. For RKHS HK0 with kernel K0, deﬁne

F = {Q HK0 : Q 2 HK0 ρ}, (11)

where ρ is a positive constant. Using the F in (11), the optimization solution of (10) can be expressed as

i=0 αifi( ). (12)

Here α = [αi]n i=0 are the coefﬁcients decided by the optimization problem, and f0( ) = Ex µ0 π [K0( , x)],

fi( ) = K0( , xi) γEa i π( |s i)[K0( , x i)], i = 1, . . . , n,

where x i = (s i, a i), with s i the observed next state following xi = (si, ai) and a i randomly drawn from π( |s i).

In addition, for Q of form (12), bη(Q) is a linear function of α, and both b LK(Q) and Q 2 HK0 are convex quadratic

functions of α. In particular, we have Q 2 HK0 = α Bα, where B = [Bij] is a (n + 1) (n + 1) matrix with Bij = fi, fj HK0. Therefore, the optimization in (10) reduces to an optimization on α with linear objective and convex quadratic constraints,

max α Bα ρ2

n [c α + λη], s.t. (α b) A(α b) λK o ,

where A, B are two (n + 1) (n + 1) positive deﬁnite matrices and b, c two (n + 1) 1 vectors, whose deﬁnition can be found in Appendix E.

Random Feature Approximation Unfortunately, solving the programming in Proposition 4.2 requires an O((n + 1)3) time complexity and is too slow when the data size n is very large. We can speed up the computation by using random feature approximations. The idea is that any positive deﬁnite kernel can be expressed as K0(x, y) = R

W φ(x, w)φ(y, w)dµ(w), where φ(x, w) denotes a feature map indexed by a parameter w in some space W and µ is a measure on W. A typical example of this is the random Fourier expansion of stationary kernels by Bochner s theorem (Rahimi & Recht, 2007), in which φ(x, w) = cos(w [x, 1]).

To speed up the computation, we draw i.i.d. {wi}n i=1 from µ and take

K0(x, y) = 1

k=1 φ(x, wk)φ(y, wk) . (13)

Then one can show that any function Q in the RKHS of K0 in (13) can be expressed as

Q(x) = θ Φ(x),

Accountable Off-Policy Evaluation With Kernel Bellman Statistics

Algorithm 2 Post-hoc Diagnosis for Existing Estimators

Input: Off-policy data D = {xi, ri, s i}i i n; maximum reward rmax, discounted factor γ, positive deﬁnite kernel K, RKHS norm ρ, random feature Φ( ), fail probability δ, Monte Carlo sample size N, existing estimation b Q.

Draw {xi 0}N i=1 from the initial distribution µ0 π; Decide λη and λK by concentration inequalities.

Calculate the upper bound bη+ by solving (16), the lower bound bη by (17), and the debias function Qdebias by (18).

Output: The upper and lower bounds bη+, bη , and Qdebias(x) = θ Φ(x).

where Φ(x) = [φ(x, wi)]m i=1, with Q 2 HK0 = m θ 2 2, where 2 is the typical L2 norm on Rm.

Given the off-policy dataset D = {xi, ri, s i} with xi = (si, ai), and an i.i.d. sample {xi,0}N i=1 with xi,0 = (si,0, ai,0) from the initial distribution µ0 π. The optimization in (10) can be shown to reduce to

bη+ = max θ 2 2 ρ/m

n c0 θ + λη ,

s.t. (Zθ v) M (Zθ v) λK o , (14)

where v = [ri]n i=1 Rn 1, M Rn n with Mij = [K(xi, xj)/n2]ij, c0 = PN i=1 Φ(xi,0)/N Rm 1, and

Z = Φ(xi) γEa i π( |s i) [Φ(x i)] n

with x i = [s i, a i]. The expectation in Z can be approximated by Monte Carlo sampling from π( |s i).

Similarly, we can get the lower conﬁdence bounds via

bη = min θ 2 2 ρ/m

n c 0 θ λη ,

s.t. (Zθ v) M (Zθ v) λK o . (15)

Compared with the programming in Proposition (4.2), the optimization problems in (14)-(15) have lower dimensions and is hence much faster when m n, since the dimension of θ is m, while that of α is n + 1. We describe the detailed procedure for obtaining the upper and lower conﬁdence bounds in Algorithm 1.

4.3. Post-hoc Conﬁdence Bound of Existing Estimators

We extend our method to provide post-hoc conﬁdence bounds around existing estimators provided by users. Given an existing estimator b Q of the Q-function, we denote by Qres := Qπ b Q the difference between the ground truth Qπ and the prior estimate b Q. Assuming Qres belongs to F, we obtain an upper bound by

n [bη(Qres + b Q) + λη], s.t. b LK( b Q + Qres) λK o .

This can be viewed as a special case of (10) but with the function space anchored around the existing estimator b Q.

Similar to the earlier case, in the case of random feature approximation, the optimization reduces to

bη+ = max θ 2 2 ρ/m

n h bη( b Q) + c 0 θ + λη i ,

s.t. (Zθ ζ) M (Zθ ζ) λK o , (16)

where c0, Z, M are deﬁned as before, bη( b Q) = PN i=1 b Q(xi,0)/N, and ζ Rn 1 is deﬁned to be the TD error vector of b Q(x) evaluated at the dataset D; that is, ζ = [ζi]n i=1 with

ζi = ri + γEa i π( |s i) b Q(s i, a i) b Q(xi).

The post-hoc lower bound follows a similar form:

bη = min θ 2 2 ρ/m

n h bη( b Q) + c 0 θ λη i ,

s.t. (Zθ ζ) M (Zθ ζ) λK o . (17)

4.4. Post-hoc Correction of Existing Estimators

In addition to providing conﬁdence bounds around the existing estimator b Q, we may want to further correct b Q when b Q is identiﬁed to admit large error. The post-hoc correction should ensure the corrected estimation falls into the conﬁdence bounds provided earlier, while keeping the information in b Q as much as possible.

Our idea is to correct b Q by adding a debiasing term Qdebias, such that b LV K( b Q + Qdebias) λK, while keeping Qdebias as small as possible. This is framed into

min Qdebias F

n Qdebias 2 HK0 s.t. b LK(Q + Qdebias) λK o .

We should distinguish Qdebias with the Qres in Section 4.3, and Qres can not be used for the debiasing purpose because it is designed to give an extreme estimation (for providing the bound), rather than a minimum correction.

In the case of random features approximation, the optimization reduces to

θ = arg min θ

s.t. (Zθ ζ) M (Zθ ζ) λK o , (18)

Accountable Off-Policy Evaluation With Kernel Bellman Statistics

Random Fourier Feature True Reward ηπ Neural Feature

75% 80% 85% 90% 95% 99%

0.1 0.5 1.0 1.5 2.0

10 20 30 40 50 2

(a) Number of transitions (b) 1 δ (c) Temperature (d) Number of features

75% 80% 85% 90% 95% 99%

0.1 0.5 1.0 1.5 2.0

10 20 30 40 50 30

(e) Number of transitions (f) 1 δ (g) Temperature (h) Number of features

Figure 1. Results on Inverted-Pendulum (a)-(d) and Puck-Mountain (e)-(h). The plots show the conﬁdence bounds on the average discounted reward as we vary the number of transitions n in (a) & (e), failure probability δ in (b) & (f), the temperature parameter τ of the behavior policy in (c) & (g), and the number of features in (d) & (h). The default parameters (when not varied) are: discounted factor γ = 0.95; horizon length T = 50 for Inverted-Pendulum and T = 100 for Puck-Mountain; number of episodes 20; failure probability δ = 0.10; temperature of the behavior policy τ = 1; and the feature dimension 10.

and Qdebias(x) = θ Φ(x). If the existing estimator b Q already satisﬁes b LV K( b Q) λK, this provides a zero correction (i.e., θ = 0), since the estimator is already sufﬁciently accurate. This procedure is summarized in Algorithm 2.

5. Experiments

We evaluate the proposed algorithms in Section 4 on two continuous control tasks: Inverted-Pendulum and Puck Mountain. Details of the tasks are in Appendix D.

For all of our experiments, we use a Gaussian RBF kernel K(xi, xj) = exp( ||xi xj||2 2/h2) for evaluating the kernel Bellman statistics, with a bandwidth selected from a separate batch of training data, and V -statistics to calculate Equation (5) given a set of empirical data.

To construct the behavior and target policies, we ﬁrst train an optimal Q-function using deep Q-learning, and use its softmax functions as policies, and set the temperature to be higher for the behavior policies to encourage exploration. Note that our algorithms do not require to know the behavior policies since they are behavior-agnostic. A description for how to construct policies is in Appendix D.

5.1. Conﬁdence Bounds for Off-Policy Evaluation

We test Algorithm 1 on the two continuous control tasks, Inverted-Pendulum (Figure 1(a)-(d)) and Puck-Mountain (Figure 1(e)-(h)). We solve the convex optimization in Algorithm 1 using CVXPY (Diamond & Boyd, 2016; Agrawal et al., 2018), which gives us the upper and lower bound bη+ and bη . The results are reported w.r.t. the average discounted reward deﬁned by ηπ/(1 γ). The results are averaged over 50 random trials.

We use two types of feature maps Φ( ) to parameterize the state-action value function Q(x) := θ Φ(x):

1) Random Fourier features: Φ(x) := [cos(µ i x + bi)]m i=1, where µi N(0, 1

h2 0 I), bi Uniform[0, 2π], and h0 is a bandwidth parameter.

2) Neural features. We use a small neural network to parameterize Q function, and learn the Q function by minimizing the kernel loss on the training dataset, and set Φ( ) by selecting a set of neural features (the neural feature map before the last linear layer) on the validation set.

Figure 1(a)-(d) demonstrate the evaluation results on Inverted-Pendulum. From Figure 1(a) & (b) we can see that, as we increase the number of transitions, or increase the failure probability δ, the conﬁdence bounds become

Accountable Off-Policy Evaluation With Kernel Bellman Statistics

Reward by original b Q1

Reward by correction of b Q1

Reward by original b Q2

Reward by correction of b Q2

True Reward ηπ

Conﬁdence Bounds

102 103 Number of Transitions, n

102 103 Number of Transitions, n

102 103 Number of Transitions, n

(a) Diagnosis b Q1 with different n (b) Diagnosis for b Q2 with different n (c) Norm of Qdebias

75% 80% 85% 90% 95% 99% 1 δ

75% 80% 85% 90% 95% 99% 1 δ

75% 80% 85% 90% 95%99% 1 δ

(d) Diagnosis for b Q1 with different δ (e) Diagnosis for b Q2 with different δ (f) Norm of Qdebias

Figure 2. Post-hoc diagnosis on Inverted-Pendulum. In ﬁgure (a)-(c), we vary the number of transitions with ﬁxed δ = 0.10. In ﬁgure (d)-(f), we ﬁxed number of transitions n = 500 and vary the failure probability δ. The default parameters are: discounted factor γ = 0.95, the horizon length T = 50, number of transitions n = 500, failure probability δ = 0.10, temperature of the behavior policy τ = 1, and the feature dimension 10.

tighter since λK becomes smaller. However, it still covers the ground truth ηπ, which indicates that our method gives a reasonable conﬁdence interval for off-policy estimation.

In Figure 1(c), we investigate the performance of our method when we vary a temperature parameter of the behavior policy that controls the concentration of the probability mass. The conﬁdence bounds do not change signiﬁcantly with different temperature of behavior policies. In Figure 1(d), we study the performance of our algorithms with different number of features (including both random Fourier and neural features), which shows that we can get a tight conﬁdence interval when the number of features is small. This is because when decreasing the number of features, the Q function is constrained in a lower dimensional function space and hence gives a tighter conﬁdence interval. However, decreasing the number of features also increases the risk of model misspeciﬁcation. We also test our method on Puck-Mountain, and report the results in Figure 1 (e)-(h), which show similar results as we observe in Inverted-Pendulum.

5.2. Post-hoc Diagnosis for Existing Estimators

We implement Algorithm 2 to provide post-hoc diagnosis for existing Q estimators, and test it on Inverted-Pendulum (Figure 2) and Puck-Mountain (Figure 3 in the Appendix D). Figure 2 (a)-(f) show the diagnosis results for two different state-action value function estimators ( b Q1 and b Q2) on Inverted-Pendulum, which we learn with kernel Bellmen statistics using different number of optimization steps. Here b Q1 is a relatively accurate estimator while the estimation of b Q2 has larger bias.

Figure 2 (a)-(c) show that as we increase the number of transitions, the norm of the debias term Qdebias keeps zero when b Q1 and b Q2 are inside the conﬁdence interval. Meanwhile, when b Q1 and b Q2 can not provide an accurate estimation (falling outside of the conﬁdence interval), our algorithm gives a good post-hoc correction to ensure the corrected estimation are inside the conﬁdence interval. As we can see, such post-hoc diagnosis provides both conﬁdence bounds and correction for existing estimators, which can be useful in real-world applications.

Figure 2 (d)-(f) demonstrate the performance of our algo-

Accountable Off-Policy Evaluation With Kernel Bellman Statistics

rithms when we change the failure probability δ. Again, the empirical results shows that our algorithms can provide meaningful post-hoc diagnosis.

We also test our method on Puck-Mountain, following the same procedure as on Inverted-Pendulum, which shows similar results as we ﬁnd in Inverted-Pendulum.

6. Conclusion

In this paper, we develop a new variational framework for constructing conﬁdence bounds for off-policy evaluation and extend it to perform post-hoc diagnosis on existing estimators. In future work, we will leverage our framework to develop safe and efﬁcient policy optimization and exploration algorithms based on the kernel Bellman statistics.

Acknowledgment

This work is supported in part by NSF CRII 1830161 and NSF CAREER 1846421.

Agrawal, A., Verschueren, R., Diamond, S., and Boyd, S. A rewriting system for convex optimization problems. Journal of Control and Decision, 5(1):42 60, 2018.

Baird, L. C. Residual algorithms: Reinforcement learning with function approximation. In Proceedings of the 12th International Conference on Machine Learning, pp. 30 37, 1995.

Bottou, L., Peters, J., Qui nonero-Candela, J., Charles, D. X., Chickering, D. M., Portugaly, E., Ray, D., Simard, P., and Snelson, E. Counterfactual reasoning and learning systems: The example of computational advertising. Journal of Machine Learning Research, 14:3207 3260, 2013.

Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., and Zaremba, W. Openai gym. ar Xiv preprint ar Xiv:1606.01540, 2016.

Dann, C., Li, L., Wei, W., and Brunskill, E. Policy certiﬁcates: Towards accountable reinforcement learning. In Proceedings of the 36th International Conference on Machine Learning, pp. 1507 1516, 2019.

Diamond, S. and Boyd, S. CVXPY: A Python-embedded modeling language for convex optimization. Journal of Machine Learning Research, 17(83):1 5, 2016.

Duan, Y., Jia, Z., and Wang, M. Minimax-optimal off-policy evaluation with linear function approximation. Proceedings of the 37th International Conference on Machine Learning, 2020.

Dud ık, M., Langford, J., and Li, L. Doubly robust policy evaluation and learning. In Proceedings of the 28th International Conference on Machine Learning, pp. 1097 1104, 2011.

Farahmand, A.-m., Ghavamzadeh, M., Szepesv ari, C., and Mannor, S. Regularized policy iteration with nonparametric function spaces. The Journal of Machine Learning Research, 17(1):4809 4874, 2016.

Farajtabar, M., Chow, Y., and Ghavamzadeh, M. More robust doubly robust off-policy evaluation. In Proceedings of the 35th International Conference on Machine Learning, pp. 1446 1455, 2018.

Feng, Y., Li, L., and Liu, Q. A kernel loss for solving the bellman equation. In Advances in Neural Information Processing Systems, pp. 15456 15467, 2019.

Grunewalder, S., Lever, G., Baldassarre, L., Pontil, M., and Gretton, A. Modelling transition dynamics in mdps with rkhs embeddings. Proceedings of the 29th International Conference on Machine Learning, 2012.

Han, F. An exponential inequality for u-statistics under mixing conditions. Journal of Theoretical Probability, 31 (1):556 578, 2018.

Hanna, J. P., Stone, P., and Niekum, S. Bootstrapping with models: Conﬁdence intervals for off-policy evaluation. In Thirty-First AAAI Conference on Artiﬁcial Intelligence, 2017.

Hirano, K., Imbens, G. W., and Ridder, G. Efﬁcient estimation of average treatment effects using the estimated propensity score. Econometrica, 71(4):1161 1189, 2003.

Hoeffding, W. Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association, 58(301):13 30, 1963.

Jiang, N. and Huang, J. Minimax conﬁdence interval for offpolicy evaluation and policy optimization. ar Xiv preprint ar Xiv:2002.02081, 2020.

Jiang, N. and Li, L. Doubly robust off-policy evaluation for reinforcement learning. In Proceedings of the 23rd International Conference on Machine Learning, pp. 652 661, 2016.

Jin, C., Allen-Zhu, Z., Bubeck, S., and Jordan, M. I. Is q-learning provably efﬁcient? In Advances in Neural Information Processing Systems, pp. 4863 4873, 2018.

Jin, C., Yang, Z., Wang, Z., and Jordan, M. I. Provably efﬁcient reinforcement learning with linear function approximation. Proceedings of the 33st Conference On Learning Theory, 2020.

Accountable Off-Policy Evaluation With Kernel Bellman Statistics

Jung, T. and Polani, D. Kernelizing lspe (λ). In 2007 IEEE International Symposium on Approximate Dynamic Programming and Reinforcement Learning, pp. 338 345. IEEE, 2007.

Kallus, N. and Uehara, M. Double reinforcement learning for efﬁcient and robust off-policy evaluation. Proceedings of the 37th International Conference on Machine Learning, 2020.

Li, L., Chu, W., Langford, J., and Wang, X. Unbiased ofﬂine evaluation of contextual-bandit-based news article recommendation algorithms. In Proceedings of the 4th International Conference on Web Search and Data Mining (WSDM), pp. 297 306, 2011.

Liao, P., Klasnja, P., and Murphy, S. Off-policy estimation of long-term average outcomes with applications to mobile health. ar Xiv preprint ar Xiv:1912.13088, 2019.

Liu, Q., Li, L., Tang, Z., and Zhou, D. Breaking the curse of horizon: Inﬁnite-horizon off-policy estimation. In Advances in Neural Information Processing Systems, pp. 5356 5366, 2018.

Mousavi, A., Li, L., Liu, Q., and Zhou, D. Black-box off-policy estimation for inﬁnite-horizon reinforcement learning. In International Conference on Learning Representations, 2020.

Murphy, S. A., van der Laan, M., and Robins, J. M. Marginal mean models for dynamic regimes. Journal of the American Statistical Association, 96(456):1410 1423, 2001.

Nachum, O., Chow, Y., Dai, B., and Li, L. Dualdice: Behavior-agnostic estimation of discounted stationary distribution corrections. In Advances in Neural Information Processing Systems, pp. 2318 2328, 2019.

Puterman, M. L. Markov Decision Processes: Discrete Stochastic Dynamic Programming. Wiley-Interscience, New York, 1994. ISBN 0-471-61977-9.

Rahimi, A. and Recht, B. Random features for large-scale kernel machines. In Advances in neural information processing systems, pp. 1177 1184, 2007.

Sutton, R. S. and Barto, A. G. Reinforcement Learning: An Introduction. MIT Press, Cambridge, MA, March 1998. ISBN 0-262-19398-1.

Tang, Z., Feng, Y., Li, L., Zhou, D., and Liu, Q. Doubly robust bias reduction in inﬁnite horizon off-policy estimation. In International Conference on Learning Representations, 2020.

Taylor, G. and Parr, R. Kernelized value function approximation for reinforcement learning. In Proceedings of

the Twenty-Sixth International Conference on Machine Learning, pp. 1017 1024, 2009.

Thomas, P. S. and Brunskill, E. Data-efﬁcient off-policy policy evaluation for reinforcement learning. In Proceedings of the 33rd International Conference on Machine Learning, pp. 2139 2148, 2016.

Thomas, P. S., Theocharous, G., and Ghavamzadeh, M. High conﬁdence policy improvement. In Proceedings of the 32nd International Conference on Machine Learning, pp. 2380 2388, 2015a.

Thomas, P. S., Theocharous, G., and Ghavamzadeh, M. High-conﬁdence off-policy evaluation. In Twenty-Ninth AAAI Conference on Artiﬁcial Intelligence, 2015b.

Thomas, P. S., Theocharous, G., Ghavamzadeh, M., Durugkar, I., and Brunskill, E. Predictive off-policy policy evaluation for nonstationary decision problems, with applications to digital marketing. In Proceedings of the 31st AAAI Conference on Artiﬁcial Intelligence, pp. 4740 4745, 2017.

Uehara, M., Huang, J., and Jiang, N. Minimax weight and q-function learning for off-policy evaluation. Proceedings of the 37th International Conference on Machine Learning, 2020.

Wang, Y.-X., Agarwal, A., and Dud ık, M. Optimal and adaptive off-policy evaluation in contextual bandits. In Proceedings of the 34th International Conference on Machine Learning, pp. 3589 3597, 2017.

White, M. and White, A. Interval estimation for reinforcement-learning algorithms in continuous-state domains. In Advances in Neural Information Processing Systems, pp. 2433 2441, 2010.

Xie, T., Ma, Y., and Wang, Y.-X. Towards optimal off-policy evaluation for reinforcement learning with marginalized importance sampling. In Advances in Neural Information Processing Systems, pp. 9668 9678, 2019.

Xu, X., Xie, T., Hu, D., and Lu, X. Kernel least-squares temporal difference learning. International Journal of Information and Technology, 11(9):54 63, 2005.

Xu, X., Hu, D., and Lu, X. Kernel-based least-squares policy iteration for reinforcement learning. IEEE Transactions on Neural Networks, 18(4):973 992, 2007.

Zhang, R., Dai, B., Li, L., and Schuurmans, D. Gendice: Generalized ofﬂine estimation of stationary values. In International Conference on Learning Representations, 2020.