# offpolicy_interval_estimation_with_lipschitz_value_iteration__67d05377.pdf

Off-Policy Interval Estimation with Lipschitz Value Iteration

Ziyang Tang University of Texas at Austin ztang@utexas.edu

Yihao Feng University of Texas at Austin yihao@cs.utexas.edu

Na Zhang Tsinghua University zhangna@pbcsf.tsinghua.edu.cn

Jian Peng University of Illinois at Urbana-Champaign jianpeng@illinois.edu

Qiang Liu University of Texas at Austin lqiang@cs.utexas.edu

Off-policy evaluation provides an essential tool for evaluating the effects of different policies or treatments using only observed data. When applied to high-stakes scenarios such as medical diagnosis or ﬁnancial decision-making, it is crucial to provide provably correct upper and lower bounds of the expected reward, not just a classical single point estimate, to the end-users, as executing a poor policy can be very costly. In this work, we propose a provably correct method for obtaining interval bounds for off-policy evaluation in a general continuous setting. The idea is to search for the maximum and minimum values of the expected reward among all the Lipschitz Q-functions that are consistent with the observations, which amounts to solving a constrained optimization problem on a Lipschitz function space. We go on to introduce a Lipschitz value iteration method to monotonically tighten the interval, which is simple yet efﬁcient and provably convergent. We demonstrate the practical efﬁciency of our method on a range of benchmarks.

1 Introduction

Reinforcement learning (RL) (e.g., Sutton & Barto, 1998) has become widely used in tasks like recommendation system, robotics, trading and healthcare (Murphy et al., 2001; Li et al., 2011; Bottou et al., 2013; Thomas et al., 2017). The current success of RL highly relies on excessive amount of data, which, however, is usually not available in many real world tasks where deploying a new policy is very costly or even risky. Off-policy evaluation (OPE) (e.g., Fonteneau et al., 2013; Jiang & Li, 2016; Liu et al., 2018a), estimating the expected reward of a target policy using observational data gathered from previous behavior policies, therefore holds tremendous promise for designing data-efﬁcient RL algorithms by leveraging on previously collected data.

Existing OPE methods mainly focus on point estimation, which only provides a single point estimation of the expected reward. However, such point estimate can be rather unreliable as OPE often suffers from high error due to the lack of historical samples, policy shift or model misspeciﬁcation. Further, for applications in high-stakes areas such as medical diagnosis and ﬁnancial investment, a point estimate itself is far from enough and can even be dangerous if it is unreliable. Hence, it is essential

34th Conference on Neural Information Processing Systems (Neur IPS 2020), Vancouver, Canada.

to provide provably correct interval estimation of the expected reward, which is both trustful and theoretically correct.

To address this problem, we propose a general optimization-based framework to derive a provably correct off-policy interval estimation based on historical samples. Our idea is to search for the largest and smallest possible values of the expected reward, among all the Q-functions in a Lipschitz function space that are consistent with the observed historical samples. This interval estimator is provably correct once the true Q-function satisﬁes the Lipschitz assumption.

Computing our upper and lower bounds amounts to solving a constrained optimization problem in the space of Lipschitz functions. We introduce a Lipschitz value iteration algorithm of a similar style to value iteration. Our method is efﬁcient and provably convergent. In particular, our algorithm has a simple closed form update at each iteration and is guaranteed to monotonically tighten the bounds with a linear rate under mild conditions. To speed up the algorithm, we develop a double subsampling strategy, which we only pick a random subsample of value functions to update in each iteration and use the same batch of data as constraints.

We test our algorithm on a number of benchmarks and show that it can provide tight and provably correct bounds.

Related Work Our work is closely related to the off-policy point estimation. There are typically two types of OPE methods, importance sampling (IS) based methods (e.g., Liu, 2001; Precup et al., 2000; Liu et al., 2018a; Xie et al., 2019) and value function or model based methods (e.g., Fonteneau et al., 2013; Liu et al., 2018b; Le et al., 2019; Feng et al., 2019) Another line of work combines these two methods and yields a doubly-robust estimator for off-policy evaluation (Jiang & Li, 2016; Thomas & Brunskill, 2016; Kallus & Uehara, 2019; Tang et al., 2020). In this work, we consider the black box setting when the behavior policy is assumed to be unknown (e.g., Nachum et al., 2019; Mousavi et al., 2020; Zhang et al., 2020; Feng et al., 2020).

IS-based point estimation methods naturally yield a conﬁdence interval by standard concentration (Thomas et al., 2015b,a). Another major type of conﬁdence interval estimation approaches leverages the statistical procedure of bootstrapping (White & White, 2010; Hanna et al., 2017). However, these conﬁdence intervals are typically loose due to the curse of horizon. Moreover, they heavily rely on the assumption that the off-policy data is drawn i.i.d. from a particular behavior policy. But this is not always true since the policies usually evolve and depend on their previous policies.

Another related set of works are PAC-RL (e.g., Jin et al., 2018; Dann et al., 2018; Song & Sun, 2019; Yang et al., 2019), which mainly focus on the regret bound or sample complexity of the Q-learning exploration. As a side product, they also provide a conﬁdence interval estimation for the value function. However, this line of work mostly focuses on the tabular or linear MDPs. In contrast, our work aims to handle the general continuous MDPs by leveraging Lipschitz properties of Q-functions. Song & Sun (2019) provides a metric embedding style Q-learning method, with a particular focus on ﬁnite horizon settings.

2 Background and Problem Settings

We ﬁrstly set up the problem of off-policy interval evaluation and then introduce the related background on Q-learning and Bellman equation.

Markov Decision Process A Markov decision process (MDP) M = S, A, r, T , µ0, γ consists of a state space S, an action space A, an unknown deterministic reward function r : S A R, an unknown transition function T : S A S, an initial state distribution µ0, and a discounted factor γ. Throughout this work, we assume that the transition and reward function are deterministic for simplicity and we can easily draw samples from the initial state distribution µ0.

In RL, an agent acts in a MDP following a policy π( |s), which prescribes a distribution over the action space A given each state s S. Running the policy starting from the initial distribution µ0 yields a random trajectory τ := {si, ai, ri}1 i T , where si, ai, ri represent the state, action, reward at time i respectively. We deﬁne the inﬁnite horizon discounted reward of π as Rπ := lim T Eτ π h PT i=0 γiri i , where γ (0, 1) is a discounting factor; T is the horizon length of the

trajectory, which we assume to approach inﬁnite, hence yielding an inﬁnite horizon problem; Eτ π[ ] denotes the expectation of the random trajectories collected under the policy π.

Black Box Off-Policy Interval Estimation We are interested in the problem of black-box offpolicy interval evaluation, which requires arguably the minimum assumptions on the off-policy data. It amounts to providing an interval estimation [Rπ, Rπ] of the expected reward Rπ of a policy π (called the target policy), given a set of transition pairs {si, ai, s i, ri}n i=1 collected under a different, unknown behavior policy, or even a mix of different policies; here s i = T (si, ai) and ri = r(si, ai) denote the next state and the local reward following (si, ai) respectively.

Q-function and Bellman Equation We review the properties of the Q-function that is most relevant to our work. The Q-function Qπ(s, a) speciﬁes the expected reward when following π from the state-action pair (s, a) and is known to be the unique ﬁxed point of the Bellman equation:

Q(s, a) = BπQ(s, a) := r(s, a) + γPπQ(s, a), (s, a) S A , (1)

where Bπ denotes the Bellman operator, and Pπ is the transition operator deﬁned by

PπQ(s, a) := Es =T (s,a),a π( |s )[Q(s , a )] . (2)

An expected reward associated with a Q-function is deﬁned via

Rµ0,π[Q] := Es0,a0 µ0,π [Q(s0, a0)] , (3)

where we use µ0,π(s, a) = µ0(s)π(a|s) to denote the joint initial state-action distribution.

Off-Policy Q-Learning Q-function can be learned in an off-policy manner. Assume we have a set of transition pairs D := {si, ai, si, r i}n i=1. Under the assumption of deterministic transition and reward, we can estimate the Bellman operator on each of the data point:

BπQ(si, ai) = ri + Ea π( |s i)[Q(s i, a )] ,

which can be estimated with an arbitrarily high accuracy via drawing a large number of samples from π( |s i). Assume Qπ belongs to a class of functions F, we can estimate Qπ by ﬁnding a Q F that satisﬁes the Bellman equation on the data points:

Q F, s.t. Q(si, ai) = BπQ(si, ai), i [n]. (4)

Compared with the exact Bellman equation (1), we only match the equation on the observed data (4), which may yield multiple or even inﬁnite solutions of Q. Therefore, the function class F needs to be sufﬁciently constrained in order to yield meaningful solutions. In practice, (4) is often solved using ﬁtted value iteration (Munos & Szepesvári, 2008), which starts from an initial Q0, and then perform iterative updates by

Qt+1 arg min Q F

i=1 (Q(si, ai) BπQt(si, ai))2 . (5)

It is easy to see Rπ = R[Qπ]. With an estimation of Qπ, the expected reward Rπ in (3) can be estimated by Monte Carlo sampling from µ0 and π.

3 Main Method

We now introduce our main framework for providing provably correct upper and lower bounds of the expected reward. For notation, we use x = (s, a) to represent a state-action pair.

3.1 Motivation and Optimization Framework

When the ﬁtted value iteration in (5) converges, it only provides one Q-function that yields a point estimation of the reward. To get an interval estimation, we expect the ﬁtted value iteration to provide two Q-functions Q and Q, such that all possible Q consistent with the data points lie between Q and Q, e.g. Q Q Q, where f g means f(x) g(x), x.

More concretely, consider a function set F that is expected to include Qπ, we construct an upper bound of Rπ by

Rπ F = sup Q F

Rµ0,π[Q], s.t. Q(xi) BπQ(xi), i [n] , (6)

where R[Q] is deﬁned in (3). It is easy to see that Rπ F Rπ if Qπ F holds; this is because Qπ satisﬁes the constraints in the optimization, and hence Rπ F R[Qπ] = Rπ as a result of the optimization.

It is worthy to note that we use the Bellman inequality constraint in (6), which would not cause any looseness compared to the equality constraint. To see this, note that the exact reward Rπ can be framed into (Bertsekas, 2000) Rπ = sup Q F

Rµ0,π[Q], s.t. Q(x) BπQ(x), x S A ,

which implies that Rπ F would converge to Rπ as data points increase.

We can construct the lower bound in a similar way: Rπ F = inf Q F Rµ0,π[Q], s.t. Q(xi) BπQ(xi), i [n] . (7)

Deﬁne IF = [Rπ F, Rπ F] as the interval estimation for Rπ, once the true Qπ lies in the function space F, Rπ lies in IF provably.

Beneﬁts of Our Framework Our optimization framework enjoys several advantages compared with the existing methods. First of all, unlike the standard concentration bounds, our bounds do not rely on the i.i.d. assumption of transition pairs {si, ai}. In RL settings, the historical transition pairs are highly dependent to each other: on one hand, in sequential data stream, the current step of state is the next state in the previous step; on the other hand, the behavior policy that generates the trajectories is also evolving during the learning process. Secondly, under our optimization framework, more data would enable us to add more constraints in our searching space and therefore get tighter bounds accordingly. This property allows a trade-off between the time complexity and the tightness of the bounds, which we will further discuss in section 4.1. Last but not least, the tightness of the bounds depends on the capacity of the function space F, which naturally yields the nice monotonicity property. It is easy to see if Qπ F1 F2, we will have Rπ IF1 IF2.

Lipschitz Function Space To implement this framework, it is important to choose a proper function set F and solve the optimization process efﬁciently. Intuitively, F should be rich enough to include the true value function, but not be too large to cause the bounds to be vacuous. In light of this, we propose to use the space of Lipschitz functions. Let X be a metric space equipped with a distance d: X X R. We propose to take F to be a ball in the Lipschitz function space:

Fη := {f : f d,Lip η}, where f d,Lip = sup x,x X, x =x |f(x) f(x )|

d(x, x ) , (8)

where f d,Lip is the Lipschitz norm of f and η is a radius parameter.

Although the Lipschitz space yields an inﬁnite dimensional optimization, our key technical contribution is to show that it is possible to calculate the upper and lower bounds with an efﬁcient ﬁxed point algorithm. We form it as an optimization problem and solve it efﬁciently in a value iteration like fashion. In addition, in Section 3.4, we show that the size of the Lipschitz space enables us to establish a diminishing bound on the gap of the upper and lower bounds.

3.2 Optimization in Lipschitz Function Space

Figure 1: Illustration

We propose a novel value iteration-like algorithm for solving the optimizations in (6) and (7). We focus on the upper bound (6), as the case of the lower bound is similar and discussed in Appendix A.

Our algorithm enjoys the same spirit as the ﬁtted value iteration (5). We apply the Bellman operator on the current estimation of Q, and then use it as the constraint on the Bellman inequalities to obtain a new estimation.

Speciﬁcally, starting from an initial Q0 with q0,i := Q0(xi), at the t-th iteration, we update by

Qt = arg max Q F Rµ0,π[Q], s.t. Q(xi) qt,i, i [n] , qt+1,i = BπQt(xi), i [n]. (9)

This update requires to solve a constrained optimization on the space F. For our choice of the Lipschitz function space, this yields a simple closed form solution.

Algorithm 1 Lipschitz Value Iteration (for Upper Bound)

Input: Transition data D = {si, ai, s i, ri}1 i n; discounted factor γ; target policy π; Lipschitz constant η. Initialize q0,i with criterion like (12). for iteration t do

Update: qt+1,i = BπQt(xi), where Qt(x) = minj [n]

qt,j + ηd(x, xj)

end for Return: upper bound: Rπ F = Es,a µ0,π minj [n]

qt,j + ηd(x, xj) .

Proposition 3.1. Suppose µ0,π(x) > 0, x S A, consider the optimization in (9) with F = Fη. We have

Qt(x) = min j [n](qt,j + ηd(x, xj)), qt+1,i = BπQt(xi), i [n]. (10)

Intuitively, Qt in (10) yields an

upper envelope of all the possible Lipschitz functions Q that satisfy Q(xi) qt,i, i, and hence solves the optimization in (9), as R[Q] is monotonically increasing with Q. The updates for the lower bound can be derived similarly by calculating the lower envelopes: Qt(x) = max i [n](qt,i ηd(x, xi)), qt+1,i = BπQt(xi), i [n]. (11)

Note that these updates can be simpliﬁed to only keeping track of the Q-function values at the data points {qt,i}n i=1, as summarized in Algorithm 1. Figure 1 illustrates how the upper and lower envelopes bound all the possible Lipschitz functions going through the same set of data points.

3.3 Convergence Analysis

Our algorithm enjoys two important and desirable properties: one is monotonic convergence, which indicates that you can stop any time, depending on your time budget, and still get a valid lower/upper bound; the other is linear convergence, which implies that it only needs logarithmic time steps to converge. Thanks to these properties, our method is much faster compared to directly solving the convex optimization in (6) with (stochastic) sub-gradient ascent.

The monotonic convergence relies on the initial upper bound we pick, which is inspired by the monotonicity of the Bellman operator. See Appendix B.1 for more details.

Theorem 3.2. Following the update in (10) starting from

q0,i = 1 1 γ ri + γηEx i T π( |xi)[d(xi, x i)] , (12)

Qt Qt+1 Q π, t = 0, 1, 2, . . . , (13)

where Q π = arg max Q F{R[Q], s.t. Q(xi) BπQ(xi), i [n]}. Therefore,

Rµ0,π[Qt] Rµ0,π[Qt+1] Rπ F Rπ,

and limt Rµ0,π[Qt] = Rπ F.

We establish a fast linear convergence rate for the updates in (10). The convergence result here does not need the initialization of q0.

Proposition 3.3. Following the updates in (10) under arbitrary initialization, with constant C := maxi [n] |q1,i q0,i| we have

Rµ0,π[Qt] Rπ F C γt

Algorithm 2 Lipschitz Value (Upper Bound) Iteration with Stochastic Update

Input: Transition data D = {si, ai, s i, ri}1 i n; discounted factor γ; target policy π; distance function d. Lipschitz constant η. Subsample size n B. Initialize: q0,i = 1 1 γ (ri + γηb Ex T π( |xi)[d(xi, x)]), i, according to equation (12). for iteration t do

Subsample a subset St {1, 2, ..., n} with |St| = n B. Update: qt+1,i = min{qt,i, BπQt(xi)}, i St and qt+1,i = qt,i, i / St where Qt(x) = minj St{qt,j + ηd(xj, x)}. end for Return: upper bound: Rπ F = b Ex µ0,π[mini [n]{q T,i + ηd(xi, x)}].

3.4 Tightness of Lipschitz-Based Bounds

We provide a quantitative estimation of the gap (Rπ F Rπ F) between the upper and lower bounds when using the Lipschitz function set. We show that it depends on a notion of the covering radius of the data points in the domain.

Theorem 3.4. Let F = Fη be the Lipschitz function class with Lipschitz constant η. Suppose X = S A is a compact and bounded domain equipped with a distance d: X X R. For a set of data points X = {xi}n i=1, we have

Rπ F Rπ F 2η 1 γ εX,

where γ is the discount factor and εX = supx X mini d(x, xi) is the covering radius.

Typically, the covering radius in a bounded and compact domain asymptotically grows with O(n 1/ d), where d is an intrinsic dimension of the domain (Cohen et al., 1985), if the support of the data distribution (where xi is getting from) covers dπ, the stationary distribution of policy π. This shows that the bound gets tighter as the number of samples get larger, but may decay slowly when the data domain has a very high intrinsic dimension. While it is possible to choose smaller space sets (such as RKHS) to obtain smaller gaps, it would sacriﬁce other properties such as capacity and simplicity.

4 Practical Considerations

We discuss some practical concerns of Lipschitz value iteration in this section.

4.1 Accelerating with Stochastic Subsampling

If we draw D samples to estimate Bπ when updating with equation (10), we need O(n2D) times of calculations for each round. This n2 comes from two sources, one is the updating of all n terms of qt,i in each iteration, and the other is taking the maximum/minimum among all n upper/lower curves when we calculate the upper/lower envelope function of Qt(x). This is quadratic to the number of samples and therefore computationally expensive as the sample size grows large.

To tackle this problem, we propose a fast random subsampling technique. Instead of updating all n qt,i in each iteration, we pick a batch of subsamples {qt,i}i S to update, where S {1, 2, . . . , n} is a subset with ﬁx size |S| = n B. The beneﬁt of subsampling is that we can trade-off between time-complexity and tightness.

To be more precise, we can write down the new update scheme as follow:

Qt(x) = min j St(qt,j + ηd(x, xj)),

qt+1,i = min{qt,i, BπQt(xi), i St, qt+1,i = qt,i, i / St}, (14)

where St is the sub-sample set in the t iteration. A similar monotonic result can be shown in sequel.

Relative log mean error

Relative Reward

100 101 0.6

number of trajectories nt number of trajectories nt Subsample Size n B State s (a) (b) (c) (d) Figure 2: Results for synthesis environment with a known value function. The default settings: number of trajectory nt = 30, Horizon length H = 100, discounted factor γ = 0.95, Lipschitz constant η = 2.0 and subsample size n B = 500. (a) y-axis: log of relative mean error log((Rπ F Rπ F)/Rπ); (b)(c) y-axis: relative reward; (d) landscape for value function V π with [V π, V π], here state is bounded in interval [ 1.2, 1.2], interval is estimated using 100 samples. Blue curves: subsamples bounds; purple lines: the whole samples bounds.

Proposition 4.1. Consider the update in (14) with initialization following (12), let Qt be the upper envelope function of data points {xi, qt,i}n i=1, we have a similar monotonic result as theorem 3.2:

Qt Qt+1 Q π, t = 0, 1, 2, . . . .

We summarize the strategy in Algorithm 2.

The subsampling algorithm only needs O(n2 BD) time complexity in each iteration. Although this strategy does not converge to the exact optimal solution Rπ F, it still gives valid (despite less tight) bounds. In the numerical experiments, we ﬁnd that once the size of subsample set is sufﬁciently large, we get almost the same tightness as the exact optimal bounds.

4.2 On Estimating Lipschitz Norm

The only model assumption we need to specify is the function set Fη; we want η Qπ d,Lip, but not to be too large, since the estimated interval gets loose as η increases. However, compared to other value-based function approximation methods that also need assumptions on model speciﬁcation, our non-parametric Lipschitz assumption is obviously very mild.

In order to set hyperparameter η, we would like to estimate the upper bound of Qπ d,Lip, which is typically non-identiﬁable purely from the data. This is because, given a sufﬁciently large η, we can always ﬁnd a function Q that satisﬁes all the Bellman constraints with Q d,Lip η by twisting with a small function; see appendix C for more details. However, if we know or can estimate the Lipschitz norm of the reward and transition functions, then it is possible to derive a theoretical upper bound of Qπ d,Lip with only mild assumptions.

Proposition 4.2. Let S A, dx be a metric space for state action pair x and S, ds be a metric space for state s. Suppose dx is separable so that dx(x1, x2) = ds(s1, s2) if a1 = a2. If the reward function r and the transition T are both Lipschitz in the sense that

r(x1) r(x2) r Lipdx(x1, x2), ds(T (x1), T (x2)) T Lipdx(x1, x2), x1, x2.

We can prove that if γ T Lip < 1, we have

Qπ Lip r Lip/(1 γ T Lip), (15)

when π is a constant policy. Furthermore, for optimal policy π with value function Q , we have:

Q Lip r Lip/(1 γ T Lip), (16)

Theorem 4.2 suggests that if our target policy is close to the optimal, we can set η = r Lip 1 γ T Lip if we can estimate r Lip and T Lip. This provides a way for estimating the upper bound of the Lipschitz norm of Qπ by leveraging the Lipschitz norm for the reward and transition functions. In practice, we can estimate r Lip and T Lip using historical data:

d r Lip = max i =j (ri rj) d(xi, xj), d T Lip = max i =j ds(s i, s j) dx(xi, xj). (17)

Relative Reward

Relative Reward

100 101 0.5

number of trajectories nt Subsample Size n B number of trajectories nt Subsample Size n B (a) (b) (c) (d) Figure 3: Results for pendulum and HIV simulator. The default settings for pendulum (Figure a,b): nt = 30, H = 100, γ = 0.95, η = 10.0, n B = 500. The default settings for HIV (Figure c,d): nt = 50, H = 30, γ = 0.75, η = 40.0, n B = 500. We follow the similar experiments from Figure 2 (b) and (c). Diagnosing Model Misspeciﬁcation from Data Since the empirical maximum tends to underestimate the true maximization, simply using Proposition 4.2 may still underestimate the true Lipschitz norm. Luckily, we can diagnose if η is too small to be consistent with data by only adding a few lines of diagnosis codes.

From Theorem 3.2 we know that for all Q Fη, which are consistent with the ﬁnite sample Bellman equations, we have Qt Q Qt. Thus, if at some time t (or after convergence), we ﬁnd that Qt(x) < Qt(x) for some x, we can reject the following hypothesis:

h : Q Fη, s.t. Q(xi) = BπQ(xi), i [n]. In this way, we can see that Qπ / Fη. Then, we can increase η by a constant factor κ > 1 and rerun our upper/lower bound algorithm. Note that we do not need to compare an inﬁnite number of x to check Qt(x) < Qt(x), as Qt(x) = minj{qt,j + ηd(x, xj)} and Qt(x) = maxj{qt,j ηd(x, xj)}. And hence, it is sufﬁcient to check if there exists an index i such that qt,i < qt,i.

5 Experiments

We test our algorithms in different environments. We follow Algorithm 2 with sub-sampling technique. In each environment, we evaluate the tightness of our bound by changing 1) the number of samples n and 2) subsampling size n B. We also make a comparison with the exact bound with full sample size (i.e. n = n B). We start hyperparameter η estimated by Proposition 4.2 with empirical maximal, and use diagnose algorithm in the last section to gradually increase η until it is consistent with data set where we set the increasing factor κ = 1.1. We observed that diagnosis algorithm usually passes with the very initial η. As a baseline, Thomas et al. (2015b) needs huge amounts of samples to get comparable tight bounds as ours, we demonstrate a comparison experiment in Appendix D.

Synthetic Environment with A Known Value Function We consider a simple environment with one dimension state space S = R and one dimension action space A = R with a linear transition function. Given a target policy π, we enforce our value function Qπ(s, a) to be our predeﬁned function. This is done by enforcing the reward function r for this environment as the reverse Bellman error of Qπ, ri = Qπ(xi) γPπQπ(xi), with Pπ be the transition operator in equation (2). We use Euclidean distance metric and under this metric we can prove that the Lipschitz constant is 2. See Appendix D for more details.

All the reported results are average over 300 runs using the following setting by default: number of trajectory nt = 30, Horizon length H = 100, discounted factor γ = 0.95, Lipschitz constant η = 2.0 and subsample size n B = 500. We run Lipschitz value iteration for 100 iteration to ensure almost convergence. From Figure 2(a)(b) we can see that as the number of sample size gets larger, the bound gets tighter. 2(c) indicates that with a sufﬁciently large subsample size, e.g. n B = 500, we can achieve bounds accurately enough compared to whole sample algorithm (purple lines). We also demonstrate the landscape of evaluation for state value function V π(s) = Ea π( |s)[Qt(s, a)] and V π(s) under the ﬁnal Lipschitz value iteration for 100 data samples. Compare with the true value function, we can see that we get a tighter bound on a neighborhood region of data points compared to unseen region.

Pendulum Environment We demonstrate our method on pendulum, which is a continuous control environment with state space of R3 and action space on interval [ 2, 2]. In this environment, we aim

to control the pendulum to make it stand up as long as possible (for the large discounted case), or as fast as possible (for small discounted case). See Appendix D for more experimental setups.

Figure 3(a)(b) shows a similar result indicating our interval estimation is tight and subsampling achieves almost same tightness with full samples.

HIV Simulator The HIV simulator described in Ernst et al. (2006) is a continuous state environment with 6 parameters and a discrete action environment with total 4 actions. In this environment, we seek to ﬁnd an optimal drug schedule given patient s 6 key HIV indicators. The HIV simulator has richer dynamics than the previous two environments. We follow Liu et al. (2018b) to learn a target policy by ﬁtted Q iteration and use the ϵ-greedy policy of the Q-function as the behavior policy.

The default setting is similar to the previous two experiments but we use a relatively small discounted factor γ = 0.75 to ensure that we can get a reasonable Lipschitz constant from equation (15). Figure 3(c)(d) demonstrate a similar result of the HIV environment.

6 Conclusion

We develop a general optimization framework for off-policy interval estimation and propose a value iteration style algorithm to monotonically tighten the interval. Our Lipschitz value iteration on the continuous settings MDP enjoys nice convergence properties similar to the tabular MDP value iteration, which is worth further investigating. Future directions include leveraging our interval estimation to encourage policy exploration or ofﬂine safe policy improvement.

Broader Impact Off-policy interval evaluation not only can advise end-user to deploy new policy, but can also serve as an intermediate step for latter policy optimization. Our proposed methods also ﬁll in the gap of theoretical understanding of Markov structure in Lipschitz regression. We current work stands as a contribution to the fundamental ML methodology, and we do not foresee potential negative impacts.

Funding Transparency Statement This work is supported in part by NSF CAREER #1846421, Sen SE #2037267, and EAGER #2041327.

Bertsekas, D. P. Dynamic Programming and Optimal Control. Athena Scientiﬁc, 2nd edition, 2000. ISBN 1886529094.

Bottou, L., Peters, J., Quiñonero-Candela, J., Charles, D. X., Chickering, D. M., Portugaly, E., Ray, D., Simard, P., and Snelson, E. Counterfactual reasoning and learning systems: The example of computational advertising. Journal of Machine Learning Research, 14:3207 3260, 2013.

Cohen, G., Karpovsky, M., Mattson, H., and Schatz, J. Covering radius survey and recent results. IEEE Transactions on Information Theory, 31(3):328 343, 1985.

Dann, C., Li, L., Wei, W., and Brunskill, E. Policy certiﬁcates: Towards accountable reinforcement learning. ar Xiv preprint ar Xiv:1811.03056, 2018.

Ernst, D., Stan, G.-B., Goncalves, J., and Wehenkel, L. Clinical data based optimal sti strategies for hiv: a reinforcement learning approach. In Proceedings of the 45th IEEE Conference on Decision and Control, pp. 667 672. IEEE, 2006.

Feng, Y., Li, L., and Liu, Q. A kernel loss for solving the bellman equation. Neural Information Processing Systems (Neur IPS), 2019.

Feng, Y., Ren, T., Tang, Z., and Liu, Q. Accountable off-policy evaluation with kernel bellman statistics. In International Conference on Machine Learning, 2020.

Fonteneau, R., Murphy, S. A., Wehenkel, L., and Ernst, D. Batch mode reinforcement learning based on the synthesis of artiﬁcial trajectories. Annals of Operations Research, 208(1):383 416, 2013.

Hanna, J., Stone, P., and Niekum, S. Bootstrapping with models: Conﬁdence intervals for off-policy evaluation. In Proceedings of the 16th International Conference on Autonomous Agents and Multiagent Systems (AAMAS), May 2017.

Jiang, N. and Li, L. Doubly robust off-policy evaluation for reinforcement learning. In Proceedings of the 23rd International Conference on Machine Learning (ICML), pp. 652 661, 2016.

Jin, C., Allen-Zhu, Z., Bubeck, S., and Jordan, M. I. Is q-learning provably efﬁcient? In Advances in Neural Information Processing Systems, pp. 4863 4873, 2018.

Kallus, N. and Uehara, M. Efﬁciently breaking the curse of horizon: Double reinforcement learning in inﬁnite-horizon processes. ar Xiv preprint ar Xiv:1909.05850, 2019.

Le, H., Voloshin, C., and Yue, Y. Batch policy learning under constraints. In International Conference on Machine Learning, pp. 3703 3712, 2019.

Li, L., Chu, W., Langford, J., and Wang, X. Unbiased ofﬂine evaluation of contextual-bandit-based news article recommendation algorithms. In Proceedings of the 4th International Conference on Web Search and Data Mining (WSDM), pp. 297 306, 2011.

Liu, J. S. Monte Carlo Strategies in Scientiﬁc Computing. Springer Series in Statistics. Springer Verlag, 2001. ISBN 0387763694.

Liu, Q., Li, L., Tang, Z., and Zhou, D. Breaking the curse of horizon: Inﬁnite-horizon off-policy estimation. In Advances in Neural Information Processing Systems, pp. 5361 5371, 2018a.

Liu, Y., Gottesman, O., Raghu, A., Komorowski, M., Faisal, A. A., Doshi-Velez, F., and Brunskill, E. Representation balancing mdps for off-policy policy evaluation. In Advances in Neural Information Processing Systems, pp. 2644 2653, 2018b.

Mousavi, A., Li, L., Liu, Q., and Zhou, D. Black-box off-policy estimation for inﬁnite-horizon reinforcement learning. In International Conference on Learning Representations, 2020.

Munos, R. and Szepesvári, C. Finite-time bounds for ﬁtted value iteration. Journal of Machine Learning Research, 9(May):815 857, 2008.

Murphy, S. A., van der Laan, M., and Robins, J. M. Marginal mean models for dynamic regimes. Journal of the American Statistical Association, 96(456):1410 1423, 2001.

Nachum, O., Chow, Y., Dai, B., and Li, L. Dualdice: Behavior-agnostic estimation of discounted stationary distribution corrections. In Advances in Neural Information Processing Systems, pp. 2315 2325, 2019.

Precup, D., Sutton, R. S., and Singh, S. P. Eligibility traces for off-policy policy evaluation. In Proceedings of the 17th International Conference on Machine Learning (ICML), pp. 759 766, 2000.

Song, Z. and Sun, W. Efﬁcient model-free reinforcement learning in metric spaces. ar Xiv preprint ar Xiv:1905.00475, 2019.

Sutton, R. S. and Barto, A. G. Reinforcement Learning: An Introduction. MIT Press, Cambridge, MA, March 1998. ISBN 0-262-19398-1.

Tang, Z., Feng, Y., Li, L., Zhou, D., and Liu, Q. Doubly robust bias reduction in inﬁnite horizon off-policy estimation. In International Conference on Learning Representations, 2020.

Thomas, P. and Brunskill, E. Data-efﬁcient off-policy policy evaluation for reinforcement learning. In International Conference on Machine Learning, pp. 2139 2148, 2016.

Thomas, P., Theocharous, G., and Ghavamzadeh, M. High conﬁdence policy improvement. In International Conference on Machine Learning, pp. 2380 2388, 2015a.

Thomas, P. S., Theocharous, G., and Ghavamzadeh, M. High-conﬁdence off-policy evaluation. In Twenty-Ninth AAAI Conference on Artiﬁcial Intelligence, 2015b.

Thomas, P. S., Theocharous, G., Ghavamzadeh, M., Durugkar, I., and Brunskill, E. Predictive offpolicy policy evaluation for nonstationary decision problems, with applications to digital marketing. In Proceedings of the 31st AAAI Conference on Artiﬁcial Intelligence (AAAI), pp. 4740 4745, 2017.

White, M. and White, A. Interval estimation for reinforcement-learning algorithms in continuous-state domains. In Advances in Neural Information Processing Systems, pp. 2433 2441, 2010.

Xie, T., Ma, Y., and Wang, Y.-X. Optimal off-policy evaluation for reinforcement learning with marginalized importance sampling. Neural Information Processing Systems (Neur IPS), 2019.

Yang, L. F., Ni, C., and Wang, M. Learning to control in metric space with optimal regret. ar Xiv preprint ar Xiv:1905.01576, 2019.

Zhang, R., Dai, B., Li, L., and Schuurmans, D. Gen DICE: Generalized ofﬂine estimation of stationary values. In International Conference on Learning Representations, 2020.