# sope_spectrum_of_offpolicy_estimators__0e6cdbed.pdf

SOPE: Spectrum of Off-Policy Estimators

Christina J. Yuan University of Texas at Austin

cjyuan@cs.utexas.edu

Yash Chandak University of Massachusetts

ychandak@cs.umass.edu

Stephen Giguere University of Texas at Austin

sgiguere@cs.utexas.edu

Philip S. Thomas University of Massachusetts

pthomas@cs.umass.edu

Scott Niekum University of Texas at Austin

sniekum@cs.utexas.edu

Many sequential decision making problems are high-stakes and require off-policy evaluation (OPE) of a new policy using historical data collected using some other policy. One of the most common OPE techniques that provides unbiased estimates is trajectory based importance sampling (IS). However, due to the high variance of trajectory IS estimates, importance sampling methods based on state-action visitation distributions (SIS) have recently been adopted. Unfortunately, while SIS often provides lower variance estimates for long horizons, estimating the stateaction distribution ratios can be challenging and lead to biased estimates. In this paper, we present a new perspective on this bias-variance trade-off and show the existence of a spectrum of estimators whose endpoints are SIS and IS. Additionally, we also establish a spectrum for doubly-robust and weighted version of these estimators. We provide empirical evidence that estimators in this spectrum can be used to trade-off between the bias and variance of IS and SIS and can achieve lower mean-squared error than both IS and SIS.

1 Introduction

Many sequential decision making problems, such as automated health-care, robotics, and online recommendations are high-stakes in terms of health, safety, or ﬁnance [Liao et al., 2020, Brown et al., 2020, Theocharous et al., 2020]. For such problems, collecting new data to evaluate the performance of a new decision rule, called an evaluation policy e, may be expensive or even dangerous if e results in undesired outcomes. Therefore, one of the most important challenges in such problems is the estimation of the performance J( e) of the policy e before its deployment.

Many off-policy evaluation (OPE) methods enable estimation of J( e) with historical data collected using an existing decision rule, called a behavior policy b. One popular OPE technique is trajectorybased importance sampling (IS) [Precup, 2000]. While this method is both non-parametric and provides unbiased estimates of J( e), it suffers from the curse of horizon and can have variance exponential in the horizon length [Jiang and Li, 2016, Guo et al., 2017]. To mitigate this problem, recent methods use stationary distribution importance sampling (SIS) to adjust the stationary distribution of the Markov chain induced by the policies, instead of the individual trajectories [Liu et al., 2018, Gelada and Bellemare, 2019, Nachum and Dai, 2020]. This requires (parametric) estimation of the ratio between the stationary distribution induced by e and b. Unfortunately, estimating this ratio accurately can require unveriﬁably strong assumptions on the parameters [Jiang and Huang, 2020], and often requires solving non-trivial min-max saddle point optimization problems [Yang et al., 2020]. Consequently, if the parameterization is not rich enough, then it may not be possible to represent the distribution ratios accurately, and when using rich function approximators (such as neural networks) then the optimization procedure may get stuck in sub-optimal saddle points.

35th Conference on Neural Information Processing Systems (Neur IPS 2021).

In practice, these challenges can introduce error when estimating the distribution ratio, potentially leading to arbitrarily biased estimates of J( e), even when an inﬁnite amount of data is available.

In this work, we present a new perspective on the bias-variance trade-off for OPE that bridges the unbiasedness of IS and the often lower variance of SIS. Particularly, we show that

There exists a spectrum of OPE estimators whose end-points are IS and SIS, respectively. Estimators in this spectrum can have lower mean-squared error than both IS and SIS. This spectrum can also be established for doubly-robust and weighted version of IS and SIS.

In Sections 3 and 4 we show how trajectory-based and distribution-based methods can be combined. The core idea establishing the existence of this spectrum relies upon ﬁrst splitting individual trajectories into two parts and then computing the probability of the ﬁrst part using SIS and IS for the latter. In Section 5, we introduce weighted and doubly-robust extensions of the spectrum. Finally, in Section 6, we present empirical case studies to highlight the effectiveness of these new estimators.

2 Background

Notation: A Markov decision process (MDP) is a tuple (S, A, r, T, γ, d1), where S is the state set, A is the action set, r is the reward function, T is the transition function, γ is the discounting factor, and d1 is the initial state distribution. Although our results extend to the continuous setting, for simplicity of notation we assume that S and A are ﬁnite. A policy is a distribution over A, conditioned on the state. Starting from initial state S1 d1, policy interacts with the environment iteratively by sampling action At at every time step t from ( |St). The environment then produces reward Rt with the expected value r(St, At), and transitions to the next state St+1 according to T( |St, At). Let := (S1, A1, R1, S2, ..., SL, AL, RL) be the sequence of random variables corresponding to a trajectory sampled from , where L is the horizon length. Let p denote the distribution of under .

Problem Statement: The performance of any policy is given by its value deﬁned by the expected discounted sum of rewards J( ) := E p [PL

t=1 γt 1Rt]. The inﬁnite horizon setting can be obtained by letting L ! 1. In general, for any random variable, we use the superscript of i to denote the trajectory associated with it. The goal of the off-policy policy evaluation (OPE) problem is to estimate the performance J( e) of an evaluation policy e using only a batch of historical trajectories D := { i}m

i=1 collected from a different behavior policy b. This problem is challenging because J( e) must be estimated using only observational, off-policy data from the deployment of a different behavior policy b. Additionally, this problem might not be feasible if the data collected using b is not informative about the outcomes possible under e. Therefore, to make the problem tractable, we make the following standard support assumption, which implies that any outcome possible under e also has non-zero probability of occurring under b.

Assumption 1. For all s 2 S and a 2 A, the ratio e(a|s)

b(a|s) < 1.

Trajectory-Based Importance Sampling: One of the earliest methods for estimating J( e) is trajectory-based importance sampling. This method corrects the difference in distribution of b and e by re-weighting the trajectories from b in D by the probability ratio of the trajectory under e and b, i.e. p e( ) p b( ) = QL

e(At|St) b(At|St). Let the single-step action likelihood ratio be denoted

t := e(At|St)

b(At|St) and the likelihood ratio from steps j to k be denoted j:k := Qk

t=j t. The fulltrajectory importance sampling (IS) estimator and the per-decision importance sampling (PDIS) estimator [Precup, 2000] can then be deﬁned as:

t, PDIS(D) := 1

It was shown by Precup [2000] that under Assumption 1, IS(D) and PDIS(D) are unbiased estimators of J( e). That is, J( e) = E p b [IS( )] = E b[PDIS( )]. Unfortunately, however, both IS and PDIS directly depend on the product of importance ratios and thus can often suffer from exponentially high-variance in the horizon length L, known as the curse of horizon [Jiang and Li, 2016, Guo et al., 2017, Liu et al., 2018].

Distribution-Based Importance Sampling: To eliminate the dependency on trajectory length, recent works apply importance sampling over the state-action space rather than the trajectory space. For any policy , let d

t denote the induced state-action distribution at time step t, i.e. d

t (s, a) = p (St = s, At = a). Let the average state-action distribution be d (s, a) := (PL

t (s, a))/(PL

t=1 γt 1). This gives the likelihood of encountering (s, a) when following policy and averaging over time with γ-discounting. Let (S, A) d and (S, A) d

t denote that (S, A) are sampled from d and d

t respectively. The performance of e can be expressed as,

J( e) = E p e

t (s, a)r(s, a) =

d e(s, a)r(s, a)

d b(s, a)d e(s, a)

d b(s, a)r(s, a) =

t (s, a)d e(s, a)

d b(s, a)r(s, a),

γt 1 d e(St, At)

d b(St, At)Rt

where (a) is possible due to Assumption 1. Using this observation, recent works have considered the following stationary-distribution importance sampling estimator [Liu et al., 2018, Yang et al., 2020, Jiang and Huang, 2020],

SIS(D) := 1

where w(s, a) := d e(s,a)

d b(s,a) is the distribution correction ratio. Notice that SIS( ) marginalizes over the product of importance ratios 1:t, and thus can help in mitigating variance s dependence on horizon length for PDIS and IS estimators. When an unbiased estimate of w is available, then SIS( ) is also an unbiased estimator, i.e., E b[SIS( )] = J( e). Unfortunately, such an estimate of w is often not available. For large-scale problems, parametric estimation w is required in practice and we replace the true density ratios w with an estimate ˆw. However, estimating w accurately may require both a non-veriﬁable strong assumption on the parametric function class, and global solution to a non-trivial min-max optimization problem [Jiang and Huang, 2020, Yang et al., 2020]. When these conditions are not met, SIS estimates can be arbitrarily biased, even when an inﬁnite amount of data is available.

3 Combining Trajectory-Based and Density-Based Importance Sampling

Trajectory-based and distribution-based importance sampling methods are typically presented as alternative methods of applying importance sampling for off-policy evaluation. However, in this section we show that the choice of estimator is not binary, and these two styles of computing importance weights can actually be combined into a single importance sampling estimate. Furthermore, using this combination, in the next section, we will derive a spectrum of estimators that allows interpolation between the trajectory-based PDIS and distribution-based SIS, which will often allow us trade-off between the strengths and weaknesses of these methods.

Intuitively, trajectory-based and distribution-based importance sampling provide two different ways of correcting the distribution mismatch under the evaluation and behavior policies. Trajectory-based importance sampling corrects the distribution mismatch by examining how likely policies are to take the same sequence of actions and thus applies the action likelihood ratio as the correction term. Distribution-based importance sampling corrects the mismatch by how likely policies are to visit the same state and action pairs while remaining agnostic to how they arrived and applies the distribution ratio as the importance weight. However, using distribution ratio and action likelihood ratio correction terms are not mutually exclusive, and one can draw on both types of correction terms to derive combined estimators.

To build intuition for why likelihood ratios and distribution ratios can naturally be combined, we consider the two rooms domain shown in Figure 3. In this example, there are two policies b, e

Figure 1: Illustration of two room domain. The domain consists of two rooms, the left room and the right room separated by a connecting door. b and e are two different policies that move from the left room to the right room. Note that, although b and e have two different behaviors in the left room and right room, both pass through the connecting door.

which have different strategies for navigating from the ﬁrst room to the second room. Note that while the behavior of the two policies are very different in the left room, both policies must pass through the connecting door to get to the right room at some point in time. Conditioning on having passed through the connecting door at a point in time, all parts of the trajectory that occur in the right room are independent from what has occurred in the left room by the Markov property. Thus, when considering a reward Rt that occurs in the right room, it is natural to consider the probability of reaching the door and then the probability of the action sequence policy in the right room under each policy.

Now, we formalize this intuition and show how trajectory-based and density-based importance sampling can be combined in the same estimator. Given a trajectory , we can consider (Sz, Az), the state and action at time z in the trajectory. By conditioning on (Sz, Az), trajectory can be separated into two conditionally independent partial trajectories 0:z and z+1,L by the Markov property. Since the segments of before and after time z are conditionally independent, then 1:z, the likelihood ratio for the trajectory before time z, is conditionally independent from z+1:L and from Rt for all t z. Formally, let (Sz, Az) d b z , then,

J( e) = E p b [PDIS( )] = E p b

γt 1 1:t Rt

γt 1 1:t Rt

γt 1 1:z z+1:t Rt

(((((Sz, Az

γt 1 1:t Rt

γt 1E p b [ 1:z|Sz, Az] E b [ z+1:t Rt|Sz, Az]

(a) = E p b

γt 1 1:t Rt

γt 1 d e z (Sz, Az) d b z (Sz, Az)E p b

γt 1 1:t Rt +

γt 1 d e z (Sz, Az) d b z (Sz, Az) z+1:t Rt

where (a) follows from the following Property 1, which states that the expected value of product likelihood ratios 1:z conditioned on (Sz, Az) is equal to the time-dependent state-action distribution ratio for (Sz, Az). We provide a detailed proof of Property 1 in Appendix A.

Property 1 ([Liu et al., 2018]). Under Assumption 1, E p b [ 1:t|St = s, At = a] = d e

Observe that Eq (1) is indexed by time z. Intuitively, z can be thought of as the time to switch from using distribution ratios to action likelihood ratios in the importance weight. Speciﬁcally, the distribution ratios are used to estimate the probability of being in state Sz and taking action Az at time z and action likelihood ratios are used to correct for the probability of actions taken after time z. Further observe that z does not have to be a ﬁxed constant z(t) can be a function of t so that each reward in the trajectory Rt can utilize a different switching time. In the next section, we show that by using a function z(t) that allows the switching time to be time-dependent, we are able to further marginalize over time and create an estimator that interpolates between average state-action distribution ratios w(s, a) = d e(s,a)

d b(s,a), rather than time-dependent distribution ratios d e

Figure 2: Illustrations of the PDIS, SOPEn and SIS estimators. The dotted blue line represents an example trajectory drawn from e, and the solid red line represents an example trajectory from b. All three importance sampling methods work by re-weighting each reward Rt in the trajectory from b. (a) Trajectory-based PDIS works by re-weighting each reward by p e( 1:t)

p b( 1:t), the probability ratio of the sub-trajectory leading up to Rt under the b and e, respectively. This factors into 1:t, the product of t action likelihood ratios. (c) Distribution-based SIS considers the probability of encountering (St, At) under e and b, and re-weights Rt by d e(St,At)

d b(St,At), (b) SOPEn combines trajectory and distribution importance sampling weights by considering the probability of each policy visiting (St n, At n), the state-action pair n steps in the past, and additionally the probability of the sub-trajectory t n+1:t from n steps in the past to t. Thus, SOPEn re-weights Rt by d e(St n,At n)

d b(St n,At n) t n+1:t.

4 Bias-Variance Trade-off using n-step Interpolation Between PDIS and SIS

We now build upon the ideas from Section 3 to derive a spectrum of off-policy estimators that allows for interpolation between the trajectory-based PDIS and distribution-based SIS estimators. This spectrum contains PDIS and SIS at the endpoints and allows for smooth interpolation between them to obtain new estimators that can often trade-off the strengths and weaknesses of PDIS and SIS. An illustration of the key idea can be found in Figure 2.

One simple way to perform this trade-off is to control the number of terms in the product in the action likelihood ratio for each reward Rt. Speciﬁcally, for any reward Rt, we propose including only the n most recent action likelihood ratios t n+1:t in the importance weight, rather than 1:t. Thus, the overall importance weight becomes the re-weighted probability of visiting (St n, At n), followed by the re-weighted probability of taking the last n actions leading up to reward Rt. This reduces the exponential impact that horizon length L has on the variance of PDIS, and provides control over this reduction via the parameter n. To get an estimator to perform this trade-off, we start with the derivation in (1) with z(t) = t n, then accumulate the time-dependent state-action distributions dt over time. The ﬁnal expression for the ﬁnite horizon setting requires some additional constructs and is thus presented along with its derivations and additional discussion in Appendix B. In the following we present the result for the inﬁnite horizon setting.

J( e) = E p b

γt 1 1:t Rt +

γt 1 d e(St n, At n)

d b(St n, At n) t n+1:t Rt

Using the sample estimate of (2), we obtain the Spectrum of Off-Policy Estimators (SOPEn),

SOPEn(D) = 1

Remark 1. Note that since we generally do not have access to the true density ratios, in practice we substitute w with the estimated density ratios ˆw similarly as in SIS. Since SOPEn is agnostic to how

ˆw is estimated, it can readily leverage existing and new methods for estimating ˆw.

Observe that SOPEn doesn t just give a single estimator, but a spectrum of off-policy estimators indexed by n. An illustration of this spectrum can be seen in Figure 3. As n decreases, the number of terms in the action likelihood ratio decreases, and SOPEn depends more on the distribution correction ratio and is more like SIS. Likewise as n increases, the number of terms in the action likelihood ratio increases, and SOPEn is closer to PDIS. Further note that that for the endpoint values of this

Figure 3: On the left side of the ﬁgure, we show an illustration the SOPEn spectrum of estimators. For the purpose of this illustration, consider that only at the last time step there is a non-zero reward RL. The SOPEn spectrum allows for control of how much an estimate depends on distribution ratios vs action likelihood ratios. Notice that SOPE0 results in SIS, SOPEL results in PDIS estimator, and other values of n result in new interpolated estimators. As an analogy, consider the backup-diagram [Sutton and Barto, 2018] for the n-step q-estimate as illustrated on the right-hand side of the solid vertical line. Notice that in the n-step q-estimate, returns are backed up from possible future outcomes, whereas in the n-step interpolation estimators the probabilities are backed-up from the possible histories. (In the diagram, bias-variance characterization of PDIS and SIS is based on typical practical observations [Voloshin et al., 2019, Fu et al., 2021], however it is worth noting that SIS is not biased when oracle density ratios are available, and there are also edge cases, particularly for short horizon problems, where SIS can have higher variance than PDIS [Liu et al., 2020, Metelli et al., 2020]).

spectrum, n = 0 and n = L, SOPEn gives the SIS and PDIS estimators exactly (for PDIS, horizon length needs to be L instead of 1 for the estimator to be well deﬁned),

SOPE0(D) = 1

t = SIS(D),

SOPEL(D) = 1

t = PDIS(D).

5 Doubly-Robust and Weighted IS Extensions to SOPEn

An additional advantage of SOPEn is that it can be readily extended to obtain a spectrum for other estimators. For instance, to mitigate variance further a popular technique is to leverage domain knowledge from (imperfect) models using doubly-robust estimators [Jiang and Li, 2016, Jiang and Huang, 2020]. In the following we can create a doubly robust version of the SOPEn estimator.

Before moving further, we introduce some additional notation. Let,

d e(St n,At n) d b(St n,At n)

e(At j|St j) b(At j|St j)

if t > n Qt

e(Aj|Sj) b(Aj|Sj) 1 t n 1 otherwise

Let q be an estimate for the q-value function for e, computed using the (imperfect) model. For brevity, we make the random variable p b implicit for the expectations in this section. For a given value of n, performance (2) of e can then be expressed as,

w(t, n)γt 1Rt

We now use this form to create a spectrum of doubly-robust estimators,

w(t, n)γt 1Rt

w(t, n)γt 1q(St, At)

w(t, n)γt 1q(St, At)

w(t, n)γt 1Rt

w(t 1, n)γt 1q(St, A e

w(t, n)γt 1q(St, At)

w(0, n)γ0q(S1, A e

w(t, n)γt 1

Rt + γq(St+1, A e

t+1) q(St, At)

w(t, n)γt 1

Rt + γq(St+1, A e

t+1) q(St, At)

where in (a) we used the notation A e

t to indicate the At e( |St). Using A e

t eliminates the need for correcting At sampled under b. We deﬁne DR-SOPEn(D) to be the sample estimate of (3), i.e., a doubly-robust form for the SOPEn(D) estimator. It can now be observed that existing doubly-robust estimators are end-points of DR-SOPEn(D) (for trajectory-wise settings, horizon length needs to be L instead of 1 for the estimator to be well deﬁned),

DR-SOPEL(D) = Trajectory-wise DR [Jiang and Li, 2016, Thomas and Brunskill, 2016], DR-SOPE0(D) = State-action distribution DR [Jiang and Huang, 2020, Kallus and Uehara, 2020].

A variation of PDIS that can often also help in mitigating the variance of PDIS method is the Consistent Weighted Per-Decision Importance Sampling estimator (CWPDIS) [Thomas, 2015]. CWDPIS renormalizes the importance ratio at each time with the sum of importance weights, which causes CWPDIS to be biased (but consistent) and often have lower variance than PDIS.

CWPDIS(D) :=

Similar DR-SOPEn, we can create a weighted version of SOPEn estimator that interpolates between a weighted-version of SIS and CWPDIS:

W-SOPEn(D) :=

Since, unlike PDIS, CWPDIS is a biased (but consistent) estimator, W-SOPEn interpolates between two biased estimators as endpoints. Nonetheless, we show experimentally in Section 6 that in practice W-SOPEn estimators for intermediate values of n can still outperform weighted-SIS and CWPDIS.

6 Experimental Results

In this section, we present experimental results showing that interpolated estimators within the SOPEn and W-SOPEn spectrums can outperform the SIS/weighted-SIS and PDIS/CWPDIS endpoints. In each experiment, we evaluate SOPEn and W-SOPEn for different values of n ranging from 0 to L. This allows us to compare the different estimators we get for each n and see trends of how the

(a) SOPEn on Graph Domain

(b) SOPEn on Toy Mountain Car Domain

Figure 4: Experimental results from evaluating the SOPEn estimator on the Graph and Toy Mountain Car domains. The x-axis for each plot indicates the value of n in the SOPEn estimate. The shaded regions denote 95% conﬁdence regions on the mean of MSE. Recall that SOPE0 gives SIS and SOPEL gives PDIS. The evaluation and behavior policies are e(a = 0) = 0.9 and b(a = 0) = 0.5 for the experiments on the Graph Domain and and e(a = 0) = 0.5 and b(a = 0) = 0.6 for the Toy Mountain Car domain. In both these domains, we can see that there exist interpolating estimators in the SOPEn spectrum that outperform SIS and PDIS, and that the SOPEn spectrum empirically performs a bias-variance trade-off.

performance changes as n varies. Additionally, we plot estimates of the bias and the variance for the different values of n to further investigate the properties of estimators in this spectrum.

For our experiments, we utilize the environments and implementations of baseline estimators in the Caltech OPE Benchmarking Suite (COBS) [Voloshin et al., 2019]. In this section, we present results on the Graph and Toy Mountain Car environments. To obtain an estimate of the density ratios ˆw, we use COBS s implementation of inﬁnite horizon methods from [Liu et al., 2018]. Full experimental details and additional experimental results can be found in Appendix D. Additional experiments include an investigation on the impact on the degree of e and b mismatch on SOPEn and W-SOPEn, as well as additional experiments on the Mountain Car domain.

The experimental results for the SOPEn and W-SOPEn estimators can be seen in Figures 4 and 5 respectively. We observe that for both SOPEn and W-SOPEn, the plots of mean-squared error (MSE) have a U-shape indicating that there exist interpolated estimators within the spectrum with lower MSE than the endpoints. Additionally, from the bias and variance plots, we can see that SOPEn performs a bias-variance trade-off in these experiments. We observe that as n increases and the estimators become closer to PDIS, the bias decreases but the variance increases. Likewise, as n decreases and the estimators become closer to SIS, the variance decreases but the bias increases. This bias-variance trade-off trend is very notable for the unweighted SOPEn which trades-off between biased SIS and unbiased PDIS endpoints. However, we still can see this trend even with the W-SOPEn estimator, although the trade-off is not as clean because W-SOPE interpolates between biased SIS and the also biased (but consistent) CWPDIS.

Finally, note that our plots also show the results for different batch sizes of historical data. In our plots, as batch size increases, for some domains the PDIS/CWPDIS endpoints eventually outperform the SIS/weighted-SIS endpoints. However, even in this case, there still exist interpolated estimators that outperform both endpoints.

(a) W-SOPEn on Graph Domain

(b) W-SOPEn on Toy Mountain Car Domain Figure 5: Experimental results from evaluating the W-SOPEn estimator on the Graph and Toy Mountain Car domains. The x-axis for each plot indicates the value of n in the SOPEn estimate. The shaded regions denote 95% conﬁdence regions on the mean of MSE. Recall that W-SOPE0 gives weighted-SIS and W-SOPEL gives CWPDIS. The evaluation and behavior policies are e(a = 0) = 0.9 and b(a = 0) = 0.7 for the experiments on the Graph Domain and and e(a = 0) = 0.5 and b(a = 0) = 0.9 for the Toy Mountain Car domain. In both these domains, we can see that although we do not get as clean of a bias-variance trade-off as when we use SOPEn, there still exist interpolating estimators in the W-SOPEn spectrum that outperform SIS and PDIS.

7 Related Work

Off-policy evaluation (also related to counterfactual inference in the causality literature [Pearl, 2009]) is one the most crucial aspects of RL, and importance sampling [Metropolis and Ulam, 1949, Horvitz and Thompson, 1952] plays a central role in it. Precup [2000] ﬁrst introduced IS, PDIS, and WIS estimates for OPE. Since then there has been a ﬂurry of research in this direction: using partialmodels to develop doubly robust estimators [Jiang and Li, 2016, Thomas and Brunskill, 2016], using multi-importance sampling [Papini et al., 2019, Metelli et al., 2020], estimating the behavior policy [Hanna et al., 2019], clipping importance ratios [Bottou et al., 2013, Thomas et al., 2015, Munos et al., 2016, Schulman et al., 2017], dropping importance ratios [Guo et al., 2017], importance sampling the entire return distribution [Chandak et al., 2021], importance resampling of trajectories [Schlegel et al., 2019], emphatic weighting of TD methods [Mahmood et al., 2015, Hallak et al., 2016, Patterson et al., 2021], and estimating state-action distributions [Hallak and Mannor, 2017, Liu et al., 2018, Gelada and Bellemare, 2019, Xie et al., 2019, Nachum and Dai, 2020, Yang et al., 2020, Zhang et al., 2020, Jiang and Huang, 2020, Uehara et al., 2020].

Perhaps the most relevant to our work are the recent works by Liu et al. [2020] and Rowland et al. [2020] that use the conditional IS (CIS) framework to show how IS, PDIS, and SIS are special instances of CIS. Similarly, our proposed method for combining trajectory and density-based importance sampling also falls under the CIS framework. Liu et al. [2020] also showed that in the ﬁnite horizon setting, none of IS, PDIS, or SIS has variance always lesser than the other. Similarly, Rowland et al. [2020] used sufﬁcient conditional functions to create new off-policy estimators and showed that return conditioned estimates (RCIS) can provide optimal variance reduction. However, using RCIS requires a challenging task of estimating density ratios for returns (not state-action pair) and Liu et al. [2020] established a negative result that estimating these ratios using linear regression may result in the IS estimate itself.

Our analysis complements these recent works by showing that there exists interpolated estimators that can provide lower variance estimates than any of IS, PDIS, or SIS. Our proposed estimator SOPEn provides a natural interpolation technique to trade-off between the strengths and weaknesses of these

trajectory and density based methods. Additionally, while it is known that q (s, a) and d (s, a) have a primal-dual connection [Wang et al., 2007], our time-based interpolation technique also sheds new light on connections between their n-step generalizations.

8 Conclusions

We present a new perspective in off-policy evaluation connecting two popular estimators, PDIS and SIS, and show that PDIS and SIS lie as endpoints on the Spectrum of Off-Policy Estimators SOPEn which interpolates between them. Additionally, we also derive a weighted and doubly robust version of this spectrum of estimators. With our experimental results, we illustrate that estimators that lie on the interior of the SOPEn and W-SOPEn spectrums can be used outperform their endpoints SIS/weighted-SIS and PDIS/CWPDIS.

While we are able to show there exist SOPEn estimators that are able to outperform PDIS and SIS, it remains as future work to devise strategies to automatically select n to trade-off bias and variance. Future directions may include developing methods to select n or combine all estimators for all n using λ-trace methods [Sutton and Barto, 2018] to best trade-off bias and variance.

Finally, like all off-policy evaluation methods, our approach carries risks if used inappropriately. When using OPE for sensitive or safety-critical applications such as medical domains, caution should be taken to carefully consider the variance and bias of the estimator that is used. In these cases, high-conﬁdence OPE methods [Thomas et al., 2015] may be more appropriate.

9 Acknowledgement

We thank members of the Personal Autonomous Robotics Lab (Pe ARL) at the University of Texas at Austin for discussion and feedback on early stages of this work. We especially thank Jordan Schneider, Harshit Sikchi, and Prasoon Goyal for reading and giving suggestions on early drafts. We additionally thank Ziyang Tang for suggesting an additional marginalization step in the main proof that helped us unify the results for ﬁnite and inﬁnite horizon setting. The work also beneﬁted from feedback by Nan Jiang during initial stages of this work. We would also like to thank the anonymous reviewers for their suggestions which helped improve the paper.

This work has taken place in part in the Personal Autonomous Robotics Lab (Pe ARL) at The University of Texas at Austin. Pe ARL research is supported in part by the NSF (IIS-1724157, IIS-1638107, IIS-1749204, IIS-1925082), ONR (N00014-18-2243), AFOSR (FA9550-20-1-0077), and ARO (78372-CS). This research was also sponsored by the Army Research Ofﬁce under Cooperative Agreement Number W911NF-19-2-0333, a gift from Adobe, NSF award #2018372, and the DEVCOM Army Research Laboratory under Cooperative Agreement W911NF-17-2-0196 (ARL Io BT CRA). The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the ofﬁcial policies, either expressed or implied, of the Army Research Ofﬁce or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for Government purposes notwithstanding any copyright notation herein.

Léon Bottou, Jonas Peters, Joaquin Quiñonero-Candela, Denis X Charles, D Max Chickering, Elon

Portugaly, Dipankar Ray, Patrice Simard, and Ed Snelson. Counterfactual reasoning and learning systems: The example of computational advertising. The Journal of Machine Learning Research, 14(1):3207 3260, 2013.

Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and

Wojciech Zaremba. Openai gym. ar Xiv preprint ar Xiv:1606.01540, 2016.

Daniel Brown, Russell Coleman, Ravi Srinivasan, and Scott Niekum. Safe imitation learning via fast

bayesian reward inference from preferences. In International Conference on Machine Learning, pages 1165 1177. PMLR, 2020.

Yash Chandak, Scott Niekum, Bruno Castro da Silva, Erik Learned-Miller, Emma Brunskill, and

Philip S Thomas. Universal off-policy evaluation. ar Xiv preprint ar Xiv:2104.12820, 2021.

Justin Fu, Mohammad Norouzi, Oﬁr Nachum, George Tucker, Ziyu Wang, Alexander Novikov,

Mengjiao Yang, Michael R Zhang, Yutian Chen, Aviral Kumar, et al. Benchmarks for deep off-policy evaluation. ar Xiv preprint ar Xiv:2103.16596, 2021.

Carles Gelada and Marc G Bellemare. Off-policy deep reinforcement learning by bootstrapping the

covariate shift. In Proceedings of the AAAI Conference on Artiﬁcial Intelligence, volume 33, pages 3647 3655, 2019.

Zhaohan Daniel Guo, Philip S Thomas, and Emma Brunskill. Using options and covariance testing

for long horizon off-policy policy evaluation. ar Xiv preprint ar Xiv:1703.03453, 2017.

Assaf Hallak and Shie Mannor. Consistent on-line off-policy evaluation. In International Conference

on Machine Learning, pages 1372 1383. PMLR, 2017.

Assaf Hallak, Aviv Tamar, Rémi Munos, and Shie Mannor. Generalized emphatic temporal difference

learning: Bias-variance analysis. In Proceedings of the AAAI Conference on Artiﬁcial Intelligence, volume 30, 2016.

Josiah Hanna, Scott Niekum, and Peter Stone. Importance sampling policy evaluation with an

estimated behavior policy. In International Conference on Machine Learning, pages 2605 2613. PMLR, 2019.

Daniel G Horvitz and Donovan J Thompson. A generalization of sampling without replacement from

a ﬁnite universe. Journal of the American statistical Association, 47(260):663 685, 1952.

Nan Jiang and Jiawei Huang. Minimax conﬁdence interval for off-policy evaluation and policy

optimization. ar Xiv preprint ar Xiv:2002.02081, 2020.

Nan Jiang and Lihong Li. Doubly robust off-policy value evaluation for reinforcement learning. In

International Conference on Machine Learning, pages 652 661. PMLR, 2016.

Nathan Kallus and Masatoshi Uehara. Double reinforcement learning for efﬁcient off-policy eval-

uation in markov decision processes. Journal of Machine Learning Research, 21(167):1 63, 2020.

Peng Liao, Predrag Klasnja, and Susan Murphy. Off-policy estimation of long-term average outcomes

with applications to mobile health. Journal of the American Statistical Association, pages 1 10, 2020.

Qiang Liu, Lihong Li, Ziyang Tang, and Dengyong Zhou. Breaking the curse of horizon: Inﬁnite-

horizon off-policy estimation. ar Xiv preprint ar Xiv:1810.12429, 2018.

Yao Liu, Pierre-Luc Bacon, and Emma Brunskill. Understanding the curse of horizon in off-policy

evaluation via conditional importance sampling. In International Conference on Machine Learning, pages 6184 6193. PMLR, 2020.

A Rupam Mahmood, Huizhen Yu, Martha White, and Richard S Sutton. Emphatic temporal-difference

learning. ar Xiv preprint ar Xiv:1507.01569, 2015.

Alberto Maria Metelli, Matteo Papini, Nico Montali, and Marcello Restelli. Importance sampling

techniques for policy optimization. Journal of Machine Learning Research, 21(141):1 75, 2020.

Nicholas Metropolis and Stanislaw Ulam. The monte carlo method. Journal of the American

statistical association, 44(247):335 341, 1949.

Rémi Munos, Tom Stepleton, Anna Harutyunyan, and Marc G Bellemare. Safe and efﬁcient off-policy

reinforcement learning. ar Xiv preprint ar Xiv:1606.02647, 2016.

Oﬁr Nachum and Bo Dai. Reinforcement learning via Fenchel-Rockafellar duality. ar Xiv preprint

ar Xiv:2001.01866, 2020.

Matteo Papini, Alberto Maria Metelli, Lorenzo Lupo, and Marcello Restelli. Optimistic policy

optimization via multiple importance sampling. In International Conference on Machine Learning, pages 4989 4999. PMLR, 2019.

Andrew Patterson, Sina Ghiassian, D Gupta, A White, and M White. Investigating objectives for

off-policy value estimation in reinforcement learning, 2021.

Judea Pearl. Causality. Cambridge university press, 2009.

Doina Precup. Eligibility traces for off-policy policy evaluation. Computer Science Department

Faculty Publication Series, page 80, 2000.

Mark Rowland, Anna Harutyunyan, Hado Hasselt, Diana Borsa, Tom Schaul, Rémi Munos, and Will

Dabney. Conditional importance sampling for off-policy learning. In International Conference on Artiﬁcial Intelligence and Statistics, pages 45 55. PMLR, 2020.

Matthew Schlegel, Wesley Chung, Daniel Graves, Jian Qian, and Martha White. Importance resampling for off-policy prediction. ar Xiv preprint ar Xiv:1906.04328, 2019.

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy

optimization algorithms. ar Xiv preprint ar Xiv:1707.06347, 2017.

R. S. Sutton and A. G. Barto. Reinforcement learning: An introduction. MIT Press, Cambridge, MA,

2 edition, 2018.

Georgios Theocharous, Yash Chandak, Philip S Thomas, and Frits de Nijs. Reinforcement learning

for strategic recommendations. ar Xiv preprint ar Xiv:2009.07346, 2020.

Philip Thomas and Emma Brunskill. Data-efﬁcient off-policy policy evaluation for reinforcement

learning. In International Conference on Machine Learning, pages 2139 2148, 2016.

Philip Thomas, Georgios Theocharous, and Mohammad Ghavamzadeh. High-conﬁdence off-policy

evaluation. In Proceedings of the AAAI Conference on Artiﬁcial Intelligence, volume 29, 2015.

Philip S Thomas. Safe reinforcement learning. Ph D thesis, University of Massachusetts Libraries,

Masatoshi Uehara, Jiawei Huang, and Nan Jiang. Minimax weight and q-function learning for off-

policy evaluation. In International Conference on Machine Learning, pages 9659 9668. PMLR, 2020.

Cameron Voloshin, Hoang M Le, Nan Jiang, and Yisong Yue. Empirical study of off-policy policy

evaluation for reinforcement learning. ar Xiv preprint ar Xiv:1911.06854, 2019.

Tao Wang, Michael Bowling, and Dale Schuurmans. Dual representations for dynamic programming

and reinforcement learning. In 2007 IEEE International Symposium on Approximate Dynamic Programming and Reinforcement Learning, pages 44 51. IEEE, 2007.

Tengyang Xie, Yifei Ma, and Yu-Xiang Wang. Towards optimal off-policy evaluation for rein-

forcement learning with marginalized importance sampling. ar Xiv preprint ar Xiv:1906.03393, 2019.

Mengjiao Yang, Oﬁr Nachum, Bo Dai, Lihong Li, and Dale Schuurmans. Off-policy evaluation via

the regularized lagrangian. ar Xiv preprint ar Xiv:2007.03438, 2020.

Ruiyi Zhang, Bo Dai, Lihong Li, and Dale Schuurmans. Gendice: Generalized ofﬂine estimation of

stationary values. ar Xiv preprint ar Xiv:2002.09072, 2020.