# sope_spectrum_of_offpolicy_estimators__0e6cdbed.pdf SOPE: Spectrum of Off-Policy Estimators Christina J. Yuan University of Texas at Austin cjyuan@cs.utexas.edu Yash Chandak University of Massachusetts ychandak@cs.umass.edu Stephen Giguere University of Texas at Austin sgiguere@cs.utexas.edu Philip S. Thomas University of Massachusetts pthomas@cs.umass.edu Scott Niekum University of Texas at Austin sniekum@cs.utexas.edu Many sequential decision making problems are high-stakes and require off-policy evaluation (OPE) of a new policy using historical data collected using some other policy. One of the most common OPE techniques that provides unbiased estimates is trajectory based importance sampling (IS). However, due to the high variance of trajectory IS estimates, importance sampling methods based on state-action visitation distributions (SIS) have recently been adopted. Unfortunately, while SIS often provides lower variance estimates for long horizons, estimating the stateaction distribution ratios can be challenging and lead to biased estimates. In this paper, we present a new perspective on this bias-variance trade-off and show the existence of a spectrum of estimators whose endpoints are SIS and IS. Additionally, we also establish a spectrum for doubly-robust and weighted version of these estimators. We provide empirical evidence that estimators in this spectrum can be used to trade-off between the bias and variance of IS and SIS and can achieve lower mean-squared error than both IS and SIS. 1 Introduction Many sequential decision making problems, such as automated health-care, robotics, and online recommendations are high-stakes in terms of health, safety, or finance [Liao et al., 2020, Brown et al., 2020, Theocharous et al., 2020]. For such problems, collecting new data to evaluate the performance of a new decision rule, called an evaluation policy e, may be expensive or even dangerous if e results in undesired outcomes. Therefore, one of the most important challenges in such problems is the estimation of the performance J( e) of the policy e before its deployment. Many off-policy evaluation (OPE) methods enable estimation of J( e) with historical data collected using an existing decision rule, called a behavior policy b. One popular OPE technique is trajectorybased importance sampling (IS) [Precup, 2000]. While this method is both non-parametric and provides unbiased estimates of J( e), it suffers from the curse of horizon and can have variance exponential in the horizon length [Jiang and Li, 2016, Guo et al., 2017]. To mitigate this problem, recent methods use stationary distribution importance sampling (SIS) to adjust the stationary distribution of the Markov chain induced by the policies, instead of the individual trajectories [Liu et al., 2018, Gelada and Bellemare, 2019, Nachum and Dai, 2020]. This requires (parametric) estimation of the ratio between the stationary distribution induced by e and b. Unfortunately, estimating this ratio accurately can require unverifiably strong assumptions on the parameters [Jiang and Huang, 2020], and often requires solving non-trivial min-max saddle point optimization problems [Yang et al., 2020]. Consequently, if the parameterization is not rich enough, then it may not be possible to represent the distribution ratios accurately, and when using rich function approximators (such as neural networks) then the optimization procedure may get stuck in sub-optimal saddle points. 35th Conference on Neural Information Processing Systems (Neur IPS 2021). In practice, these challenges can introduce error when estimating the distribution ratio, potentially leading to arbitrarily biased estimates of J( e), even when an infinite amount of data is available. In this work, we present a new perspective on the bias-variance trade-off for OPE that bridges the unbiasedness of IS and the often lower variance of SIS. Particularly, we show that There exists a spectrum of OPE estimators whose end-points are IS and SIS, respectively. Estimators in this spectrum can have lower mean-squared error than both IS and SIS. This spectrum can also be established for doubly-robust and weighted version of IS and SIS. In Sections 3 and 4 we show how trajectory-based and distribution-based methods can be combined. The core idea establishing the existence of this spectrum relies upon first splitting individual trajectories into two parts and then computing the probability of the first part using SIS and IS for the latter. In Section 5, we introduce weighted and doubly-robust extensions of the spectrum. Finally, in Section 6, we present empirical case studies to highlight the effectiveness of these new estimators. 2 Background Notation: A Markov decision process (MDP) is a tuple (S, A, r, T, γ, d1), where S is the state set, A is the action set, r is the reward function, T is the transition function, γ is the discounting factor, and d1 is the initial state distribution. Although our results extend to the continuous setting, for simplicity of notation we assume that S and A are finite. A policy is a distribution over A, conditioned on the state. Starting from initial state S1 d1, policy interacts with the environment iteratively by sampling action At at every time step t from ( |St). The environment then produces reward Rt with the expected value r(St, At), and transitions to the next state St+1 according to T( |St, At). Let := (S1, A1, R1, S2, ..., SL, AL, RL) be the sequence of random variables corresponding to a trajectory sampled from , where L is the horizon length. Let p denote the distribution of under . Problem Statement: The performance of any policy is given by its value defined by the expected discounted sum of rewards J( ) := E p [PL t=1 γt 1Rt]. The infinite horizon setting can be obtained by letting L ! 1. In general, for any random variable, we use the superscript of i to denote the trajectory associated with it. The goal of the off-policy policy evaluation (OPE) problem is to estimate the performance J( e) of an evaluation policy e using only a batch of historical trajectories D := { i}m i=1 collected from a different behavior policy b. This problem is challenging because J( e) must be estimated using only observational, off-policy data from the deployment of a different behavior policy b. Additionally, this problem might not be feasible if the data collected using b is not informative about the outcomes possible under e. Therefore, to make the problem tractable, we make the following standard support assumption, which implies that any outcome possible under e also has non-zero probability of occurring under b. Assumption 1. For all s 2 S and a 2 A, the ratio e(a|s) b(a|s) < 1. Trajectory-Based Importance Sampling: One of the earliest methods for estimating J( e) is trajectory-based importance sampling. This method corrects the difference in distribution of b and e by re-weighting the trajectories from b in D by the probability ratio of the trajectory under e and b, i.e. p e( ) p b( ) = QL e(At|St) b(At|St). Let the single-step action likelihood ratio be denoted t := e(At|St) b(At|St) and the likelihood ratio from steps j to k be denoted j:k := Qk t=j t. The fulltrajectory importance sampling (IS) estimator and the per-decision importance sampling (PDIS) estimator [Precup, 2000] can then be defined as: t, PDIS(D) := 1 It was shown by Precup [2000] that under Assumption 1, IS(D) and PDIS(D) are unbiased estimators of J( e). That is, J( e) = E p b [IS( )] = E b[PDIS( )]. Unfortunately, however, both IS and PDIS directly depend on the product of importance ratios and thus can often suffer from exponentially high-variance in the horizon length L, known as the curse of horizon [Jiang and Li, 2016, Guo et al., 2017, Liu et al., 2018]. Distribution-Based Importance Sampling: To eliminate the dependency on trajectory length, recent works apply importance sampling over the state-action space rather than the trajectory space. For any policy , let d t denote the induced state-action distribution at time step t, i.e. d t (s, a) = p (St = s, At = a). Let the average state-action distribution be d (s, a) := (PL t (s, a))/(PL t=1 γt 1). This gives the likelihood of encountering (s, a) when following policy and averaging over time with γ-discounting. Let (S, A) d and (S, A) d t denote that (S, A) are sampled from d and d t respectively. The performance of e can be expressed as, J( e) = E p e t (s, a)r(s, a) = d e(s, a)r(s, a) d b(s, a)d e(s, a) d b(s, a)r(s, a) = t (s, a)d e(s, a) d b(s, a)r(s, a), γt 1 d e(St, At) d b(St, At)Rt where (a) is possible due to Assumption 1. Using this observation, recent works have considered the following stationary-distribution importance sampling estimator [Liu et al., 2018, Yang et al., 2020, Jiang and Huang, 2020], SIS(D) := 1 where w(s, a) := d e(s,a) d b(s,a) is the distribution correction ratio. Notice that SIS( ) marginalizes over the product of importance ratios 1:t, and thus can help in mitigating variance s dependence on horizon length for PDIS and IS estimators. When an unbiased estimate of w is available, then SIS( ) is also an unbiased estimator, i.e., E b[SIS( )] = J( e). Unfortunately, such an estimate of w is often not available. For large-scale problems, parametric estimation w is required in practice and we replace the true density ratios w with an estimate ˆw. However, estimating w accurately may require both a non-verifiable strong assumption on the parametric function class, and global solution to a non-trivial min-max optimization problem [Jiang and Huang, 2020, Yang et al., 2020]. When these conditions are not met, SIS estimates can be arbitrarily biased, even when an infinite amount of data is available. 3 Combining Trajectory-Based and Density-Based Importance Sampling Trajectory-based and distribution-based importance sampling methods are typically presented as alternative methods of applying importance sampling for off-policy evaluation. However, in this section we show that the choice of estimator is not binary, and these two styles of computing importance weights can actually be combined into a single importance sampling estimate. Furthermore, using this combination, in the next section, we will derive a spectrum of estimators that allows interpolation between the trajectory-based PDIS and distribution-based SIS, which will often allow us trade-off between the strengths and weaknesses of these methods. Intuitively, trajectory-based and distribution-based importance sampling provide two different ways of correcting the distribution mismatch under the evaluation and behavior policies. Trajectory-based importance sampling corrects the distribution mismatch by examining how likely policies are to take the same sequence of actions and thus applies the action likelihood ratio as the correction term. Distribution-based importance sampling corrects the mismatch by how likely policies are to visit the same state and action pairs while remaining agnostic to how they arrived and applies the distribution ratio as the importance weight. However, using distribution ratio and action likelihood ratio correction terms are not mutually exclusive, and one can draw on both types of correction terms to derive combined estimators. To build intuition for why likelihood ratios and distribution ratios can naturally be combined, we consider the two rooms domain shown in Figure 3. In this example, there are two policies b, e Figure 1: Illustration of two room domain. The domain consists of two rooms, the left room and the right room separated by a connecting door. b and e are two different policies that move from the left room to the right room. Note that, although b and e have two different behaviors in the left room and right room, both pass through the connecting door. which have different strategies for navigating from the first room to the second room. Note that while the behavior of the two policies are very different in the left room, both policies must pass through the connecting door to get to the right room at some point in time. Conditioning on having passed through the connecting door at a point in time, all parts of the trajectory that occur in the right room are independent from what has occurred in the left room by the Markov property. Thus, when considering a reward Rt that occurs in the right room, it is natural to consider the probability of reaching the door and then the probability of the action sequence policy in the right room under each policy. Now, we formalize this intuition and show how trajectory-based and density-based importance sampling can be combined in the same estimator. Given a trajectory , we can consider (Sz, Az), the state and action at time z in the trajectory. By conditioning on (Sz, Az), trajectory can be separated into two conditionally independent partial trajectories 0:z and z+1,L by the Markov property. Since the segments of before and after time z are conditionally independent, then 1:z, the likelihood ratio for the trajectory before time z, is conditionally independent from z+1:L and from Rt for all t z. Formally, let (Sz, Az) d b z , then, J( e) = E p b [PDIS( )] = E p b γt 1 1:t Rt γt 1 1:t Rt γt 1 1:z z+1:t Rt (((((Sz, Az γt 1 1:t Rt γt 1E p b [ 1:z|Sz, Az] E b [ z+1:t Rt|Sz, Az] (a) = E p b γt 1 1:t Rt γt 1 d e z (Sz, Az) d b z (Sz, Az)E p b γt 1 1:t Rt + γt 1 d e z (Sz, Az) d b z (Sz, Az) z+1:t Rt where (a) follows from the following Property 1, which states that the expected value of product likelihood ratios 1:z conditioned on (Sz, Az) is equal to the time-dependent state-action distribution ratio for (Sz, Az). We provide a detailed proof of Property 1 in Appendix A. Property 1 ([Liu et al., 2018]). Under Assumption 1, E p b [ 1:t|St = s, At = a] = d e Observe that Eq (1) is indexed by time z. Intuitively, z can be thought of as the time to switch from using distribution ratios to action likelihood ratios in the importance weight. Specifically, the distribution ratios are used to estimate the probability of being in state Sz and taking action Az at time z and action likelihood ratios are used to correct for the probability of actions taken after time z. Further observe that z does not have to be a fixed constant z(t) can be a function of t so that each reward in the trajectory Rt can utilize a different switching time. In the next section, we show that by using a function z(t) that allows the switching time to be time-dependent, we are able to further marginalize over time and create an estimator that interpolates between average state-action distribution ratios w(s, a) = d e(s,a) d b(s,a), rather than time-dependent distribution ratios d e Figure 2: Illustrations of the PDIS, SOPEn and SIS estimators. The dotted blue line represents an example trajectory drawn from e, and the solid red line represents an example trajectory from b. All three importance sampling methods work by re-weighting each reward Rt in the trajectory from b. (a) Trajectory-based PDIS works by re-weighting each reward by p e( 1:t) p b( 1:t), the probability ratio of the sub-trajectory leading up to Rt under the b and e, respectively. This factors into 1:t, the product of t action likelihood ratios. (c) Distribution-based SIS considers the probability of encountering (St, At) under e and b, and re-weights Rt by d e(St,At) d b(St,At), (b) SOPEn combines trajectory and distribution importance sampling weights by considering the probability of each policy visiting (St n, At n), the state-action pair n steps in the past, and additionally the probability of the sub-trajectory t n+1:t from n steps in the past to t. Thus, SOPEn re-weights Rt by d e(St n,At n) d b(St n,At n) t n+1:t. 4 Bias-Variance Trade-off using n-step Interpolation Between PDIS and SIS We now build upon the ideas from Section 3 to derive a spectrum of off-policy estimators that allows for interpolation between the trajectory-based PDIS and distribution-based SIS estimators. This spectrum contains PDIS and SIS at the endpoints and allows for smooth interpolation between them to obtain new estimators that can often trade-off the strengths and weaknesses of PDIS and SIS. An illustration of the key idea can be found in Figure 2. One simple way to perform this trade-off is to control the number of terms in the product in the action likelihood ratio for each reward Rt. Specifically, for any reward Rt, we propose including only the n most recent action likelihood ratios t n+1:t in the importance weight, rather than 1:t. Thus, the overall importance weight becomes the re-weighted probability of visiting (St n, At n), followed by the re-weighted probability of taking the last n actions leading up to reward Rt. This reduces the exponential impact that horizon length L has on the variance of PDIS, and provides control over this reduction via the parameter n. To get an estimator to perform this trade-off, we start with the derivation in (1) with z(t) = t n, then accumulate the time-dependent state-action distributions dt over time. The final expression for the finite horizon setting requires some additional constructs and is thus presented along with its derivations and additional discussion in Appendix B. In the following we present the result for the infinite horizon setting. J( e) = E p b γt 1 1:t Rt + γt 1 d e(St n, At n) d b(St n, At n) t n+1:t Rt Using the sample estimate of (2), we obtain the Spectrum of Off-Policy Estimators (SOPEn), SOPEn(D) = 1 Remark 1. Note that since we generally do not have access to the true density ratios, in practice we substitute w with the estimated density ratios ˆw similarly as in SIS. Since SOPEn is agnostic to how ˆw is estimated, it can readily leverage existing and new methods for estimating ˆw. Observe that SOPEn doesn t just give a single estimator, but a spectrum of off-policy estimators indexed by n. An illustration of this spectrum can be seen in Figure 3. As n decreases, the number of terms in the action likelihood ratio decreases, and SOPEn depends more on the distribution correction ratio and is more like SIS. Likewise as n increases, the number of terms in the action likelihood ratio increases, and SOPEn is closer to PDIS. Further note that that for the endpoint values of this Figure 3: On the left side of the figure, we show an illustration the SOPEn spectrum of estimators. For the purpose of this illustration, consider that only at the last time step there is a non-zero reward RL. The SOPEn spectrum allows for control of how much an estimate depends on distribution ratios vs action likelihood ratios. Notice that SOPE0 results in SIS, SOPEL results in PDIS estimator, and other values of n result in new interpolated estimators. As an analogy, consider the backup-diagram [Sutton and Barto, 2018] for the n-step q-estimate as illustrated on the right-hand side of the solid vertical line. Notice that in the n-step q-estimate, returns are backed up from possible future outcomes, whereas in the n-step interpolation estimators the probabilities are backed-up from the possible histories. (In the diagram, bias-variance characterization of PDIS and SIS is based on typical practical observations [Voloshin et al., 2019, Fu et al., 2021], however it is worth noting that SIS is not biased when oracle density ratios are available, and there are also edge cases, particularly for short horizon problems, where SIS can have higher variance than PDIS [Liu et al., 2020, Metelli et al., 2020]). spectrum, n = 0 and n = L, SOPEn gives the SIS and PDIS estimators exactly (for PDIS, horizon length needs to be L instead of 1 for the estimator to be well defined), SOPE0(D) = 1 t = SIS(D), SOPEL(D) = 1 t = PDIS(D). 5 Doubly-Robust and Weighted IS Extensions to SOPEn An additional advantage of SOPEn is that it can be readily extended to obtain a spectrum for other estimators. For instance, to mitigate variance further a popular technique is to leverage domain knowledge from (imperfect) models using doubly-robust estimators [Jiang and Li, 2016, Jiang and Huang, 2020]. In the following we can create a doubly robust version of the SOPEn estimator. Before moving further, we introduce some additional notation. Let, d e(St n,At n) d b(St n,At n) e(At j|St j) b(At j|St j) if t > n Qt e(Aj|Sj) b(Aj|Sj) 1 t n 1 otherwise Let q be an estimate for the q-value function for e, computed using the (imperfect) model. For brevity, we make the random variable p b implicit for the expectations in this section. For a given value of n, performance (2) of e can then be expressed as, w(t, n)γt 1Rt We now use this form to create a spectrum of doubly-robust estimators, w(t, n)γt 1Rt w(t, n)γt 1q(St, At) w(t, n)γt 1q(St, At) w(t, n)γt 1Rt w(t 1, n)γt 1q(St, A e w(t, n)γt 1q(St, At) w(0, n)γ0q(S1, A e w(t, n)γt 1 Rt + γq(St+1, A e t+1) q(St, At) w(t, n)γt 1 Rt + γq(St+1, A e t+1) q(St, At) where in (a) we used the notation A e t to indicate the At e( |St). Using A e t eliminates the need for correcting At sampled under b. We define DR-SOPEn(D) to be the sample estimate of (3), i.e., a doubly-robust form for the SOPEn(D) estimator. It can now be observed that existing doubly-robust estimators are end-points of DR-SOPEn(D) (for trajectory-wise settings, horizon length needs to be L instead of 1 for the estimator to be well defined), DR-SOPEL(D) = Trajectory-wise DR [Jiang and Li, 2016, Thomas and Brunskill, 2016], DR-SOPE0(D) = State-action distribution DR [Jiang and Huang, 2020, Kallus and Uehara, 2020]. A variation of PDIS that can often also help in mitigating the variance of PDIS method is the Consistent Weighted Per-Decision Importance Sampling estimator (CWPDIS) [Thomas, 2015]. CWDPIS renormalizes the importance ratio at each time with the sum of importance weights, which causes CWPDIS to be biased (but consistent) and often have lower variance than PDIS. CWPDIS(D) := Similar DR-SOPEn, we can create a weighted version of SOPEn estimator that interpolates between a weighted-version of SIS and CWPDIS: W-SOPEn(D) := Since, unlike PDIS, CWPDIS is a biased (but consistent) estimator, W-SOPEn interpolates between two biased estimators as endpoints. Nonetheless, we show experimentally in Section 6 that in practice W-SOPEn estimators for intermediate values of n can still outperform weighted-SIS and CWPDIS. 6 Experimental Results In this section, we present experimental results showing that interpolated estimators within the SOPEn and W-SOPEn spectrums can outperform the SIS/weighted-SIS and PDIS/CWPDIS endpoints. In each experiment, we evaluate SOPEn and W-SOPEn for different values of n ranging from 0 to L. This allows us to compare the different estimators we get for each n and see trends of how the (a) SOPEn on Graph Domain (b) SOPEn on Toy Mountain Car Domain Figure 4: Experimental results from evaluating the SOPEn estimator on the Graph and Toy Mountain Car domains. The x-axis for each plot indicates the value of n in the SOPEn estimate. The shaded regions denote 95% confidence regions on the mean of MSE. Recall that SOPE0 gives SIS and SOPEL gives PDIS. The evaluation and behavior policies are e(a = 0) = 0.9 and b(a = 0) = 0.5 for the experiments on the Graph Domain and and e(a = 0) = 0.5 and b(a = 0) = 0.6 for the Toy Mountain Car domain. In both these domains, we can see that there exist interpolating estimators in the SOPEn spectrum that outperform SIS and PDIS, and that the SOPEn spectrum empirically performs a bias-variance trade-off. performance changes as n varies. Additionally, we plot estimates of the bias and the variance for the different values of n to further investigate the properties of estimators in this spectrum. For our experiments, we utilize the environments and implementations of baseline estimators in the Caltech OPE Benchmarking Suite (COBS) [Voloshin et al., 2019]. In this section, we present results on the Graph and Toy Mountain Car environments. To obtain an estimate of the density ratios ˆw, we use COBS s implementation of infinite horizon methods from [Liu et al., 2018]. Full experimental details and additional experimental results can be found in Appendix D. Additional experiments include an investigation on the impact on the degree of e and b mismatch on SOPEn and W-SOPEn, as well as additional experiments on the Mountain Car domain. The experimental results for the SOPEn and W-SOPEn estimators can be seen in Figures 4 and 5 respectively. We observe that for both SOPEn and W-SOPEn, the plots of mean-squared error (MSE) have a U-shape indicating that there exist interpolated estimators within the spectrum with lower MSE than the endpoints. Additionally, from the bias and variance plots, we can see that SOPEn performs a bias-variance trade-off in these experiments. We observe that as n increases and the estimators become closer to PDIS, the bias decreases but the variance increases. Likewise, as n decreases and the estimators become closer to SIS, the variance decreases but the bias increases. This bias-variance trade-off trend is very notable for the unweighted SOPEn which trades-off between biased SIS and unbiased PDIS endpoints. However, we still can see this trend even with the W-SOPEn estimator, although the trade-off is not as clean because W-SOPE interpolates between biased SIS and the also biased (but consistent) CWPDIS. Finally, note that our plots also show the results for different batch sizes of historical data. In our plots, as batch size increases, for some domains the PDIS/CWPDIS endpoints eventually outperform the SIS/weighted-SIS endpoints. However, even in this case, there still exist interpolated estimators that outperform both endpoints. (a) W-SOPEn on Graph Domain (b) W-SOPEn on Toy Mountain Car Domain Figure 5: Experimental results from evaluating the W-SOPEn estimator on the Graph and Toy Mountain Car domains. The x-axis for each plot indicates the value of n in the SOPEn estimate. The shaded regions denote 95% confidence regions on the mean of MSE. Recall that W-SOPE0 gives weighted-SIS and W-SOPEL gives CWPDIS. The evaluation and behavior policies are e(a = 0) = 0.9 and b(a = 0) = 0.7 for the experiments on the Graph Domain and and e(a = 0) = 0.5 and b(a = 0) = 0.9 for the Toy Mountain Car domain. In both these domains, we can see that although we do not get as clean of a bias-variance trade-off as when we use SOPEn, there still exist interpolating estimators in the W-SOPEn spectrum that outperform SIS and PDIS. 7 Related Work Off-policy evaluation (also related to counterfactual inference in the causality literature [Pearl, 2009]) is one the most crucial aspects of RL, and importance sampling [Metropolis and Ulam, 1949, Horvitz and Thompson, 1952] plays a central role in it. Precup [2000] first introduced IS, PDIS, and WIS estimates for OPE. Since then there has been a flurry of research in this direction: using partialmodels to develop doubly robust estimators [Jiang and Li, 2016, Thomas and Brunskill, 2016], using multi-importance sampling [Papini et al., 2019, Metelli et al., 2020], estimating the behavior policy [Hanna et al., 2019], clipping importance ratios [Bottou et al., 2013, Thomas et al., 2015, Munos et al., 2016, Schulman et al., 2017], dropping importance ratios [Guo et al., 2017], importance sampling the entire return distribution [Chandak et al., 2021], importance resampling of trajectories [Schlegel et al., 2019], emphatic weighting of TD methods [Mahmood et al., 2015, Hallak et al., 2016, Patterson et al., 2021], and estimating state-action distributions [Hallak and Mannor, 2017, Liu et al., 2018, Gelada and Bellemare, 2019, Xie et al., 2019, Nachum and Dai, 2020, Yang et al., 2020, Zhang et al., 2020, Jiang and Huang, 2020, Uehara et al., 2020]. Perhaps the most relevant to our work are the recent works by Liu et al. [2020] and Rowland et al. [2020] that use the conditional IS (CIS) framework to show how IS, PDIS, and SIS are special instances of CIS. Similarly, our proposed method for combining trajectory and density-based importance sampling also falls under the CIS framework. Liu et al. [2020] also showed that in the finite horizon setting, none of IS, PDIS, or SIS has variance always lesser than the other. Similarly, Rowland et al. [2020] used sufficient conditional functions to create new off-policy estimators and showed that return conditioned estimates (RCIS) can provide optimal variance reduction. However, using RCIS requires a challenging task of estimating density ratios for returns (not state-action pair) and Liu et al. [2020] established a negative result that estimating these ratios using linear regression may result in the IS estimate itself. Our analysis complements these recent works by showing that there exists interpolated estimators that can provide lower variance estimates than any of IS, PDIS, or SIS. Our proposed estimator SOPEn provides a natural interpolation technique to trade-off between the strengths and weaknesses of these trajectory and density based methods. Additionally, while it is known that q (s, a) and d (s, a) have a primal-dual connection [Wang et al., 2007], our time-based interpolation technique also sheds new light on connections between their n-step generalizations. 8 Conclusions We present a new perspective in off-policy evaluation connecting two popular estimators, PDIS and SIS, and show that PDIS and SIS lie as endpoints on the Spectrum of Off-Policy Estimators SOPEn which interpolates between them. Additionally, we also derive a weighted and doubly robust version of this spectrum of estimators. With our experimental results, we illustrate that estimators that lie on the interior of the SOPEn and W-SOPEn spectrums can be used outperform their endpoints SIS/weighted-SIS and PDIS/CWPDIS. While we are able to show there exist SOPEn estimators that are able to outperform PDIS and SIS, it remains as future work to devise strategies to automatically select n to trade-off bias and variance. Future directions may include developing methods to select n or combine all estimators for all n using λ-trace methods [Sutton and Barto, 2018] to best trade-off bias and variance. Finally, like all off-policy evaluation methods, our approach carries risks if used inappropriately. When using OPE for sensitive or safety-critical applications such as medical domains, caution should be taken to carefully consider the variance and bias of the estimator that is used. In these cases, high-confidence OPE methods [Thomas et al., 2015] may be more appropriate. 9 Acknowledgement We thank members of the Personal Autonomous Robotics Lab (Pe ARL) at the University of Texas at Austin for discussion and feedback on early stages of this work. We especially thank Jordan Schneider, Harshit Sikchi, and Prasoon Goyal for reading and giving suggestions on early drafts. We additionally thank Ziyang Tang for suggesting an additional marginalization step in the main proof that helped us unify the results for finite and infinite horizon setting. The work also benefited from feedback by Nan Jiang during initial stages of this work. We would also like to thank the anonymous reviewers for their suggestions which helped improve the paper. This work has taken place in part in the Personal Autonomous Robotics Lab (Pe ARL) at The University of Texas at Austin. Pe ARL research is supported in part by the NSF (IIS-1724157, IIS-1638107, IIS-1749204, IIS-1925082), ONR (N00014-18-2243), AFOSR (FA9550-20-1-0077), and ARO (78372-CS). This research was also sponsored by the Army Research Office under Cooperative Agreement Number W911NF-19-2-0333, a gift from Adobe, NSF award #2018372, and the DEVCOM Army Research Laboratory under Cooperative Agreement W911NF-17-2-0196 (ARL Io BT CRA). The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the Army Research Office or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for Government purposes notwithstanding any copyright notation herein. Léon Bottou, Jonas Peters, Joaquin Quiñonero-Candela, Denis X Charles, D Max Chickering, Elon Portugaly, Dipankar Ray, Patrice Simard, and Ed Snelson. Counterfactual reasoning and learning systems: The example of computational advertising. The Journal of Machine Learning Research, 14(1):3207 3260, 2013. Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. Openai gym. ar Xiv preprint ar Xiv:1606.01540, 2016. Daniel Brown, Russell Coleman, Ravi Srinivasan, and Scott Niekum. Safe imitation learning via fast bayesian reward inference from preferences. In International Conference on Machine Learning, pages 1165 1177. PMLR, 2020. Yash Chandak, Scott Niekum, Bruno Castro da Silva, Erik Learned-Miller, Emma Brunskill, and Philip S Thomas. Universal off-policy evaluation. ar Xiv preprint ar Xiv:2104.12820, 2021. Justin Fu, Mohammad Norouzi, Ofir Nachum, George Tucker, Ziyu Wang, Alexander Novikov, Mengjiao Yang, Michael R Zhang, Yutian Chen, Aviral Kumar, et al. Benchmarks for deep off-policy evaluation. ar Xiv preprint ar Xiv:2103.16596, 2021. Carles Gelada and Marc G Bellemare. Off-policy deep reinforcement learning by bootstrapping the covariate shift. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 3647 3655, 2019. Zhaohan Daniel Guo, Philip S Thomas, and Emma Brunskill. Using options and covariance testing for long horizon off-policy policy evaluation. ar Xiv preprint ar Xiv:1703.03453, 2017. Assaf Hallak and Shie Mannor. Consistent on-line off-policy evaluation. In International Conference on Machine Learning, pages 1372 1383. PMLR, 2017. Assaf Hallak, Aviv Tamar, Rémi Munos, and Shie Mannor. Generalized emphatic temporal difference learning: Bias-variance analysis. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 30, 2016. Josiah Hanna, Scott Niekum, and Peter Stone. Importance sampling policy evaluation with an estimated behavior policy. In International Conference on Machine Learning, pages 2605 2613. PMLR, 2019. Daniel G Horvitz and Donovan J Thompson. A generalization of sampling without replacement from a finite universe. Journal of the American statistical Association, 47(260):663 685, 1952. Nan Jiang and Jiawei Huang. Minimax confidence interval for off-policy evaluation and policy optimization. ar Xiv preprint ar Xiv:2002.02081, 2020. Nan Jiang and Lihong Li. Doubly robust off-policy value evaluation for reinforcement learning. In International Conference on Machine Learning, pages 652 661. PMLR, 2016. Nathan Kallus and Masatoshi Uehara. Double reinforcement learning for efficient off-policy eval- uation in markov decision processes. Journal of Machine Learning Research, 21(167):1 63, 2020. Peng Liao, Predrag Klasnja, and Susan Murphy. Off-policy estimation of long-term average outcomes with applications to mobile health. Journal of the American Statistical Association, pages 1 10, 2020. Qiang Liu, Lihong Li, Ziyang Tang, and Dengyong Zhou. Breaking the curse of horizon: Infinite- horizon off-policy estimation. ar Xiv preprint ar Xiv:1810.12429, 2018. Yao Liu, Pierre-Luc Bacon, and Emma Brunskill. Understanding the curse of horizon in off-policy evaluation via conditional importance sampling. In International Conference on Machine Learning, pages 6184 6193. PMLR, 2020. A Rupam Mahmood, Huizhen Yu, Martha White, and Richard S Sutton. Emphatic temporal-difference learning. ar Xiv preprint ar Xiv:1507.01569, 2015. Alberto Maria Metelli, Matteo Papini, Nico Montali, and Marcello Restelli. Importance sampling techniques for policy optimization. Journal of Machine Learning Research, 21(141):1 75, 2020. Nicholas Metropolis and Stanislaw Ulam. The monte carlo method. Journal of the American statistical association, 44(247):335 341, 1949. Rémi Munos, Tom Stepleton, Anna Harutyunyan, and Marc G Bellemare. Safe and efficient off-policy reinforcement learning. ar Xiv preprint ar Xiv:1606.02647, 2016. Ofir Nachum and Bo Dai. Reinforcement learning via Fenchel-Rockafellar duality. ar Xiv preprint ar Xiv:2001.01866, 2020. Matteo Papini, Alberto Maria Metelli, Lorenzo Lupo, and Marcello Restelli. Optimistic policy optimization via multiple importance sampling. In International Conference on Machine Learning, pages 4989 4999. PMLR, 2019. Andrew Patterson, Sina Ghiassian, D Gupta, A White, and M White. Investigating objectives for off-policy value estimation in reinforcement learning, 2021. Judea Pearl. Causality. Cambridge university press, 2009. Doina Precup. Eligibility traces for off-policy policy evaluation. Computer Science Department Faculty Publication Series, page 80, 2000. Mark Rowland, Anna Harutyunyan, Hado Hasselt, Diana Borsa, Tom Schaul, Rémi Munos, and Will Dabney. Conditional importance sampling for off-policy learning. In International Conference on Artificial Intelligence and Statistics, pages 45 55. PMLR, 2020. Matthew Schlegel, Wesley Chung, Daniel Graves, Jian Qian, and Martha White. Importance resampling for off-policy prediction. ar Xiv preprint ar Xiv:1906.04328, 2019. John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. ar Xiv preprint ar Xiv:1707.06347, 2017. R. S. Sutton and A. G. Barto. Reinforcement learning: An introduction. MIT Press, Cambridge, MA, 2 edition, 2018. Georgios Theocharous, Yash Chandak, Philip S Thomas, and Frits de Nijs. Reinforcement learning for strategic recommendations. ar Xiv preprint ar Xiv:2009.07346, 2020. Philip Thomas and Emma Brunskill. Data-efficient off-policy policy evaluation for reinforcement learning. In International Conference on Machine Learning, pages 2139 2148, 2016. Philip Thomas, Georgios Theocharous, and Mohammad Ghavamzadeh. High-confidence off-policy evaluation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 29, 2015. Philip S Thomas. Safe reinforcement learning. Ph D thesis, University of Massachusetts Libraries, Masatoshi Uehara, Jiawei Huang, and Nan Jiang. Minimax weight and q-function learning for off- policy evaluation. In International Conference on Machine Learning, pages 9659 9668. PMLR, 2020. Cameron Voloshin, Hoang M Le, Nan Jiang, and Yisong Yue. Empirical study of off-policy policy evaluation for reinforcement learning. ar Xiv preprint ar Xiv:1911.06854, 2019. Tao Wang, Michael Bowling, and Dale Schuurmans. Dual representations for dynamic programming and reinforcement learning. In 2007 IEEE International Symposium on Approximate Dynamic Programming and Reinforcement Learning, pages 44 51. IEEE, 2007. Tengyang Xie, Yifei Ma, and Yu-Xiang Wang. Towards optimal off-policy evaluation for rein- forcement learning with marginalized importance sampling. ar Xiv preprint ar Xiv:1906.03393, 2019. Mengjiao Yang, Ofir Nachum, Bo Dai, Lihong Li, and Dale Schuurmans. Off-policy evaluation via the regularized lagrangian. ar Xiv preprint ar Xiv:2007.03438, 2020. Ruiyi Zhang, Bo Dai, Lihong Li, and Dale Schuurmans. Gendice: Generalized offline estimation of stationary values. ar Xiv preprint ar Xiv:2002.09072, 2020.