# universal_offpolicy_evaluation__421ecb3a.pdf

Universal Off-Policy Evaluation

Yash Chandak University of Massachusetts Scott Niekum University of Texas Austin Bruno Castro da Silva University of Massachusetts

Erik Learned-Miller University of Massachusetts Emma Brunskill Stanford University Philip S. Thomas University of Massachusetts

When faced with sequential decision-making problems, it is often useful to be able to predict what would happen if decisions were made using a new policy. Those predictions must often be based on data collected under some previously used decision-making rule. Many previous methods enable such off-policy (or counterfactual) estimation of the expected value of a performance measure called the return. In this paper, we take the ﬁrst steps towards a universal off-policy estimator (Un O) one that provides off-policy estimates and high-conﬁdence bounds for any parameter of the return distribution. We use Un O for estimating and simultaneously bounding the mean, variance, quantiles/median, inter-quantile range, CVa R, and the entire cumulative distribution of returns. Finally, we also discuss Un O s applicability in various settings, including fully observable, partially observable (i.e., with unobserved confounders), Markovian, non-Markovian, stationary, smoothly non-stationary, and discrete distribution shifts.

1 Introduction

Problems requiring sequential decision-making are ubiquitous [5]. When online experimentation is costly or dangerous, it is essential to conduct off-policy evaluation before deploying a new policy; that is, one must leverage existing data collected using some policy β (called a behavior policy) to evaluate a performance metric of another policy π (called the evaluation policy). For problems with high stakes, such as in terms of health [56] or ﬁnancial assets [86], it is also crucial to provide high-conﬁdence bounds on the desired performance metric to ensure reliability and safety.

Perhaps the most widely studied performance metric in the off-policy setting is the expected return [83]. However, this metric can be limiting for many problems of interest. Safety-critical applications, such as automated healthcare, require minimizing the chances of risk-prone outcomes, and so performance metrics such as value at risk (Va R) or conditional value at risk (CVa R) are more appropriate [49, 14]. By contrast, applications like online recommendations are subject to noisy data and call for robust metrics like the median and other quantiles [2]. In order to improve user experiences, applications involving direct human-machine interaction, such as robotics and autonomous driving, focus on minimizing uncertainty in their outcomes and thus use metrics like variance and entropy [52, 84]. Recent work in distributional reinforcement learning (RL) have also investigated estimating the cumulative distribution of returns [7, 24] and its various statistical functionals [76]. While it may even be beneﬁcial to use all of these different metrics simultaneously to inform better decision-making, even individually estimating and bounding any performance metric, other than mean and variance, in the off-policy setting has remained an open problem.

This raises the main question of interest: How do we develop a universal off-policy method one that can estimate any desired performance metrics and can also provide ﬁnite-sample conﬁdence bounds that hold simultaneously with high probability for those metrics?

35th Conference on Neural Information Processing Systems (Neur IPS 2021).

Prior Work: Off-policy methods can be broadly categorized as model-based or model-free [83]. Model-based methods typically require strong assumptions on the parametric model when statistical guarantees are needed. Further, using model-based approaches to estimate parameters other than the mean can also require estimating the distribution of rewards for every state-action pair in order to obtain the complete return distribution for any policy.

By contrast, model-free methods are applicable to a wider variety of settings. Unfortunately, the popular technique of using importance-weighted returns [71] only corrects for the mean under the off-policy distribution. Recent work by Chandak et al. [18] provides a specialized extension to only correct for the variance. Outside RL, works in the econometrics and causal inference literature have also considered quantile treatments [29, 99] and inferences on counterfactual distributions [28, 20, 36], but these methods are not developed for sequential decisions and do not provide any high-conﬁdence bounds with guaranteed coverage. Further, they often mandate stationarity, identically distributed data, and full observability (i.e., no confounding).

Existing frequentist high-conﬁdence bounds are not only speciﬁcally designed for either the mean or variance, but also hold only individually [92, 45, 18]. Instead of frequentist intervals, a Bayesian posterior distribution over the mean return and various statistics of that distribution can also be obtained [105]. We are not aware of any method that provides off-policy bounds or even estimates for any parameter of the return, while also handling different domain settings that are crucial for RL related tasks. Therefore, a detailed discussion of existing work is deferred to Appendix C.

Contributions: We take the ﬁrst steps towards a universal off-policy estimator (Un O) that estimates and bounds the entire distribution of returns, and then derives estimates and simultaneous bounds for all parameters of interest. With Un O, we make the following contributions:

A. For any distributional parameter (mean, variance, quantiles, entropy, CVa R, CDF, etc.), we provide an off-policy method to obtain (A.1) model-free estimators; (A.2) high-conﬁdence bounds that have guaranteed coverage simultaneously for all parameters and that, perhaps surprisingly, often nearly match or outperform prior bounds speciﬁcally designed for the mean and the variance; and (A.3) approximate bounds using statistical bootstrapping that can often be signiﬁcantly tighter.

B. The above advantages hold for (B.1) fully observable and partially observable (i.e., with unobserved confounders) settings, (B.2) Markovian and non-Markovian settings, and (B.3) settings with stationary, smoothly non-stationary, and discrete distribution shifts in a policy s performance.

Limitations: Our method uses importance sampling and thus (1) Requires knowledge of action probabilities under the behavior policy β, (2) Any outcome under the evaluation policy should have a sufﬁcient probability of occurring under β, and (3) Variance of our estimators scales exponentially with the horizon length [39, 57], which may be unavoidable in non-Markovian domains [46].

Notation: For brevity, we ﬁrst restrict our focus to the stationary setting. In Section 5, we discuss how to tackle non-stationarity and distribution shifts. A partially observable Markov decision process (POMDP) is a tuple (S, O, A, P, Ω, R, γ, d0), where S is the set of states, O is the set of observations, A is the set of actions, P is the transition function, Ωis the observation function, R is the reward function, γ [0, 1] is the discount factor, and d0 is the starting state distribution. Although our results extend to the continuous setting, for notational ease, we consider S, A, O, and the set of rewards to be ﬁnite. Since the true underlying states are only partially observable, the resulting rewards and transitions from one partially observed state to another are therefore also potentially non-Markovian [80]. We write St, Ot, At, and Rt to denote random variables for state, observation, action, and reward respectively at time t. Let D be a data set (Hi)n i=1 collected using behavior policies (βi)n i=1, where each Hi denotes the observed trajectory (O0, A0, β(A0|O0), R0, O1, ...). Notice that an observed trajectory contains β(At|Ot) and does not contain the states St, for all t. Let Gi := PT j=0 γj Rj be the return of Hi, where i, Gmin < Gi < Gmax for some ﬁnite constants Gmin and Gmax, and T is a ﬁnite horizon length. Let Gπ and Hπ be the random variables for returns and complete trajectories under any policy π, respectively. Since the set of observations, actions, and rewards are ﬁnite, and T is ﬁnite, the total number of possible trajectories is ﬁnite. Let X be the ﬁnite set of returns corresponding to these trajectories. Let Hπ be the set of all possible trajectories for any policy π. Sometimes, to make the dependence explicit, we write g(h) to denote the return of trajectory h. Further, to ensure that samples in D are informative, we make a standard assumption that any outcome under π has sufﬁcient probability of occurring under β (see Appendix B.1 for further discussion of assumptions in general),

Assumption 1. The set D contains independent (not necessarily identically distributed) observed trajectories generated using (βi)n i=1, such that for some (unknown) ε > 0, (βi(a|o) < ε) = (π(a|o) = 0), for all o O, a A, and i {1, 2, ..., n}.

2 Idea Summary

For the desired universal method, instead of considering each parameter individually, we suggest estimating the entire cumulative distribution function (CDF) of returns ﬁrst:

ν R, Fπ(ν) := Pr Gπ ν .

Any distributional parameter, ψ(Fπ), can then be estimated from the estimate of Fπ. However, we only have off-policy data from a behavior policy β, and the typical use of importance sampling [71] only corrects for the mean return. To overcome this, we propose an estimator ˆFn that uses importance sampling from the perspective of the CDF to correct for the entire distribution of returns. The CDF estimate, ˆFn, is then used to obtain a plug-in estimator ψ( ˆFn) for any distributional parameter ψ(Fπ).

Next, we show that this CDF-centric perspective provides the additional advantage that, if we can compute a 1 δ conﬁdence band F : R 2R such that

Pr ν R, Pr Gπ ν F(ν) 1 δ,

then a 1 δ upper (or lower) high-conﬁdence bound on any parameter, ψ(Fπ), can be obtained by searching for a function F that maximizes (or minimizes) ψ(F) and ν R has F(ν) F(ν).

3 Un O: Universal Off-Policy Estimator

In the on-policy setting, one approach for estimating any parameter of returns, Gπ, might be to ﬁrst estimate its cumulative distribution Fπ and then use that to estimate its parameter ψ(Fπ). However, doing this in the off-policy setting requires additional consideration as the entire distribution of the observed returns needs to be adjusted to estimate Fπ since the data is collected using behavior policies that can be different from the evaluation policy π.

We begin by observing that ν R, Fπ(ν) can be expanded using the fact that the probability that the return Gπ equals x is the sum of the probabilities of the trajectories Hπ whose return equals x,

Fπ(ν) = Pr(Gπ ν) = X

x X,x ν Pr(Gπ = x) = X

h Hπ Pr(Hπ = h)1{g(h)=x}

where 1A = 1 if A is true and 0 otherwise. Now, observing that the indicator function can be one for at most a single value less than ν as g(h) is a deterministic scalar given h, (1) can be expressed as,

h Hπ Pr(Hπ = h) X

x X,x ν 1{g(h)=x} = X

h Hπ Pr(Hπ = h) 1{g(h) ν} ,

where the red color is used to highlight changes. Now, from Assumption 1 as β, Hπ Hβ,1

Pr(Hπ = h) 1{g(h) ν} = X

h Hβ Pr(Hβ = h)Pr(Hπ = h)

1{g(h) ν} . (2)

The form of Fπ(ν) in (2) is beneﬁcial as it suggests a way to not only perform off-policy corrections for one speciﬁc parameter, as in prior works [71, 18], but for the entire cumulative distribution function (CDF) of return Gπ. Formally, let ρi := QT j=0 π(Aj|Oj) βi(Aj|Oj) denote the importance ratio for Hi, which is equal to Pr(Hπ = h)/ Pr(Hβ = h) (see Appendix D).

Then, based on (2), we propose the following non-parametric and model-free estimator for Fπ.

ν R, ˆFn(ν) := 1

i=1 ρi1{Gi ν}. (3)

1Results can be extended to hybrid probability measures using Radon-Nikodym derivatives.

Figure 1: An illustration of return distributions for π and β. The CDF at any point ν corresponds to the area under the probability distribution up until ν. Having order statistics (G(i))5 i=1 of samples (Gi)5 i=1 drawn using β, (3) constructs an empirical estimate of the CDF for π (green shaded region) by correcting for the probability of observing each Gi using the importance-sampled counts of Gi ν. Additionally, weighted-IS (WIS) can be used as in (27) for a variance-reduced estimator for Fπ.

Figure 1 provides intuition for (3). In the following theorem, we establish that this estimator, ˆFn, is unbiased and not only pointwise consistent, but also a uniformly consistent estimator of Fπ, even when the data D is collected using multiple behavior policies (βi)n i=1. The proof (deferred to Appendix D) also illustrates that by using knowledge of action probabilities under the behavior policies, no additional adjustments (e.g., front-door or backdoor [70]) are required by ˆFn to estimate Fπ, even when the domain is non-Markovian or has partial observability (confounders).

Theorem 1. Under Assumption 1, ˆFn is an unbiased and uniformly consistent estimator of Fπ,

ν R, ED h ˆFn(ν) i = Fπ(ν), sup ν R

ˆFn(ν) Fπ(ν) a.s. 0.

Remark 1. Notice that the value of ˆFn(ν) can be more than one, even though Fπ(ν) cannot have a value greater than one for any ν R. This is an expected property of estimators based on importance sampling (IS). For example, the IS estimates of expected return during off-policy mean estimation can be smaller or larger than the smallest and largest possible return when ρ > 1.

Having an estimator ˆFn of Fπ, any parameter ψ(Fπ) can now be estimated using ψ( ˆFn). However, some parameters like the mean µπ, variance σ2 π, and entropy Hπ, are naturally deﬁned using the probability distribution d Fπ instead of the cumulative distribution Fπ. Similarly, parameters like the α-quantile Qα π and inter-quantile range (which provide tail-robust measures for the mean and deviation from the mean) and conditional value at risk CVa Rα π (which is a tail-sensitive measure) are deﬁned using the inverse CDF F 1 π (α). Therefore, let (G(i))n i=1 be the order statistics for samples (Gi)n i=1 and G(0) := Gmin. Then, we deﬁne the off-policy estimator of the inverse CDF for all α [0, 1], and the probability distribution estimator d ˆFn as,

ˆF 1 n (α) := min n g (G(i))n i=1 ˆFn(g) α o , d ˆFn(G(i)) := ˆFn(G(i)) ˆFn(G(i 1)), (4)

where d ˆFn(ν) := 0 if ν = G(i) for any i (1, . . . , n). Using (4), we now deﬁne off-policy estimators for parameters like the mean, variance, quantiles, and CVa R (see Appendix E.1 for more details on these). This procedure can be generalized to any other parameter of Fπ for which a sample estimator ψ( ˆFn) can be directly created using ˆFn as a plug-in estimator for Fπ.

µπ( ˆFn) :=

i=1 d ˆFn(G(i))G(i), σ2 π( ˆFn) :=

i=1 d ˆFn(G(i)) G(i) µπ( ˆFn) 2 ,

Qα π( ˆFn) := ˆF 1 n (α), CVa Rα π( ˆFn) := 1

i=1 d ˆFn(G(i))G(i)1{G(i) Qα π( ˆ Fn)}.

Remark 2. Let Hi be the observed trajectory for the Gi that gets mapped to G(i) when computing the order statistics. Note that d ˆFn(G(i)) equals ρi/n for this Hi. This implies that the estimator for the mean, µπ( ˆFn), reduces exactly to the existing full-trajectory-based IS estimator [71].

Notice that many parameters and their sample estimates discussed above are nonlinear in Fπ and ˆFn, respectively (the mean is one exception). Therefore, even though ˆFn is an unbiased estimator of Fπ, the sample estimator, ψ( ˆFn), may be a biased estimator of ψ(Fπ). This is expected behavior because even in the on-policy setting it is not possible to get unbiased estimates of some parameters (e.g., standard deviation), and Un O reduces to the on-policy setting when π = β. However, perhaps surprisingly, we establish in the following section that even when ψ( ˆFn) is a biased estimator of ψ(Fπ), high-conﬁdence upper and lower bounds can still be computed for both Fπ and ψ(Fπ).

Figure 2: An illustration of ˆFn (in black) using ﬁve return samples and the conﬁdence band F (red shaded region) computed using (5) with conﬁdence intervals (red lines) at three key points (κi)3 i=1. Notice that the vertical steps in ˆFn can be of different heights and their total can be greater than 1 due to importance weighting. However, since we know that Fπ is never greater than 1, F can be clipped at 1.

4 High-Conﬁdence Bounds for Un O

Off-policy estimators are typically prone to high variance, and when the domain can be non Markovian, the curse of horizon might be unavoidable [46]. For critical applications, this might be troublesome [94] and thus necessitates obtaining conﬁdence intervals to determine how much our estimates can be trusted. Therefore, in this section, we aim to construct a set of possible CDFs F : R 2R, called a conﬁdence band, such that the true Fπ(ν) is within the set F(ν) with high probability, i.e., Pr( ν R, Fπ(ν) F(ν)) 1 δ, for any δ (0, 1]. Subsequently, we develop ﬁnite-sample bounds for any parameter ψ(Fπ) using F.

In the on-policy setting, F can be constructed using the DKW inequality [31] and its tight constants [60]. However, its applicability to the off-policy setting is unclear as (a) unlike the on-policy CDF estimate, the steps of an off-policy CDF estimate are not of equal heights, (b) the steps do not sum to one (see Figure 2) and the maximum height of the steps need not be known either, and (c) DKW assumes samples are identically distributed, however, off-policy data D might be collected using multiple different behavior policies. This raises the question: How do we obtain F in the off-policy setting?

Before constructing a conﬁdence band F, let us ﬁrst focus on obtaining bounds for a single point, Fπ(κ). Let X := ρ(1{G κ}). Then, from Theorem 1, we have that ED[X] = Fπ(κ). This implies that a conﬁdence interval for the mean of X provides a conﬁdence interval for Fπ(κ). Using this observation, existing conﬁdence intervals for the mean of a bounded random variable can be directly applied to X to obtain a conﬁdence interval for Fπ(κ). For example, Thomas et al. [91] present tight bounds for the mean of IS-based random variables by mitigating the variance resulting from the heavy tails associated with IS; we use their method on ˆFn(κ) to bound Fπ(κ). Alternatively, recent work by Kuzborskij et al. [53] can potentially be used with a WIS-based Fπ estimate (27).

Before moving further, we introduce some additional notation. Let (κi)K i=1 be any K key points and let CI (κi, δi) and CI+(κi, δi) be the lower and the upper conﬁdence bounds on Fπ(κi) constructed at each key point using the observation made in the previous paragraph, such that

i (1, ..., K), Pr CI (κi, δi) Fπ(κi) CI+(κi, δi) 1 δi.

We now use the following observation to obtain a band, F, that contains Fπ with high conﬁdence. Because Fπ is a CDF, it is necessarily monotonically non-decreasing, and so if Fπ(κi) CI (κi, δi) then for any ν κi, Fπ(ν) must be no less than CI (κi, δi). Similarly, if Fπ(κi) CI+(κi, δi) then for any ν κi, Fπ(ν) must also be no greater than CI+(κi, δi). Let κ0 := Gmin, κK+1 := Gmax, CI (κ0, δ0) := 0, and CI+(κK+1, δK+1) := 1; then, as illustrated in Figure 2, we can construct a lower function F and an upper function F+ that encapsulate Fπ with high probability,

( 1 if ν > Gmax, max κi ν CI (κi, δi) otherwise. F+(ν) :=

( 0 if ν < Gmin, min κi ν CI+(κi, δi) otherwise. (5)

Theorem 2. Under Assumption 1, for any δ (0, 1], if PK i=1 δi δ, then the conﬁdence band deﬁned by F and F+ provides guaranteed coverage for Fπ. That is,

Pr ν R, F (ν) Fπ(ν) F+(ν) 1 δ.

Figure 3: Given a conﬁdence band F, bounds for many parameters can be obtained using geometry. (Left) For a lower bound on the mean, we would want a CDF F F that assigns as high a probability as possible on lower G values, and F+ is the CDF which does that. To obtain the mean of F+, we use the property that the mean of a distribution is the area above the CDF on the positive x-axis minus the area below the CDF on the negative x-axis [3]. Hence, the mean of the distribution characterized by F+ is the area of the shaded blue region minus the area of the shaded purple region, and this value is the high-conﬁdence lower bound on the mean. (Middle) Similarly, within F, F+ characterizes the distribution with the smallest α-quantile. (Right) Building upon the lower bounds for the mean and the quantile, Thomas and Learned-Miller [90] showed that the lower bound for α-CVa R can be obtained using the area of the shaded blue region minus the area of the shaded purple region, normalized by α. To get the upper bounds on the mean, quantile, and CVa R, analogous arguments hold using the lower bound CDF F . See Appendix E.5 for discussions of variance, inter-quantile, entropy, and other parameters.

Remark 3. Notice that any choice of (κi)K i=1 results in a valid band F. However, F can be made tighter by optimizing over the choice of (κi)K i=1. In Appendix E.5, we present one such method using cross-validation to minimize the area enclosed within F.

Having obtained a high-conﬁdence band for Fπ, we now discuss how high-conﬁdence bounds for any parameter ψ(Fπ) can be obtained using this band. Formally, with a slight overload of notation let F be the set of all possible CDFs bounded between F and F+, that is,

F := n F ν R, F (ν) F(ν) F+(ν) o .

This band F contains many possible CDFs, one of which is Fπ with high probability. Therefore, to get a lower or upper bound, ψ or ψ+, on ψ(Fπ), we propose deriving a CDF F F that minimizes or maximizes ψ(F), respectively, and we show that these contain ψ(Fπ) with high probability: ψ := inf F F ψ(F), ψ+ := sup F F ψ(F). (6)

Theorem 3. Under Assumption 1, for any 1 δ conﬁdence band F, the conﬁdence interval deﬁned by ψ and ψ+ provides guaranteed coverage for ψ(Fπ). That is,

Pr ψ ψ(Fπ) ψ+ 1 δ.

While obtaining ψ might not look straightforward, one can obtain closed-form expressions for many popular parameters of interest. In other cases, simple algorithms exist for computing ψ and ψ+ [74]. Figure 3 provides geometric depictions of the closed-form expressions for some parameters.

Remark 4. Perhaps surprisingly, even though ψ( ˆFn) may be biased, we can obtain high-conﬁdence bounds with guaranteed coverage on any ψ(Fπ) using the conﬁdence band F. In fact, conﬁdence bounds for all parameters computed using (6) hold simultaneously with probability at least 1 δ as they are all derived from the same conﬁdence band, F.

3.1. Statistical Bootstrapping: An important advantage of having constructed an off-policy estimator of any ψ(Fπ) is that it opens up the possibility of using resampling-based methods, like statistical bootstrapping [32], to obtain approximate conﬁdence intervals for ψ(Fπ). In particular, we can use the bias-corrected and accelerated (BCa) bootstrap procedure to obtain ψ and ψ+ for ψ(Fπ). This procedure is outlined in Algorithm 1 in Appendix E.4.

Unlike the bounds from (6), BCa-based bounds do not offer guaranteed coverage and need to be computed individually for each parameter ψ. However, they can be combined with Un O to get signiﬁcantly tighter bounds with less data, albeit without guaranteed coverage.

5 Confounding, Distributional Shifts, and Smooth Non-Stationarities

A particular advantage of Un O is the remarkable simplicity with which the estimates and bounds for Fπ or ψ(Fπ) can be extended to account for confounding, distributional shifts, and smooth non-stationarities that are prevalent in real-world applications [30].

Confounding / Partial Observability: Estimator ˆFn in (3) accounts for partial observability when both π and β have the same observation set. However, in systems like automated loan approval [94], data might have been collected using a behavior policy β dependent on sensitive attributes like race and gender that may no longer be allowable under modern laws. This can make the available observation, e O, for an evaluation policy π different from the observations, O, for β, which may also have been a partial observation of the underlying true state S.

However, an advantage of many such automated systems (e.g., online recommendation, automated healthcare, robotics) is the direct availability of behavior probabilities βi(A|O). In Appendix D, we provide generalized proofs for all the earlier results, showing that access to βi(A|O) allows Un O to handle various sources of confounding even when e O = O, without requiring any additional adjustments. When βi(A|O) is not available, we allude to possible alternatives in Appendix B.1.

Distribution Shifts: Many practical applications exhibit distribution shifts that might be discrete or abrupt. One example is when a medical treatment developed for one demographic is applied to another [37]. To tackle discrete distributional shifts, let F (1) π and F (2) π denote the CDFs of returns under policy π in the ﬁrst and the second domain, respectively. To make the problem tractable, similar to prior work on characterizing distribution shifts [10], we assume that the Kolmogorov-Smirnov distance between F (1) π and F (2) π is bounded.

Assumption 2. There exists ϵ 0, such that sup ν R

F (1) π (ν) F (2) π (ν) ϵ.

Given data D collected in the ﬁrst domain, one can obtain the bounds F (1) and F (1) + on F (1) π as in Section 4. Now since F (2) π can differ from F (1) π by at most ϵ at any point, we propose the following bounds for F (2) π for all ν R and show that they readily provide guaranteed coverage for F (2) π :

F (2) (ν) := max(0, F (1) (ν) ϵ), F (2) + (ν) := min(1, F (1) + (ν) + ϵ). (7)

Theorem 4. Under Assumptions 1 and 2, δ (0, 1], the conﬁdence band deﬁned by F (2) and F (2) + provides guaranteed coverage for F (2) π . That is, Pr( ν, F (2) (ν) F (2) π (ν) F (2) + (ν)) 1 δ.

Smooth Non-stationarity: The stationarity assumption is unreasonable for applications like online tutoring or recommendation systems, which must deal with drifts of students interests or seasonal ﬂuctuations of customers interests [93, 88]. In the worst case, however, even a small change in the transition dynamics can result in a large ﬂuctuation of a policy s performance and make the problem intractable. Therefore, similar to the work of Chandak et al. [16], we assume that the distribution of returns for any π changes smoothly over the past episodes 1 to L, and the ℓepisodes in the future. In particular, we assume that the trend of F (i) π (ν) for all ν can be modeled using least-squares regression using a nonlinear basis function φ : R Rd (e.g., the Fourier basis, which is popular for modeling non-stationary trends [12]).

Assumption 3. For any ν, wν Rd, such that, i [1, L + ℓ], F (i) π (ν) = φ(i) wν.

Estimating F (L+ℓ) π can now be seen as a time-series forecasting problem. Formally, for any key point κ, let ˆF (i) n (κ) be the estimated CDF using Hi observed in episode i. From Theorem 1, we know that ˆF (i) n (κ) is an unbiased estimator of F (i) π (κ); therefore, ( ˆF (i) n (κ))L i=1 is an unbiased estimate for the underlying time-varying sequence (F (i) π (κ))L i=1. Now, using methods from time-series literature, the trend of ( ˆF (i) n (κ))L i=1 can be analyzed to forecast F (L+ℓ) π (κ), along with its CIs. In particular, we propose using wild bootstrap [58, 26], which provides approximate CIs with ﬁnite sample error of O(L 1/2) while also handling non-normality and heteroskedasticity, which would occur when dealing with IS-based estimates resulting from different behavior polices [16]. See Appendix E.6 for more details. Finally, using the bounds obtained using wild bootstrap at multiple key points, an entire conﬁdence band can be obtained as discussed in Section 4.

6 Empirical Studies

In this section, we provide empirical support for the established theoretical results for the proposed Un O estimator and high-conﬁdence bounds. To do so, we use the following domains: (1) An open source implementation [102] of the FDA-approved type-1 diabetes treatment simulator [59], (2) A stationary and a non-stationary recommender system domain, and (3) A continuous-state Gridworld with partial observability, where data is collected using multiple behavior policies. Detailed description for domains and the procedures for obtaining π and β are provided in Appendix F.1; code is also publicly available here. In the following, we discuss four primary takeaway results.

(A) Characteristics of the Un O estimator: Figure 4 reinforces the universality of Un O. As can be seen, Un O can accurately estimate the entire CDF and a wide range of its parameters: mean, variance, quantile, and CVa R.

Figure 4: Performance trend of the proposed estimators and bounds on three domains. The black dashed line is the true value of Fπ or ψ(Fπ), green is our Un O estimator, red is our CI-based Un O bound, blue is the bootstrap version of our Un O bound, and yellow is the baseline bound for the mean [91] or variance [18]. Each bound has two lines (upper and lower); however, some are not visible due to overlaps. The shaded regions are 2 standard error, computed using 30 trials. The plots in the top row are for CDFs obtained using 3 104.5 samples. The next four rows are for different parameters and share the same x-axis. Bounds were obtained for a failure rate δ = 0.05. Since the Un O-Boot and Baseline-CI methods do not hold simultaneously for all the parameters, they were made to hold with failure rate of δ/4 for a fair comparison (as there are 4 parameters in this plot).

(B) Comparison of Un O with prior work: Recent works for bounding the mean [45, 35] assume no confounding and Markovian structure. Therefore, for a fair comparison, we resort to the method

Figure 5: (Top row) True rewards (unknown to the RL agent) associated with each of the ﬁve items over the past 1000 episodes for different speeds of non-stationarity. Speed of 0 indicates stationary setting and higher speeds indicates greater degrees of non-stationarity. (Bottom row). The black dashed line is the true value of the future distribution of returns under π: F (L+ℓ) π , where L = 1000 and ℓ= 1. In red is our Un O bound that does not account for non-stationarity, and in blue is the wild-bootstrap version of our Un O bound that accounts for non-stationarity. The shaded region corresponds to one standard error computed using 30 trials. Bounds were obtained for a failure rate δ = 0.05. (Left column) In the stationary setting, both the variants of Un O bounds approximately contain the true future CDF F (L+ℓ) π . In this setting, the Un O method designed only for stationary settings provides a tighter bound. (Middle & Right columns) As the domain becomes non-stationary, Un O bounds that do not account for non-stationarity fail to adequately bound the true future CDF F (L+ℓ) π . When the degree of non-stationarity is high, not accounting for non-stationarity can lead to signiﬁcantly inaccurate bounds. By comparison, Un O bounds that use wild bootstrap to tackle non-stationarity provide a more accurate bound throughout. As expected, when the ﬂuctuations due to non-stationarity increase, the width of the conﬁdence band increases as well. These results illustrate (a) the importance of accounting for non-stationarity, when applicable, and (b) the ﬂexibility offered by our proposed universal off-policy estimator, Un O, to tackle such settings.

of Thomas et al. [91] that can provide tight bounds even when the domain is non-Markovian or has confounding (partial observability). Perhaps surprisingly, Figure 4 shows that the proposed guaranteed coverage bounds, termed Un O-CI here, can be competitive with this existing specialized bound, termed Baseline-CI here, for the mean. In fact, Un O-CI can often require an order of magnitude less data compared to the specialized bounds for variance [18]; we refer readers to Appendix F.2 for a discussion on potential reasons. This suggests that the universality of Un O can be beneﬁcial even when only one speciﬁc parameter is of interest.

(C) Finite-sample conﬁdence bounds for other parameters using Un O: Figure 4 demonstrates that Un O-CI also successfully addresses the open question of providing guaranteed coverage bounds for multiple parameters simultaneously without additional applications of the union bound. As expected, bounds for parameters like variance and CVa R that depend heavily on the distribution tails take more samples to shrink than bounds on other parameters (like the median [quantile(0.5)]). Additional discussion on the observed trends for the bounds is provided in Appendix F.2.

The proposed Un O-Boot bounds, as discussed in Section 3.1, are approximate and might not always hold with the speciﬁed probability. However, they stand out by providing signiﬁcantly tighter, and thus more practicable, conﬁdence intervals.

(D) Results for non-stationary settings: Results for this setting are presented in Figure 5. As discussed earlier, online recommendation systems for tutorials, movies, advertisements and other

products are ubiquitous. However, the popular assumption of stationarity is seldom applicable to these systems. In particular, personalizing for each user is challenging in such settings as interests of a user for different items among the recommendable products ﬂuctuate over time. For an example, in the context of online shopping, interests of customers can vary based on seasonality or other unknown factors. To abstract such settings, in this domain the reward (interest of the user) associated with each item changes over time. See Figure 5 (top row) for visualization of the domain, for different speeds (degrees of non-stationarity).

In all the settings with different speeds, a uniformly random policy was used as a behavior policy β to collect data for 1000 episodes. To test the efﬁcacy of Un O, when the future domain can be different from the past domains, the evaluation policy was chosen to be a near-optimal policy for the future episode: 1000 + 1.

7 Conclusion

We have taken the ﬁrst steps towards developing a universal off-policy estimator (Un O), closing the open question of whether it is possible to estimate and provide ﬁnite-sample bounds (that hold with high probability) for any parameter of the return distribution in the off-policy setting, with minimal assumptions on the domain. Now, without being restricted to the most common and basic parameters, researchers and practitioners can fully characterize the (potentially dangerous or costly) behavior of a policy without having to deploy it.

There are many new questions regarding how Un O can be improved for policy evaluation by further reducing data requirements or weakening assumptions. Using Un O for policy improvement also remains an interesting future direction. Subsequent to this work, Huang et al. [43] showed how models can be used to obtain Un O-style doubly robust estimators along with its convergence rates in the contextual bandit setting. This allows their method to also provide ﬁnite-sample uniform CDF bounds for a broad class of Lipschitz risk functionals.

8 Acknowledgements

We thank Shiv Shankar, Scott Jordan, Wes Cowley, and Nan Jiang for the feedback, corrections, and other contributions to this work. We would also like to thank Bo Liu, Akshay Krishnamurthy, Marc Bellemare, Ronald Parr, Josiah Hannah, Sergey Levine, Jared Yeager, and the anonymous reviewers for their feedback on this work.

Research reported in this paper was sponsored in part by a gift from Adobe, NSF award #2018372, and the DEVCOM Army Research Laboratory under Cooperative Agreement W911NF-17-2-0196 (ARL Io BT CRA). Research in this paper is also supported in part by NSF (IIS-1724157, IIS1638107, IIS-1749204, IIS-1925082), ONR (N00014-18-2243), AFOSR (FA9550-20-1-0077), and ARO (78372-CS, W911NF-19-2-0333). The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the ofﬁcial policies, either expressed or implied, of the Army Research Laboratory or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for Government purposes notwithstanding any copyright notation herein.

[1] C. Acerbi and D. Tasche. On the coherence of expected shortfall. Journal of Banking & Finance, 26(7):1487 1503, 2002.

[2] J. Altschuler, V.-E. Brunel, and A. Malek. Best arm identiﬁcation for contaminated bandits. Journal of Machine Learning Research, 20(91):1 39, 2019.

[3] T. W. Anderson. Conﬁdence limits for the expected value of an arbitrary bounded random variable with a continuous distribution function. Technical report, Stanford University, Department of Statistics, 1969.

[4] M. G. Azar, R. Munos, and H. J. Kappen. Minimax PAC bounds on the sample complexity of reinforcement learning with a generative model. Machine Learning, 91(3):325 349, 2013.

[5] A. G. Barto, P. S. Thomas, and R. S. Sutton. Some recent applications of reinforcement learning. In Proceedings of the Eighteenth Yale Workshop on Adaptive and Learning Systems, 2017.

[6] M. Bastani. Model-free intelligent diabetes management using machine learning. Master s thesis, University of Alberta, 2014.

[7] M. G. Bellemare, W. Dabney, and R. Munos. A distributional perspective on reinforcement learning. ar Xiv preprint ar Xiv:1707.06887, 2017.

[8] Y. Benjamini and Y. Hochberg. Controlling the false discovery rate: A practical and powerful approach to multiple testing. Journal of the Royal Statistical Society: Series B (Methodological), 57(1):289 300, 1995.

[9] A. Bennett, N. Kallus, L. Li, and A. Mousavi. Off-policy evaluation in inﬁnite-horizon reinforcement learning with latent confounders. ar Xiv preprint ar Xiv:2007.13893, 2020.

[10] V. W. Berger and Y. Zhou. Kolmogorov Smirnov test: Overview. Wiley Statsref: Statistics Reference Online, 2014.

[11] J. Bezanson, A. Edelman, S. Karpinski, and V. B. Shah. Julia: A fresh approach to numerical computing. SIAM Review, 59(1):65 98, 2017.

[12] P. Bloomﬁeld. Fourier Analysis of Time Series: An Introduction. John Wiley & Sons, 2004.

[13] D. B. Brown. Large deviations bounds for estimating conditional value-at-risk. Operations Research Letters, 35(6):722 730, 2007.

[14] D. S. Brown, S. Niekum, and M. Petrik. Bayesian robust optimization for imitation learning. ar Xiv preprint ar Xiv:2007.12315, 2020.

[15] F. P. Cantelli. Sulla determinazione empirica delle leggi di probabilita. Giorn. Ist. Ital. Attuari, 4(421-424), 1933.

[16] Y. Chandak, S. M. Jordan, G. Theocharous, M. White, and P. S. Thomas. Towards safe policy improvement for non-stationary MDPs. Neural Information Processing Systems, 2020.

[17] Y. Chandak, G. Theocharous, S. Shankar, S. Mahadevan, M. White, and P. S. Thomas. Optimizing for the future in non-stationary MDPs. International Conference on Machine Learning, 2020.

[18] Y. Chandak, S. Shankar, and P. S. Thomas. High conﬁdence off-policy (or counterfactual) variance estimation. Thirty-Fifth AAAI Conference on Artiﬁcial Intelligence, 2021.

[19] S. X. Chen, W. Härdle, and M. Li. An empirical likelihood goodness-of-ﬁt test for time series. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 65(3):663 678, 2003.

[20] V. Chernozhukov, I. Fernández-Val, and B. Melly. Inference on counterfactual distributions. Econometrica, 81(6):2205 2268, 2013.

[21] K.-J. Chung and M. J. Sobel. Discounted MDP s: Distribution functions and exponential utility maximization. SIAM Journal on Control and Optimization, 25(1):49 62, 1987.

[22] W. Dabney, G. Ostrovski, D. Silver, and R. Munos. Implicit quantile networks for distributional reinforcement learning. ar Xiv preprint ar Xiv:1806.06923, 2018.

[23] W. Dabney, M. Rowland, M. Bellemare, and R. Munos. Distributional reinforcement learning with quantile regression. In Proceedings of the AAAI Conference on Artiﬁcial Intelligence, volume 32, 2018.

[24] W. Dabney, Z. Kurth-Nelson, N. Uchida, C. K. Starkweather, D. Hassabis, R. Munos, and M. Botvinick. A distributional code for value in dopamine-based reinforcement learning. Nature, 577(7792):671 675, 2020.

[25] B. Dai, O. Nachum, Y. Chow, L. Li, C. Szepesvári, and D. Schuurmans. Coindice: Off-policy conﬁdence interval estimation. ar Xiv preprint ar Xiv:2010.11652, 2020.

[26] R. Davidson and E. Flachaire. The wild bootstrap, tamed at last. Journal of Econometrics, 146(1):162 169, 2008.

[27] R. Dearden, N. Friedman, and S. Russell. Bayesian q-learning. In AAAI/IAAI, pages 761 768, 1998.

[28] J. Di Nardo, N. M. Fortin, and T. Lemieux. Labor market institutions and the distribution of wages, 1973-1992: A semiparametric approach. Technical report, National Bureau of Economic Research, 1995.

[29] S. G. Donald and Y.-C. Hsu. Estimation and inference for distribution functions and quantile functions in treatment effect models. Journal of Econometrics, 178:383 397, 2014.

[30] G. Dulac-Arnold, D. Mankowitz, and T. Hester. Challenges of real-world reinforcement learning. ar Xiv preprint ar Xiv:1904.12901, 2019.

[31] A. Dvoretzky, J. Kiefer, and J. Wolfowitz. Asymptotic minimax character of the sample distribution function and of the classical multinomial estimator. The Annals of Mathematical Statistics, pages 642 669, 1956.

[32] B. Efron and R. J. Tibshirani. An Introduction to the Bootstrap. CRC press, 1994.

[33] F. Faccio, L. Kirsch, and J. Schmidhuber. Parameter-based value functions. ar Xiv preprint ar Xiv:2006.09226, 2020.

[34] R. Feldt and A. Stukalov. Blackboxoptim. jl, 2019.

[35] Y. Feng, Z. Tang, na zhang, and qiang liu. Non-asymptotic conﬁdence intervals of off-policy evaluation: Primal and dual bounds. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=d Kg5D1Z1Lm.

[36] S. Firpo and C. Pinto. Identiﬁcation and estimation of distributional impacts of interventions using changes in inequality measures. Journal of Applied Econometrics, 31(3):457 486, 2016.

[37] Y. Gao and Y. Cui. Deep transfer learning for reducing health care disparities arising from biomedical data inequality. Nature Communications, 11(1):1 8, 2020.

[38] V. Glivenko. Sulla determinazione empirica delle leggi di probabilita. Gion. Ist. Ital. Attauri., 4:92 99, 1933.

[39] Z. Guo, P. S. Thomas, and E. Brunskill. Using options and covariance testing for long horizon off-policy policy evaluation. In Advances in Neural Information Processing Systems, pages 2492 2501, 2017.

[40] J. Hanna, P. Stone, and S. Niekum. Bootstrapping with models: Conﬁdence intervals for off-policy evaluation. In Proceedings of the 16th International Conference on Autonomous Agents and Multiagent Systems (AAMAS), May 2017.

[41] J. Hanna, S. Niekum, and P. Stone. Importance sampling policy evaluation with an estimated behavior policy. In International Conference on Machine Learning, pages 2605 2613. PMLR, 2019.

[42] J. Harb, T. Schaul, D. Precup, and P.-L. Bacon. Policy evaluation networks. ar Xiv preprint ar Xiv:2002.11833, 2020.

[43] A. Huang, L. Leqi, Z. C. Lipton, and K. Azizzadenesheli. Off-policy risk assessment in contextual bandits. ar Xiv preprint ar Xiv:2104.08977, 2021.

[44] S. C. Jaquette. Markov decision processes with a new optimality criterion: Discrete time. The Annals of Statistics, pages 496 505, 1973.

[45] N. Jiang and J. Huang. Minimax conﬁdence interval for off-policy evaluation and policy optimization. ar Xiv preprint ar Xiv:2002.02081, 2020.

[46] N. Jiang and L. Li. Doubly robust off-policy value evaluation for reinforcement learning. In International Conference on Machine Learning, pages 652 661. PMLR, 2016.

[47] N. Kallus and M. Uehara. Double reinforcement learning for efﬁcient off-policy evaluation in Markov decision processes. J. Mach. Learn. Res., 21:167 1, 2020.

[48] N. Kallus and A. Zhou. Confounding-robust policy evaluation in inﬁnite-horizon reinforcement learning. ar Xiv preprint ar Xiv:2002.04518, 2020.

[49] R. Keramati, C. Dann, A. Tamkin, and E. Brunskill. Being optimistic to be conservative: Quickly learning a CVa R policy. In Proceedings of the AAAI Conference on Artiﬁcial Intelligence, volume 34, pages 4436 4443, 2020.

[50] K. Khetarpal, M. Riemer, I. Rish, and D. Precup. Towards continual reinforcement learning: A review and perspectives. ar Xiv preprint ar Xiv:2012.13490, 2020.

[51] I. Kostrikov and O. Nachum. Statistical bootstrapping for uncertainty estimation in off-policy evaluation. ar Xiv preprint ar Xiv:2007.13609, 2020.

[52] S. R. Kuindersma, R. A. Grupen, and A. G. Barto. Variable risk control via stochastic optimization. The International Journal of Robotics Research, 32(7):806 825, 2013.

[53] I. Kuzborskij, C. Vernade, A. György, and C. Szepesvári. Conﬁdent off-policy evaluation and selection through self-normalized importance weighting. ar Xiv preprint ar Xiv:2006.10460, 2020.

[54] T. Lattimore and M. Hutter. PAC bounds for discounted MDPs. In International Conference on Algorithmic Learning Theory, pages 320 334. Springer, 2012.

[55] E. Learned-Miller and J. De Stefano. A probabilistic upper bound on differential entropy. IEEE Transactions on Information Theory, 54(11):5223 5230, 2008.

[56] P. Liao, P. Klasnja, and S. Murphy. Off-policy estimation of long-term average outcomes with applications to mobile health. Journal of the American Statistical Association, pages 1 10, 2020.

[57] Q. Liu, L. Li, Z. Tang, and D. Zhou. Breaking the curse of horizon: Inﬁnite-horizon off-policy estimation. In Advances in Neural Information Processing Systems, pages 5356 5366, 2018.

[58] E. Mammen. Bootstrap and wild bootstrap for high dimensional linear models. The Annals of Statistics, pages 255 285, 1993.

[59] C. D. Man, F. Micheletto, D. Lv, M. Breton, B. Kovatchev, and C. Cobelli. The UVA/PADOVA type 1 diabetes simulator: New features. Journal of Diabetes Science and Technology, 8(1): 26 34, 2014.

[60] P. Massart. The tight constant in the Dvoretzky-Kiefer-Wolfowitz inequality. The annals of Probability, pages 1269 1283, 1990.

[61] A. M. Metelli, M. Papini, N. Montali, and M. Restelli. Importance sampling techniques for policy optimization. Journal of Machine Learning Research, 21(141):1 75, 2020.

[62] T. Morimura, M. Sugiyama, H. Kashima, H. Hachiya, and T. Tanaka. Nonparametric return distribution approximation for reinforcement learning. In ICML, 2010.

[63] T. Morimura, M. Sugiyama, H. Kashima, H. Hachiya, and T. Tanaka. Parametric return density estimation for reinforcement learning. ar Xiv preprint ar Xiv:1203.3497, 2012.

[64] O. Nachum and B. Dai. Reinforcement learning via Fenchel-Rockafellar duality. ar Xiv preprint ar Xiv:2001.01866, 2020.

[65] H. Namkoong, R. Keramati, S. Yadlowsky, and E. Brunskill. Off-policy policy evaluation for sequential decisions under unobserved confounding. ar Xiv preprint ar Xiv:2003.05623, 2020.

[66] S. Padakandla. A survey of reinforcement learning algorithms for dynamically varying environments. ar Xiv preprint ar Xiv:2005.10619, 2020.

[67] M. Papini, A. M. Metelli, L. Lupo, and M. Restelli. Optimistic policy optimization via multiple importance sampling. In International Conference on Machine Learning, pages 4989 4999. PMLR, 2019.

[68] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, et al. Pytorch: An imperative style, high-performance deep learning library. ar Xiv preprint ar Xiv:1912.01703, 2019.

[69] B. Pavse, I. Durugkar, J. P. Hanna, and P. Stone. Reducing sampling error in batch temporal difference learning. In Proceedings of the 37th International Conference on Machine Learning (ICML), July 2020.

[70] J. Pearl. Causality. Cambridge university press, 2009.

[71] D. Precup. Eligibility traces for off-policy policy evaluation. Computer Science Department Faculty Publication Series, page 80, 2000.

[72] M. L. Puterman. Markov decision processes. Handbooks in operations research and management science, 2:331 434, 1990.

[73] J. P. Romano and C. Di Ciccio. Multiple data splitting for testing. Department of Statistics, Stanford University, 2019.

[74] J. P. Romano and M. Wolf. Explicit nonparametric conﬁdence intervals for the variance with guaranteed coverage. Communications in Statistics-Theory and Methods, 31(8):1231 1250, 2002. [75] M. Rowland, M. G. Bellemare, W. Dabney, R. Munos, and Y. W. Teh. An analysis of categorical distributional reinforcement learning. ar Xiv preprint ar Xiv:1802.08163, 2018. [76] M. Rowland, R. Dadashi, S. Kumar, R. Munos, M. G. Bellemare, and W. Dabney. Statistics and samples in distributional reinforcement learning. In International Conference on Machine Learning, pages 5528 5536. PMLR, 2019. [77] T. Schaul, D. Horgan, K. Gregor, and D. Silver. Universal value function approximators. In International Conference on Machine Learning, pages 1312 1320. PMLR, 2015. [78] P. K. Sen and J. M. Singer. Large Sample Methods in Statistics An Introduction With Applications. Chapman & Hall, 1993. [79] A. M. Shaikh. The Glivenko-Cantelli Theorem. http://home.uchicago.edu/ ~amshaikh/webfiles/glivenko-cantelli.pdf. Accessed: 2010-09-30. [80] S. P. Singh, T. Jaakkola, and M. I. Jordan. Learning without state-estimation in partially observable Markovian decision processes. In Machine Learning Proceedings 1994, pages 284 292. Elsevier, 1994. [81] M. J. Sobel. The variance of discounted Markov decision processes. Journal of Applied Probability, 19(4):794 802, 1982. [82] D. A. Stephens. The Glivenko-Cantelli Lemma. http://wwwf.imperial.ac.uk/ ~das01/My Web/M3S3/Handouts/Glivenko Cantelli.pdf, 2006. [83] R. S. Sutton and A. G. Barto. Reinforcement learning: An introduction. MIT Press, Cambridge, MA, 2 edition, 2018. [84] A. Tamar, D. Di Castro, and S. Mannor. Learning the variance of the reward-to-go. The Journal of Machine Learning Research, 17(1):361 396, 2016. [85] G. Tennenholtz, U. Shalit, and S. Mannor. Off-policy evaluation in partially observable environments. In Proceedings of the AAAI Conference on Artiﬁcial Intelligence, volume 34, pages 10276 10283, 2020. [86] G. Theocharous, P. S. Thomas, and M. Ghavamzadeh. Ad recommendation systems for life-time value optimization. In Proceedings of the 24th International Conference on World Wide Web, pages 1305 1310, 2015. [87] G. Theocharous, P. S. Thomas, and M. Ghavamzadeh. Personalized ad recommendation systems for life-time value optimization with guarantees. In Twenty-Fourth International Joint Conference on Artiﬁcial Intelligence, 2015. [88] G. Theocharous, Y. Chandak, P. S. Thomas, and F. de Nijs. Reinforcement learning for strategic recommendations. ar Xiv preprint ar Xiv:2009.07346, 2020. [89] P. Thomas and E. Brunskill. Importance sampling with unequal support. In Proceedings of the AAAI Conference on Artiﬁcial Intelligence, volume 31, 2017. [90] P. Thomas and E. Learned-Miller. Concentration inequalities for conditional value at risk. In International Conference on Machine Learning, pages 6225 6233, 2019. [91] P. Thomas, G. Theocharous, and M. Ghavamzadeh. High-conﬁdence off-policy evaluation. In Proceedings of the AAAI Conference on Artiﬁcial Intelligence, volume 29, 2015. [92] P. Thomas, G. Theocharous, and M. Ghavamzadeh. High conﬁdence policy improvement. In International Conference on Machine Learning, pages 2380 2388, 2015. [93] P. S. Thomas, G. Theocharous, M. Ghavamzadeh, I. Durugkar, and E. Brunskill. Predictive off-policy policy evaluation for nonstationary decision problems, with applications to digital marketing. In AAAI, pages 4740 4745, 2017. [94] P. S. Thomas, B. C. da Silva, A. G. Barto, S. Giguere, Y. Brun, and E. Brunskill. Preventing undesirable behavior of intelligent machines. Science, 366(6468):999 1004, 2019. [95] M. Uehara, J. Huang, and N. Jiang. Minimax weight and q-function learning for off-policy evaluation. In International Conference on Machine Learning, pages 9659 9668. PMLR, 2020.

[96] A. W. Van der Vaart. Asymptotic statistics, volume 3. Cambridge university press, 2000.

[97] G. Van Rossum and F. L. Drake Jr. Python tutorial, volume 620. Centrum voor Wiskunde en Informatica Amsterdam, 1995.

[98] E. Veach and L. J. Guibas. Optimally combining sampling techniques for Monte Carlo rendering. In Proceedings of the 22nd Annual Conference on Computer Graphics and Interactive Techniques, pages 419 428, 1995.

[99] L. Wang, Y. Zhou, R. Song, and B. Sherwood. Quantile-optimal treatment regimes. Journal of the American Statistical Association, 113(523):1243 1254, 2018.

[100] T. Wang, M. Bowling, and D. Schuurmans. Dual representations for dynamic programming and reinforcement learning. In 2007 IEEE International Symposium on Approximate Dynamic Programming and Reinforcement Learning, pages 44 51. IEEE, 2007.

[101] D. White. Mean, variance, and probabilistic criteria in ﬁnite Markov decision processes: A review. Journal of Optimization Theory and Applications, 56(1):1 29, 1988.

[102] A. Xie, J. Harrison, and C. Finn. Deep reinforcement learning amidst lifelong non-stationarity. ar Xiv preprint ar Xiv:2006.10701, 2020.

[103] J. Xie. Simglucose v0.2.1 (2018), 2019. URL https://github.com/jxx123/ simglucose.

[104] T. Xie, Y. Ma, and Y.-X. Wang. Towards optimal off-policy evaluation for reinforcement learning with marginalized importance sampling. ar Xiv preprint ar Xiv:1906.03393, 2019.

[105] M. Yang, B. Dai, O. Nachum, G. Tucker, and D. Schuurmans. Ofﬂine policy selection under uncertainty. ar Xiv preprint ar Xiv:2012.06919, 2020.

1. For all authors...

(a) Do the main claims made in the abstract and introduction accurately reﬂect the paper s contributions and scope? [Yes] (b) Did you describe the limitations of your work? [Yes]

(c) Did you discuss any potential negative societal impacts of your work? [Yes] See Appendix B (d) Have you read the ethics review guidelines and ensured that your paper conforms to them? [Yes]

2. If you are including theoretical results...

(a) Did you state the full set of assumptions of all theoretical results? [Yes] See Assumptions 1, 2, and 3. (b) Did you include complete proofs of all theoretical results? [Yes] See Appendix D.

3. If you ran experiments...

(a) Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [Yes] code for the proposed Un O method(s) and the domains used for empirical studies are available https://github.com/yashchandak/Un O. (b) Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? [Yes] See Section F. (c) Did you report error bars (e.g., with respect to the random seed after running experiments multiple times)? [Yes] All plots have standard error bars computed using 30 trials. (d) Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [Yes] All the experiments were conducted on a personal computer with 32 Gi B of memory and an Intel Core i7 CPU with 12 threads. Total runtime for all the experiments combined was less than a day.

4. If you are using existing assets (e.g., code, data, models) or curating/releasing new assets...

(a) If your work uses existing assets, did you cite the creators? [Yes] Our code uses Julia [11], Blackboxoptim library [34], Python [97], and Py Torch [68]. (b) Did you mention the license of the assets? [N/A] Only open-source assets have been used. (c) Did you include any new assets either in the supplemental material or as a URL? [N/A]

(d) Did you discuss whether and how consent was obtained from people whose data you re using/curating? [N/A] (e) Did you discuss whether the data you are using/curating contains personally identiﬁable information or offensive content? [N/A] 5. If you used crowdsourcing or conducted research with human subjects...

(a) Did you include the full text of instructions given to participants and screenshots, if applicable? [N/A] (b) Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable? [N/A] (c) Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation? [N/A]