# highconfidence_offpolicy_or_counterfactual_variance_estimation__8c9543f0.pdf

High-Conﬁdence Off-Policy (or Counterfactual) Variance Estimation

Yash Chandak, Shiv Shankar, Philip S. Thomas University of Massachusetts {ychandak, sshankar, pthomas}@cs.umass.edu

Many sequential decision-making systems leverage data collected using prior policies to propose a new policy. For critical applications, it is important that high-conﬁdence guarantees on the new policy s behavior are provided before deployment, to ensure that the policy will behave as desired. Prior works have studied high-conﬁdence off-policy estimation of the expected return, however, high-conﬁdence off-policy estimation of the variance of returns can be equally critical for high-risk applications. In this paper we tackle the previously open problem of estimating and bounding, with high conﬁdence, the variance of returns from off-policy data.

Introduction Reinforcement learning (RL) has emerged as a promising method for solving sequential decision-making problems (Sutton and Barto 2018). Deploying RL to real-world applications, however, requires additional consideration of reliability, which has been relatively understudied. Speciﬁcally, it is often desirable to provide high-conﬁdence guarantees on the behavior of a given policy, before deployment, to ensure that the policy will behave as desired. Prior works in RL have studied the problem of providing high-conﬁdence guarantees on the expected return of an evaluation policy, π, using only data collected from a currently deployed policy called the behavior policy, β (Thomas, Theocharous, and Ghavamzadeh 2015; Hanna, Stone, and Niekum 2017; Kuzborskij et al. 2020). Analogously, researchers have also studied the problem of counter-factually estimating and bounding the average treatment effect, with high conﬁdence, using data from past treatments (Bottou et al. 2013). While these methods present important contributions towards developing practical algorithms, real-world problems may require additional consideration of the variance of returns (effect) under any new policy (treatment) before it can be deployed responsibly. For applications that have high stakes in the terms of ﬁnancial cost or public well-being, only providing guarantees on the mean outcome might not be sufﬁcient. Analysis of variance (ANOVA) has therefore become a de-facto standard for many industrial and medical applications (Tabachnick and Fidell 2007). Similarly, analysis of variance can

Copyright c 2021, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved.

Figure 1: Illustrative example of the distributions of returns from a behavior policy β, and evaluation policy π, along with the importance weighted returns ρ, discussed later. Given trajectories from the behavior policy β, we aim to estimate and bound the variance, σ2(π), of returns under an evaluation policy π, with high conﬁdence. Note that the distribution of importance-weighted returns ρ has the mean value µ(π), but might have variance not equal to σ2(π).

inform numerous real-world applications of RL. For example, (a) analysing the variance of outcomes in a robotics application (Kuindersma, Grupen, and Barto 2013), (b) ensuring that the variance of outcomes for a medical treatment is not high, (c) characterizing the variance of customer experiences for a recommendation system (Teevan et al. 2009), or (d) limiting the variability of the performance of an autonomous driving system (Montgomery 2007). More generally, variance estimation can be used to account for risk in decision-making by designing objectives that maximize the mean of returns but minimize the variance of returns (Sato, Kimura, and Kobayashi 2001; Di Castro, Tamar, and Mannor 2012; La and Ghavamzadeh 2013). Variance estimates have also been shown to be useful for automatically adapting hyper-parameters, like the exploration rate (Sakaguchi and Takano 2004) or λ for eligibility-traces (White and White 2016), and might also inform other methods that depend on the entire distribution of returns (Bellemare, Dabney, and Munos 2017; Dabney et al. 2017). Despite the wide applicability of variance analysis, estimating and bounding the variance of returns with high conﬁdence, using only off-policy data, has remained an understudied problem. In this paper, we ﬁrst formalize the problem statement; an illustration of which is provided in Figure 1. We show that the typical use of importance sampling (IS) in RL only corrects for the mean, and so

The Thirty-Fifth AAAI Conference on Artificial Intelligence (AAAI-21)

it does not directly provide unbiased off-policy estimates of variance. We then present an off-policy estimator of the variance of returns that uses IS twice, together with a simple double-sampling technique. To reduce the variance of the estimator, we extend the per-decision IS technique (Precup 2000) to off-policy variance estimation. Building upon this estimator, we provide conﬁdence intervals for the variance using (a) concentration inequalities, and (b) statistical bootstrapping.

Advantages: The proposed variance estimator has several advantages: (a) it is a model-free estimator and can thus be used irrespective of the environment complexity, (b) it requires only off-policy data and can therefore be used before actual policy deployment, (c) it is unbiased and consistent. For high-conﬁdence guarantees, (d) we provide both upper and lower conﬁdence intervals for the variance that have guaranteed coverage (that is, they hold with any desired conﬁdence level and without requiring false assumptions), and (e) we also provide bootstrap conﬁdence intervals, which are approximate but often more practical.

Limitations: The proposed off-policy estimator of the variance relies upon IS and thus inherits its limitations. Namely, (a) it requires knowledge of the action probabilities from the behavior policy β, (b) it requires that the support of the trajectories under the evaluation policy π is a subset of the support under the behavior policy β, and (c) the variance of the estimator scales exponentially with the length of the trajectory (Guo, Thomas, and Brunskill 2017; Liu et al. 2018).

Background and Problem Statement A Markov decision process (MDP) is a tuple (S, A, P, R, γ, d0), where S is the set of states, A is the set of actions, P is the transition function, R is the reward function, γ [0, 1) is the discount factor, and d0 is the starting state distribution.1 A policy π is a distribution over the actions conditioned on the state, i.e., π(a|s) represents the probability of taking action a in state s. We assume that the MDP has ﬁnite horizon T, after which any action leads to an absorbing state S( ). In general, we will use subscripts with parentheses for the timestep and subscript without parentheses to indicate the episode number. Let Ri(j) [Rmin, Rmax] represent the reward observed at timestep j of the episode i. Let the random variable Gi := PT j=0 γj Ri(j) be the return for episode i. Let c := (1 γT )/(1 γ) so that the minimum and the maximum returns possible are Gmin := c Rmin and Gmax := c Rmax, respectively. Let µ(π) := Eπ[G] be the expected return, and σ2(π) := Vπ[G] be the variance of returns, where the subscript π denotes that the trajectories are generated using policy π.

1We formulate the problem in terms of MDPs, but it can analogously be formulated in terms of structural causal models. (Pearl 2009). For simplicity, we consider ﬁnite states and actions, but our results extend to POMDPs (by replacing states with observations) and to continuous states and actions (by appropriately replacing summations with integrals), and to inﬁnite horizons (T := ).

Let Hπ (i):(j) be the set of all possible trajectories for a policy π, from timestep i to timestep j. Let H denote a complete trajectory: (S(0), A(0), Pr(A(0)|S(0)), R(0), S(1), ..., S( )), where T is the horizon length, and S(0) is sampled from d0. Let D be a set of n trajectories {Hi}n i=1 generated using behavior policies {βi}n i=1, respectively. Let ρi(0, T) := QT j=0 π(Ai(j)|Si(j)) βi(Ai(j)|Si(j)) denote the product of importance ratios from timestep 0 to T. For brevity, when the range of timesteps is not necessary, we write ρi := ρi(0, T). Similarly, when referring to ρi for an arbitrary i {1, . . . , n}, we often write ρ. With this notation, we now formalize the offpolicy variance estimation (OVE) and the high-conﬁdence off-policy variance estimation (HCOVE) problems.

OVE Problem: Given a set of trajectories D and an evaluation policy π, we aim to ﬁnd an estimator ˆσ2 n that is both an unbiased and consistent estimator of σ2(π), i.e.,

E[ˆσ2 n] = σ2(π), ˆσ2 n a.s. σ2(π).

HCOVE Problem: Given a set of trajectories D, an evaluation policy π, and a conﬁdence level 1 δ, we aim to ﬁnd a conﬁdence interval C := [vlb, vub], such that

Pr σ2(π) C 1 δ.

Remark 1. It is worth emphasizing that the OVE problem is about estimating the variance of returns, and not the variance of the estimator of the mean of returns. These problems would not be possible to solve if the trajectories in D are not informative about the trajectories that are possible under π. For example, if D has no trajectory that could be observed if policy π were to be executed, then D provides little or no information about the possible outcomes under π. To avoid this case, we make the following common assumption (Precup 2000), which is satisﬁed if (βi(a|s) = 0) = (π(a|s) = 0) for all s S, a A, and i {1, . . . , n}. Assumption 1. The set D contains independent trajectories generated using behavior policies {βi}n i=1, such that

i, Hπ (0):(T ) Hβi (0):(T ).

The methods that we derive, and IS methods in general, do not require complete knowledge of {βi}n i=1 (which might be parameterized using deep neural networks and might be hard to store). Only the probabilities, βi(a|s), for states s and actions a present in D are required. For simplicity, we restrict our notation to a single behavior policy β, such that i, βi = β.

Naïve Methods In the on-policy setting, computing an estimate of µ(π) or σ2(π) is trivial sample n trajectories using π and compute the sample mean or variance of the observed returns, {Gi}n i=1. In the off-policy setting, under Assumption 1,

the sample mean ˆµ := 1 n Pn i=1 ρi Gi of the importance weighted returns {ρi Gi}n i=1, is an unbiased estimator of µ(π) (Precup 2000), i.e., Eβ[ˆµ] = µ(π). Similarly, one natural way to estimate σ2(π) in the off-policy setting might be to compute the sample variance (with Bessel s correction) of the importance sampled returns {ρi Gi}n i=1,

ˆσ2!! n := 1 n 1

Unfortunately, ˆσ2!! n is neither an unbiased nor consistent estimator of σ2(π), in general, as shown in the following properties. These properties also reveal that ρi Gi only corrects the distribution for the mean and not for the variance, as depicted in Figure 1. Also, note that all proofs are deferred to the appendix.

Property 1. Under Assumption 1, ˆσ2!! n may be a biased estimator of σ2(π). That is, it is possible that Eβ[ˆσ2!! n ] = σ2(π).

Property 2. Under Assumption 1, ˆσ2!! n may not be a consistent estimator of σ2(π). That is, it is not always the case that ˆσ2!! n a.s. σ2(π).

Since the on-policy variance is Eπ[(G Eπ[G])2], a natural alternative might be to construct an estimator that corrects the off-policy distribution for both the mean and the variance. That is, using the equivalence

Vπ(G) = Eπ[(G Eπ[G])2] = Eβ[ρ(G Eβ[ρG])2],

an alternative might be to use a plug-in estimator for Eβ[ρ(G Eβ[ρG])2] (with Bessel s correction) as,

ˆσ2! n := 1 n 1

While ˆσ2! n turns out to be a consistent estimator, it is still not an unbiased estimator of σ2(π). We formalize this in the following properties.

Property 3. Under Assumption 1, ˆσ2! n may be a biased estimator of σ2(π). That is, it is possible that Eβ[ˆσ2! n ] = σ2(π).

Property 4. Under Assumption 1, ˆσ2! n is a consistent estimator of σ2(π). That is, ˆσ2! n a.s. σ2(π).

Before even considering conﬁdence intervals for σ2(π), the lack of unbiased estimates from these naïve methods leads to a basic question: How can we construct unbiased estimates of σ2(π)? We answer this question in the following section.

Off-Policy Variance Estimation Before constructing an unbiased estimator for σ2(π), we ﬁrst discuss one root cause for the bias of ˆσ2! n and ˆσ2!! n . Notice that an expansion of (1) and (2) produces self-coupled importance ratio terms. That is, terms consisting of ρ2 i and

ρ3 i . While ρi helps in correcting the distribution, its higher powers, ρ2 i and ρ3 i , do not. Expansion of (1) and (2) also results in cross-coupled importance ratio terms, ρiρj, where i = j. However, because Eβ[ρi] = 1 for all i {1, . . . , n} and because ρi and ρj are independent when i = j, these terms factor out in expectation. Hence, these terms do not create the troublesome higher powers of importance ratios. Based on these observations, we create an estimator that does not have any self-coupled importance ratio terms like ρ2 i , but which may have ρiρj terms, where i = j. To do so, we consider the alternate formulation of variance,

Vπ(G) = Eπ[G2] Eπ[G]2 = Eβ[ρG2] Eβ[ρG]2. (3)

In (3), while a plug-in estimator of Eβ[ρG2] would be unbiased and free of any self-coupled importance ratio terms, a plug-in estimator for Eβ[ρG]2 would neither be unbiased nor would it be free of ρ2 i terms. To remedy this problem, we explicitly split the set of sampled trajectories into two mutually exclusive sets, D1 and D2, of equal sizes, and reexpress Eβ[ρG]2 as Eβ[ρG]Eβ[ρG], where the ﬁrst expectation is estimated using samples from D1 and the second expectation is estimated using samples from D2. Based on this double sampling approach, we propose the following off-policy variance estimator,

i=1 ρi G2 i

This simple data-splitting trick sufﬁces to create, ˆσ2 n, an offpolicy variance estimator that is both unbiased and consistent. We formalize this in the following theorems.

Theorem 1. Under Assumption 1, ˆσ2 n is an unbiased estimator of σ2(π). That is, Eβ[ˆσ2 n] = σ2(π).

Theorem 2. Under Assumption 1, ˆσ2 n is a consistent estimator of σ2(π). That is, ˆσ2 n a.s. σ2(π).

Remark 2. It is possible that ˆσ2 n results in negative values (see Appendix C for an example). One practical solution to avoid negative values for variance can be to deﬁne ˆσ2+ n := clip(ˆσ2 n, min = 0, max = ). However, this may make ˆσ2+ n a biased estimator, i.e., Eβ[ˆσ2+ n ] = σ2(π). Notice that this is the expected behavior of IS based estimators. For example, the IS estimates of expected return can be smaller or larger than the smallest and largest possible returns when ρ > 1. We refer the reader to the works by Mc Hugh and Mielke (1968), Anderson (1965), and Nelder (1954) for other occurences of negative variance and its interpretations.

Variance-Reduced Estimation of Variance Despite ˆσ2 n being both an unbiased and a consistent estimator of variance, the use of IS can make its variance high. Speciﬁcally, the importance ratio ρ may become unstable when its denominator, QT i=0 β(A(i)|S(i)), is small.

Algorithm 1: Variance-Reduced Off-Policy Variance Estimator

1 Input: Set of trajectories D

2 D1, D2 equal_split(D)

3 X = 1 |D|

k=0 ρi(0, max(j, k))γj+k Ri(j)Ri(k)

4 Y = 1 |D1|

j=0 ρ(0, j)γj Ri(j)

5 Y = 1 |D2|

j=0 ρ(0, j)γj Ri(j)

6 Return X Y Y

To mitigate variance, it is common in off-policy mean estimation to use per-decision importance sampling (PDIS), instead of the full-trajectory IS, to reduce variance without incurring any bias (Precup 2000). It is therefore natural to ask: Is it also possible to have something like PDIS for offpolicy variance estimation? Recall from (4) that the expectation of the terms inside the parentheses correspond to Eβ[ρG] = µ(π), a term for which we can directly leverage the existing PDIS estimator, Eβ[ρG] = Eβ h PT i=0 ρ(0, i)γi R(i) i . Intuitively, PDIS leverages the fact that the probability of observing an individual reward at timestep i only depends upon the probability of the trajectory up to timestep i. However, the ﬁrst term in the right hand side (RHS) of (4) contains G2 = (PT i=o γi R(i))2. Expanding this expression, we obtain self-coupled and cross-coupled reward terms, R2 (i) and R(i)R(j), which makes PDIS not directly applicable. In the following theorem we present a new estimator, coupled-decision importance sampling (CDIS), which extends PDIS to handle these coupled rewards.

Theorem 3. Under Assumption 1,

Eβ ρG2 = Eβ

j=0 ρ (0, max(i, j)) γi+j R(i)R(j)

Borrowing intuition from PDIS, CDIS leverages the fact that the probability of observing a coupled-reward, R(i)R(j), only depends on the probability of the trajectory up to the i or jth timestep, whichever is larger. Importance ratios beyond that timestep can therefore be discarded without incurring bias. In Algorithm 1, we combine both per-decision and coupled-decision IS to construct a variance-reduced estimator of σ2(π).

HCOVE using Concentration Inequalities In the previous section we found that the reformulation presented in (3) was helpful for creating a variance reduced offpolicy variance estimator ˆσ2 n. In this section, we will again build upon (3) to obtain a conﬁdence interval (CI) for σ2(π).

One speciﬁc advantage of (3) is that it allows us to build upon existing concentration inequalities, which were developed for obtaining CIs for µ(π), to obtain a CI for σ2(π). Before moving further, we deﬁne some additional notation. For any random variable X, let CI+(E[X], δ), CI (E[X], δ), and CI+ (E[X], δ) represent only upper, only lower, and both upper and lower (1 δ)-conﬁdence bounds for E[X], respectively. That is, Pr(CI+(E[X], δ) E[X]) 1 δ, Pr(CI (E[X], δ) E[X]) 1 δ, etc. For brevity, We will sometimes suppress CI s dependency on δ. With the above notation, we now establish a highconﬁdence bound on (3). Recall that (3) consists of one positive term E[ρG2] and a negative term E[ρG]2. Therefore, given a conﬁdence interval for both of these terms, the high-conﬁdence upper bound for (3) would be the high-conﬁdence upper bound of E[ρG2] minus the highconﬁdence lower bound of E[ρG]2, and vice-versa to obtain a high-conﬁdence lower bound on (3). That is, let δ1, δ2, δ3 and δ4 be some constants in (0, 0.5] such that δ/2 = δ1 + δ2 = δ3 + δ4. The lower bound vlb and the upper bound vub can be expressed as, vlb := CI Eβ[ρG2], δ1 CI+ Eβ[ρG]2, δ2 , (5)

vub := CI+ Eβ[ρG2], δ3 CI Eβ[ρG]2, δ4 . (6) For getting the desired CIs for the ﬁrst terms in the RHS of (5) and (6), notice that any method for obtaining a CI on the expected return, Eβ[ρG], can also be used to bound Eβ[ρG ], where G := G2. For getting the desired CIs in the second term in the RHS of (5) and (6), we perform interval propagation (Jaulin, Braems, and Walter 2002). That is, given a high conﬁdence interval for E[ρG], since E[ρG]2 is a quadratic function of E[ρG], the upper bound for the value of E[ρG]2 would be the maximum of the squared values of the end-points of the interval for E[ρG]. Similarly, the lower bound on E[ρG]2 would be 0 if the signs of upper and lower bounds for E[ρG] are different, otherwise it would be the minimum of the squared value of the end-points of the interval for E[ρG]. An illustration of this concept is presented in Figure 2. Using interval propagation, the resulting upper bound is CI+(Eβ[ρG]2) = max(CI (Eβ[ρG])2, CI+(Eβ[ρG])2), and the resulting high-conﬁdence lower bound is CI (Eβ[ρG]2) = 0 if both CI (Eβ[ρG]) 0 and CI+(Eβ[ρG]) 0, and CI (Eβ[ρG]2) = min(CI+(Eβ[ρG])2, CI (Eβ[ρG])2) otherwise. Notice that these upper and lower high-conﬁdence bounds on Eβ[ρG]2 can always be reduced to max(G2 min, G2 max) (the maximum squared return under any policy) when they are larger. In the following theorem, we prove that the resulting conﬁdence interval, C, has guaranteed coverage, i.e., that it holds with probability 1 δ. Theorem 4 (Guaranteed coverage). Under Assumption 1, if (δ1 + δ2 + δ3 + δ4) δ, then for the conﬁdence interval C := [vlb, vub],

Pr σ2(π) C 1 δ.

Figure 2: Two separate examples that show how interval propagation can be used to map conﬁdence intervals for X (in red) to conﬁdence intervals for Y = X2 (in blue).

Remark 3. Theorem 4 presents a two-sided interval. If only a lower bound or only an upper bound is required, then it sufﬁces if only (δ1 + δ2) δ or (δ3 + δ4) δ, respectively. Remark 4. C can always be clipped via taking the intersection with the interval [0, (Gmax Gmin)2/4], since the variance will always be within this range (see Popoviciu s inequality for the deterministic upper bound on variance).

A Tale of Long-Tails One important advantage of Theorem 4 is that it constructs a CI, C, for σ2(π) using any concentration inequality that can be used to get CIs CI+(Eβ[ρG]) and CI (Eβ[ρG])) for µ(π). Hence, the tightness of C scales directly with the tightness of these existing off-policy policy evaluation methods for the expected discounted return. However, naïvely using common concentration inequalities can result in wide and uninformative CIs, as we discuss below. Therefore, in this section, we aim to establish a control-variate that is designed to produce tighter CIs for σ2(π). Typically, for a random variable X [a, b], the width of the conﬁdence interval for E[X] obtained using common concentration inequalities, like Hoeffding s (Hoeffding 1994) or an empirical Bernstein inequality (Maurer and Pontil 2009), have a direct dependence on the range, (b a). Unfortunately, as shown by Thomas, Theocharous, and Ghavamzadeh (2015), IS based estimators may exhibit extremely long tail behavior and can have a range in the order of 1010. For example, even if a A and s S, if β(a|s) > 0.1, then the maximum possible importance weighted return of a ten timestep long trajectory can be on the order of (1/0.1)10 = 1010 even when returns are normalized to the [0, 1] interval. Such a large range causes Hoeffding s inequality and empirical Bernstein inequalities to produce wide and uninformative conﬁdence intervals, especially when the number of samples is not enormous. To construct a lower bound for E[X], while being robust to the long tail, Thomas, Theocharous, and Ghavamzadeh (2015) notice that truncating the upper-tail of X to a constant c can only lower the expected value of X, i.e., for X := min(X, c), E[X ] E[X]. Therefore, X lb := CI (E[X ], δ) is also a valid lower bound for E[X] and Pr(X lb E[X]) 1 δ. Additionally, truncating allows for signiﬁcantly shrinking the range from [a, b] to [a, c], thereby effectively leading to a much tighter lower bound when c is chosen appropriately. For completeness, we review this bound in Appendix F.

While this bound was designed speciﬁcally for getting the lower bounds required in (5) and (6), it cannot be naïvely used to get the upper bounds. As E[X ] E[X], the upper bound X ub := CI+(E[X ], δ), may not be a valid upper bound for E[X] and Pr(X ub E[X]) 1 δ. A natural question is then: How can an upper bound be obtained that is robust to the long upper tail? To answer this question, notice that if instead of the upper tail, the lower tail of the distribution was long, then the upper bound constructed after truncating the lower tail would still be valid. Therefore, we introduce a control-variate ξ which can be used to switch the tails of the distribution of an IS based estimator, such that both upper and lower valid bounds can be obtained using the resulting distribution. We formalize this in the following theorem. Theorem 5. Let X be either G or G2, then for any δ (0, 0.5] and a ﬁxed constant ξ,

CI+ (Eβ[ρX], δ) = CI+ (Eβ[ρ(X ξ)], δ) + ξ.

Remark 5. When ξ is set to be the maximum value that X can take, then the random variable ρ(X ξ) will have an upper bound of 0 and a long lower tail since ρ 0 and (X ξ) 0. Similarly, when ξ is set to be the minimum value that X can take, then the random variable ρ(X ξ) will have a lower bound of 0 and a long upper tail. When a two-sided interval is required, two different estimators need to be constructed using the values for ξ discussed above. Theorem 5 allows us to control the tail-behavior such that the tight bounds presented by Thomas, Theocharous, and Ghavamzadeh (2015) can be leveraged to obtain both valid upper and valid lower high-conﬁdence bound. However, Theorem 5 still makes use of the full trajectory importance ratio ρ, which can result in high-variance and inﬂate the conﬁdence intervals. To mitigate the above problem as well, we combine the variance reduction property of per-decision and coupleddecision IS offered by Theorem 3, and the control over the tail behavior offered by Theorem 5, and present the following theorem (see Appendix F for the complete algorithm). Theorem 6. Under Assumption 1, for any δ [0.0.5], let ξR := max(R2 min, R2 max) and ξG := max(G2 min, G2 max) then

j=0 ρ (0, max(i, j)) γi+j R(i)R(j) ξR ,

i=0 ρ(0, i)γi R(i) Rmax ,

then Pr(X 0) = Pr(Y 0) = 1, and

CI+ Eβ[ρG2], δ = CI+ (Eβ[X], δ) + ξG,

CI+ (Eβ[ρG], δ) = CI+ (Eβ[Y ], δ) + Gmax.

Remark 6. For a lower bound on Eβ[ρG2], notice that ρG2 0 always and thus the lower bound on Eβ[X], where ξR and ξG are set to 0, can be used. Lower bound on Eβ[ρG] can be constructed by using the lower bound on Eβ[Y ], when Rmax and Gmax are replaced by Rmin and Gmin.

Remark 7. If some trajectories have horizon length t < T, then they must be appropriately padded to ensure that i [t + 1, T], ρ(0, i) = ρ(0, t) and R(i) = 0, such that in expectation the total amount added/subtracted by the control variate is zero.

HCOVE using Statistical Bootstrapping Bootstrap is a popular non-parametric technique for ﬁnding approximate conﬁdence intervals (Efron and Tibshirani 1994). The core idea of bootstrap is to re-sample the observed data D and construct B pseudo-datasets {D i }B i=1 in a way such that each D i resembles a draw from the true underlying data generating process. With each pseudo-data Di, an unbiased pseudo-estimate of a desired sample statistic can be created. For our problem, this statistic corresponds to ˆσ2 n , the estimate of σ2(π) obtained using (4). Thereby, leveraging the entire set of pseudo-data {D i }B i=1, an empirical distribution for the estimates of the variance {ˆσ2 n,i}B i=1 can be obtained. This empirical distribution approximates the true distribution of ˆσ2 n and can thus be leveraged to obtain CIs for σ2(π) using the percentile method, the biascorrected and accelerated (BCa) method, etc. (Di Ciccio and Efron 1996). A drawback of bootstrap is the increased computational cost required for re-sampling and analysing B pseudo datasets. Further, the CIs obtained from bootstrap are only approximate, meaning that they can fail with more than δ probability. However, the primary advantage of using bootstrap is that it provides much tighter CIs, as compared to the ones obtained using concentration inequalities, and hence can be more informative for certain applications in practice. Let ˆC be the approximate interval for σ2(π), for a given conﬁdence δ, obtained using bootstrap (see Appendix F for the complete algorithm). Then under the following assumption on the higher-moments of ˆσ2 n, we directly leverage the results for bootstrap to obtain an error-rate for ˆC. Assumption 2. The third moment of ˆσ2 n is bounded. That is, c1 < such that Eβ[(ˆσ2 n Eβ[ˆσ2 n])3] < c1. Assumption 2 is a typical requirement for bootstrap methods (Efron and Tibshirani 1994). Assumption 2 can easily be satisﬁed by commonly used entropy regularized behavior policies that ensure that c2 > 0 such that a A, s S, β(a|s) c2. This would ensure that the importance ratio ρ 1/(c T 2 ) , and because G Gmax, ρG and ρG2 would also be bounded. This ensures that ˆσ2 n is bounded, and therefore all its moments are also bounded, as required by Assumption 2. We formalize the asymptotic correctness of bootstrap conﬁdence intervals in the following theorem. Theorem 7. Under Assumptions 1 and 2, the conﬁdence interval ˆC has a ﬁnite sample error of O(n 1/2). That is,

Pr σ2(π) ˆC 1 δ O n 1

Remark 8. Other variants of bootstrap (Bootstrap-t, BCa, etc.) can also be used, which typically offer higher order reﬁnement by reducing the ﬁnite sample error-rate to O(n 1) (Di Ciccio and Efron 1996).

Related Work

When samples are from the distribution whose variance needs to be estimated, then under the assumption that the distribution is normal, the χ2 distribution can be used for providing CIs for the variance. Effects of non-normality on tests of signiﬁcance were ﬁrst analyzed by Pearson and Adyanthaya (1929) and has led to a large body of literature on variance tests (Pearson 1931; Box 1953; Levene 1960). Various modiﬁcations to χ2 tests have also been proposed to be robust against samples from non-normal distributions (Subrahmaniam 1966; García-Pérez 2006; Pan 1999; Lim and Loh 1996). The statistical bootstrap approach used in this paper to obtain bounds on the variance is closest to the bootstrap test developed by Shao (1990). However, all of these methods are analogous to on-policy variance analysis. In the context of RL, Sobel (1982) ﬁrst introduced Bellman operators for the second moment and combined it with the ﬁrst moment to compute the variance. Temporal difference (TD) style algorithms have been subsequently developed for estimating the variance of returns (Tamar, Di Castro, and Mannor 2016; La and Ghavamzadeh 2013; White and White 2016; Sherstan et al. 2018). However, such TD methods might suffer from potential instabilities when used with function approximators and off-policy data (Sutton and Barto 2018). Policy gradient style algorithms have also been developed for ﬁnding policies that optimize variance related objectives (Di Castro, Tamar, and Mannor 2012; Tamar and Mannor 2013), however, these are limited to the on-policy setting. We are not aware of any work in the RL literature that provides unbiased and consistent off-policy variance estimators, nor high-conﬁdence bounds for thereon. Outside RL, variants of off-policy (or counterfactual) estimation using importance sampling (or inverse propensity estimator (Horvitz and Thompson 1952)) is common in econometrics (Hoover 2011; Stock and Watson 2015) and causal inference (Pearl 2009). While these works have mostly focused on mean estimation, counterfactual probability density or quantiles of potential outcomes can also be estimated (Di Nardo, Fortin, and Lemieux 1995; Melly 2006; Chernozhukov, Fernández-Val, and Melly 2013; Donald and Hsu 2014). Such distribution estimation methods can possibly also be used to estimate off-policy variance; however it is unclear how to obtain unbiased estimates of the variance from an unbiased estimate of the distribution. Instead, focusing directly on the variance can be more data-efﬁcient and can also lead to unbiased estimators. Further, these works neither leverage any MDP structure to reduce variance resulting from IS, nor do they provide any methods that provide high-conﬁdence bounds on the variance. In the RL setting, the problem of high variance in IS is exacerbated as sequential interaction leads to multiplicative importance ratios, thereby requiring additional consideration for long tails to obtain tight bounds.

Experimental Study

Inspired by real-world applications where OVE and HCOVE can be useful, we validate our proposed estimators empirically on two domains motivated by real-world ap-

Figure 3: Experimental results using 100 trials. (Top) Empirically observed fraction (out of 100 trials) for which the computed conﬁdence interval did not include the actual variance, for the given number of trajectories (plotted on the shared horizontal axis), for the proposed upper and lower high-conﬁdence bounds that were constructed using concentration inequalities (labeled as: CI +, and CI ) and using bootstrap (labeled as: Bootstrap +, and Bootstrap ). The color of the bars refer to the legend, and these bars should ideally be below the line representing the conﬁdence level δ = 0.05. (Bottom) The dashed, colored, lines represent the value of the respective high-conﬁdence bounds, constructed with conﬁdence level 1 δ each. The green line represents the value of our proposed estimator ˆσ2 n and the shaded area around it (almost negligible) corresponds to the standard error. Black dashed line represents the true variance, σ2(π). The unbiased and consistent property of ˆσ2 n can be visualized by comparing it with σ2(π). Notice that as the bootstrap conﬁdence interval ˆC is only approximate, it can fail with more that δ probability. In comparison, conﬁdence interval C obtained using concentration inequalities provide guaranteed coverage. However, as clear from the plots, C can be conservative, while ˆC provides a tighter interval.

plications. Here, we only provide a brief description about the experimental setup and the main results. Appendix G contains additional experimental details.

Diabetes treatment: This domain is based on an opensource implementation (Xie 2019) of the FDA approved Type-1 Diabetes Mellitus simulator (T1DMS) (Man et al. 2014) for treatment of Type-1 Diabetes, where the objective is to control an insulin pump to regulate the blood-glucose level of a patient. High-conﬁdence estimation of the variance of a controller s outcome, before deployment, can be informative when assessing potential harm to the patient that may be caused by the controller.

Recommender system: This domain simulates the problem of providing online recommendations based on customer interests, where it is often useful to obtain high-conﬁdence estimates for the variance of customer s experience, before actually deploying the system, to limit ﬁnancial loss.

Gridworld: We also consider a standard 4 4 Gridworld with stochastic transitions. There are eight discrete actions corresponding to up, down, left, right, and the four diagonal movements.

Given trajectories collected using a behavior policy β, in Figure 3 we provide the trend of our estimator ˆσ2 n for an evaluation policy π, and the conﬁdence intervals C and ˆC as the number of trajectories increase (more details on how π and β were constructed can be found in Appendix G). As established in Theorem 1 and Theorem 2, ˆσ2 n can be seen to be both an unbiased and a consistent estimator of σ2(π). Similarly, as established in Theorem 4, the (1 δ)-conﬁdence interval C provides guaranteed coverage. In comparison, as established in Theorem 7, bootstrap bounds are approximate and can fail more than δ fraction of the time. However, bootstrap bounds can still be useful in many applications as they provide tighter intervals.

Conclusion In this work, we addressed an understudied problem of estimating and bounding σ2(π) using only off-policy data. We took the ﬁrst steps towards developing a model-free, offpolicy, unbiased, and consistent estimator of σ2(π) using a simple double-sampling trick. We then showed how bound propagation using concentration inequalities, or statistical bootstrap, can be used to obtain CIs for σ2(π). Finally, empirical results were provided to support the established theoretical results.

References Anderson, R. 1965. Negative variance estimates. Technometrics 7(1): 75 76.

Bastani, M. 2014. Model-free intelligent diabetes management using machine learning. M.S. Thesis, University of Alberta .

Bellemare, M. G.; Dabney, W.; and Munos, R. 2017. A distributional perspective on reinforcement learning. ar Xiv preprint ar Xiv:1707.06887 .

Bottou, L.; Peters, J.; Quiñonero-Candela, J.; Charles, D. X.; Chickering, D. M.; Portugaly, E.; Ray, D.; Simard, P.; and Snelson, E. 2013. Counterfactual reasoning and learning systems: The example of computational advertising. The Journal of Machine Learning Research 14(1): 3207 3260.

Box, G. E. 1953. Non-normality and tests on variances. Biometrika 40(3/4): 318 335.

Chernozhukov, V.; Fernández-Val, I.; and Melly, B. 2013. Inference on counterfactual distributions. Econometrica 81(6): 2205 2268.

Dabney, W.; Rowland, M.; Bellemare, M. G.; and Munos, R. 2017. Distributional reinforcement learning with quantile regression. ar Xiv preprint ar Xiv:1710.10044 .

Di Castro, D.; Tamar, A.; and Mannor, S. 2012. Policy gradients with variance related risk criteria. ar Xiv preprint ar Xiv:1206.6404 .

Di Ciccio, T. J.; and Efron, B. 1996. Bootstrap conﬁdence intervals. Statistical Science 189 212.

Di Nardo, J.; Fortin, N. M.; and Lemieux, T. 1995. Labor market institutions and the distribution of wages, 19731992: A semiparametric approach. Technical report, National Bureau of Economic Research.

Donald, S. G.; and Hsu, Y.-C. 2014. Estimation and inference for distribution functions and quantile functions in treatment effect models. Journal of Econometrics 178: 383 397.

Efron, B.; and Tibshirani, R. J. 1994. An introduction to the Bootstrap. CRC press.

García-Pérez, A. 2006. Chi-square tests under models close to the normal distribution. Metrika 63(3): 343 354.

Guo, Z.; Thomas, P. S.; and Brunskill, E. 2017. Using options and covariance testing for long horizon off-policy policy evaluation. In Advances in Neural Information Processing Systems, 2492 2501.

Hanna, J. P.; Stone, P.; and Niekum, S. 2017. Bootstrapping with models: Conﬁdence intervals for off-policy evaluation. In Proceedings of the 16th International Conference on Autonomous Agents and Multiagent Systems.

Hoeffding, W. 1994. Probability inequalities for sums of bounded random variables. In The Collected Works of Wassily Hoeffding, 409 426. Springer.

Hoover, K. D. 2011. Counterfactuals and causal structure. Oxford University Press Oxford.

Horvitz, D. G.; and Thompson, D. J. 1952. A generalization of sampling without replacement from a ﬁnite universe. Journal of the American statistical Association 47(260): 663 685.

Jaulin, L.; Braems, I.; and Walter, E. 2002. Interval methods for nonlinear identiﬁcation and robust control. In Proceedings of the 41st IEEE Conference on Decision and Control, 2002., volume 4, 4676 4681. IEEE.

Kostrikov, I.; and Nachum, O. 2020. Statistical bootstrapping for uncertainty estimation in off-policy Evaluation. ar Xiv preprint ar Xiv:2007.13609 .

Kuindersma, S. R.; Grupen, R. A.; and Barto, A. G. 2013. Variable risk control via stochastic optimization. The International Journal of Robotics Research 32(7): 806 825.

Kuzborskij, I.; Vernade, C.; György, A.; and Szepesvári, C. 2020. Conﬁdent off-policy evaluation and selection through self-normalized importance weighting. ar Xiv preprint ar Xiv:2006.10460 .

La, P.; and Ghavamzadeh, M. 2013. Actor-critic algorithms for risk-sensitive MDPs. Advances in Neural Information Processing Systems 26: 252 260.

Levene, H. 1960. Robust tests for equality of variances. Contributions to probability and statistics In Olkin I, Ed.

Lim, T.-S.; and Loh, W.-Y. 1996. A comparison of tests of equality of variances. Computational Statistics & Data Analysis 22(3): 287 301.

Liu, Q.; Li, L.; Tang, Z.; and Zhou, D. 2018. Breaking the curse of horizon: Inﬁnite-horizon off-policy estimation. In Advances in Neural Information Processing Systems, 5356 5366.

Man, C. D.; Micheletto, F.; Lv, D.; Breton, M.; Kovatchev, B.; and Cobelli, C. 2014. The UVA/PADOVA type 1 diabetes simulator: New features. Journal of Diabetes Science and Technology 8(1): 26 34.

Maurer, A.; and Pontil, M. 2009. Empirical Bernstein bounds and sample variance penalization. ar Xiv preprint ar Xiv:0907.3740 .

Mc Hugh, R. B.; and Mielke, P. W. 1968. Negative variance estimates and statistical dependence in nested sampling. Journal of the American Statistical Association 63(323): 1000 1003.

Melly, B. 2006. Estimation of counterfactual distributions using quantile regression. Technical report, Universität St. Gallen.

Montgomery, D. C. 2007. Introduction to statistical quality control. John Wiley & Sons.

Nelder, J. 1954. The interpretation of negative components of variance. Biometrika 41(3/4): 544 548.

Pan, G. 1999. On a Levene type test for equality of two variances. Journal of Statistical Computation and Simulation 63(1): 59 71.

Pearl, J. 2009. Causality. Cambridge University Press.

Pearson, E. S. 1931. The analysis of variance in cases of non-normal variation. Biometrika 114 133. Pearson, E. S.; and Adyanthaya, N. 1929. The distribution of frequency constants in small samples from non-normal symmetrical and skew populations. Biometrika 21(1/4): 259 286. Precup, D. 2000. Eligibility traces for off-policy policy evaluation. Computer Science Department Faculty Publication Series 80. Sakaguchi, Y.; and Takano, M. 2004. Reliability of internal prediction/estimation and its application. I. Adaptive action selection reﬂecting reliability of value function. Neural Networks 17(7): 935 952. Sato, M.; Kimura, H.; and Kobayashi, S. 2001. TD algorithm for the variance of return and mean-variance reinforcement learning. Transactions of the Japanese Society for Artiﬁcial Intelligence 16(3): 353 362. Shao, J. 1990. Bootstrap estimation of the asymptotic variances of statistical functionals. Annals of the Institute of Statistical Mathematics 42(4): 737 752. Sherstan, C.; Ashley, D. R.; Bennett, B.; Young, K.; White, A.; White, M.; and Sutton, R. S. 2018. Comparing direct and indirect temporal-difference methods for estimating the variance of the return. In UAI, 63 72. Sobel, M. J. 1982. The variance of discounted Markov decision processes. Journal of Applied Probability 19(4): 794 802. Stock, J. H.; and Watson, M. W. 2015. Introduction to Econometrics. Pearson Press. Subrahmaniam, K. 1966. Some contributions to the theory of non-normality-I (univariate case). Sankhy a: The Indian Journal of Statistics, Series A 389 406. Sutton, R. S.; and Barto, A. G. 2018. Reinforcement learning: An introduction. Cambridge, MA: MIT Press, 2 edition. Tabachnick, B. G.; and Fidell, L. S. 2007. Experimental Designs Using ANOVA. Thomson/Brooks/Cole Belmont, CA. Tamar, A.; Di Castro, D.; and Mannor, S. 2016. Learning the variance of the reward-to-go. The Journal of Machine Learning Research 17(1): 361 396. Tamar, A.; and Mannor, S. 2013. Variance adjusted actor critic algorithms. ar Xiv preprint ar Xiv:1310.3697 . Teevan, J. B.; Dumais, S. T.; Liebling, D. J.; and Horvitz, E. J. 2009. Using variation in user interest to enhance the search experience. US Patent App. 12/163,561. Thomas, P. S.; Theocharous, G.; and Ghavamzadeh, M. 2015. High-conﬁdence off-policy evaluation. In Twenty Ninth AAAI Conference on Artiﬁcial Intelligence. Wasserman, L. 2006. All of nonparametric statistics. Springer Science & Business Media. White, M.; and White, A. 2016. A greedy approach to adapting the trace parameter for temporal difference learning. ar Xiv preprint ar Xiv:1607.00446 . Xie, J. 2019. Simglucose v0.2.1 (2018). URL https://github. com/jxx123/simglucose.