# coindice_offpolicy_confidence_interval_estimation__30eb680f.pdf

Coin DICE: Off-Policy Conﬁdence Interval Estimation

Bo Dai1, Oﬁr Nachum1, Yinlam Chow1

Lihong Li1, Csaba Szepesvári2,3, Dale Schuurmans1,3

1Google Research, Brain Team 2University of Alberta 3Deep Mind

We study high-conﬁdence behavior-agnostic off-policy evaluation in reinforcement learning, where the goal is to estimate a conﬁdence interval on a target policy s value, given only access to a static experience dataset collected by unknown behavior policies. Starting from a function space embedding of the linear program formulation of the Q-function, we obtain an optimization problem with generalized estimating equation constraints. By applying the generalized empirical likelihood method to the resulting Lagrangian, we propose Coin DICE, a novel and efﬁcient algorithm for computing conﬁdence intervals. Theoretically, we prove the obtained conﬁdence intervals are valid, in both asymptotic and ﬁnite-sample regimes. Empirically, we show in a variety of benchmarks that the conﬁdence interval estimates are tighter and more accurate than existing methods.2

1 Introduction

One of the major barriers that hinders the application of reinforcement learning (RL) is the ability to evaluate new policies reliably before deployment, a problem generally known as off-policy evaluation (OPE). In many real-world domains, e.g., healthcare (Murphy et al., 2001; Gottesman et al., 2018), recommendation (Li et al., 2011; Chen et al., 2019), and education (Mandel et al., 2014), deploying a new policy can be expensive, risky or unsafe. Accordingly, OPE has seen a recent resurgence of research interest, with many methods proposed to estimate the value of a policy (Precup et al., 2000; Dudík et al., 2011; Bottou et al., 2013; Jiang and Li, 2016; Thomas and Brunskill, 2016; Liu et al., 2018; Nachum et al., 2019a; Kallus and Uehara, 2019a,b; Zhang et al., 2020b).

However, the very settings where OPE is necessary usually entail limited data access. In these cases, obtaining knowledge of the uncertainty of the estimate is as important as having a consistent estimator. That is, rather than a point estimate, many applications would beneﬁt signiﬁcantly from having conﬁdence intervals on the value of a policy. The problem of estimating these conﬁdence intervals, known as high-conﬁdence off-policy evaluation (HCOPE) (Thomas et al., 2015b), is imperative in real-world decision making, where deploying a policy without high-probability safety guarantees can have catastrophic consequences (Thomas, 2015). Most existing high-conﬁdence off-policy evaluation algorithms in RL (Bottou et al., 2013; Thomas et al., 2015a,b; Hanna et al., 2017) construct such intervals using statistical techniques such as concentration inequalities and the bootstrap applied to importance corrected estimates of policy value. The primary challenge with these correction-based approaches is the high variance resulting from multiplying per-step importance ratios in long-horizon problems. Moreover, they typically require full knowledge (or a good estimate) of the behavior policy, which is not easily available in behavior-agnostic OPE settings (Nachum et al., 2019a).

In this work, we propose an algorithm for behavior-agnostic HCOPE. We start from a linear programming formulation of the state-action value function. We show that the value of the policy may be obtained from a Lagrangian optimization problem for generalized estimating equations

Equal contribution. Email: {bodai, ofirnachum}@google.com.

2Open-source code for Coin DICE is available at https://github.com/google-research/dice_rl.

34th Conference on Neural Information Processing Systems (Neur IPS 2020), Vancouver, Canada.

over data sampled from off-policy distributions. This observation inspires a generalized empirical likelihood approach (Owen, 2001; Broniatowski and Keziou, 2012; Duchi et al., 2016) to conﬁdence interval estimation. These derivations enable us to express high-conﬁdence lower and upper bounds for the policy value as results of minimax optimizations over an arbitrary ofﬂine dataset, with the appropriate distribution corrections being implicitly estimated during the optimization. åWe translate this understanding into a practical estimator, Conﬁdence Interval DIstribution Correction Estimation (Coin DICE), and design an efﬁcient algorithm for implementing it. We then justify the asymptotic coverage of these bounds and present non-asymptotic guarantees to characterize ﬁnite-sample effects. Notably, Coin DICE is behavior-agnostic and its objective function does not involve any per-step importance ratios, and so the estimator is less susceptible to high-variance gradient updates. We evaluate Coin DICE in a number of settings and show that it provides both tighter conﬁdence interval estimates and more correctly matches the desired statistical coverage compared to existing methods.

2 Preliminaries

For a set W, the set of probability measures over W is denoted by P (W).3 We consider a Markov Decision Process (MDP) (Puterman, 2014), M = (S, A, T, R, γ, µ0), where S denotes the state space, A denotes the action space, T : S A ! P (S) is the transition probability kernel, R : S A ! P ([0, Rmax]) is a bounded reward kernel, γ 2 (0, 1] is the discount factor, and µ0 is the initial state distribution.

A policy, : S ! P (A), can be used to generate a random trajectory by starting from s0 µ0 (s), then following at (st), rt R (st, at) and st+1 T (st, at) for t > 0. The stateand action-value functions of are denoted V and Q , respectively. The policy also induces an occupancy measure, d (s, a) := (1 γ)E

t>0 γt1 {st = s, at = a}

, the normalized discounted probability of visiting (s, a) in a trajectory generated by , where 1 { } is the indicator function. Finally, the policy value is deﬁned as the normalized expected reward accumulated along a trajectory:

γtrt|s0 µ0, at (st) , rt R (st, at) , st+1 T (st, at)

We are interested in estimating the policy value and its conﬁdence interval (CI) in the behavior agnostic off-policy setting (Nachum et al., 2019a; Zhang et al., 2020a), where interaction with the environment is limited to a static dataset of experience D := {(s, a, s0, r)i}n

i=1. Each tuple in D is generated according to (s, a) d D, r R (s, a) and s0 T (s, a) , where d D is an unknown distribution over S A, perhaps induced by one or more unknown behavior policies. The initial distribution µ0 (s) is assumed to be easy to sample from, as is typical in practice. Abusing notation, we denote by d D both the distribution over (s, a, s0, r) and its marginal on (s, a). We use Ed [ ] for the expectation over a given distribution d, and ED [ ] for its empirical approximation using D.

Following previous work (Sutton et al., 2012; Uehara et al., 2019; Zhang et al., 2020a), for ease of exposition we assume the transitions in D are i.i.d.. However, our results may be extended to fast-mixing, ergodic MDPs, where the the empirical distribution of states along a long trajectory is close to being i.i.d. (Antos et al., 2008; Lazaric et al., 2012; Dai et al., 2017; Duchi et al., 2016).

Under mild regularity assumptions, the OPE problem may be formulated as a linear program referred to as the Q-LP (Nachum et al., 2019b; Nachum and Dai, 2020) with the following primal and dual forms:

min Q:S A!R (1 γ) Eµ0 [Q (s0, a0)] (2)

s.t. Q (s, a) > R (s, a) + γ P Q (s, a) , 8 (s, a) 2 S A,

max d:S A!R+Ed [r (s, a)] (3)

s.t. d (s, a) = (1 γ) µ0 (s, a) + γ P

d (s, a) , 8 (s, a) 2 S A, where the operator P and its adjoint, P

, are deﬁned as P Q (s, a) := Es0 T ( |s,a),a0 ( |s0) [Q (s0, a0)] ,

d (s, a) := (a|s)

T (s| s, a) d ( s, a) .

3All sets and maps are assumed to satisfy appropriate measurability conditions; which we will omit from below for the sake of reducing clutter.

The optimal solutions of (2) and (3) are the Q-function, Q , and stationary state-action occupancy, d , respectively, for policy ; see Nachum et al. (2019b, Theorems 3 & 5) for details as well as extensions to the undiscounted case.

Using the Lagrangian of (2) or (3), we have

= min Q max >0 (1 γ) Eµ0 [Q (s0, a0)] + Ed D [ (s, a) (R (s, a) + γQ (s0, a0) Q (s, a))] , (4)

where (s, a) := d(s,a) d D(s,a) is the stationary distribution corrector. One of the key beneﬁts of the minimax optimization (4) is that both expectations can be immediately approximated by sample averages.4 In fact, this formulation allows the derivation of several recent behavior-agnostic OPE estimators in a uniﬁed manner (Nachum et al., 2019a; Uehara et al., 2019; Zhang et al., 2020a; Nachum and Dai, 2020).

3 Coin DICE

We now develop a new approach to obtaining conﬁdence intervals for OPE. The algorithm, COnﬁdence INterval stationary DIstribution Correction Estimation (Coin DICE), is derived by combining function space embedding and the previously described Q-LP.

3.1 Function Space Embedding of Constraints

Both the primal and dual forms of the Q-LP contain |S| |A| constraints that involve expectations over state transition probabilities. Working directly with these constraints quickly becomes computationally and statistically prohibitive when |S| |A| is large or inﬁnite, as with standard LP approaches (De Farias and Van Roy, 2003). Instead, we consider a relaxation that embeds the constraints in a function space:

:= max d:S A!R+ Ed [r (s, a)] s.t. hφ, di = hφ, (1 γ) µ0 + γ P

where φ : S A ! p Rp is a feature map, and hφ, di :=

φ (s, a) d (s, a) dsda. By projecting the constraints onto a function space with feature mapping φ, we can reduce the number of constraints from |S| |A| to p. Note that p may still be inﬁnite. The constraint in (5) can be written as generalized estimating equations (Qin and Lawless, 1994; Lam and Zhou, 2017) for the correction ratio (s, a) over augmented samples x := (s0, a0, s, a, r, s0, a0) with (s0, a0) µ0 , (s, a, r, s0) d D, and a0 ( |s0),

hφ, di = hφ, (1 γ) µ0 + γ P

di , Ex [ (x; , φ)] = 0, (6) where (x; , φ) := (1 γ) φ (s0, a0) + (s, a) (γφ (s0, a0) φ (s, a)). The corresponding Lagrangian is

= max :S A!R+ min

β2Rp Ed D [ r (s, a)] + hβ, Ed D [ (x; , φ)]i . (7)

This embedding approach for the dual Q-LP is closely related to approximation methods for the standard state-value LP (De Farias and Van Roy, 2003; Pazis and Parr, 2011; Lakshminarayanan et al., 2017). The gap between the solutions to (5) and the original dual LP (3) depends on the expressiveness of the feature mapping φ. Before stating a theorem that quantiﬁes the error, we ﬁrst offer a few examples to provide intuition for the role played by φ.

Example (Indicator functions): Suppose p = |S| |A| is ﬁnite and φ = [δs,a](s,a)2S A, where δs,a 2 {0, 1}p with δs,a = 1 at position (s, a) and 0 otherwise. Plugging this feature mapping into (5), we recover the original dual Q-LP (3).

Example (Full-rank basis): Suppose Φ 2 Rp p is a full-rank matrix with p = |S| |A|; furthermore, φ(s, a) = Φ((s, a), )>. Although the constraints in (5) and (3) are different, their solutions are identical. This can be veriﬁed by the Lagrangian in Appendix A.

Example (RKHS function mappings): Suppose φ (s, a) := k ((s, a) , ) 2 Rp with p = 1, which forms a reproducing kernel Hilbert space (RKHS) Hk. The LHS and RHS in the constraint of (5) are the kernel embeddings of d (s, a) and (1 γ) µ0 (s, a) + γ P

d (s, a) respectively. The constraint in (5) can then be understood as as a form of distribution matching by comparing

4We assume one can sample initial states from µ0, an assumption that often holds in practice. Then, the data in D can be treated as being augmented as (s0, a0, s, a, r, s0, a0) with a0 (a|s0) , a0 (a|s0).

kernel embeddings, rather than element-wise matching as in (3). If the kernel function k ( , ) is characteristic, the embeddings of two distributions will match if and only if the distributions are identical almost surely (Sriperumbudur et al., 2011).

Theorem 1 (Approximation error) Suppose the constant function 1 2 Fφ := span {φ}. Then,

0 6 6 2 min

β k Q hβ, φik1 ,

where Q is the ﬁxed-point solution to the Bellman equation Q (s, a) = R (s, a) + γP Q (s, a).

Please refer to Appendix A for the proof. The condition 1 2 Fφ is standard and is trivial to satisfy. Although the approximation error relies on k k1, a sharper bound that relies on a norm taking the state-action distribution into account can also be obtained (De Farias and Van Roy, 2003). We focus on characterizing the uncertainty due to sampling in this paper, so for ease of exposition we will consider a setting where φ is sufﬁciently expressive to make the approximation error zero. If desired, the approximation error in Theorem 1 can be included in the analysis.

Note that, compared to using a characteristic kernel to ensure injectivity for the RKHS embeddings over all distributions (and thus guaranteeing arbitrarily small approximation error), Theorem 1 only requires that Q be represented in Fφ, which is a much weaker condition. In practice, one may also learn the feature mapping φ for the projection jointly.

3.2 Off-policy Conﬁdence Interval Estimation

By introducing the function space embedding of the constraints in (5), we have transformed the original point-wise constraints in the Q-LP to generalized estimating equations. This paves the way to applying the generalized empirical likelihood (EL) (Owen, 2001; Broniatowski and Keziou, 2012; Bertail et al., 2014; Duchi et al., 2016) method to estimate a conﬁdence interval on policy value.

Recall that, given a convex, lower-semicontinuous function f : R+ ! R satisfying f (1) = 0, the

f-divergence between densities p and q on R is deﬁned as Df (P||Q) :=

d P (x) d Q(x)

Given an f-divergence, we propose our main conﬁdence interval estimate based on the following conﬁdence set Cf

++++w2Kf, Ew [ (x; , φ)]=0

w 2 Pn 1 (bpn) ,

Df (w||bpn) 6

(8) where Pn 1 (bpn) denotes the n-simplex on the support of bpn, the empirical distribution over D. It is easy to verify that this set Cf

n, R is convex, since (w) is a convex function over a convex

feasible set. Thus, Cf

n, is an interval. In fact, Cf

n, is the image of the policy value on a bounded (in f-divergence) perturbation to w in the neighborhood of the empirical distribution bpn.

Intuitively, the conﬁdence interval Cf

n, possesses a close relationship to bootstrap estimators. In vanilla bootstrap, one constructs a set of empirical distributions

i=1 by resampling from the dataset D. Such subsamples are used to form the empirical distribution on

i=1, which provides population statistics for conﬁdence interval estimation. However, this procedure is computationally very expensive, involving m separate optimizations. By contrast, our proposed estimator Cf

n, exploits the asymptotic properties of the statistic (w) to derive a target conﬁdence interval by solving only two optimization problems (Section 3.3), a dramatic savings in computational cost.

Before introducing the algorithm for computing Cf

n, , we establish the ﬁrst key result that, by choosing = χ2,1

n, is asymptotically a (1 )-conﬁdence interval on the policy value,

(1) is the (1 )-quantile of the χ2-distribution with 1 degree of freedom.

Theorem 2 (Informal asymptotic coverage) Under some mild conditions, if D contains i.i.d. samples and the optimal solution to the Lagrangian of (5) is unique, we have

(1) is an asymptotic (1 )-conﬁdence interval of the value of the policy .

Please refer to Appendix E.1 for the precise statement and proof of Theorem 2.

Theorem 2 generalizes the result in Duchi et al. (2016) to statistics with generalized estimating equations, maintaining the 1 degree of freedom in the asymptotic χ2

(1)-distribution. One may also apply existing results for EL with generalized estimating equations (e.g., Lam and Zhou, 2017), but these would lead to a limiting distribution of χ2

(m) with m 1 degrees of freedom, resulting in a much looser conﬁdence interval estimate than Theorem 2.

Note that Theorem 2 can also be specialized to multi-armed contextual bandits to achieve a tighter conﬁdence interval estimate in this special case. In particular, for contextual bandits, the stationary distribution constraint in (5), Ew [ (x; , φ)] = 0, is no longer needed, and can be replaced by Ew [ 1] = 0. Then by the same technique used for MDPs, we can obtain a conﬁdence interval estimate for ofﬂine contextual bandits; see details in Appendix C. Interestingly, the resulting conﬁdence interval estimate not only has the same asymptotic coverage as previous work (Karampatziakis et al., 2019), but is also simpler and computationally more efﬁcient.

3.3 Computing the Conﬁdence Interval

Now we provide a distributional robust optimization view of the upper and lower bounds of Cf

Theorem 3 (Upper and lower conﬁdence bounds) Denote the upper and lower conﬁdence bounds of Cf

n, by un and ln, respectively:

min w2Kf min

>0 Ew [ (x; , β)] , max w2Kf max

β2Rp Ew [ (x; , β)]

min β2Rp max

w2Kf Ew [ (x; , β)] , max

w2Kf Ew [ (x; , β)]

where (x; , β) := r + β> (x; , φ). For any ( , β, λ, ) that satisﬁes the constraints in (11), the optimal weights for the upper and lower conﬁdence bounds are

and wu = f 0

respectively. Therefore, the conﬁdence bounds can be simpliﬁed as:

4minβ max >0,λ>0, ED

max >0 minβ,λ>0, ED

The proof of this result relies on Lagrangian duality and the convexity and concavity of the optimization; it may be found in full detail in Appendix D.1.

As we can see in Theorem 3, by exploiting strong duality properties to move w into the inner most optimizations in (11), the obtained optimization (11) is the distributional robust optimization extenion of the saddle-point problem. The closed-form reweighting scheme is demonstrated in (12). For particular f-divergences, such as the KLand 2-power divergences, for a ﬁxed (β, ), the optimal can be easily computed and the weights w recovered in closed-form. For example, by using KL (w||bpn), (12) can be used to obtain the updates

wl (x) = exp

, wu (x) = exp

where l and u provide the normalizing constants. (For closed-form updates of w w.r.t. other f-divergences, please refer to Appendix D.2.) Plug the closed-form of optimal weights into (11), this greatly simpliﬁes the optimization over the data perturbations yielding (13), and estabilishes the connection to the prioritized experiences replay (Schaul et al., 2016), where both reweight the experience data according to their loss, but with different reweighting schemes.

Note that it is straightforward to check that the estimator for un in (13) is nonconvex-concave and the estimator for ln in (13) is nonconcave-convex. Therefore, one could alternatively apply stochastic gradient descent-ascent (SGDA) for to solve (13) and beneﬁt from attractive ﬁnite-step convergence guarantees (Lin et al., 2019).

Remark (Practical considerations): As also observed in Namkoong and Duchi (2016), SGDA for (13) could potentially suffer from high variance in both the objective and gradients when λ approaches 0. In Appendix D.3, we exploit several properties of (11), which leads to a computational efﬁcient algorithm, to overcome the numerical issue. Please refer to Appendix D.3 for the details of Algorithm 1 and the practical considerations.

Remark (Joint learning for feature embeddings): The proposed framework also allows for the possibility to learn the features for constraint projection. In particular, consider ( , ) := β>φ ( , ) : S A ! R. Note that we could treat the combination β>φ (s, a) together as the Lagrange multiplier function for the original Q-LP with inﬁnitely many constraints, hence both β and φ ( , ) could be updated jointly. Although the conditions for asymptotic coverage no longer hold, the ﬁnite-sample correction results of the next section are still applicable. This might offer an interesting way to reduce the approximation error introduced by inappropriate feature embeddings of the constraints, while still maintaining calibrated conﬁdence intervals.

4 Finite-sample Analysis

Theorem 2 establishes the asymptotic (1 )-coverage of the conﬁdence interval estimates produced by Coin DICE, ignoring higher-order error terms that vanish as sample size n ! 1. In practice, however, n is always ﬁnite, so it is important to quantify these higher-order terms. This section addresses this problem, and presents a ﬁnite-sample bound for the estimate of Coin DICE. In the following, we let F and Fβ be the function classes of and β used by Coin DICE. Theorem 4 (Informal ﬁnite-sample correction) Denote by d F and d Fβ the ﬁnite VC-dimension of F and Fβ, respectively. Under some mild conditions, when Df is χ2-divergence, we have

P ( 2 [ln n, un + n]) > 1 12 exp

d F + d Fβ 1

where c1 = 2c + log d F + log d Fβ +

d F + d Fβ 1

(c, M, C ) are univeral constants.

The precise statement and detailed proof of Theorem 4 can be found in Appendix E.2. The proof relies on empirical Bernstein bounds with a careful analysis of the variance term. Compared to the vanilla sample complexity of O

, we achieve a faster rate of O

without any additional assumptions on the noise or curvature conditions. The tight sample complexity in Theorem 4 implies that one can construct the (1 )-ﬁnite sample conﬁdence interval by optimizing (11) with = 18

d F + d Fβ 1

, and composing with n. However, we observe that this bound can be conservative compared to the asymptotic conﬁdence interval in Theorem 2. Therefore, we will evaluate the asymptotic version of Coin DICE based on Theorem 2 in the experiment.

The conservativeness arises from the use of a union bound. However, we conjecture that the rate is optimal up to a constant. We exploit the VC dimension due to its generality. In fact, the bound can be improved by considering a data-dependent measure, e.g., Rademacher complexity, or by some function class dependent measure, e.g., function norm in RKHS, for speciﬁc function approximators.

5 Optimism vs. Pessimism Principle

Coin DICE provide both upper and lower bounds of the target policy s estimated value, which paves the path for applying the principle of optimism (Lattimore and Szepesvári, 2020) or pessimism (Swaminathan and Joachims, 2015) in the face of uncertainty for policy optimization in different learning settings.

Optimism in the face of uncertainty. Optimism in the face of uncertainty leads to risk-seeking algorithms, which can be used to balance the exploration/exploitation trade-off. Conceptually, they always treat the environment as the best plausibly possible. This principle has been successfully applied to stochastic bandit problems, leading to many instantiations of UCB algorithms (Lattimore and Szepesvári, 2020). In each round, an action is selected according to the upper conﬁdence bound, and the obtained reward will be used to reﬁne the conﬁdence bound iteratively. When applied to MDPs, this principle inspires many optimistic model-based (Bartlett and Mendelson, 2002; Auer

et al., 2009; Strehl et al., 2009; Szita and Szepesvari, 2010; Dann et al., 2017), value-based (Jin et al., 2018), and policy-based algorithms (Cai et al., 2019). Most of these algorithms are not compatible with function approximators.

We can also implement the optimism principle by optimizing the upper bound in Coin DICE iteratively, i.e., max u D ( ). In t-th iteration, we calculate the gradient of u D ( t), i.e., r u D ( t), based on the existing dataset Dt, then, the policy t will be updated by (natural) policy gradient and samples will be collected through the updated policy t+1. Please refer to Appendix F for the gradient computation and algorithm details.

Pessimism in the face of uncertainty. In ofﬂine reinforcement learning (Lange et al., 2012; Fujimoto et al., 2019; Wu et al., 2019; Nachum et al., 2019b), only a ﬁxed set of data from behavior policies is given, a safe optimization criterion is to maximize the worst-case performance among a set of statistically plausible models (Laroche et al., 2019; Kumar et al., 2019; Yu et al., 2020). In contrast to the previous case of online exploration, this is a pessimism principle (Cohen and Hutter, 2020; Buckman et al., 2020) or counterfactual risk minimization (Swaminathan and Joachims, 2015), and highly related to robust MDP (Iyengar, 2005; Nilim and El Ghaoui, 2005; Tamar et al., 2013; Chow et al., 2015).

Different from most of the existing methods where the worst-case performance is characterized by model-based perturbation or ensemble, the proposed Coin DICE provides a lower bound to implement the pessimism principle, i.e., max l D ( ). Conceptually, we apply the (natural) policy gradient w.r.t. l D ( t) to update the policy iteratively. Since we are dealing with policy optimization in the ofﬂine setting, the dataset D keeps unchanged. Please refer to Appendix F for the algorithm details.

6 Related Work

Off-policy estimation has been extensively studied in the literature, given its practical importance. Most existing methods are based on the core idea of mportance reweighting to correct for distribution mismatches between the target policy and the off-policy data (Precup et al., 2000; Bottou et al., 2013; Li et al., 2015; Xie et al., 2019). Unfortunately, when applied naively, importance reweighting can result in an excessively high variance, which is known as the curse of horizon (Liu et al., 2018). To avoid this drawback, there has been rapidly growing interest in estimating the correction ratio of the stationary distribution (e.g., Liu et al., 2018; Nachum et al., 2019a; Uehara et al., 2019; Liu et al., 2019; Zhang et al., 2020a,b). This work is along the same line and thus applicable in long-horizon problems. Other off-policy approaches are also possible, notably model-based (e.g., Fonteneau et al., 2013) and doubly robust methods (Jiang and Li, 2016; Thomas and Brunskill, 2016; Tang et al., 2020; Uehara et al., 2019). These techniques can potentially be combined with our algorithm, which we leave for future investigation.

While most OPE works focus on obtaining accurate point estimates, several authors provide ways to quantify the amount of uncertainty in the OPE estimates. In particular, conﬁdence bounds have been developed using the central limit theorem (Bottou et al., 2013), concentration inequalities (Thomas et al., 2015b; Kuzborskij et al., 2020), and nonparametric methods such as the bootstrap (Thomas et al., 2015a; Hanna et al., 2017). In contrast to these works, the Coin DICE is asymptotically pivotal, meaning that there are no hidden quantities we need to estimate, which is based on correcting for the stationary distribution in the behavior-agnostic setting, thus avoiding the curse of horizon and broadening the application of the uncertainty estimator. Recently, Jiang and Huang (2020) provide conﬁdence intervals for OPE, but focus on the intervals determined by the approximation error induced by a function approximator, while our conﬁdence intervals quantify statistical error.

Empirical likelihood (Owen, 2001) is a powerful tool with many applications in statistical inference like econometrics (Chen et al., 2018), and more recently in distributionally robust optimization (Duchi et al., 2016; Lam and Zhou, 2017). EL-based conﬁdence intervals can be used to guide exploration in multiarmed bandits (Honda and Takemura, 2010; Cappé et al., 2013), and for OPE (Karampatziakis et al., 2019; Kallus and Uehara, 2019b). While the work of Kallus and Uehara (2019b) is also based on EL, it differs from the present work in two important ways. First, their focus is on developing an asymptotically efﬁcient OPE point estimate, not conﬁdence intervals. Second, they solve for timestep-dependent weights, whereas we only need to solve for timestep-independent weights from a system of moment matching equations induced by an underlying ergodic Markov chain.

7 Experiments

We now evaluate the empirical performance of Coin DICE, comparing it to a number of existing conﬁdence interval estimators for OPE based on concentration inequalities. Speciﬁcally, given a dataset of logged trajectories, we ﬁrst use weighted step-wise importance sampling (Precup et al., 2000) to calculate a separate estimate of the target policy value for each trajectory. Then given such a ﬁnite sample of estimates, we then use the empirical Bernstein inequality (Thomas et al., 2015b) to derive high-conﬁdence lower and upper bounds for the true value. Alternatively, one may also use Student s t-test or Efron s bias corrected and accelerated bootstrap (Thomas et al., 2015a).

# samples = 50 # samples = 100 # samples = 200

interval coverage

0.6 0.7 0.8 0.9 0.95

0.6 0.7 0.8 0.9 0.95

0.6 0.7 0.8 0.9 0.95

interval log-width

0.6 0.7 0.8 0.9 0.95

0.6 0.7 0.8 0.9 0.95

0.6 0.7 0.8 0.9 0.95

Conﬁdence level (1 )

Coin DICE (ours) Bootstrapping

Bernstein Student t

Expected coverage

Figure 1: Results of Coin DICE and baseline methods on a simple two-armed bandit. We plot empirical coverage and median log-width (y-axes) of intervals evaluated at a number of desired conﬁdence levels (x-axis), as measured over 200 random trials. We ﬁnd that Coin DICE achieves more accurate coverage and narrower intervals compared to the baseline conﬁdence interval estimation methods.

We begin with a simple bandit setting, devising a two-armed bandit problem with stochastic payoffs. We deﬁne the target policy as a near-optimal policy, which chooses the optimal arm with probability 0.95. We collect offpolicy data using a behavior policy which chooses the optimal arm with probability of only 0.55. Our results are presented in Figure 1. We plot the empirical coverage and width of the estimated intervals across different conﬁdence levels. More speciﬁcally, each data point in Figure 1 is the result of 200 experiments. In each experiment, we randomly sample a dataset and then compute a conﬁdence interval. The interval coverage is then computed as the proportion of intervals out of 200 that contain the true value of the target policy. The interval log-width is the median of the log of the width of the 200 computed intervals. Figure 1 shows that the intervals produced by Coin DICE achieve an empirical coverage close to the intended coverage. In this simple bandit setting, the coverages of Student s t and bootstrapping are also close to correct, although they suffer more in the low-data regime. Notably, the width of the intervals produced by Coin DICE are especially narrow while maintaining accurate coverage.

Frozen Lake Taxi # trajectories = 50 # trajectories = 100 # trajectories = 20 # trajectories = 50

interval coverage

0.6 0.7 0.8 0.9 0.95

0.6 0.7 0.8 0.9 0.95

0.6 0.7 0.8 0.9 0.95

0.6 0.7 0.8 0.9 0.95

interval log-width

0.6 0.7 0.8 0.9 0.95

0.6 0.7 0.8 0.9 0.95

0.6 0.7 0.8 0.9 0.95

0.6 0.7 0.8 0.9 0.95

Conﬁdence level (1 )

Coin DICE (ours) Bootstrapping

Bernstein Student t

Expected coverage

Figure 2: Results of Coin DICE and baseline methods on an inﬁnite-horizon version of Frozen Lake and Taxi. In Frozen Lake, each dataset consists of trajectories of length 100; in Taxi, each dataset consists of trajectories of length 500.

We now turn to more complicated MDP environments. We use Frozen Lake (Brockman et al., 2016), a highly stochastic gridworld environment, and Taxi (Dietterich, 1998), an environment with a moderate state space of 2 000 elements. As in (Liu et al., 2018), we modify these environments to be inﬁnite horizon by randomly resetting the state upon termination. The discount factor is γ = 0.99. The target policy is taken to be a near-optimal one, while the behavior policy is highly suboptimal. The behavior policy in Frozen Lake is the optimal policy with 0.2 white noise, which reduces the policy value dramatically, from 0.74 to 0.24. For the behavior policies in Taxi and Reacher, we follow the same experiment setting for constructing the behavior policies to collect data as in (Nachum et al., 2019a; Liu et al., 2018).

interval coverage

0.6 0.7 0.8 0.9 0.95

interval log-width

0.6 0.7 0.8 0.9 0.95

Conﬁdence level (1 )

Figure 3: Results of Coin DICE and baseline methods on Reacher (Brockman et al., 2016; Todorov et al., 2012), using 25 trajectories of length 100. Colors and markers are as deﬁned in the legends of previous ﬁgures.

We follow the same evaluation protocol as in the bandit setting, measuring empirical interval coverage and log-width over 200 experimental trials for various dataset sizes and conﬁdence levels. Results are shown in Figure 2. We ﬁnd a similar conclusion that Coin DICE consistently achieves more accurate coverage and smaller widths than baselines. Notably, the baseline methods accuracy suffers more signiﬁcantly compared to the simpler bandit setting described earlier.

Lastly, we evaluate Coin DICE on Reacher (Brockman et al., 2016; Todorov et al., 2012), a continuous control environment. In this setting, we use a one-hidden-layer neural network with Re LU activations. Results are shown in Figure 3. To account for the approximation error of the used neural network, we measure the coverage of Coin DICE with respect to a true value computed as the median of a large ensemble of neural networks trained on the off-policy data. To keep the comparison fair, we measure the coverage of the IS-based baselines with respect to a true value computed as the median of a large number of IS-based point estimates. The results show similar conclusions as before: Coin DICE achieves more accurate coverage than the IS-based methods. Still, we see that Coin DICE coverage suffers in this regime, likely due to optimization difﬁculties. If the optimum of the Lagrangian is only approximately found, the empirical coverage will inevitably be inexact.

8 Conclusion

In this paper, we have developed Coin DICE, a novel and efﬁcient conﬁdence interval estimator applicable to the behavior-agnostic ofﬂine setting. The algorithm builds on a few technical components, including a new feature embedded Q-LP, and a generalized empirical likelihood approach to conﬁdence interval estimation. We analyzed the asymptotic coverage of Coin DICE s estimate, and provided an inite-sample bound. On a variety of off-policy benchmarks we empirically compared the new algorithm with several strong baselines and found it to be superior to them.

Broader Impact

This research is fundamental and targets a broad question in reinforcement learning. The ability to reliably assess uncertainty in off-policy evaluation would have signiﬁcant positive beneﬁts for safetycritical applications of reinforcement learning. Inaccurate uncertainty estimates create the danger of misleading decision makers and could lead to detrimental consequences. However, our primary goal is to improve these estimators and reduce the ultimate risk of deploying reinforcement-learned systems. The techniques are general and do not otherwise target any speciﬁc application area.

Acknowledgements

We thank Hanjun Dai, Mengjiao Yang and other members of the Google Brain team for helpful discussions. Csaba Szepesvári gratefully acknowledges funding from the Canada CIFAR AI Chairs Program, Amii and NSERC.

András Antos, Csaba Szepesvári, and Rémi Munos. Learning near-optimal policies with Bellman-

residual minimization based ﬁtted policy iteration and a single sample path. Machine Learning, 71 (1):89 129, 2008.

Sanjeev Arora, Simon S Du, Wei Hu, Zhiyuan Li, Russ R Salakhutdinov, and Ruosong Wang. On

exact computation with an inﬁnitely wide neural net. In Advances in Neural Information Processing Systems, pages 8139 8148, 2019.

P. Auer. Using upper conﬁdence bounds for online learning. In Proc. 41st Annual Symposium on

Foundations of Computer Science, pages 270 279. IEEE Computer Society Press, Los Alamitos,

Peter Auer, Thomas Jaksch, and Ronald Ortner. Near-optimal regret bounds for reinforcement

learning. In Advances in neural information processing systems, pages 89 96, 2009.

P. L. Bartlett and S. Mendelson. Rademacher and Gaussian complexities: Risk bounds and structural

results. Journal of Machine Learning Research, 3:463 482, 2002.

Patrice Bertail, Emmanuelle Gautherat, and Hugo Harari-Kermadec. Empirical ' -divergence

minimizers for Hadamard differentiable functionals. In Topics in Nonparametric Statistics, pages 21 32. Springer, 2014.

Léon Bottou, Jonas Peters, Joaquin Quiñonero-Candela, Denis X Charles, D Max Chickering, Elon

Portugaly, Dipankar Ray, Patrice Simard, and Ed Snelson. Counterfactual reasoning and learning systems: The example of computational advertising. The Journal of Machine Learning Research, 14(1):3207 3260, 2013.

Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and

Wojciech Zaremba. Open AI Gym. ar Xiv preprint ar Xiv:1606.01540, 2016.

Michel Broniatowski and Amor Keziou. Divergences and duality for estimation and test under

moment condition models. Journal of Statistical Planning and Inference, 142(9):2554 2573, 2012.

Jacob Buckman, Carles Gelada, and Marc G Bellemare. The importance of pessimism in ﬁxed-dataset

policy optimization. ar Xiv preprint ar Xiv:2009.06799, 2020.

Qi Cai, Zhuoran Yang, Chi Jin, and Zhaoran Wang. Provably efﬁcient exploration in policy optimiza-

tion. ar Xiv preprint ar Xiv:1912.05830, 2019.

Olivier Cappé, Aurélien Garivier, Odalric-Ambrym Maillard, Rémi Munos, Gilles Stoltz, et al.

Kullback-Leibler upper conﬁdence bounds for optimal sequential allocation. The Annals of Statistics, 41(3):1516 1541, 2013.

Minmin Chen, Alex Beutel, Paul Covington, Sagar Jain, Francois Belletti, and Ed H Chi. Top-k

off-policy correction for a REINFORCE recommender system. In Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining, pages 456 464, 2019.

Xiaohong Chen, Timothy M. Christensen, and Elie Tamer. Monte Carlo conﬁdence sets for identiﬁed

sets. Econometrica, 86(6):1965 2018, 2018.

Yinlam Chow, Aviv Tamar, Shie Mannor, and Marco Pavone. Risk-sensitive and robust decision-

making: a cvar optimization approach. In Advances in Neural Information Processing Systems, pages 1522 1530, 2015.

Michael K Cohen and Marcus Hutter. Pessimism about unknown unknowns inspires conservatism.

In Conference on Learning Theory, pages 1344 1373. PMLR, 2020.

Bo Dai, Albert Shaw, Lihong Li, Lin Xiao, Niao He, Zhen Liu, Jianshu Chen, and Le Song. SBEED:

Convergent reinforcement learning with nonlinear function approximation. Co RR, abs/1712.10285, 2017.

Christoph Dann, Tor Lattimore, and Emma Brunskill. Unifying PAC and regret: Uniform PAC bounds

for episodic reinforcement learning. In Advances in Neural Information Processing Systems, pages 5713 5723, 2017.

Daniela Pucci De Farias and Benjamin Van Roy. The linear programming approach to approximate

dynamic programming. Operations research, 51(6):850 865, 2003.

Thomas G. Dietterich. The MAXQ method for hierarchical reinforcement learning. In Proc. Intl.

Conf. Machine Learning, pages 118 126. Morgan Kaufmann, San Francisco, CA, 1998.

John Duchi, Peter Glynn, and Hongseok Namkoong. Statistics of robust optimization: A generalized

empirical likelihood approach. ar Xiv preprint ar Xiv:1610.03425, 2016.

Miroslav Dudík, John Langford, and Lihong Li. Doubly robust policy evaluation and learning. In

Proceedings of the 28th International Conference on Machine Learning, pages 1097 1104, 2011. Co RR abs/1103.4601.

Ivar Ekeland and Roger Temam. Convex analysis and variational problems, volume 28. SIAM, 1999.

Raphael Fonteneau, Susan A. Murphy, Louis Wehenkel, and Damien Ernst. Batch mode reinforcement

learning based on the synthesis of artiﬁcial trajectories. Annals of Operations Research, 208(1): 383 416, 2013.

Scott Fujimoto, David Meger, and Doina Precup. Off-policy deep reinforcement learning without

exploration. In International Conference on Machine Learning, pages 2052 2062, 2019.

Omer Gottesman, Fredrik Johansson, Joshua Meier, Jack Dent, Donghun Lee, Srivatsan Srinivasan,

Linying Zhang, Yi Ding, David Wihl, Xuefeng Peng, Jiayu Yao, Isaac Lage, Christopher Mosch, Li wei H. Lehman, Matthieu Komorowski, Matthieu Komorowski, Aldo Faisal, Leo Anthony Celi, David Sontag, and Finale Doshi-Velez. Evaluating reinforcement learning algorithms in observational health settings, 2018. ar Xiv:1805.12298.

Josiah P. Hanna, Peter Stone, and Scott Niekum. Bootstrapping with models: Conﬁdence intervals

for off-policy evaluation. In Proceedings of the 31st AAAI Conference on Artiﬁcial Intelligence, pages 4933 4934, 2017.

Junya Honda and Akimichi Takemura. An asymptotically optimal bandit algorithm for bounded

support models. In COLT, pages 67 79, 2010.

Garud N Iyengar. Robust dynamic programming. Mathematics of Operations Research, 30(2):

257 280, 2005.

Nan Jiang and Jiawei Huang. Minimax conﬁdence interval for off-policy evaluation and policy

optimization, 2020. ar Xiv:2002.02081.

Nan Jiang and Lihong Li. Doubly robust off-policy evaluation for reinforcement learning. In

Proceedings of the 33rd International Conference on Machine Learning, pages 652 661, 2016.

Chi Jin, Zeyuan Allen-Zhu, Sebastien Bubeck, and Michael I Jordan. Is Q-learning provably efﬁcient?

In Advances in Neural Information Processing Systems, pages 4863 4873, 2018.

Nathan Kallus and Masatoshi Uehara. Double reinforcement learning for efﬁcient off-policy evalua-

tion in Markov decision processes. ar Xiv preprint ar Xiv:1908.08526, 2019a.

Nathan Kallus and Masatoshi Uehara. Intrinsically efﬁcient, stable, and bounded off-policy evaluation

for reinforcement learning. In Advances in Neural Information Processing Systems 32, pages 3320 3329, 2019b.

Nikos Karampatziakis, John Langford, and Paul Mineiro. Empirical likelihood for contextual bandits.

ar Xiv preprint ar Xiv:1906.03323, 2019.

Aviral Kumar, Justin Fu, Matthew Soh, George Tucker, and Sergey Levine. Stabilizing off-policy

Q-learning via bootstrapping error reduction. In Advances in Neural Information Processing Systems, pages 11784 11794, 2019.

Ilja Kuzborskij, Claire Vernade, András György, Csaba Szepesvári Conﬁdent off-policy evaluation

and selection through self-normalized importance weighting. ar Xiv preprint ar Xiv:2006.10460, 2020.

Chandrashekar Lakshminarayanan, Shalabh Bhatnagar, and Csaba Szepesvari. A linearly relaxed

approximate linear program for Markov decision processes. ar Xiv preprint ar Xiv:1704.02544, 2017.

Henry Lam and Enlu Zhou. The empirical likelihood approach to quantifying uncertainty in sample

average approximation. Operations Research Letters, 45(4):301 307, 2017.

Sascha Lange, Thomas Gabel, and Martin Riedmiller. Batch reinforcement learning. In Reinforcement

learning, pages 45 73. Springer, 2012.

Romain Laroche, Paul Trichelair, and Remi Tachet Des Combes. Safe policy improvement with

baseline bootstrapping. In International Conference on Machine Learning, pages 3652 3661. PMLR, 2019.

Tor Lattimore and Csaba Szepesvári. Bandit algorithms. Cambridge University Press, 2020.

Alessandro Lazaric, Mohammad Ghavamzadeh, and Rémi Munos. Finite-sample analysis of least-

squares policy iteration. Journal of Machine Learning Research, 13(Oct):3041 3074, 2012.

Lihong Li, Wei Chu, John Langford, and Xuanhui Wang. Unbiased ofﬂine evaluation of contextual-

bandit-based news article recommendation algorithms. In Proceedings of the fourth ACM international conference on Web search and data mining, pages 297 306. ACM, 2011.

Lihong Li, Rémi Munos, and Csaba Szepesvári. Toward minimax off-policy value estimation. pages

608 616, 2015.

Tianyi Lin, Chi Jin, and Michael I. Jordan. On gradient descent ascent for nonconvex-concave

minimax problems. Co RR, abs/1906.00331, 2019.

Qiang Liu, Lihong Li, Ziyang Tang, and Dengyong Zhou. Breaking the curse of horizon: Inﬁnite-

horizon off-policy estimation. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems 31, pages 5356 5366. Curran Associates, Inc., 2018.

Yao Liu, Pierre-Luc Bacon, and Emma Brunskill. Understanding the curse of horizon in off-policy

evaluation via conditional importance sampling, 2019. ar Xiv:1910.06508.

Travis Mandel, Yun-En Liu, Sergey Levine, Emma Brunskill, and Zoran Popovic. Ofﬂine policy

evaluation across representations with applications to educational games. 2014.

Andreas Maurer and Massimiliano Pontil. Empirical bernstein bounds and sample variance penaliza-

tion. ar Xiv preprint ar Xiv:0907.3740, 2009.

Susan A Murphy, Mark J van der Laan, James M Robins, and Conduct Problems Prevention Re-

search Group. Marginal mean models for dynamic regimes. Journal of the American Statistical Association, 96(456):1410 1423, 2001.

Oﬁr Nachum and Bo Dai. Reinforcement learning via Fenchel-Rockafellar duality. ar Xiv preprint

ar Xiv:2001.01866, 2020.

Oﬁr Nachum, Yinlam Chow, Bo Dai, and Lihong Li. Dual DICE: Behavior-agnostic estimation of

discounted stationary distribution corrections. pages 2315 2325, 2019a.

Oﬁr Nachum, Bo Dai, Ilya Kostrikov, Yinlam Chow, Lihong Li, and Dale Schuurmans. Algae DICE:

Policy gradient from arbitrary experience. ar Xiv preprint ar Xiv:1912.02074, 2019b.

Hongseok Namkoong and John C Duchi. Stochastic gradient methods for distributionally robust

optimization with f-divergences. In Advances in neural information processing systems, pages 2208 2216, 2016.

Hongseok Namkoong and John C Duchi. Variance-based regularization with convex objectives. In

Advances in neural information processing systems, pages 2971 2980, 2017.

Arnab Nilim and Laurent El Ghaoui. Robust control of Markov decision processes with uncertain

transition matrices. Operations Research, 53(5):780 798, 2005.

Art B Owen. Empirical likelihood. Chapman and Hall/CRC, 2001.

Jason Pazis and Ronald Parr. Non-parametric approximate linear programming for MDPs. In AAAI,

Doina Precup, R. S. Sutton, and S. Singh. Eligibility traces for off-policy policy evaluation. In Proc.

Intl. Conf. Machine Learning, pages 759 766. Morgan Kaufmann, San Francisco, CA, 2000.

Martin L Puterman. Markov decision processes: discrete stochastic dynamic programming. John

Wiley & Sons, 2014.

Jin Qin and Jerry Lawless. Empirical likelihood and general estimating equations. the Annals of

Statistics, pages 300 325, 1994.

R Tyrrell Rockafellar. Augmented lagrange multiplier functions and duality in nonconvex program-

ming. SIAM Journal on Control, 12(2):268 285, 1974.

Werner Römisch. Delta method, inﬁnite dimensional. Wiley Stats Ref: Statistics Reference Online,

Tom Schaul, John Quan, Ioannis Antonoglou, and David Silver Prioritized experience replay. In

Proceedings of the 4th International Conference on Learning Representations, 2016.

B. Sriperumbudur, K. Fukumizu, and G. Lanckriet. Universality, characteristic kernels and RKHS

embedding of measures. Journal of Machine Learning Research, 12:2389 2410, 2011.

Alexander L. Strehl, Lihong Li, and Michael L. Littman. Reinforcement learning in ﬁnite MDPs:

PAC analysis. In Journal of Machine Learning Research, 10:2413 2444, 2009.

Richard S Sutton, Csaba Szepesvári, Alborz Geramifard, and Michael P Bowling. Dyna-style planning

with linear function approximation and prioritized sweeping. ar Xiv preprint ar Xiv:1206.3285, 2012.

Adith Swaminathan and Thorsten Joachims. Counterfactual risk minimization: Learning from logged

bandit feedback. In International Conference on Machine Learning, pages 814 823, 2015.

Istvan Szita and Csaba Szepesvari. Model-based reinforcement learning with nearly tight explo-

ration complexity bounds. In Proceedings of the 27th International Conference on International Conference on Machine Learning, page 1031 1038. Omnipress, 2010.

Aviv Tamar, Huan Xu, and Shie Mannor. Scaling up robust MDPs by reinforcement learning. ar Xiv

preprint ar Xiv:1306.6189, 2013.

Ziyang Tang, Yihao Feng, Lihong Li, Dengyong Zhou, and Qiang Liu. Doubly robust bias reduction

in inﬁnite horizon off-policy estimation. In Proceedings of the 8th International Conference on Learning Representations, 2020.

Philip Thomas, Georgios Theocharous, and Mohammad Ghavamzadeh. High conﬁdence policy

improvement. In International Conference on Machine Learning, pages 2380 2388, 2015a.

Philip S Thomas. Safe reinforcement learning. Ph D thesis, University of Massachusetts Libraries,

Philip S. Thomas and Emma Brunskill. Data-efﬁcient off-policy policy evaluation for reinforcement

learning. In Proceedings of the 33rd International Conference on Machine Learning, pages 2139 2148, 2016.

Philip S Thomas, Georgios Theocharous, and Mohammad Ghavamzadeh. High-conﬁdence off-policy

evaluation. In Twenty-Ninth AAAI Conference on Artiﬁcial Intelligence, 2015b.

Emanuel Todorov, Tom Erez, and Yuval Tassa. Mu Jo Co: A physics engine for model-based control.

In Intelligent Robots and Systems (IROS), 2012 IEEE/RSJ International Conference on, pages 5026 5033. IEEE, 2012.

Masatoshi Uehara, Jiawei Huang, and Nan Jiang. Minimax weight and Q-function learning for

off-policy evaluation. ar Xiv preprint ar Xiv:1910.12809, 2019.

A. W. van der Vaart and J. A. Wellner. Weak Convergence and Empirical Processes. Springer, 1996.

Weiran Wang and Miguel A Carreira-Perpinán. Projection onto the probability simplex: An efﬁcient

algorithm with a simple proof, and an application. ar Xiv preprint ar Xiv:1309.1541, 2013.

Yifan Wu, George Tucker, and Oﬁr Nachum. Behavior regularized ofﬂine reinforcement learning.

ar Xiv preprint ar Xiv:1911.11361, 2019.

Tengyang Xie, Yifei Ma, and Yu-Xiang Wang. Optimal off-policy evaluation for reinforcement

learning with marginalized importance sampling. In Advances in Neural Information Processing Systems 32, pages 9665 9675, 2019.

Tianhe Yu, Garrett Thomas, Lantao Yu, Stefano Ermon, James Zou, Sergey Levine, Chelsea Finn, and

Tengyu Ma. MOPE: Model-based ofﬂine policy optimization. ar Xiv preprint ar Xiv:2005.13239, 2020.

Ruiyi Zhang, Bo Dai, Lihong Li, and Dale Schuurmans. Gen DICE: Generalized ofﬂine estimation of

stationary values. In International Conference on Learning Representations, 2020a.

Shangtong Zhang, Bo Liu, and Shimon Whiteson. Gradient DICE: Rethinking generalized ofﬂine

estimation of stationary values, 2020b. ar Xiv:2001.11113.