# coindice_offpolicy_confidence_interval_estimation__30eb680f.pdf Coin DICE: Off-Policy Confidence Interval Estimation Bo Dai1, Ofir Nachum1, Yinlam Chow1 Lihong Li1, Csaba Szepesvári2,3, Dale Schuurmans1,3 1Google Research, Brain Team 2University of Alberta 3Deep Mind We study high-confidence behavior-agnostic off-policy evaluation in reinforcement learning, where the goal is to estimate a confidence interval on a target policy s value, given only access to a static experience dataset collected by unknown behavior policies. Starting from a function space embedding of the linear program formulation of the Q-function, we obtain an optimization problem with generalized estimating equation constraints. By applying the generalized empirical likelihood method to the resulting Lagrangian, we propose Coin DICE, a novel and efficient algorithm for computing confidence intervals. Theoretically, we prove the obtained confidence intervals are valid, in both asymptotic and finite-sample regimes. Empirically, we show in a variety of benchmarks that the confidence interval estimates are tighter and more accurate than existing methods.2 1 Introduction One of the major barriers that hinders the application of reinforcement learning (RL) is the ability to evaluate new policies reliably before deployment, a problem generally known as off-policy evaluation (OPE). In many real-world domains, e.g., healthcare (Murphy et al., 2001; Gottesman et al., 2018), recommendation (Li et al., 2011; Chen et al., 2019), and education (Mandel et al., 2014), deploying a new policy can be expensive, risky or unsafe. Accordingly, OPE has seen a recent resurgence of research interest, with many methods proposed to estimate the value of a policy (Precup et al., 2000; Dudík et al., 2011; Bottou et al., 2013; Jiang and Li, 2016; Thomas and Brunskill, 2016; Liu et al., 2018; Nachum et al., 2019a; Kallus and Uehara, 2019a,b; Zhang et al., 2020b). However, the very settings where OPE is necessary usually entail limited data access. In these cases, obtaining knowledge of the uncertainty of the estimate is as important as having a consistent estimator. That is, rather than a point estimate, many applications would benefit significantly from having confidence intervals on the value of a policy. The problem of estimating these confidence intervals, known as high-confidence off-policy evaluation (HCOPE) (Thomas et al., 2015b), is imperative in real-world decision making, where deploying a policy without high-probability safety guarantees can have catastrophic consequences (Thomas, 2015). Most existing high-confidence off-policy evaluation algorithms in RL (Bottou et al., 2013; Thomas et al., 2015a,b; Hanna et al., 2017) construct such intervals using statistical techniques such as concentration inequalities and the bootstrap applied to importance corrected estimates of policy value. The primary challenge with these correction-based approaches is the high variance resulting from multiplying per-step importance ratios in long-horizon problems. Moreover, they typically require full knowledge (or a good estimate) of the behavior policy, which is not easily available in behavior-agnostic OPE settings (Nachum et al., 2019a). In this work, we propose an algorithm for behavior-agnostic HCOPE. We start from a linear programming formulation of the state-action value function. We show that the value of the policy may be obtained from a Lagrangian optimization problem for generalized estimating equations Equal contribution. Email: {bodai, ofirnachum}@google.com. 2Open-source code for Coin DICE is available at https://github.com/google-research/dice_rl. 34th Conference on Neural Information Processing Systems (Neur IPS 2020), Vancouver, Canada. over data sampled from off-policy distributions. This observation inspires a generalized empirical likelihood approach (Owen, 2001; Broniatowski and Keziou, 2012; Duchi et al., 2016) to confidence interval estimation. These derivations enable us to express high-confidence lower and upper bounds for the policy value as results of minimax optimizations over an arbitrary offline dataset, with the appropriate distribution corrections being implicitly estimated during the optimization. åWe translate this understanding into a practical estimator, Confidence Interval DIstribution Correction Estimation (Coin DICE), and design an efficient algorithm for implementing it. We then justify the asymptotic coverage of these bounds and present non-asymptotic guarantees to characterize finite-sample effects. Notably, Coin DICE is behavior-agnostic and its objective function does not involve any per-step importance ratios, and so the estimator is less susceptible to high-variance gradient updates. We evaluate Coin DICE in a number of settings and show that it provides both tighter confidence interval estimates and more correctly matches the desired statistical coverage compared to existing methods. 2 Preliminaries For a set W, the set of probability measures over W is denoted by P (W).3 We consider a Markov Decision Process (MDP) (Puterman, 2014), M = (S, A, T, R, γ, µ0), where S denotes the state space, A denotes the action space, T : S A ! P (S) is the transition probability kernel, R : S A ! P ([0, Rmax]) is a bounded reward kernel, γ 2 (0, 1] is the discount factor, and µ0 is the initial state distribution. A policy, : S ! P (A), can be used to generate a random trajectory by starting from s0 µ0 (s), then following at (st), rt R (st, at) and st+1 T (st, at) for t > 0. The stateand action-value functions of are denoted V and Q , respectively. The policy also induces an occupancy measure, d (s, a) := (1 γ)E t>0 γt1 {st = s, at = a} , the normalized discounted probability of visiting (s, a) in a trajectory generated by , where 1 { } is the indicator function. Finally, the policy value is defined as the normalized expected reward accumulated along a trajectory: γtrt|s0 µ0, at (st) , rt R (st, at) , st+1 T (st, at) We are interested in estimating the policy value and its confidence interval (CI) in the behavior agnostic off-policy setting (Nachum et al., 2019a; Zhang et al., 2020a), where interaction with the environment is limited to a static dataset of experience D := {(s, a, s0, r)i}n i=1. Each tuple in D is generated according to (s, a) d D, r R (s, a) and s0 T (s, a) , where d D is an unknown distribution over S A, perhaps induced by one or more unknown behavior policies. The initial distribution µ0 (s) is assumed to be easy to sample from, as is typical in practice. Abusing notation, we denote by d D both the distribution over (s, a, s0, r) and its marginal on (s, a). We use Ed [ ] for the expectation over a given distribution d, and ED [ ] for its empirical approximation using D. Following previous work (Sutton et al., 2012; Uehara et al., 2019; Zhang et al., 2020a), for ease of exposition we assume the transitions in D are i.i.d.. However, our results may be extended to fast-mixing, ergodic MDPs, where the the empirical distribution of states along a long trajectory is close to being i.i.d. (Antos et al., 2008; Lazaric et al., 2012; Dai et al., 2017; Duchi et al., 2016). Under mild regularity assumptions, the OPE problem may be formulated as a linear program referred to as the Q-LP (Nachum et al., 2019b; Nachum and Dai, 2020) with the following primal and dual forms: min Q:S A!R (1 γ) Eµ0 [Q (s0, a0)] (2) s.t. Q (s, a) > R (s, a) + γ P Q (s, a) , 8 (s, a) 2 S A, max d:S A!R+Ed [r (s, a)] (3) s.t. d (s, a) = (1 γ) µ0 (s, a) + γ P d (s, a) , 8 (s, a) 2 S A, where the operator P and its adjoint, P , are defined as P Q (s, a) := Es0 T ( |s,a),a0 ( |s0) [Q (s0, a0)] , d (s, a) := (a|s) T (s| s, a) d ( s, a) . 3All sets and maps are assumed to satisfy appropriate measurability conditions; which we will omit from below for the sake of reducing clutter. The optimal solutions of (2) and (3) are the Q-function, Q , and stationary state-action occupancy, d , respectively, for policy ; see Nachum et al. (2019b, Theorems 3 & 5) for details as well as extensions to the undiscounted case. Using the Lagrangian of (2) or (3), we have = min Q max >0 (1 γ) Eµ0 [Q (s0, a0)] + Ed D [ (s, a) (R (s, a) + γQ (s0, a0) Q (s, a))] , (4) where (s, a) := d(s,a) d D(s,a) is the stationary distribution corrector. One of the key benefits of the minimax optimization (4) is that both expectations can be immediately approximated by sample averages.4 In fact, this formulation allows the derivation of several recent behavior-agnostic OPE estimators in a unified manner (Nachum et al., 2019a; Uehara et al., 2019; Zhang et al., 2020a; Nachum and Dai, 2020). 3 Coin DICE We now develop a new approach to obtaining confidence intervals for OPE. The algorithm, COnfidence INterval stationary DIstribution Correction Estimation (Coin DICE), is derived by combining function space embedding and the previously described Q-LP. 3.1 Function Space Embedding of Constraints Both the primal and dual forms of the Q-LP contain |S| |A| constraints that involve expectations over state transition probabilities. Working directly with these constraints quickly becomes computationally and statistically prohibitive when |S| |A| is large or infinite, as with standard LP approaches (De Farias and Van Roy, 2003). Instead, we consider a relaxation that embeds the constraints in a function space: := max d:S A!R+ Ed [r (s, a)] s.t. hφ, di = hφ, (1 γ) µ0 + γ P where φ : S A ! p Rp is a feature map, and hφ, di := φ (s, a) d (s, a) dsda. By projecting the constraints onto a function space with feature mapping φ, we can reduce the number of constraints from |S| |A| to p. Note that p may still be infinite. The constraint in (5) can be written as generalized estimating equations (Qin and Lawless, 1994; Lam and Zhou, 2017) for the correction ratio (s, a) over augmented samples x := (s0, a0, s, a, r, s0, a0) with (s0, a0) µ0 , (s, a, r, s0) d D, and a0 ( |s0), hφ, di = hφ, (1 γ) µ0 + γ P di , Ex [ (x; , φ)] = 0, (6) where (x; , φ) := (1 γ) φ (s0, a0) + (s, a) (γφ (s0, a0) φ (s, a)). The corresponding Lagrangian is = max :S A!R+ min β2Rp Ed D [ r (s, a)] + hβ, Ed D [ (x; , φ)]i . (7) This embedding approach for the dual Q-LP is closely related to approximation methods for the standard state-value LP (De Farias and Van Roy, 2003; Pazis and Parr, 2011; Lakshminarayanan et al., 2017). The gap between the solutions to (5) and the original dual LP (3) depends on the expressiveness of the feature mapping φ. Before stating a theorem that quantifies the error, we first offer a few examples to provide intuition for the role played by φ. Example (Indicator functions): Suppose p = |S| |A| is finite and φ = [δs,a](s,a)2S A, where δs,a 2 {0, 1}p with δs,a = 1 at position (s, a) and 0 otherwise. Plugging this feature mapping into (5), we recover the original dual Q-LP (3). Example (Full-rank basis): Suppose Φ 2 Rp p is a full-rank matrix with p = |S| |A|; furthermore, φ(s, a) = Φ((s, a), )>. Although the constraints in (5) and (3) are different, their solutions are identical. This can be verified by the Lagrangian in Appendix A. Example (RKHS function mappings): Suppose φ (s, a) := k ((s, a) , ) 2 Rp with p = 1, which forms a reproducing kernel Hilbert space (RKHS) Hk. The LHS and RHS in the constraint of (5) are the kernel embeddings of d (s, a) and (1 γ) µ0 (s, a) + γ P d (s, a) respectively. The constraint in (5) can then be understood as as a form of distribution matching by comparing 4We assume one can sample initial states from µ0, an assumption that often holds in practice. Then, the data in D can be treated as being augmented as (s0, a0, s, a, r, s0, a0) with a0 (a|s0) , a0 (a|s0). kernel embeddings, rather than element-wise matching as in (3). If the kernel function k ( , ) is characteristic, the embeddings of two distributions will match if and only if the distributions are identical almost surely (Sriperumbudur et al., 2011). Theorem 1 (Approximation error) Suppose the constant function 1 2 Fφ := span {φ}. Then, 0 6 6 2 min β k Q hβ, φik1 , where Q is the fixed-point solution to the Bellman equation Q (s, a) = R (s, a) + γP Q (s, a). Please refer to Appendix A for the proof. The condition 1 2 Fφ is standard and is trivial to satisfy. Although the approximation error relies on k k1, a sharper bound that relies on a norm taking the state-action distribution into account can also be obtained (De Farias and Van Roy, 2003). We focus on characterizing the uncertainty due to sampling in this paper, so for ease of exposition we will consider a setting where φ is sufficiently expressive to make the approximation error zero. If desired, the approximation error in Theorem 1 can be included in the analysis. Note that, compared to using a characteristic kernel to ensure injectivity for the RKHS embeddings over all distributions (and thus guaranteeing arbitrarily small approximation error), Theorem 1 only requires that Q be represented in Fφ, which is a much weaker condition. In practice, one may also learn the feature mapping φ for the projection jointly. 3.2 Off-policy Confidence Interval Estimation By introducing the function space embedding of the constraints in (5), we have transformed the original point-wise constraints in the Q-LP to generalized estimating equations. This paves the way to applying the generalized empirical likelihood (EL) (Owen, 2001; Broniatowski and Keziou, 2012; Bertail et al., 2014; Duchi et al., 2016) method to estimate a confidence interval on policy value. Recall that, given a convex, lower-semicontinuous function f : R+ ! R satisfying f (1) = 0, the f-divergence between densities p and q on R is defined as Df (P||Q) := d P (x) d Q(x) Given an f-divergence, we propose our main confidence interval estimate based on the following confidence set Cf ++++w2Kf, Ew [ (x; , φ)]=0 w 2 Pn 1 (bpn) , Df (w||bpn) 6 (8) where Pn 1 (bpn) denotes the n-simplex on the support of bpn, the empirical distribution over D. It is easy to verify that this set Cf n, R is convex, since (w) is a convex function over a convex feasible set. Thus, Cf n, is an interval. In fact, Cf n, is the image of the policy value on a bounded (in f-divergence) perturbation to w in the neighborhood of the empirical distribution bpn. Intuitively, the confidence interval Cf n, possesses a close relationship to bootstrap estimators. In vanilla bootstrap, one constructs a set of empirical distributions i=1 by resampling from the dataset D. Such subsamples are used to form the empirical distribution on i=1, which provides population statistics for confidence interval estimation. However, this procedure is computationally very expensive, involving m separate optimizations. By contrast, our proposed estimator Cf n, exploits the asymptotic properties of the statistic (w) to derive a target confidence interval by solving only two optimization problems (Section 3.3), a dramatic savings in computational cost. Before introducing the algorithm for computing Cf n, , we establish the first key result that, by choosing = χ2,1 n, is asymptotically a (1 )-confidence interval on the policy value, (1) is the (1 )-quantile of the χ2-distribution with 1 degree of freedom. Theorem 2 (Informal asymptotic coverage) Under some mild conditions, if D contains i.i.d. samples and the optimal solution to the Lagrangian of (5) is unique, we have (1) is an asymptotic (1 )-confidence interval of the value of the policy . Please refer to Appendix E.1 for the precise statement and proof of Theorem 2. Theorem 2 generalizes the result in Duchi et al. (2016) to statistics with generalized estimating equations, maintaining the 1 degree of freedom in the asymptotic χ2 (1)-distribution. One may also apply existing results for EL with generalized estimating equations (e.g., Lam and Zhou, 2017), but these would lead to a limiting distribution of χ2 (m) with m 1 degrees of freedom, resulting in a much looser confidence interval estimate than Theorem 2. Note that Theorem 2 can also be specialized to multi-armed contextual bandits to achieve a tighter confidence interval estimate in this special case. In particular, for contextual bandits, the stationary distribution constraint in (5), Ew [ (x; , φ)] = 0, is no longer needed, and can be replaced by Ew [ 1] = 0. Then by the same technique used for MDPs, we can obtain a confidence interval estimate for offline contextual bandits; see details in Appendix C. Interestingly, the resulting confidence interval estimate not only has the same asymptotic coverage as previous work (Karampatziakis et al., 2019), but is also simpler and computationally more efficient. 3.3 Computing the Confidence Interval Now we provide a distributional robust optimization view of the upper and lower bounds of Cf Theorem 3 (Upper and lower confidence bounds) Denote the upper and lower confidence bounds of Cf n, by un and ln, respectively: min w2Kf min >0 Ew [ (x; , β)] , max w2Kf max β2Rp Ew [ (x; , β)] min β2Rp max w2Kf Ew [ (x; , β)] , max w2Kf Ew [ (x; , β)] where (x; , β) := r + β> (x; , φ). For any ( , β, λ, ) that satisfies the constraints in (11), the optimal weights for the upper and lower confidence bounds are and wu = f 0 respectively. Therefore, the confidence bounds can be simplified as: 4minβ max >0,λ>0, ED max >0 minβ,λ>0, ED The proof of this result relies on Lagrangian duality and the convexity and concavity of the optimization; it may be found in full detail in Appendix D.1. As we can see in Theorem 3, by exploiting strong duality properties to move w into the inner most optimizations in (11), the obtained optimization (11) is the distributional robust optimization extenion of the saddle-point problem. The closed-form reweighting scheme is demonstrated in (12). For particular f-divergences, such as the KLand 2-power divergences, for a fixed (β, ), the optimal can be easily computed and the weights w recovered in closed-form. For example, by using KL (w||bpn), (12) can be used to obtain the updates wl (x) = exp , wu (x) = exp where l and u provide the normalizing constants. (For closed-form updates of w w.r.t. other f-divergences, please refer to Appendix D.2.) Plug the closed-form of optimal weights into (11), this greatly simplifies the optimization over the data perturbations yielding (13), and estabilishes the connection to the prioritized experiences replay (Schaul et al., 2016), where both reweight the experience data according to their loss, but with different reweighting schemes. Note that it is straightforward to check that the estimator for un in (13) is nonconvex-concave and the estimator for ln in (13) is nonconcave-convex. Therefore, one could alternatively apply stochastic gradient descent-ascent (SGDA) for to solve (13) and benefit from attractive finite-step convergence guarantees (Lin et al., 2019). Remark (Practical considerations): As also observed in Namkoong and Duchi (2016), SGDA for (13) could potentially suffer from high variance in both the objective and gradients when λ approaches 0. In Appendix D.3, we exploit several properties of (11), which leads to a computational efficient algorithm, to overcome the numerical issue. Please refer to Appendix D.3 for the details of Algorithm 1 and the practical considerations. Remark (Joint learning for feature embeddings): The proposed framework also allows for the possibility to learn the features for constraint projection. In particular, consider ( , ) := β>φ ( , ) : S A ! R. Note that we could treat the combination β>φ (s, a) together as the Lagrange multiplier function for the original Q-LP with infinitely many constraints, hence both β and φ ( , ) could be updated jointly. Although the conditions for asymptotic coverage no longer hold, the finite-sample correction results of the next section are still applicable. This might offer an interesting way to reduce the approximation error introduced by inappropriate feature embeddings of the constraints, while still maintaining calibrated confidence intervals. 4 Finite-sample Analysis Theorem 2 establishes the asymptotic (1 )-coverage of the confidence interval estimates produced by Coin DICE, ignoring higher-order error terms that vanish as sample size n ! 1. In practice, however, n is always finite, so it is important to quantify these higher-order terms. This section addresses this problem, and presents a finite-sample bound for the estimate of Coin DICE. In the following, we let F and Fβ be the function classes of and β used by Coin DICE. Theorem 4 (Informal finite-sample correction) Denote by d F and d Fβ the finite VC-dimension of F and Fβ, respectively. Under some mild conditions, when Df is χ2-divergence, we have P ( 2 [ln n, un + n]) > 1 12 exp d F + d Fβ 1 where c1 = 2c + log d F + log d Fβ + d F + d Fβ 1 (c, M, C ) are univeral constants. The precise statement and detailed proof of Theorem 4 can be found in Appendix E.2. The proof relies on empirical Bernstein bounds with a careful analysis of the variance term. Compared to the vanilla sample complexity of O , we achieve a faster rate of O without any additional assumptions on the noise or curvature conditions. The tight sample complexity in Theorem 4 implies that one can construct the (1 )-finite sample confidence interval by optimizing (11) with = 18 d F + d Fβ 1 , and composing with n. However, we observe that this bound can be conservative compared to the asymptotic confidence interval in Theorem 2. Therefore, we will evaluate the asymptotic version of Coin DICE based on Theorem 2 in the experiment. The conservativeness arises from the use of a union bound. However, we conjecture that the rate is optimal up to a constant. We exploit the VC dimension due to its generality. In fact, the bound can be improved by considering a data-dependent measure, e.g., Rademacher complexity, or by some function class dependent measure, e.g., function norm in RKHS, for specific function approximators. 5 Optimism vs. Pessimism Principle Coin DICE provide both upper and lower bounds of the target policy s estimated value, which paves the path for applying the principle of optimism (Lattimore and Szepesvári, 2020) or pessimism (Swaminathan and Joachims, 2015) in the face of uncertainty for policy optimization in different learning settings. Optimism in the face of uncertainty. Optimism in the face of uncertainty leads to risk-seeking algorithms, which can be used to balance the exploration/exploitation trade-off. Conceptually, they always treat the environment as the best plausibly possible. This principle has been successfully applied to stochastic bandit problems, leading to many instantiations of UCB algorithms (Lattimore and Szepesvári, 2020). In each round, an action is selected according to the upper confidence bound, and the obtained reward will be used to refine the confidence bound iteratively. When applied to MDPs, this principle inspires many optimistic model-based (Bartlett and Mendelson, 2002; Auer et al., 2009; Strehl et al., 2009; Szita and Szepesvari, 2010; Dann et al., 2017), value-based (Jin et al., 2018), and policy-based algorithms (Cai et al., 2019). Most of these algorithms are not compatible with function approximators. We can also implement the optimism principle by optimizing the upper bound in Coin DICE iteratively, i.e., max u D ( ). In t-th iteration, we calculate the gradient of u D ( t), i.e., r u D ( t), based on the existing dataset Dt, then, the policy t will be updated by (natural) policy gradient and samples will be collected through the updated policy t+1. Please refer to Appendix F for the gradient computation and algorithm details. Pessimism in the face of uncertainty. In offline reinforcement learning (Lange et al., 2012; Fujimoto et al., 2019; Wu et al., 2019; Nachum et al., 2019b), only a fixed set of data from behavior policies is given, a safe optimization criterion is to maximize the worst-case performance among a set of statistically plausible models (Laroche et al., 2019; Kumar et al., 2019; Yu et al., 2020). In contrast to the previous case of online exploration, this is a pessimism principle (Cohen and Hutter, 2020; Buckman et al., 2020) or counterfactual risk minimization (Swaminathan and Joachims, 2015), and highly related to robust MDP (Iyengar, 2005; Nilim and El Ghaoui, 2005; Tamar et al., 2013; Chow et al., 2015). Different from most of the existing methods where the worst-case performance is characterized by model-based perturbation or ensemble, the proposed Coin DICE provides a lower bound to implement the pessimism principle, i.e., max l D ( ). Conceptually, we apply the (natural) policy gradient w.r.t. l D ( t) to update the policy iteratively. Since we are dealing with policy optimization in the offline setting, the dataset D keeps unchanged. Please refer to Appendix F for the algorithm details. 6 Related Work Off-policy estimation has been extensively studied in the literature, given its practical importance. Most existing methods are based on the core idea of mportance reweighting to correct for distribution mismatches between the target policy and the off-policy data (Precup et al., 2000; Bottou et al., 2013; Li et al., 2015; Xie et al., 2019). Unfortunately, when applied naively, importance reweighting can result in an excessively high variance, which is known as the curse of horizon (Liu et al., 2018). To avoid this drawback, there has been rapidly growing interest in estimating the correction ratio of the stationary distribution (e.g., Liu et al., 2018; Nachum et al., 2019a; Uehara et al., 2019; Liu et al., 2019; Zhang et al., 2020a,b). This work is along the same line and thus applicable in long-horizon problems. Other off-policy approaches are also possible, notably model-based (e.g., Fonteneau et al., 2013) and doubly robust methods (Jiang and Li, 2016; Thomas and Brunskill, 2016; Tang et al., 2020; Uehara et al., 2019). These techniques can potentially be combined with our algorithm, which we leave for future investigation. While most OPE works focus on obtaining accurate point estimates, several authors provide ways to quantify the amount of uncertainty in the OPE estimates. In particular, confidence bounds have been developed using the central limit theorem (Bottou et al., 2013), concentration inequalities (Thomas et al., 2015b; Kuzborskij et al., 2020), and nonparametric methods such as the bootstrap (Thomas et al., 2015a; Hanna et al., 2017). In contrast to these works, the Coin DICE is asymptotically pivotal, meaning that there are no hidden quantities we need to estimate, which is based on correcting for the stationary distribution in the behavior-agnostic setting, thus avoiding the curse of horizon and broadening the application of the uncertainty estimator. Recently, Jiang and Huang (2020) provide confidence intervals for OPE, but focus on the intervals determined by the approximation error induced by a function approximator, while our confidence intervals quantify statistical error. Empirical likelihood (Owen, 2001) is a powerful tool with many applications in statistical inference like econometrics (Chen et al., 2018), and more recently in distributionally robust optimization (Duchi et al., 2016; Lam and Zhou, 2017). EL-based confidence intervals can be used to guide exploration in multiarmed bandits (Honda and Takemura, 2010; Cappé et al., 2013), and for OPE (Karampatziakis et al., 2019; Kallus and Uehara, 2019b). While the work of Kallus and Uehara (2019b) is also based on EL, it differs from the present work in two important ways. First, their focus is on developing an asymptotically efficient OPE point estimate, not confidence intervals. Second, they solve for timestep-dependent weights, whereas we only need to solve for timestep-independent weights from a system of moment matching equations induced by an underlying ergodic Markov chain. 7 Experiments We now evaluate the empirical performance of Coin DICE, comparing it to a number of existing confidence interval estimators for OPE based on concentration inequalities. Specifically, given a dataset of logged trajectories, we first use weighted step-wise importance sampling (Precup et al., 2000) to calculate a separate estimate of the target policy value for each trajectory. Then given such a finite sample of estimates, we then use the empirical Bernstein inequality (Thomas et al., 2015b) to derive high-confidence lower and upper bounds for the true value. Alternatively, one may also use Student s t-test or Efron s bias corrected and accelerated bootstrap (Thomas et al., 2015a). # samples = 50 # samples = 100 # samples = 200 interval coverage 0.6 0.7 0.8 0.9 0.95 0.6 0.7 0.8 0.9 0.95 0.6 0.7 0.8 0.9 0.95 interval log-width 0.6 0.7 0.8 0.9 0.95 0.6 0.7 0.8 0.9 0.95 0.6 0.7 0.8 0.9 0.95 Confidence level (1 ) Coin DICE (ours) Bootstrapping Bernstein Student t Expected coverage Figure 1: Results of Coin DICE and baseline methods on a simple two-armed bandit. We plot empirical coverage and median log-width (y-axes) of intervals evaluated at a number of desired confidence levels (x-axis), as measured over 200 random trials. We find that Coin DICE achieves more accurate coverage and narrower intervals compared to the baseline confidence interval estimation methods. We begin with a simple bandit setting, devising a two-armed bandit problem with stochastic payoffs. We define the target policy as a near-optimal policy, which chooses the optimal arm with probability 0.95. We collect offpolicy data using a behavior policy which chooses the optimal arm with probability of only 0.55. Our results are presented in Figure 1. We plot the empirical coverage and width of the estimated intervals across different confidence levels. More specifically, each data point in Figure 1 is the result of 200 experiments. In each experiment, we randomly sample a dataset and then compute a confidence interval. The interval coverage is then computed as the proportion of intervals out of 200 that contain the true value of the target policy. The interval log-width is the median of the log of the width of the 200 computed intervals. Figure 1 shows that the intervals produced by Coin DICE achieve an empirical coverage close to the intended coverage. In this simple bandit setting, the coverages of Student s t and bootstrapping are also close to correct, although they suffer more in the low-data regime. Notably, the width of the intervals produced by Coin DICE are especially narrow while maintaining accurate coverage. Frozen Lake Taxi # trajectories = 50 # trajectories = 100 # trajectories = 20 # trajectories = 50 interval coverage 0.6 0.7 0.8 0.9 0.95 0.6 0.7 0.8 0.9 0.95 0.6 0.7 0.8 0.9 0.95 0.6 0.7 0.8 0.9 0.95 interval log-width 0.6 0.7 0.8 0.9 0.95 0.6 0.7 0.8 0.9 0.95 0.6 0.7 0.8 0.9 0.95 0.6 0.7 0.8 0.9 0.95 Confidence level (1 ) Coin DICE (ours) Bootstrapping Bernstein Student t Expected coverage Figure 2: Results of Coin DICE and baseline methods on an infinite-horizon version of Frozen Lake and Taxi. In Frozen Lake, each dataset consists of trajectories of length 100; in Taxi, each dataset consists of trajectories of length 500. We now turn to more complicated MDP environments. We use Frozen Lake (Brockman et al., 2016), a highly stochastic gridworld environment, and Taxi (Dietterich, 1998), an environment with a moderate state space of 2 000 elements. As in (Liu et al., 2018), we modify these environments to be infinite horizon by randomly resetting the state upon termination. The discount factor is γ = 0.99. The target policy is taken to be a near-optimal one, while the behavior policy is highly suboptimal. The behavior policy in Frozen Lake is the optimal policy with 0.2 white noise, which reduces the policy value dramatically, from 0.74 to 0.24. For the behavior policies in Taxi and Reacher, we follow the same experiment setting for constructing the behavior policies to collect data as in (Nachum et al., 2019a; Liu et al., 2018). interval coverage 0.6 0.7 0.8 0.9 0.95 interval log-width 0.6 0.7 0.8 0.9 0.95 Confidence level (1 ) Figure 3: Results of Coin DICE and baseline methods on Reacher (Brockman et al., 2016; Todorov et al., 2012), using 25 trajectories of length 100. Colors and markers are as defined in the legends of previous figures. We follow the same evaluation protocol as in the bandit setting, measuring empirical interval coverage and log-width over 200 experimental trials for various dataset sizes and confidence levels. Results are shown in Figure 2. We find a similar conclusion that Coin DICE consistently achieves more accurate coverage and smaller widths than baselines. Notably, the baseline methods accuracy suffers more significantly compared to the simpler bandit setting described earlier. Lastly, we evaluate Coin DICE on Reacher (Brockman et al., 2016; Todorov et al., 2012), a continuous control environment. In this setting, we use a one-hidden-layer neural network with Re LU activations. Results are shown in Figure 3. To account for the approximation error of the used neural network, we measure the coverage of Coin DICE with respect to a true value computed as the median of a large ensemble of neural networks trained on the off-policy data. To keep the comparison fair, we measure the coverage of the IS-based baselines with respect to a true value computed as the median of a large number of IS-based point estimates. The results show similar conclusions as before: Coin DICE achieves more accurate coverage than the IS-based methods. Still, we see that Coin DICE coverage suffers in this regime, likely due to optimization difficulties. If the optimum of the Lagrangian is only approximately found, the empirical coverage will inevitably be inexact. 8 Conclusion In this paper, we have developed Coin DICE, a novel and efficient confidence interval estimator applicable to the behavior-agnostic offline setting. The algorithm builds on a few technical components, including a new feature embedded Q-LP, and a generalized empirical likelihood approach to confidence interval estimation. We analyzed the asymptotic coverage of Coin DICE s estimate, and provided an inite-sample bound. On a variety of off-policy benchmarks we empirically compared the new algorithm with several strong baselines and found it to be superior to them. Broader Impact This research is fundamental and targets a broad question in reinforcement learning. The ability to reliably assess uncertainty in off-policy evaluation would have significant positive benefits for safetycritical applications of reinforcement learning. Inaccurate uncertainty estimates create the danger of misleading decision makers and could lead to detrimental consequences. However, our primary goal is to improve these estimators and reduce the ultimate risk of deploying reinforcement-learned systems. The techniques are general and do not otherwise target any specific application area. Acknowledgements We thank Hanjun Dai, Mengjiao Yang and other members of the Google Brain team for helpful discussions. Csaba Szepesvári gratefully acknowledges funding from the Canada CIFAR AI Chairs Program, Amii and NSERC. András Antos, Csaba Szepesvári, and Rémi Munos. Learning near-optimal policies with Bellman- residual minimization based fitted policy iteration and a single sample path. Machine Learning, 71 (1):89 129, 2008. Sanjeev Arora, Simon S Du, Wei Hu, Zhiyuan Li, Russ R Salakhutdinov, and Ruosong Wang. On exact computation with an infinitely wide neural net. In Advances in Neural Information Processing Systems, pages 8139 8148, 2019. P. Auer. Using upper confidence bounds for online learning. In Proc. 41st Annual Symposium on Foundations of Computer Science, pages 270 279. IEEE Computer Society Press, Los Alamitos, Peter Auer, Thomas Jaksch, and Ronald Ortner. Near-optimal regret bounds for reinforcement learning. In Advances in neural information processing systems, pages 89 96, 2009. P. L. Bartlett and S. Mendelson. Rademacher and Gaussian complexities: Risk bounds and structural results. Journal of Machine Learning Research, 3:463 482, 2002. Patrice Bertail, Emmanuelle Gautherat, and Hugo Harari-Kermadec. Empirical ' -divergence minimizers for Hadamard differentiable functionals. In Topics in Nonparametric Statistics, pages 21 32. Springer, 2014. Léon Bottou, Jonas Peters, Joaquin Quiñonero-Candela, Denis X Charles, D Max Chickering, Elon Portugaly, Dipankar Ray, Patrice Simard, and Ed Snelson. Counterfactual reasoning and learning systems: The example of computational advertising. The Journal of Machine Learning Research, 14(1):3207 3260, 2013. Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. Open AI Gym. ar Xiv preprint ar Xiv:1606.01540, 2016. Michel Broniatowski and Amor Keziou. Divergences and duality for estimation and test under moment condition models. Journal of Statistical Planning and Inference, 142(9):2554 2573, 2012. Jacob Buckman, Carles Gelada, and Marc G Bellemare. The importance of pessimism in fixed-dataset policy optimization. ar Xiv preprint ar Xiv:2009.06799, 2020. Qi Cai, Zhuoran Yang, Chi Jin, and Zhaoran Wang. Provably efficient exploration in policy optimiza- tion. ar Xiv preprint ar Xiv:1912.05830, 2019. Olivier Cappé, Aurélien Garivier, Odalric-Ambrym Maillard, Rémi Munos, Gilles Stoltz, et al. Kullback-Leibler upper confidence bounds for optimal sequential allocation. The Annals of Statistics, 41(3):1516 1541, 2013. Minmin Chen, Alex Beutel, Paul Covington, Sagar Jain, Francois Belletti, and Ed H Chi. Top-k off-policy correction for a REINFORCE recommender system. In Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining, pages 456 464, 2019. Xiaohong Chen, Timothy M. Christensen, and Elie Tamer. Monte Carlo confidence sets for identified sets. Econometrica, 86(6):1965 2018, 2018. Yinlam Chow, Aviv Tamar, Shie Mannor, and Marco Pavone. Risk-sensitive and robust decision- making: a cvar optimization approach. In Advances in Neural Information Processing Systems, pages 1522 1530, 2015. Michael K Cohen and Marcus Hutter. Pessimism about unknown unknowns inspires conservatism. In Conference on Learning Theory, pages 1344 1373. PMLR, 2020. Bo Dai, Albert Shaw, Lihong Li, Lin Xiao, Niao He, Zhen Liu, Jianshu Chen, and Le Song. SBEED: Convergent reinforcement learning with nonlinear function approximation. Co RR, abs/1712.10285, 2017. Christoph Dann, Tor Lattimore, and Emma Brunskill. Unifying PAC and regret: Uniform PAC bounds for episodic reinforcement learning. In Advances in Neural Information Processing Systems, pages 5713 5723, 2017. Daniela Pucci De Farias and Benjamin Van Roy. The linear programming approach to approximate dynamic programming. Operations research, 51(6):850 865, 2003. Thomas G. Dietterich. The MAXQ method for hierarchical reinforcement learning. In Proc. Intl. Conf. Machine Learning, pages 118 126. Morgan Kaufmann, San Francisco, CA, 1998. John Duchi, Peter Glynn, and Hongseok Namkoong. Statistics of robust optimization: A generalized empirical likelihood approach. ar Xiv preprint ar Xiv:1610.03425, 2016. Miroslav Dudík, John Langford, and Lihong Li. Doubly robust policy evaluation and learning. In Proceedings of the 28th International Conference on Machine Learning, pages 1097 1104, 2011. Co RR abs/1103.4601. Ivar Ekeland and Roger Temam. Convex analysis and variational problems, volume 28. SIAM, 1999. Raphael Fonteneau, Susan A. Murphy, Louis Wehenkel, and Damien Ernst. Batch mode reinforcement learning based on the synthesis of artificial trajectories. Annals of Operations Research, 208(1): 383 416, 2013. Scott Fujimoto, David Meger, and Doina Precup. Off-policy deep reinforcement learning without exploration. In International Conference on Machine Learning, pages 2052 2062, 2019. Omer Gottesman, Fredrik Johansson, Joshua Meier, Jack Dent, Donghun Lee, Srivatsan Srinivasan, Linying Zhang, Yi Ding, David Wihl, Xuefeng Peng, Jiayu Yao, Isaac Lage, Christopher Mosch, Li wei H. Lehman, Matthieu Komorowski, Matthieu Komorowski, Aldo Faisal, Leo Anthony Celi, David Sontag, and Finale Doshi-Velez. Evaluating reinforcement learning algorithms in observational health settings, 2018. ar Xiv:1805.12298. Josiah P. Hanna, Peter Stone, and Scott Niekum. Bootstrapping with models: Confidence intervals for off-policy evaluation. In Proceedings of the 31st AAAI Conference on Artificial Intelligence, pages 4933 4934, 2017. Junya Honda and Akimichi Takemura. An asymptotically optimal bandit algorithm for bounded support models. In COLT, pages 67 79, 2010. Garud N Iyengar. Robust dynamic programming. Mathematics of Operations Research, 30(2): 257 280, 2005. Nan Jiang and Jiawei Huang. Minimax confidence interval for off-policy evaluation and policy optimization, 2020. ar Xiv:2002.02081. Nan Jiang and Lihong Li. Doubly robust off-policy evaluation for reinforcement learning. In Proceedings of the 33rd International Conference on Machine Learning, pages 652 661, 2016. Chi Jin, Zeyuan Allen-Zhu, Sebastien Bubeck, and Michael I Jordan. Is Q-learning provably efficient? In Advances in Neural Information Processing Systems, pages 4863 4873, 2018. Nathan Kallus and Masatoshi Uehara. Double reinforcement learning for efficient off-policy evalua- tion in Markov decision processes. ar Xiv preprint ar Xiv:1908.08526, 2019a. Nathan Kallus and Masatoshi Uehara. Intrinsically efficient, stable, and bounded off-policy evaluation for reinforcement learning. In Advances in Neural Information Processing Systems 32, pages 3320 3329, 2019b. Nikos Karampatziakis, John Langford, and Paul Mineiro. Empirical likelihood for contextual bandits. ar Xiv preprint ar Xiv:1906.03323, 2019. Aviral Kumar, Justin Fu, Matthew Soh, George Tucker, and Sergey Levine. Stabilizing off-policy Q-learning via bootstrapping error reduction. In Advances in Neural Information Processing Systems, pages 11784 11794, 2019. Ilja Kuzborskij, Claire Vernade, András György, Csaba Szepesvári Confident off-policy evaluation and selection through self-normalized importance weighting. ar Xiv preprint ar Xiv:2006.10460, 2020. Chandrashekar Lakshminarayanan, Shalabh Bhatnagar, and Csaba Szepesvari. A linearly relaxed approximate linear program for Markov decision processes. ar Xiv preprint ar Xiv:1704.02544, 2017. Henry Lam and Enlu Zhou. The empirical likelihood approach to quantifying uncertainty in sample average approximation. Operations Research Letters, 45(4):301 307, 2017. Sascha Lange, Thomas Gabel, and Martin Riedmiller. Batch reinforcement learning. In Reinforcement learning, pages 45 73. Springer, 2012. Romain Laroche, Paul Trichelair, and Remi Tachet Des Combes. Safe policy improvement with baseline bootstrapping. In International Conference on Machine Learning, pages 3652 3661. PMLR, 2019. Tor Lattimore and Csaba Szepesvári. Bandit algorithms. Cambridge University Press, 2020. Alessandro Lazaric, Mohammad Ghavamzadeh, and Rémi Munos. Finite-sample analysis of least- squares policy iteration. Journal of Machine Learning Research, 13(Oct):3041 3074, 2012. Lihong Li, Wei Chu, John Langford, and Xuanhui Wang. Unbiased offline evaluation of contextual- bandit-based news article recommendation algorithms. In Proceedings of the fourth ACM international conference on Web search and data mining, pages 297 306. ACM, 2011. Lihong Li, Rémi Munos, and Csaba Szepesvári. Toward minimax off-policy value estimation. pages 608 616, 2015. Tianyi Lin, Chi Jin, and Michael I. Jordan. On gradient descent ascent for nonconvex-concave minimax problems. Co RR, abs/1906.00331, 2019. Qiang Liu, Lihong Li, Ziyang Tang, and Dengyong Zhou. Breaking the curse of horizon: Infinite- horizon off-policy estimation. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems 31, pages 5356 5366. Curran Associates, Inc., 2018. Yao Liu, Pierre-Luc Bacon, and Emma Brunskill. Understanding the curse of horizon in off-policy evaluation via conditional importance sampling, 2019. ar Xiv:1910.06508. Travis Mandel, Yun-En Liu, Sergey Levine, Emma Brunskill, and Zoran Popovic. Offline policy evaluation across representations with applications to educational games. 2014. Andreas Maurer and Massimiliano Pontil. Empirical bernstein bounds and sample variance penaliza- tion. ar Xiv preprint ar Xiv:0907.3740, 2009. Susan A Murphy, Mark J van der Laan, James M Robins, and Conduct Problems Prevention Re- search Group. Marginal mean models for dynamic regimes. Journal of the American Statistical Association, 96(456):1410 1423, 2001. Ofir Nachum and Bo Dai. Reinforcement learning via Fenchel-Rockafellar duality. ar Xiv preprint ar Xiv:2001.01866, 2020. Ofir Nachum, Yinlam Chow, Bo Dai, and Lihong Li. Dual DICE: Behavior-agnostic estimation of discounted stationary distribution corrections. pages 2315 2325, 2019a. Ofir Nachum, Bo Dai, Ilya Kostrikov, Yinlam Chow, Lihong Li, and Dale Schuurmans. Algae DICE: Policy gradient from arbitrary experience. ar Xiv preprint ar Xiv:1912.02074, 2019b. Hongseok Namkoong and John C Duchi. Stochastic gradient methods for distributionally robust optimization with f-divergences. In Advances in neural information processing systems, pages 2208 2216, 2016. Hongseok Namkoong and John C Duchi. Variance-based regularization with convex objectives. In Advances in neural information processing systems, pages 2971 2980, 2017. Arnab Nilim and Laurent El Ghaoui. Robust control of Markov decision processes with uncertain transition matrices. Operations Research, 53(5):780 798, 2005. Art B Owen. Empirical likelihood. Chapman and Hall/CRC, 2001. Jason Pazis and Ronald Parr. Non-parametric approximate linear programming for MDPs. In AAAI, Doina Precup, R. S. Sutton, and S. Singh. Eligibility traces for off-policy policy evaluation. In Proc. Intl. Conf. Machine Learning, pages 759 766. Morgan Kaufmann, San Francisco, CA, 2000. Martin L Puterman. Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons, 2014. Jin Qin and Jerry Lawless. Empirical likelihood and general estimating equations. the Annals of Statistics, pages 300 325, 1994. R Tyrrell Rockafellar. Augmented lagrange multiplier functions and duality in nonconvex program- ming. SIAM Journal on Control, 12(2):268 285, 1974. Werner Römisch. Delta method, infinite dimensional. Wiley Stats Ref: Statistics Reference Online, Tom Schaul, John Quan, Ioannis Antonoglou, and David Silver Prioritized experience replay. In Proceedings of the 4th International Conference on Learning Representations, 2016. B. Sriperumbudur, K. Fukumizu, and G. Lanckriet. Universality, characteristic kernels and RKHS embedding of measures. Journal of Machine Learning Research, 12:2389 2410, 2011. Alexander L. Strehl, Lihong Li, and Michael L. Littman. Reinforcement learning in finite MDPs: PAC analysis. In Journal of Machine Learning Research, 10:2413 2444, 2009. Richard S Sutton, Csaba Szepesvári, Alborz Geramifard, and Michael P Bowling. Dyna-style planning with linear function approximation and prioritized sweeping. ar Xiv preprint ar Xiv:1206.3285, 2012. Adith Swaminathan and Thorsten Joachims. Counterfactual risk minimization: Learning from logged bandit feedback. In International Conference on Machine Learning, pages 814 823, 2015. Istvan Szita and Csaba Szepesvari. Model-based reinforcement learning with nearly tight explo- ration complexity bounds. In Proceedings of the 27th International Conference on International Conference on Machine Learning, page 1031 1038. Omnipress, 2010. Aviv Tamar, Huan Xu, and Shie Mannor. Scaling up robust MDPs by reinforcement learning. ar Xiv preprint ar Xiv:1306.6189, 2013. Ziyang Tang, Yihao Feng, Lihong Li, Dengyong Zhou, and Qiang Liu. Doubly robust bias reduction in infinite horizon off-policy estimation. In Proceedings of the 8th International Conference on Learning Representations, 2020. Philip Thomas, Georgios Theocharous, and Mohammad Ghavamzadeh. High confidence policy improvement. In International Conference on Machine Learning, pages 2380 2388, 2015a. Philip S Thomas. Safe reinforcement learning. Ph D thesis, University of Massachusetts Libraries, Philip S. Thomas and Emma Brunskill. Data-efficient off-policy policy evaluation for reinforcement learning. In Proceedings of the 33rd International Conference on Machine Learning, pages 2139 2148, 2016. Philip S Thomas, Georgios Theocharous, and Mohammad Ghavamzadeh. High-confidence off-policy evaluation. In Twenty-Ninth AAAI Conference on Artificial Intelligence, 2015b. Emanuel Todorov, Tom Erez, and Yuval Tassa. Mu Jo Co: A physics engine for model-based control. In Intelligent Robots and Systems (IROS), 2012 IEEE/RSJ International Conference on, pages 5026 5033. IEEE, 2012. Masatoshi Uehara, Jiawei Huang, and Nan Jiang. Minimax weight and Q-function learning for off-policy evaluation. ar Xiv preprint ar Xiv:1910.12809, 2019. A. W. van der Vaart and J. A. Wellner. Weak Convergence and Empirical Processes. Springer, 1996. Weiran Wang and Miguel A Carreira-Perpinán. Projection onto the probability simplex: An efficient algorithm with a simple proof, and an application. ar Xiv preprint ar Xiv:1309.1541, 2013. Yifan Wu, George Tucker, and Ofir Nachum. Behavior regularized offline reinforcement learning. ar Xiv preprint ar Xiv:1911.11361, 2019. Tengyang Xie, Yifei Ma, and Yu-Xiang Wang. Optimal off-policy evaluation for reinforcement learning with marginalized importance sampling. In Advances in Neural Information Processing Systems 32, pages 9665 9675, 2019. Tianhe Yu, Garrett Thomas, Lantao Yu, Stefano Ermon, James Zou, Sergey Levine, Chelsea Finn, and Tengyu Ma. MOPE: Model-based offline policy optimization. ar Xiv preprint ar Xiv:2005.13239, 2020. Ruiyi Zhang, Bo Dai, Lihong Li, and Dale Schuurmans. Gen DICE: Generalized offline estimation of stationary values. In International Conference on Learning Representations, 2020a. Shangtong Zhang, Bo Liu, and Shimon Whiteson. Gradient DICE: Rethinking generalized offline estimation of stationary values, 2020b. ar Xiv:2001.11113.