# more_robust_doubly_robust_offpolicy_evaluation__1cba0b16.pdf More Robust Doubly Robust Off-policy Evaluation Mehrdad Farajtabar * 1 Yinlam Chow * 2 Mohammad Ghavamzadeh 2 We study the problem of off-policy evaluation (OPE) in reinforcement learning (RL), where the goal is to estimate the performance of a policy from the data generated by another policy(ies). In particular, we focus on the doubly robust (DR) estimators that consist of an importance sampling (IS) component and a performance model, and utilize the low (or zero) bias of IS and low variance of the model at the same time. Although the accuracy of the model has a huge impact on the overall performance of DR, most of the work on using the DR estimators in OPE has been focused on improving the IS part, and not much on how to learn the model. In this paper, we propose alternative DR estimators, called more robust doubly robust (MRDR), that learn the model parameter by minimizing the variance of the DR estimator. We first present a formulation for learning the DR model in RL. We then derive formulas for the variance of the DR estimator in both contextual bandits and RL, such that their gradients w.r.t. the model parameters can be estimated from the samples, and propose methods to efficiently minimize the variance. We prove that the MRDR estimators are strongly consistent and asymptotically optimal. Finally, we evaluate MRDR in bandits and RL benchmark problems, and compare its performance with the existing methods. 1. Introduction In many real-world decision-making problems, in areas such as marketing, finance, robotics, and healthcare, deploying a policy without having an accurate estimate of its performance could be costly, unethical, or even illegal. This is why the problem of off-policy evaluation (OPE) has been heavily studied in contextual bandits (e.g., Dud ık et al. 2011; *Equal contribution 1Georgia Tech 2Deep Mind. Correspondence to: Yinlam Chow . Proceedings of the 35 th International Conference on Machine Learning, Stockholm, Sweden, PMLR 80, 2018. Copyright 2018 by the author(s). Swaminathan et al. 2017) and reinforcement learning (RL) (e.g., Precup et al. 2000a; 2001; Paduraru 2013; Mahmood et al. 2014; Thomas et al. 2015a; Li et al. 2015; Jiang & Li 2016; Thomas & Brunskill 2016), and some of the results have been applied to problems in marketing (e.g., Li et al. 2011; Theocharous et al. 2015), healthcare (e.g., Murphy et al. 2001; Hirano et al. 2003), and education (e.g., Mandel et al. 2014; 2016). The goal in OPE is to estimate the performance of an evaluation policy, given a log of data generated by the behavior policy(ies). The OPE problem can be viewed as a form of counterfactual reasoning to infer causal effects of a new treatment from historical data (e.g., Bottou et al. 2013; Shalit et al. 2017; Louizos et al. 2017). Three different approaches to OPE in RL can be identified in the literature. 1) Direct Method (DM) which learns a model of the system and then uses it to estimate the performance of the evaluation policy. This approach often has low variance but its bias depends on how well the selected function class represents the system and on whether the number of samples is sufficient to accurately learn this function class. There are two major problems with this approach: (a) Its bias cannot be easily quantified, since in general it is difficult to quantify the approximation error of a function class; (b) It is not clear how to choose the loss function for model learning without the knowledge of the evaluation policy (or the distribution of the evaluation policies). Without this knowledge, we may select a loss function that focuses on learning the areas that are irrelevant for the evaluation policy(ies). 2) Importance Sampling (IS) that uses the IS term to correct the mismatch between the distributions of the system trajectory induced by the evaluation and behavior policies. Although this approach is unbiased (under mild assumptions) in case the behavior policy is known, its variance can be very large when there is a big difference between the distributions of the evaluation and behavior policies, and grows exponentially with the horizon of the RL problem. 3) Doubly Robust (DR) which is a combination of DM and IS, and can achieve the low variance of DM and no (or low) bias of IS. The DR estimator was first developed in statistics (e.g., Cassel et al. 1976; Robins et al. 1994; Robins & Rotnitzky 1995; Bang & Robins 2005) to estimate from incomplete data with the property that is unbiased when either of its DM or IS estimators is correct. It was brought to our community, first in contextual bandits by Dud ık et al. More Robust Doubly Robust Off-policy Evaluation (2011) and then in RL by Jiang & Li (2016). Thomas & Brunskill (2016) proposed two methods to reduce the variance of DR, with the cost of introducing a bias, one to select a low variance IS estimator, namely weighted IS (WIS), and one to blend DM and IS together (instead of simply combining them as in the standard DR approach) in a way to minimize the mean squared error (MSE). In this paper, we propose to reduce the variance of DR in bandits and RL by designing the loss function used to learn the model in the DM part of the estimator. The main idea of our estimator, called more robust doubly robust (MRDR), is to learn the parameters of the DM model by minimizing the variance of the DR estimator. This idea has been investigated in statistics in the context of regression when the labels of a subset of samples are randomly missing (Cao et al., 2009). We first present a novel formulation for the DM part of the DR estimator in RL. We then derive formulas for the variance of the DR estimator in both bandits and RL in a way that its gradient w.r.t. the model parameters can be estimated from the samples. Note that the DR variances reported for bandits (Dud ık et al., 2011) and RL (Jiang & Li, 2016) contain the bias of the DM component, which is unknown. We then propose methods to efficiently minimize the variance in both bandits and RL. Furthermore, we prove that the MRDR estimator is strongly consistent and asymptotically optimal. Finally, we evaluate the MRDR estimator in bandits and RL benchmark problems, and compare its performance with DM, IS, and DR approaches. 2. Preliminaries In this paper, we consider the reinforcement learning (RL) problem in which the agent s interaction with the system is modeled as a Markov decision process (MDP). Note that the contextual bandit problem is a special case with horizon-1 decision-making. In this section, we first define MDPs and the relevant quantities that we are going to use throughout the paper, and then define the off-policy evaluation problem in RL, which is the main topic of this work. 2.1. Markov Decision Processes A MDP is a tuple h X, A, Pr, P, P0, γi, where X and A are the state and action spaces, Pr(x, a) is the distribution of the bounded random variable r(x, a) 2 [0, Rmax] of the immediate reward of taking action a in state x, P( |x, a) is the transition probability distribution, P0 : X ! [0, 1] is the initial state distribution, and γ 2 [0, 1) is the discounting factor. A (stationary) policy : X A ! [0, 1] is a stochastic mapping from states to actions, with (a|x) being the probability of taking action a in state x. We denote by P the state transition of the Markov chain induced by policy , i.e., P (xt+1|xt) = P a2A (a|xt)P(xt+1|xt, a). We denote by = (x0, a0, r0, . . . , x T 1, a T 1, r T 1, x T ) a T-step trajectory generated by policy , and by R0:T 1( ) = PT 1 t=0 γtrt the return of trajectory . Note that in , x0 P0, and 8t 2 {1, . . . , T 1}, at ( |xt), xt+1 P( |xt, at), and rt Pr( |xt, at). These distributions together define P , i.e., the distribution of trajectory . We evaluate a policy by the expectation of the return of the T-step trajectories it generates, i.e., . If we set T to be of order O T would be a good approximation of the infinite-horizon performance 1. Throughout the paper, we assume that T has been selected such that 1, and thus, we refer to = T as the performance of policy . We further define the value (action-value) function of a policy at each state x (state-action pair (x, a)), denoted by V (x) (Q (x, a)), as the expectation of the return of a T-step trajectory generated by starting at state x (stateaction pair (x, a)), and then following policy . Note that = Ex P0 Note that the contextual bandit setting is a special case of the setting described above, where T = 1, and as a result, the context is sampled from P0 and there is no dynamic P. 2.2. Off-policy Evaluation Problem The off-policy evaluation (OPE) problem is when we are given a set of T-step trajectories D = { (i)}n i=1 independently generated by the behavior policy b,1 and the goal is to have a good estimate of the performance of the evaluation policy e. We consider the estimator ˆ e good if it has low mean square error (MSE), i.e., MSE( e, ˆ e) We make the following standard regularity assumption: Assumption 1 (Absolute Continuity). For all state-action pairs (x, a) 2 X A, if b(a|x) = 0 then e(a|x) = 0. In order to quantify the mismatch between the behavior and evaluation policies in generating a trajectory, we define cumulative importance ratio as follows. For each T-step trajectory 2 D, the cumulative importance ratio from time step t1 to time step t2, where both t1 and t2 are in {0, . . . , T}, is !t1,t2 = 1 if t1 > t2, and is !t1,t2 = Qt2 e(a |x ) b(a |x ), otherwise. In case the behavior policy b is unknown, we define b!t1,t2 exactly as !t1,t2, with b replaced by its approximation b b. Under Assumption 1, it is easy to see that e = EP e t=0 γtrt] = t=0 γt!0:trt]. Similar equalities hold for the value and action-value functions of e, i.e., V e(x) = EP e t=0 γtrt|x0 = x] = EP t=0 γt!0:trt|x0 = x] and Q e(x, a) = EP e t=0 γtrt|x0 = x, a0 = a] = t=0 γt!0:trt|x0 = x, a0 = a]. 1The results of this paper can be easily extended to the case that the trajectories are generated by multiple behavior policies. More Robust Doubly Robust Off-policy Evaluation 3. Existing Approaches to OPE The objective of MRDR is to learn the model part of a DR estimator by minimizing its variance. MRDR is a variation of DR with a DM loss function derived from minimizing the DR s variance and is built on top of IS and DM. Therefore, before stating our main results in Section 4, we first provide a brief overview of these popular approaches. 3.1. Direct Estimators The idea of the direct method (DM) is to first learn a model of the system and then use it to estimate the performance of the evaluation policy e. In the case of bandits, this model is the mean reward of each pair of context and arm, and in RL it is either the mean reward r(x, a) and state transition P( |x, a), or the value (action-value) V (x) (Q(x, a)) function. In either case, if we select a good representation for the quantities that need to be learned, and our dataset2 contains sufficient number of the states and actions relevant to the evaluation of e, then the DM estimator has low variance and small bias, and thus, has the potential to outperform the estimators resulted from other approaches. As mentioned in Section 1, an important issue that has been neglected in the previous work on off-policy evaluation in RL is the loss function used in estimating the model in DM. As pointed out by Dud ık et al. (2011), the direct approach has a problem if the model is estimated without the knowledge of the evaluation policy. This is because the distribution of the states and actions that are visited under the evaluation policy should be included in the loss function of the direct approach. In other words, if upon learning a model, we have no information about the evaluation policy (or the distribution of the evaluation policies), then it is not clear how to design the DM s loss function (perhaps a uniform distribution over the states and actions would be the most reasonable). Therefore, in this paper, we assume that the evaluation policy is known prior to learning the model.3 In their DM and DR experiments, both Jiang & Li (2016) and Thomas & Brunskill (2016) learn the MDP model, r(x, a) and P( |x, a), although all the model learning discussion in Thomas & Brunskill (2016) is about the reward of the evaluation policy e at every step t along the T-step trajectory, i.e., r e(x, t). More generally, in off-policy actorcritic algorithms (such as the Reactor algorithm proposed in Gruslys et al. 2017), where one can view the gradient estimation part as an off-policy evaluation problem, the DM state-action value function model is learned by minimizing the Bellman residual in an off-policy setting (Precup et al., 2000b; Munos et al., 2016; Geist & Scherrer, 2014). 2Note that we shall use separate datasets for learning the model in DM and evaluating the policy. 3Our results can be extended to the case that the distribution of the evaluation policies is known prior to learning the model. However, neither of these three approaches incorporate the design of the DM loss function into the primary objective, perhaps because they consider the setting in which the model is learned independently. Our approach to DM in RL: In this paper, we propose to learn Q e, the action-value function of the evaluation policy e, and then use it to evaluate its performance as 0 ) b Q e(x(i) We model Q e using a parameterized class of functions with parameter β 2 R and learn β by solving the following weighted MSE problem β 2 arg min β2R E(x,a) µ e Q e(x, a) b Q e(x, a; β) where µ e is the γ-discounted horizon-T state-action occupancy of e, i.e., µ e(x, a) = 1 γ 1 γT t=0 γt EP e 1{xt = x, at = a} and 1{ } is the indicator function. Since the actions in the data set D are generated by b, we rewrite the objective function of the optimization problem (2) as % Rt:T 1( ) b Q e(xt, at; β) where Rt:T 1( ) = PT 1 =t γ t!t+1: r(x , a ) is the Monte Carlo estimate of Q e(xt, at). The proof of the equivalence of the objective functions (2) and (3) can be found in Appendix A. We obtain β n by solving the sample average approximation (SAA) of (3), i.e., n 2 arg min Rt:T 1( (i)) Since the SAA estimator (4) is unbiased, for large enough n, β n ! β almost surely. We define the bias of our DM estimator at each state-action pair as (x, a) = b Q e(x, a; β) Q e(x, a). Note that in contextual bandits with deterministic evaluation policy, the SAA (4) may be written as the weighted least square (WLS) problem 1{ e(xi) = ai} r(xi, ai) b Q(xi, ai; β) with weights 1/ b(ai|xi) for the actions consistent with e. 3.2. Importance Sampling Estimators Another common approach to off-policy evaluation in RL is to use importance sampling (IS) to estimate the performance of the evaluation policy, i.e., More Robust Doubly Robust Off-policy Evaluation 0:T 1 and r(i) t are the cumulative importance ratio and reward at step t of trajectory (i) 2 D, respectively, and R(i) 0:T 1 = R0:T 1( (i)). Under Assumption 1, the IS estimator (6) is unbiased. A variant of IS that often has less variance, while still unbiased, is step-wise importance sampling (step-IS), i.e., step-IS = 1 If the behavior policy b is unknown, which is the case in many applications, then either b or the importance ratio ! = e/ b needs to be estimated, and thus, IS may no longer be unbiased. In this case, the bias of IS and step-IS are δ0:T 1( )R0:T 1( ) *** and ***PT 1 t=0 γt EP e ***, respectively, where δ0:t( ) = 1 λ0:t( ) = 1 Qt b(a |x ) b b(a |x ), with b b being our approximation of b (see the proofs in Appendix B). Note that when b is known, i.e., b b = b, we have δ0:t = 0, and the bias of both IS and step-IS would be zero. Although the unbiasedness of IS estimators is desirable for certain applications such as safety (Thomas et al., 2015b), their high variance (even in step-wise case), which grows exponentially with horizon T, restricts their applications. This is why another variant of IS, called weighted importance sampling (WIS), and particularly its step-wise version, i.e., is considered more practical, especially where being biased is not crucial. The WIS estimators are biased but consistent and have lower variance than their IS counterparts. 3.3. Doubly Robust Estimators Doubly robust (DR) estimators that combine DM and IS were first developed for regression (e.g., Cassel et al. 1976), brought to contextual bandits by Dud ık et al. (2011), and to RL by Jiang & Li (2016) and Thomas & Brunskill (2016). The DR estimator for RL is defined as 0:t b Q e(x(i) t ; β) !(i) 0:t 1 b V e(x(i) Eq. 7 clearly shows that a DR estimator contains both the cumulative importance ratio ! (IS part) and the model estimates b V e and b Q e (DM part). Note that the IS part of the DR estimator (7) is based on step-wise IS. Thomas & Brunskill (2016) derived a DR estimator whose IS part is based on step-wise WIS. In this paper, we use step-wise IS for the IS part of our DR-based estimators, but our results can be easily extended to other IS estimators. The bias of a DR estimator is the product of that of DM and IS, and thus, DR is unbiased whenever either IS or DM is unbiased. This is what the term doubly robust refers to. The bias of the DR estimator (7) is |EP e t=0 γtλ0:t 1( )δt( ) (xt, at)]| (see the proofs in Appendix C), and thus, it would be zero if either (xt, at) or δt( ) is zero. As discussed in Section 3.2, if b is known, δt = 0 and the DR estimator (7) is unbiased. Throughout this paper, we assume that b is known, and thus, DR is unbiased as long as it uses unbiased variants of IS. However, our proposed estimator described in Section 4 can be extended to the case that b is unknown. 4. More Robust Doubly Robust Estimators In this section, we present our class of more robust doubly robust (MRDR) estimators. The main idea of MRDR is to learn the DM parameter of a DR estimator, β 2 R , by minimizing its variance. In other words, MRDR is a variation of DR with a DM loss function derived from minimizing the DR s variance. As mentioned earlier, we assume that the behavior policy b is known, and thus, both IS (step-IS) and DR estimators are unbiased. This means that our MRDR estimator is also unbiased, and since it is the result of minimizing the DR s variance, it has the lowest MSE among all the DR estimators. 4.1. MRDR Estimators for Contextual Bandits Before presenting MRDR for RL, we first formulate it in the contextual bandit setting. We follow the setting of Dud ık et al. (2011) and define the DR estimator as e(ai|xi) b b(ai|xi) r(xi, ai) b Q(xi, ai; β) + b V e(xi; β), where b Q(x, a; β) Q(x, a) = EPr[r(x, a)] and b V e(x; β) = Ea e[ b Q(x, a; β)]. We further define the DM bias (x, a) = b Q(x, a; β) Q(x, a), and error in learning the behavior policy δ(x, a) = 1 λ(x, a) = 1 b(a|x) b b(a|x). Proposition 1 proves the bias and variance of DR for stochastic evaluation policy e. Note that the results stated in Theorems 1 and 2 in Dud ık et al. (2011) are only for deterministic e. Proposition 1. The bias and variance of the DR estimator (8) for stochastic e may be written as More Robust Doubly Robust Off-policy Evaluation **** e EP b δ(x, a) (x, a) r(x, a) Q(x, a) Q(x, a) + δ(x, a) (x, a) Proof. See Appendix D. As expected from a DR estimator, Proposition 1 shows that (8) is unbiased if either its DM part is unbiased, = 0, or its IS part is unbiased, δ = 0. When the behavior policy b is known, and thus, δ(x, a) = 0 for all x and a, the variance of (8) in Proposition 1 may be written as r(x, a) Q(x, a) !(x, a) (x, a)2 E e Unfortunately, the variance formulation (9) is not suitable for our MRDR method, because its derivative w.r.t. β contains a term (x, a) = b Q(x, a) Q(x, a) that cannot be estimated from samples as the true expected reward Q is unknown. To address this issue, we derive a new formulation of the variance in Theorem 1, whose derivative does not contain such terms. Theorem 1. The variance of the DR estimator (8) for stochastic e may be written as the following two forms: !(x, a0) b Q(x, a0; β)2i b V e(x; β)2 2r(x, a) !(x, a) b Q(x, a; β) b V e(x; β) + !(x, a)2r(x, a)2 E e[r(x, a)]2i E e[r(x, a)] J(β) z }| { EP b !(x, a)qβ(x, a, r)> b(x)qβ(x, a, r) where b(x) = diag a2A ee> is a positive semi-definite matrix (see Proposition 6 in Appendix D fot the proof) with e = [1, . . . , 1]>; qβ(x, a, r) = D e(x) Q(x; β) I(a)r a row vector with D e(x) = diag a2A, row vector Q(x; β) = b Q(x, a; β) a2A, and the row vector of indicator functions I(a) = a02A; and finally E e[r(x, a)] E e[r(x, a)]2 !(x, a) 1 2 !(x, a)r(x, a)2i Proof. See Appendix D. The significance of the variance formulations of Theorem 1 is 1) the variance of the DR estimator has no dependence on the unknown term , and thus, its derivative w.r.t. β is computable, 2) the expectation in (11) is w.r.t. P b , which makes it possible to replace J(β) with its unbiased SAA !(xi, ai)qβ(xi, ai, ri)> b(xi)qβ(xi, ai, ri), where D = {(xi, ai, ri)}n i=1 is the data set generated by the behavior policy b, such that the optimizer of Jn(β) converges to that of J(β) almost surely, and 3) J(β) in (11) is a convex quadratic function of qβ, which in case that b Q(x, a; β) is smooth, makes it possible to efficiently optimize Jn(β) with stochastic gradient descent. Moreover, when rβ b Q(x, a; β) can be explicitly written, we can obtain β n 2 arg minβ Jn(β), by solving the first order optimality condition Pn i=1 !(xi, ai) qβ(xi, ai, ri)> b(xi)D e(xi)rβ Q(xi; β) = 0. In case the evaluation policy is deterministic, the variance n VP DR) in (10) becomes J(β) z }| { h1{ e(x) = a} b(a|x) 1 b(a|x) r(x, a) b Q(x, a; β) E e[r(x, a)] This form of J(β) allows us to find the model parameter of MRDR by solving the WLS n 2 arg min β Jn(β) = 1 1{ e(xi) = ai} r(xi, ai) b Q(xi, ai; β) Comparing this WLS with that in the DM approach in Section 5, we note that MRDR changes the weights from 1/ b to (1 b)/ 2 b, and this way increases the penalty of the samples whose actions are the same as those suggested by e, but have low probability under b, and decreases the penalty of the rest of the samples. 4.2. MRDR Estimators for Reinforcement Learning We now present our MRDR estimator for RL. We begin with the DR estimator for RL given by (7). Similar to the bandits case reported in Section 4.1, we first derive a formula for the variance of the estimator (7), whose derivative can be easily estimated from trajectories generated by the behavior policy. We then use this variance formulation as the objective function to find the MRDR model parameter. Theorem 2. The variance of the DR estimator in (7) can be written as More Robust Doubly Robust Off-policy Evaluation 0:t 1VFt:T 1 b Q e(xt, at; β) + b V e(xt; β) + Ct 0:t 1VFt+1:T 1( Rt:T 1 | Ft) where Ft1:t2 is the filtration induced by the sequence {xt1, at1, rt1, . . . , xt2, at2, rt2} P b , Rt:T 1 = r(xt, at) + γ PT 1 =t+1 γ (t+1)!t+1:jr(x , a ), and Ct = % Rt:T 1 EFt+1:T 1[ Rt:T 1] t Rt:T 1 % Rt:T 1 EFt+1:T 1[ Rt:T 1] is a β-independent term. Proof. The proof is by mathematical induction and is reported in Appendix E. As opposed to the DR variance reported in Jiang & Li (2016), ours in (13) has no dependence on the DM bias , which contains the unknown term Q e, and plus, all its expectations are over P b . This allows us to easily compute the MRDR model parameter from the gradient of (13). Let s define β 2 arg minβ2R VP as the minimizer of the DR variance. We may write β using the variance formulation of Theorem 2, and after dropping the β-independent terms, as β 2 arg minβ2R PT 1 t=0 EF0:t 1 b Q e(xt, at; β) + b V e(xt; β) . Similar to the derivation of (11) for bandits, we can show that β 2 arg min γ2t EF0:t 1 0:t 1 !t (14) qβ(xt, at, Rt:T 1)> b(xt)qβ(xt, at, Rt:T 1) As shown in Proposition 6, J(β) is a quadratic convex function of qβ, which means that if the approximation b Q e( , ; β) is smooth in β, then this problem can be effectively solved by gradient descent. Since the expectation in (14) is w.r.t. P b , we may use the trajectories in D (generated by b), replace J(β) with its unbiased SAA, Jn(β), and solve it for β, i.e., n 2 arg min β2R Jn(β) = 0:t 1)2 !(i) t:T 1)> b(x(i) Since Jn(β) is strongly consistent, β n ! β almost surely. If we can explicitly write rβ b Q(x, a; β), then β n is the solution of equation 0 = Pn t=0 γ2t(!(i) 0:t 1)2!(i) t:T 1)> b(x(i) t )D e(x(i) t )rβ Q(x(i) In case the evaluation policy is deterministic, we can further simplify Jn(β) and derive the model parameter for MRDR by solving the following WLS problem: 0:t 1)2!(i) t 1{ e(x(i) t:T 1 b Q e(x(i) The intuition behind the weights in WLS (16) is 1) to adjust the difference between the occupancy measures of the behavior and evaluation policies, and 2) to increase the penalty of the policy discrepancy term 1{ e(xt) = at}. 4.3. Other Properties of the MRDR Estimators Strong Consistency Similar to the analysis in Thomas & Brunskill (2016) for weighted DR, we prove (in Appendix F) that the MRDR estimators are strongly consistent, i.e., limn!1 ˆ e n) = e almost surely. This implies that MRDR is a well-posed OPE estimator. Asymptotic Optimality The MRDR estimator, by construction, has the lowest variance among the DR estimators of the form (7). On the other hand, the semi-parametric theory in multivariate regression (Robins et al., 1994) states that without extra assumption on the data distribution, the class of unbiased, consistent, and asymptotically normal OPE estimators is asymptotically equivalent to the DR estimators in (7). Utilizing this result, we can show that the MRDR estimators are asymptotically optimal (i.e., have minimum variance) in this class of estimators. MRDR Extensions Similar to Thomas & Brunskill (2016), we can derive the weighted MRDR estimator by replacing the IS part of the MRDR estimator in (7) with (per-step) weighted importance sampling. This introduces bias, but potentially reduces its variance, and thus, its MSE. Throughout the paper, we assumed that the data has been generated by a single behavior policy. We can extend our MRDR results to the case that there are more than one behavior policy by replacing the IS part of our estimator with fused importance sampling (Peshkin & Shelton, 2002). 5. Experiments In this section, we demonstrate the effectiveness of the proposed MRDR estimation by comparing it with other stateof-the art methods from Section 3 on both contextual bandit and RL benchmark problems. 5.1. Contextual Bandit Using the 9 benchmark experiments described in Dud ık et al. (2011), we evaluate the OPE algorithms using the standard classification data-set from the UCI repository. Here we More Robust Doubly Robust Off-policy Evaluation follow the same procedure of transforming a classification data-set into a contextual bandit dataset. For the sake of brevity, detailed descriptions of the experimental setup will be deferred to the appendix. Given a deterministic policy , which is a logistic regression model trained by the classification data set, we discuss three methods of transforming it into stochastic policies. The first one, which is known as friendly softening, constructs a stochastic policy with the following smoothing procedure: Given two constants and β, and a uniform (continuous) random variable u 2 [ 0.5, 0.5]. For each a 2 {1, . . . , l}, whenever (x) = a, the stochastic policy ,β(x) returns a with probability + β u, and it returns k, which is a realization of the uniform (discrete) random variable in {1, . . . , l} \ {a} with probability 1 ( +β u) l 1 . The second one, which is known as adversarial softening, constructs a stochastic policy ,β(x) from policy in a similar fashion. Whenever (x) = a, ,β(x) returns k 6= a with probability + β u, and it returns k, which is a realization of the uniform (discrete) random variable in {1, . . . , l} with probability 1 ( +β u) l . The third one, which is the neutral policy, is a uniformly random policy. We will use these methods to construct behavior and evaluation policies. Table 1 summarizes their specifications. Here we compare the MRDR method with the direct method (DM), the importance sampling (IS) method and two doubly robust (DR) estimators. The model parameter of the DM estimator is obtained by solving the SAA of the following problem: βDM 2 arg minβ2R E(x,a) P b [(Q e(x, a) b Q e(x, a; β))2], which means all samples are weighted according to data, without consideration of the visiting distribution induced by the evaluation policy. The model parameters of the DR estimator is optimized based on the DM methodologies described in (2). Besides the standard DR estimator we also include another alternative that is known as DR0, which heuristically uses the model parameter from the vanilla DM method (which is called DM0 and assigns uniform weights over samples). Below are results over the five behavior policies and five algorithms on the benchmark datasets. Due to the page limit, only the results of Vehicle, Sat Image, Pen Digits and Letter are included in the main paper, see Appendix G for the remaining results. We evaluate the accuracy of the estimation via root mean squares error (RMSE): q PN j e)2/N, where b e j is the estimated value from the j-th dataset. Furthermore, we perform a 95% significance test only on MRDR and DR, with bold numbers indicating the corresponding method outperforms its counterpart significantly. In the contextual bandit experiments, it is clear that in most cases the proposed MRDR estimator is superior to all alter- native estimators (statistical) significantly. Similar to the results reported in Dud ık et al. (2011), the DM method incurs much higher MSE than other methods in all of the experiments. This is potentially due to the issue of high bias in model estimation when the sample-size is small. In general the estimation error is increasing across rows from top to bottom. This is expected due to the increasing difficulties in the OPE tasks that is accounted by the increasing mis-matches between behavior and evaluation policies. Although there are no theoretical justifications, in most cases the performance of DR estimators (with the DM method described in Section 3.1) is better than that of DR0. This also illustrates the benefits of optimizing the model parameter based on the knowledge of trajectory distribution P e , which is generated by the evaluation policy. Table 1. Behavior and Evaluation Policies Evaluation Policy 0.9 0 Behavior Policies Friendly I 0.7 0.2 Friendly II 0.5 0.2 Neutral - - Adversary I 0.3 0.2 Adversary II 0.5 0.2 Table 2. Vehicle Behavior Policy DM IS DR MRDR DR0 Friendly I 0.3273 0.0347 0.0217 0.0202 0.0224 Friendly II 0.3499 0.0517 0.0331 0.0318 0.0356 Neutral 0.4384 0.087 0.0604 0.0549 0.0722 Adversary I 0.405 0.0937 0.0616 0.0516 0.0769 Adversary II 0.405 0.1131 0.0712 0.0602 0.0952 Table 3. Sat Image Behavior Policy DM IS DR MRDR DR0 Friendly I 0.2884 0.0128 0.0071 0.0063 0.0073 Friendly II 0.3328 0.0191 0.0107 0.0087 0.0119 Neutral 0.3848 0.0413 0.0246 0.0186 0.0335 Adversary I 0.3963 0.0459 0.027 0.0195 0.0383 Adversary II 0.4093 0.0591 0.0364 0.0262 0.0521 Table 4. Pen Digits Behavior Policy DM IS DR MRDR DR0 Friendly I 0.4014 0.0103 0.0056 0.0037 0.0059 Friendly II 0.4628 0.0159 0.0092 0.0056 0.0194 Neutral 0.564 0.0450 0.0314 0.0138 0.0412 Adversary I 0.5861 0.0503 0.0366 0.0172 0.0472 Adversary II 0.5641 0.0646 0.0444 0.0188 0.0611 Table 5. Letter Behavior Policy DM IS DR MRDR DR0 Friendly I 0.392 0.0074 0.0056 0.0044 0.0057 Friendly II 0.4146 0.0102 0.0077 0.0054 0.0083 Neutral 0.4713 0.0467 0.0363 0.0315 0.0456 Adversary I 0.46 0.0587 0.0455 0.0385 0.0575 Adversary II 0.4728 0.0714 0.055 0.0481 0.0703 More Robust Doubly Robust Off-policy Evaluation 5.2. Reinforcement Learning In this section we present the experimental results of OPE in reinforcement learning. We first test the OPE algorithms on the standard domains Model Win, Model Fail, and 4 4 Maze, with behavior and evaluation policies used in Thomas & Brunskill (2016). The schematic diagram of the domains is shown in Figure 1. To demonstrate the scalability of the proposed OPE methods, we also test the OPE algorithms on the following two domains with continuous state space: Mountain Car and Cart Pole. To construct the stochastic behavior and evaluation policies, we first compute the optimal policy using standard RL algorithms such as SARSA and Q-learning. Then these policies are constructed by applying friendly softening to the optimal policy with specific values of ( , β). For both domains, the evaluation policy is constructed using ( , β) = (0.9, 0.05), and the behavior policy is constructed analogously using ( , β) = (0.8, 0.05). Detailed explanations of the experimental setups can be found in Appendix G. In the following experiments we set the discounting factor to be γ = 1. For both Model Fail and Model Win domains, the number of training trajectories is set to 64, for Maze, Mountain Car, and Cart Pole domains this number is set to 1024. The number of trajectories for sampling-based part of estimators varies from 32 to 512 for the Model Win, Model Fail, and Cart Pole domains, and varies from 128 to 2048 for the Maze and Mountain Car domains. Figure 1. Environments from Thomas & Brunskill (2016). Top left: Model Fail; Bottom left: Model Win; Right: Maze In all of the above experiments, we compare results of MRDR with DM, IS, DR, and DR0 estimations by their corresponding MSE values. Similarly, the bold numbers represent cases when the performance of the MRDR estimator is statistically significantly better than that of the DR estimator. Similar to the contextual bandit setting, except for the Model Win domain that is known to be in favor of the DM estimator (Thomas & Brunskill, 2016), in most cases MRDR estimator has significantly lower MSE than the other methods. Furthermore, when we increase the number of the evaluation trajectories, the accuracy of all the estimators in all the experiments is improved. Similar to the contextual bandit setting, significant performance improvement can be observed when one switches from DR0 to DR in the RL experiments. Table 6. Model Fail Sample Size DM IS DR MRDR DR0 32 0.07152 1.37601 0.18461 0.1698 1.16084 64 0.07152 1.07213 0.1314 0.11405 0.9046 128 0.07152 0.752 0.09901 0.08188 0.63571 256 0.07152 0.55955 0.06565 0.05527 0.47211 512 0.07152 0.39533 0.04756 0.03819 0.33391 Table 7. Modelwin Sample Size DM IS DR MRDR DR0 32 0.06182 0.78452 1.55244 1.46778 1.51858 64 0.06182 1.03207 1.13856 0.98433 1.40758 128 0.06182 0.90166 1.4195 1.27891 1.52634 256 0.06182 0.78507 1.03575 0.79849 1.10332 512 0.06182 0.55647 0.89655 0.66791 0.97128 Table 8. 4 4 Maze Sample Size DM IS DR MRDR DR0 128 1.77598 6.68579 0.70465 0.57042 0.70969 256 1.77598 3.50346 0.69886 0.58871 0.70211 512 1.77598 2.64257 0.60124 0.58879 0.60338 1024 1.77598 1.45434 0.5201 0.4666 0.52148 2048 1.77598 0.89668 0.3932 0.31274 0.39425 Table 9. Mountain Car Sample Size DM IS DR MRDR DR0 128 17.80368 23.11318 16.14661 14.96227 19.46953 256 14.62359 14.82684 13.89212 12.48327 22.80573 512 13.22012 8.26484 8.01421 7.89474 7.96849 1024 10.24318 3.26843 3.03239 3.1359 9.16269 2048 10.91577 2.50591 2.75933 2.17138 8.25527 Table 10. Cart Pole Sample Size DM IS DR MRDR DR0 32 86.81935 70.58151 12.13028 16.45905 10.84913 64 87.00547 75.86198 14.82026 14.16847 15.69192 128 84.40824 77.38233 18.55218 15.38549 19.15905 256 83.31824 64.75034 9.96921 8.36612 9.36373 512 84.09259 65.72996 6.88534 4.60712 6.9962 6. Conclusions In this paper, we proposed the class of more-robust doublyrobust (MRDR) estimators for off-policy evaluation in RL. In particular, we proposed a principled method to calculate the model in DR estimator, which aims at minimizing its variance. Furthermore, we showed that our estimator is consistent and asymptotically optimal in the class of unbiased, consistent and asymptotically normal estimators. Finally, we demonstrated the effectiveness of our MRDR estimator in bandits and RL benchmark problems. Future work includes extending the MRDR estimator to the cases 1) when there are multiple behavior policies, 2) when the action set has a combinatorial structure, e.g., actions are in the form of slates (Swaminathan et al., 2017), and 3) when the behavior policy is unknown. More Robust Doubly Robust Off-policy Evaluation Bang, H. and Robins, J. Doubly robust estimation in missing data and causal inference models. Biometrics, 61:962 972, 2005. Bottou, L., Peters, J., Qui nonero-Candela, J., Charles, D., Chickering, D. Max, Portugaly, E., Ray, D., Simard, P., and Snelson, E. Counterfactual reasoning and learning systems: The example of computational advertising. JMLR, 14:3207 3260 620, 2013. Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., and Zaremba, W. Open AI Gym. ar Xiv:1606.01540, 2016. Cao, W., Tsiatis, A., and Davidian, M. Improving effi- ciency and robustness of the doubly robust estimator for a population mean with incomplete data. Biometrika, 96: 723 734, 2009. Cassel, C., S arndal, C., and Wretman, J. Some results on generalized difference estimation and generalized regression estimation for finite populations. Biometrika, 63: 615 620, 1976. Dud ık, M., Langford, J., and Li, L. Doubly robust pol- icy evaluation and learning. In Proceedings of the 28th International Conference on Machine Learning, pp. 1097 1104, 2011. Geist, M. and Scherrer, B. Off-policy learning with eligibil- ity traces: A survey. The Journal of Machine Learning Research, 15(1):289 333, 2014. Gruslys, A., Azar, M., Bellemare, M., and Munos, R. The reactor: A sample-efficient actor-critic architecture. ar Xiv preprint ar Xiv:1704.04651, 2017. Hanna, J., Stone, P., and Niekum, S. High confidence off-policy evaluation with models. ar Xiv preprint ar Xiv:1606.06126, 2016. Hasselt, H. Van, Guez, A., and Silver, D. Deep reinforce- ment learning with double Q-learning. In AAAI, volume 16, pp. 2094 2100, 2016. Hirano, K., Imbens, G., and Ridder, W. Efficient estimation of average treatment effects using the estimated propensity score. Econometrica, 71(4):1161 1189, 2003. Jiang, N. and Li, L. Doubly robust off-policy value evalu- ation for reinforcement learning. In Proceedings of the 33rd International Conference on Machine Learning, pp. 652 661, 2016. Li, L., an d J. Langford, W. Chu, and Wang, X. Unbiased offline evaluation of contextual bandit-based news article recommendation algorithms. In Proceedings of the 4th International Conference on Web Search and Data Mining, pp. 297 306, 2011. Li, L., Munos, R., and Szepesv ari, Cs. Toward minimax off-policy value estimation. In Proceedings of the 18th International Conference on Artificial Intelligence and Statistics, pp. 608 616, 2015. Louizos, C., Shalit, U., Mooij, J., Sontag, D., Zemel, R., and Welling, M. Causal effect inference with deep latentvariable models. In Proceedings of the 31st International Conference on Neural Information Processing Systems, pp. 3076 3085, 2017. Mahmood, A., van Hasselt, H., and Sutton, R. Weighted importance sampling for off-policy learning with linear function approximation. In Proceedings of the 27th International Conference on Neural Information Processing Systems, 2014. Mandel, T., Liu, Y., Levine, S., Brunskill, E., and Popovic, Z. Off-policy evaluation across representations with applications to educational games. In Proceedings of the 13th International Conference on Autonomous Agents and Multi-agent Systems, pp. 1077 1084, 2014. Mandel, T., Liu, Y., Brunskill, E., and Popovic, Z. Offline evaluation of online reinforcement learning algorithms. In Proceedings of the 30th Conference on Artificial Intelligence, 2016. Munos, R., Stepleton, T., Harutyunyan, A., and Bellemare, M. Safe and efficient off-policy reinforcement learning. In Advances in Neural Information Processing Systems, pp. 1054 1062, 2016. Murphy, S., van der Laan, M., and Robins, J. Marginal mean models for dynamic regimes. Journal of American Statistical Association, 96(456):1410 1423, 2001. Paduraru, C. Off-policy Evaluation in Markov Decision Processes. Ph D thesis, Mc Gill University, 2013. Peshkin, L. and Shelton, C. Learning from scarce experi- ence. In Proceedings of the 19th International Conference on Machine Learning, pp. 498 505, 2002. Precup, D., Sutton, R., and Singh, S. Eligibility traces for off-policy policy evaluation. In Proceedings of the 17th International Conference on Machine Learning, pp. 759 766, 2000a. Precup, D., Sutton, R., and Singh, S. Eligibility traces for off-policy policy evaluation. In ICML, pp. 759 766. Citeseer, 2000b. Precup, D., Sutton, R., and Dasgupta, S. Off-policy tem- poral difference learning with function approximation. More Robust Doubly Robust Off-policy Evaluation In Proceedings of the 18th International Conference on Machine Learning, pp. 417 424, 2001. Robins, J. and Rotnitzky, A. Semi-parametric efficiency in multivariate regression models with missing data. Journal of American Statistical Association, 90:122 129, 1995. Robins, J., Rotnitzky, A., and Zhao, L. Estimation of regres- sion coefficients when some regressors are not always observed. Journal of American Statistical Association, 89(427):846 866, 1994. Shalit, U., Johansson, F., and Sontag, D. Estimating indi- vidual treatment effect: Generalization bounds and algorithms. In Proceedings of the 34th International Conference on Machine Learning, pp. 3076 3085, 2017. Sutton, R. and Barto, A. Reinforcement learning: An intro- duction. MIT press Cambridge, 1998. Swaminathan, A., Krishnamurthy, A., Agarwal, A., Dud ık, M., Langford, J., Jose, D., and Zitouni, I. Off-policy evaluation for slate recommendation. In Proceedings of the 31st International Conference on Neural Information Processing Systems, pp. 3635 3645, 2017. Theocharous, G., Thomas, P., and Ghavamzadeh, M. Per- sonalized ad recommendation systems for life-time value optimization with guarantees. In Proceedings of the 24th International Joint Conference on Artificial Intelligence, pp. 1806 1812, 2015. Thomas, P. and Brunskill, E. Data-efficient off-policy policy evaluation for reinforcement learning. In Proceedings of the 33rd International Conference on Machine Learning, pp. 2139 2148, 2016. Thomas, P., Theocharous, G., and Ghavamzadeh, M. High confidence off-policy evaluation. In Proceedings of the 29th Conference on Artificial Intelligence, 2015a. Thomas, P., Theocharous, G., and Ghavamzadeh, M. High confidence policy improvement. In Proceedings of the 32nd International Conference on Machine Learning, pp. 2380 2388, 2015b.