# statistically_efficient_offpolicy_policy_gradients__49b970ae.pdf Statistically Efficient Off-Policy Policy Gradients Nathan Kallus 1 Masatoshi Uehara 2 Policy gradient methods in reinforcement learning update policy parameters by taking steps in the direction of an estimated gradient of policy value. In this paper, we consider the statistically efficient estimation of policy gradients from off-policy data, where the estimation is particularly non-trivial. We derive the asymptotic lower bound on the feasible mean-squared error in both Markov and non Markov decision processes and show that existing estimators fail to achieve it in general settings. We propose a meta-algorithm that achieves the lower bound without any parametric assumptions and exhibits a unique 3-way double robustness property. We discuss how to estimate nuisances that the algorithm relies on. Finally, we establish guarantees at the rate at which we approach a stationary point when we take steps in the direction of our new estimated policy gradient. 1. Introduction Learning sequential decision policies from observational off-policy data is an important problem in settings where exploration is limited and simulation is unreliable. A key application is reinforcement learning (RL) for healthcare (Gottesman et al., 2019). In such settings, data is limited and it is crucial to use the available data efficiently. Recent advances in off-policy evaluation (Kallus & Uehara, 2019a;b) have shown how efficiently leveraging problem structure, such as Markovianness, can significantly improve off-policy evaluation and tackle such sticky issues as the curse of horizon (Liu et al., 2018). An important next step is to translate these successes in off-policy evaluation to off-policy learning. In this paper we tackle this question by studying the efficient estimation of the policy gradient from off-policy data and the implications of this for learning via estimated-policy-gradient ascent. *Equal contribution 1Cornell University, Ithaca, NY, USA 2Harvard University, Massachusetts, Boston, USA. Correspondence to: Masatoshi Uehara . Proceedings of the 37 th International Conference on Machine Learning, Online, PMLR 119, 2020. Copyright 2020 by the author(s). Table 1. Comparison of off-policy policy gradient estimators. Here, f = Θ(g) means 0 < lim inf f/g lim sup f/g < (not to be confused with the policy parameter space Θ). In the second row, nuisances must be estimated at n 1/2-rates, and in the rows below it, nuisances may be estimated at slow non-parametric rates. Estimator MSE Efficient Nuisances Reinforce, Eq. (4) 2Θ(H)Θ(1/n) none PG, Eq. (5) 2Θ(H)Θ(1/n) q (parametric) EOPPG (NMDP) 2Θ(H)Θ(1/n) q, q EOPPG (MDP) Θ(H4/n) q, µ, q, µ Policy gradient algorithms (Sutton & Barto, 2018, Chapter 13) enable one to effectively learn complex, flexible policies in potentially non-tabular, non-parametric settings and are therefore very popular in both on-policy and off-policy RL. We begin by describing the problem and our contributions, before reviewing the literature in Section 1.2. Consider a (H + 1)-long Markov decision process (MDP), with states st St, actions at At, rewards rt R, initial state distribution p0(s0), transition distributions pt(st+1 | st, at), and reward distribution pt(rt | st, at), for t = 0, . . . , H. A policy (πt(at | st))t H induces a distribution over trajectories T = (s0, a0, r0, . . . , s T , a H, r H, s H+1): pπ(T ) = (1) p0(s0) QH t=0 πt(at | st)pt(rt | st, at)pt(st+1 | st, at). Given a class of policies πθ t (at | st) parametrized by θ Θ RD, we seek the parameters with greatest average reward, defined as J(θ) = Epπθ h PH t=0 rt i . (Policy Value) A generic approach is to repeatedly move θ in the direction of the policy gradient (PG), defined as Z(θ) = θJ(θ) (Policy Gradient) = Epπθ h PH t=0 rt Pt k=0 θ log πθ k(ak | sk) i For example, in the on-policy setting, we can generate trajectories from πθ, T (1), . . . , T (n) pπθ, and the (GPOMDP variant of the) REINFORCE algorithm (Baxter & Bartlett, 2001) advances in the direction of the stochastic gradient ˆZon-policy(θ) = 1 k=0 θ log πθ k(a(i) k | s(i) k ). Statistically Efficient Off-Policy Policy Gradients In the off-policy setting, however, we cannot generate trajectories from any given policy and, instead our data consists only of trajectory observations from one fixed policy, T (1), . . . , T (n) pπb, (Off-policy data) where πb is known as the behavior policy. With this data, ˆZon-policy(θ) is no longer a stochastic gradient (i.e., it is biased and inconsistent) and we must seek other ways to estimate Z(θ) in order to make policy gradient updates. This paper addresses the efficient estimation of Z(θ) from off-policy data and its use in off-policy policy learning. Specifically, our contributions are: (Section 2) We calculate the asymptotic lower bound on the minimal-feasible mean-squared error (MSE) in estimating the policy gradient, which is of order O(H4/n). In addition, we demonstrate that existing off-policy policy gradient approaches fail to achieve this bound and may even have exponential dependence on the horizon. (Section 3.1) We propose a meta-algorithm called Efficient Off-Policy Policy Gradient (EOPPG) that achieves this bound without any parametric assumptions. In addition, we prove it enjoys a unique 3-way double robustness property. (Section 3.3) We show how to estimate the nuisance functions needed for our meta-algorithm by introducing the concepts of Bellman equations for the gradient of q-function and stationary distributions. (Section 4) We establish guarantees for the rate at which we approach a stationary point when we take steps in the direction of our new estimated policy gradient. Based on efficiency results for our gradient estimator, we can prove the regret s horizon dependence is H2. 1.1. Notation and definitions We define the following variables (note the implicit dependence on θ): gt = θ log πθ,t(at | st), (Policy score) qt = Epπθ h PH k=t rk | st, at i , (q-function) vt = Epπθ h PH k=t rk | st i , (v-function) νt = πθ t (at|st) πb t (at|st), (Density Ratio) νt :t = Qt k=t νk, (Cumulative Density Ratio) µt = pπθ (st) pπb(st) (Marginal State Density Ratio) µt = µt νt, (Marginal State-Action Density Ratio) dq t = θqt, dv t = θvt, dµ t = θµt, dν t = θν0:t. Note that all of the above are simply functions of the trajectory, T , and θ. To make this explicit, we sometimes write, for example, qt = qt(st, at) and refer to qt as a function. Similarly, when we estimate this function by ˆqt we also refer to ˆqt as the random variable gotten by evaluating it on the trajectory, ˆqt(st, at). We write a b to mean that there exists a universal constant C satisfying a Cb, such as a number like 5, which doesn t depend on any instance-specific parameters. We let 2 denote the Euclidean vector norm and op denote the matrix operator norm. All expectations, variances, and probabilities without subscripts are understood to be with respect to pπb. Given a vector-valued function of trajectory, f, we define its L2 norm as f 2 L2 b = E f(T ) 2 2. Further, given trajectory data, T (1), . . . , T (n), we define the empirical expectation as Enf = En[f(T )] = 1 n Pn i=1 f(T (i)). MDP and NMDP. Throughout this paper, we focus on the MDP setting where the trajectory distribution pπ is given by Eq. (1). For completeness, we also consider relaxing the Markov assumption, yielding a non-Markov decision process (NMDP), where the trajectory distribution pπ(T ) is p0(s0) QH t=0 πt(at | Hst)pt(rt | Hat)pt(st+1 | Hat), where Hat is (s0, a0, , at) and Hst is (s0, a0, , st). Assumptions. Throughout we assume that t H: 0 rt Rmax, gt op Gmax, νt C1, µt C 2. And, we define C2 = C1C 2 so that µt C2. The bounds on νt, µt are often called (sequential) overlap or concentrability conditions. The bounds on gt is used to bound the variance of the policy gradient. A similar assumption is made in Agarwal et al. (2019). The above uniform bounds may also be replaced with bounds on second moments at the cost of stronger conditions on nuisance estimate convergence in, e.g., Theorem 7; we omit these details to focus on the more common L2 mode of estimate convergence and uniform bounds on rewards, density ratios, and policy scores. 1.2. Related literature 1.2.1. OFF-POLICY POLICY GRADIENTS A standard approach to dealing with off-policy data is to correct the policy gradient equation using importance sampling (IS) using the cumulative density ratios, ν0:t (see, e.g., Papini et al., 2018, Appendix A; Hanna & Stone, 2018). Statistically Efficient Off-Policy Policy Gradients This allows us to rewrite the policy gradient Z(θ) as an expectation over pπb and then estimate it using an equivalent empirical expectation. The off-policy version of the classic REINFORCE algorithm (Williams, 1992) recognizes Z(θ) = E h ν0:H PH t=0 rt PH t=0 gt i (2) (recall that E is understood as Epπb) and uses the estimated policy gradient given by replacing E with En. The GPOMDP variant (Baxter & Bartlett, 2001) refines this by Z(θ) = E h ν0:H PH t=0 rt Pt s=0 gs i , (3) whose empirical version (En) has less variance and is therefore preferred. A further refinement is given by a step-wise IS (Precup et al., 2000) as in Deisenroth et al. (2013): Z(θ) = E h PH t=0 ν0:trt Pt s=0 gs i . (4) Following Degris et al. (2012), Chen et al. (2019) replace ν0:t with νt in Eq. (4) to reduce variance, but this is an approximation that incurs non-vanishing bias. By exchanging the order of summation in Eq. (4) and recognizing qt = E h PH j=t νt+1:jrj | st, at i , we obtain a policy gradient in terms of the q-function (Sutton et al., 1998), Z(θ) = E h PH t=0 ν0:tgtqt i . (5) The off-policy policy gradient (Off-PAC) of Degris et al. (2012) is obtained by replacing ν0:t with νt in Eq. (5), estimating qt by ˆqt and plugging it in, and taking the empirical expectation. Replacing ν0:t with νt is intended to reduce variance but it is an approximation that ignores the state distribution mismatch (essentially, µt) and incurs nonvanishing bias. Since it amounts to a reweighting and the unconstrained optimal policy remains optimal on any input distribution, in the tabular and fully unconstrained case considered in Degris et al. (2012), we may still converge. But this fails in the general non-parametric, non-tabular setting. We therefore focus only on consistent estimates of Z(θ) in this paper (which requires zero or vanishing bias). Many of the existing off-policy RL algorithms including DDPG (Silver et al., 2014) and Off-PAC with emphatic weightings (Imani et al., 2018) also use the above trick, i.e., ignoring the state distribution mismatch. Various recent work deals with this problem (Dai et al., 2019; Liu et al., 2019; Tosatto et al., 2020). These, however, both assume the existence of a stationary distribution and are not efficient. We do not assume the existence of a stationary distribution since many RL problems have a finite horizon and/or do not have a stationary distribution. Moreover, our gradient estimates are efficient in that they achieve the MSE lower bound among all regular estimators. 1.2.2. OTHER LITERATURE Online off-policy PG. Online policy gradients have shown marked success in the last few years (Schulman et al., 2015). Various work has investigated incorporating offline information into online policy gradients (Gu et al., 2017; Metelli et al., 2018). Compared with this setting, our setting is completely off-policy with no opportunity of collecting new data from arbitrary policies, as considered in, e.g., Athey & Wager (2017); Kallus (2018); Kallus & Zhou (2018); Swaminathan & Joachims (2015) for H = 0 and Chen et al. (2019); Fujimoto et al. (2019) for H 1. Variance reduction in PG. Variance reduction has been a central topic for PG (Greensmith et al., 2004; Schulman et al., 2016; Tang & Abbeel, 2010; Wu et al., 2018). These papers generally consider a given class of estimators given by an explicit formula (such as given by all possible baselines) and show that some estimator is optimal among the class. In our work, the class of estimators among which we are optimal is all regular estimators, which both extremely general and also provides minimax bounds in any vanishing neighborhood of pπb (van der Vaart, 1998, Thm. 25.21). Off-policy evaluation (OPE). OPE is the problem of estimating J(θ) for a given θ from off-policy data. Step-wise IS (Precup et al., 2000) and direct estimation of q-functions (Munos & Szepesv ari, 2008) are two common approaches for OPE. However, the former is known to suffer from the high variance and the latter from model misspecification. To alleviate this, the doubly robust estimate combines the two; however, the asymptotic MSE still explodes exponentially in the horizon like Ω(CH 1 H2/n) (Jiang & Li, 2016; Thomas & Brunskill, 2016). Kallus & Uehara (2019b) show that the efficiency bound in the MDP case is actually polynomial in H and give an estimator achieving it, which combines marginalized IS (Xie et al., 2019) and q-modeling using cross-fold estimation. This achieves MSE O(C2H2/n). Kallus & Uehara (2019a) further study efficient OPE in the infinite horizon MDP setting with non-iid batch data. Offline policy learning There are many types of methods for offline policy learning (batch RL) such as fitted Q-iteration (Munos & Szepesv ari, 2008), bellman residual minimization (Antos et al., 2008), and minimax learning (Chen & Jiang, 2019). We focus on the policy gradient approach since it can be easily applied very generally, in particular when actions are continuous. As far as we know, there are few studies of offline policy gradient with regret guarantee as in our Section 4. Statistically Efficient Off-Policy Policy Gradients 2. Efficiency Bound for Estimating Z(θ) Our target estimand is Z(θ) so a natural question is what is the least-possible error we can achieve in estimating it. In parametric models, the Cram er-Rao bound lower bounds the variance of all unbiased estimators and, due to H ajek (1970), also the asymptotic MSE of all (regular) estimators. Our model, however, is nonparametric as it consists of all MDP distributions, i.e., any choice for p0(s0), pt(rt | st, at), pt(rt | st, at), and πt(at | st) in Eq. (1). Semiparametric theory gives an answer to this question. We first informally state the key property of the efficient influence function (EIF) from semiparametric theory. The EIF is a function defined in terms of only the estimand (map from data generating distribution to quantity of interest), model (set of possible data generating distributions), and instance in the model (the specific unknown data generating distribution). It provides a lower bound on the feasible asymptotic MSE in estimating the estimand, which is sometimes achievable. And, we will then show how to achieve the corresponding bound in the next section. We present the key property in terms of our own model, which is all MDP distributions, and our estimand, which is Z(θ). For additional detail, see Appendix B and Tsiatis (2006); van der Vaart (1998). Theorem 1 (Informal description of van der Vaart (1998), Theorem 25.20). The EIF ξMDP(T ; pπb) satisfies that for any MDP distribution pπb and any regular estimator ˆZ(θ), inf v 2 1 v T (AMSE[ ˆZ(θ)] var[ξMDP])v = 0, where AMSE[ ˆZ(θ)] = R zz T d F(z) is the second moment of F the limiting distribution of n( ˆZ(θ) Z(θ)). This also implies AMSE[ ˆZ(θ)] op var[ξMDP] op. Here, var[ξMDP] is called the efficiency bound (note it is a covariance matrix). Estimators satisfying AMSE[ ˆZ(θ)] = var[ξMDP] are called efficient. A regular estimator is any whose limiting distribution is insensitive to small changes of order O(1/ n) to pπb that keep it an MDP distribution (see van der Vaart, 1998, Chapter 25). So the above provides a lower bound on the variance of all regular estimators, which is a very general class. It is so general that the bound also applies to all estimators at all in a local asymptotic minimax sense (see van der Vaart, 1998, Theorem 25.21). Technically, we first need to prove that the EIF ξMDP exists in order to obtain the bound in Theorem 1. The following result does so and derives it explicitly (in terms of unknown nuisance functions). The result after does the same in the NMDP model. (Note that, while by the usual convention the EIF refers to a function with 0 mean, instead we let the EIF have mean Z(θ) everywhere as it simplifies the presentation. Since adding a constant does not change the variance, Theorem 1 is unchanged.) Theorem 2. The EIF of Z(θ) under MDP, ξMDP, exists and is equal to PH j=0(dµ j (rj qj) µjdq j + µj 1dv j + dµ j 1vj), where µ 1 = 1, dµ 1 = 0. And, in particular, var[ξMDP] op C2R2 max Gmax(H + 1)2(H + 2)2/4. Theorem 3. The EIF of Z(θ) under NMDP, ξNMDP, exists and is equal to PH j=0(dν j (rj qj) ν0:jdq j + ν0:j 1dv j + dν j 1vj), where ν0: 1 = 1, dν 1 = 0. (Note that here ν0:j, dν j , dq j, dq j are actually functions of all of Haj and not just of (sj, aj) as in MDP case in Theorem 2.) And, in particular, var[ξNMDP] op CH 1 R2 max Gmax(H + 1)2(H + 2)2/4. Roughly speaking, the EIFs of Z(θ) in Theorems 2 and 3 are derived by differentiating the EIFs of J(θ) obtained in Kallus & Uehara (2019b) with respect to θ, since we have the relation Z(θ) = J(θ). That is why dµ j , dq j appear in addition to µj, qj in the EIFs above. In fact, ξMDP = n v0 + PH j=0 µj(rj qj + vj+1) o , ξNMDP = n v0 + PH j=0 ν0:j(rj qj + vj+1) o . Formulae for var[ξMDP] and var[ξNMDP] are given in Appendix C. Theorem 3 showed var[ξNMDP] is at most exponential; we next show it is also at least exponential. Theorem 4. Suppose that νt C3 and that var[(P h gh)(r H q H) | Ha H] c I. Then, var[ξNMDP] op C2H 3 c. Theorems 3 and 4 show that the curse of horizon is generally unavoidable in NMDP since the lower bound in is at least exponential in horizon. On the other hand, Theorem 2 shows there is a possibility we can avoid the curse of horizon in MDP in the sense that the lower bound is at most polynomial in horizon as long as C2 is bounded as H grows. This is true, for example, if pπb(st) converges in distribution, which will necessarily occur if the MDP is ergodic. We show that REINFORCE necessarily suffers from the curse of horizon. Theorem 5. The MSE of step-wise REINFORCE Eq. (4) is PH+1 k=0 E[ν2 k 1 var[E[PH t=k 1 νk:trt Pt s=k 1 gs | Hak] | Hak 1]], which is no smaller than the MSE of REINFORCE Eq. (2) and GOMDP-REINFROCE Eq. (3). (Equation references refer to the estimate given by replacing E with En.) Statistically Efficient Off-Policy Policy Gradients Algorithm 1 Efficient Off-Policy Policy Gradient Take a K-fold random partition (Ik)K k=1 of the observation indices {1, . . . , n} such that the size of each fold, |Ik|, is within 1 of n/K. Let Lk = {T (i) : i Ik}, Uk = {T (i) : i / Ik} for k {1, , K} do Using only Lk as data, construct nuisance estimators ˆq(k) j , ˆµ(k) j , ˆdq(k) j , ˆdµ(k) j for j H (see Section 3.3) By integrating/summing w.r.t aj πθ j (aj | sj), set ˆvj(sj) = Eπθ j [ˆqj | sj], ˆdv j(sj) = Eπθ j [ ˆdq j + ˆqjgj | sj] (6) Construct an intermediate estimate: ˆZk(θ) = EUk PH j=0 ˆdµ(k) j (rj ˆq(k) j ) ˆµ(k) j ˆdq(k) j + ˆµ(k) j 1 ˆdv(k) j + ˆdµ(k) j 1 ˆv(k) j , where EUk is the empirical expectation over Uk end for Return ˆZEOPPG(θ) = 1 K PK k=1 ˆZk. Theorem 6. Suppose that νt C3 and that var[r Hg H | Ha H] c I. Then, the operator norm of the variance of step-wise REINFORCE is lower bounded by c C2H 3 /n. 3. Efficient Policy Gradient Estimation In this section we develop an estimator, EOPPG, for Z(θ) achieving the lower bound in Theorem 2 under weak nonparametric rate assumptions. 3.1. The Meta-Algorithm Having derived the EIF of Z(θ) under MDP in Theorem 2, we use a meta-algorithm based on estimating the unknown functions (aka nuisances) µj, dq j, qj, dµ j and plugging them into ξMDP, as described in Algorithm 1, which we call the Efficient Off-Policy Policy Gradient (EOPPG). In particular, we use a cross-fitting technique (Chernozhukov et al., 2018; van der Vaart, 1998) to avoid technical smoothness conditions on our nuisance estimates. We refer to this as a meta-algorithm as it relies on given nuisances estimators: we show to construct these in Section 3.3. Note Eq. (6) in Algorithm 1 is computed simply by taking an integral over aj (or, sum, for finite actions) with respect to the known measure (or, mass function) πθ j (aj | sj). We next prove that EOPPG achieves the efficiency bound under MDP and enjoys a 3-way double robustness (see Fig. 1). We require the following about our nuisance estimators, which arises from the boundedness assumed in Section 1.1. Assumption 1. k K, j H, we have 0 ˆq(k) j Rmax(H + 1 j), ˆµ(k) j C2, ˆdq(k) j op, ˆdµ(k) j op C4. Figure 1. Doubly robust and efficient structure of EOPPG. Every circle represents the event that the corresponding nuisance is well-specified. The cyan-shaded region represents the event that ˆZEOPPG(θ) is consistent. The red-shaded region represents the event that ˆZEOPPG(θ) is efficient (when nuisances are consistently estimated non-parametrically at slow rates). Theorem 7 (Efficiency). Suppose k K, j H, ˆµ(k) j µj L2 b = op(n α1), ˆdµ(k) j dµ j L2 b = op(n α2), ˆq(k) j qj L2 b = op(n α3), ˆdq(k) j dq j L2 b = op(n α4), min(α1, α2)+min(α3, α4) 1/2 and α1, α2, α3, α4 > 0. Then, n( ˆZEOPPG(θ) Z(θ)) d N(0, var[ξMDP]). An important feature of Theorem 7 is that the required nuisance convergence rates are nonparametric (slower than n 1/2) and no metric entropy condition (e.g., Donsker) is needed. In particular, the result does not depend on the particular nuisance estimates used, and we experience no variance inflation due to plugging-in estimates instead of true nuisances. While usually we can expect inflation due to nuisance variance (e.g., PG Eq. (5) generally has MSE worse than Θ(n 1/2) if we use an estimate ˆq with a nonparametric rate), we avoid this due to the special doubly robust structure of ξMDP that renders our estimate insensitive to the way nuisances are estimated. To establish this doubly robust structure the key step of the proof we show that ˆZEOPPG(θ) is equal to En[ξMDP] + K 1 PK k=1 PH j=0 Biask,j + op(n 1/2), (7) where Biask,j 2 is upper bounded up to constant by the following Biask,j 2 ˆµ(k) j µj L2 b ˆdq(k) j dq j L2 b (8) + ˆdµ(k) j dµ j L2 b ˆq(k) j qj L2 b + ˆµ(k) j 1 µj 1 L2 b ˆdv(k) j dv j L2 b + ˆdµ(k) j 1 dµ j 1 L2 b ˆv(k) j vj L2 b. After establishing this key result, Eqs. (7) and (8), Theorem 7 follows by showing that the bias term is op(n 1/2) Statistically Efficient Off-Policy Policy Gradients and applying CLT. This asymptotic results can also be extended to a finite-sample results by assuming finite-sample bounds on nuisance estimates, following Kallus & Uehara (2019b). We also obtain the following from Eq. (7) via LLN. Theorem 8 (3-way double robustness). Suppose k K, j H, ˆµ(k) j µ j L2 b, ˆdq(k) j dq j L2 b, ˆq(k) j q j L2 b, ˆdµ(k) j dµ j L2 b all converge to 0 in probability. Then ˆZEOPPG(θ) p Z(θ) if one the following hold: µ j = µj, dµ j = dµ j ; q j = qj, dq j = dq j; or µ j = µj, q j = qj. That is, as long as (a) ˆµ, ˆdµ are correct, (b) ˆq, ˆdq are correct, or (c) ˆµ, ˆq are correct, EOPPG is still consistent. The reason the estimator is not consistent when only ˆdq, ˆdµ are correct is because ˆdv is constructed using both ˆq, ˆdq (see Eq. (6)). Commonly double robustness refers to a situation with two nuisances where an estimator is consistent as long as either nuisance estimate is consistent (Rotnitzky & Vansteelandt, 2014). In doubly robust OPE in MDPs, these nuisances are µ and q (Kallus & Uehara, 2019b). Here, for policy gradient estimation, we have four nuisances and we have a new 3-way double robustness wherein there are three pairs of nuisances where only one pair need be consistently estimated to make the final estimator consistent. 3.2. Special Cases Example 1 (On-policy case). If πb = πθ, then ξNMDP = PH j=0((PH i=j ri + vi+1 qi)gj + dv j dq j), ξMDP = PH j=0(dµ j (rj qj) dq j + dv j + dµ j 1vj), where dµ j = E[Pj i=0 gi(ai | si) | aj, sj]. (Recall that qj, dq j are functions of Haj in NMDP but only of (sj, aj) in MDP; similarly for vj, dv j and Hsj compared to just sj.) In the on-policy case, Cheng et al. (2019); Huang & Jiang (2019) propose estimators equivalent to estimating q, dq and plugging into the above equation for ξNMDP. Using our results (establishing the efficiency bound and that ξNMDP is the EIF under NMDP) these estimators can then be shown to be efficient for NMDP (either under a Donsker condition or using cross-fitting instead of their in-sample estimation). These are not efficient under MDP, however, and ξMDP will still have lower variance. However, in the on-policy case, C1 = 1, so the curse of horizon does not affect ξNMDP and since it requires fewer nuisances it might be preferable. Example 2 (Logged bandit case). If H = 0 (one decision point), then ξMDP = ξNMDP are both equal to ν0(r0 q0)g0 + Eπθ 0(a0|s0)[q0g0 | s0]. We can construct an estimator by cross-fold estimation of q0 (note the last expectation is just an integral/sum with respect to the measure πθ(a0 | s0) for a given s0). While policy gradients are used in the logged bandit case in the counterfactual learning community (e.g. Swaminathan & Joachims, 2015, which use the gradient ν0r0g0), as far as we know, no one uses this efficient estimator for the gradient even in the logged bandit case, where NMDP and MDP are the same. Example 3. By Theorem 8, each of the following is a new policy gradient estimator that is consistent given consistent estimates of its respective nuisances: a) ˆµ = 0, ˆdµ = 0: En[ ˆdv 0], b) ˆq = 0, ˆdq = 0: En[PH j=0 ˆdµ j rj], c) ˆdq = 0, ˆdµ = 0: En[PH j=0 Eπθ[ˆµj 1ˆqjgj | sj]], where the inner expectation is only over aj πθ(aj | sj). Example 4 (Stationary infinite-horizon case). Suppose the MDP transition and reward probabilities and the behavior and target policy (πθ) are all stationary (i.e., time invariant so that π = πt, g = gt, pt = p, etc.). Suppose moreover that, as H the Markov chain on the variables {(st, at, rt) : t = 0, 1, . . . } is ergodic under either the behavior or target policy. Consider estimating the derivative of the long-run average reward Z (θ) = lim H Z(θ)/H. By taking the limit of ξMDP/H as H , we obtain ξ MDP dist = dµ(s , a )(r q(s , a )) µ(s , a )dq(s , a ) + µ(s, a)dv(s ) + dµ(s, a)v(s ), where (s, a, r, s , a ) are distributed as the stationary distribution of (st, at, rt, st+1, at+1) under the behavior policy, µ(s, a) is the ratio of stationary distributions of (st, at) under the target and behavior policies, q(s, a) and v(s) are the long-run average qand v-functions under the target policy, and dµ, dq, dv are the derivatives with respect to θ. It can be shown that under appropriate conditions, ξ MDP is in fact the EIF for Z (θ) if our data were iid observations of (s, a, r, s , a ) from the stationary distribution under the behavior policy. If our data consists, as it does, of n observations of (H + 1)-long trajectories, then we can instead construct the estimator 1 n(H+1) Pn i=1 PH j=0 dµ(s(i) j , a(i) j )(r(i) j q(s(i) j , a(i) j )) µ(s(i) j , a(i) j )dq(s(i) j , a(i) j ) + µ(s(i) j 1, a(i) j 1)dv(s(i) j ) + dµ(s(i) j 1, a(i) j 1)v(s(i) j ) , where the nuisances µ, dµ, q, dq are appropriately estimated in a cross-fold manner as in Algorithm 1. Following similar arguments as in Kallus & Uehara (2019a), which study infinite-horizon OPE, one can show that this extension of EOPPG maintains its efficiency and 3-way robustness guarantees as long as our data satisfies appropriate mixing conditions (which ensures it appropriately approximates observing draws from the stationary distribution). Fleshing out these details is beyond the scope of this paper. Statistically Efficient Off-Policy Policy Gradients 3.3. Estimation of Nuisance Functions We next explain how to estimate the nuisances dµ j and dq j. The estimation of qj is covered by Chen & Jiang (2019); Munos & Szepesv ari (2008) and the estimation of µj by Kallus & Uehara (2019b); Xie et al. (2019). In this section, we focus on the case where the behavior policy (and thus νt) is known. When the behavior policy is unknown, each method can be adapted by estimating the behavior policy first and then plugging it in. The error from estimating the behavior policy will be additive and hence might not deteriorate the rates of the below methods unless it is strictly slower. Monte-Carlo approach. First we explain a Monte-Carlo way to estimate dq j, dµ j . We use the following theorem. Theorem 9 (Monte Carlo representations of dµ j , dq j). dq j = E h PH t=j+1 rtνj+1:t Pt ℓ=j+1 gℓ| aj, sj i , dµ j = E h ν0:j Pj ℓ=0 gℓ| aj, sj i . Based on this result, we can simply learn dq j, dµ j using any regression algorithm. Specifically, we construct the response variables y(i) dq j = PH t=j+1 r(i) t ν(i) j+1:t Pt ℓ=j+1 g(i) ℓ, y(i) dµ j = ν(i) 0:j Pj ℓ=0 g(i) ℓ, and we regress these on (a(i) j , s(i) j ). For example, we can use empirical risk minimization: ˆdq j = arg minf F 1 n Pn i=1 y(i) dq j f(aj, sj) 2 , ˆdµ j = arg minf F 1 n Pn i=1 y(i) dµ j f(aj, sj) 2 . Examples for F include RKHS norm balls, an expanding subspace of L2 (i.e., a sieve), and neural networks. Rates for such estimators can, for example, be derived from Bartlett et al. (2005); Wainwright (2019). A careful reader might wonder whether estimating nuisances in this way causes the final EOPPG estimator to suffer from the curse of horizon, since ν0:j can be exponentially growing in j. However, as long as we have suitable nonparametric rates (in n) for the nuisances as in Theorem 7, the asymptotic MSE of ˆZEOPPG(θ) does not depend on the estimation error of the nuisances. These errors only appear in higherorder (in n) terms and therefore vanish. This is still an important concern in finite samples, which is why we next present an alternative nuisance estimation approach. Recursive approach. Next, we explain a recursive way to estimate dq j, dµ j . This relies on the following result. Theorem 10 (Bellman equations of dq j, dµ j ). dq j(sj, aj) = E[dv j+1 | sj, aj], dv j(sj) = Eπθ[dq j + gjqj | sj] dµ j (sj, aj) = E[dµ j 1 νj | sj, aj] + µjgj. Algorithm 2 Estimation of dq j (Recursive way) Input: q-estimates ˆqj, hypothesis classes Fdq j Set ˆdv H = ˆdq H = 0 for j = H 1, H 2, do Set ˆdq j arg min ˆdv j+1(s(i) j+1) f(s(i) j , a(i) j ) 2 Set ˆdv j(sj) = Eπθ j [ ˆdq j + ˆqjgj | sj] (by integrating/summing w.r.t aj πθ j (aj | sj)) end for Algorithm 3 Estimation of dµ j (Recursive way) Input: µ-estimates ˆµj, hypothesis classes Fdµ j Set ˆdµ 0 = ν0g0 for j = 1, 2, do Set ˆdµ j = arg min f F dµ j Pn i=1 f(s(i) j , a(i) j ) ν(i) j ˆdµ j 1(s(i) j 1, a(i) j 1) ˆµ(i) j g(i) j 2 Algorithm 4 Off-policy projected gradient ascent Input: An initial point θ1 Θ and step size schedule αt for t = 1, 2, do θt+1 = θt + αt ˆZEOPPG(θt) θt+1 = ProjΘ( θt+1) end for This is derived by differentiating the Bellman equations: qj(sj, aj) = E[r + vj+1(sj+1) | sj, aj], µj(sj, aj) = E[µj 1(sj 1, aj 1) νj|sj, aj]. Following Theorem 10, we propose the recursive Algorithms 2 and 3 that estimate dq j using backwards recursion and dµ j using forward recursion. Remark 1. Morimura et al. (2010) discussed a way to estimate the gradient of the stationary distribution in an on-policy setting. In comparison, our setting is off-policy. Remark 2. We have directly estimated dµ j . Another way is using dµ j = νj θ µj + µjgj and estimating θ µj recursively based on a Bellman equation for µj, derived in a similar way to that for dµ j in Theorem 10. 4. Off-policy Optimization with EOPPG Next, we discuss how to use the EOPPG estimator presented in Section 3 for off-policy optimization using projected gradient ascent and the resulting guarantees. The algorithm is given in Algorithm 4. Then, by defining an error Bt = ˆZEOPPG(θt) Z(θt), we have the following theorem. Theorem 11. Assume the function J(θ) is differentiable Statistically Efficient Off-Policy Policy Gradients and M-smooth in θ, M < 1/(4αt), and θt = θt.1 Set J = maxθ Θ J(θ). Then, {θt}T t=1 in Algorithm 4 satisfies 1 T PT t=1 αt Z(θt) 2 2 4(J J(θ1)) T PT t=1 αt Bt 2 2. Theorem 11 gives a bound on the average derivative. That is, if we let ˆθ be chosen at random from {θt}T t=1 with weights αt, then via Markov s inequality, Z(ˆθ) = Op( 4 T (J J(θ1)) + 3 T PT t=1 αt Bt 2 2). So as long as we can bound the error term P t αt Bt 2 2/T, we have that ˆθ becomes a near-stationary point. This error term comes from the noise of the EOPPG estimator. A heuristic calculation based on Theorem 7 that ignores the fact that θt is actually random would suggest Bt 2 2 trace(var[ξMDP]) + op(1/n) DC2R2 max Gmax(H + 1)2(H + 2)2 n + op(1/n). To establish this formally, we recognize that θt is a random variable and bound the uniform deviation of EOPPG over all θ Θ. We then obtain the following high probability bound on the cumulative errors. Theorem 12 (Bound for cumulative errors). Suppose the assumptions of Theorem 7 hold, that θ ξMDP,j is almost surely differentiable with derivatives bounded by L for j {1, , D}, where ξMDP,j is a j-th component of ξMDP, and that Θ is compact with diameter Υ. Then, for any δ, there exists Nδ such that n Nδ, we have that, with probability at least 1 δ, t Bt 2 2 Un,T,δ, Un,T,δ = D(L2DΥ2+C2Gmax R2 max(H+1)2(H+2)2 log(T D/δ)) This shows that, by letting T = nβ (β > 1) be sufficiently large, we can obtain Z(ˆθ) = Op(H4C2 log(n)/n) for ˆθ chosen at random from {θt}T t=1 as above. Note that if we had used other policy gradient estimators such as (step-wise) REINFORCE, PG as in Eq. (5), or off-policy variants of the estimators of Cheng et al. (2019); Huang & Jiang (2019), then the term CH 1 would have appeared in the bound and the curse of horizon would have meant that our learned policies would not be near-stationary for long-horizon problems. Remark 3. Many much more sophisticated gradient-based optimization methods equipped with our EOPPG gradient estimator can be used in place of the vanilla projected gradient ascent in Algorithm 4. We refer the reader to Jain & Kar (2017) for a review of non-convex optimization methods. 1This means all iterates remain in Θ so the projection is identity. This is a standard condition in the analysis of non-convex optimization method that can be guaranteed under certain assumptions; see Khamaru & Wainwright (2018); Nesterov & Polyak (2006). The concave case. The previous results study the guarantees of Algorithm 4 in terms of convergence to a stationary point, which is the standard form of analysis for non-convex optimization. If we additionally assume that J(θ) is a concave function then we can see how the efficiency of EOPPG translates to convergence to an optimal solution in terms of the regret compared to the optimal policy. In this case we set ˆθ = 1 T PT t=1 θt, for which we can prove the following: Theorem 13 (Regret bound). Suppose the assumptions of Theorem 12 hold, that J(θ) is a concave function, and that Θ is convex. For a suitable choice of αt we have that, for any δ, there exists Nδ such that n Nδ, we have that, with probability at least 1 δ, J J(ˆθ) Υsupθ Θ Z(θ) 2 + p Here, the first term is the optimization error if we knew the true gradient Z(θ). The second term is the approximation error due to the error in our estimated gradient ˆZEOPPG(θ). When we set T = nβ (β > 1), the final regret bound is Op ΥRmax H2p DβGmax C2 log(n D/δ))/ n . The regret s horizon dependence is H2. This is a crucial result since the regret with polyomial horizon dependence is a desired result in RL (Jiang & Agarwal, 2018). Again, if we had used other policy gradient methods, then an exponential dependence via CH 1 would appear. Moreover, the regret depends on C2, which corresponds to a concentrability coefficient (Antos et al., 2008). Remark 4. Recent work studies the global convergence of online-PG algorithms without concavity (Agarwal et al., 2019; Bhandari & Russo, 2019). This may be applicable here, but our setting is completely off-policy and therefore different and requiring future work. Notably, the above focus on optimization rather than PG variance reduction. In a truly off-policy setting, the available data is limited and statistical efficiency is crucial and is our focus here. Remark 5 (Comparison with other results for off-policy policy learning). In the logged bandit case (H = 0), the regret bound of off-policy learning via exhaustive search (nonconvex) optimization is Op( p τ(Π) log(1/δ)/n), where τ(Π) represents the complexity of the hypothesis class (Foster & Syrgkanis, 2019; Zhou et al., 2018). In this bandit case, the nuisance functions of the EIF do not depend on the policy itself, making this analysis tractable. However, for our RL problem (H > 0), nuisance functions depend on the policy; thus, these techniques do not extend directly. Nie et al. (2019) do extend these types of regret results to an RL problem but where the problem has a special when-to-treat structure, not the general MDP case. Statistically Efficient Off-Policy Policy Gradients 7.0 7.5 8.0 8.5 Log Sample Size Log MSE with 95% CI REINFORCE PG EOPPG q, dq EOPPG μ, dμ EOPPG dμ, dq Figure 2. Comparison of MSE of gradient estimation 5.5 6.0 6.5 7.0 Log Sample Size Log Regret with 95% CI REINFORCE PG EOPPG q, dq EOPPG μ, dμ EOPPG dμ, dq Figure 3. Comparison of regret after gradient ascent 5. Experiments We conducted an experiment in a simple environment to confirm the theoretical guarantees of the proposed estimator. The setting is as follows. Set St = R, At = R, s0 = 0. Then, set the transition dynamics as st = at 1 st 1, the reward as rt = s2 t, the behavior policy as πb t(a | s) = N(0.8s, 0.22), the policy class as N(θs, 0.22), and horizon as H = 49. Then, θ = 1 with optimal value J = 1.96, obtained by analytical calculation. We compare REINFORCE (Eq. (4)), PG (Eq. (5)), and EOPPG with K = 2. Nuisances functions q, µ, dq, dµ are estimated by polynomial sieve regressions (Chen, 2007). Additionally, to investigate 3-way double robustness, we consider corrupting the nuisance models by adding noise N(0, 1); we consider thus corrupting (q, dq), (µ, dµ), or (dµ, dq). First, in Fig. 2, we compare the MSE of these gradient estimators at θ = 1.0 using 100 replications of the experiment for each of n = 800, 1600, 3200, 6400 . As can be seen, the performance of EOPPG is superior to existing estimators in terms of MSE, validating our efficiency results. We can also see that the EOPPG with misspecified models still converges, validating our 3-way double robustness results. Second, in Fig. 3, we apply a gradient ascent as in Algo- rithm 4 with αt = 0.15 and T = 40. We compare the regret of the final policy, i.e., J(θ ) J(ˆθ40), using 60 replications of the experiment for each of n = 200, 400, 800, 1600. Notice that the lines decrease roughly as 1/ n but because of the large differences in values, the lines only appear somewhat flat. This shows that the efficiency and 3-way double robustness translate to good regret performance, as predicted by our policy learning analysis. 6. Conclusions We established an MSE efficiency bound of order O(H4/n) for estimating a policy gradient in an MDP in an off-policy manner. We proposed an estimator, EOPPG, that achieves the bound, enjoys 3-way double robustness, and leads to regret dependence of order H2/ n when used for policy learning. Notably, this is much smaller than other approaches, which incur exponential-in-H errors. This paper is only a first step toward efficient and effective off-policy policy gradients in MDPs. Remaining questions include how to estimate dq, dµ in a large-scale environments, the performance of more practical implementations that alternate in updating θ and nuisance estimates with only one gradient update, and extending our theory to the deterministic policy class as in Silver et al. (2014). Acknowledgements We thank the anonymous reviewers for their insightful comments and suggestions. This material is based upon work supported by the National Science Foundation under Grant No. 1846210. Masatoshi Uehara was partially supported by the Masason Foundation. Agarwal, A., Kakade, S., Lee, J., and Mahajan, G. Optimality and approximation with policy gradient methods in markov decision processes. ar Xiv preprint ar Xiv: 1908.00261, 2019. Antos, A., Szepesv ari, C., and Munos, R. Learning near-optimal policies with bellman-residual minimization based fitted policy iteration and a single sample path. Machine Learning, 71:89 129, 2008. Athey, S. and Wager, S. Efficient policy learning. ar Xiv preprint ar Xiv:1702.02896, 2017. Bartlett, P. L., Bousquet, O., and Mendelson, S. Local rademacher complexities. Annals of Statistics, 33:1497 1537, 2005. Baxter, J. and Bartlett, P. L. Infinite-Horizon Policy- Statistically Efficient Off-Policy Policy Gradients Gradient Estimation. Journal of Artificial Intelligence Research, 15:319 350, 2001. Bhandari, J. and Russo, D. Global optimality guarantees for policy gradient methods. ar Xiv preprint ar Xiv:1906.01786, 2019. Bickel, P. J., Klaassen, C. A., Ritov, Y., and Wellner, J. A. Efficient and adaptive estimation for semiparametric models. Johns Hopkins University Press, Baltimore, 1993. Chen, J. and Jiang, N. Information-theoretic considerations in batch reinforcement learning. In Proceedings of the 36th International Conference on Machine Learning, volume 97, pp. 1042 1051, 2019. Chen, M., Beutel, A., Covington, P., Jain, S., Belletti, F., and Chi, E. Top-k off-policy correction for a reinforce recommender system. In Proceedings of the Twelfth ACM International Conference on web search and data mining, WSDM 19, pp. 456 464, 2019. Chen, X. Chapter 76 large sample sieve estimation of seminonparametric models. Handbook of Econometrics, 6: 5549 5632, 2007. Cheng, C.-A., Yan, X., and Boots, B. Trajectory-wise control variates for variance reduction in policy gradient methods. ar Xiv preprint arxiv:1908.03263, 2019. Chernozhukov, V., Chetverikov, D., Demirer, M., Duflo, E., Hansen, C., Newey, W., and Robins, J. Double/debiased machine learning for treatment and structural parameters. Econometrics Journal, 21:C1 C68, 2018. Dai, B., Kostrikov, I., Chow, Y., Li, L., and Schuurmans, D. Algaedice: Policy gradient from arbitrary experience. ar Xiv preprint ar Xiv:1912.02074, 2019. Degris, T., White, M., and Sutton, R. Off-policy actorcritic. Proceedings of the 29th International Coference on International Conference on Machine Learning, pp. 179 186, 2012. Deisenroth, M., Neumann, G., and Peters, J. A survey on policy search for robotics. Foundations and Trends in Robotics - volume 1, 2:1 142, 2013. Foster, D. J. and Syrgkanis, V. Orthogonal statistical learning. ar Xiv preprint ar Xiv:1901.09036, 2019. Fujimoto, S., Meger, D., and Precup, D. Off-policy deep reinforcement learning without exploration. In Proceedings of the 36th International Conference on Machine Learning, volume 97, pp. 2052 2062, 2019. Gottesman, O., Johansson, F., Komorowski, M., Faisal, A., Sontag, D., Doshi-Velez, F., and Celi, L. A. Guidelines for reinforcement learning in healthcare. Nat Med, 25: 16 18, 2019. Greensmith, E., Bartlett, P. L., and Baxter, J. Variance reduction techniques for gradient estimates in reinforcement learning. Journal of Machine Learning Research, pp. 1471 1530, 2004. Gu, S. S., Lillicrap, T., Turner, R. E., Ghahramani, Z., Sch olkopf, B., and Levine, S. Interpolated policy gradient: Merging on-policy and off-policy gradient estimation for deep reinforcement learning. In Advances in Neural Information Processing Systems 30, pp. 3846 3855. 2017. H ajek, J. A characterization of limiting distributions of regular estimates. Zeitschrift f ur Wahrscheinlichkeitstheorie und verwandte Gebiete, 14:323 330, 1970. Hanna, J. and Stone, P. Towards a data efficient off-policy policy gradient. In AAAI Spring Symposium on Data Efficient Reinforcement Learning, 2018. Hazan, E. Introduction to online convex optimization. Foundations and Trends R in Optimization, 2:157 325, 2015. Huang, J. and Jiang, N. From importance sampling to doubly robust policy gradient. ar Xiv preprint ar Xiv:1910.09066, 2019. Imani, E., Graves, E., and White, M. An off-policy policy gradient theorem using emphatic weightings. In Advances in Neural Information Processing Systems 31, pp. 96 106. 2018. Jain, P. and Kar, P. Non-convex optimization for machine learning. Found. Trends Mach. Learn., 10:142 336, December 2017. Jiang, N. and Agarwal, A. Open problem: The dependence of sample complexity lower bounds on planning horizon. In Proceedings of the 31st Conference On Learning Theory, volume 75, pp. 3395 3398, 2018. Jiang, N. and Li, L. Doubly robust off-policy value evaluation for reinforcement learning. In Proceedings of the 33rd International Conference on International Conference on Machine Learning-Volume, pp. 652 661, 2016. Kallus, N. Balanced policy evaluation and learning. In Advances in Neural Information Processing Systems, pp. 8895 8906, 2018. Kallus, N. and Uehara, M. Efficiently breaking the curse of horizon: Double reinforcement learning in infinitehorizon processes. ar Xiv preprint ar Xiv:1909.05850, 2019a. Kallus, N. and Uehara, M. Double reinforcement learning for efficient off-policy evaluation in markov decision processes. ar Xiv preprint ar Xiv:1908.08526, 2019b. Statistically Efficient Off-Policy Policy Gradients Kallus, N. and Zhou, A. Confounding-robust policy improvement. In Advances in neural information processing systems, pp. 9269 9279, 2018. Khamaru, K. and Wainwright, M. Convergence guarantees for a class of non-convex and non-smooth optimization problems. In Proceedings of the 35th International Conference on Machine Learning, pp. 2601 2610, 2018. Klaassen, C. A. Consistent estimation of the influence function of locally asymptotically linear estimators. The Annals of Statistics, pp. 1548 1562, 1987. Liu, Q., Li, L., Tang, Z., and Zhou, D. Breaking the curse of horizon: Infinite-horizon off-policy estimation. In Advances in Neural Information Processing Systems, pp. 5361 5371, 2018. Liu, Y., Swaminathan, A., Agarwal, A., and Brunskill, E. Off-policy policy gradient with state distribution correction. In Proccedings of the Conference on Uncertainty in Artificial Intelligence (UAI 2019), 2019. Metelli, A. M., Papini, M., Faccio, F., and Restelli, M. Policy optimization via importance sampling. In Advances in Neural Information Processing Systems 31, pp. 5442 5454. 2018. Morimura, T., Uchibe, E., Yoshimoto, J., Peters, J., and Doya, K. Derivatives of logarithmic stationary distributions for policy gradient reinforcement learning. Neural Computation, 22:342 376, 2010. Munos, R. and Szepesv ari, C. Finite-time bounds for fitted value iteration. Journal of Machine Learning Research, 9:815 857, 2008. Nesterov, Y. and Polyak, B. Cubic regularization of newton method and its global performance. Mathematical Programming, 108:177 205, 2006. Nie, X., Brunskill, E., and Wager, S. Learning when-to-treat policies. ar Xiv preprint ar Xiv:1905.09751, 2019. Papini, M., Binaghi, D., Canonaco, G., Pirotta, M., and Restelli, M. Stochastic variance-reduced policy gradient. In International Conference on Machine Learning, pp. 4026 4035, 2018. Precup, D., Sutton, R. S., and Singh, S. P. Eligibility Traces for Off-Policy Policy Evaluation. In Proceedings of the 17th International Conference on Machine Learning, pp. 759 766, 2000. Rotnitzky, A. and Vansteelandt, S. Double-robust methods. In Handbook of missing data methodology. In Handbooks of Modern Statistical Methods, pp. 185 212. Chapman and Hall/CRC, 2014. Schulman, J., Heess, N., Weber, T., and Abbeel, P. Gradient estimation using stochastic computation graphs. In Neural Information Processing Systems (NIPS), Montreal, Canada, 2015. Schulman, J., Moritz, P., Levine, S., Jordan, M., and Abbeel, P. High-dimensional continuous control using generalized advantage estimation. In Proceedings of the International Conference on Learning Representations (ICLR), 2016. Sen, B. A gentle introduction to empirical process theoryand applications. http: //www.stat.columbia.edu/ bodhi/Talks/ Emp-Proc-Lecture-Notes.pdf, 2018. Accessed: 2020 1-1. Silver, D., Lever, G., Heess, N., Degris, T., Wierstra, D., and Riedmiller, M. Deterministic policy gradient algorithms. In Proceedings of the 31st International Conference on Machine Learning, pp. 387 395, 2014. Sutton, R. S. and Barto, A. G. Reinforcement learning: An introduction. MIT press, 2018. Sutton, R. S., Precup, D., and Singh, S. P. Intra-option Learning about Temporally Abstract Actions. In Proceedings of the 15th International Conference on Machine Learning, pp. 556 564, 1998. Swaminathan, A. and Joachims, T. Counterfactual risk minimization: Learning from logged bandit feedback. In International Conference on Machine Learning, pp. 814 823, 2015. Tang, J. and Abbeel, P. On a connection between importance sampling and the likelihood ratio policy gradient. In Proceedings of the 23rd International Conference on Neural Information Processing Systems - Volume 1, pp. 1000 1008, 2010. Thomas, P. and Brunskill, E. Data-efficient off-policy policy evaluation for reinforcement learning. In Proceedings of the 33rd International Conference on Machine Learning, pp. 2139 2148, 2016. Tosatto, S., Carvalho, J., Abdulsamad, H., and Peters, J. A nonparametric offpolicy policy gradient. ar Xiv preprint ar Xiv:2001.02435, 2020. Tsiatis, A. A. Semiparametric Theory and Missing Data. Springer Series in Statistics. Springer New York, New York, NY, 2006. van Der Laan, M. J. and Robins, J. M. Unified Methods for Censored Longitudinal Data and Causality. Springer Series in Statistics,. Springer New York, New York, NY, 2003. Statistically Efficient Off-Policy Policy Gradients van der Laan, M. J. and Rose, S. Targeted Learning :Causal Inference for Observational and Experimental Data. Springer Series in Statistics. Springer New York : Imprint: Springer, New York, NY, 2018. van der Vaart, A. W. Asymptotic statistics. Cambridge University Press, Cambridge, UK, 1998. Wainwright, M. J. High-Dimensional Statistics : A Non Asymptotic Viewpoint. Cambridge University Press, New York, 2019. Williams, R. J. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8(3-4):229 256, 1992. Wu, C., Rajeswaran, A., Duan, Y., Kumar, V., Bayen, A., Kakade, S., Mordatch, I., and Abbeel, P. Variance reduction for policy gradient with action-dependent factorized baselines. Proceedings of the International Conference on Learning Representations (ICLR), 2018. Xie, T., Ma, Y., and Wang, Y.-X. Towards optimal off-policy evaluation for reinforcement learning with marginalized importance sampling. In Advances in Neural Information Processing Systems 32, pp. 9665 9675. 2019. Zhou, Z., Athey, S., and Wager, S. Offline multi-action policy learning: Generalization and optimization. ar Xiv preprint ar Xiv:1810.04778, 2018.