# generalized_offpolicy_actorcritic__cd1341b0.pdf Generalized Off-Policy Actor-Critic Shangtong Zhang, Wendelin Boehmer, Shimon Whiteson Department of Computer Science University of Oxford {shangtong.zhang, wendelin.boehmer, shimon.whiteson}@cs.ox.ac.uk We propose a new objective, the counterfactual objective, unifying existing objectives for off-policy policy gradient algorithms in the continuing reinforcement learning (RL) setting. Compared to the commonly used excursion objective, which can be misleading about the performance of the target policy when deployed, our new objective better predicts such performance. We prove the Generalized Off-Policy Policy Gradient Theorem to compute the policy gradient of the counterfactual objective and use an emphatic approach to get an unbiased sample from this policy gradient, yielding the Generalized Off-Policy Actor-Critic (Geoff-PAC) algorithm. We demonstrate the merits of Geoff-PAC over existing algorithms in Mujoco robot simulation tasks, the first empirical success of emphatic algorithms in prevailing deep RL benchmarks. 1 Introduction Reinforcement learning (RL) algorithms based on the policy gradient theorem (Sutton et al., 2000; Marbach and Tsitsiklis, 2001) have recently enjoyed great success in various domains, e.g., achieving human-level performance on Atari games (Mnih et al., 2016). The original policy gradient theorem is on-policy and used to optimize the on-policy objective. However, in many cases, we would prefer to learn off-policy to improve data efficiency (Lin, 1992) and exploration (Osband et al., 2018). To this end, the Off-Policy Policy Gradient (OPPG) Theorem (Degris et al., 2012; Maei, 2018; Imani et al., 2018) was developed and has been widely used (Silver et al., 2014; Lillicrap et al., 2015; Wang et al., 2016; Gu et al., 2017; Ciosek and Whiteson, 2017; Espeholt et al., 2018). Ideally, an off-policy algorithm should optimize the off-policy analogue of the on-policy objective. In the continuing RL setting, this analogue would be the performance of the target policy in expectation w.r.t. the stationary distribution of the target policy, which is referred to as the alternative life objective (Ghiassian et al., 2018). This objective corresponds to the performance of the target policy when deployed. Previously, OPPG optimizes a different objective, the performance of the target policy in expectation w.r.t. the stationary distribution of the behavior policy. This objective is referred to as the excursion objective (Ghiassian et al., 2018), as it corresponds to the excursion setting (Sutton et al., 2016). Unfortunately, the excursion objective can be misleading about the performance of the target policy when deployed, as we illustrate in Section 3. It is infeasible to optimize the alternative life objective directly in the off-policy continuing setting. Instead, we propose to optimize the counterfactual objective, which approximates the alternative life objective. In the excursion setting, an agent in the stationary distribution of the behavior policy considers a hypothetical excursion that follows the target policy. The return from this hypothetical excursion is an indicator of the performance of the target policy. The excursion objective measures this return w.r.t. the stationary distribution of the behavior policy, using samples generated by executing the behavior policy. By contrast, evaluating the alternative life objective requires samples from the stationary distribution of the target policy, to which the agent does not have access. In the counterfactual objective, we use a new parameter ˆγ to control how counterfactual the objective is, 33rd Conference on Neural Information Processing Systems (Neur IPS 2019), Vancouver, Canada. akin to Gelada and Bellemare (2019). With ˆγ = 0, the counterfactual objective uses the stationary distribution of the behavior policy to measure the performance of the target policy, recovering the excursion objective. With ˆγ = 1, the counterfactual objective is fully decoupled from the behavior policy and uses the stationary distribution of the target policy to measure the performance of the target policy, recovering the alternative life objective. As in the excursion objective, the excursion is never actually executed and the agent always follows the behavior policy. We make two contributions in this paper. First, we prove the Generalized Off-Policy Policy Gradient (GOPPG) Theorem, which gives the policy gradient of the counterfactual objective. Second, using an emphatic approach (Sutton et al., 2016) to compute an unbiased sample for this policy gradient, we develop the Generalized Off-Policy Actor-Critic (Geoff-PAC) algorithm. We evaluate Geoff-PAC empirically in challenging robot simulation tasks with neural network function approximators. Geoff PAC outperforms the actor-critic algorithms proposed by Degris et al. (2012); Imani et al. (2018), and to our best knowledge, Geoff-PAC is the first empirical success of emphatic algorithms in prevailing deep RL benchmarks. 2 Background We use a time-indexed capital letter (e.g., Xt) to denote a random variable. We use a bold capital letter (e.g., X) to denote a matrix and a bold lowercase letter (e.g., x) to denote a column vector. If x : S R is a scalar function defined on a finite set S, we use its corresponding bold lowercase letter to denote its vector form, i.e., x .= [x(s1), . . . , x(s|S|)]T. We use I to denote the identity matrix and 1 to denote an all-one column vector. We consider an infinite horizon MDP (Puterman, 2014) consisting of a finite state space S, a finite action space A, a bounded reward function r : S A R and a transition kernel p : S S A [0, 1]. We consider a transition-based discount function (White, 2017) γ : S A S [0, 1] for unifying continuing tasks and episodic tasks. At time step t, an agent at state St takes an action At according to a policy π : A S [0, 1]. The agent then proceeds to a new state St+1 according to p and gets a reward Rt+1 satisfying E[Rt+1] = r(St, At). The return of π at time step t is Gt .= P i=0 Γi 1 t Rt+1+i, where Γi 1 t .= Πi 1 j=0γ(St+j, At+j, St+j+1) and Γ 1 t .= 1. We use vπ to denote the value function of π, which is defined as vπ(s) .= Eπ[Gt|St = s]. Like White (2017), we assume vπ exists for all s. We use qπ(s, a) .= Eπ[Gt|St = s, At = a] to denote the state-action value function of π. We use Pπ to denote the transition matrix induced by π, i.e., Pπ[s, s ] .= P a π(a|s)p(s |s, a). We assume the chain induced by π is ergodic and use dπ to denote its unique stationary distribution. We define Dπ .= diag(dπ). In the off-policy setting, an agent aims to learn a target policy π but follows a behavior policy µ. We use the same assumption of coverage as Sutton and Barto (2018), i.e., (s, a), π(a|s) > 0 = µ(a|s) > 0. We assume the chain induced by µ is ergodic and use dµ to denote its stationary distribution. Similarly, Dµ .= diag(dµ). We define ρ(s, a) .= π(a|s) µ(a|s), ρt .= ρ(St, At) and γt .= γ(St 1, At 1, St). Typically, there are two kinds of tasks in RL, prediction and control. Prediction: In prediction, we are interested in finding the value function vπ of a given policy π. Temporal Difference (TD) learning (Sutton, 1988) is perhaps the most powerful algorithm for prediction. TD enjoys convergence guarantee in both onand off-policy tabular settings. TD can also be combined with linear function approximation. The update rule for on-policy linear TD is w w + α t, where α is a step size and t .= [Rt+1 + γV (St+1) V (St)] w V (St) is an incremental update. Here we use V to denote an estimate of vπ parameterized by w. Tsitsiklis and Van Roy (1997) prove the convergence of on-policy linear TD. In off-policy linear TD, the update t is weighted by ρt. The divergence of off-policy linear TD is well documented (Tsitsiklis and Van Roy, 1997). To approach this issue, Gradient TD methods (Sutton et al. 2009) were proposed. Instead of bootstrapping from the prediction of a successor state like TD, Gradient TD methods compute the gradient of the projected Bellman error directly. Gradient TD methods are true stochastic gradient methods and enjoy convergence guarantees. However, they are usually two-time-scale, involving two sets of parameters and two learning rates, which makes it hard to use in practice (Sutton et al., 2016). To approach this issue, Emphatic TD (ETD, Sutton et al. 2016) was proposed. ETD introduces an interest function i : S [0, ) to specify user s preferences for different states. With function approximation, we typically cannot get accurate predictions for all states and must thus trade off between them. States are usually weighted by dµ(s) in the off-policy setting (e.g., Gradient TD methods) but with the interest function, we can explicitly weight them by dµ(s)i(s) in our objective. Consequently, we weight the update at time t via Mt, which is the emphasis that accumulates previous interests in a certain way. In the simplest form of ETD, we have Mt .= i(St) + γtρt 1Mt 1. The update t is weighted by ρt Mt. In practice, we usually set i(s) 1. Inspired by ETD, Hallak and Mannor (2017) propose to weight t via ρt c(St) in the Consistent Off-Policy TD (COP-TD) algorithm, where c(s) .= dπ(s) dµ(s) is the density ratio, which is also known as the covariate shift (Gelada and Bellemare, 2019). To learn c via stochastic approximation, Hallak and Mannor (2017) propose the COP operator. However, the COP operator does not have a unique fixed point. Extra normalization and projection is used to ensure convergence (Hallak and Mannor, 2017) in the tabular setting. To address this limitation, Gelada and Bellemare (2019) further propose the ˆγ-discounted COP operator. Gelada and Bellemare (2019) define a new transition matrix Pˆγ .= ˆγPπ +(1 ˆγ)1d T µ where ˆγ [0, 1] is a constant. Following this matrix, an agent either proceeds to the next state according to Pπ w.p. ˆγ or gets reset to dµ w.p. 1 ˆγ. Gelada and Bellemare (2019) show the chain under Pˆγ is ergodic. With dˆγ denoting its stationary distribution, they prove dˆγ = (1 ˆγ)(I ˆγPT π) 1dµ (ˆγ < 1) (1) and dˆγ = dπ (ˆγ = 1). With c(s) .= dˆγ(s) dµ(s), Gelada and Bellemare (2019) prove that c = ˆγD 1 µ PT πDµc + (1 ˆγ)1, (2) yielding the following learning rule for estimating c in the tabular setting: C(St+1) C(St+1) + α[ˆγρt C(St) + (1 ˆγ) C(St+1)], (3) where C is an estimate of c and α is a step size. A semi-gradient is used when C is a parameterized function (Gelada and Bellemare, 2019). For a small ˆγ (depending on the difference between π and µ), Gelada and Bellemare (2019) prove a multi-step contraction under linear function approximation. For a large ˆγ or nonlinear function approximation, they provide an extra normalization loss for the sake of the constraint d T µc = 1Tdˆγ = 1. Gelada and Bellemare (2019) use ρtc(St) to weight the update t in Discounted COP-TD. They demonstrate empirical success in Atari games (Bellemare et al., 2013) with pixel inputs. Control: In this paper, we focus on policy-based control. In the on-policy continuing setting, we seek to optimize the average value objective (Silver, 2015) s dπ(s)i(s)vπ(s). (4) Optimizing the average value objective is equivalent to optimizing the average reward objective (Puterman, 2014) if both γ and i are constant (see White 2017). In general, the average value objective can be interpreted as a generalization of the average reward objective to adopt transitionbased discounting and nonconstant interest function. In the off-policy continuing setting, Imani et al. (2018) propose to optimize the excursion objective s dµ(s)i(s)vπ(s) (5) instead of the alternative life objective Jπ. The key difference between Jπ and Jµ is how we trade off different states. With function approximation, it is usually not possible to maximize vπ(s) for all states, which is the first trade-off we need to make. Moreover, visiting one state more implies visiting another state less, which is the second trade-off we need to make. Jµ and Jπ achieve both kinds of trade-off according to dµ and dπ respectively. However, it is Jπ, not Jµ, that correctly reflects the deploy-time performance of π, as the behavior policy will no longer matter when we deploy the off-policy learned π in a continuing task. In both objectives, i(s) is usually set to 1. We assume π is parameterized by θ. In the rest of this paper, all gradients are taken w.r.t. θ unless otherwise specified, and we consider the gradient for only Figure 1: (a) The two-circle MDP. Rewards are 0 unless specified on the edge (b) The probability of transitioning to B from A under target policy π during training (c) The influence of ˆγ and λ2 on the final solution found by Geoff-PAC. one component of θ for the sake of clarity. It is not clear how to compute the policy gradient of Jπ in the off-policy continuing setting directly. For Jµ, we can compute the policy gradient as s dµ(s)i(s) P a qπ(s, a) π(a|s) + π(a|s) qπ(s, a) . (6) Degris et al. (2012) prove in the Off-Policy Policy Gradient (OPPG) theorem that we can ignore the term π(s, a) qπ(s, a) without introducing bias for a tabular policy1 when i(s) 1, yielding the Off-Policy Actor Critic (Off-PAC), which updates θ as θt+1 = θt + αρtqπ(St, At) log π(At|St), (7) where α is a step size, St is sampled from dµ, and At is sampled from µ( |St). For a policy using a general function approximator, Imani et al. (2018) propose a new OPPG theorem. They define F (1) t .= i(St) + γtρt 1F (1) t 1, M (1) t .= (1 λ1)i(St) + λ1F (1) t , Z(1) t .= ρt M (1) t qπ(St, At) log π(At|St), where λ1 [0, 1] is a constant used to optimize the bias-variance trade-off and F (1) 1 .= 0. Imani et al. (2018) prove that Z(1) t is an unbiased sample of Jµ in the limiting sense if λ1 = 1 and π is fixed, i.e., limt Eµ[Z(1) t ] = Jµ. Based on this, Imani et al. (2018) propose the Actor-Critic with Emphatic weightings (ACE) algorithm, which updates θ as θt+1 = θt + αZ(1) t . ACE is an emphatic approach where M (1) t is the emphasis to reweigh the update. 3 The Counterfactual Objective We now introduce the counterfactual objective Jˆγ .= P s dˆγ(s)ˆi(s)vπ(s), (8) where ˆi is a user-defined interest function. Similarly, we can set ˆi(s) to 1 for the continuing setting but we proceed with a general ˆi. When ˆγ = 1, Jˆγ recovers the alternative life objective Jπ. When ˆγ = 0, Jˆγ recovers the excursion objective Jµ. To motivate the counterfactual objective Jˆγ, we first present the two-circle MDP (Figure 1a) to highlight the difference between Jπ and Jµ. In the two-circle MDP, an agent needs to make a decision only in state A. The behavior policy µ proceeds to B or C randomly with equal probability. For this continuing task, the discount factor γ is always 0.6 and the interest is always 1. Under this task specification (White, 2017), the optimal policy under the alternative life objective Jπ, which is equivalent to the average reward objective as γ and i are constant, is to stay in the outer circle. However, to maximize Jµ, the agent prefers the inner circle. To see this, first note vπ(B) and vπ(C) hardly change w.r.t. π, and we have vπ(B) 3.6 and vπ(C) 5. To maximize Jµ, the target policy π would prefer transitioning to state C to maximize vπ(A). The agent, therefore, remains in the inner circle. The two-circle MDP is tabular, so the policy 1See Errata in Degris et al. (2012), also in Imani et al. (2018); Maei (2018). maximizing vπ(s) for all s can be represented accurately. If we consider an episodic task, e.g., we aim to maximize only vπ(A) and set the discount of the transition back to A to 0, such policy will be optimal under the episodic return criterion. However, when we consider a continuing task and aim to optimize Jπ, we have to consider state visitation. The excursion objective Jµ maximizes vπ(A) regardless of the state visitation under π, yielding a policy π that never visits the state D, a state of the highest value. Such policy is sub-optimal in this continuing task. To maximize Jπ, the agent has to sacrifice vπ(A) and visits state D more. This two-circle MDP is not an artifact due to the small γ. The same effect can also occur with a larger γ if the path is longer. With function approximation, the discrepancy between Jµ and Jπ can be magnified as we need to make trade-off in both maximizing vπ(s) and state visitation. One solution to this problem is to set the interest function i in Jµ in a clever way. However, it is not clear how to achieve this without domain knowledge. Imani et al. (2018) simply set i to 1. Another solution might be to optimize Jπ directly in off-policy learning, if one could use importance sampling ratios to fully correct dµ to dπ as Precup et al. (2001) propose for value-based methods in the episodic setting. However, this solution is infeasible for the continuing setting (Sutton et al., 2016). One may also use differential value function (Sutton and Barto, 2018) to replace the discounted value function in Jµ. Off-policy policy gradient with differential value function, however, is still an open problem and we are not aware of existing literature on this. In this paper, we propose to optimize Jˆγ instead. It is a well-known fact that the policy gradient of the stationary distribution exists under mild conditions (e.g., see Yu (2005)).2 It follows immediately from the proof of this existence that limˆγ 1 dˆγ = dπ. Moreover, it is trivial to see that limˆγ 0 dˆγ = dµ, indicating the counterfactual objective can recover both the excursion objective and the alternative life objective smoothly. Furthermore, we show empirically that a small ˆγ (e.g., 0.6 in the two-circle MDP and 0.2 in Mujoco tasks) is enough to generate a different solution from maximizing Jµ. 4 Generalized Off-Policy Policy Gradient In this section, we derive an estimator for Jˆγ and show in Proposition 1 that it is unbiased in the limiting sense. Our (standard) assumptions are given in supplementary materials. The OPPG theorem (Imani et al., 2018) leaves us the freedom to choose the interest function i in Jµ. In this paper, we set i(s) .= ˆi(s)c(s), which, to our best knowledge, is the first time that a non-trivial interest is used. Hence, i depends on π and we cannot invoke OPPG directly as Jµ = P d dµ(s)i(s) vπ(s). However, we can still invoke the remaining parts of OPPG: X s dµ(s)i(s) vπ(s) = X a qπ(s, a) π(a|s), (9) where m T .= i TDµ(I Pπ,γ) 1, Pπ,γ[s, s ] .= P a π(a|s)p(s |s, a)γ(s, a, s ). We now compute the gradient Jˆγ. Theorem 1 (Generalized Off-Policy Policy Gradient Theorem) a qπ(s, a) π(a|s) s dµ(s)ˆi(s)vπ(s)g(s) where g .= ˆγD 1 µ (I ˆγPT π) 1b, b .= PT πDµc Proof. We first use the product rule of calculus and plug in dˆγ(s) = dµ(s)c(s): s dˆγ(s)ˆi(s) vπ(s) + X s dˆγ(s)ˆi(s)vπ(s) s dµ(s)c(s)ˆi(s) vπ(s) s dµ(s) c(s)ˆi(s)vπ(s). 2 For completeness, we include that proof in supplementary materials. 1 = 3 follows directly from (9). To show 2 = 4 , we take gradients on both sides of (2). We have c = ˆγD 1 µ PT πDµ c + ˆγD 1 µ PT πDµc. Solving this linear system of c leads to c = (I ˆγD 1 µ PT πDµ) 1ˆγD 1 µ PT πDµc = D 1 µ (I ˆγPT π)Dµ 1 ˆγD 1 µ PT πDµc = D 1 µ (I ˆγPT π) 1Dµ ˆγD 1 µ PT πDµc = g. With c(s) = g(s), 2 = 4 follows easily. Now we use an emphatic approach to provide an unbiased sample of Jˆγ. We define It .= c(St 1)ρt 1 log π(At 1|St 1), F (2) t .= It + ˆγρt 1F (2) t 1, M (2) t .= (1 λ2)It + λ2F (2) t . Here It functions as an intrinsic interest (in contrast to the user-defined extrinsic interest ˆi) and is a sample for b. F (2) t accumulates previous interests and translates b into g. λ2 is for bias-variance trade-off similar to Sutton et al. (2016); Imani et al. (2018). We now define Z(2) t .= ˆγˆi(St)vπ(St)M (2) t , Zt .= Z(1) t + Z(2) t and proceed to show that Zt is an unbiased sample of Jˆγ when t . Lemma 1 Assuming the chain induced by µ is ergodic and π is fixed, the limit f(s) .= dµ(s) limt Eµ[F (2) t |St = s] exists, and f = (I ˆγPT π) 1b for ˆγ < 1. Proof. Previous works (Sutton et al., 2016; Imani et al., 2018) assume limt Eµ[F (1) t |St = s] exists. Here we prove the existence of limt Eµ[F (2) t |St = s], inspired by the process of computing the value of limt Eµ[F (1) t |St = s] (assuming its existence) in Sutton et al. (2016). The existence of limt Eµ[F (1) t |St = s] with transition-dependent γ can also be established with the same routine.3 The proof also involves similar techniques as Hallak and Mannor (2017). Details in supplementary materials. Proposition 1 Assuming the chain induced by µ is ergodic, π is fixed, λ1 = λ2 = 1, ˆγ < 1, i(s) .= ˆi(s)c(s), then limt Eµ[Zt] = Jˆγ Proof. The proof involves Proposition 1 in Imani et al. (2018) and Lemma 1. Details are provided in supplementary materials. When ˆγ = 0, the Generalized Off-Policy Policy Gradient (GOPPG) Theorem recovers the OPPG theorem in Imani et al. (2018). The main contribution of GOPPG lies in the computation of c, i.e., the policy gradient of a distribution, which has not been done by previous policy gradient methods. The main contribution of Proposition 1 is the trace F (2) t , which is an effective way to approximate c. Inspired by Propostion 1, we propose to update θ as θt+1 = θt + αZt, where α is a step size. So far, we discussed the policy gradient for a single dimension of the policy parameter θ, so F (1) t , M (1) t , F (2) t , M (2) t are all scalars. When we compute policy gradients for the whole θ in parallel, F (1) t , M (1) t remain scalars while F (2) t , M (2) t become vectors of the same size as θ. This is because our intrinsic interest function It is a multi-dimensional random variable, instead of a deterministic scalar function like ˆi. We, therefore, generalize the concept of interest. So far, we also assumed access to the true density ratio c and the true value function vπ. We can plug in their estimates C and V , yielding the Generalized Off-Policy Actor-Critic (Geoff-PAC) algorithm.4 The density ratio estimate C can be learned via the learning rule in (3). The value estimate V can be learned by any off-policy prediction algorithm, e.g., one-step off-policy TD (Sutton and Barto, 2018), Gradient TD methods, (Discounted) COP-TD or V-trace (Espeholt et al., 2018). Pseudocode of Geoff-PAC is provided in supplementary materials. We now discuss two potential practical issues with Geoff-PAC. First, Proposition 1 requires t . In practice, this means µ has been executed for a long time and can be satisfied by a warm-up before 3This existence does not follow directly from the convergence analysis of ETD in Yu (2015). 4At this moment, a convergence analysis of Geoff-PAC is an open problem. training. Second, Proposition 1 provides an unbiased sample for a fixed policy π. Once π is updated, F (1) t , F (2) t will be invalidated as well as C, V . As their update rule does not have a learning rate, we cannot simply use a larger learning rate for F (1) t , F (2) t as we would do for C, V . This issue also appeared in Imani et al. (2018). In principle, we could store previous transitions in a replay buffer (Lin, 1992) and replay them for a certain number of steps after π is updated. In this way, we can satisfy the requirement t and get the up-to-date F (1) t , F (2) t . In practice, we found this unnecessary. When we use a small learning rate for π, we assume π changes slowly and ignore this invalidation effect. 5 Experimental Results Our experiments aim to answer the following questions. 1) Can Geoff-PAC find the same solution as on-policy policy gradient algorithms in the two-circle MDP as promised? 2) How does the degree of counterfactualness (ˆγ) influence the solution? 3) Can Geoff-PAC scale up to challenging tasks like robot simulation in Mujoco with neural network function approximators? 4) Can the counterfactual objective in Geoff-PAC translate into performance improvement over Off-PAC and ACE? 5) How does Geoff-PAC compare with other downstream applications of OPPG, e.g., DDPG (Lillicrap et al., 2015) and TD3 (Fujimoto et al., 2018)? 5.1 Two-circle MDP We implemented a tabular version of ACE and Geoff-PAC for the two-circle MDP. The behavior policy µ was random, and we monitored the probability from A to B under the target policy π. In Figure 1b, we plot π(A B) during training. The curves are averaged over 10 runs and the shaded regions indicate standard errors. We set λ1 = λ2 = 1 so that both ACE and Geoff-PAC are unbiased. For Geoff-PAC, ˆγ was set to 0.9. ACE converges to the correct policy that maximizes Jµ as expected, while Geoff-PAC converges to the policy that maximizes Jπ, the policy we want in on-policy training. Figure 1c shows how manipulating ˆγ and λ2 influences the final solution. In this two-circle MDP, λ2 has little influence on the final solution, while manipulating ˆγ significantly changes the final solution. 5.2 Robot Simulation Figure 2: Comparison among Off-PAC, ACE, and Geoff-PAC. Black dash lines are random agents. Figure 3: Comparison among DDPG, TD3, and Geoff-PAC Evaluation: We benchmarked Off-PAC, ACE, DDPG, TD3, and Geoff-PAC on five Mujoco robot simulation tasks from Open AI gym (Brockman et al., 2016). As all the original tasks are episodic, we adopted similar techniques as White (2017) to compose continuing tasks. We set the discount function γ to 0.99 for all non-termination transitions and to 0 for all termination transitions. The agent was teleported back to the initial states upon termination. The interest function was always 1. This setting complies with the common training scheme for Mujoco tasks (Lillicrap et al., 2015; Asadi and Williams, 2016). However, we interpret the tasks as continuing tasks. As a consequence, Jπ, instead of episodic return, is the proper metric to measure the performance of a policy π. The behavior policy µ is a fixed uniformly random policy, same as Gelada and Bellemare (2019). The data generated by µ is significantly different from any meaningful policy in those tasks. Thus, this setting exhibits a high degree of off-policyness. We monitored Jπ periodically during training. To evaluate Jπ, states were sampled according to π, and vπ was approximated via Monte Carlo return. Evaluation based on the commonly used total undiscounted episodic return criterion is provided in supplementary materials. The relative performance under the two criterion is almost identical. Implementation: Although emphatic algorithms have enjoyed great theoretical success (Yu, 2015; Hallak et al., 2016; Sutton et al., 2016; Imani et al., 2018), their empirical success is still limited to simple domains (e.g., simple hand-crafted Markov chains, cart-pole balancing) with linear function approximation. To our best knowledge, this is the first time that emphatic algorithms are evaluated in challenging robot simulation tasks with neural network function approximators. To stabilize training, we adopted the A2C (Clemente et al., 2017) paradigm with multiple workers and utilized a target network (Mnih et al., 2015) and a replay buffer (Lin, 1992). All three algorithms share the same architecture and the same parameterization. We first tuned hyperparameters for Off-PAC. ACE and Geoff-PAC inherited common hyperparameters from Off-PAC. For DDPG and TD3, we used the same architecture and hyperparameters as Lillicrap et al. (2015) and Fujimoto et al. (2018) respectively. More details are provided in supplementary materials and all the implementations are publicly available 5. Results: We first studied the influence of λ1 on ACE and the influence of λ1, λ2, ˆγ on Geoff-PAC in Half Cheetah. The results are reported in supplementary materials. We found ACE was not sensitive to λ1 and set λ1 = 0 for all experiments. For Geoff-PAC, we found λ1 = 0.7, λ2 = 0.6, ˆγ = 0.2 produced good empirical results and used this combination for all remaining tasks. All curves are averaged over 10 independent runs and shaded regions indicate standard errors. Figure 2 compares Geoff-PAC, ACE, and Off-PAC. Geoff-PAC significantly outperforms ACE and Off-PAC in 3 out of 5 tasks. The performance on Walker and Reacher is similar. This performance improvement supports our claim that optimizing Jˆγ can better approximate Jπ than optimizing Jµ. We also report the performance of a random agent for reference. Moreover, this is the first time that ACE is evaluated on such challenging domains instead of simple Markov chains. Figure 3 compares Geoff-PAC, DDPG, and TD3. Geoff-PAC outperforms DDPG in Hopper and Swimmer. DDPG with a uniformly random policy exhibits high instability in Half Cheetah, Walker, and Hopper. This is expected because DDPG ignores the discrepancy between dµ and dπ. As training progresses, this discrepancy gets larger and finally yields a performance drop. TD3 uses several techniques to stabilize DDPG, which translate into the performance and stability improvement over DDPG in Figure 3. However, Geoff-PAC still outperforms TD3 in Hopper and Swimmer. This is not a fair comparison in that many design choices for DDPG, TD3 and Geoff-PAC are different (e.g., one worker vs. multiple workers, deterministic vs. stochastic policy, network architectures), and we do not expect Geoff-PAC to outperform all applications of OPPG. However, this comparison does suggest GOPPG sheds light on how to improve applications of OPPG. 6 Related Work The density ratio c is a key component in Geoff-PAC, which is proposed by Gelada and Bellemare (2019). However, how we use this density ratio is different. Q-Learning (Watkins and Dayan, 1992; Mnih et al., 2015) is a semi-gradient method. Gelada and Bellemare (2019) use the density ratio to reweigh the Q-Learning semi-gradient update directly. The resulting algorithm still belongs to semi-gradient methods. If we would use the density ratio to reweigh the Off-PAC update (7) directly, it would just be an actor-critic analogue of the Q-Learning approach in Gelada and Bellemare (2019). This reweighed Off-PAC, however, will no longer follow the policy gradient of the objective Jµ, yielding instead policy semi-gradient . In this paper, we use the density ratio to define a new objective, the counterfactual objective, and compute the policy gradient of this new objective directly (Theorem 1). The resulting algorithm, Geoff-PAC, still belongs to policy gradient methods (in the 5https://github.com/Shangtong Zhang/Deep RL limiting sense). Computing the policy gradient of the counterfactual objective requires computing the policy gradient of the density ratio, which has not been explored in Gelada and Bellemare (2019). There have been many applications of OPPG, e.g., DPG (Silver et al., 2014), DDPG, ACER (Wang et al., 2016), EPG (Ciosek and Whiteson, 2017), and IMPALA (Espeholt et al., 2018). Particularly, Gu et al. (2017) propose IPG to unify onand off-policy policy gradients. IPG is a mix of gradients (i.e., a mix of Jµ and Jπ). To compute Jπ, IPG does need on-policy samples. In this paper, the counterfactual objective is a mix of objectives, and we do not need on-policy samples to compute the policy gradient of the counterfactual objective. Mixing Jˆγ and Jπ directly in IPG-style is a possibility for future work. There have been other policy-based off-policy algorithms. Maei (2018) provide an unbiased estimator (in the limiting sense) for Jµ, assuming the value function is linear. Theoretical results are provided without empirical study. Imani et al. (2018) eliminate this linear assumption and provide a thorough empirical study. We, therefore, conduct our comparison with Imani et al. (2018) instead of Maei (2018). In another line of work, the policy entropy is used for reward shaping. The target policy can then be derived from the value function directly (O Donoghue et al., 2016; Nachum et al., 2017a; Schulman et al., 2017). This line of work includes the deep energy-based RL (Haarnoja et al., 2017, 2018), where a value function is learned off-policy and the policy is derived from the value function directly, and path consistency learning (Nachum et al., 2017a,b), where gradients are computed to satisfy certain path consistencies. This line of work is orthogonal to this paper, where we compute the policy gradients of the counterfactual objective directly in an off-policy manner and do not involve reward shaping. Liu et al. (2018) prove that c is the unique solution for a minimax problem, which involves maximization over a function set F. They show that theoretically F should be sufficiently rich (e.g., neural networks). To make it tractable, they restrict F to a ball of a reproducing kernel Hilbert space, yielding a closed form solution for the maximization step. SGD is then used to learn an estimate for c in the minimization step, which is then used for policy evaluation. In a concurrent work (Liu et al., 2019), this approximate for c is used in off-policy policy gradient for Jπ, and empirical success is observed in simple domains. By contrast, our Jˆγ unifies Jπ and Jµ, where ˆγ naturally allows bias-variance trade-off, yielding an empirical success in challenging robot simulation tasks. 7 Conclusions In this paper, we introduced the counterfactual objective unifying the excursion objective and the alternative life objective in the continuing RL setting. We further provided the Generalized Off-Policy Policy Gradient Theorem and corresponding Geoff-PAC algorithm. GOPPG is the first example that a non-trivial interest function is used, and Geoff-PAC is the first empirical success of emphatic algorithms in prevailing deep RL benchmarks. There have been numerous applications of OPPG including DDPG, ACER, IPG, EPG and IMPALA. We expect GOPPG to shed light on improving those applications. Theoretically, a convergent analysis of Geoff-PAC involving compatible function assumption (Sutton et al., 2000) or multi-timescale stochastic approximation (Borkar, 2009) is also worth further investigation. Acknowledgments SZ is generously funded by the Engineering and Physical Sciences Research Council (EPSRC). This project has received funding from the European Research Council under the European Union s Horizon 2020 research and innovation programme (grant agreement number 637713). The experiments were made possible by a generous equipment grant from NVIDIA. The authors thank Richard S. Sutton, Matthew Fellows, Huizhen Yu for the valuable discussion. Asadi, K. and Williams, J. D. (2016). Sample-efficient deep reinforcement learning for dialog control. ar Xiv preprint ar Xiv:1612.06000. Bellemare, M. G., Naddaf, Y., Veness, J., and Bowling, M. (2013). The arcade learning environment: An evaluation platform for general agents. Journal of Artificial Intelligence Research. Borkar, V. S. (2009). Stochastic approximation: a dynamical systems viewpoint. Springer. Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., and Zaremba, W. (2016). Openai gym. ar Xiv preprint ar Xiv:1606.01540. Ciosek, K. and Whiteson, S. (2017). Expected policy gradients. ar Xiv preprint ar Xiv:1706.05374. Clemente, A. V., Castejón, H. N., and Chandra, A. (2017). Efficient parallel methods for deep reinforcement learning. ar Xiv preprint ar Xiv:1705.04862. Degris, T., White, M., and Sutton, R. S. (2012). Off-policy actor-critic. ar Xiv preprint ar Xiv:1205.4839. Espeholt, L., Soyer, H., Munos, R., Simonyan, K., Mnih, V., Ward, T., Doron, Y., Firoiu, V., Harley, T., Dunning, I., et al. (2018). Impala: Scalable distributed deep-rl with importance weighted actor-learner architectures. ar Xiv preprint ar Xiv:1802.01561. Fujimoto, S., van Hoof, H., and Meger, D. (2018). Addressing function approximation error in actor-critic methods. ar Xiv preprint ar Xiv:1802.09477. Gelada, C. and Bellemare, M. G. (2019). Off-policy deep reinforcement learning by bootstrapping the covariate shift. In Proceedings of the 33rd AAAI Conference on Artificial Intelligence. Ghiassian, S., Patterson, A., White, M., Sutton, R. S., and White, A. (2018). Online off-policy prediction. Gu, S. S., Lillicrap, T., Turner, R. E., Ghahramani, Z., Schölkopf, B., and Levine, S. (2017). Interpolated policy gradient: Merging on-policy and off-policy gradient estimation for deep reinforcement learning. In Advances in Neural Information Processing Systems. Haarnoja, T., Tang, H., Abbeel, P., and Levine, S. (2017). Reinforcement learning with deep energybased policies. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 1352 1361. JMLR. org. Haarnoja, T., Zhou, A., Abbeel, P., and Levine, S. (2018). Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. ar Xiv preprint ar Xiv:1801.01290. Hallak, A. and Mannor, S. (2017). Consistent on-line off-policy evaluation. In Proceedings of the 34th International Conference on Machine Learning. Hallak, A., Tamar, A., Munos, R., and Mannor, S. (2016). Generalized emphatic temporal difference learning: Bias-variance analysis. In Proceedins of 30th AAAI Conference on Artificial Intelligence. Imani, E., Graves, E., and White, M. (2018). An off-policy policy gradient theorem using emphatic weightings. In Advances in Neural Information Processing Systems. Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., and Wierstra, D. (2015). Continuous control with deep reinforcement learning. ar Xiv preprint ar Xiv:1509.02971. Lin, L.-J. (1992). Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine Learning. Liu, Q., Li, L., Tang, Z., and Zhou, D. (2018). Breaking the curse of horizon: Infinite-horizon off-policy estimation. In Advances in Neural Information Processing Systems. Liu, Y., Swaminathan, A., Agarwal, A., and Brunskill, E. (2019). Off-policy policy gradient with state distribution correction. ar Xiv preprint ar Xiv:1904.08473. Maei, H. R. (2018). Convergent actor-critic algorithms under off-policy training and function approximation. ar Xiv preprint ar Xiv:1802.07842. Marbach, P. and Tsitsiklis, J. N. (2001). Simulation-based optimization of markov reward processes. IEEE Transactions on Automatic Control. Mnih, V., Badia, A. P., Mirza, M., Graves, A., Lillicrap, T., Harley, T., Silver, D., and Kavukcuoglu, K. (2016). Asynchronous methods for deep reinforcement learning. In Proceedings of the 33rd International Conference on Machine Learning. Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidjeland, A. K., Ostrovski, G., et al. (2015). Human-level control through deep reinforcement learning. Nature. Nachum, O., Norouzi, M., Xu, K., and Schuurmans, D. (2017a). Bridging the gap between value and policy based reinforcement learning. In Advances in Neural Information Processing Systems. Nachum, O., Norouzi, M., Xu, K., and Schuurmans, D. (2017b). Trust-pcl: An off-policy trust region method for continuous control. ar Xiv preprint ar Xiv:1707.01891. O Donoghue, B., Munos, R., Kavukcuoglu, K., and Mnih, V. (2016). Combining policy gradient and q-learning. ar Xiv preprint ar Xiv:1611.01626. Osband, I., Aslanides, J., and Cassirer, A. (2018). Randomized prior functions for deep reinforcement learning. In Advances in Neural Information Processing Systems. Precup, D., Sutton, R. S., and Dasgupta, S. (2001). Off-policy temporal-difference learning with function approximation. In Proceedings of the 18th International Conference on Machine Learning. Puterman, M. L. (2014). Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons. Schulman, J., Chen, X., and Abbeel, P. (2017). Equivalence between policy gradients and soft q-learning. ar Xiv preprint ar Xiv:1704.06440. Silver, D. (2015). Policy gradient methods. URL: http://www0.cs.ucl.ac.uk/staff/d. silver/web/Teaching_files/pg.pdf. Silver, D., Lever, G., Heess, N., Degris, T., Wierstra, D., and Riedmiller, M. (2014). Deterministic policy gradient algorithms. In Proceedings of the 31st International Conference on Machine Learning. Sutton, R. S. (1988). Learning to predict by the methods of temporal differences. Machine Learning. Sutton, R. S. and Barto, A. G. (2018). Reinforcement learning: An introduction (2nd Edition). MIT press. Sutton, R. S., Maei, H. R., and Szepesvári, C. (2009). A convergent o(n) temporal-difference algorithm for off-policy learning with linear function approximation. In Advances in Neural Information Processing Systems. Sutton, R. S., Mahmood, A. R., and White, M. (2016). An emphatic approach to the problem of off-policy temporal-difference learning. The Journal of Machine Learning Research. Sutton, R. S., Mc Allester, D. A., Singh, S. P., and Mansour, Y. (2000). Policy gradient methods for reinforcement learning with function approximation. In Advances in Neural Information Processing Systems. Tsitsiklis, J. N. and Van Roy, B. (1997). Analysis of temporal-diffference learning with function approximation. In Advances in neural information processing systems. Wang, Z., Bapst, V., Heess, N., Mnih, V., Munos, R., Kavukcuoglu, K., and de Freitas, N. (2016). Sample efficient actor-critic with experience replay. ar Xiv preprint ar Xiv:1611.01224. Watkins, C. J. and Dayan, P. (1992). Q-learning. Machine Learning. White, M. (2017). Unifying task specification in reinforcement learning. In Proceedings of the 34th International Conference on Machine Learning. Yu, H. (2005). A function approximation approach to estimation of policy gradient for pomdp with structured policies. The 21st Conference on Uncertainty in Artificial Intelligence. Yu, H. (2015). On convergence of emphatic temporal-difference learning. In Conference on Learning Theory.