# exploration_conscious_reinforcement_learning_revisited__a098c15b.pdf Exploration Conscious Reinforcement Learning Revisited Lior Shani * 1 Yonathan Efroni * 1 Shie Mannor 1 The Exploration-Exploitation tradeoff arises in Reinforcement Learning when one cannot tell if a policy is optimal. Then, there is a constant need to explore new actions instead of exploiting past experience. In practice, it is common to resolve the tradeoff by using a fixed exploration mechanism, such as ϵ-greedy exploration or by adding Gaussian noise, while still trying to learn an optimal policy. In this work, we take a different approach and study exploration-conscious criteria, that result in optimal policies with respect to the exploration mechanism. Solving these criteria, as we establish, amounts to solving a surrogate Markov Decision Process. We continue and analyze properties of exploration-conscious optimal policies and characterize two general approaches to solve such criteria. Building on the approaches, we apply simple changes in existing tabular and deep Reinforcement Learning algorithms and empirically demonstrate superior performance relatively to their non-exploration-conscious counterparts, both for discrete and continuous action spaces. 1. Introduction The main goal of Reinforcement Learning (RL) (Sutton et al., 1998) is to find an optimal policy for a given decision problem. A major difficulty arises due to the Exploration Exploitation tradeoff, which characterizes the omnipresent tension between exploring new actions and exploiting the so-far acquired knowledge. Considerable line of work has been devoted for dealing with this tradeoff. Algorithms that explicitly balance between exploration and exploitation were developed for tabular RL (Kearns & Singh, 2002; Brafman & Tennenholtz, 2002; Jaksch et al., 2010; Osband et al., 2013). However, generalizing these results to approximate *Equal contribution 1Department of Electrical Engineering, Technion, Haifa, Israel. Correspondence to: Lior Shani , Yonathan Efroni . Proceedings of the 36 th International Conference on Machine Learning, Long Beach, California, PMLR 97, 2019. Copyright 2019 by the author(s). RL, i.e, when using function approximation, remains an open problem. On the practical side, recent works combined more advanced exploration schemes in approximate RL (e.g, Bellemare et al. (2016); Fortunato et al. (2017)), inspired by the theory of tabular RL. Nonetheless, even in the presence of more advanced mechanisms, ϵ-greedy exploration is still applied (Bellemare et al., 2017; Dabney et al., 2018; Osband et al., 2016). More generally, the traditional and simpler ϵ-greedy scheme (Sutton et al., 1998; Asadi & Littman, 2016) in discrete RL, and Gaussian action noise in continuous RL, are still very useful and popular in practice (Mnih et al., 2015; 2016; Silver et al., 2014; Schulman et al., 2017; Horgan et al., 2018), especially due to their simplicity. These types of exploration schemes share common properties. First, they all fix some exploration parameter beforehand, e.g, ϵ, the inverse temperature β, or the action variance σ for the ϵ-greedy, soft-max and Gaussian exploration schemes, respectively. By doing so, the balance between exploration and exploitation is set. Second, they all explore using a random policy, and exploit using current estimate of the optimal policy. In this work, we follow a different approach, when using these fixed exploration schemes: exploiting by using an estimate of the optimal policy w.r.t. the exploration mechanism. Exploration-Consciousness is the main reason for the improved performance of on-policy methods like Sarsa and Expected-Sarsa (Van Seijen et al., 2009) over Q-learning during training (Sutton et al., 1998)[Example 6.6: Cliff Walking]. Imagine a simple Cliff-Walking problem: The goal of the agent is to reach the end without falling of the cliff, where the optimal policy is to go alongside the cliff. While using a fixed-exploration scheme, playing a near optimal policy which goes alongside the cliff will lead to a significant sub-optimal performance. This, in turn, will hurt the acquisition of new experience needed to learn the optimal policy. However, learning to act optimally w.r.t. the exploration scheme can mitigate this difficultly; the agent learns to reach the goal while keeping a safe enough distance from the cliff. In the past, tabular q-learning-like exploration-conscious algorithms were suggested (John, 1994; Littman et al., 1997; Van Seijen et al., 2009). Here we take a different approach, and focus on exploration conscious policies. The main contributions of this work are as follows: Exploration Conscious Reinforcement Learning Revisited We define exploration-consciousness optimization criteria, for discrete and continuous actions spaces. The criteria are interpreted as finding an optimal policy within a restricted set of policies. Both, we show, can be reduced to solving a surrogate MDP. The surrogate MDP approach, to the best of our knowledge, is a new one, and serves us repeatedly in this work. We formalize a bias-error sensitivity tradeoff. The solutions are biased w.r.t. the optimal policy, yet, are less sensitive to approximation errors. We establish two fundamental approaches to practically solve Exploration-Conscious optimization problems. Based on these, we formulate algorithms in discrete and continuous action spaces, and empirically test the algorithms on the Atari and Mu Jo Co domains. 2. Preliminaries Our framework is the infinite-horizon discounted Markov Decision Process (MDP). An MDP is defined as the 5-tuple (S, A, P, R, γ) (Puterman, 1994), where S is a finite state space, A is a compact space, P P(s |s, a) is a transition kernel, R r(s, a) [0, Rmax] is a bounded reward function, and γ [0, 1). Let π : S P(A) be a stationary policy, where P(A) is a probability distribution on A, and denote Π as the set of deterministic policies, π Π : S A. Let vπ R|S| be the value of a policy π, defined in state s as vπ(s) Eπ |s[P t=0 γtr(st, at)], where at π(st), and Eπ |s denotes expectation w.r.t. the distribution induced by π and conditioned on the event {s0 = s}. It is known that vπ = P t=0 γt(P π)trπ = (I γP π) 1rπ, with the component-wise values [P π]s,s Ea π[P(s | s, a)] and [rπ]s Ea π[r(s, a)]. Furthermore, the q-function of π is given by qπ(s, a) = r(s, a) + γ P s P(s | s, a)vπ(s ), and represents the value of taking an action a from state s and then using the policy π. Usually, the goal is to find π yielding the optimal value, π arg maxπ Π Eπ[P t=0 γtr(st, at)], and the optimal value is v = vπ . It is known that optimal deterministic policy always exists (Puterman, 1994). To achieve this goal the following classical operators are defined (with equalities holding component-wise). v, π : T πv =rπ + γP πv, Tv = max π T πv, (1) G(v) = {π : T πv = Tv}, (2) where T π is a linear operator, T is the optimal Bellman operator and both T π and T are γ-contraction mappings w.r.t. the max norm. It is known that the unique fixed points of T π and T are vπ and v , respectively. G(v) is the standard set of 1-step greedy policies w.r.t. v. Furthermore, given v , the set G(v ) coincides with that of stationary optimal policies. It is also useful to define the q-optimal Bellman operator, which is a γ-contraction, with fixed point q . T qq(s, a)=r(s, a)+γ X s P(s | s, a) max a q(s , a ), (3) In this work, the use of mixture policies is abundant. We denote the α [0, 1]-convex mixture of policies π1, π2 by πα(π1, π2) (1 α)π1 + απ2. Importantly, πα(π1, π2) can be interpreted as a stochastic policy s.t with w.p (1 α) the agent acts with π1 and w.p α acts with π2. 3. The α-optimal criterion In this section, we define the notion of α-optimal policy w.r.t. a policy, π0. We then claim that finding an α-optimal policy can be done by solving a surrogate MDP. We continue by defining the surrogate MDP, and analyze some basic properties of the α-optimal policy. Let α [0, 1]. We define π α,π0 to be the α-optimal policy w.r.t. π0, and is contained in the following set, π α,π0 arg max π Π Eπα(π ,π0) "X t=0 γtr(st, at)) or, π α,π0 arg maxπ vπα(π ,π0), where at πα(π , π0) and πα(π , π0) is the α-convex mixture of π and π0, and thus a probability distribution. For brevity, we omit the subscript π0, and denote the α-optimal policy by π α throughout the rest of the paper. The α-optimal value (w.r.t. π0) is vπα(π α,π0), the value of the policy πα(π α, π0). In the following, we will see the problem is equivalent to solving a surrogate MDP, for which an optimal deterministic policy is known to exist. Thus, there is no loss optimizing over the set of deterministic policies Π. Optimization problem (4) can be viewed as optimizing over a restricted set of policies: all policies that are a convex combination of π0 with a fixed α. Naturally, we can consider in (4) a state-dependent α(s) as well, and some of the results in this work will consider this scenario. In other words, π α is the best policy an agent can act with, if it plays w.p (1 α) according to π α, and w.p α according to π0, where π0 can be any policy. The relation to the ϵ-greedy exploration setup becomes clear when π0 is a uniform distribution on the actions, and set α = ϵ instead of α. Then, π α is optimal w.r.t. the ϵ-greedy exploration scheme; the policy would have the largest accumulated reward, relatively to all other policies, when acting in an ϵ-greedy fashion w.r.t. it. We choose to name the policy as the αand not ϵ-optimal to prevent confusion with other frameworks. The ϵ-optimal policy is a notation used in the context of PAC-MDP type of analysis (Strehl et al., 2009), and has a different meaning than the objective in this work (4). Exploration Conscious Reinforcement Learning Revisited 3.1. The α-optimal Bellman operator, α-optimal policy and policy improvement In the previous section, we defined the α-optimal policy and the α-optimal value, π α and vπα(π α,π0), respectively. We start this section by observing that problem (4) can be viewed as solving a surrogate MDP, denoted by Mα. We define the Bellman operators of the surrogate MDP, and use them to prove an important improvement property. Define the surrogate MDP as Mα =(S, A, Pα, Rα, γ). a A, rα(s, a)=(1 α)r(s, a) + αrπ0(s), P π α (s | s, a)=(1 α)P(s | s, a) + αP π0(s | s), (5) are its reward and dynamics, and rest of its ingredients are similar to M. We denote the value of a policy π on Mα by vπ α, and the optimal value on Mα by v α. The following simple lemma relates the value of a policy π, measured on M and Mα (see proof in Appendix D). Lemma 1. For any policy π, vπ α = vπα(π,π0). Thus, an optimal policy on Mα is the α-optimal policy π α (4). The fixed-policy and optimal Bellman operators of Mα are denoted by T π α and Tα, respectively. Again, for brevity we omit π0 from the definitions. Notice that T π α and Tα are γcontractions as being Bellman operators of a γ-discounted MDP. The following Lemma relates T π α and Tα to the Bellman operators of the original MDP, M. Furthermore, it stresses a non-trivial relation between the α-optimal policy π α and the α-optimal value, vπα(π α,π0). Proposition 2. The following claims hold for any policy π: 1. T π α=(1 α)T π+αT π0, with fixed point vπ α=vπα(π,π0). 2. Tα =(1 α)T+αT π0, with fixed point v α =vπα(π α,π0). 3. An α-optimal policy is an optimal policy of Mα and is greedy w.r.t. v α, π α G(v α) = {π : T π v α = Tv α}. In previous works, e.g. (Asadi & Littman, 2016), the operator (1 ϵ)T +ϵT π0 was referred to as the ϵ-greedy operator. Lemma 2 shows this operator is Tα (with α = ϵ), the optimal Bellman operator of the defined surrogate MDP Mα. This lemma leads to the following important property. Proposition 3. Let α [0, 1), β [0, α], π0 be a policy, and π α be the α-optimal policy w.r.t π0. Then, vπ0 vπα(π α,π0) vπβ(π α,π0), with equality iff vπ0 = v . The first relation vπ0 vπα(π α,π0), πα(π α, π0) is better than π0, is trivial and holds by definition (4). The nontrivial statement is the second one. It asserts that given π α, it is worthwhile to use the mixture policy πβ(π α, π0) with β < α; use π0 with smaller probability. Specifically, better performance, compared to πα(π α, π0), is assured when using the deterministic policy π α, by setting β = 0. In section 6, we demonstrate the empirical consequences of the improvement lemma, which, to our knowledge, has not yet been stated. Furthermore, the improvement lemma is unique to the defined optimization criterion (4). We will show that alternative definitions of exploration conscious criteria does not necessarily have this property. Moreover, one can use Proposition 3 to generalize the notion of the 1-step greedy policy (2), as was done in Efroni et al. (2018) with multiple-step greedy improvement. We leave studying this generalization and its Policy Iteration scheme for future work, and focus on solving (4) a single time. 3.2. Performance bounds in the presence of approximations We now consider an approximate setting and quantify a bias - error sensitivity tradeoff in πα(ˆπ α, π0), where ˆπ α is an approximated α-optimal policy. We formalize an intuitive argument; as α increases the bias relatively to the optimal policy increases. Yet, the sensitivity to errors decreases, since the agent uses π0 w.p. α regardless of errors. Definition 1. Let v be the optimal value of an MDP, M. We define L(s) v (s) T π0v (s) 0, to be the Lipschitz constant w.r.t. π0 of the MDP at state s. We further define the upper bound on the Lipschitz constant L maxs L(s). Definition 1 defines the Lipschitz property of the optimal value, v . Intuitively, L(s) quantifies a degree of smoothness of the optimal value. A small value of L(s) indicates that if one acts according to π0 once and then continue playing the optimal policy from state s, a great loss will not occur. Large values of L(s) indicate that using π0 from state s leads to an irreparable outcome (e.g, falling off a cliff). The following theorem formalizes a bias-error sensitivity tradeoff. As α increases, the bias increases, while the sensitivity to errors decreases (see proof in Appendix H). Theorem 4. Let α [0, 1]. Assume ˆv α is an approximate α-optimal value s.t v α ˆv α = δ for some δ 0. Let ˆπ α be the greedy policy w.r.t. ˆv α, ˆπ α G(ˆv α). Then, the performance relatively to the optimal policy is bounded by, v vπα(ˆπ α,π0) αL 1 γ | {z } Bias 1 γ | {z } Sensitivity When the bias of the α-optimal value relatively to the optimal one is small, solving (4) does not lead to a great loss relatively to the optimal performance. The bias can be bounded by the Lipschitz property L of the MDP. For a state dependent α(s), the bias bound changes to be dependent on maxs α(s)L(s). This highlights the importance of prior knowledge when using (4). Choosing π0 (possibly state-wise) s.t. maxs α(s)L(s) is small, allows to use a bigger α, while the bias is small. The sensitivity term upper bounds the performance of πα(ˆπ α, π0) relatively to the α-optimal value, and is less sensitive to errors as α increase. Exploration Conscious Reinforcement Learning Revisited The bias term is derived by using the structure of Mα, and is not a direct application of the Simulation Lemma (Kearns & Singh, 2002; Strehl et al., 2009); applying it would lead to a bias of αRmax (1 γ)2 . For the sensitivity term, we generalize (Bertsekas & Tsitsiklis, 1995)[Proposition 6.1] (see Appendix G). There, a (1 α) factor does not exists. 4. Exploration-Conscious Continuous Control The α-greedy approach from Section 3 relies on an exploration mechanism which is fixed beforehand: π0 and α are fixed, and an optimal policy w.r.t. them is being calculated (4). However, in continuous control RL algorithms, such as DDPG and PPO (Lillicrap et al., 2015; Schulman et al., 2017), different approach is used. Usually, a policy is being learned, and the exploration noise is injected by perturbing the policy, e.g., by adding to it a Gaussian noise. We start this section by defining an exploration-conscious optimality criterion that captures such perturbation for the simple case of Gaussian noise. Then, results from Section 3 are adapted to the newly defined criterion, while highlighting commonalities and differences relatively to (4). As in Section 3, we define an appropriate surrogate MDP and we show it can be solved by the usual machinery of Bellman operators. Unlike Section 3, we show that improvement when decreasing the stochasticity does not generally hold. Finally, we prove a similar bias-error sensitivity result: As σ grows, the bias increases, but the sensitivity term decreases. Instead of restricting the set of policies to the one defined in (4), we restrict our set of policies to be the set of Gaussian policies with a fixed σ2 variance. Formally, we wish to find the optimal deterministic policy µ σ : S A in this set, µ σ arg max µ Π Eπµ,σ " X t=0 γtr(st, at) where πµ,σ( | s) = N(µ(s), σ2), is a Gaussian policy with mean µ(s) and a fixed variance σ2. We name µ σ and π σ as the mean and σ-optimal policy, respectively. As in (4), we show in the following that solving (6) is equivalent for solving a surrogate MDP. Thus, optimal policy can always be found in the deterministic class of policies Π; mixture of Gaussians would not lead to a better performance in (6). Similarly to (5), we define a surrogate MDP Mσ w.r.t. to the Gaussian noise and relate it to values of Gaussian policies on the original MDP M. Then, we characterize its Bellman operators and thus establish it can be solved using Dynamic Programming. Define the surrogate MDP as Mσ =(S, A, Pσ, Rσ, γ). For every a A, rσ(s, a)= Z A N(a ; a, σ)r(s, a )da , Pσ(s | s, a)= Z A N(a ; a, σ)P(s | s, a )da , (7) are its reward and dynamics, and denote a value of a policy on Mσ by vµ σ. The following results correspond to Lemma 1 and Proposition 2 for the class of Gaussian policies. Lemma 5. For any policy π, vπ µ,σ = vµ σ. Thus, an optimal policy on Mσ is the mean optimal policy µ σ (6). Proposition 6. Let π be a mixture of Gaussian policies. Then, the following holds: 1. T µ σ = Eπ πµ,σ T π, with fixed point vµ σ =vπµ,σ. 2. Tσ =max µ A Eπ πµ,σ T π, with fixed point v σ =vπµ σ,σ. 3. The mean σ-optimal policy µ σ is an optimal policy of Mσ and, µ σ {µ : T πµ,σv σ = maxµ T πµ,σv σ}. Surprisingly, given a σ-optimal policy mean µ σ, an improvement is not assured when lowering the stochasticity by decreasing σ in πµ σ,σ. This comes in contrast to Proposition 3 and highlights its uniqueness (proof in Appendix J). Proposition 7. Let 0 σ < σ and let µ σ be the mean σoptimal policy. There exists an MDP s.t vπµ ,σ vπµ ,σ . Definition 2. Let M be a continuous action space MDP. Assume that exists Lr, Lp 0, s.t. s S, a1, a2 A, |r(s, a1) r(s, a2)| Lr a1 a2 1 and p( |s, a1) p( |s, a2) T V Lp a1 a2 1. The Lipschitz constant of M is L (1 γ)Lr + γLp Rmax. The following theorem quantifies a bias-error sensitivity tradeoff in σ, similarly to Theorem 4 (see Appendix K). Theorem 8. Let M be an MDP with Lipschitz constant L and let σ R|A| + . Let v σ be the σ-optimal value of Mσ. Let ˆv σ be an approximation of v σ s.t. v σ ˆv σ = δ for δ 0. Let µ σ, ˆµ σ RA be the greedy mean policy w.r.t. v σ and ˆv σ respectively. Let σ 2 is the σ 2-weighted euclidean norm. Then, v vˆπ σ L σ 1 2 (1 γ)2 | {z } Bias + γδ min{ 1 2 µ σ ˆµ σ σ 2 , 2} 1 γ | {z } Sensitivity 5. Algorithms In this section, we offer two fundamental approaches to solve exploration conscious criteria using sample-based algorithms: the Expected and Surrogate approaches. For both, we formulate converging, q-learning-like, algorithms. Next, by adapting DDPG, we show the two approaches can be used in exploration-conscious continuous control as well. Consider any fixed exploration scheme. Generally, these schemes operate in two stages: (i) Choose a greedy action, achosen. (ii) Based on achosen and some randomness generator, choose an action to be applied on the environment, aenv. E.g., for ϵ-greedy exploration, w.p. 1 α the agent acts with Exploration Conscious Reinforcement Learning Revisited achosen, otherwise, with a random uniform policy. While in RL the common update rules use aenv, the saved experience is (s, aenv, r, s ), in the following we motivate the use of achosen, and view the data as (s, achosen, aenv, r, s ). The two approaches characterized in the following are based on two, inequivalent, ways to define the q-function. For the Expected approach the q-function is defined as usual: qπ(s, a) represents the value obtained when taking an action a = aenv and then acting with π, meaning a is the action chosen in step (ii). Alternatively, for the Surrogate approach, the q-function is defined on the Surrogate MDP, i.e., the exploration is viewed as stochasticity of the environment. Then, qπ α(s, a) is the value obtained when a is the action of step (i), i.e., choosing action a = achosen. 5.1. Exploration Conscious Q-Learning We focus on solving the α-optimal policy (4), and formulate q-learning-like algorithms using the two aforementioned approaches. The Expected α-optimal q-function is, qπα(π α,π0)(s, a) r(s, a)+γ X s P(s | s, a)v α(s ) (8) Indeed, qπα(π α,π0) is the usually defined q-function of the policy πα(π α, π0) on an MDP M. Here, the action a represents the actual performed action, aenv. By relating qπα(π α,π0) to v α it can be easily verified that qπα(π α,π0) satisfies the fixed point equation (see Appendix L), qπα(π α,π0)(s, a) = r(s, a)+γ(1 α) X s P(s |s, a) max a qπα(π α,π0)(s , a ) s ,a P(s |s, a)π0(a |s )qπα(π α,π0)(s , a ). (9) Alternatively, consider the optimal q-function of the surrogate MDP Mα (5). It satisfies the fixed-point equation q α(s, a) rα(s, a)+γ X s Pα(s | s, a) max a q α(s , a ). The following lemma formalizes the relation between the two q-functions, and shows they are related by a function of the state, and not of the action. Lemma 9. q α(s, a) = (1 α)qπα(π α,π0)(s, a) + f(s). The α-optimal policy π α is also an optimal policy of Mα (Lemma 1). Thus, it is greedy w.r.t. q α, the optimal q of Mα. By Proposition 2.3 it is also greedy w.r.t. qπα(π α,π0), i.e., π α(s) arg max a q α(s, a ) = arg max a qπα(π α,π0)(s, a ). Lemma 9 describes this fact by different means; the two qfunctions are related by a function of the state and, thus, the greedy action w.r.t. each is equal. Furthermore, it stresses the fact that the two q-function are not equal. Before describing the algorithms, we define the following notation for any q(s, a), v(s) = max a q(s, a ), vπ(s) = X a π(a | s)q(s, a ). We now describe the Expected α-Q-learning algorithm (see Algorithm 1), also given in (John, 1994; Littman et al., 1997), and re-interpret it in light of the previous discussion. The fixed point equation (9), leads us to define the operator T Eq α for which qπα(π α,π0) = T Eq α qπα(π α,π0). Expected αQ-learning (Alg. 1) is a Stochastic Approximation (SA) alg. based on the operator T Eq α . Given a sample of the form (s, achosen, aenv, r, s ), it updates q(s, aenv) by (1 η)q(s, aenv)+η (rt+γ((1 α)v(st+1)+αvπ0(st+1))) (10) Algorithm 1 Expected α-Q-Learning Initialize: α [0, 1], π0, q, learning rate ηt. for t = 0, 1, ... do achosen arg maxa qt(st, a) Xt Bernoulli(1 α) ( achosen, if Xt = 1 a π0( | s), if Xt = 0 rt, st+1 ACT(aenv) yt rt + γ(1 α)vt(st+1) + γαvπ0 t (st+1) q(st, aenv) (1 ηt) q(st, aenv) + ηtyt end for return: π arg maxa q( , a) Its convergence proof is standard and follows by showing T Eq α is a γ-contraction and using (Bertsekas & Tsitsiklis, 1995)[Proposition 4.4] (see proof in Appendix L.1). We now turn to describe an alternative algorithm, which operates on the surrogate MDP, Mα, and converges to q α. Naively, given a sample (s, achosen, r, s ), regular q-learning on Mα can be used by updating q(s, achosen) as, (1 ηt)q(s, achosen)+ηt(rt+γv(st+1)), (11) Yet, this approach does not utilize a meaningful knowledge; when the exploration policy π0 is played, i.e., when Xt = 0, the sample (rt, st+1) can be used to update all the action entries from the current state. These entries are also affected by the policy π0. In fact, we cannot prove the convergence of the naive update based on current techniques; if the greedy action is repeatedly chosen, infinitely often visit in all (s, a) pairs cannot be guaranteed. This reasoning leads us to formulate Surrogate α-Q-learning (see Algorithm 2). The Surrogate α-Q-learning updates two q-functions, q and qα. The first, q, has the same update Exploration Conscious Reinforcement Learning Revisited Algorithm 2 Surrogate α-Q-Learning Initialize: α [0, 1], π0, qα, q, learning rate ηt. for t = 0, 1, ... do achosen arg maxa q(st, a) Xt Bernoulli(1 α) ( achosen, if Xt = 1 a π0( | s), if Xt = 0 rt, st+1 ACT(aenv) for a A do ( rt + γvα(st+1), a = achosen Xtq(st, a)+(1 Xt) (rt+γvα(st+1)), o.w qα(st, a) (1 η) qα(st, a) + ηy a t end for yt rt + γ(1 α)v(st+1) + γαvπ0(st+1) q(st, aenv) (1 ηt)q(st, aenv) + ηtyt end for return π arg maxa qα( , a) as in Expected α-Q-learning, and thus converges (w.p 1) to qπα(π α,π0). The second, qα, updates the chosen greedy action using equation (11), when the exploration policy is not played (Xt = 1). By bootstrapping on q, the algorithm updates all other actions when the exploration policy π0 is played (Xt = 0). Using (Singh et al., 2000)[Lemma 1], the convergence of Surrogate α-Q-learning to (qπα(π α,π0), q α) is established (see proof in Appendix L.2). Interestingly, and unlike other q-learning algorithms (e.g, Expected α-Qlearning, Q-learning, etc.), Surrogate α-Q-learning updates the entire action set given a single sample. For completness, we state the convergence result for both algorithms. Theorem 10. Consider the processes described in Alg. 1, 2. Assume {ηt} t=0 satisfies s S, a A, P t=0 ηt = , and P t=0 η2 t < , where ηt ηt(st = s, aenv,t = a). Then, for both 1, 2 the sequence {qn} n=0 converges w.p. 1 to qπα(π α,π0), and for 2, {qα,n} n=0 converges w.p. 1 to q α. 5.2. Continuous Control Building on the two approaches for solving Exploration Conscious criteria, we suggest two techniques to find an optimal Gaussian policy (6) using gradient based Deep RL (DRL) algorithms, and specifically, DDPG (Lillicrap et al., 2015). Nonetheless, the techniques are generalizable to other actor-critic, DRL algorithms (Schulman et al., 2017). Assume we wish to find an optimal Gaussian policy by parameterizing its mean µ(φ). Nachum et al. (2018)[Eq. 13] showed the gradient of the value w.r.t. φ is similar to Silver et al. (2014), S aq ππµ,σ σ (s, a) φµθ(s)dρπµ,σ(s), (12) where qµ σ(s, a) = rσ(s, a)+γ R S pσ(s | s, a)vπµ,σ(s )ds , is the q-function of the surrogate MDP. In light of previ- ous section, we interpret qµ σ as the q-function of the surrogate MDP s Mσ (7). Furthermore, we have the following relation between the surrogate and expected q-functions, qµ σ(s, a) = R a A N(a | a, σ)qπµ,σ(s, a )da , from which it is easy to verify that (see Appendix L.3), uqπµ,σ σ (s, b)= Z A N(b | a, σ) bqπµ,σ(s, b)db. (13) Thus, we can update the actor in two inequivalent ways, by using gradients on the surrogate MDP s q-function (12), or by using gradients of the expected q-function (13). The updates of the critic, qµ σ or qπµ,σ, can be done using the same notion that led to the two forms of updates in (11)-(10). When using Gaussian noise, one performs the two stages defined in Section 5, where achosen is the output of the actor µ(s), and aenv N(achosen, σ). Then, the sample (s, achosen, aenv, r, s ) is obtained by interacting with the environment. Based on the the fixed policy TD-error defined in (11), we define the following loss function, for learning qµ σ, q-function of the fixed policy µ over Mσ, qθ σ(s, achosen) r γqθ σ (s , µφ (s )) 2 . On the other hand, we can define a loss function derived from the fixed-policy TD-error defined in (10), for learning qπµ,σ, the q-function of the Gaussian policy with mean and variance µ, σ2 over M, qθ(s, aenv) r γ Z A N(b | µφ (s ), s )qθ (s , b)db 2. 6. Experiments In this section, we test the theory and algorithms 1 suggested in this work. In all experiments we used γ = 0.99. The tested DRL algorithms in this section (See Appendix B) are simple variations of DDQN (Van Hasselt et al., 2016) and DDPG (Lillicrap et al., 2015), without any parameter tuning, and based on Section 5. For example, for the surrogate approach in both DDQN and DDPG we merely save (s, achosen, r, s ) instead of (s, aenv, r, s ) in the replay buffer (see Section 5 for definitions of aenv, achosen). We observe a significant improved empirical performance, both in training and evaluation for both the surrogate and expected approaches relatively to the baseline performance. The improved training performance is predictable; the learned policy is optimal w.r.t. the noise which is being played. In large portion of the results, the explorationconscious criteria leads to better performance in evaluation. 6.1. Exploration Consciousness with Prior Knowledge We use an adaptation of the Cliff-Walking maze (Sutton et al., 1998) we term T-Cliff-Walking (see Appendix C). 1Implementation of the proposed algorithms can be found in https://github.com/shanlior/Exploration Conscious RL. Exploration Conscious Reinforcement Learning Revisited 2 4 Iteration 1e5 Es U[v * (s) v t(s)] Q EQ, Mixture BQ, Mixture BQ, Greedy S 2 4 Iteration 1e5 Es U[v * (s) v t(s)] Q EQ, Mixture BQ, Mixture Figure 1. T-Cliff-Walking for the expected (E) and surrogate (S) approaches. (Left) α=0.3. (Right) α(s) from prior knowledge. The agent starts at the bottom-left side of a maze, and needs to get to the bottom-right side goal state with value +1. If the agent falls off the cliff, the episode terminates with reward 1. When the agent visits any of the first three steps on top of the cliff, it gets a reward of 0.01 (1 γ). We tested Expected α-Q-learning, Surrogate α-Q-learning, and compared their performance to Q-learning in the presence of ϵ-greedy exploration. Figure 1 stresses the typical behaviour of the α-optimality criterion. It is easier to approximate πα(π α, π0) than the optimal policy. Further, by being exploration-consciousness, the value of the approximated policy improves faster using the α-optimal algorithms; it learns faster which regions to avoid. As Proposition 4 suggests, the value of the learned policy is biased w.r.t v . Next, as suggested by Proposition 3, acting greedily w.r.t. the approximated value attains better performance. Such improvement is not guaranteed while the value had not yet converged to v α. However, the empirical results suggest that if the agent performs well over the mixture policy, it is worth using the greedy policy. We show that it is possible to incorporate prior knowledge to decrease the bias caused by being Exploration-Conscious. The T-Cliff-Walking example demands high exploration, α = ϵ = 0.3, because of the bottleneck state between the two sides of the maze. The α-optimal policy in such case is to stay at the left part of the maze. We used the prior knowledge that L(s) close to the barrier is high. The knowledge was injected through the choice of α, i.e., we chose a state-wise exploration scheme with α(s) = ϵ(s) = 0.1 in the passage and the two states around it, and α(s) = 0.3 elsewhere, for all three algorithms. The results in Figure 1 suggests that using prior knowledge to set α(s), can increase the performance by reducing the bias. In contrast, such prior knowledge does not help the baseline q-learning. 6.2. Exploration Consciousness in Atari We tested the α-optimal criterion in the more complex function approximation setting (see Appendix Alg. 3, 4). We used five Atari 2600 games (5) from the ALE (Bellemare Table 1. Train and Test rewards for the Atari 2600 environment, with 90% confidence interval Game DDQN Expected α-DDQN Surrogate α-DDQN Breakout 350 4 356 6 357 4 Fishing Der -45 9 -35 27 -8 8 Frostbite 1191 171 794 158 1908 162 Qbert 13221 565 13431 178 14240 225 Riverraid 8602 205 8811 645 11476 79 Breakout 402 14 390 5 392 5 Fishing Der -37 15 -19 34 -3 19 Frostbite 1720 191 1638 292 2686 278 Qbert 15627 497 15780 206 16082 338 Riverraid 9049 443 9491 802 12846 241 et al., 2013). We chose games that resemble the Cliff Walking scenario, where the wrong choice of action can lead to a sudden termination of the episode. Thus, being unaware of the exploration strategy can lead to poor training results. We used the same deep neural network as in DQN (Mnih et al., 2015), using the open AI Baselines implementation (Dhariwal et al., 2017), without any parameter tuning, except for the update equations. We chose to use the Double-DQN variant of DQN (Van Hasselt et al., 2016) for simplicity and generality. Nonetheless, changing the optimality criterion is orthogonal to any of the suggested add-ons to DQN (Hessel et al., 2017). We used α = ϵ = 0.01 in the train phase, and ϵ = 0.001 in the evaluation phase. For the surrogate version, we used a naive implementation based on equation (11). Table 1 shows that our method improves upon using the optimal criterion. That is, while bias exists, the algorithm still converges to a better policy. This result holds both on the exploratory training regime and the evaluation regime. Again, acting greedy w.r.t. the approximation of the α-optimal policy proved beneficial: The evaluation phase results surpasses the train phase results as shown in the table, and the training figures in Appendix (2). The evaluation is usually done with an ϵ = 0.001 > 0. Proposition 3 put formal grounds for using smaller ϵ in the evaluation phase than in the training phase; improvement is assured. Being accurate is extremely important in most Atari games, so Exploration Consciousness can also hurt the performance. Still, one can use prior knowledge to overcome this obstacle. 6.3. Exploration Consciousness in Mu Jo Co We tested the Expected σ-DDPG (5) and Surrogate σDDPG (6) on continuous control tasks from the Mu Jo Co environment (Todorov et al., 2012). We used the Open AI implementation of DDPG as the baseline, where we only changed the update equations to match our proposed algorithms. We used the default hyper-parameters, and independent Gaussian noise with σ = 0.2, for all tasks and Exploration Conscious Reinforcement Learning Revisited Table 2. Train and Test rewards for the Mu Jo Co environment. Game DDPG Expected σ-DDPG Surrogate σ-DDPG Ant 809 47 1013 49 993 110 Half Cheetah 2255 804 2634 828 3848 248 Hopper 1864 139 1866 132 2566 155 Humanoid 1281 142 1416 155 1703 272 In Pendulum 694 109 882 33 998 3 Walker 1722 170 2144 145 2587 214 Ant 1611 120 1924 126 1754 184 Half Cheetah 2729 936 3147 986 4579 298 Hopper 3099 113 3071 50 3037 78 Humanoid 1688 223 1994 389 2154 408 In Pendulum 999 2 1000 0 1000 0 Walker 3031 298 3315 147 3501 240 algorithms. The results in Table 2 were averaged over 10 different seeds. The performance of the σ-optimal variants superseded the baseline DDPG, for most of the training and test results. Interestingly, although improvement is not guaranteed (Proposition 7), the σ-optimal policy improved when using µφ deterministically, i.e., in the test phase. This suggests that improvement can be expected on certain scenarios, although that generally it is not guaranteed. We also found that the training process was faster using the σ-optimal algorithms, as can be seen in the learning curves in Appendix 3. Interestingly, again, the surrogate approach proved superior. 7. Relation to existing work Lately, several works have tackled the exploration problem for deep RL. In some, like Bootstrapped-DQN (see appendix [D.1] in (Osband et al., 2016)), the authors still employ an ϵ-greedy mechanism on top of their methods. Moreover, methods like Distributional-DQN (Bellemare et al., 2017; Dabney et al., 2018) and the state-of-the-art Ape-X DQN (Horgan et al., 2018), still uses ϵ-greedy and Gaussian noise, for discrete and continuous actions, respectively. Hence, all the above works are applicable for the α-optimal criterion by using the simple techniques described in Section 5. Existing on-policy methods produce variants of Exploration Consciousness. In TRPO and A3C (Schulman et al., 2015; Mnih et al., 2016), the exploration is implicitly injected into the agent policy through entropy regularization, and the agent improves upon the value of the explorative policy. Simple derivation shows the α-greedy and the Gaussian approaches are both equivalent to regularizing the entropy to be higher than a certain value by setting α or σ appropriately. Expected α-Q-learning highlights a relation to algorithms analysed in (John, 1994; Littman et al., 1997) and to Expected-Sarsa (ES) (Van Seijen et al., 2009). The focus of (John, 1994; Littman et al., 1997) is exploration-conscious q-based methods. In ES, when setting the estimation policy (Van Seijen et al., 2009) to be π = (1 αt)πG + αtπ0, we get similar updating equations as in lines 1-1, and similarly to (John, 1994; Littman et al., 1997). However, in ES αt decays to zero, and the optimal policy is obtained in the infinite time limit. In (Nachum et al., 2018), the authors offer a gradient based mechanism for updating the mean and variance of the actor. Here, we offer and analyze the approach of setting αt and σt to a constant value. This would be of interest especially when a good mechanism for decaying αt and σt lacks; the decay mechanism is usually chosen by trial-and-error, and is not clear how it should be set. Lastly, (4) and (6) can be understood as defining a surrogate problem , rather than finding an optimal policy. In this sense, it offers an alternative approach to biasing the problem by lowering the discount-factor, i.e., solve a surrogate MDP with γ < γ (Petrik & Scherrer, 2009; Jiang et al., 2015). Interestingly, the introduced bias when solving (4) is proportional to a local property of v , L(s), that can be estimated using prior-knowledge on the MDP, where solving an MDP with γ introduces a bias proportional to a non-local term, which is harder to estimate. More importantly, the performance of an α-optimal policy π α is assured to improve when tested on the original MDP M (Proposition 3), while the performance of an optimal policy in an MDP with γ might decline when tested on M with γ-discounting. In this paper, we revisited the notion of an agent being conscious to an exploration process. To our view, this notion did not receive the proper attention, though it is implicitly and repeatedly used. We started by formally defining optimal policy w.r.t. an exploration mechanism (4), (6). This expanded the view on exploration-conscious q-learning (John, 1994; Littman et al., 1997) to a more general one, and lead us to derive new algorithms, as well as re-interpreting existing ones (Van Seijen et al., 2009). We formulated the surrogate MDP notion, which helped us to establish that exploration-conscious criteria can be solved by Dynamic Programming, or, more generally, by an MDP solver. From the practical side, based on the theory, we tested DRL algorithms by simply modifying existing ones, with no further hyper-parameter tuning and empirically showed their superiority. Although a bias - error sensitivity tradeoff was formulated, we did not prove (4), (6) are easier to solve than an MDP. We believe proving whether the claim is true is of interest. Furthermore, analyzing more exploration-conscious criteria, e.g., exploration-conscious w.r.t. Ornstein-Uhlenbeck noise, is of interest, as well as defining a unified framework for exploration-conscious criteria. Exploration Conscious Reinforcement Learning Revisited Acknowledgments We would like to thank Chen Tessler, Nadav Merlis and Tom Zahavy for helpful discussions. Asadi, K. and Littman, M. L. An alternative softmax operator for reinforcement learning. ar Xiv preprint ar Xiv:1612.05628, 2016. Bellemare, M., Srinivasan, S., Ostrovski, G., Schaul, T., Saxton, D., and Munos, R. Unifying count-based exploration and intrinsic motivation. In Advances in Neural Information Processing Systems, pp. 1471 1479, 2016. Bellemare, M. G., Naddaf, Y., Veness, J., and Bowling, M. The arcade learning environment: An evaluation platform for general agents. Journal of Artificial Intelligence Research, 47:253 279, 2013. Bellemare, M. G., Dabney, W., and Munos, R. A distributional perspective on reinforcement learning. ar Xiv preprint ar Xiv:1707.06887, 2017. Bertsekas, D. P. and Tsitsiklis, J. N. Neuro-dynamic programming: an overview. In Decision and Control, 1995., Proceedings of the 34th IEEE Conference on, volume 1, pp. 560 564. IEEE, 1995. Brafman, R. I. and Tennenholtz, M. R-max-a general polynomial time algorithm for near-optimal reinforcement learning. Journal of Machine Learning Research, 3(Oct): 213 231, 2002. Dabney, W., Rowland, M., Bellemare, M. G., and Munos, R. Distributional reinforcement learning with quantile regression. In Thirty-Second AAAI Conference on Artificial Intelligence, 2018. Dhariwal, P., Hesse, C., Klimov, O., Nichol, A., Plappert, M., Radford, A., Schulman, J., Sidor, S., and Wu, Y. Openai baselines. https://github.com/ openai/baselines, 2017. Efroni, Y., Dalal, G., Scherrer, B., and Mannor, S. Beyond the one-step greedy approach in reinforcement learning. In Proceedings of the 35th International Conference on Machine Learning, pp. 1386 1395, 2018. Fortunato, M., Azar, M. G., Piot, B., Menick, J., Osband, I., Graves, A., Mnih, V., Munos, R., Hassabis, D., Pietquin, O., et al. Noisy networks for exploration. ar Xiv preprint ar Xiv:1706.10295, 2017. Hessel, M., Modayil, J., Van Hasselt, H., Schaul, T., Ostrovski, G., Dabney, W., Horgan, D., Piot, B., Azar, M., and Silver, D. Rainbow: Combining improvements in deep reinforcement learning. ar Xiv preprint ar Xiv:1710.02298, 2017. Horgan, D., Quan, J., Budden, D., Barth-Maron, G., Hessel, M., Van Hasselt, H., and Silver, D. Distributed prioritized experience replay. ar Xiv preprint ar Xiv:1803.00933, 2018. Jaksch, T., Ortner, R., and Auer, P. Near-optimal regret bounds for reinforcement learning. Journal of Machine Learning Research, 11(Apr):1563 1600, 2010. Jiang, N., Kulesza, A., Singh, S., and Lewis, R. The dependence of effective planning horizon on model accuracy. In Proceedings of the 2015 International Conference on Autonomous Agents and Multiagent Systems, pp. 1181 1189. International Foundation for Autonomous Agents and Multiagent Systems, 2015. John, G. H. When the best move isn t optimal: Q-learning with exploration. Citeseer, 1994. Kearns, M. and Singh, S. Near-optimal reinforcement learning in polynomial time. Machine learning, 49(2-3):209 232, 2002. Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., and Wierstra, D. Continuous control with deep reinforcement learning. ar Xiv preprint ar Xiv:1509.02971, 2015. Littman, M. L. et al. Generalized markov decision processes: Dynamic-programming and reinforcement-learning algorithms. 1997. Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidjeland, A. K., Ostrovski, G., et al. Human-level control through deep reinforcement learning. Nature, 518(7540): 529, 2015. Mnih, V., Badia, A. P., Mirza, M., Graves, A., Lillicrap, T., Harley, T., Silver, D., and Kavukcuoglu, K. Asynchronous methods for deep reinforcement learning. In International Conference on Machine Learning, pp. 1928 1937, 2016. Nachum, O., Norouzi, M., Tucker, G., and Schuurmans, D. Smoothed action value functions for learning gaussian policies. ar Xiv preprint ar Xiv:1803.02348, 2018. Osband, I., Russo, D., and Van Roy, B. (more) efficient reinforcement learning via posterior sampling. In Advances in Neural Information Processing Systems, pp. 3003 3011, 2013. Osband, I., Blundell, C., Pritzel, A., and Van Roy, B. Deep exploration via bootstrapped dqn. In Advances in neural information processing systems, pp. 4026 4034, 2016. Exploration Conscious Reinforcement Learning Revisited Petrik, M. and Scherrer, B. Biasing approximate dynamic programming with a lower discount factor. In Advances in neural information processing systems, pp. 1265 1272, 2009. Puterman, M. L. Markov decision processes. j. Wiley and Sons, 1994. Schulman, J., Levine, S., Abbeel, P., Jordan, M., and Moritz, P. Trust region policy optimization. In International Conference on Machine Learning, pp. 1889 1897, 2015. Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal policy optimization algorithms. ar Xiv preprint ar Xiv:1707.06347, 2017. Silver, D., Lever, G., Heess, N., Degris, T., Wierstra, D., and Riedmiller, M. Deterministic policy gradient algorithms. In ICML, 2014. Singh, S., Jaakkola, T., Littman, M. L., and Szepesv ari, C. Convergence results for single-step on-policy reinforcement-learning algorithms. Machine learning, 38(3):287 308, 2000. Strehl, A. L., Li, L., and Littman, M. L. Reinforcement learning in finite mdps: Pac analysis. Journal of Machine Learning Research, 10(Nov):2413 2444, 2009. Sutton, R. S., Barto, A. G., et al. Reinforcement learning: An introduction. MIT press, 1998. Todorov, E., Erez, T., and Tassa, Y. Mujoco: A physics engine for model-based control. In Intelligent Robots and Systems (IROS), 2012 IEEE/RSJ International Conference on, pp. 5026 5033. IEEE, 2012. Van Hasselt, H., Guez, A., and Silver, D. Deep reinforcement learning with double q-learning. In AAAI, volume 2, pp. 5. Phoenix, AZ, 2016. Van Seijen, H., Van Hasselt, H., Whiteson, S., and Wiering, M. A theoretical and empirical analysis of expected sarsa. In Adaptive Dynamic Programming and Reinforcement Learning, 2009. ADPRL 09. IEEE Symposium on, pp. 177 184. IEEE, 2009.