# munchausen_reinforcement_learning__1879a5fa.pdf

Munchausen Reinforcement Learning

Nino Vieillard Google Research, Brain Team Université de Lorraine, CNRS, Inria, IECL F-54000 Nancy, France vieillard@google.com

Olivier Pietquin Google Research, Brain Team pietquin@google.com

Matthieu Geist Google Research, Brain Team mfgeist@google.com

Bootstrapping is a core mechanism in Reinforcement Learning (RL). Most algorithms, based on temporal differences, replace the true value of a transiting state by their current estimate of this value. Yet, another estimate could be leveraged to bootstrap RL: the current policy. Our core contribution stands in a very simple idea: adding the scaled log-policy to the immediate reward. We show that slightly modifying Deep Q-Network (DQN) in that way provides an agent that is competitive with distributional methods on Atari games, without making use of distributional RL, n-step returns or prioritized replay. To demonstrate the versatility of this idea, we also use it together with an Implicit Quantile Network (IQN). The resulting agent outperforms Rainbow on Atari, installing a new State of the Art with very little modiﬁcations to the original algorithm. To add to this empirical study, we provide strong theoretical insights on what happens under the hood implicit Kullback-Leibler regularization and increase of the action-gap.

1 Introduction

Most Reinforcement Learning (RL) algorithms make use of Temporal Difference (TD) learning [29] in some ways. It is a well-known bootstrapping mechanism that consists in replacing the unknown true value of a transiting state by its current estimate and use it as a target for learning. Yet, agents compute another estimate while learning that could be leveraged to bootstrap RL: their current policy. Indeed, it reﬂects the agent s hunch about which actions should be executed next and thus, which actions are good. Building upon this observation, our core contribution stands in a very simple idea: optimizing for the immediate reward augmented by the scaled log-policy of the agent when using any TD scheme. We insist right away that this is different from maximum entropy RL [36], that subtracts the scaled log-policy to all rewards, and aims at maximizing both the expected return and the expected entropy of the resulting policy. We call this general approach Munchausen Reinforcement Learning (M-RL), as a reference to a famous passage of The Surprising Adventures of Baron Munchausen by Raspe [24], where the Baron pulls himself out of a swamp by pulling on his own hair.

To demonstrate the genericity and the strength of this idea, we introduce it into the most popular RL agent: the seminal Deep Q-Network (DQN) [23]. Yet, DQN does not compute stochastic policies, which prevents using log-policies. So, we ﬁrst introduce a straightforward generalization of DQN to maximum entropy RL [36, 17], and then modify the resulting TD update by adding the scaled logpolicy to the immediate reward. The resulting algorithm, referred to as Munchausen-DQN (M-DQN), is thus genuinely a slight modiﬁcation of DQN. Yet, it comes with strong empirical performances. On the Arcade Learning Environment (ALE) [6], not only it surpasses the original DQN by a large

34th Conference on Neural Information Processing Systems (Neur IPS 2020), Vancouver, Canada.

margin, but it also overtakes C51 [8], the ﬁrst agent based on distributional RL (dist RL). As far as we know, M-DQN is the ﬁrst agent not using dist RL that outperforms a dist RL agent1. The current state of the art for single agent algorithms is considered to be Rainbow [18], that combines C51 with other enhacements to DQN, and does not rely on massivly distributed computation (unlike R2D2 [19], SEED [12] or Agent57 [4]). To demonstrate the versatility of the M-RL idea, we apply the same recipe to modify Implicit Quantile Network (IQN) [11], a recent dist RL agent. The resulting Munchausen-IQN (M-IQN) surpasses Rainbow, installing a new state of the art.

To support these empirical results, we provide strong theoretical insights about what happens under the hood. We rewrite M-DQN under an abstract dynamic programming scheme and show that it implicitly performs Kullback-Leibler (KL) regularization between consecutive policies. M-RL is not the ﬁrst approach to take advantage of KL regularization [27, 2], but we show that, because this regularization is implicit, it comes with stronger theoretical guarantees. From this, we link M-RL to Conservative Value Iteration (CVI) [20] and Dynamic Policy Programming (DPP) [3] that were not introduced with deep RL implementations. We also draw connections with Advantage Learning (AL) [5, 7] and study the effect of M-RL on the action-gap [13]. While M-RL is not the ﬁrst scheme to induce an increase of the action-gap [7], it is the ﬁrst one that allows quantifying this increase.

2 Munchausen Reinforcement Learning

RL is usually formalized within the Markov Decision Processes (MDP) framework. An MDP models the environment and is a tuple {S, A, P, r, γ}, with S and A the state and action spaces, P the Markovian transition kernel, r the bounded reward function and γ the discount factor. The RL agent interacts with the MDP using a policy π, that associates to every state either an action (deterministic policy) or a distribution over actions (stochastic policy). The quality of this interaction is quantiﬁed by the expected discounted cumulative return, formalized as the state-action value function, qπ(s, a) = Eπ[P t=0 γtr(st, at)|s0 = s, a0 = a], the expectation being over trajectories induced by the policy π and the dynamics P. An optimal policy satisﬁes π argmaxπ qπ. The associated optimal value function q = qπ satisﬁes the Bellman equation q (s, a) = r(s, a) + γEs |s,a[maxa q (s , a )]. A deterministic greedy policy satisﬁes π(a|s) = 1 for a argmaxa q(s, a ) and will be written π G(q). We also use softmax policies, π = sm(q) π(a|s) = exp q(s,a) P

a exp q(s,a ).

A standard RL agent maintains both a q-function and a policy (that can be implicit, for example π G(q)), and it aims at learning an optimal policy. To do so, it often relies on Temporal Difference (TD) updates. To recall the principle of TD learning, we quickly revisit the classical Q-learning algorithm [34]. When interacting with the environment the agent observes transitions (st, at, rt, st+1). Would the optimal q-function q be known in the state st+1, the agent could use it as a learning target and build successive estimates as q(st, at) q(st, at) + η(rt + γ maxa q (st+1, a ) q(st, at)), using the Bellman equation, η being a learning rate. Yet, q is unknown, and the agent actually uses its current estimate q instead, which is known as bootstrapping.

We argue that the q-function is not the sole quantity that could be used to bootstrap RL. Let s assume that an optimal deterministic policy π is known. The log-policy is therefore 0 for optimal actions, and for sub-optimal ones. This is a very strong learning signal, that we could add to the reward to ease learning, without changing the optimal control. The optimal policy π being obviously unknown, we replace it by the agent s current estimate π, and we assume stochastic policies for numerical stability. To sum up, M-RL is a very simple idea, that consists in replacing rt by rt + α ln π(at|st) in any TD scheme, assuming that the current agent s policy π is stochastic, so as to bootstrap the current agent s guess about what actions are good.

To demonstrate the generality of this approach, we use it to enhance the seminal DQN [23] deep RL algorithm. In DQN, the q-values are estimated by an online Q-network qθ, with weights copied regularly to a target network q θ. The agent behaves following a policy πθ G(qθ) (with ε-greedy exploration), and stores transitions (st, at, rt, st+1) in a FIFO replay buffer B. DQN performs

1It appears that the beneﬁts of dist RL do not really come from RL principles, but rather from the regularizing effect of modelling a distribution and its role as an auxiliary task in a deep learning context [21].

stochastic gradient descent on the loss ˆEB[(qθ(st, at) ˆqdqn(rt, st+1))2], regressing the target ˆqdqn:

ˆqdqn(rt, st+1) = rt + γ X

a A π θ(a |st+1)q θ(st+1, a ) with π θ G(q θ).

To derive Munchausen-DQN (M-DQN), we simply modify the regression target. M-RL assumes stochastic policies while DQN computes deterministic policies. A simple way to address this is to not only maximize the return, but also the entropy of the resulting policy, that is adopting the viewpoint of maximum entropy RL [36, 17]. It is straightforward to extend DQN to this setting, see Appx. A.1 for a detailed derivation. We call the resulting agent Soft-DQN (S-DQN). Let τ be the temperature parameter scaling the entropy, it just amounts to replace the original regression target by

ˆqs-dqn(rt, st+1) = rt+γ X

a A π θ(a |st+1) q θ(st+1, a ) τ ln π θ(a |st+1)

with π θ = sm(q θ

where we highlighted the differences with DQN in blue. Notice that this is nothing more than the most straightforward discrete-actions version of Soft Actor-Critic (SAC) [17]. Notice also that in the limit τ 0 we retrieve DQN. The last step to obtain M-DQN is to add the scaled log-policy to the reward. Let α [0, 1] be a scaling factor, the regression target of M-DQN is thus

ˆqm-dqn(rt, st+1) = rt+ατ ln π θ(at|st) + γ X

a A π θ(a |st+1) q θ(st+1, a ) τ ln π θ(a |st+1)

still with π θ = sm( q θ τ ), where we highlighted the difference with Soft-DQN in red (retrieved by setting α = 0). Hence, M-DQN is genuinely obtained by replacing ˆqdqn by ˆqm-dqn as the regression target of DQN. All details of the resulting algorithm are provide in Appx. B.1.

0 25 50 75 100 125 150 175 200 Frames (M)

Human-normalized mean scores

M-IQN M-DQN DQN C51 RAINBOW IQN

0 25 50 75 100 125 150 175 200 Frames (M)

Human-normalized median scores

M-IQN M-DQN DQN C51 RAINBOW IQN

Figure 1: Left: Human-normalized mean scores. Right: Human-normalized median scores.

Despite being an extremely simple modiﬁcation of DQN, M-DQN is very efﬁcient. We show in Fig. 10 the Human-normalized mean and median scores for various agents on the full set of 60 Atari games of ALE (more details in Sec. 4). We observe that M-DQN signiﬁcantly outperforms DQN, but also C51 [8]. As far we know, M-DQN is the ﬁrst method that is not based on dist RL which overtakes C51. These are quite encouraging empirical results.

To demonstrate the versatility of the M-RL principle, we also combine it with IQN [11], a recent and efﬁcient dist RL agent (note that IQN has had recent successors, such as Fully Parameterized Quantile Function (FQF) [35], to which in principle, we could also apply M-RL). We denote the resulting algorithm M-IQN. In a nutshell, IQN does not estimate the q-function, but the distribution of which the q-function is the mean, using a distributional Bellman operator. The (implicit) policy is still greedy according to the q-function, computed as the (empirical) mean of the estimated distribution. We apply the exact same recipe: derive soft-IQN using the principle of maximum entropy RL (which is as easy as for DQN), and add the scaled log-policy to the reward. For the sake of showing the generality of our method, we combine M-RL with a version of IQN that uses 3-steps returns (and we compare to IQN and Rainbow, that both use the same). We can observe on Fig. 10 that M-IQN outperforms Rainbow, both in terms of mean and median scores, and thus deﬁnes the new state of the art. In addition, even when using only 1-step returns, M-IQN still outperforms Rainbow. This result and the details of M-IQN can be found respectively in Appx. B.3 and B.1.

3 What happens under the hood?

The impressive empirical results of M-RL (see Sec. 4 for more) call for some theoretical insights. To provide them, we frame M-DQN in an abstract Approximate Dynamic Programming (ADP) framework and analyze it. We mainly provide two strong results: (1) M-DQN implicitly performs KL regularization between successive policies, which translates in an averaging effect of approximation errors (instead of accumulation in general ADP frameworks); (2) it increases the action-gap by a quantiﬁable amount which also helps dealing with approximation errors. We also use this section to draw connections with the existing literature in ADP. Let s ﬁrst introduce some additional notations.

We write X the simplex over the ﬁnite set X and Y X the set of applications from X to the set Y . With this, an MDP is {S, A, P S A S , r RS A, γ (0, 1)}, the state and action spaces being assumed ﬁnite. For f, g RS A, we deﬁne a component-wise dot product f, g = (P

a f(s, a)g(s, a))s RS. This will be used with q-functions and (log-) policies, e.g. for expectations: Ea π( |s)[q(s, a)] = π, q (s). For v RS, we have Pv = (Es |s,a[v(s )])s,a = (P

s P(s |s, a)v(s ))s,a RS A. We also deﬁned a policy-induced transition kernel Pπ as Pπq = P π, q . With these notations, the Bellman evaluation operator is Tπq = r + γPπq and its unique ﬁxed point is qπ. An optimal policy still satisﬁes π argmaxπ S A qπ. The set of greedy policies can be written as G(q) = argmaxπ S A π, q . We ll also make use of the entropy of a policy, H(π) = π, ln π , and of the KL between two policies, KL(π1||π2) = π1, ln π1 ln π2 .

A softmax is the maximizer of the Legendre-Fenchel transform of the entropy [9, 32], sm(q) = argmaxπ π, q + H(π). Using this and the introduced notations, we can write M-DQN in the following abstract form (each iteration consists of a greedy step and an evaluation step): ( πk+1 = argmaxπ S A π, qk + τH(π) qk+1 = r+ατ ln πk+1 + γP πk+1, qk τ ln πk+1 + ϵk+1. M-VI(α, τ) (3)

We call the resulting scheme Munchausen Value Iteration, or M-VI(α,τ). The term ϵk+1 stands for the error between the actual and the ideal update (sampling instead of expectation, approximation of qk by a neural network, ﬁtting of the neural network). Removing the red term, we retrieve approximate VI (AVI) regularized by a scaled entropy, as introduced by Geist et al. [15], of which Soft-DQN is an instantiation (as well as SAC, with additional error in the greedy step). Removing also the blue term, we retrieve the classic AVI [26], of which DQN is an instantiation.

To get some insights, we rewrite the evaluation step, setting α = 1 and with q k qk τ ln πk:

qk+1 = r + τ ln πk+1 + γP πk+1, qk τ ln πk+1 + ϵk+1

qk+1 τ ln πk+1 = r + γP πk+1, qk τ ln πk τ ln πk+1

q k+1 = r + γP( πk+1, q k τ KL(πk+1||πk)) + ϵk+1.

Then, the greedy step can be rewritten as (looking at what πk+1 maximizes)

π, qk + τH(π) = π, q k + τ ln πk τ π, ln π = π, q k τ KL(π||πk). (4)

We have just shown that M-VI(1,τ) implicitly performs KL regularization between successive policies.

This is a very insightful result as KL regularization is the core component of recent efﬁcient RL agents such as TRPO [27] or MPO [2]. It is extensively discussed by Vieillard et al. [32]. Interestingly, we can show that the sequence of policies produced by M-VI(α,τ) is the same as the one of their Mirror Descent VI (MD-VI), with KL scaled by ατ and entropy scaled by (1 α)τ. Thus, M-VI(α,τ) is equivalent to MD-VI(ατ, (1 α)τ), as formalized below (proof in Appx. A.2). Theorem 1. For any k 0, deﬁne q k = qk ατ ln πk, we have

( πk+1 = argmaxπ S A π, q k ατ KL(π||πk) + (1 α)τH(π) q k+1 = r + γP( πk+1, q k ατ KL(πk+1||πk) + (1 α)τH(πk+1)) + ϵk+1 .

Moreover, [32, Thm. 1] applies to M-VI(1,τ) and [32, Thm. 2] applies to M-VI(α < 1,τ).

In their work, Vieillard et al. [32] show that using regularization can reduce the dependency to the horizon (1 γ) 1 and that using a KL divergence allows for a compensation of the errors ϵk over

iterations, which is not true for classical ADP. We refer to them for a detailed discussion on this topic. However, we would like to highlight that they acknowledge that their theoretical analysis does not apply to the deep RL setting. The reason being that their analysis does not hold when the greedy step is approximated, and they deem as impossible to do the greedy step exactly when using neural network. Indeed, computing πk+1 by maximizing eq. (4) yields an analytical solution proportional to πk exp( qk

τ ), and that thus depends on the previous policy πk. Consequently, the solution to this equation cannot be computed exactly when using deep function approximation (unless one would be willing to remember every computed policy). On the contrary, their analysis applies in our deep RL setting. In M-VI, the KL regularization is implicit, so we do not introduce errors in the greedy step. To be precise, the greedy step of M-VI is only a softmax of the q-function, which can be computed exactly in a discrete actions setting, even when using deep networks. Their strong bounds for MD-VI therefore hold for M-VI, as formalized in Thm. 1, and in particular for M-DQN.

Indeed, let q θk be the kth update of the target network, write qk = q θk, πk+1 = sm( qk

τ ), and deﬁne ϵk+1 = qk+1 (r + α ln πk+1 γP πk+1, qk τ ln πk+1 ), the difference between the actual update and the ideal one. As a direct corollary of Thm. 1 and [32, Thm. 1], we have that, for α = 1,

q qπk 2 1 γ

+ 4 (1 γ)2 rmax + τ ln |A|

with rmax the maximum reward (in absolute value), and with qπk the true value function of the policy of M-DQN. This is a very strong bound. The error term is 1

k Pk j=1 ϵj , to be compared to the one

of AVI [26], (1 γ) Pk j=1 γk j ϵj . Instead of having a discounted sum of the norms of the errors, we have the norm of the average of the errors. This is very interesting, as it allows for a compensation of errors between iterations instead of an accumulation (sum and norm do not commute). The error term is scaled by (1 γ) 1 (the average horizon of the MDP), while the one of AVI would be scaled by (1 γ) 2. This is also quite interesting, a γ close to 1 impacts less negatively the bound. We refer to [32, Sec. 4.1] for further discussions about the advantage of this kind of bounds. Similarly, we could derive a bound for the case α < 1, and even more general and meaningful component-wise bounds. We defer the statement of these bounds and their proofs to Appx. A.3.

From Eq. (3), we can also relate the proposed approach to another part of the literature. Still from basic properties of the Legendre-Fenchel transform, we have that maxπ q, π + τH(π) = πk+1, qk +τH(πk+1) = ln 1, exp q . In other words, if the maximizer is the softmax, the maximum is the log-sum-exp. Using this, Eq. (3) can be rewritten as (see Appx. A.4 for a detailed derivation)

qk+1 = r + γP(τ ln 1, exp qk

τ ) + α(qk τ ln 1, exp qk

τ ) + ϵk+1. (5)

This is very close to Conservative Value Iteration2 (CVI) [20], a purely theoretical algorithm, as far as we know. With α = 0 (without Munchausen), we get Soft Q-learning [14, 16]. Notice that with this, CVI can be seen as soft Q-learning plus a scaled and smooth advantage (the term α(qk τ ln 1, exp qk

τ )). With α = 1, we retrieve a variation of Dynamic Policy Programming (DPP) [3, Appx. A]. DPP has been extended to a deep learning setting [30], but it is less efﬁcient than DQN3 [32]. Taking the limit τ 0, we retrieve Advantage Learning (AL) [5, 7] (see Appx. A.4): qk+1 = r + γP πk+1, qk + α(qk πk+1, qk ) + ϵk+1 with πk+1 G(qk). (6)

AL aims at increasing the action-gap [13] deﬁned as the difference, for a given state, between the (optimal) value of the optimal action and that of the suboptimal ones. The intuitive reason to want a large action-gap is that it can mitigate the undesirable effects of approximation and estimation errors made on q on the induced greedy policies. Bellemare et al. [7] have introduced a family of Bellman-like operators that are gap-increasing. Not only we show that M-VI is gap-increasing but we also quantify the increase. To do so, we introduce some last notations. As we explained before, with α = 0, M-VI(0, τ) reduces to AVI regularized by an entropy (that is, maximum entropy RL). Without error, it is known that the resulting regularized MDP has a unique optimal policy πτ and a unique optimal q-function4 qτ [15]. This being deﬁned, we can state our result (proven in Appx. A.5).

2In CVI, 1, exp qk

τ is replaced by 1

|A|, exp qk

τ . 3In fact, Tsurumine et al. [30] show better performance for deep DPP than for DQN in their setting. Yet, their experiment involves a small number of interactions, while the function estimated by DPP is naturally diverging. See [33, Sec. 6] for further discussion about this. 4It can be related to the unregularized optimal q-function, qτ q τ ln |A|

Theorem 2. For any state s S, deﬁne the action-gap of an MPD regularized by an entropy scaled by τ as δτ (s) = maxa qτ (s, a) qτ (s, ) RA +. Deﬁne also δα,τ k (s) as the action-gap for the kth

iteration of M-VI(α,τ), without error (ϵk = 0): δα,τ k (s) = maxa qk(s, a) qk(s, ) RA +. Then, for any s S, for any 0 α 1 and for any τ > 0, we have

lim k δα,τ k (s) = 1 + α

1 α δ(1 α)τ (s),

with the convention that 0 = 0 for α = 1.

Thus, the original action-gap is multiplied by 1+α

1 α with M-VI. In the limit α = 1, it is even inﬁnite (and zero for the optimal actions). This suggests choosing a large value of α, but not too close to 1 (for numerical stability: if having a large action-gap is desirable, having an inﬁnite one is not).

4 Experiments

Munchausen agents. We implement M-DQN and M-IQN as variations of respectively DQN and IQN from Dopamine [10]. We use the same hyperparameters for IQN5, and we only change the optimizer from RMSProp to Adam for DQN. This is actually not anodyne, and we study its impact in an ablation study. We also consider a Munchausen-speciﬁc modiﬁcation, log-policy clipping. Indeed, the log-policy term is not bounded, and can cause numerical issues if the policy becomes too close to deterministic. Thus, with a hyperparameter l0 < 0, we replace τ ln π(a|s) by [τ ln π(a|s)]0 l0, where [ ]y x is the clipping function. For numerical stability, we use a speciﬁc log-sum-exp trick to compute the log-policy (see App. B.1). Hence, we add three parameters to the modiﬁed agent: α, τ and l0. After some tuning on a few Atari games, we found a working zone for these parameters to be α = 0.9, τ = 0.03 and l0 = 1, used for all experiments, in M-DQN and M-IQN. All details about the rest of the parameters can be found in Appx. B.1. DQN and IQN use ε-greedy policies to interact with the environment. Although M-DQN and M-IQN produce naturally stochastic policies, we use the same ε-greedy policies. We discuss this further in Appx. B.2, where we also compare to stochastic policies.

Baselines. First, we consider both DQN and IQN, as these are the algorithms we modify. Second, we compare to C51 because, as far as we know, it has never been outperformed by a non-dist RL agent before. We also consider Rainbow, as it stands for being the state-of-the-art non-distributed agent on ALE. All our baselines are taken from Dopamine. For Rainbow, this version doesn t contain all the original improvements, but only the ones deemed as the more important and efﬁcient by Hessel et al. [18]: n-steps returns and Prioritized Experience Replay (PER) [25], on top of C51.

Task. We evaluate our methods and the baselines in the ALE environment, i.e. on the full set of 60 Atari games. Notice that it is not a canonical environment. For example, choosing to end an episode when an agent loses a life or after game-over can dramatically change the score an agent can reach (e.g., [10, Fig. 4]). The same holds for using sticky actions, introducing stochasticity in the dynamics (e.g., [10, Fig. 6]). Even the ROMs could be different, with unpredictable consequences (e.g. different video encoding). Here, we follow the methodological best practices proposed by Machado et al. [22] and instantiated in Dopamine [10], that also makes the ALE more challenging. Notably, the results we present are hardly comparable to the ones presented in the original publications of DQN [23], C51 [8], Rainbow [18] or IQN [11], that use a different, easier, setting. Yet, for completeness, we report results on one game (Asterix) using an ALE setting as close as possible to the original papers, in Appx. B.4: the baseline results match the previously published ones, and M-RL still raises improvement. We also highlight that we stick to a single-agent version of the environment: we do not claim that our method can be compared to highly distributed agents, such as R2D2 [19] or Agent57 [4], that use several versions of the environment in parallel, and train on a much higher number of frames (around 10G frames vs 200M here). Yet, we are conﬁdent that our approach could easily apply to such agents.

Metrics. All algorithms are evaluated on the same training regime (details in Appx.B.1), during 200M frames, and results are averaged over 3 seeds. As a metric for any games, we compute the baseline-normalized score, for each iteration (here, 1M frames), normalized so that 0% corresponds to a random score, and 100% to the ﬁnal performance of the baseline. At each iteration, the score is

5By default, Dopamine s IQN uses 3-steps returns. We rather consider 1-step returns, as in [11].

the undiscounted sum of rewards, averaged over the last 100 learning episodes. The normalized score is then a r |b r|, with a the score of the agent, b the score of the baseline, and r the score of a random policy. For a human baseline, the scores are those provided in Table 3 (Appx. B.6), for an agent baseline the score is the one after 200M frames. With this, we provide aggregated results, showing the mean and the median over games, as learning proceeds when the baseline is the human score (e.g., Fig. 1), or after 200M steps with human and Rainbow baselines in Tab. 3 (more results in Appx. B.6, as learning proceeds). We also compute a baseline-improvement score as a b |b r|, and use it to report a per-game improvement after 200M frames (Fig. 4, M-Agent versus Agent, or Appx. B.6).

0 100 200 300 400 500 600 700 800 Time steps

M-DQN AL Adam DQN

Figure 2: Action-gaps (Asterix).

Action-gap. We start by illustrating the action-gap phenomenon suggested by Thm. 2. To do so, let qθ be the qfunction of a given agent after training for 200M steps. At any time-step t, write ˆat argmaxa A qθ(st, a) the current greedy action, we compute the empirical action-gap as the difference of estimated values between the best and second best actions, qθ(st, ˆat) maxa A\{ˆat} qθ(st, a). We do so for M-DQN, for AL (that was introduced specifically to increase the action-gap) and for DQN with Adam optimizer (Adam DQN), as both build on top of it (only changing the regression targets, see Appx. B.1 for details). We consider the game Asterix, for which the ﬁnal average performance of the agents are (roughly) 15k for Adam DQN, 13k for AL and 20k for M-DQN. We report the results on Fig. 2: we run each agent for 10 trajectories, and average the resulting action-gaps (the length of the resulting trajectory is the one of the shorter trajectory, we also apply an exponential smoothing of 0.99). Both M-DQN and AL increase the action-gaps compared to Adam DQN. If AL increases it more, it seems also to be less stable, and less proportional to the original action-gap. Despite this increase, it performs worse than Adam DQN (13k vs 15k), while M-DQN increases it and performs better (20k vs 15k). An explanation to this phenomenon could the one of Van Seijen et al. [31], who suggest that what is important is not the value of the action gap itself, but its uniformity over the state-action space: here, M-DQN seems to beneﬁt from a more stable action-gap than AL. This ﬁgure is for an illustrative purpose, one game is not enough to draw conclusions. Yet, the following ablation shows that globally M-DQN performs better than AL. Also, it beneﬁts from more theoretical justiﬁcations (not only quantiﬁed action-gap increase, but also implicit KL-regularization and resulting performance bounds).

Ablation study. We ve build M-DQN from DQN by adding the Adam optimizer (Adam DQN), extending it to maximum entropy RL (Soft-DQN, Eq. (1)), and then adding the Munchausen term (M-DQN, Eq. (2)). A natural ablation is to remove the Munchausen term, and use only maximum entropy RL, by considering M-DQN with α = 0 (instead of 0.9 for M-DQN), and the same τ (here, 3e 2), which would give Soft-DQN(τ). However, Thm. 1 states that M-DQN performs entropy regularization with an implicit coefﬁcient of (1 α)τ, so to compare M-DQN and Soft-DQN fairly, one should evaluate Soft-DQN with such a temperature, that is 3e 3 in this case. We denote this ablation as Soft-DQN((1 α)τ). As sketched in Sec. 3, AL can also be seen as a limit case (on an abstract way, as τ 0, see also Appx. B.1 for details on the algorithm). We provide an ablation study of all these variations, all using Adam (except DQN), in Fig. 3. All methods perform better than DQN. Adam DQN performs very well and is even competitive with C51. This is an interesting insight, as changing the optimizer compared to the published parameters dramatically improves the performance, and Adam DQN could be considered as a better baseline6. Surprisingly, if better than DQN, Soft-DQN does not perform better than Adam DQN. This suggests that maximum entropy RL alone might not be sufﬁcient. We kept the temperature τ = 0.03, and one could argue that it was not tuned for Soft DQN, but it is on par with the temperature of similar algorithms [28, 32]. We observe that AL performs better than Adam DQN. Again, we kept α = 0.9, but this is consistent with the best performing parameter of Bellemare et al. [7, e.g., Fig. 7]. The proposed M-DQN outperforms all other methods, both in mean and median, and especially Soft-DQN by a signiﬁcant margin (the sole difference being the Munchausen term).

Comparison to the baselines. We report aggregated results as Human-normalized mean and median scores on Figure 1, that compares the Munchausen agents to the baselines. M-DQN is largely

6To be on par with the literature, we keep using the published DQN as the baseline for other experiments.

0 25 50 75 100 125 150 175 200 Frames (M)

Human-normalized mean scores

M-DQN AL Adam-DQN Soft-DQN((1 α)τ) Soft-DQN(τ) DQN

0 25 50 75 100 125 150 175 200 Frames (M)

Human-normalized median scores

M-DQN AL Adam-DQN Soft-DQN((1 α)τ) Soft-DQN(τ) DQN

Figure 3: Ablation study of M-DQN: Human-normalized mean (left) and median (right) scores.

Table 1: Mean/median Human/Rainbow-normalized scores at 200M frames, on the 60 games, averaged over 3 random seeds. In bold are the best of each column, and in blue over Rainbow. We also provide the number of improved games (compared to Human and Rainbow).

Human-normalized Rainbow-normalized

Mean Median #Improved Mean Median #Improved

M-DQN 340% 124% 37 89% 92% 21 M-IQN 563% 165% 43 130% 109% 38 RAINBOW 414% 150% 43 100% 100% - IQN 441% 139% 41 105% 99% 27 C51 339% 111% 33 84% 70% 11 DQN 228% 71% 23 51% 51% 3

over DQN, and outperforms C51 both in mean and median. It is remarkable that M-DQN, justiﬁed by theoretically sound RL principles and without using common deep RL tricks like n-steps returns, PER or dist RL, is competitive with dist RL methods. It is even close to IQN (in median), considered as the best dist RL-based agent. We observe that M-IQN, that combines IQN with Munchausen principle, is better than all other baselines, by a signiﬁcant margin in mean. We also report the ﬁnal Human-normalized and Rainbow-normalized scores of all the algorithms in Table 1. These results are on par with the Human-normalized scores of Fig. 1 (see Appx. B.6 for results over frames). M-DQN is still close to IQN i median, is better than DQN, and C51, while M-IQN is the best agent w.r.t. all metrics.

Per-game improvements. In Figure 4, we report the improvement for each game of the Munchausen agents over the algorithms they modify. The Munchausened versions show signiﬁcant improvements, on a large majority of Atari games (53/60 for M-DQN vs DQN, 40/60 for M-IQN vs IQN). This result also explains the sometime large difference between the mean and median metrics, as some games beneﬁt from a particularly large improvement. All learning curves are in Appx B.6.

5 Conclusion

In this work, we presented a simple extension to RL algorithms: Munchausen RL. This method augments the immediate rewards by the scaled logarithm of the policy computed by an RL agent. We applied this method to a simple variation of DQN, Soft-DQN, resulting in the M-DQN algorithm. MDQN shows large performance improvements: it outperforms DQN on 53 of the 60 Atari games, while simply using a modiﬁcation of the DQN loss. In addition, it outperforms the seminal distributional RL algorithm C51. We also extended the Munchausen idea to distributional RL, showing that it could be successfully combined with IQN to outperform the Rainbow baseline. Munchausen-DQN relies on theoretical foundations. To show that, we have studied an abstract Munchausen Value Iteration scheme and shown that it implicitly performs KL regularization. Notably, the strong theoretical results of [32] apply to M-DQN. By rewriting it in an equivalent ADP form, we have related our approach to the literature, notably to CVI, DPP and AL . We have shown that M-VI increases the

Elevator Action

Montezuma Revenge

Private Eye

Journey Escape

Kung Fu Master

Star Gunner

Yars Revenge

Crazy Climber

Road Runner

Double Dunk

Fishing Derby

Chopper Command

Demon Attack

Name This Game

Battle Zone

Video Pinball

Space Invaders

Wizard Of Wor

M-DQN vs DQN:average improvement: 724.7%, median improvement: 45.0%, improved games: 53 / 60

Chopper Command

Montezuma Revenge

Private Eye

Star Gunner

Kung Fu Master

Double Dunk

Road Runner

Journey Escape

Fishing Derby

Video Pinball

Crazy Climber

Battle Zone

Elevator Action

Yars Revenge

Wizard Of Wor

Name This Game

Space Invaders

Demon Attack

M-IQN vs IQN:average improvement: 31.1%, median improvement: 8.7%, improved games: 40 / 60

Figure 4: Per-game improvement of M-DQN vs DQN (top) and of M-IQN vs IQN (bottom).

action-gap, and we have quantiﬁed this increase, that can be inﬁnite in the limit. In the end, this work highlights that a thoughtful revisiting of the core components of reinforcement learning can lead to new and efﬁcient deep RL algorithms.

Broader impact

The core contribution of this work is to propose a new RL algorithm, that surpasses state of the art results on a challenging discrete actions environment. We believe it can impact positively the RL community, as it shades light on fundamental ideas, justiﬁed by deep theoretical foundations, that proves to be efﬁcient in practice. Outside of the RL community, the impact of this paper is part of the global impact of RL. This work is mainly algorithmic and theoretical, with no speciﬁc applications in mind, but participates to the general development of efﬁcient and practical RL methods.

Funding transparency statement

Nothing to disclose.

[1] Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dandelion Mané, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viégas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. Tensor Flow: Largescale machine learning on heterogeneous systems, 2015. Software available from tensorﬂow.org.

[2] Abbas Abdolmaleki, Jost Tobias Springenberg, Yuval Tassa, Remi Munos, Nicolas Heess, and Martin Riedmiller. Maximum a posteriori policy optimisation. In International Conference on learning Representations (ICLR), 2018.

[3] Mohammad Gheshlaghi Azar, Vicenç Gómez, and Hilbert J Kappen. Dynamic policy programming. Journal of Machine Learning Research, 13(Nov):3207 3245, 2012.

[4] Adrià Puigdomènech Badia, Bilal Piot, Steven Kapturowski, Pablo Sprechmann, Alex Vitvitskyi, Daniel Guo, and Charles Blundell. Agent57: Outperforming the atari human benchmark. ar Xiv preprint ar Xiv:2003.13350, 2020.

[5] Leemon C Baird III. Reinforcement Learning Through Gradient Descent. Ph D thesis, US Air Force Academy, US, 1999.

[6] Marc G Bellemare, Yavar Naddaf, Joel Veness, and Michael Bowling. The arcade learning environment: An evaluation platform for general agents. Journal of Artiﬁcial Intelligence Research, 47:253 279, 2013.

[7] Marc G Bellemare, Georg Ostrovski, Arthur Guez, Philip S Thomas, and Rémi Munos. Increasing the action gap: New operators for reinforcement learning. In AAAI Conference on Artiﬁcial Intelligence (AAAI), 2016.

[8] Marc G Bellemare, Will Dabney, and Rémi Munos. A distributional perspective on reinforcement learning. In International Conference on Machine Learning (ICML), 2017.

[9] Stephen Boyd and Lieven Vandenberghe. Convex optimization. Cambridge university press, 2004.

[10] Pablo Samuel Castro, Subhodeep Moitra, Carles Gelada, Saurabh Kumar, and Marc G Bellemare. Dopamine: A research framework for deep reinforcement learning. ar Xiv preprint ar Xiv:1812.06110, 2018.

[11] Will Dabney, Georg Ostrovski, David Silver, and Rémi Munos. Implicit quantile networks for distributional reinforcement learning. In International Conference on Machine Learning (ICML), 2018.

[12] Lasse Espeholt, Raphael Marinier, Piotr Stanczyk, Ke Wang, and Marcin Michalski. SEED RL: Scalable and Efﬁcient Deep-RL with Accelerated Central Inference. In International Conference on Learning Representations (ICLR), 2020.

[13] Amir-massoud Farahmand. Action-gap phenomenon in reinforcement learning. In Advances in Neural Information Processing Systems (Neur IPS), pages 172 180, 2011.

[14] Roy Fox, Ari Pakman, and Naftali Tishby. Taming the noise in reinforcement learning via soft updates. In Conference on Uncertainty in Artiﬁcial Intelligence (UAI), 2016.

[15] Matthieu Geist, Bruno Scherrer, and Olivier Pietquin. A Theory of Regularized Markov Decision Processes. In International Conference on Machine Learning (ICML), 2019.

[16] Tuomas Haarnoja, Haoran Tang, Pieter Abbeel, and Sergey Levine. Reinforcement Learning with Deep Energy-Based Policies. In International Conference on Machine Learning (ICML), 2017.

[17] Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Offpolicy maximum entropy deep reinforcement learning with a stochastic actor. In International Conference on Machine Learning (ICML), 2018.

[18] Matteo Hessel, Joseph Modayil, Hado Van Hasselt, Tom Schaul, Georg Ostrovski, Will Dabney, Dan Horgan, Bilal Piot, Mohammad Azar, and David Silver. Rainbow: Combining improvements in deep reinforcement learning. In AAAI Conference on Artiﬁcial Intelligence (AAAI), 2018.

[19] Steven Kapturowski, Georg Ostrovski, John Quan, Remi Munos, and Will Dabney. Recurrent experience replay in distributed reinforcement learning. In International Conference on Learning Representations (ICLR), 2018.

[20] Tadashi Kozuno, Eiji Uchibe, and Kenji Doya. Theoretical Analysis of Efﬁciency and Robustness of Softmax and Gap-Increasing Operators in Reinforcement Learning. In International Conference on Artiﬁcial Intelligence and Statistics (AISTATS), 2019.

[21] Clare Lyle, Marc G Bellemare, and Pablo Samuel Castro. A comparative analysis of expected and distributional reinforcement learning. In AAAI Conference on Artiﬁcial Intelligence (AAAI), 2019.

[22] Marlos C Machado, Marc G Bellemare, Erik Talvitie, Joel Veness, Matthew Hausknecht, and Michael Bowling. Revisiting the arcade learning environment: Evaluation protocols and open problems for general agents. Journal of Artiﬁcial Intelligence Research, 61:523 562, 2018.

[23] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529, 2015.

[24] Rudolf Erich Raspe. Baron Munchhausen s Narrative of his Marvellous Travels and Campaigns in Russia, 1785.

[25] Tom Schaul, John Quan, Ioannis Antonoglou, and David Silver. Prioritized experience replay. In Internation Conference on Representation Learning (ICLR), 2016.

[26] Bruno Scherrer, Mohammad Ghavamzadeh, Victor Gabillon, Boris Lesner, and Matthieu Geist. Approximate modiﬁed policy iteration and its application to the game of Tetris. Journal of Machine Learning Research, 16:1629 1676, 2015.

[27] John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region policy optimization. In International Conference on Machine Learning (ICML), 2015.

[28] Zhao Song, Ron Parr, and Lawrence Carin. Revisiting the Softmax Bellman Operator: New Beneﬁts and New Perspective. In International Conference on Machine Learning (ICML), 2019.

[29] Richard S Sutton. Learning to predict by the methods of temporal differences. Machine learning, 3(1):9 44, 1988.

[30] Yoshihisa Tsurumine, Yunduan Cui, Eiji Uchibe, and Takamitsu Matsubara. Deep dynamic policy programming for robot control with raw images. In International Conference on Intelligent Robots and Systems (IROS), 2017.

[31] Harm Van Seijen, Mehdi Fatemi, and Arash Tavakoli. Using a logarithmic mapping to enable lower discount factors in reinforcement learning. In Advances in Neural Information Processing Systems (Neur IPS), pages 14134 14144, 2019.

[32] Nino Vieillard, Tadashi Kozuno, Bruno Scherrer, Olivier Pietquin, Rémi Munos, and Matthieu Geist. Leverage the Average: an Analysis of Regularization in RL. Advances in Neural Information Processing Systems (Neur IPS), 2020.

[33] Nino Vieillard, Bruno Scherrer, Olivier Pietquin, and Matthieu Geist. Momentum in Reinforcement Learning. In International Conference on Artiﬁcial Intelligence and Statistics (AISTATS), 2020.

[34] Christopher JCH Watkins and Peter Dayan. Q-learning. Machine learning, 8(3-4):279 292, 1992.

[35] Derek Yang, Li Zhao, Zichuan Lin, Tao Qin, Jiang Bian, and Tie-Yan Liu. Fully parameterized quantile function for distributional reinforcement learning. In Advances in Neural Information Processing Systems (Neur IPS), pages 6190 6199, 2019.

[36] Brian D Ziebart. Modeling Purposeful Adaptive Behavior with the Principle of Maximum Causal Entropy. Ph D thesis, University of Washington, 2010.