# muesli_combining_improvements_in_policy_optimization__3bec2841.pdf Muesli: Combining Improvements in Policy Optimization Matteo Hessel * 1 Ivo Danihelka * 1 2 Fabio Viola 1 Arthur Guez 1 Simon Schmitt 1 Laurent Sifre 1 Theophane Weber 1 David Silver 1 2 Hado van Hasselt 1 We propose a novel policy update that combines regularized policy optimization with model learning as an auxiliary loss. The update (henceforth Muesli) matches Mu Zero s state-of-the-art performance on Atari. Notably, Muesli does so without using deep search: it acts directly with a policy network and has computation speed comparable to model-free baselines. The Atari results are complemented by extensive ablations, and by additional results on continuous control and 9x9 Go. 1. Introduction Reinforcement learning (RL) is a general formulation for the problem of sequential decision making under uncertainty, where a learning system (the agent) must learn to maximize the cumulative rewards provided by the world it is embedded in (the environment), from experience of interacting with such environment (Sutton & Barto, 2018). An agent is said to be value-based if its behavior, i.e. its policy, is inferred (e.g by inspection) from learned value estimates (Sutton, 1988; Watkins, 1989; Rummery & Niranjan, 1994; Tesauro, 1995). In contrast, a policy-based agent directly updates a (parametric) policy (Williams, 1992; Sutton et al., 2000) based on past experience. We may also classify as model free the agents that update values and policies directly from experience (Sutton, 1988), and as model-based those that use (learned) models (Oh et al., 2015; van Hasselt et al., 2019) to plan either global (Sutton, 1990) or local (Richalet et al., 1978; Kaelbling & Lozano-P erez, 2010; Silver & Veness, 2010) values and policies. Such distinctions are useful for communication, but, to master the singular goal of optimizing rewards in an environment, agents often combine ideas from more than one of these areas (Hessel et al., 2018; Silver et al., 2016; Schrittwieser et al., 2020). *Equal contribution 1Deep Mind, London, UK 2University College London. Correspondence to: Matteo Hessel , Ivo Danihelka , Hado van Hasselt . Proceedings of the 38 th International Conference on Machine Learning, PMLR 139, 2021. Copyright 2021 by the author(s). 0 50 100 150 200 Millions of frames (a) Median human-normalized score atari57 median Muesli TRPO penalty PPO PG MPO 0 50 100 150 200 Millions of frames (b) atari57 median Mu Zero[2021] Muesli Figure 1. Median human normalized score across 57 Atari games. (a) Muesli and other policy updates; all these use the same IMPALA network and a moderate amount of replay data (75%). Shades denote standard errors across 5 seeds. (b) Muesli with the larger Mu Zero network and the high replay fraction used by Mu Zero (95%), compared to the latest version of Mu Zero. These large scale runs use 2 seeds. Muesli still acts directly with the policy network and uses one-step look-aheads in updates. In this paper, we focus on a critical part of RL, namely policy optimization. We leave a precise formulation of the problem for later, but different policy optimization algorithms can be seen as answers to the following crucial question: given data about an agent s interactions with the world, and predictions in the form of value functions or models, how should we update the agent s policy? We start from an analysis of the desiderata for general policy optimization. These include support for partial observability and function approximation, the ability to learn stochastic policies, robustness to diverse environments or training regimes (e.g. off-policy data), and being able to represent knowledge as value functions and models. See Section 3 for further details on our desiderata for policy optimization. Then, we propose a policy update combining regularized policy optimization with model-based ideas so as to make progress on the dimensions highlighted in the desiderata. More specifically, we use a model inspired by Mu Zero (Schrittwieser et al., 2020) to estimate action values via one-step look-ahead. These action values are then plugged into a modified Maximum a Posteriori Policy Optimiza- Muesli: Combining Improvements in Policy Optimization tion (MPO) (Abdolmaleki et al., 2018) mechanism, based on clipped normalized advantages, that is robust to scaling issues without requiring constrained optimization. The overall update, named Muesli, then combines the clipped MPO targets and policy-gradients into a direct method (Vieillard et al., 2020) for regularized policy optimization. The majority of our experiments were performed on 57 classic Atari games from the Arcade Learning Environment (Bellemare et al., 2013; Machado et al., 2018), a popular benchmark for deep RL. We found that, on Atari, Muesli can match the state of the art performance of Mu Zero, without requiring deep search, but instead acting directly with the policy network and using one-step look-aheads in the updates. To help understand the different design choices made in Muesli, our experiments on Atari include multiple ablations of our proposed update. Additionally, to evaluate how well our method generalises to different domains, we performed experiments on a suite of continuous control environments (based on Mu Jo Co and sourced from the Open AI Gym (Brockman et al., 2016)). We also conducted experiments in 9x9 Go in self-play, to evaluate our policy update in a domain traditionally dominated by search methods. 2. Background The environment. We are interested in episodic environments with variable episode lengths (e.g. Atari games), formalized as Markov Decision Processes (MDPs) with initial state distribution µ and discount γ [0, 1); ends of episodes correspond to absorbing states with no rewards. The objective. The agent starts at a state S0 µ from the initial state distribution. At each time step t, the agent takes an action At π(At|St) from a policy π, obtains the reward Rt+1 and transitions to the next state St+1. The expected sum of discounted rewards after a state-action pair is called the action-value or Q-value qπ(s, a): qπ(s, a) = E t=0 γt Rt+1|π, S0 = s, A0 = a The value of a state s is vπ(s) = EA π( |s) [qπ(s, A)] and the objective is to find a policy π that maximizes the expected value of the states from the initial state distribution: J(π) = ES µ [vπ(S)] . (2) Policy improvement. Policy improvement is one of the fundamental building blocks of reinforcement learning algorithms. Given a policy πprior and its Q-values qπprior(s, a), a policy improvement step constructs a new policy π such that vπ(s) vπprior(s) s. For instance, a basic policy improvement step is to construct the greedy policy: arg max π EA π( |s) qπprior(s, A) . (3) Regularized policy optimization. A regularized policy optimization algorithm solves the following problem: EA π( |s) ˆqπprior(s, A) Ω(π) , (4) where ˆqπprior(s, a) are approximate Q-values of a πprior policy and Ω(π) R is a regularizer. For example, we may use as the regularizer the negative entropy of the policy Ω(π) = λH[π], weighted by an entropy cost λ (Williams & Peng, 1991). Alternatively, we may also use Ω(π) = λ KL(πprior, π), where πprior is the previous policy, as used in TRPO (Schulman et al., 2015). Following the terminology introduced by Vieillard et al. (2020), we can then solve Eq. 4 by either direct or indirect methods. If π(a|s) is differentiable with respect to the policy parameters, a direct method applies gradient ascent to J(s, π) = EA π( |s) ˆqπprior(s, A) Ω(π). (5) Using the log derivative trick to sample the gradient of the expectation results in the canonical (regularized) policy gradient update (Sutton et al., 2000). In indirect methods, the solution of the optimization problem (4) is found exactly, or numerically, for one state and then distilled into a parametric policy. For example, Maximum a Posteriori Policy Optimization (MPO) (Abdolmaleki et al., 2018) uses as regularizer Ω(π) = λ KL(π, πprior), for which the exact solution to the regularized problem is πMPO(a|s) = πprior(a|s) exp ˆqπprior(s, a) 1 z(s), (6) where z(s) = EA πprior( |s) h exp ˆqπprior(s,A) λ i is a normalization factor that ensures that the resulting probabilities form a valid probability distribution (i.e. they sum up to 1). Mu Zero. Mu Zero (Schrittwieser et al., 2020) uses a weakly grounded (Grimm et al., 2020) transition model m trained end to end exclusively to support accurate reward, value and policy predictions: m(st, at, at+1, . . . , at+k) (Rt+k+1, vπ(St+k+1), π( |St+k+1)). Since such model can be unrolled to generate sequences of rewards and value estimates for different sequences of actions (or plans), it can be used to perform Monte-Carlo Tree Search, or MCTS (Coulom, 2006). Mu Zero then uses MCTS to construct a policy as the categorical distribution over the normalized visit counts for the actions in the root of the search tree; this policy is then used both to select actions, and as a policy target for the policy network. Despite Mu Zero being introduced with different motivations, Grill et al. (2020) showed that the Mu Zero policy update can also be interpreted as approximately solving a regularized policy optimization problem with the regularizer Ω(π) = λ KL(πprior, π) also used by the TRPO algorithm (Schulman et al., 2015). Muesli: Combining Improvements in Policy Optimization 3. Desiderata and motivating principles First, to motivate our investigation, we discuss a few desiderata for a general policy optimization algorithm. 3.1. Observability and function approximation Being able to learn stochastic policies, and being able to leverage Monte-Carlo or multi-step bootstrapped return estimates is important for a policy update to be truly general. This is motivated by the challenges of learning in partially observable environments ( Astr om, 1965) or, more generally, in settings where function approximation is used (Sutton & Barto, 2018). Note that these two are closely related: if a chosen function approximation ignores a state feature, then the state feature is, for all practical purposes, not observable. In POMDPs the optimal memory-less stochastic policy can be better than any memory-less deterministic policy, as shown by Singh et al. (1994). As an illustration, consider the MDP in Figure 2; in this problem we have 4 states and, on each step, 2 actions (up or down). If the state representation of all states is the same φ(s) = , the optimal policy is stochastic. We can easily find such policy with pen and paper: π (up|φ(s)) = 5 8; see Appendix B for details. It is also known that, in these settings, it is often preferable to leverage Monte-Carlo returns, or at least multi-step bootstrapped estimators, instead of using one-step targets (Jaakkola et al., 1994). Consider again the MDP in Figure 2: boostrapping from vπ(φ(s)) produces biased estimates of the expected return, because vπ(φ(s)) aggregates the values of multiple states; again, see Appendix B for the derivation. Among the methods in Section 2, both policy gradients and MPO allow convergence to stochastic policies, but only policy gradients naturally incorporate multi-step return estimators. In MPO, stochastic return estimates could make the agent overly optimistic (E[exp(G)] exp(E[G])). 3.2. Policy representation Policies may be constructed from action values or they may combine action values and other quantities (e.g., a direct parametrization of the policy or historical data). We argue that the action values alone are not enough. First, we show that action values are not always enough to represent the best stochastic policy. Consider again the MDP in Figure 2 with identical state representation φ(s) in all states. As discussed, the optimal stochastic policy is π (up|φ(s)) = 5 8. This non-uniform policy cannot be inferred from Q-values, as these are the same for all actions and are thus wholly uninformative about the best probabilities: qπ (φ(s), up) = qπ (φ(s), down) = 1 4. Similarly, a model on its own is also insufficient without a policy, as it would produce the same uninformative action values. Figure 2. An episodic MDP with 4 states. State 1 is the initial state. State 4 is terminal. At each step, the agent can choose amongst two actions: up or down. The rewards range from -1 to 1, as displayed. The discount is 1. If the state representation φ(s) is the same in all states, the best stochastic policy is π (up|φ(s)) = 5 One approach to address this limitation is to parameterize the policy explicitly (e.g. via a policy network). This has the additional advantage that it allows us to directly sample both discrete (Mnih et al., 2016) and continuous (van Hasselt & Wiering, 2007; Degris et al., 2012; Silver et al., 2014) actions. In contrast, maximizing Q-values over continuous action spaces is challenging. Access to a parametric policy network that can be queried directly is also beneficial for agents that act by planning with a learned model (e.g. via MCTS), as it allows to guide search in large or continuous action space. 3.3. Robust learning We seek algorithms that are robust to 1) off-policy or historical data; 2) inaccuracies in values and models; 3) diversity of environments. In the following paragraphs we discuss what each of these entails. Reusing data from previous iterations of policy π (Lin, 1992; Riedmiller, 2005; Mnih et al., 2015) can make RL more data efficient. However, if computing the gradient of the objective EA π( |s) ˆqπprior(s, A) on data from an older policy πprior, an unregularized application of the gradient can degrade the value of π. The amount of degradation depends on the total variation distance between π and πprior, and we can use a regularizer to control it, as in Conservative Policy Iteration (Kakade & Langford, 2002), Trust Region Policy Optimization (Schulman et al., 2015), and Appendix C. Whether we learn on or off-policy, agents predictions incorporate errors. Regularization can also help here. For instance, if Q-values have errors, the MPO regularizer Ω(π) = λ KL(π, πprior) maintains a strong performance bound (Vieillard et al., 2020). The errors from multiple iterations average out, instead of appearing in a discounted sum of the absolute errors. While not all assumptions behind this result apply in an approximate setting, Section 5 shows that MPO-like regularizers are helpful empirically. Muesli: Combining Improvements in Policy Optimization Observability and function approximation 1a) Support learning stochastic policies 1b) Leverage Monte-Carlo targets Policy representation 2a) Support learning the optimal memory-less policy 2b) Scale to (large) discrete action spaces 2c) Scale to continuous action spaces Robust learning 3a) Support off-policy and historical data 3b) Deal gracefully with inaccuracies in the values/model 3c) Be robust to diverse reward scales 3d) Avoid problem-dependent hyperparameters Rich representation of knowledge 4a) Estimate values (variance reduction, bootstrapping) 4b) Learn a model (representation, composability) Table 1. A recap of the desiderata or guiding principles that we believe are important when designing general policy optimization algorithms. These are discussed in Section 3. Finally, robustness to diverse environments is critical to ensure a policy optimization algorithm operates effectively in novel settings. This can take various forms, but we focus on robustness to diverse reward scales and minimizing problem dependent hyperparameters. The latter are an especially subtle form of inductive bias that may limit the applicability of a method to established benchmarks (Hessel et al., 2019). 3.4. Rich representation of knowledge Even if the policy is parametrized explicitly, we argue it is important for the agent to represent knowledge in multiple ways (Degris & Modayil, 2012) to update such policy in a reliable and robust way. Two classes of predictions have proven particularly useful: value functions and models. Value functions (Sutton, 1988; Sutton et al., 2011) can capture knowledge about a cumulant over long horizons, but can be learned with a cost independent of the span of the predictions (van Hasselt & Sutton, 2015). They have been used extensively in policy optimization, e.g., to implement forms of variance reduction (Williams, 1992), and to allow updating policies online through bootstrapping, without waiting for episodes to fully resolve (Sutton et al., 2000). Models can also be useful in various ways: 1) learning a model can act as an auxiliary task (Schmidhuber, 1990; Sutton et al., 2011; Jaderberg et al., 2017; Guez et al., 2020), and help with representation learning; 2) a learned model may be used to update policies and values via planning (Werbos, 1987; Sutton, 1990; Ha & Schmidhuber, 2018); 3) finally, the model may be used to plan for action selection (Richalet et al., 1978; Silver & Veness, 2010). These benefits of learned models are entangled in Mu Zero. Sometimes, it may be useful to decouple them, for instance to retain the benefits of models for representation learning and policy optimization, without depending on the computationally intensive process of planning for action selection. 4. Robust yet simple policy optimization The full list of desiderata is presented in Table 1. These are far from solved problems, but they can be helpful to reason about policy updates. In this section, we describe a policy optimization algorithm designed to address these desiderata. 4.1. Our proposed clipped MPO (CMPO) regularizer We use the Maximum a Posteriori Policy Optimization (MPO) algorithm (Abdolmaleki et al., 2018) as starting point, since it can learn stochastic policies (1a), supports discrete and continuous action spaces (2c), can learn stably from off-policy data (3a), and has strong performance bounds even when using approximate Q-values (3b). We then improve the degree of control provided by MPO on the total variation distance between π and πprior (3a), avoiding sensitive domain-specific hyperparameters (3d). MPO uses a regularizer Ω(π) = λ KL(π, πprior), where πprior is the previous policy. Since we are interested in learning from stale data, we allow πprior to correspond to arbitrary previous policies, and we introduce a regularizer Ω(π) = λ KL(πCMPO, π), based on the new target πCMPO(a|s) = πprior(a|s) exp clip( ˆ adv(s, a), c, c) z CMPO(s) , where ˆ adv(s, a) is a non-stochastic approximation of the advantage ˆqπprior(s, a) ˆvπprior(s) and the factor z CMPO(s) ensures the policy is a valid probability distribution. The πCMPO term we use in the regularizer has an interesting relation to natural policy gradients (Kakade, 2001): πCMPO is obtained if the natural gradient is computed with respect to the logits of πprior and then the expected gradient is clipped (for proof note the natural policy gradient with respect to the logits is equal to the advantages (Agarwal et al., 2019)). The clipping threshold c controls the maximum total variation distance between πCMPO and πprior. Specifically, the total variation distance between π and π is defined as DTV(π ( |s), π( |s)) = 1 a |π (a|s) π(a|s)|. (8) As discussed in Section 3.3, constrained total variation supports robust off-policy learning. The clipped advantages allows us to derive not only a bound for the total variation distance but an exact formula: Theorem 4.1 (Maximum CMPO total variation distance) For any clipping threshold c > 0, we have: max πprior, ˆ adv,s DTV(πCMPO( |s), πprior( |s)) = tanh( c Muesli: Combining Improvements in Policy Optimization 0 1 2 3 4 5 6 Clipping threshold (a) Max CMPO total variation 0.1 0.3 1.0 3.0 10.0 Regularizer multiplier (b) Mean return relative to multiplier=1 alien beam_rider breakout gravitar hero ms_pacman phoenix robotank seaquest time_pilot Figure 3. (a) The maximum total variation distance between πCMPO and πprior is exclusively a function of the clipping threshold c. (b) A comparison (on 10 Atari games) of the Muesli sensitivity to the regularizer multiplier λ. Each dot is the mean of 5 runs with different random seeds and the black line is the mean across all 10 games. With Muesli s normalized advantages, the good range of values for λ is fairly large, not strongly problem dependent, and λ = 1 performs well on many environments. We refer readers to Appendix D for proof of Theorem 4.1; we also verified the theorem predictions numerically. Note that the maximum total variation distance between πCMPO and πprior does not depend on the number of actions or other environment properties (3d). It only depends on the clipping threshold as visualized in Figure 3a. This allows to control the maximum total variation distance under a CMPO update, for instance by setting the maximum total variation distance to ϵ, without requiring the constrained optimization procedure used in the original MPO paper. Instead of the constrained optimization, we just set c = 2 arctanh(ϵ). We used c = 1 in our experiments, across all domains. 4.2. A novel policy update Given the proposed regularizer Ω(π) = λ KL(πCMPO, π), we can update the policy by direct optimization of the regularized objective, that is by gradient descent on LPG+CMPO(π, s) = EA π( |s) h ˆ adv(s, A) i + λ KL(πCMPO( |s), π( |s)), (9) where the advantage terms in each component of the loss can be normalized using the approach described in Section 4.5 to improve the robustness to reward scales. The first term corresponds to a standard policy gradient update, thus allowing stochastic estimates of ˆ adv(s, A) that use Monte-Carlo or multi-step estimators (1b). The second term adds regularization via distillation of the CMPO target, to preserve the desiderata addressed in Section 4.1. Critically, the hyper-parameter λ is easy to set (3d), because even if λ is high, λ KL(πCMPO( |s), π( |s)) still proposes improvements to the policy πprior. This property is missing in popular regularizers that maximize entropy or minimize a distance from πprior. We refer to the sensitivity analysis depicted in Figure 3b for a sample of the wide range of values of λ that we found to perform well on Atari. We used λ = 1 in all other experiments reported in the paper. Both terms can be sampled, allowing to trade off the computation cost and the variance of the update; this is especially useful in large or continuous action spaces (2b), (2c). We can sample the gradient of the first term by computing the loss on data generated on a prior policy πprior, and then use importance sampling to correct for the distribution shift wrt π. This results in the estimator πb(a|s)(Gv(s, a) ˆvπprior(s)), (10) for the first term of the policy loss. In this expression, πb(a|s) is the behavior policy; the advantage (Gv(s, a) ˆvπprior(s)) uses a stochastic multi-step bootstrapped estimator Gv(s, a) and a learned baseline ˆvπprior(s). We can also sample the regularizer, by computing a stochastic estimate of the KL on a subset of N actions a(k), sampled from πprior(s). In which case, the second term of Eq. 9 becomes (ignoring an additive constant): " exp(clip( ˆ adv(s, a(k)), c, c)) z CMPO(s) log π(a(k)|s) where ˆ adv(s, a) = ˆqπprior(s, a) ˆvπprior(s) is computed from the learned values ˆqπprior and ˆvπprior(s). To support sampling just few actions from the current state s, we can estimate z CMPO(s) for the i-th sample out of N as: z(i) CMPO(s) = zinit + PN k =i exp(clip( ˆ adv(s, a(k)), c, c)) where zinit is an initial estimate. We use zinit = 1. 4.3. Learning a model As discussed in Section 3.4, learning models has several potential benefits. Thus, we propose to train a model alongside policy and value estimates (4b). As in Mu Zero (Schrittwieser et al., 2020) our model is not trained to reconstruct observations, but is rather only required to provide accurate estimates of rewards, values and policies. It can be seen as an instance of value equivalent models (Grimm et al., 2020). For training, the model is unrolled k > 1 steps, taking as inputs an initial state st and an action sequence a