# when_maximum_entropy_misleads_policy_optimization__608e6e82.pdf

When Maximum Entropy Misleads Policy Optimization

Ruipeng Zhang 1 Ya-Chien Chang 1 Sicun Gao 1

The Maximum Entropy Reinforcement Learning (Max Ent RL) framework is a leading approach for achieving efficient learning and robust performance across many RL tasks. However, Max Ent methods have also been shown to struggle with performance-critical control problems in practice, where non-Max Ent algorithms can successfully learn. In this work, we analyze how the tradeoff between robustness and optimality affects the performance of Max Ent algorithms in complex control tasks: while entropy maximization enhances exploration and robustness, it can also mislead policy optimization, leading to failure in tasks that require precise, low-entropy policies. Through experiments on a variety of control problems, we concretely demonstrate this misleading effect. Our analysis leads to better understanding of how to balance reward design and entropy maximization in challenging control problems.

1. Introduction

The Maximum Entropy Reinforcement Learning (Max Ent RL) framework (Ziebart et al., 2008; Abdolmaleki et al., 2018; Haarnoja et al., 2018a; Han & Sung, 2021) augments the standard objective of maximizing return with the additional objective of maximizing policy entropy. Max Ent methods such as Soft-Actor Critic (SAC) (Haarnoja et al., 2018a) have shown superior performance than other onpolicy or off-policy methods (Schulman, 2015; Schulman et al., 2017; Lillicrap, 2015; Fujimoto et al., 2018) in many standard continuous control benchmarks (Achiam, 2018; Raffin et al., 2021; Weng et al., 2021; Huang et al., 2024). Explanations of their performance include better exploration, smoothing of optimization landscape, and enhanced robustness to disturbances (Hazan et al., 2019; Ahmed et al., 2019; Eysenbach & Levine, 2021).

1Computer Science and Engineering, UC San Diego. Correspondence to: Ruipeng Zhang <ruz019@ucsd.edu>.

Proceedings of the 42 nd International Conference on Machine Learning, Vancouver, Canada. PMLR 267, 2025. Copyright 2025 by the author(s).

SAC-Simplified

PPO-Realistic

SAC-Realistic

Failure Max Ent gets misled

Low-entropy policy is required on critical states to ensure success

High-entropy policy on mediocre states leads to irrecoverable failure

Figure 1. (Upper) In the quadrotor control environment, SAC learns well under simplified dynamics, but fails to learn when under realistic dynamics models. PPO can learn well despite the use of the latter. (Lower) Intuitive illustration of hard control problems, where critical states naturally require low-entropy policies, while Max Ent RL can favor mediocre states with robust policies of low returns that branch out towards failure and are not recoverable.

Interestingly, the well-motivated benefits of Max Ent and SAC have not led to its dominance in RL for real-world continuous control problems in practice (Shengren et al., 2022; Tan & Karak ose, 2023; Xu et al., 2021; Radwan et al., 2021). Most recent RL-based robotic control work (Kaufmann et al., 2023; Miki et al., 2022; Zhuang et al., 2024) mostly still uses a combination of imitation learning and fine-tuning with non-Max Ent methods such as PPO (Schulman et al., 2017). The typical factors of consideration that favor PPO over SAC in practice include computational cost, sensitivity to hyperparameters, and ease of customization. Often the performance by SAC is indeed shown to be inferior to PPO despite efforts in tuning (Muzahid et al., 2021; Lee & Moon, 2021; Nair et al., 2024). In fact, it is easy to reproduce such behaviors. Figure 1 shows the comparison of SAC and PPO for learning to control a quadrotor to follow a path. When the underlying model for the quadrotor is a simplified dynamics model, SAC can quickly learn a stable controller. When a more realistic dynamics model for the quadrotor is used, then SAC always fails, while PPO can succeed under the same initialization and dynamics.

In this paper, we show how the conflict between maximizing

When Maximum Entropy Misleads Policy Optimization

entropy and maximizing overall return can be magnified and hinder learning in performance-critical control problems that naturally require precise, low-entropy policies to solve.

The example of quadcopter control in Figure 1 highlights a common structure in complex control tasks: achieving desired performance often requires executing precise actions at a sequence of critical states. At these states, only a narrow (often zero-dimensional) subset of the action space is feasible, hence the ground-truth optimal policy has inherently low entropy. (In aerodynamic terms, this is often referred to as flying on the edge of instability.) Conversely, actions that deviate from this narrow feasible set often lead to states that are not recoverable: once the system enters these states, all available actions tend to produce similarly poor outcomes but accumulate short-term entropy benefits that can be favored by Max Ent. Over time, this drift can compound, ultimately pushing the system into irrecoverable failure.

Consequently, Max Ent RL may bias the agent toward suboptimal behaviors rather than the precise low-entropy optimal policies that are key to solving hard control problems.

Formalizing this intuition, we give an in-depth analysis of the trade-off. Our main result is that for an arbitrary MDP, there exists an explicit way of introducing entropy traps that we define as Entropy Bifurcation Extension, such that Max Ent methods can be misled to consider an arbitrary policy distribution as Max Ent-optimal, while the ground-truth optimal policy is not affected by the extension. Importantly, this is not a matter of sample efficiency or exploration bias during training, but is the end result at the convergence of Max Ent algorithms. Consequently, the misleading effect of entropy can occur easily where standard policy optimization methods are not affected.

We then demonstrate that this concern is not theoretical, and can in fact explain key failures of Max Ent algorithms in practical control problems. We analyze the behavior of SAC in several realistic control environments, including controlling wheeled vehicles at high speeds, quadrotors for trajectory tracking, quadruped robot control that directly corresponds to hardware platforms. We show that the gap between the value landscapes under Max Ent and regular policy optimization explains the difficulty of SAC for converging to feasible control policies on these environments.

Our analysis does not imply that Max Ent is inherently unsuitable for control problems. In fact, following the same analysis, we can now concretely understand why Max Ent leads to successful learning on certain environments that benefit from robust exploration, including common benchmarking Open AI Gym environments where SAC generally performs well. Overall, the analysis aims to guide reward design and hyperparameter tuning of Max Ent algorithms for complex control problems.

We will first give a toy example to showcase the core misleading effect of entropy maximization in Section 4, and then generalize the construction to the technique of entropy bifurcation extension in Section 5. We then show experimental results of how the misleading effects affect learning in practice, and how the adaptive tuning of entropy further validates our analysis in Section 6.

2. Related Work

Max Ent RL and Analysis. The Max Ent RL framework incorporates an entropy term in the RL objective and performs probability matching, such that the policy distribution aligns with the soft-value landscape (Ziebart et al., 2008; Toussaint, 2009; Rawlik et al., 2013; Fox et al., 2015; O Donoghue et al., 2016; Abdolmaleki et al., 2018; Haarnoja et al., 2018a; Mazoure et al., 2020; Han & Sung, 2021). Max Ent RL has strong theoretical connections to probabilistic inference (Toussaint, 2009; Rawlik et al., 2013; Levine, 2018) and well-motivated for ensuring robustness from a stochastic inference (Ziebart, 2010; Eysenbach & Levine, 2021) and game-theoretic perspective (Gr unwald & Dawid, 2004; Ziebart et al., 2010; Han & Sung, 2021; Kim & Sung, 2023). SAC and algorithms such as SACNF (Mazoure et al., 2020), MME (Han & Sung, 2021) and MEow (Chao et al., 2024) have outperformed most non Max Ent methods in many standard benchmarking environments (Brockman, 2016; Todorov et al., 2012; Towers et al., 2024). A common explanation of the benefits of Max Ent is that it enhances exploration (Haarnoja et al., 2018a; Hazan et al., 2019), smoothes the optimization landscape (Ahmed et al., 2019), and solves robust versions of the control problems (Eysenbach & Levine, 2021). Despite the benefits, we show that they can also mislead Max Ent to converge to suboptimal policies in complex control problems.

Practical Difficulties with Max Ent in Control Problems. RL algorithms are well-known to be sensitive to parameter tuning (Wang & Ni, 2020; Muzahid et al., 2021; Nair et al., 2024). Various recent learning-based robotics control work reported that SAC delivers suboptimal solutions compared to PPO in complex control problems (Tan & Karak ose, 2023; Xu et al., 2021; Radwan et al., 2021). We aim to understand the discrepancy between such results and the generally good performance of Max Ent algorithms (Haarnoja et al., 2018b; Achiam, 2018; Raffin et al., 2021).

Trade-off between robustness and optimality. There is a long line of work studying the trade-off between robustness and performance in supervised deep learning (Su et al., 2018; Zhang et al., 2019; Tsipras et al., 2018; Raghunathan et al., 2019; 2020; Yang et al., 2020). This trade-off in the RL context often overlaps with exploration issues mentioned above. Note that instead of sample efficiency or exploration issues, in this work we focus on pointing out the deeper issue

When Maximum Entropy Misleads Policy Optimization

of misguiding policy optimization results at convergence.

3. Preliminaries

A Markov Decision Process (MDP) is defined by the tuple: M = (S, A, P, r, γ) where: S is the state space, A is the action space, which can be discrete or continuous, P(s |s, a) is the transition probability distribution, defining the probability of transitioning to state s after taking action a at state s. r(s, a) is the reward function, which specifies the immediate scalar reward received for taking action a at state s. γ [0, 1) is the discount factor. A policy is a mapping π : S A, where A is the probability simplex over the action space A, defines a probability distribution over actions given any state in S. We often write the distribution determined by a policy π at a state s as π( |s). The goal of standard RL is to find an optimal policy π maximizes the expected return over the trajectory of states and actions to achieve the best cumulative reward.

Maximum Entropy RL extends the standard framework by incorporating an entropy term into the objective, encouraging stochasticity in the optimal policy. This modification ensures that the agent not only maximizes reward but also maintains exploration. Instead of maximizing only the expected sum of rewards, the agent maximizes the entropyaugmented objective:

t=0 γt(r(st, at) + αH(π( |st)))] (1)

where H(π( |s)) = Ea π( |s)[log π(a|s)] is the entropy of the policy at state s. The coefficient α 0 controls weight on entropy. The Bellman backup operator T π is:

T πQ(st, at) = r(st, at) + γEst+1 p[V (st+1)], (2)

where V (st) = Eat π[Q(st, at)] + αH(π( |st)) is the soft state value at st. Treating Q-values as energy, the Boltzmann distribution induced then at state s is defined as:

π Q(a|s) = exp(α 1Qπ(s, a))/Z(π( |s)) (3)

with Z(π( |s)) = R exp(α 1Qπ(s, a))da as the normalization factor. Policy update in Max Ent RL at each state s aims to minimize the KL divergence between the policy π( |s) and π Q( |s). Naturally, π = π Q itself is an optimal policy at state s in the Max Ent sense, since DKL(π π Q) = 0.

4. A Toy Example

From the soft value definitions in Max Ent, it is reasonable to expect some trade-off between return and entropy. But the key to understanding how it can affect the learning at convergence in major ways is by introducing intermediate states, where entropy shapes the soft values differently so

𝑃(𝑠!|𝑠,𝑎) = 1

𝑟= 1 𝑟= 20 𝑟= 1

Value Value

0.1 0.1 0.1 0.1

Figure 2. (Left) MDP in the Toy Example: The MDP consists of an initial state s0 and two subsequent states sg (good) and sb (bad). It is clear that an optimal policy for s0 should be centered in the left half of the action interval, since only sg can transit to the terminal state s+ T with positive reward. (Right) Learning results of SAC and PPO at s0 at convergence. In the SAC plot, the soft Q-values Q(s0, a) is higher for actions leading to sb, results in an incorrect policy centered in A2, the wrong action region (µ: dashed green line, σ: green area). We also show the learned Q-values without entropy term with separate networks (red line), showing higher values for actions leading to sg. In the PPO plot, it learns the correct optimal policy.

that the Max Ent-optimal policy is misled at an upstream state. We show a simple example to illustrate this effect. Consider an MDP (depicted in Figure 2) defined as follows:

State space S = {s0, sg, sb, s+ T , s g,T , s b,T }. Here s0 is the critical state for selecting actions. sg is the good next state for s0, and sb is the bad next state, in the sense that only sg can transit to the terminal state s+ T with positive reward (under some subset of actions), while sb always transits to terminal state s b,T with negative reward.

Action space A = [ 1, 1], a continuous interval in R.

The transitions on s0 are defined as follows, reflecting the intuition mentioned above in the state definition:

At state s0, any action in A1 = [ 1, 0) leads to the good state deterministically, i.e., a A1, P(sg|s0, a) = 1, while any action in A2 = [0, 1] leads to the bad state, i.e., a A2, P(sb|s0, a) = 1. Note that allowing zerodimensional overlap between the A1 and A2 with random transitions at overlapping points will not change the results.

The transitions from sg: P(s+ T |sg, [ 0.1, 0.1]) = 1 and P(s g,T |sg, [ 1, 0.1) (0.1, 1]) = 1. That is, for any action in a small fraction of the action space, a [ 0.1, 0.1], we can transit to the positive-reward terminal state s+ T .

The transitions from sb from any action deterministically lead to the negative-reward terminal state, i.e. a A, P(s b,T |sb, a ) = 1.

When Maximum Entropy Misleads Policy Optimization

The rewards are collected only at the terminal states, with r(s+ T ) = 1, r(s g,T ) = 20 and r(s b,T ) = 1. The discount factor is set to γ = 0.99 and α is set to 1.

Given the MDP definition, it is clear that the ground truth policy at s0 should allocate as much probability mass as possible (within the policy class being considered) for actions in the A1 interval, because A1 is the only range of actions that leads to sg, and then has a chance of collecting positive rewards on the terminal state s+ T , if the action on sg correctly taken.

We can calculate analytically the soft Q values and the policy distributions that are Max Ent-optimal (shown in Appendix A.1). We can observe that the soft values of sg and sb will force Max Ent to favor sb at s0. Indeed, as shown in Figure 2, the SAC algorithm with Gaussian policy has its mean converged to the center of A2. The black curve in the SAC plot shows the soft Q-values of the actions, showing how entropy affects the bias. On the other hand, PPO correctly captured the policy that favors the A1 range. More detailed explanations are in Appendix A.1.

We will illustrate how the misleading effect of entropy captured in the toy example can arise in realistic control problems through experimental results in Section 6.

5. Entropic Bifurcation Extension

Building on the intuition from the toy example, we introduce a general method for manipulating Max Ent policy optimization. The method can target arbitrary states in any MDP and inject special states with a special configuration of the soft Q-value landscape to mislead the Max Ent-optimal policy. Importantly, the newly introduced states do not change the optimal policy on any non-targeted original states (in either the Max Ent or non-Max Ent sense), but can arbitrarily shape the Max Ent-optimal policy on the targeted state. Thus, the technique can be applied on any number of states, and in the extreme case to change the entire Max Ent-optimal policy to mirror the behavior of the worst possible policy on all states, thereby creating an arbitrarily large gap between the Max Ent-optimal policy and the true optimal policy.

The key to our construction is to introduce new states that create a bifurcating soft Q-value landscape, such that the Max Ent objective of probability matching biases the policy to favor states with low return and also can not transit back to desired trajectories, thus sabotaging learning. Definition 5.1 (Entropic Bifurcation Extension). Consider an arbitrary MDP M = SM, AM, PM, r M, γ with continuous action space, the entropic bifurcation extension on M at state s is a set E(M, s) of new MDPs ˆ M of form:

ˆ M = S ˆ M, A ˆ M, P ˆ M, r ˆ M, γ E(M, s).

constructed with the following steps (illustrated in Figure 3):

Backward Compatibility

Forward Compatibility

Figure 3. MDP M and its entropic bifurcation extension ˆ M. The extension captures the intuition in the toy example, by using additional intermediate states which specifically designed reward to mislead Max Ent-optimal policies that match the soft-Q landscapes.

1. We write N(s) = {s |P(s |s, a) > 0 for some a AM} to denote the set of next states with non-zero transition probability from s. We introduce new states as follows:

For any s N(s), introduce a new state sµ that has not appeared in the state space SM and let the correspondence between s and sµ be marked as sµ = µ(s ). We can then write Sµ = µ(N(s)) for the set of all new states that are introduced in this way for state s.

At the same time, for each sµ, we introduce a fresh state sµ T that is a new terminal state, and write the set of such newly introduced terminal states as Sµ T .

We now let the state space of ˆ M be S ˆ M = SM Sµ Sµ T . Note that Sµ, Sµ T , and SM are always disjoint.

2. At the original state s, for each s N(s), we now set P ˆ M(s |s, a) = 0 and P ˆ M(µ(s )|s, a) = PM(s |s, a). That is, we delay the transition to s and let the newly introduced state sµ = µ(s ) take over the same transitions. For all other states in SM \{s}, the transitions in M and ˆ M are the same.

3. For all the newly introduced states in Sµ, their action space is a new Aµ R. At each sµ Sµ, we will define transitions on two disjoint intervals, i.e., Asµ 1 Asµ 2 Aµ

and Asµ 1 Asµ 2 = . This partitioning is sµ-dependent (they will be used to tune the soft-Q landscapes), but for notational simplicity we will just write Aµ 1 and Aµ 2 when possible. Overall, the new MDP ˆ M has action space A ˆ M = AM Aµ.

4. At each sµ, the transitions are defined to produce bifurca-

When Maximum Entropy Misleads Policy Optimization

tion behaviors, as follows. For any a Aµ 1, P(s |sµ, a) = 1. That is, such actions deterministically lead back to the s

in the original MDP. On the other side, any a Aµ 2 leads the the new terminal state, P(sµ T |sµ, a) = 1. That is, the two intervals of action introduce bifurcation into two next states, both deterministically. This design generalizes the construction in the toy example, splitting the action space into one part that recovers the original MDP, and a second part that leads to non-recoverable suboptimal behaviors.

5. The reward function r ˆ M is the same as r M on all the original states and actions, i.e., r ˆ M(s, a) = r M(s, a) for all (s, a) SM AM. The reward on the new state is r ˆ M(sµ, a) = 0 for any action a Aµ 2. On the new terminal state sµ T , we can choose r(sµ T ) to shape the soft-Q values as needed. The same discount γ is shared between M and ˆ M.

Notation 5.2. The construction above defines the set of all possible bifurcation extensions Es(M). For any specific instance, the only tunable parameters are the size of |Aµ 1| and |Aµ 2|, as well as rewards on the newly introduced states sµ and sµ T . We will show that these parameters already give enough degrees of freedom to shape the soft Q-value landscapes and policy at the target state s.

Our main theorem relies on two important lemmas, which guarantee that we can use the newly introduced states to arbitrarily shape the policy at the targeted state s, without affecting the policy at another state in the original MDP.

Backward compatibility ensures that any fixed value of V (s) can remain invariant, by finding an appropriate soft Q-value landscape at s. Importantly, this Q-landscape can be shaped to match an arbitrary policy distribution at s.

Forward compatibility ensures that by choosing appropriate r(s T µ), |Aµ 1|, |Aµ 2|, the Max Ent optimal value on V (sµ) can match arbitrary target values, using the original values on the original next states s N(s).

These two properties ensure the feasibility of shaping the Max Ent-optimal policy at the targeted state s without affecting the policy on any other states of the original MDP. Formally, the lemmas can be stated as follows and the proofs are in the Appendix:

Lemma 5.3 (Backward Compatibility). Let π( |s) : AM [0, 1] be an arbitrary policy distribution over the action space at the targeted state s. Let vs R be an arbitrary desired value for state s. There exists a value function V : Sµ R on all the newly introduced states sµ such that vs is the optimal soft value of s under the Max Ent-optimal policy at s (Definition 4).

Lemma 5.4 (Forward Compatibility). Let sµ = µ(s ) be the newly introduced state for s N(s). Let V (s ) be

an arbitrarily fixed value for the original next state s , and r(sµ, a) an arbitrary reward for the newly introduced state sµ. Let v R be an arbitrary target value. Then, there exist choices of Aµ 1, Aµ 2, and r(sµ T ) such that V (sµ) = v is the optimal soft value for the bifurcating state sµ.

The lemmas ensure that for any transition in the original state, (s, a, s ), and for any value V πM M (s) and V πM M (s ) in M under some policy πM, there exist parameter choices in the bifurcation extension that maintains the same values of VM(s) and VM(s ), while shaping the Max Ent-optimal policy arbitrarily. Consequently, the bifurcation extension can create an arbitrary value gap between the Max Ent-optimal policy and the ground truth optimal policy at the targeted state s. This leads to the main theorem:

Theorem 5.5 (Bifurcation Extension Misleads Max Ent RL). Let M be an MDP with optimal Max Ent policy π , and s an arbitrary state in SM. Let π( |s) be an arbitrary distribution over the action space AM at state s. We can construct an entropy bifurcation extension ˆ M of M such that ˆ M is equivalent to M restricted to SM \{s} and does not change its optimal policy on those states, while the Max Ent-optimal policy at s after entropy bifurcation extension can follow an arbitrary distribution π( |s) over the actions.

Proposition 5.6 (Bifurcation Extension Preserves Optimal Policies). By setting r(sµ, a) = (1 γ)V (s ) for every newly introduced bifurcating state sµ = µ(s ) and a Aµ 1, the optimal policy is preserved under bifurcation extension.

Now, since this construction can alter the policy at any state without affecting other states, it can be independently used at any number of states simultaneously, and alter the entire policy of the MDP. In particular, the bifurcation extensions can force the Max Ent-optimal policy to match the worst policy in the original MDP.

Corollary 5.7. Let M be an MDP whose optimal policy has value J+ and its worst policy (minimizing return) has value J . By applying entropy bifurcation extension on M on all states in M, we can obtain an MDP ˆ M whose Max Ent-optimal policy has value J while its ground-truth optimal policy still has value J+.

Remark 5.8. Our theoretical analysis does not need to use properties of function approximators or other components of practical Max Ent algorithms, because the Max Ent-optimal policies can be directly characterized and manipulated, as they must align with the soft-Q landscape. In the next section we show how this theoretical analysis explains the practical behaviors of SAC in realistic control environments.

6. Experiments

We now show empirical evidence of how the misleading effect of entropy can play a crucial role in the performance of

When Maximum Entropy Misleads Policy Optimization

Vehicle Quadrotor Open Cat Acrobot Obstacle2D

Figure 4. Reward performance of SAC and PPO across five environments with five random seeds. Note that we choose to show SAC and PPO because they are the best representatives of Max Ent and non-Max Ent algorithms.

Max Ent algorithms, both when they fail in complex control tasks and when they outperform non-Max Ent methods.

We first analyze the performance of the algorithms on continuous control environments that involve realistic complex dynamics, including quadrotor control (direct actuation on the propellers), wheeled vehicle path tracking (nonlinear dynamic model at high speed), and quadruped robot control (high-fidelity dynamics simulation from commercial project (Petoi Camp)). We show how the soft value landscapes mislead policy learning at critical states that led to the failure of the control task, while non-Max Ent algorithms such as PPO can successfully acquire high-return actions.

We also revisit some common benchmark environments to show how the superior performance of Max Ent can be attributed to the same misleading effect that prevents it from getting stuck as non-Max Ent methods. It reinforces more well-known advantages of Max Ent with grounded explanations supported by our theoretical understanding.

To further validate our theory, we add new adaptive entropy tuning in SAC, enabling it to switch from soft-Q to normal Q values when their landscapes have much discrepancy. We then observe that the performance of SAC is improved in the environments where it was failing. In particular, the newly learned policy acquires visibly-better control actions on critical states. This form of adaptive entropy tuning is not intended as a new algorithm it relies on global estimation of Q values that is hard to scale. Instead, the goal is to show the importance of understanding the effect of entropy, as directions for future improvement of Max Ent algorithms.

Environments. In the Vehicle environment, the task is to control a wheeled vehicle under the nonlinear dynamic bicycle model (Kong et al., 2015) to move at a constant high speed along a path. Effective control is critical for steering the vehicle onto the path. In the Quadrotor environment, the task is to control a quadrotor to track a simple path under small perturbations. The actions are the independent speeds of its four rotors (Rub ı et al., 2020), which makes the learning task harder than simpler models. The Opencat environment simulates an open-source Arduino-based

𝑸!"#$(𝒔, 𝒂)

𝑸%&'()(𝒔, 𝒂)

Track Vehicle Path Target

Figure 5. Soft and Plain Q-value landscapes in Vehicle. (Left) Q landscapes with Qsoft(s, a) and without Qplain(s, a) entropy. Introduced Entropy in SAC elevates the true Q values to encourage exploration, risking missing the only feasible optimal actions. (Right) Rendering of the queried state. The grey rectangle denotes the vehicle with the black arrow as its heading direction. The current SAC policy steers the vehicle left and moves forward, while the PPO policy reasonably steers it back on track with braking, aligning with the optimal region indicated by Qplain.

quadruped robot (Petoi Camp). The action space is the 8 joint angles of the robot. Acrobot is a two-link planar robot arm with one end fixed at the shoulder and the only actuated joint at the elbow (Spong, 1995). The action is changed to continuous torque values instead of simpler discrete values in Open AI Gym (Brockman, 2016). In Obstacle2D, the goal is to navigate an agent to the goal behind a wall, which creates a clear suboptimal local policy that the agent should learn to avoid. Hopper is the standard Mu Jo Co environment (Todorov et al., 2012) where SAC typically learns faster and more stably than PPO.

Overall Performance. Fig. 4 shows the overall performance comparison of the learning curves of SAC and PPO across environments. We notice that SAC performs worse than PPO in the first three environments that are harder to control under complex dynamics, while significantly outperforming PPO in Acrobot, Obstacle2D, and Hopper (shown in Appendix D.3). Our goal is to understand how these behaviors are affected by entropy in the soft value landscapes.

When Maximum Entropy Misleads Policy Optimization

6.1. Misleading Soft-Q Landscapes

In Figure 5, we show the Q values on a critical state in the Vehicle control environment, where the vehicle is about to deviate much from the path to be tracked. Because of the high entropy of the policy on the next states where the vehicle further deviates, the soft-Q value landscape at this state is as shown in the top layer on the left in Figure 5. It fails to understand that critical actions are needed, but instead encourages the agent to stay in the generally high soft-Q value region, where the action at the center is shown as the green arrow on the right-hand side plot in the Figure 5. It is clear that the action leads the vehicle to aggressively deviate more from the target path. On the other hand, the plain Q value landscape, as shown in the bottom layer on the left, uses exactly the same states that the SAC agent collected in the buffer to train, and it can realize that only a small region in the action space corresponds to greater plain Q values. The blue dot in the plot, mapped to the blue action vector on the vehicle, shows a correct action direction that steers the vehicle back onto the desired path. Notably, this is the action learned by PPO at this state, and it generally explains the success of PPO in this environment.

PPO SAC Motor #1 Motor #1

𝑸!"#$(𝒔, 𝒂) 𝑸%&'()(𝒔, 𝒂)

Target Path

Figure 6. Q landscapes in Quadrotor. (Left) The current state is at the end of the black trajectory. The Red dashed line is the target track. (Right) Qsoft and Qplain at this state. SAC fails to push upward with minimal action at this state, leading to failure against gravity. PPO successfully applies greater thrust to the back motor (#3), flying the quadrotor towards the path.

In Figure 6, we observe similar behaviors in the Quadrotor control environment. The quadrotor should track a horizontal path. The controller should apply forward-leaning thrusts to the two rotors parallel to the path, while balancing the other two rotors. The middle and right plots in Figure 6 show the Max Ent and plain Q values in the action space for the two rotors aligned with the forward direction. Again, because of the high entropy of soft-Q values on mediocre next states that will lead to failure, the value landscape of SAC favors a center at the green dot in the plots, which corresponds to actions of the green arrow in the plot on the left. In contrast, the plain Q value landscape shows that actions of high quality are centered differently. In particular, the blue dot indicates a good action that can be acquired by PPO, controlling the quadrotor towards the right direction.

𝑸!"#$(𝒔, 𝒂) 𝑸%&'()(𝒔, 𝒂)

Current state 𝜋!"# 𝜋$! 𝜋$"

Figure 7. (Upper) Comparison of soft-Q (left) and plain-Q (right) value landscapes at the current state shown below. (Lower) The second to fourth snapshots show Hopper s state after taking actions at the circled position of corresponding colors in the action space shown above. The SAC policy benefits from entropy by leaning forward , a risky move despite this action being suboptimal in the current ground truth value.

6.2. Benefits of Misleading Landscapes

It is commonly accepted that the main benefit of Max Ent is the exploratory behavior that it naturally encourages, especially in continuous control problems where exploration is harder to achieve than in discrete action spaces. That is, entropy is designed to intentionally mislead to avoid getting stuck at local solutions. We focused on showing how this design can unintentionally cause failure in learning, but the same perspective allows us to more concretely understand the benefits of Max Ent in control problems. We briefly discuss here and more details are in the Appendix.

Figure 7 shows the comparison between the soft-Q and plain Q landscapes for the training process in Hopper, plotted at a particular state shown in the snapshots in the figure. The action learned by SAC is in fact not a high-return action at this point of the learning process. According to the plain Q values, the agent should take actions that lead to more stable poses. However, after this risky move of SAC, the nature of the environment makes it possible to achieve higher rewards, which led to successful learning. Figure 8 shows the similar positive outcome of Max Ent encouraged by the soft value landscape. Overall, it is important to note that the effectiveness of Max Ent learning depends crucially on the nature of the underlying environment, and our analysis aims to give a framework for understanding how reward design and entropy-based exploration should be carefully balanced.

When Maximum Entropy Misleads Policy Optimization

Soft Q values

Advantage values

Environment

Figure 8. Q/Advantage landscapes of SAC and PPO in Obstacle2D and Acrobot. (Upper) In the Obstacle2D environment, SAC successfully bypasses the wall while PPO fails, as explained by the Soft-Q/Advantage landscapes. (Lower) In Acrobot, SAC learns a more stable control policy (applying the right torque to neutralize momentum to prevent failing) while PPO updates are stuck at a local solution that fails to robustly stabilize.

SAC-Ada Ent SAC

Figure 9. Performance of SAC-Ada Ent v.s. SAC. (Left) Learning curves. (Middle) Full trajectory rendering. (Right) Behavior of policy on critic states. In Vehicle, SAC-Ada Ent successfully steers and brakes to bring the vehicle back on track. In Quadrotor, it effectively lifts the quadrotor to follow the designated path.

6.3. Importance of Adaptive Entropy Scaling

To further test the theory that the use of soft values in Max Ent can negatively affect learning, we modify the SAC algorithm by actively monitoring the discrepancy between the soft Q-value landscape and the plain Q-values.

We simultaneously train two networks for Qsoft and Qplain. During policy updates, we sample from action space at each state under the current policy and evaluate their Qsoft and Qplain values. If Qsoft deviates significantly from Qplain, it indicates that entropy could mislead the policy, and we rely on Qplain as the target Q-value for the policy update instead. This adaptive approach, named SAC-Ada Ent, ensures a balance between promoting exploration in less critical states

and prioritizing exploitation in states where the misleading effects may result in failure. Note that SAC-Ada Ent is different from SAC with an auto-tuned entropy coefficient with uniform entropy adjustment across all states.

Figure 9 shows how the adaptive tuning of entropy in SACAda Ent affects learning. In both the Vehicle and Quadrotor environments, the policy learned by SAC-Ada Ent mostly corrects the behavior of the SAC policy, as illustrated in their overall trajectories and the critical shown in the plots.

Note that the simple change of SAC-Max Ent is not intended as a new efficient algorithm, because measuring the discrepancy of the Q landscapes requires global understanding at each state, which is unrealistic in high dimensions. It does confirm the misleading effects of entropy in control environments where the Max Ent approach was originally failing. More details of the algorithm are in Appendix E.

7. Conclusion

We analyzed a fundamental trade-off of the Max Ent RL framework for solving challenging control problems. While entropy maximization improves exploration and robustness, it can also mislead policy optimization towards failure.

We introduced Entropy Bifurcation Extension to show how the ground-truth policy distribution at certain states can become adversarial to the overall learning process in Max Ent RL. Such effects can naturally occur in real-world control tasks, where states with precise low-entropy policies are essential for success.

Our experiments validated the theoretical analysis in practical environments such as high-speed vehicle control, quadrotor trajectory tracking, and quadruped control. We also showed that adaptive tuning of entropy can alleviate the misleading effects of entropy, but may offset its benefits too. Overall, our analysis provides concrete guidelines for understanding and tuning the trade-off between reward design and entropy maximization in RL for complex control problems. We believe the results also have implications for potential adversarial attacks in RL from human feedback scenarios.

Acknowledgment

We thank the anonymous reviewers for their helpful comments in revising the paper. This material is based on work supported by NSF Career CCF 2047034, NSF CCF DASS 2217723, and NSF AI Institute CCF 2112665.

Impact Statement

Our paper offers both theoretical and experimental insights without immediate negative impacts. However, it contributes to a deeper understanding of entropy maximization

When Maximum Entropy Misleads Policy Optimization

principles and their role in reinforcement learning, laying the groundwork for future advancements in Max Ent RL that may also involve new models of adversarial attacks and defense on RL-based engineering of critical AI systems.

Abdolmaleki, A., Springenberg, J. T., Tassa, Y., Munos, R., Heess, N., and Riedmiller, M. Maximum a posteriori policy optimisation. ar Xiv preprint ar Xiv:1806.06920, 2018.

Achiam, J. Spinning Up in Deep Reinforcement Learning. 2018. URL https://github.com/openai/ spinningup.

Ahmed, Z., Le Roux, N., Norouzi, M., and Schuurmans, D. Understanding the impact of entropy on policy optimization. In International conference on machine learning, pp. 151 160. PMLR, 2019.

Brockman, G. Openai gym. ar Xiv preprint ar Xiv:1606.01540, 2016.

Chao, C.-H., Feng, C., Sun, W.-F., Lee, C.-K., See, S., and Lee, C.-Y. Maximum entropy reinforcement learning via energy-based normalizing flow. ar Xiv preprint ar Xiv:2405.13629, 2024.

Eysenbach, B. and Levine, S. Maximum entropy rl (provably) solves some robust rl problems. ar Xiv preprint ar Xiv:2103.06257, 2021.

Fox, R., Pakman, A., and Tishby, N. Taming the noise in reinforcement learning via soft updates. ar Xiv preprint ar Xiv:1512.08562, 2015.

Fujimoto, S., Hoof, H., and Meger, D. Addressing function approximation error in actor-critic methods. In International conference on machine learning, pp. 1587 1596. PMLR, 2018.

Gr unwald, P. D. and Dawid, A. P. Game theory, maximum entropy, minimum discrepancy and robust bayesian decision theory. 2004.

Haarnoja, T., Zhou, A., Abbeel, P., and Levine, S. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pp. 1861 1870, Stockholmsm assan, Stockholm Sweden, 10 15 Jul 2018a. PMLR.

Haarnoja, T., Zhou, A., Hartikainen, K., Tucker, G., Ha, S., Tan, J., Kumar, V., Zhu, H., Gupta, A., Abbeel, P., et al. Soft actor-critic algorithms and applications. ar Xiv preprint ar Xiv:1812.05905, 2018b.

Han, S. and Sung, Y. A max-min entropy framework for reinforcement learning. Advances in Neural Information Processing Systems, 34:25732 25745, 2021.

Hazan, E., Kakade, S., Singh, K., and Van Soest, A. Provably efficient maximum entropy exploration. In International Conference on Machine Learning, pp. 2681 2691. PMLR, 2019.

Huang, S., Gallou edec, Q., Felten, F., Raffin, A., Dossa, R. F. J., Zhao, Y., Sullivan, R., Makoviychuk, V., Makoviichuk, D., Danesh, M. H., Roum egous, C., Weng, J., Chen, C., Rahman, M. M., M. Ara ujo, J. G., Quan, G., Tan, D., Klein, T., Charakorn, R., Towers, M., Berthelot, Y., Mehta, K., Chakraborty, D., KG, A., Charraut, V., Ye, C., Liu, Z., Alegre, L. N., Nikulin, A., Hu, X., Liu, T., Choi, J., and Yi, B. Open RL Benchmark: Comprehensive Tracked Experiments for Reinforcement Learning. ar Xiv preprint ar Xiv:2402.03046, 2024. URL https://arxiv.org/abs/2402.03046.

Kaufmann, E., Bauersfeld, L., Loquercio, A., M uller, M., Koltun, V., and Scaramuzza, D. Champion-level drone racing using deep reinforcement learning. Nature, 620 (7976):982 987, 2023.

Kim, W. and Sung, Y. An adaptive entropy-regularization framework for multi-agent reinforcement learning. In International Conference on Machine Learning, pp. 16829 16852. PMLR, 2023.

Kong, J., Pfeiffer, M., Schildbach, G., and Borrelli, F. Kinematic and dynamic vehicle models for autonomous driving control design. In 2015 IEEE intelligent vehicles symposium (IV), pp. 1094 1099. IEEE, 2015.

Lee, M. H. and Moon, J. Deep reinforcement learningbased uav navigation and control: A soft actor-critic with hindsight experience replay approach. ar Xiv preprint ar Xiv:2106.01016, 2021.

Levine, S. Reinforcement learning and control as probabilistic inference: Tutorial and review. ar Xiv preprint ar Xiv:1805.00909, 2018.

Lillicrap, T. Continuous control with deep reinforcement learning. ar Xiv preprint ar Xiv:1509.02971, 2015.

Mazoure, B., Doan, T., Durand, A., Pineau, J., and Hjelm, R. D. Leveraging exploration in off-policy algorithms via normalizing flows. In Conference on Robot Learning, pp. 430 444. PMLR, 2020.

Miki, T., Lee, J., Hwangbo, J., Wellhausen, L., Koltun, V., and Hutter, M. Learning robust perceptive locomotion for quadrupedal robots in the wild. Science robotics, 7 (62):eabk2822, 2022.

When Maximum Entropy Misleads Policy Optimization

Muzahid, A. J. M., Kamarulzaman, S. F., and Rahman, M. A. Comparison of ppo and sac algorithms towards decision making strategies for collision avoidance among multiple autonomous vehicles. In 2021 International Conference on Software Engineering & Computer Systems and 4th International Conference on Computational Science and Information Management (ICSECS-ICOCSIM), pp. 200 205. IEEE, 2021.

Nair, V. G., D Souza, J. M., Asha, C., and Rafikh, R. M. A scoping review on unmanned aerial vehicles in disaster management: Challenges and opportunities. Journal of Robotics and Control (JRC), 5(6):1799 1826, 2024.

O Donoghue, B., Munos, R., Kavukcuoglu, K., and Mnih, V. Combining policy gradient and q-learning. ar Xiv preprint ar Xiv:1611.01626, 2016.

Petoi Camp. Opencat: Open-source quadruped robot. URL

https://github.com/Petoi Camp/Open Cat? tab=readme-ov-file.

Radwan, M. O., Sedky, A. A. H., and Mahar, K. M. Obstacles avoidance of self-driving vehicle using deep reinforcement learning. In 2021 31st International Conference on Computer Theory and Applications (ICCTA), pp. 215 222. IEEE, 2021.

Raffin, A., Hill, A., Gleave, A., Kanervisto, A., Ernestus, M., and Dormann, N. Stable-baselines3: Reliable reinforcement learning implementations. Journal of Machine Learning Research, 22(268):1 8, 2021.

Raghunathan, A., Xie, S. M., Yang, F., Duchi, J. C., and Liang, P. Adversarial training can hurt generalization. ar Xiv preprint ar Xiv:1906.06032, 2019.

Raghunathan, A., Xie, S. M., Yang, F., Duchi, J., and Liang, P. Understanding and mitigating the tradeoff between robustness and accuracy. ar Xiv preprint ar Xiv:2002.10716, 2020.

Rawlik, K., Toussaint, M., and Vijayakumar, S. On stochastic optimal control and reinforcement learning by approximate inference. 2013.

Rub ı, B., P erez, R., and Morcego, B. A survey of path following control strategies for uavs focused on quadrotors. Journal of Intelligent & Robotic Systems, 98(2):241 265, 2020.

Schulman, J. Trust region policy optimization. ar Xiv preprint ar Xiv:1502.05477, 2015.

Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal policy optimization algorithms. ar Xiv preprint ar Xiv:1707.06347, 2017.

Shengren, H., Salazar, E. M., Vergara, P. P., and Palensky, P. Performance comparison of deep rl algorithms for energy systems optimal scheduling. In 2022 IEEE PES Innovative Smart Grid Technologies Conference Europe (ISGT-Europe), pp. 1 6. IEEE, 2022.

Spong, M. W. The swing up control problem for the acrobot. IEEE control systems magazine, 15(1):49 55, 1995.

Su, D., Zhang, H., Chen, H., Yi, J., Chen, P.-Y., and Gao, Y. Is robustness the cost of accuracy? a comprehensive study on the robustness of 18 deep image classification models. In Computer Vision ECCV 2018, pp. 644 661. Springer International Publishing, 2018.

Tan, Z. and Karak ose, M. A new approach for drone tracking with drone using proximal policy optimization based distributed deep reinforcement learning. Software X, 23: 101497, 2023.

Todorov, E., Erez, T., and Tassa, Y. Mujoco: A physics engine for model-based control. In 2012 IEEE/RSJ international conference on intelligent robots and systems, pp. 5026 5033. IEEE, 2012.

Toussaint, M. Robot trajectory optimization using approximate inference. In Proceedings of the 26th annual international conference on machine learning, pp. 1049 1056, 2009.

Towers, M., Kwiatkowski, A., Terry, J., Balis, J. U., De Cola, G., Deleu, T., Goul ao, M., Kallinteris, A., Krimmel, M., KG, A., et al. Gymnasium: A standard interface for reinforcement learning environments. ar Xiv preprint ar Xiv:2407.17032, 2024.

Tsipras, D., Santurkar, S., Engstrom, L., Turner, A., and Madry, A. Robustness may be at odds with accuracy. ar Xiv preprint ar Xiv:1805.12152, 2018.

Wang, Y. and Ni, T. Meta-sac: Auto-tune the entropy temperature of soft actor-critic via metagradient. ar Xiv preprint ar Xiv:2007.01932, 2020.

Weng, J., Chen, H., Yan, D., You, K., Duburcq, A., Zhang, M., Su, H., and Zhu, J. Tianshou: A highly modularized deep reinforcement learning library. ar Xiv preprint ar Xiv:2107.14171, 2021.

Xu, C., Zhu, R., and Yang, D. Karting racing: A revisit to ppo and sac algorithm. In 2021 International Conference on Computer Information Science and Artificial Intelligence (CISAI), pp. 310 316. IEEE, 2021.

Yang, Y.-Y., Rashtchian, C., Zhang, H., Salakhutdinov, R., and Chaudhuri, K. A closer look at accuracy vs. robustness. ar Xiv preprint ar Xiv:2003.02460, 2020.

When Maximum Entropy Misleads Policy Optimization

Zhang, H., Yu, Y., Jiao, J., Xing, E., El Ghaoui, L., and Jordan, M. Theoretically principled trade-off between robustness and accuracy. In Proceedings of the 36th International Conference on Machine Learning, pp. 7472 7482. PMLR, 2019.

Zhuang, Z., Yao, S., and Zhao, H. Humanoid parkour learning. ar Xiv preprint ar Xiv:2406.10759, 2024.

Ziebart, B. D. Modeling purposeful adaptive behavior with the principle of maximum causal entropy. Carnegie Mellon University, 2010.

Ziebart, B. D., Maas, A. L., Bagnell, J. A., Dey, A. K., et al. Maximum entropy inverse reinforcement learning. In Aaai, volume 8, pp. 1433 1438. Chicago, IL, USA, 2008.

Ziebart, B. D., Bagnell, D., and Dey, A. K. Maximum causal entropy correlated equilibria for markov games. In Workshops at the Twenty-Fourth AAAI Conference on Artificial Intelligence, 2010.

When Maximum Entropy Misleads Policy Optimization

A. Details on the Toy Example

A.1. Calculation of soft Q(s0, a) for SAC

Figure 10. Toy Example Results of SAC and PPO at states sg and sb

In the Max Ent framework, the policy at s0 is iteratively updated towards the Boltzmann distribution π Q( |s0). Given the simple transitions in the MDP, we can easily calculate the Q values for any action. We use α = 1 for the entropy coefficient.

A.1.1. DIRECT CALCULATION WITHOUT PARAMETRIZATION

Since transitions from s0 to sg and sb are deterministic and yields zero reward, we have

Q(s0, a) = γV (sg/b) = γEa π[Q(sg/b, a ) α log π(a |sg/b)] = γ Z π(a |sg/b) log Z(sg/b)da = γ log Z(sg/b)

under the optimal policy π = π = exp(α 1Q(sg/b, a ))/Z(sg/b).

For a [ 1, 0) which transits to sg,

Q(s0, a) = γ log Z(sg) = γ log( Z 1

1 exp[Q(sg, a )] da ) = γ log( Z 0.1

0.1 e1 da + Z 0.1

1 e 20 da + Z 1

0.1 e 20 da ) = 0.603,

and a [0, 1] which transits to sb,

Q(s0, a) = γ log Z(sb) = γ log( Z 1

1 exp[Q(sb, a )] da ) = γ log( Z 1

1 e 1 da ) = 0.304,

As Q(s0, [ 1, 0)) < Q(s0, [0, 1]), SAC is theoretically expected to incorrectly select a [0, 1] as the optimal policy.

A.1.2. CALCULATION WITH EMPIRICAL PARAMETERIZATION

Empirically with Gaussian policy, we can also compute Q(s0, a) given the policies on sg, sb, as shown in Figure 10. In practice, π(sg) = Squash[N(µ(sg), σ(sg))], π(sb) = Squash[N(µ(sb), σ(sb))], where µ(sg) = 0.013, σ(sg) = 0.027, µ(sb) = 0.016, σ(sb) = 0.877 specifically, thus

Q(s0, a) = γV (sg/b) = γEa π(sg/b)[Q(sg/b, a ) α log N π(a |sg/b) + α log(1 tanh2(a ))]

We can compute this numerically as

Q(s0, a) = γV (sg) 1.696 for a [ 1, 0)

Q(s0, a) = γV (sb) 0.485 for a [0, 1]

Those values are consistent with the results in Figure 2. Consequently, as expected, Max Ent algorithms such as SAC quickly converge to the Max Ent-optimal policy that leads almost all trajectories to the terminal state s b,T with negative rewards.

When Maximum Entropy Misleads Policy Optimization

A.2. Results of PPO policies

The advantage landscapes at s0, sg, sb are shown in Figure 2 and Figure 10. From those, PPO is observed to converge to the correct optimal policy.

A.3. Remarks on arbitrary α

Although we set α = 1 in the toy example for simplicity, it can be an arbitrary non-negative value.

Remark A.1. If Max Ent policy is mislead at s0 i.e. Q(s0, a|a [ 1, 0)) < Q(s0, a|a [0, 1]) when α = 1, for arbitrary α = 1, we can keep misleading Max Ent policy through reward scaling

ˆr(s+ T ) = αr(s+ T ), ˆr(s g,T ) = αr(s g,T ), ˆr(s b,T ) = αr(s b,T )

Proof. For arbitrary α,

Q(s0, a) = γEa π[Q(sg/b, a ) α log π(a |sg/b)] = γα log ZQ/α(sg/b)

where π matches the optimal softmax policy exp(Q(sg/b,a )/α)

ZQ/α(sg/b) , and ZQ/α(sg/b) = R exp Q(sg/b,a )

For ˆr(s+ T ), ˆr(s g,T ), ˆr(s b,T ), we have ˆQ(sg/b, a ) = αQ(sg/b, a ), and ZQ/α(sg/b) = R exp ˆQ(sg/b, a )da. Therefore the ordering of ZQ/α(sg) and ZQ/α(sb) is the same as the original ZQ(sg) and ZQ(sb).

Remark A.2. For arbitrary α, we can always find r b so that the optimal policy for standard RL (e.g. PPO) will favor sg while Max Ent policy will favor sb, i.e. Q(s0, a|a [ 1, 0)) < Q(s0, a|a [0, 1]).

Proof. Let Q(s0, a|a [ 1, 0)) < Q(s0, a|a [0, 1]), we can get

log ZQ/α(sg) < log ZQ/α(sb)

log Z exp Q(sg, a )

α da < log[exp(r b α ) |A|]

α log Z exp Q(sg, a )

α da α log |A| < r b < r+

Given α log R exp Q(sg,a )

α da maxa Q(sg, a ) + α log |A|, we have left-hand side is upper bounded by

maxa Q(sg, a ) = r+. As long as supa Q(sg, a ) < r+, the open interval I(sg) = α log R exp Q(sg,a )

α log |A|, r+ is non-empty. Therefore, we can always pick r b I(sg) so that Q(s0, a|a [ 1, 0)) < Q(s0, a|a [0, 1]) i.e. Max Ent favors sb.

B. Full Proofs

Lemma B.1 (Lemma 5.3, Backward Compatibility). Let π( |s) : AM [0, 1] be an arbitrary policy distribution over the action space at the targeted state s. Let vs R be an arbitrary desired value for state s. There exists a value function V : Sµ R on all the newly introduced states sµ such that vs is the optimal soft value of s under the Max Ent-optimal policy at s (Definition 4).

Proof. Based on the definition of soft value,

V (st) = Eat π[Q(st, at)] + αH(π( |st)) (4)

we need to show that there exists a function Q : {s} AM R such that

vs = V (s) = Ea π( |s)[Q(s, a)] + αH(π( |s))

When Maximum Entropy Misleads Policy Optimization

which minimizes the KL-divergence between π and the Boltzmann distribution induced by Q, i.e.,

DKL(π( | s) π Q) = 0.

To ensure DKL(π( |s) π Q) = 0, we can directly construct Q(s, a) such that it matches the optimal policy

π Q(a|s) = exp α 1Q(s, a)

with normalization term Z(Q) = R

AM exp α 1Q(s, a) da.

Taking logarithms on both sides and rearranging for Q(s, a):

Q(s, a) = α log π(a | s) + α log Z(s), (5)

where the normalization factor is a constant that we can arbitrarily choose without changing the KL divergence. Let c = α log Z(s).

Next, taking expectation over π:

Ea π( |s)[Q(s, a)] = Z

AM π(a | s)Q(s, a)da.

Substituting in Q(s, a) from Eq. (5), we have

Ea π( |s)[Q(s, a)] = Z π(a|s) (α log π(a|s) + c) da

= α Z π(a|s) log π(a|s)da + c Z π(a|s)da

= αH(π( |s)) + c.

Now, to match the soft value function V (s), we can set:

vs = Ea π[Q(s, a)] + αH(π( |s)) = c αH(π( |s)) + αH(π( |s)) = c. (6)

Thus, solving for c, we obtain c = vs.

Substituting back into Eq. (5), we get Q(s, a) = α log π(a | s) + vs.

This ensures that both vs = V (s) = Ea π[Q(s, a)] + αH(π( |s)) and DKL(π( |s) π Q) = 0 are satisfied.

Lemma B.2 (Lemma 5.4, Forward Compatibility). Let sµ = µ(s ) be the newly introduced state for s N(s). Let V (s ) be an arbitrarily fixed value for the original next state s , and r(sµ, a) an arbitrary reward defined for the newly introduced state sµ. Let v R be an arbitrarily chosen target value. Then, there exist choices of Aµ 1, Aµ 2, and r(sµ T ) such that V (sµ) = v is the optimal soft value for the bifurcating state sµ.

Proof. Following the definition of the Max Ent value (Definition 4), we need to show:

v = V (sµ) = Ea π( |sµ)[Q(sµ, a)] + αH(π( |sµ)), (7)

where π( |sµ) is the policy distribution at sµ that exactly matches the Boltzmann distribution induced by some Q-function Q(sµ, a), i.e., DKL(π π Q) = 0.

With the bifurcating action space Aµ = Aµ 1 Aµ 2 and deterministic transitions, we define:

Q1 = r(sµ, a) + γV (s ), (8)

Q2 = γr(sµ T ), (9)

When Maximum Entropy Misleads Policy Optimization

where r(sµ, a) is an arbitrarily fixed reward for any a Aµ 1, and for any a Aµ 2, we set r(sµ, a) = 0. Since r(sµ, a) and V (s ) are fixed, only Q2 is tunable via the choice of r(sµ T ).

A policy that minimizes the KL-divergence with π Q at sµ is:

|Aµ 1 |e Q1/α+|Aµ 2 |e Q2/α , a Aµ 1,

|Aµ 1 |e Q1/α+|Aµ 2 |e Q2/α , a Aµ 2.

and we define the normalization term as:

Z(Q) = |Aµ 1|e Q1/α + |Aµ 2|e Q2/α.

The probabilities over action subspaces are:

p1 = |Aµ 1|e Q1/α

Z(Q) , p2 = |Aµ 2|e Q2/α

The expected value is:

Ea π[Q(sµ, a)] = p1Q1 + p2Q2. (10)

The entropy of π( |sµ) is: H(π( |sµ)) = (p1Q1 + p2Q2)/α + log Z(Q),

so αH(π( |sµ)) = (p1Q1 + p2Q2) + α log Z(Q). (11)

Substituting Eqs. 10 and 11 into Eq. 7:

V (sµ) = p1Q1 + p2Q2 + αH(π( |sµ))

= p1Q1 + p2Q2 (p1Q1 + p2Q2) + α log Z(Q)

= α log Z(Q). (12)

Thus, in this corrected version, we observe that:

V (sµ) = α log |Aµ 1|e Q1/α + |Aµ 2|e Q2/α .

Solving for r(sµ T ), we get:

r(sµ T ) = 1

α log ev/α |Aµ 1|e Q1/α

which is valid for all α > 0, and the function V (sµ) = α log Z(Q)

is a surjection onto R when varying over |Aµ 1|, |Aµ 2|, and r(sµ T ), we conclude that for any v R, a valid construction exists such that V (sµ) = v.

Theorem B.3 (Theorem 5.5, Bifurcation Extension Misleads Max Ent RL). Let M be an MDP with optimal Max Ent policy π , and s an arbitrary state in SM. Let π( |s) be an arbitrary distribution over the action space AM at state s. We can construct an entropy bifurcation extension ˆ M of M such that ˆ M is equivalent to M restricted to SM \ {s} and does not change its optimal policy on those states, while the Max Ent-optimal policy at s after entropy bifurcation extension can follow an arbitrary distribution π( |s) over the actions.

In other words, without affecting the rest of the MDP, we can introduce bifurcation extension at an arbitrary state such that the Max Ent optimal policy becomes arbitrarily bad at the affected state.

When Maximum Entropy Misleads Policy Optimization

Proof. Following Lemma 5.4, we introduce the bifurcation extension as Definition 5.1 and obtain Q-values on all the newly introduced states Q(s, a) such that V (s) remains unchanged, while the Max Ent optimal policy at s becomes π(|s). Given such target Q(s, a), which now impose target values on the introduced bifurcation states, i.e., V (sµ) = Q(s, a)/P(s |s, a), because by construction P(sµ|s, a) = P(s |s, a) > 0. We then use the forward compatibility Lemma 5.4 to set the parameters in the bifurcation extension, such that V (sµ) is attained by the Max Ent policy at sµ, without changing the existing values on the original next states V (s ) for any s N(s). Since we have not changed the values on s or any s N(s), the bifurcation extension does not affect the policy on any other state in SM \ {s}. At the same time, the target arbitrary policy π( |s) is now a Max Ent optimal policy at s.

Proposition B.4 (Proposition 5.6, Bifurcation Extension Preserves Optimal Policies). By setting r(sµ, a) = (1 γ)V (s ) for every newly introduced bifurcating state sµ = µ(s ) and a Aµ 1, the optimal policy is preserved under bifurcation extension.

Proof. In the non-Max Ent setting, the state value of sµ maximizes the Q-value, and the optimal policy chooses the actions in Aµ 1. Since r(sµ, a) = (1 γ)V (s ), the additional reward on (sµ, a) ensures that V (sµ) = V (s ). Note that Lemma 5.4 holds for arbitrary r(sµ, a). Consequently, there is no change in Q(s, a) and the optimal policy remains the same between M and ˆ M.

C. Environments

C.1. Vehicle

The task is to control a wheeled vehicle to maintain a constant high speed while following a designated path. Practically, the vehicle chases a moving goal, which travels at a constant speed, by giving negative rewards to penalize the distance difference. Also the vehicle receives a penalty for deviating from the track. The overall reward is

r = rgoal + rtrack = ||pvehicle pgoal||2 + βb|R2 vehicle R2 track|

where pvehicle, pgoal are the positions for vehicle and the goal respectively, βb = 0.3 is the scaling factor, Rtrack = 10 is the radius of the quarter-circle track and Rvehicle = p

x2 vehicle + y2 vehicle is the vehicle s radial distance from the origin. The initial state is set to make steering critical for aligning the vehicle with the path, given an initial forward speed of v = 3. The action space is steering and throttle. The vehicle follows a dynamic bicycle model (Kong et al., 2015), where throttle and steering affect speed, direction, and lateral dynamics. It introduces slip, acceleration, and braking, requiring the agent to manage stability and traction for precise path tracking.

C.2. Quadrotor

The task is to control a quadrotor to track a simple path while handling small initial perturbations. The quadrotor also chases a target moving at a constant speed. The reward is given by the distance between the quadrotor and the target, combined with a penalty on the quadrotor s three Euler angles to encourage stable orientation and prevent excessive tilting.

The overall reward function is:

r = rgoal + rstability = β||pquadrotor ptarget||2 |θ| |ϕ| |ψ|

where pquadrotor, ptarget are the positions for quadrotor and the target respectively, θ, ϕ, ψ are the Euler angles, β = 5 is a scaling factor. Simplified: Since the track aligns with one of the rotor axes, we fix the thrust of the orthogonal rotors to zero, providing additional stabilization. The agent controls only the thrust and pitch torque, simplifying the task. Realistic: The agent must fully control all four rotors, with the action space consisting of four independent rotor speeds, making stabilization and trajectory tracking more challenging.

C.3. Opencat

The Opencat environment simulates an open-source Arduino-based quadruped robot, which is based on Petoi s Open Cat project (Petoi Camp). The task focuses on controlling the quadruped s joint torques to achieve stable locomotion while adapting to perturbations. The action space consists of 8 continuous joint torques, corresponding to the two actuated joints for each of the four legs. The agent must learn to coordinate leg movements efficiently to maintain balance and move toward.

When Maximum Entropy Misleads Policy Optimization

C.4. Acrobot

Acrobot is a two-link planar robot arm with one end fixed at the shoulder (θ1) and an actuated joint at the elbow (θ2) (Spong, 1995). The control action for this underactuated system involves applying continuous torque at the free joint to swing the arm to the upright position and stabilize it. The task is to minimize the deviation between the joint angle (θ1) and the target upright position (θ1 = π), while maintaining zero angular velocity when the arm is upright. The reward function is defined as follows: r = (θ1 π)2 + (θ2)2 + 0.1( θ1)2 + 0.1( θ2)2

C.5. Obstacle2D

The goal is to navigate an 2D agent to the goal (3, 0) while avoiding a wall spanning y = [ 2, 2] starting from (0, 0). The action range is [ 3, 3], which makes it sufficient for the agent to avoid the wall in one step. The reward function is based on progress toward the goal, measured as the difference in distance before and after taking a step. For special cases, it receives +500 for reaching the goal, -200 for hitting the wall.

C.6. Hopper

Hopper is from Open AI Gym based on mujoco engine, which aims to hop forward by applying action torques on the joints.

D. Details on Experiments

D.1. Training without entropy in target Q values

In Sec. 4 and Sec. 6, we simultaneously train Q networks with (soft) and without (plain) the entropy term in the target Q values, in order to illustrate the effect of entropy on policy optimization. Specifically,

T πQsoft(st, at) = r(st, at) + γEst+1,at+1[Qsoft(st+1, at+1) α log π(at+1|st+1)]]

T πQplain(st, at) = r(st, at) + γEst+1,at+1[Qplain(st+1, at+1)]]

with the rest of the SAC algorithm unchanged. We still update the policy based on the target Q with entropy, i.e. Qsoft(st, at) as original SAC and training Qplain(st, at) is just for better understanding for entropy s role in the policy updating dynamics.

D.2. Experiment Hyperparameters

The hyperparameters for training the algorithms are in Table 1 and Table 2.

D.3. Performance of DDPG and SAC with Auto-tuned Entropy Coefficient

We also run SAC with auto-tuned α and DDPG across all six environments, as shown in Fig. 11. The first row includes environments where SAC fails due to critical control requirements, while the second row shows cases where SAC performs better. Notably, auto-tuning the entropy temperature in SAC improves performance in some critical environments but not all, and it still fails to surpass PPO.

D.4. Benefits of Misleading Landscapes in SAC

Nonetheless, entropy in target Q is beneficial as designed because of necessary exploration. In Gym Hopper, we investigate the state shown in Fig. 12. Entropy smooths the Q landscape in regions that may not produce optimal actions at the current training stage, encouraging exploration and enabling the policy to achieve robustness rather than clinging solely to the current optima.

D.5. PPO Trapped by Advantage Zero-Level Sets.

Without extra entropy to encourage exploration, PPO as an on-policy RL algorithm can be trapped in the zero-level set of advantages. In Obstacle2D (Fig. 13 first row), we plot the policy of the initial state, where the optimal action is to move to either the upper or lower corner of the wall, avoiding it in one step. The reward is designed to encourage the agent to approach the goal while penalizing collisions with a large negative reward. The region in front of the wall is a higher-reward

When Maximum Entropy Misleads Policy Optimization

Table 1. Hyperparameters for SAC(SAC-auto-alpha), PPO, and DDPG Hyperparameter SAC / SAC-auto-α PPO DDPG Discount factor (γ) 0.99 0.99 0.99 Entropy coefficient (α) 0.2/ N/A 0 / Exploration noise 0 0 0.1 Target smoothing coefficient (τ) 0.005 / 0.005 Batch size 256 64 256 Replay buffer size 1M 2048 1M Hidden layers 2 2 2 Hidden units per layer 256(64 for Toy Example) 256(64 for Toy Example) 256 Activation function Re LU Tanh Re LU Optimizer Adam Adam Adam Number of updates per environment step 1 10 1 Clipping parameter (ϵ) / 0.2 / GAE parameter (λ) / 0.95 /

Table 2. Learning Rates for SAC and PPO Across Different Environments Algorithm Learning Rate Vehicle Quadrotor Opencat Acrobot Obstacle2D Hopper

SAC Actor 1e-3 3e-4 1e-3 1e-3 1e-3 1e-3 Q-function 1e-3 3e-4 1e-3 1e-3 1e-3 1e-3 SAC auto-α α 1e-3 3e-4 1e-3 1e-3 1e-3 1e-3

PPO Actor 3e-4 3e-4 1e-4 3e-4 3e-4 3e-4 Value function 3e-4 3e-4 1e-4 3e-4 3e-4 3e-4

DDPG Actor 1e-3 3e-4 3e-4 1e-3 1e-3 1e-3 Q-function 1e-3 3e-4 3e-4 1e-3 1e-3 1e-3

PPO SAC SAC-auto-alpha DDPG

Figure 11. Performance of All Algorithms across six environments

area because of the instant approaching reward. The advantage landscape reveals that PPO s policy moves to the center of the positive advantage region but remains confined by the zero-level set. Notably, although the optimal regions (upper and lower corners) have positive advantages, PPO remains trapped due to its local behavior. The coupling of exploration and actual policy worsens this issue if PPO fails to explore actions to bypass the wall, its policy s σ shrinks, further reducing

When Maximum Entropy Misleads Policy Optimization

𝑸!"#$(𝒔, 𝒂) 𝑸%&'()(𝒔, 𝒂)

Current state 𝜋!"# 𝜋$! 𝜋$"

Figure 12. Q landscapes in Hopper. Upper: We set torque #0 (top torso) as the current µSAC 0 to plot Qsoft and Qplain for torque #1 (middle thigh) and #2 (bottom leg) in the state shown in the bottom-left figure. Lower: Rendered hopper s gestures result from the corresponding policies. SAC s policy benefits from entropy by leaning further forward , taking a risky move despite this action being suboptimal in the current true Q. Investigating the peaks o1 and o2 in Qplain reveals that the hopper tends to bend its knee and jump up when following the corresponding policies, demonstrating less exploration.

exploration and leading to entrapment. We can also observe this across training stages, as shown in Fig. 14.

A similar phenomenon is observed in Acrobot (Fig. 13 second row), where PPO s policy shrinks prematurely, leading to insufficient exploration.

However, this phenomenon can also be viewed as a strength of PPO, as it builds on the current optimal policy and makes incremental improvements step by step, thus not misled by suboptimal actions introduced by entropy. As a result, PPO performs better in environments where the feasible action regions are small and narrow in the action space, such as in Vehicle, Quadrotor, and Open Cat, which closely resemble real-world control settings.

E. Details on SAC-Ada Ent

E.1. Pseudocode

We provide the detailed algorithm in Algorithm 1.

E.2. SAC-Ada Ent improves performance in environments that SAC fails

To further validate the misleading entropy claim and enhance SAC s performance in critical environments, we propose SAC with Adaptive Entropy (SAC-Ada Ent) and test it on Vehicle and Quadrotor environments, showing improvements in Fig. 15. In these environments, SAC relies excessively on entropy as it dominates the soft Q values. To address this, SAC-Ada Ent adaptively combines target Q values with and without entropy. Specifically, we simultaneously train Qsoft and Qplain as in Appendix D.1. During policy updates, we sample multiple actions per state under the current policy and evaluate their Qsoft and Qplain values. By comparing these values, we compute the similarity of the two landscapes. If Qsoft deviates significantly from Qplain, indicating that entropy could mislead the policy, we rely on Qplain as the target Q value instead. Otherwise, entropy is retained to encourage exploration. This adaptive approach ensures a balance between safe exploration and exploitation, promoting exploration in less critical states and prioritizing exploitation in states where errors could result in failure. Note that SAC-Ada Ent is fundamentally different from SAC with an auto-tuned entropy coefficient,

When Maximum Entropy Misleads Policy Optimization

Soft Q values

Advantage values

Environment

Figure 13. Q/Advantage landscapes of SAC and PPO in Obstacle2D and Acrobot. Upper: In Obstacle2D with start (0, 0), goal (3, 0), and a wall at x = 2 spanning y = [ 2, 2], SAC succeeds in bypassing the wall whereas PPO fails. We plot the Q/Advantage landscape of the initial state. For SAC, entropy encourages exploration, guiding updates toward the upper and lower ends of the wall via soft Q. In contrast, PPO remains trapped despite the presence of positive advantage regions near the wall s ends. Lower: In Acrobot, both algorithms achieve swing-up, but near the stabilization height, SAC applies the right torque to neutralize momentum, preventing it from falling again. In contrast, PPO remains stuck in a local optimum, leading to repeated failures.

(a) (b) (c)

(d) (e) (f)

Figure 14. Advantage landscapes in Obstacle2D for PPO. (a) to (f) show the advantage landscapes at different training stages for the initial state, where PPO s policy center remains trapped in front of the wall while σ gradually shrinks.

When Maximum Entropy Misleads Policy Optimization

Algorithm 1 SAC with Adaptive Entropy (SAC-Ada Ent)

Initialize: Actor network πθ, Q networks and their paired target networks for Q-target with/without entropy ϕ1, ϕ2, ϕtarg,1, ϕtarg,2 (w/ entropy), ϕ 1, ϕ 2, ϕ targ,1, ϕ targ,2 (w/o entropy), replay buffer D, similarity threshold ϵ for each training step do

Sample action at πθ(at|st) and observe st+1, rt Store (st, at, rt, st+1) in replay buffer D for each gradient update step do

Sample minibatch of transitions (s, a, r, s ) from D Compute target value:

y = r + γ min i=1,2 ˆQϕi(s , a ) α log πθ(a |s ) , y = r + γ min i=1,2 ˆQϕ i(s , a )

Update Q networks:

ϕi ϕi ηQ ϕi 1 N

n=1 (ϕi(s, a) y)2, ϕ i ϕ i ηQ ϕ i 1 N

n=1 (ϕ i(s, a) y )2

For each s, sample actions using current policy As = {as|as πθ( |s)} Compute similarity score:

sim(Q, Q ) = Q(s) Q (s) Q(s) Q (s) , where Q(s) = min i=1,2 ˆQϕi(s, as)

as πθ(a|s) , Q (s) = min i=1,2 ˆQϕ i(s, as)

Update actor policy using reparameterization trick:

θ θ ηπ θEs D,a πθ

( α log πθ(a|s) Qϕ1(s, a), if sim(Q, Q ) > ϵ α log πθ(a|s) Qϕ 1(s, a), otherwise

Update target networks:

ˆQϕi τQϕi + (1 τ) ˆQϕi, ˆQϕ i τQϕ i + (1 τ) ˆQϕ i

end for end for

which applies a uniform entropy adjustment across all states. In contrast, SAC-Ada Ent adaptively adjusts entropy for each state, making it particularly effective in environments requiring precise control and careful exploration.

E.3. SAC-Ada Ent preserves performance in environments that SAC succeeds

Not only SAC-Ada Ent improves performance in environments where SAC struggles, but it also retains SAC s strengths in those where SAC already excels. We report SAC-Ada Ent s results on Hopper, Obstacle2D, and Acrobot as in Table. 3:

Algorithm Hopper Obstacle2D Acrobot

SAC 3484.46 323.87 501.98 0.62 45.25 7.94 SAC-Ada Ent 3285.17 958.43 501.50 0.57 36.31 16.42

Table 3. Performance (mean std) of SAC and SAC-Ada Ent across tasks.

When Maximum Entropy Misleads Policy Optimization

SAC-Ada Ent SAC

Figure 15. Performance of SAC-Ada Ent v.s. SAC. Left: Reward Improvement. Middle: Full trajectory rendering. Right: Behavior of policy on critic states. In Vehicle, SAC-Ada Ent successfully steers and brakes to bring the vehicle back on track, while in Quadrotor, it effectively lifts the quadrotor to follow the designated path.

F. Additional Experiments on Other Max Ent algorithm

Although SAC is a powerful Max Ent algorithm, to ensure our findings generalize beyond SAC s particular implementation of entropy regularization, we also evaluate Soft Q-Learning (SQL), an alternative Max Ent method. SQL extends traditional Q-learning by incorporating an entropy bonus into its Bellman backup, resulting in a policy that maximizes both expected return and action entropy thereby fitting within the maximum-entropy RL framework. It can extend to continuous actions by parameterizing both the soft Q-function and policy with neural networks and using the reparameterization trick for efficient, entropy-regularized updates. We compare the performance of all algorithms on Vehicle, Quadrotor and Hopper. The results in Table. 4 show that SQL also suffers from the entropy-misleading issue, but its Ada Ent variant effectively mitigates this weakness.

Algorithm Vehicle Quadrotor Hopper

SAC (α = 0.2) 2003.85 867.82 475.29 244.96 3484.46 323.87 SAC (auto-α) 1551.96 636.88 666.62 233.19 2572.00 901.35 SAC-Ada Ent 1250.45 725.40 247.58 45.15 3285.17 958.43 SQL 2715.48 453.00 6082.51 1632.35 2998.21 158.19 SQL-Ada Ent 2077.59 266.84 4499.35 863.73 3115.94 25.19

Table 4. Performance (mean std) across Vehicle, Quadrotor, and Hopper tasks for various algorithms.