# dime_diffusionbased_maximum_entropy_reinforcement_learning__4ead4b56.pdf DIME: Diffusion-Based Maximum Entropy Reinforcement Learning Onur Celik 1 Zechu Li 2 Denis Blessing 1 Ge Li 1 Daniel Palenicek 3 4 Jan Peters 3 4 5 6 Georgia Chalvatzaki 2 4 Gerhard Neumann 1 Maximum entropy reinforcement learning (Max Ent-RL) has become the standard approach to RL due to its beneficial exploration properties. Traditionally, policies are parameterized using Gaussian distributions, which significantly limits their representational capacity. Diffusion-based policies offer a more expressive alternative, yet integrating them into Max Ent-RL poses challenges primarily due to the intractability of computing their marginal entropy. To overcome this, we propose Diffusion-Based Maximum Entropy RL (DIME). DIME leverages recent advances in approximate inference with diffusion models to derive a lower bound on the maximum entropy objective. Additionally, we propose a policy iteration scheme that provably converges to the optimal diffusion policy. Our method enables the use of expressive diffusion-based policies while retaining the principled exploration benefits of Max Ent-RL, significantly outperforming other diffusion-based methods on challenging high-dimensional control benchmarks. It is also competitive with state-of-the-art non-diffusion based RL methods while requiring fewer algorithmic design choices and smaller update-to-data ratios, reducing computational complexity1. 1. Introduction The maximum entropy reinforcement learning (Max Ent-RL) objective augments the task reward in each time step with the entropy of the policy (Ziebart et al., 2008; Toussaint, 2009; Haarnoja et al., 2017; 2018b). This objective has 1Autonomous Learning Robots, KIT 2Interactive Robot Perception & Learning, TU Darmstadt 3Intelligent Autonomous Systems, TU Darmstadt 4Hessian.AI 5German Research Center for AI 6Centre for Cognitivie Science, TU Darmstadt . Correspondence to: Onur Celik . Proceedings of the 42 nd International Conference on Machine Learning, Vancouver, Canada. PMLR 267, 2025. Copyright 2025 by the author(s). 1https://alrhub.github.io/dime-website/ several favorable properties among which improved exploration (Ziebart, 2010; Haarnoja et al., 2017) is crucial for RL. Recent successful model-free RL algorithms leverage these favorable properties and build upon this framework (Bhatt et al., 2024; Nauman et al., 2024) improving sample efficiency and leading to remarkable results. However, the policies are traditionally parameterized using Gaussian distributions, significantly limiting their representational capacity. On the other hand, diffusion models (Sohl-Dickstein et al., 2015; Ho et al., 2020; Song et al., 2021; Karras et al., 2022) are highly expressive generative models and have proven beneficial in representing complex behavior policies (Reuss et al., 2023; Chi et al., 2023). However, important metrics such as the marginal entropy are intractable to compute (Zhou et al., 2024) which restricts their usage in RL. Because of this shortcoming, recent methods propose different ways to train diffusion-based methods in off-policy RL. While these methods are discussed in more detail in the related work section, most of them require additional techniques to add artificial (in most cases Gaussian) noise to the generated actions to induce exploration in the behavior generation process. Hence, they do not leverage the diffusion model to generate potentially non-Gaussian exploration patterns but fall back to mainly Gaussian exploration. Nonetheless, there have been significant advances in training diffusion-based models for approximate inference (Berner et al.; Richter & Berner). Since the policy improvement in Max Ent-RL can also be cast as an approximate inference problem to the energy-based policy (Haarnoja et al., 2017), it is a natural step to explore these parallels. We propose Diffusion-Based Maximum Entropy Reinforcement Learning (DIME). DIME leverages recent advances in approximate inference with diffusion models (Richter & Berner) to derive a lower bound on the Max Ent objective. We propose a policy iteration framework with monotonic policy improvement that converges to the optimal diffusion policy. Additionally, building on recent off-policy RL algorithms such as Cross-Q (Bhatt et al., 2024) and distributional RL (Bellemare et al., 2017), we propose a practical version of DIME that can be used for training diffusion-based RL policies. On 13 challenging continuous high-dimensional control benchmarks, we empirically validate that DIME significantly outperforms other diffusion-based baselines on DIME: Diffusion-Based Maximum Entropy Reinforcement Learning all environments and consistently outperforms other stateof-the-art RL methods based on a Gaussian policy on 10 out of 13 environments, while being computationally more efficient and requiring less algorithmic design choices as the current state of the art baseline BRO (Nauman et al., 2024). 2. Related Work Maximum Entropy RL. The maximum entropy RL framework uses the entropy of the policy at each time step as an additional objective, providing a principled way of inducing exploration in the RL policy. It is different from entropy regularized RL (Neu et al., 2017), where the entropy of the policy is maximized only for the current time step. Haarnoja et al. (2017) proposed Soft-Q Learning, where amortized Stein variational gradient descent (Wang & Liu, 2016) (SVGD) is used to train a parameterized sampler that can sample from the energy-based policy. SAC (Haarnoja et al., 2018b) proposes an actor-critic RL method but frames the policy update as an approximate inference problem to the energy-based policy using a Gaussian policy parameterization. SAC has been extended to energy-based policies using SVGD in (Messaoud et al.), where the authors also propose a new method to estimate the entropy in closed form. While SVGD is a powerful method for learning an energy-based policy, it is harder to scale these approaches to high-dimensional control problems. For improving exploration, LSAC (Ishfaq et al., 2025) proposes leveraging Langevin Monte Carlo (Welling & Teh, 2011) in conjunction with a distributed critic objective to sample a state-action value. Haarnoja et al. (2018a) proposes learning a latent variable model as a policy representation, but relies on the change of variable formula to express the density of the policy by calculating the Jacobian of the transformations. Recent advances of SAC also define the state-of-the-art in off-policy RL in many domains, such as Cross Q (Bhatt et al., 2024) and BRO (Nauman et al., 2024). Cross Q proposed removing the target network by leveraging batch renormalization and BRO scales to large networks in RL by using several methods such as optimistic exploration (Nauman & Cygan, 2023), network resets (Nikishin et al., 2022), weight decay, and high update-to-data ratios. Diffusion-Based Policies in RL. Early works have researched diffusion models in offline RL (Lange et al., 2012; Levine et al., 2020) as trajectory generators (Janner et al., 2022) or as expressive policy representations (Wang et al., 2023; Kang et al., 2023; Hansen-Estruch et al., 2023; Chen et al., 2023; Ding & Jin, 2024; Mao et al., 2024; Fang et al., 2025; Lu et al., 2023). More recently, diffusion models in online RL have become more popular. DIPO (Yang et al., 2023) proposes training a diffusion-based policy using a behavior cloning loss. The actions in the replay buffer serve as target actions for the policy improvement step and are updated using the gradients of the Q-function a Q(s, a). DIPO has been extended to develop methods for learning multi-modal behaviors(Li et al., 2024) by leveraging hierarchical clustering to isolate different behavior modes. DIPO relies on the stochasticity inherent to the diffusion model for exploration and does not explicitly control it via an objective. QSM (Psenka et al., 2024) directly matches the policy s score with the gradient of the Q-function a Q(s, a). While their objective avoids differentiating through the whole diffusion chain, the proposed objective disregards the entropy of the policy and, therefore, exploration. Consequently, QSM needs to add noise to the final action of the diffusion chain. More recently, DACER (Wang et al., 2024) proposed using the data-generating process as the policy representation and backpropagating the gradients through the diffusion chain. However, they do not consider a backward process as we do, and their objective for updating the diffusion model is based on the expected Q-values only. To incentivize the exploration, DACER adds diagonal Gaussian noise to the sampled actions, where the variance of this noise is controlled by a scaling term that is updated automatically using an approximation of the marginal entropy by extracting a Gaussian Mixture Model from the diffusion policy. Concurrently, QVPO (Ding et al., 2024) proposed weighting their diffusion loss with their respective Q-values after applying transformations. However, QVPO relies on sampling actions from a uniform distribution to enforce exploration. DIME distinguishes from prior works in that we use the maximum entropy RL framework for training the diffusion policy, which was not considered before. This allows direct control of the exploration-exploitation trade-off arising naturally through this objective without the need for additional approximations. DIME is leveraging the diffusion model to generate non-Gaussian exploration actions which is in contrast to most other diffusion RL approaches that still require including Gaussian or uniform exploration noise. Approximate Inference with Diffusion Models. Early works on approximate inference with diffusion models were formalized as a stochastic optimal control problem using Schr odinger-F ollmer diffusions (Dai Pra, 1991; Tzen & Raginsky, 2019; Huang et al., 2021) and only recently realized with deep-learning based approaches (Vargas et al., 2023; Zhang & Chen, 2021). Vargas et al.; Berner et al. later extended these results to denoising diffusion models. A more general framework where both forward and backward processes of the diffusion model are learnable was concurrently proposed by Richter & Berner; Nusken et al. (2024). Recently, many extensions have been proposed, see e.g. (Akhound-Sadegh et al., 2024; Noble et al., 2024; Geffner & Domke, 2023; Zhang et al., 2023; Chen et al., 2024; Blessing et al., 2025b;a; Chen et al., 2025). Our work can be seen as an instance of the sampler presented in (Berner et al.). However, our formulation allows using different DIME: Diffusion-Based Maximum Entropy Reinforcement Learning diffusion samplers such as those presented in (Richter & Berner; Blessing et al., 2025a), while we restrict ourselves in this work to the sampler presented in (Berner et al.). 3. Preliminaries 3.1. Maximum Entropy Reinforcement Learning Notation We consider the task of learning a policy π : S A R+, where S and A denote a continuous state and action space, respectively using reinforcement learning (RL). We formalize the RL problem using an infinite horizon Markov decision process consisting of the tuple (S, A, r, p, ρπ, γ), with bounded reward function r : S A [rmin, rmax] and transition density p : S S A R+ which denotes the likelihood for transitioning into a state s S when being in s S and executing an action a A. We follow (Haarnoja et al., 2018b) and slightly overload ρπ which denotes the state and state-action marginals induced by a policy π. Moreover, γ [0, 1) denotes the discount factor. For brevity, we use rt r(st, at). Lastly, we denote objective functions that we aim to maximize as J and minimize as L. Control as inference. The goal of maximum entropy reinforcement learning (Max Ent-RL) is to jointly maximize the sum of expected rewards and entropies of a policy t=l γt l Eρπ [rt + αH(π(at|st))] , (1) where H(π(a|s)) = R π(a|s) log π(a|s)da is the differential entropy, and α R+ controls the exploration exploitation trade-off (Haarnoja et al., 2017). To keep the notation uncluttered we absorb α into the reward function via r r/α. Defining the Q-function of a policy π as Qπ(st, at) = rt + l=1 γl Eρπ [rt+l + H (π(at+l|st+l))] , (2) with Qπ : S A R, the Max Ent objective can be cast as an approximate inference problem of the form π(at|st) exp Qπ(st, at) in a sense that maxπ J(π) = minπ L(π). Here, DKL denotes the Kullback-Leibler divergence and Zπ(s) = Z exp Qπ(s, a)da (4) is the state-dependent normalization constant. Policy iteration is a two-step iterative update scheme that is, under certain assumptions, guaranteed to converge to the optimal policy with respect to the maximum entropy objective. The two steps include policy evaluation and policy improvement. Given a policy π, policy evaluation aims to evaluate the value of π. To that end, (Haarnoja et al., 2018b) showed that repeated application of the Bellman backup operator T πQk with T πQ(st, at) rt + γE [Q(st+1, at+1) + H(at+1|st+1)] , (5) converges to Qπ as k , starting from any Q. To update the policy, that is, to perform the policy improvement step, the Q-function of the previous evaluation step, Qπold is used to obtain a new policy according to πnew = arg min π Π DKL π(at|st) exp Qπold(st, at) where Π is a set of policies such as a family of parameterized distributions. Note that Zπold(st) is not required for optimization as it is independent of π. Haarnoja et al. (2018b) showed that for all state-action pairs (s, a) S A it holds that Qπnew(s, a) Qπold(s, a) ensuring that policy iteration converges to the optimal policy π in the limit of infinite repetitions of policy evaluation and improvement. 3.2. Denoising Diffusion Policies For a given state s S, we consider a stochastic process on the time-interval [0, T] given by an Ornstein-Uhlenbeck (OU) process 2 (S arkk a & Solin, 2019) dat = βtatdt + η p 2βtd Bt, a0 π0( |s), (7) with diffusion coefficient β : [0, T] R+, standard Brownian motion (Bt)t [0,T ], and some target policy π0. For t, l [0, T], we denote the marginal density of Eq. 7 at t as πt and the conditional density at time t given l as πt|l. Eq. 7 is commonly referred to as forward or noising process since, for a suitable choice of β, it holds that πT N(0, η2I). Denoising diffusion models leverage the fact, that the timereversed process of Eq. 7 is given by dat = βtatdt 2η2βt log πt(at|s) + η p 2βtd Bt, (8) starting from πT = πT N(0, η2I) and running backwards in time (Nelson, 2020; Anderson, 1982; Haussmann & Pardoux, 1986). For the backward, generative or denoising process (Eq. 8), we denote the density as π. Here, time-reversal means that the marginal densities align, i.e., πt = πt for all t [0, T]. Hence, starting from a T N(0, η2I), one can sample from the target policy π0 by simulating Eq. 8. However, for most densities π0, 2Please note, for clarity, we slightly abuse notation by using t to denote the time in the stochastic process. This should not be confused with the time step in RL. The distinction becomes clear when we discretize the processes. DIME: Diffusion-Based Maximum Entropy Reinforcement Learning exp(Qπ/α)/Zπ N(0, I) t exp(Qπ/α)/Zπ N(0, I) t exp(Qπ/α)/Zπ N(0, I) t Figure 1. The effect of the reward scaling parameter α. The figures in (a)-(b) show diffusion processes for different α values starting at a prior distribution N(0, I) and going backward in time to approximate the target distribution exp (Qπ/α)/Zπ. Small values for α (a) lead to concentrated target distributions with less noise in the diffusion trajectories especially at the last time steps. The higher α becomes (b) and (c), the more the target distribution is smoothed and the distribution of the samples at the last time steps becomes more noisy. Therefore, the parameter α directly controls the exploration by enforcing noisier samples the higher α becomes. the scores ( log πt(at|s))t [0,T ] are intractable, requiring numerical approximations. To address this, denoising scorematching objectives are commonly employed, that is, LSM(θ) = E βt f θ t (at, s) log πt|0(at|a0, s) 2 , (9) where t is sampled on [0, T] and f θ denotes a parameterized score network (Hyv arinen & Dayan, 2005; Vincent, 2011). For OU processes, the conditional densities log πt|0 are explicitly computable, making the objective tractable for optimizing θ (Song et al., 2021). Once trained, the score network f θ can be used to simulate the denoising process dat = βtatdt 2η2βtf θ t (at, s) + η p 2βtd Bt, (10) to obtain samples a0 πθ 0 that are approximately distributed according to π0. Here, πθ t denotes the marginal distribution of Eq. 10 at time t. While score-matching techniques work well in practice, they cannot be applied to maximum entropy reinforcement learning. This is because the expectation in Eq. 9 requires samples a0 π0 exp Qπ which are not available. However, in the next section, we build on recent advances in approximate inference to optimize diffusion models without requiring samples from a0, relying instead on evaluations of Qπ. 4. Diffusion-Based Maximum Entropy RL Here, we explain how diffusion models can be used within a maximum entropy RL framework. To that end, we express the maximum entropy objective as an approximate inference problem for diffusion models. We then use these results to introduce a policy iteration scheme that provably converges to the optimal policy. Lastly, we propose a practical algorithm for optimizing diffusion models. 4.1. Control as Inference for Diffusion Policies Directly maximizing the maximum entropy objective t=l γt l Eρπ rt(st, a0 t) + αH( π0(a0 t|st)) , for a diffusion model is difficult as the marginal entropy H( π0(a|s)) of the denoising process in Eq. 8 is intractable. Please note that we use superscripts for the actions to indicate the diffusion step to avoid collisions with the time step used in RL. Moreover, we will again absorb α into the reward and use rt r(st, a0 t). To overcome this intractability, we propose to maximize a lower bound. We start by discretizing the stochastic processes introduced in Section 3.2 and use the results as a foundation to derive this lower bound. Note that while similar results can be derived from a continuous-time perspective (see e.g., Berner et al.; Richter & Berner; Nusken et al. (2024)), such derivation would require a background in stochastic calculus, making it less accessible to a broader audience. The Euler-Maruyama (EM) discretization (S arkk a & Solin, 2019) of the noising (Eq. 7) and denoising (Eq. 8) process is given by an+1 = an βnanδ + ϵn and (11) an 1 = an + βnan + 2η2βn log πn(an|s) δ + ξn, (12) respectively, with ϵn, ξn N(0, 2η2βnδI). Here, δ denotes a constant discretization step size such that N = T/δ is an integer. To simplify notation, we write an, instead of anδ. Under the EM discretization, the noising and denoising process admit the following joint distributions π0:N(a0:N|s) = π0(a0|s) n=0 πn+1|n(an+1 an, s), (13) π0:N(a0:N|s) = πn 1|n(an 1 an, s), in a sense that π0:N and π0:N converge to the law of (at)t [0,T ] in Eq. 7 and 8, as δ 0, respectively (Doucet et al., 2022). Here, πn+1|n and πn 1|n are Gaussian transition densities that directly follow from Eq. 11 and 12. DIME: Diffusion-Based Maximum Entropy Reinforcement Learning To obtain a maximum entropy objective for diffusion models, we make use of the following lower bound on the marginal entropy, that is, H( π0(a0|s)) ℓ π(a0, s), where π(a0, s) = E log π1:N|0(a1:N|a0, s) π0:N(a0:N|s) Please note that similar bounds have been used, e.g., in (Agakov & Barber, 2004; Tran et al., 2015; Ranganath et al., 2016; Maaløe et al., 2016; Arenz et al., 2018), or, more generally, follow from the data processing inequality (Cover, 1999). A derivation can be found in Appendix A. From Eq. 15, it directly follows that t=l γt l Eρπ rt + ℓ π(a0 t, st) . (16) Next, we cast Eq. 16 as an approximate inference problem to make the objective more interpretable. To that end, let us define the Q-function of a denoising policy π with respect to the maximum entropy objective J as π(st, a0 t) = rt + X l=1 γl Eρπ rt+l + ℓ π(a0 t+l, st+l) , (17) with Q π : S A R. With Eq. 17 we identify the corresponding approximate inference problem as finding π which minimizes (please see Appendix A for derivation) π0:N(a0:N|s)| π0:N(a0:N|s) , (18) where the target policy, i.e., the marginal of the noising process in Eq. 13 is given by the exponentiated Q-function of the diffusion policy π0(a0|s) = exp Q π(s) . (19) Recall from Section 3.2 that we aim to time-reverse the noising process, that is, to ensure for all states s S, it holds that π0:N = π0:N. Please note that this is precisely what Eq. 18 is trying to accomplish, i.e., we aim to learn a diffusion model π, such that the denoising process time-reverses the noising process, and, in particular, has a marginal distribution given by π0 = exp Q π. Lastly, from the data processing inequality, it directly follows that π0(a0|s) exp Q π(a0:N|s)| π(a0:N|s) , (20) which shows the approximate inference problem in Eq. 18 indeed optimizes the same inference problem stated in Eq. 3. Next, we will use these results to develop a policy iteration scheme for diffusion models. 4.2. Diffusion-based Policy Iteration We propose a policy iteration scheme for learning an optimal maximum entropy policy, similar to (Haarnoja et al., 2018b). However, here we restrict the family of stochastic actors to diffusion policies Π Π. Throughout this section, we assume finite action spaces to enable theoretical analysis, but relax this assumption in Section 4.3. All proofs of this section are deferred to Appendix A. For policy evaluation, we aim to compute the value of a policy π. We define the Bellman backup operator as πQ(st, a0 t) rt+γE Q(st+1, a0 t+1) + ℓ π(a0 t+1, st+1) . (21) Note that Eq. 21 contains the entropy-lower bound ℓ π. By applying standard convergence results for policy evaluation (Sutton & Barto, 1999) we can obtain the value of a policy by repeatedly applying T π as established in Proposition 4.1. Proposition 4.1 (Policy Evaluation). Let T π be the Bellman backup operator for a diffusion policy π as defined in Eq. 21. Further, let Q0 : S A R and Qk+1 = T πQk. Then, it holds that limk Qk = Q π is the Q value of For the policy improvement step, we seek to improve the current policy based on its value using the Q-function. Formally, we need to solve the approximate inference problem πnew = arg min π0:N(a0:N|s)| π old 0:N(a0:N|s) , (22) for all s S, where π old 0:N(a0:N|s) is as in Eq. 13 with marginal density π old 0 (a0|s) = exp Q πold(s, a0) Z πold(s) . (23) Indeed, solving Eq. 22 results in a policy with higher value as established below. Proposition 4.2 (Policy Improvement). Let Π be defined as in Eq. 23 and 22, respectively. Then for all (s, a) S A it holds that Q πnew(s, a) Q πold(s, a). Combining these results leads to the policy iteration method which alternates between policy evaluation (Proposition 4.1) and policy improvement (Proposition 4.2) and provably converges to the optimal policy in Π (Proposition 4.3). Proposition 4.3 (Policy Iteration). Let Π. Further, let πi+1 be the policy obtained from πi after a policy evaluation and improvement step. Then, for any starting policy π0 it holds that limi such that for all Π and (s, a) S A it holds that Q However, performing policy iteration until convergence is in practice often intractable, particularly for continuous control tasks. As such, we will introduce a practical algorithm next. DIME: Diffusion-Based Maximum Entropy Reinforcement Learning 10 01 10 02 10 03 10 04 10 05 10 08 10 09 10 12 0 0.5 1 1.5 2 2.5 3 Number Env Interactions 400 500 600 700 800 IQM Return 0 0.5 1 1.5 2 2.5 3 Number Env Interactions Gaussian Policy DIME 0 0.5 1 1.5 2 2.5 3 Number Env Interactions Gaussian Policy DIME Figure 2. Reward Scaling Sensitivity (a)-(b). The α parameter controls the exploration-exploitation trade-off. (a) shows the learning curves for varying values on DMC s dog-run task. Too high α values (α = 0.1) do not incentivize learning whereas too small α values (α 10 5) converge to suboptimal behavior. (b) shows the aggregated end performance for each learning curve in (a). For increasing α values, the end performance increases until it reaches an optimum at α = 10 3 after which the performance starts dropping. Diffusion Policy Benefit (c) and (d). We compare DIME to a Gaussian policy with the same implementation details as DIME on the (a) humanoid-run and (b) dog-run tasks. The diffusion-based policy reaches a higher return (a) and converges faster. 4.3. DIME: A Practical Diffusion RL Algorithm To obtain a practical algorithm, we use a parameterized function approximation for the Q-function and the policy, that is, Qϕ and πθ, with parameters ϕ and θ, respectively. Here, πθ is represented by a parameterized score network, see Eq. 10. To perform approximate policy evaluation, we can minimize the Bellman residual, 2E h Qϕ(st, a0 t) Qtarget(st, a0 t) 2i , (24) using stochastic gradients with respect to ϕ. We provide implementation details in Section 4.4. Moreover, the expectation is computed using state-action pairs collected from environment interactions and saved in a replay buffer. For policy improvement, we solve the approximate inference problem L(θ) = DKL πθ 0:N(a0:N|s)| π0:N(a0:N|s) , (25) where the target policy, i.e., the marginal of the noising process in Eq. 13 is given by the approximate Q-function π0(a0|s) = exp Qϕ(s, a0) Zϕ(s) , (26) where states are again sampled from a replay buffer. Further expanding L(θ) yields log πθ N(a N|s) Qϕ(s, a0) (27) n=1 log πθ n 1|n(an 1 an, s) πn|n 1(an an 1, s) + log Zϕ(s), showing that Zϕ is not needed to minimize Eq. 27 as it is independent of θ. Moreover, contrary to the score-matching objective (see Eq. 9) that is commonly used to optimize diffusion models, stochastic optimization of L(θ) does not need access to samples a0 exp Qϕ/Zϕ, instead relying on stochastic gradients obtained via reparameterization trick (Kingma, 2013) using samples from the diffusion model πθ. 4.4. Implementation Details Autotuning Temperature. We follow implementations like SAC (Haarnoja et al., 2018c) where the reward scaling parameter α (also see Fig. 1) is not absorbed into the reward but scales the entropy term. Choosing α depends on the reward ranges and the dimensionality of the action space, which requires tuning it per environment. We instead follow prior works (Haarnoja et al., 2018c) for auto-tuning α by optimizing J(α) = α Htarget ℓθ H , (28) where Htarget is a target value for the mismatch between the noising and denoising processes measured by the log ratio. Autotuning Diffusion Coefficient. Please note that the objective function in Eq. 27 is fully differentiable with respect to parameters of the diffusion process. As such, we additionally treat the diffusion coefficient β as a learnable parameter that is optimized end-to-end, further reducing the need for manual hyperparameter tuning. Further details on the parameterization can be found in Appendices D and G. Q-function. Following Bhatt et al. (2024) we adopt the Cross Q algorithm, i.e., we use Batch Renormalization in the Q-function and avoid a target network for calculating Qtarget. When updating the Q-function, the values for the current and next state-action pairs are queried in parallel. The next Q-values are used as target values where the gradients are stopped. Additionally, we employ distributional Q learning as proposed by (Bellemare et al., 2017). The details are described in Appendix D. DIME: Diffusion-Based Maximum Entropy Reinforcement Learning 32 16 8 4 2 DIME (ours) Cross Q QSM Diff-QL Consistency-AC DIPO QVPO DACER 0 0.5 1 1.5 2 2.5 3 Number Env Interactions (a) Varying the Diffusion Steps 2 4 8 16 32 0 Number Diffusion Steps Mean Runtime in Hours (b) Runtime for 1M Steps 0 0.2 0.4 0.6 0.8 1 Number Env Interactions IQM Mean Return 0 0.2 0.4 0.6 0.8 1 Number Env Interactions IQM Mean Return ( 104) (d) Humanoid-v3 Figure 3. Varying the Number of diffusion steps (a)-(b). The number of diffusion steps might affect the performance and the computation time. (a) shows DIME s learning curves for varying diffusion steps. Two diffusion steps perform badly, whereas four and eight diffusion steps perform similar but still worse than 16 and 32 diffusion steps which perform similarly. (b) shows the computation time for 1MIO steps of the corresponding learning curves. The smaller the diffusion steps, the less computation time is required. Learning Curves on Gym Benchmark Suite (c)-(d). We compare DIME against various diffusion baselines and Cross Q on the (c) Ant-v3 and (d) Humanoid-v3 from the Gym suite. While all diffusion-based methods are outperformed by DIME, DIME performs on par with Cross Q on the Ant environment. DIME performs favorably on the high-dimensional Humanoid-v3 environment, where it also outperforms Cross Q. 5. Experiments We analyze DIME s algorithmic features with an intensive ablation study where we clarify the role of the reward scaling parameter α, the effect of varying diffusion steps, and the gained performance boost when using a diffusion policy representation over a Gaussian representation. Additional analysis on employing distributional Q learning is discussed in the Appendix G. In a broad range of 13 sophisticated learning environments from different benchmark suits, ranging from mujoco gym (Brockman et al., 2016), deepmind control suit (DMC) (Tunyasuvunakool et al., 2020), and myo suite (Caggiano et al., 2022), we compare DIME s performance against state-ofthe-art RL baselines that employ diffusion and Gaussian policy parameterizations. The considered environments are challenging locomotion and manipulation learning tasks with up to 39-dimensional action and 223-dimensional observation spaces. We consider QSM (Psenka et al., 2024), Diffusion-QL (Wang et al., 2023), Consistency-AC (Ding & Jin, 2024), DIPO (Yang et al., 2023), QVPO (Ding et al., 2024), and DACER (Wang et al., 2024) as baselines for diffusion-based policy representations. Additionally, we compare against the state-of-the-art RL methods Cross Q (Bhatt et al., 2024) and BRO (Nauman et al., 2024), where we have used the provided learning curves from the latter. Both methods use a Gaussian parameterized policy and have shown remarkable results. We have run the learning curves for 10 seeds using the official code releases and report the interquartile mean (IQM) with a 95% stratified bootstrap confidence interval as suggested by Agarwal et al. (2021). 5.1. Ablation Studies Exploration Control. The parameter α balances the exploration-exploitation trade-off by scaling the reward signal. We analyze the effect of this parameter by comparing DIME s learning curves with different α values on the dogrun task from the DMC (see Fig. 2a). Additionally, we show the performance of the last return measurements for each learning curve in Fig. 2b. Too high α values (α = 0.1) do not incentivize maximizing the task s return, leading to no learning at all, whereas small values (α 10 5) lead to suboptimal performance because the policy does not explore sufficiently. We can also see a clear trend that starting from α = 10 12, the performance gradually increases until the best performance is reached for α = 10 3. Diffusion Policy Benefit. We aim to analyze the performance benefits of the diffusion-parameterized policy compared to a Gaussian parameterization in the same setup by only exchanging the policy and the corresponding policy update. This comparison ensures that the Gaussian policy is trained with the identical implementation details from DIME as described in Sec. 4.4 and showcases the performance benefits of a diffusion-based policy. Fig. 2c and 2d show the learning curves of both versions on DMC s humanoid-run and dog-run environments. The diffusion policy s expressivity leads to a higher aggregated return in the humanoid-run and to significantly faster convergence in the high-dimensional dog-run task. We attribute this performance benefit to an improved exploration behavior. Number of Diffusion Steps. The number of diffusion steps determines how accurately the stochastic differential equations are simulated and is a hyperparameter that affects the performance. Usually, the higher the number of diffusion steps the better the model performs at the burden of higher computational costs. In Fig. 3a we plot DIME s perfor- DIME: Diffusion-Based Maximum Entropy Reinforcement Learning 0 0.2 0.4 0.6 0.8 1 Number Env Interactions IQM Mean Return (a) Dog Run 0 0.2 0.4 0.6 0.8 1 Number Env Interactions IQM Mean Return (b) Dog Trot 0 0.2 0.4 0.6 0.8 1 Number Env Interactions IQM Mean Return (c) Dog Walk 0 0.2 0.4 0.6 0.8 1 Number Env Interactions IQM Mean Return (d) Dog Stand 0 0.2 0.4 0.6 0.8 1 Number Env Interactions IQM Mean Return (e) Humanoid Run 0 0.2 0.4 0.6 0.8 1 Number Env Interactions IQM Mean Return (f) Humanoid Walk 0 0.2 0.4 0.6 0.8 1 Number Env Interactions IQM Mean Return (g) Humanoid Stand DIME (ours) BRO BRO (Fast) Cross Q QSM Diff-QL Consistency-AC DIPO 0 0.2 0.4 0.6 0.8 1 Number Env Interactions IQM Success Rate (h) Object Hold Hard 0 0.2 0.4 0.6 0.8 1 Number Env Interactions IQM Success Rate (i) Reach Hard 0 0.2 0.4 0.6 0.8 1 Number Env Interactions IQM Success Rate (j) Key Turn Hard 0 0.2 0.4 0.6 0.8 1 Number Env Interactions IQM Success Rate (k) Pen Twirl Hard Figure 4. Training curves on DMC s dog, humanoid tasks, and the hand environments from the MYO Suite. DIME performs favorably on the high-dimensional dog tasks, where it significantly outperforms all baselines (dog-run) or converges faster to the final performance. On the humanoid tasks, DIME outperforms all diffusion-based baselines, Cross Q and BRO Fast, and performs on par with BRO on the humanoid-stand task and slightly worse on the humanoid-run and humanoid-walk tasks. In the MYO SUITE environments, DIME performs consistently on all tasks, either outperforming the baselines or performing on par. mance for varying diffusion steps on DMC s humanoid-run environment and report the corresponding runtimes for 1 Mio environment steps in Fig. 3b on an Nvidia A100 GPU machine. With an increasing number of diffusion steps, the performance and runtime increases. However, from 16 diffusion steps on, the performance stays the same. 5.2. Performance Comparisons We consider environments with high-dimensional observation and action spaces from three benchmark suits for a robust performance assessment (please see Appendix C). Gym Environments. Fig 3c and Fig. 3d show the learning curves for the An-tv3 and Humanoid-v3 tasks respectively. While the diffusion-based baselines perform reasonably well on the Ant-v3 task with DIPO outperforming the rest, they are all outperformed by DIME and Cross Q which perform comparably. On the Humanoid-v3 DIME achieves a significantly higher return than all baselines. DMC: Dog and Humanoid Tasks (Fig. 4). We benchmark on DMC suit s challenging dog and humanoid environments, where we additionally consider BRO and BRO Fast as a Gaussian-based policy baseline. BRO Fast is identical to BRO but differs only in the update-to-data (UTD) ratio of two as DIME and Cross Q. Please note that we used the online available learning curves provided by the official implementation for BRO. DIME outperforms all baselines significantly on the dog-run environment and converges faster to the same end performance on the remaining dog environments (see Fig. 4a - 4d). BRO has slightly higher average performance on the humanoid-run and humanoidwalk (see Fig. 4f - 4e)) tasks indicating that DIME performs favorably on more high-dimensional tasks like the dog environments and tasks from the myo suite. However, DIME s asymptotic behavior in the humanoid-run achieves slightly higher aggregated performance than BRO, where we have run both algorithms for 3M steps (Fig. 6c). However, BRO requires full parameter resets leading to performance drops DIME: Diffusion-Based Maximum Entropy Reinforcement Learning during training and it is run with a UTD ratio of 10 which is 5 times higher than DIME. This leads to longer training times. As reported in their paper (Nauman et al., 2024), BRO needs an average training time of 8.5h, whereas DIME trains in approximately 4.5h with 16 diffusion steps on the humanoid-run with the same hardware (Nvidia A100). MYO Suite (Fig. 4). Except for pen twirl hard (Fig. 4k), DIME consistently outperforms BRO and BRO Fast in that it converges to a higher or faster end success rate. DIME also consistently outperforms Cross Q in terms of the achieved success rates on all the tasks except for the object hold hard task 4h, where DIME converges faster. 6. Conclusion and Future Work In this work, we introduced DIME, a method for learning diffusion models for maximum entropy reinforcement learning by leveraging connections to approximate inference. We view this work as a starting point for exciting future research. Specifically, we explored denoising diffusion models, where the forward process follows an Ornstein-Uhlenbeck process. However, approximate inference with diffusion models is an active and rapidly evolving field, with numerous recent advancements that consider alternative stochastic processes. For example, Richter & Berner proposed learning both the forward and backward processes, while Nusken et al. (2024) further enhanced exploration by incorporating the gradient of the target density into the diffusion process. Additionally, Chen et al. (2024) combined learned diffusion models with Sequential Monte Carlo (Del Moral et al., 2006), resulting in a highly effective inference method. These approaches hold significant promise for further improving diffusion-based policies in RL. We have conducted preliminary experiments on the framework from Richter & Berner and provide them in Appendix F. Finally, we note that the loss function used in this work (see Eq. 25) is based on the Kullback-Leibler divergence. However, in principle, any divergence could be used. For instance, the log-variance divergence (Richter & Berner) has shown promising results in optimizing diffusion models for approximate inference (Chen et al., 2024; Noble et al., 2024). Exploring alternative objectives could lead to additional performance improvements. Another interesting future research lies in investigating the effects of using more sophisticated critic structures, such as transformers, as proposed by Li et al. (2025). Acknowledgements The authors acknowledge support from the state of Baden W urttemberg through the Hore Ka supercomputer funded by the Ministry of Science, Research and the Arts Baden W urttemberg and by the German Federal Ministry of Education and Research. This work has been supported by the DFG Collaborative Research Center 1574, Circular Factory for the Perpetual Product, and by the pilot program Core Informatics of the Helmholtz Association (HGF). This research has been additionally supported by the DFG Emmy Noether project CH 2676/1-1. This research was also supported by the research cluster Third Wave of AI , funded by the excellence program of the Hessian Ministry of Higher Education, Science, Research and the Arts, hessian.AI. Impact Statement This paper presents work whose goal is to advance the field of Machine Learning. There are many potential societal consequences of our work, none of which we feel must be specifically highlighted here. Agakov, F. V. and Barber, D. An auxiliary variational method. In Neural Information Processing: 11th International Conference, ICONIP 2004, Calcutta, India, November 22-25, 2004. Proceedings 11, pp. 561 566. Springer, 2004. Agarwal, R., Schwarzer, M., Castro, P. S., Courville, A. C., and Bellemare, M. Deep reinforcement learning at the edge of the statistical precipice. Advances in neural information processing systems, 34:29304 29320, 2021. Akhound-Sadegh, T., Rector-Brooks, J., Bose, A. J., Mittal, S., Lemos, P., Liu, C.-H., Sendera, M., Ravanbakhsh, S., Gidel, G., Bengio, Y., et al. Iterated denoising energy matching for sampling from boltzmann densities. ar Xiv preprint ar Xiv:2402.06121, 2024. Anderson, B. D. Reverse-time diffusion equation models. Stochastic Processes and their Applications, 12(3):313 326, 1982. Arenz, O., Neumann, G., and Zhong, M. Efficient gradientfree variational inference using policy search. In International conference on machine learning, pp. 234 243. PMLR, 2018. Bellemare, M. G., Dabney, W., and Munos, R. A distributional perspective on reinforcement learning. In International conference on machine learning, pp. 449 458. PMLR, 2017. Berner, J., Richter, L., and Ullrich, K. An optimal control perspective on diffusion-based generative modeling. Transactions on Machine Learning Research. Bhatt, A., Palenicek, D., Belousov, B., Argus, M., Amiranashvili, A., Brox, T., and Peters, J. Crossq: Batch normalization in deep reinforcement learning for greater DIME: Diffusion-Based Maximum Entropy Reinforcement Learning sample efficiency and simplicity. In The Twelfth International Conference on Learning Representations, 2024. Blessing, D., Berner, J., Richter, L., and Neumann, G. Underdamped diffusion bridges with applications to sampling. In The Thirteenth International Conference on Learning Representations, 2025a. Blessing, D., Jia, X., and Neumann, G. End-to-end learning of gaussian mixture priors for diffusion sampler. In The Thirteenth International Conference on Learning Representations, 2025b. URL https://openreview. net/forum?id=i Xb Uqua Wbl. Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., and Zaremba, W. Openai gym, 2016. Caggiano, V., Wang, H., Durandau, G., Sartori, M., and Kumar, V. Myosuite a contact-rich simulation suite for musculoskeletal motor control, 2022. ar Xiv preprint ar Xiv:2205.00588. Chen, H., Lu, C., Ying, C., Su, H., and Zhu, J. Offline reinforcement learning via high-fidelity generative behavior modeling. In The Eleventh International Conference on Learning Representations, 2023. Chen, J., Richter, L., Berner, J., Blessing, D., Neumann, G., and Anandkumar, A. Sequential controlled langevin diffusions. ar Xiv preprint ar Xiv:2412.07081, 2024. Chen, J., Richter, L., Berner, J., Blessing, D., Neumann, G., and Anandkumar, A. Sequential controlled langevin diffusions. In The Thirteenth International Conference on Learning Representations, 2025. URL https:// openreview.net/forum?id=d Im D2sgy86. Chi, C., Feng, S., Du, Y., Xu, Z., Cousineau, E., Burchfiel, B., and Song, S. Diffusion policy: Visuomotor policy learning via action diffusion. In Proceedings of Robotics: Science and Systems (RSS), 2023. Cover, T. M. Elements of information theory. John Wiley & Sons, 1999. Dai Pra, P. A stochastic control approach to reciprocal diffusion processes. Applied mathematics and Optimization, 23(1):313 329, 1991. Del Moral, P., Doucet, A., and Jasra, A. Sequential monte carlo samplers. Journal of the Royal Statistical Society Series B: Statistical Methodology, 68(3):411 436, 2006. Ding, S., Hu, K., Zhang, Z., Ren, K., Zhang, W., Yu, J., Wang, J., and Shi, Y. Diffusion-based reinforcement learning via q-weighted variational policy optimization. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. Ding, Z. and Jin, C. Consistency models as a rich and efficient policy class for reinforcement learning. In The Twelfth International Conference on Learning Representations, 2024. Doucet, A., Grathwohl, W., Matthews, A. G., and Strathmann, H. Score-based diffusion meets annealed importance sampling. Advances in Neural Information Processing Systems, 35:21482 21494, 2022. Fang, L., Liu, R., Zhang, J., Wang, W., and Jing, B. Diffusion actor-critic: Formulating constrained policy iteration as diffusion noise regression for offline reinforcement learning. In The Thirteenth International Conference on Learning Representations, 2025. Geffner, T. and Domke, J. Langevin diffusion variational inference. In International Conference on Artificial Intelligence and Statistics, pp. 576 593. PMLR, 2023. Haarnoja, T., Tang, H., Abbeel, P., and Levine, S. Reinforcement learning with deep energy-based policies. In International conference on machine learning, pp. 1352 1361. PMLR, 2017. Haarnoja, T., Hartikainen, K., Abbeel, P., and Levine, S. Latent space policies for hierarchical reinforcement learning. In International Conference on Machine Learning, pp. 1851 1860. PMLR, 2018a. Haarnoja, T., Zhou, A., Abbeel, P., and Levine, S. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In International conference on machine learning, pp. 1861 1870. PMLR, 2018b. Haarnoja, T., Zhou, A., Hartikainen, K., Tucker, G., Ha, S., Tan, J., Kumar, V., Zhu, H., Gupta, A., Abbeel, P., et al. Soft actor-critic algorithms and applications. ar Xiv preprint ar Xiv:1812.05905, 2018c. Hansen-Estruch, P., Kostrikov, I., Janner, M., Kuba, J. G., and Levine, S. Idql: Implicit q-learning as an actorcritic method with diffusion policies. ar Xiv preprint ar Xiv:2304.10573, 2023. Haussmann, U. G. and Pardoux, E. Time reversal of diffusions. The Annals of Probability, pp. 1188 1205, 1986. Ho, J., Jain, A., and Abbeel, P. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840 6851, 2020. Huang, J., Jiao, Y., Kang, L., Liao, X., Liu, J., and Liu, Y. Schr odinger-f ollmer sampler: sampling without ergodicity. ar Xiv preprint ar Xiv:2106.10880, 1, 2021. DIME: Diffusion-Based Maximum Entropy Reinforcement Learning Hyv arinen, A. and Dayan, P. Estimation of non-normalized statistical models by score matching. Journal of Machine Learning Research, 6(4), 2005. Ishfaq, H., Wang, G., Islam, S. N., and Precup, D. Langevin soft actor-critic: Efficient exploration through uncertainty-driven critic learning. In The Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum? id=Fv Qsk3la17. Janner, M., Du, Y., Tenenbaum, J., and Levine, S. Planning with diffusion for flexible behavior synthesis. In International Conference on Machine Learning, pp. 9902 9915. PMLR, 2022. Kang, B., Ma, X., Du, C., Pang, T., and Yan, S. Efficient diffusion policies for offline reinforcement learning. Advances in Neural Information Processing Systems, 36, 2023. Karras, T., Aittala, M., Aila, T., and Laine, S. Elucidating the design space of diffusion-based generative models. Advances in neural information processing systems, 35: 26565 26577, 2022. Kingma, D. P. Auto-encoding variational bayes. ar Xiv preprint ar Xiv:1312.6114, 2013. Lange, S., Gabel, T., and Riedmiller, M. Batch reinforcement learning. In Reinforcement learning: State-of-theart, pp. 45 73. Springer, 2012. Levine, S., Kumar, A., Tucker, G., and Fu, J. Offline reinforcement learning: Tutorial, review, and perspectives on open problems. ar Xiv preprint ar Xiv:2005.01643, 2020. Li, G., Tian, D., Zhou, H., Jiang, X., Lioutikov, R., and Neumann, G. TOP-ERL: Transformer-based off-policy episodic reinforcement learning. In The Thirteenth International Conference on Learning Representations, 2025. Li, Z., Krohn, R., Chen, T., Ajay, A., Agrawal, P., and Chalvatzaki, G. Learning multimodal behaviors from scratch with diffusion policy gradient. ar Xiv preprint ar Xiv:2406.00681, 2024. Lu, C., Chen, H., Chen, J., Su, H., Li, C., and Zhu, J. Contrastive energy prediction for exact energy-guided diffusion sampling in offline reinforcement learning. In International Conference on Machine Learning, pp. 22825 22855. PMLR, 2023. Maaløe, L., Sønderby, C. K., Sønderby, S. K., and Winther, O. Auxiliary deep generative models. In International conference on machine learning, pp. 1445 1453. PMLR, 2016. Mao, L., Xu, H., Zhan, X., Zhang, W., and Zhang, A. Diffusion-dice: In-sample diffusion guidance for offline reinforcement learning. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. Messaoud, S., Mokeddem, B., Xue, Z., Pang, L., An, B., Chen, H., and Chawla, S. S 2 ac: Energy-based reinforcement learning with stein soft actor critic. In The Twelfth International Conference on Learning Representations. Nauman, M. and Cygan, M. On the theory of risk-aware agents: Bridging actor-critic and economics. In ICML 2024 Workshop: Aligning Reinforcement Learning Experimentalists and Theorists, 2023. Nauman, M., Ostaszewski, M., Jankowski, K., Miło s, P., and Cygan, M. Bigger, regularized, optimistic: scaling for compute and sample efficient continuous control. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. Nelson, E. Dynamical theories of Brownian motion, volume 101. Princeton university press, 2020. Neu, G., Jonsson, A., and G omez, V. A unified view of entropy-regularized markov decision processes. ar Xiv preprint ar Xiv:1705.07798, 2017. Nikishin, E., Schwarzer, M., D Oro, P., Bacon, P.-L., and Courville, A. The primacy bias in deep reinforcement learning. In International conference on machine learning, pp. 16828 16847. PMLR, 2022. Noble, M., Grenioux, L., Gabri e, M., and Durmus, A. O. Learned reference-based diffusion sampling for multimodal distributions. ar Xiv preprint ar Xiv:2410.19449, 2024. Nusken, N., Vargas, F., Padhy, S., and Blessing, D. Transport meets variational inference: Controlled monte carlo diffusions. In The Twelfth International Conference on Learning Representations: ICLR 2024, 2024. Psenka, M., Escontrela, A., Abbeel, P., and Ma, Y. Learning a diffusion model policy from rewards via q-score matching. In Forty-first International Conference on Machine Learning, 2024. Ranganath, R., Tran, D., and Blei, D. Hierarchical variational models. In International conference on machine learning, pp. 324 333. PMLR, 2016. Reuss, M., Li, M., Jia, X., and Lioutikov, R. Goal conditioned imitation learning using score-based diffusion policies. In Robotics: Science and Systems, 2023. Richter, L. and Berner, J. Improved sampling via learned diffusions. In The Twelfth International Conference on Learning Representations. DIME: Diffusion-Based Maximum Entropy Reinforcement Learning S arkk a, S. and Solin, A. Applied stochastic differential equations, volume 10. Cambridge University Press, 2019. Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N., and Ganguli, S. Deep unsupervised learning using nonequilibrium thermodynamics. In International conference on machine learning, pp. 2256 2265. PMLR, 2015. Song, Y., Sohl-Dickstein, J., Kingma, D. P., Kumar, A., Ermon, S., and Poole, B. Score-based generative modeling through stochastic differential equations. In International Conference on Learning Representations, 2021. Sutton, R. S. and Barto, A. G. Reinforcement learning: An introduction. Robotica, 17(2):229 235, 1999. Toussaint, M. Robot trajectory optimization using approximate inference. In Proceedings of the 26th annual international conference on machine learning, pp. 1049 1056, 2009. Tran, D., Ranganath, R., and Blei, D. M. The variational gaussian process. ar Xiv preprint ar Xiv:1511.06499, 2015. Tunyasuvunakool, S., Muldal, A., Doron, Y., Liu, S., Bohez, S., Merel, J., Erez, T., Lillicrap, T., Heess, N., and Tassa, Y. dm control: Software and tasks for continuous control. Software Impacts, 6:100022, 2020. ISSN 2665-9638. Tzen, B. and Raginsky, M. Theoretical guarantees for sampling and inference in generative models with latent diffusions. In Conference on Learning Theory, pp. 3084 3114. PMLR, 2019. Vargas, F., Grathwohl, W. S., and Doucet, A. Denoising diffusion samplers. In The Eleventh International Conference on Learning Representations. Vargas, F., Ovsianas, A., Fernandes, D., Girolami, M., Lawrence, N. D., and N usken, N. Bayesian learning via neural schr odinger f ollmer flows. Statistics and Computing, 33(1):3, 2023. Vincent, P. A connection between score matching and denoising autoencoders. Neural computation, 23(7):1661 1674, 2011. Wang, D. and Liu, Q. Learning to draw samples: With application to amortized mle for generative adversarial learning. ar Xiv preprint ar Xiv:1611.01722, 2016. Wang, Y., Wang, L., Jiang, Y., Zou, W., Liu, T., Song, X., Wang, W., Xiao, L., WU, J., Duan, J., and Li, S. E. Diffusion actor-critic with entropy regulator. In The Thirtyeighth Annual Conference on Neural Information Processing Systems, 2024. URL https://openreview. net/forum?id=l0c1j4Qv Tq. Wang, Z., Hunt, J. J., and Zhou, M. Diffusion policies as an expressive policy class for offline reinforcement learning. International Conference on Learning Representations, 2023. Welling, M. and Teh, Y. W. Bayesian learning via stochastic gradient langevin dynamics. In Proceedings of the 28th international conference on machine learning (ICML-11), pp. 681 688. Citeseer, 2011. Yang, L., Huang, Z., Lei, F. h., Zhong, Y., Yang, Y., Fang, C., Wen, S., Zhou, B., and Lin, Z. Policy representation via diffusion probability model for reinforcement learning. ar Xiv preprint ar Xiv:2305.13122, 2023. Zhang, D., Chen, R. T., Liu, C.-H., Courville, A., and Bengio, Y. Diffusion generative flow samplers: Improving learning signals through partial trajectory optimization. ar Xiv preprint ar Xiv:2310.02679, 2023. Zhang, Q. and Chen, Y. Path integral sampler: a stochastic control approach for sampling. ar Xiv preprint ar Xiv:2111.15141, 2021. Zhou, H., Blessing, D., Li, G., Celik, O., Jia, X., Neumann, G., and Lioutikov, R. Variational distillation of diffusion policies into mixture of experts. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URL https://openreview.net/ forum?id=ii Yadg KHwo. Ziebart, B. D. Modeling purposeful adaptive behavior with the principle of maximum causal entropy. Carnegie Mellon University, 2010. Ziebart, B. D., Maas, A. L., Bagnell, J. A., Dey, A. K., et al. Maximum entropy inverse reinforcement learning. In Aaai, volume 8, pp. 1433 1438. Chicago, IL, USA, 2008. DIME: Diffusion-Based Maximum Entropy Reinforcement Learning A. Derivations Lower-Bound Derivation. H(π0(a0|s)) ℓ H(π0(a0|s)) = E π0:N(a0:N|s) π1:N|0(a1:N|s, a0) π0:N(a0:N|s) π1:N|0(a1:N|s, a0) π1:N|0(a1:N|s, a0) π1:N|0(a1:N|s, a0) log π1:N|0(a1:N|s, a0) π0:N(a0:N|s) π1:N|0(a1:N|s, a0) π1:N|0(a1:N|s, a0) log π1:N|0(a1:N|s, a0) π0:N(a0:N|s) π1:N|0(a1:N|s, a0) π1:N|0(a1:N|s, a0) (31) log π1:N|0(a1:N|s, a0) π0:N(a0:N|s) where we have used the relation π0:N(a0:N|s) π1:N|0(a1:N|s, a0) (33) and the fact that the KL divergence is always non-negative Approximate Inference Formulation. Recall the definition of the Q-function π(st, a0 t) = rt + X l=1 γl Eρπ rt+l + ℓ π(a0 t+l, st+l) . (34) π(a0, s) = E log π1:N|0(a1:N|a0, s) π0:N(a0:N|s) We start reformulating the objective t=l γt l Eρπ rt + ℓ π(a0 t, st) . (36) t=l+1 γt l Eρπ rt + ℓ π(a0 t, st) + Eρπ rl + ℓ π(a0 l , sl) (37) π(st, a0 t) + Eρπ ℓ π(a0 l , sl) (38) π(st, a0 t) + ℓ π(a0 l , sl) (39) π(st, a0 t) + log π1:N|0(a1:N|a0, s) π0:N(a0:N|s) π(a0:N|s) π(a0:N|s) log Z π(s) , (41) where we used π0(a0|s) = exp Q in the last step. When minimizing, the negative sign in front of the KL vanishes. Please note that the expectation over the marginal state distribution was ommited in the main text to avoid cluttered notation. DIME: Diffusion-Based Maximum Entropy Reinforcement Learning Proof of Proposition 4.1 (Policy Evaluation). Let s define the entropy-augmented reward of a diffusion policy as π(st, a0 t) rt(st, a0 t) + E log π1:N|0(a1:N|a0, s) π0:N(a0:N|s) and the update rule for the Q-function as Q(st, a0 t) r π(st, a0 t) + γEst+1 p,a0 t+1 π Q(st+1, a0 t+1) . (44) This formulation allows us to apply the standard convergence results for policy evaluation as stated in (Sutton & Barto, 1999). Proof of Proposition 4.2 (Policy Improvement). It holds that π(i+1)(a0:N|s) = exp Qπ(i)(s, a N) Zπ(i)(s) π(i)(a0:N 1|a N, s) (45) Moreover, using the fact that the KL divergence is always non-negative, we obtain π(i+1)(a0:N|s) π(i+1)(a0:N|s) DKL π(i)(a0:N|s) π(i+1)(a0:N|s) (46) Rewriting the KL divergences yields π(i+1)(a0:N|s) π(i+1)(a0:N|s) π(i)(a0:N|s) π(i+1)(a0:N|s) π(i+1) h log π(i+1)(a0:N|s) i E π(i+1) h log π(i+1)(a0:N|s) i (48) π(i)(a0:N|s) i E π(i+1)(a0:N|s) i π(i+1) h log π(i+1)(a0:N|s) i E log exp Qπ(i)(s, a N) Zπ(i)(s) π(i)(a0:N 1|a N, s) π(i)(a0:N|s) i E log exp Qπ(i)(s, a N) Zπ(i)(s) π(i)(a0:N 1|a N, s) π(i+1) h Qπ(i)(s, a N) i + E log π(i)(a0:N 1|a N, s) π(i+1)(a0:N|s) π(i) h Qπ(i)(s, a N) i + E log π(i)(a0:N 1|a N, s) π(i)(a0:N|s) To keep the notation uncluttered we use d(i+1)(s, a N) = E log π(i)(a0:N 1|a N, s) π(i+1)(a0:N|s) and d(i)(s, a N) = E log π(i)(a0:N 1|a N, s) π(i)(a0:N|s) DIME: Diffusion-Based Maximum Entropy Reinforcement Learning Figure 5. Considered environments. The Humanoid-v3 and the Ant-v3 are environments from the mujoco gym benchmark (Brockman et al., 2016). The three environmentshumanoid-run,humanoid-walk and humanoid-stand are from the deepmind control suite (DMC) benchmark (Tunyasuvunakool et al., 2020). The dog environments consist of dog-run, dog-walk, dog-stand, dog-trot and are also from the DMC sutie benchmark. Finally, the myo suite hand environments object-hold-hard,reach-hard, key-turn-hard, pen-twirl-hard are from the myo suite (Caggiano et al., 2022). Qπ(i)(s, a N) = r0 + E h γ d(i)(s1, a N 1 ) + E π(i) h Qπ(i)(s1, a N 1 ) i i (52) r0 + E h γ d(i+1)(s1, a N 1 ) + E π(i+1) h Qπ(i)(s1, a N 1 ) i i (53) = r0 + E h γ d(i+1)(s1, a N 1 ) + r1 + γ2 d(i)(s2, a N 2 ) + E π(i) h Qπ(i)(s2, a N 2 ) i i (54) r0 + E h γ d(i+1)(s1, a N 1 ) + r1 + γ2 d(i+1)(s2, a N 2 ) + E π(i+1) h Qπ(i)(s2, a N 2 ) i i (55) t=1 γt d(i+1)(st, a N t ) + rt # = Qπ(i+1)(s, a N) (57) Since Q improves monotonically, we eventually reach a fixed point Q(i+1) = Q(i) = Q Proof of Proposition 4.3 (Policy Iteration). From Proposition 4.2 it follows that Q πi+1(s, a) Q πi(s, a). If for limk π , then it must hold that Q π(s, a) for all Π which is guaranteed by Proposition 4.2. C. Environments All environments are visualized in Fig. 5. We consider the Ant-v3 and the Humanoid-v3 environments from mujoco gym (Brockman et al., 2016). The humanoid-stand, humanoid-walk , humanoid-run, dog-stand, dog-walk, dog-trot and dog-run environments from the deepmind control suite (DMC) (Tunyasuvunakool et al., 2020). The hand environments from myo suite are the object-hold-random,reach-random, key-turn-random and pen-twirl-random environments (Caggiano et al., 2022). The action and observation spaces of the respective environments are shown in Table 1. D. Implementation Details We consider a score network with 3 layers and a 256 dimensional hidden layer with gelu activation function. We use Fourier features to encode the timestep and scale the embedding using a feed-forward neural network with two layers, with a hidden dimension of 256. For the diffusion coefficient, we use a cosine schedule and additionally optimize a scaling parameter for the diffusion coefficient per dimension end-to-end (i.e,. we learn the parameter β (please see Appendix G). We employ distributional Q following (Bellemare et al., 2017), where the Q-model outputs probabilities q over b bins. Using the bellman backup operator for diffusion models from Eq. 21 and the bin values b we follow (Bellemare et al., 2017) and calculate the target probabilities qtarget. Using the entropy-regularized cross-entropy loss L(ϕ) = P qtarget log qϕ 0.005 P qϕ log qϕ we update the parameters ϕ of the Q-function. Please note that the entropy regularization was not proposed in the original paper from (Bellemare et al., 2017), however, we noticed that a small regularization helps improve DIME: Diffusion-Based Maximum Entropy Reinforcement Learning Training Environment Observation Space Dim. Action Space Dim. Ant-v3 111 8 Humanoid-v3 376 17 dog-run 223 38 dog-walk 223 38 dog-trot 223 38 dog-stand 223 38 humanoid-run 67 24 humanoid-walk 67 24 humanoid-stand 67 24 myo Hand Obj Hold Random-v0 91 39 myo Hand Reach Random-v0 115 39 myo Hand Key Turn Random-v0 93 39 myo Hand Pen Twirl Random-v0 83 39 Table 1. Observation and Action Space Dimensions for Various Training Environments the performance in the early learning stages but does not change the asymptotic performance. Additionally, we follow (Nauman et al., 2024) and use the mean of the two Q-values instead of the min as it has usually been used in RL so far. The expected Q-values for updating the actor can be easily calculated using the expectation Q(s, a0 t) = P i qi(st, a0 t)bi Action Scaling. Practical applications have a bounded action space that can usually be scaled to a fixed range. However, the action range of the diffusion policy π is unbounded. Therefore, we follow recent works (Haarnoja et al., 2018b) and propose applying the change of variables with a tanh squashing function at the last diffusion step n = 0. For the backward process q0:N(u0:N|s) with unbounded action space u RD we can squash the action a0 = tanh u0 such that a0 ( 1, 1) and its density is given by π0:N(a0:N|s) = q0:N(u0:N|s) det with the corresponding log-likelihood π0:N(a0:N|s) = log qn 1(un 1|un, s) i=1 log 1 tanh2 u N i . (59) This means that the Gaussian kernels of the diffusion chain have the same log probabilities except for the correction term of the last step at n = 0. Algorithm 1 DIME: Diffusion-Based Maximum Entropy Reinforcement Learning Input: Initialized parameters θ, ϕ, α, learningrates λ 1: for k = 1 to M do 2: if k % UTD then 3: a0:T t πθ 0:N(a0:N|st) 4: st+1 p(st+1|a0 t, st) 5: D D S{st, a0 t, rt, st+1} 6: end if 7: ϕ ϕ λϕ ϕJQ(ϕ) (Eq. 24) 8: if k % POLICYDELAY then 9: θ θ λθ θL(θ) (Eq. 25) 10: α α λαJ(α) (Eq. 28) 11: end if 12: end for DIME: Diffusion-Based Maximum Entropy Reinforcement Learning Algorithm 1 shows the learning procedure of DIME. Note that policy delay refers to the number of delayed updates of the policy compared to the critic. UTD is the update to data ratio. E. List of Hyperparameters Please note that we have used the official code releases of the respective baseline methods for training. For BRO and BRO Fast we used the provided learning curves DIME QSM Diff-QL Consistency-AC DIPO DACER QVPO Update-to-data ratio 2 1 1 1 1 1 1 Discount 0.99 0.99 0.99 0.99 0.99 0.99 0.99 batch size 256 256 256 256 256 256 256 Buffer size 1e6 1e6 1e5 1e5 1e6 1e6 1e6 Htarget 4dim(A) N/A N/A N/A N/A -0.9dim A N/A Critic hidden depth 2 2 2 3 3 3 2 Critic hidden size 2048 2048 256 256 256 256 256 Actor/Score depth 3 3 4 4 4 3 2 Actor/Score size 256 256 256 256 256 256 256 Num. Bins/Quantiles 100 N/A N/A N/A N/A 2 N/A Temp. Learn. Rate 1e-3 N/A N/A N/A N/A 3e-2 N/A Learn. Rate Critic 3e-4 3e-4 3e-4 3e-4 3e-4 3e-4 3e-4 Learn. Rate Actor/Score 3e-4 3e-4 1e-5 1e-5 3e-4 3e-4 3e-4 Optimizer Adam Adam Adam Adam Adam Adam Adam Diffusion Steps 16 15 5 N/A 100 20 20 Prior Distr. N(0, 2.5) N(0, 1) N/A N/A N/A N(0, 1) N(0, 1) Exploration Steps 5000 1e4 1e4 1e4 1e4 1e4 1e4 Score-Q align. factor N/A 50 N/A N/A N/A N/A N/A Table 2. Hyperparameters of DIME and all diffusion-based algorithms for all benchmark suits. Varying hyperparameters for different benchmark suits are described in the text. DIME BRO BRO Fast Cross Q Polyak weight N/A 0.005 0.005 N/A Update-to-data ratio 2 10 2 2 Discount 0.99 0.99 0.99 0.99 batch size 256 128 128 256 Buffer size 1e6 1e6 1e6 1e6 Htarget 4dim(A) dim(A)/2 dim(A)/2 dim(A) Critic hidden depth 2 BRONET BRONET 2 Critic hidden size 2048 512 512 2048 Actor/Score depth 3 BRONET BRONET 3 Actor/Score size 256 256 256 256 Num. Bins/Quantiles 100 100 100 N/A Temp. Learn. Rate 1e-3 3e-4 3e-4 3e-4 Learn. Rate Critic 3e-4 3e-4 3e-4 7e-4 Learn. Rate Actor/Score 3e-4 3e-4 3e-4 7e-4 Optimizer Adam Adam W Adam W Adam Diffusion Steps 16 N/A N/A N/A Prior Distr. N(0, 2.5) N/A N/A N/A Exploration Steps 5000 2500 2500 5000 Table 3. Hyperparameters of DIME and Gaussian-based algorithms for all benchmark suits. Varying hyperparameters for different benchmark suits are described in the text. DIME: Diffusion-Based Maximum Entropy Reinforcement Learning DIME. For DIME, we use distributional Q, where the maximum and minimum values for the bins have been chosen per benchmark suite. We have used vmin = 1600 and vmax = 1600 for the gym environments, vmin = 200 and vmax = 200 for the DMC suite and vmin = 3600 and vmax = 3600 for the myo suite. QSM. In certain environments, we observed that QSM with default hyperparameters performed poorly, particularly in several DMC tasks and the Gym Ant-v3 tasks. To address this, we fine-tuned the hyperparameters for QSM in each of these underperforming tasks. For the DMC tasks, we found that QSM often requires an α value representing the alignment factor between the score and the Q-function (Psenka et al., 2024) in the range of 100-200, rather than the default value of 50 reported in QSM s original implementation. In the Ant-v3 task, we determined that α needs to be set to 1. In the original implementation, the number of diffusion steps is set to be 5, however, we found using more steps, such as 10 and 15, can significantly improve the performance in these under performed tasks. Cross Q. We used the hyperparameters from the original paper (Bhatt et al., 2024) for the gym benchmark suite. However, we used a different set of hyperparameters for the DMC and MYO suites for improved performance. More precisely, we increased the policy size to 3 layers with 256 hidden size. Additionally, we reduced the learning rate to 7e-4. F. General Diffusion Policies DIME s maximum entropy reinforcement learning framework for training diffusion policies is not specifically restricted to denoising diffusion policies but can be extended to general diffusion policies. This can be realized using the General Bridges framework as presented in (Richter & Berner). In this case, we can write the forward and backward process as dat = [f(at, t) + βu(at, s, t)] dt + p 2βtd Bt, a0 π0( |s), (60) dat = [f(at, t) βv(at, s, t)] dt + p 2βtd Bt, a T N(0, I), (61) with the drift and control functions f, u, v : Rd [0, T] Rd, the diffusion coefficient β : [0, T] R+, standard Brownian motion (Bt)t [0,T ] and some target policy π0. Again we denote the marginal density of the forward process as πt and the conditional density at time t given l as πt|l for t, l [0, T]. The backward process starts from πT = πT N(0, I) and runs backward in time where we denote its density as The respective discretization using the Euler Maruyama (EM) (S arkk a & Solin, 2019) method are given by an+1 = an + [f(an, n) + βu(an, s, n)] δ + ϵn, (62) an 1 = an [f(an, n) βv(an, s, n)] δ + ξn, (63) where ϵn, ξn N(0, 2βδI), with the constant discretization step size δ such that N = T/δ is an integer. We have used the simplified notation where we write an instead of anδ. The discretizations admit the joint distributions π0:N(a0:N|s) = π0(a0|s) n=0 πn+1|n(an+1 an, s), (64) π0:N(a0:N|s) = πn 1|n(an 1 an, s), (65) with Gaussian kernels πn+1|n(an+1 an, s) = N(an+1|an + [f(an, n) + βu(an, s, n)] δ, 2βδI) (66) πn 1|n(an 1 an, s) = N(an 1|an [f(an, n) βv(an, s, n)] δ, 2βδI) (67) Following the same framework presented in the main text, we can now optimize the controls u and v using the same objective L(u, v) = DKL π0:N(a0:N|s)| π0:N(a0:N|s) , (68) where the target policy at time step n = 0 is given as π0(a0|s) = exp Q π(s) . (69) DIME: Diffusion-Based Maximum Entropy Reinforcement Learning 0 0.2 0.4 0.6 0.8 1 Number Env Interactions IQM Mean Return (a) DIME and GB on Dog Run 0 0.2 0.4 0.6 0.8 1 Number Env Interactions IQM Mean Return (b) DIME and GB on Humanoid Run 0 0.5 1 1.5 2 2.5 3 Number Env Interactions (c) DIME and BRO on Humanoid Run Figure 6. Preliminary results for the GB sampler on the dog run (a) and humanoid run (b) environments from DMC. Comparison to BRO on the humanoid run for 3 million steps. 0 0.2 0.4 0.6 0.8 1 Number Env Interactions (a) β0 on the Dog Run 0 0.2 0.4 0.6 0.8 1 Number Env Interactions (b) β10 on the Dog Run 0 0.2 0.4 0.6 0.8 1 Number Env Interactions (c) β20 on the Dog Run 0 0.2 0.4 0.6 0.8 1 Number Env Interactions (d) β30 on the Dog Run Figure 7. Learned β parameters. DIME s policy improvement objective (Eq. 27) allows to train various parameters end-to-end, such as the scaling for the diffusion coefficient β. More concretely, we train a scaling parameter βk per dimension k, that scales the cosine schedule. We visualize the adaptation of the parameter for the dimension k = 0, 10, 20, 30 over the training, averaged over 10 seeds for the dog-run task. Clearly, DIME first increases the parameter at the beginning of the training phase. Depending on the dimension, it either converges to a rather high value (k = 20 and k = 30), or keeps being reduced for other dimensions k = 0 and k = 10. In practice, we optimize the control functions u and v using parameterized neural networks. We have run preliminary results using the general bridge framework within the maximum entropy objective as suggested in our work. The learning curves can be seen in Fig. 6. G. Additional Experiments End-To-End Learning of β. We visualize the adaptation of the scaling for the diffusion coefficient β in Fig. 7 during learning on DMC s dog-run environment. Extended Analysis on Distributional Q Learning. DIME employs distributional Q Learning (Bellemare et al., 2017) to represent the Q-function as a distribution over bins. We compare DIME to baselines when using distributional Q Learning and when using the well-known Bellman residual (see Eq. 24) for updating the parameters of the Q-function. We start by comparing DIME with distributional Q learning against diffusion-based baselines that employ distributional Q learning. Fig. 8a and Fig. 8b show the learning curves on the Ant-v3 and Humanoid-v3, respectively, where we compare against DACER, a distributional Q variant of Diff-QL, and Consistency-AC. DIME converges faster to the same performance as DACER on the Ant-v3 task and outperforms the baselines on the Humanoid-v3 task. In the setting without distributional Q Learning, i.e., when updating the parameters using the residual Bellman function, DIME performs similarly to DIPO and QVPO on the Ant-v3 task and outperforms all baselines on the higher-dimensional Humanoid-v3 task (Fig. 8c and Fig. 8d). Additionally, we compare DIME with and without distributional Q Learning on the four dog environments from the DMC suite (Fig. 8), where we concentrate on the strong baselines BRO (Nauman et al., 2024) and Cross Q (Bhatt et al., 2024). BRO employs quantile distributional Q learning, whereas Cross Q uses the Bellman residual loss function for updating the DIME: Diffusion-Based Maximum Entropy Reinforcement Learning DIME (ours) DIME w/o Distr Q (ours) QSM Diff-QL Consistency-AC DIPO DACER QVPO 0 0.2 0.4 0.6 0.8 1 Number Env Interactions IQM Mean Return 0 0.2 0.4 0.6 0.8 1 Number Env Interactions IQM Mean Return ( 104) (b) Humanoid-v3 0 0.2 0.4 0.6 0.8 1 Number Env Interactions IQM Mean Return (c) Ant-v3 - w/o Distr Q 0 0.2 0.4 0.6 0.8 1 Number Env Interactions IQM Mean Return ( 104) (d) Humanoid-v3w/o Distr Q Figure 8. Comparison to Diffusion Baselines with (a)-b)) and without Distributional Q (c)-d)) on the Ant-v3 and Humanoid-v3 tasks. We provide the learning curves for distributional versions for Diff-QL and Consistency-AC alongside DACER, which employs distributional Q by default on the Ant-v3 (a) and Humanoid-v3 (b) tasks. DIME converges faster on the Ant-v3 (a) task to the same performance achieved by DACER and outperforms all baselines on the more high-dimensional Humanoid-v3 (b) task. Additionally, we compare DIME without distributional Q against the diffusion baselines without distributional Q on the Ant-v3 (c) and Humanoid-v3 (d) tasks. DIME without distributional Q performs on par with the baselines DIPO and QVPO on the Ant-v3 (c) and outperforms all baselines on the Humanoid-v3 (d). DIME (ours) DIME w/o Distr Q BRO BRO (Fast) Cross Q 0 0.2 0.4 0.6 0.8 1 Number Env Interactions IQM Mean Return (a) Dog Run 0 0.2 0.4 0.6 0.8 1 Number Env Interactions IQM Mean Return (b) Dog Trot 0 0.2 0.4 0.6 0.8 1 Number Env Interactions IQM Mean Return (c) Dog Walk 0 0.2 0.4 0.6 0.8 1 Number Env Interactions IQM Mean Return (d) Dog Stand Figure 9. Ablation on Distributional Q. Comparison of DIME and DIME without employing distributional Q (dashed line). While there is a small improvement when using distributional Q, DIME w/o Distributional Q still performs on par, or better than BRO, which employs quantile distributional RL. DIME w/o Distr Q outperforms Cross Q and BRO (Fast). Q-function s parameters. In the main text, we have already observed that DIME with distributional Q performs favorably over the baselines. Fig. 8 shows a small improvement when using distributional Q. However, DIME without distributional Q (dashed line) still performs on par, or better than BRO and consistently performs better than BRO (Fast) and Cross Q. Please note that BRO and BRO (Fast) employ quantile distributional RL (Nauman et al., 2024).