# risksensitive_variational_actorcritic_a_modelbased_approach__55542cef.pdf Published as a conference paper at ICLR 2025 RISK-SENSITIVE VARIATIONAL ACTOR-CRITIC: A MODEL-BASED APPROACH Alonso Granados Department of Computer Science University of Arizona Tucson, AZ, USA alonsog@cs.arizona.edu Reza Ebrahimi School of Information Systems and Management University of South Florida Tampa, FL, USA ebrahimim@usf.edu Jason Pacheco Department of Computer Science University of Arizona Tucson, AZ, USA pachecoj@cs.arizona.edu Risk-sensitive reinforcement learning (RL) with an entropic risk measure typically requires knowledge of the transition kernel or performs unstable updates w.r.t. exponential Bellman equations. As a consequence, algorithms that optimize this objective have been restricted to tabular or low-dimensional continuous environments. In this work we leverage the connection between the entropic risk measure and the RL-as-inference framework to develop a risk-sensitive variational actor-critic algorithm (rs VAC). Our work extends the variational framework to incorporate stochastic rewards and proposes a variational model-based actor-critic approach that modulates policy risk via a risk parameter. We consider, both, the risk-seeking and risk-averse regimes and present rs VAC learning variants for each setting. Our experiments demonstrate that this approach produces risk-sensitive policies and yields improvements in both tabular and risk-aware variants of complex continuous control tasks in Mu Jo Co. 1 INTRODUCTION Deep reinforcement learning (RL) algorithms have contributed to many breakthroughs in domains such as games (Mnih et al., 2015) and robotics (Levine et al., 2016). However, the standard objective in RL, maximization of the expected sum of rewards, disregards the variability of the return due to the intrinsic uncertainty in the transition dynamics and the stochasticity of the rewards which can lead to catastrophic behavior. Such catastrophic behavior is especially common in real-world applications, such as autonomous driving agents acting dangerously to achieve high reward (Chia et al., 2022) or financial losses in portfolio management (Lai et al., 2011). As a consequence, risk-aware agents are important to adapt to inherent environmental risk. Many risk measures have been studied to introduce risk-sensitivity into RL algorithms. For instance, value at risk (Va R) (Chow et al., 2018), conditional value at risk (CVa R) (Chow & Ghavamzadeh, 2014; Greenberg et al., 2022), mean-variance (Tamar et al., 2012; La & Ghavamzadeh, 2013), rewardvolatility risk measure (Zhang et al., 2021) and Gini-deviation (Luo et al., 2024). In this work we focus on the entropic risk measure, an approach that incorporates risk into its objective via the exponential utility function (Howard & Matheson, 1972; Borkar, 2002). Directly optimizing this objective is challenging: it requires the knowledge of the transition kernel or it depends on unstable updates w.r.t. exponential Bellman equations (Noorani et al., 2023). We address these challenges by exploiting the connection between RL and probabilistic inference to obtain a surrogate objective to the entropic risk measure. RL-as-inference algorithms search for policies that maximize the probability of optimal trajectories rather than maximizing the expected return. However, it has been observed that this objective can produce unwanted risk-seeking behaviour Published as a conference paper at ICLR 2025 in the learned policy (Levine, 2018; O Donoghue et al., 2019; Tarbouriech et al., 2023). Many such methods take a variational approach that constrain the posterior dynamics to equal those of the true environment (Haarnoja et al., 2017; 2018), but can lead to overly stochastic policies (Fellows et al., 2019). Existing variational model-based methods allow posterior dynamics to vary (Chow et al., 2018) but lead to risk-seeking policies that do not adapt to aleatoric risk in the environment (Eysenbach et al., 2022). Furthermore, these methods assume a deterministic reward model that implicitly ignores its risk contribution to the original objective. In this work, we leverage the connection between RL and probabilistic inference to formulate a variational lower bound on the entropic risk measure that can be optimized using only experience from an agent. We optimize this surrogate objective using an EM-style algorithm that consists of learning variational dynamics and reward models that account for intrinsic uncertainty in the environment (E-step) and improve the objective w.r.t. a policy (M-step). Our comprehensive approach permits learning of risk-seeking and risk-averse policies for which the latter has been mostly ignored in the RL-as-inference literature. Our formulation also adapts to risk induced by stochastic rewards, a further extension of the RL-as-inference literature which assumes deterministic rewards. Furthermore, we demonstrate the robustness of our method to other risk-aware algorithms in risk-sensitive variants of Mujoco tasks. Code is available at https://github.com/Alonso Granados/rs VAC/. 2 PRELIMINARIES: RISK-SENSITIVE REINFORCEMENT LEARNING The RL framework consists of a Markov decision process (MDP) defined by a tuple (S, A, p, R). S, A and R are the state, action and reward spaces, respectively. The transition probability over the next state st+1 S given the current state st S and action at A is denoted as p(st+1 | st, at), the initial state distribution as p(s1). A policy π specifies a probability distribution over actions given a current state st. The reward rt R is treated as a random variable with distribution p(rt|st, at). The distribution over trajectory τ = (s1, a1, r1, s2, a2, ..., s T , a T , r T , s T +1) for a sampling policy π is given by pπ(τ) = p(s1) Q t p(st+1 | st, at)p(rt | st, at)π(at | st). The standard objective in RL is to find a policy that maximizes expected return: π = arg maxπ Epπ(τ)[PT t=1 rt]. 2.1 ENTROPIC RISK MEASURE In risk-sensitive RL with the entropic risk measure the goal is to find a policy that maximizes: max π β log Epπ(τ) for risk parameter β R. This objective is closely related to mean-variance RL (Mannor & Tsitsiklis, 2011) given that a Taylor expansion of Eq. (1) yields Epπ(τ)[P t rt] + 1 2β Varπ(P t rt) + O( 1 β2 ) (Mihatsch & Neuneier, 2002; Garc ıa & Fern andez, 2015). The parameter β controls the risksensitivity of the objective producing risk-seeking policies for β > 0 and risk-averse policies for β < 0. Additionally, it reduces to the standard (risk-neutral) RL objective when |β| . Based on this framework, we define the soft value functions as the cumulative rewards under the entropic risk: Vπ(s) = log E pπ |s1 = s , Qπ(s, a) = log E pπ |s1 = s, a1 =a . (2) These functions are recursively associated via Bellman-style backup equations: Vπ(st) = log E π( |st)[exp(Qπ(st, at))], Qπ(st, at) = log E p( |st,at) β + Vπ(st+1) . (3) These are known as soft value functions, due to the presence of operators log E[exp( )] that act as soft approximations to max( ). Finally, we have the Bellman optimality equations, V (st) = max at Q (st, at), Q (st, at) = log Ep( |st,at) β + V (st+1) . (4) Although these value functions can be estimated using dynamic programming, they require knowledge of the transition dynamics and the reward model to compute the expectations, since unbiased samplebased estimates are not available due to the nonlinear log operation. Fig. 1 illustrates the impact of the risk-sensitivity parameter β on the optimal policy in a simple three arms MDP, and its effects in modulating risk-seeking and risk-averse policies. We emphasize that a risk-neutral policy is recovered for large |β| values, while small |β| values produce risk-seeking/averse policies. Published as a conference paper at ICLR 2025 Figure 1: Three arms environment Left: MDP with three actions (left, down and right) and initial state S. Action right produces reward 0 and 10 with probability 0.9 and 0.1, respectively. Action left produces reward 0 and 4 with probability 0.5 and 0.5, respectively. Finally, action down has a deterministic reward of 1. A risk neutral agent would prefer action left which has the highest mean return while a risk-seeking and risk-averse agents would prefer action right and down , respectively. Middle: Soft-Q values as a function of β for β > 0. Observe that for small β the framework learns a risky policy (red region) while for large β it recovers an optimal risk neutral policy (green region). Right: Soft-Q values as a function of β for β < 0. Now the agent learns a risk-averse policy when |β| is small (red region) while it recovers the neutral policy for larger |β| (green region). 2.2 RISK-SENSITIVE VARIATIONAL BOUND In this work we leverage the well-established connection between RL and probabilistic inference (Levine, 2018). Under this formulation we incorporate the rewards into a probabilistic model by introducing a set of binary auxiliary variables Ot {0, 1} that are independently distributed at each time as p(Ot = 1 | rt) exp( rt β ). The event Ot = 1 can loosely be interpreted as the agent having acted optimally at time t1. An important motivation for using this interpretation is that we can define a surrogate objective for the entropic risk measure via the evidence lower bound (ELBO) on the log-marginal likelihood: log pπ(O1:T ) = log Epπ KL(q(τ) pπ(τ)) := Jβ(q, π), (5) where the LHS is shorthand for the marginal likelihood of an optimal trajectory: log pπ(O1:T = 1). The log-marginal is equivalent to the entropic risk measure, up to a multiplicative constant β which controls risk-sensitivity and is bounded by the RHS. The bound in Eq. (5) arises from the application of Jensen s inequality where q(τ) is a variational distribution over trajectories. This bound is tight when the variational distribution equals the posterior over trajectories q(τ) = p(τ | O1:T = 1) almost everywhere. A variety of expectation-maximization (EM) style algorithms have been proposed to optimize Jβ by alternating improvements w.r.t. q and π (Abdolmaleki et al., 2018b; Peters et al., 2010; Levine & Koltun, 2013; Chow et al., 2021). 3 VARIATIONAL MODEL / POLICY ITERATION Using Jβ(q, π) as a surrogate objective for the entropic risk measure we now propose two algorithms that approximately optimize it for, both, the risk-seeking (β > 0) and risk-averse (β < 0) settings. We consider variational distributions over trajectories of the form, qπ(τ) = p(s1) t=1 π(at|st)qr(rt|st, at)qd(st+1|st, at). (6) Note that we incorporate a variational posterior distribution qr over rewards. This stochastic reward model is an extension of existing RL-as-inference methods that are constrained to deterministic rewards. Expanding the KL regularizer of Eq. (5) yields the variational objective, Jβ(q, π) = Eqπ(τ) β log qd(st+1|st, at) p(st+1|st, at) log qr(rt|st, at) p(rt|st, at) 1The optimality interpretation is a loose one stemming from the reward at time t, which increases the probability of Ot = 1 exponentially. This interpretation has become standard in the literature (Levine, 2018). Published as a conference paper at ICLR 2025 To optimize Eq. (7) we consider an EM-style algorithm where the E-step maximizes Jβ w.r.t. q and the M-step optimizes w.r.t. π. Risk-sensitivity arises in Eq. (7) from the maximization w.r.t. the variational distribution q. Although the penalty discourages deviations from the true model, the agent is willing to pay this penalty if the increase in expected return is large enough to compensate this extra cost. When β > 0, the variational model becomes optimistic (risk-seeking) as it aims to increase the expected return. When β < 0, it becomes pessimistic (risk-averse), as instead, it aims to increase the expected cost. Finally, we recover the true model when |β| as the objective only suffers the deviation penalty. For a fixed policy π we denote the optimal variational distribution as q π = arg maxq Jβ(q, π). Directly maximizing Jβ can be computationally expensive as it requires optimizing q over the full trajectory. Instead, we consider a Bellman-like operator Tπ as a partial optimization over q for a single transition where V is a state-value function: Tπ[V ](s) = E a π( |s) max qr R E r qr β log qr(r|s, a) + max qd S E s qd V (s ) log qd(s |s, a) In particular, we have the following theorem for the operator Tπ[V ](s) (see Appendix for proof): Theorem 1. Repeated application of Tπ to any value function V such that V (s T +1) = 0 converges to the optimal value function V k for all k, where: V k (sk) = Eq π(τ) β log q d(st+1|st, at) p(st+1|st, at) log q r(rt|st, at) p(rt|st, at) Hence, we can obtain the optimal value function by iteratively applying Tπ to some initial value function V0. We can recover the optimal variational distributions using the optimal value function: Theorem 2. Let q r and q d be the solution of arg maxq Jβ(q, π). Then q r(r|s, a) p(r|s, a) exp r , q d(s |s, a) p(s |s, a) exp (V π (s )) . (10) All proofs can be found in the Appendix. We optimize J (q , π) w.r.t. the policy π using the variational distribution q π from the E-step: π = arg max π Eq π(τ) t rt β log q d(st+1|st, at) p(st+1|st, at) β log q r(rt|st, at) p(rt|st, at) | {z } :=ˆrt Observe that Eq. (11) is equivalent to learning the optimal policy for a standard RL problem with transition dynamics q and augmented rewards ˆrt, so any RL algorithm can be used for the M-step. Although the expectation can now be estimated using easily-obtained samples from q we still have the problem of needing to evaluate the dynamics and reward model in the augmented reward, which might be unknown. In the following section we address this by providing an algorithm that can be optimized using off-policy data. Finally, we note that in the risk-averse setting Eq. (11) corresponds to a minimization w.r.t. the policy: arg minπ J (q , π). See Appendix B for an extended discussion of optimization in the risk-averse regime. 4 RSVAC: RISK SENSITIVE VARIATIONAL ACTOR-CRITIC We now present a practical RL algorithm that approximately optimizes J (q, π) using only collected experience by the agent. We make three design choices to approximate this objective: first, we learn Published as a conference paper at ICLR 2025 parameterized probabilistic networks, pθ(st+1|st, at) and pθ(rt|st, at), for the unknown transition dynamics and reward model; next, we represent the variational distributions using probabilistic networks, qϕ(st+1|st, at) and qϕ(rt|st, at), and approximate the maximization operation in the E-step with stochastic gradient descent; finally, we use an actor-critic architecture with function approximators to learn the optimal value function and policy from the M-step. 4.1 VARIATIONAL REWARD AND DYNAMICS MODEL OPTIMIZATION We model the reward and dynamics as Gaussian distributions with mean and covariance given by neural networks and train them to minimize cross-entropy using stochastic gradient descent: Jr(θ) = E(st,at,rt) Denv [log pθ(rt|st, at)] , Jd(θ) = E(st,at,st+1) Denv [log pθ(st+1|st, at)] (12) where Denv is an experience replay buffer that stores previously seen interactions with the environment. We similarly parameterize the variational models using Gaussian distributions. To learn the variational reward model we approximate the optimization w.r.t. qϕ(rt|st, at) in Eq. (8) by maximizing: Jr(ϕ) = E(st,at) Denv,rt qϕ(rt|st,at) β log qϕ(rt|st, at) pθ(rt|st, at) w.r.t. its parameters ϕ. In particular, we use the reparameterization trick to obtain a lower variance estimator that can be optimized using stochastic gradient ascent, Jr(ϕ) = E(st,at) Denv,ϵ N fϕ(ϵ; st, at) β log qϕ(fϕ(ϵ; st, at)|st, at) pθ(fϕ(ϵ; st, at)|st, at) where fϕ(ϵ; st, at) is the reparameterized reward model and ϵ is a noise vector sampled from a spherical Gaussian distribution. Similarly, we learn variational dynamics by approximating the optimization w.r.t. qϕ(st+1|st, at) in Eq. (8) with: Jd(ϕ) = E(st,at) Denv,ϵ N Vψ(gϕ(ϵ; st, at)) + log qϕ(gϕ(ϵ; st, at)|st, at) pθ(gϕ(ϵ; st, at)|st, at) where again we use the reparameterization trick to reparameterize the dynamics model, st+1 = gϕ(ϵ; st, at), and substitute the optimal value function with a critic Vψ that can be differentiated so Eq. (15) can be optimized using stochastic gradient ascent. 4.2 ACTOR-CRITIC OPTIMIZATION We now present an actor-critic algorithm to optimize the M-step. As previously stated, the optimization in Eq. (11) is equivalent to the RL problem that has transition dynamics qϕ(st+1|st, at), reward model qϕ(rt|st, at), and augmented reward ˆrt = rt β log qϕ(st+1|st,at) pθ(st+1|st,at) β log qϕ(rt|st,at) pθ(rt|st,at). We collect transitions from the variational model using branched rollout (Janner et al., 2019), i.e. we sample states under the true dynamics Denv and run the policy under qϕ to generate new transitions which we store in the model replay buffer Dmodel. We approximate the critic using a neural network Qψ(st, at) which we train by minimizing the squared TD-error: JQ(ψ) = E(st,at,rt,st+1) Dmodel h Qψ(st, at) ˆrt V ψ(st+1) 2i , (16) using stochastic gradient descent and samples from the model replay buffer. The optimal statevalue function V ψ(st+1) is implicitly represented by Eat+1 πθ(at+1|st+1)[Q ψ(st+1, at+1)] where Q ψ is a target critic network that we update using an exponentially moving average of the Qψ weights (Lillicrap et al., 2015). We approximate this expectation using a single action sample from the policy πθ. The policy πθ is a Gaussian distribution parameterized with neural networks and is trained to maximize Qψ with an added entropy regularizer to improve exploration during learning (Haarnoja et al., 2018): Jπ(θ) = Est Denv,ϵ N [Qψ(st, fθ(ϵ; st)) log πθ(fθ(ϵ; st)|st)] (17) where we have reparameterized the policy at = fθ(ϵ; st) and ϵ is a noise vector sampled from a spherical Gaussian distribution. Again we learn these parameters using stochastic gradient descent. One benefit of rs VAC is that it enjoys great flexibility so any actor-critic method can be incorporated into framework as long as the rewards and samples come from the variational model. Pseudocode for the rs VAC algorithm can be found in Appendix F. Published as a conference paper at ICLR 2025 5 RELATED WORK Entropic risk. The risk-sensitive objective with entropic risk measure was first described by the seminal work of (Howard & Matheson, 1972). This work has inspired many methods in a variety of settings (Borkar, 2001; 2002; 2010; Borkar & Meyn, 2002; Coraluppi & Marcus, 1999; Di Masi & Stettner, 1999; Fleming & Mc Eneaney, 1995; Hern andez-Hern andez & Marcus, 1996; Huang & Haskell, 2020). However, these algorithms are constrained to simple environments as they require knowledge of the transition dynamics or assume access to a simulator of the environment. In the setting with unknown transition dynamics, TD(0) and Q-learning-style algorithms have been proposed by applying an exponential transformation to the risk sensitive objective, but estimating these value functions can lead to instabilities when introducing function approximators (B auerle & Rieder, 2014; Borkar, 2002; Fei et al., 2021b;a; Mihatsch & Neuneier, 2002; Noorani et al., 2023). RL-as-inference. Probabilistic inference methods for solving RL can be traced back to the Kalmanduality in linear-quadratic systems (Kalman, 1960) and later to linearly solvable MDPs (Todorov, 2006). The variational framework can be formulated as searching for maximum likelihood policies on an augmented MDP with exponentiated rewards treated as probabilities and is equivalent to the risk-sensitive objective (Todorov, 2008; Levine & Koltun, 2013; Levine, 2018). These approaches learn risk-seeking policies as they only consider β > 0. Model-free variational approaches such as Max Ent RL combat this behavior by removing the controller s ability to modify the variational dynamics, resulting in high-entropy policies. This penalization of determinism has been effective in some high-dimensional tasks (O Donoghue et al., 2016; Nachum et al., 2017; Haarnoja et al., 2018; Lee et al., 2020), but has been shown to lead to undesirable behavior (Fellows et al., 2019). Closely related KL-regularized RL methods include a proximal operator on the policy (Peters et al., 2010; Schulman et al., 2015; Chebotar et al., 2017; Noorani & Baras, 2021). More generally, EM-style algorithms jointly optimize their variational and prior policies (Peters & Schaal, 2007; Neumann et al., 2011; Abdolmaleki et al., 2018b;a). Our variational formulation is most similar to the one used in VMBPO (Chow et al., 2021), an EM-style algorithm that also learns variational dynamics. However, their approach only considers risk-seeking policies and deterministic rewards. Connections to other methods. The role of the risk parameter β is to limit the disagreement between the variational and true environment dynamics and reward model through the KL penalty. β-VAE (Higgins et al., 2016) studies a similar objective to ours, albeit in the different context of representation learning, where β limits the capacity of the variational distribution to learn disentangled representations. In Bayesian RL, a similar risk parameter balances the exploration-exploitation tradeoff by modulating epistemic uncertainty (O Donoghue et al., 2019; O Donoghue & Lattimore, 2021; O Donoghue, 2023). 6 EXPERIMENTS We evaluate the ability of rs VAC to learn risk-sensitive policies in a variety of risky environments. First, we consider a risky variation of the tabular environment discussed in Eysenbach et al. (2022), where exact inference is possible and equality to the entropic risk can be achieved. We next evaluate the inclusion of function approximators in a continuous 2D environment with stochastic transition dynamics where the goal is to land proximal to the environment boundary without crossing over. Finally, we compare rs VAC to risk-sensitive baseline methods in variations of three challenging Mu Jo Co environments that incorporate risk in the manner introduced by Luo et al. (2024). In all cases we find that rs VAC capably learns risk-sensitive policies in, both, the risk-averse and risk-seeking regimes while simultaneously achieving high reward. 6.1 RISK IN TABULAR ENVIRONMENTS Our motivation for using the tabular setting is that we can study the risk-sensitive behavior of our algorithm without introducing function approximation error. We consider a risky variant of the gridworld presented in Eysenbach et al. (2022). In this environment the agent s goal, which starts from the top left corner, is to reach the star goal state (See Fig. 2a). We modify the original environment to incorporate aleatoric risk by including a cliff region (gray squares in Fig. 2a). Falling into the cliff incurs a large negative reward and transition to the initial state. The agent can choose Published as a conference paper at ICLR 2025 (a) Environment (b) Risk-neutral (c) Risk-seeking (d) Risk-averse Figure 2: Risky tabular setting. (a) The modified grid environment with cliff region given by gray states. We show the dynamics in the grid for the action right at each state, where we represent a transition probability between two states as a vectors with its magnitude proportional to the probability. (b, c, d) To demonstrate the risk preferences of our algorithm, we sample 1000 episodes for three policies Q-learning (risk-neutral), β = 1 (risk-seeking) and β = 0.5 (risk-averse) and compute histograms for the count of visited states. (a) Performance curves (b) Convergence for β > 0 (c) Convergence for β < 0 Figure 3: Stochastic cliff performance. (a): We compare the expected return for 5 independent runs for different algorithms. rs VAC for β > 0 performs comparably to both Q-learning and VMBPO. (b,c): We show that dual optimization eventually converges to the same optimal value for different initial settings of β. from four actions (up, left, down and right) which can result in a transition to the chosen direction or moving randomly to one of the four directions with equal probability. We demonstrate that rs VAC can produce risk-sensitive policies by training the surrogate objective for the values of β = 1 and β = 0.5, along with the risk-neutral policy. We compare the risk preferences of these policies by computing a histogram over states for 1000 trajectories. From these trajectories, we observe that the risk-seeking policy (Fig. 2c) takes the shortest path to the goal, but in the process occasionally falls into the cliff. In contrast, the risk-averse policy (Fig. 2d) avoids entirely the cliff region resulting in longer trajectories. Finally, the risk-neutral policy takes a middle-of-the-road approach between the two previous policies (Fig. 2b) where it rarely falls into the cliff but on average it takes to longer to reach the final state in comparison to the risk-seeking policy. We now compare the average return performance of rs VAC when including dual optimization w.r.t. β discussed in Appendix C. As comparison baselines we consider Q-learning and VMBPO (Chow et al., 2021), a model-based algorithm that also learns variational dynamics but is restricted to the risk-seeking setting. The performance curves in Fig. 3a show that rs VAC (β > 0) is as efficient and performs as well or better than VMBPO and Q-learning. Figs. 3b and 3c demonstrate robustness of our dual optimization over β, which converges to the same value regardless of initialization. 6.2 STOCHASTIC CONTINUOUS 2D ENVIRONMENT We verify that our algorithm can learn risk-sensitive policies with function approximators in a stochastic continuous environment. An agent begins in the middle of a 2D space and the goal is to navigate as near to the lower left-or-right corners as possible without crossing the side edges. The agent observes its 2D coordinates (x, y), chooses a direction a (||a||2 = 1), and moves in that Published as a conference paper at ICLR 2025 (a) Risk-Neutral (β = 100) (b) Risk-Seeking (β = 2) (c) Risk-Averse (β = 2) Figure 4: Trajectories for stochastic 2D environment. We illustrate the learned policy by sampling 10 trajectories for different β values. (a) Policies trained with large β magnitude tend to be risk-neutral. (b) Policies trained with small positive β are risk-seeking and try to hit high reward ignoring potentially hitting the side wall. (c) Policies trained with negative β stay in the center part of the square. 0 10 20 30 40 Environments steps (1k) 0 10 20 30 40 Environments steps (1k) Medium-risk 0 10 20 30 40 Environments steps (1k) Figure 5: Exit regions for stochastic 2D environment. We define 3 regions depended on the agent s X-position when an episode ends: low-risk (|x| < 2.8), medium-risk (2.8 |x| < 5.6), and high-risk (5.6 |x| < 7). We calculate these percentage regions as a function of environment steps over different β values. direction with noise sampled from N(a, 0.52I). An episode terminates when the agent leaves the square given by {(x, y) : |x| 7, |y| 7}. The agent receives 0.1 reward at every step with an additional positive reward proportional to its X-position ((100/7) |x|) when it exits the square through bottom, or 100 reward if it leaves through either the left-or-right side of the square. In Fig. 4, we visualize trajectories for the different learned policies on the true environment. Observe that small positive β values tend to produce risk-seeking policies where the agent aims to get as much reward (close to the walls) as possible while ignoring the likelihood of hitting the sides of the square. Policies trained with negative β produce risk-avoiding policies that stay in the center region. We also calculate the percentage of episodes that terminate in different risk-regions (Fig. 5) which demonstrate how β interpolates between different regions. We designate low-risk (left) as far from the walls, medium-risk (center), and high-risk (right) as near the wall. We find that negative β primarily terminate in the low-risk region, whereas small positive β primarily terminates in the high-risk region and risk-neutral (β = 100) terminates in the intermediate and to a lesser extent high-risk regions. We additionally include the visualizations for the learned variational dynamics in the Appendix (Fig. 7). 6.3 SIMULATED ROBOTIC BENCHMARKS We use the Mu Jo Co physics engine (Todorov et al., 2012) in Gymnasium (Towers et al., 2023) to evaluate our method on three continuous tasks (Inverted Pendulum, Half Cheetah, and Swimmer). We follow the modifications made to the reward function as in (Luo et al., 2024) to produce risky regions in these environments based on the X-position of the agent. An additional stochastic reward is sampled from N(0, 102) if X-position > 0.01 in Inverted Pendulum, > 3 in Half Cheetah, or > 0.5 in Swimmer. Hence, we can test whether an agent is risk-seeking or risk-averse by calculating the percentage of time spent in the region of stochastic reward. We compare our method to Mean Gini deviation (MG) (Luo et al., 2024), a policy gradient algorithm that optimizes the Gini deviation as an alternative risk measure and outperforms other mean-variance Published as a conference paper at ICLR 2025 0 20 40 60 80 100 Environment steps (10k) Inverted Pendulum-v4 0 20 40 60 80 100 Environment steps (10k) Half Cheetah-v4 0 20 40 60 80 100 Environment steps (10k) 0 20 40 60 80 100 Environment steps (10k) X > 0.01 Rate Inverted Pendulum-v4 0 20 40 60 80 100 Environment steps (10k) X < -3 Rate Half Cheetah-v4 0 20 40 60 80 100 Environment steps (10k) X > 0.5 Rate MG MVPI Exp TD rs VAC-SAC rs VAC-TD3 Figure 6: Risk-Averse Mu Jo Co. Top row: Average return on risky Mu Jo Co benchmarks. Bottom row: Percentage of steps on an episode in risky regions. The solid curves correspond to the mean and shaded regions to one standard deviation over 10 random trials. algorithms; mean-variance policy iteration (MVPI) (Zhang et al., 2021), a highly flexible algorithm that optimizes reward-volatility risk measure with the primary goal of reducing the performance gap between risk-neutral and risk-averse algorithms; and exponential TD (exp TD) (Noorani et al., 2023), an actor-critic algorithm that optimizes the entropic risk-measure by using a critic that estimates the exponentiated return. To achieve a fair comparison, we implement every actor-critic algorithm on top of TD3 (Fujimoto et al., 2018). For MG we follow the author implementation which uses PPO-style policy gradient to maximize the expected return (Schulman et al., 2017). For consistency we use the same network architectures across all algorithms. We also update the policy at each environment step for all algorithms, with the exception of MG which requires the collection of 10 episodes before updating its model. See the Appendix for additional configuration details. In Fig. 6, we report the total return (top-row) and percentage of timesteps visiting the noisy region (bottom-row) for each algorithm in a risk-averse configuration. We perform 10 runs of each algorithm with different random seeds and report average and STDEV every 10k environment steps. Agents are tested for 20 episodes per evaluation. For rs VAC, we include both a version where the actor-critic is given by TD3 and another given by SAC (Haarnoja et al., 2018). The results show that rs VAC is effective at learning the stochasticity of the environment while producing better policies in terms of learning speed and final-performance. MVPI also learns risk-averse policies in all three domains, but results in lower overall mean returns. We also perform an ablation analysis for the parameter β to demonstrate that our algorithm can learn both risk-seeking and risk-averse policies (Appendix Fig. 8). 7 CONCLUSION In this work, we leveraged the connection between RL and probabilistic inference to formulate a surrogate objective on the entropic risk measure. We proposed an EM-style algorithm that consists of learning variational dynamics and reward model that account for aleatoric uncertainty in the environment (E-step) and improves the objective w.r.t. a policy (M-step). Finally, we proposed a practical algorithm (rs VAC) that permits learning of risk-seeking and risk-averse policies from experience replay alone. Our evaluations demonstrate that rs VAC is effective in learning risk-sensitive policies in several challenging environments. In particular we show that, compared to baseline risksensitive methods, rs VAC performs capably in risky variants of challenging Mu Jo Co environments, and in all cases yields superior return. Published as a conference paper at ICLR 2025 8 ACKNOWLEDGMENTS This material is based upon work supported by the Air Force Office of Scientific Research under award number FA9550-22-1-0194. Abbas Abdolmaleki, Jost Tobias Springenberg, Jonas Degrave, Steven Bohez, Yuval Tassa, Dan Belov, Nicolas Heess, and Martin Riedmiller. Relative entropy regularized policy iteration. ar Xiv preprint ar Xiv:1812.02256, 2018a. Abbas Abdolmaleki, Jost Tobias Springenberg, Yuval Tassa, Remi Munos, Nicolas Heess, and Martin Riedmiller. Maximum a posteriori policy optimisation. In International Conference on Learning Representations, 2018b. Nicole B auerle and Ulrich Rieder. More risk-sensitive markov decision processes. Mathematics of Operations Research, 39(1):105 120, 2014. Dimitri Bertsekas. Dynamic programming and optimal control 4th edition, volume 1. Athena scientific, 2020. Dimitri P Bertsekas. Dynamic programming and optimal control 4th edition, volume 2. Athena scientific, 2012. Vivek S Borkar. A sensitivity formula for risk-sensitive cost and the actor critic algorithm. Systems & Control Letters, 44(5):339 346, 2001. Vivek S Borkar. Q-learning for risk-sensitive control. Mathematics of operations research, 27(2): 294 311, 2002. Vivek S Borkar. Learning algorithms for risk-sensitive control. In Proceedings of the 19th International Symposium on Mathematical Theory of Networks and Systems MTNS, volume 5, 2010. Vivek S Borkar and Sean P Meyn. Risk-sensitive optimal control for markov decision processes with monotone cost. Mathematics of Operations Research, 27(1):192 209, 2002. Stephen P Boyd and Lieven Vandenberghe. Convex optimization. Cambridge university press, 2004. Yevgen Chebotar, Mrinal Kalakrishnan, Ali Yahya, Adrian Li, Stefan Schaal, and Sergey Levine. Path integral guided policy search. In 2017 IEEE international conference on robotics and automation (ICRA), pp. 3381 3388. IEEE, 2017. Wei Ming Dan Chia, Sye Loong Keoh, Cindy Goh, and Christopher Johnson. Risk assessment methodologies for autonomous driving: A survey. IEEE transactions on intelligent transportation systems, 23(10):16923 16939, 2022. Yinlam Chow and Mohammad Ghavamzadeh. Algorithms for cvar optimization in mdps. Advances in neural information processing systems, 27, 2014. Yinlam Chow, Mohammad Ghavamzadeh, Lucas Janson, and Marco Pavone. Risk-constrained reinforcement learning with percentile risk criteria. Journal of Machine Learning Research, 18 (167):1 51, 2018. Yinlam Chow, Brandon Cui, Moonkyung Ryu, and Mohammad Ghavamzadeh. Variational modelbased policy optimization. In Zhi-Hua Zhou (ed.), Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence, IJCAI-21, pp. 2292 2299. International Joint Conferences on Artificial Intelligence Organization, 8 2021. doi: 10.24963/ijcai.2021/316. URL https: //doi.org/10.24963/ijcai.2021/316. Main Track. Stefano P Coraluppi and Steven I Marcus. Risk-sensitive, minimax, and mixed risk-neutral/minimax control of Markov decision processes. Springer, 1999. Published as a conference paper at ICLR 2025 Giovanni B Di Masi and Lukasz Stettner. Risk-sensitive control of discrete-time markov processes with infinite horizon. SIAM Journal on Control and Optimization, 38(1):61 78, 1999. Benjamin Eysenbach, Alexander Khazatsky, Sergey Levine, and Russ R Salakhutdinov. Mismatched no more: Joint model-policy optimization for model-based rl. Advances in Neural Information Processing Systems, 35:23230 23243, 2022. Yingjie Fei, Zhuoran Yang, Yudong Chen, and Zhaoran Wang. Exponential bellman equation and improved regret bounds for risk-sensitive reinforcement learning. Advances in neural information processing systems, 34:20436 20446, 2021a. Yingjie Fei, Zhuoran Yang, and Zhaoran Wang. Risk-sensitive reinforcement learning with function approximation: A debiasing approach. In International Conference on Machine Learning, pp. 3198 3207. PMLR, 2021b. Matthew Fellows, Anuj Mahajan, Tim GJ Rudner, and Shimon Whiteson. Virel: A variational inference framework for reinforcement learning. Advances in neural information processing systems, 32, 2019. Wendell H Fleming and William M Mc Eneaney. Risk-sensitive control on an infinite time horizon. SIAM Journal on Control and Optimization, 33(6):1881 1915, 1995. Scott Fujimoto, Herke Hoof, and David Meger. Addressing function approximation error in actorcritic methods. In International conference on machine learning, pp. 1587 1596. PMLR, 2018. Javier Garc ıa and Fernando Fern andez. A comprehensive survey on safe reinforcement learning. Journal of Machine Learning Research, 16(1):1437 1480, 2015. Ido Greenberg, Yinlam Chow, Mohammad Ghavamzadeh, and Shie Mannor. Efficient risk-averse reinforcement learning. Advances in Neural Information Processing Systems, 35:32639 32652, 2022. Tuomas Haarnoja, Haoran Tang, Pieter Abbeel, and Sergey Levine. Reinforcement learning with deep energy-based policies. In International conference on machine learning, pp. 1352 1361. PMLR, 2017. Tuomas Haarnoja, Aurick Zhou, Kristian Hartikainen, George Tucker, Sehoon Ha, Jie Tan, Vikash Kumar, Henry Zhu, Abhishek Gupta, Pieter Abbeel, et al. Soft actor-critic algorithms and applications. ar Xiv preprint ar Xiv:1812.05905, 2018. Matthias Heger. Consideration of risk in reinforcement learning. In Machine Learning Proceedings 1994, pp. 105 111. Elsevier, 1994. Daniel Hern andez-Hern andez and Steven I Marcus. Risk sensitive control of markov processes in countable state space. Systems & control letters, 29(3):147 155, 1996. Irina Higgins, Loic Matthey, Arka Pal, Christopher Burgess, Xavier Glorot, Matthew Botvinick, Shakir Mohamed, and Alexander Lerchner. beta-vae: Learning basic visual concepts with a constrained variational framework. In International conference on learning representations, 2016. Ronald A Howard and James E Matheson. Risk-sensitive markov decision processes. Management science, 18(7):356 369, 1972. Wenjie Huang and William B Haskell. Stochastic approximation for risk-aware markov decision processes. IEEE Transactions on Automatic Control, 66(3):1314 1320, 2020. Michael Janner, Justin Fu, Marvin Zhang, and Sergey Levine. When to trust your model: Model-based policy optimization. Advances in neural information processing systems, 32, 2019. Rudolph Emil Kalman. A new approach to linear filtering and prediction problems. In ASME Transactions journal of basic engineering, pp. 82(1):35 45, 1960. Prashanth La and Mohammad Ghavamzadeh. Actor-critic algorithms for risk-sensitive mdps. Advances in neural information processing systems, 26, 2013. Published as a conference paper at ICLR 2025 Tze Leung Lai, Haipeng Xing, and Zehao Chen. Mean-variance portfolio optimization when means and covariances are unknown. The Annals of Applied Statistics, pp. 798 823, 2011. Alex X Lee, Anusha Nagabandi, Pieter Abbeel, and Sergey Levine. Stochastic latent actor-critic: Deep reinforcement learning with a latent variable model. Advances in Neural Information Processing Systems, 33:741 752, 2020. Sergey Levine. Reinforcement learning and control as probabilistic inference: Tutorial and review. ar Xiv preprint ar Xiv:1805.00909, 2018. Sergey Levine and Vladlen Koltun. Variational policy search via trajectory optimization. Advances in neural information processing systems, 26, 2013. Sergey Levine, Chelsea Finn, Trevor Darrell, and Pieter Abbeel. End-to-end training of deep visuomotor policies. Journal of Machine Learning Research, 17(39):1 40, 2016. Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. ar Xiv preprint ar Xiv:1509.02971, 2015. Yudong Luo, Guiliang Liu, Pascal Poupart, and Yangchen Pan. An alternative to variance: Gini deviation for risk-averse policy gradient. Advances in Neural Information Processing Systems, 36, 2024. Shie Mannor and John N Tsitsiklis. Mean-variance optimization in markov decision processes. In Proceedings of the 28th International Conference on International Conference on Machine Learning, pp. 177 184, 2011. Oliver Mihatsch and Ralph Neuneier. Risk-sensitive reinforcement learning. Machine learning, 49: 267 290, 2002. Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning. nature, 518(7540):529 533, 2015. Ofir Nachum, Mohammad Norouzi, Kelvin Xu, and Dale Schuurmans. Bridging the gap between value and policy based reinforcement learning. Advances in neural information processing systems, 30, 2017. Gerhard Neumann et al. Variational inference for policy search in changing situations. In Proceedings of the 28th International Conference on Machine Learning, ICML 2011, pp. 817 824, 2011. Arnab Nilim and Laurent El Ghaoui. Robust control of markov decision processes with uncertain transition matrices. Operations Research, 53(5):780 798, 2005. Erfaun Noorani and John S Baras. Risk-sensitive reinforcement learning and robust learning for control. In 2021 60th IEEE Conference on Decision and Control (CDC), pp. 2976 2981. IEEE, 2021. Erfaun Noorani, Christos N Mavridis, and John S Baras. Exponential td learning: A risk-sensitive actor-critic reinforcement learning algorithm. In 2023 American Control Conference (ACC), pp. 4104 4109. IEEE, 2023. Brendan O Donoghue and Tor Lattimore. Variational bayesian optimistic sampling. Advances in Neural Information Processing Systems, 34:12507 12519, 2021. Brendan O Donoghue, Remi Munos, Koray Kavukcuoglu, and Volodymyr Mnih. Combining policy gradient and q-learning. In International Conference on Learning Representations, 2016. Brendan O Donoghue, Ian Osband, and Catalin Ionescu. Making sense of reinforcement learning and probabilistic inference. In International Conference on Learning Representations, 2019. Takayuki Osogami. Robustness and risk-sensitivity in markov decision processes. Advances in neural information processing systems, 25, 2012. Published as a conference paper at ICLR 2025 Brendan O Donoghue. Efficient exploration via epistemic-risk-seeking policy optimization. International Conference on Machine Learning, pp. 26382 26402, 2023. Jan Peters and Stefan Schaal. Reinforcement learning by reward-weighted regression for operational space control. In Proceedings of the 24th international conference on Machine learning, pp. 745 750, 2007. Jan Peters, Katharina Mulling, and Yasemin Altun. Relative entropy policy search. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 24, pp. 1607 1612, 2010. John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region policy optimization. In International conference on machine learning, pp. 1889 1897. PMLR, 2015. John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. ar Xiv preprint ar Xiv:1707.06347, 2017. Aviv Tamar, Dotan Di Castro, and Shie Mannor. Policy gradients with variance related risk criteria. In Proceedings of the twenty-ninth international conference on machine learning, pp. 387 396, 2012. Jean Tarbouriech, Tor Lattimore, and Brendan O Donoghue. Probabilistic inference in reinforcement learning done right. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. Emanuel Todorov. Linearly-solvable markov decision problems. Advances in neural information processing systems, 19, 2006. Emanuel Todorov. General duality between optimal control and estimation. In 2008 47th IEEE Conference on Decision and Control, pp. 4286 4292. IEEE, 2008. Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for model-based control. In 2012 IEEE/RSJ international conference on intelligent robots and systems, pp. 5026 5033. IEEE, 2012. Mark Towers, Jordan K. Terry, Ariel Kwiatkowski, John U. Balis, Gianluca de Cola, Tristan Deleu, Manuel Goul ao, Andreas Kallinteris, Arjun KG, Markus Krimmel, Rodrigo Perez-Vicente, Andrea Pierr e, Sander Schulhoff, Jun Jet Tai, Andrew Tan Jin Shen, and Omar G. Younis. Gymnasium, March 2023. URL https://zenodo.org/record/8127025. Shangtong Zhang, Bo Liu, and Shimon Whiteson. Mean-variance policy iteration for risk-averse reinforcement learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pp. 10905 10913, 2021. Published as a conference paper at ICLR 2025 A OPERATOR Tπ PROOFS For our formulation, we consider a finite-horizon problem for which we maximize the following objective over variational distributions qt d(st+1|st, at) and qt r(rt|st, at): V 1 (s1) = max q1r,q1 d,...,q T r ,q T d Eqπ(τ) β log qt d(st+1|st, at) pt d(st+1|st, at) log qt r(rt|st, at) ptr(rt|st, at) Let {q1 r , q1 d , ..., q T r , q T d } be an optimal set of variational distributions for this problem. By the principle of optimality, we have that the truncated set of variational distributions {qk r , qk d , ..., q T r , q T d } is optimal for the subproblem where we start at sk: V k (sk) = max qk r ,qk d,...,q T r ,q T d Eqπ(τ) β log qt d(st+1|st, at) pt d(st+1|st, at) log qt r(rt|st, at) ptr(rt|st, at) Lemma 1. Define the recursive value functions as: VT (s T ) = E a T π max q T r E q T r β log q T r (r T |s T , a T ) p Tr (r T |s T , a T ) + max q T d E q T d log q T d (s T +1|s T , a T ) p T d (s T +1|s T , a T ) Vk(sk)= E ak π max qk r E qk r β log qk r (rk|sk, ak) pkr(rk|sk, ak) +max qk d E qk d Vk+1(sk+1) log qk d(sk+1|sk, a T ) pk d(sk+1|sk, ak) k = 1, ..., T 1. Then we have that Vk(sk) = V k (sk) Proof. This analysis proceeds similar to the proof of Prop. 1.3.1 of Bertsekas (2020). We will show by induction that the functions Vk are equal to the optimal value functions V k . For k = T, we have that V T (s T ) = max q T r ,q T d E a T π β log q T r (r T |s T , a T ) p Tr (r T |s T , a T ) log q T d (s T +1|s T , a T ) p T d (s T +1|s T , a T ) max q T r E q T r β log q T r (r T |s T , a T ) p Tr (r T |s T , a T ) + max q T d E q T d log q T d (s T +1|s T , a T ) p T d (s T +1|s T , a T ) = VT (s T ), where the max and expectation operators commute by the principle of optimality. Now, let us assume that for some k and all sk+1, we have Vk+1(sk+1) = V k+1(sk+1). Then V k (sk) = max qk r ,qk d,...,q T r ,q T d Eqπ(τ) β log qt d(st+1|st, at) pt d(st+1|st, at) log qt r(rt|st, at) ptr(rt|st, at) max qk r E qk r β log qk r (rk|sk, ak) pkr(rk|sk, ak) + max qk d E qk d max qk+1 r ,...,q T d E qπ β log qt d(st+1|st, at) pt d(st+1|st, at) log qt r(rt|st, at) ptr(rt|st, at) log qk d(sk+1|sk, ak) pk d(sk+1|sk, ak) max qk r E qk r β log qk r (rk|sk, ak) pkr(rk|sk, ak) + max qk d E qk d V k+1(sk+1) log qk d(sk+1|sk, ak) pk d(sk+1|sk, ak) max qk r E qk r β log qk r (rk|sk, ak) pkr(rk|sk, ak) + max qk d E qk d Vk+1(sk+1) log qk d(sk+1|sk, ak) pk d(sk+1|sk, ak) where we obtain the second equality by moving the max operator inside the expectation using the principle of optimality. For the third equality, we use the definition of V k+1, and for the fourth equality we use the induction hypothesis. This completes our induction, and we have V k (sk) = Vk(sk) for all k. Published as a conference paper at ICLR 2025 The ability to commute expectation and maximization operators, used in the previous proof, deserves additional discussion. Formal proofs of this result in the general DP setting can be found in appendix A of Bertsekas (2012). We include a discussion specific to our setting below, beginning with the assumption that each maximization step is finite: Qk(sk, ak)= max qk r ,qk d,...,q T r ,q T d E qπ(τ) β log qt d(st+1|st, at) pt d(st+1|st, at) log qt r(rt|st, at) ptr(rt|st, at) Hence, for every ϵ > 0 there exist a set of variational distributions {qkϵ r , ..., q T ϵ d } that satisfies that β log qtϵ d (st+1|st, at) pt d(st+1|st, at) log qtϵ r (rt|st, at) ptr(rt|st, at) Qk(sk, ak) ϵ. Then we have that, max qk r ,qk d,...,q T r ,q T d Eak π β log qt d(st+1|st, at) pt d(st+1|st, at) log qt r(rt|st, at) ptr(rt|st, at) β log qtϵ d (st+1|st, at) pt d(st+1|st, at) log qtϵ r (rt|st, at) ptr(rt|st, at) max qk r ,qk d,...,q T r ,q T d Eqπ(τ) β log qt d(st+1|st, at) pt d(st+1|st, at) log qt r(rt|st, at) ptr(rt|st, at) Since ϵ > 0 is arbitrary it follows that max qk r ,qk d,...,q T r ,q T d Eak π β log qt d(st+1|st, at) pt d(st+1|st, at) log qt r(rt|st, at) ptr(rt|st, at) max qk r ,qk d,...,q T r ,q T d Eqπ(τ) β log qt d(st+1|st, at) pt d(st+1|st, at) log qt r(rt|st, at) ptr(rt|st, at) On the other hand, we have that max qk r ,qk d,...,q T r ,q T d Eqπ(τ) β log qt d(st+1|st, at) pt d(st+1|st, at) log qt r(rt|st, at) ptr(rt|st, at) β log qt d(st+1|st, at) pt d(st+1|st, at) log qt r(rt|st, at) ptr(rt|st, at) for all sets of variational distributions {qk r , ..., q T d }. So the inequality holds when taking the maximum, max qk r ,qk d,...,q T r ,q T d Eqπ(τ) β log qt d(st+1|st, at) pt d(st+1|st, at) log qt r(rt|st, at) ptr(rt|st, at) max qk r ,qk d,...,q T r ,q T d Eak π β log qt d(st+1|st, at) pt d(st+1|st, at) log qt r(rt|st, at) ptr(rt|st, at) Combining these two results we obtain the sought equality: max qk r ,qk d,...,q T r ,q T d Eqπ(τ) β log qt d(st+1|st, at) pt d(st+1|st, at) log qt r(rt|st, at) ptr(rt|st, at) = max qk r ,qk d,...,q T r ,q T d Eak π β log qt d(st+1|st, at) pt d(st+1|st, at) log qt r(rt|st, at) ptr(rt|st, at) Published as a conference paper at ICLR 2025 We now present our main result for application of the Bellman-style operator. Theorem 1. Repeated application of Tπ to any value function V such that V (s T +1) = 0 converges to the optimal value function V k for all k. Proof. We demonstrate this by induction. After one application of Tπ we have that Tπ[V ](s T ) = E π max q T r E q T r β log q T r (r T |s T , a T ) p Tr (r T |s T , a T ) +max q T d E q T d V (s T +1) log q T d (s T +1|s T , a T ) p T d (s T +1|s T , a T ) max q T r E q T r β log q T r (r T |s T , a T ) p Tr (r T |s T , a T ) +max q T d E q T d log q T d (s T +1|s T , a T ) p T d (s T +1|s T , a T ) = VT (s T ), where the second equality uses the fact that V (s T +1) = 0. Now, let us assume that for some k and all sk+1, we have V (sk+1) = Vk+1(sk+1). Then Tπ[V ](sk) = E π max qk r E qk r β log qk r (rk|sk, ak) pkr(rk|sk, ak) +max qk d E qk d V (sk+1) log qk d(sk+1|sk, ak) pk d(sk+1|sk, ak) max qk r E qk r β log qk r (rk|sk, ak) pkr(rk|sk, ak) +max qk d E qk d Vk+1(sk+1) log qk d(sk+1|sk, ak) pk d(sk+1|sk, ak) where the second equality uses the induction hypothesis. This shows that after k successive applications of Tπ we recover the optimal value function Vk. Lemma 2. For any state s S, action a A and reward distribution p(r|s, a), we have max qr R Er qr(r|s,a) β log qr(r|s, a) = log Er p(r|s,a) Analogously, for any state s S, action a A, value function V and dynamics distribution p(s |s, a), we have max qd S Es qd(s |s,a) V (s ) log qd(s |s, a) = log Es p(s |s,a)[exp(V (s ))]. (21) Proof. For Eq. 20, we have that max qr R Er qr(r|s,a) β log qr(r|s, a) = max qr R Er qr(r|s,a) β + log p(r|s, a) log qr(r|s, a) β + log p(r|s, a) = log Er p(r|s,a) where the second equality follows from Lemma 4 in Nachum et al. (2017). Eq. 21 follows from Lemma 5 in Chow et al. (2021). Lemma 3. The operator Tπ is monotonic. Proof. If V, W : S R are functions such that V (s) W(s), s S. Then Tπ[V ](s) Tπ[W](s), s S. From Lemma 2, we have that Tπ[V ](s) = Eπ + log E s p[exp(V (s ))] = Eπ log E s ,r p β + V (s ) , where the second equality uses the fact that p(s , r|s, a) = p(s |s, a)p(r|s, a). Therefore, Tπ[V ](s) = Eπ log E s ,r p β + V (s ) Eπ log E s ,r p β + W(s ) = Tπ[W](s), where we use the monotonicity of the exp, expectation and log operations. Before proving Theorem 2, we prove the following lemma: Published as a conference paper at ICLR 2025 Lemma 4. Let q r be the solution of Eq. 20. Then q r(r|s, a) p(r|s, a) exp r Analogously, let q V d be the solution of Eq. 21. Then q V d (s |s, a) p(s |s, a) exp (V (s )) . (23) Proof. For Eq. 22, we have that q r(r|s, a) = exp r β + log p(r|s, a) r exp r β + log p(r|s, a) p(r|s, a) exp r where the first equality follows from Corollary 6 in Nachum et al. (2017). Eq. 23 follows from Lemma 3 in Chow et al. (2021). Theorem 2. Let q r and q d be the solution of arg maxq Jβ(q, π). Then q r(r|s, a) p(r|s, a) exp r , q d(s |s, a) p(s |s, a) exp (V π (s )) . (25) Proof. We have that Jβ(q , π) = V π (s) = max qπ Eqπ β log qr(r|s, a) p(r|s, a) + V π (s ) log qd(s |s, a) max qr R Er qr β log qr(r|s, a) + max qd S Es qd V π (s ) log qd(s |s, a) where the second equality comes from the definition of V π . Using Lemma 4, we conclude that q r(r|s, a) p(r|s, a) exp r , q d(s |s, a) p(s |s, a) exp (V π (s )) . B RISK-AVERSE M-STEP In this section, we derive Eq. (11) for the two cases: β > 0 and β < 0. For β > 0, we have that the M-step: arg max π Jβ(q , π) = arg max π Eq π(τ) β log q d(st+1|st, at) p(st+1|st, at) log q r(rt|st, at) p(rt|st, at) = arg max π Eqπ(τ) t rt β log q d(st+1|st, at) p(st+1|st, at) β log q r(rt|st, at) p(rt|st, at) where the second equality comes from multiplying by β. For β < 0, we have that the M-step: arg min π Jβ(q , π) = arg min π Eq π(τ) β log q d(st+1|st, at) p(st+1|st, at) log q r(rt|st, at) p(rt|st, at) = arg max π Eqπ(τ) t rt β log q d(st+1|st, at) p(st+1|st, at) β log q r(rt|st, at) p(rt|st, at) where the arg min becomes arg max in the second equality due to multiplying through by negative β. Hence, for both cases the M-step is equivalent to Eq. (11). This shows that for β < 0, the overall optimization is equivalent to the saddle-point problem: arg minπ arg maxq Jβ(q, π). Thus, for the risk-averse setting we do not have monotonic improvement on the entropic objective. Nonetheless, it Published as a conference paper at ICLR 2025 serves as an approximation to the unconstrained objective when q(τ) = p(τ|O1:T = 1), for which we have equality to the entropic objective (Osogami, 2012). This surrogate objective is also related to Robust MDPs (Nilim & El Ghaoui, 2005) by treating the maximization of q as the worst choice for an uncertain set, i.e. the set of variational distributions of the form qπ(τ). Additionally, the objective is equivalent to the Minimax Criterion under inherent uncertainty (Garc ıa & Fern andez, 2015; Heger, 1994) when β 0. C DUAL OPTIMIZATION Choosing a suitable β can be a deciding factor between learning risk-sensitive policies or divergence in practice. Hence, we propose a Langrangian formulation that automatically tunes the risk-sensitive parameter β for, both, the risk-seeking (β > 0) and risk-averse (β < 0) settings. For β > 0, we observe that the maximization of Jβ(q, π) w.r.t. q reflects the Lagrangian of the following constrained optimization, max q Eqπ(τ) s.t. KL(qπ(τ) pπ(τ)) ϵ, (26) where ϵ sets a hard-constraint on the allowable divergence of the distribution qπ(τ). We recognize β as a Lagrange multiplier and perform dual gradient descent (Boyd & Vandenberghe, 2004) via the loss function: J(β) = β (ϵ KL(q(st+1, rt | st, at) p(st+1, rt | st, at))) . (27) Although constraint in the primal problem of Eq. (26) suggests optimizing the dual parameter β w.r.t. the entire trajectory KL(qπ(τ) pπ(τ)). This can lead to high variance for long trajectories. We instead impose the constraint only at single transition, which yields more stable learning. This approach most closely aligns with SAC (Haarnoja et al., 2018), which introduces a dual relaxation to modulate its policy entropy. For β < 0, we now consider the following primal problem, max q Eqπ(τ) s.t. KL(qπ(τ) pπ(τ)) ϵ. (28) where the optimization is w.r.t. costs ct = rt. In other words, the agent aims to find the worst-case dynamics q that are within ϵ of the true dynamics in a KL sense. Now consider the dual problem with Lagrange multiplier λ: min λ>0 max q Eqπ(τ) + λ(ϵ KL(qπ(τ) pπ(τ))). (29) In particular, we observe that for any fixed λ the maximization w.r.t. q is equivalent to maximizing Jβ(q, π) with β = λ. Hence, we propose a dual gradient descent optimization with loss function: J(λ) = λ (ϵ KL(q(st+1, rt | st, at) p(st+1, rt | st, at))) . (30) where we can recover β by setting it to λ. Again, we impose the constraint only at single transitions, which yields more stable learning. D ADDITIONAL EXPERIMENTS D.1 VISUALIZATION OF VARIATIONAL DYNAMICS FOR RSVAC In Fig. 8, we visualize the learned variational dynamics on the stochastic continuous 2D environment for a range of β values. When β < 0, we observe that the variational dynamics model is pessimistic and moves the agent towards the horizontal sides of the square. In contrast, when β > 0 the variational dynamics guide the agent towards the regions of high reward and ignore the potential of hitting the walls. D.2 ABLATION EXPERIMENTS FOR RSVAC Ablation experiments for rs VAC using SAC as its actor-critic w.r.t. risky Mu Jo Co benchmarks (Inverted Pendulum, Half Cheetah and Swimmer) for a range of β values. Fig. 8 demonstrates that rs VAC is capable of learning risk-sensitive policies in, both, the risk-averse and risk-seeking regimes while achieving high reward. Published as a conference paper at ICLR 2025 (c) β = 100 (f) β = 100 Figure 7: Visualizations of variational dynamics for linearly spaced coordinates in stochastic continuous 2D environment. From each state, we draw a vector to its expected next state colored by the agent s action: up (black), right (green), down (blue) and left (red). 0 20 40 60 80 100 Environment steps (1k) Inverted Pendulum-v4 0 20 40 60 80 100 120 140 Environment steps (1k) Half Cheetah-v4 0 20 40 60 80 100 120 140 Environment steps (1k) 0 20 40 60 80 100 Environment steps (1k) X > 0.01 Rate Inverted Pendulum-v4 0 20 40 60 80 100 120 140 Environment steps (1k) X < -3 Rate Half Cheetah-v4 0 20 40 60 80 100 120 140 Environment steps (1k) X > 0.5 Rate Beta=1.0 Beta=10.0 Beta=100.0 Beta=-1.0 Beta=-10.0 Beta=-100.0 Figure 8: Ablation analysis w.r.t risk parameter β. The solid curves correspond to the mean and shaded regions to one standard deviation over 5 random trials. Top row: Average return. Bottom row: Percentage of steps on an episode in risky regions. D.3 ADDITIONAL MUJOCO EXPERIMENTS Fig. 9 compares rs VAC (with TD3 as its actor-critic) to other risk-averse baselines on the Mu Jo Co environment Ant-v4. Similarly to our previous experiments, we modify this environment by including an additional stochastic reward sampled from N(0, 102) if X-position > 0.5. Again we observe that rs VAC is effective at learning risk-sensitivity while producing better policies in terms of average return. Published as a conference paper at ICLR 2025 0 20 40 60 80 100 Environment steps (10k) 0 20 40 60 80 100 Environment steps (10k) X > 0.5 Rate MG MVPI Exp TD rs VAC-TD3 Figure 9: Risky Ant-v4. Left: Average return on Ant-v4. Right: Percentage of steps on an episode in risky regions. The solid curves correspond to the mean and shaded regions to one standard deviation over 5 random trials. E IMPLEMENTATION DETAILS E.1 STABILITY MODIFICATIONS During the implementation of rs VAC, we noticed that the log-terms in the critic update have no effects on controlling the risk-sensitivity of the algorithm, while producing instabilities that can hurt the critic s convergence. For this reason, we remove them during optimization of rs VAC for the continuous experiments. Another modification that we found that can improve learning for the variational dynamics is the introduction of a separate critic V optimized w.r.t. real environment data. This critic is convenient as it provides information about the return in future states while remaining independent of the variational dynamics so it doesn t tend to become overly optimistic (or pessimistic) for β values with small magnitude. E.2 CONTINUOUS 2D ENVIRONMENT Table 1 lists the hyperparameters used by rs VAC for the stochastic continuous 2D environment. Table 1: Hyperparameters for stochastic continuous 2D environment Schedule details Environment steps before training 5000 steps Environment steps per epoch 1000 steps Model optimization every 1 steps Number of model rollouts 128 rollouts Rollout length 1 step Network details Discount factor 0.9 Soft target update 0.005 Experience buffer Denv 1,000,000 Model buffer Dmodel 128 Dynamics Network Architecture MLP with 2 hidden layers of size 256 Actor Network Architecture MLP with 2 hidden layers of size 256 Critic Network Architecture MLP with 2 hidden layers of size 256 Network optimizer Adam Non-linear layers Re LU Learning rate 0.0003 Published as a conference paper at ICLR 2025 E.3 MUJOCO ENVIRONMENTS HYPERPARAMETERS For MG and MVPI, we use the implementations in Luo et al. (2024) and follow the same hyperparameters suggested by the authors. We implement exp TD (Noorani et al., 2023) on top of the TD3 implementation in Luo et al. (2024) and select the -10 as its risky parameter from { 1, 10, 20, 100}. For rs VAC we select its risk parameter from { 1, 4, 8}. For SAC we use a re-implementation of that algorithm made available by other authors2. We use the same network architectures and learning rates for all algorithms. Table 2 lists the hyperparameters used for rs VAC and all actor-critic algorithms in risk-aware Mu Jo Co benchmarks. Table 2: Hyperparameters for risk-aware Mu Jo Co benchmark Schedule details Environment steps before training 5000 steps Environment steps per epoch 1000 steps Model optimization every 1 step Number of model rollouts 256 rollouts Rollout length 1 step Network details Discount factor 0.99 Soft target update 0.005 Experience buffer Denv 1,000,000 Model buffer Dmodel 256 Reward Network Architecture MLP with 2 hidden layers of size 256 Actor Network Architecture MLP with 2 hidden layers of size 256 Critic Network Architecture MLP with 2 hidden layers of size 256 Network optimizer Adam Non-linear layers Re LU Learning rate 0.0003 β initialization -1 (-8 for inverted Pendulum with TD3) α initialization 0.2 (for SAC implementation only) F PSEUDOCODE OF RSVAC This section contains the pseudocode for our algorithm, rs VAC. 2https://github.com/Xingyu-Lin/mbpo pytorch Published as a conference paper at ICLR 2025 Algorithm 1 rs VAC Initialize networks, parameters and replay buffers. for each epoch do for each environment step do at πθ(at|st) st+1, rt p(st+1, rt|st, at) Sample next state from environment. Denv Denv {(st, at, st+1, rt)} Add tuple to experience buffer. if model optimization then {(si t, ai t, si t+1, ri t)}N i=1 Denv Sample every tuple in experience buffer. θ θ Jd(θ) Update prior dynamics pθ. θ θ Jr(θ) Update prior reward pθ. ϕ ϕ Jd(ϕ) Update variational dynamics qϕ. ϕ ϕ Jr(ϕ) Update variational reward qϕ. for m = 1, 2, ..., M do st Denv Sample state from experience buffer Denv. at πθ(at|st) Sample action using policy. st+1 qϕ(st+1|st, at) Sample next state using variational dynamics. rt qϕ(rt|st, at) Sample reward using variational reward model. Dmodel Dmodel {(st, at, st+1, rt)} Add tuple to model buffer. end for end if for k = 1, 2, ..., K do {(si t, ai t, si t+1, ri t)}B i=1 Dmodel Sample mini-batch from model buffer Dmodel. ψ ψ J(ψ) Update critic Qψ. θ θ J(θ) Update policy πθ. ψ τψ + (1 τ)ψ Update target critic Q ψ. end for end for end for