# risksensitive_control_as_inference_with_rényi_divergence__75e89476.pdf Risk-Sensitive Control as Inference with Rényi Divergence Kaito Ito The University of Tokyo kaito@g.ecc.u-tokyo.ac.jp Kenji Kashima Kyoto University kk@i.kyoto-u.ac.jp This paper introduces the risk-sensitive control as inference (RCa I) that extends Ca I by using Rényi divergence variational inference. RCa I is shown to be equivalent to log-probability regularized risk-sensitive control, which is an extension of the maximum entropy (Max Ent) control. We also prove that the risk-sensitive optimal policy can be obtained by solving a soft Bellman equation, which reveals several equivalences between RCa I, Max Ent control, the optimal posterior for Ca I, and linearly-solvable control. Moreover, based on RCa I, we derive the risk-sensitive reinforcement learning (RL) methods: the policy gradient and the soft actor-critic. As the risk-sensitivity parameter vanishes, we recover the risk-neutral Ca I and RL, which means that RCa I is a unifying framework. Furthermore, we give another risksensitive generalization of the Max Ent control using Rényi entropy regularization. We show that in both of our extensions, the optimal policies have the same structure even though the derivations are very different. 1 Introduction Optimal control theory is a powerful framework for sequential decision making [1]. In optimal control problems, one seeks to find a control policy that minimizes a given cost functional and typically assumes the full knowledge of the system s dynamics. Optimal control with unknown or partially known dynamics is called reinforcement learning (RL) [2], which has been successfully applied to highly complex and uncertain systems, e.g., robotics [3], self-driving vehicles [4]. However, solving optimal control and RL problems is still challenging, especially for continuous spaces. Control as Inference (Ca I), which connects optimal control and Bayesian inference, is a promising paradigm for overcoming the challenges of RL [5]. In Ca I, the optimality of a state and control trajectory is defined by introducing optimality variables rather than explicit costs. Consequently, an optimal control problem can be formulated as a probabilistic inference problem. In particular, maximum entropy (Max Ent) control [6, 7] is equivalent to a variational inference problem using the Kullback Leibler (KL) divergence. Max Ent control has entropy regularization of a control policy, and as a result, the optimal policy is stochastic. Several works have revealed the advantages of the regularization such as robustness against disturbances [8], natural exploration induced by the stochasticity [7, 9], fast convergence of the Max Ent policy gradient method [10]. On the other hand, the KL divergence is not the only option available for variational inference. In [11], the variational inference was extended to Rényi α-divergence [12], which is a rich family of divergences including the KL divergence. Similar to the traditional variational inference, this extension optimizes a lower bound of the evidence, which is called the variational Rényi bound. The parameter α of Rényi divergence controls the balance between mass-covering and zero-forcing effects for approximate inference [13]. However, if we use Rényi divergence for Ca I, it remains unclear how α affects the optimal policy, and a natural question arises: what objective does Ca I using Rényi divergence optimize? 38th Conference on Neural Information Processing Systems (Neur IPS 2024). Contributions The contributions of this work are as follows: 1. We reveal that Ca I with Rényi divergence solves a log-probability (LP) regularized risksensitive control problem with exponential utility [14] (Theorem 2). The order parameter α of Rényi divergence plays a role of the risk-sensitivity parameter, which determines whether the resulting policy is risk-averse or risk-seeking. Based on the result, we refer to Ca I using Rényi divergence as risk-sensitive Ca I (RCa I). Since Rényi divergence includes the KL divergence, RCa I is a unifying framework of Ca I. Additionally, we show that the risk-sensitive optimal policy takes the form of the Gibbs distribution whose energy is given by the Q-function, which can be obtained by solving a soft Bellman equation (Theorem 3). Furthermore, this reveals several equivalence results between RCa I, Max Ent control, the optimal posterior for Ca I, and linearly-solvable control [15, 16]. 2. Based on RCa I, we derive risk-sensitive RL methods. First, we provide a policy gradient method [17 19] for the regularized risk-sensitive RL (Proposition 7). Next, we derive the risk-sensitive counterpart of the soft actor-critic algorithm [7] through the maximization of the variational Rényi bound (Subsection 4.2). As the risk-sensitivity parameter vanishes, the proposed methods converge to REINFORCE [19] with entropy regularization and risk-neutral soft actor-critic [7], respectively. One of their advantages over other risksensitive approaches, including distributional RL [20, 21], is that they require only minor modifications to the standard REINFORCE and soft actor-critic. The behavior of the risk-sensitive soft actor-critic is examined via an experiment. 3. Although the risk-sensitive control induced by RCa I has LP regularization of the policy, it is not entropy, unlike the Max Ent control with the Shannon entropy regularization. To bridge this gap, we provide another risk-sensitive generalization of the Max Ent control using Rényi entropy regularization. We prove that the resulting optimal policy and the Bellman equation have the same structure as the LP regularized risk-sensitive control (Theorem 6). The derivation differs significantly from that for the LP regularization, and for the analysis, we establish the duality between exponential integrals and Rényi entropy (Lemma 5). The established relations between several control problems in this paper are summarized in Fig. 1. Ca I Max Ent control LP regularized risk-sensitive control R enyi entropy regularized risk-sensitive control + VI with KL divergence Same structure Deterministic system (linearly solvable) (VI: variational inference) : equivalence + VI with R enyi (1 + η)-divergence Policy converges as η 1 Figure 1: Relations of control problems. Related work The duality between control and inference has been extensively studied [15, 22 26]. Inspired by Ca I, [27, 28] reformulated model predictive control (MPC) as a variational inference problem. In [29], variational inference MPC using Tsallis divergence, which is equivalent to Rényi divergence, was proposed. The difference between our results and theirs is that variational inference MPC infers feedforward optimal control while RCa I infers feedback optimal control. Consequently, the equivalence of risk-sensitive control and Tsallis variational inference MPC is not derived, unlike RCa I. The work [30] proposed an EM-style algorithm for RL based on Ca I, where the resulting policy is risk-seeking. However, riskaverse policies cannot be derived from Ca I by this approach. Our framework provides the equivalence between Ca I and risk-sensitive control both for risk-seeking and risk-averse cases. Risk-averse policies are known to yield robust control [31, 32], and risk-seeking policies are useful for balancing exploration and exploitation for RL [33]. Because of these merits, many efforts have been devoted to risk-sensitive RL [19, 34 36]. In [37], risk-sensitive RL with Shannon entropy regularization was investigated. However, their theoretical results are valid only for almost risk-neutral cases. Our results imply that LP and Rényi entropy regularization are suitable for the risk-sensitive RL. In [16], risk-sensitive control whose control cost is defined by Rényi divergence was investigated, and it was shown that the associated Bellman equation can be linearized. However, it is assumed that the transition distribution can be controlled as desired, which is not satisfied in general as pointed out in [38]. On the other hand, our result shows that when the dynamics is deterministic, LP and Rényi entropy regularized risk-sensitive control problems are linearly solvable without the full controllability assumption of the transition distribution. Notation For simplicity, by abuse of notation, we write the density (or probability mass) functions of random variables x, y as p(x), p(y), and the expectation with respect to p(x) is denoted by Ep(x). For a set S, the set of all densities on S is denoted by P(S). Rényi entropy and divergence with parameter α > 0, α = 1 are defined as Hα(p) := 1 α(1 α) log R {u:p(u)>0} p(u)αdu , Dα(p1 p2) := 1 α 1 log R {u:p1(u)p2(u)>0} p1(u)αp2(u)1 αdu . For the factor 1 α(1 α) of Hα, we follow [39, 40] because this choice is convenient for the analysis in Subsection 3.2 rather than another common choice 1/(1 α). We formally extend the definition of Hα to α < 0. Denote the Shannon entropy and KL divergence by H1(p), D1(p1 p2), respectively because limα 1 Hα(p) = H1(p), limα 1 Dα(p1 p2) = D1(p1 p2). For further properties of the Rényi entropy and divergence, see e.g., [41]. The set of integers {k, k + 1, . . . , s}, k < s is denoted by [[k, s]]. A sequence {xk, xk+1, . . . , xs} is denoted by xk:s. The set of non-negative real numbers is denoted by R 0. 2 Brief introduction to control as inference Figure 2: Graphical model for Ca I. First, we briefly introduce the framework of Ca I. For the detailed derivation, see Appendix A and [5]. Throughout the paper, xt and ut denote X-valued state and U-valued control variables at time t, respectively, where X Rnx, U Rnu, and µL(U) > 0. Here, µL denotes the Lebesgue measure on Rnu. The initial distribution is p(x0), and the transition density is denoted by p(xt+1|xt, ut), which depends only on the current state and control input. Let T > 0 be a finite time horizon. Ca I connects control and probabilistic inference problems by introducing optimality variables Ot {0, 1} as in Fig. 2. For ct : X U R 0, c T : X R 0, which will serve as cost functions, the distribution of Ot is given by p(Ot = 1|xt, ut) = exp( ct(xt, ut)), t [[0, T 1]] and p(OT = 1|x T ) = exp( c T (x T )). If Ot = 1, then (xt, ut) at time t is said to be optimal. The control posterior p(ut|xt, Ot:T = 1) is called the optimal policy. Let the prior of ut be uniform: p(ut) = 1/µL(U), ut U. Although this choice is common for Ca I, the arguments in this paper may be extended to non-uniform priors. Then, for the graphical model in Fig. 2, the distribution of the optimal state and control input trajectory τ := (x0:T , u0:T 1) satisfies p(τ|O0:T = 1) t=0 p(xt+1|xt, ut) p(OT = 1|x T ) t=0 p(Ot = 1|xt, ut) t=0 p(xt+1|xt, ut) t=0 ct(xt, ut) For notational simplicity, we will drop = 1 for Ot in the remainder of this paper. The optimal policy p(ut|xt, Ot:T ) can be computed in a recursive manner. To this end, define Qt(xt, ut) := log p(Ot:T |xt, ut) µL(U) , Vt(xt) := log p(Ot:T |xt), (2) which play a role of value functions. Then, the following result holds. Proposition 1. Assume that µL(U) < and let ct(xt, ut) := ct(xt, ut) + log µL(U). Assume further the existence of density functions p(x0) and p(xt+1|xt, ut) for any t [[0, T 1]]1. Then, it holds that p(ut|xt, Ot:T = 1) = exp ( Qt(xt, ut) + Vt(xt)) , xt X, ut U, (3) 1When considering discrete variables xt, ut, the assumption µL(U) < is replaced by the finiteness of the set U, and the existence of the densities is not required. Vt(xt) = log Z U exp( Qt(xt, ut))dut , t [[0, T 1]], VT (x T ) = c T (x T ), (4) Qt(xt, ut) = ct(xt, ut) log Ep(xt+1|xt,ut) [exp( Vt+1(xt+1))] , t [[0, T 1]]. (5) The recursive computation (4), (5) is similar to the Bellman equation for the risk-seeking control. However, it is not still clear what kind of performance index the optimal trajectory p(τ|Ot:T ) optimizes because (4) does not coincide with that of the conventional risk-seeking control. An indirect way to make this clear is variational inference. Let us consider finding the closest trajectory distribution pπ(τ) to the optimal distribution p(τ|O0:T ). The variational distribution is chosen as pπ(τ) = p(x0) t=0 p(xt+1|xt, ut)πt(ut|xt), (6) where πt( |xt) P(U) is the conditional density of ut given xt and corresponds to a control policy. Then, the minimization of the KL divergence D1(pπ(τ) p(τ|O0:T )) is known to be equivalent to the following Max Ent control problem: minimize {πt}T 1 t=0 Epπ(τ) c T (x T ) + ct(xt, ut) H1(πt( |xt)) # Especially when the system p(xt+1|xt, ut) is deterministic, the minimum value of D1(pπ(τ) p(τ|O0:T )) is 0, and the posterior p(ut|xt, Ot:T ) yields the optimal control of (7). As mentioned in Introduction, this work uses Rényi divergence rather than the KL divergence. Moreover, we characterize the optimal posterior p(ut|xt, Ot:T ) more directly even for stochastic systems. 3 Control as Rényi divergence variational inference In this section, we address the question of what kind of control problem is solved by Ca I with Rényi divergence and characterize the optimal policy. 3.1 Equivalence between Ca I with Rényi divergence and risk-sensitive control Let η > 1, η = 0. Then, Ca I using Rényi variational inference is formulated as the minimization of D1+η(pπ(τ) p(τ|O0:T )) with respect to pπ in (6). Now, we have D1+η(pπ p( |O0:T )) = 1 η log Z pπ(τ)1+ηp(τ, O0:T ) ηdτ | {z } (Variational Rényi bound) + log p(O0:T ). (8) That is, Ca I with Rényi divergence is equivalent to maximizing the above variational Rényi bound. Moreover, by (1), it holds that log Z pπ(τ)1+ηp(τ, O0:T ) ηdτ p(x0) QT 1 t=0 p(xt+1|xt, ut)πt(ut|xt) 1 µL(U)p(x0) h QT 1 t=0 p(xt+1|xt, ut) i exp c T (x T ) PT 1 t=0 ct(xt, ut) = log Z pπ(τ) exp ηc T (x T ) + η ct(xt, ut) + log πt(ut|xt) dτ + η log µL(U). Consequently, we obtain the first equivalence result in this paper. Theorem 2. Suppose that the assumptions in Proposition 1 hold. Then, for any η > 1, η = 0, the minimization of D1+η(pπ p( |O0:T = 1)) with respect to pπ in (6) is equivalent to minimize {πt}T 1 t=0 1 η log Epπ(τ) ηc T (x T ) + η ct(xt, ut) + log πt(ut|xt) !# Problem (9) is a risk-sensitive control problem with the log-probability regularization log πt(ut|xt) of the control policy. Let ηΦ(τ) be the exponent in (9). Then, 1 η log E[exp(ηΦ(τ))] = E[Φ(τ)] + η 2Var[Φ(τ)] + O(η2), where Var[ ] denotes the variance [42]. Hence, η > 0 (resp. η < 0) leads to risk-averse (resp. risk-seeking) policies. As η goes to zero, the objective in (9) converges to the risk-neutral Max Ent control problem (7). 3.2 Derivation of optimal control and further equivalence results In this subsection, we derive the optimal policy of (9) and give its characterizations. For the analysis, we do not need the non-negativity of the cost ct. We only sketch the derivation, and the detailed proof is given in Appendix B. Similar to the conventional optimal control problems, we adopt the dynamic programming. Another approach based on variational inference will be given in Subsection 4.2. Define the optimal (state-)value function Vt : X R and the Q-function Qt : X U R as follows: Vt(xt) := inf {πs}T 1 s=t 1 η log Epπ(τ|xt) ηc T (x T ) + η cs(xs, us) + log πs(us|xs) !# Qt(xt, ut) := ct(xt, ut) + 1 η log Ep(xt+1|xt,ut) exp ηVt+1(xt+1) , t [[0, T 1]], (11) and VT (x T ) := c T (x T ). Then, it can be shown that the Bellman equation for Problem (9) is Vt(xt) = log Z U exp ( Qt(xt, u )) du + inf πt( |xt) P(U) D1+η(πt( |xt) π t ( |xt)), (12) where π t (ut|xt) := exp ( Qt(xt, ut)) /Zt(xt), and the normalizing constant is assumed to fulfill Zt(xt) := R U exp ( Qt(xt, u )) du < . Since D1+η(πt( |xt) π t ( |xt)) attains its minimum value 0 if and only if πt( |xt) = π t ( |xt), the unique optimal policy that minimizes the right-hand side of (12) is given by π t ( |xt) and Vt(xt) = log Z U exp ( Qt(xt, u )) du , π t (ut|xt) = exp ( Qt(xt, ut) + Vt(xt)) . (13) Because of the softmin operation above, the left equation in (13) is called the soft Bellman equation. Theorem 3. Assume that R U exp ( Qt(x, u )) du < holds for any t [[0, T 1]] and x X. Let η > 1, η = 0. Then, the unique optimal policy of Problem (9) is given by (13). Especially when the dynamics is deterministic, i.e., p(xt+1|xt, ut) = δ(xt+1 ft(xt, ut)) for some ft : X U X and the Dirac delta function δ, it holds that Qt(xt, ut) = ct(xt, ut) + Vt+1 ft(xt, ut) , (14) and the optimal policy of the Max Ent control problem (7) solves the LP-regularized risk-sensitive control problem (9) for any η > 1, η = 0. Assumption R U exp ( Qt(x, u )) du < is satisfied for example when ct is bounded for any t [[0, T]] and µL(U) < . The linear quadratic setting also fulfills this assumption; see (16). Theorem 3 suggests several equivalence results: RCa I and Max Ent control for deterministic systems. First, we emphasize that even though the equivalence between unregularized risk-neutral and risk-sensitive controls for deterministic systems is already known, our equivalence result for Max Ent and regularized risk-sensitive controls is nontrivial. This is because the regularized policy π t makes a system stochastic even though the original system is deterministic, and for stochastic systems, the unregularized risk-sensitive control does not coincide with the risk-neutral control. This implies that the optimal randomness introduced by the regularization does not affect the risk sensitivity of the policy. This provides insight into the robustness of Max Ent control [8]. Note that [43] mentioned that the Max Ent control objective can be reconstructed by the risk-sensitive control objective under the heuristic assumption that the cost follows a uniform distribution. However, this assumption is not satisfied in general. Our equivalence result does not require such an unrealistic assumption. RCa I and optimal posterior. Although the optimal posterior p(ut|xt, Ot:T ) yields the Max Ent control for deterministic systems as mentioned in Section 2, it is not known what objective p(ut|xt, Ot:T ) optimizes for stochastic systems. Theorem 3 gives a new characterization of p(ut|xt, Ot:T ). By formally substituting η = 1 into (11), the Bellman equation for computing π t becomes (4), (5) for the optimal posterior p(ut|xt, Ot:T ). Note that even if the cost function ct in (9) is replaced by ct in Proposition 1, {π t } is still optimal. Therefore, by taking the limit as η 1, the policy π t (ut|xt) in Theorem 3 converges to p(ut|xt, Ot:T ), and in this sense, the policy p(ut|xt, Ot:T ) is risk-seeking. Corollary 4. Under the assumptions in Proposition 1, it holds that lim η 1 π t (ut|xt) = exp( Qt(xt, ut) + Vt(xt)) = p(ut|xt, Ot:T = 1), (15) where Vt and Qt are given by (11), (13) with η = 1. RCa I for deterministic systems and linearly-solvable control. For deterministic systems, by the transformation Et(xt) := exp( Vt(xt)), the Bellman equation (14) becomes linear: Et(xt) = R exp( ct(xt, u ))Et+1( ft(xt, u ))du . That is, when the system is deterministic, the LP-regularized risk-sensitive control, or equivalently, the Max Ent control is linearly solvable [15, 16, 44], which enables efficient computation of RL. Even for the Max Ent control, this fact seems not to be mentioned explicitly in the literature. RCa I and unregularized risk-sensitive control in linear quadratic setting. Similar to the unregularized and Max Ent problems [45, 46], Problem (9) with a linear system p(xt+1|xt, ut) = N(xt+1|Atxt + Btut, Σt) and quadratic costs ct(xt, ut) = (x t Qtxt + u t Rtut)/2, c T (x T ) = x T QT x T /2 admits an explicit form of the optimal policy: π t (u|x) = N u| (Rt + B t Πt+1(I ηΣtΠt+1) 1Bt) 1B t Πt+1(I ηΣtΠt+1) 1Atx, (Rt + BtΠt+1(I ηΣtΠt+1) 1Bt) 1 . (16) Here, N( |µ, Σ) denotes the Gaussian density with mean µ and covariance Σ. The definition of Πt and the proof are given in Appendix C. In general, the mean of the regularized risk-sensitive control deviates from the unregularized risk-sensitive control. However, in the linear quadratic Gaussian (LQG) case, the mean of the optimal policy (16) coincides with the optimal control of risk-sensitive LQG control without the regularization [47]. 3.3 Another risk-sensitive generalization of Max Ent control via Rényi entropy The Shannon entropy regularization E[ H1(πt( |xt))] of the Max Ent control problem (7) can be rewritten as E[log πt(ut|xt)]. In this sense, the risk-sensitive control (9) is a natural extension of (7). Nevertheless, for the risk-sensitive case, the interpretation of log πt(ut|xt) as entropy is no longer available. In this subsection, we provide another risk-sensitive extension of the Max Ent control. Inspired by the Rényi divergence utilized so far, we employ Rényi entropy regularization: minimize {πt}T 1 t=0 1 η log Epπ(τ) ηc T (x T ) + η ct(xt, ut) H1 η(πt( |xt)) !# where η R \ {0, 1}, and πt( |x) L1 η(U) := {ρ P(U)| R U ρ(u)1 ηdu < }, x, which implies |H1 η(πt( |xt))| < . As η tends to zero, (17) converges to the Max Ent control problem (7). Define the value function Vt and the Q-function Qt associated with (17) like (10) and (11). Then, as in Subsection 3.2, the following Bellman equation holds. The derivation is given in Appendix E. Vt(xt) = inf πt L1 η(U) U πt(u |xt) exp(ηQt(xt, u ))du H1 η(πt( |xt)) . (18) For the minimization in (18), we establish the duality between exponential integrals and Rényi entropy like in [40] because the same procedure as for (12) cannot be applied. Lemma 5 (Informal). For β, γ R \ {0} such that β < γ and for g : U R, it holds that U exp(βg(u))du = inf ρ L 1 γ γ β (U) U exp(γg(u))ρ(u)du 1 γ β H1 γ γ β (ρ) , and the unique optimal solution that minimizes the right-hand side of (19) is given by ρ(u) = exp ( (γ β)g(u)) R U exp( (γ β)g(u ))du , u U. (20) For the precise statement and the proof, see Appendix D. By applying Lemma 5 with β = η 1, γ = η to (18), we obtain the optimal policy of (17) as follows. Theorem 6. Assume that ct is bounded below for any t [[0, T]]. Assume further that for any x X and t [[0, T 1]], it holds that R U exp ( Qt(x, u )) du < , R U exp ( (1 η)Qt(x, u )) du < . Then, the unique optimal policy of Problem (17) is given by π t (ut|xt) = 1 Z (xt) exp ( Qt(xt, ut)) , t [[0, T 1]], xt X, ut U, (21) where Zt(xt) := R U exp( Qt(xt, u ))du , and it holds that Vt(xt) = 1 1 η log Z U exp ( (1 η)Qt(xt, u )) du , t [[0, T 1]], xt X. (22) Recall that the LP regularized risk-sensitive optimal control is given by (11), (13) while the Rényi entropy regularized control is determined by (21), (22), and Qt(xt, ut) = ct(xt, ut) + 1 η log Ep(xt+1|xt,ut)[exp(ηVt+1(xt+1))]. Hence, the only difference between the risk-sensitive controls for the LP and Rényi regularization is the coefficient in the soft Bellman equations (13), (22). 4 Risk-sensitive reinforcement learning via RCa I Standard RL methods can be derived from Ca I using the KL divergence [5]. In this section, we derive risk-sensitive policy gradient and soft actor-critic methods from RCa I. 4.1 Risk-sensitive policy gradient In this subsection, we consider minimizing the cost (9) by a time-invariant policy parameterized as πt(u|x) = π(θ)(u|x), θ Rnθ. Let Cθ(τ) := c T (x T )+PT 1 t=0 (ct(xt, ut)+log π(θ)(ut|xt)) and pθ be the density of the trajectory τ under the policy π(θ). Then, Problem (9) can be reformulated as the minimization of J(θ)/η where J(θ) := R pθ(τ) exp(ηCθ(τ))dτ. To optimize J(θ)/η by gradient descent, we give the gradient θJ(θ). The proof is shown in Appendix F. Proposition 7. Assume the existence of densities p(xt+1|xt, ut), p(x0). Assume further that π(θ) is differentiable in θ, and the derivative and the integral can be interchanged as θJ(θ) = R θ[pθ(τ) exp(ηCθ(τ))]dτ. Then, for any function b : Rnx R, it holds that θJ(θ) = (η + 1)Epθ(τ) t=0 θ log π(θ)(ut|xt) ηc T (x T ) + η cs(xs, us) + log π(θ)(us|xs) ! The function b is referred to as a baseline function, which can be used for reducing the variance of an estimate of θJ. The following gradient estimate of J(θ)/η is unbiased: t=0 θ log π(θ)(ut|xt) exp ηc T (x T ) + η cs(xs, us) + log π(θ)(us|xs) b(xt) . This is almost the same as risk-sensitive REINFORCE [19] except for the additional term log π(θ)(us|xs). In the risk-neutral limit η 0, this estimator converges to the Max Ent policy gradient estimator [5]. 4.2 Risk-sensitive soft actor-critic In Subsection 3.2, we used dynamic programming to obtain the optimal policy {π t }. Rather, in this section, we adopt a standard procedure of variational inference [48]. First, we find the optimal factor πt for fixed πs, s = t as follows. The proof is deferred to Appendix G. Proposition 8. For t [[0, T 1]], let πs, s = t be fixed. Let η > 1, η = 0. Then, the optimal factor π t := arg minπt P(U) D1+η(pπ p( |O0:T = 1)) is given by π t (ut|xt) = 1 Zt(xt) Epπ(xt+1:T ,ut+1:T 1|xt,ut) " QT 1 s=t+1 πs(us|xs) p(Ot|x T ) QT 1 s=t p(Os|xs, us) where Zt(xt) is the normalizing constant. By (24), the optimal factor π t is independent of the past factors πs, s [[0, t 1]]. Therefore, the variational Rényi bound in (8) is maximized by optimizing πt in backward order from t = T 1 to t = 0, which is consistent with the dynamic programming. Associated with (24), we define V π t (xt) := 1 η log Epπ(xt+1:T ,ut:T 1|xt) " QT 1 s=t πs(us|xs) p(Ot|x T ) QT 1 s=t p(Os|xs, us) η log Epπ(xt+1:T ,ut:T 1|xt) ηc T (x T ) + η cs(xs, us) + log πs(us|xs) !# which is the value function for the policy {πs}T 1 s=t satisfying the following Bellman equation. V π t (xt) = 1 η log Eπt(ut|xt) p(Ot|xt, ut) η Ep(xt+1|xt,ut) exp(ηV π t+1(xt+1)) (26) η log Eπt(ut|xt) h exp (ηct(xt, ut) + η log πt(ut|xt)) Ep(xt+1|xt,ut) exp(ηV π t+1(xt+1)) i . By the value function, π t (ut|xt) can be written as π t (ut|xt) = p(Ot|xt, ut) Zt(xt) Ep(xt+1:T ,ut+1:T 1|xt,ut) " QT 1 s=t+1 πs(us|xs) p(Ot|x T ) QT 1 s=t+1 p(Os|xs, us) = p(Ot|xt, ut) Zt(xt) Ep(xt+1|xt,ut) exp(ηV π t+1(xt+1)) 1/η . (27) Next, we define the Q-function for {πs}T 1 s=t+1 as follows: Qπ t (xt, ut) := log p(Ot|xt, ut) + 1 η log Ep(xt+1|xt,ut) exp(ηV π t+1(xt+1)) . (28) Then, it follows from (26) and (27) that V π t (xt) = 1 η log Eπt(ut|xt) [πt(ut|xt)η exp(ηQπ t (xt, ut))] , (29) π t (ut|xt) = 1 Zt(xt) exp( Qπ t (xt, ut)), Zt(xt) = Z U exp ( Qπ t (xt, u )) du . (30) Especially when πt(ut|xt) = π t (ut|xt), it holds that V π t (xt) = log R exp( Qπ t (xt, u ))du , which coincides with the soft Bellman equation in (13). In summary, in order to obtain the optimal factor π t , it is sufficient to compute V π t and Qπ t in a backward manner. Next, we consider the situation when the policy is parameterized as π(θ) t (ut|xt), θ Rnθ and there is no parameter θ that gives the optimal factor π(θ) t = π t . To accommodate this situation, we utilize the variational Rényi bound. One can easily see that the maximization of the Rényi bound in (8) with respect to a single factor πt is equivalent to the following problem. minimize πt 1 η log Epπ(xt) Eπt(ut|xt) [πt(ut|xt)η exp(ηQπ t (xt, ut))] . (31) This suggests choosing θ that minimizes (31) whose πt is replaced by π(θ) t . Note that this is further equivalent to minimize θ Epπ(xt) π(θ) t ( |xt) exp( Qπ t (xt, )) Zt(xt) We also parameterize V π t and Qπ t as V (ψ), Q(φ) and optimize ψ, φ so that the relations (28), (29) approximately hold. To obtain unbiased gradient estimators later, we minimize the following squared residual error based on (28), (29), and the transformation Tη(v) := (eηv 1)/η, v R: JQ(φ) := Epπ(xt,ut) n Tη Q(φ)(xt, ut) c(xt, ut) Ep(xt+1|xt,ut) h Tη(V (ψ)(xt+1)) io2 , JV (ψ) := Epπ(xt) n Tη(V (ψ)(xt)) Eπ(θ)(ut|xt) h Tη Q(φ)(xt, ut) + log π(θ)(ut|xt) io2 . Using Q(φ) and Tη, we replace (31) with the following equivalent objective: Jπ(θ) := Epπ(xt) Eπ(θ)(ut|xt) Tη Q(φ)(xt, ut) + log π(θ)(ut|xt) . (33) Noting that limη 0 Tη(κ(η)) = κ(0) for κ : R R, as the risk sensitivity η goes to zero, the objectives JQ, JV , Jπ converge to those used for the risk-neutral soft actor-critic [7]. Now, we have φJQ(φ) = Epπ(xt,ut) h φQ(φ)(xt, ut) exp ηQ(φ)(xt, ut) ηc(xt, ut) Tη Q(φ)(xt, ut) c(xt, ut) Ep(xt+1|xt,ut) Tη(V (ψ)(xt+1)) i , (34) ψJV (ψ) = Epπ(xt) h ψV (ψ)(xt) exp(ηV (ψ)(xt)) Tη(V (ψ)(xt)) Eπ(θ)(ut|xt) Tη Q(φ)(xt, ut) + log π(θ)(ut|xt) i , (35) θJπ(θ) = (η + 1)Epπ(xt,ut) h θ log π(θ)(ut|xt) Tη Q(φ)(xt, ut) + log π(θ)(ut|xt) i . (36) Thanks to the transformation Tη, the expectations appear linearly, and an unbiased gradient estimator can be obtained by removing them. By simply replacing the gradients of the soft actor-critic [7] with (34) (36), we obtain the risk-sensitive soft actor-critic (RSAC). It is worth mentioning that since RSAC requires only minor modifications to SAC, techniques for stabilizing SAC, e.g., reparameterization, minibatch sampling with a replay buffer, target networks, double Q-network, can be directly used for RSAC. 5 Experiment Figure 3: Average episode cost for RSAC with some η and standard SAC. Unregularized risk-averse control is known to be robust against perturbations in systems [32]. Since the robustness of the regularized cases has not yet been established theoretically, we verify the robustness of policies learned by RSAC through a numerical example. The environment is Pendulum-v1 in Open AI Gymnasium. We trained control policies using the hyperparameters shown in Appendix H. There were no significant differences in the control performance obtained or the behavior during training. On the other hand, for each η, one control policy was selected and was applied to a slightly different environment without retraining. To be more precise, the pendulum length l, which is 1.0 during training, is changed to 1.25 and 1.5; See Fig. 3. In this example, it can be seen that the control policy obtained with larger η has a smaller performance degradation due to environmental changes. This robustness can be considered a benefit of risk-sensitive control. In Fig. 4, empirical distributions of the costs for different risk-sensitivity parameters η are plotted. Only the distribution for η = 0.02 does not change so much under the system perturbations. The Episode cost (a) Pendulum length l = 1.0 during training Episode cost (b) System perturbation l = 1.25 Episode cost (c) System perturbation l = 1.5 Figure 4: Empirical distributions of the costs for different risk-sensitivity parameters η. distribution for SAC (η = 0) with l = 1.5 deviates from the original one (l = 1.0), and another peak of the distribution appears in the high-cost area. This means that there is a high probability of incurring a high cost, which clarifies the advantage of RSAC. The more risk-seeking the policy becomes, the less robust it becomes against the system perturbation. 6 Conclusions In this paper, we proposed a unifying framework of Ca I, named RCa I, using Rényi divergence variational inference. We revealed that RCa I yields the LP regularized risk-sensitive control with exponential performance criteria. Moreover, we showed the equivalences for risk-sensitive control, Max Ent control, the optimal posterior for Ca I, and linearly-solvable control. In addition to these connections, we derived the policy gradient method and the soft actor-critic method for the risksensitive RL via RCa I. Interestingly, Rényi entropy regularization also results in the same form of the risk-sensitive optimal policy and the soft Bellman equation as the LP regularization. From a practical point of view, a major limitation of the proposed risk-sensitive soft actor-critic is its numerical instability for large |η| cases. Since η appears, for example, as exp(ηQ(φ)(xt, ut)) in the gradients (34) (36), the magnitude of η that does not cause the numerical instability depends on the scale of costs. Therefore, we need to choose η depending on environments. In the experiment using Pendulum-v1, |η| that is larger than 0.03 results in the failure of learning due to the numerical instability. Although it is an important future work to address this issue, we would like to note that this issue is not specific to our algorithms, but occurs in general risk-sensitive RL with exponential utility. It is also important how to choose a specific value of the order parameter 1 + η of Rényi divergence. Since we showed that η determines the risk sensitivity of the optimal policy, we can follow previous studies on the choice of the sensitivity parameter of the risk-sensitive control without regularization. The properties of the derived algorithms also need to be explored in future work, e.g., the compatibility of a function approximator for RSAC [49]. Acknowledgments The authors thank Ran Wang for his valuable help in conducting the experiment. This work was supported in part by JSPS KAKENHI Grant Numbers JP23K19117, JP24K17297, JP21H04875. [1] Onésimo Hernández-Lerma and Jean B. Lasserre, Discrete-time Markov Control Processes: Basic Optimality Criteria, vol. 30, Springer-Verlag New York, 1996. [2] Richard S. Sutton and Andrew G. Barto, Reinforcement Learning: An Introduction, MIT Press, second edition, 2018. [3] Tuomas Haarnoja, Sehoon Ha, Aurick Zhou, Jie Tan, George Tucker, and Sergey Levine, Learning to walk via deep reinforcement learning , in Robotics: Science and Systems, 2019. [4] B. Ravi Kiran, Ibrahim Sobh, Victor Talpaert, Patrick Mannion, Ahmad A. Al Sallab, Senthil Yogamani, and Patrick Pérez, Deep reinforcement learning for autonomous driving: A survey , IEEE Transactions on Intelligent Transportation Systems, vol. 23, no. 6, pp. 4909 4926, 2022. [5] Sergey Levine, Reinforcement learning and control as probabilistic inference: Tutorial and review , ar Xiv preprint ar Xiv:1805.00909, 2018. [6] Brian D. Ziebart, Modeling Purposeful Adaptive Behavior with the Principle of Maximum Causal Entropy, Ph D thesis, Carnegie Mellon University, 2010. [7] Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine, Soft actor-critic: Offpolicy maximum entropy deep reinforcement learning with a stochastic actor , in International Conference on Machine Learning. PMLR, 2018, pp. 1861 1870. [8] Benjamin Eysenbach and Sergey Levine, Maximum entropy RL (provably) solves some robust RL problems , in International Conference on Learning Representations, 2022. [9] Tuomas Haarnoja, Aurick Zhou, Kristian Hartikainen, George Tucker, Sehoon Ha, Jie Tan, Vikash Kumar, Henry Zhu, Abhishek Gupta, Pieter Abbeel, and Sergey Levine, Soft actor-critic algorithms and applications , ar Xiv preprint ar Xiv:1812.05905, 2018. [10] Jincheng Mei, Chenjun Xiao, Csaba Szepesvari, and Dale Schuurmans, On the global convergence rates of softmax policy gradient methods , in International Conference on Machine Learning. PMLR, 2020, vol. 119, pp. 6820 6829. [11] Yingzhen Li and Richard E. Turner, Rényi divergence variational inference , in Advances in Neural Information Processing Systems, 2016, vol. 29, pp. 1073 1081. [12] Alfréd Rényi, On measures of entropy and information , in Proceedings of the fourth Berkeley Symposium on Mathematical Statistics and Probability, 1961, vol. 1, pp. 547 561. [13] Cheng Zhang, Judith Bütepage, Hedvig Kjellström, and Stephan Mandt, Advances in variational inference , IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 41, no. 8, pp. 2008 2026, 2019. [14] Peter Whittle, Risk-Sensitive Optimal Control, John Wiley & Sons, Ltd., 1990. [15] Emanuel Todorov, Linearly-solvable Markov decision problems , in Advances in Neural Information Processing Systems, 2006, vol. 19, pp. 1369 1376. [16] Krishnamurthy Dvijotham and Emanuel Todorov, A unifying framework for linearly solvable control , in 27th Conference on Uncertainty in Artificial Intelligence, 2011, pp. 179 186. [17] Ronald J. Williams, Simple statistical gradient-following algorithms for connectionist reinforcement learning , Machine Learning, vol. 8, pp. 229 256, 1992. [18] David Nass, Boris Belousov, and Jan Peters, Entropic risk measure in policy search , in 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2019, pp. 1101 1106. [19] Erfaun Noorani and John S. Baras, Risk-sensitive REINFORCE: A Monte Carlo policy gradient algorithm for exponential performance criteria , in 2021 60th IEEE Conference on Decision and Control (CDC). IEEE, 2021, pp. 1522 1527. [20] Jingliang Duan, Yang Guan, Shengbo Eben Li, Yangang Ren, Qi Sun, and Bo Cheng, Distributional soft actor-critic: Off-policy reinforcement learning for addressing value estimation errors , IEEE Transactions on Neural Networks and Learning Systems, vol. 33, no. 11, pp. 6584 6598, 2022. [21] Jinyoung Choi, Christopher Dance, Jung-Eun Kim, Seulbin Hwang, and Kyung-sik Park, Risk-conditioned distributional soft actor-critic for risk-sensitive navigation , in 2021 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2021, pp. 8337 8344. [22] Hilbert J. Kappen, Path integrals and symmetry breaking for optimal control theory , Journal of Statistical Mechanics: Theory and Experiment, vol. 2005, no. 11, pp. P11011, 2005. [23] Emanuel Todorov, General duality between optimal control and estimation , in 2008 47th IEEE Conference on Decision and Control. IEEE, 2008, pp. 4286 4292. [24] Hilbert J. Kappen, Vicenç Gómez, and Manfred Opper, Optimal control as a graphical model inference problem , Machine Learning, vol. 87, pp. 159 182, 2012. [25] Konrad Rawlik, Marc Toussaint, and Sethu Vijayakumar, On stochastic optimal control and reinforcement learning by approximate inference , in Proceedings of Robotics: Science and Systems, 2012. [26] Marc Toussaint, Robot trajectory optimization using approximate inference , in International Conference on Machine Learning, 2009, pp. 1049 1056. [27] Masashi Okada and Tadahiro Taniguchi, Variational inference MPC for Bayesian model-based reinforcement learning , in Conference on Robot Learning. PMLR, 2020, pp. 258 272. [28] Alexander Lambert, Fabio Ramos, Byron Boots, Dieter Fox, and Adam Fishman, Stein variational model predictive control , in Conference on Robot Learning. PMLR, 2021, vol. 155, pp. 1278 1297. [29] Ziyi Wang, Oswin So, Jason Gibson, Bogdan Vlahov, Manan S. Gandhi, Guan-Horng Liu, and Evangelos A. Theodorou, Variational inference MPC using Tsallis divergence , in Robotics: Science and Systems, 2021. [30] Yinlam Chow, Brandon Cui, Moon Kyung Ryu, and Mohammad Ghavamzadeh, Variational model-based policy optimization , ar Xiv preprint ar Xiv:2006.05443, 2020. [31] Marco C. Campi and Matthew R. James, Nonlinear discrete-time risk-sensitive optimal control , International Journal of Robust and Nonlinear Control, vol. 6, no. 1, pp. 1 19, 1996. [32] Ian R Petersen, Matthew R James, and Paul Dupuis, Minimax optimal control of stochastic uncertain systems with relative entropy constraints , IEEE Transactions on Automatic Control, vol. 45, no. 3, pp. 398 412, 2000. [33] Brendan O Donoghue, Variational Bayesian reinforcement learning with regret bounds , in Advances in Neural Information Processing Systems, 2021, vol. 34, pp. 28208 28221. [34] Vivek S. Borkar, Q-learning for risk-sensitive control , Mathematics of Operations Research, vol. 27, no. 2, pp. 294 311, 2002. [35] Yingjie Fei, Zhuoran Yang, Yudong Chen, and Zhaoran Wang, Exponential Bellman equation and improved regret bounds for risk-sensitive reinforcement learning , in Advances in Neural Information Processing Systems, 2021, vol. 34, pp. 20436 20446. [36] Javier Garcıa and Fernando Fernández, A comprehensive survey on safe reinforcement learning , Journal of Machine Learning Research, vol. 16, no. 1, pp. 1437 1480, 2015. [37] Tobias Enders, James Harrison, and Maximilian Schiffer, Risk-sensitive soft actor-critic for robust deep reinforcement learning under distribution shifts , ar Xiv preprint ar Xiv:2402.09992, 2024. [38] Kaito Ito and Kenji Kashima, Kullback Leibler control for discrete-time nonlinear systems on continuous spaces , SICE Journal of Control, Measurement, and System Integration, vol. 15, no. 2, pp. 119 129, 2022. [39] Friedrich Liese and Igor Vajda, Convex Statistical Distances, Teubner, Leipzig, 1987. [40] Rami Atar, Kenny Chowdhary, and Paul Dupuis, Robust bounds on risk-sensitive functionals via Rényi divergence , SIAM/ASA Journal on Uncertainty Quantification, vol. 3, no. 1, pp. 18 33, 2015. [41] Tim Van Erven and Peter Harremos, Rényi divergence and Kullback Leibler divergence , IEEE Transactions on Information Theory, vol. 60, no. 7, pp. 3797 3820, 2014. [42] Oliver Mihatsch and Ralph Neuneier, Risk-sensitive reinforcement learning , Machine Learning, vol. 49, pp. 267 290, 2002. [43] Erfaun Noorani, Christos Mavridis, and John Baras, Risk-sensitive reinforcement learning with exponential criteria , ar Xiv preprint ar Xiv:2212.09010, 2023. [44] Krishnamurthy Dvijotham and Emanuel Todorov, Inverse optimal control with linearlysolvable MDPs , in Proceedings of the 27th International Conference on Machine Learning, 2010, pp. 335 342. [45] Kaito Ito and Kenji Kashima, Maximum entropy optimal density control of discrete-time linear systems and Schrödinger bridges , IEEE Transactions on Automatic Control, vol. 69, no. 3, pp. 1536 1551, 2023. [46] Kaito Ito and Kenji Kashima, Maximum entropy density control of discrete-time linear systems with quadratic cost , To appear in IEEE Transactions on Automatic Control, 2025, ar Xiv preprint ar Xiv:2309.10662. [47] Peter Whittle, Risk-sensitive linear/quadratic/Gaussian control , Advances in Applied Probability, vol. 13, no. 4, pp. 764 777, 1981. [48] Christopher M. Bishop, Pattern Recognition and Machine Learning, Springer, 2006. [49] Richard S Sutton, David Mc Allester, Satinder Singh, and Yishay Mansour, Policy gradient methods for reinforcement learning with function approximation , in Advances in Neural Information Processing Systems, 1999, vol. 12, pp. 1057 1063. [50] Antonin Raffin, Ashley Hill, Adam Gleave, Anssi Kanervisto, Maximilian Ernestus, and Noah Dormann, Stable-baselines3: Reliable reinforcement learning implementations , Journal of Machine Learning Research, vol. 22, no. 268, pp. 1 8, 2021. [51] Diederik P. Kingma and Jimmy Ba, Adam: A method for stochastic optimization , ar Xiv preprint ar Xiv:1412.6980, 2014. A More details on Control as Inference In this appendix, we give more details on Ca I. As mentioned in (1), the distribution of the state and control input trajectory given optimality variables satisfies p(τ|O0:T ) p(τ, O0:T ) p(OT |x T ) t=0 p(Ot|xt, ut) t=0 p(xt+1|xt, ut)p(ut) where p(ut) = 1/µL(U) and p(τ, O0:T ) is defined so that P(τ B, O0:T = o0:T ) = Z B p(τ, o0:T )dτ for any o0:T {0, 1}T +1 and any Borel set B, where P denotes the probability. Therefore, we have p(τ|O0:T = 1) t=0 p(xt+1|xt, ut) t=0 ct(xt, ut) The posterior p(ut|xt, Ot:T = 1) given the optimality condition Ot:T = 1 is called the optimal policy. We emphasize that the optimality of p(ut|xt, Ot:T = 1) is defined by the condition Ot:T = 1 rather than by introducing a cost functional, unlike π (ut|xt) in (13). In the following, we drop = 1 for Ot. The optimal policy can be computed as follows. Define βt(xt, ut) := p(Ot:T |xt, ut), (37) ζt(xt) := p(Ot:T |xt). (38) Then, it holds that U p(Ot:T |xt, ut)p(ut|xt)dut = Z U βt(xt, ut)p(ut)dut = 1 µL(U) U βt(xt, ut)dut. In addition, we have βt(xt, ut) = p(Ot:T |xt, ut) = p(Ot|xt, ut)p(Ot+1:T |xt, ut) = p(Ot|xt, ut) Z X p(Ot+1:T |xt+1)p(xt+1|xt, ut)dxt+1 = p(Ot|xt, ut) Z X ζt+1(xt+1)p(xt+1|xt, ut)dxt+1, (40) ζT (x T ) = p(OT |x T ) = exp( c T (x T )), where we used p(Ot+1:T |xt, ut) = Z X p(Ot+1:T , xt+1|xt, ut)dxt+1 X p(Ot+1:T |xt+1, xt, ut)p(xt+1|xt, ut)dxt+1 X p(Ot+1:T |xt+1)p(xt+1|xt, ut)dxt+1. In terms of βt and ζt, the optimal policy can be written as p(ut|xt, Ot:T ) = p(xt, ut, Ot:T ) p(xt, Ot:T ) = p(Ot:T |xt, ut) p(Ot:T |xt) p(ut|xt) = βt(xt, ut) µL(U)ζt(xt). (41) Next, by the logarithmic transformation, we define Qt(xt, ut) := log βt(xt, ut) µL(U) , (42) Vt(xt) := log ζt(xt). (43) Then, by (41), the optimal policy satisfies p(ut|xt, Ot:T ) = exp ( Qt(xt, ut) + Vt(xt)) . (44) By (39), it holds that Vt(xt) = log Z U exp( Qt(xt, ut))dut By using (40), we obtain exp( Qt(xt, ut))µL(U) = exp( ct(xt, ut)) Z X ζt+1(xt+1)p(xt+1|xt, ut)dxt+1, which yields Qt(xt, ut) = ct(xt, ut) log Ep(xt+1|xt,ut) [exp( Vt+1(xt+1))] . (46) Here, we defined ct(xt, ut) := ct(xt, ut) + log µL(U). In summary, Proposition 1 holds. B Proof of Theorem 3 This appendix is devoted to the analysis of the following problem: minimize {πt}T 1 t=0 ηc T (x T ) + η ct(xt, ut) + ε log πt(ut|xt) !# subject to xt+1 = ft(xt, ut, wt), ut U, t [[0, T 1]], (48) ut πt( |x) given xt = x, (49) x0 Px0. (50) Here, {wt}T 1 t=0 is an independent sequence, x0 is independent of {wt}, ε > 0 is the regularization parameter, and η is the risk-sensitivity parameter satisfying η > ε 1, η = 0. Note that we do not assume the existence of densities p(xt+1|xt, ut), p(x0). To perform dynamic programming for Problem (47), define the value function and the Q-function as Vt(x) := inf {πs}T 1 s=t ηc T (x T ) + η cs(xs, us) + ε log πs(us|xs) ! xt = x t [[0, T 1]], x X, (51) VT (x) := c T (x), x X, Qt(x, u) := ct(x, u) + 1 η log E exp ηVt+1(ft(x, u, wt)) , t [[0, T 1]], x X, u U. Then, under the assumption that R U exp Qt(x,u ) ε du < , we prove that the unique optimal policy of Problem (47) is given by π t (u|x) := exp Qt(x,u) U exp Qt(x,u ) ε du , t [[0, T 1]], u U, x X. (53) First, by definition, we have Vt(x) = inf {πs}T 1 s=t U πt(u|x)E exp ηct(x, u) + εη log πt(u|x) + ηc T (x T ) cs(xs, us) + ε log πs(us|xs) xt = x, ut = u du = inf {πs}T 1 s=t U πt(u|x) exp ηct(x, u) + εη log πt(u|x) E exp ηc T (x T ) + η cs(xs, us) + ε log πs(us|xs) xt = x, ut = u du = inf πt 1 η log Z U πt(u|x) exp ηct(x, u) + εη log πt(u|x) E exp ηVt+1(ft(x, u, wt)) du . By the definition of the Q-function (52), we get Vt(x) = inf πt( |x) P(U) 1 η log Z U πt(u|x) exp(εη log πt(u|x)) exp(ηQt(x, u))du = inf πt( |x) P(U) 1 η log πt(u|x) 1+εη exp Qt(x, u) = inf πt( |x) P(U) 1 η log U exp Qt(x, u ) U πt(u|x)1+εηπ t (u|x) εηdu U exp Qt(x, u ) du + inf πt( |x) P(U) εD1+εη(πt( |x) π t ( |x)). Since D1+εη(πt( |x) π t ( |x)) attains its minimum value 0 if and only if πt( |x) = π t ( |x), we conclude that Vt(x) = ε log Z U exp Qt(x, u ) du , x X, (54) and the unique optimal policy of Problem (47) is given by (53). Moreover, π t can be rewritten as π t (u|x) = exp Qt(x, u) , t [[0, T 1]], u U, x X. (55) When considering the deterministic system xt+1 = ft(xt, ut), we immediately obtain the relation Qt(x, u) = ct(x, u) + Vt+1( ft(x, u)). (56) On the other hand, the unique optimal policy of the Max Ent control problem: minimize {πt}T 1 t=0 E c T (x T ) + ct(xt, ut) εH1(πt( |xt)) # is also given by (55) whose Q-function (52) is replaced by Qt(x, u) = ct(x, u) + E[Vt+1(ft(x, u, wt))]. Therefore, when the system is deterministic, the Q-function of the LP regularized risk-sensitive control problem (47) coincides with that of the Max Ent control problem (57). Consequently, the optimal policy of Problem (57) solves Problem (47) for any η > ε 1, η = 0 for deterministic systems. C Linear quadratic Gaussian setting In this appendix, we derive the regularized risk-sensitive optimal policy in the linear quadratic Gaussian setting. Theorem 9. Let p(xt+1|xt, ut) = N(Atxt + Btut, Σt) and ct(xt, ut) = (x t Qtxt + u t Rtut)/2, c T (x T ) = x T QT x T /2, where Σt, Qt, and Rt are positive definite matrices for any t, and N(µ, Σ) denotes the Gaussian distribution with mean µ and covariance Σ. Let X = Rnx, U = Rnu. Assume that there exists a solution {Πt}T t=0 to the following Riccati difference equation: Πt = Qt + A t Πt+1(I ηΣtΠt+1 + Bt R 1 t B t Πt+1) 1At, t [[0, T 1]], (58) ΠT = QT , (59) such that Σ 1 t ηΠt+1 is positive definite for any t [[0, T 1]]. Here, I denotes the identity matrix of appropriate dimension. Then, the unique optimal policy of Problem (9) is given by π t (u|x) = N u| (Rt + B t Πt+1(I ηΣtΠt+1) 1Bt) 1B t Πt+1(I ηΣtΠt+1) 1Atx, (Rt + BtΠt+1(I ηΣtΠt+1) 1Bt) 1 . (60) Proof. In this proof, for notational simplicity, we often drop the time index t as A, B. First, for t = T 1, the Q-function in (11) is QT 1(x, u) = 1 2 x 2 QT 1 + 1 2 u 2 RT 1 + 1 η log E h exp η 2 AT 1x + BT 1u + w T 1 2 ΠT i , where x 2 P := x Px for a symmetric matrix P. Here, we have 2 Ax + Bu + w T 1 2 ΠT i (2π)nx|ΣT 1| 2 w 2 Σ 1 T 1 + η 2 Ax + Bu + w 2 ΠT where |ΣT 1| denotes the determinant of ΣT 1, and 2 w 2 Σ 1 T 1 + η 2 Ax + Bu + w 2 ΠT w 2 Σ 1 ηΠ 2ηw Π(Ax + Bu) Ax + Bu 2 ηΠ . By the assumption that Σ 1 T 1 ηΠT is positive definite and a completion of squares argument, 2 w 2 Σ 1 T 1 + η 2 Ax + Bu + w 2 ΠT w (Σ 1 ηΠ) 1ηΠ(Ax + Bu) 2 Σ 1 ηΠ ηΠ(Ax + Bu) 2 (Σ 1 ηΠ) 1 Ax + Bu 2 ηΠ . Thus, we obtain Z 2 w 2 Σ 1 T 1 + η 2 Ax + Bu + w 2 ΠT (2π)nx|(Σ 1 ηΠ) 1| exp 1 2 ηΠ(Ax + Bu) 2 (Σ 1 ηΠ) 1 + 1 2 Ax + Bu 2 ηΠ Consequently, by (61) (63), the Q-function can be written as QT 1(x, u) = 1 2 x 2 QT 1 + 1 2 u 2 RT 1 + 1 2η ηΠ(AT 1x + BT 1u) 2 (Σ 1 T 1 ηΠT ) 1 2 AT 1x + BT 1u 2 Π + CQT 1 2 x 2 Q + 1 2 u 2 R + 1 2 Ax + Bu 2 ηΠ(Σ 1 ηΠ) 1Π+Π + CQT 1 2 x 2 Q + 1 2 u 2 R + 1 2 Ax + Bu 2 Π(I ηΣΠ) 1 + CQT 1, where the constant CQT 1 is independent of (x, u). Now, we adopt a completion of squares argument again: QT 1(x, u) = 1 u 2 R+B Π(I ηΣΠ) 1B + 2x A Π(I ηΠΣ) 1Bu + x 2 Q+A Π(I ηΣΠ) 1A u + (R + B Π(I ηΣΠ) 1B) 1B (I ηΠΣ) 1ΠAx 2 R+B Π(I ηΣΠ) 1B B (I ηΠΣ) 1ΠAx 2 (R+B Π(I ηΣΠ) 1B) 1 + x 2 Q+A Π(I ηΣΠ) 1A 2 u + (R + B ΠT (I ηΣΠT ) 1B) 1B ΠT (I ηΣΠT ) 1Ax 2 R+B ΠT (I ηΣΠT ) 1B 2 x 2 ΠT 1 + CQT 1. Here, we used ΠT (I ηΣT 1ΠT ) 1 = (I ηΠT ΣT 1) 1ΠT and ΠT 1 = QT 1 + A T 1ΠT (I ηΣT 1ΠT + BT 1R 1 T 1B T 1ΠT ) 1AT 1 = Q + A ΠT (I ηΣT 1ΠT ) 1A A ΠT (I ηΣT 1ΠT ) 1B (RT 1 + B ΠT (I ηΣT 1ΠT ) 1B) 1B (I ηΠT ΣT 1) 1ΠT A. Therefore, the optimal policy at t = T 1 is π T 1(u|x) = N u| (RT 1 + B ΠT (I ηΣT 1ΠT ) 1B) 1B ΠT (I ηΣT 1ΠT ) 1Ax, (RT 1 + B ΠT (I ηΣT 1ΠT ) 1B) 1 . (64) The value function is given by VT 1(x) = log Z Rnu exp( QT 1(x, u))du = 1 2 x 2 ΠT 1 + CVT 1, where CVT 1 does not depend on x. By applying the same argument as above for t = T 2, . . . , 0, we arrive at the optimal policy (60) and 2 x 2 Πt + CVt, (65) 2 u + (Rt + B Πt+1(I ηΣtΠt+1) 1B) 1B Πt+1(I ηΣtΠt+1) 1Ax 2 Rt+B Πt+1(I ηΣtΠt+1) 1B 2 x 2 Πt + CQt, (66) where CVt and CQt are independent of (x, u). This completes the proof. By the same argument as above, the optimal policy of the Rényi entropy regularized risk-sensitive control problem (17) in the linear quadratic Gaussian setting is also given by (60). D Proof of Lemma 5 First, we give the precise statement of Lemma 5. To this end, for a, b R, define Ba,b(U) := g : U R g is bounded below, Z U exp(ag(u))du < , Z U exp(bg(u))du < . Similarly, define Ba,b(U) for upper bounded functions. For given g : U R, a R, and α R \ {0, 1}, define Pa,g(U) := ρ P(U) U exp(ag(u))ρ(u)du < , Lα(U) := ρ P(U) U ρ(u)αdu < . If ρ Lα(U) and α (0, 1), then it holds that Hα(ρ) < . If α ( , 0) (1, ), we have Hα(ρ) > . Now, we are ready to state the duality lemma. Lemma 10. For β, γ R \ {0} such that β < γ and for g B{β, (γ β)}(U), it holds that U exp(βg(u))du = inf ρ L 1 γ γ β (U) U exp(γg(u))ρ(u)du 1 γ β H1 γ γ β (ρ) , and the unique optimal solution that minimizes the right-hand side of (68) is given by ρ(u) = exp ( (γ β)g(u)) R U exp( (γ β)g(u ))du , u U. (69) In addition, for h B{γ,γ β}(U), it holds that 1 γ log Z exp(γh(u))du = sup ρ L γ γ β (U) β log Z exp(βh(u))ρ(u)du + 1 γ β H γ γ β (ρ) , and the unique optimal solution that maximizes the right-hand side of (70) is given by ρ(u) = exp((γ β)h(u)) R exp((γ β)h(u ))du , u U. (71) Although the proof is similar to that of the duality between exponential integrals and Rényi divergence [40], it requires more careful analysis because we do not assume the upper boundedness of g and the lower boundedness of h, unlike in [40]. Proof. For notational simplicity, we often drop U as Lα. First, we note that it is sufficient to prove that for α > 0, α = 1, g B{α 1, 1}, and h B{α,1}, it holds that 1 α 1 log Z exp((α 1)g(u))du = inf ρ L1 α α log Z exp(αg(u))ρ(u)du H1 α(ρ) , 1 α log Z exp(αh(u))du = sup ρ Lα 1 α 1 log Z exp((α 1)h(u))ρ(u)du + Hα(ρ) , ρ (u) := exp( g(u)) R exp( g(u ))du , ρ (u) := exp(h(u)) R exp(h(u ))du (74) are the unique optimal solutions to (72), (73), respectively. To see this, note that if (72), (73) hold for α > 0, α = 1, they hold for any α R \ {0, 1}. Indeed, when α < 0, let α := 1 α > 1 and for h B{α,1}, let g := h. Since g B{ α 1, 1}, by (72), we have 1 α 1 log Z exp(( α 1) g(u))du = inf ρ L1 α α log Z exp( α g(u))ρ(u)du H1 α(ρ) . Therefore, it holds that α log Z exp(αh(u))du = inf ρ Lα 1 1 α log Z exp((α 1)h(u))ρ(u)du Hα(ρ) 1 α 1 log Z exp((α 1)h(u))ρ(u)du + Hα(ρ) , which means that for any α < 0 and any h Bα,1, (73) holds. Similarly, by considering h := g B{ α,1} for g B{α 1, 1}, we can see that for any α < 0 and any g B{α 1, 1}, (72) holds. Additionally, (72) and (73) with α = γ γ β , g = (γ β)eg, h = (γ β)eh coincide with (68), (70) where g and h are replaced by eg, eh. In what follows, for α > 0, α = 1, we prove (72). Note that when ρ L1 α, |H1 α(ρ)| < holds. Hence, for the minimization of (72), it is sufficient to consider ρ Pα,g L1 α. The density ρ defined in (74) fulfills ρ Pα,g L1 α because g B{α 1, 1}, and it can be easily seen that 1 α 1 log Z exp((α 1)g(u))du = 1 α log Z exp(αg(u))ρ (u)du H1 α(ρ ). (75) First, we consider the case α > 1. Define eρ(u) := exp((α 1)g(u)), ϕ(u) := exp( g(u)). Then, by Hölder s inequality, for any ρ Pα,g L1 α, it holds that Z eρ(u)du = Z ϕ(u) α 1 eρ(u)du ϕ(u) eρ(u)du α 1 = Z ρ(u)1 αdu 1 α Z exp(αg(u))ρ(u)du α 1 Noting that α 1 > 0 and taking the logarithm of (76), we get for any ρ Pα,g L1 α, 1 α 1 log Z exp((α 1)g(u))du 1 α log Z exp(αg(u))ρ(u)du H1 α(ρ). Combining this with (75), the relation (72) holds, and by (75), ρ in (74) is an optimal solution. The equality of Hölder s inequality (76) holds if and only if there exist a1, a2 0, a1a2 = 0 such that ρ(u) 1 α = a2 ρ(u) ϕ(u) holds eµ-almost everywhere. Here, eµ is the measure defined by eρ. This condition is satisfied only for ρ , that is, it is an unique optimal solution. Next, we analyze the case α (0, 1). By Hölder s inequality, for any ρ Pα,g, α 1 eρ(u)du Z 11/αeρ(u)du α α 1# 1 1 α eρ(u)du = Z eρ(u)du α Z ρ(u) ϕ(u) eρ(u)du 1 α , which yields 1 α 1 log Z exp((α 1)g(u))du 1 Z exp(αg(u))ρ(u)du H1 α(ρ), ρ Pα,g. Then, similar to the case α > 1, it can be seen that for α (0, 1), (72) holds and ρ is a unique optimal solution. Next, we show (73) for α > 1. Since α > 1 and h is upper bounded, it holds that ρ Pα 1,h. The density ρ defined in (74) satisfies ρ Pα 1,h Lα because h B{α,1}, and one can easily see that 1 α log Z exp(αh(u))du = 1 α 1 log Z exp((α 1)h(u))ρ (u)du + Hα(ρ ). Define bρ(u) := exp((α 1)h(u))ρ(u), λ(u) := exp( h(u))ρ(u). Then, by Hölder s inequality, for any ρ Lα, it holds that Z bρ(u)du = Z λ(u) α 1 Z λ(u)α 1bρ(u)du 1 α Z λ(u) 1bρ(u)du α 1 = Z ρ(u)αdu 1 α Z exp(αh(u))du α 1 It follows from the above that for any ρ Lα, 1 α 1 log Z exp((α 1)h(u))ρ(u)du 1 α log Z exp(αh(u))du Hα(ρ). Hence, by the same argument as for (72), we can show that (73) holds for α > 1, and ρ is a unique optimal solution. Lastly, we show (73) for α (0, 1). For ρ Lα, it holds that |Hα(ρ)| < . Then, noting that α 1 < 0, it is sufficient to perform the maximization in (73) for ρ Pα 1,h Lα. By Hölder s inequality, for any ρ Pα 1,h, we have Z ραdu = Z λ(u)α 1bρ(u)du Z 11/αbρ(u)du α (λ(u)α 1) 1 1 α bρ(u)du 1 α = Z exp((α 1)h(u))ρ(u)du α Z exp(αh(u))du 1 α . Therefore, 1 α 1 log Z exp((α 1)h(u))ρ(u)du 1 α log Z exp(αh(u))du Hα(ρ), and similar to the case α > 1, we arrive at (73) for α (0, 1), and the unique optimal solution is ρ . This completes the proof. E Proof of Theorem 6 In this appendix, we analyze the following problem: minimize {πt}T 1 t=0 ηc T (x T ) + η ct(xt, ut) εH1 εη(πt( |xt)) !# where ε > 0, η R \ {0, ε 1}, the system is given by (48) (50), and πt( |x) L1 εη(U) := {ρ P(U) | R U ρ(u)1 εηdu < } for any x X and t [[0, T 1]]. Define the value function and the Q-function associated with (78) as Vt(x) := inf {πs}T 1 s=t ηc T (x T ) + η cs(xs, us) εH1 εη(πs( |xs)) ! xt = x t [[0, T 1]], x X, (79) VT (x) := c T (x), x X, Qt(x, u) := ct(x, u) + 1 η log E exp ηVt+1(ft(x, u, wt)) , t [[0, T 1]], x X, u U. (80) For the analysis, we assume the following conditions. Assumption 11. For any t [[0, T]], ct is bounded below. Assumption 12. The Q-function Qt in (80) satisfies U exp Qt(x, u) U exp (1 εη)Qt(x, u) for any x X and t [[0, T 1]]. For example, when ct is bounded for any t [[0, T]], Qt is also bounded, and in addition, if µL(U) < , (81) holds. In the linear quadratic setting, Assumption 12 also holds without the boundedness of ct and U. Now, we prove Theorem 6 by induction. First, for t = T 1, we have VT 1(x) = inf πT 1( |x) L1 εη(U) εH1 εη(πT 1( |x)) U πT 1(u|x)E exp (ηc T 1(x, u) + ηc T (x T )) x T 1 = x, u T 1 = u du . The derivation is same as (85) and (86). By the definition of the Q-function in (80), it holds that VT 1(x) = inf πT 1( |x) L1 εη(U) U πT 1(u|x) exp(ηQT 1(x, u))du εH1 εη(πT 1( |x)) . Since c T and c T 1 are bounded below, QT 1 is also bounded below. Therefore, by Assumption 12, QT 1(x, ) B (ε 1 η), ε 1(U) (see (67) for the definition of Ba,b), and we can apply Lemma 10 with β = (ε 1 η), γ = η to (82). As a result, VT 1(x) = 1 ε 1 η log Z U exp (ε 1 η)QT 1(x, u) du , (83) and the unique optimal policy that minimizes the right-hand side of (82) is π T 1(u|x) = exp QT 1(x,u) U exp QT 1(x,u ) ε du . (84) Moreover, since QT 1 is bounded below, VT 1 is also bounded below. Next, we assume the induction hypothesis that for some t [[0, T 2]], {π s}T 1 s=t+1 is the unique optimal policy of the minimization in the definition of Vt+1, and Vt+1 is bounded below. By definition, Vt(x) = inf {πs}T 1 s=t ηct(x, ut) εηH1 εη(πt( |x)) + ηc T (x T ) cs(xs, us) εH1 εη(πs( |xs)) ! xt = x = inf {πs}T 1 s=t εH1 εη(πt( |x)) ηct(x, ut) + ηc T (x T ) + η cs(xs, us) εH1 εη(πs( |xs)) ! xt = x = inf {πs}T 1 s=t εH1 εη(πt( |x)) + 1 exp ηct(x, u) + ηc T (x T ) cs(xs, us) εH1 εη(πs( |xs)) xt = x, ut = u = inf πt( |x) L1 εη(U) εH1 εη(πt( |x)) + 1 "Z πt(u|x) exp(ηct(x, u)) E{π s }T 1 s=t+1 exp ηc T (x T ) + η cs(xs, us) εH1 εη(π s( |xs)) xt = x, ut = u du Moreover, noting that exp(ηVt+1(x)) = E{π s }T 1 s=t+1 ηc T (x T ) + η cs(xs, us) εH1 εη(π s( |xs)) ! xt+1 = x Vt(x) = inf πt( |x) L1 εη(U) εH1 εη(πt( |x)) η log Z πt(u|x) exp(ηct(x, u))E exp ηVt+1(ft(x, u, wt)) du . By using Qt, the above equation can be written as Vt(x) = inf πt( |x) L1 εη(U) 1 η log Z U πt(u|x) exp(ηQt(x, u))du εH1 εη(πt( |x)). (87) Since we assumed that Vt+1 is bounded below, Qt is also bounded below. By combining this with Assumption 12, it holds that Qt(x, ) B (ε 1 η), ε 1(U). Thus, by Lemma 10, the unique optimal policy that minimizes the right-hand side of the above equation is π t (u|x) = exp Qt(x,u) U exp Qt(x,u ) Vt(x) = 1 ε 1 η log Z U exp (ε 1 η)Qt(x, u) du . (89) Lastly, since Qt is bounded below, Vt is also bounded below. This completes the induction step, and we obtain Theorem 6. F Proof of Proposition 7 By using the relation θ log pθ(τ) = θpθ(τ)/pθ(τ), we obtain θJ(θ) = Z pθ(τ) exp(ηCθ(τ)) η θCθ(τ) + θ log pθ(τ) dτ. In addition, by the expression pθ(τ) = p(x0) t=0 p(xt+1|xt, ut)π(θ)(ut|xt), θJ(θ) = Z pθ(τ) exp(ηCθ(τ)) t=0 θ log π(θ)(ut|xt) + t=0 θ log π(θ)(ut|xt) = (η + 1)Epθ(τ) t=0 θ log π(θ)(ut|xt) ηc T (x T ) + η ct(xt, ut) + log π(θ)(ut|xt) !# Note that for any h : (X)t+1 (U)t+1 R, it holds that E [h(x0:t, u0:t)] = Z h(x0:t, u0:t)p(x0) s=0 p(xs+1|xs, us)π(θ)(us|xs)dx0:T du0:T 1 = Z h(x0:t, u0:t)p(x0) s=0 p(xs+1|xs, us)π(θ)(us|xs) Z p(x T |x T 1, u T 1)π(θ)(u T 1|x T 1)dx T du T 1 dx0:T 1du0:T 2 = Z h(x0:t, u0:t)p(x0) s=0 p(xs+1|xs, us)π(θ)(us|xs)dx0:T 1du0:T 2 = Z h(x0:t, u0:t)π(θ)(ut|xt)p(x0) s=0 p(xs+1|xs, us)π(θ)(us|xs)dx0:tdu0:t. It follows from the above that θ log π(θ)(ut|xt) exp cs(xs, us) + log π(θ)(us|xs) !# = Z θ log π(θ)(ut|xt) exp cs(xs, us) + log π(θ)(us|xs) ! π(θ)(ut|xt)p(x0) s=0 p(xs+1|xs, us)π(θ)(us|xs)dx0:tdu0:t = Z θπ(θ)(ut|xt) exp cs(xs, us) + log π(θ)(us|xs) ! s=0 p(xs+1|xs, us)π(θ)(us|xs)dx0:tdu0:t Z π(θ)(ut|xt)dut cs(xs, us) + log π(θ)(us|xs) ! s=0 p(xs+1|xs, us)π(θ)(us|xs)dx0:tdu0:t 1 By combining this with (90), we get = (η + 1)Epθ(τ) t=0 θ log π(θ)(ut|xt) exp ηc T (x T ) + η cs(xs, us) + log π(θ)(us|xs) !# Lastly, for any function b : Rn R, it holds that Epθ(τ)[ θ log π(θ)(ut|xt)b(xt)] = Z pθ(xt, ut) θπ(θ)(ut|xt) π(θ)(ut|xt) b(xt)dxtdut = Z p(xt)b(xt) θπ(θ)(ut|xt)dutdxt = 0. This completes the proof. G Proof of Proposition 8 By definition, π t = arg min πt P(U) QT 1 s=0 πs(us|xs) p(OT |x T ) QT 1 s=0 p(Os|xs, us) The term between the brackets is Z pπ(x0:t, u0:t) Z pπ(xt+1:T , ut+1:T |xt, ut) " QT 1 s=0 πs(us|xs) p(OT |x T ) QT 1 s=0 p(Os|xs, us) dxt+1:T dut+1:T = Z pπ(x0:t, u0:t) " Qt 1 s=0 πs(us|xs) Qt 1 s=0 p(Os|xs, us) Z pπ(xt+1:T , ut+1:T |xt, ut) " QT 1 s=t πs(us|xs) p(Ot|x T ) QT 1 s=t p(Os|xs, us) dxt+1:T dut+1:T dx0:tdu0:t, M = πt(ut|xt)η Z pπ(xt+1:T , ut+1:T |xt, ut) " QT 1 s=t+1 πs(us|xs) p(OT |x T ) QT 1 s=t p(Os|xs, us) dxt+1:T dut+1:T . In addition, by the expression pπ(x0:t, u0:t) = p(x0)πt(ut|xt) Qt 1 s=0 p(xs+1|xs, us)πs(us|xs), π t = arg min πt "Z πt(ut|xt)1+η Epπ(xt+1:T ,ut+1:T |xt,ut) " QT 1 s=t+1 πs(us|xs) p(Ot|x T ) QT 1 s=t p(Os|xs, us) bπt(ut|xt) := 1 Zt(xt) Epπ(xt+1:T ,ut+1:T |xt,ut) " QT 1 s=t+1 πs(us|xs) p(Ot|x T ) QT 1 s=t p(Os|xs, us) Zt(xt) := Z Epπ(xt+1:T ,ut+1:T |xt,ut) " QT 1 s=t+1 πs(us|xs) p(Ot|x T ) QT 1 s=t p(Os|xs, us) Then, (94) can be rewritten as π t = arg min πt X Zt(xt)η Z U bπt(ut|xt) πt(ut|xt) By Jensen s inequality, for any η > 1, η = 0, it holds that X Zt(xt)η Z U bπt(ut|xt) πt(ut|xt) X Zt(xt)η Z U bπt(ut|xt)πt(ut|xt) bπt(ut|xt)dut X Zt(xt)ηdxt where the equality holds if and only if π( |xt) bπt( |xt), and πt( |xt)/bπt( |xt) is constant b Pxtalmost everywhere. Here, b Pxt is the probability distribution associated with bπt( |xt). Hence, the infimum (98) is attained only by πt = bπt. This completes the proof. H Details of the experiment The implementation of the risk-sensitive SAC (RSAC) algorithm follows the stable-baselines3 [50] version of the SAC algorithm, which means that the RSAC algorithm also implements some tricks including reparameterization, minibatch sampling with a replay buffer, target networks, and double Q-network. Now, we introduce a series of hyperparameters listed in Table 1 shared for both SAC and RSAC algorithms. Table 1: SAC and RSAC Hyperparameters Parameter Value optimizer Adam [51] learning rate 10 3 discount factor 0.99 regularization coefficient 0.1 target smoothing coefficient 0.005 replay buffer size 105 number of critic networks 2 number of hidden layers (all networks) 2 number of hidden units per layer 256 number of samples per minibatch 256 activation function Re LU As mentioned in Section 5, there were no significant differences in the control performance obtained or the behavior during training shown in Fig. 5 with those hyperparameters. However, when η is too small or too large, the training process becomes unstable due to the gradient vanishing problem and the gradient exponential growth problem, respectively, leading to training failure. To this end, we compare the robustness of the trained policies with RSAC (η { 0.02, 0.01, 0.01, 0.02}) and the standard SAC, which corresponds to η = 0, in the experiment. For each learned policy, we do trail for 20 times. For each trail, we take 100 sampling paths to calculate the average episode cost. In Fig. 3, the error bars depict the max and min values, and the points depict the mean value among the 20 trails. We change the length of the pole l in the Pendulum-v1 environment to test the robustness of the learned policies (l = 1.0 m in the original environment). For the training, we used an Ubuntu 20.04 server (GPU: NVIDIA Ge Force RTX 2080Ti). The code is available at https://github.com/kaito-1111/risk-sensitive-sac.git. Average episode cost Learning steps Figure 5: Training process of RSAC (with different η) and SAC in terms of average episode cost. Neur IPS Paper Checklist Question: Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope? Answer: [Yes] Justification: The main claims are made based on our theoretical results (Theorems 2, 3, 6, and Propositions 7, 8). Guidelines: The answer NA means that the abstract and introduction do not include the claims made in the paper. The abstract and/or introduction should clearly state the claims made, including the contributions made in the paper and important assumptions and limitations. A No or NA answer to this question will not be perceived well by the reviewers. The claims made should match theoretical and experimental results, and reflect how much the results can be expected to generalize to other settings. It is fine to include aspirational goals as motivation as long as it is clear that these goals are not attained by the paper. 2. Limitations Question: Does the paper discuss the limitations of the work performed by the authors? Answer: [Yes] Justification: The limitations are discussed in Section 6. Guidelines: The answer NA means that the paper has no limitation while the answer No means that the paper has limitations, but those are not discussed in the paper. The authors are encouraged to create a separate "Limitations" section in their paper. The paper should point out any strong assumptions and how robust the results are to violations of these assumptions (e.g., independence assumptions, noiseless settings, model well-specification, asymptotic approximations only holding locally). The authors should reflect on how these assumptions might be violated in practice and what the implications would be. The authors should reflect on the scope of the claims made, e.g., if the approach was only tested on a few datasets or with a few runs. In general, empirical results often depend on implicit assumptions, which should be articulated. The authors should reflect on the factors that influence the performance of the approach. For example, a facial recognition algorithm may perform poorly when image resolution is low or images are taken in low lighting. Or a speech-to-text system might not be used reliably to provide closed captions for online lectures because it fails to handle technical jargon. The authors should discuss the computational efficiency of the proposed algorithms and how they scale with dataset size. If applicable, the authors should discuss possible limitations of their approach to address problems of privacy and fairness. While the authors might fear that complete honesty about limitations might be used by reviewers as grounds for rejection, a worse outcome might be that reviewers discover limitations that aren t acknowledged in the paper. The authors should use their best judgment and recognize that individual actions in favor of transparency play an important role in developing norms that preserve the integrity of the community. Reviewers will be specifically instructed to not penalize honesty concerning limitations. 3. Theory Assumptions and Proofs Question: For each theoretical result, does the paper provide the full set of assumptions and a complete (and correct) proof? Answer: [Yes] Justification: Assumptions and a complete proof of all our results (Theorems 2, 3, 6, and Propositions 7, 8) are provided in the main paper and appendix. Guidelines: The answer NA means that the paper does not include theoretical results. All the theorems, formulas, and proofs in the paper should be numbered and crossreferenced. All assumptions should be clearly stated or referenced in the statement of any theorems. The proofs can either appear in the main paper or the supplemental material, but if they appear in the supplemental material, the authors are encouraged to provide a short proof sketch to provide intuition. Inversely, any informal proof provided in the core of the paper should be complemented by formal proofs provided in appendix or supplemental material. Theorems and Lemmas that the proof relies upon should be properly referenced. 4. Experimental Result Reproducibility Question: Does the paper fully disclose all the information needed to reproduce the main experimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper (regardless of whether the code and data are provided or not)? Answer: [Yes] Justification: All the information is disclosed in Appendix H. Guidelines: The answer NA means that the paper does not include experiments. If the paper includes experiments, a No answer to this question will not be perceived well by the reviewers: Making the paper reproducible is important, regardless of whether the code and data are provided or not. If the contribution is a dataset and/or model, the authors should describe the steps taken to make their results reproducible or verifiable. Depending on the contribution, reproducibility can be accomplished in various ways. For example, if the contribution is a novel architecture, describing the architecture fully might suffice, or if the contribution is a specific model and empirical evaluation, it may be necessary to either make it possible for others to replicate the model with the same dataset, or provide access to the model. In general. releasing code and data is often one good way to accomplish this, but reproducibility can also be provided via detailed instructions for how to replicate the results, access to a hosted model (e.g., in the case of a large language model), releasing of a model checkpoint, or other means that are appropriate to the research performed. While Neur IPS does not require releasing code, the conference does require all submissions to provide some reasonable avenue for reproducibility, which may depend on the nature of the contribution. For example (a) If the contribution is primarily a new algorithm, the paper should make it clear how to reproduce that algorithm. (b) If the contribution is primarily a new model architecture, the paper should describe the architecture clearly and fully. (c) If the contribution is a new model (e.g., a large language model), then there should either be a way to access this model for reproducing the results or a way to reproduce the model (e.g., with an open-source dataset or instructions for how to construct the dataset). (d) We recognize that reproducibility may be tricky in some cases, in which case authors are welcome to describe the particular way they provide for reproducibility. In the case of closed-source models, it may be that access to the model is limited in some way (e.g., to registered users), but it should be possible for other researchers to have some path to reproducing or verifying the results. 5. Open access to data and code Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [Yes] Justification: We provide open access to the code via Git Hub. Guidelines: The answer NA means that paper does not include experiments requiring code. Please see the Neur IPS code and data submission guidelines (https://nips.cc/ public/guides/Code Submission Policy) for more details. While we encourage the release of code and data, we understand that this might not be possible, so No is an acceptable answer. Papers cannot be rejected simply for not including code, unless this is central to the contribution (e.g., for a new open-source benchmark). The instructions should contain the exact command and environment needed to run to reproduce the results. See the Neur IPS code and data submission guidelines (https: //nips.cc/public/guides/Code Submission Policy) for more details. The authors should provide instructions on data access and preparation, including how to access the raw data, preprocessed data, intermediate data, and generated data, etc. The authors should provide scripts to reproduce all experimental results for the new proposed method and baselines. If only a subset of experiments are reproducible, they should state which ones are omitted from the script and why. At submission time, to preserve anonymity, the authors should release anonymized versions (if applicable). Providing as much information as possible in supplemental material (appended to the paper) is recommended, but including URLs to data and code is permitted. 6. Experimental Setting/Details Question: Does the paper specify all the training and test details (e.g., data splits, hyperparameters, how they were chosen, type of optimizer, etc.) necessary to understand the results? Answer: [Yes] Justification: All the training and test details are given in Appendix H. Guidelines: The answer NA means that the paper does not include experiments. The experimental setting should be presented in the core of the paper to a level of detail that is necessary to appreciate the results and make sense of them. The full details can be provided either with the code, in appendix, or as supplemental material. 7. Experiment Statistical Significance Question: Does the paper report error bars suitably and correctly defined or other appropriate information about the statistical significance of the experiments? Answer: [Yes] Justification: We report error bars in Fig. 3. Guidelines: The answer NA means that the paper does not include experiments. The authors should answer "Yes" if the results are accompanied by error bars, confidence intervals, or statistical significance tests, at least for the experiments that support the main claims of the paper. The factors of variability that the error bars are capturing should be clearly stated (for example, train/test split, initialization, random drawing of some parameter, or overall run with given experimental conditions). The method for calculating the error bars should be explained (closed form formula, call to a library function, bootstrap, etc.) The assumptions made should be given (e.g., Normally distributed errors). It should be clear whether the error bar is the standard deviation or the standard error of the mean. It is OK to report 1-sigma error bars, but one should state it. The authors should preferably report a 2-sigma error bar than state that they have a 96% CI, if the hypothesis of Normality of errors is not verified. For asymmetric distributions, the authors should be careful not to show in tables or figures symmetric error bars that would yield results that are out of range (e.g. negative error rates). If error bars are reported in tables or plots, The authors should explain in the text how they were calculated and reference the corresponding figures or tables in the text. 8. Experiments Compute Resources Question: For each experiment, does the paper provide sufficient information on the computer resources (type of compute workers, memory, time of execution) needed to reproduce the experiments? Answer: [Yes] Justification: The information on the computer resources is provided in Appendix H. Guidelines: The answer NA means that the paper does not include experiments. The paper should indicate the type of compute workers CPU or GPU, internal cluster, or cloud provider, including relevant memory and storage. The paper should provide the amount of compute required for each of the individual experimental runs as well as estimate the total compute. The paper should disclose whether the full research project required more compute than the experiments reported in the paper (e.g., preliminary or failed experiments that didn t make it into the paper). 9. Code Of Ethics Question: Does the research conducted in the paper conform, in every respect, with the Neur IPS Code of Ethics https://neurips.cc/public/Ethics Guidelines? Answer: [Yes] Justification: This work does not involve human subjects or participants, and there are no data-related concerns such as privacy issues. Guidelines: The answer NA means that the authors have not reviewed the Neur IPS Code of Ethics. If the authors answer No, they should explain the special circumstances that require a deviation from the Code of Ethics. The authors should make sure to preserve anonymity (e.g., if there is a special consideration due to laws or regulations in their jurisdiction). 10. Broader Impacts Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed? Answer: [NA] Justification: The contribution of this paper is theoretical and we do not anticipate any direct societal impact of the work. Guidelines: The answer NA means that there is no societal impact of the work performed. If the authors answer NA or No, they should explain why their work has no societal impact or why the paper does not address societal impact. Examples of negative societal impacts include potential malicious or unintended uses (e.g., disinformation, generating fake profiles, surveillance), fairness considerations (e.g., deployment of technologies that could make decisions that unfairly impact specific groups), privacy considerations, and security considerations. The conference expects that many papers will be foundational research and not tied to particular applications, let alone deployments. However, if there is a direct path to any negative applications, the authors should point it out. For example, it is legitimate to point out that an improvement in the quality of generative models could be used to generate deepfakes for disinformation. On the other hand, it is not needed to point out that a generic algorithm for optimizing neural networks could enable people to train models that generate Deepfakes faster. The authors should consider possible harms that could arise when the technology is being used as intended and functioning correctly, harms that could arise when the technology is being used as intended but gives incorrect results, and harms following from (intentional or unintentional) misuse of the technology. If there are negative societal impacts, the authors could also discuss possible mitigation strategies (e.g., gated release of models, providing defenses in addition to attacks, mechanisms for monitoring misuse, mechanisms to monitor how a system learns from feedback over time, improving the efficiency and accessibility of ML). 11. Safeguards Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse (e.g., pretrained language models, image generators, or scraped datasets)? Answer: [NA] Justification: In this work, we do not need data or models that have a high risk for misuse. Guidelines: The answer NA means that the paper poses no such risks. Released models that have a high risk for misuse or dual-use should be released with necessary safeguards to allow for controlled use of the model, for example by requiring that users adhere to usage guidelines or restrictions to access the model or implementing safety filters. Datasets that have been scraped from the Internet could pose safety risks. The authors should describe how they avoided releasing unsafe images. We recognize that providing effective safeguards is challenging, and many papers do not require this, but we encourage authors to take this into account and make a best faith effort. 12. Licenses for existing assets Question: Are the creators or original owners of assets (e.g., code, data, models), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected? Answer: [Yes] Justification: For the experiment, we use Open AI Gym, and it is properly mentioned. Guidelines: The answer NA means that the paper does not use existing assets. The authors should cite the original paper that produced the code package or dataset. The authors should state which version of the asset is used and, if possible, include a URL. The name of the license (e.g., CC-BY 4.0) should be included for each asset. For scraped data from a particular source (e.g., website), the copyright and terms of service of that source should be provided. If assets are released, the license, copyright information, and terms of use in the package should be provided. For popular datasets, paperswithcode.com/datasets has curated licenses for some datasets. Their licensing guide can help determine the license of a dataset. For existing datasets that are re-packaged, both the original license and the license of the derived asset (if it has changed) should be provided. If this information is not available online, the authors are encouraged to reach out to the asset s creators. 13. New Assets Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets? Answer: [Yes] Justification: We submit the documentation as a supplementary material. Guidelines: The answer NA means that the paper does not release new assets. Researchers should communicate the details of the dataset/code/model as part of their submissions via structured templates. This includes details about training, license, limitations, etc. The paper should discuss whether and how consent was obtained from people whose asset is used. At submission time, remember to anonymize your assets (if applicable). You can either create an anonymized URL or include an anonymized zip file. 14. Crowdsourcing and Research with Human Subjects Question: For crowdsourcing experiments and research with human subjects, does the paper include the full text of instructions given to participants and screenshots, if applicable, as well as details about compensation (if any)? Answer: [NA] Justification: This paper does not involve crowdsourcing nor research with human subjects. Guidelines: The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Including this information in the supplemental material is fine, but if the main contribution of the paper involves human subjects, then as much detail as possible should be included in the main paper. According to the Neur IPS Code of Ethics, workers involved in data collection, curation, or other labor should be paid at least the minimum wage in the country of the data collector. 15. Institutional Review Board (IRB) Approvals or Equivalent for Research with Human Subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or institution) were obtained? Answer: [NA] Justification: This paper does not involve crowdsourcing nor research with human subjects. Guidelines: The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Depending on the country in which research is conducted, IRB approval (or equivalent) may be required for any human subjects research. If you obtained IRB approval, you should clearly state this in the paper. We recognize that the procedures for this may vary significantly between institutions and locations, and we expect authors to adhere to the Neur IPS Code of Ethics and the guidelines for their institution. For initial submissions, do not include any information that would break anonymity (if applicable), such as the institution conducting the review.