# constrained_riskaverse_markov_decision_processes__6f0a6c4b.pdf Constrained Risk-Averse Markov Decision Processes Mohamadreza Ahmadi1, Ugo Rosolia1, Michel D. Ingham2, Richard M. Murray1, and Aaron D. Ames1 1California Institute of Technology, 1200 E. California Blvd., Pasadena, CA 91125 2NASA Jet Propulsion Laboratory, 4800 Oak Grove Dr, Pasadena, CA 91109. mrahmadi@caltech.edu We consider the problem of designing policies for Markov decision processes (MDPs) with dynamic coherent risk objectives and constraints. We begin by formulating the problem in a Lagrangian framework. Under the assumption that the risk objectives and constraints can be represented by a Markov risk transition mapping, we propose an optimization-based method to synthesize Markovian policies that lower-bound the constrained risk-averse problem. We demonstrate that the formulated optimization problems are in the form of difference convex programs (DCPs) and can be solved by the disciplined convex-concave programming (DCCP) framework. We show that these results generalize linear programs for constrained MDPs with total discounted expected costs and constraints. Finally, we illustrate the effectiveness of the proposed method with numerical experiments on a rover navigation problem involving conditional-value-at-risk (CVa R) and entropic-value-at-risk (EVa R) coherent risk measures. Introduction With the rise of autonomous systems being deployed in real-world settings, the associated risk that stems from unknown and unforeseen circumstances is correspondingly on the rise. In particular, in risk-sensitive scenarios, such as aerospace applications, decision making should account for uncertainty and minimize the impact of unfortunate events. For example, spacecraft control technology relies heavily on a relatively large and highly skilled mission operations team that generates detailed time-ordered and event-driven sequences of commands. This approach will not be viable in the future with increasing number of missions and a desire to limit the operations team and Deep Space Network (DSN) costs. In order to maximize the science returns under these conditions, the ability to deal with emergencies and safely explore remote regions are becoming increasingly important (Mc Ghan et al. 2016). For instance, in Mars rover navigation problems, finding planning policies that minimize risk is critical to mission success, due to the uncertainties present in Mars surface data (Ono et al. 2018) (see Figure 1). Copyright c 2021, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. Figure 1: Mars surface (Eberswalde Crater) slope uncertainty for rover navigation: the slopes range within (blue) 5 10 , (green) 10 15 , (yellow) 15 20 , (orange) 20 25 , (red) 25 , and (the rest) < 5 or no data. Risk can be quantified in numerous ways, such as chance constraints (Ono et al. 2015; Wang, Jasour, and Williams 2020). However, applications in autonomy and robotics require more nuanced assessments of risk (Majumdar and Pavone 2020). Artzner et. al. (Artzner et al. 1999) characterized a set of natural properties that are desirable for a risk measure, called a coherent risk measure, and have obtained widespread acceptance in finance and operations research, among other fields. An important example of a coherent risk measure is the conditional value-at-risk (CVa R) that has received significant attention in decision making problems, such as Markov decision processes (MDPs) (Chow et al. 2015; Chow and Ghavamzadeh 2014; Prashanth 2014; B auerle and Ott 2011). In many real world applications, risk-averse decision making should account for system constraints, such as fuel budget, communication rates, and etc. In this paper, we attempt to address this issue and therefore consider MDPs with both total coherent risk costs and constraints. Using a Lagrangian framework and properties of coherent risk measures, we propose an optimization problem, whose solution provides a lower bound to the constrained risk-averse MDP problem. We show that this result is indeed a generalization of constrained MDPs with total expected cost costs and constraints (Altman 1999). For general coherent risk measures, The Thirty-Fifth AAAI Conference on Artificial Intelligence (AAAI-21) we show that this optimization problem is a difference convex program (DCP) and propose a method based on disciplined convex-concave programming to solve it. We illustrate our proposed method with a numerical example of path planning under uncertainty with not only CVa R risk measure but also the more recently proposed entropic value-at-risk (EVa R) (Ahmadi-Javid 2012; Mc Allister et al. 2020) coherent risk measure. Notation: We denote by Rn the n-dimensional Euclidean space and N 0 the set of non-negative integers. Throughout the paper, we use bold font to denote a vector and ( ) for its transpose, e.g., a = (a1, . . . , an) , with n {1, 2, . . .}. For a vector a, we use a ( )0 to denote element-wise nonnegativity (non-positivity) and a 0 to show all elements of a are zero. For two vectors a, b Rn, we denote their inner product by a, b , i.e., a, b = a b. For a finite set A, we denote its power set by 2A, i.e., the set of all subsets of A. For a probability space (Ω, F, P) and a constant p [1, ), Lp(Ω, F, P) denotes the vector space of real valued random variables c for which E|c|p < . Preliminaries In this section, we briefly review some notions and definitions used throughout the paper. Markov Decision Processes An MDP is a tuple M = (S, Act, T, κ0) consisting of a set of states S = {s1, . . . , s|S|} of the autonomous agent(s) and world model, actions Act = {α1, . . . , α|Act|} available to the robot, a transition function T(sj|si, α), and κ0 describing the initial distribution over the states. This paper considers finite Markov decision processes, where S, and Act are finite sets. For each action the probability of making a transition from state si S to state sj S under action α Act is given by T(sj|si, α). The probabilistic components of a Markov decision process must satisfy the following: P s S T(s|si, α) = 1, si S, α Act, P s S κ0(s) = 1. Coherent Risk Measures Consider a probability space (Ω, F, P), a filteration F0 FN F, and an adapted sequence of random variables (stage-wise costs) ct, t = 0, . . . , N, where N N 0 { }. For t = 0, . . . , N, we further define the spaces Ct = Lp(Ω, Ft, P), p [0, ), Ct:N = Ct CN and C = C0 C1 . We assume that the sequence c C is almost surely bounded (with exceptions having probability zero), i.e., maxt ess sup |ct(ω)| < . In order to describe how one can evaluate the risk of subsequence ct, . . . , c N from the perspective of stage t, we require the following definitions. Definition 1 (Conditional Risk Measure). A mapping ρt:N : Ct:N Ct, where 0 t N, is called a conditional risk measure, if it has the following monoticity property: ρt:N(c) ρt:N(c ), c, c Ct:N such that c c . Definition 2 (Dynamic Risk Measure). A dynamic risk measure is a sequence of conditional risk measures ρt:N : Ct:N Ct, t = 0, . . . , N. One fundamental property of dynamic risk measures is their consistency over time (Ruszczy nski 2010, Definition 3). That is, if c will be as good as c from the perspective of some future time θ, and they are identical between time τ and θ, then c should not be worse than c from the perspective at time τ. If a risk measure is time-consistent, we can define the one-step conditional risk measure ρt : Ct+1 Ct, t = 0, . . . , N 1 as follows: ρt(ct+1) = ρt,t+1(0, ct+1), (1) and for all t = 1, . . . , N, we obtain: ρt,N(ct, . . . , c N) = ρt ct + ρt+1(ct+1 + ρt+2(ct+2 + + ρN 1 (c N 1 + ρN(c N)) )) . (2) Note that the time-consistent risk measure is completely defined by one-step conditional risk measures ρt, t = 0, . . . , N 1 and, in particular, for t = 0, (2) defines a risk measure of the entire sequence c C0:N. At this point, we are ready to define a coherent risk measure. Definition 3 (Coherent Risk Measure). We call the one-step conditional risk measures ρt : Ct+1 Ct, t = 1, . . . , N 1 as in (2) a coherent risk measure if it satisfies the following conditions Convexity: ρt(λc + (1 λ)c ) λρt(c) + (1 λ)ρt(c ), for all λ (0, 1) and all c, c Ct+1; Monotonicity: If c c then ρt(c) ρt(c ) for all c, c Ct+1; Translational Invariance: ρt(c + c ) = c + ρt(c ) for all c Ct and c Ct+1; Positive Homogeneity: ρt(βc) = βρt(c) for all c Ct+1 and β 0. Henceforth, all the risk measures considered are assumed to be coherent. In this paper, we are interested in the discounted infinite horizon problems. Let γ (0, 1) be a given discount factor. For N = 0, 1, . . ., we define the functionals ργ 0,t(c0, . . . , ct) = ρ0,t(c0, γc1, . . . , γtct) c0 + ρ1 γc1 + ρ2(γ2c2 + + ρt 1 γt 1ct 1 + ρN(γtct) ) , which are the same as (2) for t = 0, but with discounting γt applied to each ct. Finally, we have total discounted risk functional ργ : C R defined as ργ(c) = lim t ργ 0,t(c0, . . . , ct). (3) From (Ruszczy nski 2010, Theorem 3), we have that ργ is convex, monotone, and positive homogeneous. Constrained Risk-Averse MDPs Notions of coherent risk and dynamic risk measures discussed in the previous section have been developed and applied in microeconomics and mathematical finance fields in the past two decades (Vose 2008). Generally, risk-averse decision making is concerned with the behavior of agents, e.g. consumers and investors, who, when exposed to uncertainty, attempt to lower that uncertainty. The agents avoid situations with unknown payoffs, in favor of situations with payoffs that are more predictable, even if they are lower. In a Markov decision making setting, the main idea in risk-averse control is to replace the conventional risk-neutral conditional expectation of the cumulative cost objectives with the more general coherent risk measures. In a path planning setting, we will show in our numerical experiments that considering coherent risk measures will lead to significantly more robustness to environment uncertainty. In addition to risk-aversity, an autonomous agent is often subject to constraints, e.g. fuel, communication, or energy budgets. These constraints can also represent mission objectives, e.g. explore an area or reach a goal. Consider a stationary controlled Markov process {st}, t = 0, 1, . . ., wherein policies, transition probabilities, and cost functions do not depend explicitly on time. Each policy π = {πt} t=0 leads to cost sequences ct = c(st, αt), t = 0, 1, . . . and di t = di(st, αt), t = 0, 1, . . ., i = 1, 2, . . . , nc. We define the dynamic risk of evaluating the γ-discounted cost of a policy π as Jγ(κ0, π) = ργ c(s0, α0), c(s1, α1), . . . , (4) and the γ-discounted dynamic risk constraints of executing policy π as Di γ(κ0, π) = ργ di(s0, α0), di(s1, α1), . . . βi, i = 1, 2, . . . , nc, (5) where ργ is defined in equation (3), s0 κ0, and βi > 0, i = 1, 2, . . . , nc, are given constants. In this work, we are interested in addressing the following problem: Problem 1. For a given Markov decision process, a discount factor γ (0, 1), and a total risk functional Jγ(κ0, π) as in equation (4) and total cost constraints (5) with {ρt} t=0 being coherent risk measures, compute π argmin π Jγ(κ0, π) subject to Dγ(κ0, π) β. (6) We call a controlled Markov process with the nested objective (4) and constraints (5) a constrained risk-averse MDP. It was previously demonstrated in (Chow et al. 2015; Osogami 2012) that such coherent risk measure objectives can account for modeling errors and parametric uncertainty in MDPs. One interpretation for Problem 1 is that we are interested in policies that minimize the incurred costs in the worst-case in a probabilistic sense and at the same time ensure that the system constraints, e.g., fuel constraints, are not violated even in the worst-case probabilistic scenarios. Note that in Problem 1 both the objective function and the constraints are in general non-differentiable and non-convex in policy π (with the exception of total expected cost as the coherent risk measure ργ (Altman 1999)). Therefore, finding optimal policies in general may be hopeless. Instead, we find sub-optimal polices by taking advantage of a Lagrangian formulation and then using an optimization form of Bellman s equations. Next, we show that the constrained risk-averse problem is equivalent to a non-constrained inf-sup risk-averse problem thanks to the Lagrangian method. Proposition 1. Let Jγ(κ0) be the value of Problem 1 for a given initial distribution κ0 and discount factor γ. Then, (i) the value function satisfies Jγ(κ0) = inf π sup λ 0 Lγ(π, λ), (7) Lγ(π, λ) = Jγ(κ0, π) + λ, (Dγ(κ0, π) β) , (8) is the Lagrangian function. (ii) Furthermore, a policy π is optimal for Problem 1, if and only if Jγ(κ0) = supλ 0 Lγ(π , λ). At any time t, the value of ρt is Ft-measurable and is allowed to depend on the entire history of the process {s0, s1, . . .} and we cannot expect to obtain a Markov optimal policy (Ott 2010). In order to obtain Markov policies, we need the following property (Ruszczy nski 2010). Definition 4. Let m, n [1, ) such that 1/m + 1/n = 1 and P = p Ln(S, 2S, P) | X s S p(s )P(s ) = 1, p 0 . A one-step conditional risk measure ρt : Ct+1 Ct is a Markov risk measure with respect to the controlled Markov process {st}, t = 0, 1, . . ., if there exist a risk transition mapping σt : Lm(S, 2S, P) S P R such that for all v Lm(S, 2S, P) and αt π(st), we have ρt(v(st+1)) = σt(v(st+1), st, p(st+1|st, αt)), (9) where p : S Act P is called the controlled kernel. In fact, if ρt is a coherent risk measure, σt also satisfies the properties of a coherent risk measure (Definition 3). In this paper, since we are concerned with MDPs, the controlled kernel is simply the transition function T. Assumption 1. The one-step coherent risk measure ρt is a Markov risk measure. The simplest case of the risk transition mapping is in the conditional expectation case ρt(v(st+1)) = E{v(st+1) | st, αt}, i.e., σ {v(st+1), st, p(st+1|st, αt)} = E{v(st+1) | st, αt} st+1 S v(st+1)T(st+1 | st, αt). (10) Note that in the total discounted expectation case σ is a linear function in v rather than a convex function, which is the case for a general coherent risk measures. In the next result, we show that we can find a lower bound to the solution to Problem 1 via solving an optimization problem. Theorem 1. Consider an MDP M with the nested risk objective (4), constraints (5), and discount factor γ (0, 1). Let Assumption 1 hold, let ρt, t = 0, 1, . . . be coherent risk measures as described in Definition 3, and suppose c( , ) and di( , ), i = 1, 2, . . . , nc, be non-negative and upperbounded. Then, the solution (V γ, λ ) to the following optimization problem (Bellman s equation) sup V γ,λ 0 κ0, V γ λ, β subject to Vγ(s) c(s, α) + λ, d(s, α) + γσ{Vγ(s ), s, p(s |s, α)}, s S, α Act, (11) satisfies Jγ(κ0) κ0, V γ λ , β . (12) One interesting observation is that if the coherent risk measure ρt is the total discounted expectation, then Theorem 1 is consistent with the result by (Altman 1999). Corollary 1. Let the assumptions of Theorem 1 hold and let ρt( ) = E( |st, αt), t = 1, 2, . . .. Then the solution (V γ, λ ) to optimization (11) satisfies Jγ(κ0) = κ0, V γ λ , β . Furthermore, with ρt( ) = E( |st, αt), t = 1, 2, . . ., optimization (11) becomes a linear program. Once the values of λ and V γ are found by solving optimization problem (11), we can find the policy as π (s) argmin α Act c(s, α) + λ , d(s, α) + γσ{V γ (s ), s, p(s |s, α)} . (13) Note that π is a deterministic, stationary policy. Such policies are desirable in practical applications, since they are more convenient to implement on actual robots. Given an uncertain environment, π can be designed offline and used for path planning. In the next section, we discuss methods to find solutions to optimization problem (11), when ργ is an arbitrary coherent risk measure. DCPs for Constrained Risk-Averse MDPs Note that since ργ is a coherent, Markov risk measure (Assumption 1), v 7 σ(v, , ) is convex (because σ is also a coherent risk measure). Next, we demonstrate that optimiza- tion problem (11) is indeed a DCP. Re-formulating equation (11) as a minimization yields inf V γ,λ 0 λ, β κ0, V γ subject to Vγ(s) c(s, α) + λ, d(s, α) + γσ{Vγ(s ), s, p(s |s, α)}, s S, α Act. (14) At this point, we define f0(λ) = λ, β , g0(V γ) = κ0, V γ , f1(Vγ) = Vγ, g1(λ) = c + λ, d , and g2(Vγ) = γσ(Vγ, , ). Note that f0 and g1 are convex (linear) functions of λ and g0, f1, and g2 are convex functions in V γ. Then, we can re-write (14) as inf V γ,λ 0 f0(λ) g0(V γ) subject to f1(Vγ) g1(λ) g2(Vγ) 0, s, α. (15) In fact, optimization problem (15) is a standard DCP (Horst and Thoai 1999). DCPs arise in many applications, such as feature selection in machine learning (Le Thi et al. 2008) and inverse covariance estimation in statistics (Thai et al. 2014). Although DCPs can be solved globally (Horst and Thoai 1999), e.g. using branch and bound algorithms (Lawler and Wood 1966), a locally optimal solution can be obtained based on techniques of nonlinear optimization (Bertsekas 1999) more efficiently. In particular, in this work, we use a variant of the convex-concave procedure (Lipp and Boyd 2016; Shen et al. 2016), wherein the concave terms are replaced by a convex upper bound and solved. In fact, the disciplined convex-concave programming (DCCP) (Shen et al. 2016) technique linearizes DCP problems into a (disciplined) convex program (carried out automatically via the DCCP Python package (Shen et al. 2016)), which is then converted into an equivalent cone program by replacing each function with its graph implementation. Then, the cone program can be solved readily by available convex programming solvers, such as CVXPY (Diamond and Boyd 2016). At this point, we should point out that solving (11) via the DCCP method, finds the (local) saddle points to optimization problem (11). However, from Theorem 1, we have that every saddle point to (11) satisfies (12). In other words, every saddle point corresponds to a lower bound to the optimal value of Problem 1. DCPs for CVa R and EVa R Risk Measures In this section, we present the specific DCPs for finding the risk value functions for two coherent risk measures studied in our numerical experiments, namely, CVa R and EVa R. For a given confidence level ε (0, 1), value-at-risk (Va Rε) denotes the (1 ε)-quantile value of the cost variable. CVa Rε measures the expected loss in the (1 ε)-tail given that the particular threshold Va Rε has been crossed. CVa Rε is given by ρt(ct+1) = inf ζ R εE [(ct+1 ζ)+ | Ft] , (16) where ( )+ = max{ , 0}. A value of ε 1 corresponds to a risk-neutral policy; whereas, a value of ε 0 is rather a risk-averse policy. In fact, Theorem 1 can applied to CVa R since it is a coherent risk measure. For an MDP M, the risk value functions can be computed by DCP (15), where g2(Vγ) = inf ζ R s S (Vγ(s ) ζ)+ T(s | s, α) where the infimum on the right hand side of the above equation can be absorbed into the overal infimum problem, i.e., inf V γ,λ 0,ζ. Note that g2(Vγ) above is convex in ζ (Rockafellar, Uryasev et al. 2000, Theorem 1) because the function ( )+ is increasing and convex (Ott 2010, Lemma A.1., p. 117). Unfortunately, CVa R ignores the losses below the Va R threshold. EVa R is the tightest upper bound in the sense of Chernoff inequality for the value at risk (Va R) and CVa R and its dual representation is associated with the relative entropy. In fact, it was shown in (Ahmadi-Javid and Pichler 2017) that EVa Rε and CVa Rε are equal only if there are no losses (c ) below the Va Rε threshold. In addition, EVa R is a strictly monotone risk measure; whereas, CVa R is only monotone (Ahmadi-Javid and Fallah-Tafti 2019). EVa Rε is given by ρt(ct+1) = inf ζ>0 log E[eζct+1 | Ft] Similar to CVa Rε, for EVa Rε, ε 1 corresponds to a riskneutral case; whereas, ε 0 corresponds to a risk-averse case. In fact, it was demonstrated in (Ahmadi-Javid 2012, Proposition 3.2) that limε 0 EVa Rε(Z) = ess sup(Z). Since EVa Rε is a coherent risk measure, the conditions of Theorem 1 hold. Since ζ > 0, using the change of variables, V γ ζV γ and λ ζλ (note that this change of variables is monotone increasing in ζ (Agrawal et al. 2018)), we can compute EVa R value functions by solving (15), where f0( λ) = λ, β , f1( Vγ) = Vγ, g0( V γ) = κ0, V γ , g1( λ) = ζc + λ, d , and g2( Vγ) = log P s S e Vγ (s )T (s |s,α) Similar to the CVa R case, the infimum over ζ can be lumped into the overall infimum problem, i.e., inf V γ, λ 0,ζ>0. Note that g2( V γ) is convex in Vγ, since the logarithm of sums of exponentials is convex (Boyd and Vandenberghe 2004, p. 72). The lower bound in (12) can then be obtained as 1 ζ κ0, V γ λ, β . Related Work and Discussion We believe this work is the first study of constrained MDPs with both coherent risk objectives and constraints. We emphasize that our method leads to a policy that lower-bounds the value of Problem 1 for general coherent, Markov risk measures. In the case of no constraints λ 0, our proposed method can also be applied to risk-averse MDPs with no constraints. With respect to risk-averse MDPs, (Tamar et al. 2016, 2015) proposed a sampling-based algorithm for finding saddle point solutions to MDPs with total coherent risk measure costs using policy gradient methods. (Tamar et al. 2016) relies on the assumption that the risk envelope appearing in the dual representation of the coherent risk measure is known with an explicit canonical convex programming formulation. As the authors indicated, this is the case for CVa R, mean-semi-deviation, and spectral risk measures (Shapiro, Dentcheva, and Ruszczy nski 2014). However, such explicit form is not known for general coherent risk measures, such as EVa R. Furthermore, it is not clear whether the saddle point solutions are a lower bound or upper bound to the optimal value. Also, policy-gradient based methods require calculating the gradient of the coherent risk measure, which is not available in explicit form in general. MDPs with CVa R constraint and total expected costs were studied in (Prashanth 2014; Chow and Ghavamzadeh 2014) and locally optimal solutions were found via policy gradients, as well. However, this method also leads to saddle point solutions and cannot be applied to general coherent risk measures. In addition, since the objective and the constraints are described by different coherent risk measures, the authors assume there exists a policy that satisfies the CVa R constraint (feasibility assumption), which may not be the case in general. Following the footsteps of (Pflug and Pichler 2016), a promising approach based on approximate value iteration was proposed for MDPs with CVa R objectives in (Chow et al. 2015). But, it is not clear how one can extend this method to other coherent risk measures. An infinitedimensional linear program was derived analytically for MDPs with coherent risk objectives in (Haskell and Jain 2015) with implications for solving chance and stochasticdominance constrained MDPs. Successive finite approximation of such infinite dimensional linear programs was suggested leading to sub-optimal solutions. Yet, the method cannot be extended to constrained MDPs with coherent risk objectives and constraints. A policy iteration algorithm for finding policies that minimize total coherent risk measures for MDPs was studied in (Ruszczy nski 2010; Fan and Ruszczy nski 2018a) and a computational non-smooth Newton method was proposed in (Ruszczy nski 2010). Our work, extends (Ruszczy nski 2010; Fan and Ruszczy nski 2018a) to constrained problems and uses a DCCP computational method, which takes advantage of already available software (DCCP and CVXPY). Numerical Experiments In this section, we evaluate the proposed methodology with a numerical example. In addition to the traditional total expectation, we consider two other coherent risk measures, namely, CVa R and EVa R. All experiments were carried out using a Mac Book Pro with 2.8 GHz Quad-Core Intel Core i5 and 16 GB of RAM. The resultant linear programs and DCPs Figure 2: Grid world illustration for the rover navigation example. Blue cells denote the obstacles and the yellow cell denotes the goal. were solved using CVXPY (Diamond and Boyd 2016) with DCCP (Shen et al. 2016) add-on in Python. Rover MDP Example Set Up An agent (e.g. a rover) has to autonomously navigate a two dimensional terrain map (e.g. Mars surface) represented by an M N grid. A rover can move from cell to cell. Thus, the state space is given by S = {si|i = x + My, x {1, . . . , M}, y {1, . . . , N}}. The action set available to the robot is Act = {E, W, N, S, NE, NW, SE, SW}, i.e., diagonal moves are allowed. The actions move the robot from its current cell to a neighboring cell, with some uncertainty. The state transition probabilities for various cell types are shown for actions E and N in Figure 2. Other actions lead to analogous transitions. With regards to constraint costs, there is an stage-wise cost of each move until reaching the goal of 2, to account for fuel usage constraints. In between the starting point and the destination, there are a number of obstacles that the agent should avoid. Hitting an obstacle incurs the immediate cost of 10, while the goal grid region has zero immediate cost. These two latter immediate costs are captured by the cost function. The discount factor is γ = 0.95. The objective is to compute a safe path that is fuel efficient, i.e., solving Problem 1. To this end, we consider total expectation, CVa R, and EVa R as the coherent risk measure. As a robustness test, inspired by (Chow et al. 2015), we included a set of single grid obstacles that are perturbed in a random direction to one of the neighboring grid cells with probability 0.2 to represent uncertainty in the terrain map. For each risk measure, we run 100 Monte Carlo simulations with the calculated policies and count failure rates, i.e., the number of times a collision has occurred during a run. Results In our experiments, we consider three grid-world sizes of 10 10, 15 15, and 20 20 corresponding to 100, 225, and 400 states, respectively. For each grid-world, we allocate 25% of the grid to obstacles, including 3, 6, and 9 uncertain (single-cell) obstacles for the 10 10, 15 15, and 20 20 grids, respectively. In each case, we solve DCP (11) (linear program in the case of total expectation) with |S||Act| = MN 8 = 8MN constraints and MN + 2 variables (the (M N)ρt Jγ(κ0) Total Time [s] # U.O. F.R. (10 10)E 5.10 0.7 3 9% (15 15)E 7.53 1.0 6 18% (20 20)E 8.98 1.6 9 21% (10 10)CVa Rε 7.76 5.4 3 1% (15 15)CVa Rε 9.22 8.3 6 3% (20 20)CVa Rε 12.76 10.5 9 5% (10 10)EVa Rε 7.99 3.2 3 0% (15 15)EVa Rε 11.04 4.9 6 0% (20 20)EVa Rε 15.28 6.6 9 2% Table 1: Comparison between total expectation, CVa R, and EVa R coherent risk measures. (M N)ρt denotes experiments with grid-world of size M N and one-step coherent risk measure ρt. Jγ(κ0) is the valued of the constrained riskaverse problem (Problem 1). Total Time denotes the time taken by the CVXPY solver to solve the associated linear programs or DCPs in seconds. # U.O. denotes the number of single grid uncertain obstacles used for robustness test. F.R. denotes the failure rate out of 100 Monte Carlo simulations with the computed policy. risk value functions Vγ s, Langrangian coefficient λ, and ζ for CVa R and EVa R). In these experiments, we set ε = 0.15 for CVa R and EVa R coherent risk measures. The fuel budget (constraint bound β) was set to 50, 10, and 200 for the 10 10, 15 15, and 20 20 grid-worlds, respectively. The initial condition was chosen as κ0(s M) = 1, i.e., the agent starts at the right most grid at the bottom. A summary of our numerical experiments is provided in Table 1. Note the computed values of Problem 1 satisfy E(c) CVa Rε(c) EVa Rε(c). This is in accordance with the theory that EVa R is a more conservative coherent risk measure than CVa R (Ahmadi-Javid 2012). For total expectation coherent risk measure, the calculations took significantly less time, since they are the result of solving a set of linear programs. For CVa R and EVa R, a set of DCPs were solved. CVa R calculation was the most computationally involved. This observation is consistent with (Ahmadi-Javid and Fallah-Tafti 2019) were it was discussed that EVa R calculation is much more efficient than CVa R. Note that these calculations can be carried out offline for policy synthesis and then the policy can be applied for risk-averse robot path planning. The table also outlines the failure ratios of each risk measure. In this case, EVa R outperformed both CVa R and total expectation in terms of robustness, tallying with the fact that EVa R is conservative. In addition, these results suggest that, although discounted total expectation is a measure of performance in high number of Monte Carlo simulations, it may not be practical to use it for real-world planning under uncertainty scenarios. CVa R and especially EVa R seem to be a more reliable metric for performance in planning under un- Figure 3: Results for the MDP example with total expectation (left), CVa R (middle), and EVa R (right) coherent risk measures. The goal is located at the yellow cell. Notice the 9 single cell obstacles used for robustness test. certainty. For the sake of illustrating the computed policies, Figure 3 depicts the results obtained from solving DCP (11) for a 20 20 grid-world. The arrows on grids depict the (sub)optimal actions and the heat map indicates the values of Problem 1 for each grid state. Note that the values for EVa R are greater than those for CVa R and the values for CVa R are greater from those of total expectation. This is in accordance with the theory that E(c) CVa Rε(c) EVa Rε(c) (Ahmadi-Javid 2012). In addition, by inspecting the computed actions in obstacle dense areas of the gridworld (for example, the top right area), we infer that the actions in more risk-averse cases (especially, EVa R) have a higher tendency to steer the robot away from the obstacles given the diagonal transition uncertainty as depicted in Figure 2; whereas, for total expectation, the actions are rather concerned about taking the robot to the goal region. Conclusions We proposed an optimization-based method for finding sub-optimal policies that lower-bound the constrained riskaverse problem in MDPs. We showed that this optimization problem is in DCP form for general coherent risk measures and can be solved using DCCP method. Our methodology generalized constrained MDPs with total expectation risk measure to general coherent, Markov risk measures. Numerical experiments were provided to show the efficacy of our approach. In this paper, we assumed the states are fully observable. In future work, we will consider extension of the proposed framework to Markov processes with partially observable states (Ahmadi et al. 2020; Fan and Ruszczy nski 2015, 2018b). Moreover, we only considered discounted infinite horizon problems in this work. We will explore riskaverse policy synthesis in the presence of other cost criteria (Carpin, Chow, and Pavone 2016; Cavus and Ruszczynski 2014) in our prospective research. In particular, we are interested in risk-averse planning in the presence of highlevel mission specifications in terms of linear temporal logic formulas (Wongpiromsarn, Topcu, and Murray 2012; Ahmadi, Sharan, and Burdick 2020). Agrawal, A.; Verschueren, R.; Diamond, S.; and Boyd, S. 2018. A rewriting system for convex optimization problems. Journal of Control and Decision 5(1): 42 60. Ahmadi, M.; Ono, M.; Ingham, M. D.; Murray, R. M.; and Ames, A. D. 2020. Risk-Averse Planning Under Uncertainty. In 2020 American Control Conference (ACC), 3305 3312. IEEE. Ahmadi, M.; Sharan, R.; and Burdick, J. W. 2020. Stochastic finite state control of POMDPs with LTL specifications. ar Xiv preprint ar Xiv:2001.07679 . Ahmadi-Javid, A. 2012. Entropic value-at-risk: A new coherent risk measure. Journal of Optimization Theory and Applications 155(3): 1105 1123. Ahmadi-Javid, A.; and Fallah-Tafti, M. 2019. Portfolio optimization with entropic value-at-risk. European Journal of Operational Research 279(1): 225 241. Ahmadi-Javid, A.; and Pichler, A. 2017. An analytical study of norms and Banach spaces induced by the entropic valueat-risk. Mathematics and Financial Economics 11(4): 527 550. Altman, E. 1999. Constrained Markov decision processes, volume 7. CRC Press. Artzner, P.; Delbaen, F.; Eber, J.; and Heath, D. 1999. Coherent measures of risk. Mathematical finance 9(3): 203 228. B auerle, N.; and Ott, J. 2011. Markov decision processes with average-value-at-risk criteria. Mathematical Methods of Operations Research 74(3): 361 379. Bertsekas, D. 1999. Nonlinear Programming. Athena Scientific. Boyd, S.; and Vandenberghe, L. 2004. Convex optimization. Cambridge university press. Carpin, S.; Chow, Y.; and Pavone, M. 2016. Risk aversion in finite Markov Decision Processes using total cost criteria and average value at risk. In 2016 ieee international conference on robotics and automation (icra), 335 342. IEEE. Cavus, O.; and Ruszczynski, A. 2014. Risk-averse control of undiscounted transient Markov models. SIAM Journal on Control and Optimization 52(6): 3935 3966. Chow, Y.; and Ghavamzadeh, M. 2014. Algorithms for CVa R optimization in MDPs. In Advances in neural information processing systems, 3509 3517. Chow, Y.; Tamar, A.; Mannor, S.; and Pavone, M. 2015. Risk-sensitive and robust decision-making: a cvar optimization approach. In Advances in Neural Information Processing Systems, 1522 1530. Diamond, S.; and Boyd, S. 2016. CVXPY: A Pythonembedded modeling language for convex optimization. Journal of Machine Learning Research 17(83): 1 5. Fan, J.; and Ruszczy nski, A. 2015. Dynamic risk measures for finite-state partially observable markov decision problems. In 2015 Proceedings of the Conference on Control and its Applications, 153 158. SIAM. Fan, J.; and Ruszczy nski, A. 2018a. Process-based risk measures and risk-averse control of discrete-time systems. Mathematical Programming 1 28. Fan, J.; and Ruszczy nski, A. 2018b. Risk measurement and risk-averse control of partially observable discrete-time Markov systems. Mathematical Methods of Operations Research 88(2): 161 184. Haskell, W. B.; and Jain, R. 2015. A convex analytic approach to risk-aware Markov decision processes. SIAM Journal on Control and Optimization 53(3): 1569 1598. Horst, R.; and Thoai, N. V. 1999. DC programming: overview. Journal of Optimization Theory and Applications 103(1): 1 43. Lawler, E. L.; and Wood, D. E. 1966. Branch-and-bound methods: A survey. Operations research 14(4): 699 719. Le Thi, H. A.; Le, H. M.; Dinh, T. P.; et al. 2008. A DC programming approach for feature selection in support vector machines learning. Advances in Data Analysis and Classification 2(3): 259 278. Lipp, T.; and Boyd, S. 2016. Variations and extension of the convex concave procedure. Optimization and Engineering 17(2): 263 287. Majumdar, A.; and Pavone, M. 2020. How should a robot assess risk? Towards an axiomatic theory of risk in robotics. In Robotics Research, 75 84. Springer. Mc Allister, W.; Whitman, J.; Varghese, J.; Axelrod, A.; Davis, A.; and Chowdhary, G. 2020. Agbots 2.0: Weeding Denser Fields with Fewer Robots. Robotics: Science and Systems 2020 . Mc Ghan, C. L.; Vaquero, T.; Subrahmanya, A. R.; Arslan, O.; Murray, R.; Ingham, M. D.; Ono, M.; Estlin, T.; Williams, B.; and Elaasar, M. 2016. The Resilient Spacecraft Executive: An Architecture for Risk-Aware Operations in Uncertain Environments. In Aiaa Space 2016, 5541. Ono, M.; Heverly, M.; Rothrock, B.; Almeida, E.; Calef, F.; Soliman, T.; Williams, N.; Gengl, H.; Ishimatsu, T.; Nicholas, A.; et al. 2018. Mars 2020 Site-Specific Mission Performance Analysis: Part 2. Surface Traversability. In 2018 AIAA SPACE and Astronautics Forum and Exposition, 5419. Ono, M.; Pavone, M.; Kuwata, Y.; and Balaram, J. 2015. Chance-constrained dynamic programming with application to risk-aware robotic space exploration. Autonomous Robots 39(4): 555 571. Osogami, T. 2012. Robustness and risk-sensitivity in Markov decision processes. In Advances in Neural Information Processing Systems, 233 241. Ott, J. T. 2010. A Markov decision model for a surveillance application and risk-sensitive Markov decision processes. Pflug, G. C.; and Pichler, A. 2016. Time-consistent decisions and temporal decomposition of coherent risk functionals. Mathematics of Operations Research 41(2): 682 699. Prashanth, L. 2014. Policy gradients for CVa R-constrained MDPs. In International Conference on Algorithmic Learning Theory, 155 169. Springer. Rockafellar, R. T.; Uryasev, S.; et al. 2000. Optimization of conditional value-at-risk. Journal of risk 2: 21 42. Ruszczy nski, A. 2010. Risk-averse dynamic programming for Markov decision processes. Mathematical programming 125(2): 235 261. Shapiro, A.; Dentcheva, D.; and Ruszczy nski, A. 2014. Lectures on stochastic programming: modeling and theory. SIAM. Shen, X.; Diamond, S.; Gu, Y.; and Boyd, S. 2016. Disciplined convex-concave programming. In 2016 IEEE 55th Conference on Decision and Control (CDC), 1009 1014. IEEE. Tamar, A.; Chow, Y.; Ghavamzadeh, M.; and Mannor, S. 2015. Policy gradient for coherent risk measures. In Advances in Neural Information Processing Systems, 1468 1476. Tamar, A.; Chow, Y.; Ghavamzadeh, M.; and Mannor, S. 2016. Sequential decision making with coherent risk. IEEE Transactions on Automatic Control 62(7): 3323 3338. Thai, J.; Hunter, T.; Akametalu, A. K.; Tomlin, C. J.; and Bayen, A. M. 2014. Inverse covariance estimation from data with missing values using the concave-convex procedure. In 53rd IEEE Conference on Decision and Control, 5736 5742. IEEE. Vose, D. 2008. Risk analysis: a quantitative guide. John Wiley & Sons. Wang, A.; Jasour, A. M.; and Williams, B. 2020. Nongaussian chance-constrained trajectory planning for autonomous vehicles under agent uncertainty. IEEE Robotics and Automation Letters . Wongpiromsarn, T.; Topcu, U.; and Murray, R. M. 2012. Receding horizon temporal logic planning. IEEE Transactions on Automatic Control 57(11): 2817 2830.