# the_geometry_of_robust_value_functions__376a6242.pdf

The Geometry of Robust Value Functions

Kaixin Wang 1 Navdeep Kumar 2 Kuangqi Zhou 3 Bryan Hooi 1 4 Jiashi Feng 5 Shie Mannor 2 6

The space of value functions is a fundamental concept in reinforcement learning. Characterizing its geometric properties may provide insights for optimization and representation. Existing works mainly focus on the value space for Markov Decision Processes (MDPs). In this paper, we study the geometry of the robust value space for the more general Robust MDPs (RMDPs) setting, where transition uncertainties are considered. Specifically, since we find it hard to directly adapt prior approaches to RMDPs, we start with revisiting the non-robust case, and introduce a new perspective that enables us to characterize both the non-robust and robust value space in a similar fashion. The key of this perspective is to decompose the value space, in a state-wise manner, into unions of hypersurfaces. Through our analysis, we show that the robust value space is determined by a set of conic hypersurfaces, each of which contains the robust values of all policies that agree on one state. Furthermore, we find that taking only extreme points in the uncertainty set is sufficient to determine the robust value space. Finally, we discuss some other aspects about the robust value space, including its non-convexity and policy agreement on multiple states.

1. Introduction

The space of value functions for stationary policies is a central concept in Reinforcement Learning (RL), since many RL algorithms are essentially navigating this space to find an optimal policy that maximizes the value function,

1Institute of Data Science, National University of Singapore, Singapore 2Electrical and Computer Engineering, Technion, Haifa, Israel 3Department of Electrical and Computer Engineering, National University of Singapore, Singapore 4School of Computing, National University of Singapore, Singapore 5Byte Dance, Singapore 6NVIDIA Research, Haifa, Israel. Correspondence to: Kaixin Wang <kaixin.wang@u.nus.edu>.

Proceedings of the 39 th International Conference on Machine Learning, Baltimore, Maryland, USA, PMLR 162, 2022. Copyright 2022 by the author(s).

Figure 1. The value space can be decomposed in a state-wise manner as an intersection of unions of hypersurfaces. Each union corresponds to a state and each hypersurface contains the value functions of policies agreeing on that state.

such as policy gradient (Sutton et al., 1999), policy iteration (Howard, 1960) and evolutionary strategies (de Boer et al., 2005). Characterizing the geometric properties for the space of the value function (i.e., the value space) would offer insights for RL research. A recent work (Dadashi et al., 2019) shows that the value space for Markov Decision Processes (MDPs) is a possibly non-convex polytope, which inspires new methods in representation learning in RL (Bellemare et al., 2019; Dabney et al., 2021).

Compared to MDPs, Robust MDPs (RMDPs) are more general, since they do not assume that the transition dynamics are known exactly but instead may take any value from a given uncertainty set (Xu & Mannor, 2006; Iyengar, 2005; Nilim & El Ghaoui, 2005; Wiesemann et al., 2013). This makes RMDPs more suitable for real-world problems where parameters may not be precisely given. Therefore, characterizing the geometric properties of the value space for RMDPs (i.e., robust value space) is of interest.

However, we find it hard to directly adapt the prior approach (Dadashi et al., 2019) from MDPs to RMDPs. Their method builds upon on a key theorem (the Line Theorem), but we find it difficult to prove a robust counterpart of this theorem (see more discussions in Section 5.3).

In this work, we introduce a new perspective for investigating the geometry of the space of value functions. Specifically, we start with revisiting the non-robust case due to its

The Geometry of Robust Value Functions

simplicity. By decomposing the value space in a state-wise manner (as illustrated in Figure 1), we can give an explicit form about the value function polytope.

With this decomposition-based perspective, we show that the robust value space is determined by a set of conic hypersurfaces, each of which contains the robust value functions for policies that agree on one state. Furthermore, from a geometric perspective, we show that the robust value space can be fully determined by a subset of the uncertainty set, which composes of extreme points of the uncertainty set. As a result, for polyhedral uncertainty set such as ℓ1-ball and ℓ -ball (Ho et al., 2018; 2021; Behzadian et al., 2021), we can replace the infinite uncertainty set with a finite active uncertainty subset, without losing any useful information for policy optimization. Finally, we discuss some other aspects about the robust value space, including policy agreement on more than one state, the non-convexity of the robust value space, and why it is difficult to obtain a Line Theorem for RMDPs.

All proofs and the specifics of MDPs and RMDPs used for illustration can be found in Appendix.

2. Preliminaries

We introduce backgrounds for MDPs in Section 2.1 and for RMDPs in Section 2.2. Importantly, Section 2.3 sets up some essential concepts and notations for studying the value space, which will be frequently used in the rest of paper.

Notations. We use 1 and 0 to denote vectors of all ones and all zeros respectively, and their sizes can be inferred from the context. For vectors and matrices, <, , > and denote element-wise comparisons. Calligraphic letters such as P are mainly for sets. For an index set Z = {1, , k}, (xi)i Z denotes a vector (x1, x2, , xk) if xi is a scalar, or a matrix (x1, x2, , xk) if xi is a vector. U is used to denote the space of probability distributions over a set U. For a non-empty set U, we denote its polar cone as U (Bertsekas, 2009), given by

U := {y | y, x 0, x U}. (1)

We use conv( ) to denote the convex hull of a set, and ext( ) to denote the set of extreme points of a non-empty convex set.

2.1. Markov Decision Processes

We consider an MDP (S, A, P, r, γ, p0) with a finite state set S and a finite action set A. The number of states |S| and the number of actions |A| are denoted with S and A, respectively. The initial state is generated according to the p0 S. We use Ps,a S to specify the probabilities of transiting to new states when taking action a in state s, and employ P := (Ps,a)s S,a A ( S)S A as

a condensed notation. An immediate reward rs,a R is given after taking action a in state s, and similarly r := (rs,a)s S,a A RS A is a condensed notation. γ [0, 1) is the discount factor. In addition, we also define Ps := (Ps,a)a A ( S)A and rs := (rs,a)a A RA.

A stationary stochastic policy π := (πs,a)s S,a A ( A)S specifies a decision making strategy, where πs,a [0, 1] is the probability of taking some action a in current state s. We denote πs := (πs,a)a A A as the probability vector over actions. In particular, we use ds,a A to represent a deterministic πs that is all-zero except πs,a = 1.

Under a given policy π, we define the state-to-state transition probability as

P π := (P πs)s S ( S)S, where

P πs := Psπs = X

a A πs,a Ps,a S. (2)

The reward function under this policy is defined as

rπ := (rπs)s S RS, where

rπs := r s πs = X

a A πs,ars,a R. (3)

The value V π,P RS is defined to be the expected cumulative reward from starting in a state and acting according to the policy π under transition dynamic P:

V π,P (s) := EP π

t=0 γtrst,at | s0 = s

2.2. Robust Markov Decision Processes

Robust Markov Decision Processes (RMDPs) generalize MDPs in that the uncertainty in the transition dynamic P is considered (Iyengar, 2005; Nilim & El Ghaoui, 2005; Wiesemann et al., 2013). In an RMDP, the transition dynamic P is chosen adversarially from an uncertainty set P ( S)S A. We assume throughout the paper that the set P is compact. The robust value function for a policy π and the optimal robust value function are defined as

V π,P(s) := min P P V π,P (s), (5)

V ,P(s) := max π Π V π,P(s). (6)

Both the policy evaluation and policy improvement problems are intractable for generic P (Wiesemann et al., 2013). However, they become tractable when certain independence assumptions about P are made. Two common assumptions are (s, a)-rectangularity (Iyengar, 2005; Nilim & El Ghaoui, 2005) and s-rectangularity (Wiesemann et al., 2013), which we will use in this paper. The (s, a)-rectangularity assumes

The Geometry of Robust Value Functions

that the adversarial nature selects the worst transition probabilities independently for each state and action. Under (s, a)-rectangularity, the uncertainty set P can be factorized into Ps,a S for each state-action pair, i.e.,

P = {P | Ps,a Ps,a, s S, a A}, (7)

or in short P = (s,a) S APs,a where denotes Cartesian

product. The s-rectangularity is less restrictive and assumes the adversarial nature selects the worst transition probabilities independently for each state. Under s-rectangularity, the uncertainty set P can be factorized into Ps ( S)A

for each state, i.e.,

P = {P | Ps Ps, s S}, (8)

or in short P = s SPs. Note that (s, a)-rectangularity is

a special case of s-rectangularity. Below we present a restatement of the remark in (Ho et al., 2021) that the optimal policy for the robust policy evaluation MDP is deterministic. This restatement will be used later. Under s-rectangularity, we have for any π,

P P s.t. V π,P (s) = V π,P(s), s S. (9)

2.3. The Space of Value Functions

The space of value functions (or value space in short) is the set of value functions for all stationary policies. We use f P and f P to respectively represent the mapping between a set of policies and their non-robust and robust value functions, i.e.,

f P (U) := {V π,P | π U}, (10)

f P(U) := {V π,P | π U}. (11)

The set of all stationary stochastic policies is denoted as Π = ( A)S. Then, the non-robust value space for a transition dynamic P and the robust value space for an uncertainty set P can be respectively expressed as

VP := f P (Π), (12)

VP := f P(Π). (13)

We then introduce some notations that will be frequently used later. We use Y πs to denote the set of policies that agree with π on s, i.e.,

Y πs := {π | π s = πs}. (14)

Note that policy agreement on state s does not imply disagreement on other states. Thus, π itself is also in Y πs. The row of the matrix I γP π corresponds to state s is denoted as Lπs,Ps, i.e.,

Lπs,Ps := es γP πs = es γPsπs (15)

where es RS is an all-zero vector except the entry corresponding to s being 1.

Figure 2. Hyperplanes Hπs,Ps corresponding to different s intersect at the value function V π,P .

3. The Value Function Polytope Revisited

In this section, we revisit the non-robust value space from a new perspective, where the value space is decomposed in a state-wise manner. This perspective enables us to characterize the polytope shape of the value space in a more straightforward way, and leads to an explicit form of the value polytope.

Our first step is to connect a single value function V π,P to a set of hyperplanes, each of which can be expressed as:

Hπs,Ps := {x RS | x, Lπs,Ps = rπs}. (16)

As shown in Lemma 3 in (Dadashi et al., 2019), the value functions f P (Y πs) lie in the hyperplane Hπs,Ps.

Specifically, since π Y πs, we know every hyperplane Hπs,Ps passes through V π,P (see examples in Figure 2). The following lemma states that this intersecting point is unique.

Lemma 3.1. Consider a policy π and a transition dynamic P, we have {V π,P } = \

s S Hπs,Ps (17)

Lemma 3.1 bridges between a single value function and the intersection of S different hyperplanes, each of which corresponds to a state s. Then, by definition (Eqn. (12)), we can obtain the value space by taking the union over all π Π, i.e., VP = [

s S Hπs,Ps, (18)

as illustrated in Figure 3(a).

From Eqn. (18), we observe that the value space VP can also be expressed from an alternative perspective (as shown in Figure 3(b)): 1) for each state s S, taking the union of all hyperplanes corresponding to different πs A; 2) taking the intersection of the unions obtained in previous step. The following lemma formalizes this perspective.

The Geometry of Robust Value Functions

Figure 3. Visualization of the value functions for a 2-state 3-action MDP. (a) For each policy π, we plot the value function V π,P and the corresponding hyperplanes Hπs,Ps intersecting at V π,P . (b) For each policy π, the hyperplanes Hπs,Ps intersecting at V π,P are plotted in different colors for different states. (c) For each state s, the union of Hπs,Ps + and the union of Hπs,Ps over all πs A are highlighted respectively. (d) For each state s, the hyperplanes Hds,a,Ps for different actions a are plotted. The union of H ds,a,Ps + and the union of H ds,a,Ps over all actions a A are highlighted as dashed. The entire value space VP is visualized as the purple region.

Lemma 3.2. Consider a transition dynamic P, the value space VP can be represented as

s S Hπs,Ps = \

πs A Hπs,Ps. (19)

As suggested in Lemma 3.2, the core of this perspective is to decompose the value space in a state-wise manner. In this way, to study the whole value space, we only need to focus on the union of hyperplanes corresponding to one state.

Specifically, let us denote the two closed half-spaces determined by the hyperplane Hπs,Ps as

Hπs,Ps + := {x RS | x, Lπs,Ps rπs},

Hπs,Ps := {x RS | x, Lπs,Ps rπs}. (20)

Then the value space can be expressed in terms of the halfspaces:

πs A Hπs,Ps + Hπs,Ps . (21)

Recall that in (Dadashi et al., 2019) a convex polyhedron is defined as a finite intersection of half-spaces, and a polytope is a bounded finite union of convex polyhedra. So our goal is to get rid of this infinite union S

To this end, we first replace the inner union in Eqn. (21) with an intersection of two unions, as illustrated in Figure 3(c) and formally stated in the following lemma.

Lemma 3.3. Consider a policy π and a transition dynamic

P, we have for all states s S, [

πs A Hπs,Ps + Hπs,Ps =

πs A Hπs,Ps +

πs A Hπs,Ps

Although these two unions are still taken over infinite set A, the following Lemma 3.4 shows that they actually coincide with the finite unions of half-spaces that correspond to ds,a (i.e., deterministic πs). We can get an intuition by comparing Figure 3(c) and Figure 3(d). Lemma 3.4. Consider a policy π and a transition dynamic P, we have for all states s S, [

πs A Hπs,Ps δ = [

a A Hds,a,Ps δ , δ {+, }. (23)

Finally, putting everything together, we are able to represent the value space with finite union and intersection operations on half-spaces, as stated in Theorem 3.5 and illustrated in Figure 3(d). Using the distributive law of sets, we can see that the value space VP immediately satisfies the definition of polyhedron. Since VP is bounded, we can conclude that VP is a polytope. Theorem 3.5. Consider a transition dynamic P, the value space VP can be represented as

a A Hds,a,Ps +

a A Hds,a,Ps

h Hds,as,Ps + H ds,a s,Ps i (24)

The Geometry of Robust Value Functions

Figure 4. Visualization of the robust value functions for a 2-state 2-action RMDP with an s-rectangular uncertainty set. We consider P = Ps1 Ps2 with Ps1 = {P (1) s1 , P (2) s1 } and Ps2 = {P (1) s2 , P (2) s2 }. Denote P (ij) P such that P (ij) s1 = P (i) s1 and P (ij) s2 = P (j) s2 . We plot with different widths to differentiate overlapping lines. (a) For the same set of policies Y πs1 , the set of non-robust value functions f P (Y πs1 ) for different P P and the set of robust value functions f P(Y πs1 ) are plotted. (b) For different P P, f P (Y πs1 ) are highlighted in different colors. The hyperplanes corresponding to different Ps1 Ps1 are plotted.

where a = (as)s S, a = (a s)s S, and as, a s A.

Compared to the prior approach (Dadashi et al., 2019), our work gives an explicit form of the value function polytope, showing how the value polytope is formed (cf. the proof of Proposition 1 in (Dadashi et al., 2019)).

4. Value Space Geometry of RMDPs

4.1. Policy Agreement and the Conic Hypersurface

Recall that in Section 3, our new perspective connects the value space to the hyperplanes where f P (Y πs) lies. Thus in order to characterize the robust value space, we start with studying the geometric properties of robust value functions for all policies that agree on one state, i.e., f P(Y πs). Unlike the non-robust case, f P(Y πs) may not lie in a hyperplane, as shown in Figure 4(a). Nevertheless, it looks like f P(Y πs) still lies in a hypersurface (also see the example for |S| = 3 in the supplementary). In what follows, we are going to characterize this hypersurface.

First, as shown in Figure 4(b), for different P P that share the same Ps, their f P (Y πs) lie in the same hyperplane Hπs,Ps. Comparing Figure 4(a) and (b), it seems that the robust value functions f P(Y πs) always lie in the lower halfspace Hπs,Ps for different P P. On the other hand, from Eqn. (9), we know that there exists Ps Ps such that V π,P

lies in the hyperplane Hπs,Ps. Putting it together, we have the following lemma about f P(Y πs).

Lemma 4.1. Consider an s-rectangular uncertainty set P

Figure 5. (a) For different Ps Ps, the hyperplanes Hπs,Ps intersect at one point. (b) Illustration of the conic hypersurface in which f P(Y πs) lies.

and a policy π, we have for all states s S,

Ps Ps Hπs,Ps

Ps Ps Hπs,Ps #

Note that the right hand side (RHS) of above Eqn. (25) is essentially the boundary of the intersection of half-spaces T

Ps Ps Hπs,Ps . To further characterize the geometry, we need to know how these half-spaces intersect (equivalently how the hyperplanes intersect). One interesting observation is that when Ps contains more then 2 elements, the hyperplanes still intersect at one point, as illustrated in Figure 5(a). The following lemma states this property and also gives the intersecting point.

Lemma 4.2. Consider an s-rectangular uncertainty set P and a policy π, we have for all states s S,

Ps Ps Hπs,Ps. (26)

Since the hyperplanes intersect at the same point, the intersection of the half-spaces will be a convex cone. We denote

Cπs,Ps + = {x RS | x, Lπs,Ps rπs, Ps Ps},

Cπs,Ps = {x RS | x, Lπs,Ps rπs, Ps Ps}. (27) The following corollary characterizes the hypersurface that f P(Y πs) lies in. Figure 5(b) gives an illustration.

Corollary 4.3. Consider an s-rectangular uncertainty set P and a policy π, we have for all states s S,

f P(Y πs) Cπs,Ps (28)

where Cπs,Ps = Cπs,Ps + Cπs,Ps is a conic hypersurface.

The Geometry of Robust Value Functions

Figure 6. Visualizations of the robust value functions for a 2-state 2-action RMDP with an s-rectangular uncertainty set. (a) For a fixed π, the conic hypersurfaces Cπs,Ps corresponding to different s intersect at the robust value function V π,P. (b) For each policy π, the robust value function V π,P is plotted, and the corresponding conic hypersurfaces Cπs,Ps intersecting at V π,P are plotted in different colors for different states. (c) For each state s, the union of Cπs,Ps + and the union of Cπs,Ps over all πs A are highlighted respectively. (d) For each state s, the union of C ds,a,Ps + and the union of C ds,a,Ps over all a A are highlighted respectively.

4.2. The Robust Value Space

With the knowledge about the geometry of f P(Y πs), we are now ready to characterize the entire robust value space VP. Similar to Section 3, we first connect the single robust value function to the intersection of S different conic hypersurfaces by the following lemma (see Figure 6(a) for an illustration).

Lemma 4.4. Consider an s-rectangular uncertainty set P and a policy π, we have

{V π,P} = \

s S Cπs,Ps. (29)

Then from the introduced perspective, we show that the robust value space can also be viewed as an intersection of state-wise unions of conic hypersurfaces, as illustrated in Figure 6(b) and formally stated in Lemma 4.5.

Lemma 4.5. Consider an s-rectangular uncertainty set P,

Figure 7. The robust value space VP and the conic hypersurfaces Cds,a,Ps under s-rectangularity (a) and (s, a)-rectangularity (b).

the robust value function space VP can be represented as

s S Cπs,Ps = \

πs A Cπs,Ps. (30)

Next, we show the equivalence between each inner union in RHS of the above equation and an intersection of two unions in Lemma 4.6. Figure 6(c) gives an illustration. Similar to the non-robust case, Lemma 4.6 will help us characterize the relationship between the robust value space VP and the conic hypersurfaces corresponding to ds,a.

Lemma 4.6. Consider an s-rectangular uncertainty set P, we have for all states s S,

πs A Cπs,Ps =

πs A Cπs,Ps +

πs A Cπs,Ps

As shown in Figure 6(d), unlike the non-robust case, the infinite union S

πs A Cπs,Ps does not necessarily coin-

cides with the finite union S

a A Cds,a,Ps . The following Lemma 4.7 characterizes their relationship.

Lemma 4.7. Consider an s-rectangular uncertainty set P, we have for all states s S, [

πs A Cπs,Ps + = [

a A Cds,a,Ps + , (32)

πs A Cπs,Ps [

a A Cds,a,Ps , (33)

where the equality in the second line holds when P is (s, a)- rectangular.

Putting it together, the robust value space can be characterized in Theorem 4.8. Figure 7 highlights the difference in the robust value space between s-rectangularity and (s, a)- rectangularity, by using the same set of probability values

The Geometry of Robust Value Functions

Figure 8. A closer look at the extra region under s-rectangularity. Here Ps2 = {P (1) s2 , P (2) s2 }. We highlight the hyperplanes in (a), and the upper and lower bounds of the region S

πs A Cπs,Ps in (b). Note that the black boundaries in (b) are composed by the hyperplanes in (a).

(see Appendix A). Our results also provide a geometric perspective on why the optimal policies under s-rectangularity might be stochastic, which is only exemplified in prior works (Wiesemann et al., 2013). The robust value functions of deterministic policies always lie in the region defined by RHS of Eqn. (34) but the optimal value might lie outside.

Theorem 4.8. Consider an s-rectangular uncertainty set P, the robust value function space VP satisfies

a A Cds,a,Ps +

πs A Cπs,Ps

a A Cds,a,Ps +

a A Cds,a,Ps

where the equality in the second line holds when P is (s, a)- rectangular.

Furthermore, we take a closer look at this extra region under s-rectangularity. Since the space can be decomposed state-wisely, we focus on a single state s. Recall the definition of Cπs,Ps in Eqn. (27), i.e.,

Ps Ps Hπs,Ps . (35)

From our results in Section 3, we know

a A Hds,a,Ps . (36)

Therefore, we can obtain

a A Hds,a,Ps , (37)

Figure 9. Visualization of the convex cone Cπs,Ps for a fixed πs and different Ps. The translation rπs1 is ignored since πs is fixed. (a) We set Ps = {P (1) s , P (2) s }. (b) We set Ps = {P (µ) s | P (µ) s = µP (1) s + (1 µ)P (2) s , 0 µ 1} and also plot the hyperplanes

Hπs,P (µ) s for different µ.

and accordingly

πs A Cπs,Ps \

a A Hds,a,Ps . (38)

The RHS of the above equation gives us an upper bound of the region S

πs A Cπs,Ps while the RHS of Eqn. (33) provides a lower bound. The extra region lies within the gap between them. Figure 8 gives an illustration using the same RMDP example as in Figure 7.

4.3. Active Uncertainty Subsets

In above sections, we have shown that the robust value space VP depends on P in the form of a set of conic hypersurfaces Cπs,Ps. In this section, by taking a closer look at how Ps and Cπs,Ps are related, we will show that only a subset P P is sufficient to determine the robust value space, i.e., VP = VP . (39)

We term P as active uncertainty subset, analogous to active constraints, in the sense that all P P are active in determining the shape of the robust value space VP.

First, let us keep πs fixed, and note that the conic hypersurface Cπs,Ps is uniquely determined by the convex cone Cπs,Ps . We then focus on the relationship between Ps and Cπs,Ps . Denote the set

Lπs,Ps := {Lπs,Ps | Ps Ps}. (40)

From the definition of Cπs,Ps , we can see Cπs,Ps is exactly the polar cone of Lπs,Ps (plus a translation), denoted with

Cπs,Ps = (Lπs,Ps) + {rπs1}. (41)

The Geometry of Robust Value Functions

Figure 10. A spectrum of the spaces of value functions.

Here + denotes the Minkowski addition. Figure 9(a) gives an illustration. Note that for fixed πs, Lπs,Ps is the image of Ps under a fixed affine transformation. We denote this affine transformation as g, i.e., Lπs,Ps = g(Ps). Then we are able to obtain the following lemma:

Lemma 4.9. Consider a s-rectangular uncertainty set P and a policy π, we have

(Lπs,Ps) = (g(Ps)) = (g(ext(conv(Ps)))) . (42)

This lemma implies that, in order to determine the conic hypersurface Cπs,Ps, we only need to care about those Ps Ps that are extreme points of the convex hull. Figure 9(b) gives an illustration. We then generalize it to the whole robust value space and present the following theorem:

Theorem 4.10. Consider a s-rectangular uncertainty set P, we have VP = VP (43)

where P = ext(conv(P)) P.

If the P (or more generally conv(P)) is polyhedral, such as ℓ1-ball and ℓ -ball (Ho et al., 2018; 2021; Behzadian et al., 2021), then we can reduce P to a finite set without losing any useful information for policy optimization. In addition, conv(P) being polyhedral implies that Cπs,Ps is a polyhedral cone. Combining with Theorem 4.8, it means that the robust value space for an (s, a)-rectangular uncertainty set will be a polytope.

5. Discussion

5.1. Policy Agreement on More States

We already know that the value functions for policies that agree on a single state lie in a hyperplane for MDPs (Dadashi et al., 2019), and a conic hypersurface for s-rectangular RMDPs (Section 4.1). One natural question is how the space of value functions looks like when we fix the policies at more states. With our new decomposition-based perspective, the results are immediately available from Lemma 3.2 and Lemma 4.5.

Figure 11. (a) The intersection between the robust value space and axis-parallel lines are line segments. (b) An example showing that the robust value space is not star-convex.

In Figure 10, we show the space of value functions for policies agree on states in S S, under both non-robust and robust setting. Moreover, as illustrated in Figure 10, our perspective reveals a spectrum of the spaces of value functions. When the policies agree on all states, then it reduces to a single value function. When the policies are free to vary on all states, then it is the whole value space. This perspective enables us to characterize every point on this spectrum in an explicit form. In comparison, for non-robust case, prior works (Dadashi et al., 2019) only prove that the spaces are polytopes without giving a clear characterization.

5.2. The Non-convexity of the Robust Value Space

Like the non-robust case, the robust value space VP is also possibly non-convex (e.g., Figure 7). Despite the nonconvexity, VP exhibits some interesting properties analogous to monotone polygons. As shown in Figure 11(a), for any point in the robust value space VP, if we draw an axisparallel line passing this point, the intersection will be a line segment (or a point in degenerated case). We formalize this observation in the following corollary.

Corollary 5.1. Consider an s-rectangular uncertainty set P, if an axis-parallel line intersects with the robust value space VP, then the intersection will be a line segment.

From the examples in Figure 7, one may wonder if the robust value function space VP is a star-convex set. For many randomly generated RMDPs, VP does look like a star-convex set (see Figure 12 in Appendix C). However, we show a carefully crafted counter-example in Figure 11(b), which is clearly not star-shaped. Nevertheless, it seems to be a rare case. One interesting question to explore in the future is, how non-convex the robust value space can be and how likely it exhibits such non-convexity. If it is nearly convex for most time, then we might be able to design some efficient algorithms tailored for such case.

The Geometry of Robust Value Functions

5.3. The Line Theorem for RMDPs

As mentioned before, one major obstacle that prevents us from adapting the prior method (Dadashi et al., 2019) from MDPs to RMDPs is that deriving a robust counterpart of the Line Theorem is highly challenging. Here we elaborate on this issue, with the help of our findings about the robust value space. Without loss of generality, suppose the set of policies only differ on s1. From the discussions in Section 5.1, we know the resulting set of robust value functions is " S\

i=2 Cπsi,Psi

πs1 A Cπs1,Ps1

The first term is an intersection of S 1 conic hypersurfaces and the second term is an infinite union of conic hypersurfaces. Both are hard to further characterize. For example, though we know the first term could be a curve, it is challenging to give a closed-form expression for it. In comparison, for MDPs, the first term is just a line and its direction is known (see the proof of Lemma 4 (ii) in (Dadashi et al., 2019)).

6. Related Works

The geometry of the space of value functions has been studied only recently. Dadashi et al. (2019) first investigate it, and establish that for MDPs the value space is a possibly non-convex polytope. Their results provide a geometric perspective to help understand the dynamics of different RL algorithms (Kumar et al., 2019; Chan et al., 2020; Harb et al., 2020; Chan et al., 2021), and also inspire new methods in representation learning in RL (Bellemare et al., 2019; Dabney et al., 2021). In RMDP literature, some works take advantage of the geometric properties of special uncertainty sets to design efficient algorithms (Ho et al., 2018; Behzadian et al., 2021; Ho et al., 2021), but no prior works studies the geometry of the robust value space.

Our work can be viewed as an extension of (Dadashi et al., 2019) to RMDPs. We introduce a new perspective to characterize the geometric properties of the value space for RMDPs. Our approach also leads to a finer characterization of the value function polytope in MDPs setting.

7. Conclusion and Future Work

In this work, we characterize the geometry of the space of robust value functions from a new perspective, where the value space is decomposed in a state-wise manner. We show that the robust value space is determined by a set of conic hypersurfaces. Furthermore, we can reduce the uncertainty set to a subset of extreme points without sacrificing any useful information for policy optimization.

There remain some interesting open questions. As discussed in Section 5, it is worth studying how non-convex the robust value space can be (i.e., can it be approximated as a convex set?). A further question is whether the level of non-convexity increases or decreases with the number of states/actions. Another direction is to investigate the geometry for other uncertainty set, such as coupled uncertainty (Mannor et al., 2012), r-rectangular sets (Goyal & Grand-Cl ement, 0) or more general ones. In addition, as in the non-robust case, it is interesting to study the geometry of robust value functions when the state space is very large and some approximation is needed. We will leave these questions to future works.

Acknowledgements

This work was partially supported by the Israel Science Foundation under contract 2199/20. We appreciate the valuable feedback from ICML anonymous reviewers. We also thank Bingyi Kang and Pengqian Yu for some helpful discussions about RMDPs.

Behzadian, B., Petrik, M., and Ho, C. P. Fast algorithms for l -constrained s-rectangular robust mdps. In Ranzato, M., Beygelzimer, A., Dauphin, Y., Liang, P., and Vaughan, J. W. (eds.), Advances in Neural Information Processing Systems, volume 34, pp. 25982 25992. Curran Associates, Inc., 2021. URL https://proceedings. neurips.cc/paper/2021/file/ da4fb5c6e93e74d3df8527599fa62642-Paper. pdf.

Bellemare, M., Dabney, W., Dadashi, R., Ali Taiga, A., Castro, P. S., Le Roux, N., Schuurmans, D., Lattimore, T., and Lyle, C. A geometric perspective on optimal representations for reinforcement learning. In Wallach, H., Larochelle, H., Beygelzimer, A., d'Alch e-Buc, F., Fox, E., and Garnett, R. (eds.), Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. URL https://proceedings. neurips.cc/paper/2019/file/ 3cf2559725a9fdfa602ec8c887440f32-Paper. pdf.

Bellman, R. Dynamic programming. Princeton University Press, Princeton, 1957.

Bertsekas, D. Convex Optimization Theory. Athena Scientific optimization and computation series. Athena Scientific, 2009. ISBN 9781886529311. URL http: //www.athenasc.com/convexduality.html.

Bertsekas, D., Nedic, A., and Ozdaglar, A. Convex Analysis and Optimization. Athena Scientific optimization

The Geometry of Robust Value Functions

and computation series. Athena Scientific, 2003. ISBN 9781886529458. URL http://www.athenasc. com/convexity.html.

Chan, A., Asis, K. D., and Sutton, R. S. Inverse policy evaluation for value-based sequential decision-making. Co RR, abs/2008.11329, 2020. URL https://arxiv. org/abs/2008.11329.

Chan, A., Silva, H., Lim, S., Kozuno, T., Mahmood, A. R., and White, M. Greedification operators for policy optimization: Investigating forward and reverse KL divergences. Co RR, abs/2107.08285, 2021. URL https: //arxiv.org/abs/2107.08285.

Dabney, W., Barreto, A., Rowland, M., Dadashi, R., Quan, J., G. Bellemare, M., and Silver, D. The value-improvement path: Towards better representations for reinforcement learning. Proceedings of the AAAI Conference on Artificial Intelligence, 35(8):7160 7168, May 2021. URL https://ojs.aaai.org/index. php/AAAI/article/view/16880.

Dadashi, R., Taiga, A. A., Roux, N. L., Schuurmans, D., and Bellemare, M. G. The value function polytope in reinforcement learning. In Chaudhuri, K. and Salakhutdinov, R. (eds.), Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pp. 1486 1495. PMLR, 09 15 Jun 2019. URL https://proceedings.mlr. press/v97/dadashi19a.html.

Dattorro, J. Convex Optimization & Euclidean Distance Geometry. Meboo Publishing, 2005. ISBN 9780976401308. URL https://meboo. convexoptimization.com/.

de Boer, P.-T., Kroese, D. P., Mannor, S., and Rubinstein, R. Y. A tutorial on the cross-entropy method. Annals of Operations Research, 134(1):19 67, Feb 2005. ISSN 1572-9338. doi: 10.1007/ s10479-005-5724-z. URL https://doi.org/10. 1007/s10479-005-5724-z.

Goyal, V. and Grand-Cl ement, J. Robust markov decision processes: Beyond rectangularity. Mathematics of Operations Research, 0(0):null, 0. doi: 10.1287/moor.2022. 1259. URL https://doi.org/10.1287/moor. 2022.1259.

Harb, J., Schaul, T., Precup, D., and Bacon, P. Policy evaluation networks. Co RR, abs/2002.11833, 2020. URL https://arxiv.org/abs/2002.11833.

Ho, C. P., Petrik, M., and Wiesemann, W. Fast Bellman updates for robust MDPs. In Dy, J. and Krause, A. (eds.), Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine

Learning Research, pp. 1979 1988. PMLR, 10 15 Jul 2018. URL https://proceedings.mlr.press/ v80/ho18a.html.

Ho, C. P., Petrik, M., and Wiesemann, W. Partial policy iteration for l1-robust markov decision processes. Journal of Machine Learning Research, 22(275):1 46, 2021. URL http://jmlr.org/papers/v22/20-445. html.

Howard, R. A. Dynamic programming and Markov processes. Dynamic programming and Markov processes. John Wiley, Oxford, England, 1960.

Iyengar, G. N. Robust dynamic programming. Mathematics of Operations Research, 30(2):257 280, 2005. ISSN 0364765X, 15265471. URL http://www.jstor. org/stable/25151652.

Krein, M. and Milman, D. On extreme points of regular convex sets. Studia Mathematica, 9(1):133 138, 1940. URL http://eudml.org/doc/219061.

Kumar, S., Ahmed, Z., Dadashi, R., Schuurmans, D., and Bellemare, M. G. Generalized policy updates for policy optimization. In Neur IPS 2019 Optimization Foundations for Reinforcement Learning Workshop, 2019.

Mannor, S., Mebel, O., and Xu, H. Lightning does not strike twice: Robust mdps with coupled uncertainty. In Proceedings of the 29th International Coference on International Conference on Machine Learning, ICML 12, pp. 451 458, Madison, WI, USA, 2012. Omnipress. ISBN 9781450312851.

Nilim, A. and El Ghaoui, L. Robust control of markov decision processes with uncertain transition matrices. Operations Research, 53(5):780 798, 2005. doi: 10. 1287/opre.1050.0216. URL https://doi.org/10. 1287/opre.1050.0216.

Sutton, R. S., Mc Allester, D., Singh, S., and Mansour, Y. Policy gradient methods for reinforcement learning with function approximation. In Solla, S., Leen, T., and M uller, K. (eds.), Advances in Neural Information Processing Systems, volume 12. MIT Press, 1999. URL https://proceedings. neurips.cc/paper/1999/file/ 464d828b85b0bed98e80ade0a5c43b0f-Paper. pdf.

Wiesemann, W., Kuhn, D., and Rustem, B. Robust markov decision processes. Mathematics of Operations Research, 38(1):153 183, 2013. doi: 10.1287/moor.1120.0566. URL https://doi.org/10.1287/moor.1120. 0566.

The Geometry of Robust Value Functions

Xu, H. and Mannor, S. The robustness-performance tradeoff in markov decision processes. In Sch olkopf, B., Platt, J., and Hoffman, T. (eds.), Advances in Neural Information Processing Systems, volume 19. MIT Press, 2006. URL https://proceedings. neurips.cc/paper/2006/file/ 177540c7bcb8db31697b601642eac8d4-Paper. pdf.

The Geometry of Robust Value Functions

A. Details of MDPs and RMDPs

In this section, we give the specifics of the MDPs and the RMDPs used for illustrations in this work.

Figure 2(a) and Figure 3:

S = 2, A = 3

rs1 = (0.0199, 0.6097, 0.8313), rs2 = (0.4044, 0.5534, 0.8319)

Ps1,a1 = (0.7793, 0.2207), Ps1,a2 = (0.9713, 0.0287), Ps1,a3 = (0.0668, 0.9332)

Ps2,a1 = (0.0676, 0.9324), Ps2,a2 = (0.5929, 0.4071), Ps2,a3 = (0.2497, 0.7503)

πs1 = (0.2, 0.3, 0.5), πs2 = (0.3, 0.1, 0.6)

Figure 2(b):

S = 3, A = 2

rs1 = (0.5, 0.8), rs2 = (0.4, 0.2), rs3 = (0.2, 0.6)

Ps1,a1 = (0.14, 0.75, 0.11), Ps1,a2 = (0.44, 0.45, 0.11)

Ps2,a1 = (0.23, 0.19, 0.58), Ps2,a2 = (0.44, 0.32, 0.24)

Ps3,a1 = (0.45, 0.43, 0.12), Ps3,a2 = (0.14, 0.54, 0.32)

πs1 = (0.46, 0.54), πs2 = (0.38, 0.62), πs3 = (0.49, 0.51)

S = 2, A = 2

rs1 = (0.5, 0.6), rs2 = (0.4, 0.7)

Ps1 = 0.78 0.22 0.79 0.21

, 0.85 0.15 0.99 0.01

Ps2 = 0.59 0.41 0.92 0.08

, 0.60 0.40 0.39 0.61

πs1 = (0.45, 0.55), πs2 = (0.10, 0.90)

S = 2, A = 2

rs1 = (0.5, 0.6), rs2 = (0.4, 0.7)

Ps1 = 0.78 0.22 0.79 0.21

, 0.85 0.15 0.99 0.01

, 0.92 0.08 0.99 0.01

, 0.92 0.08 0.83 0.17

Ps2 = 0.59 0.41 0.92 0.08

πs1 = (0.45, 0.55), πs2 = (0.10, 0.90)

Figure 6, Figure 7(a), Figure 8, Figure 9 and Figure 11(a):

S = 2, A = 2

rs1 = (0.27, 0.9398), rs2 = (0.3374, 0.2212)

Ps1 = 0.95 0.05 0.17 0.83

, 0.24 0.76 0.05 0.95

Ps2 = 0.07 0.93 0.83 0.17

, 0.70 0.30 0.23 0.77

πs1 = (0.8, 0.2), πs2 = (0.9, 0.1)

The Geometry of Robust Value Functions

Figure 7(b):

S = 2, A = 2

rs1 = (0.27, 0.9398), rs2 = (0.3374, 0.2212)

Ps1,a1 = {(0.95, 0.05), (0.24, 0.76)}

Ps1,a2 = {(0.17, 0.83), (0.05, 0.95)}

Ps2,a1 = {(0.07, 0.93), (0.70, 0.30)}

Ps1,a1 = {(0.83, 0.17), (0.23, 0.77)}

Figure 11(b):

S = 2, A = 2

rs1 = (0.24, 0.998), rs2 = (0.3574, 0.412)

Ps1 = 0.95 0.05 0.05 0.95

, 0.24 0.76 0.95 0.05

Ps2 = 0.2 0.8 0.99 0.01

, 0.2 0.8 0.01 0.99

Lemma 3.1. Consider a policy π and a transition dynamic P, we have

{V π,P } = \

s S Hπs,Ps (17)

Proof. Observe that Hπs,Ps = {x RS | x, Lπs,Ps = rπs} (45)

is the set of vectors that satisfy the s-th equation of the following system of linear equations:

(I γP π)x = rπ. (46)

Since (I γP π) is invertible, this system of linear equations has a unique solution V π,P . Hence, we have

{V π,P } = \

s S Hπs,Ps (47)

which completes the proof.

Lemma 3.2. Consider a transition dynamic P, the value space VP can be represented as

s S Hπs,Ps = \

πs A Hπs,Ps. (19)

Proof. By the definition of VP and Lemma 3.1, we have

π Π {V π,P } = [

s S Hπs,Ps. (48)

We can break the union into nested unions by fixing πs for each s: [

s S Hπs,Ps = [

s S Hπs,Ps. (49)

The Geometry of Robust Value Functions

Then, we have

i=2 Hπsi,Psi

πs1 A Hπs1,Ps1

i=2 Hπsi,Psi

. (distributive law of sets)

By iteratively applying the distributive law of sets, we can obtain

πs A Hπs,Ps (51)

which completes the proof.

Lemma 3.3. Consider a policy π and a transition dynamic P, we have for all states s S,

πs A Hπs,Ps + Hπs,Ps =

πs A Hπs,Ps +

πs A Hπs,Ps

Proof. First, by the distributive property of sets, it is trivial to obtain LHS RHS. Next, we will show RHS LHS. For any x RHS, we have

π s, π s A s.t. x Hπ s,Ps + Hπ s ,Ps . (52)

When π s = π s , it is trivial to obtain x LHS. When π s = π s , then there exists α, β 0 such that

x, Lπ s,Ps rπ s = α,

x, Lπ s ,Ps rπ s = β. (53)

When either α = 0 or β = 0, we have x Hπ s,Ps or x Hπ s ,Ps, and accordingly x LHS. Therefore, we only focus on the case where α, β > 0. If we set

π s = β α + β π s + α α + β π s , (54)

then we have x, Lπ s,Ps rπ s

= x, β α + β Lπ s,Ps + α α + β Lπ s ,Ps β α + β rπ s α α + β rπ s

= x, β α + β Lπ s,Ps β α + β rπ s + x, α α + β Lπ s ,Ps α α + β rπ s

x, Lπ s,Ps rπ s + α α + β

x, Lπ s ,Ps rπ s

Note that π s A. The above result implies x lies in the hyperplane Hπ s,Ps. Thus x LHS and accordingly RHS LHS. Putting it together, we obtain LHS = RHS.

The Geometry of Robust Value Functions

Lemma 3.4. Consider a policy π and a transition dynamic P, we have for all states s S, [

πs A Hπs,Ps δ = [

a A Hds,a,Ps δ , δ {+, }. (23)

Proof. We first prove S

a A Hds,a,Ps + = S

πs A Hπs,Ps + . It is trivial that LHS RHS. We then focus on proving RHS LHS. For any x RHS, we have

π s A s.t. x Hπ s,Ps + . (56)

Note that any πs A can be written as a convex combination of ds,a, a A. In our case, we write

a A π s,ads,a, (57)

then we have x, Lπ s,Ps rπ s 0

a A π s,a Lds,a,Ps X

a A π s,ards,a 0

a A π s,a x, Lds,a,Ps rds,a 0.

Since π s,a 0 for all a A, the above inequality implies

a A s.t. x, Lds,a ,Ps rds,a 0. (59)

This is equivalent to x LHS. Putting it together, we obtain LHS = RHS.

The second part S

a A Hds,a,Ps = S

πs A Hπs,Ps can be proved in the same way.

Theorem 3.5. Consider a transition dynamic P, the value space VP can be represented as

a A Hds,a,Ps +

a A Hds,a,Ps

h Hds,as,Ps + H ds,a s,Ps i (24)

where a = (as)s S, a = (a s)s S, and as, a s A.

Proof. The first equality follow immediately from Lemma 3.2, Lemma 3.3 and Lemma 3.4. The second equality can be obtained using the distributive law of sets.

Lemma 4.1. Consider an s-rectangular uncertainty set P and a policy π, we have for all states s S,

Ps Ps Hπs,Ps

Ps Ps Hπs,Ps #

Proof. For any π Y πs, from Eqn. (9), we know that

P P, s.t. P P, V π ,P V π ,P . (60)

Using the Bellman equation (Bellman, 1957), we can obtain

V π ,P V π ,P = γP π V π ,P γP π V π ,P

= γ(P π P π )V π ,P γP π (V π ,P V π ,P )

= (I γP π ) 1γ(P π P π )V π ,P

The Geometry of Robust Value Functions

Note that (I γP π ) 1 = P t=0(γP π )t 0. Thus, we have

P P, γ(P π P π )V π ,P 0 (62)

Rearranging the above inequality, we obtain

P P, (I γP π )V π ,P (I γP π )V π ,P . (63)

Since (I γP π )V π ,P = rπ and V π ,P = V π ,P , we have

P P, (I γP π )V π ,P rπ . (64)

Taking the s-th inequality and noting that π s = πs, we have

Ps Ps, V π ,P, Lπs,Ps rπs. (65)

Therefore, we have f P(Y πs) \

Ps Ps Hπs,Ps . (66)

On the other hand, from Eqn. (9) we know

Ps Ps, V π ,P, Lπs,Ps = rπs, (67)

which is equivalent to

f P(Y πs) [

Ps Ps Hπs,Ps. (68)

Putting it together, we get

Ps Ps Hπs,Ps

Ps Ps Hπs,Ps #

which completes the proof.

Lemma 4.2. Consider an s-rectangular uncertainty set P and a policy π, we have for all states s S,

Ps Ps Hπs,Ps. (26)

Proof. Recall that Hπs,Ps = {x RS | x, Lπs,Ps = rπs}. (70)

From the definition of Lπs,Ps, we know

1, Lπs,Ps = 1 1 γ . (71)

Thus, it is easy to verify that rπs

1 γ 1, Lπs,Ps = rπs for all Ps Ps, which concludes the proof.

Corollary 4.3. Consider an s-rectangular uncertainty set P and a policy π, we have for all states s S,

f P(Y πs) Cπs,Ps (28)

where Cπs,Ps = Cπs,Ps + Cπs,Ps is a conic hypersurface.

The Geometry of Robust Value Functions

Proof. This corollary is a restatement of Lemma 4.1. Note that " \

Ps Ps Hπs,Ps

Ps Ps Hπs,Ps #

Ps Ps Hπs,Ps

Ps Ps Hπs,Ps

Ps Ps Hπs,Ps

Ps Ps Hπs,Ps +

= Cπs,Ps Cπs,Ps + .

From Lemma 4.2, we know all halfspaces Hπs,Ps intersect at the same point. Then their intersection Cπs,Ps will be a convex cone. Note that each Hπs,Ps is a supporting hyperplane of the cone Cπs,Ps and all Hπs,Ps determine this cone. Thus the intersection of S

Ps Ps Hπs,Ps and Cπs,Ps is exactly the surface of Cπs,Ps .

Lemma 4.4. Consider an s-rectangular uncertainty set P and a policy π, we have

{V π,P} = \

s S Cπs,Ps. (29)

Proof. For any x RHS, we have that for all s S

Ps Ps, x, Lπs,Ps = rπs; (73)

Ps Ps, x, Lπs,Ps rπs. (74)

Since P is s-rectangular, we have

P P, (I γP π)x = rπ; (75)

P P, (I γP π)x rπ. (76)

Since the Bellman equation has a unique solution, the first line implies P P, x = V π,P . Suppose V π,P = V π,P, then from Eqn. (9) we have P P, s.t. V π,P = V π,P < V π,P . (77)

On the other hand, from Eqn. (76), we know

(I γP π )V π,P rπ 0

(I γP π )V π,P (I γP π )V π,P 0

γ(P π P π )V π,P 0

(I γP π ) 1γ(P π P π )V π,P 0 (see the proof of Lemma 4.1)

V π,P V π,P 0

V π,P V π,P .

We have an contradiction. Therefore, we can conclude x = V π,P and accordingly {V π,P} = T

s S Cπs,Ps.

Lemma 4.5. Consider an s-rectangular uncertainty set P, the robust value function space VP can be represented as

s S Cπs,Ps = \

πs A Cπs,Ps. (30)

The Geometry of Robust Value Functions

Proof. The proof below follows exactly the same procedure as the proof of Lemma 3.2. By the definition of VP and Lemma 4.4, we have VP = [

π Π {V π,P} = [

s S Cπs,Ps. (79)

We can break the union into nested unions by fixing πs for each s: [

s S Cπs,Ps = [

s S Cπs,Ps. (80)

Then, we have

i=2 Cπsi,Psi

πs1 A Cπs1,Ps1

i=2 Cπsi,Psi

. (distributive law of sets)

By iteratively applying the distributive law of sets, we can obtain

πs A Cπs,Ps, (82)

which completes the proof.

Lemma 4.6. Consider an s-rectangular uncertainty set P, we have for all states s S,

πs A Cπs,Ps =

πs A Cπs,Ps +

πs A Cπs,Ps

Proof. Recall that Cπs,Ps = Cπs,Ps + Cπs,Ps , then we need to prove

πs A Cπs,Ps + Cπs,Ps =

πs A Cπs,Ps +

πs A Cπs,Ps

First, by the distributive property of sets, it is trivial to obtain LHS RHS. Next, we will show RHS LHS. For any x RHS, we have π s, π s A s.t. x Cπ s,Ps + Cπ s ,Ps . (84)

When π s = π s , it is trivial to obtain x LHS. When π s = π s , then we have

Ps Ps, x, Lπ s,Ps rπ s 0;

Ps Ps, x, Lπ s ,Ps rπ s 0. (85)

If there exists Ps Ps such that x, Lπ s ,Ps rπ s = 0, then we will get x Hπ s ,Ps Cπ s ,Ps + and accordingly x LHS. Therefore, we only consider the case where

Ps Ps, x, Lπ s,Ps rπ s 0;

Ps Ps, x, Lπ s ,Ps rπ s < 0. (86)

The Geometry of Robust Value Functions

αPs := x, Lπ s,Ps rπ s,

βPs := rπ s x, Lπ s ,Ps ,

Ps := {Ps | αPs 0, βPs > 0, Ps Ps},

λ := min Ps Ps αPs

αPs + βPs .

and accordingly

1 λ = max Ps Ps βPs

αPs + βPs . (88)

We construct

π s := (1 λ)π s + λπ s . (89)

Note that 0 λ 1. We have π s A since π s is a convex combination of π s and π s . Then we are going to show that

x Cπ s,Ps + Cπ s,Ps , i.e.,

Ps Ps, x, Lπ s,Ps rπ s 0;

Ps Ps, x, Lπ s,Ps rπ s 0. (90)

On the one hand, denoting

P s := arg min Ps Ps

αPs + βPs , (91)

x, Lπ s,P s rπ s = x, (1 λ)Lπ s,P s + λLπ s ,P s (1 λ)rπ s λrπ s

= (1 λ) x, Lπ s,P s rπ s λ rπ s x, Lπ s ,P s

= (1 λ)αP s λβP s

= βP s αP s

αP s + βP s αP sβP s

αP s + βP s

On the other hand, for all Ps Ps we have

x, Lπ s,Ps rπ s = x, (1 λ)Lπ s,Ps + λLπ s ,Ps (1 λ)rπ s λrπ s

= (1 λ) x, Lπ s,Ps rπ s λ rπ s x, Lπ s ,Ps

= (1 λ)αPs λβPs

αPs + βPs λβPs

αPs + βPs αPsβPs

The Geometry of Robust Value Functions

x, Lπ s,Ps rπ s = x, (1 λ)Lπ s,Ps + λLπ s ,Ps (1 λ)rπ s λrπ s

= (1 λ) x, Lπ s,Ps rπ s λ rπ s x, Lπ s ,Ps

= (1 λ)αPs λβPs

αPs + βPs λβPs

αPs + βPs αPsβPs

Putting it together, we obtain x Cπ s,Ps + Cπ s,Ps and thus x LHS.

Lemma 4.7. Consider an s-rectangular uncertainty set P, we have for all states s S, [

πs A Cπs,Ps + = [

a A Cds,a,Ps + , (32)

πs A Cπs,Ps [

a A Cds,a,Ps , (33)

where the equality in the second line holds when P is (s, a)-rectangular.

Proof. First, we are going to prove [

πs A Cπs,Ps + = [

a A Cds,a,Ps + . (95)

It is trivial that RHS LHS. We then focus on proving LHS RHS. For any x LHS, we have

π s A s.t. x Cπ s,Ps + . (96)

Note that πs A can be written as a convex combination of ds,a, a A. In our case, we write

a A π s,ads,a. (97)

Also note that for any Ps Ps,

x, Lπ s,Ps rπ s = x, X

a A π s,a Lds,a,Ps X

a A π s,ards,a = X

a A π s,a x, Lds,a,Ps rds,a . (98)

Therefore, we can write x Cπ s,Ps + as

a A π s,a x, Lds,a,Ps rds,a 0. (99)

Since π s,a 0 for all a A, the above statement implies

Ps Ps, a A s.t. x, Lds,a ,Ps rds,a 0. (100)

This is equivalent to x RHS. Putting it together, we obtain LHS = RHS.

Second, we are going to prove [

πs A Cπs,Ps [

a A Cds,a,Ps , (101)

The Geometry of Robust Value Functions

where the equality holds when P is (s, a)-rectangular. Again, it is trivial that RHS LHS. We then focus on proving LHS RHS when P is (s, a)-rectangular. For any x LHS, we have

π s A s.t. x Cπ s,Ps . (102)

Similarly, we can obtain Ps Ps, X

a A π s,a x, Lds,a,Ps rds,a 0. (103)

This is equivalent to max Ps Ps

a A π s,a x, Lds,a,Ps rds,a 0. (104)

Due to (s, a)-rectangularity of P, we have X

a A π s,a max Ps,a Ps,a

x, Lds,a,Ps rds,a 0. (105)

Since π s,a 0 for all a A, the above statement implies

a A, s.t. max Ps,a Ps,a x, Lds,a ,Ps rds,a 0, (106)

which is equivalent to a A, Ps,a Ps,a s.t. x, Lds,a ,Ps rds,a 0. (107)

This is essentially saying x RHS. Putting it together, we obtain LHS = RHS when P is (s, a)-rectangular.

Theorem 4.8. Consider an s-rectangular uncertainty set P, the robust value function space VP satisfies

a A Cds,a,Ps +

πs A Cπs,Ps

a A Cds,a,Ps +

a A Cds,a,Ps

where the equality in the second line holds when P is (s, a)-rectangular.

Proof. The proof follows immediately from Lemma 4.5, Lemma 4.6 and Lemma 4.7.

Lemma 4.9. Consider a s-rectangular uncertainty set P and a policy π, we have

(Lπs,Ps) = (g(Ps)) = (g(ext(conv(Ps)))) . (42)

Proof. Since affine transformations preserve affine hulls (Dattorro, 2005), we have

conv(g(Ps)) = g(conv(Ps)),

conv(g(ext(conv(Ps)))) = g(conv(ext(conv(Ps)))). (108)

Using Krein-Milman Theorem (Krein & Milman, 1940), we can obtain

g(conv(ext(conv(Ps)))) = g(conv(Ps)). (109)

Putting it together, we have conv(g(Ps)) = conv(g(ext(conv(Ps)))). (110)

Then by the properties of polar cones (Proposition 2.2.1 in (Bertsekas, 2009)), we can get

(g(Ps)) = (g(ext(conv(Ps)))) , (111)

which completes the proof.

The Geometry of Robust Value Functions

Theorem 4.10. Consider a s-rectangular uncertainty set P, we have

VP = VP (43)

where P = ext(conv(P)) P.

Proof. From Eqn. (41) and Lemma 4.9, we know that each conic hypersurface Cπs,Ps only depends on ext(conv(Ps)). Then we have VP = VP , where P = s S ext(conv(Ps)). (112)

By the definition of extreme points, it is straightforward to show that

s S ext(conv(Ps)) = ext

s S conv(Ps)

Using the properties of Cartesian products (Bertsekas et al., 2003), we can get

s S conv(Ps) = conv

= conv(P). (114)

Putting it together, we have P = ext(conv(P)). Since P is assumed to be compact, then P P.

Corollary 5.1. Consider an s-rectangular uncertainty set P, if an axis-parallel line intersects with the robust value space VP, then the intersection will be a line segment.

Proof. Without loss of generality, consider a line parallel to the axis corresponding to state s1, and denote it as

K = {u + tes1 | t R} (115)

where u RS is fixed. Then the intersection between this line and the robust value space is

πs A Cπs,Ps +

πs A Cπs,Ps

On the line K, denote the direction of the ray {u + tes1 | t 0} as negative and the opposite direction as positive.

First, we have K Hπs,Ps + = u + tes1 | t es1, Lπs,Ps rπs u, Lπs,Ps (117)

For s = s1, since es1, Lπs,Ps 0, the intersection K Hπs,Ps + is either the line K or a negative ray. Thus, the intersection

πs A Cπs,Ps +

is either the line K or a negative ray.

For s = s1, since es1, Lπs,Ps > 0, then the intersection K Hπs,Ps + is a positive ray. Thus, the intersection

πs A Cπs,Ps +

is also a positive ray.

Putting it together, we can obtain that the intersection

πs A Cπs,Ps +

The Geometry of Robust Value Functions

is either empty or a line segment (or a point in degenerated case).

Similarly, we can show that the intersection

πs A Cπs,Ps

is either empty or a line segment (or a point in degenerated case).

Finally, taking the intersection, we have that the intersection between K and the robust value space is either empty or a line segment (or a point in degenerated case).

The Geometry of Robust Value Functions

C. Additional Figures

The Geometry of Robust Value Functions

Figure 12. Visualization of the robust value space for several randomly generated s-rectangular RMDPs.