# robust_satisficing_mdps__cb90bf22.pdf

Robust Satisficing MDPs

Haolin Ruan 1 Siyu Zhou 2 Zhi Chen 3 Chin Pang Ho 1

Despite being a fundamental building block for reinforcement learning, Markov decision processes (MDPs) often suffer from ambiguity in model parameters. Robust MDPs are proposed to overcome this challenge by optimizing the worstcase performance under ambiguity. While robust MDPs can provide reliable policies with limited data, their worst-case performances are often overly conservative, and so they do not offer practical insights into the actual performance of these reliable policies. This paper proposes robust satisficing MDPs (RSMDPs), where the expected returns of feasible policies are softlyconstrained to achieve a user-specified target under ambiguity. We derive a tractable reformulation for RSMDPs and develop a first-order method for solving large instances. Experimental results demonstrate that RSMDPs can prescribe policies to achieve their targets, which are much higher than the optimal worst-case returns computed by robust MDPs. Moreover, the average and percentile performances of our model are competitive among other models. We also demonstrate the scalability of the proposed algorithm compared with a state-of-the-art commercial solver.

1. Introduction

Markov decision processes (MDPs) have emerged as a powerful modeling framework for sequential decision-making problems under uncertainty (Ashok et al., 2019; Puterman, 2014; Sutton & Barto, 2018). Successful employments of MDPs largely rely on the perfect estimation of model parameters (Petrik & Russel, 2019), which, unfortunately, is not always the case. A common situation is when the true parameters are estimated from a limited amount of sam-

1School of Data Science, City University of Hong Kong 2The City University of Hong Kong Shenzhen Research Institute 3CUHK Business School, The Chinese University of Hong Kong. Correspondence to: Chin Pang Ho <clint.ho@cityu.edu.hk>.

Proceedings of the 40 th International Conference on Machine Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright 2023 by the author(s).

ples, which may lead to non-negligible estimation deviation (Mannor et al., 2007). Sometimes, these true parameters themselves may be uncertain or even time-dependent, yet they are mistreated as fixed ones in the modeling process (Mannor et al., 2016; Suilen et al., 2022). Due to the sequential nature of MDPs, these estimation errors accumulate quickly (Behzadian et al., 2021b; Xu & Mannor, 2009), and so the output policies of MDPs could be disappointing in practice.

As an encouraging framework to mitigate or resolve these issues, a robust MDP (RMDP) assumes the uncertain reward function and/or transition kernel to reside in an ambiguity set, which includes the possible candidates of the unknown true parameters with high confidence (Delgado et al., 2016; Ghavamzadeh et al., 2016; Hanasusanto & Kuhn, 2013; Petrik, 2012; Tamar et al., 2014; Xu & Mannor, 2006). By optimizing against the most adversarial value within the ambiguity set, RMDPs can provide policies that are robust in practice (Auer et al., 2008; Goyal & Grand-Clement, 2022; Hansen et al., 2013; Ho et al., 2018; Iyengar, 2005; Kaufman & Schaefer, 2013; Taleghan et al., 2015).

However, since RMDPs optimize the worst cases (which probably are unusual in most cases), the optimized worstcase performances are often too pessimistic and do not offer insights into the actual performance of the obtained policies. Moreover, one major drawback of using ambiguity sets to account for parameter ambiguity in RMDPs (resp., DRMDPs) is that the model may perform poorly when the true parameter (resp., true probability distribution of parameter) is outside the ambiguity set, which could be catastrophic in high-risk applications such as healthcare and robotics (Brown et al., 2020). Another potential issue is the difficulty of determining the right size of the ambiguity set: for example, when using a Wasserstein ambiguity set (Derman & Mannor, 2020; Yang, 2017), choosing an appropriate size/radius that provides satisfying out-of-sample performances can be challenging (Mohajerin Esfahani & Kuhn, 2018).

In this paper, we focus on the case of ambiguous transition kernel and we propose a novel framework, named robust satisficing MDPs (RSMDPs), to handle the ambiguity. Robust satisficing is an alternative framework of optimization under uncertainty (Long et al., 2023). As opposed to RMDPs,

Robust Satisficing MDPs

the proposed RSMDPs consider all possible transition kernels, therefore one does not need to specify an ambiguity set. Unlike nominal MDPs where only the hard constraints that correspond to the empirical transition kernel are considered, we also impose soft constraints for all other transition kernels in RSMDPs. The magnitudes of violation of these soft constraints depend on the distance between the associated transition kernel and the empirical one. This is one notable feature of our RSMDPs that prevents the model from performing too poorly when the empirical transition kernel is not close to the true one, which RMDPs may fail to achieve when the true kernel is not included in the ambiguity set. In particular, to achieve robustness, RSMDPs directly optimize/minimize the magnitude of constraint violations, and the level of robustness can be controlled elastically by articulating the targeted average return. Compared to the radius/size of RMDPs, the targeted return is a more tangible parameter, where a lower target corresponds to a higher level of robustness. Moreover, parameter selection approaches are much easier to use for choosing the one-dimensional targeted return, rather than choosing the possibly multidimensional size of an ambiguity set (e.g., (Delage & Ye, 2010)). More details of RSMDPs are provided in Section 3.

Our contributions may be summarized as follows.

(i). We propose a novel framework of RSMDPs that allows the uncertain transition probabilities to vary within the entire support set and optimizes/minimizes the constraint violation directly while attaining an intuitive and tangible target on the expected return articulated by the decision-maker.

(ii). We derive a tractable reformulation of our RSMDPs as a conic program. To solve RSMDPs, we design a first-order method that is more scalable than the Gurobi solver (Gurobi Optimization, LLC, 2022), and thus is advantageous in large-scale problems.

(iv). Via data-driven experiments, we compare RSMDPs with nominal MDPs (NMDPs), RMDPs and distributionally robust MDPs (DRMDPs). Results show that RSMDPs have better percentile performances and target-oriented feature.

The remainder of the paper is organized as follows. Preliminaries are introduced in Section 2. We study RSMDPs and derive their tractable reformulation in Section 3, and we propose a first-order method to solve RSMDPs efficiently in Section 4. We conduct numerical experiments in Section 5.

Notations. We denote vectors (resp., matrices) by boldface lowercase (resp., uppercase) letters. The sets of nonnegative and strictly positive real numbers are denoted as R+ and R++, respectively. The symbols 0 and e stand for the vectors of all 0 s and all 1 s of a size that is clear from the context, respectively. We use es, s {1, , S} to denote the s-th standard basis vector in RS. A probability simplex is denoted as S = {p RS + | e p = 1}. The

matrix A = diag(a) RS S is diagonal with its diagonal entries being defined by the vector a RS.

Related Work

RMDPs Solving RMDPs is generally an NP-hard problem (Ho et al., 2021; Iyengar, 2005; Wiesemann et al., 2013). The rectangularity assumption is crucial in obtaining tractability, which ensures that the optimal policy can be computed via robust variants of value or policy iteration in polynomial time (Hansen et al., 2013; Iyengar, 2005); many RMDPs are equipped with rectangular ambiguity sets: the sa-rectangularity is the most common assumption where the uncertain transition probabilities at each state-action pair are independently distributed (Iyengar, 2005; Nilim & El Ghaoui, 2005; Strehl et al., 2009). However, the volumes of sa-rectangular ambiguity sets are usually unnecessarily large, which implies that the output policies are often too conservative. To this end, s-rectangular sets are proposed to be a less conservative alternative where only independence between states are assumed (Goyal & Grand-Clement, 2022; Ho et al., 2021; 2022; Wiesemann et al., 2013). Fast algorithms for solving RMDPs are also an active research topic, where the RMDPs are equipped with polyhedral ambiguity sets that are defined using L1 or L -norm (Behzadian et al., 2021a; Derman et al., 2021; Ho et al., 2021) or with nonlinear ambiguity sets such as the spherical ambiguity set and the KL uncertainty set (Grand-Cl ement & Kroer, 2021b; Ho et al., 2022). On another note, it is worth mentioning that DRMDPs (Chen et al., 2019; Clement & Kroer, 2021; Xu & Mannor, 2010) are closely related to RMDPs, where the transition kernels are assumed to be random and subject to some unknown probability distributions that reside in ambiguity sets.

Dual formulation of NMDPs As we will demonstrate in Section 3, since the interpretation of the value function will become unclear when applying the robust satisficing framework to the primal formulation of NMDPs, the proposed RSMDPs are motivated by the dual formulation. In recent years, various new models have been proposed based on the dual formulation of NMDPs due to their interpretability. For example, (Lobo et al., 2020) optimize a weighted average of expectation and conditional value-at-risk (CVa R) of return under model ambiguity, (Brown et al., 2020) and (Delage & Mannor, 2010) optimize CVa R and the value-at-risk (Va R) of average return, respectively; by considering random rewards r P and a risk threshold ε (0, 1), (Brown et al., 2020) and (Delage & Mannor, 2010) replace the objective function r u in (2) by maxy y (1/(1 ε)) EP[(y r u)+] and maxy{y | P[ r x y] 1 ε}, respectively. Our proposed RSMDPs, as opposed to the aforementioned models, do not optimize the expected return or the risk of return. Instead, we optimize/minimize the constraint violation while

Robust Satisficing MDPs

specifying a target value for the expected return, reflecting their target-oriented feature.

Model-free approaches for robust reinforcement learning Our RSMDPs are motivated by the linear programming formulation of NMDPs, both of which are modelbased approaches. We remark that, beyond model-based methods, there is an abundance of inspiring research on robust reinforcement learning, such as robust policy gradient (Wang & Zou, 2022), sample complexity analysis (Panaganti & Kalathil, 2022), least-squares policy iteration (Lagoudakis & Parr, 2003) and robust Q-learning (Roy et al., 2017; Wang & Zou, 2021). While model-free methods often require a large number of interactions with the environment, model-based learning is known for high sample efficiency (Sutton & Barto, 2018), which are especially preferred for those applications with limited data such as medicine (Imani et al., 2018) and manufacturing (Doltsinis et al., 2014).

2. Preliminaries

Consider an infinite-horizon MDP denoted by a tuple S, A, p, r, γ, d with a finite state space S = {1, , S} and a finite action space A = {1, , A}. When an action a A is chosen at state s S, transition to a new state s S follows a distribution ps,a S and a non-negative reward rs,a R+ materializes. We condense the transition probabilities and rewards to p = (ps,a)s S,a A ( S)S A and r = (rs,a)s S,a A, respectively. The discount factor is γ (0, 1) and the initial state distribution is d RS ++. We optimize a policy π Π = ( A)S that takes an action a A with probability πs,a at state s S, where Π is the set of all (stationary) randomized policies. For a nominal MDP (NMDP), to obtain the optimal policy that maximizes our total expected discounted reward, we solve maxπ Π E [P t=0 γt r St,At], where the initial state S0 follows the distribution d, and for any time step t, we use St to denote the state at time step t while At is a random action governed by the distribution πSt A.

To obtain the optimal policy of an NMDP, under the dynamic programming principle, we let vπ RS + be a vector where vπ s denotes the total expected discounted reward starting at state s S when applying policy π. The optimal value function v s (achieved by the optimal policy π ) can be expressed via the Bellman optimality equation

v s = max a A rs,a + γ p s,av s S, (1)

and the optimal solution can be retrieved by value iteration or policy iteration; see, e.g., (Sutton & Barto, 2018). Alternatively, an NMDP can be formulated as a linear program

(in primal and dual)1

ZN(p) = min d v s.t. vs rs,a + γ p s,av (s, a) S A v RS + = max r u s.t. e us p Qsu ds 0 s S u RS A + ,

where for each s S, Qs = γ diag(es, , es) such that γ P

a A ps ,a,sus ,a = p Qsu. Here, we focus on the dual formulation (i.e., the maximization problem) in (2), where the feasible solution u is interpreted as the discounted probability of executing action a at state s when employing a (stationary randomized) policy πs,a = us,a/ P

a A us,a (s, a) S A. Therefore, the dual formulation in essence, chooses the optimal policy that maximizes the total expected discounted reward r u (Puterman, 2014).

Robust optimization (e.g., (Ben-Tal et al., 2009)) is a classic paradigm to account for parameter uncertainty:

max r u s.t. e us p Qsu ds 0 p F, s S u RS A + . (3)

Here, the ambiguity set F = {p P | ℓ(p, ˆp) κ} is a κneighbourhood (measured by a distance function ℓ) around the empirical ˆp. Compare the first set of constraints of (3) to that of the dual formulation (2), one can observe that, for all s S, the solution of (3) is robust against all transition kernels in the ambiguity set F, i.e., the solution of (3) always remains feasible for all p F, while the solution of ZN(ˆp) is only guaranteed to be feasible when the true transition kernel is the same as the empirical one (i.e., ˆp).

3. Robust Satisficing MDPs

Equipped with the dual formulation of NMDPs in (2), our target-oriented robust satisficing MDP is formulated as

ZRS(ˆp) = min w k s.t. e us p Qsu ds ks ℓ(p, ˆp) p P, s S r u τ u RS A + , k RS +,

where ˆp is the empirical transition kernel and the support set is P = p RS A S + | e ps,a = 1 s S, a A .

1Throughout this paper, the last line of constraints of an optimization problem indicates its decision variables (and their dimensions). For example, v RS + and u RS A + herein are the decision variables of the primal and dual NMDPs, respectively.

Robust Satisficing MDPs

While w could be set to any non-negative vector, we commonly set w = e in (4). If there is no additional information, the first set of constraints appear to be symmetric to each other and should be equally important. The term ℓ(p, ˆp), which can be a general norm ℓ(p, ˆp) = p ˆp or other distances such as the KL divergence, measures the proximity between p and the empirical ˆp.

Comparing the first collection of constraints in RSMDP (4) to that of (the dual formulation of) NMDP (2), we observe that the decision variables {ks}s S in RSMDPs reflect the magnitude of constraint violation incurred by the distance between the ambiguous transition kernel and the empirical one: when the values of {ks}s S are small, only mild violation will occur even when the true transition kernel is far away from the empirical estimation ˆp, which is often unlikely to happen. RSMDPs then minimize a weighted sum of {ks}s S under the promise that the average return r u is not smaller than the pre-specified target τ > 0, highlighting the notion of satisficing. Notice that when p = ˆp, there will be no violation in our RSMDPs (4), thus the average return achieved should reach the target τ.

Note that the feasible region of (4) will become smaller with a larger τ, where the optimal ks, s S tend to be larger. Therefore, one can interpret τ as, in addition to the targeted expected return, the controller of the robustness of (4). That is, a smaller τ corresponds to higher robustness. By setting τ = ZN(ˆp), one can recover the optimal policy of the corresponding NMDP. Proposition 3.1. (i) Any optimal u of ZRS(ˆp) with τ = ZN(ˆp) is also optimal in ZN(ˆp), while (ii) ZRS(ˆp) is infeasible when τ > ZN(ˆp).

Since we consider non-negative rewards, when we set τ 0, the feasible region of (4) is unchanged with its second set of constraints (i.e., r u τ) being eliminated. This observation, together with Proposition 3.1, provide the interval [0, ZN(p)] within which we calibrate τ for RSMDPs. As opposed to sizing the ambiguity sets in RMDPs, setting the target τ is another distinguished feature of RSMDPs. To a decision-maker, the target τ is more directly related to her objective (i.e., return) and is thus more tangible.

The robust optimization model (3) is another important benchmark for RSMDPs (4). Compare their first set of constraints, one can observe that for all s S, the robust optimization model only hedges against transition kernels in the ambiguity set F, while giving no guarantee about the constraint violation when the transition kernel is outside F. In contrast, the RSMDP (4) minimizes the magnitude of the constraint violation for all possible transition kernels from the whole support set P, where the decision variable ks measures the maximal violation of the s-th constraint among all p P which is to be optimized. To further distinguish between these two frameworks, let u RO and

(u RS, k) be the feasible solutions of the robust optimization model (3) and RSMDP (4), respectively. Comparing the s-th constraints of (3) and (4), for the former we have

e u RO s p Qsu RO ds 0 p F e u RO s p Qsu RO ds + p P\F, (5) while for the latter, we have

e u RS s p Qsu RS ds ks ℓ(p, ˆp) p F e u RS s p Qsu RS ds ks ℓ(p, ˆp) p P\F. (6) It is clearly indicated by (5) and (6) that, though RSMDPs may allow some additional violation (i.e., ks ℓ(p, ˆp)) when the true transition kernel is inside the ambiguity set F, it can protect one from disastrous situations when p P\F. This is one advantage of RSMDPs because the ambiguity set F is often much smaller compared to the support set P. One can also consider P as the ambiguity set for RSMDPs, which is a special one in the way that the magnitude of violation is proportional to the proximity of the unknown true transition kernel to its nominal value (i.e., ℓ(p, ˆp)); for the robust optimization framework, the violation is equal for all p F (and is unpredictable and ignored for p / F).

3.2. Reformulation

RSMDP (4) is an infinitely constrained optimization problem, where each possible p P corresponds to one constraint. Quite notably, it can be reformulated as a conic program when equipped with a general norm.

Theorem 3.2. Equipped with a general norm ℓ(p, ˆp) = p ˆp , RSMDP (4) is equivalent to the conic program

ZRS(ˆp) = min X

s S ws βs Qsu B αs

s.t. e us ds ˆp βs + ˆp Qsu s S r u τ u RS A + , αs RS A, βs RS A S + s S (7) where B = diag(e , , e ) RS A S A S and e RS.

We remark that, due to the choice of a general norm in (4), the conic program (7) admits an equivalent reformulation as a minimax optimization problem (as we will show in the coming section), to which an efficient primal-dual algorithm can be applied. Besides a general norm, choosing the KL divergence in (4) also allows a reformulation as a convex optimization problem (see more details in Appendix B). We also remark that, after solving (7), the optimal policy of RSMDPs can be retrieved as in NMDPs (2).

4. First-Order Method

Robust Satisficing MDPs

By Theorem 3.2, we are already able to compute the optimal policy of an RSMDP (equipped with a general norm) by solving its equivalent reformulation (7) via the state-of-theart commercial solvers; however, the computation time may be quite long when the problem size becomes large. To this end, we will apply the first-order primal-dual algorithm (PDA) (e.g., (Chambolle & Pock, 2016; Esser et al., 2010; Grand-Cl ement & Kroer, 2021b; He & Yuan, 2012)) that aims to solve problems in a convex-concave min-max form at a convergence rate O(1/N) (Chambolle & Pock, 2016), where N is the number of iterations. First-order methods are known for their computationally efficient updates, by which large-scale problems can be solved with moderate accuracy efficiently. To apply PDA, we will transform problem (7) into a min-max form. A necessary preceding result for the transformation is relegated to Lemma A.1 in Appendix A.2, and we provide the min-max reformulation in the following proposition. Proposition 4.1. When w RS ++, problem (7) has an equivalent minimax reformulation:

min u U max (λs,θs) Vq(ws):s S

s S (λs (e us ds) θ s Qsu),

(8) where U = {u RS A + | r u τ} and

Vq(w) = (λ, θ) R+ RS A S +

θ λ ˆp q w λ e = Bθ

In the remainder, we focus on solving the min-max problem (8) (thus RSMDPs (4)) equipped with an L -norm:

min u U max (λs,θs) V (ws):s S

s S (λs (e us ds) θ s Qsu).

(9) As we demonstrate later in this section, the specific problem structure of this min-max formulation, together with the computational scheme of PDA, allows us to decompose the min-max formulation into several subproblems that can be solved efficiently in each iteration by our designed strategy. In particular, to update the decision variable of the outer minimization problem, we solve a minimization problem by an efficient procedure with time complexity O(SA log(SA)); while to update the inner problem, we design a strategy with time complexity O(S3 log(S)A log(1/δ)), where δ is the desired precision of the golden section search that will be introduced in Section 4.2.

Observe that, for any fixed u RS A + , the inner maximization problem of (9) is decomposable into S subproblems, which allows an equivalent reformulation of (9) as

s S max (λs,θs) V (ws) λs (e us ds) θ s Qsu.

Hence, for any fixed u, it is sufficient to solve each of these S subproblems separately. This decomposition is

beneficial to our PDA since it can remarkably lower the time complexity for the update of the decision variables (λ, θ) for the inner maximization problem in (9).

Now we introduce our PDA in Algorithm 1. In every iteration, our PDA updates the primal (resp., dual) variable by solving a minimization problem with the dual (resp., primal) variable fixed at a value related to its last update. Here, the primal update operator is defined as

P(λ, θ; ˆu)

= argmin u U

λse us θ s Qsu + 1

2ν u ˆu 2 2

for ˆu RS A + and (λ, θ) RS + RS S A S + , while for any ˆs S, the dual update operator is

Dˆs(u; ˆλ, ˆθ) = argmin λ(dˆs e uˆs) + θ Qˆsu + 1

2σ ((λ ˆλ)2 + θ ˆθ 2 2) s.t. (λ, θ) V (wˆs)

for any u RS A + and (ˆλ, ˆθ) R+ RS A S + , where ν > 0 and σ > 0 are, respectively, the stepsizes of the primal and dual updates. The input coefficient matrix C = (diag(e, , e), Q 1 , , Q S ) R(S+S S A S) S A with S all-ones vectors e RA satisfies Cu, (λ , θ ) = P

s S λs(e us ds) θ s Qsu . Note that one needs to choose the Bregman divergence in PDA, and herein we choose the convex function in the definition of the Bregman divergence as (1/2) 2 2 for both the primal and the dual updates (Chambolle & Pock, 2016). We provide a simplified result for the convergence rate of Algorithm 1, which is based on a stronger but more technical convergence result in Theorem 1 in (Chambolle & Pock, 2016). Specifically, our result is obtained by specifying the sufficient condition νσ 1/L2. We refer interested readers to Theorem A.2 in Appendix A.2 for the original convergence result in (Chambolle & Pock, 2016).

Theorem 4.2. Let (uk, (λk, θk)), k = 0, 1, 2, , K be a sequence generated by Algorithm 1. If the stepsizes ν, σ > 0 are chosen such that νσ 1/L2. Then for any feasible solution (u, (λ, θ)) of problem (9) it holds that

s S ϕs( u K, λs, θs) X

s S ϕs(u, λK s , θK s ) = O 1

where u K = (P

k [K] uk)/K, ( λK, θK) = (P

k [K](λk, θk))/K, and for all s S, ϕs(u, θ, λ) = λ (e us ds) θ Qsu.

4.1. Solving P(λ, θ; ˆu) via Interval Search

To solve P(λ, θ; ˆu) in Step 1, note that it is a quadratic program with no cross term in the objective function and with

Robust Satisficing MDPs

1 3 5 7 9 11 13 M Sample Size

10-Percentile

RSMDPs 0.8 RSMDPs 0.85

DRMDPs 1.0 DRMDPs 1.6

RMDPs 0.8 RMDPs 1.0

1 3 5 7 9 11 13 M Sample Size

20-Percentile

RSMDPs 0.8 RSMDPs 0.85

DRMDPs 1.0 DRMDPs 1.6

RMDPs 0.8 RMDPs 1.0

1 3 5 7 9 11 13 M Sample Size

30-Percentile

RSMDPs 0.8 RSMDPs 0.85

DRMDPs 1.0 DRMDPs 1.6

RMDPs 0.8 RMDPs 1.0

1 3 5 7 9 11 13 M Sample Size

RSMDPs 0.8 RSMDPs 0.85

DRMDPs 1.0 DRMDPs 1.6

RMDPs 0.8 RMDPs 1.0

Figure 1. Average and percentile performances over 5000 out-of-sample testing trajectories in the river swim application.

Algorithm 1 Primal-Dual Algorithm (PDA) for Problem (9)

Input: Operator norm L = C , initial feasible solution (u0, (λ0, θ0)) of problem (9), stepsizes ν, σ > 0, desired precision ε, k 0 repeat

// Step 1 : Primal update uk+1 P(λk, θk; uk); // Step 2 : Dual update for ˆs = 1 to S do

(λk+1 ˆs , θk+1 ˆs ) Dˆs(2uk+1 uk; λk ˆs, θk ˆs); end for k k + 1; until uk uk 1 < ε Output: Solution uk = 1 k P

i [k] ui and ( λk, θk) = 1 k P

i [k](λi, θi)

only linear constraints. Based on its structure, we develop an efficient algorithm to solve this problem. In particular, we show that solving P(λ, θ; ˆu) is equivalent to finding the root of a non-decreasing piecewise linear function, which can be done efficiently via Interval Search (where we search the intervals between breakpoints of the piecewise linear function). The time complexity of the proposed algorithm is provided in the statement of the following proposition, while the details of the algorithm are provided in the proof and pseudocode that are relegated to the appendix.

Proposition 4.3. Problem P(λ, θ; ˆu) can be solved in time O(SA log(SA)).

4.2. Solving Dˆs(u; ˆλ, ˆθ) via Golden Section Search

For Step 2 in our PDA, for each ˆs S, we solve the subproblem Dˆs(u; ˆλ, ˆθ) by a well-known golden section search algorithm (e.g., (Truhar & Veseli c, 2009)). For any fixed λ R+, the function

fˆs(λ) = min λ(dˆs e uˆs) + θ Qˆsu + 1

2σ λ ˆλ θ ˆθ

2 s.t. (λ, θ) V (wˆs) θ RS A S + (10) defined on R+ involves the inner minimization problem that we need to solve. Notice that, herein we treat the problem Dˆs(u; ˆλ, ˆθ) as a min-min problem where we optimize λ R+ in the outer minimization problem and θ RS A S + in the inner one. The golden section search is used to locate the optimal λ for the outer problem: we initialize the interval [λ, λ] for the search, where λ = 0 and we provide Lemma A.3 in Appendix A.2 for selecting λ. At the initialization phase and in each iteration of the search, we need to solve the inner problem (10) once.

To implement the golden section search, we need to prove that fˆs is well-defined on R+. We provide this proof in Lemma A.4 in Appendix A.2. In general, the golden section search converges to a local minimizer of the problem; fortunately, this is a global minimizer in our case.

Proposition 4.4. The golden section search converges to a global minimizer λ of Dˆs(u; ˆλ, ˆθ).

Robust Satisficing MDPs

Table 1. Predicted returns and the corresponding differences (in median) between sample returns and predicted returns over 1000 samples in the river swim application.

RSMDPs τ / ZN(ˆp) 1.0 0.9 0.8 0.7 0.6 0.5

Predicted Return 58.6 52.7 46.9 41.0 35.2 29.3 Difference (in median) -4.4 0.9 6.1 12.0 17.9 23.6

DRMDPs r 0.0 0.3 0.6 0.9 1.2 1.5

Predicted Return 58.6 41.0 32.4 26.5 21.5 17.3 Difference (in median) -4.4 12.9 20.7 26.6 31.7 35.9

RMDPs r 0.0 0.3 0.6 0.9 1.2 1.5

Predicted Return 58.6 38.6 27.2 20.2 15.0 11.3 Difference (in median) -4.4 15.6 27.0 34.1 39.2 43.0

It remains to solve problem (10) to obtain the optimal solution θ . Observe that the problem can be further decomposed into SA subproblems, and for each (s, a) S A, we have

2σ θ s,aθs,a + θ s,a(zˆs,s,a 1

s.t. [λˆps,a,s wˆs]+ θs,a,s λˆps,a,s + wˆs s S e θs,a = λ θs,a RS, (11) where zˆs = Qˆsu RS A S. The only difference between (11) and P(λ, θ; ˆu) is that the former has upper bounds for decision variables, while the latter has not. Therefore, we can develop a similar efficient strategy to solve problem (11), whose time complexity is provided in the following proposition. Proposition 4.5. Problem (11) can be solved in time O(S log S).

Now we provide the total time complexity of our strategy described in this section to compute Step 2 in Algorithm 1. Proposition 4.6. The output of Step 2 in Algorithm 1 can be computed in time O(S3 log(S)A log(δ 1)), where δ > 0 is the desired precision of the golden section search.

4.3. Randomized Block Coordinate Gradient Descent for Dual Updates

Although Step 2 in Algorithm 1 can be computed in time complexity that is almost linear in the number of dual variables λ and θ, the dual updates in Step 2 remain as the computational bottleneck of the algorithm. This is because the number of dual variables is in O(S3A), as opposed to the number of primal variables u, which is in O(SA). Moreover, as mentioned above, our policy is computed via normalization step, πs,a = us,a/ P

a A us,a (s, a) S A; therefore, it would be desirable to update the primal variables u frequently toward their optimal values. To achieve

this, one possible direction is to update only part of the dual variables at every Step 2. In particular, inspired by the randomized block coordinate gradient descent, one may modify the Step 2 in Algorithm 1 to be

(λk+1 s , θk+1 s ) Ds(2uk+1 uk; λk s, θk s) s Sk. (12)

Here Sk with |Sk| = M S k is an index set whose elements are sampled uniformly from S without replacement in each iteration. Therefore, only M dual update operators would be applied, and the time complexity of this modified Step 2 is reduced from O(S3 log(S)A log(δ 1)) to O(MS2 log(S)A log(δ 1)). In the same spirit, a more radical approach is to uniformly sample (ˆs, s, a) from S S A at every Step 2 and update the dual variables that are only associated to the corresponding problem (11). The complexity of such updates would be reduced to O(S log S). This second type of updates with problem (11) doesn t update λ. Hence, one needs to apply both aforementioned updates. We relegate the discussions on the computation complexities of NMDPs, RMDPs and RSMDPs to Appendix C.

5. Numerical Results

We compare the performances of the proposed RSMDPs with NMDPs, RMDPs and DRMDPs in three applications: river swim (Strehl & Littman, 2008), machine replacement (Delage & Mannor, 2010) and grid world (Ghavamzadeh et al., 2016), where the results of the latter two applications are provided in Appendices D.5 and D.6, and detailed settings are provided in Appendices D, D.1 and D.4. We also compare our proposed algorithms with the Gurobi solver in terms of scalability; see more details in Appendix D.7. The code and data to reproduce our experiments is available online at https://github.com/RUANHaolin/ RSMDPs.

Robust Satisficing MDPs

0 10 20 30 Level of Contamination

Sample Return - Predicted Return

RSMDPs 0.8 RSMDPs 0.85

0 10 20 30 Level of Contamination

DRMDPs 0.05 DRMDPs 0.1

0 10 20 30 Level of Contamination

RMDPs 0.05 RMDPs 0.1

Figure 2. Differences between sample returns and predicted returns over 1000 samples in the river swim environment. Due to the page limit, we only plot the parameters (of the three models) where the sample returns are close to predicted returns.

Table 2. Average of computation times (in ms) of different algorithms for RSMDPs and Gurobi for RMDPs, and the ratios of Gurobi s computation times of RSMDPs to those of PDA, PDAblock, and PDAblock+.

Computation times Ratios of computation times

S = A Gurobi PDA PDAblock PDAblock+ RMDP Gurobi/ PDA Gurobi/ PDAblock Gurobi/ PDAblock+ 10 8459.3 1521.5 1325.2 341.4 2818.6 5.6 6.4 24.8 13 60684.0 3104.8 2158.5 1132.7 4900.6 19.5 28.1 53.6 15 194114.5 4813.1 2241.7 2049.5 6604.5 40.3 86.6 94.7 17 552354.5 9305.4 2825.4 5286.8 9332.2 59.4 195.5 104.5

5.1. Improvements on Percentiles

In this section we compare the average and percentile performances of the four MDP models. In Figure 1, we present the case where the models are equipped with the best two parameters selected by cross validation. In most cases, RSMDPs outperform the other three in terms of both average and percentile performances.

5.2. Target-Oriented Feature

In this section, we contaminate the true transition kernel p

to obtain the polluted transition kernel p; in particular, we quantify the level of contamination by p p 1. We first input the true transition kernel to the four MDP models, then use the polluted kernels to test the policies they yield. To test the models on their abilities of reaching their predicted (average) returns (with p input) under contamination, we compute the difference between their sample returns and predicted returns, so that a non-negative difference indicates reaching the prediction.

Table 1 compares the differences between sample returns and predicted returns with 1000 samples for each value of parameters (i.e., τ for RSMDPs, r for RMDPs and DRMDPs). Results show that RSMDPs have predicted returns that are much higher than those of RMDPs and DRMDPs, while their sample returns can still reach their predicted/target returns in most times (i.e., have high pre-

diction accuracy ). Notice that the three models have the same predicted and sample returns in the third column of Table 1 because RSMDPs with τ = ZN(ˆp) (by Proposition 3.1), and RMDPs and DRMDPs with r = 0 all degenerate to NMDPs. Figure 2 also illustrates that even with transition kernel samples that are far away from the true one, in contrast to RMDPs and DRMDPs where nearly half and even most of the sample returns fall below the prediction, RSMDPs still remain highly accurate in prediction, reflecting its target-oriented feature.

5.3. Scalability of Different Algorithms

In this section, we compare the computation times of our proposed first-order algorithms with the state-of-the-art solver Gurobi (academic license) (Gurobi Optimization, LLC, 2022). Table 2 reports the computation times of Gurobi when solving RMDPs (see Appendix D.2 for details of the model) and RSMDPs, as well as the proposed algorithms when solving RSMDPs. Results show that directly solving RSMDPs could be very challenging: the computation time increases rapidly with even a small increase in problem size, and is much larger than that of RMDP. This observation confirms our motivation on developing a tailored first-order method for this problem. Compared to Gurobi, the proposed algorithms remain scalable as the problem size increases. In particular, the computation time of PDA scales similarly to that of RMDP, and PDAblock (PDA with dual

Robust Satisficing MDPs

updates (12)) and PDAblock+ (PDA where the dual updating step follows the second strategy mentioned in Section 4.3) provide computationally cheap updates on the policy. This matches the advantages of first-order methods, which are used to solve large problems to moderate accuracy with high efficiency.

6. Conclusion and Future Work

We propose RSMDPs to compute satisficing policies under model ambiguity. In particular, the expected return is constrained to meet a user-specified target which is strictly imposed under the empirical transition kernel and softly imposed under all other possible transition kernels. RSMDPs minimize the magnitude of violation of those soft constraints with additional tolerance that depends on the distance of the associated transition kernel to the empirical one. We reformulate the RSMDP model into a min-max form where a scalable PDA algorithm is applicable. Experimental results showcase the robustness and target-oriented feature of RSMDPs as well as the scalability of the algorithm. A promising future work would be extending RSMDPs to the setting with continuous state and action spaces, for which discretization of the state and action spaces, as well as the Approximate Linear Programming (ALP) method (e.g., (Abbasi-Yadkori et al., 2019)) may be needed.

Acknowledgements

We thank the anonymous reviewers for their comments. This work was supported, in part, by the General Research Fund Grant (Project No. 9043424) from the Hong Kong Research Grants Council, the NSFC/RGC Joint Research Scheme N City U105/21 from the Hong Kong Research Grants Council, the City U Start-Up Grant (Project No. 9610481), the National Natural Science Foundation of China (Project No. 72032005), and Chow Sang Sang Group Research Fund sponsored by Chow Sang Sang Holdings International Limited (Project No. 9229076).

Abbasi-Yadkori, Y., Bartlett, P. L., Chen, X., and Malek, A. Large-scale Markov decision problems via the linear programming dual. ar Xiv preprint ar Xiv:1901.01992, 2019.

Ashok, P., Kˇret ınsk y, J., and Weininger, M. Pac statistical model checking for Markov decision processes and stochastic games. In Computer Aided Verification: 31st International Conference, CAV 2019, New York City, NY, USA, July 15-18, 2019, Proceedings, Part I 31, pp. 497 519. Springer, 2019.

Auer, P., Jaksch, T., and Ortner, R. Near-optimal regret

bounds for reinforcement learning. Advances in Neural Information Processing Systems, 21, 2008.

Bayraksan, G. and Love, D. K. Data-driven stochastic programming using phi-divergences. In The operations research revolution, pp. 1 19. INFORMS, 2015.

Behzadian, B., Petrik, M., and Ho, C. P. Fast algorithms for L -constrained S-rectangular robust MDPs. Advances in Neural Information Processing Systems, 34, 2021a.

Behzadian, B., Russel, R. H., Petrik, M., and Ho, C. P. Optimizing percentile criterion using robust MDPs. In International Conference on Artificial Intelligence and Statistics, pp. 1009 1017. PMLR, 2021b.

Ben-Tal, A., El Ghaoui, L., and Nemirovski, A. Robust optimization, volume 28. Princeton University Press, 2009.

Bertsimas, D., Shtern, S., and Sturt, B. A data-driven approach for multi-stage linear optimization. Available at Optimization Online, 2018.

Boyd, S. and Vandenberghe, L. Convex optimization. Cambridge University Press, 2004.

Brown, D., Niekum, S., and Petrik, M. Bayesian robust optimization for imitation learning. Advances in Neural Information Processing Systems, 33:2479 2491, 2020.

Chambolle, A. and Pock, T. On the ergodic convergence rates of a first-order primal-dual algorithm. Mathematical Programming, 159(1):253 287, 2016.

Chen, Z., Yu, P., and Haskell, W. Distributionally robust optimization for sequential decision-making. Optimization, 68(12):2397 2426, 2019.

Clement, J. G. and Kroer, C. First-order methods for Wasserstein distributionally robust MDP. In International Conference on Machine Learning, pp. 2010 2019. PMLR, 2021.

Delage, E. and Mannor, S. Percentile optimization for Markov decision processes with parameter uncertainty. Operations Research, 58(1):203 213, 2010.

Delage, E. and Ye, Y. Distributionally robust optimization under moment uncertainty with application to data-driven problems. Operations Research, 58(3):595 612, 2010.

Delgado, K., De Barros, L., Dias, D., and Sanner, S. Realtime dynamic programming for Markov decision processes with imprecise probabilities. Artificial Intelligence, 230:192 223, 2016.

Derman, E. and Mannor, S. Distributional robustness and regularization in reinforcement learning. ar Xiv preprint ar Xiv:2003.02894, 2020.

Robust Satisficing MDPs

Derman, E., Geist, M., and Mannor, S. Twice regularized MDPs and the equivalence between robustness and regularization. Advances in Neural Information Processing Systems, 34, 2021.

Doltsinis, S., Ferreira, P., and Lohse, N. An MDP modelbased reinforcement learning approach for production station ramp-up optimization: Q-learning analysis. IEEE Transactions on Systems, Man, and Cybernetics: Systems, 44(9):1125 1138, 2014.

Esser, E., Zhang, X., and Chan, T. A general framework for a class of first order primal-dual algorithms for convex optimization in imaging science. SIAM Journal on Imaging Sciences, 3(4):1015 1046, 2010.

Ghavamzadeh, M., Petrik, M., and Chow, Y. Safe policy improvement by minimizing robust baseline regret. Advances in Neural Information Processing Systems, 29, 2016.

Goyal, V. and Grand-Clement, J. Robust Markov decision processes: beyond rectangularity. Mathematics of Operations Research, 2022.

Grand-Cl ement, J. and Kroer, C. First-order methods for Wasserstein distributionally robust MDPs. In Proceedings of Machine Learning Research, volume 139, pp. 2010 2019, 2021a.

Grand-Cl ement, J. and Kroer, C. Scalable first-order methods for robust MDPs. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pp. 12086 12094, 2021b.

Gurobi Optimization, LLC. Gurobi Optimizer Reference Manual, 2022. URL https://www.gurobi.com.

Hanasusanto, G. and Kuhn, D. Robust data-driven dynamic programming. Advances in Neural Information Processing Systems, 26, 2013.

Hansen, T., Miltersen, P., and Zwick, U. Strategy iteration is strongly polynomial for 2-player turn-based stochastic games with a constant discount factor. Journal of the ACM (JACM), 60(1):1 16, 2013.

He, B. and Yuan, X. Convergence analysis of primal-dual algorithms for a saddle-point problem: from contraction perspective. SIAM Journal on Imaging Sciences, 5(1): 119 149, 2012.

Ho, C. P., Petrik, M., and Wiesemann, W. Fast Bellman updates for robust MDPs. In International Conference on Machine Learning, pp. 1979 1988. PMLR, 2018.

Ho, C. P., Petrik, M., and Wiesemann, W. Partial policy iteration for L1-robust Markov decision processes. Journal of Machine Learning Research, 22(275):1 46, 2021.

Ho, C. P., Petrik, M., and Wiesemann, W. Robust phidivergence MDPs. ar Xiv preprint ar Xiv:2205.14202, 2022.

Imani, M., Ghoreishi, S. F., and Braga-Neto, U. M. Bayesian control of large MDPs with unknown dynamics in datapoor environments. Advances in Neural Information Processing Systems, 31, 2018.

Iyengar, G. Robust dynamic programming. Mathematics of Operations Research, 30(2):257 280, 2005.

Kaufman, D. and Schaefer, A. Robust modified policy iteration. INFORMS Journal on Computing, 25(3):396 410, 2013.

Khanh, P. Q. and Quan, N. H. Versions of the Weierstrass theorem for bifunctions and solution existence in optimization. SIAM Journal on Optimization, 29(2):1502 1523, 2019.

Lagoudakis, M. G. and Parr, R. Least-squares policy iteration. The Journal of Machine Learning Research, 4: 1107 1149, 2003.

Lobo, E. A., Ghavamzadeh, M., and Petrik, M. Softrobust algorithms for batch reinforcement learning. ar Xiv preprint ar Xiv:2011.14495, 2020.

Long, D. Z., Sim, M., and Zhou, M. Robust satisficing. Operations Research, 71(1):61 82, 2023.

Mannor, S., Simester, D., Sun, P., and Tsitsiklis, J. Bias and variance approximation in value function estimates. Management Science, 53(2):308 322, 2007.

Mannor, S., Mebel, O., and Xu, H. Robust MDPs with k-rectangular uncertainty. Mathematics of Operations Research, 41(4):1484 1509, 2016.

Mohajerin Esfahani, P. and Kuhn, D. Data-driven distributionally robust optimization using the Wasserstein metric: performance guarantees and tractable reformulations. Mathematical Programming, 171(1):115 166, 2018.

MOSEK Ap S. MOSEK Optimizer API for Python 9.3.20, 2022. URL https://docs.mosek.com/latest/ pythonapi/index.html.

Nilim, A. and El Ghaoui, L. Robust control of Markov decision processes with uncertain transition matrices. Operations Research, 53(5):780 798, 2005.

Panaganti, K. and Kalathil, D. Sample complexity of robust reinforcement learning with a generative model. In International Conference on Artificial Intelligence and Statistics, pp. 9582 9602. PMLR, 2022.

Robust Satisficing MDPs

Petrik, M. Approximate dynamic programming by minimizing distributionally robust bounds. ar Xiv preprint ar Xiv:1205.1782, 2012.

Petrik, M. and Russel, R. Beyond confidence regions: tight Bayesian ambiguity sets for robust MDPs. Advances in Neural Information Processing Systems, 32, 2019.

Petrik, M. and Subramanian, D. RAAM: the benefits of robustness in approximating aggregated MDPs in reinforcement learning. Advances in Neural Information Processing Systems, 27, 2014.

Puterman, M. Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons, 2014.

Roy, A., Xu, H., and Pokutta, S. Reinforcement learning under model mismatch. Advances in Neural Information Processing Systems, 30, 2017.

Russel, R., Behzadian, B., and Petrik, M. Optimizing normbounded weighted ambiguity sets for robust MDPs. ar Xiv preprint ar Xiv:1912.02696, 2019.

Strehl, A. and Littman, M. An analysis of model-based interval estimation for Markov decision processes. Journal of Computer and System Sciences, 74(8):1309 1331, 2008.

Strehl, A., Li, L., and Littman, M. Reinforcement learning in finite MDPs: PAC analysis. Journal of Machine Learning Research, 10(11), 2009.

Suilen, M., Sim ao, T. D., Parker, D., and Jansen, N. Robust anytime learning of Markov decision processes. Advances in Neural Information Processing Systems, 35:28790 28802, 2022.

Sutton, R. and Barto, A. Reinforcement learning: an introduction. MIT Press, 2018.

Taleghan, M., Dietterich, T., Crowley, M., Hall, K., and Albers, J. PAC optimal MDP planning with application to invasive species management. Journal of Machine Learning Research, 16, 2015.

Tamar, A., Mannor, S., and Xu, H. Scaling up robust MDPs using function approximation. In International Conference on Machine Learning, pp. 181 189. PMLR, 2014.

Truhar, N. and Veseli c, K. An efficient method for estimating the optimal dampers viscosity for linear vibrating systems using Lyapunov equation. SIAM Journal on Matrix Analysis and Applications, 31(1):18 39, 2009.

Wang, Y. and Zou, S. Online robust reinforcement learning with model uncertainty. Advances in Neural Information Processing Systems, 34:7193 7206, 2021.

Wang, Y. and Zou, S. Policy gradient method for robust reinforcement learning. ar Xiv preprint ar Xiv:2205.07344, 2022.

Weissman, T., Ordentlich, E., Seroussi, G., Verdu, S., and Weinberger, M. Inequalities for the L1 deviation of the empirical distribution. Hewlett-Packard Labs, Tech. Rep, 2003.

Wiesemann, W., Kuhn, D., and Rustem, B. Robust Markov decision processes. Mathematics of Operations Research, 38(1):153 183, 2013.

Xie, W. Tractable reformulations of two-stage distributionally robust linear programs over the type Wasserstein ball. Operations Research Letters, 48(4):513 523, 2020.

Xu, H. and Mannor, S. The robustness-performance tradeoff in Markov decision processes. Advances in Neural Information Processing Systems, 19, 2006.

Xu, H. and Mannor, S. Parametric regret in uncertain Markov decision processes. In Proceedings of the 48h IEEE Conference on Decision and Control (CDC) held jointly with 2009 28th Chinese Control Conference, pp. 3606 3613. IEEE, 2009.

Xu, H. and Mannor, S. Distributionally robust Markov decision processes. Advances in Neural Information Processing Systems, 23, 2010.

Yang, I. A convex optimization approach to distributionally robust Markov decision processes with Wasserstein distance. IEEE Control Systems Letters, 1(1):164 169, 2017.

Robust Satisficing MDPs

A. Technical Results and Proofs

A.1. Proofs of Results in Section 3

Proof of Proposition 3.1 By letting p = ˆp in the first S constraints in (4), on the one hand, one can observe that the feasible set of (4) is a subset of the optimal solution set of (2), where conclusion (i) follows; on the other hand, one can also observe that for all feasible u of (4), the maximal reachable value of r u is at most ZN(ˆp), which implies the infeasibility when τ > ZN(ˆp).

Proof of Theorem 3.2 Focus on the first S constraints in problem (4). For every s S, the s-th one equivalent to

e us ds min p P p Qsu + ks p ˆp .

The optimization problem on the right-hand side is

min p Qsu + ks p ˆp s.t. Bp = e p RS A S + ,

where Bp = e is a compact form of P

s S ps,a,s = 1 (s, a) S A. The equivalent min-max form of this problem is

min p max β 0,α p Qsu + ks p ˆp + α (Bp e) β p,

whose dual problem2 is

max β 0,α min p p (Qsu + B α β) + ks p ˆp α e

= max β 0,α α e max p p (β Qsu B α) ks p ˆp .

By the technique of convex conjugate, we have its equivalent reformulation as

max α e ˆp (β Qsu B α) s.t. β Qsu B α ks α RS A, β RS A S + .

Therefore, for all s S, the s-th constraint in (4) can be reformulated as

e us ds α s e ˆp (βs Qsu B αs) βs Qsu B αs ks αs RS A, βs RS A S + .

Now we can re-express problem (4) as

min w k s.t. e us ds α s e ˆp (βs Qsu B αs) s S βs Qsu B αs ks s S r u τ u RS A + , k RS +, αs RS A, βs RS A S + s S,

which is equivalent to min X

s S ws βs Qsu B αs

s.t. e us ds α s e ˆp (βs Qsu B αs) s S r u τ u RS A + , αs RS A, βs RS A S + s S.

Our conclusion then follows from the fact that B ˆp = e.

2Here, strong duality follows from the fact that the primal problem has only the collection of constraints ps,a S (s, a) S A.

Robust Satisficing MDPs

A.2. Proofs of Results in Section 4

Lemma A.1. When w R++, strong duality holds for the following convex optimization problem:

max (λ,θ) Vq(w) λ(e u d) θ Qu, (13)

where u RA, d R, Q RS A S S A and q 1 are constants, and

Vq(w) := (λ, θ) R+ RS A S +

θ λ ˆp q w λ e = Bθ

Proof of Lemma A.1 Notice that, since we have B ˆp = e by the definition of B, by letting λ be any nonnegative number and θ = λ ˆp, we can have (λ, θ) as the feasible solution of (13) (which is also strictly feasible in the first inequality constraint in Vq(w)). If ˆp > 0, then (λ, λ ˆp) with λ > 0 is the strictly feasible solution we want. Otherwise, if ˆp s, a, s = 0 for some ( s, a, s ) S A S, then (λ, θ) with λ > 0 and θ = λ ˆp will be a solution of (13) that is not strictly feasible because θ s, a, s = 0.

Now we demonstrate how we can construct a strictly feasible solution of (13). First, observe that the second set of constraints in Vq(w) of (13) is equivalent to

e θs,a = λ (s, a) S A. (14)

Since θ 0 and λ > 0, there must exists some s S such that θ s, a, s > 0. Let

ε = min{θ s, a, s /2, w/(2 e s, a, s e s, a, s )},

where e s, a, s and e s, a, s are two standard bases of RS A S. One can then easily verify that (λ, θ) with

θ = θ + ε (e s, a, s e s, a, s )

is a feasible solution of (13) which remains strictly feasible in the first inequality constraint, while θ s, a, s remains strictly positive and θ s, a, s becomes strictly positive. By going through a similar procedure iteratively, we can finally construct a strictly feasible solution of (13).

Proof of Proposition 4.1 Notice that, by Theorem 3.2, the RSMDP model (4) is equivalent to

s.t. e us ds α s e ˆp ys s S ys = βs Qsu B αs s S r u τ u RS A + , αs RS A, βs RS A S + , ys RS A S s S.

Since the first set of constraints in (4) implies that e us ˆp Qsu ds 0 s S, we have an equivalent reformulation of (15):

s.t. e us ds α s e ˆp ys s S ys = βs Qsu B αs s S r u τ e us ˆp Qsu ds 0 s S u RS A + , αs RS A, βs RS A S + , ys RS A S s S.

Notice that for any u RS A + that satisfies r u τ and e us ˆp Qsu ds 0 s S, (αs, βs, ys) = (0, 0, Qsu), s S is feasible in (16), whose objective value has a lower bound 0. We thus can re-express (16) in

Robust Satisficing MDPs

a min-min form: min u min α,β,y

s.t. e us ds α s e ˆp ys s S ys = βs Qsu B αs s S r u τ e us ˆp Qsu ds 0 s S u RS A + , αs RS A, βs RS A S + , ys RS A S s S,

Fix any u RS A + that satisfies r u τ and e us ˆp Qsu ds 0 s S, we have the dual of the inner minimization problem of (17) as follows:

max λ,θ min α,β,y

s S ws ys + X

s S λs(e us ds + α s e + ˆp ys) + X

s S θ s (βs Qsu B αs ys)

s.t. λ RS +, θ RS S A S + , αs RS A, βs RS A S + , ys RS A S s S. (18)

Here, for the inner minimization problem of (18), we have

ws ys + y s (λs ˆp θs) = 0 θs λs ˆp ws s S otherwise,

s S α s (λs e Bθs) = 0 λs e Bθs = 0 s S otherwise,

s S θ s βs = 0 θs 0 s S otherwise.

Therefore we have the equivalent reformulation of (18) as follows:

λs(e us ds) θ s Qsu

s.t. θs λs ˆp ws s S λs e = Bθs s S λ RS +, θs RS A S + s S,

where a strictly feasible solution exists by Lemma A.1. Therefore we claim that strong duality between the inner minimization problem of (17) and its dual problem (19) holds by Slater s condition. We thus have an equivalent minimax formulation of (17) is as follows: min u max λ,θ

λs(e us ds) θ s Qsu

s.t. θs λs ˆp ws s S λs e = Bθs s S e us ˆp Qsu ds 0 s S r u τ u RS A + , λ RS +, θs RS A S + s S.

It remains to argue that it is free to eliminate the third collection of constraints. To this end, notice that, if we choose a feasible u in (8) that satisfies e us ˆp Qsu ds > 0 for some s S, the inner maximum will approach + since we can choose a feasible solution (λs, θs) that satisfies θs = λs ˆp with arbitrarily large λs > 0, by which the objective value is expanded to + .

Theorem A.2. [theorem 1, (Chambolle & Pock, 2016)] Let (uk, (λk, θk)), k = 0, 1, 2, , K be a sequence generated by Algorithm 1. If the stepsize parameters σ, ν > 0 satisfy

1 2ν u u 2 2 + 1

2 C(u u ), ((λ λ ) , (θ θ ) ) 0 (20)

Robust Satisficing MDPs

for any u, u RS A and (λ, θ), (λ , θ ) RS RS S A S. Then it holds for any feasible solution (u, (λ, θ)) of problem (9) that X

s S {(λs (e u K s ds) θ s Qsu) ( λK s (e us ds) θK s Qsu))}

1 2ν u u0 2 2 + 1

2σ λ λ0 θ θ0

2 C(u u0), ((λ λ0) , (θ θ0) )

where u K = 1

k [K] uk, λK = 1

k [K] λk and θK = 1

Proof of Proposition 4.3 We first rewrite P(λ, θ; ˆu) as

arg min a u + 1

s.t. r u τ u RS A + ,

where the coefficient vector a RS A satisfies a u = P

s S λs e us θ s Qsu 1

ν ˆu u. Introducing dual variables ζ R+ and κ RS A + , the Lagrangian dual function of this problem is

L(u, ζ, κ) = a u + 1

2ν u u + ζ (τ r u) κ u.

Since ν > 0, problem (21) is a convex optimization problem. The KKT conditions (e.g., (Boyd & Vandenberghe, 2004)) thus are sufficient conditions of the optimality for the primal and dual solutions, which are

r u τ u 0 ζ 0 κ 0 ζ (τ r u) = 0 κiui = 0 i [SA]

u L(u, ζ, κ) = 1

ν u + a ζ r κ = 0.

By these conditions, for ζ = 0 we have

ui = νai i [SA] : κi = 0 0 i [SA] : κi = 0 u 0;

and for ζ > 0 we have

ui = ν (ζri ai) i [SA] : κi = 0 0 i [SA] : κi = 0 u 0,

where it is sufficient to find the optimal ζ R+ which is the solution to the equation Φ(ζ) := P

i [SA] riν [ζri ai]+ = τ, and then obtain optimal u i = [ν (ζ ri ai)]+ i [SA]. Notice that Φ(ζ) is a piecewise linear function so that we can locate all its breakpoints αi = ai

ri : i I with I = {i [SA] | ai > 0, ri > 0} and then sort them from smallest to largest as ζi1 ζi2 ζi|I|. As Φ is non-decreasing on [0, + ), by searching the intervals [0, ζi1], [ζi1, ζi2], , [ζi I, + ) sequentially in an ascending order, we can obtain optimal ζ and u .

The time complexity is dominated by the breakpoint sorting, which is O(SA log(SA)).

We provide the pseudocode for solving problem (21) in Algorithm 2.

Robust Satisficing MDPs

Algorithm 2 Interval Search Algorithm for Problem (21)

i [SA] ri[ νai]+ τ then return ζ = 0 and u i = [ νai]+ i [SA] else

Compute all the breakpoints ζi = ai

ri : i I with I = {i [SA] | ai > 0, ri > 0}; Sort the breakpoints from smallest to largest as ζi1 ζi2 ζi|I|; Compute the initial index set I [SA] \ I; for k = 1 to |I| do

j I ν(ζikrj aj) rj then

j I rjrj and u i = [ν(ζ ri ai)]+ i [SA] else

I I {ik}; end if end for return ζ =

r r and u i = [ν(ζ ri ai)]+ i [SA] end if Output: Solutions ζ and u

Lemma A.3. For the optimal solution (λ , θ ) of Dˆs(u; ˆλ, ˆθ), the upper bound for λ is

λ ˆλ + σ (dˆs e uˆs) + (dˆs e uˆs)2 + 2

(s,a,s ) Z+ zs,a,s (ˆλˆps,a,s + wˆs) + X

(s,a,s ) Z zs,a,s [ˆλˆps,a,s wˆs]+

where z = Qˆsu, Z+ = {(s, a, s ) S A S | zs,a,s > 0} and Z = {(s, a, s ) S A S | zs,a,s < 0}.

Proof. By Algorithm 1, (ˆλ, ˆθ) is a feasible solution of Dˆs(u; ˆλ, ˆθ). Thus by the first constraint in Dˆs(u; ˆλ, ˆθ), we have [ˆλˆps,a,s wˆs]+ ˆθs,a,s ˆλˆps,a,s +wˆs (s, a, s ) S A S. Plugging (ˆλ, ˆθ) in the objective function of Dˆs(u; ˆλ, ˆθ), we obtain an upper bound of its optimal value

(dˆs e uˆs) ˆλ + z ˆθ (dˆs e uˆs) ˆλ + X

(s,a,s ) Z+ zs,a,s (ˆλˆps,a,s + wˆs) + X

(s,a,s ) Z zs,a,s [ˆλˆps,a,s wˆs]+.

Let (λ , θ ) with λ = ˆλ+ λ be the optimal solution of Dˆs(u; ˆλ, ˆθ) and (ˆλ+ λ, θ) be a feasible solution of Dˆs(u; ˆλ, ˆθ). Plugging (ˆλ + λ, θ) in Dˆs(u; ˆλ, ˆθ), we consider the problem:

min (dˆs e uˆs) (ˆλ + λ) + z θ + 1

2σ (( λ)2 + θ ˆθ 2 2)

s.t. (ˆλ + λ)ˆps,a,s wˆs θs,a,s (s, a, s ) S A S (ˆλ + λ)ˆps,a,s + wˆs θs,a,s (s, a, s ) S A S e θs,a = ˆλ + λ s S, a A θ RS A S + .

Notice that problem (22) and Dˆs(u; ˆλ, ˆθ) share the same optimal value. The equivalent minimax representation of

Robust Satisficing MDPs

problem (22) is:

min θ RS A S max χ RS A S + , ψ RS A S + , ξ RS A, µ RS A S +

(dˆs e uˆs) (ˆλ + λ) + z θ + 1

2σ (( λ)2 + θ ˆθ 2 2)

(s,a,s ) S A S χs,a,s ((ˆλ + λ)ˆps,a,s wˆs θs,a,s )

(s,a,s ) S A S ψs,a,s (θs,a,s (ˆλ + λ)ˆps,a,s wˆs)

(s,a) S A ξs,a (e θs,a ˆλ λ)

Taking its dual, we have

max (dˆs e uˆs) (ˆλ + λ) + 1

2σ (( λ)2 + ˆθ ˆθ)

(s,a,s ) S A S χs,a,s ((ˆλ + λ)ˆps,a,s wˆs)

(s,a,s ) S A S ψs,a,s ((ˆλ + λ)ˆps,a,s + wˆs)

(s,a) S A ξs,a (ˆλ + λ) σ

σ ˆθ χ + ψ + B ξ µ

2 s.t. χ RS A S + , ψ RS A S + , ξ RS A, µ RS A S + .

Considering a feasible solution (χ, ψ, ξ, µ) = 0 in (23), by weak duality, we have a lower bound:

Dˆs(u; ˆλ, ˆθ) (dˆs e uˆs) (ˆλ + λ) + 1

2σ (( λ)2 + ˆθ ˆθ) σ

Hence, the existence of the optimal value requires that the following inequality must hold:

(dˆs e uˆs) (ˆλ + λ) + 1

2σ (( λ)2 + ˆθ ˆθ) σ

2 (dˆs e uˆs) ˆλ + X

(s,a,s ) Z+ zs,a,s (ˆλˆps,a,s + wˆs) + X

(s,a,s ) Z zs,a,s [ˆλˆps,a,s wˆs]+,

1 2σ ( λ)2 + (dˆs e uˆs) λ σ

(s,a,s ) Z+ zs,a,s (ˆλˆps,a,s + wˆs) + X

(s,a,s ) Z zs,a,s [ˆλˆps,a,s wˆs]+

Hence, we have an upper bound

λ σ (dˆs e uˆs) + (dˆs e uˆs)2 + 2

(s,a,s ) Z+ zs,a,s (ˆλˆps,a,s + wˆs) + X

(s,a,s ) Z zs,a,s [ˆλˆps,a,s wˆs]+

from which our conclusion follows immediately.

By Proposition 4.3, we have in Algorithm 1 that for all k N+, uk+1 is bounded if uk and (λ, θ) are bounded; while by Lemma A.3, λk+1 is bounded if uk+1, uk and (λk, θk) are bounded. Since we initialize (u0, (λ0, θ0)) with some real numbers, Lemma A.3 can iteratively provide us the upper bounds for the golden section search.

Robust Satisficing MDPs

Lemma A.4. The function fˆs is well-defined on R+.

Proof It is sufficient to prove that problem (10) can achieve its minimum for all λ R+. Since for all λ R+, the feasible region of (10),

C(λ) := θ RS A S +

θ λ ˆp wˆs e θs,a = λ (s, a) S A

is compact and the objective function

hˆs(λ, θ) := (dˆs e uˆs) λ + θ Qˆsu + 1

2σ λ ˆλ θ ˆθ

is continuous in θ, by the Weierstrass extreme-value theorem (see, e.g., (Khanh & Quan, 2019)), problem (10) can obtain its minimum.

Proof of Proposition 4.4 Let us fix λ, λ R+ and ω [0, 1]. By Lemma A.4, fˆs is well-defined, which means that there exists θ C(λ), θ C(λ ) such that

hˆs(λ, θ) = fˆs(λ) and hˆs(λ , θ ) = fˆs(λ ),

where we denote

C(λ) := θ RS A S +

θ λ ˆp wˆs e θs,a = λ (s, a) S A

hˆs(λ, θ) := (dˆs e uˆs) λ + θ Qˆsu + 1

2σ λ ˆλ θ ˆθ

2 (as in the proof of Lemma A.4). By the convexity of

C := (λ, θ) R+ RS A S + | θ C(λ)

(which is, the feasible region of Dˆs(u; ˆλ, ˆθ)), we have ((1 ω) λ + ωλ , (1 ω) θ + ω θ ) C. Thus we have (1 ω) θ + ω θ C((1 ω) λ + ωλ ) and

fˆs((1 ω) λ + ωλ ) hˆs((1 ω) λ + ωλ , (1 ω) θ + ω θ ) (1 ω) hˆs(λ, θ) + ω hˆs(λ , θ ) = (1 + ω) fˆs(λ) + ω fˆs(λ ),

where the first inequality follows from the definition of fˆs and the second inequality holds because of the convexity of h. Since λ, λ and ω are arbitrary, we have proved the convexity of fˆs, where our conclusion follows.

Proof of Proposition 4.5 Rearranging the constraints in (11), we rewrite the problem as

2σ θ s,aθs,a + θ s,a(zˆs,s,a 1

s.t. θs,a,s [λˆps,a,s wˆs]+ s S θs,a,s λˆps,a,s + wˆs s S e θs,a = λ θs,a RS.

Introducing dual variables η RS +, φ RS + and ι R, we have its Lagrangian dual function as

L(θs,a, η, φ, ι) = 1

2σ θ s,aθs,a + θ s,a(zˆs,s,a 1

s S ηs ([λˆps,a,s wˆs]+ θs,a,s )

s S φs (θs,a,s (λˆps,a,s + wˆs)) + ι (λ e θs,a).

Robust Satisficing MDPs

Thus we have the KKT conditions of problem (11) as

θs,a,s [λˆps,a,s wˆs]+ s S θs,a,s λˆps,a,s + wˆs s S e θs,a = λ η 0 φ 0 ηs ([λˆps,a,s wˆs]+ θs,a,s ) = 0 s S φs (θs,a,s (λˆps,a,s + wˆs)) = 0 s S

θs,a L(θs,a, η, φ, ι) = 1

σ θs,a + (zˆs,s,a 1

σ ˆθs,a) η + φ ι e = 0,

from which we have

λˆps,a,s + wˆs s S : φs = 0

σ ˆθs,a,s zˆs,s,a,s ) s S : ηs = 0 and φs = 0

[λˆps,a,s wˆs]+ s S : ηs = 0.

where it is sufficient to solve the equation Ws,a(ι) = λ to find ι then obtain the optimal solution θ s,a,s = Ws,a,s (ι ) s S, where Ws,a(ι) = P

s S Ws,a,s (ι) with

λˆps,a,s + wˆs if ι 1

σ (λˆps,a,s + wˆs) + zˆs,s,a,s 1

[λˆps,a,s wˆs]+ if ι < 1

σ [λˆps,a,s wˆs]+ + zˆs,s,a,s 1

σ θk s,a,s zˆs,s,a,s ) otherwise,

for all s S. Since for all s S, the function Ws,a,s is clearly piecewise linear and non-decreasing, the function Ws,a = P s S Ws,a,s is thus also piecewise linear and non-decreasing with 2S breakpoints: the upper breakpoints 1 σ (λˆps,a,s +wˆs) 1

σ ˆθs,a,s +zˆs,s,a,s , s S and the lower breakpoints 1

σ [λˆps,a,s wˆs]+ 1

σ ˆθs,a,s +zˆs,s,a,s , s S. Sort them from smallest to largest as ι1 ι2S and search the intervals [ι1, ι2], [ι2, ι3], , [ι2S 1, ι2S] in an ascending order, we can locate the optimal ι and θ s,a,s = Ws,a,s (ι ) s S.

The time complexity is dominated by sorting the breakpoints, which is achieved in time O(S log(S)).

We provide the pseudocode for solving problem (11) in Algorithm 3. Here the functions p1( ) : [2S] 7 S and p2( ) : [2S] 7 { lower , upper } map the indices of the non-decreasing breakpoint sequence to the indices and types of lower/upper breakpoints, respectively; e.g., if ι3 corresponds to ι5, then we have p1(3) = 5 and p2(3) = lower .

Proof of Proposition 4.6 By the strategy described in Section 4.2, to obtain the output of Step 2 in Algorithm 1, we need to solve S subproblems, each of which solved by a golden section search with time complexity log( 1

δ ). By the analysis in Section 4.2, each function valuation of fˆs in the search requires to solve SA subproblems in the form of (11), and each of these subproblems is, by Proposition 4.5, solvable in time O(S log(S)). Our conclusion thus follows.

B. An Equivalent Convex Optimization Problem for RSMDPs with KL Divergence

Lemma B.1. Let U := {u RS A + | r u τ}. For all s S and (α, u) RS A U, if

lim k 0+ α e X

i [SAS] ˆpik

Q s,iu + B :,iα k

e us ds, (24)

then it holds that Qs ˆu + B ˆα 0.

Proof Suppose to the contrary that there exists some i [SAS] such that Q s,iu + B :,iα < 0. It follows that

lim k 0+ k exp

Q s,iu + B :,iα k

Robust Satisficing MDPs

Algorithm 3 Interval Search Algorithm for Problem (11)

Compute all the upper breakpoints ιs 1 σ(λˆps,a,s + wˆs) 1

σθs,a,s + zˆs,s,a,s s S and lower breakpoints ιs 1

σ[λˆps,a,s wˆs]+ 1

σθs,a,s + zˆs,s,a,s s S; Sort the breakpoints from smallest to largest as ι1 ι2S; Initialize χ σ and ψ P

s S:s =p1(1)[λˆps,a,s wˆs]+ + σ ( 1

σθs,a,p1(1) zˆs,s,a,p1(1)); Initialize upper breakpoint index set U and lower breakpoint index set L S \ p1(1); for k = 1 to 2S 1 do

if χ ιk+1 + ψ λ then

χ ; for s = 1 to S do

λˆps,a,s + wˆs s U [λˆps,a,s wˆs]+ s L σ (ι + 1

σθs,a,s zˆs,s,a,s ) s S \ (U L);

end for break else if p2(k + 1) = upper then

χ χ σ; ψ ψ σ ( 1

σθs,a,p1(k+1) zˆs,s,a,p1(k+1)) + λˆps,a,p1(k+1) + wˆs; else

χ χ + σ; ψ ψ + σ ( 1

σθs,a,p1(k+1) zˆs,s,a,p1(k+1)) [λˆps,a,p1(k+1) wˆs]+; end if end for Output: Solution θ s,a

by which the left-hand side of (24) will be driven to , yielding a contradiction.

Lemma B.2. Let Bs(u) := {α RS A | α e e us ds and Qsu + B α 0} and

α RS A lim k 0+ α e X

i [SAS] ˆpik

Q s,iu + B :,iα k

for all s S, where Qs,i is the i-th row of Qs and B:,i is the i-th column of B, both of them are column vectors. When ˆp RS A S ++ , it holds that Bs(u) = B s(u) s S for all u U := {u RS A + | r u τ}.

Proof Fix some s S. We will first prove that Bs(u) B s(u) for all u U. To this end, let ( ˆα, ˆu) RS A U be arbitrarily taken such that ˆα Bs(ˆu), i.e., ( ˆα, ˆu) satisfies

r ˆu τ ˆα e e ˆus ds Qs ˆu + B ˆα 0.

We then have

lim k 0+ ˆα e X

i [SAS] ˆpik

Q s,i ˆu + B :,i ˆα k

which implies that ˆα B s(ˆu).

Next, we prove that B s(u) Bs(u) for all u U. To this end, let ( ˆα, ˆu) RS A U be arbitrarily taken such that ˆα B s(ˆu). By Lemma B.1, it holds that Qs ˆu + B ˆα 0. In this case, we have

lim k 0+ ˆα e X

i [SAS] ˆpik

Q s,i ˆu + B :,i ˆα k

Robust Satisficing MDPs

which means that ˆα B s(ˆu) if and only if ( ˆα, ˆu) satisfies ˆα e e ˆus ds. Hence, ˆα Bs(ˆu) follows from the definition of Bs(ˆu).

Proposition B.3. When ˆp RS A S ++ , the RSMDPs (4) equipped with the KL divergence3 ℓ(p, ˆp) = P

(s,a,s ) S A S ˆps,a,s ϕ(ps,a,s /ˆps,a,s ), where the phi-divergence function ϕ(t) = t log t t + 1, is equivalent to the following convex optimization problem:

ZRS = min w k

s.t. e us ds α s e X

i [SAS] ˆpiks

Q s,iu + B i αs ks

r u τ u RS A + , k RS +, αs RS A s S,

where Qs,i is the i-th row of Qs and B:,i is the i-th column of B, both of them are column vectors.

Proof The s-th constraint of (4) is equivalent to

e us ds min p P p Qsu + ks ℓ(p, ˆp),

where the minimization problem on the right-hand side is:

min p Qsu + ks ℓ(p, ˆp) s.t. Bp = e p RS A S + .

The dual of this problem is:

max α min p 0 p Qsu + ks ℓ(p, ˆp) + α (Bp e)

= max α min p 0 (Qsu + B α) p + ks X

i [SAS] ˆpi ϕ pi

= max α α e + min p 0 (Qsu + B α) p + ks X

i [SAS] ˆpi ϕ pi

= max α α e X

i [SAS] max p 0 (Q s,iu + B :,iα)p ksˆpiϕ p

= max α α e X

i [SAS] max q 0 (Q s,iu + B :,iα)ˆpiq ksˆpiϕ (q) .

Here, strong duality holds for {p RS A S + | Bp = e} = ( S)SA, where p = (1/S) e is a strictly feasible solution. The last equality holds by the substitution q = p/ˆpi. We can further have

max α α e X

i [SAS] max q 0 (Q s,iu + B :,iα)ˆpiq ksˆpiϕ (q)

max α e s.t. Qsu + B α 0 α RS A if ks = 0

max α α e X

i [SAS] ˆpiks

Q s,iu + B :,iα ks

where the equality for the case ks > 0 follows from the convex conjugate of ϕ( ); see (Bayraksan & Love, 2015). Therefore,

3We refer intereted readers to (Bayraksan & Love, 2015) for more details about the KL divergence.

Robust Satisficing MDPs

we have that RSMDPs (4) is equivalent to

ZRS(ˆp) = min w k s.t. α s e e us ds s : ks = 0 Qsu + B αs 0 s : ks = 0

i [SAS] ˆpiks

Q s,iu + B :,iαs ks

e us ds s : ks > 0

r u τ u RS A + , k RS +, αs RS A s S.

Our conclusion then follows from Lemma B.2.

C. Discussions on Computation Complexities

For value iteration, the time complexity per iteration for solving NMDPs is O(S2A). For robust value iteration, the time complexity per iteration for solving RMDPs differs for different ambiguity sets. In general, for s-rectangular ambiguity set, the practical time complexity per iteration could be at least O(S4A3) (Boyd & Vandenberghe, 2004). There are recent algorithmic developments for specific ambiguity sets; for example, the time complexity per iteration for solving RMDPs with an unweighted L1-norm ambiguity set could be reduced to O(S2A log S) (Ho et al., 2021). As mentioned in Section 4.3, the proposed PDAblock and PDAblock+ have time complexities of O(MS2 log(S)A log(δ 1)) and O(S log S) (with high probability as PDAblock+ is a randomized algorithm), respectively.

D. Additional Details and Results on the Experiments

In our experiments, the reward functions are deterministic and known to the agent while the transition kernel is uncertain. We adopt data-driven setups and evaluate models based on their out-of-sample performances. All optimization problems are solved on an Intel 3.6 GHz processor with 32GB RAM.

In the experiment in Section 5.1, we vary the training sample size 4 among {10, 20, , 140, M} with some sufficiently large M such that the empirical transition kernel is close enough to the true one, and we generate 5000 out-of-sample testing trajectories of length 100. The parameters of RMDPs and DRMDPs (i.e., the radius r of the ambiguity set) are selected from r [0.0, 1.8] and those of RSMDPs (i.e., the target τ) are selected from τ [0.5ZN, 1.0ZN], via cross validation.

In the experiment in Section 5.2, τ is the predicted return for RSMDPs, while ZN = d v N, d v R and d v DR are the predicted returns of NMDPs, RMDPs and DRMDPs, respectively, where the optimal value function v N is computed via (1), and v R and v DR are via robust Bellman optimality equation, respectively (see details in Appendices D.2 and D.3, respectively), all with the true p . All sample returns are computed by Bellman equation with the polluted transition kernel.

In the experiment in Section 5.3, we implemented our algorithms in C++, whereas Gurobi is also called from C++. In our experiments, we synthetically generate random RSMDP instances, and the details of the experiments and parameters can be found in the Appendix D.7, which also includes additional experimental results with PDAblock+, a variant of PDA where the dual updating step follows the second strategy mentioned in Section 4.3.

D.1. Setup for Initial Transition Kernel

For the experiments in Section 5.1, the next state space for each state-action pair is given to the agent at the beginning when the state space and action space are big (e.g., grid world), but such information is unknown to the agent when the state space and action space are small (e.g., machine replacement, river swim), otherwise the optimal policy can be obtained immediately with the initial transition kernel using any methods.

4Here sample size is the number of transitions that an agent experiences (with a uniform initial distribution).

Robust Satisficing MDPs

Algorithm 4 Solve the inner minimization problem in (26)

Input: Value function v, transition kernel p = ˆps,a Sort p such that its corresponding v is non-decreasing, indexed as 1, , S; o copy(p); i S; b min n 1 p1, r

o ; o1 b + p1; repeat

oi oi min {b, oi}; b b min {b, oi}; i i 1; until b 0 or i < 0 p o; Output: Optimal solution p

D.2. Additional Details on Robust MDPs

Robust MDPs (RMDPs) maximize the total expected return considering the worst-case realization of the uncertain parameter within a predefined ambiguity set P: ZR = max π Π min p P d v(π, p). (25)

In the experiments, we adopt the popular sa-rectangular ambiguity set (Behzadian et al., 2021a; Iyengar, 2005; Nilim & El Ghaoui, 2005; Strehl et al., 2009; Weissman et al., 2003):

P = p RS A S + | ps,a Ps,a, e ps,a = 1 s S, a A ,

where Ps,a = p S | p ˆps,a 1 r with ˆps,a S being the subvector of empirical transition kernel ˆp P; see, e.g., (Petrik & Subramanian, 2014; Russel et al., 2019). To calculate the optimal policy in RMDP, we utilise the value iteration method, where in each iteration, the robust Bellman optimality equation:

v s = max a A min p Ps,a

rs,a + γ p v = max a A min p S rs,a + γ p v | p ˆps,a 1 r s S (26)

is solved to update our value function. We adopt Algorithm 4 to solve the inner minimization problem in (26) (Petrik & Subramanian, 2014).

We remark the the input v in Algorithm 4 is given by the value function at the last iteration in the value iteration process, where it is initialized as 0 at the first iteration.

D.3. Additional Details on Distributionally Robust MDPs

We follow the settings in (Grand-Cl ement & Kroer, 2021a) where the ambiguity set is defined using Wasserstein distance with Lp-norm p (for some p R { }). In particular, we solve the following distributionally robust Bellman equation

v s = max πs A min p1, ,p N ˆ P

i=1 pi s,av ! 1

i=1 pi s ˆpi s p µp W

where µW is the radius and ˆp1, . . . , ˆp N P are N samples of transition kernels (Bertsimas et al., 2018; Xie, 2020; Yang, 2017). In the experiments, we set p to be 1 and Mosek (academic license) is utilized to solve the inner minimization problem in each iteration (MOSEK Ap S, 2022).

D.4. Additional Details on Environments

We use a discounted factor γ = 0.85 for all environments, and the objective is always maximizing (total discounted) rewards.

Robust Satisficing MDPs

Machine Replacement: we have 2 repair options constituting our action set [ repair , do nothing ] and 10 states. The rewards relate only to the states, which are [20, 20, 20, 20, 20, 20, 20, 0, 18, 10].

River Swim: we have 2 swimming directions constituting our action set [ move left , move right ] and 10 states, and the rewards relate only to the state, which are [5, 0, 0, 10, 10, 10, 10, 10, 10, 15].

Grid World: the grid world has two rows and 12 columns, and the rewards relate to the column indices only, which are [0, 3, 21, 27, 6, 0, 0, 0, 0, 0, 15, 24]. There are four available actions, move up and move down for vertical moves (that decreases and increases the column index, respectively), as well as move left , and move right for horizontal moves (that decreases and increases the row index, respectively). Horizontal moves have a chance of failure that only related to row indices (0.9 for the first row and 0.2 for the second). Failing a transfer or selecting a vertical move would generate the column index of the next state according to a Dirichlet distribution. After selecting a horizontal move, the agent will randomly go up, go down, or stay with probabilities 0.35, 0.35 and 0.3, respectively.

D.5. Additional Results on the Improvements on Percentiles

1 3 5 7 9 11 13 M Sample Size

10-Percentile

RSMDPs 0.85 RSMDPs 0.9

DRMDPs 0.2 DRMDPs 0.5

RMDPs 0.2 RMDPs 1.2

1 3 5 7 9 11 13 M Sample Size

20-Percentile

RSMDPs 0.85 RSMDPs 0.9

DRMDPs 0.2 DRMDPs 0.5

RMDPs 0.2 RMDPs 1.2

1 3 5 7 9 11 13 M Sample Size

30-Percentile

RSMDPs 0.85 RSMDPs 0.9

DRMDPs 0.2 DRMDPs 0.5

RMDPs 0.2 RMDPs 1.2

1 3 5 7 9 11 13 M Sample Size

RSMDPs 0.85 RSMDPs 0.9

DRMDPs 0.2 DRMDPs 0.5

RMDPs 0.2 RMDPs 1.2

Figure 3. Average and percentile performances over 5000 out-of-sample testing trajectories in the machine replacement application.

Robust Satisficing MDPs

1 3 5 7 9 11 13 M Sample Size

10-Percentile

RSMDPs 0.85 RSMDPs 0.9

DRMDPs 0.2 DRMDPs 0.4

RMDPs 1.0 RMDPs 1.2

1 3 5 7 9 11 13 M Sample Size

20-Percentile

RSMDPs 0.85 RSMDPs 0.9

DRMDPs 0.2 DRMDPs 0.4

RMDPs 1.0 RMDPs 1.2

1 3 5 7 9 11 13 M Sample Size

30-Percentile

RSMDPs 0.85 RSMDPs 0.9

DRMDPs 0.2 DRMDPs 0.4

RMDPs 1.0 RMDPs 1.2

1 3 5 7 9 11 13 M Sample Size

RSMDPs 0.85 RSMDPs 0.9

DRMDPs 0.2 DRMDPs 0.4

RMDPs 1.0 RMDPs 1.2

Figure 4. Average and percentile performances over 5000 out-of-sample testing trajectories in the grid world application.

D.6. Additional Results on the Target-Oriented Feature

0 10 20 30 Level of Contamination

Sample Return - Predicted Return

RSMDPs 0.9 RSMDPs 0.85

0 10 20 30 Level of Contamination

DRMDPs 0.05 DRMDPs 0.1

0 10 20 30 Level of Contamination

RMDPs 0.05 RMDPs 0.1

Figure 5. Differences between sample returns and predicted returns over 1000 samples in the machine replacement environment. Due to the page limit, we only plot the parameters (of the three models) that the sample returns are close to predicted returns.

Robust Satisficing MDPs

Table 3. Predicted returns and the corresponding differences (in median) between sample returns and predicted returns over 1000 samples in the machine replacement application.

RSMDPs τ / ZN(ˆp) 1.0 0.9 0.8 0.7 0.6 0.5

Predicted Return 123.9 111.5 99.1 86.7 74.4 62.0 Sam Ret-Pre Ret (Median) -10.5 1.0 13.4 25.8 38.2 50.9

DRMDPs r 0.0 0.3 0.6 0.9 1.2 1.5

Predicted Return 123.9 105.4 88.2 68.5 45.2 31.5 Difference (in median) -10.5 7.6 24.8 44.0 66.9 79.1

RMDPs r 0.0 0.3 0.6 0.9 1.2 1.5

Predicted Return 123.9 99.9 76.1 52.8 30.2 20.7 Sam Ret-Pre Ret (Median) -10.5 13.7 37.5 60.8 83.4 92.9

0 50 100 Level of Contamination

Sample Return - Predicted Return

RSMDPs 0.9 RSMDPs 0.85

0 50 100 Level of Contamination

DRMDPs 0.05 DRMDPs 0.1

0 50 100 Level of Contamination

RMDPs 0.05 RMDPs 0.1

Figure 6. Differences between sample returns and predicted returns over 1000 samples in the grid world environment. Due to the page limit, we only plot the parameters (of the three models) that the sample returns are close to predicted returns.

D.7. Detailed Numerical Results on PDAs

In our experiments, we generate random instances as follows. The entries in reward function r, initial distribution d, and transitional kernel ˆp are sampled from an uniform distribution in [0, 1]. The entries in d and ˆp are then normalized so that d and ˆps,a are elements in a probability simplex, for all s S, a A. We set the discount factor γ = 0.95, and the target τ = 0.85ZN.

We denote PDA as the proposed Algorithm 1. PDAblock is a variant of PDA where the dual updating step (i.e. Step 2 in Algorithm 1) follows the first strategy mentioned in Section 4.3 in which the dual updating step becomes

(λk+1 s , θk+1 s ) Ds(2uk+1 uk; λk s, θk s) s Sk. (27)

Here Sk with |Sk| = M S is an index set where its elements are sampled uniformly from S without replacement in each iteration, and we set M = 2. On the other hand, PDAblock+ is a PDA where the dual updating step follows the second strategy mentioned in Section 4.3. In particular, at each iteration, (ˆs, s, a) is sampled uniformly, we perform the update (27) with probability P and otherwise we update θs,a by solving problem (11), which is the subproblem corresponded to problem (10). We set P = 1/(SA) in this experiment.

In the experiment, we also compare our PDAs to Gurobi solver for RMDPs. Since the overhead (in terms of computation time) can be dominating, we only report the runtime of the Gurobi solver.

Figure 7 reports the average of computation times and per-iteration computational times (relative to Gurobi) over 20 generated test instances. The vertical bars indicate the average standard deviations. Since PDA is a first-order method, it is impractical to expect solution with extreme precision with optimally. Hence, we terminate PDA when |f PDA f Gurobi|/f Gurobi < 5%, where f PDA and f Gurobi are the objective values computed by PDA and Gurobi, respectively. The same stopping criteria is

Robust Satisficing MDPs

Table 4. Predicted returns and the corresponding differences (in median) between sample returns and predicted returns over 1000 samples in the grid world application.

RSMDPs τ / ZN(ˆp) 1.0 0.9 0.8 0.7 0.6 0.5

Predicted Return 55.9 50.3 44.7 39.1 33.5 27.9 Sam Ret-Pre Ret (Median) -2.4 3.1 8.7 14.3 19.8 25.4

DRMDPs r 0.0 0.3 0.6 0.9 1.2 1.5

Predicted Return 55.9 31.8 21.1 14.7 11.0 9.4 Difference (in median) -2.5 21.1 31.6 38.0 41.6 42.8

RMDPs r 0.0 0.3 0.6 0.9 1.2 1.5

Predicted Return 55.9 26.6 15.5 11.1 9.4 8.5 Sam Ret-Pre Ret (Median) -2.5 26.7 38.2 42.6 44.3 45.2

10 11 12 13 14 15 16 17 S = A

Gurobi PDA PDAblock PDAblock +

10 11 12 13 14 15 16 17 S = A

Gurobi_time/time per-iteration

PDA PDAblock PDAblock +

Figure 7. Left: Computation times (in ms) of different algorithms and Gurobi. Right: The ratio of Gurobi s computation time to the per-iteration computation times of PDA, PDAblock, and PDAblock+.

used for PDAblock and PDAblock+. To ensure algorithms will terminate within a reasonable amount of time, we also set their maximal numbers of iterations to be 2000, 20000, and 400000 for PDA, PDAblock, and PDAblock+, respectively.

Figure 7 shows that as the problem size increases, the computational time of Gurobi increases rapidly compared to PDA, PDAblock, and PDAblock+. PDA also exhibits similar performance, but with a slower rate compared to Gurobi. The right figure of Figure 7 demonstrates the scalability of the proposed algorithms. As the problem size increase, the per-iteration computation times are remarkably cheaper compared to Gurobi. This phenomenon identifies the advantage of the proposed PDAs, and it matches the expected algorithmic behaviors for first-order methods, which are often proposed to efficiently solve large problems to moderate accuracy by computationally cheaper updates.

E. Limitations and Potential Negative Societal Impact

One limitation of this work is that the RSMDPs are not (yet) solvable by a dynamic programming approach. Another limitation would be the lack of further exploration about how the optimal policy and robust value function of RMDPs can indicate the range of the targeted return of RSMDPs. Potential negative societal impact is not applicable to this work.