# multiobjective_mdps_with_conditional_lexicographic_reward_preferences__9d1d9237.pdf

Multi-Objective MDPs with Conditional Lexicographic Reward Preferences

Kyle Hollins Wray1 Shlomo Zilberstein1 Abdel-Illah Mouaddib2

1 School of Computer Science, University of Massachusetts, Amherst, MA, USA 2 GREYC Laboratory, University of Caen, Basse-Normandie, France

Sequential decision problems that involve multiple objectives are prevalent. Consider for example a driver of a semiautonomous car who may want to optimize competing objectives such as travel time and the effort associated with manual driving. We introduce a rich model called Lexicographic MDP (LMDP) and a corresponding planning algorithm called LVI that generalize previous work by allowing for conditional lexicographic preferences with slack. We analyze the convergence characteristics of LVI and establish its game theoretic properties. The performance of LVI in practice is tested within a realistic benchmark problem in the domain of semi-autonomous driving. Finally, we demonstrate how GPU-based optimization can improve the scalability of LVI and other value iteration algorithms for MDPs.

1 Introduction Stochastic planning problems designed to optimize multiple objectives are widespread within numerous domains such as management of smart homes and commercial buildings (Kwak et al. 2012), reservoir water control (Castelletti, Pianosi, and Soncini-Sessa 2008), and autonomous robotics (Mouaddib 2004; Calisi et al. 2007). Current approaches often use a scalarization function and a weight vector to project the multi-objective problem to a singleobjective problem (Roijers et al. 2013; Natarajan and Tadepalli 2005; Perny and Weng 2010; Perny et al. 2013). While these approaches leverage effectively the vast existing work on single-objective optimization, they have several drawbacks. Choosing a projection is often too onerous to use in practice since there are many viable Pareto optimal solutions to the original multi-objective problem, making it hard to visualize and analyze alternative solutions. Often there is no clear way to prefer one over another. In some cases, a simple lexicographic order exists among the objectives; for example, using plan safety as primary criterion and cost as secondary. But lexicographic order of objectives can be too rigid, not allowing any trade-offs between objectives (e.g., a large cost reduction for taking a minimal risk). Recent work by Mouaddib (2004) used a strict lexicographic preference ordering for multi-objective MDPs. Others have also developed lexicographic orderings over value

Copyright c 2015, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved.

functions, calling this technique ordinal dynamic programming (Mitten 1974; Sobel 1975). Mitten assumed a speciﬁc preference ordering over outcomes for a ﬁnite horizon MDP; Sobel extended this model to inﬁnite horizon MDPs. Ordinal dynamic programming has been explored under reinforcement learning (G abor, Kalm ar, and Szepesv ari 1998; Natarajan and Tadepalli 2005), with the notion of a minimum criterion value. We propose a natural extension of sequential decision making with lexicographic order by introducing two added model components: conditioning and slack. Conditioning allows the lexicographic order to depend on certain state variables. Slack allows a small deviation from the optimal value of a primary variable so as to improve secondary value functions. The added ﬂexibility is essential to capture practical preferences in many domains. For example, in manufacturing, there is always a trade-off among cost, quality, and time. In critical states of the manufacturing process, one may prefer to optimize quality with no slack, whereas in less important states one may optimize cost, but allow for some slack in order to improve time and quality. Our model is motivated by work on planning for semiautonomous systems (Zilberstein 2015). Consider a car that can operate autonomously under certain conditions, for example, maintaining safe speed and distance from other vehicles on a highway. All other roads require manual driving. A driver may want to minimize both the time needed to reach the destination and the effort associated with manual driving. The concepts of conditional preference and slack are quite useful in deﬁning the overall objective. To ensure safety, if the driver is tired, then roads that are autonomous-capable are preferred without any margin of slack; however, if the driver is not tired, then roads that optimize travel time are preferred, perhaps with some slack to increase the inclusion of autonomous-capable segments. We focus on this sample domain for the remainder of the paper. The general use of preference decomposition is popular, as found in Generalized Additive Decomposable (GAI) networks (Gonzales, Perny, and Dubus 2011) or Conditional Preference Networks (CP-Nets) (Boutilier et al. 2004). Constrained MDPs (CMDPs) can also capture complex preference structures, as well as slack, and are potentially a more general representation than LMDPs (Altman 1999). Various other forms of slack are also commonly found in the liter-

Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence

ature (G abor, Kalm ar, and Szepesv ari 1998). However, as we show, LMDPs offer a good trade-off between expressive power and computational complexity, allowing a new range of objectives to be expressed without requiring substantially more complex solution methods. Our primary contributions include formulating the Lexicographic MDP (LMDP) model and the corresponding Lexicographic Value Iteration (LVI) algorithm. They generalize the previous methods mentioned above with our formulation of slack variables and conditional state-based preferences. Additionally, we show that LMDPs offer a distinct optimization framework from what can be achieved using a linearly weighted scalarization function. Furthermore, we introduce a new benchmark problem involving semi-autonomous driving together with general tools to experiment in this domain. Finally, we develop a Graphic Processing Unit (GPU) implementation of our algorithm and show that GPU-based optimization can greatly improve the scalability of Value Iteration (VI) in MDPs. Section 2 states the LMDP problem deﬁnition. Section 3 presents our main convergence results, bound on slack, and an interesting relation to game theory. Section 4 presents our experimental results. Finally, Section 5 concludes with a discussion of LMDPs and LVI.

2 Problem Deﬁnition A Multi-Objective Markov Decision Process (MOMDP) is a sequential decision process in which an agent controls a domain with a ﬁnite set of states. The actions the agent can perform in each state cause a stochastic transition to a successor state. This transition results in a reward, which consists of a vector of values, each of which depends on the state transition and action. The process unfolds over a ﬁnite or inﬁnite number of discrete time steps. In a standard MDP, there is a single reward function and the goal is to maximize the expected cumulative discounted reward over the sequence of stages. MOMDPs present a more general model with multiple reward functions. We deﬁne below a variant of MOMDPs that we call Lexicographic MDP (LMDP), which extends MOMDPs with lexicographic preferences to also include conditional preferences and slack. We then introduce a Lexicographic Value Iteration (LVI) algorithm (Algorithm 1) which solves LMDPs. Deﬁnition 1. A Lexicographic Markov Decision Process (LMDP) is a represented by a 7-tuple x S, A, T, R, δ, S, oy: S is a ﬁnite set of n states, with initial state s0 P S A is a ﬁnite set of m actions T : SˆAˆS Ñ r0, 1s is a state transition function which speciﬁes the probability of transitioning from a state s P S to state s1 P S, given action a P A was performed; equivalently, Tps, a, s1q Prps1|s, aq R r R1, . . . , Rks T is a vector of reward functions such that @i P K t1, . . . , ku, Ri : S ˆ A ˆ S Ñ R; each speciﬁes the reward for being in a state s P S, performing action a P A, and transitioning to a state s1 P S, often written as Rps, a, s1q r R1ps, a, s1q, . . . , Rkps, a, s1qs δ xδ1, . . . , δky is a tuple of slack variables such that @i P K, δi ě 0

Algorithm 1 Lexicographic Value Iteration (LVI)

1: V Ð 0 2: V 1 Ð 0 3: while }V V 1}S 8 ą ϵ 1 γ

γ do 4: V 1 Ð V 5: V fixed Ð V 6: for j 1, . . . , ℓdo 7: for i ojp1q, . . . , ojpkq do 8: while }V 1 i Vi} Sj 8 ą ϵ 1 γ

γ do 9: V 1 i psq Ð Vipsq, @s P Sj 10: Vipsq Ð Bi V 1 i psq, @s P Sj 11: end while 12: end for 13: end for 14: end while 15: return V 1

S t S1, . . . , Sℓu is a set which forms an ℓ-partition over the state space S o xo1, . . . , oℓy is a tuple of strict preference orderings such that L t1, . . . , ℓu, @j P L, oj is a k-tuple ordering elements of K In the interest of clarity, we limit this paper to inﬁnite horizon LMDPs (i.e., h 8), with a discount factor γ P r0, 1q. The ﬁnite horizon case follows in the natural way. A policy π : S Ñ A maps each state s P S to an action a P A. Let V r V1, . . . , Vks T be a set of value functions. Let each function V π i : S Ñ R, @i P K, represent the value of states S following policy π. The stochastic process of MDPs enable us to represent this using the expected value over the reward for following the policy at each stage.

Vπpsq E 8 ÿ

t 0 γt Rpst, πpstq, st 1q ˇˇˇs0 s, π ı

This allows us to recursively write the value of the state s P S, given a particular policy π, in the following manner.

s1PS Tps, πpsq, s1qp Rps, πpsq, s1q γVπps1qq

Lexicographic Value Iteration LMDPs lexicographically maximize Vojpiqpsq over Vojpi 1qpsq, for all i P t1, . . . , k 1u, j P L, and s P S, using Vojpi 1q to break ties. The model allows for slack δojpiq ě 0 (deviation from the overall optimal value). We will also refer to ηojpiq ě 0 as the deviation from optimal for a single action change. We show that the classical value iteration algorithm (Bellman 1957) can be easily modiﬁed to solve MOMDPs with this preference characterization. For the sake of readability, we use the following convention: Always assume that the ordering is applied, e.g., Vi 1 Vojpi 1q and t1, . . . , i 1u tojp1q, . . . , ojpi 1qu. This allows us to omit the explicit ordering ojp q for subscripts, sets, etc. First, Equation 1 deﬁnes Qips, aq, the value of taking an action a P A in a state s P S according to objective i P K.

s1PS Tps, a, s1qp Rips, a, s1q γVips1qq (1)

With this deﬁnition in place, we may deﬁne a restricted set of actions for each state s P S. For i 1, let A1psq A and for all i P t1, . . . , k 1u let Ai 1psq be deﬁned as:

Ai 1psq ta P Aipsq| max a1PAipsq Qips, a1q Qips, aq ď ηiu

(2) For reasons explained in Section 3, we let ηi p1 γqδi. Finally, let Equation 3 below be the Bellman update equation for MOMDPs with lexicographic reward preferences for i P K, using slack δi ě 0, for all states s P S. If i ą 1, then we require Vi 1 to have converged.

Vipsq max a PAipsq Qips, aq (3)

Within the algorithm, we leverage a modiﬁed value iteration with slack Bellman update equation (from Equation 3) denoted as Bi. We either use Vi for s P Sj Ď S or V fixed i psq for s P Sz Sj, as shown in Equation 4 below, with r s denoting Iverson brackets.

Bi V 1 i psq max a PAipsq

s1PS Tps, a, s1qp Rips, a, s1q γ Vips1qq

(4) Vips1q V 1 i ps1qrs P Sjs V fixed i ps1qrs R Sjs (5)

3 Theoretical Analysis This section provides a theoretical analysis of LVI in three parts. First, we provide a strong lower bound on slack to complete the deﬁnition of LVI. Second, we prove convergence given an important assumption, and explain why the assumption is needed. Third, we show two properties of LMDPs: a game theory connection and a key limitation of scalarization methods in comparison to LMDPs.

Strong Bound on Slack First we show in Proposition 1 that ηi from Equation 2 may be deﬁned as p1 γqδi to bound the ﬁnal deviation from the optimal value of a state by δi, for i P K. This is designed to be a worst-case guarantee that considers each state selects an action as far from optimal as it can, given the slack allocated to it. The accumulation of error over all states is bounded by δ. In practice, this strong bound can be relaxed, if desired. Proposition 1. For all j P L, for i P K, assume 1, . . . , i 1 has converged. Let V η be the value functions returned following Equation 4; Lines 7-10. Let V π be the value functions returned by value iteration, following the resulting optimal policy π, starting at V η. If ηi p1 γqδi then @s P Sj, V η i psq V π i psq ď δi.

Proof. For any i P K, the full (inﬁnite) expansion of value iteration for V η i is as follows (t Ñ 8).

V π i psq ÿ

st PS Tps, πpsq, stq Rips, πpsq, stq γ

s1PS Tps2, πps2q, s1q Rips2, πps2q, s1q

γ V 0 i ps1q

Since value iteration admits exactly one unique ﬁxed point, any initial value of V 0 i is allowed; we let V 0 i V η i . From this, Qη i ps, πpsqq (Equation 1) exists within the above equation for V π i psq. By Equation 2, V η i psq Qη i ps, πpsqq ď ηi since πpsq P Akpsq Ď Ď Ai 1psq. Equivalently, Qη i ps, πpsqq ě V η i psq ηi. Combine all of these facts and bound it from below. The ηi falls out of the inner equation. Also, recall that ř s2PS Tps3, πps3q, s2q 1 and γηi is a constant. This produces:

st PS Tps, πpsq, stq Rips, πpsq, stq γ

s2PS Tps3, πps3q, s2q Rips3, πps3q, s2q

γ V η i ps2q γηi

We again recognize Qη i ps3, πps3qq and place a lower bound on the next one with V η i ps3q ηi. This process repeats, each time introducing a new ηi, with one less γ multiplied in front of it, until we reach the ﬁnal equation. We obtain the following inequality, and note that if ηi ě 0, then we may subtract another ηi in order to obtain a geometric series (i.e., the sum may begin at t 0).

V π i psq ě V η i psq

t 0 γtηi ě V η i psq ηi 1 γ

V η i psq V π i psq ď ηi 1 γ Therefore, let ηi p1 γqδi. This guarantees that error for all states s P S is bounded by δi.

Convergence Properties In order to prove the convergence of Algorithm 1, we ﬁrst prove that the value iteration component over a partition with slack is a contraction map. The proof follows from value iteration (Bellman 1957) and from the suggested proof by Russel and Norvig (2010). We include it because of its required modiﬁcations and it explains exactly why Assumption 1 is needed. Finally, we include the domain in max norms, i.e., } }Z 8 maxz PZ | |.

Proposition 2. For all j P L, for i P K, assume 1, . . . , i 1 has converged, with discount factor γ P r0, 1q. Bi (Equation 4) is a contraction map in the space of value functions over s P Sj, i.e., }Bi V1 Bi V2}Sj 8 ď γ}V1 V2}Sj 8 .

Proof. Let the space Yi Rz be the space of value functions for i for z |Sj|, i.e., we have Vi r Vipsj1q, . . . , Vipsjzqs T P Yi. Let the distance metric di be the max norm, i.e., }Vi}8 maxs PSj |Vipsq|. Since γ P r0, 1q, the metric space Mi x Yi, diy is a complete normed metric space (i.e., Banach space). Let the lexicographic Bellman optimality equation for i (Equation 4) be deﬁned as an operator Bi. We must show that the operator Bi is a contraction map in Mi for all i P K, given either that i 1 or that the previous i 1 has converged to within ϵ of its ﬁxed point.

Let V1, V2 P Yi be any two value function vectors. Apply Equation 4. For s P Sj, if i 1 then Aipsq A; otherwise, Aipsq is deﬁned using Ai 1psq (Equation 2) which by construction has converged.

}Bi V1 Bi V2} Sj 8 max s PSj | max a PAipsq Q1ps, aq max a PAipsq Q2ps, aq|

As part of the Qp q values, we distribute Tp q to each Rp q and V p q in the summations, then apply the property: maxx fpxq gpxq ď maxx fpxq maxx gpxq, twice. We may then pull out γ and recall that for any two functions f and g, | maxx fpxq maxx gpxq| ď maxx |fpxq gpxq|.

ď γ max s PSj max a PAipsq

s1PS Tps, a, s1q ˇˇˇ V1ps1q V2ps1q ˇˇˇ

Now, we can apply Equation 5. Note the requirement that }V fixed 1 V fixed 2 }Sz Sj 8 ď }V1 V2}Sj 8 . Informally, this means that the largest error in the ﬁxed values is bounded by the current iteration s largest source of error from partition j. This is trivially true when performing VI in Algorithm 1 on Lines 6-13 because V fixed is the same for V1 and V2, implying that 0 ď }V1 V2}Sj 8 . Therefore, we are left with the difference of V1 and V2 over Sj.

ď γ max s PSj max a PAipsq

s1PSj Tps, a, s1q ˇˇˇV1ps1q V2ps1q ˇˇˇ

ď γ max s PSj

ˇˇˇV1psq V2psq ˇˇˇ ď γ}V1 V2} Sj 8

This proves that the operator Bi is a contraction map on metric space Mi, for all i P K.

We may now guarantee convergence to within ϵ ą 0 of the ﬁxed point for a speciﬁc partition s value function. We omit this proof because it is the same as previous proofs and does not produce more assumptions (unlike Proposition 2). Proposition 3. For all j P L, for i P K, assume 1, . . . , i 1 has converged. Following Equation 4; Lines 7-10, for any i P K, Bi converges to within ϵ ą 0 of a unique ﬁxed point once }V t 1 i V t i }Sj 8 ă ϵ 1 γ

γ for iteration t ą 0.

As part of Proposition 2 s proof, there is an implicit constraint underlying its application; essentially, for a partition to properly converge, there must be a bound on the ﬁxed values of other partitions. This is trivially true for the inner while loop on Lines 8-11, but not necessarily true for the outer while loop. We formally state the assumption below. Assumption 1 (Proper Partitions). Let j P L be the partition which satisﬁes }V fixed 1 V fixed 2 }Sz Sj 8 ď }V1 V2}Sj 8 from Proposition 2. Also, let i P K such that 1, . . . , i 1 has converged. Assume that this inequality holds until j has converged to within ϵ and then remains ﬁxed. Intuitively, with respect to LVI, the assumption takes the partition which is the source of the largest error, and assumes that it converges ﬁrst and then remains ﬁxed thereafter. For example, this can arise in scenarios with goal states, since the partition without the goal state implicitly depends on the value functions of the partition with the goal state. We prove that LVI converges under this assumption.

Proposition 4. LVI (Algorithm 1) converges to a unique ﬁxed point V under Assumption 1 with γ P r0, 1q.

Proof. We must show that for each partition, all value functions converge following the respective orderings, in sequence, to a unique ﬁxed point. We do this by constructing a metric space of the currently converged values, and the ones that are converging (i.e., satisfying Assumption 1). Then, we apply Proposition 2 which proves that G is a contraction map on this space. Finally, since this is true for all partitions and their value functions, and Assumption 1 guarantees converged values remain stable, it converges to the entire metric space over the entire partitions and their values. Let Y be the space of all value functions over partitions @j P L1 Ď L, value functions K1 j Ď K, and their states S which have converged or satisfy Assumption 1. Note that the base case includes only the ﬁrst ojp1q P K1 j from each j P L1. Let distance metric d be the max norm, i.e., }V }8 maxj PL1 maxi PK1 j maxs PSj |Vipsq|. Thus, the metric space M x Y, dy is a Banach space. Let G be an operator in M following Lines 4-13 in Algorithm 1, i.e., V 1 GV using the algorithm s variables: V and V 1. We must show that G is a contraction map. Let V1, V2 P Y be any two value function vectors. Rewrite in terms of Bi and then apply Proposition 2.

}GV1 GV2}8 max j PL1 max i PK1 j max s PSj |p GV1qipsq p GV2qipsq|

max j PL1 max i PK1 j max s PSj |Bi V1ipsq Bi V2ipsq|

ď γ max j PL1 max i PK1 j max s PSj |V1ipsq V2ipsq|

ď γ}V1 V2}8

Therefore, G is a contraction map. By Banach s ﬁxed point theorem, since M is a complete metric space and G is a contraction map on Y , G admits a unique ﬁxed point V P Y . The convergence of G in M produces a new expanded metric space, which converges to a new unique ﬁxed point in the larger space. This process repeats to completion until we reach the entire space over Rkˆn, yielding a ﬁnal unique ﬁxed point V .

Relation to Other Models Interestingly, there is a connection between LMDPs and game theory. This link bridges the two often distinct ﬁelds of decision theory and game theory, while simultaneously providing another description of the optimality of an LMDP policy. Furthermore, it opens the possibility for future work on game theoretic solvers which produce solutions that map back from the normal form game to an LMDP policy. Deﬁnition 2 describes this mapping from an LMDP to a normal form game. Intuitively, each partition can be thought of as a player in a game. The action a partition can take are any sub-policy for the states in the partition. We select any ℓvalue functions, one from each partition, and use those as the payoffs for a normal form game. We show in Proposition 5 that the resulting LMDP s optimal policy computed from LVI is, in fact, a Nash equilibrium.

Deﬁnition 2. Let LMDP x S, A, T, R, δ, S, oy have value functions V π for corresponding optimal policy π, the optimal value functions V η, computed via LVI (Algorithm 1). Let s xs1, . . . , sℓy be a state tuple such that @z P L, sz P Sz. Let i xi1, . . . , iℓy be any tuple of indices iz P K. The LMDP s Normal Form Game x L, A, Uy is:

L t1, . . . , ℓu is a set of agents (partitions) A A1ˆ ˆAℓis a ﬁnite set of joint actions, such that @z P L, x |Sz|, Sz tsz1, . . . , szxu, Az Aizpsz1qˆ ˆ Aizpszxq U xu1, . . . , uℓy is a set of utility functions, such that @z P L, @a P A, sz P s, uzpaz, a zq mint V π,az iz pszq, V η izpszq δizu.

Note that we only consider pure strategy proﬁles. Thus, a player s strategy set will be used synonymously with its action set. Similarly, since we consider one stage games, strategy proﬁle will be synonymous with action proﬁle.

Proposition 5. For an LMDP, the optimal policy π is a weak pure strategy Nash equilibrium of the LMDP s equivalent normal form.

Proof. Let LMDP x S, A, T, R, δ, S, oy have optimal policy π. Let fpπq xω1, . . . , ωℓy so that @z P L, x |Sz|, Sz tsz1, . . . , szxu, ωz xπpsz1q, . . . , πpszxqy. After applying the transformation in Deﬁnition 2, we must show that the strategy (action) proﬁle fpπq a paz, a zq P A is a weak pure strategy Nash equilibrium. By the deﬁnition of a (weak) Nash equilibrium, we must show that @z P L, @a1 P Az, uzpaz, a zq ě uzpa1, a zq. Let π1 be the corresponding policy for fpπ1q xa1, . . . , az 1, a1, az 1, . . . , aℓy P A. Let V π1 be the value functions after value iteration has converged for each value function in K following policy π1. Note that Vπ x Vπ1 x , @x P t1, . . . , iz 1u, and therefore by Equation 2, Aπ iz Aπ1 iz . This ensures we may use Proposition 1, by considering a MOMDP with a reduced number of rewards up to iz. By Proposition 1, V π izpszq ě V η izpszq δi. Apply the fact that π is deﬁned following action az, so V π,az iz pszq V π izpszq. Thus, by Deﬁnition 2: uzpaz, a zq mint V π izpszq, V η izpszq δiu V η izpszq δi. There are two cases for V π1 iz pszq; regardless, the inequality uzpaz, a zq ě uzpa1, a zq is true. Therefore, the strategy (action) proﬁle π is a Nash equilibrium.

We now show that optimal solutions for LMDPs may not exist in the space of solutions captured by MOMDPs with linearly weighted scalarization functions. Hence, LMDPs are a new distinct multi-objective optimization problem, while still being related to MOMDPs. Figure 1 demonstrates the proof s main idea: Construct a MOMDP with conﬂicting weight requirements to produce the LMDP s optimal policy.

Proposition 6. The optimal policy of an LMDP π may not exist in the space of solutions captured by its corresponding scalarized MOMDP s policy πw.

Proof. Consider the MOMDP depicted in Figure 1, with S ts1, s2, s3, s4u, A tstay, leaveu, Tps, stay, s1q 1

Figure 1: The example MOMDP for Proposition 6.

if s s1, Tps, leave, s1q 1 if s s1, Tps, a, s1q 0 otherwise, and rewards R r R1, R2s T as shown. For states s1 and s2, w1 ă 1{3 and w2 ą 2{3 produce the policy πwps1q leave and πwps2q stay. Similarly, weights w1 ą 1{3 and w2 ă 2{3 produce the policy πwps1q stay and πwps2q leave. Only weights set to w1 1{3 and w2 2{3 allows for ambiguity; a carefully selected tie-breaking rule alows for the policy πwps1q πwps2q stay. The exact same logic holds for states s3 and s4 with the weights reversed at w1 2{3 and w2 1{3 to allow a tie-breaking rule to produce πwps3q πwps4q stay. Construct the corresponding LMDP with S t S1, S2u, S1 ts1, s3u, S2 ts2, s4u, o xo1, o2y, o1 x1, 2y, o2 x2, 1y, and δ x0, 0y. The policy returned by LVI is πpsq stay for all s P S. This policy is unique since it cannot be produced by any selection of weights.

4 Experimentation Semi-autonomous systems require efﬁcient policy execution involving the transfer of control between human and agent (Cohn, Durfee, and Singh 2011). Our experiments focus on a semi-autonomous driving scenario in which transfer of control decisions depend on the driver s level of fatigue. The multi-objective model we use is motivated by an extensive body of engineering and psychological research on monitoring driver fatigue and risk perception (Ji, Zhu, and Lan 2004; Pradhan et al. 2005). We use real-world road data from Open Street Map1 for sections of the 10 cities in Table 1.

The Semi-Autonomous Driving Domain LMDP states are formed by a pair of intersections (previous and current), driver tiredness (true/false), and autonomy (enabled/disabled). The intersection pair captures the direction of travel. Actions are taken at intersections, as found in path planning in graphs (e.g., GPS). Due to real world limitations, autonomy can only be enabled on main roads and highways; in our case, this means a speed limit greater than or equal to 30 miles per hour. Actions are simply which road to take at each intersection and, if the road allows, whether or not to enable autonomy. The stochasticity within state transitions model the likelihood that the driver will drift from attentive to tired, following probability of 0.1.

1http://wiki.openstreetmap.org

Figure 2: The policy for attentive (above) and tired (below).

Rewards are the negation of costs. Thus, there are two cost functions: time and fatigue. The time cost is proportional to the time spent on the road (in seconds), plus a small constant value of 5 (seconds) which models time spent at a trafﬁc light, turning, or slowed by trafﬁc. The fatigue cost is proportional to the time spent on the road, but is only applied if the driver is tired and autonomy is disabled; otherwise, there is an ϵ-cost. The absorbing goal state awards zero for both. It is natural to minimize the time cost when the driver is attentive and the fatigue cost when the driver is tired. We allow for a 10 second slack in the expected time to reach the goal, in order to favor selecting autonomy-capable roads. Figure 2 shows an example optimal policy returned by LVI for roads north of Boston Commons. Each road at an intersection has an action, denoted by arrows. Green arrows denote that autonomy was disabled; purple arrows denote that autonomy was enabled. The blue roads show autonomycapability. We see that the policy (green) correctly favors a more direct route when the driver is attentive, and favors the longer but safer autonomous path when the driver is tired. This is exactly the desired behavior: Autonomy is favored for a fatigued driver to improve vehicle safety. We created simple-to-use tools to generate a variety of scenarios from the semi-autonomous driving domain. This domain addresses many of the existing concerns regarding current MOMDP evaluation domains (Vamplew et al. 2011). In particular, our domain consists of real-world road data, streamlined stochastic state transitions, both small and large state and action spaces, policies which are natural to understand, and the ability for both qualitative and quantitative comparisons among potential algorithms.

GPU-Optimized LVI We explore the scalability of LVI in a GPU-optimized implementation using CUDA2. VI on GPUs has been shown to improve performance and enable scalability (J ohannsson 2009; Boussard and Miura 2010). Our approach for LVI exploits this structure of Bellman s optimality equation: independent updates of the values of states. LVI also allows us to run each partition separately, further improving its scalability.

2https://developer.nvidia.com/about-cuda

City |S| |A| w CPU GPU Chicago 400 10 1.5 3.3 3.9 Denver 616 10 6.2 12.9 6.4 Baltimore 676 8 5.6 14.1 5.7 Pittsburgh 864 8 7.3 15.4 7.9 Seattle 1168 10 29.2 63.5 14.2 Austin 1880 10 99.4 433.3 29.8 San Franc. 2016 10 4235.8 4685.7 159.4 L.A. 2188 10 118.7 273.5 37.8 Boston 2764 14 10595.5 11480.9 393.2 N.Y.C. 3608 10 14521.8 16218.5 525.7

Table 1: Computation time in seconds for weighted scalarized VI (w), and LVI on both the CPU and GPU.

Our implementation ﬁrst transfers the entire LMDP model from the host (CPU) to the device (GPU). Each partition is executed separately, with multiple Bellman updates running in parallel. After convergence, the ﬁnal policy and values are transferred back from the device to the host. One of the largest performance costs comes from transferring data between host and device. LVI avoids this issue since transfers occur outside each iteration. Our experiments shown in Table 1 were executed with an Intel(R) Core(TM) i7-4702HQ CPU at 2.20GHz, 8GB of RAM, and an Nvidia(R) Ge Force GTX 870M graphics card using C++ and CUDA(C) 6.5. Proposition 6 clearly shows that optimal polices for an LDMP may differ from those of a scalarized MOMDP. One valid concern, however, is that scalarization methods may still return a reasonable policy much faster than LVI. For this reason, we evaluated the computation times of both weighted VI, as well as LVI run on both a CPU and GPU. As shown in the table, the runtimes of LVI and scalarized VI differ by a small constant factor. The GPU version of LVI vastly improves the potential scales of LMDPs, reaching speeds roughly 30 times that of the CPU.

5 Conclusion Our work focuses on a novel model for state-dependent lexicographic MOMDPs entitled LMDPs. This model naturally captures common multi-objective stochastic control problems such as planning for semi-autonomous driving. Competing approaches such as linearly weighted scalarization involve non-trivial tuning of weights and as we show may not be able to produce the optimal LMDP policy for an LMDP. Our proposed solution, LVI, converges under a partition-dependent assumption and is amenable to GPU-based optimization, offering vast performance gains. Within the domain of semi-autonomous driving, policies returned by LVI successfully optimize the multiple statedependent objectives, favoring autonomy-capable roads when the driver is fatigued, thus maximizing driver safety. In future work, we plan to enrich the semi-autonomous driving domain and make the scenario creation tools public to facilitate further research and comparison of algorithms for multi-objective optimization. We also plan to further investigate the analytical properties and practical applicability of LVI and its GPU-based implementation.

6 Acknowledgements We thank Laurent Jeanpierre and Luis Pineda for fruitful discussions of multi-objective MDPs. This work was supported in part by the National Science Foundation grant number IIS-1405550.

References Altman, E. 1999. Constrained Markov Decision Processes, volume 7. CRC Press. Bellman, R. E. 1957. Dynamic Programming. Princeton, NJ: Princeton University Press. Boussard, M., and Miura, J. 2010. Observation planning with on-line algorithms and GPU heuristic computation. In ICAPS Workshop on Planning and Scheduling Under Uncertainty. Boutilier, C.; Brafman, R. I.; Domshlak, C.; Hoos, H. H.; and Poole, D. 2004. CP-nets: A tool for representing and reasoning with conditional ceteris paribus preference statements. Journal of Artiﬁcial Intelligence Research 21:135 191. Calisi, D.; Farinelli, A.; Iocchi, L.; and Nardi, D. 2007. Multi-objective exploration and search for autonomous rescue robots. Journal of Field Robotics 24(8-9):763 777. Castelletti, A.; Pianosi, F.; and Soncini-Sessa, R. 2008. Water reservoir control under economic, social and environmental constraints. Automatica 44(6):1595 1607. Cohn, R.; Durfee, E.; and Singh, S. 2011. Comparing actionquery strategies in semi-autonomous agents. In Proceedings of the 10th International Conference on Autonomous Agents and Multiagent Systems, 1287 1288. G abor, Z.; Kalm ar, Z.; and Szepesv ari, C. 1998. Multicriteria reinforcement learning. In Proceedings of the 15th International Conference on Machine Learning, volume 98, 197 205. Gonzales, C.; Perny, P.; and Dubus, J. P. 2011. Decision making with multiple objectives using GAI networks. Artiﬁcial Intelligence 175(78):1153 1179. Ji, Q.; Zhu, Z.; and Lan, P. 2004. Real-time nonintrusive monitoring and prediction of driver fatigue. IEEE Transactions on Vehicular Technology 53(4):1052 1068. J ohannsson, A. P. 2009. GPU-based Markov decision process solver. Master s thesis, School of Computer Science, Reykjav ık University. Kwak, J.-Y.; Varakantham, P.; Maheswaran, R.; Tambe, M.; Jazizadeh, F.; Kavulya, G.; Klein, L.; Becerik-Gerber, B.; Hayes, T.; and Wood, W. 2012. SAVES: A sustainable multiagent application to conserve building energy considering occupants. In Proceedings of the 11th International Conference on Autonomous Agents and Multiagent Systems, 21 28. Mitten, L. G. 1974. Preference order dynamic programming. Management Science 21(1):43 46. Mouaddib, A.-I. 2004. Multi-objective decision-theoretic path planning. In Proceedings of the IEEE International Conference on Robotics and Automation, volume 3, 2814 2819.

Natarajan, S., and Tadepalli, P. 2005. Dynamic preferences in multi-criteria reinforcement learning. In Proceedings of the 22nd International Conference on Machine learning, 601 608. Perny, P., and Weng, P. 2010. On ﬁnding compromise solutions in multiobjective Markov decision processes. In Proceedings of the European Conference on Artiﬁcial Intelligence, 969 970. Perny, P.; Weng, P.; Goldsmith, J.; and Hanna, J. 2013. Approximation of Lorenz-optimal solutions in multiobjective Markov decision processes. In Proceedings of the 27th AAAI Conference on Artiﬁcial Intelligence, 92 94. Pradhan, A. K.; Hammel, K. R.; De Ramus, R.; Pollatsek, A.; Noyce, D. A.; and Fisher, D. L. 2005. Using eye movements to evaluate effects of driver age on risk perception in a driving simulator. Human Factors: The Journal of the Human Factors and Ergonomics Society 47(4):840 852. Roijers, D. M.; Vamplew, P.; Whiteson, S.; and Dazeley, R. 2013. A survey of multi-objective sequential decisionmaking. Journal of Artiﬁcial Intelligence Research 48:67 113. Russell, S., and Norvig, P. 2010. Artiﬁcial Intelligence: A modern approach. Upper Saddle River, New Jersey: Prentice Hall. Sobel, M. J. 1975. Ordinal dynamic programming. Management Science 21(9):967 975. Vamplew, P.; Dazeley, R.; Berry, A.; Issabekov, R.; and Dekker, E. 2011. Empirical evaluation methods for multiobjective reinforcement learning algorithms. Machine Learning 84(1-2):51 80. Zilberstein, S. 2015. Building strong semi-autonomous systems. In Proceedings of the 29th AAAI Conference on Artiﬁcial Intelligence.