# preference_planning_for_markov_decision_processes__3c42f0e2.pdf

Preference Planning for Markov Decision Processes

Meilun Li and Zhikun She School of Mathematics and Systems Science Beihang University, China

Andrea Turrini and Lijun Zhang State Key Laboratory of Computer Science Institute of Software, Chinese Academy of Sciences Beijing, China

The classical planning problem can be enriched with quantitative and qualitative user-deﬁned preferences on how the system behaves on achieving the goal. In this paper, we propose the probabilistic preference planning problem for Markov decision processes, where the preferences are based on an enriched probabilistic LTL-style logic. We develop P4Solver, an SMT-based planner computing the preferred plan by reducing the problem to quadratic programming problem, which can be solved using SMT solvers such as Z3. We illustrate the framework by applying our approach on two selected case studies.

Introduction Classical planning problem focuses on ﬁnding a sequence of actions leading from initial state to the user-deﬁned set of goal states. For many problems in reality, however, users may have preferences over some special actions than other ones, or willingness to touch some special group of states before reaching the goal. This leads to the promotion of preference-based planning to integrate planning with user preferences (Baier and Mc Ilraith 2009). Planning with preferences has been studied for decades and becomes even more attractive in recent years. Many researchers have proposed preference languages to effectively and succinctly express user s preferences (Coste Marquis et al. 2004; Delgrande, Schaub, and Tompits 2004; Boutilier et al. 2004), algorithms for designing preferencebased planners corresponding to the different preference languages, or algorithms to increase planning efﬁciency (Bacchus and Kabanza 2000; Edelkamp 2006; Sohrabi, Baier, and Mc Ilraith 2009; Tu, Son, and Pontelli 2007; Baier, Bacchus, and Mc Ilraith 2009). The above mentioned work is essentially based on nondeterministic transition systems. For most systems in reality, however, the effect of system actions is often unpredictable. Even though we cannot precisely know what will follow after an action, some event is more (or less) frequent than other events. This motivates incorporating probability to planning problems to capture this kind of uncertainty. Markov Decision Processes (MDPs) (Bellman 1957) provide a mathematical framework for modelling decision making in situations

Copyright c 2015, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved.

Figure 1: The robot MDP

where outcomes are partly stochastic and partly under the control of a decision maker. It is a unifying model for many classes of planning problems studied in AI (Boutilier, Dean, and Hanks 1999). In this paper, we propose the probabilistic preferencebased planning problem on MDPs, which we refer to as P4. Derived from probabilistic linear-time temporal logic (probabilistic LTL) (Hansson and Jonsson 1994; Baier and Katoen 2008), we propose a preference language allowing us to specify the goal and preferences in terms of probabilistic formulas. Using the language, we can for instance express the property the goal states are eventually reached with probability at least 0.95 (in probabilistic LTL style, P 0.95(F(goal))) and, among all ways, we prefer those such that with probability at least 0.99 the system never enters a forbidden state (P 0.99(G( forbidden))). Further, we propose P4Solver, a planner for P4, in which we reduce P4 to a quadratic programming problem. We use the nonlinear real arithmetic of Z3 (de Moura and Bjørner 2008) to solve the problem, after having encoded it in the SMT-LIB format (Barrett, Stump, and Tinelli 2010). The experimental results conﬁrm in general the effectiveness of the approach.

A Motivating Example As an example of problems our technique addresses, consider a robot moving in one direction on a ring rail surrounded by N loading and unloading areas (or positions), depicted in Figure 1. The task of the robot is to sort B boxes between these areas, with B < N. The robot alternates be-

Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence

tween two modes: control (mode(c)) and action (mode(a)). In the control mode, the robot decides what to do next: either to move on the rail (m) or to use its arm to load/unload boxes (a). As movement, either it can go to the next area (n), or it can go quickly 5 areas ahead (l); the actual reached area depends on the distance: starting from the area i, action n reaches the area (i + 1) mod N with probability 1 while action l reaches the area (i + 5) mod N with probability 0.7, the areas (i+4) mod N and (i+6) mod N with probability 0.1 each, and the areas (i + 3) mod N and (i + 7) mod N with probability 0.05 each. When the robot faces an area i containing a box j and it is in action mode, it can pick it up (p(j, i)) so that it can later drop (d(j, t)) it in the target area t, if such area is empty. Both operations succeed with probability 0.95. Only one box can be carried at a time. Possible user preferences are that the robot moves quickly at least once, that it drops box 1 somewhere, that a box is never picked up and dropped in the same area, and so on.

Preliminaries In this section we introduce Markov Decision Processes (MDPs) (Bellman 1957) as our model. We ﬁx Σ as the ﬁnite set of atomic propositions. Each state in the MDP is labelled with only the atomic propositions that are satisﬁed in this state1. State evolves to other states as the result of performing actions that probabilistically decide the state to transfer to. The formal deﬁnition of MDP is as follows.

Deﬁnition 1 A Markov Decision Process is a tuple M = (S, L, Act, s, P) where

S is a ﬁnite set of states, L: S 2Σ is a labeling function, Act is a ﬁnite set of actions, s S is the initial state, and P: S Act Dist(S) is the transition probability function where by Dist(S) we denote the set of probability distributions over S.

For example, robot At(2) is the atomic proposition holding whenever the robot is in area 2. To simplify the exposition, we use the special action to model the stopping of the computation; for each state s S, P(s, ) = δs where δs is the Dirac distribution assigning probability 1 to s. We denote by Act(s) the set of possible actions from state s. Note that for each s, Act(s) = since Act(s). A path π in an MDP M is a ﬁnite or inﬁnite sequence of alternating states and actions s0a1s1 . . . starting from a state s0, also denoted by ﬁrst(π) and, if the sequence is ﬁnite, ending with a state denoted by last(π), such that for each i > 0, ai Act(si 1) and P(si 1, ai)(si) > 0. We denote the set of all ﬁnite paths for a given MDP M by Paths M and the set of all ﬁnite and inﬁnite paths by Paths M. We may drop the subscript M whenever it is clear from the context.

1In (Bienvenu, Fritz, and Mc Ilraith 2011), ﬂuents and nonﬂuent relational formulas are used for characterizing state properties. The former are of the form f(s, x0, x1, . . . ) whose value depends on the assignment of variables xi in the state s, while the latter do not depend on the state s.

For a given path π = s0a1s1 . . . , we denote by |π| the length of π, i.e., the number of actions occurring in π. If π is inﬁnite, then |π| = . For 1 i |π|, we denote by action(π, i) the action ai; for 0 i |π|, we denote by state(π, i) the state si and by suﬃx(π, i) the sufﬁx of π starting from state(π, i). The nondeterminism of the transitions enabled by a state is resolved by a policy χ based on the previous history. Formally, a policy for an MDP M is a function χ: Paths Dist(Act) such that for each π Paths , { a Act | χ(π)(a) > 0 } Act(last(π)) and if action(π, |π|) = , then χ(π) = δ . The last condition ensures that no action except can be chosen after the stop action has occurred. Given an MDP M, a policy χ and a state s induce a probability distribution over paths as follows. The basic measurable events are the cylinder sets of ﬁnite paths, where the cylinder set of π, denoted by Cπ, is the set { π Paths | π π } where π π means that the sequence π is a preﬁx of the sequence π . The probability µχ,s of a cylinder set Cπ is deﬁned recursively as follows:

0 if π = t for a state t = s, 1 if π = s, µχ,s(Cπ ) χ(π )(a) P(last(π ), a)(t) if π = π at. Standard measure theoretical arguments ensure that µχ,s extends uniquely to the σ-ﬁeld generated by cylinder sets. Given a ﬁnite path π, we deﬁne µχ,s(π) as µχ,s(π) = µχ,s(Cπ) χ(π)( ), where χ(π)( ) is the probability of terminating the computation after π has occurred. The deﬁnition could be extended to a set of ﬁnite paths: given a set of ﬁnite paths B, we deﬁne µχ,s(B) as µχ,s(B) = P

π B µχ,s(π).

As an example of path and its probability, consider the path π = s0ms1ls2as3p(2, 4)s4 relative to the robot, where each action is chosen by the policy with probability 1 and the states have the following meaning: State Robot at Mode Carry Box Box 2 at s0 0 c - 4 s1 0 m - 4 s2 4 c - 4 s3 4 a - 4 s4 4 c 2 - The probability of π is 1 1 1 1 0.1 1 1 1 0.95 = 0.095.

Probabilistic Preference Planning Problem In this section we present the Probabilistic Preference Planning Problem (P4). This is the extension of the classical preference-based planning problem to probabilistic settings. In P4, preference formulas and goal formulas are described with probability, and both are based on the concept of property formula. Deﬁnition 2 The syntax of a property formula ϕ is deﬁned as follows: ϕ ::= p | ϕ | ϕ ϕ | ﬁnal(ϕ) | occ(a) | X(ϕ) | U(ϕ, ϕ) | x(ϕ) where p Σ is an atomic proposition and a Act. x is a variable that can appear only in the terms p and a of occ(a).

Throughout the paper, we use standard derived operators such as ϕ ψ = ( ϕ ψ), True = p p, False = True, F(ϕ) = U(True, ϕ), G(ϕ) = F( ϕ), ϕ ψ = ϕ ψ, and x(ϕ) = x( ϕ).

Remark 1 We allow the property formula to express metadescription of atomic propositions with variables and their domains. Take the sorting robot as an example, if we have the atomic propositions carry Box(1), carry Box(2), ..., carry Box(B) Σ to represent the fact that the robot is carrying a speciﬁc box, we can express the property the robot is carrying nothing as x( carry Box(x)), and the domain of x is dom(x) = {1, . . . , B}.

Property formulas are deﬁned on ﬁnite paths. Given an MDP M, a path π and a formula ϕ, we use π |=M ϕ (or π |= ϕ if M is clear from the context) to represent that the path π satisﬁes the property ϕ. The formal semantics of property formulas is deﬁned inductively as follows:

π |= p if p L(ﬁrst(π)) π |= ϕ if π |= ϕ π |= ϕ ψ if π |= ϕ and π |= ψ π |= ﬁnal(ϕ) if last(π) |= ϕ π |= X(ϕ) if suﬃx(π, 1) |= ϕ π |= U(ϕ, ψ) if π |= ψ or π |= ϕ X(U(ϕ, ψ)) π |= occ(a) if action(π, 1) = a π |= x(ϕ) if π |= V v dom(x) ϕ[x v]

where dom(x) is the domain of x and ϕ[x v] represents the formula where all free occurrences of x in ϕ have been replaced by v dom(x). The formula occ(a) holds on π if the ﬁrst action of π is a. Temporal operators X(ϕ) and U(ϕ, ψ) are deﬁned in the standard way (Vardi 1995). The formula x(ϕ) holds if ϕ holds on π for all assignments of x in dom(x), where x is a variable appearing in ϕ.

Example 1 Requiring that the robot never drops a box in the same position it has picked it up can be expressed as b( z(G(occ(p(b, z)) F(occ(d(b, z)))))). The LTL formula p(G(robot At(p) occ(l) X( robot At(p)))) means that the robot never goes to its starting area while doing a quick movement.

Along with the deﬁnition of property formulas, we denote by PJ(ϕ) the probabilistic property, where J [0, 1] is a closed interval. Given an MDP M, s S, and a policy χ, s |=χ M PJ(ϕ) (s |= PJ(ϕ) for short if χ and M are clear from the context) if µχ,s({ π Paths M | π |= ϕ }) J. For deﬁning the preference formula, we assume a total order on the subset of probabilistic properties. For two probabilistic properties ψ1 and ψ2, we say that ψ1 is more preferred than ψ2 if and only if ψ1 ψ2. We now deﬁne preference formula using PJ(ϕ) and .

Deﬁnition 3 A preference formula ΦP is of the form ψ1 ψ2 ψk, where each ψi (0 < i k) is a probabilistic property, ψk = P[1,1](True), and for each m, n k, m < n implies ψm ψn.

Deﬁnition 3 assigns the order of probabilistic properties that corresponds to the appearance order of them. Earlier

a formula appears, the more preferred it is. The formula P[1,1](True) means no preference and it is the least preferred one. With this notion we introduce the preference planning problem for the probabilistic setting:

Deﬁnition 4 A Probabilistic Preference Planning Problem (P4) is a tuple (M, ΦG, ΦP ) where M is an MDP, ΦG is a goal formula which is a probabilistic property of the form ΦG = PJg(ﬁnal(γ)) where γ is a property formula, and ΦP = ψ1 ψ2 ψk is a preference formula. A solution to a P4 problem is a policy χ such that s |=χ M PJg(ﬁnal(γ)). An optimal solution to a P4 is a solution χ

such that there exists i {1, . . . , k} such that s |=χ

M ψi and for each 0 < j < i and each solution χ, s |=χ M ψj.

Since we assume that ψk = P[1,1](True), we have that an optimal solution for (M, ΦG, ΦP ) exists whenever there is a policy χ such that s |=χ M PJg(ﬁnal(γ)). We remark that different optimal solutions may exist. We can extend P4 to General Preference Formulas (Bienvenu, Fritz, and Mc Ilraith 2011) by decorating preference formulas with values taken from a totally ordered set (V, ) and by deﬁning according to .

Planning Algorithm for P4 In this section we introduce a planner P4Solver to solve P4. To compute the optimal solution to a P4 problem, we start with the most preferred formula ψ1. If we ﬁnd a solution, then it is an optimal solution to this P4. If we fail to reach a solution, we continue with the next preferred formula, and after we have enumerated all the property formulas, we conclude that P4 has no solution. As said before, this happens only when the goal formula cannot be satisﬁed. The scheme is shown in Algorithm 1.

Algorithm 1 P4Solver Input: MDP model M, preference formula ΦP , goal formula ΦG Output: An optimal policy χ

1: Parse ΦP to the form ΦP = ψ1 ψ2 ψk 2: χ = 3: for i = 1 to k do 4: χ = Pol Finder(ψi, ΦG, M) 5: if χ = then 6: return (χ , i) 7: return unsatisﬁable

The procedure Pol Finder is the core of the algorithm. It translates P4 to a quadratic programming problem based on the concept of progression of property formula.

Deﬁnition 5 Given a property formula ϕ, a state s, and an action a, the progression ρs a(ϕ) of ϕ from s through a is deﬁned as follows.

ρs a(p) = True if s |= p False otherwise

ρs a( ϕ) = ρs a(ϕ) ρs a(ϕ1 ϕ2) = ρs a(ϕ1) ρs a(ϕ2)

ρs a(ﬁnal(ϕ)) =

ﬁnal(ϕ) if a = True if a = and s |= ρs (ϕ) False otherwise

ρs a(X(ϕ)) = ϕ if a = False otherwise

ρs a(U(ϕ1, ϕ2)) =

ρs a(ϕ2) (ρs a(ϕ1) U(ϕ1, ϕ2)) if a = ρs (ϕ2) otherwise

ρs a(occ(a )) = True if a = a False otherwise

ρs a( x(ϕ)) =

v dom(x) ρs a(ϕ[x v])

Note that we can simplify the resulting progressed formula with the basic rules of Boolean logic: if ρs a(ϕ2) reduces to True, then we can simplify ρs a(U(ϕ1, ϕ2)) = ρs a(ϕ2) (ρs a(ϕ1) U(ϕ1, ϕ2)) to True as well. Intuitively, the progression of a formula ϕ from s through a is a new property formula that says which parts of ϕ are still to be satisﬁed after reaching s and performing a. When we are in a state s along a path π, the sequence of progressed formulas encodes what it has been already satisﬁed while reaching s and what the remainder of π has to satisfy from s, so that π |= ϕ. Once we reach the end of the path, the last progressed formula can be deﬁnitely decided, and this is exactly the satisfaction of π of the original formula ϕ. Theorem 1 Let π be a path, then

( π |= ρﬁrst(π) (ϕ) if |π| = 0, suﬃx(π, 1) |= ρﬁrst(π) action(π,1)(ϕ) otherwise.

The proof is a minor adaptation of the one in (Bienvenu, Fritz, and Mc Ilraith 2011). Note that the value of ρs a(ϕ) may remain unknown until we reach the last state of the path. Example 2 Take the rail robot example and assume that we want to check whether the robot moves quickly at least once on the path π = s0p(1, 0)s1ns2ls3, represented by the property formula F(occ(l)). At ﬁrst we cannot decide whether π |= F(occ(l)) since F(occ(l)) = True. So we use the information of s0 and action p(1, 0) and get ρs0 p(1,0)(F(occ(l))) = ρs0 p(1,0)(occ(l)) F(occ(l)) = False F(occ(l)) = F(occ(l)), since p(1, 0) = l. Now we need to check whether suﬃx(π, 1) |= ρs0 p1,0(F(occ(l))), i.e., s1ns2ls3 |= F(occ(l)). In a similar way we can iteratively attempt to match all actions of π to l, and ﬁnally get π |= F(occ(l)) since the last second action is l (note that the last action is without explicit notion on a ﬁnite path). We denote by psa the probability to choose action a from state s, and by µs ϕ the probability of state s to satisfy a property formula ϕ. Formally, µs True = 1, µs False = 0, and

a Act(s) psa X

s S P(s, a)(s ) µs ρs a(ϕ). (1)

Note that when a = , psa P s S P(s, a)(s ) µs ρs a(ϕ) reduces to ps µs ρs (ϕ) since P(s, ) = δs.

When solving a P4, we construct the equations iteratively for each µs ϕ with the form above until the progression reaches True or False so we can deﬁnitely know the probability of the formula to be 1 or 0, respectively. If there exists a solution to the corresponding set of formulas to the P4, then the values psa induce the solution of P4. So P4 can be converted into a set of non-linear equations (actually, quadratic equations) whose solution, if any, is the solution for P4. This conversion is managed by Pol Finder. Pol Finder, from the initial state of M, explores the state space of M and for each newly reached pair of state and progressed formula, generates the corresponding instance of the Equation 1. In the bounded version of Pol Finder, we also consider the number of past actions in the deﬁnition of Equation 1. Obviously, the correctness of P4Solver relies on the termination of Pol Finder. Since we consider only ﬁnite MDPs, Pol Finder terminates for sure. As complexity, for an MDP M and a property formula ϕ, the time complexity of Pol Finder is linear in the size of M and double exponential in the size of ϕ.

Theorem 2 For a P4 (M, ΦG, ΦP ), it has an optimal policy if and only if P4Solver returns a policy χ and the index i of the satisﬁed preference formula.

Implementation We implemented the P4Solver algorithm in Scala and delegated to Z3 (de Moura and Bjørner 2008) the evaluation of the generated quadratic programming problem. We represent each state of the MDP by the atomic propositions it is labelled with; for instance, the state of the MDP depicted in Figure 1 is represented by the atomic propositions robot At(1), carry Box(1), box At(2, 6), box At(3, 21), and mode(c), given that the robot is in control mode. An action a is symbolically encoded as a possibly empty list of variables, a precondition, and the effects. The variables bind the condition and the effects to the state s enabling the action. The precondition is simply a property formula that has to be satisﬁed by the state s (or, more precisely, by the path π = s) in order to have a enabled in s. The effect of a is a probability distribution mapping each target state occurring with non-zero probability to the corresponding probability value. The target states are encoded by a pair (R, A) of sets of atomic propositions: those not holding anymore (R) and those now holding (A). As a concrete example, the action d(b, p) models the drop of box b at position p: the precondition is mode(a) robot At(p) carry Box(b) b (box At(b , p)); the effects are ({carry Box(b), mode(a)}, {mode(c), box At(b, p)}) with probability 0.95 and ({mode(a)}, {mode(c)}) with probability 0.05. The preference and goal formulas are encoded according to the grammars in Deﬁnitions 2 and 3, and the probability intervals by the corresponding bounds. Each state, action, and formula is uniquely identiﬁed by a number that is used to generate the variables for the SMT solver: for instance, the scheduler s choice of action 4 in state 37 is represented by the variable sched_s37_a4 while sched_s37_stop stands for the probability of

N B Property Nodes tg Vars te Res 5 2 1 279 0.06 333 0.06 sat 5 2 2 329 0.05 348 0.08 sat 5 2 3 1332 0.10 1764 0.73 sat 5 2 4 1690 0.09 1870 0.83 sat 6 2 1 376 0.06 453 0.11 sat 6 2 2 436 0.05 471 0.11 sat 6 2 3 2173 0.15 2884 1.90 sat 6 2 4 2707 0.13 3042 1.89 sat 7 2 1 487 0.07 591 0.15 sat 7 2 2 557 0.06 612 0.15 sat 7 2 3 3304 0.28 4368 10.21 sat 7 2 4 4048 0.20 4606 5.08 sat

Table 1: Performance of P4Solver: Rail Robot

Type Bound Property Nodes tg Vars te Res np 2 1 28 0.01 41 0.01 sat np 4 2 533 0.08 268 1.58 sat np 4 3 533 0.09 270 1.53 sat np 4 4 533 0.06 265 1.55 sat np 4 5 533 0.08 290 0.06 sat np 1 265 0.04 458 0.06 sat np 2 529 0.08 522 0.09 sat np 3 time-out np 4 534 0.06 523 0.08 sat np 5 265 0.05 458 0.07 sat pr 2 1 87 0.01 41 0.01 unsat pr 4 2 6058 0.19 271 time-out pr 4 3 6058 0.21 284 time-out pr 4 4 6058 0.15 272 time-out pr 4 5 6058 0.08 290 0.09 sat pr 1 497 0.05 458 0.06 sat pr 2 993 0.09 522 0.08 sat pr 3 time-out pr 4 1002 0.08 523 0.08 sat pr 5 497 0.04 458 0.06 sat

Table 2: Performance of P4Solver: Dinner Domain

stopping in state 37. We encode the SMT problem by the SMT-LIB format (Barrett, Stump, and Tinelli 2010), thus we can replace Z3 with any other solver supporting such format. Moreover, it is easy to change the output format to support solvers like Redlog and Mathematica. We have run our prototype on a single core of a laptop running open SUSE 13.1 on a Intel Core TM i5-4200M CPU and 8Gb of RAM. The results for the rail robot and an adaptation of the dinner domain (Bienvenu, Fritz, and Mc Ilraith 2011) are shown in Tables 1 and 2, respectively. For the robot example, we consider N = 5, 6, 7 positions and B = 2 boxes and four preference formulas that require to eventually pickup or drop the boxes, placed in different areas to be sort according to the goal positions (each box i in position i). As preferences, we have F( B( P(occ(p(B, P))))) as preference 1 and 3 (with different initial states), F( B( P(occ(d(B, P))))) as preference 2, and F( P(occ(d(1, P)))) as preference 4. The initial state relative to preference 4 has box At(1, 1), so the preference requires to drop the box even if this action is not needed to reach the goal (asking for box At(1, 1) as well).

The goal has to be satisﬁed with probability in [0.75, 1] and the preference with probability in [1, 1]. The remaining columns in the table show the number of nodes (state, formula) of the MDP we generated, the time tg spent by Scala to generate the SMT problem, the number of variables in the problem, the time te used by Z3 to solve the problem, and whether the problem is satisﬁable. For the dinner domain example, we consider two types of a simpliﬁed version, where we have removed some of the foods available for dinner: np for the non-probabilistic version and pr for the probabilistic one. In the former all actions reach a single state, in the latter an action may have no effect with non-zero probability. In the column Bound we show the bound on the length of the paths we used to construct the model: in some cases, the resulting model is so large to take too much time to be generated (30 seconds) or solved (30 minutes). We marked these cases with time-out . We are at home with the kitchen clean as initial state, and our goal state is to be at home, sated, and with the kitchen clean. As preferences, let ε be the formula L(F(occ(eat)(pizza, L)) and ι be occ(drive(home, italian Rest)). ε is preference 1, ε F(ι) is preference 2, F(ε ι) is preference 3, ι ε is preference 4, and L(occ(drive(home, L)) X(at(italian Rest))) is preference 5. Probabilities for goal and preferences are again [0.75, 1] and [1, 1], respectively. It is interesting to observe the unsat result for property 1 of the probabilistic bounded version: this is caused by the fact that the bound prevents to satisfy with enough probability the preference, while this always happens for the unbounded version since it is enough to try repeatedly the action until we reach the desired state.

Related Work and Discussion

Related Work. The Planning Domain Deﬁnition Language (PDDL) was ﬁrst proposed in (Mc Dermott et al. 2000) and evolves with the International Planning Competition (IPC) to PDDL3 (Gerevini and Long 2005) and PPDDL (Younes et al. 2005; Younes and Littman 2004). PDDL3 is an extension of PDDL that includes temporally extended preferences described by a subset of LTL formulas, but it does not consider probabilistic aspects, so it is not feasible for modelling MDPs. PDDL3 is a quantitative preference language evaluating plans by optimizing an associated numeric objective function, while our preference language expresses the preference of the plans by deﬁning an order on the probabilistic properties the plan has to satisfy. This order can be easily extended to quantitative preferences by associating each probabilistic property with a unique element taken from a totally ordered set, as in (Bienvenu, Fritz, and Mc Ilraith 2011). Another difference between PDDL3 and our P4 language is that PDDL3 does not allow us to express preferences on the occurrence of actions, only a subset of temporal operators is available, and the nesting of LTL formulas is limited, while our P4 supports all them. On the other hand, PDDL3 has some features that are missing in P4: preference formulas in PDDL3 allow us to make a comparison between plans in the degree they violate the preference,

by means of the value optimizing the associated numeric objective function. This quantitative information can be convenient in deﬁning some user s preferences. PPDDL extends PDDL with features for modelling probabilistic effects and rewards, so it can be used to model MDPs. Being based on PDDL2.1 (Fox and Long 2003), the above considerations about PDDL3 and P4 apply also to PPDDL since the differences between PDDL2.1 and PDDL3 are only on derived predicates and constraints. PP and LPP are noteworthy preference languages related to our work. PP was proposed in (Son and Pontelli 2004) with an answer set programming implementation. LPP was proposed in (Bienvenu, Fritz, and Mc Ilraith 2011) together with a bounded best-ﬁrst search planner PPLAN. The property formula in this paper and Trajectory Property Formula (TPF) of LPP inherit all features of Basic Desire Formulas (BDF) of PP: considering the preferences over states and actions with static properties, and over the trajectories with temporal properties. Both also extend PP to support ﬁrst order preferences. LPP inherits General Preference of PP and extends structure of preferences with Aggregated Preference Formulas (APF) to support more complex comparison and comprehension of different preferences. Several planners have been proposed for solving the preference planning problem in non-probabilistic systems. HPLAN-P (Baier, Bacchus, and Mc Ilraith 2009) is a bounded planner that computes optimal plans performing at most a given number of actions. It can deal with temporally extended preferences (TEPs) by compiling them to parametrized ﬁnite automata by using the TLPlan s ability to deal with numerical functions and quantiﬁcations (Bacchus and Kabanza 2000). HPLAN-P is equipped with heuristics to prune the search space during the computation. SGPLAN5 (Hsu et al. 2007) is another search-based planner with the ability to plan with TEPs. It uses a constraint partitioning approach to split the problem of planning in sub-problems, by taking advantage of local optimal values. This permits to reduce in several cases the makespan of the plan. MIPS-XXL (Edelkamp, Jabbar, and Nazih 2006) and MIPS-BDD (Edelkamp 2006) are two well performing planners that compile the given TEPs into B uchi automata that are then synchronized with the model under consideration. MIPS-XXL targets very large systems and it combines internal weighted best-ﬁrst search and external cost-optimal breadth-ﬁrst search to control the pace of planning. MIPSBDD does symbolic bread-ﬁrst search on state space encoded to binary decision diagram (BDD). We refer to (Baier and Mc Ilraith 2009) and (Bienvenu, Fritz, and Mc Ilraith 2011) for a more comprehensive comparison. For MDPs, there are some related works on planning. In (Boutilier, Dean, and Hanks 1999), the authors introduce many MDP-related methods and show how to unify some planning problems in such framework in AI. It presents a quantitative preference based on reward structure on MDPs and shows how to compute a plan with the means in Decision-Theoretic Planning, such as value iteration and policy iteration. To specify uncontrollable behaviours of systems, it introduces exogenous events in the MDP. As mentioned above, the preference formula deﬁned in our

work extends probabilistic LTL. It is intuitively possible to take great advantage of the algorithms in decision-theoretic planning. Another similar work in this line is (Younes 2003), which extends PDDL for modelling stochastic decision processes by using PCTL and CSL for specifying Probabilistic Temporally Extended Goals (PTEG). It encodes many stochastic effects in PDDL and discusses the expressiveness of those effects in different kinds of MDP models. The essential difference of the above work to our framework is on the preference language. Our preference language is qualitative which could specify the properties along the sequences of behaviors, whereas the quantitative preference in (Boutilier, Dean, and Hanks 1999) can not capture those properties. And even though in (Younes 2003) PTEG is able to deal with temporal properties on real time, it lacks ﬂexibility to express various features and can not handle preferences over actions.

Discussion. There are some other possible approaches to P4 based on model checking methods. One is to use the current planners on deterministic models: planners like MIPSXXL, MIPS-BDD and HPLAN-P ﬁrst translate the LTL formula to an automaton and then ﬁnd the plan in the product. Whereas it works nicely for models without probability, this approach cannot be adapted in a straightforward way to MDPs. The main bottleneck is the fact that a deterministic automaton may be needed. (In the model checking community, this classical automaton approach for MDPs always involves determination, see for details (Baier and Katoen 2008).) Unfortunately, a deterministic automaton for an LTL formula is double exponential, as shown in (Courcoubetis and Yannakakis 1995). In addition, this approach faces some other challenges, such as (i) the preference formula has to be satisﬁed before the goal formula, (ii) the matching of the desired probability intervals for both preference and goal formulas under the same plan, and (iii) the plan is not required to maximize/minimize the probability of the formulas. Besides, our method to build the equations is an on-theﬂy construction of the product of LTL and MDP, wherein we avoid explicit representation of the whole product. Another approach tries to solve P4 with current probabilistic planners capable to make plans with soft preferences. These planners support the Relational Dynamical Inﬂuence Diagram Language (RDDL) (Sanner 2010) used for the last ICAPS 11 and ICAPS 14 International Probabilistic Planner Competitions. RDDL is a uniform language to model factored MDPs and corresponding reward structures which could be regarded as quantitative preferences. It is inﬂuenced by PPDDL and PDDL3 and it is designed to be a language that is simple and uniform where the expressive power comes from the composition of simple constructs. In particular, it aims at representing problems that are difﬁcult to model in PPDDL and PDDL. In its current speciﬁcation, RDDL does not support temporal state/action goal or preferences. A direct encoding of our preference formulas in the reward structure is not obvious which we leave as future work.

Conclusion In this paper we have proposed a framework for probabilistic preference-based planning problem on MDPs. Our language can express rich properties, and we have designed a planning algorithm. Our algorithm is via reduction to a quadratic programming problem. We presented two case studies. As future work we plan to enhance the algorithm for improving the efﬁciency. We would also like to investigate the possibility of combining model checking methods with P4. Finding k-optimal path-based policy with approximation is also a promising way to reduce the complexity.

Acknowledgments Zhikun She is supported by the National Natural Science Foundation of China under grants 11422111 and 11371047. Lijun Zhang (Corresponding Author) is supported by the National Natural Science Foundation of China under grants 61472473, 61428208, and 61361136002, and by the CAS/SAFEA International Partnership Program for Creative Research Teams.

References Bacchus, F., and Kabanza, F. 2000. Using temporal logics to express search control knowledge for planning. Artiﬁcial Intelligence 16:123 191. Baier, C., and Katoen, J.-P. 2008. Principles of Model Checking. The M.I.T. Press. Baier, J. A., and Mc Ilraith, S. A. 2009. Planning with preferences. Artiﬁcial Intelligence 173:593 618. Baier, J. A.; Bacchus, F.; and Mc Ilraith, S. A. 2009. A heuristic search approach to planning with temporally extended preferences. Artiﬁcial Intelligence 173:593 618. Barrett, C.; Stump, A.; and Tinelli, C. 2010. The SMT-LIB standard: Version 2.0. In SMT. Bellman, R. 1957. A Markovian decision process. Indiana University Mathematics Journal 6:679 684. Bienvenu, M.; Fritz, C.; and Mc Ilraith, S. A. 2011. Specifying and computing preferred plans. Artiﬁcial Intelligence 175:1308 1345. Boutilier, C.; Brafman, R. I.; Domshlak, C.; Hoos, H. H.; and Poole, D. 2004. CP-nets: A tool for representing and reasoning with conditional ceteris paribus preference statements. Journal of Artiﬁcial Intelligence Research 21:135 191. Boutilier, C.; Dean, T.; and Hanks, S. 1999. Decision-theoretic planning: Structural assumptions and computational leverage. Journal of Artiﬁcial Intelligence Research 11:1 94. Coste-Marquis, S.; Lang, J.; Liberatore, P.; and Marquis, P. 2004. Expressive power and succinctness of propositional languages for preference representation. In Proceedings of the 9th International Conference on Knowledge Representation and Reasoning, 203 121. Courcoubetis, C., and Yannakakis, M. 1995. The complexity of probabilistic veriﬁcation. Journal of the ACM 42(4):857 907. de Moura, L. M., and Bjørner, N. 2008. Z3: An efﬁcient SMT solver. In TACAS, volume 4963 of LNCS, 337 340. Delgrande, J. P.; Schaub, T.; and Tompits, H. 2004. Domainspeciﬁc preferences for causual reasoning and planning. In

Proceedings of the 9th International Conference on Knowledge Representation and Reasoning, 673 682. Edelkamp, S.; Jabbar, S.; and Nazih, M. 2006. Large-scale optimal PDDL3 planning with MIPS-XXL. In 5th International Planning Competition Booklet (IPC-2006). Edelkamp, S. 2006. Optimal symbolic PDDL3 planning with MIPS-BDD. In 5th International Planning Competition Booklet (IPC-2006). Fox, M., and Long, D. 2003. PDDL 2.1: An extension to PDDL for expressing temporal planning problems. Journal of Artiﬁcial Intelligence Research 20:61 124. Gerevini, A., and Long, D. 2005. Plan constraints and preferences in PDDL3: The language of the ﬁfth international planning competition. Technical report, University of Brescia. Hansson, H., and Jonsson, B. 1994. A logic for reasoning about time and reliability. Formal Aspects of Computing 6(5):512 535. Hsu, C.; Wah, B. W.; Huang, R.; and Chen, Y. 2007. Constraint partitioning for solving planning problems with trajectory constraints and goal preferences. In Proceedings of the 20th International Joint Conference on Artiﬁcial Intelligence, 1924 1929. Mc Dermott, D.; Ghallab, M.; Howe, A.; Knoblock, C.; Ram, A.; Veloso, M.; Weld, D.; and Wilkins, D. 2000. PDDL the planning domain deﬁnition language. Technical report, Yale Center for Computational Vision and Control. Sanner, S. 2010. Relational dynamic inﬂuence diagram language (RDDL): Language description. Unpublished ms. Australian National University. Sohrabi, S.; Baier, J. A.; and Mc Ilraith, S. A. 2009. HTN planning with preferences. In Proceedings of the 21st International Joint Conference on Artiﬁcial Intelligence, 1790 1797. Son, T. C., and Pontelli, E. 2004. Planning with preferences using logic programming. In Proceedings of the 7th International Conference on Logic Programming and Nonmonotonic Reasoning, 247 260. Tu, P.; Son, T.; and Pontelli, E. 2007. CPP: A constraint logic programming based planner with preferences. In Proceedings of the 9th International Conference on Logic Programming and Nonmonotonic Reasoning, 290 296. Vardi, M. Y. 1995. An automata-theoretic approach to linear temporal logic. In Logics for Concurrency - Structure versus Automata (8th Banff Higher Order Workshop), volume 1043 of LNCS, 238 266. Younes, H. L. S., and Littman, M. L. 2004. PPDDL 1.0: An extension to PDDL for expressing planning domains with probabilistic effects. Technical Report CMU-CS-04-167, Carnegie Mellon University. Younes, H. L. S.; Littman, M. L.; Weissman, D.; and Asmuth, J. 2005. The ﬁrst probabilistic track of the international planning competition. Journal of Artiﬁcial Intelligence Research 24:851 887. Younes, H. 2003. Extending PDDL to model stochastic decision processes. In Proceedings of the ICAPS-03 Workshop on PDDL, Trento, Italy, 95 103.