# ltlfldlf_nonmarkovian_rewards__858dc1bb.pdf LTLf/LDLf Non-Markovian Rewards Ronen I. Brafman Ben-Gurion University, Beer-Sheva, Israel brafman@cs.bgu.ac.il Giuseppe De Giacomo, Fabio Patrizi Sapienza Universit a di Roma, Italy {degiacomo,patrizi}@dis.uniroma1.it In Markov Decision Processes (MDPs), the reward obtained in a state is Markovian, i.e., depends on the last state and action. This dependency makes it difficult to reward more interesting long-term behaviors, such as always closing a door after it has been opened, or providing coffee only following a request. Extending MDPs to handle non-Markovian reward functions was the subject of two previous lines of work. Both use LTL variants to specify the reward function and then compile the new model back into a Markovian model. Building on recent progress in temporal logics over finite traces, we adopt LDLf for specifying non-Markovian rewards and provide an elegant automata construction for building a Markovian model, which extends that of previous work and offers strong minimality and compositionality guarantees. Introduction Markov Decision Processes (MDPs) are a central model for sequential decision making under uncertainty. They are used to model and solve many real-world problems, and to address the problem of learning to behave well in unknown environments. The Markov assumption is a key element of this model. It states that the effects of an action depend only on the state it was executed in, and that a reward given at a state depends only on the previous action and state. It has long been observed (Bacchus, Boutilier, and Grove 1996; Thi ebaux et al. 2006) that many performance criteria call for more sophisticated reward functions that do not depend on the last state only. For example, we may want to reward a robot that eventually delivers coffee each time it gets a request; or, to ensure it will access restricted areas only after having acquired the right permission. Such rewards are non-Markovian. Interestingly, Littman has advocated at IJCAI 2015 that it may actually be more convenient, from a design perspective, to assign rewards to the satisfaction of (non-Markovian) declarative temporal properties, rather than to states (Littman et al. 2017). To extend MDPs with non-Markovian rewards we need a language for specifying such rewards. Markovian rewards are specified as a function R from the previous state and action to the reals. R can be specified using an explicit reward matrix, or implicitly, by associating a reward with properties Copyright 2018, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. of the last state and action. With non-Markovian rewards, an explicit representation is no longer possible, as the number of possible histories or futures of a state is infinite. Hence, we must use an implicit specification that can express properties of past (or future) states. To date, two specification languages have been proposed. (Bacchus, Boutilier, and Grove 1996) suggest using a temporal logic of the past. Whether a state satisfies such a past temporal formula depends on the entire sequence of states leading to the state. Thus, we can reward appropriate response to a bring-coffee command by associating a reward with the property: the bring-coffee command was issued in the past, and now I have coffee. A second proposal, by (Thi ebaux et al. 2006), uses a temporal logic of the future with a special symbol to denote awarding a reward. At each step, one checks whether this symbol must be true in the current state, for the reward formula to be satisfied in the initial state. If that is the case, the current state is rewarded. This language is not often used in other areas, and its semantics is more involved. Existing MDP solution methods, possibly with the exception of Monte-Carlo tree search algorithms (Kocsis and Szepesv ari 2006), rely heavily on the Markov assumption, and cannot be applied directly with non-Markovian rewards. To address this, both proposals above transform the non Markovian model to an equivalent Markovian one that can be solved using existing algorithms, by enriching the state with information about the past. For example, if we extend our state to record whether a bring-coffee command was issued earlier, a reward for bringing coffee in states indicating that bring-coffee was issued in the past, is now Markovian. It rewards the same behaviors as the non-Markovian reward on the original state. We call a model obtained by extending the state space of the original non-Markovian MDP, an extended MDP. Using extended MDPs is a well-known idea. Since state space size affects the practical and theoretical complexity of most MDP solution algorithms, the main question is how to minimally enrich the state so as to make rewards Markovian. (Bacchus, Boutilier, and Grove 1996) provide algorithms for constructing an extended MDP that attempt to minimize size by reducing the amount of information about the past that is maintained. While their construction does not generate a minimal extended MDP, they allude to using automata minimization techniques to accomplish this. (Thi ebaux et The Thirty-Second AAAI Conference on Artificial Intelligence (AAAI-18) al. 2006), instead, use a construction that works well with forward search based algorithms, such as LAO* (Hansen and Zilberstein 2001) and LRTDP (Bonet and Geffner 2003). Unlike classical dynamic programming methods that require the entire state space a-priori, these algorithms generate reachable states only. With a good heuristic function, they often generate only a fraction of the state space. So, while the augmented search space they obtain is not minimal, because states are constructed on the fly during forward search, their approach does not require a-priori enumeration of the state space, and never generates an unreachable state. The aim of this paper is to bring to bear developments in the theory of temporal logic over finite traces to the problem of specifying and solving MDPs with non-Markovian rewards. With these tools, which were unavailable to earlier work, we are able to provide a cleaner, more elegant approach that builds on well understood semantics, much more expressive languages, and enjoys good algorithmic properties. We adopt LDLf, a temporal logic of the future interpreted over finite traces, which extends LTLf, the classical linear-time temporal logic over finite traces (De Giacomo and Vardi 2013). LDLf has the same computational features of LTLf but is more expressive, as it captures monadic second-order logic (MSO) on finite traces (i.e., inductively defined properties), instead of first-order logic (FO), as LTLf. A number of techniques based on automata manipulation have been developed for LDLf, to address tasks such as satisfiability, model checking, reactive synthesis, and planning under full/partial observability (De Giacomo and Vardi 2013; 2015; De Giacomo and Vardi 2016; Torres and Baier 2015; Camacho et al. 2017c). We exploit such techniques to generate an extended MDP with good properties.1 Our formalism has three important advantages: 1. Enhanced expressive power. We move from linear-time temporal logics used by past authors to LDLf, paying no additional (worst-case) complexity costs. LDLf can encode in polynomial time LTLf, regular expressions (RE), the past LTL (PLTL) of (Bacchus, Boutilier, and Grove 1996), and all examples of (Thi ebaux et al. 2006). Moreover, LDLf can naturally represent procedural constraints (Baier et al. 2008), i.e., sequencing constraints expressed as programs, using if and while . In fact, future logics are more commonly used in the model checking community, as they are considered more natural for expressing desirable properties. This is especially true with complex properties that require the power of LDLf. 2. Minimality and Compositionality. We generate a minimal equivalent extended MDP, exploiting existing techniques for constructing automata that track the satisfiability of an LDLf formula. This construction is relatively simple and compositional: if a new reward formula is added, we only need to optimize its automaton and add it to the current (extended) MDP. If the current MDP was minimal, the resulting (extended) MDP is minimal too. 3. Forward Construction via Progression. The automaton used to identify when a reward should be given can be constructed in a forward manner using progression. This ensures the generation of reachable 1We share this objective with work independently carried out in (Camacho et al. 2017b; 2017a). states only, as in (Thi ebaux et al. 2006). Moreover, we can combine progression in the space of MDPs with precomputed automata minimization to obtain the best of both worlds. Background MDPs. A Markov Decision Process (MDP) M = S,A,Tr,R contains a set S of states, a set A of actions, and a transition function Tr S A Prob(S) that returns for every state s and action a a distribution over the next state. We can further restrict actions to be defined on a subset of S only, and use A(s) to denote the actions applicable in s. The reward function, R S A R, specifies the real-valued reward received by the agent when applying action a in state s. In this paper, states in S are truth assignments to a set P of primitive propositions. Hence, if φ is a propositional formula and s a state, we can check whether s φ. A solution to an MDP, called a policy, assigns an action to each state, possibly conditioned on past states and actions. The value of policy ρ at s, vρ(s), is the expected sum of (possibly discounted) rewards when starting at s and selecting actions based on ρ. Every MDP has an optimal policy, ρ , i.e., one that maximizes the expected sum of rewards for every starting state s S. When the horizon is infinite, there exists an optimal policy ρ S A that is stationary and deterministic (i.e., ρ depends only on the current state) (Puterman 2005). There are diverse methods for computing an optimal policy. With the exception of online simulation-based methods, they rely on the Markov assumption, and their theoretical and practical complexity is strongly impacted by S . LTLf and LDLf. LTLf is essentially the standard Lineartime Temporal Logic LTL (Pnueli 1977) interpreted over finite, instead of infinite, traces (De Giacomo and Vardi 2013). LTLf is as expressive as FO over finite traces and star-free regular expressions (RE), thus strictly less expressive than RE, which in turn are as expressive as MSO over finite traces. RE themselves are not convenient for expressing temporal specifications, since, e.g., they miss direct constructs for negation and conjunction. For this reason, (De Giacomo and Vardi 2013) introduced LDLf (linear dynamic logic on finite traces), which merges LTLf with RE, through the syntax of the wellknown logic of programs PDL, propositional dynamic logic (Fischer and Ladner 1979; Harel, Kozen, and Tiuryn 2000; Vardi 2011), but interpreted over finite traces. We consider a variant of LDLf that works also on empty traces. Formally, LDLf formulas ϕ are built as follows: ϕ = tt ϕ ϕ1 ϕ2 ϱ ϕ ϱ = φ ϕ? ϱ1 + ϱ2 ϱ1;ϱ2 ϱ where tt stands for logical true; φ is a propositional formula over P (including true, not to be confused with tt); ϱ denotes path expressions, which are RE over propositional formulas φ with the addition of the test construct ϕ? typical of PDL. We use abbreviations [ϱ]ϕ ϱ ϕ as in PDL, ff tt for false, and φ φ tt to denote occurence of propositional formula φ. Intuitively, ϱ ϕ states that, from the current step in the trace, there exists an execution satisfying the RE ϱ such that its last step satisfies ϕ, while [ϱ]ϕ states that, from the current step, all executions satisfying the RE ϱ are such that their last step satisfies ϕ. Tests are used to insert into the execution path checks for satisfaction of additional LDLf formulas. The semantics of LDLf is given in terms of finite traces, i.e., finite sequences π = π0,...,πn of elements from the alphabet 2P. We define π(i) πi, length(π) n + 1, and π(i,j) πi,πi+1,...,πj 1. When j > length(π), π(i,j) π(i,length(π)). In decision processes, traces are usually sequences of states and actions, i.e., they have the form: s0,a1,s1,...,sn 1,an . These can still be represented as traces of the form π = π0,...,πn, by extending the set P to include one proposition pa per action a, and setting πi si {pa a = ai+1}. In this way, πi denotes the pair (si,ai+1). We will always assume this form, even if referring to sequences of states and actions. Given a finite trace π, an LDLf formula ϕ, and a position i, we define when ϕ is true at step i, written π,i ϕ, by (mutual) induction, as follows: π,i tt; π,i ϕ iff π,i / ϕ; π,i ϕ1 ϕ2 iff π,i ϕ1 and π,i ϕ2; π,i ϱ ϕ iff there exists i j such that π(i,j) L(ϱ) and π,j ϕ, where the relation π(i,j) L(ϱ) is as follows: π(i,j) L(φ) if j=i+1, i < length(π), and π(i) φ (φ propositional); π(i,j) L(ϕ?) if j = i and π,i ϕ; π(i,j) L(ϱ1+ϱ2) if π(i,j) L(ϱ1) or π(i,j) L(ϱ2); π(i,j) L(ϱ1;ϱ2) if there exists k [i,j] such that π(i,k) L(ϱ1) and π(k,j) L(ϱ2); π(i,j) L(ϱ ) if j = i or there exists k such that π(i,k) L(ϱ) and π(k,j) L(ϱ ). Note that if i length(π), the above definitions still apply; though, φ ϕ (φ prop.) and ϱ ϕ become trivially false. We say that a trace π satisfies an LDLf formula ϕ, written π ϕ, if π,0 ϕ. Also, sometimes we denote by L(ϕ) the set of traces that satisfy ϕ: L(ϕ) = {π π ϕ}. LDLf is as expressive as MSO over finite words. It captures LTLf, by seeing next and until as the abbreviations ϕ true (ϕ end) and ϕ1 U ϕ2 (ϕ1?;true) (ϕ2 end), and any RE r, with the formula r end, where end [true]ffexpresses that the trace has ended. Note that in addition to end we can also denote the last element of the trace as last true end. Section Non-Markovian Rewards illustrates several examples of LDLf formulas in our context. Computing DFA for LDLf formulas As standard, an NFA is a tuple A = Σ,Q,q0,δ,F , where: (i) Σ is the input alphabet; (ii) Q is the finite set of states; (iii) q0 Q is the initial state; (iv) δ Q Σ Q is the transition relation; (v) F Q is the set of final states. A DFA is an NFA where δ is a function δ Q Σ Q. By L(A) we mean the set of all traces over Σ accepted by A. We can associate each LDLf formula ϕ with an (exponentially large) NFA Aϕ = 2P,Q,q0,δ,F that accepts exactly the traces satisfying ϕ. A direct algorithm (LDLf 2NFA) for computing the NFA given the LDLf formula, which is a variant of that in (De Giacomo and Vardi 2015), is reported below. Its correctness relies on the fact that (i) every LDLf formula ϕ can be associated with a polynomial alternating automaton on words (AFW) accepting exactly the traces that satisfy ϕ (De Giacomo and Vardi 2013), and (ii) every AFW can be transformed into an NFA, see, e.g., (De Giacomo and Vardi 2013). The algorithm assumes the LDLf formula is in negation normal form (NNF), i.e., with negation symbols occurring only in front of propositions (any LDLf formula can be rewritten in NNF in linear time). Let be the following auxiliary function, which takes as input an (implicitly quoted) LDLf formula ϕ (in NNF), extended with auxiliary constructs Fψ and Tψ, and a propositional interpretation Θ for P, and returns a positive boolean formula whose atoms are (implicitly quoted) ϕ subformulas (not including Fψ or Tψ): (tt, Θ) = true (ff, Θ) = false (φ, Θ) = ( φ tt, Θ) (φ prop.) (ϕ1 ϕ2, Θ) = (ϕ1, Θ) (ϕ2, Θ) (ϕ1 ϕ2, Θ) = (ϕ1, Θ) (ϕ2, Θ) ( φ ϕ, Θ) = E(ϕ) if Θ φ (φ prop.) false if Θ / φ ( ϱ? ϕ, Θ) = (ϱ, Θ) (ϕ, Θ) ( ϱ1 + ϱ2 ϕ, Θ) = ( ϱ1 ϕ, Θ) ( ϱ2 ϕ, Θ) ( ϱ1; ϱ2 ϕ, Θ) = ( ϱ1 ϱ2 ϕ, Θ) ( ϱ ϕ, Θ) = (ϕ, Θ) ( ϱ F ϱ ϕ, Θ) ([φ]ϕ, Θ) = E(ϕ) if Θ φ (φ prop.) true if Θ / φ ([ϱ?]ϕ, Θ) = (nnf ( ϱ), Θ) (ϕ, Θ) ([ϱ1 + ϱ2]ϕ, Θ) = ([ϱ1]ϕ, Θ) ([ϱ2]ϕ, Θ) ([ϱ1; ϱ2]ϕ, Θ) = ([ϱ1][ϱ2]ϕ, Θ) ([ϱ ]ϕ, Θ) = (ϕ, Θ) ([ϱ]T[ϱ ]ϕ, Θ) (Fψ, Θ) = false (Tψ, Θ) = true where E(ϕ) recursively replaces in ϕ all occurrences of atoms of the form Tψ and Fψ by E(ψ); and (ϕ,ϵ) is defined inductively exactly as above, except for the following base cases: ( φ ϕ,ϵ) = false ([φ]ϕ,ϵ) = true (φ prop.) Note that (ϕ,ϵ) is always either true or false. The NFA Aϕ for an LDLf formula ϕ is then built in a forward fashion as shown in Figure 1, where: states of Aϕ are sets of atoms (each atom is a quoted ϕ subformula) to be interpreted as conjunctions; the empty conjunction stands for true; q is a set of quoted subformulas of ϕ denoting a minimal interpretation such that q (ψ q) (ψ,Θ) (notice that we trivially have ( ,a, ) δ for every a 2P). Theorem 1. (De Giacomo and Vardi 2015) Algorithm LDLf 2NFA is correct, i.e., for every finite trace π: π ϕ iff π L(Aϕ). Moreover, it terminates in at most an exponential number of steps, and generates a set of states S whose size is at most exponential in the size of the formula ϕ. The NFA Aϕ can be transformed into a DFA, in exponential time, following the standard procedure, and then possibly put in (the unique) minimal form, in polynomial time (Rabin 1: algorithm LDLf 2NFA 2: input LDLf formula ϕ 3: output NFA Aϕ = (2P, Q, q0, δ, F) 4: q0 {ϕ} 5: F { } 6: if ( (ϕ, ϵ) = true) then 7: F F {q0} 8: Q {q0, }, δ 9: while (Q or δ change) do 10: for (q Q) do 11: if (q (ψ q) (ψ, Θ) then 12: Q Q {q } 13: δ δ {(q, Θ, q )} 14: if ( (ψ q ) (ψ, ϵ) = true) then 15: F F {q } Figure 1: LDLf 2NFA algorithm and Scott 1959). Thus, we can transform any LDLf formula into a DFA of double exponential size. While this is a worstcase complexity, in most cases the size of the DFA is actually manageable (Tabakov and Vardi 2005). Computing the DFA on-the-fly. All operations above can be performed on-the-fly, without the need for constructing Aϕ. To do so, we progress all possible states that the NFA can be in, after consuming the next trace symbol, and accept the trace iff, once it has been completely read, the set of possible states contains a final state. More formally, call a set of possible states for the NFA a macro state, let Q = {q1,...,qn} be the current macro state (initially Q = Q0 = {q0} = {{ϕ}}), and let Θ be the next trace symbol. Then, the successor macro state is the set Q = {q q Q s.t. q (ψ q) (ψ,Θ)}. Given an input trace π, we decide whether π ϕ by iterating the above procedure, starting from the initial state Q = Q0, and accepting π iff the last state obtained includes {true}. The following observations are in order. Firstly, to compute Q , only function is needed. Neither the set of states Q nor the transition relation δ of the NFA are required. In other words, the on-the-fly progression does not require the construction of Aϕ. Secondly, the progression step produces only one successor macro state Q , thus transitions are deterministic. Indeed, it is immediate to see that the on-the-fly progression includes the determinization of Aϕ. As it turns out, the on-the-fly approach performs, in fact, the execution and the determinization at once and in a lazy way, i.e., avoiding the construction of the entire resulting DFA. Non-Markovian Rewards In this section we extend MDPs with LDLf-based reward functions resulting in a non-Markovian-reward decision process (NMRDP). Specifically, a NMRDP is a tuple M = S,A,Tr,R , where S,A and Tr are as in an MDP, and R is redefined as R (S A) R. The reward is now a realvalued function over finite state-action sequences. Given a (possibly infinite) trace π, the value of π is: π i=1 γi 1R( π(1),π(2),...,π(i) ), where 0 < γ 1 is the discount factor and π(i) denotes the pair (si 1,ai). Since every policy ρ S A induces a distribution over the set of possible infinite traces, we can now define the value of a policy ρ given an initial state s0 as: vρ(s) = Eπ M,ρ,s0v(π). That is, vρ(s) is the expected value of infinite traces, where the distribution over traces is defined by the initial state s0, the transition function Tr, and the policy ρ. Specifying a non-Markovian reward function explicitly is cumbersome and unintuitive, even if we only want to reward a finite number of traces. But, typically, we want to reward behaviors that correspond to various patterns. LDLf provides an intuitive and convenient language for specifying R implicitly, using a set of pairs {(ϕi,ri)m i=1}. Intuitively, if the current (partial) trace is π = s0,a1,...,sn 1,an , the agent receives at sn a reward ri for every formula ϕi satisfied by π. Formally: R(π) = 1 i m π ϕi ri From now on, we assume R is thus specified. Note that we use LDLf to reward partial traces. Sometimes, we may want to reward complete traces only, this is studied later. Examples. To illustrate the power of LDLf as a mechanism to specify non-Markovian rewards, we show some examples. The properties mentioned in the introduction are LDLf-expressible: [true ](requestp true coffeep) (all coffee requests from person p will eventually be served); (( rstr a) ;perma;( rstr a) ;rstr a) ;( rstr a) end (before entering restricted area a the robot must have permission for a). Note that the first formula can be easily rewritten in LTLf, as (requestp coffeep), but not the second one. Also the formulas from (Bacchus, Boutilier, and Grove 1996) and (Thi ebaux et al. 2006) are LDLf-expressible: 1. g ;g end, reward offered only at the first state where g holds; 2. true ;g;true end, reward offered at every state that follows g (included); 3. g ;g;( gk; g ;g) ) end, achievement of g is rewarded periodically, at most once every k steps; 4. true ; g;g+(( g1+...+ gk);g) end, achievement of g is rewarded whenever it occurs within k steps (k 1) of a state where g holds; 5. true ;g1;g2;g3 end, reward issued whenever g1 is achieved and followed immediately by g2 and then by g3; 6. true ;c;true ;g end achievement of g is rewarded whenever it follows c; 7. true ;c; g; g ;g end, only the first achievement of g that follows c is rewarded; 8. true ;c;g end, g is rewarded whenever it follows c immediately; 9. true ;c;g + ((true1 + ... + truek);g) end, achievement of g is rewarded whenever occurring within k steps (k 1) of c; 10. true ;c;g + (( g1 + ... + gk);g) end, only the first achievement of g occurring within k steps (k 1) of c is rewarded; 11. g end, reward issued if g has always been true; 12. c ;g end, the holding of c until g is rewarded. To appreciate the power of LDLf, consider the following example. Suppose one prefers policies for a physician with the following structure (on top of whatever other requirements/rewards exist): the physician should work in clear phases: check patient (a) then update patient record (b) repeatedly. Occasionally, and always in the end, after treating a patient and updating her record, upload updated records to server (c). This can be concisely captured in LDLf by ((a;b) ;c) end. The equivalent LTLf formula would be: ( b W a) (b ( b W a)) ((c true) c) (a b) (b (c a)) If a,b and c are non exclusive formulas (they can be true simultaneously), the equivalent LTLf formula can be much more complex (depending on the actual formulas a, b, c), and in some cases it may not exist at all (as in the limit case where a = b = c = true). LDLf, differently from LTLf, can also easily express procedural constraints (De Giacomo and Vardi 2015; Fritz and Mc Ilraith 2007; Baier et al. 2008), so one can reward the traces satisfying such constraints. To see this, consider a sort of propositional variant of GOLOG (Levesque et al. 1997): ϱ = A φ? ϱ1 + ϱ2 ϱ1;ϱ2 ϱ (A prop.) if φ then ϱ1 else ϱ2 while φ do ϱ, These programs correspond to LDLf path expressions, with if and while abbreviations (Fischer and Ladner 1979): if φ then ϱ1else ϱ2 (φ?;ϱ1) + ( φ?;ϱ2) while φ do ϱ (φ?;ϱ) ; φ? We can reward the traces that follow such (nondeterministic) programs. E.g.: At every point, if it is hot then, if the air-conditioning is off, turn it on, else do not turn it off: [true ] if (hot) then if ( air On) then turn On Air else turn Off Air true As another example: Alternate the following two instructions: while it is hot, if the air-conditioning is off then turn it on, else do not turn it off; do something for one step: (while (hot) do if ( air On) then turn On Air else turn Off Air;true) end Building an Equivalent Markovian Model When the rewards are Markovian, one can compute vρ (for stationary ρ) and an optimal policy ρ using Bellman s dynamic programming equations (Puterman 2005). However, this is not the case when the reward is non Markovian, and thus the optimal policy may not be stationary. The standard solution is to formulate an extended MDP (with Markvian rewards) that is equivalent to the original NMRDP (Bacchus, Boutilier, and Grove 1996; Thi ebaux et al. 2006). Definition 1 ((Bacchus, Boutilier, and Grove 1996)). An NMRDP M = S,A,Tr,R is equivalent to an extended MDP M = S ,A,Tr ,R if there exist two functions τ S S and σ S S such that 1. s S τ(σ(s)) = s; 2. s1,s2 S and s 1 S : if Tr(s1,a,s2) > 0 and τ(s 1) = s1, there exists a unique s 2 S such that τ(s 2) = s2 and Tr(s 1,a,s 2) = Tr(s1,a,s2); 3. For any feasible trajectory s0,a1,...,sn 1,an of M and s 0,a1,...,s n 1,an of M , such that τ(s i) = si and σ(s0) = s 0, we have R( s0,a1,...,sn 1,an ) = R ( s 0,a1,...,s n 1,an ). As in previous work, we restrict our attention to extended MDPs such that S = Q S, for some set Q. Given an NMRDP M = S,A,Tr,R , we now show how to construct an equivalent extended MDP. First, using the methods described earlier, construct for each reward formula ϕi its corresponding (minimal) DFA, Aϕi = 2P,Qi,qi0,δi,Fi (notice that S 2P and δi is total). Then, define the equivalent extended MDP M = S ,A ,Tr ,R where: S = Q1 Qm S is the set of states; A = A; Tr S A S [0,1] is defined as follows: Tr (q1,...,qm,s,a,q 1,...,q m,s ) = { Tr(s,a,s ) if i δi(qi,s) = q i 0 otherwise; R S A R is defined as: R(q1,...,qm,s,a) = i δi(qi,s) Fi ri That is, the state space is a product of the states of the original MDP and the various automata. The action set is the same. Given action a, the S-component of the state progresses according to the original MDP dynamics, and the other components progress according to the transition function of the corresponding automaton. Finally, in every state, and for every 1 i m, the agent receives the reward associated with ϕi if the DFA Aϕi reached a final state. Theorem 2. The NMRDP M = S,A,Tr,R is equivalent to the extended MDP M = S ,A ,Tr ,R . Proof. Recall that every s S has the form (q1,...,qm,s). Define δ(q1,...,qm,s) = s. Define σ(s) = (q10,...,qm0,s). We have δ(σ(s)) = s. Condition 2 of Def. 1 is easily verifiable by inspection. For condition 3, consider a possible trace π = s0,a1,...,sn 1,an . We use σ to obtain s 0 = σ(s0) and given si, we define s i (for 1 i < n) to be the unique state (q1i,...,qmi,si) such that qji = δ(qji 1,ai) for all 1 j m. We now have a corresponding possible trace of M , i.e., π = s 0,a1,s 1 ...,s n 1,an . This is the only feasible trajectory of M that satisfies Condition 3. The reward at π = s0,a1,s1 ...,sn 1,an depends only on whether or not each formula ϕi is satisfied by π. However, by construction of the automaton Aϕi and the transition function Tr , π ϕi iff s n 1 = (q1,...,qm,s n) and qi Fi. Let ρ be a policy for the Markovian M . It is easy to define an equivalent policy on M: Let π = s0,a1,s1 ...,sn 1,an be the current history of the process leading to state sn. Let qin denote the current state of automaton Aϕi given input π. Define ρ(π) = ρ (q1n,...,qmn,sn). Theorem 3 ((Bacchus, Boutilier, and Grove 1996)). Given an NMRDP M, let ρ be an optimal policy for an equivalent MDP M . Then, policy ρ for M that is equivalent to ρ is optimal for M. Minimality and Compositionality One of the main aims of previous work on NMRDP specification methods was to help minimize the size of the resulting MDP. We now explain the attractive minimality properties supported by our construction. Note that the minimization discussed below is w.r.t. the extended MDP. It is quite possible that the original NMRDP can be minimized by removing duplicate states. Its minimization is orthogonal to the issue of minimizing the extended MDP, and hence, we assume that the NMRDP itself is already minimal. Constructing a Minimal MDP. The Markovian model is obtained by taking the synchronous product of the original MDP and a DFA that is itself the synchronous product of smaller DFA s, one for each formula. We can apply the simple, standard automaton minimization algorithm to obtain a minimal automaton, thus obtaining a minimal MDP. But even better, as we show below, it is enough to ensure that each DFA Aϕi in the above construction is minimal to ensure the overall minimality of the extended MDP. Theorem 4. If every automaton Aϕi (1 i m) is minimal then the extended MDP defined above is minimal. Proof. Let As be the synchronous product of Aϕi (1 i m). We show that no two distinct states of the synchronous product As are equivalent, and therefore, all of them are needed, hence the thesis. Suppose two distinct states of the synchronous product As are equivalent. Then, being As a DFA, such two states are bisimilar. Two states of As are bisimilar (denoted by ) iff: (q1,...,qn) (t1,...,tm) implies for all i. qi Fi iff ti Fi; for all a. δs(q1,...,qm) = (q 1,...,q m) implies δs(t1,...,tm,a) = (t 1,...,t m) and (q 1,...,q m) (t 1,...,t m); for all a. δs(t1,...,tm) = (t 1,...,t m) implies δs(q1,...,qm,a) = (q 1,...,q m) and (q 1,...,q m) (t 1,...,t m). Now we show that (q1,...,qm) (t1,...,tm) implies qi = ti, for all i. To check this we show that the relation project on i , Πi((q1,...,qm) (t1,...,tm)) extracting the i-th component on the left and on the right of is a bisimulation for states in Ai. Indeed it is immediate to verify that Πi((q1,...,qm) (t1,...,tm)) implies qi Fi iff ti Fi; for all a, δi(qi,a) = q i implies δi(ti,a) = t i and Πi((q 1,...,q m) (t 1,...,t m)); for all a, δi(ti,a) = t i implies δi(qi,a) = qi and Πi((q 1,...,q m) (t 1,...,t m)). Hence if there are two distinct states (q1,...,qm) (t1,...,tm) then at least for one i it must be the case that qi and ti are distinct and bisimilar and hence equivalent. But this is impossible since each DFA Aϕi is minimal. In general, the synchronous product of minimal DFA s can be minimized further. The theorem shows that no further minimization is possible if the final states of the automata are kept distinct, to assign proper rewards. This is not required if one does not need to distinguish between multiple reward formulas (in which case, final states can be conjoined). This theorem also implies that the construction is compositional and, hence, incremental: If a new formula is added, we need not change the MDP, but simply extend it with one additional component. If the original MDP was minimal and the new component is minimal, so is the resulting MDP. Generating Reachable States Only. Minimizing the state space of the extended MDP is important if one wants to apply classical dynamic programming algorithms such as value iteration (Bellman 1957) and policy iteration (Howard 1960). However, these methods typically do not scale up as the size of S increases. Instead, search-based algorithms, such as LAO (Hansen and Zilberstein 2001), LRTDP (Bonet and Geffner 2003), or MCTS (Kocsis and Szepesv ari 2006) are preferred. Their main advantage is that they explore only a subset of the reachable states. These states are generated by progressing the initial state with a sequence of actions. For this reason, (Thi ebaux et al. 2006) developed an approach in which progression can be applied to generate the states of the extended MDP, as well. That is, one does not need to enumerate the entire set S of the original NMRDP and the entire set S of the extended MDP a priori. We, too, support progression, and in a manner that is simpler to understand and analyze. Recall that the automaton that tracks the satisfaction of a reward formula can be constructed by progression, and that the states of the extended MDP are simply vectors that represent the state of the NMRDP and the state of the automaton for each reward formula. Hence, we can trivially support progression by simply progressing each component of this vector in the standard manner. As observed by (Thi ebaux et al. 2006), when progression is used, the constructed states may not be minimal. That is, one may generate two extended states that have an identical underlying MDP state but a different extended part, yet both states are equivalent in the sense that we get the same reward behavior from both. This issue is easy to understand with our automata-based construction: It is simply a result of the fact that the automaton constructed by progression is not minimal. Our construction provides the user with a clear set of options: 1. Apply simple progression, possibly generating a nonminimal automaton for some of the formulas. This is essentially the approach of (Thi ebaux et al. 2006). (But see the discussion in Section .) 2. Build the automata for the reward formulas and minimize them off-line before starting search. Then, apply progression using the minimal automata. This approach is attractive because the reward formulas are typically much smaller than the MDP and so is the size of their automata. Hence the effort of constructing and minimizing them will not be significant. Yet, by minimizing them a-priori, we can reduce the size of the combined state space (which is a product of the two) significantly. 3. Apply a more complex progression algorithm that generates a minimal automaton on-the-fly in time quadratic in the size of the minimal automata for the reward formulas (Lee and Yannakakis 1992). Getting rewards for complete traces only. We may want to reward an agent for its entire behavior rather than for each prefix of it. This means that the value of a sequence π = s0,a1,s1 ...,sn 1,an is defined as follows: v(π) = i π ϕi ri Behaviors optimal w.r.t. this definition may differ from those optimal w.r.t. the original definition in which rewards are collected following each action. Now, an agent must attempt to make as many formulas true at once, as it does not get any credit for having achieved them in the past. Given an NMRDP M with the above reward semantics, we can easily generate an equivalent MDP using the above construction, preceded by the following steps: 1. Add a special action stop to A. 2. Add a new proposition done to S. 3. No action is applicable in a state in which done is true. 4. The only effect of the stop action is to make done be true. 5. Convert every reward formula ϕi to done ϕi. Interestingly, when focussing on complete traces, our framework becomes an extension of Goal MDP planning that handles temporally extended goals, see, e.g., Chapter 6 and Chapter 4 of (Geffner and Bonet 2013). Comparison with previous proposals Capturing PLTL rewards. Our proposal can be seen as extending (Bacchus, Boutilier, and Grove 1996). There, rewards are assigned to partial traces whenever the last state of the trace satisfies past-LTL (PLTL) formulas. Without introducing explicitly PLTL, but given a partial trace π0,...,πn we reverse it into πn,...,π0 and evaluate it over the LTLf formula ϕ obtained from the PLTL formula by simply replacing the past operators with the corresponding future operators (e.g., replace since with eventually). Then, the setting remains analogous to the one shown above. In particular, we can construct the NFA Aϕ associated with ϕ and, instead of reversing the partial traces, reverse Aϕ, thus getting an NFA A ϕ, by simply reversing the edge directions and switching initial and final states. This can be done in linear time. If we now determinize (and minimize) A ϕ, getting the (minimal) DFA A ϕ, we can proceed exactly as above. Given this essential equivalence of PLTL and LTLf, and the fact that LDLf is strictly more expressive than LTLf, we conclude that our setting is strictly more expressive than the one in (Bacchus, Boutilier, and Grove 1996). In addition, unlike (Bacchus, Boutilier, and Grove 1996), our automata construction algorithm is based on progression, allowing us to use information about the initial state to prevent the generation of unreachable states. Comparing with $FLTL rewards. In (Thi ebaux, Kabanza, and Slaney 2002; Gretton, Price, and Thi ebaux 2003; Thi ebaux et al. 2006) a sophisticated temporal logic, called $FLTL is introduced, which is able to specify explicitly when a partial (finite) trace gets a reward. In fact, many formulas in this logic cannot naturally be interpreted as specifying reasonable rewards. Thus, the results of (Thi ebaux et al. 2006) focus on a class of formulas called reward-normal. This class has been further studied in (Slaney 2005) and in (Gretton 2014). In this work it is first shown that a notable fragment of reward-perfect formulas is equivalent in expressive power to star-free RE, and hence equivalent to LTLf. Then, rewardnormal formulas are shown to be expressible as RE, and hence expressible in LDLf. Moreover, in (Gretton 2014), a variant of reward-normal formulas is introduced with exactly the same expressive power as REf, and hence equivalent to LDLf. Interestingly, the reduction to RE is based on an actual translation into finite state automata. Besides the useful ability to build the extended MDP by progression, (Thi ebaux et al. 2006) discuss a minimality notion called blind minimality where no two states in the extended MDP lead to the same reward behavior under all conceivable futures (where a conceivable future is any future sequence of states and actions, including ones that are not reachable). Observe that if one considers an NMRDP with a single state, this amounts to being able to generate the minimal automata for the reward formulas by progression. We are unaware of any general method for providing such a construction in linear time, which further hints that the set of formulas for which this result holds is rather limited. Given the above, we cannot claim to provide blind minimality. We can only speculate that given the similarity between their progression algorithm and the standard progression algorithm in cases where they are both defined, we should be able to offer similar guarantees. Thus, the formalism developed here is simpler syntactically and semantically, most likely more expressive, can support minimization without requiring the construction of the NMRDP, but supports progression as well, and is compositional. We presented a new language for specifying non-Markovian rewards in MDPs. Our language is based on LTLf/LDLf and is more expressive than previous proposals and being based on a standard temporal logic of the future, is likely to be more intuitive to use. We showed how to construct a minimal equivalent MDP, and since we rely on general methods for tracking temporal formulas, the construction is cleaner. Being based on progression, it can use information about the initial state to prune unreachable states. In future work we intend to examine the use of monitoring notions developed for LTLf and LDLf (Bauer, Leucker, and Schallhart 2010; De Giacomo et al. 2014; Maggi et al. 2011). Using such monitors one could extract early rewards that guide the process to get full rewards later. Another important direction for future work is exploiting non-Markovian rewards in reinforcement learning (RL) to provide better guidance to the learning agent, as well as extending inverse RL methods to learn to assign non-Markovian rewards in a state. Finally, we observe that LDLf can capture all co-safe LTL formulas, i.e., LTL formulas whose automaton is a DFA (instead of a B uchi automaton). Co-safe LTL formulas were used in Robotics to select traces of an MDP that represent behaviors of interest (Lacerda, Parker, and Hawes 2014; 2015). Thus, instead of co-safe LTL, we can use LDLf to gain expressivity in trace specification at essentially no cost. Acknowledgements: The first author is supported in part by ISF grant 933/13 and the Lynn and William Frankel Center for Computer Science. The work was partially supported by the Sapienza project Immersive Cognitive Environments . Bacchus, F.; Boutilier, C.; and Grove, A. J. 1996. Rewarding behaviors. In AAAI. Baier, J. A.; Fritz, C.; Bienvenu, M.; and Mc Ilraith, S. A. 2008. Beyond classical planning: Procedural control knowledge and preferences in state-of-the-art planners. In AAAI. Bauer, A.; Leucker, M.; and Schallhart, C. 2010. Comparing LTL semantics for runtime verification. Logic and Computation. Bellman, R. 1957. Dynamic Programming. Princeton University Press. Bonet, B., and Geffner, H. 2003. Labeled RTDP: Improving the convergence of real-time dynamic programming. In ICAPS. Camacho, A.; Chen, O.; Sanner, S.; and Mc Ilraith, S. A. 2017a. Decision-making with non-markovian rewards: From ltl to automata-based reward shaping. In RLDM, 279 283. Camacho, A.; Chen, O.; Sanner, S.; and Mc Ilraith, S. A. 2017b. Non-markovian rewards expressed in LTL: guiding search via reward shaping. In SOC, 159 160. Camacho, A.; Triantafillou, E.; Muise, C.; Baier, J. A.; and Mc Ilraith, S. 2017c. Non-deterministic planning with temporally extended goals: LTL over finite and infinite traces. In AAAI. De Giacomo, G., and Vardi, M. Y. 2013. Linear temporal logic and linear dynamic logic on finite traces. In IJCAI. De Giacomo, G., and Vardi, M. Y. 2015. Synthesis for LTL and LDL on finite traces. In IJCAI. De Giacomo, G., and Vardi, M. Y. 2016. LTLf and LDLf synthesis under partial observability. In IJCAI. De Giacomo, G.; De Masellis, R.; Grasso, M.; Maggi, F. M.; and Montali, M. 2014. Monitoring business metaconstraints based on LTL and LDL for finite traces. In BPM. Fischer, M. J., and Ladner, R. E. 1979. Propositional dynamic logic of regular programs. J. Com. Systems and Science 18. Fritz, C., and Mc Ilraith, S. A. 2007. Monitoring plan optimality during execution. In ICAPS. Geffner, H., and Bonet, B. 2013. A Coincise Introduction to Models and Methods for Automated Planning. Morgan&Claypool. Gretton, C.; Price, D.; and Thi ebaux, S. 2003. Implementation and comparison of solution methods for decision processes with non-markovian rewards. In UAI. Gretton, C. 2014. A more expressive behavioral logic for decision-theoretic planning. In PRICAI, 13 25. Hansen, E. A., and Zilberstein, S. 2001. LAO*: A heuristic search algorithm that finds solutions with loops. Artificial Intelligence 129(1-2):35 62. Harel, D.; Kozen, D.; and Tiuryn, J. 2000. Dynamic Logic. MIT Press. Howard, R. A. 1960. Dynamic Programming and Markov Processes. MIT Press. Kocsis, L., and Szepesv ari, C. 2006. Bandit based montecarlo planning. In ECML. Lacerda, B.; Parker, D.; and Hawes, N. 2014. Optimal and dynamic planning for Markov decision processes with co-safe LTL specifications. In IROS. Lacerda, B.; Parker, D.; and Hawes, N. 2015. Optimal Policy Generation for Partially Satisfiable Co-Safe LTL Specifications. In IJCAI. Lee, D., and Yannakakis, M. 1992. Online minimization of transition systems (extended abstract). In Proceedings of the 24th Annual ACM Symposium on Theory of Computing. Levesque, H. J.; Reiter, R.; Lesperance, Y.; Lin, F.; and Scherl, R. 1997. GOLOG: A logic programming language for dynamic domains. J. of Logic Programming 31. Littman, M. L.; Topcu, U.; Fu, J.; Jr., C. L. I.; Wen, M.; and Mac Glashan, J. 2017. Environment-independent task specifications via GLTL. Co RR abs/1704.04341. Maggi, F. M.; Montali, M.; Westergaard, M.; and van der Aalst, W. M. P. 2011. Monitoring business constraints with linear temporal logic: An approach based on colored automata. In Proc. of BPM. Pnueli, A. 1977. The temporal logic of programs. In FOCS. Puterman, M. L. 2005. Markov Decision Processes: Discrete Stochastic Dynamic Programming. Wiley. Rabin, M. O., and Scott, D. 1959. Finite automata and their decision problems. IBM J. Res. Dev. 3:114 125. Slaney, J. K. 2005. Semipositive LTL with an uninterpreted past operator. Logic Journal of the IGPL 13(2):211 229. Tabakov, D., and Vardi, M. Y. 2005. Experimental evaluation of classical automata constructions. In LPAR. Thi ebaux, S.; Gretton, C.; Slaney, J. K.; Price, D.; and Kabanza, F. 2006. Decision-theoretic planning with nonmarkovian rewards. J. Artif. Intell. Res. (JAIR) 25:17 74. Thi ebaux, S.; Kabanza, F.; and Slaney, J. K. 2002. Anytime state-based solution methods for decision processes with nonmarkovian rewards. In UAI. Torres, J., and Baier, J. A. 2015. Polynomial-time reformulations of LTL temporally extended goals into final-state goals. In IJCAI. Vardi, M. Y. 2011. The rise and fall of linear time logic. In Gand ALF.