# ltlconstrained_steadystate_policy_synthesis__4d57b776.pdf

LTL-Constrained Steady-State Policy Synthesis

Jan Kˇret ınsk y Technical University of Munich jan.kretinsky@tum.de

Decision-making policies for agents are often synthesized with the constraint that a formal speciﬁcation of behaviour is satisﬁed. Here we focus on inﬁnite-horizon properties. On the one hand, Linear Temporal Logic (LTL) is a popular example of a formalism for qualitative speciﬁcations. On the other hand, Steady-State Policy Synthesis (SSPS) has recently received considerable attention as it provides a more quantitative and more behavioural perspective on speciﬁcations, in terms of the frequency with which states are visited. Finally, rewards provide a classic framework for quantitative properties. In this paper, we study Markov decision processes (MDP) with the speciﬁcation combining all these three types. The derived policy maximizes the reward among all policies ensuring the LTL speciﬁcation with the given probability and adhering to the steady-state constraints. To this end, we provide a uniﬁed solution reducing the multi-type speciﬁcation to a multi-dimensional long-run average reward. This is enabled by Limit-Deterministic B uchi Automata (LDBA), recently studied in the context of LTL model checking on MDP, and allows for an elegant solution through a simple linear programme. The algorithm also extends to the general ω-regular properties and runs in time polynomial in the sizes of the MDP as well as the LDBA.

1 Introduction

Markov decision processes (MDP), e.g. [Puterman, 1994], are a basic model for agents operating in uncertain environments since it features both non-determinism and stochasticity. An important class of problems is to automatically synthesize a policy that resolves the non-determinism in such a way that a given formal speciﬁcation of some type is satisﬁed on the induced Markov chain. Here we focus on the classic formalisms capable of specifying properties over inﬁnite horizon.

Inﬁnite-horizon properties. On the one hand, Linear Temporal Logic (LTL) [Pnueli, 1977] is a popular formalism for qualitative speciﬁcations, capable of expressing complex temporal relationships, abstracting from the concrete quantitative

timing, e.g., after every request there is a grant (not saying when exactly). It has found applications in veriﬁcation of programs [Baier and Katoen, 2008] as well as high-level robot control, e.g. [Kress-Gazit et al., 2009], or preference-based planning in PDDL [Benton et al., 2012] to name a few. On the other hand, Steady-State Control (SSC, a.k.a. Steady-State Policy Synthesis) [Akshay et al., 2013] constrains the frequency with which states are visited, providing a more quantitative and more behavioural perspective (in terms of states of the system, as opposed to logic-based or reward-based speciﬁcations). Recently, it has started receiving more attention also in AI planning [Velasquez, 2019; Atia et al., 2020], improving the theoretical complexity and its applicability to a wider class of MDP (although still being quite restrictive on the class of policies, see below). Finally, rewards provide a classic framework for quantitative properties. In the setting of inﬁnite horizon, a key role is played by the long-run average reward (LRA), e.g. [Puterman, 1994], which constrains the reward gained on average per step and, over the decades, it has found numerous applications [Feinberg and Shwartz, 2012].

Our contribution. In this paper, we study Markov decision processes (MDP) with the speciﬁcations combining all these three types, yielding a more balanced and holistic perspective. We synthesize a policy maximizing the LRA reward among all policies ensuring the LTL speciﬁcation (with the given probability) and adhering to the steady-state constraints. To this end, we provide a uniﬁed solution conceptually reducing the problem with the heterogeneous speciﬁcation combining three types of properties to a problem with a single-type speciﬁcation, namely the multi-dimensional LRA reward. This in turn can be solved using a single simple linear programme, which is easy to understand and shorter than previous solutions to the special cases considered in the literature. Not only does the uniﬁed approach allow us to use the powerful results from literature on rewards, but it also allows for easy extensions and modiﬁcations such as considering various classes of policies (depending on the memory available or further requirements on the induced chain) or variants of the objectives (e.g. constraints on frequency of satisfying recurrent goals, multiple rewards, trade-offs between objectives), which we also describe. Our reduction is particularly elegant due to the use of Limit Deterministic B uchi Automata (LDBA), recently studied in

Proceedings of the Thirtieth International Joint Conference on Artiﬁcial Intelligence (IJCAI-21)

the context of LTL model checking on MDP. Moreover, the solution extends trivially from LTL to the general ω-regular properties (given as automata) and runs in time polynomial in the sizes of the MDP as well as the LDBA. In summary, our contribution is as follows:

We introduce the heterogeneous LTL-SSC-LRA speciﬁcation for maximizing the reward under the LTL and steady-state constraints.

We provide a uniﬁed solution framework via recent results on LDBA automata and on multi-dimensional LRA optimization.

The resulting solution is generic, as documented on a number of extensions and variants of the problem we discuss, as well as simple, generating a linear programme with a structure close to classic reward optimization.

1.1 Related Work The steady-state control was introduced in [Akshay et al., 2013], treating the case of recurrent MDP and showing the problem is in PSPACE by quadratic programming. It is combined with LRA reward maximization, giving rise to steadystate policy synthesis, in [Velasquez, 2019]. A polynomialtime solution is provided via linear programming, even for general (multi-chain) MDP, but restricting to stationary policies inducing recurrent Markov chains. In [Atia et al., 2020], general (multi-chain) MDP and a wider class of policies are considered. However, the EP policies prohibit playing in some non-bottom end components, which makes stationary policies optimal within this class, but is very restrictive in contrast to general strategies. In this work, we add also the LTL speciﬁcation and we have neither restriction on the MDP nor on the class of strategies considered and chains induced. Our linear programme is simpler and extends the correspondence between solutions and the policies with their steady-state distributions to the general ones (with no recurrence assumed). On the other hand, we also treat the restricted settings of ﬁnitememory, recurrence, and further modiﬁcations to show the versatility of our approach. The main idea of our approach is an underlying conceptual reduction to multi-dimensional LRA rewards allowing for a simple linear programme. The classic linear-programming solution for a single LRA reward [Puterman, 1994] has been extended to various settings such as multi-dimensional reachability and ω-regular properties [Etessami et al., 2008] or multidimensional LRA rewards [Br azdil et al., 2014; Chatterjee et al., 2017]. Some of these results [Br azdil et al., 2014] are also implemented [Br azdil et al., 2015] on top of the PRISM model checker [Kwiatkowska et al., 2011]. Rewards have also been combined with LTL on ﬁnite paths in [Brafman et al., 2018; Giacomo et al., 2019]. Our results strongly rely particularly on [Chatterjee et al., 2017]; however, that work treats neither SSC nor LTL. The other element simplifying our work is the limitdeterministic B uchi automaton [Courcoubetis and Yannakakis, 1995]. It has been shown to be usable for LTL model checking on MDP under some conditions [Hahn et al., 2015; Sickert et al., 2016], these conditions are satisﬁed by an efﬁcient translation from LTL [Sickert et al., 2016]. Tools for

the translation [Kˇret ınsk y et al., 2018] as well as the modelchecking [Sickert and Kˇret ınsk y, 2016] are also available.

2 Preliminaries 2.1 Basic Deﬁnitions We use N, Q, R to denote the sets of positive integers, rational and real numbers, respectively. The Kronecker function 1x(y) yields 1 if x = y and 0 otherwise. The set of all distributions over a countable set X is denoted by Dist(X). A distribution d Dist(X) is Dirac on x X if d = 1x.

Markov chains. A Markov chain (MC) is a tuple M = (L, P, µ, ν) where L is a countable set of locations, P : L Dist(L) is a probabilistic transition function, µ Dist(L) is the initial probability distribution, and ν : L 2Ap is a labelling function. A run in M is an inﬁnite sequence ρ = ℓ1ℓ2 of locations, a path in M is a ﬁnite preﬁx of a run. Each path w in M determines the set Cone(w) consisting of all runs that start with w. To M we associate the probability space (Runs, F, P), where Runs is the set of all runs in M, F is the σ-ﬁeld generated by all Cone(w), and P is the unique probability measure such that P(Cone(ℓ1 ℓk)) = µ(ℓ1) Qk 1 i=1 P(ℓi)(ℓi+1).

Markov decision processes. A Markov decision process (MDP) is a tuple G = (S, A, Act, δ, ˆs, ν) where S is a ﬁnite set of states, A is a ﬁnite set of actions, Act : S 2A \ { } assigns to each state s the set Act(s) of actions enabled in s so that {Act(s) | s S} is a partitioning of A, δ : A Dist(S) is a probabilistic transition function that given an action a gives a probability distribution over the successor states, ˆs is the initial state, and ν : S 2Ap is a labelling function. Note that every action is enabled in exactly one state (w.l.o.g. by renaming). A run in G is an inﬁnite alternating sequence of states and actions ρ = s1a1s2a2 such that for all i 1, we have ai Act(si) and δ(ai)(si+1) > 0. A path of length k in G is a ﬁnite preﬁx w = s1a1 ak 1sk of a run in G.

Policies and plays. Intuitively, a policy (a.k.a. strategy) in an MDP is a recipe to choose actions. A policy is formally deﬁned as a function σ : (SA) S Dist(A) that given a ﬁnite path w, representing the history of a play, gives a probability distribution over the actions enabled in the last state of w. A play of G determined by a policy σ is a Markov chain Gσ where the set of locations is the set of paths S(AS) , the valuation depends on the current state only ν(ws) = ν(s), the initial distribution is Dirac on ˆs, and transition function P is deﬁned by P(w)(was) = σ(w)(a) δ(a)(s) . Hence, Gσ starts in the initial state of the MDP and actions are chosen according to σ evaluated on the current history of the play, and the next state is chosen according to the transition function evaluated on the chosen action. The induced probability measure is denoted by Pσ and almost surely ( a.s. ) or almost all runs refers to happening with probability 1 according to this measure. The respective expected value of a random variable F : Runs R is Eσ[F] = R

Runs F d Pσ. For t N, random variables St, At return the t-th state and action on the run.

Proceedings of the Thirtieth International Joint Conference on Artiﬁcial Intelligence (IJCAI-21)

A policy is memoryless if it only depends on the last state of the current history. A policy is bounded-memory (or ﬁnitememory) if it stores only ﬁnite information about the history; technically, it can written down using a ﬁnite-state machine, as detailed in Appendix A.

End components. A set T B with = T S and B S t T Act(t) is an end component of G if (1) for all a B, whenever δ(a)(s ) > 0 then s T; and (2) for all s, t T there is a path w = s1a1 ak 1sk such that s1 = s, sk = t, and all states and actions that appear in w belong to T and B, respectively. An end component T B is a maximal end component (MEC) if it is maximal with respect to the subset ordering. Given an MDP, the set of MECs is denoted by MEC and can be computed in polynomial time [Courcoubetis and Yannakakis, 1995].

2.2 Speciﬁcations

In order to deﬁne our problem, we ﬁrst recall deﬁnitions of the considered classes of properties.

LTL and ω-regular properties. Linear Temporal Logic (LTL) [Pnueli, 1977] over the ﬁnite set Ap of atomic propositions is given by the syntax

ϕ ::= p | ϕ ϕ | ϕ | Xϕ | ϕUϕ p Ap

and is interpreted over a run ρ = s1s2 of M by ρ |= p if p L(s1) ρ |= Xϕ if s2s3 |= ϕ ρ |= ϕUψ if k : sksk+1 |= ψ, j < k : sjsj+1 |= ϕ and the usual semantics of Boolean connectives. The event that ϕ is satisﬁed is denoted by |= ϕ. Given an MDP G and a policy σ, Pσ[|= ϕ] then denotes the probability that ϕ is satisﬁed on Gσ. A B uchi Automaton (BA) is a tuple B = (Σ, Q, , q0, F) where Σ is a ﬁnite alphabet, Q is a ﬁnite set of states, : Q Σ 2Q is a transition function, writing q α r for r (q, α), q0 is the initial state, and F Q is the set of accepting states. B accepts the language L(B) = {α1α2 Σω | q0 α1 q1 α2 q2 : qi F for inﬁnitely many i} of inﬁnite words that can pass through F inﬁnitely often. A BA B is limit-deterministic (LDBA) [Sickert et al., 2016] if Q can be partitioned Q = QN QD so that

1. (q, α) QD and | (q, α)| = 1 for q QD, α Σ,

2. F QD, and

3. | (q, α) QN | = 1 for q QN , α Σ.

Intuitively, a BA is LDBA if it can be split into a nondeterministic part without accepting transitions and a deterministic component, where it has to remain forever but may now accept. The third constraint is less standard and not essential, but eases the argumentation. Intuitively, it implies that the only choice that is made on a run is when and where to do the single non-deterministic jump from QN to QD [Sickert et al., 2016]. Every LTL formula can be translated into an equivalent BA BA(ϕ) in exponential time and into LDBA LDBA(ϕ) in double exponential time [Sickert et al., 2016], so that both

automata are over the alphabet 2Ap and accept exactly the language L(ϕ) of words satisfying ϕ.1 A set is an ω-regular language if it can be written as L(B) for some BA (or equivalently LDBA) B. The standard algorithm for LTL model checking on MDP, i.e. to decide whether σ : Pσ[|= ϕ] p, is to translate it to the more general problem of model checking ω-regular speciﬁcation σ : Pσ[ν(ρ) L(A)] p for some automaton A accepting L(ϕ). Typically, A is deterministic with a more complex acceptance condition [Baier and Katoen, 2008], but also LDBA (with the simple B uchi condition of visiting F inﬁnitely often) can be used if they satisfy a certain condition, ensured by the recent translation from LTL [Sickert et al., 2016]. Then the problem can be solved by combining the MDP G and the automaton B := LDBA(ϕ) into the product G B, where the automaton monitors the run of the MDP, and its subsequent analysis. The product of G = (S, A, Act, δ, ˆs, ν) and B = (2Ap, Q, , q0, F) is the MDP G B = (S Q, A , Act , δ , ˆs, q0 , ν ) where A = A Q Q ensuring uniqueness of actions, the elements are written as aq r, intuitively combining a with q r;

Act ( s, q ) = {aq r | a Act(s), q ν(s) r} δ ( s, q , aq r)( t, r ) = δ(s, a)(t) 1r(r ) ν ( s, q ) = ν(s) Note that indexing the actions preserves their uniqueness. A state s, q of the product is accepting if q F. Observe that a run of the original MDP satisﬁes ϕ iff there is a choice of the jump such that the corresponding run of the product visits accepting states inﬁnitely often. A MEC which contains an accepting state is called an accepting MEC and the set of accepting MECs is denoted by AMEC. Observe that once in AMEC, any policy that uses all actions of the current MEC (and none that leave it) ensures visiting all its states hence also acceptance. Consequently, maxσ Pσ[|= ϕ] = maxσ Pσ[S AMEC is reached] by [Sickert et al., 2016]. Steady-state constraints. The classic way of constraining the steady state is to require that π(s) [ℓ, u] where π(s) is the steady-state distribution applied to the state s and [ℓ, u] [0, 1] is the constraining interval. We slightly generalize the deﬁnition in two directions: Firstly, we can constrain (not necessarily disjoint) sets of states, to match the ability of the logical speciﬁcation. Secondly, we consider general cases where the steady-state distribution does not exist, to cater for general policies (in contrast to restricted classes of policies investigated in previous works), which may cause such behaviour. Formally, a steady-state speciﬁcation (SSS) is a set S of steady-state constraints of the form (p, ℓ, u) Ap [0, 1] [0, 1]. A Markov chain M satisﬁes S, written M |= S if for each (p, ℓ, u) S we have

ℓ πinf(p) := lim inf T 1

T PT t=1 P[p ν(St)] πsup(p) := lim sup T 1 T PT t=1 P[p ν(St)] u

1Technically, L(ϕ) is the set such that for any run ρ = s1s2 with ν(si) = αi, we have ρ |= ϕ iff α1α2 L(ϕ).

Proceedings of the Thirtieth International Joint Conference on Artiﬁcial Intelligence (IJCAI-21)

Intuitively, we measure the frequency with which p is satisﬁed and the average frequency must remain in the constraining interval at all times (even when the limit does not exist), except for an initial heat-up phase. It is worth noting that the policies we will synthesize always have even the limit deﬁned. If an atomic proposition ps is only satisﬁed in state s, then the constraint (s, x, x) is equivalent to the usual constraint π(s) = x for the steady-state distribution π. Further, M δ-satisﬁes S for some δ 0, written M |=δ S, if M |= {(p, ℓ δ, u + δ) | (p, ℓ, u) S}. Intuitively, δ-satisfaction allows for δ imprecision in satisfying the constraints. Such a relaxation is often used, for instance, to obtain δ-decidability of otherwise undecidable problems, e.g. on systems with complex continuous dynamics [Gao et al., 2012; Rungger and Tabuada, 2017]. In our case, we show it allows for a polynomial-time solution even when restricting to ﬁnite-memory policies.

Long-run average reward. Let r : A Q be a reward function. Recall that At is a random variable returning the action played at time t. Similarly to the steady-state distribution, the random variable given by the limit-average function lr(r) = lim T 1

T PT t=1 r(At) may be undeﬁned for some runs, so we consider maximizing the respective point-wise limit inferior:

lrinf(r) = lim inf T 1 T

Although we could also maximize limit superior, it is less interesting technically, see [Br azdil et al., 2014], and less relevant from the perspective of ensuring the required performance. Further, the respective minimizing problems can be solved by maximization with opposite rewards.

3 Problem Statement and Examples

This paper is concerned with the following tasks:

Satisﬁability: Given an MDP G with the long-run speciﬁcation LR = ((ϕ, θ), S, (r, R)), where ϕ is an LTL formula, θ [0, 1] is a probability threshold, S is a steady-state speciﬁcation, r is a reward function, R Q is a reward threshold, decide whether there is a policy σ such that

Pσ[|= ϕ] θ (LTL) Gσ |= S (SSS) Eσ[lrinf(r)] R (LRA)

Policy synthesis: If satisﬁable, construct a policy satisfying the requirements, i.e. inducing a Markov chain satisfying them. δ-satisfying ﬁnite-memory policy synthesis (for δ > 0): If satisﬁable, construct a ﬁnite-memory policy σ ensuring

Pσ[|= ϕ] θ (LTL) Gσ |=δ S (δ-SSS) Eσ[lrinf(r)] R δ (δ-LRA)

We illustrate the problem, some of its intricacies, the effect of the bound on the memory and the consequent importance of δ-satisfaction on the following examples. Example 1. Consider the MDP in Fig. 1 with the speciﬁcation π(s) = 0.5 = π(t). On the one hand, every memoryless policy yields either π(s) = 1 or π(t) = 1, the latter occurring iff b is played in s with non-zero probability. On the other hand, a history-dependent policy σ deﬁned by σ(sa s)(a) = 1 and σ(λ)(a) = 0.5 = σ(λ)(b) satisﬁes the speciﬁcation. In terms of the automata representation, the policy is 2-memory, remembering whether a step has been already taken, see Fig. 5 in Appendix A.

Figure 1: A (multichain) MDP where memory is required for satisfying a steady-state speciﬁcation (SSS)

Consequently, memory may be necessary, in contrast to the claim of [Velasquez, 2019] that memoryless policies are sufﬁcient by [Akshay et al., 2013], which holds only for the setting with recurrent chains. Moreover, the combination with LTL may require even unbounded memory: Example 2. Consider the MDP in Fig. 2 with the LTL speciﬁcation (GFpt, 1) where pt holds in t only, and steady-state constraint (s, 1, 1). Every policy satisfying the speciﬁcation must play action b inﬁnitely often. On the one hand, a historydependent policy σ deﬁned by σ(ws)(b) = 1/|w|, where |w| denotes the length of w, satisﬁes the speciﬁcation. Indeed, t is still visited inﬁnitely often almost surely, but with frequency decreasing below every positive bound. While σ is Markovian (does not depend on the history, only on the time), it means it still uses unbounded memory. On the other hand, every policy with ﬁnite memory satisfying the LTL speciﬁcation visits t with positive frequency, in fact with at least 1/p|S| m where m is the size of the memory and p the minimum probability occurring in the policy. Hence no ﬁnite-memory policy can satisfy the whole speciﬁcation. However, in order to δ-satisfy it, a memoryless policy is sufﬁcient, deﬁned by σ(s)(b) = 1/δ.

1 c Figure 2: A (unichain) MDP where unbounded memory is required for satisfying SSS and LTL speciﬁcations, but ﬁnite memory is sufﬁcient for δ-satisfaction

4 LTL-Constrained Steady-State Policy Synthesis Fig. 3 depicts a set of linear constraints capturing (both transient and recurrent) behaviour of a policy as a ﬂow [Chatterjee et al., 2017], with variables capturing how much each action is used. Together with maximization of a certain linear

Proceedings of the Thirtieth International Joint Conference on Artiﬁcial Intelligence (IJCAI-21)

Input: MDP (S, A, Act, δ, ˆs, ν)

Beside requiring all variables ya, ys, xa for a A, s S be non-negative, the policy-ﬂow constraints are the following:

1. transient ﬂow: for s S

a A ya δ(a)(s) = X

a Act(s) ya + ys

2. almost-sure switching to recurrent behaviour: X

s C MEC ys = 1

3. probability of switching in a MEC is the frequency of using its actions: for C MEC X

4. recurrent ﬂow: for s S X

a A xa δ(a)(s) = X

a Act(s) xa

Figure 3: Linear program for policy synthesis in general (multichain) MDP

function it yields the standard linear programme for maximizing LRA reward in general (multichain) MDP [Puterman, 1994]. Since then it has been used in various contexts, including multiple reachability objectives [Etessami et al., 2008] or multi-dimensional rewards [Br azdil et al., 2014; Chatterjee et al., 2017]. Intuitively, xa is the expected frequency of using a on the long run; Equation 4 thus expresses the recurrent ﬂow in MECs and π(s) := P a Act(s) xa is the steady-state distribution for state s. However, before we can play according to x-variables, we have to reach MECs and switch from the transient behaviour to this recurrent behaviour. Equation 1 expresses the transient ﬂow before switching. Variables ya are the expected number of using a until we switch to the recurrent behaviour in MECs and ys is the probability of this switch upon reaching s. To relate yand x-variables, Equation 3 states that the probability to switch and remain within a given MEC is the same whether viewed from the transient or recurrent ﬂow perspective.2 Finally, Equation 2 states that switching happens almost surely.3

Lemma 1 ([Chatterjee et al., 2017]). Existence of a solution

2Actually, one could eliminate variables ys and use directly xa in Equation 1 and leave out Equation 3 completely, in the spirit of [Puterman, 1994]. However, the form with explicit ys is closer to the more recent related work, more intuitive, and more convenient for correctness proofs. 3Note that summing Equation 1 over all s S yields P

s S ys = 1. Since ys can be shown to equal 0 for state s not in MEC, Equation 2 is again redundant but more convenient.

Input: MDP (S, A, Act, δ, ˆs, ν), long-run speciﬁcation LR = ((ϕ, θ), S, (r, R)), AMEC due to the assumption that the MDP is a product of another MDP with LDBA(ϕ)

The speciﬁcation constraints are the following:

5. LTL speciﬁcation: X

a A\S AMEC xa 1 θ

6. steady-state speciﬁcation: for every (p, ℓ, u) S

p ν(s) a Act(s)

7. LRA rewards: X

a A xa r(a) R

Figure 4: Linear program for the long-run speciﬁcation on the product MDP (together with the policy-ﬂow constraints of the LP in Fig. 3)

to the equations in Fig. 3 is equivalent to existence of a policy inducing the steady-state distribution π(s) = P a Act(s) xa for all s S.

Theorem 1 (Linear program for LTL-constrained steady-state policy synthesis on general MDP). For an MDP G and a longrun speciﬁcation LR, let L be the system consisting of linear policy-ﬂow constraints of Fig. 3 and speciﬁcation constraints of Fig. 4 both on input G LDBA(ϕ) where ϕ is the LTL formula of LR. Then:

1. Every policy satisfying LR induces a solution to L.

2. Every solution to L effectively induces a (possibly unbounded-memory) policy satisfying LR and, moreover for every δ > 0, also a ﬁnite-memory policy δ-satisfying LR .

Proof sketch. The full proof can be found in [Kˇret ınsk y, 2021, Appendix B]. It heavily relies on results about the linear programming solution for the multi-dimensional long-run average reward [Chatterjee et al., 2017] and also on the LTL model checking algorithm on MDP via LDBA [Sickert et al., 2016]. Essentially, each steady-state constraint can be seen as a LRA constraint on a new reward that is 1 whenever p holds and 0 otherwise. Besides, δ-satisfaction of the LTL constraint with ﬁnite memory is equivalent to positive reward collected on accepting states. The constructed policy ﬁrst reaches the desired MECs where it switches to staying in the MEC and playing a memoryless policy whose frequencies of using each action a are close to xa, and hence we get a 2-memory policy; alternatively, the switch might be followed by a policy achieving exactly xa but possibly requiring unbounded memory. The policy s

Proceedings of the Thirtieth International Joint Conference on Artiﬁcial Intelligence (IJCAI-21)

distribution over actions then depends on the length of the current history.

Corollary 1. Given an MDP G with a long-run speciﬁcation LR (possibly with a trivial reward threshold, e.g. the minimum reward), a policy maximizing reward among policies satisfying LR can be effectively synthesized.

Proof. First, we construct the product G LDBA(ϕ), where ϕ is the LTL formula of LR. Second, we construct the linear programme with the objective function

a Act(s) xa r(a)

and the constraints 1. 7., or in the case of trivial reward threshold only 1. 6. Thirdly, a solution to the programme induces the satisfying policy (as described in detail in the proof of Theorem 1.

5 Complexity Observe that the number of variables is linear in the size of the product and the size of the LP is quadratic, hence the size is polynomial in the size of G and LDBA(ϕ). Since linear programming is solvable in polynomial time [Khachiyan, 1979], our problems can also be solved in time polynomial in G and LDBA(ϕ). Consequently, the satisﬁability problem can be solved in double exponential time due to the translation of LTL to LDBA with this complexity. In fact, it is 2-EXPTIME-complete, inheriting the hardness from its special case of LTL model checking of MDP. While this sounds pessimistic at ﬁrst glance, the worst-case complexity is rarely any issue in practice whenever it stems only from the translation as here. Indeed, the sizes of the LDBA produced for practically occurring formulae are typically very small, see [Sickert et al., 2016; Sickert and Kˇret ınsk y, 2016; Kˇret ınsk y et al., 2018; Esparza et al., 2020] for examples and more details. Moreover, in the frequent case of qualitative (almost sure) LTL satisfaction, Constraint 5 takes the form: for every a A \ S AMEC : xa = 0, effectively eliminating many variables from the LP. Further, for long-run speciﬁcations with ω-regular properties (instead of LTL), the problem can be solved in polynomial time if the input speciﬁcation is provided directly as an LDBA. If instead a fully non-deterministic BA is used, single exponential time is required due to the semi-determinization to LDBA [Courcoubetis and Yannakakis, 1995]. In any case, the algorithm is polynomial with respect to the MDP and there are no extra variables for the steady-state or reward speciﬁcations. Hence, if the LTL property is trivial (omitted), there is almost no overhead compared the LP of Fig. 3 (or the LP for rewards which contains one constraint (7.) more). In particular, for point speciﬁcations with constraints (p, x, x) as in [Akshay et al., 2013], Constraint 6 takes the form P a xa = x, which can be substituted into the system, even decreasing the number of its variables. The ﬁnal remark is concerned with the type of the automaton used. While we used LDBA, one could use other, deterministic automata such as the traditionally used Rabin or parity [Baier and Katoen, 2008]. However, there are two

drawbacks of doing so. Firstly, while the translation from LTL is for all the cases double exponential in the worst case, practically it is often worse for Rabin than for LDBA and yet signiﬁcantly worse for parity, see e.g. [Kˇret ınsk y et al., 2018]. Secondly, the acceptance condition is more complex. Instead of the B uchi condition with inﬁnitely many visits of a set of states, for Rabin there is disjunction of several options, each a conjunction of the B uchi condition and prohibiting some other states. A reduction to LRA would require several copies of these parts in the product, each with the prohibited parts removed, resulting in a larger system. For parity, one copy would be sufﬁcient together with a pre-processing by the attractor construction, yet, as already mentioned, the automaton is typically signiﬁcantly larger than the LDBA, see [Esparza et al., 2017].

6 Discussion We have seen we can synthesize satisfying policies if they exist. The generality of the framework can be documented by the ease of extending our solution to various modiﬁcations of the problem. For instance, the general policies suffer from several deﬁciencies, which we can easily address here: Unbounded memory may not be realistic for implementation. As discussed, we can solve the δ-satisfaction problem in polynomial time for steady-state speciﬁcations (or also for general long-run speciﬁcation provided the LDBA(ϕ) is given on input), yielding ﬁnite-memory policies, where the imprecision δ can be arbitrarily small. The unbounded-memory solution may lead to policies visiting the accepting states less and less frequently over time. Avoiding this is easy through an additional constraint X

a Acc A C AMEC xa 1

ensuring the frequency does not decrease below once per f steps on average. Alternatively, if we want to enforce the frequency bound for almost all satisfying runs, then we can instead use the frequency bound in each MEC: X

a Acc A C xa 1

Further modiﬁcations and their solutions include: Similarly to the frequency bounds for visiting accepting states, we may require bounds on frequency of satisfying recurrent subgoals, i.e. subformulae of ϕ. Indeed, the translation from LTL to LDBA [Sickert et al., 2016] labels the states with the progress of satisfying subformulae. For each subformula, there are its breakpoint states where the subformula is satisﬁed, and whose x variables we can constrain as above. LRA rewards can be multi-dimensional, i.e., take the vector form r : A Qn. Then the single Constraint 7 appears once for each dimension, with negligible additional computational requirements. Moreover, we can optimize any weighted combination of the LRA rewards or consider the trade-offs among them. To this end, we

Proceedings of the Thirtieth International Joint Conference on Artiﬁcial Intelligence (IJCAI-21)

can ε-approximate the Pareto frontier capturing all pointwise optimal combinations, in time polynomial in the size of the product and in 1/ε [Papadimitriou and Yannakakis, 2000; Chatterjee et al., 2017]. We can optimize the satisfaction probability of the LTL formula instead of the reward or consider the trade-offs. To this end, it is important that our solution uses a single linear programme. A possible approach of analyzing each MEC separately and then merging the ﬁxed solutions would suffer from the mutual inﬂexibility and could not efﬁciently address these trade-offs. In contrast, if only unichain4 solutions should be considered, see [Velasquez, 2019], we can check each accepting MEC separately (through our LP) whether it also satisﬁes the steady-state speciﬁcation and whether it can be reached almost surely. As future work, one may consider combinations with nonlinear properties, such as those expressible in branching-time logics, or with ﬁnite-horizon or discounted rewards.

Acknowledgments This work has been partially supported by the German Research Foundation (DFG) projects 427755713 (KR 4890/3-1) Group-By Objectives in Probabilistic Veriﬁcation (GOPro) and 317422601 (KR 4890/1-1) Veriﬁed Model Checkers (VMC).

Appendix A Finite-Memory Policies In order to work with ﬁnite-memory policies, we also use a slightly different (though equivalent see [Br azdil et al., 2014, Section 6]) deﬁnition of policies, which is more convenient in some settings. Let M be a countable set of memory elements. A policy is a triple σ = (σu, σn, α), where σu : A S M Dist(M) and σn : S M Dist(A) are memory update and next move functions, respectively, and α is the initial distribution on memory elements. We require that, for all (s, m) S M, the distribution σn(s, m) assigns a positive value only to actions enabled at s, i.e. σn(s, m) Dist(Act(s)). A play of G determined by a policy σ is a Markov chain Gσ where the set of locations is S M A, the initial distribution µ is zero except for µ(ˆs, m, a) = α(m) σn(ˆs, m)(a), and

P(s, m, a)(s , m , a ) =

δ(a)(s ) σu(a, s , m)(m ) σn(s , m )(a ) .

Hence, Gσ starts in a location chosen randomly according to α and σn. In a current location (s, m, a), the next action to be performed is a, hence the probability of entering s is δ(a)(s ). The probability of updating the memory to m is σu(a, s , m)(m ), and the probability of selecting a as the next action is σn(s , m )(a ). Note that these choices are independent, and thus we obtain the product above.

4A Markov chain is a unichain if it only has one bottom strongly connected component (one recurrent class) and all other states are transient.

/{a 7 0.5, b 7 0.5, c 7 1}

/{a 7 1, b 7 0, c 7 1}

Figure 5: An automaton representation of the 2-memory strategy considered in Example 1 depicted in Fig. 1

In general, a policy may use inﬁnite memory M, and both σu and σn may randomize, the former called stochastic-update, the latter randomization in the narrower sense. We can now classify the policies according to the size of memory they use. Important subclasses are memoryless policies, in which M is a singleton, n-memory policies, in which M has exactly n elements, and ﬁnite-memory policies, in which M is ﬁnite. Other policies are unbounded-memory or inﬁnitememory policies.

References [Akshay et al., 2013] S. Akshay, Nathalie Bertrand, Serge Haddad, and Lo ıc H elou et. The steady-state control problem for Markov decision processes. In QEST, volume 8054 of Lecture Notes in Computer Science, pages 290 304. Springer, 2013. [Atia et al., 2020] George K. Atia, Andre Beckus, Ismail Alkhouri, and Alvaro Velasquez. Steady-state policy synthesis in multichain Markov decision processes. In IJCAI, pages 4069 4075. ijcai.org, 2020. [Baier and Katoen, 2008] C. Baier and J.-P. Katoen. Principles of Model Checking. MIT Press, 2008. [Benton et al., 2012] J. Benton, Amanda Jane Coles, and Andrew Coles. Temporal planning with preferences and timedependent continuous costs. In ICAPS. AAAI, 2012. [Brafman et al., 2018] Ronen I. Brafman, Giuseppe De Giacomo, and Fabio Patrizi. LTLf/LDLf non-Markovian rewards. In AAAI, pages 1771 1778. AAAI Press, 2018. [Br azdil et al., 2014] Tom as Br azdil, V aclav Brozek, Krishnendu Chatterjee, Vojtech Forejt, and Anton ın Kucera. Two views on multiple mean-payoff objectives in Markov decision processes. Log. Methods Comput. Sci., 10(1), 2014. [Br azdil et al., 2015] Tom as Br azdil, Krishnendu Chatterjee, Vojtech Forejt, and Anton ın Kucera. Multigain: A controller synthesis tool for mdps with multiple mean-payoff objectives. In TACAS, volume 9035 of Lecture Notes in Computer Science, pages 181 187. Springer, 2015. [Chatterjee et al., 2017] Krishnendu Chatterjee, Zuzana Kˇret ınsk a, and Jan Kˇret ınsk y. Unifying two views on multiple mean-payoff objectives in Markov decision processes. Log. Methods Comput. Sci., 13(2), 2017. [Courcoubetis and Yannakakis, 1995] C. Courcoubetis and M. Yannakakis. The complexity of probabilistic veriﬁcation. Journal of the ACM, 42(4):857 907, 1995. [Esparza et al., 2017] Javier Esparza, Jan Kˇret ınsk y, Jean Franc ois Raskin, and Salomon Sickert. From LTL and limit-deterministic B uchi automata to deterministic parity

Proceedings of the Thirtieth International Joint Conference on Artiﬁcial Intelligence (IJCAI-21)

automata. In TACAS (1), volume 10205 of Lecture Notes in Computer Science, pages 426 442, 2017. [Esparza et al., 2020] Javier Esparza, Jan Kˇret ınsk y, and Salomon Sickert. A uniﬁed translation of linear temporal logic to ω-automata. J. ACM, 67(6):33:1 33:61, 2020. [Etessami et al., 2008] K. Etessami, M. Kwiatkowska, M. Vardi, and M. Yannakakis. Multi-objective model checking of Markov decision processes. LMCS, 4(4):1 21, 2008. [Feinberg and Shwartz, 2012] Eugene A Feinberg and Adam Shwartz. Handbook of Markov decision processes: methods and applications, volume 40. Springer Science & Business Media, 2012. [Gao et al., 2012] Sicun Gao, Jeremy Avigad, and Edmund M. Clarke. Delta-decidability over the reals. In LICS, pages 305 314. IEEE Computer Society, 2012. [Giacomo et al., 2019] Giuseppe De Giacomo, Luca Iocchi, Marco Favorito, and Fabio Patrizi. Foundations for restraining bolts: Reinforcement learning with LTLf/LDLf restraining speciﬁcations. In ICAPS, pages 128 136. AAAI Press, 2019. [Hahn et al., 2015] Ernst Moritz Hahn, Guangyuan Li, Sven Schewe, Andrea Turrini, and Lijun Zhang. Lazy probabilistic model checking without determinisation. In CONCUR, volume 42 of LIPIcs, pages 354 367. Schloss Dagstuhl - Leibniz-Zentrum f ur Informatik, 2015. [Khachiyan, 1979] L.G. Khachiyan. A polynomial algorithm in linear programming. Dokl. Akad. Nauk SSSR, 244:1093 1096, 1979. [Kress-Gazit et al., 2009] Hadas Kress-Gazit, Georgios E. Fainekos, and George J. Pappas. Temporal-logic-based reactive mission and motion planning. IEEE Trans. Robotics, 25(6):1370 1381, 2009. [Kˇret ınsk y et al., 2018] Jan Kˇret ınsk y, Tobias Meggendorfer, Salomon Sickert, and Christopher Ziegler. Rabinizer 4: From LTL to your favourite deterministic automaton. In CAV (1), volume 10981 of Lecture Notes in Computer Science, pages 567 577. Springer, 2018. [Kˇret ınsk y, 2021] Jan Kˇret ınsk y. LTL-constrained steadystate policy synthesis. Co RR (ar Xiv), abs/2105.14894, 2021. [Kwiatkowska et al., 2011] Marta Z. Kwiatkowska, Gethin Norman, and David Parker. PRISM 4.0: Veriﬁcation of probabilistic real-time systems. In CAV, volume 6806 of LNCS, pages 585 591, 2011. [Papadimitriou and Yannakakis, 2000] C. H. Papadimitriou and M. Yannakakis. On the approximability of trade-offs and optimal access of web sources. In FOCS, pages 86 92, 2000. [Pnueli, 1977] Amir Pnueli. The temporal logic of programs. In FOCS, pages 46 57, 1977. [Puterman, 1994] M.L. Puterman. Markov Decision Processes. John Wiley and Sons, 1994.

[Rungger and Tabuada, 2017] Matthias Rungger and Paulo Tabuada. Computing robust controlled invariant sets of linear systems. IEEE Trans. Autom. Control., 62(7):3665 3670, 2017. [Sickert and Kˇret ınsk y, 2016] Salomon Sickert and Jan Kˇret ınsk y. Mochiba: Probabilistic LTL model checking using limit-deterministic B uchi automata. In ATVA, volume 9938 of Lecture Notes in Computer Science, pages 130 137, 2016. [Sickert et al., 2016] Salomon Sickert, Javier Esparza, Stefan Jaax, and Jan Kˇret ınsk y. Limit-deterministic B uchi automata for linear temporal logic. In CAV, pages 312 332, 2016. [Velasquez, 2019] Alvaro Velasquez. Steady-state policy synthesis for veriﬁable control. In IJCAI, pages 5653 5661. ijcai.org, 2019.

Proceedings of the Thirtieth International Joint Conference on Artiﬁcial Intelligence (IJCAI-21)