# the_tremblinghand_problem_for_ltlf_planning__9cfc1d96.pdf The Trembling-Hand Problem for LTLf Planning Pian Yu1 , Shufang Zhu1 , Giuseppe De Giacomo1 , Marta Kwiatkowska1 and Moshe Vardi2 1Department of Computer Science, University of Oxford, UK 2 Department of Computer Science, Rice University, USA {pian.yu, shufang.zhu, giuseppe.degiacomo, marta.kwiatkowska}@cs.ox.ac.uk vardi@cs.rice.edu Consider an agent acting to achieve its temporal goal, but with a trembling hand . In this case, the agent may mistakenly instruct, with a certain (typically small) probability, actions that are not intended due to faults or imprecision in its action selection mechanism, thereby leading to possible goal failure. We study the trembling-hand problem in the context of reasoning about actions and planning for temporally extended goals expressed in Linear Temporal Logic on finite traces (LTLf), where we want to synthesize a strategy (aka plan) that maximizes the probability of satisfying the LTLf goal in spite of the trembling hand. We consider both deterministic and nondeterministic (adversarial) domains. We propose solution techniques for both cases by relying respectively on Markov Decision Processes and on Markov Decision Processes with Set-valued Transitions with LTLf objectives, where the set-valued probabilistic transitions capture both the nondeterminism from the environment and the possible action instruction errors from the agent. We formally show the correctness of our solution techniques and demonstrate their effectiveness experimentally through a proofof-concept implementation. 1 Introduction In this paper, we study the trembling-hand (TH) problem in the context of reasoning about actions and planning for temporally extended goals expressed in Linear Temporal Logic on finite traces LTLf [De Giacomo and Vardi, 2013].1 In a chess game, a player may have a trembling hand due to, e.g., nervousness, anxiety, or stress, which results in mistaken moves that were not intended. Likewise, an agent acting in an environment could mistakenly instruct a different action, e.g., due to faults, leading to possible goal failure. 1All results presented here apply to other linear temporal logics on finite traces, such as LDLf [De Giacomo and Vardi, 2013] and Pure Past LTL [Bonassi et al., 2023], as long as the set of traces that satisfy a formula can be characterized by a regular language, i.e., by a deterministic finite state automaton. The TH problem originates from Game Theory in Economics, see e.g. [Marchesi and Gatti, 2021], referring to the situation where players erroneously select unintended moves with a small quantifiable probability. This problem highlights the importance of introducing some form of resilience to these errors [Vardi, 2020] in the player strategies and has given rise to the well-known notion of Trembling Hand Perfect Equilibrium in Economics [Bielefeld, 1988]. Here, we study this problem in the context of reasoning about actions [Reiter, 2001] and planning [Geffner and Bonet, 2013]. Specifically, we consider an agent acting in a domain. At each state of the domain, when the agent instructs an action, with a certain probability, it can mistakenly instruct a different action. Notice that this uncertainty is on the agent decision-making capabilities, not on how the environment executes the instructed actions. To stress this point, we consider two settings: deterministic domains, where the environment has no choices in responding to agent actions, as in classical planning [Geffner and Bonet, 2013]; and nondeterministic domains, where the environment can adversarially respond to agent actions, as in Fully Observable Nondeterministic Domains (FOND), when considering strong plans [Cimatti et al., 2003; Ghallab et al., 2004; Geffner and Bonet, 2013; De Giacomo and Rubin, 2018]. In both settings, we want to synthesize a strategy (aka plan) that guarantees to maximize the probability of fulfilling a temporally extended goal expressed in LTLf, in spite of the adversarial response of the environment in the case of nondeterministic domains. We devise solution techniques to solve the problem in the two settings by relying respectively on Markov Decision Processes (MDPs) [Puterman, 2014] in the case of deterministic domains and on Markov Decision Processes with Setvalued Transitions (MDPSTs) [Trevizan et al., 2007; Trevizan et al., 2008] in the case of nondeterministic domains. MDPs specify concrete probability values for each transition. MDPs with imprecise probabilities (MDPIPs) [White III and Eldeib, 1994; Satia and Lave Jr, 1973; Givan et al., 2000] and Uncertain MDPs (UMDPs) [Nilim and El Ghaoui, 2004; Buffet et al., 2005; Hahn et al., 2019] have been proposed for scenarios where the probability values are uncertain. MDPSTs constitute a restricted subclass of MDPIPs, which combine probabilistically quantifiable uncertainty with unquantifiable uncertainty (nondeterminism) in a unified framework. They are attractive because they admit a simplified Bellman Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence (IJCAI-24) equation compared to MDPIPs, UMDPs [Trevizan et al., 2007], and thus stochastic games. In both settings, we consider LTLf objectives instead of standard reachability. Note that MDPs with LTLf objectives have been studied in [Brafman et al., 2018; Wells et al., 2021]. Instead, MDPSTs with LTLf objectives are studied for the first time in this paper. We lift the definition of satisfying LTLf objectives from MDPs to MDPSTs by defining the notion of robust strategy, and then an efficient value iteration algorithm is proposed for synthesizing an optimal robust strategy. We evaluate the effectiveness of the proposed solution techniques on a human-robot co-assembly problem, where the robot operates with a trembling hand, and demonstrate promising scalability. 1.1 Related Work Interestingly, the trembling-hand problem has never been specifically studied in reasoning about actions and planning, though some related work exists. For example, the classical work on reasoning with noisy sensors and effectors in Situation Calculus [Bacchus et al., 1999] is indeed related. There, sensors and effectors are considered to be noisy, so they introduce a stochastic element to be taken into account in reasoning. In particular, uncertainty on the effectors may be considered similarly to our trembling hand. There is, however, an important distinction between the two: in our case, the uncertainty is on which action is actually instructed by the agent; instead, in their case, the uncertainty is on how the action (perfectly instructed) is actually executed by the environment. In other words, in our case, the uncertainty is on the agent, while in theirs, the uncertainty is on the environment, which has been more extensively studied since it corresponds to uncertainty in modeling the reaction of the world to agent moves. In fact, there is a growing interest in forms of synthesis/planning that are resilient to errors in modeling the environment. For example, in [Zhu and De Giacomo, 2022], one could compute a maximally permissive strategy for the agent, allowing it to switch from one strategy to another while in execution, in case exceptional environment changes occur such that the predetermined strategy fails. In [Aminof et al., 2021; Ciolek et al., 2020], multiple models of the environment are considered to handle exceptions during planning, aiming to mitigate the intrinsic risk in a single environment model. In [Aineto et al., 2023], a k-resilient strategy allows the agent to fail k times maximally at a repeatedly occurring state. Nevertheless, in all these works, the focus is on errors/exceptions wrt expected environment behaviors. The trembling hand, on the other hand, is about errors in the agent behavior, and, as such, has not been much studied yet. Only two papers [Wells et al., 2020a; Wells et al., 2021] touched on this aspect, accounting for possible errors in robot decisions. Nevertheless, the environment considered is either deterministic or has probabilistic uncertainty. In contrast, our environment admits adversarial behaviors. 2 Preliminaries We study the trembling hand (TH) problem in the context of planning for temporally extended goals expressed in LTLf. We now briefly introduce the logic LTLf, deterministic and nondeterministic planning domains, and the notion of strategy (aka plan) in a domain achieving an LTLf goal. LTLf. Linear Temporal Logic on finite traces (LTLf) is a specification language expressing temporal properties on finite, nonempty traces. In particular, LTLf shares syntax with LTL, which is instead interpreted over infinite traces [Pnueli, 1977]. Given a set of atomic propositions Prop, LTLf formulas are generated as follows: φ ::= a | φ φ | φ | φ | φ U φ, where a Prop is an atom, (Next) and U (Until) are temporal operators. We make use of standard Boolean abbreviations, e.g., (or) and (implies), true and false. In addition, we define the following abbreviations: Weak Next φ φ, Eventually φ true U φ and Always φ false R φ, where R is for Release. A trace π = π0π1 . . . is a sequence of propositional interpretations (sets), where for every i 0, πi 2P rop is the i-th interpretation of π. Intuitively, πi is interpreted as the set of propositions that are true at instant i. We denote the last instant (i.e., index) in a trace π by lst(π). By πk = π0 πk we denote the prefix of π up to the k-th iteration, and πk = ϵ denotes an empty trace if k < 0. We denote π satisfies φ by π |= φ. The detailed semantics of LTLf can be found in [De Giacomo and Vardi, 2013]. It is also shown there that, for every LTLf formula φ, one can construct a Deterministic Finite Automaton (DFA) Autφ = (2P rop, Q, q0, δ, acc), where 2P rop is a finite alphabet, Q is a finite set of states, q0 Q is the initial state, δ : Q 2P rop Q is the transition function, and acc is the set of accepting states, such that for every trace π we have π |= φ iff π is accepted by Autφ. Deterministic Domain. A deterministic domain is a tuple D = (S, s0, A, Fd, L), where S is a finite set of states, s0 S is an initial state, A is a finite set of actions, Fd : S A 7 S is the deterministic transition function, where s = Fd(s, a) is the successor state after performing an applicable action a at s. We use A(s) A to denote the set of applicable actions at s. L : S 2P rop is the labeling function, where Prop is a finite set of propositions. Note that, compared to typical formulations of domains in planning [Geffner and Bonet, 2013], we are assuming that more than one state can have the same evaluation of the propositions (fluents). Nondeterministic Domain. A nondeterministic domain is a tuple N = (S, s0, A, Fn, L), where S, s0, A, and L are defined as in deterministic domains, and Fn : S A 7 2S is now a nondeterministic transition function such that Fn(s, a) denotes the non-empty set of possible successor states that follow an applicable action a A(s) in s. It is worth noting that domains that are typically compactly represented, e.g., in Planning Domain Description Language (PDDL) [Haslum et al., 2019], can be encoded using a number of bits that is logarithmic in the number of states. LTLf Planning. A path of D (resp. N) is a finite or infinite sequence of alternating states and actions ρ = s0a0s1a1 , ending with a state if finite, where s0 is the initial state, and Fd(si, ai) = si+1 (resp. si+1 Fn(si, ai)) for all i with 0 i < |ρ|, and |ρ| N { }. We denote by ρk = Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence (IJCAI-24) s0a0s1a1 sk the finite prefix of ρ up to the k-th alternation. The sequence π(ρ, D) = L(s0)L(s1) (resp. π(ρ, N) = L(s0)L(s1) ) over Prop is called the trace induced by ρ over D (resp. N). FPaths denotes the set of all finite paths. An agent strategy (or plan) is a function σp : FPaths A mapping a finite path on D (resp. N) to agent actions that are applicable at the last state of the finite path. For nondeterministic domains, we assume that the nondeterminism is resolved by the environment, which acts according to an (unknown) strategy as an (adversarial) antagonist of the agent. Environment strategies are functions of the form γp : FPaths A S, which need to comply with the domain, in the sense that given a finite path ρ FPaths and an action a A(lst(ρ)) it must be the case that γp(ρ) Fn(lst(ρ), a). Note that, in the case of deterministic domains, this constraint forces the environment to have only one strategy. Given an agent strategy σp and an environment strategy γp, there is a unique path ρ(σp, γp) = s0a0s1a1 generated by σp and γp, where s0 is the initial state and for every i 0 it holds that ai = σp(ρi) and si+1 = γp(ρi, ai). Sometimes, for simplicity, we write ρ instead of ρ(σp, γp), when it is clear in the context. An agent strategy σp enforces an LTLf φ in a domain D (resp. N) if for every environment strategy γp, the infinite path ρ(σp, γp) contains a finite prefix ρi such that the finite trace π(ρi, D) |= φ (resp. π(ρi, N) |= φ). LTLf planning concerns computing such an agent strategy σp, if one exists. 3 TH in Deterministic Domains We begin investigating the trembling-hand problem for LTLf planning by focusing on the case of deterministic domains, where the environment has only one strategy, i.e., following the transitions of the domain. Hence, the only uncertainty is the stochastic one, arising from the trembling hand . In the next section, we consider the case of nondeterministic domains, where the environment employs adversarial strategies. 3.1 Problem Formulation The trembling hand refers to the agent intending to instruct a certain action but, by mistake, instructing a different action with a (small) quantified probability. This probability only depends on the domain state where the action is instructed (and then performed). Notice that, here, the environment is fully deterministic, and hence, once the action is instructed, it will be executed (without any error) in a deterministic way. The problem that we want to address is to maximize the probability of achieving a given LTLf objective in a deterministic domain in spite of the (state dependent) actioninstruction errors due to the trembling hand . We formalize the action-instruction errors as follows. Let D = (S, s0, A, Fd, L) be a deterministic domain, s S a domain state, and a A(s) an applicable action at s. We denote by err(s, a) Dist(A(s)) the probability distribution of instructing an action a instead of a in state s. For instance, suppose the set of applicable actions at s is such that A(s) = {a, a , a } and err(s, a) = [0.9, 0.04, 0.06]. This means that, when the agent intends to instruct action a at state s, then with probability 0.04 and 0.06, it may instruct actions a and a , respectively. Let E = {err(s, a) : (s, a) S A} be the set of the (state-dependent) action-instruction errors caused by the trembling hand . The set of actions that could be instructed when the agent intends to instruct action a at state s is the support set of err(s, a), denoted by supp(s, a) A(s). In this example, supp(s, a) = {a, a , a }. Recall that, without a trembling hand, executing an agent strategy σp in a deterministic domain D results in a unique path. Yet, in the presence of a trembling hand , i.e., actioninstruction errors E, we get perturbed paths, where the actually instructed action may differ from the intended one with a probability following E. Definition 1 (Perturbed path in D). Let D be a deterministic domain, σp an agent strategy, and E a set of actioninstruction errors. A perturbed path in D wrt E is a sequence of triples ρ(σp, E) = (s0, a0, a 0)(s1, a1, a 1) , where for every i 0, ai = σp(ρi) (the intended action), a i supp(si, ai) (the actually instructed action), and si+1 = Fd(si, a i). The set of all perturbed paths in D wrt σp and E is denoted by Φσp,E. Given a deterministic domain D, an LTLf formula φ, and a set of action-instruction errors E, the probability of σp enforcing φ in D wrt E is defined as Prσp,E D (φ) := Pr D({ρ Φσp,E | π(ρk, D) |= φ for some k 0}). Definition 2 (TH problem for LTLf planning in D). The problem is a tuple Pd = (D, φ, E), where D is a deterministic domain, φ is an LTLf formula, and E is a set of actioninstruction errors. Solving Pd consists in synthesizing an agent strategy σ p that maximizes the probability of enforcing φ in D wrt E, i.e., an optimal strategy for Pd, that is: σ p = arg maxσp Prσp,E D (φ). In the following, an example is given to demonstrate a perturbed transition in a deterministic domain D. Example 1. Consider a robot assembly problem, where the robot aims to assemble an arch using N blocks. In Figure 1, the goal configurations for N (2 N 6) blocks are depicted (which are used later in Section 5). Let OBJ = {Obji | i {1, , N}} be the set of blocks and LOC = {Lj | j {0, , M}} be the set of locations, where L0 represents the storage. Initially, all the blocks are stored in the storage. During assembly, the robot can perform move actions to relocate blocks. The set of robot actions is A = {(Obji, Lj) : Obji OBJ, Lj LOC} {do-nothing}, where (Obji, Lj) means move block i to location j. Due to the trembling hand (which may be caused by drifting), the robot s action is subject to uncertainty. For instance, if the robot intends to move block i to location j, there exists a probability that it may mistakenly move a different block i = i or inaccurately place it in location j = j. In addition, the probability of errors varies for different actions. For instance, if the robot chooses DO-NOTHING, one can safely assume that the probability of error is 0. However, if the robot chooses to move, we assume a positive probability of error, leading to perturbed transitions resulting in perturbed paths. Figure 2 shows a perturbed transition example in this case. The dashed arrow shows the intended action and the solid Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence (IJCAI-24) arrows represent the set of actions that may be actually instructed with their respective probabilities. Figure 1: An arch. Left: 2, 3, and 4 blocks. Right: 5 and 6 blocks. blue-L2 red-L2 0.06 blue-L2 0.9 blue-L3 0.04 trembling hand Figure 2: A perturbed transition in a det. domain. 3.2 MDPs With LTLf Objectives Solving Pd = (D, φ, E) requires a modeling technique that is able to capture the action-instruction errors E in a deterministic domain. To this end, we first review Markov Decision Processes (MDPs), which are used in our solution technique. MDPs allow action choice in each state, and each state action transition is a probability distribution on successor states, which provides a natural way of capturing the probability of the agent mistakenly taking a different action in a deterministic domain. Following [Puterman, 2014], an MDP is a tuple M = (S, s0, A, T , L), where S is a finite set of states, s0 S is the initial state, A is a finite set of actions, T : S A S [0, 1] is the probabilistic transition function, and L : S 2P rop is the proposition labelling function. Analogously to planning domains, for each s S, the set of actions applicable at state s is denoted by A(s). Denote by ξ = s0a0s1a1 a path and by FPaths the set of all finite paths of M. In this work, we focus on deterministic agent strategies for MDPs, instead of randomized. A strategy σm of M is a function σm : FPaths A such that, for each ξ FPaths, σm(ξ) A(lst(ξ)), where lst(ξ) is the last state of ξ. We denote by Ξσm the set of all probably infinite paths of M generated by σm. Given an MDP M = (S, s0, A, T , L) and a set of goal states G S, the probability of an agent strategy σm enforcing G in M is defined as Prσm M (G) := Pr M({ξ Ξσm | lst(ξk) G for some k 0}). Computing an optimal strategy σ m that maximizes the probability of enforcing G is the reachability problem over M. Analogously, an agent strategy can also enforce a temporal objective on an MDP. Given an MDP M and an LTLf formula φ, the probability of an agent strategy σm enforcing φ in M is defined as Prσm M (φ) := Pr M({ξ Ξσm | π(ξk, M) |= φ for some k 0}). The problem of MDP with LTLf objective is to compute an optimal strategy σ m, which maximizes the probability of enforcing φ in M [Baier and Katoen, 2008]. 3.3 Solution Technique We now present our solution technique to synthesize an optimal agent strategy σ p for Pd = (D, φ, E), which aims to maximize the probability of enforcing φ in D in spite of E. The key idea is to reduce Pd to an MDP with an LTLf objective. Intuitively, the MDP provides a probabilistic abstraction of instructing mistaken actions in the domain due to the trembling hand. Hence, the MDP has the same states as the domain, retains its original action choices, and incorporates the probability of action-instruction errors in its transitions. Probabilistic abstraction. We define an MDP M = (S, s0, A, T , L) from Pd = (D, φ, E) as follows. S, s0, A, and L are the same as in D. In order to construct the probabilistic transition function T : S A S [0, 1], note that T (s, a, s ) gives the probability of transitioning from state s to s on action a. Assuming no trembling hand errors, we have that T (s, a, s ) = 1 if s = Fd(s, a) and T (s, a, s ) = 0 otherwise. Due to trembling-hand errors, however, unintended actions (e.g., a ) may be instructed with some probability (given by err(s, a)(a )), thus resulting in a different successor state s = Fd(s, a ). Following this observation, we construct T as follows: T (s, a, s ) = err(s, a)(a ), if a A(s), s = Fd(s, a ), 0, otherwise. The TH problem for LTLf planning in deterministic domains is now reduced to an MDP with an LTLf objective. Theorem 1. Let Pd = (D, φ, E) be a TH problem defined in Def. 2, and M the constructed MDP described above. An optimal strategy for M with φ is an optimal strategy for Pd and vice versa, that is: σ p = arg max σm {Prσm M (φ)}, where σ p (optimal strategy for Pd) is given in Def. 2. Proof. First, we observe that σp = σm, by construction. Second, one can derive, according to Def. 1 that Φσp,E = Ξσm if σp = σm. Finally, given a state s and an intended action a = σp(s) in Pd, one has that, with probability err(s, σp(s))(a ), the agent may actually instruct a different action a and then transit to s = Fd(s, a ). Denote by Pr Pd(s |s, σp(s)) the probability of transiting to state s from the state s with action σp(s). We obtain that if σp(s) = σm(s), then Pr Pd(s |s, σp(s)) = err(s, σp(s))(a ) = T (s, σm(s), s ). Hence, the conclusion follows. Thm. 1 allows us to utilize existing algorithms for MDPs with LTLf objectives to solve the TH problem for LTLf planning in deterministic domains. The common approach is Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence (IJCAI-24) by reduction to the reachability problem of an MDP [Baier and Katoen, 2008; Wells et al., 2021]. More specifically, given an MDP M with LTLf objective φ, we first construct the corresponding DFA Autφ of the LTLf formula φ, then construct the product MDP M of M and Autφ. In this case, computing an optimal strategy for M with φ reduces to the reachability problem over M , where the goal states G are those in M that consist of the accepting states of Autφ. The reachability problem of M wrt G can be solved via value iteration, strategy iteration, or linear programming [Altman, 1998; Wells et al., 2021]. Due to the cross product, every finite path ρ FPaths D such that ρ = s0a0s1a1 sk+1, where FPaths D denotes the finite paths on D, corresponds to a finite path ρ on M , where ρ = (s0, q0)a0(s1, q1)a1 (sk, qk)ak(sk+1, qk+1). Therefore, every strategy σ m for M wrt G induces a strategy for Pd = (D, φ, E) as follows: σ p(ρ) = σ m(ρ ) for ρ FPaths D. Together with Thm. 1, the following theorem is an immediate result. Theorem 2. Let Pd = (D, φ, E) be a TH problem defined in Def. 2, M = M Autφ the constructed product MDP, and G the set of goal states. We have that the computed optimal strategy σ m of M with reachability goal G induces an optimal strategy σ p for Pd. 4 TH in Nondeterministic Domains We now turn to the case in which the domain is nondeterministic, i.e., the environment can choose its strategy adversarially. Hence, in planning, we have to overcome two forms of uncertainty: the stochastic uncertainty from the trembling hand and the adversarial uncertainty from the environment. 4.1 Problem Formulation Consider the case where the agent acts in a nondeterministic (adversarial) domain. We assume the nondeterminism is unquantifiable, so it is adversarial without a probabilistic behavior. In this setting, we want to synthesize an agent strategy that maximizes the probability of achieving its goal in spite of the adversarial behavior of the environment and the (small) probability of instructing wrong actions at every step. In other words, the agent seeks a maxi-min strategy, that is, a strategy that maximizes the minimal probability across all possible environment strategies. We find this case to be particularly interesting, since it combines the probabilistic aspects of the previous case s action-instruction errors and the environment s adversarial nondeterminism. We first define perturbed paths of an agent acting in a nondeterministic domain N with action-instruction errors E, i.e., the trembling hand, and adversarial environment behaviors. Definition 3 (Perturbed paths in N). Let N be a nondeterministic domain, σp an agent strategy, γp an environment strategy, and E a set of action-instruction errors. A perturbed path in N wrt E is a sequence of triples ρ(σp, γp, E) = (s0, a0, a 0)(s1, a1, a 1) . We denote by ρ the projection of ρ by considering only the states si and actually instructed actions a i. It holds that, for every i 0, ai σp(ρ i) (the intended action), a i supp(si, ai) (the actually instructed ac- tion), and si+1 = γp(ρ i, a i). The set of all perturbed paths in N wrt σp, γp, E is denoted by Φσp,γp,E. The probability of an agent strategy σp enforcing φ considering environment strategy γp and the action-instruction error E is denoted by Prσp,γp,E N (φ). Definition 4 (TH problem for LTLf planning in N). The problem is a tuple Pn = (N, φ, E), where N is a nondeterministic domain, φ is an LTLf formula, and E is a set of action-instruction errors. Solving Pn consists in synthesizing an agent strategy σ p that maximizes the probability of enforcing φ in N with E in spite of adversarial strategies of the environment, i.e., an optimal strategy for Pn, that is: σ p = arg max σp min γp Prσp,γp,E N (φ). Example 2. Consider the human-robot co-assembly problem adapted from [He et al., 2019], where the robot aims to assemble an arch along with a human in a shared workspace. During assembly, the robot can perform actions to relocate blocks, and the human may perform moves to intervene. The robot has a trembling hand as described in Example 1. Compounding this issue, the robot must ensure that the arch is successfully built, considering the human involvement. red-L2 0.06 blue-L2 0.9 blue-L3 0.04 trembling hand in nondet. domain Figure 3: A perturbed transition in a nondet. domain. At each step, the robot instructs an action to relocate a block, though with a trembling hand, and then the human can react by moving blocks among locations. Human movements are not controllable, thus introducing nondeterminism into the consequences of instructed robot actions. Figure 3 shows a perturbed transition example of a robot working in such a nondeterministic domain with a trembling hand. Note that each intended action (dashed arrow) corresponds to a set of actions (solid arrow) that may be actually instructed, together with their respective probabilities. Furthermore, each actually instructed action leads to a set of possible changes due to uncontrollable human intervention. 4.2 MDPSTs With LTLf Objectives We now introduce Markov Decision Processes with Setvalued Transitions (MDPSTs), which combine probabilistically quantifiable uncertainty with unquantifiable uncertainty (nondeterminism), hence providing a natural probabilistic abstraction of an agent acting with a trembling hand in a nondeterministic domain. Note that MDPSTs are favorable due to their simplified Bellman equation compared to MD- Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence (IJCAI-24) PIPs, UMDPs, and thus stochastic games. This distinction will be further clarified in Section 4.3. An MDPST is a tuple MN = (S, s0, A, F, TN, L), where S, s0, A, and L are defined as for MDPs, and F : S A 2S is the set-valued nondeterministic transition function; TN : S A 2S [0, 1] is the transition probability (or mass assignment) function, i.e., given a set Θ F(s, a), where Θ S, TN(s, a, Θ) = Pr(Θ|s, a). Note that, in MDPSTs, the transition function F(s, a) returns a set of state sets, i.e., F(s, a) 2S, and the transition probability function T expresses the probability of transitioning to such sets via a given action. It is worth noting that the problem of MDPSTs with LTLf objectives is studied for the first time in this paper. To define the problem, one needs to lift the definition of enforcing LTLf objectives from MDPs to MDPSTs by defining the notion of robustness for strategies to incorporate the unquantifiable uncertainty. In an MDPST, due to the unquantifiable uncertainty, the distribution on the set of reachable successor states of a state-action pair (s, a), where a A(s), is not uniquely determined by (s, a), in contrast to MDPs. We denote by Post(s, a) the set of reachable states of (s, a), despite both quantifiable and unquantifiable uncertainties. Formally, Post(s, a) = {s | Θ F(s, a) s.t. TN(s, a, Θ) > 0 and s Θ}. A feasible distribution of an MDPST guarantees that, given a state action pair (s, a), where a A(s), (i) only one state is chosen within Θ for each Θ F(s, a); (ii) the sum of probabilities of selecting a state from Post(s, a) equals 1; and (iii) the probability of selecting a state out of Post(s, a) is 0. In the following definition, ιΘ s indicates whether s is in Θ. Hence ιΘ s = 1 if s Θ and ιΘ s = 0 otherwise. Furthermore, αΘ s indicates whether s is selected from Θ. Hence αΘ s = 1 if s is selected from Θ, and αΘ s = 0 otherwise. Definition 5 (Feasible distribution in MDPSTs). Let MN = (S, s0, A, F, TN, L) be an MDPST, (s, a) a state-action pair, where a A(s). ha s Dist(S) is a feasible distribution of (s, a), denoted by s a ha s, if (i) P s Θ αΘ s = 1, for Θ F(s, a); (ii) ha x(s )= P Θ F(s,a) ιΘ s αΘ s TN(s, a, Θ), for s Post(s, a); (iii) ha s(s ) = 0, for s S \ Post(s, a). Following Def. 5, it is evident that, in MDPSTs, a feasible distribution is not uniquely determined for a given stateaction pair, in contrast to MDPs. It highly depends on αΘ s (see item (ii)), the unquantifiable uncertainty of whether s is selected from Θ. We now introduce nature for MDPSTs to characterize this unquantifiable uncertainty, motivated by the definition of nature in robust MDPs [Nilim and El Ghaoui, 2004]. One can intuitively perceive nature as the environment in the context of nondeterministic domains, playing a role in resolving nondeterminism. We denote by Ha s the set of feasible distributions of state action pair (s, a). Analogously to MDPs, FPaths denotes the set of finite paths of an MDPST. Definition 6 (Nature for MDPSTs). A nature of an MDPST is a function γm : FPaths A Dist(S) such that, γ(ξ, a) Ha s for ξ FPaths and a A(lst(ξ)). Suppose we fix a nature γm. The probability of an agent strategy σm enforcing φ is denoted by Prσm,γm MN (φ) := Pr MN ({ξ Ξσm,γm | π(ξk, MN) |= φ for some k 0}), where Ξσm,γm is the set of all probable paths generated by the agent strategy σm and nature γm. We now define (optimal) robust strategies for MDPSTs, which quantify all natures. Definition 7 (Robust strategy). Let MN be an MDPST, φ an LTLf formula, and β [0, 1] a threshold. An agent strategy σm robustly enforces φ in MN wrt β, if for every nature γm, the probability of the probably generated paths satisfying φ is no less than β, that is, P σm MN (φ) β, where P σm MN (φ) := minγm{Prσm,γm MN (φ)}. Such σm is referred to as a robust strategy for MN (wrt β). Definition 8 (Optimal robust strategy). An optimal strategy σ m robustly enforces an LTLf formula φ in an MDPST MN is σ m = arg maxσm{Prσm MN (φ)}. In this case, σ m is referred to as an optimal robust strategy for MN. The problem of MDPSTs with LTLf objectives is to compute an optimal robust strategy σ m, which maximizes the probability of robustly enforcing φ in MN, i.e., achieving the maximal value of β. Analogously, we can also define an MDPST with simple reachability, i.e., reaching a set of goal states. To avoid repetition, it has been omitted. 4.3 Solution Technique The key idea to solve the TH problem for LTLf planning in nondeterministic domains is to combine the quantifiable action-instruction errors and the unquantifiable adversarial nondeterminism of the domain into an MDPST. This MDPST has the same states as the domain N, and incorporates the action-instruction errors and the domain s adversarial nondeterminism into its transitions. In this case, we reduce Pn to an MDPST with an LTLf objective. Probabilistic abstraction. We build an MDPST MN = (S, s0, A, F, TN, L) from Pn = (N, φ, E) as follows. S, s0, A, and L are the same as in N. To construct the set-valued nondeterministic transition function F and the mass assignment function TN, we incorporate the action-instruction errors E into the probabilistic transitions, much like in deterministic domains. In nondeterministic domains, however, the successor for each state-action pair is not singular due to the adversarial environment behaviour. Consequently, the successor of each probabilistic transition is a set, with environment deciding which element of the set to transit to. Based on these observations, we can construct the setvalued nondeterministic transition function F as F(s, a) = a A(s){Fn(s, a )}, such that F(s, a) is a set of subsets in S, for s S and a A(s). The corresponding mass assignment function TN is such that TN(s, a, Θ) = err(s, a)(a ), if a A(s), Θ = Fn(s, a ), 0, otherwise. We now reduce Pn to an MDPST with an LTLf objective. Theorem 3. Let Pn = (N, φ, E) be a problem defined in Def. 4, and MN the constructed MDPST described above. An optimal robust strategy for MN with φ is an optimal strategy for Pn, and vice versa, that is: σ p = arg max σm {Prσm MN (φ)}, Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence (IJCAI-24) where σ p is given in Def. 4, and Prσm MN (φ) is in Def. 7. Proof. First, we observe that σp = σm, by construction. Second, one can derive, according to Definitions 4, 5, and 6 that γpΦσp,γp,E = γmΞσm,γm if σp = σm. The rest of the proof can be completed similarly to Theorem 1. Finally, we propose an algorithm to solve the problem of MDPSTs with LTLf objectives. This algorithm is based on a reduction to an MDPST with simple reachability, i.e., reaching a set of goal states. The algorithm is motivated by existing value iteration algorithms for MDPs with LTLf objectives [Wells et al., 2020b]. Essential adaptations, however, are needed to handle the set-valued transitions in MDPSTs. Given an MDPST MN = (S, s0, A, F, T , L) and an LTLf formula φ, we first construct the corresponding DFA Autφ = (2P rop, Q, q0, δ, acc) of φ. Then, the product MDPST M N = (S , s 0 , A , F , T , L ) is constructed accordingly with S = S Q, A = A. The set of goal states is given by acc = {(s, q) S | q acc}. For the efficiency of strategy synthesis, we further introduce an optimization that computes an optimal robust strategy on a sub-MDPST, which only consists of states that are (forward) reachable from the initial state s 0 and (backward) reachable from the set of accepting states acc . To do so, we first partition S wrt the initial state s 0 and the set of accepting states acc . Specifically, let Sr S be the set of states that can be reached from s 0 . We now partition S into S = Sn Sd Sp, where Sn = S \ Sr consists of states that cannot be reached from s0, Sd Sr consists of states reachable from s0 but that cannot reach any states in acc , and Sp = Sr \ Sd includes those that can be reached from both the initial and accepting states. We construct a sub-MDPST Z = (Sp, s0, Ap, Fp, Tp, Lp) from M N with respect to Sp as follows. Ap = A {aϵ}, where aϵ denotes self-loop action. The set-valued transition function is such that Fp(s, a) = F (s, a), s Sp \ acc and Fp(s, aϵ) = s, s acc . The mass assignment function Tp is then given by i) Tp(s, a, Θ) = T (s, a, Θ), Θ Fp(s, a) if s Sp \ acc , a A , and ii) Tp(s, aϵ, s) = 1 if s acc . Define a value function VZ : Sp R 0 by VZ(s) = maxσm minγm{Prσm,γm Z (acc )}, which represents the maximal probability of reaching acc from s. Then one can get that VZ(s) = 1, s acc . For s Sp \ acc , the Bellman principle of optimality is [Satia and Lave Jr, 1973]: VZ(s) = max a Ap(s) min Pr( |s,a) Ha s s Sp Pr(s |s, a)VZ(s ) o . (1) Moreover, it was further shown in [Trevizan et al., 2007] that a simplified Bellman equation exists for MDPSTs. That is, one can safely pull the min operator inside the summation, which gives a more efficient variant VZ(s) = max a Ap(s) Θ Fp(s,a) Tp(s, a, Θ) min s Θ{VZ(s )} o . (2) An optimal robust strategy σ m can be derived from VZ. Analogously to Sec. 3.3, due to cross product, every strategy σm for M N wrt acc induces a strategy for Pn = (N, φ, E). Together with Thm. 3, we have the following. Theorem 4. Let Pn = (N, φ, E) be as defined in Def. 4, M N = MN Autφ the constructed product MDPST, and acc the set of goal states. The computed optimal robust strategy σ m of M N with reachability goal acc induces an optimal strategy σ p for Pn. Remark 1. Let F = maxs Sp{maxa Ap(s) |Fp(s, a)|} be an upper bound of |Fp(s, a)| for all (s, a) Sp Ap. Let ϵ be the convergence precision of the value iteration (i.e., Eqn. (2)). The complexity for achieving an ϵ-suboptimal solution is O(|Sp|2|Ap| F log 1 ϵ ). We highlight that MDPSTs are a subclass of MDPIPs, and they admit more efficient strategy synthesis algorithms than general MDPIPs. This is because for general MDPIPs, each round of value iteration of its corresponding Bellman equation (Eqn. (1)) involves two stages: 1) the inner minimization problem is solved (e.g., using a bisection algorithm) for each state and each action and 2) the value function is updated with dynamic programming. 5 Implementation and Experimental Results We implemented the solution technique described in Sec. 4, which subsumes the method described in Sect. 3, in Python, and use LYDIA [De Giacomo and Favorito, 2021] for LTLf-to-DFA construction. The implementation details of our algorithms and experiments can be found on Git Hub: https://github.com/piany/Tremblinghand LTLf. In this section, we present a case study to demonstrate the effectiveness of the proposed method. Our case study is based on the human-robot co-assembly problem, described in Example 2. The trembling-hand robot aims to stack blocks to a certain configuration with unpredictable human interventions. Note that we assume the human only has a limited number K of moves (otherwise, the robot has no way to guarantee task completion [He et al., 2019].). In particular, we consider the configurations of having certain objects in certain locations. The goal configurations for N blocks are depicted in Figure 1, involving objects ranging from 2 to 6 (i.e., |OBJ| {2, 3, , 6}). Implementation. The key challenge in the implementation is an effective representation of the planning domain with error actions. On one hand, it impacts the efficiency for model construction. As shown in [Wells et al., 2021], the state space of the co-assembly problem grows exponentially in the number of objects. Therefore, it is extremely challenging to build a tractable model for a large number of objects. On the other hand, the efficiency of strategy synthesis is also impacted, as discussed in Remark 1. In [Wells et al., 2021], three different choices are examined to encode the state space, integer encoding (states are enumerated by breadth-first search), object encoding (use tuples, e.g., (1, 2, 0), mapping each object to its location), and location encoding (using tuples, e.g., (0, 1, 1, 0), mapping each location to the number of objects therein), where location encoding shows the best overall performance. We follow this observation and use location encoding for state-space representation. Note, however, that in our case a single tuple is Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence (IJCAI-24) L1 L2 L1 L2 green-L1 green-L1 do-nothing L1 blue-L2 blue-L3 red-L2 L1 L2 red-storage red-storage do-nothing L1 orange-L3 L1 L2 s0 s1 s2 s3 s4 sn Figure 4: An execution example of an optimal strategy for the arch-building task. Robot-intended actions, robot-executed actions, and human interventions are shown in black, brick, and blue, respectively. not enough to represent a state. This is due to the one-toone correspondence between objects and locations in the goal configuration. For example, in the configuration involving 5 objects (where the goal configuration is shown on the upper right of Figure 1), if the locations of the green and blue blocks are interchanged, the task is deemed incomplete. To resolve this issue, we use a tuple of tuples for state encoding. For instance, a state ((0, 0, 0, 1), (1, 0, 0, 0), (0, 1, 0, 0)) means that Obj1 is at location 3, Obj2 is at the storage (the first element in each tuple represents storage), and Obj3 is at location 1. Therefore, given N objects, the co-assembly domain has at most 2N(N+1) states. Luckily, many of the states are invalid due to physical constraints, e.g., one block cannot be at two different locations, and each location (except for storage) can have at most one block. Using these physical constraints, we design a recursive algorithm (to avoid enumerating all possible states, hence greatly improving efficiency) to prune the co-assembly state space such that only valid states are maintained. The details of the state pruning algorithm can be found in Appendix A. Moreover, the K-move limitation of the human should also be considered. We use a counter C = {0, 1, , K} to record the human moves. Together with the (pruned) state-space X of domain N = (S, s0, A, Fn, L), i.e., X S, we obtain an augmented state-space S = X C. If the counter value is K, then the set of applicable human actions is restricted to DO-NOTHING for all s S . Finally, we can construct the probabilistic abstraction of the co-assembly problem in the form of an MDPST over S following Sec. 4.3. Figure 4 depicts a (simplified) execution example of an optimal strategy for an arch-building task with 5 blocks at 5 certain locations (see the right-most arch). All blocks are in storage (state s0) at the beginning. Then, the robot intends to put the green block at L1, which succeeds without any interference from the human (state s1). Next, the robot intends to finish building the base level by putting the blue block at L2. However, due to the trembling hand, the blue block was put to L3. Furthermore, the human put the red block at L2 to prevent the robot from building the arch (state s2). Note that the red block is supposed to be at L5 (see the right-most arch). In this case, the robot intends to remove the red block, which succeeds (state s3). After a finite number of executions, in spite of the trembling hand and the interventions from the human, the robot builds the arch (state sn). Experimental results. In our experiments, the convergence precision for the value iteration in Eqn. (2) was set to 10 3. All experiments were carried out on a Macbook Pro (2.6 GHz 6-Core Intel Core i7 and 16 GB of RAM). We first show the effectiveness of the state-pruning tech- nique. For the case that K = 3, the number of states and transitions in the constructed MDPST are shown in Figure 5 for different numbers of objects (2 |OBJ| 6). It is worth noting that six is the maximum number of objects that have been considered in the literature [Wells et al., 2021] (due to the exponential blowup in the number of states). It is evident that the state space, post-pruning, exhibits much slower growth compared to exponential expansion. 2 3 4 5 6 Number of Objects Number of States and Transitions Number of States Number of Transitions Figure 5: Number of states and transitions in the constructed MDPST for 2 |OBJ| 6 (in log scale). 2 3 4 5 6 Number of Objects Model Const. Synthesis Figure 6: Computation time for model construction and strategy synthesis for 2 |OBJ| 6. In addition, the computation time for model construction (including state-space pruning and MDPST construction) and strategy synthesis (including DFA construction, product MDPST computation, and robust value iteration of Eqn. (2)) with respect to different number of objects (2 |OBJ| 6) are depicted in Figure 6. Note that LTLf synthesis in nondeterministic domains is 2EXPTIME-complete in the size of the LTLf formula and EXPTIME-complete in the size of the domain [De Giacomo et al., 2023]. Moreover, the probabilistic behaviour of the agent (caused by the trembling hand ) Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence (IJCAI-24) K States Transitions Model Const. (s) Synthesis (s) 3 1724 5324 3.222 45.81 4 2155 6655 4.685 78.820 5 2586 7986 6.958 132.42 6 3017 9317 9.283 190.203 7 3448 10648 12.383 263.826 8 3879 11979 15.768 347.015 Table 1: Model construction and synthesis computation times for 5 objects. further complicates the synthesis problem. Nevertheless, one can see that, for the case of six objects, the overall problem (including model construction and strategy synthesis) can still be solved quite efficiently using the proposed algorithm. For the case of 5 objects, we further explore situations where the upper bounds on human interventions range from 3 to 8 (i.e., K {3, 4, , 8}). The experimental results are shown in Table 1, where the size of the problem, i.e., the number of states and transitions, and the computation time for MDPST construction and strategy synthesis with respect to K, are listed. It is clear that, although both the number of states and transitions grow linearly as K increases (as does the model construction time), the synthesis time grows faster. This is because computing an ϵ-suboptimal solution for an MDPST MN = (S, s0, A, F, TN, L), which consists of reachable states only (following the partition optimization described in Section 4.3), is O(|S|2|A| F log 1 6 Conclusions In this paper, we have investigated the trembling hand problem for LTLf planning in deterministic and nondeterministic domains. We formulate the problem formally by defining action-instruction errors and perturbed paths influenced by these errors. For the case of deterministic domains, we show that the problem can be reformulated as an MDP with an LTLf objective, leveraging existing algorithms for synthesis. In the case of nondeterministic domains, on the other hand, we utilise MDPSTs with LTLf objectives, for which we propose an efficient robust value iteration algorithm for synthesis. In particular, MDPSTs with LTLf objectives have been studied here for the first time. We also demonstrate the promising scalability of the proposed algorithm in a case study. For future work, we plan to leverage symbolic techniques for synthesis, aiming to improve efficiency. Appendices A State Pruning Algorithm Given N objects, the co-assembly domain has at most 2N(N+1) states. To obtain an effective representation of the co-assembly domain, we design a state pruning algorithm (Algorithm 1), which takes into account the physical constraints: (1) each block cannot be at more than one location, and (2) each location (except for storage) can have at most one block. In this way, only a set of valid states that satisfy these constraints are maintained for strategy synthesis. Algorithm 1 State Pruning Input: the number of objects N. Return: the set of valid states Svalid(N). 1: for i = 2, , N do 2: if i = 2 then 3: S Create States(i, i + 1) 4: for s S do 5: if Rows Sum Is One(s) and Cols Sum Is One(s) then 6: add s into Svalid(i) 7: end if 8: end for 9: end if 10: if i 3 then 11: rvalid Create Row(i) 12: for si 1 Svalid(i 1) do 13: for r(k) in si 1 do 14: rnew(k) {r(k) + (0), (0, , 0, 1)} 15: end for 16: Scomb (rnew(0), , rnew(i 2), rvalid) 17: for sexpand in Product(Scomb) do 18: if Rows Sum Is One(sexpand) and Cols Sum Is One(sexpand) then 19: add sexpand to Svalid(i) 20: end if 21: end for 22: end for 23: end if 24: end for Note that Algorithm 1 employs a recursive approach, strategically avoiding the enumeration of all possible states. This recursive nature enhances efficiency. Algorithm 1 takes as input the number of objects N and outputs the set of valid states Svalid(N). It starts by computing Svalid(2) for 2 objects (lines 2-9), followed by the recursive computation of Svalid( 3) through Svalid(N) (lines 10-23). When i = 2, we first create all possible states S using Create States(i, i + 1), where each state is a 2 3 tuple and each element of the tuple is either 0 or 1 (line 3). Then for a state s S, if all the rows sum equal to 1 (checked by Rows Sum Is One) and all the columns (except for column 0, which represents storage) sum equal to 1 (checked by Cols Sum Is One), then s is added to Svalid(2) (lines 4-8). When i 3, we compute Svalid(i) by expanding each state si 1 Svalid(i 1) (which is a (i 1) i tuple) into a set of valid states si (which is a i (i + 1) tuple). In line 11, sub-algorithm Create Row(i) returns a set of 1 (i + 1) tuples whose row sum equals to 1 (i.e., rvalid). In lines 13-15, we expand each row r(k) of si 1 into a set of two rows, i.e., rnew(k), by adding one element (either 0 or 1) to the end of r(k). Then, in line 17, sub-algorithm Product( ) computes a set of tuples containing all possible combinations of elements from rnew(0), , rnew(i 2) and rvalid (i.e., Scomb). For each sexpand in Scomb, if all the rows sum equal to 1 and all the columns sum (except for column 0) equal to 1, the state sexpand is added to Svalid(i) (lines 17-21). Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence (IJCAI-24) Acknowledgements This work was supported in part by the ERC ADG FUN2MODEL (No. 834115), in part by the ERC ADG White Mech (No. 834228), in part by the EU ICT-48 2020 project TAILOR (No. 952215), in part by NSF grants IIS1527668, CCF-1704883, IIS-1830549, CNS-2016656, Do D MURI grant N00014-20-1-2787, and in part by an award from the Maryland Procurement Office. [Aineto et al., 2023] Diego Aineto, Alessandro Gaudenzi, Alfonso Gerevini, Alberto Rovetta, Enrico Scala, and Ivan Serina. Action-failure resilient planning. In ECAI, pages 44 51, 2023. [Altman, 1998] Eitan Altman. Constrained Markov decision processes with total cost criteria: Lagrangian approach and dual linear program. Mathematical methods of operations research, 48(3):387 417, 1998. [Aminof et al., 2021] Benjamin Aminof, Giuseppe De Giacomo, Alessio Lomuscio, Aniello Murano, and Sasha Rubin. Synthesizing best-effort strategies under multiple environment specifications. In KR, pages 42 51, 2021. [Bacchus et al., 1999] Fahiem Bacchus, Joseph Y Halpern, and Hector J Levesque. Reasoning about noisy sensors and effectors in the situation calculus. Artificial Intelligence, 111(1-2):171 208, 1999. [Baier and Katoen, 2008] Christel Baier and Joost-Pieter Katoen. Principles of model checking. MIT press, 2008. [Bielefeld, 1988] R. Selten Bielefeld. Reexamination of the Perfectness Concept for Equilibrium Points in Extensive Games, pages 1 31. Dordrecht, 1988. [Bonassi et al., 2023] Luigi Bonassi, Giuseppe De Giacomo, Marco Favorito, Francesco Fuggitti, Alfonso Emilio Gerevini, and Enrico Scala. Planning for temporally extended goals in pure-past linear temporal logic. In ICAPS, pages 61 69, 2023. [Brafman et al., 2018] Ronen I. Brafman, Giuseppe De Giacomo, and Fabio Patrizi. LTLf/LDLf non-markovian rewards. In Sheila A. Mc Ilraith and Kilian Q. Weinberger, editors, AAAI, pages 1771 1778, 2018. [Buffet et al., 2005] Olivier Buffet, Douglas Aberdeen, et al. Robust planning with (l) rtdp. In INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, volume 19, page 1214. LAWRENCE ERLBAUM ASSOCIATES LTD, 2005. [Cimatti et al., 2003] Alessandro Cimatti, Marco Pistore, Marco Roveri, and Paolo Traverso. Weak, strong, and strong cyclic planning via symbolic model checking. Artif. Intell., 147(1-2):35 84, 2003. [Ciolek et al., 2020] Daniel Alfredo Ciolek, Nicol as D Ippolito, Alberto Pozanco, and Sebastian Sardi na. Multi-tier automated planning for adaptive behavior. In ICAPS, pages 66 74. AAAI Press, 2020. [De Giacomo and Favorito, 2021] Giuseppe De Giacomo and Marco Favorito. Compositional approach to translate LTLf/LDLf into deterministic finite automata. In ICAPS, pages 122 130, 2021. [De Giacomo and Rubin, 2018] Giuseppe De Giacomo and Sasha Rubin. Automata-theoretic foundations of FOND planning for LTLf and LDLf goals. In IJCAI, pages 4729 4735, 2018. [De Giacomo and Vardi, 2013] Giuseppe De Giacomo and Moshe Y. Vardi. Linear Temporal Logic and Linear Dynamic Logic on Finite Traces. In IJCAI, 2013. [De Giacomo et al., 2023] Giuseppe De Giacomo, Gianmarco Parretti, and Shufang Zhu. LTLf best-effort synthesis in nondeterministic planning domains. pages 533 540, 2023. [Geffner and Bonet, 2013] Hector Geffner and Blai Bonet. A Concise Introduction to Models and Methods for Automated Planning. Morgan & Claypool Publishers, 2013. [Ghallab et al., 2004] Malik Ghallab, Dana S. Nau, and Paolo Traverso. Automated planning - theory and practice. 2004. [Givan et al., 2000] Robert Givan, Sonia Leach, and Thomas Dean. Bounded-parameter Markov decision processes. Artificial Intelligence, 122(1-2):71 109, 2000. [Hahn et al., 2019] Ernst Moritz Hahn, Vahid Hashemi, Holger Hermanns, Morteza Lahijanian, and Andrea Turrini. Interval Markov decision processes with multiple objectives: from robust strategies to Pareto curves. ACM Transactions on Modeling and Computer Simulation (TOMACS), 29(4):1 31, 2019. [Haslum et al., 2019] Patrik Haslum, Nir Lipovetzky, Daniele Magazzeni, and Christian Muise. An Introduction to the Planning Domain Definition Language. 2019. [He et al., 2019] Keliang He, Andrew M Wells, Lydia E Kavraki, and Moshe Y Vardi. Efficient symbolic reactive synthesis for finite-horizon tasks. In 2019 International Conference on Robotics and Automation (ICRA), pages 8993 8999, 2019. [Marchesi and Gatti, 2021] Alberto Marchesi and Nicola Gatti. Trembling-hand perfection and correlation in sequential games. In AAAI, pages 5566 5574, 2021. [Nilim and El Ghaoui, 2004] Arnab Nilim and Laurent El Ghaoui. Robust markov decision processes with uncertain transition matrices. Ph D thesis, University of California, Berkeley, 2004. [Pnueli, 1977] Amir Pnueli. The temporal logic of programs. In FOCS, pages 46 57, 1977. [Puterman, 2014] Martin L Puterman. Markov decision processes: discrete stochastic dynamic programming. 2014. [Reiter, 2001] Raymond Reiter. Knowledge in Action: Logical Foundations for Specifying and Implementing Dynamical Systems. MIT, 2001. Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence (IJCAI-24) [Satia and Lave Jr, 1973] Jay K Satia and Roy E Lave Jr. Markovian decision processes with uncertain transition probabilities. Operations Research, 21(3):728 740, 1973. [Trevizan et al., 2007] Felipe W Trevizan, Fabio Gagliardi Cozman, and Leliane Nunes de Barros. Planning under risk and knightian uncertainty. In IJCAI, pages 2023 2028, 2007. [Trevizan et al., 2008] Felipe W Trevizan, F abio G Cozman, and Leliane N De Barros. Mixed probabilistic and nondeterministic factored planning through markov decision processes with set-valued transitions. In Workshop on A Reality Check for Planning and Scheduling Under Uncertainty at ICAPS, page 62, 2008. [Vardi, 2020] Moshe Y. Vardi. Efficiency vs. resilience: what COVID-19 teaches computing. Commun. ACM, 63(5):9, 2020. [Wells et al., 2020a] Andrew M. Wells, Morteza Lahijanian, Lydia E. Kavraki, and Moshe Y. Vardi. LTLf synthesis on probabilistic systems. In Gand ALF, volume 326 of EPTCS, pages 166 181, 2020. [Wells et al., 2020b] Andrew M. Wells, Morteza Lahijanian, Lydia E. Kavraki, and Moshe Y. Vardi. Ltlf synthesis on probabilistic systems. In Gand ALF, volume 326 of EPTCS, pages 166 181, 2020. [Wells et al., 2021] Andrew M. Wells, Zachary K. Kingston, Morteza Lahijanian, Lydia E. Kavraki, and Moshe Y. Vardi. Finite-horizon synthesis for probabilistic manipulation domains. In ICRA, pages 6336 6342, 2021. [White III and Eldeib, 1994] Chelsea C White III and Hany K Eldeib. Markov decision processes with imprecise transition probabilities. Operations Research, 42(4):739 749, 1994. [Zhu and De Giacomo, 2022] Shufang Zhu and Giuseppe De Giacomo. Synthesis of maximally permissive strategies for LTLf specifications. In IJCAI, pages 2783 2789, 2022. Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence (IJCAI-24)