# learning_safe_action_models_with_partial_observability__2951967c.pdf

Learning Safe Action Models with Partial Observability

Hai S. Le1, Brendan Juba1, Roni Stern2

1Washington University in St. Louis 2Ben Gurion University of the Negev bjuba@wustl.edu, hsle@wustl.edu, roni.stern@gmail.com

A common approach for solving planning problems is to model them in a formal language such as the Planning Domain Definition Language (PDDL), and then use an appropriate PDDL planner. Several algorithms for learning PDDL models from observations have been proposed but plans created with these learned models may not be sound. We propose two algorithms for learning PDDL models that are guaranteed to be safe to use even when given observations that include partially observable states. We analyze these algorithms theoretically, characterizing the sample complexity each algorithm requires to guarantee probabilistic completeness. We also show experimentally that our algorithms are often better than FAMA, a state-of-the-art PDDL learning algorithm.

Introduction Classical planning, i.e., planning in a discrete, deterministic, and fully observable environment, is a useful abstraction for solving many planning problems. In order to use these planners, however, one must first model the problem at hand in a formal language, such as the Planning Domain Definition Language (PDDL). This is not an easy task. Therefore, several approaches to learning a PDDL model from observations have been proposed (Aineto, Celorrio, and Onaindia 2019; Stern and Juba 2017; Juba, Le, and Stern 2021; Cresswell, Mc Cluskey, and West 2013; Wu, Yang, and Jiang 2007). A prominent example is FAMA (Aineto, Celorrio, and Onaindia 2019), which is a state-of-the-art algorithm for learning a PDDL model from observations. A major advantage of FAMA is that it is able to learn a PDDL model even if the given observations are incomplete, in the sense that only a subset of the actions and state variables are observed. A major disadvantage of FAMA and most PDDL model learning algorithms is that they do not provide any guarantee on the performance of the learned model. Plans generated with the learned model may not be executable or may fail to achieve their intended goals. SAM Learning (Stern and Juba 2017; Juba, Le, and Stern 2021; Juba and Stern 2022; Mordoch et al. 2022) is a recently introduced family of learning algorithms that provide safety guarantees over the learned PDDL model: any plan generated with the model they return is guaranteed to be executable and achieve the intended

Copyright 2024, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.

goals. SAM Learning, however, is limited to learning from fully observed trajectories. In this paper, we propose two algorithms for learning safe PDDL models in partially observed domains. The first algorithm, PI-SAM, extends SAM (Juba, Le, and Stern 2021) to support partially observable domains by only applying the SAM learning rules when a literal is observed in the states immediately before and after an action is applied. PI-SAM is easy to implement, has a polynomial running time, and outputs a classical planning PDDL model that provides the desired safety guarantee. The second algorithm, EPI-SAM, utilizes observations that PI-SAM ignores to learn a stronger formulation. EPI-SAM compiles its knowledge and uncertainty about the underlying action model into a conformant planning problem, whose solution is also a safe solution to the underlying classical planning problem. We analyze the running time of EPI-SAM and prove that the conformant planning problem created by EPI-SAM is the strongest safe problem formulation. In terms of sample complexity, we show that in general it is not possible to guarantee efficient learning of a safe action model when the observations are partially observable. Nevertheless, we introduce a form of bounded concealment assumption, adapted from prior work on learning from partial observations (Michael 2010), under which both PI-SAM and EPI-SAM are guaranteed probabilistic completeness with a tractable sample complexity. Experimentally, we evaluated the performance of both algorithms and compared them with FAMA (Aineto, Celorrio, and Onaindia 2019) on common domains from the International Planning Competition (IPC) (Mc Dermott 2000). Our results show that PI-SAM and EPISAM often outperform FAMA in terms of the number of samples they require to learn effective action models, while still preserving our safety guarantee.

Background and Problem Definition A classical planning domain is defined by a tuple F, A where F is a set of Boolean state variables, also known as fluents, and A is a set of actions. A state is a complete assignment of values to all fluents, i.e., s : F {true, false}. A partial state is an assignment of values to some (possibly all) of the fluents. For a fluent f and a partial state p, we denote by p[f] the value assigned to f according to p. A partial state p is consistent with a partial state p if for every

The Thirty-Eighth AAAI Conference on Artiﬁcial Intelligence (AAAI-24)

fluent f either p[f] = p [f], f is not assigned in p, or f is not assigned in p . A literal in this context is either a fluent f F or its negation f. For a literal ℓ= f, we denote by p[ℓ] = true, and p[ℓ] = false the fact that p[f] = false and p[f] = true, respectively. We say that a literal ℓis in a partial state p, denoted ℓ p, if p[ℓ] = true. Similarly, if p[ℓ] = false we say that ℓis not in s, denoted ℓ/ s. An action a is defined by a tuple name(a), pre(a), eff(a) where name(a) is a unique identifier of the action and pre(a) and eff(a) are partial states that specify the preconditions and effects of a, respectively. An action model of a planning domain is its set of actions including their names, preconditions, and effects. An action a is applicable in a state s if pre(a) is consistent with s. Applying a in s results in a state a(s) where for every fluent f F: (1) if f is assigned in eff(a) then eff(a)[f] = a(s)[f], (2) otherwise, s[f] = a(s)[f]. A sequence of actions π = (a1, . . . an) is applicable in a state s if a1 is applicable in s and for every i = 2, . . . , n, ai is applicable in ai 1( a1(s) ). The result of applying such a sequence of actions in a state s, denoted π(s), is the state an( a1(s) ). A classical planning problem is defined by a tuple F, A, I, G where F, A is a domain, I is the initial state, and G is a partial state representing the goal we aim to achieve. A state s is called a goal state if G is consistent with s. A solution to a planning problem is a plan, which is a sequence of actions π such that π is applicable in I and π(I) results in a goal state. Classical planning domains and problems are often described in a lifted manner, where fluents and actions are parameterized over objects. For ease of presentation, we describe our work in a grounded manner, but our work fully supports a lifted domain representation directly following Juba, Le, and Stern (2021). A trajectory is an alternating sequence of states and actions. For a trajectory T = (s0, a1, . . . , an, sn), let T.si = si and T.ai = ai. A transition in a trajectory is a tuple of the form T.si 1, T.ai, T.si . The last state and action in T are denoted by T.s 1 and T.a 1, respectively, and T.s and T.a denote the sequence of states and actions in T, respectively. An action model A is consistent with a trajectory T if according to A the sequence of actions T.a is applicable in T.s0 and T.si = T.ai( T.a1(T.s0) ) for every i {1, . . . , |T|}. Conformant planning (Bonet 2010) and contingent planning (Majercik and Littman 2003; Hoffmann and Brafman 2005; Albore, Palacios, and Geffner 2009; Brafman and Shani 2012) are previously studied types of planning under uncertainty that are directly related to our work. In both, the effects of some actions may be non-deterministic, and the initial state I is replaced by a formula φI over the set of fluents that defines a set of possible initial states. In conformant planning, the agent is assumed to be unable to collect observations during execution. As such, conformant planning algorithms output a linear plan, which is a sequence of actions, as in classical planning. A (strong) solution to a conformant planning problem is a linear plan that is guaranteed to achieve the goal regardless of the inherent uncertainty due to the initial state and non-deterministic effects. In contingent planning, some actions effects may include observing the values of some fluents, and the agent is as-

sumed to be able to collect these observations and adapt its behavior accordingly. Many algorithms have been proposed for learning action models from a given set of trajectories (Cresswell, Mc Cluskey, and West 2013; Yang, Wu, and Jiang 2007; Aineto, Celorrio, and Onaindia 2019; Juba, Le, and Stern 2021). Algorithms from the LOCM family (Cresswell and Gregory 2011; Cresswell, Mc Cluskey, and West 2013) learn action models by analyzing observed action sequences and constructing finite state machines that capture how actions change the states of objects in the world. The FAMA algorithm (Aineto, Celorrio, and Onaindia 2019) translates the problem of learning an action model to a planning problem, where every solution to this planning problem is an action model consistent with the available observations. FAMA works even if the observations given to it are partially observable. Algorithms from the SAM learning family (Stern and Juba 2017; Juba, Le, and Stern 2021; Juba and Stern 2022; Mordoch et al. 2022) are different from other action model learning algorithms in that they guarantee that the action model they return is safe, in the sense that plans consistent with it are also consistent with the real, unknown action model. Most algorithms from this family have a tractable running time and reasonable sample complexity to ensure a probabilistic form of completeness, but rely on perfect observability of the given observations. The partially observed trajectories we consider are created by masking some fluent values in a trajectory, essentially changing some states into partial states. A literal ℓis said to be masked in a partial state p, denoted by p[ℓ] = ? if the corresponding fluent is not assigned in p. We say that an action model A is consistent with a partially observable trajectory T if it is consistent with at least one trajectory created by assigning values to all masked literals in T. Definition 1. A safe model-free planning problem is a tuple Π, T where Π = F, A, I, G is a classical planning problem, and T is a set of partially observable trajectories created by executing plans that solve other problems in the same domain, and masking some literals in the states of the resulting trajectories. A safe model-free planning algorithm accepts the tuple F, I, G, T and outputs a plan π that is a solution to the underlying planning problem Π. The key challenge in solving such problems is that the problem-solver is not given any prior knowledge about the action model or the values of the masked literals. Nevertheless, the returned plan π must be safe, in the sense that π is a sequence of actions that are applicable in I according to the real action model A and ends up in a goal state. We make the following simplifying assumptions. 1. Actions have deterministic effects. 2. The preconditions and effects of actions are conjunctions of literals, as opposed to more complex logical statements, such as conditional effects. 3. The form of partial observability defined above embodies the assumption that observations are noiseless: the value of a literal that is not masked is assumed to be correct. These assumptions are reasonable when planning in digital/virtual environments, such as video games, or environ-

The Thirty-Eighth AAAI Conference on Artiﬁcial Intelligence (AAAI-24)

ments that have been instrumented with reliable sensors, such as warehouses designed to be navigated by robots (Li et al. 2020).

Partial Information SAM Learning Following prior work (Stern and Juba 2017; Juba, Le, and Stern 2021), we first learn an action model from the given trajectories, and then use a planner to solve the given planning problem. We aim to learn an action model that is safe.

Definition 2 (Safe Action Model). An action model ˆA is safe w.r.t an action model A if (1) for every action a ˆA and state s if a is applicable in s according to ˆA then it is also applicable in s according to A, and (2) for every goal G, if a plan achieves G according to ˆA then it also achieves G according to A. Safety of/w.r.t is defined analogously for a fixed problem and its goal G. The first learning algorithm we propose is called Partial Information SAM (PI-SAM). PI-SAM is based on the following observation. Observation 1 (PI-SAM Rules). For any transition s, a, s and literal ℓ Rule 1 [not a precondition]. If (ℓ s) (s[ℓ] = ?) then ℓ is not a precondition of a. Rule 2 [an effect]. If (ℓ/ s) (ℓ s ) (s[ℓ] = ?) (s [ℓ] = ?) then ℓis an effect of a. Rule 3 [not an effect]. If (ℓ/ s ) (s [ℓ] = ?) then ℓis not an effect of a. PI-SAM applies Rules 1 and 2 in almost the same way as SAM Learning. For every action a observed in some trajectory, we first assume that it has no effects and its preconditions consist of all possible literals. Then, for every transition s, a, s and each literal ℓobserved in both preand post-states, i.e., (s[ℓ] =?) (s [ℓ] =?), we apply Rule 1 to remove preconditions and apply Rule 2 to add effects. PI-SAM runs in O P a A |T (a)| |F| , where T (a)

is the set of transitions in T with action a.1 PI-SAM also returns a safe action model, following the same reasoning given for the fully observable case (Stern and Juba 2017). Note that PI-SAM essentially uses the SAM learning rules, except that they are only applied for literals observed in both preand post-states. This may seem unintuitive, since Rule 1 does not require that a literal l is observed in a post-state to infer that it cannot be a precondition. To see why this modification is needed, consider running PI-SAM on a single trajectory with a single transition s, a, s where ℓ/ s and s [ℓ] = ?. Since the value of l is masked in s , we cannot apply Rule 2, and thus PI-SAM will assume l is not an effect of a. However, we cannot know if l is an effect of a or not. Thus, even though we can infer that l is not a precondition of a, returning an action model that allows a in such states may yield an unsafe action model.

Sample Complexity Analysis Learning a non-trivial safe action model without any restrictions on how the partially observable trajectories have been generated is impossible.

1Assuming one can access T (a) in O(1).

To see this, consider the case where the value of some fluent f is always masked. Since we never observe the value of f, then for every action a we can never be certain if its preconditions include f, f, or neither. Thus, we can never have a safe action model that allows action a to be applied. This example highlights that some assumptions about how the partially observable trajectories were created are necessary to guarantee efficient learning of a safe action model. We propose such an assumption, based on the definition of a masking function. Definition 3 (Masking function). A trajectory masking function O maps a trajectory T to a partially observable trajectory O(T) where (1) T.a = O(T).a, (2) |T| = |O(T)|, and (3) i : T.si is consistent with O(T).si. An example of a masking function is random masking, which masks the value of each fluent with some fixed, independent probability. Without loss of generality, we assume the set of trajectories T were created by applying some masking function O on fully observable trajectories. Next, we introduce the following assumption about masking functions, adapted from Michael s theory of learning from partial information (Michael 2010): Definition 4 (Bounded Concealment Assumption). A masking function satisfies the η-bounded concealment assumption in an environment if for every literal that is not a precondition of an action, when that action is taken and the literal is false, then the corresponding fluent is observed in both the preand post-states with probability at least η. As an example of a masking function that satisfies a bounded concealment assumption, consider a random masking function, where every literal is masked with a fixed independent probability α. Thus, each literal is observed in both the preand post-states with probability α2 on each transition, i.e., such cases feature α2-bounded concealment. Next, we analyze the relation between the number of trajectories given to PI-SAM and the ability of the action model it returns to solve new problems in the same domain, under the bounded concealment assumption. Let PD be a probability distribution over solvable planning problems in a domain D. Let TD be a probability distribution over pairs P, T given by drawing a problem P from P(D), using a sound and complete planner to generate a plan for P, and setting T to be the trajectory from following this plan.2

Theorem 2. Under η-bounded concealment, given m 1 ϵ η(2 ln 3|A| |F| + ln 1

δ ) trajectories sampled from TD, PI-SAM returns a safe action model MPI-SAM such that with probability at least 1 δ, a problem drawn from PD is not solvable with MPI-SAM with probability at most ϵ. Definition 5 (Adequate). An action model M is ϵ-adequate if, w.p. at most ϵ, a trajectory T sampled from TD contains a transition s, a, s where 1. s does not satisfy pre M(a) or 2. there is a literal in s \ s but not in eff M(a). Lemma 1. The action model returned by PI-SAM Learning given m trajectories (as specified in Theorem 2) is ϵadequate with probability at least 1 δ.

2The planner need not be deterministic.

The Thirty-Eighth AAAI Conference on Artiﬁcial Intelligence (AAAI-24)

Sketch of proof: Consider any action model M that is not ϵ-adequate: then either 1. with probability at least ϵ, trajectories sampled from TD contain a transition s, a, s for which s does not satisfy pre M(a), or 2. with probability at least ϵ, trajectories sampled from TD contain a transition s, a, s for which there is a literal in s \ s but not in eff M(a). In the first case, note that since s, a, s is a valid transition under the true action model M , the literal for which pre M(a) is violated cannot be in pre M (a). Therefore, by η-bounded concealment, the violated precondition literal in pre M(a) is observed with probability at least η when such a transition occurs; thus, with probability at least η ϵ overall, the literal is observed and deleted from pre MPI-SAM(a). Since PI-SAM never adds precondition literals back, this ensures that MPI-SAM = M. Similarly, in the second case, if l s \ s, l eff M (a). Thus, η-bounded concealment ensures that l is observed in both s and s with probability at least η when such a transition occurs. So, overall with probability η ϵ, the trajectory contains a transition s, a, s where l is observed and l s \ s. When this happens, l is added to eff PI-SAM(a), and we again get MPI-SAM = M since PI-SAM never removes literals from the effects. Thus, in either case, the probability of obtaining a trajectory that ensures that M is not output is at least η ϵ on each example. The rest of the proof now follows similarly to the original SAM analysis (Stern and Juba 2017; Juba, Le, and Stern 2021). We now prove Theorem 2.

Proof. When PI-SAM deletes a literal from pre(a), it observes a transition s, a, s where l is false in s. Thus, whenever action a can be taken in some state under MPI-SAM, it can also be taken in M . Conversely, since MPI-SAM is ϵadequate, with probability at least 1 ϵ the sequence of actions appearing in the trajectory associated with a draw from TD is a valid plan in MPI-SAM. The first condition ensures that the preconditions of MPI-SAM allow the action to be executed, and the second condition guarantees that MPI-SAM obtains the same states on each transition. Thus, with probability 1 ϵ, the goal is achievable under MPI-SAM using the plan.

Extended PI-SAM (EPI-SAM)

The PI-SAM algorithm is easy to implement and outputs an action model that can be used by any planner designed to solve classical planning problems. Yet, it only uses transitions where there are literals that are observed in both preand post-states. For example, consider an action a, a literal ℓ, and three transitions s1, a, s 1 , s2, a, s 2 , and s3, a, s 3 where ℓis not observed in any state except s1, s 2, and s 3 in which its values are false, false, and true, respectively. Since ℓwas observed to be false in s1, we can deduce it is not a precondition of a (Rule 1 in Observation 1). Since ℓis never observed in both preand post-states of the same transition, the PI-SAM algorithm still does not remove ℓfrom pre(a). However, considering the value of ℓin s 2 and s 3, we can deduce that neither ℓnor ℓare effects of a (Rule 2 and 3 in

Observation 1). Thus, it is possible to apply a in states without ℓand maintain our safety property. Next, we propose the Extended PI-SAM (EPI-SAM) learning algorithm, which is able to make such inferences. EPI-SAM relies on several key observations. The first observation is that learning of the effects of actions and learning their preconditions can be done separately, because we can never be certain that a literal is a precondition of an action. The second observation is that limiting the output of EPI-SAM to a classical planning action model limits the scope of safe model-free planning problems we can solve. For example, if we observe a trajectory (s0, a1, s1, a2, s2), where s0[ℓ] = false, s2[ℓ] = true, and ℓis masked in s1, we cannot discern which action a1 or a2 achieved ℓ, but we can learn that at least one of them has done so. While classical planning action models cannot capture this knowledge directly, such uncertainty can be compiled into a non-classical planning problem. Based on these observations, EPI-SAM has the following parts: learning effects, learning preconditions, and compilation to non-classical planning. In the first part (learning effects), EPI-SAM creates a Conjunctive Normal Form (CNF) formula for each literal ℓ, denoted by CNFeff(ℓ), which describes conditions for sequences of actions that achieve ℓ in the problems returned by EPI-SAM. The literals of this CNF are of the form Is Eff(ℓ, a), representing whether literal ℓis an effect of action a. In the second part (learning preconditions), EPI-SAM creates a set of literals pre(a) for each action a that describes the preconditions of a in the returned problems. In the third part (compilation to nonclassical planning), EPI-SAM creates a conformant planning problem using the output of the previous two parts. This conformant planning problem is constructed so that any (strong) solution to this problem is a safe solution to the actual planning problem. We describe these in detail next.

Learning Effects To learn effects, EPI-SAM extends PISAM rules 2 and 3 (Observation 1) from rules over transitions to rules over sub-trajectories. A trajectory T is a subtrajectory of trajectory T, denoted T T, if it is a consecutive subsequence of T, i.e., there exists i and j where i < j such that T .s0 = T.si and for every k {1, . . . , |T |} we have T .sk = T.si+k and T .ak = T.ai+k.

Observation 3 (EPI-SAM Rules). For any sub-trajectory T of a trajectory in T that ends in a state where literal l is not masked, i.e., where T .s 1[l] = ?, then Rule 1 [an effect]. If l T .s 1 and l / T .s0 then a T .a that has l as an effect. Rule 2 [not an effect]. If l T.s 1 then l is not an effect of T .a 1 Rule 3 [not deleted].If l T .s 1 and l is an effect of an action T .ai then i > i that has l as an effect.

Algorithm 1 lists the pseudo-code for effects learning in EPI-SAM, which builds on the EPI-SAM rules in Observation 3. Initially, CNFeff(ℓ) contains a single clause for every action a that ensures the effects of a are mutually exclusive (line 3). Then, we implement the EPI-SAM rules by going over every trajectory T and every state T.si in which ℓis not

The Thirty-Eighth AAAI Conference on Artiﬁcial Intelligence (AAAI-24)

Algorithm 1: EPI-SAM: Learning Effects

Input : Partially observed trajectories T Output: CNFeff(ℓ) for each literal l

1 foreach literal ℓdo

2 CNFeff(ℓ)

3 foreach action a do Add to CNFeff(ℓ): { Is Eff(ℓ, a) Is Eff( ℓ, a)}

4 foreach trajectory T T do

5 foreach index i {1, . . . , |T|} where ℓ T.si do

6 T max. prefix of T.si where ℓis masked

7 if ℓ/ T .s0 then Add to CNFeff(ℓ): Is Eff(ℓ, T .a1) Is Eff(ℓ, T .a|T |)

8 Add to CNFeff(ℓ): { Is Eff( ℓ, T .a|T |)}

9 foreach j = 1 to |T | 1 do

10 Add to CNFeff(ℓ): { Is Eff( ℓ, T .aj) Is Eff(ℓ, T .aj+1) Is Eff(ℓ, T .a|T |)}

15 return {CNFeff(ℓ)}ℓ

masked. For each such pair of trajectory and state, we extract the longest sub-trajectory T T that ends in T.si and where ℓis masked in all other states in T (line 6). If a literal ℓwas false at the first state of T , then we add to CNFeff(ℓ) a clause to ensure that ℓis an effect of some action ai (EPISAM Rule 1). Then, we add a clause to ensure that ℓis not an effect of the last action in T (EPI-SAM Rule 2). Finally, we add a clause to ensure that if ℓwas an effect of any action a T .a then some action in T after that action must have had ℓas an effect (EPI-SAM Rule 3).

Learning Preconditions EPI-SAM starts by assuming for every action a that it has all literals as preconditions. Then, it removes a literal l from the set of preconditions of an action a if and only if assuming l is a precondition of pre(a) is inconsistent with T . There are two possible ways in which the assumption that l is a precondition of a can be inconsistent with the observations: (1) there is a transition s, a, s in T where s[l] = false, and (2) no set of action effects is consistent with T when we additionally set s[l] = true for every transition s, a, s in T . The former corresponds to PI-SAM Rule 1, which can be easily verified in linear time. The latter can be checked by setting s[l] = true in the relevant transitions, running EPI-SAM s effect-learning part (Algorithm 1) on the resulting set of trajectories, and checking if the resulting CNF is satisfiable. This check can be done by calling any SAT solver. Fortunately, it is also possible to perform this satisfiability check in polynomial time. This is because assumptions about which action achieves literal l are independent of any assumption about which actions achieve any other literal except l.3 Algorithm 2 lists the pseudo-code of EPI-SAM s precondition learning part. Like PI-SAM, EPI-SAM initially assumes that the preconditions of every action include all lit-

3This independence fails when conditional effects are allowed.

Algorithm 2: EPI-SAM: Learning Preconditions

Input : Partially observed trajectories T Output: Precondition pre(a) for each action a

1 foreach action a do pre(a) all literals

2 foreach action a, literal ℓdo

3 if s, a, s T T where ℓ s then

4 Remove ℓfrom pre(a)

5 Continue to the next (a, ℓ) pair

6 Ta,ℓ Assume Precondition(a, ℓ, T ) ; Airr

7 while a / Airr where Irrelevant(a ,ℓ,Ta,ℓ) do

8 foreach s, a , s in T Ta,ℓdo

9 if s[ℓ] and s [ℓ] are inconsistent then

10 Remove ℓfrom pre(a)

11 Continue to the next (a, ℓ) pair

13 if s[ℓ] = ? then s[ℓ] s [ℓ]

14 Remove s, a , s from T

19 return {pre(a)}a

erals. Then, EPI-SAM iterates over every pair of action a and literal ℓto check if ℓcan be removed from the set of preconditions assumed for a. The first way EPI-SAM attempts to remove ℓfrom pre(a) is by checking if it violates PI-SAM Rule 1 (lines 3-5). The second way is by using a proof-by-contradiction approach, checking if assuming ℓis a precondition of a leads to a contradiction with the observations and every possible assumption about actions effects. EPI-SAM performs this check by performing the following steps. First, it creates a copy of the set of trajectories T where ℓis set to be true in every state where a is applied (the Assume Precondition call in line 6). This set of modified trajectories is denoted by Ta,ℓin Algorithm 2. Then, EPI-SAM iteratively searches for actions that are irrelevant for the value of ℓ. An action a is said to be irrelevant for the value of ℓif we can infer that neither ℓnor ℓare effects of a. We do this by invoking PI-SAM Rule 2 for both ℓand ℓ. That is, action a is identified as irrelevant to ℓif there are two transitions s1, a , s 1 and s2, a , s 2 where ℓ is not masked in their post-states and it has different values, i.e., (s 1[ℓ] = ?) (s 2[ℓ] = ?) (s 1[ℓ] = s 2[ℓ]). A contradiction is identified if there exists a transition s, a , s where a is an irrelevant action but the value of ℓin s and in s is inconsistent, i.e., unmasked and different (line 9). If a is irrelevant but the values of s and s are consistent, then we propagate the value of s to s and remove the transition s, a , s from Ta,ℓ(lines 13-14). 4

Compilation to Non-Classical Planning Next, EPI-SAM creates a conformant planning problem ΠSAM based on the outputs of the previous EPI-SAM parts, {CNFeff(ℓ)}ℓand {prea}a, and the available knowledge of the underlying planning problem Π. A conformant planning problem is de-

4If s , a , s T, then removing s, a , s implicitly adds the transition s, a , s .

The Thirty-Eighth AAAI Conference on Artiﬁcial Intelligence (AAAI-24)

fined by a tuple F, O, A, I, G where F, A, I, and G are the set of fluents, actions, initial state, and goals, as in a classical planning problem, except that A may include nondeterministic and conditional effects, and I is a set of possible initial states defined by a formula over F. O is the subset of fluents in F that are observable. The set of fluents in ΠSAM includes all fluents in Π and an additional fluent f Is Eff(a,ℓ) for every action a and literal ℓ. All fluents from Π are observable in ΠSAM and all others are not. The initial state formula in ΠSAM sets the values of all observable fluents according to their initial values in Π. In addition, it includes all the clauses in the CNFs returned by EPI-SAM ({CNFeff(ℓ)}ℓ), replacing every literal Is Eff(a, ℓ) with the corresponding fluent f Is Eff(a,ℓ). The action model of ΠSAM includes all actions observed in T . For each action a, we set its preconditions to the set of preconditions learned for it by EPI-SAM s learning preconditions part, pre(a). All the effects of a are conditional effects. A conditional effect of an action is an effect (i.e., a partial state) that is only applied if a specified condition holds. For each action a and literal ℓ, we add a conditional effect such that if f Is Eff(a,ℓ) is true then ℓis an effect of a. Note that conditional effects are supported by many classical and conformant planners (Bonet 2010; Grastien, Scala, and Kessler 2017). If the agent executing the plan can observe the values of fluents during execution and react, then the above compilation can be used almost as-is to construct a contingent planning problem instead of a conformant planning problem. The output of a contingent planning algorithm is a plan tree, branching over the observed values during execution, which can be more efficient than the linear plan returned for the respective conformant planning problem.

Theoretical Properties Next, we analyze EPI-SAM theoretically, showing that it is safe, runs in polynomial time, and it is the strongest algorithm for solving safe model-free planning problems, in the sense that any algorithm able to solve a problem that cannot be solved by EPI-SAM cannot also be safe. Throughout this analysis, we denote by A the action model of the underlying problem, and denote by pre A(a) and eff A(a) the set of preconditions and effects, respectively, of an action a according to an action model A. Observe that every classical action model A corresponds to an assignment σA to the formula Φeff = V ℓCNFeff(ℓ), by setting Is Eff(ℓ, a) to true if ℓis an effect of a for each literal ℓand action a. Similarly, every satisfying assignment of Φeff describes the effects of a classical action model.

Lemma 2. If a classical action model A is consistent with T then σA is a satisfying assignment of Φeff. Conversely, every satisfying assignment σ to Φeff describes the effects of at least one classical action model that is consistent with T .

Proof. Consider the clausal encoding of the STRIPS axioms, instantiated at each step of each trajectory in T . This CNF, denoted CNFT is defined over variables of the form Is Eff(l, a), Is Pre(l, a), and State(l, i, T), representing that l is a precondition of a, l is an effect of a, and l = true in the ith state of trajectory T, respectively. This CNF includes the following clauses for every transition si 1, ai, si in every trajectory T T :

(C1) Is Pre(l, ai) State(l, i 1, T) (C2) Is Eff(l, ai) State(l, i, T) (C3) Is Eff(l, ai) State(l, i 1, T) State(l, i, T) By construction, a satisfying assignment to CNFT corresponds to the effects of an action model and the complete trajectories for this action model, given the values observed in the trajectories of T . Moreover, the action model with these effects and no preconditions is consistent with T . Let CNFT (ℓ) be the formula containing all the clauses in CNFT containing literals for a single fluent literal ℓ. Note that the clauses of CNFT only contain literals for a single fluent literal, so CNFT is satisfiable iff for every ℓthe formula CNFT (ℓ) is satisfiable. The final part of our proof will show that the CNF returned by EPI-SAM, CNFeff(ℓ), is satisfiable iff CNFT (ℓ) is satisfiable. To this end, we rely on the refutation-completeness of resolution and examine which clauses may appear in a refutation of CNFT (ℓ). The Is Pre(a, ℓ) literals, appearing only negatively, cannot appear in a refutation. Thus, any refutation will be based on clauses of types C2 and C3. Two types of proofs can be created from such clauses. The first requires observing the value of ℓin enough states such that we have contradicting unit clauses with Is Eff literals for some action ai. That is, we have transitions si, ai, s i and sj, ai, s j where l is observable in states s i, sj 1, and sj with values false, true, and false, respectively. This option is implemented in line 4 of Algorithm 2. The second type of proof requires using resolution to eliminate at least one State literal. Reordering the applications of the resolution rule on these literals to the beginning of the proof, we see that we must create clauses that correspond to consecutive runs of unobserved literals using the resolution rule on clauses of type C3 for each step, beginning with either an observed literal or with using clauses of type C2 to eliminate the first State(ℓ, i, T) literal. These are, respectively, the clauses of CNFeff(l) created on lines 7 and 10 in Algorithm 1.

Lemma 3. For every action a in ASAM and literal ℓ, it holds that ℓ pre ASAM(a) if and only if there exists an action model A consistent with T where ℓ pre A(a).

Proof. We first prove that if EPI-SAM removes a literal ℓ from pre(a), then there exists a transition s, a, s in T where ℓis false, and hence cannot be in pre A (a). EPI-SAM removes ℓfrom pre(a) in two places in Algorithm 2: line 4 and line 10. The correctness of line 4 is immediate: if ℓis observed to be false in a state where a has been applied then it cannot be a precondition of a (PI-SAM Rule 1). Before removing a precondition due to line 10, EPI-SAM creates a set of trajectories Tℓ,a that assumes ℓwas true whenever a was taken, and detects the set of actions Airr that cannot affect the value of ℓin any action model consistent with Tℓ,a. Because of the frame axioms, the value of ℓgets propagated in any transition that includes an action in Airr. ℓis only removed in line 10 if this propagation results in a state where ℓhas contradicting values. As this occurs for any action model consistent with Tℓ,a, this implies that ℓcannot be true in every state where a was applied, and thus cannot be a precondition of a in any action model consistent with T .

The Thirty-Eighth AAAI Conference on Artiﬁcial Intelligence (AAAI-24)

Next, we prove that if ℓhas not been deleted from pre(a) by EPI-SAM, then there exists an action model A consistent with T where ℓ pre A(a). Consider the subset of Airr that includes only actions that have been in a transition where the value of ℓis not masked. For each action a in this set, we are guaranteed that this value of ℓis always the same, denoted v(a , ℓ). Otherwise a would have been added to Airr. The action model created by assigning v(a , ℓ) as an effect of a for each of these actions is consistent with Tl,a. Therefore, there exists an action model where ℓis a precondition of a that is consistent with T .

Given these lemmas, the following theorems are straightforward, and we omit the proofs due to space constraints.

Theorem 4. EPI-SAM returns a safe plan.

Theorem 5 (Time Complexity). Given a set of trajectories T , EPI-SAM runs in time O |A| |F| P a A |T (a)| .

A safe action model is not necessarily useful, as it may be too restrictive to allow solving problems in the domain. Next, we characterize the usefulness of the action model returned by EPI-SAM. We say that an action model A1 is stronger than another action model A2 if the set of problems it allows solving is a superset of the set of problems allowed by A2. The strongest safe action model is a safe action model that is stronger than any other safe action model.

Theorem 6 (Strength). The problem ΠSAM returned by EPISAM is the strongest safe action model.

Proof. Let ASAM be the action model returned by EPI-SAM. By contradiction, let Abad be a safe action model such that ASAM is not stronger than Abad. This means that either there exists a literal ℓthat is in pre ASAM but not in pre Abad or a plan πbad that achieves some goal G according to Abad but not according to ASAM. The first condition cannot hold due to Lemma 3: for any precondition assumed by ASAM there exists an action model consistent with T that requires it. For the second condition, suppose that there is a plan under Abad that is allowed by the EPI-SAM action model, but for which EPI-SAM does not achieve the goal. This means (by Lemma 2) that there was some action model consistent with T under which the goal was not achieved. The other action model is therefore not safe.

Experiments

We evaluate our algorithms performance experimentally on the IPC (Mc Dermott 2000) domains listed in Table 1. The tuple listed under each domain details the number of lifted fluents, lifted actions, maximal arity of fluents, maximal arity of actions, and average trajectory length in that domain. For each domain, we generated problems using the generators provided by the IPC learning tracks and solved them using the true action model and an off-the-shelf planner. In the resulting trajectories, we masked some states using random masking with masking probability η = 0.1 and η = 0.3.

Metrics A common approach to comparing action models is by computing the precision and recall of the learned action model with respect to which literals appear in the real action model. However, this syntactic measure has three limitations. First, it requires the evaluated action models to use the same fluents and action names. Second, it gives the same penalty for every mistake in the learned model. Third, domains may have distinct but semantically-equivalent action models. For example, in Npuzzle, we could have a precondition that the tile we are sliding into the empty position is not an empty position. This precondition is not necessary, as there is only ever one empty position in any puzzle. Thus, either formulation of the domain is adequate for planning purposes, but a syntactic measure of correctness will penalize one of the two formulations. Instead, we introduce and use empirically-based precision and recall measures, which are based on comparing the number of transitions that are valid or invalid according to the learned action model ( ˆA) and the true action model (A). The empirical precision and recall measures are defined according to the number of true/false positives/negatives (TP, FP, TN, FN) but compute TP, FP, TN, and FN differently. For preconditions, TP is the number of transitions that are valid according to both ˆA and A, FP is the number of transitions that are valid according to ˆA but not A, TN is the number of transitions that are invalid according to ˆA and A, and FN is the number of transitions that are valid according to A and but not ˆA. TP, FP, TN, and FN for effects are computed similarly. To compute the empirical precision and recall, we created 3 trajectories for each domain. These trajectories were created by solving problems in the same domain. These problems were different, of course, from the trajectories given to the learning algorithm to learn the action model.

Results and Discussion We performed experiments using PI-SAM and EPI-SAM*, a simplified (unsafe) version of EPI-SAM. Recall that EPI-SAM does not return a classical action model, and the conformant planning formulation it produces involves explicitly reasoning about the various possible states that could occur in trajectories using the uncertain action model. As such, it does not make sense to apply state-wise measures of precision and recall directly to EPI-SAM. EPI-SAM* uses unit propagation to determine the effects of every action in the CNF returned by Algorithm 1, by checking if assuming literal l is an effect of action a if the CNF formula extended by Is Eff(a, ℓ) is satisfiable. EPI-SAM* outputs a classical action model instead of a conformant plan. Nevertheless, observe that the inferences obtained by unit propagation are sound and are a subset of those obtainable in EPI-SAM s formulation. Thus, since EPI-SAM is safe, the precision and recall for EPI-SAM* provide a lower bound on the performance of EPI-SAM. As a baseline, we compared our algorithms to FAMA (Aineto, Celorrio, and Onaindia 2019), a modern algorithm for learning action models under partial observability. We ran those three algorithms on our benchmark domains. For each domain, we computed the empirical precision (P) and recall (R) separately for the preconditions (pre) and effects (eff). Table 1 lists the results of our experiments, averaged

The Thirty-Eighth AAAI Conference on Artiﬁcial Intelligence (AAAI-24)

Domain Algorithm η = 0.3 η = 0.1 |T | P(pre) R(pre) P(eff) R(eff) T(sec) |T | P(pre) R(pre) P(eff) R(eff) T(sec) Blocks FAMA 3 0.90 0.90 1.00 0.89 13 6 0.90 0.85 0.90 0.85 60 (5,4,2,2,30) PI-SAM 3 1.00 0.90 1.00 0.95 9 6 1.00 0.83 1.00 0.85 43 EPI-SAM* 3 1.00 0.92 1.00 0.95 - 6 1.00 0.85 1.00 0.88 - Depot FAMA 5 0.80 0.85 0.90 1.00 17 8 0.80 0.80 0.90 1.00 60 (6,5,4,2,32) PI-SAM 5 1.00 0.85 1.00 1.00 12 8 1.00 0.82 1.00 1.00 53 EPI-SAM* 5 1.00 0.85 1.00 1.00 - 8 1.00 0.83 1.00 1.00 - Ferry FAMA 3 0.85 1.00 1.00 1.00 9 6 0.80 1.00 0.85 1.00 35 (5,3,2,2,55) PI-SAM 3 1.00 1.00 1.00 1.00 5 6 1.00 0.94 1.00 0.90 27 EPI-SAM* 3 1.00 1.00 1.00 1.00 - 6 1.00 0.95 1.00 0.90 - Floortile FAMA 5 0.84 0.80 0.79 0.80 18 9 0.87 0.82 0.80 0.83 60 (10,7,2,4,40) PI-SAM 5 1.00 0.87 1.00 0.87 15 9 1.00 0.85 1.00 0.85 50 EPI-SAM* 5 1.00 0.89 1.00 0.90 - 9 1.00 0.87 1.00 0.87 - Gripper FAMA 5 1.00 1.00 1.00 1.00 8 10 1.00 1.00 1.00 1.00 30 (4,3,2,3,20) PI-SAM 5 1.00 1.00 1.00 1.00 5 10 1.00 1.00 1.00 1.00 24 EPI-SAM* 5 1.00 1.00 1.00 1.00 - 10 1.00 1.00 1.00 1.00 - Hanoi FAMA 1 0.85 1.00 1.00 1.00 1 1 0.81 1.00 1.00 1.00 60 (3,1,2,3,31) PI-SAM 1 1.00 1.00 1.00 1.00 1 1 1.00 1.00 1.00 1.00 15 EPI-SAM* 1 1.00 1.00 1.00 1.00 - 1 1.00 1.00 1.00 1.00 - Npuzzle FAMA 1 1.00 1.00 1.00 1.00 1 1 0.83 1.00 1.00 1.00 23 (3,1,2,3,48) PI-SAM 1 1.00 1.00 1.00 1.00 1 1 1.00 1.00 1.00 1.00 17 EPI-SAM* 1 1.00 1.00 1.00 1.00 - 1 1.00 1.00 1.00 1.00 - Parking FAMA 6 0.85 0.85 1.00 1.00 13 8 0.83 0.85 0.90 1.00 60 (5,4,2,3,50) PI-SAM 6 1.00 0.88 1.00 1.00 8 8 1.00 0.83 1.00 1.00 49 EPI-SAM* 6 1.00 0.88 1.00 1.00 - 8 1.00 0.85 1.00 1.00 - Sokoban FAMA 2 1.00 1.00 1.00 1.00 8 5 1.00 1.00 1.00 1.00 40 (4,2,3,5,28) PI-SAM 2 1.00 1.00 1.00 1.00 6 5 1.00 1.00 1.00 1.00 33 EPI-SAM* 2 1.00 1.00 1.00 1.00 - 5 1.00 1.00 1.00 1.00 - Transport FAMA 5 0.77 0.80 0.80 0.90 14 9 0.80 0.80 0.84 0.90 60 (5,3,2,5,36) PI-SAM 5 1.00 0.83 1.00 0.90 9 9 1.00 0.80 1.00 0.90 48 EPI-SAM* 5 1.00 0.85 1.00 0.92 - 9 1.00 0.83 1.00 0.92 -

Table 1: Empirical precision and recall results under random masking with η = 0.1 and η = 0.3.

|T| = 3 |T| = 5 |T| = 7 PI-SAM FAMA PI-SAM FAMA PI-SAM FAMA

P(pre) 1.00 0.90 1.00 0.87 1.00 0.90 R(pre) 0.90 0.90 0.92 0.88 0.93 0.90 P(eff) 1.00 1.00 1.00 0.95 1.00 1.00 R(eff) 0.95 0.89 0.96 0.87 0.96 0.90

Table 2: Results on Blocks with η = 0.3.

over three independent runs. Columns P(pre) , R(pre) , P(eff) , and R(eff) show the empirical precision and recall for preconditions and effects for every evaluated algorithm. |T | is determined as the point that FAMA started decreasing performance (i.e. precision-recall) or reaching the time limit. We limited the running time of each algorithm to 60 seconds. Column T is the runtime of each algorithm in seconds. Since EPI-SAM* is unsafe, we do not report its runtime. Since PI-SAM and EPI-SAM*, by definition, never remove a literal that is an actual precondition from the preconditions or add a literal that is not an actual effect, their empirical precision is perfect for both preconditions and effects, as opposed to FAMA, which does not always achieve this. PI-SAM tends to have a higher empirical recall under lower masking probability (high η), while

SAM PI-SAM Alg. (η = 1.0) (η = 0.3) (η = 0.1)

Hanoi 1 10 95 Npuzzle 1 9 92 Ferry 4 42 355 Gripper 5 51 476 Sokoban 6 55 563

Table 3: # of transitions needed to learn the preconditions

FAMA tends to obtain higher recall under higher masking probability (low η). EPI-SAM* generally outperforms both. Note that FAMA s performance may decrease as more input is given, while PI-SAM cannot. To demonstrate this, we picked a domain (Blocks) and recorded their performance as given an increasing number of trajectories as input. The results are shown in Table 2. We also compared the number of transitions required to correctly learn the preconditions (i.e., P(pre) and R(pre) = 1.0) when using PI-SAM with η {0.1, 0.3} and when having full observability and using SAM. The results are shown in Table 3. As expected, the number of transitions required scales inversely with the random masking probability η2, which verifies the tightness of the bound in Theorem 2. The source code of the experiments

The Thirty-Eighth AAAI Conference on Artiﬁcial Intelligence (AAAI-24)

is available at https://github.com/hsle/pisam learning.

Conclusion and Future Work

We proposed two algorithms for learning safe action models in domains with partial observability. The first algorithm, PI-SAM, extends the SAM learning algorithm (Juba, Le, and Stern 2021) to partially observable domains and outputs classical planning action models. The second algorithm, EPI-SAM, provides the outputs in the form of conformant planning problems, but can work on general observations. In practice, we can choose either PI-SAM or EPI-SAM, depending on the specific observation sets (e.g., whether they satisfy the bounded concealment assumption or not). For future work, we aim to extend safe action model learning to more complicated domains, such as domains with stochastic effects, numeric state variables, etc.

Acknowledgments

This work was partially funded by NSF awards IIS1908287, IIS-1939677, and IIS-1942336, and BSF grant #2018684 to Roni Stern.

Aineto, D.; Celorrio, S.; and Onaindia, E. 2019. Learning action models with minimal observability. Artificial Intelligence, 275: 104 137. Albore, A.; Palacios, H.; and Geffner, H. 2009. A translation-based approach to contingent planning. In International Joint Conference on Artificial Intelligence (IJCAI). Bonet, B. 2010. Conformant plans and beyond: Principles and complexity. Artificial Intelligence, 174(3): 245 269. Brafman, R.; and Shani, G. 2012. A multi-path compilation approach to contingent planning. In AAAI Conference on Artificial Intelligence. Cresswell, S.; and Gregory, P. 2011. Generalised domain model acquisition from action traces. In International Conference on Automated Planning and Scheduling (ICAPS), 42 49. Cresswell, S. N.; Mc Cluskey, T. L.; and West, M. M. 2013. Acquiring planning domain models using LOCM. The Knowledge Engineering Review, 28(2): 195 213. Grastien, A.; Scala, E.; and Kessler, F. B. 2017. Intelligent Belief State Sampling for Conformant Planning. In IJCAI, 4317 4323. Hoffmann, J.; and Brafman, R. 2005. Contingent planning via heuristic forward search with implicit belief states. In ICAPS, volume 2005. Juba, B.; Le, H. S.; and Stern, R. 2021. Safe Learning of Lifted Action Models. In International Conference on Principles of Knowledge Representation and Reasoning (KR), 379 389. Juba, B.; and Stern, R. 2022. Learning Probably Approximately Complete and Safe Action Models for Stochastic Worlds. In AAAI Conference on Artificial Intelligence.

Li, J.; Tinka, A.; Kiesel, S.; Durham, J. W.; Kumar, T. S.; and Koenig, S. 2020. Lifelong Multi-Agent Path Finding in Large-Scale Warehouses. In International Conference on Autonomous Agents and Multi Agent Systems (AAMAS), 1898 1900. Majercik, S. M.; and Littman, M. L. 2003. Contingent planning under uncertainty via stochastic satisfiability. Artificial Intelligence, 147(1): 119 162. Mc Dermott, D. 2000. The 1998 AI Planning Systems Competition. AI Magazine, 21(2): 13. Michael, L. 2010. Partial observability and learnability. Artificial Intelligence, 174(11): 639 669. Mordoch, A.; Portnoy, D.; Stern, R.; and Juba, B. 2022. Collaborative Multi-Agent Planning with Black-Box Agents by Learning Action Models. In Learning with Strategic Agents (LSA) Workshop in the International Conference on Autonomous Agents and Multiagent Systems (AAMAS). Stern, R.; and Juba, B. 2017. Efficient, Safe, and Probably Approximately Complete Learning of Action Models. In the International Joint Conference on Artificial Intelligence (IJCAI), 4405 4411. Wu, K.; Yang, Q.; and Jiang, Y. 2007. ARMS: An automatic knowledge engineering tool for learning action models for AI planning. The Knowledge Engineering Review, 22(2): 135 152. Yang, Q.; Wu, K.; and Jiang, Y. 2007. Learning action models from plan examples using weighted MAX-SAT. Artificial Intelligence, 171(2-3): 107 143.

The Thirty-Eighth AAAI Conference on Artiﬁcial Intelligence (AAAI-24)