# qualitative_numeric_planning_reductions_and_complexity__a162d423.pdf

Journal of Artiﬁcial Intelligence Research 69 (2020) 923 961 Submitted 12/2019; published 11/2020

Qualitative Numerical Planning: Reductions and Complexity

Blai Bonet bonet@usb.ve Universidad Sim on Bol ıvar Caracas, Venezuela

Hector Geﬀner hector.geffner@upf.edu ICREA & Universitat Pompeu Fabra Barcelona, Spain

Qualitative numerical planning is classical planning extended with non-negative real variables that can be increased or decreased qualitatively , i.e., by positive indeterminate amounts. While deterministic planning with numerical variables is undecidable in general, qualitative numerical planning is decidable and provides a convenient abstract model for generalized planning. The solutions to qualitative numerical problems (QNPs) were shown to correspond to the strong cyclic solutions of an associated fully observable non-deterministic (FOND) problem that terminate. This leads to a generate-and-test algorithm for solving QNPs where solutions to a FOND problem are generated one by one and tested for termination. The computational shortcomings of this approach for solving QNPs, however, are that it is not simple to amend FOND planners to generate all solutions, and that the number of solutions to check can be doubly exponential in the number of variables. In this work we address these limitations while providing additional insights on QNPs. More precisely, we introduce two polynomial-time reductions, one from QNPs to FOND problems and the other from FOND problems to QNPs both of which do not involve termination tests. A result of these reductions is that QNPs are shown to have the same expressive power and the same complexity as FOND problems.

1. Introduction

Qualitative numerical problems (QNPs) are classical planning problems extended with non-negative numerical variables X that can be decreased or increased qualitatively , i.e., by positive indeterminate amounts. Since such numerical variables cannot be used for counting, QNP planning, unlike most other general forms of planning with numbers (Helmert, 2002), turns out to be decidable. QNPs were introduced by Srivastava et al. (2011) as a useful model for generalized planning, namely, the synthesis of plans that solve multiple classical planning instances (Levesque, 2005; Bonet et al., 2009; Srivastava et al., 2011; Hu & De Giacomo, 2011; Bonet & Geﬀner, 2015; Belle & Levesque, 2016; Jim enez-Celorrio et al., 2019). Basically, collections Q of planning instances P that share the same set of actions and state features may be often expressed as a single QNP problem Q whose solutions, that map state features into actions, solve all problems P in Q (Bonet & Geﬀner, 2018). QNPs can be solved in two steps (Srivastava et al., 2011). First, the QNP Q is converted into a standard fully observable non-deterministic (FOND) problem P (Cimatti et al., 2003). Then, solutions of P obtained by an oﬀ-the-shelf FOND planner are tested for termination. This last step is required because the non-determinism in the FOND problem P is not fair in the usual sense: the trajectories that are not fair are those in which a variable is decreased an inﬁnite number of times but increased a ﬁnite number of times only (Bonet et al., 2017). The policies that solve P in which the (QNP) fair trajectories all reach the goal, and which correspond to the terminating strong cyclic policies for P, are the policies that solve the QNP Q (Srivastava et al., 2011).

c 2020 AI Access Foundation. All rights reserved.

Bonet & Geffner

The computational shortcomings of solving QNPs following this generate-and-test approach, however, are two. First, it is not simple to amend FOND planners to generate all the solutions of a FOND problem because FOND plans are not action sequences but closed-loop policies. Second, the number of policies that need to be tested for termination may be huge: exponential in the number of FOND states, and hence, doubly exponential in the number of variables.

In this work we address these limitations while providing additional insights on QNPs. We introduce two polynomial-time reductions, one from QNPs to FOND problems and the other from FOND problems to QNPs. As every (formal) reduction, both reduction are sound and complete, and hence do not require termination tests. A result of these reductions is that QNPs are shown to have the same expressive power and in particular, the plan-existence decision problem for both have the same complexity EXP-Complete (Littman et al., 1998; Rintanen, 2004). The new QNP to FOND translator is implemented and available. In combination with sound and complete FOND planners, the translation yields the only sound and complete QNP planner available (i.e., solver that works directly on the factored QNP representation without explicitly ﬂattening the input QNP).

The structure of the paper is as follows. We review ﬁrst classical planning, FOND planning, QNPs, the direct translation of QNPs into FOND problems, and the termination test. We follow ideas from Srivastava et al. (2011) but in a slightly diﬀerent and more expressive formulation. We then introduce the two new reductions: from FOND problems into QNPs, and from QNPs into FOND problems. This last reduction is very diﬀerent from the one sketched by Bonet et al. (2017) which is incorrect (we illustrate this with an example). We then consider variations and extensions of QNPs, the use of QNPs for generalized planning, experimental results, and related work.

2. Classical and FOND planning

A classical planning problem is a sequential decision problem where a goal is to be reached by performing actions with deterministic eﬀects from a given initial state. These problems are usually expressed in compact form in planning languages such as STRIPS (Fikes & Nilsson, 1971; Russell & Norvig, 2002). A (grounded) STRIPS planning problem (with negation) is a tuple P = F, I, O, G where F denotes a set of propositional variables, I and G are sets of F-literals representing the initial and goal situation, and O is a set of actions a with preconditions and eﬀects Pre(a) and Eﬀ(a) given by sets of F-literals.

The state model S(P) for the problem P = F, I, O, G is a tuple S(P) = S, s0, Act, A, f, SG where S is the set of possible truth-valuations over the F literals, called the states, s0 is the initial state, Act = O, A(s) represents the actions a in Act whose preconditions are true in s, f(a, s) represents the state s that follows action a in s for a A(s), and SG is the set of goal states. It is assumed that the problem P is consistent in the sense that s0 and f are well-deﬁned and SG is not empty. A solution to a classical problem P is an action sequence a0, . . . , an that generates a state sequence s0, . . . , sn+1 over the model S(P) that reaches the goal. In this sequence, ai A(si) and si+1 = f(ai, si) for i = 0, . . . , n, and sn+1 SG.

A fully-observable non-deterministic (FOND) problem P is like a classical planning problem except that actions a may have non-deterministic eﬀects expressed as Eﬀ1(a) | |Eﬀn(a) where Eﬀi(a) is a set of F-literals as above (Cimatti et al., 2003; Geﬀner & Bonet, 2013; Ghallab et al., 2016). The state model S(P) determined by a FOND problem P = F, I, O, G is a tuple S(P) = S, s0, Act, A, F, SG as above with the diﬀerence that the state transition function F is non-deterministic, and maps an action a and state s into a non-empty set F(a, s) of possible successor states. As usual, the non-deterministic transition function F is given in factored form.

Qualitative Numerical Planning: Reductions and Complexity

That is, for action a made of multiple eﬀects Eﬀ1 | | Eﬀn (possibly deterministic when n = 1), each outcome s in F(a, s) results of the choice of one Eﬀi for each non-deterministic eﬀect of a.1

The solutions of FOND problems ensure that the goal is reached with certainty under certain fairness assumptions. Policies or plans in the FOND setting are partial functions π mapping states s into actions π(s). A state trajectory s0, s1, . . . , sn (ﬁnite or inﬁnite) is induced by π over the model S(P) if the action ai = π(si) is deﬁned, it is applicable in the state si, i.e., ai A(si), and si+1 is in F(ai, si), i = 1, . . . , n 1. The trajectory is said to be a π-trajectory. The trajectory is maximal if A) it is inﬁnite, i.e., n = , and does not include a goal state, B) if sn is the ﬁrst goal state in the sequence, or C) the action π(sn) is not deﬁned or not applicable in sn. A policy π is a solution of the FOND problem P if all the fair maximal trajectories induced by π over the model S(P) are goal reaching (Cimatti et al., 1998, 2003). The so-called strong solutions assume that all state trajectories are fair. Strong-cyclic solutions, on the other hand, assume that all trajectories are fair except the inﬁnite trajectories where a state s occurs inﬁnitely often but a state transition (s, s ) for some s F(a, s) for a = π(s), occurs ﬁnitely often. The latter trajectories are deemed to be unfair. Other equivalent characterizations of strong and strong cyclic solutions are common. For example, a strong cyclic solution π for a FOND problem P is also a policy π such that for each π-trajectory connecting an initial state to a state s, there is a π-trajectory connecting s to a goal state. Similarly, a strong solution is a strong cyclic solution π with no cycles; i.e., one where no π-trajectory visits the same state twice. Strong solutions can also be thought as winning strategies against an adversary, while strong cyclic solutions as winning strategies against nature. Indeed, there is a well known relation between (proper) policies that achieve the goal with probability 1 in goal-based MDPs (Markov Decision Processes) and the strong cyclic policies that solve the FOND problem associated with the MDP, where the transition function is such that F(a, s) collects the states s that are possible after action a in s, i.e., for which Pa(s |s) > 0 (Geﬀner & Bonet, 2013). From now, by solution of a FOND problem we mean a strong cyclic solution or plan, and by a FOND planner, we mean a strong cyclic planner, i.e., a planner that produces strong cyclic solutions. There are some good FOND planners available, including PRP (Muise et al., 2012), based on classical planners, My ND (Bercher & Mattm uller, 2009), based on heuristic AND/OR search, and FOND-SAT (Geﬀner & Geﬀner, 2018), based on a reduction to SAT.

3. Qualitative Numerical Problems

Qualitative numerical problems (QNPs) are classical planning problems extended with numerical variables that can be decremented or incremented qualitatively . We make this formal below.

3.1 QNPs: Syntax

The syntax of QNPs is deﬁned as an extension of the STRIPS language with negation. A QNP is a tuple Q = F, V, I, O, G where the new component is a set V of non-negative numerical variables X V . These variables introduce the non-propositional atoms X = 0 and their negations, denoted as X > 0. These literals can appear in the initial situation, action preconditions, and goals of Q. The eﬀects of actions a on a numerical variable X can be only qualitative increments or qualitative

1. As it is standard, any choice of eﬀects is assumed to be consistent (i.e., any pair of choices for two diﬀerent nondeterministic eﬀects of the same action contain no complementary literals). However, with some (polynomially bounded) extra work, our methods, algorithms and results still apply if the model is extended with constraints that every outcome s must satisfy, when such constraints are given in suitable form; e.g. DNF formulas over F.

Bonet & Geffner

decrements denoted by the expressions Inc(X) and Dec(X), often abbreviated as X and X respectively. We refer to X = 0 and X > 0 as the V -literals for X V , and to p and p for p F, as the F-literals in Q.

Deﬁnition 1. A QNP is a tuple Q = F, V, I, O, G where F and V are sets of propositional and numerical variables respectively, I and G denote the initial and goal situations, and O is a set of actions a with preconditions, and propositional and numerical eﬀects that are denoted as Pre(a), Eﬀ(a), and N(a) respectively. The F-literals can appear in I, G, Pre(a), and Eﬀ(a), while V - literals can appear in I, G, and Pre(a). The numerical eﬀects N(a) only contain special atoms of the form Inc(X) or Dec(X) for the variables X in V . Actions with the Dec(X) eﬀect must feature the precondition X > 0 for any variable X in V .

The preconditions and eﬀects of an action a are denoted as pairs Pre(a); Eﬀ(a), N(a) where N(a) contains the numerical eﬀects; namely, expressions like X and X that stand for the increment and decrement of variable X respectively. QNPs are assumed to be syntactically consistent by requiring that no pair of complementary literals or qualitative eﬀects appears in the initial situation, action eﬀects, or goals. A pair of complementary literals or qualitative eﬀects has the form {p, p} for some p in F, or {X = 0, X > 0} or {X , X } for some X in V .

Example. An abstraction that is suitable for expressing the generalized problem of achieving the goal clear(x) over an arbitrary Blocksworld instance (Bonet & Geﬀner, 2018) is given in terms of the QNP Qclear = F, V, I, O, G where F = {H} contains a boolean variable H that represents if the gripper is holding a block, V = {n(x)} contains a numerical variable n(x) that represents the number of blocks above x, and I = { H, n(x) > 0} and G = {n(x) = 0} represent the initial and goal situations. The actions O = {a, b} are

a = H, n(x) > 0; H, n(x) (1)

b = H; H . (2)

It is easy to see that the ﬁrst action a picks up blocks that are above x; its ﬁrst precondition H expresses that the gripper is holding no block, while the second n(x) > 0 that there is at least one block above x. The eﬀects, on the other hand, make H true (expressing that some block is being held) and decrease the number n(x) of blocks above x. The other action b puts the block being held away from block x as expressed by the precondition H and eﬀect H. The fact that b puts blocks away from block x is reﬂected in that it does not aﬀect the variable n(x). The QNP Qclear captures the relevant part of the inﬁnite collection of Blocksworld instances where the goal is to achieve the atom clear(x) for some block x. The solution to Qclear provides the general strategy for solving all such instances. For ways of learning such abstractions automatically; see the recent work of Bonet et al. (2019).

3.2 QNPs: Semantics

A state s for QNP Q = F, V, I, O, G is a valuation that assigns a truth value s[p] to each boolean variable p F, and a non-negative real value s[X] to each numerical variable X V . Since the initial situation I can only feature atoms of the form X = 0 or X > 0, there is a set S0 of possible initial states s0 that correspond to the valuations that satisfy the literals in I. For example, in Qclear, I is given by the literals I = { H, n(x) > 0}, meaning that S0 contains all and only the

Qualitative Numerical Planning: Reductions and Complexity

valuations that make H false and n(x) = r for some positive real number r. The use of variables that can take real values for representing integer counters illustrates that the semantics of QNPs is coarse-grained, and for this reason, decidable. Indeed, QNPs use just one qualitative property of numbers; namely, that a non-negative variable eventually must reach the value of zero if it keeps being decremented and not incremented. This property is true for integers, and it also true for reals, as long as the magnitude of the decrements is bounded from below by some positive ϵ-parameter. More about this below. The state model S(Q) represented by a QNP can be characterized as follows:

Deﬁnition 2. A QNP Q = F, V, I, O, G determines a non-deterministic state model S(Q) = S, S0, Act, A, F, SG where

the states s in S are the valuations that assign a truth value to the boolean variables in F and a non-negative real value to the numerical variables in V ,

the initial states s0 in S0 are those that satisfy the literals in I under a closed-world assumption (s0 makes p and X = 0 false if the literals p and X = 0 are not in I),

the actions in Act are those in O; i.e., Act = O,

the actions A(s) applicable in s are those in Act such that Pre(a) is true in s,

the goal states in SG are those that satisfy G,

the transition function F is such that s F(a, s) for a A(s) if

(a) s [p] is true (resp. false) if p (resp. p) is in Eﬀ(a),

(b) s[X] < s [X] if Inc(X) is in N(a),

(c) s [X] < s[X] if Dec(X) is in N(a),

(d) s [p] = s[p] if neither p nor p in Eﬀ(a),

(e) s [X] = s[X] if neither X nor X in N(a).

A trajectory s0, a0, s1, a1, . . . , sn is compatible with the model S(Q) = S, S0, Act, A, F, SG if s0 S0, and ai A(si) and si+1 F(ai, si) for each ai in the sequence. The trajectory is an ϵbounded trajectory or ϵ-trajectory if the numerical changes are bounded from below by a parameter ϵ > 0, except when this would make the variable negative:

Deﬁnition 3. A trajectory s0, a0, s1, a1, . . . , sn is an ϵ-trajectory iﬀfor any variable X and time point i, with i < n, si+1[X] = si[X] implies |si+1[X] si[X]| ϵ or si+1[X] = 0.

Trajectories bounded by ϵ > 0 cannot decrease the value of a variable asymptotically without ever reaching the value of zero. This is in agreement with the key assumption in QNPs by which variables that keep being decreased and not increased eventually must reach the value zero. From now, trajectories over QNPs will refer to ϵ-trajectories for some ϵ > 0.

3.3 QNPs: Solutions

Solutions to QNPs take the form of partial functions or policies π that map states into actions. The choice of the action π(s) to be done in a state s, however, can only depend on the truth values s[p] associated with the boolean variables p in F and the truth values of the expressions s[X] = 0 associated with the numerical variables X in V . If we use the notation s[X = 0] to refer to s[X] = 0,

Bonet & Geffner

then π(s) must depend solely on the truth-valuation over the F-literals p and the V -literals X = 0 that are determined by the state s. There is indeed a ﬁnite number of such truth valuations but an inﬁnite number of states. We refer to such truth valuations as the boolean states of the QNP and denote the boolean state associated with a state s as s.

Deﬁnition 4 (Policy). A policy π for a QNP Q = F, V, I, O, G is a partial mapping of states into actions such that π(s) = π(s ) if s = s .

A trajectory s0, a0, s1, a1, . . . , sn compatible with the model S(Q) is said to be a π-trajectory for Q if ai = π(si). Sometimes, a π-trajectory is simply denoted as a sequence of states since the actions are determined by π. A π-trajectory is also said to be a trajectory induced by π or compatible with π. As before, a π-trajectory is maximal if A) the trajectory is inﬁnite and does not include a goal state, B) sn is the ﬁrst goal state in the trajectory, or C) π(sn) is undeﬁned or denotes an action that is not applicable in sn. The solutions to QNPs are deﬁned then as follows:

Deﬁnition 5 (Solution). Let Q be a QNP and let π be a policy for Q. The policy π solves Q iﬀ for every ϵ > 0, all the maximal ϵ-trajectories induced by π reach a goal state.

We will see that solutions to QNPs can be characterized equivalently in terms of a suitable notion of QNP-fairness; namely, a policy π solves Q iﬀevery maximal (QNP) fair trajectory induced by π reaches the goal, where the unfair trajectories are those in which some variable X is decreased inﬁnitely often but increased ﬁnitely often. For historical reasons, such unfair trajectories are called terminating instead, as indeed they cannot go on forever if the decrements are bounded from below by some ϵ > 0.

Example. Consider the QNP Qclear = F, V, I, O, G from above with F = {H}, V = {n(x)}, I = { H, n(x) > 0}, G = {n(x) = 0}, and O = {a, b} where a = H, n(x) > 0; H, n(x) and b = H; H . Let π be the policy deﬁned by the rules:

if H and n(x) > 0, then do a , (3)

if H and n(x) > 0, then do b . (4)

All the maximal ϵ-bounded trajectories that are induced by the policy π on Qclear have the form

s0, a, s1, b, s2, a, s3, b, . . . , s2m, a, s2m+1 (5)

where sm+1, for positive integer m, is the ﬁrst state where n(x) = 0 is true. The actions a and b alternate because the ﬁrst makes H false and the second makes it true. In each transition (si, si+1) for a non-negative even integer i, the numerical variable n(x) decreases by ϵ or more, unless si+1[n(x)] = 0. The former case cannot happen more than s0[n(x)]/2ϵ times, as the numerical variable n(x) is decreased every two steps and is never increased. Thus, in all cases and for any ϵ > 0, any ϵ-trajectory induced by the policy π reaches a goal state in a ﬁnite number of steps, regardless of the initial value s0[n(x)] of n(x), and regardless of the actual magnitude of the changes |si+1[n(x)] si[n(x)]|.

Example. A more interesting QNP that requires nested loops is Qnest = F, V, I, O, G with F = , V = {X, Y }, I = {X > 0, Y > 0}, G = {X = 0}, and O = {a, b} where

a = X > 0, Y = 0; X , Y , (6)

b = Y > 0; Y . (7)

Qualitative Numerical Planning: Reductions and Complexity

The policy π is given by the rules:

if X > 0 and Y = 0, then do a , (8)

if X > 0 and Y > 0, then do b . (9)

The policy decrements Y using action b until the action a that decreases X and increases Y can be applied, and the process is repeated until X = 0. The ϵ-trajectories induced by π have the form

s0, b, . . . , b, s1 k0, a, s1 0, b, . . . , b, s1 k1, a, s2 0, b, . . . , b, s2 k2, . . . , a, sm 0 , b, . . . , b, sm km, a, s G . (10)

where there is an outer loop that is executed a number of times m bounded by s0[X]/ϵ, as X is decreased by ϵ or more, but is not increased. In the iteration i of such a loop, the action b is executed a number of times ki bounded by si 0[Y ]/ϵ as in such inner loop Y begins with value si 0[Y ] and it is decreased and not increased. The result is that all the ϵ-trajectories induced by π reach a goal state in a ﬁnite number of steps that cannot be bounded a priori because the increments of Y produced by the action a are ﬁnite but not bounded. The policy π thus solves Qnest.

4. Direct Translation and Termination Test

The problem of deciding the existence of a policy that solves a given QNPs is decidable as noted by Srivastava et al. (2011). They hint a generate-and-test procedure to ﬁnd such a policy where the QNP is ﬁrst translated into a FOND problem, and then all the possible strong cyclic policies for the FOND problem are enumerated and tested for termination. The translation runs in polynomial (linear) time in the number of boolean states for the QNP while the termination test for a given strong cyclic solution is polynomial in the number of FOND states. However, the number of strong cyclic solutions that need to be tested is exponential in the number of FOND states in the worst case. The generate-and-test approach is not eﬃcient but it is complete and runs in ﬁnite time. In contrast, the plan-existence problem for numerical planning is undecidable even in the classical setting where there is a single initial state and the action eﬀects are deterministic; e.g., Post s correspondence problem can be reduced to a numerical planning problem (Helmert, 2002). The decidability of plan existence for QNPs is due to the qualitative behaviour of the numerical variables that cannot keep track of counts; in particular, the variables cannot be incremented or decremented by speciﬁc amounts nor queried about speciﬁc values. We review the translation and the termination test for QNPs before considering a novel polynomial translation which does not require termination tests and which thus is a true reduction of QNPs into FOND problems. The translation TD from a QNP Q to a FOND P = TD(Q) by Srivastava et al. (2011) is simple and direct, and it involves three steps: 1) the literals X = 0 and X > 0 are made propositional with the numerical variables X eliminated, 2) Inc(X) eﬀects are converted into deterministic boolean eﬀects X > 0, and 3) Dec(X) eﬀects are converted into non-deterministic boolean eﬀects X > 0 | X = 0.

Deﬁnition 6 (Direct Translation TD). For QNP Q = F, V, I, O, G , the FOND problem P = TD(Q) is P = F , I , O , G with

1. F = F {X = 0 : X V }, where X = 0 stands for a new propositional symbol p X = 0 and X > 0 stands for p X = 0,

2. I = I but with X = 0 and X > 0 denoting p X = 0 and p X = 0,

3. O = O but with Inc(X) eﬀects replaced by the deterministic propositional eﬀects X > 0, and Dec(X) eﬀects replaced by non-deterministic propositional eﬀects X > 0 | X = 0,

Bonet & Geffner

4. G = G but with X = 0 and X > 0 denoting p X = 0 and p X = 0.

The problem P = TD(Q) is a special type of FOND problem. For example, from its deﬁnition, there is no action in P that can achieve a proposition X = 0 deterministically. We refer to actions in the FOND P with eﬀects X > 0 and X > 0 | X = 0 as Inc(X) and Dec(X) actions, as such eﬀects in P may only come from Inc(X) and Dec(X) eﬀects in Q. Also, observe that the FOND P has a unique initial state even though the QNP Q may have an inﬁnite number of initial states. The states of the FOND problem P = TD(Q) are related to the boolean states over Q, i.e., the truth-assignments over the atoms p and X = 0, the latter of which stand for (abbreviation of) symbols in P. A policy π for the QNP Q thus induces a policy over the FOND problem P and vice versa.2 Moreover, the FOND problem P = TD(Q) captures the possible boolean state transitions in Q exactly. More precisely, (s, a, s ) is a possible transition in Q iﬀ( s, a, s ) is a possible transition in P. Indeed, if we extend the notion of strong cyclic policies to QNPs:

Deﬁnition 7. Let Q be a QNP and let π be a policy for Q. π is strong cyclic for Q iﬀfor every π-trajectory connecting s0 with a state s, there is a π-trajectory connecting s with a goal state.

The following correspondence between boolean states in Q and the states of the boolean FOND problem TD(Q) results:

Theorem 8. Let Q be a QNP and let π be a policy for Q. π is strong cyclic solution for Q iﬀπ is strong cyclic policy for the FOND problem TD(Q).

Proof. Let S(Q) = S, S0, Act, A, F, SG and S(P) = S , s 0, Act , A , F , S G be the state models for the QNP Q and the FOND problem P = TD(Q). From the deﬁnition of the TD translation, the state s is in S0 (resp. in SG) iﬀ s = s 0 (resp. in S G), and the state s F(a, s) for a A(s) iﬀ s F (a, s) for a A ( s). This means that there is a π-trajectory connecting an initial state s0 in S0 with a state s in S iﬀthere is a corresponding π-trajectory connecting s 0 with s in S , and similarly, there is a π-trajectory connecting s with a goal state s iﬀthere is a corresponding π-trajectory connecting s with s in S(Q).

The correspondence between the π-trajectories connecting states s in Q and the π-trajectories connecting the states s in P = TD(Q) does not imply however that the solutions of P and Q are the same. Indeed, the Dec(x) eﬀects of an action a in Q are mapped into the non-deterministic propositional eﬀects X > 0 | X = 0 in P = TD(Q) which implies that X = 0 will be true in P if the action a is repeated inﬁnitely often. On the other hand, a Dec(X) eﬀect in Q ensures that X = 0 will be true if a is repeated inﬁnitely often as long as no Inc(X) action is performed inﬁnitely often as well. In other words, the correspondence between the state transitions (s, a, s ) in Q and the state transitions ( s, a, s ) in P = TD(Q) does not extend to inﬁnite trajectories (Bonet et al., 2017). Recall that trajectories in Q refer to ϵ-trajectories for some ϵ > 0 that exclude inﬁnitesimal changes. As a result:

Theorem 9. Let Q be a QNP and let π be a policy for Q. If τ = s0, s1 . . . is an inﬁnite πtrajectory in Q, then τ = s0, s1, . . . is an inﬁnite π-trajectory in P = TD(Q). Yet, there may be inﬁnite π-trajectories in P = T(D) that do not correspond to any π-trajectory in Q.

2. The policy π over states s of Q determines the policy π over the FOND P where π (t) = π(s) if t = s, and vice versa, a policy π for P determines a policy π for Q where π(s) = π (t) if s = t. For simplicity, we use the same notation π to refer to the policy π over Q and the policy π that it induces over P = TD(Q).

Qualitative Numerical Planning: Reductions and Complexity

Proof. For the ﬁrst part, if τ is an inﬁnite π-trajectory over Q, then si+1 F(ai, si) for ai = π(si); therefore si+1 F(ai, si) for ai = π( si), and hence τ = s0, s1, . . . is an inﬁnite π-trajectory over P. For the second part, one example suﬃces. Let Q be a QNP with a single variable X that is numerical, a single action a with precondition X > 0 and eﬀect Dec(X), initial condition X > 0, and goal X = 0. In the state model S(P) associated with the FOND problem P = TD(Q), there are two states t and t , the ﬁrst where X = 0 is true and the second where X > 0 is true, and there is an inﬁnite trajectory s0, s1, . . . where all si = t and π( si) = a, but there is no inﬁnite trajectory s0, s1 . . . in Q where X > 0 stays true forever while being decremented. Indeed, for any ϵ > 0 and any initial value of X, s0[X] > 0, it is the case that sn[X] = 0 for n > s0[X]/ϵ.

The notion of termination is aimed at capturing the inﬁnite π-trajectories over the FOND problem P = TD(Q) that do not map into inﬁnite π-trajectories over Q. Let

s0, s1, . . . , [ si, . . . , sm] (11)

denote any inﬁnite π-trajectory on the FOND P where the states si, . . . , sm in brackets make the non-empty set of recurring states; namely those that occur inﬁnitely often in the trajectory (not necessarily in that order). We refer to such set of recurrent states as the loop of the trajectory. Observe that knowledge of the loop (and the policy π) is suﬃcient to infer whether a variable X is decremented or incremented inﬁnitely often. Termination imposes the following condition on loops:

Deﬁnition 10 (Terminating Trajectories). Let Q be a QNP and let π be a policy for Q. An inﬁnite π-trajectory s0, . . . , [ si, . . . , sm] is terminating in P = TD(Q) if there is a variable X in Q that is decremented but not incremented in the loop; i.e., if π( sk) is a Dec(X) action for some k [i, m], and π( sj) is not an Inc(X) action for any k [i, m].

The notion of termination is a notion of fairness that is diﬀerent from the one underlying strong cyclic planning that says that inﬁnite but terminating trajectories in P are not fair and hence can be ignored. Indeed, this notion of termination closes the gap in Theorem 9:

Theorem 11. Let Q be a QNP and let π be a policy for Q. τ = s0, s1, . . . is an inﬁnite nonterminating π-trajectory in P = TD(Q) iﬀthere is an inﬁnite π-trajectory τ = s0, s1, . . . in Q.

Proof. Let τ = s0, s1, . . . be an inﬁnite π-trajectory in Q, and let us assume that the inﬁnite trajectory τ = s0, s1, . . . is terminating. Then there must be a variable X that is decremented by π(s) in some recurring state s in τ and which is not incremented by π(s ) on any recurrent state s

in τ. Let s(t) denote the state at time point t in τ, let t be the last time point where variable X is increased in τ (t = 1 if X is not increased in τ), and let X(t+1) be the value of variable X at the next time point. The maximum number of times that X can be decreased after t + 1 is bounded by X(t + 1)/ϵ, and after this, X must have zero value. But in τ, X is decreased an inﬁnite number of times, in contradiction with the assumption that any action that decrements X features X > 0 as precondition. For the converse, we show that one such trajectory τ in Q can be constructed for any ϵ > 0, given that the trajectory τ in P is non-terminating. We do so by adjusting the non-deterministic increments and decrements of the actions, all of which have to be greater than or equal to ϵ, except when this would result in negative values that are increased back to zero. We construct τ = s0, s1, . . . from τ = s0, s1, . . . as follows. The value of the boolean variables is the same in si as in si, and in addition, si[X] = 0 iﬀX = 0 is true in si for all i. We just have to ﬁnd exact values for the numerical variables X in each of the states si in τ, and this is a function of their initial values s0[X] when s0[X] > 0, and the positive decrements or increments (X, si) when π(si) is a

Bonet & Geffner

Dec(X) or Inc(X) action, and (X, si) ϵ. For simplicity and without loss of generality, let us assume that ϵ < 1 (the case for ϵ 1 is an easy exercise). All the positive initial values of numerical variables, increments, and decrements are set to positive integers by considering the sequence of actions π(si), i = 1, . . .. The initial values s0[X] are set to 1 + k(X, 0) where k(X, i) stands for the number of Dec(X) actions that occur between the state si and the ﬁrst state sj after si where an Inc(X) action occurs (if no Inc(X) action occurs after state si, k(X, i) is the number of Dec(X) actions after si). That is, k(X, i) is the cardinality of the set

{j : i j < ind(X, i) and π(sj) is Dec(X) action} (12)

where ind(X, i) is the minimum index j > i such that π(sj) is an Inc(X) action, or if there is no such action after si. Observe that k(X, i) is bounded. The only way it could be inﬁnite is when no Inc(X) action occurs after si while at the same time an inﬁnite number of Dec(X) actions occur; yet, this is impossible since then X eventually becomes zero after which no Dec(X) action may occurs as such actions feature the precondition X > 0. Likewise, the increments (X, si) are set to 1 + k(X, i), and the decrements (X, si) are set to si[X] if X = 0 is true in si+1 and to 1 is X > 0 if true in si+1. It is not diﬃcult to verify that these choices deﬁne a trajectory τ in Q that corresponds to the assumed trajectory τ in P.

The full correspondence between inﬁnite π-trajectories τ in Q and inﬁnite non-terminating πtrajectories τ in P = TD(Q) suggests the following deﬁnition of termination in QNPs Q and FOND problems TD(Q):

Deﬁnition 12 (Termination in Q). A policy π for the QNP Q is terminating iﬀall the π-trajectories on Q are of ﬁnite length. In such a case, we say that π is Q-terminating.

Deﬁnition 13 (Termination in P). A policy π for the FOND problem P = TD(Q) is terminating iﬀ all the inﬁnite π-trajectories on P are terminating. In such a case, we say that π is P-terminating.

The correspondence between policies can then be expressed as:

Theorem 14. Let Q be a QNP, let P = TD(Q) be its direct translation, and let π be a policy for Q (and thus also for P). Then, π is Q-terminating iﬀπ is P-terminating.

Proof. Direct from Theorem 11. For one direction, assume that π is Q-terminating and let τ be a π-trajectory in P. If τ is not terminating, by Theorem 11, τ is inﬁnite and thus π would not be Q-terminating. Therefore, every π-trajectory τ in P is terminating and thus π is P-terminating. The other direction is established similarly.

The soundness and completeness of the direct translation extended with termination can be expressed as following:

Theorem 15 (Soundness and Completeness TD). Let Q be a QNP, let P = TD(Q) be its direct translation, and let π be a policy for Q (and thus also for P). The following are equivalent:

1. π solves Q,

2. π is a strong cyclic solution of Q and π is Q-terminating,

3. π is a strong cyclic solution of P and π is P-terminating.

Qualitative Numerical Planning: Reductions and Complexity

Proof. (1 2) Assume that π solves Q. If there is π-trajectory connecting an initial state with state s, there must be π-trajectory connecting s with a goal state. Otherwise, π would not be a solution for Q. Likewise, if τ is an inﬁnite π-trajectory in Q, then τ does not reach a goal state and thus π would not solve Q. For the converse direction, assume that π is a strong cyclic solution for Q and that π is Q-terminating, and suppose that π does not solve Q. Then, there is a maximal π-trajectory τ in Q that does not reach a goal state. It cannot be the case that τ ends in a state s where π(s) is undeﬁned or non-applicable as π then would not be a strong cyclic solution for Q. Hence, τ must be inﬁnite but this contradicts the assumption that π is Q-terminating. (2 3) By Theorem 8, π is a strong cyclic solution for Q iﬀit is strong cyclic solution for P. By Theorem 14, π is Q-terminating iﬀit is P-terminating.

5. Checking Termination with Sieve

Sieve is the procedure introduced by Srivastava et al. (2011) to test whether a policy terminates. It runs in time that is polynomial in the number of states of the FOND problem reached by the policy.3 For this, the algorithm takes as input a policy graph G(P, π) = V, E constructed from the FOND problem P = TD(Q) and a strong cyclic policy π for P. The nodes in the policy graph are the states s in the state model S(P) that are reachable from the initial state and the policy π, and the directed edges in E are the pairs ( s, s ) for s F(a, s) and π( s) = a. These edges are labeled with the action a. The algorithm iteratively removes edges from the graph G(P, π) until the graph becomes acyclic or no additional edge can be removed. For incrementally removing edges from the graph, Sieve identiﬁes ﬁrst its strongly connected components by a single depth-ﬁrst search traversal, following Tarjan s algorithm (Tarjan, 1972). A strongly connected component (SCC) is a partition of the nodes of the graph such that if a node s belongs to a partition, any node s that can be reached from s and that can reach s back in the graph, is placed in the same partition as s. The algorithm then picks a variable X and a SCC such that the variable X is decremented but not incremented in the SCC. That is, there must be a state s in the SCC such that π( s) is a Dec(X) action and no π( s ) is an Inc(X) action for any s in the SCC. The algorithm then removes all the edges ( s, s ) in the SCC such that π( s) is a Dec(X) action. We abbreviate this by saying that variable X is removed from the SCC, which means that the edges associated with Dec(X) actions are removed. Following the edge removals, the SCCs must be recomputed and the process is repeated until the graph becomes acyclic or no more edges can be removed in this manner. The result is that:

Theorem 16. Let Q be a QNP. A policy π for the FOND problem P = TD(Q) is P-terminating iﬀSieve reduces the policy graph G(P, π) to an acyclic graph.

Proof. We show the contrapositive of the two implications that make up the equivalence. First, let us assume that Sieve terminates with a cyclic graph G and, thus, let C be a non-trivial SCC in the graph G. Since every state in G is reachable from the initial state by π and since every state in C is also reachable by π from any other state in C, there is a π-trajectory τ in P of the form s0, . . . , [ si, . . . , sm] where the recurrent states si, . . . , sm in τ are exactly all the states in C. Observe that Sieve is unable to remove any further edge from G. If no variable is decremented in the loop, τ is not P-terminating by deﬁnition. Similarly, if some variable X is decremented in the loop (i.e., C), X is also incremented in the loop and the trajectory is again not P-terminating.

3. Our account of termination diﬀers from the one of Srivastava et al. (2011) in two main aspects. First, our notion of termination is articulated independently of the algorithm for checking termination. Second, our version of the algorithm is developed and applied to QNPs that involve both numerical and boolean variables.

Bonet & Geffner

Sieve (Graph G = G(P, π)): repeat

Compute the strongly connected components (SCC) of G.

Choose an SCC C and a variable X that is decreased in C but is not increased in C; i.e., for some s in C, π( s) is a Dec(X) action, and for no s in C, π( s) is an Inc(X) action.

Remove the edges ( s, s ) such that s and s are in C, and π( s) is a Dec(X) action. until G is acyclic (terminating) or there is no SCC C and variable X to choose (non-terminating)

Figure 1: Sieve procedure for testing whether policy π for FOND problem P = TD(Q) terminates (Srivastava et al., 2011).

For the other implication, let us assume that π is not P-terminating. Then, there must be an inﬁnite π-trajectory τ in P of the form s0, . . . , [ si, . . . , sm] where every variable that is decremented by an action π( sj), j [i, m], is incremented by another action π( sj), j [i, m]. We want to show that Sieve terminates with a graph that has one SCC that includes all the states in the loop. Indeed, initially, all the states in the loop must be in one component C as they are all reachable from each other. Observe that edges that correspond to actions that do not decrement variables are not removed by Sieve. Hence, if the loop does not decrement any variable X, the states in the loop cannot be separated into diﬀerent SCCs. On the other hand, Sieve cannot remove a variable X from the loop as the component C features actions that increment and decrement X. Therefore, as before, the states in the loop stay together within an SCC for the whole execution of Sieve.

The Sieve procedure, slightly reformulated from Srivastava et al. (2011), is depicted in Figure 1.

Example. The policy π for Qnest above is given by the rules:

if X > 0 and Y = 0, then do a , (13)

if X > 0 and Y > 0, then do b (14)

where recall that Qnest = F, V, I, O, G with F = , V = {X, Y }, I = {X > 0, Y > 0}, G = {X = 0}, and O = {a, b} where a = X > 0, Y = 0; X , Y and b = Y > 0; Y . The policy decrements Y using the action b until the action a that decreases X and increases Y can be applied. The process is repeated until X = 0. The nested loops in the policy graph G(P, π) are shown in Figure 2. The policy graph contains three states: the leftmost one is the initial state, and the rightmost one is a goal state. The two states on the left are reachable from each other, and hence, deﬁne a strongly connected component (SCC). In this SCC, the Dec(X) edges are removed by Sieve because X is not increased anywhere. Once this is done, the Dec(Y ) edges are removed by Sieve because the edges associated with Inc(Y ) eﬀects are gone. The resulting graph is acyclic establishing thus that the policy π terminates in P = TD(Qnest).

Using Sieve it is easy to see that the problem of checking the existence of plans for QNPs can be decided in exponential space:

Theorem 17 (Srivastava et al., 2011). Deciding plan existence for QNPs is in EXPSPACE.

Proof. Let Q = F, V, I, O, G be a QNP. The number of boolean states for Q is exponential in the number of ﬂuents and variables; i.e., |F| + |V |. A policy π for Q can be described in exponential

Qualitative Numerical Planning: Reductions and Complexity

X > 0, Y > 0 X > 0, Y = 0 X = 0, Y > 0

Figure 2: Testing termination with Sieve. Policy graph G(P, π) for the FOND problem P = TD(Qnest) and policy π, from the example in text, containing three states: the leftmost is the initial state, and the rightmost is a goal state. The two states on the left are reachable from each other, and hence, deﬁne a strongly connected component (SCC). In this SCC, the Dec(X) edges are removed by Sieve because X is not increased anywhere. Once this is done, the Dec(Y ) edges are removed because the edges associated with Inc(Y ) eﬀects have been eliminated. The resulting graph is acyclic and hence π terminates in P.

space as a mapping from boolean states into actions. A brute-force algorithm enumerates all policies one by one using exponential space. Each policy is tested for strong cyclicity and termination. The former is a straightforward test in a graph while the latter is done with Sieve. If the policy is strong cyclic and terminating, Q is accepted. Otherwise, if no policy is found to be strong cyclic and terminating, Q is rejected. Since testing strong cyclicity and running Sieve both require polynomial time in the size of the input policy, the whole algorithm can be implemented in space that is exponential in the size of Q.

Below we improve this bound and show exponential time (EXP) solvability through a more complex translation of QNPs into FOND problems that is also polynomial. The novelty of the new translation is that the strong-cyclic policies of the resulting FOND problems do not need to be checked for termination. QNPs are thus fully reduced to FOND problems. Since FOND problems can be reduced to QNPs as well, we will show indeed that FOND problems and QNPs have the same expressive power, and the complexity of plan existence for FOND problems is known to be EXP-Complete (Littman et al., 1998; Rintanen, 2004), then these reductions show the EXPCompleteness of the plan existence decision problem for QNPs. In addition to the establishment of novel theoretical results, these reductions are also of practical importance as they permit the computation of solutions; once a QNP is reduced to FOND, a solution for the QNP can be recovered in linear time from a solution to the FOND problem, and the same for the reduction from FOND problems into QNPs. The distinction between the classes EXP and EXPSPACE is important. EXPSPACE contains the (decision) problems that can be solved with Turing machines (TMs) that operate in exponential space (as a function of the input size), yet such TMs may in fact run in doubly exponential time as the running time is bounded, in the worst case, by an exponential function of the bound in space (Sipser, 2006). On the other hand, EXP comprises the problems that can be solved with TMs that run in exponential time. The diﬀerence is analogous to the diﬀerence between the classes P (polynomial time) and PSPACE (polynomial space).

6. First Reduction: FOND problems into QNPs

The ﬁrst reduction that we introduce is from FOND problems into QNPs. It is a non-trivial reduction yet simpler than the inverse reduction from QNPs into FOND problems. The main

Bonet & Geffner

obstacle to overcome is that the non-deterministic eﬀects in FOND problems are over boolean variables, while those in QNPs are only on numerical variables through decrements. Another important obstacle is that strong cyclic solutions in QNPs are not QNP solutions unless they are terminating. Let P = F, I, O, G be a FOND problem and let us denote the non-deterministic eﬀects of action a as Ea 1 | Ea 2 | | Ea ka where each Ea i is a set (conjunction) of F-literals, and ka denotes the number of non-deterministic eﬀects of the action a. If ka = 1, the action a is deterministic, else it is non-deterministic. For simplicity, we assume that the set of eﬀects {Ea i }i for action a aggregates all the multiple non-deterministic eﬀects in the description of a in P, and the reduction below is presented under this assumption. Afterwards, we discuss how to handle in polynomial time FOND problems whose transitions are factorized. We map P into a QNP Q = F , V , I , O , G that extends P with numerical variables V = {X} {Ya,i : a O, 1 i ka}, extra boolean variables, and extra actions; i.e., F F , I I , O O , and G = G. The heart of the reduction lies in the way in which the non-deterministic eﬀects of each action are captured in Q. For this, the collection of non-deterministic eﬀects of the action a are replaced by an Inc(X) action, for the unique numerical variable X, followed by a ﬁxed loop where the variable X is decremented until becoming zero. The alternative eﬀects Ea i are then triggered when X = 0 becomes true in the corresponding part of the loop. Intuitively, each non-deterministic action a becomes a wheel of fortune that must be spun to select the non-deterministic eﬀect to be applied. The increments and decrements of the extra variables Ya,i ensure that the strong cyclic policies π for P, and only those, induce policies π for Q that are terminating. The ﬁxed loop sequence associated with action a performs the following steps, that are implemented with the help of new auxiliary actions and propositional symbols:

1. Inc(X) (implemented by modiﬁed action a),

2. Dec(X) (implemented by new action Spin),

3. If X = 0, apply the eﬀects in Ea 1, increment Ya,1, decrement Ya,j, j = 1, and break loop (implemented by new actions Prep(a, i) and Exit(a, i) for i = 1),

4. Dec(X) (implemented by new action Next(a, i) for i = 1)

5. If X = 0, apply the eﬀects in Ea 2, increment Ya,2, decrement Ya,j, j = 2, and break loop (implemented by new actions Prep(a, i) and Exit(a, i) for i = 2),

6. Dec(X) (implemented by the new action Next(a, i) for i = 2)

7. If X = 0, apply the eﬀects in Ea ka, increment Ya,ka, decrement Ya,j, j = ka, and break loop (implemented by new actions Prep(a, i) and Exit(a, i) for i = ka),

8. Dec(X) and go back to 3 (implemented by new action Loop(a))

If, throughout the loop, some variable Ya,j becomes zero, the next action to apply is forced to be a new action that achieves the goal thus ending the execution. This mechanism takes care of the non-fair trajectories that may exist in P. To show that the resulting mapping is indeed a reduction (i.e., the mapping is sound and complete), two things need to be established: that a policy π that solves Q induces a policy π that

Qualitative Numerical Planning: Reductions and Complexity

solves P (soundness), and vice versa, that a policy π that solves the FOND problem P induces a policy π that solves Q (completeness). The ﬁxed sequence of eﬀects for each action a in Q is implemented using the following additional boolean variables:

A boolean normal that is false when the sequence is entered for some action a and made true when the loop is exited.

A boolean ex(a) to express that the sequence for the action a is being executed. The reduction is such that the atoms in {normal} {ex(a) : a O} are pairwise mutex and one of them is always true.

A counter from 0 to K encoded using mutex atoms cnt(ℓ), ℓ= 0, 1, . . . , K + 1, that is set to 1 when the loop is entered and re-entered (step 8 above) and where K is the maximum number of non-deterministic outcomes of any action in P. (An engineering trick that saves an extra boolean permits cnt(0) and cnt(i) to be both true after executing the action Prep(a, i) below.)

The actions that implement the ﬁxed loop sequence (i.e., spin the wheel) are the following:

Spin = normal, cnt(0), X > 0; cnt(0), cnt(1), X .

Next(a, i) = ex(a), cnt(i), X > 0; cnt(i), cnt(1+i), X to advance along the ﬁxed loop sequence while X > 0 and decrementing X.

Loop(a) = ex(a), cnt(ka), X > 0; cnt(ka), cnt(1), X to start a new iteration of the loop.

These are the only actions that decrease the variable X, while the (modiﬁed) non-deterministic actions being the only actions that increase X (see below). Once X becomes zero, the nondeterministic eﬀect to apply is determined by the value of the counter. The actions that apply the selected eﬀect and capture the non-fair trajectories are:

Prep(a, i) = ex(a), cnt(i), cnt(0), X = 0, Ya,j > 0; cnt(0), Ya,i , {Ya,j }j =i where the precondition Ya,j > 0 is for all j, and the eﬀect Ya,j for all j = i.

Exit(a, i) = ex(a), cnt(i), cnt(0), X = 0, Ya,j > 0; ex(a), cnt(i), normal, Ea i where the precondition Ya,j > 0 is for all j.

Fin(a, j) = ex(a), X = 0, Ya,j = 0; ex(a), { cnt(i)}i 1, normal, G that is forced when Ya,i = 0 and reaches the goal G, where the eﬀect cnt(i) is for all positive i.

The only actions that aﬀect the variables Ya,j are the Prep(a, i) actions. After applying such an action, some (one or more) variables Ya,j may become zero. If so, Fin(a, j) becomes the only applicable action and the execution terminates as the goal is reached. If no such variable becomes zero, Exit(a, i) becomes forced which applies the eﬀect Ea i and restores normal operation. The initial state of the QNP includes the atoms normal and cnt(0). Deterministic actions in P pass directly into the QNP Q with these two atoms added as extra preconditions. Nondeterministic actions a, however, are handled diﬀerently by replacing their eﬀects Ea 1 | Ea 2 | | Ea ka by deterministic eﬀects { normal, ex(a), X } after which the only applicable action would be the Spin action from above. The idea of the construction is illustrated in Figure 3.

Deﬁnition 18. Let P = F, I, O, G be a FOND problem. The reduction R maps P into the QNP Q = F , V , I , O , G given by

Bonet & Geffner

cnt(0), X = 0, Yj > 0

cnt(0), X > 0, Yj > 0

cnt(1), X > 0, Yj > 0 cnt(1), X = 0, Yj > 0

cnt(2), X > 0, Yj > 0 cnt(2), X = 0, Yj > 0

cnt(k), X > 0, Yj > 0 cnt(k), X = 0, Yj > 0

(modiﬁed) a : X

Next(a, 1) : X

Next(a, 1) : X

Prep(a, 1) : Y1 , {Yj }j =1

Prep(a, 2) : Y2 , {Yj }j =2

Prep(a, k) : Yk , {Yj }j =k

Next(a, k 1) : X

Next(a, k 1) : X

Loop(a) : X

Figure 3: Encoding the non-deterministic boolean eﬀects E1 | | Ek of an action a in the FOND problem P as a ﬁxed sequence of eﬀects that loops while decrementing the variable X in the QNP Q = R(P). The variable X implements the loop while the counter cnt(i) determines which eﬀect Ei obtains when X becomes zero. When X = 0 and cnt(i) hold, Prep(a, i) is applied whose function is to increment Yi = Ya,i and decrement the other Yj variables; if no such variable becomes zero, the eﬀect Ei is applied with the action Exit(a, i) (not shown), otherwise Fin(a, j) is applied when Yj becomes zero (not shown). The variables Yi are used to map unfair trajectories in P into goal-reaching trajectories in Q, while forcing solutions of Q to induce solutions for P. Indeed, if one unfair trajectory contains the action a inﬁnitely often but neglects (starves) the eﬀect Ei, the variable Yi eventually becomes zero, since the only action that increments it is Prep(a, i), and then the trajectory forcibly applies the action Fin(a, i) that terminates the execution by reaching the goal (see text for details).

1. F = F {normal} {ex(a) : a O} {cnt(ℓ) : ℓ [0, K]} where K = maxa O ka,

2. V = {X} {Ya,i : a O, i [1, ka]},

3. I = I {normal, cnt(0)} {X = 0} {Ya,i > 0 : a O, i [1, ka]},

4. O = O {Spin} {A(a, i) : A {Prep, Exit, Fin, Next}, a O, i [1, ka]} {Loop(a) : a O},

For stating the formal properties of the reduction, we need to specify how a policy π for Q induces a policy π for P, and vice versa, as Q involves extra variables and actions.

Qualitative Numerical Planning: Reductions and Complexity

Let us split the non-goal states in Q into two sets: the normal states, where the booleans normal and cnt(0) are true, and the rest of non-goal states where normal or cnt(0) is false. (Observe that there cannot be a normal state where some variable Ya,j is zero as when that happens, the loop must exit with Fin(a, j) that reaches a goal state.) In the normal states, the (qualitative) value of all the extra variables is the same: all the extra boolean variables are false except for normal and cnt(0) that are true, X = 0, and Ya,i > 0 for a O and i = 1, 2, . . . , ka. Hence, any policy for P can be converted into a policy for Q, and vice versa. For example, a policy π for P translates directly into a policy for Q on normal states, where the actions for non-normal states is determined (the only real choice is when more that one variable Ya,j is zero, but then any choice of the Fin(a, j) actions is equivalent as such actions yield goal states). Conversely, any policy for Q induces a unique policy for P as each non-goal state in P is associated with normal states in Q that get the same action from the policy. The main properties of the reduction are expressed as follows:

Theorem 19 (Reduction FOND to QNP). The mapping R is a polynomial-time reduction from FOND problems into QNPs. That is, a FOND problem P has a solution iﬀthe QNP Q = R(P) has a solution. Moreover, a solution for P (resp. Q) can be recovered in polynomial time from a solution for Q (resp. P).

Proof. It is straightforward to see that R is computable in time that is polynomial in the size of the input P and the maximum value ka for any action a. (See below for a discussion on how to deal with FOND problems that may have multiple non-deterministic eﬀects per action.) Likewise, as explained above, policies for P map into policies for Q, and vice versa. We ﬁrst show how to map inﬁnite fair trajectories in P into trajectories in Q that visits normal states inﬁnitely often, and vice versa. Indeed, for an inﬁnite trajectory τ in P, we can construct a corresponding trajectory τ in Q by augmenting the states in τ to give value to the extra booleans and numerical variables in Q, and by inserting sub-trajectories over non-normal states for the triplets (s, a, s ) in τ for non-deterministic actions a. Indeed, for such a triplet, it is enough to determine the eﬀect Ea i that produces the successor state s in order to obtain a ﬁnite sequence of non-normal states that map s into s in Q. For the converse, if τ is a trajectory in Q that visits normal states inﬁnitely often, then τ does not contain an action Fin(a, i). Thus, we can construct a trajectory over P by just projecting τ over its normal states. This trajectory is fair since if some eﬀect Ea i is starved from an action a that is applied inﬁnitely often, the variable Ya,i would become zero forcing the application of Fin(a, i). For showing the ﬁrst implication, let us consider a policy π for P. We want to show that the policy π obtained from π solves Q. Indeed, suppose that π does not solve Q. Then, π is either not strong cyclic or non terminating in Q. The former is impossible since π connects each reachable normal state to a goal state. Thus, π is non terminating in Q and there is a non-terminating trajectory s0, . . . [ si, . . . , sm] in TD(Q). The loop increments and decrements the variable X, and all the variables Ya,j for the actions a that are applied in the loop, and thus it must contain some normal state. Therefore, we can construct an inﬁnite fair π-trajectory in P that is non-goal reaching contradicting the assumption that π solves P. For showing the second implication, let π be a policy for Q and let us suppose that the policy π induced from π does not solve P. Then, either π is not deﬁned over its reachable states or there is an inﬁnite fair π-trajectory in P, one that does not reach the goal. The ﬁrst case is impossible since π solves Q. For the second case, if such trajectory exists, then there is a π -trajectory τ in Q that visits normal states inﬁnitely often. This trajectory is associated with a non-terminating trajectory in TD(Q) in contradiction with π being a solution for Q.

Bonet & Geffner

Theorem 19 states that the strong cyclic policies for a FOND problem P correspond exactly with the policies of the QNP Q = R(P) that can be obtained from P in polynomial time. There is an analogous translation for computing the strong policies of P; namely, the strong cyclic policies that are actually acyclic (no state visited twice along any trajectory). For this, let R be the translation that is like R above but with the numerical variables Ya,i removed; i.e., initial and goal conditions on Ya,i are removed from R (P) as well as the conditions and eﬀects on Ya,i in the actions Prep(a, i) and Exit(a, i), and with the actions Fin(a, j) removed.

Theorem 20 (Reduction Strong FOND to QNP). The mapping R is a polynomial time reduction from FOND problems with strong solutions into QNPs. That is, a FOND problem P has a strong solution iﬀthe QNP Q = R (P) has solution. Moreover, a strong solution for P (resp. solution for Q) can be recovered in polynomial time from a solution for Q (resp. strong solution for P).

Proof. In the absence of the variables Ya,i, the only solutions to Q = R (P) must be acyclic, as any cycle would involve increments and decrements of the single numerical variable X, and thus, would not be terminating. The rest of the proof follows from the correspondence laid out in the previous proof between the trajectories over the normal states in Q and the trajectories in P.

Let us now consider the case when the FOND problem P contains actions with multiple nondeterministic eﬀects. Let a be one such action, and let n be the number of non-deterministic eﬀects in a, each one denoted by Ej 1 | | Ej kj where j [1, n] and kj is the number of outcomes for the j-th non-deterministic eﬀect of a. By assumption, every choice of outcomes for the eﬀects yields a consistent (aggregated) outcome for a (cf. footnote 1). The easiest way to accommodate such actions is to preprocess P by replacing the action a by the sequence a(1), a(2), . . . , a(n) of n nondeterministic actions, each one featuring exactly one eﬀect; i.e., the eﬀect of a(j) is Ej 1 | | Ej kj. For this, the precondition of a(1) is set to the precondition of a, that of a(j) to {seq(a), next(j)}, j [2, n], and the precondition of every other action is extended with seq(a). Likewise, the eﬀect of a(1) is extended with {seq(a), next(2)}, that of a(j) with { next(j), next(j + 1)}, j [2, n 1], and that of a(n) with { next(n), seq(a)}. The new atoms seq(a) and next(j) denote that the sequence for a is being applied and the next action to apply in the sequence is a(j) respectively. In this way, the FOND problem P is converted in linear time into an equivalent FOND problem where each action has exactly one non-deterministic eﬀect. In other words, we may assume without loss of generality that the FOND problem P is such that each action has at most one non-deterministic eﬀect since if this is not the case, P can be converted into an equivalent FOND problem P in linear time. By equivalent, we mean that any solution π for P can be converted in polynomial time into a solution π for P , and vice versa.

We have shown how strong cyclic and strong planning over a FOND problem P translates into QNPs: in one case, including the extra variables X and {Ya,i}a,i in place, in the second, only X. The translation with these variables oﬀers the possibility of capturing more subtle forms of fairness. For example, if we just remove from the translation a variable Ya,i along with the eﬀects on it, the resulting QNP would assume that all the outcomes Ej, j = i, of action a are fair (i.e., they cannot be skipped forever in a fair trajectory) but that the outcome Ei is not. In other words, while in strong cyclic planning, all the non-deterministic actions are assumed to be fair, and in strong planning, all of them to be unfair, in QNP planning, it is possible to handle a combination of fair and unfair actions (as in dual FOND planning, Geﬀner & Geﬀner, 2018), as well as a combination of fair and unfair outcomes of the same action.

Qualitative Numerical Planning: Reductions and Complexity

7. Second Reduction: QNPs into FOND problems

We have shown that FOND problems can be reduced in polynomial time to QNPs. We show now the other direction: QNPs can be reduced in polynomial time to FOND problems. The two results imply a new complexity result; namely, that QNPs have the same expressive power as FOND problem and that the plan existence decision problem for both models have the same complexity. This second translation T is more subtle than the ﬁrst and unlike the direct translation TD above, it is a full reduction which does not require termination tests. The ﬁrst attempt at such a translation was sketched by Bonet et al. (2017) but the reduction is buggy as it is not always sound. The intuition, however, is useful and we build on it. Basically, that reduction introduces boolean variables q X that when set to true preclude increments of variable X, hence making the decrements of X fair . The variable q X can be reset to false when the loop ﬁnishes , i.e., when X = 0 is true. This idea, however, does not fully avoid non-terminating loops and hence, by itself, does not produce a sound reduction.4 The new translation replaces the q X variables by a bounded stack that keeps the variables X that are being decremented in order, and suitable counters. The new variables and actions enforce that solutions of the FOND problem P = T(Q), unlike the solutions of the direct translation TD(Q), are all terminating. For this, the new translation introduces conditions that mirror those captured by the Sieve procedure. In particular, for capturing policies that terminate, the variables are to be placed on the stack following the order by which Sieve removes them.

7.1 Extra Variables and Actions

The reduction T(Q) introduces a bounded stack α where numerical variables from V can be pushed and popped, and bounded counters c(d), for d = 0, . . . , |V | that are associated with the possible levels (depths) d = |α| of the stack. There is also a top counter c T that may only increase. The stack starts empty and may grow to contain all the variables in V , but no variable can appear in the stack more than once. The stack is represented as growing from left to right; e.g., αX is the stack that results of pushing the variable X in the stack α. The c counters start at 0 and may grow up to a Max number, that for completeness must be set to 1 + 2n where n is the total number of boolean and numerical variables in Q. In practice, Max can be set to a much small number.5

In any case, the counters and the stack are captured in terms of a polynomial number of boolean variables and the whole reduction T(Q) is computed in polynomial time. The state of the stack α is captured by the atoms in(X), depth(d), and index(X, d) that represent whether X is in the stack, the depth of the stack, and the depth at which X is in the stack, respectively. X is the top element in the stack when index(X, d) and depth(d) are true (i.e., the stack is αX and |α| = d 1),

4. Consider a QNP Q = F, V, I, O, G with a single numerical variable X and four actions a, b, c, and d that result in a loop where a = p1, X > 0; p1, p2, X , b = p2; p3, p2 , c = p3, X > 0; X , d = p3, X = 0; p3, p1, X . Let us assume that I = {p1, X > 0} and G = {X = 0, p2}. There is a single policy π for Q, as in all the (non-goal) states that can be reached from I, there is a single applicable action. This policy π is strongly cyclic but is not terminating. The reason is that one of the trajectories determined by the policy is a non-terminating loop s0, a, s1, b, s2, c, s3, d, s0, . . . where the single variable that is decremented (X) is also incremented, and where s0 = {p1, X > 0}, s1 = {p2, X > 0}, s2 = {p3, X > 0}, and s3 = {p3, X = 0}. The FOND problem that results from the translation sketched by Bonet et al. (2017) accepts this policy π as a solution, which is incorrect. This happens because the variable q X can be set and reset an inﬁnite number of times; indeed, right before and right after the action c in the loop, respectively. The new translation excludes such non-terminating loops via a stack and counters. 5. For structured policies that result in loops that can be entered and left through single entry and exit points, Max is the bound on the number of consecutive loops (blocks), possibly with other loops nested, that the policy can generate at the same level.

Bonet & Geffner

and it is bottom element when index(X, 1) is true (i.e., the stack is X). The stack is empty when depth(0) holds. The extra actions in P = T(Q) are those for pushing and popping variables to and from the stack, and for advancing the top counter c T .

1. Actions Push(X, d) for variable X and depth d [0, |V | 1] have preconditions in(X), depth(d) and c(d) < Max, and eﬀects:

a) in(X), index(X, d + 1), depth(d + 1) and depth(d) to push X and increase stack depth,

b) c(d) := c(d) + 1 to increment counter for old level,

c) c(d + 1) := 0 to initialize counter for new level.

2. Actions Pop(X, d) for variable X and depth d [1, |V |] have preconditions in(X), index(X, d) and depth(d), and eﬀects:

a) in(X), index(X, d), depth(d), depth(d 1) to pop X and decrease stack depth.

3. Action Move advances the top counter c T by 1 when the stack is empty; i.e., it has preconditions depth(0) and c T < Max, and eﬀect c T := c T + 1.

For simplicity, we assume that the language of our FOND problems P makes room for the integer counters c(d), d = 0, . . . , |V |, that may be increased by 1, from 0 up to a ﬁxed number Max and that may be reset back to 0. In the implementation, these counters are represented in terms of a linear number of boolean variables.6

The actions a that belong to Q are split into two classes. Actions a that do not decrement any variable, keep their names in P = T(Q) but replace their Inc(X) eﬀects by propositional eﬀects X > 0, and add the precondition in(X) that disables a when X is in the stack:

4. Actions a in Q that decrement no variable, keep the same names in T(Q), the same preconditions and same eﬀects, except that the eﬀects Inc(X) are replaced by propositional eﬀects X > 0, if any, and in such a case, the precondition in(X) is added.

Actions a from Q that decrement variables (and hence introduce non-determinism) map into actions in P of type a(X, d) where X is a variable decremented by X that is in the stack at depth d: more of the variables that are decremented are in the stack when the action is applied:

5. Actions a(X, d) for a in Q that decrement a variable X in the stack at level d (and possibly others) inherit propositional preconditions and eﬀects of a, and for each variable Y that is increased by a, they include the precondition in(Y ) and the eﬀect Y > 0. The parameter d [1, |V |] stands for the current stack depth. The action a(X, d) also has:

a) extra precondition index(X, d) (i.e., d is level at which X appears in stack),

6. In the actual encoding, the counters c(d) are represented with 1 + n atoms (bits), bi(d), i = 0, . . . , n, where n is total number of variables in Q, i.e., n = |F| + |V |. A preconditions such as c(d) < Max then translates into the precondition bn(d); the least and most signiﬁcant bits for c(d) are b0(d) and bn(d) respectively. Increments of c(d) by 1 may be translated in two diﬀerent ways, either by using conditional eﬀects, or by increasing the number of actions. For the former, conditional eﬀects of the form b0(d), . . . bi 1(d), bi(d) b0(d), . . . , bi 1(d), bi(d), for i [0, n], are used. For the latter, each action act that increases c(d) is replaced by n actions act(i), i [0, n], that are like act but have the extra precondition b0(d), . . . bi 1(d), bi(d) and the extra eﬀects b0(d), . . . , bi 1(d), bi(d). The ﬁrst translation, however, when compiled into STRIPS introduces additional actions as well (Nebel, 2000). Finally, a reset eﬀect c(d) := 0 is obtained by setting all atoms b0(d), i [0, n], to false.

Qualitative Numerical Planning: Reductions and Complexity

b) extra non-deterministic eﬀects Xi > 0 | Xi = 0 for each Dec(Xi) eﬀect, and

c) extra eﬀects c(d ) := 0 for each d such that d d |V | to reset the counters for the levels above or equal to d.

In words, actions a from Q that do not decrement any variable map into a single action of the form a in P = T(Q), while actions a from Q that decrement variables map into actions a(X, d) applicable only when a variable X decremented by a is in the stack at level d. The actions of the form a in P = T(Q) are deterministic, and only actions a(X, d) can generate cycles in a strong cyclic policy for P. The reduction P = T(Q) can be summarized as follows:

Deﬁnition 21 (Reduction QNP to FOND). Let Q F, V, I, O, G be a QNP. The FOND problem P = T(Q) is P = F , I , O , G with:

1. F = F {c T } {depth(d), c(d)} {in(X)} {index(X, d )},

2. I = I {depth(0), c T = 0, c(0) = 0},

4. O = {a : a O+} {a(Y, d ) : a O } {Push(X, d 1), Pop(X, d )} {Move}

where X ranges over V , d and d range over [0, |V |] and [1, |V |] respectively, and O and O+ stand for the sets of actions in O that decrement and do not decrement a variable respectively, and the variable Y in a(Y, d ) ranges among the variables decremented by the action a in Q. Preconditions and eﬀects of the actions in O are described above in the text.

8. Properties

Clearly, the reduction P = T(Q) can be computed in polynomial time:

Theorem 22. Let Q = F, V, I, O, G be a QNP. The reduction P = T(Q) can be computed in time that is polynomial in the size of Q.

Proof. Let n = |F| + |V | be the number of variables, propositional or numerical, in P = T(Q). P has 1 + n counters of capacity 1 + 2n, each one requiring 1 + n bits: the counter c(d) is encoded in binary with bits c(d, i), i [0, n]. P also has n atoms of form depth(d), |V | = O(n) atoms of form in(X), and n|V | = O(n2) atoms of form index(X, d). Therefore, P has O(|F| + n2) = O(n2) propositional variables. P has |V |2 = O(n2) push actions. Since Push(X, d) has precondition c(d) < Max and eﬀect c(d) := c(d) + 1, it gets compiled into n actions of the form Push(X, d, i), i [0, n 1], where precondition c(d) < Max is expressed as { c(d, i)} {c(d, j) : j [0, i 1]}, and eﬀect c(d) := c(d) + 1 is expressed as {c(d, i)} { c(d, j) : j [0, i 1]}. The pop actions do not modify counters, so there are O(n2) of them. The Move action increments the counter c T and then, like for Push(X, d), it gets compiled into n diﬀerent actions. Actions a in Q that do not decrement variables are translated into actions a in P. Actions a that decrement a variable get translated into actions a(X, d); there are O(n2) such actions a(X, d) in P for each such action a in Q. In total, P has O(n2) propositional variables and O(|O|n2 + n3) actions, where the cubic term accounts for the Push(X, d, i) actions. These numbers (polynomially) bound the size of P. It is clear that producing each action in P is straightforward and can be done in polynomial time.

The second direct property of the translation is that due to the use of the counters, all strong cyclic policies π for P must terminate:

Bonet & Geffner

Theorem 23. Let Q be a QNP. Any strong cyclic policy π for P = T(Q) is P-terminating.

Proof. Let π be a strong cyclic policy for P and let τ = s0, . . . , [ si, . . . , sm] be an inﬁnite πtrajectory. We need to show that there is some variable X that is decreased in one of these states and increased in none. Clearly, π( s) for some s in the recurrent set must be a non-deterministic action, and this means it is an action of form a(X, d). The actions a(X, d) require X to be in the stack and then resets all counters c(d ) for d d back to 0. Let us pick an action a(X, d) in the recurrent states of τ to be one with smallest stack depth d, and let s be one of such states where π( s) = a(X, d). In the state s, the variable X is in the stack at level d; i.e., index(X, d) is true. We show next that this same atom must be true in all the other recurrent states in τ. Indeed, if there is a recurrent state where index(X, d) is false, it means that there are recurrent states where X is popped from the stack, and others where it is pushed back at level d, as s is a recurrent state where index(X, d) holds. Yet, each occurrence of the action Push(X, d 1) needed to make index(X, d) true increases the counter c(d 1) that no action a(Y, d ) can reset with d < d, due our choice of the action a(X, d) as one with minimum d. As a result, it has to be the case that X is in the stack at level d in all the recurrent states of τ, and hence no action that increases X is applied while in a recurrent state (since increments of X are disabled when X is in the stack). Then, since there is a recurrent state where X is decremented, the inﬁnite π-trajectory τ is terminating. Therefore, the policy π is P-terminating.

In order to prove soundness and completeness, we establish a correspondence between the strong cyclic policies of Q and the strong cyclic policies of P = T(Q). The policies cannot be the same, however, as the reduction T, unlike the direct translation TD, adds extra variables and actions. Indeed, T preserves the atoms p and X = 0 from Q, the latter being propositional, but adds boolean variables and actions that ensure that the policies over of T(Q), unlike the policies over TD(Q), terminate. Let QM be the QNP obtained from the FOND problem P = T(Q) by 1) adding the numerical variables X from Q, 2) replacing the eﬀects X > 0 by Inc(X), and the non-deterministic eﬀects X > 0 | X = 0 by Dec(X), and 3) interpreting the preconditions and goal of the form X = 0 and X > 0 in terms of such variables (i.e., non-propositionally).

Theorem 24. If π solves the FOND problem P = T(Q), π solves the QNP QM.

Proof. P is the direct translation of QM, i.e. P = TD(QM). π is P-terminating by Theorem 23 and strong cyclic for P as it solves it. Therefore, by Theorem 15, π solves QM.

The QNP QM can be thought of as the composition of the original QNP Q and with a deterministic model that encodes the state of the stack and counters. From this perspective, the policy π that solves QM stands for a controller πM for Q that has an internal memory M comprised of the atoms that encode the stack and counters: actions like Push(X, d), Pop(X, d) and Move only aﬀect the internal memory M of the controller πM, actions a that do not decrement any variable in Q, only aﬀect the state of Q, while actions a(X, d) aﬀect both the state of Q and the internal memory M. Due to the correspondence between the application of policy π to the QNP QM and the application of the controller with memory πM to the QNP Q, it is then direct that:

Theorem 25 (Soundness). If policy π solves the FOND problem P = T(Q), the controller πM solves Q.

Proof. Each execution of π in QM generates a trajectory over M and one over Q. Since π solves QM, the latter must be terminating and goal reaching, but then they must be terminating and goal reaching in Q that shares the same goal as QM and the same Dec(X) and Inc(X) actions.

Qualitative Numerical Planning: Reductions and Complexity

The inverse direction of this theorem is also true, but it does not give us a completeness result. For that, we need to show that a policy π that solves Q determines a policy π that solves P = T(Q).

8.1 Completeness

We now assume that there is a policy π that solves Q and want to show that there is a policy π that solves P = T(Q). Since π solves Q, π is Q-terminating and also P -terminating where P = TD(Q) is the direct translation of Q (cf. Theorem 15). Let G be the policy graph associated with π in P . By Theorem 16, Sieve reduces G to an acyclic graph. For the rest of this section, we assume that Sieve is run until all edges that are associated with actions that decrement variables are eliminated rather than stopping as soon as the graph becomes acyclic.7 As a result, since π solves Q, the resulting acyclic graph has no edge associated with a decrement of a variable. Each edge removed by Sieve can be identiﬁed with a variable, and edges are removed in batches by Sieve, each such batch(C) associated with a component C and a variable X chosen by Sieve; i.e., in a given iteration, Sieve chooses a component C and a variable X, and removes all edges ( s, s ) from C such that π( s) is a Dec(X) action (cf. Figure 1). Let us index the top SCCs processed by Sieve in topological order (i.e., if Ci reaches Cj for j = i, then i < j), and let scc( s) be the index of the (top) component that includes s (i.e., the index of the component that includes s in the graph G). Sieve decomposes each component C into a collection of nested SCCs that result of recursively removing edges from C. For each state s, let C s = {Cj s}j 1 be the collection of nested SCCs that contain s; i.e.,

C1 s = Ck1 where k1 = scc( s) is the index of the (top) component that contains s, and

for j 1, Cj+1 s = Ckj+1 where s is in the component Ckj+1 of the graph that results when all edges in S{batch(Cki) : i [1, j]} have been removed by Sieve.

For each state s, let stack( s) be the sequence of variables chosen by Sieve for each component in C s. Observe that such sequence contains no repetitions since once Sieve chooses variable X for a component C, the same X cannot be chosen later for another component C contained in C. Also, if the action π( s) is a Dec(X) action for variable X, then stack( s) contains X by the assumption that Sieve is run until all edges associated with decrements of variables are eliminated. We compare the stack α and stack( s), and say that α = X1 Xn is a preﬁx of stack( s) = Z1 Zm if the latter can be obtained from the former by pushing variables only; i.e., if n m and Xi = Zi for i [1, n]. A property of the Sieve algorithm that we exploit in the completeness proof is the following:

Theorem 26. Let Q be a QNP and let P = TD(Q) be its direct translation. If π solves Q and τ = s0, . . . , [ si, . . . , sm] is an inﬁnite π-trajectory in P , there is a variable X and a recurrent state s such that π( s) is a Dec(X) action, and X is in stack( s ) for every recurrent state s in τ.

Proof. If π solves Q, π must be P -terminating. Thus, there must be a variable that is decreased by π in some recurrent state, and increased by π in no recurrent state. At the same time, Sieve is complete and must remove variables (edges) in the policy graph until it becomes acyclic (cf. Theorem 16). Initially, all the recurrent states in τ are in the same component but at one point Sieve removes a variable X and splits the set of recurrent states into diﬀerent and smaller components. From its deﬁnition, this means that X appears in stack( s) for each recurrent state s in

7. Clearly, the modiﬁcation on the stopping condition for Sieve does not aﬀect its correctness since an acyclic graph remains acyclic when one or more edges are removed, and, on the other hand, if the original Sieve cannot reduced a component, the modiﬁed algorithm is not able to reduce it either.

Bonet & Geffner

τ. Moreover, since the removal of X leaves these states into two or more components, π( s) for one such state must be a Dec(X) action.

We now deﬁne the policy π for P = T(Q) that is determined by the policy π that solves Q. In the deﬁnition, we use the two functions scc( s) and stack( s) deﬁned above in terms of the execution of Sieve on the policy graph for π on TD(Q). The states over P are denoted by triplets s, c, α where s is the state in TD(Q), c stands for the state of the counters c(d), d [0, |V |], and c T , and α stands for the state of the stack (given by the atoms depth(d), in(X), index(X, d)). The policy π for P is deﬁned at triplet s, c, α by

Pop(X, d) if c T < scc( s), X is top variable in α, and d = |α|, else Move if c T < scc( s) and empty stack, else Pop(X, d) if X is top variable in α, d = |α|, and α is not a preﬁx of stack( s), else Push(X, d) if αX is a preﬁx of stack( s) and d = |α|, else a if π( s) = a decrements no variable, else a(X, d) if π( s) = a decrements X at depth d in α but no other var at depth d < d.

A ﬁrst observation is that the policy π is deﬁned on every triplet s, c, α such that π( s) is deﬁned. The policy π for the FOND problem P = T(Q) on a triplet s, c, α advances the c T counter until it becomes equal to the index scc( s) of the SCC in G that contains the node s. It then performs pops and pushes until α becomes equal to stack( s), and ﬁnally applies the action a selected by the policy π on Q using the action names a or a(X, d) according to whether a decrements no variable or decrements a variable, which must be in stack( s) as discussed above. In the latter case, the variable X for the action a(X, d) is the variable X decremented by a that is deepest in the stack, at depth d. The completeness result can then be expressed as follows:

Theorem 27 (Completeness). If π solves the QNP Q, then the policy π deﬁned by (15) solves the FOND problem P = T(Q).

Proof. From Theorem 15, π solves Q iﬀπ solves and terminates in P = TD(Q). We will show that if π solves and terminates in P , then π must solve P = T(Q). Therefore, by forward reasoning, given that π solves Q, then π solves and terminates in P from which we obtain that π solves P. We need to show that the policy π is executable in P, and more precisely that 1) π cannot generate non-goal states s, c, α where π is not deﬁned or deﬁned but non applicable, and 2) π

cannot get trapped in a loop that only involves the extra actions Move, Pop(X, d) and Push(X, d). These two properties ensure that in any π -trajectory over P, if a non-goal state s, c, α is reached, an action a or a(X, d) will be the one changing the component s of the state when π( s) = a, and that this will happen after a bounded number of applications of the extra actions Move, Pop(X, d) and Push(X, d) that do not change s. Since the eﬀect of the actions a or a(X, d) on s in P is the same as the eﬀect of a on s in P , it follows that π will be strong cyclic for P if π is strong cyclic for P . Alternatively, 1) and 2) ensure that if π is executable in P, it generate trajectories over the s components that are the same as those obtained by the policy π over P except for a bounded number of steps where the s component in the states s, c, α does not change.

Point 2) is direct. Move increases the counter c T that no other action decreases. Pushes and pops are applied in order, either to ﬂush out the stack when c T < scc( s), or to make α = stack( s). In the latter case, α is popped until it becomes a preﬁx of stack( s) (ﬂushed out in the extreme case), and then pushes take place to make α equal to stack( s). Hence, no loops that only involve Move, Pop(X, d) and Push(X, d) actions are possible.

Qualitative Numerical Planning: Reductions and Complexity

Point 1) is more subtle. The policy π is deﬁned on all triplets s, c, α for which s is reachable by π. We ﬁrst argue, that except for the preconditions on the counters c(d) and c T , the rest of the preconditions are true for the actions selected by π . Observe that every triplet s, c, α reached by π is such that s is reachable by π, and thus π( s) is deﬁned and applicable in s. Second, for an action selected by π , its easy to see, except for in(Y ) when the action is a or a(X, d) and it increments Y , that its preconditions hold. To see that in(Y ) also holds, observe that if the actions increments Y , then stack( s) cannot contain Y ; if so, the collection C s of nested components for s has a component C that contains a state where Y is decremented while being incremented in s, thus making Y ineligible by Sieve.

The actions that have preconditions on counters are of type Push( , d) with precondition c(d) < Max, and Move with precondition c T < Max. Here, is a placeholder that denotes any variable X in V . For the top counter, c T < Max always hold since c T starts at 0, it is only increased to make it equal to scc( s), and the number of components in G is less than or equal the number of subsets of states which is less than Max. We are thus left to show c(d) < Max by considering the only type of actions that increase c(d): Push( , d).

For this, we show that π cannot generate a trajectory τ in P that contains a fragment τ with 1 + 2n (i.e. Max) actions of the form Push( , d) while no action of the form a( , d ), d d, or Push( , d 1) as this would be the only way in which c(d) may grow up to 1 + 2n: actions of the form Push( , d) increase c(d) by 1, and the only actions that decreases c(d), back to 0, have the form a( , d ) for d d, or Push( , d 1).

Indeed, let τ = s1, c1, α1 , s2, c2, α2 , . . . be such a fragment, and let 1 = i1 < i2 < < im, for m = 1 + 2n, be the indices for the triplets in τ on which the policy π selects an action of type Push( , d). Observe that between each pair of such indices, there must be one triplet where an action of type a or a( , d ) is applied: two pushes at the same stack depth must be mediated by at least one such action.

Let i 1 < i 2 < be the indices such that i k is the ﬁrst index after ik where the action selected by π is of type a or a( , d ), k [1, m]. Since the total number of states is less than or equal to m, there is some s that repeats. Without loss of generality, let us assume that si 1 = si m. The policy π loops in P on a set R of recurrent states that includes { si k : k [1, m]}. By Theorem 26, there is a variable X that is decremented by π while looping in R such that X belongs to each stack( si k), k [1, m]. We choose X to be such variable appearing deepest in the stacks. Therefore, there is index k 1 such that π ( sk, ck, αk ) = a(X, d ) where d is the depth of X in αk. Since X also belongs to α1, it must be the case d |α1| = d, the latter inequality since π ( si1, ci1, αi1 ) is of type Push( , d). This is a contradiction with the assumption that τ contains no action of type a( , d ) for d d.

The second reduction from QNPs into FOND problems may be used to compute policies for a given QNP from policies of the resulting FOND. The reduction is a sound and complete mapping. As mentioned above, the resulting QNP policies correspond to controllers that map states s and controller states, pairs c, α that encode the state of the (bounded) counters and stack, into actions. Hence, there is still the question of whether a QNP solvable by such a controller is solvable by a ﬂat policy (as given in Deﬁnition 5). In the examples below, the controllers found by the FOND planner over the translation yield ﬂat policies where the selection of actions does not depend on the internal controller state, but more generally the question remains open and beyond the scope of this paper.

Bonet & Geffner

9. Examples

Let us illustrate the translation and its solution with a simple example and two variations. The base QNP is Q1 = F, V, I, O, G where there are two boolean variables F = {p, g}, two numerical variables V = {n, m}, the initial and goal states are I = {p, n > 0, m > 0} and G = {g} respectively, and the four actions in O are a1 = p, n > 0; p, n , a2 = p; p , and ﬁn1 = n = 0; g and ﬁn2 = m = 0; g , where the last two actions are used to capture the disjunctive goal n = 0 m = 0. The QNP Q2 is like Q1 except that a2 = p; p, n , while the QNP Q3 is like Q1 (and Q2) except that a2 = p, m > 0; p, n , m . Figure 4 shows the solutions of the FOND problems P1 = T(Q1) and P3 = T(Q3) for Q1 and Q3 that are obtained with the qnp2fond translator (see below), and a FOND planner. The QNP P2 = T(Q2), on the other hand, has no solution. In order to understand these results, recall that in the FOND problem T(Q) for any QNP Q, actions that decrement variables may be applied only when some of the decremented variables are in the stack, and actions that increment variables may be applied only when none of the incremented variables is in the stack. In addition, the translation prevents loops where a variable is pushed and popped from the stack, except when the stack contains another variable throughout the loop that is decremented, as otherwise the counters would grow without bounds. With these elements in mind, we can turn to the solutions of Q1 and Q3, and the lack of solutions for Q2. The solution for Q1 is direct. The idea is to use action a1 to decrement n down to zero, and then apply ﬁn1 to reach the goal. However, a1 is applicable only if n is in the stack and p is true. Then, n must be pushed into the stack before the ﬁrst application of a1, and p must be set to true using a2 after each use of a1 as observed in Fig. 4(a). The QNP Q2 has no solution because the action a2 that must be used to restore the precondition p of a1 in the loop increments the variable n as well. Since action a1 requires n to be in the stack, which prevents an action like a2 in Q2 to execute, the only possibility is to pop n from the stack before executing a2, but then variable n should be pushed and popped from the stack in a loop without the presence of a another variable in the stack that is decremented; a condition that is precluded by the translation, and which is necessary for the loop to terminate. It is precisely the presence of such extra variable m that is decremented by a2 and not incremented by any action that makes the QNP Q3 solvable. While the goal can be achieved by either reaching the condition n = 0 or m = 0 and then applying the action ﬁn1 or ﬁn2 respectively, the solution found by the FOND planner over T(Q3) focuses on decrementing m down to zero using a2 instead, and then using ﬁn2 to reach the goal. The loop involving an unbounded number of pushes/pops of n in the stack is permitted in the solution because the variable m is in the stack and it is decremented in each iteration of the loop. In the translation this means that decrements of m reset the counters associated with the variables like n that are above m in the stack, cf. Fig. 4(b). Finally, an extra dummy variable like z together with a dummy action a3 = z > 0; z would not make the QNP Q2 solvable either, because while such an action could be used to render a terminating loop involving actions a1 and a2, along with a3, it would move the unsolvability to the subproblem that results when the variable z becomes zero. The policy would need to map states where z = 0 holds into goal states, and then, the same obstacles arise.

10. Extensions, Variations, and Memoryless Controllers

For simplicity, QNPs have been deﬁned with certain syntactic restrictions that do not limit their expressive power. In particular, there are no actions with non-deterministic eﬀects on boolean variables, and there are no eﬀects that can map a literal X = 0 non-deterministically into the

Qualitative Numerical Planning: Reductions and Complexity

p, n > 0, m > 0, g p, n > 0, m > 0, g

p, n > 0, m > 0, g p, n = 0, m > 0, g

p, n = 0, m > 0, g

p, n > 0, m > 0, g p, n > 0, m > 0, g p, n > 0, m > 0, g

p, n 0, m > 0, g p, n 0, m > 0, g p, n > 0, m = 0, g

p, n > 0, m = 0, g

Push(m) Push(n)

(a) Solution for QNP Q1 (b) Solution for QNP Q3

Figure 4: Solutions to the FOND translations T(Q1) and T(Q3) of the QNPs Q1 and Q3 in the text. Nodes represent states in the translations (i.e., boolean QNP states augmented with stack and counters) but only the QNP part is shown. Edges correspond to actions from Q or actions that manipulate the stack and counters. Edges are annotated with action labels and their eﬀect on the numerical variables. Blue nodes represent multiple QNP states; e.g., the node {p, n 0, m > 0, g} for Q3 represents the QNP states where {p, m > 0, g} and there is no restriction on the value of n. In both controllers, the initial state is the top leftmost state (in yellow) and the goal is the rightmost state at the bottom (green). The solution for Q1 decrements n with action a1 until it becomes zero, and a1 requires p and thus a1 is interleaved with a2 that makes p true. In Q3, the action a2 is changed to increment n as well, and to decrement another variable m. The solution for Q3 found by the solver reaches the goal by decreasing m to zero using action a2, while using a1 to restore the preconditions of a2, and ignoring the variable n. Another solution could be obtained by applying the action ﬁn1 to the states where n = 0 but it would involve more controller states (not shown).

literals X > 0 and X = 0, as decrements require X > 0 as a precondition, and increments yield the outcome X > 0. Yet, these two restrictions are not essential and can be bypassed. As we have seen before, non-deterministic eﬀects on boolean variables as found in strong and strong cyclic FOND planning can be obtained in QNPs by using additional numerical variables (Section 6). Likewise, a sequence of two consecutive eﬀects Inc(X) and Dec(X) can be used to emulate the eﬀect of an action that leaves the value of X completely uncertain; i.e., it may increase X, decrease X, or leave the value of X unchanged. There are also syntactic conditions that when satisﬁed, make QNPs simpler. For example, if the numerical variables in a QNP Q can be linearly ordered as X1, X2, . . . so that the actions that increase a variable Xi also decrease a variable Xj that appears later in the ordering, then every policy π that solves the direct translation P = TD(Q) of Q will necessarily solve Q as well, as any such policy will terminate in Q. Indeed, if Xℓis the last variable in the ordering that is decreased in a cycle induced by π, the cycle cannot include a diﬀerent state where the variable Xℓis increased, as otherwise, the condition implies that another variable Xj appearing later in the ordering must be decreased in the cycle. For such well-ordered QNPs, the simpler, direct translation TD is thus both sound and complete. Finally, recall that a policy π that solves the FOND P = T(Q) obtained from the translation of a QNP Q, deﬁnes a policy that can be understood as a memory-extended controller that solves Q using extra boolean variables and actions. Often, however, the policy π( s, m) obtained from P, where s is the state over Q and m is the memory state, can be projected onto a memoryless controller π for Q which does not use the extra variables or actions. The projection is possible

Bonet & Geffner

and immediate when there is no state s in the controller where the actions π( s, m) and π( s, m ) selected by the policy over two diﬀerent memory states are associated with diﬀerent actions in Q. In such a case, all states s can be associated with a single action a from Q (there must be one such action as otherwise π would not map s into a goal state), and the memoryless policy π for Q is then simply π ( s) = a.

11. QNPs and Generalized Planning

QNPs were introduced by Srivastava et al. (2011) as a useful model for planning with loops and for generalized planning (Levesque, 2005; Bonet et al., 2009; Srivastava et al., 2011; Hu & Levesque, 2011; Bonet & Geﬀner, 2015). In the basic formulation (Hu & De Giacomo, 2011), a generalized planning problem is a collection Q of planning instances P that share the same set of (ground) actions and the same set state features. The solution of the generalized problem Q is then a mapping from feature valuations into actions that solves each of the instances in Q. This basic formulation was then extended to domains where the ground actions change from instance to instance, as in most relational domains, like Blocksworld, where the actions are determined by a small number of action schemas and object names. This formulation is achieved by means of QNPs (Bonet & Geﬀner, 2018) where a single QNP is shown to be capable of representing a suitable abstraction of the concrete instances P involving diﬀerent ground actions. A QNP is a sound abstraction of a family Q of concrete problems P from a common domain when the boolean and numerical variables p and n in the QNP accurately represent and track the value changes of certain boolean and numerical state features φp and φn in each of the instances. More precisely, a QNP action a = Pre; Eﬀ is sound relative to Q if in any (reachable) state s over an instance P in Q, if the formula Pre is true in s, with the QNP variables p and n replaced by the state feature functions φp and φn, then there is an action in P that induces a state transition (s, s ) that agrees with the eﬀects of the abstract action a, once again, with the QNP variables replaced by the corresponding state features. For example, in Blocksworld, a QNP action a = H, n(x) > 0; H, n(x) provides a suitable abstraction of the action of picking up a block from above a designated block x. In this abstraction, the variable H is associated with the boolean state feature φH that captures when the arm is empty, and the variable n(x) is associated with the numerical state feature φn that captures the number of blocks above x. The abstract action is sound in the sense that for any state s of a Blocksworld instance P, if φH(s) is false and φn(x)(s) > 0, there is an action b in P that induces a state transition (s, s ) such that φH(s ) is true and φn(x)(s ) < φn(x)(s), in agreement with a. The concrete action b is then said to instantiate the abstract action a in the state s of the instance. In general, if all the QNP actions are sound relative to Q and suitable conditions apply to the initial and goal conditions of the QNP in relation to Q, any policy π that solves the QNP provides a solution to Q; i.e., the policy π can be applied to any instance P in Q by interpreting the variables in the QNP in terms of the state features (Bonet & Geﬀner, 2018). More recently, it has been shown how these QNPs can be learned directly from a PDDL description of the domain and a number of sampled instances and their plans (Bonet et al., 2019), and also how to obtain testable logical conditions to check the soundness of a QNP-based abstraction for an instance P of a PDDL domain description (Bonet et al., 2017). The QNPs used in the experiments below are variations of QNPs learned from samples. A ﬁnal question about QNPs for generalized planning is what are the generalized planning problems for which QNPs provide a suitable abstraction and solution method. It turns out that with no restrictions on the state features φp and φn that can be abstracted into the QNP, there is indeed, no limit. The solution to any family Q of planning problems can be expressed compactly

Qualitative Numerical Planning: Reductions and Complexity

in terms of a single QNP action a = V > 0; V that involves a single numerical variable V

associated with the feature φV that measures the optimal cost of reaching the goal from a given state. The QNP action is sound and just says to move in the direction of the goal. However, the application of this abstract action in a concrete state requires the computation of optimal costs for each successor state, which is in general intractable in the number of problem variables. Thus, a reasonable restriction is that the QNP variables should represent reasonable features, and in particular, features that can be computed in polynomial (perhaps linear) time.

12. Implementation: Qnp2fond

The reduction from QNPs to FOND problems has been implemented and it can be found in the Github repository https://github.com/bonetblai/qnp2fond. The reduction produces a FOND problem without conditional eﬀects which is desirable since some FOND planners do not support them. The reduction may be parametrized in terms of the maximum range of counters and the maximum stack depth, but their default values are those used in the proofs that ensure completeness in general. In some cases, however, the reduction can be made simpler without compromising soundness or completeness. For example, if no numerical variable in the QNP is incremented by an action, the more compact and eﬃcient direct translation TD is sound and complete. The same holds if the QNP is well-ordered as deﬁned above, where variables may be ordered as X1, . . . , Xn such that actions that increment a variable Xi decrement a variable Xj for i < j.

On the other hand, if there are actions that increase variables but there are no actions that increase a variable X, there is no need to add push and pop actions for X, nor preconditions for X to be in the stack when decrementing it. This is possible since X may be assumed to be always at the bottom of the stack (as if an implicit action that pushes X has been executed before any other action) and because there is no need to pop X as no action increments it. However, according to the translation, every action that decreases X must reset all stack counters. Interestingly, this and the other simpliﬁcations are special cases of a more general simpliﬁcation that can be applied when the QNP contains a subset S of well-ordered numerical variables, as deﬁned above; indeed, a subset of variables X that are not incremented by any action is such a subset. For such a subset S, one can assume that all variables in S are in the stack and thus simplify the translation by (a) not generating push/pop actions for the variables in S, (b) not adding extra preconditions to actions that decrement a variable in S (such actions already fulﬁll the requirement that one of its decremented variables must be in the stack), and (c) requiring that actions that decrement any variable in S, reset all the stack counters.8 When S contains all the numerical variables in the QNP, then there is no need to have a stack and the simpliﬁed translation T simply becomes the direct translation TD.

Qnp2fond supports the optimization for variables that are not incremented by any action. The general optimization involving a subset of well ordered variables is not yet implemented since ﬁnding

8. Observe that the simpliﬁcation cannot aﬀect completeness as no action becomes inapplicable since preconditions are removed and counters are potentially reset more often. For soundness, suppose that there is a non-terminating execution and let R be the set of recurrent actions in such an execution. It cannot be the case that R contains only actions that aﬀect numerical variables outside S since such actions are not aﬀected by the simpliﬁcation and thus are subject to the full translation. Hence, there is at least one action in R that either increments or decrements a variable in S. If the action increments a variable in S, by deﬁnition, the action also decrements another variable in S, one that comes later in the ordering. By inductive reasoning, R contains an action that decrements a last variable Xℓin R that is not incremented by any other action in R. Hence, such a variable eventually becomes zero and the action cannot be applied afterwards, contradicting the choice of R.

Bonet & Geffner

such a subset and ordering is intractable.9 There also options for disabling the optimization, and even to force the direct translation whose solutions need to be checked with Sieve. By default, the options are to use the optimization whenever possible, and to use the maximum range for the counters and stack depth that ensure completeness in general.

13. Experiments

We illustrate the performance of the QNP translator and solver over some QNPs that capture abstraction of generalized planning problems. There is no useful baseline for evaluating the use of the translator in combination with FOND planners. The only other complete QNP planner would result from translating QNP problems into LTL synthesis tasks but the comparison would be unfair because LTL synthesis is computationally harder than QNP planning. There is also no complete generate-and-test QNP planner reported, which would have to generate the strong cyclic policies of the direct FOND translation, one by one, while checking them for termination. In the examples, the resulting FOND problems T(Q) are solved with FOND-SAT (Geﬀner & Geﬀner, 2018), a general SAT-based FOND planner that is available at https://github.com/ tomsons22/FOND-SAT. This planner calls a SAT solver multiple times.10 The SAT solver used is Glucose 4.1 (Audemard & Simon, 2009) which builds on Minisat (Een & Sorensson, 2004). FONDSAT solves a FOND problem by constructing a compact controller where each node represents one or more states. When depicting the solutions found, controller nodes that represent more than one state are shown in blue, while the nodes that correspond to the initial and goal QNP states are shown in yellow and green respectively.

13.1 Clearing a Block

A general plan for clearing a given block x in a Blocksworld instance can be obtained by solving the following abstraction expressed as the QNP problem Q = F, V, I, O, G where F = {H} contains a boolean variable H that represents if a block is being held, V = {n} contains a numerical variable that counts the number of blocks above x, the initial situation I = { H, n > 0} assumes that block x is not clear and that gripper is empty, and the goal situation G = {n = 0} expresses that there are no blocks on top of x. There are four actions in O:11

Putaway = H; H to put the block being held on the table or on a block not above x,

Pick-above-x = H, n > 0; H, n to pick the top block above x,

Put-above-x = H; H, n to put the block being held on the top block above x, and

Pick-other = H; H to pick a block not above x.

Qnp2fond translates the QNP Q into the FOND problem P = T(Q) that has 20 atoms and 16 actions in less than 0.01 seconds. FOND-SAT solves P in 0.08 seconds after 3 calls to the SAT solver that require 0.01 seconds in total. The solution produced by FOND-SAT is the controller

9. A subset S and ordering can be found by solving a simple SAT theory, while a maximum-size subset and ordering can be found with a weighted-max SAT solver, or by doing multiple calls to a SAT solver. 10. In each call to the SAT solver, FOND-SAT tries to ﬁnd a solution (controller) with a given number of states (budget). If no controller is found, the budget is increased by 1, and repeat until one is found. 11. From the point of view of generalized planning, the last QNP action, Pick-other = H; H is not sound in the Blocksworld domain because on the states where there is a single tower and the gripper is empty, the abstract action is applicable yet no concrete action corresponds to it. The obtained policy however is sound as it never prescribes the action Pick-other.

Qualitative Numerical Planning: Reductions and Complexity

H, n > 0 H, n > 0 H, n > 0 H, n = 0 Push(n) Pick-above-x : n

Pick-above-x : n

Figure 5: Solution of the FOND translation for the QNP Q for clearing a block x. Nodes represent states in the translation (i.e., QNP states augmented with stack and counters) but only the QNP part is shown. Edges correspond to actions from Q or actions that manipulate the stack and counters. Edges are annotated with action labels and their eﬀect on the numerical variables. The initial state is the leftmost state and the goal is the rightmost one. The policy is strong cyclic and terminating.

shown in Figure 5 which depicts actions from the QNP as well as the added actions to manipulate the stack. The resulting policy is a ﬁnite-state controller that can be converted into the memoryless policy π for the QNP that picks a block above x when the gripper is free, and puts away the block being held otherwise.

13.2 Placing a Block on Top of Another

A general plan for placing block x on top of block y may be obtained by solving a suitable QNP. For simplicity, we only consider the case when the blocks x and y are initially in diﬀerent towers and remains so on the non-goal reachable states (i.e., there are no actions to put blocks above x or y except Put-x-on-y that achieves the goal). In this case, the QNP Q = F, V, I, O, G has boolean and numerical variables F = {E, X, D} and V = {n, m} that represent whether the gripper is empty (E) or holding the block x (X), or if the goal has been achieved (D), while the numerical variables n and m count the number of blocks above x and y respectively. The initial state I = {E, X, D, n > 0, m > 0} describes a conﬁguration where no block is being held, there are blocks above x and above y, but x and y are in diﬀerent towers; the goal is simply G = {D}. The QNP has six diﬀerent actions:12

Pick-x = E, n = 0; E, X to pick block x,

Pick-above-x = E, n > 0; E, n to pick the topmost block that is above x,

Pick-above-y = E, m > 0; E, m to pick the topmost block that is above y,

Putaside = E, X; E to put aside (not above x or y) the block being held,

Put-x-aside = E, X; E, X to put aside the block x (being held), and

Put-x-on-y = E, X, m = 0; E, X, D, m to put x on y.

12. One reason for why this particular QNP is not suitable for dealing with states where the blocks x and y are in the same tower, is that the Pick-above-x and Pick-above-y actions are not sound relative to the intended features φn and φm that the variables n and m are aimed to track. A needed Pick-above-x-and-y action, for example, will decrement the two variables n and m.

Bonet & Geffner

E, X, D, n > 0, m > 0 E, X, D, n > 0, m > 0

E, X, D, n = 0, m > 0 E, X, D, n = 0, m > 0 E, X, D, n = 0, m > 0 E, X, D, n = 0, m = 0

E, X, D, n = 0, m = 0 E, X, D, n = 0, m = 0 E, X, D, n = 0, m = 0 E, X, D, n = 0, m > 0

Pick-above-x : n

Pick-above-x : n

Push(m) Putaside Pick-above-y : m

Pick-above-y : m

Putaside Pick-x Put-x-on-y : m

Figure 6: Solution of the FOND translation for the QNP Q for placing a block x on top of another block y. Nodes represent states in the translation (i.e., QNP states augmented with stack and counters) but only the QNP part is shown. Edges correspond to actions from Q or actions that manipulate the stack and counters. Edges are annotated with action labels and their eﬀect on the numerical variables. The initial state is the top leftmost state and the goal is the leftmost one at the bottom. The policy is strong cyclic and terminating.

Since Q has no action to increment n, qnp2fond generates in less than 0.01 seconds a simpliﬁed FOND problem P = T(Q) that has 47 atoms and 35 actions, 42 atoms for encoding the counters and the stack, and 27 actions that manipulate the stack and the top counter. FOND-SAT ﬁnds the solution shown in Figure 6 in 2.55 seconds; it makes 9 calls to the SAT solver that require 0.31 seconds in total. The resulting controller also deﬁnes a memoryless policy.

13.3 Gripper

The task involves a robot with grippers whose goal is to move a number of balls from one room into a target room. Each gripper may carry one ball at a time. An abstraction for generalized plan may be obtained with a QNP Q = F, V, I, O, G that involves one boolean feature T that indicates whether the robot is in the target room, and three numerical features that count the number of balls still to be moved (b), the number of balls being carried (c), and the number of empty grippers (g). The initial state I = {T, b > 0, c = 0, g > 0} places the robot at the target room, carrying no balls, and with some balls in the other room. The goal description is simply G = {c = 0, b = 0} saying that the number of balls being carried and the number of balls in the other room are both zero. The set of (abstract) actions in Q is:

Drop-at-source = T, c > 0; b , c , g to drop balls in the other room,

Drop-at-target = T, c > 0; c , g to drop balls in the target room,

Pick-at-source = T, b > 0, g > 0; b , c , g to pick balls in the other room,

Move = T; T to move to the target room, and

Leave = T; T to move to the other room.

It is easy to see that this abstraction captures instances involving any number of balls and robots with any positive number of grippers. The translator runs in less than 0.01 seconds and generates a FOND problem P = T(Q) with 54 atoms and 47 actions (50 atoms for encoding the counters and the stack, and 29 actions to

Qualitative Numerical Planning: Reductions and Complexity

T, b > 0, c = 0, g > 0 T, b > 0, c = 0, g > 0 T, b > 0, c 0, g > 0 T, b = 0, c > 0, g 0

T, b = 0, c > 0, g 0

T, b = 0, c > 0, g 0 T, b = 0, c = 0, g > 0

T, b > 0, c > 0, g = 0 T, b > 0, c > 0, g = 0 T, b > 0, c = 0, g > 0

T, b > 0, c > 0, g 0

Push(b) Leave Pick-at-source : b , c , g

Drop-at-target : c , g

Pick-at-source : b , c , g

Drop-at-target : c , g

Pick-at-source : b , c , g

Drop-at-target : c , g Drop-at-target : c , g

Figure 7: Solution of the FOND translation for the QNP Q for Gripper. Nodes represent states in the translation (i.e., QNP states augmented with stack and counters) but only the QNP part is shown. Blue nodes in the controller represent one or more QNP states; e.g., the node {T, b > 0, c 0, g > 0} in the ﬁrst row represents the QNP states where {T, b > 0, g > 0} hold and there is no restriction on the value of c. Edges correspond to actions from Q or actions that manipulate the stack and counters. Edges are annotated with action labels and their eﬀect on the numerical variables. The initial state is the top leftmost state and the goal is the second one at the bottom row. The policy is strong cyclic and terminating because each self loop is terminating, and the outer loop terminates since the variable b decreases by at least one in each iteration, and no action in the plan increases it.

manipulate the stack and move the top counter). FOND-SAT ﬁnds the solution shown in Figure 7 in 11.25 seconds; it makes 10 calls to the SAT solver that require 3.00 seconds in total. The resulting controller also deﬁnes a memoryless policy.

13.4 Delivery

The last example involves an agent that navigates a grid and whose job is to look for packages and deliver them at a target location subject to the constraint that it can carry one package at a time. The generalized problem consists of all instances with a ﬁnite but unbounded grid, and a ﬁnite but unbounded number of packages. The generalized problem can be captured with the QNP Q = F, V, I, O, G that involves one boolean feature H that tells whether the agent is holding a package, and 3 numerical features that measure the distance to the next package (d), the distance to the target location (t), and the number of packages that still need to be delivered (p). The initial state I = { H, d > 0, t > 0, p > 0} corresponds to a state where the agent holds no package and is neither at a package or the target location, while the goal description G = { H, p = 0} indicates that all packages have been delivered. The QNP Q has ﬁve actions:13

Move = d > 0, p > 0; d , t to move towards next package and away from target location,

Home = t > 0; d , t to move towards target location and away from next package,

Pick = H, d = 0; H to pick a package,

13. From the point of view of generalized planning, the QNP actions Move and Home are not sound over this domain since the agent may move towards the next package without moving away from the target location, and vice versa.

Bonet & Geffner

H, d > 0, t 0, p > 0 H, d > 0, t 0, p > 0 H, d = 0, t > 0, p > 0 H, d = 0, t > 0, p > 0

H, d = 0, t > 0, p > 0 H, d 0, t > 0, p > 0 H, d > 0, t = 0, p > 0 H, d > 0, t = 0, p > 0

H, d > 0, t = 0, p = 0

Push(d) Move : d , t Pop(d)

Push(t) Home : d , t Pop(t)

Deliver : d , p

Deliver : d , p

Move : d , t

Home : d , t

Figure 8: Solution of the FOND translation for the QNP Q for Delivery. Nodes represent states in the translation (i.e., QNP states augmented with stack and counters) but only the QNP part is shown. Blue nodes in the controller represent one or more QNP states; e.g., the top leftmost node {H, d > 0, t 0, p > 0} represents the QNP initial state and a similar state except t = 0. Edges correspond to actions from Q or actions that manipulate the stack and counters. Edges are annotated with action labels and their eﬀect on numerical variables. The initial state is the top leftmost state and the goal is the one at the bottom row. The policy is strong cyclic and terminating because each self loop is terminating, and the outer loop terminates since the variable p decreases by one in each iteration, and no action in the plan increases it.

Drop = H, t > 0; H to drop a package not in target location, and

Deliver = H, t = 0, p > 0; H, d , p to deliver a package in target location.

Since the variable p is not incremented by any action, the translator generates a simpliﬁed FOND problem P = T(Q), in less than 0.01 seconds, that has 54 atoms and 40 actions (50 atoms for encoding the counters and the stack, and 29 actions to manipulate the stack and move the top counter). FOND-SAT ﬁnds the solution shown in Figure 8 in 2.99 seconds; it makes 8 calls to the SAT solver that require 0.30 seconds in total. As in previous examples, the resulting controller deﬁnes a memoryless policy.

14. Related Work

QNPs have been introduced as a decidable planning model able to account for plans with loops (Srivastava et al., 2011, 2015). In addition, by deﬁning the boolean and numerical variables of QNPs as suitable general boolean and numerical features over a given domain, it has been shown that QNPs can be used to express abstract models for generalized planning, in particular when the ground actions change from instance to instance (Bonet & Geﬀner, 2018). More recently, it has been shown that these QNP abstractions can be learned automatically from a given planning domain and sampled plans (Bonet et al., 2019). QNPs thus provide a convenient language for a model-based approach to the computation of general plans where such plans are derived from a (QNP) planning model. If the model is sound, the general plans are guaranteed to be correct (Bonet & Geﬀner, 2018; Bonet et al., 2019). This is in contrast with the more common inductive or learning-based approaches where plans computed to solve a few sampled instances are assumed to generalize to other instances by virtue of the compact form of the plans (Khardon, 1999; Martin

Qualitative Numerical Planning: Reductions and Complexity

& Geﬀner, 2004; Fern et al., 2004). These learning approaches do not construct or solve a suitable abstraction of the problems as expressed by QNPs. Inductive approaches have been used recently to learn general plans in the form of ﬁnite-state controllers (Bonet et al., 2009; Hu & De Giacomo, 2013), ﬁnite programs (Segovia-Aguas et al., 2016), and deep neural nets learned in a supervised manner (Toyer et al., 2018; Bueno et al., 2019; Issakkimuthu et al., 2018; Bajpai et al., 2018). A key diﬀerence between learning-based and model-based approaches is that the correctness of the latter follows from the soundness of the model. Deep reinforcement learning methods have also been used recently for computing generalized plans with no supervision (Groshev et al., 2018; Sukhbaatar et al., 2015), yet by not using ﬁrst-order symbolic representations, they have diﬃculties in dealing with relational domains that involve objects and relations (Garnelo & Shanahan, 2019). Forms of generalized planning have also been formulated using ﬁrst-order logic (Srivastava et al., 2011; Illanes & Mc Ilraith, 2019), and general plans over ﬁnite horizons have been derived using ﬁrst-order regression as well (Boutilier et al., 2001; Wang et al., 2008; Van Otterlo, 2012; Sanner & Boutilier, 2009). The use of QNPs for expressing (or learning) abstractions for generalized planning problems, combined with the compilation of QNPs into FOND problems, allows us to beneﬁt from the performance of propositional oﬀ-the-shelf FOND planners like PRP (Muise et al., 2012), My ND (Bercher & Mattm uller, 2009), and FOND-SAT (Geﬀner & Geﬀner, 2018) in order to ﬁnd generalized plans. QNP problems can be easily translated into LTL planning problems with FOND domains, reachability goals, and a particular type of trajectory constraints that can be expressed as compact LTL formulas (Bonet et al., 2017). The trajectory constraints use a fragment of LTL (Pnueli, 1977) to express the QNP fairness constraints; namely, that trajectories where a variable X is decremented an inﬁnite number of times and incremented a ﬁnite number of times, are not fair and can thus be ignored.14 As a result, QNP planning can be translated quite eﬃciently (linear time) into LTL synthesis. The translation, however, is not particularly useful computationally, as QNP planning, like FOND planning, is EXP-Complete, while LTL synthesis is 2EXP-Complete (doubly exponential in time) (Pnueli & Rosner, 1989). In LTL planning, i.e., FOND planning with LTL goals and trajectory constraints, the double exponential growth is in the number of variables that appear in such formulas (Camacho et al., 2019; Aminof et al., 2019). Tight complexity bounds for the speciﬁc type of LTL trajectory constraints that QNPs convey have not been settled. In any case, such methods need to compute in explicit form the QNP transition system, and thus require exponential space in the total number of variables. This lower bound, that does not consider the LTL formulas associated with the trajectory constrains, already matches the upper bound of the brute-force algorithm that uses Sieve as subroutine (cf. proof of Theorem 17).

15. Conclusions

QNPs are convenient abstract models for generalized planning. In this work we have studied QNPs and placed them on ﬁrmer ground by studying their theoretical foundations further. We have also shown that FOND problems can be reduced into QNPs, and vice versa, that QNPs can reduced into FOND problems. Both translations are new and polynomial-time computable, hence establishing that the two models have the same expressive power and the same complexity. The previous, direct translation TD(Q) for QNPs Q also yields a FOND problem but with fairness assumptions that do not match those underlying strong cyclic FOND planning, and this is why solutions to this translation need to be checked for termination. QNPs can be reduced to LTL synthesis and

14. The LTL formula used by Bonet et al. (2017) is slightly diﬀerent but logically equivalent, and states that always, if a variable is decreased inﬁnitely often and increased ﬁnitely often, it must eventually have value zero. They are equivalent because once the value zero is reached, the variable cannot be decreased without being increased.

Bonet & Geffner

planning but these are harder computational tasks. In the future, it would be interesting to study more general types of fairness assumptions and the fragments of LTL that can be handled eﬃciently with methods similar to the ones developed for QNPs that are based on polynomial-time translations and oﬀ-the-shelf FOND planners.

Acknowledgements

We thank the associate editor, Patrik Haslum, and the anonymous reviewers for useful comments that helped us to improve the paper. This work was performed while B. Bonet was at sabbatical leave at Universidad Carlos III de Madrid under a UC3M-Santander C atedra de Excelencia Award. H. Geﬀner is also a Guest WASP professor at Link oping University, Sweden, and his work is partially funded by a grant TIN-2015-67959-P from MINECO, Spain, and a grant from the Knut and Alice Wallenberg (KAW) Foundation, Sweden.

Aminof, B., De Giacomo, G., Murano, A., & Rubin, S. (2019). Planning under LTL environment speciﬁcations. In Proc. Int. Conf. on Automated Planning and Schedulng (ICAPS), pp. 31 39.

Audemard, G., & Simon, L. (2009). Predicting learnt clauses quality in modern SAT solver. In Proc. Int. Joint Conf. on Artiﬁcial Intelligence (IJCAI), pp. 399 404.

Bajpai, A., Garg, S., & Mausam (2018). Transfer of deep reactive policies for MDP planning. In Advances in Neural Information Processing Systems (Neur IPS), pp. 10988 10998.

Belle, V., & Levesque, H. J. (2016). Foundations for generalized planning in unbounded stochastic domains. In Proc. Principles of Knowledge Representation and Reasoning (KR), pp. 380 389.

Bercher, P., & Mattm uller, R. (2009). Solving non-deterministic planning problems with pattern database heuristics. In Proc. German Conf. on AI (KI), pp. 57 64. Springer.

Bonet, B., Palacios, H., & Geﬀner, H. (2009). Automatic derivation of memoryless policies and ﬁnite-state controllers using classical planners. In Proc. Int. Conf. on Automated Planning and Schedulng (ICAPS), pp. 34 41.

Bonet, B., De Giacomo, G., Geﬀner, H., & Rubin, S. (2017). Generalized planning: Nondeterministic abstractions and trajectory constraints. In Proc. Int. Joint Conf. on Artiﬁcial Intelligence (IJCAI), pp. 873 879.

Bonet, B., Franc es, G., & Geﬀner, H. (2019). Learning features and abstract actions for computing generalized plans. In Proc. AAAI Conf. on Artiﬁcial Intelligence (AAAI), pp. 2703 2710.

Bonet, B., Fuentetaja, R., E-Mart ın, Y., & Borrajo, D. (2017). Guarantees for sound abstractions for generalized planning. In Proc. Int. Joint Conf. on Artiﬁcial Intelligence (IJCAI), pp. 1566 1573.

Bonet, B., & Geﬀner, H. (2015). Policies that generalize: Solving many planning problems with the same policy.. In Proc. Int. Joint Conf. on Artiﬁcial Intelligence (IJCAI), pp. 2798 2804.

Bonet, B., & Geﬀner, H. (2018). Features, projections, and representation change for generalized planning. In Proc. Int. Joint Conf. on Artiﬁcial Intelligence (IJCAI), pp. 4667 4673.

Boutilier, C., Reiter, R., & Price, B. (2001). Symbolic dynamic programming for ﬁrst-order MDPs. In Proc. Int. Joint Conf. on Artiﬁcial Intelligence (IJCAI), Vol. 1, pp. 690 700.

Qualitative Numerical Planning: Reductions and Complexity

Bueno, T. P., de Barros, L. N., Mau a, D. D., & Sanner, S. (2019). Deep reactive policies for planning in stochastic nonlinear domains. In Proc. AAAI Conf. on Artiﬁcial Intelligence (AAAI), Vol. 33, pp. 7530 7537.

Camacho, A., Bienvenu, M., & Mc Ilraith, S. A. (2019). Towards a uniﬁed view of ai planning and reactive synthesis. In Proc. Int. Conf. on Automated Planning and Schedulng (ICAPS), pp. 58 67.

Cimatti, A., Roveri, M., & Traverso, P. (1998). Automatic OBDD-based generation of universal plans in non-deterministic domains. In Proc. AAAI Conf. on Artiﬁcial Intelligence (AAAI), pp. 875 881.

Cimatti, A., Pistore, M., Roveri, M., & Traverso, P. (2003). Weak, strong, and strong cyclic planning via symbolic model checking. Artiﬁcial Intelligence, 147(1-2), 35 84.

Een, N., & Sorensson, N. (2004). An extensible SAT-solver. Lecture notes in computer science, 2919, 502 518.

Fern, A., Yoon, S., & Givan, R. (2004). Approximate policy iteration with a policy language bias. In Advances in Neural Information Processing Systems (NIPS), pp. 847 854.

Fikes, R., & Nilsson, N. (1971). STRIPS: A new approach to the application of theorem proving to problem solving. Artiﬁcial Intelligence, 1, 27 120.

Garnelo, M., & Shanahan, M. (2019). Reconciling deep learning with symbolic artiﬁcial intelligence: representing objects and relations. Current Opinion in Behavioral Sciences, 29, 17 23.

Geﬀner, H., & Bonet, B. (2013). A Concise Introduction to Models and Methods for Automated Planning. Morgan & Claypool Publishers.

Geﬀner, T., & Geﬀner, H. (2018). Compact policies for fully observable non-deterministic planning as sat. In Proc. Int. Conf. on Automated Planning and Schedulng (ICAPS), pp. 88 96.

Ghallab, M., Nau, D., & Traverso, P. (2016). Automated planning and acting. Cambridge University Press.

Groshev, E., Goldstein, M., Tamar, A., Srivastava, S., & Abbeel, P. (2018). Learning generalized reactive policies using deep neural networks. In Proc. Int. Conf. on Automated Planning and Schedulng (ICAPS), pp. 408 416.

Helmert, M. (2002). Decidability and undecidability results for planning with numerical state variables. In Proc. Int. Conf. on Artiﬁcial Intelligence Planning Systems (AIPS), pp. 44 53.

Hu, Y., & De Giacomo, G. (2013). A generic technique for synthesizing bounded ﬁnite-state controllers. In Proc. Int. Conf. on Automated Planning and Schedulng (ICAPS), pp. 109 116.

Hu, Y., & De Giacomo, G. (2011). Generalized planning: Synthesizing plans that work for multiple environments. In Proc. Int. Joint Conf. on Artiﬁcial Intelligence (IJCAI), pp. 918 923.

Hu, Y., & Levesque, H. J. (2011). A correctness result for reasoning about one-dimensional planning problems.. In Proc. Int. Joint Conf. on Artiﬁcial Intelligence (IJCAI), pp. 2638 2643.

Illanes, L., & Mc Ilraith, S. A. (2019). Generalized planning via abstraction: arbitrary numbers of objects. In Proc. AAAI Conf. on Artiﬁcial Intelligence (AAAI), pp. 7610 7618.

Issakkimuthu, M., Fern, A., & Tadepalli, P. (2018). Training deep reactive policies for probabilistic planning problems. In Proc. Int. Conf. on Automated Planning and Schedulng (ICAPS), pp. 422 430.

Bonet & Geffner

Jim enez-Celorrio, S., Segovia-Aguas, J., & Jonsson, A. (2019). A review of generalized planning. The Knowledge Engineering Review, 34.

Khardon, R. (1999). Learning action strategies for planning domains. Artiﬁcial Intelligence, 113, 125 148.

Levesque, H. J. (2005). Planning with loops. In Proc. Int. Joint Conf. on Artiﬁcial Intelligence (IJCAI), pp. 509 515.

Littman, M. L., Goldsmith, J., & Mundhenk, M. (1998). The computational complexity of probabilistic planning. Journal of Artiﬁcial Intelligence Research, 9, 1 36.

Martin, M., & Geﬀner, H. (2004). Learning generalized policies from planning examples using concept languages. Applied Intelligence, 20(1), 9 19.

Muise, C. J., Mc Ilraith, S. A., & Beck, C. (2012). Improved non-deterministic planning by exploiting state relevance. In Proc. Int. Conf. on Automated Planning and Schedulng (ICAPS), pp. 172 180.

Nebel, B. (2000). On the compilability and expressive power of propositional planning. Journal of Artiﬁcial Intelligence Research, 12, 271 315.

Pnueli, A. (1977). The temporal logic of programs. In 18th Annual Symposium on Foundations of Computer Science, pp. 46 57. IEEE.

Pnueli, A., & Rosner, R. (1989). On the synthesis of an asynchronous reactive module. In Proc. ICALP, pp. 652 671.

Rintanen, J. (2004). Complexity of planning with partial observability.. In Proc. Int. Conf. on Automated Planning and Schedulng (ICAPS), pp. 345 354.

Russell, S., & Norvig, P. (2002). Artiﬁcial Intelligence: A Modern Approach. Prentice Hall. 2nd Edition.

Sanner, S., & Boutilier, C. (2009). Practical solution techniques for ﬁrst-order MDPs. Artiﬁcial Intelligence, 173(5-6), 748 788.

Segovia-Aguas, J., Jim enez-Celorrio, S., , & Jonsson, A. (2016). Generalized planning with procedural domain control knowledge. In Proc. Int. Conf. on Automated Planning and Schedulng (ICAPS), pp. 285 293.

Sipser, M. (2006). Introduction to Theory of Computation (2nd edition). Thomson Course Technology, Boston, MA.

Srivastava, S., Immerman, N., & Zilberstein, S. (2011). A new representation and associated algorithms for generalized planning. Artiﬁcial Intelligence, 175(2), 615 647.

Srivastava, S., Zilberstein, S., Gupta, A., Abbeel, P., & Russell, S. (2015). Tractability of planning with loops. In Proc. AAAI Conf. on Artiﬁcial Intelligence (AAAI), pp. 3393 3401.

Srivastava, S., Zilberstein, S., Immerman, N., & Geﬀner, H. (2011). Qualitative numeric planning. In Proc. AAAI Conf. on Artiﬁcial Intelligence (AAAI), pp. 1010 1016.

Sukhbaatar, S., Szlam, A., Synnaeve, G., Chintala, S., & Fergus, R. (2015). Mazebase: A sandbox for learning from games. ar Xiv preprint ar Xiv:1511.07401.

Tarjan, R. (1972). Depth-ﬁrst search and linear graph algorithms. SIAM journal on computing, 1(2), 146 160.

Toyer, S., Trevizan, F., Thi ebaux, S., & Xie, L. (2018). Action schema networks: Generalised policies with deep learning. In Proc. AAAI Conf. on Artiﬁcial Intelligence (AAAI), pp. 6294 6301.

Qualitative Numerical Planning: Reductions and Complexity

Van Otterlo, M. (2012). Solving relational and ﬁrst-order logical markov decision processes: A survey. In Wiering, M., & van Otterlo, M. (Eds.), Reinforcement Learning, pp. 253 292. Springer.

Wang, C., Joshi, S., & Khardon, R. (2008). First order decision diagrams for relational MDPs. Journal of Artiﬁcial Intelligence Research, 31, 431 472.