# generalized_planning_with_positive_and_negative_examples__8a765e0d.pdf The Thirty-Fourth AAAI Conference on Artificial Intelligence (AAAI-20) Generalized Planning with Positive and Negative Examples Javier Segovia-Aguas,1 Sergio Jim enez,2 Anders Jonsson3 1IRI - Institut de Rob otica i Inform atica Industrial, CSIC-UPC 2VRAIN - Valencian Research Institute for Artificial Intelligence, Universitat Polit ecnica de Val encia 3Universitat Pompeu Fabra Generalized planning aims at computing an algorithm-like structure (generalized plan) that solves a set of multiple planning instances. In this paper we define negative examples for generalized planning as planning instances that must not be solved by a generalized plan. With this regard the paper extends the notion of validation of a generalized plan as the problem of verifying that a given generalized plan solves the set of input positives instances while it fails to solve a given input set of negative examples. This notion of plan validation allows us to define quantitative metrics to asses the generalization capacity of generalized plans. The paper also shows how to incorporate this new notion of plan validation into a compilation for plan synthesis that takes both positive and negative instances as input. Experiments show that incorporating negative examples can accelerate plan synthesis in several domains and leverage quantitative metrics to evaluate the generalization capacity of the synthesized plans. Introduction Generalized planning studies the computation of plans that can solve a family of planning instances that share a common structure (Hu and De Giacomo 2011; Srivastava, Immerman, and Zilberstein 2011; Jim enez, Segovia-Aguas, and Jonsson 2019). Since generalized planning is computationally expensive, a common approach is to synthesize a generalized plan starting from a set of small instances and validate it on larger instances. This approach is related to the principle of Machine Learning (ML), in which a model is trained on a training set and validated on a test set (Mitchell 1997). Traditionally, generalized planning has only focused on solvable instances, both for plan synthesis and for validation (Winner and Veloso 2003; Bonet, Palacios, and Geffner 2010; Hu and Levesque 2011; Srivastava et al. 2011; Hu and De Giacomo 2013; Segovia-Aguas, Jim enez, and Jonsson 2018; Segovia-Aguas, Celorrio, and Jonsson 2019). However, many computational problems in AI benefit from negative examples, e.g. SAT approaches that exploit clause learning (Angluin, Frazier, and Pitt 1992), grammar induc- Copyright c 2020, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. Figure 1: a) Robot in 2 1, 6 1 and 1 1 corridors; b) Three candidate generalized plans. tion (Parekh, Honavar, and others 2000), program synthesis (Alur et al. 2018) and model learning (Camacho and Mc Ilraith 2019). If used appropriately, negative examples can help reduce the solution space and accelerate the search for a solution. In this paper we introduce negative examples for generalized planning as input planning instances that should not be solved by a generalized plan. An intuitive way to come up with negative examples for solutions that generalize is to first synthesize a solution with exclusively positive examples, and identify cases for which the solution did not generalize as desired, somewhat akin to clause learning in satisfiability problems (Biere et al. 2009). Imagine that we aim to synthesize the generalized plan (paint(X), inc(X), inc(X))+ that makes a robot paint every odd cell black in a N 1 corridor, starting from the leftmost cell (we use Kleene notation to represent regular expressions, and Z+ indicates the repetition of Z at least once). Action paint(X) paints the current cell black while inc(X) increments the robot s X coordinate. The positive example of a 2 1 corridor (whose goal configuration is illustrated at Figure 1a) is solvable by all three automata plans in Figure 1b) (acceptor states are marked with overlines e.g., q). These automata plans, namely Π, Π and Π+, compactly represent the three sets of sequential plans (paint(X), inc(X), inc(X)), (paint(X), inc(X), inc(X)) and (paint(X), inc(X), inc(X))+. Hence the single positive example of a 2 1 corridor is not enough to discriminate among these three generalized plans. Adding a second positive example, the 6 1 corridor in Figure 1a), discards plan Π. Adding a third 1 1 negative example, where the initial and goal robot cell are the same and no cell is required to be painted, discards Π preventing over-generalization because Π solves this negative example. The problem of deriving generalized plans has been a longstanding open challenge. Compared to previous work on generalized planning, the contributions of this paper are: 1. Negative examples to more precisely specify the semantics of an aimed generalized plan. 2. A new approach for the synthesis of plans that can generalize from smaller input examples thanks to negative examples. 3. The definition of quantitative evaluation metrics to assess the generalization capacity of generalized plans. The paper is organized as follows. We start with some background notation of classical planning and generalized planning (GP). Then we formalize the concept of a negative example for generalized planning. We continue with a description of a generalized planning problem with positive and negative examples that can be compiled to classical planning. We show proofs of soundness and completeness for the two main tasks in GP that are synthesis and validation. We continue with the experiments where we compare the impact of negative examples, and finally we conclude with a discussion on the presented work. This section formalizes the planning models used in this work as well as planning programs (Segovia-Aguas, Celorrio, and Jonsson 2019), an algorithm-like representation for plans that can generalize. Classical planning with conditional effects We use F to denote the set of fluents (propositional variables) describing a state. A literal l is a valuation of a fluent f F, i.e. l = f or l = f. A set of literals L on F represents a partial assignment of values to fluents (WLOG we assume that L does not assign conflicting values to any fluent). Given L, L = { l : l L} is the complement of L. Finally, we use L(F) to denote the set of all literal sets on F, i.e. all partial assignments of values to fluents. A state s is a set of literals such that |s| = |F|, i.e. a total assignment of values to fluents. The number of states is then 2|F |. A classical planning frame is a tuple Φ = F, A , where F is a set of fluents and A is a set of actions with conditional effects. Conditional effects can compactly define actions whose precise effects depend on the state where the action is executed. Each action a A has a set of literals pre(a), called the precondition, and a set of conditional effects, cond(a). Each conditional effect C E cond(a) is composed of a set of literals C (the condition) and E (the effect). Action a is applicable in state s if and only if pre(a) s, and the resulting set of triggered effects is C E cond(a),C s E, i.e. effects whose conditions hold in s. The result of applying a in s is the successor state θ(s, a) = (s \ eff(s, a)) eff(s, a). A classical planning problem with conditional effects is a tuple P = F, A, I, G , where F is a set of fluents and A is a set of actions with conditional effects as defined for a planning frame, I is an initial state and G is a goal condition, i.e. a set of literals to achieve. A solution for a classical planning problem P can be specified using different representation formalisms, e.g. a sequence of actions, a partially ordered plan, a policy, etc. Here we define a sequential plan for P as an action sequence π = a1, . . . , an whose execution induces a state sequence s0, s1, . . . , sn such that s0 = I and, for each i such that 1 i n, ai is applicable in si 1 and generates the successor state si = θ(si 1, ai). The plan π solves P if and only if G sn, i.e. if the goal condition is satisfied following the execution of π in I. Planning programs Given a planning frame Φ = F, A , a planning program (Segovia-Aguas, Celorrio, and Jonsson 2019) is a sequence of instructions Π = w0, . . . , wn . Each instruction wi, 0 i n, is associated with a program line i and is drawn from the set of instructions I = A Igo {end}, where Igo = { go(i , !f) : 0 i n, f F} is the set of goto instructions. In other words, each instruction is either a planning action a A, a goto instruction go(i , !f) or a termination instruction end. The execution model for a planning program Π is a program state (s, i), i.e. a pair of a planning state s L(F) (with |s| = |F|), and a program counter 0 i n. Given a program state (s, i), the execution of instruction wi on line i is defined as follows: If wi A, the new program state is (s , i + 1), where s = θ(s, w) is the successor for planning state s and action w. If wi = go(i , !f), the program state becomes (s, i + 1) if f s, and (s, i ) otherwise. Conditions in Goto instructions can represent arbitrary formulae since f can be a derived fluent (Lotinac et al. 2016). If wi = end, execution terminates. To execute a planning program Π on a planning problem P = F, A, I, G , the initial program state is set to (I, 0), i.e. the initial state of P and program line 0. A program Π = w0, . . . , wn solves a planning problem P = F, A, I, G iff the execution terminates in a program state (s, i) that satisfies the goal conditions, i.e. wi = end and G s. Segovia-Aguas, Celorrio, and Jonsson (2019) contains a detailed analysis of the failure conditions on planning programs, which we summarize here as follows: Corollary 1 (Planning Program Failure). If a planning program Π fails to solve a planning problem P, the only possible sources of failure are: 1. Incomplete Program. Execution terminates in program state (s, i) but the goal condition does not hold, i.e. (wi = end) (G s). 2. Inapplicable Action. Executing an action wi A in program state (s, i) fails because its precondition does not hold, i.e. pre(w) s. 3. Infinite Loop. The program execution enters into an infinite loop that never reaches an end instruction. Generalized planning We define a generalized planning problem as a finite set of classical planning problems P = {P1, . . . , PT } that are defined on the same planning frame Φ. Therefore, P1 = F, A, I1, G1 , . . . , PT = F, A, IT , GT share the same fluents and actions but differ in the initial state and goals. A planning program Π solves a given generalized planning problem P iff Π solves every planning problem Pt P, 1 t T. Segovia-Aguas, Celorrio, and Jonsson (2019) showed that a program Π that solves a generalized planning task P can be synthesized by defining a new classical planning problem Pn = Fn, An, In, Gn , where n is a bound on the number of program lines. A solution plan π for Pn programs instructions on the available empty lines (building the n-line program Π), and validates Π on each input problem Pt P. The fluent set is defined as Fn = F Fpc Fins Ftest {done}, where: Fpc = {pci : 0 i n} models the program counter, Fins = {insi,w : 0 i n, w I {nil}} models the program lines (nil indicates an empty line), Ftest = {testt : 1 t T} indicates the classical planning problem Pt P. Each instruction w I is modeled as a planning action, with one copy endt of the termination instruction per input problem Pt. Preconditions for the goto and end instructions are defined as pre(go(i , !f)) = and pre(endt) = Gt {testt}. The authors define two actions for each instruction w and line i: a programming action P(wi) for programming w on line i, and an execution action E(wi) that uses the previous execution model to execute w on line i: pre(P(wi)) = pre(w) {pci, insi,nil}, cond(P(wi)) = { { insi,nil, insi,w}}, pre(E(wi)) = pre(w) {pci, insi,w}. The effect of E(wi) depends on the type of instruction: cond(E(wi)) = cond(w) { { pci, pci+1}}, w A, cond(E(wi)) = { { pci}} {{f} {pci+1}} {{ f} {pci }}, w = go(i , !f), cond(E(wi)) = resett+1, w = endt, t < T, cond(E(wi)) = { {done}}, w = end T . The conditional effect resett+1 resets the program state to (It+1, 0), preparing execution on the next problem Pt+1. Initial State Goal State 1 2 3 4 5 6 6 5 2 3 4 1 1 2 3 4 5 6 6 5 4 3 2 1 Figure 2: Positive example (upper row) and negative example (lower row) for the generalized planning task of reversing a list. The initial state is In = I1 {pc0} {insi,nil : 0 i n} indicating that, initially, the program lines are empty and the program counter is at the first line. The goal is Gn = {done} and can only be achieved after solving sequentially all the instances in the generalized planning problem. Negative examples in generalized planning This section extends the previous generalized planning formalism to include negative examples for the validation and synthesis of programs. Negative examples as classical planning problems Negative examples are additional solution constraints to more precisely specify the semantics of an aimed generalized plan and prevent undesired generalizations. Definition 1 (Negative examples in generalized planning). Given a generalized planning problem P, a negative example is a classical planning instance P = F, A, I , G that must not be solved by solutions to P. Figure 2 shows an input/output pair (1, 2, 3, 4, 5, 6)/(6, 5, 4, 3, 2, 1) that represents a positive example for computing a generalized plan that reverse lists of any size. This example can be encoded as a classical planing problem, where the set of fluents are the state variables necessary for encoding a list of arbitrary size plus two pointers over the list nodes. The initial state encodes, using these fluents, the particular list (1, 2, 3, 4, 5, 6). The goal condition encodes the target list (6, 5, 4, 3, 2, 1). Finally, actions encode the swapping of the content of two pointers as well as actions for moving the pointers along the list. In this regard, the input/output example (1, 2, 3, 4, 5, 6)/(6, 5, 4, 3, 2, 1) is a positive example while (1, 2, 3, 4, 5, 6)/(6, 5, 2, 3, 4, 1) or (4, 3, 2, 1)/(2, 3, 4, 1) are negative examples for the generalized planning task of reversing lists. In this work both positive and negative examples are classical planning problems P1 = F, A, I1, G1 , . . . , PT = F, A, IT , GT defined on the same fluent set F and action set A. Each problem Pt P, 1 t T encodes an input specification in its initial state It while Gt encodes the specification of its associated output. Although the examples share actions, each action in A can have different interpretations in different states due to conditional effects. For instance, back to the example of Figure 1, we can encode individual planning tasks with different corridor sizes (the set of fluents F can include fluents of type max(N) that encode P1+, . . . , PT + PT ++1, . . . , PT Figure 3: Taxonomy of instances in generalized planning. different corridor boundaries (Segovia-Aguas, Celorrio, and Jonsson 2019)). Negative examples should not be confused with unsolvable planning instances. The goals of negative examples are reachable from the given initial state (see Figure 3). For instance the goals of the negative example (1, 2, 3, 4, 5, 6)/(6, 5, 2, 3, 4, 1) shown in Figure 2 can be reached from the associated initial state by applying the corresponding actions to swap the content of pointers and moving appropriately those pointers. On the other hand (4, 3, 2, 1)/(1, 1, 1, 1) would represent an UNSAT classical planning instance, because (1, 1, 1, 1) is not reachable starting from (4, 3, 2, 1) and provided the mentioned actions for reversing lists. Program validation with negative examples In generalized planning the process of plan validation is implicitly required as part of plan synthesis, since computing a solution plan requires us to validate it on all the given input instances. Next, we extend the notion of validation to consider also negative examples. Definition 2 (Program Validation with Positive and Negative examples). Given a program Π and a set of classical planning problems P = {P1, . . . , PT } labeled either as positive or negative, validation is the task of verifying whether Π solves each P P labeled as positive, while it fails to solve each P P that is labeled as negative. Validating a sequential plan on a classical planing problem is straightforward because either a validation proof, or a failure proof, is obtained by executing the plan starting from the initial state of the planning problems (Howey, Long, and Fox 2004). Validating a program on a classical planning problem is no longer straightforward because it requires a mechanism for detecting infinite loops (checking failure conditions 1. and 2. is however straightforward). The execution of plans with control flow on a given planning problem is compilable into classical planning. Examples are compilations for GOLOG procedures (Baier, Fritz, and Mc Ilraith 2007), Finite State Controllers (Bonet, Palacios, and Geffner 2010; Segovia-Aguas, Jim enez, and Jonsson 2018) or planning programs (Segovia-Aguas, Jim enez, and Jonsson 2016; Segovia-Aguas, Celorrio, and Jonsson 2019). These compilations encode the cross product of the given planning problem and the automata corresponding to the plan to execute. The plan is valid iff the compiled problem is solvable. If a classical planner proves the compiled problem is unsolvable, then the plan is invalid because its execution necessarily failed. Besides current planners do not excel at proving that a given problem is unsolvable (Eriksson, R oger, and Helmert 2017), none of the cited compilations can identify the precise source of a failed plan execution. Next, we show that the classical planning compilation for the synthesis of planning programs (Segovia-Aguas, Celorrio, and Jonsson 2019) can be updated with a mechanism for detecting infinite loops, that is taken from an approach for the computation of infinite plans (Patrizi et al. 2011). This updated compilation can identify the three possible sources of execution failure (namely incomplete programs, inapplicable actions and infinite loops) of a program in a given classical planning problem. What is more, the compilation can be further updated for solving generalized planning problems with positive and negative examples. A compilation for program validation Given a generalized planning task P = {P1, . . . , PT } and a program Π, program validation is compilable into a planning instance P n = F n, A n, I n, Gn , that extends Pn from the background Section . The extended fluent set is F n = Fn Fneg Fcopy, where Fneg = {checked, holds, stored, acted, loop} contains flags for identifying the source of execution failures, Fcopy = {copyf, correctf : f F Fpc} contains the fluents used to store a copy of the program state with the aim of identifying infinite loops. Unlike An, the action set A n does not contain programming action (these actions are only necessary for program synthesis but not for program validation). However, A n contains a new type of action called check action that verifies whether the precondition of an instruction holds. For an instruction w and line i, the check action C(wi) is defined as pre(C(wi)) ={pci, insi,w, checked, loop}, cond(C(wi)) ={ {checked}} {pre(w) {holds}}. After applying C(wi), execution fails if holds is false, either because the goal condition Gt is not satisfied when we apply a termination instruction endt, or because the precondition pre(w) of the action w A does not hold (which corresponds precisely to failure conditions 1. and 2. above). A similar mechanism has been previously developed for computing explanations/excuses of when a plan cannot be found (G obelbecker et al. 2010). Each execution action E(wi) is defined as before, but we add precondition {checked, holds} and the conditional effect { checked, holds, acted}. As a result, C(wi) has to be applied before E(wi), and E(wi) is only applicable if execution does not fail (i.e. if holds is true). To identify infinite loops A n is extended with three new actions: store, which stores a copy of the current program state. pre(store) = { checked, stored, acted}, cond(store) = { {stored, acted}} {{f} {copyf} : f F Fpc}. This action can be applied only once, after an action execution E(wi) and before checking an action C(wi). It simply uses conditional effects to store a copy of the program state (s, i) in the fluents of type copyf. compare, which compares the current program state (s, i) with the stored copy. pre(compare) = { checked, stored, acted, loop}, cond(compare) = { { stored, acted, loop}} {{f, copyf} {correctf} : f F Fpc} {{ f, copyf} {correctf} : f F Fpc}. The result of the comparison is in the fluents of type correctf. Note that acted is not true after applying store so, to satisfy the precondition of compare, we have to apply an execution action first (otherwise the current program state would trivially equal the stored copy). For a fluent f to be correct, either it is true in both the current program state and the stored copy, or it is false in both. process, which processes the outcome of the comparison. pre(process) = {loop} {correctf : f F Fpc}, cond(process) = { { loop, checked}}. This action can only be applied if all fluents in F Fpc are correct, adds fluent checked, and resets other auxiliary fluents to false. The purpose of adding checked is to match the state of other failure conditions (checked is true and holds is false). Finally, A n contain also actions skipt, 1 t T, that terminate program execution as a result of a failure condition. These actions are applicable once a failure condition is detected, of either type (checked is true and holds is false). pre(skipt) = {testt, checked, holds}, cond(skipt) = cond(endt) { { checked, stored}} { { copyf, correctf : f F Fpc}}. Note that the action applied immediately before skipt identifies the source of the execution failure of the program Π on Pt. This action is either: 1. C(endt,i), identifying an incomplete program. 2. C(wi) such that w A, which proves that w A is an inapplicable action. 3. process, identifying that the execution of the program entered an infinite loop. The precondition stored is added to all check actions C(endt,i), to avoid saving a stored copy of the program state from one input instance to the next. The goal condition is the same as before and in the initial state I n the instructions of the program Π are already programmed in the initial state: I n = I1 {pc0} {insi,w : 0 i n wi Π}. Program synthesis with positive and negative examples We define a generalized planning problem with positive and negative examples as a finite set of classical planning instances P = {P1, . . . , PT +, . . . , PT } that belong to the same planning frame. In this set there are T + positive and T negative instances such that T = T + + T (see Figure 3). We assume that at least one positive instance is necessary (T + > 0) because otherwise, the one-instruction program 0.end, covers any negative instance whose goals are not satisfied in the initial state. To synthesize programs for generalized planning with positive and negative examples we extend the compilation P n with programming instructions. The output of the final extension of the compilation is a new planning instance P n = F n , A n, I n, Gn : The fluent set F n is identical to that of the compilation P n, except that F n now includes a fluent negex, which is used to constrain the application of actions E(endt,i) and skipt, respectively. By adding a precondition negex to E(endt,i) and a precondition negex to skipt, we require program execution to succeed for positive examples, and to fail for negative examples. The action set A n is identical to A n except that we reintroduce programming actions P(wi) in the action set A n and we add a precondition negex to E(endt,i) and a precondition negex to skipt to require program execution to succeed for positive examples, and to fail for negative examples. Moreover, precondition negex is added to all actions related to infinite loop detection (e.g. store, compare and process). All program lines are now empty in I n (so they can be programmed) and the goal condition is still Gn = {done}, like in the original compilation. Theorem 1 (Soundness). Any plan π that solves the planning instance P n induces a planning program Π that solves the corresponding generalized planning task with positive and negative examples P = {P1, . . . , PT +, . . . , PT }. Proof. To solve the planning instance P n , a solution π has to use programming actions P(wi) to program the instructions on empty program lines, effectively inducing a planning program Π. Once an instruction w is programmed on line i, it cannot be altered and can only be executed using the given execution model (that is, using a check action C(wi) to test its precondition, followed by an execution action E(wi) to apply its effects). To achieve the goal Gn = {done}, π has to simulate the execution of Π on each input instance Pt, 1 t T, terminating with either E(endt,i) or skipt, which are the only two actions that allow us to move on to the next input instance (if t < T) or add fluent done (if t = T). Because of the preconditions of E(endt,i) and skipt, termination has to end with E(endt,i) if Pt is a positive example, and with skipt if Pt is negative proving that Π solves each positive example and fails to solve each negative example. Theorem 2 (Completeness). If there exists a program Π within n program lines that solves P = {P1, . . . , PT +, . . . , PT } then there exists a classical plan π that solves P n . Proof. We can build a prefix of plan π using programming actions that insert the instructions of Π on each program line. Now, we determine the remaining actions of π building a postfix that simulates the execution of Π on each input instance Pt P. Since Π solves each positive example and fails to solve each negative example, it means that there exists an action sequence that simulates the execution of Π on Pt and ends with action E(endt,i) (if Pt is a positive example) and with skipt (if Pt is negative). Experiments This section reports the empirical performance of our approach for the synthesis and evaluation of programs for generalized planning1. All experiments are run on an Intel Core i5 2.90GHz x 4 with a memory limit of 4GB and 600 seconds of planning timeout. In order to compare with previous approaches, we use Fast Downward (Helmert 2006) in the LAMA-2011 setting (Richter, Westphal, and Helmert 2011) to synthesize and evaluate programs using the presented compilations. Experiments are carried out over the following generalized planning tasks. The Green Block consist of a tower of blocks where only one greenish block exists and must be collected. In Fibonacci the aim is to output the correct number in the Fibonacci sequence. In Gripper a robot has to move a number of balls from room A to room B, and in List the aim is to visit all elements of a linked list. Finally, in Triangular Sum we must calculate the triangular sum represented with the formula y = N 0 x. We also introduce in this paper Robo Painter (RP), where a robot should paint, given different constraints, odd cells in a corridor (see Figure 1). Computing programs with positive and negative examples For the synthesis of programs that solve the previous generalized planning tasks, we compare two versions of our compilation, PN-Lite and PN, with the results from some problems whose solutions where solved and reported as One Procedure in Segovia-Aguas, Jim enez, and Jonsson (2016). We use PN to denote the version with positive and negative examples that detect the three possible failures of a planning program, whereas PN-Lite is a simpler sound version that detects incomplete programs and inapplicable actions but not infinite loops. In this experiment we have run almost 100 random configurations with at most 5 instances that could be either positive or negative (where at least one is forced to be positive, see the previous section). The idea behind this experiment is to evaluate the use of negative examples as counter-examples to prevent undesired generalizations of programs that are 1The source code, benchmarks and scripts are in the Automated Programming Framework (Segovia-Aguas 2017) such that any experimental data in the paper can be reproduced. synthesized from small input instances. Some domains from previous approaches are simple enough that they can generalize from few positive instances, so our compilations will only add complexity to the domain, requiring extra searching time required for failure detection. In Table 1, columns PN-Lite and PN report the obtained results when we synthesize programs that are validated on both positive and negative examples. Recall that the P n compilation has additional fluents and actions compared to Pn, which imposes an extra searching time. However, including negative examples often makes it possible to synthesize programs from fewer positive examples and with fewer fluents (planning instances of smaller size) which, in general, increases the percentage of programs found. Also, the process of synthesis from few examples is a benefit in generalized planning compilations akin to fewshot learning (Lake, Salakhutdinov, and Tenenbaum 2015; Camacho and Mc Ilraith 2019). We briefly describe the best solutions that generalize for each domain in Table 1. In Green Block, we repeat the drop and unstack instructions while the green block is not hold, then we collect the holding green block. In the Fibonacci domain there are 4 variables called A, B, C and D that represent Fn, n, Fn 1 and Fn 2 respectively. The program assigns C to D, then A to C, then adds D to A and decreases B, repeating this sequence while B is different from 0. The planning program found in Gripper picks up a ball with the left hand, moves to room B, drops the ball, and goes back until no more balls are in room A. The List program visits the current node, moves to the next node and repeats the process until it reaches the tail. Finally, the program for Triangular Sum has a variable A initialized to 0 and countdown variable B that is added iteratively to A. Evaluating generalized plans with test sets of positive and negative examples Negative examples are useful for defining quantitative metrics that evaluate the coverage of generalized plans with respect to a test set of unseen examples. Given a labeled classical planning instance P and a program Π: If P is labeled as positive and Π solves P this means P is a true positive. Otherwise, if Π fails to solve P this means P is a false positive. If P is labeled as a negative example and Π solves P this means P is a false negative. Otherwise, if Π fails to solve P this means P is a true negative. Our notion of positive and negative examples allows us to adopt metrics from ML for the evaluation of planning solutions that generalize with respect to a test set. These metrics are more informative than simply counting the number of positive examples covered by a solution and also consider the errors made over the test set (Davis and Goadrich 2006): Precision, pr(Π) = p (p+p ), and Recall, re(Π) = p (p+n ), where p is the number of positive examples solved by Π, p the number of false positives (classical planing problems labeled as negative that are solved by Only Positive PN-Lite PN n max(T) Avg. Search(s) Found(%) Avg. Search(s) Found(%) Avg. Search(s) Found(%) Robo Painter 5 5 64.58 50% 140.44 60% 86.43 40% Green Block 4 5 45.40 81.25% 154.35 67.5% 99.12 90% Fibonacci 5 5 190.14 25% 93.96 27.5% - 0% Gripper 4 5 48.19 43.75% 67.77 27.5% 107.14 27.5% List 3 5 0.04 31.25% 0.07 27.5% 0.21 45% T. Sum 3 5 143.36 100% 192.13 100% 141.74 100% Table 1: Program synthesis with positive and negative examples: number of program lines (n), max number of input instances (T), average search time (secs) and, percentage of found solutions (with only positive and with positive and negative examples). re(ΠP) pr(ΠP) ac(ΠP) re(ΠPN-Lite) pr(ΠPN-Lite) ac(ΠPN-Lite) re(ΠPN) pr(ΠPN) ac(ΠPN) Robo Painter 71.74% 100.00% 95.19% 70.90% 100.00% 94.58% 75.63% 100.00% 95.57% Green Block 68.86% 80.93% 91.48% 60.48% 75.00% 88.76% 80.89% 88.38% 94.43% Fibonacci 17.86% 71.43% 85.47% 22.22% 100.00% 85.97% -% -% -% Gripper 41.54% 85.71% 87.68% 62.38% 88.73% 91.35% 35.44% 84.88% 86.30% List 8.08% 72.73% 81.89% 5.96% 65.00% 81.00% 4.75% 47.22% 80.27% T. Sum 71.74% 100.00% 95.19% 75.63% 100.00% 95.57% 70.90% 100.00% 94.58% Table 2: Program evaluation wrt a set of positive and negative tests using the recall, precision and accuracy metrics. The -% symbol refers either to value not found because of an invalid operation. Π) and n is the number of false negatives (instances labeled as positive examples that cannot be solved by Π). Accuracy is a more informed metric frequently used in ML, ac(Π) = p+n p+n+p +n . In our case, n represents the number of negative examples unsolved by the program Π. For this experiment we considered the planning program as given, i.e. the computed programs in the previous synthesis experiment. Then we compile a set of positive and negative instances but include the planning program in the initial state instead of having empty lines (as in the Pn and P n compilations). The execution of the computed programs on these instances must solve the positive instances while failing to solve the negative instances, verifying plan failure due to a failed condition or the detection of an infinite loop, as explained in the previous section. We have used a framework for validating planning programs that reports success or specifies the source of failure when executing the program for each randomly generated planning instance. The results are shown in Table 2 where we report the precision, accuracy and recall of programs synthesized using only positive examples, and programs synthesized using the positive and negative examples. The list domain is the only case where positive examples yield to a better accuracy, while the rest of domains using positive and negatives improves only positives in all metrics. Generalized planning provides an interesting framework to bridge the gap between ML and AI planning (Geffner 2018). On the one hand generalized planning follows a model based approach with declarative definitions of actions and goals, as AI planning. On the other hand generalized planning, as inductive ML, computes solutions that generalize over a set of input examples. Generalized planning has however little work dedicated to the assessment of the generality of plans beyond the given input planning tasks (positive examples only). ML however, has a long tradition on the empirical assessment of the generality of solutions. The fact that our compilation identifies the source of failures of program execution on a particular planning instance allows us to introduce negative examples and to bring the evaluation machinery from ML to define evaluation metrics that empirically assess the generality of plans beyond the given input planning tasks, e.g. using test sets. Model checking (Clarke, Grumberg, and Peled 1999) provides effective solvers for automatically verifying correctness properties for diverse finite-state systems. More precisely when actions have non-deterministic effects, program validation becomes complex since it requires proving that all the possible program executions reach the goals. In such a scenario, model checking (Clarke, Grumberg, and Peled 1999) and non-deterministic planning, like POMDP planning, are definitely more suitable approaches for plan validation (Hoffmann 2015). An interesting research direction is to study how to leverage model checking techniques for the synthesis of generalized planning form both positive and negative examples. Plan constraints are also a compact way of expressing negative information for planning and reduce the space of possible solution plans. Plan constraints have already been introduced to different planning models using the LTL formalism (Baier, Bacchus, and Mc Ilraith 2009; Bauer and Haslum 2010; Patrizi, Lipovetzky, and Geffner 2013; Camacho et al. 2017). Acknowledgments The research leading to these results has received funding from the European Union s Horizon 2020 research and innovation programme under grant agreement no. 731761, IMAGINE; and it is partially supported by grant TIN-201567959 and the Maria de Maeztu Units of Excellence Programme MDM-2015-0502, MEC, Spain. Sergio Jim enez is supported by the Ramon y Cajal program, RYC-201518009, the Spanish MINECO project TIN2017-88476-C21-R, and the generalitat valenciana project GV/2019/082. Anders Jonsson is partially supported by the Spanish grants TIN2015-67959 and PCIN-2017-082. References Alur, R.; Singh, R.; Fisman, D.; and Solar-Lezama, A. 2018. Search-based program synthesis. Communications of the ACM 61(12):84 93. Angluin, D.; Frazier, M.; and Pitt, L. 1992. Learning conjunctions of horn clauses. Machine Learning 9(2-3):147 164. Baier, J. A.; Bacchus, F.; and Mc Ilraith, S. A. 2009. A heuristic search approach to planning with temporally extended preferences. Artificial Intelligence 173(5-6):593 618. Baier, J. A.; Fritz, C.; and Mc Ilraith, S. A. 2007. Exploiting procedural domain control knowledge in state-of-the-art planners. In International Conference on Automated Planning and Scheduling, 26 33. Bauer, A., and Haslum, P. 2010. Ltl goal specifications revisited. In ECAI, volume 10, 881 886. Biere, A.; Heule, M.; van Maaren, H.; and Walsh, T. 2009. Conflict-driven clause learning sat solvers. Handbook of Satisfiability, Frontiers in Artificial Intelligence and Applications 131 153. Bonet, B.; Palacios, H.; and Geffner, H. 2010. Automatic derivation of finite-state machines for behavior control. In AAAI. Camacho, A., and Mc Ilraith, S. A. 2019. Learning interpretable models expressed in linear temporal logic. In International Conference on Automated Planning and Scheduling, volume 29, 621 630. Camacho, A.; Triantafillou, E.; Muise, C.; Baier, J. A.; and Mc Ilraith, S. A. 2017. Non-deterministic planning with temporally extended goals: Ltl over finite and infinite traces. In AAAI. Clarke, E. M.; Grumberg, O.; and Peled, D. 1999. Model checking. MIT press. Davis, J., and Goadrich, M. 2006. The relationship between precision-recall and roc curves. In ICML, 233 240. ACM. Eriksson, S.; R oger, G.; and Helmert, M. 2017. Unsolvability certificates for classical planning. In International Conference on Automated Planning and Scheduling. Geffner, H. 2018. Model-free, model-based, and general intelligence. In IJCAI. G obelbecker, M.; Keller, T.; Eyerich, P.; Brenner, M.; and Nebel, B. 2010. Coming up with good excuses: What to do when no plan can be found. Cognitive Robotics 10081. Helmert, M. 2006. The Fast Downward Planning System. Journal of Artificial Intelligence Research 26:191 246. Hoffmann, J. 2015. Simulated penetration testing: From dijkstra to turing test++ . In International Conference on Automated Planning and Scheduling, 364 372. Howey, R.; Long, D.; and Fox, M. 2004. Val: Automatic plan validation, continuous effects and mixed initiative planning using pddl. In ICTAI, 294 301. IEEE. Hu, Y., and De Giacomo, G. 2011. Generalized planning: Synthesizing plans that work for multiple environments. In International Joint Conferences on Artificial Intelligence. Hu, Y., and De Giacomo, G. 2013. A generic technique for synthesizing bounded finite-state controllers. In International Conference on Automated Planning and Scheduling. Hu, Y., and Levesque, H. J. 2011. A correctness result for reasoning about one-dimensional planning problems. In International Joint Conferences on Artificial Intelligence, 2638 2643. Jim enez, S.; Segovia-Aguas, J.; and Jonsson, A. 2019. A review of generalized planning. The Knowledge Engineering Review 34. Lake, B. M.; Salakhutdinov, R.; and Tenenbaum, J. B. 2015. Human-level concept learning through probabilistic program induction. Science 350(6266):1332 1338. Lotinac, D.; Segovia-Aguas, J.; Jim enez, S.; and Jonsson, A. 2016. Automatic generation of high-level state features for generalized planning. In International Joint Conferences on Artificial Intelligence. Mitchell, T. M. 1997. Machine Learning. New York, NY, USA: Mc Graw-Hill, Inc., 1 edition. Parekh, R.; Honavar, V.; et al. 2000. Grammar inference, automata induction, and language acquisition. Handbook of natural language processing 727 764. Patrizi, F.; Lipoveztky, N.; De Giacomo, G.; and Geffner, H. 2011. Computing infinite plans for ltl goals using a classical planner. In International Joint Conferences on Artificial Intelligence. Patrizi, F.; Lipovetzky, N.; and Geffner, H. 2013. Fair ltl synthesis for non-deterministic systems using strong cyclic planners. In International Joint Conferences on Artificial Intelligence. Richter, S.; Westphal, M.; and Helmert, M. 2011. Lama 2008 and 2011. In International Planning Competition. Segovia-Aguas, J.; Celorrio, S. J.; and Jonsson, A. 2019. Computing programs for generalized planning using a classical planner. Artificial Intelligence. Segovia-Aguas, J.; Jim enez, S.; and Jonsson, A. 2016. Generalized planning with procedural domain control knowledge. In International Conference on Automated Planning and Scheduling. Segovia-Aguas, J.; Jim enez, S.; and Jonsson, A. 2018. Computing hierarchical finite state controllers with classical planning. Journal of Artificial Intelligence Research 62:755 797. Segovia-Aguas, J. 2017. Automated programming framework. https://github.com/aig-upf/automated-programming-framework. Accessed: 2019-11-12. Srivastava, S.; Immerman, N.; Zilberstein, S.; and Zhang, T. 2011. Directed search for generalized plans using classical planners. In International Conference on Automated Planning and Scheduling, 226 233. Srivastava, S.; Immerman, N.; and Zilberstein, S. 2011. A new representation and associated algorithms for generalized planning. Artificial Intelligence 175(2):615 647. Winner, E., and Veloso, M. 2003. Distill: Learning domain-specific planners by example. In ICML, 800 807.