# guiding_gbfs_through_learned_pairwise_rankings__e10f3f0c.pdf

Guiding GBFS through Learned Pairwise Rankings

Mingyu Hao1 , Felipe Trevizan1 , Sylvie Thi ebaux1,2 , Patrick Ferber3,4 , J org Hoffmann4

1Australian National University, Australia 2LAAS-CNRS, Universit e de Toulouse, France 3University of Basel, Switzerland 4Saarland University, Germany {mingyu.hao,felipe.trevizan,sylvie.thiebaux}@anu.edu.au, patrick.ferber@unibas.ch, hoffmann@cs.uni-saarland.de

We propose a new approach based on ranking to learn to guide Greedy Best-First Search (GBFS). As previous ranking approaches, ours is based on the observation that directly learning a heuristic function is overly restrictive, and that GBFS is capable of efficiently finding good plans for a much more flexible class of total quasi-orders over states. In order to learn an optimal ranking function, we introduce a new ranking framework capable of leveraging any neural network regression model and efficiently handling the training data through batching. Compared with previous ranking approaches for planning, ours does not require complex loss functions and allows training on states outside the optimal plan with minimal overhead. Our experiments on the domains of the latest planning competition learning track show that our approach substantially improves the coverage of the underlying neural network models without degrading plan quality.

1 Introduction Greedy Best-First Search (GBFS) is one of the most widely used algorithms in satisfying planning. GBFS uses a heuristic function to estimate the cost to reach the goal from a given node. The nodes are put in a priority queue based on their heuristic values, and the node with the lowest heuristic value is expanded next. This strategy will return a plan if one exists, but may result in suboptimal solutions. Many researchers have made significant efforts to design informative heuristics to guide GBFS [Bonet et al., 1997; Hoffmann and Nebel, 2001; Richter and Westphal, 2010], however the efficiency of the search depends on properties of the heuristic function which are not yet completely understood [Wilt and Ruml, 2016; Cohen and Beck, 2018]. Therefore, in recent years, the possibility of exploring the space of heuristic functions using machine learning has raised increasing attention [Arfaee et al., 2011; Shen et al., 2020; Ferber et al., 2020; Karia and Srivastava, 2021; St ahlberg et al., 2022; Heller et al., 2022; Chen et al., 2024]. These approaches try to learn or improve a heuristic function for a planning domain or, occasionally, a single large problem

instance. They typically formulate the learning problem as one of regression, and train models using pre-computed statevalue pairs extracted from (often optimal) plans generated by off-the-shelf planners for small problem instances. However, labelling states with informative values and attempting to learn accurate values for the heuristic is expensive and unnecessary for the purpose of learning to guide GBFS. This is because the choice of node expansion by GBFS is not determined by the actual numerical values of the heuristic but by the relative ordering over nodes they induce [R oger and Helmert, 2010]. Therefore, regression models, which attempt to learn precise heuristic values, artificially restrict the target hypothesis space and the training dataset one can learn from. They also introduce issues such as the difficulty of distinguishing between states that have close heuristic values. In fact, for GBFS to find the optimal plan, it suffices for the heuristic to order the states on the optimal path as preferable to those off it. This is called the perfect ranking heuristic in [Chrestien et al., 2023] as it also minimises search effort. These considerations motivate an alternative framing of the problem as one of learning to rank pairs of states, building on the active research on learning to rank in information retrieval and relevance analysis [Burges, 2010; Damke and H ullermeier, 2021; Li, 2022]. Garrett et al. (2016) were the first to approach this problem for planning by using a statistical machine learning model, namely Rank SVM [Joachims, 2002], hand-crafted domain-independent features, and a loss function measuring the normalised difference between the number of correctly and incorrectly classified pairs. In contrast, this paper explores pairwise ranking with modern deep-learning methods. We propose a novel framework combining a Direct Ranker [K oppel et al., 2020b] with any existing neural network regression model for learning heuristics, which are trained end-to-end in order to learn a total quasi-order, i.e. a total, transitive, and reflexive relation over pairs of states. This framework does not require complex loss functions, and exploits the transitivity of the ordering relation to learn a ranking just slightly stronger than the perfect ranking, from an order of magnitude less data than alternative approaches. Moreover, it supports efficient batched evaluation to speed up training. We evaluate our approach on the learning track of the 2023 international planning competition, using the recent STRIPSHGN [Shen et al., 2020] and GOOSE [Chen et al., 2024]

Proceedings of the Thirty-Third International Joint Conference on Artiﬁcial Intelligence (IJCAI-24)

domain-independent graph neural network architectures, as the underlying regression model. We show that our ranking approach considerably improves the coverage of both these models and expands their ability to generalise to larger problems than those trained on, whilst preserving their plan quality and time to convergence. We also compare our approach with the recently and independently developed work from Chrestien et al. (2023). We show that our framework requires less training data and generalises better to unseen problems.

2 Background

We start by introducing the necessary background on planning, greedy best-first search, and learning to rank. We use Jn K to denote {1, 2, ..., n}.

2.1 Planning

We consider classical planning problems described by a tuple P = S, s0, G, A, C where S is a finite and discrete set of states, s0 is the initial state, G S \ {s0} is the set of goal states, A is the set of actions of which As is the subset applicable in state s, and an action a As is a deterministic transition function, which takes state s as input and outputs the successor state s . In the following, we write Ns for the set of successors of a state s: it consists of all states s such that there exists an applicable action a As with s := a(s). The cost function C(a) returns the non-negative cost of applying action a. A unit-cost problem is one for which all actions have cost 1. A plan for P is a sequence of actions a1, . . . , an that induces a sequence of states (a path) s0, s1, ..., sn where ai Asi 1, si := ai(si 1) for i Jn K, and sn G. A plan is optimal if its cost P i Jn K C(ai) is minimal. In the unit-cost case, optimal plans are plans of minimal length. Greedy Best-First Search (GBFS) is the most commonly used algorithm for satisficing planning. In order to guide the search towards the goal, GBFS relies on a heuristic function h : S R returning an estimate h(s) of the cost to reach the goal from a given state s, and always expands the state with the lowest heuristic value on the search frontier. As shown in Algorithm 1, GBFS maintains two lists of nodes1: the open list (O) a priority queue ordered by increasing h value, containing the nodes that have been generated but not yet expanded, and a closed list (C) containing previously expanded nodes which is used to avoid revisiting the same state twice and creating loops. At each iteration, the node at the front of the open list the node with the lowest h value is transferred to the closed list and expanded by generating its successors. The search stops if a successor is a goal state. Otherwise, any unvisited successor is added to the open list. GBFS is guaranteed to find a plan if one exists, but the plan may not be optimal. In fact, even when guided with the h heuristic which returns the optimal cost to reach the goal, GBFS is only guaranteed to return the optimal plan for unit cost problems. In the following, we assume that problems have unit cost.

1In the interest of readability, Algorithm 1 identifies nodes with states, whereas nodes additionally record the h value of the state, the parent node, and the action applied at the parent.

Algorithm 1 GBFS Algorithm Input: problem P, heuristic function h Output: Plan π or unsolvable 1: O := {s0} 2: C := 3: while O = do 4: s := argmins O h(s ) 5: O := O \ {s} 6: C := C {s} 7: for all s Ns do 8: if s G then 9: return Extract-Solution(s ) 10: end if 11: if s C O then 12: O := O {s } 13: end if 14: end for 15: end while 16: return unsolvable

2.2 Learning to Rank Learning-to-rank is a well-studied field in machine learning [Liu and others, 2009; Li, 2022]. Given a set of objects, in our case a set of states, B = {s1, . . . , sk}, it aims to learn an order that results in the ranking si1 si2 . . . sik, where i1, . . . , ik is a permutation of Jk K. In this work, we further require the order to have the following properties: Definition 1 (Total Quasi-Order ). A total quasi-order over a set X is a binary relation with the following properties Totality: a, b X a b b a; Transitivity: a, b, c X a b b c a c. Moreover, totality implies Reflexivity: a X a a. We define the strict order in the obvious way: a b iff a b and b a. Note that transitivity also holds for the strict order : a, b, c X a b b c a c. There are several different approaches to solving the learning-to-rank problem, such as SVM [Shashua and Levin, 2002; Herbrich et al., 1999], Boosting [Freund et al., 2003; Wu et al., 2010] and Neural Networks [Burges et al., 2005; K oppel et al., 2020b]. Based on how the order is represented, learning-to-rank approaches are divided into three groups: pointwise, pairwise and listwise [Li, 2022]. Listwise approaches learn to sort a list of objects to match a desired total order. For planning, this approach is infeasible since we do not know a priori the list of states we need to sort. The pointwise approach represents the total quasi-order by a ranking function r : B R which takes in one object and returns a numerical score [Li, 2022]. This numerical score is then used to sort B. Note that GBFS uses the heuristic h as a pointwise ranking function: the set of open states O is sorted according to their h values (Line 4 of Algorithm 1). Similarly, learning a heuristic function can be seen as learning a pointwise ranking function with extra constraints imposed by the definition of a heuristic function, e.g., h(s) 0 for all s, h(g) = 0 for all g G, or even admissibility or consistency.

Proceedings of the Thirty-Third International Joint Conference on Artiﬁcial Intelligence (IJCAI-24)

Figure 1: A example graph representation of part of the search tree. The optimal path is [s0, s1, sn]. States ti,j are siblings of si and have higher ranks, i.e., si ti,j.

Lastly, a pairwise ranking function takes a pair of elements as input and it outputs the ordering between the two objects. Formally, a pairwise ranking function r(si, sj) { 1, 0, 1} represents whether si sj, si = sj or sj si, respectively [K oppel et al., 2020b]. While learning a pointwise ranking function is a regression problem, learning a pairwise ranking is a classification problem, i.e., given two states si and sj, we need to assign it either 1, 0, or 1. As we show in the next section, this difference has a large impact on learning a ranking for planning in comparison to learning a heuristic.

3 Rank-based GBFS

In the context of GBFS, the heuristic function h is used solely for sorting states in the priority queue as shown in Line 4 of Algorithm 1. This means h is being used only as a total quasiorder over the states S. This fact has been exploited by others, for instance, by using functions other than the cost-to-goal to order states in GBFS [Ferber et al., 2022] and to analyse what makes a good heuristic for GBFS [Wilt and Ruml, 2016]. For this reason, we extend GBFS to use a ranking instead of a heuristic. Formally, given a total quasi-order , rank-based GBFS is the algorithm obtained by replacing Line 4 in Algorithm 1 with: choose s such that s t for all t O. We relate the two versions of GBFS with the lemma below.

Definition 2 ( h). Given a heuristic h, the total quasi-order h associated with it is so that s h s h(s) h(s ).

Lemma 1 (Rank-based GBFS equivalence). Given a problem P, a heuristic h, GBFS using h is equivalent to rank-based GBFS using h for the same tie-breaking method.

It is easy to see that Lemma 1 holds based on the definition of h. Lemma 1 also explains why any monotonic non-decreasing function f applied to the heuristic h, i.e., h (s) = f(h(s)), does not change the solution found by GBFS: f preserves order, and therefore the total quasi-order h and f(h) are the same. We close this section by looking at what ranking function we should aim to learn. An intuitive ranking to learn is the ranking induced by h , i.e., h since GBFS is guaranteed to find the optimal solution using h under unit-cost actions. However, for usage with GBFS, h orders more states than are necessary to find the optimal solution. To see this, consider the example in Figure 1. The total quasi-order h imposes an ordering between any pair of states in the example, including between siblings of states on the optimal path, i.e., ti,j h ti ,j h (ti,j) h (ti ,j ). However, the ordering among siblings is not needed, i.e., it is sufficient to know

that si tj,m for j Ji K and m Jki K. We exploit this fact to define an optimal ranking as:

Definition 3 (Optimal Ranking ). Given a problem P and an optimal plan π = [s0, s1, ..., sn] for P, the optimal ranking is a total quasi-order such that si si 1 for all i Jn K and si t t Nsi 1 and i Jn K.

Notice that Definition 3 only relates a state si in the optimal plan with its parent si 1 and its siblings t. Since it does not enforce an ordering between the siblings t of si, the optimal ranking is still a total quasi-order. Moreover, due to the transitivity property of total quasi-orders, we have that si t

for all t i 1 j=0Nsj, i.e., si is ranked strictly better than any of its ancestors and their siblings. Next, we prove that GBFS finds the optimal solution when using an optimal ranking.

Theorem 1. For a solvable problem P with unit costs, given an optimal ranking , the plan returned by the rank-based GBFS is optimal.

Proof. Consider a problem P with unit costs, with an optimal solution π = [s0, s1, ..., sn] and optimal ranking . We prove by induction that si 1 is expanded at the ith iteration of the algorithm and that after expansion, the open list is Oi = i 1 j=0Nsj \ i 1 j=0{sj} for all i Jn K. For readability, given a state s and set of states X, we abbreviate t X, s t with s X. For the base case: initially, the open list of GBFS only contains s0 (Line 1 of Algorithm 1), which is always expanded at the first iteration of the while loop in Line 3, leading to O1 = Ns0 \ {s0}. For the induction step: assume the property holds, we show that si is expanded next and Oi+1 = i j=0Nsj \ i j=0sj. According to the definition of the optimal ranking we have: si Nsi 1 (a) and si si 1 (b). Because si 1 was expanded in the last iteration, we have si 1 i 2 j=0Nsj\ i 2 j=0{sj} (c). Since is transitive, (b) and (c) lead to si i 2 j=0Nsj\ i 2 j=0{sj} (d). Finally, the union of (a) and (d) gives si i 1 j=0Nsj\ i 1 j=0{sj}. In other words, si is ranked at the top of the open list Oi and will be expanded in iteration i+1. After expansion the queue will be Oi+1 = Oi (Nsi \ i j=0{sj})= i j=0Nsj \ i j=0{sj}.

A key difference between optimal rankings and h is that there are several total quasi-orders that are optimal rankings while h is unique. This makes learning an optimal ranking potentially easier than learning h or even h itself, since h is also unique. We exploit this in the next section where we introduce our framework to learn an optimal ranking.

4 Learning an Optimal Ranking

In this section, we introduce a new framework capable of leveraging any neural network regression model to learn an optimal ranking function. We also show how to efficiently integrate the ranking computations into GBFS and how to extract more data from the optimal plans for training while still efficiently handling this extra data through batching.

4.1 Direct Ranker Our approach for learning an optimal ranking through pairwise classification uses Direct Ranker [K oppel et al., 2020b].

Proceedings of the Thirty-Third International Joint Conference on Artiﬁcial Intelligence (IJCAI-24)

Figure 2: The structure of the Direct Ranker [K oppel et al., 2020b].

Direct Ranker is a general model for pairwise ranking that is designed to learn total quasi-orders. Its structure is shown in Figure 2. The model consists of two parts: feature computation and ranking computation. The first part receives two objects (in our case, states) si and sj to be ranked and generates the embedding vectors embsi and embsj, both in Rn, through the neural networks nni and nnj. The structure of the neural networks nni and nnj is not restricted; the only requirement is that both nni and nnj have the same structure and share weights, i.e., they must be identical. The second part of the model, ranking computation, computes the element-wise difference of the two embedding vectors and passes it through a single neuron τ with zero bias. Formally, this second part computes p = σ(w(embsi embsj)) where w R1 n are the weights and σ is the activation function of τ. The activation function σ must be odd and sign-preserving, i.e., σ( x) = σ(x) and sign(σ(x)) = sign(x). The input states si and sj are classified as si sj, si = sj, or sj si if p is less than, equal to, or greater than 0, respectively. Under these assumptions, Direct Ranker is guaranteed to return a total quasi-order [K oppel et al., 2020b, Theorem 1]. Besides the activation function σ, the loss function used by Direct Ranker is also a parameter of the framework and we use the same parametrisation as in K oppel et al. (2020b) experiments: activation function σ(x) = (1 + e x) 1 0.5 and mean squared error (MSE) as the loss function [K oppel et al., 2020a].

4.2 Efficient Integration with GBFS Once we learn the shared weights for nn1 and nn2 as well as w for the ranking computation, we need to integrate it into GBFS. Let D(s, t) { 1, 0, 1} represent the result of the Direct Ranker for s and t. In order to insert a new state s into the priority queue O, we need to compare s with states already in O since we have learned a pairwise ranking. However, the amortised computational complexity of insertion is O(k log |O|), where k is the time needed to evaluate D(s, t) for a single comparison. Since the ranking computation part of Direct Ranker is a single-layer, unbiased neural network, we can transform our learned pairwise ranking into a point-wise representation to improve the insertion efficiency in the priority queue. This transformation is formalised in Lemma 2 and lets us represent the learned D(s, t) as the point-wise ranking r(s) R. As a result, the priority queue in GBFS can now be sorted by r(s) resulting in an amortised insertion time of O(k + log |O|) where k is the time required to evaluate r(s). With this transformation, only a single neural network evaluation is needed

instead of log |O| when using the pairwise ranking function D(s, t). Although r(s) is being used as a heuristic in GBFS, it is important to note that it is not a heuristic in the traditional sense, for instance, r(s) can be negative. Moreover, since we are learning an optimal ranking, it is possible that r(t) r(t ) even though h (t) > h (t ) for states t and t outside the optimal plan.

Lemma 2. Given a Direct Ranker model with weights w representing the total quasi-order , we can extract a point-wise ranking function r : S R representing .

Proof. Given two states si, sj S, without loss of generality, we assume si sj. Then from the definition of the Direct Ranker, we have p = σ(w (nn1(si) nn2(sj))) 0. Since nn1 and nn2 are identical and σ is sign-preserving, we have that p 0 w (nn(si) nn(sj)) 0. Because the right-hand side is a linear transformation, it is equivalent to w nn(si) w nn(sj). Defining r(s) as w nn(s), we have that p 0 r(si) r(sj) and therefore si sj r(si) r(sj).

4.3 Efficient Training Another key difference of our framework lies in the training methodology. Unlike when learning a heuristic, where a specific value is required for each state, we train our Direct Ranker D(si, sj) to learn the optimal ranking through classification and then extract the point-wise ranking function r(s). Thus, given a pair of states si and sj, it is sufficient to determine their relative optimal ranking, i.e., whether si sj or sj si, without assigning specific values to each state. This allows us to use siblings of states in optimal plans in the training process without the need to compute any information about them. Thus, for a problem with a constant branching factor b, an optimal plan π of size n encodes O(bn) optimal ranking pairs for training. This contrasts with methods that directly learn a heuristic function (e.g., STRIPS-HGN [Shen et al., 2020] and GOOSE [Chen et al., 2024]), which can only extract n pairs in the form (s, h (s)) from π . Note that it is computationally expensive to compute h (s ) for any s π since it requires optimally solving the problem from s . Another advantage of our framework is that we can efficiently handle this extra training data by batching the pairs si t for all siblings t of si and also t = si 1 for each si π [Damke and H ullermeier, 2021]. The performance gains come from the fact that the embedding embsi is computed only once as opposed to |Nsi 1| times, i.e., once for each pair involving si. Formally, let Bsi = {si 1} (Nsi 1 \ {si}), i.e., the set consisting of the siblings of si and its parent. We compute the embeddings for all states in Bsi resulting in the matrix Er Rm n where m = |Bsi| and the j-th row of Er is the vector embedding of the j-th element in Bsi. Next, we compute the embedding embsi and stack m copies of it to create the matrix El Rm n. Then we compute p = σ((El Er)w) Rm where the j-th row of p is the output of τ comparing si and the j-th element in Bsi. Thus, we compute p by generating only |Bsi| + 1 embeddings as opposed to 2|Bsi| embeddings. This difference is important since computing each embedding in our experiments requires evaluating an expensive neural network architecture.

Proceedings of the Thirty-Third International Joint Conference on Artiﬁcial Intelligence (IJCAI-24)

5 Related Work

The closest work to ours is [Chrestien et al., 2023] which learns a heuristic function using a pointwise ranking as the target, as opposed to h . As with other approaches that learn a heuristic function, e.g., STRIPS-HGN [Shen et al., 2020] and GOOSE [Chen et al., 2024], their learned heuristics can be used with A . However, all these approaches cannot guarantee that the solution found is optimal since the learned heuristic may not be admissible. For this reason, our work focuses on GBFS which allows us to remove the restriction that the learned ranking must also be a heuristic. What makes the approach by Chrestien et al. (2023) different from all previous works on learning heuristics is that their target is a perfect ranking, a concept they introduced. A perfect ranking orders a pair of states s and t as s t for all s on the optimal path and t O when s is popped from the GBFS queue. It is equivalent to an optimal ranking without transitivity nor the requirement that si si 1 for two consecutive states si 1 and si in π . To see the benefits of using an optimal ranking as the target instead of a perfect ranking, consider a problem with a constant branching factor b and an optimal plan π = [s0, s1, ..., sn]. The number of ordered pairs s t encoded in the optimal plan is Pn i=1 ib = b(n2 + n)/2. To learn a perfect ranking, Chrestien et al. (2023) use all pairs encoded in the optimal plan with the exception of the pairs si sj for si, sj π

and i < j, resulting in (b 1)(n2 +n)/2 total pairs. Alternatively, due to transitivity and the requirement that si si 1 for si π imposed by our optimal ranking, we can encode all the b(n2 + n)/2 pairs using only bn pairs. Therefore, our training data scales linearly instead of quadratically with |π| while representing all ordered pairs considered by [Chrestien et al., 2023]. As our experiments will show, this has a large empirical effect on training efficiency and generalisation. Our approach also differs in how the target ranking is learned. We learn a pairwise ranking function using classification through the Direct Ranker architecture that is later converted to a pointwise ranking to be efficiently used with GBFS. The heuristic learned by [Chrestien et al., 2023] follows an approach in which a pointwise ranking function is learned using the 0-1 loss for an implicit pairwise ranking function. Their loss function is later relaxed to a logistic loss in order to use gradient optimisation. Another work based on ranking is [Garrett et al., 2016]. They learn a pairwise ranking by training a Rank Support Vector Machine model [Joachims, 2002] using a convex relaxation of the Kendall rank correlation coefficient. This coefficient measures the degree of similarity between two rankings by computing the difference between the number of concordant and discordant pairs as a proportion of the total number of possible pairs. One main difference between our approaches is that we allow the use of any neural network for computing features while they use hand-crafted features extracted from the graph implicitly built by the FF heuristic [Hoffmann and Nebel, 2001]. Another key difference is that they use as data only the pairs within a suboptimal plan π, that is, they do not use the states outside of π.

domain Training Testing blocksworld n [2, 21] 56 n [5, 488] 90 childsnack c [1, 5] 37 c [4, 292] 90 ferry c [1, 13] 66 c [2, 974] 90 floortile t [2, 30] 64 t [12, 980] 90 miconic p [1, 10] 99 p [1, 485] 90 rovers r [1, 4] 67 r [1, 30] 90 satellite s [1, 10] 90 n [3, 99] 90 sokoban n [7, 13] 99 b [8, 99] 90 spanner s [1, 10] 89 n [1, 487] 90 transport v [1, 5] 47 n [3, 50] 90

Table 1: Number of problems per domain and problem size in terms of the most significant object type: n blocks in blocksworld, c children in childsnack, c cars in ferry, etc. The full set of parameters governing problem size can be found in [Segovia and Seipp, 2023].

6 Experiments In this section, we empirically compare our ranking framework to Chrestien et al. 2023 s framework using GBFS. For both frameworks, we consider two state-of-the-art neural network regression models, namely STRIPS-HGN [Shen et al., 2020] and GOOSE-LLG [Chen et al., 2024], for internal feature computation. To verify that learning a ranking is beneficial, we consider STRIPS-HGN and GOOSE by themselves, i.e., trained using h as the learning target. As a baseline, we also consider the hff heuristic [Hoffmann and Nebel, 2001].

6.1 Dataset We use the 10 domains from the 2023 International Planning Competition Learning Track (IPC23-LT) [Segovia and Seipp, 2023]. Each domain provides 100 training problems and 90 testing problems. We first solve the training problems using the Scorpion planner [Seipp et al., 2020] with a 30-minute time limit. Only problems solved optimally by Scorpion are used for training, but this still results in a dataset of sufficient size for our experiments. The training problems are randomly split between the training set (90%) and the validation set (10%). The problems in the validation set are used only to monitor convergence and control learning rates. The testing set is divided into easy , medium and hard subsets per domain, based on problem size, with 30 problems per subset. In contrast the training set only contains easy problems. Details of the dataset are shown in Table 1.

6.2 Training Settings The two graph neural network regression models we use are GOOSE Lifted Learning Graph (LLG) [Chen et al., 2024] and STRIPS-HGN [Shen et al., 2020]. The former encodes the full planning problem and domain in a lifted form, whereas the latter encodes the problem delete relaxation in a grounded form. Both models can learn domain-independent heuristics and our framework inherits this property. However, our experiments are restricted to domain-specific rankings. For both regression models, we used the code provided by their authors with all Re LU activations replaced by Leaky Re LUs, except in their last layer. The Re LU function is suitable for learning heuristics because its values are non-negative; however, this restriction is not needed for learning a ranking

Proceedings of the Thirty-Third International Joint Conference on Artiﬁcial Intelligence (IJCAI-24)

hff GOOSE Perf Rank(GOOSE) Opt Rank(GOOSE) HGN Perf Rank(HGN) Opt Rank(HGN) domain sum easy med. hard sum easy med. hard sum easy med. hard sum easy med. hard sum easy med. hard sum easy med. hard sum blocksworld 28 30 0 0 30 30 5 0 35 30 20 1 51 22 0 0 22 - - - - 23 0 0 23 childsnack 26 19 0 0 19 21 4 0 25 26 4 0 30 6 0 0 6 13 0 0 13 25 0 0 25 ferry 68 30 30 1 61 30 30 2 62 30 30 1 61 26 2 0 28 - - - - 30 8 0 38 floortile 12 1 0 0 1 2 0 0 2 1 0 0 1 0 0 0 0 1 0 0 1 1 0 0 1 miconic 90 30 30 15 75 30 30 16 76 30 30 15 75 30 30 0 60 24 0 0 24 30 27 1 58 rovers 34 29 2 0 31 28 1 0 29 30 1 0 31 25 0 0 25 10 0 0 10 28 0 0 28 satellite 65 27 2 0 29 27 0 0 27 29 4 0 33 13 0 0 13 6 0 0 6 30 1 0 31 sokoban 36 26 1 0 27 30 2 0 32 30 2 0 32 27 0 0 27 17 0 0 17 30 1 0 31 spanner 30 30 4 0 34 30 30 2 62 30 30 1 61 30 0 0 30 30 4 0 34 30 14 0 44 transport 41 30 8 0 38 30 9 0 39 30 16 0 46 22 0 0 22 6 0 0 6 28 0 0 28 sum 430 252 77 16 345 258 111 20 389 266 137 18 421 201 32 0 233 107 4 0 111 254 52 1 307

Table 2: Rounded average coverage per domain and problem difficulty for the different configurations. Best coverage per domain highlighted in bold. The best coverage for each regression model is underlined.

and we observed that it results in slower convergence. For consistency, the same model parameters are used across our experiments: 4 message-passing layers and hidden dimension m = 64. For our framework, we use the output from the second-last hidden layer of the underlying regression model as the embedding vector, thus embs R64. For each domain, we compare GBFS using the following configurations for the heuristic/ranking: GOOSE and HGN. The original GOOSE-LLG and STRIPS-HGN regression models trained using h as target and their original loss function. The training dataset consists of all state-value pairs (s, h (s)) for s in the optimal plan. Opt Rank(GOOSE) and Opt Rank(HGN). Our optimal ranking framework combined with the respective regression models. The training dataset is the set of all ordered pairs s t for s in the optimal plan and t Ns {s } where s precedes s in the optimal plan. Perf Rank(GOOSE) and Perf Rank(HGN). Chrestien et al. (2023) s framework to learn a perfect ranking heuristic combined with each model. The loss function associated with the perfect ranking definition is also used. The training dataset consists of all ordered pairs s t for s in the optimal plan and t O when s is popped from the GBFS queue. We use the Adam optimiser and an initial learning rate of 10 3. The learning rate is reduced by a factor of 10 if the accuracy on the validation set does not improve for 10 consecutive epochs. Training is stopped when the learning rate reaches 10 6 or after 500 epochs. For each configuration, we solve the problems in the testing set with Fast-Downward using eager-GBFS [Helmert, 2006], a time limit of 30 min per problem, and 8 GB of memory. All experiments are run on an Intel Xeon 2.1GHz CPU and NVIDIA A6000 GPU with 64GB of memory. The experiments were repeated three times and the average results are reported. The source code can be found at [Hao et al., 2024a] and an extended set of results in [Hao et al., 2024b].

6.3 Results Total Coverage. Table 2 shows the coverage of each configuration on each domain, i.e., the number of problems it solves for this domain. Overall, using optimal rankings improved the coverage of the underlying regression model: 22%

and 32% increase for GOOSE and STRIPS-HGN, respectively. Our optimal ranking approach also obtained higher total coverage than the perfect ranking one: 8% and 176% increase for GOOSE and STRIPS-HGN, respectively. The large increase in coverage obtained by Opt Rank(HGN) w.r.t Perf Rank(HGN) can be partially attributed to: (i) STRIPSHGN being a slower model to train; and (ii) Perf Rank(HGN) using quadratically more data than Opt Rank(HGN). The convergence time for all learned configurations is shown in Figure 3 and we observe that, for any framework, using STRIPSHGN to generate features is slower than using GOOSE. The impact of using STRIPS-HGN is less pronounced on Opt Rank(HGN) due to its compact dataset that exploits transitivity and is nearly equivalent to that of Perf Rank(HGN). Moreover, despite processing O(b) more data than GOOSE where b is the average branching factor of a domain, Opt Rank(GOOSE) converges in a similar amount of time. Hence, it processes training data much faster.

Domain-specific coverage. When considering individual domains, Opt Rank(GOOSE) obtains the highest coverage in 3 domains, namely Blocksworld, Childsnack and Transport, with 46% (16 problems), 15% (4 problems) and 12% (5 problems) more coverage than the second-best planner, respectively. Perf Rank(GOOSE) has the highest coverage only on Spanner, where it solves one more problem (2%) than Opt Rank(GOOSE). In Floortile, all configurations perform very poorly. This domain is known to be hard because it requires a large receptive field and it has many state space symmetries.

GOOSE configurations. Opt Rank(GOOSE) improves GOOSE s coverage in 6 domains and obtains the same coverage in the other 4 domains, therefore using optimal rankings is highly beneficial for this regression model. E.g, in Blocksworld and Spanner, GOOSE solves none or hardly any of the medium-difficulty problems, whereas Opt Rank(GOOSE) solves a large fraction or all of them, demonstrating a greater ability to generalise to larger problem sizes. Perf Rank(GOOSE) almost always improves the coverage of GOOSE and its coverage is marginally smaller (2 problems, representing about 9%) only in Rovers and Satellite. Opt Rank(GOOSE) has better coverage than Perf Rank(GOOSE) in 5 domains, namely Blocksworld, Childsnack, Rovers, Satellite and Transport, with 46% (16 problems), 20% (5

Proceedings of the Thirty-Third International Joint Conference on Artiﬁcial Intelligence (IJCAI-24)

Figure 3: Convergence times (sec) grouped by configuration. Each box shows the training time of all 10 domains and 3 learned models for the configuration. Perf Rank(HGN) Models whose training failed to converge are excluded.

problems), 7% (2 problems), 22% (6 problems) and 18% (5 problems) more coverage, respectively. Their coverage is the same for Sokoban and Perf Rank(GOOSE) solves one more problem than Opt Rank(GOOSE) for Ferry, Floortile, Miconic and Spanner. To better understand the differences between Opt Rank(GOOSE) and both GOOSE and Perf Rank(GOOSE), Figure 4 compares them regarding the number of nodes expanded, the cost of the returned plan and the search time to find a plan. We observe that Opt Rank(GOOSE) generally expands fewer nodes than GOOSE while achieving similar plan quality. This means Opt Rank(GOOSE) spends less time exploring unpromising regions of the search space and shows that our learned ranking generalises better than the learned GOOSE heuristic to unseen problems. Although our framework guarantees to learn a total quasi-order, there is no guarantee that the learned quasi-order is an optimal ranking. This can be observed in the plan quality plots where other approaches find plans cheaper than Opt Rank(GOOSE) for some problems. We also observe that Perf Rank(GOOSE) generally expands more nodes than Opt Rank(GOOSE) and provides plans of slightly lesser quality. STRIPS-HGN configurations. Considering only configurations based on STRIPS-HGN, Opt Rank(HGN) has the highest coverage in all domains except in Miconic where it solves two fewer problems than HGN. As with GOOSE, the usage of optimal rankings is also highly beneficial for this regression model. Surprisingly, Perf Rank(HGN) s coverage is lower than HGN s except on Childsnack, Floortile and Spanner. Moreover, Perf Rank(HGN) did not converge for Blocksworld and Ferry. Note that our experiments use the original version of STRIPS-HGN [Shen et al., 2020] whereas Chrestien et al. (2023) use an improved version of STRIPSHGN. This and the different benchmark problems for each domain could explain part of the discrepancy. Comparison with hff. Finally, none of the learned models were able to surpass the coverage of hff. However, Opt Rank(GOOSE) is not far off with just 9 fewer problems solved. This is an improvement on the competition results where the deep learning entries attempting to directly learn a heuristic or a policy did not exceed the score of HGN in our experiments. Moreover, Opt Rank(GOOSE) generally expands fewer nodes and finds plans of better quality than hff. Hence it is the inference time of GNN models remains an impediment to their competitiveness with special-purpose planning heuristics.

Figure 4: Comparison between Opt Rank(GOOSE) and GOOSE (first column), Perf Rank(GOOSE) (second column), and h-FF (third column) w.r.t. the number of nodes expanded (first row), plan cost (second row) and search time (third column). Problems only solved by one of the planners are mapped to the plot s boundaries. Points in the lower-triangle parts favor Opt Rank(GOOSE).

7 Conclusion

Heuristic search algorithms conventionally assume heuristic functions mapping states to goal-distance estimates. Yet, as was previously observed, for the popular greedy best-first search algorithm, the heuristic function merely serves to rank states, which is an overkill. To find a plan without search, as we show here, we only require an optimal ranking relating the states on an optimal plan to its neighbours. While this observation has yet to be used to design model-based heuristics, as we show here it is highly relevant for learning, and enables us to obtain much better performance than heuristic functions learned using the same neural network representations even compared to a closely related ranking-learning framework. It remains of course a bit disappointing that, at least in the benchmarks used here, the classic hff heuristic still performs best overall. This may change though with further research into representations and training algorithms for learning rankings. Apart from this, our results raise interesting questions regarding learning search guidance information for other settings and algorithms, for instance: Can our approach be extended to non-unit-cost problems? Are there other algorithms (e.g., hill-climbing or beam search) in which the heuristic function can be replaced with a simpler function that can be efficiently learned? What about variants of greedy best-first search for other forms of planning, such as top-K?

Proceedings of the Thirty-Third International Joint Conference on Artiﬁcial Intelligence (IJCAI-24)

Acknowledgements

We thank the reviewers for their comments which helped improving the paper. This work was supported by Australian Research Council grant DP220103815, by the Artificial and Natural Intelligence Toulouse Institute (ANITI) under the grant agreement ANR-19-PI3A-0004, and by the European Union s Horizon Europe Research and Innovation program under the grant agreement TUPLES No. 101070149.

References [Arfaee et al., 2011] Shahab Jabbari Arfaee, Sandra Zilles, and Robert C. Holte. Learning Heuristic Functions for Large State Spaces. Artificial Intelligence, 175(1617):2075 2098, 2011. [Bonet et al., 1997] Blai Bonet, G abor Loerincs, and H ector Geffner. A Robust and Fast Action Selection Mechanism for Planning. In Proc. National Conference on Artificial Intelligence (AAAI), pages 714 719, 1997. [Burges et al., 2005] Chris Burges, Tal Shaked, Erin Renshaw, Ari Lazier, Matt Deeds, Nicole Hamilton, and Greg Hullender. Learning to Rank Using Gradient Descent. In Proc. International Conference on Machine learning (ICML), pages 89 96, 2005. [Burges, 2010] Chris J.C. Burges. From Rank Net to Lambda Rank to Lambda MART: An Overview. Technical Report MSR-TR-2010-82, Microsoft Research, June 2010. [Chen et al., 2024] Dillon Z. Chen, Sylvie Thi ebaux, and Felipe Trevizan. Learning Domain-Independent Heuristics for Grounded and Lifted Planning. Proc. AAAI Conference on Artificial Intelligence (AAAI), 38(18):20078 20086, March 2024. [Chrestien et al., 2023] Leah Chrestien, Stefan Edelkamp, Antonin Komenda, and Tomas Pevny. Optimize Planning Heuristics to Rank, Not to Estimate Cost-to-Goal. In Proc. Advances in Neural Information Processing Systems (Neur IPS), volume 36, page 25508 25527, 2023. [Cohen and Beck, 2018] Eldan Cohen and J. Christopher Beck. Local Minima, Heavy Tails, and Search Effort for GBFS. In Proc. International Joint Conference on Artificial Intelligence (IJCAI), pages 4708 4714, 2018. [Damke and H ullermeier, 2021] Clemens Damke and Eyke H ullermeier. Ranking Structured Objects with Graph Neural Networks. In Proc. International Conference on Discovery Science (DS), pages 166 180, 2021. [Ferber et al., 2020] Patrick Ferber, Malte Helmert, and J org Hoffmann. Neural Network Heuristics for Classical Planning: A Study of Hyperparameter Space. In Proc. European Conference on Artificial Intelligence (ECAI), pages 2346 2353, 2020. [Ferber et al., 2022] Patrick Ferber, Florian Geiber, Felipe Trevizan, Malte Helmert, and J org Hoffmann. Neural Network Heuristic Functions for Classical Planning: Bootstrapping and Comparison to Other Methods. Proc. International Conference on Automated Planning and Scheduling (ICAPS), 32(1):583 587, June 2022.

[Freund et al., 2003] Yoav Freund, Raj Iyer, Robert E. Schapire, and Yoram Singer. An Efficient Boosting Algorithm for Combining Preferences. Journal of Machine Learning Research, 4:933 969, 2003. [Garrett et al., 2016] Caelan Reed Garrett, Leslie Pack Kaelbling, and Tom as Lozano-P erez. Learning to Rank for Synthesizing Planning Heuristics. In Proc. International Joint Conference on Artificial Intelligence (IJCAI), pages 3089 3095, 2016. [Hao et al., 2024a] Mingyu Hao, Sylvie Thi ebaux, and Felipe Trevizan. Source Code for Guiding GBFS Through Learned Pairwise Rankings . https://doi.org/10.5281/ zenodo.11107790, 2024. [Hao et al., 2024b] Mingyu Hao, Felipe Trevizan, Sylvie Thi ebaux, Parick Ferber, and J org Hoffmann. Learned Pairwise Rankings for Greedy Best-First Search. In Proc. ICAPS Workshop on Reliable Data-Driven Planning and Scheduling, 2024. [Heller et al., 2022] Daniel Heller, Patrick Ferber, Julian Bitterwolf, Matthias Hein, and J org Hoffmann. Neural Network Heuristic Functions: Taking Confidence Into Account. In Proc. International Symposium on Combinatorial Search (SOCS), pages 223 228, 2022. [Helmert, 2006] Malte Helmert. The Fast Downward Planning System. Journal of Artificial Intelligence Research, 26:191 246, 2006. [Herbrich et al., 1999] Ralf Herbrich, Thore Graepel, and Klause Obermayer. Large Margin Rank Boundaries for Ordinal Regression. In Advances in Large Margin Classifiers. The MIT Press, 1999. [Hoffmann and Nebel, 2001] J org Hoffmann and Bernhard Nebel. The FF Planning System: Fast Plan Generation Through Heuristic Search. Journal of Artificial Intelligence Research, 14:253 302, 2001. [Joachims, 2002] Thorsten Joachims. Optimizing Search Engines Using Clickthrough Data. In Proc. ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), pages 133 142, 2002. [Karia and Srivastava, 2021] Rushang Karia and Siddharth Srivastava. Learning Generalized Relational Heuristic Networks for Model-Agnostic planning. In Proc. AAAI Conference on Artificial Intelligence (AAAI), pages 8064 8073, 2021. [K oppel et al., 2020a] Marius K oppel, Alexander Segner, Martin Wagener, Lukas Pensel, Andreas Karwath, and Stefan Kramer. Code and Supplementary Material to the Paper: Pairwise Learning to Rank by Neural Networks Revisited: Reconstruction, Theoretical Analysis and Practical Performance. https://doi.org/10.5281/ zenodo.3700819, 2020. [K oppel et al., 2020b] Marius K oppel, Alexander Segner, Martin Wagener, Lukas Pensel, Andreas Karwath, and Stefan Kramer. Pairwise Learning to Rank by Neural Networks Revisited: Reconstruction, Theoretical Analysis and Practical Performance. In Proc. European Confer-

Proceedings of the Thirty-Third International Joint Conference on Artiﬁcial Intelligence (IJCAI-24)

ence on Machine Learning and Knowledge Discovery in Databases (ECML PKDD), pages 237 252, 2020. [Li, 2022] Hang Li. Learning to Rank for Information Retrieval and Natural Language Processing. Springer, 2022. [Liu and others, 2009] Tie-Yan Liu et al. Learning to Rank for Information Retrieval. Foundations and Trends in Information Retrieval, 3(3):225 331, 2009. [Richter and Westphal, 2010] Silvia Richter and Matthias Westphal. The LAMA Planner: Guiding Cost-Based Anytime Planning with Landmarks. Journal of Artificial Intelligence Research, 39:127 177, 2010. [R oger and Helmert, 2010] Gabriele R oger and Malte Helmert. The More, the Merrier: Combining Heuristic Estimators for Satisficing planning. In Proc. International Conference on Automated Planning and Scheduling (ICAPS), pages 246 249, 2010. [Segovia and Seipp, 2023] Javier Segovia and Jendrik Seipp. Benchmarking Repository of IPC 2023 - Learning Track. https://github.com/ipc2023-learning/benchmarks, 2023. [Seipp et al., 2020] Jendrik Seipp, Thomas Keller, and Malte Helmert. Saturated Cost Partitioning for Optimal Classical Planning. Journal of Artificial Intelligence Research, 67:129 167, 2020. [Shashua and Levin, 2002] Amnon Shashua and Anat Levin. Ranking with Large Margin Principle: Two Approaches. In Proc. Advances in Neural Information Processing Systems (Neur IPS), pages 937 944, 2002. [Shen et al., 2020] William Shen, Felipe Trevizan, and Sylvie Thi ebaux. Learning Domain-Independent Planning Heuristics with Hypergraph Networks. In Proc. International Conference on Automated Planning and Scheduling (ICAPS), volume 30, pages 574 584, 2020. [St ahlberg et al., 2022] Simon St ahlberg, Blai Bonet, and Hector Geffner. Learning General Optimal Policies with Graph Neural Networks: Expressive Power, Transparency, and Limits. In Proc. International Conference on Automated Planning and Scheduling (ICAPS), volume 32, pages 629 637, 2022. [Wilt and Ruml, 2016] Christopher Makoto Wilt and Wheeler Ruml. Effective Heuristics for Suboptimal Best First Search. Journal of Artificial Intelligence Research, 57:273 306, 2016. [Wu et al., 2010] Qiang Wu, Christopher J.C. Burges, Krysta M. Svore, and Jianfeng Gao. Adapting Boosting for Information Retrieval Measures. Information Retrieval, 13:254 270, 2010.

Proceedings of the Thirty-Third International Joint Conference on Artiﬁcial Intelligence (IJCAI-24)