# envelopebased_approaches_to_realtime_heuristic_search__230d77b7.pdf

The Thirty-Fourth AAAI Conference on Artiﬁcial Intelligence (AAAI-20)

Envelope-Based Approaches to Real-Time Heuristic Search

Kevin C. Gall, Bence Cserna, Wheeler Ruml Department of Computer Science University of New Hampshire, USA kcg245@gmail.com, {bence, ruml}@cs.unh.edu

In real-time heuristic search, the planner must return the next action for the agent within a pre-speciﬁed time bound. Many algorithms for this setting are agent-centered in that, at every iteration, they only expand states near the agent s current state, discarding the search frontier afterwards. In this paper, we investigate the alternative paradigm in which the search expands a single ever-growing envelope of states. Previous work on envelope-based methods restricts the agent to move along the generated search tree. We propose a more ﬂexible approach in which an auxiliary search is performed within the envelope to guide the agent toward a promising frontier node. Experimental results indicate that intra-envelope search is beneﬁcial in state spaces that are highly interconnected, such as those for grid pathﬁnding.

Introduction Interactive systems, such as robots or user interfaces, often must be controlled in real time. In this paper, we address real-time planning, where the planner must return the next action for the system to take within a speciﬁed wall-clock time bound (Korf 1990). The traditional approach to realtime planning is agent-centered, meaning a portion of the state space immediately surrounding the agent the Local Search Space (LSS) is expanded to some bounded lookahead (Koenig 2001). For example, in the popular LSSLRTA* algorithm (Koenig and Sun 2009), the agent generates the LSS using A* and then moves toward the frontier node of lowest f. This paradigm has a signiﬁcant drawback: the agent can easily fall into local minima, also known as heuristic depressions, in which the h values are misleadingly low, distracting the agent from paths leading toward a goal. To guarantee completeness in domains with local minima larger than the lookahead, the heuristic values of nodes within the LSS are raised (Koenig and Likhachev 2006). However, multiple visits to the same state may be required before the values are high enough, a process known as scrubbing (Sturtevant and Bulitko 2016). Many attempts have been made at improving the ability of LSS-style algorithms to escape heuristic depressions including analysis of heuristic error and dynamically changing lookahead (Hern andez and Baier 2012; Kiesel, Burns, and Ruml 2015).

Copyright c 2020, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved.

However for any lookahead size achieved with ﬁnite resources, one can always imagine a domain with local minima larger than the achieved lookahead.

Time-Bounded A* (TBA*) (Bj ornsson, Bulitko, and Sturtevant 2009) represents an alternative to agent-centered methods and is the ﬁrst example of what we call Envelopebased Search (ES) for real-time search problems. TBA* iteratively expands a single A* search tree throughout the life of the algorithm and periodically commits the agent toward the best node on the A* frontier by tracing and following the pointers of the tree. Because A* with a consistent heuristic will expand a state at most once, scrubbing is eliminated. TBA* is able to escape heuristic depressions by directing the agent toward a more promising branch of the tree without spending computation time updating heuristic values. However, the algorithm also has drawbacks stemming from the fact that its search is rooted at the agent s start state, an increasingly irrelevant state with respect to the agent as it transitions away. In the worst case, when a new target node is selected the agent will backtrack all the way to the start state before reaching the new branch.

Variants of TBA* that use Weighted A* and Greedy Best First Search to expand the search graph (Hern andez, Baier, and As ın 2016) alleviate this problem by biasing toward nodes with low h, discouraging the search from ﬁnding new paths that deviate from the tree near the root (reminiscent of the low h bias described by Richter, Thayer, and Ruml (2010)). However, even with this enhancement, the agent s path through the constructed search graph is still inﬂexible since the graph is not typically rewired to exploit shortcuts between the agent s current branch and a new target branch.

We propose a new ES algorithm called Intra-Envelope Search (I-ES) that seeks to address the ﬂaws in TBA* by splitting time between expanding nodes on the envelope frontier and searching through the expanded envelope to connect the agent with a target node. This target is chosen periodically as the current best node on the envelope frontier. We show that I-ES is complete in undirected and directed domains under reasonable assumptions. Experimental results suggest that the new method has superior performance in domains in which it is easy to ﬁnd alternative paths between states, such as grid pathﬁnding.

Preliminaries A heuristic search problem S, E, Goal, h, sroot consists of a set of states S, a set of edges E labeled with positive costs between states, a predicate Goal(s) that returns true if s is a goal, a heuristic function h(s, s ) that estimates the cost of an optimal path c (s, s ) between s and s , where h(s) denotes min{s |Goal(s )} h(s, s ), and the agent s start state sroot S. All heuristics used in this paper are admissible, meaning that they do not overestimate c . We denote by g(s) the cost of the cheapest known path to reach s from sroot, f(s) = g(s) + h(s), OPEN holds all nodes that have been generated but not expanded, and fmin is the minimum f value on OPEN. The successors of state s, denoted ssucc, are those s such that edge E : edge = (s, s ), c(s, s ) < . Similarly, spred denotes the predecessors.

Envelope-Based Search Envelope Search (ES) offers an alternative to the LSS paradigm of limited lookahead followed by heuristic updates. An ES algorithm maintains a search envelope containing every node expanded throughout the life of the algorithm. ES algorithms direct the agent toward the best node discovered during the entire search so far, not simply the best node in the agent s immediate vicinity. Heuristic depressions are much more easily avoided by ES algorithms because minima can be fully explored over multiple iterations and thereafter never considered for the agent s target. In this way, they can reason about the entire portion of the state space generated so far, avoiding the obvious blunders that plague agent-centered algorithms. The problem of discovering and extracting a path from the agent to some arbitrary target node is non-trivial. Agentcentered search in the absence of dead ends typically guarantees that a path will be available and easily extracted for the agent within the time bound by virtue of the fact that no nodes are considered which are farther away from the agent than the bound. In Envelope Search, however, the agent may be arbitrarily far away from the node set as its target. ES algorithms must be able to incrementally construct paths connecting the agent to the target so that the time constraint is not violated. TBA* and I-ES represent alternative approaches to this problem.

Time-Bounded A* Time-Bounded A* (TBA*) (Bj ornsson, Bulitko, and Sturtevant 2009) is an Envelope Search algorithm that maintains a single A* search tree rooted at sroot throughout the life of the algorithm. The tree is constructed incrementally exactly as in ofﬂine A* by expanding nodes until the constant time bound is reached. TBA* periodically commits the agent to the path leading to the current-best node on the frontier with respect to fmin. This is achieved by tracing back paths from target nodes via the parent pointers in the A* search tree. TBA* inherits from A* completeness in solvable inﬁnite undirected domains. Though the algorithm can make no guarantees about overall solution quality since it is realtime, in practice the underlying optimality of the A* search tree tends to create high-quality solutions.

Algorithm 1: TBA*

input : bound, traceback Limit

1 OPEN {root}

3 while loc = goal State do

4 A* from OPEN until bound is exhausted

5 if goal node not traced then

6 new Target Trace Path(traceback Limit) if new Target is not null then

7 agent Target new Target

8 if loc is on path to agent Target then

9 loc loc.next

11 loc loc.parent

Pseudocode for TBA* appears in Algorithm 1. TBA* begins precisely as A*: seeding the open list with the root state [line 1] and expanding nodes sorted on f [line 4]. The A* search is paused when bound is reached. Then, the algorithm performs its Traceback phase [line 6].The traceback function takes traceback Limit as a parameter which places a constant bound on the amount of time the algorithm can spend tracing a path. If no trace-in-progress is cached, TBA* selects the best node from the open list and begins extracting a path by following parent pointers from the search tree. The trace terminates on any of three criteria: the trace reaches the root state, the trace reaches the agent s location, or the traceback Limit is reached. In the last scenario, the trace-inprogress is cached so that it can be resumed during the next iteration, and null is returned. When a trace terminates at the root or the agent s location, the node at the end of the path is returned as new Target and is set as the agent s target [line 6]. The agent moves based on a simple decision: if it is on the path to its current target, continue on that path. If not, move to the previous state on its current path. Both the number of nodes expanded by A* and the number of nodes traced in the Traceback phase are bounded by a constant, so an iteration guarantees constant time complexity, i.e. O(bound + traceback Limit). Bj ornsson, Bulitko, and Sturtevant (2009) describe some simple optimizations to the basic algorithm. The threshold optimization prevents the agent from choosing a new path unless the g value of new Target is at least as high as the g value of agent Target. This prevents the agent from switching rapidly between paths that are roughly as good. This optimization was shown to produce 2% better paths in certain circumstances, and so we include it in our implementation. A second technique, shortcutting, will be discussed below.

Variants: Weighted, Greedy, and Restarting The original TBA* algorithm, while effective compared to agent-centered competitors, is ﬂawed due to its reliance on g as a function of sroot. Hern andez, As ın, and Baier (2014) introduce two variants: Time-Bounded Weighted A* (TB(WA*)) and Time-Bounded Best-First

Figure 1: Challenges for TB(WA*) (left), TB(BFS) (right). Red solid lines approximate the trajectory of TBA* to a certain time, green dotted lines represent the backtracking path TBA* takes when a new branch leads to a goal, and blue dotted lines represent a shortcut from the red path to the goal.

Search (TB(BFS)). The modiﬁcation is simple but powerful: instead of an A* search, the underlying search tree is constructed using Weighted A* or Greedy Best-First Search respectively. After all, there is no reason that the agent must necessarily ﬁnd an optimal path from the root to a goal, as its actual trajectory is likely to be suboptimal. TB(WA*) reduces the agent s dependence on sroot by using the weighted A* equation f(s) = g(s)+w h(s) where w > 1.0. This means that the algorithm is not restricted to ﬁnding the optimal path, but rather any path within the suboptimality bound w. As a result, increasing the weight can often lead to lower-cost total solutions since the agent is not forced to backtrack toward an optimal branch of the tree when the goal is closer to the agent s current location. TB(BFS) divorces the search from the concept of g at all and instead conducts a Greedy Best-First Search ordered on h. This allows the agent to follow the heuristic greedily. The search tree is still rooted at sroot, but when the tree is long and skinny, as it would be with a heuristic accurate enough to be consistently decreasing, the agent rarely has to backtrack. Restarting Time-Bounded Weighted A* (TBR(WA*)) (Hern andez, Baier, and As ın 2016) is a variant that is complete on directed search graphs (i.e. graphs with nonreversible edges). In this variant, when the agent attempts to backtrack but no edge to the previous state exists, the search graph is discarded. The algorithm then begins expanding a new search graph rooted at the agent s current state. In order to achieve completeness, the heuristic values of all nodes in the discarded search tree are updated using a Dijkstra backup in exactly the same manner as LSS-LRTA*.

Shortcuts Even with greedy envelope expansion, the problem remains that the agent s behavior is tied to the search tree anchored at sroot, a state which is irrelevant after the ﬁrst iteration. (In fact, the low h bias of greedy expansion can even worsen the problem.) Figure 1 presents examples illustrating this poor behavior. Each case is an instance of grid pathﬁnding with four-way movement and the Manhattan distance heuristic. The start location is marked by @ and the goal by *. As we

will explain, the red solid lines approximate the trajectory of TBA* up to a certain time. The green dotted lines represent the backtracking path that TBA* follows when it discovers another path to the goal. The blue dotted lines represent the shortcut path through the state space that an agent could have taken from the end of the red path to reach the goal more quickly. The left panel shows a troublesome instance for TB(WA*). Because h is Manhattan distance and increases as we move away from the start, the search will begin by alternately expanding the paths to the left and right until either the bottom left (n L) or bottom right (n R) corner node is expanded. Observe that h(n L) = h(n R) = a + b. Call this value hb, so f(n L) = f(n R) = b + whb. Assume without loss of generality that TB(WA*) breaks the tie in favor of n L. For the ith successor of n L such that i < a, observe that h(ni) = (a i) + b. Therefore, f(ni) = b + i + wh(ni) = b+i+w(a+b) wi = b+whb+i(1 w), so f(ni) < f(n R) if w > 1. Therefore, all nodes in the left corridor will then be expanded before expanding n R. Assuming that a and b are large relative to the lookahead size, the agent will move along the red path toward n U. Then, let di = dxi + dyi be the distance of a node li past n U, so h(li) = b dxi + dyi and f(li) = b + a + di + wh(li). TB(WA*) will continue to expand along the red path as long as:

f(li) < f(n R) b + a + di + wh(li) < b + whb w(h(li) hb) < a di w(hb h(li)) > a + di

w > a + di hb h(li)

w > a + dxi + dyi

a + dxi dyi

which can be made true for any w > 1 by increasing a. But as dyi increases relative to a, there will come a point where w is no longer large enough and n R is expanded. In the ﬁgure, this is where the red path ends. Since n R then becomes the best node on OPEN, it or its successors will become the agent s target. Because TB(WA*) moves the agent only along the search tree, the agent will be forced to backtrack to sroot, instead of discovering the shortcut represented by the blue dotted arrow. The right panel shows a troublesome instance for TB(BFS). Let l be the lookahead. Observe that h(n1) = h(n2) = a+b. Without loss of generality, assume the search breaks ties in favor of moving south, so n1 is expanded ﬁrst and the agent falls into the local minimum where all nodes besides n1 have h < a + b. Because the search order is greedy, nodes along the red path will be expanded until n3 is reached, at which point the search will ﬁll the room by diagonals. Once the room is ﬁlled, TB(BFS) will expand n2 and backtrack the agent along the path it came in by. This will be longer than a shortcut (such as the blue path) if the agent has already passed n3 by the time the room has ﬁlled. This is true when the number of actions the agent has taken, ab

l , is greater than a + b, which happens when l = 1 and

a = b = 3 or when l > 1 and a = b = l2. The original TBA* paper proposes a shortcut optimization. When a goal is traced but the agent is not on the goal branch, a new search is conducted every iteration from the entire goal path toward the agent. If the agent is found, the path is extracted and followed (Bj ornsson, Bulitko, and Sturtevant 2009). We implemented this technique, but do not include it in our experiments because, for many of the domains we tested on, the algorithm could not complete within the allotted time. The technique requires seeding a new open list from sroot to the discovered goal every iteration after a goal is found, which is worst-case linear in the size of the domain, making it impractical in domains of even moderate size.

Iterative Updates in TBR(WA*) Hern andez, Baier, and As ın (2016) make reference to the fact that the heuristic update operation of TBR(WA*) is not bounded by a constant and suggest it can be done iteratively. In consultation with the original authors, the following implementation was used in our experiments: When the agent attempts to backtrack but cannot, the existing OPEN is cached and reordered on h, and a new open list is allocated. We will call this new open list OPENcurrent and the cached one OPENcached. We choose the best successor s of the agent s current state s with respect to the priority function and seed it into OPENcurrent. The action transitioning s to s is returned for execution. For every iteration while OPENcached is not empty, a number of nodes are expanded from OPENcached equal to the number of nodes expanded in the forward search, thus bounding the number of updates performed by a constant. Note that expansions from OPENcached are executed as in LSSLRTA* (Koenig and Sun 2009). When OPENcached is empty it is discarded. If the algorithm attempts to restart when there is already a non-empty OPENcached, then the existing OPENcurrent is not cached but simply discarded. This ensures that there is no way any state can be discarded from OPENcached repeatedly and thus never updated. Observe that the above maintains completeness by not disrupting any invariants of the original TBR(WA*). Any state that is continually visited will eventually be moved to OPENcached and updated accordingly. If OPEN is ordered on weighted f, reordering OPENcached on h is a linear operation in the size of the frontier and thus still not bounded by a constant. However, it is likely to be signiﬁcantly less than the operation of updating the entire closed list when restarting is triggered, which is linear in CLOSED, not OPEN, and so represents an improvement over the scheme presented by Hern andez, Baier, and As ın (2016).

Intra-Envelope Search We now turn to a new approach to Envelope Search we call Intra-Envelope Search (I-ES). Like TBA* and variants, a single search frontier is maintained throughout the life of the algorithm. Periodically, the best node on the frontier is selected as the agent s target, and a search is performed within the envelope to connect the agent to that target node. These connection searches allow the agent to shortcut through the state space toward a target node. Pseudocode is presented

in Algorithm 2. I-ES maintains an open list OPENfr representing the Envelope frontier. An additional open list OPENbk is maintained for the intra-envelope search and is described in detail below. The algorithm is agnostic to the ordering of the nodes on these open lists except that the priority function must increase as h increases in order to provide completeness guarantees (Theorem 2). The algorithm takes a parameter bound which is a resource limit on the iteration execution time. This limit is split into 2 parts by multiplying by a factor 0 < r1 < 1. In our experimentation we used r1 = 0.8, implying more frontier expansion and less search within the envelope. The cache of the agent s path is set to an empty list, and OPENfr is seeded with the root node [line 7]. While the agent has not reached the goal, the algorithm proceeds in two parts: Exploration and Path Extraction. The EXPLORE function expands nodes from and adds successors to OPENfr until boundfr is exhausted [line 9]. Every state s expanded by EXPLORE has its heuristic value updated to s.h = mins ssucc s .h + c(s, s ). This update is only relevant in directed graphs, where it provides completeness guarantees (Theorem 2). In Path Extraction, I-ES searches through the envelope to connect the agent to a target node. There are many possible techniques for searching within the envelope, but for clarity, we focus on a simple backward search from a target to the agent. First, we seed the intra-envelope search open list if a search is not in progress [line 11]. starget is set as the root of the backward search and sforward is set as its goal. Note that if our path cache is non-empty we seed the forward search from the end of the cached path [line 21]. Next the ENVELOPESEARCH function is executed. Using the backward strategy (pseudocode in Algorithm 3), nodes are expanded by examining their predecessors. Note that only predecessors that have already been generated by the Exploration stage are examined, ensuring the search remains within the envelope. The goal of the search is to connect starget with sforward, so a state s is ordered within OPENbk using h(sforward, s). EXPAND sets pointers on predecessors so that the path to starget can be extracted. If sforward is expanded from OPENbk, the search is terminated and a path is extracted. The EXTRACTPATH function constructs a path by walking forward from sforward to starget following pointers set in the envelope search. The resulting path is appended to path Cache [line 28 - 29]. If no path is available after ENVELOPESEARCH, then a partial path is constructed by simply taking best successor of sforward with respect to h. Note that if sforward has not been expanded yet it is expanded as by EXPLORE during EXTRACTPARTIALPATH [line 14]. Finally, the agent transitions to the next state in the cached path.

Directed Domains I-ES works in directed search graphs with one modiﬁcation. If OPENbk = and no connection path has been extracted, the entire envelope is reset: CLOSEDfr {}, OPENfr {sagent}. Crucially, heuristic updates for states explored are maintained. If the backward search is empty, this means that the agent cannot reach

Algorithm 2: Intra-Envelope Search (I-ES)

input : bound

1 sagent root

2 starget null

3 sforward null

4 boundfr r1 bound

5 boundi (1.0 r1) bound

6 path Cache [ ]

7 OPENfr {root}

8 while Goal(sagent) do

// Exploration

9 Explore(boundfr)

// Path Extraction

10 if Envelope Search not in progress then

11 Reseed Envelope Search

12 Envelope Search(boundi)

13 if path Cache is empty then

14 path Cache Extract Partial Path

15 sagent path Cache.first

16 path Cache.remove First()

17 deﬁne Reseed Envelope Search:

18 starget OPENfr.top()

19 OPENbk {starget}

20 if path Cache is not empty then

21 sforward path Cache.last

23 sforward sagent

Algorithm 3: Envelope Search - Backward Strategy

24 while boundi not exhausted do

25 s OPENbk.pop()

26 Expand (s)

27 add spred (CLOSEDfr OPENfr) to OPENbk

28 if s = sforward then

29 path Cache path Cache + Extract Path(s)

starget from within the expanded envelope, and so we begin re-exploring the area of the state space we know the agent will be able to reach.

Properties of I-ES We now turn to characterizing conditions under which I-ES terminates and is complete.

Deﬁnition 1 In an undirected domain, s S, s ssucc, s s succ. Otherwise, the domain is directed.

Deﬁnition 2 In a solvable domain, s S, path p : p.first = s Goal(p.last) is true. Any s for which this does not hold is a dead end.

Theorem 1 I-ES is complete in ﬁnite undirected solvable domains.

Proof: The EXPLORE function will continue to expand nodes until all descendants of the start state have been expanded, which by deﬁnition of a ﬁnite solvable state space must include a goal node. Once a goal is generated, the backward search will be seeded with it during the next call to RESEEDENVELOPESEARCH. The agent is always either following a path extracted from the intra-envelope search or transitioning to its best successor as discovered by expanding its current state. Therefore the agent is always within the envelope. The backward search will eventually expand sagent since the domain is ﬁnite and the agent is always within the envelope. Thus a path will be extracted to the goal. Furthermore, I-ES is also complete in a broad class of directed domains.

Theorem 2 I-ES is complete in ﬁnite directed solvable domains with no dead ends, positive ﬁnite edge costs, and positive ﬁnite heuristics.

Proof: We reuse the classic RTA* proof (Korf 1990, Theorem 1): To be incomplete in a ﬁnite domain, the agent must forever traverse a ﬁnite set of states S that does not include a goal. (We note that this implies resetting the envelope frontier an inﬁnite number of times, as at least two nodes will need to be chosen as targets an inﬁnite number of times.) The h value of every node expanded by EXPLORE is set to a value higher than the minimum of its successors, due to the positive action costs. Therefore, a minimum h value in S will iteratively be raised until a successor outside S appears more attractive than any node in S, causing the agent to leave S and implying a contradiction.

Theorem 3 I-ES terminates in ﬁnite unsolvable undirected domains.

Proof: If OPENfr empties without ﬁnding a goal, EXPLORE will cause the algorithm to terminate.

Ordering the Open Lists

I-ES and its theorems are agnostic to the ordering of the open lists. We explored several priority functions including Greedy Best First on h for all open lists, weighted f for OPENbk, and a weighted f-like function for OPENfr calculated as f(s) = w h(s) + h(s, sagent) where sagent is the agent s current state. This function allows us to approximate g without having to maintain a rigid graph. Since each open list could theoretically be ordered using a different priority function, the number of conﬁgurations possible is combinatorial in the number of priority functions considered. For experimental comparison with TBA*, we used a Greedy Best-First priority function for OPENfr and either a greedy or a weighted f function for OPENbk. The weight parameter reported in experiments applies only to the backward open list. In the backward search, g(s) = c(s, starget). These conﬁgurations were identiﬁed as the strongest performers in the domains considered; we omit plots and detailed analysis of the other conﬁgurations for brevity.

Figure 2: left: 100 100 Minima map; right: Starcraft Cauldron map

Time Complexity

As a real-time algorithm, we analyze time complexity with respect to per-iteration guarantees. All operations are either O(1) or otherwise bounded by the constant bound except where noted below. The open lists are implemented for our experimental analysis using binary heaps. Insertion time into a binary heap takes log time in the size of the heap, which is not bounded by a constant. Data structures exist for constant time insert, such as the Fibonacci Heap, but were not used for ease of implementation. The EXTRACTPATH function is worst-case linear in the number of states generated so far since it is possible that all states generated are on the extracted path. Therefore, EXTRACTPATH is not bounded by a constant. It is possible to iteratively construct the path over mutliple search iterations in a manner similar to TBA*, however, we decided not to focus on those details since following pointers is a relatively cheap operation. Proﬁling of our implementation revealed that our traceback operations take approximately 200ns as opposed to our expansion operations which take approximately 4000ns, and furthermore our tracebacks account for only 2% of total algorithm runtime. For fair comparison with TBA* and its variants, we set the TBA* parameter traceback Limit = allowing it to also extract paths unbounded by constants. path Cache was implemented as a linked list which allows constant time insert, delete, and append for both the start and end of the list.

Experimental Results

We tested variants of TBA* and I-ES on: 1) the 100 15puzzles of Korf (1985) using Manhattan distance (MD); 2) a deterministic version of the racetrack game (Barto, Bradtke, and Singh 1995) in which a collision brings the velocity to zero, using MD divided by the maximum achievable speed and a version of the original Barto racetrack that is scaled and extended to be 414 x 66; 3) grid pathﬁnding using four-way movement and MD in a) the Starcraft cauldron map (Sturtevant 2012) and b) 1500 x 1500 grids with large randomly-generated minima, referred to as the Minima domain (Figure 2 shows example maps).

Each experiment was given 7GB RAM and 5 minutes. Real-time bounds were speciﬁed using limits on the number of expanded nodes and the CPU time of every iteration before the goal was found was recorded. To compensate for not using a real-time OS, we used the 99th percentile of the per-iteration CPU times, which was much more stable than the maximum. We present the most successful variants of TBA* and IES, with LSS-LRTA* reported as a baseline. For this analysis, the Greedy I-ES variant uses GBFS for both frontier and intra-envelope expansion, whereas the Weighted variant uses GBFS for frontier expansion and Weighted A* for intra-envelope expansions. A weight of 3.0 was selected as a high-performing weight in these domains and compared against greedy variants. Figure 3 presents the results. The x-axis represents the mean over the test instances of the 99th percentile of the per-iteration CPU time, with error bars representing 95% conﬁdence intervals. The y-axis represents the mean agent trajectory cost, expressed as a factor of A* s off-line optimal solution, again with conﬁdence intervals. Figures 3a and 3b show that, in grid pathﬁnding, I-ES is signiﬁcantly superior to the other tested algorithms for lower time bounds. Above 15 ms/iteration, TB(WA*) become slightly superior. Note that even though both TB(BFS) and I-ES expand their frontiers on a greedy best-ﬁrst strategy, I-ES has much better performance, suggesting the shortcutting inherent to I-ES makes a crucial contribution in these highly-connected domains. The Sliding Tile experiments (Figure 3c) offer an interesting contrast. I-ES is somewhat competitive with the TBA* variants at low iteration durations, but TB(WA*) becomes much better at higher lookaheads. This suggests that in this domain the ability of TB(BFS) to ﬁnd a suitable path quickly is more important than the ability to switch between paths opportunistically. This is evidenced by the trajectories of the I-ES and TB(BFS) plots, which are very similar. Recall that both I-ES and TB(BFS) expand their frontiers greedily, but that I-ES spends computation time searching within the envelope. The time spent on intra-envelope search reduces the effectiveness of the greedy expansion strategy, though it is still able to outperform TB(WA*) at low durations even without a rigid internal graph structure. Racetrack (Figure 3d) represents a domain where I-ES is not as competitive, though it does ultimately converge with TBA* in solution quality. We speculate that this is because connections between branches in the search tree are difﬁcult to discover. The distance with respect to node transitions between two states which share a common location but vary in speed may be large, and so discovering these connections takes more computation time than in other domains.

Intra-Envelope Search techniques have clear advantages in certain domains. They exhibit the crucial behavior of being able to escape heuristic depressions without scrubbing as evidenced by the high performance of I-ES in the Minima domain. When compared with the leading real-time

2 5 1 2 5 10 2 5 100 2 5 1000 2 5 9 1

5 6 7 8 9 10

5 6 7 8 9 100 TB(BFS) TB(BFS)

TB(WA*)(3.0) TB(WA*)(3.0)

LSS-LRTA* LSS-LRT R A T *

I-ES - Greedy I-ES - Greedy

I-ES - W 3.0 I-ES - W 3.0

Minima 1500x1500 (Grid World)

CPU time per Iteration (ms): Percentile 99

Factor of Optimal

2 5 1 2 5 10 2 5 100 2 5 1000 2 1

TB(BFS) TB(BFS)

TB(WA*)(3.0) TB(WA*)(3.0)

LSS-LRTA* LSS-LRT R A T *

I-ES - Greedy I-ES - Greedy

I-ES - W 3.0 I-ES - W 3.0

Starcraft: Cauldron (Grid World)

CPU time per Iteration (ms): Percentile 99

Factor of Optimal

2 5 10 2 5 100 2

TB(BFS) TB(BFS)

TB(WA*)(3.0) TB(WA*)(3.0)

LSS-LRTA* LSS-LRT R A T *

I-ES - Greedy I-ES - Greedy

I-ES - W 3.0 I-ES - W 3.0

Sliding Tile - Korf 4x4

CPU time per Iteration (ms): Percentile 99

Factor of Optimal

3 4 5 6 7 8 9 10 2 3 4 5 6 7 8 9 100

TBr(BFS) TBr(BFS)

TBr(WA*)(3.0) TBr(WA*)(3.0)

LSS-LRTA* LSS-LRT R A T *

I-ES - Greedy I-ES - Greedy

I-ES - W 3.0 I-ES - W 3.0

Racetrack (Hansen, 3-wide)

CPU time per Iteration (ms): Percentile 99

Factor of Optimal

Figure 3: Solution cost as a function of the real-time bound for the best-performing variants of TBA* and I-ES.

search algorithm for these domains, TBA* and variants, IES achieves state-of-the-art performance. By anchoring neither the search nor the path extraction in the increasingly less relevant root node, I-ES is very effective at navigating dense graphs via shortcuts when new target nodes are selected. The performance of I-ES in grid pathﬁnding does not translate as easily to the Sliding Tile and Racetrack domains. This characteristic is a result of the fact that I-ES spends computation on its extra intra-envelope search instead of continuing to expand the frontier with all available time. In grids, this split search pays dividends in better path quality because shortcuts between target nodes are discovered. This is in contrast to TBA* and its variants which restrict themselves to their pre-constructed graphs. In Racetrack and Sliding Tile, the state space is not as interconnected and so the extra time spent trying to ﬁnd shortcuts ultimately proves less effective than simply following pointers of a rigidly deﬁned search tree. We experimented with seeding OPENbk with the entire search frontier. This resulted in the agent being drawn toward frontier nodes which happened to be near it rather

than the goal. We also experimented with replacing the backward envelope search with a naive bidirectional search, however, this did not improve performance. Incorporating more recent work on bidirectional search (Holte et al. 2015; Shperberg et al. 2019) could be a promising future direction. I-ES could be extended into dynamic domains where edge costs change over time. For example, the forward frontier search could be replaced with a version of Lifelong Planning A* (Koenig and Likhachev 2001), or the backward search could be replaced with a version of D* Lite (Koenig and Likhachev 2002). Although the LSS and ES paradigms are the most popular in real-time search, there are others. For example, the FRIT algorithm (Rivera et al. 2014) does not rely on LSS expansion as the only means to guide the agent. Instead, a relaxed search graph (i.e. one with no obstacles) is constructed and is then updated as the agent discovers that assumed edges do not exist in the true search graph. This approach shows improvements when compared against other LSS algorithms, but still restricts its learning to areas local to the agent and is not limited to expanding states at most once.

Real-Time D* (Bond et al. 2010) uses both a forward search for action selection and a backward search to connect the agent with the goal. However, the backward search limits the algorithm to domains featuring a single goal state. I-ES is able to connect the agent to the frontier in the absence of a discovered goal and can guide the agent toward the closest goal if multiple have been discovered.

Conclusion We advanced the paradigm of envelope-based real-time heuristic search as a contrast to the more traditional agentcentered approach. We examined TBA*, the previous stateof-the-art envelope-based approach, and presented examples of troublesome problems for its way of guiding the agent towards the promising area of the search frontier. We proposed a new approach, Intra-Envelope Search (I-ES), that conducts an auxiliary search within the main envelope to connect the agent to the frontier. We showed that I-ES is complete in both undirected and directed domains under reasonable assumptions and compared its best conﬁgurations with TBA* s in several search benchmarks. I-ES exhibits superior performance in highly-connected domains like grid pathﬁnding where shortcuts were easily extracted, although it did not provide any advantage in more sparsely-connected state spaces such as racetrack. Given that the main advantage of envelope-based methods is their avoidance of scrubbing, which appears so irrational to an observer, we hope I-ES s sensible approach to guiding the agent to the frontier will raise the proﬁle of envelope-based search as an effective approach for real-time planning.

References Barto, A. G.; Bradtke, S. J.; and Singh, S. P. 1995. Learning to act using real-time dynamic programming. Artiﬁcial Intelligence 72(1):81 138. Bj ornsson, Y.; Bulitko, V.; and Sturtevant, N. 2009. TBA*: Time-Bounded A*. In Proceedings of the Twenty-First International Joint Conference on Artiﬁcial Intelligence (IJCAI-09). Bond, D. M.; Widger, N. A.; Ruml, W.; and Sun, X. 2010. Real-time search in dynamic worlds. In Proceedings of the Symposium on Combinatorial Search (So CS-10). Hern andez, C., and Baier, J. A. 2012. Avoiding and escaping depressions in real-time heuristic search. Journal of Artiﬁcial Intelligence Research 43:523 570. Hern andez, C.; As ın, R.; and Baier, J. A. 2014. Timebounded best-ﬁrst search. In Proceedings of the Seventh Annual Symposium on Combinatorial Search (So CS 2014). Hern andez, C.; Baier, J. A.; and As ın, R. 2016. Timebounded best-ﬁrst search for reversible and non-reversible search graphs. Journal of Artiﬁcial Intelligence Research 56:547 571. Holte, R. C.; Felner, A.; Sharon, G.; and Sturtevant, N. R. 2015. Bidirectional search that is guaranteed to meet in the middle. In AAAI. Kiesel, S.; Burns, E.; and Ruml, W. 2015. Achieving goals quickly using real-time search: experimental results in video

games. Journal of Artiﬁcial Intelligence Research 54:123 158. Koenig, S., and Likhachev, M. 2001. Incremental A*. In Proceedings of the Neural Information Processing Systems. Koenig, S., and Likhachev, M. 2002. D* lite. In AAAI, 476 483. American Association for Artiﬁcial Intelligence. Koenig, S., and Likhachev, M. 2006. Real-time adaptive A*. In Proceedings of the Fifth International Joint Conference on Autonomous Agents and Multiagent Systems (AAMAS06), 281 288. Koenig, S., and Sun, X. 2009. Comparing real-time and incremental heuristic search for real-time situated agents. Autonomous Agents and Multi-Agent Systems 18(3):313 341. Koenig, S. 2001. Agent-centered search. AI Magazine 22(4). Korf, R. E. 1985. Iterative-deepening-A*: An optimal admissible tree search. In Proceedings of IJCAI-85, 1034 1036. Korf, R. E. 1990. Real-time heuristic search. Artiﬁcial Intelligence 42:189 211. Richter, S.; Thayer, J. T.; and Ruml, W. 2010. The joy of forgetting: Faster anytime search via restarting. In Proceedings of the Twentieth International Conference on Automated Planning and Scheduling (ICAPS-10). Rivera, N.; Illanes, L.; Baier, J. A.; and Hern andez, C. 2014. Reconnection with the ideal tree: A new approach to realtime search. Journal of Artiﬁcial Intelligence Research 50:235 264. Shperberg, S. S.; Felner, A.; Shimony, S. E.; Sturtevant, N. R.; and Hayoun, A. 2019. Improving bidirectional heuristic search by bounds propagation. In Twelfth Annual Symposium on Combinatorial Search. Sturtevant, N. R., and Bulitko, V. 2016. Scrubbing during learning in real-time heuristic search. Journal of Artiﬁcial Intelligence Research 57:307 343. Sturtevant, N. 2012. Benchmarks for grid-based pathﬁnding. Transactions on Computational Intelligence and AI in Games 4(2):144 148.