# scrubbing_during_learning_in_realtime_heuristic_search__043ac016.pdf

Journal of Artiﬁcial Intelligence Research 57 (2016) 307 343 Submitted 07/15; published 10/16

Scrubbing During Learning In Real-time Heuristic Search

Nathan R. Sturtevant sturtevant@cs.du.edu Department of Computer Science University of Denver Denver, Colorado, USA

Vadim Bulitko bulitko@ualberta.ca Department of Computing Science University of Alberta Edmonton, Alberta, T6G 2E8, Canada

Real-time agent-centered heuristic search is a well-studied problem where an agent that can only reason locally about the world must travel to a goal location using bounded computation and memory at each step. Many algorithms have been proposed for this problem and theoretical results have also been derived for the worst-case performance with simple examples demonstrating worst-case performance in practice. Lower bounds, however, have not been widely studied. In this paper we study best-case performance more generally and derive theoretical lower bounds for reaching the goal using LRTA*, a canonical example of a real-time agent-centered heuristic search algorithm. The results show that, given some reasonable restrictions on the state space and the heuristic function, the number of steps an LRTA*-like algorithm requires to reach the goal will grow asymptotically faster than the state space, resulting in scrubbing where the agent repeatedly visits the same state. We then show that while the asymptotic analysis does not hold for more complex realtime search algorithms, experimental results suggest that it is still descriptive of practical performance.

1. Introduction

The framework of real-time agent-centered heuristic search models an agent with locally limited sensing and perception that is trying to reach a goal while interleaving planning and movement (Koenig, 2001). This well-studied problem has led to numerous algorithms (Korf, 1990; Furcy & Koenig, 2000; Shimbo & Ishida, 2003; Hern andez & Meseguer, 2005; Bulitko & Lee, 2006; Hern andez & Baier, 2012) and various theoretical analysis (Ishida & Korf, 1991; Koenig & Simmons, 1993; Koenig, Tovey, & Smirnov, 2003; Bulitko & Lee, 2006; Bulitko & Bulitko, 2009; Sturtevant, Bulitko, & Bj ornsson, 2010). Given the broad work on this problem, it is surprising to ﬁnd that little work has been done on lower bounds. It is clear that in some examples an agent can travel directly to the goal state on an optimal path. Such examples, however, generally require an unreasonably accurate initial heuristic or a favorable domain-speciﬁc tie-breaking schema which are hard to guarantee in practice. Other examples (Koenig & Simmons, 1993; Edelkamp & Schr odl, 2012) show that the worst-case bound is tight, but these examples do not generally characterize when the worst-case and best-case performance coincide. The main contribution of this paper is a non-trivial lower bound on the travel distance required for an agent to reach the goal in the basic real-time heuristic search framework.

c 2016 AI Access Foundation. All rights reserved.

Sturtevant & Bulitko

We show theoretically that there exists a tie-breaking schema that forces the agent to revisit its states ( scrub ) in polynomially growing state spaces; this is an undesirable property of real-time heuristic search. Our result shows that this phenomenon is unavoidable when an agent has limited 1-step lookahead, the state space is polynomial, all edges have unit costs, and the initial heuristic is consistent and integer valued. We later show that lifting the lookahead and unit edge cost restrictions allows an agent in some cases to travel directly from the start to the goal. However, such counter examples are somewhat contrived and, as our empirical results demonstrate, do not appear to happen frequently in practice. It is an open question whether similar theoretical results can be derived with fewer restrictions on the agent and the state space. We also examine exponential state spaces and learning over multiple trials until convergence. The insights from this theoretical work also provide explanations into why several approaches taken by other recent literature are eﬀective. This paper is an extended version of our previously published symposium paper (Sturtevant & Bulitko, 2014). The original paper contained the proof of asymptotic state revisitation under the assumptions of 1-step lookahead, polynomial state spaces, unit edge costs, integer initial heuristics, and non-optimal tie-breaking. This journal paper adds counterexamples for exponential state spaces, larger lookahead, and spaces with non-unit edge costs and/or non-integer initial heuristic. Convergence travel is also analyzed, and new experimental results with diﬀerent tie-breaking rules are added. Finally, we include additional an discussion around all of these issues. The rest of the paper is organized as follows. We begin in Section 2 by formally deﬁning the problem at hand. We will review related work on this problem in Section 3. Our own theoretical analysis is presented in Section 4. We then discuss the applicability of the theory to a broader class of state spaces and algorithms in Section 5 and build concrete counter-examples. The theoretical results are supported by a brief empirical evaluation in Section 6. We then conclude the paper and outline directions for future work.

2. Problem Formulation

In this paper we use a common deﬁnition of the real-time heuristic search problem. We deﬁne a search problem S as the n-tuple (S, E, s0, sg, h) where S is a ﬁnite state of states and E S S is a ﬁnite set of edges between them. S and E jointly deﬁne the search graph which is undirected: a, b S [(a, b) E = (b, a) E]. Two states a and b are immediate neighbors iﬀthere is an edge between them: (a, b) E; we denote the set of immediate neighbors of a state s by N(s). A path P is a sequence of states (s0, s1, . . . , sn) such that for all i {0, . . . , n 1}, (si, si+1) E. A route R is a simple path that does not include duplicate states. Initially we assume that all edge costs are 1 (Assumption 1). At all times t {0, 1, . . . } the agent occupies a single state st S. The state s0 is the start state and is given as a part of the problem. The agent can change its current state, that is, move to any immediately neighboring state in N(s). The traversal incurs a travel cost of c(st, st+1). The agent is said to solve the search problem at the earliest time T it arrives at the goal state: s T = sg. The solution is a path P = (s0, . . . , s T ): a sequence of states visited by the agent from the start state until the goal state. The cumulative cost of all edges in the solution is the travel cost and is the primary performance metric we are concerned with in this paper. The cost of the shortest possible

Scrubbing in Learning Real-time Heuristic Search

path between states a, b S is denoted by h (a, b). We abbreviate h (s, sg) as h (s). The agent has access to a heuristic h : S [0, ). The heuristic function is a part of the search problem speciﬁcation and is meant to give the agent an estimate of the remaining cost to go. The heuristic is integer-valued (Assumption 2). The search agent can modify the heuristic as it sees ﬁt as long as it remains admissible s S [h(s) h (s)], consistent a, b S [|h(a) h(b)| h (a, b)] and integer-valued at all times. (Admissibility and consistency are assumed throughout the paper.) The heuristic at time t is denoted by ht; h0 = h. The total magnitude of all updates to the heuristic function performed by the agent between t = 0 and t = T is called the total learning amount.

Algorithm 1: Basic Real-time Heuristic Search

input : search problem (S, E, s0, sg, h), tie-breaking schema τ output: path (s0, s1, . . . , s T ), s T = sg 1 t 0

3 while st = sg do

4 st+1 arg minτ s N(st)(1 + ht(s ))

5 ht+1(st) 1 + ht(st+1)

Numerous heuristic search algorithms have been developed for the problem described above (e.g., Korf, 1990; Bulitko & Lee, 2006; Koenig & Sun, 2009). Most of them are based on Algorithm 1. A search agent following the algorithm begins in the start state s0 and repeatedly loops through the process of choosing the next state and updating its heuristic until it reaches the goal state sg for the ﬁrst time (line 3). Within the loop, in line 4 the agent ﬁrst computes the immediate neighbor that minimizes the estimated cost of going to that neighbor (always 1 for now) plus the estimated cost of going from that neighbor to the goal. The lookahead is 1 and thus only immediate neighbors (i.e., N(st)) are considered by the agent (Assumption 3). Ties among neighbors that have the same minimal heuristic values are broken with a tie-breaking schema τ that we provide (Assumption 4), as detailed later in the paper, which is suﬃciently suboptimal to prove our results. Then, in line 5, the agent updates (or learns) its heuristic in the current state by making it consistent with the heuristic of the neighbor picked in the previous line. The agent then changes its current state to the neighbor (i.e., makes a move). While the general problem allows for the heuristic to decrease, this algorithm will never decrease the heuristic value of a state, since the heuristic is always consistent. The problem we consider in this paper is to describe the minimum amount Tmin(S) of the travel cost that any search agent of the type described above would necessarily incur in ﬁnding a solution to S = (S, E, s0, sg, h). While the exact value of Tmin(S) can intricately depend on the particulars of a speciﬁc search problem S, we will derive a useful asymptotic lower bound on Tmin(S). In doing so we are concerned with systematic scrubbing: an agent is said to scrub systematically on a series of growing state spaces iﬀthe number of state visits asymptotically dominates the number of states in the space: Tmin(S) ω(|S|).

Sturtevant & Bulitko

Formally, consider a series of seach problems {S1, S2, . . . S2, . . . } of progressively increasing state spaces: |Si| = g(i) (e.g., |Si| = i2 for an open two-dimensional grid Si of i i cells). Then the agent scrubs systematically iﬀits travel time is lower-bounded by a function Tmin(Si) = f(i) which asymptotically dominates the state space size: f ω(g). We use the standard deﬁnition of ω as asymptotic dominance: f(n) ω(g(n)) iﬀ k > 0 n0 n > n0 [kg(n) f(n)]. Since all our functions are positive, we omit the absolute value.

We initially work under the assumption that the agent is only trying to reach the goal once (Assumption 5), for which we measure ﬁrst-trial travel, some other work has considered convergence travel. In convergence travel the agent is teleported back to the start state after reaching the goal, beginning a new trial, now with the updated heuristic. The agent s learning has converged when a trial does not result in further updates to the heuristic function (Bulitko & Lee, 2006). If a systematic tie-breaking schema is used, convergence entails that the agent will follow the same path to the goal on all subsequent trials. First-trial travel is trivially a lower bound on the convergence travel; we will discuss this connection in more detail later in the paper.

3. Related Work

Previous work has focused on deriving upper bounds on the agent s travel cost. For instance, LRTA* (Algorithm 1) is guaranteed to reach the goal in O(|S|2) steps (Ishida & Korf, 1991). This follows from the analysis of MTS (Ishida & Korf, 1991) if the target s position is ﬁxed. Examples have been provided to show that the exact bound, |S|2/2 |S|/2, is tight (Edelkamp & Schr odl, 2012).1 The analysis has been extended to a class of reinforcement learning algorithms, of which LRTA* is a special case (Koenig & Simmons, 1992, 1993, 1996). For state-based value-iteration algorithms, such as the search algorithm listed above, the upper bound is still O(|S|2).

There has also been substantial work on analyzing algorithms with backtracking moves, introduced with SLA* (Shue & Zamani, 1993a, 1993b) and the more general SLA*T (Shue, Li, & Zamani, 2001; Zamani & Shue, 2001). Such algorithms behave in the same way as our algorithm until the total learning exceeds a certain threshold T. After that the agent backtracks to the previous state whenever it raises a heuristic value. It was shown by Bulitko and Lee (2006) that the travel cost of SLA*T is upper bounded by h (s0, sg) + T where T is the threshold (learning quota). However, this upper bound holds only if the path being built by SLA*T is processed after every move and all state revisits are removed from it.

A problem-speciﬁc analysis of the minimum learning required to prove the optimal path was described by Sturtevant et al. (2010) but they did not consider the ﬁrst-trial performance. They provided a proof sketch that agents that lookahead farther will not converge to an optimal solution more quickly. Empirical results suggested that algorithms that look farther ahead have better convergence performance primarily because they perform less learning over and above the minimum required.

1. The chapter containing these results is authored by Koenig.

Scrubbing in Learning Real-time Heuristic Search

4. Our Analysis

Our goal in this paper is to derive a non-trivial asymptotic lower-bound on the amount of travel any search agent that uses Algorithm 1 will need to perform to reach the goal state. We will achieve this goal in stages. Section 4.1 illustrates how tie-breaking can inﬂuence bestand worst-case performance, using two simple examples. We present a high-level overview of our derivation in Section 4.2 and then detail individual steps in Sections 4.3 through 4.6. We then apply the analysis to the case of polynomially growing state spaces in Section 4.9 and discuss its implications on state revisits (i.e., scrubbing ). These results require all of our assumptions (1-5). After this presentation is completed, in Section 5 we discuss extensions of this work to state spaces with non-unit edge costs and larger lookahead. In Appendix A we consider exponential state spaces.

4.1 Intuition

Consider the ﬁve-state grid world in Figure 1. The goal is state 5 on the far right. The agent starts out in state 2. At each state the agent examines the heuristics of its immediate neighbors, to the left and right, and goes to the neighbor with the lowest heuristic. If the neighbors heuristic values are identical then, for the time being, we assume that ties are broken to the neighbor on the right. This is a favorable or fortunate tie-breaking rule for this example because it moves the agent towards the goal when the heuristic is not suﬃcient to do so by itself. A tie-breaking schema that breaks ties away from the goal, in this example, would be unfavorable or unfortunate. In this example we temporarily break Assumption 4 of an agent with unfortunate tie breaking. We use several variations on this problem to illustrate the importance of tie-breaking on solution travel cost.

Figure 1: A one-dimensional grid world of ﬁve states numbered at the top. The agent A is located in state 2 and has two available actions shown in the ﬁgure.

In Figure 2 we plot the initial heuristic values of each state in a ten-state version of this grid, where the goal is on the far right and the agent starts out on the far left. The initial heuristic values are shown with dark bars, and the learned updates to the heuristic are light bars stacked on top. Each plate shows a single time step. The agent s current position is indicated with an A . After reaching time step 4, the agent can continue straight to the goal without further heuristic updates. Due to the fortunate tie-breaking schema in this particular example, the total amount of learning (i.e., the total area of light bars) is 8 and the travel cost is 9. Scaling this example to 2k states, the total amount of learning will be 2(k 1) and the path produced will have the optimal cost of 2k 1, with no state revisits. Thus, the total learning in this example grows linearly with the state space size and no systematic scrubbing is observed (i.e., Tmin(S) ω(|S|)). However, tie breaking can adversely aﬀect the agent s performance, and in general there is no guidance available in a state space to allow favorable tie-breaking. Consider the search problem in Figure 3 with 13 states. In this example, each tie is broken away from the goal

Sturtevant & Bulitko

Figure 2: Heuristic learning with favorable tie breaking. The ten states of the problem are along the horizontal axis. The vertical axis shows the value of the heuristic function per time step. The darker bars are the initial heuristic values. The lighter bars are the increases due to learning. The initial heuristic is at the top. The agent s current state is shown with an A .

(i.e., to the left). The travel cost is now 20 and the total amount of learning is 18 indicating some revisits of the 13 states by the agent, such as at time steps 3, 7, and 9. Scaling this search problem, the total amount of learning will be asymptotically quadratic in the number of states. The amount of heuristic learning per move is at most 2 which means that the travel cost will also be asymptotically quadratic in the number of states and number of revisits per state will increase with the state space size. Thus, Tmin(S) ω(|S|) as |S|2 ω(|S|) and we have systematic scrubbing. The diﬀerence between these two cases is asymptotically signiﬁcant. While favorable tie-breaking can be achieved in speciﬁc examples, it cannot be always guaranteed in general.

4.2 Analysis Overview

Our goal is to quantify the travel required by any agent of the type described in Section 2 to ﬁnd a solution. The examples in the previous section demonstrated that the travel cost can depend on the tie-breaking schema used by the agent. Since it may not always be possible to design a favorable tie-breaking schema for a given problem (or series of problems), we consider the agent s performance with suﬃciently suboptimal tie-breaking schemas (Assumption 4) and develop a meaningful lower bound on the amount of travel. Our lower bound is asymptotic in the number of states and, unfortunately, demonstrates that under common conditions the agent will necessarily have to revisit states many times an undesirable phenomenon known as scrubbing in real-time heuristic search. We present the high-level argument immediately below and then detail each step individually in the subsequent sections. First, due to consistency of the heuristic and a bounded lookahead, the learning performed in each step is constant-bounded. Thus, an asymptotic lower bound on the learning required to reach the goal is also an asymptotic lower bound on travel distance (Section 4.3). We show that with a suﬃciently suboptimal tie breaking schema an agent traveling from sa to sb must raise the heuristic of sa to at least that of

Scrubbing in Learning Real-time Heuristic Search

Figure 3: Heuristic learning with unfortunate tie breaking.

sb (Section 4.4). Given a search graph, a current location, and a goal, we identify a lower bound on the maximum heuristic value ˆh that the agent must encounter when traveling to the goal (Section 4.5). Since the heuristic is consistent at all times we use ˆh to compute the minimum amount of learning (Section 4.6). We then apply the argument to polynomial state spaces where the number of states within distance r from a given state grows as Θ(rd) (e.g., quadratically on commonly used two-dimensional navigation maps with d = 2).2 In such spaces we show that, given an appropriately inaccurate initial heuristic, the amount of learning necessarily performed by the agent can grow as Ω(rd+1) which asymptotically dominates the number of states Θ(rd). Since learning per step is constant-bounded, the amount of travel will also asymptotically dominate the number of states. This result indicates that any real-time heuristic search agent of the type introduced above will necessarily scrub systematically (Corollary 2).

4.3 Learning per Step is Constant Bounded

Here we provide a bound on the learning required to solve a problem the ﬁrst time the cumulative updates to the heuristic function from t = 0 until t = T. In this section we begin by showing that the learning per step is constant-bounded. This will imply that a lower bound on learning is also a lower bound on movement.

2. We use the standard deﬁnition of Θ as bounded above and below : f(n) Θ(g(n)) iﬀ k1 > 0 k2 > 0 n0 n > n0 [k1g(n) f(n) k2g(n)], of Ωas bounded below : f(n) Ω(g(n)) iﬀ k > 0 n0 n > n0 [kg(n) f(n)] and of O as bounded above : f(n) O(g(n)) iﬀg(n) Ω(f(n)).

Sturtevant & Bulitko

Lemma 1 The total change in heuristic values during each learning step of Algorithm 1 is constant bounded independently of the number of states in the state space.

Proof. Algorithm 1 updates the heuristic of a state st in line 5 of the pseudo code based on the heuristic of a neighboring state st+1. Because st+1 is an immediate neighbor of st and because the heuristic is consistent, |ht(st) ht(st+1)| h (st, st+1) = 1. By deﬁnition, the new heuristic for st is 1 + ht(st+1). Thus, the heuristic of st can increase by at most 2. Since st is the only state that has its heuristic updated, the maximum change in heuristic values at each time step is 2, which is constant bounded. 2 We now measure the learning necessary to move between diﬀerent locations in the world.

4.4 Maintaining a Non-increasing Heuristic Slope

In this section we will prove that there exists a tie-breaking schema, τ, that forces the agent to maintain non-increasing heuristic values for states along its route, which we call the non-increasing heuristic slope property. While better and worse tie-breaking schemes may exist for speciﬁc problems, τ is suﬃciently suboptimal to prove our main result. As deﬁned earlier, the agent s route is a simple path from the start state s0 to the current state st with any loops removed. For instance, if by the time t = 5 the agent has traversed the path P = (s0, A, B, A, C, D) then its route R = (s0, A, C, D). A heuristic proﬁle is the vector of heuristic values along the route. If the heuristic proﬁle along this route is {4, 3, 2, 2}, the non-increasing property holds. If the heuristic proﬁle is {2, 3, 4, 3}, the non-increasing properly does not hold. We will construct a tie-breaking schema such that whenever the non-increasing property is violated by raising the heuristic at the current state, the agent will be forced to backtrack, removing the oﬀending state from its route. To illustrate, consider Figure 3. At time step 5 the agent raises the heuristic of its current state and violates the non-increasing property (the new, larger heuristic value is shown at time step 6). The agent therefore backtracks, making a move to the left, which removes the oﬀending state from its route. Only after reaching the state at time step 8 can the agent once again move forward while satisfying the property.

Figure 4: Raising h(st) above h(st 1) allows a tie-breaking schema to force the agent to backtrack.

Scrubbing in Learning Real-time Heuristic Search

Lemma 2 (Forced Backtracking) Consider an agent in a non-goal state st at time t. Its current route is R = (s0, . . . , st 1, st). Then there exists a tie-breaking schema τ such that after updating the heuristic of st from ht(st) to ht+1(st), if the updated value raises the heuristic proﬁle so that it ceases to be non-increasing, then the agent will backtrack to its previous state st 1:

ht+1(st 1) < ht+1(st) = st+1 = st 1. (1)

Proof. First, as the agent moved from st 1 to st, it must hold that ht(st 1) = 1 + ht(st), due to line 5 of Algorithm 1. This is shown in the left side of Figure 4. Due to a consistent, integer-valued heuristic and unit edge costs, the only way the heuristic can be increasing along the route after learning is if ht+1(st) = ht(st 1) + 1, shown on the right side of Figure 4. In this case, the update must have come from some state s N(st) which is a state with the lowest heuristic among st s neighbors (Figure 4) with ht(s ) = ht(st 1). Even if s is not the same as st 1, it has the same heuristic value. Thus, a tie-breaking schema will be able to break the ties towards st 1, making the agent backtrack from st to st 1 and maintaining the non-increasing heuristic. This tie-breaking schema is τ. 2

Since backtracking removes the oﬀending state from the agent s route, we have the following corollary.

Corollary 1 (Heuristic Slope) At any time during the search, there exists a tie-breaking schema such that the heuristic along the agent s route R = (r1, . . . , rn) is non-increasing:

ht(r1) ht(r2) ht(rn). (2)

Proof. We prove this claim by induction on route length n. For n = 1 inequality (2) trivially holds. Suppose it holds for the route of length n then when the (n + 1)th state is added to the route, by Lemma 2 it must hold that the heuristic of the added state is less than or equal to the heuristic of the route s previous end (otherwise the agent would backtrack and the route would not grow). 2

4.5 A Lower Bound on the Maximum Heuristic Encountered

Assume that at time t the agent added the state sb to its route. This means, according to Corollary 1, that ht(sa) ht(sb) for any state sa already in the route. Informally, before the agent passes through a state with a high heuristic, it must ﬁrst raise all previous states in its route to have at least equally high heuristics. Since heuristic values in Algorithm 1 never decrease during learning (due to consistency), this also means that ht(sa) h0(sb) which implies that the agent must have already raised the heuristic of sa by at least h0(sb) h0(sa), assuming this term is positive. We are interested in identifying the states in a particular problem which maximize the diﬀerence h0(sb) h0(sa), as they can be used to bound learning. In this section we show how to ﬁnd these states. In particular, sb will be the state with largest heuristic that the agent must encounter en route to the goal. Informally, we are providing a general deﬁnition of a local minima and then proving that the agent must learn its way out of the local minima by increasing heuristic values. Figure 5 illustrates this concept with pathﬁnding on a video-game map. Suppose that the agent is in state sa and is trying to reach state sg. We observe that the agent must pass

Sturtevant & Bulitko

through one of the states within the bottleneck C1 before reaching the goal. Similarly, the agent must also pass through one of the states in C2 before reaching the goal.

Figure 5: A two-dimensional pathﬁnding search problem with the goal state labeled sg and the start state labeled as sa. C1, C2, and C3 C4 are all cut sets.

We say that a set of states C is a cut set with respect to the states sa and sg iﬀsa / C, sg / C and all possible routes R = (sa, ..., sg) have a non-empty intersection with C. In Figure 5 the sets C1, C2 and C3 C4 are three diﬀerent cut sets with respect to sa and sg. Given two states sa and sg, we denote the set of all their cut sets as C(sa, sg). Thus C1 C(sa, sg), C2 C(sa, sg) and (C3 C4) C(sa, sg).

We use the notion of cut sets to derive a lower bound on the maximum heuristic value seen by an agent en route to the goal. For the map in Figure 5, assuming the standard straight-line heuristic, C1 is not the best cut set because the initial heuristic values of its states will be relatively small. C2 is better because its initial heuristic values will be larger. C3 C4 is an even better cut set as it will have the largest minimum initial heuristic values among C1, C2, C3 C4.

In general, we can ﬁnd the best lower bound by considering all cut sets and choosing the one with the maximal minimal initial heuristic:

ˆh(sa, sg) = max C C(sa,sg) min s C h0(s) (3)

From this deﬁnition and the deﬁnition of a cut set it follows that an agent will necessarily travel through some state s such that h0(s) ˆh(sa, sg). Thus, ˆh(sa, sg) is a useful lower bound on the maximum heuristic encountered along a route from sa to sg.

Scrubbing in Learning Real-time Heuristic Search

Figure 6: An illustration of s , h and ˆh(s0, sg).

Note that the state sa can be an arbitrary state in S. Thus, if R(S) is the set of states in a route generated by the agent while solving the problem S then we can deﬁne:

h = max s R(S)

h ˆh(s, sg) h0(s) i

s = arg max s R(S)

h ˆh(s, sg) h0(s) i . (5)

If there are several states for which h is maximized, s can be set to any of them. From here it follows that s is a state on the agent s route to goal whose heuristic value has to be raised by at least h:

h h T (s ) h0(s ). (6)

Figure 6 illustrates this with the heuristic proﬁle of a simple one-dimensional state space where every state between the start state s0 and the goal state sg forms a single-state cut set. Hence, ˆh(s0, sg) is just the highest initial heuristic on the agent s route. Thus, the heuristic of all states to the left of the peak will have to be raised to at least ˆh(s0, sg). The maximum amount of learning, h, happens in the state s . While all states along a route R(S) may not be known a priori, R(S) must contain the initial state s0 which, given (4), means that:

h ˆh(s0, sg) h0(s0). (7)

Furthermore, it is often possible to identify a state in R(S) which yields a lower bound on h that is higher than ˆh(s0, sg) h0(s0). To illustrate: the h shown in Figure 6 is strictly greater than ˆh(s0, sg) h0(s0). In a more practical example, shown in Figure 8, the octile distance h0 on the eight-connected grid leads to ˆh(s0, sg) h0(s0) = 0. But, in practice, an agent will move to the corner ﬁrst, allowing us to identify a diﬀerent state as s . We analyze this example in detail in Section 4.8.

4.6 Minimum Learning and Minimum Travel Required Overall

We have now established that there exists a state s along the agent s route to goal whose heuristic the agent will necessarily raise by at least h. Per Lemma 1, the amount of learning per move is constant-bounded, so the number of visits to the state s is Ω( h).

Sturtevant & Bulitko

Consistency of the heuristic allows us to derive even stronger lower bounds on the total amount of learning and travel cost. Indeed, raising the heuristic of s by h implies that the heuristic values of many other states have to be raised as well, contributing to the total amount of learning, travel and state re-visits. Speciﬁcally, since at any time t {0, . . . , T} the heuristic ht is consistent, the heuristic value of any state n S is upper and lower bounded with respect to an arbitrary state m S, according to its distance from it:

ht(m) h (m, n) ht(n) ht(m) + h (m, n). (8)

Thus, when the agent reaches the goal at time T, the heuristic of any state in the state space is lower-bounded according to the distance from s :

n S [h T (n) h T (s ) h (s , n)] . (9)

The initial heuristic is consistent and hence upper-bounded:

n S [h0(n) h0(s ) + h (s , n)] . (10)

For each state n S, the diﬀerence between the left sides of (9) and (10) is at least as large as the diﬀerence between their right sides:

h T (n) h0(n) h T (s ) 2h (s , n) h0(s )

which, due to (6), becomes: h 2h (s , n) (11)

We sum the right side of (11) over all states n S and, since no heuristic value decreases during learning, we derive the following lower bound on the total amount of learning:

n S max{0, h 2h (s , n)}. (12)

This is illustrated in Figure 7 where consistency of the heuristic determines the diamond around h. The area of the diamond is the right side of inequality (12). As the amount of learning per move is constant bounded (Section 4.3) we can also derive a lower bound on the amount of travel:

Tmin(S) Ω(Lmin(S)) . (13)

Note that in this one-dimensional example the non-increasing heuristic slope property allows to derive an even higher lower bound on the necessary amount of learning. Speciﬁcally, by the time the agent reaches the goal sg, all heuristic values to the left of the peak have to be raised to at least the level of the dotted line. The ﬁlled-in volume is clearly greater than the area of the diamond in the ﬁgure. This indicates a possible direction for future work ﬁnding a lower bound more aggressive than the one in inequality (12) that we use in the rest of the paper.

Scrubbing in Learning Real-time Heuristic Search

Figure 7: A lower bound on Lmin determined by s , h.

4.7 General Conditions for Systematic Scrubbing

Recall that our deﬁnition of scrubbing is that the number of state visits asymptotically dominates the number of states in the space: Tmin(S) ω(|S|). Thus, one way to establish systematic scrubbing on a series of search problems is to compute or, at least, estimate the sum in Equation 12 and show that it asymptotically dominates the number of states |S|. Together with the link between Tmin and Lmin given by Equation 13, this would establish systematic scrubbing. Generally speaking, the sum may depend intricately on the search problem structure and the initial heuristic. Thus, from here on we will proceed by analyzing two special cases.

4.8 Special Case: The Corner Map

We cannot make general statements about single problem instances, so instead we parameterize problem instances by their size, allowing us to describe how they scale as the map size grows. In this case we measure the size by the radius or the length of one edge of the map. Consider a series of search problems known as the corner map (Figure 8).

Figure 8: The corner map analyzed by Sturtevant et al. (2010).

This map is parameterized by n: a problem instance Sn for n > 3 has |Sn| O(n2) states. This is a two-dimensional grid where the agent can occupy any vacant cell (white in the ﬁgure). The agent can move to any of its four cardinal neighbors unless they are blocked by an obstacle (dark grey cells in the ﬁgure), all edges have unit cost, and we assume the Manhattan distance as the initial heuristic h0. Later in the paper we revisit this example with diagonal, non-unit-cost moves and an octile distance heuristic.

Sturtevant & Bulitko

If the agent starts in the corner created by the obstacles, that corner state will be s . Using the analysis in the previous sections, ˆh in Equation 3 has the value of n + 1 in the two states labeled ˆs in the ﬁgure. Correspondingly, the state s deﬁned in Equation 5 and also shown in the ﬁgure has the initial heuristic value of 4 (for any n); this will be raised to ˆh before the goal state is reached. Thus, h = n + 1 4 = n 3. The lower bound on Lmin (Inequality 12) becomes:

s Sn max{0, n 3 2h (s , s)}. (14)

Looking at the structure of the corner, we can re-write the right side of (14) by summing over i = h (s , s) and multiplying by the number of states with each value of h (s , s). There is one state (s ) with h (s , s) = 0, two states with h (s , s) = 1, and i + 1 states with h (s , s) = i. For odd values of n, we re-write the sum in (14) as:3

i=0 (i + 1)(n 3 2i) = (n 3)(n 1)(n + 1)

For even values of n the result is:

i=0 (i + 1)(n 3 2i) = (n 2)(n 1)n

In either case, the sum, as a function of n, is in the class Θ(n3) which puts the learning amount Lmin in Ω(n3). As the amount of learning per move is constant bounded, Tmin Ω(n3). The number of states on a two-dimensional n n corner map is O(n2) which is asymptotically dominated by Tmin proving that the agent will scrub systematically on {S1, S2, . . . }. Speciﬁc properties of the corner maps allowed us to derive a complexity class for the sum in (12) as well as the total number of states on each map. In the following section we analyze a broader class of search spaces.

4.9 Special Case: Locally Isotropic Polynomial State Spaces

The lower bounds on the amount of learning (12) and total travel (13) hold for a search problem S. However, the bounds do not explicitly reference the number of states in S and thus do not immediately allow us to correlate the amount of travel to the number of states in order to make a claim about state revisitation. To do so we investigate ways of computing the number of states η(r, S, E) which are precisely distance r away from the state s :

η(r, S, E) = |{s S | h (s, s ) = r}|. (17)

Generally speaking, η(r, S, E) depends on the topology of the search graph (S, E) and can be arbitrarily complex. However, if the state space is isotropic around the state s

3. This is still a lower-bound for this example because we are not considering that the full path to the goal has to have its heuristic values raised to h(ˆs). Our analysis assumes that only the ﬁrst state will have its heuristic raised.

Scrubbing in Learning Real-time Heuristic Search

then the number of states distance r away from s does not depend on the direction and can be described simply as η(r). The term isotropic is usually used to refer to the entire state space; we use the term locally isotropic to refer to states spaces that are isotropic in a region within radius r of s . Then the sum in inequality (12) can be alternatively computed as an integral over the shortest distance r between the state s and all other states:

n S max{0, h 2h (s , n)} = Z h

0 η(r)( h 2r)dr. (18)

We can further simplify the computation of the integral (18) for polynomial state spaces which extend from s for a distance of at least h/2; that is, are locally isotropic. These are state spaces in which

η(r) = γrd 1 (19)

for r [0, h/2]. For instance, in two-dimensional navigational maps which are also locally isotropic to at least radius h/2 around the state s , the degree of the polynomial is d = 2 and η(r) = γr2 1 = γr, r [0, h/2]. Note that this is a theoretical abstraction because states in real-life maps (e.g., Figure 5) are not always locally isotropic; the maps have an asymmetric structure and may not stretch far enough around s . Also, they may not be polynomial as the obstacles may non-uniformly reduce the number of available states distance r away from any state. Substituting η(r) = γrd 1 in (18), we get:

0 η(r)( h 2r)dr =

0 γrd 1( h 2r)dr =

0 rd 1dr 2γ Z h

0 rd 1rdr =

r=0 2γ d + 1rd+1

γ d2d ( h)d+1 γ (d + 1)2d ( h)d+1 =

γ 2dd(d + 1)( h)d+1 Θ(( h)d+1), h . (20)

Combining (12), (18) and (20) and we conclude that for locally isotropic polynomial spaces of dimension d that extend for distance at least h/2 around the state s , the minimum

Sturtevant & Bulitko

amount of total learning is lower-bounded as:

Lmin(S) Ω ( h)d+1 , h . (21)

As the amount of learning per step is constant-bounded, the same asymptotic lower bound applies to the travel cost:

Tmin(S) Ω ( h)d+1 , h . (22)

Note that under our assumptions, the number of states within the radius r of the state s grows as: Z r

0 η(x)dx = Z r

0 γxd 1dx Θ(rd), r . (23)

Now, consider an inﬁnite series of growing search problems {S1, S2, . . . }. Suppose each search problem Si is such that the corresponding state space Si is locally isotropic and polynomial to radius ri around the corresponding state s . Assume also that the degree of the polynomial is d 1 and that the radii of the search problems monotonically and unboundedly increase with i:

r1 < r2 < . . . (24)

ri when i . (25)

From (23), it follows the state space size is asymptotically lower-bounded by rd i :

|Si| Ω(rd i ), i . (26)

Suppose also that the state space Si is asymptotically upper-bounded by rd i as well:

|Si| O(rd i ), i . (27)

Further, suppose the initial heuristic for search problem Si is such that the corresponding h grows linearly with ri. Finally suppose that the local isotropicity expands far enough from s : ri h/2. Together these two conditions are:

h/2 ri ξ h (28)

where ξ 1/2 is a ﬁxed constant.

Corollary 2 (Systematic Scrubbing). Given the assumptions in equations 24-28, any real-time heuristic search algorithm that ﬁts the basic framework formulated earlier in this paper (Assumptions 1-5) will necessarily scrub systematically. Proof. As the state space radius ri increases, h for the problem Si will increase linearly with it due to (28). The minimum amount of travel Tmin asymptotically grows as a polynomial of degree d + 1:

Tmin(Si) Ω d+1 h , i (29)

Tmin(Si) Ω rd+1 i , i (30)

Scrubbing in Learning Real-time Heuristic Search

due to (22), (25) and the linear relation between h for Si and its radius ri (28). At the same time, the number of states in the state space will asymptotically grow as a polynomial of degree d of the radius:

|Si| Θ(rd i ), i (31)

due to (26) and (27). Combining (30) and (31) we get that the necessary amount of travel assymptotically dominates the state space size:

Tmin(Si) ω(|Si|), i (32)

and the agent will scrub systematically. 2

4.10 Discussion

To informally summarize the results thus far, we have shown that we can systematically measure the heuristic learning that must be performed in a single state by looking at the maximum heuristic diﬀerence encountered between that state and the goal. Under our assumptions the total learning required around this state asymptotically dominates the number of states in the state space. Since the learning per step is constant-bounded, agents will be forced to scrub. This analysis, in its most general form (Section 4.9), is based on several important assumptions. First, we assume a conservative tie-breaking rule that prefers to re-visit states over exploring new ones. Second, we assume that the state space is locally isotropic around s . Third, we assume that, in general problems, the heuristic error is growing proportional to the radius of the state space. Finally, we assume that the agent has 1-step lookahead, that edges have unit cost, and that heuristic values are integer valued. We now consider these assumptions in more detail. First, in our experimental results we will explore several tie-breaking rules, as well as aggressive tie-breaking that prefers to move to larger g-costs over small g-costs. These results will show that, while a better tie-breaking rule can improve best-case performance, it can also have a large impact on worst-case performance. Second, our deﬁnition of systematic scrubbing holds for any problem for which the total learning (20) grows asymptotically faster than the state space (23). We only proved that this must occur in polynomial state spaces that are locally isotropic around s ; it would be interesting future work to develop a broader range of models that have this property. Experimental results are provided giving evidence that this does occur in practice. Third, if the heuristic error ( h) in a series of growing state spaces is constant, then no scrubbing behavior will be guaranteed by our proofs. This requires a highly accurate heuristic and cannot be assumed in general. It is an interesting open question how the heuristic error grows on diﬀerent classes of problems, whether it is linear, logarithmic, or some other function of the size or radius of the state space. Such analysis is clearly domain and problem dependent, but even if the heuristic consistently underestimates by just 10% in locations that are locally isotropic, and is perfect in others, our analysis holds (Corollary 2). Finally, we will address larger lookahead and non-unit edge and heuristic costs in Section 5.

Sturtevant & Bulitko

4.11 Special Case: Exponential State Spaces

It is not common for real-time agents to traverse exponential state spaces, so we place our analysis of exponential state spaces in Appendix A. The techniques we used to show systematic scrubbing in polynomial spaces that are locally isotropic are insuﬃcient to do so in exponential spaces that are locally isotropic. However, we also do not have a proof that systematic scrubbing will be absent. Thus, more analysis is needed to make a conclusive statement on scrubbing in exponential state spaces.

5. Generalizing the Theory

So far, we have proven that systematic scrubbing (Corollary 2) holds for a basic agent design and a simple state space. In this section we investigate the extent to which the result is generalizable. We independently relax Assumptions 1 & 2 (unit edge costs and integer heuristic values) and Assumption 3 (one-step lookahead) showing, with counterexamples, that Corollary 2 does not directly generalize. Some of these counter-examples contrast with experimental results which suggest that scrubbing does occur in practice. It is an open problem to succinctly describe conditions for asymptotic scrubbing in these more complicated scenarios. Relaxing the single trial assumption (Assumption 5), however, shows that scrubbing is expected when learning until convergence.

5.1 Non-unit Edge Costs and Non-Integer Initial Heuristics

In this section we address whether Corollary 2 generalizes to search problems with non-unit edge costs and/or heuristic values. In particular, we relax the assumption of unit edge costs (Assumption 1), replacing it with an assumption of constant-bounded edge costs. We also relax the assumption of integer heuristic values (Assumption 2). We provide a set of counter-examples to systematic scrubbing: speciﬁcally constructed search problems where an agent running LRTA* can move from a region of low heuristic values to higher heuristic values without maintaining a non-increasing heuristic slope, which was the key assumption to our previous proofs.

s1 s2 s3 s4

Figure 9: Example chain of states for n = 2.

Consider a chain of 2n + 1 states, illustrated in Figure 9 for n = 2, {s0, s1, s2, . . . , sn, . . . , s2n} with s0 the start state and s2n the goal. Suppose the initial heuristic values are monotonically increasing along the chain until state sn (h(s1) < h(s2) < . . . h(sn)), and monotonically decreasing afterwards. Let h(i) = i for i n and h(i) = 2n i for i > n. This set of growing state spaces is well deﬁned for all n > 0 and, if c = 1, meets the conditions of Corollary 2. Thus, the example will cause our previously described agent to scrub.

Scrubbing in Learning Real-time Heuristic Search

However, if we allow the heuristic values to be non-integer or the edge costs to be non-unit then scenarios exists in which the agent will climb the heuristic grade without scrubbing. For instance, if we change the edge costs to be c = 1.5 instead of 1.0, we get the behavior shown Figure 10. Here the agent starts at state 0; the heuristic of this state is updated to be 2.5 which is the edge cost (1.5) plus the h-cost of the neighbor (1.0). The agent then moves to state 1. Notice here that the heuristic to the left (2.5 in state 0) is slightly larger than the heuristic to the right (2.0 in state 2). After learning, the heuristic in state 1 is updated to 3.5, and the agent continues to state 2. In the last step shown here, the agent will begin to follow the gradient to the goal in state 4. Thus, with these edge costs the agent has no choice but to move between the start and the goal without scrubbing.

0 1 2 3 4 0

1.5 1.5 1.5 1.5 0 1 2 3 4 0

1.5 1.5 1.5 1.5 0 1 2 3 4 0

1.5 1.5 1.5 1.5 0 1 2 3 4 0

1.5 1.5 1.5 1.5

Figure 10: The agent climbs a heuristic grade without scrubbing.

The intuition can be formalized for a chain of 2n states as follows:

Lemma 3 Consider a chain of states {s0, s1, s2, . . . , s2n} with s0 being the starting state and s2n being the goal. Suppose the initial heuristic values are monotonically increasing along the chain until some intermediate state sn : h(s1) < h(s2) < . . . h(sn) forming a heuristic grade of n states. Assume that after sn the heuristic is monotonically decreasing. Suppose the heuristic grade is of a constant slope i {0, . . . , 2n 1} [|h(si) h(si+1)| = α] and all edge costs are uniform i {0, . . . , 2n 1} [c(si, si+1) = c]. If the slope α is strictly below the edge cost c then the agent with a lookahead of 1 will climb the grade to state n without any scrubbing, regardless of the tie-breaking schema, and reach the goal in Θ(n) steps. Proof. The proof is by induction on the state number k. We will show that whenever the agent is in the state 0 k n 1 it will set the heuristic of sk to be greater than the heuristic of sk+2 (i.e., hnew(sk) > h(sk+2)) and move to state sk+1. We call this the slope-edge property: the slope growing slower than the edges causes the agent to climb the slope without backtracking. Base case: k = 0. The agent will increase h(s0) to h(s1) + c and move to s1 as its only choice. Starting with c > α we derive:

hnew(s0) h(s1) > h(s2) h(s1) (34)

hnew(s0) > h(s2). (35)

which proves the slope-edge property for k = 0. Inductive step. Suppose the slope-edge property holds for k = j, j n 3. Then it follows that the agent is now in the state sj+1 and hnew(sj) > h(sj+2). From this we

Sturtevant & Bulitko

immediately conclude that the agent will make its next move to sj+2. Before the move it will set hnew(sj+1) to min {hnew(sj) + c, h(sj+2) + c} = h(sj+2)+c. Since h(sj+3) = h(sj+2)+α and c > α, we conclude that hnew(sj+1) > h(sj+3) which makes the slope-edge property hold for k = j + 1. Once the agent reaches state n, the heuristic is monotonically decreasing towards the goal, and the agent will follow the slope downward until reaching the goal. 2

The lemma hinges on the interplay between α and c. The interplay can be satisﬁed by either non-unit edge costs or non-integer valued initial heuristic. Our example above uses c = 1.5 and α = 1.0, but the lemma holds for c = 1.0 and α = 0.5, as well as many other values of c and α.

4.0 4.5 5.0 5.5 6.0 7.0 8.0 9.0

3.0 6.5 7.5 8.5

2.0 6.0 7.0 8.0

1.0 5.5 6.5 7.5

G A 6.0 7.0

4.0 4.5 5.0 5.5 6.0 A 8.0 9.0

3.0 8.0 7.5 8.5

2.0 7.5 7.0 8.0

1.0 7.0 6.5 7.5

G 6.5 6.0 7.0

Figure 11: The agent climbs a heuristic grade without scrubbing (intermediate steps omitted).

This situation can happen in practice, as illustrated in Figure 11. The state with the agent is marked A and the state with the goal is marked G. All other states are labeled with their initial heuristic value. Diagonal edges have cost 1.5 and the heuristic is octile distance, which is the shortest path with no obstacles and only diagonal and cardinal movements allowed. A real-time heuristic search agent with a lookahead of 1 will walk up along the wall because the heuristic increases by only α = 0.5 per step while the edge costs along the agent s path are 1. As a result, the agent will walk directly out of the local heuristic minimum and continue to the goal, traversing a shortest path with no scrubbing. The example, however, is somewhat fragile. If we extend the map downward, the agent will no longer necessarily walk directly to the goal. We illustrate this in Figure 12. In this case the agent begins walking upwards as before, but when it reaches the diagonal line shown in the right side of the ﬁgure, the heuristic no longer changes by 0.5 per step. Because the slope-edge property is no longer maintained, the agent can no longer move against the heuristic slope. At this point the agent will have four choices of where to move next (shown as four gray cells in the ﬁgure) and, with unfavorable tie breaking will backtrack. Running this example shows that, in this particular case, typical scrubbing behavior then follows.

5.2 Larger Lookahead

In this section we show that equipping the agent with a larger lookahead, breaking Assumption 3, in the style of LSS-LRTA* (Koenig & Sun, 2009) can eliminate systematic scrubbing even in a search problem with unit edge costs.

Scrubbing in Learning Real-time Heuristic Search

6.0 6.5 7.0 7.5 8.0 8.5 9.0 9.5

5.0 7.5 8.5 9.5

4.0 7.0 8.0 9.0

3.0 6.5 7.5 8.5

2.0 6.0 7.0 8.0

1.0 5.5 6.5 7.5

G A 6.0 7.0

6.0 6.5 7.0 7.5 8.0 8.5 9.0 9.5

5.0 A 8.5 9.5

4.0 8.5 8.0 9.0

3.0 8.0 7.5 8.5

2.0 7.5 7.0 8.0

1.0 7.0 6.5 7.5

G 6.5 6.0 7.0

Figure 12: The agent cannot climb the heuristic grade without scrubbing.

First, we redeﬁne our agent algorithm to allow for larger lookahead from Algorithm 1 to Algorithm 2. In the terminology of A* the algorithm works as follows: as long as the goal is not reached (line 3) the agent expands the search space from its current state st. The expansion is carried out via an OPEN set which is seeded with the current state and at each moment of time contains all states that have been generated but not expanded. To expand a state means to generate its neighbors. Such neighbors are placed on the OPEN set unless they are already in the CLOSED set (line 9). The CLOSED set contains all states that have been expanded (line 8). States in the OPEN set are expanded in the order of their minimum f = g + h cost, where the g-cost is the distance from the agent s current state st (line 7). The number of expansions per step is at most l, an algorithm parameter (line 10). When the expansion process is ﬁnished the agent updates the heuristic value for all states on the CLOSED list (line 12). We call these states the local learning space. Then it changes its current state to the most promising state on the OPEN list (line 13). In lines 12 and 13 we use a diﬀerent cost function, c LS(s, s ). This is the cost of the shortest path between s and s using only the states inside OPEN and CLOSED.

We now provide an example of a set of problems where, using larger lookahead, an agent can walk directly out of a local minima and follow an optimal path to the goal without scrubbing. Similar to the example in Section 5.1, a heuristic barrier will be created behind the agent that drives it forward and precludes backtracking/scrubbing. Our example is in Figure 13. In this example all edge costs are 1. Each state is marked with the current heuristic value, and states are labeled from (b) to (n). The state with the agent is marked with an A in the lower left corner. The agent has a lookahead of l = 5.

In the ﬁrst row, the agent will have the gray states in its local learning space. The darker/red states indicate the OPEN list, which is used as the basis for updating heuristic values; the agent will also move to one of these states after learning. No matter how ties are broken (in the f-cost metric used for ordering state expansions), the same ﬁve states will be in the local learning space; the gray states must be expanded. But, because state (h) has a lower heuristic than state (b), the agent necessarily moves to state (h). Upon reaching state (h) the space process repeats, except with higher heuristic values. This process is shown in a similar manner as past examples in Figure 14.

Sturtevant & Bulitko

Algorithm 2: Real-time Heuristic Search with Lookahead

input : search problem (S, E, s0, sg, h), lookahead l output: path (s0, s1, . . . , s T ), s T = sg 1 t 0

3 while st = sg do

4 OPEN {st}

7 n best state in OPEN

8 CLOSED CLOSED {n}

9 OPEN OPEN (N(n) \ CLOSED)

10 until OPEN = |CLOSED| = l

11 for each state s CLOSED do

12 ht+1(s) min s OPEN(c LS(s, s ) + ht(s ))

13 st+1 arg min s OPEN(c LS(st, s ) + ht(s ))

6 5 4 3 4 5 5 6 7 7 8 9 9 A

6 7 8 8 7 6 5 6 7 7 8 9 9 A

6 7 8 8 9 10 10 9 8 7 8 9 9 A

(b) (c) (d) (e) (f) (g) (h) (i) (j) (k) (l) (m) (n)

Figure 13: The agent climbs a heuristic grade without scrubbing.

1 2 3 4 5 6 7 8 9 10 11 12 13 0

1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1 2 3 4 5 6 7 8 9 10 11 12 13 0

1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1 2 3 4 5 6 7 8 9 10 11 12 13 0

1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0

Figure 14: The agent climbs a heuristic grade without scrubbing.

Scrubbing in Learning Real-time Heuristic Search

This example can be generalized in two ways: ﬁrst to any value for the lookahead (l), and second to arbitrary long sequences of states. A given problem instance, however, may not generalize to a diﬀerent lookahead or start state. We build an example that shows that with lookahead l = 3 we can build arbitrarily long sequences of states which a lookahead agent can traverse without scrubbing, reaching the goal in no more than |S| steps.

s-1 s0 s1 s2

1.0 1.0 1.0 1.0

1.0 1.0 1.0

Figure 15: General example for l = 3.

Lemma 4 Consider a chain of states {s 2, s 1, s0, . . . , s3n} with s0 being the starting state and s3n = s2n+n being the goal. The heuristic of state si is i for i < 0. The heuristic of state si is i/2 for 0 i 2n. The heuristic of state si is 3n i for i > 2n as shown in Figure 15. An agent with a lookahead of 3 will climb the heuristic grade to state 2n without any scrubbing, regardless of the tie-breaking schema, and reach the goal in 3n steps. Proof. We prove this by inductively showing an invariant on the heuristic values in the two neighbors on each side of the agent. This invariant will hold until the agent reaches each state up to s2n, after which the agent can walk directly to the goal. The invariant is that the current state has heuristic v, both neighbors of the current state have the same heuristic value v + 1, the next neighbor to the right has the value v + 1, and the next neighbor to the left has the value v + 2. Furthermore, along the path the agent never backtracks to the left, stopping to learn only on the even states s0, s2, s4, . . . , s2n. Base case: The agent starts at s0 with heuristic 0. By deﬁnition, states s1 and s 1 both have heuristics of 1, state s2 has a heuristic of 1, and state s 2 has a value of 2. After learning, h(s 1) = 3, h(s0) = 3 and h(s1) = 2. The agent thus moves to state s2 with heuristic 1. States s1, s3, and s4 have heuristic 2, and s0 has heuristic 3. The agent has only moved to the right thus far, and is in an even-numbered state, so invariant holds for the agent s new location.

v+1 v+2 v+1 v+1

si-1 si si+1 si+2

1.0 1.0 1.0 1.0 1.0

Previously unvisited

Figure 16: Current heuristic values at the induction step.

Inductive step. The current state of the graph at the beginning of the induction step is shown in Figure 16. We assume the invariant holds for the current agent state si with heuristic v = i/2 which is i/2 since i is even by the invariant. Because states

Sturtevant & Bulitko

si+1, si+2, si+3, si+4 have yet to be visited or included in the learning, their heuristic values will be v + 1, v + 1, v + 2, v + 2 respectively. This follows from the initial heuristic values and the fact that i is even. Furthermore, according to the invariant, states si 1 and si+1 both have heuristics of v + 1, state si+2 has a heuristic of v + 1, and state si 2 has a value of v + 2. After learning, h(si 1) = v + 3, h(si) = v + 3 and h(si+1) = v + 2. The agent then must move to si+2 with heuristic v + 1, which meets the even state condition. h(si+1) was just updated to v + 2 and h(si) to v + 3. States and si+3 and si+4 have yet to be visited or be part of the local learning space, so they still have their original heuristic values. In particular, since i is even, they will have the heuristic value (i + 3)/2 and (i + 4)/2 respectively, which are both equal. Thus h(si+3) = v +2 and h(si+4) = v +2. The invariant still holds, generalizing the example. The remaining question is what happens after the agent reaches state s2n. At this point the heuristic values to the left of the agent are higher than those to the right where they decrease directly with the edge cost until they reach zero. Thus, the agent will always prefer to move right until it reaches the goal. The agent never backtracks and moves directly from state s0 to s3n, thus reaching the goal in 3n steps. 2

This example is for a ﬁxed, odd-valued lookahead. When the structure here is wellunderstood, it is not diﬃcult to construct similar examples for other odd-valued lookaheads. If the lookahead is even, an additional state with one neighbor is needed next to each state visited by the agent (i.e., a single new state connected to each state s2i). This extra state will never be visited, because it is a dead end, and essentially reduces the lookahead by one, reducing back to the odd-value case for the lookahead.

5.3 Discussion

The examples presented above require particular starting conditions and can be broken by slightly changing their properties such as the starting state, agent lookahead, and the heuristic slope. Thus, while these examples conﬁrm that, under our assumptions, the agents can, under certain circumstances, reach the goal without scrubbing, they do not guarantee that this will happen in practice on a wide range of real-world problem instances (Section 6). As it may not always be a priori clear which parameters an LSS-LRTA* style agent needs to use to avoid scrubbing on a particular problem, we propose several general approaches to combat scrubbing. They work by removing one or more of the assumptions used in our theoretical analysis. The ﬁrst approach is to learn faster, trying to violate the bound on constant learning per step. There are several ways to do this. Algorithms such as f-LRTA* (Sturtevant & Bulitko, 2011) and LSS-LRTA* with swamps (Sharon, Sturtevant, & Felner, 2013) can prune states from the state space, eﬀectively raising their heuristics to inﬁnity in a single step. The second approach is to use a better tie-breaking rule. Hern andez and Baier (2012) have developed tie-breaking rules that work very well on grid worlds, but on a small fraction of problems result in very poor performance. FALCONS also learns g-costs to do better tie-breaking (Furcy & Koenig, 2000). The third approach is to decrease h to decrease the size of the local minima from which the agent must escape. If we use the 0 initial heuristic there will be no local minima,

Scrubbing in Learning Real-time Heuristic Search

resulting in h = 0 and rendering the theoretical analysis in this paper inapplicable, but the agent would have no guidance and would wander aimlessly. A better approach is to multiply the heuristic by some constant 0 < ϵ < 1, reducing the magnitude of h. This is in contrast to raising the initial heuristic multiplicatively in an attempt to reduce its heuristic error (Shimbo & Ishida, 2003). The latter method is equivalent to putting a weight on the g-cost which was introduced in real-time heuristic search by Bulitko (2004) with the equivalence proven by Bulitko and Lee (2006). The question of how to perform weighting was recently revisited by Rivera, Baier, and Hern andez (2015). Finally, we could avoid the value-iteration based heuristic search approach altogether, using algorithms such as RIBS (Sturtevant et al., 2010) and EDA* (Sharon, Felner, & Sturtevant, 2014).

5.4 Convergence Travel

The results derived thus far apply just to travel on the ﬁrst-trial. If one runs Algorithm 1 or Algorithm 2 repeatedly, preserving the learning and using the same start and goal, some of the heuristic values will eventually converge to the true distance to the goal along at least one optimal path from the start to the goal (breaking Assumption 5) (Bulitko & Lee, 2006). We can thus build a more general bound for the learning required for convergence. To start, we derive an analogous result to Lemma 1, showing that the learning per step is constant-bounded when using larger lookahead.

Lemma 5 Assuming constant-bounded edge costs, constant-bounded branching factor, and constant-bounded lookahead, the total change in heuristic values during each learning step of Algorithm 2 is constant bounded.

Proof. Algorithm 2 updates the heuristic of a state s in the closed list in line 12 of the pseudo code based on the heuristic of a state s in the open list and the distance between s and s . The number of states on the closed list is constant bounded by the constantbounded lookahead, l. The shortest path (c LS) between a state s on the open list and a state s on the closed list is bounded from above by the maximum edge cost times l times the branching factor. These three values are all constants, so the shortest path is constant bounded. Additionally, the change in heuristic is constant bounded due to consistency. Thus, the change in heuristic for any state on the closed list is also constant bounded. Put together, the total change in heuristic values for all states in CLOSED is also constant bounded. 2 Previously, we used a cut set analysis to determine the minimum learning required for some state in the state space. Here, we bypass that analysis and look at the optimal heuristic that must be learned on at least one optimal path between the start to the goal. Let P (s0, sg) be the set of all optimal paths between the start and the goal, and let Pi be the ith such path. We deﬁne P to be the optimal path that requires the minimum learning - that is, it has the smallest diﬀerence between the initial and ﬁnal heuristic after convergence. Then, we can deﬁne alternate versions of h and s for convergence.

Sturtevant & Bulitko

P = arg min Pi P (s0,s G)(max s Pi [h (s, sg) h0(s)]) (36)

h = max s P [h (s, sg) h0(s)] (37)

s = arg max s P [h (s, sg) h0(s)] . (38)

In this deﬁnition, h depends only on the error between the perfect heuristic and the initial heuristic and the fact that the algorithm converges to the perfect heuristic on one path. It does not depend on tie-breaking, lookahead, or the edge costs in the environment. In order to make the connection between L min and T min below, we do, however, assume that edge costs are constant bounded to ensure that the learning per step is also constant bounded. Thus, we can provide a convergence form of Equation 12:

n S max{0, h 2h (s , n)}. (39)

We can also show that the convergence travel is bounded from below by the convergence learning:

T min(S) Ω(L min(S)) . (40)

Thus, on a series of polynomial state spaces where h grows with the map radius, systematic scrubbing will necessarily occur for convergence travel.

Corollary 3 (Systematic Scrubbing for Convergence). Consider a series of search problems {S1, S2, . . . } with locally isotropic polynomial state spaces {S1, S2, . . . } of the dimension d (i.e., the number of states at radius r from a given state is γrd 1). Each state space Si has radius ri = max a,b Si h (a, b). Suppose the initial heuristic for search problem Si

has heuristic error h µri where µ is a positive constant. Also, suppose that the heuristic state space Si extends for at least h/2 around the state s . Then any real-time heuristic search algorithm that runs to convergence will necessarily scrub systematically.

Proof. As the state space radius ri increases, the heuristic error h will increase at least as µri. The minimum amount of travel T min will asymptotically grow at least as rd+1 i (Equation 22 holds trivially for h as well as h given that they both grow as µri). At the same time, the number of states in the state space (|Si|) will asymptotically grow no faster than rd i (follows from (23)). This means that T min(Si) ω(|Si|) and thus the agent will scrub systematically. 2

Note that we have not analyzed here any inﬂuence of number of trials performed by the agent. We leave this analysis for future work.

Scrubbing in Learning Real-time Heuristic Search

6. Experimental Results

The goal of this paper is to investigate conditions for scrubbing to occur. We showed, under certain assumptions, that, given the heuristic learning h required in state s , the agent must perform asymptotically at least d+1 h moves (Tmin) to solve a problem in polynomial state spaces of dimension d. If h is linear in the radius r of the state space where |S| Θ(rd) then Tmin(S) ω(|S|) and systematic scrubbing will occur. In the experimental section of this paper we focus primarily on elements of the agent construction, such as lookahead and tie-breaking rules, but also relax the assumption of unit edge costs. We do not address the questions of whether h is linear in the radius or whether |S| Θ(rd) in general. Our initial proof required an unfavorable tie-breaking rule, so in our ﬁrst set of experiments we will explore what happens as we vary the tie-breaking rule and scale the size of the state space, showing that scrubbing still occurs in practice. Our initial proof also did not generalize when we relaxed the unit-cost and 1-step lookahead assumptions. So, in our second experiment we relax these assumptions and measure performance on game maps, again showing that scrubbing occurs in practice. To further validate these results and the practical value of our theoretical claims, we implemented a version of LSS-LRTA* that uses essentially the opposite tie-breaking rule from τ used in our proofs. In particular, it learns g-costs like FALCONS (Furcy & Koenig, 2000) and f-LRTA* (Sturtevant & Bulitko, 2011), and breaks ties by moving towards the state with the maximum g-cost. We call this algorithm g LSS-LRTA*. Our theoretical tiebreaking rule is conservative, choosing to move back towards the start state (towards states with minimum g-cost) when possible. g LSS-LRTA* is aggressive, moving away from the start state when possible. Evidence both here and elsewhere (Sturtevant, 2012) suggests that the game maps are two dimensional. To conclude, we look at maze maps taken from the moving AI repository, which are estimated to be one-dimensional (Sturtevant, 2012). We repeat our experiments on these maps showing that scrubbing is also occurring.

6.1 Experiments on the Corner Map

Consider the corner map used earlier in the paper (Figure 8). This example contains a corner-shaped wall with the start s0 in the upper left corner and the goal sg behind the wall in the bottom right corner. We now consider that the map is 8-connected; diagonal movement costs 1.5. An octile distance heuristic will mislead the agent into traveling to the state labeled s (Equation (5)) while trying to reach the goal. Under the tie-breaking schema τ constructed earlier in this paper, the ﬁnal heuristic value of that state, h T (s ), will be raised to at least ˆh(s , sg) = h0(ˆs), based on the states marked ˆs in the ﬁgure. On this map ˆh(s , sg) = h0(ˆs) = n where n is the number of cells along the side of the map. So under τ it will be the case that h T (s ) n. By recording h T (s ) ˆh(s , sg) = h T (s ) n we can see how diﬀerent tie-breaking schema compare to τ. Speciﬁcally, a measurement of h T (s ) n < 0 indicates that the agent did less learning in the state s than it would have under τ. A measurement of h T (s ) n indicates that at least as much learning was performed as is required by τ.

Sturtevant & Bulitko

Max Difference Average Difference Min Difference

Heuristic Diff. (h T(sΔ)-h0(ŝ))

Corner Length/Width (n)

20 40 60 80 100

g LSS(1) g LSS(10)

Heuristic Diff. (h T(sΔ)-h0(ŝ))

Corner Length/Width (n)

20 40 60 80 100

Figure 17: Values of h T (s ) n recorded by running LSS-LRTA* and g LSS-LRTA* on corner maps of diﬀerent sizes.

400+2.7x2.5

LSS-LRTA*(1)

Dist Traveled

Max Learning (Δ) 1 10 100

400+0.3x2.55

LSS-LRTA*(10)

Dist Traveled

Max Learning (Δ) 1 10 100

400+0.03x2.65

LSS-LRTA*(100)

Dist Traveled

Max Learning (Δ) 1 10 100

Figure 18: Distance traveled versus maximum learning for any state when solving a problem instance using LSS-LRTA* with lookahead depths (l) of 1 (top left), 10 (top right) and 100 (bottom).

As long as h T (s ) n remains constant as the map radius grows, then scrubbing is occurring. This follows from our analysis in Section 4.8 - subtracting a constant in Equation (14) will not change the asymptotic complexity of the result.

Scrubbing in Learning Real-time Heuristic Search

For each value of n {10, 11, . . . , 100} (i.e., n n map with Θ(n2) states), we ran 20 trials of LRTA* with lookahead of 1 and random tie breaking (instead of τ). We also experimented with ﬁxed tie-breaking (chosen deterministically by operator orderings and the internal data structures) and got similar results. The average value, maximum value and minimum value of h T (s ) n on the 20 trials as a function of n are plotted in Figure 17(a). Analyzing the underlying data, we ﬁnd that LRTA* with randomized tie-breaking performed less than n learning in s in 18% of the trials. That is, 18% of the data points fall below the zero line. We used linear regression to ﬁt a line to the max and min values over diﬀerent segments of the data to test if they were growing. Beyond low values of n, the min and max diﬀerences do not seem to grow signiﬁcantly as the problem size increases. Thus, we can conclude that scrubbing is occurring in these experiments. In Figure 17(b) we look at the performance of g LSS-LRTA* on the same corner map with lookahead 1 and 10. With lookahead 1, g LSS-LRTA* exhibits the same scrubbing behavior as LSS-LRTA* as the radius of the map scales. However, with the lookahead of 10, g LLS-LRTA* is able to escape the corner map without scrubbing. On the largest map tested, where the radius of the local minima is 100, g LSS-LRTA*(10) only increased the heuristic value of s by 15 We suspect that g LSS-LRTA* does not need to raise to the heuristic value of the corner state as much because on the corner map moving away from the start state is correlated to moving closer to the goal. In the next section we look at more complex maps from the game Dragon Age: Origins to see how LSS-LRTA* and g LSS-LRTA* perform there.

6.2 Pathﬁnding on Video-Game Maps

In this section we look at pathﬁnding on video-game maps. Our experiments not only violate assumptions 1-3 (unit edge costs, integer heuristics, and lookahead of one), but the maps are also not guaranteed to be locally isotropic around any state. Despite this, we see that a measure of the heuristic learning correlates with movement that grows asymptotically faster than the state space. Furthermore, experiments with g LSS-LRTA* show that it has worse average-case performance that LSS-LRTA*. We performed these experiments on the Moving AI benchmark set (Sturtevant, 2012), on all Dragon Age: Origins maps with the optimal solution cost in [400, 404). A total of 600 problems were used over 60 maps. We ran these problems with LSS-LRTA* (Koenig & Sun, 2009) with lookahead depths of 1, 10 and 100 (LSS-LRTA* with lookahead 1 is equivalent to Algorithm 1). Diagonal movement was allowed with cost 1.5. Ties were broken in a ﬁxed way in each state, according to the operator ordering and data structures, without randomization. Note that the optimal solution cost is not predictive of the distance traveled when solving the problem, so this setup gives a wide range of problem diﬃculties that are constrained to take at least 400 steps to solve. We measured the maximum learning = max s R(S) h T (s) h0(s) that occurred in any

state as well as the total distance traveled before reaching the goal, and then plotted one point for each problem instance in our test set. If scrubbing is not occurring in practice, then all the values of will be constant-bounded; otherwise we expect a range of values for

Sturtevant & Bulitko

400+2.7x2.5

g LSS-LRTA*(1)

Dist Traveled

Max Learning (Δ) 1 10 100

400+0.3x2.45

g LSS-LRTA*(10)

Dist Traveled

Max Learning (Δ) 1 10 100 1000

400+0.05x2.40

g LSS-LRTA*(100)

Dist Traveled

103 104 105 106

Max Learning (Δ) 1 10 100 1000

Figure 19: Distance travelled versus maximum learning for any state when solving a problem instance using g LSS-LRTA* with lookahead depths (l) of 1 (top left), 10 (top right) and 100 (bottom).

. If the number of states at a given radius in a map grows polynomially, then the total movement should grow faster than a polynomial with degree two. The resulting scatter plots are found in Figure 18. We visually ﬁt a polynomial of the form y = 400 + c1 xc2 to the data, as we knew that all problems have the optimal solution cost between 400 and 403. (y is the distance traveled and x is .) The values for c1 and c2 for each of the three lookahead depths are shown in the ﬁgure, although slightly diﬀerent values do not signiﬁcantly change the curves or the residuals. These two-dimensional video-game maps are not isotropic around the state s and are not exactly polynomial due to their topology, have non-unit-cost edges, have non-unit edge costs, and and may not extend far enough from s to be locally isotropic (Section 4.9). As a result, our theory does not oﬀer the asymptotic bound of Tmin Ω(( h)d+1) = Ω(( h)3) on the travel to hold. Furthermore, the maximum learning amount we measure is not h deﬁned earlier in the paper. Yet, the manually ﬁt polynomial curves appear to come close with the degree of the polynomial being between 2.5 and 2.65. Note that the degree of the polynomial is greater than two, the maximum dimensionality of the maps. The data suggests that is predictive of the total movement required to reach the goal and that scrubbing is occurring in practice. Results for the same experiments with g LSS-LRTA* are found in Figure 19. On these problems the algorithm sometimes raises the heuristic substantially higher than LSS-LRTA* and also travels substantially further.

Scrubbing in Learning Real-time Heuristic Search

LSS-LRTA*(10) g LSS-LRTA*(10)

Max Learning (Δ)

Estimate of h(s0, sg)-h0(s0) 0 25 50 75 100 125

Figure 20: A comparison of the maximum learning in practice ( ) to an estimate of the learning required between the start and goal states (ˆh(s0, sg) h0(s0)).

If the diﬀerent tie-breaking rule achieved better performance, we would expect a smaller range of values on the x-axis (as compared to Figure 18), because there would be fewer states with large amounts of learning. Instead, with lookahead 10 and 100 the range of values is signiﬁcantly increased, with some states having their heuristic raised by almost 1000. A two-sample Kolmogorov-Smirnov test comparing the LSS-LRTA* and g LSS-LRTA* results on the same problems suggests that the diﬀerences are signiﬁcant with lookahead of 10 or 100, but not with lookahead 1. That is g LSS-LRTA* provides worse performance that just LSS-LRTA*. Looking deeper into the data with lookahead 10 we ﬁnd that, although the mean maximum learning is similar (55.29 for LSS-LRTA*(10) versus 59.01 for g LSS-LRTA*(10)), there is a higher standard deviation for g LSS-LRTA*(10) (68.35 versus 45.79 for LSS-LRTA*). The data shows that there are more instances where tie-breaking towards states with higher g-cost can help the agent reach the goal faster (324 instances better versus 240 worse). On the other hand, when moving towards higher g-costs results in worse performance (distance and learning), it outweighs the gains on the other problems, leading to worse average performance. With lookahead 100 the diﬀerence in instance counts with better and worse performance is almost equal, but again, the loss of performance from this tie-breaking rule outweighs the gains. The data suggests that moving away from the start state quickly can, in the best case, improve performance, but in the worst case it can have signiﬁcantly detrimental eﬀects. These results suggest that the corner map is not representative of the video game maps. We now compare the learning required between the start and the goal (ˆh(s0, sg) h0(s0)), a lower-bound on h, to , the actual maximum learning performed by the agent. We estimate ˆh(s0, sg) h0(s0) by measuring over a single shortest path instead of over all shortest paths. If our lower-bound on h holds for LSS-LRTA*, then all points in Figure 20 would be above the straight line x = y in the ﬁgure. On our 600 pathﬁnding problems this was indeed the case for LSS-LRTA*(10) on all but one problem. On that one problem, the algorithm had the maximum amount of learning

Sturtevant & Bulitko

7.1507x2.3067

LSS-LRTA*(1)

Dist Traveled

Max Learning (Δ) 20 50 100 200

1.7745x2.2078

LSS-LRTA*(10)

Dist Traveled

Max Learning (Δ) 20 50 100 200

0.51333x2.0462

LSS-LRTA*(100)

Dist Traveled

Max Learning (Δ) 20 50 100 200

Figure 21: Distance travelled versus maximum learning for any state when solving a problem instances on maze maps with corridor size 8 using LSS-LRTA* with lookahead depths (l) of 1 (top left), 10 (top right) and 100 (bottom).

( ) of 54.5 yet the maximum diﬀerence between initial heuristic values along a particular optimal solution was 55. Repeating the experiments with g LSS-LRTA*(10), the number of such problems rose from 1 to 36 as illustrated by more markers below the line in the Figure. Simultaneously, g LSS-LRTA*(10) raised heuristic values of some states substantially higher than LSS-LRTA*(10).

6.3 Maze Maps

In the previous section we looked at maps from the game Dragon Age: Origins. Analysis of these maps using regression over the number of states at each level in a breadth-ﬁrst search suggests that they are nearly two-dimensional (Sturtevant, 2012). Speciﬁcally, polynomial regression for the equation a + b x + c x2 gave an average value of 0.31 for the constant c. According to this same measure, mazes in the benchmark set appear to be approximately one-dimensional, as regression gives a value of 0.01 to 0.03 for the constant c in this equation (Sturtevant, 2012). Thus, mazes represent a diﬀerent class of problems on which we can re-run our experiments. As before, our theoretical assumptions on the state spaces do not hold in these experiments. We duplicated our settings from the experiments in Section 6.2 on 10 maze maps with corridor width 8 from the Moving AI benchmark set. Because we had fewer maps, we increased the number of problems solved by solving all probelsm with optimal solution between 400 and 439, resulting in 1100 total problems solved. The results for LSS-LRTA* are in Figure 21. We computed the best-ﬁt curve using least squares for the data to the equation c1 xc2, which is also shown.

Scrubbing in Learning Real-time Heuristic Search

The ﬁts have correlation of 0.94, 0.94, and 0.91 respectively for lookaheads 1, 10, and 100. On these maps we see that the degree of the ﬁt polynomial is two or greater. As the maps are one-dimensional and is not constant, these results suggest that scrubbing is occurring in practice. In the initial trials on the maze problems, g LSS-LRTA* had worst-case travel distance several orders of magnitude higher than LSS-LRTA*. This precluded us from running the experiments in a reasonable amount of time and suggests that g LSS-LRTA* is not a practical algorithm on such maps.

7. Conclusions

The primary contribution of this paper is the development of a non-trivial lower bound on the minimum travel that an LRTA*-like real-time heuristic search agent may have to perform to reach the goal state. While previous work has provided examples of problems where state revisitation (i.e., scrubbing) would occur, we provide general conditions that can be used for analysis. In idealized polynomial state spaces the lower bound grows asymptotically faster than the state space. This means that the agent will necessarily scrub an undesirable behavior in many applications such as real-time pathﬁnding in video games. These theoretical results are supported experimentally on real-world search problems. This result may appear discouraging, as it suggests that common real-time heuristic search algorithms may not, on their own, be able to avoid scrubbing. While the proofs rely on a several restrictive assumptions, we expect that our results hold more broadly. From the proof we suggest four directions for future work trying to improve asymptotic performance. This include (1) increasing the amount of learning performed in each step, (2) using diﬀerent tie breaking rules, (3) decreasing the size of the heuristic local minima and (4) developing algorithms that do not use value-iteration as the core technique driving agent behavior. We hope that future researchers will be able to point to these four directions to explain why their approaches improve performance, and they will be able to identify underlying assumptions about the state space that makes their approaches successful. While we have shown that our lower bound does not hold when some of our assumptions are broken, we continue to look for properties that we could use to extend the lower bounds to a larger class of problems and agents.

Acknowledgements

The second author received support for this work from the National Research and Engineering Council (NSERC).

Appendix A. Special Case: Locally Isotropic Exponential State Spaces

In this section we consider the case of exponential state spaces. We limit our analysis to locally isotropic exponential state states spaces where the number of states exactly cost r away (up to h/2) from a given state s is:

η(r) = λbr, r [0, h/2] (41)

Sturtevant & Bulitko

where b is the branching factor of the space. As in Corollary 2, we will assume that the state space size does not expand substantially beyond h/2 so the total state space size is asymptotically the same as the size of the state space within the radius h/2 from the state s (assumption ). In such a case we can repeat the derivation in Section 4.9 by substituting η(r) = λbr in (18): Z h

0 η(r)( h 2r)dr =

0 λbr( h 2r)dr =

0 brdr 2λ Z h

0 brrdr. (42)

To take the integral R h

2 0 brrdr in (42) we introduce the function g(r) = br

ln b so that g (r) = br

and integrate by parts: Z brrdr = Z g (r)rdr =

g(r)r Z g(r)r dr =

g(r)r Z g(r)dr = r br

ln2 b + C = br

With (43) equation (42) becomes:

Combining (12), (18) and (44) we conclude that for locally isotropic exponential spaces of branching factor b that extend for distance at least h/2 around the state s , the minimum amount of total learning is lower-bounded as:

Lmin(S) Ω b

2 , h . (45)

Scrubbing in Learning Real-time Heuristic Search

As the amount of learning per step is constant-bounded, the same asymptotic lower bound applies to the travel cost:

Tmin(S) Ω b

2 , h . (46)

In locally isotropic exponential spaces the number of states within radius h/2 of the state s grows as:

0 η(r)dr = Z h

0 λbrdr Θ b

which is asymptotically the same as the total state space according to the assumption ( ) and asymptotically the same as the lower bound on the minimum amount of travel (46).

This means that in locally isotropic exponential spaces our lower bound Ω b h

2 is not suﬃcient to show systematic scrubbing (i.e., to prove an equivalent of Corollary 2). Note that (46) is a lower asymptotic bound on the amount of travel whereas (47) is both an upper and lower asymptotic bound on the state space growth. Thus, it may be possible that Tmin grows asymptotically faster than the state space size but our lower bound derived above is insuﬃcient to claim so.

Bulitko, V. (2004). Learning for adaptive real-time search. Tech. rep. http://arxiv.org/abs/cs.AI/0407016, Computer Science Research Repository (Co RR).

Bulitko, V., & Lee, G. (2006). Learning in real time search: A unifying framework. Journal of Artiﬁcial Intelligence Research, 25, 119 157.

Bulitko, V. K., & Bulitko, V. (2009). On backtracking in real-time heuristic search. Co RR, abs/0912.3228.

Edelkamp, S., & Schr odl, S. (2012). Heuristic Search - Theory and Applications. Academic Press.

Furcy, D., & Koenig, S. (2000). Speeding up the convergence of real-time search. In National Conference on Artiﬁcial Intelligence (AAAI), pp. 891 897.

Hern andez, C., & Baier, J. A. (2012). Avoiding and escaping depressions in real-time heuristic search. Journal of Artiﬁcial Intelligence Research (JAIR), 43, 523 570.

Hern andez, C., & Meseguer, P. (2005). LRTA*(k). In International Joint Conference on Artiﬁcial Intelligence (IJCAI), pp. 1238 1243.

Ishida, T., & Korf, R. (1991). Moving target search. In International Joint Conference on Artiﬁcial Intelligence (IJCAI), pp. 204 210.

Koenig, S. (2001). Agent-centered search. Artiﬁcial Intelligence Magazine, 22(4), 109 132.

Koenig, S., & Simmons, R. G. (1992). Complexity analysis of real-time reinforcement learning applied to ﬁnding shortest paths in deterministic domains. Tech. rep. CMU CS 93 106, School of Computer Science, Carnegie Mellon University, Pittsburgh.

Sturtevant & Bulitko

Koenig, S., & Simmons, R. G. (1993). Complexity analysis of real-time reinforcement learning. In National Conference on Artiﬁcial Intelligence (AAAI), pp. 99 105.

Koenig, S., & Simmons, R. G. (1996). The eﬀect of representation and knowledge on goal-directed exploration with reinforcement-learning algorithms. Machine Learning, 22(1-3), 227 250.

Koenig, S., & Sun, X. (2009). Comparing real-time and incremental heuristic search for real-time situated agents. Journal of Autonomous Agents and Multi-Agent Systems, 18(3), 313 341.

Koenig, S., Tovey, C., & Smirnov, Y. (2003). Performance bounds for planning in unknown terrain. Artiﬁcial Intelligence, 147, 253 279.

Korf, R. (1990). Real-time heuristic search. Artiﬁcial Intelligence, 42(2 3), 189 211.

Rivera, N., Baier, J. A., & Hern andez, C. (2015). Incorporating weights into real-time heuristic search. Artiﬁcial Intelligence, 225, 1 23.

Sharon, G., Felner, A., & Sturtevant, N. (2014). Exponential deepening A* for real-time agent-centered search. In AAAI Conference on Artiﬁcial Intelligence, pp. 871 877.

Sharon, G., Sturtevant, N. R., & Felner, A. (2013). Online detection of dead states in realtime agent-centered search. In Helmert, M., & R oger, G. (Eds.), Proceedings of the Sixth Annual Symposium on Combinatorial Search. AAAI Press.

Shimbo, M., & Ishida, T. (2003). Controlling the learning process of real-time heuristic search. Artiﬁcial Intelligence, 146(1), 1 41.

Shue, L.-Y., Li, S.-T., & Zamani, R. (2001). An intelligent heuristic algorithm for project scheduling problems. In Annual Meeting of the Decision Sciences Institute, San Francisco.

Shue, L.-Y., & Zamani, R. (1993a). An admissible heuristic search algorithm. In International Symposium on Methodologies for Intelligent Systems (ISMIS-93), Vol. 689 of LNAI, pp. 69 75.

Shue, L.-Y., & Zamani, R. (1993b). A heuristic search algorithm with learning capability. In ACME Transactions, pp. 233 236.

Sturtevant, N. R. (2012). Benchmarks for grid-based pathﬁnding. Transactions on Computational Intelligence and AI in Games, 4(2), 144 148.

Sturtevant, N. R., Bulitko, V., & Bj ornsson, Y. (2010). On learning in agent-centered search. In Autonomous Agents and Multiagent Systems (AAMAS), pp. 333 340. International Foundation for Autonomous Agents and Multiagent Systems.

Sturtevant, N. R., & Bulitko, V. (2011). Learning where you are going and from whence you came: Hand G-cost learning in real-time heuristic search. In International Joint Conference on Artiﬁcial Intelligence (IJCAI), pp. 365 370. AAAI Press.

Sturtevant, N. R., & Bulitko, V. (2014). Reaching the goal in real-time heuristic search: Scrubbing behavior is unavoidable. In Proceedings of the Symposium on Combinatorial Search (So CS), pp. 166 174.

Scrubbing in Learning Real-time Heuristic Search

Zamani, R., & Shue, L.-Y. (2001). A heuristic learning algorithm and its application to project scheduling problems. Tech. rep., Department of Information Systems, University of Wollongong.