# three_new_algorithms_to_solve_npomdps__66618986.pdf

Three New Algorithms to Solve N-POMDPs

Yann Dujardin CSIRO yann.dujardin@csiro.au

Tom Dietterich School of EECS Oregon State University tgd@oregonstate.edu

Iadine Chad es CSIRO iadine.chades@csiro.au

In many ﬁelds in computational sustainability, applications of POMDPs are inhibited by the complexity of the optimal solution. One way of delivering simple solutions is to represent the policy with a small number of α-vectors. We would like to ﬁnd the best possible policy that can be expressed using a ﬁxed number N of α-vectors. We call this the N-POMDP problem. The existing solver α-min approximately solves ﬁnite-horizon POMDPs with a controllable number of α-vectors. However α-min is a greedy algorithm without performance guarantees, and it is rather slow. This paper proposes three new algorithms, based on a general approach that we call α-min-2. These three algorithms are able to approximately solve N-POMDPs. α-min-2-fast (heuristic) and α-min-2-p (with performance guarantees) are designed to complement an existing POMDP solver, while α-min-2-solve (heuristic) is a solver itself. Complexity results are provided for each of the algorithms, and they are tested on well-known benchmarks. These new algorithms will help users to interpret solutions to POMDP problems in computational sustainability.

Introduction Partially Observable Markov Decision Processes (POMDP) provide a powerful tool for sequential decision-making under uncertainty. In computational sustainability, POMDPs are increasingly being applied to help solve important problems in biodiversity conservation. For example, rules of thumb were derived using POMDPs to best manage and monitor cryptic species such as the critically endangered Sumatran Tigers (Chad es et al. 2008; Mc Donald-Madden et al. 2011), or the invasive parasitic plant species branched broomrape with long-lasting microscopic seeds (Regan, Chad es, and Possingham 2011). More recently, POMDPs were employed to adaptively manage renewable natural resources (Fackler and Paciﬁci 2014; Springborn and Sanchirico 2013) and migratory shorebirds threatened by the uncertain consequences of sea level rise (Nicol et al. 2015; 2013). Despite this growing interest, POMDP solutions are often too complex to be analyzed and communicated efﬁciently to stakeholders, which prevents the deployment of POMDP solutions (Tulloch et al. 2015; Nicol and Chad es

Copyright c 2017, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved.

2012). Until recently, POMDP algorithms have focused on providing near-optimal solutions regardless of the number of α-vectors required. Indeed, current algorithms require many α-vectors even when solving small problems (Poupart, Kim, and Kim 2011). Here, we contribute to bridging the gap between POMDP solvers and their end users by providing approximate solutions that can be represented with a small number of α-vectors. The recently-published algorithm α-min was a ﬁrst attempt at approximately solving ﬁnite-horizon POMDPs given a limit N on the number of α-vectors (Dujardin, Dietterich, and Chad es 2015). We call this problem N-POMDP. α-min is however a greedy algorithm and does not provide satisfying performance guarantees. Here, we propose three new algorithms, based on a common approach, α-min-2, that directly searches for subsets of α-vectors of size N. α-min-2-fast (heuristic) and αmin-2-p (with performance guarantees) are designed to work in complement of an existing POMDP solver, and α-min-2solve (heuristic) is a solver itself. The new approach allows to solve inﬁnite-horizon or ﬁnite-horizon problems and is able to outperform α-min. The paper is organized as follows. First, we introduce POMDPs and the α-min principles. Next we formalize the N-POMDP problem. Then we present the α-min-2 principle and the three related algorithms: α-min-2-fast, α-min-2p and α-min-2-solve. Finally we apply the three algorithms to well-known benchmark problems and discuss the results.

POMDPs, N-POMDPs and the α-min principle POMDPs and related work POMDPs are models of sequential decision-making when the decision-maker does not have complete information about the current state of the system (Sigaud and Buffet 2013). Formally, a discrete ﬁnite-horizon POMDP is speciﬁed as a tuple {S, A, O, τa, Ωa, R, H}, where

H = {0, ..., T 1} are called time steps , T N. H is called the time horizon.

S, t H, st S is the state of the system at t.

A, t H, at A is the action taken at t.

O, t H, zt O is the observation at t.

Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence (AAAI-17)

τa is the transition matrix for action a. Its elements are τa (st, st+1). Ωa is the observation matrix for action a. Its elements are Ωa(st+1, zt+1). R is the reward matrix. Its elements are R (at, st).

For sake of clarity, we deﬁne the following notation: For action a A and for observation z O, let Ma,z be the matrix of dimension S S such that Ma,z(st+1, st) = Ωa(z, st+1)τa(st+1, st). For every a A, the vector ra = R(a, ) corresponds to the row of the matrix R corresponding to the action a. The optimal decision at time t may depend on the complete history of past actions and observations. Because it is neither practical nor tractable to use the history of the actionobservation trajectory to compute an optimal solution, belief states (also called beliefs), i.e., probability distributions over states, are used to summarize and overcome the difﬁculties of imperfect detection ( Astr om 1965). A POMDP can be cast into a fully observable Markov decision process deﬁned over the continuous belief state space, B. Solving exactly a ﬁnite-horizon POMDP means ﬁnding an optimal policy Π0 = t Hπt, where πt : B A maps belief states at time t to actions. Π0 maximizes the expected sum of rewards E[

t H rat bt] over the time horizon H ( denotes the scalar product). For each time step t, for a given belief state bt and a given policy Πt = t {t,...,T 1}πt , the expected sum E[ t {t,...,T 1} rat bt] is also referred to as the value function Vt,Πt(bt). A value function allows us to rank policies by assigning a real value to each belief bt. An optimal policy Π t is a policy such that, bt B, Πt, Vt,Π t (bt) Vt,Πt(bt). Several policies can be optimal and share the same optimal value function Vt, which can be computed using Bellman s principle of optimality (Bellman 1957): bt B,

Vt(bt) = max at A{rat bt +

zt O p(zt+1|at, bt)Vt+1(bt+1)}

(1) where each component bt+1(st+1) of the new belief can be computed as follows:

bt+1(st+1) = Ωa(st+1, zt+1)

st S τa(st, st+1)bt(st)

st,s t+1 S Ωa(s t+1, zt+1)τa(st, s t+1)bt(st) (2) Equation 1 can be rewritten Vt = BL(Vt+1), where BL is the Bellman operator (Shani, Pineau, and Kaplow 2012), sometimes also called backup operator (Pineau, Gordon, and Thrun 2006). While various algorithms from the operations research and artiﬁcial intelligence literature have been developed over the past years, exact solving of POMDPs is intractable: ﬁnite-horizon POMDPs are PSPACE-complete (Papadimitriou and Tsitsiklis 1987) and inﬁnite-horizon POMDPs are undecidable (Madani, Hanks, and Condon 2003). Equation 1 can be solved by directly manipulating αvectors (Smallwood and Sondik 1973). For every t H,

there exists a ﬁnite set Γt of vectors of dimension |S| (the so-called α-vectors) that deﬁne entirely Vt such as |Γt| is minimal and

t H, bt B, Vt (bt) = max αt Γt αt bt. (3)

Given that ΓT 1 = {ra| a A, ra is not dominated}, one can in theory build a set of α-vectors deﬁning the value function Vt from Γt+1 at any time step t H = {0, ..., T 2}, because every αt Γt can be written:

αt = [rat +

zt+1 O (α at,zt+1 t+1 )T Mat,zt+1]T (4)

where at A and α at,zt+1 t+1 , zt+1 O are elements of Γt+1. The set of α-vectors Gt given by Equation 4 is such that Γt Gt. Note that some α-vectors of Gt could be dominated by others and therefore do not belong to Γt. In most of exact algorithms, complexity when calculated is exponential in at least one component of the instance (Kaelbling, Littman, and Cassandra 1998). For example, a natural way to compute Γt from a given Γt+1, is ﬁrst to compute Gt entirely, and then prune the dominated vectors that are not useful for representing the value function. This approach has a complexity exponential in the number of observations |O| (Sigaud and Buffet 2013, sections 7.3 and 7.4). Conversely, the one-pass algorithm (Smallwood and Sondik 1973), the linear-support algorithm (Cheng 1988), and the relaxed-region algorithm (Cheng 1988) all try to generate Γt directly. In particular, the linear-support algorithm (Cheng 1988) systematically looks for new beliefs allowing to iteratively compute new vectors of BL(Γt+1). Although not proven by the authors, the complexity is then controlled by the number of explored beliefs, which in turn is at least exponential in |Γt|. The witness algorithm (Kaelbling, Littman, and Cassandra 1998) has a running time exponential in |A|.

N-POMDPs Deﬁnition 1. An N-POMDP is a POMDP with an additional parameter N that deﬁnes the maximum size of any admissible policy represented by a set of α-vectors at each time step (for the inﬁnite-horizon case, we consider that there is only one time step).

Solving an N-POMDP means ﬁnding the best possible policy of size at most N at each time step (Problem 5).

max Γ0,...,ΓT 1 Θ, s.t. t H,|Γt| N V0,Γ(b0) (5)

where V0,Γ(b0) is the value function corresponding to the policy Γ = t HΓt for the initial belief b0. Θ is the set of all possible sets of α-vectors. In the inﬁnite case, we only consider one set Γ of α-vectors.

Proposition 1. The N-POMDP problem is an NP-hard problem.

Proof. Because N-POMDPs concern both ﬁnite and inﬁnite horizon cases, one just needs to prove the NP-hardness for one case. For the ﬁnite-horizon case, this is a direct consequence of the NP-hardness of the exact backup operation: in

BL(Vt+1) / Vt

BL(Vt+1) / Vt

Figure 1: Two successive iterations (a) and (b) of α-min for a given time step t. In both ﬁgures, solid lines represent Vt while dashed lines represent BL( Vt+1), and belief b t corresponds to the current solution of Problem 7. (Reproduced from (Dujardin, Dietterich, and Chad es 2015))

(Littman, Cassandra, and Kaelbling 1995), it is stated that, unless P = NP, no representation of the optimal value function Vt given an optimal value function Vt+1 can be found in polynomial time in the instance and the number of α-vectors needed to describe Vt (i.e. in N in our case). If an N-POMDP could be solved in polynomial time in the instance of the POMDP and N, one could then set N to the actual number of α-vectors needed to describe Vt and ﬁnd an optimal policy of the corresponding POMDP in polynomial time.

The α-min principle Dujardin, Dietterich, and Chad es (2015) show that an approximate solution for N-POMDPs can be found by minimizing the Bellman error iteratively at every time step (Problem 6).

min Γt BL( Γt+1),| Γt| N max b B [BL( Vt+1)(b) Vt(b)] (6)

where BL is the Bellman operator, Γt is the set of approximate α-vectors for BL( Γt+1) and Vt is the value function corresponding to Γt, t H. To approximately solve Problem 6, α-min solves Problem 7 N times at each time step using a mixed integer linear program. max b B [BL( Vt+1)(b) Vt(b)] (7)

In each iteration, the α-vector that maximizes α b is added to the current Γt (Figure 1). α-min shares some similarities with the linear support algorithm (Cheng 1988) because it iteratively constructs new α-vectors, but it is much quicker than Cheng s algorithm and can solve bigger problems. Neither α-min nor linear support are designed to solve N-POMDPs exactly. The main reason is that new α-vectors are added greedily at each iteration, and therefore the optimal combination of α-vectors is unlikely to be found. One can easily construct a 2-state example (S = {0, 1}) where the greedy strategy can be arbitrary bad. We consider Γ = {α1, α2, α3} with α1 = (x, x), α2 = (0, 0), α3 = ( x, x), x R+. α1 is a dominating vector for b {b B | b(0) b(1)}, α2 is dominating for b {b B | b(0) = b(1)} and α3 is dominating for b {b B | b(0) b(1)}.

A greedy approach starting with α2 would return the policy Γ = {α2, α1} or Γ = {α2, α3} with a maximum gap of x, which can be arbitrary large, while the optimum gap is 0 and corresponds to the policy Γ = {α1, α3}. The goal of the α-min-2 algorithms is to solve Problem 6 by looking for the best combinations of α-vectors instead of using the α-min greedy algorithm.

The α-min-2 algorithms Given that good POMDPs solvers already exist (Spaan and Vlassis 2005; Kurniawati, Hsu, and Lee 2008), a natural approximate approach to solve an N-POMDP is to ﬁrst generate a good initial policy Γ without any size limitation and then to select the best combination of α-vectors of Γ minimizing the gap between the initial policy and the new policy (Problem 8).

g = min Γ Γ,| Γ| N max b B [V (b) V (b)] (8)

In doing so we hope that if Γ is good enough, Γ should be good enough too. This approach constitutes the main principle for both algorithms α-min-2-fast (heuristic) and α-min2-p (approximate approach with performance guarantees). Unfortunately, α-min-2-fast and α-min-2-p are unable to solve N-POMDPs when no initial policy is given. If Γ is not good enough or not available, one would like to solve Problem 5 directly. α-min-2-solve aims to do that.

The α-min-2-fast algorithm Suppose we have at hand a policy Γ, and we want to solve Problem 8. Let s be a positive semi-deﬁnite function such as s( α, α) = maxb B(α)(α b α b), α, α Γ, where B(α) is the subspace of B where α dominates the other α-vectors. We have

max b B [V (b) V (b)] max b B (max α Γ α b max α Γ α b)

max α Γ min α Γ max b B(α)(α b α b)

max α Γ min α Γ s( α, α)

Solving Problem 9 then provides an upper bound for Problem 8.

min Γ Γ,| Γ| N max α Γ min α Γ s( α, α) (9)

for N 1. Problem 9 is very similar to the k-center clustering problem (Gonzalez 1985) but cannot be solved using classic algorithms because s is not a distance. However, Problem 9 can easily be modelled by a mixed integer linear program, Linear Program 10, with |Γ|2 0-1 variables and constraints, where s( α, α) are pre-calculated using linear programs with |S| variables and |Γ| constraints each.

min g s.t. g s(α, α )yα,α , α, α Γ

α Γ yα,α 1, α Γ

xα yα,α , α, α Γ

yα,α {0, 1} xα {0, 1}

where every 0-1 variable xα is such that xα = 1 iff α is in Γ and yα,α are 0-1 variables such that yα,α = 1 iff α Γ and α Γ minimizes s(α, α ). Linear Program 10 is a mixed integer linear programs with many 0-1 variables (|Γ|2) and is consequently slow to solve. We propose a faster way to solve Problem 9, using pure integer linear programs and with only |Γ| variables, within an arbitrary precision p (Algorithm 1). The principle of Algorithm 1 is to perform a binary search on a decision version of Linear Program 10.

Algorithm 1 α-min-2-fast

1: procedure FAST(Γ, p, N 1) 2: ϵ+ = ϵub, ϵ = 0, δ = ϵ+ ϵ

3: while δ > p do 4: ϵ = ϵ++ϵ

2 5: for α, α Γ do 6: if sα,α ϵ then cα,α = 1 7: else cα,α = 0 8: C = {cα,α | α, α Γ} 9: Γ LPF AST (C, N) 10: if LPF AST (C, N) has no solution then ϵ+ = ϵ 11: else ϵ = ϵ 12: ϵ+ = ϵub, ϵ = 0, δ = ϵ+ ϵ

return Γ, ϵ

Here, ϵub is an upper bound of the solution value ϵ of Problem 9 and Γ (line 9) is the solution returned by LPF AST , given that α Γ x α = 1.

min α Γ xα s.t.

α Γ cα,α xα 1, α Γ

Remark. Linear Program LPF AST actually does not need any objective function since only the feasibility is checked in Algorithm 1. An objective function has been added to obtain a formal integer linear program.

Proposition 2. Algorithm 1 solves Problem 9 within precision p in a ﬁnite number of iterations.

Proof. Computing a ﬁrst upper bound ϵub of g is easy: choose randomly α0 Γ and set ϵub = maxα Γ s(α0, α) (can be solved with linear programming). We have then ϵub maxα Γ min α Γ s( α, α), Γ Γ, | Γ| 1, so ϵub min Γ Γ,| Γ| N maxα Γ min α Γ s( α, α) for N 1. At the ﬁrst iteration, LPF AST is feasible because the column cα0,α only contains ones so it is enough to set xα0 = 1 to get a feasible solution (given that N 1). If LPF AST is feasible for a given ϵ, then ϵ is an upper bound of the optimal solution value of Problem 9. Indeed, we have Γ Γ, | Γ| N (vectors corresponding to x α = 1) such that α Γ, α Γ, s( α, α) ϵ. This implies α Γ, min α Γ s( α, α) ϵ and consequently maxα Γ min α Γ s( α, α) ϵ. Similarly, if LPF AST is not feasible, then one cannot ﬁnd any solution Γ where every α Γ is controlled by α Γ, so ϵ in this case is a lower bound. A binary search is then possible on ϵ to solve Problem 9, by setting the ﬁrst upper bound ϵ+ to ϵub and ϵ to 0. This is exactly what Algorithm 1 performs. Algorithm 1 has time complexity O(log( ϵub

p )P(|Γ|, log(N))2|Γ|), where P is a polynomial, due to the binary search and the Branch&Bound for solving the 0-1 integer linear program LPF AST .

The α-min-2-p algorithm Solutions provided by α-min-2-fast have no performance guaranties and can produce, in theory, very bad solutions, because, for every Γ, the real gap gr(V, V ) = maxb B[V (b) V (b)] between V and V is approximated by an upper bound only. In order to get a better approximation of gr(V, V ), we now consider V to be represented both by α-vectors and β-points in Δ = {(b, V (b)), b BΔ}, where BΔ a ﬁnite subset of B, while V remains represented by α-vectors only. According to the new representation of V , Problem 8 can be approximated by Problem 11, with the new possibility of adding as many β-points to Δ as necessary to get a better representation of V and a better approximation g (Δ) of the optimal gap g :

g (Δ) = min Γ Γ,| Γ| N max β Δ min α Γ s ( α, β) (11)

where N 1 and s (α, β(b, V (b))) = α b V (b).

Figure 2 gives an example where V is represented with 7 α-vectors (in black) and 5 β-points (in gray). For any given set of β-points Δ, it is easy to adapt Algorithm 1 to solve Problem 11. One just needs to replace Γ by Γ Δ, pairs (α, α ) Γ2 by pairs (α, β) Γ Δ, and s by s . Let us call the corresponding procedure FASTβ. In Algorithm 2, the construction of Δ is the same as in α-min: the while loop aims to iteratively adding β-points corresponding to the current biggest gap gr(V, V ) between V and V , until gr(V, V ) and g (Δ) are close enough.

Figure 2: Fig. a and Fig. b show an example of a mixed representation of V , using α-vectors (Γ) and β-points (Δ). αvectors are in dark and β-points are in grey. Fig. a shows s (α4, β2) = 5 and s (α3, β4) = 4.8. On Fig. b one can see the real gap gr(V, V ) between Γ and Γ = {α2, α5}, and the gap g (Δ) between Δ and Γ.

Algorithm 2 α-min-2-p

1: procedure PRECISE(Γ, p, N) 2: Let Δ be the set of the β-points corresponding to extreme beliefs of B: (1, 0, , 0), (0, 1, , 0), , (0, , 0, 1) 3: δ inf, gub inf 4: while δ > p

2 do 5: ( Γ, gΔ) FASTβ(Γ Δ, p

2, N) 6: b arg maxb B(V (b) V (b)) 7: gub min(gub, V (b ) V (b )) 8: δ gub gΔ 9: Δ Δ {b } return Γ, gub

Proposition 3. Algorithm 2 solves Problem 8 within precision p in a ﬁnite number of iterations.

Proof. Given that FASTβ(Γ Δ, p

2, N) provides an optimal solution to Problem 11 within p

2, upon termination we have Γ, Δ and gΔ such as gΔ g (Δ) + p

2. Additionally we have g (Δ) g because Δ {(b, V (b)), b B}. This leads to gΔ g + p

2. gub is always set to a real gap gr(V, V ) = V (b ) V (b) or its value does not change (line 7). Therefore, we always have g gub (because g gr(V, V ), V ). Finally, at the end of the algorithm we have gub gΔ p

2, so gub gΔ + p

2 g + p. Overall we have g gub g + p. Algorithm 2 terminates after a ﬁnite number of iterations. At each iteration, if the Γ provided by FASTβ (line 5) is the same as a previously-generated Γ, we obtain δ = 0 by construction. So, in the worst case, Γ will be successively equal to every possible combination of N α-vectors before exiting the while loop. The number of iterations is therefore bounded by |Γ|N. FASTβ has time complexity O(log( ϵub

p )P(|Γ|, log(N))2|Γ|), where ϵub is deﬁned in the proof of Proposition 2 and P is a polynomial. Finally, the computation in line 6 can be performed in polynomial time (|Γ| is considered as part of the instance): maxb B(V (b) V (b)) = maxα Γ maxb B(α b V | V α b, α Γ) can be solved using |Γ| linear programs. Overall, the complexity of Algorithm 2 is O(|Γ|N log( ϵub

p )P(|Γ|, log(N))2|Γ|) where P is a polyno-

The α-min-2-solve algorithm Solving Problem 5 directly without any initial policy at hand is harder to perform. In the ﬁnite-horizon case, one has to solve Problem 6 instead of Problem 8, and BL(Vt+1) is generally very hard to compute. In particular, line 6 of α-min2-p would be very hard to compute if V were replaced by BL(Vt+1). However, a simple heuristic based on α-min-2p can be used, where instead of computing the best belief b at each iteration (line 6), one generates a random belief. With this change, a new Γ is not yet guaranteed to be found at each iteration, so the new stopping criterion is to stop when the number of sampled beliefs has reached a speciﬁed limit. This is α-min-2-solve. Algorithm α-min-2-solve has the advantage that it can be directly compared to the solver α-min, while the two previous algorithms cannot because they are not solvers. α-min-2-solve has the same time complexity as the previous algorithm since the main complexity term comes from the integer linear program FASTβ. α-min-2-solve should be faster than α-min because it does not require to solve Problem 6. Moreover, the performances of α-min-2-solve should be competitive because it looks for best combinations of α-vectors, while α-min greedily adds the best α-vectors one by one. The inﬁnite-horizon case is harder because one cannot be sure of the convergence of the process. That is why α-min-2-solve only deals with the ﬁnite-horizon case.

Experiments We compare α-min-2-fast to α-min-2-p, and α-min-2-solve to α-min. α-min-2-fast and α-min-2-p are not POMDP solvers because they require an initial policy calculated by an external POMDP solver. Therefore, they cannot be directly compared to α-min and α-min-2-solve.

α-min-2-fast and α-min-2-p To compare α-min-2-fast and α-min-2-p, we conducted experiments on three benchmark problems1 given an initial policy Γ provided by SARSOP and the corresponding lower bound V (b0), where b0 the initial belief (at t = 0). We denote by (|S|, |A|, |O|, |Γ|, V (b0))) the characteristics of the problems. We have milos-aaai997 (20,6,8,184,574.8), hallway2 (92,5,17,139,0.25) and learning.c4 (48,16,3,332,3.3). Figures 3a, 3b and 4a show the encouraging proﬁle of the calculated gaps and bounds as a function of N. For αmin-2-fast, only an upper bound for g is provided, called gap fast . For α-min-2-p the upper bound gub and the lower bound g(Δ) p

2 of g are provided, respectively called gap precise ub and gap precise lb . Figures 4b provides the computation times for the α-min2-p approach on each benchmark. The computation times of α-min-2-fast are negligible (always less than 1 second) and are not shown.

1http://www.pomdp.org/examples/

(a) Gaps milos-aaai97 (b) Gaps hallway2 (gapub and gaplb overlap almost perfectly)

(a) Gaps learning.c4 (b) Computation times (α-min-2p)

Figure 5a and Figure 5b show respectively the computation time of α-min-2-p as a function of the required precision p (N ﬁxed), and V (b0) as a function of N, for milos-aaai97. α-min-2-fast and α-min-2-p are able to solve larger problems. Both algorithms were tested on the benchmark Tag Avoid (870, 5, 30, 615, 6.76) (Kurniawati, Hsu, and Lee 2008). For α-min-2-fast, the computing time was always less than 3 seconds and good performance was obtained, with V (b0) 0.98V (b0), but with large gaps (greater than 10 even for N=50). α-min-2-p provided much smaller gaps, with performance guarantees: for N=10, gap=5.98 within precision 0.7. This comes with a high cost in computation time (3389 seconds for N=10).

α-min-2-solve In order to compare α-min-2-solve to α-min, we ran α-min2-solve on several benchmark problems from (Dujardin, Dietterich, and Chad es 2015). The time-horizon chosen for the experiments was T = 10. The number of β-points for αmin-2-solve was 100. Table 1 shows that α-min-2-solve is always faster while providing a better V0(b0) (LB). Computation times (Time) are the average computation times per time step, in seconds. Note that SARSOP is not a N-POMDP solver. Therefore, results are presented for information only. α-min-2-solve has also been tested on the same computational sustainability problem as in (Dujardin, Dietterich, and Chad es 2015): the four populations Sumatran tigers nonstationary problem 2: α-min-2-solve is clearly faster while providing a better LB.

2available at https://sites.google.com/site/ijcaialphamin/home

(a) Time function of precision - milos-aaai97, N = 5.

(b) V (b0) function of N - milos-aaai97

Table 1: Comparison of α-min-2-solve and α-min. LB is the provided lower bound of V0(b0). SARSOP and α-min results are from (Dujardin, Dietterich, and Chad es 2015).

Problem Algo. N LB Time(s) (|S|, |A|, |O|)

aloha.10 sarsop 190 64.87 1000 (30,9,3) α-min 30 62.66 1000 α-min-2 30 63.51 < 1 learning.c3 sarsop 11433 1.36 1000 (24,12,3) α-min 24 1.96 1000 α-min-2 24 2.09 < 1 cheng.D4-5 sarsop 15 77.29 1000 (4,4,4) α-min 4 77.85 1000 α-min-2 4 77.90 < 1 milos-aaai97 sarsop 122 41.48 1000 (20,6,8) α-min 20 50.31 1000 α-min-2 20 54.76 < 1 dujardin-ijcai15 α-min 7 207.23 37.8 (16,13,16) α-min-2 7 208.45 < 1

Discussion The three α-min-2 algorithms presented in this paper approximately solve N-POMDPs with different levels of approximation or different assumptions. α-min-2-fast minimizes an upper bound of the gap between an initial policy, provided by an external solver, and a policy using only N α-vectors. α-min-2-p uses both an upper and a lower bound, which enables it to provide solutions within any chosen precision. However, it is clearly slower than α-min-2fast. Finally, because poor initial POMDP policies can produce poor ﬁnal solutions, α-min-2-solve was written in order to solve N-POMDPs without any initial POMDP policy. However, α-min-2-solve can only solve ﬁnite-horizon N-POMDPs. For all three algorithms, the computation time is not very sensitive to the parameter N, although the worstcase time complexity is exponential in N. In practice, the size of the initial set Γ of α-vectors seems to play a key role for α-min-2-p s computation time. This is because α-min-2p has to solve |Γ| linear programs at each iteration to update the upper bound of the gap. The number of states also plays a role for α-min-2-p, because the number of β-points needed to obtain sufﬁcient precision can be very large. Future work will be conducted to reduce the worst-case time complexity

of the α-min-2 algorithms. Interpreting policies for planning problems arising in ecology is very important (Tulloch et al. 2015). While existing POMDP solvers generally provide policies of unlimited size, making interpretation difﬁcult, the α-min-2 algorithms provide policies of a desired size N which are as close as possible to optimal policies. In doing so, we hope that αmin-2 will contribute to bridging the gap between POMDP solutions and their applications.

Acknowledgements We thank Sam Nicol (CSIRO) and Martin P eron (QUT/CSIRO) for providing valuable feedbacks on an earlier version of this manuscript.

References Bellman, R. 1957. Dynamic Programming. Princeton University Press, Princeton, New Jersey. Chad es, I.; Mc Donald-Madden, E.; Mc Carthy, M. A.; Wintle, B.; Linkie, M.; and Possingham, H. P. 2008. When to stop managing or surveying cryptic threatened species. Proceedings of the National Academy of Sciences 105(37):13936 13940. Cheng, H.-T. 1988. Algorithms for Partially Observable Markov Decision Process. Ph.D. Dissertation, University of British Columbia. Dujardin, Y.; Dietterich, T.; and Chad es, I. 2015. α-min: A compact approximate solver for ﬁnite-horizon POMDPs. In Proceedings of the Twenty-Fourth International Joint Conference on Artiﬁcial Intelligence (IJCAI 15), 2582 2588. Fackler, P., and Paciﬁci, K. 2014. Addressing structural and observational uncertainty in resource management. Journal of Environmental Management 133:27 36. Gonzalez, T. F. 1985. Clustering to minimize the maximum intercluster distance. Theoretical Computer Science 38:293 306. Kaelbling, L. P.; Littman, M. L.; and Cassandra, A. R. 1998. Planning and acting in partially observable stochastic domains. Artiﬁcial intelligence 101(1):99 134. Kurniawati, H.; Hsu, D.; and Lee, W. S. 2008. Sarsop: Efﬁcient point-based POMDP planning by approximating optimally reachable belief spaces. In Robotics: Science and Systems, volume 2008. Zurich, Switzerland. Littman, M. L.; Cassandra, A. R.; and Kaelbling, L. P. 1995. Efﬁcient dynamic-programming updates in partially observable Markov decision processes. Technical report. Madani, O.; Hanks, S.; and Condon, A. 2003. On the undecidability of probabilistic planning and related stochastic optimization problems. Artiﬁcial Intelligence 147(1):5 34. Mc Donald-Madden, E.; Chad es, I.; Mc Carthy, M. A.; Linkie, M.; and Possingham, H. P. 2011. Allocating conservation resources between areas where persistence of a species is uncertain. Ecological Applications 21(3):844 858. Nicol, S., and Chad es, I. 2012. Which states matter? An application of an intelligent discretization method to solve

a continuous POMDP in conservation biology. PLo S ONE 7(2):e28993. Nicol, S.; Buffet, O.; Iwamura, T.; and Chades, I. 2013. Adaptive management of migratory birds under sea level rise. In Proceedings of the 23rd International Joint Conference on Artiﬁcial Intelligence, 2955 2957. Nicol, S.; Fuller, R. A.; Iwamura, T.; and Chad es, I. 2015. Adapting environmental management to uncertain but inevitable change. Proceedings of the Royal Society of London B: Biological Sciences 282(1808):20142984. Papadimitriou, C. H., and Tsitsiklis, J. N. 1987. The complexity of Markov decision processes. Mathematics of operations research 12(3):441 450. Pineau, J.; Gordon, G.; and Thrun, S. 2006. Anytime pointbased approximations for large POMDPs. Journal of Artiﬁcial Intelligence Research 335 380. Poupart, P.; Kim, K.-E.; and Kim, D. 2011. Closing the gap: Improved bounds on optimal POMDP solutions. In Proceedings of the 21st International Conference on Automated Planning and Scheduling, 194,201. Astr om, K. J. 1965. Optimal control of Markov processes with incomplete state information. Journal of Mathematical Analysis and Applications 10(1):174. Regan, T. J.; Chad es, I.; and Possingham, H. P. 2011. Optimally managing under imperfect detection: a method for plant invasions. Journal of Applied Ecology 48(1):76 85. Shani, G.; Pineau, J.; and Kaplow, R. 2012. A survey of point-based POMDP solvers. Autonomous Agents and Multi-Agent Systems 27(1):1 51. Sigaud, O., and Buffet, O. 2013. Markov decision processes in artiﬁcial intelligence. John Wiley & Sons. Smallwood, R. D., and Sondik, E. J. 1973. The optimal control of partially observable Markov processes over a ﬁnite horizon . Operations Research 21(5):1071 1088. Spaan, M. T., and Vlassis, N. 2005. Perseus: Randomized point-based value iteration for pomdps. Journal of artiﬁcial intelligence research 24:195 220. Springborn, M., and Sanchirico, J. N. 2013. A density projection approach for non-trivial information dynamics: Adaptive management of stochastic natural resources. Journal of Environmental Economics and Management 66(3):609 624. Tulloch, V. J.; Tulloch, A. I.; Visconti, P.; Halpern, B. S.; Watson, J. E.; Evans, M. C.; Auerbach, N. A.; Barnes, M.; Beger, M.; Chad es, I.; et al. 2015. Why do we map threats? Linking threat mapping with actions to make better conservation decisions. Frontiers in Ecology and the Environment 13:91 99.