# approximability_of_constanthorizon_constrained_pomdp__7fdd2d3a.pdf

Approximability of Constant-horizon Constrained POMDP

Majid Khonji1,2 , Ashkan Jasour2 and Brian Williams2

1EECS Department, KU Center for Autonomous Robotic Systems, Khalifa University, Abu Dhabi, UAE 2Computer Science and Artiﬁcial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA, USA majid.khonji@ku.ac.ae, {mkhonji, jasour, williams}@mit.edu

Partially Observable Markov Decision Process (POMDP) is a fundamental framework for planning and decision making under uncertainty. POMDP is known to be intractable to solve or even approximate when the planning horizon is long (i.e., within a polynomial number of time steps). Constrained POMDP (C-POMDP) allows constraints to be speciﬁed on some aspects of the policy in addition to the objective function. When the constraints involve bounding the probability of failure, the problem is called Chance-Constrained POMDP (CC-POMDP). Our ﬁrst contribution is a reduction from CC-POMDP to C-POMDP and a novel Integer Linear Programming (ILP) formulation. Thus, any algorithm for the later problem can be utilized to solve any instance of the former. Second, we show that unlike POMDP, when the length of the planning horizon is constant, (C)C-POMDP is NP-Hard. Third, we present the ﬁrst Fully Polynomial Time Approximation Scheme (FPTAS) that computes (near) optimal deterministic policies for constant-horizon (C)C-POMDP in polynomial time.

1 Introduction

A fundamental area in artiﬁcial intelligence involves planning under uncertainty. Partially observable markov decision process (POMDP) provides a model for optimal planning under actuator and sensor uncertainty, where the goal is to ﬁnd policies that maximize some measure of expected utility [Kaelbling et al., 1998; Sondik, 1971]. POMDP model is quite general and used in many areas such as reinforcement learning [Kaelbling et al., 1996] and robotics [Kress Gazit et al., 2009]. The model extends a well-researched framework of markov decision process (MDP) to scenarios where an agent cannot accurately observe the underlying state of the environment [Howard, 1960; Puterman, 2014]. In many real-world applications, a single measure of performance is not sufﬁcient to capture all requirements (e.g., a battery-operated unmanned aerial vehicle (UAV) tasked

Contact Author

to maximize its surveillance coverage while keeping energy consumption below a given threshold). An extension, often called constrained POMDP (C-POMDP), encodes additional constraints on the system to capture those requirements [Poupart et al., 2015]. Another approach is to bound the probability of constraint violation, modeled as chanceconstrained POMDP (CC-POMDP) [Santana et al., 2016]. Unsurprisingly, modeling constraints as negative rewards in the objective function leads to models that are over-sensitive to the particular value chosen, and to policies that are overly risk-taking or overly risk-averse [Undurti and How, 2010]. The focus in the literature has been primarily on the fully observable constrained MDP (C-MDP), for which non-trivial theoretical results [Altman, 1999; Feinberg and Shwartz, 1996] and efﬁcient algorithms are known (e.g., [Trevizan et al., 2016], [Ono et al., 2012]). On the other hand, the state of the art for (C)C-POMDP is less mature. Despite recent algorithmic developments, there has been relatively little effort devoted to the theoretical aspects of (C)C-POMDP. Current methods span from extensions of dynamic programming [Isom et al., 2008], point-based value iteration [Kim et al., 2011], approximate linear programming [Poupart et al., 2015], on-line search [Undurti and How, 2010], to heuristic forward search for CC-POMDP [Santana et al., 2016]. These methods generally compromise on either the optimality or the feasibility of the problem, in which they may obtain policies that are sub-optimal or in some cases violate the constraints. Another issue is that some of these methods produce randomized policies [Poupart et al., 2015]. Arguably, users rarely trust stochasticity in decision making in safety-critical systems [Dolgov and Durfee, 2005; Santana et al., 2016]. In general, POMDP is known to be NP-Hard [Papadimitriou and Tsitsiklis, 1987]. A stronger hardness result states that there is no α-approximation for POMDP unless P=NP [Lusena et al., 2001], where α characterizes the worst-case ratio between the approximate solution and an optimal solution. In this context, α can be any arbitrary value with a polynomial number of bits (hence exponentially small). The hardness result assumes the planning horizon has a polynomial number of time steps. If we assume a constant length, POMDP can be easily solved in polynomial time using a standard search algorithm [Russell and Norvig, 2016]. This work provides the ﬁrst approximability study for

Proceedings of the Twenty-Eighth International Joint Conference on Artiﬁcial Intelligence (IJCAI-19)

(C)C-POMDP. We show that when the planning horizon is constant, unlike POMDP, (C)C-POMDP is still NP-hard even when the environment is fully observable. Despite the hardness, we show that one can obtain close-to-optimal solutions in polynomial time within a user-deﬁned accuracy parameter ϵ. It is worth noting that in many practical applications (e.g., self-driving vehicle), the planning horizon is reasonably small (within 10 seconds). Thus, a better theoretical understanding of the constant-horizon case is both relevant to practice and paves the way for future development of fast heuristics without compromising optimality. The paper is structured as follows. Sec. 2 presents deﬁnitions and relevant background. Sec. 3 provides a novel integer linear programming (ILP) formulation and a reduction from CC-POMDP into C-POMDP. Sec. 4 presents our hardness result, namely, (C)C-(PO)MDP is NP-Hard even when planning horizon is two. In Sec. 5, we provide a novel fully polynomial time approximation scheme FPTAS for the two problems, which is the best possible algorithm in theory in terms of approximation ratio.

2 Problem Deﬁnition

We consider a series of discretized time steps T = {1, . . . , h} with a ﬁxed interval. Through out this work, we assume h is a constant. By this assumption, we do not require our algorithm to scale in h. This is motivated by the fact that there is no αapproximation for POMDP for polynomial h, unless P=NP [Lusena et al., 2001].

Deﬁnition 2.1 (C-POMDP). Constrained partially observable markov decision process problem is a tuple M = S, A, O, T, O, U, b0, h, U , C , S, A and O are ﬁnite sets of discrete states, actions and observations; T : S A S [0, 1] is a probabilistic transition function between states, i.e., T(s, a, s ) = Pr(s | a, s), where s, s S and a A; O : O S [0, 1] is a probabilistic observation function such that O(o, s) = Pr(o | s), where o O and s S; U : S A R+ is a non-negative reward function; b0 : S [0, 1] is an initial belief state, a probability distribution over the state space; h is the execution horizon; U : S A R+ is a secondary non-negative penalty function; and C R+ is an upper bound on the total expected penalty. The objective is to compute a conditional plan (or a policy) that maximizes the cumulative expected reward while keeping the cumulative expected constraint penalty at most C .

Deﬁnition 2.2 (CC-POMDP). Chance constrained partially observable markov decision process problem is a tuple M = S, A, O, T, O, U, b0, h, R, , where S, A, O, T, O, U, b0, h are deﬁned as in Def. 2.1; R S is a subset that represents risky states (wherein the agent cannot perform any further action); and is the risk budget, a threshold on the probability of failure. The objective is to compute a conditional plan that maximizes the cumulative expected reward such that the probability of ending up in risky states is at most .

In the following, we set up notation and provide formal deﬁnitions for the notions of conditional plan and approxi-

Figure 1: The precedence relationship between action sequences is represented by directed edges. Hyper-nodes, represented by shaded circles, are observation sequences e O, while white circles are action sequences e A, represented by the last action of the sequence (for h = 3, A = {a1, a2}, and O = {o1, o2}).

mation algorithm. We show later how to calculate expected reward, risk, and penalty in Sec. 3. Deﬁne the set of all possible histories of length k, represented by interleaving action-observation sequences, as

{q = a1 q, o2 q, a3 q, . . . , ak 1 q , ok q | a2t 1 q A, o2t q O, t T }, if k is even, {q = a1 q, o2 q, a3 q, . . . , ok 1 q , ak q | a2t 1 q A, o2t q O, t T }, if k is odd.

Notice when k is even, a sequence q Qk ends with an observation, whereas if k is odd, the sequence ends with an action. For convenience, we deﬁne a function τ : N+ T that maps indices k onto time steps in T as follows:

τ(k) k 2 if k is even, k+1

2 if k is odd.

Let e O S k is even, τ(k) T Qk be the set of all possible sequences that

end with an observation. Similarly, deﬁne e A S k is odd, τ(k) T Qk

to be the set of all possible sequences that end with an action. For a sequence q e O and action a A (resp. observation o O), we write qa (resp. qo) to denote the concatenation q a (resp. q o ). Also, for any two sequences q, q e A e O we write qq to denote q q . Let |q| be the length of sequence q. We refer to the empty sequence by 0. Given a sequence q e O e A, let O(q) and A(q) respectively denote the set of observation and action indices along the sequence. We write oi q and aj q to denote the i-th observation and j-th action, respectively, for i O(q), j A(q). For convenience, we write q k to denote the sequence minus the last k elements (where an element is either an action or observation). For example, q = a1o2a3, q 1 = a1o2, q 2 = a1, and so on. Next, deﬁne a precedence relationship between action sequences in e A by a directed tree graph Ga = ( e A, Ea) (depicted in Fig. 1) such that (q, q ) Ea if and only if q = qoa for some o O and a A1.

1Note that q e A (resp. q e O) represents a sequence, while a A (resp. o O) represents a single action (resp. observation).

Proceedings of the Twenty-Eighth International Joint Conference on Artiﬁcial Intelligence (IJCAI-19)

Figure 2: A conditional plan is represented by red circles over action sequences, represented by the last action. (h = 3, A = {a1, a2}, and O = {o1, o2}.)

We deﬁne a binary decision variable xq for each sequence such that xq = 1 indicates the last action in q is selected, and xq = 0 otherwise. We write x (xq)q e A to denote the vector of all decision variables. Deﬁnition 2.3. x {0, 1}| e A| is a conditional plan if, X

a A xqa 1, q e O {0}, (1)

xq xq , (q, q ) Ea. (2)

Cons. (1) enforces that at most one action is selected for each observation (see the red circles in Fig. 2), while Cons. (2) maintains the precedence relationship, that is, if an action is selected then all its ancestors (deﬁned by the precedence relationship Ga) must be selected as well. The two constraints enforce a tree structure, also known as a policy tree. We note that the deﬁnition above does not enforce a fullhorizon policy tree. We deliberately highlight this deﬁnition because unlike POMDP where there always exists a full-horizon solution, such solution may not exist for (C)CPOMDP due to the restriction imposed by the constraint (see Sec. 3.3 below). In other words, the highest reward actions may consume all the risk budget before reaching the horizon. Thus, one may opt to prioritize objective optimization over policy completeness. In many practical applications, xq = 0 can be interpreted as a safe maneuver. For example, an autonomous underwater glider, tasked to maximize expected science return while bounding the risk of collision [Timmons and Williams, 2015], can execute a safe-extract action anytime (by changing its buoyancy so that it ﬂoats to the surface for extraction). We note that one can construct a scenario in which a fullhorizon solution exists for CC-POMDP but the current stateof-the-art approach returns infeasible [Santana et al., 2016]). Therefore we ﬁnd it important to highlight this new aspect of the constrained version of POMDP that hasn t been discussed before. The results of this paper can be extended to enforce full-horizon policies. Due to the paucity of space, we omit the details and refer interested readers to the full version of the paper. The subject of approximation algorithms is well-studied in the theoretical computer science community [Vazirani, 2013].

In the following, we deﬁne some standard terminology for approximation algorithms. Consider a maximization problem Π with non-negative objective function f( ), let F be a feasible solution to Π and F be an optimal solution to Π. f(F) denotes the objective value of F. Let OPT = f(F ) be the optimal objective value of F . A common deﬁnition of approximate solutions is α-approximation, where α characterizes the approximation ratio between the approximate solution and an optimal solution. Deﬁnition 2.4 ([Vazirani, 2013]). For α [0, 1], an αapproximation to maximization problem Π is an algorithm that obtains a feasible solution F for any instance such that f(F) α OPT. In particular, polynomial-time approximation scheme (PTAS) is a (1 ϵ)-approximation algorithm to a maximization problem, for any ϵ > 0. The running time of a PTAS is polynomial in the input size for every ﬁxed ϵ, but the exponent of the polynomial might depend on 1/ϵ. In other words, PTAS allows to trade the approximation ratio against the running time. An even stronger notion is a fully polynomial-time approximation scheme (FPTAS), which requires the running time to be polynomial in both input size and 1/ϵ. 3 Novel Formulation for (C)C-POMDP In this section, we present a novel ILP formulation for CC-POMDP and C-POMDP, and show the similarity between the two problems. We observe that the later is in fact more general. In other words, any instance of CC-POMDP can be reduced to an instance of C-POMDP. The reduction implies that any algorithm for C-POMDP can be also used to solve CC-POMDP. It also implies that because CCPOMDP is NP-Hard (as we show in Sec. 4), C-POMDP must be NP-Hard as well. The key idea for the new ILP formulation is that we can calculate rewards and risks (penalties for C-POMDP) independently from decision variable x, hence we get linear constraints.

3.1 Calculating Reward We show how to compute the cumulative expected reward of a conditional plan. The cumulative expected penalty for C-POMDP is obtained in a similar way, thus we omit for brevity. The reward of the last action in a sequence q e A is denoted by uq. Essentially, the reward is the product of the probability of sequence q occurring, denoted by ρ(q), and the sum of expected reward of the last action a|q| q in that sequence, denoted by v(q):

uq ρ(q) v(q). (3) As shown by [Aras et al., 2007], we can recursively compute ρ(q) by the following

i O(q) oi q b0

j A(q) aj q

i O(q) Pr oi q b0

i O(q) Pr(oi q | bi 1 q ),

Proceedings of the Twenty-Eighth International Joint Conference on Artiﬁcial Intelligence (IJCAI-19)

for q e A e O, |q| > 1, where bi 1 q is the prior belief state after action ai 1 q in q, deﬁned by

bi 1 q (s) X

s S T(s , ai 1 q , s) bi 2 q (s ).

The posterior belief bi q is computed as

bi q(s) O(oi q, s) bi 1 q (s)

Pr(oiq | bi 1 q ) , s S,

where Pr(oi q | bi 1 q ) P s S bi 1 q (s) O(oi q, s). Notice that by the deﬁnition of a sequence, if i O(q) corresponds to an observation then (i 1) A(q) corresponds to an action. For |q| = 1, the probability ρ(q) 1 because such state is always reachable. Hence, the expected reward v(q) of sequence q e A is computed by

s S bq 1(s) U(s, a|q| q ). (4)

The total expected reward of a conditional plan x is given by P

q e A uqxq.

Remark 1. Given a sequence q e A, the reward uq can be computed independently from a conditional plan x.

3.2 Calculating Risk We adopt the notion of execution risk from [Santana et al., 2016; Santana and Williams, 2015], denoted by er(0), to compute the probability of ending up in risky states starting from the initial sequence 0. The chance constraint requires that er(0) . (5)

For a given belief b, let r(b) P s R b(s) be the expected risk of failure. We write the execution risk as a function of a conditional plan x {0, 1}| e A| using the recursion in Eqn. (6) (see [Santana et al., 2016] for a systematic derivation). f Pr(o | bq 1) is the observation probability, and bq 1(s) and ebq(s) are safe prior and posterior beliefs, respectively, deﬁned by,

f Pr(o | bq 1) X

s S O(o, s )bq 1(s ),

P s S\R T(s , a, s)ebq 2(s )

1 r(ebq 2) ,

ebq(s) O(o|q| q , s) bq(s) f Pr(o | bq 1) ,

for q e O, and where eb0 q = b0. The reason we distinguish safe prior and posterior from the standard deﬁnitions is that reaching history q implies, roughly speaking, that none of the past states were in R, otherwise no action could have been taken to proceed. Next, we solve the recursive Eqn. (6) and rearrange the terms to obtain

er(0) = r(b0) + X

q e A:τ(|q|)<h

1 r(ebi 2 qo )

f Pr(oi qo | b i 1 qo ) r(ebqo) xq

q e A:τ(|q|)=h

1 r(ebi 2 q )

f Pr(oi q | b i 1 q ) r(bq) xq, (7)

where bi q andebj q are the belief states after the i-th action and jth observation, respectively. For q e A, write the coefﬁcient of xq from Eqn. (7) as

P o O Q i O(qo) 1 r(ebi 2 qo ) e P(oi qo | b i 1 qo )

r(ebqo), if τ(|q|) < h, Q i O(q) 1 r(ebi 2 q ) e P(oi q | b i 1 q )

(1 r(ebq 1)) r(bq), if τ(|q|) = h. (8) By Eqns. (7) and (8), the chance Cons. (5) can be written as

q e A rq xq r(b0).

Remark 2. For a given sequence q e A, the risk rq can be computed independently from a conditional plan x.

3.3 ILP Formulation for CC-POMDP

Chance-constrained POMDP

Given a precomputed (uq, rq)q e A R2| e A| + from Eqns. (3)

and (8), a precedence graph Ga = ( e A, Ea), e O, C r(b0), C R+, CC-POMDP can be formulated by the following integer linear program:

(ILP) max xq {0,1}

q e A uq xq

Subject to X

a A xqa 1, q e O {0}, (9)

xq xq , (q, q ) Ea, (10) X

q e A rq xq C, (11)

Constrained POMDP In CC-POMDP, a secondary objective function U (s, a) is given as a constraint with an upper bound C R+ on the total expected value. We can use the above ILP formulation for C-POMDP with the following changes in the coefﬁcients: rq is replaced by a penalty r q, calculated in the same way as uq replacing U(s, a) by U (s, a) in Eqn. (4) (as shown in Sec. 3.1). Thus Cons. (11) is replaced by

Proceedings of the Twenty-Eighth International Joint Conference on Artiﬁcial Intelligence (IJCAI-19)

r(b0) + (1 r(b0)) P a A P o O f Pr(o | ba) er(ao) if q = 0, r(ebq) xq 1 + (1 r(ebq)) P a A P o O f Pr(o | bqa) er(qao), if q e O and τ(|q| + 2) h, r(ebq) xq 1 + (1 r(ebq)) P a A er(qa), if q e O and τ(|q| + 1) = h, r(bq) xq, if q e A and τ(|q|) = h,

q e A r qxq C . (12)

With the above formulation, we observe the similarity between the two problems. We highlight that C-POMDP is actually more general, because the parameters have higher degrees of freedom as we show below. First, it is worth noting that one can slightly generalize CPOMDP by allowing the penalty function to be a function of state and action history instead of a single action, namely, U : S e A R+ (denoted by CC-POMDP*). Any instance I = (uq, rq)q e A, C of CC-POMDP can be easily reduced to an instance I = (u q, r q)q e A, C of C-POMDP* as follows:

set C = C and deﬁne U (s, q) rq ρ(q), q e A. Following the same lines in Sec. 3.1 we obtain r q = rq (and u q = uq). Any solution to such instance of C-POMDP* is also feasible to that of CC-POMDP with the same objective value. Second, we note that C-POMDP (with the standard transition function) can be also seen as a generalization of CCPOMDP even for the fully observable case (C-MDP) . Given an instance I of CC-POMDP, we can compute an instance I of C-(PO)MDP as follows: set the state space S and observation space O of C-POMDP to be equal to the set of observation sequences of CC-POMDP, S = O = e O {0}; set O(q, q ) = 1 if q = q for q, q S , and 0 otherwise (hence fully observable); set T(q, a, q ) = 1 |O| if q = qao and o O, and 0 otherwise; U(q, a) = uq and U (q, a) = rq for q S , a A. Following the steps in Sec. 3.1 we obtain r q = rq and u q = uq.

4 Hardness of CC-POMDP

In this section we provide our main hardness result. The result is tightly focused on a highly restricted subset of CCPOMDP instances, with a planning horizon of two and a fully observable environment. This provides an argument that, in general, exact algorithms are infeasible for most CCPOMDP instances.

Theorem 4.1. CC-POMDP is NP-Hard even for the fully obervable case (O = S), time horizon h = 2, and a single action A = {a}.

Proof. We present a reduction from the (weakly) NP-Hard knapsack problem (KP) to CC-POMDP.

Deﬁnition 4.2 (KP). Given a set of items {1, . . . , n}, each associated with a value vi R+ and a weight wi R+, i = 1, . . . , n, and a total capacity P R+; ﬁnd a subset of items that achieve the highest aggregate value subject to the capacity constraint. More precisely,

Figure 3: A gadget for the reduction from KP to CC-POMDP. Blue arrows represent the state transition function, red dashed circles represent the risky state sn+2, and green circles represent safe states. The subtree within the gray box represents the precedence relationship between action nodes.

(KP) max xi {0,1}

i=1 vi xi s.t.

i=1 wi xi P.

W.l.o.g. we assume P 1 and Pn i=1 wi = 1 (which can be easily achieved via normalization). We present a reduction from an instance I of KP to instance I of CCPOMDP as follows. Let S = {s1, s2, . . . , sn, sn+1, sn+2}, R = {sn+2}, O = S, A = {a}, and = P

2 . Since the system is fully observable (O(si, sj) = 1 for i = j, and 0 otherwise), we have b0(sn+1) = 1, and b0(s) = 0 for s S\{sn+1}. Based on the above, r(b0) = 0, C = r(b0) = P

2 . Deﬁne the state transition function as follows, for s, s S, a A:

T(s, a, s )

wi if s = sn+1, s = si S\R, 1 2 if s S\{sn+1, sn+2}, s R, 1 2 if s = s S\R 1 if s = s = sn+2, 0 otherwise. (13) Deﬁne the reward function as

U(s, a) vi wi if s = si S\{sn+1, sn+2}, 0 otherwise. (14)

See Fig. 3 for a pictorial illustration of instance I . The precedence graph is given by Ga = ( e A, Ea), where e A = {a, as1a, as2a, . . . , asna, asn+1a, asn+2a}, and Ea is deﬁned as shown in Fig. 3 (the subtree within the box). Fol-

Proceedings of the Twenty-Eighth International Joint Conference on Artiﬁcial Intelligence (IJCAI-19)

lowing the steps in Secs. 3.1 3.2,

f Pr(oi | ba) = Pr(oi | ba) = ba(si) = ba(si)

= T(sn+1, a, si) = wi if i = 1, . . . , n, 0 otherwise;

ebaoi(si) = baoi(si) = 1 if i = 1, . . . , n, 0 otherwise;

baoia(s) = baoia(s) = 1 2 if s {si, sn+2}, 0 otherwise,

where oi = si, i = 1, . . . , n+2. Hence, we obtain rq, q e A, as follows:

o O (1 r(b0)) e P(o | b0) r(ebao) = 0,

raoia = (1 r(b0)) e P(oi | ba) (1 r(ebaoi)) r( baoia) = wi

2 , for i = 1, . . . , n,

raon+1a = raon+2a = 0.

We also obtain uq as follows:

ua = ρ(a) X

s S b0(s)U(s, a) = 0,

uaoia = Pr(oi | ba) X

s S baoi(s) U(s, a)

= wi U(si, a) = wi vi

wi = vi, for i = 1, . . . , n,

uaon+1a = uaon+2a = 0.

Now, any optimal solution x for instance I , of value denoted by OPT(I ), must satisfy x a = 1; otherwise, OPT(I ) = 0 (by Cons. (10)), which is smaller than that of the feasible solution x a = 1, x as1a = 1, x q = 0, for q = as1a, contradicting the optimality of x . Let V (x) and V (x ) denote the objective values of solution x {0, 1}n for KP(I) and x {0, 1}| e A| for CC-POMDP(I ), respectively. Given a non-trivial feasible solution x for CCPOMDP(I ) (such that x a = 1, x asn+1a = x asn+2a = 0), one can obtain a solution x for KP(I) such that xi x asia. By Cons. (11),

q e A rq x q C

2 x asia + 0x a + 0x asn+1a

+ 0x asn+2a P

i=1 wixi P. (15)

Hence, solution x is feasible for KP(I). Note that the objective value of x is equivalent to that of x:

q e A uq x q =

i=1 vi xi, (16)

Hence, V (x) = OPT(I ). It remains to show that x is optimal to KP(I). Suppose that there exists a solution y for KP(I) such that V (y) > OPT(I ), then we can ﬁnd

x for CC-POMDP(I ) such that x asia yi, x a = 1, x asn+1a = x asn+2a = 0, which is a feasible solution by (15) and Cons. (10), and V (x ) > OPT(I ) by (16), contradicting the optimality of x . Therefore OPT(I) = OPT(I ). Note that if there exists an optimal algorithm for CCPOMDP, then we can use it to obtain optimal solutions for KP. Since KP is NP-Hard, we conclude that such algorithm does not exist, unless P=NP.

5 Algorithm

In this section, we provide an FPTAS that is applicable to both C-POMDP and CC-POMDP. We showed in Sec. 4 that CC-POMDP is NP-Hard, which implies that the FPTAS is the best polynomial time algorithm in theory in terms of approximation ratio (unless P=NP). In fact, we consider a generalization of the two problems by allowing arbitrary values for rq, uq, C in the ILP formulation. The new problem is also a generalization of the so-called precedence constrained knapsack problem (PKP) under tree topologies [Johnson and Niemi, 1983; Kellerer et al., 2004].

Deﬁnition 5.1 (PKP). Given a directed out-tree graph Gp = (Vp, Ep), where nodes represent items i, each of which has a value ui R+ and a weight ri R+, and edges correspond to the precedence order between items such that an item can be packed into the knapsack only if its parent is also packed (precedence relationship). The goal is to ﬁnd a subset of items that maximizes the total value while satisﬁes the precedence relationship and the total weight is at most C.

PKP can be formulated using the above ILP with |O| = 1 (whereas rq, uq, C are allowed to be arbitrary non-negative numbers). When |O| > 1, we call this generalization multiple-choice precedence constrained knapsack problem (MPKP).

Deﬁnition 5.2 (MPKP). Given a directed And-Or graph G = (V, E), where nodes are partitioned into And nodes and Or nodes. Contrary to PKP, only an And node i is associated with a value ui R+ and a weight ri R+. Edges correspond to the precedence order between And-Or nodes such that a node is selected only if its parent is also selected (precedence relationship). In addition, for an Or node, at most one of its children is allowed to be selected (disjunction condition). The goal is to ﬁnd a subset of nodes that maximizes the total value of And nodes while satisﬁes the precedence relationship between And-OR nodes, the disjunction condition at Or nodes, and the total weight of And nodes is at most C.

In the following, we present an FPTAS that solves MPKP. The algorithm is a non-trivial extension of the FPTAS for the PKP problem [Johnson and Niemi, 1983; Kellerer et al., 2004] to account for And-Or graphs. The algorithm can also solve both C-POMDP and CC-POMDP since MPKP is a generalization of the two problems. First, we set up notation needed to describe the algorithm. For simplicity, we use the terminologies of CC-POMDP, namely, we say reward instead of value and risk instead of weight . Deﬁne an And-Or graph G = (V, E), where the set of nodes V e A ( e O {0}) is a partition such that e A ( e O {0}) = ,

Proceedings of the Twenty-Eighth International Joint Conference on Artiﬁcial Intelligence (IJCAI-19)

E n (q, q ) | q V s.t. τ(|q|) < h, (if q e O {0}, q = qa, a A)

or (if q e A, q = qo, o O) o .

We call nodes in e A And nodes, while nodes in e O {0} are called Or nodes. The tree is rooted at node 0.

Based on E, we order nodes in V following the standard depth-ﬁrst search order. More precisely, let G (j, i) G represent the subtree induced by j-th node in that order, the ﬁrst i children of j and all their successors, and all vertices in V such that their indices are lower than j. In other words, G (j, i) is the expanded subtree using the depth ﬁrst search at node j after fully expanding its i-th child. For simplicity, we say node j instead of the j-th node in the depth ﬁrst search order. With a slight abuse of notation, we write V = {0, 1, . . . , m}, where j V refers to a sequence in e A e O {0} in that order. Let n = | e A| (and m n = | e O|). In the following, we use subscript j instead of q to refer to nodes in V. We write rj and uj to denote the risk and reward of the respective sequence. Let δ(j) V denote the set of children of j in G. As a slight practical generalization, we can allow |δ(j)| to be different for each j.

5.1 Pseudo Polynomial-time Algorithm We provide an exact algorithm to solve MPKP based on a novel dynamic program. The running time is polynomial in L, deﬁned as the set of all possible objective values. In general the size of L is exponential, but it could be polynomial for certain applications, e.g., when uj = 1 for all j e A then |L| = n. We show in the subsection below that with a proper rounding scheme, one can obtain an FPTAS for any large L. Deﬁne a 3D table D(j, i, ℓ), where each cell corresponds to the minimum weight feasible solution within the subtree G (j, i) that achieve an objective value of at least ℓ L, according to following rules. For ℓ L, R1: j O {0} and i = 0,

D(j, 0, ℓ) .

R2: j δ(0) e A and i = 0,

D(j, 0, ℓ) rj if uj ℓ, otherwise.

R3: j e A e O {0} and i > 0,

D(j, i, ℓ) min n D(j, i 1, ℓ), D(ji, |δ(ji)|, ℓ) o , where ji δ(j) is the i-th child of j.

R4: j e A\δ(0) and i = 0,

D(k , t 1, [ℓ uj]+) + rj, if D(k , t 1, [ℓ uj]+) + rj C , otherwise,

where [β]+ = β if β 0, and [β]+ = 0 if β < 0; j is a child of k, and k is the t-th child of k . Notice that by the deﬁnition of G, j e A implies that k e O and k e A.

Algorithm 1 MPKP-FPTAS[(uj, rj)j e A, C, G, ϵ]

Input: Rewards and risks (uj, rj)j e A; capacity C; And-Or graph G; accuracy parameter ϵ Output: (1 ϵ)-solution x to MPKP 1: Let K = ϵP

K for all j e A

3: x MPKP-Exact h 1, 2, . . . n P

K , (uj, rj)j e A, C, G i

4: return x

By the condition in R4, the process always maintains feasibility with respect to the capacity Cons. (11). Note that the precedence constraints are systematically maintained throughout the entire process. The optimum solution value corresponds to the maximum value ℓ L for which D(0, |δ(0)|, ℓ) < . Through out the above steps, one can compute the corresponding binary vector x such that P

j e A ujx j = ℓ. We denote the algorithm that com-

putes x for any instance I = ((uj, rj)j e A, C, G, e A, e O) by MPKP-Exact[L, (uj, rj)j e A, C, G]. The running time of MPKP-Exact[L, (uj, rj)j e A] is O(n|L|). (In the context

of constant horizon (C)C-POMDP, n = Ph t=1 |A|t|O|t 1.) Hence, we state the following lemma. Lemma 5.3. MPKP-Exact[L, (uj, rj)j e A, C, G] obtains optimal solutions for MPKP in O(n|L|).

5.2 FPTAS In this section we present our (1 ϵ)-approximation algorithm for MPKP. We follow a standard rounding technique for knapsack [Vazirani, 2013]. The basic idea is to round all rewards uj such that the set of all possible objective values L has polynomial cardinality, while closely approximates the actual set of all possible rewards. Then, we envoke MPKP-Exact[ ] with the new L. More precisely, let P maxj e A uj be the maximum reward. Note that n P is a trivial upper bound on OPT. Now, divide all rewards uj by a factor K ϵP

n and then round down. As a result, the set of all possible rewards L = {0, 1, 2, . . . n P

K } has a polynomial length, |L| = n2

ε + 1. Theorem 5.4 shows such rounding yields a good approximation. The pseudocode is provided in Alg. 1, denoted by MPKP-FPTAS. Theorem 5.4. MPKP-FPTAS is a fully polynomial time approximation scheme for MPKP and runs in O( 1

ϵ n3). A proof can be obtained following a standard rounding scheme [Vazirani, 2013]. Due to the lack of space, we omit our proof. Remark. One can speedup the algorithm in practice using a smaller set L with a tighter upper bound on the optimal. For example, instead of n P, one can use the objective value of the linear relaxation of MPKP, such that xj [0, 1], as an upper bound, which can be obtained using a standard LP solver. Corollary 5.5. MPKP-FPTAS is a fully polynomial time approximation scheme for (C)C-POMDP and runs in O( 1

ϵ |A|3h|O|3(h 1)).

Proceedings of the Twenty-Eighth International Joint Conference on Artiﬁcial Intelligence (IJCAI-19)

6 Conclusion In this work, we provide a fundamental study for constanthorizon (C)C-POMDP problem. A reduction is given from CC-POMDP problem to C-POMDP, and a novel ILP formulation that can be solved using standard ILP solvers. We show that CC-POMDP (consequently, C-POMDP) is NPHard even for the fully observable case and a planning horizon of two. We provide an FPTAS that can solve a generalization of constant-horizon C-POMDP called MPKP (and hence solve (C)C-POMDP) in polynomial time, within a bounded optimally gap given as input. We believe our results provide fundamental insights into the problem and can lead to future development of faster heuristics for (C)C-POMDP without compromising optimality.

References [Altman, 1999] Eitan Altman. Constrained Markov decision processes, volume 7. CRC Press, 1999. [Aras et al., 2007] Raghav Aras, Alain Dutech, Franc ois Charpillet, et al. Mixed integer linear programming for exact ﬁnite-horizon planning in decentralized pomdps. In ICAPS, pages 18 25, 2007. [Dolgov and Durfee, 2005] Dmitri Dolgov and Edmund Durfee. Stationary deterministic policies for constrained mdps with multiple rewards, costs, and discount factors. In International Joint Conference on Artiﬁcial Intelligence, volume 19, page 1326, 2005. [Feinberg and Shwartz, 1996] Eugene A Feinberg and Adam Shwartz. Constrained discounted dynamic programming. Mathematics of Operations Research, 21(4):922 945, 1996. [Howard, 1960] Ronald A Howard. Dynamic programming and markov processes. 1960. [Isom et al., 2008] Joshua D Isom, Sean P Meyn, and Richard D Braatz. Piecewise linear dynamic programming for constrained pomdps. In AAAI, volume 1, pages 291 296, 2008. [Johnson and Niemi, 1983] David S Johnson and KA Niemi. On knapsacks, partitions, and a new dynamic programming technique for trees. Mathematics of Operations Research, 8(1):1 14, 1983. [Kaelbling et al., 1996] Leslie Pack Kaelbling, Michael L Littman, and Andrew W Moore. Reinforcement learning: A survey. Journal of artiﬁcial intelligence research, 4:237 285, 1996. [Kaelbling et al., 1998] Leslie Pack Kaelbling, Michael L Littman, and Anthony R Cassandra. Planning and acting in partially observable stochastic domains. Artiﬁcial intelligence, 101(1-2):99 134, 1998. [Kellerer et al., 2004] Hans Kellerer, Ulrich Pferschy, and David Pisinger. Knapsack Problems. Springer, 2004. [Kim et al., 2011] Dongho Kim, Jaesong Lee, Kee-Eung Kim, and Pascal Poupart. Point-based value iteration for constrained pomdps. In Twenty-Second International Joint Conference on Artiﬁcial Intelligence, 2011.

[Kress-Gazit et al., 2009] Hadas Kress-Gazit, Georgios E Fainekos, and George J Pappas. Temporal-logic-based reactive mission and motion planning. IEEE transactions on robotics, 25(6):1370 1381, 2009. [Lusena et al., 2001] Christopher Lusena, Judy Goldsmith, and Martin Mundhenk. Nonapproximability results for partially observable markov decision processes. Journal of artiﬁcial intelligence research, 14:83 103, 2001. [Ono et al., 2012] Masahiro Ono, Yoshiaki Kuwata, and J Balaram. Joint chance-constrained dynamic programming. In 2012 IEEE 51st IEEE Conference on Decision and Control (CDC), pages 1915 1922. IEEE, 2012. [Papadimitriou and Tsitsiklis, 1987] Christos H Papadimitriou and John N Tsitsiklis. The complexity of markov decision processes. Mathematics of operations research, 12(3):441 450, 1987. [Poupart et al., 2015] Pascal Poupart, Aarti Malhotra, Pei Pei, Kee-Eung Kim, Bongseok Goh, and Michael Bowling. Approximate linear programming for constrained partially observable markov decision processes. In Twenty Ninth AAAI Conference on Artiﬁcial Intelligence, 2015. [Puterman, 2014] Martin L Puterman. Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons, 2014. [Russell and Norvig, 2016] Stuart J Russell and Peter Norvig. Artiﬁcial intelligence: a modern approach. Malaysia; Pearson Education Limited,, 2016. [Santana and Williams, 2015] Pedro Henrique Santana and Brian C Williams. Dynamic execution of temporal plans with sensing actions and bounded risk. In Twenty-Fourth International Joint Conference on Artiﬁcial Intelligence, 2015. [Santana et al., 2016] Pedro Santana, Sylvie Thi ebaux, and Brian Williams. Rao*: an algorithm for chance constrained pomdps. In Proc. AAAI Conference on Artiﬁcial Intelligence, 2016. [Sondik, 1971] Edward J Sondik. The optimal control of partially observable markov decision processes. Ph D the sis, Stanford University, 1971. [Timmons and Williams, 2015] Eric Timmons and Brian C Williams. Enumerating preferred solutions to conditional simple temporal networks quickly using bounding conﬂicts. In Workshops at the Twenty-Ninth AAAI Conference on Artiﬁcial Intelligence, 2015. [Trevizan et al., 2016] Felipe Trevizan, Sylvie Thi ebaux, Pedro Santana, and Brian Williams. Heuristic search in dual space for constrained stochastic shortest path problems. In Twenty-Sixth International Conference on Automated Planning and Scheduling, 2016. [Undurti and How, 2010] Aditya Undurti and Jonathan P How. An online algorithm for constrained pomdps. In 2010 IEEE International Conference on Robotics and Automation, pages 3966 3973. IEEE, 2010. [Vazirani, 2013] Vijay V Vazirani. Approximation algorithms. Springer Science & Business Media, 2013.

Proceedings of the Twenty-Eighth International Joint Conference on Artiﬁcial Intelligence (IJCAI-19)