# montecarlo_tree_search_for_constrained_pomdps__3248ed4f.pdf Monte-Carlo Tree Search for Constrained POMDPs Jongmin Lee1, Geon-Hyeong Kim1, Pascal Poupart2, Kee-Eung Kim1,3 1 School of Computing, KAIST, Republic of Korea 2 University of Waterloo, Waterloo AI Institute and Vector Institute 3 PROWLER.io {jmlee,ghkim}@ai.kaist.ac.kr, ppoupart@uwaterloo.ca, kekim@cs.kaist.ac.kr Monte-Carlo Tree Search (MCTS) has been successfully applied to very large POMDPs, a standard model for stochastic sequential decision-making problems. However, many real-world problems inherently have multiple goals, where multiobjective formulations are more natural. The constrained POMDP (CPOMDP) is such a model that maximizes the reward while constraining the cost, extending the standard POMDP model. To date, solution methods for CPOMDPs assume an explicit model of the environment, and thus are hardly applicable to large-scale realworld problems. In this paper, we present CC-POMCP (Cost-Constrained POMCP), an online MCTS algorithm for large CPOMDPs that leverages the optimization of LP-induced parameters and only requires a black-box simulator of the environment. In the experiments, we demonstrate that CC-POMCP converges to the optimal stochastic action selection in CPOMDP and pushes the state-of-the-art by being able to scale to very large problems. 1 Introduction Monte-Carlo Tree Search (MCTS) [4, 5, 12] is a generic online planning algorithm that effectively combines random sampling and tree search, and has shown great promise in many areas such as online Bayesian reinforcement learning [8, 10] and computer Go [7, 20]. MCTS efficiently explores the search space by investing more search effort in promising states and actions while balancing exploration and exploitation in the direction of maximizing the cumulative (scalar) rewards. Due to its outstanding performance without relying on any prior domain knowledge or heuristic function, MCTS has become the de-facto standard method for solving very large sequential decision making problems, commonly formulated as Markov decision processes (MDPs) and partially observable MDPs (POMDPs). However in many situations, it is not straightforward to formulate the objective with the reward maximization alone, as in the following examples. For spoken dialogue systems [24], it is common to optimize the dialogue strategy towards minimizing the number of turns while maintaining the success rate of dialogue tasks above a certain level. For UAVs under search and rescue mission, the main goal would be to find as many targets as possible, while avoiding threats that may endanger the mission itself. The constrained POMDP (CPOMDP) [9] is an appealing framework for dealing with this kind of multi-objective sequential decision making problems when the environment is partially observable. The model assumes that the action incurs not only rewards, but also K different types of costs, and the goal is to find an optimal policy that maximizes the expected cumulative rewards while bounding each of K expected cumulative costs below certain levels. Although the CPOMDP is a very expressive model, it is known to be very difficult to solve due to the PSPACE-complete nature of solving POMDP [16] originating from the two main challenges: the curse of dimensionality and the curse of history. Partially observable Monte-Carlo Planner (POMCP) [19] tames these two curses of POMDP by using Monte-Carlo sampling both in the root belief-state 32nd Conference on Neural Information Processing Systems (Neur IPS 2018), Montréal, Canada. and in the black-box simulation of the history. In contrast, solution methods for CPOMDPs, e.g. dynamic programming [11], linear programming [18], and online uniform-cost search [23], are not yet advanced up to this level, partly due to requiring an explicit model of the environment. This prevents CPOMDP from being a practical approach for modeling real-world applications. In this paper, we present an MCTS algorithm for CPOMDPs, which precisely addresses the scalability. To the best of our knowledge, extending MCTS to CPOMDPs (or even CMDPs) has remained unexplored since it is not straightforward to handle the constrained optimization in tree search. This challenge is compounded by the fact that optimal policies can be stochastic.1 In order to develop MCTS for CPOMDPs, we first show that solving CPOMDPs is essentially equivalent to jointly solving an unconstrained POMDP while optimizing its LP-induced parameters that control the tradeoff between the reward and the costs. From this result, we present our algorithm, Cost-Constrained POMCP (CC-POMCP), for solving large CPOMDPs that combine traditional MCTS with LP-induced parameter optimization. In the experiments section, we demonstrate that CC-POMCP converges to the optimal stochastic action selection using a synthetic domain and that it is able to handle very large problems including constrained variants of Rocksample(15,15) and Atari 2600 arcade game, pushing the state-of-the-art scalability in CPOMDP solvers. 2 Background Partially observable Markov decision processes (POMDPs) [22] provide a principled framework for modeling sequential decision making problems under stochastic transitions and noisy observations. It is formally defined by tuple Sp, A, Op, Tp, Zp, Rp, γ, b0 , where Sp is the set of environment states s, A is the set of actions a, Op is the set of observations o, Tp(s |s, a) = Pr(st+1 = s |st = s, at = a) is the transition probability, Zp(o|s , a) = Pr(ot+1 = o|st+1 = s , at = a) is the observation probability, Rp(s, a) R is the immediate reward for taking action a in state s, γ [0, 1) is the discount factor, and b0(s) = Pr(s0 = s) is the starting state distribution at time step 0. The history ht = [a0, o0, . . . , at, ot] and htat+1 = [a0, o0, . . . , at, ot, at+1] denote the sequence of actions and observations. The agent takes an action via policy π(a|h) = Pr(at = a|ht = h) that maps from history to probability distribution over actions. In POMDPs, the environment state is not directly observable, thus the agent maintains belief bt(s) = Pr(st = s|ht) that can be recursively updated using the Bayes rule: when taking action a in b and observing o, the updated belief is bao(s ) Zp(o|s , a) P s Tp(s |s, a)b(s). Since belief bt is a sufficient statistic of history ht, the POMDP can be understood as the belief-state MDP B, A, T, R, γ, b0 , where b0 is the initial state, B is the set of reachable beliefs starting from b0, T(b |b, a) = P o,s,s Zp(o|s , a)Tp(s |s, a)b(s)δ(b , bao) is the transition probability, R(b, a) = P s b(s)Rp(s, a) is the immediate reward function. We shall use h and b = Pr(s|h) interchangeably as long as there is no confusion (e.g. QR(h, a) = QR(b, a), π(a|h) = π(a|b)). The goal is to find an optimal policy π that maximizes the expected discounted return (i.e. cumulative discounted rewards): max π V π R (b0) = Eπ t=0 γt R(bt, at) b0 Constrained POMDPs (CPOMDPs) [9, 11, 18] is a generalization of POMDPs for multi-objective problems. It is formally defined by tuple Sp, A, Op, Tp, Zp, Rp, Cp,ˆc, γ, b0 , where Cp = {Cp,k}1..K is the set of K non-negative cost functions with individual thresholds ˆc = {ˆck}1..K. Similarly, a CPOMDP can be cast into an equivalent belief-state CMDP B, A, T, R, C,ˆc, γ , where Ck(b, a) = P s b(s)Cp,k(s, a). The goal is to compute an optimal policy that maximizes the expected cumulative reward while bounding the expected cumulative costs: max π V π R (b0) = Eπ t=0 γt R(bt, at) b0 s.t. V π Ck(b0) = Eπ t=0 γt Ck(bt, at) b0 1 Stochastic nature of the optimal policy in CPOMDPs results from the stochasticity of optimal policies in CMDPs [1]. A more formal treatment on this matter, pertinent to CPOMDPs, can be found in [6]. An optimal policy of the CPOMDP (or the equivalent belief-state CMDP) is generally stochastic and can be obtained by solving the following linear program (LP) [1]: max {y(b,a) 0} b,a b,a R(b, a)y(b, a) (1) a y(b , a ) = δ(b0, b ) + γ X b,a T(b |b, a)y(b, a) b b,a Ck(b, a)y(b, a) ˆck k where y(b, a) can be interpreted as a discounted occupancy measure of (b, a), and δ(x, y) is a Dirac delta function that has the value of 1 if x = y and 0 otherwise. Once the optimal solution y (b, a) is obtained, an optimal stochastic policy and the corresponding optimal value are computed by π (a|b) = Pr(a|b) = y (b, a)/ P a y (b, a ) and V R(b0;ˆc) = P b,a R(b, a)y (b, a) respectively. It is usually intractable to solve LP (1) exactly since the cardinality of B can be infinite. POMCP [19] is a highly scalable Monte-Carlo tree search (MCTS) algorithm for (unconstrained) POMDPs. The algorithm uses Monte-Carlo simulation for both tree search and belief update to simultaneously tackle the curse of history and the curse of dimensionality. In each simulation, a state particle is sampled from the root node s belief-state s B(h) (called root sampling) and is used to sample a trajectory using a black-box simulator (s , o, r) G(s, a). It adopts UCB1 [2] as the tree policy, i.e. the action selection rule in the internal nodes of the search tree: QR(h, a) + κ where QR(h, a) is the average of the sampled returns, N(h) is the number of simulations performed through h, N(h, a) is the number of times action a is selected in h, and κ is the exploration constant to adjust the exploration-exploitation trade-off. POMCP expands the search tree non-uniformly, focusing more search efforts in promising nodes. It can be formally shown that QR(h, a) asymptotically converges to the optimal value Q R(h, a) in POMDPs. Unfortunately, it is not straightforward to use POMCP for CPOMDPs since the original UCB1 action selection rule does not have any notion of cost constraints. If we naively adopt the vanilla, rewardmaximizing POMCP, we may obtain cost-violating action sequences. We could obtain the average of sampled cumulative costs QC during search, but it is not straightforward how to leverage them in the tree policy: if we naively prevent action branches that violate the cost constraint QC(h, a) ˆc, we may end up with policies that are too conservative and thus sub-optimal, i.e. a feasible policy may be rejected during search if the Monte-Carlo estimate violates the cost constraint. 3 Solving CPOMDP via a POMDP Solver The derivation of our algorithm starts from the dual of (1): min {V(b)} b {λk 0} k b δ(b0, b)V(b) + X k ˆckλk (2) s.t. V(b) R(b, a) X k Ck(b, a)λk + γ X b T(b |b, a)V(b ) b, a Observe that if we treat λ = [λ1, . . . , λK] as a constant, the problem becomes an unconstrained belief-state MDP with the scalarized reward function R(b, a) λ C(b, a). Let V λ be the optimal value function of this unconstrained POMDP. Then, for any λ, there exists a corresponding unique V λ, and we can compute V λ with a POMDP solver. Thus, solving the dual LP (2) reduces to: h V λ(b0) + λ ˆc i (3) Moreover, if there is an optimal solution y to the primal LP in (1), then there exists a corresponding dual optimal solution V and λ , and the duality gap is zero, i.e. V R(b0;ˆc) = X b,a R(b, a)y (b, a) = V λ (b0) + λ ˆc by the strong duality theorem. To compute optimal λ in Eq. (3), we have to consider the trade-off between the first term and the second term according to the cost constraint ˆc. For example, if the cost constraint ˆc is very large, the optimal solution λ tends to be close to zero since the objective function would be mostly affected by the second term λ ˆc. On the other hand, if ˆc is sufficiently small, the first term will be dominant and the optimal solution λ tends to get larger in order to have a negative impact on the reward R(b, a) λ C(b, a). Thus, it may seem that Eq. (3) is a complex optimization problem. However, as we will see in the following proposition, the objective function in Eq. (3) is actually piecewise-linear and convex over λ, as depicted in Figure 3 in Appendix A. Proposition 1. Let V λ be the optimal value function of the POMDP with scalarized reward function R(b, a) λ C(b, a). Then, V λ(b0) + λ ˆc is a piecewise-linear and convex (PWLC) function of λ. (The proof is provided in Appendix A.) In addition, we can show that the optimal solution λ is bounded: Proposition 2 (Lemma 4 in [14]). Suppose that the reward function is bounded in [Rmin, Rmax] and there exists τ > 0 and a (feasible) policy π such that V π C(b0) + τ1 ˆc. Then, λ 1 Rmax Rmin Thus, from Propositions 1 and 2, we can obtain optimal λ by greedily optimizing (3) with λk in the range [0, Rmax Rmin τ(1 γ) ]. The remaining question is how to compute that direction for updating λ. We start with the following lemma to answer this question: Lemma 1. Let M1 = B, A, T, R1, γ and M2 = B, A, T, R2, γ be two (belief-state) MDPs differing only in the reward function, and V π 1 and V π 2 be the value functions of M1 and M2 with a fixed policy π. Then, the value function of the new MDP M = B, A, T, p R1 + q R2, γ with the policy π is V π(b) = p V π 1 (b) + q V π 2 (b) for all b B. (The proof is provided in Appendix B.) Lemma 1 implies that V λ can be decomposed into V λ(b0) = V π λ R (b0) λ V π λ C (b0) where π λ is the optimal policy with respect to the scalarized reward function R(b, a) λ C(b, a), and thus (3) becomes: h V π λ R (b0) λ V π λ C (b0) + λ ˆc i (4) One way to compute the descent direction for λ would be by taking the derivative of Eq. (4) with respect to λ while holding π λ constant so that we use the direction V π λ C (b0) ˆc. The following theorem shows that this is indeed a valid direction: Theorem 1. For any λ, V π λ C (b0) ˆc is a negative subgradient that decreases the objective in Eq. (3), where π λ is the optimal policy with respect to the scalarized reward function R(b, a) λ C(b, a). Also, if V π λ C (b0) ˆc = 0 then λ is the optimal solution of Eq. (3). (Proof provided in Appendix C.) The direction V π λ C (b0) ˆc has a natural interpretation: if the current policy violates the k-th cost constraint (i.e. V π λ Ck > ˆck), λk increases so that the cost is penalized more in the scalarized reward function R(b, a) λ C(b, a). On the other hand, if the current policy is too conservative for the k-th cost constraint (i.e. V π λ Ck < ˆck), λk decreases so that the cost is penalized less. In summary, we can solve the dual of LP of the belief-state CMDP by iterating through the following steps, starting from any λ: 1. π λ Solve Belief MDP( B, A, T, R λ C, γ ) 2. V π λ C Policy Evaluation( B, A, T, C, γ , π λ) 3. λ λ + αn(V π λ C (b0) ˆc) and clip λk to range [0, Rmax Rmin τ(1 γ) ] k {1, 2, ...K} By Theorem 1, this procedure is a subgradient method, guaranteed to converge to the optimal solution by using a step-size sequence αn such that P n αn = and P Algorithm 1 Cost-Constrained POMCP (CC-POMCP) function SEARCH(h0) λ is randomly initialized. repeat if h = then s B(h0) end if SIMULATE(s, h0, 0) a GREEDYPOLICY(h0, 0, 0) λ λ + αn [QC(h0, a) ˆc] Clip λk to range [0, Rmax Rmin τ(1 γ) ] k = {1, 2, ...K} until TIMEOUT() return GREEDYPOLICY(h0, 0, ν) end function function ROLLOUT(s, h, d) if d = (maximum-depth) then return [0, 0] end if a πrollout( |h) and (s , o, r, c) G(s, a) return [r, c] + γ ROLLOUT(s , hao, d + 1) end function function GREEDYPOLICY(h, κ, ν) Q λ (h, a) := QR(h, a) λ QC(h, a) + κ q N(h,a) a arg maxa Q λ (h, a) A n a i |Qλ(h, a i ) Qλ(h, a )| log N(h,a i ) N(h,a i ) + q log N(h,a ) Solve LP (10) with A to compute a policy π(a i |h) = wi. return π( |h) end function function SIMULATE(s, h, d) if d = (maximum-depth) then return [0, 0] end if if h / T then T(ha) (Ninit, QR,init, QC,init, ) a return ROLLOUT(s, h, d) end if a GREEDYPOLICY(h, κ, ν) (s , o, r, c) G(s, a) [R, C] [r, c] + γ SIMULATE(s , hao, d + 1) B(h) B(h) {s} N(h) N(h) + 1 N(h, a) N(h, a) + 1 QR(h, a) QR(h, a) + R QR(h,a) N(h,a) QC(h, a) QC(h, a) + C QC(h,a) N(h,a) c(h, a) c(h, a) + c c(h,a) N(h,a) return [R, C] end function function MAINLOOP() ˆc (cost constraint) s (initial state) h while s is not terminal do π SEARCH(h) a π( |h) (s , o, r, c) G(s, a) ˆc ˆc π(a|h) c(h,a) P a =a π(a |h)QC(h,a ) γπ(a|h) s s h hao end while end function 4 Cost-Constrained POMCP (CC-POMCP) Although we have eliminated the cost-constraints by introducing simultaneous update of λ, it still relies on exactly solving POMDPs via Solve Belief MDP in each iteration, which is impractical for large CPOMDPs. Fortunately, all we need in step 3 is the cost value at the initial belief state V π λ C (b0) with respect to the optimal policy when the reward function is given by R λ C. This is exactly the situation where MCTS can be effectively applied: MCTS focuses on finding the optimal action selection at the root node using the Monte-Carlo estimate of long-term rewards (or costs). We are now ready to present our online algorithm for large CPOMDPs, which we refer to as Cost Constrained POMCP (CC-POMCP), shown in Algorithm 1. The changes from the standard POMCP are highlighted in blue. CC-POMCP is an extension of POMCP with cost constraints and can be seen as an anytime approximation of policy iteration with the simultaneous optimization of λ: the policy is sequentially evaluated via Monte-Carlo return QR(h, a) QR(h, a) + R QR(h, a) N(h, a) and QC(h, a) QC(h, a) + C QC(h, a) N(h, a) (5) and the policy is implicitly improved by the UCB1 action selection rule based on the scalarized value Qλ(h, a) = QR(h, a) λ QC(h, a): arg max a Q λ(h, a) = QR(h, a) λ QC(h, a) + κ Finally, λ is updated simultaneously using the current estimate of VC(s0) ˆc, which is the descent direction of the convex objective function: λ λ + αn(QC(h0, a) ˆc) where a π( |h0) (7) The following theorem states that CC-POMCP asymptotically converges to optimal λ under mild assumption: Theorem 2. Suppose that λ is updated with increasing simulation step t, and the search tree is reset at the end of λ s update as detailed in Appendix F. If the asymptotic bias of UCT holds for all types of cost values (i.e. M > 0, k, |V π λ Ck (h0) VCk(h0)| M( log t t )), then either sign(V π λ Ck (h0) ˆck) = sign(VCk(h0) ˆck) or |V π λ Ck (h0) ˆck| M log t t holds with probability 1 as t . The above states that either λ is close to optimal or it is improved by the update towards the direction of negative subgradient. Note that CC-POMCP inherits the scalability of POMCP and thus does not require an explicit model of the environment: all we need is a black-box simulator G of the CPOMDP, which generates sample (s , o, r, c) G(s, a) of the next state s , observation o, reward r, and cost vector c, given the current state s and action a. 4.1 Admissible Costs After the agent executes an action, the cost constraint threshold ˆc must be updated at the next time step. For this, we reformulate the notion of admissible cost [17], originally formulated for dynamic programming. The admissible cost ˆct+1 at time step t + 1 denotes the expected total cost allowed to be incurred in future time steps {t + 1, t + 2, ...} without violating the cost constraints. Under dynamic programming, the update is given by [17]: ˆct+1 = ˆct E[C(bt,at)|b0,π] γ where evaluating E[C(bt, at)|b0, π] requires the probability of reaching (bt, at) at time step t, which in turn requires marginalizing out the history in the past [a0, b1, a2, ..., bt 1]. This is intractable for large state spaces. On the other hand, under forward search, the admissible cost at the next time step t + 1 is simply ˆct+1 = V π C (bt+1). We can access V π C (bt+1) by starting from the root node of the search tree ht, and sequentially following the action branch at and the next observation branch ot+1. Here we are assuming that the exact optimal V π C is obtained, which is certainly achievable after infinitely many simulations of CC-POMCP. Note also that even though ˆct > V π C (bt) is possible in general, assuming ˆct = V π C (bt) does not change the solution. If ˆct < V π C (bt), this means that no feasible policy exists. 4.2 Filling the Gap: Stochastic vs Deterministic Policies Our approach relies on the POMDP with scalarized rewards, but care must be taken as the optimal policy of the CPOMDP is generally stochastic: given optimal λ , let π λ be the deterministic optimal policy for the POMDP with the scalarized reward function R λ C. Then, by the duality between the primal (1) and the dual (2), V R(b0;ˆc) = V λ(b0) + λ ˆc = V π λ R (b0) λ (V π λ C (b0) ˆc) (8) This implies that if λ k > 0 and V π λ Ck (b0) = ck for some k then π λ is not optimal for the original CPOMDP. This is exactly the situation where the optimal policy is stochastic. In order to make the policy computed by our algorithm stochastic, we make sure that the following optimality condition is satisfied, derived from V R(b0;ˆc) = V π λ R (b0): k=1 λ k(V π λ Ck (b0) ˆck) = a π λ(a|b0)Qπ λ Ck(b0, a) ˆck That is, actions a i with equally maximal scalarized action values Qλ(b, a i ) = QR(b, a i ) λ QC(b, a i ) participate as the support of the stochastic policy, and are selected with probability π(a i |b) that satisfies k : λ k > 0, P a i π(a i |b)QCk(b, a i ) = ˆck. 101 102 103 104 105 simulations discounted cumulative reward 0.001 0.01 0.1 earch time of CCPOMCP ( ec ) 101 102 103 104 105 di counted cumulative co t c Optimal ( tocha tic) Optimal (determini tic) Ba eline CCPOMCP (our ) 0.001 0.01 0.1 earch time of CCPOMCP ( ec ) 102 104 106 simulations discounted c m lative reward 0.001 0.01 0.1 1 search time of CCPOMCP (secs) 102 104 106 sim lations disco nted c m lative cost C LP Baseline CCPOMCP (o rs) 0.001 0.01 0.1 1 search time of CCPOMCP (secs) (b) Rocksample (5, 7) 102 104 106 simulations discounted cumula ive reward 0.001 0.01 0.1 1 search ime of CCPOMCP (secs) 102 104 106 simula ions discoun ed cumula ive cos CALP Baseline CCPOMCP (ours) 0.001 0.01 0.1 1 search ime of CCPOMCP (secs) (c) Rocksample (7, 8) 102 104 106 simulations discounted cumulative re ard 0.001 0.01 0.1 1 10 search time of CCPOMCP (secs) 102 104 106 simulations discounted cumulative cost aseline CCPOMCP (ours) 0.001 0.01 0.1 1 10 search time of CCPOMCP (secs) (d) Rocksample (11, 11) Figure 1: The result of CPOMDP Toy domain [11] and the constrained variants of Rocksample [21]. The result of Rocksample (15, 15) is presented in Appendix I. For each domain, the left figure shows the average discounted cumulative reward, and the right figure shows the average discounted cumulative cost. The wall-clock search time for CC-POMCP is presented on the top of x-axis. GREEDYPOLICY in Algorithm 1 computes the stochastic policy according to the above principle. In practice, due to the randomness in Monte-Carlo sampling, action values in (9) are always subject to estimation error, so it is reformulated as a linear programming with up to |A| + 2K variables: min {wi,ξ+ k ,ξ k } k=1 λk(ξ+ k + ξ k ) (10) i:a i A wi QCk(h, a i ) = ˆck + (ξ+ k ξ k ) k i:a i A wi = 1 and wi, ξ+ k , ξ k 0 where A = {a i | Qλ(h, a i ) Qλ(h, a ) s.t. a = arg maxa Q λ(h, a)} 2, and the solutions are wi = π(a i |h). Here, when K = 1, there is a simple analytic solution to LP (10), which is described in Appendix G. Even when K > 1, note that the optimization problem occurs only when the number of equally maximal scalarized action values is more than 1 thus randomization of actions is required. It is well known in CMDPs that an optimal policy requires at most K randomizations [1], which means that we expect to invoke optimization on extremely small part of the state space when the problem is very large. 5 Experiments All the parameters for running CC-POMCP are provided in Appendix H. Baseline agent To the best of our knowledge, this work is the first attempt to solve constrained (PO)MDP using Monte-Carlo Tree Search. Since there is no algorithm for direct performance comparison for large problems, we implemented a simple baseline agent using MCTS. This agent works as outlined in section 2: it chooses an action via reward-maximizing POMCP while preventing action branches that violate cost constraint QC(s, a) ˆc. If all action branches violate the cost constraints, the agent chooses action uniformly at random. 2Exact condition for Qλ(h, a i ) Qλ(h, a ) and its theoretical guarantee are provided in Appendix E. Domain |S| ˆc Algorithm Cumulative reward Cumulative cost CALP 12.77 0 0.78 0 Rocksample (5,7) 3,201 1 Baseline 1.09 0.88 12.74 0.50 CC-POMCP 11.36 1.02 0.79 0.06 CALP 3.67 0 1.20 0 Rocksample (7,8) 12,544 1 Baseline 0.23 0.44 13.92 0.33 CC-POMCP 9.36 0.76 0.56 0.06 CALP N/A N/A Rocksample (11,11) 247,808 1 Baseline 0.14 0.33 15.29 0.25 CC-POMCP 2.65 0.73 0.09 0.04 CALP N/A N/A Rocksample (15,15) 7,372,800 1 Baseline 0.39 0.58 16.27 0.27 CC-POMCP 0.74 0.33 0.69 0.08 Table 1: Comparison of CC-POMDP with the state-of-the-art offline solver, CALP [18]. CPOMDP: Toy and Rocksample We first tested CC-POMCP on the synthetic toy domain introduced in [11] to demonstrate convergence to stochastic optimal actions, where the cost constraint ˆc is 0.95. Any deterministic policy is suboptimal or violates the cost constraint. As can be seen in Figure 1a, CC-POMCP converges to optimal stochastic action selection (thus experimentally confirms the soundness of algorithm), while the baseline agent converges to the suboptimal policy (optimal policy among deterministic ones). We also conducted experiments on cost-constrained variants of Rocksample [21]. Rocksample(n, k) simulates a Mars rover in n n grid containing k rocks. The goal is to sort out good rocks, collect them, and escape the map by moving to the rightmost part of the map. We augmented the singleobjective Rocksample with the cost function that assigns 1 to low reward state-action pairs (i.e. Cp(s, a) = 1 if Rp(s, a) < 0), similarly to [18]. We also assigned the cost of 1 to actions detecting whether a rock is good or bad. The cost constraint ˆc is set to 1. We compared CC-POMCP with the state-of-the-art offline CPOMDP solver, CALP [18]. CALP was allowed 10 minutes for the offline computation, and we performed exact policy evaluation with respect to the resulting finite state controller without simulation in the real environment. The results on Rocksample are summarized in Table 1 and Figure 1. In Rocksample (5, 7), the reward performance of CC-POMCP is comparable to CALP when more than 2 seconds of search time is allowed while at the same time satisfying the cost constraint. In contrast, baseline agent basically exhibited random behavior since the Monte-Carlo return at early stage mostly violates cost constraints for all actions. On Rocksample (7, 8), CALP failed to compute a feasible policy, and CC-POMCP outperformed CALP in terms of reward while satisfying the cost constraint. Finally, CC-POMCP was able to scale to Rocksample (11, 11) and (15, 15): given a few seconds of search time, CC-POMCP was able to find actions satisfying the cost constraints, and tended to yield higher returns as we increased the number of simulations. CMDP: Pong We also conducted experiments on a multi-objective version of PONG, an arcade game running on the Arcade Learning Environment (ALE) [3], depicted in Figure 2a. In this domain, the left paddle is handled by the default computer opponent and the right paddle is controlled by the agent. We use the RAM state feature, i.e. the states are binary strings of length 1024 which results in |S| = 21024. The action space is {up, down, stay}. The agent receives a reward of {1, 1} for each round depending on win/lose. The episode terminates if the accumulated reward is {21, 21}. We assigned cost 0 to the center area (position [0.4, 0.6]), 1 to the neighboring area (position [0.2, 0.4] [0.6, 0.8]), and 2 to the area farthest away from the center (position [0.0, 0.2] [0.8, 1.0]). This cost function was motivated by the scenario, where a human expert tries to constrain the RL agent to adhere to human advice that the agent should stay in the center. This advice is encoded as the cost function and its threshold. We experimented with various cost constraint thresholds ˆc {200, 100, 50, 30, 20} ranging from the unconstrained case ˆc = 200 ( Cmax 1 γ = 200) to the tightly constrained case ˆc = 20. We can see that the agent has two conflicting objectives: in order to achieve high rewards, it sometimes needs to move the paddle to positions far away from the center, but if this happens too often, the cost constraint will be violated. Thus, it needs to trade off between reward and cost properly depending on the cost constraint threshold ˆc. Figure 2b summarizes the experimental results from the CC-POMCP and the baseline agents. When ˆc = 200 (unconstrained case), both algorithms always win the game 21 by 0. As we lower ˆc, |S| = 21024, |A| = 3 (a) Domain description 0 0.5 1 0.0 ˆc ALGO avg cumulative avg discounted avg score rewards cumulative costs FOE vs ALGO 200 CC-POMCP 21.00 0.00 133.00 4.97 0.0 vs 21.0 Baseline 21.00 0.00 136.66 4.45 0.0 vs 21.0 100 CC-POMCP 19.27 1.63 99.88 0.13 1.4 vs 20.7 Baseline 15.05 3.83 110.88 3.86 18.9 vs 3.9 50 CC-POMCP 17.88 1.79 49.95 0.07 2.8 vs 20.7 Baseline 20.45 0.26 130.40 4.99 21.0 vs 0.6 30 CC-POMCP -0.07 5.23 30.40 0.46 13.2 vs 13.2 Baseline 20.48 0.30 131.37 5.08 21.0 vs 0.5 20 CC-POMCP 17.80 2.91 25.36 1.25 20.1 vs 2.2 Baseline 20.48 0.30 131.37 5.08 21.0 vs 0.5 (b) Simulation results Figure 2: (a) Multi-objective version of Atari 2600 PONG, visualizing the cost function. (b) Results of the constrained PONG. Above: Histogram of the CC-POMCP agent s position, where the horizontal axis denotes the position of the agent (0: topmost, 1: bottommost) and the vertical axis denotes the relative discounted visitation rate for each bin. CC-POMCP tends to stay in the center in order to make a trade off between reward and cost (shown in the histograms in Figure 2b). We can also see that the agent gradually performs worse in terms of scores as ˆc decreases. This is a natural result since it is forced to stay in the center and thus sacrifice the game score. Overall, CC-POMCP computes a good policy while generally respecting the cost constraint. On the other hand, the baseline fails to show a meaningful policy except when ˆc = 200 since the Monte-Carlo cost returns at early stage mostly violate the cost constraint, resulting in random behavior. 6 Conclusion We presented CC-POMCP, an online MCTS algorithm for very large CPOMDPs. We showed that solving the dual LP of CPOMDPs is equivalent to jointly solving an unconstrained POMDP and optimizing its LP-induced parameters λ, and provided theoretical results that shed insight on the properties of λ and how to optimize it. We then extended POMCP to maximize the scalarized value while simultaneously updating λ using the current action-value estimates QC. We also empirically showed that CC-POMCP converges to the optimal stochastic actions on a toy domain and easily scales to very large CPOMDPs through the constrained variants of Rocksample and the multi-objective version of PONG. Acknowledgement This work was supported by the ICT R&D program of MSIT/IITP of Korea (No. 2017-0-01778) and DAPA/ADD of Korea (UD170018CD). J. Lee acknowledges the Global Ph.D. Fellowship Program by NRF of Korea (NRF-2018-Global Ph.D. Fellowship Program). [1] Eitan Altman. Constrained Markov Decision Processes. Chapman and Hall, 1999. [2] Peter Auer, Nicolò Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed bandit problem. Machine Learning, 47(2):235 256, 2002. [3] Marc G. Bellemare, Yavar Naddaf, Joel Veness, and Michael Bowling. The arcade learning environment: An evaluation platform for general agents. Journal of Artificial Intelligence Research, 47:253 279, 2013. [4] Cameron Browne, Edward Powley, Daniel Whitehouse, Simon Lucas, Peter I. Cowling, Stephen Tavener, Diego Perez, Spyridon Samothrakis, and Simon Colton. A survey of Monte Carlo tree search methods. IEEE Transactions on Computational Intelligence and AI in Games, 4:1 49, 2012. [5] Rémi Coulom. Efficient selectivity and backup operators in Monte-Carlo tree search. In Proceedings of the 5th International Conference on Computers and Games, pages 72 83, 2006. [6] Eugene A Feinberga and Aleksey B Piunovskiyb. Nonatomic total rewards markov decision processes with multiple criteria. J. Math. Anal. Appl, 273:93 111, 2002. [7] Sylvain Gelly and David Silver. Monte-Carlo tree search and rapid action value estimation in computer Go. Artif. Intell., 175(11):1856 1875, 2011. [8] Arthur Guez, David Silver, and Peter Dayan. Scalable and efficient Bayes-adaptive reinforcement learning based on Monte-carlo tree search. Journal of Artificial Intelligence Research, pages 841 883, 2013. [9] Joshua D. Isom, Sean P. Meyn, and Richard D. Braatz. Piecewise linear dynamic programming for constrained POMDPs. In Proceedings of the Twenty-Third AAAI Conference on Artificial Intelligence, pages 291 296, 2008. [10] Sammie Katt, Frans A. Oliehoek, and Christopher Amato. Learning in POMDPs with Monte Carlo tree search. In Proceedings of the 34th International Conference on Machine Learning, pages 1819 1827, 2017. [11] Dongho Kim, Jaesong Lee, Kee-Eung Kim, and Pascal Poupart. Point-based value iteration for constrained POMDPs. In Proceedings of the Twenty-Second International Joint Conference on Artificial Intelligence - Volume Volume Three, IJCAI 11, pages 1968 1974, 2011. [12] Levente Kocsis and Csaba Szepesvári. Bandit based Monte-Carlo planning. In Proceedings of the Seventeenth European Conference on Machine Learning (ECML 2006), pages 282 293, 2006. [13] Levente Kocsis, Csaba Szepesvári, and Jan Willemson. Improved Monte-Carlo Search. Technical Report 1, Univ. Tartu, Estonia, 2006. [14] Jongmin Lee, Youngsoo Jang, Pascal Poupart, and Kee-Eung Kim. Constrained Bayesian reinforcement learning via approximate linear programming. In Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence (IJCAI-17), pages 2088 2095, 2017. [15] Eunsoo Oh and Kee-Eung Kim. A geometric traversal algorithm for reward-uncertain MDPs. In Proceedings of the Twenty-Seventh Conference on Uncertainty in Artificial Intelligence (UAI-11), pages 565 572, 2011. [16] Christos Papadimitriou and John N. Tsitsiklis. The complexity of Markov decision processes. Math. Oper. Res., 12(3):441 450, 1987. [17] Alexei B Piunovskiy and Xuerong Mao. Constrained Markovian decision processes: the dynamic programming approach. Operations research letters, 27(3):119 126, 2000. [18] Pascal Poupart, Aarti Malhotra, Pei Pei, Kee-Eung Kim, Bongseok Goh, and Michael Bowling. Approximate linear programming for constrained partially observable Markov decision processes. In Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence, pages 3342 3348, 2015. [19] David Silver and Joel Veness. Monte-Carlo planning in large POMDPs. In Advances in Neural Information Processing Systems 23, pages 2164 2172, 2010. [20] David Silver, Aja Huang, Chris J. Maddison, Arthur Guez, Laurent Sifre, George van den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, Sander Dieleman, Dominik Grewe, John Nham, Nal Kalchbrenner, Ilya Sutskever, Timothy Lillicrap, Madeleine Leach, Koray Kavukcuoglu, Thore Graepel, and Demis Hassabis. Mastering the game of Go with deep neural networks and tree search. Nature, pages 484 489, 2016. [21] Trey Smith and Reid Simmons. Heuristic search value iteration for pomdps. In Proceedings of the 20th Conference on Uncertainty in Artificial Intelligence, UAI 04, pages 520 527, 2004. [22] Edward J. Sondik. The Optimal Control of Partially Observable Markov Processes. Ph D thesis, Stanford University, 1971. [23] A. Undurti and J. P. How. An online algorithm for constrained POMDPs. In 2010 IEEE International Conference on Robotics and Automation, pages 3966 3973, 2010. [24] Jason D. Williams and Steve Young. Partially observable markov decision processes for spoken dialog systems. Computer Speech and Language, 21(2):393 422, 2007.