# recursive_smallstep_multiagent_a_for_decpomdps__8ea27ba2.pdf Recursive Small-Step Multi-Agent A for Dec-POMDPs Wietze Koops , Nils Jansen , Sebastian Junges and Thiago D. Sim ao Radboud University, Nijmegen, The Netherlands {wietze.koops, nils.jansen, sebastian.junges, thiago.simao}@ru.nl We present recursive small-step multi-agent A (RS-MAA ), an exact algorithm that optimizes the expected reward in decentralized partially observable Markov decision processes (Dec-POMDPs). RS-MAA builds on multi-agent A (MAA ), an algorithm that finds policies by exploring a search tree, but tackles two major scalability concerns. First, we employ a modified, small-step variant of the search tree that avoids the double exponential outdegree of the classical formulation. Second, we use a tight and recursive heuristic that we compute on-the-fly, thereby avoiding an expensive precomputation. The resulting algorithm is conceptually simple, yet it shows superior performance on a rich set of standard benchmarks. 1 Introduction This paper considers finite-horizon decentralized decisionmaking under stochastic dynamics and partial observability. Markov decision processes (MDPs) are the formalism to capture decision-making under (stochastic) uncertainty. These models assume that a centralized controller can fully observe the state space. Partially observable MDPs (POMDPs) capture the natural restriction that a controller only observes features of the current state. However, they still assume a centralized controller that can collect information from all sensors. In decentralized control, we assume that various agents each locally observe features of the state space and must locally select actions to execute. Such problems are formally captured by decentralized partially observable Markov decision processes (Dec-POMDPs) [Oliehoek and Amato, 2016]. These models naturally capture, for instance, sensor networks or multi-robot navigation in settings where communication is impossible, ineffective, or lossy. Finding an optimal (decentralized) controller (hereafter: a policy) is computationally challenging; the associated decision problem is NEXP-hard [Bernstein et al., 2002]. Nevertheless, a series of advances resulted in a collection of effective algorithms that can find (1) provably optimal (exact) or (2) ε-optimal solutions. The focus of this paper is the former type, and we will refer to those as exact algorithms. In particular, we consider a family of exact algorithms referred to as multi-agent A (MAA ) [Szer et al., 2005; Oliehoek et al., 2008; Oliehoek et al., 2013] based on A - style heuristic search that finds provably optimal policies. Multi-agent A . At every time-step (stage) of the decisionmaking problem, a local policy must reason about the history of observations to decide upon an action. Such a mapping is commonly referred to as decision rules, and for all policies together, these rules are the joint decision rules. A (joint) policy is then described by a sequence of (joint) decision rules. MAA employs a search tree where a path through the tree encodes this sequence of decision rules. Consequently, the leaves of this tree correspond to all policies. MAA searches through the tree by incrementally fixing decision rules for the individual stages. Every node corresponds to a partial policy up to time step t. As standard in A , the order in which we explore the tree is steered by an admissible heuristic that, for every inner node of the tree, guarantees that we conservatively overapproximate the optimal reward obtained by a policy at the leaves of the subtree. Note that nodes reflect choices for the joint decision rules, i.e., we must choose responses for each agent and for each observation history. This representation yields a search tree with an outdegree that is double exponential in the stage t. Avoiding the double exponential blowup and providing a tight admissible heuristic are two primary concerns when developing variations to MAA . Our Approach: Small-step MAA . In classical MAA , expanding nodes is cumbersome due to the double exponential outdegree of nodes. While incremental expansion [Spaan et al., 2011] aims to alleviate the prohibitive outdegree, we prevent this massive outdegree by taking smaller steps in the decision-making process. In particular, we pick actions for each agent and each observation history successively. Compared to MAA , our search tree limits the outdegree of nodes from double exponential to constant, while increasing the height from linear to exponential. Paired with a tight heuristic, this significantly limits the number of nodes expanded. Furthermore, the depth of the tree is further restricted by using clustering [Oliehoek et al., 2009], which is applicable offthe-shelf to small-step MAA . Admissible Heuristics. A further important step towards efficient small-step MAA is the efficient computation of a tight and admissible heuristic. We base our heuristic on the recursive heuristic from Szer et al. [2005]. In a nutshell, to Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence (IJCAI-23) obtain a heuristic value for a node at stage t, we assume that, at stage t, the agents once communicate their local observations. This (generally unrealistic) assumption yields a reward value that overapproximates the actual value and thus yields an admissible heuristic. In particular, this assumption allows us to compute a distribution over the current state. We can then compute a heuristic value by solving a Dec-POMDP problem for horizon h t from the current state. We add two ingredients that make this heuristic effective. First, we can decide to assume communication at a stage t < t, which often allows us to reuse earlier computations and provides even tighter results. Second, when computing the heuristic value, we can terminate the computation early and use the current heuristic value as a sound overapproximation of the true value. This means that even the aborted computation yields an admissible heuristic value. With these ingredients, the recursive heuristic has three benefits: 1) It is tighter than heuristics employed by the state-of-the-art. 2) It can be computed on-the-fly and avoids precomputing and storing heuristic values. 3) It avoids dependencies on other algorithms or formalisms, like POMDP solvers or Bayesian games. Contributions. We present recursive small-step multiagent A (RS-MAA ), a fast yet conceptually simple instantiation of MAA 1. In particular, RS-MAA can be concisely implemented and relies neither on precomputed external heuristics nor solvers for cooperative Bayesian games. Its on-the-fly recursive heuristic only computes heuristic values for nodes in the search tree that are expanded. This alleviates a well-known weakness in the state-of-the-art GMAA - ICE which relates to the memory consumption of storing the heuristic values [Spaan et al., 2011]. Our formulation is compatible with clustering [Oliehoek et al., 2009]. A prototype of our algorithm significantly outperforms the state-of-the-art optimal planners. In particular, to the best of our knowledge2, RS-MAA is the first to scale to horizon 12 on DECTIGER (an improvement from 6) and to horizon 1500 on RECYCLING ROBOTS (an improvement from 80). Related Work We review extensions and improvements of MAA and some recent trends. For a survey and in-depth treatment of various approximate and exact approaches, we refer to the detailed monograph by Oliehoek and Amato [2016]. Szer et al. [2005] introduce multi-agent A (MAA ), investigating heuristics based on different approximations such as an MDP, a POMDP and a recursive MAA . Thereafter, different heuristics have been proposed to reduce the number of nodes expanded during the search. Oliehoek and Vlassis [2007] introduce a heuristic that models interactions between the agents in one time step using Bayesian games. Oliehoek et al. [2008] introduce generalized MAA (GMAA ), which uses a Bayesian game to determine which child of a given partial policy is the best, but solving the Bayesian game itself is also expensive. Oliehoek et al. [2009] propose the GMAA with incremental clustering 1Proofs and source code are given in the supplementary material available at https://zenodo.org/record/7949016 2Based on masplan.org and our experiments. (GMAA -IC) algorithm, which clusters observation histories while expanding the tree to reduce the number of nodes in the tree. This is the first approach to solve DECTIGER (see Example 1) for horizon 5. Finally, Spaan et al. [2011] introduce GMAA -IC with incremental expansion (GMAA - ICE), which, when expanding a node, finds the best child incrementally by a second (nested, but not recursive) A search over partially specified Bayesian game policies. This was the first method to solve DECTIGER for horizon 6. We refer to Oliehoek et al. [2013] for an in-depth explanation of these approaches. In this paper, we propose a new variation of MAA that improves the scalability even further, solving DECTIGER with horizon 12. The state-of-the-art to generate ϵ-optimal solutions reduces the Dec-POMDP to an MDP with continuous state space and uses an adaption of heuristic-search value-iteration for continuous MDPs [Dibangoye et al., 2016]. Recent work has studied different subclasses of Dec-POMDPs to alleviate the complexity of the general method. Amato et al. [2019] adds macro actions (also known as option [Pateria et al., 2022; Barto and Mahadevan, 2003]) to the Dec-POMDP, which allows an agent to perform multiple actions without having to coordinate with the remaining agents. Xie et al. [2020] study problems with two agents under one-sidedness, that is, where one of the agents observes the actions and observations of the other agent. Finally, Lauri et al. [2019; 2020] investigate the information gathering task where the reward function is based on the joint belief of the agents. The idea of splitting up steps in the A search tree into smaller steps to limit the outdegree of the tree was proposed before by Cazenave [2010], and applied to multi-agent path finding and multiple sequence alignment. However, it has not been applied before to solving Dec-POMDPs. 2 Problem Statement (X) denotes distributions over finite sets X. Definition 1 (Dec-POMDP). A Dec-POMDP is a tuple D, S, A, O, b, T, R, O with a set D = {1, . . . , n} of n agents, a finite set S of states, a set A = i D Ai of joint actions, where each Ai is the finite set of local actions of agent i, and a set O = i D Oi of joint observations, where each Oi is the finite set of local observations of agent i. The transition function T : S A (S) defines the transition probability Pr(s | s, a), b (S) is the initial belief, R: S A R is the reward function, and the observation function O: A S (O) defines the observation probability Pr(o | a, s ). A Dec-POMDP describes a system that initially starts in a state s0 with probability b(s0). At each time step t, each agent i selects a local action at i. Together, this yields a joint action at = at 1, . . . , at n and a reward rt = R(st, at). The system evolves to the next state st+1 with probability Pr(st+1 | st, at). The observation ot+1 is obtained with probability Pr(ot+1 | at, st+1), and each agent i receives the local observation ot+1 i . We call the sequence of joint observations τ = o1o2 . . . ot a joint observation history (JOH) and the sequence of local observations τi = o1 i o2 i . . . ot i of agent i Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence (IJCAI-23) a local observation history (LOH). We denote the first d observations of τ by pre(τ, d). We denote the set of all LOHs of length exactly ℓand at most ℓby Oℓ i and O ℓ i , respectively. We are interested in offline planning: How should the individual agents act locally such that, jointly, they optimize the cumulative reward up until horizon h? A local policy for agent i maps LOHs for that agent to a local action, formally: πi : O h 1 i Ai. A joint policy is a tuple of local policies π = π1, , πn , and Π denotes the set of all joint policies. Given a joint policy π, we can define the value of executing this policy in the initial belief b over a horizon h as: Vπ(b, h) = Eπ t=0 R(st, at) where st and at are the state and joint action at time step t. Problem statement: Given a Dec-POMDP and a horizon h, find a joint policy π that maximizes Vπ(b, h). Example 1 (Decentralized tiger problem). The DECTIGER problem [Nair et al., 2003] is a multi-agent variation of the Tiger problem [Kaelbling et al., 1998]. In this problem, two agents are in front of two doors. A tiger is behind the left or the right door, which indicates the state of the environment (TL or TR), and a treasure is behind the other. Each agent has three actions: listen (Li), open the left door (OL), or open the right door (OR). Listening gives a noisy observation regarding the location of the tiger (HL or HR). In the multi-agent variation, the dynamics depend on the joint action. The state resets if any agent opens a door, in which case both agents get an uninformative observation (uniform distribution over the observations). If only one agent opens the door with the treasure, they receive a positive reward. But, if both agents find the treasure, they receive a larger reward. A large penalty is given for opening the door with the tiger. The DECTIGER problem (Example 1) illustrates the challenges in cooperating to solve a Dec-POMDP. In this case, besides finding a policy to gather information, the agents must also coordinate in the offline phase deciding the right moment to open a door. Without such coordination, the actions of one agent could compromise the belief of the remaining agents. 3 Small-step Multi-agent A In this section, we frame the search for an optimal policy by an incremental heuristic search over partial policies. Ordered LOHs. For our algorithm, we explore LOHs in a fixed order. Intuitively, we order the LOHs of all agents as follows: first by stage (length of the observation history), then by agent, and then lexicographically by observation history. This ordering is formalized in the following definition. Definition 2. Given a total order i on Oi, we define a total order on Sn i=1 O h 1 i such that τ t i τ t j if t < t or t = t i < j or t = t i = j τ t i lex i τ t j . Throughout the paper, we use this fixed order for LOHs. Stage Agent LOH Figure 1: A partial tree of the DECTIGER problem at stage 1, where we expanded all nodes in stage 0 but only a subset of the nodes in stage 1 after expanding the gray node (filled). Partial policies. A local partial policy for agent i is a partial function φi : O h 1 i Ai. A partial policy is a tuple of local partial policies φ = φ1, . . . , φn . The length of a partial policy is ℓ, if every local partial policy φi is defined on only the first ℓLOHs (of any agent) according to the order in Definition 2. Hence, a partial policy of length ℓis defined on ℓLOHs in total, summed over all local partial policies. The stage σ(φ) of a partial policy φ is t, if the policy is defined for all LOHs of length t 1, but not for all LOHs of length t. The set of all partial policies is Φ. A partial policy φ extends a partial policy φ, denoted by φ 0 be the partial policy of length ℓ 1 such that φ 0 we define Qd(φ) = min{QDec,min{σ(φ),d}(φ), Qd(φ )}. We can also set d = ; then min{σ(φ), d} = σ(φ). Corollary 1. Qd is admissible for d 1. This follows from Thm. 1 and Qd(φ) QDec,1(φ), d 1. Comparing heuristics. The recursive Dec-POMDP heuristic proposed by Szer et al. [2005] can be written as QDec,σ(φ)(φ). It does exhibit the weakness mentioned above that we fix in the definition of Qd. Two other commonly used heuristics in the literature [Oliehoek and Vlassis, 2007; Oliehoek et al., 2013] are the QPOMDP and QBG heuristic. For the QPOMDP heuristic, it is assumed that the agents can communicate and hence have access to all joint observations when making their decisions. For the QBG heuristic, it is assumed that agents can communicate with one step of delay. The QPOMDP and QBG heuristics can be captured as instances of the generalized policy optimization problem. This allows to compare our heuristics to QPOMDP and QBG. Theorem 2. Let φ be a partial policy. Then Q (φ) QPOMDP(φ) and QDec,σ(φ) 1(φ) QBG(φ). Proof. The (centralized) POMDP policy optimization problem is the unconstrained version of the optimization problem (1), from which the first inequality follows. The one-step delay communication problem used to compute QBG corresponds constraints of the form (pre(τ, s 1)=pre(eτ, s 1) τi = eτi) = πi (τ)=πi (eτ) . The antecedent of the constraint pre(τ, t 1) = pre(eτ, t 1) τi = eτi used in QDec,t 1(φ) is weaker for s t (leading to a more constrained problem), whereas all actions corresponding to OHs of length s t 1 are already fixed (and satisfy the Dec-POMDP constraint, so certainly also the constraint for QBG). This shows the second inequality. 5 Abandoning Heuristic Computations Early In this section, we consider a cheap variation of the recursive heuristic that provides an upper bound on Qd and is thus admissible. We algorithmically obtain this heuristic by early termination of the computation of the recursive heuristic. Assume we are computing a heuristic value for the partial policy φ. By Eq. 2, this can be done by solving Dec POMDPs with a smaller horizon, e.g., using an A algorithm with an admissible heuristic Q. Due to the nature of A , the highest heuristic value of an open node is an upper bound for the expected reward of an optimal policy. Hence, we can abandon a computation after exploring M nodes and return the highest heuristic value of an open node at that point. We call this value LM(φ, Q). If the heuristic is such that the heuristic value of node φ is always at most the value of its parent φ , then the A algorithm explores nodes in order of decreasing values. LM(φ, Q) will be the Mth largest value in the tree, or, if A terminates within M steps, the value of the fully specified policy. Next, we formalize LM(φ, Q). Definition 6. Consider a search tree with heuristic Q: Φ R and some fixed partial policy φ. We define the set of candidate policies Φc = {φ 3600s) and memout (>16GB). Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence (IJCAI-23) M 200 200 200 200 no clustering no last stage Qd Q3 Q3 Q3 Q3 Q1 Q2 α 0.2 0.2 0.2 0.2 h = 6 <1 <1 <1 <1 3 1 1 1 h = 7 2 2 2 2 293 5 MO 2 h = 8 2 31 3 40 3355 15 MO 2 h = 10 628 MO 630 TO TO TO MO 2453 h = 12 762 MO 1129 TO TO TO MO 1296 Table 2: Ablation study for DECTIGER, showing computation times in seconds. The last two columns are computed using the baseline parameters M = 200, heuristic Q3, α = 0.2. the eight different observations. Hence, computing QBG is infeasible (for every horizon). QMDP can be computed, but is not sufficiently tight. Likewise, the tighter heuristic Q3 outperforms Q for sufficiently large horizons. GRID is a prime example where the on-the-fly computation of the heuristic values is an essential benefit, as the QBG heuristic cannot finish the computation of the heuristic values. On HOTEL and BROADCAST, the QBG heuristic is already tight, which means that the more expensive computation of Qd does not significantly reduce the number of nodes that are explored. This also explains why Q outperforms Q3. FIREFIGHTING3 s structure makes h = 5 expensive (no potential for clustering). The recursive nature of RS-MAA suffers from this when computing values for h 6. In short, RS-MAA performs generally better on domains where agent knowledge about the state is relatively valuable. Note that the recursive heuristics are tighter than other known competitive heuristics because they share less information about observations (and hence about the state). Hence, if knowledge about the state is valuable, then the additional tightness of the recurive heuristic is more essential for limiting the number of nodes expanded in the search tree. RS-MAA is also better on domains with many clustered observations, as the speed-up provided by using the smallstep search tree is more significant in this case. By contrast, GMAA -ICE seems better on domains where there are very few clustered observations, although in these cases RSMAA can sometimes still outperform GMAA -ICE. Ablation study. For DECTIGER, we provide additional results using alternative configurations in Table 2. The first configuration reflects the baseline. The next 5 configurations change the three hyperparameters. In particular, setting M = α = disables abandoning computations as described in Sec. 5. We additionally used the baseline parameters but without clustering or the last-stage optimization from Sec. 6. Clustering, early termination, and avoiding low-depth heuristics are essential components for the performance. Terminating early based on a threshold and the last-stage optimization are less essential, but still provide a significant speedup. Benchmark-specific parameters. Although the parameters M = 200, d = 3, and α = 0.2 work well for a wide range of benchmarks, some benchmarks benefit from different parameters. For example, BOXPUSHING is solved faster when using the tighter Q2 heuristic, namely in respectively 3 and 6 seconds for h = 4 and h = 5. HOTEL can be solved in 44 seconds for h = 2000 using a less tight M = 25 and Q . Finally, BROADCAST can be solved in respectively 250, 465 and 1342 seconds for h = 500, h = 800, and h = 2000 using a less tight M = 25 and Q . Notably, it seems that for most benchmarks, parameter choices that are good for a short horizon are also a good choice on larger horizons. Comparison to ϵ-optimal Algorithms It is interesting to understand the performance of RS-MAA compared to less strict ϵ-optimal algorithms, such as FBHSVI [Dibangoye et al., 2016]. These algorithms provide an upper bound and a lower bound for the optimal value that are at most ϵ apart. As there is no implementation of FBHSVI available at the time of writing, we provide a short qualitative comparison based on the data from Dibangoye et al. [2016]. For GRID, GRID3X3, MARS and BOXPUSHING, FB-HSVI is able to provide ϵ-optimal solutions for horizon 10, thereby outperforming RS-MAA . For BROADCAST, it is is likely that RS-MAA outperforms FB-HSVI, as computing the solution for horizon 100 already took 473 seconds for FBHSVI. For RECYCLING, both algorithms perform very well and to see which algorithm performs better, results for FBHSVI beyond horizon 100 are needed. All run times above allow an ϵ = 0.01 optimality gap. Based on this qualitative comparison, some aspects remain unclear: It is unclear from which ϵ onwards RS-MAA outperforms FB-HSVI, and it is also unclear in which circumstances the upper bounds obtained with RS-MAA within a given timelimit are stricter than the ones found by FB-HSVI. 8 Conclusion We presented a novel method for finding optimal solutions for finite-horizon Dec-POMDP problems. Our approach showed competitive, and often superior, performance compared to the state-of-the-art. In particular, for several benchmarks, we extend the highest horizon for which an optimal solution is known. The contributions that led to this advancement are a small-step variant of the search tree for MAA , as well as a novel heuristic that is computed on-the-fly. In the future, we will investigate supporting nonlinear objectives, relevant, e.g., for rewards that relate the joint belief of agents, as in Lauri et al. [2019] and Lauri et al. [2020]. Finally, towards further scalability, we will integrate results by approximate Dec-POMDP solvers in the heuristic computation. Acknowledgements We would like to thank the anonymous reviewers for their useful comments. This work has been partially funded by the ERC Starting Grant 101077178 (DEUCE) and the NWO grant NWA.1160.18.238 (Prima Vera). Contribution Statement Wietze Koops is the primary designer of the algorithm and implemented it and evaluated the algorithms. The other authors contributed discussions, ideas, and shaped the description of the algorithm. Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence (IJCAI-23) [Amato and Zilberstein, 2009] Christopher Amato and Shlomo Zilberstein. Achieving goals in decentralized POMDPs. In AAMAS, pages 593 600, 2009. [Amato et al., 2006] Christopher Amato, Daniel S. Bernstein, and Shlomo Zilberstein. Optimal fixed-size controllers for decentralized POMDPs. In AAMAS MSDM workshop, 2006. [Amato et al., 2007] Christopher Amato, Daniel S. Bernstein, and Shlomo Zilberstein. Optimizing memorybounded controllers for decentralized POMDPs. In UAI, pages 1 8, 2007. [Amato et al., 2009] Christopher Amato, Jilles Steeve Dibangoye, and Shlomo Zilberstein. Incremental policy generation for finite-horizon DEC-POMDPs. In ICAPS, 2009. [Amato et al., 2019] Christopher Amato, George Dimitri Konidaris, Leslie Pack Kaelbling, and Jonathan P. How. Modeling and planning with macro-actions in decentralized POMDPs. JAIR, 64:817 859, 2019. [Barto and Mahadevan, 2003] Andrew G. Barto and Sridhar Mahadevan. Recent advances in hierarchical reinforcement learning. Discret. Event Dyn. Syst., 13(4):341 379, 2003. [Bernstein et al., 2002] Daniel S. Bernstein, Robert Givan, Neil Immerman, and Shlomo Zilberstein. The complexity of decentralized control of Markov decision processes. Mathematics of operations research, 27(4):819 840, 2002. [Cazenave, 2010] Tristan Cazenave. Partial move A*. In 22nd IEEE International Conference on Tools with Artificial Intelligence, volume 2, pages 25 31. IEEE, 2010. [Dibangoye et al., 2016] Jilles Steeve Dibangoye, Christopher Amato, Olivier Buffet, and Franc ois Charpillet. Optimally solving Dec-POMDPs as continuous-state MDPs. JAIR, 55:443 497, 2016. [Hansen et al., 2004] Eric A. Hansen, Daniel S. Bernstein, and Shlomo Zilberstein. Dynamic programming for partially observable stochastic games. In AAAI, pages 709 715, 2004. [Kaelbling et al., 1998] Leslie Pack Kaelbling, Michael L. Littman, and Anthony R. Cassandra. Planning and Acting in Partially Observable Stochastic Domains. Artif. Intell., 101(1-2):99 134, 1998. [Lauri et al., 2019] Mikko Lauri, Joni Pajarinen, and Jan Peters. Information gathering in decentralized POMDPs by policy graph improvement. In AAMAS, pages 1143 1151. IFAAMAS, 2019. [Lauri et al., 2020] Mikko Lauri, Joni Pajarinen, and Jan Peters. Multi-agent active information gathering in discrete and continuous-state decentralized POMDPs by policy graph improvement. Auton. Agents Multi Agent Syst., 34(2):42, 2020. [Nair et al., 2003] Ranjit Nair, Milind Tambe, Makoto Yokoo, David Pynadath, and Stacy Marsella. Taming decentralized POMDPs: Towards efficient policy computation for multiagent settings. In IJCAI, pages 705 711, 2003. [Oliehoek and Amato, 2016] Frans A. Oliehoek and Christopher Amato. A Concise Introduction to Decentralized POMDPs. Briefs in Intelligent Systems. Springer, 2016. [Oliehoek and Vlassis, 2007] Frans A. Oliehoek and Nikos Vlassis. Q-value heuristics for approximate solutions of Dec-POMDPs. In AAAI Spring Symposium: Game Theoretic and Decision Theoretic Agents, pages 31 37, 2007. [Oliehoek et al., 2008] Frans A. Oliehoek, Matthijs T. J. Spaan, and Nikos Vlassis. Optimal and approximate Qvalue functions for decentralized POMDPs. JAIR, 32:289 353, 2008. [Oliehoek et al., 2009] Frans A. Oliehoek, Shimon Whiteson, and Matthijs T. J. Spaan. Lossless clustering of histories in decentralized POMDPs. In AAMAS, pages 577 584, 2009. [Oliehoek et al., 2013] Frans A. Oliehoek, Matthijs T. J. Spaan, Christopher Amato, and Shimon Whiteson. Incremental clustering and expansion for faster optimal planning in Dec-POMDPs. JAIR, 46:449 509, 2013. [Oliehoek et al., 2017] Frans A. Oliehoek, Matthijs T. J. Spaan, Bas Terwijn, Philipp Robbel, and Jo ao V. Messias. The MADP toolbox: An open source library for planning and learning in (multi-)agent systems. JMLR, 18:1 5, 2017. [Pateria et al., 2022] Shubham Pateria, Budhitama Subagdja, Ah-Hwee Tan, and Chai Quek. Hierarchical reinforcement learning: A comprehensive survey. ACM Comput. Surv., 54(5):109:1 109:35, 2022. [Russell and Norvig, 2020] Stuart Russell and Peter Norvig. Artificial Intelligence: A Modern Approach (4th Edition). Pearson, 2020. [Seuken and Zilberstein, 2007] Sven Seuken and Shlomo Zilberstein. Memory-bounded dynamic programming for DEC-POMDPs. In IJCAI, pages 2009 2015, 2007. [Spaan and Melo, 2008] Matthijs T. J. Spaan and Francisco S. Melo. Interaction-driven Markov games for decentralized multiagent planning under uncertainty. In AAMAS, pages 525 532, 2008. [Spaan et al., 2011] Matthijs T. J. Spaan, Frans A. Oliehoek, and Christopher Amato. Scaling up optimal heuristic search in Dec-POMDPs via incremental expansion. In IJCAI, 2011. [Szer et al., 2005] Daniel Szer, Franc ois Charpillet, and Shlomo Zilberstein. MAA*: A heuristic search algorithm for solving decentralized POMDPs. In UAI, 2005. [Xie et al., 2020] Yuxuan Xie, Jilles Dibangoye, and Olivier Buffet. Optimally solving two-agent decentralized POMDPs under one-sided information sharing. In ICML, volume 119 of Proceedings of Machine Learning Research, pages 10473 10482, 2020. Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence (IJCAI-23)