# factored_online_planning_in_manyagent_pomdps__b392fb1d.pdf

Factored Online Planning in Many-Agent POMDPs

Maris F.L. Galesloot1, Thiago D. Sim ao2, Sebastian Junges1, Nils Jansen1,3

1Radboud University Nijmegen, The Netherlands 2Eindhoven University of Technology, The Netherlands 3Ruhr-University Bochum, Germany maris.galesloot@ru.nl, t.simao@tue.nl, sebastian.junges@ru.nl, n.jansen@rub.de

In centralized multi-agent systems, often modeled as multi-agent partially observable Markov decision processes (MPOMDPs), the action and observation spaces grow exponentially with the number of agents, making the value and belief estimation of single-agent online planning ineffective. Prior work partially tackles value estimation by exploiting the inherent structure of multi-agent settings via so-called coordination graphs. Additionally, belief estimation methods have been improved by incorporating the likelihood of observations into the approximation. However, the challenges of value estimation and belief estimation have only been tackled individually, which prevents existing methods from scaling to settings with many agents. Therefore, we address these challenges simultaneously. First, we introduce weighted particle filtering to a sample-based online planner for MPOMDPs. Second, we present a scalable approximation of the belief. Third, we bring an approach that exploits the typical locality of agent interactions to novel online planning algorithms for MPOMDPs operating on a so-called sparse particle filter tree. Our experimental evaluation against several state-ofthe-art baselines shows that our methods (1) are competitive in settings with only a few agents and (2) improve over the baselines in the presence of many agents.

1 Introduction Planning problems with multiple agents, such as teams of mobile robots (Ahmadi et al. 2019) or autonomous surveillance systems (Witwicki et al. 2017), can be modeled by multi-agent partially observable Markov decision processes (MPOMDPs, Messias, Spaan, and Lima 2011). These formal models exhibit sets of (local) observations and actions for each agent that can be shared with a central controller. This controller then makes decisions among the joint actions of all agents. Computationally, the main challenge is that the spaces of joint action and observations grow exponentially with the number of agents (Pynadath and Tambe 2002). Moreover, as the controller only partially observes the system state, it must base its decisions on the history of previous joint actions and observations. Online algorithms, such as those based on Monte Carlo tree search (MCTS, Browne et al. 2012), are a common

Copyright 2024, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.

way to tackle large planning problems. These algorithms search for local solutions in the most promising regions of the search space (Kocsis and Szepesv ari 2006). In particular, partially observable Monte Carlo planning (POMCP, Silver and Veness 2010) derives Monte Carlo estimates in the form of (1) an approximation of the value function and (2) distributions (beliefs) over states by generating sample trajectories from the simulation of single state particles. However, a naive application of (single-agent) online planning is illequipped to handle the high-dimensional MPOMDP setting. First, due to the many actions that must be explored during simulations, the value estimation may suffer from high variance. Second, the chance of mismatch between simulated and actual observations is high for large observation spaces, lowering the quality of the approximation in the belief estimates (Sunberg and Kochenderfer 2018). To address the challenge of value estimation, one can exploit the typical locality of interactions between the agents, captured by so-called coordination graphs (Guestrin, Lagoudakis, and Parr 2002). In particular, Amato and Oliehoek (2015) estimate the action value for subsets of agents instead of all agents based on such graphs. The main concepts are to (1) factorize the value estimates over the action space of subsets of agents in the factored statistics variant and to (2) factorize both the action and the observation space in the factored trees variant. These factorizations are key to achieving good performance in settings with many agents. The challenge of belief estimation is also a prevalent issue in single-agent continuous settings. From the likelihood of sampled observations, importance sampling weights are added to the Monte Carlo estimates of the beliefs (Thrun 1999). Such weighted beliefs are also used in single-agent online planners that simulate weighted belief estimates instead of single states (Fischer and Tas 2020; Lim, Tomlin, and Sunberg 2020). By simulating belief estimates, these algorithms operate on the set of possible beliefs of the agents, which makes the branching factor insensitive to the number of observations. A particularly effective algorithm is the so-called sparse particle filter tree (Sparse-PFT, Lim et al. 2023), which only searches for local solutions in the set of reachable belief estimates. To the best of our knowledge, these solutions to value and belief estimation have only been studied independently, and weighted belief estimates have not yet been explored in the

The Thirty-Eighth AAAI Conference on Artiﬁcial Intelligence (AAAI-24)

Belief Estimation Simulation POMDP Value Estimation MPOMDP Value Estimation

Single-Agent Factored Statistics Factored Trees

Unweighted Single States POMCP FS-POMCP FT-POMCP (Silver and Veness 2010) (Amato and Oliehoek 2015)

Weighted Single States W-POMCP / POMCPOW FS-W-POMCP FT-W-POMCP (Sunberg and Kochenderfer 2018) Sect. 4 (new)

Belief Estimates Sparse-PFT FS-PFT FT-PFT (Lim et al. 2023) Sect. 5 (new)

Table 1: Our algorithms and state-of-the-art MCTS methods in the (continuous) POMDP and MPOMDP literature.

online MPOMDP planning setting. Therefore, no method can scale to problems with many agents, and we must tackle both challenges simultaneously. We present multi-agent online planning algorithms that exploit adequate approximations of both the value and the belief and thereby can scale to many agents. In particular, we integrate factored value estimation and weighted belief estimation. Table 1 positions the new algorithms with respect to the existing solutions in the MCTS literature. First, we add weighted belief estimation to POMCP variants, namely W-POMCP, and combine this algorithm with factored statistics in FS-WPOMCP (Sect. 4.1). Furthermore, we design a weighted belief approximation that is compatible with factored trees in FT-W-POMCP (Sect. 4.2). Then, we introduce two novel variants of the Sparse-PFT algorithm that exploit coordination graphs similarly to Amato and Oliehoek (2015), namely FS-PFT and FT-PFT (Sect. 5.2). The empirical evaluation (1) shows the improvement of integrating weighted particle filtering in POMCP, and (2) demonstrates the effect of value factorization via coordination graphs in Sparse-PFT. It also shows that exploiting local interactions between agents is beneficial in environments where the way the agents interact may change over time, i. e., there is no naturally sparse graph. The extended version of this article (Galesloot et al. 2023) contains an Appendix with additional details.

Contributions. We present four novel algorithms for many-agent online planning that can scale both value and belief estimations to problems with many agents. Our empirical evaluation, using the nine algorithms in Table 1, shows that our variants improve over the state-of-the-art. For example, in the environments FIREFIGHTINGGRAPH and Multi Agent ROCKSAMPLE, we scale up to instances with 64 and 6 agents instead of 10 and 2 agents, respectively.

2 Online Planning in MPOMDPs The set of all distributions over the finite set X is (X).

MPOMDPs. We study online planning in centralized multi-agent systems that are modeled as MPOMDPs. Intuitively, agents encounter individual observations but can share those via immediate and noiseless broadcast communication, which allows a centralized control paradigm.

Definition 1 (MPOMDP). An MPOMDP is a tuple M = I, S, b0, A, T , R, Ω, O, γ , with the finite set I of n agents,

the finite set S of states, an initial state distribution b0 (S), the set A = i I Ai of joint actions, composed of the finite sets Ai of actions for each agent i I, the transition function T : S A (S) such that T (s | s, a) = Pr(s | s, a) is the probability of a new state s given the previous state s and joint action a, the reward function R: S A R such that R(s, a) is the reward given state s and joint action a, the set Ω= i I Ωi of joint observations composed by the finite sets Ωi of observations for each agent i I, the observation function O: S A (Ω) such that O( o | s , a) = Pr( o | s , a) specifies the probability of observing joint observation o in the state s given joint action a, and the discount factor γ [0, 1).

MPOMDPs generalize POMDPs (Kaelbling, Littman, and Cassandra 1998), which are MPOMDPs with a single agent. An MPOMDP can be treated as a POMDP by ignoring the agent-wise factorization in the action and observation space.

Objective. The return Rt = P t =t γt t R(st , at ) is the infinite-horizon discounted sum of reward from time t N. An observable history ht = ( o1, a1, o2, . . . , at 1, ot) is a sequence of joint observations and joint actions. Policies determine the action choices. Optimal policies π: (S) A for MPOMDPs map the belief bt (S) to joint actions. The belief bt is a sufficient statistic (Kaelbling, Littman, and Cassandra 1998) for the history ht, and resembles the state distribution bt(s) = Pr(st | ht, b0) at time t, with b0 from M. Beliefs can be updated bt+1 = υ(bt, ot+1, at) by υ from T and O, using Bayes theorem (Spaan 2012). Our aim is to maximize the joint Q-value of a belief b, which is the expected return under a policy π given action a and belief b at time t, and at = π(bt) for subsequent t > t:

Qπ(b, a) = Eπ [Rt | bt=b, at= a, at >t = π(bt )] . (1)

Online planning. Online search-based planners interleave planning and execution. They perform a forward search in the set of beliefs reachable from the current belief, incrementally building a look-ahead tree known as a search tree. Monte Carlo planners typically do so with a generative interface G : S A S Ω R of the model M, i. e., a simulator (Kearns, Mansour, and Ng 2002), with s, a 7 s , o, r. After searching from b, the planner executes a selected action a, receives an observation o. Then, it updates its belief b = υ(b, a, o), before it starts searching from b .

The Thirty-Eighth AAAI Conference on Artiﬁcial Intelligence (AAAI-24)

MCTS. Contemporary MCTS methods are based on the upper confidence trees (UCT, Kocsis and Szepesv ari 2006) algorithm. In particular, partially observable UCT (POUCT) is the search algorithm that underlies POMCP. It plans with a look-ahead search tree comprised of paths of action and observation nodes. It samples states from the current belief b, and for each state, it expands a trajectory of actions and observations using G until it reaches a new node in the tree. Then, a (random) rollout estimates the value (Kocsis and Szepesv ari 2006). The trajectory is used to update a set of statistics for history h that includes visit counts N( h) and n( h, a) in the observation and action nodes, respectively. Additionally, the trajectory updates estimates of the Q-values Q( h, a) of the action nodes by a running average of the return. The upper confidence bound (UCB1, Auer, Cesa-Bianchi, and Fischer 2002) algorithm decides the most promising actions during the search, balancing exploration and exploitation. It is computed from (an estimate of) the number N of visits to the observation node, as well as the number of visits n and the value Q of the action node by:

UCB(Q , N , n ) = Q + c q

(n +1) , (2)

where c is an exploration constant. During search, in some history h, PO-UCT chooses actions via: arg max a UCB(Q( h, a), N( h), n( h, a)).

A separation of beliefs. POMCP is the extension of POUCT that gradually builds up beliefs B( h) consisting of simulated states s S, i. e., particles, in the observation nodes for history h, representing a Monte Carlo estimate of the belief b. However, the number of particles in each observation node depends on how often a history has been recorded during forward simulation and might diminish over time due to a lack of diversity in the set of particles. POMCP requires domain-specific particle reinvigorating techniques to mitigate this. Instead, we separate the concerns of an online belief inside the search tree that is used to estimate Q, and the offline belief b that represents the current belief over states. Following related works, we write B( h) for POMCP s online belief and b for Sparse-PFT s weighted online belief.

Problem statement: Given a an MPOMDP M, how do we, at each time step t N, both (1) efficiently search for a joint action at given the current belief estimate bt, and (2) effectively find a good belief estimate bt+1 from bt, at and the received observation ot+1.

3 Using Structure in Multi-Agent POMDPs In this section, we introduce coordination graphs to decompose the objective into local sub-problems. In particular, we recap prior work by Amato and Oliehoek (2015).

3.1 Coordination Graphs A coordination graph (CG, Guestrin, Venkataraman, and Koller 2002; Oliehoek and Amato 2016) is an undirected graph (V, E) that represents the local interactions between

agents. Each vertex v V corresponds to an agent (V I), and each edge (i, j) E indicates that agents i V and j V interact locally. For an edge e E, we define the local action ae and local observations oe, which range over the product of the individual agent action Ae = Ai Aj and observation spaces Ωe = Ωi Ωj, with e = (i, j). For three agents V = {1, 2, 3} connected by a line E = {e1, e2}, with e1 = (1, 2) and e2 = (2, 3), we have a = {a1, a2, a3}, thus ae1 = {a1, a2} and ae2 = {a2, a3}, respectively. To find the Q-value for some history h (or equivalent belief b) based on the local actions, we define a local payoff function Qe( h, ae) for each edge e E, where ae Ae is the projection of a to the agents in the edge. Then, Q( h, a) P e Qe( h, ae). Instead of finding Qe( h, ae), we maintain local predictions ˆQe( h, ae) = E h Q( h, a) | ae i of the joint Q-value. A mixture of experts (Mo E) combines these local estimates of Q:

Q( h, a) ˆQ( h, a) = P e ωe ˆQe( h, ae), (3)

where ωe 0 is the weight for edge e, s.t. P e E ωe = 1. We assume uniform mixture weights ωe = 1/|E| throughout the remainder of the paper. We pick the estimated maximizing joint action a# a over the sum of local estimates of Q:

a# = arg max a P e ωe ˆQe( h, ae). (4)

We thus aim to find local actions (for the edges of the graph) that maximize the estimated joint value function. Notice that when finding a#, any agent i might belong to multiple edges, and therefore agent i must be assigned the same action a# i Ai in all edges e E where i e. We can compute the maximum with graphical inference algorithms, such as Variable Elimination (VE) and Max-Plus (MP) (Vlassis, Elhorst, and Kok 2004). Appendix D provides an overview.

3.2 Factored-Value POMCP Factored-value POMCP (Amato and Oliehoek 2015) consists of two techniques that exploit the structure of a CG to scale POMCP to problems with large action and observation spaces. Next, we outline how these techniques factor the action space to introduce statistics for each edge e E for computing the UCB1 value of the local joint action space ae. Both these algorithms require an inference algorithm to compute Eq. (4) during and after the simulations.

Factored statistics (FS-POMCP). FS-POMCP uses the structure of MPOMDPs and stores the statistics Q, N, n in a factorized manner. This adaption is more space-efficient and also allows for improved action selection in large action spaces by maximizing over the factored Q-functions. More precisely, the tree structure in FS-POMCP remains the same as in POMCP, representing the history h with associated visit counts N( h) and particles B( h) in the observation nodes. The action nodes maintain a set of statistics Qe( h, ae), N( h, ae) for each edge e E, independently. Thus, the Mo E optimization from Eq. (3) is applied directly in each action node of the search tree. This improves over POMCP as the combination of local action spaces ae of each edge e E is smaller than the joint action space

The Thirty-Eighth AAAI Conference on Artiﬁcial Intelligence (AAAI-24)

a. During search, the action a# is selected by maximizing over the UCB1 values (Eq. (2)) of the local Q-functions: arg max a# P e E UCB(Qe( h, ae), N( h), N( h, ae)).

Factored trees (FT-POMCP). FT-POMCP constructs a tree for every edge e. This tree represents the factored histories he, which consists of a sequence of factored actions and observations he,t = ( ae,0, oe,1, . . . , ae,t 1, oe,t). This further reduces the scope of Qe( h, ae) to Qe( he, ae) by introducing an expert for every he, ae pair. In each tree, the action nodes maintain statistics Qe( he, ae), and N( he, ae), and the observation nodes maintain N( he) and B( he) according to the factored history he. During search, we again maximize with respect to the UCB1 value (using Eq. (2)) of local Q-functions: arg max a# P e E UCB(Qe( he, ae), N( he), N( he, ae)).

4 Scalable Particle Filtering In MPOMDPs with large state spaces, repeated execution of the Bayesian belief update is intractable as each update requires O(|S|2) computations. We represent this belief with weighted particle filters to ensure scalability. The following subsection introduces how we incorporate these filters in WPOMCP and FS-W-POMCP. The second subsection introduces our method that uses the structure of a coordination graph to condition the belief on a local part of the observation space, which we apply to FT-W-POMCP and FT-PFT.

Particle filtering. Particle filtering (Thrun, Burgard, and Fox 2005) represents the belief by sequential Monte Carlo approximations, alleviating the bottleneck of the belief update. In an unweighted filter, the belief approximation is a set b = {(s(k))}K k=1, with s(k) S, and is updated using rejection sampling on the real observation o; b = {s (k) : o = o (k)}, where s (k), o (k) are generated from s(k), a by G (Kochenderfer et al. 2015). POMCP implicitly uses an unweighted particle filter by using the online particle belief B( h) in the observation nodes to represent b.

4.1 Weighted Particle Filtering Weighted particle filters approximate the belief by a weighted set of K particles b = {(s(k), w(k))}K k=1, where s(k) S is a state in the filter and w(k) R+ the associated weight. We update beliefs in weighted filters with importance sampling as in the bootstrapped particle filter (Gordon, Salmond, and Smith 1993). In the bootstrapped particle filter, the proposal distribution is the transition function s (k) T ( | s(k), a), and the importance weights are computed from the observation function w (k) w(k)O( o | s (k), a). Additionally, the posterior belief is re-sampled at every time step to alleviate sample degeneracy, after which the weights are set to 1/K, which is known as sequential importance re-sampling (SIR). We decide whether to re-sample in our SIR filter by comparing the estimated effective sample size (ESS) of the particle filters with respect to the number of particles (Septier and Peters

2016). The ESS(b) (PK k=1(w(k))2) 1 quantifies weight disparity, which is an indicator for sample degeneracy. The likelihood L of a belief update in a SIR filter represents the probability of the new belief given the observation, action, and previous belief. It is a statistic on the quality of the approximate belief update (Katt, Oliehoek, and Amato 2019). It is computed from the sum of all updated weights multiplied by the previous likelihood L(b ) = P k w (k)L(b) where L(b) = 1 when b was initialized from b0.

W-POMCP and FS-W-POMCP. In both W-POMCP and FS-W-POMCP, we represent the current root-node belief estimate with an offline weighted filter b that we update independently of the search tree instead of using the unweighted online particles B stored inside the tree. We provide the pseudo-code for the SIR filter that updates b in Appendix F.2. Additionally, FS-W-POMCP maintains statistics in each action node for the actions of pairs of agents instead of all joint actions, as explained previously for FS-POMCP.

Particle filtering in MPOMDPs. In MPOMDPs, the observation signal becomes increasingly sparse as the number of agents increases, as it commonly depends on the probability of all individual observations. This can result in an impoverishment of the particles. Comparably, the likelihood of matching the received joint observation in the rejection update is small for unweighted filters in larger observation spaces. If the particle filter reaches a deprived state where no particles remain, the planner defaults to a baseline policy.

4.2 Particle Filtering in a Coordination Graph To increase the scalability and decrease the chance of deprivation of the particle filter in large observation spaces, we introduce a general filtering approach for b based on the structure of a coordination graph (V, E), independent of the online planning algorithm. This method applies to both FT-POMCP (Sect. 3.2) as well as FT-PFT (introduced in Sect. 5). We exploit the structure in the following way. For every edge e E, we introduce a separate particle filter be with Ke particles. We choose Ke such that K = P e Ke. This method makes the following assumption. Assumption 1. Individual observations probabilities, as given by the individual observation model Oi : S A (Ωi), are conditionally independent given the successor state and the previous action. Therefore, we write the observation model as the product of individual observation probabilities: O( o | s , a) = Q i I Oi(oi | s , a), with oi Ωi. Note that we condition the individual observations on the joint state and action instead of the assumption of observational independence of ND-POMDPs (Nair et al. 2005) and, distinctly from factored beliefs (Messias, Spaan, and Lima 2011) and factored particle filtering (Ng, Peshkin, and Pfeffer 2002), we do not assume any state space factorization.

Local updates. Using Assumption 1, we update the particle filters for the edges by the local part of the observation space oe Ωe. For an edge e = (i, j), we retrieve the local observation oe by taking the individual observations oi, oj from the joint observation o. Then, we change

The Thirty-Eighth AAAI Conference on Artiﬁcial Intelligence (AAAI-24)

the importance weights of the filtering procedure to be based on the local observation probability Oe( oe | s , a), instead of the joint observation probability O( o | s , a). For each edge e, the local observation probability Oe( oe | s , a) = Q i e Oi(oi | s , a) is the probability of observing oi Ωi for each agent i in e. Consider an offline weighted filter be = {(s(k), w(k) e )}Ke k=1 and joint observation o. We first propagate the particle s (k) T ( | s(k), a) and then compute the new importance sampling weight w e (k) we(k)Oe( oe | s , a), where oe is the local part of the observation o. Every be approximates the belief for each s S as: be(s) = PKe k=1 w(k) e δ(s,s(k))/PKe l=1 w(l) e , where δ is the Kronecker delta function. Now, be represents the particle-belief representing the history he of joint actions a and local observations oe for edge e, and L(be) = PKe k=1 w(k) e its likelihood. We provide more details in Appendix F.2.

Ensembling. Since we run multiple particle filters in parallel, we must decide how to fuse these beliefs together for sampling. We propose to treat this set of beliefs as an ensemble. We determine a procedure for sampling the ensemble for every simulation iteration. We use the likelihood of the weighted particle filter update as a statistic for the quality of the belief approximation (Katt, Oliehoek, and Amato 2019). Intuitively, we sample more often from higher-quality filters. More precisely, we give filters that contain particles with a higher probability of generating the true observation a higher chance of getting sampled. We sample from the set of filters with probabilities proportional to the likelihoods of the particle filters L(be) of the edges: s be w.p. L(be)/P

e E L(be ). Altogether, this ensemble particle-filter results in the following approximation b of the belief b for each s S: b(s) b(s) = P e E L(be)/P

e L(be )be(s). FT-W-POMCP and FT-PFT maintain an offline belief estimate b consisting of the offline local beliefs be for each e.

Limitation. Our proposal distribution remains T across the ensemble, but our observation distributions are the local function Oe for every filter be. Therefore, these local filters will be biased towards the posterior p(s | he, b0) instead of p(s | h, b0), where, as before, he is the history of factored observations oe and joint actions a. Since each filter considers only local observations, the local filters cannot recover a joint belief that depends on all agents (Capit an et al. 2011).

5 Sparse-PFT with Value Factorization In this section, we lift Sparse-PFT to MPOMDPs. We propose extensions that exploit the factorization of the action space as in Sect. 3.2. Firstly, we introduce a particle belief approximation and Sparse-PFT for MPOMDPs. Then, we introduce variants with factored statistics (FS-PFT) and factored trees (FT-PFT) to combat large action spaces.

Particle approximation. For POMDPs, it is natural to consider a fully observable belief-MDP, whose state space are the beliefs and the action space is unchanged (Cassandra, Kaelbling, and Littman 1994). The same construction

for MPOMDP yields a belief-MMDP. Particle-belief-MDPs approximate belief-MDPs (Lim, Tomlin, and Sunberg 2020; Lim et al. 2023). Similarly, we introduce the particle-belief MMDP as an approximation of an MPOMDP:

Definition 2 (PB-MMDP). The Particle-Belief-MMDP for an MPOMDP M = I, S, A, T , r, Ω, O, γ is a tuple M = I, Σ, A, τ, ρ, γ with states Σ = (S R+)C consisting of online weighted particle beliefs b = {(s(k), w(k))}C k=1 encoded by C particles, the transition density function τ : Σ A (Σ) defined by τ( b | b, a) = P o ΩPr( b | b, a, o) Pr( o | b, a), and the reward function ρ: Σ A R defined by ρ( b, a) = P

k w(k)R(s(k), a)/P

Simulating the PB-MMDP requires us to update the associated generative model. We simulate particle beliefs b of size C instead of individual states to estimate Q. Consequentially, the generative model GP F : Σ A Σ R updates the state based on the action and returns the particlebased reward ρ as specified above. This extension increases the complexity of the generative model by a factor O(C).

5.1 Sparse Particle Filter Tree Sparse-PFT is an application of UCT to the PB-MMDP. While it was designed for continuous state spaces, the fact that the tree branches on a fixed number of belief nodes instead of the number of joint observations is beneficial in our setting. Sparse-PFT constructs a sparse particle-belief tree incrementally during a forward search by allowing each action node to expand up to C particle-belief nodes. The particle-belief nodes correspond to the states of the particle-belief MMDP (Def. 2). The root particle-belief b (s(k), 1/C) C k=1 b is sampled at every simulation iteration from the current offline belief b. Following our separation of online and offline beliefs, the number of particles in the offline belief |b| C can be much greater than the simulated belief inside the tree b. If the number of children |Ch( b, a)| of the action node is less than C, then we simulate the particle-belief through GP F to obtain the next particlebelief b and particle-based reward ρ. Otherwise, b and ρ are sampled uniformly from Ch( b, a). We continue the simulation and traverse the particle-belief tree until we reach a leaf node or a predetermined maximum depth. If we reach a leaf node, a rollout is performed. Scalability is partially addressed because the branching factor of the belief nodes is independent of the observation size. However, a full enumeration of the action space is still required for selecting actions according to UCB1, which is impractical in MPOMDPs.

5.2 Sparse-PFT for MPOMDPs We introduce two extensions to improve upon the weakness of Sparse-PFT when operating with large action spaces.

Factored statistics (FS-PFT). We propose to keep factored action statistics in the nodes of the particle filter tree, similar to FS-POMCP. In addition to the node visit count N( b), we maintain sets of statistics Qe( b, ae), N( b, ae) in every particle filter belief node that predicts

The Thirty-Eighth AAAI Conference on Artiﬁcial Intelligence (AAAI-24)

the Q-function for every edge e E, applying Mo E optimization from Eq. (3) directly in the nodes. The offline belief b is represented by an external weighted particle filter, as in FS-W-POMCP. Finally, similarly to the previous factored statistics algorithms, a graphical inference algorithm selects the maximal joint action a# during the search by maximizing over the UCB1 values: arg max a# P e E UCB(Qe( b, ae), N( b), N( b, ae)).

Factored trees (FT-PFT). Additionally, we introduce the construction of multiple particle-belief trees in parallel, one for each e E. These trees have a local action space ae and maintain statistics Qe( b, ae), Ne( b), Ne( b, ae) for the agents associated with the edge. Since particle filter trees do not explicitly branch on observations, only the action space is factored inside the trees. We use a single joint particle belief step in each layer to reduce overhead. Thus, every tree is constructed from the same simulated particle filter beliefs. Although the belief nodes might have the same particles, we maintain independent visit count statistics Ne for each belief node and associated local joint actions ae Ae, respectively. The inference equation for picking the maximal UCB1 action (using Eq. (2)) is given by: arg max a# P e E UCB(Qe( b, ae), Ne( b), Ne( b, ae)). The offline belief b is maintain identically to FT-W-POMCP (Sect. 4.2), by the ensemble of offline beliefs be. In addition to the above, we suspect the improvement of FT-PFT over Sparse-PFT is an increase in node re-use and search depth due to the smaller factored action space in the trees.

6 Experimental Evaluation

We evaluate the effectiveness of our methods on MPOMDPs with many agents. Abbreviations follow those in Table 1. The key question is Q1: Does the use of coordination graphs (CGs) accelerate online planners for MPOMDPs in general? We evaluate this question on three benchmarks, one with a given coordination graph and two with an artificially chosen graph. Regarding our novel algorithms introduced in Sect. 4 and 5, we evaluate Q2: Do (FS/FT)-W-POMCP variants improve over (unweighted) (FS/FT)-POMCP variants, and, Q3: Do FS/FT-PFT improve over Sparse-PFT?

Benchmarks. FIREFIGHTINGGRAPH (FFG, Oliehoek et al. 2008) has been used to evaluate factored POMCP (Amato and Oliehoek 2015). Agents stand in a line, and houses are located to the left and right of each agent. Agents have two actions: fight fires to their left or right. Multi-agent ROCKSAMPLE (MARS, Cai et al. 2021) extends singleagent Rock Sample (Smith and Simmons 2004). MARS environments are defined by their size m, the number of agents n, and the number of rocks k, with k = m = 15. In CAPTURETARGET (CT), agents are tasked with capturing a moving target. We depict results for CT in Appendix A.2. Detailed benchmark descriptions are in Appendix G.

Experimental set-up. All algorithm variants are implemented in the same Python prototype, published online1.

1https://zenodo.org/records/10409525.

4 16 32 64 Number of Agents

Cumulative Discounted Reward

FIREFIGHTINGGRAPH

FS-POMCP FS-W-POMCP FT-POMCP FT-W-POMCP POMCP W-POMCP RANDOM

3 4 5 6 Number of Agents

Multi-Agent ROCKSAMPLE

Figure 1: Performance comparison for POMCP variants with (solid) and without (dotted) weighted particle filtering.

All code ran on a machine with an Intel(R) Core(TM) i910980XE CPU @ 3.00GHz and 256 GB RAM (8 x 32GB DDR4-3200). Our Python wrapper executed episodes in parallel on 34 threads such that each episode had access to 256/34 7.5GB of RAM. All reported results are the achieved returns averaged over 100 episodes with error bars representing 95% confidence intervals. We did not run an extensive hyperparameter optimization for any algorithm, and we list the most important parameters in Tab. 2 of Appendix A.1. All algorithms ran with a maximum of 5s and 15s per step on FFG/CT and MARS, respectively. If the particle filter belief is deprived at any point in time during the episode, the policy defaults to a random policy. We set the number K of particles in the joint filters such that K = P e Ke in the factored filters, e.g., if we have three edges with Ke = 100, then the joint counterpart has K = 300. For MARS and CT, we chose the CG as a line (n 1 factors) for odd numbers of agents and a team formation (n/2 factors) where pairs of agents cooperate for even numbers. The single-agent algorithms could not run with more than 20 and 5 agents on FFG and MARS, respectively.

Discussion. Below, we analyze the results as answers to the three questions. Q1. We study Q1 across our different set-ups. The single-agent algorithms (Sparse-PFT, POMCP, and W-POMCP) are out-scaled by their competitors with value factorization (Fig. 1 and 2) in FFG and MARS. However, planning on the joint value performs better in settings with fewer agents. In MARS and CT, the agents move and thus may coordinate dynamically. Therefore, the desired agent coordination does not induce a sparse coordination graph, meaning the CG acts as a heuristic. The results show that assuming some arbitrary, sparse static graph is helpful, even if this assumes no coordination between agents that, in principle, should coordinate. We find that the static heuristic performs well when many agents are involved. Thus, CGs (as a heuristic) accelerate planning. Q2. FS-W-POMCP outperforms FS-POMCP across all three benchmarks, show-

The Thirty-Eighth AAAI Conference on Artiﬁcial Intelligence (AAAI-24)

4 16 32 64 Number of Agents

Cumulative Discounted Reward

FIREFIGHTINGGRAPH

FS-PFT FT-PFT Sparse-PFT RANDOM

3 4 5 6 Number of Agents

Multi-Agent ROCKSAMPLE

Figure 2: Comparison between Sparse-PFT (dotted) and our PFT variants with value factorization (solid).

ing that POMCP benefits from offline weighted particle filtering. The difference between FT-POMCP and FT-WPOMCP is smaller, as FT-POMCP also benefits from the fact that the belief representation, which is offline, consists of local beliefs be for each e, albeit unweighted (see Appendix E). Q3. Sparse-PFT performs well in MPOMDPs with fewer agents, but with more agents, it runs out of memory or fails to find a good estimate of Q due to its naive enumeration of the action space (Fig. 2). FS-PFT and FT-PFT do scale to settings with many agents. Thus, CGs alleviate Sparse-PFTs scaling issues in many-agent POMDPs. However, they achieve comparable (FFG) or lower (MARS) returns in the settings with few agents. Combining CGs with weighted filtering performs well across. We crossevaluate our contributions in (Fig. 3, Appendix A.2). We find that both factored W-POMCP and PFT algorithm variants are suited for many-agent POMDPs and perform well across FFG and MARS, but POMCP variants generally perform better. The improvement of FS-W-POMCP over FS-POMCP is consistent. FT-W-POMCP is slightly better on FFG and significantly improves over FT-POMCP in CT, but performs equal or worse on MARS. The action selection method has a noticeable influence on performance. In MARS and CT, VE is the best-performing algorithm. However, MP achieves much higher returns in FFG (Fig. 4, Appendix A.2). The factored algorithms are sensitive to the method that maximizes over the local predictions, as recently also demonstrated in the fully observable setting (Choudhury et al. 2022).

7 Related Work Multi-agent Markov models. MPOMDPs reside in a realm of models for cooperative multi-agent systems with partial observability. Distributed cooperative systems (Dec POMDPs, Oliehoek and Amato 2016) remove the communication assumptions of MPOMDPs. However, they are much more computationally complex (doubly exponential), as agents need to reason over each other s policies. Messias, Spaan, and Lima (2011) considered factored MPOMDPs by

assuming shared communication in Dec-POMDPs, computing policies over factored beliefs. Additionally, they studied lifting the instantaneous communication assumptions by asynchronous execution (Messias, Spaan, and Lima 2013). Our algorithms build on prior work by Amato and Oliehoek (2015). Therefore, their work is summarized in Sect. 3. Zhou et al. (2019) introduce a further decentralized MCTS algorithm for transition-independent MPOMDPs. Choudhury et al. (2022) consider a fully observable MMDP setting and study action selection under state-dependent coordination graphs. Recently, MPOMDPs were also studied with barrier functions over the joint belief (Ahmadi et al. 2019) and to support multi-object tracking (Nguyen et al. 2020).

Single-agent online planning. POMCPOW (Sunberg and Kochenderfer 2018) and Sparse-PFT (Lim et al. 2023), also in Table 1, are algorithms that improved UCT-based planners for POMDPs with continuous spaces. They, i.a., replaced unweighted belief estimates with importance sampling estimates using weighted particle filters (Sect. 4). We summarize Sparse-PFT in Sect. 5. POMCPOW shares characteristics with W-POMCP and Sparse-PFT, simulating single states but maintaining weighted particles in the tree. DESPOT (Ye et al. 2017) and Ada OPS (Wu et al. 2021) are alternative, orthogonal online planners that are distinguishable from MCTS methods (e.g., POMCP). DESPOT utilizes a set of deterministic scenarios and heuristic tree searches to reduce variance in the value estimates instead of the independent simulations via MCTS. Its extensions employ alpha-vectors to fuse similar paths in the tree (Garg, Hsu, and Lee 2019) or (GPU) parallelization in factored simulators (Cai et al. 2021). Ada OPS also employs offline and weighted particle filtering. Distinctively, it uses adaptive particle filtering (Fox 2001), which requires a partitioning of the state space into grids. It relies on a full-width search instead of simulations, during which it fuses similar observation branches. Both algorithms work well with small-sized discrete action spaces. However, it is unclear how value factorization from coordination graphs can be incorporated, as both algorithms expand the full action space at each new node instead of picking the most promising action to simulate via the UCB1 policy. In MPOMDPs, expanding the combinatorial joint action space is impractical.

8 Conclusion

In this paper, we studied how to simultaneously tackle the belief and value estimation challenges in online planning for MPOMDPs. We presented extensions of factored POMCP and novel variants of the Sparse-PFT algorithms tailored specifically for many-agent online planning with partial observability. The empirical evaluation showed the effectiveness of combining weighted particle filtering and value factorization in settings with many agents. However, it is also clear that planning on the joint value suffices when few agents are involved. Future work consists of alleviating the communication assumptions (Spaan, Oliehoek, and Vlassis 2008; Oliehoek and Spaan 2012; Messias, Spaan, and Lima 2013), exploring extensions for continuous MPOMDPs, or learning the coordination graph (Kok et al. 2005).

The Thirty-Eighth AAAI Conference on Artiﬁcial Intelligence (AAAI-24)

Acknowledgments

We would like to thank the anonymous reviewers for their valuable feedback. This research has been partially funded by the NWO grant NWA.1160.18.238 (Prima Vera) and the ERC Starting Grant 101077178 (DEUCE). Additionally, we would like to thank the ELLIS Unit Nijmegen and Radboud AI for their support.

Ahmadi, M.; Singletary, A.; Burdick, J. W.; and Ames, A. D. 2019. Safe Policy Synthesis in Multi-Agent POMDPs via Discrete-Time Barrier Functions. In CDC, 4797 4803. IEEE. Amato, C.; and Oliehoek, F. A. 2015. Scalable Planning and Learning for Multiagent POMDPs. In AAAI, 1995 2002. AAAI Press. Auer, P.; Cesa-Bianchi, N.; and Fischer, P. 2002. Finite-time Analysis of the Multiarmed Bandit Problem. Mach. Learn., 47(2-3): 235 256. Browne, C.; Powley, E. J.; Whitehouse, D.; Lucas, S. M.; Cowling, P. I.; Rohlfshagen, P.; Tavener, S.; Liebana, D. P.; Samothrakis, S.; and Colton, S. 2012. A Survey of Monte Carlo Tree Search Methods. IEEE Trans. Comput. Intell. AI Games, 4(1): 1 43. Cai, P.; Luo, Y.; Hsu, D.; and Lee, W. S. 2021. Hy PDESPOT: A hybrid parallel algorithm for online planning under uncertainty. Int. J. Robotics Res., 40(2-3). Capit an, J.; Merino, L.; Caballero, F.; and Ollero, A. 2011. Decentralized Delayed-State Information Filter (DDSIF): A new approach for cooperative decentralized tracking. Robotics Auton. Syst., 59(6): 376 388. Cassandra, A. R.; Kaelbling, L. P.; and Littman, M. L. 1994. Acting Optimally in Partially Observable Stochastic Domains. In AAAI, 1023 1028. AAAI Press / The MIT Press. Choudhury, S.; Gupta, J. K.; Morales, P.; and Kochenderfer, M. J. 2022. Scalable Online Planning for Multi-Agent MDPs. J. Artif. Intell. Res., 73: 821 846. Fischer, J.; and Tas, O. S. 2020. Information Particle Filter Tree: An Online Algorithm for POMDPs with Belief-Based Rewards on Continuous Domains. In ICML, volume 119 of Proceedings of Machine Learning Research, 3177 3187. PMLR. Fox, D. 2001. KLD-Sampling: Adaptive Particle Filters. In NIPS, 713 720. MIT Press. Galesloot, M. F. L.; Simao, T. D.; Junges, S.; and Jansen, N. 2023. Factored Online Planning in Many-Agent POMDPs. ar Xiv:2312.11434. Garg, N. P.; Hsu, D.; and Lee, W. S. 2019. DESPOT-Alpha: Online POMDP Planning with Large State and Observation Spaces. In Robotics: Science and Systems. Gordon, N. J.; Salmond, D. J.; and Smith, A. F. M. 1993. Novel approach to nonlinear/non-Gaussian Bayesian state estimation. IEE Proceedings F (Radar and Signal Processing), 140(2): 107 113(6).

Guestrin, C.; Lagoudakis, M. G.; and Parr, R. 2002. Coordinated Reinforcement Learning. In ICML, 227 234. Morgan Kaufmann. Guestrin, C.; Venkataraman, S.; and Koller, D. 2002. Context-Specific Multiagent Coordination and Planning with Factored MDPs. In AAAI/IAAI, 253 259. AAAI Press / The MIT Press. Kaelbling, L. P.; Littman, M. L.; and Cassandra, A. R. 1998. Planning and Acting in Partially Observable Stochastic Domains. Artif. Intell., 101(1-2): 99 134. Katt, S.; Oliehoek, F. A.; and Amato, C. 2019. Bayesian Reinforcement Learning in Factored POMDPs. In AAMAS, 7 15. International Foundation for Autonomous Agents and Multiagent Systems. Kearns, M. J.; Mansour, Y.; and Ng, A. Y. 2002. A Sparse Sampling Algorithm for Near-Optimal Planning in Large Markov Decision Processes. Mach. Learn., 49(2-3): 193 208. Kochenderfer, M.; Amato, C.; Chowdhary, G.; How, J.; and Reynolds, H. 2015. Decision Making Under Uncertainty: Theory and Application. MIT Lincoln Laboratory Series. MIT Press. ISBN 978-0-262-02925-4. Kocsis, L.; and Szepesv ari, C. 2006. Bandit Based Monte Carlo Planning. In ECML, volume 4212 of Lecture Notes in Computer Science, 282 293. Springer. Kok, J. R.; Hoen, P. J.; Bakker, B.; and Vlassis, N. 2005. Utile Coordination: Learning Interdependencies Among Cooperative Agents. In CIG. IEEE. Lim, M. H.; Becker, T. J.; Kochenderfer, M. J.; Tomlin, C. J.; and Sunberg, Z. N. 2023. Optimality Guarantees for Particle Belief Approximation of POMDPs. J. Artif. Intell. Res., 77: 1591 1636. Lim, M. H.; Tomlin, C. J.; and Sunberg, Z. N. 2020. Sparse Tree Search Optimality Guarantees in POMDPs with Continuous Observation Spaces. In IJCAI, 4135 4142. ijcai.org. Messias, J. V.; Spaan, M. T. J.; and Lima, P. U. 2011. Efficient Offline Communication Policies for Factored Multiagent POMDPs. In NIPS, 1917 1925. Messias, J. V.; Spaan, M. T. J.; and Lima, P. U. 2013. Multiagent POMDPs with asynchronous execution. In AAMAS, 1273 1274. IFAAMAS. Nair, R.; Varakantham, P.; Tambe, M.; and Yokoo, M. 2005. Networked Distributed POMDPs: A Synergy of Distributed Constraint Optimization and POMDPs. In IJCAI, 1758 1760. Professional Book Center. Ng, B.; Peshkin, L.; and Pfeffer, A. 2002. Factored Particles for Scalable Monitoring. In UAI, 370 377. Morgan Kaufmann. Nguyen, H. V.; Rezatofighi, H.; Vo, B.; and Ranasinghe, D. C. 2020. Multi-Objective Multi-Agent Planning for Jointly Discovering and Tracking Mobile Objects. In AAAI, 7227 7235. AAAI Press. Oliehoek, F. A.; and Amato, C. 2016. A Concise Introduction to Decentralized POMDPs. Springer Briefs in Intelligent Systems. Springer.

The Thirty-Eighth AAAI Conference on Artiﬁcial Intelligence (AAAI-24)

Oliehoek, F. A.; and Spaan, M. T. J. 2012. Tree-Based Solution Methods for Multiagent POMDPs with Delayed Communication. In AAAI, 1415 1421. AAAI Press. Oliehoek, F. A.; Spaan, M. T. J.; Whiteson, S.; and Vlassis, N. 2008. Exploiting locality of interaction in factored Dec POMDPs. In AAMAS (1), 517 524. IFAAMAS. Pynadath, D. V.; and Tambe, M. 2002. The Communicative Multiagent Team Decision Problem: Analyzing Teamwork Theories and Models. J. Artif. Intell. Res., 16: 389 423. Septier, F.; and Peters, G. W. 2016. Langevin and Hamiltonian Based Sequential MCMC for Efficient Bayesian Filtering in High-Dimensional Spaces. IEEE J. Sel. Top. Signal Process., 10(2): 312 327. Silver, D.; and Veness, J. 2010. Monte-Carlo Planning in Large POMDPs. In NIPS, 2164 2172. Curran Associates, Inc. Smith, T.; and Simmons, R. G. 2004. Heuristic Search Value Iteration for POMDPs. In UAI, 520 527. AUAI Press. Spaan, M. T. J. 2012. Partially Observable Markov Decision Processes. In Reinforcement Learning, volume 12 of Adaptation, Learning, and Optimization, 387 414. Springer. Spaan, M. T. J.; Oliehoek, F. A.; and Vlassis, N. 2008. Multiagent Planning Under Uncertainty with Stochastic Communication Delays. In ICAPS, 338 345. AAAI. Sunberg, Z. N.; and Kochenderfer, M. J. 2018. Online Algorithms for POMDPs with Continuous State, Action, and Observation Spaces. In ICAPS, 259 263. AAAI Press. Thrun, S. 1999. Monte Carlo POMDPs. In NIPS, 1064 1070. The MIT Press. Thrun, S.; Burgard, W.; and Fox, D. 2005. Probabilistic robotics. Intelligent robotics and autonomous agents. MIT Press. Vlassis, N.; Elhorst, R.; and Kok, J. R. 2004. Anytime algorithms for multiagent decision making using coordination graphs. In SMC (1), 953 957. IEEE. Witwicki, S. J.; Castillo, J. C.; Messias, J. V.; Capit an, J.; Melo, F. S.; Lima, P. U.; and Veloso, M. M. 2017. Autonomous Surveillance Robots: A Decision-Making Framework for Networked Muiltiagent Systems. IEEE Robotics Autom. Mag., 24(3): 52 64. Wu, C.; Yang, G.; Zhang, Z.; Yu, Y.; Li, D.; Liu, W.; and Hao, J. 2021. Adaptive Online Packing-guided Search for POMDPs. In Neur IPS, 28419 28430. Ye, N.; Somani, A.; Hsu, D.; and Lee, W. S. 2017. DESPOT: Online POMDP Planning with Regularization. J. Artif. Intell. Res., 58: 231 266. Zhou, X.; Wang, W.; Zhu, Y.; Wang, T.; and Zhang, B. 2019. Centralized Patrolling With Weakly-Coupled Agents Using Monte Carlo Tree Search. IEEE Access, 7: 157293 157302.

The Thirty-Eighth AAAI Conference on Artiﬁcial Intelligence (AAAI-24)