# optimally_solving_decpomdps_as_continuousstate_mdps__679d9476.pdf

Journal of Artiﬁcial Intelligence Research 55 (2016) 443-497 Submitted 11/14; published 02/16

Optimally Solving Dec-POMDPs as Continuous-State MDPs

Jilles Steeve Dibangoye jilles-steeve.dibangoye@insa-lyon.fr Univ de Lyon INSA-Lyon, CITI-Inria, F-69621, France

Christopher Amato camato@cs.unh.edu University of New Hampshire Durham, NH, USA

Olivier Buﬀet olivier.buffet@inria.fr Franc ois Charpillet francois.charpillet@inria.fr Inria Universit e de Lorraine CNRS Villers-l es-Nancy, F-54600, France

Decentralized partially observable Markov decision processes (Dec-POMDPs) provide a general model for decision-making under uncertainty in decentralized settings, but are diﬃcult to solve optimally (NEXP-Complete). As a new way of solving these problems, we introduce the idea of transforming a Dec-POMDP into a continuous-state deterministic MDP with a piecewise-linear and convex value function. This approach makes use of the fact that planning can be accomplished in a centralized oﬄine manner, while execution can still be decentralized. This new Dec-POMDP formulation, which we call an occupancy MDP, allows powerful POMDP and continuous-state MDP methods to be used for the ﬁrst time. To provide scalability, we reﬁne this approach by combining heuristic search and compact representations that exploit the structure present in multi-agent domains, without losing the ability to converge to an optimal solution. In particular, we introduce a feature-based heuristic search value iteration (FB-HSVI) algorithm that relies on feature-based compact representations, point-based updates and eﬃcient action selection. A theoretical analysis demonstrates that FB-HSVI terminates in ﬁnite time with an optimal solution. We include an extensive empirical analysis using well-known benchmarks, thereby demonstrating that our approach provides signiﬁcant scalability improvements compared to the state of the art.

1. Introduction

Many signiﬁcant real-world problems involve decision making in sequential multiagent environments. Examples include: exploration robots that must coordinate to perform experiments on an unknown planet (Zilberstein, Washington, Bernstein, & Mouaddib, 2002); rescue robots that, after a disaster, must safely ﬁnd victims as quickly as possible (Paquet, Chaib-draa, Dallaire, & Bergeron, 2010); optimized distributed congestion control in a noisy computer network (Winstein & Balakrishnan, 2013); sensor networks where multiple sensors work jointly to perform a large-scale sensing task under strict power constraints (Jain, Taylor, Tambe, & Yokoo, 2009); or robot logistics problems with communication limitations and sensor uncertainty (Amato, Konidaris, Anders, Cruz, How, & Kaelbling, 2015). All these tasks require multiple decision makers, or agents, to coordinate their actions in order to achieve common long-term goals. Additionally, uncertainty is ubiquitous in these domains, both in the eﬀects of actions and in the information received by the agents.

c 2016 AI Access Foundation. All rights reserved.

Dibangoye, Amato, Buffet & Charpillet

Markov decision processes (MDPs) address uncertainty in system dynamics, but assume centralization. Standard methods for solving MDPs, e.g., linear and dynamic programming (Bellman, 1957; Puterman, 1994) and heuristic search (Barto, Bradtke, & Singh, 1995; Hansen & Zilberstein, 2001), are centralized during both the planning and the execution phases. Partially observable Markov decision processes (POMDPs) extend MDPs to situations in which there is uncertainty over the state of the system (Kaelbling, Littman, & Cassandra, 1998; Smith & Simmons, 2006; Shani, Pineau, & Kaplow, 2013), but similarly assume centralized planning and execution. To use the MDP and POMDP models when multiple agents are present, a centralized coordinator agent must have a global view of the underlying state (or belief state in the case of a POMDP) of the entire system, and plan on behalf of its teammates. Every time step, this agent would transmit the appropriate action that each agent must perform and then observe the resulting state (or observations of each agent in a POMDP). These methods, collectively referred to as centralized planning for centralized control, assume agents communicate at each step with no delay or cost, either explicitly through messages or implicitly through observations. Unfortunately, in many practical applications, agents are not permitted to share their information with no delay or cost; rather, each agent possesses only local, unshared observations, and acts without full knowledge of what others observe or plan to do. These characteristics have led to the development of a rich body of research on decentralized decision-making under uncertainty. The decentralized partially observable Markov decision process (Dec-POMDP) is a standard formulation for cooperative decision-making in these sequential settings without instantaneous, free and noiseless communication (Bernstein, Givan, Immerman, & Zilberstein, 2002). Over the past decade, there has been extensive research on solution methods for Dec-POMDPs, using methods such as dynamic programming (Hansen, Bernstein, & Zilberstein, 2004; Boularias & Chaib-draa, 2008; Amato, Dibangoye, & Zilberstein, 2009), optimization (Aras & Dutech, 2010; Amato, Bernstein, & Zilberstein, 2010) and heuristic search (Szer, Charpillet, & Zilberstein, 2005; Oliehoek, Spaan, & Vlassis, 2008; Oliehoek, Spaan, Amato, & Whiteson, 2013). These approaches directly search for an optimal solution in the space of possible solutions (or policies) but become intractable for larger problems. This is not unexpected given the worst-case NEXP complexity of ﬁnite-horizon Dec-POMDPs (Bernstein et al., 2002). A key assumption in many Dec-POMDP algorithms is that planning can be centralized as long as execution remains decentralized (Hansen et al., 2004; Boularias & Chaib-draa, 2008; Amato et al., 2009; Szer et al., 2005; Oliehoek et al., 2008, 2013). That is, these methods represent centralized planning for decentralized control a centralized planner generates a tuple of individual solutions, one individual solution for each agent. While the use of the Dec-POMDP model does not require centralized planning (e.g., Velagapudi, Varakantham, Sycara, & Scerri, 2011; Wu, Zilberstein, & Chen, 2011; Banerjee, Lyle, Kraemer, & Yellamraju, 2012), because Dec-POMDPs are a cooperative framework it is common to assume that centralized planning is possible. Unfortunately, current algorithms do not take full advantage of this centralized planning assumption. This article extends a conference paper published at IJCAI 13 (Dibangoye, Amato, Buﬀet, & Charpillet, 2013), which includes the introduction of a centralized solution method that recasts a Dec-POMDP as a continuous-state MDP with more detailed background, new theorems and proofs, as well as more concise representations of policies and value functions. Our novel method is also able to produce decentralized solutions, but leverages work on centralized planning methods to signiﬁcantly increase scalability. Furthermore, we show that the optimal value function of the aforementioned MDP is a piecewise-linear and convex function. In this form, theory from POMDPs

Optimally Solving Finite-Horizon Dec-POMDPs

(Kaelbling et al., 1998) applies, allowing POMDP algorithms to produce optimal solutions for Dec POMDPs. A wide range of POMDP algorithms, which have demonstrated signiﬁcant scalability (Shani et al., 2013), can now be applied. We extend one such heuristic search algorithm (Smith & Simmons, 2004) to the Dec-POMDP case, but because the number of states and actions in this MDP grows exponentially with the planning horizon, scalability remains limited. To increase scalability, we introduce a novel mechanism that reﬁnes this centralized solution methodology and present ways to combine classical heuristic search and compact representations, without losing the ability to converge to an optimal solution. To incorporate compact representations, we build on feature-based dynamic programming (Tsitsiklis & van Roy, 1996), which includes feature extraction and value prediction with approximation methods. We introduce the feature-based heuristic search value iteration (FB-HSVI) algorithm that relies on compact representations, pointbased updates and eﬃcient action selection. A theoretical analysis demonstrates that FB-HSVI terminates in ﬁnite time with an optimal solution. This combination of POMDP theory and compact representations can greatly reduce the problem size and solution eﬃciency while retaining optimal solutions for Dec-POMDPs.

State start State State

Observation1 Observation1 Observation1

Observation2 Observation2 Observation2

Action1 Action1

Action2 Action2

Time t t+1 t+2

Agent 1 s World

Agent 2 s World

Figure 1: A graphical model of the two-agent Dec-POMDP model.

Dibangoye, Amato, Buffet & Charpillet

2. Background on Dec-POMDPs

This section introduces the basic components of decentralized partially observable Markov decision processes (Dec-POMDPs).

2.1 Problem Deﬁnition and Notations

Consider multiple agents that are faced with the task of inﬂuencing a stochastic system as it evolves over time (see the two agent case in Figure 1). At every time step, each agent receives a private observation that gives (possibly) incomplete and noisy information about the current state of the system. Since the states are not observable, an agent cannot choose its actions based on the states. Instead, it can consider a complete history of its past actions and observations to choose an action. Actions produce a common immediate reward, and the system evolves to a new state at the next time step according to a probability distribution conditioned on actions. At the next time step, agents face a similar problem again, but now the system may be in a diﬀerent state. The goal of agents is to choose the actions based on these local action-observation sequences which cause the system to perform optimally with respect to a shared performance criterion (which we discuss below). The Dec-POMDP model formalizes these interactions between agents and the system. This paper formulates and solves this general decentralized stochastic control problem for a process that operates for a ﬁnite planning horizon.

Deﬁnition 1. A Dec-POMDP is represented as a tuple M (I, S, A, Z, p, o, r, b0, T), where:

I is a ﬁnite set of agents i {1, 2, . . . , |I|};

S is a ﬁnite set of n states;

Ai is the ﬁnite set of agent i s actions; A i Ai is the ﬁnite set of joint actions;

Zi is the ﬁnite set of agent i s observations; Z i Zi is the ﬁnite set of joint observations;

p = {pa | a A} denotes the transition model. pa is an n n stochastic matrix, where pa(s, s) is the probability of transitioning to state s if the agents choose joint action a in state s;

o = {oa,z | a A, z Z} is the observation model. oa,z is an n 1 vector1, where oa,z( s) is the probability of observing z if joint action a is performed and the resulting state is s;

r = {ra | a A} is the reward function; ra is a 1 n reward vector, where ra(s) is the bounded reward obtained by executing joint action a in state s;

b0 is the initial probability distribution over states; and

T is the number of decision steps t {0, 1, . . . , T 1} (the problem horizon).

Remark 1. We often use shorthand notation pa,z(s, s) def= oa,z( s)pa(s, s), for all s, s S , a A, and z Z, combining the transition and observation models. That is, the probability of transitioning to state s after observing z if joint action a is performed and the resulting state is s.

1. The observation vector is not a stochastic vector.

Optimally Solving Finite-Horizon Dec-POMDPs

To make the representation more concrete, we discuss a very simple Dec-POMDP, namely the multi-agent tiger problem (Nair, Tambe, Yokoo, Pynadath, & Marsella, 2003). We will revisit this problem in later sections to clarify the ideas presented in the paper.

Example 1 (Multi-agent tiger problem description). In the multi-agent tiger problem, two agents stand in front of two closed doors. Behind one of the doors there is a hungry tiger, and behind the other door there is valuable treasure. The agents do not know the position of either. By listening, rather than simply opening one of the doors, the agents can gain information about the position of the tiger. But listening has a cost and is not entirely accurate (i.e., it only reveals the correct information about the location of the tiger with some probability). Moreover, agents cannot communicate their observations to each other. At each step, each agent can independently either listen or open one of the doors. If one of the agents opens the door with the treasure behind it (while the other agent listens or also opens the correct door), they both get the reward. If either agent opens the door with the tiger, a large penalty is incurred. However, if they both open the tiger door at the same time, they receive a smaller penalty. The agents must make decisions about listening and opening doors based on the local observations. After a door is opened and the agents receive a reward or penalty, the problem starts over again and the tiger is randomly repositioned. We refer to the state of the multi-agent tiger world when the tiger is on the left as stl (tiger left) and when it is on the right as str (tiger right). The actions for each agent are aol (open left), aor (open right), and al (listen). There are only two possible observations for each agent (even after opening a door): to hear the tiger on the left zhl (hear left) or to hear the tiger on the right zhr (hear right). The reward function is deﬁned as shown on Table 1.

listens opens opens both agents

good door bad door

2 +9 101 opens good door

+9 +20 100 opens bad door

Table 1: Reward function deﬁnition for the multi-agent tiger problem

The transition and observation models can be described in detail as follows. The joint action (al, al) does not change the state of the world. Any other joint action causes a transition to state stl with probability 0.5 and to state str with probability 0.5 essentially resetting the problem. When the world is in state stl, the joint action (al, al) results in observation zhl for either agent with probability 0.85 and observation zhr with probability 0.15; conversely for state str. No matter what state the world is in, the other joint actions result in either observation with probability 0.5.

This example illustrates a small Dec-POMDP. Even in this small Dec-POMDP, coordination is diﬃcult due to uncertainty about the location of the tiger and the actions of the other agent.

2.2 Preliminaries

Given a T-step Dec-POMDP M, we would like agents to act in such a way as to maximize some common measure of long-term return in M. The challenge in Dec-POMDPs is that each agent s strategy typically must take the other agents strategies into account. To this end, we discuss agent

Dibangoye, Amato, Buffet & Charpillet

and team decision rules and policies that allow the agents to act based on local information, while attempting to maximize a joint objective.

2.2.1 Private Decision Rules and Policies

At every time step, each agent chooses an action to be executed based on the actions the agent has previously executed and the observations that it has received. This is called a policy. To better understand this concept, we introduce the notions of private histories and decision rules.

Deﬁnition 2. A step-t private history of agent i I is a length-t sequence of actions and observations, θi t

def= (ai 0, zi 1, . . . , zi t 1, ai t 1, zi t), where ai t and zi t denote actions and observations of agent i I at time step t {0, 1, . . . , T 1}.

Any step-t private history θi t follows the recursion θi t = (θi t 1, ai t 1, zi t) and the initial private history θi 0 is empty. Let Θi t be the set of all step-t private histories of agent i I, namely the step-t private history set. A private policy speciﬁes private decision rules an agent can use at all time steps, one private decision rule for each time step.

Deﬁnition 3. A step-t private decision rule di t : Θi t 7 Ai prescribes agent i I the private action to be executed in each private history θi t Θi t at a speciﬁed time step t {0, 1, . . . , T 1}.

We further denote Di t to be the set of all step-t private decision rules for agent i I at time step t {0, 1, . . . , T 1}, namely the step-t private decision rule set of agent i I. Decision rules may be randomized or deterministic. In Dec-POMDPs, as in MDPs, there always exists a deterministic decision rule that is as good as any randomized decision rule (Oliehoek et al., 2008; Puterman, 1994). For this reason, we focus on deterministic (private) decision rules. Hence, private policies provide each agent with a mapping for action selection for any possible private history.

Deﬁnition 4. A (t + 1)-step private policy πi t:t+t def= (di t, . . . , di t+t ) is a sequence of private decision rules for agent i I from time step t to time step t + t , where t {0, 1, . . . , T 1} and t N.

Example 2 (Multi-agent tiger private decision rule and policy descriptions). Figure 2 shows a pair of 2-step private policies π1 0:1 and π2 0:1 as trees, for agent 1 and 2, respectively.

Agent 1 Agent 2

agent 2 s decision rule d2 0 depends only upon θ2 0

agent 2 s decision rule d2 1 depends only upon θ2 1

Figure 2: A pair of private policies in the form of trees and decomposed as decision rules for two steps of the multi-agent tiger problem.

Optimally Solving Finite-Horizon Dec-POMDPs

For agent 2 for example, step-2 private policy π2 0:1 consists of two private decision rules d2 0 and d2 1, for time steps t = 0 and t = 1, respectively. The step-0 private decision rule d2 0 maps empty private history θ2 0 = to private action al. In addition, the step-1 private decision rule d2 1 maps private histories (aol, zhl) and (aol, zhr) to private action aor in both cases. It is worth noticing that private decision rules only maintain private histories that are reachable from past actions executed and observations received.

So far, we focused on the information each agent has at the execution phase including: private histories, decision rules and policies. Nevertheless, the goal of Dec-POMDP planning is to ﬁnd a separable joint policy. For this reason, we will next discuss joint histories, decision rules and policies.

2.2.2 Separable Joint Decision Rules and Policies

In this section, we extend private information available to a given agent, e.g., private histories, decision rules and policies, to joint information that consists of a collection of private data. Let joint histories θt, separable joint decision rules dt and separable joint policies πt:t be |I|-tuples of private histories (θ1 t , θ2 t , . . . , θ|I| t ), decision rules (d1 t , d2 t , . . . , d|I| t ) and policies (π1 t:t , π2 t:t , . . . , π|I| t:t ), respectively for all time steps t {0, 1, . . . , T 1} and t N. Note that each of these concepts is separable in the sense that it is expressed as |I|-tuple using only private information, one private concept for each agent.

Example 3 (Multi-agent tiger separable joint decision rule and joint policy descriptions). Figure 3 depicts a 2-step separable joint policy π0:1 (π1 0:1, π2 0:1) as a tuple of private policies, one for each agent i {1, 2}. If we group together private decision rules of agents at a given time step, then we have a separable joint decision rule. For example, at the initial time step t = 0, the tuple of private decision rules d0 (d1 0, d2 0) is a separable joint decision rule. In addition, separable joint decision rule d0 prescribes to agents 1 and 2 actions al and aol, respectively.

Agent 1 Agent 2

separable joint decision rule d0 (d1 0, d2 0) depends on θ0 (θ1 0, θ2 0)

separable joint decision rule d1 (d1 1, d2 1) depends on θ1 (θ1 1, θ2 1)

Figure 3: A 2-step separable joint policy for the multi-agent tiger problem.

2.3 Acting Optimally

In this section, we discuss the criterion used throughout the paper to compute an optimal separable joint policy starting at the initial belief-state. Before proceeding any further, we ﬁrst cast Dec POMDPs into an MDP, namely the information-state MDP.

Dibangoye, Amato, Buffet & Charpillet

2.3.1 Information-state Markov Decision Processes

As mentioned above, a common assumption in solving Dec-POMDPs is that planning takes place in a centralized (oﬄine) manner even though agents execute actions in a decentralized fashion (online). In such a planning paradigm, a centralized algorithm maintains, at each step, the total available information about the process to be controlled. This centralized algorithm essentially performs a policy search in the space of separable joint policies. Thus, the separable joint decision rule choices are based only on the exhaustive information available to the centralized algorithm or on statistics derived from that information. This is illustrated in the inﬂuence diagram in Figure 4, where the separable joint decision rule at time step t depends only on previous separable joint decision rules and initial belief state, not on hidden states. The statistics summarizing the exhaustive information available to the centralized algorithm are called information states (Hauskrecht, 2000). As deﬁned below, complete information states represent a trivial case, i.e., the exhaustive data.

ι0 start ιt ιt+1

Figure 4: Inﬂuence diagram for an information-state MDP. Information states (ιt and ιt+1) are represented by cycles. Joint-decision-rule choices (dt and dt+1) are represented by rectangles, and depend only on the current information state, not on the underlying hidden states. Diamonds represent expected immediate rewards r0, . . . , rt and rt+1. Dashed lines represent indirect inﬂuence over time.

Deﬁnition 5. In Dec-POMDPs, a step-t complete information state ιC t

def= (b0, π0:t 1) is a length-t sequence of separable joint decision rules starting with the initial belief state b0, for all time steps t {0, 1, . . . , T 1}. It satisﬁes the recursion: ιC 0 = (b0), and ιC t+1 = (ιC t , dt).

Example 4 (Multi-agent tiger complete information state description). Figure 5 depicts a step-2 complete information state as a sequence of separable joint decision rules ι2 (b0, d0, d1) starting at initial belief state b0. Alternatively, the complete information state consists of a separable joint policy represented as a separable joint policy tree together with the initial distribution b0. It is worth noting that, in a complete information state, each private history of each agent occurs in more than one joint history. It is this interdependence between joint histories that makes Dec-POMDPs signiﬁcantly diﬀerent from centralized problems (e.g., MDPs and POMDPs) since policies must remain decentralized. This interdependence also explains why joint histories, which are suﬃcient for optimally planning in MDPs and POMDPs, are no longer suﬃcient for optimally planning in Dec-POMDPs. Instead, we rely on complete information states.

Optimally Solving Finite-Horizon Dec-POMDPs

separable joint decision rule d0

separable joint decision rule d1

Figure 5: A step-2 complete information state ιC 2 (b0, d0, d1) for the multi-agent tiger problem. Separable joint decision rules are groups of private decision rules depicted in Figure 3.

There is no need to retain the complete information states; instead, one can rely on more compact information states. Recall that information states are quantities summarizing the complete information states. The collection of random variables {ιt : t {0, 1, . . . , T}} taking values in the information state space S deﬁnes an information-state Markov decision process (ι-MDP). In such an MDP, states are information states and actions are separable joint decision rules as illustrated in the inﬂuence diagram in Figure 4. Our ι-MDP is deterministic since the next-step information state ιt+1 is a deterministic function of the previous information state ιt and the joint-decision-rule choice dt i.e., ιt+1 = P(ιt, dt). Furthermore, after taking a separable joint decision rule dt at an information state ιt, the expected reward is R(ιt, dt).

Deﬁnition 6. A ι-MDP ˆM (S, A, P, R, ι0, T) w.r.t. Dec-POMDP M is given by:

S is the information-state set, which deﬁnes the set of all information states ιt, at every time step t {0, 1, . . . , T 1};

A is the set of separable joint decision rules, which deﬁnes the set of all separable joint decision rules dt, at every time step t {0, 1, . . . , T 1};

P speciﬁes the next-step information state ιt+1 = P(ιt, dt) after taking separable joint decision rule dt at information state ιt, at every time step t {0, 1, . . . , T 1};

R speciﬁes the immediate expected reward R(ιt, dt) = P s,θ Pr(s, θ|ιt) rdt(θ)(s) to be gained by executing a separable joint decision rule dt at information state ιt, at every time step t {0, 1, . . . , T 1};

ι0 is the initial information state; and T is the problem s temporal horizon.

Dibangoye, Amato, Buffet & Charpillet

The information-state MDP ˆM diﬀers from the original Dec-POMDP M because the state space is implicit. That is, the information-state space S is much too large to be generated and stored in memory. Instead, information states are generated as they are explored during the state space search, and typically discarded thereafter. Generating the separable joint decision rules and their corresponding expected rewards are also sources of the complexity for current methods as discussed later. All these methods build upon the assumption that one can always convert the original Dec-POMDP into an information-state MDP by using complete information states without losing optimality (Szer et al., 2005; Oliehoek et al., 2008, 2013). Below, we provide a formal proof of this property for the sake of completeness.

Lemma 1. Any optimal separable joint policy π ˆM for complete-information-state MDP ˆM is an optimal separable joint policy π M for the original Dec-POMDP M.

Proof. In demonstrating the proof, we need to show that optimal joint policies π ˆM and π M are separable joint policies with the same expected value. Throughout the proof, we will use notations AT to denote T-steps separable joint policies and Pb0,π( ) to denote a joint probability distribution with parameter b0 and π. The proof starts with the deﬁnition of an optimal separable joint policy for information-state MDP ˆM:

def= arg maxπ AT

t=0 R(ιC t , dt).

Next, we replace R(ιC t , dt) by the immediate expectation of rewards received after taking joint (separable) decision rule dt at complete information state ιC t :

π ˆM = arg maxπ AT

t=0 E(st,θt) Pb0,π0:t ( ) n rdt(θt)(st) o , (Def. of R(ιC t , dt)).

The following holds because the sum of expectations is equal to expectation of sums:

π ˆM = arg maxπ AT E(s0,a0,...,s T 1,a T 1) Pb0,π( )

t=0 rat(st)

Hence, it is suﬃcient to search for an optimal separable joint policy using ˆM to ﬁnd an optimal separable joint policy for M (and vice versa).

This lemma allows us to interchangeably use either complete-information-state MDPs ˆM or the original Dec-POMDP counterpart M with no loss in optimality.

2.3.2 Optimality Criterion

In this paper, we consider the ﬁnite-horizon Dec-POMDP (and therefore the ﬁnite-horizon ι-MDP ˆM), where the optimality criterion is to ﬁnd a separable joint policy that maximizes the expected sum of rewards over the planning horizon starting at a given belief state. To ﬁnd an optimal separable joint policy, we ﬁrst characterize the expected value to be gained from executing any arbitrary separable joint policy πt:T 1 starting from any arbitrary step-t information state. This characterization represents the Dec-POMDP value function using the ι-MDP notation.

Optimally Solving Finite-Horizon Dec-POMDPs

Deﬁnition 7. Let πt:T 1 be a separable joint policy with respect to ˆM. The value function V ˆM,πt:T 1 denotes the expected cumulative reward obtained if the team of agents executes πt:T 1 from time step t onward. For any arbitrary information state ιt, V ˆM,πt:T 1(ιt) def= PT t 1 k=0 R(ιt+k, dt+k), where ιt+k+1 = (ιt, dt, . . . , dt+k), t {0, 1, . . . , T 1} and k {0, 1, . . . , T t 1}.

Example 5 (Multi-agent tiger expected values given a separable joint policy and an information state). Figure 6 depicts a mapping from step-1 private histories to private policies π1 1:T 1 and π2 1:T 1 for agents 1 and 2, respectively. These private histories result from agents taking one action and receiving an observation, i.e., agent 1 took action al and received either observation zhl or zhr. The mapping ensures decentralized control since private histories map to private policies. For example, agent 1 s private history (al, zhl) maps to private policy x. However, the expected value of one agent s private policy depends on the other agents private policies. For this reason, we rely on joint histories induced by information state ι1 as illustrated in Figure 7. Figure 7 depicts a mapping from joint histories to future separable joint policies. Each joint history is a pair of private histories from Figure 6, one for each agent. For example, joint history {(al, aol), (zhl, zhl)} maps to future separable joint policy (x, α) as a separable joint policy tree. The contribution of separable joint policy (x, α) to the expected value depends on the probability of joint history {(al, aol), (zhl, zhl)} and the initial belief.

private histories generated by d1 0 and d2 0

future separable joint policy π1:T 1 = (π1 1:T 1, π2 1:T 1)

Figure 6: Mappings from private histories to future private policies for each agent.

Value function V ˆM,πt:T 1 satisﬁes the following recursion:V ˆM,πt:T 1(ιt) = R(ιt, dt)+V ˆM,πt+1:T 1(P(ιt, dt)) where V ˆM,πt+1:T 1(P(ιt, dt)) describes the future value of executing separable joint policy πt+1:T 1 from time step t + 1 onward starting at information state ιt+1 = P(ιt, dt).

2.3.3 Bellman s Optimality Equations

The standard deﬁnitions of optimality equations in a T-step ι-MDP follow. We ﬁrst describe the optimal value at a given information state as the highest value of any separable joint policy for that information state. Let Πt:T 1 be the set of all separable joint policies with respect to ˆM. For all t {0, 1, . . . , T 1}, the optimal step-t value function V ˆM,t(ιt) at information state ιt is V ˆM,t(ιt) def=

Dibangoye, Amato, Buffet & Charpillet

x α α x x α α x

joint histories induced by ι1 = (b0, d0)

future separable joint policy π1:T 1

Figure 7: Mappings from joint histories to future separable joint policies.

Figure 8: The information-state search tree where search nodes are information states and arcs of the search tree are labeled by separable joint decision rules.

Optimally Solving Finite-Horizon Dec-POMDPs

maxπ Πt:T 1 V ˆM,π(ιt). The optimality equations (or Bellman s optimality equations, see Bellman, 1957; Puterman, 1994) are:

V ˆM,t(ιt) def= max dt At

R(ιt, dt) + V ˆM,t+1(P(ιt, dt)) , ιt St, t {0, 1, . . . , T 1}, (1)

with an added boundary condition V ˆM,T( ) def= 0 for time step t = T. Note that the optimal step-t value function is written in terms of the optimal step-(t + 1) value function. This recursion implies an eﬃcient procedure for computing step-t value functions which we will discuss below. Moreover, an optimal separable joint policy can be directly extracted from the optimal value functions. Suppose (V ˆM,t)t {0,1,...,T 1} are solutions of the optimality equations (1) subject to the boundary condition, then it is clear that an optimal separable joint policy π 0:T 1 = (d t )t {0,1,...,T 1} satisﬁes:

d t arg maxdt At

R(ιt, dt) + V ˆM,t+1(P(ιt, dt)) , t {0, 1, . . . , T 1}. (2)

This property implies that an optimal separable joint policy is found by ﬁrst solving the optimality equations, and then for each time step choosing a separable joint decision rule that attains the maximum of the right hand side of (2) for t {0, 1, . . . , T 1}.

2.4 Optimally Solving Dec-POMDPs

We provide an overview of dynamic programming and heuristic search principles exact methods build upon. For a thorough introduction to solution methods in Dec-POMDPs, the reader can refer to surveys (e.g., Oliehoek, 2012; Amato, Chowdhary, Geramifard, Ure, & Kochenderfer, 2013). Notice, however, that most dynamic programming methods do not explicitly consider information states (instead considering value functions over underlying system states). They construct separable joint policies from the last step in the horizon to the ﬁrst by evaluating the possible separable joint policies at each step and pruning those that have provably lower value over the full state space (Hansen et al., 2004; Boularias & Chaib-draa, 2008; Amato et al., 2009). Heuristic search techniques implicitly use complete information states in developing Dec-POMDP solution methods (Szer et al., 2005; Oliehoek et al., 2008; Oliehoek, Whiteson, & Spaan, 2009; Spaan, Oliehoek, & Amato, 2011), but do not explicitly use the ιC-MDP representation.

2.4.1 Dynamic Programming Methods

One class of Dec-POMDP solution methods is based on dynamic programming (Howard, 1960). Here, a set of T-step separable joint policies is generated from the bottom up (Hansen et al., 2004). At each step, all step-t separable joint policies are generated that build oﬀseparable joint policies from step t+1. Any separable joint policy that has lower value than some other separable joint policy for all states and possible separable joint policies of the other agents is then pruned (with linear programming). These generation and pruning steps continue until the desired horizon is reached and a separable joint policy with the highest value at the initial state is chosen. Given that the number of separable joint policies grows doubly exponentially every generation step, the importance of pruning away unnecessary separable joint policies is crucial. More eﬃcient dynamic programming methods have been developed, reducing the number of separable joint policies generated (Amato et al., 2009) or compressing separable joint policy representations (Boularias & Chaib-draa, 2008).

Dibangoye, Amato, Buffet & Charpillet

2.4.2 Heuristic Search Methods

Another class of Dec-POMDP solution methods is based on heuristic search techniques. Unlike dynamic programming methods, heuristic search algorithms can take advantage of the initial complete information state. Separable joint policies can be built from the top down using centralized heuristic search methods over the search tree shown on Figure 8 (Szer et al., 2005). In this case, a search node is a complete information state at a given horizon, t. These complete information states can be evaluated up to that horizon and then a heuristic value can be added. The resulting heuristic values are over-estimates of the true value, allowing an A*-style search through the space of possible complete information states, expanding promising search nodes to horizon t + 1 from horizon t. While in principle, A*-style search methods can ﬁnd an optimal separable joint policy, in practice the doubly exponential growth of the search tree makes it diﬃcult. Recent work has included clustering probabilistically equivalent complete information states (Boularias & Chaib-draa, 2008) and histories (Oliehoek et al., 2009), and incrementally expanding nodes in the search tree (Spaan et al., 2011; Oliehoek et al., 2013), greatly improving scalability of the original algorithms.

2.4.3 Limitations of Current Methods

While current methods attempt to reduce the number of separable joint policies or information states considered, they rely on explicit representations that consider all possible joint histories (even though many of them may be unreachable). Moreover, existing techniques fail to generalize value functions from one information state to other information states, which slows down the convergence to an optimal joint policy. Finally, even though most solution methods use an oﬄine centralized planning phase, no concise representation of the information state has been identiﬁed (until now) that allows for greater scalability. Simultaneous to this work one exception developed concise representations based on observation histories, but did not show the value function over the resulting MDP was piecewise linear and convex (Oliehoek, 2013). We are able to show this piecewise linear and convex property and develop a novel algorithm to exploit the resulting structure. To do so, we draw inspiration from advances in MDP and POMDP algorithms as discussed below. Signiﬁcant progress has been made in solving large MDPs and POMDPs. One reason for progress in MDPs has been the use of approximate dynamic programming and function approximation (Tsitsiklis & van Roy, 1996; De Farias & Van Roy, 2003; Powell, 2007) to represent the state of the system and value functions more concisely. For POMDPs, eﬃcient algorithms have been developed by recasting problems as belief MDPs that utilize probability distributions over states of the system, namely belief states (Smallwood & Sondik, 1973). This belief MDP is a continuousstate MDP with a piecewise linear and convex value function, allowing algorithms to scale to large problems while sometimes retaining performance bounds (Smith & Simmons, 2004; Shani et al., 2013). We will take advantage of such advances by recasting a Dec-POMDP as a continuous-state MDP with a piecewise linear and convex optimal value function. The resulting formulation opens the door for direct application of POMDP methods and opens research directions on utilizing Dec POMDP structure in centralized planning representations. We discuss this formulation and some progress in using this structure in the remaining sections.

Optimally Solving Finite-Horizon Dec-POMDPs

3. Solving Dec-POMDPs as Continuous-State MDPs

The contribution of this section is threefold. Section 3.1 introduces a statistic (i.e., the occupancy state) which summarizes the information state. Section 3.1.4 demonstrates occupancy states are suﬃcient for optimally solving Dec-POMDPs, i.e., occupancy states are suﬃcient statistics. Occupancy states further allow transforming information-state MDPs into occupancy-state MDPs. Section 3.1.6 establishes a fundamental property of the resulting MDP, namely the piecewise linearity and convexity of the optimal value function over the occupancy states. Remember that these methods assume centralized oﬄine planning and actions as separable joint decision rules to ensure decentralized execution. These contributions enable the application of a vast collection of MDP and POMDP solution methods to Dec-POMDP problems. Finally, Section 3.2 introduces the occupancy-based heuristic search value iteration (OHSVI) algorithm for solving occupancy-state MDPs that builds upon the HSVI algorithm for POMDPs (Smith, 2007).

3.1 Summarizing Complete Information States

Before providing a formal deﬁnition of occupancy states, we start with a brief motivation. Then, we demonstrate that the occupancy state induces a deterministic process that is Markov, namely the occupancy-state Markov decision process. Finally, we prove that the occupancy state is a suﬃcient statistic for optimal decision-making in Dec-POMDPs.

3.1.1 Occupancy State

As discussed in Section 2.4.2, standard heuristic search methods for solving Dec-POMDPs rely on complete information states (Szer et al., 2005; Oliehoek et al., 2008, 2009; Spaan et al., 2011). While complete information states preserve the ability to ﬁnd an optimal separable joint policy (see Lemma 1), heuristic search methods using them can only solve small toy problems. One reason for this poor behavior is that complete information states result in redundant and useless computations. In particular, every time they estimate the immediate rewards R(ιC, d) the entire multivariate probability distribution Pr(s, θ|ιC) needs to be computed over all states and joint histories (see Deﬁnition 6). This operation is time-consuming because it involves exponentially many joint histories, including unreachable ones. Since this operation occurs at every time step, it is important to reduce the time required. To this end, we introduce a statistic called the occupancy state that we can maintain in place of the complete information state.

Deﬁnition 8. The step-t occupancy state, denoted ξt, is deﬁned as the posterior probability distribution of state st and joint history θt given complete information state ιC t , i.e., ξt(st, θt) def= Pr(st, θt|ιC t ), t {0, 1, . . . , T}. We denote t the step-t occupancy simplex, that is, the set of all possible step-t occupancy states.

Example 6 (Multi-agent tiger from complete information states to occupancy states). Figure 9 depicts a complete information state (left-hand side) and a corresponding occupancy state (righthand side) over joint histories and states of the system. We illustrate an occupancy state as a tree, where branches are joint histories of the complete information state and leaves are state-probability pairs. Note that the same initial belief is assumed in both state types.

The occupancy state represents a predictive model of the state that the system may end up in and joint history the agents may experience given a complete information state. As such, in occupancy

Dibangoye, Amato, Buffet & Charpillet

Complete Information State ιC 1 Corresponding Occupancy State ξ1

Figure 9: A step-1 occupancy state ξ1 that corresponds to complete information state ι1.

states, we need just to maintain state and joint history pairs that are reachable given the complete information state.

3.1.2 Markov Property

This section proves that occupancy states induce a process that is Markov. In other words, the future occupancy states of the process depend only upon the present occupancy state and the next-step separable joint decision rule.

Theorem 1. Occupancy state ξt+1 depends on current occupancy state ξt and separable joint decision rule dt, i.e., for any arbitrary s S , at A, zt+1 Z and θt Θt,

ξt+1( s, (θt, at, zt+1)) = 1{at}(dt(θt)) X

s S ξt(s, θt) pat,zt+1(s, s), (3)

where 1{ }( ) is an indicator function which returns 1 when the actions at are chosen by dt, and returns 0 otherwise.

Proof. In demonstrating this theorem we also derive a procedure for updating the occupancy states. Let ιt be our step-t information state, which we will decompose as ιt = (ιt 1, dt 1) i.e., the information state ιt 1 prior to time-step t plus the known separable joint decision rule dt 1. By Deﬁnition 8, we can relate the occupancy state and the information state as follows: for any arbitrary state st and joint history θt,

ξt(st, θt) def= Pr(st, θt|ιt). (4)

The substitution of ιt = (ιt 1, dt) into (4) yields

ξt(st, θt) = Pr(st, θt|ιt 1, dt 1). (5)

The expansion of the right-hand side of (5) over all states of the system at the end of time-step t 1 produces

ξt(st, θt) = X

st 1 S Pr(st 1, st, θt|ιt 1, dt 1). (6)

Optimally Solving Finite-Horizon Dec-POMDPs

The expansion of the joint probability in (6) as the product of conditional probabilities results in

ξt(st, θt) = X

st 1 S Pr(at 1|θt 1, dt 1) Pr(st, zt|st 1, θt 1, ιt 1, dt 1) Pr(st 1, θt 1|ιt 1, dt 1). (7)

The ﬁrst factor denotes the joint action at 1 that separable joint decision rule dt 1 prescribes at θt 1. Since we assume that separable joint decision rules are deterministic, Pr(at 1|θt 1, dt 1) {0, 1}. In fact, Pr(at 1|θt 1, dt 1) = 1 if dt 1(θt 1) = at 1, otherwise Pr(at 1|θt 1, dt 1) = 0. So, Pr(at 1|θt 1, dt 1) = 1{at 1}(dt 1(θt 1)), where 1F is an indicator function. The second factor on the right-hand side of (7) is the transition probability

ξt(st, θt) = 1{at 1}(dt 1(θt 1)) X

st 1 S pat 1,zt(st 1, st) Pr(st 1, θt 1|ιt 1, dt 1). (8)

The last factor deﬁnes the prior occupancy state ξt 1 at state st 1 and joint history θt 1, which does not depend on the current separable joint decision rule dt 1. Overall (6) becomes

ξt(st, θt) = 1{at 1}(dt 1(θt 1)) X

st 1 S pat 1,zt(st 1, st) ξt 1(st 1, θt 1).

Therefore, the calculation of the occupancy state after time-step t requires only the occupancy state of the previous time-step t 1 and the current separable joint decision rule.

Equation (3) describes the transitions of a continuous-state MDP in which states are occupancy states and actions are separable joint decision rules. For this process, the transitions are deterministic but the state space is continuous. Next, we formally deﬁne the process occupancy states induce.

3.1.3 Occupancy-State Markov Decision Processes

We consider the MDP described by the occupancy states; we call it an occupancy-state Markov decision process.

Deﬁnition 9. Let ˇM ( , A, R, P, b0, T) be the occupancy-state Markov decision process with respect to Dec-POMDP M, where:

= t {0,1,...,T} t is the occupancy simplex, where ξ0 = b0 is the initial occupancy state;

A = t {0,1,...,T} At is the separable joint decision rule set;

R: A 7 R is a reward function: the reward at (ξt, dt) is R(ξt, dt) def= P s,θ ξt(s, θ) rdt(θ)(s);

P: A 7 is a transition function: next occupancy state ξt+1 def= P(ξt, dt) as described in Equation (3) given (ξt, dt);

b0 is the initial belief state; and

T denotes the planning horizon.

Dibangoye, Amato, Buffet & Charpillet

Here, the states of the system represent the centralized knowledge of the planner while the actions represent separable joint decision rules to ensure decentralized execution. The (occupancy) state can be updated by using the known transition and observation functions of the Dec-POMDP given the current (occupancy) state and the chosen separable joint decision rule. The rewards are also calculated (as an expectation) using the known reward model of the Dec-POMDP. The optimal value functions of ˇM are solutions of the optimality equations:

V ˇM,t(ξt) def= max dt A

R(ξt, dt) + V ˇM,t+1(P(ξt, dt)) , t {0, 1, , T 1}, (9)

with an added boundary condition V ˇM,T( ) def= 0 for t = T. Notice that once a solution (V ˇM,t)t {0,1,...,T 1} of the optimality equations Eq. (9) has been found, one can always retrieve an optimal separable joint policy starting at the initial occupancy state. This is achieved by iteratively retrieving optimal separable joint decision rules for decision steps t {0, 1, . . . , T 1}. At each decision step t, the procedure selects the current occupancy state ξt (starting with initial occupancy state ξ0). It uses the max operator to retrieve an optimal separable joint decision rule d t at current occupancy state ξt:

def= arg maxdt A R(ξt, dt) + V ˇM,t+1(P(ξt, dt)) . (10)

Thereafter, it moves to the next occupancy state ξt+1 = P(ξt, d t ) and makes it the current one. The procedure then repeats until ﬁnal decision epoch T has been reached. The sequence of optimal separable joint decision rules (d 0, d 1, . . . , d T 1) deﬁnes an optimal separable joint policy π for an occupancy-state MDP ˇM. Furthermore, Theorem 2 proves that an optimal separable joint policy for an occupancy-state MDP ˇM is an optimal separable joint policy for the original Dec-POMDP M. Optimally solving a continuous-state MDP, such as the occupancy-state MDP, is a nontrivial task. In general, there is no exact solution method for solving general continuous-state MDPs. Methods often rely on structural assumptions about the shape of the optimal value function (Tsitsiklis & van Roy, 1996; De Farias & Van Roy, 2003; Powell, 2007). We next demonstrate that a useful structure does indeed exist for occupancy MDPs in the form of optimal value functions that are piecewise linear and convex functions over the occupancy states.

3.1.4 Sufficiency of Occupancy States

We ﬁrst show that the occupancy state is a suﬃcient statistic for optimal decision-making in Dec POMDPs. Throughout the remainder of this paper, we call a statistic a suﬃcient statistic when the statistic of the information state is suﬃcient for optimal decision making in occupancy-state MDPs.

Theorem 2. Occupancy state ξt = Pr(s, θt|ιC t )s S,θt Θt is a suﬃcient statistic of complete information state ιC t , i.e., it is suﬃcient for optimally solving occupancy-state MDPs. Furthermore, an optimal joint policy for the occupancy-state MDP ˇM together with the correct estimation of the occupancy states, is also optimal for information-state MDP ˆM (respectively Dec-POMDP M).

Proof. In demonstrating the suﬃciency of the occupancy state with respect to its corresponding information state, we need to demonstrate that (a) the optimal value function at an occupancy state is identical to that of its corresponding information state and (b) the future occupancy states depend only upon the current occupancy states (and next-step separable joint decision rule). We proved (b) in Theorem 1, so it only remains to prove statement (a). We show this by induction.

Optimally Solving Finite-Horizon Dec-POMDPs

The suﬃciency of the occupancy state with respect to its corresponding information state trivially holds at the last time step of the problem. In fact, V ˆM,T(ιC T) = V ˇM,T(ξT) = 0 for any arbitrary

complete information state ιC T and its corresponding occupancy state ξT (since the horizon has been reached). If we assume that statement (a) holds for time-step t+1, we can now show it holds for time-step t. For any arbitrary step-t information state, Bellman s optimality criterion prescribes the following:

V ˆM,t(ιC t ) = max dt A R(ιt, dt) + V ˆM,t+1(ιC t+1), (11)

where ιC t+1 = (ιC t , dt). By the induction hypothesis, we have V ˆM,t+1(ιC t+1) = V ˇM,t+1(ξt+1), if ξt+1 corresponds to the occupancy state associated to ιC t+1. Hence, Equation (11) becomes

V ˆM,t(ιC t ) = max dt A R(ιC t , dt) + V ˇM,t+1(ξt+1). (12)

Moreover, given that R(ιC t , dt) = P s,θ rdt(θ)(s) Pr(s, θ|ιC t ) = R(ξt, dt), the following expression holds since ξt = (Pr(s, θ|ιC t ))s,θ:

V ˆM,t(ιC t ) = max dt A R(ξt, dt) + V ˇM,t+1(ξt+1), (13)

which ends the proof of statement (a) at time-step t, since V ˇM,t(ξt) = maxdt A R(ξt, dt)+V ˇM,t+1(ξt+1).

As a consequence, statement (a) holds for any arbitrary complete information state ιC t St and at any arbitrary time-step t {0, 1, . . . , T 1}. Combining statements (a) and (b), we are guaranteed to ﬁnd the optimal value function for ˆM by using occupancy states instead of information states. As such, an optimal joint policy for the occupancy-state MDP ˇM, together with the correct estimation of the occupancy states, is also optimal for information-state MDP ˆM (or the original Dec POMDP M). Given the optimal value function of ˇM, an optimal joint policy π = (d t )t {0,1,...,T 1} is given by:

d t = arg maxdt A R(ξt, dt) + V ˇM,t+1(ξt+1), (14)

for any arbitrary information state ιt, and all t {0, 1, . . . , T 1}, where ξt and ξt+1 correspond to ιC t and ιC t+1 = (ιC t , dt), respectively.

This theorem demonstrates that by optimally solving any of M, ˆM or ˇM, we are guaranteed to ﬁnd an optimal separable joint policy to the others.

3.1.5 Belief States versus Occupancy States

Note that there is a similarity between the occupancy state in Dec-POMDPs and the belief state in POMDPs. Formally, a step-t belief state bt = (P(s|θt, b0))s S is a probability distribution over states conditioned on step-t history θt. It is also a suﬃcient statistic for the total data available to the centralized agent (i.e., the action-observation history) an algorithm can rely on to ﬁnd an optimal solution in POMDPs. Similarly, an occupancy state is a suﬃcient statistic for the total data available to a centralized planner (i.e., the history of separable joint decision rules) an algorithm can rely on to ﬁnd an optimal separable joint policy in Dec-POMDPs. However, the occupancy state remains fundamentally diﬀerent from the belief state. First, the belief state is not suﬃcient for

Dibangoye, Amato, Buffet & Charpillet

optimal decision making in Dec-POMDPs (because it is not geared to ensure the separability of the joint policy). Second, a belief state deﬁnes a time-invariant statistic, i.e., the dimension of belief states is bounded by the number of states. In contrast, the dimension of the occupancy states grows exponentially with the horizon. Also, unlike the belief state, the occupancy state is only a plan time suﬃcient statistic which is not used during execution time. Instead, agents still condition their actions on local action-observation histories in Dec-POMDPs. These diﬀerences make algorithmic and theoretic transfers from belief-state MDPs to occupancy-state MDPs nontrivial.

3.1.6 Piecewise-Linearity and Convexity Property

We now present one of the main results of this paper the piecewise-linearity and convexity of the optimal value function of the occupancy-state MDP. For this discussion, we use vector (resp. matrix) representation for operators R( , dt) and P( , dt). Vector rdt = (rdt(θ)(s))s,θ denotes the immediate reward of executing dt starting from any state and joint history (i.e., R(ξt, dt) is the inner product of ξt and rdt for any occupancy state ξt t). Moreover, operator P( , dt) transforms any step-t occupancy state to a step-(t + 1) occupancy state, that is, P( , dt) describes a transition matrix pdt such that P(ξt, dt) = ξtpdt for every step-t occupancy state ξt t. With these linear transformations as a background, the following holds.

Theorem 3. The optimal value functions (V ˇM,t)t {0,1, ,T} (solutions to Equations in (9)) are piecewiselinear and convex functions of the occupancy states. Hence, for all t {0, 1, , T 1}, there exists a ﬁnite set of length-n|Θt| vector values Λt such that, for any arbitrary occupancy state ξt t, we have:

V ˇM,t(ξt) = max βt Λt ξt, βt , (15)

where ξt, βt denotes the inner product P s P θ ξt(s, θ) βt(s, θ).

Proof. We show that (15) holds by induction. Since V ˇM,T(ξT) = 0 for all ξT (since the horizon has been reached), we have that V ˇM,T(ξT) = maxβT ΛT ξT, βT , where βT( ) = 0 and ΛT = {βT}. Hence, the property holds for k = T. Assume that the property holds for k t+1, that is, V ˇM,k(ξk) = maxβk Λk ξk, βk for k t + 1. Now we want to prove the property for k = t i.e.,

V ˇM,t(ξt) = max dt A

R(ξt, dt) + V ˇM,t+1(P(ξt, dt)) , ξt t.

Using linear transformations rdt and pdt, the following holds:

V ˇM,t(ξt) def= max dt A

R(ξt, dt) + V ˇM,t+1(P(ξt, dt)) ,

ξt, rdt + max βt+1 Λt+1 P(ξt, dt), βt+1 ! , (Inductive Hypothesis)

= max dt A max βt+1 Λt+1

ξt, rdt + D ξt pdt, βt+1 E , (Rearranging Terms)

= max dt A max βt+1 Λt+1

ξt, rdt + βt+1 (pdt) .

Finally, if we let Λt be the set of all length-n|Θt| vectors βt def= rdt + βt+1 (pdt) for all separable joint decision rules dt A and all vectors βt+1 Λt+1, then V ˇM,t(ξt) = maxβt Λt ξt, βt . As a consequence, the proof holds for every time step t {0, 1, . . . , T 1}.

Optimally Solving Finite-Horizon Dec-POMDPs

We demonstrated that the information states and value functions can be represented in a vector space, the occupancy-state space, without loosing optimality. Next, we provide an approach for extending MDP and POMDP solution methods to occupancy-state Markov decision processes.

3.2 Heuristic Search Solution Methods

This section presents the occupancy-based heuristic search value iteration (OHSVI) algorithm for solving occupancy-state MDPs. This algorithm extends the heuristic search value iteration (HSVI) algorithm for POMDPs (Smith & Simmons, 2004) as well as other heuristic search algorithms such as A* (Hart, Nilsson, & Raphael, 1968) and LRTA* (Korf, 1990).

3.2.1 Heuristic Search Value Iteration for Occupancy-State MDPs

HSVI is a state-of-the-art algorithm for solving POMDPs (Smith & Simmons, 2004). It produces solutions by maintaining two-sided bounds on the optimal value function and updating them over a number of sample trajectories. The upper bounds, (U ˇM,t)t {0,1, ,T}, are represented as (belief) state-value mappings, and the lower bounds, (L ˇM,t)t {0,1, ,T}, are represented by vector sets. Each trajectory begins at the initial belief state and continues until the time horizon is reached. Once a trajectory is ﬁnished, the upper and lower bounds are updated at each belief state, in the reverse order of visit. The trajectories can also be interrupted once they have reached a belief state where the upper and lower bounds are equal, since there is no reason to expand a belief state whose optimal value is provably known. Finally, it is often useful to prune lower and upper bounds to maintain concise representations, by removing either dominated vectors or points, respectively (Pineau, Gordon, & Thrun, 2006; Smith, 2007).

Algorithm 1: The OHSVI algorithm

function OHSVI((L ˇM,t)t {0,1, ,T}, (U ˇM,t)t {0,1, ,T})

while Stop(ξ0) do Explore (ξ0)

function Stop(ξt)

return U ˇM,t(ξt) = L ˇM,t(ξt)

function Explore (ξt)

if Stop(ξt) then

d t arg maxdt R(ξt, dt) + U ˇM,t+1(P(ξt, dt)) Explore(P(ξt, d t )) update U ˇM,t and L ˇM,t at ξt

OHSVI (outlined in Algorithm 1) operates in a similar manner as described above, but remains fundamentally diﬀerent from HSVI. HSVI generates trajectories of belief states while OHSVI generates trajectories of occupancy states. Also, HSVI generates trajectories by (i) picking a greedy action with respect to the upper bound (optimistic exploration), and (ii) performing the transition corresponding to the largest gap in error. Due to the deterministic nature of occupancy-state MDPs, OHSVI does not need HSVI s gap-based heuristic to guide the state exploration (Smith, 2007). Instead, OHSVI always executes a greedy separable joint decision rule with respect to the upper bounds, and then selects the next occupancy state based on this greedy separable joint decision rule.

Dibangoye, Amato, Buffet & Charpillet

OHSVI can be also thought of as an extension of learning real-time A* (LRTA*) (Korf, 1990) that takes advantage of the piecewise-linearity and convexity of the optimal value function.

(a) lower bound L ˇM,t

V ˇM,t(ξt) ξt, βk t

β2 t β1 t β0 t

(b) upper bound U ˇM,t

(ξ1 t , v1) (ξ2 t , v2)

ξt = 0.2 ξ1 t + 0.8 ξ2 t

V ˇM,t(ξt) 0.2 v1 t + 0.8 v2 t

Figure 10: (a) Lower bounds are represented using sets of vectors, where vectors are dashed lines, solid lines represent the upper surface of these vectors (lower-bound value function), and the circle is the projection of the target occupancy onto the lower-bound value function. (b) Upper bounds are represented using occupancy-value mappings, where dashed lines denotes the convex hull formed by these points.

OHSVI relies on standard approaches to represent lower and upper bounds for piece-wise linear and convex value functions: vector sets and occupancy-value mappings, which we detail in the next section.

3.2.2 Vector Sets : Lower Bounds

As in HSVI or other algorithms (and depicted in Figure 10(a)), the lower bound L ˇM,t can be represented as a ﬁnite collection Λt of n|Θt|-dimensional vectors, for every time-step t {0, 1, . . . , T} (Smith, 2007; Kaelbling et al., 1998; Hauskrecht, 2000; Smallwood & Sondik, 1973). Lower bounds can be iteratively updated using point-based backup steps as follows: ξt t,

Λ t = Λt [ {backup(Λt+1, ξt)} , where (16)

backup(Λt+1, ξt) = arg maxgξt dt : dt A ξt, gξt dt , (17)

gξt dt = rdt + arg maxgβt+1 dt : βt+1 Λt+1 ξt, gβt+1 dt , (sum vectors in Rn|Θt|) (18)

gβt+1 dt = βt+1 (pdt) , (projection Rn|Θt+1| 7 Rn|Θt|) (19)

Λt being the vector set prior to backup and Λ t the vector set after the backup. Lower bounds (L ˇM,t)t {0,1,...,T 1} initializes (Λt)t {0,1,...,T 1} with a single vector βt( ) = mins S,a A (T t) ra(s), for all t {0, 1, . . . , T}. Notice that the vector representation is suitable only for lower bounds.

3.2.3 Occupancy-Value Mappings : Upper Bounds

Upper bounds (U ˇM,t)t {0,1,...,T 1} can be represented using mappings from occupancy states to reals, see e.g., Figure 10(b). The upper bound is then the convex hull of the current point set. It is

Optimally Solving Finite-Horizon Dec-POMDPs

possible to interpolate the value for occupancy states whose mapping is not currently maintained or is outdated. This can be achieved using linear approximation methods (e.g., Hauskrecht, 2000; Smith, 2007). Of this family, the sawtooth linear interpolation maps every occupancy state ξ t and point set Ψt to upper-bound value

U ˇM,t(ξ ) = min {v ξ , vξ ξ | (ξ 7 vξ) Ψt}, where (20)

vξ ξ = v ξ + (vξ v ξ) D(ξ, ξ ), and (21)

s,θ ξ(s, θ) vs, . (22)

We refer to D(ξ, ξ ) = mins,θ: ξ(s,θ)>0 ξ (s, θ)/ξ(s, θ) as the sawtooth measure. To update the upper bound U ˇM,t at a speciﬁc occupancy state ξ using sawtooth, we need to compute a new value for ξ, and add it to occupancy-value mapping U ˇM,t, as follows:

Ψ t = Ψt [ n (ξ 7 vξ) o , (23)

vξ = max d At R(ξ, d) + U ˇM,t+1(P(ξ, d)), (24)

where Ψt is the point set prior update and Ψ t is the point set after the update. The upper bound U ˇM,t initializes Ψt = {ξs, 7 vs, t | s S } using the optimal value of the underlying MDP for corner points, for every t {0, 1, . . . , T}. Clearly, one can eventually ﬁnd an optimal solution to the occupancy-state MDP using OHSVI with the full lower and upper bounds. However, it quickly becomes intractable to maintain these bounds in the full occupancy-state space since the number of state and joint-history pairs grows as the horizon increases. This highlights the necessity for compact representations of occupancy states, decision rules and vector values.

4. Solving Dec-POMDPs as Lossless Compact Occupancy-State MDPs

The previous sections show that every Dec-POMDP can be represented as an occupancy-state MDP without losing optimality. The diﬃculty with using this representation is that common algorithms for solving such MDPs with piecewise linear convex value functions quickly run out of time and/or memory since state and action spaces become intractably large for most real-world problems. This is not surprising given the NEXP-Complete worst-case complexity of general Dec-POMDPs, but realistic Dec-POMDP applications often have signiﬁcant structure. In this section, we discuss optimally solving occupancy-state MDPs while potentially reducing the dimensionality of the occupancy states, decision rules and value functions. In Subsection 4.1, we reduce the dimensionality of occupancy states and decision rules by constructing clusters of equivalent histories. Next, we deﬁne compact representations for occupancy states and decision rules based upon clusters of histories rather than single histories. While the resulting compact MDP may have exponentially fewer states and actions than the original model, the optimal value function of the compact model may no longer be piecewise linear and convex. In Subsection 4.2, we overcome this limitation, allowing values from one compact occupancy state to generalize to another one using parametric value functions. Finally, Subsection 4.3 presents the feature-based heuristic search value iteration algorithm and theoretical guarantees.

Dibangoye, Amato, Buffet & Charpillet

4.1 Lossless Compact Occupancy-State MDPs

The dimension of occupancy states and decision rules typically grows exponentially with the horizon. Because of this, it is often impractical to compute and store every component of occupancy state and decision rule representations. We overcome this limitation by using compact representations of occupancy states and decision rules based on notions of equivalence between histories. These notions of equivalence are fundamental to designing and analyzing algorithms for reducing the dimensionality of the occupancy-state MDP, thereby improving scalability. Equivalence relations permit us to aggregate histories that convey the same information about the process. We target equivalence relations that, upon replacing each group of aggregated histories by one element of the group, allow us to produce compact representations for all occupancy states and decision rules while still preserving the ability to ﬁnd an optimal solution.

4.1.1 Probabilistic Equivalence for History Space Aggregation

We ﬁrst present the equivalence notions that we build upon before deﬁning compact representations and proving that they preserve optimality. The deﬁnitions here are inspired by work on concise information states (Boularias & Chaib-draa, 2008; Oliehoek et al., 2013). Here, we connect this research to occupancy-state MDPs, and later provide natural algorithms for constructing and eﬀectively solving them.

Deﬁnition 10. Private histories θi, θ i Θi t of agent i I are locally probabilistically equivalent with respect to occupancy state ξ t denoted ξ-LPE if, and only if, for any state s S and history θ i Θ i t of the other agents I\{i}: Pr(s, θ i|θi, ξ) = Pr(s, θ i|θ i, ξ).

It is worth noticing that ξ-LPE can be used to partition private history set Θi t of any agent i I, for all time steps t {0, 1, . . . , T}. This partition is a set of nonempty subsets (called clusters) Bi 1, Bi 2, . . . , Bi k of private histories, such that Bi 1 Bi 2 . . . Bi k = Θi t. We distinguish between two sets of private histories for each agent i I and any occupancy state ξ. The ﬁrst set, denoted Θi t(ξ), consists of private histories with non-zero probability with respect to ξ. The second set, denoted Θi t\Θi t(ξ), consists of private histories with zero probability with respect to ξ. This diﬀerence is particularly important, as we will show later. In fact, only nonzero private histories play a part in demonstrating that ξ-LPE preserves optimality. In addition, it is useful to note that, for each agent, ξ-LPE groups together all zero private histories w.r.t. ξ into the same cluster. Given that all private histories are clustered that convey the same information, representations of compact occupancy states, decision rules and value functions should depend only upon these clusters. Unfortunately, maintaining clusters still requires a large amount of memory, which explains the impetus for labeled clusters. A labeled cluster is a cluster along with a label; all private histories in a cluster match the corresponding label. Throughout the paper, we use the following convention: each cluster of private histories maps to the minimum private history (of that cluster) using lexicographical ordering. Speciﬁcally, the label of a cluster is chosen among private histories in the cluster, which have non-zero probability w.r.t. the occupancy state. Therefore, the label of a cluster is also a private history in that cluster, which leads to a clear relationship between a private history, a cluster and its corresponding label. Thus, the representation of compact occupancy states, decision rules and value functions should depend only upon labels instead of clusters. Nonetheless, these compact representations make it hard to generalize the value function from one compact occupancy state to another. As we will see later, this generalization requires the ability

Optimally Solving Finite-Horizon Dec-POMDPs

to quickly check whether a given private history belongs to a speciﬁed cluster. If we group histories using ξ-LPE, checking whether private history θi belongs to cluster Bi whose label is another private history ϑi Bi can now be replaced by checking whether private histories θi and ϑi are ξ-LPE, for any arbitrary agent i I. Unfortunately, checking whether two private histories of an agent are ξ-LPE requires enumerating all states and other agents nonzero histories w.r.t. ξ, which results in a subroutine with complexity O(n|Θ i t (ξ)|) see Algorithm 3 in Appendix B. Given that we call this procedure for exponentially many private histories, the importance of replacing local probabilistic equivalence by a cheaper equivalence relation is clear. To this end, we introduce truncation probabilistic equivalence w.r.t. ξ (denoted ξ-TPE). Before providing the formal deﬁnition of ξ-TPE, we start with a motivation. In many practical situations, all important information about the process to be controlled lies in a history window (of a ﬁxed length) of actions and observations the agents have experienced. Based on this insight, we want truncation probabilistic equivalence to group private histories of each agent that share: (i) the same model about states and other agent histories (i.e., be ξ-LPE) and (ii) a common suﬃx of a ﬁxed length m. In such a setting, once the private histories have been clustered, checking if any particular private history belongs to a cluster can now be replaced by checking whether both that private history and the label (another private history) for the cluster share the same suﬃx of a speciﬁed length m which is signiﬁcantly faster than checking whether two private histories are ξ-LPE: O(m) versus O(n|Θ i t (ξ)|). That is, once the private histories are clustered, any two private histories with the same suﬃxes of length m are already known to be ξLPE though the clustering process and choice of m. It turns out that in deﬁning the probabilistic truncation equivalence w.r.t. ξ, we need to determine the suﬃx length (i.e., history window) mξ that is suﬃcient called the local truncation parameter w.r.t. ξ. The local truncation parameter mξ with respect to occupancy state ξ t has to be: (i) large enough to ensure all nonzero private histories w.r.t. ξ that share the same suﬃx of length mξ are also ξ-LPE; and (ii) small enough to group the maximum number of private histories. A straightforward method (Algorithm 4 in Appendix B) to compute mξ starts with parameter m = 0 and proceeds as follows: (step 1) If for any agent, nonzero private histories w.r.t. ξ that share a common suﬃx of length m are also ξ-LPE with one another, then set mξ = m and terminate; (step 2) Otherwise, increment m = m + 1 and go back to step 1. This algorithm is guaranteed to terminate after at most t (the current time step) iterations, i.e., mξ = t in the worst case. The later case corresponds to the boundary case where no clustering is done i.e., each cluster corresponds to a single joint history. We are now ready to formally deﬁne the truncation probabilistic equivalence relation.

Deﬁnition 11. Private histories θi, θ i Θi t of agent i I are truncation probabilistically equivalent with respect to occupancy state ξ t we denote ξ-TPE if, and only if, they hold the same last mξ private actions and observations given mξ is found for the set of histories as discussed above.

Not surprisingly, ξ-TPE also partitions the private history sets. However, it often produces more clusters than ξ-LPE since it further constrains ξ-LPE non-zero private histories w.r.t. ξ (since it also requires the suﬃxes to be the same). In contrast to ξ-LPE, it spreads zero private histories w.r.t. ξ over diﬀerent clusters. Recall that the advantage of ξ-TPE over ξ-LPE is that ﬁnding the appropriate cluster for a private history (i.e., checking whether two private histories are equivalent) is cheaper using ξ-TPE than using ξ-LPE. The former requires the comparison of their suﬃx for a ﬁxed length, whereas the latter needs the comparison of large distributions over states and joint histories.

Dibangoye, Amato, Buffet & Charpillet

It is worth noticing that, to the best of our knowledge, TPE and LPE are the weakest assumptions to date that can reduce the dimensionality of full occupancy states. Nevertheless, not all Dec POMDPs will beneﬁt from these data reduction approaches. More precisely, it is unlikely that these assumptions provide concise occupancy states in totally unstructured domains. Fortunately, real-world domains are often very structured, as demonstrated in Section 5. Next, we address the problem of using equivalence relations between private histories to ﬁnd compact representations for all occupancy states, decision rules and real vectors.

4.1.2 Compact States, Actions, Vectors and MDPs

We deﬁne the compact representations for occupancy states, decision rules and real vectors based upon the aforementioned equivalence relations. Notice that the following deﬁnitions depend on either local or truncation probabilistic equivalence relations only through the labeled clusters they generate. Intuitively, compact occupancy states are distributions over states and joint labels, compact decision rules are mappings from labels to private actions, and compact real vectors are mappings from pairs of states and joint labels to reals. Notationally, let Li ξ be the label set of agent i I that occupancy state ξ generates (i.e., the set of labels generated from the histories which make up ξ), and Bϑi be the labeled cluster of agent i I with label ϑi Li ξ and let Lξ = i I Li ξ.

Deﬁnition 12. A compact occupancy state of ξ, denoted ξ, is a distribution over states and joint labels: for all s S and (ϑi)i I Lξ,

ξ(s, ϑ1, ϑ2, . . . , ϑ|I|) def= X

θ2 Bϑ2 . . . X

θ|I| Bϑ|I| ξ(s, θ1, θ2, θ|I|). (25)

Example 7 (Multi-agent tiger from full occupancy states to compact occupancy states). Figure 11 depicts a full occupancy state (left-hand side) and a corresponding compact occupancy state (right-hand side) over joint histories and hidden states. To obtain this compact occupancy state, we show that (al, zhl) and (al, zhr) are ξ1-LPE for agent green, and (al, zhl) and (aol, zhr) are ξ1-LPE for agent red. As a consequence, one can group histories Bgreen {(al, zhl), (al, zhr)} for agent green, and Bred {(aol, zhl), (aol, zhr)} for agent red, and replace each cluster by a single label.

A compact decision rule w.r.t. ξ of agent i I, denoted di ξ : Li ξ 7 Ai, is a mapping from label set to private action set. In addition, a compact private policy w.r.t. (ξ0, ξ1, . . . , ξT 1) of agent i I, denoted πi = ( di ξt)t {0,1,...,T 1}, is a sequence of compact decision rules of agent i I. A similar deﬁnition follows for compact separable joint policies π = ( πi)i I. A compact real vector w.r.t. ξ, denoted βξ Rn|Lξ|, is a mapping from pairs of state s S and joint label ϑ Lξ to reals.

Intuitively, distribution ξ reassigns the probability mass of each joint cluster (Bϑi)i I to the corresponding joint label (ϑi)i I. Algorithms 3 and 4 (see Appendix B) present a straightforward way to compute a compact occupancy state using local and truncation probabilistic equivalence relations, respectively. It is worth noticing that, for a given occupancy state ξ , ξ-LPE always produces a number of joint labels |Lξ| less than or equal to the number of labels that ξ-TPE produces. This is because ξ-TPE is a stricter form of local probabilistic equivalence, as discussed above. Hence, ξ-LPE produces compact occupancy states, decision rules and real vectors that are more concise than those from ξ-TPE.

Optimally Solving Finite-Horizon Dec-POMDPs

Full Occupancy State ξ1 Compact Occupancy State ξ1

Figure 11: A step-1 compact occupancy state ξ1 that corresponds to full occupancy state ξ1.

In demonstrating that compact occupancy states are suﬃcient for optimal decision making in occupancy-state MDPs, we rely on the notion of compact occupancy-state MDPs.

Deﬁnition 13. The compact occupancy-state MDP w.r.t. occupancy-state MDP ˇM is given by tuple M ( , A, R, P, ξ0, T):

= t {0,1,...,T} t is the space of compact occupancy states, with ξ0 = ξ0 the initial compact occupancy state;

A = t {0,1,...,T} At is the space of compact joint (decentralized) decision rules;

R: A 7 R is a compact reward function where R( ξ, dξ) = P s,ϑ ξ(s, ϑ) r dξ(ϑ)(s);

P: A 7 is a compact transition function where P( ξ, dξ) def= ξ , with ξ = P( ξ, dξ); and

T denotes the planning horizon.

Analogous to occupancy-state MDPs, compact occupancy states can be updated using known transition and observation functions of the Dec-POMDP given the current compact occupancy state and the chosen compact separable joint decision rule. The rewards are also calculated (as expectations) using the known reward model of the Dec-POMDP. Finally, the optimal value function of M is solution of the optimality equations:

V M,t( ξt) def= max dξt A

R( ξt, dξt) + V M,t+1( P( ξt, dξt)) , t {0, 1, . . . , T 1}, (26)

with an added boundary condition V M,T( ) = 0.

4.1.3 Sufficiency of Compact Occupancy States

Below, we prove that, when planning over compact occupancy states instead of full occupancy states, the optimal separable joint policy for the compact model immediately induces a corresponding optimal separable joint policy for the original model. Before proceeding any further, we ﬁrst demonstrate the optimality of compact policies.

Dibangoye, Amato, Buffet & Charpillet

Note that this proof mirrors a similar proof showing that histories for an agent can be clustered without any change to the optimal policy or loss in value by using probabilistic equivalence in the traditional Dec-POMDP representation (Oliehoek et al., 2013). We extend these ideas to the occupancy-state MDP case and incorporate truncation probabilistic equivalence here to show that the compact policies preserve optimality.

Theorem 4 (Optimality of compact policies). In occupancy-state MDPs, there exists optimal policies of an agent that depend only upon labels of that agent produced based on either local or truncation probabilistic equivalence with respect to occupancy states these optimal policies induce, and not on private histories.

Proof. We construct the proof by induction. We ﬁrst show that in the last step of the problem, an agent policy depends on labels, but not on its private histories. Given occupancy state ξT 1 T 1 and policies π i T 1 Π i T 1 of the other agents I\{i}, the policy πi T 1 Πi T 1 of agent i I is a best response to π i T 1. That is, it chooses for every private history θi T 1 Θi T 1 a private action so as to maximize value based on model (Pr(s, θ i T 1|θi T 1, ξT 1))s,θ i T 1 occupancy state ξT 1 and private history θi T 1 induce over possible histories of the other agents and resulting states of the system:

πi T 1(θi T 1) arg maxai Ai X

Pr(s, θ i T 1|θi T 1, ξT 1) rai,π i T 1(θ i T 1)(s). (27)

In assigning private actions to private histories given ξT 1, only nonzero private histories w.r.t. ξT 1 aﬀect the outcome. That is, zero probability private histories w.r.t. ξT 1 can be prescribed any private action without losing optimality (e.g., a private action identical to that of the associated labels generated based on either local or truncation probabilistic equivalence relations). Hence, policy πi T 1 depends on zero probability private histories w.r.t. ξT 1 only through corresponding labels. Next, we restrict our attention to nonzero probability private histories w.r.t. ξT 1. Recall that private histories of this family that are ξ-TPE with one another are also ξ-LPE with one another. Therefore, if we demonstrate the property for private histories of this family that are ξ-LPE with one another, the proof follows immediately for those that are ξ-TPE. Assume θi T 1 is a nonzero probability private history w.r.t. ξT 1. If θi T 1 is in the cluster of ξT 1-LPE private histories with label ϑi T 1 (another nonzero probability private history w.r.t. ξT 1), then equality (Pr(s, θ i T 1|θi T 1, ξT 1))s,θ i T 1 = (Pr(s, θ i T 1|ϑi T 1, ξT 1))s,θ i T 1 holds. As a consequence, policy πi T 1 depends on nonzero probability private history w.r.t. ξT 1 only through corresponding labels:

πi T 1(θi T 1) arg maxai Ai X

Pr(s, θ i T 1|ϑi T 1, ξT 1) rai,π i T 1(θ i T 1)(s). (28)

Therefore, the property holds at the last step of the problem, for any arbitrary agent i I. This allows us to deﬁne policies on the last step as mappings from labels to private actions, i.e., compact policies. Next, we rely on the concept of private policy tree, which is a tree that represents actions in nodes and observations in edges with a depth that is the number of stages to go. The root node determines the ﬁrst private action to be taken. Then depending on private observation received, the agent executes another private action; and so on until a leaf node is reached. A policy tree

Optimally Solving Finite-Horizon Dec-POMDPs

δi t:T 1 is a portion of a private policy πi t:T 1, which prescribes private actions to be taken by agent i I in the remaining stages starting at a given private history θi t. By an abuse of the notation, let δi t:T 1 = πi t:T 1(θi t), for all agents i I and all time steps t {0, 1, . . . , T}. For the induction step, we can show that, if an agent policy-tree from step t + 1 onward depends on private histories only through the corresponding labels for either equivalence relations, then that agent s policy tree from step t onward also depends on its private histories only through the corresponding labels for either equivalence relations. Given occupancy state ξt t and policy π i t:T 1 Π i t:T 1 of the other agents I\{i}, policy πi t:T 1 Πi t:T 1 of agent i I is as follows:

πi t:T 1(θi t) arg maxδi t:T 1 Πt:T 1 X

s,θ i t Pr(s, θ i t |θi t, ξt) βδi t:T 1,π i t:T 1(θ i t )(s), (29)

where (δi t:T 1, π i t:T 1(θ i t )) is a separable joint policy-tree and βδi t:T 1,π i t:T 1(θ i t ) is the associated vector value. Again, in assigning private actions to private histories given ξt, only nonzero probability private histories w.r.t. ξt matter. In fact, the property trivially holds for zero probability private histories w.r.t. ξt. For nonzero probability private histories w.r.t. ξt, the property holds for private histories that belong to a cluster of ξt-TPE private histories since those histories would belong to a cluster of ξt-LPE private histories, as previously discussed. For this reason, we restrict our attention to nonzero probability private histories θi t w.r.t. ξt that belong to a cluster of ξt-TPE private histories with label ϑi t. Then, equality (Pr(s, θ i t |θi t, ξt))s,θ i t = (Pr(s, θ i t |ϑi t, ξt))s,θ i t holds. Hence,

πi t:T 1(θi t) arg maxδi t:T 1 Πt:T 1 X

Pr(s, θ i t |ϑi t, ξt) βδi t:T 1,π i t:T 1(θ i t )(s). (30)

Consequently, an agent policy depends on private history for any step of the problem only through corresponding labels for either equivalence relations. This demonstrates that a compact policy does not lose information.

Theorem 5. Compact occupancy state ξ based on either local or truncation probabilistic equivalence relations is a suﬃcient statistic of occupancy state ξ t. Furthermore, an optimal separable joint policy for compact occupancy-state MDP M, together with the correct estimation of the compact occupancy states, immediately induces an optimal separable joint policy for occupancy-state MDP ˇM.

Proof. The proof proceeds similarly to that of Theorem 2. That is, in proving the suﬃciency of the compact occupancy state with respect to its corresponding full occupancy state, we need to demonstrate: (a) the optimal value function at a compact occupancy state is identical to that of its corresponding occupancy state and (b) the next-step compact occupancy states depend only upon the current compact occupancy states (and next-step compact separable joint decision rules). We stated (b) in Deﬁnition 13, so only statement (a) remains to be proved. We show this by induction. The suﬃciency of the compact occupancy state with respect to its corresponding occupancy state trivially holds at the last step of the problem. In fact, V M,T( ξT) = V ˇM,T(ξT) = 0 for any

arbitrary occupancy state ξT and its corresponding compact occupancy state ξT (since the horizon has been reached). If we assume the statement (a) holds for time-step t + 1, we can now show

Dibangoye, Amato, Buffet & Charpillet

that it holds for time-step t. For any arbitrary step-t occupancy state, Bellman s optimality criterion produces the following equality:

V ˇM,t(ξt) def= max dt A R(ξt, dt) + V ˇM,t+1(ξt+1), (31)

where ξt+1 = P(ξt, dt). By the inductive hypothesis, we have that V M,t+1( ξt+1) = V ˇM,t+1(ξt+1). This results in equality:

V ˇM,t(ξt) = max dt A R(ξt, dt) + V M,t+1( P(ξt, dt)). (32)

In addition, Theorem 4 demonstrated that by restricting our attention to compact (joint) decision rules A, we preserve optimality. Thus, we obtain:

R(ξt, dt) = X

φ s,ϑ Φξt rdt(ϑ)( s) X

s,θ φ s,ϑ(s, θ) ξt(s, θ), (Theorem 4)

φ s,ϑ Φξt r dξt (ϑ)( s) ξt( s, ϑ), (Deﬁnition of ξt)

= R( ξt, dξt), ( dξt(ϑ) = dt(ϑ)).

V ˇM,t(ξt) = max dξt A R( ξt, dξt) + V M,t+1( P( ξt, dξt)), (33)

= V M,t( ξt). (34)

Therefore, statement (a) holds for any arbitrary occupancy state ξt t and any arbitrary time step t {0, 1, . . . , T 1}. Combining statements (a) and (b), we are guaranteed to ﬁnd the optimal value function for ˇM by using compact occupancy states instead of occupancy states. In addition, given the optimal value function (V M,t)t {0,1,...,T}, the optimal compact policy π def= ( dt)t {0,1,...,T 1} is obtained by successive one-step lookaheads: for all t {0, 1, . . . , T 1},

dt max dξt A R( ξt, dξt) + V M,t+1( P( ξt, dξt)), (35)

where ξ0 = ξ0 and ξt+1 = P( ξt, dt). This immediately induces a separable joint policy π def= (dt)t {0,1,...,T 1} for the original occupancy-state MDP ˇM such that, for every agent i I and every private history θi t, we set di t(θi t) = di t(ϑi t), where θi t is in a cluster with label ϑi t. Since both π and π yield the same expected value starting at the initial occupancy state, separable joint policy π is optimal for the original occupancy-state MDP ˇM.

By solving the compact problem instead of the original one, we circumvent the exhaustive enumeration of occupancy states and decision rules and preserve ability to ﬁnd an optimal solution for the original problem. It is worth noticing that the optimal value function in the compact occupancy space is not PWLC. This is because compact occupancy states are expressed using diﬀerent label sets. However, by exploiting the PWLC property of the optimal value function in the full occupancy-state space, we develop methods that can generalize value from one compact occupancy

Optimally Solving Finite-Horizon Dec-POMDPs

state to another. Next, we propose a method for incrementally improving lower and upper bounds that narrow the range of the optimal value function. But, in contrast to the OHSVI algorithm, we do this using compact occupancy states and compact real vectors. More precisely, our compact upper and lower bounds are deﬁned on full occupancy states, but only through their compact representations.

4.2 Feature-Based Compact Bounds

In our algorithm, upper and lower bounds are of crucial importance. They can narrow the range of the optimal value function, determine suboptimal regions of the search space, and speed up the convergence towards an exact solution. Our approach approximates the full lowerand upper-bound heuristic functions with heuristic functions deﬁned over the same full occupancy space (Tsitsiklis & van Roy, 1996; Hauskrecht, 2000; Roy, Gordon, & Thrun, 2005). The new heuristic functions are typically more compact (with respect to traditional high-dimensional vector or point sets discussed in Section 3.2.2 and Section 3.2.3), and are easier to compute than the full bounds. Our heuristic functions can be formulated as feature-based compact functions which combine dimensionality reduction (using the clustering methods discussed above) and function approximation (Tsitsiklis & van Roy, 1996). Thus, to demonstrate that the new heuristic functions are valid bounds, we will analyze our dimension reduction and approximation methods.

4.2.1 Feature-Based Compact States and Vectors

One can think of feature-based compact representations as a dimension reduction model for representing high-dimensional bounds using bounds of lower dimensionality. However, this approach requires a set of basis functions (or feature set), in which lower-dimensional bounds can be expressed eﬀectively. A basis function (or feature) is a function which maps high-dimensional data to one salient feature of the problem at hand. In our setting for example, a feature can be an indicator function of whether a private history matches one speciﬁed label. Features are at the core of the feature-based compact representations of full occupancy states and real vectors, which ultimately serve to represent upper and lower bounds using either feature-based compact vector or point sets in a similar way as standard vector and point set representations.

Deﬁnition 14. A feature is an indicator function φs,(ϑi)i I : S Θ 7 {0, 1} for one speciﬁed state s S and one speciﬁed joint label (ϑi)i I Lξ that occupancy state ξ induces i.e., for all s S and (θi)i I Θ,

φs,(ϑi)i I( s, (θi)i I) = ( 1, if s = s and, for all i I, θi belongs to cluster with label ϑi, 0, otherwise.

The feature set w.r.t. ξ, denoted Φξ, is given by Φξ def= {φs,ϑ|s S, ϑ i ILi ξ}.

The feature set w.r.t. ξ represents the partition of the state and joint history space that equivalence relation ξ-LPE or ξ-TPE induces. This partition is a set of nonempty subsets (Bs,ϑ)s S,ϑ Lξ (called labeled joint clusters) such that s S,ϑ LξBs,ϑ = S Θ. A labeled joint cluster Bs,ϑ consists of the cross-product between the singleton {s} and labeled clusters Bϑ1, Bϑ2, . . . , Bϑ|I| that equivalence relation ξ-LPE or ξ-TPE generates. Therefore, a feature can be interpreted as a way to check whether a state and joint history pair belongs to a speciﬁed labeled joint cluster. Thus, features provide

Dibangoye, Amato, Buffet & Charpillet

an alternative (possibly lossy) representation of compact occupancy states. One can express the compact representation ξ of full occupancy state ξ using its feature set Φξ, as follows

θ Θ(ξ) φ s,ϑ(s, θ) ξ(s, θ)

Speciﬁcally, for all state s S and all labels ϑi Li ξ and all agent i I,

ξ( s, ϑ1, ϑ2, . . . , ϑ|I|) def= X

θ2 Bϑ2 . . . X

θ|I| Bϑ|I| ξ( s, θ1, θ2, θ|I|),

( s,θ) B s,ϑ ξ( s, θ), (ϑ def= (ϑ1, ϑ2, . . . , ϑ|I|))

θ Θ(ξ) φ s,ϑ(s, θ) ξ(s, θ),

An analogous property holds for feature-based compact real vectors w.r.t. ξ. That is, a feature-based compact real vector βξ is a |Φξ|-dimensional real vector that is expressed using Φξ. It is worth noticing that standard feature-based approaches assume a unique feature set and data are all expressed based on that unique feature set, which eases bound generalization (Tsitsiklis & van Roy, 1996). In our setting, however, diﬀerent occupancy states yield diﬀerent (possibly disjoint) feature sets. Hence, feature-based compact occupancy states (or real vectors) are expressed based on diﬀerent feature sets, making it hard to transfer value from one feature-based compact occupancy state to another one. Bounds generalize naturally among feature-based compact occupancy states only if they share the same feature set. To remedy this, we introduce basis change operations for both feature-based compact occupancy states and real vectors.

4.2.2 Change of Feature Set

In enabling the bound generalization, it is necessary to work with more than one feature set. Hence, it is important to be able to easily transform feature vectors calculated with respect to one feature set to their corresponding (possibly lossy) representations with respect to another feature set. To ease the change of feature set, it is necessary to match features from the original feature set to those from the destination feature set. We introduce two heuristic methods to match features from diﬀerent feature sets, thereby allowing the change of feature sets.

Deﬁnition 15. Let Φξ and Φξ be feature sets w.r.t. occupancy states ξ and ξ , respectively. The projection of feature-based compact occupancy state ξ onto feature set Φξ , denoted FΦξ ( ξ), is given by probability distribution:

FΦξ ( ξ) def=

φs,ϑ Φξ φ s, ϑ(s, ϑ) ξ(s, ϑ)

where FΦξ ( ξ) reassigns the probability mass ξ(s, ϑ) of each pair (s, ϑ), such that φs,ϑ Φξ, to pair ( s, ϑ), such that φ s, ϑ Φξ , if and only if they match, i.e., φ s, ϑ(s, ϑ) = 1.

Optimally Solving Finite-Horizon Dec-POMDPs

This change of feature set describes a function from original feature set Φξ to destination feature set Φξ . In particular, it is a surjective function (i.e., every feature φ s, ϑ in the destination set has at least one corresponding feature φs,ϑ in the original set those such that φ s, ϑ(s, ϑ) = 1). The transformation assigns to each feature in the destination set the probability mass of all corresponding features in the original set. This heuristic method provides no guarantee that both FΦξ ( ξ) and ξ share the same optimal value. By replacing a range of features in the original set by a single feature in the destination set, we produce compact but possibly lossy representations, which ultimately precludes ability to preserve the value of the original compact occupancy state. Fortunately, since we will use these representations to provide bounds, even lossy representations can generate useful bounds.

Deﬁnition 16. Let Φξ and Φξ be feature sets w.r.t. occupancy states ξ and ξ , respectively. The projection of feature-based compact real vector βξ using feature set Φξ , denoted GΦξ ( βξ), is given by feature vector:

GΦξ ( βξ) def=

φs,ϑ Φξ φs,ϑ( s, ϑ) βξ(s, ϑ)

This change of feature set applies to real vectors, and describes a function from destination feature set Φξ to original feature set Φξ. Speciﬁcally, every pair ( s, ϑ) along with feature φ s, ϑ in the destination set has a corresponding pair (s, ϑ) along with a feature φs,ϑ in the original set i.e., the only one such that φs,ϑ( s, ϑ) = 1. The transformation assigns to each pair ( s, ϑ) in the destination set the value βξ(s, ϑ) from their corresponding pair (s, ϑ) in the original set. Since not all pairs in the original set can have their values represented in GΦξ ( βξ), the resulting feature vector is a lossy representation of the original vector βξ. The loss in the resulting feature vector depends on original and destination feature sets, and the choice of the equivalence relation between histories. As previously mentioned, we will make use of either ξ-LPE or ξ-TPE relations. Using ξ-LPE, we distinguish between state and joint-history pairs that involve only non-zero probability private history sets w.r.t. ξ (i.e., non-zero probability pairs), and state and joint-history pairs that involve zero probability private history sets w.r.t. ξ (i.e., zero probability pairs). For the sake of conciseness, compact real vectors maintain only values associated to non-zero probability pairs. All zero probability pairs have the same default value, for example (T t) mins S,a A ra(s). Hence, the change of feature set is such that all pairs ( s, ϑ) in the destination set, with a corresponding non-zero probability pair (s, ϑ) in the original set, inherits the value βξ(s, ϑ). However, all pairs ( s, ϑ) in the destination set, with a corresponding zero probability pair (s, ϑ) in the original set, inherits the default (and loose) value. Because the number of zero probability pairs in occupancy state ξ is far larger than the number of non-zero probability pairs, feature vectors that result from the change of basis via ξ-LPE have multiple components with a loose value. Using ξ-TPE, we distinguish between state and joint-history pairs only through joint-history suﬃxes of length mξ see Deﬁnition 10. There are many pairs in the destination set, that would have been associated to a corresponding zero probability pair in the original set using ξ-LPE. Using ξ-TPE, however, these pairs are associated to a non-zero probability pair in the original set. Speciﬁcally, pair ( s, ϑ) in the destination set, whose probability is zero with respect to ξ, has a corresponding zero probability pair in the original set using ξ-LPE. Yet, if pair ( s, ϑ) in the destination set and non-zero probability pair (s, ϑ) in the original set share a common length mξ history suﬃx, then ( s, ϑ) is associated to (s, ϑ) using ξ-TPE. Thus, feature vectors that result from the change of

Dibangoye, Amato, Buffet & Charpillet

feature set using ξ-TPE involve fewer components with a default, loose value than those from using ξ-LPE. Although heuristic methods for the change of feature set are not guaranteed to produce representations that retain all information of their compact counterparts, in many cases the resulting (possibly lossy) representation is suﬃcient to generalize bounds over the entire compact occupancy space. The bound property (i.e., the ability to overestimate or underestimate the optimal value) of the resulting representation can be determined by examining methods for the change of feature set. The following theorem (proved in the Appendix) establishes that the FΦ and GΦ mappings we use preserve the bound property.

Theorem 6. For any arbitrary feature set Φ, the change of feature set based on mappings GΦ and FΦ preserve the bound property i.e., ξ t,

a. ν R, if ν V M,t( ξ), then ν V M,t(FΦ( ξ));

b. β R|Φ |, if V M,t( ξ) ξ,GΦξ( β) , then V M,t( ξ) ξ,GΦξ(GΦ( β)) .

Theorem 6 shows that bound properties for a given occupancy state are preserved for the occupancy state obtained upon the change of feature set using heuristic methods F or G. Next, we discuss how to approximate full upper and lower bounds over the entire occupancy space.

4.2.3 Compact Point-Value Mappings: Upper Bounds

The full upper-bound value function can be approximated by a ﬁnite set of points and the sawtooth interpolation rule that estimates the value of an arbitrary point of the compact occupancy space by relying on the points already experienced and their associated values. A key aspect of this heuristic approximation is the sawtooth interpolation rule over compact occupancy states. In the full upper-bound value function, the sawtooth interpolation can only approximate points expressed within the same basis set. Here, we demonstrate that it can apply even when points are expressed in diﬀerent feature sets by means of the change of feature set. Let Ψ = {( ξ1 7 v ξ1), ( ξ2 7 v ξ2), . . . , ( ξk 7 v ξk)} be a set of point-value pairs that represents approximate function U ˇM deﬁned over the compact occupancy space, such that each point satisﬁes Theorem 6a. Then the approximate value for an arbitrary compact occupancy state ξ based on point-value pair ( ξ 7 v ξ) can be obtained using the sawtooth interpolation in a way that is similar to the calculation in the full upper-bound value function:

v ξ ξ = v ξ + (v ξ v ξ) D(FΦξ ( ξ), ξ ), (38)

where v ξ = P s,ϑ ξ(s, ϑ) vs, . From Theorem 6a, we know that feature vector FΦξ ( ξ) shares the

same upper bound v ξ with ξ. In addition, FΦξ ( ξ) and ξ are expressed using the same feature set

Φξ . Hence the sawtooth interpolation can apply and produce upper-bound value v ξ ξ for compact

occupancy state ξ based on point-value pair ( ξ 7 v ξ). The optimization with respect to compact occupancy state ξ is then acquired by choosing the best overall upper-bound value from all pointvalue pairs in Ψ:

U M( ξ ) = min {v ξ , v ξ ξ | ( ξ 7 v ξ) Ψt}. (39)

Optimally Solving Finite-Horizon Dec-POMDPs

This heuristic approximation diﬀers from the full upper-bound value function only because it requires the change of feature set FΦ and compact occupancy states instead of full occupancy states. Hence, this sawtooth interpolation is computationally more eﬃcient than that of the full upperbound value function, because compact occupancy states are of lower dimensionality. As for the accuracy of the resulting upper-bound values, it is not clear how this sawtooth interpolation compares with that from the full upper-bound value function (i.e., whether or not the sawtooth interpolation weakens the upper bounds). Yet, a collection of point-value pairs obtained for a selection of compact occupancy states can be combined to deﬁne an approximate function of the upper-bound value function as discussed next. A feature-based compact value function (U M,t)t {0,1,...,T} over compact occupancy space is represented using sets ( Ψt)t {0,1,...,T} of point-value pairs that estimate values of any arbitrary compact occupancy state. Initially, each set Ψt contains |S | point-value pairs {ξs, 7 vs, t | s S } that represent the step-t optimal value function of the underlying MDP. The sawtooth interpolation estimates U M,t at any compact occupancy state ξ t as follows:

U M,t( ξ ) = min {v ξ , v ξ ξ | ( ξ 7 v ξ) Ψt}, t {0, 1, . . . , T}.

Set Ψt is updated for every compact occupancy state ξ t, using a point-based backup step as follows:

Ψt = Ψt [ n ( ξ 7 v ξ ) o , t {0, 1, . . . , T}

where v ξ = max dξ A R( ξ , dξ ) + U M,t+1( P( ξ , dξ )). Approximate function (U M,t)t {0,1,...,T} is an upper-bound value function similarly to the full upper-bound value function as stated below and proven in Appendix A.

Theorem 7. Feature-based compact value function (U M,t)t {0,1,...,T}, as iteratively updated, upper bounds the optimal value function over the entire compact occupancy space.

4.2.4 Compact Vector Sets: Lower Bounds

A full lower bound over the full occupancy space can be approximated by a ﬁnite set of compact real vectors along with their associated feature sets and linear function updates. We take inspiration from initialization, evaluation and update routines from the full lower-bound value function. One important aspect of our approximation lies in the deﬁnition of the update operation, denoted ] backup. Let Λ be a set of compact real vectors that represents the approximate value function, such that each compact real vector satisﬁes Theorem 6b. Then, a new compact real vector for any compact occupancy state ξ and compact decision rule dξ can be computed eﬃciently as in the full lowerbound value function (Section 3.2.2):

g ξ dξ = r dξ + arg maxgα dξ : α Λ ξ, gα dξ

where gα dξ = GΦP( ξ, dξ)(α) (p dξ) is the expression of the projection of GΦP( ξ, dξ)(α) onto feature set Φξ, and GΦP( ξ, dξ)(α) is the expression of α in feature set ΦP( ξ, dξ). The optimization with respect to

compact occupancy state ξ is then acquired by choosing the compact vector with the best overall

Dibangoye, Amato, Buffet & Charpillet

value from all vectors g ξ dξ:

] backup( Λ, ξ) = arg maxg ξ dξ : dξ A

The principal diﬀerence with respect to the full lower-bound value function lies in the use of transformation GΦ and compact real vectors instead of high-dimensional real vectors. Hence, this backup operation is more eﬃcient than that of the full lower-bound value function, since operations are the same but we now use only lower dimensional vectors, which is likely to save signiﬁcant time. But the change of feature set may produce weaker bounds. Nonetheless, a collection of compact vectors obtained for a selection of compact occupancy states can be combined to deﬁne an approximate function of the lower-bound value function as discussed next. A feature-based compact value function (L M,t)t {0,1,...,T 1} over the compact occupancy space is represented using sets ( Λt)t {0,1,...,T 1} of compact real vectors along with their associated feature sets that estimate values of any arbitrary compact occupancy state. Initially, each set Λt contains a single compact real vector βt given by

βt( ) = (T t) min s S,a A ra(s), t {0, 1, . . . , T}. (42)

The max-vector rule estimates L M,t at any compact occupancy ξt t as follows:

L M,t( ξt) = max (Φ7 βt) Λt ξt,GΦξt( βt) , t {0, 1, . . . , T}. (43)

Set Λt is updated for every compact occupancy state ξt t, using point-based backup steps as follows:

Λt = Λt [ n (Φξt 7 ] backup( Λt+1, ξt)) o . (44)

Approximate function (L M,t)t {0,1,...,T 1} is a lower-bound value function since its main diﬀerence with respect to the full lower-bound value function lies in the use of GΦ, which preserves the bound property. For a complete proof, the reader can refer to Appendix A.

Theorem 8. Feature-based compact value function (L M,t)t {0,1,...,T}, lower bounds the optimal value function over the entire compact occupancy space.

4.3 Feature-Based Heuristic Search Value Iteration Algorithm

This section presents the feature-based heuristic search value iteration algorithm (FB-HSVI) that iteratively updates feature-based compact lowerand upper-bound representations. We also discuss FB-HSVI s theoretical guarantees.

4.3.1 Algorithm Description

Similar to OHSVI (Algorithm 1), FB-HSVI (Algorithm 2) solves occupancy-state MDPs by generating trajectories of occupancy states and iteratively updating lower and upper bounds, but in the case of FB-HSVI, these are now compact occupancy states and feature-based compact lower (L M,t)t {0,1,...,T} and upper bounds (U M,t)t {0,1,...,T}. FB-HSVI improves the scalability of OHSVI in

Optimally Solving Finite-Horizon Dec-POMDPs

several ways. First, FB-HSVI replaces full exact representations by compact representations for all: occupancy states, decision rules, lower and upper bounds. In addition, it combines stopping criteria from HSVI (Smith, 2007) and optimal classical heuristic search methods (e.g., Hart et al., 1968; Korf, 1990), which may result in more eﬃcient pruning of unnecessary subspaces.

Algorithm 2: The FB-HSVI Algorithm.

function FB-HSVI((L M,t)t {0,1, ,T}, (U M,t)t {0,1, ,T})

initialize L M,t and U M,t for all t {0, , T 1}. while Stop ( ξ0, 0) do Explore( ξ0, 0)

function Explore( ξt, gt)

if Stop ( ξt, gt) then

dt arg max d t A R( ξt, d t) + U M,t+1( P( ξt, d t)) Update U M,t Explore( P( ξt, dt), R( ξt, dt) + gt) Update L M,t

function Stop( ξt, gt)

return U M,t( ξt) L M,t( ξt) L M,0( ξ0) gt + U M,t( ξt)

FB-HSVI diﬀers from OHSVI in four main ways: 1. compact representation of occupancy states, which signiﬁcantly reduces the search space; 2. compact representation of decision rules, which speeds up the decision-rule selection; 3. compact representation of lower and upper bounds, which speeds up the convergence; 4. enhanced value function generalization and combination of stopping criteria, which results in more eﬃcient pruning of unnecessary subspaces.

4.3.2 Stopping Criteria

The stopping criteria of FB-HSVI build upon those from optimal classical heuristic search methods (e.g., Hart et al., 1968; Korf, 1990; Smith, 2007). They determine when to stop the current trajectory of compact occupancy states in the algorithm. Ideally, an optimal criterion would measure the distance from the current trajectory to an optimal trajectory, but this is not known. Instead, we use two criteria based on the upper and lower bound values of trajectories and compact occupancy states. The upper bound of the current trajectory, which we denote f( ξt), is the sum of two functions: (1) the past trajectory-reward function g( ξ0, ξt), which is the sum of rewards from the starting compact occupancy state ξ0 to the current occupancy state ξt; and (2) the future trajectory-reward from compact occupancy state ξt, which is an admissible heuristic estimate, e.g., upper-bound U M,t( ξt) at compact occupancy state ξt. The ﬁrst criterion relies on the fact that there is no reason to expand an occupancy state ξt that has f( ξt) less than or equal to L M,0( ξ0), since it cannot lead to a solution better than the current best solution; a criterion previously used in optimal classical heuristic search methods (e.g., Hart et al., 1968; Korf, 1990).

Dibangoye, Amato, Buffet & Charpillet

Criterion 1. A trajectory of occupancy states ( ξ0, . . . , ξt) is interrupted whenever heuristic value f( ξ0) is less than or equal to L M,0( ξ0) i.e., L M,0( ξ0) f( ξ0). The best solution found so far is optimal if there is no expanded occupancy state ξt on the frontier of the search space2 with a heuristic-value f( ξt) higher than L M,0( ξ0).

The second criterion builds upon the fact that there is no reason to expand an occupancy state ξt that has upper bound less than or equal to its lower bound (Smith, 2007).

Criterion 2. A trajectory of occupancy states ( ξ0, . . . , ξt) is interrupted whenever upper bound U M,t( ξt) is less than or equal to lower bound L M,t( ξt) i.e., L M,t( ξt) U M,t( ξt). The best solution found so far is optimal if upper and lower bounds at the initial occupancy state are equal.

By interrupting any trajectory that satisﬁes either of criterion 1 or 2, FB-HSVI preserves the ability to ﬁnd an optimal separable joint policy, as shown below.

4.3.3 Convergence Guarantees

Theorems 7 and 8 show the feature-based compact functions (L M,t)t {0,1, ,T} and (U M,t)t {0,1, ,T} as iteratively updated by FB-HSVI (Algorithm 2) upper and lower-bounds the optimal value function. Next, we prove that upon the update of each trajectory no compact occupancy state has its bounds depreciated, and at least one compact occupancy state improves its bounds. Since there is a ﬁnite number of compact occupancy states, the bounds ultimately converge to the optimal value at the initial occupancy state. Here, we use this argument to show that FB-HSVI converges to an optimal separable joint policy after a ﬁnite number of iterations. To this end, a compact occupancy state is said to be ﬁnished if either criterion 1 or 2 is satisﬁed; otherwise it is not ﬁnished. Moreover, all compact occupancy states ξT at the last time step T are ﬁnished since criterion 2 is satisﬁed at the last time step T.

Theorem 9. The FB-HSVI algorithm always terminates after a ﬁnite number of trials and the solution found at termination the separable joint policy the lower bound induces is optimal.

Proof. First, we show by contradiction that the algorithm cannot terminate before an optimal solution is found. Suppose the algorithm terminates before ﬁnding an optimal solution which has value f ( ξ0). Then, the sequence of f( ξ0) values generated while planning is f 0( ξ0), f 1( ξ0), . . . , f k( ξ0), where f 0( ξ0) is the initial lower bound before any solution is found, f 1( ξ0) is the value of the ﬁrst solution found, and f k( ξ0) that of the last solution found. In addition, we know by hypothesis that f 0( ξ0) < f 1( ξ0) < . . . < f k( ξ0) f ( ξ0), where the last inequality holds under the assumption that the algorithm may terminate with a suboptimal solution. Now consider an optimal path ξ0, ξ1, . . . , ξT leading from the initial occupancy state to a terminal occupancy state. Under the assumption that this optimal path was not found, there must be some occupancy state ξt along this path that was generated but not expanded. This is only possible if f( ξt) f k( ξ0). But by the admissibility of f, we know that f( ξt) f ( ξ0) and therefore f( ξt) f ( ξ0) > f m( ξ0) for any m {0, 1, . . . , k}. From this contradiction, it follows that the algorithm cannot terminate before an optimal solution is found. Next, we show that each trial turns at least one not ﬁnished occupancy state into a ﬁnished one. Suppose the algorithm has not yet terminated and a trial is executed. Let the last two occupancy

2. Typically, search algorithms involves expanding nodes by adding all unexpanded neighboring nodes into a priority queue, called a frontier of the search space.

Optimally Solving Finite-Horizon Dec-POMDPs

states encountered during the forward expansion be ξt and ξt+1. Given that the trial terminated at ξt+1, we know that before the trial, ξt was not ﬁnished but ξt+1 was ﬁnished. Because ξt+1 results from a greedy separable joint decision rule selection at ξt, we know ξt will be ﬁnished after being updated. This is because only two scenarios are possible, each of which corresponds to a stopping condition:

either ξt+1 yields its optimal value, then ξt will also yield its optimal value after being updated, making it a ﬁnished occupancy state after the update;

or ξt+1 has f(ξt+1) lower than or equal to L M,0( ξ0), then ξt will also have an f( ξt) lower than or equal to L M,0( ξ0) after being updated.

Thus, executing a trial causes occupancy state ξt, which was not ﬁnished, to become ﬁnished. Finally, we show that the algorithm terminates after a ﬁnite number of trials. To this end, we note that the search graph of the algorithm is a tree similar to Figure 8, with a bounded branching factor | At| at depth t {0, 1, . . . , T}. By hypothesis, only occupancy states that appear at depth t < T are not ﬁnished. Thus, the total number of occupancy states at all depths up to time step T is upper bounded by the total number of information states at all depths up to time step T:

| T 1 t=0 St| = |A ||I||Z |T|A |T 1

|Z ||A | 1 , (45)

where A = maxi I Ai and Z = maxi I Zi. Given that at least one occupancy state becomes ﬁnished after each trial, the initial occupancy state must become ﬁnished after at most | T 1 t=0 St| trials, causing the algorithm to terminate.

Another important property of FB-HSVI is that it reﬁnes both upper and lower bounds throughout planning. Since compact occupancy state expansions are interleaved with updates, FB-HSVI oﬀers an anytime solution. Furthermore, cutting oﬀFB-HSVI trials at any time, we know that the diﬀerence between the current best solution and the optimal one is bounded.

Theorem 10. At any iteration of FB-HSVI, the current solution and the separable joint policy induced by the current lower bound is within ǫ = U M,0(ξ0) L M,0(ξ0) of the optimal solution.

Proof. Formally, the diﬀerence in value between executing the separable joint policy πlb induced by the current lower bound instead of an optimal separable joint policy π is written as follows:

V ˇM,π (ξ0) V ˇM,πlb(ξ0) = V ˇM,π (ξ0) L M,0(ξ0), (V ˇM,πlb(ξ0) = L M,0(ξ0)) (46)

U M,0(ξ0) L M,0(ξ0). (V ˇM,π (ξ0) U M,0(ξ0)) (47)

Consequently, whenever FB-HSVI is interrupted, its current solution is within the ǫ given above of the optimal solution.

5. Experiments

This section empirically demonstrates and validates the importance of our feature-based heuristic search value iteration (FB-HSVI) algorithm. We show that FB-HSVI outperforms all existing exact algorithms on all tested domains from the literature and that FB-HSVI can solve those problems over unprecedented time horizons.

Dibangoye, Amato, Buffet & Charpillet

5.1 Experimental Setup

As discussed throughout this paper, there are many key components that can aﬀect the performance of FB-HSVI. These key components include the (upper and lower) bound representations, the bound update methods, the history compression, the value generalization, and the initial upper bound. We present three variants of FB-HSVI, denoted OHSVI, FB-HSVI-LPE and FB-HSVI-TPE. The two latter diﬀer only on the notion of history equivalence they use in the feature-based compact representations (see Table 2). When no equivalence relation is given, FB-HSVI refers to its default (and better performing) implementation, FB-HSVI-TPE.

Bound Representations

Compression

FB-HSVI-LPE

feature-based compact

FB-HSVI-TPE

feature-based compact

Table 2: A review of the selected algorithmic components we use.

We selected benchmarks with the goal of spanning the range of properties that may aﬀect the performance of a Dec-POMDP solver. In Table 3, we review the selected domains and their properties. These domains can be downloaded at http://masplan.org.

domain M parameters |Π0:t| for diﬀerent T N |S | |Ai| |Zi| T = 2 T = 5 T = 10 Dec-Tiger 2 2 3 2 6561 3.43 1030 1.39 10977

Mabc 2 4 2 2 256 1.84 1019 3.23 10616

Grid-Small 2 16 5 2 390625 5.42 1044 3.09 101431

Recycling-Robots 2 4 3 2 6561 3.43 1030 1.39 10977

Box Pushing 2 100 4 5 3.34 107 5.23 10940 1.25 102939746

Mars Rovers 2 256 6 8 1.69 1014 1.88 107285 2.57 10238723869

Table 3: Domain parameters and maximum number of separable joint policies per horizons.

5.2 Empirical Analysis of our Algorithms

In this section, we compare FB-HSVI to other exact solvers. The exact Dec-POMDP solvers considered are the state-of-the-art methods including: GMAA*-ICE (Oliehoek et al., 2013), IPG (Amato et al., 2009), MILP (Aras & Dutech, 2010), and LPC (Boularias & Chaib-draa, 2008). IPG and LPC perform dynamic programming, GMAA*-ICE performs heuristic search and MILP is a mixed integer linear programming method. Results for GMAA*-ICE (provided by Matthijs Spaan), IPG, MILP, LPC were conducted on diﬀerent machines. Because of this, the timing results are not directly comparable, but are likely to only diﬀer by a small constant factor. Our three FB-HSVI variants (Table 2) were implemented in the same framework, using identical basic operations, such as occupancy state and value function updates, and separable joint decision rule selection. We terminate FB-HSVI whenever the distance between lower and upper bounds is within ǫ = 0.01. A time limit was set to 1000ms.

Optimally Solving Finite-Horizon Dec-POMDPs

5.2.1 Comparing to other Exact Planners

multi-agent tiger

recycling robot

meeting in a 3x3 grid

Table 4: Experiments comparing the computation times (in seconds) of all exact solvers (part 1). Time limit violations are indicated by , indicate unknown values. Bold entries correspond to the best known results for these benchmarks, both in terms of computational time and expected value.

Tables 4 and 5 show performance results for the exact algorithms. For each algorithm, we reported the computation time, which includes the time to compute heuristic values when appropriate (since all algorithms do not use the same heuristics). We also reported the best expected cumulative reward L ˇM,0(ξ0) at the initial occupancy state. Tables 4 and 5 clearly show that FB-HSVI allows for signiﬁcant improvement over the state-of-the-art solvers: for all tested benchmarks we provide results for longer horizons than have been solved previously (the bold entries). In many cases, an (epsilon) optimal solution can be found for horizons that are an order of magnitude larger than was previously solvable. There are two main reasons for FB-HSVI s performance. First, it searches in the space

Dibangoye, Amato, Buffet & Charpillet

of policies mapping lower-dimensional features to actions, whereas the other exact solvers search in the space of policies mapping full histories to actions. In addition, it uses a value function mapping occupancy states to reals allowing it to generalize the value function over unvisited occupancy states whereas all other solvers use value functions mapping partial policies to reals. FB-HSVI performs best when the domain possesses structure that results in a compact value function, as in the recycling robot and mabc domains.

broadcast channel

cooperative box-pushing

Mars rovers

Table 5: Experiments comparing the computation times (in seconds) of all exact solvers (part 2).

Optimally Solving Finite-Horizon Dec-POMDPs

5.2.2 Choosing a Method to Keep Information Concise

We now compare the local and truncation probabilistic equivalence notions we introduced to maintain concise the representations of the occupancy states, decision rules, and value functions.

2 3 4 5 6 7 0

2 3 4 5 6 7 8 9 10 0

2 3 4 5 6 7 8 9 10 0

Recycling Robot

2 3 4 5 6 7

2 3 4 5 6 0

Box-Pushing

2 3 4 5 6 7 8 9 10

Figure 12: Comparison of compression methods to maintain concise data through FB-HSVI-LPE and FB-HSVI-TPE. All graphs shows the memory requirements until convergence or time exceeds in the y-axis given the various number of planning horizons in x-axis.

Clearly, algorithms that use feature-based compact representations provide signiﬁcant savings in the number of maintained histories over those that do not (e.g., OHSVI). Using OHSVI, the number of generated histories grows (in the worst case) exponentially with the planning horizon. It is this exponential growth that explains why OHSVI, which does not use history aggregation,

Dibangoye, Amato, Buffet & Charpillet

cannot scale beyond planning horizon T = 4 over all tested domains (see Tables 4 and 5). For Recycling Robot at horizon T = 5, experimental results together with Table 3 show that algorithms that use our compression methods maintain up to 30 orders of magnitude less separable joint policies than algorithms that do not. The number of histories retained is important as all occupancy states, decision rules, and value functions are mappings from reachable histories (or corresponding labels). To this end, we compare LPE and TPE over our selection of benchmarks and various planning horizons. As previously discussed, though LPE yields compact occupancy states that are more concise than those that result from TPE, the latter eases the generalization of bounds, which speeds up the convergence to an optimal solution. Figure 12 reports the total number of joint labels denoted |L| that are explicitly maintained for FB-HSVI-LPE and FB-HSVI-TPE over various planning horizons. We observe that TPE yields more concise bound representations than LPE over most benchmarks and planning horizons (i.e., using TPE number |L| is lower than that using LPE). In particular, we notice that, in all tested domains, there is a bounded number of labels that is suﬃcient for representing optimal or near-optimal value functions. TPE often succeeds in identifying this memory-bounded parametric space, resulting in more concise value functions, whereas LPE often fails. In the Recycling Robot problem for example, TPE yields no more than 6 joint labels (i.e., histories) for all horizons whereas LPE maintains up to 38 diﬀerent joint labels, a number that keeps growing as the planning horizon increases. In fact, (Dibangoye, Amato, & Doniec, 2012; Becker, Zilberstein, Lesser, & Goldman, 2004) demonstrated that in the Recycling-Robot problem, the most recent private observation is suﬃcient to summarize all past private histories of each agent (i.e., only the four joint observations are necessary). Here, TPE yields 6 joint labels because it relies on joint action-observation histories rather than joint observation histories. In the Broadcast Channel domain, TPE yields no more than 4 joint labels for all horizons whereas LPE produces up to 20 diﬀerent joint labels. Again, these results are due to the underlying structure of the Broadcast Channel domain. In such a scenario, the future states of the world are conditionally independent of joint histories. Hence, TPE can forget about joint histories, and reason only about states. Another domain of interest is the Dec-Tiger problem. In this problem, for T = 6, TPE produces no more than 30 joint labels for all horizons whereas LPE maintains up to 126 diﬀerent joint labels. Our assumption is that there always exists an optimal separable joint policy that is periodic (with period 3) for the Dec-Tiger domain. In other words, there exists an optimal separable joint policy that depends on histories only upon the most recent three action-observation pairs. Also, there are many scenarios in which both equivalence relations would fail to identify a memory-bounded space of histories, even if such a space exists. For example, important information in a history may be spread over a few time steps, but not necessarily the last ones.

6. Discussion

While we have demonstrated that our method can solve Dec-POMDPs which are larger than those previously solved, many practical applications are much larger than domains considered in this paper. As a result, additional methods may be necessary to solve very large problems which do not permit the construction of a concise feature space while preserving optimality. This is of concern since the numbers of states and histories impact all occupancy states, separable joint decision rules, and value functions. Maintaining these objects for large feature spaces is prohibitive. This highlights

Optimally Solving Finite-Horizon Dec-POMDPs

the necessity of addressing the scalability issue of FB-HSVI through more concise (and possibly lossy) feature spaces. In that direction, we have already extended the general methodology presented in this paper along two lines: error-bounded approximations and tractable subclasses.

6.1 Error-Bounded Approximations

FB-HSVI can ﬁnd an optimal solution because it maintains concise representations that preserve optimality. This is both an advantage and a liability. On the one hand, for problems of a reasonable size, the algorithm can ﬁnd an optimal solution. On the other hand, in many realistic applications, it will run out of time or memory. These scalability limitations are because FB-HSVI maintains accurate estimates of (compact) occupancy states, value functions and decision rules. To improve the scalability of Dec-POMDP solvers, many researchers have investigated approximate solutions. A notable example of this family includes the memory-bounded dynamic programming (MBDP) algorithm for ﬁnite-horizon Dec-POMDPs (Seuken & Zilberstein, 2007; Carlin & Zilberstein, 2008; Dibangoye, Mouaddib, & Chaib-draa, 2009; Kumar & Zilberstein, 2010; Wu, Zilberstein, & Chen, 2010). These are dynamic programming methods that require bounded computational resources to produce heuristic solutions that empirically perform well in standard Dec-POMDP benchmarks. However, these methods do not possess any theoretical guarantees concerning the quality of their solutions. Recently, we introduced a framework for monitoring the error in FB-HSVI by replacing an exact estimate of (compact) occupancy states, decision rules and value functions, by their approximate counterparts (Dibangoye, Mouaddib, & Chaib-draa, 2011; Dibangoye, Buﬀet, & Charpillet, 2014). The resulting algorithm can solve Dec-POMDP instances with larger planning horizon while still providing strong theoretical guarantees. It is also worth noting that, because FB-HSVI is trial-based, it can be used as an anytime algorithm. That is, it alternates between the generation of an occupancy-state trajectory and the update of the current best value function. As the algorithm proceeds, the current (best) value function is improved at the expense of increased computational time. The algorithm can be terminated either when a satisfactory value function is attained, or when allocated planning time is exceeded. In either case, this algorithm can always provide online performance bounds on the returned value function illustrating how far from the optimal value function the current one is. In the future, we also would like to explore using occupancy states over observation histories (rather than action-observation histories), which were shown to be suﬃcient (along with actionobservation histories) simultaneous to this work (Oliehoek, 2013). The inclusion of observation histories could lead to further scalability gains by reducing the dimensionality of the feature space.

6.2 Tractable Subclasses

Many attempts to address the scalability issues in Dec-POMDPs rely on the use of tractable subclasses. These subclasses have additional assumptions that allow more concise representations for all occupancy states, decision rules and value functions; and therefore speed up the convergence towards an optimal solution. For instance, we have already shown that occupancy states over just states (and not agent histories) can be used in transitionand observation-independent Dec-MDPs (Becker et al., 2004) (where the state is fully determined by the joint observation) to greatly increase scalability while preserving optimality (Dibangoye et al., 2012; Dibangoye, Amato, Doniec, & Charpillet, 2013). By restrict-

Dibangoye, Amato, Buffet & Charpillet

ing attention to decentralized Markov policies (i.e., mappings from private states to private actions) we reduce the complexity signiﬁcantly (NP versus NEXP), and make it possible to optimally solve larger problems. We plan to investigate other forms of tractable structures including temporal dependencies or constraints that induce structured domains in both single and multi-agent settings (Dibangoye, Chaib-draa, & Mouaddib, 2008; Dibangoye, Shani, Chaib-Draa, & Mouaddib, 2009; Pajarinen, Hottinen, & Peltonen, 2013). In that line of research, we introduced a novel approach called structural analysis as a means of discovering the underlying structural properties embedded in certain decentralized decision-making problems (Dibangoye, Buﬀet, & Simonin, 2015). We also applied the general methodology presented in this paper to scale up the number of agents involved in the process. To this end, we consider domains that exhibit the locality of interactions (Dibangoye, Amato, Buﬀet, & Charpillet, 2015, 2014). Examples include the networked distributed partially observable Markov decision processes (ND-POMDPs) (Nair, Varakantham, Tambe, & Yokoo, 2005). We plan to explore applying our methodology and FB-HSVI to Dec POMDPs where agents have joint dynamics or rewards, as well as domains with delayed communication (Ooi & Wornell, 1996; Grizzle, Marcus, & Hsu, 1981; Oliehoek & Spaan, 2012), as a means of reducing the memory burden. A secondary (but no less important) issue concerning scalability in Dec-POMDPs pertains to eﬃcient methods to update occupancy states and value functions at the planning stage. The locality of interaction among agents may be exploited statically (e.g., Nair et al., 2005; Kumar & Zilberstein, 2009; Amato, Konidaris, & Kaelbling, 2014) or dynamically (e.g., Canu & Mouaddib, 2011) by considering factorization and graphical models in our representation and hence improve scalability. This is a critical issue as the number of occupancy states necessary to obtain a good solution may be exponential in the planning horizon. So, techniques that can eﬃciently update both occupancy states and value functions are of great importance.

7. Conclusion

This paper describes a novel way of representing Dec-POMDPs, as continuous-state MDPs with piecewise-linear convex value functions, and a scalable algorithm for generating ǫ-optimal solutions. We summarize the key contributions below. By exploiting the assumption of centralized planning for decentralized execution, our method recasts the Dec-POMDP problem into an equivalent deterministic and centralized fully observable MDP (using information that is available to all agents). Next, we identify a concise statistic the occupancy state that represents the state of the resulting fully observable MDP, which we call the occupancy-state MDP. We demonstrate that the optimal value functions of occupancy MDPs are piecewise linear and convex functions of the occupancy states. We also prove that an optimal solution of the occupancy-state MDP is an optimal solution to the corresponding Dec-POMDP. We also present the feature-based heuristic search value iteration (FB-HSVI) algorithm to ﬁnd an optimal solution to the occupancy-state MDP. This algorithm builds oﬀthe theory for solving POMDPs and MDPs, as our occupancy-state MDP allows these methods to be directly applied to Dec-POMDPs for the ﬁrst time. We believe FB-HSVI is a major step forward in scalable exact solutions for Dec-POMDPs. This scalability is achieved by deﬁning feature-based compact occupancy states and decision rules through the use of equivalence relations between private histories. These concise representations permit us to circumvent the exhaustive enumeration of an otherwise intractable number of occupancy states and decision rules.

Optimally Solving Finite-Horizon Dec-POMDPs

Another aspect of the improved scalability stems from generalization of the value function. This is achieved through the piecewise linear and convex functions in the occupancy-state MDP. We show that, although feature-based compact lower and upper bounds are no longer piecewiselinear and convex, they can still generalize value functions over the entire feature-based compact occupancy-state space. Experimentally, we show that FB-HSVI is able to outperform all current state-of-the-art exact Dec-POMDP solvers in common benchmark domains. These results show that ǫ-optimal solutions can be found for larger horizons in all problems and for horizons that are sometimes an order of magnitude larger than those that have previously been solved.

8. Acknowledgements

We would like to thank Matthijs Spaan for providing results for GMAA*-ICE as well as Frans Oliehoek, Akshat Kumar and the anonymous reviewers for their helpful comments. Research supported in part by AFOSR MURI project #FA9550-09-1-0538.

Appendix A. Correctness of Feature-Based Compact Bounds

A.1 Proof of Theorem 6

(a) By hypothesis, we have ξ t, such that v V M,t( ξ). Hence, we obtain successively:

v V M,t( ξ) (48)

= V ˇM,t(ξ) ( ξ = FΦξ(ξ) by Theorem 5), (49)

def= max β Λt

s,θ ξ(s, θ) β(s, θ) (by deﬁnition of V ˇM,t(ξ)), (50)

s,θ φ s, ϑ(s, θ) ξ(s, θ)

β( s, ϑ) (ξ projected onto Φξ), (51)

φ s, ϑ Φξ ξ( s, ϑ) β( s, ϑ), (52)

φ s, ϑ Φξ φs,ϑ( s, ϑ) ξ( s, ϑ)

β(s, ϑ) ( ξ projected onto Φ), (53)

def= V M,t(FΦ( ξ)), (54)

which ends the proof of Theorem 6.a. (b) By hypothesis, we have ξ t such that V M,t( ξ) ξ,GΦξ( β) for any arbitrary β R|Φ |. To

prove V M,t( ξ) ξ,GΦξ(GΦ( β)) for any arbitrary feature set Φ, we successively show:

ξ,GΦξ(GΦ( β)) def= X

φs,ϑ Φξ ξ(s, ϑ) X

φ s, ϑ Φ φ s, ϑ(s, ϑ) X

φ s, ϑ Φ φ s, ϑ( s, ϑ) β( s, ϑ), (55)

φs,ϑ Φξ ξ(s, ϑ) X

φ s, ϑ Φ β( s, ϑ)

φ s, ϑ Φ φ s, ϑ( s, ϑ) φ s, ϑ(s, ϑ)

, (Re-ordering). (56)

Dibangoye, Amato, Buffet & Charpillet

Before proceeding any further, we need to prove quantity φ s, ϑ(s, ϑ) is greater or equal to quantity P φ s, ϑ Φ φ s, ϑ( s, ϑ) φ s, ϑ(s, ϑ) (in the bracket of the last expression above). To this end, we start with the interpretations of each expression. The ﬁrst expression asks whether state-history pair (s, ϑ) belongs to cluster along with feature φ s, ϑ, an aﬃrmative answer results in value φ s, ϑ(s, ϑ) = 1 otherwise 0. Let φ s , ϑ be the feature in Φ whose cluster includes state-history pair (s, ϑ). Then, the second expression asks whether both state-history pairs ( s , ϑ ) and (s, ϑ) belong to cluster along with feature φ s, ϑ Φ , an aﬃrmative answer will result in value P φ s, ϑ Φ φ s, ϑ( s, ϑ) φ s, ϑ(s, ϑ) = 1 otherwise 0. Clearly, the second expression is a stricter form of the ﬁrst expression, hence φ s, ϑ(s, ϑ) is greater or equal to P φ s, ϑ Φ φ s, ϑ( s, ϑ) φ s, ϑ(s, ϑ). Thus, by replacing P φ s, ϑ Φ φ s, ϑ( s, ϑ) φ s, ϑ(s, ϑ) by φ s, ϑ(s, ϑ), we obtain:

ξ,GΦξ(GΦ( β)) = X

φs,ϑ Φξ ξ(s, ϑ) X

φ s, ϑ Φ β( s, ϑ)

φ s, ϑ Φ φ s, ϑ( s, ϑ) φ s, ϑ(s, ϑ)

φs,ϑ Φξ ξ(s, ϑ) X

φ s, ϑ Φ φ s, ϑ(s, ϑ) β( s, ϑ), (58)

def= ξ,GΦξ( β) (59)

V M,t( ξ), (60)

which ends the proof Theorem 6.b.

A.2 Proof of Theorem 7

The proof proceeds by induction. Heuristic function (U M,t)t {0,1,...,T}, initially upper bounds the optimal value function, since it is initialized using the underlying MDP value function. For the induction step, we assume heuristic function (U M,t)t {0,1,...,T} represented using point sets ( Ψt)t {0,1,...,T} upper bounds the optimal value function. Next, we show that, at any arbitrary time step t {0, 1, . . . , T}, heuristic function U M,t, after the

update of U M,t resulting in upper bound v ξ at occupancy state ξ, is also an upper bound on V M,t over the entire compact occupancy space t. That is,

ξ t : U M,t( ξ ) U M,t( ξ ) V M,t( ξ ). (61)

We ﬁrst show that ξ t : U M,t( ξ ) U M,t( ξ ). Using the sawtooth interpolation approach3, we obtain successively:

U M,t( ξ ) def= min v ξ , v ξ ξ | ( ξ 7 v ξ ) Ψt { ξ 7 v ξ} , (see sawtooth deﬁnition) (62)

= min v ξ ξ , U M,t( ξ ) , (63)

U M,t( ξ ), (64)

which proves the ﬁrst part of expression (Eq. 61). Now, we show ξ t : U M,t( ξ ) V M,t( ξ ). To this end, we distinguish between before and after the update of the U M,t.

3. Here, we adapted the sawtooth interpolation to replace full occupancy states by compact occupancy states.

Optimally Solving Finite-Horizon Dec-POMDPs

Before the update, the following holds:

ξ t : U M,t( ξ ) V M,t( ξ ), (65)

by the inductive hypothesis.

After the update, we obtain two important results. On the one hand, we have that the resulting value v ξ is an upper bound at ξ:

v ξ def= max dξ A R( ξ, dξ) + U M,t+1( P( ξ, dξ)), (66)

max dξ A R( ξ, dξ) + V M,t+1( P( ξ, dξ)), (67)

def= V M,t( ξ). (68)

On the other hand, we show that from v ξ one can extrapolate an upper bound value v ξ ξ for any other

compact occupancy state ξ . This is mainly thanks to Theorem 6.a, in which we demonstrate: if expression V M,t( ξ) v ξ holds, then for any arbitrary ξ t, expression V M,t(FΦξ ( ξ)) v ξ holds as well. The sawtooth interpolation method concludes this argument as follows:

def= v ξ + (v ξ v ξ) D(FΦξ ( ξ), ξ ), (69)

V M,t( ξ ). (70)

In fact, the sawtooth interpolation can always generate an upper-bound for one compact occupancy state from the upper bound of any compact occupancy state as long as both are expressed into the same feature set.

A.3 Proof of Theorem 8

The proof proceeds by induction as well. Heuristic function (L M,t)t {0,1,...,T} initially lower bounds the optimal value function since we initialize it using the value function associated with the worst separable joint policy. The one that prescribes the agents the joint action that yields the minimum reward, and this over all time steps, i.e., L M,t( ) def= (T t) mins,a ra(s), for all t {0, 1, . . . , T}. For the induction step, we assume heuristic function (L M,t)t {0,1,...,T} represented using compact vector sets ( Λt)t {0,1,...,T} lower bounds the optimal value function. Next, we show that for any arbitrary time step t {0, 1, . . . , T}, heuristic function L M,t, which results from the update of lower bound L M,t and produces compact vector βξ along feature set Φξ, is also a lower bound on V M,t over the entire compact occupancy space t. That is,

ξ t : L M,t( ξ ) L M,t( ξ ) V M,t( ξ ). (71)

We ﬁrst show that ξ t : L M,t( ξ ) L M,t( ξ ). In fact,

L M,t( ξ ) def= max β Λt ξ ,GΦξ ( β) , (72)

max β Λt { βξ} ξ ,GΦξ ( β) , (73)

def= L M,t( ξ ), (74)

Dibangoye, Amato, Buffet & Charpillet

which proves the ﬁrst part. Now, we show ξ t : L M,t( ξ ) V M,t( ξ ). To this end, we distinguish between before and after the update of L M,t.

Before the update, by the induction hypothesis, we have:

ξ t : L M,t( ξ ) V M,t( ξ ). (75)

After the update, we obtain: ξ t,

V M,t( ξ ) def= max dξ A R( ξ , dξ ) + V M,t+1( P( ξ , dξ )), (76)

= max dξ A ξ , r dξ + max α Λ t+1 ξ p dξ ,GΦ ξ p dξ ( α) , (77)

= max dξ A, α Λ t+1 ξ , r dξ + GΦ ξ p dξ ( α) (p dξ ) , (re-arranging terms ) (78)

max dξ A, α Λt+1 ξ , r dξ + GΦ ξ p dξ ( α) (p dξ ) , (replace Λ t+1 by Λt+1 ) (79)

max dξ A, α Λt+1 ξ ,GΦξ (r dξ + GΦ ξp dξ ( α) (p dξ) )) , (Lemma 6) (80)

ξ ,GΦξ ( ] backup( ξ, Λt+1)) , (retain one element). (81)

Merging together arguments before and after the update, i.e., L M,t( ξ ) V M,t( ξ ), we prove the second part of the proof. This ends the proof.

Appendix B. Subroutines

This section gives subroutines that are required to compute feature-based compact occupancy states using either local or truncation probabilistic equivalence relations (Algorithm 3).

Algorithm 3: Compact feature-based occupancy state through LPE.

function Compact-LPE(ξt)

S {(s, θ) S Θt : ξt(s, θ) > 0} foreach (s, θ) S do

S S\{(s, θ)} and ξt(s, θ) ξt(s, θ) foreach ( s, θ ) S do

if Are State Joint History Pairs LPE((s, θ), ( s, θ ), ξt) then

S S\{( s, θ )} and ξt(s, θ) ξt(s, θ) + ξt( s, θ )

function Compact-TPE(ξt)

mξt get Truncation Param(ξt) and S S Θt(ξ) foreach (s, θ) S do

S S\{(s, θ)} and ξt(s, θ) ξt(s, θ) foreach ( s, θ ) S do

if Are State Joint History Pairs TPE((s, θ), ( s, θ ), mξt) then

S S\{( s, θ )} and ξt(s, θ) ξt(s, θ) + ξt( s, θ )

Optimally Solving Finite-Horizon Dec-POMDPs

Algorithm 4: Subroutines for compact feature-based occupancy state through LPE and TPE.

function Are Private Histories LPE(θi t, θ i t , ξt)

foreach θ i t Θ i t (ξt) and s S do

if Pr(s, θ i t |ξ0, θt) , Pr(s, θ i t |ξ0, θ t) then return False

return True

function Are State Joint History Pairs LPE((s, θi t i I), ( s, θ i t i I), ξt)

if s , s then return False foreach i I do

if Are Private History LPE(θi t, θ i t , ξt) then return False

return True

function get Truncation Param(ξt)

m 0 foreach i I do

foreach θi t, θ i t Θi t(ξt) do

if Suffix(θi t, m) = Suffix(θ i t , m) then

if Are Private Histories LPE(θi t, θ i t , ξt) then m m + 1;

function Are Private Histories TPE(θi t, θ i t , mξt)

return Suffix(θi t, mξt) = Suffix(θ i t , mξt)

function Are State Joint History Pairs TPE((s, θi t i I), ( s, θ i t i I), mξt)

if s , s then return False foreach i I do

if Are Private History LPE(θi t, θ i t , mξt) then return False

return True

Amato, C., Bernstein, D. S., & Zilberstein, S. (2010). Optimizing ﬁxed-size stochastic controllers for POMDPs and decentralized POMDPs. Journal of Autonomous Agents and Multi-Agent Systems, 21(3), 293 320.

Amato, C., Chowdhary, G., Geramifard, A., Ure, N. K., & Kochenderfer, M. J. (2013). Decentralized control of partially observable Markov decision processes. In 54th IEEE Conference on Decision and Control.

Amato, C., Dibangoye, J. S., & Zilberstein, S. (2009). Incremental policy generation for ﬁnitehorizon DEC-POMDPs. In Proceedings of the Nineteenth International Conference on Automated Planning and Scheduling.

Amato, C., Konidaris, G. D., Anders, A., Cruz, G., How, J. P., & Kaelbling, L. P. (2015). Policy search for multi-robot coordination under uncertainty. In Proceedings of the Robotics: Science and Systems Conference.

Amato, C., Konidaris, G. D., & Kaelbling, L. P. (2014). Planning with macro-actions in decentralized POMDPs. In Proceedings of the Thirteenth International Conference on Autonomous Agents and Multiagent Systems.

Dibangoye, Amato, Buffet & Charpillet

Aras, R., & Dutech, A. (2010). An investigation into mathematical programming for ﬁnite horizon decentralized POMDPs. Journal of Artiﬁcial Intelligence Research, 37, 329 396.

Banerjee, B., Lyle, J., Kraemer, L., & Yellamraju, R. (2012). Sample bounded distributed reinforcement learning for decentralized POMDPs. In Proceedings of the Twenty-Sixth AAAI Conference on Artiﬁcial Intelligence, pp. 1256 1262, Toronto, Canada.

Barto, A. G., Bradtke, S. J., & Singh, S. P. (1995). Learning to act using real-time dynamic programming. Artiﬁcial Intelligence, 72(1-2), 81 138.

Becker, R., Zilberstein, S., Lesser, V. R., & Goldman, C. V. (2004). Solving transition independent decentralized Markov decision processes. Journal of Artiﬁcial Intelligence Research, 22, 423 455.

Bellman, R. E. (1957). Dynamic Programming. Dover Publications, Incorporated.

Bernstein, D. S., Givan, R., Immerman, N., & Zilberstein, S. (2002). The complexity of decentralized control of Markov decision processes. Mathematics of Operations Research, 27(4).

Boularias, A., & Chaib-draa, B. (2008). Exact dynamic programming for decentralized POMDPs with lossless policy compression. In Proceedings of the Eighteenth International Conference on Automated Planning and Scheduling, pp. 20 27.

Canu, A., & Mouaddib, A.-I. (2011). Collective decision under partial observability - a dynamic local interaction model. In IJCCI (ECTA-FCTA), pp. 146 155.

Carlin, A., & Zilberstein, S. (2008). Value-based observation compression for DEC-POMDPs. In Proceedings of the Seventh International Conference on Autonomous Agents and Multiagent Systems.

De Farias, D. P., & Van Roy, B. (2003). The linear programming approach to approximate dynamic programming. Operations Research, 51(6), 850 865.

Dibangoye, J. S., Amato, C., Buﬀet, O., & Charpillet, F. (2014). Exploiting separability in multiagent planning with continuous-state MDPs. In Proceedings of the Thirteenth International Conference on Autonomous Agents and Multiagent Systems, pp. 1281 1288.

Dibangoye, J. S., Amato, C., Buﬀet, O., & Charpillet, F. (2015). Exploiting separability in multiagent planning with continuous-state MDPs (extended abstract). In Proceedings of the Twenty Fifth International Joint Conference on Artiﬁcial Intelligence, pp. 4254 4260.

Dibangoye, J. S., Amato, C., & Doniec, A. (2012). Scaling up decentralized MDPs through heuristic search. In Proceedings of the Twenty-Eighth Conference on Uncertainty in Artiﬁcial Intelligence, pp. 217 226.

Dibangoye, J. S., Amato, C., Doniec, A., & Charpillet, F. (2013). Producing eﬃcient error-bounded solutions for transition independent decentralized MDPs. In Proceedings of the Twelfth International Conference on Autonomous Agents and Multiagent Systems, pp. 539 546.

Dibangoye, J. S., Buﬀet, O., & Simonin, O. (2015). Structural results for cooperative decentralized control models. In Proceedings of the Twenty-Fifth International Joint Conference on Artiﬁcial Intelligence, pp. 46 52.

Dibangoye, J. S., Mouaddib, A.-I., & Chaib-draa, B. (2009). Point-based incremental pruning heuristic for solving ﬁnite-horizon DEC-POMDPs. In Proceedings of the Eighth International Conference on Autonomous Agents and Multiagent Systems, pp. 569 576.

Optimally Solving Finite-Horizon Dec-POMDPs

Dibangoye, J. S., Mouaddib, A.-I., & Chaib-draa, B. (2011). Toward error-bounded algorithms for inﬁnite-horizon Dec-POMDPs. In Proceedings of the Tenth International Conference on Autonomous Agents and Multiagent Systems, pp. 947 954.

Dibangoye, J. S., Shani, G., Chaib-Draa, B., & Mouaddib, A.-I. (2009). Topological order planner for POMDPs. In Proceedings of the Twenty-Second International Joint Conference on Artiﬁcial Intelligence, pp. 1684 1689.

Dibangoye, J. S., Amato, C., Buﬀet, O., & Charpillet, F. (2013). Optimally solving Dec-POMDPs as continuous-state MDPs. In Proceedings of the Twenty-Fourth International Joint Conference on Artiﬁcial Intelligence.

Dibangoye, J. S., Buﬀet, O., & Charpillet, F. (2014). Error-bounded approximations for inﬁnitehorizon discounted decentralized POMDPs. In Proceedings of the Twenty-Fourth European Conference on Machine Learning, pp. 338 353.

Dibangoye, J. S., Chaib-draa, B., & Mouaddib, A.-I. (2008). A novel prioritization technique for solving Markov decision processes. In Proceedings of the 21th International Conference of the Florida Artiﬁcial Intelligence Research Society, pp. 537 542.

Grizzle, J. W., Marcus, S. I., & Hsu, K. (1981). Decentralized control of a multiaccess broadcast network. In 20th IEEE Conference on Decision and Control including the Symposium on Adaptive Processes, Vol. 20, pp. 390 391.

Hansen, E. A., Bernstein, D. S., & Zilberstein, S. (2004). Dynamic programming for partially observable stochastic games. In Proceedings of the Nineteenth National Conference on Artiﬁcial Intelligence, pp. 709 715.

Hansen, E. A., & Zilberstein, S. (2001). LAO*: A heuristic search algorithm that ﬁnds solutions with loops. Artiﬁcial Intelligence, 129(1-2), 35 62.

Hart, P. E., Nilsson, N. J., & Raphael, B. (1968). A formal basis for the heuristic determination of minimum cost paths. IEEE Trans. Systems Science and Cybernetics, 4(2), 100 107.

Hauskrecht, M. (2000). Value-function approximations for partially observable Markov decision processes. Journal of Artiﬁcial Intelligence Research, 13, 33 94.

Howard, R. A. (1960). Dynamic Programming and Markov Processes. The M.I.T. Press.

Jain, M., Taylor, M. E., Tambe, M., & Yokoo, M. (2009). DCOPs meet the real world: Exploring unknown reward matrices with applications to mobile sensor networks. In Proceedings of the Twenty-Second International Joint Conference on Artiﬁcial Intelligence, pp. 181 186.

Kaelbling, L. P., Littman, M. L., & Cassandra, A. R. (1998). Planning and acting in partially observable stochastic domains. Artiﬁcial Intelligence, 101(1-2), 99 134.

Korf, R. E. (1990). Real-time heuristic search. Artiﬁcial Intelligence, 42(2-3), 189 211.

Kumar, A., & Zilberstein, S. (2009). Constraint-based dynamic programming for decentralized POMDPs with structured interactions. In Proceedings of the Eighth International Conference on Autonomous Agents and Multiagent Systems, pp. 561 568.

Kumar, A., & Zilberstein, S. (2010). Point-based backup for decentralized POMDPs: complexity and new algorithms. In Proceedings of the Ninth International Conference on Autonomous Agents and Multiagent Systems, pp. 1315 1322.

Dibangoye, Amato, Buffet & Charpillet

Nair, R., Tambe, M., Yokoo, M., Pynadath, D. V., & Marsella, S. (2003). Taming decentralized POMDPs: Towards eﬃcient policy computation for multiagent settings. In Proceedings of the Eighteenth International Joint Conference on Artiﬁcial Intelligence, pp. 705 711.

Nair, R., Varakantham, P., Tambe, M., & Yokoo, M. (2005). Networked distributed POMDPs: A synthesis of distributed constraint optimization and POMDPs. In Proceedings of the Twentieth National Conference on Artiﬁcial Intelligence, pp. 133 139.

Oliehoek, F. A. (2012). Decentralized POMDPs. In Wiering, M., & van Otterlo, M. (Eds.), Reinforcement Learning: State of the Art, Vol. 12, pp. 471 503. Springer Berlin Heidelberg, Berlin, Germany.

Oliehoek, F. A. (2013). Suﬃcient plan-time statistics for decentralized POMDPs. In Proceedings of the Twenty-Fourth International Joint Conference on Artiﬁcial Intelligence.

Oliehoek, F. A., Spaan, M. T. J., Amato, C., & Whiteson, S. (2013). Incremental clustering and expansion for faster optimal planning in Dec-POMDPs. Journal of Artiﬁcial Intelligence Research, 46, 449 509.

Oliehoek, F. A., Spaan, M. T. J., & Vlassis, N. A. (2008). Optimal and approximate Q-value functions for decentralized POMDPs. Journal of Artiﬁcial Intelligence Research, 32, 289 353.

Oliehoek, F. A., & Spaan, M. T. (2012). Tree-based solution methods for multiagent POMDPs with delayed communication. In Proceedings of the Twenty-Sixth AAAI Conference on Artiﬁcial Intelligence.

Oliehoek, F. A., Whiteson, S., & Spaan, M. T. J. (2009). Lossless clustering of histories in decentralized POMDPs. In Proceedings of the Eighth International Conference on Autonomous Agents and Multiagent Systems, pp. 577 584.

Ooi, J. M., & Wornell, G. W. (1996). Decentralized control of a multiple access broadcast channel: Performance bounds. In Proc. of the 35th IEEE Conference on Decision and Control, Vol. 1, pp. 293 298. IEEE.

Pajarinen, J., Hottinen, A., & Peltonen, J. (2013). Optimizing spatial and temporal reuse in wireless networks by decentralized partially observable Markov decision processes. IEEE Transactions on Mobile Computing, 13(4). Preprint.

Paquet, S., Chaib-draa, B., Dallaire, P., & Bergeron, D. (2010). Task allocation learning in a multiagent environment: Application to the robocuprescue simulation. Multiagent and Grid Systems, 6(4), 293 314.

Pineau, J., Gordon, G. J., & Thrun, S. (2006). Anytime point-based approximations for large POMDPs. Journal of Artiﬁcial Intelligence Research, 27, 335 380.

Powell, W. B. (2007). Approximate Dynamic Programming: Solving the Curses of Dimensionality (Wiley Series in Probability and Statistics). Wiley-Interscience.

Puterman, M. L. (1994). Markov Decision Processes, Discrete Stochastic Dynamic Programming. Wiley-Interscience, Hoboken, New Jersey.

Roy, N., Gordon, G. J., & Thrun, S. (2005). Finding approximate POMDP solutions through belief compression. Journal of Artiﬁcial Intelligence Research, 23, 1 40.

Optimally Solving Finite-Horizon Dec-POMDPs

Seuken, S., & Zilberstein, S. (2007). Improved memory-bounded dynamic programming for DECPOMDPs. In Proceedings of the Twenty-Third Conference on Uncertainty in Artiﬁcial Intelligence.

Shani, G., Pineau, J., & Kaplow, R. (2013). A survey of point-based POMDP solvers. Journal of Autonomous Agents and Multi-Agent Systems, 27(1), 1 51.

Smallwood, R. D., & Sondik, E. J. (1973). The optimal control of partially observable Markov decision processes over a ﬁnite horizon. Operations Research, 21(5), 1071 1088.

Smith, T. (2007). Probabilistic Planning for Robotic Exploration. Ph.D. thesis, The Robotics Institute, Carnegie Mellon University, Pittsburgh, PA.

Smith, T., & Simmons, R. (2004). Heuristic search value iteration for POMDPs. In Proceedings of the Twentieth Conference on Uncertainty in Artiﬁcial Intelligence, pp. 520 527, Arlington, Virginia, United States.

Smith, T., & Simmons, R. G. (2006). Focused real-time dynamic programming for MDPs: Squeezing more out of a heuristic. In Proceedings of the Twenty-First AAAI Conference on Artiﬁcial Intelligence, pp. 1227 1232.

Spaan, M. T. J., Oliehoek, F. A., & Amato, C. (2011). Scaling up optimal heuristic search in Dec POMDPs via incremental expansion. In Proceedings of the Twenty-Third International Joint Conference on Artiﬁcial Intelligence, pp. 2027 2032.

Szer, D., Charpillet, F., & Zilberstein, S. (2005). MAA*: A heuristic search algorithm for solving decentralized POMDPs. In Proceedings of the Twenty-First Conference on Uncertainty in Artiﬁcial Intelligence, pp. 568 576.

Tsitsiklis, J. N., & van Roy, B. (1996). Feature-based methods for large scale dynamic programming. Machine Learning, 22(1-3), 59 94.

Velagapudi, P., Varakantham, P., Sycara, K., & Scerri, P. (2011). Distributed model shaping for scaling to decentralized POMDPs with hundreds of agents. In Proceedings of the Tenth International Conference on Autonomous Agents and Multiagent Systems, pp. 955 962.

Winstein, K., & Balakrishnan, H. (2013). TCP ex Machina: Computer-generated congestion control. In SIGCOMM, Hong Kong.

Wu, F., Zilberstein, S., & Chen, X. (2010). Point-based policy generation for decentralized POMDPs. In Proceedings of the Ninth International Conference on Autonomous Agents and Multiagent Systems, pp. 1307 1314.

Wu, F., Zilberstein, S., & Chen, X. (2011). Online planning for multi-agent systems with bounded communication. Artiﬁcial Intelligence, 175(2), 487 511.

Zilberstein, S., Washington, R., Bernstein, D. S., & Mouaddib, A.-I. (2002). Decision-theoretic control of planetary rovers. In Revised Papers from the International Seminar on Advances in Plan-Based Control of Robotic Agents, pp. 270 289, London, UK. Springer-Verlag.