# knowledgebased_policies_for_qualitative_decentralized_pomdps__15797702.pdf

Knowledge-Based Policies for Qualitative Decentralized POMDPs

Abdallah Safﬁdine University of New South Wales, Sydney, Australia

Franc ois Schwarzentruber Univ. Rennes, CNRS, IRISA France

Bruno Zanuttini Normandie Univ UNICAEN, ENSICAEN, CNRS, GREYC 14000 Caen, France

Qualitative Decentralized Partially Observable Markov Decision Problems (QDec-POMDPs) constitute a very general class of decision problems. They involve multiple agents, decentralized execution, sequential decision, partial observability, and uncertainty. Typically, joint policies, which prescribe to each agent an action to take depending on its full history of (local) actions and observations, are huge, which makes it difﬁcult to store them onboard, at execution time, and also hampers the computation of joint plans. We propose and investigate a new representation for joint policies in QDec-POMDPs, which we call Multi-Agent Knowledge-Based Programs (MAKBPs), and which uses epistemic logic for compactly representing conditions on histories. Contrary to standard representations, executing an MAKBP requires reasoning at execution time, but we show that MAKBPs can be exponentially more succinct than any reactive representation.

Introduction Knowledge-Based Programs (KBPs) (Fagin et al. 2003) describe policies that agents should perform according to their knowledge, such as if KiΦ then κ where Ki is the knowledge modality for agent i, Φ is a formula, and κ is a program. As shown by Lang and Zanuttini (2012; 2013) for the single-agent case, KBPs offer compact representations of policies for planning problems. Moreover, arguably, they provide a natural level of expressivity for experts to design policies for autonomous agents. We are interested here in multi-agent decision problems, for which we generalize KBPs. Precisely, we consider Qualitative Decentralized Partially Observable Markov Decision Problems (QDec-POMDPs), as formalized by Brafman, Shani, and Zilberstein (2013). Such problems involve multiple agents, decentralized execution, sequential decision, partial observability, and uncertainty. A single agent executing a policy for such a problem must reason about what the other agents are executing, what they have observed, and what they believe or know. However, this is taken into account only indirectly by standard representations of policies. A drawback of this is that such policies are typically huge. By expliciting the reasoning about

Copyright c 2018, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved.

other agents knowledge in Multi-Agent KBPs (MAKBPs), we provide more succinct representations. The counterpart is that executing MAKBPs requires reasoning, hence they do not constitute reactive policies. However, in many scenarios some reasoning at execution time may be affordable. Typical examples are applications in which executing atomic actions is not instantaneous (like in extraterraneous missions), so that deliberation can be performed in parallel. Knowledge-based policies have been considered in other planning settings than that of QDec-POMDPs, in particular, they arise naturally in epistemic planning (Bolander and Andersen 2011). Other works have taken inspiration from reasoning about knowledge to design approaches for multiagent (classical) planning (Kominis and Geffner 2015). However, the originality of the (qualitative) Dec-PODMP model is that it assumes decentralized execution, but centralized planning. This makes it a challenging problem, as it requires to be able to plan ofﬂine a joint plan which can be executed in a decentralized fashion. The challenge is visible in the large body of work about the resolution of quantitative (stochastic) Dec-POMDPs (see Oliehoek and Amato (2016), Dibangoye et al. (2016) for instance). Centralized planning with decentralized execution means that agents agree on a policy ofﬂine, then execute it independently of each other. Hanabi is a ﬂagship example of such settings. In this cooperative card game with partial observability and limited communication, players typically agree on conventions before the game takes places. Researchers have started building computer players for Hanabi, notably by encoding human-designed policies (van den Bergh 2015; Cox et al. 2015). Furthermore, describing and using policies that involve epistemic reasoning to determine the next action is seen as a gateway to high-performance automated players (Osawa 2015). Yet, no formal frameworks allowing such policies have been put forward.

Example 1 (running) As a running example, assume Alice wants to meet Bob. She may take a train or a ﬂight from her place to his, but there is a risk of air crew strike. Alice and Bob may call each other beforehand (centralized planning phase) to agree on the following plan. Alice tries to ﬂy. If there is a strike, she takes the train and listens to the radio to know whether Bob can know this. If the strike is not announced on the radio (hence Bob cannot know), after arriving at the station she will reach the airport

The Thirty-Second AAAI Conference on Artificial Intelligence (AAAI-18)

to meet Bob there. Otherwise they will meet at her place of arrival. It can be seen that this plan, once agreed upon, can be executed successfully in a decentralized fashion.

Besides deﬁning MAKBPs as a new representation of joint policies for QDec-PODMPs, we make the following contributions. We prove that this representation is as succinct as standard ones, e.g., joint policies trees, and we give families of problems for which it is exponentially more succinct. Contrary to standard representations, determining the action to execute next is nontrivial in MAKBPs, as it involves explicit reasoning about other agents knowledge. We then investigate the complexity of this execution problem and that of verifying whether a given MAKBP solves a given QDec-POMDP, and we show that both these essential problems are intractable: both are PSPACE-complete if the horizon is bounded, and veriﬁcation is undecidable otherwise.1

Decentralized POMDPs We deﬁne the planning tasks which we address in two steps; ﬁrst the model, encoding the dynamics of actions and observations (Deﬁnition 2), then the planning task itself, which speciﬁes an initial (belief) state, a goal, and a planning horizon (Deﬁnition 3). Our deﬁnitions essentially follow those given by Brafman, Shani, and Zilberstein (2013). We generalize them slightly by (1) allowing nondeterminism in the transitions and observations, (2) allowing the observations to depend on the previous state as well as the action and resulting state, and (3) allowing arbitrary observations rather than only the truth value of propositions. For simplicity, we assume that all agents have the same actions, and do not consider explicit preconditions.2

Deﬁnition 2 (factored QDec-POMDP model) A factored model for a Qualitative Decentralized Partially Observable Markov Decision Problem (QDec-POMDP) is a tuple M = I , X, A, Ω, T , where I , X, A, Ω are ﬁnite set of agents, propositions, actions, and observations, resp., and T is a transition function, which maps each couple (s, a) 2X AI to a set of couples {(s , ω)} 2X ΩI .

A Boolean assignment s to X is called a state; s[x] denotes the value it assigns to x. An element a of AI is called a joint action, and speciﬁes one (individual) action ai per agent i. We use similar notation for observations. The transition function encodes the dynamics of actions and observations: (s , ω) T(s, a) means that when the state is s and the agents take joint action a, it is possible that both the environment transitions to state s and each agent i receives the observation ωi. Note that the next states and observations depend on the current state and the joint action taken. At execution time (when agents act), at any timestep t the environment is in some state st 2X. Each agent i (simultaneously and independently) chooses and takes an (individual) action at i A, deﬁning the joint action at. Then the en-

1A simple demonstration of executions of KBPs can be found in the tool Hintikka s world (http://people.irisa.fr/Francois. Schwarzentruber/hintikkasworld/). 2The trick to model unexecutable joint actions consists of adding a sink state to which any unexecutable joint action leads.

vironment nondeterministically chooses a couple (st+1, ωt) from T(st, at); it transitions to state st+1; each agent i receives the individual observation ωt i; and the process repeats.

Deﬁnition 3 (factored QDec-POMDP) A factored QDec POMDP is a quadruple Π = M, B0, g, H , where M = I , X, A, Ω, T is a factored QDec-POMDP model, B0 2X is the initial belief state, g is a Boolean formula over X called goal, and H N { } is the horizon.3

Histories and policies Let M = I , X, A, Ω, T be a QDec-POMDP model. A t-history for M is a sequence

ht = s0a0ω0s1a1ω1 . . . st 1at 1ωt 1st

satisfying t < t, (st +1, ωt ) T(st , at ). Of ht, the only information available to an agent i I is its local history ht i = a0 i ω0 i a1 i ω1 i . . . at 1 i ωt 1 i , that is, the sequence of (individual) actions it has taken and (individual) observations it has received. We write Ht for the set of all t-histories for M (M will always be clear from the context), and Ht i for the set of all local t-histories for i. We also use notation H t = t t =0 Ht and H t i = t t =0 Ht i . A (deterministic) t-policy for i is a mapping πi : H t i A. Action πi(ht i ) is the one which i should take when its local history so far is ht i . A joint t-policy π is simply a vector of t-policies, one for each agent i. A history ht = s0a0ω0 . . . st 1at 1ωt 1st is said to be consistent with π if at all timesteps, each agent acts according to its policy, namely, for all timesteps t < t 1 and all agents i, at +1 i is πi(a0 i ω0 i . . . at i ωt i ). A joint t-policy π is valid for Π = M, B0, g, H if for all H-histories s0a0ω0 . . . s H 1a H 1ωH 1s H with s0 B0 and consistent with π, the ﬁnal state s H satisﬁes the goal g. Note that we require the ﬁnal state to satisfy g: this implicitly requires that the agents know when g is achieved and maintain it until timestep H (e.g., using void actions if available). Finally, two joint policies are said to be t-equivalent (wrt Π) if they induce exactly the same sets of consistent t-histories. In this paper, we are interested in joint policies which are computed ofﬂine in a centralized manner (as opposed to execution, which is decentralized), as is standard in the literature on Dec-POMDPs (Brafman, Shani, and Zilberstein 2013; Oliehoek and Amato 2016; Dibangoye et al. 2016).

Representations Desiging compact representations for (joint) policies is crucial, due to the combinatorial nature of H t i . The most direct representations are (1) by policy trees, which the agent follows by branching on the observations received and executing the actions stored at the nodes, and (2) by ﬁnite state controllers, i.e., labelled automata in which transitions are ﬁred by observations and each state holds an action to take (see Kumar, Mostafa, and Zilberstein (2016)).4

3By assuming g to be a Boolean formula, we stick to the standard deﬁnition of a Dec-POMDP. However, it is easily seen that allowing epistemic goals, e.g., that all agents know the value of some variable, would not change the results in the paper. 4Note that ﬁnite state controllers constitute a much more general class of representations than policy trees. In particular, if there

w : {x} w : w : {x} 1 2

Figure 1: Example of a Kripke structure (each world w is annotated with its valuation V (w))

A well-known drawback of such representations is that they can be huge even for simple problems. Observe indeed that there are as many as |Ω|t local t-histories, each of which can yield a distinct path in the tree or automaton. On the other hand, an advantage of using these representations is that they are reactive, in the sense that they are easily executable. Taking inspiration from B ackstr om and Jonsson (2012), we formalize this as follows.

Deﬁnition 4 (execution problem) The execution problem takes as input a QDec-POMDP model M, an initial belief state B0, a joint policy π, an agent i, and a local history ht i such that there is a history ht consistent with π and inducing ht i. It asks what action πi(ht i) is, that is, what action π prescribes to i at timestep t + 1.

Deﬁnition 5 (reactive policy representation) A class R of representations of policies is said to be reactive if the execution problem restricted to policies represented in R is solvable in polynomial time.

Clearly, policy trees and ﬁnite state controllers constitute reactive representations: executing one consists of iteratively changing state according to the observation just received and taking the action found there. The rest of the paper provides a proper generalization of them, which is not reactive but has the essential advantage of being more succinct.

Background on Epistemic Logic Let X be a set of variables and I be a set of agents. A Kripke structure represents an epistemic situation, that is, a description of the state of the variables and of what the agents know about it, about what the other agents know about it, etc. (Hintikka 1962). This is done through a graph of possible worlds and undistinguishability relations i (for each agent i). Intuitively, a possible world w represents a possible state over X (either the actual one, or one which an agent imagines) together with the epistemic situation for all agents, assuming that w is the actual world. Two such worlds are related by i if whenever the actual world is one of them, agent i thinks it might as well be the other one. Note that we model knowledge rather than beliefs (which might be false); precisely, we rely on an S5n semantics, and in particular, an agent may know a formula Φ only if Φ is true.

Deﬁnition 6 (Kripke structure) A Kripke structure over X is a tuple S = W, ( i)i I , V , where W is nonempty set of worlds, i is an equivalence relation over W for all agents i I , and V : W 2X is a valuation function, which associates an assignment to X to each world in W.

are (feasible) loops in the automaton, the policy can be executed for an inﬁnite number of steps.

Figure 1 shows a Kripke structure with three worlds w, w , w (reﬂexive links are omitted for simplicity). Variable x is true in w and w , and false in w . Note that different worlds may have the same valuation. The 1-equivalence classes are {w, w } and {w }. Hence when the actual situation is w, agent 1 thinks it might be w instead, and hence, it thinks that 2 may think that the actual world is w . Contrastingly, when the actual state is w or w , agent 1 knows that the actual state is not w . It can also be seen that when the actual world is w, agent 1 does not know the value of x, while it knows that x is true when the actual world is w . Kripke structures are essentially used for giving semantics to epistemic formulas (over X and I ), which are logical formulas Φ generated by the following grammar:

Φ ::= x | Φ | Φ Φ | KiΦ | KWiΦ

where x ranges over X, i over I , KiΦ is read agent i knows that Φ holds , and KWiΦ is read agent i knows whether Φ holds or not . An epistemic formula Φ is said to be subjective for agent i if all propositional variables x are in the scope of a Ki or KWi modality. This means that Φ refers to the knowledge of i only; in particular, i is able to evaluate such a formula, while in general it does not know enough of the actual world to evaluate an objective (propositional) formula about it. Finally, the epistemic depth d(Φ) of Φ is the deepest nesting of modalities in Φ. The truth condition S, w |= Φ is read S, w satisﬁes Φ , and means that Φ is true at world w in the Kripke structure S. It is deﬁned by induction on Φ, with the obvious deﬁnition for Boolean connectives:

S, w |= x if x is assigned to true by V (w);

S, w |= KiΦ if w , (w i w = S, w |= Φ) holds;

S, w |= KWiΦ if S, w |= KiΦ or S, w |= Ki Φ holds.

We insist that despite the fact that the actual world w is given in the deﬁnition, in general the agents do not know what the actual world is. For instance, on Figure 1 we have S, w |= K1x K1(x KW2x). In this paper, we use epistemic formulas as branching conditions in policies, and Kripke structures for providing them with operational semantics. In these structures, possible worlds essentially correspond to histories for the QDec-POMDP model at hand, and two histories are undistinguishable of each other for an agent i if they induce the same local history for i.

Multi-Agent Knowledge-Based Programs We are now ready to introduce our new representation of joint policies for QDec-POMDPs. We build on Knowledge Based Programs (KBPs) (Fagin et al. 2003) and on their use as policies for single-agent conformant planning (Lang and Zanuttini 2012). In a nutshell, single-agent KBPs represent policies by programs built using sequential composition, and branching and iterating with conditions about what the agent knows. We ﬁrst generalize them to allow branching on the last observation received, through the constructor jo ( just observed ). This allows us to properly generalize reactive representations such as policy trees. Let M = I , X, A, Ω, T be a QDec-POMDP model.

Deﬁnition 7 ((MA)KBP) A Knowledge-Based Program (KBP) for agent i is an expression generated by:

κ ::= ϵ | a | κ;κ | if Θ then κ else κ ﬁ| while Θ do κ od

where ϵ is the empty program, a A is an action, and Θ is either an epistemic formula over X and I which is subjective for i, or a Boolean combination of atoms of the form jo(ω) for observations ω Ω. A Multi-Agent Knowledge-Based Program (MAKBP) is a vector of KBPs, one per agent i I .

We sometimes enclose a KBP inside bold brackets ([ . . . ]) to promote readability. The epistemic depth d(κ) of a KBP κ is deﬁned to be the maximal epistemic depth of a condition Θ occurring in it, and that of an MAKBP κ is deﬁned to be the maximal epistemic depth of all individual KBPs κi. We write Kd for the class of all MAKBPs of epistemic depth at most d.5 Finally, the size |κ| of a KBP is deﬁned to be the number of symbols used for writing it, including the symbols used for writing the branching conditions, and |κ| is deﬁned to be i I |κi|.

Example 8 (continued) Let strike and radio be variables encoding that there is an air crew strike and that it is announced on the radio, resp., and let plane A encode that Alice is in the plane. The plan of Example 1 would be represented by the MAKBP κ = (κA, κB), with

try-plane; if KA( plane A) then take-train; turn-radio-on; listen-radio; if KA( KBstrike) then to-airport ﬁ ﬁ

turn-radio-on; listen-radio; if KB(strike) then to-station else to-airport ﬁ

Note that conceptually, jo(ω) can be seen as the atom Ki(oi = ω), if the dynamics of the problem is extended so that each state stores the last observation received by i in an extra variable oi. We also insist that jo constructs (resp. epistemic branching conditions) in the KBP of i refer to the last observation received by i itself (resp. to the knowledge of i itself, possibly about other agents knowledge). Otherwise the KBP would not make sense, as i would not be able to evaluate its branching conditions at execution time.

Operational Semantics We now deﬁne an operational semantics for MAKBPs, namely, a model for the agents to execute them. Importantly, recall that we consider decentralized execution (each agent executes its KBP independently of the other agents), but centralized planning, so that at execution time the agents know the QDec-POMDP Π as well as the whole MAKBP κ. The operational semantics which we give is intended to be the straightforward one: each agent executes its own KBP

5The class K0 consists of KBPs which do not branch, or branch only on combinations of jo(ω) atoms, since epistemic conditions are subjective and hence have depth at least 1.

faithfully, and all reason perfectly about the possible evolutions of the environment and of knowledge. However, providing a language which involves reasoning, as MAKBPs, with an operational semantics is a matter of choice, and we could as well study semantics involving approximate reasoning (e.g., bounded-depth epistemic reasoning). We leave this as a very interesting perspective for future work. At execution time, when an agent i evaluates a branching condition such as [if Ki Ki Φ], it needs to reason about the current knowledge of i . This requires in particular to reason on what observations i has collected so far/on what actions it has taken. However, due to partial observability, i may not know exactly what actions i has taken so far, even if it knows what KBP i is executing. To cope with this, our semantics includes reasoning about the program counters in the KBP of each agent. The intended meaning of Γ, a, κ is that, provided that all formulas in Γ are currently true, i is about to execute action a, then it will execute the KBP κ.

Deﬁnition 9 (program counter) Let i be an agent. A program counter for i is a triple c = Γ, a, κ with Γ = Θ(1), . . . , Θ(k) , where each Θ(j) is either an epistemic formula subjective for i, or a Boolean combination of jo(ω) atoms, called the guard of c, a A is called the ﬁrst action of c, and κ is a KBP called its continuation.

We will reason on what program counters may be executed after what others. We ﬁrst deﬁne the set FPC(κ) of ﬁrst program counters of a (nonempty) KBP κ as follows. If κ is of the form [a] (resp. [a ; κ ]), then FPC(κ) is the singleton { , a, ϵ } (resp. { , a, κ }). If κ is of the form [if Θ then κ else κ ﬁ], then FPC(κ) is deﬁned to be { {Θ} Γ , a , κ | Γ , a , κ FPC(κ )} { { Θ} Γ , a , κ | Γ , a , κ FPC(κ )}, and while constructs are handled similarly. Finally, we deﬁne the control-ﬂow graph of κ; intuitively, an edge c κ c in G(κ) means that in an execution of κ, the ﬁrst action of c will be followed by that of c , provided the guard of c is true.

Deﬁnition 10 (control-ﬂow graph) Let κ be a KBP. The control-ﬂow graph of κ is the directed graph G(κ) = (V, κ) where (1) V is the smallest set of program counters which contains FPC(κ), and contains FPC(κ ) whenever it contains a program counter with continuation κ , and (2) κ contains an edge from c = Γ, a, κ to c if and only if c is in FPC(κ ).

Example 11 (continued) Bob s KBP κB in Example 8 has four program counters:

c0 B = ( , turn-radio-on, [listen-radio; if . . .ﬁ]), c1 B = ( , listen-radio, if ...ﬁ), c2 B = ({KB(strike)}, to-station, ϵ), c3 B = ({ KB(strike)}, to-airport, ϵ),

and G(κB) has these as vertices, with c0 B κB c1 B, c1 B κB c2 B, and c1 B κB c3 B.

For simplicity, we assume that there is a unique ﬁrst program counter in κ, written c0(κ) or c0 (that is, that FPC(κ) is a singleton); given an initial belief state B0, this can always be enforced by precomputing the value in B0 of all tests performed at the beginning of κ.

With this in hand, we give an operational semantics to MAKBPs, by deﬁning the Kripke structure in which the agents evaluate branching conditions. The following deﬁnition deﬁnes both the structure and, implicitly, how it is progressed at execution time. Intuitively, this structure reﬂects what agents know about the epistemic structure (of all agents) as well as about the state variables X and the program counters of all agents. Technically, it is made of all possible extended histories, where an extended t-history h t is a sequence s0c0ω0s1c1ω1 . . . stct consisting of the successive states and joint observations together with the (joint) program counters executed; a unique extended history is associated to each history s0a0ω0s1a1ω1 . . . stat in the straightforward manner. Moreover, two extended histories are undistinguishable by an agent if and only if they have generated the same observations for it.6

Deﬁnition 12 (structure at time t) Let M be a QDec POMDP model, B0 be an initial belief state, and κ be an MAKBP. The structure for M, B0, κ at time t, is the Kripke structure St(κ) = W t, ( t i)i I , V over X, deﬁned by induction on t as follows: 1. W 0 = {s0c0 | s0 B0}, 2. i I , 0 i = W 0 W 0, 3. W t+1 is the set of all worlds htstctωtst+1ct+1 with (a) htstct W t, (b) i I , St(κ), htstct |= Γt i, with Γt i the guard of ct i, (c) (st+1, ωt) T(st, at), where at is the joint action made of the ﬁrst actions of ct, (d) i I , ct i κi ct+1 i ,

4. i I , htstctωtst+1ct+1 t+1 i h ts tc tω ts t+1c t+1

iff htstct t i h ts tc t and ωt i = ω i t, 5. and x X, V (htstctωtst+1ct+1)(x) = st+1[x]. Items 1 and 2 in Deﬁnition 12 state that it is common knowledge for the agents that the initial state is one in B0 and that they are all about to start executing their KBPs, and that nothing more is known. In the deﬁnition of W t+1, Item 3a states that possible worlds (extended histories) at time t + 1 are extensions of possible worlds at time t; Item 3b, that the program counter ct i was executable; Item 3c, that the new state and observations are consistent with the dynamics of M; and Item 3d, that it is possible that ct+1 i is now the current program counter of i. Finally, Items 4 and 5 state that i distinguishes between histories only based on its local observations,7 and that the valuation of a world is determined by the current state of the environment. Note that the situation at time t does not depend on the actual history, nor on the agent. It rather represents all possible evolutions of the environment and knowledge of the agents, i.e., what may have happened rather than what did happen.

6 For the readers familiar with dynamic epistemic logic, our construction amounts to the iterated product update of the initial situation encoding common knowledge of B0, by an event model built from the control-ﬂow graphs of the KBPs. 7We do not need to specify that an agent distinguishes two histories in which it took different actions, since KBPs are deterministic: this can occur only if it receives different observations.

Example 13 (continued) Write c0 A, c1 A for Alice s program counters ( , try-plane, if ...ﬁ) and (KA( plane A), taketrain, ...), resp. The initial belief state consists of two possible worlds (we write only the relevant variables):

h0 = strike0 (c0 A, c0 B) 0, h 0 = strike0 (c0 A, c0 B) 0

At timestep 1, some possible worlds are:

h1 = h0 strike1 plane A 1 radio1 (c1 A, c1 B) 1

h 1 = h 0 strike1 plane A 1 radio1 (c1 A, c1 B) 1

h 1 = h 0 strike1 plane A 1 radio1 (c1 A, c1 B) 1

where h 1, h 1 differ on whether the radio announced the strike at this step. Since the only observation is by Alice (whether plane A is true), we have h1 1 B h 1 1 B h 1

(Bob considers all three equally plausible), but h1 1 A h 1

and h1 1 A h 1 (Alice knows whether she is in the plane). Still, we have h 1 1 A h 1, since Alice has not listened to the radio yet. Interestingly, at this point Bob does not know whether there is a strike, but he knows that Alice does (since he knows that she observed this as a result of trying to ﬂy).

We ﬁnally deﬁne the operational semantics of KBPs, by deﬁning when a history is consistent with an MAKBP.

Deﬁnition 14 (operational semantics) A t-history ht is said to be consistent with an MAKBP κ if for all timesteps t < t, each agent evaluates the epistemic branching con-

dition Φ in its KBP by deciding St (κ), h t |= Φ, where

h t is the extended history associated to ht , and progresses its program counters and takes actions according to the straightfoward semantics for ;, if, while, and jo(ω).8

We emphasize that this semantics is well-deﬁned. Precisely, since all epistemic conditions in the KBP of i are

subjective for i (Deﬁnition 7), the test St (κ), h t |= Φ is equivalent to i deciding whether the test is true at all situations which it considers actually possible, that is, to St (κ), h t |= Φ for all histories h t consistent with the observations which it has received.9 Given that each agent also has all the information needed for building St (κ) (namely M, B0, κ), because planning is centralized, the test indeed makes sense for each agent individually, at execution time. Finally, we emphasize that we do not intend to design agents which compute or embark an explicit representation of St(κ) at execution time. This notion serves only deﬁning the operational semantics. Indeed, which action to perform at execution time can be computed from an intensional representation of St(κ), with a procedure similar to model checking for succinct dynamic epistemic logic (Charrier and Schwarzentruber 2017) (see also our Proposition 18). Notwithstanding the fact that there is not a single correct semantics, let us emphasize that our operational semantics builds on the multi-agent logic of knowledge S5n

8We deﬁne condition jo(ω) to evaluate to at the ﬁrst timestep. 9Equivalently, the test needs only be performed in the agent s internal, multipointed view of the situation (Aucher 2010).

and hence, that whenever agent i evaluates, say, Ki Ki Φ to , it is indeed the case that i knows Φ. Hence our semantics correctly captures faithful execution with perfect reasoning about the joint execution, by all agents.

Expressiveness and Succinctness We now show that MAKBPs can represent any joint policy, at least as succinctly as standard representations like policy trees, and possibly exponentially more succinctly. For this, we need to ﬁx a concrete representation of QDec POMDPs. Slightly generalizing the STRIPS-like representation proposed by Brafman, Shani, and Zilberstein (2013), we assume that for each joint action a, a set of quadruples Ta = {(ϕ(j), e(j) + , e(j) , ω(j)) | j J} is given, where each ϕ(j) is a propositional formula over X, each e(j) + , e(j) is a set of propositions (with e(j) + e(j) = ), and each ω(j) is a joint observation given as a tuple. Given a state s and a joint action a, this encodes T(s, a) = {(s e(j) + e(j) , ω(j)) | j is s.t. s |= ϕ(j)}, with s e+ e = s s.t. x e+, s [x] = , x e , s [x] = , x X \ (e+ e ), s [x] = s[x]. Finally, we assume that B0 is given as a propositional formula over X, whose B0 is the set of models. Note that one can imagine other compact representations, especially representations based on inﬂuence diagrams or dynamic Bayesian networks (Boutilier, Dearden, and Goldszmidt 2000). Our results would also hold with such representations, provided they allow to efﬁciently check that a given history is indeed generated by the dynamics, and to efﬁciently progress the values of the state variables. To compare representations, we consider a quite general deﬁnition of a t-policy tree for a model I , X, A, Ω, T and an agent i I : an action a A is a 0-policy tree, and if a is an action, Ω1, Ω2, . . . , Ωk form a partition of Ω, and τ1, τ2, . . . , τk are (t 1)-policy trees, then τ = (a, Ω1 : τ1, Ω2 : τ2, . . . , Ωk : τk) is a t-policy tree. The semantics is that when an agent executes such a tree, it ﬁrst takes action a, then upon receiving an observation ω, it goes on by executing the tree τj, where Ωj is the subset of Ω containing ω. The size |τ| (resp. |τ|) of a tree τ (resp. joint tree τ) is deﬁned to be its total number of nodes. Proposition 15 Let τ be a joint policy tree for a QDec POMDP model M. Then there is an MAKBP κ which is equivalent to τ and has size linear in |τ|. PROOF. For all agents i I , deﬁne κi to be κ(τi), where κ(τ) is deﬁned inductively as follows. If τ is reduced to a, then κ(τ) is deﬁned to be the KBP [a]. Otherwise τ must be of the form (a, Ω1 : τ1, Ω2 : τ2, . . . , Ωk : τk), and κ(τ) is deﬁned to be the KBP [a ; if ω Ω1 jo(ω) then κ(τ1) else if

ω Ω2 jo(ω) then κ(τ2) . . . else if

ω Ωk jo(ω) then κ(τk) else ϵ ﬁ]. (Note that the else clause is never entered.) Clearly, κ(τ) is equivalent to τ and of size linear in |τ|. Since by their very deﬁnition, joint policy trees can represent any policy (modulo equivalence), we have complete expressiveness as an immediate corollary of Proposition 15.

Corollary 16 All (deterministic) joint policies for QDec POMDPs can be represented as MAKBPs. Importantly, the same result as Proposition 15 could be given for other reactive representations, such as ﬁnite state controllers (FSCs), were a goto construct added to the syntax of KBPs; this would indeed allow two parts of a KBP to share their continuation. Moreover, such a generalization would not change any result in this paper, and we only ignore this possibility to keep the presentation simple. Anyway, the result which we give next concerns any reactive class of representations, and hence does take FSCs into account.

Exponential Gains We now exhibit a family of QDec-POMDPs for which (assuming NP P/poly) there is no family of valid reactive policies of polynomial size, but for which there is a polysize family of valid MAKBPs, hence showing that the representation by MAKBPs can yield exponential gains in space.10 This result is directly inherited from the single-agent case (Lang and Zanuttini 2013, Prop. 2). However, this does not imply per se that reasoning about the other agents knowledge is useful as far as succinctness is concerned (reasoning about its own knowledge might have been enough). We show that this is indeed useful, and even that increasing the epistemic depth allowed for MAKBPs always properly increases succinctness. Namely, for all d, we show that there is a family (Πn,d)n N of QDec-POMDPs which has a family (κn,d)n N of valid MAKBPs of depth d, but no polysize valid MAKBPs of depth d with d < d, nor any in any reactive representation (assuming NP P/poly).

Construction Broadly speaking, Πn,d encodes a problem in which some information passes through a series of d pairs of agents (iℓ, i ℓ) (ℓ= 0, . . . , d 1), so that at each step, only one agent in the pair gets the information, as controlled by an (unobservable) variable xℓ; this agent must take one of two actions, depending on the information received. The piece of information passed is precisely xℓ 1, that is, whether it is iℓ 1 or i ℓ 1 who obtained the previous information. Finally, once passed, the value of xℓ 1 is reset, so that iℓ, iℓ cannot reason on its value any more. In this manner, iℓ, iℓ must choose their action depending on whether they know that iℓ 1 obtained the previous information, or they know that i ℓ 1 did; by induction, they must reason on whether they know that iℓ 1 knows whether iℓ 2 knows whether. .., that is, on an epistemic formula of depth d. Finally, for the last level, whether id 1 or i d 1 obtains the information is controlled by whether a propositional formula ϕ is satisﬁable, instead of a simple variable xd 1; assuming NP P/poly, this prevents the last agent id to have a polysize reactive policy, as this would give a nonuniform polytime algorithm for propositional satisﬁability.

More precisely, the dynamics of Πn,d is designed so that the sequence of actions is forced in any valid policy. For this,

10The complexity class P/poly (or nonuniform P ) is the class of problems for which for all n, there is an algorithm whose description is polysize in n and which correctly decides all instances of size n in polytime. The conjecture NP P/poly is standard.

we add a vector t of variables to the set X, meant to encode the current timestep, we add an encoding of s [ t] = s[ t] + 1 to the effect of each joint action, and we make each joint action a lead to a sink state if taken at a disallowed timestep. All this can easily be encoded with a vector t of logarithmic size and a polynomial number of extra quadruples (ϕ, e+, e , ) (where represents a dummy observation). The rest of the construction is as follows. A set of n state variables xϕ,0, . . . , xϕ,n 1 represents propositions, and a

3CNF formula ϕ = 8n3 1 j=0 γ(j) on these variables is encoded in the state (over 8n3 3 (1 + log n ) Boolean variables). The n ﬁrst steps in any valid policy for Πn,d force a speciﬁc agent is to modify the values of xϕ,0, . . . , xϕ,n 1 so that they encode a model ϕ, if possible, and to set variable xd 1 to if and only if ϕ is satisﬁable; for more details, we refer the reader to (Lang and Zanuttini 2013, Prop. 12). The problem also has a variable xℓfor ℓ= 0, . . . , d 2, with unobservable value. At each timestep tℓ= 8n3 + n + 1 + ℓ, both iℓand i ℓtake an action which reveals the value of xℓ 1 to iℓ(resp. to i ℓ) if stℓ[xℓ] is (resp. ). Moreover, except for x0, whose value never changes, xℓ 1 is reset to at the same time. This transition can be encoded efﬁciently, through the four quadruples (ϕ, e+, e , ω) of the form (xℓ 1 xℓ, , {xℓ 1}, ( , . . . , , xℓ 1, , . . . , )), ( xℓ 1 xℓ, , {xℓ 1}, ( , . . . , , xℓ 1, , . . . , )), etc. (where again, represents a dummy observation). Finally, a last agent id observes the CNF ϕ, and the bits encoding ϕ are reset to . At the last timestep H, the agents must together take a joint action so that

one of i0, i 0 takes action a (resp. a ) if s H[x0] is (resp. ), and the other takes a no-op,

for ℓ= 1, . . . , d 1, one of iℓ, i ℓtakes action a (resp. a ) if iℓ 1 (resp. i ℓ 1) takes action a or a ,

id takes a (resp. a ) if id 1 (resp. i d 1) takes a or a .

Any other joint action taken at this step sets leads to a sink state. The dynamics can be encoded efﬁciently as there are a polynomial (in n) number of joint actions allowed at all, and overall the size of Πn,d is polynomial in n. We are now ready to state our main result.

Proposition 17 For all d 2, the family (Πn,d)n N is s.t.

1. there is a family (κn,d)n N of MAKBPs of depth d, so that κn,d is valid for Πn,d and is polysize (in n), 2. assuming NP P/poly, for any class R of reactive representations, there is no family (πn,d)n N so that πn,d is valid for Πn,d and has a polysize representation in R, 3. for d < d, there is no family (κn,d )n N of MAKBPs of depth at most d , so that κn,d is valid for Πn,d.

SKETCH OF PROOF. For Item 1, the valid MAKBPs are the sequences of actions enforced by the problem. The only exceptions are for is, for which we use the construction of (Lang and Zanuttini 2013, Prop. 12), and for the last action, for which it is easily seen that it is enough for all agents iℓ, i ℓto execute [if K(KWiℓ 1(. . . (KWi1x0))) then a else if K (KWiℓ 1(. . . (KWi1x0))) then a else no-op ﬁ], which indeed has depth d.

For Item 2, assume that there is such a family. Then take any 3CNF ϕ and simulate the MAKBP. If id plays a then ϕ is satisﬁable, otherwise it is unsatisﬁable. This gives a polytime, polysize algorithm for deciding the satisﬁability of a 3CNF formula over n variables, hence a nonuniform polytime algorithm for 3SAT, a contradiction. Finally, for Item 3, we can show that for any valid MAKBP κ, the pointed structures St(κ), ha and St(κ), ha (t > 8n3 + n + d + 1) are bisimilar up to depth d 2, i.e., that the subgraphs of St(κ) centered at ha, ha and of radius d 1 are isomorphic (Blackburn, Rijke, and Venema 2001), whenever ha and ha have the same values for xℓ, ℓ= 0, . . . , d 3, but a different one for xd 2 (by induction on d). Intuitively, this captures the fact that no epistemic condition Φ of depth less than Kid(KWid 1(. . . (KWi1x0))) can tear apart histories of the form ha (where id must take action a) from histories of the form ha . As a consequence, any MAKBP allowing id to take the correct action at the last timestep must have depth at least d.

Algorithmic Results We now investigate the complexity of the main computational problems related to MAKBPs, viz. execution and veriﬁcation, with input given in compact form as in the previous section. Note that the complexity of the general planning problem of deciding whether there exists an MAKBP which is valid for a given QDec-POMDP is the same as for any other representation, since MAKBPs are fully expressive (Corollary 16). Precisely, it is NEXP-complete (Brafman, Shani, and Zilberstein 2013, Corollary 1). Recall that the execution problem asks what action i should take, given M, B0, κ, ht i. The membership proof in the next proposition follows the same structure as that, given by (Charrier and Schwarzentruber 2017), that model checking for succinct dynamic epistemic logic is in PSPACE. This is not by coincidence, as progression for MAKBPs can be seen as DEL product updates (see Footnote 6).

Proposition 18 The execution problem for MAKBPs is PSPACE-complete.

Proposition 19 The problem of deciding whether a given MAKBP is valid for a given QDec-POMDP at a ﬁnite horizon H N, given in unary,11 is PSPACE-complete.

SKETCH OF PROOF. For membership, by deﬁnition an MAKBP is not valid for Π if and only if there is a history consistent with it and which does not reach the goal. Since PSPACE is closed under nondeterminism and complementation, we have the result. For hardness, we reduce the execution problem (Prop. 18) to veriﬁcation by using an action which sets a variable xg to , and deﬁning the goal g to be xg. Then deciding whether the MAKBP is valid amounts to deciding whether this action is executed, hence the result.

These propositions show in particular that MAKBPs can be executed and veriﬁed without explicitly computing nor maintaining the (exponential) knowledge structure St(κ) at execution time. Hence the gain in succinctness with respect

11This amounts to considering that the input has at least size H.

to reactive policies is not only at design time, when writing the policies, but also at execution time, when embarking and reasoning with them can be done in polynomial space. Finally, since we allowed the while construct in KBPs, we also investigate veriﬁcation at an indeﬁnite horizon. The proof uses essentially the same ideas as the proof that epistemic planning with preconditions on knowledge is undecidable (Bolander and Andersen 2011). Observe however that this result is not obvious, as the same problem is decidable in the single-agent case (Lang and Zanuttini 2012, Prop. 6).

Proposition 20 Determining if a given MAKBP terminates in ﬁnite time in all histories, for given QDec-POMDP model M and initial belief state B0, is undecidable.

Perspectives

Our ﬁrst, obvious, perspective is to generalize Multi Agent KBPs to the context of quantitative (stochastic) Dec POMDPs (Oliehoek and Amato 2016). To this aim, we will use epistemic logic with probabilities (Fagin and Halpern 1994) and its dynamic extension (Kooi 2003; van Eijck and Schwarzentruber 2014) instead of standard epistemic logic. Doing so will make a bridge between or work and the recently introduced notion of occupancy state (Dibangoye et al. 2016), which can be seen as the stochastic counterpart of our knowledge structures St(κ). Another interesting perspective is to investigate approximate operational semantics for MAKBPs, for instance, semantics with reasoning at bounded modal depth, which may lead to a decidable veriﬁcation problem. We also think that the expressivity of our framework allows to formalize real applications, such as Hanabi. Finally, the question of how to synthetize MAKBPs (that is, to plan) is very important; for this point, we intend to follow the direction of multi-agent epistemic regression.

Acknowledgments This work has been made possible by the existence of the working group MAFTEC of the French CNRS pr e-GDR IA. The authors want to thank in particular Fr ed eric Maris, Tiago de Lima, Guillaume Aucher, and Alexandre Niveau for their contribution to the working group and to the early construction of this work. Abdallah Safﬁdine is the recipient of an ARC DECRA Fellowship (DE150101351).

Aucher, G. 2010. An internal version of epistemic logic. Studia Logica 94(1):1 22. B ackstr om, C., and Jonsson, P. 2012. Algorithms and limits for compact plan representations. J. Artif. Intell. Res. 44:141 177. Blackburn, P.; Rijke, M. d.; and Venema, Y. 2001. Modal Logic. Cambridge Tracts in Theoretical Computer Science. Cambridge University Press. Bolander, T., and Andersen, M. B. 2011. Epistemic planning for single and multi-agent systems. J. Applied Non-Classical Logics 21(1):9 34.

Boutilier, C.; Dearden, R.; and Goldszmidt, M. 2000. Stochastic dynamic programming with factored representations. Artif. Intell. 121(1-2):49 107. Brafman, R. I.; Shani, G.; and Zilberstein, S. 2013. Qualitative planning under partial observability in multi-agent domains. In Proc. 27th AAAI Conference on Artiﬁcial Intelligence (AAAI 2013), 130 137. AAAI Press. Charrier, T., and Schwarzentruber, F. 2017. A succinct language for dynamic epistemic logic. In Proc. 16th Conference on Autonomous Agents and Multi-Agent Systems (AAMAS 2017), 123 131. Cox, C.; De Silva, J.; Deorsey, P.; Kenter, F. H.; Retter, T.; and Tobin, J. 2015. How to make the perfect ﬁreworks display: Two strategies for hanabi. Mathematics Magazine 88(5):323 336. Dibangoye, J. S.; Amato, C.; Buffet, O.; and Charpillet, F. 2016. Optimally solving Dec-POMDPs as continuous-state MDPs. J. Artif. Intell. Res. 55:443 497. Fagin, R., and Halpern, J. Y. 1994. Reasoning about knowledge and probability. J. ACM 41(2):340 367. Fagin, R.; Moses, Y.; Halpern, J.; and Vardi, M. 2003. Reasoning about knowledge. The MIT Press. Hintikka, J. 1962. Knowledge and belief: an introduction to the logic of the two notions. Cornell University Press. Kominis, F., and Geffner, H. 2015. Beliefs in multiagent planning: From one agent to many. In Proc. 25th International Conference on Automated Planning and Scheduling (ICAPS 2015), 147 155. Kooi, B. P. 2003. Probabilistic dynamic epistemic logic. J. Logic, Language and Information 12(4):381 408. Kumar, A.; Mostafa, H.; and Zilberstein, S. 2016. Dual formulations for optimizing Dec-POMDP controllers. In Proc. 26th International Conference on Automated Planning and Scheduling (ICAPS 2016), 202 210. Lang, J., and Zanuttini, B. 2012. Knowledge-based programs as plans: The complexity of plan veriﬁcation. In Proc. 20th European Conference on Artiﬁcial Intelligence (ECAI 2012), 504 509. Lang, J., and Zanuttini, B. 2013. Knowledge-based programs as plans: Succinctness and the complexity of plan existence. In Proc. 14th conference on Theoretical Aspects of Rationality and Knowledge (TARK 2013). Oliehoek, F. A., and Amato, C. 2016. A concise introduction to decentralized POMDPs. Springer. Osawa, H. 2015. Solving hanabi: Estimating hands by opponent s actions in cooperative game with incomplete information. In AAAI workshop: Computer Poker and Imperfect Information, 37 43. van den Bergh, M. 2015. Hanabi, a co-operative game of ﬁreworks. Bachelor thesis, Universiteit Leiden, the Netherlands. van Eijck, J., and Schwarzentruber, F. 2014. Epistemic probability logic simpliﬁed. In Advances in Modal Logic 10, invited and contributed papers from the tenth conference on Advances in Modal Logic , 158 177.