# decentralized_stochastic_planning_with_anonymity_in_interactions__bbb7d7d8.pdf

Decentralized Stochastic Planning with Anonymity in Interactions

Pradeep Varakantham, Yossiri Adulyasak , Patrick Jaillet School of Information Systems, Singapore Management University Singapore-MIT Alliance for Research and Technology (SMART), Massachusetts Institute of Technology Department of Electrical Engineering and Computer Science, Massachussets Institute of Technology pradeepv@smu.edu.sg, yossiri@smart.mit.edu, jaillet@mit.edu

In this paper, we solve cooperative decentralized stochastic planning problems, where the interactions between agents (speciﬁed using transition and reward functions) are dependent on the number of agents (and not on the identity of the individual agents) involved in the interaction. A collision of robots in a narrow corridor, defender teams coordinating patrol activities to secure a target, etc. are examples of such anonymous interactions. Formally, we consider problems that are a subset of the well known Decentralized MDP (DECMDP) model, where the anonymity in interactions is speciﬁed within the joint reward and transition functions. In this paper, not only do we introduce a general model model called D-SPAIT to capture anonymity in interactions, but also provide optimization based optimal and local-optimal solutions for generalizable sub-categories of D-SPAIT.

Introduction Decentralized Markov Decision Problem (Dec-MDP) model provides a rich framework to tackle decentralized decisionmaking problems. However, solving a Dec-MDP problem to create coordinated multi-agent policies in environments with uncertainty is NEXP-Hard (Bernstein et al. 2002). Researchers have typically employed two types of approaches to address this signiﬁcant computational complexity: (1) approximate dynamic programming and policy iteration approaches (Seuken and Zilberstein 2007; Bernstein, Hansen, and Zilberstein 2005); (2) exploit static and dynamic sparsity in interactions (Becker et al. 2004; Nair et al. 2005; Velagapudi et al. 2011; Witwicki and Durfee 2012; Mostafa and Lesser 2012). In this paper, we pursue a third type of approach, where in we exploit anonymity in interactions. This is generalising on the notion of aggregate inﬂuences that has previously been considered in existing work (Witwicki and Durfee 2012; Mostafa and Lesser 2012; Varakantham et al. 2009). More speciﬁcally, we exploit the fact that in many decentralised stochastic planning problems, interactions between agents (speciﬁed as joint transition or reward functions) are not dependent on the identity of the agents involved. Instead they are dependent only on the number of agents involved in the interaction. For instance, in the navigation domains of (Melo and Veloso 2011) and (Varakantham et al. 2009),

Copyright c 2014, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved.

rewards (and transitions) in narrow corridors are dependent only on the number of agents entering the narrow corridor simultaneously and not on the speciﬁc agents. Similarly, in the coordination problem introduced by (Yin and Tambe 2011) for Autonomous Underwater and Surface Vehicles (AUVs and ASVs), the rewards are associated with the number of agents sampling the underwater samples simultaneously and not which speciﬁc agents. In fact, most sensor network problems (Kumar, Zilberstein, and Toussaint 2011; Nair et al. 2005) have coordination problems with anonymous interactions, where reward is dependent on the number of agents tracking a region (and not which speciﬁc sensors). Finally, in the context of coordinating defenders in security patrolling problems (Shieh et al. 2013), the reward for patrolling a target by multiple agents is dependent on the number of agents patrolling a target. While anonymity in interactions has been considered in the context of competitive games (Roughgarden and Tardos 2002; Varakantham et al. 2012; Ahmed, Varakantham, and Cheng 2012), it has not been considered in the context of decentralized and stochastic cooperative planning and that is the main focus of this paper. Concretely, we ﬁrst provide a general model called Decentralized Stochastic Planning with Anonymous Interactions (D-SPAIT) to represent anonymity in interactions within the context of the Dec MDPs framework. Secondly, we develop an optimization based formulation for solving the general D-SPAIT problems and in reference to this optimization, we prove a key theoretical result regarding the scalability of this formulation. Thirdly, we develop speciﬁc and scalable methods for generalizable sub-categories of the D-SPAIT problems and ﬁnally, we demonstrate the performance of these approaches on random D-SPAIT problems. In our experimental results, we also compare against Softmax based Flow Update (SMFU) algorithm, which was developed for competitive games and can be adapted to solve D-SPAIT problems.

Model: D-SPAIT

In this section, we describe the Decentralized Stochastic Planning with Anonymous Interac Tions (D-SPAIT) model that combines the cooperative stochastic planning framework of Decentralized MDP (DEC-MDP) (Bernstein et al. 2002) with anonymity in interactions introduced in the context of multi-agent routing and planning models (Roughgarden and Tardos 2002; Varakantham et al. 2012) from competitive game theory. More formally, D-SPAIT is described

Proceedings of the Twenty-Eighth AAAI Conference on Artificial Intelligence

Agents ! k,..,n

Agents 1..k-1!

Goals ! k,..n

Goals ! 1,..,k-1

Narrow Corridor

Figure 1: Example of navigation task, where agents are on the left side and goals are on the right side. For agents to achieve the goals, they have to pass through a narrow corridor. If multiple agents pass through the narrow corridor, then each agent receives a penalty that is dependent on the number of agents passing through the corridor. Except for interactions in the narrow corridor, agents can independently move through the grid world.

using the tuple of P, S, A, R, φ, (δi)i P, T : P is the agent population, S is the set of states and A is the action set of any individual agent in P. R is the joint reward and is expressed as the sum of individual rewards, R parameterized by the number of agents due to any anonymous interactions in the joint state, action pair:

Rt( s1, .., s|P| , a1, .., a|P| ) = X

i Rt(si, ai, dt si,ai) (1)

where dt si,ai refers to the number of agents impacting the reward function for si, ai in the anonymous interaction. For instance, in the example of Figure 1, number of agents impacting the reward for an agent entering the narrow corridor from S1 and taking action East will be those agents which are either entering the narrow corridor from S1 taking action East or S2 taking West. Therefore, if xt i(s, a) denotes the number of times agent i takes action a in s at time step t, then dt s1, East corresponding to interaction in narrow corridor is given by:

dt s1, East = X

h xt i(s1, East ) + xt i(s2, West ) i

When an agent is not part of any interaction, then: Rt(si, ai, dt si,ai) = Rt(si, ai) φ refers to the joint transition probability and is expressed as a product of individual transition functions, ϕ parameterized by the number of agents numbers due to any anonymous interactions in the joint state, action pair:

φt( s1, , s|P| , a1, , a|P| , ˆs1, , ˆs|P| ) = Y

i ϕt(si, ai, ˆsi, dt si,ai) (2)

δi is the initial belief over states for agent i. δt i = 0 for all decision epochs t > 0. T is the time horizon. The goal is to compute a strategy for the individual agents so as to maximize joint expected reward over the time horizon. Some important aspects of the D-SPAIT model are: (1) The individual rewards and transitions are parameterized by the number of agents in Equations 1 and 2 respectively. In general, dt si,ai can be different for reward and transition

functions and we deﬁne dt si,ai for the reward function as:

dt si,ai = X

(s,a) IR si,ai

xt i(s, a) (3)

where IR si,ai and Iϕ si,ai refer to the set of state, action pairs that impact (through the number of agents executing action ai in state si) the reward and transition function respectively for (si, ai) pair. (2) Since the transition and reward functions of agents are independent given the number of agents, dt si,ai for each si, ai and t values, the value function over all agents is given by:

V 0({πi}) = X

i,si,ai,t Rt(si, ai, dt si,ai)xt i(si, ai) (4)

where xt i(si, ai) is the frequency of executing action ai in state si for agent i at time t given joint policy {πi}.

Optimization Approach We ﬁrst formulate the solution to a D-SPAIT problem as an optimization problem. Given that the individual reward and transition functions (parameterized by number of agents) can be any arbitrary functions, we cannot solve this optimization problem in the most general case. However, this optimization based formulation helps deduce an important property associated with optimal policies for D-SPAIT. Intuitively, the problem faced by every agent in D-SPAIT is an MDP with the reward and transition functions that depend on the number of agents involved in the anonymous interactions. Due to the dependence of reward and transition function on number of agents, this optimization problem becomes signiﬁcantly more complicated than a single agent MDP. xt i(s, a) in the formulation represents the fraction of times agent i takes action a in state s at time t.

Algorithm 1 SOLVEDSPAIT() Inputs: R, ϕ, δ Outputs: x = {xt i(s, a)}i,t,s,a

s,a,i,t xt i(s, a) Rt(s, a, X

j,(s ,a ) IR s,a xt j(s , a ))

a xt i(s, a) X

xt 1 i (s , a)

ϕt 1(s , a, s, X

j,(s ,a ) Iϕ s,a xt 1 j (s , a ))

= δt i(s), i, s

xt i(s, a) 0, i, s, a, t 0 = 0, i, s, a, t < 0

The formulation in SOLVEDSPAIT() has three aspects: (1) Since the goal is to compute optimal expected joint reward for all agents, objective includes a summation over all agents i.

(2) The reward function corresponding to a state, action pair in the objective has the number of agents impacting the reward value in that state, action pair, i.e., P j,(s ,a ) IR s,a xt j(s , a )) as a parameter. Since we have a ﬁnite horizon problem, xt j(s , a ) corresponds to the proportion of one agent (i.e. j) taking action a in state s at time t and consequently, P j xt j(s , a ) corresponds to the number of agents taking action a in state s at time t. (3)Similar to the reward function above, the transition function in the ﬁrst constraint has an extra parameter corresponding to the number of agents executing a dependent action in dependent state. We now prove an important property for populations, where the starting belief distribution, b over states is the same for all agents. We use xi to represent the vector of ﬂow values, xt i(s, a) for all s, a, t.

Proposition 1 If x = x1, x2, . . . x|P| is the optimal solution for the optimization problem in Algorithm 1 and i, j, s : δ0 i (s) = δ0 j (s), then ˆx deﬁned below is also an optimal solution for the same optimization problem :

ˆx = h P i xi |P| , P i xi |P| , . . . i

Proof. We ﬁrst show that ˆx belongs to the feasible region. By summing up the LHS and RHS of the Constraint 5 for all agents i and for each value of s and t, we have: X

i xt i(s, a) |P| X

i xt 1 i (s , a) |P|

h ϕt 1 s , a, s, X

i,(s ,a ) Iϕ s,a |P| xt 1 i (s , a ) |P|

Therefore, if x is a feasible solution to the optimization problem in SOLVEDSPAIT(), then so is ˆx. Now, we show that ˆx is also an optimal solution. To demonstrate this, we calculate the objective value corresponding to x and show that it is equal to the value obtained with ˆx.

i xt i(s, a) i Rt s, a, X

j xt j(s, a)

i xt i(s, a) |P|

i Rt s, a, h |P|

j xt j(s, a)

Proposition 1 provides a very important result that removes the dependence of SOLVEDSPAIT() on the number of agents. It indicates that the class of symmetric joint policies is guaranteed to contain at least one optimal joint policy. Hence, we can rewrite the optimization problem in SOLVEDSPAIT() as the one in SOLVEDSPAIT-HOMOGENEOUS(). While the scalability of SOLVEDSPAIT-HOMOGENOUS is independent of number of agents, it is dependent on two key factors. (1) The structure of Iϕ and IR functions, i.e., the dependence of reward and transition for a given state action pair on number of agents in other state action pairs. We ﬁrst describe the simpler case, where dependence of reward and transition for a given state action pair (s, a) at time t is dependent on number of agents executing action a in state s at the same time step t, i.e.,

Iϕ s,a = {(s, a)}; IR s,a = {(s, a)} (6)

Algorithm 2 SOLVEDSPAIT-HOMOGENOUS() Inputs: R, ϕ, δ Outputs: x = {xt(s, a)}t,s,a

s,a,t |P| xt(s, a) Rt(s, a, |P| X

(s ,a ) IR s,a xt(s , a ))

a xt(s, a) X

s ,a xt 1(s , a)

ϕt 1 s , a, s, |P| X

( s, a) Iϕ s ,a

xt 1( s, a) = δt(s), i, s

xt(s, a) 0, i, s, a, t 0 = 0, i, s, a, t < 0

However, our approaches are applicable to general Iϕ and IR structures and we provide speciﬁc comments in the subsection on Discussion on generic IR and Iϕ Structures . (2) Functional form of reward, R and transition, ϕ functions with respect to the number of agents, dt(s, a). For instance, it will be very difﬁcult to solve SOLVEDSPAITHOMOGENOUS for non-linear and non-convex functions. We focus on functional forms for reward and transitions that can be used to approximate non-linear and non-convex functional forms, such as: (a) Linear; (b) Piecewise constant; (c) Piecewise linear and convex. We now focus on functional forms where it is possible to provide scalable approaches (optimal and local-optimal).

Linear Rewards Our ﬁrst set of assumptions on the reward and transition functions where we can provide a scalable approach are: (1) Transition function, φ is independent of d, i.e. φt(s, a, s , dt s,a) = φt(s, a, s ). (2) Reward function, R is dependent linearly on dt s,a, i.e. Rt(s, a, ds,a) = mt s,a dt s,a + ct s,a. More speciﬁcally, we are interested in linear functions, where the reward decreases with increase in number of agents, that is to say, all mt s,a are less than 0. It should be noted that a positive set of slopes imply that the problem is not concave and hence requires more sophisticated approximations. Linearity of reward function and the structure of I (from Equation 6) reduces the objective in Algorithm 2 as follows: X

s,a,t |P| xt(s, a) Rt(s, a, |P| xt(s, a))

h mt s,a |P|2 xt(s, a) 2 + ct s,a |P| xt(s, a) i (7)

Thus, even when the reward function is linear, we have a quadratic objective. A general quadratic objective cannot be solved in polynomial time and typically requires non-linear solvers. However, in cases of resource congestion or collision interactions, reward decreases with number of agents and hence there is a negative slope on the reward function. In such cases, we can show that the objective is concave and hence has only one maxima.

Proposition 2 If all slopes mt s,a are negative, then the objective provided in Equation 7 is concave. Proof Sketch. We prove this proposition by showing that the Hessian matrix, H corresponding to the objective function is negative semideﬁnite. Let F denote the objective function, then

s,a,t mt s,a |P|2 xt(s, a) 2 + ct s,a |P| xt(s, a)

We ﬁrst consider the diagonal element for the row corresponding to ˆs, ˆa, ˆt:

H( ˆs, ˆa, ˆt , ˆs, ˆa, ˆt ) = 2F 2[xˆt(ˆs, ˆa)] = 2 m ˆt ˆs,ˆa |P|2

We now consider a non diagonal element in the row corresponding to ˆs, ˆa, ˆt, and column corresponding to s, a, t:

H( ˆs, ˆa, ˆt , s, a, t ) = 2F [xˆt(ˆs, ˆa)] [x t( s, a)] = 0

Therefore, all the diagonal elements of H are negative and other elements are zero. Hence, for any vector y: y T Hy 0. Thus F is negative semi deﬁnite and concave. Proposition 2 is important as concavity implies the maximization problem in Algorithm 3 can be solved in polynomial time. More practically, the optimization can be solved using standard LP solvers like CPLEX.

Algorithm 3 SOLVEDSPAIT-LINEAR ( m, c ) Inputs: {mt s,a, ct s,a}s,a,t, ϕ, δ Outputs: x = {xt(s, a)}t,s,a

h mt s,a |P| xt(s, a) 2 + ct s,a |P| xt(s, a) i s.t.

a xt(s, a) X

s ,a xt 1(s , a)ϕt 1(s , a, s) = δt(s) s, t

xt(s, a) 0, i, s, a, t 0 = 0, i, s, a, t < 0

Piece Wise Constant (PWC) Transition Function From the assumptions in the previous sub section, we relax the assumption with respect to the transition function. More speciﬁcally, we no longer assume that transition function is independent of d, instead, we assume a piecewise constant dependency on d, i.e., s, a, s , t, we have

ϕt(s, a, s , d) =

pt,1(s, a, s ), if ˇd1 d ˆd1 pt,2(s, a, s ), if ˇd2 d ˆd2 . . . , . . . (8)

A key advantage of piecewise constant functions is their ability to closely approximate very general functions. The impact of the piecewise constant dependency on d is on the ﬂow preservation constraint s, a: X

a xt(s, a) X

s ,a xt(s , a) ϕt(s , a, s, |P| xt(s , a)) = δt(s)

Since transition function is no longer a constant, the terms xt(s , a) ϕt(s , a, s, |P| xt(s , a)) are a product of two variables and hence are non-linear. In this section, we contribute novel mechanisms to linearize the above terms. We use a set of new variables, {Xt(s , a, s)} that represent the product terms, i.e.,:

Xt(s , a, s) = xt(s , a) ϕt(s , a, s, |P| xt(s , a)) (9)

To linearize these product terms, we ﬁrst provide equivalent expressions for the new variables as follows:

Xt(s , a, s) = X

k Xt,k(s , a, s), where (10)

Xt,k(s , a, s) = pt,k(s , a, s) xt(s , a), if ˇdk |P| xt(s , a) ˆdk 0, otherwise (11)

The above expressions require two linearization steps. Firstly, we need to linearize the conditional required to check if the overall ﬂow on state s and a (= |P| xt(s , a)) belongs to an interval [ ˇdk, ˆdk]. The binary variables, yt,k(s , a) are used to represent the satisfaction of this condition, i.e., yt,k(s , a) = 1 implies that the |P| xt(s , a) value belongs to the interval [ ˇdk, ˆdk]. More speciﬁcally, the constraints that are used to achieve this assignment are as follows: X

k yt,k(s, a) = 1 (12)

1 yt,k(s, a) ˇdk |P| xt(s, a)

1 yt,k(s, a) |P| xt(s, a) ˆdk

Secondly, we need to set Xt,k(s , a, s) to pt,k(s , a, s) xt(s , a) if yt,k(s , a) is set to 1 and a value of zero otherwise. The linearized constraints corresponding to this conditional assignment are as follows:

Xt,k(s , a, s) xt(s , a) pt,k(s , a, s) (15)

Xt,k(s , a, s) yt,k(s , a) M (16)

Xt,k(s , a, s) xt(s , a) pt,k(s , a, s) (1 yt,k(s , a)) M (17)

Proposition 3 Constraints 12, 13 and 14 ensure that: (a) yt,k(s, a) = 0, if |P| xt(s, a) / [ ˇdk, ˆdk]. (b) yt,k(s, a) = 1, if |P| xt(s, a) [ ˇdk, ˆdk].

Proposition 4 Constraints 10, 15, 16, 17 ensure that deﬁnition of X variables in Equation 9 are satisﬁed.

Proposition 5 M in constraints 16 and 17 can be set to 1 without violating their correctness. Large values of M imply longer run-times (Hooker 1995) and thus, Proposition 5 is important in reducing the run-time required to solve the optimization problem signiﬁcantly.

Algorithm 4 SOLVEDSPAIT-PWC() Inputs: {ct,k s,a}t,k,s,a, ϕ, δ Outputs: x = {xt(s, a)}i,t,s,a

s,a,t,k |P| Zt,k(s, a) s.t.

Zt,k(s, a) zt,k(s, a) M; Zt,k(s, a) xt(s, a) ct,k s,a Zt,k(s, a) xt(s, a) ct,k s,a (1 zt,k(s, a)) M X

a xt(s, a) X

s ,a xt 1(s , a)ϕt 1(s , a, s) = δt(s) s, t

k zt,k(s, a) = 1; 1 zt,k(s, a) ˇdk |P| xt(s, a)

1 zt,k(s, a) |P| xt(s, a) ˆdk

xt(s, a) 0, i, s, a, t 0 = 0, i, s, a, t < 0

Number of Agents, d(s,a)

Reward Function

Figure 2: Example of PWC reward

Piece Wise Constant (PWC) Rewards We now consider piecewise constant reward functions (Figure 2) for R(s, a, d) in a similar vein to the PWC transition function. We employ the same linearization techniques as the ones used for piecewise constant transition function. While we only provide the algorithm for piecewise constant rewards in SOLVEDSPAIT-PWC, the constraints introduced for piecewise constant transitions can be included directly into this algorithm.

Piece Wise Linear and Convex (PWLC) Rewards We now generalize the reward function to be piecewise linear and convex (Figure 3[a]). Unlike the previous cases where we have provided an exact linear program, we are only able to provide a local-optimal approach for this case. Formally, the piecewise linear and convex reward function is speciﬁed as follows:

Rt(s, a, d) = max k {mt,k s,a d + ct,k s,a} (18)

Therefore, the objective reduces to: X

s,a,t |P| xt(s, a) max k mt,k s,a |P| xt(s, a) + ct,k s,a

Number of Agents, d(s,a)

Reward Function

Number of Agents, d(s,a)

Reward Function

Figure 3: (a) Example of a PWLC reward function; (b) Maximum of two concave functions

100x100x24%

0" 1" 2" 3" 4" 5"

Run,me%(in%milliseconds)%

Problem%Size%

2" 3" 4" 5" 6"

Run,me"(in"milliseconds)"

Number"of"piecewise"components"

10#x#10#x#10#

20#x#10#x#10#

30#x#10#x#10#

Figure 4: Run-time performance of (a) SOLVEDSPAITLINEAR(); (b) SOLVEDSPAIT-PWC().

s,a,t max k

h mt,k s,a |P|2 xt(s, a) 2 + ct,k s,a |P| xt(s, a) i

Figure 3[b] provides an example to illustrate that the maximum of multiple concave functions is typically nonconcave and has multiple local maxima. Hence, well known convex or concave programming solvers cannot be used directly to solve this optimization problem. Therefore, we pursue an approximate approach that iteratively improves on the ﬂow variables x and the piecewise linear components, k.

Algorithm 5 Solve DSPAIT-PWLC()

1: k = k0 0,0, . . . , kt s,a, . . . GETRANDCOMPNNTS() 2: x SOLVEDSPAIT-LINEAR(k); x1 3: while x = x1 do 4: x1 x; k GETBESTK(x) 5: x SOLVEDSPAIT-LINEAR(k) 6: return x

Algorithm 5 provides the approximate approach for solving this category of problems. Initially, we select a random linear component (slope, m and intercept, c) from the available components for every state action pair at every time step. On line 1, we obtain these components k using the GETRANDCOMPNNTS() function. Given a ﬁxed set of linear components, k for every state, action pair at each time step, the resulting problem has linear reward functions and hence we use the SOLVEDSPAIT-LINEAR() function of Algorithm 3 (line 2) to obtain the optimal ﬂow, x . We then ﬁnd the best component vector k corresponding to the ﬂow vector x on line 4 using the GETBESTK() function. This

0" 2" 4" 6" 8" 10" 12" 14"

Expected"Value"

Number"of"Itera<ons"

0$ 2$ 4$ 6$ 8$ 10$ 12$

Expected$Value$

Number$of$Itera;ons$

0" 1" 2" 3" 4" 5" 6" 7" 8" 9"

Expected"Value"

Number"of"Itera>ons"

Figure 5: Convergence of SOLVEDSPAIT-PWLC function on three different example problems.

iterative improvement process over k and x (lines 4-5) is continued until the convergence of the ﬂow vector. In this algorithm, both GETBESTK() and SOLVEDSPAIT-LINEAR() functions will never ﬁnd a solution that has lower value than the current solution as current solution is part of the search space. Each iteration thus does not reduce the solution quality and since there are a ﬁnite number of PWLC components, the algorithm will converge to a local optima.

Discussion on generic IR and Iϕ Structures Our contributions earlier are provided for the case where IR(s, a) = {(s, a)} and Iϕ(s, a) = {(s, a)}. We now focus on generic IR and Iϕ structures, where reward and transition functions can depend on sum of agent numbers in multiple different state, action pairs. We now comment on the generality of the our contributions in the context of generic IR and Iϕ structures. (a) SOLVEDSPAIT-PWC() with PWC reward and transition functions can be used to solve with generic Iϕ and IR structure. There is no change in approach required. (b) If we have linear reward function with a generic IR(s, a), the Hessian matrix for the objective is not negative semi deﬁnite with negative slopes. Hence, the problem is not concave. However, since the quadratic optimization problem contains products of the form xt(s, a) xt(s , a ) and each of the ﬂow values are less than 1 (since ﬁnite horizon MDPs), separable programming (applied in the context of MDPs by (Ahmed et al. 2013)) can be used to provide tight approximations for the the quadratic terms. More speciﬁcally,

xt(s, a) xt(s , a ) = A2 B2

A = xt(s, a) + xt(s , a )

2 ; B = xt(s, a) xt(s , a )

A and B can be approximated as follows, by dividing [0,1] into W intervals, {brw} and using SOS2 constraints:

w λw brw; A2 = X

w λw (brw)2

w λw = 1; SOS2({λw})

SOLVEDSPAIT-LINEAR() and consequently SOLVEDSPAIT-PWLC will be updated to include this approximation of quadratic terms.

Experimental Results

In this section, we demonstrate the following: (1) Run-time performance of SOLVEDSPAIT-LINEAR(), SOLVEDSPAIT-PWC approaches. (2) Run-time performance, scalability and solution quality for the local optimal approach in SOLVEDSPAIT-PWLC. (3) Performance comparison against the SMFU(Varakantham et al. 2012), which also exploits anonymous interactions in competitive settings. We generated random DSPAIT problems, where both the reward and transition functions were generated randomly. For the transition function, we varied the reachability (number of states with positive probability) of the states and generated random transition probabilities. For the reward function, we generated random numbers between a range while satisfying the assumptions of the speciﬁc categories of functions (ex: negative slopes, monotonically non-increasing, etc.). We then used our optimization algorithms to solve 1 these random DSPAIT problems. Figure 4(a) provides the runtime results for SOLVEDSPAIT-LINEAR(). A problem type is denoted by a cross product of number of states (zones), number of actions (zones) and number of decision epochs (time horizon). The ﬁve problem types we considered for the linear reward functions were: (20x20x24, 40x40x24, 60x60x24, 80x80x24,100x100x24). We randomly generated 25 instances for each problem type and each point in the graph Figure 4(a) is an average run-time for 25 instances. The biggest problem was solved within 4 minutes. The performance of the mixed integer linear program in SOLVEDSPAIT-PWC() is shown in Figure 4(b). We were able to solve problems with up to 30 states, 10 actions and 10 decision epochs with 6 piecewise constant components per state action pair (for both the reward and transition functions) within 20 mins. Every point in the graph is averaged over 25 instances. We show the performance of the SOLVEDSPAIT-PWLC function by considering 10 piecewise linear components for every state action pair, time step corresponding to the reward function. We experimented with multiple sets of problems where the number of states, actions and time steps were varied from 20-100, 20-100 and 5-24 respectively. In all the cases, the number of iterations required for convergence

1All the linear and quadratic optimization problems were solved using the commercial optimization software CPLEX 12.2 on a 1.8 GHz Intel Core i5 machine with 8GB 1600 MHz DDR3 RAM.

was less than 15 ( solving SOLVEDSPAIT-LINEAR() 15 times). Since SOLVEDSPAIT-PWLC does not provide optimal solutions, the key results are with respect to solution quality. While, SOLVEDSPAIT-PWLC converges to local optima , a very important and practically interesting phenomenon is observed with respect to the quality of the local optima. Figure 5 provides results on all our experiments with PWLC reward functions with number of components ranging from 5 - 20. Here are the key results: (a) The total number of iterations for convergence (the number of times the while loop in SOLVEDSPAIT-PWLC() is executed) varied between 9-14. However, the number of iterations required to be near local optima was only 3 or 4. (b) For each problem , we started with 10 random starting values of k and as witnessed in all the three graphs, the starting solution quality has a very high variance. However, the algorithm converged to local optima that were very close to each other on all the problems (3 shown here and numerous others that we experimented with). While we do not know if the global optima is close to these set of local optima, this is a unique result for a local optimal algorithm, especially since the problems were generated randomly. Previously, SMFU was proposed by (Varakantham et al. 2012) to compute equilibrium solutions for stochastic decision making problems for competitive settings. SMFU exploits the anonymity in interactions while computing equilibrium and hence our comparison against SMFU. We compare against SMFU and not against D-TREMOR (Velagapudi et al. 2011) because: SMFU employs shaping of model based on inﬂuences of other agents, similar to D-TREMOR. SMFU exploits anonymity in interactions and scales to problems with thousands of agents. Unlike D-TREMOR, SMFU converges to local optimal. Since SMFU s solution depends on the initial policy, performance was averaged over multiple initializations of the starting policy. The SOLVEDSPAIT-LINEAR() and SOLVEDSPAIT-PWLC() computed optimal policies at run-times (at least) an order of magnitude faster than the runtime by SMFU. For instance, on the 80x80x24 problem (equivalent in size to the real world taxi ﬂeet optimization problem in (Varakantham et al. 2012)), the SOLVEDSPAITLINEAR problem generated optimal solutions in 70 seconds, whereas SMFU took close to 30 minutes. However, with respect to SOLVEDSPAIT-PWC(), SMFU computed solutions in runtimes that were an order of magnitude shorter on large problems (ex: 30x10x10). While SMFU returned optimal solutions in few cases, overall it returned solutions that were around 60% of the optimal.

Acknowledgements This research is supported in part by the National Research Foundation (NRF) Singapore through the Singapore MIT Alliance for Research and Technology (SMART) and its Future Urban Mobility (FM) Interdisciplinary Research Group.

References Ahmed, A.; Varakantham, P.; Adulyasak, Y.; and Jaillet, P. 2013. Regret based robust solutions for uncertain markov decision processes. In NIPS 13.

Ahmed, A.; Varakantham, P.; and Cheng, S.-F. 2012. Decision support for agent populations in uncertain and congested environments. In UAI 12, 44 53. Becker, R.; Zilberstein, S.; Lesser, V.; and Goldman, C. 2004. Solving transition independent decentralized Markov decision processes. Journal of Artiﬁcial Intelligence Research 22:423 455. Bernstein, D.; Givan, R.; Immerman, N.; and Zilberstein, S. 2002. The complexity of decentralized control of Markov decision processes. Mathematics of Operations Research 27(4):819 840. Bernstein, D. S.; Hansen, E. A.; and Zilberstein, S. 2005. Bounded policy iteration for decentralized POMDPs. In Proceedings of the Nineteenth International Joint Conference on Artiﬁcial Intelligence, 1287 1292. Hooker, J. 1995. Logic-based benders decomposition. Mathematical Programming 96:2003. Kumar, A.; Zilberstein, S.; and Toussaint, M. 2011. Scalable multiagent planning using probabilistic inference. In Proceedings of the Twenty-Second International Joint Conference on Artiﬁcial Intelligence, 2140 2146. Melo, F. S., and Veloso, M. M. 2011. Decentralized mdps with sparse interactions. Artif. Intell. 175(11):1757 1789. Mostafa, H., and Lesser, V. 2012. Ofﬂine planning for communication by exploiting structured interactions in decentralized mdps. In IAT 09. Nair, R.; Varakantham, P.; Tambe, M.; and Yokoo, M. 2005. Networked distributed POMDPs: A synthesis of distributed constraint optimization and POMDPs. In Proceedings of the National Conference on Artiﬁcial Intelligence (AAAI), 133 139. Roughgarden, T., and Tardos, E. 2002. How bad is selﬁsh routing? Journal of the ACM 49(2):236 259. Seuken, S., and Zilberstein, S. 2007. Improved memorybounded dynamic programming for decentralized POMDPs. In UAI. Shieh, E.; Jain, M.; Jiang, A. X.; and Tambe, M. 2013. Ef?ciently solving joint activity based security games. In International Joint Conference on Artiﬁcial Intelligence (IJCAI). Varakantham, P.; Kwak, J. Y.; Taylor, M.; Marecki, J.; Scerri, P.; and Tambe, M. 2009. Exploiting coordination locales in distributed POMDPs via social model shaping. In Nineteenth International Conference on Automated Planning and Scheduling. Varakantham, P.; Cheng, S.-F.; Gordon, G.; and Ahmed, A. 2012. Decision support for agent populations in uncertain and congested environments. In AAAI, 1471 1477. Velagapudi, P.; Varakantham, P.; Sycara, K.; and Scerri, P. 2011. Distributed model shaping for scaling to decentralized POMDPs with hundreds of agents. In AAMAS, 955 962. Witwicki, S., and Durfee, E. 2012. Inﬂuence-based policy abstraction for weakly-coupled dec-pomdps. In ICAPS 10. Yin, Z., and Tambe, M. 2011. Continuous time planning for multiagent teams with temporal constraints. In International Joint Conference on Artiﬁcial Intelligence (IJCAI).