# preallocation_and_planning_under_stochastic_resource_constraints__388ebf4b.pdf Preallocation and Planning Under Stochastic Resource Constraints Frits de Nijs, Matthijs T. J. Spaan, Mathijs M. de Weerdt {f.denijs, m.t.j.spaan, m.m.deweerdt}@tudelft.nl Delft University of Technology, The Netherlands Resource constraints frequently complicate multi-agent planning problems. Existing algorithms for resource-constrained, multi-agent planning problems rely on the assumption that the constraints are deterministic. However, frequently resource constraints are themselves subject to uncertainty from external influences. Uncertainty about constraints is especially challenging when agents must execute in an environment where communication is unreliable, making on-line coordination difficult. In those cases, it is a significant challenge to find coordinated allocations at plan time depending on availability at run time. To address these limitations, we propose to extend algorithms for constrained multi-agent planning problems to handle stochastic resource constraints. We show how to factorize resource limit uncertainty and use this to develop novel algorithms to plan policies for stochastic constraints. We evaluate the algorithms on a search-and-rescue problem and on a power-constrained planning domain where the resource constraints are decided by nature. We show that plans taking into account all potential realizations of the constraint obtain significantly better utility than planning for the expectation, while causing fewer constraint violations. Introduction Planning for future uncertainties is an effective tool to increase the utility of a system of multiple agents. Particularly when the actions of agents are restricted by scarce resources, planning for resource usage is an important challenge that many authors have addressed (Adelman and Mersereau 2008; Agrawal, Varakantham, and Yeoh 2016; De Nijs, Spaan, and De Weerdt 2015; Gordon et al. 2012; Meuleau et al. 1998; Wu and Durfee 2010; Yoo, Fitch, and Sukkarieh 2012). These approaches have in common that they consider uncertainty in state transitions, while assuming full knowledge about future resource constraints. However, resource capacity may itself be subject to uncertainty. For example, the amount of power produced from renewable sources such as wind turbines is a stochastic quantity (Kl ockl, Papaefthymiou, and Pinson 2008). Similarly, when only a subset of agents participate in a traffic congestion control system, the non-participants contribute to congestion stochastically (De Weerdt et al. 2013). Another source of resource uncertainty may occur when an Copyright c 2018, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. agent s consumption itself is stochastic (Mausam et al. 2005; Schaffer, Clement, and Chien 2005). Nevertheless, no earlier work has addressed multi-agent planning for such stochastic resources. In several application domains where multiple constrained agents must coordinate their actions, there may be known fixed periods where communication between them is impossible (such as with non-geostationary satellites), unadvisable (such as in warfare), or too uncertain (as in hazardous environments). In other domains the required response time for actors maybe so short that planning and coordination needs to be done a priori, such as in robot soccer, high-frequency trading in multiple stock markets, or protection control in electricity distribution networks. In all of these situations, an approach is needed where coordinated policies are computed for a number of sequential decisions that are taken without further communication. Therefore, in this work we focus on preallocation algorithms, which compute policies for a given plan horizon by allocating resources to agents a priori, thereby effectively decoupling the agents policies so that they can be computed and executed independently. Decoupling necessarily introduces an error, as agents cannot respond to non-local realizations of uncertain transitions. In this work, however, we show how to permit effective decoupling even in the case of a tight and stochastic coupling constraint. We extend Multi-agent Markov Decision Processes (MMDPs) by a model of the resource constraint realizations in a separate, orthogonal part of the state space. This enables us to formulate novel approaches based on two state-of-the-art planning algorithms that can deal with deterministic resource constraints. These algorithms represent different solution categories: an optimal preallocation mixedinteger linear program (Wu and Durfee 2010) which restricts worst-case consumption, and the constraint relaxation approach Constrained Markov Decision Process (Altman 1999) restricting average consumption. We evaluate the benefit of planning for stochastic resource constraints for both approaches by comparing to the state of the art i.e., planning for the mean constraint level on a coordinated search-and-rescue domain, demonstrating the need to handle stochastic resource constraints. Subsequently, we use a heater planning domain to demonstrate the scalability of the approximations and their reduced resource violation frequency in larger problems. We show that agents taking The Thirty-Second AAAI Conference on Artificial Intelligence (AAAI-18) into account all potential realizations of the resource limit obtain significantly better policies. Finally, we show that the number of resource violations further decreases with more frequent replanning. Background A Multi-agent Markov Decision Process (MMDP) models a system consisting of n cooperative agents operating under uncertainty (Boutilier 1996). Time in a finite-horizon MMDP is discretized into h time steps. At each step the state s of the system describes all the relevant properties of all agents. We require that the set S of possible discrete states of the system is finite and known. A decision or action a = a1, . . . , an of the agents describes for each agent i the selected control input ai. The finite set A contains all potential joint actions. For any given state-action pair, the transition function T : S A S [0, 1] gives the probability of reaching potential future state s . The performance of the agents is measured by a reward function R : S A R which assigns a real-valued instantaneous reward for every state-action pair. Tuple S, A, T, R, h fully specifies an MMDP. The goal of planning for an (M)MDP is to compute the best action to take in order to obtain the highest possible expected value, as defined through the Bellman equation (1957). The optimal expected value function V is defined as V [h, s] = max a A R s, a , s V [t, s] = max a A R s, a + Q[t, s, a] , 1 t