# rode_learning_roles_to_decompose_multiagent_tasks__e9349c3b.pdf Published as a conference paper at ICLR 2021 RODE: LEARNING ROLES TO DECOMPOSE MULTI-AGENT TASKS Tonghan Wang , Tarun Gupta , Anuj Mahajan , Bei Peng Institute for Interdisciplinary Information Sciences, Tsinghua University Univeristy of Oxford tonghanwang1996@gmail.com {tarun.gupta, anuj.mahajan, bei.peng}@cs.ox.ac.uk Shimon Whiteson Univeristy of Oxford shimon.whiteson@cs.ox.ac.uk Chongjie Zhang* Tsinghua University chongjie@tsinghua.edu.cn Role-based learning holds the promise of achieving scalable multi-agent learning by decomposing complex tasks using roles. However, it is largely unclear how to efficiently discover such a set of roles. To solve this problem, we propose to first decompose joint action spaces into restricted role action spaces by clustering actions according to their effects on the environment and other agents. Learning a role selector based on action effects makes role discovery much easier because it forms a bi-level learning hierarchy: the role selector searches in a smaller role space and at a lower temporal resolution, while role policies learn in significantly reduced primitive action-observation spaces. We further integrate information about action effects into the role policies to boost learning efficiency and policy generalization. By virtue of these advances, our method (1) outperforms the current state-of-the-art MARL algorithms on 9 of the 14 scenarios that comprise the challenging Star Craft II micromanagement benchmark and (2) achieves rapid transfer to new environments with three times the number of agents. Demonstrative videos can be viewed at https://sites.google.com/view/rode-marl. 1 INTRODUCTION Cooperative multi-agent problems are ubiquitous in real-world applications, such as crewless aerial vehicles (Pham et al., 2018; Xu et al., 2018) and sensor networks (Zhang & Lesser, 2013). However, learning control policies for such systems remains a major challenge. Joint action learning (Claus & Boutilier, 1998) learns centralized policies conditioned on the full state, but this global information is often unavailable during execution due to partial observability or communication constraints. Independent learning (Tan, 1993) avoids this problem by learning decentralized policies but suffers from non-stationarity during learning as it treats other learning agents as part of the environment. The framework of centralized training with decentralized execution (CTDE) (Foerster et al., 2016; Gupta et al., 2017; Rashid et al., 2018) combines the advantages of these two paradigms. Decentralized policies are learned in a centralized manner so that they can share information, parameters, etc., without restriction during training. Although CTDE algorithms can solve many multi-agent problems (Mahajan et al., 2019; Das et al., 2019; Wang et al., 2020d), during training they must search in the joint action-observation space, which grows exponentially with the number of agents. This makes it difficult to learn efficiently when the number of agents is large (Samvelyan et al., 2019). Humans cooperate in a more effective way. When dealing with complex tasks, instead of directly conducting a collective search in the full action-observation space, they typically decompose the task and let sub-groups of individuals learn to solve different sub-tasks (Smith, 1937; Butler, 2012). Once the task is decomposed, the complexity of cooperative learning can be effectively reduced Equal advising Published as a conference paper at ICLR 2021 because individuals can focus on restricted sub-problems, each of which often involves a smaller action-observation space. Such potential scalability motivates the use of roles in multi-agent tasks, in which each role is associated with a certain sub-task and a corresponding policy. The key question in realizing such scalable learning is how to come up with a set of roles to effectively decompose the task. Previous work typically predefines the task decomposition and roles (Pav on & G omez-Sanz, 2003; Cossentino et al., 2005; Spanoudakis & Moraitis, 2010; Bonjean et al., 2014). However, this requires prior knowledge that might not be available in practice and may prevent the learning methods from transferring to different environments. Therefore, to be practical, it is crucial for role-based methods to automatically learn an appropriate set of roles. However, learning roles from scratch might not be easier than learning without roles, as directly finding an optimal decomposition suffers from the same problem as other CTDE learning methods searching in the large joint space with substantial exploration (Wang et al., 2020c). To solve this problem, we propose a novel framework for learning ROles to DEcompose (RODE) multi-agent tasks. Our key insight is that, instead of learning roles from scratch, role discovery is easier if we first decompose joint action spaces according to action functionality. Intuitively, when cooperating with other agents, only a subset of actions that can fulfill a certain functionality is needed under certain observations. For example, in football games, the player who does not possess the ball only needs to explore how to move or sprint when attacking. In practice, we propose to first learn effect-based action representations and cluster actions into role action spaces according to their effects on the environment and other agents. Then, with knowledge of effects of available actions, we train a role selector that determines corresponding role observation spaces. This design forms a bi-level learning framework. At the top level, a role selector coordinates role assignments in a smaller role space and at a lower temporal resolution. At the low level, role policies explore strategies in reduced primitive action-observation spaces. In this way, the learning complexity is significantly reduced by decomposing a multi-agent cooperation problem, both temporally and spatially, into several short-horizon learning problems with fewer agents. To further improve learning efficiency on the sub-problems, we condition role policies on the learned effect-based action representations, which improves generalizability of role policies across actions. We test RODE on Star Craft II micromanagement environments (Samvelyan et al., 2019). Results on this benchmark show that RODE establishes a new state of the art. Particularly, RODE has the best performance on 9 out of all 14 maps, including all 5 super hard maps and most hard maps. Visualizations of learned action representations, factored action spaces, and dynamics of role selections shed further light on the superior performance of RODE. We also demonstrate that conditioning the role selector and role policies on action representations enables learned RODE policies to be transferred to tasks with different numbers of actions and agents, including tasks with three times as many agents. 2 RODE LEARNING FRAMEWORK In this section, we introduce the RODE learning framework. We consider fully cooperative multiagent tasks that can be modelled as a Dec-POMDP (Oliehoek et al., 2016) consisting of a tuple G= I, S, A, P, R, Ω, O, n, γ , where I is the finite set of n agents, γ [0, 1) is the discount factor, and s S is the true state of the environment. At each timestep, each agent i receives an observation oi Ωdrawn according to the observation function O(s, i) and selects an action ai A, forming a joint action a An, leading to a next state s according to the transition function P(s |s, a), and observing a reward r = R(s, a) shared by all agents. Each agent has local action-observation history τi T (Ω A) . Our idea is to learn to decompose a multi-agent cooperative task into a set of sub-tasks, each of which has a much smaller action-observation space. Each sub-task is associated with a role, and agents taking the same role collectively learn a role policy for solving the sub-task by sharing their learning. Formally, we propose the following definition of sub-tasks and roles. Definition 1 (Role and Sub-Task). Given a cooperative multi-agent task G= I, S, A, P, R, Ω, O, n, γ , let Ψ be a set of roles. A role ρj Ψ is a tuple gj, πρj where gj = Ij, S, Aj, P, R, Ωj, O, γ is a sub-task and Ij I, j Ij = I, and Ij Ik = , j = k. Aj is the action space of role j, Published as a conference paper at ICLR 2021 Role Selector Role Representations 𝒛-. 𝒛-/ 𝒛-0 𝒛-1 𝑎% (𝑜%, 𝒂7%) Action Representation One-Hot Action (a) (b) (c) Linear Network Vector Dot Product Legend Action Encoder Predictions Action Representations Actions of Role 𝑗 𝒛<. 𝒛