# growing_action_spaces__98259a39.pdf Growing Action Spaces Gregory Farquhar 1 2 Laura Gustafson 3 Zeming Lin 3 Shimon Whiteson 1 Nicolas Usunier 3 Gabriel Synnaeve 3 In complex tasks, such as those with large combinatorial action spaces, random exploration may be too inefficient to achieve meaningful learning progress. In this work, we use a curriculum of progressively growing action spaces to accelerate learning. We assume the environment is out of our control, but that the agent may set an internal curriculum by initially restricting its action space. Our approach uses off-policy reinforcement learning to estimate optimal value functions for multiple action spaces simultaneously and efficiently transfers data, value estimates, and state representations from restricted action spaces to the full task. We show the efficacy of our approach in proof-of-concept control tasks and on challenging large-scale Star Craft micromanagement tasks with large, multi-agent action spaces. 1. Introduction The value of curricula has been well established in machine learning, reinforcement learning, and in biological systems. When a desired behaviour is sufficiently complex, or the environment too unforgiving, it can be intractable to learn from scratch through random exploration. Instead, by starting small (Elman, 1993), an agent can build skills, representations, and a dataset of meaningful experiences that allow it to accelerate its learning. Such curricula can drastically improve sample efficiency (Bengio et al., 2009). Typically, curriculum learning uses a progression of tasks or environments. Simple tasks that provide meaningful feedback to random agents are used first, and some schedule is used to introduce more challenging tasks later during training (Graves et al., 2017). However, in many contexts neither the agent nor experimenter has such unimpeded con- 1University of Oxford 2Work done at Facebook AI Research 3Facebook AI Research. Correspondence to: Gregory Farquhar . Proceedings of the 37 th International Conference on Machine Learning, Vienna, Austria, PMLR 119, 2020. Copyright 2020 by the author(s). trol over the environment. In this work, we instead make use of curricula that are internal to the agent, simplifying the exploration problem without changing the environment. In particular, we grow the size of the action space of reinforcement learning agents over the course of training. At the beginning of training, our agents use a severely restricted action space. When this restricted action space is appropriately chosen for the task, this helps exploration by guiding the agent towards rewards and meaningful experiences, and provides low variance updates during learning. The action space is then grown progressively. Eventually, using the most unrestricted action space, the agents are able to find superior policies. Each action space is a strict superset of the more restricted ones. In many cases, it is straightforward to identify a suitable hierarchy of action spaces that enable efficient exploration. For example, continuous action spaces can be discretised with increasing resolution. Curricula for coping with the large combinatorial action spaces induced by many agents can be obtained from the prior that nearby agents are more likely to need to coordinate. For example, in routing or traffic flow problems nearby agents or nodes may wish to adopt similar local policies to alleviate global congestion. Our curriculum learning paradigm is valuable whenever it is possible to identify a restricted action space in which random exploration leads to significantly more informative experiences than random exploration in the full action space, and when regularities in the action space allow learning from these experiences to transfer to the full task. We propose an approach that uses off-policy reinforcement learning to improve sample efficiency in this type of curriculum learning. Since data from exploration using a restricted action space is still valid in the Markov Decision Processes (MDPs) corresponding to the less restricted action spaces, we can learn value functions in the less restricted action space with off-action-space data collected by exploring in the restricted action space. In our approach, we learn value functions corresponding to each level of restriction simultaneously. We can use the relationships of these value functions to each other to accelerate learning further, by using value estimates themselves as initialisations or as bootstrap targets for the less restricted action spaces, as well as sharing learned state representations. Growing Action Spaces Empirically, we first demonstrate the efficacy of our approach in two simple control tasks, in which the resolution of discretised actions is progressively increased. We then tackle a more challenging set of problems with combinatorial action spaces, in the context of Star Craft micromanagement with large numbers of agents (50-100). Given the heuristic prior that nearby agents in a multiagent setting are likely to need to coordinate, we use hierarchical clustering to impose a restricted action space on the agents. Agents in a cluster are restricted to take the same action, but we progressively increase the number of groups that can act independently of one another over the course of training. Our method substantially improves sample efficiency on a number of scenarios, outperforming learning any particular action space from scratch, a number of ablations, and an actor-critic baseline that learns a single value function for the behaviour policy, as in the work of Czarnecki et al. (2018). Code is available at https://github.com/ Torch Craft/Torch Craft AI/tree/gas-micro. 2. Related work Curriculum learning has a long history, appearing at least as early as the work of (Selfridge et al., 1985) in reinforcement learning, and for the training of neural networks since (Elman, 1993). In supervised learning, one typically has control of the order in which data is presented to the learning algorithm. For learning with deep neural networks, (Bengio et al., 2009) explored the use of curricula in computer vision and natural language processing. Many approaches use handcrafted schedules for task curricula, but others (Zaremba & Sutskever, 2014; Pentina et al., 2015; Graves et al., 2017) study diagnostics that can be used to automate the choice of task mixtures throughout training. In a self-supervised control setting, Murali et al. (2018) use sensitivity analysis to automatically define a curriculum over action dimensions and prioritise their search space. In some reinforcement learning settings, it may also be possible to control the environment to induce a curriculum. With a resettable simulator, one can use a sequence of progressively more challenging initial states (Asada et al., 1996; Florensa et al., 2017). With a procedurally generated task, it is often possible to tune the difficulty of the environments (Tamar et al., 2016). Similar curricula also appear often in hierarchical reinforcement learning, where skills can be learned in comparatively easy settings and then composed in more complex ways later (Singh, 1992). (Taylor et al., 2007) use more general inter-task mappings to transfer Q-values between tasks that do not share state and action spaces. In adversarial settings, one may also induce a curriculum through self-play (Tesauro, 1995; Sukhbaatar et al., 2017; Silver et al., 2017). In this case, the learning agents themselves define the changing part of the environment. A less invasive manipulation of the environment involves altering the reward function. Such reward shaping allows learning policies in an easier MDP, which can then be transferred to the more difficult sparse-reward task (Colombetti & Dorigo, 1992; Ng et al., 1999). It is also possible to learn reward shaping on simple tasks and transfer it to harder tasks in a curriculum (Konidaris & Barto, 2006). In contrast, learning with increasingly complex function approximators does not require any control of the environment. In reinforcement learning, this has often taken the form of adaptively growing the resolution of the state space considered by a piecewise constant discretised approximation (Moore, 1994; Munos & Moore, 2002; Whiteson et al., 2007). Stanley & Miikkulainen (2004) study continual complexification in the context of coevolution, growing the complexity of neural network architectures through the course of training. These works progressively increase the capabilities of the agent, but not with respect to its available actions. In the context of planning on-line with a model, there are a number of approaches that use progressive widening to consider increasing large action spaces over the course of search (Chaslot et al., 2008), including in planning for continuous action spaces (Cou etoux et al., 2011). These methods are typically applied in a quite different context, considering additional actions over the course of a search budget for planning, rather than over a long period of curriculum learning, or in the model-free setting. A recent related work tackling our domain is that of Czarnecki et al. (2018), who train mixtures of two policies with an actor-critic approach, learning a single value function for the current mixture of policies. The mixture contains a policy that may be harder to learn but has a higher performance ceiling, such as a policy with a larger action space as we consider in this work. The mixing coefficient is initialised to only support the simpler policy, and adapted via population based training (Jaderberg et al., 2017). In contrast, we simultaneously learn a different value function for each policy, and exploit the properties of the optimal value functions to induce additional structure on our models. We further use these properties to construct a scheme for offaction-space learning which means our approach may be used in an off-policy setting. Empirically, in our settings, we find our approach to perform better and more consistently than an actor-critic algorithm modeled after Czarnecki et al. (2018), although we do not take on the significant additional computational requirements of population based training in any of our experiments. A number of other works address the problem of generalisation and representation for value functions with large discrete action spaces, without explicitly addressing the resulting exploration problem (Dulac-Arnold et al., 2015; Pan Growing Action Spaces et al., 2018; Van de Wiele et al., 2020). These approaches typically rely on action representations from prior knowledge. Such representations could be used in combination with our method to construct a hierarchy of action spaces with which to shape exploration by curriculum learning. Finally, there are many other strategies for intelligent exploration in reinforcement learning, which often employ some notion of optimism in the face of uncertainty (Schmidhuber, 1991; Auer, 2002; Bellemare et al., 2016; Burda et al., 2018). Our work can be complementary to these methods, as restricted action spaces may allow a more efficient traversal of the state space, even when guided by a sophisticated exploration strategy, in order to find extrinsic reward. We study this further in Section 6.1. 3. Background We formalise our problem as a MDP, specified by a tuple < S, S0, A, P, r, γ >. The set of possible states and actions are given by S and A, P is the transition function that specifies the environment dynamics, S0 is the initial state distribution, and γ is a discount factor used to specify the discounted return R = PT t=0 γtrt for an episode of length T. We wish our agent to maximise this return in expectation by learning a policy π that maps states to actions. The state-action value function (Q-function) is given by Qπ = Eπ[R|s, a]. The optimal Q-function Q satisfies the Bellman optimality equation: Q (s, a) = T Q (s, a) = E[r(s, a) + γ max a Q (s , a )]. (1) Q-learning (Watkins & Dayan, 1992) uses a sample-based approximation of the Bellman optimality operator T to iteratively improve an estimate of Q . Q-learning is an off-policy method, meaning that samples from any policy may be used to improve the value function estimate. We use this property to engage Q-learning for off-action-space learning, as described in the next section. In deep RL, the Q-function is approximated by a deep neural network trained by stochastic gradient descent to minimise the Q-learning TD-error (Mnih et al., 2015). 4. Curriculum learning with growing action spaces 4.1. Action space hierarchies We first introduce some notation for restricted action spaces. In particular, for an MDP with unrestricted action space A we define a set of N action spaces Aℓ, ℓ {0, . . . , N 1}. Each action space is a subset of the next: A0 A1 . . . AN 1 A. A policy restricted to actions Aℓis denoted πℓ(a|s). The optimal policy in this restricted policy class is π ℓ(a|s), and its corresponding action-value and value functions are Q ℓ(s, a) and V ℓ(s) = maxa Q ℓ(s, a). Additionally, we define a hierarchy of actions by identifying for every action a Aℓ, ℓ> 0 a parent action parentℓ(a) in the space of Aℓ 1. Since action spaces are subsets of larger action spaces, for all a Aℓ 1, parentℓ(a) = a, i.e., one child of each action is itself. Simple pieces of domain knowledge are often sufficient to define these hierarchies. For example, a discretised continuous action can identify its nearest neighbour in Aℓ 1 as a parent. In Section 5 we describe a possible hierarchy for high-dimensional action spaces for which each action dimension is associated with a spatial location, such as those that arise in multi-agent problems. Defining these relationships requires a notion of similarity in the action space, and will be effective when similar actions (according to the similarity metric used to define the hierarchy) have similar value. We use simple heuristics to define our hierarchies (e.g. a Euclidean metric on a continuous action dimension or on the spatial locations of agents), but we could also try to learn these metrics from data. If action-embeddings in a latent space (Tennenholtz & Mannor, 2019) can be learned from random behaviour or demonstrations then we can cluster actions in the latent space, and choose representative actions to form a coarse space. However, we leave this direction to future work. 4.2. Off-action-space learning We observe that the value function for an action space Aℓ may be updated with transitions using actions drawn from its own action space, or any more restricted action spaces, if we use an off-policy learning algorithm. The restricted transitions simply form a subset of the data required to learn the value functions of the less restricted action spaces. To exploit this insight, we simultaneously learn an estimated optimal value function ˆQ ℓ(s, a) for each action space Aℓ, and use samples drawn from a behaviour policy based on a value function for low ℓto directly train all of the higher ℓ value functions. At the beginning of each episode, we sample ℓaccording to some distribution (the curriculum is induced by gradually increasing the mean of this distribution over the course of training). The experiences generated in that episode are used to update all of the ˆQ ℓ(s, a). This off-action-space learning is a type of off-policy learning that enables efficient exploration by restricting behaviour to the low-ℓregime. We sample at the beginning of the episode rather than at each timestep because, if the agent uses a high-ℓaction, it may enter a state that is inaccessible for a lower-ℓpolicy, and we do not wish to force a low-ℓvalue function to generalise to states that are only accessible at higher ℓ. Growing Action Spaces Since data from a restricted action space only supports a subset of the state-action space relevant for the value functions of less restricted action spaces, we hope that a suitable function approximator allows useful generalisation to the unexplored parts of the less restricted state-action space. 4.3. Value estimates Note that: V i (s) V j (s) s if i < j. (2) This is because each action space is a strict subset of the larger ones, so the agent can always in the worst case fall back to a policy using a more restricted action space. This monotonicity intuitively recommends an iterative decomposition of the value estimates, in which ˆQ ℓ+1(s, a) is estimated as a sum of ˆQ ℓ(s, a) and some positive ℓ(s, a). This is not immediately possible due to the mismatch in the support of each function. However, we can leverage a hierarchical structure in the action spaces when present, as described in Section 3. In this case we can use: ˆQ ℓ+1(s, a) = ˆQ ℓ(s, parentℓ(a)) + ℓ(s, a). (3) This is a task-specific upsampling of the lower-ℓvalue function to intialise the next value function. Of course, the utility of this adaptive initialisation depends on the choice of action-space hierarchy. Both ˆQ ℓ(s, a) and ℓ(s, a) are learned components. We could further regularise or restrict the functional form of ℓto ensure the required positivity which occurs for actions with matching support, i.e. when parentℓ(a) = a. However, we did not find this to be valuable in our experiments, and simply initialise ℓto be small. The property (2) also implies a modified Bellman optimality equation: Q ℓ(s, a) = E[r(s, a) + γ max i ℓmax a Q i (s , a )] (4) The maxi<ℓare redundant in their role as conditions on the optimal value function Q ℓ. However, the Bellman optimality equation also gives us the form of a Q-learning update, where the term in the expectation on the RHS is used as a target towards which to move an estimate of Q . When these estimates are inaccurate, the modified form of the Bellman equation may lead to different updates, allowing the Q-functions at higher ℓto be bootstrapped from those at lower ℓ. We expect that policies with low ℓare easier to learn, and that therefore the corresponding ˆQ ℓis higher value and more accurate earlier in training than those at high ℓ. These high values could be picked up by the extra maximisation in the modified bootstrap, and thereby rapidly learned by the higher-ℓvalue functions. Empirically however, we find that using this form for the target in our loss function performs no better than just maximising over ˆQ ℓ(s , a ). We discuss the choice of target and these results in more detail in Section 6. 4.4. Representation By sharing parameters between the function approximators of each Qℓ, we can learn a joint state representation, which can then be iteratively decoded into estimates of Q for each ℓ. This shared embedding can be iteratively refined by, e.g., additional network layers for each ˆQ ℓto maintain flexibility along with transfer of useful representations. This approach has had great success in improving the efficiency of many multi-task solutions using deep learning (Ruder, 2017). 4.5. Curriculum scheduling We need to choose a schedule with which to increase the ℓused by the behaviour policy over the course of training. Czarnecki et al. (2018) use population based training (Jaderberg et al., 2017) to choose a mixing parameter on the fly. However, this comes at significant computational cost, and optimises greedily for immediate performance gains. We use a simple linear schedule on a mixing parameter α [0, N]. Initially α = 0 and we always choose ℓ= 0. Later, we pick ℓ= α with probability α α and ℓ= α with probability α α (e.g. if α = 1.1, we choose ℓ= 1 with 90% chance and ℓ= 2 with 10% chance). This worked well empirically with little effort to tune. Many other strategies exist for tuning a curriculum automatically (such as those explored by Graves et al. (2017)), and could be beneficial, at the cost of additional overhead and algorithmic complexity. 5. Growing action spaces for multi-agent control In cooperative multi-agent control, the full action space allows each of N agents to take actions from a set Aagent, resulting in an exponentially large action space of size |Aagent|N. Random exploration in this action space is highly unlikely to produce sensical behaviours, so growing the action space as we propose is particularly valuable in this setting. One approach would be to limit the actions available to each agent, as done in our discretised continuous control experiments (Section 6.1) and those of Czarnecki et al. (2018). However, the joint action space would still be exponential in N. We propose instead to use hierarchical clustering, and to assign the same action to nearby agents. At the first level of the hierarchy, we treat the whole team as a single group, and all agents are constrained to take the same action. At the next level of the hierarchy, we split the agents into k groups using an unsupervised clustering algorithm, allowing each group to act independently. At Growing Action Spaces (a) Acrobot (b) Mountain Car Figure 1: Discretised continuous control with growing action spaces. We report the mean and standard error (over 10 random seeds) of the returns during training, with a moving average over the past 20 episodes. A2 (slow ϵ) is an ablation of A2 that decays ϵ at a quarter the rate. each further level, every group is split once again into k smaller groups. In practice, we simply use k-means clustering based on the agent s spatial position, but this can be easily extended to more complex hierarchies using other clustering approaches. To estimate the value function, we compute a state-value score ˆV (s), and a group-action delta ℓ(s, ag, g) for each group g at each level ℓ. Then, we compute an estimated group-action value for each group, at each level, using a pergroup form of (3): ˆQ ℓ+1(s, ag) = ˆQ ℓ(s, parentk(ag)) + ℓ(s, ag, g). We use ˆQ 1(s, ) = ˆV (s) to initialise the iterative computation, similarly to the dueling architecture of Wang et al. (2015). The estimated value of the parent action is the estimated value of the entire parent group all taking the same action as the child group. At each level ℓ we now have a set of group-action values. In effect, a multi-agent value-learning problem still remains at each level ℓ, but with a greatly reduced number of agents at low ℓ. There are several approaches to this value-estimation problem. One option would be to use the action-branching architecture of (Tavakoli et al., 2018), which in effect carries out independent Q-learning (Tan, 1993), but shares a state representation between all agents. We instead choose to estimate the joint-action value at each level as the mean of the group-action values for the groups at that ℓ, as in the work of (Sunehag et al., 2017), due to the success of this approach in cooperative multi-agent learning. A less restrictive representation, such as that proposed by (Rashid et al., 2018), could help, but we leave this direction to future work. A potential problem is that the clustering changes for every state, which may interfere with generalisation as groupactions do not have consistent semantics. We address this in two ways. First, we include the clustering as part of the state, and the cluster centroids are re-initialised from the previous timestep for t > 0 to keep the cluster semantics approximately consistent. Second, we use a functional representation that produces group-action values that are broadly agnostic to the identifier of the group. In particular, we compute a spatially resolved embedding, and pool over the locations occupied by each group. See Figure 3 and Section 6.2 for more details. 6. Experiments We investigate two classes of problems that have a natural hierarchy in the action space. First, simple control problems where a coarse action discretisation can help accelerate exploration, and fine action discretisation allows for a more optimal policy. Second, the cooperative multi-agent setting, discussed in Section 5, using large-scale Star Craft micromanagement scenarios. 6.1. Discretised continuous control As a proof-of-concept, we look at two simple examples: versions of the classic Acrobot and Mountain Car environments with discretised action spaces. Both tasks have a sparse reward of +1 when the goal is reached, and we make the exploration problem more challenging by terminating episodes with a penalty of -1 if the goal is not reached within 500 timesteps. The normalised remaining time is concatenated to the state so it remains Markovian despite the time limit. There is a further actuation cost of 0.05 a 2. At A0, the actions apply a force of +1 and 1. At each subsequent Aℓ>0, each action is split into two children, one that is the same as the parent action, and the other applying half the force. Thus, there are 2ℓactions in Aℓ. Growing Action Spaces (a) Acrobot (b) Mountain Car (c) Acrobot (d) Mountain Car Figure 2: Ablations of GAS in discretised continuous control. The results of our experiments are shown in Figure 1. Training with the lower resolutions A0 and A1 from scratch converges to finding the goal, but incurs significant actuation costs. Training with A2 from scratch almost never finds the goal with ϵ-greedy exploration. We also tried decaying the ϵ at a quarter of the rate (A2 slow ϵ) without success. In these cases, the policy converges to the one that minimises actuation costs, never finding the goal. Training with a growing action space explores to find the goal early, and then uses this experience to transition smoothly into a solution that finds the goal but takes a slower route that minimises actuation costs while achieving the objective. We also run three ablations of our method, to study the impact of our algorithmic choices. The results of these experiments are shown in Figure 2(a, b). The first ablation, GAS(2): ON-AC, investigates the impact of our off-action-space update. In this experiment, data sampled at a particular level ℓis only used to train the Q-function for that level, rather than also being used for each higher ℓQ-function in the hierarchy. In the Acrobot environment, this ablation performs similarly to GAS (2), although its final performance dips slightly. In Mountain Car, training without the off-action-space update is less efficient and less stable. The second ablation, GAS(2): SEP-Q, investigates the importance of iteratively computing the Q-values using Equation (3). In SEP-Q , each Q-function is computed separately, rather than using the action-value estimates for restricted action spaces as initialisations for the action-value estimates of their child actions. In both control tasks, this ablation performs significantly worse than the full algorithm. These ablations show the importance of efficiently using data collected during curriculum learning, in addition to shaping exploration via the curriculum of action spaces. Finally, in GAS(2): Max Targets, we perform a maximisation over coarser action spaces to construct the target as described in Section 4.3. This harms performance in the Mountain Car environment. One potential reason is that maximising over more potential targets increases the maximisation bias already present in Q-learning (Hasselt, 2010) We also investigate a potentially complementary approach to exploration, with results shown in Figure 2(c, d). Intrinsic motivation that encourages a pursuit of novelty is another Growing Action Spaces Figure 3: Architecture for GAS with hierarchical clustering. For clarity, only two levels of hierarchy are shown. The dark shaded regions identify the locations that are pooled over before state-value or group-action scores are computed. way to overcome sparse or misleading rewards as seen in these environments. We use Random Network Distillation (RND) Burda et al. (2018) as a representative algorithm in this category, with a grid search over learning rate, intrinsic reward weight, and network architecture for the RND bonus. For Mountain Car, RND helps A2 to find the goal more frequently. However, in each case, the inefficient traversal of the state space afforded by the full action space limits learning even with RND. RND is compatible with GAS, but offers no further benefits in these experiments, as GAS already learns quite quickly and finds the optimal policy. 6.2. Combinatorial action spaces: Star Craft battles 6.2.1. LARGE-SCALE STARCRAFT MICROMANAGEMENT The real-time strategy game Star Craft and its sequel Star Craft II have emerged as popular platforms for benchmarking reinforcement learning algorithms (Synnaeve et al., 2016; Vinyals et al., 2017). Full game-play has been tackled by e.g. (Lee et al., 2018; Vinyals et al., 2019), while other works focus on sub-problems such as micromanagement, the low-level control of units engaged in a battle between two armies (e.g. (Usunier et al., 2016)). Efforts to approach the former problem have required some subset of human demonstrations, hierarchical methods, and massive compute scale, and so we focus on the latter as a more tractable benchmark to evaluate our methods. Most previous work on RL benchmarking with Star Craft micromanagement is restricted to maximally 20-30 units (Samvelyan et al., 2019; Usunier et al., 2016). In our experiments we focus on much larger-scale micromanagement scenarios with 50-100 units on each side of the battle. To further increase the difficulty of these micromanagement scenarios, in our setting the starting locations of the armies are randomised, and the opponent is controlled by scripted logic that holds its position until any agent-controlled unit is in range, and then focus-fires on the closest enemy. This increases the exploration challenge, as our agents need to learn to find the enemy first, while they hold a strong defensive position. The action space for each unit permits an attack-move or move action in eight cardinal directions, as well as a stop action that causes the unit to passively hold its position. In our experiments, we use k = 2 for k-means clustering and split down to at most four or eight groups. The maximum number of groups in an experiment with Aℓis 2ℓ. Although our approach is designed for off-policy learning, we follow the common practice of using n-step Q-learning to accelerate the propagation of values (Hessel et al., 2018). Our base algorithm uses the objective of n-step Q-learning from the work of Mnih et al. (2016), and collects data from multiple workers into a short queue similarly to Espeholt et al. (2018). Full details can be found in the Appendix. Growing Action Spaces Figure 4: Star Craft micromanagement with growing action spaces. We report the mean and standard error (over 5 random seeds) of the evaluation winrate during training, with a moving average over the past 500 episodes. 6.2.2. MODEL ARCHITECTURE We propose an architecture to efficiently represent the value functions of the action-space hierarchy for this type of multiagent control problem. The overall structure is shown in Figure 3. We start with the state of the scenario (1). Ally units are blue and split into two groups. From the state, features are extracted from the units and map (see Appendix for full details). These features are concatenated with a onehot representation of the unit s group (for allied agents), and are embedded with a small MLP. A 2-D grid of embeddings is constructed by adding up the unit embeddings for all units in each cell of the grid (2). The embeddings are passed through a residual CNN to produce a final embedding (3), which is copied several times and decoded as follows. First, a state-value branch computes a scalar value by taking a global mean pooling (4) and passing the result through a 2-layer MLP (6). Then, for each ℓ, a masked mean-pooling is used to produce an embedding for each group at that Aℓby masking out the positions in the spatial embedding where there are no units of that group (5a, 5b, 5c). A single evaluation MLP for each ℓis used to decode this embedding into a group action-score (7a, 7b, 7c). This architecture allows a shared state representation to be efficiently decoded into value-function contributions for groups of any size, at any level of restriction in the action space. It may be easily applied to any problem with a high-dimensional action space in which each action dimension is associated with a location in a 2-D plane. Our default approach for combining the outputs of this model is described in Section 5: each group s action-value is given by the sum of the state-value and group-action-scores for the group and its ancestors (8a, 8b). In the SEP-Q ablation for these experiments, each group s action-value is simply given by the state-value added to the group-action score, i.e., ˆQ ℓ(s, ag) = ˆV (s) + ℓ(s, ag, g) 6.2.3. RESULTS AND DISCUSSION Figure 4 presents the results of our method, as well as a number of baselines and ablations, on a variety of micromanagement tasks. Our method is labeled Growing Action Spaces GAS(ℓ), such that GAS(2) will grow from A0 to A2. Our primary baselines are policies trained with action spaces A0 or A2 from scratch. GAS(2) consistently Growing Action Spaces outperforms both of these variants. Policies trained from scratch on A2 struggle with exploration, in particular in the harder scenarios where the opponent has a numbers advantage. Policies trained from scratch on A0 learn quickly, but plateau comparatively low, due to the limited ability of a single group to position effectively. GAS(2) benefits from the efficient exploration enabled by an intialisation at A0, and uses the data gathered under this policy to efficiently transfer to A2; enabling a higher asymptotic performance. We also compare against a Mix&Match (MM) baseline using the actor-critic approach of Czarnecki et al. (2018), but adapted for our new multi-agent setting and supporting a third level in the mixture of policies (A0, A1, A2). We tuned hyperparameters for all algorithms on the easiest, fastest-training scenario (80 marines vs. 80 marines). On this scenario, MM learns faster but plateaus at the same level as GAS(2). MM underperforms on all other scenarios to varying degrees. Learning separate value functions for each Aℓ, as in our approach, appears to accelerate the transfer learning in the majority of settings. Another possible explanation is that MM may be more sensitive to hyperparameters. We do not use population based training to tune hyperparameters on the fly, which could otherwise help MM adapt to each scenario. However, GAS would presumably also benefit from population based training, at the cost of further computation and sample efficiency. Although we kept the implementations as close as possible, the difference in overall algorithmic framework (Q-learning as opposed to stochastic actor-critic) can also be significant. One important factor is the use of ϵ-greedy instead of a stochastic policy for exploration. Further study of this type of curriculum in both paradigms is likely warranted. The policies learned by GAS exhibit good tactics. Control of separate groups is used to position our army so as to maximise the number of attacking units by forming a wall or a concave that surrounds the enemy, and by coordinating a simultaneous assault. Figure 5 in the Appendix shows some example learned policies. In scenarios where MM fails to learn well, it typically falls into a local minimum of attacking head-on. In each scenario, we also evaluate the GAS (2): ON-AC ablation that does not use our off-action-space update. This ablation performs somewhat worse on average, although the size of the impact varies. In most scenarios, it is indeed beneficial to accelerate learning for finer action spaces using data drawn from the off-action-space policy. We present further ablations on two scenarios. The most striking failure is of the SEP-Q ablation which does not compose the value function as a sum of scores in the hierarchy. It is critical to ensure that values are well-initialised as we move to less restricted action spaces. The Max Targets variant performs comparably to GAS(2). Because we use an n-step objective which combines a partial on-policy return with the bootstrap target, the relative impact of the choice of target may be reduced. Finally, we experiment with a higher ℓ. Unfortunately, asymptotic performance is degraded slightly once we use A3 or higher. One potential reason is that it decreases the average group size, pushing against the limits of the spatial resolution that may be captured by our CNN architecture. Higher ℓincreases the amount of time that there are fewer units than groups, leaving certain groups empty and rendering our masked pooling operation degenerate. We do not see a fundamental limitation that should restrict the further growth of the action space, although we note that most hierarchical approaches in the literature avoid too many levels of depth. For example, Czarnecki et al. (2018) only mix between two sizes of action spaces rather than the three we progress through in the majority of our GAS experiments. 7. Conclusion In this work, we present an algorithm for growing action spaces with off-policy reinforcement learning to efficiently shape exploration. We learn value functions for all levels of a hierarchy of restricted action spaces simultaneously, and transfer data, value estimates, and representations from more restricted to less restricted action spaces. We also present a strategy for using this approach in multi-agent control. In discretised continuous control tasks and challenging multi-agent Star Craft micromanagement scenarios, we demonstrate empirically the effectiveness of our approach and the value of off-action-space learning. To reduce reliance on domain knowledge to define action space hierarchies, it could be interesting to turn to learned action-embeddings, as discussed in Section 4.1. Metaoptimisation is another avenue for learning how to restrict action spaces for efficient exploration. This work also highlights the importance of an active choice of interface between agent and environment, rather than committing dogmatically to an MDP as an unalterable problem statement. Acknowledgments We thank Danielle Rothermel, Daniel Gant, Jonas Gehring, Nicolas Carion, Daniel Haziza, Dexter Ju, and Vegard Mella for useful discussions and their work on the Star Craft codebase. This work was supported by the UK EPSRC CDT in Autonomous Intelligent Machines and Systems, and received funding from the European Research Council (ERC) under the European Unions Horizon 2020 research and innovation programme (grant agreement number 637713). Growing Action Spaces Asada, M., Noda, S., Tawaratsumida, S., and Hosoda, K. Purposive behavior acquisition for a real robot by visionbased reinforcement learning. Machine learning, 23(2-3): 279 303, 1996. Auer, P. Using confidence bounds for exploitationexploration trade-offs. Journal of Machine Learning Research, 3(Nov):397 422, 2002. Bellemare, M., Srinivasan, S., Ostrovski, G., Schaul, T., Saxton, D., and Munos, R. Unifying count-based exploration and intrinsic motivation. In Advances in neural information processing systems, pp. 1471 1479, 2016. Bengio, Y., Louradour, J., Collobert, R., and Weston, J. Curriculum learning. In Proceedings of the 26th annual international conference on machine learning, pp. 41 48. ACM, 2009. Burda, Y., Edwards, H., Storkey, A., and Klimov, O. Exploration by random network distillation. In International Conference on Learning Representations, 2018. Chaslot, G. M. J., Winands, M. H., HERIK, H. J. V. D., Uiterwijk, J. W., and Bouzy, B. Progressive strategies for monte-carlo tree search. New Mathematics and Natural Computation, 4(03):343 357, 2008. Colombetti, M. and Dorigo, M. Robot shaping: developing situated agents through learning. International Computer Science Institute, 1992. Cou etoux, A., Hoock, J.-B., Sokolovska, N., Teytaud, O., and Bonnard, N. Continuous upper confidence trees. In International Conference on Learning and Intelligent Optimization, pp. 433 445. Springer, 2011. Czarnecki, W. M., Jayakumar, S. M., Jaderberg, M., Hasenclever, L., Teh, Y. W., Osindero, S., Heess, N., and Pascanu, R. Mix&match-agent curricula for reinforcement learning. ar Xiv preprint ar Xiv:1806.01780, 2018. Dulac-Arnold, G., Evans, R., van Hasselt, H., Sunehag, P., Lillicrap, T., Hunt, J., Mann, T., Weber, T., Degris, T., and Coppin, B. Deep reinforcement learning in large discrete action spaces. ar Xiv preprint ar Xiv:1512.07679, 2015. Elman, J. L. Learning and development in neural networks: The importance of starting small. Cognition, 48(1):71 99, 1993. Espeholt, L., Soyer, H., Munos, R., Simonyan, K., Mnih, V., Ward, T., Doron, Y., Firoiu, V., Harley, T., Dunning, I., et al. Impala: Scalable distributed deep-rl with importance weighted actor-learner architectures. ar Xiv preprint ar Xiv:1802.01561, 2018. Florensa, C., Held, D., Wulfmeier, M., Zhang, M., and Abbeel, P. Reverse curriculum generation for reinforcement learning. ar Xiv preprint ar Xiv:1707.05300, 2017. Graves, A., Bellemare, M. G., Menick, J., Munos, R., and Kavukcuoglu, K. Automated curriculum learning for neural networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 1311 1320. JMLR. org, 2017. Hasselt, H. V. Double q-learning. In Advances in Neural Information Processing Systems, pp. 2613 2621, 2010. Hessel, M., Modayil, J., Van Hasselt, H., Schaul, T., Ostrovski, G., Dabney, W., Horgan, D., Piot, B., Azar, M., and Silver, D. Rainbow: Combining improvements in deep reinforcement learning. In Thirty-Second AAAI Conference on Artificial Intelligence, 2018. Jaderberg, M., Dalibard, V., Osindero, S., Czarnecki, W. M., Donahue, J., Razavi, A., Vinyals, O., Green, T., Dunning, I., Simonyan, K., et al. Population based training of neural networks. ar Xiv preprint ar Xiv:1711.09846, 2017. Konidaris, G. and Barto, A. Autonomous shaping: Knowledge transfer in reinforcement learning. In Proceedings of the 23rd international conference on Machine learning, pp. 489 496. ACM, 2006. Lee, D., Tang, H., Zhang, J. O., Xu, H., Darrell, T., and Abbeel, P. Modular architecture for starcraft ii with deep reinforcement learning. In Fourteenth Artificial Intelligence and Interactive Digital Entertainment Conference, 2018. Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidjeland, A. K., Ostrovski, G., et al. Human-level control through deep reinforcement learning. Nature, 518(7540): 529, 2015. Mnih, V., Badia, A. P., Mirza, M., Graves, A., Lillicrap, T., Harley, T., Silver, D., and Kavukcuoglu, K. Asynchronous methods for deep reinforcement learning. In International conference on machine learning, pp. 1928 1937, 2016. Moore, A. W. The parti-game algorithm for variable resolution reinforcement learning in multidimensional statespaces. In Advances in neural information processing systems, pp. 711 718, 1994. Munos, R. and Moore, A. Variable resolution discretization in optimal control. Machine learning, 49(2-3):291 323, 2002. Murali, A., Pinto, L., Gandhi, D., and Gupta, A. Cassl: Curriculum accelerated self-supervised learning. In 2018 Growing Action Spaces IEEE International Conference on Robotics and Automation (ICRA), pp. 6453 6460. IEEE, 2018. Ng, A. Y., Harada, D., and Russell, S. Policy invariance under reward transformations: Theory and application to reward shaping. In ICML, volume 99, pp. 278 287, 1999. Pan, Y., Farahmand, A.-m., White, M., Nabi, S., Grover, P., and Nikovski, D. Reinforcement learning with functionvalued action spaces for partial differential equation control. 07 2018. Pentina, A., Sharmanska, V., and Lampert, C. H. Curriculum learning of multiple tasks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5492 5500, 2015. Rashid, T., Samvelyan, M., de Witt, C. S., Farquhar, G., Foerster, J., and Whiteson, S. Qmix: Monotonic value function factorisation for deep multi-agent reinforcement learning. ar Xiv preprint ar Xiv:1803.11485, 2018. Ruder, S. An overview of multi-task learning in deep neural networks. ar Xiv preprint ar Xiv:1706.05098, 2017. Samvelyan, M., Rashid, T., de Witt, C. S., Farquhar, G., Nardelli, N., Rudner, T. G., Hung, C.-M., Torr, P. H., Foerster, J., and Whiteson, S. The starcraft multi-agent challenge. 2019. Schmidhuber, J. A possibility for implementing curiosity and boredom in model-building neural controllers. In Proc. of the international conference on simulation of adaptive behavior: From animals to animats, pp. 222 227, 1991. Selfridge, O. G., Sutton, R. S., and Barto, A. G. Training and tracking in robotics. In IJCAI, pp. 670 672, 1985. Silver, D., Hubert, T., Schrittwieser, J., Antonoglou, I., Lai, M., Guez, A., Lanctot, M., Sifre, L., Kumaran, D., Graepel, T., et al. Mastering chess and shogi by self-play with a general reinforcement learning algorithm. ar Xiv preprint ar Xiv:1712.01815, 2017. Singh, S. P. Transfer of learning by composing solutions of elemental sequential tasks. Machine Learning, 8(3-4): 323 339, 1992. Stanley, K. O. and Miikkulainen, R. Competitive coevolution through evolutionary complexification. Journal of artificial intelligence research, 21:63 100, 2004. Sukhbaatar, S., Lin, Z., Kostrikov, I., Synnaeve, G., Szlam, A., and Fergus, R. Intrinsic motivation and automatic curricula via asymmetric self-play. ar Xiv preprint ar Xiv:1703.05407, 2017. Sunehag, P., Lever, G., Gruslys, A., Czarnecki, W. M., Zambaldi, V., Jaderberg, M., Lanctot, M., Sonnerat, N., Leibo, J. Z., Tuyls, K., et al. Value-decomposition networks for cooperative multi-agent learning. ar Xiv preprint ar Xiv:1706.05296, 2017. Synnaeve, G., Nardelli, N., Auvolat, A., Chintala, S., Lacroix, T., Lin, Z., Richoux, F., and Usunier, N. Torchcraft: a library for machine learning research on real-time strategy games. ar Xiv preprint ar Xiv:1611.00625, 2016. Tamar, A., Wu, Y., Thomas, G., Levine, S., and Abbeel, P. Value iteration networks. In Advances in Neural Information Processing Systems, pp. 2154 2162, 2016. Tan, M. Multi-agent reinforcement learning: Independent vs. cooperative agents. In Proceedings of the tenth international conference on machine learning, pp. 330 337, 1993. Tavakoli, A., Pardo, F., and Kormushev, P. Action branching architectures for deep reinforcement learning. In Thirty Second AAAI Conference on Artificial Intelligence, 2018. Taylor, M. E., Stone, P., and Liu, Y. Transfer learning via inter-task mappings for temporal difference learning. Journal of Machine Learning Research, 8(Sep): 2125 2167, 2007. Tennenholtz, G. and Mannor, S. The natural language of actions. In International Conference on Machine Learning, pp. 6196 6205, 2019. Tesauro, G. Temporal difference learning and td-gammon. Communications of the ACM, 38(3):58 68, 1995. Usunier, N., Synnaeve, G., Lin, Z., and Chintala, S. Episodic exploration for deep deterministic policies: An application to starcraft micromanagement tasks. ar Xiv preprint ar Xiv:1609.02993, 2016. Van de Wiele, T., Warde-Farley, D., Mnih, A., and Mnih, V. Q-learning in enormous action spaces via amortized approximate maximization. ar Xiv preprint ar Xiv:2001.08116, 2020. Vinyals, O., Ewalds, T., Bartunov, S., Georgiev, P., Vezhnevets, A. S., Yeo, M., Makhzani, A., K uttler, H., Agapiou, J., Schrittwieser, J., et al. Starcraft ii: A new challenge for reinforcement learning. ar Xiv preprint ar Xiv:1708.04782, 2017. Vinyals, O., Babuschkin, I., Chung, J., Mathieu, M., Jaderberg, M., Czarnecki, W. M., Dudzik, A., Huang, A., Georgiev, P., Powell, R., Ewalds, T., Horgan, D., Kroiss, M., Danihelka, I., Agapiou, J., Oh, J., Dalibard, V., Choi, D., Sifre, L., Sulsky, Y., Vezhnevets, S., Molloy, Growing Action Spaces J., Cai, T., Budden, D., Paine, T., Gulcehre, C., Wang, Z., Pfaff, T., Pohlen, T., Wu, Y., Yogatama, D., Cohen, J., Mc Kinney, K., Smith, O., Schaul, T., Lillicrap, T., Apps, C., Kavukcuoglu, K., Hassabis, D., and Silver, D. Alpha Star: Mastering the Real-Time Strategy Game Star Craft II. https://deepmind.com/blog/ alphastar-mastering-real-time-strategy-game-starcraft-ii/, 2019. Wang, Z., Schaul, T., Hessel, M., Van Hasselt, H., Lanctot, M., and De Freitas, N. Dueling network architectures for deep reinforcement learning. ar Xiv preprint ar Xiv:1511.06581, 2015. Watkins, C. J. and Dayan, P. Q-learning. Machine learning, 8(3-4):279 292, 1992. Whiteson, S., Taylor, M. E., Stone, P., et al. Adaptive tile coding for value function approximation. Computer Science Department, University of Texas at Austin, 2007. Zaremba, W. and Sutskever, I. Learning to execute. ar Xiv preprint ar Xiv:1410.4615, 2014.