# strategic_attentive_writer_for_learning_macroactions__80540df5.pdf Strategic Attentive Writer for Learning Macro-Actions Alexander (Sasha) Vezhnevets, Volodymyr Mnih, John Agapiou, Simon Osindero, Alex Graves, Oriol Vinyals, Koray Kavukcuoglu Google Deep Mind {vezhnick,vmnih,jagapiou,osindero,gravesa,vinyals,korayk}@google.com We present a novel deep recurrent neural network architecture that learns to build implicit plans in an end-to-end manner purely by interacting with an environment in reinforcement learning setting. The network builds an internal plan, which is continuously updated upon observation of the next input from the environment. It can also partition this internal representation into contiguous sub-sequences by learning for how long the plan can be committed to i.e. followed without replaning. Combining these properties, the proposed model, dubbed STRategic Attentive Writer (STRAW) can learn high-level, temporally abstracted macro-actions of varying lengths that are solely learnt from data without any prior information. These macro-actions enable both structured exploration and economic computation. We experimentally demonstrate that STRAW delivers strong improvements on several ATARI games by employing temporally extended planning strategies (e.g. Ms. Pacman and Frostbite). It is at the same time a general algorithm that can be applied on any sequence data. To that end, we also show that when trained on text prediction task, STRAW naturally predicts frequent n-grams (instead of macro-actions), demonstrating the generality of the approach. 1 Introduction Using reinforcement learning to train neural network controllers has recently led to rapid progress on a number of challenging control tasks [15, 17, 26]. Much of the success of these methods has been attributed to the ability of neural networks to learn useful abstractions or representations of the stream of observations, allowing the agents to generalize between similar states. Notably, these agents do not exploit another type of structure the one present in the space of controls or policies. Indeed, not all sequences of low-level controls lead to interesting high-level behaviour and an agent that can automatically discover useful macro-actions should be capable of more efficient exploration and learning. The discovery of such temporal abstractions has been a long-standing problem in both reinforcement learning and sequence prediction in general, yet no truly scalable and successful architectures exist. We propose a new deep recurrent neural network architecture, dubbed STRategic Attentive Writer (STRAW), that is capable of learning macro-actions in a reinforcement learning setting. Unlike the vast majority of reinforcement learning approaches [15, 17, 26], which output a single action after each observation, STRAW maintains a multi-step action plan. STRAW periodically updates the plan based on observations and commits to the plan between the replanning decision points. The replanning decisions as well as the commonly occurring sequences of actions, i.e. macro-actions, are learned from rewards. To encourage exploration with macro-actions we introduce a noisy communication channel between a feature extractor (e.g. convolutional neural network) and the planning modules, taking inspiration from recent developments in variational auto-encoders [11, 13, 24]. Injecting noise at this level of the network generates randomness in plans updates that cover multiple time steps and thereby creates the desired effect. 30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain. Our proposed architecture is a step towards more natural decision making, wherein one observation can generate a whole sequence of outputs if it is informative enough. This provides several important benefits. First and foremost, it facilitates structured exploration in reinforcement learning as the network learns meaningful action patterns it can use them to make longer exploratory steps in the state space [4]. Second, since the model does not need to process observations while it is committed to its action plan, it learns to allocate computation to key moments thereby freeing up resources when the plan is being followed. Additionally, the acquisition of macro-actions can aid transfer and generalization to other related problems in the same domain (assuming that other problems from the domain share similar structure in terms of action-effects). We evaluate STRAW on a subset of Atari games that require longer term planning and show that it leads to substantial improvements in scores. We also demonstrate the generality of the STRAW architecture by training it on a text prediction task and show that it learns to use frequent n-grams as the macro-actions on this task. The following section reviews the related work. Section 3 defines the STRAW model formally. Section 4 describes the training procedure for both supervised and reinforcement learning cases. Section 5 presents the experimental evaluation of STRAW on 8 ATARI games, 2D maze navigation and next character prediction tasks. Section 6 concludes. 2 Related Work Learning temporally extended actions and temporal abstraction in general are long standing problems in reinforcement learning [5, 6, 7, 8, 12, 20, 21, 22, 23, 25, 27, 28, 30]. The options framework [21, 28] provides a general formulation. An option is a sub-policy with a termination condition, which takes in environment observations and outputs actions until the termination condition is met. An agent picks an option using its policy-over-options and subsequently follows it until termination, at which point the policy-over-options is queried again and the process continues. Notice, that macro-action is a particular, simpler instance of options, where the action sequence (or a distribution over them) is decided at the time the macro-action is initiated. Options are typically learned using subgoals and pseudo-rewards that are provided explicitly [7, 8, 28]. For a simple, tabular case [30], each state can be used as a subgoal. Given the options, a policy-over-options can be learned using standard techniques by treating options as actions. Recently [14, 29] have demonstrated that combining deep learning with pre-defined subgoals delivers promising results in challenging environments like Minecraft and Atari, however, subgoal discovery remains an unsolved problem. Another recent work by [1] shows a theoretical possibility of learning options jointly with a policy-over-options by extending the policy gradient theorem to options, but the approach was only tested on a toy problem. In contrast, STRAW learns macro-actions and a policy over them in an end-to-end fashion from only the environment s reward signal and without resorting to explicit pseudo-rewards or hand-crafted subgoals. The macro-actions are represented implicitly inside the model, arising naturally from the interplay between action and commitment plans within the network. Our experiments demonstrate that the model scales to a variety of tasks from next character prediction in text to ATARI games. 3 The model STRAW is a deep recurrent neural network with two modules. The first module translates environment observations into an action-plan a state variable which represents an explicit stochastic plan of future actions. STRAW generates macro-actions by committing to the action-plan and following it without updating for a number of steps. The second module maintains commitment-plan a state variable that determines at which step the network terminates a macro-action and updates the action-plan. The action-plan is a matrix where one dimension corresponds to time and the other to the set of possible discrete actions. The elements of this matrix are proportional to the probability of taking the corresponding action at the corresponding time step. Similarly, the commitment plan represents the probabilities of terminating a macro-action at the particular step. For updating both plans we use attentive writing technique [11], which allows the network to focus on parts of a plan where the current observation is informative of desired outputs. This section formally defines the model, we describe the way it is trained later in section 4. The state of the network at time t is comprised of matrices At RA T and ct R1 T . Matrix At is the action-plan. Each element At a,τ is proportional to the probability of outputting a at time t + τ. Here A is a total number of possible actions and T is a maximum time horizon of the plan. To generate an action at time t, the first column of At (i.e. At 0) is transformed into a distribution * * 0 0 =1 1 Input frame commitmentplan action space: replan commit commit replan Figure 1: Schematic illustration of STRAW playing a maze navigation game. The input frames indicate maze geometry (black = corridor, blue = wall), red dot corresponds to the position of the agent and green to the goal, which it tries to reach. A frame is first passed through a convolutional network, acting as a feature extractor and then into STRAW. The top two rows depict the plans A and c. Given a gate gt, STRAW either updates the plans (steps 1 and 4) or commits to them. over possible outputs by a Soft Max function. This distribution is then sampled to generate the action at. Thereby the content of At corresponds to the plan of future actions as conceived at time t. The single row matrix ct represents the commitment-plan of the network. Let gt be a binary random variable distributed as follows: gt ct 1 1 . If gt = 1 then at this step the plans will be updated, otherwise gt = 0 means they will be committed to. Macro-actions are defined as a sequence of outputs {at}t2 1 t1 produced by the network between steps where gt is on : i.e gt1 = gt2 = 1 and gt = 0, t1 < t < t2. During commitment the plans are rolled over to the next step using the matrix time-shift operator ρ, which shifts a matrix by removing the first column and appending a column filled with zeros to its rear. Applying ρ to At or ct reflects the advancement of time. Figure 1 illustrates the workflow. Notice that during commitment (step 2 and 3) the network doesn t compute the forward pass, thereby saving computation. Attentive planning. An important assumption that underpins the usage of macro-actions is that one observation reveals enough information to generate a sequence of actions. The complexity of the sequence and its length can vary dramatically, even within one environment. Therefore the network has to focus on the part of the plan where the current observation is informative of desired actions. To achieve this, we apply differentiable attentive reading and writing operations [11], where attention is defined over the temporal dimension. This technique was originally proposed for image generation, here instead it is used to update the plans At and ct. In the image domain, the attention operates over the spatial extent of an image, reading and writing pixel values. Here it operates over the temporal extent of a plan, and is used to read and write action probabilities. The differentiability of the attention model [11] makes it possible to train with standard backpropagation. An array of Gaussian filters is applied to the plan, yielding a patch of smoothly varying location and zoom. Let A be the total number of possible actions and K be a parameter that determines the temporal resolution of the patch. A grid of A K one-dimensional Gaussian filters is positioned on the plan by specifying the coordinates of the grid center and the stride distance between adjacent filters. The stride controls the zoom of the patch; that is, the larger the stride, the larger an area of the original plan will be visible in the attention patch, but the lower the effective resolution of the patch will be. The filtering is performed along the temporal dimension only. Let ψ be a vector of attention parameters, i.e.: grid position, stride, and standard deviation of Gaussian filters. We define the attention operations as follows: D = write(p, ψA t ); βt = read(At, ψA t ) (1) The write operation takes in a patch p RA K and attention parameters ψA t . It produces a matrix D of the same size as At, which contains the patch p scaled and positioned according to ψA t with the rest set to 0. Analogously the read operation takes the full plan At together with attention parameters Figure 2: Schematic illustration of an action-plan update. Given zt, the network produces an actionpatch and attention parameters ψA t . The write operation creates an update to the action-plan by scaling and shifting the action-patch according to ψA t . The update is then added to ρ(At). ψA t and outputs a read patch β RA K, which is extracted from At according to ψA t . We direct readers to [11] for details. Action-plan update. Let zt be a feature representation (e.g. the output of a deep convolutional network) of an observation xt. Given zt, gt and the previous state At 1 STRAW computes an update to the action-plan using the Algorithm 1. Here f ψ and f A are linear functions and h is two layer perceptron. Figure 2 gives an illustration of an update to At. Algorithm 1 Action-plan update Input: zt, gt, At 1 Output: At, at if gt = 1 then Compute attention parameters ψA t = f ψ(zt) Attentively read the current state of the action-plan βt = read(At 1, ψA t ); Compute intermediate representation ξt = h([βt, zt]) Update At = ρ(At 1) + gt write(f A(ξt), ψA t ) else // gt = 0 Update At = ρ(At 1) //just advance time end if Sample an action at Soft Max(At 0) Commitment-plan update. Now we introduce a module that partitions the action-plan into macroactions by defining the temporal extent to which the current action-plan At can be followed without re-planning. The commitment-plan ct is updated at the same time as the action-plan, i.e. when gt = 1 or otherwise it is rolled over to the next time step by ρ operator. Unlike the planning module, where At is updated additively, ct is overwritten completely using the following equations: gt ct 1 1 if gt = 0 then ct = ρ(ct 1) else ψc t = f c([ψA t , ξt]) ct = Sigmoid(b + write(e, ψc t)); (2) Here the same attentive writing operation is used, but with only one Gaussian filter for attention over c. The patch e is therefore just a scalar, which we fix to a high value (40 in our experiments). This high value of e is chosen so that the attention parameters ψc t define a time step when re-planning is guaranteed to happen. Vector b is of the same size as dt filled with a shared, learnable bias b, which defines the probability of re-planning earlier than the step implied by ψc t. Notice that gt is used as a multiplicative gate in algorithm 1. This allows for retroactive credit assignment during training, as gradients from write operation at time t + τ directly flow into the commitment module through state ct we set ct 1 1 gt as proposed in [3]. Moreover, when gt = 0 only the computationally cheap operator ρ is invoked. Thereby more commitment significantly saves computation. 3.1 Structured exploration with macro-actions The architecture defined above is capable of generating macro-actions. This section describes how to use macro-actions for structured exploration. We introduce STRAW-explorer (STRAWe), a version of STRAW with a noisy communication channel between the feature extractor (e.g. a convolutional neural network) and STRAW planning modules. Let ζt be the activations of the last layer of the feature extractor. We use ζt to regress the parameters of a Gaussian distribution Q(zt|ζt) = N(µ(ζt), I σ(ζt)) from which zt is sampled. Here µ is a vector and σ is a scalar. Injecting noise at this level of the network generates randomness on the level of plan updates that cover multiple time steps. This effect is reinforced by commitment, which forces STRAWe to execute the plan and experience the outcome instead of rolling the update back on the next step. In section 5 we demonstrate how this significantly improves score on games like Frostbite and Ms. Pacman. 4 Learning The training loss of the model is defined as follows: Lout(At) + gt αKL(Q(zt|ζt)|P(zt)) + λct 1 , (3) where Lout is a domain specific differentiable loss function defined over the network s output. For supervised problems, like next character prediction in text, Lout can be defined as negative log likelihood of the correct output. We discuss the reinforcement learning case later in this section. The two extra terms are regularisers. The first is a cost of communication through the noisy channel, which is defined as KL divergence between latent distributions Q(zt|ζt) and some prior P(zt). Since the latent distribution is a Gaussian (sec. 3.1), a natural choice for the prior is a Gaussian with a zero mean and standard deviation of one. The last term penalizes re-planning and encourages commitment. For reinforcement learning we consider the standard setting where an agent is interacting with an environment in discrete time. At each step t, the agent observes the state of the environment xt and selects an action at from a finite set of possible actions. The environment responds with a new state xt+1 and a scalar reward rt. The process continues until the terminal state is reached, after which it restarts. The goal of the agent is to maximize the discounted return Rt = P k=0 αkrt+k+1. The agent s behaviour is defined by its policy π mapping from state space into action space. STRAW produces a distribution over possible actions (a stochastic policy) by passing the first column of the action-plan At through a Soft Max function: π(at|xt; θ) = Soft Max(At 0). An action is then produced by sampling the output distribution. We use a recently proposed Asynchronous Advantage Actor-Critic (A3C) method [18], which directly optimizes the policy of an agent. A3C requires a value function estimator V (xt) for variance reduction. This estimator can be produced in a number of ways, for example by a separate neural network. The most natural solution for our architecture is to create value-plan containing the estimates. To keep the architecture simple and efficient, we simply add an auxiliary row to the action plan which corresponds to the value function estimation. It participates in attentive reading and writing during the update, thereby sharing the temporal attention with action-plan. The plan is then split into action part and the estimator before the Soft Max is applied and an action is sampled. The policy gradient update for Lout in L is defined as follows: Lout = θ log π(at|xt; θ)(Rt V (xt; θ)) + β θH(π(at|xt; θ)) (4) Here H(π(at|xt; θ)) is entropy of the policy, which stimulates the exploration of primitive actions. The network contains two random variables zt and gt, which we have to pass gradients through. For zt we employ the re-parametrization trick [13, 24]. For gt, we set ct 1 1 gt as proposed in [3]. 5 Experiments The goal of our experiments was to demonstrate that STRAW learns meaningful and useful macroactions. We use three domains of increasing complexity: supervised next character prediction in text [10], 2D maze navigation and ATARI games [2]. 5.1 Experimental setup Architecture. The read and write patches are A 10 dimensional, and h is a 2 layer perceptron with 64 hidden units. The time horizon T = 500. For STRAWe (sec. 3.1) the Gaussian distribution for structured exploration is 128-dimensional. Ablative analysis for some of these choices is provided in section 5.5. Feature representation of the state space is particular for each domain. For 2D mazes and ATARI it is a convolutional neural net (CNN) and it is an LSTM [9] for text. We provide more details in the corresponding sections. Baselines. The experiments employ two baselines: a simple feed forward network (FF) and a recurrent LSTM network, both on top of a representation learned by CNN. FF directly regresses the action probabilities and value function estimate from feature representation. The LSTM [9] architecture is a widely used recurrent network and it was demonstrated to perform well on a suite of reinforcement learning problems [18]. It has 128 hidden units and its inputs are the feature representation of an observation and the previous action of the agent. Action probabilities and the value function estimate are regressed from its hidden state. Optimization. We use the A3C method [18] for all reinforcement learning experiments. It was shown to achieve state-of-the-art results on several challenging benchmarks [18]. We cut the trajectory and run backpropagation through time [19] after 40 forward passes of a network or if a terminal signal is received. The optimization process runs 32 asynchronous threads using shared RMSProp. There are 4 hyper-parameters in STRAW and 2 in the LSTM and FF baselines. For each method, we ran 200 experiments, each using randomly sampled hyperparameters. Learning rate and entropy penalty were sampled from a Log Uniform(10 4, 10 3) interval. Learning rate is linearly annealed from a sampled value to 0. To explore STRAW behaviour, we sample coding cost α Log Uniform(10 7, 10 4) and replanning penalty λ Log Uniform(10 6, 10 2). For stability, we clip the advantage Rt V (xt; θ) (eq. 4) to [ 1, 1] for all methods and for STRAW(e) we do not propagate gradients from commitment module into planning module through ψA t and ξt. We define a training epoch as one million observations. N-gram rank Text Maze replanning Transfer in mazes _the _