# relational_recurrent_neural_networks__953800a6.pdf Relational recurrent neural networks Adam Santoro*α, Ryan Faulkner*α, David Raposo*α, Jack Raeαβ, Mike Chrzanowskiα, Théophane Weberα, Daan Wierstraα, Oriol Vinyalsα, Razvan Pascanuα, Timothy Lillicrapαβ *Equal Contribution αDeep Mind London, United Kingdom βCo MPLEX, Computer Science, University College London London, United Kingdom {adamsantoro; rfaulk; draposo; jwrae; chrzanowskim; theophane; weirstra; vinyals; razp; countzero}@google.com Memory-based neural networks model temporal data by leveraging an ability to remember information for long periods. It is unclear, however, whether they also have an ability to perform complex relational reasoning with the information they remember. Here, we first confirm our intuitions that standard memory architectures may struggle at tasks that heavily involve an understanding of the ways in which entities are connected i.e., tasks involving relational reasoning. We then improve upon these deficits by using a new memory module a Relational Memory Core (RMC) which employs multi-head dot product attention to allow memories to interact. Finally, we test the RMC on a suite of tasks that may profit from more capable relational reasoning across sequential information, and show large gains in RL domains (e.g. Mini Pac Man), program evaluation, and language modeling, achieving state-of-the-art results on the Wiki Text-103, Project Gutenberg, and Giga Word datasets. 1 Introduction Humans use sophisticated memory systems to access and reason about important information regardless of when it was initially perceived [1, 2]. In neural network research many successful approaches to modeling sequential data also use memory systems, such as LSTMs [3] and memory-augmented neural networks generally [4 7]. Bolstered by augmented memory capacities, bounded computational costs over time, and an ability to deal with vanishing gradients, these networks learn to correlate events across time to be proficient at storing and retrieving information. Here we propose that it is fruitful to consider memory interactions along with storage and retrieval. Although current models can learn to compartmentalize and relate distributed, vectorized memories, they are not biased towards doing so explicitly. We hypothesize that such a bias may allow a model to better understand how memories are related, and hence may give it a better capacity for relational reasoning over time. We begin by demonstrating that current models do indeed struggle in this domain by developing a toy task to stress relational reasoning of sequential information. Using a new Relational Memory Core (RMC), which uses multi-head dot product attention to allow memories to interact with each other, we solve and analyze this toy problem. We then apply the RMC to a suite of tasks that may profit from more explicit memory-memory interactions, and hence, a potentially 32nd Conference on Neural Information Processing Systems (Neur IPS 2018), Montréal, Canada. increased capacity for relational reasoning across time: partially observed reinforcement learning tasks, program evaluation, and language modeling on the Wikitext-103, Project Gutenberg, and Giga Word datasets. 2 Relational reasoning We take relational reasoning to be the process of understanding the ways in which entities are connected and using this understanding to accomplish some higher order goal [8]. For example, consider sorting the distances of various trees to a park bench: the relations (distances) between the entities (trees and bench) are compared and contrasted to produce the solution, which could not be reached if one reasoned about the properties (positions) of each individual entity in isolation. Since we can often quite fluidly define what constitutes an entity or a relation , one can imagine a spectrum of neural network inductive biases that can be cast in the language of relational reasoning 1. For example, a convolutional kernel can be said to compute a relation (linear combination) of the entities (pixels) within a receptive field. Some previous approaches make the relational inductive bias more explicit: in message passing neural networks [e.g. 9 12], the nodes comprise the entities and relations are computed using learnable functions applied to nodes connected with an edge, or sometimes reducing the relational function to a weighted sum of the source entities [e.g. 13, 14]. In Relation Networks [15 17] entities are obtained by exploiting spatial locality in the input image, and the model focuses on computing binary relations between each entity pair. Even further, some approaches emphasize that more capable reasoning may be possible by employing simple computational principles; by recognizing that relations might not always be tied to proximity in space, non-local computations may be better able to capture the relations between entities located far away from each other [18, 19]. In the temporal domain relational reasoning could comprise a capacity to compare and contrast information seen at different points in time [20]. Here, attention mechanisms [e.g. 21, 22] implicitly perform some form of relational reasoning; if previous hidden states are interpreted as entities, then computing a weighted sum of entities using attention helps to remove the locality bias present in vanilla RNNs, allowing embeddings to be better related using content rather than proximity. Since our current architectures solve complicated temporal tasks they must have some capacity for temporal relational reasoning. However, it is unclear whether their inductive biases are limiting, and whether these limitations can be exposed with tasks demanding particular types of temporal relational reasoning. For example, memory-augmented neural networks [4 7] solve a compartmentalization problem with a slot-based memory matrix, but may have a harder time allowing memories to interact, or relate, with one another once they are encoded. LSTMs [3, 23], on the other hand, pack all information into a common hidden memory vector, potentially making compartmentalization and relational reasoning more difficult. Our guiding design principle is to provide an architectural backbone upon which a model can learn to compartmentalize information, and learn to compute interactions between compartmentalized information. To accomplish this we assemble building blocks from LSTMs, memory-augmented neural networks, and non-local networks (in particular, the Transformer seq2seq model [22]). Similar to memory-augmented architectures we consider a fixed set of memory slots; however, we allow for interactions between memory slots using an attention mechanism. As we will describe, in contrast to previous work we apply attention between memories at a single time step, and not across all previous representations computed from all previous observations. 3.1 Allowing memories to interact using multi-head dot product attention We will first assume that we do not need to consider memory encoding; that is, that we already have some stored memories in matrix M, with row-wise compartmentalized memories mi. To allow memories to interact we employ multi-head dot product attention (MHDPA) [22], also known as 1Indeed, in the broadest sense any multivariable function must be considered relational. Next Memory Prev. Memory Apply gating ***computation of gates not depicted MULTI-HEAD DOT PRODUCT ATTENTION Updated Memory Keys Queries Compute attention weights Weights Normalized Weights Normalize weights with row-wise softmax Compute weighted average of values Return updated memory Updated Memory Figure 1: Relational Memory Core. (a) The RMC receives a previous memory matrix and input vector as inputs, which are passed to the MHDPA module labeled with an A . (b). Linear projections are computed for each memory slot, and input vector, using row-wise shared weights W q for the queries, W k for the keys, and W v for the values. (c) The queries, keys, and values are then compiled into matrices and softmax(QKT )V is computed. The output of this computation is a new memory where information is blended across memories based on their attention weights. An MLP is applied row-wise to the output of the MHDPA module (a), and the resultant memory matrix is gated, and passed on as the core output or next memory state. self-attention. Using MHDPA, each memory will attend over all of the other memories, and will update its content based on the attended information. First, a simple linear projection is used to construct queries (Q = MW q), keys (K = MW k), and values (V = MW v) for each memory (i.e. row mi) in matrix M. Next, we use the queries, Q, to perform a scaled dot-product attention over the keys, K. The returned scalars can be put through a softmax-function to produce a set of weights, which can then be used to return a weighted average of values from V as A(Q, K, V ) = softmax QKT V , where dk is the dimensionality of the key vectors used as a scaling factor. Equivalently: Aθ(M) = softmax MW q(MW k)T MW v, where θ = (W q, W k, W v) (1) The output of Aθ(M), which we will denote as f M, is a matrix with the same dimensionality as M. f M can be interpreted as a proposed update to M, with each emi comprising information from memories mj. Thus, in one step of attention each memory is updated with information originating from other memories, and it is up to the model to learn (via parameters W q, W k, and W v) how to shuttle information from memory to memory. As implied by the name, MHDPA uses multiple heads. We implement this producing h sets of queries, keys, and values, using unique parameters to compute a linear projection from the original memory for each head h. We then independently apply an attention operation for each head. For example, if M is an N F dimensional matrix and we employ two attention heads, then we compute g M 1 = Aθ(M) and g M 2 = Aφ(M), where g M 1 and g M 2 are N F/2 matrices, θ and φ denote unique parameters for the linear projections to produce the queries, keys, and values, and f M = [g M 1 : g M 2], where [:] denotes column-wise concatenation. Intuitively, heads could be useful for letting a memory share different information, to different targets, using each head. 3.2 Encoding new memories We assumed that we already had a matrix of memories M. Of course, memories instead need to be encoded as new inputs are received. Suppose then that M is some randomly initialised memory. We can efficiently incorporate new information x into M with a simple modification to equation 1: f M = softmax MW q([M; x]W k)T [M; x]W v, (2) where we use [M; x] to denote the row-wise concatenation of M and x. Since we use [M; x] when computing the keys and values, and only M when computing the queries, f M is a matrix with same dimensionality as M. Thus, equation 2 is a memory-size preserving attention operation that includes attention over the memories and the new observations. Notably, we use the same attention operation to efficiently compute memory interactions and to incorporate new information. We also note the possible utility of this operation when the memory consists of a single vector rather than a matrix. In this case the model may learn to pick and choose which information from the input should be written into the vector memory state by learning how to attend to the input, conditioned on what is contained in the memory already. This is possible in LSTMs via the gates, though at a different granularity. We return to this idea, and the possible compartmentalization that can occur via the heads even in the single-memory-slot case, in the discussion. 3.3 Introducing recurrence and embedding into an LSTM Suppose we have a temporal dimension with new observations at each timestep, xt. Since M and f M are the same dimensionality, we can naively introduce recurrence by first randomly initialising M, and then updating it with f M at each timestep. We chose to do this by embedding this update into an LSTM. Suppose memory matrix M can be interpreted as a matrix of cell states, usually denoted as C, for a 2-dimensional LSTM. We can make the operations of individual memories mi nearly identical to those in a normal LSTM cell state as follows (subscripts are overloaded to denote the row from a matrix, and timestep; e.g., mi,t is the ith row from M at time t). si,t = (hi,t 1, mi,t 1) (3) fi,t = W fxt + U fhi,t 1 + bf (4) ii,t = W ixt + U ihi,t 1 + bi (5) oi,t = W oxt + U ohi,t 1 + bo (6) mi,t = σ(fi,t + bf) mi,t 1 + σ(ii,t) gψ( emi,t) | {z } (7) hi,t = σ(oi,t) tanh(mi,t) (8) si,t+1 = (mi,t, hi,t) (9) The underbrace denotes the modification to a standard LSTM. In practice we did not find output gates necessary please see the url in the footnote for our Tensorflow implementation of this model in the Sonnet library 2, and for the exact formulation we used, including our choice for the gψ function (briefly, we found a row/memory-wise MLP with layer normalisation to work best). There is also an interesting opportunity to introduce a different kind of gating, which we call memory gating, which resembles previous gating ideas [24, 3]. Instead of producing scalar gates for each individual unit ( unit gating), we can produce scalar gates for each memory row by converting W f, W i, W o, U f, U i, and U o from weight matrices into weight vectors, and by replacing the element-wise product in the gating equations with scalar-vector multiplication. Since parameters W f, W i, W o, U f, U i, U o, and ψ are shared for each mi, we can modify the number of memories without affecting the number of parameters. Thus, tuning the number of memories and the size of each memory can be used to balance the overall storage capacity (equal to the total number of units, or elements, in M) and the number of parameters (proportional to the dimensionality of mi). We find in our experiments that some tasks require more, but not necessarily larger, memories, and others such as language modeling require fewer, larger memories. 2https://github.com/deepmind/sonnet/blob/master/sonnet/python/modules/ relational_memory.py What is the Nth farthest from vector m? x = 339 for [19]: x += 597 for[94]: x += 875 x if 428 < 778 else 652 print(x) Box World Mini-Pacman Reinforcement Learning Program Evaluation Nth farthest Language Modeling Supervised Learning It had 24 step programming abilities, which meant it was highly _____ A gold dollar had been proposed several times in the 1830s and 1840s , but was not initially _____ Super Mario Land is a 1989 side scrolling platform video _____ Figure 2: Tasks. We tested the RMC on a suite of supervised and reinforcement learning tasks. Notable are the N th Farthest toy task and language modeling. In the former, the solution requires explicit relational reasoning since the model must sort distance relations between vectors, and not the vectors themselves. The latter tests the model on a large quantity of natural data and allows us to compare performance to well-tuned models. Thus, we have a number of tune-able parameters: the number of memories, the size of each memory, the number of attention heads, the number of steps of attention, the gating method, and the postattention processor gψ. In the appendix we list the exact configurations for each task. 4 Experiments Here we briefly outline the tasks on which we applied the RMC, and direct the reader to the appendix for full details on each task and details on hyperparameter settings for the model. 4.1 Illustrative supervised tasks N th Farthest The N th Farthest task is designed to stress a capacity for relational reasoning across time. Inputs are a sequence of randomly sampled vectors, and targets are answers to a question of the form: What is the nth farthest vector (in Euclidean distance) from vector m? , where the vector values, their IDs, n, and m are randomly sampled per sequence. It is not enough to simply encode and retrieve information as in a copy task. Instead, a model must compute all pairwise distance relations to the reference vector m, which might also lie in memory, or might not have even been provided as input yet. It must then implicitly sort these distances to produce the answer. We emphasize that the model must sort distance relations between vectors, and not the vectors themselves. Program Evaluation The Learning to Execute (LTE) dataset [25] consists of algorithmic snippets from a Turing complete programming language of pseudo-code, and is broken down into three categories: addition, control, and full program. Inputs are a sequence of characters over an alphanumeric vocabulary representing such snippets, and the target is a numeric sequence of characters that is the execution output for the given programmatic input. Given that the snippets involve symbolic manipulation of variables, we felt it could strain a model s capacity for relational reasoning; since symbolic operators can be interpreted as defining a relation over the operands, successful learning could reflect an understanding of this relation. To also assess model performance on classical sequence tasks we also evaluated on memorization tasks, in which the output is simply a permuted form of the input rather than an evaluation from a set of operational instructions. See the appendix for further experimental details. 4.2 Reinforcement learning Mini Pacman with viewport We follow the formulation of Mini Pacman from [26]. Briefly, the agent navigates a maze to collect food while being chased by ghosts. However, we implement this task with a viewport: a 5 5 window surrounding the agent that comprises the perceptual input. The task is therefore partially observable, since the agent must navigate the space and take in information through this viewport. Thus, the agent must predict the dynamics of the ghosts in memory, and plan its navigation accordingly, also based on remembered information about which food has already been picked up. We also point the reader to the appendix for a description and results of another RL task called Box World, which demands relational reasoning in memory space. 4.3 Language Modeling Finally, we investigate the task of word-based language modeling. We model the conditional probability p(wt|w