# metalearned_neural_memory__20e95362.pdf Metalearned Neural Memory Tsendsuren Munkhdalai, Alessandro Sordoni, Tong Wang, Adam Trischler Microsoft Research Montréal, Québec, Canada tsendsuren.munkhdalai@microsoft.com We augment recurrent neural networks with an external memory mechanism that builds upon recent progress in metalearning. We conceptualize this memory as a rapidly adaptable function that we parameterize as a deep neural network. Reading from the neural memory function amounts to pushing an input (the key vector) through the function to produce an output (the value vector). Writing to memory means changing the function; specifically, updating the parameters of the neural network to encode desired information. We leverage training and algorithmic techniques from metalearning to update the neural memory function in one shot. The proposed memory-augmented model achieves strong performance on a variety of learning problems, from supervised question answering to reinforcement learning. 1 Introduction Many information processing tasks require memory, from sequential decision making to structured prediction. As such, a host of past and recent research has focused on augmenting statistical learning algorithms with memory modules that rapidly record task-relevant information [38, 8, 41]. A core desideratum for a memory module is the ability to store information such that it can be recalled from the same cue at later times; this reliability property has been called self-consistency [41]. Update procedure Controller g𝜃 Figure 1: Schematic illustration of the MNM model. Green and blue arrows indicate data flows for writing and reading operations, respectively. kr t , kw t , vw t and βt denote readin key, write-in key, target value and update rate vectors. Furthermore, a memory should exhibit some degree of generalization, by recalling useful information for cues that have not been encountered before, or by recalling information associated with what was originally stored (the degree of association may depend on downstream tasks). Memory structures should also be efficient, by scaling gracefully with the quantity of information stored and by enabling fast read-write operations. In the context of neural networks, one widely successful memory module is the soft look-up table [8, 46, 3]. This module stores high-dimensional key and value vectors in tabular format and is typically accessed by a controlled attention mechanism [3]. While broadly adopted, the soft look-up table has several shortcomings. Look-up tables are efficient to write, but they may grow without bound if information is stored naïvely in additional slots. Usually, their size is kept fixed and a more efficient writing mechanism is either learnt [8] or implemented heuristically [35]. The read operation common to most table-augmented models, which is based on soft attention, does not scale well in terms of the number of slots used or in the dimensionality of stored information [32]. Furthermore, soft look-up tables generalize only via convex combinations of stored values. This burdens the controller with estimating useful key and value representations. 33rd Conference on Neural Information Processing Systems (Neur IPS 2019), Vancouver, Canada. In this paper, we seek to unify few-shot metalearning and memory. We introduce an external memory module that we conceptualize as a rapidly adaptable function parameterized as a deep neural network. Reading from this neural memory amounts to a forward pass through the network: we push an input (the key vector) through the function to produce an output (the value vector). Writing to memory means updating the parameters of the neural network to encode desired information. We hypothesize that modelling memory as a neural function will offer compression and generalization beyond that of soft look-up tables: deep neural networks are powerful function approximators capable of both strong generalization and memorization [50, 12, 11], and their space overhead is constant. For a neural network to operate as a useful memory module, it must be possible to record memories rapidly, i.e., to update the network in one shot based on a single datum. We address this challenge by borrowing techniques from few-shot learning through metalearning [1, 33, 6, 27]. Recent progress in this domain has shown how models can learn to implement data-efficient, gradient-descent-like update procedures that optimize their own parameters. To store information rapidly in memory, we propose a novel, layer-wise learned update rule. This update modifies the memory parameters to minimize the difference between the neural function s predicted output (in response to a key) and a target value. We find our novel update rule to offer faster convergence than gradient-based update rules used commonly in the metalearning literature [7]. We combine our proposed neural memory module with an RNN controller and train the full model end-to-end (Figure 1). Our model learns to remember: we meta-train its reading and writing mechanisms to store information rapidly and reliably. Meta-training also promotes incremental storage, as discussed in 3.5. We demonstrate the effectiveness of our metalearned neural memory (MNM) on a diverse set of learning problems, which includes several algorithmic tasks, synthetic question answering on the b Ab I dataset, and maze exploration via reinforcement learning. Our model achieves strong performance on all of these benchmarks. 2 Related Work Several neural architectures have been proposed recently that combine a controller and a memory module. Neural Turing Machines (NTMs) extend recurrent neural networks (RNNs) with an external memory matrix [8]. The RNN controller interacts with this matrix using read and write heads. Despite NTM s sophisticated architecture, it is unstable and difficult to train. A possible explanation is that its fixed-size matrix and the lack of a deallocation mechanism lead to information collisions in memory. Neural Semantic Encoders [28] address this drawback by way of a variable-size memory matrix and by writing new content to the most recently read memory entry. The Differentiable Neural Computer [9] maintains memory usage statistics to prevent information collision while still relying on a fixed-size memory matrix. Memory Networks [46, 40] circumvent capacity issues with a read-only memory matrix that scales with the number of inputs. The read-out functions of all these models and related variants (e.g., [20]) are based on a differentiable attention step [3] that takes a convex combination of all stored memory vectors. Unfortunately, this attention step does not scale well to large memory arrays [32]. Another line of work incorporates dynamic, so-called fast weights [14, 38, 37] into the recurrent connections of RNNs to serve as writable memory. For instance, the Hebbian learning rule [13] has been explored extensively in learning fast-weight matrices for memory [38, 2, 23, 26]. Hyper Networks generate context dependent weights as dynamic scaling terms for the weights of an RNN [10] and are closely related to conditional normalization techniques [21, 5, 31]. Gradient-based fast weights have also been studied in the context of metalearning. Meta Networks [27] define fast and slow weight branches in a single layer and train a meta-learner that generates fast weights, while conditionally shifted neurons [29] map loss gradients to fast biases, in both cases for one-shot adaptation of a classification network. Our proposed memory controller adapts its neural memory model through a set of input and output pairs (called interaction vectors below) without directly interacting with the memory weights. Another related approach from the metalearning literature [16, 43, 35, 33, 24] is MAML [6]. MAML discovers a parameter initialization from which a few steps of gradient descent rapidly adapt a model to several related tasks. Recently, [47, 48] extended the Sparse Distributed Memory (SDM) of [18] as a generative memory mechanism [49], wherein the content matrix is parameterized as a linear Gaussian model. Memory access then corresponds to an iterative inference procedure. Memory mechanisms based on iterative and/or neural functions, as in [47, 48] and this work, are also related to frameworks that cast memory as dynamical systems of attractors (for some background, see [42]). 3 Proposed Method At a high level, our proposed memory-augmented model operates as follows. At each time step, the controller RNN receives an external input. Based on this input and its internal state, the controller produces a set of memory interaction vectors. In the process of reading, the controller passes a subset of these vectors, the read-in keys, to the neural memory function. The memory function outputs a read-out value (i.e., a memory recall) in response. In the process of writing, the controller updates the memory function based on the remaining subset of interaction vectors: the write-in keys and target values. We investigate two ways to bind keys to values in the neural memory: (i) by applying one step of modulated gradient descent to the memory function s parameters ( 3.3); or (ii) by applying a learned local update rule to the parameters ( 3.4). The parameter update reduces the error between the memory function s predicted output in response to the write-in key and the target value. It may be used to create a new association or strengthen existing associations in memory. Finally, based on its internal state and the memory read-out, the controller produces an output vector for use in some downstream task. We meta-train the controller and the neural memory end-to-end to learn effective memory access procedures ( 3.5), and call the proposed model the Metalearned Neural Memory (MNM). In the sections below we describe its components in detail. Figure 1 illustrates the MNM model schematically. 3.1 The Controller The controller is a function gθ with parameters θ = {W, b}. It uses the LSTM architecture [15] as its core. At each time step it takes in the current external input, xt IRdi, along with the previous memory read-out value vr t 1 IRdv and hidden state ht 1 IRdh. It outputs a new hidden state: ht = LSTM(xt, vr t 1, ht 1). The controller also produces an output vector to pass to external modules (e.g., a classification layer) for use in a downstream task. The output is computed as yt = Wy[ht; vr t ] + by IRdo and depends on the memory read-out vector vr t . The read-out vector is computed by the memory function, as described in 3.2. From the controller s hidden state ht we obtain a set of interaction vectors for reading from and writing to the memory function. These include read-in keys kr t IRdk, write-in keys kw t IRdk, target values vw t IRdv, and a rate vector β t IRdk: [kr t,1; . . . ; kr t,H; kw t,1; . . . ; kw t,H; vw t,1 . . . ; vw t,H; β t] = tanh(Wvht + bv). (1) The controller outputs H vectors of each interaction type, where H is the number of parallel interaction heads. The single rate vector is further projected down to a scalar and squashed into [0, 1]: βt = sigmoid(Wββ t + bβ). Rate βt controls the strength with which the corresponding (key, value) pairs should be stored in memory. The write-in keys and target values determine the content to be stored, whereas the read-in keys are used to retrieve content from the memory. We use separate keys and values for reading and writing because the model interacts with its memory in two distinct modes at each time step: It reads information stored there at previous time steps that it deems useful to the task at hand, and it writes information related to the current input that it deems will be useful in the future. The rate βt enables the controller to influence the dynamics of the gradient-based and local update procedures that encode information in memory ( 3.3 and 3.4). 3.2 The Memory Function We model external memory as an adaptive function, fφt, parameterized as a feed-forward neural network with weights φt = {M l}. Note that these weights, unlike the controller parameters θ, change rapidly from time step to time step and store associative bindings as the model encodes information. Reading from the memory corresponds to feeding the set of read-in keys through the memory function to generate a set of read-out values, {vr t,i} = fφt({kr t,i}). At hidden layer l, the memory function s forward computation is defined as zl = σ(M lzl 1), where σ is a nonlinear activation function, zl IRDl is the layer s activation vector, and we have dropped the time-step index. We execute this computation in parallel on the set of H read-in keys kr t,i at each time step, yielding H read-out value vectors. We take their mean to construct the single read-out value vr t that feeds back into the controller and the output computation for yt. 3.3 Writing to Memory with Gradient Descent A write to the memory consists in rapidly binding the write-in keys {kw t,i} to map to the target values {vw t,i} in the parameters of the neural memory. One way to do this is by updating the memory parameters to optimize an objective that encourages binding. We denote this memory objective by Lup t and in this work implement it as a simple mean-squared error: i=1 ||fφt 1(kw t,i) vw t,i||2 2 (2) We aim to encode the target values {vw t,i} obtained from the controller by optimizing Eq. 2. We obtain a set of memory prediction values, {ˆvw t,i} = fφt 1({kw t,i}), by feeding the controller s write-in keys through the memory function as parameterized at the previous time step (by φt 1). The model binds the write-in keys to the target values at time t by taking a gradient step to minimize Lup t : φt φt 1 βt φt 1Lup t . (3) Here, βt is the update rate obtained from the controller, which modulates the size of the gradient step. In principle, by diminishing the update rate, the controller can effectively avoid writing into memory, achieving an effect similar to a gated memory update [8]. In experiments we find that the mean-squared error is an effective memory objective: minimizing it encodes the target values in the memory function, in the sense that they can be read out (approximately) by passing in the corresponding keys. 3.4 Writing to Memory with a Learned Local Update Writing to memory with gradient descent poses challenges. Writing a new item to a look-up table can be as simple as adding an element to the corresponding array. In a neural memory, by contrast, multiple costly gradient steps may be required to store a key-value vector pair reliably in the parameters. Memory parameter updates are expensive because, in end-to-end training, they require computation of higher-order gradients (see 3.5; this issue is common in gradient-based metalearning). Sequential back-propagation of the memory error through the layers of the memory model also adds a computational bottleneck. Other researchers have recognized this problem and proposed possible solutions. For example, the direct feedback alignment algorithm [30], a variant of the feedback alignment method [22], enables weight updates via error alignment with random skip connections. Because these feedback connections are not computed nor used sequentially, updates can be parallelized for speed. However, fixed random feedback connections may be inefficient. The synthetic gradient methods [17] train a model to locally predict the error gradient and this requires the true gradient as a target. We propose a memory writing mechanism that is fast and gradient-free. The key idea is to represent each neural memory layer with decoupled forward computation and backward feedback prediction functions (BFPF) and perform local updates to the memory layers. Unlike feedback alignment methods, the BFPF and the local update rules are then meta-trained, jointly. Concretely, for neural memory layer l, the BFPF is a function ql ψ with trainable parameter ψl that makes a prediction for an expected activation as: z l = ql ψ(vw t ). We then adopt the perceptron learning rule [34] to update the layer locally: M l t M l t 1 βl t(zl z l)zl 1T (4) where βl t is the local update rate that can be learned for each layer with the controller or separately with the BFPF ql ψ. The perceptron update rule uses the predicted activation as a true target and approximate the loss gradient w.r.t M l t 1 via βl t(zl z l)zl 1T . Therefore, the (approximate) gradient is near zero when zl z l and there are no changes to the weights. But if the predicted and the forward activations don t match, we update the weights such that the predicted activations can be reconstructed from the weights given the activations zl 1 from the previous layer l 1. Therefore, the BFPF module first proposes a regression problem locally for each layer and then the perceptron learning mechanism here solves the local regression problem. One can use a different local update mechanism, rather than the perceptron method. Note that it is helpful to choose the update mechanism that is differentiable w.r.t to its solution to the problem, since the BFPF module is trained to propose problems whose solutions minimize the meta and task loss ( 3.5). With the proposed local and gradient-free update rule, the neural memory writes to its weights in parallel and its computation graph need not be tracked during writing. This makes it straightforward to add complex structural biases, such as recurrence, into the neural memory itself. The proposed approach can readily be applied in the few-shot learning setup as well. For example, we can utilize the learned local update method as an inner-loop adaptation mechanism in a model agnostic way. We leave this to future work. 3.5 End-to-end Training via Meta and Task Objectives It is important to note that the memory objective function, Eq. 2, and the memory updates, Eqs. 3, 4, modify the neural memory function; these updates occur even at test time, as the model processes and records information (they require no external labels). We train the controller parameters θ and the memory initialization φ0 and the BFPF parameters ψ end-to-end through the combination of a task objective and a meta objective. The former, denoted by Ltask, is specific to the task the model performs, e.g., classification or regression. The meta objective, Lmeta, encourages the controller to learn effective update strategies for the neural memory that work well across tasks. These two objectives take account of the model s behavior over an episode, i.e., a sequence of time steps, t = 1, . . . , T, in which the model performs some temporally extended task (like question answering or maze navigation). For the meta objective, we make use again of the mean squared error between the memory prediction values and the target values, as in Eq. 2. For the meta objective, however, we obtain the prediction values from the updated neural function at each time step, fφt, after the update step in Eq. 3 or Eq. 4 has been applied. We also introduce a recall delay parameter, τ: Lmeta = 1 TH i=1 λτ||fφt(kw t τ,i) vw t τ,i||2 2 (5) The sum over τ can be used to reinforce older memory values, and λτ is a decay weight for the importance of older (key, value) pairs. The latter can be used, for instance, to implement a kind of exponential decay. We found that λτ is task specific; in this work, we always set the maximum recall delay as T = 0 to focus on reliable storage of new information. At the end of each episode, we take gradients of the meta objective and the task objective with respect to the controller parameters, θ, and use these for training the controller: θ θ θLtask θLmeta. (6) These gradients propagate back through the memory-function updates (requiring higher-order gradients if using Eq. 3) so that the controller learns how to modify the memory parameters via the interaction vectors (Eq. 1) and the gradient steps or local updates (Eq. 3 or Eq. 4, respectively). We attempted to learn the memory initialization (similar to MAML) by updating the initial parameters φ0 w.r.t. the meta loss. We found that this led to severe overfitting. Therefore, we initialize the memory function tabula rasa at each task episode from a fixed random parameter set. Optimizing the episodic task objective will often require the memory to recall information stored many time steps back, after newer information has also been written. This requirement, along with the fact that the optimization involves propagating gradients back through the memory update steps, promotes incremental learning in the memory function, because overwriting previously stored information would harm task performance. Figure 2: Training curves on the dictionary inference task. 4 Experimental Evaluation and Analysis 4.1 Algorithmic Tasks We first introduce a synthetic dictionary inference task to test MNM s capacity to store and recall associated information. This can be considered a toy translation problem. To construct it, we randomly partition the 26 letters of the English alphabet evenly into a source and a target vocabulary, and define a random, bijective mapping F between the two sets. Following the few-shot learning setup, we then construct a support set of k source sequences with their corresponding target sequences. Each source sequence consists of l letters randomly sampled (with replacement) from the source vocabulary, which are then mapped to the corresponding target sequence using F. The objective of the task is to predict the target given a previously unseen source sequence whose composing letters have been observed in the support set. For example, after observing the support examples abc def;tla qzd, the model is expected to translate input sequence tca to the output qfd. The difficulty of the task varies depending on the sequence length l and the size of the support set k. Longer sequences introduce long-term dependencies, whereas a larger number of observations requires efficient memory structure and compression. We constructed four different task instances with support set size of 4, 8, 12, and 16 and sequence length of 1, 4, 8 and 12. We trained the MNM models with both gradient-based (MNM-g) and local memory updates (MNMp). The models have a controller network with 100 hidden units and a three-layer feed-forward memory with 100 hidden units and tanh activation. We compare against two baseline models: a vanilla LSTM model and a memory-augmented model with the soft-attention look-up table as memory (LSTM+SALU). In the LSTM+SALU model, we replace the feed-forward neural memory with the look-up table, providing an unbounded memory to the LSTM controller. The training setup is given in Appendix A. Figure 2 shows the results for our dictionary inference task (averaged over 5 runs). All models solved the first task instance and all memory-augmented models converged for the second case. As the task difficulty increased for the last two cases, only the MNM models converged and solved the task with zero error. In Figure 3, we compared the training wallclock time and the memory size of MNM(-g, -p) against LSTM+SALU models on these task runs. When the input length is small, the LSTM+SALU model is faster than MNM-g and similar in speed to MNM-p, and has a smaller memory footprint than both. However, as soon as the input length exceeds the size of the MNM memory s hidden units, LSTM+SALU becomes less efficient. It exhibits quadratic growth whereas the MNM models grow approximately linearly in the wallclock time. Figure 3 s left plot also demonstrates that that learned 4 32 96 192 Wallclock time LSTM+SALU MNM-g MNM-p 4 32 96 192 Memory size LSTM+SALU MNM Figure 3: Model comparison over varying input lengths (x-axes). Figure 4: Training curves on programming tasks. Table 1: Results on b Ab I question answering. Sentence-level Word-level Ent Net TPR-RNN DNC SDNC MNM-g MNM-p Mean Error 9.7 2.6 1.34 0.52 12.8 4.7 6.4 2.5 3.2 0.5 0.55 0.74 Failed Tasks (> 5% error) 5 1.2 0.86 1.11 8.2 2.5 4.1 1.6 1.3 0.8 0.08 0.28 x-axis: target values and y-axis: read-out x-axis: target values and y-axis: read-out x-axis: target values and y-axis: read-out x and y-axis are target values x and y-axis are write-in keys Figure 5: Visualization of learned memory operations. (a) At each question word (y-axis), the model recalls memory contents written for entities in the story (x-axis) that are most closely related to the question type (e.g., locations for where questions). (b) Towards the end of a story (y-axis), the model learns to access and update the memory conditioned on structured information (e.g., location and character) memorized earlier in the story (x-axis). local updates (MNM-p) confer significant speed benefits over gradient-based updates (MNM-g), since the former can be applied in parallel. We further evaluated these models on two standard memorization tasks: double copy and priority sort. As shown in Figure 4, the MNM models quickly solve the double copy task with input length 50. On the priority sort problem, the LSTM+SALU model demonstrated the strongest result. This is surprising, since the task was previously shown to be hard to solve [8]. It suggests that the unbounded memory table and the look-up operation are especially useful for sorting. The MNM models strong performance across the suite of algorithmic tasks, which require precise recall of past information, indicates that MNM can store information reliably. 4.2 b Ab I Question Answering b Ab I is a synthetic question-answering benchmark that has been widely adopted to evaluate long-term memory and reasoning [45]. It presents 20 reasoning tasks in total. We aim to solve all of them with one generic model. Previous memory-augmented models addressed the problem with sentence-level [40, 20] or word-level inputs [9, 32]. Solving the task based on word-level input is harder, but more general [36]. We trained word-level MNM models following the setup for the DNC [9]. The results are summarized in Table 1. The MNM model with the learned local update solved all the b Ab I tasks with near zero error, outperforming the result of TPR-RNN [36] (previously the best model operating at sentence-level). It also outperformed the DNC by around 12% and the Sparse DNC [9] by around 6% in terms of mean error. We report the best results and all 12 runs of MNM-p in Appendix B. MNM-g with the gradient-based update solved 19 tasks, failing to solve only the basic induction task. We also attempted to train a word-level LSTM+SALU baseline as described in the previous section. However, multiple LSTM+SALU runs did not solve any task and converged to 77.5% mean error after 62K training iterations. With the same number of iterations, the MNM runs converged to a 9.5% mean error and solved 16.5 tasks on average. This suggests the importance of a deep neural memory and a learned memory access mechanism for reasoning. Analyzing Learned Memory Operations: To understand the MNM more deeply, we analyzed its temporal dynamics through the similarity between the keys and values of its read/write operations as it processed b Ab I. Here cosine distance is used as a similarity metric. We found that the neural memory exhibits readily interpretable structures as well as efficient self-organization. Intuitively, keys and values correspond to the locations and contents of memory read/write operations. Consequently the temporal comparison of various combinations of these vectors can have meaningful interpretations. For example, given two time steps t1 < t2, when comparing vr t2 against vw t1, higher similarity (brighter colors in Figure 5) indicates that the content stored at t1 is being retrieved at t2. We first applied this comparison to a b Ab I story (x-axis in Figure 5a) and the corresponding question (y-axis), since they are read consecutively by the model. Upon reading the question word where , the model successfully retrieves the location-related memories. When the model reads in the character names, the retrieval is then filtered down to only the locations of the characters in question. Furthermore, the retrieval also appears to be closer to more recent locations, effectively modeling this strong prior in the data distribution of b Ab I. Similarly, we analyzed the memory operations towards the end of a story (y-axis) and examined how the model uses the memory developed earlier (x-axis). Again, comparing vr t2 and vw t1 (row 1 in Figure 5b), the bright vertical stripe at hallway indicates that the memory retrieval focuses more on Daniel s most recent location (while ignoring both his previous locations and locations of other characters). In addition, vw t2 and vw t1 are compared in row 2, Figure 5b, where the dark vertical stripes indicate that the memory is being updated aggressively with new contents whenever a new location is mentioned potentially establishing new associations between locations and characters. In the comparison between kw t2 and kw t1 (row 3 in Figure 5b), two bright diagonals are observed in the sentences related to the matching character Daniel, suggesting that (a) the model has likely acquired an entity-based structure and (b) it is capable of leveraging this structure for efficient memory reuse. More examples can be found in the appendix. Overall, the patterns above are salient and consistent, indicating our model s ability to disentangle objects and their roles from a story, and to use that information to dynamically establish and update associations in a structured, efficient manner all of which are key to neural-symbolic reasoning [39] and effective generalization. 4.3 Maze Exploration by Reinforcement Learning Figure 6: Training curves on the maze exploration task. External memory may be helpful or even necessary for agents operating in partially observable environments, where current decision making depends on a sequence of past observations. However, memory mechanisms also add complexity to an agent s learning process [19], since the agent must learn to access its memory during experience. In our final set of experiments, we train an RL agent augmented with our metalearned neural memory. We wish to discover whether an MNM agent can perform well in a sequential decision making problem and use its memory to improve performance or sample efficiency. We train MNM agents on a maze exploration task from the literature on meta-reinforcement learning [4, 44]. Specifically, we adopted the grid world setup from [23]. In this task, for each episode, the agent explores a grid-based maze and attempts to reach a goal position. Reaching the goal earns the agent a reward of 10 and relocates it to a random position. The agent must then return to the goal to collect more reward before the episode terminates. To the agent, the goal is invisible in the maze and its position is chosen randomly at the beginning of each episode. Inputs to the agent are its surrounding 3 3 cells, the time step, and the reward from the previous time step. The agent receives -0.1 reward for hitting a wall and 0 reward otherwise. A 9 9 grid maze is shown in Figure 7 (Appendix A) for illustration. We trained agents on a 9 9 maze following the setup of [23] to provide a direct comparison with the differential plasticity agents of that work. We used the Advantage Actor-Critic (A2C) algorithm for optimization, a non-asynchronous variant of the A3C method [25]. The MNM agent has a neural memory and controller with 100 and 200 hidden units, respectively. The training curve for the 9 9 maze (averaged over 10 runs) is plotted in Figure 6, along with results from [23]. As can be seen, the agents with differential plasticity (denoted Plastic and Homogenous Plastic) converge to a reward of 175 after training on nearly 1M episodes. MNM, on the other hand, reaches the same reward in only 250K episodes. It obtains significantly higher final reward after 1M episodes. This result shows that the MNM fosters improved performance and sample efficiency in a sequential decision making scenario and, promisingly, it can be trained in conjunction with an RL policy. 5 Conclusion We cast external memory for neural models as a rapidly adaptable function, itself parameterized as a deep neural network. Our goal was for this memory mechanism to confer the benefits of deep neural networks expressiveness, generalization, and constant space overhead. In order to write to a neural network memory rapidly, in one shot, and incrementally, such that newly stored information does not distort existing information, we adopted training and algorithmic techniques from metalearning. The proposed memory-augmented model, MNM, was shown to achieve strong performance on a wide variety of learning problems, from supervised question answering to reinforcement learning. Our learned local update algorithm can be applied in an other setup than the memory one. In future work, we will investigate different neural architectures for metalearned memory and the effects of recall delays in the meta objective. Acknowledgements We thank Thomas Miconi for sharing data. We thank Geoff Gordon for helpful comments and suggestions. [1] Marcin Andrychowicz, Misha Denil, Sergio Gomez, Matthew W Hoffman, David Pfau, Tom Schaul, Brendan Shillingford, and Nando De Freitas. Learning to learn by gradient descent by gradient descent. In Advances in Neural Information Processing Systems, pages 3981 3989, 2016. [2] Jimmy Ba, Geoffrey E Hinton, Volodymyr Mnih, Joel Z Leibo, and Catalin Ionescu. Using fast weights to attend to the recent past. In Advances in Neural Information Processing Systems, pages 4331 4339, 2016. [3] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. ar Xiv preprint ar Xiv:1409.0473, 2014. [4] Yan Duan, John Schulman, Xi Chen, Peter L Bartlett, Ilya Sutskever, and Pieter Abbeel. Rl2: Fast reinforcement learning via slow reinforcement learning. ar Xiv preprint ar Xiv:1611.02779, 2016. [5] Vincent Dumoulin, Jonathon Shlens, and Manjunath Kudlur. A learned representation for artistic style. Co RR, abs/1610.07629, 2(4):5, 2016. [6] Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deep networks. In Doina Precup and Yee Whye Teh, editors, Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 1126 1135, International Convention Centre, Sydney, Australia, 06 11 Aug 2017. PMLR. [7] Chelsea Finn, Tianhe Yu, Tianhao Zhang, Pieter Abbeel, and Sergey Levine. One-shot visual imitation learning via meta-learning. ar Xiv preprint ar Xiv:1709.04905, 2017. [8] Alex Graves, Greg Wayne, and Ivo Danihelka. Neural turing machines. ar Xiv preprint ar Xiv:1410.5401, 2014. [9] Alex Graves, Greg Wayne, Malcolm Reynolds, Tim Harley, Ivo Danihelka, Agnieszka Grabska Barwi nska, Sergio Gómez Colmenarejo, Edward Grefenstette, Tiago Ramalho, John Agapiou, et al. Hybrid computing using a neural network with dynamic external memory. Nature, 538(7626):471, 2016. [10] David Ha, Andrew Dai, and Quoc V Le. Hypernetworks. In ICLR 2017, 2017. [11] Song Han, Jeff Pool, John Tran, and William Dally. Learning both weights and connections for efficient neural network. In Advances in neural information processing systems, pages 1135 1143, 2015. [12] Babak Hassibi and David G Stork. Second order derivatives for network pruning: Optimal brain surgeon. In Advances in neural information processing systems, pages 164 171, 1993. [13] Donald Olding Hebb. The organization of behavior: A neuropsychological theory. Psychology Press, 1949. [14] Geoffrey E Hinton and David C Plaut. Using fast weights to deblur old memories. In Proceedings of the ninth annual conference of the Cognitive Science Society, pages 177 186, 1987. [15] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 9(8):1735 1780, 1997. [16] Sepp Hochreiter, A Steven Younger, and Peter R Conwell. Learning to learn using gradient descent. In International Conference on Artificial Neural Networks, pages 87 94. Springer, 2001. [17] Max Jaderberg, Wojciech Marian Czarnecki, Simon Osindero, Oriol Vinyals, Alex Graves, David Silver, and Koray Kavukcuoglu. Decoupled neural interfaces using synthetic gradients. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 1627 1635. JMLR. org, 2017. [18] Pentti Kanerva. Sparse distributed memory. MIT press, 1988. [19] Arbaaz Khan, Clark Zhang, Nikolay Atanasov, Konstantinos Karydis, Vijay Kumar, and Daniel D Lee. Memory augmented control networks. ar Xiv preprint ar Xiv:1709.05706, 2017. [20] Ankit Kumar, Ozan Irsoy, Peter Ondruska, Mohit Iyyer, James Bradbury, Ishaan Gulrajani, Victor Zhong, Romain Paulus, and Richard Socher. Ask me anything: Dynamic memory networks for natural language processing. In International Conference on Machine Learning, pages 1378 1387, 2016. [21] Jimmy Lei Ba, Kevin Swersky, Sanja Fidler, et al. Predicting deep zero-shot convolutional neural networks using textual descriptions. In Proceedings of the IEEE International Conference on Computer Vision, pages 4247 4255, 2015. [22] Timothy P Lillicrap, Daniel Cownden, Douglas B Tweed, and Colin J Akerman. Random synaptic feedback weights support error backpropagation for deep learning. Nature communications, 7, 2016. [23] Thomas Miconi, Jeff Clune, and Kenneth O Stanley. Differentiable plasticity: training plastic neural networks with backpropagation. ar Xiv preprint ar Xiv:1804.02464, 2018. [24] Nikhil Mishra, Mostafa Rohaninejad, Xi Chen, and Pieter Abbeel. A simple neural attentive meta-learner. 2018. [25] Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In International conference on machine learning, pages 1928 1937, 2016. [26] Tsendsuren Munkhdalai and Adam Trischler. Metalearning with hebbian fast weights. ar Xiv preprint ar Xiv:1807.05076, 2018. [27] Tsendsuren Munkhdalai and Hong Yu. Meta networks. In Doina Precup and Yee Whye Teh, editors, Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 2554 2563, International Convention Centre, Sydney, Australia, 06 11 Aug 2017. PMLR. [28] Tsendsuren Munkhdalai and Hong Yu. Neural semantic encoders. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, pages 397 407. Association for Computational Linguistics, 2017. [29] Tsendsuren Munkhdalai, Xingdi Yuan, Soroush Mehri, and Adam Trischler. Rapid adaptation with conditionally shifted neurons. In International Conference on Machine Learning, pages 3661 3670, 2018. [30] Arild Nøkland. Direct feedback alignment provides learning in deep neural networks. In Advances in Neural Information Processing Systems, pages 1037 1045, 2016. [31] Ethan Perez, Florian Strub, Harm De Vries, Vincent Dumoulin, and Aaron Courville. Film: Visual reasoning with a general conditioning layer. ar Xiv preprint ar Xiv:1709.07871, 2017. [32] Jack Rae, Jonathan J Hunt, Ivo Danihelka, Timothy Harley, Andrew W Senior, Gregory Wayne, Alex Graves, and Timothy Lillicrap. Scaling memory-augmented neural networks with sparse reads and writes. In Advances in Neural Information Processing Systems, pages 3621 3629, 2016. [33] Sachin Ravi and Hugo Larochelle. Optimization as a model for few-shot learning. In ICLR 2017, 2017. [34] Frank Rosenblatt. The perceptron: a probabilistic model for information storage and organization in the brain. Psychological review, 65(6):386, 1958. [35] Adam Santoro, Sergey Bartunov, Matthew Botvinick, Daan Wierstra, and Timothy Lillicrap. Meta-learning with memory-augmented neural networks. In International conference on machine learning, pages 1842 1850, 2016. [36] Imanol Schlag and Jürgen Schmidhuber. Learning to reason with third order tensor products. In Advances in Neural Information Processing Systems, pages 10003 10014, 2018. [37] J Schmidhuber. Reducing the ratio between learning complexity and number of time varying variables in fully recurrent nets. In International Conference on Artificial Neural Networks, pages 460 463. Springer, 1993. [38] Jürgen Schmidhuber. Learning to control fast-weight memories: An alternative to dynamic recurrent networks. Neural Computation, 4(1):131 139, 1992. [39] Paul Smolensky. Tensor product variable binding and the representation of symbolic structures in connectionist systems. Artificial intelligence, 46(1-2):159 216, 1990. [40] Sainbayar Sukhbaatar, Jason Weston, Rob Fergus, et al. End-to-end memory networks. In Advances in neural information processing systems, pages 2440 2448, 2015. [41] Wen Sun, Alina Beygelzimer, Hal Daumé III, John Langford, and Paul Mineiro. Contextual memory trees. ar Xiv preprint ar Xiv:1807.06473, 2018. [42] Adam Trischler. A Computational Model for Episodic Memory Inspired by the Brain. Ph D thesis, 2016. [43] Oriol Vinyals, Charles Blundell, Tim Lillicrap, Daan Wierstra, et al. Matching networks for one shot learning. In Advances in Neural Information Processing Systems, pages 3630 3638, 2016. [44] Jane X Wang, Zeb Kurth-Nelson, Dhruva Tirumala, Hubert Soyer, Joel Z Leibo, Remi Munos, Charles Blundell, Dharshan Kumaran, and Matt Botvinick. Learning to reinforcement learn. ar Xiv preprint ar Xiv:1611.05763, 2016. [45] Jason Weston, Antoine Bordes, Sumit Chopra, Alexander M Rush, Bart van Merriënboer, Armand Joulin, and Tomas Mikolov. Towards ai-complete question answering: A set of prerequisite toy tasks. ar Xiv preprint ar Xiv:1502.05698, 2015. [46] Jason Weston, Sumit Chopra, and Antoine Bordes. Memory networks. Co RR, abs/1410.3916, 2014. [47] Yan Wu, Greg Wayne, Alex Graves, and Timothy Lillicrap. The kanerva machine: A generative distributed memory. ICLR 2018, 2018. [48] Yan Wu, Gregory Wayne, Karol Gregor, and Timothy Lillicrap. Learning attractor dynamics for generative memory. In Advances in Neural Information Processing Systems, pages 9401 9410, 2018. [49] Richard S Zemel and Michael C Mozer. A generative model for attractor dynamics. In Advances in neural information processing systems, pages 80 88, 2000. [50] Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning requires rethinking generalization. ar Xiv preprint ar Xiv:1611.03530, 2016.