# metareinforcement_learning_with_selfmodifying_networks__7af9dc7c.pdf Meta-Reinforcement Learning with Self-Modifying Networks Mathieu Chalvidal Artificial and Natural Intelligence Toulouse Institute Universite de Toulouse, France mathieu_chalvid@brown.edu Thomas Serre Carney Institute for Brain Science Brown University, U.S. thomas_serre@brown.edu Rufin Van Rullen Centre de Recherche Cerveau & Cognition CNRS, Universite de Toulouse, France rufin.vanrullen@cnrs.fr Deep Reinforcement Learning has demonstrated the potential of neural networks tuned with gradient descent for solving complex tasks in well-delimited environments. However, these neural systems are slow learners producing specialized agents with no mechanism to continue learning beyond their training curriculum. On the contrary, biological synaptic plasticity is persistent and manifold, and has been hypothesized to play a key role in executive functions such as working memory and cognitive flexibility, potentially supporting more efficient and generic learning abilities. Inspired by this, we propose to build networks with dynamic weights, able to continually perform self-reflexive modification as a function of their current synaptic state and action-reward feedback, rather than a fixed network configuration. The resulting model, Met ODS (for Meta-Optimized Dynamical Synapses) is a broadly applicable meta-reinforcement learning system able to learn efficient and powerful control rules in the agent policy space. A single layer with dynamic synapses can perform one-shot learning, generalizes navigation principles to unseen environments and manifests a strong ability to learn adaptive motor policies. 1 Introduction The algorithmic shift from hand-designed to learned features characterizing modern Deep Learning has been transformative for Reinforcement Learning (RL), allowing to solve complex problems ranging from video games [1, 2] to multiplayer contests [3] or motor control [4, 5]. Yet, "deep" RL has mostly produced specialized agents unable to cope with rapid contextual changes or tasks with novel or compositional structure [6 8]. The vast majority of models have relied on gradient-based optimization to learn static network parameters adjusted during a predefined curriculum, arguably preventing the emergence of online adaptivity. A potential solution to this challenge is to meta-learn [9 11] computational mechanisms able to rapidly capture a task structure and automatically operate complex feedback control: Meta-Reinforcement Learning constitutes a promising direction to build more adaptive artificial systems [12] and identify key neuroscience mechanisms that endow humans with their versatile learning abilities [13]. In this work, we draw inspiration from biological fast synaptic plasticity, hypothesized to orchestrate flexible cognitive functions according to context-dependent rules [14 17]. By tuning neuronal selectivity at fast time scales from fast neural signaling (milliseconds) to experience-based learning 36th Conference on Neural Information Processing Systems (Neur IPS 2022). (seconds and beyond) fast plasticity can support in principle many cognitive faculties including motor and executive control. From a dynamical system perspective, fast plasticity can serve as an efficient mechanism for information storage and manipulation and has led to modern theories of working memory [18 22]. Despite the fact that the magnitude of the synaptic gain variations may be small, such modifications are capable of profoundly altering the network transfer function [23] and constitute a plausible mechanism for rapidly converting reward and choice history into tuned neural functions [24]. From a machine learning perspective, despite having a long history [25 29], fast weights have most often been investigated in conjunction with recurrent neural activations [30 33] and rarely as a function of an external reward signal or of the current synaptic state itself. Recently, new proposals have shown that models with dynamic modulation related to fast weights could yield powerful meta-reinforcement learners [34 36]. In this work, we explore an original self-referential update rule that allows the model to form synaptic updates conditionally on information present in its own synaptic memory. Additionally, environmental reward is injected continually in the model as a rich feedback signal to drive the weight dynamics. These features endow our model with a unique recursive control scheme, that support the emergence of a self-contained reinforcement learning program. Contribution: We demonstrate that a neural network trained to continually self-modify its weights as a function of sensory information and its own synaptic state can produce a powerful reinforcement learning program. The resulting general-purpose meta-RL agent called Met ODS (for Meta Optimized Dynamical Synapses) is theoretically presented as a model-free approach performing stochastic feedback control in the policy space. In our experimental evaluation, we investigate the reinforcement learning strategies implemented by the model and demonstrate that a single layer with lightweight parametrization can implement a wide spectrum of cognitive functions, from oneshot learning to continuous motor-control. We hope that Met ODS inspires more works around self-optimizing neural networks. The remainder of the paper is organised as follows: In Section 2 we introduce our mathematical formulation of the meta-RL problem, which motivates Met ODS computational principles presented in Section 3. In Section 4 we review previous approaches of meta-reinforcement learning and we discuss other models of artificial fast plasticity and their relation to associative memory. In Section 5 we report experimental results in multiple contexts. Finally, in Section 6 we summarise the main advantages of Met ODS and outline future work directions. 2 Background 2.1 Notation Throughout, we refer to tasks" as Markov decision processes (MDP) defined by the following tuple τ = (S, A, P, r, ρ0), where S and A are respectively the state and action sets, P : S A S 7 [0, 1] refers to the state transition distribution measure associating a probability to each tuple (state, action, new state), r : A S 7 R is a bounded reward function and ρ0 is the initial state distribution. For simplicity, we consider finite-horizon MDP with T time-steps although our discussion can be extended to the infinite horizon case as well as partially observed MDP. We further specify notation when needed by subscripting symbols with the corresponding task τ or time-step t. 2.2 Meta-Reinforcement learning as an optimal transport problem: Meta-Reinforcement learning considers the problem of building a program that generates a distribution µπ of policies π Π that are "adapted" for a distribution µτ of task τ T with respect to a given metric R. For instance, R(τ, π) can be the expected cumulative reward for a task τ and policy π, i.e where state transitions are governed by st+1 Pτ(.|st, at), actions are sampled according to the policy π: at π and initial state s0 follows the distribution ρ0,τ. Provided the existence of an optimal policy π for any task τ T, we can define the distribution measure µπ of these policies over Π. Arguably, an ideal system aims at associating to any task τ its optimal policy π , i.e, finding the transport plan γ in the space Γ(µT, µπ ) of coupling distributions with marginals µT and µπ that maximizes R. max γ Γ(µT,µπ ) E(τ,π) γ R(τ, π) where R(τ, π) = Eπ,Pτ t=0 rτ(at, st) (1) Figure 1: Meta-Reinforcement Learning as a transport problem and Met ODS synaptic adaptation: a) Associating any task τ in T to its optimal policy π τ in Π can be regarded as finding an optimal transport plan from µT to µπ with respect to the cost R. Finding this transport plan is generally an intractable problem. b) Meta-RL approximates a solution by defining a stochastic flow in the policy space Π conditioned by the current task τ and driving a prior distribution µπ0 of policies π0 towards a distribution µθ,τ,t π of policies with high score R. c) Density and mean trajectories of our model dynamic weights principal components over several episodes of the Harlow task (see section 5.1) reveal this policy specialization. Two modes colored with respect to whether the agent initial guess was good or bad emerge, corresponding to two different policies to solve the task. Most generally, problem (1) is intractable, since µπ is unknow or has no explicit form. Instead, previous approaches optimize a surrogate problem, by defining a parametric specialization procedure which builds for any task τ, a sequence (πt)t of improving policies (see Fig. 1). Defining θ some meta-parameters governing the evolution of the sequences (πt)t and µθ,τ,t π the distribution measure of the policy πt after learning task τ during some period t, the optimization problem amounts to finding the meta-parameters θ that best adapt πt µθ,τ,t π over the task distribution. max θ Eτ µT Eπ µθ,τ,t π R(τ, π)] (2) Equation (2) cast meta-Reinforcement learning as the problem of building automatic policy control in Π. We discuss below desirable properties of such control for which we show that our proposed meta-learnt synaptic rule has a good potential. Efficiency: How "fast" the distribution µθ,τ,t π is transported towards a distribution of highperforming policies. Arguably, an efficient learning mechanism should require few interaction steps t with the environment to identify the task rule and adapt its policy accordingly. This can be seen as minimizing the agent s cumulative regret during learning [37]. This notion is also connected to the policy exploration-exploitation trade-off [38, 39], where the agent learning program should foster rapid discovery of policy improvements while retaining performance gains acquired through exploration. We show that our learnt synaptic update rule is able to change the agent transfer function drastically in a few updates, and settle to a policy when task structure has been identified, thus supporting one-shot learning of a task-contingent association rule in the Harlow task, adapting a motor policy in a few steps or exploring original environments quickly in the Maze experiment. Capacity: That property defines the learner sensitivity to particular task structures and its ability to convert them into precise states in the policy space, which determines the achievable level of performance for a distribution of tasks µτ. This notion is linked to the sensitivity of the learner, i.e how the agent captures and retains statistics and structures describing a task. Precedent work has shown that memory-based Meta-RL models operate a Bayesian task regression problem [40], and further fostering task identification through diverse regularizations [41, 42, 39] benefited performance of the optimized learning strategy. Because our mechanism is continual, it allows for constant tracking of the environment information and policy update similar to memory-based models. We particularly test this property in the maze experiment in Section 5, showing that tuned online synaptic updates obtain the best capacity under systematic variation of the environment. Generality: We refer here to the overall ability of the meta-learnt policy flow to drive µθ,τ,t π towards high-performing policy regions for a diverse set of tasks (generic trainability) but also to how general is the resulting reinforcement learning program and how well it transfers to tasks unseen during training (transferability). In the former case. since the proposed synaptic mechanism is model-free, it allows for tackling diverse types of policy learning, from navigation to motor control. Arguably, to build reinforcing agents that learn in open situations, we should strive for generic and efficient computational mechanisms rather than learnt heuristics. Transferability corresponds to the ability of the meta-learned policy flow to generally yield improving updates even in unseen policy regions of the space Π or conditioned by unseen task properties: new states, actions and transitions, new reward profile. etc. We show in a motor-control experiment using the Meta-World benchmark that meta-tuned synaptic updates are a potential candidate to produce a more systematic learner agnostic to environment setting and reward profile. The generality property remains the hardest for current meta RL approaches, demonstrating the importance of building more stable and invariant control principles. 3 Met ODS: Meta-Optimized Dynamical Synapses 3.1 Learning reflexive weights updates Figure 2: A layer of Met ODS updates its weights by recursive applications of read and write operations based on neural activations v(s) and synaptic traces W (s). What if a neural agent could adjust the rule driving the evolution of µθ,τ,t π based on its own knowledge? In this work, we test a neural computational model that learns to compress experience of a task τ into its dynamic weights Wt. t T, π(a|s, Wt) µθ,τ,t π (3) Contrary to gradient-based rules whose analytical expression is a static and predefined function of activations error ( (Wt) = a W . L a), we define an update rule that directly depends on the current weight state through a parameterized non-linear recursive scheme, making the expression of the update much more sensitive to weight content itself. t T, (Wt) = Fθ(Wt) W =Wt (4) This self-reflexive evolution of the synaptic configuration Wt allows for online variation of the learning rule during adaptation for a given task τ and opens an original repertoire of dynamics for the policy evolution µθ,τ,t π . Specifically, at every time-step t, network inference and learning consists in recursive applications of read and write operations that we define below. Read-write operations: The core mechanism consists in two simple operations that respectively linearly project neurons activation v through the dynamic weights W followed by an element-wise non-linearity, and build a local weight update with element-wise weighting α: ϕ(W , v) = σ(W .v) read ψ(v) = α v v write (5) Here, α is a matrix of RN N, denotes the outer-product, is the element-wise multiplication, σ is a non-linear activation function. The read and write operations are motivated by biological synaptic computation: Reading corresponds to the non-linear response of the neuron population to a specific activation pattern. Writing consists in an outer product that emulates a local Hebbian rule between neurons. The element-wise weighting α allows locally tuning synaptic plasticity at every connection consistent with biology [43, 44] which generates a matrix update with potentially more than rank one as in the classic hebbian rule. Recursive update: While a single iteration of these two operations can only retrieve a similar activation pattern (reading) or add unfiltered external information into weights (writing), recursively applying these operations offers a much more potent computational mechanism that mix information between current neural activation and previous iterates. Starting from an initial activation pattern v(0) and previous weight state W (0) = Wt 1, the model recursively applies equations (5) for s [1, S] on v(s) and W (s) such that: for s [1, S] : v(s) = s 1 P l=0 κ(l) s v(l) + κ(s) s ϕ(W (s 1), v(s 1)) W (s) = s P l=0 β(l) s W (l) + β(s) s ψ(v(s 1)) (6) Parameters κ(l) s and β(l) s are scalar values learnt along with plasticity parameters α, and correspond to delayed contributions of previous patterns and synaptic states to the current operations. This is motivated by biological evidence of complex cascades of temporal modulation mechanisms over synaptic efficacy [45, 46]. Finally, (v(S), W (S)) are respectively used as activation for the next layer, and as the new synaptic state Wt. Computational interpretation: We note that if S=1 in equation (6), the operation boils down to a simple hebbian update with a synapse-specific weighting αi,j. This perspective makes Met ODS an original form of modern Hopfield network [47] with hetero-associative memory that can dynamically access and edit stored representations driven by observations, rewards and actions. While pattern retrieval from Hopfield networks has a dense litterature, our recursive scheme is an original proposal to learn automatic updates able to articulate representations across timesteps. We believe that this mechanism is particularly benefitial for meta-RL to filter external information with respect to past experience. The promising results shown in our experimental section suggest that such learnt updates can generate useful self-modifications to sequentially adapt to incoming information at runtime. Algorithm 1: Met ODS synaptic learning 1: Require: θ = [f, g, α, κ, β] and W0 2: for 1 t T do 3: v(0) f(st, at 1, rt 1) 4: W (0) Wt 1 5: for 1 s S do 6: v(s) s 1 P l=0 κ(l) s v(l) + κ(s) s σ(W (s 1).v(s 1)) 7: W (s) s 1 P l=0 β(l) s W (l) + β(s) s α v(s 1) v(s 1) 8: end for 9: at, vt g(v(s)) 10: Wt W (s) 11: end for In this work, we test a single dynamic layer for Met ODS and leave the extension of the synaptic plasticity to the full network for future work. In order for the model to learn a credit assignment strategy, state transition and previous reward information [st, at 1, rt 1] are embedded into a vector vt by a feedforward map f as in previous meta-RL approaches [48, 49]. Action and advantage estimate are read-out by a feedforward policy map g. We meta-learn the plasticity and update coefficients, as well as the embedding and read-out function altogether: θ = [f, g, α, κ, β]. Additionally, the initial synaptic configuration W0 can be learnt, fixed a priori or sampled from a specified distribution. 4 Related work Meta-Reinforcement learning has recently flourished into several different approaches aiming at learning high-level strategies for capturing task rules and structures. A direct line of work consists in automatically meta-learning components or parameters of the RL arsenal to improve over heuristic settings [50 52]. Orthogonally, work building on the Turing-completeness of recurrent neural networks has shown that simple recurrent neural networks can be trained to store past information in their persistent activity state to inform current decision, in such a way that the network implements a form of reinforcement learning over each episode [53, 48, 54]. It is believed that vanilla recurrent networks alone are not sufficient to meta-learn the efficient forms of episodic control found in biological agents [55, 13]. Hence additional work has tried to enhance the system with a better episodic memory model [56 58] or by modeling a policy as an attention module over an explicitely stored set of past events [49]. Optimization based approaches have tried to cast episodic adaptation as an explicit optimization procedure either by treating the optimizer as a black-box system [59, 60] or by learning a synaptic configuration such that one or a few gradient steps are sufficient to adapt the input/output mapping to a specific task [61]. Artificial fast plasticity: Networks with dynamic weights that can adapt as a function of neural activation have shown promising results over regular recurrent neural networks to handle sequential data [62, 29, 31, 63, 32, 34]. However, contrary to our work, these models postulate a persistent neural activity orchestrating weights evolution. On the contrary, we show that synaptic states are the sole persistent components needed to perform fast adaptation. Additionally, the possibility of optimizing synaptic dynamics with evolutionary strategies in randomly initialized networks [64] or through gradient descent [65] has been demonstrated, as well as in a time-continuous setting [66]. Recent results have shown that plasticity rules differentially tuned at the synapse level allow to dynamically edit and query networks memory [31, 67]. However another specificity of this work is that our model synaptic rule is a function of reward and synaptic state, allowing to drive weight dynamics conditionally on both an external feedback signal and the current model belief. Associative memory: As discussed above, efficient memory storage and manipulation is a crucial feature for building rapidly learning agents. To improve over vanilla recurrent neural network policies [48], some models have augmented recurrent agents with content-addressable dictionaries able to reinstate previously encoded patterns given the current state [68 70, 13]. However these slot-based memory systems are subject to interference with incoming inputs and their memory cost grows linearly with experience. Contrastingly, attractor networks can be learnt to produce fast compression of sensory information into a fixed size tensorial representation [71, 72]. One class of such network are Hopfield networks [73 76] which benefit from a large storage capacity [76], can possibly perform hetero-associative concept binding [77, 34] and produce fast and flexible information retrieval [47]. 5 Experiments In this section, we explore the potential of our meta-learnt synaptic update rule with respect to the three properties of the meta-RL problem exposed in section 2. Namely, 1) efficiency, 2) capacity and 3) generality of the produced learning algorithm. We compare it with three state-of-the-art meta-RL models based on different adaptation mechanisms: RL2 [48] a memory-based algorithm based on training a GRU-cell to perform Reinforcement Learning within the hidden space of its recurrent unit, MAML [61] which performs online gradient descent on the weights of three-layer MLP and PEARL [78], which performs probabilistic task inference for conditioning policy. Details for each experimental setting are further discussed in S.I. and the code can be found at https: //github.com/mathieuchal/metods2022. 5.1 Efficiency: One-shot reinforcement learning and rapid motor control Figure 3: a-b) Schemas of the Harlow and Mujoco Ant-directional locomotion task. c-d) Evolution of accumulated reward over training. In the Harlow task, we conduct an ablation study by either reducing the number of recursive iterations (S=1) or removing the trainable plasticity weights α resulting in sub-optimal policy. In Ant-dir we compare our agent training profile against MAML and RL2. e) We can interpret the learned policy in terms of a Hopfield energy adapting with experience. We show horizontally two reward profiles of different episodes and the energy EWt(v1, v2) = v T 1 Wtv2 along two principal components of the vector trajectory vt. In the first episode, the error in the first presentation (red square) transforms the energy landscape which changes the agent policy, while on the other episode, the model belief does not change over time. Note the two modes for every energy map, which allows the model to handle the potential position permutation of the presented values. f) Average rewards per timestep during a single episode of the Ant-dir task. To first illustrate that learnt synaptic dynamics can support fast behavioral adaptation, we use a classic experiment from the neuroscience literature originally presented by Harlow [79] and recently reintroduced in artificial meta-RL in [54] as well as a heavily-benchmarked Mu Jo Co directional locomotion task (see Fig. 3). To behave optimally in both settings, the agent must quickly identify the task rule and implement a relevant policy : The Harlow task consists of five sequential presentations of two random variables placed on a one-dimensional line with random permutation of their positions that an agent must select by reaching the corresponding position. One value is associated with a positive reward and the other with a negative reward. The five trials are presented in alternance with periods of fixation where the agent should return to a neutral position between items. In the Mu Jo Co robotic Ant-dir experiment, a 4-legged agent must produce a locomotive policy given a random rewarded direction. Harlow: Since values location are randomly permuted across presentations within one episode, the learner cannot develop a mechanistic strategy to reach high rewards based on initial position. Instead, to reach the maximal expected reward over the episode, the agent needs to perform one-shot learning of the task-contingent association rule during the first presentation. We found that even a very small network of N=20 neurons proved to be sufficient to solve the task perfectly. We investigated the synaptic mechanism encoding the agent policy. A principal component analysis reveals a policy differentiation with respect to the initial value choice outcome supported by synaptic specialization in only a few time-steps (see Figure 1-c). We can further interpret this adaptation in terms of sharp modifications of the Hopfield energy of the dynamic weights (Figure 3-e). Finally, the largest synaptic variations measured by the sum of absolute synaptic variations occur for states that carry a non-null reward signal (see S.I). These results suggest that the recursive hebbian update combined with reward feedback is sufficient to support one-shot reinforcement learning of the task association rule. Mu Jo Co Ant-dir: We trained models for performing the task on an episode of 200 timesteps and found that Met ODS can adapt in a few time-steps, similar to memory-based models such as RL2, (Figure 3-f) thanks to its continual adaptation mechanism. By design, MAML and PEARL do not present such a property, and they need multiple episodes before being able to perform adaptation correctly. We still report MAML performance after running its gradient adaptation at timestep t=100. We further note that our agent overall performance in a single episode is still superior to MAML performance reported in [61, 49] when more episodes are accessible for training. 5.2 Capacity : Maze exploration task Figure 4: a) Examples of generated maze configurations. The mazes consist in 8 8 pix. areas, with walls and obstacles randomly placed according to a variation of Prim s algorithm [80] with target location randomly selected for the entire duration of the episode (star). The agent receptive field is highlighted in red. b) Comparisons with MAML and RL2 and effect of element-wise plasticity α. c) Variations of Met ODS writing mechanism as well as depth S of recursivity yield different performances (see. S.I for further details) We further tested the systematicity of Met ODS learnt reinforcement program on a partially observable Markov decision process (POMDP): An agent must locate a target in a randomly generated maze while starting from random locations and observing a small portion of its environment (depicted in Figure 4). While visual navigation has been previously explored in meta-RL [48, 49, 31], we here focus on the mnemonic component of navigation by complexifying the task in two ways, we reduce the agent s visual field to a small size of 3x3 pix. and randomize the agent s position after every reward encounter. The agent can take discrete actions in the set {up,down,left,right} which moves it accordingly by one coordinate. The agent s reward signal is solely received by hitting the target location, thus receiving a reward of 10. Each time the agent hits the target, its position is randomly reassigned on the map (orange) and the exploration resumes until 100 steps are accumulated during which the agent must collect as much reward as possible. Note that the reward is invisible to the agent, and thus the agent only knows it has hit the reward location because of the activation of the reward input. The reduced observability of the environment and the sparsity of the reward signal (most of the state transitions yield no reward) requires the agent to perform logical binding between distant temporal events to navigate the maze. Again, this setting rules out PEARL since its latent context encoding mechanism erases crucial temporal dependencies between state transitions for efficient exploration. Despite having no particular inductive bias for efficient spatial exploration or path memorization, a strong policy emerges spontaneously from training. Agent 1st rew. ( ) Success ( ) Cum. Rew. ( ) Cum. Rew (Larger) ( ) Random 96.8 0.5 5 % 3.8 8.9 3.7 6.4 MAML 64.3 39.3 45.2% 14.95 4.5 5.8 10.3 RL2 16.2 1.1 96.2% 77.7 46.5 28.1 29.7 Met ODS 14.7 1.4 96.6% 86.5 46.8 34.9 34.9 Figure 5: Performance of Meta-RL models tested at convergence (1e7 env. steps). Met ODS better explores the maze as measured by the average number of steps before 1st reward and the success rate in finding the reward at least once. It then better exploits the maze as per the accumulated reward. (* We assign 100 to episodes with no reward encounter.) Ablation study and variations: We explored the contribution of the different features combined in Met ODS update rule, showing that they all contribute to the final performance of our meta-learning model. First, we tested the importance of the element-wise tuning of plasticity in weight-based learning models and note that while it adversely affects MAML gradient update, it greatly improves Met ODS performance, suggesting different forms of weights update. Second, we verified that augmenting recursivity depth S was beneficial to performance, consistently with Harlow results. Third, we noted that transforming the rightmost vector of the writing equation in 5 with a linear projection (Linear in figure 4, see S.I for full experimental details) yields major improvement while non-linear (MLP) does not improve performance. Finally, we additionally test the capability of the learnt navigation skills to generalize to a larger maze size of 10 10 pix. unseen during training. We show that Met ODS is able to retain its advantage (see figure 4 and table 5 for results). 5.3 Generality : Motor control Finally, we test the generality of the reinforcement learning program learnt by our model for different continuous control tasks: Meta World: First, we use the dexterous manipulation benchmark proposed in [81] using the benchmark suite [82], in which a Sawyer robot is tasked with diverse operations. A full adaptation episode consists in N=10 rollouts of 500 timesteps of the same task across which dynamic weights are carried over. Observation consists in the robot s joint angles and velocities, and the actions are its joint torques. We compare Met ODS with baseline methods in terms of meta-training and meta-testing success rate for 3 settings, push, reach and ML-10. We show in Fig. 6 the meta-training results for all the methods in the Meta World environments. Due to computational resource constraints, we restrict our experiment to a budget of 10M steps per run. While we note that our benchmark does not reflect the final performance of previous methods reported in [81] at 300M steps, we note that Met ODS test performance outperforms these methods early in training and keeps improving at 10M steps, potentially leaving room for improvement with additional training. (see S.I for additional discussion). Finally, we note that all tested approaches performed modestly on ML10 for test tasks, which highlights the limitation of current methods. We conjecture that this might be due to the absence of inductive bias for sharing knowledge between tasks or fostering systematic exploration of the environment of the tested meta-learning algorithms. Robot impairment: We also tested the robustness of Met ODS learnt reinforcement programs by evaluating the agent ability to perform in a setting not seen during training: specifically, when partially impairing the agent motor capabilities. We adopt the same experimental setting as section 5.1 for the Ant and Cheetah robots and evaluate the performance when freezing one of the robots torque. We show that our model policy retains a better proportion of its performance compared to other approaches. These results suggest that fast synaptic dynamics are not only better suited to support fast adaptation of a motor policy in the continuous domain, but also implement a more robust reinforcement learning program when impairing the agent motor capabilities. Figure 6: Left Meta-training results for Meta World benchmarks. Subplots show tasks success rate over training timesteps. Average meta-test results for Met ODS is shown in dotted line. Right Cumulative reward of the Ant and Cheetah directional locomotion task. For each condition, results are normalized against the best performing policy 6 Discussion In this work, we introduce a novel meta-RL system, Met ODS, which leverages a self-referential weight update mechanism for rapid specialization at the episodic level. Our approach is generic and supports discrete and continuous domains, giving rise to a promising repertoire of skills such as oneshot adaptation, spatial navigation or motor coordination. Met ODS demonstrates that locally tuned synaptic updates whose form depends directly on the network configuration can be meta-learnt for solving complex reinforcement learning tasks. We conjecture that further tuning the hyperparameters as well as combining Met ODS with more sophisticated reinforcement learning techniques can boost its performance. Generally, the success of the approach provides evidence for the benefits of fast plasticity in artificial neural networks, and the exploration of self-referential networks. 7 Broader Impact Our work explores the emergence of reinforcement-learning programs through fast synaptic plasticity. The proposed pieces of evidence that information memorization and manipulation as well as behavioral specialization can be supported by such computational principles 1) help question the functional role of such mechanisms observed in biology and 2) reaffirms that alternative paradigms to gradient descent might exists for efficient artificial neural network control. Additionally, the proposed method is of interest for interactive machine learning systems that operates in quickly changing environments and under uncertainty (bayesian optimization, active learning and control theory). For instance, the meta RL approach proposed in this work could be applied to brain-computer interfaces for tuning controllers to rapidly drifting neural signals. Improving medical applications and robot control promises positive impact, however deeper theoretical understanding and careful deployment monitoring are required to avoid misuse. 8 Acknowledgments and Disclosure of Funding This work was supported by ANR-3IA Artificial and Natural Intelligence Toulouse Institute (ANR19-PI3A-0004), OSCI-DEEP ANR (ANR-19-NEUC-0004), ONR (N00014-19-1-2029), and NSF (IIS-1912280 and EAR-1925481). Additional support provided by the Carney Institute for Brain Science and the Center for Computation and Visualization (CCV) via NIH Office of the Director grant S10OD025181. The authors would like to thank the anonymous reviewers for their thorough comments and suggestions that led to an improved version of this work. [1] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin A. Riedmiller. Playing atari with deep reinforcement learning. Ar Xiv, abs/1312.5602, 2013. [2] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning. nature, 518(7540):529 533, 2015. [3] Max Jaderberg, Wojciech M. Czarnecki, Iain Dunning, Luke Marris, Guy Lever, Antonio Garcia Castañeda, Charles Beattie, Neil C. Rabinowitz, Ari S. Morcos, Avraham Ruderman, Nicolas Sonnerat, Tim Green, Louise Deason, Joel Z. Leibo, David Silver, Demis Hassabis, Koray Kavukcuoglu, and Thore Graepel. Human-level performance in 3d multiplayer games with population-based reinforcement learning. Science, 364(6443):859 865, 2019. [4] Sergey Levine, Chelsea Finn, Trevor Darrell, and Pieter Abbeel. End-to-end training of deep visuomotor policies. The Journal of Machine Learning Research, 17(1):1334 1373, 2016. [5] Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. In ICLR (Poster), 2016. [6] Brenden M. Lake, Tomer D. Ullman, Joshua B. Tenenbaum, and Samuel J. Gershman. Building machines that learn and think like people. Behavioral and Brain Sciences, 40:e253, 2017. [7] Brenden Lake and Marco Baroni. Still not systematic after all these years: On the compositional skills of sequence-to-sequence recurrent networks, 2018. [8] Karl Cobbe, Christopher Hesse, Jacob Hilton, and John Schulman. Leveraging procedural generation to benchmark reinforcement learning, 2020. [9] Jürgen Schmidhuber, Jieyu Zhao, and Marco Wiering. Shifting inductive bias with success-story algorithm, adaptive levin search, and incremental self-improvement. Machine Learning, 28, 01 1997. [10] Sebastian Thrun. Lifelong Learning Algorithms, pages 181 209. Springer US, Boston, MA, 1998. [11] Ricardo Vilalta and Youssef Drissi. A perspective view and survey of meta-learning. Artif. Intell. Rev., 18(2):77 95, oct 2002. [12] Jeff Clune. Ai-gas: Ai-generating algorithms, an alternate paradigm for producing general artificial intelligence. Co RR, abs/1905.10985, 2019. [13] Matthew Botvinick, Sam Ritter, Jane X. Wang, Zeb Kurth-Nelson, Charles Blundell, and Demis Hassabis. Reinforcement learning, fast and slow. Trends in Cognitive Sciences, 23(5):408 422, 2019. [14] S. J. Martin, P. D. Grimwood, and R. G. M. Morris. Synaptic plasticity and memory: An evaluation of the hypothesis. Annual Review of Neuroscience, 23(1):649 711, 2000. [15] LF Abbott and Wade G Regehr. Synaptic computation. Nature, 431(7010):796 803, 2004. [16] Wade G Regehr. Short-term presynaptic plasticity. Cold Spring Harbor perspectives in biology, 4(7):a005702, 2012. [17] Natalia Caporale and Yang Dan. Spike timing dependent plasticity: A hebbian learning rule. Annual Review of Neuroscience, 31(1):25 46, 2008. PMID: 18275283. [18] Gianluigi Mongillo, Omri Barak, and Misha Tsodyks. Synaptic theory of working memory. Science, 319(5869):1543 1546, 2008. [19] Nicolas Masse, Guangyu Yang, H. Song, Xiao-Jing Wang, and David Freedman. Circuit mechanisms for the maintenance and manipulation of information in working memory, 04 2018. [20] Omri Barak and Misha Tsodyks. Working models of working memory. Current Opinion in Neurobiology, 25:20 24, 2014. Theoretical and computational neuroscience. [21] Mark G. Stokes. activity-silent working memory in prefrontal cortex: a dynamic coding framework. Trends in Cognitive Sciences, 19(7):394 405, 2015. [22] Sanjay G. Manohar, Nahid Zokaei, Sean J. Fallon, Tim P. Vogels, and Masud Husain. Neural mechanisms of attending to items in working memory. Neuroscience and Biobehavioral Reviews, 101:1 12, 2019. [23] Pierre Yger, Marcel Stimberg, and Romain Brette. Fast learning with weak synaptic plasticity. Journal of Neuroscience, 35(39):13351 13362, 2015. [24] R azvan V Florian. Reinforcement learning through modulation of spike-timing-dependent synaptic plasticity. Neural computation, 19(6):1468 1502, 2007. [25] Geoffrey E Hinton and David C Plaut. Using fast weights to deblur old memories. In Proceedings of the 9th Annual Conference of the Cognitive Science Society, pages 177 186, 1987. [26] Jürgen Schmidhuber. Learning to control fast-weight memories: An alternative to dynamic recurrent networks. Neural Computation, 4(1):131 139, 1992. [27] Jürgen Schmidhuber. Learning to control fast-weight memories: An alternative to dynamic recurrent networks. Neural Computation, 4:131 139, 1992. [28] Christoph von der Malsburg. The Correlation Theory of Brain Function, pages 95 119. Springer New York, New York, NY, 1994. [29] David Ha, Andrew Dai, and Quoc V Le. Hypernetworks. ar Xiv preprint ar Xiv:1609.09106, 2016. [30] Jimmy Ba, Geoffrey E Hinton, Volodymyr Mnih, Joel Z Leibo, and Catalin Ionescu. Using fast weights to attend to the recent past. Advances in Neural Information Processing Systems, 29:4331 4339, 2016. [31] Thomas Miconi, Kenneth Stanley, and Jeff Clune. Differentiable plasticity: training plastic neural networks with backpropagation. In International Conference on Machine Learning, pages 3559 3568. PMLR, 2018. [32] Imanol Schlag and Jürgen Schmidhuber. Learning to reason with third-order tensor products. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, pages 10003 10014, 2018. [33] Kazuki Irie, Imanol Schlag, Róbert Csordás, and Jürgen Schmidhuber. Going beyond linear transformers with recurrent fast weight programmers. In A. Beygelzimer, Y. Dauphin, P. Liang, and J. Wortman Vaughan, editors, Advances in Neural Information Processing Systems, 2021. [34] Imanol Schlag, Tsendsuren Munkhdalai, and Jürgen Schmidhuber. Learning associative inference using fast weight memory. In International Conference on Learning Representations, 2020. [35] Elad Sarafian, Shai Keynan, and Sarit Kraus. Recomposing the reinforcement learning building blocks with hypernetworks. In International Conference on Machine Learning, pages 9301 9312. PMLR, 2021. [36] Luckeciano C Melo. Transformers are meta-reinforcement learners. In International Conference on Machine Learning, pages 15340 15359. PMLR, 2022. [37] Jane X Wang, Zeb Kurth-Nelson, Dhruva Tirumala, Hubert Soyer, Joel Z Leibo, Remi Munos, Charles Blundell, Dharshan Kumaran, and Matt Botvinick. Learning to reinforcement learn. ar Xiv preprint ar Xiv:1611.05763, 2016. [38] Shin Ishii, Wako Yoshida, and Junichiro Yoshimoto. Control of exploitation exploration meta-parameter in reinforcement learning. Neural networks, 15(4-6):665 687, 2002. [39] Luisa Zintgraf, Kyriacos Shiarlis, Maximilian Igl, Sebastian Schulze, Yarin Gal, Katja Hofmann, and Shimon Whiteson. Varibad: A very good method for bayes-adaptive deep rl via metalearning. ar Xiv preprint ar Xiv:1910.08348, 2019. [40] Pedro A Ortega, Jane X Wang, Mark Rowland, Tim Genewein, Zeb Kurth-Nelson, Razvan Pascanu, Nicolas Heess, Joel Veness, Alex Pritzel, Pablo Sprechmann, et al. Meta-learning of sequential strategies. ar Xiv preprint ar Xiv:1905.03030, 2019. [41] Jan Humplik, Alexandre Galashov, Leonard Hasenclever, Pedro A. Ortega, Yee Whye Teh, and Nicolas Heess. Meta reinforcement learning as task inference. Co RR, abs/1905.06424, 2019. [42] Danijar Hafner, Timothy P. Lillicrap, Jimmy Ba, and Mohammad Norouzi. Dream to control: Learning behaviors by latent imagination. Co RR, abs/1912.01603, 2019. [43] Larry F Abbott and Sacha B Nelson. Synaptic plasticity: taming the beast. Nature neuroscience, 3(11):1178 1183, 2000. [44] Tuning into diversity of homeostatic synaptic plasticity. Neuropharmacology, 78:31 37, 2014. Homeostatic Synaptic Plasticity. [45] Chun Yun Chang. Impact of second messenger modulation on activity-dependent and basal properties of excitatory synapses. 2010. [46] Rylan S. Larsen, Ikuko T. Smith, Jayalakshmi Miriyala, Ji Eun Han, Rebekah J. Corlew, Spencer L. Smith, and Benjamin D. Philpot. Synapse-specific control of experience-dependent plasticity by presynaptic nmda receptors. Neuron, 83(4):879 893, 2014. [47] Hubert Ramsauer, Bernhard Schäfl, Johannes Lehner, Philipp Seidl, Michael Widrich, Lukas Gruber, Markus Holzleitner, Thomas Adler, David Kreil, Michael K Kopp, et al. Hopfield networks is all you need. In International Conference on Learning Representations, 2020. [48] Yan Duan, John Schulman, Xi Chen, Peter L Bartlett, Ilya Sutskever, and Pieter Abbeel. Rl2: Fast reinforcement learning via slow reinforcement learning. ar Xiv preprint ar Xiv:1611.02779, 2016. [49] Nikhil Mishra, Mostafa Rohaninejad, Xi Chen, and Pieter Abbeel. A simple neural attentive meta-learner. In International Conference on Learning Representations, 2018. [50] Rein Houthooft, Richard Y Chen, Phillip Isola, Bradly C Stadie, Filip Wolski, Jonathan Ho, and Pieter Abbeel. Evolved policy gradients. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, pages 5405 5414, 2018. [51] Zhongwen Xu, Hado van Hasselt, and David Silver. Meta-gradient reinforcement learning. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, pages 2402 2413, 2018. [52] Abhishek Gupta, Russell Mendonca, Yu Xuan Liu, Pieter Abbeel, and Sergey Levine. Metareinforcement learning of structured exploration strategies. Advances in Neural Information Processing Systems, 31:5302 5311, 2018. [53] Sepp Hochreiter, A Steven Younger, and Peter R Conwell. Learning to learn using gradient descent. In International Conference on Artificial Neural Networks, pages 87 94. Springer, 2001. [54] Jane X. Wang, Zeb Kurth-Nelson, Dharshan Kumaran, Dhruva Tirumala, Hubert Soyer, Joel Z. Leibo, Demis Hassabis, and Matthew Botvinick. Prefrontal cortex as a meta-reinforcement learning system. 2018. [55] Máté Lengyel and Peter Dayan. Hippocampal contributions to control: The third way. In J. Platt, D. Koller, Y. Singer, and S. Roweis, editors, Advances in Neural Information Processing Systems, volume 20. Curran Associates, Inc., 2008. [56] Adam Santoro, Sergey Bartunov, Matthew Botvinick, Daan Wierstra, and Timothy Lillicrap. Meta-learning with memory-augmented neural networks. In International conference on machine learning, pages 1842 1850. PMLR, 2016. [57] Alexander Pritzel, Benigno Uria, Sriram Srinivasan, Adria Puigdomenech Badia, Oriol Vinyals, Demis Hassabis, Daan Wierstra, and Charles Blundell. Neural episodic control. In International Conference on Machine Learning, pages 2827 2836. PMLR, 2017. [58] Samuel Ritter, Jane X Wang, Zeb Kurth-Nelson, and M Botvinick. Episodic control as metareinforcement learning. bio Rxiv, page 360537, 2018. [59] Adam Santoro, Sergey Bartunov, Matthew Botvinick, Daan Wierstra, and Timothy Lillicrap. Meta-learning with memory-augmented neural networks. In Proceedings of the 33rd International Conference on International Conference on Machine Learning - Volume 48, ICML 16, page 1842 1850. JMLR.org, 2016. [60] Sachin Ravi and H. Larochelle. Optimization as a model for few-shot learning. In ICLR, 2017. [61] Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deep networks. In Doina Precup and Yee Whye Teh, editors, Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 1126 1135. PMLR, 06 11 Aug 2017. [62] Tsendsuren Munkhdalai and Hong Yu. Meta networks. In International Conference on Machine Learning, pages 2554 2563. PMLR, 2017. [63] Thomas Miconi, Aditya Rawal, Jeff Clune, and Kenneth O Stanley. Backpropamine: training self-modifying neural networks with differentiable neuromodulated plasticity. In International Conference on Learning Representations, 2018. [64] Elias Najarro and Sebastian Risi. Meta-learning through hebbian plasticity in random networks. In H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 20719 20731. Curran Associates, Inc., 2020. [65] Thomas Miconi. Learning to learn with backpropagation of hebbian plasticity. ar Xiv: Neural and Evolutionary Computing, 2016. [66] Krzysztof Choromanski, Jared Davis, Valerii Likhosherstov, Xingyou Song, Jean-Jacques E. Slotine, Jacob Varley, Honglak Lee, Adrian Weller, and Vikas Sindhwani. An ode to an ode. Ar Xiv, abs/2006.11421, 2020. [67] Mathieu Chalvidal, Matthew Ricci, Rufin Van Rullen, and Thomas Serre. Go with the flow: Adaptive control for neural ODEs. In International Conference on Learning Representations, 2021. [68] Jason Weston, Sumit Chopra, and Antoine Bordes. Memory networks. ar Xiv preprint ar Xiv:1410.3916, 2014. [69] Wojciech Zaremba and Ilya Sutskever. Reinforcement learning neural turing machines - revised, 2016. [70] Alexander Pritzel, Benigno Uria, Sriram Srinivasan, Adrià Puigdomènech Badia, Oriol Vinyals, Demis Hassabis, Daan Wierstra, and Charles Blundell. Neural episodic control. In Doina Precup and Yee Whye Teh, editors, Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 2827 2836. PMLR, 06 11 Aug 2017. [71] Sergey Bartunov, Jack Rae, Simon Osindero, and Timothy Lillicrap. Meta-learning deep energy-based memory models. In International Conference on Learning Representations, 2019. [72] Wei Zhang and Bowen Zhou. Learning to update auto-associative memory in recurrent neural networks for improving sequence memorization, 2017. [73] J J Hopfield. Neural networks and physical systems with emergent collective computational abilities. Proceedings of the National Academy of Sciences, 79(8):2554 2558, 1982. [74] Pascal Koiran. Dynamics of discrete time, continuous state hopfield networks. Neural Computation, 6(3):459 468, 1994. [75] Mete Demircigil, Judith Heusel, Matthias Löwe, Sven Upgang, and Franck Vermet. On a model of associative memory with huge storage capacity. Journal of Statistical Physics, 02 2017. [76] Dmitry Krotov and John J. Hopfield. Dense associative memory for pattern recognition. In NIPS, 2016. [77] Imanol Schlag and Jürgen Schmidhuber. Learning to reason with third order tensor products. In Neur IPS, pages 10003 10014, 2018. [78] Kate Rakelly, Aurick Zhou, Chelsea Finn, Sergey Levine, and Deirdre Quillen. Efficient off-policy meta-reinforcement learning via probabilistic context variables. In International conference on machine learning, pages 5331 5340. PMLR, 2019. [79] Harry Frederick Harlow. The formation of learning sets. Psychological review, 56 1:51 65, 1949. [80] Robert Clay Prim. Shortest connection networks and some generalizations. The Bell System Technical Journal, 36(6):1389 1401, 1957. [81] Tianhe Yu, Deirdre Quillen, Zhanpeng He, Ryan Julian, Karol Hausman, Chelsea Finn, and Sergey Levine. Meta-world: A benchmark and evaluation for multi-task and meta reinforcement learning. In Conference on Robot Learning (Co RL), 2019. [82] The garage contributors. Garage: A toolkit for reproducible reinforcement learning research. https://github.com/rlworkgroup/garage, 2019. [83] David E Rumelhart, Geoffrey E Hinton, and Ronald J Williams. Learning internal representations by error propagation. Technical report, California Univ San Diego La Jolla Inst for Cognitive Science, 1985. [84] Michael Betancourt, Charles C Margossian, and Vianey Leos-Barajas. The discrete adjoint method: Efficient derivatives for functions of discrete sequences. ar Xiv preprint ar Xiv:2002.00326, 2020. [85] Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In International conference on machine learning, pages 1928 1937. PMLR, 2016. [86] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms, 2017. [87] John Schulman, Philipp Moritz, Sergey Levine, Michael Jordan, and Pieter Abbeel. Highdimensional continuous control using generalized advantage estimation, 2018.