# evolving_reinforcement_learning_algorithms__aee05beb.pdf

Published as a conference paper at ICLR 2021

EVOLVING REINFORCEMENT LEARNING ALGORITHMS

John D. Co-Reyes, Yingjie Miao, Daiyi Peng, Esteban Real, Sergey Levine, Quoc V. Le, Honglak Lee, Aleksandra Faust Research at Google, Mountain View, CA 94043, USA

We propose a method for meta-learning reinforcement learning algorithms by searching over the space of computational graphs which compute the loss function for a value-based model-free RL agent to optimize. The learned algorithms are domain-agnostic and can generalize to new environments not seen during training. Our method can both learn from scratch and bootstrap off known existing algorithms, like DQN, enabling interpretable modiﬁcations which improve performance. Learning from scratch on simple classical control and gridworld tasks, our method rediscovers the temporal-difference (TD) algorithm. Bootstrapped from DQN, we highlight two learned algorithms which obtain good generalization performance over other classical control tasks, gridworld type tasks, and Atari games. The analysis of the learned algorithm behavior shows resemblance to recently proposed RL algorithms that address overestimation in value-based methods.

1 INTRODUCTION

Designing new deep reinforcement learning algorithms that can efﬁciently solve across a wide variety of problems generally requires a tremendous amount of manual effort. Learning to design reinforcement learning algorithms or even small sub-components of algorithms would help ease this burden and could result in better algorithms than researchers could design manually. Our work might then shift from designing these algorithms manually into designing the language and optimization methods for developing these algorithms automatically.

Reinforcement learning algorithms can be viewed as a procedure that maps an agent s experience to a policy that obtains high cumulative reward over the course of training. We formulate the problem of training an agent as one of meta-learning: an outer loop searches over the space of computational graphs or programs that compute the objective function for the agent to minimize and an inner loop performs the updates using the learned loss function. The objective of the outer loop is to maximize the training return of the inner loop algorithm.

Our learned loss function should generalize across many different environments, instead of being speciﬁc to a particular domain. Thus, we design a search language based on genetic programming (Koza, 1993) that can express general symbolic loss functions which can be applied to any environment. Data typing and a generic interface to variables in the MDP allow the learned program to be domain agnostic. This language also supports the use of neural network modules as subcomponents of the program, so that more complex neural network architectures can be realized. Efﬁciently searching over the space of useful programs is generally difﬁcult. For the outer loop optimization, we use regularized evolution (Real et al., 2019), a recent variant of classic evolutionary algorithms that employ tournament selection (Goldberg & Deb, 1991). This approach can scale with the number of compute nodes and has been shown to work for designing algorithms for supervised learning (Real et al., 2020). We adapt this method to automatically design algorithms for reinforcement learning.

While learning from scratch is generally less biased, encoding existing human knowledge into the learning process can speed up the optimization and also make the learned algorithm more interpretable. Because our search language expresses algorithms as a generalized computation graph, we

jcoreyes@eecs.berkeley.edu, {yingjiemiao,daiyip,ereal,slevine,qvl,honglak,sandrafaust}@google.com

Published as a conference paper at ICLR 2021

Figure 1: Method overview. We use regularized evolution to evolve a population of RL algorithms. A mutator alters top performing algorithms to produce a new algorithm. The performance of the algorithm is evaluated over a set of training environments and the population is updated. Our method can incorporating existing knowledge by starting the population from known RL algorithms instead of purely from scratch.

can embed known RL algorithms in the graphs of the starting population of programs. We compare starting from scratch with bootstrapping off existing algorithms and ﬁnd that while starting from scratch can learn existing algorithms, starting from existing knowledge leads to new RL algorithms which can outperform the initial programs.

We learn two new RL algorithms which outperform existing algorithms in both sample efﬁciency and ﬁnal performance on the training and test environments. The learned algorithms are domain agnostic and generalize to new environments. Importantly, the training environments consist of a suite of discrete action classical control tasks and gridworld style environments while the test environments include Atari games and are unlike anything seen during training.

The contribution of this paper is a method for searching over the space of RL algorithms, which we instantiate by developing a formal language that describes a broad class of value-based model-free reinforcement learning methods. Our search language enables us to embed existing algorithms into the starting graphs which leads to faster learning and interpretable algorithms. We highlight two learned algorithms which generalize to completely new environments. Our analysis of the metalearned programs shows that our method automatically discovers algorithms that share structure to recently proposed RL innovations, and empirically attain better performance than deep Q-learning methods.

2 RELATED WORK

Learning to learn is an established idea in in supervised learning, including meta-learning with genetic programming (Schmidhuber, 1987; Holland, 1975; Koza, 1993), learning a neural network update rule (Bengio et al., 1991), and self modifying RNNs (Schmidhuber, 1993). Genetic programming has been used to ﬁnd new loss functions (Bengio et al., 1994; Trujillo & Olague, 2006). More recently, Auto ML (Hutter et al., 2018) aims to automate the machine learning training process. Automated neural network architecture search (Stanley & Miikkulainen, 2002; Real et al., 2017; 2019; Liu et al., 2017; Zoph & Le, 2016; Elsken et al., 2018; Pham et al., 2018) has made large improvements in image classiﬁcation. Instead of learning the architecture, Auto ML-Zero (Real et al., 2020) learns the algorithm from scratch using basic mathematical operations. Our work shares similar ideas, but is applied to the RL setting and assumes additional primitives such as neural network modules. In contrast to Auto ML-Zero, we learn computational graphs with the goal of automating RL algorithm design. Our learned RL algorithms generalize to new problems, not seen in training.

Automating RL. While RL is used for Auto ML (Zoph & Le, 2016; Zoph et al., 2018; Cai et al., 2018; Bello et al., 2017), automating RL itself has been somewhat limited. RL requires different design choices compared to supervised learning, including the formulation of reward and policy update rules. All of which affect learning and performance, and are usually chosen through trial and error. Auto RL addresses the gap by applying the Auto ML framework from supervised learning to the MDP setting in RL. For example, evolutionary algorithms are used to mutate the value or actor network weights (Whiteson & Stone, 2006; Khadka & Tumer, 2018), learn task reward (Faust et al., 2019), tune hyperparameters (Tang & Choromanski, 2020; Franke et al., 2020), or search for a neural network architecture (Song et al., 2020; Franke et al., 2020). This paper focuses on task-agnostic RL update rules in the value-based RL setting which are both interpretable and generalizable.

Published as a conference paper at ICLR 2021

Meta-learning in RL. Recent work has focused on few-shot task adaptation. Finn et al. (2017); Finn & Levine (2018) meta-learns initial parameters which can quickly adapt to new tasks, while RL2 (Duan et al., 2016) and concurrent work (Wang et al., 2017), formulates RL itself as a learning problem that is learned with an RNN. The meta-learned component of these works is tuned to a particular domain or environment, in the form of NN weights which cannot be used for completely new domains with potentially different sized inputs. Neural Programmer-Interpreters (Reed & De Freitas, 2015; Pierrot et al., 2019) overcome the environment generalization challenge by learning hierarchical neural programs with domain-speciﬁc encoders for different environments. Here, the computational graph has a ﬂexible architecture and generalizes across different environments.

Learning RL algorithms or their components, such as a reward bonus or value update function, has been studied previously with meta-gradients (Kirsch et al., 2020; Chebotar et al., 2019; Oh et al., 2020), evolutionary strategies (Houthooft et al., 2018), and RNNs (Duan et al., 2016). Although our work also learns RL algorithms, the update rule is represented as a computation graph which includes both neural network modules and symbolic operators. One key beneﬁt is that the resulting graph can be interpreted analytically and can optionally be initialized from known existing algorithms. Prior work that focuses on learning RL losses, generalizes to different goals and initial conditions within a single environment (Houthooft et al., 2018), or learns a domain invariant policy update rule that can generalize to new environments (Kirsch et al., 2020). Another approach searches over the space of curiosity programs using a similar language of DAGs with neural network modules (Alet et al., 2020a) and performs the meta-training on a single environment. In contrast, our method is applied to learn general RL update rules and meta-trained over a diverse set of environments.

3 LEARNING REINFORCEMENT LEARNING ALGORITHMS

In this section, we ﬁrst describe the problem setup. An inner loop method Eval(L, E) evaluates a learned RL algorithm L on a given environment E. Given access to this procedure, the goal for the outer loop optimization is to learn a RL algorithm with high training return over a set of training environments. We then describe the search language which enables the learning of general loss functions and the outer loop method which can efﬁciently search over this space.

3.1 PROBLEM SETUP

Figure 2: Visualization of a RL algorithm, DQN, as a computational graph which computes the loss L = (Q(st, at) (rt + γ maxa Qtarg(st+1, a)))2. Input nodes are in blue, parameter nodes in gray, operation nodes in orange, and output in green.

We assume that the agent parameterized with policy πθ(at|st) outputs actions at at each time step to an environment E and receives reward rt and next state st+1. Since we are focusing on discrete action value-based RL methods, θ will be the parameters for a Qvalue function and the policy is obtained from the Q-value function using an ϵ-greedy strategy. The agent saves this stream of transitions (st, st+1, at, rt) to a replay buffer and continually updates the policy by minimizing a loss function L(st, at, rt, st+1, θ, γ) over these transitions with gradient descent. Training will occur for a ﬁxed number of M training episodes where in each episode m, the agent earns episode return Rm = PT t=0 rt. The performance of an algorithm for a given environment is summarized by the normalized average training return, 1

M PM m=1 Ri Rmin Rmax Rmin , where Rmin and Rmax are the minimum and maximum return for that environment. We assume these are known ahead of time. This inner loop evaluation procedure Eval(L, E) is outlined in Algorithm 1. To score an algorithm, we use the normalized average training return instead of the ﬁnal behavior policy return because the former metric will factor in sample efﬁciency as well.

Published as a conference paper at ICLR 2021

The goal of the meta-learner is to ﬁnd the optimal loss function L(st, at, rt, st+1, θ, γ) to optimize πθ with maximal normalized average training return over the set of training environments. The full objective for the meta-learner is:

L = arg max L

E Eval(L, E)

L is represented as a computational graph which we describe in the next section.

3.2 SEARCH LANGUAGE

Our search language for the algorithm L should be expressive enough to represent existing algorithms while enabling the learning of new algorithms which can obtain good generalization performance across a wide range of environments. Similar to Alet et al. (2020a), we describe the RL algorithm as general programs with a domain speciﬁc language, but we target updates to the policy rather than reward bonuses for exploration. Algorithms will map transitions (st, at, st+1, rt), policy parameters θ, and discount factor γ into a scalar loss to be optimized with gradient descent. We express L as a computational graph or directed acyclic graph (DAG) of nodes with typed inputs and outputs. See Figure 2 for a visualization of DQN expressed in this form. Nodes are of several types:

Input nodes represent inputs to the program, and include elements from transitions (st, at, st+1, rt) and constants, such as the discount factor γ.

Parameter nodes are neural network weights, which can map between various data types. For example, the weights for the Q-value network will map an input node with state data type to a list of real numbers for each action.

Operation nodes compute outputs given inputs from parent nodes. This includes applying parameter nodes, as well as basic math operators from linear algebra, probability, and statistics. A full list of operation nodes is provided in Appendix A. By default, we set the last node in the graph to compute the output of the program which is the scalar loss function to be optimized. Importantly, the inputs and outputs of nodes are typed among (state, action, vector, ﬂoat, list, probability). This typing allows for programs to be applied to any domain. It also restricts the space of programs to ones with valid typing which reduces the search space.

Algorithm 1 Algorithm Evaluation, Eval(L, E)

1: Input: RL Algorithm L, Environment E, training episodes M 2: Initialize: Q-value parameters θ, target parameters θ empty replay buffer D 3: for i = 1 to M do 4: for t = 0 to T do 5: With probability ϵ, select a random action at, 6: otherwise select at = arg maxa Q(st, a) 7: Step environment st+1, rt E(at, st) 8: D D {st, at, rt, st+1} 9: Update parameters θ θ θL(st, at, rt, st+1, θ, γ) 10: Update target θ θ 11: end for 12: Compute episode return Rm = PT t=0 rt 13: end for 14: Output: 15: Normalized training performance 1 M PM m=1 Rm Rmin Rmax Rmin

Algorithm 2 Evolving RL Algorithms

1: Input: Training environments {E}, hurdle environment Eh, hurdle threshold α, optional existing algorithm A 2: Initialize: Population P of RL algorithms {L}, history H, randomized inputs I. If bootstrapping, initialize P with A. 3: Score each L in P with H[L].score P

E Eval(L, E) 4: for c = 0 to C do 5: Sample tournament T Uniform(P) 6: Parent algorithm L highest score algorithm in T 7: Child algorithm L Mutate(L) 8: H[L ].hash Hash(L (I)) 9: if H[L ].hash was new and Eval(L , Eh) > α then 10: H[L ].score P

E Eval(L , E) 11: end if 12: Add L to population P 13: Remove oldest L from population 14: end for 15: Output: Algorithm L with highest score

3.3 EVOLUTIONARY SEARCH METHOD

Evaluating thousands of programs over a range of complex environments is prohibitively expensive, especially if done serially. We adapt a genetic programming (Koza, 1993) method for the search method and use regularized evolution (Real et al., 2019), a variant of classic evolutionary algorithms that employ tournament selection (Goldberg & Deb, 1991). Regularized evolution has been shown to work for learning supervised learning algorithms (Real et al., 2020) and can be parallelized across compute nodes. Tournament selection keeps a population of P algorithms and improves the population through cycles. Each cycle picks a tournament of T < P algorithms at random and selects the best algorithm in the tournament as a parent. The parent is mutated into a child algorithm which gets added to the population while the oldest algorithm in the population is removed. We use

Published as a conference paper at ICLR 2021

a single type of mutation which ﬁrst chooses which node in the graph to mutate and then replaces it with a random operation node with inputs drawn uniformly from all possible inputs.

There exists a combinatorially large number of graph conﬁgurations. Furthermore, evaluating a single graph, which means training the full inner loop RL algorithm, can take up a large amount of time compared to the supervised learning setting. Speeding up the search and avoiding needless computation are needed to make the problem more tractable. We extend regularized evolution with several techniques, detailed below, to make the optimization more efﬁcient. The full training procedure is outlined in Algorithm 2.

Functional equivalence check (Real et al., 2020; Alet et al., 2020b). Before evaluating a program, we check if it is functionally equivalent to any previously evaluated program. This check is done by hashing the concatenated output of the program for 10 values of randomized inputs. If a mutated program is functionally equivalent to an older program, we still add it to the population, but use the saved score of the older program. Since some nodes of the graph do not always contribute to the output, parts of the mutated program may eventually contribute to a functionally different program.

Early hurdles (So et al., 2019). We want poor performing programs to terminate early so that we can avoid unneeded computation. We use the Cart Pole environment as an early hurdle environment Eh by training a program for a ﬁxed number of episodes. If an algorithm performs poorly, then episodes will terminate in a short number of steps (as the pole falls rapidly) which quickly exhausts the number of training episodes. We use Eval(L, Eh) < α as the threshold for poor performance with α chosen empirically.

Program checks. We perform basic checks to rule out and skip training invalid programs. The loss function needs to be a scalar value so we check if the program output type is a ﬂoat (type(L) = R). Additionally, we check if each program is differentiable with respect to the policy parameters by checking if a path exists in the graph between the output and the policy parameter node.

Learning from Scratch and Bootstrapping. Our method enables both learning from scratch and learning from existing knowledge by bootstrapping the initial algorithm population with existing algorithms. We learn algorithms from scratch by initializing the population of algorithms randomly. An algorithm is sampled by sampling each operation node sequentially in the DAG. For each node, an operation and valid inputs to that operation are sampled uniformly over all possible options.

While learning from scratch might uncover completely new algorithms that differ substantially from the existing methods, this method can take longer to converge to a reasonable algorithm. We would like to incorporate the knowledge we do have of good algorithms to bootstrap our search from a better starting point. We initialize our graph with the loss function of DQN (Mnih et al., 2013) so that the ﬁrst 7 nodes represent the standard DQN loss, while the remaining nodes are initialized randomly. During regularized evolution, the nodes are not frozen, such that it is possible for the existing sub-graph to be completely replaced if a better solution is found.

4 LEARNED RL ALGORITHM RESULTS AND ANALYSIS

We discuss the training setup and results of our experiments. We highlight two learned algorithms with good generalization performance, DQNClipped and DQNReg, and analyze their structure.

4.1 TRAINING SETUP

Meta-Training details: We search over programs with maximum 20 nodes, not including inputs or parameter nodes. A full list of node types is provided in Appendix A. We use a population size of 300, tournament size of 25, and choose these parameters based on the ones used in (Real et al., 2019). Mutations occur with probability 0.95. Otherwise a new random program is sampled. The search is done over 300 CPUs and run for roughly 72 hours, at which point around 20, 000 programs have been evaluated. The search is distributed such that any free CPU is allocated to a proposed individual such that there are no idle CPUs. Further meta-training details are in Appendix B.

Training environments: The choice of training environments greatly affects the learned algorithms and their generalization performance. At the same time, our training environments should be not too computationally expensive to run as we will be evaluating thousands of RL algorithms. We use

Published as a conference paper at ICLR 2021

(a) Learning curve

(b) Performance histogram Figure 3: Left: Meta-training performance over different number of environments from scratch, and bootstrapping. Plotted as RL evaluation performance (sum of normalized training return across the training environments) over the number of candidate algorithms. Shaded region represents one standard deviation over 10 random seeds. More training environments leads to better algorithms. Bootstrapping from DQN speeds up convergence and higher ﬁnal performance. Right: Meta-training performance histogram for bootstrapped training. Many of the top programs have similar structure (Appendix D).

a range of 4 classical control tasks (Cart Pole, Acrobat, Mountain Car, Lunar Lander) and a set of 12 multitask gridworld style environments from Mini Grid (Chevalier-Boisvert et al., 2018). These environments are computationally cheap to run but also chosen to cover a diverse set of situations. This includes dense and sparse reward, long time horizon, and tasks requiring solving a sequence of subgoals such as picking up a key and unlocking a door. More details are in Appendix C.

The training environments always include Cart Pole as an initial hurdle. If an algorithm succeeds on Cart Pole (normalized training performance greater than 0.6), it then proceeds to a harder set of training environments. For our experiments, we choose these training environments by sampling a set of 3 environments and leave the rest as test environments. For learning from scratch we also compare the effect of number of training environments on the learned algorithm by comparing training on just Cart Pole versus training on Cart Pole and Lunar Lander.

RL Training details: For training the RL agent, we use the same hyperparameters across all training and test environments except as noted. All neural networks are MLPs of size (256, 256) with Re LU activations. We use the Adam optimizer with a learning rate of 0.0001. ϵ is decayed linearly from 1 to 0.05 over 1e3 steps for the classical control tasks and over 1e5 steps for the Mini Grid tasks.

4.2 LEARNING CONVERGENCE

Figure 3a shows convergence over several training conﬁgurations. We ﬁnd that at the end of training roughly 70% of proposed algorithms are functionally equivalent to a previously evaluated program, while early hurdles cut roughly another 40% of proposed non-duplicate programs.

Varying number of training environments: We compare learning from scratch with a single training environment (Cart Pole) versus with two training environments (Cart Pole and Lunar Lander). While both experiments reach the maximum performance on these environments (Figure 3a), the learned algorithms are different. The two-environment training setup learns the known TD loss

LDQN = (Q(st, at) (rt + γ max a Qtarg(st, a)))2

while the single-environment training setup learns a slight variation L = (Q(st, at) (rt + maxa Qtarg(st, a)))2 that does not use the discount, indicating that the range of difﬁculty on the training environments is important for learning algorithms which can generalize.

Learning from scratch versus bootstrapping: In Figure 3a, we compare training from scratch versus training from bootstrapping on four training environments (Cart Pole, Key Corridor S3R1, Dynamic Obstacle-6x6, Door Key-5x5). The training performance does not saturate, leaving room for improvement. Bootstrapping from DQN signiﬁcantly improves both the convergence and performance of the meta-training, resulting in a 40% increase in ﬁnal training performance.

Published as a conference paper at ICLR 2021

4.3 LEARNED RL ALGORITHMS: DQNCLIPPED AND DQNREG

In this section, we discuss two particularly interesting loss functions that were learned by our method, and that have good generalization performance on the test environments. Let

Yt = rt + γ max a Qtarg(st, a), and δ = Q(st, at) Yt.

The ﬁrst loss function DQNClipped is

LDQNClipped = max Q(st, at), δ2 + Yt + max h Q(st, at) Yt, γ(max a Qtarg(st, a))2i .

LDQNClipped was trained from bootstrapping off DQN using three training environments (Lunar Lander, Mini Grid-Dynamic-Obstacles-5x5, Mini Grid-Lava Gap S5). It outperforms DQN and double DQN, DDQN, (van Hasselt et al., 2015) on both the training and unseen environments (Figure 4). The intuition behind this loss function is that, if the Q-values become too large (when Q(st, at) > δ2 + Yt), the loss will act to minimize Q(st, at) instead of the normal δ2 loss. Alternatively, we can view this condition as δ = Q(st, at) Yt > δ2. This means when δ is small enough then Q(st, at) are relatively close and the loss is just to minimize Q(st, at).

Figure 4: Performance of learned algorithms (DQNClipped and DQNReg) versus baselines (DQN and DDQN) on training and test environments as measured by episode return over 10 training seeds. A dashed line indicates that the algorithm was meta-trained on that environment while a solid line indicates a test environment. DQNReg can match or outperform the baselines on almost all the training and test environments. Shaded regions correspond to 1 standard deviation.

The second learned loss function, which we call DQNReg, is given by

LDQNReg = 0.1 Q(st, at) + δ2.

DQNReg was trained from bootstrapping off DQN using three training environments (Key Corridor S3R1, Dynamic-Obstacles-6x6, Door Key-5x5). In comparison to DQNClipped, DQNReg directly regularizes the Q values with a weighted term that is always active. We note that both of these loss functions modify the original DQN loss function to regularize the Q-values to be lower in value. While DQNReg is quite simple, it matches or outperforms the baselines on all training and test environments including from classical control and Minigrid. It does particularly well on a few test

Published as a conference paper at ICLR 2021

environments (Simple Crossing S9N1, Door Key-6x6, and Unlock) and solves the tasks when other methods fail to attain any reward. It is also much more stable with lower variance between seeds, and more sample efﬁcient on test environments (Lava Gap S5, Empty-6x6, Empty-Random-5x5).

Env DQN DDQN PPO DQNReg Asteroid 1364.5 734.7 2097.5 2390.4 Bowling 50.4 68.1 40.1 80.5 Boxing 88.0 91.6 94.6 100.0 Road Runner 39544.0 44127.0 35466.0 65516.0

Table 1: Performance of learned algorithm DQNReg against baselines on several Atari games. Baseline numbers taken from reported papers.

In Table 1, we evaluate DQNReg on a set of Atari games. We use the same architecture as in DQN (Mnih et al., 2013) and use the same no-op evaluation procedure which evaluates a trained policy every 1 million training steps over 200 test episodes. Even though meta-training was on computationally simple, non-image based environments, we ﬁnd that DQNReg can generalize to image-based environments and outperform baselines. The results for the baselines are taken from their respective papers (Mnih et al., 2013; van Hasselt et al., 2015; Schulman et al., 2017).

These algorithms are related to recently proposed RL algorithms, conservative Q-learning (CQL) (Kumar et al., 2020) and M-DQN (Vieillard et al., 2020). CQL learns a conservative Qfunction by augmenting the standard Bellman error objective with a simple Q-value regularizer: log P

a exp (Q(st, a)) Q(st, at) which encourages the agent to stay close to the data distribution while maintaining a maximum entropy policy. DQNReg similarly augments the standard objective with a Q-value regularizer although does so in a different direction by preventing overestimation. M-DQN modiﬁes DQN by adding the scaled log-policy (using the softmax Q-values) to the immediate reward. Both of these methods can be seen as ways to regularize a value-based policy. This resemblance indicates that our method can ﬁnd useful structures automatically that are currently being explored manually, and could be used to propose new areas for researchers to explore.

We discover that the best performing algorithms from the experiment which learned DQNReg are consistent, and in the form L = δ2 + k Q(st, at). This loss could use further analysis and investigation, possibly environment-speciﬁc tuning of the parameter k. See Appendix 3 for details.

4.4 ANALYSIS OF LEARNED ALGORITHMS

Figure 5: Overestimated value estimates is generally problematic in value-based RL. Our method learns algorithms which regularize the Q-values helping with overestimation. We compare the estimated Q-values for our learned algorithms and baselines with the optimal ground truth Q-values across several environments during training. Estimate is for taking action zero from the initial state of the environment. While DQN overestimates the Q-values, our learned algorithms DQNClipped and DQNReg underestimate the Q-values.

We analyze the learned algorithms to understand their beneﬁcial effect on performance. In Figure 5, we compare the estimated Q-values for each algorithm. We see that DQN frequently overestimates the Q values while DDQN consistently underestimates the Q values before converging to the ground truth Q value which are computed with a manually designed optimal policy. DQNClipped has similar performance to DDQN, in that it also consistently underestimates the Q values and does so slightly more aggressively than DDQN. DQNReg signiﬁcantly undershoots the Q values and does not converge to the ground truth. Various works (van Hasselt et al., 2015; Haarnoja et al., 2018; Fujimoto et al., 2018) have shown that overestimated value estimates is problematic and restricting the overestimation improves performance.

The loss function in DQNClipped is composed of the sum of two max operations, and so we can analyze when each update rule is active. We interpret DQNClipped as max(v1, v2) + max(v2, v3) with four cases: 1) v1 > v2 and v3 > v4 2) v1 > v2 and v3 < v4 3) v1 < v2 and v3 < v4 4) v1 < v2 and v3 > v4. Case 2 corresponds to minimizing the Q values. Case 3 would correspond to the normal DQN loss of δ2 since the parameters of Qtarg are not updated during gradient descent.

Published as a conference paper at ICLR 2021

Figure 6: Our learned algorithm, DQNClipped, can be broken down into four update rules where each rule is active under certain conditions. Case 3 corresponds to normal TD learning while case 2 corresponds to minimizing the Q-values. Case 2 is more active in the beginning when value overestimation is a problem and then becomes less active as it is no longer needed.

In Figure 6, we plot the proportion of when each case is active during training. We see that usually case 3 is generally the most active with a small dip in the beginning but then stays around 95%. Meanwhile, case 2, which regularizes the Q-values, has a small increase in the beginning and then decreases later, matching with our analysis in Figure 6, which shows that DQNClipped strongly underestimates the Q-values in the beginning of training. This can be seen as a constrained optimization where the amount of Q-value regularization is tuned accordingly. The regularization is stronger in the beginning of training when overestimation is problematic (Q(st, at) > δ2+Yt) and gets weaker as δ2 gets smaller.

5 CONCLUSION

In this work, we have presented a method for learning reinforcement learning algorithms. We design a general language for representing algorithms which compute the loss function for value-based model-free RL agents to optimize. We highlight two learned algorithms which although relatively simple, can obtain good generalization performance over a wide range of environments. Our analysis of the learned algorithms sheds insight on their beneﬁt as regularization terms which are similar to recently proposed algorithms. Our work is limited to discrete action and value-based RL algorithms that are close to DQN, but could easily be expanded to express more general RL algorithms such as actor-critic or policy gradient methods. How actions are sampled from the policy could also be part of the search space. The set of environments we use for both training and testing could also be expanded to include a more diverse set of problem types. We leave these problems for future work.

ACKNOWLEDGEMENTS

We thank Luke Metz for helpful early discussions and feedback on the paper, Hanjun Dai for early discussions on related research ideas, and Xingyou Song, Krzysztof Choromanski, and Kevin Lee for help with infrastructure. We also thank Jongwook Choi for help with environment selection.

Ferran Alet, M. F. Schneider, Tomas Lozano-Perez, and L. Kaelbling. Meta-learning curiosity algorithms. Ar Xiv, abs/2003.05325, 2020a.

Ferran Alet, Martin F. Schneider, Tomas Lozano-Perez, and Leslie Pack Kaelbling. Meta-learning curiosity algorithms. In International Conference on Learning Representations, 2020b. URL https://openreview.net/forum?id=Bygdyx HFDS.

Irwan Bello, Barret Zoph, Vijay Vasudevan, and Quoc V Le. Neural optimizer search with reinforcement learning. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 459 468. JMLR. org, 2017.

S. Bengio, Yoshua Bengio, and J. Cloutier. Use of genetic programming for the search of a new learning rule for neural networks. Proceedings of the First IEEE Conference on Evolutionary Computation. IEEE World Congress on Computational Intelligence, pp. 324 327 vol.1, 1994.

Yoshua Bengio, S. Bengio, and J. Cloutier. Learning a synaptic learning rule. IJCNN-91-Seattle International Joint Conference on Neural Networks, ii:969 vol.2 , 1991.

Han Cai, Tianyao Chen, Weinan Zhang, Yong Yu, and Jun Wang. Efﬁcient architecture search by network transformation. In Proceedings of the Thirty-Second AAAI Conference on Artiﬁcial Intelligence, (AAAI-18), the 30th innovative Applications of Artiﬁcial Intelligence (IAAI-18), and the 8th AAAI Symposium on Educational Advances in Artiﬁcial Intelligence (EAAI-18), New Orleans, Louisiana, USA, February 2-7, 2018, pp. 2787 2794, 2018.

Yevgen Chebotar, Artem Molchanov, Sarah Bechtle, Ludovic Righetti, F. Meier, and Gaurav S. Sukhatme. Meta-learning via learned loss. Ar Xiv, abs/1906.05374, 2019.

Published as a conference paper at ICLR 2021

Maxime Chevalier-Boisvert, Lucas Willems, and Suman Pal. Minimalistic gridworld environment for openai gym. https://github.com/maximecb/gym-minigrid, 2018.

Yan Duan, John Schulman, Xi Chen, Peter L. Bartlett, Ilya Sutskever, and P. Abbeel. Rl2: Fast reinforcement learning via slow reinforcement learning. Ar Xiv, abs/1611.02779, 2016.

Thomas Elsken, Jan Hendrik Metzen, and Frank Hutter. Neural architecture search: A survey. ar Xiv preprint ar Xiv:1808.05377, 2018.

Aleksandra Faust, Anthony Francis, and Dar Mehta. Evolving rewards to automate reinforcement learning. ar Xiv preprint ar Xiv:1905.07628, 2019.

Chelsea Finn and S. Levine. Meta-learning and universality: Deep representations and gradient descent can approximate any learning algorithm. Ar Xiv, abs/1710.11622, 2018.

Chelsea Finn, P. Abbeel, and S. Levine. Model-agnostic meta-learning for fast adaptation of deep networks. Ar Xiv, abs/1703.03400, 2017.

J org K. H. Franke, Gregor K ohler, Andr e Biedenkapp, and Frank Hutter. Sample-efﬁcient automated deep reinforcement learning. 2020.

Scott Fujimoto, H. V. Hoof, and David Meger. Addressing function approximation error in actorcritic methods. Ar Xiv, abs/1802.09477, 2018.

David E Goldberg and Kalyanmoy Deb. A comparative analysis of selection schemes used in genetic algorithms. FOGA, 1991.

Toumas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In ICML, 2018.

John H. Holland. Adaptation in Natural and Artiﬁcial Systems. University of Michigan Press, Ann Arbor, MI, 1975. second edition, 1992.

Rein Houthooft, Richard Y. Chen, Phillip Isola, Bradly C. Stadie, F. Wolski, Jonathan Ho, and P. Abbeel. Evolved policy gradients. Ar Xiv, abs/1802.04821, 2018.

Frank Hutter, Lars Kotthoff, and Joaquin Vanschoren (eds.). Automated Machine Learning: Methods, Systems, Challenges. Springer, 2018. In press, available at http://automl.org/book.

Shauharda Khadka and Kagan Tumer. Evolution-guided policy gradient in reinforcement learning. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (eds.), Advances in Neural Information Processing Systems 31, pp. 1188 1200. Curran Associates, Inc., 2018.

Louis Kirsch, Sjoerd van Steenkiste, and J. Schmidhuber. Improving generalization in meta reinforcement learning using learned objectives. Ar Xiv, abs/1910.04098, 2020.

John Koza. Genetic programming - on the programming of computers by means of natural selection. In Complex adaptive systems, 1993.

Aviral Kumar, Aurick Zhou, G. Tucker, and Sergey Levine. Conservative q-learning for ofﬂine reinforcement learning. Ar Xiv, abs/2006.04779, 2020.

Chenxi Liu, Barret Zoph, Jonathon Shlens, Wei Hua, Li-Jia Li, Li Fei-Fei, Alan L. Yuille, Jonathan Huang, and Kevin Murphy. Progressive neural architecture search. Co RR, abs/1712.00559, 2017. URL http://arxiv.org/abs/1712.00559.

Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin A. Riedmiller. Playing atari with deep reinforcement learning. Ar Xiv, abs/1312.5602, 2013.

Junhyuk Oh, Matteo Hessel, Wojciech Marian Czarnecki, Zhongwen Xu, Hado van Hasselt, Satinder Singh, and David Silver. Discovering reinforcement learning algorithms. Ar Xiv, abs/2007.08794, 2020.

Published as a conference paper at ICLR 2021

Hieu Pham, Melody Y Guan, Barret Zoph, Quoc V Le, and Jeff Dean. Efﬁcient neural architecture search via parameter sharing. In ICML, 2018.

Thomas Pierrot, Guillaume Ligner, Scott Reed, Olivier Sigaud, Nicolas Perrin, Alexandre Laterre, David Kas, Karim Beguir, and Nando de Freitas. Learning compositional neural programs with recursive tree search and planning. Neur IPS, 2019.

Esteban Real, Sherry Moore, Andrew Selle, Saurabh Saxena, Yutaka Leon Suematsu, Jie Tan, Quoc V. Le, and Alexey Kurakin. Large-scale evolution of image classiﬁers. In Proceedings of the 34th International Conference on Machine Learning - Volume 70, ICML 17, pp. 2902 2911. JMLR.org, 2017.

Esteban Real, Alok Aggarwal, Yanping Huang, and Quoc V. Le. Regularized evolution for image classiﬁer architecture search. In AAAI, volume abs/1802.01548, 2019.

Esteban Real, Chen Liang, David So, and Quoc V. Le. Automl-zero: Evolving machine learning algorithms from scratch. In ICML, volume abs/2003.03384, 2020.

Scott Reed and Nando De Freitas. Neural programmer-interpreters. ar Xiv preprint ar Xiv:1511.06279, 2015.

Juergen Schmidhuber. Evolutionary principles in self-referential learning. 1987.

Juergen Schmidhuber. A self-referential weight matrix. 1993.

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. ar Xiv preprint ar Xiv:1707.06347, 2017.

D. So, Chen Liang, and Quoc V. Le. The evolved transformer. Ar Xiv, abs/1901.11117, 2019.

Xingyou Song, Krzysztof Choromanski, Jack Parker-Holder, Yunhao Tang, Wenbo Gao, Aldo Pacchiano, Tamas Sarlos, Deepali Jain, and Yuxiang Yang. Reinforcement learning with chromatic networks for compact architecture search. 2020.

Kenneth O Stanley and Risto Miikkulainen. Evolving neural networks through augmenting topologies. Evolutionary computation, 10(2):99 127, 2002.

Yunhao Tang and Krzysztof Choromanski. Online hyper-parameter tuning in off-policy learning via evolutionary strategies, 2020.

L. Trujillo and G. Olague. Synthesis of interest point detectors through genetic programming. In GECCO 06, 2006.

Hado van Hasselt, Arthur Guez, and David Silver. Deep reinforcement learning with double qlearning. In AAAI, 2015.

Nino Vieillard, Olivier Pietquin, and M. Geist. Munchausen reinforcement learning. Ar Xiv, abs/2007.14430, 2020.

Jane X. Wang, Zeb Kurth-Nelson, Hubert Soyer, Joel Z. Leibo, Dhruva Tirumala, R emi Munos, Charles Blundell, D. Kumaran, and Matt M. Botvinick. Learning to reinforcement learn. Ar Xiv, abs/1611.05763, 2017.

S. Whiteson and P. Stone. Evolutionary function approximation for reinforcement learning. J. Mach. Learn. Res., 7:877 917, 2006.

Barret Zoph and Quoc V Le. Neural architecture search with reinforcement learning. In ICLR, 2016.

Barret Zoph, Vijay Vasudevan, Jonathon Shlens, and Quoc V Le. Learning transferable architectures for scalable image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 8697 8710, 2018.

Published as a conference paper at ICLR 2021

A SEARCH LANGUAGE DETAILS

Inputs and outputs to nodes in the computational graph have data types which include state S, action Z, ﬂoat R, list List[X], probability P, vector V. The symbol X indicates it can be of S, R, or V. We assume that vectors are of ﬁxed length 32 and actions are integers. Operations will broadcast so that for example adding a ﬂoat variable to a state variable will result in the ﬂoat being added to each element of the state. This typing allows the learned program to be domain agnostic. The full list of operators is listed below.

Operation Input Types Output Type Add X, X X Subtract X, X X Max X, X X Min X, X X Dot Product X, X R Div X, X X L2Distance X, X R Max List List[R] R Min List List[R] R Arg Max List List[R] Z Select List List[X], Z X Mean List List[X] X Variance List List[X] X Log X X Exp X X Abs X X (C)NN:S List[R] S List[R] (C)NN:S R S R (C)NN:S V V V Softmax List[R] P KLDiv P, P R Entropy P R Constant 1, 0.5, 0.2, 0.1, 0.01 Multiply Tenth X X Normal(0, 1) R Uniform(0, 1) R

B TRAINING DETAILS

We describe the training details and hyperparameters used. For all environments we use the Adam optimzier with a learning rate of 0.0001.

Common RL training details. All neural networks are MLPs of size (256, 256) with Re LU activations. For optimizing the Q-function parameters we use the Adam optimizer with a learning rate of 0.0001. Target update period is 100. These settings are used for all training and test environments.

Classical control environments. The value of ϵ is decayed linearly from 1 to 0.05 over 1000 steps. Cart Pole, Acrobat, and Mountain Car are trained for 400 episodes and Lunar Lander is trained for 1000 episodes.

Mini Grid environments. The value of ϵ is decayed linearly from 1 to 0.05 over 105 steps. During meta-training, Mini Grid environments are trained for 5 105 steps.

Atari environments. We use the same neural network architecture as in Mnih et al. (2013). Target update period is 1, 000. The value of ϵ is decayed linearly from 1 to 0.1 over 106 steps. For evaluation, we use the no-op start condition as in Mnih et al. (2013) where the agent will output the no-op action for x steps where x is a random integer drawn between [1, 30]. The evaluation policy uses an ϵ of 0.001 and is evaluated every 106 steps for 100 episodes. The best training snapshot is reported.

Published as a conference paper at ICLR 2021

C ENVIRONMENT DETAILS

We describe the classical control environments below. Cart Pole and Lunar Lander are dense reward while Acrobat and Mountain Car are sparse reward.

Task ID Description

Cart Pole-v0

The agent must balance a pole on top of a cart by applying a force of +1 or 1 to the cart. A reward of +1 is provided for each timestep the pole remains upright.

Lunar Lander-v2 The agent controls a lander by ﬁring one of four thrusters and must land it on the landing pad.

Acrobat-v1 The goal is to swing a 2-link system upright to a given height by applying 1, 0, or -1 torque on the join between the two links.

Mountain Car-v0 The goal is to drive up the mountain on the right by ﬁrst driving back and forth to build up momentum.

We describe the Mini Grid environments below. The input to the agent is a fully observed grid which is encoded as an Nx Nx3 size array where N is the grid size. The 1st channel contains the index of the object type at that location (out of 11 possible objects), the 2nd channel contains the color of the object (out of 6 possible colors), and the 3rd channel contains the orientation of the agent out of 4 cardinal directions. This encoding is then ﬂattened and fed into an MLP. There are 7 possible actions (turn left, turn right, forward, pickup, drop, toggle, done).

Unless stated otherwise, all tasks are sparse reward tasks with a reward of 1 for completing the task. Max steps is set to 100. A size such as 5x5 in the environment name refers to a grid size with width and height of 5 cells.

Task ID Description

Key Corridor S3R1-v0

The agent has to ﬁnd a key hidden in one room and then use it to pickup an object behind a locked door in another room. This tests sequential subgoal completion.

Lava Gap S5-v0

The agent has to reach the green goal square without touching the lava which will terminate the episode with zero reward. This tests safety and safe exploration.

Multi Room-N2-S4-v0 The agent must open a door to get to the green goal square in the next room.

Published as a conference paper at ICLR 2021

Simple Crossing S9N1v0

The agent has to reach the green goal square on the other corner of the room and navigate around walls.

Empty-v0 The agent has to reach the green goal square in an empty room.

Empty Random-v0 The agent has to reach the green goal square in an empty room but is initialized to a random location.

Dynamic-Obstacles-v0

The agent has to reach the green goal square without colliding with any blue obstacles which move around randomly. If the agent collides with an obstacle it receives a reward of 1 and the episode terminates.

Four Rooms-v0

The agent must navigate in a maze composed of four rooms. Both the agent and goal square are randomly placed in any of the four rooms.

Door Key-v0 The agent must pick up a key to unlock a door to enter another room and get to the green goal square.

Published as a conference paper at ICLR 2021

D GRAPH DISTRIBUTION ANALYSIS

We look at the distribution of top performing graphs and ﬁnd similarities in their structure. This is summarized in Table 3 where we describe the equations of learned algorithms for differing ranks (if sorted by score). The best performing algorithms from the experiment which learned DQNReg are all variants of adding Q(st, at) to the standard TD loss in some form, δ2 + k Q(st, at). We think this kind of loss could use further investigation and that while we did not tune the value of k, this could also be tuned per environment. In Figure 3b, we show the distribution of scores for all nonduplicate programs that have been evaluated. We provide a full list of top performing algorithms from a few of our experiments at https://github.com/jcoreyes/evolvingrl.

Raw Equation Simpliﬁed Equation Score Rank δ2 + 0.1 Q(st, at) + rt (γ Qtarg 0.1 Q(st, at)) δ2 + 0.2 Q(st, at) 3.905 2 δ2 + 0.1 Q(st, at) γ + Qtarg δ2 + 0.1 Q(st, at) 3.904 3 δ2 (γ Qtarg 0.1 Q(st, at)) δ2 + 0.1 Q(st, at) 3.903 4 δ2 + Qtarg + 0.1 Q(st, at) γ δ2 + 0.1 Q(st, at) 3.902 5 δ2 (0.1 Q(st, at) Yt)2 δ2 (0.1 Q(st, at) Yt)2 3.898 6 δ2 + ((rt + γ Qtarg + Q(st, at)) (γ max(γ, 0.1 Q(st, at)) γ Qtarg 0.1 Q(st, at)) NA 3.846 11146

δ2 + (δ2 + 0.1 Q(st, at))2 NA 3.65 12146 δ2 + Q(st, at) δ2 + Q(st, at) 2.8 12446 δ2 δ2 2.28 13246

Table 3: Other programs learned in learning DQNReg which is rank 1 with score 3.907. Rank is if scores are sorted in decreasing order. Score is the sum of normalized RL training performance across four environments. The simpliﬁed equations contains only the relevant parts for minimizing the equation output. Qtarg refers to maxa Qtarg(st+1, a).

E REPEATABILITY OF META TRAINING

In Figure 7, we plot the meta-training performance for bootstrapping from DQN with four training environments (Cart Pole, Key Corridor S3R1, Dynamic-Obstacles-6x6, Door Key-5x5) over ten trials. Four out of the ten trials reach the max training performance. Two out of 10 of these trials learns the same algorithm DQNReg while the other top two trials ﬁnd other less interpretable algorithms.

Figure 7: Meta-training performance for boot-strapping on 4 training environments for 10 random seeds.