# empirical_design_in_reinforcement_learning__3a433e08.pdf

Journal of Machine Learning Research 25 (2024) 1-63 Submitted 2/23; Revised10/24; Published 10/24

Empirical Design in Reinforcement Learning

Andrew Patterson ap3@ualberta.ca Samuel Neumann sfneuman@ualberta.ca Martha White whitem@ualberta.ca Adam White amw8@ualberta.ca Department of Computing Science and Alberta Machine Intelligence Institute (Amii) University of Alberta, Edmonton, Canada Canada CIFAR AI Chair

Editor: George Konidaris

Empirical design in reinforcement learning is no small task. Running good experiments requires attention to detail and at times signiﬁcant computational resources. While compute resources available per dollar have continued to grow rapidly, so have the scale of typical experiments in reinforcement learning. It is now common to benchmark agents with millions of parameters against dozens of tasks, each using the equivalent of 30 days of experience. The scale of these experiments often conﬂict with the need for statistical evidence, especially when comparing algorithms. Recent studies have highlighted how popular algorithms are sensitive to hyperparameter settings and implementation details, and that common empirical practice leads to weak statistical evidence (Machado et al., 2018; Henderson et al., 2018). This manuscript represents both a call to action, and a comprehensive resource for how to do good experiments in reinforcement learning. In particular, we cover: the statistical assumptions underlying common performance measures, how to properly characterize performance variation and stability, hypothesis testing, special considerations for comparing multiple agents, baseline and illustrative example construction, and how to deal with hyperparameters and experimenter bias. Throughout we highlight common mistakes found in the literature and the statistical consequences of those in example experiments. The objective of this document is to provide answers on how we can use our unprecedented compute to do good science in reinforcement learning, as well as stay alert to potential pitfalls in our empirical design.

Keywords: Reinforcement Learning, Empirical Methodology

1. Reinforcement Learning, an Empirical Science

Running a good experiment in reinforcement learning is diﬃcult. There are many decisions to be made: How long should you run your agent? Should you count the number of episodes or number of steps? Should performance be measured online or oﬄine with test trials? How should you measure and aggregate performance? Do we use rules of thumb to set hyperparameters or some systematic search? What are the right baseline algorithms to compare against? Which environments should you use? What does good learning even look like in a given environment? The answer to each question can greatly impact the credibility and utility of the result, ranging from insightful to down-right misleading.

c 2024 Patterson, Neumann, White, White.

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v25/23-0183.html.

Patterson, Neumann, White, White

The task of evaluating a reinforcement learning agent is complicated by the fundamental aspect that makes the problem interesting: an agent interacting with an environment. Unlike supervised learning, reinforcement learning experiments are online and interactive. The agent a program generates its own training data by interacting with the environment another program and the quality of the data depends on what the agent learned previously. This interaction makes fair comparisons and scientiﬁc reproducibility major challenges in reinforcement learning. Many of the ideas from classical machine learning such as train and test splits, overﬁtting, cross-validation, and model selection are either diﬀerent or non-existent in reinforcement learning. It is not surprising that the community is currently wrestling with the consequences of limited reproducibility, experimenter bias, unreliable algorithms, and exaggerated performance claims. The ﬁeld of reinforcement learning is experiencing rapid growth and many of the issues we see today are expected of a growing ﬁeld. Historically, the community was much smaller than other branches of machine learning and the scale of most experiments was limited: a large scale experiment may have consisted of a half dozen state dimensions and thousands of episodes. Before 2014 (with the development of DQN), researchers were likely to begin by replicating experiments from the Sutton and Barto textbook (2018) and then extending and innovating from there. This was perhaps serendipitous as Sutton and Barto spent decades reﬁning their experimental insights: learning from animal learning experiments and from teaching the textbook year after year. Most researchers were starting from an excellent empirical foundation. Historically, large scale experiments in reinforcement learning such as TD-Gammon (Tesauro, 1995) and work in robotics were demonstrations highlighting what was possible with reinforcement learning, without attempting to make strong scientiﬁc claims. This document makes a distinction between scientiﬁc studies in reinforcement learning and demonstrations of (impressive) engineered systems. Both play an important role in reinforcement learning research, but can be detrimental when conﬂated. Demonstrations can be seen as exploratory science, probing the edges of what is known or demonstrating the capabilities of existing algorithms. Scientiﬁc studies, on the other hand, aim to obtain a deeper understanding of our systems and algorithms; typically by posing clear and falsiﬁable hypotheses and controlling for confounding eﬀects. The aim should not be to show an algorithm is good, but rather understand an algorithm s properties, potentially relative to other algorithms.1

Given the immense growth of the ﬁeld, there is an emerging need to more clearly articulate and even develop better empirical practices in reinforcement learning. There is certainly a greater variety of researchers operating in reinforcement learning currently. Many come from other ﬁelds of machine learning and neuroscience, bringing with them diﬀerent expectations, practices, and rules of thumb. Much of the widely bemoaned poor empirical practices (Henderson et al., 2018; Jordan et al., 2020; Agarwal et al., 2021; Colas et al., 2018) could be due to mistakenly applying practices common in other communities. In this document we aim to provide a cookbook or how-to guide for running good experiments in reinforcement learning. We will walk the budding reinforcement learning

1. Of course, an important part of understanding is also theoretical analysis. Experiments and analysis go hand-in-hand, in that they both aim to provide understanding, using diﬀerent tools.This document is focused on good experiments, but does not suggest that this is the only route to understanding our algorithms.

Empirical Design in Reinforcement Learning

empiricist through important design decisions, common mistakes, and hidden biases. We will provide numerical examples of the consequences of bad decisions and illustrate what clear results and fair comparisons look like. In some cases we convey rules of thumb and sources of bias hard learned over decades of experience in the ﬁeld. Naturally, we can never cover all the key decisions and the list of bad practice will be incomplete and ever-growing. Regardless, our ambition is to provide (1) a reference on how to run good experiments in reinforcement learning for those new to the ﬁeld, and (2) additional insight and examples so that this cookbook may be useful to the seasoned researcher as well. We begin by discussing the complexity of a ﬁrst experiment one might run: evaluating a single agent on a single environment in Section 2. The section introduces key concepts about what to measure and how to aggregate performance in Section 2.1. Then we get more technical, looking at sources of variability in our experiments (Section 2.2), and how to make statistically signiﬁcant claims (Sections 2.4, 2.5). In Section 3 we address how to deal with hyperparameters in our experiments. After that we move to comparing multiple agents, and the additional nuances that arise in Section 4. In Section 5 we discuss some considerations when selecting environments for your experiments. In Section 6, we attempt to recreate a previous result and demonstrate how to use the strategies explained here to improve on the previous experiment design. We conclude with a summary of common errors made in experiments in Section 7.

Compare agents

Experiment details: select perf. measure & length

of exp. (e.g., episodic return

& 20k steps)

Verify: best A & B corresponds to

boundaries of

Construct hyper sets

(i.e., Gridsearch, Random)

Run experiment: A & B for each hyper,

Run experiment: A & B with hypers,

>=15 times each

Verify: shaded regions

large or overlapping?

Compute statistics

e.g., mean learning curve & Bootstrap CI

increase # of runs

Reasonable comparison of A & B

Hypers for A & B

{2i : i 2 { 8, . . . , 1}}

Figure 1: Two-stage approach to comparing two algorithms. The goal of this experimental workﬂow is to progress left to right. Good choices in the colored boxes should limit the experiment rerunning sometimes forced by the yellow decision diamonds. The same basic process is used with automatic hyperparameter optimization algorithms (HOA), except the HOA chooses what hyperparameters to test from a speciﬁed range (instead of a set) and sometimes early stopping is used (during the ﬁrst red Run experiment box). You should still use multiple runs to evaluate the hyperparameters within the HOA and after it is done you should check if the best hyperparameters are at the edge of the ranges. From there, one would proceed with Run experiment second red box and continue with the workﬂow. Each stage of the work ﬂow is discussed in detail in the text; we link each for convenience. Section 2: discusses how to decide on key experiment details (in 2.7), the basics on how to run an experiment (in 2.1), plot learning curves and conﬁdence intervals (in 2.5), and decide if we need more runs (in 2.6). Section 3: discusses how to construct hyperparameter sets, and how to determine if you need to expand the hyperparameter set (in 3.1). Section 4.4: discusses ways to compare two (or more) algorithms, including how to detect if the changes you have made to a baseline signiﬁcantly improve performance.

Patterson, Neumann, White, White

With such a large document, it is often useful to provide a high-level summary. Figure 1 provides a ﬂowchart visualizing an experimental procedure with references to associated sections. Speciﬁcally, we visualize a common two-stage procedure which represents sensible empirical workﬂow and should lead to reasonable conclusions. This document is educational, but does in fact contain a variety of new results. These results are used to support methodological proposals throughout this work. For easier reference, we list those novel ﬁndings in Appendix A. We also include a summary list of common errors and pitfalls at the end of this document, in Section 7.

2. Observational experiments

We start with a simple observational study of a single agent interacting with its environment. Because much experimentation and observation in reinforcement learning occurs in deterministic computer simulations, we as researchers have a far greater degree of control in the design of our experiments than most natural sciences. This control can often become distracting and cause us to lose sight of our original goals as empiricists. As such, throughout this paper we will reference parallels to other ﬁelds of empirical science in order to provide grounding for our empirical practices and to build intuition.

Imagine an animal learning laboratory where we study how quickly rats can learn to navigate a maze. We place an individual rat in the maze and record some demographic details; its height, weight, age, sex, and so on. We then observe the rat s behavior as it begins to explore its environment and learns to obtain a block of cheese at the end of the maze. Throughout, we measure time-to-completion, the number of wrong turns taken, and some qualitative measures such as the perceived frustration of the rat throughout its learning process. It is clear that some parameters of the individual rat will play a role in the time it takes for the rat to complete the maze. Highly ﬁt rats will likely complete the maze faster than particularly lazy rats, though the degree of inﬂuence is complex and unknown. Further, these rats are all individuals of the same subspecies and diﬀerences between the individuals are unknown to the researcher (i.e., the researcher does not observe the genetic make-up of each rat, nor does the researcher know each rat s detailed history of experiences before entering the lab). However, some elements of these rats are expected to remain consistent because they belong to the same subspecies for example, this subspecies is known to be generally smaller and more docile than other subspecies of domesticated rats.

The ﬁeld of reinforcement learning shares many similar motivations in both how and why we conduct experiments as in this example. One such similarity is in how we measure an agent s performance. However, this is only the ﬁrst level of analysis. The goal in animal learning is not to understand how quickly rats can complete mazes, but rather to use the maze task as a platform to probe the rat s ability to learn and explore. The second analogy is in how we obtain repeated trials. We can describe several characteristics of a particular subspecies of rat one subspecies is more docile, another generally larger and these characteristics inform us about general behavior of that subspecies. However, within a subspecies there is still variation across individuals. For example, although

Empirical Design in Reinforcement Learning

lab rats are generally docile, we might still encounter a particularly ornery rat. To understand and characterize an entire subspecies, the researcher often averages over these individuals. Likewise, we are typically interested in characterizing the performance of an algorithm, which itself will be used to instantiate many individual agents. Let us establish some terminology before diving into our ﬁrst set of experiments. Throughout this paper, we use the term agent to refer to a single entity interacting with its environment; analogous to an individual rat in a maze. We use the term algorithm to refer to a process which produces a set of agents, both by specifying initial conditions such as the initial weights of a neural network, and by specifying the learning rules by which the agent adapts to observations. Algorithms typically have conﬁguration parameters often called hyperparameters which modify the set of individuals produced by the algorithm. We call an algorithm with a particular hyperparameter conﬁguration a fully-speciﬁed algorithm (Jordan et al., 2020), which is analogous to a particular subspecies of rat such as Rattus norvegicus domestica, the common lab rat. Finally, an unspeciﬁed algorithm refers to an algorithm where some (or all) hyperparameters have not been conﬁgured. This is analogous to a species of rat, such as Rattus norvegicus.

2.1 Experiment one: a demonstration

The ﬁrst step towards designing an eﬀective experiment is identifying the scientiﬁc question (or hypothesis) of interest. In this ﬁrst section, we will focus on the simpler observational studies which typically seek to answer questions of the form How well does algorithm A perform on environment E? We will speciﬁcally investigate the Expected SARSA (or E-Sarsa) algorithm on a simple maze gridworld environment. Details of the learning algorithm and environment can be found in Appendix B. For now, it is suﬃcient to know that individual agents have suﬃcient learning capacity to eﬃciently ﬁnd a near optimal policy in this environment. Now that we know which agents we want to observe and on what environment, there are still many design decisions to be made. For example: (1) how many time steps will each trial or run contain? (2) if a terminal state is not reached after n steps, will we artiﬁcially terminate the episode? (3) if the task is episodic, how many episodes should we run? In general these choices should be made based on what you want to show. We will illustrate our thought process with an example experiment. Let us start with a common and simple case: running E-Sarsa on our maze for a ﬁxed budget of time-steps measuring online performance. The goal in this problem setting is to get high episodic returns, so the performance we report is the discounted return for an episode.2

We use a ﬁxed budget of agent-environment interactions, meaning we run E-Sarsa on the maze for n steps. If n is large enough, then the agent will reach a terminal state and begin a new episode many times. With a ﬁxed budget of steps, the agent will complete a variable number of episodes in n steps depending on how good the agent in this problem setting. We use a ﬁxed budget of steps to better reﬂect that we care about online performance, where each sample matters. It lets us ask: after n steps of environment interaction, how good

2. Many deep reinforcement learning algorithms are deployed with discounting even in undiscounted episodic cost-to-goal tasks. When this choice is made, it should be considered part of the internal mechanics of the agent: a hyperparameter inside the algorithm that induces additional discounting. This additional discount is not part of the environment and thus the performance metric should be the undiscounted return.

Patterson, Neumann, White, White

Figure 2: A single run of an Expected Sarsa agent on a simple, tabular maze environment. The return rate for this agent is M = 0.827. The agent has near optimal performance near the end of the curve, as it reliably reaches the goal in 15 to 17 steps, with the return hovering around 0.9917 = 0.84 to 0.9915 = 0.86.

is the agent s policy? If instead we use a ﬁxed number of episodes, then some agents can get much more experience for learning. For example, a model-based agent could thoroughly explore in the ﬁrst episode to learn its model, taking 9999 steps, whereas a model-free agent might ﬁnd the goal in 100 steps. On the next episode, the model-based agent might already have a near optimal policy, looking like it learned much faster. In contrast, if we had considered the number of environment steps, then we might not ﬁnd a big diﬀerence. It is not clear that our algorithms are designed to actually do this, but nonetheless we should measure what best reﬂects the goals of our experiment. One secondary advantage of using a budget is that it avoids highly variable runtimes. Under a ﬁxed budget of episodes, poorly performing agents can have very long episodes, resulting in much longer runtimes. See Machado et al. (2018) for further discussion on this in the context of Atari. Now let us consider how we can plot the learning curve for this agent, to see both how quickly it is learning as well as observe its ﬁnal performance at the end of learning. To plot the return at time step t, we use the return for the current episode Gj that began on timestep j 0, Gj .= Rj+1 + γRj+2 + . . . + γT 1Rj+T ,

where T is the (variable) length of the episode and j t j + T. The learning curve will be a piecewise linear step function, where the plotted performance will be the same for every step of the episode. For this particular environment, the rewards in this return are all zero except Rj+T = 1, making Gj = γT 1. Figure 2 shows the result of our ﬁrst simple experiment. We plot the return per step of our E-Sarsa Agent on the maze. The curve starts just below 0.2, because the agent randomly found the goal early in learning in under 200 steps (notice γ200 = 0.13). Then it is ﬂat at zero for some time as its next episode takes so long that the return is eﬀectively zero. The longer the ﬂat portion, the longer it took the agent to complete the episode. The curve has a

Empirical Design in Reinforcement Learning

step-proﬁle because of the way we plot episodic return versus steps and because the data only reﬂects the performance of a single run of the experiment. It will also be useful to be able to summarize this performance over time. In this case, because we care about online performance and so how much reward the agent receives while learning it is natural to report the average over the points in the curve, or the sum (often called the area under the curve). We call this average over the learning curve the return rate. Such summary or aggregate performance numbers M are particularly useful when we want to compare agents (Section 4) or reason about quality of hyperparameters (Section 3).

Other measures of performance and other aggregation functions for our learning curves can be considered. For simplicity in this work, we largely focus on measuring online episodic returns with this simple averaging aggregation. We discuss other choices in more depth in Appendix F.

We can see that this agent reached reasonable performance (near optimal) in this maze environment, though it is a bit hard to say if it learned quickly because we have no comparators. But, in any case, this was only one run! Perhaps we just got lucky? We need more independent evaluation of the agent to better characterize the performance.

2.2 Experiment two: characterizing variations in performance

Our ﬁrst result was demonstrative in nature, however, most of the time we are interested in results that capture the reliability of our algorithms. Why would our algorithms perform diﬀerently if we ran them more than once? There are two primary sources of variation across agents produced by an algorithm; think nature versus nurture. The ﬁrst source of variation we encounter occurs when we initialize a particular agent. Often, our function approximators particularly neural networks are randomly initialized, causing diﬀerences between agents even before data is observed. The second source of variation comes from diﬀerences in the stream of data that a particular agent observes throughout its life (e.g., due to stochasticity in the environment or diﬀerences in actions selected by diﬀerent agents). In Figure 3a we plot the best and worst performance of the E-Sarsa agent in the Simple Maze environment, highlighting a large variation in performance. The right side Figure 3b shows data from a real animal learning experiment: 10 diﬀerent rats running a water maze on 10 diﬀerent days.3 It is interesting how much variation we see in the rat data. It should not be too surprising to us that learning agents, even using the same algorithm (same species), could exhibit quite a bit of variability. Our eﬀorts as reinforcement learning empiricists are not that diﬀerent than animal learning researchers and thus the motivations for repeated trials is also similar. We do not make claims about the abilities of a particular species of rat from an individual run through a maze. Each rat will be slightly diﬀerent, due to random genetic factors and raising. Each time an individual is run through the maze things will be slightly diﬀerent: lighting, humidity,

3. It is common to exclude the data from poorly performing animals (after several retries). This practice has been adopted because scientists have identiﬁed genes that cause poor performance on particular tasks Vorhees and Williams (2006). In reinforcement learning, we have no such prior knowledge which easily justiﬁes removing data; dropping outliers should be done with caution.

Patterson, Neumann, White, White

0 20000 40000 60000 80000 100000 Steps

Discounted return

(a) Performance of reinforcement learning agents in a maze.

0 20 40 60 80 100 120 Episodes

Steps to goal

(b) Performance of real rats in a water maze

Figure 3: Understanding variability in agents. (a) 30 individual E-Sarsa agents on a simple, tabular maze environment. The thick black line shows the mean over individual agents over time. (b) Raw data from 10 rats running a water maze.

the way the researcher holds the rat, etc. This maps pretty well onto the diﬀerences we see in agent initialization (e.g., neural network initialization) and environment variability (perhaps simulated by sticky actions in Atari (Machado et al., 2018) or real-world issues like motor actuations).

2.3 Distributions matter!

The variability in performance across agents gives rise to a distribution over performance for our algorithm. If we run an algorithm on a single environment multiple times, then we are repeatedly sampling from a distribution: P(M) where M is the aggregate performance for a run. The stochasticity comes from changing the algorithm s random seed for initialization and decision-making, and by changing the environment s random seed for simulated noise and start states. If P is skewed or bimodal, then we might need more than a dozen independent runs to estimate the mean and variance. Worse, these simple sample statistics might be misleading. In practice P(M) can be skewed or multi-modal, as we can wee in Figure 4. The sample mean (blue solid line) suggests the average agent would achieve a return of approximately -425. However, a more complete description would be: any randomly selected agent is most likely going to receive a return of -450, while some agents will receive a return of -250.

2.4 Reporting variability in performance

The performance distribution above provides a rich performance summary of many individual agents, compared to simply averaging learning curves. These performance distributions, however, often require an infeasible amount of data to construct. Instead, we can consider other descriptive summary statistics of this distribution beyond the mean. One such pair of summary statistics are upper and lower percentiles, (a, b), that reﬂect the range of performance. Using percentiles, we can describe the range of performance values

Empirical Design in Reinforcement Learning

Figure 4: Performance distribution P(M) of DQN on Mountain Car

with stepsize=2 9. The performance M is the ﬁnal performance of

the agent after 100,000 steps of learning. The density around values for M represents the probability of an experimental trial yielding

a given outcome. For instance, if we run DQN for a single random seed we will most likely observe an agent that achieves a return of

approximately -450 by the end of learning. This density is estimated

using 1000 independent DQN agents using the gaussian_kde kernel

density estimation function in scipy.

that, say, 90% of observed agents achieve. However, because we are not observing all possible future agents, we have some uncertainty whether this range accurately reﬂects 90% of all possible agents produced by this algorithm. Tolerance intervals provide a distribution-agnostic way to summarize the range of an algorithm s performance while taking into account this uncertainty. Like a conﬁdence interval, we specify an acceptable level of uncertainty, 1 α (e.g. α = 0.05), then, unlike a conﬁdence interval, we specify a proportion of individual agents captured by the range (e.g. β = 0.9). Constructing a (α, β)-tolerance interval is rather straightforward; essentially, you compute the empirical upper and lower (1 β)/2 percentiles, then widen the interval based on the number of samples used. Many standard computing packages include support for tolerance intervals and we provide more details in Appendix C. Let us return to our experiment with DQN on Mountain Car and examine the variability. We ran DQN for 50 runs, varying the seed for the agent and environment together. We plot mean performance with an (α = 0.05, β = 0.9) tolerance interval in Figure 5a. The range in the plot is created by computing tolerance intervals for the 50 values at each time step t.

0 40k 80k 120k 160k 200k

(a) β = 0.9 with mean

0 40k 80k 120k 160k 200k

(b) β = 0.7 with mean

0 40k 80k 120k 160k 200k

(c) β = 0.7 with median Figure 5: Tolerance intervals for the discounted return of DQN on Mountain Car, over 50 runs with α = 0.05. Recall β speciﬁes the percentage of the distribution considered for the tolerance interval. (a) Tolerance interval with β = 0.9, with mean performance. (b) Tolerance interval with β = 0.7, with mean performance notice the later is tighter. (c) Tolerance interval with β = 0.7, with median performance.

Finally, we might also want to visualize the performance of the median agent instead of the mean performance over agents. The mean performance does not correspond to any of the agents in the 50 runs. It might be useful to know how a representative (median)

Patterson, Neumann, White, White

agent performed. We visualize this in Figure 5c. We take the average episodic return over learning in each run, to get M1, M2, . . . , M50, and pick the run j that is the median in M1, M2, . . . , M50. We plot the learning curve for this agent. We can see that this median agent exhibits considerably higher variation per timestep than the mean learning curve over agents this is not uncommon, actually! We also highlight that the tolerance interval is not centered around this median agent. Rather it is centered around the median over all agents taken at each timestep. Standard deviations and tolerance intervals reﬂect the variation in performance. As we get more and more runs, the sample standard deviation and tolerance intervals approach their true values: the true standard deviation and the true probability interval that captures β proportion of the population. They do not shrink to zero with more runs, unlike conﬁdence intervals, which we discuss next.

2.5 Reporting conﬁdence in performance estimates

The majority of results in the reinforcement learning literature report conﬁdence intervals which are very diﬀerent from what we have discussed so far. Mean and variance plots and tolerance intervals attempt to capture the variation in the distribution of performance. As you collect more and more data more runs you get more and more accurate summaries of the variation. A conﬁdence interval on the other hand has a diﬀerent goal: to capture how certain we are in our estimate of some statistic about agent performance. As you collect more and more data we become more and more conﬁdent that our estimate has converged. Conﬁdence intervals allow us to report our uncertainty in the mean estimate, for example. We obtain a conﬁdence interval (l, u) for a given conﬁdence level 1 α, where the interval is wider if we desire a higher conﬁdence: wider for α = 0.01 than α = 0.05. The interpretation is that there is a low likelihood α that the true mean falls outside (l, u). The uncertainty comes from the ﬁnite sample we observe, meaning our interval could have been diﬀerent had we seen a diﬀerent sample (diﬀerent runs). In other words, we can say that at least 1 α percentage of the time with a diﬀerent random sample, our interval would contain the true mean. It is possible that we were unlucky and have the ﬁnite sample where the true mean is not in our interval.4

To make such probabilistic statements, we have to make diﬀerent assumptions about the underlying distribution over performance. The distribution could be Gaussian (normally distributed), but more generally could be any distribution, such as the bimodal, skewed distribution in Figure 4. We cannot know ahead of time exactly what our distribution looks like. But, there is a wealth of literature on selecting appropriate conﬁdence interval approaches (see Japkowicz and Shah (2011) for a good reference).

4. Mean performance across runs is commonly reported in reinforcement learning, and so we discuss appropriate empirical procedures here for estimating it. It is important, however, to reﬂect for yourself on when it is useful or not useful to use this evaluation. As we saw in the previous section, the mean performance is not necessarily reﬂective of any agent. The behavior of each agent can actually be quite erratic, but the mean performance might be reasonably smooth. In some settings, we want each agent to behave reasonably; reporting just the mean does not allow us to see potentially poor or erratic behavior in these agents. In other settings, such as when the environment is highly stochastic and the primary cause of variability, then it might be sensible to report just the mean performance.

Empirical Design in Reinforcement Learning

A reasonable choice is to obtain a conﬁdence interval using the Student t-distribution (Student, 1908). This choice assumes the underlying distribution is approximately Gaussian. A typical recommendation before using this approach is to visualize the empirical distribution over your samples, to see if normality is a reasonable assumption a so-called graphical method.5 For example, we could plot the 50 sampled scalars M1, M2, . . . , M50 and see if they are concentrated around the mean value, or even use a package to plot an empirical distribution. More generically, if we assume we have n samples of performance, the Student t-distribution conﬁdence interval is of the form

M tα,n ˆσ n, M + tα,n ˆσ n

where M def = 1

i=1 Mi and ˆσ2 def = 1 n 1

i=1 (Mi M)2.

The multiplier tα,n depends on the conﬁdence level and the number of samples. For example, for α = 0.05, as the number of samples increases, tα,n gets closer to the typical 1.96 for Gaussian distributions.6 This multiplier can be obtained from the Student t-distribution table, or again computed using standard computing packages. As an example, for α = 0.05, with n = 3 (three samples, two degrees of freedom) we have tα,n = 4.303, for n = 10 we have tα,n = 2.262 and for n = 1000 we have tα,n = 1.962. The conﬁdence interval itself shrinks with more samples, because the standard error term ˆσ/ n goes to zero. In our setting, this means as we get more and more runs, our conﬁdence interval around our mean estimator M shrinks to zero until we can conﬁdently claim that we have an accurate estimate of the mean. In our our third experiment, we compute and plot a conﬁdence interval around the mean estimator instead of a tolerance interval using the same data as in the last section. We can see in Figure 6a that the conﬁdence interval with α = 0.05 is already quite tight, for these 50 runs. This shaded region reﬂects our uncertainty in our estimate of the mean, whereas the tolerance interval in Figure 5a reﬂects the variation around the mean (and so is wider and does not shrink to zero). We can ask how many samples n (or runs) we need before it is reasonable to compute this conﬁdence interval. We can actually obtain a statistically valid conﬁdence interval, even with only two samples! The interval itself will simply be wider, because tα,n will be larger as will the standard error. However, it is diﬃcult to gauge, with only two samples, if it is appropriate to make the assumption that the underlying distribution is Gaussian. You need to obtain enough runs to decide if the Student t-distribution conﬁdence interval is appropriate. In many cases we may not be conﬁdent that our performance distribution is Gaussian, or we might even believe it is not Gaussian. In that setting, a natural alternative to the Student s t conﬁdence interval is to use bootstrap-based statistics to generate conﬁdence intervals. The bootstrap procedure is simple, but not the same as boostrapping in TD

5. Alternatively, one could use a goodness of ﬁt test like the Kolmogorov Smirnov test or other statistical techniques. Many software packages such as Matlab, R, and SPSS have implementations you can use. 6. Using a Gaussian conﬁdence interval requires knowing the true variance. Because we have to estimate it from data, we actually have uncertainty in this part of our interval as well. Therefore, our interval is actually a bit wider than if we knew the true variance. With more samples, our estimate of the true variance (true standard deviation) becomes accurate and so the multiplier approaches the value of the multiplier we get if we knew the true variance.

Patterson, Neumann, White, White

0 40k 80k 120k 160k 200k

Time (a) α = 0.05 with Student s t

0 40k 80k 120k 160k 200k

(b) α = 0.3 with Student s t

0 40k 80k 120k 160k 200k

(c) α = 0.05 with bootstrap Figure 6: Conﬁdence intervals around the discounted return of DQN on Mountain Car averaged over 30 runs. (a) Student s t-distribution conﬁdence interval with α = 0.05. (b) Student s t-distribution conﬁdence interval with α = 0.3. (c) Bootstrap conﬁdence interval with α = 0.05, with mean performance.

learning. As before, we get n measures of performance for an algorithm, one for each of the n random seeds. We then generate a new dataset by resampling n values from the original dataset with replacement. Finally, we repeat this resampling procedure for a total of m times, with m usually very large (e.g. m = 10, 000). For each of these new datasets, we compute the statistic of interest (e.g. the mean) then measure the variability of that statistic over all m datasets. To create a 95% conﬁdence interval, we report the 0.025 percentile of the estimated statistic over all m estimates as a lower-bound and the 0.975 percentile as an upper-bound. A major advantage of bootstrap-based methods is that they often require very few assumptions about the underlying data. This advantage comes at a cost, however, because bootstrap methods generally require more data points to provide tight conﬁdence intervals. For this reason, our recommendation is to default to bootstrap-based methods for most comparisons, but to check the underlying distributions to see if more powerful methods such as Student t-distribution conﬁdence intervals can apply without breaking assumptions. We visualize the bootstrap conﬁdence interval using the above procedure in Figure 6c.

We did not seem to have to be so careful with tolerance intervals. But there too we had to account for uncertainty in our percentile estimates. The same distributional question arises. If we know we have an underlying Gaussian distribution, then we can get a better estimate of these percentiles with fewer runs. It is common, however, to default to distribution-free tolerance interval calculations, that make few assumptions about the data. We discuss this further in Appendix C.

2.6 Do we really need more runs?

Nobody wants to run their experiment longer than needed. Just increasing the number of runs makes our experiments take longer, and has real environmental costs. Ideally, one would hope that advanced statistical tools can save us. Unfortunately, there is an inherent trade-oﬀbetween making some assumptions and having tighter conﬁdence intervals, and avoiding assumptions and having potentially useless conﬁdence intervals. In this section, we highlight that we may need to do more runs, especially for the types of agents that we currently analyze because they can produce highly skewed performance distributions.

Empirical Design in Reinforcement Learning

(a) Conﬁdence intervals with 10 runs. (b) Conﬁdence intervals with 50 runs.

Figure 7: Conﬁdence intervals about the mean total discounted return of DQN on the Puddle World environment. Long-tail behavior in the performance distribution violates assumptions for the Student s t-distribution, whereas the percentile bootstrap fails when too few samples are used. Increasing the number of samples right subﬁgure (b) can resolve the issue for the percentile bootstrap. All of these intervals should be 95% conﬁdence intervals, except for the standard error. We include the standard error only to highlight that it can be much narrower than the corresponding 95% conﬁdence interval.In this instance, for n = 10, the standard error represents a Student-t conﬁdence interval with α = 0.5 (because tα=0.5,n=10 = 1) and for n = 50, we have α = 0.3 (tα=0.3,n=50 = 1).

Consider the following example. Imagine we have collected 10 runs of the DQN algorithm on the Puddle World environment and we wish to report DQN s average performance. We can compute a conﬁdence interval around this mean estimate, to reﬂect our uncertainty due to only having 10 runs. There are several assumptions we could make, resulting in diﬀerent conﬁdence intervals. We visualize several options in Figure 7. In this synthetic example, we can actually compute the true mean and check: did our conﬁdence intervals capture the mean? We can see the estimated average performance for these 10 runs is far from the true average performance of DQN in this environment. Further, four of the conﬁdence intervals provide overly optimistic ranges and fail to capture our estimate of true mean (from 250 samples). The sample Bernstein conﬁdence interval does reasonably capture the population mean in Figure 7; a natural conclusion might be to prefer such conﬁdence intervals which make minimal distributional assumptions. However, the Bernstein bound is highly conservative meaning provided conﬁdence intervals are incredibly wide even for large numbers of runs. Unfortunately, this inhibits truly understanding whether we have accurately captured the average performance in the case of Puddle World, the conﬁdence region covers a majority of the range of possible means requiring substantially more runs to present statistically meaningful results. In the example above, we required as many as 1000 runs to detect diﬀerences between DQN and an alternative algorithm using the sample Bernstein conﬁdence interval, while the percentile bootstrap required only 30 runs.

Patterson, Neumann, White, White

500 400 300 200 100

Unfortunately, these examples where conﬁdence intervals fail are far from rare. To see why they fail in this instance, we need only look at the performance distribution of DQN on Puddle World (see inset ﬁgure to the left). The distribution is long-tailed with some very low performing runs occurring with low probability (approximately 5% of the time). In this case, all 10 of our samples were high performing causing us to overestimate the average performance while simultaneously underestimating the variation in performance. Note the probability of receiving 10 samples near the right-side mode is approximately 60% not at all uncommon. The only way to resolve this issue is simply to collect more runs of DQN. For this speciﬁc example, we found that approximately 20 runs were suﬃcient to accurately reﬂect the high variation and obtain accurate bootstrap conﬁdence intervals, and 30 runs were suﬃcient to accurately estimate the average performance. This provides a clear example of how dangerous it can be to declare the variance is low based on a few runs. Generally, we may need a large number of runs to compare agents (Jordan et al., 2024), which we discuss more in Section 4.

One proposed solution to allow for a smaller number of runs, knowing that we have such distributions, is to instead report an estimate of the interquartile-mean (IQM) (Agarwal et al., 2021). The IQM takes the mean of the points in the interquartile range, namely between the 0.25 and 0.75 percentiles. This statistic is more robust to outliers, because these potentially larger magnitude values are not included in the mean calculation. The proposed estimate of the IQM is to drop the 25% highest samples (runs) and 25% lowest samples before computing the mean of the remaining 50% of the data (Agarwal et al., 2021).

As with any choice, we need to be cautious about whether the goal was to estimate the IQM, or whether we chose it because it provides a convenient way to reduce the number of runs. In some cases, it is a useful statistic, with the added beneﬁt of being a robust statistic. In other cases, it may not capture key properties of the algorithm that we care about. Take the DQN algorithm as example. We observed low-probability catastrophic failure events for DQN across nearly every environment we tested. In Lunar Lander, some agents would simply ﬂy oﬀinto oblivion, obtaining incredible amounts of negative reward until the episode was mercifully terminated due to episode cutoﬀs. In CliﬀWorld, some DQN agents would get stuck in a corner perpetually in every single episode, never learning to ﬁnd the goal. Even worse, a small subset of agents would learn to always jump into the cliﬀimmediately and obtain massive negative rewards. In this case, removing these outlier agents the agents whose performance do not conform to our pre-existing notions of how DQN should behave on simple environments is not helping to create a clearer picture of our algorithm, as we are simply ignoring its shortcomings.

Generally, IQM is most usefully applied across collections of environments. In Atari, some games are nearly trivial (e.g., Pong) and others are nearly impossible without prior knowledge (e.g., Montezuma s Revenge). Performance on trivial and impossible games can greatly skew performance measures, making it diﬃcult to compare diﬀerent agents. In such cases IQM makes a lot of sense and was the intention behind the origin design (personal communication with authors).

Empirical Design in Reinforcement Learning

(a) Mean performance. (b) Variance over runs. (c) Distribution over runs.

Figure 8: Performance of the EQRC algorithm in Mountain Car for three diﬀerent episode cutoﬀlengths: τ {200, 500, 10000}. Smaller cutoﬀs clip the impact of outlier performance values, causing variance to be signiﬁcantly under-reported and mean performance to be biased toward higher performance.

2.7 Deciding on the length of an experiment

An important empirical choice for understanding your agent is the number of steps of interaction. If you choose a smaller number of steps, then you are evaluating early learning performance. If you choose a larger number of steps, then you are evaluating if the agent can reach near-optimal performance within a reasonable number of samples. If you choose a very large number of steps, then you may be evaluating if your agent can reach near-optimal performance and stably remain at this performance. One way to gauge if the number of steps reﬂects early learning or ﬁnal performance is to examine the learning curve itself. If the curve is still increasing, then likely you are seeing early learning. If it has ﬂattened for many steps, then you are seeing the ﬁnal performance of the agent. To appropriately pick the number, you can run the agent for longer to gauge if it eventually levels oﬀ. Then you can report performance for a smaller number of steps to focus on early learning. An additional choice related to steps of interaction is artiﬁcial episode cutoﬀs. Cutoﬀs are used to prevent an episode from becoming too long. An agent may get stuck, never ﬁnding its way to the goal and so never terminate the episode. A cutoﬀinvolves teleporting the agent after max_steps back to the start state, without entering a terminal state. It is like picking up a robot that got stuck in a corner, and moving it back into a state from which it can learn. For this reason, cutoﬀs can interact with exploration, and in particular very aggressive short cutoﬀs may misrepresent the performance of agents that get frequently stuck. For example, in original implementations of Mountain Car, no early cutoﬀs were used and tile-coded agents would run for a few thousand steps in the ﬁrst episode on average (Sutton and Barto, 2018). In later implementations, an aggressive episode cutoﬀof 200 steps was introduced (Brockman et al., 2016). We show in Figure 8 that performance for an algorithm that uses neural networks called EQRC (Patterson et al., 2022b) is much better with more aggressive cutoﬀs, with much lower variability over runs.7

7. This change in default implementation illustrates the practice of problem adaptation making an easier version of the problem to alleviate challenges in modern algorithms, such as neural networks struggling to learn from ﬂat reward signals.

Patterson, Neumann, White, White

We might choose to run the agent for a ﬁxed number of steps, rather than a ﬁxed number of episodes. In that case, we do not need episode cutoﬀs and experiments will terminate in a reasonable amount of time. Cutoﬀs can, however, be useful for other reasons. The main reason is that, due to stochasticity in environments, many agents may have some interactions that lead the agent to get stuck eﬀectively bad luck. This issue may not only have to do with the agent, but also potentially with the environment. We can set the episode cutoﬀto a large number, that is less than the number of learning steps, to avoid having runs where the agent is stuck in one episode forever. With a large cutoﬀ, we are much less likely to make the problem too easy and can avoid signiﬁcantly skewing the results. But, we obtain some of the reduced variance in runs to facilitate evaluation.

Summary: We conclude each section with a summary of key actionable advice. Here, we summarize key points to consider when investigating an algorithm in one environment, assuming we have already speciﬁed the hyperparameters.

Key insights: evaluating a fully-speciﬁed algorithm

1. A single agent s performance is informative, however, consider plotting the performance of individual runs to gain additional understanding of your algorithm.

2. Agent behavior, with the same algorithm and environment, can be very diﬀerent across random seeds. Consider reporting this variability using sample standard deviations or tolerance intervals, rather than just reporting the mean or median performance.

3. Conﬁdence intervals around the mean primarily tell us about our certainty in our mean estimate, not about the variability in the underlying agents. If you primarily care about understanding mean performance, then report means with conﬁdence intervals. If you care about the behavior of each agent, not just the mean, then consider reporting tolerance intervals.

4. In general we advocate that you do not report standard errors. They are like a low-conﬁdence conﬁdence interval, and it is more sensible to decide on the conﬁdence interval you want to report.

5. The required number of runs depends on your performance distribution, which is unknown to you. It is clear that in almost all cases 5 runs is insuﬃcient to make strong claims, but even 30 runs can be insuﬃcient if the distribution is heavily skewed.

6. Consider reporting performance versus steps of Interaction, rather than performance versus episodes. This choice ensures every agent receives the same number of samples every run. (See Section 2.1)

7. Choose the number of steps of interaction to reﬂect early learning or ability to learn the optimal policy, rather than inheriting what was done in previous work.

8. Aggressive episode cutoﬀs can signiﬁcantly skew results.

There are many things to worry about just to run a good observational study of single agent. But let s gain some perspective here: why should this be easy? It s not easy for our animal learning scientist to conduct and document the study of a few rats in a maze.

Empirical Design in Reinforcement Learning

They have to worry about so many other things that we do not: how to physically handle the animals, if their personal scent impacts the animals, uncontrolled genetic variations, running in real time, extra dull rats, and so on. In contrast, we can carefully enumerate our sources of bias, programatically vary conditions, exercise perfectly repeatable interventions, control almost all relevant sources of variation, perfectly record all relevant information, and run millions of experiments hundreds of times faster than real-time. If we truly want to understand our algorithms and gather suﬃcient evidence for our claims, then we need to take the scientiﬁc enterprise seriously. Science is ﬁrst and foremost about understanding, not picking winners and losers.

3. Dealing with Hyperparameters

Almost all algorithms have hyperparameters. These are scalars that have to be selected by a person before running an experiment. Typical hyperparameters in reinforcement learning include stepsizes and other optimization parameters like momentum and batch sizes; the eligibility trace parameter or the horizon for n-step methods; the target net refresh rate; and even the function approximation architectures used, which themselves can have many diﬀerent hyperparameters (e.g., depth, number of nodes per layer, activation function, etc).8

The possible combinations of hyperparameters can be overwhelming. It is hard enough to properly evaluate an algorithm for a single hyperparameter combination, let alone having to consider this combinatorial space of algorithms. But, we can overcome this panic by stepping back and clarifying the goal of our experiment. There are three typical goals: (1) understand the behavior of an algorithm across settings of the hyperparameters, (2) optimizing hyperparameters to study an algorithm s idealized maximum performance, and (3) simulating a deployment scenario. As a ﬁeld, we generally have more understanding about how to study hyperparameter sensitivity (corresponding to setting 1), though of course empirical design here is also nuanced; we discuss this in Section 3.1. The second setting falls into the category of competitive machine learning. Given the ability to extensively tune the performance of a learning system, what is the maximum capability we can hope to achieve? Such studies are limited to speciﬁc problem settings, often called benchmark problems. Unfortunately, there are multiple challenges that arise in this setting. The primary challenge is statistical: it is hard to estimate the maximum value of a stochastic function, such as the maximum performance of a reinforcement learning system. Another challenge arises in making fair comparison to baselines in the competition, it is diﬃcult to ensure equal tuning eﬀort is given to each competitor learning system. We discuss this issue further in Section 3.2. The ﬁnal setting poses the greatest challenge. Unlike supervised learning, we do not have a general purpose approach to select hyperparameters. In supervised learning, cross-validation can be used with almost any algorithm to select hyperparameters according to generalization performance. It is not obvious how to use cross-validation in reinforcement learning (see a more in-depth discussion in Appendix D.3). An additional challenge comes from the

8. It is reasonable to expect that some of these hyperparameters like any learnable parameters should adapt with time. However, we can consider hyperparameter adaptation as part of the algorithm; our job, as empiricists, is to specify the hyper-hyperparameters for that adaptive algorithm. For this reason, we deﬁne hyperparameters as the initial values set at the beginning of the experiment.

Patterson, Neumann, White, White

variety of use cases for reinforcement learning, such as solving simulated problems versus interacting with the real-world; having access to lots of data but limited compute versus lots of compute and limited data; or settings where taking exploratory actions is safe versus unsafe. Each combination of these (and other) factors will lead to diﬀerent methodologies to select hyperparameters for a reinforcement learning algorithm. Understanding the performance of an algorithm in deployment depends not only on the algorithm and the hyperparameters tested, but also on the deployment scenario itself. We discuss this issue further in Section 3.3.

3.1 Understanding Hyperparameter Sensitivity

The goal of hyperparameter sensitivity analysis is to help us understand our algorithms. This is not about optimizing hyperparameters to support SOTA claims! These insights can help identify serious sensitivities that suggest improvements to the algorithm are needed, they can help us understand changes in behavior as we interpolate across a space of diﬀerent algorithms, or they can help identify hyperparameters which require joint tuning in order to provide good performance. As a classic example, the trace parameter λ in TD(λ) algorithms interpolates between Monte Carlo algorithms as λ 1 and the original TD algorithm as λ 0. A sensitivity study may reveal that λ = 0.9 is an optimal choice for an environment, however, this is only one useful piece of information that we can derive from the study. We may learn that performance becomes highly variable as λ 0 or that λ near 1 diverges. We might additionally learn that the performance suddenly drops oﬀoutside a narrow range of λ suggesting that this algorithm will be diﬃcult to tune in a new environment. In order to evaluate our algorithm with diﬀerent values of a hyperparameter, we need to collect enough data to provide reasonable estimates of our statistic of interest. Say we wish to report the average performance across agents produced by our algorithm for each hyperparameter value, then for every hyperparameter setting we need enough agents to actually estimate that average. This is no diﬀerent than Section 2.1 where we required multiple agents to evaluate an algorithm, except now we are evaluating multiple fully-speciﬁed algorithms; one for each setting of the hyperparameter of interest. In the most basic setting, this means we need N runs for every hyperparameter setting, H, for a total of N H runs. Clearly, this can become expensive quickly!

Dealing with a single hyperparameter. Once we have obtained an estimate of performance for each setting of our hyperparameter, we can summarize the performance of this partially-speciﬁed algorithm as in Figure 9. To create such a plot, we must ﬁrst decide on a range for the hyperparameter then specify how intermediate values are selected within that range. A common range for stepsizes is to use powers of 2, to systematically cover the space. If there is a clear bowl or U-shape to the resulting curve as in Figure 9 then this selection scheme was likely appropriate though the curve is not always U-shaped. If we observe sharp changes in performance like the V-shape in Figure 10 you may need to sample more densely within that region to better understand the range of appropriate values for the hyperparameter. These sharp changes often occur when the initial range of the hyperparameter is too large or when the distribution of tested values are concentrated around a region of poor performance that is you missed the good ones. If the best performance is at the one end of the range like in Figure 11 then this suggests the range was too narrow and may need to be systematically expanded.

Empirical Design in Reinforcement Learning

Now that we have our sensitivity curve, how do we interpret it? If the sensitivity curve is reasonably ﬂat the minimum performance is close to the maximum for a wide range of hyperparameter values, then we might say that this partially-speciﬁed algorithm is insensitive and so it will not be challenging to deﬁne a fully-speciﬁed algorithm for deployment. If the sensitivity curve indicates a large diﬀerence in performance within a narrow region of hyperparameter values, then we would say this partially-speciﬁed algorithm is highly sensitive and conclude that deﬁning a fully-speciﬁed version for deployment might be diﬃcult.

2 11 2 9 2 7

Performance

Figure 9: A good sensitivity curve that captures a wide range of the variable of interest and illustrates that performance changes smoothly as the hyperparameter changes.

2 11 2 9 2 7

Performance

Figure 10: A sensitivity curve where the range of tested values may be too wide, instead of being focused in the region of interest. We lose some information around the peak performance and the algorithm appears quite sensitive. This sensitivity might be an artifact of the plot testing insuﬃciently many values rather than a property of the algorithm.

2 12 2 11 2 10 2 9

Performance

Figure 11: A sensitivity curve where we potentially missed the best performance. The best performing hyperparameter may be outside the range or may be the endpoint of the range, but we cannot tell with the information as available.

Such comprehensive experiments can be expensive, and it can be tempting to test only a small number of hyperparameter settings. However, we always have to ask ourselves if we have compromised our empirical design. There is no point in running a ﬂawed experiment, even if it is more feasible in terms of computation. An alternative choice would be to ask research questions that better match our computational resources. Remember, some empirical questions are simply more challenging to answer than others. For example, it is challenging to convincingly show that an algorithm is insensitive this would require testing a wide-range of values with a large number of intermediate values, possibly with multiple environments.9 Similarly, it is diﬃcult to show empirically that an algorithm is divergent.

9. We could more thoughtfully sample the hyperparameters to get an accurate sensitivity curve. One could imagine an automatic procedure that starts broad, and iteratively samples hyperparameter choices between two where the values are very diﬀerent, or focuses on regions of higher performance where we want the curve to be more accurate (He et al., 2021; Parker-Holder et al., 2022; Eimer et al., 2023).

Patterson, Neumann, White, White

Example: Does TD diverge?

We know that TD can diverge under oﬀ-policy updating. In fact, we can prove that TD always diverges in very speciﬁc scenarios such as Baird s counterexample (Baird, 1995); though such theoretical characterizations may not always be possible and it is in these scenarios that we often rely on empirical evidence. For example, imagine you combine TD s update rule with a momentum term. Does this still diverge on Baird s counterexample? If there is no existing theory about the convergence of TD with momentum, you might turn to experiments to obtain some insights. Likely, we would start with the default hyperparameters for momentum (say setting the momentum term β = 0.9). We ﬁnd that this algorithm diverges! We test again with a diﬀerent random seed and observe divergence; and again for another random seed, and again, and again. . . After many such trials, we might conclude with high conﬁdence that TD with momentum always diverges on this environment. This conclusion, however, is limited by the choice β = 0.9. Does this conclusion hold for β = 0.99 or β = 0.1? To provide evidence that TD with momentum does not converge in this environment regardless of hyperparameter setting we must carefully sweep a dense set of values of β and show divergence for every tested value. Without theory or an inﬁnite number of runs we can never know with absolute certainty that TD with momentum always diverges. It is always possible that some untested random seed presents the data in exactly the right order such that TD with momentum converges, or possibly there is some perfect value of β where TD converges in this environment. With empiricism, we accumulate a body of evidence which supports our claim; the more random seeds we test and the more hyperparameter values we sweep, the more convincing our body of evidence.

Assessing overall hyperparameter sensitivity One of the primary goals of hyperparameter sensitivity studies is to understand how to improve our algorithms and develop those that are easier to tune. In addition to explicitly visualizing performance versus a speciﬁc hyperparameter, we can also attempt to assess overall how sensitive an algorithm is to its hyperparameters. One approach (Jordan et al., 2020) is to treat all hyperparameters as unknown values. We model these unknowns as random variables and draw sample hyperparameter conﬁgurations. For example, we might treat the stepsize, α, as an unknown value and sample α Uniform(0, 1). Each sampled α produces a fully-speciﬁed algorithm, for which we can obtain a measurement of performance. The amount that the performance changes as we change the hyperparameters then provides a measurement for the sensitivity of the under-speciﬁed algorithm on a given problem setting. This procedure is akin to performing a sensitivity study across all hyperparameters simultaneously. Because the space of hyperparameters is combinatorial, exhaustively sweeping across conﬁgurations to obtain a measure of sensitivity is typically impractical. Instead, this approach randomly samples points in the combinatorial space in order to compute the variation in performance over that space. This presents a tradeoﬀ: we can carefully and systematically investigate a small number of hyperparameters at a time, or we can try to investigate the entire space of hyperparameters with much less detail.

Bayesian optimization approaches provide such a strategy to ﬁnd the best hyperparameters, and so the ideas there could potentially be adapted for this goal. Nonetheless, such an algorithm would still require testing a large number of hyperparameters.

Empirical Design in Reinforcement Learning

Dealing with hyperparameters is an active area of research and thus our treatment of it here is necessarily limited. The interested reader can jump to Appendix D.1 for an expanded discussion on handling multiple hyperparameters.

3.2 Reporting Idealized Performance

When we ﬁrst introduce an algorithm, we may want to know how well it can perform, on an environment. If it performs poorly on an environment, even with extensively tuned hyperparameters for that environment, then there may be an issue with the algorithm. Further, the algorithm may be conceptually appealing, but have new hyperparameters that have not yet been well-studied. If an argument can be made that adaptive algorithms could be developed for these new hyperparameters, then it can be appropriate to evaluate the behavior of the algorithm under nearly-optimal hyperparameter settings. Though reporting performance (or behavior) for nearly-optimal hyperparameters is common in reinforcement learning, it has several serious pitfalls that you need to consider if you take this route. The ﬁrst diﬃculty is that estimating the maximum of a stochastic function is challenging and often requires a very large number of samples. In our experiments, this translates to needing many random seeds and signiﬁcant computational resources. The second is that many reinforcement learning algorithms are notoriously sensitive to their hyperparameters and often contain dozens of hyperparameters to tune recent algorithms make use of dozens of hyperparameters, not including those for the function approximator (Adkins et al., 2024). Finally, we have to be careful about comparing two algorithms under this idealized setting where hyperparameters are optimized. If we allow one algorithm to have more hyperparameter settings, then diﬀerences in performance can be due to maximizing over more hyperparameter settings rather than diﬀerences in the algorithm. We discuss these issues in more details below.

Maximization bias. Several statistical challenges present themselves when tuning hyperparameters for maximum performance. One challenge is maximization bias, which is a form of statistical bias that can arise when estimating a quantity of the form maxh E[G | h]. In our case, h represents a hyperparameter conﬁguration and G represents the performance of our algorithm for a given conﬁguration. Because we do not know E[G | h], the average performance of our algorithm for a given hyperparameter conﬁguration, we estimate it with samples, Gh E[G | h]. Then we estimate maxh E[G | h] using maxh Gh. This estimate if prone to maximization bias, because E[maxh Gh] maxh E[ Gh] = maxh E[G | h]. To gain a bit more intuition, consider the following example comparing two hyperparameter conﬁgurations h1, h2. Imagine that E[G | h1] = E[G | h2], but when we estimate their average performance using samples, Gh1 is an overestimate of E[G | h1]. When we maximize over our estimates Ghi, we will be reporting an overestimate for maxh E[G | h] meaning we are overstating the performance of our algorithm. This is similar to maximization bias we see in Q-learning! Now imagine instead of two conﬁgurations that are identical, we instead have 100 identical conﬁgurations, or 1000. The probability that we overestimate at least one conﬁguration increases as we increase the number of conﬁgurations estimated; thus

Patterson, Neumann, White, White

increasing the reported performance for the algorithm, maxh E[G | h]. That is, without changing the properties of the investigated algorithm itself, we can report increasingly higher performance by increasing the number of hyperparameter conﬁgurations we investigate! If this was an algorithm that we are proposing, then we would be doing our readers a disservice by overstating the beneﬁts of the algorithm. If this was a baseline algorithm, we would be doing ourselves a disservice by setting too high of standards and potentially ﬁltering out useful ideas.

Issues with the typical two-stage approach. One common approach to counteract the eﬀects of maximization bias is to use a two-stage methodology. In the ﬁrst stage, the researcher performs an extensive hyperparameter sweep in order to select the maximizing hyperparameters. In the second stage, the researcher then evaluates the best hyperparameter conﬁguration with a new set of random seeds typically far more random seeds than originally used for the selection process. Finally, we report the average performance for the second stage only. This approach provides an unbiased, conﬁdent estimate of performance for that chosen hyperparameter conﬁguration, because we are able to use more runs for that single hyperparameter conﬁguration. However, there are two key drawbacks to this approach. As pointed out by Jordan et al. (2020), this approach is wasteful of compute, because many samples of performance are thrown out during the hyperparameter selection stage. Additionally, this two-stage approach ignores uncertainty during hyperparameter selection. Say, for example, that we use 10 runs for every hyperparameter conﬁguration in stage one. Due to maximization bias, however, we may unintentionally select a suboptimal hyperparameter conﬁguration as being best. Then in the second stage, we will use many runs say 100 to evaluate this hyperparameter conﬁguration. Because we used so many runs in this second stage, our conﬁdence intervals about the mean performance will be tight we will be reasonably certain we have accurately captured the mean performance. This certainty, however, is unwarranted. While we may have accurately estimated the mean for the selected hyperparameter conﬁguration, we likely have not accurately estimated the mean for the best hyperparameter conﬁguration. As a result, this two-stage approach is likely to underestimate the maximum performance and is overconﬁdent in its estimate. Regardless, the two-stage approach is a reasonable place to start and is much better than one-stage tuning or tuning by hand" both common in the literature.

A new approach: bootstrapped two-stage tuning. One way to overcome maximization bias is to repeat the entire experimental procedure multiple times; wrap the two-stage process in an outer loop. After running the sweep over hyperparameters the ﬁrst time, we repeated this procedure a second time using a new set of random seeds. We will almost certainly obtain a new estimate for maximal performance and likely obtain a diﬀerent best hyperparameter conﬁguration that gives that maximal performance. We now have two diﬀerent estimates of maximal performance, giving us a sense of spread how much does this maximizing performance change if we use diﬀerent data to estimate it? Naturally, we can repeat this process many times say 100 and report the mean and spread of the results. Unfortunately, this simple procedure has a major downside: it is incredibly expensive. For every hyperparameter conﬁguration (say H conﬁgurations), we require N runs and we

Empirical Design in Reinforcement Learning

repeat the sweep-then-maximize procedure M times. In total, we run our learning algorithm H M N times in order to report an estimate of tuned performance. Fortunately, we can rely on the principles of bootstrapping and resampling in order to make this procedure much cheaper. For every hyperparameter conﬁguration, we collect N runs of our algorithm. Then, for each conﬁguration, we resample N of those N runs (with replacement) and compute a sample average. We select the maximizing sample average as a single estimate of tuned performance. Then we repeat the above procedure many times, resampling from the same H N dataset each time. Because we are capturing the variance across all hyperparameters, we do not need as many runs per hyperparameter. As a result, we can select a smaller N than we would typically require for sensitivity analysis; say N = 10 instead of N = 30.

Picking hyperparameter sets fairly. We now have a mechanism for measuring idealized performance, but still want to avoid the pitfall where we allow one algorithm to have many more hyperparameter settings than another. We want our results to reﬂect the utility of our algorithm, rather than reﬂecting performance when ﬁtting hyperparameters to a set of environments. Unfortunately, there is no an explicit, precise procedure here to obtain a perfectly fair experiment. The only way to avoid tricking ourselves is to attempt to be as fair as possible and to sincerely make choices that are justiﬁable. At a minimum, you should ensure all algorithms have the same number of hyperparameter settings that are tested. If one algorithm has two hyperparameters, α and β, and another only has η, then you have to sweep n values for η and n values for the cross-product of (α, β). For example, you might test η 2i for i { 6, 5, 4, 3, 2, 1} and test α 2i

for i { 5, 3, 2} and β {0.1, 0.5}, so that they both have 6 hyperparameter settings. If you are using a hyperparameter optimization algorithm, then you can optimization algorithm to ensure fairness. These optimization algorithms typically attempt to estimate the function mapping hyperparameter values to performance for a given agent-environment pair (e.g., using bayesian hyperparameter optimization in Optuna). On each iteration, the optimization algorithm samples a new hyperparameter conﬁguration based on the learned function so far (selecting the one believed to be best, with some exploration) and then runs an reinforcement learning experiment for some number of steps and repeating the experiment with diﬀerent random seeds to obtain a performance estimate. When using such a tool, we can ensure fairness by ensuring the optimization algorithm is run for a ﬁxed number of iterations and the same number of steps and seeds are used in evaluation. Automatic hyperparameter selection is a big topic and there are several great papers you can start with if you want to learn more (He et al., 2021; Parker-Holder et al., 2022; Eimer et al., 2023). We discuss fair comparisons of multiple agents in much more depth in Section 4.

3.3 Evaluating Algorithms for Deployment

This ﬁnal setting evaluates the performance of a reinforcement learning algorithm under some realistic constraints, typically modelling a speciﬁc deployment scenario. For this setting, we need an algorithmic approach to set hyperparameters, or we need hyperparameter-free algorithms. It is not possible to test many hyperparameters in most deployment scenarios.10

10. The one exception is when the ultimate goal is to solve a simulated environment. This is beyond the scope of this document, but addressed elsewhere (Parker-Holder et al., 2022; Eimer et al., 2023). It is

Patterson, Neumann, White, White

For example, if a reinforcement learning algorithm is being used to optimize data center cooling, it can be unrealistic to test even a handful of hyperparameters on the system. Instead, we want to deploy a fully-speciﬁed algorithm. One issue in reinforcement learning is that we do not have a general purpose algorithm for hyperparameter selection, unlike supervised learning. In supervised learning, we can use internal cross-validation to select hyperparameters: the best hyperparameters are selected by separating the data into training and validation, and using validation performance as a measure of generalization performance under those hyperparameters. We have no such equivalent procedure for reinforcement learning. We discuss why this is the case in Appendix D.3, and propose potential options for developing such algorithms for reinforcement learning. This ﬁnal setting is an open question in the reinforcement learning community. A common approach today is to use the default hyperparameters speciﬁed in released code-bases. Using these defaults is not unreasonable, if the goal is to compare two systems. In an empirical study, you may want to understand the performance of two code-bases across a variety of diﬀerent environments. This experiment is not about comparing the algorithms underlying those systems, but rather the systems themselves. It could help a practitioner decide which of the available code-bases might be more suitable for their application. Of course, the default hyperparameters in the code-base were likely set on a small set of simulation environments; so we should be cautious about how well the system will perform in deployment. Another strategy has been to tune hyperparameters on a subset of environments, and then ﬁx them for a larger set of environments. This practice was used in the Atari suite for example, where it was suggested to use ﬁve of the 57 games for hyperparameter tuning (Bellemare et al., 2013).This procedure could mimic a deployment scenario, where you have several simulated environments related to your real-world environment, on which hyperparameters can be tuned. However, as yet there is little understanding of how one might pick such tuning environments. Finally, a simple and relatively under-investigated approach is to learn a simulator of the real system and use the resultant simulator to search for or optimize the hyperparameters (Wang et al., 2022b). The advantage of this approach is the number of hyperparameters tested, in a sweep or bayesian hyperparameter optimization proceedure, is in no way limited by the deployment environment. The challenge of course is that the quality of the simulator is related to the deployment data available and, generally, learning accurate models that are useful under long rollouts remains a largely open challenge.

important to note that for simulated environments, even though we can use hyperparameter optimization (or sweeps), we then have to account for that as part of the algorithm and as part of the computational cost.

Empirical Design in Reinforcement Learning

Key insights: dealing with hyperparameters

1. Unthoughtful treatment of hyperparameters is one of the leading causes of bias in empirical reinforcement learning. The ﬁrst step is acknowledging this bias.

2. When designing an experiment, make a plan to select hyperparameters in a way that matches the goals of the experiment, rather than as a frustrating nuisance.

3. Hyperparameter sweeps are not a general-purpose method for ﬁnding hyperparameters; instead, we use sweeps in experiments to understand our algorithms.

4. For sensitivity plots, select ranges of hyperparameters mindfully. Ensure the range is wide enough; if you ﬁnd optimal parameters lie on the boundary, expand the range.

5. Visualizing hyperparameter sensitivity is itself a diﬃcult problem; be innovative about how to communicate this to the reader. Current options include two-dimensional parameter sensitivity curves over one hyperparameter and violin plots of all hyperparameters.

6. In general, it is better to avoid reporting performance for the best hyperparameters chosen in an environment. There are too many pitfalls, and you risk tricking yourself about the quality of your algorithm.

7. Default hyperparameter settings are not necessarily appropriate and certainly not always fair. See more in the next section about the pitfalls of untuned baselines.

4. Comparing the performance of multiple algorithms

Most if not all the things we worry about when investigating a single agent are relevant when investigating more than one. However, the concerns become more serious as we are often making a value judgement on the ranking or relations between multiple agents. The claims are inherently more nuanced and thus the standard of evidence and rigour required goes up a notch.

There are many reasons we ultimately compare multiple algorithms. The most common goal is to provide a performance ranking over multiple related algorithms in a given environment (or suite of environments). Implicitly, we are making the claim: if your environment is similar to this environment, here is the algorithm you should use . Supporting such claims well is rife with diﬃculties and we will dive into the details in Section 5.2.

An alternative reason to compare algorithms is to show that one algorithm (the baseline) suﬀers from some problem, but the newly proposed algorithm does not. A prototypical example is the introduction of Gradient TD algorithms (Sutton et al., 2009). The comparison starts by convincingly illustrating that the baseline algorithm (TD in this example) suﬀers from some problem by crafting a very speciﬁc and simple counterexample. Then we introduce a novel algorithm or modiﬁcation to the baseline and show that this modiﬁcation does not suﬀer the same problem.

Patterson, Neumann, White, White

In this section, we will explore challenges that arise for both types of comparison and discuss strategies for drawing reliable and robust conclusions. We will assume the reader has well understood the preceding sections on summarizing the performance of a single algorithm (Section 2.2) and understanding the impact of hyperparameters on an algorithm (Section 3) as these concerns become even more exaggerated in the multiple algorithm case.

4.1 Designer s curse: with great knowledge comes great responsibility

Evaluating your own algorithm is rife with bias. You know the ins and outs of your algorithm better than anyone on the planet. You have spent months working with diﬀerent versions of your algorithm, tuning them for performance, ﬁnding environments where your algorithm shines, and discarding ones where your algorithm does not. The baselines you eventually compare against likely have not received the same attention. Worse, you will not have the same detailed understanding of those baselines, nor knowledge of how to tune them well. There is nothing we can do about designer bias , but there are steps we can take to reduce your experimenter bias. You can counteract some of the bias, however, by spending extra eﬀort to get the baselines working well. It is better to risk giving the baselines a small advantage to mitigate over-claiming. Let us revisit the example of an experiment where we propose an algorithm to resolve some speciﬁc failure case. Implicitly, this experiment is actually making two independent claims. The ﬁrst claim is that the baseline algorithm experiences failure. As we saw in Section 3, it is exceptionally challenging to show that an algorithm has a consistent failure case. We can always ask: would this happen for a diﬀerent choice of hyperparameters? The second claim is that our proposed algorithm solves the failure case. Illustrating this second case can be much easier; we just need to ﬁnd a single hyperparameter setting that works! As a result, a truly fair comparison requires spending a signiﬁcant amount of time understanding the baseline algorithm while requiring far less time (experimentally) understanding the proposed algorithm. If we had not spent the time with the baseline, we could unintentionally build a strawman argument; one which does little good for the scientiﬁc understanding of learning algorithms. When we are benchmarking our algorithm, one of the easiest ways to protect against designer bias is to choose the right environments. We can let the authors of the baseline algorithm do the work here. If you are comparing your new algorithm to Soft Actor-Critic (SAC) (Haarnoja et al., 2018), then use an environment where prior work has shown SAC does well. It is reasonable to assume that the authors of SAC worked hard to tune their own algorithm that is designer bias working for you! It is important that you pair the environment choice with fully speciﬁed algorithms because our algorithms are not yet general. SAC with hyperparameters tuned for Half Cheetah, for example, is likely to not work well in Mountain Car. It is okay to try a baseline algorithm on a new environment, but then you will likely have to expend signiﬁcant energy tuning it (perhaps changing the network architecture, optimizer, etc.) and you really cannot be sure you will do a good job.

Empirical Design in Reinforcement Learning

Example: Untuned baseline.

The inset ﬁgure shows one example where an untuned agent is applied to a new environment, in this case Lunar Lander. Imagine we have introduced a new algorithm called Deep Q it is a simple action value learning agent with a Qlearning update and neural network function approximation (the blue line). Now imagine we grabbed a freely available DQN implementation (the black line) with a default set of hyperparameters tuned for some other environment. Comparing the blue and black lines might lead one to think Deep Q is much better than DQN on this task, but we are misleading ourselves. The red line represents DQN after retuning its hyperparameters for Lunar Lander. Much improved performance and ego checked! The moral of the story is simple: beware of untuned baselines.

4.2 The utility of calibration baselines: what does that line even mean anyway?

Sometimes comparing the performance of two algorithms is not enough because we lack context to understand the results. Returning to the Lundar Lander experiment above we might naturally wonder: are any of these algorithms doing well? How would a non-learning policy like random action-selection fair? What about a learning agent with a much simpler function approximator like SARSA(λ) with tile coding? Anecdotally, years ago when one of the authors submitted their ﬁrst paper on using reinforcement learning to play hearts, pesky Reviewer #2 said roughly: sure your agent is good compared to human players and the best search agent, but I don t have any basis to understand the performance. You should include results playing against a random agent . Reviewer #2 was right. We included the baseline, which ultimately showed our agent and the other baselines were very strong indeed, and it made our paper better. You should always ask yourself if such baselines not just SOTA algorithms could make your results easier for the non-expert to interpret. Picking the right calibration baseline depends on the research question, but generally there is a lot of ﬂexibility here. It is often good to think of both randomized and highperformance or even oracle baselines. Oracle baselines often have access to side information that your learning agent does not, setting an unobtainable but interesting performance bar. Alternatively, an oracle baseline might just operate under diﬀerent constraints than your new algorithm. For example, we often compare linear complexity oﬀ-policy TD algorithms against least squares TD (LSTD). LSTD is a quadratic algorithm that performs more compute per step than a linear algorithm and thus naturally we expect it to set a high performance bar. If your new linear oﬀ-policy TD algorithm approaches the performance of LSTD on several environments, you can be more conﬁdent your algorithm is learning eﬃciently and that your algorithm s improvement over other linear baselines is relevant. While these calibrations provide useful context for interpreting results, we must be cautious not to become over-reliant. For example, the Atari suite uses random performance and human-level performance as calibration baselines (Mnih et al., 2013). However, humans have rather diﬀerent constraints than our agents; requiring a screen to project light to our

Patterson, Neumann, White, White

eyes, having latency between our brains and ﬁngers, the often imperfect translation from muscle activation to controller input, and ﬁnally latency from controller to gaming system. Achieving human-level performance is certainly informative of learning capacity and progress; exceeding human-level performance by many orders of magnitude, however, may be a result of fewer constraints. As we move increasingly far from our calibration baselines, we must re-evaluate their utility.

4.3 Ranking algorithms (maybe don t)

A common goal is to show that your new algorithm is better than some prior work. However, providing supporting evidence towards this claim can be exceptionally diﬃcult. In fact, in all but the most rare cases, we might go so far as to say that providing such evidence is entirely infeasible! The space of problem settings where any given algorithm may be deployed is large and wholly unspeciﬁed; showing that any algorithm generally outperforms another across this space would be impossible without ﬁrst deﬁning the space. Even once this massive problem-space is well deﬁned, gathering suﬃcient evidence to cover the space is likely intractable. Instead, we tend to produce a ranking of algorithms on a small set of environments and hope our claims generalize. Let us discuss a bit more why this might be problematic. Even if we could show improved performance for a particular benchmark, this is likely not broadly informative. It is like trying to generalize from a biased distribution (say performance in Atari) with a few samples to make predictions about performance across the whole set (performance in all environments). Instead, we argue that insights about why an algorithm behaves diﬀerently tend to generalize. Scientiﬁc testing is one name for designing experiments to gain insight on why questions and is nicely discussed in Jordan et al. (2024). Scientiﬁc testing is often performed using synthetic environments, like counterexample MDPs or environments with carefully crafted properties such as Riverswim (Strehl and Littman, 2008). These synthetic environments allow us to reason about what situations will cause an algorithm to fail. It is up to domain experts, then, to identify if these failure modes apply to their problem setting.

Example: The deadly triad.

One such insight that has aged well is the deadly triad (Sutton and Barto, 2018). The deadly triad speciﬁes three conditions where TD-based algorithms often fail. Speciﬁcally, when the algorithm uses (1) bootstrapped estimates, (2) oﬀ-policy sampling, and (3) function approximation. Given this set of conditions, then, it is not hard to propose an environment where TD algorithms fail; many have been proposed in the literature (Baird, 1995; Kolter, 2011; Tsitsiklis and Van Roy, 1997). These conditions have been further reﬁned over the years to include properties of the environment (Kolter, 2011; Patterson et al., 2022b) and properties of the function approximator (Ghosh and Bellemare, 2020) allowing an even more crisp understanding of when algorithms might fail.

Another challenge that arises is that of fairness. When comparing the performance of multiple algorithms, many design choices must be made: setting hyperparameters, picking which environments to use, how long to run each experiment, and so on. Sometimes these choices have non-obvious and indirect impacts on the performance of each tested algorithm,

Empirical Design in Reinforcement Learning

often leading to latent unfairness in the experimental design. A noteworthy example is the experiment length used for the Atari benchmark. Shortening the length of experiments run on Atari dramatically changes the ranking of algorithms dramatically (Machado et al., 2018; Agarwal et al., 2021; Obando-Ceron et al., 2024). It remains unclear if Atari 100k simply favours agents that learn much more aggressively (e.g., larger learning-rate parameters) without risk of divergence, or if it captures something diﬀerent and interesting compared with the usual 200 million step setup. Sometimes it is not obvious how to maintain fairness across diﬀerent algorithms. For instance, if two algorithms have wildly diﬀerent architectures, it can be challenging to ensure representation capacity is comparable. Instead, an empiricist might ﬁx the representation capacity of one algorithm, while testing the other algorithms with increasingly large representations. One could then report the performance with diﬀerent representation sizes and thus give a more complete picture of the performance diﬀerence as representation capacity changes. Unfortunately, it can be expensive to exhaustively test an algorithm for multiple settings of a confounding variable in fact, this is eﬀectively a form of sensitivity analysis akin to Section 3. When designing fair comparisons, we often come back to focus on understanding our algorithms, rather than ranking them. We discuss this further in Section 5, where we discuss selecting and designing environments for experiments. If this cost is prohibitive, an alternative option would be to explicitly run an unfair experiment, providing an explicit disadvantage to our own proposal algorithm. Often, our proposed algorithm already has multiple sources of implicit unfairness: we spend more time on the implementation, we better understand the algorithm and when it may fail or succeed, we spend more time tuning the algorithm to our problem setting. When we provide an explicit source of unfairness such as a reduced representation capacity and our algorithm still outperforms competitors, we show a lower bound on the potential improvement provided by our algorithm, implying that an even greater degree of improvement would be observed under even more fair conditions. Naturally, we should be careful not to overclaim here; we cannot know to what extent our algorithm improves performance under more fair conditions.

4.4 Statistically signiﬁcant comparisons for two fully-speciﬁed algorithms

Imagine that we want to compare Algorithm A and Algorithm B, in Mountain Car, in terms of the online episodic return, in expectation across many runs. We may want to say that A is better than B, with high conﬁdence, on Mountain Car.11 One of the simplest strategies for comparing the two is to compute conﬁdence intervals, which we discussed for a fully-speciﬁed algorithm in Section 2.5. If the two intervals do not overlap, and the mean value for A is above B, then we can conclude that A is statistically signiﬁcantly better than B. However, this is a low powered test. In other words, depending on the number of runs, the conﬁdence intervals may overlap but a more powerful test, like the paired t-test, might have allowed us to conclude that A is statistically signiﬁcantly better than B. The primary reason is that pairing allows us to account for sources of variation due to the environment or initialization. For example, for one random seed, the agent may start in a diﬃcult start

11. Presumably we are verifying that A resolves an issue with B, that manifests on Mountain Car, rather than attempting to outperform B on Mountain Car.

Patterson, Neumann, White, White

(a) Individual performance of DQN and EQRC. (b) Performance diﬀerence D = GDQN GEQRC.

Figure 12: Diﬀerences can make a diﬀerence: in (a), we see the performance of two algorithms, DQN and EQRC (Ghiassian et al., 2020), on the Mountain Car environment. Shaded regions show individual conﬁdence intervals for each algorithm. In (b), we show the performance diﬀerence D = GDQN GEQRC. When D > 0, then DQN has better performance. When D < 0, then EQRC has better performance. And when the shaded region does not include the horizontal line at D = 0, then the reported diﬀerence is statistically signiﬁcant. Looking at the learning curves on the left (a), it looks like the performance of the two algorithms is nearly indistinguishable in early learning. Looking at diﬀerence curves on the right (b), we get a clearer picture.

state, impacting all future learning. If we compare the two agents for that seed, then we may ﬁnd that both performed poorly for that seed, but that the relative ranking remained the same. The small modiﬁcation to get the paired t-test is simply to look at diﬀerences in performance, rather than the performance itself.

We can leverage the same idea to visualize conﬁdence intervals for learning curves. Overloading terminology, let the performance of one algorithm be random variable A and the performance of the baseline be random variable B. Then we would report an interval around D = A B as opposed to reporting two intervals (l A, u A) and (l B, u B), as shown in Figure 12. Whenever the lower bound of the interval is greater than zero, our proposed algorithm outperforms the baseline with our stated level of conﬁdence. Notably, this has the additional advantage of removing one line and shaded region in learning curve plots at the cost of a slightly obfuscated visualization of performance. In the end, both plots may be desirable.

It may actually be preferable to use this conﬁdence interval on the diﬀerence rather than a hypothesis test because it provides more information. In particular, it highlights the magnitude of the diﬀerence between algorithms, called the eﬀect size, not just that they are diﬀerent. Most hypothesis tests are designed to answer the question: is the average performance of algorithm A better than the average performance of algorithm B? If there truly is a diﬀerence in performance between these algorithms, then with enough samples we will eventually be able to reject the null hypothesis and provide evidence that A is better than B at the p = 0.05 level. However, this ignores a very important nuance: the diﬀerence in performance may be negligibly small. A common solution to this problem is to include an appropriate measure of the eﬀect size for the chosen hypothesis test, for instance for a t-test

Empirical Design in Reinforcement Learning

one might include Cohen s d to measure the eﬀect size. However, such eﬀect size tests may not be available in all situations and can often be diﬃcult to interpret. Another common criticism of hypothesis tests are that their results are easily misinterpreted; in fact, this has led to some scientiﬁc publications banning the use of p-values (Wasserstein and Lazar, 2016). A common misinterpretation of the p-value is that p represents the error rate of the test. Rather, the p value says: if the null hypothesis were true, then the probability of seeing results at least this extreme due simply to randomness is p. However, naturally, we do not know whether the null hypothesis is true this is why we are performing the test and so we cannot know the true error rate from this alone. Recent statistical simulations suggest that, for sensible assumptions, the true error rate for a p-value of p = 0.05 may be as high as 26% (Sellke et al., 2001; Colquhoun, 2017). As the p-value decreases and the eﬀect size increases, this error rate quickly drops. The principle behind pairwise comparisons is also about controlling randomness whenever we can. The ability to replicate and reproduce results is core to scientiﬁc investigation. Computer simulations have a particular advantage of being nearly perfectly replicable. To do so, however, we must carefully control sources of variation: this means controlling the pseudorandom number generators responsible for simulating randomness in our environments and agents. A common approach is to set a global random seed for the entire simulation, impacting the random state of all randomized components simultaneously. However, it can often be beneﬁcial to retain independent random states for each randomized component of the experiment. We give an example below about how this could be useful when evaluating the addition of auxiliary losses, measuring a property called the stable rank.12

Example: Separating the random seed for the agent and environment.

Imagine you want to understand the impact of adding an auxiliary loss to your agent. Speciﬁcally, your hypothesis is that the stable rank of the agent s representation will decay less quickly when you use an additional neural network head to predict the next state. To test this hypothesis you start with a single environment say Puddle World select a learning algorithm with all hyperparameters speciﬁed except for the neural network architecture, then create two fully-speciﬁed algorithms by deﬁning a neural network architecture with the auxiliary head, and another without. Finally, you run each algorithm 30 times, measuring the stable rank of the neural network layers periodically through the run for each individual agent. Consider the code-design where you set a single global random seed for the entire simulation and you run the ﬁrst seed = 0 for both architectures. Because both neural networks have diﬀerent sizes and both are initialized randomly, each agent will call the random number generator a diﬀerent number of times at the end of agent initialization, both random number generators are in diﬀerent states. Then your code initializes the Puddle World state by randomly selecting a starting coordinate. However, because both RNGs are in diﬀerent states, both investigated agents likewise start in diﬀerent states. Because the starting state in Puddle World considerably changes the sequence of observed rewards, the agents for seed = 0 receive wildly diﬀerent data.

12. The stable rank is the ratio between the Frobenius norm and the spectral norm of the weights. It was introduced as a measurement of capacity, for instance Arora et al. (2018).

Patterson, Neumann, White, White

On average over several diﬀerent agents for each experimental condition, the eﬀects of diﬀering observation sequences will wash out and we will be able to detect if our auxiliary loss plays a role on stable rank. However, if we had used individual random states for both the agent initialization and the environment initialization, the two seed = 0 agents would have observed the exact same sequence of observations and would share some degree of variation due to these observations. Taking advantage of this pairing structure, we can detect diﬀerences in stable rank with far fewer agents by cancelling out nuisance sources of variation. This experimental design is called repeated measures.

4.5 Statistically signiﬁcant comparisons for multiple fully-speciﬁed algorithms

There are times we want to compare multiple agents. For example, there may be a baseline, an existing algorithm and your modiﬁcation of that algorithm. A sensible choice is still to choose an algorithm as a comparator, and plot diﬀerences to that algorithm for the rest. This approach mitigates some of the variance we might see due to the particular conﬁgurations in a run. We can report conﬁdence intervals for these diﬀerence curves, as above. However, we need to consider issues with multiple comparisons. Imagine that we compare ﬁve diﬀerent algorithms on a plot, all with performance diﬀerences to some baseline. We may mentally compare the ﬁve diﬀerent conﬁdence intervals, or run pairwise comparisons using hypothesis tests. If we treat each comparison independently, then the failure probability δ accumulates. To account for these multiple comparisons, the simplest solution is the Bonferroni correction (Dunn, 1961), that uses δ/5 to compute these accumulating probabilities to allow for a conﬁdence of 1 δ. Naturally, this requires even more runs (and more compute) to ensure statistical signiﬁcance. One of the simplest solutions to this issue is to narrow the scope of our experiment, and run fewer algorithms. We can limit the scope of baselines to those that have similar design philosophies or characteristics to our own proposed algorithm. For example, perhaps we propose a model-free policy-gradient algorithm and so choose to only compare to other model-free policy-gradient baselines. This restriction narrows the question to algorithmic improvements within the same class of algorithms, rather than also trying to show improvements for policy gradient algorithms over an entirely alternative class like model-based Q-learning algorithms. It is actually reasonable to separate this for clarity reasons, not just computational reasons. If, on the other hand, your empirical goal is to compare a large set of algorithms, then the answer is simply that you need to do more runs. It can be useful to investigate if policy gradient algorithms have advantages over an alternative class like model-based Q-learning algorithms. When answering such a broad question, naturally it requires more compute. This further highlights why it is useful to separate these concerns. It is worthwhile to have empirical studies and benchmark papers that attempt to run these more comprehensive experiments. This is a mountain of work! It is perhaps infeasible and not entirely helpful to also include such work within a paper focused on a new algorithm. Simply put, ever paper need not include a set of leaderboard-style results. Another common approach is to isolate the precise modiﬁcation made in the proposed algorithm. Perhaps the proposed algorithm introduces a new sampling strategy from replay buﬀers. Instead of comparing against a suite of diﬀerent learning rules, we can instead make

Empirical Design in Reinforcement Learning

a small number of pairwise comparison between similar algorithms. For instance, we can endow DQN and A3C with our proposed replay buﬀer and perform two pairwise comparisons against the original DQN and A3C respectively. We do not care if DQN with the novel replay buﬀer outperforms A3C with the old replay buﬀer, as many design variables change between these populations. By eliminating several such pairwise comparisons, we can signiﬁcantly decrease the probability of false negatives and number of samples to detect diﬀerences.

Key insights: comparing the performance of multiple algorithms

1. Remember, the seed is NOT a tuneable hyperparameter.

2. Be aware of your designer bias. Ultimately, you want to avoid misleading yourself, as well as the rest of the community.

3. Consider the use of baselines (e.g., random, oracle) to contextualize performance.

4. The purpose of comparing to an algorithm is primarily to show that your new approach resolves the issues that you claim it does. It is diﬃcult to claim that you have a generally better algorithm only using performance across several environments. Design experiments to provide insights rather than state-of-the-art claims.

5. We can leverage paired comparisons between algorithms, to reduce variability and make statistically sound conclusions with fewer runs.

6. Statistical intervals conﬁdence intervals and tolerance intervals provide a more interpretable alternative to hypothesis tests, when comparing two algorithms. By reporting intervals, we provide both a sense of eﬀect size and an estimated error rate.

7. Consider setting the random seed for the environment and agent separately. This repeated measures empirical design allows us to control for stochasticity in the agent (e.g., neural network initialization) separately from environment stochasticity (e.g., start state), ensuring pairwise comparisons share more similarities.

8. Comparing more than two algorithms requires more care, and we can run into multiple comparisons issues. Err on the side of designing experiments where a more focused questions is asked (i.e., scientiﬁc testing) comparing a smaller number of algorithms.

5. Selecting environments

The selection of environments is a critical part of the setup of your experiment. In other sciences, experiments are about the natural world and are naturally constrained. The natural world provides a rich suite of problems for us as well, such as robotics, chat bots, controlling power plants, trading, and more. These real-world problems are often where we want to eventually deploy our algorithms; simulation is where we can prototype. In simulation, it is diﬃcult to maintain the richness of the real-world. And worse, we may inadvertently design environments that are actually impossible or too easy.

Patterson, Neumann, White, White

When developing an algorithm, the ﬁrst step is usually to hypothesize a very simple environment and experiment where you are highly conﬁdent in the outcome. For example, to test an agent s ability to remember, we might design a small gridworld where the agent must remember that pressing a button unlocks a reward. This ﬁrst step provides a foundation for more complex experiments. If there are surprising results in this ﬁrst step as there very often are then it can be much more carefully understood before moving to a setting where it can be harder to isolate the issue. It is also a step where a surprising result can make you rethink the algorithm itself, thus providing conceptual clarity. We discuss such diagnostic environments in Section 5.1. After that, the next step is to evaluate the algorithm(s) on existing benchmark environments. We discuss the role of using benchmark environments further in Section 5.2. We conclude the section with a brief discussion about diﬃculties with creating new environments and about the potential utility of aggregating environments.

5.1 Diagnostic environments

The ﬁrst step in designing a diagnostic environment is to isolate the key issue your algorithm is designed to selve. If you can identify that issue, then you can design an environment that is deﬁned by that issue or where the issue is exaggerated enough to ﬁnd an eﬀect that is exactly what a diagnostic environment is. A classical diagnostic environment is Baird s counterexample (Baird, 1995). This 7-state environment was designed to show that temporal diﬀerence (TD) learning divergences under oﬀ-policy sampling. This diagnostic environment may seem like a counterexample, rather than an environment, however it is an ideal example of a diagnostic environment because it is precisely one where the experimenter should be quite sure of the outcome, divergence. Surprising outcomes have since come out of this environment, for example, we have found that incorporating certain adaptive stepsize algorithms or periodically ﬁxed targets (e.g. target networks) into TD seems to resolve this counterexample, though there is no theory that suggests these strategies should help. These diagnostic environments can have lasting impact on algorithm development. In some cases, Baird s little MDP has highlighted issues with algorithms that claim convergence guarantees; for example Emphatic TD should converge on Baird s counterexample in theory, but in practice the variance is so high that convergence is very poor and requires careful tuning (Sutton et al., 2016; Mahmood, 2017). Diagnostic environments are chosen based on the hypothesis you wish to test, which means that an existing diagnostic environment may already have been designed to isolate that property. For example, you might hypothesize that an algorithm is prone to settling on a suboptimal policy because it does not explore enough, so you might use Riverswim (Szita and Lorincz, 2008), which was designed speciﬁcally to test this.

5.2 Benchmark environments and challenge problems

Benchmark environments provide a useful tool to assess if there are issues with a new algorithm a sanity check. The word benchmark means a point of reference. The possible performance on these environments is well understood, due to previous results, making it easier to understand if you are doing better or worse than this reference. If you are doing

Empirical Design in Reinforcement Learning

notably worse than a known reasonable solution, then this could indicate an issue with your algorithm that should be addressed. For this reason, benchmark environments can remain useful for many years. Classic environments that are still commonly used include Mountain Car (Moore, 1990), Cartpole (Sutton and Barto, 2018), Puddle World (Sutton and Barto, 2018) and Acrobot (Sutton, 1996). These simple environments play a useful role because we understand very well how to get good performance in these environments, and so we can easily see when there are issues in new algorithms. For example, a SARSA(λ) agent with tile-coding can ﬁnd a good policy in Mountain Car in a dozen or so episodes. DQN using replay and target networks learns much more slowly, settles on a worse policy, exhibits more instability, and is sensitive to the target network refresh rate (Hernandez-Garcia and Sutton, 2019; Kim et al., 2019; Patterson et al., 2022a). Running DQN on this environment helps identify that there may be room for algorithmic improvement. Certain environments may start as challenge problems ones that we do not know how to solve well and eventually become benchmark environments. Two examples of such benchmarks are the Atari suite (Bellemare et al., 2013; Machado et al., 2018) and Mujoco environments (Todorov et al., 2012). A challenge problem plays a diﬀerent role than a benchmark environment. It highlights gaps in all of our algorithms; trying to ﬁll these gaps can drive algorithm development. If an environment is truly a challenge problem, then experiments are more exploratory or demonstrative. It can be suﬃcient to show that you can obtain good performance demonstrate something is possible without even comparing to any other algorithms or only including basic baselines. The results suggest that you can do something that was not possible before.13

As our algorithms improve on these challenge problems, they become benchmark environments. At some point, however, they can be in a confusing interim stage. We as yet do not have an understanding of how to obtain good performance, but algorithms have gone from being abysmal to merely mediocre. It is also at this stage that we start to overﬁt to these interim benchmark-challenge problem environments. The community tends uses these benchmark-challenge problems to gatekeep new algorithms: requiring new work to show clear improvements in these environments, with comparisons to the now growing list of algorithms that perform okay in these environments. There is a tendency to dismiss any experiments in simpler benchmark environments, since those environments are too easy . Eventually, this causes overﬁtting to the benchmarks to eke out ever smaller wins. It is an issue that happens in reinforcement learning, as well as on benchmark datasets in machine learning.14

13. This is not how Atari is now used in the reinforcement learning literature, but the ﬁrst experiments including the original DQN paper (Mnih et al., 2013) were very much of this form. 14. And machine learning is, of course, not the ﬁrst place to experience these pains. Benchmarking also became a serious issue in planning and search (Hooker, 1995). A particularly evocative quote is as follows. It would be absurd to ground structural engineering, for instance, solely on a series of competitions in which, say, entire bridges are built, each incorporating everything the designer knows about how to obtain the strongest bridge for the least cost. This would allow for only a few experiments a year, and it would be hard to extract useful knowledge from the experiments. But this is not unlike the current situation in algorithmic experimentation. Structural engineers must rely at least partly on knowledge that is obtained in controlled laboratory experiments (regarding properties of materials and so on), and it is no diﬀerent with software engineers. (Hooker, 1995)

Patterson, Neumann, White, White

This direction is typically not beneﬁcial for general algorithm development. Instead, it may be better to acknowledge this transition and begin using these environments as benchmarks sooner. This means it is not key to outperform whatever (overﬁt) solution is considered the current state-of-the-art, but rather to take a useful baseline with well-known performance as a sanity check on your algorithm. For all the above reasons, it is important to remember that most of our experiments in benchmark environments are to identify issues with our algorithms, rather than to make bold claims about state-of-the-art performance. It is clearly useful to identify and ﬁx issues in our algorithms using experiments. It is not as clearly useful to rank algorithms based on performance in small simulation environments. If there are stark and meaningful diﬀerences in well-known environments, then that success can and should be reported, with hopefully accompanying experiments to understand why. Inability to get stark improvements on benchmark problems, however, should not prevent pursuing an idea, nor gatekeeping others in pursuing ideas. Your experiments should highlight one setting where your new algorithm is demonstrably useful with a clear explanation of why. You can also demonstrate acceptable performance on benchmark environments, to show nothing is obviously broken.

5.3 New (or modiﬁed) environments are not always better

Designing environments is hard. It is also arguably something many of us have little expertise in. We learn a lot about developing algorithms, much less about developing environments. It is important to realize there are relatively few well-known environments that have stood the test of time, and the ones that do have been reﬁned and ﬁned tuned over years, if not decades. For example, Mountain Car was ﬁrst proposed as the Puck on a Hill environment by Andrew Moore, in his Ph D work (Moore, 1990). The environment featured continuous actions and a non-zero reward for reaching the goal. Years later, Sutton and Barto changed the dynamics representing the hill as a cosine wave, made the actions discrete, and the reward -1 per step. This later version became the standard for over 20 years, before AIGym (Brockman et al., 2016) introduced an aggressive episode cutoﬀof 200 steps to improve the performance of neural network learners. Take pause when deciding to invent a new environment and consider the costs. First you need to justify clearly why we need yet another environment. The biggest concern is that you may very well invent an environment that is invalid or needs further improvement. But practically speaking, you are making more work for yourself. If you use your new environment to highlight the merits of your new algorithm, then you must retune baseline algorithms to avoid reporting results with untuned baselines, as we have already discussed in Section 4.1. This can be a lot of work and is error prone. Exercise similar caution when modifying existing environments. These changes need to be justiﬁed, particularly as it means that older benchmark performance no longer applies, losing one of the key reasons to use a benchmark environment in the ﬁrst place. Sometimes the easy road and the more scholarly choice is to use an existing environment. These changes, or new environments, also might cause us to accidentally design environments in support of our algorithms. For example, we might run DQN on discrete-action Pendulum swing up, and ﬁnd under standard conﬁgurations it performs surprisingly poorly. We then might try a few changes, to maintain most of the essence of the environment but

Empirical Design in Reinforcement Learning

make learning more feasible. One such change could be introducing episode cutoﬀs and random start states to facilitate exploration. Now we have a new environment that has potentially been tuned to be easier for algorithms like DQN. If we test a policy gradient algorithm, or a completely new approach, then we may have inadvertently favored DQN. Co-evolving our environments and algorithms in this way is always dangerous. Note, this is not the same as designing new diagnostic MDPs. A diagnostic environment allows us to highlight the key issue being tackled. It is common for papers to introduce new diagnostic environments that may never be used again to make a clear conceptual point. One route for environment creation is to consider environment collections. For example, we may collect several classic environments like Mountain Car, Cartpole and Acrobot into a Classic Control set composed of environments with non-imaged based, low-dimensional observations. Or we may collect several Atari games that seem to be more diﬃcult in terms of exploration into an Atari Exploration set.15 We can then report performance in aggregate, across the set, as one macro-environment, without explicitly considering performance on each environment in the set. There are several beneﬁts to this approach. This design tests the performance of the algorithm on a set of related environments with a particular property, making conclusions less specialized to one speciﬁc environment. Of course, we still need to understand why our algorithm behaves the way it does. Using more environments does not allow us to conclude our algorithm will perform better, but it does at least show that the emergent phenomena arises in more than one environment. Additionally, we obtain this ability to test across environments with minimal increases in compute over testing in one environment. That is because we can do fewer runs within each environment, to still get a large number of runs in the macro-environment (collection). A related strategy, to test generalization, was proposed earlier by having systematic parameters that could be varied for an environment (Whiteson et al., 2009), such as gravity in Mountain Car.

Considering macro-environments has some beneﬁts, but there are also challenges in aggregating performance across environments (e.g., dealing with diﬀerent reward scales). We discuss this topic in more depth in Appendix E.

15. Even when just aggregating environments, we need to be careful that we have designed the collection to do what we think it does. For example, Taïga et al. (2020) showed how prior work focusing on so called hard exploration games in Atari caused researchers to miss that simple ϵ-greedy based algorithms were actually better than count-based algorithms across the whole Atari suite. Environment design is hard.

Patterson, Neumann, White, White

Key insights: environment selection

1. Small diagnostic environments can be used to highlight the properties of a algorithm called issue-oriented research. These diagnostic environments can provide conceptual clarity as well as a critical sanity check.

2. Designing environments is hard; do not assume a published environment is correct or sensible.

3. Proceed with caution when changing an existing environment and make sure you have a good justiﬁcation for not using the original standard.

4. It is not better to run on more environments than less environments. It depends on the empirical question being asked. If you ﬁnd yourself thinking you should run on one more environment, then make sure you ask yourself why and select the next environment deliberately rather than to simply increase the number of environments.

5. It is unlikely your algorithm will perform best on all environments. But, it should have the advantages you claim it has. Verify if your hypothesis is correct, and report both this good and bad performance.

6. It is not acceptable to run an incomplete experiment due to insuﬃcient resources or compute. The choice of environment non-trivially impacts the cost of an experiment. Design and run a meaningful experiment in an environment that is feasible for the resources you have available.

6. Case study: re-evaluating previous work

In this section, we attempt to reproduce a well-known result from the literature, along the way evaluating the original design choices and suggesting alternatives. In the end, we draw very diﬀerent conclusions! Speciﬁcally, we attempt to recreate the experiments of the Soft Actor-Critic (SAC) paper (Haarnoja et al., 2018). SAC is an actor-critic algorithm derived from the maximum-entropy reinforcement learning framework and has been shown to perform well on continuous control and robotic control tasks (Haarnoja et al., 2018, 2019).

We attempted to reproduce the comparison of SAC and Deep Deterministic Policy Gradient (DDPG) on the Half-Cheetah environment from AIGym. For the SAC implementation, we used the original SAC codebase16 (SAC-CB) with the tuned hyperparameters outlined by Haarnoja et al. (2018). For the DDPG baseline, we used the RLLab codebase17 (RLLab-CB) with the hyperparameters set to the defaults in the RLLab codebase, except that the batch size, replay buﬀer capacity, and network architectures were adjusted to match those used by SAC (Duan et al., 2016). The SAC paper did not outline the exact hyperparameters used for DDPG, but their github repository left some hints on the conﬁguration of the DDPG baseline

16. The original SAC codebase can be found at https://github.com/haarnoja/sac 17. The RLLab implementation can be found at https://github.com/rll/RLLab

Empirical Design in Reinforcement Learning

0 3 Timesteps (Millions)

Average Return

DDPG (SAC Env)

DDPG (RLLab Env)

Figure 13: Our ﬁrst attempt to recreate the experiments in the original SAC paper. In contrast to the original work which used only 5 runs, we use 30 runs. The variance in performance is higher and the average performance lower than reported in the original work. Solid lines denote average performance, and shaded regions denote minimum and maximum performance.

as well as the implementation used18. Two diﬀerent environment wrappers for Open AI Gym exist in these codebases, one in RLLab-CB and one in SAC-CB. Which wrapper Haarnoja et al. (2018) used for their experiments with DDPG is unclear, so we use both for DDPG. For SAC, we assume the wrapper in SAC-CB was used. We identiﬁed a possible bug in the environment wrapper for RLLab-CB, causing this wrapper to handle episode cutoﬀs improperly. Some algorithms in RLLab-CB can bootstrap on episode cutoﬀs. Such a practice considers the ﬁnal state due to the episode cutoﬀas a terminal state due to reaching a goal considering the cutoﬀstate to have a value of 0 in the TD-error. Even if this functionality is explicitly turned oﬀ, this incorrect bootstrapping can still occur19. We cannot know for sure, but DDPG may have had a signiﬁcant disadvantage in the original experiments. In Figure 13, the line labelled DDPG (RLLab Env) does suﬀer from this innocuous bug while the line labelled DDPG (SAC Env) does not. Figure 13 shows the mean learning curves with shaded regions as minimum and maximum performance as done in the original work. The mean performance of SAC over 30 runs is lower than that reported in the original paper, and the error bars here are larger than those reported in the original paper. Also, there is a performance diﬀerence between the two DDPG results, possibly due to the bug. Next, we examined the code-base more carefully, to understand why SAC is underperforming. In our previously described experiment, we attempted to reproduce the results of Haarnoja et al. (2018) using the experimental procedures described in the paper alone. Yet, several implementation choices in the code-base were not reported in the paper. First, the default implementation of some policies in SAC-CB use regularization (e.g., Gaussian policies). Second, many code examples in SAC-CB normalize actions to stay within the environmental action bounds. Finally, several code examples in SAC-CB use an initial random exploration

18. See https://github.com/rail-berkeley/softlearning/issues/27. It is mentioned that the original paper used the RLLab implementation of DDPG with similar hyperparameter settings as SAC where applicable. The RLLab implementation uses OU noise for DDPG. 19. This happens if the wrapped AIGym environment uses a number of steps per episode less than or equal to the number of steps per episode in the RLLab-CB environment wrapper.

Patterson, Neumann, White, White

phase actions are sampled from a uniform distribution over actions for the ﬁrst 10,000 steps. Policy regularization, action normalization, and random initial exploration are not reported in the paper. We expected initial random exploration phase was the most likely culprit for the disparity in performance reported by Haarnoja et al. (2018), simply because regularization and action normalization do not aﬀect the squashed Gaussian policy implementation in SAC-CB,

0 3 Timesteps (Millions)

Average Return

SAC (No EP)

DDPG (SAC Env)

Figure 14: Our second attempt to recreate the experiments in the original SAC paper. We again use 30 runs. The mean performance attained by SAC (EP) closely matches that reported in the original work, but the shaded regions are not as tight. This could be due to the fact that we used more seeds than the original work to evaluate the performance of SAC.

We re-ran the previous experiment using an initial exploration phase of 10,000 steps. From here on, we refer to an agent A with this initial exploration phase as A (EP) and without this initial exploration phase as A (No EP). Figure 14 shows the learning curves over 30 runs for this additional variant of the SAC agent, SAC (EP). The mean performance of this variant closely matches that reported by Haarnoja et al. (2018), although the variability in performance is still noticeably higher. This could be due to the fact that we used 30 seeds while the original work used only 5. Finally, previous work highlighted seed optimization in the RLLab code-base, meaning that results are reported by sweeping over seeds and reporting performance only for the best seeds (Islam et al., 2017). This seed optimization code is compatible with SAC-CB as well. As a ﬁnal attempt to reproduce the results of Haarnoja et al. (2018), we used seed optimization in the hyperparameter tuning process. As a note, this is bad practice; we only conduct seed search in the name of reproduction. In Figure 15, where we chose the best 5 seeds of 30 for each agent, the results more closely match those from the original paper. In particular, SAC (EP) with seed optimization most closely matches the results reported by Haarnoja et al. (2018). We now turn to running an experiment that more closely resembles the principles laid out in this document, especially with respect to reporting performance of tuned baselines. Although we do not know how Haarnoja et al. (2018) tuned the baseline algorithms in their experiments, we found that the performance of DDPG on Half Cheetah was under-reported in this work. In the experiments here, we use the tuned hyperparameters for DDPG as reported by Spinning Up baselines20. Since Gaussian noise is known to outperform Ornstein-Uhlenbeck

20. See https://spinningup.openai.com/en/latest/spinningup/bench.html

Empirical Design in Reinforcement Learning

0 3 Timesteps (Millions)

Average Return

SAC (EP, Seed Search)

SAC (No EP)

SAC (No EP, Seed Search)

Figure 15: When searching seeds, we were able to achieve average performance and shaded regions much closer to that reported in the original SAC work. Solid lines denote average performance, and shaded regions denote minimum and maximum performance.

(OU) noise in some cases (Fujimoto et al., 2018)21, we use uncorrelated, unbounded Gaussian noise for action exploration in DDPG instead of OU noise. Furthermore, we try both SAC and DDPG with an exploration phase at the beginning of the experiment, where an action is drawn uniformly randomly for the ﬁrst 10,000 steps. Similarly to previous experiments, we use the tuned hyperparameters reported by Haarnoja et al. (2018) for SAC. Figure 16 shows the mean learning curves with 95% bootstrap conﬁdence intervals for this tuned version of DDPG and SAC over 30 runs. By tuning the DDPG baseline, we achieved signiﬁcant improvements in performance over what was reported in the original SAC work. It seems No EP DDPG is competitive with and likely better than No EP SAC on Half Cheetah. Furthermore, we see that simply using an initial exploration phase can signiﬁcantly improve the performance of both agents on Half Cheetah.

0 3 Timesteps (Millions)

Average Return

SAC (No EP)

DDPG (No EP)

Figure 16: Our attempt to run an experiment that closely resembles the principles laid out in this document, particularly with respect to tuning baselines. For reference, the original experiments Haarnoja et al. (2018) conducted with SAC on Half Cheetah have been inset. DDPG has been tuned here for Half Cheetah. See text for details.

21. The beneﬁts of Gaussian noise were published in parallel with the SAC paper in 2018. It is therefore reasonable to assume that the authors of SAC did not know the advantages of Gaussian noise.

Patterson, Neumann, White, White

7. Common errors in reinforcement learning experiments

We conclude this paper by outlining a list of common errors in reinforcement learning experiments. Up to now, this document has focused largely on best practices. Naturally, it is jarring to focus too much on all the wrong choices that could be made. But both positive and negative examples of empirical design are needed to become bester empiricists. Nothing in life is black and white; what we identify below are things that generally should be avoided, but sometimes they might be appropriate. Finally, this is meant to be an ever evolving list that we will update continuously in an accompanying online blog post.

Averaging over 3 or 5 runs: sometimes it is OK to use as few as three runs. For example, if the agent and environment are both deterministic. In the vast majority of cases this is a risky practice. Even if we compute the variance across runs we cannot be sure we are not simply under-estimating the variance due to (good) luck.

Reusing code from another source (including hyperparameters) as is: qe are often tempted to simply download some code and run it to get a baseline for our plots or, even worse, simply reuse the data generated by the previous authors. People often defend this practice claiming it is fair and cheap, however, this only makes sense if (1) the implementation is trusted, (2) you know the performance is representative of the baseline algorithm, and (3) the comparisons that you are making are either a simple alteration of the prior code or you are comparing complete systems. In the ﬁrst two scenarios, if the code was written by the authors of the baseline algorithm, then you can typically trust that system-level design decisions will match their paper though not always, see Engstrom et al. (2019). The third condition is about isolating your changes to ensure diﬀerences in performance are due only to your changes (or noise), and not due to unrelated system-level changes. Naturally, if your scientiﬁc question is about system-level changes, then that third condition does not apply!

Untuned agents in ablations: we may take an existing agent and ablate components. The agent was potentially tuned with those components, and there is no reason to believe the same hyperparameters will be reasonable for the agent with components removed or added.

Not controlling seeds: in simulation, we get to control all sources of randomness we should take advantage of that! When we compare two experimental conditions (for instance comparing two algorithms), we should try to minimize the number of diﬀerences between the conditions: make sure random weight initializations are the same, make sure initial state in the environment is the same, make sure the same indices are sampled for the replay buﬀer, etc. These sources of randomness can amount to a huge amount of variance, but if both experimental conditions share some of the same variations then fancy analysis techniques can leverage this joint variance allowing for statistically signiﬁcant claims with fewer runs.

Discarding or replacing runs: sometimes the agent s parameters will diverge in one or more runs. Do not just remove this run or run it again with a diﬀerent random seed! Why did it diverge? Were the hyperparameters too aggressive and this is bad performance? Maybe the agent learned so well it broke the environment. In the case you known failure is due to bad performance, then you could report max performance. This should be done with care, especially if you are reporting normalized performance.

Empirical Design in Reinforcement Learning

Cutting oﬀepisodes early: terminating episodes early can have a large impact on exploration (making the problem easier) and likely the distribution of performance. If cutoﬀs are used, then they should be set to be relatively large. For example, in Mountain Car, an aggressive cutoﬀis 200 steps; a much more reasonable option is at least 10,000 steps.

Treating episode cutoﬀs as termination: setting γ = 0 and making an update when the episode is cutoﬀis incorrect. Consider a cost to goal problem with reward of -1 per step. Termination is good and states close to the terminal state have higher (less negative values). Updating states where episodes are cutoﬀ which may be nowhere near the goal state incorrectly increases the value of those states. Instead, either this last transition should be discarded or the algorithm should bootstrap oﬀthe ﬁnal state before the cutoﬀ. See Section 2.7.

Randomizing start states: the start states in an episodic MDP are part of the problem deﬁnition. In many cases they are chosen for very particular regions (e.g., the bottom of the hill in Mountain Car). Exercise caution when changing a problem speciﬁcation especially if your motivation is to make it easier for your agent.

HARK ing (hypothesis after results known): it is best to specify the setup of your experiment ahead of time and make a hypothesis. It is important to specify the expected outcome (e.g., my new algorithm will decrease the number of steps until the ﬁrst positive reward is observed). This is the scientiﬁc method. Do not do the following. Set up an experiment. Make no hypothesis. Run the experiment and look for a performance measure where your algorithm looks best compared to the others, continually re-running the experiment (possibly on diﬀerent environments each time) until your algorithm wins.

Environment overﬁtting: another pitfall is to produce algorithms overly specialized to a small number of environments. For example, you might take DQN and Pong and continually add new components and hyperparameters until you improve DQN on just Pong. This is unlikely to yield generally useful algorithmic innovations and little to no insight is generated. Another example is the focus on hard exploration games in Atari as extensively investigated by Taïga et al. (2020).

Overly complex algorithms: generally, we are interested in the simplest algorithms that work well, with the fewest components. The alternative a more complex algorithm with similar performance by construction must have components that contribute nothing. The contribution is ﬁguring out which components matter and why. The instinct when a algorithm is not working is to add rather than subtract. By focusing on why it is not working, rather than how to ﬁx it, you are less likely to fall into this pitfall.

Choosing γ incorrectly: one example is choosing γ = 1 for an episodic problem with only a non-zero reward at the goal. The algorithm has no incentive to terminate: an episode of length 10 or 10,000 have the same return. For cost-to-goal problems, with -1 reward per step, it is most sensible to set γ = 1, and then we can report number of steps to goal (not discounted steps to goal). The algorithm itself might still use a γ for learning purposes.

Patterson, Neumann, White, White

Reporting oﬄine performance while making conclusions about online performance: oﬄine performance typically means the agent is subjected to periodic test episodes where the agent is teleported to a new state, learning is paused and we measure the cumulative reward achieved. Online performance is simply cumulative reward achieved during regular operation during exploration and learning. Under oﬄine performance there is no penalty for exploration during learning, whereas in online learning there is and thus the agent must trade-oﬀexploration and exploitation. Machado et al. (2018) discusses this at length.

Invalid errorbars or shaded regions: ﬁrst, whatever method you used to construct intervals, you should understand what assumptions on the underlying data are being made. Second, do not just put error bars around anything. We have already shown how our error bars can be misleading. A ﬁrst warning sign to look out for is if your shaded regions are tighter than the variation in the mean across the learning curve.

Using random problems: they tend to not look much like the problems we care about (e.g., random MDPs). Use with caution.

Not reporting implementation details: one should be able to use your pseudo-code to implement your algorithm. Special tricks used to improve performance are part of the algorithm and should be reported clearly.

Not comparing against stupid baselines: can a much simpler agent or even a random policy solve your environment? Does a simpler agent outperform your agent? It does not matter if prior SOTA algorithms did not use these baselines, you should!

Running ineﬃcient code: this error might not seem pertinent to our experiments. However, if you run ineﬃcient code, it hinders your ability to run careful systematic experiments. It inadvertently causes you to change the question you want to answer. An important ﬁrst step when running experiments is to ensure you have optimized your code, so that you have more ﬂexibility to do more exploratory experiments and feasibly run the ﬁnal experiments for your paper. Always proﬁle your code!

Attempting to run an experiment beyond your computational budget: it is tempting to say it is not possible to do suﬃcient runs, or impossible to carefully test hyperparameters, due to lack of compute. So you conclude it is only feasible to do a smaller number of runs. However, this is not a valid claim. Let us return to our example of animal learning researchers and rats. If a lab can only aﬀord one rat, then it is not acceptable to run experiments using only this one rat. Instead, they simply have to do diﬀerent research. The same is true for us. If you do not have the resources to run a correct experiment (say in Atari), then the alternative is not to run an incorrect one. Instead, the alternative is to run a diﬀerent experiment entirely, one that is feasible and meaningful.

Gatekeeping with benchmark problems: this is more about reviewing than running good experiments, however, there is such a strong connection between the two we would be remiss not to discuss the perils of benchmarking. Experimental results should be evaluated in how they contribute understanding and insight. Algorithms tuned to benchmarks often have many small tricks. It is often unclear which tricks contribute to success. Demonstration results can be helpful especially to inspire progress but not all (or even most) experiments

Empirical Design in Reinforcement Learning

should be demonstration results. The goal of empirical reinforcement learning research is to create new knowledge, not pick winners!

8. Conclusion

The goal of this document was to provide a comprehensive overview of important empirical design decisions in reinforcement learning. The style is educational, with a focus on clear examples and conceptual reasoning for making sound decisions. In the ﬁrst few sections we focused on outlining best practices for evaluating reinforcement learning algorithms. We then revisited an existing result, as a case study, to highlight diﬀerent conclusions in light of these best practices. Much of the work focused on what to do, with some comments about what to avoid. For clarity, therefore, we concluded the work with a more explicit list of common errors to avoid. We hope this document will be useful to help newcomers to reinforcement learning, but also provide novel perspectives for any reinforcement learning empiricists.

Jacob Adkins, Michael Bowling, and Adam White. A method for evaluating hyperparameter sensitivity in reinforcement learning. In Finding the Frame: An RLC Workshop for Examining Conceptual Frameworks, 2024.

Rishabh Agarwal, Max Schwarzer, Pablo Samuel Castro, Aaron C. Courville, and Marc Bellemare. Deep reinforcement learning at the edge of the statistical precipice. Advances in Neural Information Processing Systems, 34, 2021.

Sanjeev Arora, Rong Ge, Behnam Neyshabur, and Yi Zhang. Stronger generalization bounds for deep nets via a compression approach. In International Conference on Machine Learning, pages 254 263. PMLR, 2018.

Leemon Baird. Residual Algorithms: Reinforcement Learning with Function Approximation. Machine Learning Proceedings, pages 30 37, 1995. doi: 10.1016/B978-1-55860-377-6. 50013-X.

M. G. Bellemare, Y. Naddaf, J. Veness, and M. Bowling. The Arcade Learning Environment: An Evaluation Platform for General Agents. Journal of Artiﬁcial Intelligence Research, 47:253 279, June 2013. ISSN 1076-9757. doi: 10.1613/jair.3912.

Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. Openai gym. ar Xiv preprint ar Xiv:1606.01540, 2016.

Maxime Chevalier-Boisvert, Lucas Willems, and Pal Suman. Minimalistic Gridworld Environment (Mini Grid). Farama Foundation, 2018.

Cédric Colas, Olivier Sigaud, and Pierre-Yves Oudeyer. How Many Random Seeds? Statistical Power Analysis in Deep Reinforcement Learning Experiments. ar Xiv:1806.08295, July 2018.

David Colquhoun. The reproducibility of research and the misinterpretation of p-values. Royal society open science, 4(12):171085, 2017.

Patterson, Neumann, White, White

William C Dabney. Adaptive Step-Sizes for Reinforcement Learning. Ph D thesis, 2014.

Yan Duan, Xi Chen, Rein Houthooft, John Schulman, and Pieter Abbeel. Benchmarking deep reinforcement learning for continuous control. In International Conference on Machine Learning, pages 1329 1338. PMLR, 2016.

Olive Jean Dunn. Multiple comparisons among means. Journal of the American statistical association, 56(293):52 64, 1961.

Theresa Eimer, Marius Lindauer, and Roberta Raileanu. Hyperparameters in reinforcement learning and how to tune them. In International Conference on Machine Learning, pages 9104 9149. PMLR, 2023.

Logan Engstrom, Andrew Ilyas, Shibani Santurkar, Dimitris Tsipras, Firdaus Janoos, Larry Rudolph, and Aleksander Ma. Implementation Matters In Deep Policy Gradients: A Case Study On PPO and TRPO. International Conference on Learning Representations, page 14, 2019.

Philip J. Fleming and John J. Wallace. How not to lie with statistics: The correct way to summarize benchmark results. Communications of the ACM, 29(3):218 221, 1986.

Scott Fujimoto, Herke Van Hoof, and David Meger. Addressing function approximation error in actor-critic methods. International Conference on Machine Learning, 2018. ISSN 2640-3498.

Sina Ghiassian, Andrew Patterson, Shivam Garg, Dhawal Gupta, Adam White, and Martha White. Gradient Temporal-Diﬀerence Learning with Regularized Corrections. International Conference on Machine Learning, 2020.

Dibya Ghosh and Marc G. Bellemare. Representations for Stable Oﬀ-Policy Reinforcement Learning. International Conference on Machine Learning, 2020.

Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft Actor-Critic: Oﬀ-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor. International conference on machine learning, pages 1861 1870, August 2018.

Tuomas Haarnoja, Aurick Zhou, Kristian Hartikainen, George Tucker, Sehoon Ha, Jie Tan, Vikash Kumar, Henry Zhu, Abhishek Gupta, Pieter Abbeel, and Sergey Levine. Soft Actor-Critic Algorithms and Applications, January 2019.

Xin He, Kaiyong Zhao, and Xiaowen Chu. Auto ML: A survey of the state-of-the-art. Knowledge-Based Systems, 2021.

Peter Henderson, Riashat Islam, Philip Bachman, Joelle Pineau, Doina Precup, and David Meger. Deep Reinforcement Learning that Matters. AAAI, page 8, 2018.

J. Fernando Hernandez-Garcia and Richard S. Sutton. Understanding multi-step deep reinforcement learning: A systematic study of the DQN target. ar Xiv preprint ar Xiv:1901.07510, 2019.

Empirical Design in Reinforcement Learning

John N. Hooker. Testing heuristics: We have it all wrong. Journal of heuristics, 1(1):33 42, 1995.

Riashat Islam, Peter Henderson, Maziar Gomrokchi, and Doina Precup. Reproducibility of Benchmarked Deep Reinforcement Learning Tasks for Continuous Control. ar Xiv:1708.04133 [cs], August 2017.

Nathalie Japkowicz and Mohak Shah. Evaluating Learning Algorithms: A Classiﬁcation Perspective. Cambridge University Press, 2011.

Scott M. Jordan, Yash Chandak, Daniel Cohen, Mengxue Zhang, and Philip S. Thomas. Evaluating the Performance of Reinforcement Learning Algorithms. International Conference on Machine Learning, June 2020.

Scott M Jordan, Adam White, Bruno Castro Da Silva, Martha White, and Philip S Thomas. Position: Benchmarking is limited in reinforcement learning research. 2024.

Seungchan Kim, Kavosh Asadi, Michael Littman, and George Konidaris. Deep Mellow: Removing the Need for a Target Network in Deep Q-Learning. International Joint Conference on Artiﬁcial Intelligence, pages 2733 2739, August 2019. doi: 10.24963/ijcai. 2019/379.

J Z Kolter. The Fixed Points of Oﬀ-Policy TD. Advances in Neural Information Processing Systems, page 9, 2011.

Mario Lucic, Karol Kurach, Marcin Michalski, Sylvain Gelly, and Olivier Bousquet. Are gans created equal? a large-scale study. Advances in neural information processing systems, 31, 2018.

Marlos C. Machado, Marc G. Bellemare, Erik Talvitie, Joel Veness, Matthew Hausknecht, and Michael Bowling. Revisiting the Arcade Learning Environment: Evaluation Protocols and Open Problems for General Agents. Journal of Artiﬁcial Intelligence Research, 61: 523 562, March 2018. ISSN 1076-9757. doi: 10.1613/jair.5699.

Ashique Mahmood. Incremental oﬀ-policy reinforcement learning algorithms. 2017.

Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. Playing Atari with Deep Reinforcement Learning. page 9, 2013.

Andrew William Moore. Eﬃcient Memory-Based Learning for Robot Control. Ph D thesis, University of Cambridge, 1990.

Johan Obando-Ceron, João GM Araújo, Aaron Courville, and Pablo Samuel Castro. On the consistency of hyper-parameter selection in value-based deep reinforcement learning. ar Xiv preprint ar Xiv:2406.17523, 2024.

Jack Parker-Holder, Raghu Rajan, Xingyou Song, André Biedenkapp, Yingjie Miao, Theresa Eimer, Baohe Zhang, Vu Nguyen, Roberto Calandra, Aleksandra Faust, et al. Automated reinforcement learning (autorl): A survey and open problems. Journal of Artiﬁcial Intelligence Research, 74:517 568, 2022.

Patterson, Neumann, White, White

Andrew Patterson, Victor Liao, and Martha White. Robust losses for learning value functions. IEEE Transactions on Pattern Analysis and Machine Intelligence, pages 1 12, 2022a. ISSN 1939-3539. doi: 10.1109/TPAMI.2022.3213503.

Andrew Patterson, Adam White, and Martha White. A Generalized Projected Bellman Error for Oﬀ-policy Value Estimation in Reinforcement Learning. Journal of Machine Learning Research, 2022b.

Thomas Sellke, M. J. Bayarri, and James O. Berger. Calibration of ρ values for testing precise null hypotheses. The American Statistician, 55(1):62 71, 2001.

Alexander L. Strehl and Michael L. Littman. An analysis of model-based interval estimation for Markov decision processes. Journal of Computer and System Sciences, 74(8):1309 1331, 2008.

Student. The probable error of a mean. Biometrika, pages 1 25, 1908.

Richard S. Sutton. Generalization in reinforcement learning: Successful examples using sparse coarse coding. Advances in Neural Information Processing Systems, pages 1038 1044, 1996.

Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An Introduction. MIT Press, November 2018. ISBN 978-0-262-35270-3.

Richard S. Sutton, Hamid Reza Maei, Doina Precup, Shalabh Bhatnagar, David Silver, Csaba Szepesvári, and Eric Wiewiora. Fast gradient-descent methods for temporal-diﬀerence learning with linear function approximation. International Conference on Machine Learning, pages 1 8, 2009. doi: 10.1145/1553374.1553501.

Richard S Sutton, A Rupam Mahmood, and Martha White. An Emphatic Approach to the Problem of Oﬀ-policy Temporal-Diﬀerence Learning. Journal of Machine Learning Research, page 29, 2016.

István Szita and András Lorincz. The many faces of optimism: A unifying approach. In Proceedings of the 25th International Conference on Machine Learning, pages 1048 1055, 2008.

Adrien Ali Taïga, William Fedus, Marlos C. Machado, Aaron Courville, and Marc G. Bellemare. On bonus-based exploration methods in the arcade learning environment. ICLR, October 2020.

Gerald Tesauro. Temporal diﬀerence learning and TD-Gammon. Communications of the ACM, 38(3):58 68, 1995.

Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for model-based control. In International Conference on Intelligent Robots and Systems, pages 5026 5033. IEEE, 2012.

J.N. Tsitsiklis and B. Van Roy. An analysis of temporal-diﬀerence learning with function approximation. IEEE Transactions on Automatic Control, 42(5):674 690, 1997. doi: 10.1109/9.580874.

Empirical Design in Reinforcement Learning

Charles V. Vorhees and Michael T. Williams. Morris water maze: Procedures for assessing spatial and related forms of learning and memory. Nature protocols, 1(2):848 858, 2006.

Han Wang, Erfan Miahi, Martha White, Marlos C. Machado, Zaheer Abbas, Raksha Kumaraswamy, Vincent Liu, and Adam White. Investigating the Properties of Neural Network Representations in Reinforcement Learning, March 2022a.

Han Wang, Archit Sakhadeo, Adam M White, James M Bell, Vincent Liu, Xutong Zhao, Puer Liu, Tadashi Kozuno, Alona Fyshe, and Martha White. No more pesky hyperparameters: Oﬄine hyperparameter tuning for RL. Transactions on Machine Learning Research, 2022b. ISSN 2835-8856.

Ronald L. Wasserstein and Nicole A. Lazar. The ASA Statement on p-Values: Context, Process, and Purpose. The American Statistician, 70(2):129 133, April 2016. ISSN 0003-1305. doi: 10.1080/00031305.2016.1154108.

Shimon Whiteson, Brian Tanner, Matthew E. Taylor, and Peter Stone. Generalized domains for empirical evaluations in reinforcement learning. Workshop on Evaluation Methods for Machine Learning at ICML, 2009.

Shimon Whiteson, Brian Tanner, Matthew E. Taylor, and Peter Stone. Protecting against evaluation overﬁtting in empirical reinforcement learning. 2011 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL), pages 120 127, April 2011. ISSN 2325-1867. doi: 10.1109/ADPRL.2011.5967363.

Tianhe Yu, Deirdre Quillen, Zhanpeng He, Ryan Julian, Karol Hausman, Chelsea Finn, and Sergey Levine. Meta-world: A benchmark and evaluation for multi-task and meta reinforcement learning. In Conference on Robot Learning, pages 1094 1100. PMLR, 2020.

Patterson, Neumann, White, White

Appendix A. Summary of Contributions

The primary goal of this work is to provide a detailed treatment of good empirical practices for reinforcement learning experiments, but along the way we have made several novel contributions. The main text contains a variety of new results, beyond just conceptually reasoning to justify our design choices and recommendations. In this section, we summarize these contributions in two lists, organized by (a) novel proposals and (2) empirical ﬁndings.

The novel methodological proposals include the following:

1. (Section 2.4) Tolerance intervals with median performance can better highlight the instability of an agent within a run, and variability across runs, than our standard approach of using means and conﬁdence intervals (see Figure 5c versus Figure 6).

2. (Section 3.2) An eﬃcient algorithm for estimating idealized performance (in Algorithm F.3), ﬁxing issues with maximization bias and the typical two-stage approach uses for hyperparameters.

3. (Section 4.4) Controlling the source of randomness to reduce variance, without introducing bias. This included (a) looking at diﬀerences in algorithm performance, to enable paired testing for algorithms and (b) separating the random seed for the agent and environment to better control sources of variability (see Example 4.4).

The novel empirical ﬁndings include the following:

1. (Section 2) The performance distributions for our agents can be skewed and bimodal (DQN on Mountain Car in Figure 4, DQN on Puddle World in Section 2.6).

2. (Section 2.7) Episode cutoﬀs can have a large impact on algorithm performance, as demonstrated in Figure 8a. This result highlights that the short cutoﬀs used are making environments easier for our agents.

3. (Section 2.6) A demonstration that common conﬁdence interval approaches can fail to capture the mean, for the long-tailed performance distribution of DQN on Puddle World (Figure 7).

4. (Section 3.1) Momentum likely does not resolve issues with divergence in TD, as a TD with momentum still diverges on Baird s counterexample (see Example 3.1).

5. (Section 3.2) If we run 1000 experiments of DQN on the Mountain Car environment and sweep stepsizes, target network refresh rates, and replay buﬀer sizes for every experiment, then approximately 96% of these experiments will over report the average performance for the best hyperparameter conﬁguration among those tested.

6. (Section 4.1) DQN with default hyperparameters can fail on even simple environments; with a small amount of tuning, it goes from failure to very good performance on Lunar Lander (see Example 4.1). This demonstrated the issue with using untuned baselines and making conclusions about the utility of a new approach.

Empirical Design in Reinforcement Learning

7. (Section 6) A thorough case study applying the methodologies laid out in this work. This case study highlighted issues with reproducing a previous result, issues in that previous experimental design that lead to misleading conclusions and a ﬁnal result that provided a more clear empirical picture of those algorithms (Soft Actor-Critic and DDPG).

Appendix B. Further experimental details

B.1 ESARSA and the Maze Gridworld

For this experiment, we combine the Expected SARSA algorithm (ESARSA) using an ϵ-greedy policy both as the bootstrapping target and as the behavior policy. The agent uses tile-coded features (Sutton and Barto, 2018) mapping the (x, y)-coordinates within the gridworld to a large binary feature vector. The state, action value function estimate is a linear function of the tile-coded features.

Hyperparameter Value Tiles 4 Tilings 8 Stepsize 0.1 ϵ 0.2 Experiment length 30k steps γ 0.99

The environment is shown in Figure 2. The objective is to learn the shortest path to the goal over repeated episodes. The actions are discrete, moving the agent a ﬁxed amount (plus noise) in the continuous two dimensional space. Actions that would move the agent outside the bounds of the world or into a wall cause no change in the state. The observation (and MDP state in this environment) is the x,y position of the agent. There is a ﬁxed start state and goal region, and the reward is +1 for reaching the goal region, which ends the episode, and the reward is zero otherwise. The discount is γ = 0.99. Notice that, for this reward speciﬁcation, the discount needs to be less than 1 to encourage the agent to reach the goal quickly: otherwise, taking 100 or 1000 steps would result in the same episodic return. For this simple environment we designed, we know that the optimal policy can get to the goal in 15 steps, meaning the optimal episodic return is 0.9915 = 0.86.

Appendix C. Computing tolerance intervals

Computing a tolerance interval is simple and is independent of the underlying distribution of the data. We deﬁne an (α, β)-tolerance interval as an interval that captures β proportion of future samples22 (e.g. β = 0.9) with a nominal error rate of α (e.g. 1 α = 0.95). Intuitively, to capture β proportion of future samples, we might think to report upper and lower percentiles of the sample data, such that the percentiles symmetrically capture a β proportion of samples. In the case that β = 0.9, this would correspond to an upper percentile

22. Note that tolerance intervals reason about the population and not the speciﬁc samples being used to compute the interval. Reasoning about the whole population allows us to make inferences about future samples, instead of simply describing past experimental results.

Patterson, Neumann, White, White

u = 0.95 and a lower percentile l = 0.05. A tolerance interval takes this a step further by including an uncertainty correction when we have received a small number of samples, how do we know the 95th percentile of the samples corresponds to the 95th percentile of the true distribution? Tolerance intervals, then, add a slight pessimism by widening the interval based on the number of observed samples. As we observe more samples, the uncertainty decreases and the interval approaches the naive percentile-based approach. To compute the uncertainty corrected percentiles, we make use of the inverse CDF of the binomial distribution. We want to ask the question: for each sample in our dataset, does this sample lie within the middle β proportion of the distribution? Our success rate for the binomial distribution, then, is β and our accepted error rate is α. The inverse CDF provides the number of samples ν that do not belong to the middle β proportion of the distribution. We distribute ν evenly across the top and bottom of the distribution, receiving indices of the sorted data l = ν

2 and u = n ν

2. Note that when ν is odd, these indices will no longer be integers. A common practice is to interpolate evenly between the adjacent indices, or to alternatively take the ﬂoor of the lower index l and the ceiling of the upper index u. The interpolation approach generally provides more accurate bounds for smaller sample sizes.

Appendix D. More about hyperparameter selection

In this section, we provide a more in-depth discussion on hyperparameter selection. It is a topic that could ﬁll an entire textbook, and so we opted to keep only the most basic information in the main body of this document. Here, we dive a bit deeper, to highlight a few other more advanced points.

D.1 Understanding hyperparameter sensitivity for multiple hyperparameters

Characterizing hyperparameter sensitivity with only one hyperparameter is relatively straightforward; it becomes more complex with multiple hyperparameters. The issues are that (1) there is a potentially combinatorial explosion, (2) there are likely interactions between hyperparameters and (3) visualization becomes more diﬃcult. As yet, there is no consensus strategy for understanding the performance of a partially-speciﬁed algorithm with multiple unknown hyperparameters, but we discuss a few here. A basic strategy that only shows variability across hyperparameters rather than interactions between them is to use violin plots. We visualize this in Figure 17, for multiple algorithms. The idea is to select a set of hyperparameter settings, compute performance for each setting, and report the distribution of performance over all settings. This allows us to compare algorithms with diﬀerent hyperparameters, and still ensure they get the same number of hyperparameter settings. The one important nuance here is that, unlike sensitivity plots, selecting a wider range of hyperparameters can be misleading. Visually, violin plots encourage us to assess variability across hyperparameters. If we set the range to be too wide for one algorithm (say our competitor), then it may look sensitive simply because I selected an unreasonable range. Fairly selecting ranges for the hyperparameters should be easier than selecting the hyperparameters themselves, though, and violin plots are a useful tool when assessing performance with multiple unknown hyperparameters.

Empirical Design in Reinforcement Learning

Figure 17: Visualizing the hyperparameter sensitivity of several algorithms on Lunar Landar.

To better understand relationships between hyperparameters, one plausible strategy is to collect a data set where hyperparameters are the independent variables and performance values are the dependent variables. We can then ﬁt a model to this dataset to describe relationships between hyperparameters as well as their relationship to performance. Using such a strategy and a linear regression model, we could ﬁnd for instance linear correlations between multiple hyperparameters. This strategy is related to the Auto ML and Bayesian optimization communities, though with the caveat that our interest is in the model itself while these communities generally use the model as a means to perform optimization (He et al., 2021). We could additionally adopt a classic randomized-control trial approach to understanding the impact of one hyperparameter while others are left as unknowns. With this strategy, we would test the hyperparameter of interest at multiple predeﬁned levels (a hyperparameter sweep) while all other hyperparameters are treated as nuisance variables and, in an RCT experiment, nuisance variables are randomized over. This strategy is highly related to a recently proposed approach (Jordan et al., 2020), with the slight diﬀerence that Jordan et al. (2020) treats all hyperparameters as nuisance variables while we allow one to be controlled.23

The design methodologies discussed in this section are constantly changing as the ﬁeld matures. Currently, we discuss treating hyperparameters as unknowns while performing scientiﬁc analysis and using statistical reasoning to understand their impact. However, as better hyperparameter optimization methodologies are developed and widely adopted, it will no longer be true that hyperparameters are strictly unknowns. Consider the case of supervised learning. Hyperparameters may be initially unknown but a procedure for selecting them is known: cross-validation. As such, it is generally far more accurate and less error-prone to use cross-validation to set hyperparameter values as part of the scientiﬁc analysis instead of treating them as nuisance. Lacking such a well-adopted strategy in reinforcement learning, in the most general case we should continue to consider hyperparameters as unknowns. It is important also to note that tuning over hyperparameters is a luxury we have in our experiments, rather than a general-purpose algorithm to set hyperparameters. We are not advocating here for the use of hyperparameter sweeps to select hyperparameters, as this is

23. In fact, RCT experiments can be easily extended to allow multiple hyperparameters to be controlled. Such an experiment design is often called response surface methodology or factorial design.

Patterson, Neumann, White, White

rarely a viable option for real-world problems, unless the environment is a simulator.24 Instead, the goal of hyperparameter sweeps in our experiment is to get insights into hyperparameter sensitivity, and to identify algorithms that are generally performant and relatively insensitive to their hyperparameters and so more suitable for deployment.

D.2 Developing algorithms and avoiding misleading yourself

Recall we had two possible goals in conducting an experiment: (1) understanding performance with respect to an algorithms hyperparameters, and (2) optimizing the hyperparameters for a speciﬁc problem setting. In this section we discuss something sort-of in between: selecting hyperparameters when you are developing a new algorithm. Similar to the last section, your goal here will be to understand the impact of your new hyperparameters. The primary diﬀerence, however, is that you will likely want to iterate to improve your algorithm. Further, you might be building on other algorithms that already have hyperparameters, and your modiﬁcation to the algorithm might interact with those existing hyperparameters.

Example: Your new algorithm

In some preliminary experiments, you realize that DQN is quite sensitive to its target network update frequency. These exploratory observations inspire you come up with a new algorithm to adapt the update frequency during learning. Your algorithm, however, has a new hyperparameter. Naturally, you want to see the behavior of your algorithm under ideal conditions, where you look at performance for a nearly optimal hyperparameter setting, found either use sweeps or other (smarter) hyperparameter optimization approaches.

This is a reasonable ﬁrst step while developing an algorithm we often want to know does this work at all? however we cannot stop our investigation here. First, it is likely you can improve on DQN by using its same hyperparameter settings for the environment, and tuning over this additional hyperparameter. You may falsely conclude that your algorithm provides beneﬁts, when in actuality the main aﬀect was the ability to optimize over an additional scalar. Second, this modiﬁcation to DQN, and your new hyperparameter, might interact with existing hyperparameter choices in an unexpected ways. In fact, such a result has recently been shown for GANs; many modern GAN architectures provide little or no improvement over a sensibly tuned baseline (Lucic et al., 2018).

Instead, we should immediately perform the follow-up step: examine the performance of your algorithm for multiple values for the new hyperparameter. This is a tricky and nuanced procedure, we cover many of the common cases above in Section 3.1. During the algorithm development cycle, it can be expensive to perform repeated sensitivity studies. We strongly recommend that this is done on smaller, meaningful problem settings where simulation is cheap. Such studies are often called pilot studies and are generally used to inform the design of more complete studies later in the process.

24. There are many groups pursuing algorithm development in reinforcement learning to solve simulation environments, like games or hard search problems. Our goal is to help empirical design more generally in reinforcement learning, not just for this more restricted problem setting, and so we do not consider speciﬁc approaches that can exploit simulators.

Empirical Design in Reinforcement Learning

D.3 Why can t we just use cross-validation?

A natural question is why we cannot handle hyperparameters in the same way they are handled in supervised learning: internal cross-validation. In this section, we explain why internal cross-validation does not directly apply to reinforcement learning. Let us start with a brief refresher of cross-validation. Imagine you have a dataset and want to learn a function f with any regression approach. It is typical to split up this dataset into training, validation and test, where you use the validation set to select hyperparameters and the test set to get an unbiased estimate of the accuracy of the function before ﬁnally deploying. For example, if we want to select a regularization penalty λ {0, 0.1, 0.5}, we learn the function with each regularization penalty on the training set and check which of the three produces best performance on the validation set. Let s say it is 0.1. Then we learn the function on the combined training and validation sets, with regularization penalty 0.1, to get our ﬁnal function f. Finally we evaluate f on the testing set before we deploy it into the real world. We should not have used the testing set at all during the hyperparameter selection or training phases. This procedure, however, is not the best choice for small datasets: you want an estimate of its generalization error, but do not want to split up your already small dataset into a training and testing and/or validation sets. Fortunately, you can use cross-validation to exactly avoid this. Consider ﬁrst a setting where we do not have hyperparameters, and simply want to avoid using a test set to evaluate f before deployment. The idea is to partition the dataset into k folds, and train on each subset of k 1 folds and test on the remaining fold. This procedure generates k functions f1, . . . , fk with corresponding estimates of error e1, . . . , ek. The average of these k errors provides a reasonable estimate of the generalization error of f, even though they are estimates of error for diﬀerent functions. The primary role of cross-validation is to estimate the generalization error of a function, without needing a hold-out set. The same idea can be used to select hyperparameters, in a procedure called internal cross-validation. The reason cross-validation can be used for this purpose is that it allows us to estimate the generalization error of each function learned with diﬀerent hyperparameters. The goal is to pick the function with the best generalization error. The algorithm needs to both (a) specify its own hyperparameters and (b) learn its weights with regression. We can see this as an expanded learning problem, with cross-validation used as the algorithm to identify the hyperparameters. We can use cross-validation on the given dataset to evaluate each hyperparameter setting and pick the one with lowest error. This complete algorithm A inputs a training set and outputs a function f. In summary, external cross-validation is used to estimate test performance before deployment and internal cross-validation is an algorithm. We might use external cross-validation to evaluate A, which itself uses internal cross-validation. External cross-validation gives an estimate of performance, before deploying the model f learned by A on the entire dataset. Note that these are completely separate, and we do not have to use external cross-validation if we use internal cross-validation. For example, to evaluate the ﬁnal function, we can use the more basic dataset split approach described above rather than external cross-validation. We could split our dataset into training and test, use algorithm A on the training set namely we use internal cross-validation on the training set to pick hyperparameters and then we

Patterson, Neumann, White, White

evaluate f on the test set. Similarly, we could use external cross-validation to evaluate our ﬁnal f and algorithm A internally could use a basic training and validation split to pick hyperparameters. This brief refresher should make it more clear why it is not straightforward to use cross-validation in online reinforcement learning. This idea does not directly extend to hyperparameter selection for online reinforcement learning because we evaluate a learning algorithm rather than a learned model. We can only see performance of the learning algorithm once it is in deployment (testing). There is no separate training phase, nor training data, for the online reinforcement learning setting. Overall, there is no obvious, out-of-the-box way to directly use cross-validation for hyperparameter selection.

D.4 Cross-validation-like procedures for the hyperparameter optimizer

We can, however, consider strategies for selecting hyperparameters inspired by these ideas from cross-validation. We propose one such view below. Imagine we care about how our online reinforcement learning agent performs when learning from scratch on a class of environments E none of the below will make any sense if we only care about performance in one environment. We only receive a subset of these environments, Egiven; let s say we have n such given environments. If we have hyperparameters to specify in our algorithm, then we can consider how to use Egiven to specify these hyperparameters. We can think of Egiven like a training set, but not for our algorithm, but rather the hyperparameter optimizer H. Each environment is like a training point. The hyperparameter optimizer H can train on these environment training points, to output a proposed set of hyperparameters ˆh: H(Egiven) = ˆh. These hyperparameters are then deployed, and we hope that they generalize to the larger class of environments. If we had to map to the supervised learning setting, the ˆh is like the deployed function f above. The ability to generalize well depends on the quality of the set Egiven just like in supervised learning and also on the hyperparameter optimization algorithm H. For example, H could be a simple grid search or H could be a Bayesian optimization algorithm. Likely these two algorithms will identify diﬀerent hyperparameters using the set Egiven, and so will result in diﬀerent generalization performance. A grid search algorithm estimates the performance of each hyperparameter choice by computing per-environment performance over multiple runs in each Egiven, and then aggregating that performance into one number. It selects ( learns ) the hyperparameter ˆh with the best cross-environment performance number. We can use a similar idea to (external) cross-validation to estimate how well these hyperparameters ˆh might generalize. How do we know if ˆh generalizes well to other environments? We can keep a hold-out set of environments to test generalization performance. We split Egiven into training and testing environments, and only give the training environments to H to produce ˆh. Then we can test our fully-speciﬁed reinforcement learning algorithm full-speciﬁed because it uses hyperparameters ˆh in the test environments to get a sense of performance. This is where the ideas behind cross-validation help. Just using training-testing splits is data ineﬃcient: we learn ˆh on an even smaller training set, potentially resulting in a ˆh that has worse generalization performance. Instead, we can estimate how well ˆh might do,

Empirical Design in Reinforcement Learning

by looking at generalization performance of all the hyperparameters trained on subsets of the environments. More speciﬁcally, imagine we have n environments. We run H on all n to get ˆh. Now we use (leave-one-out) cross-validation to evaluate ˆh. We run H on all the environments except environment 1, to get ˆh1, then on all the environments except environment 2 to get ˆh2, and so on. We get an estimate of performance for ˆh1 by running the reinforcement learning algorithm with hyperparameter setting ˆh1 on environment 1, to get p1. Then we do the same for ˆh2 on environment 2. The estimate of performance for ˆh is 1

n Pn i=1 pi. If instead we pick k < n folds, say if n = 10 and k = 5, then we would be training on 8 environments and testing on 2. Note that this evaluation procedure does not actually change what hyperparameters are deployed: we are still deploying ˆh. It just lets us get an estimate of the quality of these before deployment.25

The above was only using external cross-validation to estimate generalization performance. But if the hyperparameter optimizer itself has hyperparameters let s call them hyperhyperparameters for lack of a better term then we can exploit the fact that the crossvalidation-like approach above estimates generalization performance. We want to select the hyper-hyperparameters such that they produce hyperparameters ˆh that generalize best. For example, grid search might have a hyperparameter that is the number of runs or number of steps of interaction in the environment. These could be chosen using an internal cross-validation approach. In summary, if we treat Egiven as the dataset for our hyperparameter optimize H, then we can leverage ideas from cross-validation. There is no sensible mapping where internal cross-validation itself is used to select the hyperparameters of the online reinforcement learning algorithm. Internal cross-validation can be used to set the hyper-hyperparameters of our hyperparameter optimizer. If our hyperparameter optimizer does not have any hyperhyperprameters, then we simply run the hyperparameter optimizer on Egiven and deploy the hyperparameters that are found. We can use external cross-validation to estimate the quality of these hyperparameters. It is important to point out that the premise in this section is that we have access to a set of given environments before the algorithm is deployed in an environment in E; this requirement may be hard to satisfy. The standard assumption in supervised learning is that the training set is representative of the testing set. In reinforcement learning, if we are deploying an algorithm to control a physical system, we do not have a set of other physical systems where we can evaluate the agent ﬁrst. A practitioner might actually try to craft this set themselves. Maybe they scour the literature for environments that resemble their environment, such as environments in Mujoco or simulators for other real-world problems. They might then evaluate their algorithm and tune the hyperparameters in those environments. However, such a procedure has not been explicitly proposed in the literature, nor is it standard practice. It would be interesting to explore such ideas, and understand the pitfalls, to get to a similar place that supervised learning is in with cross-validation.

25. Note that this procedure has never been used, to the best of our knowledge, and so it is not clear that it enjoys the same nice properties as cross-validation. It might have unexpected sources of bias. We are not necessarily advocating that we use the above procedure, but rather trying to bring clarity by showing potential connections to standard algorithms in supervised learning.

Patterson, Neumann, White, White

Appendix E. Aggregating environments

Individual environments can also be grouped, to provide a macro-environment for which we examine performance of an algorithm. We run the algorithm on each environment in the group, but consider the aggregate performance across environments. By thinking of this environment grouping as a macro-environment, it encourages curation of reasonable groupings to test diﬀerent properties of the algorithms. If we group environments with low-dimensional inputs and have a separate group for image-based inputs, then we can test the algorithms separately on these two macro-environments to understand behavior on MDPs with these two diﬀerent properties. Grouping environments, therefore, can improve on benchmarks by making experiments more issue-oriented. It also has the additional beneﬁt that we can test the algorithm in a larger set of environments, without having to do as many runs for each environment. If there are 5 environments in the macro-environment, it may be suﬃcient to use 10 runs for each, giving a total of 50 runs for aggregate performance in the macro-environment. Because we are not making claims about performance within each environment, fewer runs per environment are acceptable. It can still be reasonable to visualize individual runs in the environments, for qualitative insights such as those related to instability within runs. Performance plots, however, should likely only be reported for the macro-environment. We visualize what performance could look like, in Figure 18 for each environment when there are too few runs, resulting in highly-overlapping errors bars, whereas aggregate performance across runs provides clear diﬀerences between the two algorithms. Reporting such aggregate performance, however, raises the issue of how to ensure performance is normalized across environments. Such normalization could be hand-designed by the experimenter, to ensure each macro-environment has a sensible distribution. For example, the experimenter may know the optimal expected return G and worst-case return G in each environment. The observed returns Gt could be normalized between 0 and 1 using

If this number is 1, the return is optimal; if it is zero, the return was the worst-case. Special care should be taken when using ratio-based scaling methods, as these can be sensitive to relative size of the scale (Fleming and Wallace, 1986; Jordan et al., 2020). In addition to creating such macro-environments, there are already several environment suites that have been designed to have related but diﬀerent environments. Examples include Atari (Mnih et al., 2013), Metaworld (Yu et al., 2020) and Mini Grid (Chevalier-Boisvert et al., 2018). Earlier work also considered minor variations of the same environment, such as Mountain Car with diﬀerent perturbations to transitions, observation transformations and start-state conditions (Whiteson et al., 2011).

Appendix F. Understanding our agents with multiple evaluation metrics

For most of this work we have been relatively agnostic to the choice of evaluation metric. In the initial section we discussed the typical evaluation metric: the return. After seeing how to measure this evaluation metric in one run for one agent, we discussed strategies to assess

Empirical Design in Reinforcement Learning

Figure 18: Comparing multiple agents on multiple environments small seed regime. Bootstrap conﬁdence intervals around average performance (y-axis) across environments (labeled Macro) versus for each independent environment. It is diﬃcult to assess an ordering of ﬁve algorithms when investigating environments individually, however, a clearer ordering emerges when measuring performance across environments jointly. Also notice the conﬁdence intervals for the Macro environment are much more narrow. All intervals on individual environments are constructed with 10 samples, while the aggregate intervals for the macro-environment use 60 samples each.

Patterson, Neumann, White, White

an algorithm across multiple runs by aggregate this evaluation metric across runs. These aggregation strategies were generic, and apply to other evaluation metrics. In this section, we discuss alternative evaluation metrics that could help you assess an algorithm. In some cases, the expected return is not a suitable evaluation metric. Instead, for example, you might want to know if your agent maintains a certain level of performance once reaching that level. More generally, using multiple metrics can provide a more complete picture of the properties of your algorithm. Some metrics may be behavioral, rather than performance-based. For example, it may be useful to understand state-visitation for an agent. We distinguish such behavioral metrics from performance metrics, and discuss options for both in the next two subsections.

F.1 Measuring the performance of an agent

In order to compare, rank, describe, and tune our algorithms thus far, we have described the performance of agent using the total episodic return obtained by that agent over time. In fact, the overwhelming majority of research in reinforcement learning has focused on describing how well an RL agent solves a given problem. However, there are many possible attributes of our agents which we can measure and many diﬀerent metrics we can consider for each attribute. A fruitful path towards understanding an agent or an algorithm is developing a multidimensional view of that agent s behaviors and internal processes. There have been countless metrics and attributes studied in the machine learning literature certainly far more than we could sensibly explore here however we will describe a few important choices which can serve as a starting point. Perhaps the most simple deviation from measuring the total episodic return of an agent is applying a weighting to each episodic return based on the number of learning steps within that episode. Consider an environment where an agent spends ten thousand steps to complete the ﬁrst episodic, such as the Mountain Car environment. When the agent starts the second episode, it has already accumulated ten thousand experiences and has likely performed ten thousand updates to its value function. Comparatively, an agent which solves the ﬁrst episode in only a few hundred steps, it starts its second episode with far less experience and far fewer updates. Comparing these two agents episode-by-episode means comparing agents with wildly diﬀering amounts of experience, which may not be a meaningful comparison! We could, instead, compare each agent at each update timestep, ignoring episode boundaries. Our performance measure, then would need to assign a performance value at every step. A simple proposal: for each timestep of an episode, record the total return observed for that episode. If an agent takes ﬁfty steps to complete an episode and yields a return of -100 at the end of the episode, then the agent would record a performance value of [ 100, 100, . . . , 100] repeated ﬁfty times once for every step of the episode. There are many interesting questions we can ask about the agent which are not directly tied to performance, such as: what representations are learned by our agent? Some attributes of interest include the capacity of the representation what functions can learned? the eﬃciency of the representation are the duplicate features? or the amount of interference observed by the representation for each new observation (Wang et al., 2022a). Although it may not be immediately clear that these additional measurements correlate with how well an agent solves a task, these can provide distinguishing information between individual agents.

Empirical Design in Reinforcement Learning

Over time, and with many such measurements, we might begin to notice patterns such as: agents with high degrees of interference tend to learn quickly, but fail to adapt to small changes in the environment. These correlations and relationships provide novel avenues for algorithm development.

F.2 Summarizing performance over time

The performance of an agent evolves with time, as do other attributes of our agents; we want our evaluation to capture this. On the other hand, it is likewise challenging to rank, compare, or tune algorithms when our performance metric is a stochastic process or a list of numbers. Quite often, we need to provide a scalar summary which describes the measurements that we have taken over time. A common choice referred to as area under the curve is to compute the total amount of an attribute scaled by the number of observations, the average over time. A persistent challenge in providing a single summary scalar, however, is that nuances in the learning curve will be lost. A time average of the learning curve is akin to ﬁtting a horizontal over the curve, which clearly loses the fact that the curve might increase or decrease steadily with time or might have sharp drops in performance on occasion. Time averages also ignore the fact that learning curves can often be broken into two (or more) phases, generally a rapid upward slope during the early learning process, then a progressive plateau towards the end of learning. Depending on the research question at hand, it can often be desireable to reason about only one of these phases, for instance by computing a time average over the last 10% of the learning curve in order to discuss where the algorithm generally plateaus. There has been some work automatically identifying these phases, by ﬁtting piecewise linear functions to the learning curve (Dabney, 2014, Chapter 3). As usual, there is no right answer, and as the experiment designer you need to make an appropriate choice for your setting. Once you have isolated the evaluation phase, it may be reasonable to use alternatives to the average to summarize this portion of the curve. One option is to evaluate the worst case performance during evaluation; does your agent consistently perform reasonably well or does it exhibit occasional large drops in performance? This evaluation may be more suitable for settings where it is key to maintain reasonable performance for almost all episodes such as in medical applications where the preference is to ensure most patients get a reasonable treatment, rather than obtaining good outcomes on average across patients. The average could be maximized by having very good treatments for some patients and very poor ones for others. As another example, if the agent is learning to land a helicopter, then we expect that it should never crash the helicopter in the evaluation phase. Reporting worst-case performance in the evaluation phase, rather than average, might better reﬂect the desired performance for this agent. One challenge in summarizing learning curves is that there are no agreed-upon deﬁnitions for learning speed and stability. We can provide strategies to operationalize these concepts, however without concrete deﬁnitions it can be challenging to ensure our proposed summaries perfectly reﬂect these properties. For instance, perhaps we deﬁne learning speed as how quickly the agent reaches a reasonable policy. To measure this, we can deﬁne a threshold of reasonable performance usually coming from domain knowledge and measure how many learning steps it takes to cross this threshold. Similarly, we might deﬁne stability as staying

Patterson, Neumann, White, White

above that threshold of performance once it has been achieved; measured by counting how often the agent dips below this threshold in the evaluation phase. Does the agent stay above this performance level, or does it often regress and require re-learning? These particular measures present a challenge, however. What if due to stochasticity our agent manages to cross our designated threshold of performance very early in learning, long before it has actually learned a sensible policy? We might spuriously decide this agent is unstable, when in reality if we had selected a slightly later cutoﬀpoint for the evaluation phase we would have seen this agent remain stable. To avoid this potential stochasticity, we can alter our measure to require, say, three consecutive steps above the threshold before we say the agent has crossed from the learning phase to the evaluation phase. The choice of 3 is not optimal, nor theoretically motivated for either stability or learning speed. Instead, it is a choice motivated by the fact that we would like to be robust to stochasticity, but do not want to use too many consecutive steps as then we are measuring reached reasonable performance and was stable rather than reached reasonable performance . These choices have to be made carefully, and potentially revisited. For this reason, it can be useful to provide summary performance metrics, but also provide learning curves even showing each run to allow for more detailed viewing for a reader that would like to dig deeper. Our job as authors is to provide insightful summaries without overloading the reader, so such detailed information may be better included in a supplement.

F.3 Oﬄine returns versus online returns

We have so far exclusively considered online returns. It is common, however, to report oﬄine returns. At each measurement step, we take the current policy and do multiple rollouts in the environment, to estimate it s current expected return. This estimate is used as the performance metric at that measurement step, plotted in the learning curve. Reporting online returns corresponds to the online learning setting, whereas plotting oﬄine returns corresponds to the pure exploration setting. In the online learning setting, the agent is faced with the exploration-exploitation dilemma. It is being evaluated by how much reward it receives while learning, and so it may want to take the action it currently thinks will achieve most reward (act greedily). But, it also needs to spend some time exploring, to ensure it has not settled on a suboptimal policy and so is missing out on more reward. An agent that balances these well will perform well according to online returns. For oﬄine returns, the agents behavior during learning is not evaluated. In that sense, it does not face the exploration-exploitation dilemma. Instead, the behavior is faced with the pure exploration question: what actions should it take to learn the target policy (near-optimal policy) on that generated data? The learning curve reﬂects: if the agent was able to have a pure learning phase for t steps, and then deployed its ﬁxed policy, here is how that ﬁxed policy would perform. Without looking carefully at a plot and its description, learning curves for these two settings can look similar. But these two settings are fundamentally diﬀerent. It is important to motivate why you chose online or oﬄine returns in evaluation.

Empirical Design in Reinforcement Learning

Key insights: understanding agents with multiple evaluation metrics

1. Multi-dimensional analysis is key for truly understanding algorithm performance.

2. Do consider multiple performance metrics to understand your algorithm. You need not report all of these metrics; an exploratory phase helps guide ﬁnal experiments.

3. Do not optimize over multiple performance metrics, simply to ﬁnd which measure results in your algorithm having better performance than another. After the exploratory phase, decide what you want to measure. For ﬁnal experiments reported in a paper, comparing your algorithm to others, you should consider additional/diﬀerent environments from those used in the exploratory phase.

4. Oﬄine and online returns are measuring very diﬀerent things. Make the choice between these judiciously.

5. There are no clear-cut answers for deﬁning and selecting evaluation metrics. You need to clearly justify your choices. A ﬁrst criteria for the choice is whether you can convince yourself.

6. Most of our experiments should include behavioural metrics. We often test our agents in simple environments, where the goal is to understand our algorithms rather than obtain a solution in that MDP.