# large_batch_experience_replay__a3c9abf3.pdf

Large Batch Experience Replay

Thibault Lahire 1 Matthieu Geist 2 Emmanuel Rachelson 1

Several algorithms have been proposed to sample non-uniformly the replay buffer of deep Reinforcement Learning (RL) agents to speed-up learning, but very few theoretical foundations of these sampling schemes have been provided. Among others, Prioritized Experience Replay appears as a hyperparameter sensitive heuristic, even though it can provide good performance. In this work, we cast the replay buffer sampling problem as an importance sampling one for estimating the gradient. This allows deriving the theoretically optimal sampling distribution, yielding the best theoretical convergence speed. Elaborating on the knowledge of the ideal sampling scheme, we exhibit new theoretical foundations of Prioritized Experience Replay. The optimal sampling distribution being intractable, we make several approximations providing good results in practice and introduce, among others, La BER (Large Batch Experience Replay), an easy-to-code and efﬁcient method for sampling the replay buffer. La BER, which can be combined with Deep Q-Networks, distributional RL agents or actor-critic methods, yields improved performance over a diverse range of Atari games and Py Bullet environments, compared to the base agent it is implemented on and to other prioritization schemes.

1. Introduction

In deep Reinforcement Learning (Sutton & Barto, 2018, RL), neural network policies and value functions can be learnt thanks to stochastic gradient descent algorithms (Robbins & Monro, 1951, SGD) sampling an experience replay memory (Lin, 1992). This replay memory, or replay buffer, stores the transitions encountered along the interaction with the environment. SGD-based algorithms ex-

1ISAE-SUPAERO, Universit e de Toulouse, France 2Google Research, Brain Team. Correspondence to: Thibault Lahire <thibaultlahire.research@gmail.com>.

Proceedings of the 39 th International Conference on Machine Learning, Baltimore, Maryland, USA, PMLR 162, 2022. Copyright 2022 by the author(s).

ploit such buffers to learn relevant functions, such as the Qfunction in the case of Deep Q-Networks (Mnih et al., 2015, DQN), the return distribution for distributional approaches (Bellemare et al., 2017), or an actor and a critic (Lillicrap et al., 2016; Haarnoja et al., 2018) in the case of continuous state-action space problems. Most Deep RL algorithms boil down to a sequence of SGD-based, supervised learning problems. Ideally, one would like to minimize the corresponding loss functions with as few gradient steps as possible.

Prioritized Experience Replay (Schaul et al., 2016, PER) has been introduced for the DQN algorithm as a heuristic accelerating the learning process, drawing inspiration from Prioritized Sweeping (Moore & Atkeson, 1993). Each transition in the replay buffer is assigned a priority that is proportional to the temporal difference (TD) error, and gradients are estimated by sampling according to these priorities. Later, PER was combined with other DQN improvements, including distributional RL (Bellemare et al., 2017; Hessel et al., 2018), as well as applied to actor-critic methods (Wang & Ross, 2019). However the reason why prioritizing on TD errors can provide better performance remains theoretically unexplained. For distributional value functions, using the loss value as a priority, as suggested by Hessel et al. (2018), also lacks foundations. Lastly, in actor-critic methods, when two critics are used (Fujimoto et al., 2018; Haarnoja et al., 2018), two TD errors are available and the appropriate way to draw mini-batches has, again, not been established clearly. In this work, we provide theoretical foundations for the concept of prioritization used in RL, and derive the associated algorithms, shedding light on existing approaches on the way.

SGD relies on the fact that the loss minimized when learning a neural network is an integral quantity over a certain distribution. Hence, the gradient of this loss is also such an integral, which can be approximated via Monte Carlo estimation with a ﬁnite number of samples. The variance of this Monte Carlo estimate can be reduced using importance sampling (Rubinstein & Kroese, 2016), yielding better convergence speed. Consequently, non-uniform sampling of mini-batches for gradient estimation can be cast as an importance sampling problem over the replay buffer, such as in the work of Needell et al. (2014). The Supervised Learning literature provides links between sampling, variance of the stochastic gradient estimate, and convergence speed of

Large Batch Experience Replay

the SGD algorithm (Wang et al., 2017). The smaller the variance of the stochastic gradient estimate, the faster the convergence. This is particularly appealing in the context of Approximate Dynamic Programming (ADP), which encompasses the vast majority of (Deep) RL algorithms (e.g. DQN, DDPG (Lillicrap et al., 2016), SAC (Haarnoja et al., 2018) and their variations). ADP is very sensitive to approximation errors and only a few noisy gradient steps are taken at each Dynamic Programming iteration in modern Deep RL algorithms. The optimal sampling distribution, proportional to the per-sample gradient norms, is intractable and approximations were proposed (Loshchilov & Hutter, 2016; Katharopoulos & Fleuret, 2018). We show that PER is a special case of such approximations in the context of ADP, and propose better sampling schemes, theoretically grounded and less sensitive to hyperparameter tuning, that naturally extend to any value function representation, including distributional Q-functions or twin critics. In this work, we show that the main issue with PER are its outdated priorities, and introduce the Large Batch Experience Replay (La BER) algorithm which constitutes our main algorithmic contribution.

This paper is structured as follows. Section 2 covers related work in prioritization for RL and importance sampling for SGD. Then Section 3 casts the gradient estimation problem in the light of importance sampling and proposes algorithms to mimic the intractable optimal sampling distribution. On the way, it provides theoretical foundations to PER. Section 4 empirically evaluates the proposed sampling schemes, in particular the La BER algorithm. We discuss each separate aspect of sampling, their explanations and perspectives. Section 5 summarizes and concludes.

2. Background and Related Works

In the standard RL framework, one searches for the optimal control policy when interacting with a discrete-time system behaving as a Markov Decision Process (Puterman, 2014). At time step t, the system is in state st S, and upon applying action at A, it transitions to a new state st+1, while receiving reward rt. A policy π is a function mapping states to distributions over actions, whose performance can be assessed through its Q-function Qπ(s, a) = E[P

t γtrt|s0 = s, a0 = a, at π(st)]. The goal of RL is to ﬁnd a policy π that has the largest possible Q-function Q = Qπ .

Policy π s Q-function obeys the ﬁxed-point equation Qπ(s, a) = Es ,r [r + γQπ(s , π(s ))]. Similarly, the optimal Q-function obeys equation Q (s, a) = Es ,r [r + γ maxa Q (s , a )]. These are called the Bellman evaluation and optimality equations, respectively. They are often summarized by introducing the evaluation and optimality operators on functions: Qπ = T πQπ and Q =

T Q respectively. These operators are contraction mappings, which implies that the Qn+1 = TQn sequence converges to the T operator s ﬁxed point: Qπ for T = T π, and Q for T = T . The optimality and evaluation equations are cornerstones of the vast majority of RL algorithms, as they underpin respectively the search for optimal value functions in Value Iteration methods and the evaluation of current policies in Policy Iteration and Policy Gradient methods (including actor-critic ones). Approximation of TQn is thus a key issue in Reinforcement Learning and gives rise to the family of Approximate Dynamic Programming (ADP) methods.

ADP is known to suffer from the approximation errors incurred by the family of Q-functions used (Munos, 2003; 2005; Scherrer et al., 2012). In particular, it is well-known that ADP does not converge but rather that Qn Q O(ϵ/(1 γ)2) for a large enough n, where ϵ is an upperbound on the approximation error of Qn. Approximating Qn+1 from samples of TQn is a Supervised Learning problem with error at most ϵ and, therefore, reducing ϵ immediately translates to better ADP algorithms.

Deep Q-Networks (Mnih et al., 2015, DQN) is the approximate Value Iteration algorithm that uses a replay buffer of N samples (s, a, r, s ), a deep neural network Qθ, and a few steps of gradient descent to approximate T Qn. In this context, Qn is called a target network and is periodically replaced by Qθ. Speciﬁcally, at each training step, DQN aims to take a gradient step on the loss Ln(θ) = Qθ T Qn 2, or more generally Ln(θ) = R

S A ℓ(Qθ(s, a), T Qn(s, a))dρ(s, a), with ρ the stateaction data distribution. Since this loss is an integral quantity, it can be estimated by Monte Carlo sampling, which deﬁnes the empirical loss 1 N PN i=1 ℓ(Qθ(xi), yi), with xi = (si, ai) and yi = ri + γ maxa Qn(s i, a ). Minimization of this empirical loss by SGD implies drawing at each step a mini-batch of B transitions from the replay buffer and taking a descent step θt+1 = θt ηd in the direction of the gradient estimate d = 1

B PB i=1 θℓ(Qθ(xi), yi), with learning rate η. Similar targets and loss functions have been proposed in the context of distributional RL (Dabney et al., 2018b;a) or for learning critics in policy gradient methods (Fujimoto et al., 2018; Haarnoja et al., 2018).

As is common in SGD algorithms, the mini-batch is drawn uniformly with replacement within the replay buffer of size N, yielding an unbiased estimate of θLn(θ). The question of sampling non-uniformly the replay buffer has been raised from an empirical perspective in RL and has lead to heuristics such as PER and its variants (Horgan et al., 2018; Fujimoto et al., 2020). In order to put more emphasis on (s, a)-pairs that feature a large approximation error of T Qn, PER assigns each transition of the replay buffer a priority based on the TD error (|δi| + c)α, where δi = Qθ(xi) yi is the TD error, and with α and c two

Large Batch Experience Replay

non-negative hyperparameters. At each iteration, PER samples a mini-batch according to the probability distribution induced by the list of priorities, performs a gradient step and updates the priority of the selected samples. Even though PER lacks theoretical foundations, a large number of publications experimentally demonstrate its beneﬁts. Notably, the ablation study of Hessel et al. (2018) showed PER to be one of the most critical improvements over DQN. Horgan et al. (2018) reuse the idea of prioritization according to the TD errors in a distributed framework of agents working in parallel. As a follow-up on PER, Fujimoto et al. (2020) propose new sampling schemes based on the TD errors and adjusted by the loss, whereas Gruslys et al. (2018) consider a prioritization on sequences of transitions.

Other methods to select a mini-batch have been proposed. To emphasize recent experience and draw a mini-batch of size B, Zhang & Sutton (2017) sample uniformly B 1 transitions from the replay buffer and add the last experience tuple collected along the trajectory. Similarly, Wang & Ross (2019) have observed a faster convergence of the soft actor-critic algorithm (Haarnoja et al., 2018, SAC) by sampling more frequently the recent experiences collected, which can be seen as a smooth version of the sampling proposed by Zhang & Sutton (2017). Li et al. (2019) try to ﬁll each mini-batch with samples taken from every region of the state-action space S A, thus emulating a uniform distribution across S A within the replay buffer. Conversely to our contribution, where we work on the replay buffer distribution as it is, Dis Cor Kumar et al. (2020) selects less often errorful target values, leading to better models. Despite titles mentioning experience replay, many papers are quite weakly related to the idea of non-uniform sampling of mini-batches. For instance, Hindsight Experience Replay (Andrychowicz et al., 2017) is an intrinsic motivation method. Similarly, Fedus et al. (2020) study the relations between the replay buffer size and the frequency of gradient steps, as well as the crucial importance of n-steps returns, but without considering the question of how to sample minibatches.

In Supervised Learning, the links between non-uniform sampling of training sets, variance of the stochastic gradient estimate and speed of convergence have notably been studied by Needell et al. (2014); Zhao & Zhang (2015); Wang et al. (2017). As shown therein, the smaller the variance of the gradient estimate, the better the convergence speed. Importance sampling (Rubinstein & Kroese, 2016) can be used to reduce variance while keeping these estimates unbiased. Hence, non-uniform sampling schemes can be designed to reduce the variance of the estimate and accelerate the convergence. For an SGD update in its simplest form, the ideal sampling scheme p is proportional to the per-sample gradient norm (Needell et al., 2014).

Computing the optimal sampling distribution requires computing all per-sample gradients. For large training sets (such as RL replay buffers), this task is prohibitively costly. Alain et al. (2016) deploy heavy computational resources to compute the optimal distribution, using clusters of GPUs. Another possibility is to shift from p , as developed by Loshchilov & Hutter (2016), where the sampling is proportional to a loss ranking. Ensuring convergence in convex cases, Stochastic Variance Reduced Gradient (Johnson & Zhang, 2013) is a state-of-the art algorithm using importance sampling. Recently, Katharopoulos & Fleuret (2018) proposed an upper-bound on the per-sample gradient norm which is fast to compute and can be used as a surrogate of p .

3. Gradient Variance Minimization

3.1. Importance Sampling Distributions and Approximations in PER

SGD aims at minimizing the empirical loss, given the current replay buffer, as a proxy for the true loss. Hence, at each training step, plain SGD samples uniformly a mini-batch of B transitions from the replay buffer, in order to approximate the gradient of the empirical loss 1 N PN i=1 θℓ(Qθ (xi) , yi). Let u be the uniform discrete distribution over the items in the replay buffer. The gradient can then be written as an expectation over these items:

i=1 θℓ(Qθ(xi), yi) =

i=1 ui θℓ(Qθ(xi), yi)

= Ei u[ θℓ(Qθ(xi), yi)].

This expectation can be estimated by an (unbiased) empirical mean over a mini-batch of B samples:

Ei u[ θℓ(Qθ(xi), yi)] 1

i=1 θℓ(Qθ(xi), yi) , i u.

Let p be any probability distribution over the items of the replay buffer such that pi = 0, i. Importance sampling gives alternate unbiased estimates of the empirical loss gradient:

Ei u[ θℓ(Qθ(xi), yi)] = Ei p

θℓ(Qθ(xi), yi)ui

θℓ(Qθ(xi), yi) 1

This can also be estimated by an empirical mean:

θℓ(Qθ(xi), yi) 1

i=1 θℓ(Qθ(xi), yi) 1

i p. The update equation for θ becomes:

θt+1 = θt η 1

1 Npi θℓ(Qθ(xi), yi). (1)

Large Batch Experience Replay

We deﬁne Gi = wi θℓ(Qθ(xi), yi) with wi = 1/(Npi) for any sampling scheme p. The gradient of the empirical loss is thus precisely Ei p[Gi] = PN i=1 pi Gi. Following the notations of Wang et al. (2017) or Katharopoulos & Fleuret (2018), let us deﬁne the convergence speed S of SGD under a sampling scheme p as

S(p) = Ei p θt+1 θ 2 2 θt θ 2 2 .

Wang et al. (2017) show that

S(p) = 2η(θt θ )T Ei p[Gi] η2Ei p[GT i Gi].

In this work, we call variance of the gradient estimate the term Ei p[GT i Gi], which is linked to the covariance matrix Vari p[Gi] by Ei p[GT i Gi] = Tr(Vari p[Gi]) + Ei p[Gi]T Ei p[Gi]. It is thus possible to improve the theoretical convergence speed by sampling from the distribution that minimizes Ei p[GT i Gi]. Indeed, the term Ei p[Gi] is a constant with respect to p since Ei p[Gi] = PN i=1 pi Gi = PN i=1 pi 1 Npi θℓ(Qθ(xi), yi) = Ei u[ θℓ(Qθ(xi), yi)], with u the uniform distribution such that ui = 1/N. The optimal distribution is

p i θℓ(Qθ(xi), yi) 2,

the per-sample gradient norm. This derivation is recalled in Appendix A.

The optimal sampling scheme requires computing θℓ(Qθ(xi), yi) for all items in the replay buffer, which is too costly to be used in practice. Indeed, computing persample gradients requires a forward and a backward pass on the whole replay buffer before performing each gradient step. Departing from this observation, we ﬁrst give perspectives to what is done in PER, and then explore two new sampling strategies.

At a given time step, PER computes TD errors as priorities only for the samples selected in the mini-batch, and keeps priorities unchanged for all other samples in the replay buffer. We ﬁrst investigate whether the per-sample gradient norm can be safely approximated by the TD error. For the sake of notation simplicity, let qi denote the output Qθ(xi) of the Q-function network. Then, applying the chain rule to the loss gradient indicates that the gradient norm is θℓ(qi, yi) 2 = ℓ(qi, yi)/ qi qi/ θ 2. If ℓ corresponds to the L2-norm, then ℓ(qi, yi)/ qi is the TD error. Consequently, the per-sample gradient norm of the loss is the product of the TD error and of the norm of the network output s gradient.

Therefore, the TD error is a good approximation of the optimal sampling distribution if ℓis the L2-norm and if the norm of qi/ θ = θQθ(xi) is approximately constant across samples xi. If one uses the Huber loss instead of the L2-norm, then the TD error δi is replaced by min(|δi|, 1)

Table 1. Summary of the main algorithms. Priority Exact Approximate Outdated GER (this work) PER Up-to-date La BER (this work) La BER (this work)

(which is consistent with the common practice of PER which uses the Huber loss and clips the TD errors). However, if the assumption of qi/ θ = θQθ(xi) approximately constant across samples does not hold, then the variance of a sampling according to the TD errors can be higher than that of a uniform sampling. We provide in Appendix C a simple counter-example demonstrating that the variance induced by the TD error sampling scheme is uncontrolled and potentially even higher than that of a uniform sampling.

Besides being approximated by the TD errors, the persample gradient norms in PER are also outdated. Only the samples in the mini-batch receive a priority update at each time step (which already does not correspond to the loss gradient norm). All other samples retain priorities related to even more ancient Q-functions and are even more outdated. As such, a priority can be arbitrarily old. Hence, the variance induced by the sampling scheme used in PER is unknown.

As we have just seen, PER uses two approximations. We introduce two new sampling schemes, intending to remove these approximations. We also wish to study which approximation is the most penalizing. First, we can work on exact but outdated gradients θℓ(Qθ(xi), yi), i.e. gradients computed for some past Q-function parameter θt <t and not the current θt. Such gradients are calculated during a backward pass each time the considered sample is selected in a mini-batch. We call this strategy GER, for Gradient Experience Replay. Second, we propose to work with up-to-date priorities. We call this strategy La BER, for Large Batch Experience Replay. Indeed, La BER 1) pre-samples uniformly a large batch from the replay buffer; 2) computes some importance sampling probability on this large batch; and 3) down-samples the large batch to a mini-batch according to the approximation of the optimal sampling distribution. La BER considers the large batch is diverse enough to be representative of the whole replay buffer. La BER can either compute the exact gradient norms on the large batch and pay the corresponding computational cost, or exploit some surrogate model to derive approximate sampling probabilities for a reduced cost. The possible strategies are summarized in Table 1.

3.2. Gradient ER and Large Batch ER

GER (Algorithm 1) shares many similarities with PER: the only differences are the priority value and the hyperparameters involved in the calculation. GER stores persample gradient norms whereas PER works with TD er-

Large Batch Experience Replay

rors. Both GER and PER work on outdated estimates of the optimal sampling distribution. Once an SGD update is performed, θ changes. In turn, the true per-sample gradient norms change, as well as the TD errors, while only the priorities of the samples in the last mini-batch are updated using the latest gradients (which already differ from the true gradients). Hence, the variance of the gradient estimates built with PER or GER is uncontrolled, and one cannot guarantee that the convergence speed will be better than with uniform sampling. It seems to happen in practice that PER can actually degrade the convergence speed of SGD, as noted for instance by Obando-Ceron & Castro (2021). We also emphasize that TD errors are approximations of the per-sample gradient norms. Therefore, GER is an attempt at bringing PER closer to the ideal sampling scheme: PER works with outdated and approximated quantities, whereas GER uses outdated but exact quantities. Note also that GER incurs the exact same computational cost as PER, since the per-sample gradients on the mini-batches are necessarily computed in the backward pass.

In order to regain control on the gradient s variance, one may wish to approach the up-to-date optimal sampling distribution. Although computing gradient norms on the full replay buffer is prohibitively costly, one can still sample a large batch, compute the gradient norms on this batch and then down-sample it according to the variance reduction distribution. This naturally raises the question of whether this approach brings any beneﬁt compared to plain SGD with larger mini-batches. Section 4 and Appendix I will demonstrate the superiority of importance sampling even in this case. La BER (Large Batch Experience Replay) is the procedure that samples a large batch of size m B (m 1) from the replay buffer, computes the gradients norms, then down-samples it to a mini-batch of size B to perform the SGD update of Equation (1). Conversely to PER or GER, whose computing speed can be increased with the use of sum-trees, La BER does not need speciﬁc data-structures. Indeed, La BER does not sample with a non-uniform distribution the whole replay buffer, but only the large batch, hence tempering the interest of data-structures efﬁcient at large scale.

Computing the optimal variance reduction distribution on the large batch requires running a backward pass on m B items, which is often considered more costly in practice than the forward pass (see Appendix J for a theoretical discussion and experimental validation of this statement). As an alternative, we study how one can derive a surrogate model of the gradient norms that can be computed solely via the forward pass. We follow the notations of Katharopoulos & Fleuret (2018) and let zi denote the last layer s input in the Qθ network. With σ the layer s activation function, the network output is qi = σ (zi). Then, applying the chain rule yields θℓ(qi, yi) 2 =

ℓ(qi, yi)/ qi qi/ zi zi/ θ 2. Repeated application of the chain rule leads to the back-propagation algorithm. Katharopoulos & Fleuret (2018) remark that ℓ(qi, yi)/ qi qi/ zi 2 zi/ θ 2 is an upper-bound on the per-sample gradient norm of the loss. They argue that activation normalization techniques make zi/ θ 2 almost constant across samples xi. Let K be the largest value of this term. Thus, the gradient s norm variations across samples should be captured by the other term in the product. Write Σ (zi) = qi/ zi the diagonal matrix of activation derivatives σ (zi) for the last layer, and ℓ(qi, yi)/ qi = qiℓ(qi, yi). Then ˆGi = K Σ (zi) qiℓ(qi, yi) 2 is an upper bound on θℓ(qi, yi) 2. When this upper-bound is tight1, i.e. when ˆGi is an accurate surrogate model of θℓ(qi, yi) 2, it provides a generic way of computing the priority of xi in a single forward pass, without requiring the costly backward pass computation.

Let ˆp be the discrete probability distribution over the replay buffer deﬁned by ˆpi ˆGi and ˆw the corresponding importance sampling weights. Since θℓ(qi, yi) 2 ˆGi, one has a lower bound on the convergence speed of SGD when sampling according to ˆp:

S(ˆp) 2η(θt θ )T Ei ˆp[Gi] η2Ei ˆp[ ˆw2 i ˆG2 i ].

The tighter the bound on θℓ(qi, yi) 2, the larger the convergence speed.

La BER with surrogate gradient norms is summarized in Alg. 2. Conversely to GER, La BER works with up-to-date, exact or approximate per-sample gradient norms (Table 1).

We remark that a common choice for Q-networks is to use an identity function as the last layer s activation. In this case, the surrogate ˆGi for the per-sample gradient norm boils down to the absolute TD error |δi| when the loss is the L2-norm (and to min(|δi|, 1) for the Huber loss).

Interestingly, GER and La BER can be combined, since it is possible to sample a large batch according to the distribution proposed by GER, compute the surrogate gradient norms on the large batch, and down-sample according to the computed up-to-date priorities. We call this agent GER-La BER, and since this can also be done with PER, we introduce another agent called PER-La BER, where the large batch is sampled according to the outdated TD errors stored by PER. Appendix B summarizes all the proposed algorithms with a visual representation.

3.3. Scaling the Descent Directions

Given a sampling distribution p, one can use Equation (1) to perform SGD updates. By using pi = ˆGi/ PN j=1 ˆGj in

1The counter-example mentioned in the previous section illustrates why it is not always the case.

Large Batch Experience Replay

Algorithm 1 PER and GER Require: Replay buffer, priority list, mini-batch size.

Priorities := TD errors (PER) or θℓ(qi, yi) 2 (GER). loop

Sample a prioritized mini-batch. Compute per-sample gradients. Update priorities. Perform SGD update. end loop

Algorithm 2 La BER with surrogate priorities Require: replay buffer, mini-batch size, large batch size.

Sample uniformly a large batch. Compute surrogate priorities (e.g. TD errors). Down-sample according to surrogate priorities. Compute per-sample gradients on mini-batch. Perform SGD update on mini-batch. end loop

Equation (1), the gradient estimate becomes:

1 ˆGi θℓ(Qθ(xi), yi).

A possible issue lies in the fact that the average gradient norm PN j=1 ˆGj/N varies from one SGD update to the other and thus computing it would require computing all the persample gradient surrogates. Had it been ﬁxed, it could have been ignored and integrated in the learning rate; but since it varies across updates, it affects the SGD step-size and deserves some attention. To the best of our knowledge, this issue has not been tackled in the importance sampling attempts at SGD in Supervised Learning. We propose three approximation procedures to account for this aspect which we call La BER-mean, La BER-lazy, and La BER-max.

La BER-mean approximates the average gradient norm over the replay buffer by the average over the large batch. The descent direction is thus:

Pm B j=1 ˆGj m B

1 ˆGi θℓ(Qθ(xi), yi).

La BER-lazy avoids the problem altogether and neglects the variations of the average gradient norm across SGD steps. The constant multiplicative term PN j=1 ˆGj/N is absorbed by the learning rate and the descent direction becomes:

1 ˆGi θℓ(Qθ(xi), yi).

Schaul et al. (2016) dedicate a section to the study of the weights in the SGD update for PER, and propose to divide the 1/(Npi) weights which should normally be used, by the maximum weight encountered in the selected mini-batch. The normalization factor is thus (maxj [1,B] 1/ ˆGj) 1 = minj [1,B] ˆGj. Following this common practice for PER, we propose a La BER-max agent using the descent direction:

min j [1,B] ˆGj

1 ˆGi θℓ(Qθ(xi), yi). (2)

Note that only the normalization of La BER-mean is theoretically grounded since it converges to the ideal normalization when m B tends to the full replay buffer size. The two other normalization schemes do not enjoy this property.

3.4. Extension to Distributional RL and Actor-Critics

For the sake of clarity, we exposed the reasoning in this section on the DQN case, but the same considerations apply to the application of any Bellman operator. In particular, it holds for the evaluation equation and for the distributional counterparts of the optimality and evaluation equations. In the distributional context, for the C51 agent (Bellemare et al., 2017), the surrogate sampling scheme obtained via ˆG is proportional to the L2-norm of the vector containing the 51 per-atom TD errors (proof in Appendix D). This contrasts with the combination of C51 and PER deﬁned in Rainbow (Hessel et al., 2018), where the priorities are arbitrarily deﬁned as the per-sample losses. We note that this might be a poor approximation of the per-sample gradient norm. The surrogate distribution can also be derived for other distributional agents, such as QR-DQN (Dabney et al., 2018b) or IQN (Dabney et al., 2018a). For actor-critic algorithms with two critics (Fujimoto et al., 2018), the surrogate distribution is straightforward and should be computed separately on each critic. Details are provided in Appendix L.

4. Empirical evaluation

This section provides a practical analysis of the sampling schemes proposed earlier. In particular, we isolate various aspects of the sampling schemes in order to assess what is really decisive in practice. Notably, the main ﬁnding is that working with up-to-date priorities is the key feature for better performance. Hence, from a practical standpoint, La BER is the key contribution of this paper.

To ease reproducibility, we make our code available at https://github.com/sureli/laber and recall all hyperparameters in Appendix E. We emphasize that all baseline algorithms have been used with default hyperparameters. For each experiment we ran n independent simulations and reported the average and the standard deviation,

Large Batch Experience Replay

Figure 1. The ﬁrst column compares the La BER agent (La BERmean with m = 4) with exact and approximate gradient norms, and DQN with conventional and large mini-batch size (B and 4B) on Min Atar games. The second column compares different values of m for the La BER-mean agent. The third column compares the three ﬂavors of La BER with m = 4. Finally, the last column compares all the studied algorithms. The x-axis is the number of interaction steps in millions. The y-axis is the average sum of rewards gathered along each episode.

where n = 3 for Atari games (Bellemare et al., 2013), n = 6 for Min Atar games (Young & Tian, 2019) and n = 9 for Pybullet environments (Coumans & Bai, 2016 2019). The computing resources used for these experiments are summarized in Appendix G. All presented results can be found with additional details in Appendix H, I, K, and L.

Is ˆp far from p ? To assess whether the surrogate distribution is indeed close to the optimal one, we ﬁrst compared La BER2 using exact gradient norms and La BER using surrogate gradient norms. This experiment uses Min Atar games (Young & Tian, 2019), an RL benchmark simplifying Atari games and aiming at reproducibility while preserving their main difﬁculties as well as a diversity of situations. As can be seen on the ﬁrst column of Figure 1, the performance of these two agents are very similar. Second, we computed the total variation (TV) metric at each time step, between the optimal and the surrogate distributions, over the conventional 5 million training steps. These results are reported

2Here we used La BER-mean with m = 4.

Figure 2. The ﬁrst column reports the histogram of the TVs encountered during training on Min Atar games. The second column uses the TVs computed during the ﬁrst 105 iterations, and the third column, the last 105 iterations. The TVs between the surrogate and the optimal distributions are in orange, the TVs between the uniform and the optimal distributions are in blue.

on Figure 2 and conﬁrm that ˆp is a close approximation of p . Appendix H provides additional details. This justiﬁes the use of the surrogate sampling distribution, instead of the computation of all gradient norms within the large batch. This saves computing resources without compromising the performance of La BER. In the rest of this section, La BER refers to La BER with surrogate gradient norms.

La BER outperforms DQN with larger mini-batches. For a fair comparison, DQN agents with mini-batches as large as the large batch of La BER were compared with La BER-mean with m = 4 (Figure 1, ﬁrst column). This experiment has also been done on Atari games in Appendix I. Overall, La BER outperforms DQN both in performance and in computing time, conﬁrming that non-uniform sampling is critical to performance and yields better results than uniform sampling with larger mini-batches. Detailed results are provided in Appendix I and J.

Large Batch Experience Replay

What size for the large batch? Compared to GER and PER, La BER has only one hyperparameter, namely the size m B of the large batch. The second column of Figure 1 compares the different values of m for La BER-mean and a mini-batch size B = 32, on all Min Atar games. With a few exceptions, this intuitive rule holds: the larger the better. Indeed, the larger the large batch, the more diverse and representative of the replay buffer it is. In all experiments reported below, m = 4.

Descent direction normalization. The third column of Figure 1 reports the comparison between La BER-mean, La BER-lazy and La BER-max, with m = 4. La BER-lazy has a tendency to learn fast at the beginning of the training. We conjecture this is due to a larger norm of the gradient steps. In turn, this larger norm might be the cause of the observed larger variance in performance. La BER-max, which draws inspiration from the scaling used in PER, consistently performs worse than the two other ﬂavors of La BER, displaying a large variability on Freeway and an inability to learn on Seaquest. This raises the question of the relevance of the update equation of PER. This comparison justiﬁes the use of La BER-mean in the experiments discussed below.

La BER increases performance but also reduces performance variance. As explained in Section 3, the sampling used by PER and GER yields gradient estimates with an uncontrolled variance. One cannot guarantee a variance reduction in the gradient estimate (and a convergence speedup) compared to the uniform sampling. The experimental results reported on the last column of Figure 1 conﬁrm this variability of PER and GER from one game to another. This result is corroborated by Obando-Ceron & Castro (2021), who report adding PER to DQN on Min Atar games did not lead to improvements when no speciﬁc hyperparameter tuning is performed. Appendix F features an in-depth discussion about the differences in the priority calculation in PER and GER, especially the arbitrary scaling choices made and the hyperparameters used. Conversely, La BER appears to yield policies with a low variance in performance from one experiment to the other. Overall, La BER consistently outperforms all other algorithms on Min Atar games, both asymptotically and in terms of variance.

Up-to-date priorities are the key. We ﬁnally studied the performance of the GER-La BER and PER-La BER agents. PER-La BER is often better than PER, and GER-La BER always better than GER. This proves the beneﬁts of using La BER, whatever the sampling distribution of the replay buffer. However, neither PER-La BER nor GER-La BER bring a signiﬁcative improvement over vanilla La BER, while they come with additional computational burden (both for computing and for storing the priorities).

Distributional RL and actor-critic methods on larger scale benchmarks. Figure 3 reports the results of La BER,

Figure 3. The 1st (resp. 2nd, 3rd and 4th) column compares the studied algorithms with DQN (resp. C51, SAC and TD3). The x-axis is the number of interaction steps in millions. The y-axis is the average Monte Carlo return computed every 250000 steps for Atari games and every 10000 steps for continuous control tasks.

GER and PER when combined with DQN, C51, SAC and TD3 on larger state-space MDPs (namely, 5 games drawn from the Arcade Learning Environment (Bellemare et al., 2013) and 5 continuous control tasks (Coumans & Bai, 2016 2019)). For Atari games, we used the Dopamine (Castro et al., 2018) DQN and C51 implementations as baselines, upon which we implemented prioritization. For continuous control tasks, our extensions have been implemented over the SAC and TD3 agents provided by Stable-Baselines3 (Rafﬁn et al., 2019). Overall, PER and GER cannot ensure constant improvements without ﬁne hyperparameters tuning, while La BER consistently remains better than the base agents provided by Dopamine or Stable-Baselines3.

5. Conclusion

In this work, we have cast the replay buffer sampling problem as an importance sampling one, and shown the link between the theoretical convergence properties of the RL agent and the sampling scheme used. Since the optimal sampling distribution is intractable, we made several approximations to compute relevant sampling schemes. In particular, we opposed outdated, exact priorities (GER) to up-to-date,

Large Batch Experience Replay

exact or approximate ones (La BER) and demonstrated how they can be combined. On the way, we demonstrated that PER can be seen as an importance sampling scheme using outdated and approximated per-sample gradient norms. Our theoretical contribution consists in providing sound foundations to algorithms that perform non-uniform sampling in the replay buffer, such as PER. La BER is our key algorithmic contribution. It is an easy-to-code algorithm that brings consistent improvement over a signiﬁcant range of classical RL benchmarks.

Alain, G., Lamb, A., Sankar, C., Courville, A., and Bengio, Y. Variance reduction in sgd by distributed importance sampling. In International Conference on Learning Representations, 2016.

Andrychowicz, M., Wolski, F., Ray, A., Schneider, J., Fong, R., Welinder, P., Mc Grew, B., Tobin, J., Abbeel, O. P., and Zaremba, W. Hindsight experience replay. In Advances in neural information processing systems, pp. 5048 5058, 2017.

Bellemare, M. G., Naddaf, Y., Veness, J., and Bowling, M. The arcade learning environment: An evaluation platform for general agents. Journal of Artiﬁcial Intelligence Research, 47:253 279, 2013.

Bellemare, M. G., Dabney, W., and Munos, R. A distributional perspective on reinforcement learning. In International Conference on Machine Learning, pp. 449 458. PMLR, 2017.

Castro, P. S., Moitra, S., Gelada, C., Kumar, S., and Bellemare, M. G. Dopamine: A research framework for deep reinforcement learning. ar Xiv preprint ar Xiv:1812.06110, 2018.

Coumans, E. and Bai, Y. Py Bullet, a Python module for physics simulation for games, robotics and machine learning. http://pybullet.org, 2016 2019.

Dabney, W., Ostrovski, G., Silver, D., and Munos, R. Implicit quantile networks for distributional reinforcement learning. In International conference on machine learning, pp. 1096 1105. PMLR, 2018a.

Dabney, W., Rowland, M., Bellemare, M., and Munos, R. Distributional reinforcement learning with quantile regression. In Proceedings of the AAAI Conference on Artiﬁcial Intelligence, volume 32, 2018b.

Fedus, W., Ramachandran, P., Agarwal, R., Bengio, Y., Larochelle, H., Rowland, M., and Dabney, W. Revisiting fundamentals of experience replay. In International Conference on Machine Learning, pp. 3061 3071. PMLR, 2020.

Fujimoto, S., Hoof, H., and Meger, D. Addressing function approximation error in actor-critic methods. In International Conference on Machine Learning, pp. 1587 1596, 2018.

Fujimoto, S., Meger, D., and Precup, D. An equivalence between loss functions and non-uniform sampling in experience replay. Advances in Neural Information Processing Systems, 33, 2020.

Gruslys, A., Dabney, W., Azar, M. G., Piot, B., Bellemare, M., and Munos, R. The reactor: A fast and sampleefﬁcient actor-critic agent for reinforcement learning. In International Conference on Learning Representations, 2018.

Haarnoja, T., Zhou, A., Abbeel, P., and Levine, S. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In International Conference on Machine Learning, pp. 1861 1870, 2018.

Hessel, M., Modayil, J., van Hasselt, H., Schaul, T., Ostrovski, G., Dabney, W., Horgan, D., Piot, B., Azar, M., and Silver, D. Rainbow: Combining improvements in deep reinforcement learning. In Thirty-Second AAAI Conference on Artiﬁcial Intelligence, 2018.

Horgan, D., Quan, J., Budden, D., Barth-Maron, G., Hessel, M., van Hasselt, H., and Silver, D. Distributed prioritized experience replay. In International Conference on Learning Representations, 2018.

Johnson, R. and Zhang, T. Accelerating stochastic gradient descent using predictive variance reduction. Advances in neural information processing systems, 26:315 323, 2013.

Katharopoulos, A. and Fleuret, F. Not all samples are created equal: Deep learning with importance sampling. In International Conference on Machine Learning, pp. 2525 2534, 2018.

Kumar, A., Gupta, A., and Levine, S. Discor: Corrective feedback in reinforcement learning via distribution correction. Advances in Neural Information Processing Systems, 33, 2020.

Li, W., Huang, F., Li, X., Pan, G., and Wu, F. State distribution-aware sampling for deep q-learning. Neural Processing Letters, 50(2):1649 1660, 2019.

Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., and Wierstra, D. Continuous control with deep reinforcement learning. In ICLR (Poster), 2016.

Lin, L.-J. Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning, 8(3-4):293 321, 1992.

Large Batch Experience Replay

Loshchilov, I. and Hutter, F. Online batch selection for faster training of neural networks. In International Conference on Learning Representations, 2016.

Machado, M. C., Bellemare, M. G., Talvitie, E., Veness, J., Hausknecht, M., and Bowling, M. Revisiting the arcade learning environment: Evaluation protocols and open problems for general agents. Journal of Artiﬁcial Intelligence Research, 61:523 562, 2018.

Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidjeland, A. K., Ostrovski, G., et al. Human-level control through deep reinforcement learning. nature, 518(7540): 529 533, 2015.

Moore, A. W. and Atkeson, C. G. Prioritized sweeping: Reinforcement learning with less data and less time. Machine learning, 13(1):103 130, 1993.

Munos, R. Error bounds for approximate policy iteration. In International Conference on Machine Learning, 2003.

Munos, R. Error bounds for approximate value iteration. In Proceedings of the National Conference on Artiﬁcial Intelligence, 2005.

Needell, D., Ward, R., and Srebro, N. Stochastic gradient descent, weighted sampling, and the randomized kaczmarz algorithm. In Advances in neural information processing systems, pp. 1017 1025, 2014.

Obando-Ceron, J. S. and Castro, P. S. Revisiting rainbow: Promoting more insightful and inclusive deep reinforcement learning research. In Proceedings of the 38th International Conference on Machine Learning, 2021.

Puterman, M. L. Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons, 2014.

Rafﬁn, A., Hill, A., Ernestus, M., Gleave, A., Kanervisto, A., and Dormann, N. Stable baselines3. https: //github.com/DLR-RM/stable-baselines3, 2019.

Robbins, H. and Monro, S. A stochastic approximation method. The annals of mathematical statistics, pp. 400 407, 1951.

Rubinstein, R. Y. and Kroese, D. P. Simulation and the Monte Carlo method. John Wiley & Sons, 2016.

Schaul, T., Quan, J., Antonoglou, I., and Silver, D. Prioritized experience replay. In ICLR (Poster), 2016.

Scherrer, B., Ghavamzadeh, M., Gabillon, V., and Geist, M. Approximate modiﬁed policy iteration. In International Conference on Machine Learning, 2012.

Sutton, R. S. and Barto, A. G. Reinforcement learning: An introduction. MIT press, 2018.

Wang, C. and Ross, K. Boosting soft actor-critic: Emphasizing recent experience without forgetting the past. ar Xiv preprint ar Xiv:1906.04009, 2019.

Wang, L., Yang, Y., Min, R., and Chakradhar, S. Accelerating deep neural network training with inconsistent stochastic gradient descent. Neural Networks, 93:219 229, 2017.

Young, K. and Tian, T. Minatar: An atari-inspired testbed for thorough and reproducible reinforcement learning experiments. ar Xiv preprint ar Xiv:1903.03176, 2019.

Zhang, S. and Sutton, R. S. A deeper look at experience replay. In Neur IPS 2017 Deep Reinforcement Learning Symposium, 2017.

Zhao, P. and Zhang, T. Stochastic optimization with importance sampling for regularized loss minimization. In international conference on machine learning, pp. 1 9, 2015.

Large Batch Experience Replay

A. Convergence Speed and Optimal Sampling Distribution

From one sampling scheme to another, the variance of the gradient estimate is different, and we are looking for the optimal sampling scheme p over items of the replay buffer, the one with smallest variance. Indeed, for such a sampling scheme, the convergence speed of SGD is optimum. We deﬁne the convergence speed S for a sampling scheme p as S(p) = Ei p θt+1 θ 2 2 θt θ 2 2 . We recall that a stochastic gradient descent update has the form θt+1 = θt ηGi, where Gi is the gradient estimate built from sampling element i with probability p. The following derivations from (Wang et al., 2017) shed light on the relationship between variance and convergence speed:

S(p) = Ei p θt+1 θ 2 2 θt θ 2 2

= Ei p θT t+1θt+1 2θT t+1θ θT t θt + 2θT t θ

= Ei p (θt ηGi)T (θt ηGi) + 2ηGT i θ θT t θt

= Ei p 2η(θt θ )T Gi + η2GT i Gi

= 2η(θt θ )T Ei p[Gi] η2Ei p[GT i Gi]

It is possible to gain a speed-up by sampling from the distribution that minimizes Ei p[GT i Gi]. This yields the constrained optimization problem:

min p Ei p[GT i Gi] = min p

i=1 pi Gi 2 2

i=1 pi = 1 and pi 0

Recall that Gi = wi θℓ(Qθ(xi), yi) and wi = 1/(Npi). Let gi = θℓ(Qθ(xi), yi) 2. The problem boils down to:

min p 1 N 2

1 pi g2 i ,

i=1 pi = 1 and pi 0.

Lemma A.1 (Optimal sampling distribution). The optimal sampling distribution p veriﬁes p i θℓ(Qθ(xi), yi) 2, the per-sample gradient norm.

Proof. We note µ R the Lagrange multiplier associated to the equality constraint, ν RN + the Lagrange multipliers associated to the inequality constraints. Hence:

Lag(p, µ, ν) =

1 pi g2 i + µ

Setting the derivatives of the Lagrangian with respect to the primal variables yields:

i [1, N], g2 i p2 i + µ νi = 0

Multiplying the above equation by pi and using i, piνi = 0 (complementary slackness), we have: pi = gi/ µ, which yields the result.

B. Visual Summary of the Studied Agents for the DQN Case

Figure 4 suggests a visual summary of the algorithms studied in this work. Grey squares in Figure 4 represent experience samples. Adjacent squares form a small batch sampled from the replay buffer. The base algorithm we consider uses uniform sampling. PER and GER sample according to pre-existing (outdated) priorities on the whole replay buffer. La BER samples uniformly a large batch, computes exact or surrogate gradient norms on this large batch, and ﬁnally down-samples the large batch to a mini-batch according to the computed priorities. Note that the large batch can be sampled according to the priorities stored by PER or GER, yielding the PER-La BER or GER-La BER agents.

Large Batch Experience Replay

Figure 4. Visual representation of the proposed agents for the DQN case

C. Discussion on the Surrogate Distribution

In this section, we introduce the counter-example mentioned in Section 3. The sampling probabilities pi Σ (zi) qiℓ(qi, yi) 2 become pi |Qθ(xi) yi| in the DQN case or in the critic evaluation of an actor-critic method, because the last layer s activation function is the identity function and the L2-loss is used. We provide an example for N = 2 and B = 1 where sampling according to the TD errors yields a variance in the gradient estimate that is larger than the variance obtained with the uniform sampling scheme u. Let

θℓ(Qθ(x1), y1) 2 = 10 |Qθ(x1) y1| = 1 θQθ(x1) 2 = 10 and

θℓ(Qθ(x2), y2) 2 = 5 |Qθ(x2) y2| = 4 θQθ(x2) 2 = 5/4.

We ﬁnd p 1 = 2/3, p 2 = 1/3, p1 = 1/5 and p2 = 4/5. We re-use Gi = wi θℓ(Qθ(xi), yi) with wi = 1/(Npi) for any sampling scheme p, introduced in Section 3. To assess the variance of such an estimate, we are interested in the quantity:

Ei p[GT i Gi] = 1 N 2

1 pi θℓ(Qθ(xi), yi) 2 2.

For the uniform sampling scheme, ui = 1/2. It yields Ei u[GT i Gi] = 62.5. For the ideal sampling scheme p , it yields Ei p [GT i Gi] = 56.25. When sampling according to the p, we obtain Ei p[GT i Gi] = 132.8125, which is larger than the variance under the uniform sampling scheme.

Note that, p = p if θQθ(xi) 2 is constant. As illustrated in this toy example, high variations of θQθ(xi) 2 play a crucial role, and can lead to catastrophic increase of the variance. In practice, such extreme examples seem to be rare. Katharopoulos & Fleuret (2018) argue that the per-sample gradient norm variations are mostly caused by the very last layer, reducing the importance of the variations of θQθ(xi) 2.

D. La BER for the C51 Agent

The C51 agent is reputed to perform better than DQN since it does not only learn the Q-function, deﬁned as the expected cumulative sum of rewards, but the whole probability distribution of this sum (Dabney et al., 2018b). For the ith transition tuple (si, ai, ri, s i), the neural network outputs a probability distribution under the form of a histogram over 51 atoms, each written (q(k) i )1 k 51, and learnt by minimizing the Kullback-Leibler divergence with respect to a target histogram probability distribution (y(k) i )1 k 51. The loss for the transition tuple (si, ai, ri, s i) can be written:

k=1 y(k) i log q(k) i ,

Large Batch Experience Replay

Table 2. DQN parameters for Atari Parameter Value Discount factor (γ) 0.99 Mini-batch size 32 Replay buffer size 106

Target update period 8000 Interaction period 4 Random actions rate 0.01 (with a linear decay of period 2.5 105 steps) Q-network structure Conv4 8,8 32 Conv2 4,4 64 Conv1 3,3 64 FC 512 FC n A Activations Re LU (except for the output layer)

Optimizer RMSProp (lr: 0.00025, Smoothing constant: 0.95, Centered: True, Epsilon: 10 5)

and we recall that the last activation of such a network is a softmax function.

We need to compute the derivative of a softmax loss function to ﬁnd the surrogate sampling distribution. Let z(k) i be the kth element of the input to the softmax layer, the following result holds:

ℓi z(k) i = q(k) i y(k) i .

Since the gradient is the vector of per-component derivatives, the surrogate of the optimal sampling distribution is:

q(k) i y(k) i 2 !1/2

which is nothing but the L2-norm of the vector containing what could be seen as the equivalent of the TD error on each atom. Hence, La BER differs from what is proposed by Hessel et al. (2018) for the Rainbow agent, who consider a prioritization on the loss.

E. Baseline Agents

We denote neural networks structures as follows. Convd a,b c is a 2D convolutional layer with c with c feature maps whose kernels have size a b and of stride d. FC n is a fully connected layer with n neurons.

E.1. DQN Atari

The parameters used in our experiments are those reported in the Dopamine reference paper (Castro et al., 2018) and recalled in Table 2. We used the provided DQN agent, without modiﬁcations.

We follow the procedures of Machado et al. (2018) to train agents in the ALE (Arcade Learning Environment). Notably, we perform one training step (a gradient descent step) every 4 frames encountered in the environment. The state of an agent is the stack of the last 4 frames, sub-sampled to a shape of (84, 84), in shades of grey. We refer to Machado et al. (2018) for details on the preprocessing.

E.2. DQN Min Atar

Once again, the parameters are those reported in the Min Atar reference paper (Young & Tian, 2019) and recalled in Table 3. We used the provided DQN agent, without modiﬁcations.

For SAC, we used the default parameters provided by Stable-Baselines3 (Rafﬁn et al., 2019) and recalled in Table 4.

Large Batch Experience Replay

Table 3. DQN parameters for Min Atar Parameter Value Discount factor (γ) 0.99 Mini-batch size 32 Replay buffer size 105

Target update period 1000 Random actions rate 0.1 (with a linear decay of period 105 steps) Q-network structure Conv1 3,3 16 FC 128 FC n A Activations Re LU (except for the last layer)

Optimizer RMSProp (lr: 0.0001, Smoothing constant: 0.95, Centered: True, Epsilon: 10 2)

Table 4. SAC parameters Parameter Value Discount factor (γ) 0.99 Mini-batch size 256 Replay buffer size 106

Target smoothing coefﬁcient: 0.005 Policy noise 0.2 Entropy target dim(A) Update actor every 1 Network structure (critics and actor) FC 256 FC 256 FC dim(A) Activations Re LU (except for the last layer) Optimizer Adam (lr: 0.001, Epsilon: 0.00001)

For TD3, we used the default parameters provided by Stable-Baselines3 (Rafﬁn et al., 2019) and recalled in Table 5.

F. PER and GER

PER and GER share many similarities, the main difference being the priorities. In order to put more emphasis on (s, a)-pairs that feature a large approximation error of T Qn, PER assigns each transition of the replay buffer a priority based on the TD error (|δi| + c)α, where δi = Qθ(xi) yi is the TD error, and with α and c two non-negative hyperparameters. PER also exponentiates the importance sampling weights in the SGD update equation to the power β. The authors suggest that c = 10 10, α = 0.6, and β growing from β0 to 1 are good choices.

In this work, we always consider β = 1, because it was found to work better on Atari games by the authors of Dopamine

Table 5. TD3 parameters Parameter Value Discount factor (γ) 0.99 Mini-batch size 100 Replay buffer size 106

Target smoothing coefﬁcient: 0.005 Policy noise 0.2 Noise clip 0.5 Update actor every 2 Network structure (critics and actor) FC 400 FC 300 FC dim(A) Activations Re LU (except for the last layer) Optimizer Adam (lr: 0.001, Epsilon: 0.00001)

Large Batch Experience Replay

Figure 5. PER and GER with different combinations of hyperparameters on Min Atar games. The x-axis is the number of interaction steps in millions. The y-axis is the average sum of rewards gathered along each episode.

(Castro et al., 2018). We used c = 10 10 and α = 0.6 for PER, but set α = 1 and c = 0 for GER to work with the exact per-sample gradient norm.

In this section, we study the performance of GER with α = 0.6 and c = 10 10 as well as PER with α = 1 and c = 0 on Min Atar games. In Figure 5,

PER uses α = 0.6 and c = 10 10;

PER no param uses α = 1 and c = 0;

GER with param uses α = 0.6 and c = 10 10;

GER uses α = 1 and c = 0.

The combination α = 0.6 and c = 10 10, both for PER and GER, appears to be better than the combination α = 1 and c = 0. This highlights the sensitivity to hyperparameter tuning when using PER or GER.

G. Computing Resources

The experiments presented in this paper exploited internal computing clusters. The results on Atari games were obtained with single node computations. Each node contained 2 12-core Skylake Intel(R) Xeon(R) Gold 6126 2.6 GHz CPUs with 96 Go of RAM and 2 NVIDIA(R) Tesla(R) V100 32Go GPUs (only one was used per experiment). The results on the Min Atar and Py Bullet environments used single nodes also. Each of these nodes was composed of 2 12-core Skylake Intel(R) Xeon(R) Gold 6126 2.6 GHz CPUs with 96 Go of RAM (no GPU hardware).

H. Assessment of the Gap between p and ˆp

As explained in the main paper, the experimental performance should be correlated with the quality of the approximate sampling scheme used. Comparing the sampling scheme used in PER or GER with the ideal one is prohibitively costly since it requires computing gradient norms on each item in the full replay buffer, at each gradient descent step. However, for La BER, the exact per-sample gradient norms can be computed solely on the large batch, and the distribution induced

Large Batch Experience Replay

compared with the surrogate model. The measure used to compare these distributions is the total variation (TV), deﬁned as ν(p, q) = P

i |pi qi|, for two discrete distributions p and q.

On the 5 Min Atar games, La BER-mean with m = 4 is run during the 5 106 conventional iterations. Besides computing the surrogate distribution on the large batch, the distribution induced by the exact per-sample gradient norms on this large batch is also computed at each iteration. The TV between these two distributions is saved, as well as the TV between the uniform distribution and the one induced by the per-sample gradient norms. A histogram of all encountered TVs is then built and reported on Figure 6.

On all experiments, the TVs between the uniform and the optimal distributions are statistically signiﬁcantly higher than the TVs between the surrogate and the optimal distributions, at the beginning, during, and at the end of the experiment. The results of these comparisons conﬁrm the accuracy of the surrogate model presented in Section 3.

I. Comparison between DQN with Larger Mini-Batches and La BER

Figure 7 reports the performance of DQN, PER, and GER agents with large mini-batches (4B instead of B) on Min Atar games, and compares them to La BER with m = 4. This allows conﬁrming that non-uniform samping efﬁciently reduces variance and outperforms uniform sampling, even when the latter has a larger mini-batch budget. La BER-mean, with m = 4, still outperforms the other agents, demonstrating the advantage of a variance reduction importance sampling scheme over uniform sampling with a larger budget, or uncontrolled-variance sampling schemes.

Figure 8 completes the ﬁrst column of Figure 1, on Atari games instead of Min Atar games. La BER-mean with approximate gradient norms with m = 4 is better than DQN with a mini-batch as large as the large batch of the La BER agent.

J. Cost of Forward and Backward Propagation in Feed-Forward Fully-Connected Neural Networks

This section is a reminder of the computational cost of the forward pass and the backward pass. We recall the well-known result that the backward pass is more expensive than the forward pass, in the simple case of a feed forward fully-connected neural network. The computational cost of the backward pass justiﬁes the use of La BER with a surrogate model for the gradient norms, which only adds forward computations.

Let Ψ( , Θ) be a fully connected neural network parameterized by Θ, made of L layers, with θl the weights and biases of layer l. Let σl be a Lipschitz continuous activation function. Then, the forward pass can be written:

zl = θl xl 1 (3)

xl = σl (zl) (4)

Ψ(x, Θ) = x L

For the sake of simplicity, let n be the number of neurons on each hidden layer, let B be the mini-batch size and let x L have dimension one. Then Equation (3) induces O(n2B) operations. Equation (4), in turn, induces O(n B) operations. Overall, the forward pass induces O(Ln2B) operations.

Let L be the loss used to ﬁt the neural network. The backward pass, recursively computes the θl L(Ψ(x, Θ); y) gradients for each layer l, in order to update the network parameters as:

δL = σ L(z L),

θl θl η δl x T l 1 x LL, (5)

δl 1 = σ l 1(zl 1) (δl θl 1) , (6)

where is the Hadamard product. Equation (5) induces O(n2B) operations. Equation (6) induces O((n+n2)B) operations. Hence the backward pass induces O(2Ln2B) operations.

The point of this simple reminder is to recall that the complexity classes of both operations are overall the same, but that the backward pass requires (an order of) twice as many operations as the forward pass. This has been repeatedly noticed in the

Large Batch Experience Replay

Figure 6. The ﬁrst column reports the histogram of the TVs encountered during training. The second column uses the TVs computed during the ﬁrst 105 iterations, and the third column, the last 105 iterations. The TVs between the surrogate and the optimal distributions are in orange, the TVs between the uniform and the optimal distributions are in blue.

Large Batch Experience Replay

Figure 7. On Min Atar games, this ﬁgure reports the performance of DQN, PER and GER with large mini-batches (4B instead of B) and La BER with m = 4. The x-axis is the number of interaction steps in millions. The y-axis is the average sum of rewards gathered along each episode.

Large Batch Experience Replay

Figure 8. DQN with batch size 32 and batch size 128, compared to La BER-mean with approximate gradient norms with m = 4, on 5 Atari games. The x-axis is the number of interaction steps in millions. The y-axis is the average Monte Carlo return computed every 250000 steps.

Table 6. CPU time comparison (in milliseconds) DQN batch size = B batch size = 2B batch size = 4B batch size = 8B Forward 0.37 0.02 0.52 0.02 0.76 0.03 1.3 0.1 Backward 0.71 0.02 0.90 0.03 1.4 0.1 2.6 0.1 La BER m = 1 m = 2 m = 4 m = 8 Forward 0.38 0.02 0.52 0.02 0.74 0.03 1.2 0.1 Backward 0.70 0.03 0.71 0.03 0.69 0.03 0.70 0.03

automatic differentiation litterature and reported in neural networks research papers (such as the work of Katharopoulos & Fleuret (2018) for instance). Of course, parallelization and caching in modern computation architectures may greatly amortize this complexity and make it unnoticeable in some cases. In the extreme case where parallelization beneﬁts DQN, La BER with surrogate gradient norms incurs the same wall-clock computation time as DQN.

We verify experimentally this comparison by recording the CPU time taken by the portion of code which differs between DQN and La BER with surrogate gradient norms. The aim of this experiment is to show the beneﬁts of using La BER from a CPU time point of view. Table 6 reports the CPU times of the forward and the backward passes for plain DQN with different values of batch size and La BER with different values of m. These results have been collected on the 5 Min Atar games and averaged over 20 passes. As shown before, whereas La BER easily scales with larger m, it is not the case for DQN with large mini-batches, which are longer to run since the backward pass is the most costly operation.

K. Detailed Results on Min Atar Games

The ﬁrst column of Figure 9 reports the comparison of La BER (La BER-mean, m = 4) with exact and approximate gradient norms and DQN with different mini-batch sizes on all Min Atar games. This column is the ﬁrst column of Figure 1.

The second column of Figure 9 reports the comparison of La BER (La BER-mean with approximate gradient norms) with different large batch sizes (different values of m). This column is the second column of Figure 1.

The ﬁrst column of Figure 10 reports the comparison of La BER with different scaling for the descent direction (namely

Large Batch Experience Replay

La BER-mean, La BER-lazy and La BER-max) studied with m = 4 and approximate gradient norms. This column is the third column of Figure 1.

The second column of Figure 10 reports the comparison of La BER (La BER-mean with approximate gradient norms and with m = 4) and all other studied agents: DQN, PER, GER, PER-La BER and GER-La BER. This column is the last column of Figure 1.

L. Detailed Results on Atari Games and Py Bullet Environments

The ﬁrst column of Figure 11 reports the comparison of DQN, PER, GER and La BER (La BER-mean with surrogate gradient norms and with m = 4) on 5 Atari games. This column is the ﬁrst column of Figure 3.

The second column of Figure 11 reports the comparison of C51, PER, GER and La BER (La BER-mean with surrogate gradient norms and with m = 4) on the same 5 Atari games. This column is the second column of Figure 3. For PER, the value of the priority is the value of the loss, as proposed by Hessel et al. (2018). On the contrary, La BER uses the surrogate proposed in this work as priority. As already stated at the end of Section 3, this yields a very different value for prioritization. Appendix D derives the analytical expression of the surrogate gradient norms we use.

The ﬁrst column of Figure 12 reports the comparison of SAC, PER, GER and La BER (La BER-mean with surrogate gradient norms and with m = 4) on 4 Py Bullet environments and Lunar Lander. This column is the third column of Figure 3.

The second column of Figure 12 reports the comparison of TD3, PER, GER and La BER (La BER-mean with surrogate gradient norms and with m = 4) on 4 Py Bullet environments and Lunar Lander. This column is the last column of Figure 3.

Since two critic networks are used in TD3 or SAC, two per-sample gradient norms are available, and the appropriate way of dealing with this two quantities in PER, GER or La BER deserves some attention. In our Git Hub repository, we dedicate a didactic folder aiming at clarifying this issue. See: https://github.com/sureli/laber.

Whereas Fujimoto et al. (2020) implement PER over TD3 by using the maximum of the two TD errors given by the two critic networks as priorities, we consider, according to what we derived in our work, that each network must be learnt with dedicated samples. Indeed, each network has its own optimal sampling distribution p i θℓ(Qθ(xi), yi) 2 over the N items of the replay buffer. Hence, for PER and GER, two lists of priorities are maintained, and the SGD step of each network uses transitions sampled according to the distribution deﬁned by the corresponding list of priorities. For La BER, once the large batch is uniformly sampled, the up-to-date (exact or approximate) per-sample gradient norms are computed separately by each critic, and down-sampled separately. The samples used for the SGD update are not the same for the two critics.

Large Batch Experience Replay

Figure 9. On Min Atar games, the ﬁrst column of this ﬁgure reports the performance of DQN with small and large mini-batches as well as La BER with exact and approximate gradient norms. The second column reports the performance of La BER with different values of m. The x-axis is the number of interaction steps in millions. The y-axis is the average sum of rewards gathered along each episode.

Large Batch Experience Replay

Figure 10. On Min Atar games, the ﬁrst column of this ﬁgure reports the performance of La BER with different scaling for the gradient direction: La BER-mean, La BER-lazy and La BER-max. The second column compares the performance of all studied agents: DQN, La BER, PER, GER, PER-La BER and GER-La BER. The x-axis is the number of interaction steps in millions. The y-axis is the average sum of rewards gathered along each episode.

Large Batch Experience Replay

Figure 11. On 5 Atari games, the ﬁrst column of this ﬁgure reports the performance of PER, GER, La BER (implemented over DQN) and DQN, whereas the second column reports the performance of PER, GER, La BER (implemented over C51) and C51. The x-axis is the number of interaction steps in millions. The y-axis is the average Monte Carlo return computed every 250000 steps.

Large Batch Experience Replay

Figure 12. On 4 Py Bullet environments and Lunar Lander, the ﬁrst column of this ﬁgure reports the performance of PER, GER, La BER (implemented over SAC) and SAC, whereas the second column reports the performance of PER, GER, La BER (implemented over TD3) and TD3. The x-axis is the number of interaction steps in millions. The y-axis is the average Monte Carlo return computed every 10000 steps.