# large_batch_experience_replay__a3c9abf3.pdf Large Batch Experience Replay Thibault Lahire 1 Matthieu Geist 2 Emmanuel Rachelson 1 Several algorithms have been proposed to sample non-uniformly the replay buffer of deep Reinforcement Learning (RL) agents to speed-up learning, but very few theoretical foundations of these sampling schemes have been provided. Among others, Prioritized Experience Replay appears as a hyperparameter sensitive heuristic, even though it can provide good performance. In this work, we cast the replay buffer sampling problem as an importance sampling one for estimating the gradient. This allows deriving the theoretically optimal sampling distribution, yielding the best theoretical convergence speed. Elaborating on the knowledge of the ideal sampling scheme, we exhibit new theoretical foundations of Prioritized Experience Replay. The optimal sampling distribution being intractable, we make several approximations providing good results in practice and introduce, among others, La BER (Large Batch Experience Replay), an easy-to-code and efficient method for sampling the replay buffer. La BER, which can be combined with Deep Q-Networks, distributional RL agents or actor-critic methods, yields improved performance over a diverse range of Atari games and Py Bullet environments, compared to the base agent it is implemented on and to other prioritization schemes. 1. Introduction In deep Reinforcement Learning (Sutton & Barto, 2018, RL), neural network policies and value functions can be learnt thanks to stochastic gradient descent algorithms (Robbins & Monro, 1951, SGD) sampling an experience replay memory (Lin, 1992). This replay memory, or replay buffer, stores the transitions encountered along the interaction with the environment. SGD-based algorithms ex- 1ISAE-SUPAERO, Universit e de Toulouse, France 2Google Research, Brain Team. Correspondence to: Thibault Lahire . Proceedings of the 39 th International Conference on Machine Learning, Baltimore, Maryland, USA, PMLR 162, 2022. Copyright 2022 by the author(s). ploit such buffers to learn relevant functions, such as the Qfunction in the case of Deep Q-Networks (Mnih et al., 2015, DQN), the return distribution for distributional approaches (Bellemare et al., 2017), or an actor and a critic (Lillicrap et al., 2016; Haarnoja et al., 2018) in the case of continuous state-action space problems. Most Deep RL algorithms boil down to a sequence of SGD-based, supervised learning problems. Ideally, one would like to minimize the corresponding loss functions with as few gradient steps as possible. Prioritized Experience Replay (Schaul et al., 2016, PER) has been introduced for the DQN algorithm as a heuristic accelerating the learning process, drawing inspiration from Prioritized Sweeping (Moore & Atkeson, 1993). Each transition in the replay buffer is assigned a priority that is proportional to the temporal difference (TD) error, and gradients are estimated by sampling according to these priorities. Later, PER was combined with other DQN improvements, including distributional RL (Bellemare et al., 2017; Hessel et al., 2018), as well as applied to actor-critic methods (Wang & Ross, 2019). However the reason why prioritizing on TD errors can provide better performance remains theoretically unexplained. For distributional value functions, using the loss value as a priority, as suggested by Hessel et al. (2018), also lacks foundations. Lastly, in actor-critic methods, when two critics are used (Fujimoto et al., 2018; Haarnoja et al., 2018), two TD errors are available and the appropriate way to draw mini-batches has, again, not been established clearly. In this work, we provide theoretical foundations for the concept of prioritization used in RL, and derive the associated algorithms, shedding light on existing approaches on the way. SGD relies on the fact that the loss minimized when learning a neural network is an integral quantity over a certain distribution. Hence, the gradient of this loss is also such an integral, which can be approximated via Monte Carlo estimation with a finite number of samples. The variance of this Monte Carlo estimate can be reduced using importance sampling (Rubinstein & Kroese, 2016), yielding better convergence speed. Consequently, non-uniform sampling of mini-batches for gradient estimation can be cast as an importance sampling problem over the replay buffer, such as in the work of Needell et al. (2014). The Supervised Learning literature provides links between sampling, variance of the stochastic gradient estimate, and convergence speed of Large Batch Experience Replay the SGD algorithm (Wang et al., 2017). The smaller the variance of the stochastic gradient estimate, the faster the convergence. This is particularly appealing in the context of Approximate Dynamic Programming (ADP), which encompasses the vast majority of (Deep) RL algorithms (e.g. DQN, DDPG (Lillicrap et al., 2016), SAC (Haarnoja et al., 2018) and their variations). ADP is very sensitive to approximation errors and only a few noisy gradient steps are taken at each Dynamic Programming iteration in modern Deep RL algorithms. The optimal sampling distribution, proportional to the per-sample gradient norms, is intractable and approximations were proposed (Loshchilov & Hutter, 2016; Katharopoulos & Fleuret, 2018). We show that PER is a special case of such approximations in the context of ADP, and propose better sampling schemes, theoretically grounded and less sensitive to hyperparameter tuning, that naturally extend to any value function representation, including distributional Q-functions or twin critics. In this work, we show that the main issue with PER are its outdated priorities, and introduce the Large Batch Experience Replay (La BER) algorithm which constitutes our main algorithmic contribution. This paper is structured as follows. Section 2 covers related work in prioritization for RL and importance sampling for SGD. Then Section 3 casts the gradient estimation problem in the light of importance sampling and proposes algorithms to mimic the intractable optimal sampling distribution. On the way, it provides theoretical foundations to PER. Section 4 empirically evaluates the proposed sampling schemes, in particular the La BER algorithm. We discuss each separate aspect of sampling, their explanations and perspectives. Section 5 summarizes and concludes. 2. Background and Related Works In the standard RL framework, one searches for the optimal control policy when interacting with a discrete-time system behaving as a Markov Decision Process (Puterman, 2014). At time step t, the system is in state st S, and upon applying action at A, it transitions to a new state st+1, while receiving reward rt. A policy π is a function mapping states to distributions over actions, whose performance can be assessed through its Q-function Qπ(s, a) = E[P t γtrt|s0 = s, a0 = a, at π(st)]. The goal of RL is to find a policy π that has the largest possible Q-function Q = Qπ . Policy π s Q-function obeys the fixed-point equation Qπ(s, a) = Es ,r [r + γQπ(s , π(s ))]. Similarly, the optimal Q-function obeys equation Q (s, a) = Es ,r [r + γ maxa Q (s , a )]. These are called the Bellman evaluation and optimality equations, respectively. They are often summarized by introducing the evaluation and optimality operators on functions: Qπ = T πQπ and Q = T Q respectively. These operators are contraction mappings, which implies that the Qn+1 = TQn sequence converges to the T operator s fixed point: Qπ for T = T π, and Q for T = T . The optimality and evaluation equations are cornerstones of the vast majority of RL algorithms, as they underpin respectively the search for optimal value functions in Value Iteration methods and the evaluation of current policies in Policy Iteration and Policy Gradient methods (including actor-critic ones). Approximation of TQn is thus a key issue in Reinforcement Learning and gives rise to the family of Approximate Dynamic Programming (ADP) methods. ADP is known to suffer from the approximation errors incurred by the family of Q-functions used (Munos, 2003; 2005; Scherrer et al., 2012). In particular, it is well-known that ADP does not converge but rather that Qn Q O(ϵ/(1 γ)2) for a large enough n, where ϵ is an upperbound on the approximation error of Qn. Approximating Qn+1 from samples of TQn is a Supervised Learning problem with error at most ϵ and, therefore, reducing ϵ immediately translates to better ADP algorithms. Deep Q-Networks (Mnih et al., 2015, DQN) is the approximate Value Iteration algorithm that uses a replay buffer of N samples (s, a, r, s ), a deep neural network Qθ, and a few steps of gradient descent to approximate T Qn. In this context, Qn is called a target network and is periodically replaced by Qθ. Specifically, at each training step, DQN aims to take a gradient step on the loss Ln(θ) = Qθ T Qn 2, or more generally Ln(θ) = R S A ℓ(Qθ(s, a), T Qn(s, a))dρ(s, a), with ρ the stateaction data distribution. Since this loss is an integral quantity, it can be estimated by Monte Carlo sampling, which defines the empirical loss 1 N PN i=1 ℓ(Qθ(xi), yi), with xi = (si, ai) and yi = ri + γ maxa Qn(s i, a ). Minimization of this empirical loss by SGD implies drawing at each step a mini-batch of B transitions from the replay buffer and taking a descent step θt+1 = θt ηd in the direction of the gradient estimate d = 1 B PB i=1 θℓ(Qθ(xi), yi), with learning rate η. Similar targets and loss functions have been proposed in the context of distributional RL (Dabney et al., 2018b;a) or for learning critics in policy gradient methods (Fujimoto et al., 2018; Haarnoja et al., 2018). As is common in SGD algorithms, the mini-batch is drawn uniformly with replacement within the replay buffer of size N, yielding an unbiased estimate of θLn(θ). The question of sampling non-uniformly the replay buffer has been raised from an empirical perspective in RL and has lead to heuristics such as PER and its variants (Horgan et al., 2018; Fujimoto et al., 2020). In order to put more emphasis on (s, a)-pairs that feature a large approximation error of T Qn, PER assigns each transition of the replay buffer a priority based on the TD error (|δi| + c)α, where δi = Qθ(xi) yi is the TD error, and with α and c two Large Batch Experience Replay non-negative hyperparameters. At each iteration, PER samples a mini-batch according to the probability distribution induced by the list of priorities, performs a gradient step and updates the priority of the selected samples. Even though PER lacks theoretical foundations, a large number of publications experimentally demonstrate its benefits. Notably, the ablation study of Hessel et al. (2018) showed PER to be one of the most critical improvements over DQN. Horgan et al. (2018) reuse the idea of prioritization according to the TD errors in a distributed framework of agents working in parallel. As a follow-up on PER, Fujimoto et al. (2020) propose new sampling schemes based on the TD errors and adjusted by the loss, whereas Gruslys et al. (2018) consider a prioritization on sequences of transitions. Other methods to select a mini-batch have been proposed. To emphasize recent experience and draw a mini-batch of size B, Zhang & Sutton (2017) sample uniformly B 1 transitions from the replay buffer and add the last experience tuple collected along the trajectory. Similarly, Wang & Ross (2019) have observed a faster convergence of the soft actor-critic algorithm (Haarnoja et al., 2018, SAC) by sampling more frequently the recent experiences collected, which can be seen as a smooth version of the sampling proposed by Zhang & Sutton (2017). Li et al. (2019) try to fill each mini-batch with samples taken from every region of the state-action space S A, thus emulating a uniform distribution across S A within the replay buffer. Conversely to our contribution, where we work on the replay buffer distribution as it is, Dis Cor Kumar et al. (2020) selects less often errorful target values, leading to better models. Despite titles mentioning experience replay, many papers are quite weakly related to the idea of non-uniform sampling of mini-batches. For instance, Hindsight Experience Replay (Andrychowicz et al., 2017) is an intrinsic motivation method. Similarly, Fedus et al. (2020) study the relations between the replay buffer size and the frequency of gradient steps, as well as the crucial importance of n-steps returns, but without considering the question of how to sample minibatches. In Supervised Learning, the links between non-uniform sampling of training sets, variance of the stochastic gradient estimate and speed of convergence have notably been studied by Needell et al. (2014); Zhao & Zhang (2015); Wang et al. (2017). As shown therein, the smaller the variance of the gradient estimate, the better the convergence speed. Importance sampling (Rubinstein & Kroese, 2016) can be used to reduce variance while keeping these estimates unbiased. Hence, non-uniform sampling schemes can be designed to reduce the variance of the estimate and accelerate the convergence. For an SGD update in its simplest form, the ideal sampling scheme p is proportional to the per-sample gradient norm (Needell et al., 2014). Computing the optimal sampling distribution requires computing all per-sample gradients. For large training sets (such as RL replay buffers), this task is prohibitively costly. Alain et al. (2016) deploy heavy computational resources to compute the optimal distribution, using clusters of GPUs. Another possibility is to shift from p , as developed by Loshchilov & Hutter (2016), where the sampling is proportional to a loss ranking. Ensuring convergence in convex cases, Stochastic Variance Reduced Gradient (Johnson & Zhang, 2013) is a state-of-the art algorithm using importance sampling. Recently, Katharopoulos & Fleuret (2018) proposed an upper-bound on the per-sample gradient norm which is fast to compute and can be used as a surrogate of p . 3. Gradient Variance Minimization 3.1. Importance Sampling Distributions and Approximations in PER SGD aims at minimizing the empirical loss, given the current replay buffer, as a proxy for the true loss. Hence, at each training step, plain SGD samples uniformly a mini-batch of B transitions from the replay buffer, in order to approximate the gradient of the empirical loss 1 N PN i=1 θℓ(Qθ (xi) , yi). Let u be the uniform discrete distribution over the items in the replay buffer. The gradient can then be written as an expectation over these items: i=1 θℓ(Qθ(xi), yi) = i=1 ui θℓ(Qθ(xi), yi) = Ei u[ θℓ(Qθ(xi), yi)]. This expectation can be estimated by an (unbiased) empirical mean over a mini-batch of B samples: Ei u[ θℓ(Qθ(xi), yi)] 1 i=1 θℓ(Qθ(xi), yi) , i u. Let p be any probability distribution over the items of the replay buffer such that pi = 0, i. Importance sampling gives alternate unbiased estimates of the empirical loss gradient: Ei u[ θℓ(Qθ(xi), yi)] = Ei p θℓ(Qθ(xi), yi)ui θℓ(Qθ(xi), yi) 1 This can also be estimated by an empirical mean: θℓ(Qθ(xi), yi) 1 i=1 θℓ(Qθ(xi), yi) 1 i p. The update equation for θ becomes: θt+1 = θt η 1 1 Npi θℓ(Qθ(xi), yi). (1) Large Batch Experience Replay We define Gi = wi θℓ(Qθ(xi), yi) with wi = 1/(Npi) for any sampling scheme p. The gradient of the empirical loss is thus precisely Ei p[Gi] = PN i=1 pi Gi. Following the notations of Wang et al. (2017) or Katharopoulos & Fleuret (2018), let us define the convergence speed S of SGD under a sampling scheme p as S(p) = Ei p θt+1 θ 2 2 θt θ 2 2 . Wang et al. (2017) show that S(p) = 2η(θt θ )T Ei p[Gi] η2Ei p[GT i Gi]. In this work, we call variance of the gradient estimate the term Ei p[GT i Gi], which is linked to the covariance matrix Vari p[Gi] by Ei p[GT i Gi] = Tr(Vari p[Gi]) + Ei p[Gi]T Ei p[Gi]. It is thus possible to improve the theoretical convergence speed by sampling from the distribution that minimizes Ei p[GT i Gi]. Indeed, the term Ei p[Gi] is a constant with respect to p since Ei p[Gi] = PN i=1 pi Gi = PN i=1 pi 1 Npi θℓ(Qθ(xi), yi) = Ei u[ θℓ(Qθ(xi), yi)], with u the uniform distribution such that ui = 1/N. The optimal distribution is p i θℓ(Qθ(xi), yi) 2, the per-sample gradient norm. This derivation is recalled in Appendix A. The optimal sampling scheme requires computing θℓ(Qθ(xi), yi) for all items in the replay buffer, which is too costly to be used in practice. Indeed, computing persample gradients requires a forward and a backward pass on the whole replay buffer before performing each gradient step. Departing from this observation, we first give perspectives to what is done in PER, and then explore two new sampling strategies. At a given time step, PER computes TD errors as priorities only for the samples selected in the mini-batch, and keeps priorities unchanged for all other samples in the replay buffer. We first investigate whether the per-sample gradient norm can be safely approximated by the TD error. For the sake of notation simplicity, let qi denote the output Qθ(xi) of the Q-function network. Then, applying the chain rule to the loss gradient indicates that the gradient norm is θℓ(qi, yi) 2 = ℓ(qi, yi)/ qi qi/ θ 2. If ℓ corresponds to the L2-norm, then ℓ(qi, yi)/ qi is the TD error. Consequently, the per-sample gradient norm of the loss is the product of the TD error and of the norm of the network output s gradient. Therefore, the TD error is a good approximation of the optimal sampling distribution if ℓis the L2-norm and if the norm of qi/ θ = θQθ(xi) is approximately constant across samples xi. If one uses the Huber loss instead of the L2-norm, then the TD error δi is replaced by min(|δi|, 1) Table 1. Summary of the main algorithms. Priority Exact Approximate Outdated GER (this work) PER Up-to-date La BER (this work) La BER (this work) (which is consistent with the common practice of PER which uses the Huber loss and clips the TD errors). However, if the assumption of qi/ θ = θQθ(xi) approximately constant across samples does not hold, then the variance of a sampling according to the TD errors can be higher than that of a uniform sampling. We provide in Appendix C a simple counter-example demonstrating that the variance induced by the TD error sampling scheme is uncontrolled and potentially even higher than that of a uniform sampling. Besides being approximated by the TD errors, the persample gradient norms in PER are also outdated. Only the samples in the mini-batch receive a priority update at each time step (which already does not correspond to the loss gradient norm). All other samples retain priorities related to even more ancient Q-functions and are even more outdated. As such, a priority can be arbitrarily old. Hence, the variance induced by the sampling scheme used in PER is unknown. As we have just seen, PER uses two approximations. We introduce two new sampling schemes, intending to remove these approximations. We also wish to study which approximation is the most penalizing. First, we can work on exact but outdated gradients θℓ(Qθ(xi), yi), i.e. gradients computed for some past Q-function parameter θt