# analysis_of_stochastic_processes_through_replay_buffers__e6271b78.pdf Analysis of Stochastic Processes through Replay Buffers Shirli Di Castro Shashua 1 Shie Mannor 1 2 Dotan Di Castro 3 Replay buffers are a key component in many reinforcement learning schemes. Yet, their theoretical properties are not fully understood. In this paper we analyze a system where a stochastic process X is pushed into a replay buffer and then randomly sampled to generate a stochastic process Y from the replay buffer. We provide an analysis of the properties of the sampled process such as stationarity, Markovity and autocorrelation in terms of the properties of the original process. Our theoretical analysis sheds light on why replay buffer may be a good de-correlator. Our analysis provides theoretical tools for proving the convergence of replay buffer based algorithms which are prevalent in reinforcement learning schemes. 1. Introduction A Replay buffer (RB) is a mechanism for saving past generated data samples and for sampling data for off-policy reinforcement learning (RL) algorithms (Lin, 1993). The RB serves a First-In-First-Out (FIFO) buffer with a fixed capacity and it enables sampling mini-batches from previously saved data points. Its structure and sampling mechanism provide a unique characteristic: the RB serves as de-correlator of data samples. Typically, the agent in RL algorithms encounters sequences of highly correlated states and learning from these correlated data points may be problematic since many deep learning algorithms suffer from high estimation variance when data samples are dependent. Thus, a mechanism that decorrelates the input such as the RB can improve data efficiency and reduce sample complexity. Since its usage in the DQN algorithm (Mnih et al., 2013), RB mechanism have become popular in many off-policy RL algorithms (Lillicrap et al., 2015; Haarnoja et al., 2018). Previous work has been done on the empirical benefits of 1Technion Institute of Technology, Haifa, Israel 2NVIDIA Research, Israel 3Bosch Center of AI, Haifa, Israel. Correspondence to: Shirli Di Castro Shashua . Proceedings of the 39 th International Conference on Machine Learning, Baltimore, Maryland, USA, PMLR 162, 2022. Copyright 2022 by the author(s). RB usage (Fedus et al., 2020; Zhang & Sutton, 2017), but still there is a lack in theoretical understanding of how the RB mechanism works. Understanding the properties of RBs is crucial for convergence and finite sample analysis of algorithms that use a RB in training. For the best of our knowledge, this is the first work to tackle these theoretical aspects. In this work we focus on the following setup. We define a random process X that is pushed into a N samples size RB and analyze the characteristics of the stochastic process of K samples that is sampled from the RB. We analyze if properties of the original random process such as Markovity and stationarity are maintained and quantify the auto-correlation and covariance in the new RB process (later denoted by Y ) when possible. Our motivation comes from RL algorithms that use RB. Specifically, we focus on the induced Markov chain given a policy but we note that the analysis in this paper is also relevant to general random processes that are kept in a FIFO queue. This is relevant for domains such as First Come First Served domains (Laguna & Marklund, 2013). Our goal is to provide analytical tools for analyzing algorithms that use RBs. Our results can provide theoretical understanding of phenomena seen in experiments using RBs that have never been analyzed theoretically before. Our theory for RBs provides tools for proving convergence of RB-based RL algorithms. Our main contributions are: 1. Formulating RBs as random processes and analyze their properties such as stationarity, Markovity, ergodicity, auto-correlation and covariance. 2. Comparing between properties of the original random process that was pushed into the RB and the sampled process at the output of the RB. Particularly we prove that when sampling uniformly from the RB, the RB forms as a de-correlator between the sampled batches. 3. Connecting our RB theory to RL by demonstrating this connection through a RB-based actor critic RL algorithm that samples K transitions from RB with size N for updating its parameters. We prove, for the first time, the asymptotic convergence of such RBbased actor critic algorithm. Analysis of Stochastic Processes through Replay Buffers Figure 1. Replay buffer flow diagram: Process X enters the RB which stores {Xt, . . . Xt N+1} in positions (1, . . . , N), respectively. As time proceeds and t > N, old transitions are thrown away from the RB. At each time step t, a random subset Jt of K time steps is sampled from the RB. W is simply [RB, J]. Y is the process of averaging a function over X at times from the subset J. Lastly, the process Z is simply a function applied on the variable X. Comparing Y to Z, we can see that Z can serve as an online update while Y can serve as a RB-based update. The paper is structured as follows. We begin with presenting the setup in Section 2. We then state our main results regarding RB properties in Section 3. In Section 4 we connect between our RB theory and its use in RL and provide a convergence proof for an RB-based actor critic algorithm. Afterward, in Section 5 we position our work in existing literature and conclude in Section 6. 2. Setup for Replay Buffer Analysis 2.1. Replay Buffer Structure Let X (Xt) t=0 be a stochastic process where the subscript t indicates time. The samples are dynamically pushed into a Replay Buffer (RB; Lin, 1993; Mnih et al., 2013) of capacity N, i.e., it is a First-In-First-Out (FIFO) buffer that can hold the N latest samples. We define the state of the RB at time t with RBt = {Xt N+1, . . . , Xt}. Suppose that the buffer cells are numbered from 1 to N. The latest observation of X is pushed into cell 1, the observation before into cell 2, etc. When a new observation arrives, the observation in cell n is pushed into cell n + 1 for 1 n < N, while the observation in cell N is thrown away. The random process RB = (RBt) t=0 contains the last N samples of X. The random process Y is defined as the average of random K samples (without replacement) out of the N samples and applying a function f( ) : RD 1 where D is the dimension of the algorithm2. The 1We note that f( ) may also depend on t but we leave that for the sake of simplicity. 2For example, in linear function approximation of Actor-Critic algorithms, D is the dimension of the linear basis used to approximate the value function by the critic. function f( ) may correspond to a typical RL function that one usually find in RL algorithms such as linear function approximation, Temporal Difference, etc. (Bertsekas & Tsitsiklis, 1996). We elaborate on possible RL functions in Section 4.2. 2.2. Replay Buffer Sampling Method We analyze the unordered sampling without replacement strategy from the RB. We note that other sampling methods may be analyzed, but we chose this specific sampling due to its popularity in many deep reinforcement learning algorithms3. Let N be a set of indices: N = {1, . . . , N} and let J be a subset of K indices from N. Given N and K, let CN,K be the set of all possible subsets J for specific N and K. Then, the probability of sampling subset J is p N,K binom( J) = 1 ( N K) J CN,K, where N K N! (N K)!K! is the binomial coefficient. 2.3. Replay Buffer Related Processes We denote the set of K temporal indices of the samples from RB by the random process J (corresponds to a Batch in Deep Learning) where Jt = {jk}K k=1 {t N +1, . . . , t}4. Similarly, the corresponding K RB indices process is J where Jt {1, . . . , N}. We remark that both Jt and Jt contain identical information but one refers to the absolute time, and one to the indices of the RB. We define the random process Wt [RBt, Jt] which holds both the information 3In Section 6 we discuss shortly future directions for other sampling schemes. 4We note that in the first K steps the batch is of size smaller than K and in the first N steps the RB is not full. Analysis of Stochastic Processes through Replay Buffers on the RB as well on the sampling from it. For later usage, we define the process Xt going through a function f( ) with Zt f(Xt). The resulting Yt has the structure of j Jt Zj = 1 j Jt f(Xj). The stochastic processes relations that are described above are visualized in Figure 1. 3. Replay Buffer Properties In this section we analyze the properties of a random process Y that is sampled from the RB and used in some RL algorithm. Specifically, we analyze stationarity, Markovity, ergodicity, auto-correlation, and covariance. 3.1. Stationarity, Markovity and Ergodicity The following Lemmas characterize the connection between different properties of X that enter the RB and the properties of the processes RB and Y. Stationarity is not a typical desired RL property since we constantly thrive to improve the policy (and thus the induced policy) but we bring it here for the sake of completeness. Lemma 1 (Stationarity). Let Xt and Jt be stationary processes. Then, RBt and Yt are stationary. The proof for Lemma 1 is deferred to Section A.1 in the supplementary material. In the next Lemma we analyze when the process RB is Markovian. This property is important in RL analysis. Note that Yt is not necessarily Markovian, however, Wt is Markovian. For this, with some abuse of notation, we define Xn2 n1 (t) {Xt n2+1, . . . , Xt n1+1|RBt} as the set of random variables realizaions from process X stored in the RB at time t from position n1 to n2. Lemma 2 (Markovity). Let Xt be a Markov process. Then: (1) RBt and Wt are Markovian. (2) The transition probabilities of RBt for t N are: P (RBt+1|RBt) P (Xt+1|Xt) if Xt X1 1(t) and RBt+1 = {Xt+1} XN 1 1 (t), 0 otherwise. If Jt is sampled according to unordered sampling without replacement , then the transition probabilities of Wt for t N are: P (Wt+1|Wt) 1 ( N K)P (Xt+1|Xt) Jt+1 CN,K, if Xt X1 1(t) and RBt+1 = {Xt+1} XN 1 1 (t) 0 otherwise. The proof for Lemma 2 is deferred to Section A.2 in the supplementary material. In RL, the properties of aperiodicity and irreducible (that together form ergodicity; Norris, 1998) are crucial in many convergence proofs. The following states that these properties are preserved when using RB. We denote supp(Yt) as the set of all the values Yt can take. Lemma 3 (Ergodicity). Let Xt be a Markov process that is aperiodic and irreducible. Then, RBt and Wt are aperiodic and irreducible. Moreover, every point y supp(Yt) is visited infinitely often. The proof for Lemma 3 is deferred to Section A.3 in the supplementary material. 3.2. Auto-Correlation and Covariance In this section we analyze the auto-correlation and covariance of the process Y expressed by process X properties. When X is stationary, the auto-correlation and covariance functions for X are: RX(τ) = E[Xt Xt+τ] CX(τ) = E [(Xt E [Xt]) (Xt+τ E [Xt+τ])] In the same way, the definition of the auto-correlation and covariance functions for the process Y are RY (τ) and CY (τ), respectively. In the following theorem we prove the relationship between the auto-correlation and covariance functions of processes Y and X. For that, we need to define the distribution of all time differences between two batches of samples as follows. Definition 1 (Time difference distribution between two batches of samples). Consider a RB of size N. Consider taking two different random permutations (batches), denoted by Ja and Jb, both of length K in two possibly different time points, ta and tb = ta + τ. Let τ be a random variable where its distribution Fτ ( ) is the probability of each difference between each sample of Ja and Jb. The support of τ is τ N + 1 τ τ + N 1. Theorem 1 (Auto-Correlation and Covariance). Let τ be the difference between two time steps of the processes Y . Let Eτ be an expectation according to the distribution Fτ ( ). Then: RY (τ) = Eτ [RZ(τ )] , CY (τ) = Eτ [CZ(τ )] . (1) Analysis of Stochastic Processes through Replay Buffers The proof for Theorem 1 is in Section B.1 in the supplementary material. We note two things. First, we note that we did not specify how the sampling is done from the RB and it is expressed by the random variable τ from Definition 1, i.e., Eq. (1) is a general expression. Second, we note that we express the correlation using process Z and not process X directly, but process Z auto-correlation and covariance can be computed directly in any practical case using the relation Zt = f(Xt). For the specific case of unordered sampling without replacement , we express the relation between the second moments of Z and Y explicitly through the distribution of τ . Lemma 4 (Time difference Distribution for uniform batches). The random variable τ distribution for unordered sampling without replacement is N 2 τ = τ + d, d { N + 1, . . . 0, . . . N 1} 0 τ < τ N + 1 or τ > τ + N 1 . The proof for Lemma 4 is in Section B.2 in the supplementary material. In the following corollary we state the exact dependence in the case of random sampling of K samples from a RB with size N. Corollary 1. Consider process Z where sampling is according to unordered sampling without replacement . Then, the auto-correlation and covarinace of the process Y are: RY (τ) = 1 N 2 d= N+1 (N |d|)RZ(d + τ) CY (τ) = 1 N 2 d= N+1 (N |d|)CZ(d + τ) The proof for Corollary 1 is in Section B.3 in the supplementary material. We see that using a RB reduces the autocorrelation and covaraince of process Z by factor of N. Interestingly, this reduction is independent of K. This result proves the de-correlation effect of using RBs and provides an explicit expression for that. 4. Replay Buffers in Reinforcement Learning In the previous section we analyzed properties of stochastic processes that go through a RB. In this section we analyze RBs in RL. Stabilizing learning in modern off-policy deep RL algorithms, such as Deep Q-Networks (Mnih et al., 2013) or DDPG (Lillicrap et al., 2015), is based on saving past observed transitions in a RB. Even though its use is extensive, the theoretical understanding of sampling batches mechanism from a RB is still quite limited. This is our focus in this section. We begin with describing the setup that will serve us in this section. Then we connect between the random processes as defined in Section 3 and common stochastic updates used in RL. We then describe an RB-based actor-critic algorithm that uses a batch of K samples from the RB in each parameters update step. This type of algorithm serves as a basic example for popular usages of RBs in RL. We note that other versions of RB-based RL algorithms (such as deep RL algorithms, value-based algorithms, discounted settings of the value function) can be analyzed with the stochastic processes tools we provide in this work. Finally, we present a full convergence proof for the RB-based actor critic algorithm. Despite its popularity, to the best of our knowledge, there is only handful of proofs that consider RB in RL algorithm analysis (e.g., Di-Castro Shashua et al., 2021 or Lazic et al., 2021). Most of the convergence proofs for off-policy algorithms assume that a single sample is sampled from the RB. Di-Castro Shashua et al. (2021) proved for the first time the convergence of an RB-based algorithm. However, their algorithm and technical tools were focused on the sim-toreal challenge with multiple MDP environments, and they focused only on a single sample batch from the RB instead of K samples (which complicates the proof). Therefore, for completeness and focusing on the RB properties, we provide a proof for RB-based algorithms, with a single MDP environment and a batch of K samples. Similarly to previous works, we consider here a setup with linear function approximation (Bertsekas & Tsitsiklis, 1996). 4.1. Setup for Markov Decision Process An environment in RL is modeled as a Markov Decision Process (MDP; Puterman, 1994), where S and A are the state space and action space, respectively. We let P(S |S, A) denote the probability of transitioning from state S S to state S S when applying action A A. We consider a probabilistic policy πθ(A|S), parameterized by θ Θ Rd which expresses the probability of the agent to choose an action A given that it is in state S. The MDP measure P(S |S, A) and the policy measure πθ(A|S) induce together a Markov Chain (MC) measure Pθ(S |S) (Pθ in matrix form). We let µθ denote the stationary distribution induced by the policy πθ. The reward function is denoted by r(S, A). In this work we focus on the average reward setting 5. The goal of the agent is to find a policy that maximizes the average reward that the agent receives during its interaction 5The discount factor settings can be obtained in similar way to current setup. Analysis of Stochastic Processes through Replay Buffers with the environment. Under an ergodicity assumption, the average reward over time eventually converges to the expected reward under the stationary distribution (Bertsekas, 2005): PT t=0 r(St, At) T = ES µθ,A πθ[r(S, A)]. The state-value function evaluates the overall expected accumulated rewards given a starting state S and a policy πθ t=0 (r(St, At) ηθ) where the actions follow the policy At πθ( |St) and the next state follows the transition probability St+1 P( |St, At). Let O = {S, A, S } be a transition from the environment. Let Ot be a transition at time t. The temporal difference error δ(O) (TD; Bertsekas & Tsitsiklis, 1996) is a random variable based on a single sampled transition from the RB, δ(O) = r(S, A) η + ϕ(S ) w ϕ(S) w, (4) where ˆV πθ w (S) = ϕ(S) w is a linear approximation for V πθ(S), ϕ(S) Rd is a feature vector for state S and w Rd is the critic parameter vector. We denote by Φ R|S| d the matrix of all state feature vectors. 4.2. Replay Buffer as a Random Process in RL In Section 3 we compared between properties of general random process X going through a RB and yielding a process Y . In the RL context we have Xt Ot, meaning our basic component is a single transition of state-action-next-state observed at time t. In addition, we defined Zt f(Xt) process where f( ) is a general function. In RL, f( ) is commonly defined as the value function, the Q-function, the TD-error, the empirical average reward, the critic or actor gradients or any other function that computes a desirable update, based on an observed transition O. Common RL algorithms that use a single f(Ot) computation in the parameters update step are commonly referred as on-policy algorithms where they update their parameters based only on the last observed transition in the Markov chain. See Figure 1 for a comparison between on-line updates and RBbased updates. Using the formulation of random processes we presented in Section 3, we can characterize the on-line updates, based on a single last transition as follows: Zreward t = freward(Ot) = r(St, At) ηt Zcritic t = fcritic(Ot) = δ(Ot)ϕ(St) Zactor t = factor(Ot) = δ(Ot) log πθ(At|St) Algorithm 1 Linear Actor Critic with RB samples 1: Initialize Replay Buffer RB with size N. 2: Initialize actor parameters θ0, critic parameters w0 and average reward estimator η0. 3: Learning steps {αη t }, {αw t }, {αθ t }. 4: for t = 0, . . . do 5: Interact with MDP M according to policy πθt and add the transition {St, At, r(St, At), St+1} to RBt. 6: Sample Jt - K random time indices form RBt. Denote the corresponding transitions as {Oj}j Jt. 7: δ(Oj) = r(Sj, Aj) ηt + ϕ(S j) wt ϕ(Sj) wt 8: Update average reward ηt+1 = ηt + αη t ( 1 j Jt r(Sj, Aj) ηt) 9: Update critic wt+1 = wt+αw t 1 K P j Jt δ(Oj)ϕ(Sj) 10: Update actor θt+1 = Γ θt αθ t 1 K P j Jt δ(Oj) θ log πθ(Aj|Sj) 11: end for When using RB-based off-policy algorithms, the parameters updates are computed over an average of K functions which are based on K transitions that were sampled randomly from the last stored N transitions. This exactly matches our definition of the process Y : Yt = 1 K P j Jt f(Xj) = 1 K P j Jt Zj. The following updates are typical in RBbased off-policy algorithms: Y reward t = 1 j Jt Zreward j = 1 j Jt r(Sj, Aj) ηt Y critic t = 1 j Jt Zcritic j = 1 j Jt δ(Oj)ϕ(Sj) Y actor t = 1 j Jt Zactor j = 1 j Jt δ(Oj) log πθ(Aj|Sj) In Algorithm 1, we present a linear actor critic algorithm based on RB samples where the algorithm updates the actor and critic using a random batch of transitions from the RB. In Section 4.5 we show how the results from Section 3 regarding a random process that is pushed into the RB, and the definitions of X and Y processes are helpful in proving the asymptotic convergence of this algorithm. 4.3. Linear Actor Critic with RB Samples Algorithm The basic RB-based algorithm we analyze in this work is presented in Algorithm 1. We propose a two time scale linear actor critic optimization scheme (similarly to Konda & Tsitsiklis, 2000), which is an RB-based version of Bhatnagar et al. (2008) algorithm. Our algorithm is fully described by the random process Wt = [RBt, Jt] and by the algorithm updates Y reward t , Y critic t and Y actor t described in equation (5). Analysis of Stochastic Processes through Replay Buffers See Figure 2 for a visualized flow diagram of Algorithm 1. In Algorithm 1 we consider an environment, modeled as an MDP M, and we maintain a replay buffer RB with capacity N. The agent collects transitions {S, A, r(S, A), S } from the environment and stores them in the RB. We train the agent in an off-policy manner. At each time step t, the agent samples Jt a subset of K random time indices from RBt which defines the random transitions batch for optimizing the average reward, critic and actor parameters. Note that for the actor updates, we use a projection Γ( ) that projects any θ Rd to a compact set Θ whenever θ / Θ. 4.4. Expectations of Critic and Actor Updates in Algorithm 1 We divide the convergence analysis of Algorithm 1 into two parts. The first part, presented in this section, is unique to our paper - we describe in Lemmas 5 and 6 a closed form of the expectations of the actor and critic updates, based on a random batch of K transitions from the RB. In the second part, presented in Section 4.5, we use Stochastic Approximation tools for proving the algorithm updates convergence, based on the results from Lemmas 5 and 6. We note that Section 4.5 follows the steps of the convergence proofs presented by Di-Castro Shashua et al. (2021) and Bhatnagar et al. (2009). For time t n + 1 where 1 n N, we define the induced MC with a corresponding policy parameter θt n+1. For this parameter, we denote the corresponding state distribution vector ρt n+1 and a transition matrix Pt n+1 (both induced by the policy πθt n+1). Finally, we define the following diagonal matrix Dt n+1 diag(ρt n+1) and the reward vector rt n+1 with elements rt n+1(S) = P A πθt n+1(A|S)r(S, A). Based on these definitions we define n=1 Dt n+1 (Pt n+1 I) n=1 Dt n+1 (rt n+1 ηθe) . where I is the identity matrix and e is a vector of ones. Let Dθ diag(µθ) and define Cθ Dθ (Pθ I) , bθ Dθ (rθ ηθe) . (7) In our RB setting, since we have at time t a RB with the last N samples, Ct and bt in Equation (6) represent a superposition of all related elements of these samples. For proving the convergence of the critic, we assume the policy is fixed. Then, when t , ρt n+1 µθ for all index n. This means that the induced MC is one for all the samples in the RB, so the sum over N disappears for Cθ and bθ. The following two lemmas compute the expectation of the critic and actor updates when using a random batch of K samples. The expectations are over all possible random batches sampled from the RB. Recall that Jt {1, . . . , N} and CN,K is the set of all possible subsets J for specific N and K. These lemmas are the main results for proving convergence of RB-based RL algorithms. Lemma 5. Assume we have a RB with N transitions and we sample random K transitions from the RB. Then: E Jt CN,K,{Ot n+1}n Jt RBt n Jt δ(Ot n+1)ϕ(St n+1) = Φ CθΦw + Φ bθ, where Cθ and bθ are defined in (7). Lemma 6. Assume we have a RB with N transitions and we sample random K transitions from the RB. Then: E Jt CN,K,{Ot n+1}n Jt RBt n Jt δπθ(Ot n+1) θ log πθ(At n+1|St n+1) S µθ(S) ϕ(S) θwπθ θ V πθ(S) , where V πθ(S) = P A A πθ(A|S) r(S, A) ηθ + P S S P(S |S, A)ϕ(S ) wπθ . The proofs for Lemmas 5 and 6 are in sections C.1 and D.1, respectively, in the supplementary material. 4.5. Asymptotic Convergence of Algorithm 1 We are now ready to present the convergence theorems for the critic and actor in Algorithm 1. In the proof of our theorems we use tools from Stochastic Approximation (SA) (Kushner & Yin, 2003; Borkar, 2009; Bertsekas & Tsitsiklis, 1996), a standard tool in the literature for analyzing iterations of processes such as two time scale Actor-Critic in the context of RL. We showed in Lemma 2 that the process Wt = [RBt, Jt] of sampling K random transitions from the RB is a Markov process. In addition, we showed in Lemma 3 that if the original Markov chain is irreducible and aperiodic, then also the RB Markov process is irreducible and aperiodic. This property is required for the existence of unique stationary distribution and for proving the convergence of the iterations in Algorithm 1 using SA tools. We note that proving convergence for a general function approximation is hard. We choose to demonstrate the convergence for a linear function approximation (LFA; Bertsekas & Tsitsiklis, 1996), in order to keep the convergence proof as simple as possible while Analysis of Stochastic Processes through Replay Buffers Figure 2. Replay buffer in reinforcement learning flow diagram: The random processes described in Figure 1 are reflected in Algorithm 1. Here the random process that enters the RB is O which is a tuple of (S, A, S ). The RB stores the last N transitions {Ot, . . . Ot N 1} in positions (1, . . . , N), respectively. As time proceeds and t > N, old transition are thrown away from the RB. At each time step t, a random subset of K time steps is sampled from the RB and is denoted as Jt. W is simply [RB, J]. In Algorithm 1 we have three different updates, Y reward t , Y critic t and Y actor t , all are averages over functions of transitions sampled from the RB. Then the parameters are updated accordingly. Finally, the policy parameter θt+1 is used to sample the action in transition Ot+1 that later enters to the RB. focusing in the proof on the RB and random batches aspects of the algorithm. We present several assumptions that are necessary for proving the convergence of Algorithm 1. Assumption 4 is needed for the uniqueness of the convergence point of the critic. In addition, we choose a state S to be of value 0, i.e., V πθ(S ) = 0 (due to Assumption 2, S can be any of S S). Assumption 5 is required in order to get a with probability 1 using the SA convergence. In our actor-critic setup we need two time-scales convergence, thus, in this assumption the critic is a faster recursion than the actor. Assumption 1. 1. The set Θ is compact. 2. The reward |r( , )| 1 for all S S, A A. Assumption 2. For any policy πθ, the induced Markov chain of the MDP process {St}t 0 is irreducible and aperiodic. Assumption 3. For any state action pair (S, A), πθ(A|S) is continuously differentiable in the parameter θ. Assumption 4. 1. The matrix Φ has full rank. 2. The functions ϕ(S) are Liphschitz in S and bounded. 3. For every w Rd, Φw = e where e is a vector of ones. Assumption 5. The step-sizes {αη t }, {αw t }, {αθ t }, t 0 satisfy P t αη t = P t αw t = P t αθ t = , P t (αη t )2, P t (αw t )2, P t (αθ t )2 < and αθ t = o(αw t ). Now we are ready to prove the following theorems, regarding Algorithm 1. We note that Theorem 2 and 3 state the critic and actor convergence. Theorem 2. (Convergence of the Critic to a fixed point) Under Assumptions 1-5, for any given π and {ηt}, {wt} as in the updates in Algorithm 1, we have ηt ηθ and wt wπ with probability 1, where wπ is obtained as a unique solution to Φ CθΦw + Φ bθ = 0. The proof for Theorem 2 is in Section C in the supplementary material. It follows the proof for Lemma 5 in Bhatnagar et al. (2009). For establishing the convergence of the actor updates, we define additional terms. Let Z denote the set of asymptotically stable equilibria of the ODE θ = ˆΓ( θηθ) and let Zϵ be the ϵ-neighborhood of Z. We define ξπθ = P S µθ(S) ϕ(S) θwπθ θ V πθ(S) . Theorem 3. (Convergence of the actor) Under Assumptions 1-5, given ϵ > 0, δ > 0 such that for θt, t 0 obtained using Algorithm 1, if supθt ξπθt < δ, then θt Zϵ as t with probability one. The proof for Theorem 3 is in Section D in the supplementary material. It follows the proof for Theorem 2 in Bhatnagar et al. (2009). Analysis of Stochastic Processes through Replay Buffers 5. Related Work Actor critic algorithms analysis: The convergence analysis of our proposed RB-based actor critic algorithm is based on the Stochastic Approximation method (Kushner & Clark, 2012). Konda & Tsitsiklis (2000) proposed the actor-critic algorithm, and established the asymptotic convergence for the two time-scale actor-critic, with TD(λ) learning-based critic. Bhatnagar et al. (2009) proved the convergence result for the original actor-critic and natural actor-critic methods. Di Castro & Meir (2010) proposed a single time-scale actorcritic algorithm and proved its convergence. Works on finite sample analysis for actor critic algorithms (Wu et al., 2020; Zou et al., 2019; Dalal et al., 2018) analyze the case of last transition update and do not analyze the RB aspects in these algorithms. Recently, Di-Castro Shashua et al. (2021) proved for the first time the convergence of an RB-based actor critic algorithm. However, their algorithm and technical tools were focused on the sim-to-real challenge with multiple MDP environments, and they focused only on a single sample batch from the RB instead of K samples. We provide a proof for RB-based algorithms, with a single MDP environment and a batch of K samples. Replay Buffer analysis: Experience replay (Lin, 1993) is a central concept for achieving good performance in deep reinforcement learning. Deep RB-based algorithms such as deep Q-learning (DQN, Mnih et al., 2013), deep deterministic policy gradient (DDPG; Lillicrap et al., 2015), actor critic with experience replay (ACER; Wang et al., 2016), Twin Delayed Deep Deterministic policy gradient (TD3, Fujimoto et al., 2018), Soft Actor Critic (SAC, Haarnoja et al., 2018) and many others use RBs to improve performance and data efficiency. We focus mainly on works that provide some RB properties analysis. Zhang & Sutton (2017) and Liu & Zou (2018) study the effect of replay buffer size on the agent performance . Fedus et al. (2020) investigated through simulated experiments how the learning process is affected by the ratio of learning updates to experience collected. Other works focus on methods to prioritize samples in the RB and provide experimental results to emphasis performance improvement when using prioritized sampling from RB (Schaul et al., 2015; Pan et al., 2018; Zha et al., 2019; Horgan et al., 2018; Lahire et al., 2021). We, on the other hand, focus on the theoretical aspects of RB properties and provide convergence results for RB-based algorithms. Lazic et al. (2021) proposed a RB version for a regularized policy iteration algorithm. They provide an additional motivation for using RBs, in addition to the advantage of reduced temporal correlations: They claim that using RB in online learning in MDPs can approximate well the average of past value functions. Their analysis also suggests a new objective for sub-sampling or priority-sampling transitions in the RB, which differs priority-sampling objectives of previous work (Schaul et al., 2015). Regarding RB analysis in Deep RL algorithms, Fan et al. (2020) performed a finite sample analysis on DQN algorithm (Mnih et al., 2013). In their analysis, they simplified the technique of RB with an independence assumption and they replaced the distribution over random batches with a fixed distribution. These assumptions essentially reduce DQN to the neural fitted Q-iteration (FQI) algorithm (Riedmiller, 2005). In our work we focus on asymptotic convergence and analyze explicitly the distribution of random batches from the RB. 6. Conclusions In this work we analyzed RB and showed some basic random processes properties of it as ergodicity, stationarity, Markovity, correlation, and covariance. The latter two are of most interest since they can explain the success of modern RL algorithm based on RB. In addition, we developed theoretical tools of stochastic process analysis for replay buffers. We provided an example of how to use these tools to analyze the convergence of an RB-based actor critic algorithm. Similarly, other common RB-based algorithms in reinforcement learning such as DQN (Mnih et al., 2013), DDPG (Lillicrap et al., 2015), TD3 (Fujimoto et al., 2018), SAC (Haarnoja et al., 2018) and many others can be analyzed now, using the tools we developed in this work. As a future research, we propose two directions that are of great interest and complete the analysis we provided in this work: 1. Spectrum analysis of the learning processes. Since we adopted an approach of Signals and Systems with random signals (Oppenheim et al., 1997; Porat, 2008), one can use spectrum analysis in order to discover instabilities or cycles in the learning process. 2. More complex RBs. There is a large experimental body of work that tries to propose different schemes of RBs. Some of them apply different independent on RL quantities sampling techniques while other apply dependent on RL quantities schemes (e.g., prioritized RB depends on the TD signal; Schaul et al., 2015). In this work we paved the first steps to apply analysis on such schemes (both dependent and independent). 7. Acknowledgements This work was partially supported by the Israel Science Foundation under contract 2199/20. Analysis of Stochastic Processes through Replay Buffers Bertsekas, D. Dynamic programming and optimal control. Athena scientific Belmont, MA, 2005. Bertsekas, D. P. and Tsitsiklis, J. N. Neuro-dynamic programming. Athena Scientific, 1996. Bhatnagar, S. and Kumar, S. A simultaneous perturbation stochastic approximation-based actor-critic algorithm for markov decision processes. IEEE Transactions on Automatic Control, 49(4):592 598, 2004. Bhatnagar, S., Ghavamzadeh, M., Lee, M., and Sutton, R. S. Incremental natural actor-critic algorithms. In Advances in neural information processing systems, pp. 105 112, 2008. Bhatnagar, S., Sutton, R. S., Ghavamzadeh, M., and Lee, M. Natural actor critic algorithms. Automatica, 45(11): 2471 2482, 2009. Borkar, V. S. Stochastic approximation: a dynamical systems viewpoint, volume 48. Springer, 2009. Borkar, V. S. and Meyn, S. P. The ode method for convergence of stochastic approximation and reinforcement learning. SIAM Journal on Control and Optimization, 38 (2):447 469, 2000. Dalal, G., Sz or enyi, B., Thoppe, G., and Mannor, S. Finite sample analyses for td (0) with function approximation. In Proceedings of the AAAI Conference on Artificial Intelligence, 2018. Di Castro, D. and Meir, R. A convergent online single time scale actor critic algorithm. The Journal of Machine Learning Research, 11:367 410, 2010. Di-Castro Shashua, S., Di Castro, D., and Mannor, S. Sim and real: Better together. Advances in Neural Information Processing Systems, 34, 2021. Fan, J., Wang, Z., Xie, Y., and Yang, Z. A theoretical analysis of deep q-learning. In Learning for Dynamics and Control, pp. 486 489. PMLR, 2020. Fedus, W., Ramachandran, P., Agarwal, R., Bengio, Y., Larochelle, H., Rowland, M., and Dabney, W. Revisiting fundamentals of experience replay. In International Conference on Machine Learning, pp. 3061 3071. PMLR, 2020. Fujimoto, S., Hoof, H., and Meger, D. Addressing function approximation error in actor-critic methods. In International Conference on Machine Learning, pp. 1587 1596. PMLR, 2018. Haarnoja, T., Zhou, A., Abbeel, P., and Levine, S. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In International conference on machine learning, pp. 1861 1870. PMLR, 2018. Horgan, D., Quan, J., Budden, D., Barth-Maron, G., Hessel, M., Van Hasselt, H., and Silver, D. Distributed prioritized experience replay. ar Xiv preprint ar Xiv:1803.00933, 2018. Konda, V. R. and Tsitsiklis, J. N. Actor-critic algorithms. In Advances in neural information processing systems, pp. 1008 1014. Citeseer, 2000. Kushner, H. and Yin, G. G. Stochastic approximation and recursive algorithms and applications, volume 35. Springer Science & Business Media, 2003. Kushner, H. J. and Clark, D. S. Stochastic approximation methods for constrained and unconstrained systems, volume 26. Springer Science & Business Media, 2012. Laguna, M. and Marklund, J. Business process modeling, simulation, and design. Taylor & Francis, 2013. Lahire, T., Geist, M., and Rachelson, E. Large batch experience replay. ar Xiv preprint ar Xiv:2110.01528, 2021. Lazic, N., Yin, D., Abbasi-Yadkori, Y., and Szepesvari, C. Improved regret bound and experience replay in regularized policy iteration. ar Xiv preprint ar Xiv:2102.12611, 2021. Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., and Wierstra, D. Continuous control with deep reinforcement learning. ar Xiv preprint ar Xiv:1509.02971, 2015. Lin, L.-J. Reinforcement learning for robots using neural networks. Technical report, Carnegie-Mellon Univ Pittsburgh PA School of Computer Science, 1993. Liu, R. and Zou, J. The effects of memory replay in reinforcement learning. In 2018 56th Annual Allerton Conference on Communication, Control, and Computing (Allerton), pp. 478 485. IEEE, 2018. Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., and Riedmiller, M. Playing atari with deep reinforcement learning. ar Xiv preprint ar Xiv:1312.5602, 2013. Norris, J. R. Markov chains. Number 2. Cambridge university press, 1998. Oppenheim, A. V., Willsky, A. S., Nawab, S. H., Hern andez, G. M., et al. Signals & systems. Pearson Educaci on, 1997. Analysis of Stochastic Processes through Replay Buffers Pan, Y., Zaheer, M., White, A., Patterson, A., and White, M. Organizing experience: a deeper look at replay mechanisms for sample-based planning in continuous state domains. ar Xiv preprint ar Xiv:1806.04624, 2018. Porat, B. Digital processing of random signals: theory and methods. Courier Dover Publications, 2008. Puterman, M. L. Markov Decision Processes. Wiley and Sons, 1994. Riedmiller, M. Neural fitted q iteration first experiences with a data efficient neural reinforcement learning method. In European conference on machine learning, pp. 317 328. Springer, 2005. Schaul, T., Quan, J., Antonoglou, I., and Silver, D. Prioritized experience replay. ar Xiv preprint ar Xiv:1511.05952, 2015. Wang, Z., Bapst, V., Heess, N., Mnih, V., Munos, R., Kavukcuoglu, K., and de Freitas, N. Sample efficient actor-critic with experience replay. ar Xiv preprint ar Xiv:1611.01224, 2016. Wu, Y., Zhang, W., Xu, P., and Gu, Q. A finite time analysis of two time-scale actor critic methods. ar Xiv preprint ar Xiv:2005.01350, 2020. Zha, D., Lai, K.-H., Zhou, K., and Hu, X. Experience replay optimization. ar Xiv preprint ar Xiv:1906.08387, 2019. Zhang, S. and Sutton, R. S. A deeper look at experience replay. ar Xiv preprint ar Xiv:1712.01275, 2017. Zou, S., Xu, T., and Liang, Y. Finite-sample analysis for sarsa with linear function approximation. ar Xiv preprint ar Xiv:1902.02234, 2019. Analysis of Stochastic Processes through Replay Buffers A. Proofs for Lemmas in Section 3 A.1. Proof of Lemma 1 Proof. Stationarity of RBt: Recall that stationarity (in the strong sense) means that for m = 1, 2, . . . , there are times (t1, t2, . . . , tm) such that for all τ Z FX(Xt1+τ, . . . , Xtm+τ) = FX(Xt1, . . . , Xtm), where FX(Xt1, . . . , Xtm) is the cumulative distribution. Then, FRB(RBt1+τ, . . . , RBtm+τ) (1) =FX(Xt1+τ N+1, . . . , Xt1+τ, Xt2+τ N+1, . . . , Xt2+τ, Xtm+τ N+1, . . . , Xtm+τ), (2) =FX(Xt1 N+1, . . . , Xt1, Xt2 N+1, . . . , Xt2, Xtm N+1, . . . , Xtm) (3) =FRB(RBt1, . . . , RBtm), where we use the RB definition in (1), stationarity of X in (2), and expressing RB based on X in (3). Stationarity of Yt: Similarly, for m = 1, 2, . . . , there are times (t1, t2, . . . , tm) such that for all τ Z FY (Yt1+τ, . . . , Ytm+τ) (1) =FX j Jt1+τ f(Xj), . . . , 1 j Jtm+τ f(Xj) Jt1+τ ,..., Jtm+τ F Jt1+τ ,..., Jtm+τ (j1, . . . , jm) j Jt1+τ f(Xj), . . . , 1 j Jtm+τ f(Xj) j1, . . . , jm Jt1,..., Jtm F Jt1,..., Jtm (j1, . . . , jm) f(Xj), . . . , 1 j Jtm f(Xj) j1, . . . , jm f(Xj), . . . , 1 j Jtm f(Xj) (5) =FY (Yt1, . . . , Ytm), where in (1) we use the process Y definition, in (2) we use iterated expectation, in (3) we use both the stationarity of X and J, in (4) we use again iterated expectation, and in (5) we use Y definition. A.2. Proof of Lemma 2 Proof. We first note that we use some abuse of notation when referring sometimes to random variables from processes X, RB, W and J the same as their realizations. However, to avoid an overhead of the proof, we keep the notations simple and short. Analysis of Stochastic Processes through Replay Buffers Proving Markovity requires that P(RBt+1|RBt, RBt 1, . . . , RB0) = P(RBt+1|RBt). (A.1) P(Wt+1|Wt, Wt 1, . . . , W0) = P(Wt+1|Wt). (A.2) We start with proving the Markovity of RBt. Let us denote Xn2 n1 (t) {Xt n2+1, . . . , Xt n1+1|RBt} as the set of realizations of the random variables from process X stored in the RB at time t in positions n1 to n2. Note that when a new transition is pushed to the RB into position n = 1, the oldest transition in position n = N is thrown away, and all the transitions in the RB move one index forward. We present here some observations regarding the RB that will help us through the proof: RBt = XN 1 (t) = {Xt N+1, . . . , Xt n+1 . . . , Xt} (RB definition). (A.3) XN 1 (t + 1) = {Xt+1} XN 1 1 (t) (A.4) XN 1 1 (t) XN 1 (t) (A.5) Xt XN 1 (t) (A.6) P(Xt+1|Xt, . . . X0) = P(Xt+1|Xt) (Since Xt is assumed to be Markovian). (A.7) P(a, b|c1, c2, . . .) = P(a|b, c1, c2, . . .) P(b|c1, c2, . . .) (Expressing joint probability (A.8) as a conditional probabilities product). P(a|b) = p(a) (If a and b are independent) . (A.9) Computing the l.h.s. of equation (A.1) yields P (RBt+1|RBt, . . . , RB0) (A.3) = P XN 1 (t + 1) XN 1 (t), . . . , XN 1 (0) (A.4) = P Xt+1, XN 1 1 (t) XN 1 (t), . . . , XN 1 (0) (A.8) = P Xt+1 XN 1 1 (t), XN 1 (t), . . . , XN 1 (0) P XN 1 1 (t) XN 1 (t), . . . , XN 1 (0) (A.5),(A.6),(A.7) = P (Xt+1|Xt) Similarly, computing the r.h.s of (A.1) yields P (RBt+1|RBt) (A.3) = P XN 1 (t + 1) XN 1 (t) (A.4) = P Xt+1, XN 1 1 (t) XN 1 (t) (A.8) = P Xt+1 XN 1 1 (t), XN 1 (t) P XN 1 1 (t) XN 1 (t) (A.5),(A.6),(A.7) = P (Xt+1|Xt) Both sides of (A.1) are equal and therefore RBt is Markovian. In addition we have that for t N: P (RBt+1|RBt) = ( P (Xt+1|Xt) if Xt X1 1(t) and RBt+1 = {Xt+1} XN 1 1 (t) 0 otherwise . Next, we prove the Markovity of Wt. Recall that Wt is defined as: Wt = [RBt, Jt] (A.10) where Jt {t N + 1, . . . , t} is a random subset of K time indices. By their definition, RBt and Jt are independent for all t. Computing the l.h.s. of equation (A.2) yields P (Wt+1|Wt, . . . , W0) (A.10) = P (RBt+1, Jt+1|RBt, Jt, . . . , RB0, J0) (A.8) = P (RBt+1|Jt+1, RBt, Jt, . . . , RB0, J0) P (Jt+1|RBt, Jt, . . . , RB0, J0) (A.1),(A.9) = P (RBt+1|RBt) P (Jt+1) (A.9) = P (RBt+1, Jt+1|RBt, Jt) (A.10) = P (Wt+1|Wt) Analysis of Stochastic Processes through Replay Buffers We have the required result in (A.2), therefore Wt is Markovian. In addition, If Jt+1 is sampled according to unordered sampling without replacement (defined in Section 2.2), then for t N: P (Wt+1|Wt) = P (RBt+1|RBt) P (Jt+1) = 1 ( N K)P (Xt+1|Xt) if Xt X1 1(t) and RBt+1 = {Xt+1} XN 1 1 (t) Jt+1 CN,K, 0 otherwise. A.3. Proof of Lemma 3 Proof. We prove by contradiction. Let us assume that the process RB is neither aperiodic nor irreducible. If it is periodic, then one of the indices in the RB is periodic. Without loss of generality, let us assume that this is the l delayed time-steps index. But since in this index we have a periodic process, i.e., it is the process X delayed in l steps, it contradicts the assumption that X is aperiodic. We prove irreducibility in a similar way. Since the process Y is a deterministic function of the process RB, it must be aperiodic and irreducible as well, otherwise it will contradict the aperiodicity and irreducibility of the process RB. Finally, since f( ) is a deterministic function, and since for each t, Yt is an image of an ergodic process Xt, i.e., each x Xt is visited infinitely often, and as a results each point y supp(Yt) of the image of f( ) is visited infinitely often, otherwise, it contradicts the deterministic nature of f( ) or the ergodicity of X. Analysis of Stochastic Processes through Replay Buffers B. Auto Correlation and Covariance proofs B.1. Proof of Theorem 1 Proof. Let Jt {1, . . . N} and Jt+τ {1, . . . N} be subsets of K indices each. We begin with calculating the autocorrelation of process Zt. RY (τ) = E[Yt Yt+τ] n Jt Zt n+1 1 m Jt+τ Zt+τ m+1 n Jt f(Xt n+1) 1 m Jt+τ f(Xt+τ m+1) (3) = E Jt, Jt+τ CN,K,{Xt n+1}n Jt RBt,{Xt+τ m+1}n Jt+τ RBt+τ n Jt f(Xt n+1) 1 m Jt+τ f(Xt+τ m+1) (4) = E Jt, Jt+τ CN,K E{Xt n+1}n Jt RBt,{Xt+τ m+1}n Jt+τ RBt+τ n Jt f(Xt n+1) 1 m Jt+τ f(Xt+τ m+1) (5) = E Jt, Jt+τ CN,K E{Xt n+1}n Jt RBt,{Xt+τ m+1}n Jt+τ RBt+τ En Jt,m Jt+τ [f(Xt n+1) f(Xt+τ m+1)] (6) = E Jt, Jt+τ CN,K h En Jt,m Jt+τ EXt n+1,Xt+τ m+1 [f(Xt n+1) f(Xt+τ m+1)] Jt, Jt+τ i (7) = E Jt, Jt+τ CN,K h En Jt,m Jt+τ E [Zt n+1Zt+τ m+1] Jt, Jt+τ i (8) = E Jt, Jt+τ CN,K h En Jt,m Jt+τ RZ(τ + n m) Jt, Jt+τ i (9) = Eτ Jτ [RZ(τ )] where in (1) we used the definition of Y using the indices subsets Jt and Jt+τ. In (2) used the definition of Z and in (3) we wrote the expectation explicitly. In (4) we used the conditional expectation and in (5) we wrote 1 K P n Jt f( ) and 1 K P m Jt+τ f( ) as an expectations since given the subsets Jt and Jt+τ, the probability of sampling index n or m from the RB is uniform and equals 1 K . In (6) we are left with the marginal expectations for every possible couple of indices. In (7) we used again the definition of Z and in (8) we used the definition of the auto-correlation function of Z. In (9) we defined τ = τ + n m to be the time difference between each couple of indices from Jt and Jt+τ. Note that τ Jτ where Jτ = {τ N + 1, . . . , τ + N 1}. The calculation for the covariance CY (τ) follows the same steps as we did for RY (τ). B.2. Proof of Lemma 4 Proof. Let Jt = { jk}K k=1 {N, . . . , 1} and Jt+τ = { lk}K k=1 {N, . . . , 1} be subsets of K indices each. Let Jt = {jk}K k=1 {t N + 1, . . . , t} and Jt+τ = {lk}K k=1 {t + τ N + 1, . . . , t + τ} be subsets of K time-steps each. The relations between these two subsets are: jk = n jk = t n + 1 and lk = m lk = t + τ m + 1. We saw in Section B.1 that we can move from these two subsets into the set of all possible differences Jτ. Recall that we defined τ = τ + n m to be the time difference between each couple of indices from Jt and Jt+τ. Note that τ Jτ where Jτ = {τ N + 1, . . . , τ + N 1}. Analysis of Stochastic Processes through Replay Buffers We now define some sets and multisets that will help us in the calculation of P(τ ). Definition 2. Let CN,K be the set of all possible permutations of choosing K samples out of N samples without replacement. Observe that: |CN,K| = N K Definition 3. Let LN,K be the set of all possible tuples of two batches of K samples chosen from a set of N samples. Observe that: |LN,K| = N K Definition 4. Let MN,K be the multiset of all possible two-sample tuple generated for all possible couple of batches in the set LN,K. Let MN,K be the unique set of MN,K: Observe that: |MN,K| = K2 N K | MN,K| = N 2 Definition 5. Let M(n, m) be the number of times the unique tuple (n, m) appears in the multiset MN,K. Observe that: M(n, m) = N 1 K 1 2 for all n {1, . . . , N}, m {τ, . . . , N + τ} Definition 6. Let DN,K,τ be the number of unique sample tuples which have the time difference such that: n m = τ τ Observe that: DN,K,τ = N |τ τ| Here we consider the unordered sampling without replacement (described in Section 2) for sampling Jt and Jt+τ and we would like to calculate the probability distribution for each time difference τ , that is P(τ ). We have total of K2 N K 2 such differences since we have N K possible permutations for each batch and in each permutation we have K time elements. We saw in the definitions above that from all K2 N K 2 possible two-sample couples we have N 2 unique sample couples, each of which has N 1 K 1 2 repetitions. From these N 2 couples, we have only N |τ τ| unique sample tuples (n, m) that holds n m = τ τ. We define d = τ τ, therefore N + 1 d N 1. We now can calculate P(τ ): N 1 K 1 2 (N |τ τ|) N 1 K 1 2 (N |d|) (2) = (N 1)! (N 1)! K! K! (N K)! (N K)! (N |d|) K2 (K 1)! (K 1)! (N K)! (N K)! N! N! (3) = N |d| where in (1) we substitute τ τ = d. In (2) we wrote explicitly the binomial terms. In (3) we canceled similar elements in the denominator and numerator. Notice that this probability formula is relevant only for τ N + 1 τ τ + N 1 and Analysis of Stochastic Processes through Replay Buffers other values of τ can not be reached form combining these two batches. Therefore, P(τ ) = 0 for τ N + 1 > τ and τ > τ + N 1. Interestingly, this proof shows how parameter K is canceled out, meaning this time difference distribution is independent on K. In addition, we can observe that the resulting distribution can be considered as a convolution of two rectangles, which represents the time limits of each batch and the uniform sampling, and the resulting convolution, a triangle which represents P(τ ). B.3. Proof of Corollary 1 Proof. Combining Theorem 1 and Lemma 4 we get: RY (τ) = Eτ [RZ(τ )] = X τ P(τ )RZ(τ ) 1= N 2 RZ(d + τ) where in (1) we used P(τ ) from Lemma 4 and changed the variables: d = τ τ for τ N + 1 τ τ + N 1. Similar development can be done to CY (τ). C. Proof of Theorem 2: Average reward and critic convergence Proof. Recall that our TD-error update Algorithm 1 is defined as δ(Oj) = r(Sj, Aj) η + ϕ(S j) w ϕ(Sj) w, where Oj = {Sj, Aj, r(Sj, Aj), S j}. In the critic update in Algorithm 1 we use an empirical mean of TD-errors of several sampled observations, denoted as {Oj}j J. Then, the critic update is defined as w = w + αw 1 j J δ(Oj)ϕ(Sj). where J is a random subset of K samples from RB with size N. Using the definition of the sampled random K indices J, instead of J, we can write the update as: w = w + αw 1 n J δ(Ot n+1)ϕ(St n+1). In this proof we follow the proof of Lemma 5 in Bhatnagar et al. (2009). Observe that the average reward and critic updates from Algorithm 1 can be written as ηt+1 = ηt + αη t F η t + M η t+1 (C.1) wt+1 = vt + αw t F w t + M w t+1 , (C.2) F η t E Jt CN,K,{Ot n+1}n Jt RBt n Jt r(St n+1, At n+1) η n Jt r(St n+1, At n+1) ηt F w t E Jt CN,K,{Ot n+1}n Jt RBt n Jt δ(Ot n+1)ϕ(St n+1) n Jt δ(Ot n+1)ϕ(St n+1) F w t Analysis of Stochastic Processes through Replay Buffers and Ft is a σ-algebra defined as Ft {ητ, wτ, M η τ , M w τ : τ t}. We use Theorem 2.2 of Borkar & Meyn (2000) to prove convergence of these iterates. Briefly, this theorem states that given an iteration as in (C.1) and (C.2), these iterations are bounded w.p.1 if Assumption 6. 1. F η t and F w t are Lipschitz, the functions F (η) = limσ F η(ση)/σ and F (w) = limσ F w(σw)/σ are Lipschitz, and F (η) and F (w) are asymptotically stable in the origin. 2. The sequences M η t+1 and M w t+1 are martingale difference noises and for some Cη 0 , Cw 0 E (M η t+1)2 Ft Cη 0 (1 + ηt 2) E (M w t+1)2 Ft Cw 0 (1 + wt 2). We begin with the average reward update in (C.1). The ODE describing its asymptotic behavior corresponds to η = E J CN,K,{Ot n+1}n J RB n Jt r(St n+1, At n+1) η F η is Lipschitz continuous in η. The function F (η) exists and satisfies F (η) = η. The origin is an asymptotically stable equilibrium for the ODE η = F (η) and the related Lyapunov function is given by η2/2. For the critic update, consider the ODE w = E J CN,K,{Ot n+1}n J RB n J δ(Ot n+1)ϕ(St n+1) In Lemma 5 we show that this ODE can be written as w = Φ CθΦw + Φ bθ, (C.4) where Cθ and bθ are defined in (7). F w is Lipschitz continuous in w and F (w) exists and satisfies F (w) = Φ CθΦw. Consider the system w = F (w) (C.5) In assumption 4 we assume that Φw = e for every w Rd. Therefore, the only asymptotically stable equilibrium for (C.5) is the origin (see the explanation in the proof of Lemma 5 in Bhatnagar et al. (2009)). Therefore, for all t 0 E (M η t+1)2 Ft Cη 0 (1 + ηt 2 + wt 2) E (M w t+1)2 Ft Cw 0 (1 + ηt 2 + wt 2) for some Cη 0 , Cw 0 < . M η t can be directly seen to be uniformly bounded almost surely. Thus, Assumptions (A1) and (A2) of Borkar & Meyn (2000) are satisfied for the average reward, TD-error, and critic updates. From Theorem 2.1 of Borkar & Meyn (2000), the average reward, TD-error, and critic iterates are uniformly bounded with probability one. Note that when t , (C.3) has ηθ defined as in (2) as its unique globally asymptotically stable equilibrium with V2(η) = (η ηθ)2 serving as the associated Lyapunov function. Next, suppose that w = wπ is a solution to the system Φ CθΦw = 0. Under Assumption 4, using the same arguments as in the proof of Lemma 5 in Bhatnagar et al. (2009), wπ is the unique globally asymptotically stable equilibrium of the ODE (C.4). Assumption 6 is now verified and under Assumption 5, the claim follows from Theorem 2.2, pp. 450 of (Borkar & Meyn, 2000). Analysis of Stochastic Processes through Replay Buffers C.1. Proof of Lemma 5 Proof. We compute the expectation of the critic update with linear function approximation according to Algorithm 1. In this proof, we focus on the Unordered sampling without replacement strategy for sampling batch of K transitions from the replay buffer (see Section 2.2 for this strategy probability distribution). Recall that n is a position in the RB and it corresponds to transition Ot n+1 = (St n+1, At n+1, S t n+1). We will use the notation of J {1, . . . , n, . . . , N} to refer the K indices sampled batches. In addition we will use the following observations: P(n| J, n J) = 1 K , P(n| J, n / J) = 0 (C.6) N , P(n / J) = 1 K P(n| J) = P(n J) P(n| J, n J) + P(n / J) P(n| J, n / J) = K N 1 K + 0 = 1 P( J) = 1 N K (C.8) |CN,K| = N K Now we can compute the desired expectation: E Jt CN,K,{Ot n+1}n Jt RBt n Jt δ(Ot n+1)ϕ(St n+1) = E Jt CN,K E{Ot n+1}n Jt RBt n Jt δ(Ot n+1)ϕ(St n+1) (C.6) = E Jt CN,K h E{Ot n+1}n Jt RBt En Jt [δ(Ot n+1)ϕ(St n+1)] Jt i 1= E Jt CN,K En Jt EOt n+1 [δ(Ot n+1)ϕ(St n+1)] Jt Jt CN,K P( Jt) n=1 P(n| Jt)EOt n+1 [δ(Ot n+1)ϕ(St n+1)] (C.7),(C.8) = X 1 N EOt n+1 [δ(Ot n+1)ϕ(St n+1)] (C.9) = 1 N n=1 EOt n+1 [δ(Ot n+1)ϕ(St n+1)] n=1 ESt n+1,At n+1,S t n+1 r(St n+1, At n+1) η + ϕ(S t n+1) w ϕ(St n+1) w ϕ(St n+1) where in (1) we are left with the marginal expectations for each observation, in (2) we wrote expectations explicitly and in (3) we used the definition of the TD-error in (4). Next, for time t n + 1 where 1 n N, we define the induced MC with a corresponding policy parameter θt n+1. For this parameter, we denote the corresponding state distribution vector ρt n+1 and a transition matrix Pt n+1 (both induced by the policy πθt n+1. In addition, we define the following diagonal matrix Dt n+1 diag(ρt n+1). Similarly to (Bertsekas & Tsitsiklis, 1996) Lemma 6.5, pp.298, we can substitute the inner expectation ESt n+1,At n+1,S t n+1 r(St n+1, At n+1) η + ϕ(S t n+1) w ϕ(St n+1) w ϕ(St n+1) = Φ Dt n+1 (Pt n+1 I) Φw + Φ Dt n+1(rt n+1 ηθe), (C.11) Analysis of Stochastic Processes through Replay Buffers where I is the |S| |S| identity matrix, e in |S| 1 vector of ones and rt n+1 is a |S| 1 vector defined as rt n+1(s) = P a πθt n+1(A|S)r(S, A). Combining equations (6), (C.10) and (C.11) yields Φ Dt n+1 (Pt n+1 I) Φw + Φ Dt n+1(rt n+1 ηθe) = Φ CtΦw + Φ bt, (C.12) In the limit, t and ρt n+1 µθ for all index n. Using Cθ and bθ defined in (7), (C.10) can be expressed as E Jt CN,K,{Ot n+1}n Jt RBt n Jt δ(Ot n+1)ϕ(St n+1) = Φ CθΦw + Φ bθ. (C.13) Analysis of Stochastic Processes through Replay Buffers D. Proof of Theorem 3: Actor convergence Proof. Recall that our TD-error update in Algorithm 1 is defined as δ(Oj) = r(Sj, Aj) η + ϕ(S j) w ϕ(Sj) w, where Oj = {Sj, Aj, r(Sj, Aj), S j}. In the actor update in Algorithm 1 we use an empirical mean of TD-errors of several sampled observations, denoted as {Oj}j J. Then, the actor update is defined as j J δ(Oj) log πθ(Aj|Sj) where J is a random subset of K samples from RB with size N. Using the definition of the sampled random K indices J, instead of J, we can write the update as: n J δ(Ot n+1) log πθ(At n+1|St n+1) In this proof we follow the proof of Theorem 2 in Bhatnagar et al. (2009). Let O = {S, A, S } and let δπ(O) = r(S, A) η + ϕ(S ) wπ ϕ(S) wπ, where wπ is the convergent parameter of the critic recursion with probability one (see its definition in the proof for Theorem 2). Observe that the actor parameter update from Algorithm 1 can be written as θt+1 = Γ θt αθ t δ(O) θ log πθ(A|S) + F θ t F θ t + N θt t N θt t = Γ θt αθ t M θ t+1 + (F θ t N θt t ) + N θt t F θ t E Jt CN,K,{Ot n+1}n Jt RBt n Jt δ(Ot n+1) θ log πθ(At n+1|St n+1) n Jt δ(Ot n+1) θ log πθ(At n+1|St n+1) F θ t N θ t E Jt CN,K,{Ot n+1}n Jt RBt n Jt δπθ(Ot n+1) θ log πθ(At n+1|St n+1) and Ft is a σ-algebra defined as Ft {ητ, wτ, θτ, M η τ , M w τ , M θ τ : τ t}. Since the critic converges along the faster timescale, from Theorem 2 it follows that F θ t N θt t = o(1). Now, let r=0 αθ r M θ r+1, t 1. The quantities δ(O) can be seen to be uniformly bounded since from the proof in Theorem 2, {ηt} and {wt} are bounded sequences. Therefore, using Assumption 5, {M2(t)} is a convergent martingale sequence (Bhatnagar & Kumar, 2004). Consider the actor update along the slower timescale corresponding to αθ t in Algorithm 1. Let w( ) be a vector field on a set Θ. Define another vector field: ˆΓ w(y) = lim0<η 0 Γ y+ηw(y) y η . In case this limit is not unique, we let ˆΓ w(y) be the set of all possible limit points (see pp. 191 of (Kushner & Clark, 2012)). Consider now the ODE E Jt CN,K,{Ot n+1}n Jt RBt n Jt δπθ(Ot n+1) θ log πθ(At n+1|St n+1) Analysis of Stochastic Processes through Replay Buffers Substituting the result from Lemma 6, the above ODE is analogous to θ = ˆΓ( θηθ + ξπθ) = ˆΓ N θ t (D.2) where ξπθ = P S µθ(S) ϕ(S) θwπθ θ V πθ(S) . Consider also an associated ODE: θ = ˆΓ θηθ (D.3) We now show that h1(θt) N θt t is Lipschitz continuous. Here wπθt corresponds to the weight vector to which the critic update converges along the faster timescale when the corresponding policy is πθt (see Theorem 2). Note that µθ(S), S S is continuously differentiable in θ and have bounded derivatives. Also, ηθt is continuously differentiable as well and has bounded derivative as can also be seen from (2). Further, wπθt can be seen to be continuously differentiable with bounded derivatives. Finally, 2πθt(A|S) exists and is bounded. Thus h1(θt) is a Lipschitz continuous function and the ODE (D.1) is well posed. Let Z denote the set of asymptotically stable equilibria of (D.3) i.e., the local minima of ηθ, and let Zϵ be the ϵ-neighborhood of Z. To complete the proof, we are left to show that as supθ ξπθ 0 (viz. δ 0), the trajectories of (D.2) converge to those of (D.3) uniformly on compacts for the same initial condition in both. This claim follows the same arguments as in the proof of Theorem 2 in Bhatnagar et al. (2009). D.1. Proof of Lemma 6 Proof. We compute the required expectation with linear function approximation according to Algorithm 1. Following the same steps when proving the expectation for the critic in Section C.1, we have: E Jt CN,K,{Ot n+1}n Jt RBt n Jt δπθ(Ot n+1) θ log πθ(At n+1|St n+1) n=1 ESt n+1,At n+1,S t n+1 h r(St n+1, At n+1) η + ϕ(S t n+1) w ϕ(St n+1) w θ log πθ(At n+1|St n+1) i Recall the definition of the state distribution vector ρt n+1 in Section 4.4. In the limit, t and ρt n+1 µθ for all index n, then: E Jt CN,K,{Ot n+1}n Jt RBt n Jt δπθ(Ot n+1) θ log πθ(At n+1|St n+1) S S µθ(S) X A A πθ(A|S) r(S, A) ηθ + X S S P(S |S, A)ϕ(S ) wπθ ϕ(S) wπθ ! θ log πθ(A|S) We define now the following term: V πθ(S) = X A A πθ(A|S) Qπθ(S, A) = X A A πθ(A|S) r(S, A) ηθ + X S S P(S |S, A)ϕ(S ) wπθ ! where V πθ(S) and Qπθ(S, A) correspond to policy πθ. Note that here, the convergent critic parameter wπθ is used. Let s look at the gradient of (D.4): Analysis of Stochastic Processes through Replay Buffers θ V πθ(S) = θ A A πθ(A|S) Qπθ(S, A) A A θπθ(A|S) r(S, A) ηθ + X S S P(S |S, A)ϕ(S ) wπθ ! A A πθ(A|S) S S P(S |S, A)ϕ(S ) θwπθ ! A A θπθ(A|S) r(S, A) ηθ + X S S P(S |S, A)ϕ(S ) wπθ ! A A πθ(A|S) X S S P(S |S, A)ϕ(S ) θwπθ Summing both sides over the stationary distribution µθ S µθ(S) θ V πθ(S) = X A A θπθ(A|S) r(S, A) ηθ + X S S P(S |S, A)ϕ(S ) wπθ ! A A πθ(A|S) X S S P(S |S, A)ϕ(S ) θwπθ ! = E Jt CN,K,{Ot n+1}n Jt RBt n Jt δπθ(Ot n+1) θ log πθ(At n+1|St n+1) A A πθ(A|S) X S S P(S |S, A)ϕ(S ) θwπθ θηθ = E Jt CN,K,{Ot n+1}n Jt RBt n Jt δπθ(Ot n+1) θ log πθ(At n+1|St n+1) A A πθ(A|S) X S S P(S |S, A)ϕ(S ) θvπθ θ V πθ(S) Since µθ is a stationary distribution, X A A πθ(A|S) X S S P(S |S, A)ϕ(S ) θwπθ = X S S Pθ(S |S)ϕ(S ) θwπθ S µθ(S)Pθ(S |S)ϕ(S ) θwπθ S µθ(S )ϕ(S ) θwπθ, θηθ = E Jt CN,K,{Ot n+1}n Jt RBt n Jt δπθ(Ot n+1) θ log πθ(At n+1|St n+1) S µθ(S) ϕ(S) θwπθ θ V πθ(S) The result follows immediately.