# replicable_reinforcement_learning__69dfadf1.pdf

Replicable Reinforcement Learning

Eric Eaton University of Pennsylvania Philadelphia, PA 19104 eeaton@seas.upenn.edu

Marcel Hussing University of Pennsylvania Philadelphia, PA 19104 mhussing@seas.upenn.edu

Michael Kearns University of Pennsylvania Philadelphia, PA 19104 mkearns@cis.upenn.edu

Jessica Sorrell University of Pennsylvania Philadelphia, PA 19104 jsorrell@seas.upenn.edu

The replicability crisis in the social, behavioral, and data sciences has led to the formulation of algorithm frameworks for replicability i.e., a requirement that an algorithm produce identical outputs (with high probability) when run on two different samples from the same underlying distribution. While still in its infancy, provably replicable algorithms have been developed for many fundamental tasks in machine learning and statistics, including statistical query learning, the heavy hitters problem, and distribution testing. In this work we initiate the study of replicable reinforcement learning, providing a provably replicable algorithm for parallel value iteration, and a provably replicable version of R-max in the episodic setting. These are the first formal replicability results for control problems, which present different challenges for replication than batch learning settings.

1 Introduction

The growing prominence of machine learning (ML) and its widespread adoption across industries underscore the need for replicable research [Wagstaff, 2012, Pineau et al., 2021]. Many scientific fields have suffered from this same inability to reproduce the results of published studies [Begley and Ellis, 2012]. Replicability in ML requires not only the ability to reproduce published results [Wagstaff, 2012], as may be partially addressed by sharing code and data [Stodden et al., 2014], but also consistency in the results obtained from successive deployments of an ML algorithm in the same environment. However, the inherent variability and randomness present in ML pose challenges to achieving replicability, as these factors may cause significant variations in results.

Building upon foundations of algorithmic stability [Bousquet and Elisseeff, 2002], recent work in learning theory has established rigorous definitions for the study of supervised learning [Impagliazzo et al., 2022] and bandit algorithms [Esfandiari et al., 2023a] that are provably replicable, meaning that algorithms produce identical outputs (with high probability) when executed on distinct data samples from the same underlying distribution. However, these results have not been extended to the study of control problems such as reinforcement learning (RL), that have long been known to suffer from stability issues [White and Eldeib, 1994, Mannor et al., 2004, Islam et al., 2017, Henderson et al., 2018]. These stability issues have already sparked research into robustness for control problems including RL [Khalil et al., 1996, Nilim and Ghaoui, 2005, Iyengar, 2005]. Nondeterministic environments and evaluation benchmarks, the randomness of the exploration process, and the sequential interaction of an RL agent with the environment all complicate the ability to make RL replicable. Our work is orthogonal to that of the robustness literature and our goal is not to reduce

37th Conference on Neural Information Processing Systems (Neur IPS 2023).

the effect of these inherent characteristics, such as by decreasing the amount of exploration that an agent performs, but to develop replicable RL algorithms that support these characteristics.

Toward this goal, we initiate the study of replicable RL and develop the first set of RL algorithms that are provably replicable. We contend that the fundamental theoretical study of replicability in RL might advance our understanding of the aspects of RL algorithms that make replicability hard. In this work, we put on a similar lens as Impagliazzo et al. [2022] and consider replicability as an algorithmic property that can be achieved simultaneously with exploration and exploitation. First, we show that it is possible to obtain a near-optimal, replicable policy given sufficiently many samples from every state in the environment. This notion is then naturally extended to replicable exploration.

Our contributions can be summarized as follows. We provide two novel and efficient algorithms to

show that stochastic, sample-based value iteration can be done replicably and replicably explore the space of an MDP while also finding an optimal policy. We experimentally validate that our algorithms require much fewer samples than theory suggests.

2 Preliminaries

2.1 Reinforcement learning

We consider the problem of solving a discounted Markov decision process (MDP) M = {S, A, R, P, γ, µ} with state space S, action space A, reward function R, transition kernel P, discount factor γ, and initial state distribution µ. We assume that the size of the state space |S| and number of possible actions |A| are finite and not too large. Further, we assume that the rewards for every state-action pair are deterministic, bounded, and known. Relaxing assumptions on the reward function might not necessarily seem straightforward in our goal of replicable RL, as the stochastic reward would need to be made replicable. However, the case can be handled by our algorithms with minor modifications and only constant factor overhead. The goal is to find a policy π : S 7 A that maximizes the cumulative discounted reward Jh = P k=h γk h Rk(s, a). We use the typical definitions of the value and Q-value functions for the expected cumulative discounted return from a state or state-action pair, respectively:

Vπ(s) = E π,P[Jh|sh = s] Qπ(s, a) = E π,P[Jh|sh = s, ah = a] .

To show the various difficulties that come from trying to achieve replicability in RL, we consider two different settings to examine various components of the problem.

Parallel sampling setting First, we ask whether it is even possible to obtain a replicable policy from empirical samples without considering the challenges of exploration. For this, we can adopt the setting of generative models GM, or more precisely, the parallel sampling setting. In the parallel sampling model, first introduced by Kearns and Singh [1998a], one has access to a parallel generative sampling subroutine PS(GM). A single call to PS(GM) will return, for every state-action pair (s, a) S A, a randomly sampled next state s S drawn from P(s |s, a). The key advantage is that this model separates learning from the quality of the exploration procedure.

Definition 2.1 (Generative model). Let M denote an arbitrary MDP. Then a generative model GM((s, a)) is a randomized algorithm that, given a state-action pair (s, a) S A, outputs a deterministic reward R(s, a) and a next state s sampled from P( |s, a).

Definition 2.2 (Parallel sampling). Let M denote an arbitrary MDP. Then a call to the parallel sampling subroutine PS(GM) returns exactly one sample s i GM((si, ai)) for every state-action pair (si, ai) in S A of M using a generative model.

Episodic setting The second setting we consider is one in which an algorithm does have to explore the MDP before it can obtain an optimal policy. More precisely, we consider an episodic setting where, in every episode e {1, 2, ..., E}, the agent starts in a position s0 µ and interacts with the environment for a fixed amount of time H. At any step h [1, H], the agent is in some state sh, selects an action ah, receives a reward rh and transitions to a new state sh+1. Gathering a trajectory τ = (s0, a0, r0, .., s H, a H, r H) of states, actions and rewards under policy π can be thought of as a draw from a distribution τ P π M(τ). We will omit the sub-and superscripts when clear from context. For consistency with the remaining analysis, we work with a γ-discounted version of the problem.

2.2 Replicability

We build on the recent framework by Impagliazzo et al. [2022], which considers replicability as a property of randomized algorithms that take as input a dataset sampled i.i.d. from an arbitrary distribution. They consider an algorithm to be replicable if, on two runs in which its internal randomness is fixed and its input data is resampled, it outputs the same result with high probability:

Definition 2.3 (Replicability). Fix a domain X and target replicability parameter ρ (0, 1). A randomized algorithm A : X n Y is ρ-replicable if for all distributions D over X, randomizing over the internal randomness r of A and choice of samples S1, S2, each of size n drawn i.i.d. from D, we have: Pr S1,S2,r[A(S1; r) = A(S2; r)] ρ .

Several key tools that were introduced by Impagliazzo et al. [2022] will prove useful or yield inspiration for the algorithms developed in this work. One of the key observations is that many of the computations in RL can be phrased as statistical queries, defined as follows:

Definition 2.4 (Statistical query, [Kearns, 1998]). Fix a distribution D over X and an accuracy parameter α (0, 1). A statistical query is a function ϕ : X [0, 1], and a mechanism M answers ϕ with tolerance α on distribution D if a M satisfies a [Ex D[ϕ(x)] α].

We will make direct use of the replicable algorithm for answering statistical queries by Impagliazzo et al. [2022] which will be useful to obtain replicable estimates of various measurements such as transition probabilities. We will refer to the replicable statistical query procedure as r STAT. We note that Impagliazzo et al. [2022] also proves a lower-bound on the sample complexity required for replicable statistical queries, showing that the results below are essentially tight.

Theorem 2.1 (Replicable statistical queries, Impagliazzo et al. [2022]). There is a ρ-replicable algorithm r STAT such that for any distribution D over X, replicability parameter ρ (0, 1), accuracy parameter α (0, 1), failure parameter δ O(ρ), and query ϕ : X [0, 1], letting S be a sample of n O log(1/δ) (ρ 2δ)2α2 elements drawn i.i.d. from D, we have that a r STATα,ρ(S, ϕ)

satisfies a [Ex D[ϕ(x)] α] except with probability at most δ over the samples S.

At a very high level, r STAT uses its sample to empirically estimate the expected value of the statistical query on the target distribution. It then uses its internal randomness to pick an evenly-spaced set of canonical representatives from the [0, 1] interval, and returns whichever canonical representative is closest to the empirical estimate. We note that the algorithm of Impagliazzo et al. [2022] for replicably answering statistical queries is not only sample efficient, but also computationally efficient, as it has runtime polynomial in 1/α, 1/ρ, and log(1/δ).

3 Replicable reinforcement learning

To define replicability for the RL setting, we can adapt Definition 2.3 more or less exactly. The question that arises is which of the many RL objects should be made replicable? We separate the difficulty of replicability into three levels: replicability of the MDP, the value function, and the policy. Since these objects carry different amounts of information [Farahmand, 2011], the following relationships can be established.

If we are able to replicably (and accurately) estimate an MDP, we can always replicably compute an (optimal) value function using standard techniques on our estimates, and from replicable value functions we can obtain the corresponding policies. Note that the inverse is not true as we lose information when going from MDP to value function and then policy. As a result, we expect that replicable estimation of MDPs is the hardest setting in stochastic RL, followed by replicable value function, and then policy estimation.

For replicability of control problems, a sensible measure to ask for is the production of identical policies, which are the ultimate object of primary interest. We would at least like to ensure that with high probability, we can obtain identical optimal policies across two runs of our RL procedures:

Definition 3.1 (Replicable policy estimation). Let A be a policy estimation algorithm that outputs a policy bπ : S 7 A given a set of trajectories S sampled from an MDP. Algorithm A is ρ-replicable if, given independently sampled trajectory sets S1 and S2, and yielding policies bπ 1 and bπ 2, it holds

Start Goal Obstacle Policy

Value function

Figure 1: The Grid World for our experiments (left) and two different policies that were generated by the Phased Q-learning Algorithm on this gridworld (center and right). Following the first policy (center) more likely reaches the left goal while following the right policy more likely reaches the right goal. All states except the goals have 0 reward. The actions are up, down, left and right; there is a 30% chance that after choosing an action the agent moves left or right of the target direction.

for all states s S and actions a A that

Pr S1,S2,r[bπ (1)(a|s) = bπ (2)(a|s)] ρ

s.t. bπ (1)(a|s) A(S1; r) bπ (2)(a|s) A(S2; r) ,

where r represents the internal randomness of A. Trajectory sets S1 and S2 may potentially be gathered from the environment during the execution of an RL algorithm.

While this definition is the weakest we would like to achieve, the results we present in this paper provide stronger guarantees. Our Replicable Phased Value Iteration builds on [Kearns and Singh, 1998a] and ensures replicability of value functions, while our Replicable Episodic R-max follows [Kearns and Singh, 1998b, Brafman and Tennenholtz, 2003] and provides replicability of full MDPs. Equivalent formal definitions for replicable value and MDP estimation are given in Appendix A.

Current algorithms for sample-based RL problems will struggle to satisfy Definition 3.1 of replicability and output different policies even in simple environments (see Figure 1). In some cases, this may not be problematic since the resulting policies will still be ε-optimal, but in practice it is often hard to tell when that is the case. Fixing replicability will support the identification of problematic solutions and encourage procedures that yield more stable solutions in the long run. Varying policies can, for example, arise from sample uncertainty, insufficient state-space coverage, or differing exploration. In order to achieve replicability, all of the aforementioned challenges need to be addressed, which makes for an intricate but interesting problem. With this in mind, the next section will introduce a first set of formally replicable algorithms that separate out some of these challenges.

4 Algorithms

4.1 Replicable phased value iteration

The first question we answer positively is whether it is even possible to achieve replicability when the samples are drawn i.i.d. from the same distribution. For this, we use the parallel sampling model described in section 2. This model is well-suited for the task as it allows us to analyze sample-based value iteration independent of the exploration policy that collects the samples.

We provide a replicable version of indirect Phased Q-learning [Kearns and Singh, 1998a], which was later also referred to as Phased Value Iteration [Kakade, 2003]. In brief, the algorithm iterates T times and at every iteration makes m calls to PS(GM), computes an approximate value estimate for every state and does one round of value updates. Kearns and Singh [1998a] provide the following Lemma 4.1 to show the optimality of the original procedure. Lemma 4.1 (Phased Q-learning convergence, [Kearns and Singh, 1998a]). Suppose the number of calls to PS(GM) is chosen such that the value function estimates produced in every round by Phased Q-learning are sufficiently accurate. For any MDP M, Phased Q-learning converges to a policy bπ whose return is within ε of the optimal policy π .

Our algorithm operates similarly, but we would like to achieve replicability on top of optimality. We use a randomized rounding procedure for statistical query estimation (r STAT) provided by Impagliazzo et al. [2022] to compute the value estimates at every iteration. For this, we assume that the value function is normalized to the interval [0, 1]. A detailed description of our algorithm is

provided in Algorithm 1. The Replicable Phased Value Iteration (r PVI) algorithm we provide satisfies Definition 3.1 and produces ε-optimal policies. It goes even one step further and produces not only replicable policies but replicable value functions. This is formalized in the following Theorem 4.1. Theorem 4.1. Let ε (0, 1) be the accuracy and ρ (0, 1) be the replicability parameter. Let δ (0, 1) be the sample failure probability. Set the number of calls to PS(GM) at every iteration to

log2(1/ε)|S|2|A|2

ε2(ρ 2δ)2 log |S||A|

δ + log log(1/ε) !

where O supresses the dependence on γ. In two runs (1) and (2) with shared internal randomness, Algorithm 1 produces identical policies, s.t. Pr[bπ (1) = bπ (2)] O(ρ). In every run, the produced policies bπ achieve return at most ε less than the optimal policy π with all but probability O(δ).

Proof Sketch. We give a sketch for the proof of the theorem here and refer the reader to a full proof in Appendix B.2. Assume that we can get replicable and accurate estimates of the value function expectations from our r STAT procedure. One can show by induction that the algorithm consistently produces the same value functions in every iteration. Lemma 4.2 guarantees the convergence to an optimal policy. Finally, we can use union and Chernoff bounds to pick a sufficiently large sample for our r STAT queries to be replicable and accurate and satisfy our assumption.

An interesting observation is that r PVI discretizes the space of values as a function of the ε-parameter and γ (see Appendix B.2). As a result, replicability becomes harder for larger values of γ as discretization intervals become smaller and we require more samples to obtain an equally sized ρ. This is intuitive as we need to account for more potential future states that might impact our estimates.

The number of samples to compute a replicable value function is at most O(log2(1/ϵ)|S|2|A|2/ρ2) times larger than computing a non-replicable one [Kearns and Singh, 1998a]. Still, a key observation of the original Phased Q-learning result was that it is sufficient for every state-action pair to have a sample size logarithmic in |S||A|, making the procedure cheaper than estimating the full transition dynamics of an MDP. The cost of replicability is the loss of this property. However, we note that r PVI does not yield replicable transition probability estimation. Using the idea of r STAT queries to obtain transition estimates turns out to be significantly more expensive than the replicable value estimation done by Algorithm 1 (see Appendix B.2.1). Our results retain the notion that direct value estimation is much cheaper than estimating the full transition kernel even in the presence of replicability.

4.2 Replicable RL with exploration

Next, we consider the setting of episodic exploration. We show that, despite the stochastic nature of exploration, it is possible to guarantee replicability while still outputting an ε-optimal policy.

We take the R-max algorithm of Brafman and Tennenholtz [2003] as the starting point for our replicable algorithm Rep RMAX (Algorithm 2). It proceeds in rounds where the agent interacts with

Algorithm 1 Replicable Phased Value Iteration (r PVI) Parameters: accuracy ε, failure probability δ, replicability failure probability ρ Input: Generative Model GM Output: ε-optimal policy bπ

Initialize b Q0(s, a) to 0 for all (s, a) S A For all s S, let ϕQ(s) := maxa Q(s, a) for t = 0, , T 1 do

S (PS(GM))m do m calls to PS(GM) and store next-states in a map from state-action pairs (s, a) to next states S[(s, a)]. for (s, a) S A do

b V (s ) r STAT(S[(s, a)], ϕ b Qt(s )) b Qt+1(s, a) R(s, a) + γ b V (s ) end for end for return ˆπ = arg maxa b QT (s, a)

Algorithm 2 Replicable Episodic R-max (Rep RMAX) Parameters: Accuracy ε, accuracy failure probability δ, replicability failure probability ρ, horizon H Input: MDP M, maximum reward Rmax Output: ε-optimal policy π ˆ MK Initialize π ˆ MK to a random policy, counters for state-visitation n(s, a) to 0 Initialize K, the set collecting known state-action pairs, to the empty set Initialize S, the set collecting trajectories to be used for estimating transition probabilities, to Initialize c MK as b PK(s |s, a) := 1[s = s] for all (s, a, s ) and b RK(s, a) := Rmax for all (s, a) i = 1 while π ˆ MK is not ε-optimal do Collect a sample of trajectories Si P(τ)m and add Si to S Ki Rep Update K(Si, K, {n(s, a)}(s,a) S A), identify new known states For all (s, a) Ki, let S[(s, a)] be the multiset of s visited from (s, a) for all τ S For all s S, let ϕs (s) := 1[s = s ] Update c MK for all (s, a) Ki: b PK(s |s, a) := r STAT(S[(s, a)], ϕs ), b RK(s, a) := R(s, a) K = K Ki Compute π ˆ MK from c MK end while return π ˆ MK

the environment for multiple episodes. The collection of trajectories encountered during exploration is used to incrementally build a model c M of the underlying MDP M. The algorithm implicitly partitions the set of state-action pairs S A into two groups: known and unknown. All (s, a) S A are initialized to be unknown. While a state is unknown, the model c M maintains that (s, a) is a self-loop with probability 1, and that (s, a) has maximum reward, thereby promoting exploration of unknown states. After a state-action pair (s, a) has been visited sufficiently many times, it is added to the collection of known states K and its transition probabilities b P and reward b R are updated with an empirical approximation of b PK(s | s, a) for all s S and the observed reward R, respectively. After every update, the policy π c MK is computed as the optimal policy of the current model estimate.

While convergence of Algorithm 2 to an ε-optimal policy follows from familiar arguments [Brafman and Tennenholtz, 2003], proving replicability will require a great deal of additional care. To ensure that two runs of Rep RMAX (with shared internal randomness) converge to the same policy with high probability, we will show something even stronger: we prove that two such runs will with high probability perform the same sequence of updates to their respective models c MK and policies π ˆ MK.

To enforce this property, we introduce a sub-routine in Algorithm 3 which replicably identifies state-action pairs that should be added to the collection of known states. Guaranteeing that at each iteration the set of known states K will be the same for two independent runs of the algorithm helps ensure that the models of the MDP c MK, and consequently the policies π ˆ MK learned at each iteration, will also be identical. To provide replicability, we will want to avoid using a fixed threshold for the number of times a state-action pair (s, a) must be visited before it is considered known . Under small deviations in realized transitions, a fixed threshold might lead to some (s, a) becoming known in one run of the algorithm and not another. Instead, we use a randomized threshold.

In a call to Algorithm 3, the sample drawn at that round is used to estimate the expected number of visits to (s, a) in a single trajectory, for every (s, a). This estimate is added to the count n(s, a), which maintains the sum, over all iterations thus far, of the estimated expected visits to (s, a) from a single trajectory of the policy π ˆ MK at that iteration. A new threshold k is then sampled uniformly from [k, k+w]. If n(s, a) k , it is added to the set of known states K. From standard concentration arguments, we know that for two runs of Algorithm 2 with independent samples, the estimates of the total number of expected visits n(s, a) will both be close to the true total number of expected visits, and therefore close to each other. Algorithm 3 will only make different decisions about adding an (s, a) pair to the set of known states if the threshold k is chosen to fall between the two estimated values n(s, a) from the two runs. Here, the concentration of n(s, a) and the fact that k is randomized allows us to bound the probability that the threshold k is chosen to fall between the different n(s, a)

values. We show in Theorem 4.2 that so long as the sample size m that is used to estimate expected visits, and the window w from which the randomized threshold in sampled, are taken to be large enough, the update to the set of known states at each round will be replicable.

Algorithm 3 Rep Update K Parameters: Accuracy failure probability δ, replicability failure probability ρ Input: Sample of trajectories Si, set of known states K, set of state-visit counts {n(s, a)}(s,a) S A Output: List of new known state-action pairs Ki

Ki = {(s, a) : (s, a) S A and (s, a) K} k U[k, k + w] for (s, a) Ki do

bcs,a = 1 |Si| P

τ Si PH h=1 1[(sh, ah) = (s, a)] n(s, a) = n(s, a) + bcs,a if n(s, a) < k then

Remove (s, a) from Ki end if end for return Ki

Now that we have understood the intricacies on an intuitive level, we will prove convergence (Lemma 4.2) and replicability (Theorem 4.2) of Algorithm 2. Lemma 4.2 (Convergence). Consider A to be Algorithm 4. Let ε (0, 1) be the accuracy parameter, ρ (0, 1) the replicability parameter, and δ (0, 1), be the sample failure probability, with δ < ρ/4. Let T Θ( H|S||A|

ε + H2 log(1/δ)

ε2 ) be a bound on the number of iterations of Algorithm 2. Suppose

1 γ > ε H|A| and let m O |S|2|A|2T 4 log(1/ρ)

ρ2 be the number of trajectories per iteration. Let

k = H be the lowest expected visit count of a state-action pair before it is known. Let w O(k) define the window [k, k + w] for sampling the randomized threshold k . Then with all but probability δ, after T iterations, A yields an ε-optimal policy.

The proof of convergence is similar to those found in [Kearns and Singh, 1998b] and [Brafman and Tennenholtz, 2003], so we defer the proof to Appendix B.3. We continue here with the final theorem statement that summarizes the properties of our Rep RMAX algorithm. Theorem 4.2. Let parameters be set as in Lemma 4.2. Then with all but probability δ, A converges to an ε-optimal policy in T iterations and samples m T trajectories, each of length H, for a total sample complexity of O |S|7|A|7H6 log(1/ρ)

ρ2ε5 + |S|2|A|2H10 log5(1/δ) log(1/ρ)

ε10 . Further, let S1 and S2 be two trajectory sets, independently sampled over two runs of A with shared internal randomness, and let π(1) ˆ MK(a|s) A(S1; r) and π(2) ˆ MK(a|s) A(S2; r). Then

Pr S1,S2,r h π ˆ MK (1)(a|s) = π ˆ MK (2)(a|s) i O(ρ).

Proof. Lemma 4.2 gives us that, for our settings of k and T, Algorithm 2 converges to an ε-optimal policy in T iterations, except with probability δ. The sample complexity follows immediately from the bound on T and the setting of m, so it remains to analyze replicability. Our analysis will make use of some additional shorthand. We use ρK O(ρ/(T|S||A|)) to denote the replicability parameter for the decision to add a single (s, a) to K, in a single call to Algorithm 3. We similarly use ρSQ O(ρ/(|S|2|A|)), αSQ O(ε(1 γ)2/|S|), and δSQ O(δ/(|S|2|A|)) to denote the replicability, accuracy, and failure parameters for the r STAT queries made during the updates to P(s |s, a). We use t O( wρK

T ) O( Hρ |S||A|T 2 ) to denote a high probability bound on the difference between the empirical estimates for the expected visits to a given (s, a) in a trajectory across two runs of Algorithm 3, i.e. |bc(1) s,a bc(2) s,a| O(t). We are now ready to prove the following stronger claim:

Claim 4.1. If two runs of Algorithm 2 begin iteration i with c M(1) K = c M(2) K , π(1) ˆ MK = π(2) ˆ MK, and

|n(s, a)(1) - n(s, a)(2)| O(it) (s, a), then at the end of i, c M(1) K = c M(2) K , π(1) ˆ MK = π(2) ˆ MK, and

|n(s, a)(1) - n(s, a)(2)| O(it + t) (s, a), with all but probability O(ρK|S||A| + ρSQ|K1||S|).

We take the initialization of Algorithm 2 as the base case for our inductive proof. Before the first iteration, π ˆ MK is initialized randomly and shared internal randomness yields π(1) ˆ MK = π(2) ˆ MK. We

deterministically initialize c MK and all n(s, a), and so c M(1) K = c M(2) K and n(s, a)(1) = n(s, a)(2).

Next, we prove the inductive step. We begin by showing that, at the end of the ith iteration, |n(s, a)(1) n(s, a)(2)| O(it + t) (s, a), with all but probability O(ρK|S||A|).

Our inductive hypothesis gives us that |n(s, a)(1) n(s, a)(2)| O(it), so it suffices to show that, for a single (s, a), bc(1) s,a bc(2) s,a O(t) except with probability O(ρK). To obtain high probability bounds on |bc(1) s,a bc(2) s,a|, we will rely on our assumption that at the start of the iteration, π(1) ˆ MK = π(2) ˆ MK. It follows that, for every state-action pair (s, a), the expected number of visits to (s, a) in a single episode is the same for both iterations. That is, for every (s, a), defining

cs,a := Eτ P (τ) h PH h=1 1[(sh, ah) = (s, a)] i , we have c(1) s,a = c(2) s,a.

For a particular (s, a), Chernoff bounds applied to the average observed counts bc(1) s,a and bc(2) s,a show that they must both be close to their (shared) expectation with high probability. We draw a sample of

m O |S|2|A|2T 4 log(1/ρ)

ρ2 O H2 log(1/ρ)

t2 O H2 log(1/ρK)

trajectories, and each cs,a [0, H], so except with probability 4 exp 2t2m2

|bc(1) s,a bc(2) s,a|

h=1 1[τh = (s, a)] c(1) s,a

h=1 1[τh = (s, a)] c(1) s,a

where τh := (sh, ah). Union bounding over all s S and a A shows that the stated bound holds for all (s, a) except with probability ρK|S||A|.

We now show that c M(1) K = c M(2) K at the end of the iteration, except with probability O(ρK|S||A| + ρSQ|Ki||S|). Observe that c M(1) K = c M(2) K at the end of the iteration unless at least one of the following two events occurs: 1) K(1) 1 = K(2) 1 - the set of new known (s, a) pairs differs across the two runs. 2) The updates to b PK(s |s, a) and R(s, a) differ for at least one (s, a). The first event occurs exactly when k falls in between n(s, a)(1) + bc(1) s,a and n(s, a)(2) + bc(2) s,a. We have already shown that |n(s, a)(1) + bc(1) s,a n(s, a)(2) bc(2) s,a| O(it + t) O(t T),

for a single (s, a), except with probability O(ρK). We have sampled k uniformly at random from an interval of width w, so it follows that Prk ,S1,S2[(s, a) K(1) i K(2) i ] O(ρK + t T/w). We took t wρK

T , so by union bound over S A, the probability of the first event is at most O(|S||A|ρK).

To bound the probability of the second event conditioned on the first event not occurring, it suffices to bound the probability that the updates to b PK(s |s, a) for (s, a) Ki differ across both runs, By the conditioning, we have K(1) 1 = K(2) 1 , so it suffices to show that each call to r STAT returns the same value for both runs. Taking ρSQ, αSQ, and δSQ as the replicability, tolerance, and failure parameters respectively gives that a sample of size s O(|S|2 log(1/δSQ)/((ε(ρSQ 2δSQ))2(1

γ)4) is sufficient, by Theorem 2.1. Furthermore, we have assumed that 1 γ > ε log1/4(1/δ) H|A| log1/4(1/ρ),

δSQ < ρSQ/4, and ρSQ O(ρ/|S|2|A|), so a sample of size s O |S|6|A|6H4 log(1/ρ)

ε4ρ2 will also suffice. Each (s, a) is added to Ki only if it was visited at least km times. We have taken k = H,

m O |S|2|A|2T 4 log(1/ρ)

ρ , and T Ω |S||A|H

ε . It follows that mk O |S|6|A|6H5 log(1/ρ)

and therefore S[(s, a)] comprises at least s i.i.d. samples from P( | s, a), as desired. Union bounding over the |Ki||S| queries in the ith iteration gives a bound of |Ki||S|ρSQ on the probability of the second event, conditioned on the first event not happening.

We now assemble our inductive argument into a proof of the theorem. At the start of iteration i, the inductive hypothesis holds except with probability Pi 1 j=1 ρK|S||A| + ρSQ|Kj||S|.

Noting that PT j=1 |Kj| |S||A|, and recalling that we have taken replicability parameters ρK O(ρ/(T|S||A|)) and ρSQ O(ρ/(|S|2|A|)), ensures we achieve a replicability parameter ρ after the T iterations of Algorithm 2.

4.3 Limitations

As mentioned previously, our bounds lose some of the properties that standard RL results provide, such as the ability to estimate value functions with only a logarithmic dependence on relevant parameters. We expect that some of the sample complexity overhead from achieving replicability is inevitable, as seen in the statistical query lower-bound of Impagliazzo et al. [2022]. Nonetheless, we hope that future work can improve on the sample-complexities of our algorithms.

Our work is in part motivated by the recent replicability concerns in deep RL [Islam et al., 2017, Henderson et al., 2018]. However, establishing formal guarantees in these highly complicated settings is often not easy. As such, our algorithms suffer the weakness that many theoretical results in RL have to deal with, namely their lack of immediate applicability to real-world problems. Yet, our empirical evaluation in section 5 will show that there is hope for practical application.

5 Experiments

While our asymptotic bounds have sample complexity overhead from the introduction of replicability, we would like to analyze the actual requirements in practice. We introduce a simple MDP in Figure 1 that contains several ways of reaching the two goals. We analyze the impact of the number of calls to PS(GM) on replicability for r PVI. In theory, our dependence on the number of calls is not logarithmic with respect to |S||A| but we would like to see if can draw a sample that is much smaller, maybe even on the order of the logarithmic requirement. We choose accuracy ε = 0.02, failure rate δ = 0.001 and replicability ρ = 0.2. The number of calls that would be required by standard Phased Q-learning is at most m 13000 (ignoring γ factors). We take several multiples of m and measure the fraction of identical and unique value functions, treating the r STAT ρSQ as a hyperparameter.

The results are presented Figure 2, revealing that the number of samples needed to replicably produce the same value function can be several orders of magnitude lower than suggested by our bounds and that it is feasible to use a larger ρSQ than theoretically required. This should allow us to scale to more complex problems in the future. The algorithm quickly produces a small set of value functions that may not be identical but, with a little more data, minor differences are removed. Note that using a replicable procedure naturally incurs overhead, which is expected. However, the overhead is significantly better than the theoretically required sample-size with squared |S||A| dependence. In the r STAT procedure, taking smaller values for ρSQ for a fixed sample should improve replicability at the cost of accuracy of query responses, by increasing the width of each subinterval of the partition so that there are fewer partition elements overall. The experiments highlight that, as long as sample sizes are sufficiently large and ρSQ is chosen small enough, we achieve high replicability.

6 Related work

Our work builds upon the foundational ideas by Impagliazzo et al. [2022], who introduce formal notions of replicability that are strongly related to robustness, privacy, and generalization [Bun et al., 2023, Kalavasis et al., 2023]. Building on these formal definitions of replicability, researchers have provided algorithms for replicable bandits [Esfandiari et al., 2023a] and replicable clustering [Esfandiari et al., 2023b]. Ahn et al. [2022] introduce algorithms for convex optimization using a slightly different notion of replicability. Our paper presents the first results for formally replicable algorithms in a control setting.

A concurrent and independent work by Karbasi et al. [2023] also studies formal replicability of reinforcement learning. They also study the setting of discounted tabular MDPs, with access to a generative model, and show the same sample complexity upper-bounds for achieving replicable policy estimation in this setting that we prove in our work. Additionally, they provide a matching lower bound. They go on to consider two relaxed notions of replicability that allow them to provide improved sample complexity upper-bounds in the generative model setting. Our work instead considers a second setting, providing a first algorithm for replicable policy estimation in the episodic exploration

Value for SQ: (| || |)2 | |2 | |

m 8m 16m 32m Number of Calls to PS

Fraction of Identical

Value Functions

replicability threshold 1 ρ

The largest percentage of identical value functions across 150 runs. With more data, the quantity increases and the choice of ρSQ becomes less important.

m 8m 8m 16m Number of Calls to PS

Fraction of Unique

Value Functions

The percentage of unique value functions across 150 runs. Varying ρSQ has negligible impact, while more samples quickly reduce it.

Figure 2: The r PVI algorithm evaluated on varying numbers of calls to PS(GM), with several values for the internal r STAT parameter ρSQ. Results are provided across 150 runs with different random sampling seeds. The number of calls is set to constant factor multiples of m = 13000. The dotted green line denotes the replicability threshold of 1 ρ. The results show that, in practice, the number of samples needed for replicability can be orders of magnitude lower than our bounds suggest.

setting. We also provide experimental validation of the practical feasibility of our Replicable Phased Value Iteration algorithm.

From an RL perspective, our work is strongly related to understanding exploration in MDPs [Kearns and Singh, 1998b, Brafman and Tennenholtz, 2003, Kakade, 2003]. In the finite-horizon episodic setting, researchers made progress on upper bounds for exploration Auer and Ortner [2006], Auer et al. [2008], Jaksch et al. [2010] that ultimately led to the development of a near-complete understanding of the problem [Azar et al., 2017, Zanette and Brunskill, 2019, Simchowitz and Jamieson, 2019]. Lower bounds are provided in other works [Dann and Brunskill, 2015, Osband and Roy, 2016]. Further, Jin et al. [2020], Kaufmann et al. [2021] provide results on a reward-free framework that allows for the optimization of any reward function. While a good amount of progress has been made on understanding the base problem, the notion of replicability is not considered in any of them.

Given the connections of replicability and robustness, our work is related but orthogonal to that of the study of worst-case optimal policies and value functions. These worst-case results are often obtained via the study of robust Markov decision processes, first introduced by Nilim and Ghaoui [2005], Iyengar [2005]. One line of work here has focused on relaxation of assumptions and combatting conservativeness in robust MDPs [Wiesemann et al., 2013, Mannor et al., 2016, Petrik and Russel, 2019, Panaganti and Kalathil, 2022]. Others have focused on various new formulations such as distributional robustness [Xu and Mannor, 2010, Yu and Xu, 2016]. However, all of the above work focuses on understanding worst-cases and finding policies that do not have to be replicable.

Finally, our work is related to efforts in practical RL to ensure replicability, such as benchmark design [Guss et al., 2021, Mendez et al., 2022] and robust implementation [Nagarajan et al., 2018, Seno and Imai, 2022] and evaluation [Lynnerup et al., 2020, Jordan et al., 2020, Agarwal et al., 2021].

7 Conclusion & future work

We introduced the notion of formal replicability to the field of RL and established various novel algorithms for replicable RL. While these first results might have sub-optimal sample complexities, they highlight the crucial fact that replicability in RL is hard and requires study of the various aspects that impact it. We hope that future work can alleviate some of these efficiency challenges. A general open question is if replicable RL might simply be harder by nature than standard RL? This question needs to be posed on various levels because, as we argue in Section 3, finding a replicable policy might be easier than requiring the value function to be replicable. Finally, we believe the development of replicable algorithms for other settings such as the non-episodic setting as well as practical application are of great importance.

Acknowledgments and Disclosure of Funding

The first and second authors are partially supported by the DARPA SAIL-ON program under contract HR001120C0040, the DARPA Sh ELL program under agreement HR00112190133, and the Army Research Office under MURI grant W911NF20-1-0080. The last author is supported in part by the Simons Collaboration on the Theory of Algorithmic Fairness, and NSF grant CCF-2217062.

Rishabh Agarwal, Max Schwarzer, Pablo Samuel Castro, Aaron C Courville, and Marc Bellemare. Deep reinforcement learning at the edge of the statistical precipice. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan, editors, Advances in Neural Information Processing Systems, volume 34, pages 29304 29320. Curran Associates, Inc., 2021.

Kwangjun Ahn, Prateek Jain, Ziwei Ji, Satyen Kale, Praneeth Netrapalli, and Gil I. Shamir. Reproducibility in optimization: Theoretical framework and limits. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors, Advances in Neural Information Processing Systems, 2022.

Peter Auer and Ronald Ortner. Logarithmic online regret bounds for undiscounted reinforcement learning. In B. Schölkopf, J. Platt, and T. Hoffman, editors, Advances in Neural Information Processing Systems, volume 19. MIT Press, 2006.

Peter Auer, Thomas Jaksch, and Ronald Ortner. Near-optimal regret bounds for reinforcement learning. In D. Koller, D. Schuurmans, Y. Bengio, and L. Bottou, editors, Advances in Neural Information Processing Systems, volume 21. Curran Associates, Inc., 2008.

Mohammad Gheshlaghi Azar, Ian Osband, and Rémi Munos. Minimax regret bounds for reinforcement learning. In Doina Precup and Yee Whye Teh, editors, Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 263 272. PMLR, 06 11 Aug 2017.

C. Glenn Begley and Lee M Ellis. Drug development: Raise standards for preclinical cancer research. Nature, 483(7391):531 533, March 2012.

Olivier Bousquet and André Elisseeff. Stability and generalization. Journal of Machine Learning Research, 2(Mar):499 526, 2002.

Ronen I. Brafman and Moshe Tennenholtz. R-max - a general polynomial time algorithm for near-optimal reinforcement learning. J. Mach. Learn. Res., 3:213 231, mar 2003.

Mark Bun, Marco Gaboardi, Max Hopkins, Russell Impagliazzo, Rex Lei, Toniann Pitassi, Satchit Sivakumar, and Jessica Sorrell. Stability is stable: Connections between replicability, privacy, and adaptive generalization. In Barna Saha and Rocco A. Servedio, editors, Proceedings of the 55th Annual ACM Symposium on Theory of Computing, STOC 2023, Orlando, FL, USA, June 20-23, 2023, pages 520 527. ACM, 2023.

Christoph Dann and Emma Brunskill. Sample complexity of episodic fixed-horizon reinforcement learning. In Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 2, NIPS 15, page 2818 2826, Cambridge, MA, USA, 2015. MIT Press.

Hossein Esfandiari, Alkis Kalavasis, Amin Karbasi, Andreas Krause, Vahab Mirrokni, and Grigoris Velegkas. Replicable bandits. In The Eleventh International Conference on Learning Representations, 2023a.

Hossein Esfandiari, Amin Karbasi, Vahab Mirrokni, Grigoris Velegkas, and Felix Zhou. Replicable clustering, 2023b.

Amir-massoud Farahmand. Action-gap phenomenon in reinforcement learning. In J. Shawe-Taylor, R. Zemel, P. Bartlett, F. Pereira, and K.Q. Weinberger, editors, Advances in Neural Information Processing Systems, volume 24. Curran Associates, Inc., 2011.

William Hebgen Guss, Stephanie Milani, Nicholay Topin, Brandon Houghton, Sharada Mohanty, Andrew Melnik, Augustin Harter, Benoit Buschmaas, Bjarne Jaster, Christoph Berganski, Dennis Heitkamp, Marko Henning, Helge Ritter, Chengjie Wu, Xiaotian Hao, Yiming Lu, Hangyu Mao, Yihuan Mao, Chao Wang, Michal Opanowicz, Anssi Kanervisto, Yanick Schraner, Christian Scheller, Xiren Zhou, Lu Liu, Daichi Nishio, Toi Tsuneda, Karolis Ramanauskas, and Gabija Juceviciute. Towards robust and domain agnostic reinforcement learning competitions: Minerl 2020. In Hugo Jair Escalante and Katja Hofmann, editors, Proceedings of the Neur IPS 2020 Competition and Demonstration Track, volume 133 of Proceedings of Machine Learning Research, pages 233 252. PMLR, 06 12 Dec 2021.

Peter Henderson, Riashat Islam, Philip Bachman, Joelle Pineau, Doina Precup, and David Meger. Deep reinforcement learning that matters. AAAI 18/IAAI 18/EAAI 18. AAAI Press, 2018. ISBN 978-1-57735-800-8.

Russell Impagliazzo, Rex Lei, Toniann Pitassi, and Jessica Sorrell. Reproducibility in learning. In Proceedings of the 54th Annual ACM SIGACT Symposium on Theory of Computing, STOC 2022, page 818 831, New York, NY, USA, 2022. Association for Computing Machinery. ISBN 9781450392648. doi: 10.1145/3519935.3519973.

Riashat Islam, Peter Henderson, Maziar Gomrokchi, and Doina Precup. Reproducibility of benchmarked deep reinforcement learning tasks for continuous control. In Reproducibility in Machine Learning Workshop (ICML), 2017.

Garud N. Iyengar. Robust dynamic programming. Mathematics of Operations Research, 30(2): 257 280, 2005.

Thomas Jaksch, Ronald Ortner, and Peter Auer. Near-optimal regret bounds for reinforcement learning. Journal of Machine Learning Research, 11(51):1563 1600, 2010.

Chi Jin, Akshay Krishnamurthy, Max Simchowitz, and Tiancheng Yu. Reward-free exploration for reinforcement learning. In Hal Daumé III and Aarti Singh, editors, Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 4870 4879. PMLR, 13 18 Jul 2020.

Scott Jordan, Yash Chandak, Daniel Cohen, Mengxue Zhang, and Philip Thomas. Evaluating the performance of reinforcement learning algorithms. In Hal Daumé III and Aarti Singh, editors, Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 4962 4973. PMLR, 13 18 Jul 2020.

Sham M. Kakade. On the Sample Complexity of Reinforcement Learning. Phd thesis, 2003.

Alkis Kalavasis, Amin Karbasi, Shay Moran, and Grigoris Velegkas. Statistical indistinguishability of learning algorithms, 2023.

Amin Karbasi, Grigoris Velegkas, Lin F. Yang, and Felix Zhou. Replicability in reinforcement learning, 2023.

Emilie Kaufmann, Pierre Ménard, Omar Darwiche Domingues, Anders Jonsson, Edouard Leurent, and Michal Valko. Adaptive reward-free exploration. In Vitaly Feldman, Katrina Ligett, and Sivan Sabato, editors, Proceedings of the 32nd International Conference on Algorithmic Learning Theory, volume 132 of Proceedings of Machine Learning Research, pages 865 891. PMLR, 16 19 Mar 2021.

Michael Kearns. Efficient noise-tolerant learning from statistical queries. J. ACM, 45(6):983 1006, Nov 1998.

Michael Kearns and Satinder Singh. Finite-sample convergence rates for q-learning and indirect algorithms. Advances in Neural Information Processing Systems, 11, 1998a.

Michael Kearns and Satinder Singh. Near-optimal reinforcement learning in polynomial time. Machine Learning, 49:209 232, 1998b.

IS Khalil, JC Doyle, and K Glover. Robust and optimal control. Prentice hall, 1996.

Nicolai A. Lynnerup, Laura Nolling, Rasmus Hasle, and John Hallam. A survey on reproducibility by evaluating deep reinforcement learning algorithms on real-world robots. In Leslie Pack Kaelbling, Danica Kragic, and Komei Sugiura, editors, Proceedings of the Conference on Robot Learning, volume 100 of Proceedings of Machine Learning Research, pages 466 489. PMLR, 30 Oct 01 Nov 2020.

Shie Mannor, Duncan Simester, Peng Sun, and John N. Tsitsiklis. Bias and variance in value function estimation. In Proceedings of the Twenty-First International Conference on Machine Learning, page 72, New York, NY, USA, 2004. Association for Computing Machinery.

Shie Mannor, Ofir Mebel, and Huan Xu. Robust mdps with k-rectangular uncertainty. Mathematics of Operations Research, 41(4):1484 1509, 2016.

Jorge A. Mendez, Marcel Hussing, Meghna Gummadi, and Eric Eaton. Composuite: A compositional reinforcement learning benchmark. In 1st Conference on Lifelong Learning Agents, 2022.

Prabhat Nagarajan, Garrett Warnell, and Peter Stone. Deterministic implementations for reproducibility in deep reinforcement learning. In 2nd Reproducibility in Machine Learning Workshop at ICML 2018, Stockholm, Sweden, July 2018.

Arnab Nilim and Laurent El Ghaoui. Robust control of markov decision processes with uncertain transition matrices. Operations Research, 53(5):780 798, 2005.

Ian Osband and Benjamin Van Roy. On lower bounds for regret in reinforcement learning, 2016.

Kishan Panaganti and Dileep Kalathil. Sample complexity of robust reinforcement learning with a generative model. In Gustau Camps-Valls, Francisco J. R. Ruiz, and Isabel Valera, editors, Proceedings of The 25th International Conference on Artificial Intelligence and Statistics, volume 151 of Proceedings of Machine Learning Research, pages 9582 9602. PMLR, 28 30 Mar 2022.

Marek Petrik and Reazul Hasan Russel. Beyond confidence regions: Tight bayesian ambiguity sets for robust mdps. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d Alché Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019.

Joelle Pineau, Philippe Vincent-Lamarre, Koustuv Sinha, Vincent Lariviere, Alina Beygelzimer, Florence d Alche Buc, Emily Fox, and Hugo Larochelle. Improving reproducibility in machine learning research(a report from the neurips 2019 reproducibility program). Journal of Machine Learning Research, 22(164):1 20, 2021.

Takuma Seno and Michita Imai. D3rlpy: An offline deep reinforcement learning library. 23(1), 2022.

Max Simchowitz and Kevin G Jamieson. Non-asymptotic gap-dependent regret bounds for tabular mdps. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019.

V. Stodden, F. Leisch, and R.D. Peng. Implementing Reproducible Research. Chapman & Hall/CRC The R Series. Taylor & Francis, 2014. ISBN 9781466561595.

Kiri L. Wagstaff. Machine learning that matters. In Proceedings of the 29th International Coference on International Conference on Machine Learning, page 1851 1856, Madison, WI, USA, 2012. Omnipress.

Chelsea C. White and Hany K. Eldeib. Markov decision processes with imprecise transition probabilities. Operations Research, 42(4):739 749, 1994.

Wolfram Wiesemann, Daniel Kuhn, and Breç Rustem. Robust markov decision processes. Mathematics of Operations Research, 38(1):153 183, 2013.

Huan Xu and Shie Mannor. Distributionally robust markov decision processes. In J. Lafferty, C. Williams, J. Shawe-Taylor, R. Zemel, and A. Culotta, editors, Advances in Neural Information Processing Systems, volume 23. Curran Associates, Inc., 2010.

Pengqian Yu and Huan Xu. Distributionally robust counterpart in markov decision processes. IEEE Transactions on Automatic Control, 61(9):2538 2543, 2016.

Andrea Zanette and Emma Brunskill. Tighter problem-dependent regret bounds in reinforcement learning without domain knowledge using value function bounds. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 7304 7312. PMLR, 09 15 Jun 2019.