# truncated_variance_reduced_value_iteration__c2a4660f.pdf

Truncated Variance-Reduced Value Iteration

Yujia Jin Stanford University yujiajin@stanford.edu

Ishani Karmarkar Stanford University ishanik@stanford.edu

Aaron Sidford Stanford University sdiford@stanford.edu

Jiayi Wang Stanford University jyw@stanford.edu

We provide faster randomized algorithms for computing an ε-optimal policy in a discounted Markov decision process with Atot-state-action pairs, bounded rewards, and discount factor γ. We provide an O(Atot[(1 γ) 3ε 2 + (1 γ) 2])- time algorithm in the sampling setting, where the probability transition matrix is unknown but accessible through a generative model which can be queried in O(1)- time, and an O(s + Atot(1 γ) 2)-time algorithm in the offline setting where the probability transition matrix is known and s-sparse. These results improve upon the prior state-of-the-art which either ran in O(Atot[(1 γ) 3ε 2 + (1 γ) 3]) time ([1, 2]) in the sampling setting, O(s + Atot(1 γ) 3) time ([3]) in the offline setting, or time at least quadratic in the number of states using interior point methods for linear programming. We achieve our results by building upon prior stochastic variance-reduce value iteration methods [1, 2]. We provide a variant that carefully truncates the progress of its iterates to improve the variance of new variance-reduced sampling procedures that we introduce to implement the steps. Our method is essentially model-free and can be implemented in O(Atot)- space when given generative model access. Consequently, our results take a step in closing the sample-complexity gap between model-free and model-based methods.

1 Introduction

Markov decision processes (MDPs) are a fundamental mathematical model for decision making under uncertainty. They play a central role in reinforcement learning and prominent problems in computational learning theory (see e.g., [4, 5, 6, 7]). MDPs have been studied extensively for decades ([8, 9]), and there have been numerous algorithmic advances in efficiently optimizing them ([3, 1, 2, 10, 11, 12, 13, 14]).

In this paper, we consider the standard problem of optimizing a discounted Markov Decision Process (DMDP) M = (S, A, P , r, γ). We consider the tabular setting where there is a known finite set of states S and at each state s S there is a finite, non-empty, set of actions, As for an agent to choose from; A = {(s, a) : s S, a As} denotes the full set of state action pairs and Atot := |A| |S| . The agent proceeds in rounds t = 0, 1, 2, . . . . In each round t, the agent is in state st S; chooses action at Ast, which yields a known reward rt = rst,a [0, 1]; and transitions to random state st+1 sampled (independently) from a (potentially) unknown distribution pa(st) S for round t + 1, where pa(st) is the (st, a)-th row of P [0, 1]A S. The goal is to compute an ε-optimal policy, where a (deterministic) policy π, is a mapping from each state s S to an action π(s) As and is ε-optimal if for every initial s0 S the expected discounted reward of π E[P

t 0 rtγt] is at

38th Conference on Neural Information Processing Systems (Neur IPS 2024).

least v s0 ε. Here, v s0 is the maximum expected discounted reward of any policy applied starting from initial state s0 and v RS is called the optimal value of the MDP.

Excitingly, a line of work [15, 16, 2, 3, 17, 18] recently resolved the query complexity for solving DMDPs (up to polylogarithmic factors) in what we call the sample setting where the transitions pa(s) are accessible only through a generative model ([16]). A generative model is an oracle which when queried with any s S and a As returns a random s S sampled independently from pa(s) [19]. It was shown in [18] that for all ε (0, (1 γ) 1] there is an algorithm which computes an ε-optimal policy with probability 1 δ using O(Atot(1 γ) 3ε 2) queries where we use O( ) to hide polylogarithmic factors in Atot, ε 1, (1 γ) 1, and δ 1. This result improved upon a prior result of [17] which achieved the same query complexity for ε [0, (1 γ) 1/2], of [2] which achieved this query complexity for ε [0, 1], and of [16] which achieved it for ε [0, (|S| (1 γ)) 1/2]. This query complexity is known to be optimal in the worst case (up to polylogarithmic factors) due to lower bounds of [16] (and extensions of [20]), which established that the optimal query complexity for finding ε-optimal policies with probability 1 δ is Ω(Atot(1 γ) 3ε 2 log(Atotδ 1)).

Interestingly, recent state-of-the-art results [17, 18] (as well as [16]) are model-based: they query the oracle for every state-action pair, use the resulting samples to build an empirical model of the MDP, and then solve this empirical model. State-of-the-art computational complexities for the methods are then achieved by applying high-accuracy, algorithms for optimizing MDPs in what we call the offline setting, when the transition probabilities are known [2, 17].

Correspondingly, obtaining optimal query complexities for large ε, e.g., ε 1, comes with certain costs. This setting is of interest when the goal is to efficiently compute a coarse approximation of the optimal policy. Model-based methods use space Ω(Atot min((1 γ) 3ε 2, |S|)) rather than the O(Atot) memory used by model-free methods (e.g., [2, 3, 21]), which run stochastic, low memory analogs of classic popular algorithms for solving DMDPs (e.g., value iteration). Moreover, although state-of-the-art model-based methods use Ω(Atot(1 γ) 3ε 2) samples, the state-of-the-art runtime to compute the optimal policy is either O(Atot(1 γ) 3 max{1, ε 2}) (using [2]) or has a larger larger polynomial dependence on Atot and |S| by using interior point methods (IPMs) for linear programming (see Section 1.1). Consequently, in the worst case, the runtime cost per sample is more than polylogarithmic for ε sufficiently larger than 1, and it is natural to ask if this can be improved.

These costs are connected to the state-of-the-art runtimes for optimizing DMDPs in the offline setting. Ignoring IPMs (discussed in Section 1.1), the state-of-the-art runtime for optimizing a DMDP is O(nnz(P ) + Atot(1 γ) 3) due to [2] where nnz(P ) denotes the number of non-zero entries in P , i.e., the number of triplets (s, s , a) where taking action a As at state s S has a non-zero probability of transitioning to s S. This method is essentially model-free; it simply performs a variant of stochastic value iteration where passes on P are used to reduce the variance of sampling and can be implemented in O(A)-space given access to a generative model and the ability to multiply P with vectors. The difficulty in further improving the runtimes in the sample setting and improving the performance of model-free methods seems connected to the difficulty in improving the additive Atot(1 γ) 3-term in this runtime (see the discussion in Section 1.2.)

In this paper, we ask whether these complexities can be improved. Is it possible to lower the memory requirements of near-optimal query algorithms for large ε? Can we improve the runtime for optimizing MDPs in the offline setting and can we improve the computational cost per sample in computing optimal policies in DMDPs? More broadly, is it possible to close the sample-complexity gap between model-free and model-based methods for optimizing DMDPs?

1.1 Our results In this paper, we show how to answer each of these motivating questions in the affirmative. We provide faster algorithms for optimizing DMDPs in both the sample and offline setting that are implementable in O(Atot)-space provided suitable access to the input. In addition to computing ε-optimal policies, these methods also compute ε-optimal values: we call any v RS a value vector and say that it is ε-optimal if v v ε.

Here we present our main results on algorithms for solving DMDPs in sample setting and in the offline setting and compare to prior work. For simplicity of comparison, we defer any discussion and comparison of DMDP algorithms that use general IPMs for linear program to the end of this section. The state-of-the-art such IPM methods obtain improved running times but use Ω(|S|2) space and

Ω(|S|2) time and use general-purpose linear system solvers. As such they are perhaps qualitatively different from the more combinatorial or dyanmic-programming based methods, e.g., value iteration and stochastic value iteration, more commonly discussed in this introduction.

In the sample setting, our main result is an algorithm that uses O(Atot[(1 γ) 3ε 2 + (1 γ) 2]) samples and time and O(Atot)-space. It improves upon the prior, non-IPM, state-of-the-art which uses O(Atot[(1 γ) 3ε 2 + (1 γ) 3]) time [3] and nearly matches the state-of-the-art sample complexity for all ε = O((1 γ) 1/2). See Table 2 for a more complete comparison.

Theorem 1.1. In the sample setting, there is an algorithm that uses O(Atot[(1 γ) 3ε 2+(1 γ) 2]) samples and time and O(Atot) space, and computes an ε-optimal policy and ε-optimal values with probability 1 δ.

Particularly excitingly, the algorithm in Theorem 1.1 runs in time nearly-linear in the number of samples whenever ε = O((1 γ) 1/2) and therefore, provided querying the oracle costs Ω(1), has a near-optimal runtime for such ε! Prior to this work such a near-optimal, non-IPM, runtime (for non-trivially small γ) was only known for ε = O(1) ([2]). Similarly, Theorem 1.1 shows that there are model-free algorithms (which for our purposes we define as an O(Atot) space algorithm) which are nearly-sample optimal whenever ε = O((1 γ) 1/2). Previously this was only known for ε = O(1). As discussed in prior-work ([18, 17]), this large ε regime is potentially of particular importance in large-scale learning settings, where one would like to quickly compute a coarse approximation of the optimal policy.

In the offline setting, our main result is an algorithm that uses O(nnz(P ) + Atot(1 γ) 2) time. It improves upon the prior, non-IPM, state-of-the-art which use O(nnz(P ) + Atot(1 γ) 3) time ([2]). See Table 1 for a more complete comparison with prior work.

Theorem 1.2. In the offline setting, there is an algorithm that uses O(nnz(P ) + Atot(1 γ) 2) time, and computes an ε-optimal policy and ε-optimal values with probability 1 δ.

The method of Theorem 1.2 runs in nearly-linear time when (1 γ) 1 (nnz(P )/Atot)1/2, i.e., the discount factor is not too small relative to the average sparsity of rows of the transition matrix. Prior to this paper, such nearly-linear, non-IPM, runtimes (for non-trivially small γ) were only known for (1 γ) 1 (nnz(P )/Atot)1/3 ([2]). Thus, Theorem 1.2 expands the set of DMDPs which can be solved in nearly-linear time. The space usage and input access for this offline algorithm differs from the algorithm in Theorem 1.1 in that the algorithm in Theorem 1.2 assumes that access to the transition P is provided as input and uses this to compute matrix-vector products with value vectors. The algorithm in Theorem 1.2 also requires access to samples from the generative model; if access to the generative model is not provided as input, then using the access to P , the algorithm can build a O(nnz(P )) data-structure so that queries to the generative model can be implemented in O(1) time (e.g., see discussion in [2]). Hence, if matrix-vector products and queries to the generative model can be implemented in O(Atot)-space then so can the algorithm in Theorem 1.2.

Table 1: Running times to compute ε-optimal policies in the offline setting. In this table, E denotes an upper bound on the ergodicity of the MDP.

Algorithm Runtime Space

Value Iteration [22, 11] O nnz(P )(1 γ) 1 O(nnz(P ))

Empirical QVI [16] O nnz(P ) + Atot(1 γ) 3ε 2 O(nnz(P ))

Randomized Primal-Dual Method [23] O nnz(P ) + EAtot(1 γ) 4ε 2 O(Atot)

High Precision Variance Reduced Value Iteration [2] O nnz(P ) + Atot(1 γ) 3 O(Atot)

Algorithm 4 This Paper O nnz(P ) + Atot(1 γ) 2 O(Atot)

Exact DMDP Algorithms. In our our comparison of offline DMDP algorithms in Table 1, we ignored poly(log(ε 1))-factors. Consequently, we did not distinguish between algorithms which solve DMDPs to high accuracy, i.e., only depend on ε polylogarithmically, and those which solve it exactly, e.g., have no dependence on ε. There is a line of work on designing such exact methods and the current state-of-the-art is policy iteration, which can be implemented in O(|S|2 A2 tot(1 γ) 1) time ([13, 14]) and a combinatorial interior point method that can be implemented in O(A4 tot) time ([10] with no dependence on ε. Note that these methods obtain improved runtime dependence on ε at the cost of larger dependencies on |S| and Atot.

Table 2: Query complexities to compute ε-optimal policy in the sample setting. Merg denotes an upper bound on the MDP s ergodicity. Here, model-free refers to O(Atot) space methods.

Algorithm Queries ε range Model-Free

Phased Q-learning [15] O Atot (1 γ)7ε2 (0, (1 γ) 1] Yes

Empirical QVI [16] O Atot (1 γ)3ε2 (0, ((1 γ) |S|) 1/2] No

Sublinear Variance-Reduced Value Iteration [2] O Atot (1 γ)4ε2 (0, (1 γ) 1/2] Yes

Sublinear Variance-Reduced Q Value Iteration [3] O Atot (1 γ)3ε2 (0, 1] Yes

Randomized Primal Dual Method [23] O Merg Atot (1 γ)4ε2 (0, (1 γ) 1 Yes

Empirical MDP + Planning [17] O Atot (1 γ)3ε2 (0, (1 γ) 1/2] No

Perturbed Empirical MDP, Conservative Planning [18] O Atot (1 γ)3ε2 (0, (1 γ) 1] No

Algorithm 5 This Paper O Atot (1 γ)3ε2 (0, (1 γ) 1/2] Yes

Comparison with IPM Approaches. In the offline setting, [2] showed how to reduce solving DMDPs to an ℓ1-regression problem in P RA S. For ℓ1 regression in a matrix A Rn d

for n > d, [12] provides an algorithm that runs in O(d0.5(nnz(A) + d2))-time, [24] provides an algorithm that runs in O(nd + d2.5), and [25, 26, 27] yields an algorithm that runs in O(Aω tot) time for the current value of the fast matrix multiplication exponent ω < 2.371552 [28]. These offline IPM approaches can be coupled with model-based approaches to yield algorithms in the sample setting. [18] shows that given a DMDP M, with O Atot(1 γ) 2ε 3 queries to the generative model and time, one can construct a DMDP ˆ M such that an optimal policy in ˆ M is an ε-optimal for M. Consequently, provided polynomial accuracy in computing the policy suffices, applying the IPMs to ˆ M yields runtimes of O(nnz(P ) p

|S| + |S|2.5) ([12]), O(Atot |S| + |S|2.5) ([24]), and O(Aω tot) time [25]. This combination of model-based and IPM-based approaches use super-quadratic time and space, but they may yield better runtimes than Theorem 1.2 in certain regimes where γ is sufficiently large relative to S and Atot in the offline setting, or when, additionally, ε is sufficiently small relative to S and Atot in the sample setting.

1.2 Overview of approach

Here we provide an overview of our approach to proving Theorem 1.1 and Theorem 1.2. We motivate our approach from previous methods and discuss the main obstacles and insights needed to obtain our results. For simplicity, we focus on the problem of computing ε-optimal values and discuss computing ε-optimal policies at the end of this section.

Value iteration. Our approach stems from classic value-iteration method ([22, 11]) for computing ε-optimal and its more modern Q-value and stochastic counterparts ([16, 3, 29, 30, 31, 32]). As the name suggests, value iteration proceeds in iterations t = 0, 1, . . . computing values, v(t) RS. Starting from initial v(0) RS, in iteration t 1, the value vector v(t) is computed as the result of applying the (Bellman) value operator T : RS 7 RS, i.e.,

v(t) T (v(t 1) where T (v)(s) := max a As(ra(s) + γpa(s) v) for all s S and v RS . (1)

It is well-known that the value operator is γ-contractive and therefore, T (v) v γ v v for all v RS ([11, 22, 2]). If we initialize v(0) = 0 then since v (1 γ) 1

[22, 11], we see that v(t) v γt v(0) v γt(1 γ) 1 (1 γ) 1 exp( t(1 γ)). Thus, v(t) are ε-optimal values for any t (1 γ) 1 log(ε 1(1 γ) 1). This yields an O(nnz(P )(1 γ) 1) time algorithm in the offline setting.

Stochastic value iteration and variance reduction. To improve on the runtime of value iteration and apply it in the sample setting, a line of work implements stochastic variants of value iteration ([16, 2, 3, 23, 17, 18]). Those methods take approximate value iteration steps where the expected utilities pa(s) v in (1) for each state-action pair are replaced by a stochastic estimate of the expected utilities. In particular, note that pa(s) v = Ei pa(s) vi, i.e., the expected value of vi where i is drawn from the distribution given by pa(s). This is compatible in the sample setting, as computing vi for i drawn from pa(s) yields an unbiased estimate of pa(s) v with 1 query and O(1) time.

State-of-the-art model-free methods in the sample setting ([3]) and non-IPM runtimes in the offline setting ([3]) improve further by more carefully approximating the expected utilities pa(s) v of each state-action pair (s, a) A. Broadly, given an arbitrary v(0) they first compute x RA

that approximates P v(0), i.e., xa(s) approximates [P v(0)](s,a) = pa(s) v(0) for all (s, a) A. In the offline setting, x = P v(0) can be computed directly in O(nnz(P ))-time. In the sample setting, x P v(0) can be approximated to sufficient accuracy using multiple queries for each state-action pair. Then, in each iteration t 1 of the algorithm, fresh samples are taken to compute g(t) P (v(t 1) v(0)) and perform the following update:

v(t)(s) max a As(ra(s) + γ(xa(s) + ga(s)(t)) for all s S and v RS . (2)

This approach is advantageous because sampling errors for estimating P (v(t 1) v(0)) depend on the magnitude of v(t 1) v(0). After approximately computing x, the remaining task of computing g(t) P (v(t 1) v(0)) so that x+g(t) P v(t 1) may be easier than the task of directly estimating P v(t) (since v(t 1) v(0) is smaller in magnitude than v(t) entrywise.) Due to similarities of this approach to variance-reduced optimization methods, e.g. ([33, 34]), this technique is called variance reduction [2].

The works [2, 3], showed that if x is computed sufficiently accurately and v(0) are α-optimal values then applying (2) for t = Θ((1 γ) 1) yields v(t) that is α/2-optimal in just O(Atot(1 γ) 3) time and samples! [2] leverages this technique to compute ε-optimal values in the offline setting in O(nnz(P ) + Atot(1 γ) 3) time. [3] uses a similar approach to compute ε-optimal values in O(Atot[(1 γ) 3ε 2 + (1 γ) 3) time and samples in the sample setting. A key difference in [2] and [3] is the accuracy to which they must approximate the initial utility x P v(0).

Recursive variance reduction. To improve upon the prior model-free approaches of [2, 3] we improve how exactly the variance reduction is performed. We perform a similar scheme as in (2) and use essentially the same techniques as in [3, 2] towards estimating x. Where we differ from prior work is in how we estimate the change in approximate utilities g(t) P (v(t 1) v(0)). Rather than directly sampling to estimate this difference we instead sample to estimate each individual P (v(t 1) v(t)) and maintain the sum. Concretely, for t 1, we compute (t) such that

(t) P (v(t) v(t 1)) (3)

so that these recursive approximations telescope. More precisely, setting g(0) = 0, for t 1, we set

g(t) g(t 1) + (t 1) P (v(t 2) v(0)) + P (v(t 1) v(t 2)) = P (v(t 1) v(0)). (4)

This difference is perhaps similar to how methods such as SARAH ([34]) differ from SVRG ([33]). Consequently, we similarly call this approximation scheme recursive variance reduction. Interestingly, in constrast to the finite sum setting considered in [33, 34], in our setting, recursive variance reduction for solving DMDPs ultimately leads to direct quantitative improvements on worst case complexity.

To analyze this recursive variance reduction method, we treat the error in g(t) P (v(t 1) v(0)) as a martingale and analzye it using Freedman s inequality [35] (as stated in [36]). The hope in applying this approach is that by better bounding and reasoning about the changes in v(t), better bounds on the error of the sampling could be obtained by leveraging structural properties of the iterates.

Unfortunately, without further information about the change in v(t) or larger change to the analysis of variance reduced value iteration, in the worst case, the variance can be too large for this approach to work naively. Concretely, prior work ([2]) showed that it sufficed to maintain that g(t+1) P v(t) O((1 γ)α). However, imagine that v = α1, v(0) = 0, and in each iteration t one coordinate of v(t) v(t 1) is Ω(α). If |S| (1 γ) 1 and pa(s) = O(1/|S|) for some (s, a) A then the variance of each sample used to estimate pa(s) (v(t) v(t 1)) = Ω(1/|S|) = Ω((1 γ)). Applying Freedman s inequality, e.g., [36], and taking b samples for each O((1 γ) 1) iteration would yield, roughly, g(t+1) P (v(t) v(0)) = O((1 γ) 1(1 γ)/

b). Consequently b = Ω((1 γ) 2) and Ω((1 γ) 3) samples would be needed in total, i.e., there is no improvement. Next, we will discuss how we circumvent this obstacle by combining recursive variance reduction with a second algorithm technique, which we call truncation.

Truncated-value iteration. The key insight to make our new recursive variance reduction scheme for value iteration yield faster runtimes is to modify the value iteration scheme itself. Recall that in the previous paragraph, we described that the case challenging case for recursive variance reduction occurs when, for example, in every iteration, a single coordinate of v changes by Ω(α). We observe that there is a simple modification that one could make to value iteration to ensure that there is not such a large change between each iteration; simply truncate the change in each iteration so that no coordinate of v(t) changes too much! To motivate our algorithm, consider the following truncated variant of value iteration where

v(t) = median(v(t 1) (1 γ)α, T (v(t 1)), v(t 1) + (1 γ)α) (5)

Where median applies the median of the arguments entrywise. In other words, suppose we apply value iteration where we decrease or truncate the change from v(t 1) to v(t) so that it is no more than (1 γ)α in absolute value in any coordinate. Then, provided that v(t) is α-optimal, we can show that it is still the case that v(t) v γ v(t 1) v . In other words, the worst-case progress of value iteration is unaffected! This follows immediatly from the fact that v(t) v γ v(t 1) v in value iteration and the following simple technical lemma.

Lemma 1.3. For a, b, x Rn and γ, α > 0, let c := median{a (1 γ)α1, b, a + (1 γ)α1}, where median is applied entrywise. Then, if b x γ a x and a x α, then c x γ a x .

Applying truncated value iteration, we know that v(t) v(t 1) (1 γ)α. In other words, the worst-case change in a coordinate has decreased by a factor of (1 γ)! We show that this smaller movement bound does indeed decrease the variance in the martingale when using the aforementioned averaging scheme. We show this truncation scheme, when combined with our recursive variance reduction scheme (4) for estimating P (v(t) v(0)), reduces the total samples required to estimate this and halve the error from O((1 γ) 3) to just O((1 γ) 2 per state-action pair.

Our method. Our algorithm applies stochastic truncated value iteration using sampling to estimate each g(t) P (v(t) v(0)) as described. Some minor additional modifications are needed, however, to obtain our results. Perhaps the most substantial is our use of the monotonicity technique, as in prior work ([2, 3]). That is, we modify our method so that each v(t) is always an underestimate of v and the v(t) increase monotonically as t increases. Thus, we only truncate the increase in the v(t) (since they do not decrease, and the median operation in (5) reduces to a minimum in Lemma 1.3).

Beyond simplifying this aspect of the algorithm, as in prior work, this monotonicity technique allows us to simultaneously compute an ε-approximate policy as well as an ε-optimal value vector. We do this by tracking the actions associated with changed v(t) values, i.e., the argmax in (2) in a variable

Algorithm 1: Sample(u, p, M, η)

Input: Value vector u RS, p S, sample size M, and offset parameter η 0.

1 for each n [M] do

2 Choose in S independently with P {in = t} = p(t);

3 x = 1 M P

n [M] u(in);

4 ˆσ = 1 M P

n [M](u(in))2 x2;

5 x x 2ηˆσ 4η3/4 u (2/3)η u ;

Algorithm 2: Apx Utility(u, M, η)

Input: Value vector u RS, sample size M, and offset parameter η 0.

1 for each (s, a) A do

// In the sample setting, pa(s) is passed implicitly.

2 xa(s) = Sample(u, pa(s), M, η);

π(t), which denotes the current estimated policy in iteration t of value iteration. Concretely, the monotonicity technique allows us to maintain the invariant that at each iteration t, the current value estimate and policy estimate π(t), v(t) satisfy the relation v(t) T [v(t)]. Note that this ensures that the value of π(t) (denoted vπ(t)) is at least v(t) because

v(t) T [v(t)] T 2[v(t)] T [v(t)] = vπ(t)

Thus, whenever v(t) is an ε-optimal value, π(t) is an at least ε-optimal policy.

By computing initial expected utilities x = P v(0) exactly, we obtain our offline results. By carefully estimating x P v(0) as in [3] we obtain our sampling results. Finally, building off of the analysis of [37] for deterministic or highly-mixing MDPs, we also show our method obtains even faster convergence guarantees under additional non-worst-case assumptions on the MDP structure.

1.3 Notation and paper outline General notation. Caligraphic upper case letters denote sets and operators, lowercase boldface letters denote vectors, and uppercase boldface letters (e.g., P , I) denote matrices. 0 and 1 denote the all-ones and all-zeros vectors, [m] := {1, ...., m}, and n := {x Rn : 0 x and x 1 = 1} is the simplex. For v RS, we use vi or v(i) for the i-th entry of vector v. For vectors v RA, we use va(s) to denote the (s, a)-th entry of v, where (s, a) A. We use v, v2, |v| Rn for the elementwise square root, square, and absolute value of v respectively and max{u, v} and median{u, v, w} for element-wise maximum and median respectively. For v, x Rn, v x denotes that v(i) x(i) for each i [n] (analogously for <, , >.) We call x Rn an α-underestimate of y Rn if y α1 x y for α 0 (see the discussion of monotonicity in Section 1.2 for motivation).

DMDP. As discussed, the objective in optimizing a DMDP is to find an ε-approximate policy π and values. For a policy π, we use Tπ(u) : RS 7 RS to denote the value operator associated with π, i.e., Tπ(u)(s) := rπ(s)(s) + γpπ(s)(s) u for all value vectors u RS and s S. We let vπ denote the unique value vector such that Tπ(vπ) = vπ and define its variance as σuπ := P π(uπ)2 (P πuπ)2, where P π RS S is the matrix such that P π s,s = Ps,π(s)(s ). The optimal value vector v RS

of the optimal policy π is the unique vector with T (v ) = v , and P RS S := P π .

Outline. Section 2 presents our offline setting results and Section 3 our sample setting results. Section A discusses specialized settings where we can obtain even faster convergence guarantees. Omitted proofs are deferred to Appendix B.

2 Offline algorithm

In this section, we present our high-precision algorithm for finding an approximately optimal policy in the offline setting. We first define Sample (Algorithm 1), which approximately computes products between p S and a value vector u RS using samples from a generative model. The following lemma states some immediate estimation bounds on Sample using linearity and the fact that p S.

Lemma 2.1. Let x = Sample(u, p, M, 0) for p n, M Z>0, ε > 0, and u RS. Then, E [x] = p u, |x| u , and Var [x] 1/M u 2 .

We can naturally apply Sample to each state-action pair in M as in the subroutine Apx Utility (Algorithm 2). If x = Apx Utility(u, M, η), then x(s, a) is an estimate of the expected utility of taking action a As from state s S (as discussed in Section 1.2). When η > 0, this estimate may potentially be shifted to increase the probability that x underestimates the true changes in utilities; we leverage this in Section 3 (see also the discussion of monotonicity in Section 1.2). The terms arising in the definition of x arise from applying Bernstein s inequality (Theorem B.2) to guarantee that x x η with high probability.

The following algorithm TVRVI (Algorithm 3) takes as input an initial value vector v(0) and policy π(0) such that v(0) is an α-underestimate of v along with an approximate offset vector x, which is a β-underestimate of P v(0). It runs runs L = O((1 γ) 1) iterations of approximate value iteration, making one call to Sample(Algorithm 1) with a sample size of M = O((1 γ) 1) in each iteration. The algorithm outputs v L which we show is an α/2-underestimate of v (Corollary 2.5).

TVRVI (Algorithm 3) is similar to variance reduced value iteration [2], in that each iteration, we draw M samples and use Sample to maintain underestimates of pa(s) (v(ℓ) v(ℓ 1)) for each sate-action pair (s, a). However, there are two key distinctions between TVRVIand variance-reduced value iteration [2] that enable our improvement. First, we use the recursive variance reduction technique, as described by (3) and (4), and second we apply truncation (Line 7), which essentially implements the truncation described in Lemma 1.3. Lemma 2.2 below illustrates how these two techniques can be combined to bound the necessary sample complexity for maintaining approximate transitions pa(s) (w(t) w(0)) for a general sequence of ℓ -bounded vectors {w(i)}T i=1. The analysis leverages Freedman s Inequality [35] as stated in [36] and restated in Theorem B.1.

Lemma 2.2. Let T Z>0 and w(0), w(1), ..., w(T ) RS such that w(i) w(i 1) τ for all i [T]. Then, for any p S, δ (0, 1), and M 28T log(2/δ) with probability 1 δ, p (w(t) w(0)) P

j [M] Sample(w(i) w(i 1), p, 1, 0) 1/M τ/8 for all t [T].

Algorithm 3: TVRVI(v(0), π(0), x, α, δ)

Input: Initial values v(0) RS, which is an α-underestimate of v . Input: Initial policy π(0) such that v(0) Tπ(0)(v(0)). Input: Accuracy α [0, (1 γ) 1] and failure probability δ (0, 1). Input: Offsets x RA ; // entrywise underestimate of P v(0)

1 Initialize g(1) RA and ˆg(1) RA to 0;

2 L = log(8)(1 γ) 1 and M = L 28 log(2Atot/δ) ;

3 for each iteration ℓ [L] do

4 Q = r + γ(x + ˆg(ℓ));

5 v(ℓ) = v(ℓ 1) and π(ℓ) = π(ℓ 1) ;

6 for each state i S do

// Compute truncated value update (and associated action)

7 v(ℓ)(i) = min{maxa Ai Qi,a, v(ℓ 1) + (1 γ)α} and π(ℓ) i = argmaxa Ai Qi,a;

// Update value and policy if it improves

8 if v(ℓ)(i) v(ℓ)(i) then v(ℓ)(i) = v(ℓ)(i) and π(ℓ) i = π(ℓ) i ;

// Update for maintaining estimates of P (v(l) v0).

9 (ℓ) = Apx Utility(v(ℓ) v(ℓ 1), M, 0) and g(ℓ+1) = g(ℓ) + (ℓ) ; // Shift estimates so that ˆg(ℓ+1) always underestimates pa(s) v(ℓ) .

10 ˆg(ℓ+1) = g(ℓ+1) (1 γ)α

11 return (v(L), π(L))

While it is unclear how to significantly improve the constant of 28 = 256 appearing in Lemma 2.2 (and consequently Algorithm 3), we note that tightening these constants in the application of Freedman s inequality could be of practical interest. By applying Lemma 2.2 to the iterates v(ℓ) in TVRVI, the following Corollary 2.3 shows that we can maintain additive O((1 γ)α)-underestimates of the

transitions pa(s) (v(ℓ) v(0)) using only O(L) samples (as opposed to the O(L2) samples required in [2]) per state-action pair.

Corollary 2.3. In TVRVI (Algorithm 3), with probability 1 δ, in Lines 9, 10 and 2, for all s S, a As, and ℓ [L], we have g(ℓ) a (s) pa(s) (v(ℓ 1) v(0)) (1 γ)α/8 and

therefore ˆg(ℓ) a is a (1 γ)α/4-underestimate of pa(s) (v(ℓ 1) v(0)).

The following Lemma 2.4 shows that whenever the event in Corollary 2.3 holds, TVRVI (Algorithm 3) is approximately contractive and maintains monotonicity of the approximate values. By accumulating the error bounds in Lemma 2.4, we also obtain the following Corollary 2.5, which guarantees that TVRVI halves the error in the initial estimate v(0).

Lemma 2.4. Suppose that for some β RA 0, P v(0) β x P v(0) and let βπ RS be defined as βπ (s) := βπ (s)(s) for each s S. Then, with probability 1 δ, at the end of every iteration ℓ [L] (Line 3) in TVRVI(v(0), π(0), x, α, δ), the following hold for ξ := γ((1 γ)α/41 + βπ ):

v(ℓ 1) v(ℓ) Tπ(ℓ)(v(ℓ)), (6)

0 v v(ℓ) max γP (v v(ℓ 1)) + ξ, γ(v v(ℓ 1)) . (7)

Corollary 2.5. Suppose that for some α 0 and β RA 0, P v(0) β x P v(0); v(0) is an α-underestimate of v ; and v(0) Tπ(0)(v(0)). Let βπ RS be defined as βπ (s) := βπ (s)(s) for each s S. Let (v(L), π(L)) = TVRVI(v(0), π(0), α, δ), and L, M be as in Line 2. Define ξ := γ ((1 γ)α/4 1 + βπ ). Then, with probability 1 δ, 0 v v(L) γLα 1+(I γP ) 1ξ, and v(L) Tπ(L)(v(L)). In particular, if β = 0, then for L > log(8)(1 γ) 1 we can reduce the error in v(0) by half: 0 v v(L) (v v(0))/2. Additionally, TVRVI is implementable with O(Atot ML) sample queries to the generative model and time and O(Atot) space.

Theorem 1.2 now follows by recursively applying Corollary 2.5. Offline TVRVI (Algorithm 4) provides the pseudocode for the algorithm guaranteed by Theorem 1.2.

Algorithm 4: Offline TVRVI(ε, δ)

Input: Target precision ε and failure probability δ (0, 1)

1 K = log2(ε 1(1 γ) 1) , v0 = 0, π0 is an arbitrary policy, and α0(1 γ) 1;

2 for each iteration k [K] do

3 αk = αk 1/2 = 2 k(1 γ) 1;

4 x = P vk 1 ;

5 (vk, πk) = TVRVI(vk 1, πk 1, x, αk 1, 0, δ/K);

6 return (v K, πK)

3 Sample setting algorithm

In this section, we show how to extend the analysis in the previous section in the sample setting, where we do not have explicit access to P . We follow a similar framework as in [3] to show that we can instead estimate the offsets x in Offline TVRVI by taking additional samples from the generative model. The pseudocode is shown in Sample TVRVI(Algorithm 5.) To analyze the algorithm, we first bound the error incurred when approximating the exact offsets x in Line 4 of Offline TVRVI (Algorithm 4) with approximate offsets x P vk 1 computed by sampling from the generative model. The proof leverages Hoeffding s and Bernstein s inequality, and follows a similar structure as the proof of Lemma 5.1 of [3].

Algorithm 5: Sample TVRVI(ε, δ)

Input: Target precision ε and failure probability δ (0, 1)

1 K = log2(ε 1(1 γ) 1) ;

2 v0 = 0, π0 is an arbitrary policy, and α0 = (1 γ) 1;

3 for each iteration k [K] do

4 αk = αk 1/2 = 2 k(1 γ) 1 ;

5 N = 6500(1 γ) 3 log(8Atot Kδ 1);

6 Nk 1 = N max((1 γ), α 2 k 1) ;

7 ηk 1 = N 1 k 1 log(8Atot Kδ 1) ;

8 xk = Apx Utility(vk 1, Nk 1, ηk 1);

9 (vk, πk) = TVRVI(vk 1, πk 1, xk, αk 1, δ/K) ;

10 return (v K, πK)

Lemma 3.1. Consider u RS. Let x = Apx Utility(u, m Atot, η), m log(1/2δ 1), and η = (m Atot) 1 log(1/2δ 1). Then, with probability 1 δ,

2η u v + 18η3/4 u x P u.

Finally, to obtain our main result Theorem 1.1, we utilize worst-case bounds on σv from prior work [1] (see Lemma B.3, Lemma B.4) and inductively apply Lemma 3.1 and Corollary 2.5.

The constant of 6500 appearing in the initialization of N in Algorithm 5 arises due to technical reasons, from applying Bernstein s inequality, Hoeffding s inequality, union bound over all K outer loop iterations, and bounds on σv from prior work [3] to prove Lemma 3.1. While it is unclear how to directly further tighten this constant, the proof of Lemma 3.1 shows that in the expression N = 6500(1 γ) 3 log(8Atot Kδ 1) there is a natural trade-off between the leading constant (in this case 6500) and the number of outer loop iterations K. By increasing the number of outer-loop iterations K by constants, one can relax the error requirements of each iteration (i.e., decrease N by constants at the cost of increased logarithmic dependence on |S| , Atot). Although not the primary focus of our work, such trade-offs might be of practical importance.

4 Conclusion

We provided faster and more space-efficient algorithms for solving DMDPs. We showed how to apply truncation and recursive variance reduction to improve upon prior variance-reduced value iterations methods. Ultimately, these techniques reduced an additive O((1 γ) 3) term in the time and sample complexity of prior variance-reduced value iteration methods to O((1 γ) 2).

Natural open problems left by our work include exploring the practical implications of our techniques and exploring whether further runtime improvements are possible. For example, it may be of practical interest to explore whether there exist other analogs of truncation that do not need to limit the progress in individual steps of value iteration. Additionally, the question of whether the O((1 γ) 2) term in our time and sample complexities can be further improved to O((1 γ) 1) is a natural open problem; an affirmative answer to this question would yield the first near-optimal running times for solving a DMDP with a generative model for all ε and fully bridge the sample complexity gap between model-based and model-free methods. We hope this paper supports further studying these questions and establishing the optimal runtime for solving MDPs.

Acknowledgements

Thank you to Yuxin Chen for interesting and motivating discussion about model-based methods in RL. Thank you to the anonymous reviewers for their helpful feedback. Yujia Jin and Ishani Karmarkar were funded in part by NSF CAREER Award CCF-1844855, NSF Grant CCF-1955039, and a Pay Pal research award. Aaron Sidford was funded in part by a Microsoft Research Faculty Fellowship, NSF CAREER Award CCF-1844855, NSF Grant CCF1955039, and a Pay Pal research award. Part of this work was conducted while visiting the Simons Institute for the Theory of Computing. Yujia Jin s contributions to the project occurred while she was a graduate student at Stanford.

[1] Aaron Sidford, Mengdi Wang, Xian Wu, and Yinyu Ye. Variance reduced value iteration and faster algorithms for solving markov decision processes. Naval Research Logistics (NRL), 70, 2023.

[2] Aaron Sidford, Mengdi Wang, Xian Wu, and Yinyu Ye. Variance reduced value iteration and faster algorithms for solving markov decision processes. 29th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), 2018.

[3] Aaron Sidford, Mengdi Wang, Xian Wu, Lin Yang, and Yinyu Ye. Near-optimal time and sample complexities for solving markov decision processes with a generative model. Advances in Neural Information Processing Systems 31 (Neur IPS), 2018.

[4] Qiying Hu and Wuyi Yue. Markov decision processes with their applications, volume 14. Springer Science & Business Media, 2007.

[5] Zeng Wei, Jun Xu, Yanyan Lan, Jiafeng Guo, and Xueqi Cheng. Reinforcement learning to rank with markov decision process. Proceedings of the 40th international ACM SIGIR conference on research and development in information retrieval, 2017.

[6] Thomas Degris, Olivier Sigaud, and Pierre-Henri Wuillemin. Learning the structure of factored markov decision processes in reinforcement learning problems. 23rd International Conference on Machine Learning (ICML), 2006.

[7] Olivier Sigaud and Olivier Buffet. Markov decision processes in artificial intelligence. John Wiley & Sons, 2013.

[8] Martijn Van Otterlo and Marco Wiering. Reinforcement learning and markov decision processes. Reinforcement learning: State-of-the-art, 2012.

[9] Martijn Van Otterlo. Markov decision processes: Concepts and algorithms. Course on Learning and Reasoning, 2009.

[10] Yinyu Ye. A new complexity result on solving the markov decision problem. Mathematics of Operations Research, 30, 2005.

[11] Michael L Littman, Thomas L Dean, and Leslie Pack Kaelbling. On the complexity of solving markov decision problems. 11th Annual Conference on Uncertainty in Artificial Intelligence (UAI), 1995.

[12] Yin Tat Lee and Aaron Sidford. Path finding methods for linear programming: Solving linear programs in o (vrank) iterations and faster algorithms for maximum flow. 55th Annual IEEE Symposium on Foundations of Computer Science (FOCS), 2014.

[13] Yinyu Ye. The simplex and policy-iteration methods are strongly polynomial for the markov decision problem with a fixed discount rate. Mathematics of Operations Research, 36, 2011.

[14] Bruno Scherrer. Improved and generalized upper bounds on the complexity of policy iteration. Advances in Neural Information Processing Systems 26 (Neur IPS)), 2013.

[15] Michael Kearns and Satinder Singh. Finite-sample convergence rates for q-learning and indirect algorithms. Advances in Neural Information Processing Systems 11 (Neur IPS), 11, 1998.

[16] Mohammad Gheshlaghi Azar, Rémi Munos, and Hilbert J. Kappen. Minimax PAC bounds on the sample complexity of reinforcement learning with a generative model. Machine Learning, 91, 2013.

[17] Alekh Agarwal, Sham M. Kakade, and Lin F. Yang. Model-based reinforcement learning with a generative model is minimax optimal. 33rd Annual Conference on Computational Learning Theory (COLT), 2020.

[18] Gen Li, Yuting Wei, Yuejie Chi, Yuantao Gu, and Yuxin Chen. Breaking the sample size barrier in model-based reinforcement learning with a generative model. Advances in Neural Information Processing Systems 33 (Neur IPS), 2020.

[19] Sham Machandranath Kakade. On the sample complexity of reinforcement learning. University of London, University College London (United Kingdom), 2003.

[20] Fei Feng, Wotao Yin, and Lin F Yang. How does an approximate model help in reinforcement learning? ar Xiv preprint ar Xiv:1912.02986, 2019.

[21] Yujia Jin and Aaron Sidford. Efficiently solving MDPs with stochastic mirror descent. In 37th International Conference on Machine Learning (ICML), 2020.

[22] Paul Tseng. Solving h-horizon, stationary markov decision problems in time proportional to log (h). Operations Research Letters, 9, 1990.

[23] Mengdi Wang. Randomized linear programming solves the discounted markov decision problem in nearly-linear (sometimes sublinear) running time. Mathematics of Operations Research, 42, 2019.

[24] Jan van den Brand, Yin Tat Lee, Yang P. Liu, Thatchaphol Saranurak, Aaron Sidford, Zhao Song, and Di Wang. Minimum cost flows, mdps, and l1-regression in nearly linear time for dense instances. In 53rd Annual ACM Symposium on Theory of Computing (STOC), 2021.

[25] Michael B. Cohen, Yin Tat Lee, and Zhao Song. Solving linear programs in the current matrix multiplication time. Journal of the ACM, 2020.

[26] Jan van den Brand. A deterministic linear program solver in current matrix multiplication time. In Proceedings of the Fourteenth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 259 278. SIAM, 2020.

[27] Shunhua Jiang, Zhao Song, Omri Weinstein, and Hengjie Zhang. A faster algorithm for solving general lps. In Proceedings of the 53rd Annual ACM SIGACT Symposium on Theory of Computing, pages 823 832, 2021.

[28] Virginia Vassilevska Williams, Yinzhan Xu, Zixuan Xu, and Renfei Zhou. New bounds for matrix multiplication: from alpha to omega. In 35th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), 2024.

[29] Pengqian Yu, William B Haskell, and Huan Xu. Approximate value iteration for risk-aware markov decision processes. IEEE Transactions on Automatic Control, 63, 2018.

[30] Mohand Hamadouche, Catherine Dezan, David Espes, and Kalinka Branco. Comparison of value iteration, policy iteration and q-learning for solving decision-making problems. In 2021 International Conference on Unmanned Aircraft Systems (ICUAS), 2021.

[31] Christopher W Zobel and William T Scherer. An empirical study of policy convergence in markov decision process value iteration. Computers & operations research, 32, 2005.

[32] Dileep Kalathil, Vivek S Borkar, and Rahul Jain. Empirical q-value iteration. Stochastic Systems, 11, 2021.

[33] Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive variance reduction. Advances in Neural Information Processing Systems 26 (Neur IPS), 2013.

[34] Lam M Nguyen, Jie Liu, Katya Scheinberg, and Martin Takáˇc. Sarah: A novel method for machine learning problems using stochastic recursive gradient. 34th International Conference on Machine Learning (ICML), 2017.

[35] David A Freedman. On tail probabilities for martingales. pages 100 118, 1975.

[36] Joel A. Tropp. Freedman s inequality for matrix martinglaes. Electronic Communications in Probability, 16, 2011.

[37] Andrea Zanette and Emma Brunskill. Problem dependent reinforcement learning bounds which can identify bandit structure in mdps. In International Conference on Machine Learning, pages 5747 5755. PMLR, 2018.

[38] Andrea Zanette and Emma Brunskill. Tighter problem-dependent regret bounds in reinforcement learning without domain knowledge using value function bounds. 36th International Conference on Machine Learning (ICML), 2019.

A Faster problem-dependent convergence

In this section, we propose a modified version of the Sample TVRVI algorithm, named Problem Dependent TVRVI. This algorithm adjusts the number of required samples based on the structure of the MDP under consideration. Inspired by [38], we then consider MDPs with small ranges of optimal values and the extreme case of highly mixing MDPs in which state transitions are sampled from a fixed distribution.

Note that in the proof of Theorem 1.1, the error during convergence caused by approximations of values is bounded by (I γP ) 1ξk for ξk (1 γ)αk

4 1+2 2ηkσv +(2 2ηkαk+18η3/4 k v(0) )1. In its proof, we upper bound the variance term (I γP ) 1 σv by 3(1 γ) 1.5 using Lemma B.4. However, as αk decreases and the variance term becomes dominant, a number of samples proportional to the size of the variance term suffices to control the error during each iteration. Given V which upper bounds (I γP ) 1 σv , we can further refine Sample TVRVI to reduce the number of samples taken after an initial burn-in phase and obtain improved complexities when V is signficantly small. Hence, we obtain the following Algorithm 6 and Theorem A.1.

Algorithm 6: Problem Dependent TVRVI(ε, δ, V )

Input: Target precision ε, failure probability δ (0, 1), and V (I γP ) 1 σv .

1 K = log2(ε 1(1 γ) 1) ;

2 v0 = 0, π0 is an arbitrary policy, and α0 = 1 1 γ ;

3 for each iteration k [K] do

4 αk = αk 1/2 = 2 k(1 γ) 1 ;

5 if k < log2 128(1 γ) 5

6 Nk 1 = 6500 (1 γ) 3 max((1 γ), α 2 k 1) log(8Atot Kδ 1) ; // Burn-in phase

8 Nk 1 = 1024 α 2 k 1V 2 log(8Atot Kδ 1) ; // Variance-dependent phase

9 ηk 1 = N 1 k 1 log(8Atot Kδ 1) ;

10 xk = Apx Utility(vk 1, Nk 1, ηk 1);

11 (vk, πk) = TVRVI(vk 1, πk 1, xk, αk 1, δ/K);

12 return (v K, πK)

Theorem A.1. In the sample setting, there is an algorithm (Algorithm 6) that, given 3(1 γ) 1.5 V (I γP ) 1 σv , uses O Atot ε 2V 2 + (1 γ) 2 samples and time and O(Atot) space, and computes an ε-optimal policy and ε-optimal values with probability 1 δ.

Proof. Let K, αk, and (vk, πk) be as defined in Lines 1, 4, and 11 of Problem Dependent TVRVI(ε, δ, V ).

For the correctness of the algorithm, we first induct on k to show that for each k [K], with probability 1 kδ/K,

0 v vπk v vk αk, and vk Tπk(vk).

The base case is trivial, as 0 v vπ0 v v0 (1 γ) 11.

For the inductive step, observe that by Lemma 3.1, we see that with probability 1 δ/K,

P vk 1 h 2 p

2ηk 1σv + 2 p

2ηk 1αk 1 + 18η3/4 k 1 vk 1 1 i xk P vk 1. (8)

Additionally, by the inductive hypothesis, with probability 1 (k 1)δ/K,

0 v vπk 1 v vk γLαk 1 1 + (I γP ) 1ξk 1 αk1, and vk Tπk(vk). (9)

Thus, by union bound, with probability 1 kδ/K, both (8) and (9) hold. We condition on this event in the remainder of the inductive step.

Now, we apply Corollary 2.5 with

2ηk 1σv + 2 p

2ηk 1αk 1 + 18η3/4 k 1 vk 1 1.

Consequently, we have

0 v vk γLαk 1 1 + (I γP ) 1ξk 1 αk 1

8 1 + (I γP ) 1ξk 1,

for ξk 1 (1 γ)αk 1

4 1 + 2 2ηk 1σv + 2 2ηk 1αk 1 + 18η3/4 k 1 vk 1 1, and vk

Note that (I γP ) 11 1 1 γ 1. Hence, if k < log2(1 γ) 5/V 3 , we use Lemma B.4 along with the facts that (I γP ) 11 = 1/(1 γ)1 and the choice of ηk 1 to obtain (identical to the proof of Theorem 1.1):

(I γP ) 1ξk 1

6 ηk 1 (1 γ)3 + 2

2(1 γ)3 min((1 γ) 1, α2 k 1) 6500(1 γ)2 αk 1

18 ((1 γ)3 min((1 γ) 1, α2 k 1) 6500(1 γ)8/3

[αk 1/4 + 2 p

6/6500 αk 1 + 2 p

2/6500(1 γ)1/2 min((1 γ) 1/2, αk 1)αk 1

+ 18 (10 3)(1 γ)1/4 min((1 γ) 3/4, α3/2 k 1)]1

[αk 1/4 + 4 p

6/6500 αk 1 + 18 (10 3)αk 1]1 3

If instead k log2(1 γ) 5/V 3 , then αk 1 128(1 γ)4V 3, and ηk 1 = α2 k 1/(1024 V 2). Consequently,

(I γP ) 1ξk 1 2 p

2ηk 1(I γP ) 1 σv

2ηk 1(I γP ) 1αk 1 + 18η3/4 k 1(I γP ) 1 vk 1 i 1

2αk 1 4(1 γ)

1024V V 1 + 18 (1 γ)2

α2 k 1 1024 V 2

Therefore in either case,

v vk 1 αk 1

Moreover, we can use that vk Tπk(vk) to see that

vk Tπk(vk) T 2 πk(vk) T πk (vk) = vπk v .

This completes the inductive step.

Consequently, taking k = K = log2(ε 1(1 γ) 1) iterations of the outer loop, with probability 1 δ, we have that 0 v vπK v v K αK ε and

vk Tπk(vk) T 2 πk(vk) T πk (vk) = vπk v ,

that is, vk is an ε-optimal value and πK is an ε-optimal policy.

The total number of samples and time required is O Atot ε 2V 2 + (1 γ) 2 . For the space complexity, note that the algorithm can be implemented to maintain only O(1) vectors in RAtot.

Theorem A.1 yields improved complexities for solving MDPs when (I γP ) 1 σv is nontrivially bounded. Following [37] we mention two particular such settings where we can apply Theorem A.1 to obtain better problem-dependent sample and runtime bounds than Theorem 1.1.

Deterministic MDPs For a deterministic MDP, each action deterministically transitions to a single state. That is, for all (s, a) A, pa(s) = 1s (the indicator vector of s S) for some s S. In this case, σv = 0. Consequently, if the MDP is deterministic, the algorithm converges with just O((1 γ)3) samples to the generative model and time. We note that in this setting of deterministic MDPs, there may be alternative approaches to obtain the same or better runtime and sample complexity.

Small range. Define the range of optimal values for a MDP as rng(v ) def = maxs S v s mins S v s. Note that σv rng(v )21. So, (I γP ) 1 σv (1 γ) 1rng(v ). Therefore, by Theorem A.1, given an approximate upper bound of (I γP ) 1 σv our algorithm is implementable with O(Atot(ε 2(1 γ) 2rng(v )2 + (1 γ) 2)) samples and time.

Highly mixing domains. [37] showed that a contextual bandit problem can be modeled as an MDP where the next state is sampled from a fixed stationary distribution. Using the fact that the transition function is independent of the prior state and action, the authors of [38] show that rng(v ) 1 with a simple proof in its Appendix A.2. Hence, by the argument in the preceding paragraph O Atot ε 2(1 γ) 2 + (1 γ) 2 samples and time suffice in this setting.

B Omitted proofs from the main body

B.1 Omitted proof of Lemma 1.3

Lemma 1.3. For a, b, x Rn and γ, α > 0, let c := median{a (1 γ)α1, b, a + (1 γ)α1}, where median is applied entrywise. Then, if b x γ a x and a x α, then c x γ a x .

Proof. Consider the i-th entry (c x)i. There are three cases.

First, suppose ai (1 γ)α bi ai + (1 γ)α. Then, |ci xi| = |bi xi| γ a x Second, suppose bi ai (1 γ)α ai + (1 γ)α. Then, ci xi bi xi b x γ a x . Meanwhile, ci xi = ai (1 γ)α xi a x (1 γ)α. Now, because a x α, we have that (1 γ) a x (1 γ)α. So, a x (1 γ)α γ a x .

Lastly, suppose ai (1 γ)α ai + (1 γ)α bi. Then, ci xi bi xi b x γ a x . Meanwhile, ci xi = ai + (1 γ)α xi a x + (1 γ)α. Now, because a x α, we have that (1 γ) a x (1 γ)α. So, a x + (1 γ)α γ a x .

B.2 Omitted proofs from Section 2 First, we prove Lemma 2.1.

Lemma 2.1. Let x = Sample(u, p, M, 0) for p n, M Z>0, ε > 0, and u RS. Then, E [x] = p u, |x| u , and Var [x] 1/M u 2 .

Proof. The first statement follows from linearity of expectation and the second from definitions. The third statement follows from independence and that

Var [vim] = X

i S piv2 i (p v)2 X

i S pi v 2 = v 2 for any m [M] .

Next, we state Freedman s inequality [35], which we use to prove the following Lemma 2.2.

Theorem B.1 (Freedman s Inequality, restated from [36]). Consider a real-valued martingale {Yk : k = 0, 1, . . .} with difference sequence {Xk : k = 1, 2, . . .} given by Xk = Yk Yk 1. Assume that Xk R almost surely for k = 1, 2, . . .. Define the predictable quadratic variation process of the martingale: Wk := Pk j=1 E X2 j |X1, ..., Xj 1 . Then, for all t 0 and σ2 > 0,

P k 0 : Yk t and Wk σ2 exp t2/(2(σ2 + Rt/3))

Lemma 2.2. Let T Z>0 and w(0), w(1), ..., w(T ) RS such that w(i) w(i 1) τ for all i [T]. Then, for any p S, δ (0, 1), and M 28T log(2/δ) with probability 1 δ, p (w(t) w(0)) P

j [M] Sample(w(i) w(i 1), p, 1, 0) 1/M τ/8 for all t [T].

Proof. For each i [T], j [M], let

Xi,j := Sample(w(i) w(i 1), p, 1, 0) p (w(i) w(i 1)) /M.

Since p S, Lemma 2.1 yields that |Xi,j| 2τ

M . Next, define Yt,k := P

j [M] Xi,j + Pk j=1 Xt,j. The predictable quadratic variation process (as defined in Theorem B.1) is given by

j [M] E X2 i,j|X1,1:M, ..., Xi 1,1:M, Xi,1:j 1 + X

j [k] E X2 t,j|X1,1:M, ..., Xt 1,1:M, Xt,1:j 1

j [M] Var Sample(w(i) w(i 1), p, 1, 0)

j [k] Var Sample(w(t) w(t 1), p, 1, 0)

where, in the last line we used Lemma 2.1 to bound the variance. Now, by telescoping,

Sample(w(i) w(i 1), p, 1, 0)

p (w(t) w(0)) for all t [T]

Consequently, applying Theorem B.1 twice (once to Yt,M and once to Yt,M yields

P n t [T] : |Yt,M| τ

As an immediate corollary of Lemma 2.2, we obtain Corollary 2.3.

Corollary 2.3. In TVRVI (Algorithm 3), with probability 1 δ, in Lines 9, 10 and 2, for all s S, a As, and ℓ [L], we have g(ℓ) a (s) pa(s) (v(ℓ 1) v(0)) (1 γ)α/8 and

therefore ˆg(ℓ) a is a (1 γ)α/4-underestimate of pa(s) (v(ℓ 1) v(0)).

Proof. Consider some s S and a As. Note that g(ℓ) a (s) is equal in distribution to

Sample(v(i) v(i 1), pa(s), 1, 0)

pa(s) (v(ℓ 1) v(0)).

Then, by Lemma 2.2 and union bound, whenever M L 28 log(2Atot/δ) we have that with probability 1 δ, for all (s, a) A, g(ℓ) a (s) pa(s) (v(ℓ 1) v(0)) 1 γ

8 α and conditioning

on this event, we have pa(s) (v(ℓ 1) v(0)) 1 γ

4 α ˆg(ℓ) a (s) pa(s) (v(ℓ 1) v(0)) due to the shift in Line 10.

Conditioning on the event that the implication of Corollary 2.3 holds, we can prove the following Lemma 2.4

Lemma 2.4. Suppose that for some β RA 0, P v(0) β x P v(0) and let βπ RS be defined as βπ (s) := βπ (s)(s) for each s S. Then, with probability 1 δ, at the end of every iteration ℓ [L] (Line 3) in TVRVI(v(0), π(0), x, α, δ), the following hold for ξ := γ((1 γ)α/41 + βπ ):

v(ℓ 1) v(ℓ) Tπ(ℓ)(v(ℓ)), (6)

0 v v(ℓ) max γP (v v(ℓ 1)) + ξ, γ(v v(ℓ 1)) . (7)

Proof. In the remainder of this proof, condition on the event that the implications of Corollary 2.3 hold (as they occur with probability 1 δ). By Line 7 and 8 of Algorithm 3, for all ℓ [L],

v(ℓ 1) v(ℓ) v(ℓ 1) + (1 γ)α1.

This immediately implies the lower bound in (6).

We prove the upper bound in (6) by induction. In the base case when ℓ= 0, v(0) Tπ(0)(v(0)) holds by assumption. For the ℓ-th iteration, there are two cases. If v(ℓ)(s) > v(ℓ 1)(s) for s S then

v(ℓ)(s) = rπ(ℓ)(s) + γ x(s) + ˆg(ℓ) π(ℓ)(s) rπ(ℓ)(s) + γpπ(ℓ)(s) v(ℓ 1)(s) (10)

Tπ(ℓ)(v(ℓ 1)) Tπ(ℓ)(v(ℓ)).

Otherwise, if v(ℓ)(s) = v(ℓ 1)(s), then by the inductive hypothesis,

v(ℓ)(s) = v(ℓ 1)(s) Tπ(ℓ 1)(v(ℓ 1))(s) = Tπ(ℓ)(v(ℓ))(s) .

This completes the proof of (6).

Next, we prove (7). For the lower bound, by induction and (10), we have that for each s S

v(ℓ)(s) max a As{ra(s) + γpa(s) v(ℓ 1)(s)} max a As{ra(s) + γpa(s) v (s)} = v ,

so min( v(ℓ), v(ℓ 1) + (1 γ)α) v .

Next, we prove the upper bound of (7). For each (s, a) A and ℓ [L], let

ξ(ℓ) a (s) := pa(s) v(ℓ 1) (xa(s) + ˆg(ℓ) a )(s)),

and observe that

ξ(ℓ) a (s) = [pa(s) v(0) xa(s)] + [pa(s) (v(ℓ 1) v(0)) ˆg(ℓ) a )(s))] βa(s) + (1 γ)α

Note that for any s S,

(v v(ℓ))(s) = max a Ai[ra(s) + γpa(s) v (s)] max a As[ra(s) + γ(xa(s) + ˆg(ℓ) a )(s))]

[rπ (s)(s) + γ (P v ) (s)] max a As[ra(s) + γpa(s) v(ℓ 1) γξ(ℓ) a (s)]

[rπ (s)(s) + γ (P v ) (s)] [rπ (s)(s) + γ(P v(ℓ 1))(s) γξ(ℓ) π (s)(s)]

γ P (v v(ℓ 1)) (s) + ξ(s),

Consequently, for all s S,

(v v(ℓ))(s) γP (v v(ℓ 1))(s) + ξ(s).

Consider two cases for v(ℓ)(s). First, if v(ℓ)(s) = v(ℓ)(s) for some s S then v v(ℓ) (s) γ P v v(ℓ 1) (s) + ξ(s)

holds immediately. If not, v(ℓ)(s) = v(ℓ 1)(s) + (1 γ)α v(ℓ)(s) and (6) guarantees that v v(ℓ 1) v v(0) α,

which ensures that (1 γ)(v v(ℓ 1))(s) (1 γ)α and yields the results as, v v(ℓ) (s) = v v(ℓ 1) (s) (1 γ)α γ v v(ℓ 1) (s) .

We now inductively apply Lemma 2.4 to obtain Corollary 2.5, which allows us to bound the number of iterates required to halve the initial error in TVRVI.

Corollary 2.5. Suppose that for some α 0 and β RA 0, P v(0) β x P v(0); v(0) is an α-underestimate of v ; and v(0) Tπ(0)(v(0)). Let βπ RS be defined as βπ (s) := βπ (s)(s) for each s S. Let (v(L), π(L)) = TVRVI(v(0), π(0), α, δ), and L, M be as in Line 2. Define ξ := γ ((1 γ)α/4 1 + βπ ). Then, with probability 1 δ, 0 v v(L) γLα 1+(I γP ) 1ξ, and v(L) Tπ(L)(v(L)). In particular, if β = 0, then for L > log(8)(1 γ) 1 we can reduce the error in v(0) by half: 0 v v(L) (v v(0))/2. Additionally, TVRVI is implementable with O(Atot ML) sample queries to the generative model and time and O(Atot) space.

Proof. Condition on the event that the implication of Lemma 2.4 holds. First, we observe that 0 v vπ(L) v v(L) follows by monotonicity (Equation (6) of Lemma 2.4). Next, we show that

v v(L) γLα 1 + (I γP ) 1ξ,

by induction. We will show that for all i S,

k=0 γk P kξ

In the base case when ℓ= 0, this is trivially true, as v v(ℓ) α1 by assumption. Assume that the statement is true up to v(ℓ 1). Now, by Lemma 2.4, we have two cases for [v v(ℓ)](i).

First, suppose that [v v(ℓ)](i) γ[v v(ℓ 1)](i). Then, note that P and ξ are entrywise non-negative, so [γℓP ℓξ](i) 0. By inductive hypothesis, and the fact that γ (0, 1) we have

[v v(ℓ)](i) γ

k=0 γk P kξ

k=0 γk P kξ

k=0 γk P kξ

k=0 γk P kξ

k=0 γk P kξ

Second, suppose that instead, [v v(ℓ)](i) γP v v(ℓ 1) (i) + ξ(i). By monotonicity (equation (6) of Lemma 2.4) we know that v v(ℓ 1) 0. Moreover, P is non-negative, and consequently, we can use the inductive hypothesis as follows:

k=0 γk P kξ

, hence P v v(ℓ 1) P "

k=0 γk P kξ

We can rearrange terms to obtain the following bound:

[v v(ℓ)](i)

k=0 γk P kξ

= γℓα[P 1](i) +

k=0 γk+1P k+1ξ

k=0 γk P kξ

Consequently, by induction, the bound holds. When L > log(8)(1 γ) 1 , γL 1/8 and we have

v vk γLα 1 + (I γP ) 1 γ(1 γ)

4 α1 γLα + γ α

Finally, the sample complexity and runtime follow from the algorithm pseudocode. For the space complexity, at each iteration ℓof the outer for loop in TVRVI, the algorithm needs only to maintain ˆg(ℓ), g(ℓ) RAtot, v(ℓ) RS, π(L), and at most MAtot samples in invoking Sample.

Finally, we are ready to prove Theorem 1.2.

Theorem 1.2. In the offline setting, there is an algorithm that uses O(nnz(P ) + Atot(1 γ) 2) time, and computes an ε-optimal policy and ε-optimal values with probability 1 δ.

Proof. To run Offline TVRVI, we can implement a generative model from which we can draw samples in O(nnz(P )) pre-processing time, so that each query to the generative model requires O(1) time. For the correctness, we induct on k to show that after each iteration k, 0 v vπK v v K αk with probability 1 kδ/K. In the base case when k = 0, the bound is trivially true as v (1 γ) 1. Now, by Applying Corollary 2.5 and a union bound, we see that with probability 1 kδ/K, v vk αk 1

2 = αk, whenever L > log(8)(1 γ) 1. Thus, v K satisfies the required guarantee whenever αK ε, which is guaranteed by our choice of K. To see that πk is an ε-optimal policy, we observe that Corollary 2.5 ensures

vk Tπk(vk) T 2 πk(vk) T πk (vk) = vπk v .

For the runtime, the algorithm completes only K = O(1) iterations, and can be implemented with O(1) calls to the offset oracle. Each inner loop iteration can be implemented with O(Atot L2) = O Atot(1 γ) 2 additional time and queries to the generative model. The algorithm only requires

O(Atot) space in order to store offsets, values, and approximate utilities.

B.3 Omitted proofs from Section 3

Theorem B.2 (Hoeffding s Inequality and Bernstein s Inequality, restated from Lemma E.1 and E.2 of [3]). Let p S be a probability vector, v Rn, and let y := 1 m Pm j=1 v(ij) where ij are random indices drawn such that ij = k with probability p(k). Define σ := (p v2 (p v)2). For any δ (0, 1), the following hold, each with probability 1 δ:

(Hoeffding s Inequality) p v y v p

2m 1 log(2δ 1),

(Bernstein s Inequality) p v y p

2m 1σ log(2δ 1) + (2/3)m 1 v log(2δ 1).

Theorem B.2 illustrates that the error in estimating P u for some value vector u depends on the variance σu := P u2 (P u)2 RA. To bound this variance term, we appeal to the following two lemmas from [3].

Lemma B.3 (Lemma 5.2 of ([3]), restated). σv σv + v v 1.

Lemma B.4 (Lemma C.1 of ([3]), restated). For any π, we have

(I γP π) 1 σvπ 2 1 + γ γ2(1 γ)3 .

We can now bound the error in estimating P u using Apx Utility(u, N, η). The following Lemma 3.1 obtains such a bound by following a similar argument to that of Lemma 5.1 of [3].

Lemma 3.1. Consider u RS. Let x = Apx Utility(u, m Atot, η), m log(1/2δ 1), and η = (m Atot) 1 log(1/2δ 1). Then, with probability 1 δ,

2η u v + 18η3/4 u x P u.

Proof. For s S and a As. Let i1, ..., i N S be random indices such that P {ij = t} = (pa(s))(t) for each j [N]. Define the vectors x and ˆσ as follows.

j=1 u(ij) and ˆσa(s) := 1

j=1 (u(ij))2 ( xa(s))2.

From the pseudocode of Apx Utility (Algorithm 1), we see that that x = x 2ηˆσ 4η3/4 u (2/3)η u . Now, by union bound over all state-action pairs (s, a) and Theorem B.2, we have that with probability 1 δ/2 for each sate-action pair (s, a), x P u p

3η u 1. (11)

and with probability 1 δ/2 for each sate-action pair (s, a),

j [N] (ˆσa(s))2 pa(s) u2

Consequently, by union bound and triangle inequality and (11), we have that with probability 1 δ both of the following hold.

3η u 1, and ˆσ σu 4 u 2 p

We condition on (12) in the remainder of the proof. Now,

2ηˆσ + 4η3/4 u + 2

and we have that

2ηˆσ 8η3/4 u + 4

By (12) and Lemma B.3, we have that for α := u v ,

ˆσ σu + 2 u (2η)1/41 σv + α1 + 2 u (2η)1/41,

which implies that

2ηα1 16η3/4 u 1 4

2ηα + 16η3/4 u + 4

2ηα + 18η3/4 u 1.

Theorem 1.1. In the sample setting, there is an algorithm that uses O(Atot[(1 γ) 3ε 2+(1 γ) 2]) samples and time and O(Atot) space, and computes an ε-optimal policy and ε-optimal values with probability 1 δ.

Proof. Let K, αk, (vk, πk), and Nk be as defined in Lines 1, 4, 9, and 6 of Sample TVRVI(ε, δ). First, we show, by induction that for each k [K], with probability 1 kδ/K,

0 v vπk v vk αk1 and vk Tπk(vk).

In the base case when k = 0, the bound is trivially true because 0 v vπ0 v v0 (1 γ) 1.

Now, for the inductive step, by Lemma 3.1 we see that with probability 1 δ/K,

P vk 1 h 2 p

2ηk 1σv + 2 p

2ηk 1αk 1 + 18η3/4 k 1 vk 1 1 i xk P vk 1 (13)

and, by inductive hypothesis, with probability 1 (k 1)δ/K,

0 v vπk 1 v vk 1 αk 11, and vk Tπk 1(vk 1). (14)

Consequently, by a union bound, with probability 1 kδ/K, both (13) and (14) hold. Condition on this event for the remainder of the inductive step.

Next, we can apply Corollary 2.5 with

2ηk 1σv + 2 p

2ηk 1αk 1 + 18η3/4 k 1 vk 1 1.

0 v vk γLαk 1 1 + (I γP ) 1ξk 1 αk 1

8 1 + (I γP ) 1ξk 1

for ξk 1 (1 γ)αk 1

4 1+2 2ηk 1σv + 2 2ηk 1αk 1 + 18η3/4 k 1 vk 1 1. By Lemma B.4

and the facts that ηk 1 (6500 (1 γ) 3 max((1 γ), α 2 k 1)) 1 and (I γP ) 11 = 1/(1 γ)1, we obtain

(I γP ) 1ξk 1

6ηk 1 (1 γ)3 + 2

2(1 γ)3 min((1 γ) 1, α2 k 1) 6500(1 γ)2 αk 1

18 ((1 γ)3 min((1 γ) 1, α2 k 1) 6500(1 γ)8/3

[αk 1/4 + 2 p

6/6500 αk 1 + 2 p

2/6500(1 γ)1/2 min((1 γ) 1/2, αk 1)αk 1

+ 18 (10 3)(1 γ)1/4 min((1 γ) 3/4, α3/2 k 1)]1

[αk 1/4 + 4 p

6/6500 αk 1 + 18 (10 3)αk 1]1 3

Consequently, v vk α/21. To see that πk is also an αk-optimal policy, we observe that Corollary 2.5 also ensures that

vk Tπk(vk) T 2 πk(vk) T πk (vk) = vπk v .

This completes the inductive step.

Consequently, for k = K = log2(ε 1(1 γ) 1) iterations, ε αK ε/4 and with probability 1 δ, v K is an ε-optimal value and πK is an ε-optimal policy.

For runtime and sample complexity, note that the algorithm can be implemented using only O(NK) = O((1 γ) 3ε 2 + (1 γ)3)-samples and time per state-action pair. For the space complexity, note that the algorithm can be implemented to maintain only O(1) vectors in RAtot.

Neur IPS Paper Checklist

Question: Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope? Answer: [Yes] Justification: Yes, the abstract and introduction state our main results and improvements over previous work. Guidelines:

The answer NA means that the abstract and introduction do not include the claims made in the paper. The abstract and/or introduction should clearly state the claims made, including the contributions made in the paper and important assumptions and limitations. A No or NA answer to this question will not be perceived well by the reviewers. The claims made should match theoretical and experimental results, and reflect how much the results can be expected to generalize to other settings. It is fine to include aspirational goals as motivation as long as it is clear that these goals are not attained by the paper. 2. Limitations

Question: Does the paper discuss the limitations of the work performed by the authors? Answer: [Yes] Justification: Yes, we discuss regimes where our result is optimal and where it may be suboptimal in the introduction. In the conclusion we also discuss directions for future work and open problems left open by our work. Guidelines:

The answer NA means that the paper has no limitation while the answer No means that the paper has limitations, but those are not discussed in the paper. The authors are encouraged to create a separate "Limitations" section in their paper. The paper should point out any strong assumptions and how robust the results are to violations of these assumptions (e.g., independence assumptions, noiseless settings, model well-specification, asymptotic approximations only holding locally). The authors should reflect on how these assumptions might be violated in practice and what the implications would be. The authors should reflect on the scope of the claims made, e.g., if the approach was only tested on a few datasets or with a few runs. In general, empirical results often depend on implicit assumptions, which should be articulated. The authors should reflect on the factors that influence the performance of the approach. For example, a facial recognition algorithm may perform poorly when image resolution is low or images are taken in low lighting. Or a speech-to-text system might not be used reliably to provide closed captions for online lectures because it fails to handle technical jargon. The authors should discuss the computational efficiency of the proposed algorithms and how they scale with dataset size. If applicable, the authors should discuss possible limitations of their approach to address problems of privacy and fairness. While the authors might fear that complete honesty about limitations might be used by reviewers as grounds for rejection, a worse outcome might be that reviewers discover limitations that aren t acknowledged in the paper. The authors should use their best judgment and recognize that individual actions in favor of transparency play an important role in developing norms that preserve the integrity of the community. Reviewers will be specifically instructed to not penalize honesty concerning limitations. 3. Theory Assumptions and Proofs

Question: For each theoretical result, does the paper provide the full set of assumptions and a complete (and correct) proof?

Answer: [Yes] Justification: Sections 2 and 3 of our paper give a sketch of how we obtain our main theorems, and full proofs of all intermediate results as well as the full theorems can be found in the supplemental material/appendix.

Guidelines:

The answer NA means that the paper does not include theoretical results. All the theorems, formulas, and proofs in the paper should be numbered and crossreferenced. All assumptions should be clearly stated or referenced in the statement of any theorems. The proofs can either appear in the main paper or the supplemental material, but if they appear in the supplemental material, the authors are encouraged to provide a short proof sketch to provide intuition. Inversely, any informal proof provided in the core of the paper should be complemented by formal proofs provided in appendix or supplemental material. Theorems and Lemmas that the proof relies upon should be properly referenced.

4. Experimental Result Reproducibility

Question: Does the paper fully disclose all the information needed to reproduce the main experimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper (regardless of whether the code and data are provided or not)?

Answer: [NA] Justification: The paper focuses on theoretical results and mathematical analysis and does not include experiments.

Guidelines:

The answer NA means that the paper does not include experiments. If the paper includes experiments, a No answer to this question will not be perceived well by the reviewers: Making the paper reproducible is important, regardless of whether the code and data are provided or not. If the contribution is a dataset and/or model, the authors should describe the steps taken to make their results reproducible or verifiable. Depending on the contribution, reproducibility can be accomplished in various ways. For example, if the contribution is a novel architecture, describing the architecture fully might suffice, or if the contribution is a specific model and empirical evaluation, it may be necessary to either make it possible for others to replicate the model with the same dataset, or provide access to the model. In general. releasing code and data is often one good way to accomplish this, but reproducibility can also be provided via detailed instructions for how to replicate the results, access to a hosted model (e.g., in the case of a large language model), releasing of a model checkpoint, or other means that are appropriate to the research performed. While Neur IPS does not require releasing code, the conference does require all submissions to provide some reasonable avenue for reproducibility, which may depend on the nature of the contribution. For example (a) If the contribution is primarily a new algorithm, the paper should make it clear how to reproduce that algorithm. (b) If the contribution is primarily a new model architecture, the paper should describe the architecture clearly and fully. (c) If the contribution is a new model (e.g., a large language model), then there should either be a way to access this model for reproducing the results or a way to reproduce the model (e.g., with an open-source dataset or instructions for how to construct the dataset). (d) We recognize that reproducibility may be tricky in some cases, in which case authors are welcome to describe the particular way they provide for reproducibility. In the case of closed-source models, it may be that access to the model is limited in some way (e.g., to registered users), but it should be possible for other researchers to have some path to reproducing or verifying the results.

5. Open access to data and code

Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material?

Answer: [NA]

Justification: The paper focuses on theoretical results and mathematical analysis and does not include experiments.

Guidelines:

The answer NA means that paper does not include experiments requiring code. Please see the Neur IPS code and data submission guidelines (https://nips.cc/ public/guides/Code Submission Policy) for more details. While we encourage the release of code and data, we understand that this might not be possible, so No is an acceptable answer. Papers cannot be rejected simply for not including code, unless this is central to the contribution (e.g., for a new open-source benchmark). The instructions should contain the exact command and environment needed to run to reproduce the results. See the Neur IPS code and data submission guidelines (https: //nips.cc/public/guides/Code Submission Policy) for more details. The authors should provide instructions on data access and preparation, including how to access the raw data, preprocessed data, intermediate data, and generated data, etc. The authors should provide scripts to reproduce all experimental results for the new proposed method and baselines. If only a subset of experiments are reproducible, they should state which ones are omitted from the script and why. At submission time, to preserve anonymity, the authors should release anonymized versions (if applicable). Providing as much information as possible in supplemental material (appended to the paper) is recommended, but including URLs to data and code is permitted.

6. Experimental Setting/Details

Question: Does the paper specify all the training and test details (e.g., data splits, hyperparameters, how they were chosen, type of optimizer, etc.) necessary to understand the results?

Answer: [NA]

Justification: The paper focuses on theory and does not include experiments.

Guidelines:

The answer NA means that the paper does not include experiments. The experimental setting should be presented in the core of the paper to a level of detail that is necessary to appreciate the results and make sense of them. The full details can be provided either with the code, in appendix, or as supplemental material.

7. Experiment Statistical Significance

Question: Does the paper report error bars suitably and correctly defined or other appropriate information about the statistical significance of the experiments?

Answer: [NA]

Justification: The paper focuses on theoretical results and mathematical analysis and does not include experiments.

Guidelines:

The answer NA means that the paper does not include experiments. The authors should answer "Yes" if the results are accompanied by error bars, confidence intervals, or statistical significance tests, at least for the experiments that support the main claims of the paper.

The factors of variability that the error bars are capturing should be clearly stated (for example, train/test split, initialization, random drawing of some parameter, or overall run with given experimental conditions). The method for calculating the error bars should be explained (closed form formula, call to a library function, bootstrap, etc.) The assumptions made should be given (e.g., Normally distributed errors). It should be clear whether the error bar is the standard deviation or the standard error of the mean. It is OK to report 1-sigma error bars, but one should state it. The authors should preferably report a 2-sigma error bar than state that they have a 96% CI, if the hypothesis of Normality of errors is not verified. For asymmetric distributions, the authors should be careful not to show in tables or figures symmetric error bars that would yield results that are out of range (e.g. negative error rates). If error bars are reported in tables or plots, The authors should explain in the text how they were calculated and reference the corresponding figures or tables in the text.

8. Experiments Compute Resources

Question: For each experiment, does the paper provide sufficient information on the computer resources (type of compute workers, memory, time of execution) needed to reproduce the experiments?

Answer: [NA]

Justification: The paper focuses on theoretical results and mathematical analysis and does not include experiments.

Guidelines:

The answer NA means that the paper does not include experiments. The paper should indicate the type of compute workers CPU or GPU, internal cluster, or cloud provider, including relevant memory and storage. The paper should provide the amount of compute required for each of the individual experimental runs as well as estimate the total compute. The paper should disclose whether the full research project required more compute than the experiments reported in the paper (e.g., preliminary or failed experiments that didn t make it into the paper).

9. Code Of Ethics

Question: Does the research conducted in the paper conform, in every respect, with the Neur IPS Code of Ethics https://neurips.cc/public/Ethics Guidelines?

Answer: [Yes]

Justification: Yes, we have read and conformed to the Neur IPS Code of Ethics.

Guidelines:

The answer NA means that the authors have not reviewed the Neur IPS Code of Ethics. If the authors answer No, they should explain the special circumstances that require a deviation from the Code of Ethics. The authors should make sure to preserve anonymity (e.g., if there is a special consideration due to laws or regulations in their jurisdiction).

10. Broader Impacts

Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed?

Answer: [NA]

Justification: The paper focuses on foundational theory for solving MDPs and is not directly tied to any specific societal impacts (positive or negative). We do not expect any direct, immediate, substantial societal impacts.

Guidelines:

The answer NA means that there is no societal impact of the work performed. If the authors answer NA or No, they should explain why their work has no societal impact or why the paper does not address societal impact. Examples of negative societal impacts include potential malicious or unintended uses (e.g., disinformation, generating fake profiles, surveillance), fairness considerations (e.g., deployment of technologies that could make decisions that unfairly impact specific groups), privacy considerations, and security considerations. The conference expects that many papers will be foundational research and not tied to particular applications, let alone deployments. However, if there is a direct path to any negative applications, the authors should point it out. For example, it is legitimate to point out that an improvement in the quality of generative models could be used to generate deepfakes for disinformation. On the other hand, it is not needed to point out that a generic algorithm for optimizing neural networks could enable people to train models that generate Deepfakes faster. The authors should consider possible harms that could arise when the technology is being used as intended and functioning correctly, harms that could arise when the technology is being used as intended but gives incorrect results, and harms following from (intentional or unintentional) misuse of the technology. If there are negative societal impacts, the authors could also discuss possible mitigation strategies (e.g., gated release of models, providing defenses in addition to attacks, mechanisms for monitoring misuse, mechanisms to monitor how a system learns from feedback over time, improving the efficiency and accessibility of ML).

11. Safeguards

Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse (e.g., pretrained language models, image generators, or scraped datasets)?

Answer: [NA]

Justification: The paper focuses on foundational theory and does not pose any such risks.

Guidelines:

The answer NA means that the paper poses no such risks. Released models that have a high risk for misuse or dual-use should be released with necessary safeguards to allow for controlled use of the model, for example by requiring that users adhere to usage guidelines or restrictions to access the model or implementing safety filters. Datasets that have been scraped from the Internet could pose safety risks. The authors should describe how they avoided releasing unsafe images. We recognize that providing effective safeguards is challenging, and many papers do not require this, but we encourage authors to take this into account and make a best faith effort.

12. Licenses for existing assets

Question: Are the creators or original owners of assets (e.g., code, data, models), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected?

Answer: [NA]

Justification: We do not use any existing code/data/model assets because we do not have any experiments.

Guidelines:

The answer NA means that the paper does not use existing assets. The authors should cite the original paper that produced the code package or dataset. The authors should state which version of the asset is used and, if possible, include a URL. The name of the license (e.g., CC-BY 4.0) should be included for each asset.

For scraped data from a particular source (e.g., website), the copyright and terms of service of that source should be provided. If assets are released, the license, copyright information, and terms of use in the package should be provided. For popular datasets, paperswithcode.com/datasets has curated licenses for some datasets. Their licensing guide can help determine the license of a dataset. For existing datasets that are re-packaged, both the original license and the license of the derived asset (if it has changed) should be provided. If this information is not available online, the authors are encouraged to reach out to the asset s creators. 13. New Assets

Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets? Answer: [NA] Justification: The paper does not release new assets. Guidelines:

The answer NA means that the paper does not release new assets. Researchers should communicate the details of the dataset/code/model as part of their submissions via structured templates. This includes details about training, license, limitations, etc. The paper should discuss whether and how consent was obtained from people whose asset is used. At submission time, remember to anonymize your assets (if applicable). You can either create an anonymized URL or include an anonymized zip file. 14. Crowdsourcing and Research with Human Subjects

Question: For crowdsourcing experiments and research with human subjects, does the paper include the full text of instructions given to participants and screenshots, if applicable, as well as details about compensation (if any)? Answer: [NA] Justification: The paper did not involve crowdsourcing or research with human subjects.

The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Including this information in the supplemental material is fine, but if the main contribution of the paper involves human subjects, then as much detail as possible should be included in the main paper. According to the Neur IPS Code of Ethics, workers involved in data collection, curation, or other labor should be paid at least the minimum wage in the country of the data collector. 15. Institutional Review Board (IRB) Approvals or Equivalent for Research with Human Subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or institution) were obtained? Answer: [NA] Justification: The paper did not involve crowdsourcing nor research with human subjects. Guidelines:

The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Depending on the country in which research is conducted, IRB approval (or equivalent) may be required for any human subjects research. If you obtained IRB approval, you should clearly state this in the paper.

We recognize that the procedures for this may vary significantly between institutions and locations, and we expect authors to adhere to the Neur IPS Code of Ethics and the guidelines for their institution. For initial submissions, do not include any information that would break anonymity (if applicable), such as the institution conducting the review.