# multiplepolicy_evaluation_via_density_estimation__f40e8926.pdf

Multiple-policy Evaluation via Density Estimation

Yilei Chen 1 Aldo Pacchiano 1 2 Ioannis Ch. Paschalidis 1

We study the multiple-policy evaluation problem where we are given a set of K policies and the goal is to evaluate their performance (expected total reward over a fixed horizon) to an accuracy ϵ with probability at least 1 δ. We propose an algorithm named CAESAR for this problem. Our approach is based on computing an approximately optimal sampling distribution and using the data sampled from it to perform the simultaneous estimation of the policy values. CAESAR has two phases. In the first phase, we produce coarse estimates of the visitation distributions of the target policies at a low order sample complexity rate that scales with O( 1

ϵ ). In the second phase, we approximate the optimal sampling distribution and compute the importance weighting ratios for all target policies by minimizing a step-wise quadratic loss function inspired by the Dual DICE (Nachum et al., 2019) objective. Up to low order and logarithmic terms CAESAR achieves a sample complex-

ϵ2 PH h=1 maxk [K] P

s,a (dπk h (s,a))2

where dπ is the visitation distribution of policy π, µ is the optimal sampling distribution, and H is the horizon.

1. Introduction

This paper delves into the problem of policy evaluation, a fundamental problem in RL (Sutton and Barto, 2018) of which the goal is to estimate the expected total rewards of a given policy. This process serves as an integral component in various RL methodologies, such as policy iteration and policy gradient approaches (Sutton et al., 1999), wherein the current policy undergoes evaluation followed by potential updates. Policy evaluation is also paramount in scenarios

1Boston University, Boston, USA 2Broad Institute of MIT and Harvard, Cambridge, USA. Correspondence to: Yilei Chen <ylchen9@bu.edu>.

Proceedings of the 42 nd International Conference on Machine Learning, Vancouver, Canada. PMLR 267, 2025. Copyright 2025 by the author(s).

where prior to deploying a trained policy, thorough evaluation is imperative to ensure its safety and efficacy.

Broadly speaking there exist two scenarios where the problem of policy evaluation has been considered, known as online and offline data regimes. In online scenarios, a learner is interacting sequentially with the environment and is tasked with using its online deployments to collect helpful data for policy evaluation. The simplest method for online policy evaluation is Monte-Carlo estimation (Fonteneau et al., 2013). One can collect multiple trajectories by following the target policy, and use the empirical mean of the rewards as the estimator. These on-policy methods typically require executing the policy we want to estimate which may be unpractical or dangerous in many cases. For example, in a medical treatment scenario, implementing an untrustworthy policy can cause unfortunate consequences (Thapa et al., 2005). In these cases, offline policy evaluation may be preferable.

In the offline case, the learner has access to a batch of data and is tasked with using it in the best way possible to estimate the value of a target policy. Many works focus on this problem leveraging various techniques, such as importancesampling, model-based estimation, and doubly-robust estimators (Yin and Wang, 2020; Jiang and Li, 2016; Yin et al., 2021; Xie et al., 2019; Li et al., 2015). In the context of offline evaluation, the theoretical guarantees depend on the overlap between the offline data distribution and the visitations of the evaluated policy (Xie et al., 2019; Yin and Wang, 2020; Duan et al., 2020). These coverage conditions, which ensure that the data logging distribution (Xie et al., 2022) adequately covers the state space, are typically captured by the ratio between the densities corresponding to the offline data distribution and the policy to evaluate, also known as importance ratios.

Motivated by applications where one has multiple policies to evaluate, e.g., multiple policies trained using different hyperparameters, Dann et al. (2023) considered multiplepolicy evaluation which aims to estimate the performance of a set of K target policies instead of a single one. At first glance, multiple-policy evaluation does not pose challenges beyond single-policy evaluation since one can always use K instances of single-policy evaluation to evaluate the K policies. However, since the sample complexity of this

Multiple-policy Evaluation via Density Estimation

procedure scales linearly as a function of K, this can be extremely sample-inefficient as it neglects potential similarities among the K target policies.

Dann et al. (2023) proposed an online algorithm that leverages the similarity among target policies based on an idea related to trajectory synthesis (Wang et al., 2020). The basic technique is that if more than one policy take the same action at a certain state, then only one sample is needed at that state which can be reused to synthesize trajectories for these policies. Their algorithm achieves an instance-dependent sample complexity which gives much better results when target policies have many overlaps while it requires estimation of generative models as an intermediate step which can be unpractical.

Different from Dann et al. (2023), in this work, we tackle multiple-policy evaluation problem from the offline perspective1. In the context of single-policy evaluation, an offline dataset is typically given and assumed to provide good coverage over the state space of the target policy. Nevertheless, in our scenario, such a dataset does not exist. To overcome this issue, we design our algorithm based on the idea of firstly calculating a behavior distribution with enough coverage of the target policy set. Once this distribution is computed, independently and identically distributed (i.i.d.) samples from the behavior distribution can be used to estimate the value of the target policies using ideas inspired in the offline policy optimization literature. Briefly speaking, our algorithms consist of two phases:

1. Build coarse estimators of the policy visitation distributions and use them to compute a mixture policy that achieves a low visitation ratio with respect to all K policies to evaluate.

2. Sample from this approximately optimal mixture policy and use these to construct mean reward estimators for all K policies by importance weighting.

Coarse estimation refers to estimating a target value up to constant multiplicative accuracy (see Definition 4.1). We show that coarse estimation of the visitation distributions can be achieved at a cost that scales linearly, instead of quadratically with the inverse of the accuracy parameter. This ensures that the sample cost in the first phase of our algorithm is of lower order and can therefore be considered negligible (see Section 4.1). We then show that estimating the policy visitation distributions up to multiplicative accuracy is enough to find an approximately optimal behavior distribution that minimizes the maximum visitation ratio among all policies to estimate (see Section 4.2). In

1We say offline here to emphasize that the final evaluation is conducted in an offline fashion rather than implying there is no interaction with the environment.

the second phase, the samples generated from this behavior distribution are used to estimate the target policy values via importance weighting. Since the importance weights are not known to sufficient accuracy, we propose the IDES or Importance Density EStimation algorithm (see Algorithm 1) for estimating these distribution ratios by minimizing a series of loss functions inspired by the Dual DICE (Nachum et al., 2019) method (see Section 4.3). Combining these steps we arrive at our main algorithm (CAESAR) or Coarse and Adaptive EStimation with Approximate Reweighing algorithm (see Algorithm 2) that achieves a high probability finite sample complexity for the problem of multi-policy evaluation.

We summarize our contributions as the following,

We propose a novel, sample-efficient algorithm, CAESAR , for the multiple-policy evaluation problem which achieves a non-asymptotic, instance-dependent sample complexity. CAESAR provides new results and valuable insights to the existing literature while sharing several advantages compared to existing approaches (see Section 5).

We introduce the technique of coarse estimation and demonstrate its effectiveness in solving the multiplepolicy evaluation problem. We believe this technique has potential applications beyond the scope of this work.

We propose an algorithm, IDES , as a subroutine of CAESAR to accurately estimate the marginal importance ratios by minimizing a carefully designed stepwise loss function. IDES is a non-trivial extension of Dual DICE to finite-horizon Markov Decision Processes (MDPs).

We propose a novel notion termed β-distance along with an algorithm MARCH , which achieves coarse estimation of the visitation distribution for all deterministic policies, with a sample complexity scales polynomially in parameters such as the size of the state and action spaces, despite the fact of exponential number of deterministic policies.

2. Related Work

There is a rich family of off-policy estimators for policy evaluation (Liu et al., 2018; Jiang and Li, 2016; Dai et al., 2020; Feng et al., 2021; Jiang and Huang, 2020). But none of them is effective in our setting. Importance-sampling is a simple and popular method for off-policy evaluation but suffers exponential variance in the horizon (Liu et al., 2018). Marginalized importance-sampling has been proposed to get rid of the exponential variance. However, existing works all focus on function approximations which

Multiple-policy Evaluation via Density Estimation

only produce approximately correct estimators (Dai et al., 2020) or are designed for the infinite-horizon case (Feng et al., 2021). The doubly robust estimator (Jiang and Li, 2016; Hanna et al., 2017; Farajtabar et al., 2018) also solves the exponential variance problem, but no finite sample result is available. Our algorithm is based on marginalized importance-sampling and addresses the above limitations in the sense that it provides non-asymptotic sample complexity results and works for finite-horizon Markov Decision Processes (MDPs).

Another popular estimator is called model-based estimator, which evaluates the policy by estimating the transition function of the environment (Dann et al., 2019; Zanette and Brunskill, 2019). Yin and Wang (2020) provide a similar sample complexity to our results. However, there are some significant differences between their result and ours. First, our sampling distribution; calculated based on the coarse distribution estimator, is optimal. Second, our sample complexity is non-asymptotic while their result is asymptotic. Third, the true distributions appearing in our sample complexity can be replaced by known distribution estimators without inducing additional costs, that is, we can provide a known sample complexity while their result is always unknown since we do not know the true visitation distributions of target policies.

The work that most aligns with ours is Dann et al. (2023), which proposed an on-policy algorithm based on the idea of trajectory synthesis. The authors propose the first instancedependent sample complexity analysis of the multiplepolicy evaluation problem. Different from their work, our algorithm uses off-policy evaluation based on importanceweighting and achieves a better sample complexity with simpler techniques and analysis. Concurrently, Russo and Pacchiano (2025) studied the multiple-policy evaluation problem in discounted settings from the lower bound perspective.

Analogous to our two-stage pipeline, Amortila et al. (2024) proposed an exploration objective for downstream reward maximization, similar to our goal of computing an optimal sampling distribution. However, our approximate objective, based on coarse estimation is easier to solve, which is a significant contribution while they need layer-by-layer induction. They also introduced a loss function to estimate ratios, similar to how we estimate the importance densities. However, our ratios are defined differently from theirs which require distinct techniques.

Several existing works have utilized estimates of visitation distributions up to multiplicative constant accuracy (Zhang and Zanette, 2023; Li et al., 2023). We formally formulate the technique of coarse estimation and present clean results that are ready to use in general scenarios while the existing results have more additional complex terms and are limited

to the specific tasks. Moreover, our coarse estimation results are based on simple concentration inequalities and the multiplicative constant can be any value, leading to more flexible and elegant formulations.

Our algorithm also uses some techniques modified from other works which we summarize here. Dual DICE is a technique for estimating distribution ratios by minimizing some loss functions proposed by (Nachum et al., 2019). We build on this idea and make some modifications to meet the need in our setting. Besides, we utilize stochastic gradient descent algorithms and their convergence rate for strongly-convex and smooth functions from the optimization literature (Hazan and Kale, 2011). Finally, we adopt the Median of Means estimator (Minsker, 2023) to convert in-expectation results to high-probability results.

3. Preliminaries

Notations We denote the set {1, 2, . . . , N} by [N]. {Xn}N n=1 represents the set {X1, X2, . . . , XN}. Eπ[ ] denotes the expectation over the trajectories induced by policy π while Eµ[ ] represents the expectation over the trajectories sampled from distribution µ. O( ) hides constants, logarithmic and lower-order terms. We use V[X] to represent the variance of random variable X. Πdet is the set of all deterministic policies and conv(X) represents the convex hull of the set X. If not explicitly specified, s, a, h, k stands for s S, a A, h [H], k [K]. Finally, Median( ) denotes the median of a set of numbers.

Reinforcement learning framework We consider episodic tabular Markov Decision Processes (MDPs) defined by a tuple {S, A, H, {Ph}H h=1, {rh}H h=1, ν}, where S and A represent the state and action spaces, respectively, with S and A denoting the respective cardinality of these sets. H is the horizon which defines the number of steps the agent can take before the end of an episode. Ph( |s, a) S is the transition function which represents the probability of transitioning to the next state if the agent takes action a at state s. And rh(s, a) is the reward function, accounting for the reward the agent can collect by taking action a at state s. In this work, we assume that the reward is deterministic and bounded rh(s, a) [0, 1], which is consistent with prior work (Dann et al., 2023). We denote the initial state distribution by ν S.

A policy π = {πh}H h=1 is a mapping from the state space to the probability distribution space over the action space. πh(a|s) denotes the probability of taking action a at state s and step h. The value function V π h (s) of a policy π is the expected total reward the agent can receive by starting from step h, state s, and following the policy π, i.e., V π h (s) = Eπ[PH l=h rl|s]. The performance J(π) of a policy π is defined as the expected total reward the agent can obtain.

Multiple-policy Evaluation via Density Estimation

Figure 1. The scheme of CAESAR . In Phase I, we collect O(1/ϵ) trajectories for each target policies π1, . . . , πK and obtain coarse estimators of their visitation distributions ˆdπ1, . . . , ˆdπK. Based on the coarse estimator, we can generate an approximately optimal sampling dataset which has good coverage over the visitations of target policies. In Phase II, we sample data from the approximately optimal dataset and leverage the coarse estimators from Phase I to perform importance density estimation for each target policies by implementing IDES . With the estimated importance density ˆwπ1, . . . , ˆwπK, we can output the final performance evaluators ˆV π1, . . . , ˆV πK.

By the definition of the value function, we have J(π) = V π 1 (s|s ν). For simplicity, in the following context, we use V π 1 to denote V π 1 (s|s ν).

The state visitation distribution dπ h(s) of a policy π represents the probability of reaching state s at step h if the agent starts from a state sampled from the initial state distribution ν at step l = 1 and follows policy π subsequently, i.e., dπ h(s) = P[sh = s|s1 ν, π]. Similarly, the state-action visitation distribution dπ h(s, a) is defined as dπ h(s, a) = dπ h(s)π(a|s). Based on the definition of the visitation distribution, the performance of policy π can also be expressed as J(π) = V π 1 = PH h=1 P

s,a dπ h(s, a)rh(s, a).

Multiple-policy evaluation problem setup In multiplepolicy evaluation, we are given a set of known policies {πk}K k=1 and a pair of factors {ϵ, δ}. The objective is to evaluate the performance of these given policies such that with probability at least 1 δ, π {πk}K k=1, | ˆV π 1 V π 1 | ϵ, where ˆV π 1 is the performance estimator.

4. Main Results and Algorithm

In this section, we introduce our main algorithm, CAESAR step-by-step and present the main results. CAESAR is sketched out in Algorithm 2 as well as in Figure 1 for easy reference. We try to build a single sampling dataset with good coverage over target policies, with which we can estimate the performance of all target policies simultaneously using importance weighting. To that end, we first

coarsely estimate the visitation distributions of target policies at the cost of a lower-order sample complexity. Based on these coarse distribution estimators, we can build an approximately optimal sampling distribution by solving a convex optimization problem. Finally, we utilize the idea of Dual DICE (Nachum et al., 2019) with some modifications to estimate the importance-weighting ratio.

4.1. Coarse estimation of visitation distributions

It is well known that to estimate a quantity up to ϵ-accuracy, approximately O( 1

ϵ2 ) samples are needed, e.g., as indicated by Hoeffding s inequality. However, it does not serve our need since it is too expensive in terms of sample complexity. In other words, if we can estimate the visitations of target policies with ϵ-accuracy, the value functions derived from these estimated visitations will already be sufficiently accurate. At the same time, designing a sampling distribution that ensures good coverage of the target policy set requires certain knowledge about the visitations of the target policies. To that end, we introduce the concept of coarse estimators which is defined below.

Definition 4.1 (Coarse Estimator). Given an accuracy ϵ, An estimator ˆx is a coarse estimator to the true value x if it satisfies |ˆx x| max{ϵ, |x|/c} where c is a constant.

We next show that coarse estimation of the visitation distributions can be achieved by paying sample complexity of just O( 1

ϵ ) which is much cheaper compared to O( 1

ϵ2 ). And we will show in the next section that the coarse estimator pro-

Multiple-policy Evaluation via Density Estimation

vides us enough information to construct an approximately optimal sampling distribution that minimizes the maximum visitation ratio among all policies to estimate.

The idea behind the feasibility of computing these coarse estimators is based on the following lemma which shows that estimating the mean value of a Bernoulli random variable up to constant multiplicative accuracy only requires O( 1

ϵ ) samples.

Lemma 4.2. Let Zℓbe i.i.d. samples Zℓ i.i.d. Ber(p). Setting t C log(C/ϵδ)

ϵ , for some known constant C > 0, it follows that with probability at least 1 δ, the empirical mean estimator ˆpt = 1

t Pt ℓ=1 Zℓsatisfies, |ˆpt p| max{ϵ, p

Let {(si 1, ai 1), (si 2, ai 2), . . . , (si H, ai H)} denote a trajectory collected by following policy π. Then the random variable Zi(h, s, a) defined below is a Bernoulli random variable

such that Zi(h, s, a) i.i.d. Ber(dπ h(s, a)).

Zi(h, s, a) =

( 1, si h, ai h = s, a,

0, otherwise.

We can construct such random variables for each (h, s, a). Together with Lemma 4.2, we are able to coarsely estimate the visitation distributions of a policy by sampling O( 1

ϵ ) trajectories.

Proposition 4.3. With number of trajectories n CK log(CK/ϵδ)

ϵ ), we can estimate ˆdπk = { ˆdπk h }H h=1 such that with probability at least 1 δ, | ˆdπk h (s, a)

dπk h (s, a)| max{ϵ, dπk h (s,a)

4 }, s, a, h, k.

In Appendix B, we propose an algorithm, MARCH (Multipolicy Approximation via Ratio-based Coarse Handling), which computes coarse estimators of visitation distributions for all deterministic policies with sample complexity O( poly(H,S,A)

ϵ ). This is a significant result since the number of deterministic policies scales exponentially with the size of the state space and horizon. The result is achieved through a novel analysis based on our proposed notion named β distance. Due to space constraints, we refer readers to Appendix B for further details.

Before proceeding to the next section, notice that states and actions with low estimated visitation probabilities can be ignored without inducing significant errors based on the following lemma.

Lemma 4.4. Suppose we have an estimator ˆd(s, a) of d(s, a) such that | ˆd(s, a) d(s, a)| max{ϵ , d(s,a)

4 }. If ˆd(s, a) 5ϵ , then max{ϵ , d(s,a)

4 } = d(s,a)

4 , and if ˆd(s, a) 5ϵ , then d(s, a) 7ϵ .

Lemma 4.4 tells us if we replace ϵ by ϵ 14SA, the error induced by ignoring those state-action pairs which satisfy

ˆd(s, a) 5ϵ is at most ϵ

2. For simplicity of presentation, we can set ˆdπ h(s, a) = dπ h(s, a) = 0 if ˆdπ h(s, a) < 5ϵ 14SA. Finally, we have coarse estimators of the visitations for target policies such that,

| ˆdπk h (s, a) dπk h (s, a)| dπk h (s, a)

4 , s, a, h, k. (1)

4.2. Approximately optimal sampling distribution

In this section, we validate the claim from the previous section that the coarse estimator provides sufficient information (i.e., accurate enough) to construct an approximately optimal sampling distribution that minimizes the maximum visitation ratio among all policies to estimate.

Suppose that we have a dataset with distribution {µh}H h=1, from which we can sample trajectories {si 1, ai 1, si 2, ai 2, . . . , si H, ai H}n i=1, then we can evaluate the expected total rewards of target policies by importance weighting. Specifically,

ˆV πk 1 = 1

i=1 Xπk i , k [K]. (2)

where Xπk i = PH h=1 dπk h (si h,ai h) µh(si h,ai h) rh(si h, ai h) is the total reward gained in a trajectory. Note that the above estimators rely on an unknown quantity dπ h(s, a), i.e., the true visitation distribution. In the next section, we will show how to accurately estimate these importance-weighting ratios.

It can be shown that the variance of the value function estimator Xπk i is bounded by (see Appendix A.2),

Vµ[Xπk i ] H

" dπk h (sh, ah) µh(sh, ah)

We aim to identify the optimal sampling distribution by minimizing the variance (3) of the value function estimator across all target policies, resulting in the following convex optimization problem:

µ h = arg min µh conv(Dh) max k [K]

(dπk h (s, a))2

µh(s, a) , h [H]. (4)

We constrain µh in the convex hull of Dh = {dπk h : k [K]}, since in some cases, the optimal µ may not be realized by any policy (see Appendix A.3). One can also set Dh = {dπ h : π Πdet} which allows any feasible distribution, ensuring a globally optimal sampling distribution at the cost of more computations towards solving the optimization problem. We denote the optimal solution to (4) as µ h = PK k=1 α kdπk h .

It is impossible to solve (4) since the true visitations dπ h are unknown to us. Therefore, we utilize the coarse estimators obtained in the last section to replace these unknown

Multiple-policy Evaluation via Density Estimation

distributions which leads to the following approximate optimization problem,

ˆµ h = arg min µh conv( ˆ Dh) max k [K]

( ˆdπk h (s, a))2

µh(s, a) , h [H], (5)

where ˆDh = { ˆdπk h : k [K]}. We denote the optimal solution to (5) by ˆµ h = PK k=1 ˆα K ˆdπK h . Consequently, our true sampling distribution from which trajectories are sampled is µ h = PK k=1 ˆα kdπk h . And based on (1), we also have the relationship,

| µ h(s, a) ˆµ h(s, a)| µ h(s, a)

Finally, we conclude this section by showing µ h is approximately optimal.

Lemma 4.5. The sampling distribution µ is approximately optimal such that,

(dπk h (s, a))2

µ h(s, a) C max k [K]

(dπk h (s, a))2

where C is a constant and µ is the optimal solution to (4).

4.3. Estimation of importance-weighting ratios

In the previous section, we constructed a sampling dataset with distribution µ , from which we can draw trajectories. The remaining challenge is that estimating the value function using (2) requires knowledge of the importanceweighting ratios. To address this, we introduce an algorithm, IDES, outlined in Algorithm 1, for estimating these ratios. IDES is inspired by the idea of Dual DICE (Nachum et al., 2019). In Dual DICE, they propose the following loss function,

2Es,a µ w2(s, a) Es,a dπ [w(s, a)] . (7)

The minimum of ℓπ( ) is achieved at wπ, (s, a) = dπ(s,a)

µ(s,a) , the importance weighting ratio. By applying a variable transformation based on Bellman s equation, they derive a final loss function that eliminates the need for on-policy samples in infinite-horizon MDPs. The importance-weighting ratios are then obtained by optimizing this loss function.

We extend this approach to finite-horizon MDPs by proposing a step-wise loss function, which we define below. It is important to emphasize that IDES is not a trivial extension of Dual DICE. While IDES adopts the quadratic loss function from Dual DICE, it applies fundamentally different techniques and analyses.

First, IDES leverages coarse distribution estimators to address the on-policy limitation of the second term in (7),

Algorithm 1 Importance Density Estimation (IDES )

Input: Horizon H, accuracy ϵ, target policy π, coarse estimator { ˆdπ h}H h=1 , {ˆµh}H h=1 and dataset µ Define feasible sets {Dh}H h=1 where Dh(s, a) = [0, 2 ˆdπ h(s, a)]. Initialize w0 h = 0, h = 1, . . . , H, and set µ0(s0, a0) = 1, P0(s|s0, a0) = ν(s), ˆw0 = ˆµ0 = 1. for h = 1 to H do

Set the iteration number of optimization, nh =

s,a ( ˆdπ h(s,a))2

ˆµh(s,a) + ( ˆdπ h 1(s,a))2

is a known constant. for i = 1 to nh do

Sample {si h, ai h} from µh and {si h 1, ai h 1, si h} from µh 1. Calculate gradient g(wi 1 h ),

g(wi 1 h )(s, a) = wi 1 h (s, a) ˆµh(s, a) I(si h = s, ai h = a)

ˆwh 1(si h 1, ai h 1) ˆµh 1(si h 1, ai h 1) π(a|s)I(si h = s).

Update wi h = Projw Dh{wi 1 h ηi hg(wi 1 h )}. end for Output the estimator ˆwh = 1 Pnh i=1 i Pnh i=1 wi h. end for

whereas Dual DICE relies on a variable transformation based on Bellman s equation, which is only applicable to infinitehorizon MDPs. Second, IDES employs a step-wise objective function, requiring sequential step-to-step optimization and analysis, whereas Dual DICE formulates a single loss function. Third, although both IDES and Dual DICE achieve a sample complexity of O(C/ϵ2), the value of C in Dual DICE s bound is not sufficiently tight for our purposes, which involve deriving instance-dependent guarantees. In contrast, we offer a precise analysis linking the value of C in IDES to the expected visitation ratios. Lastly, IDES provides high-probability results for visitation ratio estimation, whereas Dual DICE s results hold only in expectation.

Our step-wise loss function of a policy π at each step h is defined as,

ℓπ h(w) = 1

E s ,a µ h 1 s Ph 1( |s ,a )

ˆwh 1(s , a ) ˆµ h 1(s , a ) w(s, a)π(a|s)

where µ h = PK k=1 ˆα kdπk h is the sampling distribution, and ˆµ h = PK k=1 ˆα k ˆdπk h is the optimal solution to the approximate optimization problem (5) (see Appendix 4.2). And for

Multiple-policy Evaluation via Density Estimation

notation simplicity, we set µ 0(s0, a0) = 1, P0(s|s0, a0) = ν(s), ˆw0 = ˆµ 0 = 1.

Although it may appear complex at first glance, it possesses favorable properties that allow us to derive importanceweighting ratios iteratively, as formalized in the following two lemmas. Lemma 4.6. Suppose we have an estimator ˆwh 1 at step h 1 such that, X

µ h 1(s, a) ˆwh 1(s, a)

ˆµ h 1(s, a) dπ h 1(s, a) ϵ,

then by minimizing the loss function ℓπ h(w) at step h to ℓπ h( ˆwh(s, a)) 1 ϵ, we have,

µ h(s, a) ˆwh(s, a)

ˆµ h(s, a) dπ h(s, a) 2ϵ.

Lemma 4.6 shows that the importance-weighting ratio estimator from the previous step enables the estimation of the ratio at the current step, introducing only an additive error. Consequently, by optimizing iteratively, we can accurately estimate the importance-weighting ratios at all steps, as formalized in the following lemma. Lemma 4.7. Implementing Algorithm 1 with πk, we have

the importance density estimator ˆ wπk h (s,a) ˆµ h(s,a) such that,

µ h(s, a) ˆwπk h (s, a) ˆµ h(s, a) dπk h (s, a)

The above result holds in expectation. To obtain a highprobability guarantee, we introduce a Median-of-Means (Mo M) estimator (Minsker, 2023), formalized in the following lemma, along with a data-splitting technique. Together, these methods enable us to transform (8) into a high-probability result (see Appendix A.8). Lemma 4.8. Let x R and suppose we have a stochastic estimator ˆx such that E[|ˆx x|] ϵ

4. Then, if we generate N = O (log(1/δ)) i.i.d. estimators {ˆx1, ˆx2, . . . , ˆx N} and choose ˆx Mo M = Median(ˆx1, ˆx2, . . . , ˆx N), we have with probability at least 1 δ,

|ˆx Mo M x| ϵ.

Another noteworthy property of our loss function is that it is γ-strongly convex and ξ-smooth where γ = mins,a µ h(s,a) ˆµ h(s,a), ξ = maxs,a µ h(s,a) ˆµ h(s,a). From property (6), it follows that γ and ξ are bounded from both sides, as is their ratio ξ

γ . This property plays a crucial role in analyzing the sampling complexity of the minimization of the loss function via stochastic gradient descent, which we discuss in Appendix A.6.

Algorithm 2 Coarse and Adaptive EStimation with Approximate Reweighing for Multi-Policy Evaluation (CAESAR )

Input: Accuracy ϵ, confidence δ, target policies {πk}K k=1. Coarsely estimate visitation distributions of all target policies and get { ˆdπk : k [K]}. Solve the approximate optimization problem (5) and construct the dataset µ . Implement IDES (Algorithm 1) to get importanceweighting ratios { ˆwπk}K k=1. Build the final performance estimator { ˆV πk 1 }K k=1 by (9). Output: { ˆV πk 1 }K k=1.

4.4. Main results

We are now ready to present our main sample complexity result for the multiple-policy evaluation problem, building on the findings from previous sections. Using the importance-

weighting ratio estimator ˆ wπk h (s,a) ˆµ h(s,a) , we can evaluate the per-

formance of the target policy πk by,

ˆV πk 1 = 1

ˆwπk h (si h, ai h) ˆµ h(si h, ai h) rh(si h, ai h). (9)

where {si h, ai h}n i=1 is drawn from our approximately optimal sampling distribution µ h.

The following theorem presents our main results on the sample complexity of CAESAR. We will leave the detailed derivation to Appendix A.8.

Theorem 4.9. Implementing Algorithm 2. Then, with probability at least 1 δ, for all target policies, we have that | ˆV πk 1 V πk 1 | ϵ. And the total number of trajectories sampled is,

h=1 max k [K]

(dπk h (s, a))2

where µ h is the optimal solution to (4) (refer to Section 4.2). Furthermore, the above bound still holds if replacing the unknown true visitation distributions {{dπk}K k=1, µ } by the coarse estimators {{ ˆdπk}K k=1, ˆµ } which can provide us a concrete value of the sample complexity.

We provide an instance-dependent result that characterizes the complexity of the multiple-policy evaluation problem based on the overlap of the visitation distributions of the target policies, aligning with intuitions. In the special case where all target policies are identical, i.e., single-policy evaluation, our sample complexity is O( poly(H)

ϵ2 ), independent of S and A, which is consistent with the result of Monte

Multiple-policy Evaluation via Density Estimation

Carlo estimation. Furthermore, if the target policies are deterministic, we can establish an instance-independent upper bound on our sample complexity, as formalized in the following corollary.

Corollary 4.10. If the target policies to be evaluated are deterministic, the sample complexity of CAESAR is always bounded by O poly(H)S2A

Corollary 4.10 tells us that with at most O poly(H)S2A

trajectories, by implementing CAESAR , we can evaluate all deterministic policies up to ϵ accuracy under any reward which means we can identify an ϵ optimal policy for any reward. This is consistent with the result in Jin et al. (2020), which proposes a reward-free exploration algorithm with the same sample complexity of O poly(H)S2A

5. Discussions

In this section, we first compare our results with the existing work (Dann et al., 2023). Additionally, we explore the application of CAESAR to policy identification beyond policy evaluation.

5.1. Comparison with existing result

Dann et al. (2023) proposes an algorithm for multiple-policy evaluation based on the idea of trajectory stitching and achieved an instance-dependent sample complexity,

where dmax(s) = maxk [K] dπk(s) and Kh S A keeps track of which state-action pairs at step h are visited by target policies in their trajectories.

A significant issue with the result by Dann et al. (2023) is the presence of the unfavorable 1 dmax(s), which can induce an undesirable dependency on K.

To illustrate this, consider an example of an MDP with two layers: a single initial state s1,1 in the first layer and two terminal states in the second layer s2,1, s2,2. The transition function is the same for all actions, i.e., P(s2,1|s1,1, a) = p and p is sufficiently small. Agents only receive rewards at state s2,1, regardless of the actions they take. Hence, to evaluate the performance of a policy under this MDP, it is sufficient to consider only the second layer. Now, suppose we have K target policies to evaluate, where each policy takes different actions at state s1,1 but the same action at any state in the second layer. Since the transition function at state s1,1 is the same for any action, the visitation distribution at state s2,1 of all target policies is identical. Given that

p is sufficiently small, the probability of reaching s2,1 is P[s2,1 K2] = 1 (1 p)K p K.

According to the result (11) by Dann et al. (2023), the sample complexity in this scenario is O( K

ϵ2 ) which depends on K. In contrast, since the visitation distribution at the second layer of all target policies is identical, our result provides a sample complexity of O( 1

ϵ2 ) without dependency on K.

Beyond sample complexity, our work tackles the problem from a different perspective, which complements the exsiting results. Our algorithm first constructs an approximately optimal dataset and then uses it to perform offline evaluation. In other words, we extend the offline evaluation framework to multiple-policy setting. In contrast, (Dann et al., 2023) evaluates policies in an online and on-policy manner.

5.2. Near-optimal policy identification

Besides policy evaluation, CAESAR can also be applied to identify a near-optimal policy. Fixing the high-probability factor, we denote the sample complexity of CAESAR by O( Θ(Π)

γ2 ), where Π is the set of policies to be evaluated and γ is the estimation error. We provide a simple algorithm based on CAESAR in Appendix D that achieves an instance-dependent sample complexity O(maxγ ϵ Θ(Πγ)

γ2 ) to identify a ϵ optimal policy, where Πγ = {π : V 1 V π 1 8γ}. This result is interesting as it offers a different perspective beyond the existing gap-dependent results (Simchowitz and Jamieson, 2019; Dann et al., 2021). Furthermore, this result can be easily extended to the multi-reward setting. Due to space constraints, we leave the detailed discussion to Appendix D.

6. Conclusion and Future Work

In this work, we consider the problem of multi-policy evaluation. We propose an algorithm, CAESAR, based on computing an approximately optimal sampling dataset and using the data sampled from it to perform the simultaneous estimation of the policy values. The algorithm consists of three techniques. First, we obtain coarse distribution estimators at a lower-order sample cost. Second, based on the coarse estimator, we obtain an approximately optimal sampling dataset. Lastly, we propose a step-wise loss function to estimate the importance weighting ratios.

Beyond the results of this work, there are still some open questions of interest. First, our sample complexity has a dependency on H4 which is induced by the error propagation in the estimation of the importance weighting ratios. We conjecture a dependency on H2 is possible by considering a comprehensive loss function instead of step-wise loss functions. Second, considering a reward-dependent

Multiple-policy Evaluation via Density Estimation

and variance-aware sample complexity is also an interesting direction. Third, it is still a challenging problem to derive the lower bound for multiple-policy evaluation. Finally, we are interested to see what other uses the research community may find for coarse distribution estimation.

Acknowledgements

This work was supported in part by funding from the Eric and Wendy Schmidt Center at the Broad Institute of MIT and Harvard, the NSF under grants CCF-2200052 and IIS1914792, the ONR under grants N00014-19-1-2571 and N00014-21-1-2844, the NIH under grant 1UL1TR001430, the DOE under grant DE-AC02-05CH11231, the ARPA-E under grant and DE-AR0001282, and the Boston University Kilachand Fund for Integrated Life Science and Engineering.

Impact Statement

This work aims to advance the field of Machine Learning by providing new insights and methodologies for multiplepolicy evaluation in reinforcement learning. Our contributions enhance the theoretical understanding of policy evaluation. While our research has broad implications, we do not identify any specific societal consequences that require particular emphasis at this time.

Alekh Agarwal, Daniel Hsu, Satyen Kale, John Langford, Lihong Li, and Robert Schapire. Taming the monster: A fast and simple algorithm for contextual bandits. In International Conference on Machine Learning, pages 1638 1646. PMLR, 2014.

Philip Amortila, Dylan J Foster, and Akshay Krishnamurthy. Scalable online exploration via coverability. ar Xiv preprint ar Xiv:2403.06571, 2024.

Alina Beygelzimer, John Langford, Lihong Li, Lev Reyzin, and Robert Schapire. Contextual bandit algorithms with supervised learning guarantees. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pages 19 26. JMLR Workshop and Conference Proceedings, 2011.

Bo Dai, Ofir Nachum, Yinlam Chow, Lihong Li, Csaba Szepesv ari, and Dale Schuurmans. Coindice: Off-policy confidence interval estimation. Advances in neural information processing systems, 33:9398 9411, 2020.

Chris Dann, Mohammad Ghavamzadeh, and Teodor V Marinov. Multiple-policy high-confidence policy evaluation. In International Conference on Artificial Intelligence and Statistics, pages 9470 9487. PMLR, 2023.

Christoph Dann, Lihong Li, Wei Wei, and Emma Brunskill. Policy certificates: Towards accountable reinforcement learning. In International Conference on Machine Learning, pages 1507 1516. PMLR, 2019.

Christoph Dann, Teodor Vanislavov Marinov, Mehryar Mohri, and Julian Zimmert. Beyond value-function gaps: Improved instance-dependent regret bounds for episodic reinforcement learning. Advances in Neural Information Processing Systems, 34:1 12, 2021.

Yaqi Duan, Zeyu Jia, and Mengdi Wang. Minimax-optimal off-policy evaluation with linear function approximation. In International Conference on Machine Learning, pages 2701 2709. PMLR, 2020.

Mehrdad Farajtabar, Yinlam Chow, and Mohammad Ghavamzadeh. More robust doubly robust off-policy evaluation. In International Conference on Machine Learning, pages 1447 1456. PMLR, 2018.

Yihao Feng, Ziyang Tang, Na Zhang, and Qiang Liu. Non-asymptotic confidence intervals of off-policy evaluation: Primal and dual bounds. ar Xiv preprint ar Xiv:2103.05741, 2021.

Raphael Fonteneau, Susan A Murphy, Louis Wehenkel, and Damien Ernst. Batch mode reinforcement learning based on the synthesis of artificial trajectories. Annals of operations research, 208:383 416, 2013.

David A Freedman. On tail probabilities for martingales. the Annals of Probability, pages 100 118, 1975.

Josiah P Hanna, Peter Stone, and Scott Niekum. Bootstrapping with models: Confidence intervals for off-policy evaluation. In Proceedings of the 16th Conference on Autonomous Agents and Multi Agent Systems, pages 538 546, 2017.

Elad Hazan and Satyen Kale. Beyond the regret minimization barrier: an optimal algorithm for stochastic stronglyconvex optimization. In Proceedings of the 24th Annual Conference on Learning Theory, pages 421 436. JMLR Workshop and Conference Proceedings, 2011.

Nan Jiang and Jiawei Huang. Minimax value interval for offpolicy evaluation and policy optimization. Advances in Neural Information Processing Systems, 33:2747 2758, 2020.

Nan Jiang and Lihong Li. Doubly robust off-policy value evaluation for reinforcement learning. In International conference on machine learning, pages 652 661. PMLR, 2016.

Multiple-policy Evaluation via Density Estimation

Chi Jin, Akshay Krishnamurthy, Max Simchowitz, and Tiancheng Yu. Reward-free exploration for reinforcement learning. In International Conference on Machine Learning, pages 4870 4879. PMLR, 2020.

Gen Li, Wenhao Zhan, Jason D Lee, Yuejie Chi, and Yuxin Chen. Reward-agnostic fine-tuning: Provable statistical benefits of hybrid reinforcement learning. Advances in Neural Information Processing Systems, 36:55582 55615, 2023.

Lihong Li, R emi Munos, and Csaba Szepesv ari. Toward minimax off-policy value estimation. In Artificial Intelligence and Statistics, pages 608 616. PMLR, 2015.

Qiang Liu, Lihong Li, Ziyang Tang, and Dengyong Zhou. Breaking the curse of horizon: Infinite-horizon off-policy estimation. Advances in neural information processing systems, 31, 2018.

Stanislav Minsker. Efficient median of means estimator. In The Thirty Sixth Annual Conference on Learning Theory, pages 5925 5933. PMLR, 2023.

Ofir Nachum, Yinlam Chow, Bo Dai, and Lihong Li. Dualdice: Behavior-agnostic estimation of discounted stationary distribution corrections. Advances in neural information processing systems, 32, 2019.

Alessio Russo and Aldo Pacchiano. Adaptive exploration for multi-reward multi-policy evaluation. ar Xiv preprint ar Xiv:2502.02516, 2025.

Max Simchowitz and Kevin G Jamieson. Non-asymptotic gap-dependent regret bounds for tabular mdps. Advances in Neural Information Processing Systems, 32, 2019.

Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT press, 2018.

Richard S Sutton, David Mc Allester, Satinder Singh, and Yishay Mansour. Policy gradient methods for reinforcement learning with function approximation. Advances in neural information processing systems, 12, 1999.

Devinder Thapa, In-Sung Jung, and Gi-Nam Wang. Agent based decision support system using reinforcement learning under emergency circumstances. In Advances in Natural Computation: First International Conference, ICNC 2005, Changsha, China, August 27-29, 2005, Proceedings, Part I 1, pages 888 892. Springer, 2005.

Ruosong Wang, Simon S Du, Lin F Yang, and Sham M Kakade. Is long horizon reinforcement learning more difficult than short horizon reinforcement learning? ar Xiv preprint ar Xiv:2005.00527, 2020.

Tengyang Xie, Yifei Ma, and Yu-Xiang Wang. Towards optimal off-policy evaluation for reinforcement learning with marginalized importance sampling. Advances in neural information processing systems, 32, 2019.

Tengyang Xie, Dylan J Foster, Yu Bai, Nan Jiang, and Sham M Kakade. The role of coverage in online reinforcement learning. In The Eleventh International Conference on Learning Representations, 2022.

Ming Yin and Yu-Xiang Wang. Asymptotically efficient off-policy evaluation for tabular reinforcement learning. In International Conference on Artificial Intelligence and Statistics, pages 3948 3958. PMLR, 2020.

Ming Yin, Yu Bai, and Yu-Xiang Wang. Near-optimal provable uniform convergence in offline policy evaluation for reinforcement learning. In International Conference on Artificial Intelligence and Statistics, pages 1567 1575. PMLR, 2021.

Andrea Zanette and Emma Brunskill. Tighter problemdependent regret bounds in reinforcement learning without domain knowledge using value function bounds. In International Conference on Machine Learning, pages 7304 7312. PMLR, 2019.

Ruiqi Zhang and Andrea Zanette. Policy finetuning in reinforcement learning via design of experiments using offline data. Advances in Neural Information Processing Systems, 36:59953 59995, 2023.

Multiple-policy Evaluation via Density Estimation

A. Proof of theorems and lemmas in Section 4

A.1. Proof of Lemma 4.2

Our results relies on the following variant of Bernstein inequality for martingales, or Freedman s inequality (Freedman, 1975), as stated in e.g., (Agarwal et al., 2014; Beygelzimer et al., 2011).

Lemma A.1 (Simplified Freedman s inequality). Let X1, ..., XT be a bounded martingale difference sequence with |Xℓ| R. For any δ (0, 1), and η (0, 1/R), with probability at least 1 δ ,

ℓ=1 Eℓ[X2 ℓ] + log(1/δ )

where Eℓ[ ] is the conditional expectation2 induced by conditioning on X1, , Xℓ 1.

Lemma A.2 (Anytime Freedman). Let {Xt} t=1 be a bounded martingale difference sequence with |Xt| R for all t N. For any δ (0, 1), and η (0, 1/R), there exists a universal constant C > 0 such that for all t N simultaneously with probability at least 1 δ ,

ℓ=1 Eℓ[X2 ℓ] + C log(t/δ )

where Eℓ[ ] is the conditional expectation induced by conditioning on X1, , Xℓ 1.

Proof. This result follows from Lemma A.1. Fix a time-index t and define δt = δ 12t2 . Lemma A.1 implies that with probability at least 1 δt,

ℓ=1 Eℓ X2 ℓ + log(1/δt)

A union bound implies that with probability at least 1 Pt ℓ=1 δt 1 δ ,

ℓ=1 Eℓ X2 ℓ + log(12t2/δ )

ℓ=1 Eℓ X2 ℓ + C log(t/δ )

holds for all t N. Inequality (i) holds because log(12t2/δ ) = O (log(tδ )).

Proposition A.3. Let δ (0, 1), β (0, 1] and Z1, , ZT be an adapted sequence satisfying 0 Zℓ B for all ℓ N. There is a universal constant C > 0 such that,

t=1 Et[Zt] 2 BC log(T/δ )

ℓ=1 Zℓ (1 + β)

t=1 Et[Zt] + 2 BC log(T/δ )

with probability at least 1 2δ simultaneously for all T N.

Proof. Consider the martingale difference sequence Xt = Zt Et[Zt]. Notice that |Xt| B. Using the inequality of

2We will use this notation to denote conditional expectations throughout this work.

Multiple-policy Evaluation via Density Estimation

Lemma A.2 we obtain that for all η (0, 1/B2)

ℓ=1 Eℓ[X2 ℓ] + C log(t/δ )

(i) 2ηB2 t X

ℓ=1 Eℓ[Zℓ] + C log(t/δ )

for all t N with probability at least 1 δ . Inequality (i) holds because Et[X2 t ] B2E[|Xt|] 2B2Et[Zt] for all t N. Setting η = β 2B2 and substituting Pt ℓ=1 Xℓ= Pt ℓ=1 Zℓ Eℓ[Zℓ],

ℓ=1 Zℓ (1 + β)

ℓ=1 Eℓ[Zℓ] + 2B2C log(t/δ )

with probability at least 1 δ . Now consider the martingale difference sequence X t = E[Zt] Zt and notice that |X t| B2. Using the inequality of Lemma A.2 we obtain for all η (0, 1/B2),

ℓ=1 Eℓ[(X ℓ)2] + C log(t/δ )

ℓ=1 Eℓ[Zℓ] + C log(t/δ )

Setting η = β 2B2 and substituting Pt ℓ=1 X ℓ= Pt ℓ=1 E[Zℓ] Zℓwe have,

ℓ=1 Zℓ+ 2B2C log(t/δ )

with probability at least 1 δ . Combining Equations 14 and 15 and using a union bound yields the desired result.

Let the Zℓbe i.i.d. samples Zℓ i.i.d. Ber(p). The empirical mean estimator, bpt = 1

t Pt ℓ=1 Zℓsatisfies,

(1 β)p 2C log(t/δ )

βt bpt (1 + β)p + 2C log(t/δ )

with probability at least 1 2δ for all t N where C > 0 is a (known) universal constant. Given ϵ > 0 set t 8C log(t/δ )

βϵ (notice the dependence of t on the RHS - this can be achieved by setting t C log(C/βϵδ )

βϵ for some (known) universal constant C > 0).

In this case observe that, (1 β)p ϵ/8 bpt (1 + β)p + ϵ/8.

Setting β = 1/8, 7p/8 ϵ/8 bpt 9p/8 + ϵ/8,

so that, p bpt p/8 + ϵ/8,

and bpt p p/8 + ϵ/8,

which implies |bpt p| p/8 + ϵ/8 2 max(p/8, ϵ/8) = max(p/4, ϵ/4).

Multiple-policy Evaluation via Density Estimation

A.2. Derivation of the optimal sampling distribution

Our performance estimator is,

ˆV πk 1 = 1

dπk h (si h, ai h) µh(si h, ai h) r(si h, ai h), k [K].

Denote PH h=1 dπk h (si h,ai h) µh(si h,ai h) rh(si h, ai h) by Xi. And for simplicity, denote E(s1,a1) µ1,...,(s H,a H) µH by Eµ, the variance of our estimator is bounded by,

Eµ[X2 i ] = Eµ

dπk h (si h, ai h) µh(si h, ai h) rh(si h, ai h)

dπk h (si h, ai h) µh(si h, ai h) rh(si h, ai h)

dπk h (si h, ai h) µh(si h, ai h)

" dπk h (si h, ai h) µh(si h, ai h)

The first inequality holds by Cauchy Schwarz inequality. The second inequality holds due to the assumption rh(s, a) [0, 1].

Denote PH h=1 Edπk h

dπk h (si h,ai h) µh(si h,ai h)

by ρµ,k. Applying Bernstein s inequality, we have that with probability at least 1 δ

and n samples, it holds,

| ˆV πk 1 V πk 1 |

2Hρµ,k log(1/δ)

n + 2Mk log(1/δ)

where Mk = maxs1,a1,...,s H,a H PH h=1 dπk h (sh,ah) µh(sh,ah) rh(sh, ah).

To achieve an ϵ accuracy of evaluation, we need samples,

nµ,k 8Hρµ,k log(1/δ)

ϵ2 + 4Mk log(1/δ)

Take the union bound over all target policies,

nµ 8H maxk [K] ρµ,k log(K/δ)

ϵ2 + 4M log(K/δ)

where M = maxk [K] Mk.

We define the optimal sampling distribution µ as the one minimizing the higher order sample complexity,

µ h = arg min µh max k [K] Edπk h (s,a)

" dπk h (s, a) µh(s, a)

= arg min µh max k [K]

dπk h (s, a) 2

µh(s, a) , h = 1, . . . , H.

Multiple-policy Evaluation via Density Estimation

A.3. An example of unrealizable optimal sampling distribution

Here, we give an example to illustrate the assertation that in some cases, the optimal sampling distribution cannot be realized by any policy.

Consider such a MDP with two layers, in the first layer, there is a single initial state s1,1, in the second layer, there are two states s2,1, s2,2. The transition function at state s1,1 is identical for any action, P(s2,1|s1,1, a) = P(s2,2|s1,1, a) = 1

2. Hence, for any policy, the only realizable state visitation distribution at the second layer is d2(s2,1) = d2(s2,2) = 1

Suppose the target policies take K 2 different actions at state s2,1 while take the same action at state s2,2.

By solving the optimization problem, we have the optimal sampling distribution at the second layer,

µ 2(s2,1) = K2

1 + K2 , µ 2(s2,2) = 1 1 + K2 ,

which is clearly not realizable by any policy.

A.4. Proof of Lemma 4.5

By property (1) and (6), we have 4

5 ˆdπk h (s, a) dπk h (s, a) 4

3 ˆdπk h (s, a) and 4

5 ˆµ h(s, a) µ h(s, a) 4

3 ˆµ h(s, a). Hence,

(dπk h (s, a))2

µ h(s, a) 25

12 max k [K]

( ˆdπk h (s, a))2

Remember µ is of the form PK k=1 α kdπk. Let µ be PK k=1 α k ˆdπk. Then we have |µ µ | max{ϵ, µ

4 } and µ is in the feasible set ˆDh = { ˆdπk h : k [K]}. Since ˆµ is the optimal solution to the approximate optimization problem (5), we have,

( ˆdπk h (s, a))2

ˆµ h(s, a) max k [K]

( ˆdπk h (s, a))2

µ h(s, a) 25

12 max k [K]

(dπk h (s, a))2

The second inequality is again based on the property of coarse estimators. Together, we have,

(dπk h (s, a))2

µ h(s, a) C max k [K]

(dπk h (s, a))2

µ h(s, a) .

A.5. Proof of Lemma 4.6

Proof. The gradient of ℓπ h(w) is,

w(s,a)ℓπ h(w) = µh(s, a)

ˆµh(s, a)w(s, a) X

s ,a µh 1(s , a )P(s|s , a )π(a|s) ˆwh 1(s , a )

ˆµh 1(s , a ) .

Suppose by some SGD algorithm, we can converge to a point ˆwh such that the gradient of the loss function is less than ϵ,

ℓπ h( ˆwh) 1 = X

µh(s, a) ˆµh(s, a) ˆwh(s, a) X

s ,a µh 1(s , a )P(s|s , a )π(a|s) ˆwh 1(s , a )

ˆµh 1(s , a )

Multiple-policy Evaluation via Density Estimation

By decomposing,

µh(s, a) ˆµh(s, a) ˆwh(s, a) X

s ,a µh 1(s , a )P(s|s , a )π(a|s) ˆwh 1(s , a )

ˆµh 1(s , a )

µh(s, a) ˆµh(s, a) ˆwh(s, a) dπ h(s, a) + dπ h(s, a) X

s ,a µh 1(s , a )P(s|s , a )π(a|s) ˆwh 1(s , a )

ˆµh 1(s , a )

µh(s, a) ˆµh(s, a) ˆwh(s, a) dπ h(s, a)

dπ h(s, a) X

s ,a µh 1(s , a )P(s|s , a )π(a|s) ˆwh 1(s , a )

ˆµh 1(s , a )

= µh(s, a) ˆwh(s, a)

ˆµh(s, a) dπ h(s, a)

s ,a P(s|s , a )π(a|s) dπ h 1(s , a ) µh 1(s , a ) ˆwh 1(s , a )

ˆµh 1(s , a )

Hence, we have,

µh(s, a) ˆwh(s, a)

ˆµh(s, a) dπ h(s, a)

s ,a P(s|s , a )π(a|s) dπ h 1(s , a ) µh 1(s , a ) ˆwh 1(s , a )

ˆµh 1(s , a )

dπ h 1(s , a ) µh 1(s , a ) ˆwh 1(s , a )

ˆµh 1(s , a )

A.6. Proof of Lemma 4.7

Proof. The minimum w h of the loss function ℓπ h(w) is w h(s, a) = dπ h(s,a) µh(s,a) ˆµh(s, a) if ˆwh 1 achieves optimum. By the property of the coarse distribution estimator, we have,

w h(s, a) = dπ h(s, a) µh(s, a) ˆµh(s, a)

4 3 ˆdπ h(s, a)

4 5 ˆµh(s, a) ˆµh(s, a) = 5

3 ˆdπ h(s, a).

We can define a feasible set for the optimization problem, i.e. wh(s, a) [0, Dh(s, a)], Dh(s, a) = 2 ˆdπ h(s, a).

Next, we analyse the variance of the stochastic gradient. We denote the stochastic gradient as gh(w), {si 1, ai 1, . . . , si H, ai H} a trajectory sampled from µh and {sj 1, aj 1, . . . , sj H, aj H} a trajectory sampled from µh 1.

gh(w)(s, a) = w(s, a)

ˆµh(s, a)I(si h = s, ai h = a) ˆwh 1(sj h 1, aj h 1)

ˆµh 1(sj h 1, aj h 1) π(a|s)I(sj h = s).

The variance bound becomes

V[gh(w)] E[ gh(w) 2] X

s,a µh(s, a) w(s, a)

2 + µh 1(s, a) ˆwh 1(s, a)

ˆµh 1(s, a)

( ˆdπ h(s, a))2

ˆµh(s, a) + ( ˆdπ h 1(s, a))2

ˆµh 1(s, a)

Multiple-policy Evaluation via Density Estimation

where the last inequality is due to the bounded feasible set for w and the property of coarse distribution estimator µh(s, a) 4

3 ˆµh(s, a).

Based on the error propagation lemma 4.6, if we can achieve ℓπ h( ˆwh) 1 ϵ 4H2 from step h = 1 to step h = H, then we have,

µh(s, a) ˆwh(s, a)

ˆµh(s, a) dπ h(s, a) ϵ 4H , h = 1, 2, . . . , H,

which can enable us to build the final estimator of the performance of policy π with at most error ϵ.

By the property of smoothness, to achieve ℓπ h( ˆwh) 1 ϵ 4H2 , we need to achieve ℓπ h( ˆwh) ℓπ h(w h) ϵ2 32ξH4 where ξ is the smoothness factor, because,

ℓπ h( ˆwh) 2 1 2ξ(ℓπ h( ˆwh) ℓπ h(w h)) ϵ2

Lemma A.4. For a λ strongly convex loss function L(w) satisfying w D for some known D, there exists a stochastic gradient descent algorithm that can output ˆw after T iterations such that,

E[L( ˆw) L(w )] 2G2

where G2 is the variance bound of the stochastic gradient.

Invoke the convergence rate for strongly-convex and smooth loss functions, i.e. Lemma A.4, we have that the number of samples needed to achieve ℓπ h( ˆwh) ℓπ h(w h) ϵ2 32ξH4 is,

We have shown in Section 4.3 that ξ

3, this nice property helps us to get rid of the undesired ratio of the smoothness

factor and the strongly-convexity factor, i.e. maxs,a µ(s,a)

mins,a µ(s,a) of the original loss function (7) which can be extremely bad. Replacing G2 by our variance bound (16), we have,

( ˆdπ h(s, a))2

ˆµh(s, a) + ( ˆdπ h 1(s, a))2

ˆµh 1(s, a)

For each step h, we need the above number of trajectories, sum over h, we have the total sample complexity,

( ˆdπ h(s, a))2

To evaluate K policies, we need trajectories,

h=1 max k [K]

( ˆdπk h (s, a))2

A.7. Proof of Lemma 4.8

Proof. By Markov s inequality, we have,

P(|ˆµ µ| ϵ) E[|ˆµ µ|]

Multiple-policy Evaluation via Density Estimation

The event that |ˆµMo M µ| > ϵ belongs to the event where more than half estimators ˆµi are outside of the desired range |ˆµi µ| > ϵ, hence, we have,

P(|ˆµMo M µ| > ϵ) P(

i=1 I(|ˆµi µ| > ϵ) N

Denote I(|ˆµi µ| > ϵ) by Zi and E[Zi] = p,

P(|ˆµMo M µ| > ϵ) = P(

i=1 (Zi p) 1

where the first inequality holds by Hoeffding s inequality and the second inequality holds due to p 1

4. Set δ = e N

8 , we have, with N = O(log(1/δ)), with probability at least 1 δ, it holds |ˆµMo M µ| ϵ.

A.8. Proof of Theorem 4.9

Here, we explain how Theorem 4.9 is derived. We first show how the Median-of-Means (Mo M) estimator and data splitting technique can conveniently convert Lemma 4.7 to a version holds with high probability.

For step h, Algorithm 1 can output a solution ˆwh such that E[ℓπ h( ˆwh) ℓπ h(w h)] ϵ2 32ξH4 . We can apply Lemma 4.8 on our algorithm which means that we can run the algorithm for N = O (log(1/δ)) times. Hence, we will get N solutions { ˆwh,1, ˆwh,2, . . . , ˆwh,N}. Set ˆwh,Mo M as the solution such that ℓπ h( ˆwh,Mo M) = Median(ℓπ h( ˆwh,1), ℓπ h( ˆwh,2), . . . , ℓπ h( ˆwh,N)). Based on Lemma 4.8, we have that with probability at least 1 δ, it holds ℓπ h( ˆwh,Mo M) ℓπ h(w h) ϵ2 32ξH4 . With a little abuse of notation, we just denote ˆwh,Mo M by ˆwh in the following content.

Now we are ready to estimate the total expected rewards of target policies, With the importance weighting ratio estimator ˆ wh(s,a)

ˆµh(s,a) from Algorithm 1, we can estimate the performance of policy πk,

ˆV πk 1 = 1

ˆwπk h (si h, ai h) ˆµh(si h, ai h) rh(si h, ai h), (17)

where {si h, ai h}n i=1 is sampled from µh.

Lemma A.5. With samples n = O H2

ϵ2 PH h=1 maxk [K] P

s,a ( ˆdπk h (s,a))2

, we have with probability at least 1 δ,

| ˆV πk 1 V πk 1 | ϵ

Proof. First, we can decompose the error | ˆV πk 1 V πk 1 | = | ˆV πk 1 E[ ˆV πk 1 ] + E[ ˆV πk 1 ] V πk 1 | | ˆV πk 1 E[ ˆV πk 1 ]| +

|E[ ˆV πk 1 ] V πk 1 |. Then, by Bernstein s inequality, with samples n = O H2

ϵ2 PH h=1 maxk [K] P

s,a ( ˆdπk h (s,a))2

| ˆV πk 1 E[ ˆV πk 1 ]| ϵ

4. Based Lemma 4.7, we have, |E[ ˆV πk 1 ] V πk 1 | ϵ

Remember that in Section 4.1, we ignore those states and actions with low estimated visitation distribution for each target policy which induce at most ϵ

2 error. Combined with Lemma A.5, our estimator ˆV πk 1 finally achieves that with probability at least 1 δ, | ˆV πk 1 V πk 1 | ϵ, k [K].

And for sample complexity, in our algorithm, we need to sample data in three procedures. First, for the coarse estimation of the visitation distribution, we need O( 1

ϵ ) samples. Second, to estimate the importance-weighting ratio, we need

Multiple-policy Evaluation via Density Estimation

samples O H4

ϵ2 PH h=1 maxk [K] P

s,a (dπk h (s,a))2

. Last, to build the final performance estimator (9), we need samples

ϵ2 PH h=1 maxk [K] P

s,a ( ˆdπk h (s,a))2

. Therefore, the total trajectories needed,

h=1 max k [K]

(dπk h (s, a))2

Moreover, notice that,

( ˆdπk h (s, a))2

ˆµh(s, a) max k [K]

( ˆdπk h (s, a))2

µ h(s, a) 25

(dπ h(s, a))2

µ h(s, a) , (18)

where µ h is the optimal solution of the optimization problem (4), the first inequality holds due to ˆµh is the minimum of the approximate optimization problem (5) and the second inequality holds due to ˆdπ h(s, a) 5

4dπ h(s, a). Based on (18), we can substitute the coarse distribution estimator in the sample complexity bound by the exact one,

h=1 max k [K]

(dπk h (s, a))2

A.9. Proof of Corollary 4.10

Let the sampling distribution µ h be 1 SA P

s,a dπs,a h , where πs,a = arg maxk [K] dπk h (s, a). Since µ h is the optimal solution and µ h is a feasible solution, we have h [H],

(dπk h (s, a))2

µ h(s, a) max k [K]

(dπk h (s, a))2

Notice that there is a logarithm term log(K) hidden in O notation. Remember K is the number of target policies. If they are deterministic, then K is bounded by ASH which leads to log(K) SH log(A).

Together, we have the sample complexity is bounded by O poly(H)S2A

Multiple-policy Evaluation via Density Estimation

B. Lower order coarse estimation

Algorithm 3 Multi-policy Approximation via Ratio-based Coarse Handling (MARCH)

Input: Horizon H, accuracy ϵ, policy π. Coarsely estimate d1 such that distβ( ˆd1, d1) ϵ, where β = 1

H . for h = 1 to H 1 do

1. Coarsely estimate µh such that |ˆµh(s, a) µh(s, a)| max{ϵ , c µh(s, a)}, where ϵ = ϵ 2H2S2A2 and c = β

2 . 2. Sample {si h, ai h, si h+1}n i=1 from µh. 3. Estimate dh+1(s, a) by ˆdh+1(s, a) = 1

n Pn i=1 I(si h+1 = s) ˆwh(si h, ai h). end for Output: { ˆdh}H h=1.

In this section, we first provide our algorithm MARCH for coarse estimation of all the deterministic policies and then conduct an analysis on its sample complexity.

MARCH is based on the algorithm EULER proposed by Zanette and Brunskill (2019).

Lemma B.1 (Theorem 3.3 in Jin et al. (2020)). Based on EULER, with sample complexity O( poly(H,S,A)

ϵ ), we can construct a policy cover which generates a dataset with the distribution µ such that, with probability 1 δ, if dmax h (s) ϵ SA, then,

µh(s, a) dmax h (s, a) 2HSA , (19)

where dmax h (s) = maxπ dπ h(s), dmax h (s, a) = maxπ dπ h(s, a).

With this dataset, we estimate the visitation distribution of deterministic policies by step-to-step importance weighting,

ˆdh+1(s, a) = 1

i=1 I(si h+1 = s) ˆwh(si h, ai h),

where {si h, ai h, si h+1}n i=1 are sampled from µ and ˆwh(s, a) = ˆdh(s,a) ˆµh(s,a).

We state that MARCH can coarsely estimate the visitation distributions of all the deterministic policies by just paying a lower-order sample complexity which is formalized in the following theorem.

Theorem B.2. Implement Algorithm 3 with the number of trajectories n = O( poly(H,S,A)

ϵ ), with probability at least 1 δ, it holds that for any deterministic policy π,

| ˆdπ h(s, a), dπ h(s, a)| max{ϵ, dπ h(s, a)

4 }, s S, a A, h [H],

where ˆdπ is the distribution estimator.

Proof. Our analysis is based a notion of distance defined in the following.

Definition B.3 (β distance). For x, y 0, we define the β distance as,

distβ(x, y) = min α [ 1

β ,β] |αx y|.

Correspondingly, for x, y Rn,

distβ(x, y) =

i=1 distβ(xi, yi).

Based on its definition, we show in the following lemma that β distance has some properties.

Multiple-policy Evaluation via Density Estimation

Lemma B.4. The β distance possesses the following properties for (x, y, z, γ 0):

1. distβ(γx, γy) = γdistβ(x, y); (20)

2. distβ(x1 + x2, y1 + y2) distβ(x1, y1) + distβ(x2, y2); (21)

3. distβ1 β2(x, z) distβ1(x, y) β2 + distβ2(y, z). (22)

Proof. See Appendix C.1.

The following lemma shows that if we can control the β distance between ˆx, x, then we can show ˆx achieves the coarse estimation of x.

Lemma B.5. Suppose dist1+β(x, y) ϵ, then it holds that,

|x y| βy + (1 + β 1 + β )ϵ 2 max{(1 + β 1 + β )ϵ, βy}.

Proof. See Appendix C.2.

The logic of the analysis is to show the β distance between ˆdh and dh can be bounded at each layer by induction. Then by Lemma B.5, we show { ˆdh}H h=1 achieves coarse estimation.

Suppose at layer h, we have ˆdh such that dist(1+β)h( ˆdh, dh) < ϵh where β = 1 H . For notation simplicity, we omit the superscript π. The analysis holds for any policy.

We use importance weighting to estimate ˆdh+1,

ˆdh+1(s, a) = 1

i=1 I(si h+1 = s)π(a|s) ˆwh(si h, ai h),

where ˆwh(s, a) = ˆdh(s,a) ˆµh(s,a).

We also denote,

dh+1(s, a) = E(sh,ah,sh+1) µh[I(sh+1 = s) ˆwh(sh, ah)].

By (22) in Lemma B.4, we have,

dist(1+β)h+2( ˆdh+1, dh+1) dist(1+β)( ˆdh+1, dh+1)(1 + β)h+1 | {z } A

+ dist(1+β)h+1(dh+1, dh+1) | {z } B

Next, we show how we can bound these two terms (A) and (B). Note that for (s, h) where dmax h (s) < ϵ SA, the induced β distance error is at most ϵ. Therefore, we can just discuss state-action pairs which satisfy Lemma B.1.

Bound of (A) We first show the following lemma tells us that the importance weighting is upper-bounded.

Lemma B.6. Based on the definition of µ, the importance weighting is upper bounded,

wh(s, a) = dh(s, a)

µh(s, a) 2HSA dh(s, a)

dmax h (s, a) 2HSA.

Hence, we can clip ˆwh(s, a) at 2HSA such that ˆwh(s, a) 2HSA.

Let s define the random variable Zh+1(s, a) = I(sh+1 = s) ˆwh(sh, ah), then ˆdh+1(s, a) = 1

n Pn i=1 Zi h+1(s, a). Since ˆwh(sh, ah) is bounded by Lemma B.6, we have,

V[Zh+1(s, a)] E[Zh+1(s, a)2] 2HSAE[Zh+1(s, a)].

Multiple-policy Evaluation via Density Estimation

By Berstein s inequality, we have with probability at least 1 δ,

| ˆdh+1(s, a) E[ ˆdh+1(s, a)]|

2V[Zh+1(s, a)] log(1/δ)

n + 2HSA log(1/δ)

4HSAE[ ˆdh+1(s, a)] log(1/δ)

n + 2HSA log(1/δ)

to achieve the estimation accuracy | ˆdh+1(s, a) E[ ˆdh+1(s, a)]| max{ϵ, c E[ ˆdh+1(s, a)]}, we need samples n = O HSA

Based on the above analysis, we can achieve,

| ˆdh+1(s, a), dh+1(s, a)| max{ϵ , β

2 dh+1(s, a)}

at the cost of samples O HSA

We now show dist1+β( ˆdh+1, dh+1) SAϵ . We discuss it in two cases,

1. | ˆdh+1(s, a), dh+1(s, a)| ϵ (24)

2. | ˆdh+1(s, a), dh+1(s, a)| β

2 dh+1(s, a). (25)

For those (s, a) which satisfies (25), since [1 β

2 ] [ 1 1+β , 1 + β], by the definition of β distance, we have,

dist1+β( ˆdh+1(s, a), dh+1(s, a)) = 0. (26)

For other (s, a) which satisfies (24), we have,

dist1+β( ˆdh+1(s, a), dh+1(s, a)) | ˆdh+1(s, a), dh+1(s, a)| ϵ .

Since there are at most SA state-action pairs, the error in the second case is at most SAϵ . Combine these two cases, we have,

dist1+β( ˆdh+1, dh+1) SAϵ .

By setting ϵ = ϵ SA, we have,

(A) = dist1+β( ˆdh+1, dh+1)(1 + β)h+1 (1 + β)h+1ϵ, (27)

and the sample complexity is O (HSA)2

Bound of (B) Next we show how to bound term (B). Denote µh(s, a) ˆdh(s,a) ˆµh(s,a) by dh(s, a), we have,

(B) = dist(1+β)h+1(dh+1, dh+1)

s,a dist(1+β)h+1(dh+1(s, a), dh+1(s, a))

s,a dist(1+β)h+1( X

s ,a P π h (s, a|s , a ) dh(s , a ), X

s ,a P π h (s, a|s , a )dh(s , a ))

s ,a dist(1+β)h+1(P π h (s, a|s , a ) dh(s , a ), P π h (s, a|s , a )dh(s , a ))

s ,a P π h (s, a|s , a )dist(1+β)h+1( dh(s , a ), dh(s , a ))

= dist(1+β)h+1( dh, dh),

Multiple-policy Evaluation via Density Estimation

where the first two equality holds by definition, the inequality holds by (21) in Lemma B.4, the third equality holds by (20) in Lemma B.4 and the last one holds by P

s,a P π h (s, a|s , a ) = 1.

Now we analyse dist(1+β)h+1( dh, dh).

dist(1+β)h+1( dh, dh) = X

s,a µh(s, a)dist(1+β)h+1( ˆdh(s, a) ˆµh(s, a), dh(s, a)

By coarse estimation, we have |ˆµh(s, a) µh(s, a)| max{ϵ , c µh(s, a)}. Similarly, we discuss it in two cases,

1. |ˆµh(s, a), µh(s, a)| ϵ , (28)

2. |ˆµh(s, a), µh(s, a)| c µh(s, a). (29)

For those (s, a) which satisfies (28), by Lemma B.6, we have,

dist(1+β)h+1( ˆdh(s, a) ˆµh(s, a), dh(s, a)

µh(s, a)) | ˆdh(s, a) ˆµh(s, a) dh(s, a)

µh(s, a)| 2HSA.

Hence, we have,

dist(1+β)h+1( dh(s, a), dh(s, a)) = µh(s, a)dist(1+β)h+1( ˆdh(s, a) ˆµh(s, a), dh(s, a)

2HSAµh(s, a) 2HSAϵ

where the last inequality holds by c µh(s, a) ϵ .

Next, For those (s, a) which satisfies (29), we have,

(1 c) 1 ˆµh(s, a) 1 µh(s, a) (1 + c) 1 ˆµh(s, a).

2 , since [1 β

2 ] [ 1 1+β , 1 + β], by definition of β distance, we have,

dist(1+β)( 1 ˆµh(s, a), 1 µh(s, a)) = 0. (30)

And we assume by induction that dist(1+β)h( ˆdh(s, a), dh(s, a)) ϵh, together with (30) we have,

dist(1+β)h+1( ˆdh(s, a) ˆµh(s, a), dh(s, a)

µh(s, a)) ϵh. (31)

Combine the results of two cases together, we have,

(B) = dist(1+β)h+1( dh, dh) ϵh + 4H2S2A2ϵ

Set ϵ = ϵ 4H2S2A2 , we have, (B) ϵh + ϵ (32)

at the cost of samples O( H3S2A2

Now we are ready to show the bound of β distance at layer h + 1. Plug (27)(32) into (23), we have,

dist(1+β)h+2( ˆdh+1, dh+1) dist(1+β)( ˆdh+1, dh+1)(1 + β)h+1 + dist(1+β)h+1(dh+1, dh+1)

(1 + β)h+1ϵ + ϵ + ϵh.

Multiple-policy Evaluation via Density Estimation

Start from dist(1+β)( ˆd1, d1) ϵ, we have,

dist(1+β)2h 1( ˆdh, dh) hϵ + ϵ

l=1 (1 + β)2h. (33)

Remember that β = 1

H and due to (1 + 1

H )h e (h H), we have,

diste2( ˆdh, dh) H(1 + e2)ϵ. (34)

Recall Lemma B.5, and based on (34), we have,

| ˆdh(s, a) dh(s, a)| 2 max{H(1 + e2)ϵ, (e2 1)dh(s, a)}.

By just paying multiplicative constant, we can adjust the constant above to meet our needs, i.e. in Theorem B.2.

Multiple-policy Evaluation via Density Estimation

C. Proof of lemmas in Section B

C.1. Proof of Lemma B.4

Proof. 1. The first property is trivial.

distβ(γx, γy) = min α [ 1

β ,β] |αγx γy|

= min α [ 1

β ,β] γ|αx y|

= γdistβ(x, y).

2. Let αi be such that,

dist1+β(xi, yi) = |αixi yi|, i = 1, 2.

Notice that α3 = α1 x1 x1+x2 + α2 x2 x1+x2 satisfies α3 [α1, α2] [ 1

β , β] and α3(x1 + x2) = α1x1 + α2x2, therefore,

distβ(x1 + x2, y1 + y2) = min α [ 1

β ,β] |α(x1 + x2) y1 y2|

|α3(x1 + x2) y1 y2|

= |α1x1 + α2x2 y1 y2|

|α1x1 y1| + |α2x2 y2|

= distβ(x1, y1) + distβ(x2, y2).

The first inequality holds due to the definition of β distance. The second inequality is the triangle inequality.

3. We prove the third property through a case-by-case discussion.

(1). x β1β2 z β1β2x. In this case, the result is trivial, since distβ1β2(x, z) = 0 and β distance is always non-negative.

(2). β1β2x < z. If y x, then,

distβ1β2(x, z) distβ2(x, z) distβ2(y, z).

We are done.

If x < y β1x, then distβ 1(x, y) = 0, and z > β1β2x β2y, hence,

distβ2(y, z) = z β2y z β1β2x = distβ1β2(x, z).

We are done.

If y > β1x, z [ y

β2 , β2y], then,

distβ1(x, y)β2 + distβ2(y, z) = β2(y β1x)

= distβ1β2(x, z).

We are done.

If y > β1x, z / [ y

β2 , β2y], then,

distβ1(x, y)β2 + distβ2(y, z) β2(y β1x)

= distβ1β2(x, z).

We are done.

Multiple-policy Evaluation via Density Estimation

(3). z < x β1β2 . A symmetric analysis can be done by replacing β1, β2 by 1 β1 , 1

β2 which gives the result,

distβ1β2(x, z) distβ1(x, y) 1

β2 + distβ2(y, z)

Since β2 1 and distβ1(x, y) 0, we have distβ1(x, y) 1

β2 distβ1(x, y)β2, hence,

distβ1β2(x, z) distβ1(x, y)β2 + distβ2(y, z),

which concludes the proof.

C.2. Proof of Lemma B.5

Proof. We prove the lemma through a case-by-case study.

(1). x y. If dist1+β(x, y) = 0, then x(1 + β) y x, therefore,

|x y| = y x βx βy.

If dist1+β(x, y) > 0, then dist1+β(x, y) = y (1 + β)x, therefore,

|x y| = y x = dist1+β(x, y) + βx ϵ + βx ϵ + βy.

(2). y < x. If dist1+β(x, y) = 0, then x 1+β y < x, therefore,

|x y| = x y x x 1 + β y(1 + β)(1 1 1 + β ) = βy.

If dist1+β(x, y) > 0, then y < x 1+β x and dist1+β(x, y) = x 1+β y. Moreover, since dist1+β(x, y) ϵ, we have x 1+β ϵ + y. Therefore,

|x y| = x y

= dist1+β(x, y) + (1 1 1 + β )x

= dist1+β(x, y) + β x 1 + β

ϵ + β 1 + β ϵ + βy

= (1 + β 1 + β )ϵ + βy.

Combine the results above together, we have,

|x y| βy + (1 + β 1 + β )ϵ 2 max{(1 + β 1 + β )ϵ, βy}.

Multiple-policy Evaluation via Density Estimation

D. Discussion on policy identification

In this section, we discuss on the application of CAESAR to policy identification problem, its instance-dependent sample complexity and some intuitions related to the existing gap-dependent results.

We first provide a simple algorithm that utilizes CAESAR to identify an ϵ optimal policy. The core idea behind the algorithm is we can use CAESAR to evaluate all candidate policies up to an accuracy, then we can eliminate those policies with low estimated performance. By decreasing the evaluation error gradually, we can finally identify a near-optimal policy with high probability.

For notation simplicity, fixing the high-probability factor, we denote the sample complexity of CAESAR by Θ(Π)

γ2 , where Π is the set of policies to be evaluated and γ is the estimation error.

Algorithm 4 Policy Identification based on CAESAR

Input: Alg CAESAR , optimal factor ϵ, candidate policy set Π. for i = 1 to log2(4/ϵ) do

1. Run CAESAR to evaluate the performance of policies in Π up to accuracy γ = 1

2i . 2. Eliminate πi if πj Π, ˆV πj 1 ˆV πi 1 > 2γ, update Π. end for Output: Randomly pick πo from Π.

Theorem D.1. Implement Algorithm 4, we have that, with probability at least 1 δ, πo is ϵ optimal, i.e.,

V 1 V πo 1 ϵ.

And the instance-dependent sample complexity is O(maxγ ϵ Θ(Πγ)

γ2 ), where Πγ = {π : V 1 V π 1 8γ}.

Proof. On the one hand, based on the elimination rule in the algorithm, by running CAESAR with the evaluation error γ, the optimal policy π will not be eliminated with probability at least 1 δ. Since maxπ Π ˆV π 1 ˆV π 1 V 1 +γ (V π 1 γ) 2γ.

On the other hand, if V 1 V πi 1 > 4γ, then πi will be eliminated with probability at least 1 δ. Since maxπ Π ˆV π 1 ˆV πi 1 > V 1 γ (V πi 1 + γ) > 2γ.

Therefore, by running Algorithm 4, the final policy set is not empty and for any policy π in this set, it holds, V 1 V π 1 ϵ with probability at least 1 δ.

Next, we analyse the sample complexity of Algorithm 4. Based on above analysis, within every iteration of the algorithm, we have a policy set containing 8γ optimal policies, and we use CAESAR to evaluate the performance of these policies up to γ accuracy. By Theorem 4.9, the sample complexity is Θ(Πγ)

γ2 . Therefore, the overall sample complexity is,

γ2 O(max γ ϵ Θ(Πγ)

This result is quite interesting since it provides another perspective beyond the existing gap-dependent results for policy identification. And these two results have some intuitive relations that may be of interest.

Roughly speaking, to identify an ϵ optimal policy for an MDP, the gap-dependent regret is described as,

H log K gaph(s, a)),

where gaph(s, a) = V h (s) Q h(s, a).

The value gap gaph(s, a) quantifies how sub-optimal the action a is at state s. If the gap is small, it is difficult to distinguish and eliminate the sub-optimal action. At the same time, smaller gaps mean that there are more policies with similar

Multiple-policy Evaluation via Density Estimation

performance to the optimal policy, i.e. the policy set Πγ is larger. Both our result and gap-dependent result can capture this intuition. We conjecture there exists a quantitative relationship between these two perspectives.

An interesting proposition of Theorem D.1 is to apply the same algorithm to the multi-reward setting. A similar instance-

dependent sample complexity can be achieved O(maxγ ϵ Θ(ΠR γ ) γ2 ) with the difference that ΠR γ contains policies which is 8γ optimal for at least one reward function. This sample complexity captures the intrinsic difficulty of the problem by how similar the near-optimal policies under different rewards are which is consistent with the intuition.