# selective_uncertainty_propagation_in_offline_rl__f276c778.pdf

Selective Uncertainty Propagation in Offline RL

Sanath Kumar Krishnamurthy1, Tanmay Gangwani2, Sumeet Katariya1, Branislav Kveton3, Shrey Modi4, Anshuka Rangi2

1Meta 2Amazon 3Adobe 4Indian Institute of Technology, Bombay sanathsk@meta.com

We consider the finite-horizon offline reinforcement learning (RL) setting, and are motivated by the challenge of learning the policy at any step h in dynamic programming (DP) algorithms. To learn this, it is sufficient to evaluate the treatment effect of deviating from the behavioral policy at step h after having optimized the policy for all future steps. Since the policy at any step can affect next-state distributions, the related distributional shift challenges can make this problem far more statistically hard than estimating such treatment effects in the stochastic contextual bandit setting. However, the hardness of many real-world RL instances lies between the two regimes. We develop a flexible and general method called selective uncertainty propagation for confidence interval construction that adapts to the hardness of the associated distribution shift challenges. We show benefits of our approach on toy environments and demonstrate the benefits of these techniques for offline policy learning.

1 Introduction We study the finite-horizon offline reinforcement learning (RL) problem, focusing on algorithms that adapt to instance hardness. At a high-level, we study algorithms that provide better guarantees for contextual bandit (CB) like instances while being able to plan in more dynamic RL-like instances. Our work is motivated by real-world RL problems, such as user interaction with an e-commerce search engine (recommendation system). Here, the state can be a user query, and the action is the product recommendation from the engine. When the user wants to buy a particular product, the user often only enters a single product query unrelated to the previous one; thus resembling a sequence of CB problems. On the other hand, when the user explores products, the exploration queries are related through the user s intent, and the recommendation system may want to steer the user toward the ideal product. Hence, this resembles the RL setting. This indicates the need to develop unified solutions that integrate CB and RL techniques adapting to instance hardness. We now introduce the CB and RL frameworks in more detail. Stochastic contextual bandits (CBs) (Langford and Zhang 2008; Li et al. 2010) and finite-horizon reinforcement learn-

Copyright 2025, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.

ing (RL) (Sutton 1988; Williams 1992; Sutton and Barto 1998) are two fundamental frameworks for decision-making under uncertainty. In stochastic CBs, the environment samples the context and corresponding rewards (for each action) from a fixed but unknown distribution; the agent then observes the context and learns to select the most rewarding action conditioned on the context. Finite-horizon RL is a generalization of CBs where contexts become states and a sequence of decisions are to be made over H steps. Similar to the CB problem, at each step, the agent observes the current state, selects an action conditioned on the current state, and receives a reward sampled by the environment from a corresponding conditional distribution. However, unlike the CB problem, while the initial state is sampled from a fixed but unknown distribution, the next state at any step depends on the current state and the agent s action. Hence, the agent can plan to attain high cumulative reward by learning to reach high-value future states. Unfortunately, the fact that actions can influence future states implies that the agent needs to learn under statedistribution shifts making the RL setting much more statistically harder than CBs in the worst case. For example, (Foster et al. 2021) show that the worst-case sample complexity to learn a non-trivial offline RL policy is either polynomial in the state space size or exponential in other parameters.1 On the other hand, if actions do not influence next-state distributions at any step, the RL instance would be equivalent to solving H stochastic CB instances. On such instances, offline bandit algorithms (Foster and Syrgkanis 2019) would enjoy a polynomial sample complexity for policy learning with no dependence on state space size. Hence, for such instances, state-of-the-art offline RL algorithms such as pessimistic value function optimization (Jin, Yang, and Wang 2021) may be unnecessarily conservative. We formalize this dichotomy and show that the statistical hardness of offline RL instances can be captured by the size of actions impact on the next state s distribution. To show this, we consider the high-level structure of dynamic programming (DP) algorithms for offline RL (e.g. Jin, Yang, and Wang 2021). DP algorithms construct a policy itera-

1(Foster et al. 2021) consider the discounted infinite horizon offline RL formulation. However, one should expect similar lower bounds for the finite horizon offline RL formulation.

The Thirty-Ninth AAAI Conference on Artificial Intelligence (AAAI-25)

tively starting from the policy for the final step and ending by constructing the policy for the first step. At any step h, DP algorithms can be viewed to select the policy at step h that maximizes the treatment effect of deviating from the behavioral policy at step h after having optimized the policy for all future steps. The goal of this paper is to estimate and construct good confidence intervals for this treatment effect at step h. Our primary focus is on confidence interval (CI) construction, which is motivated by the fact that many successful offline RL algorithms learn a policy that maximizes the lower bound of constructed CIs (Jin, Yang, and Wang 2021). To account for estimation errors from future steps, standard methods for CI construction at any step propagate uncertainty from future steps to the current step h. This paper seeks to construct better CIs that adapt to instance hardness by selectively propagating uncertainty. In cases where all actions have zero estimated impact on next-state distributions, our procedure does not propagate any uncertainty from later steps and still constructs valid CIs for the treatment effect of deviating from the behavioral policy at step h after having optimized the policy for all future steps. It treats the instance like a CB problem hence enjoying a polynomial sample complexity with no dependence on state space size for treatment effect estimation. For more dynamic instances, our procedure must unavoidably propagate more uncertainty from future steps in order to continue constructing valid CIs. In this way, we adapt to the hardness of the instance for CI construction at any step. We also show the benefits of this approach for offline policy learning by proposing an algorithm that optimizes our constructed CIs. Simple simulations further support our claim. Related Work: Both bandits and RL have been studied extensively (Lattimore and Szepesvari 2019; Sutton and Barto 1998; Foster and Rakhlin 2023). In bandits, the focus has been on achieving higher statistical efficiency by using the reward distribution of actions (Garivier and Cappe 2011), prior distribution of model parameters (Thompson 1933; Agrawal and Goyal 2012; Chapelle and Li 2012; Russo et al. 2018), parametric structure (Dani, Hayes, and Kakade 2008; Abbasi-Yadkori, Pal, and Szepesvari 2011; Agrawal and Goyal 2013), or agnostic methods (Agarwal et al. 2014). In RL, the focus has been on different means of learning to plan for longer horizons, such as the value function (Sutton 1988), policy (Williams 1992), or their combination (Sutton et al. 2000). Just as in our work, causal inference insights have helped improve the statistical efficiency of both CB and RL algorithms (Krishnamurthy, Wu, and Syrgkanis 2018; Carranza, Krishnamurthy, and Athey 2023; Syrgkanis and Zhan 2023). However, bridging the gap between bandits and RL is an exciting and relatively underexplored research direction. One way to define this gap is to argue that in bandit-like environments, the state never changes once initially sampled. These bandit-like environments can be viewed as a special case of the situation where actions do not impact next-state distributions. With bridging this gap as one motivation, (Zanette and Brunskill 2019; Yin and Wang 2021) have used variance-dependent Bernstein bounds to limit uncertainty propagation when there is

a lack of next-step value function heterogeneity. Another approach is to define this gap in a binary fashion. Either there is no impact of actions on next state distributions, or we are in a dynamic MDP environment. In an online setting, (Zhang, Gottesman, and Doshi-Velez 2022) develop hypothesis tests to differentiate between the two situations and then select the most appropriate exploration algorithm. While their higher-level framing is similar to ours and their approach is novel, their approach cannot outperform existing RL algorithms in MDP environments. By interpolating between the two regimes, we hope to outperform bandit and existing RL algorithms that either forgo planning or are too conservative in accounting for actions impact on next-state distributions.

2 Preliminaries

Setting: We consider an episodic Markov Decision Process (MDP) setting with state space X, action space A, horizon H, and transition kernel P = (P (h))H h=1. At every episode, the environment samples a starting state x(1) and a set of realized rewards r = (r(h))H h=1 from a fixed but unknown distribution D. Here r(h) is a map from X A to [0, 1]. For any states x, x X and action a A, P (h)(x |x, a) denotes the probability density of transitioning to state x conditional on taking action a at state x during step h. A trajectory τ is a sequence of states, actions, and rewards. That is, any trajectory τ is given by τ = (x(h), a(h), r(h)(x(h), a(h)))H h=1. A policy π is a sequence of H action sampling kernels {π(h)}H h=1, where π(h)(a|x) denotes the probability of sampling action a at state x during step h under the policy π. We let D(π) denote the induced distribution over trajectories under the policy π. For any policy π, we define the (state-) value function V (h) π : X [0, H h + 1] at each step h [H] such that,

V (h) π (x) = ED(π)

i=h r(i)(x(i), a(i)) x(h) = x . (1)

The value of policy π is given by ED[V (1) π (x(1))]. We can take expectation over D instead of D(π) here since the only random variable in V (1) π (x(1)) is the initial state x(1) which does not depend on the choice of the policy π. For any step h [H], we let R(h) be a function from X A to [0, 1] denoting the expected reward function for step h. That is, R(h)(x, a) = ED[r(h)(x, a)]. With some abuse of notation, for any x, x X, we let R(h)(x, π) = P a π(h)(a|x)R(h)(x, a) and P (h)(x |x, π) = P a π(h)(a|x)P (h)(x |x, a). That is, R(h)(x, π) is the expected reward at state x and step h under the policy π. Similarly, P (h)(x |x, π) is the expected transition probability from x to x at step h under the policy π. For any step h [H], we also let Vmax (h) denote a bound on the maximum value V (h) π (x) can take for any state x and policy π.

It is also equivalent to define the value functions (V (h) π )

using the iterative definition in (2), where Vπ (H+1) 0. h [H], x X,

Vπ (h)(x) = R(h)(x, π) + Z

x Vπ (h+1)(x )P (h)(x |x, π).

(2) Data Collection Process: In this paper, we focus on the offline setting (Levine et al. 2020) with training data collected under a behavioral policy πb. Apart from the policy πb, the learner only has access to a dataset S consisting of T trajectories sampled from the distribution D(πb), where D(πb) is the data sampling distribution induced by πb. That is, S = {τt}T t=1, where τt = (x(h) t , a(h) t , r(h) t (x(h), a(h)))H h=1 D(πb). Since the transitions in these trajectories are induced by the behavioral policy, for notational convenience, we let Pb = (Pb (h))H h=1 denote the transition kernel under the policy πb. That is, for any x, x X, Pb (h)(x |x) = P (h)(x |x, πb).

2.1 Estimand of Interest We now turn our attention to defining our target estimand, which refers to the specific quantity we aim to estimate. Consider a fixed policy π and suppose we would like to estimate its value. Since estimating the value of the behavioral policy πb is easy (empirical average of total observed reward in each trajectory), we argue that that it is sufficient to estimate ED[V (1) π (x(1)) V (1) πb (x(1))] the difference in values between evaluation and behavioral policy. This difference can be further decomposed. For each step h, let πh = (πb(1), . . . , πb(h 1), π(h), . . . , π(H)) be the policy that follows the behavioral policy upto step h 1 and then follows the evaluation policy. In (3), we decompose the difference in policy value between the evaluation and behavioral policy into the sum of differences in policy value between πh and πh+1 for each step h.

ED[V (1) π (x(1)) V (1) πb (x(1))]

h=1 ED[V (1) πh (x(1)) V (1) πh+1(x(1))]

h=1 ED(πb)[V (h) πh (x(h)) V (h) πh+1(x(h))].

Here (i) follows from telescoping and (ii) follows from the fact that the policies πh and πh+1 agree with the behavioral policy for the first h 1 steps. We let α(h) π , the term corresponding to step h in the above decomposition, be our estimand of interest. That is, our estimand (α(h) π ) is the difference in value of policies πh and πh+1 these policies only differ in the current step h, which may cause difference in immediate rewards and may also cause a difference in in next-state distributions (affecting future rewards even if the policies at future steps are the same).

α(h) π = ED(πb)[V (h) πh (x(h)) V (h) πh+1(x(h))] (4)

We now seek to justify α(h) π as an important estimand, and start by arguing that it is a reasonable estimand to care about.

Note that, given the decomposition in (3), estimating and constructing CIs for {α(h) π }H h=1 allows us to estimate and construct CIs for ED[V (1) π (x(1)) V (1) πb (x(1))] (the difference in policy value between evaluation and behavioral policies) and thus allows us to estimate and construct CIs for ED[V (1) π (x(1))] (evaluation policy value). Beyond being an effective surrogate for policy evaluation, α(h) π is an important quantity to consider in dynamic programming (DP) algorithms. DP algorithms construct the policy for the final step (π(H)) and iteratively construct policies for earlier steps. At step h, the policy at steps h + 1 to H are already fixed/computed. Hence at this step, one can interpret DP algorithms as attempting to select π(h) in order to maximize α(h) π that is, maximize the treatment effect of deviating from the behavioral policy at step h after having optimized the policy for all future steps. Hence, for any step h, α(h) π is a helpful estimand to consider for decision-making at step h. Importantly for us, when actions at step h do not affect next state distributions, the problem of choosing a policy at step h can be viewed as a CB problem. Helpfully in this case, unlike policy value, α(h) π only depends on immediate rewards and can be estimated via offline stochastic CB techniques. However, when actions at step h do influence next state distributions, RL techniques are necessary for estimating α(h) π . Hence, beyond being a critical quantity for decision-making at step h, it is also a quantity that is amenable to interpolating between CB and RL techniques. Thus, our paper focuses on estimating and constructing tight confidence intervals (CIs) for this estimand (α(h) π ).

3 Shift Model

Offline RL is more challenging than offline policy learning in the stochastic CB setting (Foster et al. 2021). The primary reason for the difference between the two settings is due to state distribution shift induced due to deviating from the behavioral policy. Distribution shift makes any statistical learning theory problem challenging (Vergara et al. 2012; Bobu et al. 2018; Farshchian et al. 2018). Hence methods that adapt to instance hardness must rely on some implicit or explicit approach to measure this state-distribution shift. To this end, we model the heterogeneous treatment effect (K unzel et al. 2019; Nie and Wager 2021) of actions on the next-state distribution and refer to this effect as the shift model . More precisely, we define the shift model = ( (h))H h=1 in (5).

(x, a), (h)( |x, a) = P (h)( |x, a) Pb (h)( |x). (5)

Here (h)(x |x, a) captures the shift in the probability of transitioning from x to x due to selecting action a at state x instead of following the behavioral policy. With some abuse of notation, for any x, x X, we let (h)(x |x, π) = P a π(h)(a|x) (h)(x |x, a). That is, (h)(x |x, π) is the expected shift (w.r.t to Pb) in probability of transitioning from x to x at step h under the policy π. It is worth

noting that shifts are bounded. For all (x, a), since the (h)( |x, a) is a difference of two state-distributions, we have (h)( |x, a) 1 2 from triangle inequality. We argue that shift helps capture instance hardness for estimating α(h) π . To see this, we provide a shift-dependent expression for α(h) π .

α(h) π (i) = ED(πb)[V (h) πh (x(h)) V (h) πh+1(x(h))]

(ii) = ED(πb)[R(h)(x(h), π) R(h)(x(h), πb)]

x Vπ (h+1)(x ) (h)(x |x(h), π) .

Here (i) follows from (4) (definition of α(h) π ); and (ii) follows from (2) and (5). Note that, in the final expression of (6), the first term can be estimated using stochastic CB techniques and the dependence on next-step value function is scaled by the size of this shift. This hints at the possibility of developing methods that interpolate between CB and RL techniques. More formally, in Section 4 , we show shift estimates enable us to adapt to the hardness of our setting when estimating and constructing CIs for α(h) π .

4 Theory: Selective Propagation

In Section 2, we motivated and defined our estimand α(h) π (see (4)) which is the treatment effect for deviating from the behavioral policy at step h after having already deviated from the behavioral policy for all future steps. We now present an approach to estimate and construct tight valid CIs for α(h) π with interval size adapting to instance hardness. Here harder instances have a larger next-state distribution shifts when deviating from the behavioral policy. When shifts are smaller, we can rely more on statistically efficient CB methods. However when shifts are larger (instance is more dynamic), we unavoidably must rely more on RL methods that account for worst-case distribution shifts. Our approach to estimate and construct tight valid CIs for α(h) π requires several inputs. These inputs, described in the following subsection, allow us to abstract away existing approaches to tackle well-studied estimation problems in CB and RL settings. In Section 4.2, we describe how to combine these existing tools to achieve guarantees that adapt to instance hardness.

4.1 Inputs Our method interpolates between existing tools for CB and RL settings, by leveraging shift estimates. To simplify our analysis and generalize our results, we assume access to these estimates as inputs to our interpolation method. In particular, we take as input: (1) offline CB treatment effect estimate and corresponding CI, (2) optimistic and pessimistic offline RL value function estimates, and (3) shift estimates with average error bounds. As the quality of our inputs improve (potentially as better estimators get developed), the quality of our outputs will correspondingly improve. We now formally describe these inputs requiring all the associated high-probability bounds to hold simultaneously

with probability at least 1 δin. We start by describing the first input, which is based on CB methods. Input 1 (CB estimates): This input provides an estimate and CI for θ(h) (formally defined in (7)) which is the average treatment effect on the immediate reward for deviating from the behavioral policy at step h.

θ(h) π = ED(πb) h R(h)(x(h), πh) R(h)(x(h), πh+1) i (7)

Since θ(h) π only depends on the immediate reward, wellestablished offline CB techniques (e.g., Dudik et al. 2014) can be used to estimate and construct CIs for the difference (in terms of immediate rewards) between these policies. We let ˆθ(h) π be our input estimate and let κ(h) π,θ be the input CI radius. That is, the confidence interval is given by (8).

|θ(h) π ˆθ(h) π | κ(h) π,θ (8)

When deviating from the behavioral policy at step h has no impact on next-state distributions, the estimate and CI for θ(h) π can be used as the estimate and CI for α(h) π . However, when there is an impact on next-state distributions, valid estimation and CI construction for α(h) π requires us to propagate estimates and uncertainty from future steps to the current step. To enable this propagation, we take estimates for Vπ (h+1) as our second input. Input 2 (RL estimates): This input provides pessimistic, standard, and optimistic estimates for Vπ (h+1) denoted by ˆV (h+1) π,p , ˆV (h+1) π , and ˆV (h+1) π,o respectively such that the ordering in (9) holds.2 Further, with high probability, we require (10) holds that is, the true value function is bounded by the pessimistic and optimistic value function estimates.

x, 0 ˆV (h+1) π,p (x) ˆV (h+1) π (x) ˆV (h+1) π,o (x) V (h+1) max (9)

x, Vπ (h+1)(x) [ ˆV (h+1) π,p (x), ˆV (h+1) π,o (x)] (10)

There is a large and growing literature on value function estimation in RL, including optimistic and pessimistic value function estimation that are designed to satisfy (10) (e.g., Martin et al. 2017; Wang et al. 2019; Jin, Yang, and Wang 2021). Thus, we can employ the most cutting-edge methods to construct these next-step value function estimates. Input 2 gave us estimates for Vπ (h+1) (next-step value), which we may need to propagate to the current step h when constructing an estimate and CI for α(h) π . Since our goal is to interpolate between tight CB guarantees and always valid RL guarantees, unlike traditional RL algorithms, we want to be selective in propagating next-step estimates/uncertainties. Our final input, shift estimates, allows us to only propagate these estimates when required enabling our adaptation to instance hardness. Input 3 (Shift estimates): This input provides a estimate for (h) (see (5)) and an associated error bound denoted

2Note that (9) can be enforced during input construction. Recall that Vmax (h+1) denotes a bound on the maximum value V (h+1) π (x) can take for any state x.

by ˆ (h) and κ(h) π, respectively such that (11) holds (recall from Section 3 that (h) satisfies the same bound). We also require (12) holds with high-probability.

(x, a), ˆ (h)( |x, a) 1 2 (11)

ED(πb) ˆ (h)( |x(h), π) (h)( |x(h), π) 1 κ(h) π, (12)

Since the true shift model is a function of the true transition model (see Section 3), shift can be estimated via first estimating transition model and then calculating the treatment effect (due to deviating from the behavioral policy) of transitioning between any pair of states (see Moerland et al. 2023, for a survey on model-based RL and transition model estimation.). 3

4.2 Combining Inputs

We now have all our required input estimates, and can state our main result (Theorem 4.1) on constructing an estimate and CI for α(h) π in a way that adapts to instance hardness.

Theorem 4.1. Suppose we have: (1) CB inputs (ˆθ(h) π , κ(h) π,θ);

(2) RL inputs ( ˆV (h+1) π,p , ˆV (h+1) π , ˆV (h+1) π,o ) satisfying (9); and (3) shift inputs ( ˆ (h), κ(h) π, ) satisfying (11) such that (8), (10), and (12) hold with probability at least 1 δin. Moreover, suppose we have a (holdout) dataset of T trajectories S = {τt}T t=1 sampled from the distribution D(πb) that were not used for constructing input estimates.4 Our estimate for α(h) π is then denoted by ˆα(h) π and given by (13).

ˆα(h) π = ˆθ(h) π + 1

x ˆV (h+1) π (x ) ˆ (h)(x |x(h) t , π)

(13) Now for some fixed δ > 0, with probability at least 1 δ δin, we have the confidence interval in (14) holds.

|α(h) π ˆα(h) π | L(h) π . (14)

Here L(h) π is given by (15).

L(h) π = κ(h) π,θ + Vmax (h+1)κ(h) π, + 6V (h+1) max

x |ED(πb)[ ˆ (h)(x |x(h) t , π)]|Γπ (h+1)(x )

(15) Here Γπ (h+1) is the difference between the optimistic and pessimistic estimates it captures the uncertainty in the next-step value function estimates. That is, for all x X, Γπ (h+1)(x) = ˆV (h+1) π,o (x) ˆV (h+1) π,p (x).

3As a treatment effect model, shift may also be estimated via heterogeneous treatment effect estimators (e.g., Nie and Wager 2021; K unzel et al. 2019). 4Utilizing a holdout set for estimating and constructing CIs for α(h) π allows us to treat our input estimates as fixed quantities (independent of the randomness in the sampled holdout dataset).

One of the advantages of in-distribution supervised learning is that excess risk bounds only depend on complexity of hypothesis class (and number of training samples), with no dependence on size of feature space (see Shalev-Shwartz and Ben-David 2014). As discussed in Section 1, the statistical challenges of RL stem from the fact that learning under (state) distribution shifts is hard. For example, without additional assumptions, optimistic/pessimistic value function estimation have an unavoidable polynomial dependency on state-space size (Foster and Rakhlin 2023). Our goal is to avoid/minimize such dependencies when possible. The key benefit of Theorem 4.1 is that both our estimate (ˆα(h) π ) and our CI width (L(h) π ) are selective in propagating/utilizing the RL estimates from input 2 allowing us to only suffer from the slower worst-case RL estimation rates on harder instances. To better understand this, let us dig deeper into the terms in our CI width (L(h) π ). Note that κ(h) π,θ and κ(h) π, (see Inputs 1 and 3) bound errors averaged under the behavioral policy state-distribution that is, they bound in-distribution average errors. Hence, with appropriate inputs, the first two terms in L(h) π shrink quickly with no dependency on state-space size (Dudik et al. 2014; Shalev-Shwartz and Ben-David 2014). The third term in L(h) π , which enjoys a O( p

log(1/δ)/T) rate, also shrinks quickly to zero and has no dependency on state-space size. We now only need to discuss the fourth and final term in L(h) π . Unlike the previous terms which bound in-distribution average errors, this term does depend on per-state (pointwise) errors (Γπ (h+1)(x )). The reason RL algorithms seek to bound per-state errors is because these guarantees do not depend on the state-distribution and are valid under any shift. We now illustrate how this robustness to statedistribution shift comes at a cost of larger error bounds, and argue that it is advantageous to scale these terms down with the estimated shift. First, as a sanity check, we show that this term is finite. Z

x | ˆ (h)(x |xt (h), π)| Γπ (h+1)(x )

(i) ˆ (h)( |xt (h), π) 1 Γπ (h+1)

(ii) 2 Γπ (h+1) . (16) where (i) follows from H older s inequality, and (ii) follows from (11). Now that we know this term is finite, we can argue that it shrinks to zero. Since Γπ (h+1)(x ) captures the size of per-state errors for pessimistic/optimistic value function estimates, we can expect this term to converges to zero in the limit with infinite data. The size of Γπ (h+1)(x ) must depend on how often states similar to x were visited at step h + 1 in the training data for the RL input. For simplicity, let us consider the scenario when all states are visited uniformly at step h + 1 under the distribution D(πb). In such a scenario, the frequency at which states similar to x were visited at step h + 1 would depend on some measure of the size of the state space X. This would imply that Γπ (h+1)(x ) shrinks at a rate that depends on some measure of the size of the state space X. That is, while these terms shrink, they

shrink slowly. Hence per-state bounds, while independent of state-distribution, come at a cost of slower statistical rates. As shown in (Foster et al. 2021), such a dependence of confidence interval width on state-space size is unavoidable in the worst-case. The key message of Theorem 4.1 is that we can move beyond this worst-case scenario by scaling these point-wise errors with the estimated shifts ( ˆ (h)). For example, when ˆ (h) 0, the fourth term in Lπ (h) is zero, allowing us to recover contextual bandit-style guarantees that are independent of state-space size. It is worth noting that, even when state-space size is not a concern, being selective about error/estimate propagation can improve the resulting interval widths.

5 Modifying Pessimistic Value Iteration

Pessimistic value iteration (PVI) (Jin, Yang, and Wang 2021) is a popular family of dynamic programming (DP) algorithms for offline RL. PVI can take as input: an estimated reward model ˆR, an estimated transition model ˆP, and pointwise estimation uncertainty bonus b = (b(h))H h=1. No additional data is required. The DP procedure in PVI iterates over steps h = H to h = 1. For any step of value function estimation, the bonuses b = (b(h))H h=1 must bound the cumulative error in the estimated reward and transition model inputs. 5 Hence at any step h, (17) gives a valid pessimistic Q-value estimate ( ˆQ(h) p ) here ˆV (h+1) p is the pessimistic value function for the constructed policy starting from step h + 1 (computed in the previous dynamic programming step). At every step h, PVI selects the policy that maximizes the pessimistic Q-value function.

ˆQ(h) p ( , ) = ˆR( , )+ Z

x ˆV (h+1) p (x ) ˆP (h)(x | , ) b(h)( , )

(17) Pessimistic Q-value maximization helps PVI avoid model exploitation that is, PVI avoids picking policies with inaccurately high estimated values at step h by penalizing uncertainty in the value function estimate. However to do this, as we see in (17), PVI propagates all the uncertainty captured in future steps through ˆV (h+1) p . Depending on the instance, this may lead to larger than necessary uncertainty propagation for avoiding model exploitation. In particular, based on the results in Section 4, uncertainty from later steps does not always need to be fully propagated to estimate lower bounds for the effect of deviating from the behavioral policy at step h (after fixing the policy for all future steps). Since we can view maximizing this treatment effect as the goal of DP algorithms at any step h, maximizing the tighter lower bounds from Section 4 should allow us to do better on easier CB-like instances (while avoiding model exploitation).

5In finite-state MDPs, these bonuses can be count based, where b(h)(x, a) = β p

1/ max{1, nh(x, a)} and nh(x, a) is the number of times state x and action a are observed at step h in the data set S. Here β is an algorithmic parameter. Several papers have extended these ideas to continuous state spaces (Bellemare et al. 2016; Osband et al. 2021).

In this section, we propose a modification of PVI called selectively pessimistic value iteration (SPVI, complete pseudo-code is available in Appendix B). The key modification is that, at any step h, SPVI maximizes the selectively pessimistic Q-value (18) which is the standard Q-value estimate ( ˆQ(h)) minus the bonus and the required uncertainty that needs to be propagated. Here ˆ is the induced shift model (that is, ˆ (h)( | , ) = ˆP (h)( | , ) ˆP (h)( | , πb)), ˆV (h+1) p and ˆV (h+1) o are the pessimistic and optimistic value function for the constructed policy starting from step h + 1 (computed in the previous dynamic programming step).

ˆQ(h) sp ( , ) = ˆQ(h)( , ) b(h)( , )

x | ˆ (h)(x | , )| ( ˆV (h+1) o (x ) ˆV (h+1) p (x )) (18)

Justification: As we argued in earlier sections, one can view the goal at step h as selecting π(h) in order to maximize α(h) π . We construct a tight lower bound for α(h) π and argue that maximizing ˆQ(h) sp maximizes this bound. Similar to standard pessimistic value estimation, we let the bonus b(h)(x, a) bound the total model estimation errors at (x, a, h). Now from the analysis in Theorem 4.1, we have (19) is a valid lowed bound on α(h) π .

α(h) π = ED(πb)[V (h) πh (x(h)) V (h) πh+1(x(h))]

ED(πb)[ ˆQ(h) sp (x(h), π(h)) ˆQ(h)(x(h), πb (h))] (19)

Importantly, by maximizing this tight lower bound we can do better on easier CB-like instances (while always avoiding model exploitation by penalizing uncertainty in estimating α(h) π ). This completes our justification of SPVI. The complete pseudo-code is available in Appendix B.

6 Simulation To illustrate our insights, we consider a simple toy environment called Chain Bandit . At a high-level, the environment has two chains, a top chain, and a bottom chain. The environment also has three actions given by a1, a2, a3. The top chain states are the most rewarding. The agent starts at (1, 0) (first state in the top chain). At any state in the bottom chain, all the actions lead to the same transition (which is to move to the next state in the bottom chain) and are essentially bandit states. In the top chain, both a1 and a2 lead to the same transitions (which is to move to the next state in the top chain), and a3 makes the agent move to the next state in the bottom chain. In the top chain, the highest cumulative reward comes from never taking action a3; however, the highest immediate reward comes from selecting the action a3 (which makes planning beneficial in this environment). Note that at every state, action a3 is a sub-optimal action. See Appendix C for a complete description. The Chain Bandit environment is constructed to have both dynamic (some actions result in a different next-state distribution) and bandit-like (non-dynamic) elements. Ensuring that while planning and some estimate/uncertainty propagation is necessary, complete uncertainty propagation can be

unnecessary to evaluate policies of interest. Throughout this section, we consider Chain Bandit with a chain length of 3 and consider the following behavioral policy (πb) at every state and step, πb selects action a3 with probability 0.8 and selects the other two actions with probability 0.1 respectively. From the data collected, we evaluate standard and selective uncertainty propagation for tasks of: (1) estimating upper/lower bounds for α(h) π ; and (2) offline policy learning. Since uncertainty propagation is the focus of this simulation, in order to make a fair comparison, both standard and selective propagation: use the same tabular approach to estimate a (step independent) reward/transition model; and use the same (step independent) count-based bonuses to account for model estimation errors.6 Here bonus is given by b(x, a) = p

ln(|X| |A| H/δ)/n(x, a) where n(x, a) is the number of times action a was taken at state x, and confidence parameter δ = 0.05. For the step h and policy π of interest, standard CIs for α(h) π are constructed using pessimistic/optimistic value estimates at step h for policies πh and πh+1 i.e, utilizing (20).

ED(πb)[ ˆV (h) πh,o(x(h)) ˆV (h) πh+1,p(x(h))] α(h) π

ED(πb)[ ˆV (h) πh,p(x(h)) ˆV (h) πh+1,o(x(h))] (20)

For selective CIs, we use (19) to construct the lower bound and similarly construct the upper bound. Note that converting inequalities like (19) and (20) into empirical bounds is straightforward since our training is from D(πb). Figure 1 plots CIs for both selective and standard uncertainty propagation, when varying the evaluation policy. As expected, benefits over standard pessimism are larger when next-state distribution shift is smaller that is, when evaluation policy is closer to the behavioral policy. In Figure 2, we plot the value of learnt policy from various algorithms as we vary the number of training episodes. In particular, we compare SPVI, PVI (Jin, Yang, and Wang 2021), and pessimistic supervised learning (PSL). Here PSL refers to a pessimistic bandit policy optimization applied to each step without planning. The Chain Bandit environment benefits from planning, so PSL performs poorly as expected. On all Chain Bandit simulations we tried, SPVI was by far the best-performing algorithm. The reason we considered the behavioral policy described earlier was that it was more disadvantageous for SPVI. In particular, since selective pessimism has an initial bias against policies that lead to significant shifts, we chose a highly sub-optimal behavioral policy (selecting a3 with probability 0.8). While this leads to a worse start for SPVI than PVI, eventually, SPVI outperforms the other algorithms. We also run simulations for CI construction and policy learning on the standard Grid World (see Appendix D) since this is a very dynamic environ-

6The model estimates and bonuses are not step dependent since the reward/transition functions in Chain Bandit are the same for all steps. Further a tabular approach to reward (transition) estimation simply indicates using the average reward (average one-hot nextstate vector) observed at any state-action pair as its reward (transition) estimate.

ment, both standard and selective propagation have similar performance.7

Figure 1: We plot CIs for α(2) π while varying the evaluation policy. These evaluation policies are parameterized by λ [0, 1]. For all states/steps, the probability of selecting a1, a2 and a3 are (1 λ)/2, (1 λ)/2, and λ respectively. Note that the evaluation policy is the same as the behavioral policy for λ = 0.8. The number of training episodes is 10000, and the plots are averaged over 10 runs.

Figure 2: Policy learning with a bad behavioral policy

7 Conclusion

We introduce selective propagation, a general approach to interpolate between CB and RL techniques achieving guarantees that adapt to instance hardness. Further developing this could impact real world problems (e.g., recommendation systems, m Health, Ed Tech) that lie in between the two settings.

7All algorithm runs takes less than 2 mins on a Mac Book Pro M2 16GB.

Abbasi-Yadkori, Y.; Pal, D.; and Szepesvari, C. 2011. Improved Algorithms for Linear Stochastic Bandits. In Advances in Neural Information Processing Systems 24, 2312 2320. Agarwal, A.; Hsu, D.; Kale, S.; Langford, J.; Li, L.; and Schapire, R. 2014. Taming the Monster: A Fast and Simple Algorithm for Contextual Bandits. In Proceedings of the 31st International Conference on Machine Learning, 1638 1646. Agrawal, S.; and Goyal, N. 2012. Analysis of Thompson Sampling for the Multi-Armed Bandit Problem. In Proceeding of the 25th Annual Conference on Learning Theory, 39.1 39.26. Agrawal, S.; and Goyal, N. 2013. Thompson Sampling for Contextual Bandits with Linear Payoffs. In Proceedings of the 30th International Conference on Machine Learning, 127 135. Bellemare, M.; Srinivasan, S.; Ostrovski, G.; Schaul, T.; Saxton, D.; and Munos, R. 2016. Unifying count-based exploration and intrinsic motivation. Advances in neural information processing systems, 29. Bobu, A.; Tzeng, E.; Hoffman, J.; and Darrell, T. 2018. Adapting to continuously shifting domains. workshop. Carranza, A. G.; Krishnamurthy, S. K.; and Athey, S. 2023. Flexible and efficient contextual bandits with heterogeneous treatment effect oracles. In International Conference on Artificial Intelligence and Statistics, 7190 7212. PMLR. Chapelle, O.; and Li, L. 2012. An Empirical Evaluation of Thompson Sampling. In Advances in Neural Information Processing Systems 24, 2249 2257. Dani, V.; Hayes, T.; and Kakade, S. 2008. Stochastic Linear Optimization under Bandit Feedback. In Proceedings of the 21st Annual Conference on Learning Theory, 355 366. Dudik, M.; Erhan, D.; Langford, J.; and Li, L. 2014. Doubly Robust Policy Evaluation and Optimization. Statistical Science, 29(4): 485 511. Farshchian, A.; Gallego, J. A.; Cohen, J. P.; Bengio, Y.; Miller, L. E.; and Solla, S. A. 2018. Adversarial domain adaptation for stable brain-machine interfaces. ar Xiv preprint ar Xiv:1810.00045. Foster, D. J.; Krishnamurthy, A.; Simchi-Levi, D.; and Xu, Y. 2021. Offline Reinforcement Learning: Fundamental Barriers for Value Function Approximation. ar Xiv preprint ar Xiv:2111.10919. Foster, D. J.; and Rakhlin, A. 2023. Foundations of Reinforcement Learning and Interactive Decision Making. ar Xiv preprint ar Xiv:2312.16730. Foster, D. J.; and Syrgkanis, V. 2019. Orthogonal statistical learning. ar Xiv preprint ar Xiv:1901.09036. Garivier, A.; and Cappe, O. 2011. The KL-UCB Algorithm for Bounded Stochastic Bandits and Beyond. In Proceeding of the 24th Annual Conference on Learning Theory, 359 376.

Jin, Y.; Yang, Z.; and Wang, Z. 2021. Is pessimism provably efficient for offline rl? In International Conference on Machine Learning, 5084 5096. PMLR.

Krishnamurthy, A.; Wu, Z. S.; and Syrgkanis, V. 2018. Semiparametric contextual bandits. In International Conference on Machine Learning, 2776 2785. PMLR.

K unzel, S. R.; Sekhon, J. S.; Bickel, P. J.; and Yu, B. 2019. Metalearners for estimating heterogeneous treatment effects using machine learning. Proceedings of the national academy of sciences, 116(10): 4156 4165.

Langford, J.; and Zhang, T. 2008. The Epoch-Greedy Algorithm for Contextual Multi-Armed Bandits. In Advances in Neural Information Processing Systems 20, 817 824.

Lattimore, T.; and Szepesvari, C. 2019. Bandit Algorithms. Cambridge University Press.

Levine, S.; Kumar, A.; Tucker, G.; and Fu, J. 2020. Offline Reinforcement Learning: Tutorial, Review. and Perspectives on Open Problems.

Li, L.; Chu, W.; Langford, J.; and Schapire, R. 2010. A Contextual-Bandit Approach to Personalized News Article Recommendation. In Proceedings of the 19th International Conference on World Wide Web.

Martin, J.; Sasikumar, S. N.; Everitt, T.; and Hutter, M. 2017. Count-based exploration in feature space for reinforcement learning. ar Xiv preprint ar Xiv:1706.08090.

Moerland, T. M.; Broekens, J.; Plaat, A.; Jonker, C. M.; et al. 2023. Model-based reinforcement learning: A survey. Foundations and Trends in Machine Learning, 16(1): 1 118.

Nie, X.; and Wager, S. 2021. Quasi-oracle estimation of heterogeneous treatment effects. Biometrika, 108(2): 299 319.

Osband, I.; Wen, Z.; Asghari, S. M.; Dwaracherla, V.; Ibrahimi, M.; Lu, X.; and Van Roy, B. 2021. Epistemic neural networks. ar Xiv preprint ar Xiv:2107.08924.

Russo, D.; Van Roy, B.; Kazerouni, A.; Osband, I.; and Wen, Z. 2018. A Tutorial on Thompson Sampling. Foundations and Trends in Machine Learning, 11(1): 1 96.

Shalev-Shwartz, S.; and Ben-David, S. 2014. Understanding machine learning: From theory to algorithms. Cambridge university press.

Sutton, R. 1988. Learning to Predict by the Methods of Temporal Differences. Machine Learning, 3: 9 44.

Sutton, R.; and Barto, A. 1998. Reinforcement Learning: An Introduction. Cambridge, MA: MIT Press.

Sutton, R.; Mc Allester, D.; Singh, S.; and Mansour, Y. 2000. Policy Gradient Methods for Reinforcement Learning with Function Approximation. In Advances in Neural Information Processing Systems 12, 1057 1063.

Syrgkanis, V.; and Zhan, R. 2023. Post-Episodic Reinforcement Learning Inference. ar Xiv preprint ar Xiv:2302.08854.

Thompson, W. R. 1933. On the Likelihood that One Unknown Probability Exceeds Another in View of the Evidence of Two Samples. Biometrika, 25(3-4): 285 294.

Vergara, A.; Vembu, S.; Ayhan, T.; Ryan, M. A.; Homer, M. L.; and Huerta, R. 2012. Chemical gas sensor drift compensation using classifier ensembles. Sensors and Actuators B: Chemical, 166: 320 329. Wang, Y.; Wang, R.; Du, S. S.; and Krishnamurthy, A. 2019. Optimism in reinforcement learning with generalized linear function approximation. ar Xiv preprint ar Xiv:1912.04136. Williams, R. 1992. Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning. Machine Learning, 8(3-4): 229 256. Yin, M.; and Wang, Y.-X. 2021. Towards instance-optimal offline reinforcement learning with pessimism. Advances in neural information processing systems, 34: 4065 4078. Zanette, A.; and Brunskill, E. 2019. Tighter problemdependent regret bounds in reinforcement learning without domain knowledge using value function bounds. In International Conference on Machine Learning, 7304 7312. PMLR. Zhang, K. W.; Gottesman, O.; and Doshi-Velez, F. 2022. A Bayesian Approach to Learning Bandit Structure in Markov Decision Processes. ar Xiv preprint ar Xiv:2208.00250.