# improving_offline_rl_by_blending_heuristics__6409ce21.pdf

Published as a conference paper at ICLR 2024

IMPROVING OFFLINE RL BY BLENDING HEURISTICS

Sinong Geng Princeton University Princeton, NJ

Aldo Pacchiano Boston University Broad Institute of MIT and Harvard Boston, MA

Andrey Kolobov Microsoft Research Redmond, WA

Ching-An Cheng Microsoft Research Redmond, WA

We propose Heuristic Blending (HUBL), a simple performance-improving technique for a broad class of offline RL algorithms based on value bootstrapping. HUBL modifies the Bellman operators used in these algorithms, partially replacing the bootstrapped values with heuristic ones that are estimated with Monte-Carlo returns. For trajectories with higher returns, HUBL relies more on the heuristic values and less on bootstrapping; otherwise, it leans more heavily on bootstrapping. HUBL is very easy to combine with many existing offline RL implementations by relabeling the offline datasets with adjusted rewards and discount factors. We derive a theory that explains HUBL s effect on offline RL as reducing offline RL s complexity and thus increasing its finite-sample performance. Furthermore, we empirically demonstrate that HUBL consistently improves the policy quality of four state-of-the-art bootstrapping-based offline RL algorithms (ATAC, CQL, TD3+BC, and IQL), by 9% on average over 27 datasets of the D4RL and Meta World benchmarks.

1 INTRODUCTION

Offline reinforcement learning (RL) aims to learn decision-making strategies from static logged datasets (Lange et al., 2012; Fujimoto et al., 2019). It has attracted increased interest in recent years, because the availability of large offline datasets are on the rise and online exploration required by alternative approaches such as online RL (Sutton and Barto, 2018) remains expensive and risky in many real-world applications, such as robotics and healthcare.

Base offline RL method

1: Input: data D = {(s, a, s , r, γ)} 2: for each iteration do 3: Sample (s, a, r, s , γ) D 4: DP update using (s, a, r, s , γ)

HUBL + offline RL

1: Input: data D = {(s, a, s , r, γ)} 2: D Relabel D with modified rewards r & discounts γ using heuristics h 3: for each iteration do 4: Sample (s, a, r, s , γ) D 5: DP update using (s, a, r, s , γ)

Figure 1: HUBL and offline RL

Among offline RL algorithms, we focus on model-free approaches using dynamic programming with value bootstrapping. These algorithms, including the commonly used CQL (Kumar et al., 2020), TD3+BC (Fujimoto and Gu, 2021), IQL (Kostrikov et al., 2022), and ATAC (Cheng et al., 2022), have demonstrated strong performance in offline RL benchmarks (Fu et al., 2020; Gulcehre et al., 2020). They follow the actor-critic scheme and adopt the principle of pessimism in the face of uncertainty to optimize an agent via a performance lower bound that penalizes taking unfamiliar actions. Despite their strengths, existing model-free offline RL methods also have a major weakness: they do not perform consistently. An algorithm that does well on one dataset may struggle on another, sometimes even underperforming behavior cloning (see Tarasov et al. (2022) and Appendix D.2). These performance fluctuations stand in the way of applying even the strongest offline RL approaches to practical problems.

In this work, we propose Heuristic Blending (HUBL), an easy-to-implement technique to address offline RL s performance inconsistency. HUBL is an aide that operates in combination with a bootstrapping-based offline RL algorithm by using heuristic state value esti-

Published as a conference paper at ICLR 2024

mates1 to modify the rewards and discounts in the dataset that the base offline RL algorithm consumes. Effectively, this modification blends heuristic values into dynamic programming to partially replace bootstrapping. Relying less on bootstrapping alleviates potential issues that bootstrapping causes and helps achieve more stable performance.

Combining HUBL with an offline RL method is very simple, as summarized in Figure 1, and amounts to running this method on a version of the original dataset with modified rewards r and discounts γ: r = r + γλh and γ = γ(1 λ), where r is blended with a heuristic h, γ is the reduced discount, and λ [0, 1] is a blending factor representing the degree of trust towards the heuristic. We propose to set h as the Monte-Carlo returns of the trajectories in the dataset for offline RL. Such heuristics are efficient and stable to compute, unlike bootstrapped Q-value estimates. The blending factor λ in HUBL can be trajectory-dependent. Intuitively, we want λ to be large (relying more on the heuristic) at trajectories where the behavior policy that collected the dataset performs well, and small (relying more on bootstrapped Q-values) otherwise. We provide three practical designs for λ; they use only one hyperparameter, which, empirically, does not need active tuning.

We analyze HUBL s performance both theoretically and empirically. Theoretically, we provide a finite-sample performance bound for a tabular offline RL with HUBL by framing it as solving a reshaped Markov decision process (MDP). To our knowledge, this is the first theoretical result for RL with heuristics in the offline setting. Our analysis shows that HUBL performs a bias-regret trade-off. On the one hand, solving the reshaped MDP with a smaller discount factor requires less bootstrapping and is relatively easier , so the regret is smaller. On the other hand, HUBL induces bias due to reshaping the original MDP. Nonetheless, we demonstrate that the bias can be controlled by setting the λ factor based on the above intuition, allowing HUBL to improve the performance of the base offline RL method.

Empirically, we run HUBL with the four aforementioned offline RL methods CQL, TD3+BC, IQL, and ATAC and show that enhancing these So TA algorithms with HUBL can improve their performance by 9% on average across 27 datasets of D4RL (Fu et al., 2020) and Meta-World (Yu et al., 2020). Notably, in some datasets where the base offline RL method shows inconsistent performance, HUBL can achieve more than 50% relative performance improvement.

2 RELATED WORK

Bootstrapping-based offline RL A fundamental challenge of bootstrapping-based offline RL is the deadly triad (Sutton and Barto, 2018): a negative interference between 1) off-policy learning from data with limited support, 2) value bootstrapping, and 3) the function approximator. Modern offline RL algorithms such as CQL (Kumar et al., 2020), TD3+BC (Fujimoto and Gu, 2021), IQL (Kostrikov et al., 2022), ATAC (Cheng et al., 2022), PEVI (Jin et al., 2021), MORe L (Kidambi et al., 2020), and VI-LCB (Rashidinejad et al., 2021) employ pessimism to discourage the agent from taking actions unsupported by the data, which has proved to be an effective strategy to address the issue of limited support. However, they still suffer from a combination of errors in bootstrapping and function approximation. HUBL aims to address this with stable-to-compute Monte-Carlo return heuristics and thus is a complementary technique to pessimism.

RL by blending multi-step returns The idea of blending multi-step Monte-Carlo returns into the bootstrapping operator to reduce the degree of bootstrapping and thereby increase its performance has a long history in RL. This technique has been widely used in temporal difference methods (Sutton and Barto, 2018), where the blending is achieved by modifying gradient updates (Seijen and Sutton, 2014; Sutton et al., 2016; Jiang et al., 2021) and reweighting observations (Imani et al., 2018). It has also been applied to improve Q function estimation (Wright et al., 2013), online exploration (Ostrovski et al., 2017; Bellemare et al., 2016) and the sensitivity to model misspecification (Zanette et al., 2021), and is especially effective for sparse-reward problems Wilcox et al. (2022). In contrast to most of the aforementioned works, which focus on blending Monte-Carlo returns as part of an RL algorithm s online operation, HUBL is designed for the offline setting and acts as a simple data relabeling step. Recently, Wilcox et al. (2022) have also proposed the idea of data relabeling, but their design takes a max of multi-step returns and bootstrapped values and as a result tends to overestimate Q-functions. We observed this to be detrimental when data has a limited support.

1In this work, a heuristic is a mapping from states to R (Mausam and Kolobov, 2012). See Section 3.2.

Published as a conference paper at ICLR 2024

RL with heuristics More generally, HUBL relates to the framework of blending heuristics (which might be estimating quantities other than a policy s value) into bootstrapping (Cheng et al., 2021; Hoeller et al., 2020; Bejjani et al., 2018; Zhong et al., 2013). However, the existing results focus only on the online case and do not conclusively show whether blending heuristics is valid in the offline case. For instance, the theoretical analysis in Cheng et al. (2021) breaks when applied to the offline setting as we will demonstrate in Section 5. The major difference is that online approaches rely heavily on collecting new data with the learned policy, which is impossible in the offline case. In this paper, we employ a novel analysis technique to demonstrate that blending heuristics is effective even in offline RL. To the best of our knowledge, ours is the first work to extend heuristic blending to the offline setting with both rigorous theoretical analysis and empirical results demonstrating performance improvement. One key insight, inspired by transition-dependent discount factor from White (2017), is the adoption of a trajectory-dependent λ blending factor, which is both a performance-improving design as well as a novel analysis technicality.

Discount regularization HUBL modifies the reward with a heuristic and reduces the discount. Discount regularization, on the other hand, is a complexity reduction technique that reduces only the discount. The idea of simplifying decision-making problems by reducing discount factors can be traced back to Blackwell optimality in the known MDP setting (Blackwell, 1962). Most existing results on discount regularization (Petrik and Scherrer, 2008; Jiang et al., 2015) study the MDP setting or the online RL setting (Van Seijen et al., 2019). Recently, Hu et al. (2022) has shown that discount regularization can also reduce the complexity of offline RL and serve as an extra source of pessimism. However, as we will show, simply reducing the discount without the compensation of blending heuristics in the offline setting can be excessively pessimistic and introduce large bias that hurts performance. In Section 6.1, we rigorously analyze this bias and empirically demonstrate the advantages of HUBL over discount regularization.

3 BACKGROUND

In this section, we define the problem setup of offline RL (Section 3.1), review dynamic programming with value bootstrapping, and briefly survey methods using heuristics (Section 3.2).

3.1 OFFLINE RL AND NOTATION

We consider offline RL in a Markov decision process (MDP) M = (S, A, P, r, γ), where S denotes the state space, A denotes the action space, P denotes the transition function, r : S A [0, 1] denotes the reward function, and γ [0, 1) denotes the discount factor. A decision-making policy π is a mapping from S to A, and its value function is defined as V π(s) = E[P t=0 γtr(st, at)|s0 = s, at π( |st)]. In addition, we define Qπ(s, a) := r(s, a) + γEs P|s,a[V π(s )] as its state-action value function (i.e., Q function). We use V to denote the value function of the optimal policy π , which is our performance target.

In addition, we introduce several definitions of distributions. First, we define the average state distribution of a policy π starting from an initial state s0 as dπ(s, a; s0) := (1 γ) P t=0 γtdπ t (s, a; s0), where dπ t (s, a; s0) is the state-action distribution at time t generated by running policy π from an initial state s0. We assume that the MDP starts with a fixed initial state distribution d0. With slight abuse of notation, we define the average state distribution starting from d0 as dπ(s, a) := dπ(s, a; d0), and dπ(s, a, s ) := dπ(s, a)P(s |s, a).

The objective of offline RL is to learn a well-performing policy ˆπ while using a pre-collected offline dataset D. The agent has no knowledge of the MDP M except information contained in D, and it cannot perform online interactions with environment to collect more data. We assume D := {τ} contains multiple trajectories collected by a behavior policy, where each τ = {(st, at, rt)}Tτ t=1 denotes a trajectory with length Tτ. Suppose these trajectories contain N transition tuples in total. With abuse of notation, we also write D := {(s, a, s , r, γ)}, where states s and action a follow a distribution µ(s, a) induced by the behavior policy, s is the state after each transition, r is the reward at s, a, and γ is the discount factor of the MDP. Note that the value of discount γ is the same in each tuple. We use Ωto denote the support of µ(s, a). We do not make the full support assumption, in the sense that the dataset D may not contain a tuple for every s, a, s transition in the MDP.

Published as a conference paper at ICLR 2024

3.2 DYNAMIC PROGRAMMING WITH BOOTSTRAPPING AND HEURISTICS

Bootstrapping Many offline RL methods leverage dynamic programming with value bootstrapping. Given a policy π, we recall Qπ and V π satisfy the Bellman equation:

Qπ(s, a) = r(s, a) +

(bootstrapping) z }| { γEs P( |s,a)[V π(s )] . (1)

These bootstrapping-based methods compute Qπ(s, a) using an approximated version of (1): given sampled tuples (s, a, s , γ, r), such methods minimize the difference between the two sides of (1) with both Qπ and V π replaced with function approximators 2. With limited offline data, learning the function approximator using bootstrapping can be challenging and yields inconsistent performance across different datasets (Dulac-Arnold et al., 2021; Sutton and Barto, 2018; Kumar et al., 2019), as we will also later see in the experiment section (Section 6).

Heuristics A heuristic is a value function h : S R calculated using domain knowledge, Monte Carlo averages, or pre-training. Heuristics are widely used in online RL, planning, and control to improve the performance of decision-making (Kolobov et al., 2010; Zhong et al., 2013; Bejjani et al., 2018; Hoeller et al., 2020; Cheng et al., 2021). In this paper, we focus on the offline setting, and consider heuristics h that approximate the value function of the behavior policy. They can be estimated from D via Monte-Carlo methods.

4 HEURISTIC BLENDING (HUBL)

In this section we describe our main contribution Heuristic Blending (HUBL), an algorithm that works in combination with bootstrapping-based offline RL methods and improves their performance.

4.1 MOTIVATION

Algorithm 1 HUBL + Offline RL

1: Input: Dataset D = {(s, a, s , r, γ)} 2: Compute ht for each trajectory in D 3: Compute λt for each trajectory in D 4: Relabel r & γ by ht and λt as r and γ and create D = {(s, a, s , r, γ)} 5: ˆπ Offline RL on D

HUBL uses a heuristic computed as the Monte-Carlo return of the behavior policy in the training dataset D to improve offline RL. It reduces an offline RL algorithm s bootstrapping at trajectories where the behavior policy performs well, i.e., where the value of the behavior policy (i.e. the heuristic value) is high. With less amount of bootstrapping, it mitigates bootstrapping-induced issues on convergence stability and performance. In addition, since the extent of blending between the heuristic and bootstrapping is trajectory-dependent, HUBL introduces only limited performance bias to the base algorithm, and therefore can improve its performance overall.

4.2 ALGORITHM

As summarized in Algorithm 1, HUBL is easy to implement: first, relabel a base offline RL algorithm s training dataset D = {(s, a, s , r, γ)} with modified rewards r and discount factors γ, creating a new dataset D := {(s, a, s , r, γ)}; next, run the base algorithm on D.

The data relabeling is done in three steps:

Step 0: As a preparation step, we convert the data tuples in D back to trajectories like τ = {(st, at, rt)}Tτ t=1. For the next two steps, we work on data trajectories instead of data tuples to compute heuristics and blending factors.

Step 1: Computing heuristic ht We compute heuristics by Monte-Carlo returns. For each τ = {(st, at, rt)}Tτ t=1 D, we calculate the heuristics as3 ht = PTτ k=t γk trk and update the data trajectory as τ {(st, at, rt, ht)}Tτ t=1.

2Q-learning-based offline RL in Kostrikov et al. (2022) uses a dynamic programming equation similar to (1) but with V π(s) = arg maxa A Qπ(s, a). 3In our implementation, we also train a value function approximator to bootstrap at the end of a trajectory if the trajectory ends due to timeout as opposed to reaching the end of the problem horizon or a terminal state.

Published as a conference paper at ICLR 2024

Step 2: Computing blending factor λt We append a scalar λt = λ(τ) [0, 1] at each time point t of each trajectory as the blending factor, leading to τ {(st, at, rt, ht, λt)}Tτ t=1. λt indicates the confidence in the heuristics on the trajectory. Intuitively, λt decides the contribution of heuristics over bootstrapped values in dynamic programming to update the Q-function. We desire λt to be closer to 1 when the heuristic value ht is higher (i.e., at states where the heuristic is closer to the optimal Q-value) to make offline RL rely more on the heuristic, and λt closer to zero when heuristic is lower to make offline RL to use more bootstrapping. We experiment with three different designs of λ(τ):

Constant: As a baseline, we consider λ(τ) = α [0, 1] for all s. We show that, despite forcing the same heuristic weight for every state, this formulation already provides performance improvements.

Sigmoid: As an alternative, we use the sigmoid function to construct a trajectory-dependent blending function λ(τ) = ασ(PTτ t=1 h(st)/Tτ), where α [0, 1] is a tunable constant and σ is the sigmoid function. Thus, λ(τ) varies with the performance of the behavior policy over data.

Rank: Similar to the Sigmoid labeling function, we provide a rank labeling function λ(τ) = α P

τ D 1 h(τ ) h(τ)/n where n is the number of trajectories in D, and h(τ) = 1

Step 3: Relabeling r and γ Finally, we relabel the reward as r and the discount factor as γ in each tuple of D. To this end, we first convert the updated data trajectories {{(st, at, rt, ht, λt)}Tτ t=1} back into data tuples {(s, a, s , r, γ, h , λ )}, where h and λ denote the next-step heuristic and blending factor. Then, for each data tuple, we compute

r = r + γλ h and γ = γ(1 λ ), (2)

to form a new dataset D := {(s, a, s , r, γ)}. Intuitively, one can interpret r as injecting a heuristicdependent quantity γλ h into the original reward, and γ as reducing bootstrapping by shrinking the original discount factor by a factor of 1 λ . We formally justify the design of r and γ in Section 5.

5 UNDERSTANDING HUBL

In this section, we take a deeper look into how HUBL works. At a high level, our theoretical analysis explains that the modification made by HUBL introduces a trade-off between bias and regret (which is similar to the variance of policy learning) into the base offline RL algorithm. While an important part of our analysis is for tabular settings (Section 5.3), we believe it still provides valuable intuitions as to why reshaping of rewards and discounts per (2) gives a performance boost to state-of-the-art offline RL algorithms in the continuous-state-space benchmarks in Section 6.

5.1 HUBL AS MDP RESHAPING

We analyze HUBL by viewing it as solving a reshaped MDP M := (S, A, P, r, γ) constructed by blending heuristics into the original MDP. To this end, we make a simplification by assuming that both the heuristic and blending factor are functions of states. Specifically, let Ωdenote the support of the data distribution; we suppose that some functions h( ) : Ω R and λ( ) : Ω [0, 1] are given to HUBL. For analysis, we extend them to out-of-Ωstates as below:

h(s) = h(s) for s Ω 0 otherwise and λ(s, s ) = λ(s ) for s, s Ω 0 otherwise (3)

We note that this extension is only for the purpose of analysis, since HUBL never uses the values h( ) and λ( ) outside Ω; the given h on out-of-Ωcan have any value and the following theorems still hold.

We define the reshaped MDP M with redefined reward function and discount factor as r(s, a) := r(s, a) + γEs P( |s,a) λ(s, s )h(s ) , and γ(s, s ) := γ(1 λ(s, s )) respectively. Note that we blend the original reward function with the expected heuristic and blending factor while adjusting the original discount factor correspondingly. The extent of blending is determined by the function λ( ). Notice that this reshaped MDP has a transition-dependent4 discount factor γ(s, s ) which is compatible with generic unifying task specification of (White, 2017). This novel definition of

4A special case of trajectory-dependent discounts.

Published as a conference paper at ICLR 2024

transition-dependent discount factor is a key analysis technique to show that blending heuristics is effective in the offline setting.

Reshaped Dynamic Programming When solving this reshaped MDP by dynamic programming, the Bellman equation changes from (1) accordingly into

Qπ(s, a) = r(s, a) + Es P( |s,a)[ γ(s, s ) V π(s )]

= r(s, a) + γ

bootstrapping z }| { Es P( |s,a)[(1 λ(s, s )) V π(s )] +γ

heuristic z }| { Es P( |s,a)[λ(s, s )h(s )] .

Here Qπ denotes the Q-function of policy π in M, and V π denotes π s value function. Compared to the original Bellman equation (1), it can be seen that λ( ) blends the heuristic with bootstrapping: the bigger λ s values, the more bootstrapping is replaced by the heuristic. The effect of solving HUBL s reshaped MDP M is twofold. On the one hand, M is different from the original MDP M: the optimal policy for M may not be optimal for M, so solving for M could potentially lead to performance bias. One the other hand, M has a smaller discount factor and thus is easier to solve than M, as the agent needs to plan for a smaller horizon. Therefore, we can think of applying HUBL to offline RL problems as performing a bias-variance trade-off, which reduces the learning variance due to bootstrapping at the cost of the bias due to using a suboptimal heuristic. We will explain this more concretely next.

5.2 BIAS-REGRET DECOMPOSITION

The insight that HUBL reshapes the Bellman equation that the offline RL algorithm uses allows us to characterize HUBL s effects on policy learning. Namely, we use the modified Bellman equation from (4) to decompose the performance of the learned policy into bias and regret terms. Theorem 1. For any h : Ω R, λ : Ω [0, 1], and policy π, with V as the value function of the optimal policy, it holds that V (d0) V π(d0) = Bias(π, h, λ) + Regret(π, h, λ), where

Bias(π, h, λ) := γ 1 γ E(s,a,s ) dπ [λ(s )( V π (s ) h(s ))|s, s Ω]

Regret(π, h, λ) := V π (d0) V π(d0) + γ 1 γ E(s,a,s ) dπ[λ(s )(h(s ) V π(s ))|s, s Ω].

The performance of π depends on both bias and regret. The bias term describes the discrepancy caused by solving the reshaped MDP with λ( ). When λ(s ) = 0, the bias becomes zero. The regret term describes the performance of the learned policy in the reshaped MDP. Intuitively, at states whose successors have bigger λ(s ) values, the reshaped MDP has a smaller discount factor and thus is easier to solve, which leads to smaller regret (i.e., smaller values for V π (d0) V ˆπ(d0)). Therefore, solving a HUBL-reshaped MDP induces a bias term but may generate a smaller regret.

Remark The critical difference and novelty of Theorem 1 compared to existing theoretical results for RL with heuristics in the offline setting is that both the bias and regret in Theorem 1 depend only on states in the data distribution support Ω. This is crucial, because in the offline setting we have no access to observations beyond Ω. In contrast, if in the preceding analysis we replace Theorem 1 by, for example, Lemma A.1 from Cheng et al. (2021) with a constant λ, we will get a performance decomposition V (d0) V π(d0) = (V (d0) V (d0)) + γλ 1 γ Es,a dπEs |s,a[h(s ) V (s )]

+ (1 λ)( V (d0) V π(d0)) + λ 1 γ ( V (dπ) V π(dπ)). The decomposition, however, suggests that the out-of-Ωvalues of λ(s) or h(s) are important to the performance of HUBL.

5.3 FINITE-SAMPLE ANALYSIS FOR BIAS AND REGRET

The finite-sample analysis of policy learning that we provide next illustrates more concretely how HUBL trades off bias and regret. Our analysis uses offline value iteration with lower confidence bound (VI-LCB) (Rashidinejad et al., 2021) as the base offline RL method. Following its original convention, we make some technical simplifications to make the presentation cleaner. Specifically, we consider a tabular setting with λ(s) = α [0, 1] for s Ωas a constant value, which can be interpreted as the quality of the behavior policy averaged across states. Note that although λ(s) = α

Published as a conference paper at ICLR 2024

halfcheetahmed-exp-v2 halfcheetahmed-rep-v2 halfcheetahmed-v2 hoppermed-exp-v2 hoppermed-rep-v2 hoppermed-v2 walker2dmed-exp-v2 walker2dmed-rep-v2 walker2dmed-v2

ATAC Average Relative Improvement: 9%

Relative Improvemen

halfcheetahmed-exp-v2 halfcheetahmed-rep-v2 halfcheetahmed-v2 hoppermed-exp-v2 hoppermed-rep-v2 hoppermed-v2 walker2dmed-exp-v2 walker2dmed-rep-v2 walker2dmed-v2

CQL Average Relative Improvement: 9%

Relative Improvemen

halfcheetahmed-exp-v2 halfcheetahmed-rep-v2 halfcheetahmed-v2 hoppermed-exp-v2 hoppermed-rep-v2 hoppermed-v2 walker2dmed-exp-v2 walker2dmed-rep-v2 walker2dmed-v2

IQL Average Relative Improvement: 9%

Relative Improvemen

halfcheetahmed-exp-v2 halfcheetahmed-rep-v2 halfcheetahmed-v2 hoppermed-exp-v2 hoppermed-rep-v2 hoppermed-v2 walker2dmed-exp-v2 walker2dmed-rep-v2 walker2dmed-v2

TD3+BC Average Relative Improvement: 9%

Relative Improvemen

Figure 2: Relative improvement of HUBL with rank blending on 9 D4RL datasets.

is a constant on Ω, the reshaped MDP is still defined by λ(s, s ) in (3) (i.e., λ(s, s ) = α if s, s Ω and zero otherwise). The details of VI-LCB with HUBL are in Appendix C due to space limits.

Theorem 2 summarizes our finite-sample results. The proof of Theorem 2 can be found in Appendix B. Theorem 2. Under the setup described above, assume that the heuristic h( ) satisfies h(s) = V µ(s) for any s Ω. Then the bias and the regret in Theorem 1 are bounded by

Bias(ˆπ, λ) γλ 1 γ E(s,a,s ) dπ [V (s ) V µ(s )|s, s Ω],

ED[Regret(ˆπ, λ)] min

V 2 max(1 γ)|S| N(1 γ(1 λ))4

max s,a dπ (s, a)

µ(s, a) + γλ 1 γ

max (s,a) Ω 1 µ(s, a)

where Vmax denotes a constant upper bound for the value function.

For the bias bound, the assumption h(s) = V µ(s) for any s Ωis made for the ease of presentation. If it does not hold, an additive error term can be introduced in the bias bound to capture that. For

the regret bound, maxs,a dπ (s,a;d0)

µ(s,a) can be infinite, whereby the best regret bound is just Vmax. But when it is bounded as assumed by existing works (Rashidinejad et al., 2021), our results demonstrate how N, γ and λ affect the regret bound.

The implications of Theorem 2 are threefold. First of all, it provides a finite-sample performance guarantee for HUBL with VI-LCB under the tabular setting. Compared with the performance bound of

the original VI-LCB, min Vmax, q

V 2 max|S| (1 γ)3N maxs,a dπ (s,a)

, HUBL shrinks the discount factor

by 1 λ and thus potentially improves the performance while inducing bias. Second, Theorem 2 hints at the source of HUBL s bias and regret of HUBL. The bias is related to the the performance of the behavior (data-collection) policy, characterized by V (s) V µ(s). In the extreme case of data being collected by an expert policy, the bias induced by HUBL is 0. The regret is affected

by dπ (s,a)

µ(s,a) , which describes the deviation of the optimal policy from the data distribution. Finally, Theorem 2 also provides guidance on how to construct a blending factor function λ( ). To reduce bias, λ(s) should be small at states where V (s) V µ(s) is small. To reduce regret, λ( ) should be generally large but small at states where the learned policy is likely to deviate from the behavior policy. Therefore, an ideal λ( ) should be large when the behavior policy is close to optimal but small when it deviates from the optimal policy. This is consistent with our design principle in Section 4.2.

Extension to Model-based Offline RL Conceptually, we can implement HUBL with model-based offline RL methods by directly relabeling the data, learning the heuristic-modified reward model,

Published as a conference paper at ICLR 2024

0.1 0.0 0.1 0.2 0.3 0.4 0.5

button-pressv2-noise0.1 button-pressv2-noise0.5 button-pressv2-noise1

push-backv2-noise0.1 push-backv2-noise0.5 push-backv2-noise1 reachv2-noise0.1 reachv2-noise0.5 reachv2-noise1

assembly v2 noise0.1 assembly v2 noise0.5 assembly v2 noise1

handle press side v2 noise0.1 handle press side v2 noise0.5 handle press side v2 noise1 plate slide v2 noise0.1 plate slide v2 noise0.5 plate slide v2 noise1

ATAC Average Relative Improvement: 2%

Relative Improveme

0.1 0.0 0.1 0.2 0.3 0.4 0.5

button-pressv2-noise0.1 button-pressv2-noise0.5 button-pressv2-noise1

push-backv2-noise0.1 push-backv2-noise0.5 push-backv2-noise1 reachv2-noise0.1 reachv2-noise0.5 reachv2-noise1

assembly v2 noise0.1 assembly v2 noise0.5 assembly v2 noise1

handle press side v2 noise0.1 handle press side v2 noise0.5 handle press side v2 noise1 plate slide v2 noise0.1 plate slide v2 noise0.5 plate slide v2 noise1

CQL Average Relative Improvement: 2%

Relative Improveme

0.1 0.0 0.1 0.2 0.3 0.4 0.5

button-pressv2-noise0.1 button-pressv2-noise0.5 button-pressv2-noise1

push-backv2-noise0.1 push-backv2-noise0.5 push-backv2-noise1 reachv2-noise0.1 reachv2-noise0.5 reachv2-noise1

assembly v2 noise0.1 assembly v2 noise0.5 assembly v2 noise1

handle press side v2 noise0.1 handle press side v2 noise0.5 handle press side v2 noise1 plate slide v2 noise0.1 plate slide v2 noise0.5 plate slide v2 noise1

IQL Average Relative Improvement: 2%

Relative Improveme

0.1 0.0 0.1 0.2 0.3 0.4 0.5

button-pressv2-noise0.1 button-pressv2-noise0.5 button-pressv2-noise1

push-backv2-noise0.1 push-backv2-noise0.5 push-backv2-noise1 reachv2-noise0.1 reachv2-noise0.5 reachv2-noise1

assembly v2 noise0.1 assembly v2 noise0.5 assembly v2 noise1

handle press side v2 noise0.1 handle press side v2 noise0.5 handle press side v2 noise1 plate slide v2 noise0.1 plate slide v2 noise0.5 plate slide v2 noise1

TD3+BC Average Relative Improvement: 2%

Relative Improveme

Figure 3: Relative improvement of HUBL with rank labeling on MW datasets.

and then doing model-based planning with a smaller discount, following (2). Theorem 1 still applies to model-based algorithms. For Theorem 2, we expect that a similar analysis can be applied to model-based algorithms like MORe L. We defer this direction to future work and focus on improving the stability of a wide range of model-free offline RL methods.

6 EXPERIMENTS

We study 27 benchmark datasets in D4RL and Meta-World. We show that HUBL improves the performance of existing offline RL methods by 9% with easy modifications and simple hyperparameter tuning. Remarkably, in some datasets where the base offline RL performs inconsistently or poorly, HUBL can achieve more than 50% performance improvement. When the dataset is simple and the base offline RL is already near-optimal, HUBL does not significantly harm the performance.

HUBL variants and base offline RL methods We implement HUBL with four state-of-the-art offline RL algorithms as base methods: CQL (Kumar et al., 2020), TD3+BC (Fujimoto and Gu, 2021), IQL (Kostrikov et al., 2022), and ATAC (Cheng et al., 2022). For each base method, we compare the performance of its original version to its performance with HUBL running three different blending strategies discussed in Section 4.2: constant, sigmoid and rank. Thus, we experiment with 16 different methods in total. The implementation details of the base methods are in Appendix D.1.

Metrics We use relative normalized score improvement, abbreviated as relative improvement, as a measure of HUBL s performance improvement. Specifically, for a given task and base method, we first compute the normalized score rbase achieved by the base method, and then the normalized score r HUBL of HUBL. The relative normalized score improvement is defined as r HUBL rbase

|rbase| . We report the relative improvement of HUBL averaged over three seeds {0, 1, 10} in this section, with standard deviations, base method performance, behavior cloning performance and absolute normalized scores provided in Appendix D.2, D.5, and D.7.

Hyperparameter Tuning For each dataset, the hyperparameters of the base methods are tuned over six different configurations suggested by the original papers. HUBL has one extra hyperparameter, α, which is fixed for all the datasets but different for each base method. Specifically, α is selected from {0.001, 0.01, 0.1} according to the relative improvement averaged over all the datasets. In practice, we notice that a single choice of α around 0.1 is sufficient for good performance across most base offline RL methods and datasets as demonstrated by the sensitivity analysis in Appendix D.3.

Published as a conference paper at ICLR 2024

6.1 PERFORMANCE IMPROVEMENT OF HUBL

We study 9 benchmark datasets in D4RL and 18 datasets for tasks in Meta-World, collected following the procedure detailed in Appendix D.4. The relative improvement of HUBL with the rank labeling method is reported in Figure 2 and Figure 3. First, despite being simple and needing little hyperparameter tuning, HUBL improves the performance of base methods in most settings and only slightly hurts in some expert datasets where base offline RL methods are already performing very well with little space for improvement. Second, there are cases where HUBL achieve very significant relative improvement more than 50%. Such big improvement happens in the datasets where the base method shows inconsistent performance and underperforms other offline RL algorithms. HUBL solves this inconsistent performance issue by simply relabeling the data. Further, HUBL improves performance even on data with few expert trajectories like hopper-med-rep-v2, walker2d-med-rep-v2, and hopper-med-v2, because HUBL conducts more bootstrapping and relies less on the heuristic on suboptimal trajectories (see Section 4.2).

6.2 ABLATION STUDIES Sigmoid Rank Constant Wins 30 46 32

Table 1: Blending strategy comparison

ATAC Sigmoid Rank Constant Ablations without r -0.01 -0.02 -0.02

HUBL 0.01 0.02 0.01 CQL Sigmoid Rank Constant Ablations without r 0.04 0.04 0.04

HUBL 0.11 0.16 0.13 IQL Sigmoid Rank Constant Ablations without r -0.04 -0.04 -0.07

HUBL 0.09 0.09 0.06 TD3+BC Sigmoid Rank Constant Ablations without r -0.01 -0.03 -0.04

HUBL 0.06 0.09 0.1

Table 2: Average relative improvement of ablations without r.

Discount Relabeling HUBL relabels both the reward and the discount factor, per (2). In contrast, existing methods like Hu et al. (2022) suggest that a lower discount factor alone, without blending heuristics into rewards, can in general improve offline RL. To assess the need for modifying both the discount factor and rewards like (2), we consider ablation methods which shrink only the discount factor as γ without r. The achieved average relative improvement of these ablations is reported and compared with that of HUBL in Table 2. HUBL consistently outperforms these ablations, which justifies HUBL s coordinated modifications in both the reward and the discount factor. The advantage of HUBL is also consistent with what our theoretical analysis predicts. The considered ablations do not modify the rewards and thus are equivalent to solving Qπ(s, a) = r(s, a) + γEs P( |s,a)[(1 λ(s )) V π(s )]. Comparing it with (1), the solution will be consistently smaller than the true Q-function, inducing a pessimistic bias. Crucially, this bias is much more challenging to tackle than the bias induced by HUBL, because the former is inevitable even when the data is from an expert policy.

Comparison of Blending Strategies We present results for HUBL with rank blending, while the results of other blending strategies (sigmoid and constant) are provided in Appendix D.2 and D.5. We notice that the rank blending outperforms the other two. With 27 datasets and 4 base methods, we have 108 cases. In Table 1 we report the number of cases where a given blending strategy provides the best performance among the three. We can see that rank is favored on average.

We also experiment HUBL in other different settings, including on datasets collected by expert policies (Appendix D.6.1), with online RL methods (Appendix D.6.2), and in stochastic environments (Appendix D.6.3). HUBL shows consistent performance improvements in these settings.

7 CONCLUSION

In this work, we propose HUBL, a method for improving the performance of offline RL methods by blending heuristics with bootstrapping. HUBL is easy to implement and is generally applicable to various of offline RL methods. Empirically, we demonstrate the performance improvement of HUBL on 27 datasets in D4RL and Meta-World. We also provide a theoretic finite-sample performance bound for HUBL, which sheds lights on HUBL s bias-regret trade-off and blending factor designs.

Published as a conference paper at ICLR 2024

Melda Alaluf, Giulia Crippa, Sinong Geng, Zijian Jing, Nikhil Krishnan, Sanjeev Kulkarni, Wyatt Navarro, Ronnie Sircar, and Jonathan Tang. Reinforcement learning paycheck optimization for multivariate financial goals. Risk & Decision Analysis, 2022.

Wissam Bejjani, Rafael Papallas, Matteo Leonetti, and Mehmet R Dogar. Planning with a receding horizon for manipulation in clutter using a learned value function. In 2018 IEEE-RAS 18th International Conference on Humanoid Robots (Humanoids), pages 1 9. IEEE, 2018.

Marc Bellemare, Sriram Srinivasan, Georg Ostrovski, Tom Schaul, David Saxton, and Remi Munos. Unifying count-based exploration and intrinsic motivation. Advances in neural information processing systems, 29, 2016.

David Blackwell. Discrete dynamic programming. The Annals of Mathematical Statistics, pages 719 726, 1962.

Ching-An Cheng, Andrey Kolobov, and Adith Swaminathan. Heuristic-guided reinforcement learning. Advances in Neural Information Processing Systems, 34:13550 13563, 2021.

Ching-An Cheng, Tengyang Xie, Nan Jiang, and Alekh Agarwal. Adversarially trained actor critic for offline reinforcement learning. ar Xiv preprint ar Xiv:2202.02446, 2022.

Marc Deisenroth and Carl E Rasmussen. Pilco: A model-based and data-efficient approach to policy search. In Proceedings of the 28th International Conference on machine learning (ICML-11), pages 465 472, 2011.

Stefan Depeweg, Jos e Miguel Hern andez-Lobato, Finale Doshi-Velez, and Steffen Udluft. Learning and policy search in stochastic dynamical systems with bayesian neural networks. ar Xiv preprint ar Xiv:1605.07127, 2016.

Gabriel Dulac-Arnold, Nir Levine, Daniel J Mankowitz, Jerry Li, Cosmin Paduraru, Sven Gowal, and Todd Hester. Challenges of real-world reinforcement learning: definitions, benchmarks and analysis. Machine Learning, 110(9):2419 2468, 2021.

Justin Fu, Katie Luo, and Sergey Levine. Learning robust rewards with adversarial inverse reinforcement learning. ar Xiv preprint ar Xiv:1710.11248, 2017.

Justin Fu, Aviral Kumar, Ofir Nachum, George Tucker, and Sergey Levine. D4rl: Datasets for deep data-driven reinforcement learning. 2020.

Scott Fujimoto and Shixiang Shane Gu. A minimalist approach to offline reinforcement learning. Advances in neural information processing systems, 34:20132 20145, 2021.

Scott Fujimoto, David Meger, and Doina Precup. Off-policy deep reinforcement learning without exploration. In International conference on machine learning, pages 2052 2062. PMLR, 2019.

S Geng, Z Kuang, PL Peissig, D Page, L Maursetter, and KE Hansen. Parathyroid hormone independently predicts fracture, vascular events, and death in patients with stage 3 and 4 chronic kidney disease. Osteoporosis International, 30:2019 2025, 2019a.

Sinong Geng, Zhaobin Kuang, Peggy Peissig, and David Page. Temporal poisson square root graphical models. Proceedings of machine learning research, 80:1714, 2018.

Sinong Geng, Minhao Yan, Mladen Kolar, and Sanmi Koyejo. Partially linear additive gaussian graphical models. In International Conference on Machine Learning, pages 2180 2190. PMLR, 2019b.

Sinong Geng, Houssam Nassif, Carlos Manzanares, Max Reppen, and Ronnie Sircar. Deep PQR: Solving inverse reinforcement learning using anchor actions. In International Conference on Machine Learning, pages 3431 3441. PMLR, 2020.

Sinong Geng, Houssam Nassif, and Carlos A Manzanares. A data-driven state aggregation approach for dynamic discrete choice models. ar Xiv preprint ar Xiv:2304.04916, 2023.

Published as a conference paper at ICLR 2024

Caglar Gulcehre, Ziyu Wang, Alexander Novikov, Thomas Paine, Sergio G omez, Konrad Zolna, Rishabh Agarwal, Josh S Merel, Daniel J Mankowitz, Cosmin Paduraru, et al. RL unplugged: A suite of benchmarks for offline reinforcement learning. Advances in Neural Information Processing Systems, 33:7248 7259, 2020.

David Hoeller, Farbod Farshidian, and Marco Hutter. Deep value model predictive control. In Conference on Robot Learning, pages 990 1004. PMLR, 2020.

Hao Hu, Yiqin Yang, Qianchuan Zhao, and Chongjie Zhang. On the role of discount factor in offline reinforcement learning. In International Conference on Machine Learning, pages 9072 9098. PMLR, 2022.

Ehsan Imani, Eric Graves, and Martha White. An off-policy policy gradient theorem using emphatic weightings. Advances in Neural Information Processing Systems, 31, 2018.

Nan Jiang, Alex Kulesza, Satinder Singh, and Richard Lewis. The dependence of effective planning horizon on model accuracy. In Proceedings of the 2015 International Conference on Autonomous Agents and Multiagent Systems, pages 1181 1189, 2015.

Ray Jiang, Tom Zahavy, Zhongwen Xu, Adam White, Matteo Hessel, Charles Blundell, and Hado Van Hasselt. Emphatic algorithms for deep reinforcement learning. In International Conference on Machine Learning, pages 5023 5033. PMLR, 2021.

Ying Jin, Zhuoran Yang, and Zhaoran Wang. Is pessimism provably efficient for offline RL? In International Conference on Machine Learning, pages 5084 5096. PMLR, 2021.

Rahul Kidambi, Aravind Rajeswaran, Praneeth Netrapalli, and Thorsten Joachims. MORe L: Modelbased offline reinforcement learning. Advances in neural information processing systems, 33: 21810 21823, 2020.

Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. ar Xiv preprint ar Xiv:1412.6980, 2014.

Andrey Kolobov, Mausam, and Daniel S. Weld. Classical planning in MDP heuristics: With a little help from generalization. In AAAI, 2010.

Ilya Kostrikov, Ashvin Nair, and Sergey Levine. Offline reinforcement learning with implicit Q-learning. In ICLR, 2022.

Aviral Kumar, Justin Fu, Matthew Soh, George Tucker, and Sergey Levine. Stabilizing off-policy Q-learning via bootstrapping error reduction. Advances in Neural Information Processing Systems, 32, 2019.

Aviral Kumar, Aurick Zhou, George Tucker, and Sergey Levine. Conservative Q-learning for offline reinforcement learning. Advances in Neural Information Processing Systems, 33:1179 1191, 2020.

Sascha Lange, Thomas Gabel, and Martin Riedmiller. Batch reinforcement learning. In Reinforcement learning, pages 45 73. Springer, 2012.

Mausam and Andrey Kolobov. Planning with Markov decision processes: An AI perspective. Synthesis Lectures on Artificial Intelligence and Machine Learning, 6:1 210, 2012.

Georg Ostrovski, Marc G Bellemare, A aron Oord, and R emi Munos. Count-based exploration with neural density models. In International conference on machine learning, pages 2721 2730. PMLR, 2017.

Marek Petrik and Bruno Scherrer. Biasing approximate dynamic programming with a lower discount factor. Advances in neural information processing systems, 21, 2008.

Paria Rashidinejad, Banghua Zhu, Cong Ma, Jiantao Jiao, and Stuart Russell. Bridging offline reinforcement learning and imitation learning: A tale of pessimism. Advances in Neural Information Processing Systems, 34:11702 11716, 2021.

Published as a conference paper at ICLR 2024

Anton Maximilian Schaefer, Steffen Udluft, and Hans-Georg Zimmermann. A recurrent control neural network for data efficient reinforcement learning. In 2007 IEEE International Symposium on Approximate Dynamic Programming and Reinforcement Learning, pages 151 157. IEEE, 2007.

Harm Seijen and Rich Sutton. True online td (lambda). In International Conference on Machine Learning, pages 692 700. PMLR, 2014.

Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT press, 2018.

Richard S Sutton, A Rupam Mahmood, and Martha White. An emphatic approach to the problem of off-policy temporal-difference learning. The Journal of Machine Learning Research, 17(1): 2603 2631, 2016.

Phillip Swazinna, Steffen Udluft, and Thomas Runkler. Overcoming model bias for robust offline deep reinforcement learning. Engineering Applications of Artificial Intelligence, 104:104366, 2021.

Phillip Swazinna, Steffen Udluft, Daniel Hein, and Thomas Runkler. Comparing model-free and model-based algorithms for offline reinforcement learning. IFAC-Papers On Line, 55(15):19 26, 2022.

Denis Tarasov, Alexander Nikulin, Dmitry Akimov, Vladislav Kurenkov, and Sergey Kolesnikov. Corl: Research-oriented deep offline reinforcement learning library. ar Xiv preprint ar Xiv:2210.07105, 2022.

Harm Van Seijen, Mehdi Fatemi, and Arash Tavakoli. Using a logarithmic mapping to enable lower discount factors in reinforcement learning. Advances in Neural Information Processing Systems, 32, 2019.

Martha White. Unifying task specification in reinforcement learning. In International Conference on Machine Learning, pages 3742 3750. PMLR, 2017.

Albert Wilcox, Ashwin Balakrishna, Jules Dedieu, Wyame Benslimane, Daniel Brown, and Ken Goldberg. Monte Carlo augmented actor-critic for sparse reward deep reinforcement learning from suboptimal demonstrations. ar Xiv preprint ar Xiv:2210.07432, 2022.

Robert Wright, Steven Loscalzo, Philip Dexter, and Lei Yu. Exploiting multi-step sample trajectories for approximate value iteration. In Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2013, Prague, Czech Republic, September 23-27, 2013, Proceedings, Part I 13, pages 113 128. Springer, 2013.

Tianhe Yu, Deirdre Quillen, Zhanpeng He, Ryan Julian, Karol Hausman, Chelsea Finn, and Sergey Levine. Meta-world: A benchmark and evaluation for multi-task and meta reinforcement learning. In Conference on robot learning, pages 1094 1100. PMLR, 2020.

Andrea Zanette, Ching-An Cheng, and Alekh Agarwal. Cautiously optimistic policy optimization and exploration with linear function approximation. In Conference on Learning Theory, pages 4473 4525. PMLR, 2021.

Mingyuan Zhong, Mikala Johnson, Yuval Tassa, Tom Erez, and Emanuel Todorov. Value function approximation and model predictive control. In 2013 IEEE symposium on adaptive dynamic programming and reinforcement learning (ADPRL), pages 100 107. IEEE, 2013.

Published as a conference paper at ICLR 2024

A EXTENDED RESULTS OF THEOREM 1

Below we prove the statement which is a restatement of theorem 1. Theorem 3. For any λ : Ω [0, 1] and any h : Ω R, it holds

V (s0) V ˆπ(s0) = V π (s0) V ˆπ(s0) (5)

+ γ 1 γ E(s,a,s ) dπ [λ(s )( V π (s ) h(s ))|s, s Ω]

+ γ 1 γ E(s,a,s ) dπ[λ(s )(h(s ) V ˆπ(s ))|s, s Ω].

Specifically, the performance of π depends on both bias and regret. The bias term describes the discrepancy caused by solving the reshaped MDP with λ( ). The regret term describes the performance of the learned policy in the reshaped MDP.

To prove Theorem 3, we first conduct a regret decomposition as

V (s0) V ˆπ(s0) = V (s0) V π (s0) + V π (s0) V ˆπ(s0) + V ˆπ(s0) V ˆπ(s0) .

Then, we rewrite both V (s0) V π (s0) and V ˆπ(s0) V ˆπ(s0) using the heuristics h(s).

A.1 TECHNICAL LEMMAS

Before proceeding to the proof details of Theorem 3, we first prove a lemma on V π(s0) V π(s0). For a policy π under the original MDP M and the reshaped MDP M, we aim to quantify the value difference using the heuristics h(s). Lemma 4. For any policy π,

V π(s0) V π(s0) = γ 1 γ E(s,a,s ) dπ[λ(s )( V π(s ) h(s ))|s, s Ω].

Proof. By the definition of λ(s, s ):

V π(s0) V π(s0)

= 1 1 γ E(s,a,s ) dπ[r(s, a) + γ V π(s ) V π(s)]

= 1 1 γ E(s,a,s ) dπ[r(s, a) + γ V π(s ) r(s, a) γλ(s, s )h(s ) + γ(1 λ(s, s )) V π(s )]

= γ 1 γ E(s,a,s ) dπ[λ(s, s )( V π(s ) h(s ))]

= γ 1 γ E(s,a,s ) dπ[λ(s, s )( V π(s ) h(s ))|s, s Ω]

= γ 1 γ E(s,a,s ) dπ[λ(s )( V π(s ) h(s ))|s, s Ω]

A.2 PROOF OF THEOREM 3

To prove the theorem, we decompose the regret into the following terms:

V (s0) V ˆπ(s0) = V (s0) V π (s0) + V π (s0) V ˆπ(s0) + V ˆπ(s0) V ˆπ(s0)

We apply Lemma 4 to rewrite the first and the last terms as

V (s0) V π (s0) = γ 1 γ E(s,a,s ) dπ [λ(s )( V π (s ) h(s ))|s, s Ω]

V ˆπ(s0) V ˆπ(s0) = γ 1 γ E(s,a,s ) dπ[λ(s )(h(s ) V ˆπ(s ))|s, s Ω].

Combining the two completes the proof.

Published as a conference paper at ICLR 2024

B EXTENDED RESULTS FOR THEOREM 2

To prove Theorem 2, we separately bound the bias term

Bias(ˆπ, λ) := γ 1 γ E(s,a,s ) dπ [λ(s )( V π (s ) h(s ))|s, s Ω],

and the regret term

Regret(π, h, λ) := V π (d0) V π(d0) + γ 1 γ E(s,a,s ) dπ[λ(s )(h(s ) V π(s ))|s, s Ω].

Specifically the bias is bounded by Lemma 6 and the regret is bounded by Lemma 9.

B.1 TECHNICAL LEMMAS

Now, we provide some technical lemmas for the proof of Theorem 2.

B.1.1 BIAS COMPONENT

We first prove the upper bound for the bias component.

Lemma 5. Assume h(s) V (s), s S. It holds that V π (s) V (s) for all s S.

V π (s) = r(s, a) + γEs |s,a[λ(s, s )h(s ) + (1 λ(s, s )) V π (s )]

= V (s) + γEs |s,a[λ(s, s )h(s ) + (1 λ(s, s )) V π (s ) V (s )]

= V (s) + γEs |s,a[λ(s, s )(h(s ) V (s )) + (1 λ(s, s ))( V π (s ) V (s ))]

V (s) + γEs |s,a[(1 λ(s, s ))( V π (s ) V (s ))]

where in the inequality we used h(s) V π (s). Then by a contraction argument, we can show V π (s) V π (s) 0

Lemma 6 (Bias Upperbound). Under the assumptions of Theorem 2, the bias component can be bounded by:

Bias(ˆπ, λ) λγ 1 γ E(s,a,s ) dπ [V (s ) V µ(s )|s, s Ω].

Proof. Review the bias component

Bias(ˆπ, λ) := γ 1 γ E(s,a,s ) dπ [λ(s )( V π (s ) h(s ))|s, s Ω].

Under the assumptions of Theorem 2, we derive

Bias(ˆπ, λ) = λγ 1 γ E(s,a,s ) dπ [ V π (s ) V (s ) + V (s ) V µ(s )|s, s Ω].

Next by Lemma 5 we have λγ 1 γ E(s,a,s ) dπ [ V π (s ) V (s )|s, s Ω] 0, and thus

Bias(ˆπ, λ) λγ 1 γ E(s,a,s ) dπ [V (s ) V µ(s )|s, s Ω].

Published as a conference paper at ICLR 2024

B.1.2 REGRET COMPONENT

Next, we focus on the regret component.

Lemma 7. Under the setup of Theorem 2, V µ(s) = V µ(s).

Proof. Since Ωis the support of µ, this can be shown by the following: for s Ω,

V µ(s) V µ(s)

= Ea µ|s Es |s,a[r(s, a) + γλ(s, s )h(s ) + γ(1 λ(s, s )) V µ(s ) r(s, a) γV µ(s )]

= Ea µ|s Es |s,a[r(s, a) + γλ(s, s )h(s ) + γ(1 λ(s, s )) V µ(s ) r(s, a) γV µ(s )|s Ω]

= Ea µ|s Es |s,a[r(s, a) + γλ(s, s )V µ(s ) + γ(1 λ(s, s )) V µ(s ) r(s, a) γV µ(s )|s Ω]

= γEa µ|s Es |s,a[(1 λ(s, s ))( V µ(s ) V µ(s ))|s Ω].

Since γ < 1, then by an argument of contraction, we have V µ(s) V µ(s) = 0 for s Ω.

Lemma 8. The difference between V (s) and V π (s) can be derived as

V (s) V π (s) = Eρπ (s)[

t=1 [λ(1 λ)t 1γt(V (st) h(st))]].

Proof. By the dynamic programming equation in both the original and shaped MDP.

V (s) V π (s) =γ(1 λ)Ea π ( ;s)[Es |s,a[V (s ) V π (s )]]

+ γλEa π ( ;s)[Es |s,a[V (s ) h(s )]]. (6)

Then, we use (6) recursively:

V (s) V π (s) = λEρπ (s)[

t=1 [(1 λ)t 1γt(V (st) h(st))]].

Lemma 9 (Regret Upperbound). The expected regret is bounded by

ED[Regret(ˆπ, λ( ))] min Vmax, Vmax

v u u t(1 γ)|S| maxs,a dπ (s,a;d0)

µ(s,a) N(1 γ(1 λ))4

+ γλ 1 γ min Vmax, Vmax

(1 γ)|S| 1 min(s,a) Ωµ(s,a) N(1 γ(1 λ))4

Proof. Under the setups in Theorem 2, we have

Regret(ˆπ, λ( )) = V π (d0) V ˆπ(d0) + γλ 1 γ E(s,a,s ) dˆπ[V µ(s ) V ˆπ(s )|s, s Ω].

Further by Lemma 7, we can replace V µ by V µ:

Regret(ˆπ, λ( )) = V π (d0) V ˆπ(d0) + γλ 1 γ E(s,a,s ) dˆπ[ V µ(s ) V ˆπ(s )|s, s Ω]. (7)

Then, by Lemma 10,

ED[ V π (d0) V ˆπ(d0)] min Vmax, Vmax

v u u t(1 γ)|S| maxs,a dπ (s,a;d0)

µ(s,a) N(1 γ(1 λ))4

Published as a conference paper at ICLR 2024

ED[ V µ(dˆπ) V ˆπ(dˆπ)] min Vmax, Vmax

v u u t(1 γ)|S| maxs,a dµ(s,a;dˆπ)

µ(s,a) N(1 γ(1 λ))4

Note that, by Lemma 11, dˆπ stays in Ω. Therefore, we get maxs,a dµ(s,a;dˆπ)

maxs,a Ω dµ(s,a;dˆπ)

µ(s,a) 1 min(s,a) Ωµ(s,a), which leads to

ED[ V µ(dˆπ) V ˆπ(dˆπ)] min Vmax, Vmax

(1 γ)|S| 1 min(s,a) Ωµ(s,a) N(1 γ(1 λ))4

Take (8) and (9) into (7), we derive

ED[Regret(ˆπ, λ( ))] min Vmax, Vmax

v u u t(1 γ)|S| maxs,a dπ (s,a;d0)

µ(s,a) N(1 γ(1 λ))4

+ γλ 1 γ min Vmax, Vmax

(1 γ)|S| 1 min(s,a) Ωµ(s,a) N(1 γ(1 λ))4

B.2 PROOF OF THEOREM 2

The proof follows by combining Lemma 6 and 9.

C VI-LCB WITH HUBL

We use the offline value iteration with lower confidence bound (VI-LCB) (Rashidinejad et al., 2021) as the base algorithm to analyze concretely the effects of HUBL with finite samples for the tabular case.

C.1 ALGORITHM

We detail the procedure of HUBL when implemented with VI-LCB for the tabular setting. To start with, we introduce several definitions which will be used in the following algorithm and theoretical analysis. Without loss of generality, we assume that the sates take values in {1, 2, , |S|} and that the actions take values in {1, 2, , |A|}. Then, let h be a |S| 1 vector which denotes the heuristic function. We assume each component satisfies hs = V µ(s) for s Ω. Let Λ be a |S| |S| matrix with Λs,s = λ if s, s Ωand Λs,s = 0 if s or s Ω. We use to denote the component-wise multiplication, and Λs,: to denote a row of Λ as a |S| 1 vector. With slight abuse of notation, we use t to denote the index of iteration in this section, with T as the total number of iterations.

To conduct VI-LCB with HUBL, we follow Algorithm 3 in Rashidinejad et al. (2021) but with a modified updating rule. The procedure is summarized Algorithm 2. Compared with the original VI-LCB, we highlight the key modification in Line 14 with blue color. Specifically, at the tth iteration, based on (2), we modify updating rule of Algorithm 3 in Rashidinejad et al. (2021) into

Qt(s, a) rt(s, a) bt(s, a) + γP t s,a (I Λs,:) Vt 1 + γP t s,a Λs,: h.

Note that we introduce heuristics by γP t s,a Λs,: h, while reducing the bootstrapping by γP t s,a (I Λs,:) Vt 1.

C.2 REGRET ANALYSIS

In this section, we study the regret under the reshaped MDP constructed by HUBL. Specifically, we bound the regret of Algorithm 2 in Lemma 10.

Published as a conference paper at ICLR 2024

Algorithm 2 HUBL with VI-LCB

1: Input: Batch dataset D and discount factor γ. 2: Set T := log N

1 γ . 3: Randomly split D into T + 1 sets Dt = {(si, ai, ri, s i)} for t {0, 1, , T}, where D0 consists of N

2 observations and other datasets have N

2T observations. 4: Set m0(s, a) := Pm i=1 1{(si, ai) = (s, a)} based on dataset D0. 5: For all a A and s S, initialize Q0(s, a) = 0, V0(s) = 0 and set π0(s) = arg maxa m0(s, a).

6: for t = 1, , T do 7: Initialize rt(s, a) = 0 and set P t s,a to be a random probability vector. 8: Set mt(s, a) := Pm i=1 1 {(si, ai) = (s, a)} based on dataset Dt. 9: Compute penalty bt(s, a) for L = 2000 log(2(T + 1)|S||A|N)

bt(s, a) := Vmax

L mt(s, a) 1.

10: for (s, a) (S, A) do 11: if mt(s, a) 1 then 12: Set P t s,a to be empirical transitions and rt(s, a) be empirical average of rewards. 13: Set Qt(s, a) rt(s, a) bt(s, a) + γP t s,a (I Λs,:) Vt 1 + γP t s,a Λs,: h. 14: Compute V mid t (s) maxa Qt(s, a) and πmid t (s) arg maxa Qt(s, a). 15: for s S do 16: if V mid t (s) Vt 1(s) then 17: Vt(s) Vt 1(s) and πt(s) πt 1(s). 18: else 19: Vt(s) V mid t (s) and πt(s) πmid t (s). 20: Return ˆπ := πT .

Lemma 10 (Regret of VI-LCB with HUBL). Let the assumptions in Section C.3 be satisfied. Then, for any initial distribution dinit in Ω, the regret of Algorithm 2 is bounded by

ED[ V π (dinit) V ˆπ(dinit)] min Vmax, Vmax

v u u t(1 γ)|S| maxs,a dπ (s,a;dinit)

µ(s,a) N(1 γ(1 λ))4

C.3 NOTATIONS

We first provide some matrix notations for MDPs. We use P π R|S||A| |S||A| to denote the transition matrix induced by policy π whose (s, a) (s, a ) element is equal to P(s |s, a)π(a |s ), and dπ R|S||A| to denote a state-action distribution induced by policy π whose (s, a) element is equal to d(s)π(a|s). Similarly, we use ΛQ R|S||A| |S||A| to denote a extended matrix version of Λ.

Further, we focus on the policies which stay in the data distribution support Ω. Such polices are formally defined as {π|dπ(s, a; s0) = 0 for any (s, a) / Ωand s0 Ω}. We also define a clean event:

EMDP := s, a, t, r(s, a) rt(s, a) + γ(Ps,a P t s,a) (I Λ) Vt 1 bt(s, a) .

C.4 TECHNICAL LEMMAS

With the aforementioned definitions and assumptions, we provide the following lemmas.

Lemma 11. Let π be a policy learned by Algorithm 2. Under the event Ecover := {Ω D0} π stays in Ωfor any initial state in Ω. The probability of Ecover is greater than

1 |S|(1 min s Ωµ(s)) N

Published as a conference paper at ICLR 2024

Proof. Under the event Ecover, we first fix s Ωand show that given s, the learned policy π does not take any action a / Ω. By Algorithm 2, for any (s, a ) / Ωand t = 1, 2, , T, we can derive

Qt(s, a ) Vmax p

2000 log(2T + 1)|S||A|N + γVmax.

Since N and T are larger than 1, we can conclude that

Qt(s, a ) 0 for any a such that (s, a ) / Ω.

Next we consider two cases:

Case I: If there exists a such that (s, a) Ωand that there exists t = 1, 2, , T such that

Qt(s, a) > 0,

we can conclude that QT (s, a) Qt(s, a) > Q(s, a ), which suggests that the learned policy π never conducts action a / Ωgiven state s.

Case II: If for any a such that (s, a) Ωand any t = 1, 2, , T, we have

Qt(s, a) = 0,

according to Algorithm 2, π(s) = arg maxa m0(s, a). By the definition of Ecover, the policy π stays in Ω.

Next, we consider the probability of Ecover. Given a state s Ω, we define the event Es := {m0(s, a) = 0 for all a such that (s, a) Ω}. The probability of Es can be derive as

P(Es) (1 µ(s)) N

By union bound, we can derive that

P( Ecover) |S|(1 min s Ωµ(s)) N

As a result, π stays in Ωwith the probability greater than

1 |S|(1 min s Ωµ(s)) N

Lemma 12. Let π be a policy that stays in Ω. Then, with dinit as any initial distribution in Ω, we can derive dinit(s) µ(s, π(s))

dπ(s,π(s);dinit)

µ(s,π(s)) 1 γ(1 λ) .

Proof. By definition, we have

dπ(s, π(s); dinit(s)) := 1 1 γ(1 λ)

t=0 γt(1 λ)t Pt(St = s, π; dinit)

Therefore, dinit(s) 1 1 γ(1 λ) dπ(s, π(s); dinit), which finishes the proof.

Lemma 13. Let vπ k = dπ init(γP π (I ΛQ))k. For a policy π that stays in Ω, the following equality holds: vπ k = dπ init(γ(1 λ)P π)k.

Proof. Since π stays in Ω, P π only accesses the entries in I ΛQ whose values equal to (1 λ). This observation finishes the proof.

Lemma 14. Let π be a policy that stays in Ω. Under the event EMDP , for all t = 1, 2, , T,

V (π) V (πt) Vmaxγt(1 λ)t + 2

i=1 Evπ t i[bi(s, a)].

Published as a conference paper at ICLR 2024

Proof. The proof follows the Lemma 2 of Rashidinejad et al. (2021) combined with Lemma 13.

Lemma 15. For any policy π that stays in Ω, the following inequality is true:

dπ(s, a; dinit) 1 γ 1 γ(1 λ)dπ(s, a; dinit),

for any (s, a) Ω.

Proof. By definition, we have

dπ(s, a; dinit) := 1 1 γ(1 λ)

t=0 γt(1 λ)t Pt(St = s, At = a, ; π, dinit)

dπ(s, a; dinit) := 1 1 γ

t=0 γt Pt(St = s, At = a; π, dinit).

dπ(s, a; dinit) 1 1 γ(1 λ)

t=0 γt Pt(St = s, At = a; π, dinit) = 1 γ 1 γ(1 λ)dπ(s, a; dinit)

C.5 PROOF OF LEMMA 10

By the event Ecover, we can decompose the regret into

ED[ V π (dinit) V ˆπ(dinit)]

ED[ V π (dinit) V ˆπ(dinit)1 {Ecover}] + ED[ V π (dinit) V ˆπ(dinit)1 {Ec cover}]

ED[ V π (dinit) V ˆπ(dinit)1 {Ecover}] + Vmax|S| |S| 1

where the second inequality is by Lemma 11. Next, we focus on ED[ V π (dinit) V ˆπ(dinit)1 {Ecover}]. Following the analysis in C.5 of Rashidinejad et al. (2021), we can derive

ED[( V π (dinit) V ˆπ(dinit))1 {Ecover}]

ED[( V π (dinit) V ˆπ(d0))1 {Ecover}]

ED[Es dinit[( V π (dinit) V ˆπ(dinit))1 { t T, mt(s, π (s)) = 0} 1 {Ecover}]] := T1

+ ED[Es dinit[( V π (dinit) V ˆπ(dinit)) 1 { t T, mt(s, π (s)) 1} 1 {EMDP } 1 {Ecover}]] := T2

+ ED[Es dinit[( V π (dinit) V ˆπ(dinit)) 1 { t T, mt(s, π (s)) 1} 1 {Ec MDP } 1 {Ecover}]] := T3.

By Lemma 11, under the event Ecover, ˆπ stays in Ω. In other words, Lemma 12, 13 and 14 are all applicable to ˆπ.

Next, for the following analysis, we consider the case that π stays in Ω. We will discuss the case that Ωdoes not cover π at the end of the proof.

By Lemma 12 and C.5.1 of Rashidinejad et al. (2021),

T1 8Vmax|S|T 2 maxs,a dπ (s,a;dinit)

µ(s,a) 9(1 (1 λ)γ)N .

By Lemma 14 and C.5.2 of Rashidinejad et al. (2021),

T2 VmaxγT (1 λ)T + 32 Vmax 1 γ(1 λ)

2L|S|T maxs,a dπ (s,a;dinit)

Published as a conference paper at ICLR 2024

By C.5.2 of Rashidinejad et al. (2021),

Combining T1, T2 T3, and (10), with T = log N/(1 γ(1 λ)),

ED[ V π (dinit) V ˆπ(dinit)] min Vmax, Vmax

v u u t|S| maxs,a dπ (s,a;dinit)

µ(s,a) N(1 γ(1 λ))3

Finally, by Lemma 15, we finish the proof:

ED[ V π (dinit) V ˆπ(dinit)] min Vmax, Vmax

v u u t(1 γ)|S| maxs,a dπ (s,a;dinit)

µ(s,a) N(1 γ(1 λ))4

Note that (11) is derived under the assumption that π stays in Ω. However, it also holds when π is

not covered by Ω. In that case, maxs,a dπ (s,a;dinit)

µ(s,a) = and ED[ V π (dinit) V ˆπ(dinit)] Vmax.

D EXTENDED RESULTS FOR EXPERIMENTS

We conduct HUBL with four offline RL base methods on 27 datasets in D4RL and Meta-World. By blending heuristics with bootstrapping, HUBL reduces the complexity of decision-making and provides smaller regret while generating limited bias. We demonstrate that HUBL is able to improve the performance of offline RL methods.

D.1 IMPLEMENTATION DETAILS

We implement base offline RL methods with code sources provided in Table 3.

Table 3: Code source for base methods.

Base Methods Code Source ATAC https://github.com/chinganc/light ATAC CQL https://github.com/young-geng/CQL/tree/master/Simple SAC IQL https://github.com/gwthomas/IQL-Py Torch/blob/main/README.md TD3+BC https://github.com/sfujim/TD3_BC

Experiments with ATAC, IQL, and TD3+BC are ran on Standard F4S V2 nodes of Azure, and experiments with CQL are ran on NC6S V2 nodes of Azure. As suggested by the original implementation of the considered base offline RL methods, we use 3-layer fully connected neural networks for policy critics and value networks, where each hidden layer has 256 neurons and Re LU activation and the output layer is linear. The first-order optimization is implemented by ADAM (Kingma and Ba, 2014) with a minibatch size as 256. The learning rates are selected following the original implementation and are reported in Table 4.

Table 4: Learning rates

Base Methods Policy Network Q Network ATAC 5 10 7 5 10 4

CQL 3 10 4 3 10 4

IQL 3 10 4 3 10 4

TD3 BC 3 10 4 3 10 4

For each dataset, the hyperparameters of base methods are tuned from six different configurations suggested by the original papers. Such configurations are summarized in Table 5.

Published as a conference paper at ICLR 2024

Table 5: Learning rates

Base Methods Hyperparameter Values ATAC degree of pessimism β {1.0, 4.0, 16.0, 0.25, 0.625, 10} CQL min Q weight {5, 20, 80, 1.25, 0.3125, 10.0}

IQL inverse temperature β {6.5, 3.0, 10.0} expectile parameter τ {0.7, 0.9} TD3+BC strength of the regularizer λ {2.5, 0.625, 10.0, 40.0, 0.15625, 1}

Meanwhile, HUBL has one extra hyperparameter, α, which is fixed for all the datasets but different for each base method. Specifically, α is selected from {0.001, 0.01, 0.1} according to the relative improvement averaged over all the datasets in one task. Later in the robustness analysis, we show that the performance of HUBL is insensitive to the selection of α. For each configuration and each base method, we repeat experiments three times with seeds in {0, 1, 10}.

D.2 BASE PERFORMANCE AND STANDARD DEVIATIONS FOR D4RL DATASETS

We provide the performance of base offline RL methods and standarde deviations in Table 6.

Published as a conference paper at ICLR 2024

Table 6: Relative normalized score improvement for HUBL and normalized scores of base methods on D4RL datasets.

Sigmoid Rank Constant ATAC halfcheetah-medium-expert-v2 0.02 0.0 0.02 0.01 0.02 0.0 92.5 0.73 halfcheetah-medium-replay-v2 0.01 0.01 0.02 0.01 0.02 0.02 46.89 0.25 halfcheetah-medium-v2 0.0 0.01 0.0 0.0 0.02 0.01 52.54 0.63 hopper-medium-expert-v2 0.0 0.0 0.0 0.0 0.0 0.0 111.04 0.13 hopper-medium-replay-v2 0.0 0.0 0.0 0.01 0.0 0.0 101.23 0.59 hopper-medium-v2 0.01 0.04 0.02 0.07 0.03 0.15 85.59 4.76 walker2d-medium-expert-v2 0.0 0.0 0.02 0.0 0.0 0.01 111.99 0.8 walker2d-medium-replay-v2 0.03 0.03 0.08 0.0 0.05 0.08 84.22 2.7 walker2d-medium-v2 0.01 0.01 0.02 0.05 0.02 0.0 87.2 0.67 Average Relative Improvement 0.01 0.02 0.01 - Sigmoid Rank Constant CQL halfcheetah-medium-expert-v2 0.02 0.02 0.03 0.06 0.02 0.17 81.96 9.93 halfcheetah-medium-replay-v2 0.1 0.01 0.18 0.02 0.13 0.01 40.98 0.81 halfcheetah-medium-v2 0.1 0.01 0.02 0.01 0.03 0.01 51.05 0.83 hopper-medium-expert-v2 0.42 0.01 0.39 0.07 0.38 0.05 78.39 14.51 hopper-medium-replay-v2 0.03 0.01 0.01 0.01 0.02 0.01 97.21 3.15 hopper-medium-v2 0.18 0.14 0.15 0.17 0.17 0.06 70.27 8.77 walker2d-medium-expert-v2 0.21 0.0 0.22 0.01 0.22 0.0 89.38 17.54 walker2d-medium-replay-v2 0.08 0.27 0.38 0.06 0.37 0.04 63.35 6.22 walker2d-medium-v2 0.05 0.05 0.05 0.03 0.03 0.15 79.41 1.82 Average Relative Improvement 0.11 0.16 0.13 - Sigmoid Rank Constant IQL halfcheetah-medium-expert-v2 0.0 0.01 0.0 0.0 0.05 0.03 92.92 0.5 halfcheetah-medium-replay-v2 0.03 0.02 0.0 0.02 0.04 0.03 44.35 0.21 halfcheetah-medium-v2 0.01 0.0 0.0 0.0 0.01 0.01 49.61 0.16 hopper-medium-expert-v2 0.04 0.03 0.03 0.02 0.09 0.12 108.13 2.69 hopper-medium-replay-v2 0.72 0.09 0.58 0.25 0.62 0.27 53.98 4.18 hopper-medium-v2 0.03 0.03 0.02 0.03 0.01 0.15 61.03 3.27 walker2d-medium-expert-v2 0.01 0.0 0.01 0.01 0.01 0.0 109.59 0.2 walker2d-medium-replay-v2 0.15 0.12 0.21 0.02 0.13 0.04 73.57 2.64 walker2d-medium-v2 0.03 0.02 0.03 0.01 0.04 0.01 81.68 2.48 Average Relative Improvement 0.09 0.09 0.06 - Sigmoid Rank Constant TD3+BC halfcheetah-medium-expert-v2 0.02 0.03 0.02 0.03 0.02 0.02 94.82 0.33 halfcheetah-medium-replay-v2 0.1 0.03 0.08 0.16 0.0 0.02 54.38 1.24 halfcheetah-medium-v2 0.07 0.02 0.04 0.01 0.01 0.01 64.63 2.15 hopper-medium-expert-v2 0.06 0.13 0.16 0.2 0.03 0.05 98.79 12.57 hopper-medium-replay-v2 0.16 0.08 0.21 0.03 0.24 0.03 81.43 27.77 hopper-medium-v2 0.51 0.17 0.7 0.02 0.6 0.19 59.12 3.39 walker2d-medium-expert-v2 0.0 0.01 0.01 0.01 0.0 0.0 111.87 0.17 walker2d-medium-replay-v2 0.12 0.04 0.09 0.07 0.05 0.02 84.95 1.85 walker2d-medium-v2 0.01 0.03 0.08 0.01 0.0 0.07 84.09 1.94 Average Relative Improvement 0.06 0.09 0.1 -

Next, we study how the performance of HUBL vary over different epochs. Specifically, we provide a plot of the achieved normalized average returns over different epochs for HUBL with TD3+BC on hopper-medium-v2 in Figure 4.

Published as a conference paper at ICLR 2024

0 50 100 150 200 Epoch

Normalized Return

HUBL TD3+BC

(a) HUBL with constant labeling

0 50 100 150 200 Epoch

Normalized Return

HUBL TD3+BC

(b) HUBL with Sigmoid labeling

0 50 100 150 200 Epoch

Normalized Return

HUBL TD3+BC

(c) HUBL with Rank labeling

Figure 4: Average normalized return of HUBL with TD3+BC on hopper-medium-v2

D.3 ROBUSTNESS OF HUBL TO α

In this section, we demonstrate the robustness of HUBL to α. Specifically, we provide the average relative improvements of HUBL on D4RL datasets in Table 7. Notice that most average relative improvements are positive, which shows that HUBL can improve the performance with different values.

Table 7: Average relative improvements of HUBL under different α s

Sigmoid Rank Constant α = 0.001 0.01 0.01 0 α = 0.01 0.01 0.01 0.01 α = 0.1 0 0.02 0.01 (a) ATAC

Sigmoid Rank Constant α = 0.001 0.05 0.05 0.03 α = 0.01 0.08 0.05 0.10 α = 0.1 0.11 0.16 0.13 (b) CQL

Sigmoid Rank Constant α = 0.001 0.02 0 -0.04 α = 0.01 0 -0.01 -0.02 α = 0.1 0.09 0.09 0.06 (c) IQL

Sigmoid Rank Constant α = 0.001 0.01 0.01 0 α = 0.01 0.01 0.01 0.01 α = 0.1 0.06 0.09 0.06 (d) TD3 BC

The selected α s are reported in Table 8:

Table 8: Selected α s

Sigmoid Rank Constant ATAC 0.01 0.01 0.1 CQL 0.1 0.1 0.1 IQL 0.1 0.1 0.1 TD3 BC 0.01 0.1 0.1

Published as a conference paper at ICLR 2024

D.4 DATA COLLECTION FOR META-WORLD

We collect data for Meta-World tasks using normalized rewards with goal-oriented stopping. Specifically, given that the original rewards of Meta-World are in [0, 10], we shift them by 10 and divide by 10, so that the normalized rewards take values in [ 1, 0]. Then, we use the hand scripted policy given in Meta-World with different Gaussian noise levels in {0.1, 0.5, 1} to collect 100 trajectory for each task. In this process, a trajectory ends if (i) it reaches the max length of a trajectory (150); (ii) it finishes the goal. Note that we follow the same rule when testing the performance of a learned policy.

D.5 BASE PERFORMANCE AND STANDARD DEVIATIONS FOR META-WORLD DATASETS

We provide the performance of base offline RL methods and standard deviations in Table 9 and 10. We also consider behavior cloning (BC) as as a baseline, and also a representative for imitation learning and inverse reinforcement learning methods (Geng et al., 2020; 2023; Fu et al., 2017). Since the behavior policy is the scripted policy with Gaussian noise, BC can effective recover the scripted policy and thus is especially competitive.

Table 9: Relative normalized score improvement for HUBL and normalized scores of base methods on MW datasets (part I)

Sigmoid Rank Constant ATAC BC button-press-v2 noise0.1 0.0 0.03 0.01 0.01 0.01 0.01 58.12 0.45 -58.79 button-press-v2 noise0.5 0.0 0.03 0.03 0.05 0.01 0.04 64.16 0.93 -59.58 button-press-v2 noise1 0.05 0.07 0.0 0.04 0.01 0.04 62.72 3.84 -61.24 push-back-v2 noise0.1 0.1 0.03 0.04 0.07 0.15 0.06 87.12 13.89 -104.85 push-back-v2 noise0.5 0.11 0.07 0.19 0.05 0.17 0.02 133.28 14.24 -134.63 push-back-v2 noise1 0.04 0.05 0.02 0.09 0.06 0.07 141.29 13.87 -143.63 reach-v2 noise0.1 0.0 0.06 0.0 0.05 0.0 0.07 17.19 0.76 -18.31 reach-v2 noise0.5 0.37 0.07 0.46 0.06 0.43 0.08 33.91 5.62 -24.57 reach-v2 noise1 0.39 0.05 0.35 0.24 0.5 0.05 46.58 3.98 -43.79 Average Relative Improvement 0.11 0.11 0.14 - - Sigmoid Rank Constant CQL BC button-press-v2 noise0.1 0.0 0.03 0.02 0.05 0.03 0.03 59.67 1.69 -58.79 button-press-v2 noise0.5 0.04 0.05 0.02 0.01 0.02 0.0 58.48 0.85 -59.58 button-press-v2 noise1 0.08 0.12 0.08 0.02 0.04 0.11 67.18 14.25 -61.24 push-back-v2 noise0.1 0.07 0.07 0.02 0.05 0.02 0.02 81.78 2.88 -104.85 push-back-v2 noise0.5 0.25 0.12 0.22 0.1 0.19 0.12 124.64 2.45 -134.63 push-back-v2 noise1 0.06 0.13 0.11 0.14 0.1 0.05 133.66 8.17 -143.63 reach-v2 noise0.1 0.04 0.17 0.01 0.05 0.04 0.02 18.5 2.52 -18.31 reach-v2 noise0.5 0.51 0.05 0.18 0.53 0.4 0.27 41.02 18.79 -24.57 reach-v2 noise1 0.8 0.02 0.35 0.15 0.81 0.03 107.14 32.75 -43.79 Average Relative Improvement 0.16 0.1 0.17 - - Sigmoid Rank Constant IQL BC button-press-v2 noise0.1 0.02 0.02 0.03 0.01 0.03 0.01 59.87 1.32 -58.79 button-press-v2 noise0.5 0.01 0.05 0.01 0.02 0.14 0.13 60.68 0.29 -59.58 button-press-v2 noise1 0.09 0.15 0.07 0.01 0.04 0.03 73.96 5.09 -61.24 push-back-v2 noise0.1 0.01 0.04 0.04 0.02 0.03 0.01 83.79 5.45 -104.85 push-back-v2 noise0.5 0.02 0.06 0.04 0.05 0.03 0.05 124.86 7.45 -134.63 push-back-v2 noise1 0.0 0.01 0.03 0.04 0.0 0.02 143.26 6.31 -143.63 reach-v2 noise0.1 0.03 0.03 0.06 0.03 0.04 0.07 17.37 0.87 -18.31 reach-v2 noise0.5 0.05 0.13 0.02 0.03 0.02 0.24 31.51 9.39 -24.57 reach-v2 noise1 0.12 0.42 0.05 0.09 0.01 0.37 42.23 7.4 -43.79 Average Relative Improvement 0.02 0.02 0.01 - - Sigmoid Rank Constant TD3+BC BC button-press-v2 noise0.1 0.25 0.0 0.23 0.0 0.23 0.02 77.68 22.14 -58.79 button-press-v2 noise0.5 0.04 0.05 0.04 0.06 0.09 0.07 59.54 1.68 -59.58 button-press-v2 noise1 0.02 0.07 0.12 0.23 0.1 0.12 81.03 5.79 -61.24 push-back-v2 noise0.1 0.08 0.01 0.01 0.02 0.05 0.04 85.44 5.05 -104.85 push-back-v2 noise0.5 0.07 0.07 0.05 0.14 0.05 0.1 127.95 9.82 -134.63 push-back-v2 noise1 0.01 0.04 0.01 0.06 0.04 0.02 139.72 15.6 -143.63 reach-v2 noise0.1 0.06 0.04 0.02 0.02 0.0 0.02 17.11 1.28 -18.31 reach-v2 noise0.5 0.12 0.19 0.25 0.14 0.15 0.18 40.69 9.66 -24.57 reach-v2 noise1 0.13 0.18 0.04 0.12 0.1 0.11 65.48 2.85 -43.79 Average Relative Improvement 0.07 0.07 0.06 - -

Published as a conference paper at ICLR 2024

Table 10: Relative normalized score improvement for HUBL and normalized scores of base methods on MW datasets (part II)

Sigmoid Rank Constant ATAC BC assembly-v2 noise0.1 0.01 0.03 0.01 0.04 0.0 0.08 70.22 4.61 -71.76 assembly-v2 noise0.5 0.01 0.02 0.03 0.02 0.08 0.0 135.77 1.58 -129.53 assembly-v2 noise1 0.01 0.01 0.02 0.01 0.04 0.01 140.12 2.45 -138.63 handle-press-side-v2 noise0.1 0.03 0.01 0.02 0.02 0.03 0.03 32.38 0.64 -36.28 handle-press-side-v2 noise0.5 0.0 0.03 0.02 0.03 0.01 0.03 33.04 0.96 -34.3 handle-press-side-v2 noise1 0.04 0.12 0.02 0.1 0.09 0.06 35.33 3.7 -35.72 plate-slide-back-side-v2 noise0.1 0.0 0.04 0.03 0.02 0.02 0.05 37.07 0.87 -37.75 plate-slide-back-side-v2 noise0.5 0.07 0.41 0.17 0.29 0.15 0.02 68.9 10.92 -65.47 plate-slide-back-side-v2 noise1 0.17 0.3 0.43 0.02 0.44 0.11 97.9 25.46 -128.28 Average Relative Improvement 0.03 0.07 0.07 - - Sigmoid Rank Constant CQL BC assembly-v2 noise0.1 0.03 0.06 0.06 0.17 0.06 0.07 76.29 9.41 -71.76 assembly-v2 noise0.5 0.0 0.02 0.01 0.04 0.01 0.02 131.35 2.62 -129.53 assembly-v2 noise1 0.0 0.03 0.0 0.01 0.0 0.02 138.94 0.97 -138.63 handle-press-side-v2 noise0.1 0.11 0.07 0.02 0.07 0.02 0.19 34.47 4.54 -36.28 handle-press-side-v2 noise0.5 0.32 0.11 0.38 0.11 0.46 0.01 70.96 29.39 -34.3 handle-press-side-v2 noise1 0.15 0.12 0.13 0.2 0.4 0.05 55.3 33.42 -35.72 plate-slide-back-side-v2 noise0.1 0.05 0.03 0.0 0.11 0.03 0.11 36.31 2.33 -37.75 plate-slide-back-side-v2 noise0.5 0.03 0.09 0.09 0.47 0.13 0.5 36.72 2.56 -65.47 plate-slide-back-side-v2 noise1 0.36 0.31 0.37 0.12 0.34 0.18 67.63 20.56 -128.28 Average Relative Improvement 0.12 0.09 0.13 - - Sigmoid Rank Constant IQL BC assembly-v2 noise0.1 0.02 0.07 0.01 0.04 0.0 0.04 72.06 2.34 -71.76 assembly-v2 noise0.5 0.01 0.01 0.01 0.02 0.01 0.04 122.97 6.12 -129.53 ssembly-v2 noise1 0.01 0.0 0.01 0.03 0.0 0.03 129.01 1.62 -138.63 handle-press-side-v2 noise0.1 0.05 0.01 0.05 0.01 0.05 0.02 34.1 1.55 -36.28 handle-press-side-v2 noise0.5 0.06 0.06 0.05 0.07 0.07 0.04 37.28 0.25 -34.3 handle-press-side-v2 noise1 0.0 0.11 0.03 0.03 0.02 0.07 46.31 3.11 -35.72 plate-slide-back-side-v2 noise0.1 0.01 0.02 0.01 0.01 0.01 0.03 34.4 1.23 -37.75 plate-slide-back-side-v2 noise0.5 0.07 0.1 0.04 0.08 0.03 0.08 34.03 1.83 -65.47 plate-slide-back-side-v2 noise1 0.05 0.18 0.01 0.29 0.04 0.35 43.95 10.29 -128.28 Average Relative Improvement 0.01 0.01 0.01 - - Sigmoid Rank Constant TD3+BC BC assembly-v2 noise0.1 0.01 0.02 0.02 0.03 0.01 0.03 72.44 6.03 -71.76 assembly-v2 noise0.5 0.0 0.02 0.02 0.02 0.02 0.04 131.62 5.25 -129.53 assembly-v2 noise1 0.0 0.0 0.01 0.01 0.0 0.01 138.0 1.0 -138.63 handle-press-side-v2 noise0.1 0.07 0.02 0.05 0.02 0.05 0.02 35.0 2.14 -36.28 handle-press-side-v2 noise0.5 0.08 0.04 0.08 0.03 0.07 0.05 40.24 1.06 -34.3 handle-press-side-v2 noise1 0.08 0.08 0.06 0.08 0.13 0.17 52.22 5.29 -35.72 plate-slide-back-side-v2 noise0.1 0.16 0.06 0.09 0.05 0.03 0.05 39.31 1.52 -37.75 plate-slide-back-side-v2 noise0.5 0.17 0.24 0.1 0.33 0.07 0.32 40.17 1.64 -65.47 plate-slide-back-side-v2 noise1 0.19 0.35 0.35 0.28 0.07 0.24 91.74 31.24 -128.28 Average Relative Improvement 0.07 0.05 0.02 - -

D.6 EXTENDED EXPERIMENTS

In this section, we study the performance of HUBL in various of scenarios.

D.6.1 HUBL ON EXPERT DATA

We study the performance of HUBL on expert collected data. Specifically, we conduct experiments of HUBL with TD3+BC on expert datasets in D4RL. The results are reported in Table 11. Note that HUBL is able to provide performance improvements in most of the cases. But the room for improvement on such datasets is small for any offline learning technique. Therefore, due to finite sample error, HUBL may occasionally hurt performance empirically

Published as a conference paper at ICLR 2024

Constant Rank Sigmoid Base Performance halfcheetah-expert-v2 1% 0% 1% 97.09 hopper-expert-v2 3% 3% 3% 108.46 walker2d-expert-v2 0% 1% 0% 110.14

Table 11: Relative improvement of HUBL for TD3+BC on expert data

D.6.2 HUBL WITH ONLINE RL METHODS

We provide experiments where we combine HUBL with DDPG on D4RL datasets, without any additional designs for the offline RL setting. The results are provided in Table 12.

Performance of DDPG 13.43 Relative improvement of HUBL with sigmoid 53% Relative improvement of HUBL with constant 74% Relative improvement of HUBL with portion 90%

Table 12: Relative improvement of HUBL for DDPG on D4RL datasets

Note that HUBL is able to significantly improve DDPG s performance in relative terms. However, the absolute performance of DDPG+Hu BL is much worse than other methods we considered in the main paper. This is expected, as HUBL is designed to augment offline RL algorithms, as opposed to making a general online RL algorithm (like DDPG) work also offline.

To see this more clearly, we consider a toy example with three actions a1, a2, a3 at state s0. Suppose that and a1 and a2 are all out-of-support. As a result, without additional designs (such as pessimism) for offline RL, the policy may take a1 and a2 regardless how we set the heuristic, since the heuristic here would only affect the value for in-support actions and not the value for out-of-support actions. Thus, HUBL cannot turn an online RL algorithm to an offline RL algorithm; it can only enhance an algorithm that is designed for offline RL already.

D.6.3 HUBL IN STOCHASTIC ENVIRONMENTS

To test the performance of HUBL in stochastic environments, we construct a random version of walker2d-v2 by adding βϵ to the action passed to the environment, where ϵ is a uniform random variable taking values in [ 1, 1], and β is a scalar controlling the amount of randomness. Note that we keep the scale of noise in [ 1, 1], and the action variable itself is normalized to [ 1, 1] in D4RL. With the new environment, we regenerate the data using the original walker2d-meidum-v2 policy and apply TD3+BC with HUBL. The results are presented in Table 13.

Sigmoid Rank Constant Base Performance β = 0.0001 6% 6% 4% 66.99 β = 0.001 6% 8% 7% 66.73 β = 0.01 4% 4% 6% 64.23 β = 1 4% -4% 5% 28.09

Table 13: Relative improvement of HUBL for TD3+BC on stochastic walker2d-meidum-v2

The base performance decreases as the noise increases, as expected. However, HUBL is still able to provide some performance improvement under this stochastic setting. We also notice that the constant labeling function is more robust to the noise. This makes sense as the constant function is not sensitive to the heuristic estimation. But if there is more noise, the performance of HUBL will further degenerate, which is a limitation of HUBL.

Published as a conference paper at ICLR 2024

D.7 EXTENDED RESULTS FOR ABSOLUTE IMPROVEMENT

We report the experiment results in absolute improvement of normalized score for D4RL and Meta World.

Table 14: Normalized score improvement for HUBL and normalized scores of base methods on D4RL datasets.

Sigmoid Rank Constant ATAC BC halfcheetah-medium-expert-v2 2.13 0.35 1.55 0.96 1.76 0.14 92.5 0.73 45.83 halfcheetah-medium-replay-v2 0.43 0.38 0.96 0.27 1.04 1.01 46.89 0.25 37.73 halfcheetah-medium-v2 0.05 0.71 0.16 0.25 0.85 0.35 52.54 0.63 42.92 hopper-medium-expert-v2 0.36 0.03 0.44 0.26 0.27 0.1 111.04 0.13 57.51 hopper-medium-replay-v2 0.01 0.48 0.21 0.77 0.26 0.32 101.23 0.59 32.22 hopper-medium-v2 0.44 3.28 1.57 6.09 2.35 12.85 85.59 4.76 53.81 walker2d-medium-expert-v2 0.08 0.4 1.72 0.5 0.26 0.6 111.99 0.8 105.3 walker2d-medium-replay-v2 2.76 2.23 7.15 0.31 4.38 6.53 84.22 2.7 22.78 walker2d-medium-v2 0.6 0.5 1.95 4.62 1.35 0.17 87.2 0.67 64.18 Average Improvement 0.75 1.31 0.87 - - Sigmoid Rank Constant CQL BC halfcheetah-medium-expert-v2 1.55 1.75 2.66 4.77 1.49 13.59 81.96 9.93 45.83 halfcheetah-medium-replay-v2 4.14 0.22 7.19 0.73 5.26 0.4 40.98 0.81 37.73 halfcheetah-medium-v2 5.12 0.43 0.86 0.72 1.35 0.68 51.05 0.83 42.92 hopper-medium-expert-v2 32.62 1.08 30.72 5.21 29.5 3.76 78.39 14.51 57.51 hopper-medium-replay-v2 3.29 0.79 0.69 1.31 2.09 0.75 97.21 3.15 32.22 hopper-medium-v2 12.81 10.16 10.87 11.68 12.01 4.51 70.27 8.77 53.81 walker2d-medium-expert-v2 19.14 0.42 19.74 0.82 19.36 0.14 89.38 17.54 105.3 walker2d-medium-replay-v2 4.84 17.24 24.07 3.58 23.4 2.24 63.35 6.22 22.78 walker2d-medium-v2 3.62 3.86 4.14 2.23 2.01 11.8 79.41 1.82 64.18 Average Improvement 8.54 11.02 9.64 - - Sigmoid Rank Constant IQL BC halfcheetah-medium-expert-v2 0.2 0.9 0.28 0.19 4.59 2.6 92.92 0.5 45.83 halfcheetah-medium-replay-v2 1.51 0.67 0.08 0.83 1.86 1.21 44.35 0.21 37.73 halfcheetah-medium-v2 0.28 0.13 0.06 0.23 0.68 0.36 49.61 0.16 42.92 hopper-medium-expert-v2 4.52 3.29 2.88 2.08 10.22 12.89 108.13 2.69 57.51 hopper-medium-replay-v2 38.66 4.79 31.57 13.41 33.41 14.48 53.98 4.18 32.22 hopper-medium-v2 1.78 2.05 1.27 1.69 0.68 8.99 61.03 3.27 53.81 walker2d-medium-expert-v2 1.53 0.31 1.06 1.01 0.99 0.28 109.59 0.2 105.3 walker2d-medium-replay-v2 10.78 9.11 15.49 1.61 9.2 3.04 73.57 2.64 22.78 walker2d-medium-v2 2.17 1.46 2.54 0.9 3.0 0.71 81.68 2.48 64.18 Average Improvement 5.42 5.4 3.17 - - Sigmoid Rank Constant TD3+BC BC halfcheetah-medium-expert-v2 1.52 2.92 2.01 2.51 1.74 2.29 94.82 0.33 45.83 halfcheetah-medium-replay-v2 5.48 1.47 4.41 8.83 0.24 0.82 54.38 1.24 37.73 halfcheetah-medium-v2 4.28 1.5 2.85 0.84 0.76 0.94 64.63 2.15 42.92 hopper-medium-expert-v2 5.8 13.25 15.45 19.55 2.61 4.79 98.79 12.57 57.51 hopper-medium-replay-v2 13.36 6.4 17.23 2.15 19.7 2.13 81.43 27.77 32.22 hopper-medium-v2 30.06 10.23 41.58 1.11 35.2 11.11 59.12 3.39 53.81 walker2d-medium-expert-v2 0.36 0.84 0.92 0.95 0.53 0.35 111.87 0.17 105.3 walker2d-medium-replay-v2 9.85 3.1 7.83 5.66 3.86 1.91 84.95 1.85 22.78 walker2d-medium-v2 0.75 2.86 6.81 1.13 0.37 5.76 84.09 1.94 64.18 Average Improvement 4.23 5.76 6.86 - -

Published as a conference paper at ICLR 2024

Table 15: Normalized score improvement for HUBL and normalized scores of base methods on MW datasets (part I)

Sigmoid Rank Constant ATAC BC button-press-v2 noise0.1 0.09 1.76 0.43 0.62 0.41 0.82 58.12 0.45 -58.79 button-press-v2 noise0.5 0.21 1.8 2.22 3.5 0.86 2.75 64.16 0.93 -59.58 button-press-v2 noise1 3.42 4.41 0.23 2.71 0.92 2.28 62.72 3.84 -61.24 push-back-v2 noise0.1 8.81 2.64 3.59 6.53 12.75 5.12 87.12 13.89 -104.85 push-back-v2 noise0.5 14.47 8.88 25.88 6.55 22.0 3.23 133.28 14.24 -134.63 push-back-v2 noise1 5.38 7.51 3.28 12.83 7.94 9.56 141.29 13.87 -143.63 reach-v2 noise0.1 0.03 1.08 0.04 0.9 0.0 1.14 17.19 0.76 -18.31 reach-v2 noise0.5 12.53 2.27 15.6 1.9 14.49 2.68 33.91 5.62 -24.57 reach-v2 noise1 18.22 2.13 16.14 11.32 23.09 2.54 46.58 3.98 -43.79 Average Improvement 6.18 6.94 8.96 - - Sigmoid Rank Constant CQL BC button-press-v2 noise0.1 0.23 1.96 1.32 2.71 1.51 1.87 59.67 1.69 -58.79 button-press-v2 noise0.5 2.29 2.67 1.18 0.55 1.37 0.19 58.48 0.85 -59.58 button-press-v2 noise1 5.37 7.82 5.46 1.68 2.71 7.2 67.18 14.25 -61.24 push-back-v2 noise0.1 5.54 5.79 1.54 3.86 1.33 1.89 81.78 2.88 -104.85 push-back-v2 noise0.5 30.82 14.6 27.34 12.97 23.15 14.97 124.64 2.45 -134.63 push-back-v2 noise1 7.69 16.81 14.41 19.38 13.58 7.12 133.66 8.17 -143.63 reach-v2 noise0.1 0.77 3.1 0.22 0.98 0.67 0.41 18.5 2.52 -18.31 reach-v2 noise0.5 21.05 1.89 7.28 21.76 16.49 11.16 41.02 18.79 -24.57 reach-v2 noise1 86.19 2.49 37.15 15.77 86.65 3.53 107.14 32.75 -43.79 Average Improvement 14.67 10.34 15.93 - - Sigmoid Rank Constant IQL bc button-press-v2 noise0.1 0.53 0.6 1.87 0.75 1.83 0.58 59.87 1.32 -58.79 button-press-v2 noise0.5 1.09 1.25 0.63 1.16 8.45 7.68 60.68 0.29 -59.58 button-press-v2 noise1 1.61 8.16 5.17 0.72 3.19 2.58 73.96 5.09 -61.24 push-back-v2 noise0.1 2.93 3.28 3.11 1.89 2.47 0.59 83.79 5.45 -104.85 push-back-v2 noise0.5 0.39 11.3 4.62 6.24 4.16 6.09 124.86 7.45 -134.63 push-back-v2 noise1 1.82 5.35 3.71 5.69 0.35 3.1 143.26 6.31 -143.63 reach-v2 noise0.1 0.55 0.2 1.0 0.45 0.68 1.23 17.37 0.87 -18.31 reach-v2 noise0.5 2.48 4.95 0.58 0.95 0.58 7.66 31.51 9.39 -24.57 reach-v2 noise1 3.69 5.46 2.12 4.0 0.33 15.48 42.23 7.4 -43.79 Average Relative Improvement 0.41 2.04 0.71 - - Sigmoid Rank Constant TD3+BC BC button-press-v2 noise0.1 19.12 0.22 18.19 0.17 18.01 1.73 77.68 22.14 -58.79 button-press-v2 noise0.5 2.47 2.73 2.11 3.75 5.36 4.18 59.54 1.68 -59.58 button-press-v2 noise1 1.71 5.97 9.71 18.76 8.4 9.47 81.03 5.79 -61.24 push-back-v2 noise0.1 7.18 0.66 1.06 1.65 4.11 3.08 85.44 5.05 -104.85 push-back-v2 noise0.5 9.43 8.7 6.1 17.6 6.43 12.98 127.95 9.82 -134.63 push-back-v2 noise1 1.26 6.0 1.4 8.05 5.31 2.35 139.72 15.6 -143.63 reach-v2 noise0.1 0.99 0.66 0.28 0.31 0.07 0.38 17.11 1.28 -18.31 reach-v2 noise0.5 5.04 7.92 10.2 5.67 6.07 7.23 40.69 9.66 -24.57 reach-v2 noise1 8.4 12.06 2.34 7.98 6.74 7.36 65.48 2.85 -43.79 Average Improvement 5.41 4.94 4.34 - -

Published as a conference paper at ICLR 2024

Table 16: Normalized score improvement for HUBL and normalized scores of base methods on MW datasets (part II)

Sigmoid Rank Constant ATAC BC assembly-v2 noise0.1 0.99 2.28 1.0 2.73 0.23 5.65 70.22 4.61 -71.76 assembly-v2 noise0.5 1.05 2.43 3.59 3.02 11.13 6.31 135.77 1.58 -129.53 assembly-v2 noise1 1.29 1.99 2.47 1.95 5.75 1.07 140.12 2.45 -138.63 handle-press-side-v2 noise0.1 0.93 0.41 0.49 0.64 0.88 1.06 32.38 0.64 -36.28 handle-press-side-v2 noise0.5 0.08 0.86 0.74 0.94 0.35 1.06 33.04 0.96 -34.3 handle-press-side-v2 noise1 1.25 4.2 0.73 3.55 3.07 2.05 35.33 3.7 -35.72 plate-slide-back-side-v2 noise0.1 0.07 1.44 0.98 0.82 0.77 1.96 37.07 0.87 -37.75 plate-slide-back-side-v2 noise0.5 5.15 28.23 11.72 20.14 10.47 1.41 68.9 10.92 -65.47 plate-slide-back-side-v2 noise1 16.4 29.75 42.45 2.22 43.25 10.94 97.9 25.46 -128.28 Average Relative Improvement 2.29 6.64 7.43 - - Sigmoid Rank Constant CQL BC assembly-v2 noise0.1 2.37 4.62 4.2 12.62 4.36 5.55 76.29 9.41 -71.76 assembly-v2 noise0.5 0.57 2.09 1.44 5.55 1.21 2.87 131.35 2.62 -129.53 assembly-v2 noise1 0.01 3.97 0.35 2.07 0.56 2.56 138.94 0.97 -138.63 handle-press-side-v2 noise0.1 3.79 2.52 0.53 2.53 0.73 6.56 34.47 4.54 -36.28 handle-press-side-v2 noise0.5 23.05 7.99 26.8 8.13 32.31 0.94 70.96 29.39 -34.3 handle-press-side-v2 noise1 8.33 6.37 6.94 10.92 22.01 2.89 55.3 33.42 -35.72 plate-slide-back-side-v2 noise0.1 1.83 1.21 0.02 3.82 0.95 3.95 36.31 2.33 -37.75 plate-slide-back-side-v2 noise0.5 0.96 3.25 3.16 17.37 4.94 18.35 36.72 2.56 -65.47 plate-slide-back-side-v2 noise1 24.46 21.13 24.72 8.42 23.04 12.37 67.63 20.56 -128.28 Average Relative Improvement 7.14 6.67 8.65 - - Sigmoid Rank Constant IQL BC assembly-v2 noise0.1 1.18 4.78 0.49 3.07 0.02 3.2 72.06 2.34 -71.76 assembly-v2 noise0.5 1.49 1.27 0.9 1.89 1.13 4.46 122.97 6.12 -129.53 assembly-v2 noise1 1.31 0.33 0.76 3.55 0.39 3.85 129.01 1.62 -138.63 handle-press-side-v2 noise0.1 1.56 0.5 1.67 0.39 1.66 0.53 34.1 1.55 -36.28 handle-press-side-v2 noise0.5 2.16 2.07 2.04 2.46 2.56 1.54 37.28 0.25 -34.3 handle-press-side-v2 noise1 0.2 5.06 1.33 1.52 1.12 3.27 46.31 3.11 -35.72 plate-slide-back-side-v2 noise0.1 0.24 0.76 0.18 0.48 0.42 1.12 34.4 1.23 -37.75 plate-slide-back-side-v2 noise0.5 2.3 3.47 1.3 2.69 0.92 2.7 34.03 1.83 -65.47 plate-slide-back-side-v2 noise1 2.33 7.7 0.63 12.54 1.8 15.22 43.95 10.29 -128.28 Average Relative Improvement 0.41 0.14 0.34 - - Sigmoid Rank Constant TD3+BC BC assembly-v2 noise0.1 1.08 1.69 1.16 2.35 0.71 1.95 72.44 6.03 -71.76 assembly-v2 noise0.5 0.56 3.01 2.16 3.05 2.6 5.8 131.62 5.25 -129.53 assembly-v2 noise1 0.48 0.6 0.71 1.39 0.03 1.92 138.0 1.0 -138.63 handle-press-side-v2 noise0.1 2.34 0.77 1.91 0.87 1.67 0.69 35.0 2.14 -36.28 handle-press-side-v2 noise0.5 3.25 1.66 3.17 1.04 2.97 1.82 40.24 1.06 -34.3 handle-press-side-v2 noise1 4.31 4.29 3.39 3.97 6.93 8.89 52.22 5.29 -35.72 plate-slide-back-side-v2 noise0.1 6.11 2.3 3.67 1.77 1.09 2.11 39.31 1.52 -37.75 plate-slide-back-side-v2 noise0.5 6.77 9.5 4.01 13.35 2.91 13.02 40.17 1.64 -65.47 plate-slide-back-side-v2 noise1 17.06 32.1 31.78 25.94 6.48 22.36 91.74 31.24 -128.28 Average Relative Improvement 4.14 4.46 1.65 - -

E LIMITATIONS AND FUTURE WORK

In our current framework, HUBL calculates heuristic values as Monte-Carlo returns. This is feasible in practical scenarios where data comes in the form of trajectories, but much of offline RL literature experiments with batch datasets that consist of disconnected transition tuples. For such datasets, HUBL requires other heuristic calculation strategies.

Empirically, we have showed HUBL s utility on benchmarks where observations are full-information low-dimensional state vectors and state transitions are deterministic. While these benchmarks are standard in offline RL, evaluation in more stochastic domains where observations are partial and high-dimensional, such as robotics, will give a more complete picture of HUBL s utility. So would evaluating HUBL with model-based offline RL such as MORe L (Kidambi et al., 2020). Also, HUBL is not applicable to bootstrapping-free offline RL methods (Schaefer et al., 2007; Deisenroth and Rasmussen, 2011; Depeweg et al., 2016; Swazinna et al., 2021; 2022).

On the theory side, our analysis focuses on state-dependent heuristics. In practice, though, heuristic estimates commonly depend on full trajectories, and our theory needs adjustments to account for this. Same goes for applying our theory to model-based offline RL such as MORe L.

Published as a conference paper at ICLR 2024

For future direction, we aim to develop better heuristics and blending strategies for batch datasets and sparse-reward problems. We also plan to apply HUBL to other tasks in healthcare and finance (Alaluf et al., 2022; Geng et al., 2018; 2019a;b) where collecting data is especially expensive.