# automatic_reward_shaping_from_confounded_offline_data__f9da01f4.pdf Automatic Reward Shaping from Confounded Offline Data Mingxuan Li 1 Junzhe Zhang 2 Elias Bareinboim 1 Reward shaping has been demonstrated to be an effective technique for accelerating the learning process of reinforcement learning (RL) agents. While successful in empirical applications, the design of a good shaping function is less well understood in principle and thus often relies on domain expertise and manual design. To overcome this limitation, we propose a novel automated approach for designing reward functions from offline data, possibly contaminated with the unobserved confounding bias. We propose to use causal state value upper bounds calculated from offline datasets as a conservative optimistic estimation of the optimal state value, which is then used as state potentials in Potential-Based Reward Shaping (PBRS). When applying our shaping function to a model-free learner based on UCB principles, we show that it enjoys a better gapdependent regret bound than the learner without shaping. To the best of our knowledge, this is the first gap-dependent regret bound for PBRS in model-free learning with online exploration. Simulations support the theoretical findings. 1. Introduction Reward shaping is an effective technique for improving sample efficiency in reinforcement learning. It augments the original environment reward with extra learning signals such that the learner is guided towards high-rewarding states in the environment. Though if not designed properly, there is a risk of misleading the agent towards suboptimal policies (Saksida et al., 1997; Randløv & Alstrøm, 1998). Potential Based Reward Shaping (PBRS, Ng et al. (1999)) solves this problem by constructing the shaping functions as the difference between state potentials. In this way, the 1Causal AI Lab, Columbia University, New York, USA 2Department of Electrical Engineering and Computer Science, Syracuse University, New York, USA. Correspondence to: Mingxuan Li . Proceedings of the 42 nd International Conference on Machine Learning, Vancouver, Canada. PMLR 267, 2025. Copyright 2025 by the author(s). set of optimal policies after reward shaping is guaranteed to be unchanged compared with the one without shaping. It has been followed in many practical applications since then. The core of PBRS is the design of potential functions, which denote the potential of a state or how good a state is. First, people generally rely on domain expertise to compose such potential functions. This dependence on experts knowledge could make designing potential functions expensive and time-consuming. One may even argue these extensive human efforts are against the promise of AI that domain experts should be free from handcrafting solutions. Moreover, this designing process endures risks of misspecification due to human biases, which in turn could slow down the training substantially or lead to erroneous agents (Ng et al., 1999; Randløv & Alstrøm, 1998; Pan et al., 2022). An ongoing line of research exists to simultaneously automate the process of learning the shaping function and training the online agent (Pathak et al., 2017; Yuan et al., 2023; Raileanu & Rockt aschel, 2020; Devidze et al., 2022; Zou et al., 2019; Ma et al., 2024). However, these methods involve nontrivial optimization procedures relying on prior parametric knowledge over the potential function. The risks of misspecification persist. In other words, designing potential functions is still a significant obstacle when applying PBRS. An alternative strategy is to learn the shaping function from previous offline data, possibly collected by different behavior policies or observing human operators interacting with the environment (Brys et al., 2015; Mezghani et al., 2022; Zhang et al., 2024). In their seminal work, Ng et al. (1999) noted that the optimal state value is a good candidate for potential functions, characterizing the state potentials in the underlying environment. In the field of reinforcement learning, the problem of evaluating optimal state value functions from past offline data has been studied under the rubrics of off-policy learning or batch learning (Sutton, 2018; Levine et al., 2020). Several algorithms and methods have been proposed, including Q-learning (Watkins, 1989; Watkins & Dayan, 1992), importance sampling (Swaminathan & Joachims, 2015; Jiang & Li, 2016), and temporal difference (Precup et al., 2000; Munos et al., 2016). These algorithms rely on the critical assumption that there is no unobserved confounding (NUC) in the offline data (Murphy, 2003; 2005). One might be able to enforce this assumption by deliberately controlling the behavior policies generat- Automatic Reward Shaping from Confounded Offline Data ing the data. However, in many practical applications, the NUC condition can be difficult to enforce and, consequently, does not necessarily hold. In these cases, directly applying standard off-policy learning could introduce bias in the estimation, leading to misspecified shaping functions. For example, when some input states to the behavior policy are not fully observed, these states could become unobserved confounders, introducing spurious correlations in the offline data. When NUC is violated, the effects of candidate policies are generally not identifiable, i.e., the model assumptions are insufficient to uniquely determine the value function from the offline data, regardless of the sample size (Pearl, 2009; Zhang & Bareinboim, 2019). This paper addresses the challenges of confounding bias in designing potential functions for reward shaping. We study the problem of constructing reward-shaping functions automatically via confounded offline data from a causal perspective. We focus on the environment of Confounded Markov Decision Processes (CMDPs) where unobserved confounders generally exist at each stage of decisionmaking Zhang & Bareinboim (2022); Tennenholtz et al. (2020). Under CMDPs, we demonstrate the difficulties of manually designing proper reward shaping functions with the presence of unobserved confounding, due to the mismatch in the observed input states. Then we show how one can extrapolate informative shaping functions from confounded offline data using partial causal identification techniques (Manski, 1989). Finally, we develop novel online reinforcement learning algorithms that can leverage the derived shaping functions to improve the learning performance in terms of the sample complexity. More specifically, our contributions are summarized as follows: We propose the first theoretically justified (Thm. 3.1), data-driven method (Algo. 2) for learning reward shaping functions from confounded offline data. We introduce a model-free UCB algorithm that can improve performance by leveraging confounded offline data using the derived shaping functions (Algo. 1). We derive a novel gap-dependent regret bound for the proposed UCB algorithm. Our analysis reveals how and under what condition (Def. 4.1) the derived shaping functions affect the learning efficiency of future online learners (Thm. 4.5). Due to limited space, related work, experiment details and all the proof are provided in Apps. A, D and H, respectively. Notations. We will consistently use capital letters (V ) to denote random variables, lowercase letters (v) for their values, and cursive V to denote their domains. Fix indices i, j N. We use bold capital letters (V ) to denote a set of random variables and let |V | denote its cardinality of set V . Finally, 1Z=z is an indicator function that returns 1 if event Z = z holds true; otherwise, it returns 0. 2. Challenges of Designing Reward Shaping in the Face of Unobserved Confounders We will focus on a sequential decision-making setting in the Markov Decision Process (MDP, Puterman (1994)) where the agent intervenes on a sequence of actions X1, . . . , XH in order to optimize the cumulative return over reward signals Y1, . . . , YH; H N is a finite horizon. Standard MDP formalism focuses on the perspective of the learners who could actively intervene in the environment. Consequently, the data collected from randomized experiments is free from the contamination of unobserved confounding bias and is generally assumed away in the model. However, when considering offline data collected by passive observation, the learner may not necessarily have deliberate control over the behavioral policy generating the data. Consequently, this could lead to confounding bias in various decision-making tasks, including off-policy learning (Kallus & Zhou, 2018; Lu et al., 2023; Zhang & Bareinboim, 2024), and imitation learning (Zhang et al., 2020; Kumor et al., 2021; Ruan et al., 2024). In this paper, we will consider an extended family of MDPs explicitly modeling the presence of unobserved confounders when generating offline data. Definition 2.1. A Confounded Markov Decision Process (CMDP) M is a tuple of S, X, Y, U, H, F, P where, S, X, Y are, respectively, the space of observed states, actions, and rewards; U is the space of unobserved exogenous noise; H N is a finite horizon; F is a set consisting of the transition function τh : S X U 7 S, behavioral policy βh : S U 7 X, and reward function rh : S X U 7 Y for every time step h = 1, . . . , H; P is a set of distributions Ph over the unobserved domain U for every time step h = 1, . . . , H. Consider a demonstrator agent interacting with a CMDP. For every time step h = 1, . . . , H, the nature draws an exogenous noise Uh from the distribution P(U); the demonstrator performs an action Xh βh(Sh, Uh), receives a subsequent reward Yh rh(Sh, Xh, Uh), and moves to the next state Sh+1 τh(Sh, Xh, Uh). The observed trajectories of the demonstrator (from the learner s perspective) are thus summarized as the observational distribution P( X, S, Y ).1 In the data-generating process described above, for every time step t, the exogenous noise Uh becomes an unobserved confounder affecting the action Xh, reward Yh, and next state Sh+1 simultaneously. Therefore, CMDP is also referred to MDP with Unobserved Confounders (MDPUC, Zhang & Bareinboim (2022)) and is a subclass of Con- 1We will consistently use X, S, Y to represent sequences {X1, . . . , XH}, {S1, . . . , SH} and {Y1, . . . , YH} Automatic Reward Shaping from Confounded Offline Data (a) CMDP, M (b) CMDP - Online Learning, Mπ Figure 1. (a) Causal diagram of the CMDP modeling the shaping function designing process; (b) Causal diagram of the CMDP modeling the online learning process under policy do(π). (a) Falling to walk. (b) Target task. Figure 2. Walking Robot Example. founded Partially Observed MDP (Shi et al., 2022; Miao et al., 2022; Bennett & Kallus, 2024) where Markov property holds. The following example, inspired by human walking dynamics (O Connor & Kuo, 2009; Wang & Srinivasan, 2014), demonstrates an instance of CMDP. Example 1 (Walking Robot). Consider a scenario where a robot learns to walk across a hallway. The agent can take two actions, small-step and big-step, denoted as Xh = 0 and Xh = 1, respectively. Whether the agent can move forward (Lh+1) depends on both the step size, Xh, and the current body stability status, Fh. It s defined as Lh+1 Lh + 1[( Fh (Xh = Uh)) (Fh Xh)]. The body stability is modeled as Fh+1 Xh if the body is in stable status Fh = 1 or Fh+1 Xh Uh if it is not stable Fh = 0 and needs to adjust step size accordingly. The required step size to stay stable is modeled as a uniform random noise Uh. The agent receives a reward Yh = 1 if it reaches the goal location or it moves forward. It receives a reward Yh = 1 if it cannot stabilize the body accordingly. Formally, Yh 1 if Lh+1 = G otherwise Yh 1[( Fh Xh = Uh) (Fh Xh)] 1[ Fh Xh Uh]. In this example, body stability Fh affects whether the robot can step forward. We will consider offline data generated by two demonstrators. When the robot is currently unstable Fh = 0, a competent demonstrator following a behavioral policy β(1) h : Xh Uh takes a step in the size of exactly the latent noise Uh. As a result, it will always transit from being unstable Fh = 0 to a stable status Fh+1 = Xh Uh = Uh Uh = 1. On the contrary, an incompetent demonstrator following β(2) X : Xh Uh will always attempt to remain unstable Fh+1 = Xh Uh = Uh Uh = 0 even when the opposite is preferable. Fig. 1a shows the graphical representation (i.e., causal diagram) describing the generative process generating the offline data in CMDPs. More specifically, nodes represent observed variables Xh, Sh, Yh, and arrows represent the functional relationships βh, τh, rh among them. The exogenous noise Uh is often not explicitly shown. However, bi-directed arrows Xh Yh and Xh Sh+1 indicate the presence of an unobserved confounder (UC) Uh affecting Xh, Yh and Sh+1, simultaneously. These bi-directed arrows characterize the spurious correlations among action Xh, reward Yh, and state Sh+1 in the offline data, violating the NUC condition. Such violations could lead to challenges in evaluating value function and reward shaping, which we will discuss in the next section. 2.1. Potential-Based Reward Shaping A policy π in a CMDP M is a sequence of decision rules πh : S 7 X, for every step h = 1, . . . , H, mapping from state to action. Similarly, πh(xh | sh) is a stochastic policy mapping from state space S to a distribution over action space X. An intervention do(π) is an operation that replaces the behavioral policy βh in model M with the decision rule πh for every step h. Let Mπ be the submodel induced by intervention do(π). Fig. 1b shows the graphical representation of the data generating process in the submodel Mπ; bi-directed arrows are now removed. The interventional distribution Pπ( X, S, Y ) is defined as the joint distribution over observed variables in Mπ, i.e., Pπ( x, s, y) = P(s1) πh(xh | sh) Th(sh, xh, sh+1)Rh(sh, xh, yh) (1) where the transition distribution Th and the reward distribution Rh are given by, for h = 1, . . . , H, Th(s, x, s ) = Z u 1s =τh(s,x,u)Ph(u) (2) Rh(s, x, y) = Z u 1y=rh(s,x,u)Ph(u) (3) For convenience, we write the reward function Rh(s, x) as the expected value P y y Rh(s, x, y). Automatic Reward Shaping from Confounded Offline Data For reinforcement learning tasks, the agent s goal is to learn an optimal policy π maximizing the cumulative reward in CMDP M, i.e., π = arg maxπ Eπ h PH h=1 Yh i . In analysis, we also evaluate the state value function V π h (s) = Eπ [P t=h Yt | Sh = s] following a policy π for every step h = 1, . . . , H. A state-action value function is defined as Qπ h(s, x) = Eπ [P t=h Yt | Sh = s, Xh = x]. Reward shaping is a popular line of techniques for incorporating domain knowledge during policy learning. Common approaches such as Potential-Based Reward Shaping (PBRS, Ng et al. (1999)) add supplemental signals to the reward function so that it would be easier to learn in future downstream tasks without affecting the optimality of the learned policy. The following proposition establishes the validity of PBRS in CMDP with a finite horizon. Proposition 2.2. For a CMDP M, let M be a CMDP obtained from M by replacing the reward function with the following, for every time step h = 1, . . . , H, r h := rh + ϕh(Sh+1) ϕh(Sh), (4) where rh is the reward function in M; ϕh( ) : S 7 R is a real valued potential function and ϕH(s) = 0. Then, That is, every optimal policy in M will also be an optimal policy in M , and vice versa. In other words, PBRS modifies the reward function in the system to encourage the learning agent to visit states with high potential in future return. At the same time, the agent is penalized for visiting states with low potential. More importantly, optimal policies remain invariant across this shaping process, i.e., every optimal policy in the after-shaping model M is guaranteed to be optimal in the original model M. Designing the potential function is a central problem when applying PBRS. The optimal state value functions V h (s) = V π h (s) (Ng et al., 1999) are a good candidate for measuring state potential. When the NUC condition holds, it is possible to compute such optimal value functions using standard offpolicy methods (Sutton, 2018). However, when unobserved confounding generally exists, evaluating value functions from confounded data could introduce bias in the shaping process, promoting states with low potential. The following example demonstrates such challenges. Example 2 (Confounded Potential Functions). Consider again the Walking Robot in Example 1. Computing the optimal state value function V h (Lh, Fh) from the groundtruth model M gives, for every step h = 1, . . . , H, V h (Lh = 0, Fh = 0) = 5.0, V h (Lh = 0, Fh = 1) = 5.5 If one sets the potential function ϕh := V h , the reward shaping will add the supplemental signal in favor of transitioning from an unstable state Fh = 0 to a stable state Fh+1 = 1, ϕh(Lh+1 = 0, Fh+1 = 1) ϕh(Lh = 0, Fh = 0) = 0.5 On the other hand, if one applies standard off-policy methods using offline data generated by the competent behavioral policy β(1) h , it leads to the following value function, V (1) h (Lh = 0, Fh = 0) = 10, V (1) h (Lh = 0, Fh = 1) = 10 This seems to suggest that being stable does not improve the robot s performance. If we set the potential function ϕh := V (1) h , the reward shaping will not encourage the robot to be stable. More interesting, if we compute the value function from offline data generated by the incompetent behavioral policy β(2) h , we have the following V (2) h (Lh = 0, Fh = 0) = 23, V (2) h (Lh = 0, Fh = 1) = 42 Being stable Fh = 1 appears to lead to lower future returns. If one sets the potential function ϕh := V (2) h , the reward shaping will even penalize the robot to transit from being unstable Fh = 0 to being stable Fh+1 = 1, which contradicts the underlying system dynamics. The above example shows that when unobserved confounders generally exist, naively computing potential functions from offline data could lead to biased evaluation. Shaping reward with a biased potential function could lead to sub-optimal performance, encouraging undesirable behaviors. Indeed, when the NUC does not hold, the transition function T and reward function R are generally not uniquely discernible from the offline data, regardless of the sample size (Pearl, 2009). For the remainder of this paper, we will introduce a robust procedure to design potential functions from confounded offline data, and how to leverage these potential functions for future online learning tasks. 3. Robust Potential Functions from Confounded Offline Data Our goal is to upper bound the optimal state value for online interventional agents, the ones that cannot utilize confounders, from potentially confounded offline datasets. Instead of attempting to identify the underlying reward and transition distribution under confounded datasets, we bound them for every s, x, s , h S X S [H] following similar partial identification strategies in (Manski, 1989), Th(s, x, s ) e Th(s, x, s )Ph(x|s) + Ph( x|s) (5) Rh(s, x) e Rh(s, x)Ph(x|s) + b Ph( x|s) (6) where e Rh(s, x) = E[Yh|Sh = s, Xh = x], e Th(s, x, s ) = Ph(Sh+1 = s |Sh = s, Xh = x) and Ph(x|s) = Ph(Xh = Automatic Reward Shaping from Confounded Offline Data x|Sh = s) are all empirical estimations from the offline dataset. b is a known upper bound on the reward signal, Yh b. Similar to the bounding approach developed in (Zhang & Bareinboim, 2024), we apply the above bounds to the Bellman Optimal Equation (Sutton & Barto, 2018) and arrive at the following Causal Bellman Optimal Equation that calculates the optimal state value optimistically from the confounded offline datasets. Theorem 3.1 (Causal Bellman Optimal Equation). For a CMDP environment M with reward Yh b, b R, the optimal value of interventional policies, V h (s), s S, is upper bounded by V h (s) V h(s) satisfying the Causal Bellman Optimality Equation, for every step h = 1, . . . , H, V h(s) = max x h Ph(x|s) e Rh(s, x) + E e Th[V h+1(s )] + Ph( x|s) b + max s V h+1(s ) i (7) and V H+1(s) = 0 for all state s S. Compared with the original Bellman Optimal Equation, ours accounts for the uncertainty brought by confounders in the offline dataset via an extra term, b + maxs V h+1(s ). This represents the best rewards that the agent could have achieved from those unselected actions, i.e., Ph( x|s). With the Causal Bellman Optimal Equation, we can robustly upper bound the optimal state values from a confounded offline dataset generated by CMDP M. When multiple offline datasets from different behavioral policies are available, each of them provides a unique perspective on the true underlying CMDP without necessarily overlapping state space coverage. Since the upper bounds calculated for each state from different datasets are all valid according to Thm. 3.1, taking the minimum yields the tightest estimation for the optimal state value. Corollary 3.2 (Unified Causal Optimal Upper Bound). Let the causal optimal value upper bound estimated from offline datasets D(i) be V (i), i = 1, 2, ..., N, the unified causal optimal upper bound is defined as V h(s) = mins D(i) h V (i) h (s) where s D(i) h means state s is visited at step h in D(i) and satisfies V h (s) V h(s), for all s. In actual computation, to avoid over-estimation brought by bootstrapping on the highest value state, we clip the optimistic bonus for the optimal transitions and calculate the value update as follows, Q (i) h (s, x) = P (i) h ( x|s) b + min n max s V (i) h+1(s ), ωh o + P (i) h (x|s) e R(i) h (s, x) + X s e T (i) h (s, x, s )V (i) h+1(s ) (8) where ωh = (H h)b is the maximum possible next state value an optimistic transition should receive in an episodic CMDP with horizon H. For state-action pairs that are not visited in the dataset, we will skip those state-action entries directly.2 To summarize, the overall algorithmic procedure to calculate causal bounds runs as standard value iteration (Sutton, 2018) except that we update the state values backwards in time, i.e., from V h+1 to V h by Eq. (8), skip unvisited state-action pairs and take the minimum according to Corol. 3.2 at last. Algo. 2 in App. C shows the full pseudo-code for approximating the optimal value upper bound from offline datasets. Now we can calculate the proposed state potentials for the Walking Robot problem and verify if it upper bounds the true optimal state values. Example 3 (Potential Functions Calculated for Robot Walking). Recall that running value iteration in the interventional policy space, we have the optimal state value for Example 1, V h (Lh = 0, Fh = 0) = 5.0, V h (Lh = 0, Fh = 1) = 5.5. As we have shown, directly calculating state value from offline datasets cannot yield informative potential functions. While with Corol. 3.2, we can extract the following potential functions from the same offline datasets, ϕh(Lh = 0, Fh = 0) = 9.0, ϕh(Lh = 0, Fh = 1) = 9.5, which perfectly upper bound the optimal state values and encourages the agent to stay in stable status. From the interventional policy space optimal value and our potential functions, we see that for interventional agents, stability is always preferred, which can be achieved via taking small steps (Xt = 0) every time. This also aligns well with the intuition that people glide with small steps when walking on slippery surfaces. See App. B for the the full confounded state values, optimal values and calculated bounds. 4. Efficient Online Learning with PBRS In this section, we will apply the derived potential function from the confounded offline data to improve the agent s performance in an online learning task. For a CMDP M, let ϕh, h = 1, . . . , H, be the potential functions derived in the previous section. We first apply reward shaping and obtain an augmented CMDP M following the procedure described in Prop. 2.2.3 In the augmented CMDP M , an online learning agent attempts to learn an optimal policy π by performing interventions for repeated episodes k = 1, . . . , K. For every episode k, the agent picks a policy πk, performs intervention do(πk), and receives subsequent outcomes. The cumulative regret after K > 0 episodes of interventions in the augment 2For a state not visited in any of the offline datasets, we can simply set the bound optimistically as V h(s) = (H h)b. 3This can be done by adding supplemental signal ϕh(Sh+1) ϕh(Sh) to every observed reward Yt. Automatic Reward Shaping from Confounded Offline Data CMDP M is defined as the sum of the gap between the optimal value function V 1 following an optimal π and the value function V πk 1 induced by policy πk. Formally, EM [Regret(K)] = EM k=1 V 1 (Sk 1 ) V πk 1 (Sk 1 ) A desirable property for the agent is to achieve a sublinear regret EM [Regret(K)] = o(K).4 This means that it is able to converge to an optimal policy π . We propose an online learning algorithm based on a modelfree learner, Q-UCB (Jin et al., 2018), to leverage the potential function ϕ extrapolated from offline data. Details of the algorithm is described in Algo. 1. Compared with the original Q-UCB, we make a few modifications for Q-UCB to work with PBRS: (1) Zero initializing the Q-values; (2) Using potential function dependent UCB bonus and value clipping; and finally, (3) Incorporating shaped reward during learning updates. See also App. F for the pseudo-code of the vanilla Q-UCB. As in the original Q-UCB, we have visitation counter, Nh(sh, xh), for the state-action pair (sh, xh) at step h. Note that the UCB bonus uses t as its denominator under the square root, which is the visitation count of (sh, xh) at the beginning of episode k. The counter is updated after assigning t, Nh(sh, xh) Nh(sh, xh) + 1. We use the same adaptive learning rate as αt = H+1 H+t and define ι = log (|S||X|T/p) where p = 2p is the probability of the event when the difference between learned and the optimal Q-values is bounded. As we shall see later in experiments (Fig. 4), using confounded values in Q-UCB Shaping usually impairs the training efficiency rather than helping. While using potential functions that upper bound the optimal state values, learning efficiency is significantly boosted. This observation also resonates with the optimism in the face of uncertainty (OFU) heuristics in reinforcement learning (Gupta et al., 2022; Brafman & Tennenholtz, 2002; Kearns & Singh, 2002). We formalize this observation as the following condition. Definition 4.1 (Conservative Optimism). The learned potential function is conservatively optimistic if it satisfies, V h (s) ϕh(s) H for all S, h S [H] where V h (s) = max π Π EMπ h =h Yh | s This condition ensures that the potential function upper bounds the optimal state value while being informative by not exceeding H (assuming Yh 1). Next, we will show that this is indeed a sufficient condition for guaranteed sample efficiency improvement under reward 4A function f(n) = o(g(n)) if for all c > 0, there exists k > 0 such that f(n) < cg(n) for all n k. shaping. We start with writing the learned Q-value under shaping at step h episode k following Algo. 1 as follows, Qk h(s, x) = i=1 αi t yh ϕh(s) + (ˆPki h V ki h+1 + ˆPki h ϕh+1)(s, x) + bi , where we use ˆPki h to denote the empirical transition of episode ki. That is, for the value of a function mapping from state space to real numbers, f : S 7 R,its value given the next state s in episode ki as input is denoted as, f(s ) = (ˆPki h f)(s, x). Similarly, we write the expected value of such function f over the transitions of an online CMDP under reward shaping, M ,as (Phf)(s, x) = EM π[f(s )|s, x]. αi t is the cumulative learning rates defined as α0 t = Πt j=1(1 αj) and αi t = αiΠt j=i+1(1 αj). Note that we also assume deterministic reward functions for notation wise simplicity (Jin et al., 2018). With the notations above, we bound the difference between the learned Q-value and the optimal Q-value with high probability in both direction. Lemma 4.2 (Bounded Differences Between Qk h and Q h). With probability at least 1 2p, the difference between the learned Q-value at the beginning of episode k and step h and the optimal Q-value can be bounded as 0 Qk h(s, x) Q h(s, x) α0 t( Q h(s, x)) + Pt i=1 αi t h (ˆPki h V ki h+1 ˆPki h V h+1)(s, x) i + 3bt where bt is the UCB bonus as in Algo. 1. The final expected regret bound can be built upon this bounded Q-value difference via a gap-dependent decomposition. We can then bridge this gap dependent regret decomposition with the difference between learned Q-values and optimal Q-values on non-optimal actions based on the fact that Qk h(s, x) Q h(s, x) h(s, x) (Lem. H.7), E [Regret(K)] = EM h=1 h(sk h, xk h) Qk h(sk h, xk h) Q h(sk h, xk h) + 2p TH. Utilizing the function clip [x|δ] = x 1[x δ], x, δ R introduced by Simchowitz & Jamieson (2019) and its properties (Lem. G.2, Lem. G.3), we can further write the difference between learned Q-values and optimal Q-values as a recursion, Qk h(sk h, xk h) Q h(sk h, xk h) clip 3bt (sk h, xk h) 2H i=1 αi t h (Qki h+1 Q h+1)(ski h+1, xki h+1) i , where we unify the optimality gaps of each state action pairs across time steps via (s, x) = minh h(s, x) = Automatic Reward Shaping from Confounded Offline Data Algorithm 1 Q-UCB Shaping 1: Input: Potential function ϕh(s), s, h. Const. c > 0. 2: Q1 h(s, x) = 0, Nh(s, x) = 0, (s, x, h) S X [H]. 3: Calculate maximum potential, ϕm = maxs ϕ(s). 4: for k = 1 to K do 5: Observe initial state s1. 6: for h = 1 to H do 7: Take action xh = arg maxx Qk h(sh, x). 8: Observe sh+1, yh. 9: Update visitation counter and UCB bonus, t = Nh(sh, xh) Nh(sh, xh)+1, bt = c p Hϕ2mι/t 10: Calculate shaped reward, y h = yh + ϕh+1(sh+1) ϕh(sh). 11: Update Q-value, Qk+1 h (sh, xh) = (1 αt)Qk h(sh, xh) + αt(y h + V k h+1(sh+1) + bt). 12: Update value function, V k+1 h (sh) = min{ϕh(sh), maxx Qk+1 h (sh, x)}. 13: end for 14: end for minh V h (s) Q h(s, x) where x is a suboptimal action. Though we can already solve the recursion and calculate the regret based on the property of the clip( ) function (Lem. G.3), a direct summation cannot reveal the benefits of reward shaping to the online learner. Inspired by Gupta et al. (2022), we denote the set of state-action pairs that are far from being even the second best choices as pseudosuboptimal state-action pairs based on the minimum optimal gap, (s, x). Definition 4.3 (Pseudo-Suboptimal State-Action Pairs). We define the set of pseudo-suboptimal state-action pairs as, Sub ={(s, x) S X | h, δh(s, x) V h (s)}, (10) where δh(s, x) = y ϕh(s)+2E[ϕh+1(s )|s, x]+ (s, x). With this set of pseudo-suboptimal state-action pairs, we show that the total number of visits made to such pairs during training can be bounded, contributing to a reduction on our final regret bound in Thm. 4.5. Lemma 4.4 (Bounded Number of Visits to Sub ). The number of visits to (s, x) Sub , t = N k h(s, x), is bounded by, t 16c2Hϕ2 mι 2(s,x) . By treating the summation of clipped terms in the regret differently with respect to whether the state action pairs belong to the set Sub or not, we arrive at a two-part regret bound that both demonstrates the efficiency gain of reward shaping and subsumes prior result on the gap-dependent regret of Q-UCB (Yang et al., 2021). Theorem 4.5 (Regret Bound of Algo. 1). Given a potential function ϕh( ), with its maximum value defined as ϕm = maxs,h ϕh(s), after running algorithm Algo. 1 for K episodes of length H, the expected regret is bounded by, H5 log (SAT) H6 log (SAT) where T = KH is totally number of steps; Sub is the set of pseudo-suboptimal state action pairs and (s, x) = minh h(s, x), for all h [H]. See App. H.3 for the full version regret bound and proof details. The first part of the regret is for the set of state-action pairs on which offline learned shaping functions would cut unnecessary over explorations resulting in an order of magnitude better dependence on the horizon factor H. This improved dependence on H even matches with state-of-theart Q-learning variant in the literature (Xu et al., 2021). The second part corresponds to the set of state-action pairs that our shaping function cannot decide when to stop exploring during learning matching the gap-dependent regret bound of the vanilla Q-UCB (Yang et al., 2021). 5. Experiments In this section, we show simulation results verifying that: (1) Q-UCB with our proposed shaping function enjoys better sample efficiency , and (2) the policy learned by our shaping pipeline at convergence is the optimal policy for an interventional agent. The baselines include vanilla Q-UCB (No Shaping), Q-UCB shaping with minimum state values learned from offline datasets (Shaping + Min Beh. Value), shaping with maximum offline state values (Shaping + Max Beh. Value), and shaping with average offline state values (Shaping + Avg Beh. Value). We test those algorithms in a series of customized windy Mini Grid environments (Zhang & Bareinboim, 2024; Chevalier-Boisvert et al., 2018). We use tabular state-action representations and the rewards are all deterministic in those environments. There is a step penalty of 0.1, +0.2 for getting a coin, 0 for reaching the goal, and 1 for touching the lava. The episode ends immediately when either a goal/lava grid is reached or the episode length hits the horizon limit. In windy grid worlds, the state transition is determined by the resultant force of both agent s action and the wind direction if the agent tries to move. For example, if the agent wants to move right when there is a north wind (wind blowing from the north), the agent will end up being in the lower right grid concerning its original location instead of the right-hand side grid. However, the wind direction can only be observed by more capable behavioral agents. In our collected offline datasets, wind direction is thus a hidden confounder forming a CMDP. See App. D for detailed experiment setups and more results. In Windy Empty World (Fig. 3a), there is a uniform wind Automatic Reward Shaping from Confounded Offline Data (a) Windy Empty World. (b) Lava Cross (easy). (c) Lava Cross (hard). (d) Lava Cross (maze). Figure 3. Windy grid worlds. The blue arrows and circles in the lower right of each grid denote the possible wind directions. The flag is the goal and the orange tile is the lava. The agent s task is to reach the goal quickly without touching lava and collect coins if possible. (a) Windy Empty World (b) Windy Lava Cross (easy) (c) Windy Lava Cross (hard) (d) Windy Lava Cross (maze) Figure 4. Cumulative regrets in windy grid worlds. Lower is better. Orange curve is ours. from each direction and a big chance for no wind (denoted as a hollow blue circle). As an interventional agent, there isn t much extra information it can utilize from the shaping function. Thus, in Fig. 4a, the cumulative regret of our method and other shaping baselines are on par with each other. Note that we do expect the vanilla Q-UCB not to perform well in all our windy grid worlds mainly because that it assumes reward being in [0, 1] and initializes overly optimistically as H. Yet, our Q-UCB Shaping (Algo. 1) lifts this restriction and learns well with simply zero initializations. (a) Easy mode (b) Hard mode (c) Maze mode Figure 5. Optimal polices learned by our proposed method. In both Windy Lava Cross easy mode and hard mode, there is a strong north wind and only a slight chance of being still across the map. We see from Fig. 3b and Fig. 3c that only the lower left L-shaped route is safe for an interventional agent. We provide offline datasets from both a conservative behavioral agent taking the safe route and an adventurous behavioral agent taking the risky northern route. The adventurous behavioral agent only moves when there is no wind. This fact is not reflected in the offline datasets though. When shaping with such values, the mixed learning signal from the adventurous agent slows down the learning process since the interventional agent cannot observe the wind and gets punished constantly by taking the risky northern route. Only our proposed shaping with causal offline upper bound provides the correct learning signal and helps the online interventional agent converge efficiently to the right optimal policy (Fig. 5a, Fig. 5b). In a more challenging Lava Cross maze (Fig. 3d), there are three coins. Though the lower left L-shaped route is still a safe option, there is a good chance for the agent to get the middle coin given the wind distribution Fig. 4d. The other two coins are traps that only behavioral agents sensing the wind could get. If the interventional agent is shaped with such confounded values, there is a high chance that it will be pushed into lava by the wind. As a result, we see in Fig. 4d that shaping with our causal upper bound helps the agent converge and find the right optimal policy (Fig. 5c). 6. Conclusion In this work we study the problem of designing potentialbased reward-shaping functions automatically with confounded offline datasets from a causal perspective. Though the optimal state value is a competitive candidate for state potentials, it s not easily accessible or identifiable from confounded offline datasets. We tackle this challenge by extending the Bellman Optimal Equation to confounded MDPs to robustly upper bound the optimal state values for online interventional agents. Then, we propose a modified model-free learner, Q-UCB Shaping, which has a better regret bound than the vanilla Q-UCB when using our proposed potentials that upper bound the optimal state values. The effectiveness of our method is also verified empirically. Automatic Reward Shaping from Confounded Offline Data Acknowledgements This research was supported in part by the NSF, ONR, AFOSR, Do E, Amazon, JP Morgan, and the Alfred P. Sloan Foundation. Impact Statement This paper presents work whose goal is to advance the field of Machine Learning. There are many potential societal consequences of our work, none of which we feel must be specifically highlighted here. One major reason is that the potential-based reward-shaping framework we considered does not alternate the optimal policy of the agent, which is an inherent property of the problem. Our proposed method solely serves as an acceleration for training such agents. Ball, P. J., Smith, L. M., Kostrikov, I., and Levine, S. Efficient online reinforcement learning with offline data. In Krause, A., Brunskill, E., Cho, K., Engelhardt, B., Sabato, S., and Scarlett, J. (eds.), International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, volume 202 of Proceedings of Machine Learning Research, pp. 1577 1594. PMLR, 2023. URL https://proceedings.mlr.press/v202/b all23a.html. Banach, S. Sur les op erations dans les ensembles abstraits et leur application aux equations int egrales. Fundamenta Mathematicae, 3:133 181, 1922. URL https://api. semanticscholar.org/Corpus ID:11854326 5. Bennett, A. and Kallus, N. Proximal reinforcement learning: Efficient off-policy evaluation in partially observed markov decision processes. Oper. Res., 72(3):1071 1086, 2024. doi: 10.1287/OPRE.2021.0781. URL https: //doi.org/10.1287/opre.2021.0781. Bertsekas, D. P. and Tsitsiklis, J. N. Parallel and distributed computation. Athena Scientific, Belmont, Massachusetts, 1989. URL https://api.semanticscholar. org/Corpus ID:60492634. Bhargava, P., Chitnis, R., Geramifard, A., Sodhani, S., and Zhang, A. Sequence modeling is a robust contender for offline reinforcement learning. arxiv, 2023. URL https: //doi.org/10.48550/ar Xiv.2305.14550. Brafman, R. I. and Tennenholtz, M. R-MAX - A general polynomial time algorithm for near-optimal reinforcement learning. J. Mach. Learn. Res., 3:213 231, 2002. URL https://jmlr.org/papers/v3/brafm an02a.html. Brys, T., Harutyunyan, A., Suay, H. B., Chernova, S., Taylor, M. E., and Now e, A. Reinforcement learning from demonstration through shaping. In Yang, Q. and Wooldridge, M. J. (eds.), Proceedings of the Twenty Fourth International Joint Conference on Artificial Intelligence, IJCAI 2015, Buenos Aires, Argentina, July 25-31, 2015, pp. 3352 3358. AAAI Press, 2015. URL http://ijcai.org/Abstract/15/472. Chen, Z., Maguluri, S. T., Shakkottai, S., and Shanmugam, K. A lyapunov theory for finite-sample guarantees of markovian stochastic approximation. Oper. Res., 72(4): 1352 1367, 2024. doi: 10.1287/OPRE.2022.0249. URL https://doi.org/10.1287/opre.2022.02 49. Chevalier-Boisvert, M., Willems, L., and Pal, S. Minimalistic gridworld environment for gymnasium, 2018. URL https://github.com/Farama-Foundation/ Minigrid. Dann, C. and Brunskill, E. Sample complexity of episodic fixed-horizon reinforcement learning. In Cortes, C., Lawrence, N. D., Lee, D. D., Sugiyama, M., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 7-12, 2015, Montreal, Quebec, Canada, pp. 2818 2826, 2015. URL https://proceedings.neurips.cc/paper /2015/hash/309fee4e541e51de2e41f21be bb342aa-Abstract.html. Dann, C., Lattimore, T., and Brunskill, E. Unifying PAC and regret: Uniform PAC bounds for episodic reinforcement learning. In Guyon, I., von Luxburg, U., Bengio, S., Wallach, H. M., Fergus, R., Vishwanathan, S. V. N., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, pp. 5713 5723, 2017. URL https://proceedings.neurips.cc/paper /2017/hash/17d8da815fa21c57af9829fb0 a869602-Abstract.html. Dann, C., Mansour, Y., Mohri, M., Sekhari, A., and Sridharan, K. Guarantees for epsilon-greedy reinforcement learning with function approximation. In Chaudhuri, K., Jegelka, S., Song, L., Szepesv ari, C., Niu, G., and Sabato, S. (eds.), International Conference on Machine Learning, ICML 2022, 17-23 July 2022, Baltimore, Maryland, USA, volume 162 of Proceedings of Machine Learning Research, pp. 4666 4689. PMLR, 2022. URL https://proceedings.mlr.press/v162/d ann22a.html. Devidze, R., Kamalaruban, P., and Singla, A. Explorationguided reward shaping for reinforcement learning under Automatic Reward Shaping from Confounded Offline Data sparse rewards. In Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., and Oh, A. (eds.), Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, Neur IPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022, 2022. URL http://papers .nips.cc/paper_files/paper/2022/hash /266c0f191b04cbbbe529016d0edc847e-Abs tract-Conference.html. Fujimoto, S., Meger, D., and Precup, D. Off-policy deep reinforcement learning without exploration. In Chaudhuri, K. and Salakhutdinov, R. (eds.), Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA, volume 97 of Proceedings of Machine Learning Research, pp. 2052 2062. PMLR, 2019. URL http://procee dings.mlr.press/v97/fujimoto19a.html. Guo, H., Cai, Q., Zhang, Y., Yang, Z., and Wang, Z. Provably efficient offline reinforcement learning for partially observable markov decision processes. In Chaudhuri, K., Jegelka, S., Song, L., Szepesv ari, C., Niu, G., and Sabato, S. (eds.), International Conference on Machine Learning, ICML 2022, 17-23 July 2022, Baltimore, Maryland, USA, volume 162 of Proceedings of Machine Learning Research, pp. 8016 8038. PMLR, 2022. URL https://proceedings.mlr.press/v162/g uo22a.html. Gupta, A., Pacchiano, A., Zhai, Y., Kakade, S. M., and Levine, S. Unpacking reward shaping: Understanding the benefits of reward engineering on sample complexity. In Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., and Oh, A. (eds.), Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, Neur IPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022, 2022. URL http://papers.nips.cc/p aper_files/paper/2022/hash/6255f2234 9da5f2126dfc0b007075450-Abstract-Con ference.html. Harutyunyan, A., Devlin, S., Vrancx, P., and Now e, A. Expressing arbitrary reward functions as potential-based advice. In Bonet, B. and Koenig, S. (eds.), Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence, January 25-30, 2015, Austin, Texas, USA, pp. 2652 2658. AAAI Press, 2015. URL http: //www.aaai.org/ocs/index.php/AAAI/AA AI15/paper/view/9893. Ibrahim, S., Mostafa, M., Jnadi, A., Salloum, H., and Osinenko, P. Comprehensive overview of reward engineering and shaping in advancing reinforcement learning applications. IEEE Access, 12:175473 175500, 2024. doi: 10.1109/ACCESS.2024.3504735. URL https: //doi.org/10.1109/ACCESS.2024.3504735. Jiang, N. and Li, L. Doubly robust off-policy value evaluation for reinforcement learning. In Balcan, M. F. and Weinberger, K. Q. (eds.), Proceedings of The 33rd International Conference on Machine Learning, volume 48 of Proceedings of Machine Learning Research, pp. 652 661, New York, New York, USA, 20 22 Jun 2016. PMLR. URL http://proceedings.mlr.press/v48/ jiang16.html. Jin, C., Allen-Zhu, Z., Bubeck, S., and Jordan, M. I. Is q-learning provably efficient? In Bengio, S., Wallach, H. M., Larochelle, H., Grauman, K., Cesa-Bianchi, N., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, Neur IPS 2018, December 3-8, 2018, Montr eal, Canada, pp. 4868 4878, 2018. URL https://proceedings.neurips. cc/paper/2018/hash/d3b1fb02964aa64e2 57f9f26a31f72cf-Abstract.html. Kakade, S. M. On the sample complexity of reinforcement learning. University of London, University College London (United Kingdom), 2003. Kallus, N. and Zhou, A. Confounding-robust policy improvement. In Bengio, S., Wallach, H. M., Larochelle, H., Grauman, K., Cesa-Bianchi, N., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, Neur IPS 2018, December 3-8, 2018, Montr eal, Canada, pp. 9289 9299, 2018. URL https://proceedings.neurips.cc/paper /2018/hash/3a09a524440d44d7f19870070 a5ad42f-Abstract.html. Kang, B., Jie, Z., and Feng, J. Policy optimization with demonstrations. In Dy, J. G. and Krause, A. (eds.), Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsm assan, Stockholm, Sweden, July 10-15, 2018, volume 80 of Proceedings of Machine Learning Research, pp. 2474 2483. PMLR, 2018. URL http://proceedings.mlr. press/v80/kang18a.html. Kearns, M. J. and Singh, S. Finite-sample convergence rates for q-learning and indirect algorithms. In Kearns, M. J., Solla, S. A., and Cohn, D. A. (eds.), Advances in Neural Information Processing Systems 11, [NIPS Conference, Denver, Colorado, USA, November 30 - December 5, 1998], pp. 996 1002. The MIT Press, 1998. URL http: //papers.nips.cc/paper/1531-finite-s ample-convergence-rates-for-q-learnin g-and-indirect-algorithms. Automatic Reward Shaping from Confounded Offline Data Kearns, M. J. and Singh, S. Near-optimal reinforcement learning in polynomial time. Mach. Learn., 49(2-3):209 232, 2002. doi: 10.1023/A:1017984413808. URL http s://doi.org/10.1023/A:1017984413808. Kumar, A., Zhou, A., Tucker, G., and Levine, S. Conservative q-learning for offline reinforcement learning. In Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., and Lin, H. (eds.), Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, Neur IPS 2020, December 6-12, 2020, virtual, 2020. URL https: //proceedings.neurips.cc/paper/2020/ hash/0d2b2061826a5df3221116a5085a605 2-Abstract.html. Kumor, D., Zhang, J., and Bareinboim, E. Sequential causal imitation learning with unobserved confounders. In Ranzato, M., Beygelzimer, A., Dauphin, Y. N., Liang, P., and Vaughan, J. W. (eds.), Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, Neur IPS 2021, December 6-14, 2021, virtual, pp. 14669 14680, 2021. URL https://proceedings.neurips.cc/paper /2021/hash/7b670d553471ad0fd7491c75b ad587ff-Abstract.html. Levine, S., Kumar, A., Tucker, G., and Fu, J. Offline reinforcement learning: Tutorial, review, and perspectives on open problems. ar Xiv preprint ar Xiv:2005.01643, 2020. Lu, M., Min, Y., Wang, Z., and Yang, Z. Pessimism in the face of confounders: Provably efficient offline reinforcement learning in partially observable markov decision processes. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. Open Review.net, 2023. URL https: //openreview.net/forum?id=Pbk BDQ5_Ub V. Ma, H., Sima, K., Vo, T. V., Fu, D., and Leong, T. Reward shaping for reinforcement learning with an assistant reward agent. In Forty-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21-27, 2024. Open Review.net, 2024. URL https: //openreview.net/forum?id=a3XFF0PGLU. Manski, C. F. Nonparametric bounds on treatment effects. The American Economic Review, 80:319 323, 1989. URL https://api.semanticscholar.org/Corp us ID:153656329. Mezghani, L., Sukhbaatar, S., Bojanowski, P., Lazaric, A., and Alahari, K. Learning goal-conditioned policies offline with self-supervised reward shaping. In Liu, K., Kulic, D., and Ichnowski, J. (eds.), Conference on Robot Learning, Co RL 2022, 14-18 December 2022, Auckland, New Zealand, volume 205 of Proceedings of Machine Learning Research, pp. 1401 1410. PMLR, 2022. URL https://proceedings.mlr.press/v205/m ezghani23a.html. Miao, R., Qi, Z., and Zhang, X. Off-policy evaluation for episodic partially observable markov decision processes under non-parametric models. In Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., and Oh, A. (eds.), Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, Neur IPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022, 2022. URL http: //papers.nips.cc/paper_files/paper/2 022/hash/03dfa2a7755635f756b160e9f4c 6b789-Abstract-Conference.html. Munos, R., Stepleton, T., Harutyunyan, A., and Bellemare, M. Safe and efficient off-policy reinforcement learning. In Advances in Neural Information Processing Systems, pp. 1054 1062, 2016. Murphy, S. A. Optimal dynamic treatment regimes. Journal of the Royal Statistical Society Series B: Statistical Methodology, 65(2):331 355, 2003. Murphy, S. A. A generalization error for q-learning. Journal of Machine Learning Research, 6:1073 1097, 2005. Nakamoto, M., Zhai, S., Singh, A., Mark, M. S., Ma, Y., Finn, C., Kumar, A., and Levine, S. Cal-ql: Calibrated offline RL pre-training for efficient online fine-tuning. In Oh, A., Naumann, T., Globerson, A., Saenko, K., Hardt, M., and Levine, S. (eds.), Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, Neur IPS 2023, New Orleans, LA, USA, December 10 - 16, 2023, 2023. URL http://papers.nips.cc/paper_files /paper/2023/hash/c44a04289beaf0a7d96 8a94066a1d696-Abstract-Conference.ht ml. Ng, A. Y., Harada, D., and Russell, S. Policy invariance under reward transformations: Theory and application to reward shaping. In Bratko, I. and Dzeroski, S. (eds.), Proceedings of the Sixteenth International Conference on Machine Learning (ICML 1999), Bled, Slovenia, June 27 - 30, 1999, pp. 278 287. Morgan Kaufmann, 1999. Open AI, Akkaya, I., Andrychowicz, M., Chociej, M., Litwin, M., Mc Grew, B., Petron, A., Paino, A., Plappert, M., Powell, G., Ribas, R., Schneider, J., Tezak, N., Tworek, J., Welinder, P., Weng, L., Yuan, Q., Zaremba, W., and Zhang, L. Solving rubik s cube with a robot hand. arxiv, 2019. URL http://arxiv.org/abs/1910 .07113. Automatic Reward Shaping from Confounded Offline Data O Connor, S. M. and Kuo, A. D. Direction-dependent control of balance during walking and standing. Journal of neurophysiology, 102 3:1411 9, 2009. URL https: //api.semanticscholar.org/Corpus ID:17 131326. Pan, A., Bhatia, K., and Steinhardt, J. The effects of reward misspecification: Mapping and mitigating misaligned models. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. Open Review.net, 2022. URL https: //openreview.net/forum?id=JYtw Gw IL7ye. Pathak, D., Agrawal, P., Efros, A. A., and Darrell, T. Curiosity-driven exploration by self-supervised prediction. In Precup, D. and Teh, Y. W. (eds.), Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017, volume 70 of Proceedings of Machine Learning Research, pp. 2778 2787. PMLR, 2017. URL http://procee dings.mlr.press/v70/pathak17a.html. Pearl, J. Causality: Models, Reasoning, and Inference. Cambridge University Press, 2 edition, 2009. doi: 10.101 7/CBO9780511803161. Precup, D., Sutton, R. S., and Singh, S. P. Eligibility traces for off-policy policy evaluation. In Proceedings of the Seventeenth International Conference on Machine Learning, pp. 759 766, 2000. Puterman, M. L. Markov Decision Processes: Discrete Stochastic Dynamic Programming. Wiley Series in Probability and Statistics. Wiley, 1 edition, 1994. ISBN 978-0471-61977-2 978-0-470-31688-7. doi: 10.1002/978047 0316887. URL https://onlinelibrary.wile y.com/doi/book/10.1002/9780470316887. Raileanu, R. and Rockt aschel, T. RIDE: rewarding impactdriven exploration for procedurally-generated environments. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. Open Review.net, 2020. URL https: //openreview.net/forum?id=rkg-TJBFPB. Randløv, J. and Alstrøm, P. Learning to drive a bicycle using reinforcement learning and shaping. In Shavlik, J. W. (ed.), Proceedings of the Fifteenth International Conference on Machine Learning (ICML 1998), Madison, Wisconsin, USA, July 24-27, 1998, pp. 463 471. Morgan Kaufmann, 1998. Ruan, K., Zhang, J., Di, X., and Bareinboim, E. Causal imitation for markov decision processes: A partial identification approach. Advances in neural information processing systems, 2024. Saksida, L. M., Raymond, S. M., and Touretzky, D. S. Shaping robot behavior using principles from instrumental conditioning. Robotics Auton. Syst., 22(3-4):231 249, 1997. doi: 10.1016/S0921-8890(97)00041-9. URL https://doi.org/10.1016/S0921-8890(97 )00041-9. Shi, C., Uehara, M., Huang, J., and Jiang, N. A minimax learning approach to off-policy evaluation in confounded partially observable markov decision processes. In Chaudhuri, K., Jegelka, S., Song, L., Szepesv ari, C., Niu, G., and Sabato, S. (eds.), International Conference on Machine Learning, ICML 2022, 17-23 July 2022, Baltimore, Maryland, USA, volume 162 of Proceedings of Machine Learning Research, pp. 20057 20094. PMLR, 2022. URL https://proceedings.mlr.press/v162/s hi22f.html. Simchowitz, M. and Jamieson, K. G. Non-asymptotic gap-dependent regret bounds for tabular mdps. In Wallach, H. M., Larochelle, H., Beygelzimer, A., d Alch e Buc, F., Fox, E. B., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, Neur IPS 2019, December 8-14, 2019, Vancouver, BC, Canada, pp. 1151 1160, 2019. URL https: //proceedings.neurips.cc/paper/2019/ hash/10a5ab2db37feedfdeaab192ead4ac0 e-Abstract.html. Singh, S. P. and Yee, R. C. An upper bound on the loss from approximate optimal-value functions. Mach. Learn., 16 (3):227 233, 1994. doi: 10.1007/BF00993308. URL https://doi.org/10.1007/BF00993308. Song, Y., Zhou, Y., Sekhari, A., Bagnell, D., Krishnamurthy, A., and Sun, W. Hybrid RL: using both offline and online data can make RL efficient. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. Open Review.net, 2023. URL https://openreview.net/forum?id= yy Bis80i Uu U. Strehl, A. L., Li, L., Wiewiora, E., Langford, J., and Littman, M. L. PAC model-free reinforcement learning. In Cohen, W. W. and Moore, A. W. (eds.), Machine Learning, Proceedings of the Twenty-Third International Conference (ICML 2006), Pittsburgh, Pennsylvania, USA, June 25-29, 2006, volume 148 of ACM International Conference Proceeding Series, pp. 881 888. ACM, 2006. doi: 10.1145/1143844.1143955. URL https: //doi.org/10.1145/1143844.1143955. Sutton, R. S. Reinforcement learning: An introduction. A Bradford Book, 2018. Automatic Reward Shaping from Confounded Offline Data Sutton, R. S. and Barto, A. G. Reinforcement Learning: An Introduction. A Bradford Book, second edition, 2018. ISBN 0262039249. Swaminathan, A. and Joachims, T. Counterfactual risk minimization: Learning from logged bandit feedback. In International Conference on Machine Learning, pp. 814 823, 2015. Tennenholtz, G., Shalit, U., and Mannor, S. Off-policy evaluation in partially observable environments. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pp. 10276 10283, 2020. Wang, Y. and Srinivasan, M. Stepping in the direction of the fall: the next foot placement can be predicted from current upper body state in steady-state walking. Biology Letters, 10, 2014. URL https://api.semantic scholar.org/Corpus ID:5972040. Wang, Y., Dong, K., Chen, X., and Wang, L. Q-learning with UCB exploration is sample efficient for infinitehorizon MDP. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. Open Review.net, 2020. URL https: //openreview.net/forum?id=Bkgl STNFDB. Watkins, C. J. C. H. Learning from delayed rewards. Ph D thesis, University of Cambridge England, 1989. Watkins, C. J. C. H. and Dayan, P. Technical note q-learning. Mach. Learn., 8:279 292, 1992. doi: 10.1007/BF009926 98. URL https://doi.org/10.1007/BF0099 2698. Wiewiora, E. Potential-based shaping and q-value initialization are equivalent. J. Artif. Intell. Res., 19:205 208, 2003. doi: 10.1613/JAIR.1190. URL https: //doi.org/10.1613/jair.1190. Wu, L., Zhang, Z., Haesaert, S., Ma, Z., and Sun, Z. Riskaware reward shaping of reinforcement learning agents for autonomous driving. In 49th Annual Conference of the IEEE Industrial Electronics Society, IECON 2023, Singapore, October 16-19, 2023, pp. 1 6. IEEE, 2023. doi: 10.1109/IECON51785.2023.10312462. URL https://doi.org/10.1109/IECON51785.2 023.10312462. Xu, H., Ma, T., and Du, S. S. Fine-grained gap-dependent bounds for tabular mdps via adaptive multi-step bootstrap. In Belkin, M. and Kpotufe, S. (eds.), Conference on Learning Theory, COLT 2021, 15-19 August 2021, Boulder, Colorado, USA, volume 134 of Proceedings of Machine Learning Research, pp. 4438 4472. PMLR, 2021. URL http://proceedings.mlr.press/ v134/xu21a.html. Yang, K., Yang, L. F., and Du, S. S. Q-learning with logarithmic regret. In Banerjee, A. and Fukumizu, K. (eds.), The 24th International Conference on Artificial Intelligence and Statistics, AISTATS 2021, April 13-15, 2021, Virtual Event, volume 130 of Proceedings of Machine Learning Research, pp. 1576 1584. PMLR, 2021. URL http://proceedings.mlr.press/v130/yan g21b.html. Yuan, M., Li, B., Jin, X., and Zeng, W. Automatic intrinsic reward shaping for exploration in deep reinforcement learning. In Krause, A., Brunskill, E., Cho, K., Engelhardt, B., Sabato, S., and Scarlett, J. (eds.), International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, volume 202 of Proceedings of Machine Learning Research, pp. 40531 40554. PMLR, 2023. URL https://proceedings.mlr. press/v202/yuan23c.html. Zhang, J. and Bareinboim, E. Near-optimal reinforcement learning in dynamic treatment regimes. In Advances in Neural Information Processing Systems, pp. 13401 13411, 2019. Zhang, J. and Bareinboim, E. Can humans be out of the loop? In Sch olkopf, B., Uhler, C., and Zhang, K. (eds.), 1st Conference on Causal Learning and Reasoning, CLea R 2022, Sequoia Conference Center, Eureka, CA, USA, 11-13 April, 2022, volume 177 of Proceedings of Machine Learning Research, pp. 1010 1025. PMLR, 2022. URL https://proceedings.mlr.pres s/v177/zhang22a.html. Zhang, J. and Bareinboim, E. Eligibility traces for confounding robust off-policy evaluation. Technical Report R-105, Causal Artificial Intelligence Lab, Columbia University, May 2024. Zhang, J., Kumor, D., and Bareinboim, E. Causal imitation learning with unobserved confounders. Advances in neural information processing systems, 33:12263 12274, 2020. Zhang, Y., Qiu, R., Liu, J., and Wang, S. Roler: Effective reward shaping in offline reinforcement learning for recommender systems. In Serra, E. and Spezzano, F. (eds.), Proceedings of the 33rd ACM International Conference on Information and Knowledge Management, CIKM 2024, Boise, ID, USA, October 21-25, 2024, pp. 3269 3278. ACM, 2024. doi: 10.1145/3627673.3679633. URL ht tps://doi.org/10.1145/3627673.3679633. Zhu, Y., Wang, Z., Merel, J., Rusu, A. A., Erez, T., Cabi, S., Tunyasuvunakool, S., Kram ar, J., Hadsell, R., de Freitas, N., and Heess, N. Reinforcement and imitation learning for diverse visuomotor skills. In Kress-Gazit, H., Srinivasa, S. S., Howard, T., and Atanasov, N. (eds.), Automatic Reward Shaping from Confounded Offline Data Robotics: Science and Systems XIV, Carnegie Mellon University, Pittsburgh, Pennsylvania, USA, June 26-30, 2018, 2018. doi: 10.15607/RSS.2018.XIV.009. URL http://www.roboticsproceedings.org/rs s14/p09.html. Zou, H., Ren, T., Yan, D., Su, H., and Zhu, J. Reward shaping via meta-learning. Co RR, abs/1901.09330, 2019. URL http://arxiv.org/abs/1901.09330. Automatic Reward Shaping from Confounded Offline Data A. Related Work Reward Shaping. Before Ng et al. (1999) proposed the Potential-Based Reward Shaping (PBRS), the idea of transforming and modifying rewards to better facilitates learning has been studied in various problem settings (Saksida et al., 1997; Randløv & Alstrøm, 1998). Ng et al. (1999) first formalized a shaping framework that guarantees the policy invariance under reward transformation. Though this policy invariance comes at a price that shaping functions are limited to certain forms of state potential functions. There are numerous successful applications of PBRS (Brys et al., 2015; Harutyunyan et al., 2015) but there are also a growing numbers of papers that focus on using carefully designed biased shaping functions (not following the PBRS framework) (Ibrahim et al., 2024). Such shaping functions have shown effectiveness in robots playing rubic cubes (Open AI et al., 2019), in autonomous driving (Wu et al., 2023) and more. People are also not satisfied with designing shaping functions manually and tries to learn shaping functions automatically, either serving as an exploration bonus or a supplement to the task rewards (Zou et al., 2019; Ma et al., 2024; Pathak et al., 2017; Yuan et al., 2023; Raileanu & Rockt aschel, 2020; Devidze et al., 2022). Off-Policy Evaluation and Learning. Off-policy learning has a long history in RL dating back to the classic algorithms of Q-learning (Watkins, 1989; Watkins & Dayan, 1992), importance sampling (Swaminathan & Joachims, 2015; Jiang & Li, 2016), and temporal difference (Precup et al., 2000; Munos et al., 2016). Recently, people also propose to utilize offline datasets to warm start the training (Nakamoto et al., 2023; Bhargava et al., 2023; Kumar et al., 2020), augmenting online training replay buffer (Song et al., 2023; Ball et al., 2023) or incorporating imitation loss with offline data (Kang et al., 2018; Zhu et al., 2018). However, these work rely on a critical assumption that there is no unobserved confounders in the environment. While this assumption is generally true when the off-policy data is collected by an interventional agent, offline datasets generated by potentially unknown sources can easily break this assumption (Levine et al., 2020). In recent years, there is also a growing interest in identifying policy values from confounded offline data (Shi et al., 2022; Miao et al., 2022; Guo et al., 2022; Bennett & Kallus, 2024) or partially identifying the policy values via bounding (Zhang & Bareinboim, 2024; 2019). Regret Analysis of Finite Horizon Episodic Tabular MDPs. Many of the prior work on regret analysis focus on the model-based setting where the transitions and reward distributions are estimated from collected data and planning is then executed in the learned model to extract the optimal policy (Singh & Yee, 1994; Kearns & Singh, 1998; Kakade, 2003; Dann & Brunskill, 2015; Dann et al., 2017; Simchowitz & Jamieson, 2019). In this work, we mainly focus on the regret analysis for model-free learners (Strehl et al., 2006; Jin et al., 2018; Wang et al., 2020; Xu et al., 2021; Chen et al., 2024). More specifically, our analysis focuses on a Q-learning variant, Q-UCB (Jin et al., 2018) that incorporates the UCB bonus into Q-learning showing model-free learners can also enjoy T regret as the model-based learners do. Later, Yang et al. (2021) provides a more fine-grained gap-dependent regret analysis on Q-UCB showing that it also enjoys a logarithmic dependency on total training steps T. B. Walking Robot Example Details From the environment set up, there are two ways of walking optimally. If the agent can observe the step size required to stabilize itself (Uh), it should always act as Xh = Uh when Fh = 0. Under this perfect behavioral policy, a walking process should be: (1) When the robot is stable, Fh = 1, it takes a big step Xh = 1 and transit into an unstable status, Fh+1 = 0. At this stage, the agent s location does not change, Lh+1 = Lh; then (2) the agent takes an appropriate size step, Xh = Uh, transits back to stable status Fh+2 = 1 and move forward by one step, Lh+2 = Lh+1 + 1. The whole process resembles human walking as a controlled falling process. This is the optimal behavioral policy. Formally, this is defined as β(1) h : Xh Uh if Fh = 0 else Xh 1. (In Example 2, the incompetent behavioral policy is defined as β(2) h : Xh Uh.) However, there is an alternative approach to walking steadily when the robot cannot determine the required step size Ut to stabilize itself. By taking small step size every time, Xt = 0, the agent always stays in stable status Ft = 1 and moves forward steadily in each time step. This is the optimal interventional policy. Formally, this is written as π(Xt|Lt, Ft) = 0. To see the quantitative difference between the optimal behavioral policy and optimal interventional policy, we calculated the optimal state values as shown in the below chart. Note that for each state with given Lt, Ft, the optimal behavioral state values are the same for all Ut. We can see from the table that while the optimal behavioral policy can achieve optimal rewards from any stability status, an interventional agent should only stay in the stable status due to observational mismatch. Automatic Reward Shaping from Confounded Offline Data Table 1. Optimal behavioral and interventional policy values. Behavioral Interventional Lt Ft = 0 Ft = 1 Ft = 0 Ft = 1 0 10. 10. 5.0 5.5 1 9. 9. 4.5 5. 2 8. 8. 4. 4.5 3 7. 7. 3.5 4. 4 6. 6. 3. 3.5 5 5. 5. 2.5 3. 6 4. 4. 2. 2.5 7 3. 3. 1.5 2. 8 2. 2. 1. 1.5 9 1. 1. 0. 1. 10 0. 0. 0. 0. In the meantime, state values of the behavioral policies estimated via Monte-Carlo simulation during the offline dataset collection are shown in Table 2. We see a clear trend that for the good behavioral policy π1, it does not show any preference Table 2. Confounded values of behavioral policies. Beh. Policy π1 Beh. Policy π2 Lt Ft = 0 Ft = 1 Ft = 0 Ft = 1 0 10. 10. -22.5 -41.7 1 9. 9. -21.8 -40.1 2 8. 8. -21.0 -38.4 3 7. 7. -20.25 -36.9 4 6. 6. -19.5 -34.5 5 5. 5. -18.75 -33.6 6 4. 4. -18.0 -32.5 7 3. 3. -17.3 -32.1 8 2. 2. -16.5 -31.5 9 1. 0. 0. 0. 10 0. 0. 0. 0. over the stability status while for a bad behavioral policy π2, it even penalizes being stable. Thus, both confounded values are not a good approximation for our optimal interventional state value. The final state potential learned by our method, on the other hand, is shown in Table 3. We see clearly that our proposed algorithm successfully learns an upper bound to the optimal interventional state values and also showing a clear tendency towards the stable status (Ft = 1), the same as the optimal interventional values do. C. Causal Upper Bound Algorithm We present the full pseudo-code for calculating causal upper bounds on optimal state values from multiple offline datasets in Algo. 2. D. Detailed Experiment Design and More Results In this section, we provide a detailed description of experiment setups and additional experiment results. Automatic Reward Shaping from Confounded Offline Data Table 3. Causal state value upper bound. Lt Ft = 0 Ft = 1 0 7.6 9.3 1 7.5 9.3 2 7.5 9.3 3 7.6 9.3 4 7.6 9.3 5 7.9 9.1 6 8.0 8.9 7 7.9 8.6 8 7.6 7.6 9 5.3 5.3 10 0. 0. Algorithm 2 Causal Upper Bound Potential 1: Input: Offline datasets, D(i), i = 1, ..., N and reward upper bound, b. 2: for i = 1 to N do 3: while Not Converged do 4: for h = H to 1 do 5: for state s S do 6: for action x X do 7: if s, x not visited in D(i) h then 8: Continue 9: end if 10: Calculate Q (i) h (s, x) by Eq. (8) 11: end for 12: Update V (i) h (s) = maxx D(i) h Q (i) h (x|s) 13: end for 14: end for 15: end while 16: end for 17: return V h(s) = mins D(i) h V (i) h (s), s, h S [H] D.1. Experiment Setups All of our experiment results are obtained from a 2021 Mac Book Pro with M1 chip and 32GB memory. The cumulative regret results are averaged over three random seeds. We implement the Q-UCB in a homogeneous way that there is no step indices for state spaces. Regarding the environment parameters, for Windy Empty World, the episode length is set to 15 while the Lava Cross series has a horizon of 20. To compensate for the hard exploration situation, we allow random initial starting states over the whole map walkable area. For training steps, we set a total of 100K environment steps for Windy Empty World and 20K for the Lava Cross series. We represent the wind distribution in those windy Grid World as a five-tuple meaning the probability mass for (WEST WIND, NORTH WIND, EAST WIND, SOUTH WIND, NO WIND). We list them as follows, Windy Empty World: (.1, .1, .1, .1, .6) Windy Lava Cross (easy): (0, .8, 0, 0, .2) Windy Lava Cross (hard): (0, .7, 0, 0, .3) Windy Lava Cross (maze): (0., .5, 0., 0., .5) if s[0] = 0 (.5, 0., 0., 0., .5) if s[1] = 7 (.15, .15, .15, .15, .4) otherwise Automatic Reward Shaping from Confounded Offline Data And for each environment, we have three different behavioral policies for collecting offline datasets. Here we list their brief descriptions as below, Windy Empty World: The first one is a good behavioral policy that stands still when the wind is blowing the agent away from the goal and go towards the goal when the wind direction is in its favor. The second one is a bad behavioral policy that follows the good policy half of the time while being fully random the other half. And the last one is a fully random policy. Windy Lava Cross (easy): The first one is a good behavioral policy that takes the L-shaped safe route staying tight to the lower and left side walls. The second one is a bad behavioral policy that tries to cross the upper side gateway when there is no wind. And the last one is again a fully random policy. Windy Lava Cross (hard): The first one is a good behavioral policy that takes the same safe route in the lower left side. The second one is a bad behavioral policy that collects those coins on the northern bank of the lava lake when there is no wind then reach the goal. And the last one is also a fully random policy. Windy Lava Cross (maze): The first one is a good behavioral policy that goes directly to the goal location. The second one is less preferred that it takes a bit detour to get the extra coin near the goal. And the last behavioral policy goes directly to the coin on upper right corner without reaching the goal at all. D.2. Extra Experiment Results Additionally, we have another Lava Cross maze with fewer coins. The safe route is still the lower left L-shaped route. The behavioral agents generating offline datasets in this environment include a conservative one taking the safe route, an adventurous one taking the detour to get an extra coin before reaching the goal, and a radical one only looking for the top right corner coin without even reaching the goal. As shown in Fig. 6a, only shaping with our proposed causal upper bound helps the online interventional agent converge efficiently to the right policy (Fig. 6c,Fig. 6b). We further provide the ratio of states where the learned policy is optimal in App. D.2. Note that in the table we included one more baseline which is a deep neural network based learner, BCQ (Fujimoto et al., 2019). Table 4. Optimal ratio across different environments. Higher is better. Algo./Env. Windy Empty World Lava Cross(easy) Lava Cross(hard) Lava Cross(maze) No Shaping Q-UCB 0.00 0.00 0.00 0.03 Shaping + Causal Bound(Ours) 0.76 0.70 0.81 0.90 Shaping + Min. Val. 0.72 0.49 0.30 0.24 Shaping + Max. Val. 0.73 0.52 0.59 0.33 Shaping + Avg. Val. 0.74 0.50 0.49 0.24 BCQ 0.79 0.50 0.67 0.50 Shaping + BCQ 0.88 0.31 0.22 0.03 (c) Learned Policy Figure 6. Lava Cross maze with fewer coins. Automatic Reward Shaping from Confounded Offline Data Figure 7. Wind distribution in the Lava Cross maze with fewer coins. E. Shaping and Initialization If we have the optimal state value, warm-starting the online learning by initializing state values with such optimal ones seems more straightforward than shaping. But their effects on learning are only the same under unrealistically strict conditions (Wiewiora, 2003). That is, given the same sequence of samples, the learned Q-values (Watkins & Dayan, 1992) after shaping is always less than the one learned without shaping by the amount of the potential functions. A direct implication of this result is that the policy learned under shaping at each time step is indifferent from the one learned without shaping, thus, learning efficiency is unchanged. Proposition E.1. Given a fixed sequence of samples, the policy distribution learned under shaping is equivalent to that learned without shaping but initialized with ϕ( ). There is also prior work showing that for the epsilon-greedy exploration strategy, potential-based reward shaping does not directly affect learning efficiency. Instead, how optimistic the initialization is compared with the potential functions affects the learning efficiency (Dann et al., 2022). While people also find that for model-based learners, certain forms of reward shaping can boost learning efficiency provably (Gupta et al., 2022). F. Original Q-UCB Algorithm We restate the pseudo-code for origianl Q-UCB in Algo. 3. For more details please see Jin et al. (2018). Algorithm 3 Q-UCB 1: Initialize Q1 h(s, x) H, Nh(s, x) = 0, for all (s, x, h) S X [H]. 2: for k = 1 to K do 3: Observe initial state s1. 4: for h = 1 to H do 5: Take action xh = arg maxx Qk h(sh, x) and observe sh+1, yh. 6: Update visitation counter and calculate UCB bonus, t = Nh(sh, xh) Nh(sh, xh) + 1, bt = c p H3ι/t 7: Update Q-value, Qk+1 h (sh, xh) = (1 αt)Qk h(sh, xh) + αt(yh + V k h+1(sh+1) + bt). 8: Update value function, V k+1 h (sh) = min{H, maxx Qk+1 h (sh, x)}. 9: end for 10: end for G. Useful Results for Proofs We will first restate some useful claims in the literature to be used later in our proofs for completeness. Lemma G.1 (Properties of Cumulative Learning Rates). Let αt = H+1 H+t . We define α0 t = Πt j=1(1 αj) and αi t = αiΠt j=i+1(1 αj). The following properties hold for αi t: Automatic Reward Shaping from Confounded Offline Data t Pt i=1 αi t t for every t 1. (b) maxi [t] αi t 2H t and Pt i=1(αi t)2 2H t for every t 1. (c) P t=i αi t = 1 + 1 H for every i > 1. In addition, we have (1)Pt i=1 αi t = 1 and α0 t = 0 for t 1; (2) Pt i=1 αi t = 0 and α0 t = 1 for t = 0. Proof. See Jin et al. (2018) App. B for details. Lemma G.2 (Property of the Clip Function). Let clip [x|δ] = x 1[x δ] (Simchowitz & Jamieson, 2019). For any three positive numbers a, b, c satisfying a + b c, and for any x (0, 1), the following holds, a + b clip h a ac i + (1 + x)b (11) Proof. See Claim A.8 in App. A from Xu et al. (2021) for details. Lemma G.3 (Bounded Summation for Clipped Function). The summation of a clipped function which scales proportionally to the inverse of the square root of the variable n has the following bound: n=1 clip c n Proof. See Claim A.13 in App. A from Xu et al. (2021) for details. H. Proof Details H.1. Potential-Based Reward Shaping in CMDP Proof This is for Prop. 2.2. Proof. Because CMDP also enjoys the Markov property, the overall proof procedure highly resembles the original one in Ng et al. (1999). Only to note that the optimal policy invariance is proved in the online learning sense, which is between M π and Mπ. H.2. Causal Bellman Optimal Equation Proof In this section, we derive the Causal Bellman Optimal Equation from the original Bellman Optimal Equation of MDPs. For stationary CMDPs, we also prove that our Causal Bellman Optimal Equation has a unique fixed point and the optimality of this unique fixed point. Theorem H.1 (Causal Bellman Optimal Equation (Thm. 3.1)). For a CMDP environment M with reward Yh b, b R, the optimal value of interventional policies, V h (s), s, is upper bounded by V (s) V h(s) satisfying the Causal Bellman Optimality Equation, V h(s) = max x h Ph(x|s) e Rh(s, x) + E e Th[V h+1(s )] + Ph( x|s) b + max s V h+1(s ) i (13) Proof. Starting from the Bellman Optimal Equation, the optimal state value function is given by, V h (s) = max x Rh(s, x) + X s Th(s, x, s )V h+1(s ) (14) Note that the actions here are done by an interventional agent, which is actually do(x) in the context of a CMDP. We swap in the causal bounds for interventional reward and transition distribution, V h (s) max x e Rh(s, x)Ph(x|s) + b Ph( x|s) + X s e Th(s, x, s )Ph(x|s)V h+1(s ) + Ph( x|s) max s V h+1(s ) (15) Automatic Reward Shaping from Confounded Offline Data where e Rh(s, x) = E[Yh|Sh = s, Xh = x], e Th is shorthand for e Th(s, x, s ) = P(Sh+1 = s |Sh = s, Xh = x) and P(x|s) = Ph(Xh = x|Sh = s) are estimated from the offline dataset. And b is a known upper bound on the reward signal, Yh b. In this step, we upper bound the next state transition by assuming the best case that for the action not taken with probability Ph( x|s), the agent transits with probability 1 the best possible next state, maxs V h+1(s ). Then after rearranging terms, we have, V h (s) max x e Rh(s, x) + X s e Th(s, x, s )V h+1(s ) + Ph( x|s) b + max s V h+1(s ) # And optimizing the value function w.r.t this inequality gives us an upper bound on the optimal state value, V h(s) max x e Rh(s, x) + X s e Th(s, x, s )V h+1(s ) + Ph( x|s) b + max s V h+1(s ) # More interestingly, we will show in Prop. H.2 that this will also converge to a unique fixed point in stationary CMDPs. Proposition H.2 (Convergence of Causal Bellman Optimal Equation in Stationary CMDPs). The Causal Bellman Optimality Equation converges to a unique fixed point which is also an upper bound on the optimal interventional state values under the assumption that P(s, x) > 0, s, x in stationary CMDPs. Proof. We will first show that the following Causal Bellman Optimal operator is a contraction mapping with respect to a weighted max norm. The major proof technique is from Bertsekas & Tsitsiklis (1989) (Sec 4.3.2). Then by Banach s fixed-point theorem (Banach, 1922), this Causal Bellman Optimal operator has a unique fixed point and updating any initial point iteratively will converge to it. Then we will show that this unique fixed point is indeed an upper bound on the optimal interventional state value. Let the operator T be, TV (s) = max x e R(s, x) + X s e T(s, x, s )V (s ) + P( x|s) b + max s V (s ) # For arbitrary value bound, V 1, V 2, we can bound their difference under one step update by, TV 1(s) TV 1(s) max x s e T(s, x, s ) V 1(s ) V 2(s ) + P( x|s) max s V 1(s ) V 2(s ) Note that we combine the terms together and extract max operator out because of its non-expansive property. Then we partition the state space as follows. Let S1 = {s } be the set of a sink state (in episodic CMDP, we always transit into the sink state after we reach the horizon limit). For k = 2, 3, ..., we define, Sk = {i|i / [ k [1,k 1] Sk min x max j S k [1,k 1] Sk T (i, x, j) > 0}. (20) The set Sk intuitively represents the set of states that can transit to states closer to the sink state even with adversarial action picks. Let Sm be the last of these sets that is nonempty. We will show that all those nonempty sets cover the whole state space. Suppose the set S = {i|i / m k=1Sk} is nonempty. Then by definition, for any state i S , there exists an action x such that T (i, x, j) = 0, j m k=1Sk. This means that state i cannot transit into the sink state with certain actions, which contradicts with our episodic setting that starting from any state-action pair, the episode will end at the horizon limit. Then we define the max norm with respect to which we will later show that the difference between value update is a contraction mapping. We set the weight vector w > 0 in a way that each of the entry wi, i S corresponds to a set Sk and all states inside the same set shares the same weight, that is, wi = yk if i Sk. Automatic Reward Shaping from Confounded Offline Data We will for now assume the following properties of sequence {yk}m k=1 hold, prove T is a contraction mapping, and then come back to show we can find such sequences. 1 = y1 < y2 < < ym (21) ym yk (1 ϵδ) + yk 1 yk ϵδ γ < 1, k = 2, ..., m (22) where ϵ = mink [2,m] minx mini Sk P k [1,k 1] Sk T (i, x, j) is the minimal one-step transition probability that a state from later set Sk transfer to a state that is closer to the sink state and δ = mins,x P(s, x) > 0 from the behavioral dataset,. From previous discussion and probability distribution validity property, we have ϵ (0, 1]. Let the initial difference between the value vectors be ||V 1 V 2||w c where c > 0 is a constant. Then we have, for any state s Sk(s) where k(s) means the index of the state set that s belongs to, V 1(s) V 2(s) cyk(s). (23) Now we can continue to write the value update difference as, |TV 1(s) TV 2(s)| cyk(s) 1 yk(s) max x s e T(s, x, s )yk(s ) + P( x|s) max s yk(s ) = 1 yk(s) max x s e T(s, x, s )yk(s ) + P( x|s) max s yk(s ) Given the fact that yk ym, k [1, m], |TV 1(s) TV 2(s)| cyk(s) 1 yk(s) max x s e T(s, x, s )yk(s ) + P( x|s)ym We can split the sum over next state s by differentiating whether it belongs to the sets closer to sink state than s or the sets further away from the sink state than s,w2 |TV 1(s) TV 2(s)| cyk(s) 1 yk(s) max x k [1,k(s) 1] Sk e T(s, x, s )yk(s ) k [k(s),m] Sk e T(s, x, s )yk(s ) + P( x|s)ym And by the property of {yk}m k=1, 1 = y1 < y2 < < ym and the fact that P s e T(s, x, s ) = 1, we have, |TV 1(s) TV 2(s)| cyk(s) 1 yk(s) max x (yk(s) 1 ym)P(x|s) X k [1,k(s) 1] Sk e T(s, x, s ) + ym P(x|s) + P( x|s)ym 1 yk(s) max x (yk(s) 1 ym)P(x|s) X k [1,k(s) 1] Sk e T(s, x, s ) yk(s) . (29) By definition, ϵ lower bounds the transition probability of P k [1,k(s) 1] Sk e T(s, x, s ), we have, |TV 1(s) TV 2(s)| cyk(s) 1 yk(s) max x P(x|s)(yk(s) 1 ym)ϵ + ym yk (1 ϵδ) + yk 1 Automatic Reward Shaping from Confounded Offline Data Thus, for the state value bound vectors, we have, ||TV 1 TV 2||w γc (33) for all V 1, V 2 satisfying ||V 1 V 2|| c. Thus, T is a contraction mapping with respect to a weighted max norm. Lastly, we will show that a sequence {yk}m k=1 satisfying Eq. (21) and Eq. (22) is realizable. Let y0 = 0, y1 = 1, and suppose that y1, y2, ..., yk have been chosen. If ϵδ = 1, we set yk+1 = yk + 1. If ϵδ < 1, we set yk+1 = 1/2(yk + zk) (34) zk = min i [1,k] yi + ϵδ 1 ϵδ (yi yi 1) . (35) And we can write zk recursively as, zk = min{zk 1, yk + ϵδ 1 ϵδ (yk yk 1)}. (36) We will first show by induction that this sequence satisfies Eq. (21). (1) Base Case: By definition, y0 < y1. (2) Induction Hypothesis: Assume yk 1 < yk holds. (3) Inductive Step: Our goal is to show that yk < yk+1. By definition, yk+1 = 1 2(yk + zk), the problem is reduced to whether yk+1 yk = zk yk > 0. Since yk = 1 2(yk 1 + zk 1) and we have yk 1 < yk. Thus, we have zk 1 > yk. And clearly we have yk + ϵ 1 ϵ(yk yk 1) > yk. By the recursive update rule, we have, zk = min{zk 1, yk + ϵ 1 ϵ(yk yk 1)} (37) As we have shown that both term are bigger than yk, we have zk > yk. Thus, the sequence {yk}m k=1 satisfies Eq. (21). For Eq. (22), we first notice that since zm 1 ym = ym ym 1 > 0, we have, zm = min{zm 1, ym + ϵ 1 ϵ(ym ym 1)} > ym. (38) By definition, zm is also calculated as, zm = min i [1,m] yi + ϵδ 1 ϵδ (yi yi 1) . (39) Swapping this definition into Eq. (38), we have, yk + ϵδ 1 ϵδ (yk yk 1) > ym (40) yk < 1 + ϵδ 1 ϵδ (1 yk 1 yk + ϵδ yk 1 yk < 1. (42) Thus, we have shown that a weight sequence {yk}m k=1 satisfying Eq. (21) and Eq. (22) is indeed realizable. Thus, T is a contraction mapping with respect to a realizable max norm. There exists a unique fixed point V when we optimize V with T iteratively till convergence. For the optimal state value vector V , we can also apply T iteratively until it converges to the fixed point V . By the update rule of T (Eq. (17)), V, V TV . Thus, we have V limk T k V = V where T k denotes applying T iteratively for k times. The fixed point of the Causal Bellman Optimal Equation is indeed an upper bound of the optimal state value vector. Automatic Reward Shaping from Confounded Offline Data H.3. Q-UCB with Shaping Regret Analysis Details We define the adaptive learning rate as αt = H+1 H+t and a shorthand notion ι = log (|S||X|T/p) where T = KH. We also have ϕ(s H+1) = 0 according to Prop. 2.2 and assume deterministic reward functions. Throughout the proof, we also assume conservative optimism condition is satisfied (Def. 4.1). We first present lemmas used in proving the regret bound of Algo. 1. Lemma H.3 (Concentration of Transition Weighted Bounded Functions). Let f be a function mapping from state space to a convex set of real values, S 7 [0, ϕm], ϕm R. With high probability, the following sum is bounded for all (s, x, h, k) S X [H] [K], i=1 αi t 1[ki < K] (ˆPki h f Phf)(s, x) bt where t = N k h(s, x) is the visitation count at the beginning of k-th episode and we define the index of the episode when state-action pair (s, x) is visited for the i-th time to be, ki = min {k [K]|k > ki 1 (sk h, xk h) = (s, x)} {K + 1} . (44) Proof. Note that ki = K +1 if (s, x) is not visited for the i-th time. The sequence {αi t 1[ki < K] (ˆPki h f Phf)(s, x)}t i=0 is a martingale difference sequence w.r.t filtration Fi 0 and we have |αi t 1[ki < K] (ˆPki h f Phf)(s, x)| αt i ϕm and α0 t 1[k0 < K] (ˆPk0 h f Phf)(s, x) = 0 since no visitation occurs at all when i = 0. By Azuma-Hoeffding inequality, we have that ϵ > 0, i=0 αi t 1[ki < K] (ˆPki h f Phf)(s, x)| ϵ) 2 exp 2 Pt i=0(αi t)2ϕ2m Thus, with probability p p SAT , | Pt i=0 αi t 1[ki < K] (ˆPki h f Phf)(s, x)| c ϕ2m log (SAT/p)H/t = bt 2 where c > 0 is a constant. By union bound over all (s, x, h, k) S X [H] [K], we conclude the proof for the claim. Lemma H.4 (Bounded Differences Between Qk h and Q h (Lem. 4.2)). The difference between the learned Q-value at the beginning of episode k and step h and the optimal Q-value can be bounded with high probability as follows, 0 Qk h(s, x) Q h(s, x) α0 t( Q h(s, x)) + i=1 αi t h (ˆPki h V ki h+1 ˆPki h V h+1)(s, x) i + 3bt, (46) where Ph denotes the expected transition w.r.t the true confounded MDP and ˆPk h denotes the sample transition experienced by the agent at episode k step h. Proof. First, we can rewrite the optimal Q-value as follows by the Bellman equation and Prop. 2.2, Q h(s, x) = α0 t Q h(s, x) + i=1 αi t(yh(s, x) ϕ(s) + (Ph V h+1 + Phϕ)(s, x). (47) The learned Q-value can be written as follows using the defined accumulative learning rates Lem. G.1, Qk h(s, x) = i=1 αi t yh ϕ(s) + (ˆPki h V ki h+1 + ˆPki h ϕ)(s, x) + bi . (48) Subtracting Eq. (47) from Eq. (48), we have, Qk h(s, x) Q h(s, x) (49) = α0 t( Q h(s, x)) + i=1 αi t (ˆPki h V ki h+1 Ph V h+1)(s, x) + (ˆPki h ϕ Phϕ)(s, x) + bi (50) = α0 t( Q h(s, x)) + i=1 αi t (ˆPki h V ki h+1 ˆPki h V h+1)(s, x) + (ˆPki h V h+1 Ph V h+1)(s, x) + (ˆPki h ϕ Phϕ)(s, x) + bi . Automatic Reward Shaping from Confounded Offline Data By Lem. H.3, we have that with probability at least 1 2p for all (s, x, h, k) S X [H] [K], Qk h(s, x) Q h(s, x) α0 t( Q h(s, x)) + i=1 αi t (ˆPki h V ki h+1 ˆPki h V h+1)(s, x) + bi + bt. (52) By Lem. G.1, we have, Qk h(s, x) Q h(s, x) α0 t( Q h(s, x)) + i=1 αi t h (ˆPki h V ki h+1 ˆPki h V h+1)(s, x) i + c p i + bt (53) α0 t( Q h(s, x)) + i=1 αi t h (ˆPki h V ki h+1 ˆPki h V h+1)(s, x) i + 3bt. (54) This concludes the proof for the right hand side. Then we will show that Qk h(s, x) Q h(s, x) 0 by induction on the steps of the episode from H to 1. Without losing generality, we assume t > 0 since when t = 0 it s straightforward to show this holds. (1) Base case: When h = H and for all k, it s the last time step and all future state values are zero by definition. We have, Qk h(s, x) Q h(s, x) = i=1 αi t (ˆPki h ϕ Phϕ)(s, x) + bi . (55) By Lem. H.3 and together with the discussion above for the other side, we know (ˆPki h ϕ Phϕ)(s, x) bt 2 with high probability. While by Lem. G.1, Pt i=1 αi tbi bt, thus, we have, Qk h(s, x) Q h(s, x) bt + ( bt 2 ) 0. (56) The base case holds. (2) Induction hypothesis: for h = h and for all k, Qk h(s, x) Q h(s, x) 0. (3) Induction step: Now we show that for h = h 1, the claim still holds. Recall that we can write the Q-value differences as follows when t > 0, Qk h(s, x) Q h(s, x) (57) i=1 αi t (ˆPki h V ki h+1 ˆPki h V h+1)(s, x) + (ˆPki h V h+1 Ph V h+1)(s, x) + (ˆPki h ϕ Phϕ)(s, x) + bi . (58) And by Lem. G.1 and similar discussions as in the base case, we are left with terms, Qk h(s, x) Q h(s, x) i=1 αi t(ˆPki h V ki h+1 ˆPki h V h+1)(s, x). (59) By Algo. 1, V k+1 h+1 (sh+1) = min{ϕ(sh+1), maxx Qk+1 h+1(sh+1, x)}. Thus, no matter which value V k+1 h+1 (sh+1) is getting, by the potential function property (ϕ V ) and the induction hypothesis, V k+1 h+1 (sh+1) V h+1(sh+1) holds. Thus, we have concluded the proof for the left hand side. Definition H.5 (Psudo-Suboptimal State-Action Pairs (Def. 4.3)). We define the set of pseudo-suboptimal state-action pairs to be the set that satisfies, Sub = {(s, x) S X| h [H], yh(s, x) ϕh(s) + 2(Phϕh+1)(s, x) + (s, x) V h (s)}, (60) where (s, x) = minh h(s, x) = minh(V h (s) Q h(s, x)). Automatic Reward Shaping from Confounded Offline Data Lemma H.6 (Bounded Number of Visits to Sub (Lem. 4.4)). The number of visits to (s, x) Sub , t = N k h(s, x), is bounded by, t 16c2Hϕ2 mι 2(s, x) . (61) Proof. When t exceeds the bound, t > 16c2Hϕ2 mι 2(s, x) , (62) by property in Lem. G.1, we have, t < (s, x). (63) This implies 2 Pt i=1 αi tbi < (s, x). And by Lem. H.3, with high probability, we have, i=1 αi t ˆPki h f i=1 αi t Phf + bt αi t Phf + bi where f is a bounded real value function f : S 7 [0, ϕm]. Apply this to the learned Q value Qk h and by the conservative optimism condition (Def. 4.1) that V ki h ϕ, Qk h(s, x) = i=1 αi t yh ϕh(s) + (ˆPki h V ki h+1 + ˆPki h ϕh+1)(s, x) + bi (65) i=1 αi t(yh ϕh(s) + 2(Phϕh+1)(s, x) + 2bi). (66) Plug 2 Pt i=1 αi tbi < (s, x) into this, Qk h(s, x) < i=1 αi t(yh ϕh(s) + 2(Phϕh+1)(s, x) + (s, x)). (68) For (s, x) Sub , by the definition of Sub and the optimistic property of the learned Q-value (Lem. H.4), we have, Qk h(s, x) < i=1 αi t V h (s) (69) = Q h(s, x ) (70) Qk h(s, x ). (71) This indicates that when the number of visits to the state action pairs in Sub , t, exceeds the proposed bound, its learned Q value will be upper bounded by the learned Q value of the optimal actions, and by the greedy action selection rule in Algo. 1, such actions will no longer be chosen going forward. Lemma H.7 (Lower Bound on Sub-optimal Action s Q-Value Differences). For actions that are suboptimal but being selected in the algorithm, x = x , with high probability, we have, Qk h(s, x) Q h(s, x) h(s, x) (72) Automatic Reward Shaping from Confounded Offline Data Proof. Because such actions are selected over the optimal, their learned Q-value must be larger. And by Lem. H.4, Qk h(s, x ) Q h(s, x ) with high probability. Thus, we have, Qk h(s, x) Q h(s, x) Qk h(s, x ) Q h(s, x) (73) Q h(s, x ) Q h(s, x) (74) = V h (s) Q h(s, x) (75) = h(s, x). (76) Now we are ready to prove the main theorem. Theorem H.8 (Regret Bound for Algo. 1. (Thm. 4.5)). Given a potential function ϕ( ), with its maximum value defined as ϕm, after running algorithm Algo. 1 for K episodes with H steps each, the expected regret is bounded by, H3ϕ2 m (s, x) + X H4ϕ2 m (s, x) where Sub is the set of pseudo suboptimal state action pairs and (s, x) = minh h(s, x), for all h [H]. Proof. By the definition of expected regret over trajectories τ, we can decompose it as follows, E [Regret(K)] = Eτ k=1 V 1 (sk 1) V πk 1 (sk 1) k=1 V 1 (sk 1) Q 1(sk 1, xk 1) + Q 1(sk 1, xk 1) V πk 1 (sk 1) k=1 1(sk 1, xk 1) + Es2[V 2 (s2) V πk 2 (s2)|πk, xk 1, sk 1] h=1 h(sk h, xk h) | πk # We can split trajectories by whether the high probability event in Lem. H.4 happens. When Lem. H.4 happens with at least 1 2p probability, Lem. H.7 is also satisfied and we have, E [Regret(K)] = Eτ h=1 h(sk h, xk h) | πk # Qk h(sk h, xk h) Q h(sk h, xk h) + 2p TH. (84) We can set failure probability to be p = 1 2T and by L.H.S of the inequality in Lem. H.4, E [Regret(K)] Qk h(sk h, xk h) Q h(sk h, xk h) + H. (85) Now, the regret bound boils down to the cumulative Q value differences. We revisit Lem. H.4 and expand the R.H.S with Bellman equations, Qk h(sk h, xk h) Q h(sk h, xk h) i=1 αi t h (ˆPki h V ki h+1 ˆPki h V h+1)(s, x) i + 3bt (86) i=1 αi t h (Qki h+1 Q h+1)(ski h+1, xki h+1) i + 3bt. (87) Automatic Reward Shaping from Confounded Offline Data By Lem. G.2, Lem. H.7 and the fact that (s, x) h(s, x), Qk h(sk h, xk h) Q h(sk h, xk h) clip 3bt (sk h, xk h) 2H i=1 αi t h (Qki h+1 Q h+1)(ski h+1, xki h+1) i . (88) Since Q H+1 = Qk H+1 = 0, we can expand the above recursion and solve for value difference of all episodes at step h, PK k=1 Qk h(sk h, xk h) Q h(sk h, xk h), as follows, k=1 Qk h(sk h, xk h) Q h(sk h, xk h) X k ,h h (1 + 1 H )2(h h) clip (sk h , xk h ) 2H k,h clip 3b Nk h (sk h, xk h) 2H The term (1 + 1 H )2 comes from both the coefficient (1 + 1 H ) and the fact that P i=1 αi t 1 + 1 H by Lem. G.1. Then we plug this term back into our regret bound (Eq. (85)), E [Regret(K)] e2H X k,h clip 3b Nk h (sk h, xk h) 2H We can rewrite the summations regarding the state action space and split the summation by whether (s, x) Sub or not, E [Regret(K)] e2H X k,h clip 3b Nk h (sk h, xk h) 2H NK h (s,x) X i=1 clip 3bi N K h (s,x) X i=1 clip 3bi Since from Lem. H.6, we know the total number of visits for each (s, x) Sub is bounded, we can apply this result on N K h (s, x) for (s, x) Sub . And for (s, x) Sub , we apply Lem. G.3 directly. Then the expected regret is bounded as follows, E [Regret(K)] 3e2H2 X NK h (s,x) X i=1 bi + e2H2 X 8H 9c2 Hϕ2 mι (s, x) + H (94) NK h (s,x) X 72c2e2ϕ2 m H4ι (s, x) + H (95) 2(s, x) + X 72c2e2ϕ2 m H4ι (s, x) + H (96) 12c2e2ϕ2 m H3ι (s, x) + X 72c2e2ϕ2 m H4ι (s, x) + H (97) ϕ2 m H3 log (SAT) ϕ2 m H4 log (SAT) Automatic Reward Shaping from Confounded Offline Data We can further simplify the regret bound by the fact that ϕ( ) H and arrive at the final bound, ϕ2 m H3 log (SAT) ϕ2 m H4 log (SAT) H5 log (SAT) H6 log (SAT)