# offpolicy_evaluation_for_human_feedback__253712be.pdf Off-Policy Evaluation for Human Feedback Qitong Gao Ge Gao Juncheng Dong Vahid Tarokh Min Chi Miroslav Pajic Off-policy evaluation (OPE) is important for closing the gap between offline training and evaluation of reinforcement learning (RL), by estimating performance and/or rank of target (evaluation) policies using offline trajectories only. It can improve the safety and efficiency of data collection and policy testing procedures in situations where online deployments are expensive, such as healthcare. However, existing OPE methods fall short in estimating human feedback (HF) signals, as HF may be conditioned over multiple underlying factors and is only sparsely available; as opposed to the agent-defined environmental rewards (used in policy optimization), which are usually determined over parametric functions or distributions. Consequently, the nature of HF signals makes extrapolating accurate OPE estimations to be challenging. To resolve this, we introduce an OPE for HF (OPEHF) framework that revives existing OPE methods in order to accurately evaluate the HF signals. Specifically, we develop an immediate human reward (IHR) reconstruction approach, regularized by environmental knowledge distilled in a latent space that captures the underlying dynamics of state transitions as well as issuing HF signals. Our approach has been tested over two real-world experiments, adaptive in-vivo neurostimulation and intelligent tutoring, as well as in a simulation environment (visual Q&A). Results show that our approach significantly improves the performance toward estimating HF signals accurately, compared to directly applying (variants of) existing OPE methods. 1 Introduction Off-policy evaluation (OPE) aims to estimate the performance of reinforcement learning (RL) policies using only a fixed set of offline trajectories [61], i.e., without online deployments. It is considered to be a critical step in closing the gap between offline RL training and evaluation, for environments and systems where online data collection is expensive or unsafe. Specifically, OPE facilitates not only offline evaluation of the safety and efficacy of the policies ahead of online deployment, but policy selection as well; this allows one to maximize the efficiency when online data collection is possible, by identifying and deploying the policies that are more likely to result in higher returns. OPE has been used in various application domains including healthcare [68, 53, 23, 22], robotics [15, 18, 24], intelligent tutoring [64, 45, 17], recommendation systems [50, 43]. The majority of existing OPE methods focus on evaluating the policies performance defined over the environmental reward functions which are mainly designed for use in policy optimization (training). However, as an increasing number of offline RL frameworks are developed for human-involved systems [64, 45, 1, 48, 16], existing OPE methods lack the ability to estimate how human users would evaluate the policies, e.g., ratings provided by patients (on a scale of 1-10) over the procedure facilitated by automated surgical robots; as human feedback (HF) can be noisy and conditioned over various confounders that could be difficult to be captured explicitly [53, 7, 44]. For example, patient satisfaction over a specific diabetes therapy may vary across the cohort, depending on many subjective Duke University. Durham, NC, USA. Contact: {qitong.gao, miroslav.pajic}@duke.edu. North Carolina State University. Raleigh, NC, USA. 37th Conference on Neural Information Processing Systems (Neur IPS 2023). factors, such as personal preferences and activity level of the day, while participating in the therapy, in addition to the physiological signals (e.g., blood sugar level, body weight) that are more commonly used as the sources for determining environmental rewards toward policy optimization [70, 33, 21, 19]. Moreover, the environmental rewards are sometimes discrete to ensure optimality of the learned policies [67], which further reduces its correlation against HF signals. In this work, we introduce the OPE for human feedback (OPEHF) framework that revives existing OPE approaches in the context of evaluating HF from offline data. Specifically, we consider the challenging scenario where the HF signal is only provided at the end of each episode i.e., no per-step HF signals, referred to as immediate human rewards (IHRs) below, are provided benchmarking the common real-world situations where the participants are allowed to rate the procedures only at the end of the study. The goal is set to estimate the end-of-episode HF signals, also referred to as human returns, over the target (evaluation) policies, using a fixed set of offline trajectories collected over some behavioral policies. To facilitate OPEHF, we introduce an approach that first maps the human return back to the sequence of IHRs, over the horizon, for each trajectory. Specifically, this follows from optimizing over an objective that consists of a necessary condition where the cumulative discounted sum of IHRs should equal the human return, as well as a regularization term that limits the discrepancy of the reconstructed IHRs over state-action pairs that are determined similar over a latent representation space into which environmental transitions and rewards are encoded. At last, this allows for the use of any existing OPE methods to process the offline trajectories with reconstructed IHRs and estimate human returns under target policies. Our main contributions are tri-fold. (i) We introduce a novel OPEHF framework that revives existing OPE methods toward accurately estimating highly sparse HF signals (provided only at the end of each episode) from offline trajectories, through IHRs reconstruction. (ii) Our approach does not require the environmental rewards and the HF signals to be strongly correlated, benefiting from the design where both signals are encoded to a latent space regularizing the objective for reconstructions of IHRs, which is justified empirically over real-world experiments. (iii) Two real-world experiments, i.e., adaptive in-vivo neurostimulation for the treatment of Parkinson s disease and intelligent tutoring for computer science students in colleges, as well as one simulation environment (i.e., visual Q&A), facilitated the thorough evaluation of our approach; various degrees of correlations between the environment rewards and HF signals existed across the environments, as well as the varied coverage of the state-action space provided by offline data over sub-optimal behavioral policies, imposing different levels of challenges for OPEHF. 2 Off-Policy Evaluation for Human Feedback (OPEHF) In this section, we introduce an OPEHF framework that allows for the use of existing OPE methods to estimate the human returns that are available only at the end of each episode, with IHRs remaining unknown. This is in contrast to the goal of classic OPE that only estimates the environmental returns following the user-defined reward function used in the policy optimization phase. A brief overview of existing OPE methods can be found in Appendix C. 2.1 Problem Formulation We first formulate the human-involved MDP (HMDP), which is a tuple M = (S, A, P, R, RH, s0, γ), where S is the set of states, A the set of actions, P : S A S is the transition distribution usually captured by probabilities p(st|st 1, at 1), R : S A R is the environmental reward function, RH(r H|s, a) is the human reward distribution from which the IHR r H t RH( |st, at) are sampled, s0 is the initial state sampled from the initial state distribution p(s0), and γ [0, 1) is the discounting factor. Note that we set the IHRs to be determined probabilistically, as opposed to the environmental rewards rt = R(st, at) that are deterministic; this is due to the fact that many underlying factors may affect the feedback provided by humans [53, 7, 44], as we have also observed while performing human-involved experiments (see Appendix D). Finally, the agent interacts with the MDP following some policy π(a|s) that defines the probabilities of taking action a at state s. In this work, we make the following assumption over R and RH. Assumption 1 (Unknown IHRs). We assume that the immediate environmental reward function R is known and R(s, a) can be obtained for any state-action pairs in S A. Moreover, the IHR distribution RH is assumed to be unknown, i.e., r H RH( |s, a) are unobservable, for all (s, a) S A. Instead, the cumulative human return GH 0:T , defined over RH, is given at the end of each trajectory, i.e., GH 0:T = PT t=0 γtr H t , with T being the horizon and r H t RH( |st, at). The assumption above follows the fact that human feedback (HF) is not available until the end of each episode, as opposed to immediate rewards that can be defined over the environment and evaluated for any (st, at) pairs at any time. This is especially true in environments such as healthcare where the clinical treatment outcome is not foreseeable until a therapeutic cycle is completed, or in intelligent tutoring where the overall gain from students over a semester is mostly reflected by the final grades. Note that although the setup can be generalized to the scenario where HF can be sparsely obtained over the horizon, we believe that issuing the HF only at the end of each trajectory leads to a more challenging setup for OPE. Consequently, the goal of OPEHF can be formulated as follows. Problem 1 (Objective of OPEHF). Given offline trajectories collected by some behavioral policy β, ρβ = {τ (0), τ (1), . . . , τ (N 1)| at β(at|st)} , with τ (i) = [(s(i) 0 , a(i) 0 , r(i) 0 , r H(i) 0 , s(i) 1 ), . . . , (s(i) T 1, a(i) T 1, r(i) T 1, r H(i) T 1, s(i) T ), GH(i) 0:T ] being a single trajectory, N the total number of offline trajectories, and r H t s being unknown, the objective is to estimate the expected total human return over the unknown state-action visitation distribution ρπ of the target (evaluation) policy π, i.e., E(s,a) ρπ,r H RH h PT t=0 γtr H t i . 2.2 Reconstrunction of IHRs for OPEHF We emphasize that the human returns are only issued at the end of each episode, with IHRs remaining unknown. One can set all IHRs from t = 0 to t = T 2 to be zeros (i.e., r H 0:T 2 = 0), and rescale the cumulative human return to be the IHR at the last step (i.e., r H T 1 = GH 0:T /γT 1), to allow the use of existing OPE methods toward OPEHF (Problem 1). However, the sparsity over r H s here may impose difficulties for OPE to estimate the human returns accurately over the target policies. For OPEHF, we start by showing that for the per-decision importance sampling (PDIS) method a variance-reduction variant of the importance sample (IS) family of OPE methods [61] if IHRs were to be available, they could reduce the variance in the estimation compared to the rescale approach above. Recall that the PDIS estimator follows ˆGπ P DIS = 1 N PN 1 i=1 PT 1 t=0 γtω(i) 0:tr H( ) t , where ω(i) 0:t = Qt k=0 π(a(i) k |s(i) k ) β(a(i) k |s(i) k ) are the PDIS weights for offline trajectory τ (i). Moreover, the estimator of the rescale approach3 above is ˆGπ Rescale = 1 N PN 1 i=1 ω(i) 0:T 1GH(i) 0:T , which is equivalent to the vanilla IS estimator [61, 72]. We now show the variance reduction property of ˆGπ P DIS in the context of OPEHF. Proposition 1. Assume that (i) E[r H t ] 0, and (ii) given the horizon T, consider any 1 t + 1 k T of any offline trajectory τ, ω0:k and r H t ω0:k are positively correlated. Then, V( ˆGπ P DIS) V( ˆGπ Rescale), with V( ) representing the variance. The proof can be found in Appendix A. Assumption (i) can be easily satisfied in the real world, as HF signals are usually quantified as positive values, e.g., ratings (1-10) provided by participants. Assumption (ii) is most likely to be satisfied when the target policies do not visit low-return regions substantially [46], which is a pre-requisite for testing RL policies in human-involved environments as initial screening are usually required to filter the ones that could potentially pose risks to participants [57]. Besides IS, doubly robust (DR) [71, 34, 69, 12] and fitted Q-evaluation (FQE) [40] methods require learning value functions. Sparsity of rewards (following the rescale approach above) in the offline dataset may lead to poorly learned value functions [74], considering that the offline data in OPE is usually fixed (i.e., no new samples can be added), and are often generated by behavioral policies that are sub-optimal, which results in limited coverage of the state-action space. Limited availabilities of environment-policy interactions (e.g., clinical trials) further reduce the scale of the exploration and therefore limit the information that can be leveraged toward obtaining accurate value function approximations. Reconstruction of IHRs. To address this challenge, our approach aims to project the end-of-episode human returns back to each environmental step, i.e., to learn a mapping fθ(τ, GH 0:T ) : (S A)T 3We call it the rescale approach instead of vanilla IS as the idea behind also generalizes to non-IS methods. . . . !# "" "!"$ !$ "!"# Trajectory Inference (Encoding) Trajectory Generation (Decoding) Offline Trajectories w/ Reconstructed IHRs Ep. N . . . . . . (&!, (!) (&$, ($) (&"#$, ("#$) (reconstruct (reconstruct (reconstruct Downstream OPE , (+, ,, # %ℋ) , Episode N . . . . . . (&!, (!) (&"#$, ("#$) - Offline Trajectories . . . !# "" "!"$ !$ "!"# VLM-H Train IHR Reconstruction ℋ, , ( %$/" ℋ] (reconstructed) Latent Representations Learnt by VLM-H Figure 1: (Left) Architecture of the variational latent model with human returns (VLM-H). (Mid) Illustration of the clustering behavior in the latent space using t-SNE visualization [73], where the encoded state-action pairs (output by the encoder of VLM-H) are in general clustered together if they are generated by policies with similar human returns (shown in the legend at the top left). (Right) Diagram summarizing the pipeline of the OPEHF framework. R RT , parameterized by θ, that maximizes the sum of log-likelihood of the estimated IHRs, [ˆr H 0 , . . . , ˆr H T 1] fθ(τ, GH 0:T ), following maxθ 1 N PN 1 i=0 PT 1 t=0 log p(ˆr H t = r H(i) t |θ, τ (i), GH(i) 0:T ), where GH(i) 0:T and r H(i) t s are respectivelly the human return and IHRs (unknown) of the i-th trajectory in the offline dataset ρβ, and N is the total number of trajectories in ρβ. Given that the objective above is intractable due to unknown r H(i) t s, we introduce a surrogate objective h log p T 1 X t=0 γtˆr H t = GH(i) 0:T |θ, τ (i), GH(i) 0:T C Lregu(ˆr H 0:T 1|θ, τ (i), GH(i) 0:T ) i . (1) Here, the first term is a necessary condition for ˆr H t s to be valid for estimating r H t s, as they should sum to GH 0:T . Since many solutions may exist if one only optimizes over the first term, the second term Lregu serves as a regularization that imposes constraints on r H t s to follow the properties specific to their corresponding state-action pairs; e.g., (s, a) pairs that are similar to each other in a representation space, defined over the state-action visitation space, tend to yield similar immediate rewards [18]. The detailed regularization technique is introduced in sub-section below. Practically, we choose fθ to be a bi-directional long-short term memory (LSTM) [32], since the reconstruction of IHRs can leverage information from both previous and subsequent steps as provided in the offline trajectories. 2.3 Reconstruction of IHRs over Latent Representations (RILR) for OPEHF Now, we introduce the regularization technique for the reconstruction of IHRs, i.e., reconstructing IHRs over latent representations (RILR). Specifically, we leverage the representations captured by variational auto-encoders (VAEs) [35], learned over ρβ, to regularize the reconstructed IHRs, ˆr H t . VAEs have been adapted toward learning a compact latent space over offline state-action visitations, facilitating both offline policy optimization [42, 81, 65, 27, 26, 28] and OPE [18]. In this work, we specifically consider building on the variational latent model (VLM) proposed in [18] since it is originally proposed to facilitate OPE, as opposed to others that mainly use knowledge captured in the latent space to improve sample efficiency for policy optimization. Moreover, the VLM has shown to be effective for learning an expressive representation space, where the encoded state-action pairs are clustered well in the latent space, as measured by the difference over the returns of the policies from which the state-action pairs are sampled; see Figure 1 (mid) which uses t-SNE to visualize the encoded state-action pairs in trajectories collected from a visual Q&A environment (Appendix E). Note that VLM originally does not account for HF signals (neither r H t s nor GH 0:T s), so we introduce the variational latent model with human returns (VLM-H) below, building on the architecture introduced in [18]. VLM-H consists of a prior p(z) over the latent variables z Z RL, with Z representing the latent space and L the dimension, along with a variational encoder qψ(zt|zt 1, at 1, st), a decoder pϕ(zt, st, rt 1|zt 1, at 1) for generating per-step transitions (over both state-action and latent space), and a separate decoder pϕ(GH 0:T |z T ) for the reconstruction of the human returns at the end of each episode. Note that encoders and decoders are parameterized by ψ and ϕ respectively. The overall architecture is illustrated in Figure 1 (left). Trajectory inference (encoding). VLM-H s encoder approximates the intractable posterior p(zt|zt 1, at 1, st) = p(zt 1,at 1,zt,st) R zt Z p(zt 1,at 1,zt,st)dzt , by avoiding to integrate over the unknown latent space a priori. The inference (or encoding) process can be decomposed as, i.e., qψ(z0:T |s0:T , a0:T 1) = qψ(z0|s0) QT t=1 qψ(zt|zt 1, at 1, st); here, qψ(z0|s0) encodes initial states s0 into latent variables z0, and qψ(zt|zt 1, at 1, st) captures all subsequent environmental transitions in the latent space over zt s. In general, both qψ s are represented as diagonal Gaussian distributions4 with mean and variance determined by neural network ψ, as in [18, 42, 27, 26, 28]. Trajectory generation (decoding). The generative (or decoding) process follows, i.e., pϕ(z1:T , s0:T , r0:T 1, GH 0:T |z0, π) = pϕ(GH 0:T |z T ) QT t=1 pϕ(zt|zt 1, at 1)pϕ(rt 1|zt) QT t=0 pϕ(st|zt); here, pϕ(zt|zt 1, at 1) enforces the transition of latent variables zt over time, pϕ(st|zt) and pϕ(rt 1|zt) are used to sample the states and immediate environmental rewards, while pϕ(GH 0:T |z T ) generates the human return issued at the end of each episode. Note that here we still use the VLM-H to capture environmental rewards, allowing the VLM-H to formulate a latent space that captures as much information about the dynamics underlying the environment as possible. All pϕ s are represented as diagonal Gaussians5 with parameters determined by network ϕ. To train ϕ and ψ, one can maximize the evidence lower bound (ELBO) of the joint log-likelihood log pϕ(s0:T , r0:T 1, GH 0:T |ϕ, ψ, ρβ), i.e., max ψ,ϕ Eqψ h log pϕ(GH 0:T |z T ) + XT t=0 log pϕ(st|zt) + XT t=1 log pϕ(rt 1|zt) KL qψ(z0|s0)||p(z0) XT t=1 KL qψ(zt|zt 1, at 1, st)||pϕ(zt|zt 1, at 1) i ; (2) the first three terms are the log-likelihoods of reconstructing the human return, states, and environmental rewards, and the two terms that follow are Kullback-Leibler (KL) divergence [38] regularizing the inferred posterior qψ. Derivation of the ELBO can be found in Appendix B. In practice, if ϕ and ψ are chosen to be recurrent networks, one can also regularize the hidden states of ϕ, ψ by including the additional regularization term introduced in [18]. Regularizing the reconstruction of IHRs. Existing works have shown that the latent space not only facilitates the generation of synthetic trajectories but demonstrated that the latent encodings of state-action pairs form clusters, over some measures in the latent space [73], if they are rolled out from policies that lead to similar returns [42, 18]. As a result, we regularize ˆr H t following min θ Lregu(ˆr H t |θ, ψ, s(i) 0:t, a(i) 0:t 1, GH(i) 0:T ) = X j J log p(ˆr H t = (1 γ)GH(j) 0:T |θ, ψ, s(j) 0:t , a(j) 0:t 1, GH(j) 0:T ) for each step t; here, (s(i) 0:t, a(i) 0:t 1) τ (i) ρβ, J = {j0, . . . , j K 1} are the indices of offline trajectories that correspond to the latent encodings {z(jk) t qψ( |s(jk) 0:t , a(jk) 0:t 1)|jk J , t [0, T 1]} that are K-neighbours of the latent encoding z(i) t pertaining to (s(i) 0:t, a(i) 0:t 1), defined over some similarity/distance function d( || ), following, i.e., k=0 d(z(i) t ||z(jk) t ), s.t. z(jk) t s corresponding s(jk) 0:t , a(jk) 0:t 1 τ (jk) ρβ. (4) In practice, we choose d( || ) to follow stochastic neighbor embedding (SNE) similarities [73], as it has been shown effective for capturing Euclidean distances in high-dimensional space [75]. Overall objective of RILR for OPEHF. As a result, by following (1) and leveraging the Lregu from (3) above, the objective for reconstructing the IHRs is set to be, i.e., h log p T 1 X t=0 γtˆr H t = GH(i) 0:T |θ, τ (i), GH(i) 0:T C t=0 Lregu(ˆr H t |θ, ψ, s(i) 0:t, a(i) 0:t 1, GH(i) 0:T ) i . 4This helps facilitate an orthogonal basis of the latent space, which would improve the expressiveness of the model. 5If needed, one can project the states over to the orthogonal basis, to ensure that they follow a diagonal covariance. Move from RILR to OPEHF. In what follows, one can leverage any existing OPE methods to take as inputs the offline trajectories, with the immediate environmental rewards rt s replaced by the reconstructed IHRs ˆr H t s, to achieve the OPEHF s objective (Problem 1). Moreover, our method does not require the IHRs to be correlated with the environmental rewards, as the VLM-H learns to reconstruct both by sampling from two independent distributions, pϕ(rt 1|zt) and pϕ(GH 0:T |z T ) respectively, following (2); this is also illustrated empirically over the experiments introduced below (Sections 3), where exceedingly low correlations are found in specific scenarios. The overall pipeline summarizing our method is shown in Figure 1 (right). 3 Real-World Experiments with Human Participants In this section, we validate the OPEHF framework introduced above over two real-world experiments, adaptive neurostimulation, and intelligent tutoring. Specifically, we consider four types of OPE methods to be used as the downstream estimator following the RILR step (Section 2.3), including perdecision importance sampling (IS) with behavioral policy estimation [30], doubly robust (DR) [71], distribution correction estimation (DICE) [78] and fitted Q-evaluation (FQE) [40]. A brief overview of these methods can be found in Appendix C, and the specific implementations we use are documented in Appendix D. In Appendix E, we have also tested our method within a visual Q&A environment [10, 66], which follows similar mechanisms as in the two real-world experiments, i.e., two types of return signals are considered though no human participants are involved. Historical Trajectories Beta Pwr. & Stim Amp. Bradykinesia t=1 t=2 t=3 t=4 t=T t=T t=T t=T Stimulation Amplitudes Stimulation Bradykinesia Test Results Beta Pwr. & Internal Pulse Generator (IPG) Tablet running RL Tremor Detection Tremor Severity Patient Feedback (end-of-episode) Environmental/Physiological Feedback (available every 2s) Figure 2: Setup of the neurostimulation experiments, as well as the formulation of offline trajectories. Environmental rewards and human returns are captured in streams 1 and 2-3 respectively. Baselines and Ablations. The baselines include two variants for each of the OPE methods above, i.e., (i) the rescale approach discussed in Section 2.2, and (ii) another variant that sets all the IHRs to be equal to the environmental rewards at corresponding steps, r H t = rt t [0, T 2], and then let r H T 1 = r T 1 + (GH 0:T G0:T )/γT 1 with G0:T = P t γtrt being the environmental return, which is referred to as fusion below this baseline may perform better when strong correlations existed between environmental and human rewards, as it intrinsically decomposes the human returns into IHRs. Consequently, in each experiment below, we compare the performance of the OPEHF framework extending all four types of OPE methods above, -OPEHF, against the corresponding baselines, -. We also include the VLM-H as an ablation baseline, as if it is a model-based approach standalone; this is achieved by sampling the estimate returns from the decoder, ˆGH 0:T pϕ(GH 0:T |z T ). Metrics. Following a recent OPE benchmark [15], three metrics are considered to validate the performance of each method, including mean absolute error (MAE), rank correlation, and regret@1. Mathematical definitions can be found in Appendix D. Also, following [15], each method is evaluated over 3 random seeds, and the mean performance (with standard errors) is reported. 3.1 Adaptive Neurostimulation: Deep Brain Stimulation Adaptive neurostimulation facilitates treatments for a variety of neurological disorders [4, 11, 13, 55]. Deep brain stimulation (DBS) is a type of neurostimulation used specifically toward Parkinson s disease (PD), where an internal pulse generator (IPG), implanted under the collarbone, sends electrical stimulus to the basal ganglia (BG) area of the brain through invasive electrodes; Figure 2 illustrates DICE Fusion DICE Rescale FQE Rescale DICE Fusion DICE Rescale FQE Rescale DICE Fusion DICE Rescale FQE Rescale VLM-H Ablation Figure 3: Results from the adaptive neurostimulation experiment, i.e., deep brain stimulation (DBS). Each method is evaluated over the data collected from each patient, toward corresponding target policies, respectively. The performance shown above are averaged over all 4 human participants affected by Parkinson s disease (PD). Raw performance over each patient can be found in Appendix D. the setup. Adaptive DBS aims to adjust the strength (amplitude) of the stimulus in real-time, to respond to irregular neuronal activities caused by PD, leveraging the local field potentials (LFPs) as the immediate feedback signals, i.e., the environmental rewards. Existing works have leveraged RL for adaptive DBS over computational BG models [25, 20, 52, 59], using rewards defined over a physiological signal beta-band power spectral density of LFPs (i.e., the beta power) since physiologically PD could lead to increased beta power due to the irregular neuronal activations it causes [39]. However, in clinical practice, the correlation between beta power and the level of satisfaction reported by the patients varies depending on the specific characteristics of each person, as PD can cause different types of symptoms over a wide range of severity [56, 37, 5, 76]. Such findings further justify the significance of evaluating HF/human returns in the real world using OPEHF. Table 1: Correlations between the environmental and human returns of the 6 target DBS policies associated with each PD patient. Patient # 0 1 2 3 Pearson s -0.396 -0.477 -0.599 -0.275 Spearman s -0.2 -0.6 0.086 0.086 In this experiment, we leverage OPEHF to estimate the feedback provided by 4 PD patients who participate in monthly clinical testings of RL policies trained to adapt amplitudes of the stimulus toward reducing their PD symptoms, i.e., bradykinesia and tremor. A mixture of behavioral policies is used to collect the offline trajectories ρβ. Specifically, in every step, the state st is a historical sequence of LFPs capturing neuronal activities, and the action at updates the amplitude of the stimulus to be sent6. Then, an environmental reward rt = R(st, at) gives a penalty if the beta power computed from the latest LFPs is greater than some threshold (to promote treatment efficacy) as well as a penalty proportional to the amplitudes of the stimulus being sent (to improve battery life of the IPG). At the end of each episode, the human returns GH 0:T are determined from three sources (weighted by 50%, 25%, 25%, respectively), i.e., (i) a satisfaction rating (between 1-10) provided by the patient, (ii) hand grasp speed as a result of the bradykinesia test [63], and (iii) level of tremorcalculated over the data from a wearable accelerometry [60, 6]. Each session lasts more than 10 minutes, and each discrete step above corresponds to 2 seconds in the real world; thus, the horizon T 300 (more details are provided in Appendix D). Approval of an Institutional Review Board (IRB) is obtained from Duke University Health System, as well as the exceptional use of the DBS system by the US Food and Drug Administration (FDA). For each patient, OPEHF and the baselines are used to estimate the human returns of 6 target policies with varied performance. The ground-truth human return for each target policy is obtained as a result of extensive clinical testing following the same schema above, over more than 100 minutes. Table 1 shows the Pearson s and Spearman s correlation coefficients [14], measuring the linear and rank correlations between the environmental returns G0:T and the human returns GH 0:T over all the target DBS policies considered for each patient. Pearson s coefficients are all negative since the environmental reward function only issues penalties, while human returns are all captured by positive values. It can be observed that only weak-to-moderate degrees of linear correlations exist for all four patients, while ranks between G0:T s and GH 0:T s are not preserved across patients; thus, it highlights 6RL policies only adapt the stimulation amplitudes within a safe range as determined by neurologists/neurosurgeons, making sure they will not lead to negative effects to participants. Patient #0 Patient #1 Patient #2 Patient #3 Figure 4: t-SNE visualizing the VLM-H encodings of the state-action pairs rolled out over DBS policies with different human returns (shown in the legend). It can be observed that distances among the encoded pairs associated with the policies that lead to similar returns are in general smaller, justifying the RILR objective (5). Table 2: Results from the intelligent tutoring experiment, i.e., performance achieved by our OPEHF framework compared to the ablation and baselines over all four types of downstream OPE estimators. IS DR Ablation Fusion Rescale OPEHF (our) Fusion Rescale OPEHF (our) VLM-H MAE 0.7 0.14 0.77 0.08 0.57 0.09 1.03 0.07 1.03 0.25 0.86 0.04 1.00 0.01 Rank 0.47 0.11 0.4 0.09 0.8 0.09 0.33 0.05 0.4 0.0 0.53 0.2 0.41 0.25 Regret@1 0.36 0.16 0.36 0.16 0.41 0.04 0.41 0.0 0.41 0.0 0.41 0.0 0.28 0.19 Fusion Rescale OPEHF (our) Fusion Rescale OPEHF (our) MAE 3.19 0.57 2.33 0.59 1.01 0.01 0.74 0.07 0.98 0.1 0.59 0.1 Rank 0.47 0.2 0.33 0.2 0.53 0.22 0.27 0.14 0.4 0.0 0.47 0.05 Regret@1 0.55 0.06 0.45 0.18 0.37 0.15 0.36 0.16 0.41 0.0 0.41 0.0 the need for leveraging OPEHF to estimate human returns, which is different than the classic OPE that focus on estimating environmental returns. The overall performance averaged across the 4-patient cohort, is reported in Fig. 3. Raw performance over every single patient can be found in Appendix D. It can be observed that our OPEHF framework significantly improves MAEs and ranks compared to the two baselines, for all 4 types of downstream OPE methods we considered (IS, DR, DICE, and FQE). Moreover, our method also significantly outperforms the ablation VLM-H in terms of these two metrics, as the VLM-H s performance is mainly determined by how well it could capture the underlying dynamics and returns. In contrast, our OPEHF framework not only leverages the latent representations learnt by the VLM-H (for regularizing RILR), it also inherits the advantages intrinsically associated with the downstream estimators; e.g., low-bias nature of IS, or low-variance provided by DR. Moreover, the fusion baseline in general performs worse than the rescale baseline as expected, since no strong correlations between environmental and human returns are found, as reported in Table 1. Note that the majority of the methods lead to similar (relatively low) regrets, as there exist a few policies that lead to human returns that are close over some patients (see the raw statistics in Appendix D). The reason is that all the policies to be extensively tested in clinics are subject to initial screening, where clinicians ensure they would not lead to undesired outcomes or pose significant risks to the patients; thus, the performance of some target policies tends to be close. Nonetheless, low MAEs and high ranks achieved by our method show that it can effectively capture the subtle differences in returns resulting from other HF signals, i.e., levels of bradykinesia and tremor. Moreover, Figure 4 visualizes the VLM-H encodings over the trajectories collected from the 6 target DBS policies for each participant and shows that encoded pairs associated with the policies that lead to similar returns are in general clustered together, which justifies the importance of leveraging the similarities over latent representations to regularize the reconstruction of IHRs as in the RILR objective (5). 3.2 Intelligent Tutoring Intelligent tutoring refers to a system where students can actively interact with an autonomous tutoring agent that can customize the learning content, tests, etc., to improve engagement and learning outcomes [2, 64, 45]. OPEHF is important in such a setup for directly estimating the potential outcomes that could be obtained by students, as opposed to environmental rewards that are mostly discrete; see detailed setup below. Existing works have explored this topic over classic OPE setting in simulations [49, 54]. The system is deployed in an undergraduate-level introduction to probability and statistics course over 5 academic years at North Carolina State University, where the interaction logs obtained from 1,288 students who voluntarily opted-in for this experiment are recorded.7 Specifically, each episode refers to a student working on a set of 12 problems (i.e., horizon T = 12), where the agent suggests the student approach each problem through independent work, working with the hints provided, or directly providing the full solution (for studying purposes) these options constitute the action space of the agent. The states are characterized by 140 features extracted from the logs, designed by domain experts; they include, for example, the time spent on each problem, and the correctness of the solution provided. In each step, an immediate environmental reward of +1 is issued if the answer submitted by students, for the current problem, is at least 80% correct (auto-graded following pre-defined rubrics). A reward of 0 is issued if the grade is less than 80% or the agent chooses the action that directly displays the full solution. Moreover, students are instructed to complete two exams, one before working on any problems and another after finishing all the problems. The normalized difference between the grades of two exams constitutes the human return for each episode. More details are provided in Appendix D. Table 3: Correlations between the environmental and human returns from data collected over each academic year. Year # 0 1 2 3 4 Pearson s 0.033 0.176 0.089 0.154 0.183 Spearman s 0.082 0.156 0.130 0.161 0.103 The intelligent tutoring agent follows different policies across academic years, where the data collected from the first 4 years (1148 students total) constitutes the offline trajectories ρβ (as a result of a mixture of behavioral policies). The 4 policies deployed in the 5th year (140 students total) serve as the target policies, whose groundtruth performance is determined by averaging over the human returns of the episodes that are associated with each policy respectively. Table 3 documents the Pearson s and Spearman s correlation coefficients between the environmental and human returns from data collected over each academic year, showing weak linear and rank correlations across all 5 years. Such low correlations are due to the fact that the environmental rewards are discrete and do not distinguish among the agent s choices, i.e., a +1 reward can be obtained either if the student works out a solution independently or by following hints, and a 0 reward is issued every time the agent chooses to display the solution even if the student could have solved the problem. As a result, such a setup makes OPEHF to be more challenging; because human returns are only available at the end of each episode, and the immediate environmental rewards do not carry substantial information toward extrapolating IHRs. Table 2 documents the performance of OPEHF and the baselines toward estimating the human returns of the target policies. It can be observed that our OPEHF framework achieves state-of-the-art performance, over all types of downstream OPE estimators considered. This result echos the design of the VLM-H where both environmental information (state transitions and rewards) and human returns are encoded into the latent space, which helps formulate a compact and expressive latent space for regularizing the downstream RILR objective (5). Moreover, it is important to use the latent information to guide the reconstruction of IHRs (as regularizations in RILR), as opposed to using the VLM-H to predict human returns standalone; since limited convergence guarantees/error bounds can be provided for VAE-based latent models, which is illustrated in both Figure 3 and Table 2 where OPEHF largely outperforms the VLM-H ablation over MAE and rank. 4 Related Works OPE. Majority of existing model-free OPE methods can be categorized into one of the four types, i.e., IS, DR, DICE, and FQE. Recently, variants of IS and DR methods have been proposed for variance or bias reduction [34, 71, 12, 69], as well as adaptations toward unknown behavioral policies [30]. 7An IRB approval is obtained from North Carolina State University. The use/test of the intelligent tutoring system is overseen by a departmental committee, ensuring it does not risk the academic performance and privacy of the participants. DICE methods are intrinsically designed to work with offline trajectories rolled out from a mixture of behavioral policies, and existing works have introduced the DICE variants toward specific environmental setups [84, 83, 77, 78, 51, 9]. FQE extrapolates policy returns from the approximated Q-values [31, 40, 36]. There also exist model-based OPE methods [82, 18] that first captures the dynamics underlying the environment, and estimate policy performance by rolling out trajectories under the target policies. A more detailed review of existing OPE methods can be found in Appendix C. Note that these OPE methods have been designed for estimating the environmental returns. In contrast, the objective for OPEHF is to estimate the human returns which may not be strongly correlated with the environmental returns, as they are usually determined under different schemas. VAEs for OPE and offline RL. There exists a long line of research developing latent models to capture the dynamics underlying environments in the context of offline RL as well as OPE. Specifically, Pla Net [27] uses recurrent neural networks to capture the transitions of latent variables over time. Latent representations learned by such VAE architectures have been used to augment the state space in offline policy optimization to improve sample efficiency, e.g., in Dreamers [26, 28], SOLAR [81] and SLAC [42]. On the other hand, Lat Co [65] attempts to improve sample efficiency by searching in the latent space which allows by-passing physical constraints. Also, MOPO [80], COMBO [79], and LOMPO [62] train latent models to quantify the confidence of the environmental transitions as learned from offline data and prevent the policies from following transitions over uncertain regions during policy training. Given that such models are mostly designed for improving sample efficiency in policy optimization/training, we choose to leverage the architecture from [18] for RILR as it is the first work that adapts latent models to the OPE setup. Reinforcement learning from human feedback (RLHF). Recently, the concept of RLHF has been widely used in guiding RL policy optimization with the HF signals deemed more informative than the environmental rewards [8, 85, 47]. Specifically, they leverage the ranked preference provided by labelers to train a reward model, captured by feed-forward neural networks, that is fused with the environmental rewards to guide policy optimization. However, in this work, we focus on estimating the HF signals that serve as direct evaluation of the RL policies used in human-involved experiments, such as the level of satisfaction (e.g., on a scale 1-10) and the treatment outcome. The reason is that in many scenarios the participants cannot revisit the same procedure multiple times, e.g., patients may not undergo the same surgeries several times and rank the experiences. More importantly, OPEHF s setup is critical when online testing of RL policies may be even prohibited, without sufficient justifications over safety and efficacy upfront, as illustrated by the experiments above. Reward shaping. Although reward shaping methods [3, 58, 29] pursue similar ideas of decomposing the delayed and/or sparse rewards (e.g., the human return) into immediate rewards, they fundamentally rely on transforming the MDP to such that the value functions can be smoothly captured and high-return state-action pairs can be quickly identified and frequently re-visited. For example, RUDDER [3] leverages the transformed MDP that has expected future rewards equal to zero. Though the optimization objective is consistent between preand post-transformed MDPs, this approach likely would not converge to an optimal policy in practice. On the other hand, the performance (i.e., returns) of sub-optimal policies is not preserved across the two MDPs. This significantly limits its use cases toward OPE which requires the returns resulted by sub-optimal policies to be estimated accurately. As a result, such methods are not directly applicable to the OPEHF problem we consider. 5 Conclusion and Future Works Existing OPE methods fall short in estimating HF signals, as HF can be dependent upon various confounders. Thus, in this work, we introduced the OPEHF framework that revived existing OPE methods for estimating human returns, through RILR. The framework was validated over two real-world experiments and one simulation environment, outperforming the baselines in all setups. Although in the future it could be possible to extend OPEHF to facilitate estimating the HF signals needed for updating the policies similar to RLHF, we focused on policy evaluation which helped to isolate the source of improvements; as policy optimization s performance may depend on multiple factors, such as the exploration techniques used as well as the objective/optimizer chosen for updating the policy. Moreover, this work mainly focuses on the scenarios where the human returns are directly provided by the participants. So under the condition where the HF signals are provided by 3-rd parties (e.g, clinicians), non-trivial adaptations over this work may be needed to consider special cases such as conflicting HF signals provided by different sources. 6 Acknowledgements This work is sponsored in part by the AFOSR under award number FA9550-19-1-0169, by the NIH UH3 NS103468 award, and by the NSF CNS-1652544, DUE-1726550, IIS-1651909 and DUE-2013502 awards, as well as the National AI Institute for Edge Computing Leveraging Next Generation Wireless Networks, Grant CNS-2112562. Investigational Summit RC+S systems and technical support provided by Medtronic PLC. Apple Watches were provided by Rune Labs. We thank Stephen L. Schmidt and Jennifer J. Peters from Duke University Department of Biomedical Engineering, as well as Katherine Genty from Duke University Department of Neurosurgery, for the efforts overseeing DBS experiments in the clinic. [1] Saminda Wishwajith Abeyruwan, Laura Graesser, David B D Ambrosio, Avi Singh, Anish Shankar, Alex Bewley, Deepali Jain, Krzysztof Marcin Choromanski, and Pannag R Sanketi. i-sim2real: Reinforcement learning of robotic policies in tight human-robot interaction loops. In Conference on Robot Learning, pages 212 224. PMLR, 2023. [2] John R Anderson, C Franklin Boyle, and Brian J Reiser. Intelligent tutoring systems. Science, 228(4698):456 462, 1985. [3] Jose A Arjona-Medina, Michael Gillhofer, Michael Widrich, Thomas Unterthiner, Johannes Brandstetter, and Sepp Hochreiter. Rudder: Return decomposition for delayed rewards. Advances in Neural Information Processing Systems, 32, 2019. [4] Alim Louis Benabid. Deep brain stimulation for parkinson s disease. Current opinion in neurobiology, 13(6):696 706, 2003. [5] Peter Brown, Antonio Oliviero, Paolo Mazzone, Angelo Insola, Pietro Tonali, and Vincenzo Di Lazzaro. Dopamine dependency of oscillations between subthalamic nucleus and pallidum in parkinson s disease. Journal of Neuroscience, 21(3):1033 1038, 2001. [6] Witney Chen, Lowry Kirkby, Miro Kotzev, Patrick Song, Ro ee Gilron, and Brian Pepin. The role of large-scale data infrastructure in developing next-generation deep brain stimulation therapies. Frontiers in Human Neuroscience, 15:717401, 2021. [7] Nicholas C Chesnaye, Vianda S Stel, Giovanni Tripepi, Friedo W Dekker, Edouard L Fu, Carmine Zoccali, and Kitty J Jager. An introduction to inverse probability of treatment weighting in observational research. Clinical Kidney Journal, 15(1):14 20, 2022. [8] Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences. Advances in neural information processing systems, 30, 2017. [9] Bo Dai, Ofir Nachum, Yinlam Chow, Lihong Li, Csaba Szepesvári, and Dale Schuurmans. Coindice: Off-policy confidence interval estimation. Advances in Neural Information Processing Systems, 33:9398 9411, 2020. [10] Abhishek Das, Satwik Kottur, Khushi Gupta, Avi Singh, Deshraj Yadav, José M.F. Moura, Devi Parikh, and Dhruv Batra. Visual Dialog. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017. [11] Günther Deuschl, Carmen Schade-Brittinger, Paul Krack, Jens Volkmann, Helmut Schäfer, Kai Bötzel, Christine Daniels, Angela Deutschländer, Ulrich Dillmann, Wilhelm Eisner, et al. A randomized trial of deep-brain stimulation for parkinson s disease. New England Journal of Medicine, 355(9):896 908, 2006. [12] Mehrdad Farajtabar, Yinlam Chow, and Mohammad Ghavamzadeh. More robust doubly robust off-policy evaluation. In International Conference on Machine Learning, pages 1447 1456. PMLR, 2018. [13] Kenneth A Follett, Frances M Weaver, Matthew Stern, Kwan Hur, Crystal L Harris, Ping Luo, William J Marks Jr, Johannes Rothlind, Oren Sagher, Claudia Moy, et al. Pallidal versus subthalamic deep-brain stimulation for parkinson s disease. New England Journal of Medicine, 362(22):2077 2091, 2010. [14] David Freedman, Robert Pisani, and Roger Purves. Statistics (international student edition). Pisani, R. Purves, 4th edn. WW Norton & Company, New York, 2007. [15] Justin Fu, Mohammad Norouzi, Ofir Nachum, George Tucker, Alexander Novikov, Mengjiao Yang, Michael R Zhang, Yutian Chen, Aviral Kumar, Cosmin Paduraru, et al. Benchmarks for deep off-policy evaluation. In International Conference on Learning Representations, 2020. [16] Ge Gao, Qitong Gao, Xi Yang, Miroslav Pajic, and Min Chi. A reinforcement learning-informed pattern mining framework for multivariate time series classification. In International Joint Conference on Artificial Intelligence (IJCAI), 2022. [17] Ge Gao, Song Ju, Markel Sanz Ausin, and Min Chi. Hope: Human-centric off-policy evaluation for e-learning and healthcare. In International Conference on Autonomous Agents and Multiagent Systems (AAMAS), 2023. [18] Qitong Gao, Ge Gao, Min Chi, and Miroslav Pajic. Variational latent branching model for off-policy evaluation. In The Eleventh International Conference on Learning Representations (ICLR), 2023. [19] Qitong Gao, Davood Hajinezhad, Yan Zhang, Yiannis Kantaros, and Michael M Zavlanos. Reduced variance deep reinforcement learning with temporal logic specifications. In Proceedings of the 10th ACM/IEEE International Conference on Cyber-Physical Systems, pages 237 248, 2019. [20] Qitong Gao, Michael Naumann, Ilija Jovanov, Vuk Lesi, Karthik Kamaravelu, Warren M Grill, and Miroslav Pajic. Model-based design of closed loop deep brain stimulation controller using reinforcement learning. In 2020 ACM/IEEE 11th International Conference on Cyber-Physical Systems (ICCPS), pages 108 118. IEEE, 2020. [21] Qitong Gao, Miroslav Pajic, and Michael M Zavlanos. Deep imitative reinforcement learning for temporal logic robot motion planning with noisy semantic observations. In 2020 IEEE International Conference on Robotics and Automation (ICRA), pages 8490 8496. IEEE, 2020. [22] Qitong Gao, Stephen L Schmidt, Afsana Chowdhury, Guangyu Feng, Jennifer J Peters, Katherine Genty, Warren M Grill, Dennis A Turner, and Miroslav Pajic. Offline learning of closed-loop deep brain stimulation controllers for parkinson disease treatment. In Proceedings of the ACM/IEEE 14th International Conference on Cyber-Physical Systems (with CPS-Io T Week 2023), pages 44 55, 2023. [23] Qitong Gao, Stephen L Schmidt, Karthik Kamaravelu, Dennis A Turner, Warren M Grill, and Miroslav Pajic. Offline policy evaluation for learning-based deep brain stimulation controllers. In 2022 ACM/IEEE 13th International Conference on Cyber-Physical Systems (ICCPS), pages 80 91. IEEE, 2022. [24] Qitong Gao, Dong Wang, Joshua D Amason, Siyang Yuan, Chenyang Tao, Ricardo Henao, Majda Hadziahmetovic, Lawrence Carin, and Miroslav Pajic. Gradient importance learning for incomplete observations. International Conference on Learning Representations, 2022. [25] A. Guez, R. D. Vincent, M. Avoli, and J. Pineau. Adaptive treatment of epilepsy via batch-mode reinforcement learning. In AAAI, pages 1671 1678, 2008. [26] Danijar Hafner, Timothy Lillicrap, Jimmy Ba, and Mohammad Norouzi. Dream to control: Learning behaviors by latent imagination. In International Conference on Learning Representations, 2020. [27] Danijar Hafner, Timothy Lillicrap, Ian Fischer, Ruben Villegas, David Ha, Honglak Lee, and James Davidson. Learning latent dynamics for planning from pixels. In International conference on machine learning, pages 2555 2565. PMLR, 2019. [28] Danijar Hafner, Timothy P Lillicrap, Mohammad Norouzi, and Jimmy Ba. Mastering atari with discrete world models. In International Conference on Learning Representations, 2020. [29] Beining Han, Zhizhou Ren, Zuofan Wu, Yuan Zhou, and Jian Peng. Off-policy reinforcement learning with delayed rewards. In International Conference on Machine Learning, pages 8280 8303. PMLR, 2022. [30] Josiah Hanna, Scott Niekum, and Peter Stone. Importance sampling policy evaluation with an estimated behavior policy. In International Conference on Machine Learning, pages 2605 2613. PMLR, 2019. [31] Botao Hao, Xiang Ji, Yaqi Duan, Hao Lu, Csaba Szepesvari, and Mengdi Wang. Bootstrapping fitted q-evaluation for off-policy inference. In International Conference on Machine Learning, pages 4074 4084. PMLR, 2021. [32] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 9(8):1735 1780, 1997. [33] Yan Jia, John Burden, Tom Lawton, and Ibrahim Habli. Safe reinforcement learning for sepsis treatment. In 2020 IEEE International conference on healthcare informatics (ICHI), pages 1 7. IEEE. [34] Nan Jiang and Lihong Li. Doubly robust off-policy value evaluation for reinforcement learning. In International Conference on Machine Learning, pages 652 661. PMLR, 2016. [35] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. ar Xiv preprint ar Xiv:1312.6114, 2013. [36] Ilya Kostrikov and Ofir Nachum. Statistical bootstrapping for uncertainty estimation in offpolicy evaluation. ar Xiv preprint ar Xiv:2007.13609, 2020. [37] A.A. Kühn, A. Kupsch, GH. Schneider, and P Brown. Reduction in subthalamic 8 35 hz oscillatory activity correlates with clinical improvement in parkinson s disease. Euro. J. of Neuroscience, 23(7):1956 1960, 2006. [38] Solomon Kullback and Richard A Leibler. On information and sufficiency. The annals of mathematical statistics, 22(1):79 86, 1951. [39] Alexis M Kuncel and Warren M Grill. Selection of stimulus parameters for deep brain stimulation. Clinical neurophysiology, 115(11):2431 2441, 2004. [40] Hoang Le, Cameron Voloshin, and Yisong Yue. Batch policy learning under constraints. In International Conference on Machine Learning, pages 3703 3712. PMLR, 2019. [41] Pierre L Ecuyer and Bruno Tuffin. Approximate zero-variance simulation. In 2008 Winter Simulation Conference, pages 170 181. IEEE, 2008. [42] Alex Lee, Anusha Nagabandi, Pieter Abbeel, and Sergey Levine. Stochastic latent actor-critic: Deep reinforcement learning with a latent variable model. Advances in Neural Information Processing Systems, 33, 2020. [43] Lihong Li, Wei Chu, John Langford, and Xuanhui Wang. Unbiased offline evaluation of contextual-bandit-based news article recommendation algorithms. In Proceedings of the fourth ACM International Conference on Web Search and Data Mining, pages 297 306, 2011. [44] Christopher G Lis, Kamal Patel, and Digant Gupta. The relationship between patient satisfaction with service quality and survival in non-small cell lung cancer is self-rated health a potential confounder? Plo S one, 10(7):e0134617, 2015. [45] Evan Liu, Moritz Stephan, Allen Nie, Chris Piech, Emma Brunskill, and Chelsea Finn. Giving feedback on interactive student programs with meta-exploration. Advances in Neural Information Processing Systems, 35:36282 36294, 2022. [46] Yao Liu, Pierre-Luc Bacon, and Emma Brunskill. Understanding the curse of horizon in off-policy evaluation via conditional importance sampling. In International Conference on Machine Learning, pages 6184 6193. PMLR, 2020. [47] James Mac Glashan, Mark K Ho, Robert Loftin, Bei Peng, Guan Wang, David L Roberts, Matthew E Taylor, and Michael L Littman. Interactive learning from policy-dependent human feedback. In International Conference on Machine Learning, pages 2285 2294. PMLR, 2017. [48] Travis Mandel, Yun-En Liu, Emma Brunskill, and Zoran Popovi c. Where to add actions in human-in-the-loop reinforcement learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 31, 2017. [49] Travis Mandel, Yun-En Liu, Sergey Levine, Emma Brunskill, and Zoran Popovic. Offline policy evaluation across representations with applications to educational games. In AAMAS, volume 1077, 2014. [50] Rishabh Mehrotra, James Mc Inerney, Hugues Bouchard, Mounia Lalmas, and Fernando Diaz. Towards a fair marketplace: Counterfactual evaluation of the trade-off between relevance, fairness & satisfaction in recommendation systems. In Proceedings of the 27th ACM International Conference on Information and Knowledge Management, pages 2243 2251, 2018. [51] Ofir Nachum, Yinlam Chow, Bo Dai, and Lihong Li. Dualdice: Behavior-agnostic estimation of discounted stationary distribution corrections. Advances in Neural Information Processing Systems, 32:2318 2328, 2019. [52] Vivek Nagaraj, Andrew Lamperski, and Theoden I Netoff. Seizure control in a computational model using a reinforcement learning stimulation paradigm. International Journal of Neural Systems, 27(07):1750012, 2017. [53] Hongseok Namkoong, Ramtin Keramati, Steve Yadlowsky, and Emma Brunskill. Off-policy policy evaluation for sequential decisions under unobserved confounding. Advances in Neural Information Processing Systems, 33:18819 18831, 2020. [54] Allen Nie, Yannis Flet-Berliac, Deon Jordan, William Steenbergen, and Emma Brunskill. Dataefficient pipeline for offline reinforcement learning with limited data. Advances in Neural Information Processing Systems, 35:14810 14823, 2022. [55] Michael S Okun. Deep-brain stimulation for parkinson s disease. New England Journal of Medicine, 367(16):1529 1538, 2012. [56] Michael S Okun. Deep-brain stimulation for parkinson s disease. New England Journal of Medicine, 367(16):1529 1538, 2012. [57] Bahram Parvinian, Christopher Scully, Hanniebey Wiyor, Allison Kumar, and Sandy Weininger. Regulatory considerations for physiological closed-loop controlled medical devices used for automated critical care: food and drug administration workshop discussion topics. Anesthesia and analgesia, 126(6):1916, 2018. [58] Vihang P Patil, Markus Hofmarcher, Marius-Constantin Dinu, Matthias Dorfer, Patrick M Blies, Johannes Brandstetter, Jose A Arjona-Medina, and Sepp Hochreiter. Align-rudder: Learning from few demonstrations by reward redistribution. ar Xiv preprint ar Xiv:2009.14108, 2020. [59] J. Pineau, A. Guez, Robert Vincent, Gabriella Panuccio, and Massimo Avoli. Treating epilepsy via adaptive neurostimulation: a reinforcement learning approach. Int. J. of Neural Sys., 19(04):227 240, 2009. [60] Rob Powers, Maryam Etezadi-Amoli, Edith M Arnold, Sara Kianian, Irida Mance, Maxsim Gibiansky, Dan Trietsch, Alexander Singh Alvarado, James D Kretlow, Todd M Herrington, et al. Smartwatch inertial sensors continuously monitor real-world motor fluctuations in parkinson s disease. Science translational medicine, 13(579):eabd7865, 2021. [61] Doina Precup. Eligibility traces for off-policy policy evaluation. Computer Science Department Faculty Publication Series, page 80, 2000. [62] Rafael Rafailov, Tianhe Yu, Aravind Rajeswaran, and Chelsea Finn. Offline reinforcement learning from images with latent space models. In Learning for Dynamics and Control, pages 1154 1168. PMLR, 2021. [63] Claudia Ramaker, Johan Marinus, Anne Margarethe Stiggelbout, and Bob Johannes Van Hilten. Systematic evaluation of rating scales for impairment and disability in parkinson s disease. Movement disorders, 17(5):867 876, 2002. [64] Sherry Ruan, Allen Nie, William Steenbergen, Jiayu He, JQ Zhang, Meng Guo, Yao Liu, Kyle Dang Nguyen, Catherine Y Wang, Rui Ying, et al. Reinforcement learning tutor better supported lower performers in a math task. ar Xiv preprint ar Xiv:2304.04933, 2023. [65] Oleh Rybkin, Chuning Zhu, Anusha Nagabandi, Kostas Daniilidis, Igor Mordatch, and Sergey Levine. Model-based reinforcement learning via latent-space collocation. In International Conference on Machine Learning, pages 9190 9201. PMLR, 2021. [66] Charlie Snell, Ilya Kostrikov, Yi Su, Mengjiao Yang, and Sergey Levine. Offline rl for natural language generation with implicit language q learning. ar Xiv preprint ar Xiv:2206.11871, 2022. [67] Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT press, 2018. [68] Shengpu Tang and Jenna Wiens. Model selection for offline reinforcement learning: Practical considerations for healthcare settings. In Machine Learning for Healthcare Conference, pages 2 35. PMLR, 2021. [69] Ziyang Tang, Yihao Feng, Lihong Li, Dengyong Zhou, and Qiang Liu. Doubly robust bias reduction in infinite horizon off-policy estimation. In International Conference on Learning Representations, 2019. [70] Miguel Tejedor, Ashenafi Zebene Woldaregay, and Fred Godtliebsen. Reinforcement learning application in diabetes blood glucose control: A systematic review. Artificial intelligence in medicine, 104:101836, 2020. [71] Philip Thomas and Emma Brunskill. Data-efficient off-policy policy evaluation for reinforcement learning. In International Conference on Machine Learning, pages 2139 2148. PMLR, 2016. [72] Philip S Thomas. Safe reinforcement learning. 2015. [73] Laurens Van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. Journal of machine learning research, 9(11), 2008. [74] Mel Vecerik, Todd Hester, Jonathan Scholz, Fumin Wang, Olivier Pietquin, Bilal Piot, Nicolas Heess, Thomas Rothörl, Thomas Lampe, and Martin Riedmiller. Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. ar Xiv preprint ar Xiv:1707.08817, 2017. [75] Martin Wattenberg, Fernanda Viégas, and Ian Johnson. How to use t-sne effectively. Distill, 1(10):e2, 2016. [76] Joshua K Wong, Günther Deuschl, Robin Wolke, Hagai Bergman, Muthuraman Muthuraman, Sergiu Groppa, Sameer A Sheth, Helen M Bronte-Stewart, Kevin B Wilkins, Matthew N Petrucci, et al. Proc. the 9th annual deep brain stimulation think tank: Advances in cutting edge technologies, artificial intelligence, neuromodulation, neuroethics, pain, interventional psychiatry, epilepsy, and traumatic brain injury. Frontiers in Human Neuroscience, page 25, 2022. [77] Mengjiao Yang, Bo Dai, Ofir Nachum, George Tucker, and Dale Schuurmans. Offline policy selection under uncertainty. In Deep RL Workshop Neur IPS 2021, 2021. [78] Mengjiao Yang, Ofir Nachum, Bo Dai, Lihong Li, and Dale Schuurmans. Off-policy evaluation via the regularized lagrangian. Advances in Neural Information Processing Systems, 33:6551 6561, 2020. [79] Tianhe Yu, Aviral Kumar, Rafael Rafailov, Aravind Rajeswaran, Sergey Levine, and Chelsea Finn. Combo: Conservative offline model-based policy optimization. Advances in neural information processing systems, 34:28954 28967, 2021. [80] Tianhe Yu, Garrett Thomas, Lantao Yu, Stefano Ermon, James Y Zou, Sergey Levine, Chelsea Finn, and Tengyu Ma. Mopo: Model-based offline policy optimization. Advances in Neural Information Processing Systems, 33:14129 14142, 2020. [81] Marvin Zhang, Sharad Vikram, Laura Smith, Pieter Abbeel, Matthew Johnson, and Sergey Levine. Solar: Deep structured representations for model-based reinforcement learning. In International Conference on Machine Learning, pages 7444 7453. PMLR, 2019. [82] Michael R Zhang, Thomas Paine, Ofir Nachum, Cosmin Paduraru, George Tucker, Mohammad Norouzi, et al. Autoregressive dynamics models for offline policy evaluation and optimization. In International Conference on Learning Representations, 2020. [83] Ruiyi Zhang, Bo Dai, Lihong Li, and Dale Schuurmans. Gendice: Generalized offline estimation of stationary values. In International Conference on Learning Representations, 2020. [84] Shangtong Zhang, Bo Liu, and Shimon Whiteson. Gradientdice: Rethinking generalized offline estimation of stationary values. In International Conference on Machine Learning, pages 11194 11203. PMLR, 2020. [85] Daniel M Ziegler, Nisan Stiennon, Jeffrey Wu, Tom B Brown, Alec Radford, Dario Amodei, Paul Christiano, and Geoffrey Irving. Fine-tuning language models from human preferences. ar Xiv preprint ar Xiv:1909.08593, 2019. A Proofs for Section 2.2 Recall that the per-decision importance sampling (PDIS) [61] estimator follows ˆGπ P DIS = 1 N PN 1 i=1 PT 1 t=0 γtω(i) 0:tr H(i) t ; here, ω(i) t = π(a(i) t |s(i) t ) ˆβ(a(i) t |s(i) t ), and henceforth ω(i) 0:t = Qt k=0 π(a(i) k |s(i) k ) ˆβ(a(i) k |s(i) k ) are the PDIS weights for offline trajectory τ (i). To simplify our notation, we consider the PDIS estimator defined over a single trajectory, i.e., ˆGπ P DIS = t=0 γtω0:tr H t , (6) as the results can be carried to N trajectories by multiplying with a factor 1/N. We omit the superscript-ed (i) s in the rest of the proof also for conciseness. A.1 Proof of Proposition 1 Proof. We start with a lemma from [46]. Lemma 1 ([46]). Given Xt and Yt as two sequences of random variables. Then t