# ratingbased_reinforcement_learning__f490f42b.pdf

Rating-Based Reinforcement Learning

Devin White1, Mingkang Wu1, Ellen Novoseller2, Vernon J. Lawhern2, Nicholas Waytowich2, Yongcan Cao1

1University of Texas, San Antonio 2DEVCOM Army Research Laboratory

This paper develops a novel rating-based reinforcement learning (Rb RL) approach that uses human ratings to obtain human guidance in reinforcement learning. Different from the existing preference-based and ranking-based reinforcement learning paradigms, based on human relative preferences over sample pairs, the proposed rating-based reinforcement learning approach is based on human evaluation of individual trajectories without relative comparisons between sample pairs. The rating-based reinforcement learning approach builds on a new prediction model for human ratings and a novel multiclass loss function. We finally conduct several experimental studies based on synthetic ratings and real human ratings to evaluate the performance of the new rating-based reinforcement learning approach.

Introduction With the development of deep neural network theory and improvements in computing hardware, deep reinforcement learning (RL) has become capable of handling complex tasks with large state and/or action spaces (e.g., Go and Atari games) and yielding human-level or better-than-humanlevel performance (Silver et al. 2016; Mnih et al. 2015). Numerous approaches, such as DQN (Mnih et al. 2015), DDPG (Lillicrap et al. 2015), PPO (Schulman et al. 2017), and SAC (Haarnoja et al. 2018) have been developed to address challenges such as stability, exploration, and convergence for various applications (Li 2019) such as robotic control, autonomous driving, and gaming. Despite the important and fundamental advances behind these algorithms, one key obstacle for the wide application of deep RL is the required knowledge of a reward function, which is often unavailable in practical applications. Although human experts could design reward functions in some domains, the cost is high because human experts need to understand the relationship between the mission objective and state-action values and may need to spend extensive time adjusting reward parameters and trade-offs not to encounter adverse behaviors such as reward hacking (Amodei et al. 2016). Another approach is to utilize qualitative human inputs indirectly to learn a reward function, such that humans guide reward function design

Copyright 2024, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.

without directly handcrafting the reward. Existing work on reward learning includes inverse reinforcement learning (IRL) (Ziebart et al. 2008), preference-based reinforcement learning (Pb RL) (Christiano et al. 2017), and the combination of demonstrations and relative preferences, e.g. learning from preferences over demonstrations (Brown et al. 2019). Existing human-guided reward learning approaches have demonstrated effective performance in various tasks. However, they suffer from some key limitations. For example, IRL requires expert demonstrations and hence, cannot be directly applied to tasks that are difficult for humans to demonstrate. Pb RL is a practical approach to learning rewards for RL, since it is straightforward for humans to provide accurate relative preference information. Yet, RL from pairwise preferences suffers from some key disadvantages. First, each pairwise preference provides only a single bit of information, which can result in sample inefficiency. In addition, due to their binary nature, standard preference queries do not indicate how much better or worse one sample is than another. Furthermore, because preference queries are relative, they cannot directly provide a global view of each sample s absolute quality (good vs. bad); for instance, if all choices shown to the user are of poor quality, the user cannot say, A is better than B, but they re both bad! . Thus, a Pb RL algorithm may be more easily trapped in a local optimum, and cannot know to what extent its performance approaches the user s goal. Finally, Pb RL methods often require strict preferences, such that comparisons between similar-quality or incomparable trajectories cannot be used in reward learning. While some works use weak preference queries (Bıyık et al. 2020; Bıyık, Talati, and Sadigh 2022), in which the user can state that two choices are equally preferable, there is no way to specify the quality (good vs. poor) of such trajectories; thus, valuable information remains untapped. The objective of this paper is to design a new ratingbased RL (Rb RL) approach that infers reward functions via multi-class human ratings. Rb RL differs from IRL and Pb RL in that it leverages human ratings on individual samples, whereas IRL uses demonstrations and Pb RL uses relative pairwise comparisons. In each query, Rb RL displays one trajectory to a human and requests the human to provide a discrete rating. The number of rating classes can be as low as two, e.g. bad and good , and can be as high as desired. For example, when the number of rating classes

The Thirty-Eighth AAAI Conference on Artiﬁcial Intelligence (AAAI-24)

is 5, the 5 possible human ratings could correspond to very bad , bad , ok , good , and very good . It is worth mentioning that the statement samples A and B are both rated as good may provide more information than stating that A and B are equally preferable , which can be inferred by the former. However, A and B are equally preferable may be important information for fine-tuning. In addition, a person can also intentionally assign high ratings to samples that contain rare states, which would be beneficial for addressing the exploration issue (Ecoffet et al. 2019) in RL. For both Pb RL and Rb RL, obtaining good samples requires exploration, and both will suffer without any well-performing samples. The main contributions of this paper are as follows. First, we propose a novel Rb RL framework for reward function and policy learning from qualitative, absolute human evaluations. Second, we design a new multi-class cross-entropy loss function that accepts multi-class human ratings as the input. The new loss function is based on the computation of a relative episodic reward index and the design of a new multi-class probability distribution function based on this index. Third, we conduct several experimental studies to quantify the impact of the number of rating classes on the performance of Rb RL, and compare Rb RL and Pb RL under both synthetic and real human feedback. Our studies suggest that (1) too few or too many rating classes can be disadvantageous, (2) Rb RL can outperform Pb RL under both synthetic and real human feedback, and (3) people find Rb RL to be less demanding, discouraging, and frustrating than Pb RL.

Related Work Inverse Reinforcement Learning (IRL) seeks to infer reward functions from demonstrations such that the learned reward functions generate behaviors that are similar to the demonstrations. Numerous IRL methods (Ng, Russell et al. 2000), such as maximum entropy IRL (Ziebart et al. 2008; Wulfmeier, Ondruska, and Posner 2015), nonlinear IRL (Finn, Levine, and Abbeel 2016), Bayesian IRL (Levine, Popovic, and Koltun 2011; Choi and Kim 2011, 2012), adversarial IRL (Fu, Luo, and Levine 2018), and behavioral cloning IRL (Szot et al. 2022) have been developed to infer reward functions. The need for demonstrations often makes these IRL methods costly, since human experts are needed to provide demonstrations. Instead of requiring human demonstrations, Pb RL (Wirth et al. 2017; Christiano et al. 2017; Ibarz et al. 2018; Liang et al. 2022; Zhan, Tao, and Cao 2021; Xu et al. 2020; Lee, Smith, and Abbeel 2021; Park et al. 2022) leverages human pairwise preferences over trajectory pairs to learn reward functions. Querying humans for pairwise preferences rather than demonstrations can dramatically save human time. In addition, by leveraging techniques such as adversarial neural networks (Zhan, Tao, and Cao 2021), additional human time can be saved by learning a well-performing model to predict human preference. Another benefit of Pb RL is that humans can provide preferences with respect to uncertainty to promote exploration (Liang et al. 2022). Despite these benefits, Pb RL can be ineffective, especially for complex environments, because pairwise preferences only provide

relative information rather than directly evaluating sample quality; while in some domains, sampled pairs may be selected carefully to infer global information, in practice, even if one sample is preferred over another, it does not necessarily mean that this sample is good. People can also have difficulty when comparing similar samples, thus taking more time and potentially yielding inaccurate preference labels. Notably, several works have sought to improve sample efficiency of Pb RL; for instance, PEBBLE (Lee, Smith, and Abbeel 2021) considers off-policy Pb RL, and SURF (Park et al. 2022) explores data augmentations in Pb RL. These contributions are orthogonal to ours, as they could straightforwardly be applied within our proposed Rb RL framework. Other methods for learning reward functions from humans include combining relative rankings and demonstrations, e.g. by inferring rewards via rankings over a pool of demonstrations (Brown et al. 2020, 2019; Brown, Goo, and Niekum 2020) to extrapolate better-than-demonstrator performance from the learned rewards, or first learning from demonstrations and then fine-tuning with preferences (Ibarz et al. 2018; Bıyık et al. 2022). Finally, in the TAMER framework (Knox and Stone 2009; Warnell et al. 2018; Celemin and Ruiz-del Solar 2015), a person gives positive (encouraging) and negative (discouraging) feedback to an agent with respect to specific states and actions, instead of over entire trajectories. These methods generally take actions greedily with respect to the learned reward, which may not yield an optimal policy in continuous control settings.

Problem Formulation We consider a Markov decision process without reward (MDP\R) augmented with ratings, which is a tuple of the form (S, A, T, ρ, γ, n). Here, S is the set of states, A is the set of possible actions, T : S A S [0, 1] is a state transition probability function specifying the probability p(s | s, a) of reaching state s S after taking action a in state s, ρ : S [0, 1] specifies the initial state distribution, γ is a discount factor, and n is the number of rating classes. The learning agent interacts with the environment through rollout trajectories, where a length-k trajectory segment takes the form (s1, a1, s2, a2, . . . , sk, ak). A policy π is a function that maps states to actions, such that π(a | s) is the probability of taking action a A in state s S. In traditional RL, the environment would receive a reward signal r : S A R, mapping state-action pairs to a numerical reward, such that at time-step t, the algorithm receives a reward rt = r(st, at), where (st, at) is the stateaction pair at time t. Accordingly, the standard RL problem can be formulated as a search for the optimal policy π , where π = arg maxπ P t=0 E(st,at) ρπ h γtr(st, at) i , at π( |st), and ρπ is the marginal state-action distribution induced by the policy π. Note that standard RL assumes the availability of the reward function r. When such a reward function is unavailable, standard RL and its variants may not be used to derive control policies. Instead, we assume that the user can assign any given trajectory segment τ = (s1, a1, . . . , sk, ak) a rating in the set {0, 1, . . . , n 1} indicating the quality of that segment, where 0 is the lowest

The Thirty-Eighth AAAI Conference on Artiﬁcial Intelligence (AAAI-24)

possible rating, while n 1 is the highest possible rating. The algorithm presents a series of trajectory segments σ to the human and receives corresponding human ratings. Let X := {(σi, ci)}l i=1 be the dataset of observed human ratings, where ci {0, . . . , n 1} is the rating class assigned to segment σi, and l is the number of rated segments contained in X at the given point during learning. Note that descriptive labels can also be given to the rating classes. For example, for n = 4 rating classes, we can call the rating class 0 very bad , the rating class 1 bad , the rating class 2 good , and the rating class 3 very good . With n = 3 rating classes, we can call the rating class 0 bad , the rating class 1 neutral , and class 2 good .

Rating-based Reinforcement Learning Different from the binary-class reward learning in Christiano et al. (2017) that utilizes relative human preferences between segment pairs, Rb RL utilizes non-binary multi-class ratings for individual segments. We call this a multi-class reinforcement learning approach based on ratings.

Modeling Reward and Return Our approach learns a reward model ˆr : S A R that predicts state-action rewards ˆr(s, a). We further define ˆR(σ) := Pk t=1 γt 1ˆr(st, at) as the cumulative discounted reward, or the return, of length-k trajectory segment σ. Larger ˆR(σ) corresponds to a higher predicted human rating for segment σ. Next, we define R(σ) as a function mapping a trajectory segment σ to an estimated total discounted reward, normalized to fall in the interval [0, 1] based on the dataset of rated trajectory segments X:

R(σ) = ˆR(σ) minσ X ˆR(σ )

maxσ X ˆR(σ ) minσ X ˆR(σ ) . (1)

Novel Rating-Based Cross-Entropy Loss Function To construct a new (cross-entropy) loss function that can take multi-class human ratings as the input, we need to estimate the human s rating class predictions. In addition, the range of the estimated rating class should belong to the interval [0, 1] for the cross-entropy computation. We here propose a new multi-class cross-entropy loss given by:

i=0 µσ(i) log Qσ(i) !

where X is the collected dataset of user-labeled segments, µσ(i) is an indicator that equals 1 when the user assigns rating i to trajectory segment σ, and Qσ(i) [0, 1] is the estimated probability that the human assigns the segment σ to the ith rating class. Next, we will model the probabilities Qσ(i) of the human choosing each rating class. Notably, we do this without comparing the segment σ to other segments.

Modeling Human Rating Probabilities We next describe our model for Qσ(i) based on the normalized predicted returns R(σ). To model the probability that σ

belongs to a particular class, we will first model separations between the rating classes in reward space. We define rating class boundaries R0, R1, . . . , Rn in the space of normalized trajectory returns such that 0 := R0 R1 . . . Rn := 1. Then, if a segment σ has normalized predicted return R(σ) such that Ri R(σ) Ri+1, we wish to model that σ belongs to rating class i with the highest probability. For example, when the total number of rating classes is n = 4, we aim to model the lower and upper return bounds for rating classes 0, 1, 2, and 3, which for instance, could respectively correspond to very bad , bad , good , and very good . In this case, if 0 R(σ) < R1, then we would like our model to predict that σ most likely belongs to class 0 ( very bad ), while if R2 R(σ) < R3, then our model should predict that σ most likely belongs to class 2 ( good ). Given the rating category separations Ri, we model Qσ(i) as a function of the normalized predicted returns R(σ):

Qσ(i) = e k( R(σ) Ri)( R(σ) Ri+1) Pn 1 j=0 e k( R(σ) Rj)( R(σ) Rj+1) , (3)

where k is a hyperparameter modeling human label noisiness, and the denominator ensures that Pn 1 i=0 Qσ(i) = 1, i.e. that the class probabilities sum to 1. To gain intuition for Equation (3), note that when R(σ) ( Ri, Ri+1), such that the predicted return falls within rating class i s predicted boundaries, then ( R(σ) Ri)( R(σ) Ri+1) 0 while ( R(σ) Rj)( R(σ) Rj+1) 0 for all j = i. This means that Qσ(i) Qσ(j), j = i, so that the model assigns category i the highest class probability, as desired. Furthermore, we note that Qσ(i) is maximized when R(σ) = 1

2( Ri + Ri+1), such that the predicted return falls directly in the center of category i s predicted range. As R(σ) becomes increasingly further from 1

2( Ri + Ri+1), the modeled probability Qσ(i) of class i monotonically decreases. We next show how to compute the class boundaries Ri, i = 1, . . . , n 1.

Modeling Boundaries between Rating Categories Next, we discuss how to model the boundaries between rating categories, 0 =: R0 R1 . . . Rn := 1. This requires selecting the range, or the upper and lower bounds of R, corresponding to each rating class. We determine these boundary values based on the distribution of R(σ) for the trajectory segments σ X and the number of observed samples in X from each rating class. We select the Ri values such that the number of training data samples that the model assigns to each modeled rating class matches the number of samples in X that the human assigned to that rating class. Note that this does not require the predicted ratings based on R(σ) to match the human ratings for σ in the training data X, but ensures that the proportions of segments in the training dataset X assigned to each rating class matches that in X. This matching in rating class proportions is desirable for learning an appropriate reward function based on human preference, since different humans could give ratings in

The Thirty-Eighth AAAI Conference on Artiﬁcial Intelligence (AAAI-24)

significantly different proportions depending on their preferences and latent reward functions, as modeled by ˆR. To define each Ri so that the number of samples in each modeled rating category reflects the numbers of ratings in the human data, we first sort the estimated returns R(σ) for all σ X from lowest to highest, and label these sorted estimates as R1 R2 Rl, where l is the cardinality of X. Denoting via kj the number of segments that the human assigned to rating class j, j {0, , n 1}, we can then model each category boundary Ri, i / {0, n} (since R0 := 0 and Rn := 1 by definition), as follows:

Ri = Rkcum i 1 + R1+kcum i 1 2 , i {1, 2, . . . , n 1}, (4)

where kcum i := Pi j=0 kj is the total number of segments that the human assigned to any rating category j i. When the user has not assigned any ratings within a particular category, i.e., ki = 0 for some i, then we define the upper bound for category i as Rki+1 := Rki. This definition guarantees that when all normalized return predictions R(σ), σ X, are distinct, then our model places k0 segments within interval [ R0, R1), ki segments within each interval ( Ri, Ri+1) for 1 i n 2, and kn 1 segments in ( Rn 1, Rn], and thus predicts that ki segments most likely have rating i.

Synthetic Experiments Setup We conduct synthetic experiments based on the setup in Lee et al. (2021) to evaluate Rb RL relative to the Pb RL baseline (Lee et al. 2021). The code can be found at https: //rb.gy/tdpc4y. The goal is to learn to perform a task by obtaining feedback from a teacher, in this case a synthetic human. For the Pb RL baseline, we generate synthetic feedback such that in each queried pair of segments, the segment with the higher ground truth cumulative reward is preferred. In contrast to the synthetic preferences between sample pairs in Lee et al. (2021), Rb RL was given synthetic ratings generated for individual samples, where these ratings were given by comparing the sample s ground truth return to the ground truth rating class boundaries. For simplicity, we selected these ground truth rating class boundaries so that rating classes are evenly spaced in reward space. For the synthetic Pb RL experiments, we selected preference queries using the ensemble disagreement approach in Lee et al. (2021). We extend this method to select rating queries for the synthetic Rb RL experiments, designing an ensemble-based approach as in Lee et al. (2021) to select trajectory segments for which to obtain synthetic ratings. First, we train a reward predictor ensemble and obtain the predicted reward for every candidate segment and ensemble member. We then select the segment with the largest standard deviation over the ensemble to receive a rating label. We study the Walker and Quadruped tasks in Lee et al. (2021), with 1000 and 2000 synthetic queries, respectively. For all synthetic experiments, the reward network parameters are optimized to minimize the cross entropy loss (2)

based on the respective batch of data via the computation of (3). We use the same neural network structures for both the reward predictor and control policy and the same hyperparameters as in Lee et al. (2021).

Figure 1: Performance of Rb RL in synthetic experiments for different n, compared to Pb RL: mean reward standard error over 10 runs for Walker (top) and Quadruped (bottom).

Results Figure 1 shows the performance of Rb RL for different numbers of rating classes (i.e. values of n) and Pb RL for two environments from Lee et al. (2021): Walker and Quadruped. We observe that a higher number of rating classes yields better performance for Walker. In addition, Rb RL with n = 5, 6 outperforms Pb RL. However, for Quadruped, while Rb RL with n = 2, 3 still outperforms Pb RL, a higher number of rating classes decreases performance; this decrease may be caused by the selection of rating class boundaries used to generate the synthetic feedback. The results indicate that Rb RL is effective and can provide better performance than Pb RL even if synthetic ratings feedback is generated using reward thresholds that are evenly distributed, without further optimization of their selection. We expect further optimization of the boundaries used to generate synthetic feedback to yield improved performance. For our experiments, we defined the rating boundaries by finding the maximum possible reward range for a segment and evenly dividing by the number of rating classes.

The Thirty-Eighth AAAI Conference on Artiﬁcial Intelligence (AAAI-24)

Human Experiments Setup We conduct all human experiments by following a similar setup to Christiano et al. (2017). In particular, our tests were approved by the UTSA IRB Office, including proper steps to ensure privacy and informed consent of all participants. In particular, the goal is to learn to perform a given task by obtaining feedback from a teacher, in this case a human. Different from Pb RL in Christiano et al. (2017), which asks humans to provide their preferences between sample pairs, typically in the form of short video segments, Rb RL asks humans to evaluate individual samples, also in the form of short video segments, to provide their ratings, e.g., segment performance is good or segment performance is bad . For all human experiments, we trained a reward predictor by minimizing the cross entropy loss (2) based on the respective batch of data via the computation of (3). We used the same neural network structures for both the reward predictor and control policy and the same hyperparameters as in Christiano et al. (2017).

Rb RL with Different Numbers of Rating Classes To evaluate the impact of the number of rating classes n on Rb RL s performance, we first conduct tests in which a human expert (an author on the study) provides ratings with n = 2, . . . , 8 in the Cheetah Mu Jo Co environment. In particular, three experiment runs were conducted for each n {2, 3, . . . , 8}. Fig. 2 shows the performance of Rb RL for each n, where the solid bar represents the mean performance and the vertical line represents the standard error over 3 experiment runs. It can be observed that Rb RL performs better for n {3, 4, . . . , 7} than for n {2, 8}, indicating that allowing more rating classes is typically beneficial. However, an overly large number of rating classes n will lead to difficulties and inaccuracies in the human ratings, and hence n must be set to a reasonable value. Indeed, for smaller n, one can more intuitively assign physical meanings to each n, whereas for overly large n, it becomes difficult to assign such physical meanings, and hence it will be more challenging for users to provide consistent ratings.

n = 2 n = 3 n = 4 n = 5 n = 6 n = 7 n = 8 0

Mean Reward

Figure 2: Rb RL performance for different n in a human experiment: performance in the Cheetah environment (mean standard error over 3 experiment runs).

Rb RL Human User Study

To evaluate the effectiveness of Rb RL for non-expert users, we conducted an IRB-approved human user study. We conducted tests on 3 of the Open AI Gym Mu Jo Co Environments also used in Christiano et al. (2017): Swimmer, Hopper and Cheetah. A total of 20 participants were recruited (7 for Cheetah, 7 for Swimmer, and 6 for Hopper). In our experiments, we provided a single 1 to 2 second long video segment to query users for each rating, while we provided pairs of 1 to 2 second videos to obtain human pairwise preferences. For Cheetah, the goal is to move the agent to the right as fast as possible; this is the same goal encoded in the default hand-crafted environment reward. Similarly, the goal for Swimmer matches that of the default hand-crafted environment reward. However, for Hopper, we instructed users to teach the agent to perform a backflip, which differs from the goal encoded by the default hand-crafted environment reward. We chose to study the back flip task to see how well Rb RL can learn new behaviors for which a reward is unknown. Thus, the performance of Cheetah and Swimmer can be evaluated via the hand-crafted environment rewards, while the Hopper task cannot be evaluated via its handcrafted environment reward. For Hopper, the performance of Rb RL will be evaluated based on evaluating the agent s behavior when running the learned policies from Rb RL. During the user study, each participant performed two tests one for Rb RL and one for Pb RL in one of the three Mu Jo Co environments, both for n = 2 rating classes. To eliminate potential bias, we assigned each participant a randomized order in which to perform the Pb RL and Rb RL experiment runs. Because the participants had no prior knowledge of the Mu Jo Co environments tested, we provided sample videos to show desired and undesired behaviors so that the participants could better understand the task. Upon request, the participants could also conduct mock tests before we initiated human data collection. For each experiment run, the participant was given 30 minutes to give rating/preference labels. Once finished, the participant filled out a questionnaire about the tested algorithm. The participant was then given a 10 minute break before conducting the second test and completing the questionnaire about the other algorithm. Afterwards, the participant completed a questionnaire comparing the two algorithms. Policy and reward learning occurred during the 30 minutes in which the user answered queries, and then continued after the human stepped away until code execution reached 4 million environment timesteps.

Performance Figure 3 shows the performance of Pb RL and Rb RL across the seven participants for the Cheetah and Swimmer tasks. We see that Rb RL performs similarly to or better than Pb RL. In particular, Rb RL can learn quickly in both cases, evidenced by the fast reward growth early during learning. Figure 3 additionally displays results when an expert (an author on the study) provided ratings and preferences for Cheetah and Swimmer. For consistency, the same expert tested Pb RL and Rb RL in each environment. We observe that for the expert trials, Rb RL performs consistently better than Pb RL given the same human time. These results

The Thirty-Eighth AAAI Conference on Artiﬁcial Intelligence (AAAI-24)

suggest that Rb RL can outperform Pb RL regardless of the user s environment domain knowledge. It can also be observed that the Rb RL and Pb RL trials with expert users outperform the trials in which feedback is given by non-experts.

Figure 3: Performance of Rb RL and Pb RL in the human user study: Cheetah (top) and Swimmer (bottom). For non-expert users, the plots show mean standard error over 7 users. The expert results are each over a single experiment run.

Although Rb RL performs similarly to Pb RL in the Cheetah task, we observed that some participants performed very poorly in this environment, perhaps due to lack of understanding of the task. By analyzing the raw data of all participants for Cheetah and Swimmer, we can see that one of the trials for Cheetah (Rb RL) performs very poorly (with the final reward less than 10). For all other tests, including both Pb RL and Rb RL, the final reward is in positive territory, usually more than 20. Hence, it may be more meaningful to evaluate the mean results for individuals who perform reasonably. Figure 4 shows the mean reward for the top 3 nonexpert users at different iterations for Cheetah and Swimmer. It can be observed that Rb RL consistently outperforms Pb RL and learns the goal faster than Pb RL. To compare Pb RL and Rb RL in the Hopper backflip task, we ran the learned policies for the 6 participants to generate videos. Videos for the best learned policies from Pb RL and Rb RL can be found at rb.gy/nt1qm6, and indicate that (1) both Rb RL and Pb RL can learn the backflip, and (2) the backflip learned via Rb RL fits better with our understanding of a desired backflip.

Figure 4: Rb RL and Pb RL performance for the top 3 (nonexpert) user study participants: mean reward standard error over the 3 experiment runs each for Cheetah (top) and Swimmer (bottom).

User Questionnaire Results The previous results in this section focus on evaluating the performance of Rb RL and Pb RL via the ground-truth environment reward (Cheetah and Swimmer) and the learned behavior (Hopper). To understand how the non-expert users view their experience of giving rating and preferences, we conduct a post-experiment user questionnaire. The questionnaire asked users for feedback about their experience supervising Pb RL and Rb RL and to compare Pb RL and Rb RL. Figure 5 displays the normalized survey results from the 20 user study participants. In particular, the top subfigure of Figure 5 shows the participants responses with respect to their separate opinions about Pb RL and Rb RL. These responses suggest that Pb RL is more demanding and difficult than Rb RL, leading users to feel more insecure and discouraged than when using Rb RL. The bottom subfigure of Figure 5 shows the survey responses when users were asked to compare Pb RL and Rb RL; these results confirm the above findings and also show that users perceive themselves as completing the task more successfully when providing ratings (Rb RL). One interesting observation is that the participants prefer Rb RL and Pb RL equally, which differs from the other findings. However, one participant stated that he/she preferred Pb RL because Pb RL is more challenging, which is counter-intuitive. This suggests that liking one algorithm more than the other is a very subjective concept, making the responses for

The Thirty-Eighth AAAI Conference on Artiﬁcial Intelligence (AAAI-24)

this question less informative than those for the other survey questions.

Very Low Neutral Very High Response Value

Mental Demand (Pb RL)

Mental Demand (Rb RL)

Success (Pb RL)

Success (Rb RL)

Difficulty (Pb RL)

Difficulty (Rb RL)

Frustration (Pb RL)

Frustration (Rb RL)

Rb RL Neutral Pb RL Response Value

Mental Demand

Frustration

Figure 5: Participants responses to survey questions about Rb RL and Pb RL. The blue bar indicates the median and the edges depict the 1st quartile (left) and 3rd quartile (right).

Human Time We also conducted a quantitative analysis of human time effectiveness when humans were asked to give ratings and preferences. Figure 6 shows the average number of human queries provided in 30 minutes for Cheetah, Swimmer, Hopper, and for all three environments combined. It can be observed that the participants can provide more ratings than pairwise preferences in all environments, indicating that it is easier and more efficient to provide ratings than to provide pairwise preferences. On average, participants can provide approximately 14.03 ratings per minute, while they provide only 8.7 preferences per minute, which means that providing a preference requires 62% more time than providing a rating. For Cheetah, providing a preference requires 100%+ more time than providing a rating, which is mainly due to the need to compare video pairs that are very similar. For Swimmer and Hopper, the environments and goals are somewhat more complicated. Hence, providing ratings can be slightly more challenging, but is still easier than providing pairwise preferences.

Discussion and Open Challenges One key difference between Pb RL and Rb RL is the value of the acquired human data. Because ratings in Rb RL are not relative, they have the potential to provide more global value than preferences, especially when queries are not carefully selected. For environments with large state-action spaces,

Cheetah Swimmer Hopper Total Queries

Number of Queries

Pb RL Rb RL

Figure 6: Number of queries provided in 30 minutes in our human user study (mean standard error).

ratings can provide more value for reward learning. One limitation of ratings feedback is that the number of data samples in different rating classes can be very different, leading to imbalanced datasets. Reward learning in Rb RL can be negatively impacted by this data imbalance issue (although our experiments still show the benefits of Rb RL over Pb RL). Hence, on-policy training with a large number of training steps may not help reward learning in Rb RL because the collected human ratings data can become very unbalanced. We expect that addressing the data imbalance issue would further improve Rb RL performance. One challenge for Rb RL is that ratings may not be given consistently during learning, especially considering users attention span and fatigue level over time. Future work includes developing mechanisms to quantify users consistency levels, the impact of user inconsistency, or solutions to user inconsistency. Another potential limitation of Rb RL is that it learns a less refined reward function than Pb RL because Rb RL does not seek to distinguish between samples from the same rating class. Hence, future work could integrate Rb RL and Pb RL to create a multi-phase learning strategy, where Rb RL provides fast initial global learning while Pb RL further refines performance via local queries based on sample pairs. One open challenge is the lack of effective human interfaces in existing code bases. For example, in Lee et al. (2021), only synthetic human feedback is available. Although a human interface is available for the algorithm in Christiano et al. (2017), the use of Google cloud makes it difficult to set up and operate efficiently. One of our future goals is to address this challenge by developing an effective human interface for reinforcement learning from human feedback, including preferences, ratings, and their variants. Some detailed information was omitted in the paper due to space limitation. Please refer to https://arxiv.org/pdf/2307. 16348.pdf for more details.

Acknowledgements The authors were supported in part by Army Research Lab under grant W911NF2120232, Army Research Office under grant W911NF2110103, and Office of Naval Research under grant N000142212474. We thank Feng Tao, Van Ngo, Gabriella Forbis for their helpful feedback, code, and tests.

The Thirty-Eighth AAAI Conference on Artiﬁcial Intelligence (AAAI-24)

Amodei, D.; Olah, C.; Steinhardt, J.; Christiano, P.; Schulman, J.; and Man e, D. 2016. Concrete problems in AI safety. ar Xiv preprint ar Xiv:1606.06565.

Bıyık, E.; Losey, D. P.; Palan, M.; Landolfi, N. C.; Shevchuk, G.; and Sadigh, D. 2022. Learning reward functions from diverse sources of human feedback: Optimally integrating demonstrations and preferences. The International Journal of Robotics Research, 41(1): 45 67.

Bıyık, E.; Palan, M.; Landolfi, N. C.; Losey, D. P.; Sadigh, D.; et al. 2020. Asking Easy Questions: A User-Friendly Approach to Active Reward Learning. In Conference on Robot Learning, 1177 1190.

Bıyık, E.; Talati, A.; and Sadigh, D. 2022. APRe L: A library for active preference-based reward learning algorithms. In ACM/IEEE International Conference on Human-Robot Interaction (HRI), 613 617.

Brown, D.; Coleman, R.; Srinivasan, R.; and Niekum, S. 2020. Safe imitation learning via fast Bayesian reward inference from preferences. In International Conference on Machine Learning, 1165 1177.

Brown, D.; Goo, W.; Nagarajan, P.; and Niekum, S. 2019. Extrapolating beyond suboptimal demonstrations via inverse reinforcement learning from observations. In Proceedings of the International Conference on Machine Learning, 783 792.

Brown, D. S.; Goo, W.; and Niekum, S. 2020. Better-thandemonstrator imitation learning via automatically-ranked demonstrations. In Conference on Robot Learning, 330 359.

Celemin, C.; and Ruiz-del Solar, J. 2015. COACH: Learning continuous actions from corrective advice communicated by humans. In 2015 International Conference on Advanced Robotics (ICAR), 581 586.

Choi, J.; and Kim, K.-E. 2011. MAP inference for Bayesian inverse reinforcement learning. Advances in Neural Information Processing Systems, 24.

Choi, J.; and Kim, K.-E. 2012. Nonparametric Bayesian inverse reinforcement learning for multiple reward functions. Advances in Neural Information Processing Systems, 25.

Christiano, P. F.; Leike, J.; Brown, T.; Martic, M.; Legg, S.; and Amodei, D. 2017. Deep reinforcement learning from human preferences. Advances in neural information processing systems, 30.

Ecoffet, A.; Huizinga, J.; Lehman, J.; Stanley, K. O.; and Clune, J. 2019. Go-explore: a new approach for hardexploration problems. ar Xiv preprint ar Xiv:1901.10995.

Finn, C.; Levine, S.; and Abbeel, P. 2016. Guided cost learning: Deep inverse optimal control via policy optimization. In Proceedings of the International Conference on Machine Learning, 49 58.

Fu, J.; Luo, K.; and Levine, S. 2018. Learning Robust Rewards with Adverserial Inverse Reinforcement Learning. In International Conference on Learning Representations.

Haarnoja, T.; Zhou, A.; Abbeel, P.; and Levine, S. 2018. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In Proceedings of the International Conference on Machine Learning, 1861 1870. Ibarz, B.; Leike, J.; Pohlen, T.; Irving, G.; Legg, S.; and Amodei, D. 2018. Reward learning from human preferences and demonstrations in Atari. Advances in Neural Information Processing Systems, 31. Knox, W. B.; and Stone, P. 2009. Interactively shaping agents via human reinforcement: The TAMER framework. In Proceedings of the fifth international conference on Knowledge capture, 9 16. Lee, K.; Smith, L.; Dragan, A.; and Abbeel, P. 2021. B-Pref: Benchmarking Preference-Based Reinforcement Learning. Advances in Neural Information Processing Systems. Lee, K.; Smith, L. M.; and Abbeel, P. 2021. PEBBLE: Feedback-Efficient Interactive Reinforcement Learning via Relabeling Experience and Unsupervised Pre-training. In International Conference on Machine Learning, 6152 6163. Levine, S.; Popovic, Z.; and Koltun, V. 2011. Nonlinear inverse reinforcement learning with Gaussian processes. Advances in Neural Information Processing Systems, 24. Li, Y. 2019. Reinforcement learning applications. ar Xiv preprint ar Xiv:1908.06973. Liang, X.; Shu, K.; Lee, K.; and Abbeel, P. 2022. Reward Uncertainty for Exploration in Preference-based Reinforcement Learning. In International Conference on Learning Representation. Lillicrap, T. P.; Hunt, J. J.; Pritzel, A.; Heess, N.; Erez, T.; Tassa, Y.; Silver, D.; and Wierstra, D. 2015. Continuous control with deep reinforcement learning. International Conference on Learning Representations. Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A. A.; Veness, J.; Bellemare, M. G.; Graves, A.; Riedmiller, M.; Fidjeland, A. K.; Ostrovski, G.; et al. 2015. Human-level control through deep reinforcement learning. Nature, 518(7540): 529 533. Ng, A. Y.; Russell, S. J.; et al. 2000. Algorithms for inverse reinforcement learning. In Proceedings of the International Conference on Machine Learning, 2. Park, J.; Seo, Y.; Shin, J.; Lee, H.; Abbeel, P.; and Lee, K. 2022. SURF: Semi-supervised Reward Learning with Data Augmentation for Feedback-efficient Preference-based Reinforcement Learning. In International Conference on Learning Representations. Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; and Klimov, O. 2017. Proximal policy optimization algorithms. ar Xiv preprint ar Xiv:1707.06347. Silver, D.; Huang, A.; Maddison, C. J.; Guez, A.; Sifre, L.; Van Den Driessche, G.; Schrittwieser, J.; Antonoglou, I.; Panneershelvam, V.; Lanctot, M.; et al. 2016. Mastering the game of Go with deep neural networks and tree search. Nature, 529(7587): 484.

The Thirty-Eighth AAAI Conference on Artiﬁcial Intelligence (AAAI-24)

Szot, A.; Zhang, A.; Batra, D.; Kira, Z.; and Meier, F. 2022. BC-IRL: Learning Generalizable Reward Functions from Demonstrations. In The Eleventh International Conference on Learning Representations. Warnell, G.; Waytowich, N.; Lawhern, V.; and Stone, P. 2018. Deep TAMER: Interactive agent shaping in highdimensional state spaces. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32. Wirth, C.; Akrour, R.; Neumann, G.; F urnkranz, J.; et al. 2017. A survey of preference-based reinforcement learning methods. Journal of Machine Learning Research, 18(136): 1 46. Wulfmeier, M.; Ondruska, P.; and Posner, I. 2015. Maximum entropy deep inverse reinforcement learning. ar Xiv preprint ar Xiv:1507.04888. Xu, Y.; Wang, R.; Yang, L.; Singh, A.; and Dubrawski, A. 2020. Preference-based reinforcement learning with finitetime guarantees. Advances in Neural Information Processing Systems, 33: 18784 18794. Zhan, H.; Tao, F.; and Cao, Y. 2021. Human-guided Robot Behavior Learning: A GAN-assisted Preference-based Reinforcement Learning Approach. IEEE Robotics and Automation Letters, 6(2): 3545 3552. Ziebart, B. D.; Maas, A.; Bagnell, J. A.; and Dey, A. K. 2008. Maximum entropy inverse reinforcement learning. Proceedings of the AAAI Conference on Artificial Intelligence.

The Thirty-Eighth AAAI Conference on Artiﬁcial Intelligence (AAAI-24)