# preferencebased_reinforcement_learning_with_finitetime_guarantees__189b6e63.pdf Preference-based Reinforcement Learning with Finite-Time Guarantees Yichong Xu1 , Ruosong Wang2, Lin F. Yang3, Aarti Singh2, Artur Dubrawski2 1Microsoft 2Carnegie Mellon University 3University of California, Los Angles yicxu@microsoft.com, {ruosongw,aarti,awd}@cs.cmu.edu, linyang@ee.ucla.edu Preference-based Reinforcement Learning (Pb RL) replaces reward values in traditional reinforcement learning by preferences to better elicit human opinion on the target objective, especially when numerical reward values are hard to design or interpret. Despite promising results in applications, the theoretical understanding of Pb RL is still in its infancy. In this paper, we present the first finite-time analysis for general Pb RL problems. We first show that a unique optimal policy may not exist if preferences over trajectories are deterministic for Pb RL. If preferences are stochastic, and the preference probability relates to the hidden reward values, we present algorithms for Pb RL, both with and without a simulator, that are able to identify the best policy up to accuracy ε with high probability. Our method explores the state space by navigating to under-explored states, and solves Pb RL using a combination of dueling bandits and policy search. Experiments show the efficacy of our method when it is applied to real-world problems. 1 Introduction In reinforcement learning (RL), an agent typically interacts with an unknown environment to maximize the cumulative reward. It is often assumed that the agent has access to numerical reward values. However, in practice, reward functions might not be readily available or hard to design, and hand-crafted rewards might lead to undesired behaviors, like reward hacking [8, 1]. On the other hand, preference feedback is often straightforward to specify in many RL applications, especially those involving human evaluations. Such preferences help shape the reward function and avoid unexpected behaviors. Preference-based Reinforcement Learning (Pb RL, [28]) is a framework to solve RL using preferences, and has been widely applied in multiple areas including robot teaching [20, 19, 11], game playing [29, 31], and in clinical trials [36]. Despite its wide applicability, the theoretical understanding of Pb RL is largely open. To the best of our knowledge, the only prior work with a provable theoretical guarantee is the recent work by Novoseller et al. [25]. They proposed the Double Posterior Sampling (DPS) method, which uses Bayesian linear regression to derive posteriors on reward values and transition distribution. Combining with Thompson sampling, DPS has an asymptotic regret rate sublinear in T (number of time steps). However, this rate is based on the asymptotic convergence of the estimates of reward and transition function, whose complexity could be exponential in the time horizon H. Also, the Thompson sampling method in [25] can be very time-consuming, making the algorithm applicable only to MDPs with a few states. To fill this gap, we naturally ask the following question: Is it possible to derive efficient algorithms for Pb RL with finite-time guarantees? Work done while at Carnegie Mellon University. 34th Conference on Neural Information Processing Systems (Neur IPS 2020), Vancouver, Canada. While traditional value-based RL has been studied extensively, including recently [6, 35, 22], Pb RL is much harder to solve than value-based RL. Most efficient algorithms for value-based RL utilize the value function and the Bellman update, both of which are unavailable in Pb RL: the reward values are hidden and unidentifiable up to shifts in rewards. Even in simple tabular settings, we cannot obtain unbiased estimate of the Q values since any offset in reward function results in the same preferences. Therefore traditional RL algorithms (such as Q learning or value iteration) are generally not applicable to Pb RL. Our Contributions. We give an affirmative answer to our main question above, under general assumptions on the preference distribution. We study conditions under which Pb RL can recover the optimal policy for an MDP. In particular, we show that when comparisons between trajectories are noiseless, there exists an MDP such that preferences between trajectories are not transitive; i.e., there is no unique optimal policy (Proposition 1). Hence, we base our method and analysis on a general assumption on preferences between trajectories, which is a generalization of the linear link function assumption in [25]. We develop provably efficient algorithms to find ε-optimal policies for Pb RL, with or without a simulator. Our method is based on a synthetic reward function similar to recent literature on RL with rich observations [14, 24] and reward-free RL [23]. We combine this reward-free exploration and dueling bandit algorithms to perform policy search. To the best of our knowledge, this is the first Pb RL algorithm with finite-time theoretical guarantees. Our method is general enough to incorporate many previous value-based RL algorithms and dueling bandit methods as a subroutine. We test our algorithm against previous baselines in simulated environments. Our results show that our algorithm can beat previous baselines, while being very simple to implement. Related Work. We refer readers to [28] for a complete overview of Pb RL. In the Pb RL framework, there are several ways that one can obtain preferences: We can obtain preferences on i) trajectories, where the labeler tells which of two trajectories is more rewarding; ii) actions, where for a fixed state s, the labeler tells which action gives a better action-value function; or iii) states, where similarly, the labeler tells which state gives a better value function. In this paper, we mainly study preferences over trajectories, which is also the most prevalent Pb RL scenario in literature. Our method is potentially applicable to other forms of preferences as well. Pb RL is relevant to several settings in Multi-Armed Bandits. Dueling bandits [33, 10] is essentially the one-state version of Pb RL, and has been extensively studied in the literature [34, 17, 16, 32]. However, Pb RL is significantly harder because in Pb RL the observation (preference) is based on the sum of rewards on a trajectory rather than individual reward values. For the same reason, Pb RL is also close to combinatorial bandits with full-bandit feedback [12, 2, 26]. Although lower bounds for these bandit problems extends to Pb RL, developing Pb RL algorithms is significantly harder since we are not free to choose any combination of state-action pairs. 2 Problem Setup MDP Formulation. Suppose a finite-time Markov Decision Process (MDP) (S, A, H, r, p, s0), where S is the state space, A is the action space, H is the number of steps, r : S A R is the reward function2, p : S A (S) is the (random) state transition function, and s0 is the starting state. For simplicity we assume S H. We consider finite MDPs with |S| = S, |A| = A. We start an episode from the initial state s0, and take actions to obtain a trajectory τ = {(sh, ah)}H 1 h=0 , following a policy π : S A. We also slightly overload the notation to use r(τ) = P (s,a) τ r(s, a) to denote the total reward of τ. For any policy π, let τh(π, s) be a (randomized) trajectory by executing π starting at state s from step h to the end. Following prior works [14, 24], we further assume that the state space S can be partitioned into H disjoint sets S = S1 SH, where Sh denotes the set of possible states at step h. Let Π : {π : S A} be the set of policies3. We use ΠH to denote the set of non-stationary policies; 2For simplicity, here we assume the reward function to be deterministic. 3We consider deterministic policies for simplicity, but our results carry to random policies easily. here a policy π = (π1, ..., πH) ΠH executes policy πh in step h for h [H] 4. Also, let πh1:h2 be the restriction of policy π ΠH to step h1, h1 + 1, ..., h2. We use the value function vπ h0(s) = E[PH 1 h=h0 r(sh, π(sh))|π, sh0 = s] to denote the expected reward of policy π starting at state s in step h0; for simplicity let vπ = vπ 0 . Let π = arg maxπ ΠH vπ 0 (s0) denote the optimal (non-stationary) policy. We assume that r(s, a) [0, 1] for every (s, a) S A, and that v (s) = vπ (s) [0, 1] for every state s. Preferences on Trajectories. In Pb RL, the reward r(s, a) is hidden and is not observable during the learning process, although we define the value function and optimal policy based on r. Instead, the learning agent can query to compare two trajectories τ and τ , and obtain a (randomized) preference τ τ or τ τ, based on r(τ) and r(τ ). We also assume that we can compare partial trajectories; we can also compare two partial trajectories τ = {(sh, ah)}H h=h0 and τ = {(s h, a h)}H h=h0 for any h0 [H]. Let φ(τ, τ ) = Pr[τ τ ] 1/2 denote the preference between τ and τ . PAC learning and sample complexity. We consider PAC-style algorithms, i.e., an algorithm needs to output a policy ˆπ such that vˆπ(s0) v (s0) ε with probability at least 1 δ. In Pb RL, comparisons are collected from human workers and the trajectories are obtained by interacting with the environment. So obtaining comparisons are often more expensive than exploring the state space; we therefore compute separately the sample complexity in terms of the number of total steps and number of comparisons. Formally, let SCs(ε, δ) be the number of exploration steps needed to obtain an ε-optimal policy with probability 1 δ, and SCp be the number of preferences (comparisons) needed in this process. We omit ε, δ from the sample complexities when the context is clear. Dueling Bandit Algorithms. Our proposed algorithms uses a dueling bandit algorithm as a subroutine. To utilize preferences, our algorithms use a PAC dueling bandit algorithm to compare policies starting at the same state. Dueling Bandits [33] has been well studied in the literature. Examples of PAC dueling bandit algorithms include Beat the Mean [34], KNOCKOUT [17], and OPT-Maximize [16]. We formally define a dueling bandit algorithm below. Definition 1 ((ε, δ)-correct Dueling Bandit Algorithm). Let ε > 0, δ > 0. M is a (ε, δ)-correct PAC dueling bandit algorithm if for any given set of arms X with |X| = K, i) M runs for at most Ψ(ε, δ)ε α steps, where Ψ(ε, δ) = poly(K, log(1/ε), log(1/δ)) and α 1; ii) in every step, M proposes two arms a, a to compare; iii) Upon completion, M returns an arm ˆa such that Pr[ˆa a] 1/2 ε for every arm a X, with probability at least 1 δ. One important feature of existing PAC dueling bandit algorithms is whether they require knowing ε before they start - algorithms like KNOCKOUT and OPT-Maximize [17, 16] cannot start without knowledge of ε; Beat-the-Mean [34] does not need to know ε to start, but can return an ε-optimal arm when given the correct budget. We write M(X, ε, δ) for an algorithm with access to arm set X, accuracy ε and success probability 1 δ; we write M(X, δ) for a dueling bandit algorithm without using the accuracy ε. Results in the value-based RL and bandit setting. In traditional RL where we can observe the reward value at every step, the minimax rate is largely still open [21] 5. The upper bound in e.g., [6] translates to a step complexity of O H3SA ε2 to recover an ε-optimal policy, but due to scaling of the rewards the lower bound [13] translates to Ω HSA ε2 . Very recently, Wang et al. [27] show an upper bound of O HS3A2 ε3 , showing that the optimal H dependence might be linear. It is straightforward to show that the lower bound in [13] translates to a step complexity of Ω HSA ε2 and a comparison complexity of Ω SA ε2 for Pb RL. Lastly, we mention that the lower bounds for combinatorial bandits with full-bandit feedback [12] can transform to a lower bound of the same scale for Pb RL. 2.1 Preference Probablities As in ranking and dueling bandits, a major question when using preferences is how to define the winner. One common assumption is the existence of a Condorcet winner: Suppose there exists an item (in our case, a policy) that beats all other items with probability greater than 1/2. However in Pb RL, because we compare trajectories, preferences might not reflect the true relation between 4We follow the standard notation to denote [H] = {0, 1, ..., H 1}. 5Note that some prior works assumes that r(s, a) [0, 1/H] [6, 35], which is different from our setting. policies. For example, assume that the comparisons are perfect, i.e., τ1 τ2 if r(τ1) > r(τ2) and vice versa. Now suppose policy π1 has a reward of 1 with probability 0.1 and 0 otherwise, and π2 has a reward of 0.01 all the time. Then a trajectory from π1 is only preferred to a trajectory from π2 with probability 0.1 if the comparisons give deterministic results on trajectory rewards. Extending this argument, we can show that non-transitive relations might exist between policies: Proposition 1. Slightly overloading the notation, for any h [H] and sh Sh, let φs(π1, π2) = Pr[τh(π1, s) τh(π2, s)] 1/2 denote the preference between policies π1 and π2 when starting at sh in step h. Suppose comparisons are noiseless. There exists a MDP and policies π0, π1, π2 such that for some state s S, φs(π0, π1), φs(π1, π2), φs(π2, π0) are all less than 0. Proposition 1 shows that making assumptions on the preference distribution on trajectories cannot lead to a unique solution for Pb RL. 6 Instead, since our target is an optimal policy, we make assumptions on the preferences between trajectories: Assumption 1. There exists a universal constant C0 > 0 such that for any policies π1, π2 and state s S with vπ1(s) vπ2(s) > 0, we have φs(π1, π2) C0(vπ1(s) vπ2(s)). I.e., the preference probabilities depends on the value function difference. Recall that φs(π1, π2) is the probability that a random trajectory from π1 beats that from π2; we do not make any assumptions on the comparison result of any individual trajectories. Assumption 1 ensures that a unique Condorcet winner (which is also the reward-maximizing policy) exists. Note that Assumption 1 also holds under many comparison models for φs, such as the BTL or Thurstone model. Following previous literatures in dueling bandits [34, 17], we also define the following properties on preferences: Definition 2. Define the following properties on preferences when the following holds for any h, s Sh and three policies π1, π2, π3 such that vπ1 h (s) > vπ2 h (s) > vπ3 h (s): Strong Stochastic Transitivity: φsh(π1, π3) max{φsh(π1, π2), φsh(π2, π3)}; Stochastic Triangle Inequality: φsh(π1, π3) φsh(π1, π2) + φsh(π2, π3). These properties are not essential for our algorithms, but are required for some dueling bandit algorithms that we use as a subroutine. To see the relation between policy preferences and reward preferences, we show the next proposition on several special cases: Proposition 2. If either of the following is true, the preferences satisfy Assumption 1, SST and STI: i) There exists a constant C such that for every pair of trajectories τ, τ we have φ(τ, τ ) = C (r(τ) r(τ )). ii) The transitions are deterministic, and φ(τ, τ ) = c for r(τ) > r(τ ) and some c (0, 1/2]. The first condition in Proposition 2 is the same as the assumption in [25], so our Assumption 1 is a generalization of theirs. We also note that although we focus on preferences over trajectories, preference between policies, as in Assumption 1, is also used in practice [18]. 3 Pb RL with a Simulator In this section, we assume access to a simulator (generator) that allows access to any state s S, executes an action a A and obtains the next state s p( |s, a). We first introduce our algorithm, to then follow with theoretical results. We also show a lower bound that our comparison complexity is almost optimal. Although value-based RL with a simulator has been well-studied in the literature [3, 4], methods like value iteration cannot easily extend to Pb RL, because the reward values are hidden. Instead, we base our method on dynamic programming and policy search [7] - by running a dueling bandit algorithm on each state, one can determine the corresponding optimal action. The resulting algorithm, Preference-based Policy Search (PPS), is presented in Algorithm 1. By inducting from H 1 to 0, PPS solves a dueling bandit problem at every state sh Sh, with arm rewards specified by a ˆπh+1:H for a A, where ˆπ is the estimated best policy after step h + 1, and stands for concatenation of policies. By allocating an error of O(ε/H) on every state, we obtain the following guarantee: 6One might consider other winner notions when a Condorcet winner policy does not exist. e.g., the von Neumann winner [15]. However, finding such winners usually involves handling an exponential number of policies and is out of the scope of the current paper. Algorithm 1 PPS: Preference-based Policy Search Require: Dueling bandit algorithm M, dueling accuracy ε1, sampling number N2, success probability δ 1: Initialize ˆπ ΠH randomly 2: for h = H 1, ..., 0 do 3: for sh Sh do 4: for n = 1, 2, ..., N2 do 5: Start an instance of M (A, ε1, δ/S) 6: Receive query (a, a ) from M 7: Rollout a ˆπh+1:H from sh, get trajectory τ 8: Rollout a ˆπh+1:H from sh, get trajectory τ 9: Compare τ, τ and return the result to M 10: end for 11: if M has finished then 12: Update ˆπh to the optimal action according to M 13: end if 14: end for 15: end for Ensure: Policy ˆπ Theorem 3. Suppose the preference distribution satisfies Assumption 1. Let ε1 = C0ε H , N0 = Ψ(ε1, δ/S)ε α 1 , where C0 is defined in Assumption 1. Algorithm 1 returns an ε-optimal policy with probability 1 δ using O Hα+1SΨ(A,ε/H,δ/S) εα simulator steps and O HαSΨ(A,ε/H,δ/S) comparisons. We can plug in existing algorithms to instantiate Theorem 3. For example, under SST, OPT-Maximize [16] achieves the minimax optimal rate for dueling bandits. In particular, it has a comparison complexity of O K log(1/δ) ε2 for selecting an ε-optimal arm with probability 1 δ among K arms. Plugging in this rate we obtain the following corollary: Corollary 4. Suppose the preference distribution satisfies Assumption 1 and SST. Using OPTMaximize as M, Algorithm 1 returns an ε-optimal policy with probability 1 δ using O H3SA ε2 log(S/δ) simulator steps, and O H2SA ε2 log(S/δ) comparisons. The proof of Theorem 3 follows from using the performance difference lemma, combined with properties of M. Our result is similar to existing results for traditional RL with a simulator: For example, in the case of infinite-horizon MDPs with a decaying factor of γ, [5] shows a minimax rate of O SA ε2(1 γ)3 on step complexity. This is the same rate as in Corollary 4 by taking H = 1 1 γ effective steps. 4 Combining Exploration and Policy Search for General Pb RL In this Section, we present our main result for Pb RL without a simulator. RL without a simulator is a challenging problem even in the traditional value-based RL setting. In this case, we will have to explore the state space efficiently to find the optimal policy. Existing works in value-based RL typically derive an upper bound on the Q-value function and apply the Bellman equation to improve the policy iteratively [6, 22, 35]. However for Pb RL, since the reward values are hidden, we cannot apply traditional value iteration and Q-learning methods. To go around this problem, we use a synthetic reward function to guide the exploration. We present our main algorithm in Section 4.1, along with the theoretical analysis. We discuss relations to prior work in Section 4.2. 4.1 Preferece-based Exploration and Policy Search (PEPS) We call our algorithm Preference-based Exploration and Policy Search (PEPS), and present it in Algorithm 2. As the name suggests, the algorithm combines exploration and policy search. For every Algorithm 2 PEPS: Preferece-based Exploration and Policy Search Require: Dueling Bandit algorithm M, quantity N0, success probability δ 1: Initialize ˆπ ΠH randomly 2: for h = H, H 1, ..., 1 do 3: for sh Sh do 4: Let rsh(s, a) = 1s=sh for all s S, a A 5: Start an instance of EULER(rsh, N0, δ/(4S)) 6: Start an instance of M (A, δ/(4S)) 7: Get next query (a, a ) from M 8: U 9: for n [N0] do 10: Obtain a policy ˆπn from EULER, and execute ˆπn until step h 11: Return the trajectory and reward to EULER 12: if current state is sh then 13: Let π = a ˆπh+1:H if |U| = 0, otherwise a ˆπh+1:H 14: Execute π till step H, obtain trajectory τ 15: U U τ 16: end if 17: if |U| = 2 then 18: Compare the two trajectories in U and return to M 19: (a, a ) next action from M 20: end if 21: If M has finished, break 22: end for 23: if M has finished then 24: Update ˆπh(sh) to the optimal action according to M 25: end if 26: end for 27: end for Ensure: Policy ˆπ step h [H] and sh Sh, we set up an artificial reward function rsh(s, a) = 1s=sh, i.e., the agent gets a reward of 1 once it gets to sh, and 0 everywhere else (Step 4). This function is also used in recent reward-free learning literatures [14, 24, 23]. But rather than using the reward function to obtain a policy cover (which is costly in both time and space), we use it to help the dueling bandit algorithm. We can use arbitrary tabular RL algorithm to optimize rsh; here we use EULER [35] with a budget of N0 and success probability δ/4S. The reason that we chose EULER is mainly for its beneficial regret guarantees; but we can also use other algorithms in practice. We also start an instance of M, the dueling bandit algorithm, without setting the target accuracy; one way to achieve this is to use a pre-defined accuracy for M, or use algorithms like Beat-the-Mean (see Theorem 5 and 7 respectively). Once we get to sh (Step 12), we execute the queried action a or a from M, followed by the current best policy ˆπ, as in PPS. If we have collected two trajectories, we compare them and feed it back to M. Upon finishing N0 steps of exploration, we update the current best policy ˆπ according to M. We return ˆπ when we have explored every state s S. Throughout this section, we use ι = log SAH εδ to denote log factors. We present two versions of guarantees with different sample complexity. The first version has a lower comparison complexity, while the second version has a lower total step complexity. The different guarantee comes from setting a slightly different target for M when it finishes in Step 21. In the following first version, we set M to finish when it finds a O(ε/H)-optimal policy. Theorem 5. Suppose the preference distribution satisfies Assumption 1. There exists a constant c0 such that the following holds: Let N0 = c0 HαSΨ(A,ε/H,δ/4S) εα+1 + S3AH2ι3 ε , recall ι = log SAH By setting the target accuracy as C0ε 2H for M, PEPS obtains an ε-optimal policy with probability 1 δ using step complexity O(HSN0) = O Hα+1S2Ψ(A, ε/H, δ/4S) εα+1 + S4AH3ι3 and comparison complexity O HαSΨ(A,ε/H,δ/4S) Since we set the target accuracy before the algorithm starts, any PAC dueling bandit algorithm can be used to instantiate Theorem 5. So similar to Theorem 3, we can plug in OPT-Maximize [16] to obtain the following corollary: Corollary 6. Suppose the preference distribution satisfies Assumption 1 and SST. There exists a constant c0 such that the following holds: Let N0 = c0 SH2A log(S/δ) ε3 + S3AH2ι3 ε . Using OPT- Maximize with accuracy ε/H as M, PEPS obtains an ε-optimal policy with probability 1 δ using step complexity O(HSN0) = O H3S2A log(S/δ) ε3 + S4AH3ι3 and comparison complexity O H2SA log(S/δ) In the second version, we simply do not set any performance target for M; it will explore until we finish all the N0 episodes. Therefore, we never break in Step 21 and M is always finished in Step 23. Since different states will be reached for a different number of times, we cannot pre-set a target accuracy for M. Effectively, we explore every state s to the accuracy of εs = O ε (µ(sh)SHα 1)1/α (see proof in appendix for details), where µ(s) is the maximum probability to reach s using any policy. Although this version leads to a slightly higher comparison complexity, it leads to a better step complexity since all exploration steps are used efficiently. We have the following result: Theorem 7. Suppose the preference distribution satisfies Assumption 1. There exists a constant c0 such that the following holds: Let N0 = c0 Hα 1SΨ(A, ε HS ,δ/4S) εα + S3AH2ι3 ε . PEPS obtains an ε-optimal policy with probability 1 δ using step complexity O(HSN0) = HαS2Ψ(A, ε HS , δ/4S) εα + S4AH3ι3 and comparison complexity O Hα 1S2Ψ(A, ε HS ,δ/4S) εα . We need to instantiate Theorem 7 with an M that does not need a pre-set accuracy. Under SST and STI, we can use Beat-the-Mean [34], which returns an ε-optimal arm among K arms with probability 1 δ, if it runs for Ω ( K εδ steps. We therefore obtain the following corollary: Corollary 8. Suppose the preference distribution satisfies Assumption 1, SST and STI. There exists a constant c0 such that the following holds: Let N0 = c0 HSAι ε2 + S3AH2ι3 ε . Using Beat-the-Mean as M, PEPS obtains an ε-optimal policy with probability 1 δ using step complexity O(HSN0) = O H2S2Aι ε2 + S4AH3ι3 and comparison complexity O HS2Aι 4.2 Discussion Comparing Corollary 6 and 8. Considering only the leading terms, the step complexity in Corollary 8 is better by a factor7 of O(H/ε) but the comparison complexity is worse by a factor of O(S/H) (recall that we assume S > H). Therefore the two theorems depict a tradeoff between the step complexity and comparison complexity. 7We use O to ignore log factors. Comparing with lower bounds and value-based RL. Corollary 6 has the same comparison complexity as Corollary 4. The lower bound for comparison complexity is O( SA ε2 ), a O(H2) factor from our result. Our comparison complexity is also the same as the current best upper bound in value-based RL (in terms of number of episodes), which shows that our result is close to optimal. Compared with the O( HSA ε2 ) lower bound on step complexity, Corollary 8 has an additional factor of O(HS). The current best upper bound in value-based RL is O( H3SA ε2 ), which is O(H/S) times our result (recall that we assume H < S). Comparing with previous RL method using policy covers. Some literature on reward-free learning and policy covers [14, 24, 23] use a similar form of synthetic reward function as ours. [14, 24] considers computing a policy cover by exploring the state space using a similar synthetic function as PEPS. However, our results are generally not comparable since they assume that µ(s) η for some η > 0, and their result depends on η. Our result do not need to depend on this η. Closest to our method is the recent work of [23], which considers reward-free learning with unknown rewards during exploration. However, their method cannot be applied to Pb RL since we do not have the reward values even in the planning phase. We note that the O(S2) dependence on the step complexity is also common in these prior works. Nevertheless, our rates are better in H dependence because we do not need to compute the policy cover explicitly. Fixed budget setting. While we present PPS and PEPS under the fixed confidence setting (with a given ε), one can easily adapt it to the fixed budget setting (with a given N, number of episodes) by dividing N evenly among the states. We present a fixed budget version in the appendix. Two-phase extension of PEPS. A drawback of Theorem 7 is that it requires M to work without a pre-set target accuracy, limiting the algorithm that we can use to instantiate M. In the appendix, we present a two-phase version of PEPS that allows for arbitrary PAC algorithm M, with slightly improved log factors on the guarantee. 40 60 80 100 120 Total Budget (Episodes) Average Reward PEPS DPS EPMC (a) Grid World, c = 1 40 60 80 100 120 Total Budget (Episodes) Average Reward PEPS DPS EPMC (b) Grid World, c = 0.001 40 60 80 100 120 Total Budget (Episodes) Average Reward PEPS DPS EPMC (c) Random MDP, c = 1 40 60 80 100 120 Total Budget (Episodes) Average Reward PEPS DPS EPMC (d) Random MDP, c = 0.001 Figure 1: Experiment Results comparing PEPS to baselines (DPS & EPMC). 5 Experiments We performed experiments in synthetic environments to compare PEPS with previous baselines. We consider two environments: Grid World: We implemented a simple Grid World on a 4 4 grid. The agent goes from the upper left corner to the lower right corner and can choose to go right or go down at each step. We randomly put a reward of 1/3 on three blocks in the grid, and the maximal total reward is 2/3. Random MDP: We followed the method in [25] but adapted it to our setting. We consider an MDP with 20 states and 5 steps, with 4 states in each step. The transitions are sampled from a Dirichlet prior (with parameters all set to 0.1) and the rewards are sampled from an exponential prior with scale parameter λ = 5. The rewards are then shifted and normalized so that the minimum reward is 0 and the mean reward is 1. Compared methods. We compared to three baselines: i) DPS [25]: We used the version with Gaussian process regression since this version gets the best result in their experiments; ii) EPMC [30], which uses preferences to simulate a Q-function. We followed the default parameter settings for both DPS and EPMC. Details of the algorithms and hyperparameter settings are included in the appendix. Experiment Setup. Our baselines are not directly comparable to PEPS since their goal is to minimize the regret, instead of getting an ε-optimal policy. However, we can easily adapt all the algorithms (including PEPS) to the fixed budget setting for optimal policy recovery. For the baselines, we ran them until N episodes and then evaluated the current best policy. For PEPS, we used the fixed budget version described in detail in the appendix. For both environments, we varied the budget N [2S , 8S ], where S is the number of non-terminating states. The comparisons are generated following the Bradley-Terry-Luce model [9]: φ(τ1, τ2) = 1 1+exp( (r(τ1) r(τ2))/c), with c being either 0.001 or 1. In the first setting of c, the preferences are very close to deterministic while comparison between equal rewards is uniformly random; in the latter setting, the preferences are close to linear in the reward difference. We repeated each experiment for 32 times and computed the mean and standard deviation. Results. The results are summarized in Figure 1. Overall, PEPS outperforms both baselines in both environments. In Grid World, while all three methods get a relatively high variance when c = 1, for c = 0.001 PEPS almost always get the exact optimal policy. Also for random MDP, PEPS outperforms both baselines by a large margin (larger than two standard deviations). We note that EPMC learns very slowly and almost does not improve as the budget increases, and this is consistent with the observation in [25]. One potential reason that makes PEPS outperform the baselines is because of the exploration method: Both DPS and EPMC need to estimate the Q function well in order to perform efficient exploration. This estimation can take time exponential in H, and it is not even computationally feasible to test until the Q function converges. As a result, both DPS and EPMC explore almost randomly and cannot recover the optimal policy. On the other hand, our method uses a dueling bandit algorithm to force exploration, so it guarantees that at least states with high reach probability have their optimal action. 6 Conclusion We analyze the theoretical foundations of the Pb RL framework in detail and present the first finitetime analysis for general Pb RL problems. Based on reasonable assumptions on the preferences, the proposed PEPS method recovers an ε-optimal policy with finite-time guarantees. Experiments show the potential efficacy of our method in practice, and that it can work well in simulated environments. Future work includes i) testing PEPS in other RL environments and applications, ii) developing algorithms for Pb RL with finite-time regret guarantees, iii) algorithms for the case when Assumption 1 holds with some error, and iv) Pb RL in non-tabular settings. Broader Impact Discussion Reinforcement learning is increasingly being used in a myriad of decision-making applications ranging from robotics and drone control, to computer games, to tutoring systems, to clinical trials in medicine, to social decision making. Many of these applications have goals that are hard to capture mathematically using a single reward definition, and the common practice of reward hacking [8, 1] (trying to achieve best performance under the specified metric) often leads to undesired behaviors with respect to the original goal of the application. Preference based RL provides an appealing alternative where the user can specify their preferences over alternative trajectories without hard-coding a single reward metric and this way avoid unexpected behavior caused by misspecified reward metrics. This paper substantially expands theoretical understanding of preference based RL by providing the first finite time guarantees, and it could help broaden applicability of this learning modality across multiple areas of its potential beneficial use, in particular where it is currently considered too expensive to set up and deploy. [1] D. Amodei, C. Olah, J. Steinhardt, P. Christiano, J. Schulman, and D. Mané. Concrete problems in ai safety. ar Xiv preprint ar Xiv:1606.06565, 2016. [2] J.-Y. Audibert, S. Bubeck, and G. Lugosi. Regret in online combinatorial optimization. Mathematics of Operations Research, 39(1):31 45, 2014. [3] M. G. Azar, R. Munos, M. Ghavamzadaeh, and H. J. Kappen. Speedy q-learning. 2011. [4] M. G. Azar, R. Munos, and B. Kappen. On the sample complexity of reinforcement learning with a generative model. ar Xiv preprint ar Xiv:1206.6461, 2012. [5] M. G. Azar, R. Munos, and H. J. Kappen. Minimax pac bounds on the sample complexity of reinforcement learning with a generative model. Machine learning, 91(3):325 349, 2013. [6] M. G. Azar, I. Osband, and R. Munos. Minimax regret bounds for reinforcement learning. ar Xiv preprint ar Xiv:1703.05449, 2017. [7] J. A. Bagnell, S. M. Kakade, J. G. Schneider, and A. Y. Ng. Policy search by dynamic programming. In Advances in neural information processing systems, pages 831 838, 2004. [8] F. Berkenkamp, A. Krause, and A. P. Schoellig. Bayesian optimization with safety constraints: safe and automatic parameter tuning in robotics. ar Xiv preprint ar Xiv:1602.04450, 2016. [9] R. A. Bradley and M. E. Terry. Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39(3/4):324 345, 1952. [10] R. Busa-Fekete, E. Hüllermeier, and A. E. Mesaoudi-Paul. Preference-based online learning with dueling bandits: A survey. ar Xiv preprint ar Xiv:1807.11398, 2018. [11] P. F. Christiano, J. Leike, T. Brown, M. Martic, S. Legg, and D. Amodei. Deep reinforcement learning from human preferences. In Advances in Neural Information Processing Systems, pages 4299 4307, 2017. [12] A. Cohen, T. Hazan, and T. Koren. Tight bounds for bandit combinatorial optimization. ar Xiv preprint ar Xiv:1702.07539, 2017. [13] C. Dann and E. Brunskill. Sample complexity of episodic fixed-horizon reinforcement learning. In Advances in Neural Information Processing Systems, pages 2818 2826, 2015. [14] S. S. Du, A. Krishnamurthy, N. Jiang, A. Agarwal, M. Dudík, and J. Langford. Provably efficient rl with rich observations via latent state decoding. ar Xiv preprint ar Xiv:1901.09018, 2019. [15] M. Dudík, K. Hofmann, R. E. Schapire, A. Slivkins, and M. Zoghi. Contextual dueling bandits. ar Xiv preprint ar Xiv:1502.06362, 2015. [16] M. Falahatgar, Y. Hao, A. Orlitsky, V. Pichapati, and V. Ravindrakumar. Maxing and ranking with few assumptions. In Advances in Neural Information Processing Systems, pages 7060 7070, 2017. [17] M. Falahatgar, A. Orlitsky, V. Pichapati, and A. T. Suresh. Maximum selection and ranking under noisy comparisons. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 1088 1096. JMLR. org, 2017. [18] J. Fürnkranz, E. Hüllermeier, W. Cheng, and S.-H. Park. Preference-based reinforcement learning: a formal framework and a policy iteration algorithm. Machine learning, 89(1-2):123 156, 2012. [19] A. Jain, S. Sharma, T. Joachims, and A. Saxena. Learning preferences for manipulation tasks from online coactive feedback. The International Journal of Robotics Research, 34(10):1296 1313, 2015. [20] A. Jain, B. Wojcik, T. Joachims, and A. Saxena. Learning trajectory preferences for manipulators via iterative improvement. In Advances in neural information processing systems, pages 575 583, 2013. [21] N. Jiang and A. Agarwal. Open problem: The dependence of sample complexity lower bounds on planning horizon. In Conference On Learning Theory, pages 3395 3398, 2018. [22] C. Jin, Z. Allen-Zhu, S. Bubeck, and M. I. Jordan. Is q-learning provably efficient? ar Xiv preprint ar Xiv:1807.03765, 2018. [23] C. Jin, A. Krishnamurthy, M. Simchowitz, and T. Yu. Reward-free exploration for reinforcement learning. ar Xiv preprint ar Xiv:2002.02794, 2020. [24] D. Misra, M. Henaff, A. Krishnamurthy, and J. Langford. Kinematic state abstraction and provably efficient rich-observation reinforcement learning. ar Xiv preprint ar Xiv:1911.05815, 2019. [25] E. R. Novoseller, Y. Sui, Y. Yue, and J. W. Burdick. Dueling posterior sampling for preferencebased reinforcement learning. 2019. [26] I. Rejwan and Y. Mansour. Top-k combinatorial bandits with full-bandit feedback. In Algorithmic Learning Theory, pages 752 776, 2020. [27] R. Wang, S. S. Du, L. F. Yang, and S. M. Kakade. Is long horizon reinforcement learning more difficult than short horizon reinforcement learning? ar Xiv preprint ar Xiv:2005.00527, 2020. [28] C. Wirth, R. Akrour, G. Neumann, and J. Fürnkranz. A survey of preference-based reinforcement learning methods. The Journal of Machine Learning Research, 18(1):4945 4990, 2017. [29] C. Wirth and J. Fürnkranz. First steps towards learning from game annotations. 2012. [30] C. Wirth and J. Fürnkranz. Epmc: Every visit preference monte carlo for reinforcement learning. In Asian Conference on Machine Learning, pages 483 497, 2013. [31] C. Wirth and J. Fürnkranz. On learning from game annotations. IEEE Transactions on Computational Intelligence and AI in Games, 7(3):304 316, 2015. [32] Y. Xu, A. Joshi, A. Singh, and A. Dubrawski. Zeroth order non-convex optimization with dueling-choice bandits. ar Xiv preprint ar Xiv:1911.00980, 2019. [33] Y. Yue, J. Broder, R. Kleinberg, and T. Joachims. The k-armed dueling bandits problem. Journal of Computer and System Sciences, 78(5):1538 1556, 2012. [34] Y. Yue and T. Joachims. Beat the mean bandit. In Proceedings of the 28th International Conference on Machine Learning (ICML-11), pages 241 248, 2011. [35] A. Zanette and E. Brunskill. Tighter problem-dependent regret bounds in reinforcement learning without domain knowledge using value function bounds. ar Xiv preprint ar Xiv:1901.00210, 2019. [36] Y. Zhao, D. Zeng, M. A. Socinski, and M. R. Kosorok. Reinforcement learning strategies for clinical trials in nonsmall cell lung cancer. Biometrics, 67(4):1422 1433, 2011.