# policy_learning_using_weak_supervision__d2bb17a6.pdf Policy Learning Using Weak Supervision Jingkang Wang 1,2 Hongyi Guo 3 Zhaowei Zhu 4 Yang Liu4 University of Toronto1, Vector Institute2, Northwestern University3, UC Santa Cruz4 wangjk@cs.toronto.edu hongyiguo2025@u.northwestern.edu zwzhu@ucsc.edu yangliu@ucsc.edu Most existing policy learning solutions require the learning agents to receive highquality supervision signals such as well-designed rewards in reinforcement learning (RL) or high-quality expert demonstrations in behavioral cloning (BC). These quality supervisions are usually infeasible or prohibitively expensive to obtain in practice. We aim for a unified framework that leverages the available cheap weak supervisions to perform policy learning efficiently. To handle this problem, we treat the weak supervision as imperfect information coming from a peer agent, and evaluate the learning agent s policy based on a correlated agreement with the peer agent s policy (instead of simple agreements). Our approach explicitly punishes a policy for overfitting to the weak supervision. In addition to theoretical guarantees, extensive evaluations on tasks including RL with noisy rewards, BC with weak demonstrations, and standard policy co-training show that our method leads to substantial performance improvements, especially when the complexity or the noise of the learning environments is high. 1 Introduction Recent breakthroughs in policy learning (PL) open up the possibility to apply reinforcement learning (RL) or behavioral cloning (BC) in real-world applications such as robotics [1, 2] and self-driving [3, 4]. Most existing works require agents to receive high-quality supervision signals, e.g., reward or expert demonstrations, which are either infeasible or expensive to obtain in practice [5, 6]. The outputs of reward functions in RL are subject to multiple kinds of randomness. For example, the reward collected from sensors on a robot may be biased and have inherent noise due to physical conditions such as temperature and lighting [7, 8, 9]. For the human-defined reward, different human instructors might provide drastically different feedback that leads to biased rewards [10]. Besides, the demonstrations by an expert in behavioral cloning (BC) are often imperfect due to limited resources and environment noise [11, 12, 13]. Therefore, learning from weak supervision signals such as noisy rewards [7] or low-quality demonstrations produced by untrustworthy expert [12, 14] is one of the outstanding challenges that prevents a wider application of PL. Although some works have explored these topics separately in their specific domains [7, 15, 14, 16], there lacks a unified solution for robust policy learning in imperfect situations. Moreover, the noise model as well as the corruption level in supervision signals is often required. To handle these challenges, we first formulate a meta-framework to study RL/BC with weak supervision and call it weakly supervised policy learning. Then we propose a theoretically principled solution, Peer PL, to perform efficient policy learning using the available weak supervision without requiring noise rates. Our solution is inspired by peer loss [17], a recently proposed loss function for learning with noisy labels but does not require the specification of noise rates. In peer loss, the noisy labels are treated as a peer agent s supervision. This loss function explicitly punishes the classifier from simply agreeing with the noisy labels, but would instead reward it for a correlated agreement" (CA). We adopt a 35th Conference on Neural Information Processing Systems (Neur IPS 2021). similar idea and treat the weak supervision as the noisy information coming from an imperfect peer agent, and evaluate the learning agent s policy based on a correlated agreement (CA) with the weak supervision signals. Compared to standard reward and evaluation functions that encourage simple agreements with the supervision, our approach punishes over-agreement" to avoid overfitting to the weak supervision, which offers us a family of solutions that do not require prior knowledge of the corruption level in supervision signals. To summarize, the contributions in the paper are: (1) We provide a unified formulation of the weakly supervised policy learning problems; (2) We propose Peer PL, a new way to perform policy evaluation for RL/BC tasks, and demonstrate how it adapts in challenging tasks including RL with noisy rewards and BC from weak demonstrations; (3) Peer PL is theoretically guaranteed to recover the optimal policy, as if the supervision are of high-quality and clean. (4) Experiment results show strong evidence that Peer PL brings significant improvements over state-of-the-art solutions. Code is online available at: https://github.com/wangjksjtu/Peer PL. 1.1 Related Work Learning with Noisy Supervision Learning from noisy supervision is a widely explored topic. The seminal work [18] first proposed an unbiased surrogate loss function to recover the true loss from the noisy label distribution, given the knowledge of the noise rates of labels. Follow-up works offered ways to estimate the noise level from model predictions [19, 20, 21, 22, 23, 24, 25, 26, 27] or label consensuses of nearby representations [28]. Recent works also studied this problem in sequential settings including federated bandit [29] and RL [7]. The former work assumes the noise can be offset by averaging rewards from multiple agents. [7] designs a statistics-based estimation algorithm for noise rates in observed rewards, which can be inefficient especially when the state-action space is huge. Moreover, the error in the estimation can accumulate and amplify in sequential problems. Inspired by recent advances of peer loss [17, 30, 31], our solution is able to recover true supervision signals without requiring a priori specification of the noise rates. Behavioral Cloning (BC) Standard BC [32, 33] tackles the sequential decision-making problem by imitating the expert actions using supervised learning. Specifically, it aims to minimize the one-step deviation error over the expert trajectory without reasoning about the sequential consequences of actions. Therefore, the agent suffers from compounding errors when there is a mismatch between demonstrations and real states encountered [33, 34, 35]. Recent works introduce data augmentations [36] and value-based regularization [37] or inverse dynamics models [38, 39] to encourage learning long-horizon behaviors. While being simple and straightforward, BC has been widely investigated in a range of application domains [40, 41] and often yields competitive performance [42, 37]. Our framework is complementary to the current BC literature by introducing a learning strategy from weak demonstrations (e.g., noisy or from a poorly-trained agent) and provides theoretical guarantees on how to retrieve clean policy under mild assumptions [43]. Correlated Agreement In [44, 45], a correlated agreement (CA) type of mechanism is proposed to evaluate the correlations between agents reports. In addition to encouraging a certain agreement between agents reports, CA also punishes over-agreement when two agents always report identically. Recently, [17, 30, 25] adapt a similar idea to noisy label learning thus offloading the burdens of estimating noise rates. We consider a more challenging sequential decision-making problem and study the convergence rates under noisy supervision signals. 2 Policy Learning from Weak Supervision We begin by reviewing conventional reinforcement learning and behavioral cloning with clean supervision signals. Then we introduce the weak supervision problem in policy learning and define two concrete instantiations: (1) RL with noisy reward and (2) BC using weak expert demonstrations. 2.1 Overview of Policy Learning The goal of policy learning (PL) is to learn a policy that the agent could follow to perform a series of actions in a stateful environment. For reinforcement learning, the interactive environment is characterized as an MDP M = h S, A, R, P, γi. At each time t, the agent in state st 2 S takes an action at 2 A by following the policy : S A ! R, and potentially receives a reward r(st, at) 2 R. Then the agent transfers to the next state st+1 according to a transition probability Weak Demonstrations (", $%), $% as '( Agent: , #!, $! , , #", $" , , ##, $# , Peer Agent: , %&! , , %&", , %&# , CA: Eva "!, %! , '(! Eva "", %" , '(# Noisy Environment Behavior Cloning Reinforcement Learning Peer Policy Learning BC with Weak Expert RL with Perturbed Reward + Policy Co-Training Figure 1: Illustration of weakly supervised policy learning and our Peer PL solution with correlated agreement (CA). We use e Y to denote a weak supervision, be it a noisy reward, or a noisy demonstration. Eva stands for an evaluation function. Peer Agent corresponds to weak supervision. function P. We denote the generated trajectory = {(st, at, rt)}T t=0, where T is a finite or infinite horizon. RL algorithms aim to maximize the expected reward over the trajectory induced by the policy: Jclean( ) = E(st,at,rt) [PT t=0 γtrt], where γ 2 (0, 1] is the discount factor. Another popular policy learning method is behavioral cloning. Let ( |s) denotes the distribution over actions formed by , and (a|s) be the probability of choosing action a given state s and policy . The goal of BC is to mimic the expert policy E through a set of demonstrations DE = {(si, ai)}N i=1 drawn from a distribution DE, where (si, ai) is the sampled state-action pair from the expert trajectory and ai E( |si) Then training a policy with standard BC corresponds to maximizing the following log-likelihood: Jclean( ) = E(s,a) DE [log (a|s)]. In both RL and BC, the learning agent receives supervision through either the (clean) reward r by interacting with environments or the expert policy E as observable demonstrations. Consider a particular policy class , the optimal policy is then defined as = arg max 2 Jclean( ): obtains the maximum expected reward over the horizon T in RL and corresponds to the clean expert policy E in BC. In practice, one can also combine both RL and BC approaches to take advantage of both learning paradigm [46, 47, 15, 43]. Specifically, a recent hybrid framework called policy co-training [43] will be considered in this paper. 2.2 Weak Supervision in Policy Learning The weak supervision signal e Y could be noisy reward r for RL or noisy action a from an imperfect expert policy E for BC, which are noisy versions of the corresponding high-quality supervision signals. See more details below. RL with Noisy Reward Consider a finite MDP f M = h S, A, R, F, P, γi with noisy reward channels [7], where R : S A ! R, and the noisy reward r is generated following a certain function F : R ! e R. Denote the trajectory a policy generates via interacting with f M as . Assume the reward is discrete and has |R| levels. The noisy reward can be characterized via a unknown matrix CRL |R| |R|, where each entry cj,k indicates the flipping probability for generating a possibly different outcome: c RL j,k = P ( rt = Rk|rt = Rj). We call r and r the true reward and noisy reward. BC with Weak Demonstration Instead of observing the true expert demonstration generated according to E, denote the available weak demonstrations by {(si, ai)}N i=1, where ai is is the noisy expert action drawn according to a random variable ai = E(si) E( |si), each state-action pair (si, ai) is sampled from distribution e DE. Note there may exist two randomness factors in getting ai: uncertainty in true policy E and noise from imperfect policy E. In particular, we do not consider the former randomness in theoretical analyses: given the output distribution E( |si), only one deterministic action E(si) is taken by expert. This is because with uncertainty in true expert actions, it is hard to distinguish a clean case with true expert actions from the weak supervision case without addition knowledge. Similar assumptions are also adopted in [23, 28]. The noisy action is modeled by a unknown confusion matrix CBC |A| |A|, where each entry cj,k indicates the flipping probability for taking a sub-optimal action that differs from E(s): c BC j,k = P( E(s) = Ak| E(s) = Aj), Ak and Aj denote the k-th and the j-th action from the action space A. In the above definition, we assume the noisy action ai is independent of the state s given the deterministic expert action E(s), i.e., P( ai| E(si)) = P( ai|si, E(si)). We aim to recover as if we were able to access the quality expert demonstration E instead of E. Knowledge of C Recall C: CRL |R| |R| or CBC |A| |A| is unknown in practice. While recent works estimate this matrix [26, 23, 28] in supervised classification problems, it is still challenging to generalize them to a sequential setting [7]. When C is not perfectly estimated, the estimation error of C may lead to unexpected state-action pairs then the error of reward estimates will be accumulated in sequential learning. Besides, estimating C involves extra computation burden. In contrast, our method gets rid of the above issues since it is free of any knowledge of C and leads to more robust policy learning algorithms. Learning Goal With full supervision, both RL and BC can converge to the optimal policy . However, when only weak supervision is available, with an over-parameterized model such as a deep neural network, the learning agent will easily memorize the weak supervision and learn a biased policy [48]. In our meta framework, instead of converging to any biased policy, we focus on learning the optimal policy with only a weak supervision sequence denoted as {(st, at), e Yt}T t=1 (RL) or {(si, ai), e Yi}N 3 Peer PL: Weakly Supervised PL via Correlated Agreement To deal with weak supervision in PL, we propose a unified and theoretically principled framework Peer PL. We treat the weak supervision as information coming from a peer agent , and then evaluate the policy using a certain type of correlated agreement function between the learning policy and the peer agent s information. 3.1 A Unified Evaluation Function We use an evaluation function Eva ((si, ai), e Yi) to evaluate a taken policy at agent state (si, ai) using the weak supervision e Yi. For RL, Eva is the instance-wise measure (negative loss) for different RL algorithms, which is a function of the noisy reward r received at (si, ai). In the BC setting, Eva is the loss to evaluate the action ai taken by the agent given the expert s demonstration ai. Note that the larger the Eva is at state (si, ai), the better it follows the supervision e Yi. Specifically, we have , (s, a, r) (RL) and Eva BC = log ( a|s) (BC), where the RL loss function can be temporal difference error [49, 50] or the policy gradient loss [51]. Furthermore, we let J ( ) denote the function that evaluates policy under a set of state action pairs with weak supervision sequence {(si, ai), e Yi}N J( ) = E(s,a) [Eva ((s, a), e Y )]. Then the goal of weakly supervised policy learning is to recover the optimal policy as if we receive clean supervision Y . Note that directly maximizing J( ) might result in sub-optimal performance due to the weak supervisions. The above unified notations are only for better delivery of our framework and we still treat PL as a sequential decision problem. 3.2 Overview of the Idea: Correlated Agreement with Weak supervision We first present the general idea of our Peer PL framework using a concept named correlated agreement (CA). For each weakly supervised sample ((si, ai), e Yi), we randomly sample (with replacement) two other peer samples indexed by j and k. Then we take the state-action pair (sj, aj) of sample j and the supervision signal e Yk of sample k, and evaluate ((si, ai), e Yi) as follows: CA with Weak Supervision: Eva (si, ai), e Yi (sj, aj), e Yk This operation is illustrated in Figure 1. We further show intuitions and a toy example below. Intuition The first term above encourages an agreement with the weak supervision (that a policy agrees with the corresponding supervision), while the second term punishes a blind and over agreement that happens when the agent s policy always matches with the weak supervision even on randomly paired traces (noise). The randomly paired instances j, k help us achieve this check. Note our mechanism does not require the knowledge of CRL |R| |R| nor CBC |A| |A|, and offers a priorknowledge free way to learn effectively with weak supervision. Toy Example Consider a toy BC setting where the policy fully memorizes the weak supervision and outputs the same sequence of actions given the same sequence of states, i.e., Weak-supervision: a1 = a2 = a3 = 1, a4 = 0; Outputs: a1 = a2 = a3 = 1, a4 = 0. Let Eva ((si, ai), ai) = 1 if the policy output agrees with the weak demonstration (ai = ai), and 0 otherwise. When the policy fully memorizes weak supervisions, we have: Without CA: E[Eva ((si, ai), ai)] = 1, With CA: E[Eva ((si, ai), ai) Eva ((sj, aj), ak)] = 0.375, where 0.375 = 1 (0.752 + 0.252) is obtained by considering the probability of randomly paired aj and ak matching each other. The above example shows that a full agreement with the weak supervision will instead be punished. In what follows, we showcase two concrete implementations: Peer RL (peer reinforcement learning) and Peer BC (peer behavioral cloning). We provide algorithms and theoretical guarantees under weak supervisions. 4 Peer RL: Peer Reinforcement Learning We propose the following objective function to punish the over-agreement of parametric policy based on CA: (si, ai), ri (sj, aj), rk where Eva RL , (s, a, r) In (1), the first expectation is taken over (si, ai, ri) and second one is taken over (sj, aj, rj) , (sk, ak, rk) , where is the trajectory specified by the noisy reward function r. Recall j, k denote two randomly and independently sampled instances. Loss function depends on the employed RL algorithms, e.g., temporal difference error [49, 50] or the policy gradient loss [51]. The learning sequence is encoded in . The objective JRL( ) represents the accumulated peer RL reward. Parameter 0 balances the penalty for blind agreements induced by CA. 4.1 Peer Reward In what follows, we consider the Q-Learning [52] as the underlying learning algorithm where ( , (s, a, r)) = r(s, a) and demonstrate that the CA mechanism provides strong guarantees for Q-Learning with only observing the noisy reward. For clarity, we define peer RL reward: Peer Reward: rpeer(s, a) = r(s, a) r0, where r0 sample { r(s, a)|s 2 S, a 2 A} is a reward sampled over all state-action pairs according to a fixed policy sample. Note the sampling policy sample is independent of and the choice of sample does not affect our theoretical results. We adopt a random sampling strategy in practice. Parameter 0 balances the noisy reward and the punishment for blind agreement (with r0). We set = 1 (for binary case) in the following analysis and treat each (s, a) equally when sampling r0. In experiments, we find rpeer is not sensitive to the choice of and keep constant for each run. Robustness to Noisy Rewards Now we show peer reward rpeer offers us an affine transformation of the true reward in expectation, which guarantees that our Peer RL algorithm converges to . Consider the binary reward setting (r+ and r ) and denote the error in r as e+ = P( r = r |r = r+), e = P( r = r+|r = r ) (a simplification of CRL |R| |R| in the binary setting). Lemma 1. Let r 2 [0, Rmax] be a bounded reward, = 1. Assume 1 e e+ > 0. We have: E[ rpeer(s, a)] = (1 e e+) E[rpeer(s, a)] = (1 e e+) E[r(s, a)] + const , where rpeer(s, a) = r(s, a) r0 is the peer RL reward when observing the true reward r, and r0 is the true reward corresponding to r0. Lemma 1 shows that by subtracting the peer penalty term r0 from noisy reward r(s, a), rpeer(s, a) recovers the clean and true reward r(s, a) in expectation. Based on Lemma 1, we prove in Theorem A1 that the Q-learning agent will converge to the optimal policy w.p.1 with peer rewards without requiring any knowledge of the corruption in rewards (CRL |R| |R|, as opposed to previous work [7] that requires such knowledge). Moreover, we prove in Theorem A2 that to guarantee the convergence to , the number of samples needed for our approach is no more than O(1/(1 e e+)2) times of the one needed when the RL agent observes true rewards perfectly (see Appendix A). Extension Even though we only present an analysis for the binary case for Q-Learning, our approach is rather generic and is ready to be plugged into modern DRL algorithms. We provide multi-reward extensions, implementations with DQN [49] and policy gradient [51] in Appendix A. 4.2 Why does Peer Reward Work? Compared with noisy reward, proposed peer variant is a less biased estimation of true reward (Benefit1). On the other hand, Peer RL helps break the unstable tie states, which might encourage the agent to explore in the early stage [53] (Benefit-2). Benefit-1: Peer RL reduces the bias We highlight that the biased noise model considered is rather generic, departing from the previous noise assumption such as zero-mean Gaussian noise [8, 9]. In zero-mean noise models, the major focus is on variance reduction so adding the random term r0 increases the variance thus resulting in worse estimation. However, in the discrete biased noise model [18], bias correction also plays an important role especially the noise rate is high [7]. Similar to peer reward (Lemma 1), the expectation of the noisy reward writes as: E[ r(s, a)] = (1 e e+)E[r(s, a)] + e r+ + e+r = (1 e e+)E[r(s, a)] + const. But the constant in peer reward has less effect on the true reward r, especially when the noise rate is high. To see this: noisy reward: E[ r(s, a)] = E[r(s, a)] + e+ 1 e e+ r + e 1 e e+ r+ peer reward: E[ rpeer(s, a)] = (E[r(s, a)] (1 ppeer)r ppeerr+), where = 1 e e+ > 0, ppeer 2 [0, 1] denotes the probability that a sample policy sees a reward r+ overall. Since the magnitude of noise terms e 1 e e+ and e+ 1 e e+ can potentially become much larger than 1 ppeer and ppeer in a high-noise regime, e 1 e e+ r+ + e+ 1 e e+ r will dilute the informativeness of E[r(s, a)]. On the contrary, E[ rpeer(s, a)] contains a moderate constant noise thus maintaining more useful training signals of the true reward in practice. In summary, although peer reward (similar to the surrogate reward in previous literature [7]) increases the variance (no free-lunch), it will lead to a better estimation of the true reward due to lower bias. Correct Tie Incorrect baseline 54.6% 5.6% 39.8% Peer RL 58.0% 0.3% 41.7% Benefit-2: Peer RL helps break ties For RL, tie states indicate that the rewards for different states are the same, which are less informative as they neither serve as positive nor negative examples. Due to the discrete nature of the noise model, adding a randomly sampled penalty term helps break the tie states and treats them as either positive examples or negative examples such that it can encourage exploration in the early stage, which has similar intuitions to some RL exploration works [53]. It has also been demonstrated that reducing the uncertainty, a.k.a. pushing confident predictions, makes the learning robust to weak-supervisions in supervised learning [17, 54]. On the other hand, it is known that positive examples are sparse yet important in RL. To leverage these useful experiences sufficiently, experience replay [55, 56] is invented to store and up-sample the positive examples for faster convergence. Tie breaking potentially provides an alternative way to access more positive examples. To illustrate tie-breaking phenomenon when using peer reward, we consider a two-state Markov process (no actions) with bounded Gaussian noise and see how well we could infer which state was better by correcting the reward signals. We collect two observations for each state and conduct 104 trials to calculate the success rate of inferring which state has larger returns ( correct in the Table). As we can see, Peer RL exploits the "discreteness" of the reward thus breaking ties to obtain more examples with good-quality supervision. More examples on varied noise models (bounded continuous noise, discrete noise) are deferred to Appendix B. (a) e = 0.2 (b) e = 0.4 Figure 2: Learning curves of DDQN on Cart Polev0 with true reward (r) , noisy reward ( r) , surrogate reward [7] (ˆr) , and peer reward ( rpeer, = 0.2) . (a) e = 0.2 (b) e = 0.4 Figure 3: Learning curves of DDPG [57] on Pendulum with true reward (r) , noisy reward ( r) , and peer reward ( rpeer, = 0.2) . 5 Peer BC: Peer Behavioral Cloning Similarly, we present our CA solution in the setting of behavioral cloning (Peer BC). In BC, the supervision is given by the weak expert s noisy trajectory. At each iteration, the agent learns under weak supervision a, and the training samples are generated from the distribution e DE determined by the weak expert. The Eva BC function in BC evaluates the agent policy , parametrized by , and the weak trajectory {(si, ai)}N i=1 using ( , (si, ai)), where is an arbitrary classification loss. Taking the cross-entropy for instance, the objective of Peer BC is: (si, ai), ai (sj, aj), ak where Eva BC = log ( a|s). (4) In (3), the first expectation is taken over (si, ai) e DE, ai ( |si) and the second is taken over (sj, aj) e DE, aj ( |sj), (sk, ak) e DE, ak ( |sk). Again, the second Eva BC term in JBC serves the purpose of punishing over-agreement with the weak demonstration. Similarly, 0 is a parameter to balance the penalty for blind agreements. Robustness to Noisy Demonstrations We prove that the policy learned by Peer BC converges to the expert policy when observing a sufficient amount of weak demonstrations. We focus on the binary action setting for theoretical analyses, where the action space is given by A = {A+, A } and the weakness or noise in the weak expert E is quantified by e+ = P( E(s) = A | E(s) = A+) and e = P( E(s) = A+| E(s) = A ). Let e DE be the optimal policy for maximizing the objective in (3) with imperfect demonstrations e DE (a particular set of with N i.i.d. imperfect demonstrations). Note ( ) is specified as the 0-1 loss: ( (s), a) = 1 when (s) 6= a, otherwise ( (s), a) = 0. We have the following upper bound on the error rate. Theorem 1. Denote by R e DE := P(s,a) DE( e DE(s) 6= a) the error rate for Peer BC. When e++e < 1, with probability at least 1 δ, it is upper-bounded as: R e DE 1+ 1 e e+ Theorem 1 states that as long as weak demonstrations are observed sufficiently, i.e., N is sufficiently large, the policy learned by Peer BC is able to converge to the clean expert policy E(s) with a convergence rate of O Peer Policy Co-Training Our discussion of BC allows us to study a more challenging co-training task [43]. Given a finite MDP M, there are two agents that receive partial observations and we let A and B denote the policies for agent A and B. Moreover, two agents are trained jointly to learn with rewards and noisy demonstrations from each other (e.g., at the preliminary training phase). Symmetrically, we consider the case where agent A learns with the demonstrations from B on sampled trajectories, and B effectively serves as a noisy version of expert policy. Following [43], we assume a mapping function f A!B exists that transforms states under view A into B. Denote by A = {(s A i=1 the trajectory that A generates via interacting with the partial world MA. Then B replaces each action a A i with its selection a B i = B(f A!B(s A i )) as the weak supervision. To recover the clean expert policy, we adapt the BC peer evaluation term to the (d) Freeway Figure 4: Learning curves of BC on Atari. Standard BC , Peer BC (ours) , expert . co-learning objective function: where the first expectation is taken over (s A i ) A, and a B i = B(f A!B(s A i )), and the second is taken over (s A j ) A, (s A k ) A, and a B k = B(f A!B(s A k )), is the loss function defined in Eqn. (4) to measure the policy difference, and Eva RL are defined in Eqn. (2) and (4) respectively. The full algorithm Peer CT is provided in Algorithm 1. We omit detailed discussions on the convergence of Peer CT - it can be viewed as a straight-forward extension of Theorem 1 in the context of co-training. 6 Experiments Algorithm 1 Peer policy co-training (Peer CT) Require: Views A, B, MDPs MA, MB, policies A, B, map- ping functions f A!B, f B!A that maps states from one view to the other view, CA coefficient , step size β for policy update. 1: repeat 2: Run A to generate trajectories A = {(s A i=1. 3: Run B to generate trajectories B = {(s B j=1. 4: Agents label the trajectories for each other 5: Update policies: {A,B} {A,B} + β r JCT( {A,B}) 6: until convergence We evaluate our solution in three challenging weakly supervised PL problems. Experiments on control games and Atari show that, without any prior knowledge of the noise, our approach is able to leverage weak supervision more effectively. Experiment Setup & Baselines We evaluate Peer PL on a wide variety of control and Atari games. For RL with noisy reward, we add synthetic noise to reward signals and compare with previous work [7], where an unbiased estimator of true reward is constructed by approximating the confusion matrix. For BC from weak demonstrations, we adopt not fully converged PPO agents as the weak experts and unroll the trajectories. We also consider a standard policy co-training setting [43] without any synthetic noise added and compare Peer CT with singleview training paradigm and Co Pi Er [43]. 6.1 Peer RL with Noisy Reward Cart Pole-v0: We first evaluate our method in RL with noisy reward setting. Following [7], we consider the binary reward { 1, 1} for Cartpole where the symmetric noise is synthesized with different error rates e = e = e+. We choose DQN [49] and DDQN [50] algorithms and train the models for 10,000 steps. We repeat each experiment 10 times with different random seeds and leave extra results in Appendix D. Figure 2 shows the learning curves for DDQN with different approaches in noisy environments ( = 0.2) 1. Since the number of training steps is fixed, the faster the algorithm converges, the fewer total episodes the agent will involve thus the learning curve is on the left side. As a consequence, the proposed peer reward outperforms other baselines significantly even in a high-noise regime (e.g., e = 0.4). Table 1 provides quantitative results on the average reward Ravg and total episodes Nepi. We find the agents with peer reward lead to a larger Ravg (less generalization error) and a smaller Nepi (faster convergence) consistently. 1We analysed the sensitivity of and found the algorithm performs reasonable when 2 (0.1, 0.4). More insights and experiments with varied is deferred to Appendix D. Table 1: Numerical performance of DDQN on Cart Pole with true reward (r), noisy reward ( r), surrogate reward ˆr [7], and peer reward rpeer( = 0.2). Ravg denotes average reward per episode after convergence, the higher (") the better; Nepi denotes total episodes involved in 10,000 steps, the lower (#) the better. Note 0 e < 0.5. e = 0.1 e = 0.2 e = 0.3 e = 0.4 Ravg " Nepi # Ravg " Nepi # Ravg " Nepi # Ravg " Nepi # r 195.6 3.1 101.2 3.2 195.6 3.1 101.2 3.2 195.6 3.1 101.2 3.2 195.2 3.0 101.2 3.3 r 185.2 15.6 114.6 6.0 168.8 13.6 123.9 9.6 177.1 11.2 133.2 9.1 185.5 10.9 163.1 11.0 ˆr 183.9 10.4 110.6 6.7 165.1 18.2 113.9 9.6 192.2 10.9 115.5 4.3 179.2 6.6 125.8 9.6 rpeer 198.5 2.3 86.2 5.0 195.5 9.1 85.3 5.4 174.1 32.5 88.8 6.3 191.8 8.5 106.9 9.2 (a) Acrobot (b) Cart Pole (d) Breakout Figure 5: Policy co-training on control/Atari. Single view , [43] , Peer CT (ours) . Pendulum: We further conduct experiments on a continuous control task Pendulum, where the goal is to keep a frictionless pendulum standing up. Since the rewards in pendulum are continuous: r 2 ( 16.3, 0.0], we discretized it into 17 intervals: ( 17, 16], ( 16, 15], , ( 1, 0], with its value approximated using its maximum point. We test DDPG [57] with uniform noise in this environment following [7]. In Figure 3, the RL agents with the proposed CA objective successfully converge to the optimal policy under different amounts of noise. On the contrary, the agents with noisy rewards suffer from biased noise, especially in a high-noise regime. Analysis of the benefits in Peer RL More surprisingly, we observed that the agents on Cart Pole with peer reward even lead to faster convergence than the ones observing true reward perfectly when the noise rate e is small. This indicates the possibility of other benefits to further promote peer reward, other than the noise reduction one we primarily focused on. We hypothesize this is because (1) the peer penalty term breaks the tie states (Benefit-2 in Section 4.1) and encourages explorations in RL; (2) Peer RL scales the reward signals appropriately for easier learning; (3) the human-specific true reward might be also imperfect which leads to a weak supervision scenario. We emphasize that the advantage of recovering from noisy reward signal is non-negligible, especially in a high-noise regime (e.g., e = 0.4 in Figure 2 and 3). 6.2 Peer BC from Weak Demonstrations Atari: In BC setting, we evaluate our approach on four vision-based Atari games. For each environment, we train an imperfect RL model with PPO [58] algorithm. Here, imperfect means the training is terminated before convergence when the performance is about 70% 90% as good as the fully converged model. We then collect the imperfect demonstrations using the expert model and generate 100 trajectories for each environment. The results are reported under three random seeds. Figure 4 shows that our approach outperforms standard BC and even the expert it learns from. Note that during the whole training process, the agent never learns by interacting directly with the environment but only have access to the expert trajectories. Therefore, we owe this performance gain to Peer BC s strong ability for learning from weak supervision. The peer term we add not only provably eliminates the effects of noise but also extracts useful strategy from the demonstrations. As shown in Table 2, our approach consistently outperforms the expert and standard BC. We provide the sensitivity analysis of in Appendix D. Comparison with imitation learning baselines We further extend the empirical study to imitation learning (IL) algorithms on Cart Pole-v1. To collect weak demonstrations, we train a PPO agent for 50k iterations that are not fully converged. As shown in Figure 6, standard IL algorithms such as BC, AIRL [37], or GAIL [59] cannot handle noisy demonstrations well and lead to sub-optimal performance. Our Peer BC brings 18% improvement over standard BC by penalizing blind agreements with the weak demonstrations. We remark that performance of Peer BC is worse than DAgger due to notorious distribution shift issue. To further improve performance, we train Peer BC in the DAgger fashion (Peer-DAgger) by querying the imperfect expert to augment the training sets. Not surprisingly, Table 2: BC from weak demonstrations. Peer BC successfully recovers better policies than expert. Environment Pong Boxing Enduro Freeway Lift (") Expert 15.1 6.6 67.5 8.5 150.1 23.0 21.9 1.7 - Standard BC 14.7 3.2 56.2 7.7 138.9 14.1 22.0 1.3 6.6% = 0.2 18.8 0.6 67.2 8.4 177.9 29.3 22.5 0.6 +11.3% = 0.5 16.6 4.0 75.6 5.4 230.9 73.0 22.4 1.3 +19.5% = 1.0 16.7 4.3 69.7 4.7 230.4 61.6 8.9 4.9 +2.0% Fully converged PPO 20.9 0.3 89.3 5.4 389.6 216.9 33.3 0.8 - Table 3: Comparison with single view training and Co Pi Er [43] on standard policy co-training. Environment Acrobot Cart Pole Pong Breakout Single View A 136.6 15.6 172.8 5.5 17.8 0.6 148.0 16.5 B 126.4 8.0 186.7 8.1 17.7 0.5 137.8 12.5 Co Pi Er A 136.2 5.2 174.1 5.1 16.8 0.5 107.5 5.8 B 131.5 4.5 174.3 5.4 16.5 0.2 82.7 6.9 Peer CT A 87.0 3.9 188.8 2.7 20.5 0.4 263.6 36.0 B 87.1 6.3 184.7 3.9 20.4 0.5 268.6 33.6 Peer-DAgger surpasses DAgger by a large margin, which indicates that our framework has wide applicability and successfully recovers the true supervision signals. Adapting Peer PL idea to more IL algorithms such as GAIL [59] and DART [35] together with rigorous analysis is left as future works. Figure 6: Comparison of imitation learning approaches on Cart Pole-v1 with imperfect expert. Analysis of benefits in Peer BC Similarly, the performance improvement of Peer BC might be also coupled with multiple possible factors. (1) The imperfect expert model might be a noisy version of the fully-converged agent since there are less visited states on which the selected actions of the model contains noise. (2) The improvements might be brought up by biasing against high-entropy policies thus Peer BC is useful when the true policy itself is deterministic. We provide more discussions about the second factor in Appendix D.5. 6.3 Peer CT for Standard Policy Co-training Continuous Control/Atari: Finally, we verify the effectiveness of the Peer CT algorithm in policy co-training setting [43]. This setting is more challenging since the states are partially observable and each agent needs to imitate another agent s behavior that is highly biased and imperfect. Note that we adopt the exact same setting as [43] without any synthetic noise included. This implies the potential of our approach to deal with natural noise in real-world applications. Following [43], we mask the first two dimensions respectively in the state vector to create two views for co-training in classic control games (Acrobot and Cart Pole). Similarly, the agent either removes all even index coordinates (view-A) in the state vector or removing all odd index ones (view-B) on Atari games. As shown in Table 3 and Figure 5, Peer CT algorithm outperforms training from single view, and Co Pi Er algorithm consistently on both control games ( = 0.5 in Figure 5a, 5b) and Atari games ( = 0.2 in Figure 5c, 5d). In most cases, our approach leads to a faster convergence and lower generalization error compared to Co Pi Er, showing that our ways of leveraging information from peer agent enables recovery of useful knowledge from highly imperfect supervision. 7 Conclusion We have proposed Peer PL, a weakly supervised policy learning framework to unify a series of RL/BC problems with low-quality supervision signals. In Peer PL, instead of blindly memorizing the weak supervision, we evaluate a learning policy s correlated agreements with the weak supervision. We demonstrate how our method adapts in RL/BC and the hybrid co-training tasks and provide analysis of the convergence rate and sample complexity. Current theorems focus on the specific discrete noise model. Future work may extend it to more general noise scenarios and evaluate our method on real RL/BC systems, such as robotics and self-driving. Acknowledgement We sincerely thank the anonymous reviewers for their insightful suggestions. Our final version benefited substantially from the discussions with Reviewer 58f G. In particular, the tie-breaking analysis together with the code snippet is designed and contributed by Reviewer 58f G. This work is partially supported by the National Science Foundation (NSF) under grant IIS-2007951 and the Office of Naval Research under grant N00014-20-1-2240. Resources used in preparing this research were provided, in part, by the Province of Ontario, the Government of Canada through CIFAR, and companies sponsoring the Vector Institute. [1] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529, 2015. [2] Ilge Akkaya, Marcin Andrychowicz, Maciek Chociej, Mateusz Litwin, Bob Mc Grew, Arthur Petron, Alex Paino, Matthias Plappert, Glenn Powell, Raphael Ribas, Jonas Schneider, Nikolas Tezak, Jerry Tworek, Peter Welinder, Lilian Weng, Qiming Yuan, Wojciech Zaremba, and Lei Zhang. Solving rubik s cube with a robot hand. Co RR, abs/1910.07113, 2019. [3] Mariusz Bojarski, Davide Del Testa, Daniel Dworakowski, Bernhard Firner, Beat Flepp, Prasoon Goyal, Lawrence D Jackel, Mathew Monfort, Urs Muller, Jiakai Zhang, et al. End to end learning for self-driving cars. ar Xiv preprint ar Xiv:1604.07316, 2016. [4] Felipe Codevilla, Matthias Miiller, Antonio López, Vladlen Koltun, and Alexey Dosovitskiy. End-to-end driving via conditional behavior cloning. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pages 1 9. IEEE, 2018. [5] Vibhu Agarwal, Tanya Podchiyska, Juan M Banda, Veena Goel, Tiffany I Leung, Evan P Minty, Timothy E Sweeney, Elsie Gyang, and Nigam H Shah. Learning statistical models of phenotypes using noisy labeled training data. Journal of the American Medical Informatics Association, 23(6):1166 1173, 2016. [6] Yang Gao, Huazhe Xu, Ji Lin, Fisher Yu, Sergey Levine, and Trevor Darrell. Reinforcement learning from imperfect demonstrations. ar Xiv preprint ar Xiv:1802.05313, 2018. [7] Jingkang Wang, Yang Liu, and Bo Li. Reinforcement learning with perturbed rewards. In AAAI, [8] Tom Everitt, Victoria Krakovna, Laurent Orseau, and Shane Legg. Reinforcement learning with a corrupted reward channel. In IJCAI, pages 4705 4713, 2017. [9] Joshua Romoff, Alexandre Piché, Peter Henderson, Vincent François-Lavet, and Joelle Pineau. Reward estimation for variance reduction in deep reinforcement learning. In ICLR (Workshop). Open Review.net, 2018. [10] Robert Loftin, Bei Peng, James Mac Glashan, Michael L Littman, Matthew E Taylor, Jeff Huang, and David L Roberts. Learning something from nothing: Leveraging implicit human feedback strategies. In The 23rd IEEE international symposium on robot and human interactive communication, pages 607 612. IEEE, 2014. [11] Michael Laskey, Jonathan Lee, Roy Fox, Anca D. Dragan, and Ken Goldberg. DART: noise injection for robust behavior cloning. In Co RL, volume 78 of Proceedings of Machine Learning Research, pages 143 156. PMLR, 2017. [12] Yueh-Hua Wu, Nontawat Charoenphakdee, Han Bao, Voot Tangkaratt, and Masashi Sugiyama. Imitation learning from imperfect demonstration. In International Conference on Machine Learning, pages 6818 6827. PMLR, 2019. [13] Siddharth Reddy, Anca D. Dragan, and Sergey Levine. SQIL: behavior cloning via reinforcement learning with sparse rewards. In ICLR. Open Review.net, 2020. [14] Fumihiro Sasaki and Ryota Yamashina. Behavioral cloning from noisy demonstrations. In International Conference on Learning Representations, 2020. [15] Xiaoxiao Guo, Shiyu Chang, Mo Yu, Gerald Tesauro, and Murray Campbell. Hybrid rein- forcement learning with expert state sequences. In AAAI, pages 3739 3746. AAAI Press, 2019. [16] Lisa Lee, Benjamin Eysenbach, Ruslan Salakhutdinov, Shixiang, Gu, and Chelsea Finn. Weakly- supervised reinforcement learning for controllable behavior, 2020. [17] Yang Liu and Hongyi Guo. Peer loss functions: Learning from noisy labels without knowing noise rates. ICML, abs/1910.03231, 2020. [18] Nagarajan Natarajan, Inderjit S Dhillon, Pradeep K Ravikumar, and Ambuj Tewari. Learning with noisy labels. In Advances in neural information processing systems, pages 1196 1204, 2013. [19] Clayton Scott, Gilles Blanchard, Gregory Handy, Sara Pozzi, and Marek Flaska. Classification with asymmetric label noise: Consistency and maximal denoising. In COLT, pages 489 511, 2013. [20] Clayton Scott. A rate of convergence for mixture proportion estimation, with application to learning from noisy labels. In AISTATS, 2015. [21] Sainbayar Sukhbaatar and Rob Fergus. Learning from noisy labels with deep neural networks. ar Xiv preprint ar Xiv:1406.2080, 2(3):4, 2014. [22] Brendan van Rooyen and Robert C Williamson. Learning in the presence of corruption. ar Xiv preprint ar Xiv:1504.00091, 2015. [23] Tongliang Liu and Dacheng Tao. Classification with noisy labels by importance reweighting. IEEE Transactions on pattern analysis and machine intelligence, 38(3):447 461, 2015. [24] Aditya Menon, Brendan Van Rooyen, Cheng Soon Ong, and Bob Williamson. Learning from corrupted binary labels via class-probability estimation. In ICML, pages 125 134, 2015. [25] Zhaowei Zhu, Tongliang Liu, and Yang Liu. A second-order approach to learning with instance- dependent label noise. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10113 10123, 2021. [26] Curtis Northcutt, Lu Jiang, and Isaac Chuang. Confident learning: Estimating uncertainty in dataset labels. Journal of Artificial Intelligence Research, 70:1373 1411, 2021. [27] Xuefeng Li, Tongliang Liu, Bo Han, Gang Niu, and Masashi Sugiyama. Provably end-to-end label-noise learning without anchor points. ar Xiv preprint ar Xiv:2102.02400, 2021. [28] Zhaowei Zhu, Yiwen Song, and Yang Liu. Clusterability as an alternative to anchor points when learning with noisy labels. ar Xiv preprint ar Xiv:2102.05291, 2021. [29] Zhaowei Zhu, Jingxuan Zhu, Ji Liu, and Yang Liu. Federated bandit: A gossiping approach. In Abstract Proceedings of the 2021 ACM SIGMETRICS/International Conference on Measurement and Modeling of Computer Systems, pages 3 4, 2021. [30] Jiaheng Wei and Yang Liu. When optimizing $f$-divergence is robust with label noise. In International Conference on Learning Representations, 2021. [31] Yang Liu. Understanding instance-level label noise: Disparate impacts and treatments. In ICML, volume 139 of Proceedings of Machine Learning Research, pages 6725 6735. PMLR, 2021. [32] Dean A Pomerleau. Efficient training of artificial neural networks for autonomous navigation. Neural computation, 3(1):88 97, 1991. [33] Stéphane Ross and Drew Bagnell. Efficient reductions for imitation learning. In Proceedings of the thirteenth international conference on artificial intelligence and statistics, pages 661 668, 2010. [34] Stéphane Ross, Geoffrey Gordon, and Drew Bagnell. A reduction of imitation learning and structured prediction to no-regret online learning. In Proceedings of the fourteenth international conference on artificial intelligence and statistics, pages 627 635, 2011. [35] Michael Laskey, Jonathan Lee, Roy Fox, Anca D. Dragan, and Ken Goldberg. DART: noise injection for robust imitation learning. In Co RL, volume 78 of Proceedings of Machine Learning Research, pages 143 156. PMLR, 2017. [36] Mariusz Bojarski, Davide Del Testa, Daniel Dworakowski, Bernhard Firner, Beat Flepp, Prasoon Goyal, Lawrence D. Jackel, Mathew Monfort, Urs Muller, Jiakai Zhang, Xin Zhang, Jake Zhao, and Karol Zieba. End to end learning for self-driving cars. Co RR, abs/1604.07316, 2016. [37] Siddharth Reddy, Anca D. Dragan, and Sergey Levine. SQIL: Behavior Cloning via Reinforce- ment Learning with Sparse Rewards. 2019. [38] Faraz Torabi, Garrett Warnell, and Peter Stone. Behavioral cloning from observation. In IJCAI, pages 4950 4957. ijcai.org, 2018. [39] Juarez Monteiro, Nathan Gavenski, Roger Granada, Felipe Meneguzzi, and Rodrigo Coelho Barros. Augmented behavioral cloning from observation. Co RR, abs/2004.13529, 2020. [40] Alessandro Giusti, Jerome Guzzi, Dan C. Ciresan, Fang Lin He, Juan P. Rodriguez, Flavio Fontana, Matthias Faessler, Christian Forster, Jurgen Schmidhuber, Gianni Di Caro, Davide Scaramuzza, and Luca M. Gambardella. A Machine Learning Approach to Visual Perception of Forest Trails for Mobile Robots. IEEE Robotics and Automation Letters, 1(2):661 667, 2016. [41] Niels Justesen and Sebastian Risi. Learning macromanagement in starcraft from replays using deep learning. In 2017 IEEE Conference on Computational Intelligence and Games (CIG), pages 162 169. IEEE, 2017. [42] Wael Farag and Zakaria Saleh. Behavior cloning for autonomous driving using convolutional neural networks. 2018 International Conference on Innovation and Intelligence for Informatics, Computing, and Technologies, 3ICT 2018, 2018. [43] Jialin Song, Ravi Lanka, Yisong Yue, and Masahiro Ono. Co-training for policy learning. In UAI, page 441. AUAI Press, 2019. [44] Anirban Dasgupta and Arpita Ghosh. Crowdsourced judgement elicitation with endogenous proficiency. In Proceedings of the 22nd international conference on World Wide Web, pages 319 330, 2013. [45] Victor Shnayder, Arpit Agarwal, Rafael M. Frongillo, and David C. Parkes. Informed truthful- ness in multi-task peer prediction. In EC, pages 179 196. ACM, 2016. [46] Tim Brys, Anna Harutyunyan, Halit Bener Suay, Sonia Chernova, Matthew E. Taylor, and Ann Nowé. Reinforcement learning from demonstration through shaping. In IJCAI, pages 3352 3358. AAAI Press, 2015. [47] Todd Hester, Matej Vecerík, Olivier Pietquin, Marc Lanctot, Tom Schaul, Bilal Piot, Dan Horgan, John Quan, Andrew Sendonaris, Ian Osband, Gabriel Dulac-Arnold, John Agapiou, Joel Z. Leibo, and Audrunas Gruslys. Deep q-learning from demonstrations. In AAAI, pages 3223 3230. AAAI Press, 2018. [48] Yang Liu. The importance of understanding instance-level noisy labels. ar Xiv preprint ar Xiv:2102.05336, 2021. [49] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin A. Riedmiller. Playing atari with deep reinforcement learning. Co RR, abs/1312.5602, 2013. [50] Ziyu Wang, Tom Schaul, Matteo Hessel, Hado van Hasselt, Marc Lanctot, and Nando de Freitas. Dueling network architectures for deep reinforcement learning. In ICML, volume 48, pages 1995 2003, 2016. [51] Richard S. Sutton, David A. Mc Allester, Satinder P. Singh, and Yishay Mansour. Policy gradient methods for reinforcement learning with function approximation. In NIPS, pages 1057 1063. The MIT Press, 1999. [52] Christopher J. C. H. Watkins and Peter Dayan. Q-learning. In Machine Learning, pages 279 292, 1992. [53] Deepak Pathak, Pulkit Agrawal, Alexei A Efros, and Trevor Darrell. Curiosity-driven exploration by self-supervised prediction. In International conference on machine learning, pages 2778 2787. PMLR, 2017. [54] Hao Cheng, Zhaowei Zhu, Xingyu Li, Yifei Gong, Xing Sun, and Yang Liu. Learning with instance-dependent label noise: A sample sieve approach. In International Conference on Learning Representations, 2021. [55] Tom Schaul, John Quan, Ioannis Antonoglou, and David Silver. Prioritized experience replay. In ICLR (Poster), 2016. [56] Hado van Hasselt, Arthur Guez, and David Silver. Deep reinforcement learning with double q-learning. In AAAI, pages 2094 2100, 2016. [57] Timothy P. Lillicrap, Jonathan J. Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. Co RR, abs/1509.02971, 2015. [58] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. Co RR, abs/1707.06347, 2017. [59] Jonathan Ho and Stefano Ermon. Generative adversarial behavior cloning. In Advances in Neural Information Processing Systems, pages 4572 4580, 2016. [60] Andrew Y. Ng, Daishi Harada, and Stuart J. Russell. Policy invariance under reward transforma- tions: Theory and application to reward shaping. In ICML, pages 278 287. Morgan Kaufmann, 1999. [61] John Asmuth, Michael L. Littman, and Robert Zinkov. Potential-based shaping in model-based reinforcement learning. In AAAI, pages 604 609. AAAI Press, 2008. [62] John Von Neumann and Oskar Morgenstern. Theory of games and economic behavior (com- memorative edition). Princeton university press, 2007. [63] Tommi S. Jaakkola, Michael I. Jordan, and Satinder P. Singh. Convergence of stochastic iterative dynamic programming algorithms. In NIPS, pages 703 710, 1993. [64] John N. Tsitsiklis. Asynchronous stochastic approximation and q-learning. Machine Learning, 16(3):185 202, 1994. [65] Michael J. Kearns and Satinder P. Singh. Finite-sample convergence rates for q-learning and indirect algorithms. In NIPS, pages 996 1002, 1998. [66] Michael J. Kearns and Satinder P. Singh. Bias-variance error bounds for temporal difference updates. In COLT, pages 142 147, 2000. [67] Michael J. Kearns, Yishay Mansour, and Andrew Y. Ng. A sparse sampling algorithm for near-optimal planning in large markov decision processes. In IJCAI, pages 1324 1231, 1999. [68] Sham Machandranath Kakade. On the Sample Complexity of Reinforcement Learning. Ph D thesis, University of London, 2003.