# behavioral_cloning_from_noisy_demonstrations__9197a352.pdf Published as a conference paper at ICLR 2021 BEHAVIORAL CLONING FROM NOISY DEMONSTRATIONS Fumihiro Sasaki & Ryota Yamashina Ricoh Company, Ltd. {fumihiro.fs.sasaki,ryohta.yamashina}@jp.ricoh.com We consider the problem of learning an optimal expert behavior policy given noisy demonstrations that contain observations from both optimal and non-optimal expert behaviors. Popular imitation learning algorithms, such as generative adversarial imitation learning, assume that (clean) demonstrations are given from optimal expert policies but not the non-optimal ones, and thus often fail to imitate the optimal expert behaviors given the noisy demonstrations. Prior works that address the problem require (1) learning policies through environment interactions in the same fashion as reinforcement learning, and (2) annotating each demonstration with confidence scores or rankings. However, such environment interactions and annotations in real-world settings take impractically long training time and a significant human effort. In this paper, we propose an imitation learning algorithm to address the problem without any environment interactions and annotations associated with the non-optimal demonstrations. The proposed algorithm learns ensemble policies with a generalized behavioral cloning (BC) objective function where we exploit another policy already learned by BC. Experimental results show that the proposed algorithm can learn behavior policies that are much closer to the optimal policies than ones learned by BC. 1 INTRODUCTION Imitation learning (IL) has become a widely used approach to obtain autonomous robotics control systems. IL is often more applicable in real-world problems than reinforcement learning (RL) since expert demonstrations are often easier than designing appropriate rewards that RL requires. There have been several IL methods that involve RL (Ziebart et al., 2008; Ng et al., 2000; Abbeel & Ng, 2004; Ho & Ermon, 2016). Those IL methods inherit sample complexity from RL in terms of environment interactions during training. The complexity restricts applicabilities in real-world problems since a number of environment interactions in real-world settings often take a long time and cause damage to the robot or the environment. Therefore, we are interested in IL methods that do not require the environment interactions, such as behavioral cloning (BC) (Pomerleau, 1991) which learns an expert policy in a supervised fashion. BC as well as popular IL methods, such as generative adversarial imitation learning (GAIL) (Ho & Ermon, 2016), assume the expert demonstration is optimal. Unfortunately, it is often difficult to obtain optimal demonstrations for many tasks in real-world problems because the expert who tries to operate the robot so that it can achieve tasks often makes mistakes due to various reasons, such as the difficulty of the task, difficulty in handling the controller, limited observability of the environment, or the presence of distraction. The mistakes include unnecessary and/or incorrect operations to achieve the tasks. Given such noisy expert demonstrations, which contain records of both optimal and non-optimal behavior, BC as well as the popular IL methods fails to imitate the optimal policy due to the optimal assumption on the demonstrations as shown in (Wu et al., 2019). A naive solution to cope with the noisy demonstrations is discarding the non-optimal demonstrations among the ones that were already collected. This screening process is often impractical because it involves a significant human effort. Most of recent IL works suppose settings where a very limited number of clean expert demonstrations, which are composed of only the optimal behavior records, are available. Those methods are also vulnerable to the noisy demonstrations due to the optimal Published as a conference paper at ICLR 2021 assumption on the demonstrations. Thus they implicitly suppose such impractical screening process if they were applied in real-world problems, where a number of the noisy demonstrations other than the clean ones can be easily obtained. There have been IL methods addressing the noisy demonstrations. Instead of the screening process, they require to annotate each demonstration with confidence scores (Wu et al., 2019) or rankings (Brown et al., 2019). Even though they cope well with the noisy demonstrations to obtain the optimal behavior policies, such annotation costs a significant human effort as it is for the screening. Hence, we desire IL methods that can cope well with the noisy demonstrations, which can be easily obtained in real-world settings, without any screening and annotation processes associated with the non-optimal behaviors. In this paper, we propose a novel imitation learning algorithm to address the noisy demonstrations. The proposed algorithm does not require (1) any environment interactions during training, and (2) any screening and annotation processes associated with the non-optimality of the expert behaviors. Our algorithm learns ensemble policies with a generalized BC objective function where we exploit another policy already learned by BC. Experimental results show that the proposed algorithm can learn policies that are much closer to the optimal than ones learned by BC. 2 RELATED WORKS A wide variety of IL methods have been proposed in these last few decades. BC (Pomerleau, 1991) is the simplest IL method among those and thus BC could be the first IL option when enough clean demonstrations are available. Ross & Bagnell (2010) have theoretically pointed out a downside of the BC which is referred to as compounding error the small errors of the learners trained by BC could compound over time and bring about the deterioration of their performance. On the other hand, experimental results in (Sasaki et al., 2018) show that BC given the clean demonstrations of sufficient amounts can easily obtain the optimal behavior even for complex continuous control tasks. Hence, the effect of the compounding error is negligible in practice if the amount of clean demonstrations is sufficient. However, even if the amount of the demonstrations is large, BC cannot obtain the optimal policy given the noisy demonstrations due to the optimal assumption on the demonstrations. Another widely used IL approaches are inverse reinforcement learning (IRL) (Ziebart et al., 2008; Ng et al., 2000; Abbeel & Ng, 2004) and adversarial imitation learning (AIL) (Ho & Ermon, 2016). Since those approaches also assume the optimality of the demonstrations, they are also not able to obtain the optimal policy given the noisy demonstrations, as shown in (Wu et al., 2019). As we will show in Section 6, our algorithm successfully can learn near-optimal policies if noisy demonstrations of sufficient amounts are given. There have been several works that address the noisy demonstrations (Wu et al., 2019; Brown et al., 2019; Tangkaratt et al., 2019; Kaiser et al., 1995; Grollman & Billard, 2012; Kim et al., 2013). Those works address the noisy demonstrations by either screening the non-optimal demonstrations with heuristic non-optimal assessments (Kaiser et al., 1995), annotations associated with the nonoptimality (Wu et al., 2019; Brown et al., 2019; Grollman & Billard, 2012), or training through the environment interactions (Kim et al., 2013; Wu et al., 2019; Brown et al., 2019; Tangkaratt et al., 2019). Our algorithm does not require any screening processes, annotations associated with the non-optimality, and the environment interactions during training. Offline RL methods (Lange et al., 2012; Fujimoto et al., 2019; Kumar et al., 2020) train the learner agents without any environment interactions, and allow the training dataset to have non-optimal trajectories as in our problem setting. A drawback of offline RL methods for the real-world applications is the requirement to design reward functions, which often involves a significant human efforts for its success, since those methods assume that the reward for each state-action pair is known. Our algorithm does not require to design reward functions as in standard IL methods. Disagreement regularized imitation learning (DRIL) (Brantley et al., 2019) is a state-of-the-art IL algorithm which employs an ensemble of policies as our algorithm does. The aims of employing the ensemble is different between DRIL and our algorithm. DRIL uses the disagreement in predictions made by policies in the ensemble to evaluate whether the states observed during training the learner are ones observed in the expert demonstrations. On the other hand, our algorithm uses the ensemble to encourage the learner to take optimal actions on each state as described in 5.3. In addition, DRIL fundamentally requires the environment interactions during training whereas our algorithm does not. Published as a conference paper at ICLR 2021 3 PRELIMINARIES AND PROBLEM SETUP In this work, we consider an episodic fixed-horizon Markov decision process (MDP) which is formalized as a tuple {S, A, P, R, d0, T}, where S is a set of states, A is a set of possible actions agents can take, P : S A S [0, 1] is a transition probability, R : S A [0, 1] is a reward function, d0 : S [0, 1] is a distribution over initial states, and T is an episode horizon. The agent s behavior is defined by a stochastic policy π : S A [0, 1] and Π denotes a set of the stochastic policies. The expected one-step immediate reward for a policy π given a state s is defined as Rπ(s) = Ea π( |s) R(s, a) . Let dπ t and dπ = 1 T PT t=1 dπ t denote the distribution over states at time step t and the average distribution over T time steps induced by π, respectively. The distributions dπ 1 at the first step correspond to d0 for any π. When following a policy π throughout an episode, the expected one-step immediate reward at time step t and the expected T-step reward are defined as Rπ t = Es dπ t ,a π( |s) R(s, a) = Es dπ t Rπ(s) and J (π, R) = PT t=1 Rπ t = TEs dπ Rπ(s) , respectively. We refer to J (π, R) as on-policy expected T-step reward. We also consider another T-step reward defined by Jβ(π, R) = TEs dβ Rπ(s) , which we call off-policy expected T-step reward, where β Π is a policy that can differ from π. In our problem setting, the functions R is not given. Instead, we observe noisy demonstrations. We refer to the agent that generates the noisy demonstrations as the noisy expert. The decision process turns to be MDP\{R} as in the common imitation learning settings, and our problem can be formalized as to find an optimal policy in MDP\{R}. Here we refer to the true expert policy π e as ones being able to take the optimal (thus not noisy) behavior in episodic tasks. We make the following four assumptions to further formalize our problem setting: Assumption 1. The T-step expected reward of π e satisfies J (π, R) J (π e, R); Jβ(π, R) Jβ(π e, R); and Jβ(π e, R) J (π e, R) for any non-optimal policies π, β Π \ {π e}. Assumption 2. With small probability ϵ, which we call non-optimal probability, the policies πe the noisy experts follow during demonstrations are sampled at each time step as πe = π pΠ if ϵ z U(0, 1), otherwise πe = π e, where pΠ is an unknown distribution over the set of policies, z is a random variable, and U(0, 1) is a uniform distribution with range [0, 1]. Assumption 3. The reward Rπe t is at least zero if the noisy expert has followed a policy π Π\{π e} once or more so far, otherwise Rπe t = Es dπe t ϵEπ pΠ[Rπ(s)] + (1 ϵ)Rπ e (s) . Assumption 4. The sequence {Rπe 1 , ..., Rπe T } has monotonically decreasing property Rπe t Rπe t+1. Assumption 1 indicates that both on-policy and off-policy expected T-step reward following π e are always greater than or equal to ones following any other policies. In other words, we assume the true expert policy is an optimal one in the MDP, and the agent following the policy is able to behave so that the expected immediate rewards at any states are maximized. Under Assumption 1, the problem that we would like to solve in this work can be said to learn a parameterized policy πθ to maximize its on-policy expected T-step reward J (πθ, R) to J (π e, R). Assumption 2 indicates that the noisy expert occasionally adopts non-optimal policies, which results in the noisy demonstrations, due to random events, such as the presence of distractions, associated with the random variable z. The noisy expert is going to visit states that would be never visited by the true expert if the noisy expert followed non-optimal policies even once. Assumption 3 indicates that those states are less rewarded and their rewards are at least zero. Assumption 3 also indicates that the noisy demonstrations have a number of episodes where the noisy expert has reached the same state s where the noisy expert has adopted both π e and π Π \ {π e} with the probability ϵ. Assumption 4 indicates that, since the probability the noisy expert consecutively follows π e decreases as time step increases according to Assumption 2, the divergence between dπe t and dπ e t becomes greater as the number of time step t increases, and thus the one-step expected immediate reward Rπe t = Es dπe t ,a πe( |s) R(s, a) decreases as t increases. Published as a conference paper at ICLR 2021 4 ANALYSIS OF PERFORMANCE DETERIORATION In this section, we firstly describe BC objective in 4.1. Then, we analyze why the learner trained by BC deteriorates its performance when using the noisy demonstrations from the expected T-step reward maximization and KL-divergence minimization perspectives in 4.2 and 4.3, respectively. 4.1 BEHAVIORAL CLONING OBJECTIVE Let πθ Π is a learner policy parameterized by θ to be optimized by IL algorithms. The objective of BC in common is as follows: arg max θ Es dπe,a πe( |s)[log πθ(a|s)]. (1) The objective (1) aims to mimic the expert behavior which follows πe. It can be interpreted that (1) is to maximize the expected one-step immediate reward Rπθ(s) to Rπe(s) at each state s dπe. Since the state distribution dπe is not induced by πθ, it can also be said that (1) is to maximize the off-policy expected T-step rewards Jπe(πθ, R) to J (πe, R). 4.2 THE EXPECTED T-STEP REWARD MAXIMIZATION We obtain the lower bound of the expected on-policy T-step reward for the noisy expert policy in almost the same way to derive Theorem 2.1 in (Ross & Bagnell, 2010) where they showed the lower bound for the learner policies given the clean expert demonstrations. Theorem 1. If the Assumptions 1 - 4 hold, J (πe, R) has the following lower bound: J (πe, R) 1 t=0 (1 ϵ)t Eπ pΠ[Jπe(π, R)]. (2) The detailed derivation can be found in Appendix A.1. Assume that the learner policy πθ has a probability of non-optimal behavior ˆϵ = ϵ + ζ at most as the result of BC, where ζ [0, 1 ϵ] is an additional probability of non-optimal behavior due to the remained loss in (1). Note that ζ may become greater than zero due to the difficulty in the optimization of (1) even if ϵ = 0. The learner following πθ with ˆϵ can be deemed as another noisy expert who samples a policy at each time step πθ = π pπθ if ˆϵ z U(0, 1), otherwise πθ = π e, where pπθ is a (special) distribution from which the same policy is always sampled. By replacing ˆϵ and pπθ from ϵ and pΠ in Theorem 1 respectively, we obtain the following corollary. Corollary 1. If the Assumptions 1 - 4 hold and the policy πθ has a probability of non-optimal behavior ˆϵ = ϵ + ζ, J (πθ, R) has the following lower bound: J (πθ, R) 1 t=0 (1 ˆϵ)t Jπe(πθ, R). (3) Recall that the BC objective (1) is to maximize Jπe(πθ, R). If ˆϵ = 0, Corollary 1 indicates that the on-policy expected T-step reward J (πθ, R), which corresponds to the actual learner performance, is boosted by maximizing Jπe(πθ, R) through the optimization of the BC objective (1). On the other hand, if ϵ > 0 and thus ˆϵ > 0, the first factor on the RHS in (3) becomes much smaller as ϵ becomes larger. Corollary 1 thus shows that the probability of non-optimal behavior ϵ of the noisy expert significantly negates the improvement of learner performance J (πθ, R) by BC even if ζ can be sufficiently minimized through the optimization. Hence, the learner trained by BC is not able to boost the learner performance enough if the noisy demonstrations were given. 4.3 KL DIVERGENCE MINIMIZATION Let Sπe be a set of states that are observed in the noisy demonstration. Sπe can be thought of as the domain of (empirical) state distributions dπe. Sπe can be defined with two state sets of states as Sπe = Sπe e Sπe e+ , where Sπe e contains states that are observed if the noisy expert has followed a policy π Π \ {π e} once or more so far in the episode, and Sπe e+ contains states Published as a conference paper at ICLR 2021 at which the noisy expert has followed a policy π Π \ {π e} at the first time in the episode. Under Assumption 3, the rewards Rπe t for the states s Sπe e are at least zero whereas Rπe t = Es dπe t ϵEπ pΠ[Rπ(s)]+(1 ϵ)Rπ e (s) for the states s Sπe e+ . Note that the noisy expert adopts π Π \ {π e} with a probability ϵ at the states s Sπe e+ . Let dπe e and dπe e+ be the state distributions the noisy expert policy induces in Sπe e and Sπe e+ , respectively. Then we can define dπe as a mixture of those distributions as dπe(s) = αdπe e (s) + βdπe e+ (s), (4) where α and β are ratios the noisy expert entered states that belong to Sπe e and Sπe e+ during demonstrations, respectively. In addition, α + β = 1 is satisfied. Using Equation (4), the upper bound of the objective function in Equation (1) is derived as follows: Es dπe,a πe( |s)[log πθ(a|s)] αΩe(θ) βΩe+ (θ), (5) Ωe(θ) = Es dπe e [DKL[πe( |s)||πθ( |s)]], (6) Ωe+ (θ) = ϵEs dπe e+ ,π pΠ[DKL[π( |s)||πθ( |s)]] + (1 ϵ)Es dπe e+ [DKL[π e( |s)||πθ( |s)]], (7) where DKL is forward Kullback-Leibler (KL) divergence. The full derivation can be found in Appendix A.2. The inequality (5) shows that the BC objective (1) with the noisy demonstrations is to minimizes the sum of KL divergences. The first term on the RHS in (7) leads the learner to imitate some non-optimal behaviors whereas the second term is to learn π e on the same states. The optimization to maximize the RHS in (7) is difficult because minimizing KL divergences with different target distributions at the same time is difficult in general. The first term on the RHS in (7) thus works as a noisy regularizer with a coefficient ϵ that makes the learner confused to learn π e. The difficulty in the optimization due to the noisy regularizer increases ζ as ϵ increases. As mentioned in 4.1 and 4.2, BC is to maximize Jπe(πθ, R) to J (πe, R). Hence, minimizing Ωe(θ) in (6) corresponds to maximize Es dπe e [Rπθ(s)] to Es dπe e [Rπe(s)]. Since the rewards Rπe(s) are at least zero for the states s dπe e according to Assumption 3 and the definition of Sπe e , Es dπe e [Rπθ(s)] becomes at least zero by minimizing Ωe(θ). Hence Jπe(πθ, R) becomes at least zero as the rate α increases, while the rate α increases as the probabilities of non-optimal behavior ϵ increases. Thus, the larger the probability ϵ is, the more difficult it is to boost the learner performance by BC. If the influence of the noisy regularizer can be reduced, probabilities the learner follows π e at the state s Sπe e+ will increase. In addition, as probabilities the learner follows π e at the states s Sπe e+ increase, the rate (corresponding to α) for the states s Sπe e will decrease. Thus, it can be said that, the more often learner follows π e at the states s Sπe e+ , the more rewards Rπ e (s) the learner obtains according to Assumption 3. To summarize the above analysis, reducing the influence of the noisy regularizer for states s Sπe e+ , which leads the learner to imitate some non-optimal behaviors, might boost the learner performance. 5 ALGORITHM The analyses in Section 4 describe that the learner trained by standard BC deteriorates its performance when the noisy demonstrations are given. Based on both analyses in 4.2 and 4.3, the learner performance will be boosted if the learner imitates the optimal policy π e but not the non-optimal ones π Π \ {π e} for the states s Sπe e+ . In other words, the learner performance will be boosted if ˆϵ of the learner can be reduced. In this section, we first propose our algorithm that avoids learning π Π\{π e} while learning π e in 5.1. Then we describe how our algorithm works to avoid learning π Π\{π e} from mode seeking and reward maximization perspectives in 5.2 and 5.3, respectively. We lastly provide limitations of our algorithm in 5.4. 5.1 PROPOSED ALGORITHM We consider a generalization of the BC objective as follows: arg max θ Es dπe,a πe( |s)[log πθ(a|s) ˆR(s, a)], (8) Published as a conference paper at ICLR 2021 Algorithm 1 Behavioral Cloning from Noisy Demonstrations 1: Given the expert demonstrations D. 2: Set ˆR(s, a) = 1 for (s, a) D. 3: Split D into K disjoint sets {D1, D2, ..., DK}. 4: for iteration = 1, M do 5: for k = 1, K do 6: Initialize parameters θk. 7: for l = 1, L do 8: Sample a random minibatch of N state-action pairs (sn, an) from Dk. 9: Calculate a sampled gradient 1 N PN n=1 θk log πθk(sn, an) ˆR(sn, an). 10: Update θk by gradient ascent using the sampled gradient. 11: end for 12: end for 13: Copy πθold πθ. 14: Set ˆR(s, a) = πθold(a|s) for (s, a) D. 15: end for 16: return πθ. where ˆR : S A [0, 1] denotes an arbitrary function which can differ from R. If ˆR(s, a) = 1 for (s, a) S A, the objective (8) corresponds to the BC objective (1). If R A ˆR(s, a)da = 1 for s S is satisfied, ˆR(s, a) can be interpreted as weights for action samples obtained by the demonstrations so that the actions are sampled according to their relative weights. The objective (8) can also be deemed as that of the off-policy actor-critic (Off-PAC) algorithm1 (Degris et al., 2012) with reward functions ˆR(s, a) and zero discount factors. Let πθ1, πθ2, ..., πθK be K parameterized policies with different initial parameters θ1, θ2, ..., θK, and πθ(a|s) = PK k=1 πθk(a|s)/K denotes an ensemble of the parameterized policies with parameters θ = {θ1, θ2, ..., θK}. Let πθold be a parameterized policy with θold which was already optimized with the noisy demonstrations. The main idea of our algorithm is to reuse the old policy πθold as ˆR(s, a) in the generalized BC objective (8). arg max θ Es dπe,a πe( |s)[log πθ(a|s) πθold(a|s)]. (9) The overview of our algorithm is described in Algorithm 1. 5.2 WEIGHTED ACTION SAMPLING FOR π e MODE SEEKING Since πθold satisfies R A πθold(a|s)da = 1 for s S, πθold can be interpreted as the weights for the weighted action sampling. We below explain the weighted action sampling procedure in our algorithm on Sπe e+ . Figure 1 depicts a toy example of the sampling procedure. The distribution of the noisy expert actions on Sπe e+ is a mixture of two distributions as shown in Equation (7). If ϵ is sufficiently small, πθ is optimized so that its mode is closer to that of π e than π Π \ {π e} according to mode seeking properties of the forward KL divergence (Ghasemipour et al., 2020). Given the sampling weights πθold(a|s) = πθ(a|s) for the empirical action samples, the weighted action distribution distorts so that its mode also gets closer to the mode of π e. By iteratively distorting the weighted action distribution with the same procedure, its mode fits to near the mode of πe . The weights for actions sampled from π Π\{π e} eventually become much smaller, and thus the learner will not learn π Π \ {π e}. The mode seeking procedure of our algorithm is analogous to the mean shift algorithm (Fukunaga & Hostetler, 1975) so that the mode of πθ shifts towards that of π e by minimizing the KL divergence between πθ and the weighted action distribution. 1Although Off-PAC multiplies log πθ(a|s) by a density ratio πe(s|a)/πθ(s|a), πθ(s|a) is empirically approximated to be one in popular off-policy RL algorithms such as DDPG (Lillicrap et al., 2015). Published as a conference paper at ICLR 2021 iteration 1 iteration 5 iteration 10 iteration 15 weighted distribution Figure 1: A toy example of the weighted action sampling procedure at each iteration in our algorithm when given a state s Sπe e+ . On both rows, the horizontal lines are the action domains. The left and right dotted lines on the top row describe π Π \ {π e} and π e(a|s), respectively. The dotted lines on the bottom row describe the mixture distribution πe(a|s) = ϵπ(a|s) + (1 ϵ)π e(a|s) with ϵ = 0.4. The solid lines on the top row describe πθ(a|s) that are optimized with objective (8) at each iteration. The solid lines on the bottom row describe distributions which draw actions, that were already drawn by πe(a|s) in the noisy demonstrations, according to the current importance weight πθ(a|s) at each iteration. πθ(a|s) are optimized at each iteration so that the weighted distribution at the previous iteration is the target distribution. 5.3 REWARD MAXIMIZATION As the Off-PAC objective, the objective (9) maximizes the expected (one-step) reward ˆR(s, a) = πθold(a|s). Recall that the learner policy πθ(a|s) = PK k=1 πθk(a|s)/K is an ensemble of the parameterized policies in our algorithm. Following the work in (Perrone, 1993), we obtain k=1 Es dπe,a πe( |s)[log πθk(a|s) ˆR(s, a)] Es dπe,a πe( |s) log πθ(a|s) ˆR(s, a) , (10) where we use Jensen s inequality with the concave property of logarithm : 1 K PK k=1 log πθk(a|s) log πθ(a|s). The inequality (10) indicates that the ensemble of policies πθ1, πθ2, ..., πθK, each of which was learned with (8), has greater or equal values of the objective function in (8) than the averaged values over the policies in the ensemble. As mentioned in 5.2, ˆR(s, a) = πθold(a|s) becomes higher near the mode of π e. Thus, making πθ as the ensemble further encourages to shift its mode to that of π e and avoid learning π Π \ {π e}. 5.4 LIMITATIONS Our algorithm has three limitations. First, K M times computational cost is required in comparison with BC, where M is the number of iterations in Algorithm 1. Second, the compounding error due to the probability of non-optimal behavior ζ still remains unless sufficient amounts of the demonstrations are given. Lastly, πθ is fitting to π Π \ {π e} rather than π e if the major mode of ϵπ(a|s) + (1 ϵ)π e(a|s) is nearer to the mode of π(a|s) than that of π e. It may be caused due to the higher kurtosis of π(a|s) or ϵ of large values. 6 EXPERIMENTS In our experiments, we aim to answer the following three questions: Q1. Does our algorithm improve the learner performance more than BC given the noisy demonstrations? Q2. Can the compounding error due to ζ be reduced as the number of noisy demonstrations increase? Q3. Is our algorithm competitive to the existing IL methods if both annotations associated with the non-optimality and environment interactions are allowed? Published as a conference paper at ICLR 2021 To answer Q1 and Q2, we evaluated our algorithm against BC on four continuous control tasks that are simulated with Mu Jo Co physics simulator (Todorov et al., 2012). We train an agent on each task by proximal policy optimization (PPO) algorithm (Schulman et al., 2017) using the rewards defined in the Open AI Gym (Brockman et al., 2016). We use the resulting stochastic policy as the true expert policy π e. We generate the noisy expert demonstrations using π e while randomly adopting non-optimal policies π with probabilities of the non-optimal behavior ϵ. The non-optimal policies π are selected from uniform distributions a U( u, u), Gaussian distributions a N(a , I) with a π e( |s), or a deterministic policy a = 0, where u R|A| denotes all-ones vectors and I R|A| |A| denotes identity matrices. ϵ are selected from {0.0, 0.1, 0.2, 0.3, 0.4, 0.5}. The noisy expert takes actions following π e if z ϵ otherwise π which is fixed to a selected one through an episode, where z U(0, 1). Each noisy demonstration with the selected ϵ consists of N state-action pairs, where N is selected from {5000, 10000, 50000, 100000}. Then we perform our algorithm as well as BC to train the learners using each noisy demonstration. We also conducted the same experiments on four low-dimensional discrete control tasks (see Appendix A.4). To answer Q3, we evaluated our algorithm against IC-GAIL (Wu et al., 2019), 2IWIL (Wu et al., 2019), T-REX (Brown et al., 2019), GAIL and DRIL on three continuous control tasks. ICGAIL, 2IWIL and T-REX require both annotations associated with the non-optimality and environment interactions. GAIL and DRIL require the environment interactions for the training, but they do not address the noisy demonstration problem. The true expert policy π e are obtained in the same way as mentioned above. The non-optimal policies π are fixed to a U( u, u). We generate the noisy expert demonstrations which consists of 10000 state-action pairs for each ϵ {0.05, 0.1, 0.15, ...., 1.0}. Then we perform our algorithm and the baselines using all noisy demonstrations. The detailed description of this experimental setup can be found in Appendix A.3. In both experiments, the performance of the learners is measured by cumulative rewards they earned in an episode. The cumulative reward is normalized with ones earned by π e and a random policy a U( u, u) so that 1.0 and 0.0 indicate the performance of π e and the random policy, respectively. We run five experiments on each task and setup, and measure the mean and standard deviation of the normalized cumulative rewards for each learner over the five experiments. In all experiments, we set the number of policies K = 5 in the ensemble learner policy πθ and the number of iterations M = 5. The implementation details of our algorithm can be found in Appendix A.5. 6.2 RESULTS Figure 2 depicts the experimental results against BC. Over all tasks, our algorithm obtains much better learner performance than BC-Single, which is a single (thus not an ensemble) policy learned by BC. It suggests that the policies learned by our algorithm are closer to π e than ones learned by BC. The compounding error due to ζ is expected to be reduced as the number of demonstrations increase. Whereas BC-Ensemble which denotes the ensemble of policies learned by BC yields significant performance gains against BC-Single, increasing the number of noisy demonstrations has a little effect to boost the learner performance trained by BC-Ensemble as shown in Figure 2- (D). It indicates that BC-Ensemble can not reduce the compounding error due to ϵ. On the other hand, our algorithm can boost the learner performance up to that of π e as increasing the number of demonstrations. It suggests that our algorithm can reduce the compounding error due to both ϵ and ζ if sufficient amounts of the noisy expert demonstrations are given, as is the case for BC with the clean expert demonstrations. The results with the deterministic non-optimal policy π Π \ {π e} which always takes an action a = 0 are worse than those with other non-optimal policies. It corresponds to the limitation of our algorithm as mentioned in 5.4, since the major mode of ϵπ(a|s)+(1 ϵ)π e(a|s) might be around a = 0. We also conducted ablation experiments where the number of policies K are selected from {1, 5} in our algorithm. See Appendix A.6 for details. The ablation experimental results show that the learner obtains better performance if K increases. In addition, the performance of the learner trained by our algorithm is significantly better than that of BC-Single even though K = 1. It suggests that our algorithm improves the learner performance by not only the ensemble approach but also using the old policies πθold. Table 1 shows the experimental results against IC-GAIL, 2IWIL, T-REX, GAIL and DRIL. Over all tasks, 2IWIL and our algorithm can successfully obtain the true expert performance while others can Published as a conference paper at ICLR 2021 Noisy Expert BC-Ensemble Ant-v2 Half Cheetah-v2 Hopper-v2 Walker2d-v2 Normalized Return Probability of Non-Optimal Behavior Normalized Return Normalized Return Probability of Non-Optimal Behavior Probability of Non-Optimal Behavior Number of State-Action Pairs (D) Noisy Expert BC-Ensemble Noisy Expert BC-Ensemble Noisy Expert BC-Ensemble Normalized Return Figure 2: (A)-(C) The performance of policies vs. ϵ given 50000 state-action pairs of the noisy expert demonstrations where the non-optimal policies π Π\{π e} are (A) U( u, u), (B) N(a , I) with a π e( |s), and (C) the deterministic one a = 0, respectively. (D) The performance of policies vs. the number of state-action pairs N of the noisy demonstrations with ϵ = 0.3 where π(a|s) = U( u, u). BC-Single is a policy learned by BC. BC-Ensemble is an ensemble of policies, each of which was learned by BC. Shaded regions indicate the standard deviation over five experiments. not. It suggests that our algorithm can obtain competitive results with that of existing IL methods even though the annotation and the environment interactions are not used. Table 1: The experimental results against IL methods that require the environment interactions. IC-GAIL 2IWIL T-REX GAIL DRIL Ours Ant-v2 0.631 0.162 1.042 0.021 0.586 0.124 0.003 0.004 1.071 0.023 1.055 0.053 Half Cheetah-v2 0.941 0.103 1.024 0.059 0.001 0.113 0.106 0.003 0.065 0.006 1.093 0.092 Hopper-v2 1.233 0.152 1.223 0.135 0.441 0.219 0.000 0.001 0.910 0.099 1.003 0.045 7 CONCLUSION In this paper, we proposed an imitation learning algorithm to cope with the noisy expert demonstrations. Experimental results showed that our algorithm can learn behavior policies that are much closer to the true expert policies than ones learned by BC. Since our algorithm cope well with the noisy expert demonstrations while not requiring any environment interactions and annotations associated with the non-optimal demonstrations, our algorithm is more applicable to real-world problems than the prior works. Although our algorithm has a few limitations as mentioned in 5.4, we believe that the analysis of performance deterioration detailed in Section 4 contributes to step forward for solving the noisy demonstration problems. In future work, we will consider the setting where the probability of non-optimal behavior is state-dependent, which often occurs in the real world more than the state-independent case that we have considered in this paper. Published as a conference paper at ICLR 2021 Pieter Abbeel and Andrew Y Ng. Apprenticeship learning via inverse reinforcement learning. In Proceedings of the twenty-first international conference on Machine learning, pp. 1, 2004. Kiant e Brantley, Wen Sun, and Mikael Henaff. Disagreement-regularized imitation learning. In International Conference on Learning Representations, 2019. Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. Openai gym. ar Xiv preprint ar Xiv:1606.01540, 2016. Daniel S Brown, Wonjoon Goo, Prabhat Nagarajan, and Scott Niekum. Extrapolating beyond suboptimal demonstrations via inverse reinforcement learning from observations. ar Xiv preprint ar Xiv:1904.06387, 2019. Thomas Degris, Martha White, and Richard S Sutton. Off-policy actor-critic. ar Xiv preprint ar Xiv:1205.4839, 2012. Scott Fujimoto, David Meger, and Doina Precup. Off-policy deep reinforcement learning without exploration. In International Conference on Machine Learning, pp. 2052 2062, 2019. Keinosuke Fukunaga and Larry Hostetler. The estimation of the gradient of a density function, with applications in pattern recognition. IEEE Transactions on information theory, 21(1):32 40, 1975. Seyed Kamyar Seyed Ghasemipour, Richard Zemel, and Shixiang Gu. A divergence minimization perspective on imitation learning methods. In Conference on Robot Learning, pp. 1259 1277. PMLR, 2020. Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the thirteenth international conference on artificial intelligence and statistics, pp. 249 256, 2010. Daniel H Grollman and Aude G Billard. Robot learning from failed demonstrations. International Journal of Social Robotics, 4(4):331 342, 2012. Jonathan Ho and Stefano Ermon. Generative adversarial imitation learning. In Advances in neural information processing systems, pp. 4565 4573, 2016. Michael Kaiser, Holger Friedrich, and Rudiger Dillmann. Obtaining good performance from a bad teacher. In Programming by Demonstration vs. Learning from Examples Workshop at ML, volume 95, 1995. Beomjoon Kim, Amir-massoud Farahmand, Joelle Pineau, and Doina Precup. Learning from limited demonstrations. In Advances in Neural Information Processing Systems, pp. 2859 2867, 2013. Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. ar Xiv preprint ar Xiv:1412.6980, 2014. Aviral Kumar, Aurick Zhou, George Tucker, and Sergey Levine. Conservative q-learning for offline reinforcement learning. ar Xiv preprint ar Xiv:2006.04779, 2020. Sascha Lange, Thomas Gabel, and Martin Riedmiller. Batch reinforcement learning. In Reinforcement learning, pp. 45 73. Springer, 2012. Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. ar Xiv preprint ar Xiv:1509.02971, 2015. Andrew Y Ng, Stuart J Russell, et al. Algorithms for inverse reinforcement learning. In Icml, volume 1, pp. 2, 2000. Michael Peter Perrone. Improving Regression Estimation: Averaging Methods for Variance Reduction with Extensions to General Convex Measure Optimization. Ph D thesis, Brown University, 1993. Published as a conference paper at ICLR 2021 Dean A Pomerleau. Efficient training of artificial neural networks for autonomous navigation. Neural computation, 3(1):88 97, 1991. St ephane Ross and Drew Bagnell. Efficient reductions for imitation learning. In Proceedings of the thirteenth international conference on artificial intelligence and statistics, pp. 661 668, 2010. Fumihiro Sasaki, Tetsuya Yohira, and Atsuo Kawaguchi. Sample efficient imitation learning for continuous control. In International Conference on Learning Representations, 2018. John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. ar Xiv preprint ar Xiv:1707.06347, 2017. Voot Tangkaratt, Bo Han, Mohammad Emtiyaz Khan, and Masashi Sugiyama. Vild: Variational imitation learning with diverse-quality demonstrations. ar Xiv preprint ar Xiv:1909.06769, 2019. Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for model-based control. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 5026 5033. IEEE, 2012. Yueh-Hua Wu, Nontawat Charoenphakdee, Han Bao, Voot Tangkaratt, and Masashi Sugiyama. Imitation learning from imperfect demonstration. ar Xiv preprint ar Xiv:1901.09387, 2019. Brian D Ziebart, Andrew L Maas, J Andrew Bagnell, and Anind K Dey. Maximum entropy inverse reinforcement learning. In Aaai, volume 8, pp. 1433 1438. Chicago, IL, USA, 2008. Published as a conference paper at ICLR 2021 A.1 DETAILED DEVIATION OF THEOREM 1 Proof. Let qt = (1 ϵ)t denotes the probability the noisy expert consecutively follows π e in the first t step, and χ = PT t=1 qt 1 denotes sum of qt 1 over time steps. Then we obtain: t=1 qt 1Rπe t + (1 qt 1) 0 (11) t=1 Es dπe t ϵEπ pΠ[Rπ(s)] + (1 ϵ)Rπ e (s) ϵEπ pΠ[Jπe(π, R)] + (1 ϵ)Jπe(π e, R) ϵEπ pΠ[Jπe(π, R)] + (1 ϵ)Eπ pΠ[Jπe(π, R)] t=0 (1 ϵ)t Eπ pΠ[Jπe(π, R)] The first inequality (11) is from Assumption 2 and 3. The second inequality (12) is from Chebyshev s sum inequality with the monotonically decreasing properties according to Assumption 4. The third inequality (13) is from Assumption 1 : Jβ(π, R) Jβ(π e, R) for any π, β Π \ {π e}. A.2 DETAILED DERIVATION OF THE KL DIVERGENCES From the definition of (4), we obtain: Es dπe,a πe( |s)[log πθ(a|s)] = αEs dπe e ,a πe( |s)[log πθ(a|s)] (14) + βEs dπe e+ ,a πe( |s)[log πθ(a|s)] (15) The forward Kullback-Leibler (KL) divergence DKL between πe and πθ over a state distribution dπe is defined as Es dπe [DKL(πe( |s)||πθ( |s))] = Es dπe [Ea πe( |s)[log πθ(a|s)]+H[πe( |s)]], where H denotes the entropy. Since H[πe( |s)] always takes positive value and is not associated with θ, we obtain an inequality : Es dπe,a πe( |s)[log πθ(a|s)] Es dπe [DKL(πe( |s)||πθ( |s))]. The same goes with (14) as αEs dπe e ,a πe( |s)[log πθ(a|s)] αEs dπe [DKL(πe( |s)||πθ( |s))]. (16) Since πe adopts both π e and π Π \ {π e} following the probability ϵ, the third term (15) can be expanded as: βEs dπe e+ ,a πe( |s)[log πθ(a|s)] = βEs dπe e+ ϵEπ pΠ,a π( |s)[log πθ(a|s)] +(1 ϵ)Ea π e( |s)[log πθ(a|s)] β ϵEs dπe e+ ,π pΠ[DKL(π( |s)||πθ( |s))] +(1 ϵ)Es dπe e+ [DKL(π e( |s)||πθ( |s))] A.3 DETAILED DESCRIPTION OF THE EXPERIMENTAL SETUP We annotate confidence scores for the noisy demonstrations so that the confidence is one if the demonstrations are obtained with ϵ = 0 otherwise zero. The confidence scores are used IC-GAIL as well as 2IWIL. We use publicly available code 2 for the implementation of both IC-GAIL and 2https://github.com/kristery/Imitation-Learning-from-Imperfect-Demonstration Published as a conference paper at ICLR 2021 Cart Pole-v1 Acrobot-v1 Mountain Car-v0 Lunar Lander-v2 0.0 0.2 0.4 0.0 0.2 0.4 0.0 0.2 0.4 0.0 0.2 0.4 Normalized Cumulative Reward Probability of Non-Optimal Behavior Noisy Expert BC-Ensemble BC-Single Ours Figure 3: The performance of policies vs. ϵ given 50000 state-action pairs of the noisy expert demonstrations where the non-optimal policies π Π \ {π e} select actions uniformly at random. BC-Single is a policy learned by BC. BC-Ensemble is an ensemble of policies, each of which was learned by BC. Shaded regions indicate the standard deviation over five experiments. 2IWIL. We follow the training procedure of both methods as described in Section 5 in (Wu et al., 2019). We annotate rankings for the noisy demonstrations so that the smaller ϵ correspond to higher rankings. Then, we train the learner by T-REX given the ranked demonstration data. We use publicly available code 3 for the implementation of T-REX. For training the learner with GAIL and DRIL, we use all noisy demonstrations without any screening process. We use publicly available code 4 for the implementation of GAIL and DRIL. A.4 EXPERIMENTAL RESULTS ON DISCRETE CONTROL TASKS Figure 3 shows the experimental results on four discrete control tasks. Over all tasks, our algorithm obtain much better results than BC. A.5 IMPLEMENTATION DETAILS OF OUR ALGORITHM We implement our algorithm using K neural networks with two hidden layers to represent policies πθ1, πθ2, ..., πθK in the ensemble. The input of the networks is vector representations of the state. Each neural network has 100 hidden units in each hidden layer followed by hyperbolic tangent nonlinearity, and the dimensionality of its final output corresponds to that of action space. The final output is followed by softmax function in the discrete control tasks. As for the continuous control tasks, the final output represents the mean of a Gaussian policy as πθk = N(µθk(s), σ2 θk), where σ2 θk is implemented as a trainable independent vector from the networks. The neural network architecture for the policy trained by BC is the same as the ones for a single policy in our algorithm. We employ Adam (Kingma & Ba, 2014) for learning parameters with a learning rate of η 10 4 where η = K/ PK k=1 πθk old(µθk old(s)|s) is a scaling parameter. The parameter η plays a role in scaling ˆR = πθold(a|s) to avoid the training being slow due to πθold(a|s) of small values. The parameters in all layers are initialized by Xavier initialization (Glorot & Bengio, 2010). The mini-batch size and the number of training epochs are 128 and 500, respectively. A.6 ABLATION EXPERIMENTS We conducted ablation experiments where we evaluate how the number of policies K in the ensemble policy πθ as well as the number of the policies Kold used in the old ensemble policies πθold affect the performance. Table 2 summarizes the ablation experimental results. Even if our algorithm uses K = 1 as BC-Single does, the results of our algorithm are better than BC. It indicates that the 3https://github.com/hiwonjoon/ICML2019-TREX 4https://github.com/xkianteb/dril Published as a conference paper at ICLR 2021 weighted action sampling described in 5.2 works to avoid learning the non-optimal policies without relying on the ensemble approach. The same goes with K = 5. Our algorithm with K = 5 and Kold = 1 obtain much better performance than BC-Ensemble with K = 5. This result also supports the weighted action sampling works. The learner performance with fixed K increases as Kold increases. Similarly, the learner performance with fixed Kold increases as K increases. It suggests that both K and Kold affect the performance in our algorithm. Table 2: The performance of policies on the ablation experiment. The number of state-action pairs of the noisy expert demonstrations is N = 50000. The non-optimal policies π Π\{π e} is U(0, I). BC-Single is a policy learned by BC. BC-Ensemble is an ensemble of five policies, each of which was learned by BC. K denotes the number of policies in the ensemble policy πθ. Kold denotes the number of policies used in the old ensemble policy πθold. The mean and standard deviation of the normalized cumulative rewards over three experiments are described. Ant-v2 Half Cheetah-v2 Hopper-v2 Walker2d-v2 Average BC-Single 0.149 0.001 0.305 0.006 0.258 0.017 0.071 0.004 0.196 0.105 BC-Ensemble(K = 5) 0.664 0.043 0.459 0.014 0.352 0.028 0.279 0.039 0.438 0.167 Ours(K = 1, Kold = 1) 0.272 0.184 0.505 0.279 0.405 0.206 0.281 0.306 0.366 0.111 Ours(K = 5, Kold = 1) 0.903 0.048 1.057 0.008 0.602 0.130 0.345 0.049 0.731 0.320 Ours(K = 1, Kold = 5) 0.517 0.015 0.907 0.007 0.778 0.093 0.414 0.085 0.654 0.227 Ours(K = 5, Kold = 5) 0.995 0.053 1.058 0.053 0.573 0.079 0.364 0.044 0.747 0.334