# imitation_learning_from_imperfect_demonstration__0a46b753.pdf

Imitation Learning from Imperfect Demonstration

Yueh-Hua Wu 1 2 Nontawat Charoenphakdee 3 2 Han Bao 3 2 Voot Tangkaratt 2 Masashi Sugiyama 2 3

Imitation learning (IL) aims to learn an optimal policy from demonstrations. However, such demonstrations are often imperfect since collecting optimal ones is costly. To effectively learn from imperfect demonstrations, we propose a novel approach that utilizes conﬁdence scores, which describe the quality of demonstrations. More speciﬁcally, we propose two conﬁdencebased IL methods, namely two-step importance weighting IL (2IWIL) and generative adversarial IL with imperfect demonstration and conﬁdence (IC-GAIL). We show that conﬁdence scores given only to a small portion of sub-optimal demonstrations signiﬁcantly improve the performance of IL both theoretically and empirically.

1. Introduction

Imitation learning (IL) has become of great interest because obtaining demonstrations is usually easier than designing reward. Reward is a signal to instruct agents to complete the desired tasks. However, ill-designed reward functions usually lead to unexpected behaviors (Everitt & Hutter, 2016; Dewey, 2014; Amodei et al., 2016). There are two main approaches that can be used to solve IL: behavioral cloning (BC) (Schaal, 1999), which adopts supervised learning approaches to learn an action predictor that is trained directly from demonstration data; and apprenticeship learning (AL), which attempts to ﬁnd a policy that is better than the expert policy for a class of cost functions (Abbeel & Ng, 2004). Even though BC can be trained with supervised learning approaches directly, it has been shown that BC cannot imitate the expert policy without a large amount of demonstration data for not considering the transition of environments (Ross et al., 2011). In contrast, AL approaches learn from interacting with environments and optimize objectives such as

1National Taiwan University, Taiwan 2RIKEN Center for Advanced Intelligence Project, Japan 3The University of Tokyo, Japan. Correspondence to: Yueh-Hua Wu <d06922005@csie.ntu.edu.tw>.

Proceedings of the 36 th International Conference on Machine Learning, Long Beach, California, PMLR 97, 2019. Copyright 2019 by the author(s).

maximum entropy (Ziebart et al., 2008).

A state-of-the-art approach generative adversarial imitation learning (GAIL) is proposed by Ho & Ermon (2016). The method learns an optimal policy by performing occupancy measure matching (Syed et al., 2008). An advantage of the matching method is that it is robust to demonstrations generated from a stochastic policy. Based on the concept proposed in GAIL, variants have been developed recently for different problem settings (Li et al., 2017; Kostrikov et al., 2019).

Despite that GAIL is able to learn an optimal policy from optimal demonstrations, to apply IL approaches to solve real-world tasks, the difﬁculty in obtaining such demonstration data should be taken into consideration. However, demonstrations from an optimal policy (either deterministic or stochastic) are usually assumed to be available in the above mentioned works, which can be barely fulﬁlled by the fact that most of the accessible demonstrations are imperfect or even from different policies. For instance, to train an agent to play basketball with game-play videos of the National Basketball Association, we should be aware that there are 14.3 turnovers per game1, not to mention other kinds of mistakes that may not be recorded. The reason why optimal demonstrations are hard to obtain can be attributed to the limited attention and the presence of distractions, which make humans hard to follow optimal policies all the time. As a result, some parts of the demonstrations may be optimal and the others are not.

To mitigate the above problem, we propose to use conﬁdence scores, which indicate the probability that a given trajectory is optimal. An intuitive example to collect the conﬁdence scores is crowdsourcing, where data can be hardlabeled as being optimal or non-optimal by multiple labelers. We can extract conﬁdence from the proportion of labelers giving a particular label, which corresponds to the probability that the instance has this label. Another example is when it is difﬁcult to judge whether the demonstration is optimal. A labeler may rate a score digitized from 0, 0.1, . . . , 1 and we can use it as conﬁdence. Since the attained conﬁdence may not be perfect, to show the practicality of our methods, in our experiments, we use estimated conﬁdence instead

1https://www.basketball-reference.com/ leagues/NBA_stats.html

Imitation Learning from Imperfect Demonstration

of the ground truth value. In addition, experiments with different levels of noise are conducted to further justify their robustness to noisy labelers.

To further reduce the additional cost to learn an optimal policy, we consider a more realistic setting that the given demonstrations are partially equipped with conﬁdence. As a result, the goal of this work is to utilize imperfect demonstrations where some are equipped with conﬁdence while some are not (we refer to demonstrations without conﬁdence as unlabeled demonstrations ).

In this work, we consider the setting where the given imperfect demonstrations are a mixture of optimal and nonoptimal demonstrations. The setting is common when the demonstrations are collected via crowdsourcing (Serban et al., 2017; Hu et al., 2018; Shah et al., 2018) and learning from different sources such as videos (Tokmakov et al., 2017; Pathak et al., 2017; Supancic III & Ramanan, 2017; Yeung et al., 2017; Liu et al., 2018), where demonstrations can be generated from different policies.

We propose two methods, two-step importance weighting imitation learning (2IWIL) and generative adversarial imitation learning with imperfect demonstration and conﬁdence (IC-GAIL), based on the idea of reweighting but from different perspectives. To utilize both conﬁdence and unlabeled data, for 2IWIL, it predicts conﬁdence scores for unlabeled data by optimizing the proposed objective based on empirical risk minimization (ERM) (Vapnik, 1998), which has ﬂexibility for different loss functions, models, and optimizers; on the other hand, instead of directly reweighting to the optimal distribution and perform GAIL with reweighting, IC-GAIL reweights to the non-optimal distribution and match the optimal occupancy measure based on our mixture distribution setting. Since the derived objective of IC-GAIL depends on the proportion of the optimal demonstration in the demonstration mixture, we empirically show that IC-GAIL converges slower than 2IWIL but achieves better performance, which forms a trade-off between the two methods. We show that the proposed methods are both theoretically and practically sound.

2. Related work

In this section, we provide a brief survey about making use of non-optimal demonstrations and semi-supervised classiﬁcation with conﬁdence data.

2.1. Learning from non-optimal demonstrations

Learning from non-optimal demonstrations is nothing new in IL and reinforcement learning (RL) literature, but previous works utilized different information to learn a better policy. Distance minimization inverse RL (DM-IRL) (Burchﬁel et al., 2016) utilized a feature function of states and assumed

that the true reward function is linear in the features. The feedback from human is an estimate of accumulated reward, which is harder to be given than conﬁdence because multiple reward functions may correspond to the same optimal policy.

Semi-supervised IRL (SSIRL) (Valko et al., 2012) extends the IRL method proposed by Abbeel & Ng (2004), where the reward function can be learned by matching the feature expectations of the optimal demonstrations. The difference from Abbeel & Ng (2004) is that in SSIRL, optimal and sub-optimal trajectories from other performers are given. Transductive SVM (Sch olkopf et al., 1999) was used in place of vanilla SVM in Abbeel & Ng (2004) to recognize optimal trajectories in the sub-optimal ones. In our setting, the conﬁdence scores are given instead of the optimal demonstrations. DM-IRL and SSIRL are not suitable for high-dimensional problems due to its dependence on the linearity of reward functions and good feature engineering.

2.2. Semi-supervised classiﬁcation with conﬁdence data

In our 2IWIL method, we train a probabilistic classiﬁer with conﬁdence and unlabeled data by optimizing the proposed ERM objective. There are similar settings such as semisupervised classiﬁcation (Chapelle et al., 2006), where few hard-labeled data y {0, 1} and some unlabeled data are given.

Zhou et al. (2014) proposed to use hard-labeled instances to estimate conﬁdence scores for unlabeled samples using Gaussian mixture models and principal component analysis. Similarly, for an input instance x, Wang et al. (2013) obtained an upper bound of conﬁdence Pr(y = 1|x) with hard-labeled instances and a kernel density estimator, then treated the upper bound as an estimate of probabilistic class labels.

Another related scheme was considered in El-Zahhar & El Gayar (2010) where they considered soft labels z [0, 1] as fuzzy inputs and proposed a classiﬁcation approach based on k-nearest neighbors. This method is difﬁcult to scale to high-dimensional tasks, and lacks theoretical guarantees. Ishida et al. (2018) proposed another scheme that trains a classiﬁer only from positive data equipped with conﬁdence. Our proposed method, 2IWIL, also considers training a classiﬁer with conﬁdence scores of given demonstrations. Nevertheless, 2IWIL can train a classiﬁer from fewer conﬁdence data, with the aid of a large number of unlabeled data.

3. Background

In this section, we provide backgrounds of RL and GAIL.

Imitation Learning from Imperfect Demonstration

3.1. Reinforcement Learning

We consider the standard Markov Decision Process (MDP) (Sutton & Barto, 1998). MDP is represented by a tuple S, A, P, R, γ , where S is the state space, A is the action space, P(st+1|st, at) is the transition density of state st+1 at time step t+1 given action at made under state st at time step t, R(s, a) is the reward function, and γ (0, 1) is the discount factor.

A stochastic policy π(a|s) is a density of action a given state s. The performance of π is evaluated in the γdiscounted inﬁnite horizon setting and its expectation can be represented with respect to the trajectories generated by π:

Eπ[R(s, a)] = E

t=0 γt R(st, at)

where the expectation on the right-hand side is taken over the densities p0(s0), P(st+1|st, at), and π(at|st) for all time steps t. Reinforcement learning algorithms (Sutton & Barto, 1998) aim to maximize Eq. (1) with respect to π.

To characterize the distribution of state-action pairs generated by an arbitrary policy π, the occupancy measure is deﬁned as follows. Deﬁnition 3.1 (Puterman (1994)). Deﬁne occupancy measure ρπ : S A R,

ρπ(s, a) = π(a|s)

t=0 γt Pr(st = s|π), (2)

where Pr(st = s|π) is the probability density of state s at time step t following policy π.

The occupancy measure of π, ρπ(s, a), can be interpreted as an unnormalized density of state-action pairs.

The occupancy measure plays an important role in IL literature because of the following one-to-one correspondence with the policy. Theorem 3.2. (Theorem 2 of Syed et al. (2008)) Suppose ρ is the occupancy measure for πρ(a|s) ρ(s,a) P

a ρ(s,a ). Then πρ is the only policy whose occupancy measure is ρ.

In this work, we also deﬁne the normalized occupancy measure p(s, a),

p(s, a) ρ(s, a) P

s,a ρ(s, a)

= ρ(s, a) P s,a π(a|s) P t=0 γt Pr(st = s|π)

= ρ(s, a) P t=0 γt = (1 γ)ρ(s, a).

The normalized occupancy measure can be interpreted as a probability density of state-action pairs that an agent experiences in the environment with policy π.

3.2. Generative adversarial imitation learning (GAIL)

The problem setting of IL is that given trajectories {(si, ai)}n i=1 generated by an expert πE, we are interested in optimizing the agent policy πθ to recover the expert policy πE with {(si, ai)}n i=1 and the MDP tuple without reward function R.

GAIL (Ho & Ermon, 2016) is a state-of-the-art IL method that performs occupancy measure matching to learn a parameterized policy. Occupancy measure matching aims to minimize the objective d(ρπE, ρπθ), where d is a distance function. The key idea behind GAIL is that it uses generative adversarial training to estimate the distance and minimize it alternatively. To be precise, the distance is the Jensen Shannon divergence (JSD), which is estimated by solving a binary classiﬁcation problem. This leads to the following min-max optimization problem:

min θ max w E s, a pθ[log Dw(s, a)] + E s, a popt[log(1 Dw(s, a))],

where pθ and popt are the corresponding normalized occupancy measures for πθ and πopt respectively. Dw is called a discriminator and it can be shown that if the discriminator has inﬁnite capacity, the global optimum of Eq. (3) corresponds to the JSD up to a constant (Goodfellow et al., 2014). To update the agent policy πθ, GAIL treats the loss log(Dw(s, a)) as a reward signal and the agent can be updated with RL methods such as trust region policy optimization (TRPO) (Schulman et al., 2015). A weakness of GAIL is that if the given demonstrations are non-optimal then the learned policy will be non-optimal as well.

4. Imitation learning with conﬁdence and unlabeled data

In this section, we present two approaches to learning from imperfect demonstrations with conﬁdence and unlabeled data. The ﬁrst approach is 2IWIL, which aims to learn a probabilistic classiﬁer to predict conﬁdence scores of unlabeled demonstration data and then performs standard GAIL with reweighted distribution. The second approach is ICGAIL, which forgoes learning a classiﬁer and learns an optimal policy by performing occupancy measure matching with unlabeled demonstration data. Details of derivation and proofs in this section can be found in Appendix.

4.1. Problem setting

Firstly, we formalize the problem setting considered in this paper. For conciseness, in what follows we use x in place of (s, a). Consider the case where given imperfect demonstrations are sampled from an optimal policy πopt and nonoptimal policies Π = {πi}n i=1. Denote that the corresponding normalized occupancy measure of πopt and Π are popt

Imitation Learning from Imperfect Demonstration

and {pi}n i=1, respectively. The normalized occupancy measure p(x) of a state-action pair x is therefore the weighted sum of popt and {pi}n i=1,

p(x) =αpopt(x) +

i=1 νipi(x)

=αpopt(x) + (1 α)pnon(x),

where α + Pn i=1 νi = 1 and pnon(x) = 1 (1 α) Pn i=1 νipi(x). We may further follow traditional

classiﬁcation notation by deﬁning popt(x) p(x|y = +1) and pnon(x) p(x|y = 1), where y = +1 indicates that x is drawn from the occupancy measure of the optimal policy and y = 1 indicates the non-optimal policies. Here, α = Pr(y = +1) is the class-prior probability of the optimal policy. We further assume that an oracle labels state-action pairs in the demonstration data with conﬁdence scores r(x) p(y = +1|x). Based on this, the normalized occupancy measure of the optimal policy can be expressed by the Bayes rule as

p(x|y = +1) =r(x)p(x)

We assume that labeling state-action pairs by the oracle can be costly and only some pairs are labeled with conﬁdence. More precisely, we obtain demonstration datasets as follows,

Dc {(xc,i, ri)}nc i=1 i.i.d. q(x, r),

Du {xu,i}nu i=1 i.i.d. p(x),

where q(x, r) = p(x)pr(r|x) and pr(ri|x) = δ(ri r(x)) is a delta distribution. Our goal is to consider the case where Dc is scarce and we want to learn the optimal policy πopt with Dc and Du jointly.

4.2. Two-step importance weighting imitation learning

We ﬁrst propose an approach based on the importance sampling scheme. By Eq. (4), the GAIL objective in Eq. (3) can be rewritten as follows:

min θ max w Ex pθ [log Dw(x)]

+ Ex,r q h r

α log(1 Dw(x)) i . (5)

In practice, we may use the mean of conﬁdence scores to estimate the class prior α. Although we can reweight the conﬁdence data Dc to match the optimal distribution, we have a limited number of conﬁdence data and it is difﬁcult to perform accurate sample estimation. To make full use of unlabeled data, the key idea is to identify conﬁdence scores of the given unlabeled data Du and reweight both conﬁdence data and unlabeled data. To achieve this, we train a probabilistic classiﬁer from conﬁdence data and unlabeled

data, where we call this learning problem semi-conf (SC) classiﬁcation.

Let us ﬁrst consider a standard binary classiﬁcation problem to classify samples into popt (y = +1) and pnon (y = 1). Let g: Rd R be a prediction function and ℓ: R R+ be a loss function. The optimal classiﬁer can be learned by minimizing the following risk:

RPN,ℓ(g) = αEx popt [ℓ(g(x))]

+ (1 α)Ex pnon [ℓ( g(x))] , (6)

where PN stands for positive-negative . However, as we only have samples from the mixture distribution p instead of samples separately drawn from popt and pnon, it is not straightforward to conduct sample estimation of the risk in Eq. (6). To overcome this issue, we express the risk in an alternative way that can be estimated only from Dc and Du in the following theorem.

Theorem 4.1. The classiﬁcation risk (6) can be equivalently expressed as

RSC,ℓ(g) = Ex,r q[r(ℓ(g(x)) ℓ( g(x)))

+ (1 β)ℓ( g(x))] + Ex p[βℓ( g(x))], (7)

where β [0, 1] is an arbitrary weight.

Thus, we can obtain a probabilistic classiﬁer by minimizing Eq. (7), which can be estimated only with Dc and Du. Once we obtain the prediction function g, we can use it to give conﬁdence scores for Du.

To make the prediction function g estimate conﬁdence accurately, the loss function ℓin Eq. (7) should come from a class of strictly proper composite loss (Buja et al., 2005; Reid & Williamson, 2010). Many losses such as the squared loss, logistic loss, and exponential loss are proper composite. For example, if we obtain g log that minimizes a logistic loss ℓlog(z) = (log(1 + exp( z)), we can obtain conﬁdence scores by passing prediction outputs to a sigmoid function bp(y = 1|x) = [1 + exp( g log(x))] 1 (Reid & Williamson, 2010). On the other hand, the hinge loss cannot be applied since it is not a proper composite loss and cannot estimate conﬁdence reliably (Bartlett & Tewari, 2007; Reid & Williamson, 2010). Therefore, we can obtain a probabilistic classiﬁer from the prediction function g that learned from a strictly proper composite loss. After obtaining a probabilistic classiﬁer, we optimize the importance weighted objective in Eq. (5), where both Dc and Du are used to estimate the second expectation. We summarize this training procedure in Algorithm 1.

Next, we discuss the choice of the combination coefﬁcient β. Since we have access to the empirical unbiased estimator b RSC,ℓ(g) from Eq. (7), it is natural to ﬁnd the minimum

Imitation Learning from Imperfect Demonstration

Algorithm 1 2IWIL

1: Input: Expert trajectories and conﬁdence Dc = {(xc,i, ri)}nc i=1, Du = {xu,i}nu i=1 2: Estimate the class prior by bα = 1 nc Pnc i=1 ri 3: Train a probabilistic classiﬁer by minimizing Eq. (7) with β = nu nu+nc 4: Predict conﬁdence scores {bru,i}nu i=1 for {xu,i}nu i=1 5: for i = 0, 1, 2, ... do 6: Sample trajectories {xi}na i=1 πθ 7: Update the discriminator parameters by maximizing Eq. (5) 8: Update πθ with reward log Dw(x) using TRPO 9: end for

variance estimator among them. The following theorem gives the optimal β in terms of the estimator variance.

Proposition 4.2 (variance minimality). Let σcov denote the covariance between n 1 c Pnc i=1 ri{ℓ(g(xc,i)) ℓ( g(xc,i))} and n 1 c Pnc i=1 ℓ( g(xc,i)). For a ﬁxed g, the estimator b RSC,ℓ(g) has the minimum variance when β = clip[0,1]( nu nc+nu + σcov Var(ℓ( g(x))) ncnu nc+nu ).2

Thus, β lies in (0, 1) when the covariance σcov is not so large. If β = 0, it means that the unlabeled data Du does help the classiﬁer by reducing empirical variance when Eq. (7) is adopted. However, computing the β that minimizes empirical variance is computationally inefﬁcient since it involves computing σcov and Var(l( g(x))). In practice, we use β = nu nc+nu for all experiments by assuming that the covariance is small enough.

In our preliminary experiments, we sometimes observed that the empirical estimate b RSC,ℓof Eq. (7) became negative and led to overﬁtting. We can mitigate this phenomenon by employing a simple yet highly effective technique from Kiryo et al. (2017), which is proposed to solve a similar overﬁtting problem (see Appendix for implementation details).

4.2.1. THEORETICAL ANALYSIS

Below, we show that the estimation error of Eq. (7) can be bounded. This means that its minimizer is asymptotically equivalent to the minimizer of the standard classiﬁcation risk RPN,ℓ, which provides a consistent estimator of p(y = +1|x). We provide the estimation error bound with Rademacher complexity (Bartlett & Mendelson, 2002). Denote Rn(G) be the Rademacher complexity of the function class G with the sample size n.

Theorem 4.3. Let G be the hypothesis class we use. Assume that the loss function ℓis ρℓ-Lipschitz continuous, and that there exists a constant Cℓ> 0 such that supx X,y { 1} |ℓ(yg(x))| Cℓfor any g G. Let

2clip[l,u](v) max{l, min{v, u}}.

bg arg min g G b RSC,ℓ(g) and g arg min g G RSC,ℓ(g). For

δ (0, 1), with probability at least 1 δ over repeated sampling of data for training bg,

RSC,ℓ(bg) RSC,ℓ(g )

16ρℓ((3 β)Rnc(G) + βRnu(G))

Thus, we may safely obtain a probabilistic classiﬁer by minimizing b RSC,ℓ, which gives a consistent estimator.

4.3. IC-GAIL

Since 2IWIL is a two-step approach by ﬁrst gathering more conﬁdence data and then conducting importance sampling, the error may accumulate over two steps and degrade the performance. Therefore, we propose IC-GAIL that can be trained in an end-to-end fashion and perform occupancy measure matching with the optimal normalized occupancy measure popt directly.

Recall that p = αpopt + (1 α)pnon. Our key idea here is to minimize the divergence between p and p , where p = αpθ + (1 α)pnon. Intuitively, the divergence between pθ and popt is minimized if that between p and p is minimized. For Jensen-Shannon divergence, this intuition can be justiﬁed in the following theorem. Theorem 4.4. Denote that

V (πθ, Dw) = Ex p[log(1 Dw(x))] + Ex p [log Dw(x)],

and that C(πθ) = maxw V (πθ, Dw). Then, V (πθ, Dw) is maximized when Dw = p

p+p ( Dw ), and its maximum value is C(πθ) = log 4 + 2JSD(p p ). Thus, C(πθ) is minimized if and only if pθ = popt almost everywhere.

Theorem 4.4 implies that the optimal policy can be found by solving the following objective,

min θ max w Ex p[log(1 Dw(x))] + Ex p [log Dw(x)].

The expectation in the ﬁrst term can be approximated from Du, while the expectation in the second term is the weighted sum of the expectation over pθ and pnon. Data Da = {xa,i}na i sampled from pθ can be obtained by executing the current policy πθ. However, we cannot directly obtain samples from pnon since it is unknown. To overcome this issue, we establish the following theorem. Theorem 4.5. V (πθ, Dw) can be transformed to e V (πθ, Dw), which is deﬁned as follows:

e V (πθ, Dw) = Ex p[log(1 Dw(x))]

+ αEx pθ[log Dw(x)] + Ex,r q[(1 r) log Dw(x)]. (9)

Imitation Learning from Imperfect Demonstration

We can approximate Eq. (9) given ﬁnite samples Dc, Du, and Da. In practice, we perform alternative gradient descent with respect to θ and w to solve this optimization problem. Below, we show that the estimation error of e V can be bounded for a ﬁxed agent policy πθ.

4.3.1. THEORETICAL ANALYSIS

In this subsection, we show that the estimation error of Eq. (9) can be bounded, given a ﬁxed agent policy πθ. Let b V (πθ, Dw) be the empirical estimate of Eq. (9). Theorem 4.6. Let W be a parameter space for training the discriminator and DW {Dw | w W} be its hypothesis space. Assume that there exist a constant CL > 0 such that | log Dw(x)| CL and | log(1 Dw(x))| CL for any x X and w W. Assume that both log Dw(x) and log(1 Dw(x)) for any w W have Lipschitz norms no more than ρL > 0. For a ﬁxed agent policy πθ, let {xa,i}na i=1 be a sample generated from πθ, D b w arg max w W b V (πθ, Dw),

and Dw arg max w W V (πθ, Dw). Then, for δ (0, 1), the

following holds with probability at least 1 δ:

V (πθ, Dw ) V (πθ, D b w)

16ρL(Rnu(DW) + αRna(DW) + Rnc(DW))

Theorem 4.6 guarantees that the estimation of Eq. (9) provides a consistent maximizer with respect to the original objective in Eq. (8) at each step of the discriminator training.

4.3.2. PRACTICAL IMPLEMENTATION OF IC-GAIL

Even though Eq. (9) is theoretically supported, when the class prior α is low, the inﬂuence of the agent become marginal in the discriminator training. This issue can be mitigated by thresholding α in Eq. (9) as follows:

min θ max w Ex p[log(1 Dw(x))] + λEx pθ[log Dw(x)]

+ (1 λ)Ex,r q

(1 α) log Dw(x) , (10)

where λ = max{τ, α} and τ (0, 1]. The training procedure of IC-GAIL is summarized in Algorithm 2. Note that Eq. (10) returns to Eq. (3) and learns an sub-optimal policy when τ = 1.

4.4. Discussion

To understand the difference between 2IWIL and IC-GAIL, we discuss it from three different perspectives: unlabeled data, conﬁdence data, and the class prior.

Role of unlabeled data: It should be noted that unlabeled data plays different roles in the two methods. In 2IWIL,

Algorithm 2 IC-GAIL

1: Input: Expert trajectories, conﬁdence, and weight threshold {xu,i}nu i=1, {(xc,i, ri)}nc i=1, τ 2: Estimate the class prior by bα = 1 nc Pnc i=1 ri 3: λ = max{τ, bα} 4: for i = 0, 1, 2, ... do 5: Sample trajectories {xi}na i=1 πθ 6: Update the discriminator parameters by maximizing Eq. (10) 7: Update πθ with reward log Dw(x) using TRPO 8: end for

we show that unlabeled data reduces the variance of the empirical risk estimator as shown in Proposition 4.2.

On the other hand, in addition to making more accurate estimation, the usefulness of unlabeled data in IC-GAIL is similar to guided exploration (Kang et al., 2018). We may analogize conﬁdence information in the imperfect demonstration setting to reward functions since both of them allow agents to learn an optimal policy in IL and RL, respectively. Likewise, fewer conﬁdence data can be analogous to sparse reward functions. Even though a small number of conﬁdence data and sparse reward functions do not make objective such as Eqs. (5) and (1) biased, they cause practical issues such as a deﬁciency in information for exploration. To mitigate the problem, we imitate from sub-optimal demonstrations and use conﬁdence information to reﬁne the learned policy, which is similar to Kang et al. (2018) in the sense that they imitate a sub-optimal policy to guide RL algorithms in the sparse reward setting.

Role of conﬁdence data: Conﬁdence data is utilized to train a classiﬁer and to reweight popt in 2IWIL, which causes the two-step training scheme and therefore the error is accumulated in the prediction phase and the occupancy measure matching phase. Differently, IC-GAIL instead compensates the pnon portion in the given imperfect demonstrations by mimicking the composition of p. The advantage of IC-GAIL over 2IWIL is that it avoids the prediction error by employing an end-to-end training scheme.

Inﬂuence of the class-prior α: The class prior in 2IWIL as shown in Eq. (5) serves as a normalizing constant so that the weight r(x)

α for reweighting p to popt has unit mean. Consequently, the class prior α does not affect the convergence of the agent policy. On the other hand, the term with respect to the agent pθ is directly scaled by α in Eq. (9) of IC-GAIL. To comprehend the inﬂuence, we may expand the reward function from the discriminator log D w(x) =

log α (1 α)pθ + pnon / α (1 α)(popt + pθ) + 2pnon

and it shows that the agent term is scaled by α (1 α), which makes the reward function prone to be a constant when α is small. Therefore the agent learns slower than in 2IWIL,

Imitation Learning from Imperfect Demonstration

Table 1. Comparison between proposed methods (IC-GAIL and 2IWIL) and baselines.

METHOD INPUT OBJECTIVE

IC-GAIL Du Dc EQ. (9) 2IWIL Du Dc EQ. (7)

GAIL (U+C) Du Dx c EQ. (3) GAIL (C) Dx c EQ. (3) GAIL (REWEIGHT) Dc EQ. (5)

where the reward function is log (pθ/(pθ + popt)).

5. Experiments

In this section, we aim to answer the following questions with experiments. (1) Do 2IWIL and IC-GAIL methods allow agents to learn near-optimal policies when limited conﬁdence information is given? (2) Are the methods robust enough when the given conﬁdence is less accurate? and (3) Do more unlabeled data results in better performance in terms of average return? The discussions are given in Sec. 5.1, 5.2, and 5.3 respectively.

Setup To collect demonstration data, we train an optimal policy (πopt) using TRPO (Schulman et al., 2015) and select two intermediate policies (π1 and π2). The three policies are used to generate the same number of state-action pairs. In real-world tasks, the conﬁdence should be given by human labelers. We simulate such labelers by using a probabilistic classiﬁer p (y = +1|x) pre-trained with demonstration data and randomly choose 20% of demonstration data to label conﬁdence scores r(x) = p (y = +1|x).

We compare the proposed methods against three baselines. Denote that Dx c {xc,i}nc i=1, Dr c {ri}nc i=1, and Dx u Du. GAIL (U+C) takes all the pairs as input without considering conﬁdence. To show if reweighting using Eq. (5) makes difference, GAIL (C) and GAIL (Reweight) use the same state-action pairs Dx c but GAIL (Reweight) additionally utilizes reweighting with conﬁdence information Dr c. The baselines and the proposed methods are summarized in Table 1.

To assess our methods, we conduct experiments on Mujoco (Todorov et al., 2012). Each experiment is performed with ﬁve random seeds. The hyper-parameter τ of IC-GAIL is set to 0.7 for all tasks. To show the performance with respect to the optimal policy that we try to imitate, the accumulative reward is normalized with that of the optimal policy and a uniform random policy so that 1.0 indicates the optimal policy and 0.0 the random one. Due to space limit, we defer implementation details, the performance of the optimal and the random policies, the speciﬁcation of each task, and the uncropped ﬁgures of Ant-v2 to Appendix.

5.1. Performance comparison

The average return against training epochs in Fig. 1 shows that the proposed IC-GAIL and 2IWIL outperform other baselines by a large margin. Due to the mentioned experiment setup, the class prior of the optimal demonstration distribution is around 33%. To interpret the experiment results, we would like to emphasize that our experiments are under incomplete optimality setting such that conﬁdence itself is not enough to learn the optimal policy as indicated by the GAIL (Reweight) baseline. Since the difﬁculty of each task varies, we use different number of nc +nu for different tasks. Our contribution is that in addition to the conﬁdence, our methods are able to utilize the demonstration mixture (sub-optimal demonstration) and learn near-optimal policies.

We can observe that IC-GAIL converges slower than 2IWIL. As discussed in Section 4.4, it can be attributed to that the term with respect to the agent in Eq. (10) is scaled by 0.7 as speciﬁed by τ, which decreases the inﬂuence of the agent policy in updating discriminator. The faster convergence of 2IWIL can be an advantage over IC-GAIL when interactions with environments are expensive. Even though the objective of IC-GAIL becomes biased by not using the class prior α, it still converges to near-optimal policies in four tasks.

In Walker2d-v2, the improvement in performance of our methods is not as signiﬁcant as in other tasks. We conjecture that it is caused by the insufﬁciency of conﬁdence information. This can be veriﬁed by observing that the GAIL (Reweight) baseline in Walker2d-v2 gradually converges to 0.2 whereas in other tasks it achieves the performance of at least 0.4. In Half Cheetah-v2, we observe that the discriminator is stuck in a local maximum in the middle of learning, which inﬂuences all methods signiﬁcantly.

The baseline GAIL (Reweight) surpasses GAIL (C) in all tasks, which shows that reweighting enables the agent to learn policies that obtain higher average return. However, since the number of conﬁdence instances is small, the information is not enough to derive the optimal policies. GAIL (U+C) is the standard GAIL without considering conﬁdence information. Although the baseline uses the same number of demonstrations nc + nu as our proposed methods, the performance difference is signiﬁcant due to the use of conﬁdence.

5.2. Robustness to Gaussian noise in conﬁdence

In practice, the oracle that gives conﬁdence scores is basically human labelers and they may not be able to accurately label conﬁdence all the time. To investigate robustness of our approaches against noise in the conﬁdence scores, we further conduct an experiment on Ant-v2 where the Gaussian noise is added to conﬁdence scores as follows:

Imitation Learning from Imperfect Demonstration

0 2500 5000 7500 10000 12500 Half Cheetah-v2

Average Return

0 1000 2000 3000 4000 5000 Ant-v2

0 1000 2000 3000 Hopper-v2

0 1000 2000 3000 Swimmer-v2

0 2000 4000 6000 8000 Walker2d-v2

Ours (2IWIL) Ours (IC-GAIL) GAIL (U+C) GAIL (Reweight) GAIL (C)

Figure 1. Learning curves of our 2IWIL and IC-GAIL versus baselines given imperfect demonstrations. The x-axis is the number of training epochs and the shaded area indicates standard error.

0 1000 2000 3000 4000 5000 Epochs

Average Return

2IWIL IC-GAIL Noise 0.0 0.1 0.2

Figure 2. Learning curves of proposed methods with different standard deviations of Gaussian noise added to conﬁdence. The numbers in the legend indicate the standard deviation of the Gaussian noise.

r(x) = p (y = 1|x)+ϵ, where ϵ N(0, σ2). Fig. 2 shows the performance of our methods in this noisy conﬁdence scenario. It reveals that both methods are quite robust to noisy conﬁdence, which suggests that the proposed methods are robust enough to human labelers, who may not always correctly assign conﬁdence scores.

5.3. Inﬂuence of unlabeled data

In this experiment, we would like to evaluate the performance of both 2IWIL and IC-GAIL with different numbers of unlabeled data to verify whether unlabeled data is useful. As we can see in Fig. 3, the performance of both methods grows as the number of unlabeled data increases, which conﬁrms our motivation that using unlabeled data can improve the performance of imitation learning when conﬁdence data is scarce. As discussed in Sec. 4.4, the different roles of unlabeled data in the two proposed methods result in dissimilar learning curves with respect to unlabeled data.

6. Conclusion

In this work, we proposed two general approaches IC-GAIL and 2IWIL, which allow the agent to utilize both conﬁdence and unlabeled data in imitation learning. The setting considered in this paper is usually the case in real-world

0 1000 2000 3000 4000 5000 Epochs

Average Return

2IWIL IC-GAIL Unlabeled 1.0 0.6 0.2

Figure 3. Learning curves of the proposed methods with different number of unlabeled data. The numbers in the legend suggest the proportion of unlabeled data used as demonstrations. 1.0 is the same as the data used in Fig. 1.

scenarios because collecting optimal demonstrations is normally costly. In 2IWIL, we utilized unlabeled data to derive a risk estimator and obtained the minimum variance with respect to the combination coefﬁcient β. 2IWIL predicts conﬁdence scores for unlabeled data and matches the optimal occupancy measure based on the GAIL objective with importance sampling. For IC-GAIL, we showed that the agent learns an optimal policy by matching a mixture of normalized occupancy measures p with the normalized occupancy measure of the given demonstrations p.

Practically, we conducted extensive experiments to show that our methods outperform baselines by a large margin, to conﬁrm that our methods are robust to noise, and to verify that unlabeled data has a positive correlation with the performance. The proposed approaches are general and can be easily extended to other IL and IRL methods (Li et al., 2017; Fu et al., 2018; Kostrikov et al., 2019).

For future work, we may extend it to a variety of applications such as discrete sequence generation because the conﬁdence in our work can be treated as a property indicator. For instance, to generate soluble chemicals, we may not have enough soluble chemicals, whereas the Crippen function (Crippen & Snow, 1990) can be used to evaluate the solubility as the conﬁdence in this work easily.

Imitation Learning from Imperfect Demonstration

Acknowledgement

We thank Zhenghang Cui for helpful discussion. MS was supported by KAKENHI 17H00757, NC was supported by MEXT scholarship, and HB was supported by JST, ACT-I, Grant Number JPMJPR18UI, Japan.

Abbeel, P. and Ng, A. Y. Apprenticeship learning via inverse reinforcement learning. In ICML, pp. 1 8, 2004.

Amodei, D., Olah, C., Steinhardt, J., Christiano, P., Schulman, J., and Man e, D. Concrete problems in AI safety. Co RR, abs/1606.06565, 2016.

Bartlett, P. L. and Mendelson, S. Rademacher and Gaussian complexities: Risk bounds and structural results. Journal of Machine Learning Research, 3:463 482, 2002.

Bartlett, P. L. and Tewari, A. Sparseness vs estimating conditional probabilities: Some asymptotic results. Journal of Machine Learning Research, 8:775 790, 2007.

Buja, A., Stuetzle, W., and Shen, Y. Loss functions for binary class probability estimation and classiﬁcation: Structure and applications. Technical report, 2005.

Burchﬁel, B., Tomasi, C., and Parr, R. Distance minimization for reward learning from scored trajectories. In AAAI, pp. 3330 3336, 2016.

Chapelle, O., Scholkopf, B., and Zien, A. Semi-Supervised Learning. MIT press, 2006.

Crippen, G. M. and Snow, M. E. A 1.8 a resolution potential function for protein folding. Biopolymers: Original Research on Biomolecules, 29(10-11):1479 1489, 1990.

Dewey, D. Reinforcement learning and the reward engineering principle. In AAAI Spring Symposium Series, 2014.

El-Zahhar, M. M. and El-Gayar, N. F. A semi-supervised learning approach for soft labeled data. In International Conference on Intelligent Systems Design and Applications, pp. 1136 1141, 2010.

Everitt, T. and Hutter, M. Avoiding wireheading with value reinforcement learning. pp. 12 22, 2016.

Fu, J., Luo, K., and Levine, S. Learning robust rewards with adversarial inverse reinforcement learning. In ICLR, 2018.

Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y. Generative adversarial nets. In Neur IPS, pp. 2672 2680, 2014.

Ho, J. and Ermon, S. Generative adversarial imitation learning. In Neur IPS, pp. 4565 4573, 2016.

Hu, Z., Liang, Y., Zhang, J., Li, Z., and Liu, Y. Inference aided reinforcement learning for incentive mechanism design in crowdsourcing. In Neur IPS, pp. 5508 5518, 2018.

Ishida, T., Niu, G., and Sugiyama, M. Binary classiﬁcation from positive-conﬁdence data. In Neur IPS, pp. 5919 5930, 2018.

Kang, B., Jie, Z., and Feng, J. Policy optimization with demonstrations. In ICML, pp. 2474 2483, 2018.

Kiryo, R., Niu, G., du Plessis, M. C., and Sugiyama, M. Positive-unlabeled learning with non-negative risk estimator. In Neur IPS, pp. 1675 1685, 2017.

Kostrikov, I., Agrawal, K. K., Levine, S., and Tompson, J. Addressing sample inefﬁciency and reward bias in inverse reinforcement learning. In ICLR, 2019.

Ledoux, M. and Talagrand, M. Probability in Banach Spaces: Isoperimetry and Processes. Springer, 1991.

Li, Y., Song, J., and Ermon, S. Info GAIL: Interpretable imitation learning from visual demonstrations. In Neur IPS, pp. 3812 3822, 2017.

Liu, Y., Gupta, A., Abbeel, P., and Levine, S. Imitation from observation: Learning to imitate behaviors from raw video via context translation. In ICRA, pp. 1118 1125, 2018.

Mc Diarmid, C. On the method of bounded differences. Surveys in Combinatorics, 141(1):148 188, 1989.

Pathak, D., Girshick, R. B., Doll ar, P., Darrell, T., and Hariharan, B. Learning features by watching objects move. In CVPR, 2017.

Puterman, M. L. Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons, Inc., New York, NY, USA, 1st edition, 1994. ISBN 0471619779.

Reid, M. D. and Williamson, R. C. Composite binary losses. Journal of Machine Learning Research, 11:2387 2422, 2010.

Ross, S., Gordon, G., and Bagnell, D. A reduction of imitation learning and structured prediction to no-regret online learning. In Proceedings of the fourteenth international conference on artiﬁcial intelligence and statistics, pp. 627 635, 2011.

Imitation Learning from Imperfect Demonstration

Schaal, S. Is imitation learning the route to humanoid robots? Trends in Cognitive Sciences, 3(6):233 242, 1999.

Sch olkopf, B., Burges, C. J. C., and Smola, A. J. Advances in Kernel Methods: Support Vector Learning. MIT press, 1999.

Schulman, J., Levine, S., Abbeel, P., Jordan, M., and Moritz, P. Trust region policy optimization. In ICML, pp. 1889 1897, 2015.

Serban, I. V., Sankar, C., Germain, M., Zhang, S., Lin, Z., Subramanian, S., Kim, T., Pieper, M., Chandar, S., Ke, N. R., et al. A deep reinforcement learning chatbot. Co RR, abs/1709.02349, 2017.

Shah, P., Hakkani-Tur, D., Liu, B., and Tur, G. Bootstrapping a neural conversational agent with dialogue self-play, crowdsourcing and on-line reinforcement learning. In Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 41 51, 2018.

Supancic III, J. S. and Ramanan, D. Tracking as online decision-making: Learning a policy from streaming videos with reinforcement learning. In ICCV, pp. 322 331, 2017.

Sutton, R. S. and Barto, A. G. Introduction to Reinforcement Learning, volume 135. MIT press, 1998.

Syed, U., Bowling, M., and Schapire, R. E. Apprenticeship learning using linear programming. In ICML, pp. 1032 1039, 2008.

Todorov, E., Erez, T., and Tassa, Y. Mujoco: A physics engine for model-based control. In IROS, pp. 5026 5033, 2012.

Tokmakov, P., Alahari, K., and Schmid, C. Learning motion patterns in videos. In CVPR, pp. 531 539, 2017.

Valko, M., Ghavamzadeh, M., and Lazaric, A. Semisupervised apprenticeship learning. In EWRL, pp. 131 142, 2012.

Vapnik, V. Statistical Learning Theory, volume 3. Wiley, New York, 1998.

Wang, W., Wang, Y., Chen, F., and Sowmya, A. A weakly supervised approach for object detection based on softlabel boosting. In IEEE Workshop on Applications of Computer Vision, pp. 331 338, 2013.

Yeung, S., Ramanathan, V., Russakovsky, O., Shen, L., Mori, G., and Fei-Fei, L. Learning to learn from noisy web videos. In CVPR, pp. 7455 7463, 2017.

Zhou, D., Quost, B., and Fr emont, V. Soft label based semisupervised boosting for classiﬁcation and object recognition. In International Conference on Control Automation Robotics & Vision, pp. 1062 1067, 2014.

Ziebart, B. D., Maas, A. L., Bagnell, J. A., and Dey, A. K. Maximum entropy inverse reinforcement learning. In AAAI, pp. 1433 1438, 2008.