# variational_imitation_learning_with_diversequality_demonstrations__21912831.pdf

Variational Imitation Learning with Diverse-quality Demonstrations

Voot Tangkaratt 1 Bo Han 2 1 Mohammad Emtiyaz Khan 1 Masashi Sugiyama 1 3

Learning from demonstrations can be challenging when the quality of demonstrations is diverse, and even more so when the quality is unknown and there is no additional information to estimate the quality. We propose a new method for imitation learning in such scenarios. We show that simple quality-estimation approaches might fail due to compounding error, and ﬁx this issue by jointly estimating both the quality and reward using a variational approach. Our method is easy to implement within reinforcement-learning frameworks and also achieves state-of-the-art performance on continuous-control benchmarks. Our work enables scalable and data-efﬁcient imitation learning under more realistic settings than before.

1. Introduction

Sequential decision making aims to learn a good policy that makes good decisions (Puterman, 1994). Imitation Learning (IL) is a speciﬁc case which learns such a policy from demonstrations (Schaal, 1999), and it performs well when high-quality demonstrations from experts are available (Ho & Ermon, 2016; Fu et al., 2018; Peng et al., 2019). However, in reality, the quality of demonstrations can be diverse, i.e., highand low-quality demonstrations are mixed. This scenario typically happens when collecting demonstrations from experts is costly, e.g., in robotics where experts must have domain-speciﬁc knowledge (Mandlekar et al., 2018; Osa et al., 2018). Unfortunately, learning from diverse-quality demonstrations is challenging, because low-quality demonstrations often negatively affect learning performance, e.g., in robotics where they may cause damages to robots (Shiarlis et al., 2016). In this paper, we propose a new method to solve this learning problem under

1RIKEN Center for Advanced Intelligence Project, Japan 2Department of Computer Science, Hong Kong Baptist University, Hong Kong 3Department of Complexity Science and Engineering, The University of Tokyo, Japan. Correspondence to: Voot Tangkaratt <voot.tangkaratt@riken.jp>.

Proceedings of the 37 th International Conference on Machine Learning, Online, PMLR 119, 2020. Copyright 2020 by the author(s).

an assumption that diversity is caused by noise-densities.

Learning from diverse-quality demonstrations becomes less challenging when the quality of demonstrations is known. In such scenarios, we can use data-cleaning techniques to remove low-quality demonstrations (Han et al., 2011), or use multi-modal approaches to learn good policies that correspond to high-quality demonstrations (Li et al., 2017; Wang et al., 2017). In some scenarios, experts may not provide the quality directly. Instead, they provide additional information about the quality. With such information, learning is still relatively easy, since the quality can be estimated by their conﬁdence scores (Wu et al., 2019), ranking scores (Brown et al., 2019), or a small number of high-quality demonstrations (Audiffren et al., 2015). However, these scenarios assume the availability of experts who provide the quality or additional information. Our goal in this paper is to go beyond these scenarios and perform IL under a more realistic setting where experts are not required.

We propose a new method for IL with diverse-quality demonstrations by modeling the quality with a probabilistic graphic model under a noise-density assumption. We show that simple quality-estimation approaches might fail due to compounding error, and ﬁx this issue by estimating the quality along with a reward function that represents an intention of experts decision making. To handle large state-action spaces, we use a variational approach, which can be easily implemented within reinforcement-learning frameworks (Sutton & Barto, 1998) and is scalable to large state-action spaces by using neural networks. We also propose importance sampling to improve the data-efﬁciency of our method. The ﬁnal method is called Variational IL with Diverse-quality demonstrations (VILD). Experiments on continuous-control tasks demonstrate that VILD is robust against diverse-quality demonstrations and achieves state-of-the-art performance.

2. IL with Diverse-quality Demonstrations

Before delving into our main contribution, we ﬁrst give backgrounds about RL and IL. Then, we formulate a new setting in IL called IL with diverse-quality demonstrations and discuss deﬁciencies of existing methods.

Variational Imitation Learning with Diverse-quality Demonstrations

s1 s2 s T s T +1

(a) Expert demonstrations

s1 s2 s T s T +1

... a1 a2 a T

(b) Diverse-quality demonstrations

Figure 1. Graphical models describe p pτ sa, kq and pdpτ su, kq of expert demonstrations and diverse-quality demonstrations, respectively. Shaded and unshaded nodes indicate observed and unobserved random variables, respectively. st P S is a state with transition densities pspst 1|st, atq, at P A is an action with density π pat|stq, ut P A is a noisy action with density pnput|st, at, kq, and k P t1, . . . , Ku is an identiﬁcation number with distribution νpkq. Actions at are unobserved in Figure 1(b) because they are not executed in the MDP.

2.1. Reinforcement Learning

Reinforcement Learning (RL) (Sutton & Barto, 1998) aims to learn an optimal policy of a Markov decision process (MDP) (Puterman, 1994). We consider a ﬁnite-horizon continuous MDP deﬁned by M p S, A, psps1|s, aq, µps1q, rps, aqq with a state st P S Ď Rds, an action at P A Ď Rda, a transition probability density pspst 1|st, atq, an initial state density µps1q, and a reward function r : S ˆA ÞÑ R, where the subscript t P t1, . . . , Tu denotes the time step. We denote τ sa ps1:T 1, a1:T q a (ﬁnite-horizon) trajectory of st and at. A decision making of an agent is determined by a policy πpat|stq, which is a conditional probability density of action given state. RL seeks for an optimal policy π pat|stq which maximizes the expected cumulative reward: EpπrΣT t 1rpst, atqs, where pπpτ saq µps1qΠT t 1pspst 1|st, atqπpat|stq is a trajectory probability density induced by π. A major limitation of RL is that it relies on the reward function which may be unavailable in practice (Schaal, 1999).

2.2. Imitation Learning

Imitation Learning (IL) was proposed to address the above limitation of RL (Schaal, 1999; Ng & Russell, 2000). IL aims to learn the optimal policy from demonstrations that encode information about the optimal policy, without using the reward function r of the MDP. A common setting of most IL methods is the setting of IL with expert demonstrations. Namely, demonstrations are collected by K ě 1 demonstrators who execute actions at drawn from π pat|stq for every states st. A graphical model describing this data collection process is depicted in Figure 1(a), where a random variable k P t1, . . . , Ku denotes each demonstrator s identiﬁcation number and νpkq denotes the probability of collecting a demonstration from the k-th demonstrator. Under this assumption, expert demonstrations tpτ sa, kqnu N n 1

are regarded to be drawn independently from

p pτ sa, kq νpkqµps1q

t 1 pspst 1|st, atqπ pat|stq. (1)

Note that k can be omitted since k and τ sa are independent.

IL has shown to work well in benchmark tasks (Ho & Ermon, 2016; Fu et al., 2018; Peng et al., 2019), but it has been rarely used in practice (Silver et al., 2012; Schroecker et al., 2019). One of the main reasons is that most methods assume the availability of high-quality demonstrations collected from experts according to Eq. (1). In practice, high-quality demonstrations are often too costly, and even when we obtain them, the number of demonstrations is often too few to accurately learn the optimal policy (Osa et al., 2018).

2.3. Diverse-quality Demonstrations

To make IL more practical, we consider IL with diversequality demonstrations, where demonstrations are collected from demonstrators with different level of expertise. Such demonstrations can be obtained cheaply via crowdsourcing (Mandlekar et al., 2018), but learning the optimal policy from them is challenging, as will be discussed below.

In this paper, we consider the following noise-density assumption of diverse-quality demonstrations. Namely, we assume that at each time step t, demonstrators execute noisy action ut pnput|st, at, kq where ut P A, instead of action at π pat|stq. A graphical model describing this process is depicted in Figure 1(b). Under this assumption, diverse-quality demonstrations tpτ su, kqnu N n 1 are regarded to be drawn from a probability density

pdpτ su, kq νpkqµps1q

t 1 pspst 1|st, utq

A π pat|stqpnput|st, at, kqdat, (2)

Variational Imitation Learning with Diverse-quality Demonstrations

where τ su ps1:T 1, u1:T q is a trajectory of st and ut. Indeed, the noise density pn determines the level of demonstrator expertise as well as the quality of demonstrations. The goal of IL with diverse-quality demonstrations is to learn the optimal policy using dataset tpτ su, kqnu N n 1.

We emphasize that identiﬁcation numbers do not contain information about the quality and do not need to be provided by experts. When the number is not given, a simple strategy is to set k n and K N, which corresponds to assuming that each demonstrator collects one demonstration.

2.4. The Deﬁciency of Existing Methods

Indeed, methods for high-quality demonstrations in Section 2.2 are unsuitable for diverse-quality demonstrations in Section 2.3 due to the differences between p and pd. Specifically, by comparing p and pd, we can see that these methods would learn a policy pπ that averages over noise-densities, i.e., pπput|stq ΣK k=1νpkq ş

A π pat|stqpnput|st, at, kqdat. This averaging policy clearly differs from the optimal policy.

Multi-modal IL methods (Li et al., 2017; Hausman et al., 2017; Wang et al., 2017) are also unsuitable for diversequality demonstrations. Speciﬁcally, these methods aim to learn a multi-modal policy where different modalities estimate different policies. These methods are suitable for diverse demonstrations which are collected by experts with different optimal policies, because different modalities simply estimate optimal policy of different experts. However, these methods become unsuitable with diverse-quality demonstrations, because some modalities estimate policy of amateurs. For this reason, it is crucial to choose good modalities that estimate experts policies, but doing so typically requires knowing the quality of demonstrations.

In supervised-learning, a well-known approach for handling diverse-quality data is to estimate the quality of data (Angluin & Laird, 1988; Raykar et al., 2010). Based on this approach, the quality of demonstrations may be estimated by using a parameterized model pθ,ω to estimate pd as follows:

pθ,ωpτ su, kq νpkqµps1q

t 1 pspst 1|st, utq

A πθpat|stqpωput|st, at, kqdat. (3)

The parameters θ and ω can be learned by a regression method (see Appendix B.2), where πθ estimates the optimal policy and pω estimates the noise-density. However, this approach suffers from the issue of compounding error (Ross & Bagnell, 2010) and tends to perform poorly at test time. Namely, regression methods assume that data distributions are identical during training and testing. However, data distributions in IL depend on policies (Puterman, 1994), which leads to discrepancies between data distributions during training and testing. Due to this, compounding error

can occur during testing, where prediction errors increase in future time steps due to changing data distributions.

Our goal is to tackle IL with diverse-quality demonstrations under this realistic setting (i.e., experts are unavailable), while avoiding these deﬁciencies.

3. VILD: A Robust Method for Diverse-quality Demonstrations

This section presents VILD, namely a robust method for tackling the challenge from diverse-quality demonstrations. Speciﬁcally, we build a parameterized model that explicitly describes the noise-density and a reward function (Section 3.1), and estimate its parameters by a variational approach (Section 3.2), which can be implemented easily by RL (Section 3.3). We also improve data-efﬁciency by using importance sampling (Section 3.4). Mathematical derivations are provided in Appendix A.

3.1. Modeling Diverse-quality Demonstrations

Our key idea to overcome the challenge of diverse-quality demonstrations is to estimate the quality of demonstrations. To avoid the deﬁciency of the model pθ,ω in Eq. (3), we utilize inverse RL (IRL) (Ng & Russell, 2000), where we learn a reward function from diverse-quality demonstrations. IL problems can be solved by a combination of IRL and RL, where we learn a reward function by IRL and then learn a policy from the reward function by RL1. This combination avoids the issue of compounding error, since the policy is learned by RL which takes into account the dependency between data distribution and policy (Ho & Ermon, 2016).

Speciﬁcally, our parameterized model for estimating pd is based on a model of maximum entropy IRL (MEIRL) (Ziebart et al., 2010), which learns a reward function from expert demonstrations by using a model pφpτ saq 9 µps1qΠT t 1pspst 1|st, atqerφpst,atq, where φ is the parameter. Based on this model, we propose to learn the reward function and noise-density by

pφ,ωpτ su, kq 1 Zφ,ω νpkqµps1q

t 1 pspst 1|st, utq

A erφpst,atqpωput|st, at, kqdat, (4)

where φ and ω are parameters, and Zφ,ω is the normalization term ensuring that pφ,ω integrates to one. By comparing pφ,ω to pd, the reward parameter φ should be learned so that the cumulative reward is proportional to a joint probability density of actions given by the optimal policy, i.e., eΣT t 1rφpst,atq 9 ΠT t 1π pat|stq. In other words, the cumu-

1IRL differs from RL; IRL learns a reward function from demonstrations, but RL learns a policy from a reward function.

Variational Imitation Learning with Diverse-quality Demonstrations

lative reward is large for trajectories induced by the optimal policy. Therefore, the optimal policy can be learned by maximizing reward rφ under transition probability ps. Meanwhile, the model pω estimates the noise-density pn, and the estimated level of demonstrators expertise can be determined from pω.

To learn parameters of this model, we propose to minimize the KL divergence from the data distribution to the model: minφ,ω KLppd||pφ,ωq. By ignoring constants and letting lφ,ωpst, at, ut, kq rφpst, atq log pωput|st, at, kq, minimizating the KL divergence is equivalent to solving

max φ,ω fpφ, ωq gpφ, ωq, (5)

fpφ, ωq Epd

A elφ,ωpst,at,ut,kqdat

gpφ, ωq log Zφ,ω. (7)

Solving this maximization requires computing the integrals over both state space S (contained in g) and action space A. Computing these integrals is feasible for small state-action spaces, but is infeasible for large state-action spaces. To scale up our model to large state-action spaces, we leverage a variational approach in the followings.

3.2. A Variational Approach for Parameter Estimation

The central idea of the variational approach is to lowerbound an integral by the Jensen inequality and a variational distribution (Jordan et al., 1999). The main beneﬁt of the approach is that the integral can be computed via an expectation over an optimal variational distribution; This makes it easier to solve an optimization problem. However, ﬁnding the optimal variational distribution usually requires solving a sub-optimization problem.

Before we proceed, notice that the difference fpφ, ωq gpφ, ωq is not a joint concave function of the integrals, and this prohibits using the Jensen inequality on this difference. However, we can separately lower-bound f and g by the Jensen inequality, since they are concave functions of their corresponding integrals. Speciﬁcally, a variational distribution qψpat|st, ut, kq with parameter ψ yields an inequality

fpφ, ωq ě Epd

t 1 Eqψ rlφ,ωpst, at, ut, kqs Htpqψq

Fpφ, ω, ψq, (8)

where we deﬁne Htpqψq Eqψ rlog qψpat|st, ut, kqs. It can be veriﬁed that fpφ, ωq maxψ Fpφ, ω, ψq. Meanwhile, a variational distribution qθpat, ut|st, kq with param-

eter θ yields an inequality

gpφ, ωq ě Esqθ

t 1 lφ,ωpst, at, ut, kq

Gpφ, ω, θq, (9)

where Hpsqθq Esqθ ΣT t 1 log qθpat, ut|st, kq , sqθpτ sau, kq νpkqµps1qΠT t=1pspst 1|st, utqqθpat, ut|st, kq, and τ sau ps1:T 1, a1:T , u1:T q. The lower-bound G resembles an objective function of maximum entropy RL (Ziebart et al., 2010). Based on the optimality results of maximum entropy RL, it can be veriﬁed that gpφ, ωq maxθ Gpφ, ω, θq. Variational distributions q ψ and q θ that maximize the lower-bounds (F and G, respectively) are called optimal variational distributions.

By using the variational approach, Eq. (5) can be written as

max φ,ω,ψ min θ Fpφ, ω, ψq Gpφ, ω, θq. (10)

It is feasible to solve Eq. (10) for large state-action spaces, since F and G are deﬁned as expectations and can be optimized straightforwardly. In practice, we represent the variational distributions by parameterized functions (e.g., neural networks), and solve the optimization by stochastic gradient methods where expectations are approximated using mini-batch samples (Ranganath et al., 2014).

3.3. Choices of Density Models in Practice

In practice, we need to specify density models in our optimization (Eq. (10)). For continuous-control tasks, we use

pωput|st, at, kq Nput|at, Cωpkqq, (11)

qθpat, ut|st, kq qθpat|stq Nput|at, Σkq, (12)

where Npa|b, Cq denotes a Gaussian distribution with mean b and covariance C, and Σk is a hyper-parameter. We use the Gaussian distribution for pω to incorporate a prior assumption that noise-density pn tends to Gaussian. Indeed, covariance Cωpkq gives an estimated expertise of the k-th demonstrator: the covariance is small for high-expertise demonstrators and vice-versa for low-expertise demonstrators2. Meanwhile, the choice for qθpat, ut|st, kq in Eq. (12) enables using RL to optimize θ, as will be described below. The choices for qψpat|st, ut, kq and qθpat|stq are ﬂexible; We use Gaussians which are common for distributions over continuous action (Duan et al., 2016), but other choices such as the beta distributions can be used (Chou et al., 2017).

With the above density models, Eq. (10) is equivalent to

max φ,ω,ψ min θ Lpφ, ω, ψ, θq, (13)

2Different choices of pω incorporate different prior assumptions. For example, a Laplace distribution may be used to model demonstrations with outliers (see Appendix A.4) (Murphy, 2013).

Variational Imitation Learning with Diverse-quality Demonstrations

Lpφ, ω, ψ, θq

rφpst, atq 1

2}ut at}2 C 1 ω pkq

t 1 rφpst, atq

2 Eν Trp C 1 ω pkqΣkq .

Here, rqθpτ saq µps1qΠT t=1rpspst 1|st, atqqθpat|stq and rpspst 1|st, atq Eν ş

Apspst 1|st, utq Nput|at, Σkqdut . Recall that the optimal policy may be learned by maximizing reward rφ under transition probability ps. As can be seen, minimizing L w.r.t. θ is equivalent to solving maximum entropy RL with reward rφ and transition probability rps. The discrepancy between ps and rps is determined by Σk: smaller value of Σk yields less discrepancy. Therefore, by choosing a reasonably small value of Σk, we can optimize θ by RL to obtain qθpat|stq which estimates the optimal policy. This is advantageous because we can use state-of-the-art RL methods without signiﬁcant modiﬁcations to implementations.

To sum up, VILD solves Eq. (13) to learn policy qθpat|stq, where θ is optimized by RL with reward rφ, while φ, ω, and ψ are optimized by stochastic gradient methods such as Adam (Kingma & Ba, 2015). Algorithm 1 shows the pseudo-code of VILD. We use a diagonal matrix for Cωpkq and also include a regularization term Lpωq TEνrlog |C 1 ω pkq|s{2 to penalize overly large values of Cωpkq. Note that L already includes a penalty Eνr Trp C 1 ω pkqΣkqs, but its strength is too small because Σk is chosen to be small. Similarly to prior works (Ho & Ermon, 2016), we implement VILD using feed-forward neural networks with two hidden-layers and use Monte-Carlo estimation to approximate expectations. We also pre-train the Gaussian mean of qψ to obtain reasonable initial predictions; We perform least-squares regression for 1000 gradient steps with target value ut. More implementation details are given in Appendix C3.

3.4. Importance Sampling for Reward Learning

To improve the convergence rate of VILD when optimizing φ, we use importance sampling (IS). Speciﬁcally, the gradient φLpφ, ω, ψ, θq indicates that φ needs to maximize the expected cumulative reward achieved by pd and qψ, and at the same time minimize the expected cumulative reward achieved by qθ. However, low-quality demonstrations drawn from pd often yield low reward values which are not informative for maximization. For this reason, stochastic gradients estimated by these demonstrations tend to be uninformative, which leads to slow convergence and poor

3Source code: www.github.com/voot-t/vild_code

data-efﬁciency.

To avoid estimating such uninformative gradients, we use IS to estimate gradients using high-quality demonstrations which are sampled with high probability. Brieﬂy, IS is a technique for estimating an expectation over a distribution by using samples from a different distribution (Robert & Casella, 2005). For VILD, we sample k from a distribution rνpkq 9 }vecp C 1 ω pkqq}1 which assigns high probabilities to k with high expertise (i.e., small Cωpkq). With this distribution, the estimated gradients tend to be more informative for reward learning. To reduce a sampling bias, we use a truncated importance weight: wpkq minpνpkq{rνpkq, 1q, which leads to an IS gradient:

IS φ Lpφ, ω, ψ, θq Erpd

t 1 Eqψ r φrφpst, atqs

t 1 φrφpst, atq

where rpdpτ su, kq is deﬁned similarly to pdpτ su, kq in Eq. (2) but with rνpkq instead of νpkq. To obtain minibatch samples from rpd, we sample k from rνpkq and then uniformly sample demonstrations associated with k from dataset tpτ su, kqnu N n 1. Computing wpkq requires νpkq, which can be estimated accurately since k is a discrete random variable. For simplicity, we assume a uniform νpkq.

We note that the gradient in Eq. (14) is biased when νpkq{rνpkq ą 1. Nonetheless, the biases may improve robustness against model misspeciﬁcation, i.e., when the Gaussian model pω in Eq. (11) cannot exactly represent noisedensity pn. Speciﬁcally, the optimal solution of Eq. (13) may yield a poor policy when the model is misspeciﬁed. In such cases, informative biases can be introduced such that the solution has desirable properties4. For VILD, a desirable property is that the reward function yields relatively large values for high-expertise demonstrators. This is precisely the consequence of using rνpkq for reward learning. Note that the usefulness of these biases still depend on the relative accuracy of estimated covariance.

3.5. Discussion

In this section, we discuss computational costs of VILD and a connection between VILD and maximum entropy IRL.

Computational costs. VILD does not incur large additional computational costs compared to prior methods. Speciﬁcally, additional costs of VILD include the cost of computing gradients w.r.t. ω and ψ and the cost of sampling from qψ. For ω, the cost of computing gradients is very low because Cωpkq is a diagonal matrix. For ψ, the cost of

4For instance, an ℓ2-regularization introduces biases to obtain a solution with a small ℓ2-norm (Hastie et al., 2001).

Variational Imitation Learning with Diverse-quality Demonstrations

Algorithm 1 VILD: Variational Imitation Learning with Diverse-quality demonstrations

1: Input: Diverse-quality demonstrations tpτ su, kqnu N n 1 pdpτ su, kq and a replay buffer B . 2: Pre-train qψpat|st, ut, kq by least-squares regression. (see Appendix C) 3: while Not converge do 4: while |B| ă B with batch size B do 5: Sample at qθpat|stq and ϵt Npϵt|0, Σkq, observe s1 t pps1 t|st, at ϵtq, and include pst, at, s1 tq into B 6: end while 7: Update qψ by an estimate of ψLpφ, ω, ψ, θq. 8: Update pω by an estimate of ωLpφ, ω, ψ, θq ωLpωq. 9: Update rφ by an estimate of IS φ Lpφ, ω, ψ, θq. 10: Update qθ by an RL method (e.g., TRPO, SAC, or PPO) with reward function rφ. 11: end while

computing gradients depends on the size of neural networks, and the cost of sampling depends on the number of samples drawn for Monte-Carlo estimation. In our experiments, we draw one sample from qψ to reduce the cost and use antithetic sampling to reduce estimation variances (Robert & Casella, 2005). Overall, additional costs of VILD are relatively low compared to the cost of collecting transition samples from MDP which is the main computational burden of many IL methods.

Relation to maximum entropy IRL. The model of VILD is based on the model of maximum entropy IRL (MEIRL) (Ziebart et al., 2010) and VILD is closely related to ME-IRL. Speciﬁcally, VILD reduces to ME-IRL under an assumption that demonstrations are high-quality. This assumption is equivalent to letting qψ and pω be Dirac deltas: qψpat|st, ut, kq δat ut and pωput|at, st, kq δut at. In this case, the optimization in Eq. (10) is equivalent to

max φ min θ Epd řT t 1rφpst, atq ı

Eqθ řT t 1rφpst, atq ı Hpqθq, (15)

where qθpτ saq µps1qΠT t 1pspst 1|st, atqqθpat|stq. Note that ψ and ω do not appear in this objective because qψ and pω are Dirac deltas without parameters. This objective is equivalent to that of ME-IRL. In practice, when all demonstrations have high quality, we expect VILD to estimate small covariance for all demonstrations and this implies that qψpat|st, ut, kq Ñ δat ut and pωput|at, st, kq Ñ δut at. Based on this, we conjecture that VILD performs comparable to ME-IRL given high-quality demonstrations. Our experiment in Appendix D.4 supports this conjecture.

4. Experiments

We experimentally evaluate VILD (with IS and without IS) in continuous-control tasks. Performance is evaluated using a cumulative ground-truth reward along trajectories collected by policies (Ho & Ermon, 2016). We report the mean and standard error computed over 5 trials.

4.1. Comparison in Continuous-control Benchmarks

In this section, we evaluate VILD in continuous-control benchmark tasks (Brockman et al., 2016) (Half Cheetah, Ant, Walker2d, and Humanoid) under scenarios where the Gaussian model of pω is correct. Speciﬁcally, for each task, we generate two datasets using two types of Gaussian noisedensity: pnput|st, at, kq Nput|at, σ2 kq (time-action independent), and pnput|st, at, kq Nput|at, σ2 kpa, tqq (time-action dependent). For each dataset, we use a pretrained π and K 10 demonstrators to generate approximately 10000 state-action pairs.

4.1.1. COMPARISON AGAINST RL-BASED METHODS

Firstly, we compare VILD against RL-based methods that use RL to optimize a policy. These methods include GAIL (Ho & Ermon, 2016), AIRL (Fu et al., 2018), VAIL (Peng et al., 2019), ME-IRL (Ziebart et al., 2010), and Info GAIL (Li et al., 2017). These existing methods are well-known in IL, but they do not take diverse-quality into account. We use TRPO (Schulman et al., 2015) as an RL method, except on the Humanoid task where we use SAC (Haarnoja et al., 2018) since TRPO does not perform well. For Info GAIL, a multi-modal IL method that learns a context-dependent policy, we report performance averaged over all contexts and performance with the best context (denoted by Info GAIL (best)). Note that in Info GAIL, modalities of a multi-modal policy are chosen based on values of context. The number of contexts is set to K.

Figure 2 shows the performance against the number of transition samples collected by RL. The results show that VILD with IS achieves state-of-the-art performance and outperforms the rest overall. VILD without IS also tends to outperform existing methods in terms of the ﬁnal performance. However, it is outperformed by VILD with IS, except on the Humanoid task with time-action independent density (Figure 2(a)). This is perhaps because bias from IS may have a negative effect when the model choice is correct. Nonetheless, the overall good performance of VILD with IS

Variational Imitation Learning with Diverse-quality Demonstrations

VILD (IS) VILD (w/o IS) GAIL AIRL ME-IRL VAIL Info GAIL Info GAIL (best)

0 1 2 3 4 5 Transition samples 1e6

Cumulative rewards

1e3 Half Cheetah

0 1 2 3 4 5 Transition samples 1e6

Cumulative rewards

0 1 2 3 4 5 Transition samples 1e6

Cumulative rewards

1e3 Walker2d

0.0 0.2 0.4 0.6 0.8 1.0 Transition samples 1e6

Cumulative rewards

1e3 Humanoid

(a) Demonstrations are generated by time-action independent noise-density.

0 1 2 3 4 5 Transition samples 1e6

Cumulative rewards

1e3 Half Cheetah

0 1 2 3 4 5 Transition samples 1e6

Cumulative rewards

0 1 2 3 4 5 Transition samples 1e6

Cumulative rewards

1e3 Walker2d

0.0 0.2 0.4 0.6 0.8 1.0 Transition samples 1e6

Cumulative rewards

1e3 Humanoid

(b) Demonstrations are generated by time-action dependent noise-density.

Figure 2. Comparison on continuous-control benchmarks against RL-based methods that do not take diverse-quality into account. VILDbased methods perform overall better than the rest. Demonstrations are artiﬁcially generated. (VILD (IS) and VILD (w/o IS) denote VILD with and without IS, respectively. Horizontal dots denote performance of 5 demonstrators. Shaded area denotes standard errors.)

demonstrates that it is more robust against diverse-quality demonstrations compared to existing methods.

On the contrary, existing methods perform poorly, except on the Humanoid task, where all methods except GAIL and VAIL achieve statistically comparable performance according to t-test. This result implies that diverse-quality demonstrations in this task may not have strong negative-effects on the performance. This is perhaps because amateurs in this task perform relatively well compared to amateurs in other tasks (see Appendix D.1).

We also evaluate the accuracy of quality-estimation in VILD, where we compare estimated covariance Cωpkq against ground-truth covariance σ2 k of noise-density. Figures of this comparison are given in Appendix D.1 due to space limitation. The results show that the estimation is quite accurate. Nonetheless, the estimation tends to be less accurate for lowexpertise demonstrators. A reason for this phenomenon is that low-quality demonstrations are highly dissimilar, which makes quality-estimating more challenging.

4.1.2. COMPARISON AGAINST SL-BASED METHODS

Next, we compare VILD against supervised-learning (SL)- based methods, namely behavior cloning (BC) (Pomerleau, 1988), Co-teaching, and BC with diverse-quality demonstrations (BC-D). Speciﬁcally, BC performs regression without taking diverse-quality data into account. Co-teaching is a

regression-extension of a recent classiﬁcation method (Han et al., 2018) that is robust against diverse-quality data. BCD takes diverse-quality data into account by performing regression based on the simple model pθ,ω. We compare the performance of these methods against the performance of VILD with IS in the last 100 iteration of Figure 2.

Figure 3 shows the performance of these SL-based methods against the number of gradient steps. Performances for time-action dependent noise-density are similar and given in Appendix D.1. As seen, SL-based methods perform very poorly and their ﬁnal performance is much worse compared to VILD with IS. In particular, for the Humanoid task which has the largest state-action space, SL-based methods could not improve upon the initial policy at all.

Notice that the performance of SL-based methods sharply degrades as training progresses. We conjecture that this degradation is due to compounding error caused by overﬁtting. Speciﬁcally, these methods may learn reasonably good policies early on (e.g., in Ant and Walker2d), but the policies overﬁt to training data as training progresses. During testing, these overﬁtted policies may make incorrect predictions which cause compounding error. In addition, diverse-quality demonstrations also makes the issue more severe, since neural networks tend to overﬁt to low-quality data (Arpit et al., 2017). Due to these reasons, BC performs poorly as it suffers from issues of compounding error and diverse-quality demonstrations. Meanwhile, Co-teaching

Variational Imitation Learning with Diverse-quality Demonstrations

0.0 0.2 0.4 0.6 0.8 1.0 Gradient steps 1e6

Cumulative rewards

1e3 Half Cheetah

Co-teaching

0.0 0.2 0.4 0.6 0.8 1.0 Gradient steps 1e6

Cumulative rewards

0.0 0.2 0.4 0.6 0.8 1.0 Gradient steps 1e6

Cumulative rewards

1e3 Walker2d

0.0 0.2 0.4 0.6 0.8 1.0 Gradient steps 1e6

Cumulative rewards

1e3 Humanoid

Figure 3. Comparison on continuous-control benchmarks against supervised-learning-based methods. BC-D and Co-teaching take diversequality into account, while BC does not. VILD with IS (red horizontal lines) clearly performs better than these methods. Demonstrations are artiﬁcially generated by time-action independent noise-density.

0 1 2 3 4 5 Transition samples 1e6

Cumulative rewards

1e3 Pendulum (Info GAIL)

Figure 4. Performance of Info GAIL on Pendulum with different values of context. Clearly, choosing a good value of context is crucial for Info GAIL.

0 1 2 3 4 5 Transition samples 1e6

Cumulative rewards

1e2 Lunar Lander

VILD (IS) VILD (w/o IS) GAIL

Info GAIL Info GAIL (best)

Figure 5. Comparison on Lunar Lander against GAIL and Info GAIL. The model of VILD is incorrect, but VILD with IS still outperforms comparison methods.

1 2 3 4 5 6 7 8 9 10 Demonstrator number k

Lunar Lander (Covariance)

VILD (IS) VILD (w/o IS)

Figure 6. Results of quality-estimation by VILD in Lunar Lander. The estimated covariance Cωpkq yields relatively accurate quality for each demonstrator.

and BC-D are quite robust against diverse-quality demonstrations, but they still suffer from compounding error and perform worse than VILD with IS. Overall, these results indicate that SL methods for diverse-quality data are not suitable for diverse-quality demonstrations.

4.1.3. INVESTIGATING INFOGAIL IN PENDULUM TASK

From Figure 2, we can see that Info GAIL performs poorly when its performance is averaged over all contexts. Using the best context improves its performance, but the improvement is quite mild. We investigated this phenomenon and found that the learned multi-modal policy yields similar performance for all contexts (see Appendix D.1), which implies that Info GAIL fails to learn a good multi-modal policy. This is perhaps because learning a multi-modal policy is challenging in large state-action spaces. To verify our claim that choosing good modalities is crucial for multi-modal IL (Section 2.4), we perform an experiment in a Pendulum task. This task has a much smaller state-action space and we expect Info GAIL to learn a good multi-modal policy.

Figure 4 shows the performance of Info GAIL for different values of context (denoted by different colors). As seen, the performance of Info GAIL crucially depends on the value of context. Namely, a well-chosen context yields a policy with good performance, whereas a poorly-chosen context

yields a policy with poor performance. Indeed, averaging these policies over all contexts yields a policy with average performance. This result supports our claim that choosing good modalities is crucial for multi-modal IL methods. However, recall that doing so is typically difﬁcult when the quality of demonstrations is unknown. In our experiments, good modalities could be chosen based on performance, but this is not possible when a ground-truth reward function for performance evaluation is not available.

4.2. Robustness Against Incorrect Model Choices

Next, we evaluate the robustness of VILD against incorrect model choices. Speciﬁcally, we evaluate VILD when pn is not Gaussian. We consider a Lunar Lander task, where an optimal policy is available for generating high-quality demonstrations (Brockman et al., 2016). To generate diversequality demonstrations, we perturb parameters of the optimal policy using half-Gaussian distributions with variance depending on k. We use K 10 to generate a dataset with approximately 20000 state-action pairs. We compare VILD against GAIL and Info GAIL; We expect other RL-based methods to perform similarly to GAIL, based on benchmark results. We use PPO (Schulman et al., 2017) as an RL method. We use a log-sigmoid reward function for VILD to make comparison against GAIL fair (see Appendix D.2).

Variational Imitation Learning with Diverse-quality Demonstrations

Figure 7. Robosuite Reacher task. Rewards are inverse proportional to distance between the end-effector and red object. Depicted trajectory is obtained by VILD with IS (left to right, top to bottom).

0 1 2 3 4 5 Transition samples 1e6

Cumulative rewards

1e1 Robosuite Reacher

VILD (IS) VILD (w/o IS) GAIL

AIRL ME-IRL VAIL

Info GAIL Info GAIL (best)

Figure 8. Comparison on Robosuite Reacher against RL-based methods using real-world demonstrations. VILD with IS performs overall better than methods that do not take diversity into account.

0 1 2 3 4 5 Transition samples 1e6

Cumulative rewards

1e1 Robosuite Reacher (Info GAIL)

Figure 9. Performance of Info GAIL on Robosuite Reacher with different values of context. Performance of Info GAIL is highly unstable.

Figure 5 shows the performance. It can be seen that VILD with IS outperforms comparison methods and learns a good policy. This result indicates that VILD with IS is robust against incorrect model choices. On the other hand, VILD without IS does not perform as well as VILD with IS. The discrepancy between them is perhaps due to IS biases which enable VILD with IS to learn a better solution. Meanwhile, GAIL and Info GAIL do not perform well. Using the best context can improve the performance of Info GAIL, but its performance is still poor compared to VILD with IS.

Figure 6 shows results of quality-estimation by VILD. The results show that the quality-estimation is reasonably accurate under this scenario. Namely, the value of Cωpkq of high-expertise demonstrators (i.e., k 1, 2, 3) is relatively smaller than that of low-expertise demonstrators (i.e., k 8, 9, 10). Note that we cannot directly evaluate quality-estimation against the ground-truth because the noise-density is not a Gaussian distribution.

4.3. Robustness Against Real-world Demonstrations

Lastly, we evaluate the robustness of VILD against realworld demonstrations collected by crowdsourcing (Mandlekar et al., 2018). While the public datasets were collected for Assembly tasks in a Robosuite platform (Fan et al., 2018), we consider a Reacher task, where demonstrations in Assembly tasks are clipped when the robot s end-effector contacts the object. We use a Reacher dataset with approximately 5000 state-action pairs. We evaluate RL-based methods where we use TRPO as an RL method. For VILD, we use a log-sigmoid reward function which improves the performance.

Figure 7 shows the task and a trajectory obtained by VILD with IS, while Figure 8 shows the performance obtained by collecting 5 million transition samples for RL training. VILD with IS clearly outperforms comparison methods except Info GAIL (best). Meanwhile, VILD without IS tends

to outperform existing methods except VAIL, Info GAIL, and Info GAIL (best). Overall, the results demonstrate that, given 5 million transition samples, VILD with IS is more robust against real-world demonstrations compared to methods that do not take diversity into account and Info GAIL that do not use the best context.

Note that the ﬁnal performance of Info GAIL (best) is comparable to that of VILD with IS, but Info GAIL (best) learns faster. Nonetheless, Info GAIL (best) is unstable as its performance ﬂuctuates between good and poor. This instability can be observed for most values of context as shown in Figure 9. This is perhaps due to a large state-action space which makes learning a multi-modal policy challenging.

5. Conclusion

This paper explored a realistic setting in IL where demonstrations have diverse-quality. We showed the deﬁciency of existing methods, and proposed a robust method called VILD, which learns both the reward function and noisedensity by using the variational approach. Empirical evaluations on continuous-control tasks demonstrated that our work enables scalable and data-efﬁcient IL in this setting.

In this work, we considered the noise-density assumption where the quality is determined by noise. In future, we will consider different assumptions for determining the quality.

Acknowledgements

We thank the anonymous reviewers for their constructive feedback. BH was supported by the Early Career Scheme (ECS) through the Research Grants Council of Hong Kong under Grant No.22200720, HKBU Tier-1 Start-up Grant and HKBU CSD Start-up Grant. MS was supported by KAKENHI 17H00757.

Variational Imitation Learning with Diverse-quality Demonstrations

Angluin, D. and Laird, P. Learning from noisy examples. Machine Learning, 1988.

Arpit, D., Jastrzebski, S. K., Ballas, N., Krueger, D., Bengio, E., Kanwal, M. S., Maharaj, T., Fischer, A., Courville, A. C., Bengio, Y., and Lacoste-Julien, S. A closer look at memorization in deep networks. In International Conference on Machine Learning, 2017.

Audiffren, J., Valko, M., Lazaric, A., and Ghavamzadeh, M. Maximum entropy semi-supervised inverse reinforcement learning. In International Joint Conferences on Artiﬁcial Intelligence, 2015.

Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., and Zaremba, W. Open AI Gym. Co RR, abs/1606.01540, 2016.

Brown, D. S., Goo, W., Nagarajan, P., and Niekum, S. Extrapolating beyond suboptimal demonstrations via inverse reinforcement learning from observations. In International Conference on Machine Learning, 2019.

Chou, P.-W., Maturana, D., and Scherer, S. Improving stochastic policy gradients in continuous control with deep reinforcement learning using the beta distribution. In International Conference on Machine Learning, 2017.

Duan, Y., Chen, X., Houthooft, R., Schulman, J., and Abbeel, P. Benchmarking deep reinforcement learning for continuous control. In International Conference on Machine Learning, pp. 1329 1338, 2016.

Fan, L., Zhu, Y., Zhu, J., Liu, Z., Zeng, O., Gupta, A., Creus-Costa, J., Savarese, S., and Fei-Fei, L. Surreal: Open-source reinforcement learning framework and robot manipulation benchmark. In Conference on Robot Learning, 2018.

Fu, J., Luo, K., and Levine, S. Learning robust rewards with adversarial inverse reinforcement learning. In International Conference on Learning Representation, 2018.

Haarnoja, T., Zhou, A., Abbeel, P., and Levine, S. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In International Conference on Machine Learning, 2018.

Han, B., Yao, Q., Yu, X., Niu, G., Xu, M., Hu, W., Tsang, I. W., and Sugiyama, M. Co-teaching: Robust training of deep neural networks with extremely noisy labels. In Advances in Neural Information Processing Systems, 2018.

Han, J., Kamber, M., and Pei, J. Data Mining: Concepts and Techniques, 3rd edition. Morgan Kaufmann, 2011.

Hastie, T., Tibshirani, R., and Friedman, J. The Elements of Statistical Learning. Springer New York Inc., 2001.

Hausman, K., Chebotar, Y., Schaal, S., Sukhatme, G. S., and Lim, J. J. Multi-modal imitation learning from unstructured demonstrations using generative adversarial nets. In Advances in Neural Information Processing Systems, 2017.

Ho, J. and Ermon, S. Generative adversarial imitation learning. In Advances in Neural Information Processing Systems, 2016.

Jordan, M. I., Ghahramani, Z., Jaakkola, T. S., and Saul, L. K. An introduction to variational methods for graphical models. Machine Learning, 1999.

Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. In International Conference on Learning Representations, 2015.

Li, Y., Song, J., and Ermon, S. Infogail: Interpretable imitation learning from visual demonstrations. In Advances in Neural Information Processing Systems, 2017.

Mandlekar, A., Zhu, Y., Garg, A., Booher, J., Spero, M., Tung, A., Gao, J., Emmons, J., Gupta, A., Orbay, E., Savarese, S., and Fei-Fei, L. ROBOTURK: A crowdsourcing platform for robotic skill learning through imitation. In Conference on Robot Learning, 2018.

Murphy, K. P. Machine learning : a probabilistic perspective. 2013.

Ng, A. Y. and Russell, S. J. Algorithms for inverse reinforcement learning. In International Conference on Machine Learning, 2000.

Osa, T., Pajarinen, J., Neumann, G., Bagnell, J. A., Abbeel, P., and Peters, J. An algorithmic perspective on imitation learning. Foundations and Trends in Robotics, 2018.

Peng, X. B., Kanazawa, A., Toyer, S., Abbeel, P., and Levine, S. Variational discriminator bottleneck: Improving imitation learning, inverse RL, and GANs by constraining information ﬂow. In International Conference on Learning Representations, 2019.

Pomerleau, D. ALVINN: an autonomous land vehicle in a neural network. In Advances in Neural Information Processing Systems, 1988.

Puterman, M. L. Markov Decision Processes: Discrete Stochastic Dynamic Programming. 1994.

Ranganath, R., Gerrish, S., and Blei, D. M. Black box variational inference. In International Conference on Artiﬁcial Intelligence and Statistics, 2014.

Variational Imitation Learning with Diverse-quality Demonstrations

Raykar, V. C., Yu, S., Zhao, L. H., Valadez, G. H., Florin, C., Bogoni, L., and Moy, L. Learning from crowds. Journal of Machine Learning Research, 2010.

Robert, C. P. and Casella, G. Monte Carlo Statistical Methods. 2005.

Ross, S. and Bagnell, D. Efﬁcient reductions for imitation learning. In International Conference on Artiﬁcial Intelligence and Statistics, 2010.

Schaal, S. Is imitation learning the route to humanoid robots? Trends in Cognitive Sciences, 1999.

Schroecker, Y., Vecerik, M., and Scholz, J. Generative predecessor models for sample-efﬁcient imitation learning. In International Conference on Learning Representations, 2019.

Schulman, J., Levine, S., Moritz, P., Jordan, M., and Abbeel, P. Trust region policy optimization. In International Conference on Machine Learning, 2015.

Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal policy optimization algorithms. Co RR, abs/1707.06347, 2017.

Shiarlis, K., Messias, J. V., and Whiteson, S. Inverse reinforcement learning from failure. In International Conference on Autonomous Agents & Multiagent Systems, 2016.

Silver, D., Bagnell, J. A., and Stentz, A. Learning autonomous driving styles and maneuvers from expert demonstration. In International Symposium on Experimental Robotics, 2012.

Sutton, R. S. and Barto, A. G. Reinforcement Learning - an Introduction. MIT Press, 1998.

Wang, Z., Merel, J., Reed, S. E., de Freitas, N., Wayne, G., and Heess, N. Robust imitation of diverse behaviors. In Advances in Neural Information Processing Systems, 2017.

Wu, Y., Charoenphakdee, N., Bao, H., Tangkaratt, V., and Sugiyama, M. Imitation learning from imperfect demonstration. In International Conference on Machine Learning, 2019.

Ziebart, B. D., Bagnell, J. A., and Dey, A. K. Modeling interaction via the principle of maximum causal entropy. In International Conference on Machine Learning, 2010.