# imitation_learning_from_purified_demonstrations__d17d341d.pdf Imitation Learning from Purified Demonstrations Yunke Wang 1 Minjing Dong 2 Yukun Zhao 1 Bo Du 1 Chang Xu 3 Imitation learning has emerged as a promising approach for addressing sequential decision-making problems, with the assumption that expert demonstrations are optimal. However, in real-world scenarios, most demonstrations are often imperfect, leading to challenges in the effectiveness of imitation learning. While existing research has focused on optimizing with imperfect demonstrations, the training typically requires a certain proportion of optimal demonstrations to guarantee performance. To tackle these problems, we propose to purify the potential noises in imperfect demonstrations first, and subsequently conduct imitation learning from these purified demonstrations. Motivated by the success of diffusion model, we introduce a two-step purification via diffusion process. In the first step, we apply a forward diffusion process to smooth potential noises in imperfect demonstrations by introducing additional noise. Subsequently, a reverse generative process is utilized to recover the optimal demonstration from the diffused ones. We provide theoretical evidence supporting our approach, demonstrating that the distance between the purified and optimal demonstration can be bounded. Empirical results on Mu Jo Co and Robo Suite demonstrate the effectiveness of our method from different aspects. 1. Introduction Reinforcement Learning (RL) (Sutton & Barto, 2018; Kaelbling et al., 1996) has achieved significant success in addressing sequential decision-making problems (Silver et al., 1School of Computer Science, National Engineering Research Center for Multimedia Software, Institute of Artificial Intelligence and Wuhan Institute of Data Intelligence, Wuhan University, China. 2Department of Computer Science, City University of Hong Kong, China. 3School of Computer Science, Faculty of Engineering, The University of Sydney, Australia. Correspondence to: Bo Du , Chang Xu . Proceedings of the 41 st International Conference on Machine Learning, Vienna, Austria. PMLR 235, 2024. Copyright 2024 by the author(s). 2016; Van Hasselt et al., 2016; Zha et al., 2021). The core tenet of RL lies in the learning of an optimal policy via rewarding an agent s actions during its interaction with the environment. The strategic formulation of reward is essential to the recovery of the best policy. However, the intricacy of reward engineering often poses significant challenges in realworld tasks, leading to the failure of RL algorithms occasionally (Amodei et al., 2016; Dulac-Arnold et al., 2021). In response to these challenges, an alternative approach to policy learning is Imitation Learning (IL) (Abbeel & Ng, 2004; Hussein et al., 2017; Cai et al., 2021), a learning framework that utilizes expert behaviors to guide agent learning. One fundamental IL approach is Behavioral Cloning (BC) (Torabi et al., 2018; Sasaki & Yamashina, 2021), in which the agent observes the action of the expert and learns a policy by directly minimizing the action probability discrepancy via supervised learning. This offline training manner has been proven to suffer from compounding error when the agent executes the policy, leading it to drift to new and dangerous states (Xu et al., 2020; 2021). In contrast, Generative Adversarial Imitation Learning (GAIL) (Ho & Ermon, 2016; Fu et al., 2018; Dadashi et al., 2020; Cai et al., 2023; Wang et al., 2023b) has revealed that imitation learning can be framed as a problem of matching state-action occupancy measures, resulting in a more accurate policy. Based on the framework of Generative Adversarial Nets (GAN) (Goodfellow et al., 2014), the discriminator in GAIL is introduced to distinguish demonstrations from expert policy and agent policy, yet the agent policy tries its best to generate behaviors that cheat the judgment of the discriminator. Current imitation learning approaches have shown promising results under the premise that expert demonstrations exhibit high-quality performance. Nonetheless, acquiring optimal demonstrations can often be costly in practical realworld applications. In many cases, the demonstrations available are not optimal, leading to the problem of imperfect demonstrations in imitation learning. Under such a situation, imitation learning algorithms are prone to failure when expert demonstrations are characterized by noise. Hence, the question of how to effectively learn an optimal policy from imperfect demonstrations becomes central to bridging the application gap of imitation learning from simulated environments to real-world tasks. This issue is critical for enhancing the adaptability and effectiveness of imitation Imitation Learning from Purified Demonstrations learning methods in various practical scenarios. Confidence-based methods have been shown to be effective when addressing the issue of imperfect demonstrations in imitation learning. In WGAIL (Wang et al., 2021b) and SAIL (Wang et al., 2021a), confidence estimation for each demonstration is tied to the discriminator during the adversarial training phase. BCND (Sasaki & Yamashina, 2021) proposes that the agent s policy itself can estimate confidence. Rather than predicting the confidence directly through either policy or discriminator, CAIL (Zhang et al., 2021) considers confidence as a learnable parameter and jointly optimizes it with imitation learning methods. The key of confidence-based methods lies in how to derive proper confidence for each expert demonstration. However, most existing methods integrate confidence estimation into the training of IL (Beliaev et al., 2022), leading to a bi-level optimization problem. This bi-level optimization can become unstable and hard to converge during the training of imitation learning, resulting in a breakdown of the confidence estimation. Furthermore, the aforementioned methods are specifically designed to be compatible with either BC or GAIL. This specificity limits their flexibility and restricts their application with other methods. Rather than integrating the handling of imperfect demonstrations into IL training, we propose an approach where the purification of imperfect demonstrations is performed first. Subsequently, imitation learning is carried out with purified demonstrations. Based on this idea, we introduce Diffusion Purified Imitation Learning (DP-IL), which utilizes the forward and reverse diffusion processes to recover the optimal demonstrations from imperfect ones. By incorporating the diffusion process into denoising imperfect demonstrations, we provide theoretical analysis on the effectiveness of noise elimination during the forward diffusion process as well as the reduction of the gap between the optimal and purified distribution. The distance between the purified and optimal demonstration can also be bounded. We show that the purified demonstrations can be used in both online and offline imitation learning methods. Experimental results in Mu Jo Co (Todorov et al., 2012) and Robo Suite (Zhu et al., 2020) demonstrate the effectiveness of our method from different aspects. 2. Related Work 2.1. Imitation Learning from Imperfect Demonstrations When dealing with imperfect demonstrations, confidencebased IL methods estimate the weight for imperfect demonstrations to address their importance to agent learning. 2IWIL (Wu et al., 2019) and IC-GAIL (Wu et al., 2019) first investigate the effectiveness of weighting schemes in imitation learning with imperfect demonstrations. However, these approaches relied on manually labeled confidence, which is challenging to obtain in practical scenarios. To relax this requirement, subsequent works have proposed alternative methods. For instance, DWBC (Xu et al., 2022) and DICE-based methods (Kim et al., 2021; Chang et al., 2021; Ma et al., 2022; Yu et al., 2023; Li et al., 2024) leverage a small fraction of known optimal demonstrations to infer the weights for the remaining supplementary demonstrations. CAIL (Zhang et al., 2021) utilizes partially ranked trajectories to guide confidence estimation. Recent works also focus on estimating confidence without exposing too much prior information. WGAIL (Wang et al., 2021b) and SAIL (Wang et al., 2021a) connect confidence estimation to the discriminator during training. BCND (Sasaki & Yamashina, 2021) use the policy network to indicate confidence. However, these approaches have two primary limitations. First, the weight estimation and imitation learning build up a bi-level optimization problem, which can be challenging to converge. Secondly, most methods rely on having a certain proportion of optimal demonstrations, which may not be feasible in practical settings. Preference-based methods (Christiano et al., 2017; Ibarz et al., 2018) have also proven effective for policy learning from imperfect demonstrations. Using human preference can avoid complex reward engineering, thus making policy learning more practical. T-REX (Brown et al., 2019) focuses on extrapolating a reward function based on ranked trajectories. By effectively capturing the rankings, the learned reward function provides valuable feedback to guide the agent s learning process. T-REX only requires precise rankings of trajectories, yet does not set constraints on data quality. As a result, T-REX can achieve satisfactory performance even in scenarios where optimal trajectories are unavailable. D-REX (Brown et al., 2020) introduces relaxation to the ranking constraint of T-REX. Initially, it learns a pre-trained policy through behavioral cloning and subsequently generates ranked trajectories by injecting varying levels of noise into the actions. D-REX uses the same way to learn the reward as T-REX with ranked trajectories. To address potential ranking errors, SSRR (Chen et al., 2021) proposes a novel reward function structure. By mitigating the adverse effects arising from ranking inaccuracies, SSRR enhances the reliability of the learning process. 2.2. Diffusion Model in Imitation Learning Diffusion models have demonstrated significant potential in generative tasks. Some works have explored the use of diffusion models either to directly model the policy network or to serve as the discriminator within Adversarial Imitation Learning (AIL) frameworks. For example, policy network in (Chi et al., 2023; Pearce et al., 2022; Reuss et al., 2023) is defined to be a condition diffusion model that refines a noise to the action based on the given state. Diff AIL (Wang Imitation Learning from Purified Demonstrations et al., 2024) models the state-action pairs as unconditional diffusion models and uses diffusion loss as part of the discriminator s learning objective. Despite achieving SOTA performance in some benchmark settings, they can not solve imperfect demonstration issues in imitation learning. In (Wang et al., 2023a), the diffusion model is trained with expert demonstrations and used to enhance the generalization of policy trained by BC. Despite both (Wang et al., 2023a) and our method trains a diffusion model on expert demonstrations first, the diffusion model is used to achieve different tasks. 3. Preliminary Before diving into our method, we briefly review the definition of the Markov Decision Process (MDP) and Imitation Learning with Distribution Matching. 3.1. Markov Decision Process (MDP) MDP is popular for formulating reinforcement learning (RL) (Puterman, 1994) and imitation learning (IL) problems. An MDP normally consists six basic elements M = (S, A, P, R, γ, µ0), where S is a set of states, A is a set of actions, P : S A S [0, 1] is the stochastic transition probability from current state s to the next state s , R : S A R is the obtained reward of agent when taking action a in a certain state s, γ [0, 1] is the discounted rate and µ0 : S [0, 1] denotes the initial state distribution. Definition 1. For any policy π(a|s) : S A, there is an one-to-one correspondence between π and its occupancy measure ρπ : S A [0, 1], which is formulated as ρπ(s, a) = (1 γ)π(a|s) t=0 γt Pr(st = s|π), (1) where Pr(st = s|π) denotes the probability density of state s at timestep t following policy π. 3.2. Imitation Learning via Distribution Matching The field of Imitation Learning (IL) is concerned with optimizing an agent s behavior in a given environment by utilizing expert demonstrations. Given expert demonstrations De sampled from the expert policy πe, imitation learning methods aim to let the agent policy πθ replicate the expert behavior. Distribution Matching (DM) approaches attempt to match the agent s state-action distribution ρπθ with that of the expert s ρπe by minimizing the f-divergence (Nowozin et al., 2016; Ke et al., 2019), θ = arg min θ Df(ρπe(s, a), ρπθ(s, a)) (2) where f : R+ R is a convex, lower semi-continuous function and satisfies f(1) = 0. Different choices of fdivergence can recover different imitation learning methods. For example, using KL divergence, Jensen-Shannon divergence can recover Behavior Cloning (BC) (Sasaki & Yamashina, 2021), Generative Adversarial Imitation Learning (GAIL) (Ho & Ermon, 2016), respectively. 4. Methodology Imitation learning achieves promising results in benchmark tasks with two non-trivial assumptions: (i) Expert demonstrations are sampled from an optimal policy, and (ii) Expert demonstrations are enough to encompass the expert distribution. While these two assumptions may not always hold in practice, our setting falls into the problem of imitation learning with imperfect demonstrations, where we have access to a small fraction of optimal demonstrations and a large fraction of supplementary sub-optimal demonstrations. To tackle this problem, we propose a two-step method named Diffusion Purified Imitation Learning (DP-IL), in which sub-optimal demonstrations are purified via a combination of forward and reverse diffusion process. Furthermore, we provide theoretical results on the optimal reverse point that could achieve best performance. 4.1. General Objective We begin by formalizing the problem of incorporating supplementary sub-optimal demonstrations into imitation learning. For simplicity, we denote the optimal expert distribution as ρπo and sub-optimal expert distribution as ρπs. In this setting, we have access to a limited number of optimal demonstrations Do = {si, ai}no i=1 ρπo(s, a), and a large number of supplementary sub-optimal demonstrations Ds = {si, ai}ns i=1 ρπs(s, a). Our objective is to leverage both types of demonstrations De = Do Ds to learn a good policy π for the agent. Typically, Do alone is often insufficient for effective policy learning (Kim et al., 2021; Xu et al., 2022), and the involvement of Ds could significantly hurt the optimization of imitation learning algorithms. To tackle this problem, a potential solution is to purify supplementary sub-optimal demonstrations Ds under the guidance of optimal demonstrations Do. Specifically, assuming that ρπs(s, a) can be purified through a distribution transformation F, the agent policy πθ can subsequently learn from the purified demonstrations. To summarize, the objective can be formulated as, min θ Df(F (ρπs(s, a)), ρπθ(s, a)) (3) s.t. F = arg min F H(ρπo(s, a), F(ρπs(s, a))), (4) where H denotes the distance measurement between distributions. We assume that sub-optimality within supplementary demonstrations mainly come from the potential perturbations δ during the collections, which form the suboptimal expert distribution. Thus, taking F as a denoising Imitation Learning from Purified Demonstrations function can be a potential solution to tackle the gap between optimal and sub-optimal expert distributions. Motivated by the success of diffusion models in adversarial purification (Nie et al., 2022), we introduce a Diffusion Purified Imitation Learning (DP-IL) algorithm, which eliminates the potential noises in ρπs through a two-step diffusion process. In the diffusion model, the first step involves a forward diffusion process, where small timestep noise is gradually added to the sub-optimal demonstrations. This results in diffused demonstrations that eliminate potential perturbation patterns in the sub-optimal expert distribution ρπs while preserving the original semantic information. In the second step, a reverse-time diffusion process is applied to reconstruct the purified demonstrations from the diffused demonstrations. 4.2. Purification via Diffusion Process We formulate F as a two-step purification as in the diffusion model, which includes forward and reverse diffusion processes. For the diffusion model, we adopt widely-used DDPM (Ho et al., 2020). For simplicity, we use x to represent the state-action pair (s, a) in the following subsections since we regard it as a whole during the purification process. Training Diffusion Model with Optimal Demos. Since the transformation F is a combination of forward and reverse diffusion processes, we train a diffusion model ϵϕ on optimal demonstrations xo from ρπo first. To simplify the formulation, we concatenate the state-action pair to construct the latent variable x for the diffusion model ϵϕ(xi, i). Subsequently, we inject noises ϵ on xo, where i indicates the number of steps of the Markov procedure in the DDPM , which can be viewed as a variable of the level of noise. The goal of the diffusion model is to reverse the diffusion process (i.e., denoise), yielding the learning objective, min ϕ Exo,ϵ,i[ϵ ϵϕ( αi xo + 1 αtϵ, t)]2, (5) where i is sampled from a uniform distribution, ϵ is the Gaussian noise sampled from N(0, I), αi is defined to be αi = 1 βi and βi is the noise schedule which monotonically increase with i. αi is defined to be αi = Qi s=1 αs. Purifying Sub-optimal Demonstrations. After we obtain the trained diffusion model ϵϕ, we purify sub-optimal demonstrations xs sample from ρπs with ϵϕ. Supposing the inverse point ir {0, ..., N 1} is the intermediate step that is used in the diffusion purification process, its forward diffusion can be calculated as follows, 1 αirϵ, ϵ N(0, I). (6) Then, the reverse diffusion process is used to denoise ˆxir to ˆxo from timestep ir to 0. The reverse diffusion at every step is defined as ˆxi 1 = 1 αi ˆxi 1 αi 1 αiϵϕ(ˆxi, t) where z N(0, I). Starting from intermediate step ˆxir, we recover the ˆx0 from Eq. 7, which is the purified demonstration corresponding to xs. Choice of optimal ir Intuitively, setting a larger ir can help to smooth the potential noise within sub-optimal demonstrations in the forward diffusion. When ir = N 1, the purified demonstrations can be seen as samples randomly drawn from the diffusion model ϵϕ. However, training such a diffusion model ϵϕ exclusively with limited optimal demonstrations may result in generated demonstrations lacking diversity and coverage, encountering similar issues as imitation learning solely from Do. A properly chosen ir achieves good trade-off between denoising (Eq. 6) and maintaining the structure of sub-optimal demonstrations (Eq. 7), thus can help the agent achieve better performance. Notice that DDPM is a discretization of VP-SDE (Song et al., 2021b), and the step i in DDPM is a discrete value that varies from 0 to N 1. In the theoretical analysis below, we extend DDPM to VP-SDE and scale the timestep to t [0, 1]. Hence, the problem is to find an optimal t r. 4.3. Theoretical Analysis In the above subsection, we propose to utilize a two-step purification to tackle supplementary sub-optimal demonstrations in imitation learning. While the choice of optimal t r remains unknown, we further provide a more theoretical analysis on (i) the impact of t r on the purification (Theorem 1 and Theorem 2), and (ii) how different levels of noises affect the choice of t r (Theorem 4). We first analyze the purification effectiveness with respect to t during the forward diffusion process. Considering the optimal expert distribution ρπo and the sub-optimal one ρπs, we show that the gap between the diffused distribution ρπo,t and ρπs,t becomes smaller with increasing timestep t, which implies that the potential perturbations can be smoothed via the gradually added noises during the forward process. Theorem 1. Let {xt}t {0,1} be samples in the forward diffusion process. If we denote ρπo,t(x) and ρπs,t(x) as the respective distributions of xt when xo,0 ρπo,t=0(x) and xs,0 ρπs,t=0(x), we then have, Z ρπo,t(x)βt|| x log ρπo,t(x) x log ρπs,t(x)||2 2dx, where ς = DKL(ρπo,t(x)||ρπs,t(x)) t denotes the derivative of t to the KL divergence between ρπo,t(x) and ρπs,t(x). Imitation Learning from Purified Demonstrations Theorem 1 is derived from Diff Pure (Nie et al., 2022). The difference is that we start from the forward diffusion step in DDPM and extend discrete steps to continuous steps. The Fisher divergence Dfisher(p||q) between two data distribution p and q is defined to be R p|| log p log q||2 2. Therefore, we have ς 1 2Dfisher(ρπo,t(x)||ρπs,t(x)) 0. The equality is only satisfied when ρπo,t=1(x) = ρπs,t=1(x). Notice that ς 0 indicates that the KL divergence between ρπo,t(x) and ρπs,t(x) monotonically decreases towards t = 0 to t = 1 during the forward diffusion step. In other words, with a relatively larger t in the forward diffusion process, the divergence between diffused optimal and sub-optimal expert distributions can be minimized. The monotonic decrease of DKL(ρπo,t(x)||ρπs,t(x)) towards t implies that the diffused optimal expert distribution and diffused sub-optimal expert distribution become closer as t increases. Suppose there exists an κ such that the difference between the diffused optimal expert distribution and diffused sub-optimal expert distribution becomes negligible when DKL(ρπo,t(x)||ρπs,t(x)) κ. We can then find a minimum tr [0, 1] to achieve this. Starting from ρπs,tr(x), we can stochastically recover ρπo(x) at t = 0 through the reverse diffusion process. We further provide theoretical evidence that the distance between the purified demonstration and its optimal version can be bounded via the two-step purification. Theorem 2. Supposing the sub-optimal demonstration xs perturbed by noise δ compared to the optimal demonstration xo, and ˆxo is a purified demonstration of xs, the L2 distance between xo and ˆxo (i.e., xo ˆxo ) is bounded within the interval [max{0, λ}, Λ] with a probability of at least 1 σ, where λ(ζ, tr, δ, σ) = ζ(tr) Cgd δ p e2ζ(tr) 1Cσ,d, (9) Λ(ζ, tr, δ, σ) = ζ(tr)Cgd + δ + p e2ζ(tr) 1Cσ,d, (10) and Cgd 2 x log ρt(x) Cgd, ζ(tr) = R tr 0 1 2βsds, d ), σ 2e d/8. As shown in Theorem 2, the L2 distance between the optimal demonstration xo and the purified demonstration ˆxo can be bounded by the terms related to the intermediate timestep tr. According to Theorem 1, tr should be set to a relatively large value to guarantee that the potential perturbations are removed in diffused distributions, however, tr cannot be arbitrarily large. Proposition 3. λ(ζ, tr, δ, σ) and Λ(ζ, tr, δ, σ) are monotonically increasing with respect to tr. Proposition 3 suggests that as tr increases, the bound of xo ˆxo increases as well. Thus, a relatively small tr is required to achieve a smaller lower and upper bound to satisfy the minimum L2 distance. Together with the analysis in Theorem 1, there exists a trade-off on tr between perturbation purification and recovery performance. An optimal timestep t r is expected to meet the requirements that the potential perturbation can be eliminated in diffused distribution while the L2 distance with the related optimal demonstration can also be minimized. Theorem 4. There exists a inflection point tp that satisfies λ (tp) = 0, which makes λ(tr) increases smoothly before tp while increasing rapidly after tp. The optimal t r is around the inflection point tp, or is around solution of the equation ζ (tr) (e2ζ(tr) 1) 3/2 + (e2ζ(tr) 1) 1/2 = Cgd Cσ,d . With a probability of 1 σ, there is a positive correlation between δ and t r. As demonstrated in Theorem 4, the lower bound λ(tr) exhibits a smooth increase initially, implying that selecting a larger tr before the inflection point tp is unlikely to significantly impact λ(tr). Combined with the analysis in Theorem 1 that shows larger tr would better smooth the perturbation, we make a conclusion that the optimal t r should be in the vicinity of the inflection point tp. Furthermore, Theorem 4 establishes a positive correlation between δ and t r, suggesting that for more substantial perturbations within sub-optimal demonstrations, the optimal t r tends to be relatively larger. Empirical results from the experiments also align with and support this observation. We further provide sensitive analysis of σ in Theorem 5. The proofs of all theorems are provided in the appendix. 5. Experiments In this section, we conduct extensive experiments to verify the effectiveness of DP-IL in Mu Jo Co (Todorov et al., 2012) and Robosuite (Zhu et al., 2020) with different compared methods. The experimental results demonstrate the advantage of DP-IL from different aspects. Benchmarking We first conduct experiments on Mu Jo Co benchmarks in Open AI Gym (Brockman et al., 2016). Four popular tasks (i.e., Ant-v2, Half Cheetah-v2, Walker2d-v2 and Hopper-v2) are used to evaluate DP-IL. The evaluated performance in Mu Jo Co benchmark can be measured by the average cumulative ground-truth rewards (i.e., the higher the better). We repeat experiments for 5 trials with different random seeds for common practice. Additionally, to verify the robustness of DP-IL with real-world human operation demonstrations, we also conduct experiments on a robot control task in Robosuite platform. Imitation Learning from Purified Demonstrations Table 1. Performance of DP-BC and compared offline imitation learning methods in four Mu Jo Co tasks with different sub-optimal demonstrations. L1 , L2 and L3 denotes different levels of perturbations within sub-optimal demonstrations. We also provide the best inverse step i r of diffusion purification, and we have ir = Ntr. Task Demos BC-opt BC-all BCND DWBC Demo DICE DP-BC i r D1-L1 29 18 -22 33 28 28 -156 95 -72 8 261 54 100 D1-L2 - 292 88 199 37 132 39 -3 14 803 114 50 D1-L3 - 1500 298 1951 311 1669 244 94 4 2547 118 10 D2-L1 - 1089 237 1539 150 1545 128 262 152 1402 151 30 D2-L2 - 918 232 1514 232 1725 214 480 139 1982 111 10 D2-L3 - 1356 137 2232 324 2484 29 2127 260 3414 40 5 Half Cheetah-v2 D1-L1 -254 103 1262 202 1267 202 618 204 22 106 1365 147 10 D1-L2 - 1264 255 1862 119 1069 264 622 159 2440 274 10 D1-L3 - 859 233 834 249 694 190 768 233 4042 80 10 D2-L1 - 1314 194 1498 246 671 178 -64 105 1530 39 30 D2-L2 - 2789 14 2241 245 2676 20 1771 158 2714 15 30 D2-L3 - 1780 296 1990 408 4722 37 4508 59 4748 40 10 Walker2d-v2 D1-L1 -4 0 21 2 102 10 1039 164 365 39 1697 219 10 D1-L2 - 1551 87 1635 319 1617 152 1560 291 1722 297 10 D1-L3 - 815 277 192 37 2322 369 583 108 3020 466 5 D2-L1 - 500 245 686 301 1656 186 1749 259 2208 160 30 D2-L2 - 578 273 1018 347 2114 21 1818 342 3162 20 5 D2-L3 - 702 244 2574 358 2637 134 1765 528 3076 205 3 D1-L1 320 0 743 83 590 10 761 12 995 119 1000 42 10 D1-L2 - 2044 157 2119 206 826 22 2059 4 3145 11 5 D1-L3 - 3090 37 2883 26 1140 27 3207 7 1825 137 1 D2-L1 - 2191 118 2165 181 761 122 2336 3 2408 113 10 D2-L2 - 2266 3 2256 2 826 2 2265 8 2323 2 10 D2-L3 - 2712 6 3093 10 1140 27 3306 2 2265 171 5 Source of Demonstrations In our setting, we have access to limited number of optimal demonstrations Do and a lot supplementary sub-optimal demonstrations Ds. To form Do, an optimal policy πo(a|s) trained by TRPO (Schulman et al., 2015) to sample optimal demonstrations. As for Ds, we use two different kinds of ways to generate sub-optimal behaviors following existing works (Tangkaratt et al., 2020; Wu et al., 2019): D1: We add Gaussian noise to optimal action distribution a of πo to form πs. The action of πs is modeled as a N(a , δ) and we choose δ = [0.6, 0.4, 0.25] to form Ds with varying quality. D2: We save checkpoints during the RL training as the sub-optimal policy πs. In addition, since we use optimal demonstrations Do to train the diffusion model, we include Do into the training of compared methods for a fair comparison. Compared Methods Purified demonstrations are applied to both the offline imitation learning method (e.g., BC) and the online imitation learning method (e.g., GAIL), to verify the generalization of DP-IL. In the case of offline imitation learning, we compare our method against several state-of-the-art offline methods, including BCND (Sasaki & Yamashina, 2021), DWBC (Xu et al., 2022) and Demo DICE (Kim et al., 2021). For online imitation learning, we compare our method with several confidence-based methods, including 2IWIL/IC-GAIL (Wu et al., 2019) and WGAIL (Wang et al., 2021b). Further details (e.g., implementation of DP-IL and compared methods, data quality, and more results) can be found in the appendix. 5.1. Evaluations on Mu Jo Co We evaluated the effectiveness of DP-BC under the offline IL setting, and results are presented in Table 1. As have introduced in the experimental setup, we use two different ways to generate sub-optimal demonstrations (i.e., D1 and D2). BC-opt means we only use Do to conduct BC training, while BC-all means we use both Do and Ds to train the policy network. Our findings align with those results reported in (Kim et al., 2021), showing that using only Do can sometimes yield worse performance compared to BC-all. This can be attributed to the insufficient coverage of the entire optimal demonstration space by a limited number of optimal demonstrations. BC-all benefits from a larger training dataset and can outperform BC-opt in certain cases. However, the challenge of sub-optimal demonstrations still hampers BC s Imitation Learning from Purified Demonstrations Figure 1. The training curve of DP-GAIL and other online imitation learning methods with D1-L1 demonstrations. The x-axis is the number of interactions with the environment and the shaded area indicates the standard error. ability to achieve optimal performance. BCND outperforms the baseline at some cases due to its self-weighting strategy. While this strategy is unstable when dealing with a high ratio of sub-optimal demonstrations, we can observe that it sometimes collapses. DWBC and Demo DICE demonstrate great performance with D2 demonstrations with relatively smaller perturbations (e.g., L3), however, we find it is more likely to fail when dealing with sub-optimal demonstrations with strong perturbations. In contrast, our method purifies sub-optimal demonstrations before conducting BC, which consistently improves the subsequent performance of behavior cloning across all different types of sub-optimal demonstrations compared to other methods. This demonstrates the efficacy of our approach in enhancing the performance of BC. Furthermore, the optimal reverse point i r is provided in the table, and the results of DP-BC is related to this i r. As L1 , L2 and L3 represent sub-optimal demonstrations with progressively less perturbation, Theorem 4 predicts a corresponding decrease in the optimal reverse point, i r. This pattern is generally observed, illustrating the theorem s validity. The impact of optimal i r will be discussed in the following ablation study in Sec 5.2. Besides offline imitation learning, we also utilize diffusionpurified demonstrations for GAIL training and compare DP-GAIL with other state-of-the-art online imitation learning methods that also address the imperfect demonstration issue. The training curves of DP-GAIL and other compared methods are depicted in Figure 1. The curve unequivocally demonstrates the superiority of DP-GAIL over other compared methods. This shows the efficacy of noise purification in improving the performance of online imitation learning methods. 5.2. Ablation Study Different Types of Noises In D1, sub-optimal demonstrations are formed by adding Gaussian noise into the optimal action. It would be interesting to further explore more noises. Table 2. The performance of DP-BC with demonstrations in different noises. Task Demons BC BCND DP-BC Ant Uni-L1 209 138 66 17 343 16 Uni-L2 396 42 333 60 480 15 Uni-L3 704 97 1472 229 1925 227 Half Cheetah Uni-L1 704 71 692 37 733 8 Uni-L2 922 207 705 179 1248 54 Uni-L3 1805 15 1933 21 2268 11 Ant S&P-L1 209 138 686 45 744 17 S&P-L2 396 42 477 62 508 12 S&P-L3 704 97 755 143 689 37 Half Cheetah S&P-L1 2143 10 2042 113 2731 18 S&P-L2 1964 195 1873 270 3550 18 S&P-L3 579 280 1160 295 2820 383 Table 3. The performance of DP-BC on Ant-v2 task when defining different F in Eq. 3. G-Filter , M-Filter and Med-Filter denote using Gaussian, Mean and Median Filter as F to denoise sub-optimal demonstrations. Demons BC DP-BC G-Filter M-Filter Med-Filter D1-L1 -22 261 242 182 138 D1-L2 292 803 729 736 437 D1-L3 1500 2547 771 1786 455 We consider adding uniform noise and salt-and-pepper noise. Similar to D1, we set δ = [0.25, 0.4, 0.6] to form L1 to L3 demonstrations. From the Table 2, we can observe that DP-BC performs best in most cases. This suggests the robustness of DP-BC when dealing with different kinds of noises. Intuitively, the noises added to the current data are overwhelmed by the accumulating Gaussian noise during the diffusion forward process, ultimately making the noise less prominent. Compared with Other Purified Methods The transformation F( ) is defined as a combination of the forward and reverse diffusion processes to denoise sub-optimal demonstrations. Apart from the diffusion model, there are various denoising methods that can be employed as the transformation F . Since filters are commonly used to smooth noisy data, we utilize three different filters (mean, median, and Gaussian) as F to denoise the imperfect demonstrations. The results are presented in Table 3. It is evident from the table that our method outperforms other filter-based methods in all types of sub-optimal demonstrations (e.g., D1L1, D1-L2 and D1-L3). Among the filter-based methods, the Mean Filter achieves the best performance. Another notable finding is that when the noise level is high in suboptimal demonstrations (e.g., L1), filter-based methods tend to maintain a greater advantage over the baseline. Multiple Demonstrators In addition to collecting suboptimal demonstrations from a single demonstrator, we also investigate the performance of DP-BC when confronted with sub-optimal demonstrations that are sampled from multiple Imitation Learning from Purified Demonstrations Average Return ir=100 ir=50 1e3 Half Cheetah-v2 ir=100 ir=50 1e3 Walker2d-v2 ir=100 ir=50 L1 L2 L3 0.0 1e3 Hopper-v2 ir=100 ir=50 Figure 2. Impact of diffusion time tr with demonstrations of different optimality. Table 4. The performance of DP-BC with mixed demonstrations (i.e., D1-L1, D1-L2, and D1-L3). Task BC BCND DWBC Demo DICE DP-BC Ant 193 411 339 -53 601 Half Cheetah 1255 921 265 344 3146 Walker2d 169 441 290 488 1390 Hopper 587 564 541 605 655 demonstrators. The results are presented in Table 4, where the mixed demonstrations denote a combination of L1, L2, and L3 demonstrations. As shown in the table, DP-BC with demonstrations sampled from a mixture of dataset also consistently outperforms the baseline and other compared methods by a substantial margin. Impact of optimal ir As discussed in Section 4.3, ir is an important hyper-parameter to achieve the trade-off between smoothing the noise and keeping semantic information. To investigate how the choice of ir affects the effectiveness of noise purification, we conducted experiments on demonstrations with different levels of optimality. Our empirical results confirmed the theoretical findings of Theorem 1, suggesting that demonstrations with less optimality require a relatively larger timestep tr to smooth the noise. For example, in Ant-v2 task, the optimal tr is 100 for D1-L1 demonstrations while the optimal tr is 10 for D1-L3 demonstrations. As the quality of demonstrations improved (e.g., from L1 to L3), we observed a gradual decrease in the optimal value of ir. This could lead to a smaller value bound since ζ(ir) exhibits a monotonic increase with respect to tr. Hence, diffusion purified imitation learning should perform better when using a smaller ir under such a case, which is consistent with the results in Figure 2. 5.3. Evaluations on Robo Suite Platform We also evaluate the robustness of DP-BC on the Robo Suite platform (Zhu et al., 2020) with real-world demonstrations. We consider a reaching task within the Nut Assembly in Saywer. In this task, the goal for the robot is to move a nut close to the peg. It earns a higher reward if the nut is closer to the peg. During the reaching phase, the Sawyer Table 5. Successful rate of DP-BC in Robo Suite platform with human demonstrations. Method BC BCND DWBC DP-BC (Ours) Success Rate 0.76 0.56 0.82 0.86 ir 1 3 5 10 DP-BC 0.78 0.86 0.74 0.60 Figure 3. Visualization of Saywer Nut Assembly task in Robo Suite platform and the quality of human demonstrations. robot s arm operates with the gripper in a fully open state, not actively attempting to grasp any objects. Its role is merely to move the nut across the plane. As such, the grasp command is not engaged and prevents the robot s arm from inserting the nut into the peg. Hence, an episode in the reaching task completes only when the timestep matches the horizon length. We use real-world demonstrations by human operators from Robo Turk (Mandlekar et al., 2018). The demonstrations contain 10 trajectories with approaching length (around 500 state-action pairs). Based on the accumulative reward of trajectories in Figure 3, we regard two trajectories that have the larger reward (e.g., >300) as optimal demonstrations and the remaining trajectories are regarded as supplementary sub-optimal demonstrations. In this reaching task, we report sucess rate rather than reward to assess learned policy. We set the termination condition of the episode to be when the distance between the nut and the peg is less than 0.1. Based on this criterion, we can provide the success rates of different methods, which are calculated over 50 sampled trajectories. The results are shown in Table 5, and we can observe that the proposed method has higher success rate Imitation Learning from Purified Demonstrations than compared methods like BCND and DWBC. We also conduct an ablation study on the timestep ir in this task and we find that setting ir = 3 leads to the best performance. The results presented in Table 8 also support our theorem about there exists a trade-off in finding an optimal ir. To further evaluate the relationship between i r and different levels of sub-optimal demonstrations, we divided the suboptimal demonstrations into two groups and evaluate their performance in the appendix (Table 8). 6. Conclusion In this paper, we propose to tackle the imperfect demonstrations issue by conducting a two-step purification process to eliminate the potential noises based on the diffusion process. The distance gap between optimal and sub-optimal expert distributions can be minimized after forward diffusion. The purified demonstrations can then be recovered from diffused one via reverse diffusion. Additionally, we provide sufficient theoretical analysis to indicate the impact of the reverse point on the purification. The proposed method can be easily adopted in existing imitation learning frameworks, such as GAIL and BC, to alleviate the effect of sub-optimal expert demonstrations. We conduct extensive experiments on Mu Jo Co and Robo Suite with different types of sub-optimal demonstrations to evaluate the effectiveness of diffusion purification. The comparison results demonstrate the superiority of DP-IL over other baselines. Impact Statement Our work enables agents to learn from a broader range of imperfect data. There are many potential societal consequences of our work, and the safety of learning from such purified data remains an area worthy of further investigation. Acknowledgements This work was supported by the National Key Research and Development Program of China 2023YFC2705700, National Natural Science Foundation of China under Grants 62225113, the Innovative Research Group Project of Hubei Province under Grants 2024AFA017, the Australian Research Council under Projects DP210101859 and FT230100549, and the City U APRC Project No. 9610680. Abbeel, P. and Ng, A. Y. Apprenticeship learning via inverse reinforcement learning. In Proceedings of the twenty-first international conference on Machine learning, pp. 1. ACM, 2004. Amodei, D., Olah, C., Steinhardt, J., Christiano, P., Schul- man, J., and Man e, D. Concrete problems in ai safety. ar Xiv preprint ar Xiv:1606.06565, 2016. Beliaev, M., Shih, A., Ermon, S., Sadigh, D., and Pedarsani, R. Imitation learning by estimating expertise of demonstrators. In International Conference on Machine Learning, pp. 1732 1748. PMLR, 2022. Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., and Zaremba, W. Openai gym, 2016. Brown, D. S., Goo, W., Nagarajan, P., and Niekum, S. Extrapolating beyond suboptimal demonstrations via inverse reinforcement learning from observations. ar Xiv preprint ar Xiv:1904.06387, 2019. Brown, D. S., Goo, W., and Niekum, S. Better-thandemonstrator imitation learning via automatically-ranked demonstrations. In Conference on Robot Learning, pp. 330 359, 2020. Cai, X.-Q., Ding, Y.-X., Jiang, Y., and Zhou, Z.-H. Imitation learning from pixel-level demonstrations by hashreward. In Proceedings of the 20th International Conference on Autonomous Agents and Multi Agent Systems, pp. 279 287, 2021. Cai, X.-Q., Ding, Y.-X., Chen, Z., Jiang, Y., Sugiyama, M., and Zhou, Z.-H. Seeing differently, acting similarly: Heterogeneously observable imitation learning. In The Eleventh International Conference on Learning Representations, 2023. Chang, J., Uehara, M., Sreenivas, D., Kidambi, R., and Sun, W. Mitigating covariate shift in imitation learning via offline data with partial coverage. Advances in Neural Information Processing Systems, 34:965 979, 2021. Chen, L., Paleja, R., and Gombolay, M. Learning from suboptimal demonstration via self-supervised reward regression. In Conference on Robot Learning, pp. 1262 1277. PMLR, 2021. Chi, C., Feng, S., Du, Y., Xu, Z., Cousineau, E., Burchfiel, B., and Song, S. Diffusion policy: Visuomotor policy learning via action diffusion. ar Xiv preprint ar Xiv:2303.04137, 2023. Christiano, P. F., Leike, J., Brown, T., Martic, M., Legg, S., and Amodei, D. Deep reinforcement learning from human preferences. In Advances in Neural Information Processing Systems, pp. 4299 4307, 2017. Dadashi, R., Hussenot, L., Geist, M., and Pietquin, O. Primal wasserstein imitation learning. In International Conference on Learning Representations, 2020. Imitation Learning from Purified Demonstrations Dulac-Arnold, G., Levine, N., Mankowitz, D. J., Li, J., Paduraru, C., Gowal, S., and Hester, T. Challenges of real-world reinforcement learning: definitions, benchmarks and analysis. Machine Learning, 110(9):2419 2468, 2021. Fu, J., Luo, K., and Levine, S. Learning robust rewards with adverserial inverse reinforcement learning. In International Conference on Learning Representations, 2018. Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y. Generative adversarial nets. In Advances in neural information processing systems, pp. 2672 2680, 2014. Ho, J. and Ermon, S. Generative adversarial imitation learning. In Advances in neural information processing systems, pp. 4565 4573, 2016. Ho, J., Jain, A., and Abbeel, P. Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33:6840 6851, 2020. Hussein, A., Gaber, M. M., Elyan, E., and Jayne, C. Imitation learning: A survey of learning methods. ACM Computing Surveys (CSUR), 50(2):1 35, 2017. Ibarz, B., Leike, J., Pohlen, T., Irving, G., Legg, S., and Amodei, D. Reward learning from human preferences and demonstrations in atari. Advances in neural information processing systems, 31, 2018. Kaelbling, L. P., Littman, M. L., and Moore, A. W. Reinforcement learning: A survey. Journal of artificial intelligence research, 4:237 285, 1996. Ke, L., Barnes, M., Sun, W., Lee, G., Choudhury, S., and Srinivasa, S. Imitation learning as f-divergence minimization. ar Xiv preprint ar Xiv:1905.12888, 2019. Kim, G.-H., Seo, S., Lee, J., Jeon, W., Hwang, H., Yang, H., and Kim, K.-E. Demodice: Offline imitation learning with supplementary imperfect demonstrations. In International Conference on Learning Representations, 2021. Li, Z., Xu, T., Qin, Z., Yu, Y., and Luo, Z.-Q. Imitation learning from imperfection: Theoretical justifications and algorithms. Advances in Neural Information Processing Systems, 36, 2024. Ma, Y., Shen, A., Jayaraman, D., and Bastani, O. Versatile offline imitation from observations and examples via regularized state-occupancy matching. In International Conference on Machine Learning, pp. 14639 14663. PMLR, 2022. Mandlekar, A., Zhu, Y., Garg, A., Booher, J., Spero, M., Tung, A., Gao, J., Emmons, J., Gupta, A., Orbay, E., et al. Roboturk: A crowdsourcing platform for robotic skill learning through imitation. In Conference on Robot Learning, pp. 879 893. PMLR, 2018. Nie, W., Guo, B., Huang, Y., Xiao, C., Vahdat, A., and Anandkumar, A. Diffusion models for adversarial purification. In International Conference on Machine Learning, pp. 16805 16827. PMLR, 2022. Nowozin, S., Cseke, B., and Tomioka, R. f-gan: Training generative neural samplers using variational divergence minimization. Advances in neural information processing systems, 29, 2016. Pearce, T., Rashid, T., Kanervisto, A., Bignell, D., Sun, M., Georgescu, R., Macua, S. V., Tan, S. Z., Momennejad, I., Hofmann, K., et al. Imitating human behaviour with diffusion models. In Deep Reinforcement Learning Workshop Neur IPS 2022, 2022. Puterman, M. L. Markov Decision Processes: Discrete Stochastic Dynamic Programming. Wiley Series in Probability and Statistics. Wiley, 1994. ISBN 978-0-471619772. doi: 10.1002/9780470316887. Reuss, M., Li, M., Jia, X., and Lioutikov, R. Goalconditioned imitation learning using score-based diffusion policies. ar Xiv preprint ar Xiv:2304.02532, 2023. S arkk a, S. and Solin, A. Applied stochastic differential equations, volume 10. Cambridge University Press, 2019. Sasaki, F. and Yamashina, R. Behavioral cloning from noisy demonstrations. In International Conference on Learning Representations, 2021. Schulman, J., Levine, S., Abbeel, P., Jordan, M., and Moritz, P. Trust region policy optimization. In International conference on machine learning, pp. 1889 1897, 2015. Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al. Mastering the game of go with deep neural networks and tree search. nature, 529(7587):484 489, 2016. Song, Y., Durkan, C., Murray, I., and Ermon, S. Maximum likelihood training of score-based diffusion models. Advances in Neural Information Processing Systems, 34: 1415 1428, 2021a. Song, Y., Sohl-Dickstein, J., Kingma, D. P., Kumar, A., Ermon, S., and Poole, B. Score-based generative modeling through stochastic differential equations. In International Conference on Learning Representations, 2021b. Imitation Learning from Purified Demonstrations Sutton, R. S. and Barto, A. G. Reinforcement learning: An introduction. MIT press, 2018. Tangkaratt, V., Han, B., Khan, M. E., and Sugiyama, M. Variational imitation learning with diverse-quality demonstrations. In Proceedings of the 37th International Conference on Machine Learning, pp. 9407 9417, 2020. Todorov, E., Erez, T., and Tassa, Y. Mujoco: A physics engine for model-based control. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 5026 5033. IEEE, 2012. Torabi, F., Warnell, G., and Stone, P. Behavioral cloning from observation. In Proceedings of the 27th International Joint Conference on Artificial Intelligence, pp. 4950 4957, 2018. Van Hasselt, H., Guez, A., and Silver, D. Deep reinforcement learning with double q-learning. In Proceedings of the AAAI conference on artificial intelligence, volume 30, 2016. Wang, B., Wu, G., Pang, T., Zhang, Y., and Yin, Y. Diffail: Diffusion adversarial imitation learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pp. 15447 15455, 2024. Wang, H.-C., Chen, S.-F., Hsu, M.-H., Lai, C.-M., and Sun, S.-H. Diffusion model-augmented behavioral cloning. In ICML Workshop on New Frontiers in Learning, Control, and Dynamical Systems, 2023a. Wang, Y., Xu, C., and Du, B. Robust adversarial imitation learning via adaptively-selected demonstrations. In Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence, pp. 3155 3161, 2021a. Wang, Y., Xu, C., Du, B., and Lee, H. Learning to weight imperfect demonstrations. In International Conference on Machine Learning, pp. 10961 10970. PMLR, 2021b. Wang, Y., Du, B., and Xu, C. Unlabeled imperfect demonstrations in adversarial imitation learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pp. 10262 10270, 2023b. Wu, Y.-H., Charoenphakdee, N., Bao, H., Tangkaratt, V., and Sugiyama, M. Imitation learning from imperfect demonstration. In International Conference on Machine Learning, pp. 6818 6827. PMLR, 2019. Xu, H., Zhan, X., Yin, H., and Qin, H. Discriminatorweighted offline imitation learning from suboptimal demonstrations. In International Conference on Machine Learning, pp. 24725 24742. PMLR, 2022. Xu, T., Li, Z., and Yu, Y. Error bounds of imitating policies and environments. Advances in Neural Information Processing Systems, 33:15737 15749, 2020. Xu, T., Li, Z., and Yu, Y. Error bounds of imitating policies and environments for reinforcement learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(10):6968 6980, 2021. Yu, L., Yu, T., Song, J., Neiswanger, W., and Ermon, S. Offline imitation learning with suboptimal demonstrations via relaxed distribution matching. In Proceedings of the AAAI conference on artificial intelligence, volume 37, pp. 11016 11024, 2023. Zha, D., Xie, J., Ma, W., Zhang, S., Lian, X., Hu, X., and Liu, J. Douzero: Mastering doudizhu with self-play deep reinforcement learning. In International Conference on Machine Learning, pp. 12333 12344. PMLR, 2021. Zhang, S., Cao, Z., Sadigh, D., and Sui, Y. Confidenceaware imitation learning from demonstrations with varying optimality. Advances in Neural Information Processing Systems, 34:12340 12350, 2021. Zhu, Y., Wong, J., Mandlekar, A., and Mart ın-Mart ın, R. robosuite: A modular simulation framework and benchmark for robot learning. In ar Xiv preprint ar Xiv:2009.12293, 2020. Imitation Learning from Purified Demonstrations A.1. Proof of Theorem 1 Theorem 1. Let {xt}t {0,1} be the forward diffusion process in Eq. 6. If we denote ρπo,t(x) and ρπs,t(x) as the respective distributions of xt when x0 ρπo,t=0(x) and x0 ρπs,t=0(x), we then have, Z ρπo,t(x)βt|| x log ρπo,t(x) x log ρπs,t(x)||2 2dx 0. (12) where ς = DKL(ρπo,t(x)||ρπs,t(x)) t denotes the derivative of t to the KL divergence between ρπo,t(x) and ρπs,t(x). Proof. Following the proof in (Song et al., 2021a), we first make two assumptions on ρπo,t(x) and ρπs,t(x). Supposing both ρπo,t(x) and ρπs,t(x) are smooth and fast decaying functions, we have lim x ρπo,t(x) x log ρπo,t(x) = 0 (13) lim x ρπs,t(x) x log ρπs,t(x) = 0 (14) The forward diffusion process in Eq. 6 is a discrete Markov chain, which can be written as 1 βi xi 1 + p βi ϵi, i = 1, ..., N, (15) where ϵi N(0, I). To obtain a process of continuous transformation in time, we can rewrite Eq. 15 as N N ϵ(i+1)/N, i = 0, ..., N 1, (16) where { βi/N = Nβi}N 1 i=0 . In the limit of N , { βi/N = Nβi}N 1 i=0 and {Xi/N}N 1 i=0 become sequences of function {βt}1 t=0 and {xt}1 t=0. Let t = 1 N , t = i N , we can rewrite Eq. 16 as 1 βt+ t t xt + p βt+ t t ϵt (17) 2βt+ t t) xt + p βt+ t t ϵt + o(βt t) (18) In the limit of t 0, we have xt+ t xt = 1 2βt txt + p βt t ϵt (19) In other words, we can transform Eq. 6 to the following VP-SDE, where w is a standard Wiener process. The Fokker Planck equation for the VP-SDE above describes the time-evolution of the stochastic process s associated probability density function, and is given by 2βtρπo,t(x) x log ρπo,t(x) + 1 2βtxρπo,t(x) (21) = x ho(x, t)ρπo,t(x) , (22) where ho(x, t) is defined as ho(x, t) = 1 2βt x log ρπo,t(x) + 1 2βtx. Then, we denote ς = DKL(ρπo,t(x)||ρπs,t(x)) Imitation Learning from Purified Demonstrations can be re-written as follows, Z ρπo,t(x) log ρπo,t(x) ρπs,t(x)dx (23) = Z ρπo,t(x) t log ρπo,t(x) ρπs,t(x)dx + Z ρπo,t(x) t dx Z ρπo,t(x) ρπs,t(x) ρπs,t(x) (i) = Z x ho(x, t)ρπo,t(x) log ρπo,t(x) ρπs,t(x)dx Z ρπo,t(x) ρπs,t(x) x hs(x, t)ρπs,t(x) (25) (ii) = Z ρπo,t(x)[h T o (x, t) h T s (x, t)][ x log ρπo,t(x) x log ρπs,t(x)] (26) Z ρπo,t(x)βt| x log ρπo,t(x) x log ρπs,t(x)|2 2dx (27) 2 Ex ρπo,t(x)(| x log ρπo,t(x) x log ρπs,t(x)|2 2) (28) where (i) can be obtained by Eq. 13 and Eq. 14, (ii) can be obtained by integration by parts. A.2. Proof of Theorem 2 Theorem 2. Supposing the sub-optimal demonstration xs perturbed by noise δ compared to the optimal demonstration xo, and ˆxo is a purified demonstration of xs, the L2 distance between xo and ˆxo (i.e., xo ˆxo ) is bounded within the interval [max{0, λ}, Λ] with a probability of at least 1 σ, where λ(ζ, tr, δ, σ) = ζ(tr) Cgd δ p e2ζ(tr) 1Cσ,d, (30) Λ(ζ, tr, δ, σ) = ζ(tr)Cgd + δ + p e2ζ(tr) 1Cσ,d, (31) and Cgd 2 x log ρt(x) Cgd, ζ(tr) = R tr 0 1 2βsds, Cσ,d = σ d ), σ 2e d/8. Proof. Following the proof in (Nie et al., 2022), we extend the distance bound to both upper and lower bounds. The sub-optimal demonstration xs can be conceptualized as a perturbation of the optimal demonstration xo, with a small disturbance δ applied to each state-action pair. Furthermore, let xt denote the demonstration obtained after subjecting xs to a forward diffusion process satisfies 1 γ(t)ϵ1, (32) where γ(t) = exp( R t 0 βsds) and ϵ1 N(0, Id), the L2 distance between the purified demonstrations ˆxo and its related optimal demonstration xo can be upper bounded as ˆxo xo = xt + (ˆxo xt) xo (33) 2βt[x + 2 x log ρt(x)]dt + Z 0 βtdw xo (34) 2βtxdt + Z 0 βtdw x + Z 0 t βt x log ρt(x)dt , (35) where Eq. 34 follows with the definition of reverse-time diffusion and Eq. 35 can be obtained via triangle inequality. Similarly, it can be lower bounded as t βt x log ρt(x)dt xt + Z 0 2βtxdt + Z 0 βtdw x (36) The sum of the first 3 terms in Eq. 34 is a time-varying Ornstein-Uhlenbeck process with a negative time increment that starts from t = t to t = 0 with the initial value set to xt. Denote by x 0 its solution, from (S arkk a & Solin, 2019) we know Imitation Learning from Purified Demonstrations x 0 follows a Gaussian distribution, where its mean µ(0) and covariance matrix Σ(0) are the solutions of the following two differential equations, respectively, dt = βtΣ + βt Id, (37) with the initial conditions µ(t) = xt and Σ(t) = 0. By solving these two differential equations, we have that conditioned on xt, x 0 N(eζ(tr)xt, (e2ζ(tr) 1)Id), where ζ(tr) = R t 0 1 2βsds. 1 γ(t)ϵ1 (38) =e ζ(tr)(xo + δ) + p 1 e 2ζ(tr)ϵ1 (39) Using the reparameterization trick, we have, x 0 xo = eζ(tr)xt + p e2ζ(tr) 1ϵ2 xo (40) = eζ(tr) e ζ(tr)(xo + δ) + p 1 e 2ζ(tr)ϵ1 + p e2ζ(tr) 1ϵ2 xo (41) e2ζ(tr) 1(ϵ1 + ϵ2) + δ (42) 2(e2ζ(tr) 1)ϵ + δ (43) Since ϵ1 and ϵ2 are independent and taken from the distribution N(0, Id), ϵ = ϵ1+ϵ2 2 N(0, Id). Assuming that Cgd 2 x log ρt(x) Cgd, we have, ˆxo xo x 0 xo + Z 0 tr βt log ρt(x))dt (44) 2(e2ζ(tr) 1)ϵ + δ + ζ(tr)Cgd (45) 2(e2ζ(tr) 1)ϵ + δ + ζ(tr)Cgd (46) Similarly, we also have, tr βt log ρt(x)dt x 0 x (47) ζ(tr) Cgd q 2(e2ζ(tr) 1)ϵ + δ (48) ζ(tr) Cgd q 2(e2ζ(tr) 1)ϵ δ (49) Since ϵ 2 = Z2 1 + Z2 2 + + Z2 d X 2(d), where Z2 i N(0, 1). It can be regarded as a sub-exponential random variable with parameters (ν, α) = (2 d, 4), we have n=1 Z2 n 1| t 2e dt2/8, for all t (0, 1) (50) Let σ = 2e dt2/8, we have By integrating the results above, we can derive Eqs. 30, 31. These two equations present the lower and upper bounds of the L2 distance between ˆxo and xo with the probability of at least 1 σ, where σ 2e d/8. Imitation Learning from Purified Demonstrations A.3. Proof of Proposition 3 Proposition 3. λ(ζ, tr, δ, σ) and Λ(ζ, tr, δ, σ) are monotonically increasing with respect to tr. Proof. The derivative of λ(tr) with respect to tr can be written as λ (tr) = ζ (tr)(e2ζ(tr) 1) 1/2Cσ,d + ζ (tr) Cgd (52) With the assumption that the lower bound is non-negative, we have Cgd 1 ζ(tr)( δ + p e2ζ(tr) 1Cσ,d) (53) By substituting Eq. 53 into Eq. 52, we have, λ (tr) ζ (tr) ζ(tr) (e2ζ(tr) 1) 1/2[Cσ,d(e2ζ(tr) ζ(tr) 1) + p e2ζ(tr) 1 δ ] (54) Given that ζ(tr) 0, ζ (tr) = βtr 0 and e2ζ(tr) ζ(tr) 1 0, it follows that λ (tr) 0. Considering the upper bound Λ(tr) = ζ(tr)Cgd + δ + e2ζ(tr) 1Cσ,d, and noting that Λ(tr) increases monotonically as a function of ζ(tr), and since ζ(tr) itself increases monotonically with tr, it can be deduced that Λ(tr) is also a monotonically increasing function of tr. A.4. Proof of Theorem 4 Theorem 4. There exists a inflection point tp that satisfies λ (tp) = 0, which makes λ(tr) increases smoothly before tp while increasing rapidly after tp. The optimal t r is around the inflection point tp, or is around the solution of the equation ζ (tr) (e2ζ(tr) 1) 3/2 + (e2ζ(tr) 1) 1/2 = Cgd Cσ,d . (55) With a probability of 1 σ, there is a positive correlation between δ and t r. Proof. We aim to identify a critical point tp where the rate of growth of the function experiences a significant increase. In mathematical words, λ(tr) should satisfies λ (tr) 0, λ (tp) = 0 and λ (tr) < 0 for tr < tp, λ (tr) > 0 for tr > tp. The second derivative of λ(tr) can be expressed as: λ (tr) = h ζ (tr) + (ζ (tr))2(e2ζ(tr) 1) i (e2ζ(tr) 1) 1/2Cσ,d + ζ (tr) Cgd (56) Therefore, λ (tr) = 0 if and only if Eq. 55 holds. Now, we can analyze the correlation between δ and t r. Note that Cgd is the lower bound of 2 x log ρt(x) . With the increase of disturbance δ , Cgd will subsequently increase, causing the right-hand side of Eq. 55 to increase. In our setting, ζ(tr) = 1 4kt2 r. Therefore, Eq. 55 can be rewritten as: 2kt2 r(e 1 2 kt2 r 1) 3/2 + (e 1 2 kt2 r 1) 1/2 = 2ζ(tr)(e2ζ(tr) 1) 3/2 + (e2ζ(tr) 1) 1/2 = Cgd Cσ,d (57) The left-hand side of Eq. 55 is monotonically increasing with respect to ζ(tr) in the interval [0, 1]. Since ζ(tr) is also a monotonically increasing function with respect to tr, within a certain range, the left-hand side of the equation is a monotonically increasing function with respect to tr. In other words, an increase in δ will lead to an increase in t r, indicating a positive correlation between δ and t r. A.5. Sensitive Analysis of σ Imitation Learning from Purified Demonstrations 0.00 0.01 0.02 0.03 0.04 0.05 Elasticity E , C , d for Various d d = 14 d = 23 d = 119 Figure 4. The corresponding curve of the elasticity Eσ,Cσ,d with respect to σ. The lower bound of xo ˆxo can be regarded as a function with four parameters: tr, σ, d, and δ. However, once the optimal demonstration xo and the corresponding sub-optimal demonstration xs are provided, δ and d are fixed. Furthermore, considering that different demonstrations exhibit varying tolerances for sub-optimal behavior, the magnitude of σ can be set according to actual scenarios. In most cases, σ [0.05, 0.0001]. Therefore, λ is indeed an increasing function with respect to tr when a pair of demonstrations and a confidence level 1 σ are specified. However, it can be observed that among these four parameters, only σ is subjectively determined by humans. We hope that for the same pair of optimal and sub-optimal demonstrations, the different subjective choices of each individual have little impact on the final calculated value of tr. We provide a new theorem below to show that different confidence level σ has little impact on the choice of tr. Theorem 5. The selection of t r exhibits low sensitivity to the magnitude of σ, indicating that, for a purified demonstration, configuring different confidence levels has minimal impact on t r. Additionally, as the dimension of the demonstration d increases, the sensitivity of t r to σ will also decrease. Proof. First of all, we will introduce elasticity, which is an indicator that can be used to measure the extent to which a change in one variable will affect other variables. Define elasticity Ex,y of variables x and y as follows, Ex,y = d(lny) d(lnx). (58) We have already known that Cσ,d = Eσ,Cσ,d = 1 p 2dln(2/σ) + 4ln(2/σ) . (59) Since 1 σ represents the confidence level, in most cases, the value range of σ is 0.05 to 0.0001. Note that the higher the accuracy requirement for the demonstration, the smaller the corresponding selected σ. In our experiment, the dimension of data is 14, 23 and 119, and the corresponding curve of the elasticity Eσ,Cσ,d is in Figure 4. From the figure, we can find that Eσ,Cσ,d < 0.0113 for every d, which reflects that Cσ,d is not sensitive to the change of σ , when the value range of σ is 0.05 to 0.0001. In light of the analysis derived from Eq. 59, it is evident that with increasing complexity in the demonstration, as indicated by the augmentation of the data dimension d, the sensitivity of parameter Eσ,Cσ,d to variations in parameter σ exhibits a notable decrement. This observation implies that in scenarios involving more intricate demonstrations, particularly when the value of σ lies within the range of 0.05 to 0.0001, the selection of σ becomes substantially less consequential. This diminishing relevance of σ selection is a critical consideration in the context of increasingly complex data dimensions, as it underscores when dealing with a high-dimensional real-world data, the choice of different σ has a negligible impact on the optimal t r. B. Experimental Details Our source code and training data will be available at https://github.com/yunke-wang/dp-il. B.1. Implementation Details of Diffusion Model The architecture of diffusion model ϵϕ is five linear layers with a 0.2 dropout ratio, batch normalization, and Re LU nonlinear activation, and the size of the hidden dimension is 1024. We implement the above two algorithms based on this repo (https: //github.com/abarankab/DDPM). The training epoch is set to be 10000 and the learning rate of ϵϕ is set to 1e-4. We set N = 1000 for all experiments and set the forward process variances to constants increasing linearly from β1 = 1e 4 to Imitation Learning from Purified Demonstrations βN = 0.02. The pseudo code of diffusion model s training and purification is available in Algorithm 1 and Algorithm 2. Algorithm 1 Diffusion Model Training 1: repeat 2: Sample optimal demonstrations xo Do 3: i Uniform({0, ..., N 1}) 4: ϵ N(0, I) 5: Optimize diffusion model ϵϕ by taking gradient descent step on 6: ϕ[ϵ ϵϕ( αi xo + 1 αi ϵ, i)]2 7: until converged Algorithm 2 Diffusion Purification 1: Sample sub-optimal demonstrations xs Ds; 2: Calculate xs,i via forward diffusion: 3: xs,i = αi xs + 1 αi ϵ, ϵ N(0, I). 4: for i = i , ..., 1 do 5: Calculate xi 1 via reverse diffusion: 6: xs,i 1 = 1 αi xs,i 1 αi 1 αiϵϕ(xs,i,t) 7: end for 8: Return ˆxs,0 B.2. Implementation Details of Imitation Learning We use a deep neural network that has two 100 100 fully connected layers and uses Tanh as the activation layer to parameterize policy. To output continuous action, agent policy adopts a Gaussian strategy, hence the policy network outputs the mean and standard deviation of action. The continuous action is sampled from the normal distribution formulated with the action s mean and standard deviation. For online imitation learning methods, the discriminator and value function are using the same architecture as the policy network. In offline imitation learning, the policy is trained with batch size 256, and the total epoch is set to be 1000. For online imitation learning, the learning rate of the discriminator Dψ and the critic rψ is set to 3 10 4. Five updates on the discriminator follow with one update on the policy network in one iteration. For the value function, the learning rate is set to 3 10 4 and three training updates are used in one iteration. We conduct the on-policy method TRPO (Schulman et al., 2015) as RL step in online imitation learning, the learning rate is set to 3 10 4 with batch size 5000. The discount rate γ of the sampled trajectory is set to 0.995. The τ (GAE parameter) is set to 0.97. B.3. Data Collection We provide six supplementary sub-optimal demonstration datasets to evaluate the performance of DP-IL on 4 Mu Jo Co tasks. After we train an optimal policy πo by TRPO, we use two different kinds of ways to collect sub-optimal demonstration data in our experiments. The quality of sub-optimal demonstrations is provided in Table 6. The first way to collect sub-optimal demonstrations is to add different Gaussian noise ξ to the action distribution a of πs to form sub-optimal policy πs. The action of πs is modeled as a N(a , ξ2) and we choose ξ = [0.6, 0.4, 0.25] to generate sub-optimal demonstrations datasets (e.g., D1-L1, D1-L2 and D1-L3). This way of collecting sub-optimal demonstration has been used in several works (Sasaki & Yamashina, 2021; Tangkaratt et al., 2020). The second way to collect sub-optimal demonstrations is to save checkpoints during the RL training of πs(a|s). In our experiment, the RL training is conducted with 5M interactions with the environment. We save 3 checkpoints at 1M, 1.5M and 3M interactions to sample D2-L1, D2-L2 and D2-L3 datasets. While the agent converges fast This way of collecting sub-optimal demonstration has been investigated in (Wang et al., 2021b; Wu et al., 2019). Table 6. The quality of demonstrations in 4 Mu Jo Co tasks, which is measured by the average cumulative reward of trajectories. Task S A D1-L1 D1-L2 D1-L3 D2-L1 D2-L2 D2-L3 Expert Ant-v2 R111 R8 -73 227 3514 1062 1560 2649 4349 Half Cheetah-v2 R17 R6 567 1090 1853 1491 2217 3263 4624 Walker2d-v2 R17 R6 523 467 4362 1699 2717 4152 4963 Hopper-v2 R11 R3 699 1037 3229 734 3362 2597 3594 Imitation Learning from Purified Demonstrations B.4. Compared Methods We compare DP-IL with several state-of-the-art imitation learning with imperfect demonstration methods. Specifically, several offline imitation learning methods (i.e., BCND (Sasaki & Yamashina, 2021), DWBC (Xu et al., 2022) and Demo DICE (Kim et al., 2021)) and online imitation learning methods (i.e., 2IWIL (Wu et al., 2019), IC-GAIL (Wu et al., 2019) and WGAIL (Wang et al., 2021b)) are compared. We briefly review the details of methods compared against DP-IL in our experiments below. BCND, DWBC and Demo DICE We follow the instruction in BCND s paper to implement BCND. Notice that for a fair comparison with other offline imitation learning methods, we do not use ensemble policies for BCND. As for DWBC, we adapt their released code1 to conduct experiments. As for Demo DICE, we adapt the implementation in Smo DICE repo2 to conduct experiments. 2IWIL and IC-GAIL We re-implement 2IWIL/IC-GAIL based on the official implementation3. In 2IWIL and IC-GAIL, a fraction of imperfect expert demonstrations are labeled with confidence (i.e., Dl = {(si, ai), ri}nl i ), while the remaining demonstrations are unlabeled (i.e., Du = {(si, ai)}nu i ). Since we have no access to the confidence score of the state-action pair in our setting, we use the normalized reward of the demonstrator as the confidence score for their related demonstrations. In the experiment, we choose 20% labeled demonstrations to train a semi-supervised classifier and then predict confidence for other 80% unlabeled demonstrations. WGAIL WGAIL is proposed to estimate confidence in GAIL framework without auxiliary information. The confidence w(s, a) of each demonstration is calculated by [(1/D w(s, a) 1)πθ(a|s)] 1 β+1 . The confidence estimation and GAIL training interact during the training. Followed with the official implementation4, β is set to be 1. B.5. Additional Experimental Results B.5.1. IMPACT OF REVERSE POINT ir IN ROBOSUITE We also include Robosuite results to offer a more complete understanding of how i r impacts demonstrations of varying quality. In Robosuite, we utilized 10 trajectories as expert demonstrations, as shown in Table 7. Among these, two highreward trajectories (ID 6 and 10) were considered optimal demonstrations, while the rest were designated as sub-optimal demonstrations. To assess the impact of different demonstration qualities, we divided the sub-optimal demonstrations into two groups, Robo-L1 and Robo-L2. Table 7. The length and quality of 10 trajectories in Robo Suite. Traj ID 1 2 3 4 5 6 7 8 9 10 Length 500 502 502 505 506 507 507 509 511 511 Reward 276 120 203 102 253 312 103 81 180 318 For Robo L1, we selected four trajectories with rewards below 150 (ID 2, 4, 7, and 8) as sub-optimal demonstrations. For Robo-L2, we chose four trajectories with rewards ranging from 150 to 300 (ID 1, 3, 5, and 9) as sub-optimal demonstrations. Clearly, Robo-L2 demonstrates higher quality compared to Robo-D1. We assessed the influence of timestep i r on both Robo-L1 and Robo-L2. The results, as shown in Table 8, exhibit similar trends to those observed in the Mu Jo Co tasks. Table 8. Impact of ir in Robo Suite platform with human demonstrations. ir 1 3 5 10 30 DP-BC (Robo-L1) 0.60 0.74 0.80 0.64 0.48 DP-BC (Robo-L2) 0.82 0.78 0.76 0.76 0.52 1https://github.com/ryanxhr/DWBC 2https://github.com/Jason Ma2016/SMODICE/tree/main 3https://github.com/kristery/Imitation-Learning-from-Imperfect-Demonstration 4https://github.com/yunke-wang/WGAIL Imitation Learning from Purified Demonstrations B.6. Impact of reverse point ir in Mu Jo Co We have provided the ablation study on ir with D1 demonstrations in Figure 2, and we provide full results of ir with both D1 and D2 demonstrations in Table 9. From the table, we can observe that in most cases (19 out of 24 cases), setting ir within the 5-10 range consistently leads to optimal or near-optimal performance. These cases are across various tasks and demonstration qualities, which indicates that an ir value in this range can serve as a robust default choice that yields satisfactory results in most situations. Table 9. Performance of DP-BC when setting different ir, which is measured by the average cumulative reward of 10 trajectories. The results are related to Figure 2. Task ir D1-L1 D1-L2 D1-L3 D2-L1 D2-L2 D2-L3 100 261 54 380 62 486 76 199 66 342 14 328 88 50 202 51 803 114 1073 172 339 27 720 139 1776 123 30 47 16 719 51 1890 156 1572 160 1833 30 3064 34 10 -100 51 499 77 2547 118 1402 151 1982 111 3260 34 5 -133 55 490 125 2331 248 1105 263 1984 103 3414 40 Half Cheetah-v2 100 556 102 640 108 1115 181 1123 41 1791 13 1393 175 50 797 185 1896 222 2733 113 1468 82 2421 5 3126 306 30 895 73 1880 326 2899 257 1530 39 2714 15 4434 240 10 1365 147 2440 274 4042 80 1394 103 2749 15 4748 40 5 708 255 2061 373 3758 351 1469 253 2469 253 4660 14 Walker2d-v2 100 260 2 241 12 311 9 259 1 258 1 273 19 50 463 40 259 11 553 89 549 196 360 14 1085 251 30 415 56 978 96 1398 333 2208 160 1742 253 2512 212 10 1697 219 1267 145 1474 265 1764 143 1915 384 1984 365 5 1467 253 1722 297 3020 466 1917 151 3162 20 3076 205 100 98 2 461 11 420 1 215 1 151 3 355 25 50 516 0 815 16 379 0 1283 109 631 4 342 29 30 688 33 649 43 517 1 2075 159 904 6 1353 128 10 1000 42 2752 38 678 56 2308 113 2323 2 1740 175 5 580 60 3145 11 1508 272 1835 216 2297 34 2265 171 While in most cases, an ir setting within the range of 5-10 yields optimal performance, we can also observe the selection of ir has a connection to the task. In the 4 Mu Jo Co tasks used in the experiment, Ant-v2 is the most challenging task, followed by Half Cheetah-v2 and Walker2d-v2, with Hopper-v2 being the easiest. We can observe that the average optimal ir is relatively higher (for instance, around 100 for D1-L1) for Ant-v2 and smaller for Hopper-v2, where an optimal ir is as low as 5 for all datasets. This suggests that setting a relatively higher ir for more difficult tasks is reasonable. Following the results presented in Table 9, an empirical way is to set ir from 5 to 10 for simple tasks like Hopper-v2, and set ir from 10 to 50 for challenging tasks like Ant-v2.