# robust_policy_learning_via_offline_skill_diffusion__841640f3.pdf

Robust Policy Learning via Offline Skill Diffusion

Woo Kyung Kim, Minjong Yoo, Honguk Woo*

Department of Computer Science and Engineering, Sungkyunkwan University {kwk2696, mjyoo2, hwoo}@skku.edu

Skill-based reinforcement learning (RL) approaches have shown considerable promise, especially in solving longhorizon tasks via hierarchical structures. These skills, learned task-agnostically from offline datasets, can accelerate the policy learning process for new tasks. Yet, the application of these skills in different domains remains restricted due to their inherent dependency on the datasets, which poses a challenge when attempting to learn a skill-based policy via RL for a target domain different from the datasets domains. In this paper, we present a novel offline skill learning framework Du Skill which employs a guided Diffusion model to generate versatile skills extended from the limited skills in datasets, thereby enhancing the robustness of policy learning for tasks in different domains. Specifically, we devise a guided diffusion-based skill decoder in conjunction with the hierarchical encoding to disentangle the skill embedding space into two distinct representations, one for encapsulating domain-invariant behaviors and the other for delineating the factors that induce domain variations in the behaviors. Our Du Skill framework enhances the diversity of skills learned offline, thus enabling to accelerate the learning procedure of high-level policies for different domains. Through experiments, we show that Du Skill outperforms other skillbased imitation learning and RL algorithms for several longhorizon tasks, demonstrating its benefits in few-shot imitation and online RL.

1 Introduction

Skill-based learning demonstrates the potentials in accelerating the adaptation to complex long-horizon tasks by leveraging pretrained skill representations on behavior patterns from the offline datasets. However, existing approaches in skill-based reinforcement learning (RL) (e.g., Pertsch, Lee, and Lim 2020; Pertsch et al. 2021) and skill-based few-shot imitation learning (e.g., Hakhamaneshi et al. 2022; Du et al. 2023) often operate under the premise that the target domain for a downstream task was present during skill pretraining. Thus, policies learned with the pretrained skills might lead to sub-optimal performance, particularly when the target domain diverges from the domains of the given datasets.

*Honguk Woo is the corresponding author. Copyright 2024, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.

Figure 1: Concept of Offline Skill Diffusion: When a downstream task belongs to the domain different from those of the training datasets, conventional skill-based learning approaches struggle in learning and choosing suitable skills. In contrast, our offline skill diffusion expands the skill diversity that goes beyond the training datasets, enabling the execution of compatible skills for the downstream task. The skills are discretely represented for visual illustration.

As shown in the left side of Figure 1, where each small circle denotes a specific skill, conventional skill learning approaches might experience low performance in the highlevel policy for a downstream task, if the task calls for skills that differ from the pretrained ones. For instance, suppose that robotic manipulation skills are learned from the datasets in the safety-first domain; then, they might heavily lean towards slow speed manipulation. In that case, a high-level policy learned with these skills might fail to adapt efficiently to the downstream task that involves stringent time constraints. These situations often arise in the environment encompassing diverse domains, as a single task can require different skills depending on the domain it is in. The dependency of conventional skill-based learning approaches on the specificity of the datasets exacerbates these challenges. Moreover, it is practically difficult to obtain comprehensive datasets that span all potential skills for diverse domains. To tackle these challenges in skill-based learning, we take a novel approach, offline skill diffusion, aiming to broaden skill diversity beyond mere imitation from the

The Thirty-Eighth AAAI Conference on Artiﬁcial Intelligence (AAAI-24)

datasets. Given that diffusion models have been recognized for their efficacy in generating human-like images with conditional values (Rombach et al. 2022; Ho and Salimans 2022; Preechakul et al. 2022), we leverage a guided diffusion model for the skill decoder to generate diverse skills. The right side of Figure 1 illustrates the benefits of our framework approach, where the red-colored dotted quadrangle represents an expanded skill set encompassing all the skills that are required to solve the downstream task in a different domain. In this paper, we present the Du Skill framework, designed to generate diverse skills for downstream tasks that can belong to the domains different from the source domains in the training datasets. Recognizing that certain aspects of a skill remain consistent despite domain variations, we view each skill as a composition of domain-invariant and domainvariant features. Then, we employ a guided diffusion-based decoder along with a hierarchical domain encoder so as to effectively disentangle each skill into two separate embedding spaces. The hierarchical domain encoder systematically segments skills into domain-invariant and domainvariant embeddings by conditioning only the lower encoder on domain variations. With two distinct embeddings, the conditional generation process of the guided diffusion-based decoder enables distinct modulation of each embedding, thereby facilitating the execution of a wide range of skills in different domains. For downstream tasks in different domains, we train a high-level policy which produces both domain-variant and domain-invariant embeddings. The high-level policy operates alongside the frozen guided diffusion-based decoder, which encompasses the necessary knowledge for generating a broad range of skills, adaptable to various domains. As such, our proposed framework stands apart from existing skill-based learning approaches (Pertsch, Lee, and Lim 2020; Pertsch et al. 2021), as it enables the generation of diverse skills that extend beyond the training datasets. The contributions of our work are summarized as follows.

We present the Du Skill framework, which facilitates robust skill-based policy learning for downstream tasks in different domains. We develop the offline skill diffusion model, incorporating the hierarchical domain encoder and guided diffusion-based decoder. The model enables the diverse skill generation that extends beyond the training datasets. We test the framework with long-horizon tasks in various domains, demonstrating its capability to adapt to different domains in both few-shot imitation and online RL settings.

2 Preliminaries and Problem Formulation Given the training datasets D = {τi}i n, where each trajectory τi in the datasets is represented as a sequence of state and action pairs {(st, at)}t H, the objective of skill representation learning is to accelerate the adaptation to long-horizon tasks by leveraging pretrained skill representations. Here, a skill is defined as a sequence of h consecutive actions a = {at, ..., at+h} (Pertsch, Lee, and Lim 2020;

Hakhamaneshi et al. 2022). Through the joint learning of both a skill encoder q(z|a) and a skill decoder ϵ(a|z) on the datasets, we are able to obtain a skill embedding z Z. The learning objective involves optimizing the evidence lower bound (ELBO), which consists of a reconstruction term and a regularization term,

LVAE = Ea D [ log ϵ(a|z) + βDKL(p(z)||q(z|a)] (1)

where DKL is the Kullback-Leibler (KL) divergence, p(z) is a prior following a unit Gaussian distribution N(0, I) and β is a hyperparameter for regularization (Higgins et al. 2016). Then, this skill representation is leveraged to accelerate the downstream task adaptation. Problem Formulation. In our problem formulation, we operate under the premise that training datasets D are available. As skills required for a task might vary depending on the domain it belongs to, we further assume that the datasets are collected from multiple source domains. The variations across the domains are parameterized by ω Ω, and this parameter information is incorporated in the datasets. Then, we aim to tackle downstream tasks in a range of domains distinct from the source domains. Successfully achieving this requires more than merely imitating the skills present in the given datasets, due to the varied nature of the tasks across different domains. We formulate a downstream task as a goal-conditioned Markov decision process (MDP) M combined with its domain Ω. The goal-conditioned MDP M is denoted as (S, A, P, r, G, γ, µ0) where s S is a state space, a A is an action space, G is a goal space, P : S A Ω [0, 1] is a transition probability, r : S A G Ω R is a reward function, γ [0, 1] is a discount factor, and µ0 : S [0, 1] is an initial state distribution. The domain variations may affect either the reward function or the transition probability in an MDP, while the goal space remains consistent. For instance, on the top of Figure 1, a task may have the same goal space as slide puck in the goalposts, but the domain related to the time constraint results in different reward functions. Then, our objective is to maximize the discounted cumulative sum of rewards for downstream tasks in different domains, through a high-level policy π(z|s),

π = argmax π Ez π( |s),a ϵ( |z)

t=0 γtr (st, at))

where ϵ(a|z) is a skill decoder and T is the maximum length for an episode.

3 Approach 3.1 Du Skill Framework To facilitate the diverse skill generation, we propose the Du Skill framework consisting of two main phases: (i) the offline skill diffusion phase, and (ii) the downstream policy learning phase, as illustrated in Figure 2. In the offline skill diffusion phase, we employ a guided diffusion-based decoder in conjunction with a hierarchical domain encoder to disentangle skills into domain-invariant and domain-variant embeddings. Specifically, the hierarchical domain encoder consists of a domain-invariant encoder

The Thirty-Eighth AAAI Conference on Artiﬁcial Intelligence (AAAI-24)

Figure 2: Offline Skill Diffusion and Downstream Policy Learning in Du Skill: (i) In the offline skill diffusion phase, a skill is decomposed into domain-invariant and domainvariant embeddings, and then they are combined through the guided diffusion based decoder to generate diverse skills. (ii) In the downstream policy learning phase, a high-level policy is learned for a task in different domains either through fewshot imitation or online RL.

and a domain-variant encoder. The domain-invariant encoder processes a sequence of state and action, resulting in domain-invariant embedding. Meanwhile, the domainvariant encoder takes the domain-invariant embedding and the domain parameterization as input to produce the domainvariant embedding. In this way, these encoders play distinct roles, in which domain-invariant encoder is responsible for encapsulating features necessary to reconstruct fundamental action sequences pertinent to achieving goals, while the domain-variant encoder is tailored to capture the features related to domain variations. To effectively disentangle skill features via the hierarchical domain encoder, we utilize a guided diffusion-based decoder with two components, where one is conditioned on the domain-invariant embedding, while the other focuses on the domain-variant embedding. This conditional generation mechanism facilitates a distinct influence of the domain-invariant and domainvariant embeddings on the separate segments of generated action sequences. By employing the hierarchical domain encoder and guided diffusion-based decoder, our framework is capable of generating diverse action sequences that encompass different combinations of the domain-invariant and domain-variant features. In the downstream policy learning phase, we exploit the disentangled embeddings via a high-level policy, which produces both domain-invariant and domain-variant embeddings. These embeddings are then fed into the frozen guided diffusion-based decoder, which generates executable skills. In this phase, we consider few-shot imitation and online RL

adaptation scenarios, where the high-level policy adapts to the tasks in different domains through either fine-tuning on a limited number of trajectories or online RL interactions.

3.2 Offline Skill Diffusion Hierarchical domain encoder. To disentangle domaininvariant and domain-variant features from skills, we introduce a hierarchical encoding approach. This enables learning in two distinct embedding spaces: the domain-invariant embedding space Zρ and the domain-variant embedding space Zσ. Specifically, we employ a domain-invariant encoder qρ which maps a sequence of states and actions to the domain-invariant embedding. We also use a domain-variant encoder qσ which maps the domain-invariant embedding and the domain parameterization ω to the domain-variant embedding, i.e.,

zρ qρ(s, a), zσ qσ(zρ, ω) (3)

where s = {st, ..., st+h} is a sequence of states, a = {at, ..., at+h} is a sequence of actions, zρ Zρ is the domain-invariant embedding, and zσ Zσ is the domainvariant embedding. To optimize the encoders and the skill decoder ϵ, we employ the evidence lower bound (ELBO) loss using (1), similar to (Pertsch et al. 2021; Hakhamaneshi et al. 2022), i.e.,

LHVAE =E(s,a) D

t=0 log ϵ(at|st, zρ, zσ)

+ βρDKL(p(zρ)||qρ(zρ|s, a))

+ βσDKL(p(zσ|zρ)||q(zσ|zρ, s, a)) (4)

where q(zσ|zρ, s, a) is a naive modeling for the domainvariant encoder, and βρ and βσ are regularization hyperparameters. The prior p(zσ) and p(zρ|zσ) are set to be unit Gaussian. To disentangle skill features, we modify the regularization term of the domain-variant encoder in (4) as

Laspect = E(s,a) D

log q(zσ|zρ, s, a)

qσ(zσ|zρ, ω)

This loss term enables the domain-variant encoder to construct the distinct embedding space Zσ, capturing domainvariant features conditioned on the domain-invariant embedding. By combining (4) and (5), we rewrite the loss as

LDHVAE =E(s,a) D

t=0 log ϵ(at|st, zρ, zσ)

+ βρDKL(p(zρ)||qρ(zρ|s, a))

+ βσDKL(p(zσ|zρ)||qσ(zσ|zρ, ω)) .

For downstream policy learning, we employ a domaininvariant prior pρ and a domain-variant prior pσ.

zρ pρ(st), zσ pσ(zρ) (7)

The domain-invariant prior pρ is conditioned on the current state st, facilitating the selection of suitable domaininvariant embedding, while the domain-variant prior pσ

The Thirty-Eighth AAAI Conference on Artiﬁcial Intelligence (AAAI-24)

Figure 3: Du Skill Framework: In (i-1), the hierarchical domain encoder disentangles the domain-invariant and domain-variant embeddings. At the same time, the domain-invariant and domain-variant priors are jointly learned with these encoders. For diverse skill generation, the encoders are trained in conjunction with the guided diffusion-based decoder in (i-2). Here, the domain-invariant decoder and domain-variant decoder are responsible for reconstructing actions based on the domain-invariant and domain-variant embeddings, respectively. In (ii), the high-level policy is learned to solve the task in different domain either through few-shot imitation or online RL.

provides a prior distribution over the domain-variant embedding. By designing a domain-variant prior conditioned solely on the domain-invariant embedding, our model achieves the flexibility to explore a wide range of parameterizations across different domains. These priors are jointly trained with the hierarchical domain encoder by minimizing the KL divergence between each respective encoder. Lprior = E(s,a) D[DKL (pρ(zρ|st)||qρ(zρ|s, a))

+ DKL (pσ(zσ|zρ)||qσ(zσ|zρ, ω))] (8)

Guided diffusion-based decoder. To generate diverse skills from domain-invariant embedding zρ and domain-variant embedding zσ, we employ the diffusion model (HO, Jain, and Abbeel 2020; Pearce et al. 2023; Wang, Hunt, and Zhou 2023; Ajay et al. 2023). In particular, we adopt the denoising diffusion probabilistic model (DDPM) (HO, Jain, and Abbeel 2020) to represent our skill decoder. The decoder reconstructs an action at from a noisy input x K N(0, I) by sequentially predicting x K 1, x K 2, ..., x0(= at), with each iteration being a marginally denoised version of its predecessor. During the training phase, the noisy input xk is generated by progressively adding the Gaussian noise to the original action at(= x0) over K steps as

1 αkη (9) where η N(0, I) and αk is a variance schedule. As our skill decoder is designed to predict the noise η, the reconstruction loss in (6) becomes an L2 distance between the decoder s output and the noise. Lrec = Ek [1,K],η N (0,I) ||ϵ(xk, k, st, zσ, zρ) η||2 2

(10) To further align with the objective in (6), we slightly modify the classifier-free guidance (Ho and Salimans 2022) to divide the decoder into two separate parts: domaininvariant decoder ϵρ(xk, k, st, zρ) and domain-variant decoder ϵσ(xk, k, st, zσ). Thus, our decoder is redefined as

Algorithm 1: Offline Skill Diffusion Input: Trainig Datasets D, total denoise step K, guidance weight δ, hyperparameters βρ, βσ 1: Initialize encoders qρ, qσ, priors pρ, pσ, decoders ϵρ, ϵσ 2: while not converge do 3: Sample a batch {(s, a, ω)}i D 4: Update qρ and qσ using LDHVAE in (6) 5: Update pρ and pσ using Lprior in (8) 6: Update ϵρ and ϵσ using Lrec in (10) 7: end while 8: return qρ, qσ, pρ, pσ, ϵρ, ϵσ

the combination of both a domain-invariant decoder and a domain-variant decoder, i.e.,

ϵ(xk, k,st, zρ, zσ) :=

(1 δ)ϵρ(xk, k, st, zσ) + δϵσ(xk, k, st, zρ) (11)

where δ > 0 serves as a guidance weight that determines the degree of adjustment towards the domain-variant decoder. This approach allows our domain-invariant decoder to reconstruct actions that consistently execute the designated task across domain features. Simultaneously, the domainvariant decoder is seamlessly incorporated by generating guidance that encapsulates domain-variant features. In our framework, we employ the hierarchical domain encoder to establish distinct embedding spaces for domaininvariant and domain-variant features, optimized by the loss defined in (6). Concurrently, the respective priors for the downstream task are jointly trained, optimized by the loss in (8), as depicted in Figure 3 (i-1), Furthermore, to effectively disentangle the domain-invariant and domain-variant embeddings, we jointly train the guided diffusion-based decoder in conjunction with the encoders, optimized by the loss (10), as illustrated in Figure 3 (i-2). Algorithm 1 lists

The Thirty-Eighth AAAI Conference on Artiﬁcial Intelligence (AAAI-24)

the learning procedures of Du Skill.

3.3 Downstream Policy Learning For efficient policy learning on downstream tasks, we adopt a hierarchical learning scheme akin to other skill-based approaches (Pertsch, Lee, and Lim 2020; Hakhamaneshi et al. 2022). In this scheme, the higher-level policy is employed to produce the skill embedding as output rather than directly generating executable actions. Specifically, we train a policy π(zρ, zσ|st), designed to align with the hierarchical domain encoder, which generates domain-invariant and domain-variant embeddings,

π(zρ, zσ|st) = πρ(zρ|st) πσ(zσ|zρ). (12)

Subsequently, these embeddings are fed into the learned guided diffusion-based decoder, which remains frozen to decode them into a sequence of executable actions. To decode an action from the guided diffusion-based decoder, the process starts with sampling a noisy input x K N(0, I). Then, the decoder iteratively denoises the input while conditioning the decoder on the disentangled embeddings to generate an action,

1 αk ϵ(xk, k, st, zρ, zσ) + ζkη

(13) where η N(0, I), ζk and αk are parameters for variance schedule. As it is empirically observed that the lowtemperature sampling leads to improved performance, similar to (Ajay et al. 2023), we set ζk = 0. For downstream task adaptation, we explore two different scenarios: few-shot imitation and online RL adaptation. For the few-shot imitation, we initialize the high-level policy with the learned domain-invariant and domain-variant priors, and then we fine-tune the high-level policy using (10) along with the learned guided diffusion-based decoder. Likewise, for online RL adaptation, we adopt the soft actor-critic (SAC) algorithm, where the learned priors guide the highlevel policy as in (Pertsch, Lee, and Lim 2020). In both scenarios, we freeze the decoder and solely fine-tune the highlevel policy. Even in the absence of decoder updates, our framework manages to attain robust performance on downstream tasks in different domains, as demonstrated in Section 4.2. Figure 3 (ii) illustrates the procedure of downstream policy learning.

4 Evaluations 4.1 Experiment Settings Environments. For evaluation, we use the multi-stage Meta World, which is implemented based on the Meta-World simulated benchmark (Yu et al. 2019). Each multi-stage task is composed of a sequence of existing Meta-World tasks (subtasks). In these multi-stage tasks, an agent is required to maneuver a robotic arm to complete a series of sub-tasks, such as slide puck, close drawer, and etc. To emulate different domains in the environment, we deliberately add variations to either reward functions or transition dynamics. Specifically, we modify the reward function by setting time constraints

(i.e., speed domains) and by incorporating the energy usage consideration (energy domains). Furthermore, to emulate varying conditions of transition dynamics in the environment, we manipulate kinematic parameters such as wind (wind domains). Datasets. For datasets, we implement several rule-based expert policies tailored to each domain-specific environment. For offline training datasets, we collect 20 trajectories for each source domain (of 6 16 domains). For few-shot imitation datasets, we collect another 3 trajectories for each target domain. Baselines. For baselines, we implement several imitation learning and skill-based RL algorithms.

BC (Esmaili, Sammut, and Shirazi 1998) is a widely used supervised behavior cloning method. A policy is learned on the training datasets and then fine-tuned for the downstream tasks. SPi RL (Pertsch, Lee, and Lim 2020) is a state-of-art skill-based RL algorithm that employs a hierarchical structure to embeds skills into a latent space, thereby accelerating the downstream adaptation. SPi RL-c is a variant of SPi RL that uses the closed-loop skill decoder, used in (Pertsch et al. 2021). FIST (Hakhamaneshi et al. 2022) is a state-of-art fewshot skill-based imitation algorithm that employs a semiparametric approach for skill determination.

To obtain few-shot imitation, the baselines such as BC, SPi RL and SPi RL-c are pretrained on the training datasets and then fine-tuned on the few-shot imitation datasets.

4.2 Few-shot Imitation Performance

Table 1 compares the few-shot imitation performance in rewards achieved by our framework (Du Skill) and other baselines (BC, SPi RL, SPi RL-c, FIST). Among the baselines, those denoted with an asterisk (*) signify that both the highlevel policy and the decoder are fine-tuned, while those without an asterisk indicate fine-tuning solely of the high-level policy. An important distinction is that Du Skill exclusively focuses on fine-tuning the high-level policy. Based on the domain disparities between the datasets and the downstream tasks, we categorize them into four different levels. At the source level, downstream tasks remain same with the source domains. As we move to higher levels, the domain disparities become more pronounced. As shown, the baselines exhibit a notable decline in performance as the domain dissimilarity increases from source to level 4, where even the most competitive baseline FIST achieves an average degradation of 25.3%. In contrast, Du Skill consistently maintains robust performance across all domains and levels, achieving a small degradation of 7.6% at average. In this experiment, SPi RL-c demonstrates relatively low performance, primarily because its decoder can only generate skills present in its training data. Consequently, this poses a challenge in attaining robust performance when solely relying on fine-tuning the high-level policy. Meanwhile, SPi RL-c* is expected to yield higher performance

The Thirty-Eighth AAAI Conference on Artiﬁcial Intelligence (AAAI-24)

Domain Level BC SPi RL* SPi RL-c SPi RL-c* FIST* Du Skill

Source 2.66 0.73 2.16 0.63 3.62 0.35 3.72 0.19 3.98 0.01 3.99 0.00

Level 1 2.71 0.28 1.83 0.50 3.41 0.50 3.51 0.25 3.86 0.05 3.92 0.07 Level 2 2.62 0.54 2.16 0.62 3.33 0.44 3.24 0.23 3.51 0.12 3.83 0.06 Level 3 2.22 0.23 2.22 0.49 2.87 0.47 3.12 0.22 3.28 0.19 3.81 0.06

Source 1.49 0.35 0.67 0.07 2.83 0.39 1.60 0.33 3.08 0.19 3.85 0.08

Level 1 1.26 0.32 0.56 0.07 1.90 0.45 1.19 0.13 2.53 0.25 3.87 0.09 Level 2 0.90 0.15 0.47 0.08 1.77 0.26 0.91 0.19 1.85 0.34 3.71 0.10 Level 3 0.53 0.09 1.69 0.25 0.89 0.18 2.02 0.24 1.04 0.20 3.67 0.14

Source 3.32 0.41 3.78 0.10 2.83 0.39 3.92 0.10 4.00 0.00 3.98 0.02

Level 1 2.99 0.52 3.14 0.54 1.90 0.45 3.71 0.29 3.89 0.15 3.78 0.13 Level 2 2.19 0.42 2.62 0.33 1.77 0.26 3.24 0.24 3.24 0.24 3.48 0.21 Level 3 2.41 0.45 2.63 0.46 0.89 0.18 3.04 0.29 3.04 0.29 3.44 0.18

Table 1: Few-shot Imitation Performance in Multi-stage Meta-World: The performance of the baselines and Du Skill is measured in achieved rewards. For each domain, we categorize domain disparity between the training datasets and downstream tasks into four different levels. The baselines marked with an asterisk (*) indicate that both the high-level policy and decoder are finetuned, while the baselines without an asterisk and Du Skill only fine-tune the high-level policy. Each is evaluated with 3 random seeds, and the highest performance is highlighted in bold.

than SPi RL-c as it fine-tunes both the high-level policy and the decoder; yet, in some cases, SPi RL-c surpasses SPi RL-c*. This is because fine-tuning the entire model with a few samples might cause a covariant shift, a phenomenon commonly observed in the few-shot imitation studies (Hakhamaneshi et al. 2022; Nasiriany et al. 2022). FIST* adopts a different strategy, involving the fine-tuning of the entire model along with the utilization of a semiparametric method to retrieve the future state st+H that it aims to reach from the training datasets. While the semiparametric method leads to improved performance for tasks in source domains, FIST* is prone to fail in producing skills for downstream tasks in different domains. This is because the training datasets do not cover the skills required for different domains. In contrast, Du Skill disentangles the domain-invariant and -variant features to effectively generate the skills through the guided diffusion-based decoder. This allows for robust performance in few-shot adaptation across different domains, through fine-tuning solely the high-level policy.

4.3 Analysis

Online RL. Table 2 compares online RL adaptation performance in reward achieved by our framework (Du Skill) and other baselines (BC+SAC, SPi RL, SPi RL-c). Both the baselines and Du Skill fine-tune the high-level policy via the SAC algorithm. Here, we categorize domain disparities into source and target domains, where target domains correspond to the level 3 in Table 1. For more stable learning in target domains, we warm-up the high-level policy with a single trajectory for Du Skill and other baselines. The performance gap between SPi RL-c and Du Skill is not remarkable in source domain tasks, as expected. In contrary, Du Skill exhibits superior performance compared to the baselines for tasks in different domains, outperforming SPi RL-c by 89.16%. This highlights the capability of our

guided diffusion-based decoder in generating diverse skills that extend beyond the limitations of the given datasets.

Level BC+SAC SPi RL-c Du Skill

Source 0.00 0.00 4.00 0.00 4.00 0.00 Target 0.00 0.00 0.36 0.02 3.32 0.20

Table 2: Online RL Performance in Speed Domain

Embedding Visualization. Here, we verify the efficacy of our Du Skill framework in disentangled embeddings. Figure 4 visualizes the domain-invariant and domain-variant embeddings generated by Du Skill separately in two distinct speed domains (fast and slow). In the figure, the labels T1 to T4 correspond to sub-tasks numbered from 1 to 4. Regarding the domain-invariant embeddings, we observe that the identical tasks are paired together, thereby establishing the domain-invariant knowledge. On the other hand, the domain-variant embeddings are grouped with respect to the domains, implying the encapsulation of domain-variant knowledge. This indicates the effectiveness of our offline skill diffusion process, which disentangles domain-invariant and variant features. Sample Efficiency. Figure 5 shows the performance with respect to samples (or timesteps) used by Du Skill and the baselines (SPi RL-c, SPi RL-c*, FIST) for downstream policy learning in few-shot imitation and online RL scenarios. For the few-shot imitation learning, we test with different numbers of few-shot trajectories (1 20). As shown in Figure 5(a), Du Skill exhibits robust performance with only a 4.89% drop from 1 to 20 trajectories, whereas the most competitive baseline FIST shows a notable performance drop of 13.96%. Furthermore, as shown in Figure 5(b), Du Skill efficiently adapts to the downstream task in online settings, enhancing performance with only 50k samples, while SPi RL-c

The Thirty-Eighth AAAI Conference on Artiﬁcial Intelligence (AAAI-24)

Figure 4: Visualization of Domain-invariant and Domainvariant embeddings

rarely achieves improvement with those samples.

(a) Few-shot imitation

(b) Online RL

Figure 5: Sample Efficiency of Downstream Policy Learning

Ablation on Skill Diffusion. Table 3 provides an ablation study of Du Skill, focusing on the impact of the hierarchical embedding structure and guided diffusion-based decoder. In this study, we conduct few-shot imitation scenarios with two ablated variants. DU is a variant of SPi RL-c, utilizing a naive diffusion model for the skill decoder, while HDU employs the hierarchical embedding structure like Du Skill and a naive diffusion model for the skill decoder. The results show that the combination of hierarchical domain encoder and guided diffusion based decoder in Du Skill yields improved performance compared to the other variants, showing 8.77 8.92% average performance gains. This is not surprising since the guided diffusion based decoder promotes the learning of disentangled representations, as the domain-invariant and domain-variant decoders are conditioned on each embedding to generate executable actions effectively. Therefore, it is crucial to employ both hierarchical embedding structure and guided diffusion based decoder for achieving better few-shot imitation learning performance.

Domain DU HDU Du Skill

Speed 3.59 0.19 3.49 0.68 3.81 0.06 Energy 3.16 0.35 3.39 0.36 3.67 0.14 Wind 3.19 0.17 3.08 0.36 3.44 0.18

Table 3: Performance by Encoder and Decoder Types

5 Related Work

Skill-based Learning. To leverage offline datasets for longhorizon tasks, hierarchical skill representation learning techniques have been investigated in the context of online RL (Pertsch, Lee, and Lim 2020; Pertsch et al. 2021) and imitation learning (Hakhamaneshi et al. 2022; Nasiriany et al. 2022; Du et al. 2023). Pertsch, Lee, and Lim (2020) proposed the hierarchical skill learning structure to accelerate the downstream task adaptation by guiding a high-level policy with learned skill priors. Meanwhile, Hakhamaneshi et al. (2022) exploited a semi-parametric approach within this hierarchical skill structure, focusing on the few-shot imitation. In our work, we also utilize skill embedding techniques, but we tackle the challenge of adapting to downstream tasks in different domains. Unlike the prior work, our Du Skill adapts a robust generative model such as diffusion with the technique of disentangled skill embeddings, enabling the effective generation of diverse skills in offline. Diffusion for RL. Given the remarkable success of diffusion models in the field of computer vision (HO, Jain, and Abbeel 2020; Rombach et al. 2022), their application has been extended to sequential decision-making problems in recent years. Pearce et al. (2023) utilized diffusion models to imitate human demonstrations, capitalizing on their capacity to represent highly multi-modal data. Wang, Hunt, and Zhou (2023) adopted diffusion models in the context of offline RL to implement policy regularization. Furthermore, Liang et al. (2023) leveraged diffusion models to generate diverse synthetic trajectories on limited training data, aiming to have self-evolving offline RL for goal-conditioned RL tasks. Recently, Ajay et al. (2023) proposed a general framework for sequential decision-making problems using diffusion models. It allows for dynamic recombination of behaviors at test time by conditioning the diffusion models on several factors such as returns, constraints, or skills. To the best of our knowledge, our Du Skill framework is the first to integrate a diffusion model with skill embedding techniques, providing a novel hierarchical RL method to generate diverse skills on limited datasets and achieving robust performance for downstream tasks in different domains.

6 Conclusion

In this work, we presented the Du Skill framework, a novel approach designed to bridge the gap between the given datasets and downstream tasks that exist within different domains. We devised the offline skill diffusion, which employs the guided diffusion-based decoder in conjunction with the hierarchical encoders to effectively disentangles domaininvariant and -variant features from skills. This enables the generation of diverse skills capable of addressing tasks in different domains. Our framework stands apart from existing skill-based learning approaches, which are typically limited to adapt within the domains encountered during skill pretraining. In our future work, we aim to extend our framework to address challenging cross-domain situations with significant domain shifts, such as entirely different tasks, robot embodiment variations, or different simulation environments.

The Thirty-Eighth AAAI Conference on Artiﬁcial Intelligence (AAAI-24)

Acknowledgments We would like to thank anonymous reviewers for theri valuable comments and suggestions. This work was supported by Institute of Information & communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) (No. 2022-0-01045, 2022-0-00043, 2019-0-00421, 2021-0-00875) and by the National Research Foundation of Korea (NRF) grant funded by the MSIT (No. RS-2023-00213118).

References Ajay, A.; Du, Y.; Gupta, A.; Tenenbaum, J. B.; Jaakkola, T. S.; and Agrawal, P. 2023. Is Conditional Generative Modeling All You Need for Decision-Making? In Proceedings of the 11th International Conference on Learning Representations. Du, M.; Nair, S.; Sadigh, D.; and Finn, C. 2023. Behavior Retrieval: Few-Shot Imitation Learning by Querying Unlabeled Datasets. ar Xiv:2304.08742. Esmaili, N.; Sammut, C.; and Shirazi, G. M. 1998. Behavioural cloning in control of a dynamic system. In Proceedings of the IEEE International Conference on Systems, Man and Cybernetics., 2904 2909. Hakhamaneshi, K.; Ruihan Zhao, A. Z.; Abbeel, P.; and Laskin, M. 2022. Hierarchical Few-Shot Imitation with Skill Transition Models. In Proceedings of the 10th International Conference on Learning Representations. Higgins, I.; Matthey, L.; Pal, A.; Burgess, C. P.; Glorot, X.; Botvinick, M. M.; Mohamed, S.; and Lerchner, A. 2016. beta-VAE: Learning Basic Visual Concepts with a Constrained Variational Framework. In Proceedings of the 4th International Conference on Learning Representations. HO, J.; Jain, A.; and Abbeel, P. 2020. Denoising Diffusion Probabilistic Models. In Proceedings of the 34th Conference on Neural Information Processing System. Ho, J.; and Salimans, T. 2022. Classifier-Free Diffusion Guidance. ar Xiv:2207.12598. Liang, Z.; Mu, Y.; Ding, M.; Ni, F.; Tomizuka, M.; and Luo, P. 2023. Adapt Diffuser: Diffusion Models as Adaptive Selfevolving Planners. In Proceedings of the 40th International Conference on Machine Learning. Nasiriany, S.; Gao, T.; Mandlekar, A.; and Zhu, Y. 2022. Learning and Retrieval from Prior Data for Skill-based Imitation Learning. In Proceedings of the 6th Conference on Robot Learning, 2181 2204. Pearce, T.; Rashid, T.; Kanervisto, A.; Bignell, D.; Sun, M.; Georgescu, R.; Macua, S. V.; Tan, S. Z.; Momennejad, I.; Hofmann, K.; and Devlin, S. 2023. Imitating Human Behaviour with Diffusion Models. In Proceedings of the 11th International Conference on Learning Representations. Pertsch, K.; Lee, Y.; and Lim, J. J. 2020. Accelerating Reinforcement Learning with Learned Skill Priors. In Proceedings of the 4th Conference on Robot Learning, 188 204. Pertsch, K.; Lee, Y.; Wu, Y.; and Lim, J. J. 2021. Demonstration-Guided Reinforcement Learning with Learned Skills. In Proceedings of the 5th Conference on Robot Learning, 729 739.

Preechakul, K.; Chatthee, N.; Wizadwongsa, S.; and Suwajanakorn:, S. 2022. High-Resolution Image Synthesis with Latent Diffusion Models. In Proceedings of the Conference on Computer Vision and Pattern Recorgnition, 10609 10619. Rombach, R.; Blattmann, A.; Lorenz, D.; Esser, P.; and Ommer, B. 2022. High-Resolution Image Synthesis with Latent Diffusion Models. In Proceedings of the Conference on Computer Vision and Pattern Recorgnition, 10674 10685. Wang, Z.; Hunt, J. J.; and Zhou, M. 2023. Diffusion Policies as an Expressive Policy Class for Offline Reinforcement Learning. In Proceedings of the 11th International Conference on Learning Representations. Yu, T.; Quillen, D.; He, Z.; Julian, R.; Hausman, K.; Finn, C.; and Levine, S. 2019. Meta-World: A Benchmark and Evaluation for Multi-Task and Meta Reinforcement Learning. In Proceedings of the 3rd Conference on Robot Learning, 1094 1100.

The Thirty-Eighth AAAI Conference on Artiﬁcial Intelligence (AAAI-24)