# efficient_planning_with_latent_diffusion__73feca2b.pdf

Published as a conference paper at ICLR 2024

EFFICIENT PLANNING WITH LATENT DIFFUSION

Wenhao Li School of Software Engineering, Tongji University Shanghai, 201804, China liwenhao@cuhk.edu.cn

Temporal abstraction and efficient planning pose significant challenges in offline reinforcement learning, mainly when dealing with domains that involve temporally extended tasks and delayed sparse rewards. Existing methods typically plan in the raw action space and can be inefficient and inflexible. Latent action spaces offer a more flexible paradigm, capturing only possible actions within the behavior policy support and decoupling the temporal structure between planning and modeling. However, current latent-action-based methods are limited to discrete spaces and require expensive planning steps. This paper presents a unified framework for continuous latent action space representation learning and planning by leveraging latent, score-based diffusion models. We establish the theoretical equivalence between planning in the latent action space and energy-guided sampling with a pretrained diffusion model and incorporate a novel sequence-level exact sampling method. Our proposed method, Latent Diffuser, demonstrates competitive performance on low-dimensional locomotion control tasks and surpasses existing methods in higher-dimensional tasks.

1 INTRODUCTION

A considerable volume of samples gathered by operational systems gives rise to the issue of offline reinforcement learning (RL), specifically, the recovery of high-performing policies without additional environmental exploration (Wu et al., 2019; Kumar et al., 2020; Kostrikov et al., 2021; 2022; Ghosh et al., 2022). However, domains that encompass temporally extended tasks and severely delayed sparse rewards can present a formidable challenge for standard offline approaches (Li et al., 2015; Ren et al., 2021; Li et al., 2023). Analogous to the online setting, an emergent objective in offline RL involves the development of efficacious hierarchy methodologies that can obtain temporally extended lower-level primitives, subsequently facilitating the construction of a higher-level policy operating at a more abstract temporal scale (Ajay et al., 2021; Pertsch et al., 2021; Villecroze et al., 2022; Rosete-Beas et al., 2022; Rao et al., 2022; Yang et al., 2023).

Within the hierarchical framework, current offline RL approaches can be broadly categorized into model-free and model-based. The former conceptualizes the higher-level policy optimization as a auxilary offline RL issue (Liu et al., 2020; Liu & Sun, 2022; Ma et al., 2022; Kipf et al., 2019; Ajay et al., 2021; Rosete-Beas et al., 2022). In contrast, the latter encompasses planning in the higherlevel policy space by generating future trajectories through a dynamics model of the environment, either predefined or learned (Li et al., 2022; Co-Reyes et al., 2018; Lynch et al., 2020; Lee et al., 2022; Venkatraman, 2023). Concerning lower-level primitive learning, these two methods exhibit similarities and are typically modeled as goal-conditioned or skill-based imitation learning or offline RL problems. Conversely, the instabilities arising from offline hierarchical RL methodologies due to the deadly triad (Sutton & Barto, 2018; Van Hasselt et al., 2018), restricted data access (Fujimoto et al., 2019; Kumar et al., 2020), and sparse rewards (Andrychowicz et al., 2017; Ma et al., 2022) remain unaddressed. This spawns another subset of model-based approaches along with more effective hierarchical variants that endeavor to resolve problems from a sequence modeling viewpoint Chen et al. (2021); Janner et al. (2021; 2022); Ajay et al. (2023).

Irrespective of whether a method is model-free or model-based, it adheres to the traditional settings, wherein planning occurs in the raw action space of the Markov Decision Process (MDP). Although

Published as a conference paper at ICLR 2024

seemingly intuitive, planning in raw action space can be inefficient and inflexible (Wang et al., 2020; Yang et al., 2021; Jiang et al., 2023). Challenges include ensuring model accuracy across the entire space and the constraint of being tied to the environment s temporal structure. Conversely, human planning offers enhanced flexibility through temporal abstractions, high-level actions, backward planning, and incremental refinement.

Drawing motivation from TAP (Jiang et al., 2023), we put forth the notion of the latent action. Planning within the domain of latent actions delivers a twofold advantage compared to planning with raw actions. Primarily, it encompasses only plausible actions under behavior policy support, yielding a reduced space despite the raw action space s dimensionality and preventing the exploitation of model frailties. Secondarily, it permits the separation of the temporal structure between planning and modeling, thus enabling a more adaptable and efficient planning process unconstrained by specific transitions. These dual benefits render latent-action-based approaches naturally superior to extant methodologies when handling temporally extended offline tasks.

Nevertheless, two shortcomings of TAP inhibit its ability to serve as a general and practical framework. Initially, TAP is confined to discrete latent action spaces. In real-world contexts, agents are likely to carry out a narrow, discrete assortment of tasks and a broader spectrum of behaviors (Co-Reyes et al., 2018). This introduces a predicament should a minor skill modification be necessary, such as opening a drawer by seizing the handle from top to bottom instead of bottom to top, a completely novel set of demonstrations or reward functions might be mandated for behavior acquisition. Subsequently, once the latent action space has been ascertained, TAP necessitates a distinct, resource-intensive planning phase for generating reward-maximizing policies. The price of planning consequently restricts latent actions to discrete domains.

To tackle these limitations, this paper proposes a novel framework, Latent Diffuser, by concurrently modeling continuous latent action space representation learning and latent action-based planning as a conditional generative problem within the latent domain. Specifically, Latent Diffuser employs unsupervised techniques to discern the latent action space by utilizing score-based diffusion models (SDMs) (Song et al., 2021; Nichol & Dhariwal, 2021; Ho & Salimans, 2022) within the latent sphere in conjunction with a variational autoencoder (VAE) framework (Kingma & Welling, 2014; Rezende et al., 2014; Vahdat et al., 2021). We first segment the input trajectories, map each slice to latent action space (which needs to be learned), and apply the SDM to the latent sequence. Subsequently, the SDM is entrusted with approximating the distribution over the offline trajectory embeddings, conditioned on the related return values. Planning or reward-maximizing trajectory synthesis is realized by initially producing latent actions through sampling from a simple base distribution, followed by iterative, conditional denoising, and eventually translating latent actions into the trajectory space using a decoder. In other words, Latent Diffuser can be regarded as a VAE equipped with an SDM prior (Vahdat et al., 2021).

Theoretically, we demonstrate that planning in the domain of latent actions is tantamount to energyguided sampling using a pre-trained diffusion behavior model. Exact energy-guided sampling is essential to carry out high-quality and efficient planning. To achieve this objective, we modify QGPO (Lu et al., 2023) to realize exact sampling at the sequence level. Comprehensive numerical results on low-dimensional locomotion control tasks reveal that Latent Diffuser exhibits competitive performance against robust baselines and outperforms them on tasks of greater dimensionality. Our main contributions encompass: 1) Developing a unified framework for continuous latent action space representation learning and planning that delivers flexibility and efficiency in temporally extended offline decision-making. 2) Our theoretical derivation confirms the equivalence between planning in the latent action space and energy-guided sampling with a pretrained diffusion model. It introduces an innovative sequence-level exact sampling technique. 3) Numerical experiments exhibit the competitive performance of Latent Diffuser and its applicability across a range of lowand high-dimensional continuous control tasks.

2 RELATED WORK

Owing to spatial constraints, this section will briefly present the most pertinent domain of Latent Diffuser: offline RL or imitation learning (IL) based on a hierarchical structure. In terms of algorithmic specificity, existing techniques can be broadly classified into goal-based and skill-based methods (Pateria et al., 2021). For further related literature, including but not limited to

Published as a conference paper at ICLR 2024

model-based RL, action representation learning, offline RL, and RL as sequence modeling, kindly refer to Appendix C and the appropriate citations within the papers.

Goal-based approaches primarily concentrate on attaining a designated state. The vital aspect of such techniques concerns the selection or creation of subgoals, which reside in the raw state space. Once the higher-level subgoal is ascertained, the lower-level policy is generally acquired through standard IL methods or offline RL based on subgoal-augmented/conditioned policy, a universal value function (Schaul et al., 2015), or their combination. In extant methods, the subgoal is either predefined (Zhou et al., 2019; Xie et al., 2021; Ma et al., 2021), chosen based on heuristics (Ding et al., 2014; Guo & Zhai, 2016; Pateria et al., 2020; Mandlekar et al., 2020), or generated via planning or an additional offline RL technique (Liu et al., 2020; Liu & Sun, 2022; Li et al., 2022; Ma et al., 2022). Moreover, some methods (Eysenbach et al., 2019; Paul et al., 2019; Lai et al., 2020; Kujanp a a et al., 2023) are solely offline during the subgoal selection or generation process. This paper also pertains to the options framework (Sutton et al., 1999; Stolle & Precup, 2002; Bacon et al., 2017; Wulfmeier et al., 2021; Salter et al., 2022; Villecroze et al., 2022)), as both the (continuous) latent actions of Latent Diffuser and (discrete) options introduce a mechanism for temporal abstraction.

Skill-based methods embody higher-level skills as low-dimensional latent codes. In this context, a skill signifies a subtask s policy, semantically representing the capability to perform something adeptly (Pateria et al., 2021). Analogous to goal-based approaches, once the higher-level skill is identified, the lower-level skill-conditioned policy is generally acquired through standard IL or offline RL methods. More precisely, few works utilize predefined skills (Nasiriany et al., 2022; Fatemi et al., 2022). The majority of studies employ a twoor multi-phase training framework: initially, state sequences are projected into continuous latent variables (i.e., skills) via unsupervised learning; next, optimal skills are generated based on offline RL (Kipf et al., 2019; Pertsch et al., 2021; Ajay et al., 2021; Rosete-Beas et al., 2022; Lee et al., 2022; Venkatraman, 2023) or planning1 (Co-Reyes et al., 2018; Lynch et al., 2020; Lee et al., 2022; Venkatraman, 2023) in the skill space.

Goal-conditioned

policy Skill

Latent action

Figure 1: The physical meaning of the goal-conditioned policy, skill and latent action (corresponding to 2 timesteps in the raw MDP). The red diamond represents a particular (goal) state, the gray, dotted diamond is a placeholder, and the red circle denotes any state.

In contrast with the aforementioned hierarchical methodologies, Latent Diffuser initially learns a more compact latent action space and subsequently employs the latent actions to make decisions. As demonstrated in Figure 1, latent action not only differs from the goal-conditioned policy, which pertains to the trajectory of reaching a particular state, but also from the skill, which relates to the trajectory of completing a specific (multi-step) state transition. The latent action also corresponds to the agent s received reward and the subsequent expected return. The unique physical implications of latent action and the methodology utilized by Latent Diffuser render the proposed method advantageous in several ways. 1) The future information in the latent action allows the algorithm to execute more efficient planning. 2) Unlike existing works wherein multiple optimization objectives and the fully coupling or seperating of representation learning and decision making (RL or planning) lead to intricate training processes and reduced training efficiency, Latent Diffuser exhibits end-to-end training and unifies representation learning, sampling, and planning.

3 PROBLEM FORMULATION

In this paper, we approach the offline RL problem as a sequence modeling task, in alignment with previous work (Janner et al., 2022; Ajay et al., 2023; Li et al., 2023). The following subsection delineates the specificities of sequence modeling, or more accurately, the conditional generative modeling paradigm. We examine a trajectory, τ, of length T, which is sampled from a MDP that features a

1It is important to note that planning is only feasible when the environment model is known or can be sampled from the environment model. Consequently, some of these works focus on online RL tasks, while others first learn an additional environment model from the offline dataset and then plan in the skill space.

Published as a conference paper at ICLR 2024

fixed stochastic behavior policy. This trajectory comprises (refer to Appendix for more modeling selections of τ) a series of states, actions, rewards, and reward-to-go values, Gt := P

i=t γi tri, as proxies for future cumulative rewards: τ := (s1, a1, r1, G1, s2, a2, r2, G2, . . . , s T , a T , r T , GT ). It is crucial to note that the definition of τ diverges from that in prior studies (Janner et al., 2022; Ajay et al., 2023; Li et al., 2023), as each timestep now contains both the reward and reward-to-go values. This modification has been specifically engineered to facilitate the subsequent learning of latent action spaces. Sequential decision-making is subsequently formulated as the standard problem of conditional generative modeling: max θ Eτ D [log pθ (τ0 | y(τ0))] , (1)

where τ0 := τ. The objective is to estimate the conditional trajectory distribution using pθ to enable planning or generating the desired trajectory τ0 based on the information y(τk). Existing instances of y may encompass the return (Janner et al., 2022; Li et al., 2023), the constraints met by the trajectory (Ajay et al., 2023; Li et al., 2023), or the skill demonstrated in the trajectory (Ajay et al., 2023). The generative model is constructed in accordance with the conditional diffusion process: q (τk+1 | τk) , pθ (τk 1 | τk, y(τ0)) . (2) As per standard convention, q signifies the forward noising process while pθ represents the reverse denoising process (Ajay et al., 2023).

Latent Actions We introduce the concept of latent action (Figure 1) proposed in TAP (Jiang et al., 2023). TAP specifically models the optimal conditional trajectory distribution p (τ | s1, z) using a series of latent variables, z := (z1, . . . , z M). Assuming that the state and latent variables (s1, z) can be deterministically mapped to trajectory τ, p (τ | s1, z) := p(s1)1(τ = h(s1, z))π (z | s1) is obtained. The terms z and π (z | s1) are subsequently referred to as the latent actions and the optimal latent policy, respectively. In a deterministic MDP, the trajectory corresponding to an arbitrary function h(s1, z) with π (z | s1) > 0 will constitute an optimal executable plan, implying that the optimal trajectory can be recovered by following the latent actions z, beginning from the initial state s1. Consequently, planning within the latent action space Z facilitates the discovery of an desired, optimal trajectory. TAP, however, remain restricted to discrete latent action spaces and necessitate indepentdent, resource-intensive planning. Motivated by these limitations, we present a unified framework that integrates representation learning and planning for continuous latent action via latent, score-based diffusion models.

4 ALGORITHM FRAMEWORK

This section provides a comprehensive elaboration of the model components and design choices, such as the network architecture, loss functions, as well as the details of training and planning. By unifying the representation learning and planning of latent action through the incorporation of a latent diffusion model and the exact energy-guided sampling technique, Latent Diffuser achieves effective decision-making capabilities for temporally-extended, sparse reward tasks. Specifically, we first explore the representation learning for latent action in Section 4, followed by a detailed discussion on planning using energy-guided sampling in Section 4.2, and provide a algorithm summary in Section 4.3 to close this section.

Encoder (Causal Transformer with Position Embedding)

Diffusion Model

State Decoder

Action Decoder

Reward Decoder

Return Decoder

with Position Embedding

Figure 2: Representation learning for latent action with the latent score-based diffusion model.

Published as a conference paper at ICLR 2024

4.1 REPRESENTATION LEARNING FOR LATENT ACTION

The latent action space allows for a more compact, efficient, and adaptable method by effectively capturing behavior policy support and detaching the temporal structure, thus providing innate benefits in handling temporally extended offline tasks. As indicated in Section 3, before proceeding to planning, we must first learn a continuous latent action space. For this purpose, we propose the Latent Diffuser based on a latent diffusion model (LDM) (Vahdat et al., 2021), as depicted in Figure 2. Latent Diffuser is constituted by an encoder qϕ (z0 | s1, τ), a score-based prior pθ (z0 | s1), and a decoder pψ (τ | s1, z0). In accordance with Vahdat et al. (2021), we train Latent Diffuser by minimizing the variational upper bound on the negative trajectory loglikelihood log p(τ | s1), meaning that the information y(τ) in Equation (1) is instantiated as the initial state s1:

L(s1, τ, ϕ, θ, ψ) = Eqϕ(z0|s1,τ) [ log pψ (τ | s1, z0)] + KL (qϕ (z0 | s1, τ) pθ (z0 | s1))

= Eqϕ(z0|s1,τ) [ log pψ (τ | s1, z0)] + Eqϕ(z0|s1,τ) [log qϕ (z0 | s1, τ)]

+ Eqϕ(z0|s1,τ) [ log pθ (z0 | s1)] (3)

utilizing a VAE approach (Kingma & Welling, 2014; Rezende et al., 2014), wherein the qϕ (z0 | s0, τ) approximates the true posterior p (z0 | s0, τ).

This paper employs Equation (3) with a decomposed KL divergence into entropy and cross entropy terms. The reconstruction and entropy terms are easily estimated for any explicit encoder as long as the reparameterization trick is applicable (Kingma & Welling, 2014). The challenging aspect of training Latent Diffuser pertains to training the cross entropy term, which involves the scorebased prior. Unlike Vahdat et al. (2021), which addresses this challenge by simultaneously learning an encoder/decoder architecture alongside a score-based prior, we adopt a simpler yet efficacious approach (Rombach et al., 2022) by training a VAE {qϕ, pψ} and a score-based diffusion model {qθ} consecutively based on the offline dataset Dτ. This does not necessitate a delicate balancing of reconstruction and generative capabilities.

Encoder qϕ and Decoder pψ We use the almost consistent encoder design with TAP (Jiang et al., 2023). Specifically, we handle xt := (st, at, rt, Gt) as a single token. The encoder ϕ processes token xt using a GPT-2 style Transformer2, yielding T feature vectors, where T is the episode horizon. Subsequently, we apply a 1-dimensional max pooling with a kernel size and stride of L, followed by a linear layer, and generate T/L latent actions. Moreover, different from the TAP Decoder architecture, we use a modular design idea. More concretely, each latent action is tiled L times to match the number of input/output tokens T. We then concatenate the initial state s1 and the latent action, and apply a linear projection to provide state information to the decoder. After adding positional embedding, the decoder reconstructs the trajectory ˆτ := (ˆx1, ˆx2, . . . , ˆx T ), with ˆxt := (ˆst, ˆat, ˆrt, ˆGt). To enhance the decoder s representation ability, we design the decoder modularly for different elements in xt, as shown in Figure 2. Noting that the action decoder is designed based on the inverse dynamics model (Agrawal et al., 2015; Pathak et al., 2017) in a manner similar to (Ajay et al., 2023; Li et al., 2023), with the aim of generating raw action sequences founded on the state sequences. The training of the encoder and decoders finally entails the use of a reconstruction loss computed as the mean squared error between input trajectories {τ} and reconstructed trajectories {ˆτ}, coupled with a low-weighted ( 10 6) Kullback-Leibler penalty towards a standard normal on the learned latent actions, akin to VAE approaches (Kingma & Welling, 2014; Rezende et al., 2014). This prevents the arbitrary scaling of latent action space.

Score-based Prior θ Having trained the VAE {qϕ, pψ}, we now have access to a compact latent action space. Distinct from VAE s adoption of a uniform prior or TAP s utilization of an autoregressive, parameterized prior over latent actions, Latent Diffuser employs a score-based one. Thus, by harnessing the diffusion-sampling-as-planning framework, we seamlessly transform planning into conditional diffusion sampling, ultimately circumventing the need for an independent, costly planning stage. Concretely, the score-based prior is modeled as a conditional, score-based diffusion probabilistic model, which is parameterized using a temporal U-Net architecture (Janner et al., 2022; Ajay et al., 2023). This architecture effectively treats a sequence of noised latent action xk(z) as an image, where the height represents a single latent action s dimension and the width signifies

2Different from the casual Transformer used in TAP, see Appendix for more discussion.

Published as a conference paper at ICLR 2024

the number of the latent actions. Conditioning information y(z) := s1 is then projected using a multi-layer perceptron (MLP). The training of the score-based prior is formulated as a standard score-matching problem detailed in Appendix B.2.

4.2 PLANNING WITH ENERGY-GUIDED SAMPLING

Upon acquiring the latent action space, we are able to effectively address temporally-extended offline tasks using planning. Intriguingly, when examined from a probabilistic standpoint, the optimal latent action sequence sampling coincides with a guided diffusion sampling problem (Lu et al., 2023), wherein the guidance is shaped by an (unnormalized) energy function. By adopting a diffusionsampling-as-planning framework (Janner et al., 2022), we can perform planning through conditional sampling using the pretrained Latent Diffuser, without necessitating further costly planning steps (Janner et al., 2021; Jiang et al., 2023). This renders Latent Diffuser a holistic framework that seamlessly consolidates representation learning and planning within the latent action space. In the subsequent sections, the equivalence between optimal latent actions sampling and energy-guided diffusion sampling is demonstrated, followed by the introduction of a practical sampling algorithm to facilitate efficient planning.

Planning is Energy-Guided Diffusion Sampling Considering a deterministic mapping from τ to z, achieved by the learned encoder qϕ, the following theorem (refer to Appendix I.1 for the proof) is derived for the optimal latent policy defined in Section 3:

Theorem 1 (Optimal latent policy). Given an initial state s1, the optimal latent policy satisfies: π (z | s1) µ(z | s1)eβ PT t=1 Qζ(st,at), wherein µ(z | s1) represents the behavior latent policy and Qζ( , ) refers to the estimated Q-value function. β 0 signifies the inverse temperature controlling the energy strength.

By rewriting p0 := π , q0 = µ and z0 = z, we can reformulate the optimal planning into the following diffusion sampling problem:

p0(z0 | s1) q0(z0 | s1) exp ( βE(h(z0, s1))) , (4)

where E(h(z0, s1)) := PT t=1 Qζ(st, at) and h(z0, s1) denotes the pretrained decoder pψ. The behavior latent policy q0(z0 | s1) is modeled by the pretrained Latent Diffuser. We then adopt the diffusion-sampling-as-planning to generate desired (e.g., reward-maximizing) latent actions z0. Concretely, we employ q0 := q, p0 = p at diffusion timestep k = 0. Then a forward diffusion process is constructed to simultaneously diffuse q0 and p0 into an identical noise distribution, where pk0(zk|z0, s1) := qk0(zk|z0, s1) = N(zk|αkz0, σ2 t I). Based on (Lu et al., 2023, Theorem 3.1), the marginal distribution qk and pk of the noised latent actions zk at the diffusion timestep k adhere to:

pk(zk | s1) qk(zk | s1) exp (Ek(h(zk, s1))) , (5)

where Ek(h(zk, s1)) is βE(h(z0, s1)) when k = 0 and log Eq0k(z0|zk)[exp( βE(h(z0, s1)))] when k > 0. We then need to estimate the score function of pk(zk | s1). Quoting the derivation of Lu et al. (2023), the score function satisfies: zk log pk (zk | s1) = zk log qk (zk | s1) + zk Ek (h(zk, s1)) . Consequently, the optimal planning has been formulated as energy-guided sampling within the latent action space, with zk E(h(zk, s1)) as the desired guidance.

Practical Sampling Method Estimating the target score function zk log pk(zk | s1) is non-trivial because of the intractable energy guidance zk E(h(zk, s1)). We borrow the energy-guided sampling method proposd in (Lu et al., 2023) and propose a sequence-level, exact sampling methods by training a total of three neural networks: (1) a diffusion model to model the behavior latent policy q0(z0 | s1); (2) a state-action value function Qζ(s, a) to define the intermediate energy function E(h(z0, s1)); and (3) an time-dependent energy model fη(zk, s1, k) to estimate Ek(h(zk, s1)) and guide the diffusion sampling process.

Recall that we already have (1) a diffusion model, i.e., the socre-based prior pθ(z0 | s1) and (2) a state-action value function Qζ(s, a), i.e., the return decoder. According to Lu et al. (2023, Theorem 3.2), the only remained time-dependent energy model, fη(zk, s1, k), can be trained by minizing the

Published as a conference paper at ICLR 2024

following contrastive loss:

min η Ep(k,s1)EQM i=1 q z(i) 0 |s1 p(ϵ(i))

e βE0(h(z(i) 0 ,s1)) PM j=1 e βE0(h(z(j) 0 ,s1)) log e fη z(i) k ,s1,k

PM j=1 e fη z(j) k ,s1,k

(6) where k U(0, K), zk = αkz0 + σkϵ, and ϵ N(0, I). To estimate true latent actions distribution q(z0 | s1) in Equation 6, we utilize the pretrained encoder qϕ and score-based prior pθ to generate M support latent actions { ˆz0 (i)}M for each initial state s1 by diffusion sampling. The contrastive loss in Equation (6) is then estimated by:

min η Ek,s1,ϵ

e βE0(h(ˆz(i) 0 ,s1)) PM j=1 e βE0(h(ˆz(j) 0 ,s1)) log e fη ˆz(i) k ,s1,k

PM j=1 e fη ˆz(j) k ,s1,k , (7)

where ˆz0 (i), ˆz0 (j) correspond to the support latent actions for each initial state s1.

4.3 ALGORITHM SUMMARY

In general, the training phase of Latent Diffuse is composed of three parts, corresponding to the training of encoder and decoders {qϕ, pψ}, score-based prior pθ, and intermediated energy model fη, as shown in Algorithm 1. Throughout the training process, it is imperative to employ two distinct datasets: the first being a standard offline RL dataset, D, which encompasses trajectories sampled from behavior policies, whereas the second dataset consists of support latent actions for each initial state s1 D, generated by the pre-trained VAE, i.e., the encoder, score-based prior and decoders.

Algorithm 1 Latent Diffuser: Efficient Planning with Latent Diffusion

Initialize the latent diffusion model, i.e., the encoder qϕ, the score-based prior pθ and the decoder pψ; the intermediate energy model fη for each gradient step do Training the encoder and decoders Sample B1 trajectories τ from offline dataset D Generate reconstructed trajectories ˆτ with the encoder qϕ and decoder pψ Update {ϕ, ψ} based on the standard VAE loss end for for each gradient step do Training the score-based prior Sample B2 trajectories τ from offline dataset D Sample B2 Gaussian noises ϵ from N(0, I) and B2 time k from U(0, K) Generate latent actions z0 with the pretrained encoder qϕ and decoder pψ Perturb z0 according to zk := αkz0 + σkϵ Update {θ} with the standard score-matching loss in Appendix B.2 end for for each initial state s1 in offline dataset D do Generating the support latent actions Sample M support latent actions { ˆz0 (i)}M from the pretrained score-based prior pθ end for for each gradient step do Training the intermediate energy model Sample B3 initial state s1 from offline dataset D Sample B3 Gaussian noises ϵ from N(0, I) and B3 time k from U(0, K) Retrieve support latent actions { ˆz0 (i)}M for each s1 Perturb ˆz0 (i) according to ˆzk (i) := αk ˆz0 (i) + σkϵ Update {η} based on the contrastive loss in Equation (7) end for

Moreover, the optimal planning is tantamount to conducting conditional diffusion sampling based on the score-based prior and the intermediate energy model. Formally, the generation employs reverse denoising process at each diffusion timestep k by utilizing the score function zk log pk (zk | s1) based on the score function of the score-based prior zk log qk (zk | s1) and intermediate energy model zk Ek (h(zk, s1)), along with the state and action decoder pψ (τ | s1, z0) to map the sampled latent actions z0 back to the original trajectory space. Explicitly, the generative process is

Published as a conference paper at ICLR 2024

p (s1, z0, τ) = p0 (z0 | s1) pψ (τ | s1, z0). To avoid the accumulation of errors during sampling, we adopt the receding horizon control used in the existing methods (Ajay et al., 2023; Li et al., 2023).

5 EXPERIMENTS

This section aims to assess the efficacy of the Latent Diffuser for extended temporal offline tasks in comparison to current SOTA offline RL methods, which integrate hierarchical structures, and conditional generation models. The empirical evaluation encompasses three task categories derived from D4RL (Fu et al., 2020): namely, Gym locomotion control, Adroit, and Ant Maze. Gym locomotion tasks function as a proof-of-concept in the lower-dimensional realm, in order to ascertain whether Latent Diffuser is capable of accurately reconstructing trajectories for decision-making and control purposes. Subsequently, Latent Diffuser is evaluated on Adroit a task with significant state and action dimensionality as well as on Latent Diffuser within the Ant Maze environment, which represents a sparse-reward continuous-control challenge in a series of extensive long-horizon maps (Li et al., 2023). The subsequent sections will describe and examine the performance of these tasks and their respective baselines individually. Scores within 5% of the maximum per task will be emphasized in bold (Kostrikov et al., 2022).

5.1 PROOF-OF-CONCEPT: GYM LOCOMOTION CONTROL

Baselines Initially, an outline of the baselines is provided: CQL (Kumar et al., 2020), IQL (Kostrikov et al., 2022), D-QL (Wang et al., 2023), and QGPO (Lu et al., 2023) are all model-free offline RL methods. Mo Re L (Kidambi et al., 2020) is a model-based offline RL method. DT (Chen et al., 2021), TT (Janner et al., 2021), Diffuser (Janner et al., 2022), and DD (Ajay et al., 2023) address offline RL tasks via conditional generative modeling. Finally, TAP (Jiang et al., 2023) and HDMI (Li et al., 2023) employ a hierarchical framework grounded in generative modeling. Due to spatial constraints, only algorithms with the highest performance rankings are displayed herein; for a comprehensive comparison, please refer to the appendix.

Table 1: The performance in Gym locomotion control in terms of normalized average returns. Results correspond to the mean and standard error over 5 planning seeds.

Dataset Environment CQL TT DD D-QL TAP QGPO HDMI LD Med-Expert Half Cheetah 91.6 95 90.6 1.3 96.8 0.3 91.8 0.8 93.5 0.3 92.1 1.4 95.2 0.2 Med-Expert Hopper 105.4 110.0 111.8 1.8 111.1 1.3 105.5 1.7 108.0 2.5 113.5 0.9 112.9 0.3 Med-Expert Walker2d 108.8 101.9 108.8 1.7 110.1 0.3 107.4 0.9 110.7 0.6 107.9 1.2 111.3 0.2 Medium Half Cheetah 44.0 46.9 49.1 1.0 51.1 0.5 45.0 0.1 54.1 0.4 48.0 0.9 53.6 0.4 Medium Hopper 58.5 61.1 79.3 3.6 90.5 4.6 63.4 1.4 98.0 2.6 76.4 2.6 98.5 0.7 Medium Walker2d 72.5 79 82.5 1.4 87.0 0.9 64.9 2.1 86.0 0.7 79.9 1.8 86.3 0.9 Med-Replay Half Cheetah 45.5 41.9 39.3 4.1 47.8 0.3 40.8 0.6 47.6 1.4 44.9 2.0 47.3 1.2 Med-Replay Hopper 95 91.5 100 0.7 101.3 0.6 87.3 2.3 96.9 2.6 99.6 1.5 100.4 0.5 Med-Replay Walker2d 77.2 82.6 75 4.3 95.5 1.5 66.8 3.1 84.4 4.1 80.7 2.1 82.6 2.1 Average 77.6 78.9 81.8 88.0 82.5 86.6 82.6 87.5

Table 1 shows that Latent Diffuser surpasses specifically designed offline RL methods in the majority of tasks. Furthermore, the performance discrepancy between Latent Diffuser and two-stage algorithms, such as TAP and HDMI, underscores the benefits provided by the the proposed framework, which unifies learning of latent action space representation and planning.

5.2 HIGH-DIMENSIONAL MDP: ADROIT

Baselines Taking into account the large dimensions characterizing the Adroit task actions, only baselines that perform well in the previous task are evaluated. Additionally, D-QL necessitates 50 repeated samplings by default for action generation (Wang et al., 2023). This requirement would result in a substantial training overhead for high-dimensional action tasks. Consequently, to ensure a fair comparison, D-QL is configured to allow only 1 sampling, akin to QGPO (Lu et al., 2023).

Table 2 demonstrates the advantages of Latent Diffuser become even more pronounced in highdimensional tasks. Furthermore, a marked decrease in sequence modeling method performance is

Published as a conference paper at ICLR 2024

Table 2: Adroit results. These tasks have high action dimensionality (24 degrees of freedom)

Dataset Environment CQL TT DD D-QL@1 TAP QGPO HDMI LD Human Pen 37.5 36.4 64.1 9.0 66.0 8.3 76.5 8.5 73.9 8.6 66.2 8.8 79.0 8.1 Human Hammer 4.4 0.8 1.0 0.1 1.3 0.1 1.4 0.1 1.4 0.1 1.2 0.1 4.6 0.1 Human Door 9.9 0.1 6.9 1.2 8.0 1.2 8.8 1.1 8.5 1.2 7.1 1.1 9.8 1.0 Human Relocate 0.2 0.0 0.2 0.1 0.2 0.1 0.2 0.1 0.2 0.1 0.1 0.1 0.2 0.1 Cloned Pen 39.2 11.4 47.7 9.2 49.3 8.0 57.4 8.7 54.2 9.0 48.3 8.9 60.7 9.1 Cloned Hammer 2.1 0.5 0.9 0.1 1.1 0.1 1.2 0.1 1.1 0.1 1.0 0.1 4.2 0.1 Cloned Door 0.4 -0.1 9.0 1.6 10.6 1.7 11.7 1.5 11.2 1.4 9.3 1.6 12.0 1.6 Cloned Relocate -0.1 -0.1 -0.2 0.0 -0.2 0.0 -0.2 0.0 -0.2 0.0 -0.1 0.0 -0.1 0.0 Expert Pen 107.0 72.0 107.6 7.6 112.6 8.1 127.4 7.7 119.1 8.1 109.5 8.0 131.2 7.3 Expert Hammer 86.7 15.5 106.7 1.8 114.8 1.7 127.6 1.7 123.2 1.8 111.8 1.7 132.5 1.8 Expert Door 101.5 94.1 87.0 0.8 93.7 0.8 104.8 0.8 98.8 0.8 85.9 0.9 111.9 0.8 Expert Relocate 95.0 10.3 87.5 2.8 95.2 2.8 105.8 2.7 102.5 2.8 91.3 2.6 109.5 2.8 Average (w/o expert) 11.7 6.1 16.2 17.1 19.6 18.79 16.6 21.3

Average (w/ expert) 40.3 20.1 43.2 46.1 51.9 49.5 44.3 54.6

observed. Two primary factors are identified: first, larger action dimensions necessitate tokenizationand autoregression-based techniques (such as TT) to process increasingly lengthy sequences; second, DD and HDMI employ an inverse dynamic model to generate actions independently, while the expansion in action dimension renders the model fitting process more challenging.

5.3 LONG-HORIZION CONTINUOUS CONTROL: ANTMAZE

Baselines To validate the benefits of latent actions in longer-horizon tasks, an additional comparison is made with hierarchical offline RL methods designed explicitly for long-horizon tasks: Comp ILE (Kipf et al., 2019), Go FAR (Ma et al., 2022), and Hi Go C (Li et al., 2022). Concurrently, CQL and TT are removed due to their inability to perform well in high-dimensional Adroit.

Table 3: Ant Maze performance correspond to the mean and standard error over 5 planning seeds.

Environment Comp ILE Go FAR Hi Go C DD D-QL@1 TAP QGPO HDMI LD Ant Maze-Play U-Maze-3 41.2 3.6 38.5 2.2 31.2 3.2 73.1 2.5 52.9 4.1 82.2 2.1 59.3 1.3 86.1 2.4 85.4 1.9 Ant Maze-Diverse U-Maze-3 23.5 1.8 25.1 3.1 25.5 1.6 49.2 3.1 32.5 5.9 69.8 0.5 38.5 2.6 73.7 1.1 75.6 2.1 Ant Maze-Diverse Large-2 - - - 46.8 4.4 - 69.2 3.2 - 71.5 3.5 75.8 2.0 Single-task Average 32.4 31.8 28.4 56.4 39.0 73.7 45.4 77.1 78.9

Multi Ant-Diverse Large-2 - - - 45.2 4.9 - 71.6 3.3 - 73.6 3.8 73.3 2.6

Multi-task Average - - - 45.2 - 71.6 - 73.6 73.3

Table 3 highlights that sequence modeling-based hierarchical methods significantly surpass RL-based approaches. Moreover, Latent Diffuser demonstrates performance comparable to two-stage techniques such as TAP and HDMI through end-to-end training.

6 CONCLUSIONS

In this work, we present a novel approach, Latent Diffuser, for tackling temporal-extended offline tasks, addressing the limitations of previous state-of-the-art offline reinforcement learning methods and conditional generation models in handling high-dimensional, long-horizon tasks. Latent Diffuser is capable of end-to-end learning for both representation of and planning with latent action, delivering a unified, comprehensive solution for offline decision-making and control. Numerical results on Gym locomotion control, Adroit, and Ant Maze, demonstrate the effectiveness of Latent Diffuser in comparison with existing hierarchicaland planning-based offline methods. The performance gains are particularly noticeable in high-dimensional and long-horizon tasks, illustrating the advantages of Latent Diffuser in addressing these challenging scenarios.

ACKNOWLEDGMENTS

We extend our heartfelt gratitude to Professor Hongyuan Zha for his enlightening discussions. This work was supported in part by Postdoctoral Science Foundation of China (2022M723039).

Published as a conference paper at ICLR 2024

Pulkit Agrawal, Joao Carreira, and Jitendra Malik. Learning to see by moving. In ICCV, 2015.

Anurag Ajay, Aviral Kumar, Pulkit Agrawal, Sergey Levine, and Ofir Nachum. Opal: Offline primitive discovery for accelerating offline reinforcement learning. In ICLR, 2021.

Anurag Ajay, Yilun Du, Abhi Gupta, Joshua B Tenenbaum, Tommi S Jaakkola, and Pulkit Agrawal. Is conditional generative modeling all you need for decision-making? In ICLR, 2023.

Marcin Andrychowicz, Filip Wolski, Alex Ray, Jonas Schneider, Rachel Fong, Peter Welinder, Bob Mc Grew, Josh Tobin, Open AI Pieter Abbeel, and Wojciech Zaremba. Hindsight experience replay. In Neur IPS, 2017.

Pierre-Luc Bacon, Jean Harb, and Doina Precup. The option-critic architecture. In AAAI, 2017.

Lili Chen, Kevin Lu, Aravind Rajeswaran, Kimin Lee, Aditya Grover, Misha Laskin, Pieter Abbeel, Aravind Srinivas, and Igor Mordatch. Decision transformer: Reinforcement learning via sequence modeling. In Neur IPS, 2021.

John Co-Reyes, Yu Xuan Liu, Abhishek Gupta, Benjamin Eysenbach, Pieter Abbeel, and Sergey Levine. Self-consistent trajectory autoencoder: Hierarchical reinforcement learning with trajectory embeddings. In ICML, 2018.

XIAO Ding, Yi-tong LI, and SHI Chuan. Autonomic discovery of subgoals in hierarchical reinforcement learning. The Journal of China Universities of Posts and Telecommunications, 21(5):94 104, 2014.

Ben Eysenbach, Russ R Salakhutdinov, and Sergey Levine. Search on the replay buffer: Bridging planning and reinforcement learning. In Neur IPS, 2019.

Mehdi Fatemi, Mary Wu, Jeremy Petch, Walter Nelson, Stuart J Connolly, Alexander Benz, Anthony Carnicelli, and Marzyeh Ghassemi. Semi-markov offline reinforcement learning for healthcare. In CHIL, 2022.

Justin Fu, Aviral Kumar, Ofir Nachum, George Tucker, and Sergey Levine. D4RL: Datasets for deep data-driven reinforcement learning. ar Xiv preprint ar Xiv:2004.07219, 2020.

Scott Fujimoto, David Meger, and Doina Precup. Off-policy deep reinforcement learning without exploration. In ICML, 2019.

Dibya Ghosh, Anurag Ajay, Pulkit Agrawal, and Sergey Levine. Offline rl policies should be trained to be adaptive. In ICML, 2022.

Xiaobo Guo and Yan Zhai. K-means clustering based reinforcement learning algorithm for automatic control in robots. International Journal of Simulation Systems, Science & Technology, 17(24): 6.1 6.6, 2016.

Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance. ar Xiv preprint ar Xiv:2207.12598, 2022.

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. In Neur IPS, 2020.

Michael Janner, Qiyang Li, and Sergey Levine. Offline reinforcement learning as one big sequence modeling problem. In Neur IPS, 2021.

Michael Janner, Yilun Du, Joshua B Tenenbaum, and Sergey Levine. Planning with diffusion for flexible behavior synthesis. In ICML, 2022.

Zhengyao Jiang, Tianjun Zhang, Michael Janner, Yueying Li, Tim Rockt aschel, Edward Grefenstette, and Yuandong Tian. Efficient planning in a compact latent action space. In ICLR, 2023.

Rahul Kidambi, Aravind Rajeswaran, Praneeth Netrapalli, and Thorsten Joachims. Morel: Modelbased offline reinforcement learning. In Neur IPS, 2020.

Published as a conference paper at ICLR 2024

Diederik P Kingma and Max Welling. Auto-encoding variational bayes. Stat, 1050:10, 2014.

Thomas Kipf, Yujia Li, Hanjun Dai, Vinicius Zambaldi, Alvaro Sanchez-Gonzalez, Edward Grefenstette, Pushmeet Kohli, and Peter Battaglia. Comp ILE: Compositional imitation learning and execution. In ICML, 2019.

Ilya Kostrikov, Rob Fergus, Jonathan Tompson, and Ofir Nachum. Offline reinforcement learning with fisher divergence critic regularization. In ICML, 2021.

Ilya Kostrikov, Ashvin Nair, and Sergey Levine. Offline reinforcement learning with implicit q-learning. In ICLR, 2022.

Kalle Kujanp a a, Joni Pajarinen, and Alexander Ilin. Hierarchical imitation learning with vector quantized models. In ICML, 2023.

Aviral Kumar, Aurick Zhou, George Tucker, and Sergey Levine. Conservative q-learning for offline reinforcement learning. In Neur IPS, 2020.

Yaqing Lai, Wufan Wang, Yunjie Yang, Jihong Zhu, and Minchi Kuang. Hindsight planner. In AAMAS, 2020.

Taeyoon Lee, Donghyun Sung, Kyoungyeon Choi, Choongin Lee, Changwoo Park, and Keunjun Choi. Learning dynamic manipulation skills from haptic-play. ar Xiv preprint ar Xiv:2207.14007, 2022.

Jinning Li, Chen Tang, Masayoshi Tomizuka, and Wei Zhan. Hierarchical planning through goalconditioned offline reinforcement learning. ar Xiv preprint ar Xiv:2205.11790, 2022.

Lihong Li, R emi Munos, and Csaba Szepesv ari. Toward minimax off-policy value estimation. In AISTATS, 2015.

Wenhao Li, Xiangfeng Wang, Bo Jin, and Hongyuan Zha. Hierarchical diffusion for offline decision making. In ICML, 2023.

Jianfeng Liu, Feiyang Pan, and Ling Luo. Gochat: Goal-oriented chatbots with hierarchical reinforcement learning. In SIGIR, 2020.

Qiang Liu, Jason Lee, and Michael Jordan. A kernelized stein discrepancy for goodness-of-fit tests. In ICML, 2016.

Shaofan Liu and Shiliang Sun. Safe offline reinforcement learning through hierarchical policies. In PAKDD, 2022.

Cheng Lu, Huayu Chen, Jianfei Chen, Hang Su, Chongxuan Li, and Jun Zhu. Contrastive energy prediction for exact energy-guided diffusion sampling in offline reinforcement learning. In ICML, 2023.

Corey Lynch, Mohi Khansari, Ted Xiao, Vikash Kumar, Jonathan Tompson, Sergey Levine, and Pierre Sermanet. Learning latent plans from play. In Co RL, 2020.

Yecheng Jason Ma, Jason Yan, Dinesh Jayaraman, and Osbert Bastani. Offline goal-conditioned reinforcement learning via f-advantage regression. In Neur IPS, 2022.

Yi Ma, Xiaotian Hao, Jianye Hao, Jiawen Lu, Xing Liu, Tong Xialiang, Mingxuan Yuan, Zhigang Li, Jie Tang, and Zhaopeng Meng. A hierarchical reinforcement learning based optimization framework for large-scale dynamic pickup and delivery problems. In Neur IPS, 2021.

Ajay Mandlekar, Fabio Ramos, Byron Boots, Silvio Savarese, Li Fei-Fei, Animesh Garg, and Dieter Fox. Iris: Implicit reinforcement without interaction at scale for learning control from offline robot manipulation data. In ICRA, 2020.

Soroush Nasiriany, Huihan Liu, and Yuke Zhu. Augmenting reinforcement learning with behavior primitives for diverse manipulation tasks. In ICRA, 2022.

Published as a conference paper at ICLR 2024

Alexander Quinn Nichol and Prafulla Dhariwal. Improved denoising diffusion probabilistic models. In ICML, 2021.

Shubham Pateria, Budhitama Subagdja, and Ah Hwee Tan. Hierarchical reinforcement learning with integrated discovery of salient subgoals. In AAMAS, 2020.

Shubham Pateria, Budhitama Subagdja, Ah-hwee Tan, and Chai Quek. Hierarchical reinforcement learning: A comprehensive survey. ACM Computing Surveys, 54(5):1 35, 2021.

Deepak Pathak, Pulkit Agrawal, Alexei A Efros, and Trevor Darrell. Curiosity-driven exploration by self-supervised prediction. In ICML, 2017.

Sujoy Paul, Jeroen Vanbaar, and Amit Roy-Chowdhury. Learning from trajectories via subgoal discovery. In Neur IPS, 2019.

Karl Pertsch, Youngwoon Lee, Yue Wu, and Joseph J. Lim. Demonstration-guided reinforcement learning with learned skills. In Co RL, 2021.

Dushyant Rao, Fereshteh Sadeghi, Leonard Hasenclever, Markus Wulfmeier, Martina Zambelli, Giulia Vezzani, Dhruva Tirumala, Yusuf Aytar, Josh Merel, Nicolas Heess, et al. Learning transferable motor skills with hierarchical latent mixture policies. In ICLR, 2022.

Tongzheng Ren, Jialian Li, Bo Dai, Simon S Du, and Sujay Sanghavi. Nearly horizon-free offline reinforcement learning. In Neur IPS, 2021.

Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation and approximate inference in deep generative models. In ICML, 2014.

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj orn Ommer. Highresolution image synthesis with latent diffusion models. In CVPR, 2022.

Erick Rosete-Beas, Oier Mees, Gabriel Kalweit, Joschka Boedecker, and Wolfram Burgard. Latent plans for task-agnostic offline reinforcement learning. In Co RL, 2022.

Sasha Salter, Markus Wulfmeier, Dhruva Tirumala, Nicolas Heess, Martin Riedmiller, Raia Hadsell, and Dushyant Rao. Mo2: Model-based offline options. In Co LLAs, 2022.

Tom Schaul, Daniel Horgan, Karol Gregor, and David Silver. Universal value function approximators. In ICML, 2015.

Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. In ICML, 2015.

Yang Song. Generative modeling by estimating gradients of the data distribution. yang-song.net, May 2021. URL https://yang-song.net/blog/2021/score/.

Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. In ICLR, 2021.

Martin Stolle and Doina Precup. Learning options in reinforcement learning. In Abstraction, Reformulation, and Approximation: 5th International Symposium, SARA 2002 Kananaskis, Alberta, Canada August 2 4, 2002 Proceedings 5, pp. 212 223. Springer, 2002.

Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT press, 2018.

Richard S Sutton, Doina Precup, and Satinder Singh. Between mdps and semi-mdps: A framework for temporal abstraction in reinforcement learning. Artificial intelligence, 112(1-2):181 211, 1999.

Arash Vahdat, Karsten Kreis, and Jan Kautz. Score-based generative modeling in latent space. In Neur IPS, 2021.

Hado Van Hasselt, Yotam Doron, Florian Strub, Matteo Hessel, Nicolas Sonnerat, and Joseph Modayil. Deep reinforcement learning and the deadly triad. ar Xiv preprint ar Xiv:1812.02648, 2018.

Published as a conference paper at ICLR 2024

Siddarth Venkatraman. Latent skill models for offline reinforcement learning. Master s thesis, Carnegie Mellon University Pittsburgh, PA, 2023.

Valentin Villecroze, Harry Braviner, Panteha Naderian, Chris Maddison, and Gabriel Loaiza-Ganem. Bayesian nonparametrics for offline skill discovery. In ICML, 2022.

Linnan Wang, Rodrigo Fonseca, and Yuandong Tian. Learning search space partition for black-box optimization using monte carlo tree search. In Neur IPS, 2020.

Zhendong Wang, Jonathan J Hunt, and Mingyuan Zhou. Diffusion policies as an expressive policy class for offline reinforcement learning. In ICLR, 2023.

Yifan Wu, George Tucker, and Ofir Nachum. Behavior regularized offline reinforcement learning. ar Xiv preprint ar Xiv:1911.11361, 2019.

Markus Wulfmeier, Dushyant Rao, Roland Hafner, Thomas Lampe, Abbas Abdolmaleki, Tim Hertweck, Michael Neunert, Dhruva Tirumala, Noah Siegel, Nicolas Heess, et al. Data-efficient hindsight off-policy option learning. In ICML, 2021.

Ruobing Xie, Shaoliang Zhang, Rui Wang, Feng Xia, and Leyu Lin. Hierarchical reinforcement learning for integrated recommendation. In AAAI, 2021.

Kevin Yang, Tianjun Zhang, Chris Cummins, Brandon Cui, Benoit Steiner, Linnan Wang, Joseph E Gonzalez, Dan Klein, and Yuandong Tian. Learning space partitions for path planning. In Neur IPS, 2021.

Yiqin Yang, Hao Hu, Wenzhe Li, Siyuan Li, Jun Yang, Qianchuan Zhao, and Chongjie Zhang. Flow to control: Offline reinforcement learning with lossless primitive discovery. In AAAI, 2023.

Guojing Zhou, Hamoon Azizsoltani, Markel Sanz Ausin, Tiffany Barnes, and Min Chi. Hierarchical reinforcement learning for pedagogical policy induction. In IJCAI, 2019.

Published as a conference paper at ICLR 2024

Supplementary Material

Table of Contents

A Limitations and Societal Impact 14

B Preliminaries 15 B.1 Offline Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . . 15 B.2 Diffusion Probabilistic Models . . . . . . . . . . . . . . . . . . . . . . . . . . 15

C Missing Related Work 16 C.1 Model-based Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . 16 C.2 Action Representation Learning. . . . . . . . . . . . . . . . . . . . . . . . . . 16 C.3 Offline Reinforcement Learning. . . . . . . . . . . . . . . . . . . . . . . . . . 16 C.4 Reinforcement Learning as Sequence Modeling . . . . . . . . . . . . . . . . . . 17 C.5 Controllable Sampling with Generative Models . . . . . . . . . . . . . . . . . . 17

D Missing Results and Analyses 18 D.1 Proof-of-Concept Example: maze-2d-open . . . . . . . . . . . . . . . . . . 18 D.2 Main Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

E Implementation and Training Details 20 E.1 Baseline Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 E.2 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

F Modeling Selection 23 F.1 Planning in the Raw Action Space . . . . . . . . . . . . . . . . . . . . . . . . . 23 F.2 Planning in the Skill Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

G Ablation Studies 26

H Latent Action Visualization 27

I Missing Derivations 28 I.1 Proof of Theorem 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 I.2 Proof of Theorem 2 and Theorem 3 . . . . . . . . . . . . . . . . . . . . . . . . 29

A LIMITATIONS AND SOCIETAL IMPACT

Limitations Latent Diffuser, analogous to other diffusion-based methods for offline decisionmaking, exhibits a protracted inference time owing to the iterative nature of the sampling process. This challenge could be alleviated through the adoption of approaches that enable accelerated sampling (Lu et al., 2022a;b) or by distilling these diffusion models into alternative methods necessitating fewer sampling iterations (Song et al., 2023). Additionally, similar with TAP (Jiang et al., 2023), empirical findings from continuous control featuring deterministic dynamics indicate that Latent Diffuser can manage epistemic uncertainty. However, the efficacy of Latent Diffuser in addressing tasks characterized by stochastic dynamics without modifications remains unascertained. Furthermore, a deficiency in our methodology is the requirement for both the latent steps L and planning horizon H for the latent action to remain constant. We hope that facilitating adaptive variation of these hyperparameters may enhance performance.

Published as a conference paper at ICLR 2024

Societal Impact Similar with other deep generative modeling techniques, the energy-guided diffusion sampling employed in this paper possesses the potential to generate harmful content and may perpetuate and exacerbate pre-existing undesirable biases present in the offline dataset.

B PRELIMINARIES

B.1 OFFLINE REINFORCEMENT LEARNING

In general, reinforcement learning (RL) represents the problem of sequential decision-making through a Markov Decision Process M = (S, A, P, r, γ), encompassing a state space S and an action space A. Given states s, s S and an action a A, the transition probability function is expressed as P (s | s, a) : S A S [0, 1] and the reward function is defined by r(s, a, s ) : S A S R. The discount factor is denoted as γ (0, 1].

The policy is represented as π : S A [0, 1], indicating the probability of taking action a in state s as π(a | s). For the timestep t [1, T], the cumulative discounted reward, also referred to as reward-to-go, is identified by Rt = PT t =t γt trt . The principal objective of online RL is to determine a policy π that maximizes J = Eat π( |st),st+1 P( |st,at)[PT t=1 γt 1rt(st, at, st+1)] via learning from transitions (s, a, r, s ) during environment interaction (Sutton & Barto, 2018).

Conversely, in offline RL, a static dataset D is employed, which has been collected through a behavior policy πµ, for acquiring a policy π that optimizes J for subsequent application in the interactive environment. The behavior policy πµ can either constitute a single policy or an amalgamation of various policies; however, it remains inaccessible. The acquisition of data is presumed to be trajectory-wise, as represented by D = {τi}D i=1, where τ = {(si, ai, ri, s i)}T i=1.

B.2 DIFFUSION PROBABILISTIC MODELS

This section will provide an introduction to the diffusion probabilistic model within the context of the Latent Diffuser. Diffusion probabilistic models (Sohl-Dickstein et al., 2015; Ho et al., 2020), constitute a likelihood-based generative framework that facilitates learning data distributions q(z) from the offline datasets expressed as D := {zi}, wherein the index i denotes a specific sample within the dataset (Song, 2021), and zi is the latent actions encoded by the pre-trained encoder qϕ. A core concept within diffusion probabilistic models lies in the representation of the (Stein) score function (Liu et al., 2016), which does not necessitate a tractable normalizing constant (also referred to as the partition function).

The discrete-time generation procedure encompasses a designed forward noising (or diffusion) process q(zk+1|zk) := N(zk+1; αkzk, (1 αk)I) at (forward) diffusion timestep k. The forward process coupled with a learnable, reverse denoising (or diffusion) process pθ(zk 1|zk) := N(zk 1|µθ(zk, k), Σk) at (backward) diffusion timestep k. N(µ, Σ) signifies a Gaussian distribution characterized by mean µ and variance Σ, αk R establishes the variance schedule. In order to ensure consistency with the main text notation, we denote αk := αk and σk := 1 αk. z0 := z corresponds to a sample in D, z1, z2, . . . , z K 1 signifies the latent variables or the noised latent actions, and z K N(0, I) for judiciously selected αk values and a sufficiently extensive K.

Commencing with Gaussian noise, samples undergo iterative generation via a sequence of denoising steps. An optimizable and tractable variational lower-bound on log pθ serves to train the denoising operator, with a simplified surrogate loss proposed in (Ho et al., 2020):

Ldenoise(θ) := Ek [1,K],z0 q,ϵ N(0,I) ϵ ϵθ(zk, k) 2 . (8)

The predicted noise ϵθ(zk, k), parameterized through a deep neural network, emulates the noise ϵ N(0, I) integrated with the dataset sample z0 yielding noisy zk in the noising process.

Conditional Diffusion Probabilistic Models Intriguingly, the conditional distribution q(z|y(z)) facilitates sample generation under the condition y(z). Within the context of this paper, y(z) is instantiated as the initial state s1. The equivalence between diffusion probabilistic models and scorematching (Song et al., 2021) reveals that ϵθ(zk, k) zk log p(zk), giving rise to two categorically

Published as a conference paper at ICLR 2024

equivalent methodologies for conditional sampling with diffusion probabilistic models: classifierguided (Nichol & Dhariwal, 2021), and classifier-free (Ho & Salimans, 2022) techniques employed in our work. The latter method modifies the preliminary training configuration, learning both a conditional ϵθ(zk, s1, k) and an unconditional ϵθ(zk, k) model for noise. Unconditional noise manifests as conditional noise ϵθ(zk, , k), with a placeholder replacing s1. When y(z) = , the entries of e are zeroed out. The perturbed noise ϵk := ϵθ(zk, k) + ω(ϵθ(zk, s1, k) ϵθ(zk, k)) is subsequently employed to generate samples. Additionally, we adopt low-temperture sampling in the denoising process to ensure higher quality latent actions (Ajay et al., 2023). Concretely, we compute µk 1 and Σk 1 from the previous noised latent actions zk 1 and perturbed noise ϵk 1, and subsequently sample zk 1 N (µk 1, αΣk 1) with the variance scaled by α [0, 1).

C MISSING RELATED WORK

C.1 MODEL-BASED REINFORCEMENT LEARNING

Latent Diffuser is incorporated into a research trajectory focused on model-based reinforcement learning (RL) (Sutton, 1990; Janner et al., 2019; Schrittwieser et al., 2020; Lu et al., 2021; Eysenbach et al., 2022; Suh et al., 2023), as it makes decisions by forecasting future outcomes. These approaches frequently employ predictions in the raw Markov Decision Process (MDP), which entails that models accept the current raw state and action as input, outputting probability distributions encompassing subsequent states and rewards. Hafner et al. (2019), Ozair et al. (2021), Hafner et al. (2021), Hafner et al. (2023), and Chitnis et al. (2023) proposed to acquiring a latent state space in conjunction with a dynamics function. Contrarily, in their cases, the action space accessible to the planner remains identical to that of the raw MDP, and the execution of the plan maintains its connection to the original temporal structure of the environment.

C.2 ACTION REPRESENTATION LEARNING.

The concept of learning a representation for actions and conducting RL within a latent action space has been investigated in the context of model-free RL (Merel et al., 2019; Allshire et al., 2021; Zhou et al., 2021; Chen et al., 2022; Peng et al., 2022; Dadashi et al., 2022). In contrast to Latent Diffuser, where the latent action space is utilized to promote efficacy and robustness in planning, the motivations for obtaining a latent action space in model-free approaches vary, yet the underlying objective centers on providing policy constraints. For instance, Merel et al. (2019) and Peng et al. (2022) implement this concept for humanoid control to ensure the derived policies resemble low-level human demonstration behavior, thus being classified as natural. Zhou et al. (2021) and Chen et al. (2022) employ latent actions to prevent out-of-distribution (OOD) actions within the offline RL framework. Dadashi et al. (2022) proposes adopting a discrete latent action space to facilitate the application of methods designed explicitly for discrete action spaces to continuous cases. In teleoperation literature, Karamcheti et al. (2021) and Losey et al. (2022) embed high-dimensional robotic actions into lower-dimensional, human-controllable latent actions.

Additionally, several works focus on learning action representations for improved planning efficiency. Wang et al. (2020) and Yang et al. (2021) learn action representations for on-the-fly learning, applicable to black-box optimization and path planning scenarios. Despite high-level similarities, these papers assume prior knowledge of environment dynamics. TAP (Jiang et al., 2023) extends this framework into the offline RL domain, where the actual environmental dynamics remain undetermined, necessitating joint learning of the dynamics model and the representation of latent action. Nevertheless, TAP is constrained to a discrete latent action space and demands costly additional planning. Latent Diffuser can achieve representation learning and planning for continuous latent actions by leveraging the latent diffusion model in an end-to-end manner.

C.3 OFFLINE REINFORCEMENT LEARNING.

Latent Diffuser is devised for the offline RL (Ernst et al., 2005; Levine et al., 2020), precluding the utilization of online experiences for policy improvement. A principal hurdle in offline RL involves preventing out-of-distribution (OOD) actions selection by the learned policy to circumvent value function and model inaccuracies exploitation. Conservatism (Kumar et al., 2020; Kidambi et al.,

Published as a conference paper at ICLR 2024

2020b; Fujimoto & Gu, 2021; Lu et al., 2022c; Kostrikov et al., 2022) is proposed as a standard solution for this challenge. Latent Diffuser inherently prevents OOD actions via planning in a learned latent action space.

Pursuing adherence to a potentially diverse behavior policy, recent works have identified diffusion models as powerful generative tools, which generally surpass preceding generative approaches such as Gaussian distribution (Peng et al., 2019; Wang et al., 2020b; Nair et al., 2020) and Variational Autoencoders (VAEs) (Fujimoto et al., 2019; Wang et al., 2021) concerning behavior modeling. Different methods adopt distinct strategies for action generation, maximizing the learned Q-functions. Diffusion-QL (Wang et al., 2023) monitors gradients from behavior diffusion policy-derived actions to guide generated actions towards higher Q-value regions. Sf BC (Chen et al., 2023b) and Diffusion QL employ a similar idea, whereby resampling actions from multiple behavioral action candidates occur, with predicted Q-values serving as sampling weights. Ada et al. (2023) incorporates a state reconstruction loss-based regularization term within the diffusion-based policy training, consequently bolstering generalization capabilities for OOD states. Alternative works (Goo & Niekum, 2022; Pearce et al., 2023; Block et al., 2023; Suh et al., 2023) solely deploy diffusion models for behavior cloning or planning, rendering Q-value maximization unnecessary.

Contrary to the works above aligned with the RL paradigm, Latent Diffuser addresses offline RL challenges through sequence modeling (refer to the subsequent section). Compared to RL-based offline methodologies, sequence modeling offers benefits regarding temporally extended and sparse or delayed reward tasks.

C.4 REINFORCEMENT LEARNING AS SEQUENCE MODELING

Latent Diffuser stems from an emerging body of research that conceptualizes RL as a sequential modeling problem (Bhargava et al., 2023). Depending on the model skeleton, this literature may be classified into two primary categories. The first category comprises models that leverage a GPT2 (Radford et al., 2019) style Transformer architecture (Vaswani et al., 2017), also referred to as causal transformers, for the autoregressive modeling of states, actions, rewards, and returns, ultimately converting predictive capabilities into policy. Examples include Decision Transformer (DT, Chen et al. 2021) and Zheng et al. (2022), which apply an Upside Down RL technique (Schmidhuber, 2019) under both offline and online RL settings, and Trajectory Transformer (TT, Janner et al. 2021), which employs planning to obtain optimal trajectories maximizing return. Chen et al. (2023a) introduced a non-autoregressive planning algorithm based on energy minimization, while Jia et al. (2023) enhanced generalization ability for unseen tasks through refined in-context example design. Lastly, Wu et al. (2023) addresses trajectory stitching challenges by adjusting the history length employed in DT.

The second category features models based on a score-based diffusion process for non-autoregressive modeling of state and action trajectories. Different methods select various conditional samplers to generate actions that maximize the return. Diffuser (Janner et al., 2022) emulates the classifierguidance methodology (Nichol & Dhariwal, 2021) and employing guidance methods as delineated in F.1. Alternatively, Decision Diffuser (Ajay et al., 2023) and its derivatives (Li et al., 2023; Hu et al., 2023) explore classifier-free guidance (Ho & Salimans, 2022). Extensions of this concept to multi-task settings are presented by He et al. (2023) and Ni et al. (2023), while Liang et al. (2023) utilizes the diffusion model as a sample generator for unseen tasks, thus improving generalization capabilities. Latent Diffuser offers a more efficient planning solution to enable these sequential modeling algorithms to navigate complex action spaces effectively.

C.5 CONTROLLABLE SAMPLING WITH GENERATIVE MODELS

Latent Diffuser produces trajectories corresponding to optimal policies by employing controllable sampling within diffusion models. Current methods for facilitating controllable generation in diffusion models primarily emphasize conditional guidance. Such approaches leverage a pretrained diffusion model for the definition of the prior distribution q(x) and strive to obtain samples from q(x) exp( βE(x)). Graikos et al. (2022) introduces a training-free sampling technique, which finds application in approximating solutions for traveling salesman problems. Poole et al. (2023) capitalizes on a pretrained 2D diffusion model and optimizes 3D parameters for generating 3D shapes. Kawar et al. (2022) and Chung et al. (2023) exploit pretrained diffusion models for addressing linear and specific non-linear inverse challenges, such as image restoration, deblurring, and denoising. Dif-

Published as a conference paper at ICLR 2024

fuser (Janner et al., 2022) and Decision Diffuser (Ajay et al., 2023) use pretrained diffusion models to solve the offline RL problem. Zhao et al. (2022) and Bao et al. (2023) employ human-crafted intermediate energy guidance for tasks including image-to-image translation and inverse molecular design. In a more recent development, Lu et al. (2023) presents a comprehensive framework for incorporating human control within the sampling process of diffusion models.

Remark Recently, Venkatraman (2023) proposed a novel algorithm called LDCQ, which is very similar to Latent Diffuser. Specifically, LDCQ also introduces a latent diffusion model to learn a latent action space. Unlike the latent action used by Latent Diffuser, the learned latent action space in LDCQ belongs to the skill space, similar to what is described in Appendix F.2. Additionally, LDCQ does not perform planning within the learned skill space but instead uses model-free TDlearning methods to choose the optimal skill at each timestep and obtains the final action using a decoder. In summary, it is quite a coincidence that LDCQ and Latent Diffuser belong to two orthogonal approaches to utilizing latent diffusion models in offline RL. The former still adopts the RL framework to model the offline RL problems, while Latent Diffuser approaches the problem from a conditional generative perspective.

D MISSING RESULTS AND ANALYSES

D.1 PROOF-OF-CONCEPT EXAMPLE: M A Z E-2D-O P E N

(a) Suboptimal trajectories.

(b) Stitched near optimal trajectories.

Figure 3: Proof-of-Concept example. We demonstrate the importance of planning through an experiment designed by Decison Diffuser (Ajay et al., 2023, DD; Appendix A.1). The diffusion model achieves trajectory stitching, a process essential for handling a large number of suboptimal trajectories, through implicit planning.

To demonstrate the importance of planning, we designed an experiment same as Ajay et al. (2023, Appendix A.1). Specifically, most tasks offline datasets contain a large number of suboptimal trajectories. To learn better policies rather than just simple behavior cloning, trajectory stitching is one of the essential abilities algorithms must possess. To validate whether Latent Diffuser can achieve trajectory stitching through implicit planning, we adopted the experimental setup same as Ajay et al. (2023, Appendix A.1). In the maze-2D-open environment, the objective is to navigate towards the target area situated on the right side, with the reward being the negative distance to this target area. The training dataset is composed of 500 trajectories originating from the left side and terminating at the bottom side, as well as 500 trajectories starting from the bottom side and ending at the right side. Each trajectory is constrained to a maximum length of 50. At test time, the agent begins on the left side and aims to reach the right side as efficiently as possible. As demonstrated in Figure 3 and consistent with the findings of Ajay et al. (2023, Appendix A.1), the Latent Diffuser can effectively stitch trajectories from the training dataset to produce trajectories that traverse from the left side to the right side in (near) straight lines.

Published as a conference paper at ICLR 2024

D.2 MAIN RESULTS

The performance in Gym locomotion control in terms of normalized average returns of all baselines are shown in Table 4.

Table 4: The performance in Gym locomotion control in terms of normalized average returns of all baselines. Results correspond to the mean and standard error over 5 planning seeds.

Dataset Environment CQL IQL DT TT Mo Re L Diffuser Med-Expert Half Cheetah 91.6 86.7 86.8 95 53.3 79.8 Med-Expert Hopper 105.4 91.5 107.6 110.0 108.7 107.2 Med-Expert Walker2d 108.8 109.6 108.1 101.9 95.6 108.4 Medium Half Cheetah 44.0 47.4 42.6 46.9 42.1 44.2 Medium Hopper 58.5 66.3 67.6 61.1 95.4 58.5 Medium Walker2d 72.5 78.3 74.0 79 77.8 79.7 Med-Replay Half Cheetah 45.5 44.2 36.6 41.9 40.2 42.2 Med-Replay Hopper 95 94.7 82.7 91.5 93.6 96.8 Med-Replay Walker2d 77.2 73.9 66.6 82.6 49.8 61.2 Average 77.6 77 74.7 78.9 72.9 75.3

Dataset Environment DD D-QL TAP QGPO HDMI Latent Diffuser Med-Expert Half Cheetah 90.6 1.3 96.8 0.3 91.8 0.8 93.5 0.3 92.1 1.4 95.2 0.2 Med-Expert Hopper 111.8 1.8 111.1 1.3 105.5 1.7 108.0 2.5 113.5 0.9 112.9 0.3 Med-Expert Walker2d 108.8 1.7 110.1 0.3 107.4 0.9 110.7 0.6 107.9 1.2 111.3 0.2 Medium Half Cheetah 49.1 1.0 51.1 0.5 45.0 0.1 54.1 0.4 48.0 0.9 53.6 0.4 Medium Hopper 79.3 3.6 90.5 4.6 63.4 1.4 98.0 2.6 76.4 2.6 98.5 0.7 Medium Walker2d 82.5 1.4 87.0 0.9 64.9 2.1 86.0 0.7 79.9 1.8 86.3 0.9 Med-Replay Half Cheetah 39.3 4.1 47.8 0.3 40.8 0.6 47.6 1.4 44.9 2.0 47.3 1.2 Med-Replay Hopper 100 0.7 101.3 0.6 87.3 2.3 96.9 2.6 99.6 1.5 100.4 0.5 Med-Replay Walker2d 75 4.3 95.5 1.5 66.8 3.1 84.4 4.1 80.7 2.1 82.6 2.1 Average 81.8 88.0 82.5 86.6 82.6 87.5

This section will then delve into a more detailed analysis of the performance differences among different baselines across various tasks. To provide an intuitive comparison of different algorithms, we classify them from three perspectives planning, hierarchy, and generative according to the classification method shown in Table 5. Firstly, the Gym locomotion task has a long horizon, dense rewards, and low action dimensions, making it a baseline test task for offline RL. The results from Table 4 show that generative methods based on diffusion models generally perform better. The community currently attributes this to diffusion models more powerful representation capabilities in modeling more complex policies or environmental models. However, Latent Diffuser does not demonstrate its advantages well in the low-dimensional action space. Although Latent Diffuser approaches the SOTA performance on this task, it is mainly due to a better diffusion sampling method, which is supported by the solid performance of the QGPO method. Due to dense rewards, planning and hierarchy-based methods, such as TAP and HDMI, have not achieved the best results.

Secondly, the Adroit task is characterized by a high-dimensional action space. This leads to the best performance for TAP and Latent Diffuser (see Table 2), two methods based on latent action, which experimentally verify the effectiveness of latent action. Additionally, generative methods based on diffusion models generally exhibit better performance. However, due to the shorter horizon of the Adroit task, the HDMI method, which is based on planning and hierarchy, does not achieve the best performance.

Lastly, the Ant Maze task has a longer horizon and very sparse rewards. This allows latent action ample room for improvement (see Table 3). Moreover, methods based on planning and hierarchy also achieve good results, such as HDMI. In this task, non-generative methods based on planning and hierarchy, such as Com PILE and Go FAR, approach the performance of generative methods without planning and hierarchy (D-QL).

The Performance Gap Between TAP and Latent Diffuser For TAP and Latent Diffuser, the performance gap between them on the expert dataset is smaller than on other datasets in Adroit and Gym locomotion tasks. We analyzed that the primary source of this performance gap comes from

Published as a conference paper at ICLR 2024

the proportion of suboptimal trajectories in the dataset. In non-expert datasets, the proportion of suboptimal trajectories is more significant. To learn the optimal policy from the dataset, the algorithm needs to have the trajectory stitch ability, i.e., to splice segments of suboptimal trajectories to form an optimal trajectory.

On the one hand, most of the current offline RL methods are based on a dynamic programming framework to learn a Q function. However, these methods require the Q function to have Bellman completeness to achieve good performance. Designing a function class with Bellman completeness is very challenging (Zhou et al., 2023). On the other hand, Ajay et al. (2023, Appendix A.1) has found that generative methods based on diffusion models possess implicit dynamic programming capabilities. These methods use the powerful representation ability of diffusion models to bypass Bellman completeness and achieve the trajectory stitch ability. This allows them to perform well in datasets with more suboptimal trajectories.

Latent Diffuser is a generative method based on a diffusion model, while TAP is not. This leads to a more significant performance gap between the two on non-expert datasets. In expert datasets, however, Latent Diffuser s advantage cannot be demonstrated.

E IMPLEMENTATION AND TRAINING DETAILS

In the following subsection, we delineate hyperparameter configurations and training methodologies employed in numeric experiments for both baseline models and the proposed Latent Diffuser. Additionally, we supply references for performance metrics of prior evaluations conducted on standardized tasks concerning baseline models.

Each task undergoes assessment with a total of 5 distinct training seeds, evaluated over a span of 20 episodes. Adhering to the established evaluation protocols of TT (Janner et al., 2021) and IQL (Kostrikov et al., 2022), the dataset versions employed for locomotion control experiments are defined as v2, whereas v0 versions are utilized for remaining tasks.

E.1 BASELINE DETAILS

Before discussing the specific baseline implementation details, we first made a simple comparison of all baselines in the 3 tasks from 3 perspectives: whether planning is introduced, whether it contains hierarchical structure, and whether generative learning is introduced, as shown in Table 5.

Table 5: Comparison of different baselines at three levels. means inclusive, # means exclusive, and G# means a cheaper approximation.

Com PILE CQL IQL D-QL D-QL@1 QGPO Diffuser DD Planning # # # # # # G# G# Hierarchy # # # # # # # Generative # # # DT TT Mo Re L Hi Go C Go FAR TAP HDMI Latent Diffuser Planning # G# G# Hierarchy # # # Generative # # #

E.1.1 GYM LOCOMOTION CONTROL

The results of CQL in Table 1 and Table 4 is reported in (Kostrikov et al., 2022, Table 1);

The results of IQL in Table 4 is reported in (Kostrikov et al., 2022, Table 1);

The results of DT in Table 4 is reported in (Chen et al., 2021, Table 2);

The results of TT in Table 1 and Table 4 is reported in (Janner et al., 2021, Table 1);

The results of Mo Re L in Table 4 is reported in (Kidambi et al., 2020, Table 2);

The results of Diffuser in Table 4 is reported in the (Janner et al., 2022, Table 2);

Published as a conference paper at ICLR 2024

The results of DD in Table 1 and Table 4 is reported in the (Ajay et al., 2023, Table 1).

The results of D-QL in Table 1 and Table 4 is reported in the (Wang et al., 2023, Table 1).

The results of TAP in Table 1 and Table 4 is reported in the (Jiang et al., 2023, Table 1).

The results of QGPO in Table 1 and Table 4 is reported in the (Lu et al., 2023, Table 2).

The results of HDMI in Table 1 and Table 4 is reported in the (Li et al., 2023, Table 3).

E.1.2 ADROIT ENVIRONMENT

The results of CQL in Table 2 is reported in (Kostrikov et al., 2022, Table 1);

The results of DD in Table 2 is generated by using the offical repository3 from the original paper (Ajay et al., 2023) with default hyperparameters.

The results of D-QL@1 in Table 2 is generated by using the official repository4 from the original paper (Wang et al., 2023) with default hyperparameters.

The results of QGPO in Table 2 is generated by using the offcial repository5 from the original paper (Lu et al., 2023) with default hyperparameters.

The results of HDMI in Table 2 is generated by re-implementing the algorithm from the original paper (Li et al., 2023) with default hyperparameters.

The results of TT in Table 2 is reported in (Janner et al., 2021, Table 1);

The results of TAP in Table 2 is reported in the (Jiang et al., 2023, Table 1).

It is essential to mention that the D-QL employs a resampling procedure for assessment purposes. To be more precise, during evaluation, the acquired policy initially produces 50 distinct action candidates, subsequently selecting a single action possessing the highest Q-value for execution. We empirically find that that this strategy is critical for achieving satisfactory performance in Adroit tasks. Nonetheless, the technique poses challenges in accurately representing the quality of initially sampled actions prior to the resampling procedure. Additionally, due to the high dimensionality of the action, it incurs considerable computational overhead. As a result, the resampling process has been eliminated from the evaluation, utilizing a single action candidate (referred to as D-QL@1) akin to QGPO (Lu et al., 2023).

E.1.3 ANTMAZE ENVIRONMENT

The results of Com PILE, Go FAR, DD and HDMI in Table 3 is reported in (Li et al., 2023, Table 2);

The results of Hi Go C in Table 3 is generated by re-implementing Hi Go C (Li et al., 2022) based on CQL6 and c VAE7, and tune over the two hyparameters, learning rate [3e 4, 1e 3] and the contribution of KL regularization [0.05, 0.2].

The results of D-QL@1 in Table 3 is generated by using the official repository from the original paper (Wang et al., 2023) with default hyperparameters.

The results of TAP in Table 3 is generated by using the official repository8 from the original paper (Wang et al., 2023) with default hyperparameters.

The results of QGPO in Table 3 is generated by using the official repository from the original paper (Wang et al., 2023) with default hyperparameters.

3https://github.com/anuragajay/decision-diffuser/tree/main/code. 4https://github.com/Zhendong-Wang/Diffusion-Policies-for-Offline-RL. 5https://github.com/thu-ml/CEP-energy-guided-diffusion. 6https://github.com/aviralkumar2907/CQL. 7https://github.com/timbmg/VAE-CVAE-MNIST. 8http://github.com/Zhengyao Jiang/latentplan.

Published as a conference paper at ICLR 2024

E.2 IMPLEMENTATION DETAILS

The forthcoming release of the complete source code will be subject to the Creative Commons Attribution 4.0 License (CC BY), with the exception of the gym locomotion control, Adroit, and Ant Maze datasets, which will retain their respective licensing arrangements. The computational infrastructure consists of dual servers, each possessing 256 GB of system memory, as well as a pair of NVIDIA Ge Force RTX 3090 graphics processing units equipped with 24 GB of video memory.

E.2.1 REPRESENTATION LEARNING FOR LATENT ACTION

Encoder and Decoder. The score-based prior model is significantly influenced by the architecture of the bottleneck, encompassing both the encoder and decoder, and subsequently impacts its application for planning. Adhering to the Decision Transformer Chen et al. (2021) and Trajectory Transformer (Janner et al., 2021), TAP (Jiang et al., 2023) employs a GPT-2-style transformer incorporating causal masking within its encoder and decoder. Consequently, information from future tokens does not propagate backward to their preceding counterparts. However, this conventional design remains prevalent in sequence modeling without guaranteeing optimality. For instance, one could invert the masking order in the decoder, thus rendering the planning goal-based.

A detailed examination of the autoregressive (causal) versus simultaneous generation of optimal action sequences can be found in (Janner et al., 2022, 3.1). The discourse presented in Janner et al. (2022) remains germane to the context of latent action space. In particular, it is reasonable to assume that latent action generation adheres to causality, whereby subsequent latent actions are contingent upon previous and current latent actions. However, decision-making or optimal control may exhibit anti-causality, as the subsequent latent action may rely on future information, such as future rewards. In general RL scenarios, the dependence on future information originates from the presumption of future optimality, intending to develop a dynamic programming recursion. This notion is reflected by the future optimality variables Ot:T present in the action distribution log p (at | st, Ot:T ) (Levine, 2018). The aforementioned analysis lends support to the diffusion-sampling-as-planning framework. As a result, causal masking is eliminated from the GPT-2 style encoder and state decoder design within the Latent Diffuser.

Additionally, the action decoder is represented through the implementation of a 2-layered MLP, encompassing 512 hidden units and Re LU activation functions, effectively constituting an inverse dynamics model. Concurrently, a 3-layered MLP, containing 1024 hidden units and Re LU activation functions, represents the reward and return decoder. The action decoder is trained employing the Adam optimizer, featuring a learning rate of 2e 4 and batch size of 32 across 2e6 training steps. The reward and return decoder are also trained utilizing the Adam optimizer, however, with a learning rate of 2e 4 and batch size of 64 spanning 1e6 training steps.

Score-based Prior. Consistent with DD (Ajay et al., 2023), the score-based prior model is characterized by a temporal U-Net9 (Janner et al., 2022) architecture encompassing a series of 6 recurrent residual blocks. Within each block, two sequential temporal convolutions are implemented, succeeded by group normalization (Wu & He, 2018), culminating in the application of the Mish activation function (Misra, 2019). Distinct 2-layer MLPs, each possessing 256 hidden units and the Mish activation function, yield 128-dimensional timestep and condition embeddings, which are concatenated and added to the first temporal convolution s activations within each block. We employ the Adam optimization algorithm, utilizing a learning rate of 2 10 4, a batch size of 32, and performing 2 106 training iterations. The probability, denoted by p, of excluding conditioning information s1 is set to 0.25, and K = 100 diffusion steps are executed.

E.2.2 PLANNING WITH ENERGY-GUIDED SAMPLING

The energy guidance model is formulated as a 4-layer MLP containing 256 hidden units and leverages Si LU activation functions (Hendrycks & Gimpel, 2016). It undergoes training for 1 106 gradientbased steps, implementing the Adam optimizer with a learning rate of 3 10 4 and a batch size of 256. For gym locomotion control and Adroit tasks, the dimension of the latent actions set, M, is

9https://github.com/jannerm/diffuser.

Published as a conference paper at ICLR 2024

established at 16, whereas for Ant Maze tasks, it is fixed at 32. In all tasks, however, β is set to 3. The planning horizon, H, is set to 40 for gym locomotion control and Adroit tasks, and 100 for Ant Maze tasks. The guidance scale, w, is selected from the range {1.2, 1.4, 1.6, 1.8}, although the specific value depends on the task. A low temperature sampling parameter, α, is designated as 0.5, and the context length, C, is assigned a value of 20.

F MODELING SELECTION

In this section, we shall explore the alterations elicited by diverse problem modeling approaches upon the Latent Diffuser framework. Specifically, by examining various representations of latent action, we investigate the performance of the Latent Diffuser within both the raw action space and skill space. Subsequently, by comparing it against existing works, we distill some key observations of distinct diffusion sampling techniques on the efficacy of planning.

Figure 4: Results of different modeling selections, where the height of the bar is the mean normalised scores on different tasks.

F.1 PLANNING IN THE RAW ACTION SPACE

The Latent Diffuser framework can be equally implemented within the raw action space; in this instance, merely substitute the latent diffusion model with any diffusion model for estimating raw trajectory distributions, such as those employed within the Diffuser (Janner et al., 2022) and Decision Diffuser (Ajay et al., 2023). To be precise, we can deduce the following theorem, akin to Theorem 1, within the raw action space:

Theorem 2 (Optimal policy). Given an initial state s1, the optimal policy satisfies: π (τ | s1) µ(τ | s1)eβ PT t=1 Qζ(st,at), wherein τ := (s1, a1, , s T , a T ), µ(τ | s1) represents the behavior policy and Qζ( , ) refers to the estimated Q-value function. β 0 signifies the inverse temperature controlling the energy strength.

By rewriting p0 := π , q0 = µ and τ0 = τ, we can reformulate the optimal planning into the following diffusion sampling problem:

p0(τ0 | s1) q0(τ0 | s1) exp ( βE(τ0, s1)) , (9)

where E(τ0, s1) := PT t=1 Qζ(st, at). Similarlly, the time-dependent energy model, fη(τk, s1, k), can then be trained by minizing the following contrastive loss:

min η Ep(k,s1)EQM i=1 q τ (i) 0 |s1 p(ϵ(i))

e βE0(h(τ (i) 0 ,s1)) PM j=1 e βE0(h(τ (j) 0 ,s1)) log e fη τ (i) k ,s1,k

PM j=1 e fη τ (j) k ,s1,k

Published as a conference paper at ICLR 2024

where k U(0, K), τk = αkτ0 + σkϵ, and ϵ N(0, I). To estimate true distribution q(τ0 | s1) in Equation 10, we can utilize the diffusion model adopted in Diffuser or Decision Diffuser to generate M support trajectories { ˆτ0 (i)}M for each initial state s1 by diffusion sampling. The contrastive loss in Equation (10) is then estimated by:

min η Ek,s1,ϵ

e βE0(h(ˆτ (i) 0 ,s1)) PM j=1 e βE0(h(ˆτ (j) 0 ,s1)) log e fη ˆτ (i) k ,s1,k

PM j=1 e fη ˆτ (j) k ,s1,k , (11)

where ˆτ0 (i), ˆτ0 (j) correspond to the support trajectories for each initial state s1. The training procedure is shown in Algorithm 2.

Algorithm 2 Efficient Planning in the Raw Action Space

Initialize the diffusion model pθ and the intermediate energy model fη for each gradient step do Training the diffusion model Sample B1 trajectories τ from offline dataset D Sample B1 Gaussian noises ϵ from N(0, I) and B1 time k from U(0, K) Perturb τ0 according to τk := αkτ0 + σkϵ Update {θ} with the standard score-matching loss in Appendix B.2 end for for each initial state s1 in offline dataset D do Generating the support trajectories Sample M support trajectories {ˆτ (i)}M from the pretrained diffusion model pθ end for for each gradient step do Training the intermediate energy model Sample B2 initial state s1 from offline dataset D Sample B2 Gaussian noises ϵ from N(0, I) and B2 time k from U(0, K) Retrieve support trajectories { ˆτ0 (i)}M for each s1 Perturb ˆτ0 (i) according to ˆτk (i) := αk ˆτ0 (i) + σkϵ Update {η} based on the contrastive loss in Equation (11) end for

As shown in Figure 4, it becomes evident that a pronounced performance degradation manifests in the raw action space when planning compared to the latent action space, This phenomenon is even more pronounced in longer-horizon tasks, such as Ant Maze.

Connection with Diffuser In Diffuser (Janner et al., 2022a), E(τ0, s1) is defined as the return of τ0. Additionally, Diffuser uses a mean-square-error (MSE) objective to train the energy model fη(τt, s1, t) and use its gradient for energy guidance (Lu et al., 2023b). The training objective is:

min η Eq0t(τ0,τt,s1) h fη (τt, s1, t) E (τ0, s1) 2 2 i . (12)

Given the unlimited model capacity, the optimal fη satisfies:

f MSE η (τt, s1, t) = Eq0t(τ0|τt,s1) [E (τ0, s1)] . (13)

However, according to Lu et al. (2023, 4.1), the true energy function satifies

Et (τ0, s1) = log Eq0t(τ0|τt,s1) h e E(τ0,s1)i

Eq0t(τ0|τt,s1) [E (τ0, s1)] = f MSE η (τt, s1, t) , (14)

and the equality only holds when t = 0. Therefore, the MSE energy function f MSE η is inexact for all t > 0. Moreover, Lu et al. (2023) also shows that the gradient of f MSE η is also inexact against the true gradience τt Et (τt, s1).

We replace the definition of E(z0, s1) in Latent Diffuser with the return (or cumulative rewards) employed in Diffuser, culminating in the numerical results depicted in Figure 4. It is imperative to note that incorporating return into Latent Diffuser contravenes Theorem 1. As discernible in the figure, a conspicuous performance degradation ensues from the replacement. While return

Published as a conference paper at ICLR 2024

bears resemblance to cumulative state-action value, the latter demonstrates diminished variance. This advantage becomes increasingly pronounced in the longer-horizon task.

In addition, we can implement an alteration to the sampling method employed by the Diffuser, thereby rendering it to an exact sampling technique. More concretely, we add an exponential activation in the original MSE-based loss in Equation (13), which is named E-MSE in (Lu et al., 2023b):

min η Et,z0,zt h exp (fη (zt, t)) exp (βE (z0)) 2 2 i .

In the Latent Diffuser, we substitute contrastive loss with E-MSE, as depicted in Figure 4. Although E-MSE belongs to the realm of exact sampling methods, its inherent exponential terms precipitate significant numerical instability during training. Evidently, from Figure 4, the employment of E-MSE has culminated in a conspicuous decline in performance a finding that resonates with the conclusions drawn in the Lu et al. (2023b, H).

F.2 PLANNING IN THE SKILL SPACE

The Latent Diffuser framework can be equally implemented within other variants of the latent action space. Considering the following simplified trajectory τsim of length T, sampled from an MDP with a fixed stochastic behavior policy, consisting of a sequence of states, and actions:

τsim := (s1, a1, s2, a2, . . . , s T ) . (15)

Under this setting, the concept of a latent action aligns perfectly with the definition of skill as delineated in prevailing works, although we persist in utilizing the Latent Diffuser framework for efficient planning within the skill space.

Encoder (Causal Transformer with Position Embedding)

Diffusion Model

State Decoder

Action Decoder

Reward Decoder

Return Decoder

with Position Embedding

Stop Gradient

Figure 5: Skill modeling with the score-based diffusion probabilistic model.

More concretely, we can merely substitute the latent diffusion model with an variant of the latent diffusion model, as shown in Figure 5, for estimating simplified trajectory distributions. Similarilly, we also can deduce a theorem, akin to Theorem 1, within the skill space:

Theorem 3 (Optimal skill-based policy). Given an initial state s1, the optimal skill-based policy satisfies: π (τsim | s1) µ(τsim | s1)eβ PT t=1 Qζ(st,at), wherein µ(τsim | s1) represents the behavior policy and Qζ( , ) refers to the estimated Q-value function. β 0 signifies the inverse temperature controlling the energy strength.

Subsequently, we can employ the algorithm nearly identical to Algorithm 1 for both the model training and the sampling of optimal trajectories.

To ascertain the efficacy of Latent Diffuser in skill space planning, we exchanged the latent diffusion model depicted in Figure 2 with the one shown in Figure 5. The experimental results can be observed in Figure 4. Evident from the illustration, a lack of encoding for future information (i.e., the reward and return) precipitates a significant decline in skill space planning performance as compared to that within latent action space. Similarly, this circumstance becomes increasingly pronounced in longer-horizon tasks accompanied by sparse rewards.

Published as a conference paper at ICLR 2024

Figure 6: Results of ablation studies, where the height of the bar is the mean normalised scores on different tasks.

G ABLATION STUDIES

In this section, ablation studies are performed on crucial hyperparameters within the Latent Diffuser, specifically focusing on the latent steps (i.e., the timesteps corresponding to the original trajectory of the latent action), the planning horizon (referring to the latent rather than the raw action), the inverse temperature (impacting energy guidance intensity), and the diffusion steps (contributing to the reconstructed trajectory quality). Numerical experiments were conducted (refer to Figure 6) on the Med-Expert dataset within the Gym locomotion control tasks, yielding the subsequent significant findings:

Latent Steps The Latent Diffuser s planning occurs in a latent action space featuring temporal abstraction. When a single latent action is sampled, L transitional steps extending the raw trajectory can be decoded. This design enhances planning efficiency as it reduces the number of unrolling steps to 1

L; hence, the search space size is exponentially diminished. Nonetheless, the repercussions of this design on the decision-making remain uncertain. Therefore, we evaluated the Latent Diffuser employing varying latent steps L. The red bars in Figure 6 demonstrate that the reduction in latent steps L to 1 leads to a substantial performance degradation. We conjecture that this performance decline is attributed to VAE overfitting, as a higher prediction error has been observed with the reduced latent step, similar with TAP (Jiang et al., 2023)

Inverse Temperature The energy guidance effect is regulated by the inverse temperature; decreasing values yield sampled trajectories more aligned with the behavior policy, while elevated values amplify the influence of energy guidance. As displayed by the pink bars in Figure 6, the Latent Diffuser s performance noticeably deteriorates when the inverse temperature is comparably low. Alternatively, a trivial decline occurs as the value substantially increases. We propose two plausible explanations: firstly, overwhelming energy guidance may generate discrepancies between trajectory distributions, guided by energy and induced by behavior policy, negatively impacting generated quality; secondly, the energy guidance originates from an estimated intermediate energy model, which is inherently prone to overfitting throughout training, leading to inaccuracies in the estimated energy and ultimately degrading the sampling quality.

Planning Horizon and Diffusion Steps As demonstrated by the blue and yellow bars in Figure 6, the Latent Diffuser exhibits low sensitivity to variations in the planning horizon and diffusion steps. Moreover, the conclusion regarding the planning horizon may be task-specific since densereward locomotion control may necessitate shorter-horizon reasoning than more intricate decisionmaking problems. The ablations of Mu Zero (Hamrick et al., 2021) and TAP (Jiang et al., 2023) further reveal that real-time planning is not as beneficial in more reactive tasks, such as Atari and locomotion control.

Published as a conference paper at ICLR 2024

Transformers In the design of Latent Diffuser, we follow the settings of most existing generative methods (such as DT, TT, DD, TAP, etc.) for the encoder part, using GPT-2 style casual transformers for parameterization. Of course, in addition to this reason, another part of the reason is due to the modeling of latent actions. In the default setting, latent action consists of multiple timesteps of state, actions, rewards, and reward-to-go. We believe casual transformers will make the learned latent action representations more predictive. For example, predicting actions based on the state, predicting rewards and reward-to-go based on the state and action. This predictive ability, similar to model-based methods, will make the learned latent action representations more conducive to high-quality planning. To verify this point, we conduct comparative experiments using non-causal transformers. Specifically, we remove the mask part of the causal transformer. This means that during the encoding and decoding process, we allow the model to use the information of the entire subtrajectory to reconstruct any element within that subtrajectory, such as the state using state information. As shown in Figure 6, the experimental results show that Latent Diffuser has a significant performance degradation. Furthermore, we found that using a non-casual transformer is close to the performance when the latent step equals 1. These results are consistent with our previous analysis, and when using a non-casual transformer, the model is also prone to overfitting, causing the learned latent action representations to contain less information, losing a certain degree of predictability.

Figure 7: The average runtime spent on a single decision of baselines based on the generative model.

Runtime To eliminate the influence of different algorithm implementation logic on runtime and focus on the model itself, we follow the settings of previous work and record the average time taken for different baselines to make the final action from the current input state 50 times. In the interest of fairness, we have only compared the runtime of generative methods. The final results are shown in Figure 7. As can be seen from the figure, the runtime of Latent Diffuser is at the average level, and the time required for making one decision is about 0.5 seconds, which is similar to DD. Although the sampling efficiency of the diffusion model has always been its weakness, we adopted the warm-up technique proposed by Diffuser, which can significantly shorten the sampling time without affecting performance. D-QL has the most extended runtime, requiring multiple samplings (50 times) to select the best result. HDMI significantly increases runtime because it is a two-layer method requiring two diffusion samplings for making one decision. TT method has a longer runtime due to its tokenized data processing, which requires longer autoregressive sequence generation before generating an action. TAP and D-QL have the shortest runtimes, with the former using beam search for planning, which can be completed quickly with a predetermined budget, but planning effectiveness is also constrained by the budget; the latter only needs to generate a one-timestep action rather than a sequence, so its runtime is also shorter. However, its final performance is significantly lower due to the lack of a planning step.

H LATENT ACTION VISUALIZATION

In order to gain a more intuitive understanding of the latent action space learned by Latent Diffuser, this section presents a visualization of the latent actions and the corresponding trajectories obtained by decoding them. Specifically, we use the fully trained Latent Diffuser in the Hopper task to sample 5 trajectories and apply the t-SNE method to reduce the dimensionality of the latent actions associated with these 5 trajectories for visualization, as shown in Figure 8. In the visualization of the trajectories, a random trajectory is selected. To facilitate presentation (due to

Published as a conference paper at ICLR 2024

140 timesteps

200 timesteps

180 timesteps

260 timesteps

450 timesteps

950 timesteps

Figure 8: Visualization of latent actions and decoded trajectories.

the similarity in the robot shape at adjacent timesteps), we downsampled this trajectory by taking one latent action every 5 latent actions and overlaid the trajectory images obtained by decoding the adjacent 3 latent actions with different opacities to achieve the trajectory displayed in Figure 8. It can be seen from the figure that the latent action space learned by Latent Diffuser is a more compact action space, which to some extent has learned a certain type of macro-action or skill.

I MISSING DERIVATIONS

I.1 PROOF OF THEOREM 1

Proof. Previous works (Peters et al., 2010; Peng et al., 2019) formulate offline RL as constrained policy optimization:

max π Es Dµ Ea π( |s)Aζ(s, a) 1

β DKL(π( | s) µ( | s)) ,

where Aζ is the action evaluation model which indicates the quality of decision (s, a) by estimating the advantange function Aπ(s, a) := Qπ(s, a) V π(s) of the current policy π. β is an inverse temperature coefficient. The first term intends to perform policy optimization, while the second term stands for policy constraint. It is shown that the optimal policy π satisfies (Peng et al., 2019):

π (a | s) µ(a | s)eβAζ(s,a).

Since V π(s) has nothing to do with the action a, the above formula can be further simplified to:

π (a | s) µ(a | s)eβAζ(s,a) µ(a | s)eβQζ(s,a),

where Qζ is the action evaluation model which indicates the quality of decision (s, a) by estimating the Q-value function Qπ(s, a) of the current policy π. To simplify notation, here we reuse ζ to parameterize the estimated Q-value function. Furthermore, we can extend the above conclusion from raw action space, single step level to latent action space, trajectory level:

Published as a conference paper at ICLR 2024

π (z | s1) = p(s1)

t=0 π (at | st)p(st+1 | st, at)R(rt | st, at, st+1)G(Gt | st, at)

t=0 µ(at | st) exp (βQζ(st, at)) p(st+1|st, at)R(rt | st, at, st+1)G(Gt | st, at)

t=0 µ(at | st)p(st+1|st, at)R(rt | st, at, st+1)G(Gt | st, at)

t=0 exp (βQζ(st, at))

= µ(z | s1) exp

t=0 Qζ(st, at)

= µ(z | s1)eβ PT 1 t=0 Qζ(st,at).

I.2 PROOF OF THEOREM 2 AND THEOREM 3

The proof procedure of Theorem 2 and Theorem 3 is similar with Theorem 1.

Published as a conference paper at ICLR 2024

REFERENCES FOR SUPPLEMENTARY MATERIAL

Suzan Ece Ada, Erhan Oztop, and Emre Ugur. Diffusion policies for out-of-distribution generalization in offline reinforcement learning. ar Xiv preprint ar Xiv:2307.04726, 2023.

Anurag Ajay, Yilun Du, Abhi Gupta, Joshua B Tenenbaum, Tommi S Jaakkola, and Pulkit Agrawal. Is conditional generative modeling all you need for decision-making? In ICLR, 2023.

Arthur Allshire, Roberto Mart ın-Mart ın, Charles Lin, Shawn Manuel, Silvio Savarese, and Animesh Garg. Laser: Learning a latent action space for efficient reinforcement learning. In ICRA, 2021.

Fan Bao, Min Zhao, Zhongkai Hao, Peiyao Li, Chongxuan Li, and Jun Zhu. Equivariant energyguided SDE for inverse molecular design. In ICLR, 2023.

Prajjwal Bhargava, Rohan Chitnis, Alborz Geramifard, Shagun Sodhani, and Amy Zhang. Sequence modeling is a robust contender for offline reinforcement learning. ar Xiv preprint ar Xiv:2305.14550, 2023.

Adam Block, Daniel Pfrommer, and Max Simchowitz. Imitating complex trajectories: Bridging low-level stability and high-level behavior. ar Xiv preprint ar Xiv:2307.14619, 2023.

Hongyi Chen, Yilun Du, Yiye Chen, Joshua B. Tenenbaum, and Patricio A. Vela. Planning with sequence models through iterative energy minimization. In ICLR, 2023a.

Huayu Chen, Cheng Lu, Chengyang Ying, Hang Su, and Jun Zhu. Offline reinforcement learning via high-fidelity generative behavior modeling. In ICLR, 2023b.

Lili Chen, Kevin Lu, Aravind Rajeswaran, Kimin Lee, Aditya Grover, Misha Laskin, Pieter Abbeel, Aravind Srinivas, and Igor Mordatch. Decision transformer: Reinforcement learning via sequence modeling. In Neur IPS, 2021.

Xi Chen, Ali Ghadirzadeh, Tianhe Yu, Jianhao Wang, Alex Yuan Gao, Wenzhe Li, Liang Bin, Chelsea Finn, and Chongjie Zhang. Lapo: Latent-variable advantage-weighted policy optimization for offline reinforcement learning. In Neur IPS, 2022.

Rohan Chitnis, Yingchen Xu, Bobak Hashemi, Lucas Lehnert, Urun Dogan, Zheqing Zhu, and Olivier Delalleau. Iql-td-mpc: Implicit q-learning for hierarchical model predictive control. ar Xiv preprint ar Xiv:2306.00867, 2023.

Hyungjin Chung, Jeongsol Kim, Michael Thompson Mccann, Marc Louis Klasky, and Jong Chul Ye. Diffusion posterior sampling for general noisy inverse problems. In ICLR, 2023.

Robert Dadashi, L eonard Hussenot, Damien Vincent, Sertan Girgin, Anton Raichuk, Matthieu Geist, and Olivier Pietquin. Continuous control with action quantization from demonstrations. In ICML, 2022.

Damien Ernst, Pierre Geurts, and Louis Wehenkel. Tree-based batch mode reinforcement learning. Journal of Machine Learning Research, 6, 2005.

Benjamin Eysenbach, Alexander Khazatsky, Sergey Levine, and Russ R Salakhutdinov. Mismatched no more: Joint model-policy optimization for model-based rl. In Neur IPS, 2022.

Scott Fujimoto and Shixiang Shane Gu. A minimalist approach to offline reinforcement learning. In Neur IPS, 2021.

Scott Fujimoto, David Meger, and Doina Precup. Off-policy deep reinforcement learning without exploration. In ICML, 2019.

Wonjoon Goo and Scott Niekum. Know your boundaries: The necessity of explicit behavioral cloning in offline rl. ar Xiv preprint ar Xiv:2206.00695, 2022.

Alexandros Graikos, Nikolay Malkin, Nebojsa Jojic, and Dimitris Samaras. Diffusion models as plug-and-play priors. In Neur IPS, 2022.

Published as a conference paper at ICLR 2024

Danijar Hafner, Timothy Lillicrap, Ian Fischer, Ruben Villegas, David Ha, Honglak Lee, and James Davidson. Learning latent dynamics for planning from pixels. In ICML, 2019.

Danijar Hafner, Timothy P Lillicrap, Mohammad Norouzi, and Jimmy Ba. Mastering atari with discrete world models. In ICLR, 2021.

Danijar Hafner, Jurgis Pasukonis, Jimmy Ba, and Timothy Lillicrap. Mastering diverse domains through world models. ar Xiv preprint ar Xiv:2301.04104, 2023.

Jessica B Hamrick, Abram L. Friesen, Feryal Behbahani, Arthur Guez, Fabio Viola, Sims Witherspoon, Thomas Anthony, Lars Holger Buesing, Petar Veliˇckovi c, and Theophane Weber. On the role of planning in model-based deep reinforcement learning. In ICLR, 2021.

Haoran He, Chenjia Bai, Kang Xu, Zhuoran Yang, Weinan Zhang, Dong Wang, Bin Zhao, and Xuelong Li. Diffusion model is an effective planner and data synthesizer for multi-task reinforcement learning. ar Xiv preprint ar Xiv:2305.18459, 2023.

Dan Hendrycks and Kevin Gimpel. Gaussian error linear units (gelus). ar Xiv preprint ar Xiv:1606.08415, 2016.

Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance. ar Xiv preprint ar Xiv:2207.12598, 2022.

Jifeng Hu, Yanchao Sun, Sili Huang, Si Yuan Guo, Hechang Chen, Li Shen, Lichao Sun, Yi Chang, and Dacheng Tao. Instructed diffuser with temporal condition guidance for offline reinforcement learning. ar Xiv preprint ar Xiv:2306.04875, 2023.

Michael Janner, Justin Fu, Marvin Zhang, and Sergey Levine. When to trust your model: Model-based policy optimization. In Neur IPS, 2019.

Michael Janner, Qiyang Li, and Sergey Levine. Offline reinforcement learning as one big sequence modeling problem. In Neur IPS, 2021.

Michael Janner, Yilun Du, Joshua Tenenbaum, and Sergey Levine. Planning with diffusion for flexible behavior synthesis. In ICML, 2022a.

Michael Janner, Yilun Du, Joshua B Tenenbaum, and Sergey Levine. Planning with diffusion for flexible behavior synthesis. In ICML, 2022b.

Zhiwei Jia, Fangchen Liu, Vineet Thumuluri, Linghao Chen, Zhiao Huang, and Hao Su. Chain-ofthought predictive control. ar Xiv preprint ar Xiv:2304.00776, 2023.

Zhengyao Jiang, Tianjun Zhang, Michael Janner, Yueying Li, Tim Rockt aschel, Edward Grefenstette, and Yuandong Tian. Efficient planning in a compact latent action space. In ICLR, 2023.

Siddharth Karamcheti, Albert J Zhai, Dylan P Losey, and Dorsa Sadigh. Learning visually guided latent actions for assistive teleoperation. In L4DC, 2021.

Bahjat Kawar, Michael Elad, Stefano Ermon, and Jiaming Song. Denoising diffusion restoration models. In Neur IPS, 2022.

Rahul Kidambi, Aravind Rajeswaran, Praneeth Netrapalli, and Thorsten Joachims. Morel: Modelbased offline reinforcement learning. In Neur IPS, 2020a.

Rahul Kidambi, Aravind Rajeswaran, Praneeth Netrapalli, and Thorsten Joachims. Morel: Modelbased offline reinforcement learning. In Neur IPS, 2020b.

Ilya Kostrikov, Ashvin Nair, and Sergey Levine. Offline reinforcement learning with implicit q-learning. In ICLR, 2022.

Aviral Kumar, Aurick Zhou, George Tucker, and Sergey Levine. Conservative q-learning for offline reinforcement learning. In Neur IPS, 2020.

Sergey Levine. Reinforcement learning and control as probabilistic inference: Tutorial and review. ar Xiv preprint ar Xiv:1805.00909, 2018.

Published as a conference paper at ICLR 2024

Sergey Levine, Aviral Kumar, George Tucker, and Justin Fu. Offline reinforcement learning: Tutorial, review, and perspectives on open problems. ar Xiv preprint ar Xiv:2005.01643, 2020.

Wenhao Li, Xiangfeng Wang, Bo Jin, and Hongyuan Zha. Hierarchical diffusion for offline decision making. In ICML, 2023.

Zhixuan Liang, Yao Mu, Mingyu Ding, Fei Ni, Masayoshi Tomizuka, and Ping Luo. Adaptdiffuser: Diffusion models as adaptive self-evolving planners. In ICML, 2023.

Dylan P Losey, Hong Jun Jeon, Mengxi Li, Krishnan Srinivasan, Ajay Mandlekar, Animesh Garg, Jeannette Bohg, and Dorsa Sadigh. Learning latent actions to control assistive robots. Autonomous robots, 46(1):115 147, 2022.

Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu. Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps. In Neur IPS, 2022a.

Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu. Dpm-solver++: Fast solver for guided sampling of diffusion probabilistic models. ar Xiv preprint ar Xiv:2211.01095, 2022b.

Cheng Lu, Huayu Chen, Jianfei Chen, Hang Su, Chongxuan Li, and Jun Zhu. Contrastive energy prediction for exact energy-guided diffusion sampling in offline reinforcement learning. In ICML, 2023a.

Cheng Lu, Huayu Chen, Jianfei Chen, Hang Su, Chongxuan Li, and Jun Zhu. Contrastive energy prediction for exact energy-guided diffusion sampling in offline reinforcement learning. In ICML, 2023b.

Cong Lu, Philip Ball, Jack Parker-Holder, Michael Osborne, and Stephen J Roberts. Revisiting design choices in offline model based reinforcement learning. In ICLR, 2022c.

Kevin Lu, Aditya Grover, Pieter Abbeel, and Igor Mordatch. Reset-free lifelong learning with skill-space planning. In ICLR, 2021.

J Merel, L Hasenclever, A Galashov, A Ahuja, V Pham, G Wayne, Y Teh, and N Heess. Neural probabilistic motor primitives for humanoid control. In International Conference on Learning Representations, 2019.

Diganta Misra. Mish: A self regularized non-monotonic neural activation function. ar Xiv preprint ar Xiv:1908.08681, 2019.

Ashvin Nair, Abhishek Gupta, Murtaza Dalal, and Sergey Levine. Awac: Accelerating online reinforcement learning with offline datasets. ar Xiv preprint ar Xiv:2006.09359, 2020.

Fei Ni, Jianye Hao, Yao Mu, Yifu Yuan, Yan Zheng, Bin Wang, and Zhixuan Liang. Metadiffuser: Diffusion model as conditional planner for offline meta-rl. In ICML, 2023.

Alexander Quinn Nichol and Prafulla Dhariwal. Improved denoising diffusion probabilistic models. In ICML, 2021.

Sherjil Ozair, Yazhe Li, Ali Razavi, Ioannis Antonoglou, Aaron Van Den Oord, and Oriol Vinyals. Vector quantized models for planning. In ICML, 2021.

Tim Pearce, Tabish Rashid, Anssi Kanervisto, Dave Bignell, Mingfei Sun, Raluca Georgescu, Sergio Valcarcel Macua, Shan Zheng Tan, Ida Momennejad, Katja Hofmann, and Sam Devlin. Imitating human behaviour with diffusion models. In ICLR, 2023.

Xue Bin Peng, Aviral Kumar, Grace Zhang, and Sergey Levine. Advantage-weighted regression: Simple and scalable off-policy reinforcement learning. ar Xiv preprint ar Xiv:1910.00177, 2019.

Xue Bin Peng, Yunrong Guo, Lina Halper, Sergey Levine, and Sanja Fidler. Ase: Large-scale reusable adversarial skill embeddings for physically simulated characters. ACM Transactions On Graphics (TOG), 41(4):1 17, 2022.

Jan Peters, Katharina Mulling, and Yasemin Altun. Relative entropy policy search. In AAAI, 2010.

Published as a conference paper at ICLR 2024

Ben Poole, Ajay Jain, Jonathan T. Barron, and Ben Mildenhall. Dreamfusion: Text-to-3d using 2d diffusion. In ICLR, 2023.

Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners, 2019.

Juergen Schmidhuber. Reinforcement learning upside down: Don t predict rewards just map them to actions. ar Xiv preprint ar Xiv:1912.02875, 2019.

Julian Schrittwieser, Ioannis Antonoglou, Thomas Hubert, Karen Simonyan, Laurent Sifre, Simon Schmitt, Arthur Guez, Edward Lockhart, Demis Hassabis, Thore Graepel, et al. Mastering atari, go, chess and shogi by planning with a learned model. Nature, 588(7839):604 609, 2020.

Yang Song, Prafulla Dhariwal, Mark Chen, and Ilya Sutskever. Consistency models. In ICML, 2023.

HJ Suh, Glen Chou, Hongkai Dai, Lujie Yang, Abhishek Gupta, and Russ Tedrake. Fighting uncertainty with gradients: Offline reinforcement learning via diffusion score matching. ar Xiv preprint ar Xiv:2306.14079, 2023.

Richard S Sutton. Integrated architectures for learning, planning, and reacting based on approximating dynamic programming. In ICML, 1990.

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Neur IPS, 2017.

Siddarth Venkatraman. Latent skill models for offline reinforcement learning. Master s thesis, Carnegie Mellon University Pittsburgh, PA, 2023.

Jianhao Wang, Wenzhe Li, Haozhe Jiang, Guangxiang Zhu, Siyuan Li, and Chongjie Zhang. Offline reinforcement learning with reverse model-based imagination. In Neur IPS, 2021.

Linnan Wang, Rodrigo Fonseca, and Yuandong Tian. Learning search space partition for black-box optimization using monte carlo tree search. In Neur IPS, 2020a.

Zhendong Wang, Jonathan J Hunt, and Mingyuan Zhou. Diffusion policies as an expressive policy class for offline reinforcement learning. In ICLR, 2023.

Ziyu Wang, Alexander Novikov, Konrad Zolna, Josh S Merel, Jost Tobias Springenberg, Scott E Reed, Bobak Shahriari, Noah Siegel, Caglar Gulcehre, Nicolas Heess, et al. Critic regularized regression. In Neur IPS, 2020b.

Yueh-Hua Wu, Xiaolong Wang, and Masashi Hamaya. Elastic decision transformer. ar Xiv preprint ar Xiv:2307.02484, 2023.

Yuxin Wu and Kaiming He. Group normalization. In ECCV, 2018.

Kevin Yang, Tianjun Zhang, Chris Cummins, Brandon Cui, Benoit Steiner, Linnan Wang, Joseph E Gonzalez, Dan Klein, and Yuandong Tian. Learning space partitions for path planning. In Neur IPS, 2021.

Min Zhao, Fan Bao, Chongxuan Li, and Jun Zhu. Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. In Neur IPS, 2022.

Qinqing Zheng, Amy Zhang, and Aditya Grover. Online decision transformer. In ICML, 2022.

Wenxuan Zhou, Sujay Bajracharya, and David Held. Plas: Latent action space for offline reinforcement learning. In Co RL, 2021.

Zhaoyi Zhou, Chuning Zhu, Runlong Zhou, Qiwen Cui, Abhishek Gupta, and Simon Shaolei Du. Free from bellman completeness: Trajectory stitching via model-based return-conditioned supervised learning. ar Xiv preprint ar Xiv:2310.19308, 2023.