# guide_your_agent_with_adaptive_multimodal_rewards__9faef07a.pdf Guide Your Agent with Adaptive Multimodal Rewards Changyeon Kim1 Younggyo Seo2 Hao Liu3 Lisa Lee4 Jinwoo Shin1 Honglak Lee5,6 Kimin Lee1 1KAIST 2Dyson Robot Learning Lab 3UC Berkeley 4Google Deep Mind 5University of Michigan 6LG AI Research Developing an agent capable of adapting to unseen environments remains a difficult challenge in imitation learning. This work presents Adaptive Return-conditioned Policy (ARP), an efficient framework designed to enhance the agent s generalization ability using natural language task descriptions and pre-trained multimodal encoders. Our key idea is to calculate a similarity between visual observations and natural language instructions in the pre-trained multimodal embedding space (such as CLIP) and use it as a reward signal. We then train a return-conditioned policy using expert demonstrations labeled with multimodal rewards. Because the multimodal rewards provide adaptive signals at each timestep, our ARP effectively mitigates the goal misgeneralization. This results in superior generalization performances even when faced with unseen text instructions, compared to existing text-conditioned policies. To improve the quality of rewards, we also introduce a fine-tuning method for pre-trained multimodal encoders, further enhancing the performance. Video demonstrations and source code are available on the project website: https://sites.google.com/view/2023arp. 1 Introduction Imitation learning (IL) has achieved promising results in learning behaviors directly from expert demonstrations, reducing the necessity for costly and potentially dangerous interactions with environments [32, 59]. These approaches have recently been applied to learn control policies directly from pixel observations [7, 36, 57]. However, IL policies frequently struggle to generalize to new environments, often resulting in a lack of meaningful behavior [14, 58, 68, 79] due to overfitting to various aspects of training data. Several approaches have been proposed to train IL agents capable of adapting to unseen environments and tasks. These approaches include conditioning on a single expert demonstration [18, 20], utilizing a video of human demonstration [6, 76], and incorporating the goal image [16, 23]. However, such prior methods assume that information about target behaviors in test environments is available to the agent, which is impractical in many real-world problems. One alternative approach for improving generalization performance is to guide the agent with natural language: training agents conditioned on language instructions [7, 50, 72]. Recent studies have indeed demonstrated that text-conditioned policies incorporated with large pre-trained multimodal models [22, 56] exhibit strong generalization abilities [46, 66]. However, simply relying on text representations may fail to provide helpful information to agents in challenging scenarios. For example, consider a text-conditioned policy (see Figure 1a) trained to collect a coin, which is positioned at the end of the map, following the text instruction collect a coin . When we deploy the learned agent to test environments where the coin s location is randomized, it often fails to collect the coin. This is because, when relying solely on expert demonstrations, the agent might mistakenly think that the goal is to navigate to the end of the level (see supporting results in Section 4.1). This example shows the simple text-conditioned policy fails to fully exploit the provided text instruction and suffers from goal misgeneralization (i.e., pursuing undesired goals, even when trained with a correct specification) [15, 64]. 37th Conference on Neural Information Processing Systems (Neur IPS 2023). Adaptive Return-conditioned Policy Text-conditioned Policy The goal is to collect the coin. Text instruction Observation Image encoder Text encoder The goal is to collect the coin. Text instruction Observation Image encoder Text encoder Transformer (or recurrent neural networks) The goal is to collect the coin. Text instruction Observation Image encoder Text encoder Text instruction Observation Transformer (or recurrent neural networks) Multimodal reward The goal is to collect the coin. Image encoder Text encoder Target Return (a) Comparison of ARP with baseline Train env: coin is at the far right Test env: coin is at the middle (b) Multimodal reward curve from Coin Run Figure 1: (a) ARP utilizes the similarity between visual observations and text instructions in the pre-trained multimodal representation space as a reward and then trains the return-conditioned policy using demonstrations with multimodal reward labels. (b) Curves of multimodal reward for fine-tuned CLIP [56] in the trajectory from Coin Run environment. Multimodal reward consistently increases as the agent approaches the goal, and this trend remains consistent regardless of the training and test environment, suggesting the potential to guide agents toward target objects in test environments. In this paper, we introduce Adaptive Return-conditioned Policy (ARP), a novel IL method designed to enhance generalization capabilities. Our main idea is to measure the similarity between visual observations and natural language task descriptions in the pre-trained multimodal embedding space (such as CLIP [56]) and use it as a reward signal. Subsequently, we train a return-conditioned policy using demonstrations annotated with these multimodal reward labels. Unlike prior IL work that relies on static text representations [46, 51, 67], our trained policies make decisions based on multimodal reward signals computed at each timestep (see the bottom figure of Figure 1a). We find that our multimodal reward can provide a consistent signal to the agent in both training and test environments (see Figure 1b). This consistency helps prevent agents from pursuing unintended goals (i.e., mitigating goal misgeneralization) and thus improves generalization performance when compared to text-conditioned policies. Furthermore, we introduce a fine-tuning scheme that adapts pre-trained multimodal encoders using in-domain data (i.e., expert demonstrations) to enhance the quality of the reward signal. We demonstrate that when using rewards derived from fine-tuned encoders, the agent exhibits superior generalization performance compared to the agent with frozen encoders in test environments. Notably, we also observe that ARP effectively guides agents in test environments with unseen text instructions associated with new objects of unseen colors and shapes (see supporting results in Section 4.3). In summary, our key contributions are as follows: We propose Adaptive Return-conditioned Policy (ARP), a novel IL framework that trains a return-conditioned policy using adaptive multimodal rewards from pre-trained encoders. We introduce a fine-tuning scheme that adapts pre-trained CLIP models using in-domain expert demonstrations to improve the quality of multimodal rewards. We show that our framework effectively mitigates goal misgeneralization, resulting in better generalization when compared to text-conditioned baselines. We further show that ARP can execute unseen text instructions associated with new objects of unseen colors and shapes. We demonstrate that our method exhibits comparable generalization performance to baselines that consume goal images from test environments, even though our method solely relies on natural language instruction. Source code and expert demonstrations used for our experiments are available at https: //github.com/csmile-1006/ARP.git 2 Related Work Generalization in imitation learning Addressing the challenge of generalization in imitation learning is crucial for deploying trained agents in real-world scenarios. Previous approaches have shown improvements in generalization to test environments by conditioning agents on a robot demonstration [18, 20], a video of a human performing the desired task [76, 6], or a goal image [16, 44, 23]. However, these approaches have a disadvantage: they can be impractical to adopt in realworld scenarios where the information about target behaviors in test environments is not guaranteed. In this work, we propose an efficient yet effective method for achieving generalization even in the absence of specific information about test environments. We accomplish this by leveraging multimodal reward computed with current visual observations and task instructions in the pre-trained multimodal embedding space. Pre-trained representation for reinforcement learning and imitation learning Recently, there has been growing interest in leveraging pre-trained representations for robot learning algorithms that benefit from large-scale data [54, 73, 61, 52]. In particular, language-conditioned agents have seen significant advancements by leveraging pre-trained vision-language models [46, 66, 78, 38], drawing inspiration from the effectiveness of multimodal representation learning techniques like CLIP [56]. For example, Instruct RL [46] utilizes a pre-trained multimodal encoder [22] to encode the alignment between multiple camera observations and text instructions and trains a transformer-based behavior cloning policy using encoded representations. In contrast, our work utilizes the similarity between visual observations and text instructions in the pre-trained multimodal embedding space in the form of a reward to guide agents in the test environment adaptively. We provide more discussions on related work in more detail in Appendix B. In this section, we introduce Adaptive Return-conditioned Policy (ARP), a novel IL framework for enhancing generalization ability using multimodal rewards from pre-trained encoders. We first describe the problem setup in Section 3.1. Section 3.2 introduces how we define the multimodal reward in the pre-trained CLIP embedding spaces and use it for training return-conditioned policies. Additionally, we propose a new fine-tuning scheme that adapts pre-trained multimodal encoders with in-domain data to enhance the quality of rewards in Section 3.3. 3.1 Preliminaries We consider the visual imitation learning (IL) framework, where an agent learns to solve a target task from expert demonstrations containing visual observations. We assume access to a dataset D = {τi}N i=1 consisting of N expert trajectories τ = (o0, a 0, ..., o T , a T ) where o represents the visual observation, a means the action, and T denotes the maximum timestep. These expert demonstrations are utilized to train the policy via behavior cloning. As a single visual observation is not sufficient for fully describing the underlying state of the task, we approximate the current state by stacking consecutive past observations following common practice [53, 74]. We also assume that a text instruction x X that describes how to achieve the goal for solving tasks is given in addition to expert demonstrations. The standard approach to utilize this text instruction is to train a text-conditioned policy π(at|o t, x). It has been observed that utilizing pre-trained multimodal encoders (like CLIP [56] and M3AE [22]) is very effective in modeling this text-conditioned policy [46, 49, 66, 67]. However, as shown in the upper figure of Figure 1a, these approaches provide the same text representations regardless of changes in visual observations. Consequently, they would not provide the agent with adaptive signals when encountering previously unseen environments. To address this limitation, we propose an alternative framework that leverages x to compute similarity with the current visual observation within the pre-trained multimodal embedding space. We then employ this similarity as a reward signal. This approach allows the reward value to be adjusted as the visual observation changes, providing the agent with an adaptive signal (see Figure 1b). 3.2 Adaptive Return-Conditioned Policy Multimodal reward To provide more detailed task information to the agent that adapts over timesteps, we propose to use the visual-text alignment score from pre-trained multimodal encoders. Specifically, we compute the alignment score between visual observation at current timestep t and text instruction x in the pre-trained multimodal embedding space as follows: rϕ,ψ(ot, x) = s(f vis ϕ (ot), f txt ψ (x)). (1) Here, s represents a similarity metric in the representation space of pre-trained encoders: a visual encoder f vis ϕ parameterized by ϕ and a text encoder f txt ψ parameterized by ψ. While our method is compatible with any multimodal encoders and metric, we adopt the cosine similarity between CLIP [56] text and visual embeddings in this work. We label each expert state-action trajectory as τ = (R0, o0, a 0, ..., RT , o T , a T ) where Rt = PT i=t rϕ,ψ(oi, x) denotes the multimodal return for the rest of the trajectory at timestep t1. The set of return-labeled demonstrations is denoted as D = {τ i }N i=1. Return-conditioned policy Using return-labeled demonstrations D , we train return-conditioned policy πθ(at|o t, Rt) parameterized by θ using the dataset D and minimize the following objective: Lπ(θ) = Eτ D t T l(πθ(at|o t, Rt), a t ) Here, l represents the loss function, which is either the cross entropy loss when the action space is defined in discrete space or the mean squared error when defined in continuous space. The main advantage of our method lies in its adaptability in the deployment by adjusting to multimodal rewards computed in test environments (see Figure 1a). At the test time, our trained policy predicts the action at based on the target multimodal return Rt and the observation ot. Since the target return Rt is recursively updated based on the multimodal reward rt, it can provide a timestep-wise signal to the agent, enabling it to adapt its behavior accordingly. We find that this signal effectively guides the agent to prevent pursuing undesired goals (see Section 4.1 and Section 4.2), and it also enhances generalization performance in environments with unseen text instructions associated with objects having previously unseen configurations (as discussed in Section 4.3). In our experiments, we implement two different types of ARP using Decision Transformer (DT) [8], referred to as ARP-DT, and using Recurrent State Space Model (RSSM) [25], referred to as ARPRSSM. Further details of the proposed architectures are provided in Appendix A. 3.3 Fine-Tuning Pre-trained Multimodal Encoders Despite the effectiveness of our method with pre-trained CLIP multimodal representations, there may be a domain gap between the images used for pre-training and the visual observations available from the environment. This domain gap can sometimes lead to the generation of unreliable, misleading reward signals. To address this issue, we propose fine-tuning schemes for pre-trained multimodal encoders (f vis ϕ , f txt ψ ) using in-domain dataset (expert demonstrations) D in order to improve the quality of multimodal rewards. Specifically, we propose fine-tuning objectives based on the following two desiderata: reward should (i) remain consistent within similar timesteps and (ii) be robust to visual distractions. Temporal smoothness To encourage the consistency of the multimodal reward over timesteps, we adopt the objective of value implicit pre-training (VIP) [52] that aims to learn smooth reward functions from action-free videos. The main idea of VIP is to (i) capture long-range dependency by attracting the representations of the first and goal frames and (ii) inject local smoothness by encouraging the distance between intermediate frames to represent progress toward the goal. We extend this idea to our multimodal setup by replacing the goal frame with the text instruction x describing the task objective and using our multimodal reward R as below: LVIP(ϕ, ψ) = (1 γ) Eo1 O1[rϕ,ψ(o1, x)] long-range dependency loss + log E(ot,ot+1) D[exp(rϕ,ψ(ot, x) + 1 γ rϕ,ψ(ot+1, x))]. local smoothness loss 1We assume the discount factor γ as 1 in our experiments, and our method can also be applied in the setup with the discount factor less than 1. Training Test (a) Coin Run Training Test Training Test (c) Maze II Figure 2: Environments from Open AI Procgen benchmarks [10] used in our experiments. We train our agents using expert demonstrations collected in environments with multiple visual variations. We then perform evaluations on environments from unseen levels with target objects in unseen locations. See Section 4.1 for more details. Here O1 denotes a set of initial visual observations in D. One can see that the local smoothness loss is the one-step temporal difference loss, which recursively trains the rϕ,ψ(ot, x) to regress 1 + γ rϕ,ψ(ot+1, x). This then induces the reward to represent the remaining steps to achieve the text-specified goal x [31], making rewards from consecutive observations smooth. Robustness to visual distractions To further encourage our multimodal reward to be robust to visual distractions that should not affect the agent (e.g., changing textures or backgrounds), we introduce the inverse dynamics model (IDM) objective [55, 33, 41]: LIDM(ϕ, ψ) = E(ot,ot+1,at) D[l(g(f vis ϕ (ot), f vis ϕ (ot+1), f txt ψ (x)), a t )], (4) where g( ) denotes the prediction layer which outputs ˆat, predicted estimate of at, and l represents the loss function which is either the cross entropy loss when the action space is defined in discrete space or the mean squared error when it s defined in continuous space. By learning to predict actions taken by the agent using the observations from consecutive timesteps, fine-tuned encoders learn to ignore aspects within the observations that should not affect the agent. Fine-tuning objective We combine both VIP loss and IDM loss as the training objective to fine-tune pre-trained multimodal encoders in our model: LFT(ϕ, ψ) = LVIP(ϕ, ψ) + β LIDM(ϕ, ψ), where β is a scale hyperparameter. We find that both objectives synergistically contribute to improving the performance (see Table 7 for supporting experiments). 4 Experiments We design our experiments to investigate the following questions: 1. Can our method prevent agents from pursuing undesired goals in test environments? (see Section 4.1 and Section 4.2) 2. Can ARP follow unseen text instructions? (see Section 4.3) 3. Is ARP comparable to goal image-conditioned policy? (see Section 4.4) 4. Can ARP induce well-aligned representation in test environments? (see Section 4.5) 5. What is the effect of each component in our framework? (see Section 4.6) 4.1 Procgen Experiments Environments We evaluate our method on three different environments proposed in Di Langosco et al. [15], which are variants derived from Open AI Procgen benchmarks [10]. We assess the generalization ability of trained agents when faced with test environments that cannot be solved without following true task success conditions. Coin Run: The training dataset consists of expert demonstrations where the agent collects a coin that is consistently positioned on the far right of the map, and the text instruction is The goal Coin Run Maze I Maze II 0 100 Training Performance Coin Run Maze I Maze II 0 Test Performance Normalized Score (%) Instruct RL ARP-DT (Ours) ARP-DT+ (Ours) Figure 3: Expert-normalized scores on training/test environments. The result shows the mean and standard variation averaged over three runs. ARP-DT denotes the model that uses pre-trained CLIP representations, and ARP-DT+ denotes the model that uses fine-tuned CLIP representations (see Section 3.3) for computing the multimodal reward. is to collect the coin. . Note the agent may mistakenly interpret that the goal is to proceed to the end of the level, as this also leads to reaching the coin when relying solely on the expert demonstrations. We evaluate the agent in environments where the coin s location is randomized (see Figure 2a) to verify that the trained agent truly follows the intended objective. Maze I: The training dataset consists of expert demonstrations where the agent reaches a yellow cheese that is always located at the top right corner, and the text instruction is Navigate a maze to collect the yellow cheese. . The agent may misinterpret that the goal is to proceed to the far right corner, as it also results in reaching the yellow cheese when relying only on expert demonstrations. To verify that the trained agent follows the intended objective, we assess the trained agents in the test environment where the cheese is placed at a random position (see Figure 2b). Maze II: The training dataset consists of expert demonstrations where the agent approaches a yellow diagonal line located at a random position, and the text instruction is Navigate a maze to collect the line. . The agent might misinterpret the goal as reaching an object with a yellow color because it also leads to collecting the object with a line shape when relying only on expert demonstrations, For evaluation, we consider a modified environment with two objects: a yellow gem and a red diagonal line. The goal of the agent is to reach the diagonal line, regardless of its color, to verify that the agent truly follows the intended objective (see Figure 2c). Implementation For all experiments, we utilize the open-sourced pre-trained CLIP model2 with Vi T-B/16 architecture to generate multimodal rewards. Our return-conditioned policy is implemented based on the official implementation of Instruct RL [46], and implementation details are the same unless otherwise specified. To collect expert demonstrations used for training data, we first train PPG [11] agents on 500 training levels that exhibit ample visual variations for 200M timesteps per task. We then gather 500 rollouts for Coin Run and 1000 rollouts for Maze in training environments. All models are trained for 50 epochs on two GPUs with a batch size 64 and a context length of 4. Our code and datasets are available at https://github.com/csmile-1006/ARP.git. Further training details, including hyperparameter settings, can be found in Appendix C. Evaluation We evaluate the zero-shot performance of trained agents in test environments from different levels (i.e., different map layouts and backgrounds) where the target object is either placed in unseen locations or with unseen shapes. To quantify the performance of trained agents, we report the expert-normalized scores on both training and test environments. To report training performance, we measure the average success rate of trained agents over 100 rollouts in training environments and divide it by the average success rate from the expert PPG agent used to collect demonstrations. For test performance, we train a separate expert PPG agent in test environments and compute expert-normalized scores in the same manner. Baseline and our method As a baseline, we consider Instruct RL [46], one of the state-of-the-art text-conditioned policies. Instruct RL utilizes a transformer-based policy and pre-trained M3AE [22] 2https://github.com/openai/CLIP (a) Pick Up Cup Training Test 0 Success Rate (%) MV-MWM ARP-RSSM (Ours) ARP-RSSM+ (Ours) (b) Training/test performance Figure 4: (a) Image observation of training and test environments for Pick Up Cup task in RLBench benchmarks [34]. (b) Success rates on both training and test environments. The result represents the mean and standard deviation over four different seeds. ARP-RSSM denotes the model that uses frozen CLIP representations for computing the multimodal reward, and ARP-RSSM+ denotes the model that incorporates fine-tuning scheme in Section 3.3. representations for encoding visual observations and text instructions. For our methods, we use a return-conditioned policy based on Decision Transformer (DT) [8, 43] architecture, denoted as ARP-DT (see Appendix A for details). We consider two variations: the model that uses frozen CLIP representations (denoted as ARP-DT) and the model that uses fine-tuned CLIP representations (denoted as ARP-DT+) for computing the multimodal reward. We use the same M3AE model to encode visual observations and the same transformer architecture for policy training. The main difference is that our model uses sequence with multimodal return, while the baseline uses static text representations with the concatenation of visual representations. Comparison with language-conditioned agents Figure 3 shows that our method significantly outperforms Instruct RL in all three tasks. In particular, ARP-DT outperforms Instruct RL in test environments while achieving similar training performance. This result implies that our method effectively guides the agent away from pursuing unintended goals through the adaptive multimodal reward signal, thereby mitigating goal misgeneralization. Moreover, we observe that ARP-DT+, which uses the multimodal reward from the fine-tuned CLIP model, achieves superior performance to ARP-DT. Considering that the only difference between ARP-DT and ARP-DT+ is using different multimodal rewards, this result shows that improving the quality of reward can lead to better generalization performance. 4.2 RLBench Experiments Environment We also demonstrate the effectiveness of our framework on RLBench [34], which serves as a standard benchmark for visual-based robotic manipulations. Specifically, we focus on Pick Up Cup task, where the robot arm is instructed to grasp and lift the cup. We train agents using 100 expert demonstrations collected from environments where the position of the target cup changes above the cyan-colored line in each episode (see the upper figure of Figure 4a). Then, we evaluate the agents in a test environment, where the target cup is positioned below the cyan-colored line (see the lower figure of Figure 4a). The natural language instruction x used is grasp the red cup and lift it off the surface with the robotic arm. For evaluation, we measure the average success rate over 500 episodes where the object position is varies in each episode. Setup As a baseline, we consider MV-MWM [62], which initially trains a multi-view autoencoder by reconstructing patches from randomly masked viewpoints and subsequently learns a world model based on the autoencoder representations. We use the same procedure for training the multi-view autoencoders for our method and a baseline. The main difference is that while MV-MWM does not use any text instruction as an input, our method trains a policy conditioned on the multimodal return as well. In our experiments, we closely follow the experimental setup and implementation of the imitation learning experiments in MV-MWM. Specifically, we adopt a single-view control setup where the representation learning is conducted using images from both the front and wrist cameras, but world model learning is performed solely using the front camera. For our methods, we train the return-conditioned policy based on the recurrent state-space model (RSSM) [25], denoted as ARP-RSSM (see Appendix A for more details). We consider two variants of this model: the model utilizing frozen CLIP representations (referred to as ARP-RSSM) and the model that employs fine-tuned CLIP representations (referred to as ARP-RSSM+). To compute multimodal rewards using both frozen and fine-tuned CLIP, we employ the same setup as in Procgen experiments. Additional details are in Appendix D. Results Figure 4b showcases the enhanced generalization performance of ARP-RSSM+ agents in test environments, increasing from 20.37% to 50.93%. This result implies that our method facilitates the agent in reaching target cups in unseen locations by employing adaptive rewards. Conversely, ARP-RSSM, which uses frozen CLIP representations, demonstrates similar performance to MVMWM in both training and test environments, unlike the result in Section 4.1. We expect this is because achieving target goals for robotic manipulation tasks in RLBench requires more fine-grained controls than game-like environments. 4.3 Generalization to Unseen Instructions (a) Coin Run-bluegem (b) Maze III Figure 5: Test environments used for experiments in Section 4.3. Table 1: Expert-normalized scores on Coin Runbluegem test environments (see Figure 5a). Model Test Performance Instruct RL 63.99% 3.07% ARP-DT (Ours) 77.05% 2.09% ARP-DT+ (Ours) 79.06% 6.69% We also evaluate our method in test environments where the agent is now required to reach a different object with an unseen shape, color, and location by following unseen language instructions associated with this new object. First, we train agents in environments with the objective of collecting a yellow coin, which is always positioned in the far right corner, and learned agents are tested on unseen environments where the target object changes to a blue gem, and the target object s location is randomized. This new environment is referred to as Coin Run-bluegem (see Figure 5a), and we provide unseen instruction, The goal is to collect the blue gem. to the agents. Table 1 shows that our method significantly outperforms the text-conditioned policy (Instruct RL) even in Coiun Runbluegem. This result indicates that our multimodal reward can provide adaptive signals for reaching target objects even when the color and shape change. Table 2: Expert-normalized scores on Maze III test environments (see Figure 5b). Model Test Performance Instruct RL 21.21% 1.52% ARP-DT (Ours) 33.33% 4.01% ARP-DT+ (Ours) 38.38% 3.15% In addition, we verify the effectiveness of our multimodal reward in distinguishing similar-looking distractors and guiding the agent to the correct goal. To this end, we train agents using demonstrations from Maze II environments, where the objective is to collect the yellow line. Trained agents are tested in an augmented version of Maze II test environments: we place a yellow gem, a red diagonal line, and a red straight line in the random position of the map (denoted as Maze III in Figure 5b), and instruct the trained agent to reach the red diagonal line (x = Navigate a maze to collect the red diagonal line. ). Table 2 shows that our method outperforms the baseline in Maze III, indicating that our multimodal reward can provide adaptive signals for achieving goals by distinguishing distractors. 4.4 Comparison with Goal-Conditioned Agents We compare our method with goal-conditioned methods, assuming the availability of goal images in both training and test environments. First , it is essential to note that suggested baselines rely on additional information from the test environment because they assume the presence of a goal image during the test time. In contrast, our method relies solely on natural language instruction and Coin Run Maze I Maze II 0 100 Training Performance Coin Run Maze I Maze II 0 Test Performance Normalized Score (%) GC-Instruct RL GC-DT ARP-DT (Ours) Figure 6: Expert-normalized scores on training/test environments. The result shows the mean and standard variation averaged over three runs. ARP-DT shows comparable or even better generalization ability compared to goal-conditioned baselines. does not necessitate any extra information about the test environment. As baselines, we consider a goal-conditioned version of Instruct RL (denoted as GC-Instruct RL), which uses visual observations concatenated with a goal image at each timestep. We also consider a variant of our algorithm that uses the distance of CLIP visual representation to the goal image for reward (denoted as GC-DT). Figure 6 illustrates the training and test performance of goal-conditioned baselines and ARP-DT. First, we observe that GC-DT outperforms GC-Instruct RL in all test environments. Note that utilizing goal image is the only distinction between GC-DT and GC-Instruct RL. This result suggests that our return-conditioned policy helps enhance generalization performance. Additionally, we find that ARP-DT demonstrates comparable results to GC-DT and even surpasses GC-Instruct RL in all three tasks. Importantly, it should be emphasized that while goal-conditioned baselines rely on the goal image of the test environment (which can be challenging to provide in real-world scenarios), ARP-DT solely relies on natural language instruction for the task. These findings highlight the potential of our method to be applicable in real-world scenarios where the agent cannot acquire information from the test environment. 4.5 Embedding Analysis To support the effectiveness of our framework in generalization, we analyze whether our proposed method can induce meaningful abstractions in test environments. Our experimental design aims to address the key requirements for improved generalization in test environments: (i) the agent should consistently assign similar representations to similar behaviors even when the map configuration is changed, and (ii) the agent should effectively differentiate between goal-reaching behaviors and misleading behaviors. To this end, we measure the cycle-consistency of hidden representation from trained agents following [3, 42]. For two trajectories τ 1 and τ 2 with the same length N, we first choose i N and find its nearest neighbor j = arg minj N ||h(o1 i, a1