# crossdomain_policy_adaptation_by_capturing_representation_mismatch__4b680120.pdf
Cross-Domain Policy Adaptation by Capturing Representation Mismatch
Jiafei Lyu 1 Chenjia Bai 2 Jingwen Yang 3 Zongqing Lu 4 5 Xiu Li 1
It is vital to learn effective policies that can be transferred to different domains with dynamics discrepancies in reinforcement learning (RL). In this paper, we consider dynamics adaptation settings where there exists dynamics mismatch between the source domain and the target domain, and one can get access to sufficient source domain data, while can only have limited interactions with the target domain. Existing methods address this problem by learning domain classifiers, performing data filtering from a value discrepancy perspective, etc. Instead, we tackle this challenge from a decoupled representation learning perspective. We perform representation learning only in the target domain and measure the representation deviations on the transitions from the source domain, which we show can be a signal of dynamics mismatch. We also show that representation deviation upper bounds performance difference of a given policy in the source domain and target domain, which motivates us to adopt representation deviation as a reward penalty. The produced representations are not involved in either policy or value function, but only serve as a reward penalizer. We conduct extensive experiments on environments with kinematic and morphology mismatch, and the results show that our method exhibits strong performance on many tasks. Our code is publicly available at https://github.com/dmksjfl/PAR.
1. Introduction
Alice is interested in learning cooking. She bought a new set of cookware recently that is different from the one she
1Tsinghua Shenzhen International Graduate School, Tsinghua University 2Shanghai Artificial Intelligence Laboratory 3Tencent IEG 4School of Computer Science, Peking University 5Beijing Academy of Artificial Intelligence. Correspondence to: Xiu Li
.
Proceedings of the 41 st International Conference on Machine Learning, Vienna, Austria. PMLR 235, 2024. Copyright 2024 by the author(s).
used before. She expertly uses new cookware soon. As this example conveys, human beings are able to quickly transfer the learned policies to similar tasks. Such capability is also expected in reinforcement learning (RL) agents. Unfortunately, RL algorithms are known to require a vast number of interactions to learn meaningful policies (Silver et al., 2016; Lyu et al., 2023). A bare fact is that sometimes only limited interactions with the environment (target domain) are feasible because it may be expensive and time-consuming for a large number of interactions in scenarios like robotics (Cutler & How, 2015; Kober et al., 2013), autonomous driving (Kiran et al., 2020; Osinski et al., 2019), etc. Nevertheless, we may simultaneously have access to another structurally similar source domain where the experience is cheaper to gather, e.g., a simulator. Since the source domain can be biased, a dynamics mismatch between the two domains may persist. It then necessitates developing algorithms that have good performance in the target domain, given the source domain with some dynamics discrepancies.
Note that there are numerous studies concerning policy adaptation, such as system identification (Yu et al., 2017; Clavera et al., 2018) and domain randomization (Slaoui et al., 2019; Tobin et al., 2017; Peng et al., 2017). These methods often rely on demonstrations from the target domain (Kim et al., 2019), the distributions from which the simulator parameters are sampled, a manipulable simulator (Chebotar et al., 2018), etc. We lift these requirements and consider learning policies with sufficient source domain data (either online or offline) and limited online interactions with the target domain. This setting is also referred to as off-dynamics RL (Eysenbach et al., 2021) or online dynamics adaptation (Xu et al., 2023). Existing methods tackle this problem by learning domain classifiers (Eysenbach et al., 2021), filtering source domain data that share similar value estimates with target domain data (Xu et al., 2023), etc.
In this paper, we study the cross-domain policy adaptation problem where only transition dynamics between the source domain and the target domain differ. The state space, action space, as well as the reward function, are kept unchanged. Unlike prior works, we address this issue from a representation learning perspective. Our motivation is that the dynamics mismatch between the source domain and the target domain can be captured by representation deviations of transitions from the two domains, which is grounded by our
Cross-Domain Policy Adaptation by Capturing Representation Mismatch
source domain
target domain
representation learning
reward modification
Figure 1. Illustration of PAR. We train encoders f, g merely with target domain data and utilize them to modify rewards from the source domain with measured representation deviations. Afterward, the downstream SAC algorithm can learn from transitions from both domains.
theoretical analysis. We further show concrete performance bounds given either online or offline source domain, where we observe that representation deviation upper bounds the performance difference of any given policy between the source domain and the target domain. Motivated by the theoretical findings, we deem that representation mismatch between two domains can be used as a reward penalizer to fulfill dynamics-aware policy adaptation.
For a practical usage, we propose Policy Adaptation by Representation mismatch, dubbed PAR algorithm. Our approach trains a state encoder and a state-action encoder only in the target domain to capture its latent dynamics structure, and then leverage the learned encoders to produce representations upon transitions from the source domain. We evaluate deviations between representations of the state-action pair and the next state, and use the resulting representation deviations to penalize source domain rewards, as depicted in Figure 1. Intuitively, the penalty is large if the transition deviates far from the target domain, and vice versa. In this way, the agent can benefit more from dynamics-consistent transitions and de-emphasize others. It is worth noting that the representation learning is decoupled from policy or value function training since the representations are not involved in them. Empirical results in environments with kinematic and morphology shifts show that our method notably beats previous strong baselines on many tasks in both online and offline source domain settings.
2. Related Work
Domain Adaptation in RL. Generalizing or transferring policies across varied domains remains a critical issue in RL, where domains may differ in terms of agent embodiment (Liu et al., 2022b; Zhang et al., 2021c), transition dynamics (Eysenbach et al., 2021; Viano et al., 2020), observation space (Gamrian & Goldberg, 2018; Bousmalis et al., 2018; Ge et al., 2022; Zhang et al., 2021b; Hansen et al., 2021),
etc. We focus on policy adaptation under dynamics discrepancies between the two domains. Prior works mainly address this issue via system identification (Clavera et al., 2018; Zhou et al., 2019; Du et al., 2021; Xie et al., 2022), domain randomization (Slaoui et al., 2019; Mehta et al., 2019; Vuong et al., 2019; Jiang et al., 2023), meta-RL (Nagabandi et al., 2018; Raileanu et al., 2020; Arndt et al., 2019; Wu et al., 2022), or by leveraging expert demonstrations from the target domain (Kim et al., 2019; Hejna et al., 2020; Fickinger et al., 2022; Raychaudhuri et al., 2021). Though effective, these methods depend on a model of the environment, expert trajectories gathered in the target domain, or a proper choice of randomized parameters. In contrast, we dismiss these requirements and study dynamics adaptation problem (Xu et al., 2023) where only a small amount of online interactions with the target domain is allowed, and a source domain with sufficient data can be accessed. Under this setting, many approaches have been developed, such as directly optimizing the parameters of the simulator to calibrate the dynamics of the source domain (Farchy et al., 2013; Zhu et al., 2017; Collins et al., 2020; Chebotar et al., 2018; Ramos et al., 2019). However, it requires a manipulable simulator. There are attempts to use some expressive models to learn the dynamics change (Golemo et al., 2018; Hwangbo et al., 2019; Xiong et al., 2023), and action transformation methods that learn dynamics models of the two domains and utilize them to modify transitions from the source domain (Hanna et al., 2021; Desai et al., 2020), while it is difficult to learn accurate dynamics models (Malik et al., 2019; Lyu et al., 2022b). Another line of research trains domain classifiers and tries to close the dynamics gap by either reward modification (Eysenbach et al., 2021; Liu et al., 2022a), or importance weighting (Niu et al., 2022). A recent work (Xu et al., 2023) bridges the dynamics gap by selectively sharing transitions from the source domain that have similar value estimates as those in the target domain. Unlike these methods, we capture dynamics discrepancy by measuring representation mismatch. It is worth noting that in this work we only consider policy adaptation across domains that have the same state space and action space. Our method can also generalize to the setting where target domain has a different state space or action space by incorporating extra components or modules like prior works (Barekatain et al., 2019; You et al., 2022; Gui et al., 2023).
Representation Learning in RL. Representation learning is an important research topic in computer vision (Bengio et al., 2012; Kolesnikov et al., 2019; He et al., 2015). In the context of RL, representation learning is actively explored in image-based tasks (Kostrikov et al., 2020; Yarats et al., 2022; Liu et al., 2021; Cetin et al., 2022), aiming at extracting useful features from information-redundant images by contrastive learning (Srinivas et al., 2020; Eysenbach et al., 2022; Stooke et al., 2021; Zhu et al., 2020), MDP homo-
Cross-Domain Policy Adaptation by Capturing Representation Mismatch
morphisms (van der Pol et al., 2020; Rezaei-Shoshtari et al., 2022), bisimulation (Ferns et al., 2011; Zhang et al., 2021a), self-predictive learning (Schwarzer et al., 2021; Tang et al., 2022; Kim et al., 2022), etc. Representation learning can also be found in model-based RL methods that rely on latent dynamics (Karl et al., 2016; Rafailov et al., 2020; Hafner et al., 2019; Hansen et al., 2022). In state-based tasks, it also spans in successor representation (Barreto et al., 2016; Fujimoto et al., 2021; Machado et al., 2023), learning stateaction representations (Ota et al., 2020; Fujimoto et al., 2023), action representations (Whitney et al., 2020; Chandak et al., 2019) for improving sample efficiency, etc. We capture latent dynamics information by learning state-action representations, but we differ from previous approaches in that we use them for detecting dynamics mismatch.
3. Preliminaries
We formulate reinforcement learning (RL) problems as a Markov Decision Process (MDP), which can be specified by the 5-tuple M = (S, A, P, r, γ), where S is the state space, A is the action space, P denotes the transition dynamics, r : S A R is the scalar reward signal, and γ [0, 1) is the discount factor. The objective of RL is to find a policy π : S (A) that maximize the discounted cumulative return P t=0 γtr(st, at). We consider access to a source domain Msrc = (S, A, Psrc, r, γ) and a target domain Mtar = (S, A, Ptar, r, γ) that share the state space and action space, and only differ in their transition dynamics. We assume the rewards are bounded, i.e., |r(s, a)| rmax, s, a.
In the rest of the paper, we specify the transition dynamics in a domain M as PM (e.g., PMsrc is the transition dynamics in the source domain). We denote ρπ M(s, a) := (1 γ) P t=0 γt P π M,t(s)π(a|s) as the normalized probability that a policy π encounters the state action pair (s, a), and P π M,t(s) is the probability that the policy π encounters the state s at timestep t in the domain M. The expected return of a policy π in MDP M can then be simplified as JM(π) = Es,a ρπ M[r(s, a)].
Notations: I(X; Y ) denotes the mutual information between two random variables X, Y . H(X) is the entropy of the random variable X. is the probability simplex.
4. Dynamics Adaptation by Representation Mismatch
In this section, we start by theoretically unpacking the equivalence between the representation mismatch and the dynamics mismatch. We further show the performance bounds of a policy between the target domain and either online or offline source domain, where representation mismatch serves as the lower bound of the performance difference. Empowered by theoretical results, we leverage the representation mismatch
to penalize source domain data and propose our practical algorithm for dynamics-aware policy adaptation.
4.1. Theoretical Analysis
Before moving to our theoretical results, we need to impose the following assumption, which can be generally satisfied in practice (e.g., deep RL). We defer the detailed discussion on the rationality of this assumption to Section 6.3.
Assumption 4.1 (One-to-one Representation Mapping). For any state-action pair (s, a) and its latent representation z, they construct a one-to-one mapping from the original stateaction joint space S A to the latent space Z.
Our first result in Theorem 4.2 establishes a connection between mutual information and the representation deviation of transitions from different domains. Due to space limits, all proofs are deferred to Appendix A.
Theorem 4.2. For any (s, a), denote its representation as z, and suppose s src PMsrc( |s, a), s tar PMtar( |s, a). Denote h(z; s src, s tar) = I(z; s tar) I(z; s src), then we have measuring h(z; s src, s tar) is equivalent to measuring the representation deviation DKL(P(z|s tar) P(z|s src)).
Remark. The defined function h(z; s src, s tar) measures the difference between the embedded target domain information and source domain information in z. This theorem illustrates that such a difference is equivalent to the KL-divergence between the distributions of z given source domain state and target domain state, respectively. Intuitively, h(z; s src, s tar) approaches 0 if the distribution of s src is close to that of s tar. If we enforce z to contain only target domain knowledge, h(z; s src, s tar) can be large if the dynamics mismatch between data from the two domains is large, incurring a large DKL(P(z|s tar) P(z|s src)). Naturally, one may think of using this representation deviation term as evidence of dynamics mismatch.
Below, we show that the representation deviation can strictly reflect the dynamics discrepancy between two domains. Theorem 4.3. Measuring the representation deviation between the source domain and the target domain is equivalent to measuring the dynamics mismatch between two domains. Formally, we can derive that DKL (P(z|s tar) P(z|s src)) = DKL (P(s tar|z) P(s src|z)) + H(s tar) H(s src).
The above theorem conveys the rationality of detecting dynamics shifts with the aid of the representation mismatch. This is appealing as representations can contain rich information and capture hidden features, and learning in the latent space is effective (Hansen et al., 2022). To see how representation mismatch affects the performance of the agent, we derive a novel performance bound of a policy given online target domain and online source domain in Theorem 4.4. Theorem 4.4 (Online performance bound). Denote Msrc,
Cross-Domain Policy Adaptation by Capturing Representation Mismatch
Mtar as the source domain and the target domain, respectively, then the return difference of any policy π between Msrc and Mtar is bounded:
JMtar(π) JMsrc(π)
2γrmax (1 γ)2 Eρπ Msrc
DKL(P(z|s src) P(z|s tar)) i
| {z } (a): representation mismatch
2γrmax (1 γ)2 Eρπ Msrc
|H(s src) H(s tar)| i
| {z } (b): state distribution deviation
Remark. The above bound indicates that the performance difference of a policy π in different domains is decided by the representation mismatch term (a), and the state distribution deviation term (b). Since both two domains are fixed, the entropy of their state distributions are constants, and term (b) is also a constant accordingly. Term (b) characterizes the inherent performance difference of a policy in two domains and vanishes if the two domains are identical.
Moreover, if the source domain is offline (i.e., one can only have access to a static offline source domain dataset), we can derive a similar bound as shown below.
Theorem 4.5 (Offline performance bound). Denote the empirical policy distribution in the offline dataset D from source domain Msrc as πD :=
D 1(s) , then the return difference of any policy π between the source domain Msrc and the target domain Mtar is bounded:
JMtar(π) JMsrc(π)
4rmax (1 γ)2 Eρ πD Msrc,PMsrc[DTV(πD||π)] | {z } (a): policy deviation
2γrmax (1 γ)2 Eρ πD Msrc
DKL(P(z|s src) P(z|s tar)) i
| {z } (b): representation mismatch
2γrmax (1 γ)2 Eρ πD Msrc
|H(s src) H(s tar)| i
| {z } (c): state distribution deviation
Remark. This theorem also explicates the importance of the representation mismatch term (b) as a lower bound, similar to Theorem 4.4, but it additionally highlights the role of the policy deviation term (a). It is evident that controlling the policy deviation term counts with an offline source domain.
Theorem 4.4 and 4.5 motivate us to use the representation mismatch term as a reward penalty to encourage dynamicsconsistent transitions, because it turns out that the core factor that affects the bound either with an online or offline source domain is the representation mismatch term.
4.2. Practical Algorithm
To acquire representations of the transitions, we train a state encoder fψ(s) parameterized by ψ to produce z1, the representation of the state s, along with a state-action encoder gξ(z, a) parameterized by ξ that receives the state representation z1 and action as inputs and outputs state-action representation z2. By letting z2 be close to the representation of the next state, we realize the latent dynamics consistency (Hansen et al., 2022; Ye et al., 2021). The objective function for learning these encoders gives:
L(ψ, ξ) = E(s,a,s ) D (gξ(fψ(s), a) SG(fψ(s )))2 , (1) where D is the replay buffer, and SG denotes stop gradient operator. Similar objectives are adopted in prior works (Ota et al., 2020; Fujimoto et al., 2023). A central difference is, that we only use the representations for measuring representation mismatch, instead of involving them in policy or value function training. One can also utilize a distinct objective, as long as it can embed the latent dynamics information (see Section 6). It is worth noting that both f and g are deterministic to fulfill Assumption 4.1.
Given insights from Theorem 4.2, we deem that the representations ought to embed more information of the target domain, and de-emphasize the source domain knowledge, such that the representation deviations can be a better proxy of dynamics shifts. This prompts us to train the state encoder and the state-action encoder only in the target domain, and evaluate the representation deviations upon samples from the source domain. We then penalize the source domain rewards with the calculated deviations, i.e., for any transition (ssrc, asrc, rsrc, s src) from the source domain, we modify its reward to
ˆrsrc = rsrc β [gξ(fψ(ssrc), asrc) fψ(s src)]2 , (2)
where β R is a hyperparameter. This penalty generally captures the representation mismatch between the source domain and the target domain. gξ(fψ(ssrc), asrc) represents the state-action representation in the target domain since these domains share state space and action space, and f, g only encode target domain information. It approaches the representation of s tar that is incurred by (ssrc, asrc) in the target domain. fψ(s src), instead, denotes the representation of s src from the source domain. A larger penalty will be allocated if the source domain data deviates too much from the dynamics of the target domain, and vice versa. Consequently, the agent can focus more on dynamics-consistent transitions and achieve better performance. Hence, such a penalty matches our theoretical results.
Formally, we introduce our novel method for cross-domain policy adaptation, Policy Adaptation by Representation Mismatch, tagged PAR algorithm. We use SAC (Haarnoja et al.,
Cross-Domain Policy Adaptation by Capturing Representation Mismatch
2018) as the base algorithm, and aim at training value functions (a.k.a., critics) Qθ1(s, a), Qθ2(s, a) parameterized by θ1, θ2, and policy (a.k.a., actor) πϕ(s) parameterized by ϕ. Denote Dsrc, Dtar as the replay buffers of the source domain and the target domain, and let the rewards in Dsrc be corrected as ˆrsrc, then the objective function for training the value functions gives:
Lcritic = E(s,a,r,s ) Dsrc Dtar (Qθi(s, a) y)2 , (3)
where i {1, 2} and y is the target value, which gives
y = r + γ min i=1,2 Qθ i(s , a ) α log πϕ(a |s ) , (4)
where θ i, i {1, 2} are the parameters of the target networks, α R+, and a πϕ( |s ). The way PAR uses to update its policy depends on the condition of the source domain, e.g., it can be either online or offline. We consider both conditions and discuss them below.
Online PAR. If the source domain is online, then the policy objective function gives:
Lon actor = Es Dsrc Dtar a πϕ( |s)
min i=1,2 Qθi(s, a) α log πϕ( |s) .
Offline PAR. Given an offline source domain, the deviation between the learned policy and the source domain behavior policy πDsrc ought to be considered based on Theorem 4.5. We then incorporate a behavior cloning term into the objective function of the policy, similar to Fujimoto & Gu (2021). This term injects conservatism into policy learning on a fixed dataset and is necessary to mitigate the extrapolation error (Fujimoto et al., 2019), a challenge that is widely studied in offline RL (Levine et al., 2020; Kumar et al., 2020; Lyu et al., 2022c;a; Kostrikov et al., 2022). The policy objective function then yields:
Loff actor = E(s,a) Dsrc a πϕ( |s)
(a a)2 + λ Lon actor, (6)
where λ = ν/ 1
(sj ,aj ) mini=1,2 Qθi(sj,aj) is the normalization term that balances behavior cloning and maximizing the value function, ν R+ is a hyperparameter, Lon actor is the policy objective of online PAR in Equation 5. The behavior cloning term ensures that the learned policy is close to the data-collecting policy of the source domain dataset. We summarize in Algorithm 1 the abstracted pseudocode of PAR, and defer the full pseudocodes to Appendix C.
5. Experiments
In this section, we examine the effectiveness of our proposed method by conducting experiments on environments with kinematic and morphology discrepancies. We also extensively investigate the performance of our method under the
Algorithm 1 PAR (Abstracted Version) Input: Source domain Msrc, target domain Mtar, target domain interaction interval F, batch size N
1: Initialize policy πϕ, value functions {Qθi}i=1,2 and target networks {Qθ i}i=1,2, replay buffers {Dsrc, Dtar} 2: for i = 1, 2, ... do 3: (online) Collect (ssrc, asrc, rsrc, s src) in Msrc and store it, Dsrc Dsrc {(ssrc, asrc, rsrc, s src)} 4: if i% F == 0 then 5: Interact with Mtar and get (star, atar, rtar, s tar). Dtar Dtar {(star, atar, rtar, s tar)} 6: end if 7: Sample N transitions from Dtar 8: Train encoders in the target domain via Equation 1 9: Sample N transitions from Dsrc 10: Modify source domain rewards with Equation 2 11: Update critics by minimizing Equation 3 12: (online) Update actor by maximizing Equation 5 13: (offline) Update actor by maximizing Equation 6 14: Update target networks 15: end for
offline source domain and different qualities of the offline datasets. Moreover, we empirically analyze the influence of the important hyperparameters in PAR.
5.1. Results with Online Source Domain
For the empirical evaluation of policy adaptation capabilities, we use four environments (halfcheetah, hopper, walker, ant) from Open AI Gym (Brockman et al., 2016) as source domains and modify their dynamics following (Xu et al., 2023) to serve as target domains. The modifications include kinematic and morphology shifts, where we simulate broken joints of the robot by limiting the rotation angle of its joints (i.e., kinematic shifts), and we clip the size of some limbs of the simulated robot to realize morphology shifts. Please see details of the environment setting in Appendix D.1.
We compare PAR against the following baselines: SAC-tar (Haarnoja et al., 2018), which trains the SAC agent merely in the target domain for 105 environmental steps; DARC (Eysenbach et al., 2021), which trains domain classifiers to estimate the dynamics discrepancy and leverage it to correct source domain rewards; DARC-weight, a variant of DARC that adopts the dynamics discrepancy term as importance sampling weights when updating critics; VGDF (Xu et al., 2023), a recent state-of-the-art method that filters transitions in the source domain that share similar value estimates as those in the target domain; SAC-tune, which trains the SAC agent in the source domain for 1M steps and fine-tunes it in the target domain with 105 transitions. For online experiments, we allow all algorithms to interact 1M environmental steps with the source domain, but only 105
Cross-Domain Policy Adaptation by Capturing Representation Mismatch
0.0 0.2 0.4 0.6 0.8 1.0 Environment Steps ( 105)
halfcheetah (broken back thigh)
0.0 0.2 0.4 0.6 0.8 1.0 Environment Steps ( 105)
hopper (broken joints)
0.0 0.2 0.4 0.6 0.8 1.0 Environment Steps ( 105)
walker (broken right foot)
0.0 0.2 0.4 0.6 0.8 1.0 Environment Steps ( 105)
ant (broken hips)
0.0 0.2 0.4 0.6 0.8 1.0 Environment Steps ( 105)
halfcheetah (no thighs)
0.0 0.2 0.4 0.6 0.8 1.0 Environment Steps ( 105)
hopper (big head)
0.0 0.2 0.4 0.6 0.8 1.0 Environment Steps ( 105)
walker (no right thigh)
0.0 0.2 0.4 0.6 0.8 1.0 Environment Steps ( 105)
ant (short feet)
PAR VGDF DARC DARC-weight SAC-tar SAC-tune
Figure 2. Adaptation performance comparison when the source domain is online. The curves depict the test performance of each algorithm in the target domain under kinematic shifts (top) and morphology shifts (bottom). The modification to the environment is specified in the parentheses of the task name. The solid lines are the average returns over 5 different random seeds and the shaded region captures the standard deviation. The dashed line of SAC-tune denotes its final performance after fine-tuning 105 steps.
steps in the target domain (i.e., the target domain interaction interval F = 10). All algorithms are run with five random seeds. We defer implementation details to Appendix D.2.
We summarize the comparison results in Figure 2. Note that the evaluated environments are quite challenging, and baselines like DARC struggle for a good performance. Based on the curves, PAR outperforms SAC-tar on all of the tasks, indicating that our method successfully boosts the performance of the agent in the target domain by extracting useful knowledge from sufficient source domain data. Notably, PAR achieves the best performance on 6 out of 8 tasks, often surpassing baselines by a large margin. On the rest of the two tasks, PAR is able to achieve competitive performance against VGDF. PAR achieves 2x sample efficiency compared to the best baseline method on tasks like halfcheetah (no thighs), ant (short feet), etc. Furthermore, PAR beats the fine-tuning method SAC-tune on 7 out of 8 tasks. These altogether illustrate the advantages of our method.
5.2. Evaluations under Offline Source Domain
There exist some circumstances where no real-time interaction with the source domain is available, but we have a previously gathered source domain dataset. We then investigate how our method behaves under this setting, and how the quality of the dataset affects the performance. To that end, we adopt datasets of the four environments (halfcheetah, hopper, walker, ant) from D4RL (Fu et al., 2020) -v2 datasets with three quality levels (medium, medium-replay, medium-expert). This induces a total of 24 tasks.
We consider four baselines for comparison: CQL-0 (Kumar
et al., 2020), which trains a CQL agent solely in the source offline dataset and then directly deploys the learned policy in the target domain in a zero-shot manner; CQL+SAC, which updates the offline source domain data with the CQL loss function, while the online target domain data with the SAC loss; H2O (Niu et al., 2022), which trains domain classifiers to estimate the dynamics gap and use it as an importance sampling weight for the bellman error of data from the source domain dataset; VGDF+BC (Xu et al., 2023), which incorporates an additional behavior cloning term in vanilla VGDF, similar to PAR. All algorithms have a limited budget of 105 interactions with the target domain. The implementation details can be found in Appendix D.2.
We present the comparison results in Table 1. We observe that PAR also achieves superior performance given offline source domain datasets, surpassing baseline methods on 17 out of 24 tasks. It is worth mentioning that PAR is the only method that obtains meaningful performance on halfcheetah (no thighs) with medium-expert dataset, which is approximately 4x the performance of the strongest baseline. PAR is also the only method that can generally gain better performance on many tasks with higher quality datasets, e.g., despite that PAR has unsatisfying performance on hopper (big head) under medium-level source domain dataset, its performance given the medium-expert source domain dataset is good. Nevertheless, methods like VGDF and H2O have worse performance given medium-expert datasets compared to medium-replay or medium datasets. These collectively show the superiority of PAR and shed light on capturing representation mismatch for cross-domain policy adaptation.
Cross-Domain Policy Adaptation by Capturing Representation Mismatch
Table 1. Performance comparison when the source domain is offline, i.e., only static source domain datasets are available. We report the mean return in conjunction with standard deviation in the target domain under different dataset qualities of the source domain data (medium, medium-replay, medium-expert). The results are averaged over 5 varied random seeds. We bold and highlight the best cell.
Dataset Type Task Name CQL-0 CQL+SAC H2O VGDF+BC PAR (ours)
medium halfcheetah (broken back thigh) 1128 156 3967 204 5450 194 4834 250 5686 603 medium halfcheetah (no thighs) 361 29 1184 211 2863 209 3910 160 5768 117 medium hopper (broken joints) 155 19 498 73 2467 323 2785 75 2825 112 medium hopper (big head) 399 5 496 53 1451 480 3060 60 1450 143 medium walker (broken right foot) 1453 412 1877 1040 3309 418 3000 388 3683 211 medium walker (no right thigh) 975 131 1262 363 2225 546 3293 306 2899 841 medium ant (broken hips) 1230 99 -1814 431 2704 253 1713 366 3324 72 medium ant (short feet) 1839 137 -807 255 3892 85 3120 469 4886 97
medium-replay halfcheetah (broken back thigh) 655 226 3868 295 5103 35 5398 360 5227 445 medium-replay halfcheetah (no thighs) 398 63 575 619 3225 66 4271 162 5161 46 medium-replay hopper (broken joints) 1018 6 686 60 2325 193 2242 1057 2376 777 medium-replay hopper (big head) 365 7 556 222 1854 647 566 90 1336 419 medium-replay walker (broken right foot) 156 175 1018 22 3536 431 2901 1101 3128 1084 medium-replay walker (no right thigh) 337 189 1465 696 4254 207 2057 921 1249 706 medium-replay ant (broken hips) 882 28 -1609 425 2497 190 2437 286 2977 186 medium-replay ant (short feet) 1294 191 -1369 476 3782 382 4493 82 4791 102
medium-expert halfcheetah (broken back thigh) 843 510 4283 180 4100 211 3580 1801 3741 378 medium-expert halfcheetah (no thighs) 322 81 1669 439 1938 473 2740 297 10517 476 medium-expert hopper (broken joints) 458 441 1147 595 2587 252 2144 938 2838 339 medium-expert hopper (big head) 460 50 547 96 1156 574 2155 1182 2676 585 medium-expert walker (broken right foot) 813 459 2431 782 2254 710 1540 926 4211 196 medium-expert walker (no right thigh) 698 194 1547 346 2835 826 2047 1100 4006 1070 medium-expert ant (broken hips) 321 373 304 1458 2178 799 1868 321 3113 501 medium-expert ant (short feet) 1816 224 -812 105 3511 441 1821 516 4902 34
5.3. Parameter Study
Now we investigate the influence of two critical hyperparameters in PAR, reward penalty coefficient β, and target domain interaction interval F. Considering the page limit, please check more experimental results in Appendix E.
Penalty coefficient β. β controls the scale of the measured representation mismatch. Intuitively, the agent will struggle for good performance if β is too large, and may fail to distinguish source domain samples with inconsistent dynamics if β is too small. To examine its impact, we conduct experiments on two tasks with online source domains, halfcheetah (broken back thigh) and walker (no right thigh). We evaluate PAR across β {0, 0.1, 0.5, 1.0, 2.0}, and show the results in Figure 3(a). We find that setting β = 0 (i.e., no representation mismatch penalty) usually incurs a worse final performance, especially on the halfcheetah task, verifying the necessity of the reward modification term. Figure 3(a) also illustrates that the optimal β can be task-dependent. We believe this is because different tasks have distinct inherent structures like rewards and state spaces. PAR exhibits some robustness to β, despite that employing a large β may incur a performance drop on some tasks, e.g., on walker task.
Target domain interaction interval F. F decides how frequently the agent interacts with the target domain. Fol-
0.0 0.2 0.4 0.6 0.8 1.0 Environment Steps ( 105)
halfcheetah (broken back thigh)
0.0 0.2 0.4 0.6 0.8 1.0 Environment Steps ( 105)
walker (no right thigh) =0 =0.1 =0.5 =1.0 =2.0
(a) Penalty coefficient β.
0.0 0.2 0.4 0.6 0.8 1.0 Environment Steps ( 105)
halfcheetah (broken back thigh)
F=2 F=5 F=10 F=20
0.0 0.2 0.4 0.6 0.8 1.0 Environment Steps ( 105)
walker (no right thigh)
(b) Target domain interaction interval F.
Figure 3. Parameter study of (a) reward penalty coefficient β, (b) target domain interaction interval F. Results are averaged over 5 seeds and the shaded region denotes the standard deviation.
lowing Section 5.1, only 105 interactions with the target domain are permitted. We employ F {2, 5, 10, 20}, and summarize the results in Figure 3(b), which show that PAR
Cross-Domain Policy Adaptation by Capturing Representation Mismatch
DARC DARC-weight VGDF PAR 0
Runtime (h)
Figure 4. Runtime comparison of different methods.
generally benefits from more source domain data incurred by a larger F, indicating that PAR can exploit dynamicsconsistent transitions and realize efficient policy adaptation to another domain. We simply use F = 10 by default.
5.4. Runtime Comparison
Furthermore, we compare the runtime of PAR against baselines. All methods are run on the halfcheetah (broken back thigh) task on a single GPU. The results in Figure 4 show that PAR is highly efficient in runtime thanks to training in the latent space with one state encoder and one state-action encoder. DARC and its variant have slightly larger training costs. VGDF consumes the most training time because it trains an ensemble of dynamics models in the original state space following model-based RL (Janner et al., 2019).
6. Discussions
In this section, we provide discussions on whether the performance of PAR can be largely affected if we use another representation learning objective, and why PAR beats DARC. We also explain the validity of the assumption. We believe these make a better understanding of our method.
6.1. PAR with a Different Objective
We investigate how PAR behaves with a varying representation learning objective against Equation 1. Such an objective needs to learn the latent dynamics information as well. To that end, we consider the following objective where g now receives true state s (instead of its representation) and action a as inputs, and no stop gradient operator is required:
L (ψ, ξ) = E(s,a,s ) D (gξ(s, a) fψ(s ))2 . (7)
Importantly, both f and g are optimized with this objective. Equation 7 also guarantees latent dynamics consistency. We tag this variant as PAR-B. To see how PAR-B competes against vanilla PAR, we conduct experiments on four tasks with kinematic and morphology mismatch. We report their final mean performance in the target domain in Figure 5,
where only marginal return difference is observed between PAR and PAR-B, implying that another objective can also be valid as long as it embeds latent dynamics information.
halfcheetah (broken back
halfcheetah
(no thighs)
hopper (brok
hopper (big
Figure 5. Performance comparison between PAR and PAR-B.
6.2. Why PAR Outperforms DARC?
It is vital to address why PAR significantly outperforms DARC on numerous online tasks given that DARC also corrects source domain rewards (check Figure 2). We stress that DARC learns domain classifiers by leveraging both source domain data and target domain data and estimates the dynamics gap, which can be interpreted as how likely the measured source domain transition belongs to the target domain. However, if the transition deviates far from the target domain, the estimated gap log PMtar(s |s,a)
PMsrc(s |s,a) can be large and negatively affect the policy learning, which is similar in spirit to DARC s overly pessimistic issue that is criticized by Xu et al. (2023). PAR, instead, captures representation mismatch by training encoders only with target domain data and evaluating representation deviations upon source domain data. We claim that PAR produces more appropriate reward penalties.
To verify our claim, we log the reward penalties calculated by DARC and PAR, and summarize the results in Figure 6. The reward penalty of PAR is large at first, while it decreases with more interactions, meaning that PAR uncovers more dynamics-consistent samples from the source domain. Note that the penalty by PAR tends to converge to a small number (not 0). However, the penalty from DARC is inconsistent on two tasks, i.e., it approaches 0 on halfcheetah task while becomes large on walker task. The results clearly indicate that capturing representation mismatch is a better choice.
6.3. On the Rationality of the Assumption
In Assumption 4.1, we assume a one-to-one mapping between S A and Z. One-to-one mapping mathematically indicates that the mapping is injective (not necessarily sur-
Cross-Domain Policy Adaptation by Capturing Representation Mismatch
0.0 0.2 0.4 0.6 0.8 1.0 Environment Steps ( 106)
Reward penalty
halfcheetah (broken back thigh)
0.0 0.2 0.4 0.6 0.8 1.0 Environment Steps ( 106)
walker (no right thigh)
Figure 6. Reward penalty comparison between DARC and PAR. We record the average reward penalty across 5 seeds when training each method. The shaded region denotes the standard deviation.
jective). That is, we only require there exists one unique z in the latent space corresponding to the specific (s, a) tuple. To satisfy this assumption, we first employ a deterministic state encoder f and state-action encoder g for representation learning, i.e., f constructs a deterministic mapping from S to Z and g is a deterministic mapping from S A to Z. It remains to decide whether the mapped representation is unique. Note that it is the user s choice of what representation learning approach and which latent representation space to use. One can surely choose a representation method and representation space to let the assumption hold. With our adopted representation learning formula in Equation 1, it is less likely that two distinct (s, a) tuples are mapped into the same latent vector because that indicates they share the same dynamics transition information (since Equation 1 realizes latent dynamics consistency). To further mitigate this concern, the dimension of the state-action representation in PAR is much larger (it is set to be 256 as shown in Table 2 in the appendix) than the input state vector and action vector. We believe these explain the rationality of the assumption.
7. Conclusion and Limitations
In this paper, we study how to effectively adapt policies to another domain with dynamics discrepancy. We propose a novel algorithm, Policy Adaptation by Representation Mismatch (PAR), which captures the representation mismatch between the source domain and the target domain, and employs the resulting representation deviation to compensate source domain rewards. Our method is motivated and supported by rigorous theoretical analysis. Experimental results demonstrate that PAR achieves strong performance and outperforms recent strong baselines under scenarios like kinematic shifts and morphology mismatch, regardless of whether the source domain is online or offline.
Despite the effectiveness of our method, we have to admit that there exist some limitations of our work. First, one may need to decide the best β manually in practice. Second, PAR behaves less satisfyingly given some (not all of them) medium-replay source domain datasets, suggesting that it may be hard for PAR to handle datasets with large diversities.
For future work, it is interesting to design mechanisms to adaptively tune β, and enable PAR to consistently acquire a good performance provided datasets with large diversities.
Acknowledgements
This work was supported by the STI 2030-Major Projects under Grant 2021ZD0201404 and the NSFC under Grant 62250068. This work was done when Jiafei Lyu worked as an intern at Tencent IEG. The authors thank Liangpeng Zhang for providing advice on the draft of this work. The authors also would like to thank the anonymous reviewers for their valuable comments on our manuscript.
Impact Statement
This paper presents work whose goal is to advance the field of Machine Learning. There are many potential societal consequences of our work, none which we feel must be specifically highlighted here.
Arndt, K., Hazara, M., Ghadirzadeh, A., and Kyrki, V. Meta reinforcement learning for sim-to-real domain adaptation. In IEEE International Conference on Robotics and Automation, 2019.
Barekatain, M., Yonetani, R., and Hamaya, M. Multipolar: Multi-source policy aggregation for transfer reinforcement learning between diverse environmental dynamics. ar Xiv preprint ar Xiv:1909.13111, 2019.
Barreto, A., Dabney, W., Munos, R., Hunt, J. J., Schaul, T., Silver, D., and Hasselt, H. V. Successor features for transfer in reinforcement learning. Ar Xiv, abs/1606.05312, 2016.
Bengio, Y., Courville, A. C., and Vincent, P. Representation learning: A review and new perspectives. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35: 1798 1828, 2012.
Bousmalis, K., Irpan, A., Wohlhart, P., Bai, Y., Kelcey, M., Kalakrishnan, M., Downs, L., Ibarz, J., Pastor, P., Konolige, K., Levine, S., and Vanhoucke, V. Using simulation and domain adaptation to improve efficiency of deep robotic grasping. In IEEE International Conference on Robotics and Automation, 2018.
Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., and Zaremba, W. Open AI Gym. Ar Xiv, abs/1606.01540, 2016.
Buckman, J., Hafner, D., Tucker, G., Brevdo, E., and Lee, H. Sample-Efficient Reinforcement Learning with Stochas-
Cross-Domain Policy Adaptation by Capturing Representation Mismatch
tic Ensemble Value Expansion. In Advances in Neural Information Processing Systems (Neur IPS), 2018.
Cetin, E., Ball, P. J., Roberts, S., and C eliktutan, O. Stabilizing off-policy deep reinforcement learning from pixels. In International Conference on Machine Learning, 2022.
Chandak, Y., Theocharous, G., Kostas, J. E., Jordan, S. M., and Thomas, P. S. Learning action representations for reinforcement learning. In International Conference on Machine Learning, 2019.
Chebotar, Y., Handa, A., Makoviychuk, V., Macklin, M., Issac, J., Ratliff, N. D., and Fox, D. Closing the simto-real loop: Adapting simulation randomization with real world experience. In International Conference on Robotics and Automation, 2018.
Clavera, I., Nagabandi, A., Fearing, R. S., Abbeel, P., Levine, S., and Finn, C. Learning to adapt: Meta-learning for model-based control. Ar Xiv, abs/1803.11347, 2018.
Collins, J. J., Brown, R., Leitner, J., and Howard, D. Traversing the reality gap via simulator tuning. Ar Xiv, abs/2003.01369, 2020.
Csisz ar, I. and K orner, J. Information theory: coding theorems for discrete memoryless systems. Cambridge University Press, 2011.
Cutler, M. and How, J. P. Efficient Reinforcement Learning for Robots using Informative Simulated Priors. In IEEE International Conference on Robotics and Automation, 2015.
Desai, S., Durugkar, I., Karnan, H., Warnell, G., Hanna, J., and Stone, P. An imitation from observation approach to transfer learning with dynamics mismatch. In Neural Information Processing Systems, 2020.
Du, Y., Watkins, O., Darrell, T., Abbeel, P., and Pathak, D. Auto-tuned sim-to-real transfer. In IEEE International Conference on Robotics and Automation, 2021.
Eysenbach, B., Chaudhari, S., Asawa, S., Levine, S., and Salakhutdinov, R. Off-dynamics reinforcement learning: Training for transfer with domain classifiers. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum? id=eq Bwg3Ac IAK.
Eysenbach, B., Zhang, T., Salakhutdinov, R., and Levine, S. Contrastive learning as goal-conditioned reinforcement learning. Ar Xiv, abs/2206.07568, 2022.
Farchy, A., Barrett, S., Mac Alpine, P., and Stone, P. Humanoid robots learning to walk faster: from the real world to simulation and back. In Adaptive Agents and Multi Agent Systems, 2013.
Ferns, N., Panangaden, P., and Precup, D. Bisimulation metrics for continuous markov decision processes. SIAM Journal on Computing, 40(6):1662 1714, 2011.
Fickinger, A., Cohen, S., Russell, S., and Amos, B. Crossdomain imitation learning via optimal transport. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum? id=x P3c Pq2h QC.
Fu, J., Kumar, A., Nachum, O., Tucker, G., and Levine, S. D4RL: Datasets for Deep Data-Driven Reinforcement Learning. Ar Xiv, abs/2004.07219, 2020.
Fujimoto, S. and Gu, S. S. A Minimalist Approach to Offline Reinforcement Learning. In Advances in Neural Information Processing Systems (Neur IPS), 2021.
Fujimoto, S., Meger, D., and Precup, D. Off-Policy Deep Reinforcement Learning without Exploration. In International Conference on Machine Learning (ICML), 2019.
Fujimoto, S., Meger, D., and Precup, D. A deep reinforcement learning approach to marginalized importance sampling with the successor representation. In International Conference on Machine Learning, 2021.
Fujimoto, S., Chang, W.-D., Smith, E. J., Gu, S. S., Precup, D., and Meger, D. For sale: State-action representation learning for deep reinforcement learning. Ar Xiv, abs/2306.02451, 2023.
Gamrian, S. and Goldberg, Y. Transfer learning for related reinforcement learning tasks via image-to-image translation. In International Conference on Machine Learning, 2018.
Ge, Y., Macaluso, A., Li, E. L., Luo, P., and Wang, X. Policy adaptation from foundation model feedback. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022.
Golemo, F., Ta ıga, A. A., Courville, A. C., and Oudeyer, P.-Y. Sim-to-real transfer with neural-augmented robot simulation. In Conference on Robot Learning, 2018.
Gui, H., Pang, S., Yu, S., Qiao, S., Qi, Y., He, X., Wang, M., and Zhai, X. Cross-domain policy adaptation with dynamics alignment. Neural Networks, 167:104 117, 2023.
Haarnoja, T., Zhou, A., Hartikainen, K., Tucker, G., Ha, S., Tan, J., Kumar, V., Zhu, H., Gupta, A., Abbeel, P., et al. Soft Actor-Critic Algorithms and Applications. ar Xiv preprint ar Xiv:1812.05905, 2018.
Hafner, D., Lillicrap, T., Fischer, I., Villegas, R., Ha, D., Lee, H., and Davidson, J. Learning latent dynamics for
Cross-Domain Policy Adaptation by Capturing Representation Mismatch
planning from pixels. In International conference on machine learning, 2019.
Hanna, J. P., Desai, S., Karnan, H., Warnell, G. A., and Stone, P. Grounded action transformation for sim-to-real reinforcement learning. Machine Learning, 110:2469 2499, 2021.
Hansen, N., Jangir, R., Sun, Y., Aleny a, G., Abbeel, P., Efros, A. A., Pinto, L., and Wang, X. Selfsupervised policy adaptation during deployment. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum? id=o_V-Mjyy GV_.
Hansen, N., Wang, X., and Su, H. Temporal Difference Learning for Model Predictive Control. In International Conference on Machine Learning, 2022.
He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition, 2015.
Hejna, D. J., Abbeel, P., and Pinto, L. Hierarchically decoupled imitation for morphological transfer. Ar Xiv, abs/2003.01709, 2020.
Hwangbo, J., Lee, J., Dosovitskiy, A., Bellicoso, D., Tsounis, V., Koltun, V., and Hutter, M. Learning agile and dynamic motor skills for legged robots. Science Robotics, 4, 2019.
Janner, M., Fu, J., Zhang, M., and Levine, S. When to Trust Your Model: Model-Based Policy Optimization. In Advances in Neural Information Processing Systems (Neur IPS), 2019.
Jiang, Y., Li, C., Dai, W., Zou, J., and Xiong, H. Variance reduced domain randomization for reinforcement learning with policy gradient. IEEE Transactions on Pattern Analysis and Machine Intelligence, 46:1031 1048, 2023.
Karl, M., S olch, M., Bayer, J., and van der Smagt, P. Deep variational bayes filters: Unsupervised learning of state space models from raw data. Ar Xiv, abs/1605.06432, 2016.
Kim, K., Gu, Y., Song, J., Zhao, S., and Ermon, S. Domain adaptive imitation learning. In International Conference on Machine Learning, 2019.
Kim, K., Ha, J., and Kim, Y. Self-predictive dynamics for generalization of vision-based reinforcement learning. In International Joint Conference on Artificial Intelligence, 2022.
Kingma, D. P. and Ba, J. Adam: A Method for Stochastic Optimization. In International Conference on Learning Representation, 2015. URL http://arxiv.org/ abs/1412.6980.
Kiran, B. R., Sobh, I., Talpaert, V., Mannion, P., Sallab, A. A. A., Yogamani, S. K., and P erez, P. Deep reinforcement learning for autonomous driving: A survey. IEEE Transactions on Intelligent Transportation Systems, 23: 4909 4926, 2020.
Kober, J., Bagnell, J. A., and Peters, J. Reinforcement learning in robotics: A survey. The International Journal of Robotics Research, 32:1238 1274, 2013.
Kolesnikov, A., Beyer, L., Zhai, X., Puigcerver, J., Yung, J., Gelly, S., and Houlsby, N. Big transfer (bit): General visual representation learning. In European Conference on Computer Vision, 2019.
Kostrikov, I., Yarats, D., and Fergus, R. Image augmentation is all you need: Regularizing deep reinforcement learning from pixels. Ar Xiv, abs/2004.13649, 2020.
Kostrikov, I., Nair, A., and Levine, S. Offline reinforcement learning with implicit q-learning. In International Conference on Learning Representations, 2022. URL https: //openreview.net/forum?id=68n2s9ZJWF8.
Kumar, A., Zhou, A., Tucker, G., and Levine, S. Conservative Q-Learning for Offline Reinforcement Learning. In Advances in Neural Information Processing Systems (Neur IPS), 2020.
Langley, P. Crafting papers on machine learning. In Langley, P. (ed.), International Conference on Machine Learning (ICML 2000), pp. 1207 1216, Stanford, CA, 2000. Morgan Kaufmann.
Levine, S., Kumar, A., Tucker, G., and Fu, J. Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems. Ar Xiv, abs/2005.01643, 2020.
Liu, G., Zhang, C., Zhao, L., Qin, T., Zhu, J., Jian, L., Yu, N., and Liu, T.-Y. Return-based contrastive representation learning for reinforcement learning. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum? id=_TM6r T7t Xke.
Liu, J., Hongyin, Z., and Wang, D. DARA: Dynamics-aware reward augmentation in offline reinforcement learning. In International Conference on Learning Representations, 2022a. URL https://openreview.net/forum? id=9SDQB3b68K.
Liu, X., Pathak, D., and Kitani, K. M. Revolver: Continuous evolutionary models for robot-to-robot policy transfer. In International Conference on Machine Learning, 2022b.
Luo, Y., Xu, H., Li, Y., Tian, Y., Darrell, T., and Ma, T. Algorithmic framework for model-based deep reinforcement learning with theoretical guarantees. In
Cross-Domain Policy Adaptation by Capturing Representation Mismatch
International Conference on Learning Representations, 2019. URL https://openreview.net/forum? id=BJe1E2R5KX.
Lyu, J., aicheng Gong, Wan, L., Lu, Z., and Li, X. State advantage weighting for offline RL. In 3rd Offline RL Workshop: Offline RL as a Launchpad , 2022a. URL https://openreview.net/forum? id=2r OD_UQfvl.
Lyu, J., Li, X., and Lu, Z. Double check your state before trusting it: Confidence-aware bidirectional offline model-based imagination. In Advances in Neural Information Processing Systems, 2022b. URL https: //openreview.net/forum?id=3e3IQMLDSLP.
Lyu, J., Ma, X., Li, X., and Lu, Z. Mildly conservative q-learning for offline reinforcement learning. In Neural Information Processing Systems, 2022c.
Lyu, J., Wan, L., Lu, Z., and Li, X. Off-policy rl algorithms can be sample-efficient for continuous control via sample multiple reuse. Ar Xiv, abs/2305.18443, 2023.
Machado, M. C., Barreto, A., Precup, D., and Bowling, M. Temporal abstraction in reinforcement learning with the successor representation. Journal of Machine Learning Research, 24(80):1 69, 2023.
Malik, A., Kuleshov, V., Song, J., Nemer, D., Seymour, H., and Ermon, S. Calibrated model-based deep reinforcement learning. In International Conference on Machine Learning, 2019.
Mehta, B., Diaz, M., Golemo, F., Pal, C. J., and Paull, L. Active domain randomization. Ar Xiv, abs/1904.04762, 2019.
Nagabandi, A., Clavera, I., Liu, S., Fearing, R. S., Abbeel, P., Levine, S., and Finn, C. Learning to adapt in dynamic, real-world environments through meta-reinforcement learning. ar Xiv preprint ar Xiv:1803.11347, 2018.
Niu, H., Sharma, S., Qiu, Y., Li, M., Zhou, G., HU, J., and Zhan, X. When to trust your simulator: Dynamicsaware hybrid offline-and-online reinforcement learning. In Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum? id=z XE8i FOZKw.
Osinski, B., Jakubowski, A., Milos, P., Ziecina, P., Galias, C., Homoceanu, S., and Michalewski, H. Simulationbased reinforcement learning for real-world autonomous driving. In IEEE International Conference on Robotics and Automation, 2019.
Ota, K., Oiki, T., Jha, D., Mariyama, T., and Nikovski, D. Can increasing input dimensionality improve deep
reinforcement learning? In International conference on machine learning, 2020.
Pan, F., He, J., Tu, D., and He, Q. Trust the Model When It Is Confident: Masked Model-based Actor-Critic. In Advances in Neural Information Processing Systems, 2020.
Peng, X. B., Andrychowicz, M., Zaremba, W., and Abbeel, P. Sim-to-real transfer of robotic control with dynamics randomization. In 2018 IEEE International Conference on Robotics and Automation (ICRA), 2017.
Qiao, Z., Lyu, J., and Li, X. The primacy bias in modelbased rl. Ar Xiv, abs/2310.15017, 2023.
Rafailov, R., Yu, T., Rajeswaran, A., and Finn, C. Offline reinforcement learning from images with latent space models. In Conference on Learning for Dynamics & Control, 2020.
Raileanu, R., Goldstein, M., and Szlam, A. Fast adaptation to new environments via policy-dynamics value functions. In International Conference on Machine Learning, 2020.
Ramos, F. T., Possas, R., and Fox, D. Bayessim: adaptive domain randomization via probabilistic inference for robotics simulators. Ar Xiv, abs/1906.01728, 2019.
Raychaudhuri, D. S., Paul, S., Vanbaar, J., and Roy Chowdhury, A. K. Cross-domain imitation from observations. In International Conference on Machine Learning, 2021.
Rezaei-Shoshtari, S., Zhao, R., Panangaden, P., Meger, D., and Precup, D. Continuous MDP homomorphisms and homomorphic policy gradient. In Neural Information Processing Systems, 2022. URL https://openreview. net/forum?id=Adl-fs-8Oz L.
Schwarzer, M., Anand, A., Goel, R., Hjelm, R. D., Courville, A., and Bachman, P. Data-Efficient Reinforcement Learning with Self-Predictive Representations. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum? id=u CQf PZw Ra Uu.
Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., van den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., Dieleman, S., Grewe, D., Nham, J., Kalchbrenner, N., Sutskever, I., Lillicrap, T. P., Leach, M., Kavukcuoglu, K., Graepel, T., and Hassabis, D. Mastering the Game of Go with Deep Neural Networks and Tree Search. Nature, 529:484 489, 2016.
Slaoui, R. B., Clements, W. R., Foerster, J. N., and Toth, S. Robust domain randomization for reinforcement learning. Ar Xiv, abs/1910.10537, 2019.
Cross-Domain Policy Adaptation by Capturing Representation Mismatch
Srinivas, A., Laskin, M., and Abbeel, P. CURL: Contrastive Unsupervised Representations for Reinforcement Learning. In International Conference on Machine Learning, 2020.
Stooke, A., Lee, K., Abbeel, P., and Laskin, M. Decoupling representation learning from reinforcement learning. In International Conference on Machine Learning, 2021.
Tang, Y., Guo, Z. D., Richemond, P. H., Pires, B. A., Chandak, Y., Munos, R., Rowland, M., Azar, M. G., Lan, C. L., Lyle, C., Gyorgy, A., Thakoor, S., Dabney, W., Piot, B., Calandriello, D., and Va lko, M. Understanding self-predictive learning for reinforcement learning. In International Conference on Machine Learning, 2022.
Tobin, J., Fong, R., Ray, A., Schneider, J., Zaremba, W., and Abbeel, P. Domain randomization for transferring deep neural networks from simulation to the real world. In IEEE/RSJ International Conference on Intelligent Robots and Systems, 2017.
van der Pol, E., Worrall, D. E., van Hoof, H., Oliehoek, F. A., and Welling, M. Mdp homomorphic networks: Group symmetries in reinforcement learning. In Neural Information Processing Systems, 2020.
Viano, L., ting Huang, Y., Kamalaruban, P., and Cevher, V. Robust inverse reinforcement learning under transition dynamics mismatch. In Neural Information Processing Systems, 2020.
Vuong, Q. H., Vikram, S., Su, H., Gao, S., and Christensen, H. I. How to pick the domain randomization parameters for sim-to-real transfer of reinforcement learning policies? Ar Xiv, abs/1903.11774, 2019.
Whitney, W., Agarwal, R., Cho, K., and Gupta, A. Dynamics-aware embeddings. In International Conference on Learning Representations, 2020. URL https: //openreview.net/forum?id=BJg ZGe HFPH.
Wu, Z., Xie, Y., Lian, W., Wang, C., Guo, Y., Chen, J., Schaal, S., and Tomizuka, M. Zero-shot policy transfer with disentangled task representation of metareinforcement learning. In IEEE International Conference on Robotics and Automation, 2022.
Xie, A., Sodhani, S., Finn, C., Pineau, J., and Zhang, A. Robust policy learning over multiple uncertainty sets. In International Conference on Machine Learning, 2022.
Xiong, Z., Beck, J., and Whiteson, S. Universal morphology control via contextual modulation. In International Conference on Machine Learning, 2023.
Xu, K., Bai, C., Ma, X., Wang, D., Zhao, B., Wang, Z., Li, X., and Li, W. Cross-domain policy adaptation via value-guided data filtering. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum? id=qd M260d Xsa.
Yarats, D., Fergus, R., Lazaric, A., and Pinto, L. Mastering Visual Continuous Control: Improved Data-Augmented Reinforcement Learning. In International Conference on Learning Representations, 2022. URL https:// openreview.net/forum?id=_SJ-_yyes8.
Ye, W., Liu, S., Kurutach, T., Abbeel, P., and Gao, Y. Mastering Atari Games with Limited Data. In Advances in Neural Information Processing Systems, 2021.
You, H., Yang, T., Zheng, Y., Hao, J., and E Taylor, M. Cross-domain adaptive transfer reinforcement learning based on state-action correspondence. In Uncertainty in Artificial Intelligence, 2022.
Yu, W., Liu, C. K., and Turk, G. Preparing for the unknown: Learning a universal policy with online system identification. Ar Xiv, abs/1702.02453, 2017.
Zhang, A., Mc Allister, R. T., Calandra, R., Gal, Y., and Levine, S. Learning invariant representations for reinforcement learning without reconstruction. In International Conference on Learning Representations, 2021a. URL https://openreview.net/forum? id=-2FCw DKRREu.
Zhang, G., Zhong, L., Lee, Y., and Lim, J. J. Policy transfer across visual and dynamics domain gaps via iterative grounding. Ar Xiv, abs/2107.00339, 2021b.
Zhang, Q., Xiao, T., Efros, A. A., Pinto, L., and Wang, X. Learning cross-domain correspondence for control with dynamics cycle-consistency. In International Conference on Learning Representations, 2021c. URL https:// openreview.net/forum?id=QIRlze3I6h X.
Zhou, W., Pinto, L., and Gupta, A. Environment probing interaction policies. In International Conference on Learning Representations, 2019. URL https:// openreview.net/forum?id=ryl8-3Ac FX.
Zhu, J., Xia, Y., Wu, L., Deng, J., gang Zhou, W., Qin, T., and Li, H. Masked contrastive representation learning for reinforcement learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45:3421 3433, 2020.
Zhu, S., Kimmel, A., Bekris, K. E., and Boularias, A. Fast model identification via physics engines for data-efficient policy search. In International Joint Conference on Artificial Intelligence, 2017.
Cross-Domain Policy Adaptation by Capturing Representation Mismatch
A. Missing Proofs
In this section, we formally present all the missing proofs from the main text. For better readability, we restate theorems in the appendix. We also need some lemmas, which can be found in Appendix B.
A.1. Proof of Theorem 4.2
Theorem A.1. For any (s, a), denote its representation as z, and suppose s src PMsrc( |s, a), s tar PMtar( |s, a). Denote h(z; s src, s tar) = I(z; s tar) I(z; s src), then we have measuring h(z; s src, s tar) is equivalent to measuring the representation deviation DKL(P(z|s tar) P(z|s src)).
Proof. By the definition of mutual information, we have
h(z; s src, s tar) = I(z; s tar) I(z; s src)
S P(z, s tar) log P(z, s tar) P(z)P(s tar)dzds tar Z
S P(z, s src) log P(z, s src) P(z)P(s src)dzds src
S P(z, s tar) log P(z|s tar) P(z) dzds tar Z
S P(z, s src) log P(z|s src) P(z) dzds src
S P(z, s tar, s src) log P(z|s tar) P(z) dzds tards src Z
S P(z, s src, s tar) log P(z|s src) P(z) dzds srcds tar
S P(z, s tar, s src) log P(z|s tar) P(z|s src)dzds tards src
= DKL (P(z|s tar) P(z|s src)) . (By the definition of Kullback Leibler divergence)
We then can conclude that measuring the defined function h(z; s src, s tar) is equivalent to measuring the KL-divergence between P(z|s tar) and P(z|s src), which is the deviation of representations given the source domain state and target domain state, respectively. Note that the definition of the KL-divergence already involves expectations over s src and s tar. While one can also write Es src,s tar[DKL (P(z|s tar) P(z|s src))] and it should not affect the result.
A.2. Proof of Theorem 4.3
Theorem A.2. Measuring the representation deviation between the source domain and the target domain is equivalent to measuring the dynamics mismatch between two domains. Formally, we can derive that DKL (P(z|s tar) P(z|s src)) = DKL (P(s tar|z) P(s src|z)) + H(s tar) H(s src).
Proof. We would like to establish a connection between the representation deviations in the two domains and the dynamics discrepancies between the two domains. We achieve this by rewriting the defined function h(z; s src, s tar) as follows,
h(z; s src, s tar) = I(z; s tar) I(z; s src)
S P(z, s tar) log P(z, s tar) P(z)P(s tar)dzds tar Z
S P(z, s src) log P(z, s src) P(z)P(s src)dzds src
S P(z, s tar) log P(s tar|z) P(s tar) dzds tar Z
S P(z, s src) log P(s src|z) P(s src) dzds src
S P(z, s tar, s src) log P(s tar|z) P(s tar) dzds tards src Z
S P(z, s src, s tar) log P(s src|z) P(s src) dzds srcds tar
S P(z, s tar, s src) log P(s tar|z) P(s src|z)dzds tards src Z
S P(s tar) log P(s tar)ds tar
S P(s src) log P(s src)ds src
= DKL (P(s tar|z) P(s src|z)) + H(s tar) H(s src).
One can see that the defined function is also connected to the dynamics discrepancy term DKL(P(s tar)|z) P(s src|z). It also correlates to two entropy terms. Nevertheless, we observe that the source domain and the target domain are specified
Cross-Domain Policy Adaptation by Capturing Representation Mismatch
and fixed, and their state distributions are also fixed, indicating that the entropy terms are constants. Then by using the conclusion from Theorem 4.2, we have
DKL (P(z|s tar) P(z|s src)) | {z } representation deviation
= DKL (P(s tar|z) P(s src|z)) | {z } dynamics deviation
+ H(s tar) H(s src) | {z } constants . (8)
Hence, we conclude that measuring representation deviations between two domains is equivalent to measuring the dynamics mismatch.
A.3. Proof of Theorem 4.4
Theorem A.3 (Online performance bound). Denote Msrc, Mtar as the source domain and the target domain, respectively, then the return difference of any policy π between Msrc and Mtar is bounded:
JMtar(π) JMsrc(π)
2γrmax (1 γ)2 Eρπ Msrc
DKL(P(z|s src) P(z|s tar)) i
| {z } (a):representation mismatch
2γrmax (1 γ)2 Eρπ Msrc
|H(s src) H(s tar)| i
| {z } (b):state distribution deviation
Proof. To show this theorem, we reiterate the Assumption 4.1 we made in the main text, i.e., the state-action pair (s, a) and its corresponding representation z are a one-to-one mapping from the original space S A to the latent space Z. This indicates that we could construct a pseudo probability distribution given the representation z that is the same as the transition dynamics probability in the system, i.e., P(s src|z) = P(s src|s, a) = PMsrc( |s, a), P(s tar|z) = P(s tar|s, a) = PMtar( |s, a), s, a.
Recall that the value function V (s) estimates the expected return given the state s, and state-action value function Q(s, a) estimates the expected return given the state s and action a. Since the rewards are bounded, we have |V (s)| rmax
1 γ , |Q(s, a)| rmax
1 γ , s, a. We denote value function under policy π and MDP M as V π M(s), Qπ M(s, a),
respectively.
By using Lemma B.1, we have
JMsrc(π) JMtar(π) = γ 1 γ Eρπ Msrc(s,a)
s PMsrc(s |s, a)V π Mtar(s )ds Z
s PMtar(s |s, a)V π Mtar(s )ds
= γ 1 γ Eρπ Msrc(s,a)
s (PMsrc(s |s, a) PMtar(s |s, a))V π Mtar(s )ds
γ 1 γ Eρπ Msrc(s,a)
s (PMsrc(s |s, a) PMtar(s |s, a))V π Mtar(s )ds
γ 1 γ Eρπ Msrc(s,a)
s |PMsrc(s |s, a) PMtar(s |s, a)| V π Mtar(s ) ds
γrmax (1 γ)2 Eρπ Msrc(s,a)
s |PMsrc(s |s, a) PMtar(s |s, a)| ds
(1 γ)2 Eρπ Msrc(s,a) [DTV (PMsrc(s |s, a) PMtar(s |s, a))]
(1 γ)2 Eρπ Msrc(s,a) [DTV (P(s src|z) P(s tar|z))]
(1 γ)2 Eρπ Msrc(s,a)
1 2DKL (P(s src|z) P(s tar|z))
2γrmax (1 γ)2 Eρπ Msrc(s,a) hp
DKL (P(z|s src) P(z|s tar)) i
2γrmax (1 γ)2 Eρπ Msrc(s,a) hp
|H(s src) H(s tar) i ,
where DTV(p, q) denotes the total variation distance between two distribution p and q, the inequality (i) is due to the Pinsker s inequality (Csisz ar & K orner, 2011), and the last step is by using Equation 8 and the triangle inequality. Then we conclude the proof.
Cross-Domain Policy Adaptation by Capturing Representation Mismatch
A.4. Proof of Theorem 4.5
Theorem A.4 (Offline performance bound). Denote the empirical policy distribution in the offline dataset D from source domain Msrc as πD :=
D 1(s) , then the return difference of any policy π between the source domain Msrc and the target domain Mtar is bounded:
JMtar(π) JMsrc(π) 4rmax
(1 γ)2 Eρ πD Msrc,PMsrc[DTV(πD||π)] | {z } (a):policy deviation
2γrmax (1 γ)2 Eρ πD Msrc
DKL(P(z|s src) P(z|s tar)) i
| {z } (b):representation mismatch
2γrmax (1 γ)2 Eρ πD Msrc
|H(s src) H(s tar)| i
| {z } (c):state distribution deviation
Proof. Since it is infeasible to directly interact with the source domain, and we have the empirical policy distribution πD in the offline dataset, we bound the performance difference by involving the term JMsrc(πD) We have
JMtar(π) JMsrc(π) = (JMtar(π) JMsrc(πD)) | {z } (a)
+ (JMsrc(πD) JMsrc(π)) | {z } (b)
The term (a) depicts the performance of the learned policy in the target domain against the performance of the data-collecting policy in the offline dataset, and the term (b) measures the performance deviation between the learned policy and the behavior policy in the source domain. We first bound term (b). By using Lemma B.3, we have
JMsrc(πD) JMsrc(π) = 1 1 γ Eρ πD Msrc(s,a),s PMsrc( |s,a) Ea πD Qπ Msrc(s , a ) Ea π Qπ Msrc(s , a )
1 1 γ Eρ πD Msrc(s,a),s PMsrc( |s,a) Ea πD Qπ Msrc(s , a ) Ea π Qπ Msrc(s , a )
1 1 γ Eρ πD Msrc(s,a),s PMsrc( |s,a)
a A (πD(a |s ) π(a |s ))Qπ Msrc(s , a )
rmax (1 γ)2 Eρ πD Msrc(s,a),s PMsrc( |s,a)
a A (πD(a |s ) π(a |s ))
(1 γ)2 Eρ πD Msrc(s,a),s PMsrc( |s,a) [DTV(πD( |s ) π( |s )] .
It remains to bound term (a). By using Lemma B.2, we have
JMtar(π) JMsrc(πD) = 1 1 γ Eρ πD Msrc(s,a) Es src PMsrc,a πD[Qπ Mtar(s src, a )] Es tar PMtar,a π[Qπ Mtar(s tar, a )]
= 1 1 γ Eρ πD Msrc(s,a)
(Es src PMsrc,a πD[Qπ Mtar(s src, a )] Es src PMsrc,a π[Qπ Mtar(s src, a )]) | {z } (c)
+ (Es src PMsrc,a π[Qπ Mtar(s src, a )] Es tar PMtar,a π[Qπ Mtar(s tar, a )]) | {z } (d)
Cross-Domain Policy Adaptation by Capturing Representation Mismatch
We bound term (c) as follows:
(c) = Es src PMsrc
a A (πD(a |s src) π(a |s src))Qπ Mtar(s src, a )
Es src PMsrc
a A |πD(a |s src) π(a |s src)| |Qπ Mtar(s src, a )|
1 γ Es PMsrc [DTV(πD( |s ) π( |s )] .
Finally, we bound term (d).
(d) = Es PMsrc,a π[Qπ Mtar(s , a )] Es PMtar,a π[Qπ Mtar(s , a )]
S (PMsrc(s |s, a) PMtar(s |s, a))Qπ Mtar(s , a )ds
S |PMsrc(s |s, a) PMtar(s |s, a)| |Qπ Mtar(s , a )|ds
S |PMsrc(s |s, a) PMtar(s |s, a)|ds = 2rmax
1 γ [DTV(PMsrc( |s, a) PMtar( |s, a))]
Then, we get the bound for term (a):
JMtar(π) JMsrc(πD) 2rmax
(1 γ)2 Eρ πD Msrc(s,a),s PMsrc [DTV(πD( |s ) π( |s )]
2rmax (1 γ)2 Eρ πD Msrc(s,a) [DTV(PMsrc( |s, a) PMtar( |s, a))] .
Combining the bounds for term (a) and term (b), and we have
JMtar(π) JMsrc(π) 4rmax
(1 γ)2 Eρ πD Msrc(s,a),s PMsrc [DTV(πD( |s ) π( |s )]
2rmax (1 γ)2 Eρ πD Msrc(s,a) [DTV(PMsrc( |s, a) PMtar( |s, a))] .
Following the same procedure in the proof of Theorem 4.4 in Appendix A.3, we convert the dynamics discrepancy term into the representation mismatch term, incurring the following bounds:
JMtar(π) JMsrc(π) 4rmax
(1 γ)2 Eρ πD Msrc(s,a),s PMsrc [DTV(πD( |s ) π( |s )]
2rmax (1 γ)2 Eρ πD Msrc(s,a) hp
DKL(PMsrc( |s, a) PMtar( |s, a)) i
2γrmax (1 γ)2 Eρ πD Msrc(s,a) hp
|H(s src) H(s tar)| i .
B. Useful Lemmas
Lemma B.1 (Telescoping lemma). Denote M1 = (S, A, P1, r, γ) and M2 = (S, A, P2, r, γ) as two MDPs that only differ in their transition dynamics. Then for any policy π, we have
JM1(π) JM2(π) = γ 1 γ Eρπ M1(s,a) Es P1[V π M2(s )] Es P2[V π M2(s )] . (10)
Proof. This is Lemma 4.3 in (Luo et al., 2019), please check the proof there.
Cross-Domain Policy Adaptation by Capturing Representation Mismatch
Lemma B.2 (Extended telescoping lemma). Denote M1 = (S, A, P1, r, γ) and M2 = (S, A, P2, r, γ) as two MDPs that only differ in their transition dynamics. Suppose we have two policies π1, π2, we can reach the following conclusion:
JM1(π1) JM2(π2) = 1 1 γ Eρπ1 M1(s,a) Es P1,a π1[Qπ2 M2(s , a )] Es P2,a π2[Qπ2 M2(s , a )] . (11)
Proof. This is Lemma C.2 in (Xu et al., 2023), please check the proof there.
Lemma B.3. Denote M = (S, A, P, r, γ) as the underlying MDP. Suppose we have two policies π1, π2, then the performance difference of these policies in the MDP gives:
JM(π1) JM(π2) = 1 1 γ Eρπ1 M(s,a),s P [Ea π1[Qπ2 M(s , a )] Ea π2[Qπ2 M(s , a )]] . (12)
Proof. Similar to (Luo et al., 2019), we use a telescoping sum to prove the result. Denote Wj as the expected return when deploying π1 in the MDP M for the first j steps and then switching to policy π2, i.e.,
Wj = Et # forward thigh
hopper (big head): The head size of the robot is modified as shown below:
# head size
Cross-Domain Policy Adaptation by Capturing Representation Mismatch
Algorithm 3 Policy Adaptation by Representation Mismatch (offline version) Input: Source domain Msrc, target domain Mtar, target domain interaction interval F, batch size N, maximum gradient step Tmax, reward penalty coefficient β, normalization coefficient ν, temperature (for SAC) α, target update rate τ. Source domain offline dataset Doff
1: Initialize policy πϕ, value functions {Qθi}i=1,2 and target networks {Qθ i}i=1,2, source domain replay buffer Dsrc Doff, target domain replay buffer Dtar . Initialize the state encoder f and state-action encoder g with parameters ψ, ξ, respectively 2: for i = 1, 2, ..., Tmax do 3: if i% F == 0 then 4: Given star in Mtar, execute atar using the policy πϕ and get (star, atar, rtar, s tar) 5: Store the transition in the replay buffer, Dtar Dtar {(star, atar, rtar, s tar)} 6: end if 7: Sample N transitions dtar = {(sj, aj, rj, s j)}N j=1 from Dtar 8: Train encoders f, g in the target domain by minimizing: 1 N P
dtar (gξ(fψ(s), a) SG(fψ(s )))2
9: Sample N transitions dsrc = {(sj, aj, rj, s j)}N j=1 from Dsrc 10: Modify source domain rewards into ˆrsrc = rsrc β [gξ(fψ(ssrc), asrc) fψ(s src)]2
11: Calculate target values y = r + γ mini=1,2 Qθ i(s , a ) α log πϕ(a |s ) , a πϕ( |s ) 12: Update critics by minimizing 1 2N P
dsrc dtar(Qθi y)2, i {1, 2}
13: Update actor by maximizing λ 2N P
dsrc dtar,a πϕ( |s) [mini=1,2 Qθi(s, a) α log πϕ( |s)] 1
dsrc(a a)2, where a πϕ( |s), λ = ν/ 1
dsrc dtar mini=1,2 Qθi(s,a) 14: Update target networks: θ i τθi + (1 τ)θ i 15: end for
ant (short feet): The size of the ant robot s feet on its front two legs are modified into the following parameters:
# leg 1 # leg 2