# cup_criticguided_policy_reuse__57463b02.pdf

CUP: Critic-Guided Policy Reuse

Jin Zhang1, Siyuan Li2, Chongjie Zhang1

1Institute for Interdisciplinary Information Sciences, Tsinghua University, China 2School of Computer Science and Technology, Harbin Institute of Technology, China jin-zhan20@mails.tsinghua.edu.cn lisiyuan199511@gmail.com chongjie@tsinghua.edu.cn

The ability to reuse previous policies is an important aspect of human intelligence. To achieve efficient policy reuse, a Deep Reinforcement Learning (DRL) agent needs to decide when to reuse and which source policies to reuse. Previous methods solve this problem by introducing extra components to the underlying algorithm, such as hierarchical high-level policies over source policies, or estimations of source policies value functions on the target task. However, training these components induces either optimization non-stationarity or heavy sampling cost, significantly impairing the effectiveness of transfer. To tackle this problem, we propose a novel policy reuse algorithm called Critic-g Uided Policy reuse (CUP), which avoids training any extra components and efficiently reuses source policies. CUP utilizes the critic, a common component in actor-critic methods, to evaluate and choose source policies. At each state, CUP chooses the source policy that has the largest one-step improvement over the current target policy, and forms a guidance policy. The guidance policy is theoretically guaranteed to be a monotonic improvement over the current target policy. Then the target policy is regularized to imitate the guidance policy to perform efficient policy search. Empirical results demonstrate that CUP achieves efficient transfer and significantly outperforms baseline algorithms.

1 Introduction

Human intelligence can solve new tasks quickly by reusing previous policies (Guberman & Greenfield, 1991). Despite remarkable success, current Deep Reinforcement Learning (DRL) agents lack this knowledge transfer ability (Silver et al., 2017; Vinyals et al., 2019; Ceron & Castro, 2021), leading to enormous computation and sampling cost. As a consequence, a large number of works have been studying the problem of policy reuse in DRL, i.e., how to efficiently reuse source policies to speed up target policy learning (Fernández & Veloso, 2006; Barreto et al., 2018; Li et al., 2019; Yang et al., 2020b).

A fundamental challenge towards policy reuse is: how does an agent with access to multiple source policies decide when and where to use them (Fernández & Veloso, 2006; Kurenkov et al., 2020; Cheng et al., 2020)? Previous methods solve this problem by introducing additional components to the underlying DRL algorithm, such as hierarchical high-level policies over source policies (Li et al., 2018, 2019; Yang et al., 2020b), or estimations of source policies value functions on the target task (Barreto et al., 2017, 2018; Cheng et al., 2020). However, training these components significantly impairs the effectiveness of transfer, as hierarchical structures induce optimization non-stationarity (Pateria et al., 2021), and estimating the value functions for every source policy is computationally expensive and with high sampling cost. Thus, the objective of this study is to address the question:

36th Conference on Neural Information Processing Systems (Neur IPS 2022).

Can we achieve efficient transfer without training additional components?

Notice that actor-critic methods (Lillicrap et al., 2016; Fujimoto et al., 2018; Haarnoja et al., 2018) learn a critic that approximates the actor s Q function and serves as a natural way to evaluate policies. Based on this observation, we propose a novel policy reuse algorithm that utilizes the critic to choose source policies. The proposed algorithm, called Critic-g Uided Policy reuse (CUP), avoids training any additional components and achieves efficient transfer. At each state, CUP chooses the source policy that has the largest one-step improvement over the current target policy, thus forming a guidance policy. Then CUP guides learning by regularizing the target policy to imitate the guidance policy. This approach has the following advantages. First, the one-step improvement can be estimated simply by querying the critic, and no additional components are needed to be trained. Secondly, the guidance policy is theoretically guaranteed to be a monotonic improvement over the current target policy, which ensures that CUP can reuse the source policies to improve the current target policy. Finally, CUP is conceptually simple and easy to implement, introducing very few hyper-parameters to the underlying algorithm.

We evaluate CUP on Meta-World (Yu et al., 2020), a popular reinforcement learning benchmark composed of multiple robot arm manipulation tasks. Empirical results demonstrate that CUP achieves efficient transfer and significantly outperforms baseline algorithms.

2 Preliminaries

Reinforcement learning (RL) deals with Markov Decision Processes (MDPs). A MDP can be modelled by a tuple (S, A, r, p, γ), with state space S, action space A, reward function r(s, a), transition function p(s |s, a), and discount factor γ (Sutton & Barto, 2018). In this study, we focus on MDPs with continuous action spaces. RL s objective is to find a policy π(a|s) that maximizes the cumulative discounted return R(π) = Eπ [P t=0 γtr(st, at)].

While CUP is generally applicable to a wide range of actor-critic algorithms, in this work we use SAC (Haarnoja et al., 2018) as the underlying algorithm. The soft Q function and soft V function (Haarnoja et al., 2017) of a policy π are defined as:

Qπ(s, a) = r(s, a) + γEs p( |s,a) [Vπ(s)] (1)

Vπ(s) = Ea π( |s) [Qπ(s, a) α log π(a|s)] , (2)

where α > 0 is the entropy weight. SAC s loss functions are defined as:

Lcritic(Qθ) = E(s,a,r,s ) D Qθ(s, a) (r + γVθ(s )) 2

Lactor(πϕ) = Es D Ea πϕ( |s) [α log πϕ(a|s) Qθ(s, a)]

Lentropy(α) = Es D Ea πϕ( |s) α log πϕ(a|s) αH ,

where D is the replay buffer, H is a hyper-parameter representing the target entropy, θ and ϕ are network parameters, θ is target network s parameters, and Vθ(s) = Ea π(a|s)[Qθ(s, a) α log π(a|s)] is the target soft value function.

We define the soft expected advantage of action probability distribution πi( |s) over policy πj at state s as: EAπj(s, πi) = Ea πi( |s) Qπj(s, a) α log πi(a|s) Vπj(s) . (4)

EAπj(s, πi) measures the one-step performance improvement brought by following πi instead of πj at state s, and following πj afterwards.

The field of policy reuse focuses on solving a target MDP M efficiently by transferring knowledge from a set of source policies {π1, π2, ..., πn}. We denote the target policy learned on M at iteration t as πt tar, and its corresponding soft Q function as Qπt tar. In this work, we assume that the source policies and the target policy share the same state and action spaces.

3 Critic-Guided Policy Reuse

This section presents CUP, an efficient policy reuse algorithm that does not require training any additional components. CUP is built upon actor-critic methods. In each iteration, CUP uses the critic to form a guidance policy from the source policies and the current target policy. Then CUP guides policy search by regularizing the target policy to imitate the guidance policy. Section 3.1 presents how to form a guidance policy by aggregating source policies through the critic, and proves that the guidance policy is guaranteed to be a monotonic improvement over the current target policy. We also prove that the target policy is theoretically guaranteed to improve by imitating the guidance policy. Section 3.2 presents the overall framework of CUP.

3.1 Critic-Guided Source Policy Aggregation

CUP utilizes action probabilities proposed by source policies to improve the current target policy, and forms a guidance policy. At iteration t of target policy learning, for each state s, the agent has access to a set of candidate action probability distributions proposed by the n source policies and the current target policy: Πs t = {π1( |s), π2( |s), ..., πn( |s), πt tar( |s)}. The guidance policy πt g can be formed by combining the action probability distributions that have the largest soft expected advantage over πt tar at each state s:

πt g( |s) = arg max π( |s) Πs t EAπt tar(s, π) = arg max π( |s) Πs t Ea π( |s) Qπt tar(s, a) α log π(a|s) for all s S.

(5) The second equation holds as adding Vπt tar(s) to all soft expected advantages does not affect the result of the arg max operator. Eq. 5 implies that at each state, we can choose which source policy to follow simply by querying its expected soft Q value under πt tar. Noticing that with function approximation, the exact soft Q value cannot be acquired. The following theorem enables us to form the guidance policy with an approximated soft Q function, and guarantees that the guidance policy is a monotonic improvement over the current target policy.

Theorem 1 Let e Qπt tar be an approximation of Qπt tar such that

| e Qπt tar(s, a) Qπt tar(s, a)| ϵ for all s S, a A. (6)

Define f πtg( |s) = arg max π( |s) Πs t Ea π( |s) h e Qπt tar(s, a) α log π(a|s) i for all s S. (7)

Vf πtg(s) Vπt tar(s) 2ϵ 1 γ for all s S. (8)

Theorem 1 provides a way to choose source policies using an approximation of the current target policy s soft Q value. As SAC learns such an approximation, the guidance policy can be formed without training any additional components.

The next question is, how to incorporate the guidance policy f πtg into target policy learning? The following theorem demonstrates that policy improvement can be guaranteed if the target policy is optimized to stay close to the guidance policy.

Theorem 2 If DKL πt+1 tar ( |s)||f πtg( |s) δ for all s S, (9)

Vπt+1 tar (s) Vπt tar(s)

2 ln 2δ( e Rmax + αHt+1 max) (1 γ)2 2ϵ + α e Hmax 1 γ for all s S, (10)

where e Rmax = max s,a |r(s, a)| is the largest possible absolute value of the reward, Ht+1 max =

max s H(πt+1 tar ( |s)) is the largest entropy of πt+1 tar , and e Hmax = max s H(πt tar( |s)) H(πt+1 tar ( |s))

is the largest possible absolute difference of the policy entropy.

According to Theorem 2, the target policy can be improved by minimizing the KL divergence between the target policy and the guidance policy. Thus we can use the KL divergence as an auxiliary loss to guide target policy learning. Proofs of this section are deferred to Appendix B.1 and Appendix B.2. Theorem 1 and Theorem 2 can be extended to common hard value functions (deferred to Appendix B.3), so CUP is also applicable to actor-critic algorithms that uses hard Bellman updates, such as A3C (Mnih et al., 2016).

3.2 CUP Framework

Source Policies

Guidance Policies

Target Policy Learning

Guidance Policy Formation KL Regularization Original RL Update

𝜋1, 𝜋2, , 𝜋𝑛

Figure 1: CUP framework. In each iteration, CUP first forms a guidance policy by querying the critic, then guides policy learning by adding a KL regularization to policy search.

In this subsection we propose the overall framework of CUP. As shown in Fig. 1, at each iteration t, CUP first forms a guidance policy f πtg according to Eq. 7, then provides additional guidance to policy

search by regularizing the target policy πt+1 tar to imitate f πtg (Wu et al., 2019; Fujimoto & Gu, 2021). Specifically, CUP minimizes the following loss to optimize πt+1 tar :

LCUP (πt+1 tar ) = Lactor(πt+1 tar ) + Es D h βs DKL πt+1 tar ( |s) ||f πtg ( |s) i , (11)

where Lactor is the original actor loss defined in Eq. (3), and βs > 0 is a hyper-parameter controlling the weight of regularization. In practice, we find that using a fixed weight for regularization has two problems. First, it is difficult to balance the scale between Lactor and the regularization term, because Lactor grows as the Q value gets larger. Secondly, a fixed weight cannot reflect the agent s confidence on f πtg. For example, when no source policies have positive soft expected advantages, f πtg = πt tar.

Then the agent should not imitate f πtg anymore, as f πtg cannot provide any guidance to further improve performance. Noticing that the soft expected advantage serves as a natural confidence measure, we weight the KL divergence with corresponding soft expected advantage at that state:

βs = β1 min g EAπt tar(s, f πtg), β2|e Vπt tar (s)| , (12)

where g EAπt tar(s, f πtg) = Ea f πtg( |s) h e Qπt tar(s, a) α log πt g(a|s) e Vπt tar(s) i is the approxi-

mated soft expected advantage, β1, β2 > 0 are two hyper-parameters, and e Vπt tar(s) =

Ea πt tar( |s) h e Qπt tar (s, a) α log πt tar(a|s) i is the approximated soft value function. This adaptive regularization weight automatically balances between the two losses, and ignores the regularization term at states where f πtg cannot improve over πt tar anymore. We further upper clip the expected

advantage with the absolute value of β2 e Vπt tar to avoid the agent being overly confident about f πtg due to function approximation error ϵ.

CUP s pseudo-code is presented in Alg. 1. The modifications CUP made to SAC are marked in red. Additional implementation details are deferred to Appendix D.1.

Algorithm 1 CUP

Require: Source policies {π1, π2, ..., πn}, hyper-parameters λθ1, λθ2, λπ, λα, τ, H, β1, β2 Initialize replay buffer D Initialize actor πϕ, entropy weight α, critic Qθ1,Qθ2, target networks Qθ1 Qθ1, Qθ2 Qθ2 while not done do

for each environment step do

at πθ st+1 p(st+1|st, at) D D {st, at, r(st, at), st+1} end for for each gradient step do

Sample minibatch b from D Query source policies action probabilities {π1( |s), π2( |s), ..., πn( |s)} for states in b Compute expected advantages according to Eq. (4), form f πtg according to Eq. (7) θi θi λQ ˆ θi Lcritic(Qθi) for i {1, 2} ϕ ϕ λπ ˆ ϕLCUP (πϕ) α α λα ˆ αLentropy(α) θi τθi + (1 τ)θi for i {1, 2} end for end while

4 Experiments

We evaluate on Meta-World (Yu et al., 2020), a popular reinforcement learning benchmark composed of multiple robot manipulation tasks. These tasks are both correlated (performed by the same Sawyer robot arm) and distinct (interacting with different objects and having different reward functions), and serve as a proper evaluation benchmark for policy reuse. The source policies are achieved by training on three representative tasks: Reach, Push, and Pick-Place. We choose several complex tasks as target tasks, including Hammer, Peg-Insert-Side, Push-Wall, Pick-Place-Wall, Push-Back, and Shelf-Place. Among these target tasks, Hammer and Peg-Insert-Side require interacting with objects unseen in the source tasks. In Push-Wall and Pick-Place-Wall, there is a wall between the object and the goal. In Push-Back, the goal distribution is different from Push. In Shelf-Place, the robot is required to put a block on a shelf, and the shelf is unseen in the source tasks. Video demonstrations of these tasks are available at https://meta-world.github.io/. Similar to the settings in Yang et al. (2020a), in our experiments the goal position is randomly reset at the start of every episode. Codes are available at https://github.com/Nagisa Zj/CUP.

4.1 Transfer Performance on Meta-World

We compare against several representative baseline algorithms, including HAAR (Li et al., 2019), PTF (Yang et al., 2020b), MULTIPOLAR (Barekatain et al., 2021), and MAMBA (Cheng et al., 2020). Among these algorithms, HAAR and PTF learn hierarchical high-level policies over source policies. MAMBA aggregates source policies V functions to form a baseline function, and performs policy improvement over the baseline function. MULTIPOLAR learns a weighted sum of source policies action probabilities, and learns an additional network to predict residuals. We also compare against the original SAC algorithm. All the results are averaged over six random seeds. As shown in Figure 2, CUP is the only algorithm that achieves efficient transfer on all six tasks, significantly outperforming the original SAC algorithm. HAAR has a jump-start performance on Push-Wall and Pick-Pick-Wall, but fails to further improve due to optimization non-stationarity induced by jointly training high-level and low-level policies. MULTIPOLAR achieves comparable performance on Push-Wall and Peg-Insert-Side, because the Push source policy is useful on Push-Wall (implied by HAAR s good jump-start performance), and learning residuals on Peg-Insert-Side is easier (implied by SAC s fast learning). In Pick-Place-Wall, the Pick-Place source policy is useful, but the residual is difficult to learn, so MULTIPOLAR does not work. For the remaining three tasks, the source policies are less useful, and MULTIPOLAR fails on these tasks. PTF fails as its hierarchical policy only gets updated when the agent chooses similar actions to one of the source policies, which is quite rare when

Figure 2: Evaluation of CUP and several baselines on various Meta-World tasks. Dashed areas represent 95% bootstrapped confidence intervals. CUP achieves substantially better performance than baseline algorithms.

the source and target tasks are distinct. MAMBA fails as estimating all source policies V functions accurately is sampling inefficient. Algorithm performance evaluated by success rate is deferred to Appendix E.1.

4.2 Analyzing the Guidance Policy

This subsection provides visualizations of CUP s source policy selection. Fig. 3 shows the percentages of each source policy being selected throughout training on Push-Wall. At early stages of training, the source policies are selected more frequently as they have positive expected advantages, which means that they can be used to improve the current target policy. As training proceeds and the target policy becomes better, the source policies are selected less frequently. Among these three source policies, Push is chosen more frequently than the other two source policies, as it is more related to the target task. Figure 4 presents the source policies expected advantages over an episode at convergence in Pick-Place-Wall. The Push source policy and Reach source policy almost always have negative expected advantages, which implies that these two source policies can hardly improve the current target policy anymore. Meanwhile, the Pick-Place source policy has expected advantages close to zero after 100 environment steps, which implies that the Pick-Place source policy is close to the target policy at these steps. Analyses on all six tasks as well as analyses on HAAR s source policy selection are deferred to Appendix E.2 and Appendix E.6, respectively.

4.3 Ablation Study

This subsection evaluates CUP s sensitivity to hyper-parameter settings and the number of source policies. We also evaluate CUP s robustness against random source policies, which do not provide meaningful candidate actions for solving target tasks.

4.3.1 Hyper-Parameter Sensitivity

For all the experiments in Section 4.1, we use the same set of hyper-parameters, which indicates that CUP is generally applicable to a wide range of tasks without particular fine-tuning. CUP introduces only two additional hyper-parameters to the underlying SAC algorithm, and we further test CUP s sensitivity to these additional hyper-parameters. As shown in Fig. 5, CUP is generally robust to the choice of hyper-parameters and achieves stable performance.

Figure 3: Percentages of source policies being selected by CUP during training on Push-Wall. The green dashed line represents the target policy s success rate on the task.

Figure 4: Expected advantages of source policies at convergence on Pick-Place-Wall. The horizontal axis represents the environment steps of an episode.

4.3.2 Number of Source Policies

We evaluate CUP as well as baseline algorithms on a larger source policy set. We add three policies to the original source policy set, which solve three simple tasks including Drawer-Close, Push-Wall, and Coffee-Button. This forms a source policy set composed of six policies. As shown in Fig. 6, CUP is still the only algorithm that solves all the six target tasks efficiently. MULTIPOLAR suffers from a decrease in performance, which indicates that learning the weighted sum of source policies actions becomes more difficult as the number of source policies grows. The rest of the baseline algorithms have similar performance to those using three source policies. Fig. 7 provides a more direct comparison of CUP s performance with different number of source policies. CUP is able to utilize the additional source policies to further improve its performance, especially on Pick-Place-Wall and Peg-Insert-Side. Further detailed analysis is deferred to Appendix E.3.

4.3.3 Interference of Random Source Policies

In order to evaluate the efficiency of CUP s critic-guided source policy aggregation, we add random policies to the set of source policies. As shown in Fig. 8(a), adding up to 3 random source policies does not affect CUP s performance. This indicates that CUP can efficiently choose which source policy to follow even if there exist many source policies that are not meaningful. Adding 4 and 5 random source policies leads to a slight drop in performance. This drop is because that as the number

Figure 5: Ablation studies on a wide range of hyper-parameters. CUP performs well on a wide range of hyper-parameters.

Figure 6: Performance of CUP and baseline algorithms on various Meta-World tasks, with a set of six source policies.

of random policies grows, more random actions are sampled, and taking argmax over these actions expected advantages is more likely to be affected by errors in value estimation.

To further investigate CUP s ability to ignore unsuitable source policies, we design another transfer setting that consists of another two source policy sets. The first set consists of three random policies that are useless for the target task, and the second set adds the Reach policy to the first set. As demonstrated in Fig. 8(b), when none of the source policies are useful, CUP performs similarly to the original SAC, and its sample efficiency is almost unaffected by the useless source policies. When there exists a useful source policy, CUP can efficiently utilize it to improve performance, even if there are many useless source policies.

5 Related Work

Policy reuse. A series of works on policy reuse utilize source policies for exploration in value-based algorithms (Fernández & Veloso, 2006; Li & Zhang, 2018; Gimelfarb et al., 2021), but they are not applicable to policy gradient methods due to the off-policyness problem (Fujimoto et al., 2019). ACTeach (Kurenkov et al., 2020) mitigates this problem by improving the actor over behavior policy s

Figure 7: Comparison of CUP s performance with different number of source policies.

Figure 8: Ablation studies on CUP s sensitivity to useless source policies. (a) Adding up to 3 random policies to the source policy set does not affect CUP s performance. (b) Ablation study in a setting where most source policies are useless. If none of the source policies are useful (3 Random Sources), CUP performs similarly to the original SAC. Even if only one of the four source policies is useful (3 Random Sources+Reach), CUP is still able to efficiently utilize the useful source policy to improve learning performance.

value estimation, but still fails in more complex tasks. One branch of methods train hierarchical high-level policies over source policies. CAPS (Li et al., 2018) guarantees the optimality of the hierarchical policies by adding primitive skills to the low-level policy set, but is inapplicable to MDPs with continuous action spaces. HAAR (Li et al., 2019) fine-tunes low-level policies to ensure optimality, but joint training of high-level and low-level policies induce optimization non-stationarity (Pateria et al., 2021). PTF (Yang et al., 2020b) trains a hierarchical policy, which is imitated by the target policy. However, the hierarchical policy only gets updated when the target policy chooses similar actions to one of the source policies, so PTF fails in complex tasks with large action spaces. Another branch of works aggregate source policies via their Q functions or V functions on the target task. Barreto et al. (2017) and Barreto et al. (2018) focus on the situation where source tasks and target tasks share the same dynamics, and aggregate source policies by choosing the policy that has the largest Q at each state. They use successor features to mitigate the heavy computation cost

brought by estimating Q functions for all source policies. MAMBA (Cheng et al., 2020) forms a baseline function by aggregating source policies V functions, and guides policy search by improving the policy over the baseline function. Finally, MULTIPOLAR (Barekatain et al., 2021) learns a weighted sum over source policies actions, and learns an auxiliary network to predict residuals around the aggregated actions. MULTIPOLAR is computationally expensive, as it requires querying all the source policies at every sampling step. Our proposed method, CUP, focuses on the setting of learning continuous-action MDPs with actor-critic methods. CUP is both computationally and sampling efficient, as it does not require training any additional components.

Policy regularization. Adding regularization to policy optimization is a common approach to induce prior knowledge into policy learning. Distral (Teh et al., 2017) achieves inter-task transfer by imitating an average policy distilled from policies of related tasks. In offline RL, policy regularization serves as a common technique to keep the policy close to the behavior policy used to collect the dataset (Wu et al., 2019; Nair et al., 2020; Fujimoto & Gu, 2021). CUP uses policy regularization as a means to provide additional guidance to policy search with the guidance policy.

6 Conclusion

In this study, we address the problem of reusing source policies without training any additional components. By utilizing the critic as a natural evaluation of source policies, we propose CUP, an efficient policy reuse algorithm without training any additional components. CUP is conceptually simple, easy to implement, and has theoretical guarantees. Empirical results demonstrate that CUP achieves efficient transfer on a wide range of tasks. As for future work, CUP assumes that all source policies and the target policy share the same state and action spaces, which limits CUP s application to more general scenarios. One possible future direction is to take inspiration from previous works that map the state and action spaces of an MDP to another MDP with similar high-level structure (Wan et al., 2020; Zhang et al., 2020; Heng et al., 2022; van der Pol et al., 2020b,a). Another interesting direction is to incorporate CUP into the continual learning setting (Rolnick et al., 2019; Khetarpal et al., 2020), in which an agent gradually enriches its source policy set in an online manner.

Acknowledgements

This work is supported in part by Science and Technology Innovation 2030 New Generation Artificial Intelligence Major Project (No. 2018AAA0100904), National Natural Science Foundation of China (62176135), and China Academy of Launch Vehicle Technology (CALT2022-18).

Barekatain, M., Yonetani, R., and Hamaya, M. Multipolar: multi-source policy aggregation for transfer reinforcement learning between diverse environmental dynamics. In Proceedings of the Twenty-Ninth International Conference on International Joint Conferences on Artificial Intelligence, pp. 3108 3116, 2021.

Barreto, A., Dabney, W., Munos, R., Hunt, J. J., Schaul, T., van Hasselt, H. P., and Silver, D. Successor features for transfer in reinforcement learning. Advances in neural information processing systems, 30, 2017.

Barreto, A., Borsa, D., Quan, J., Schaul, T., Silver, D., Hessel, M., Mankowitz, D., Zidek, A., and Munos, R. Transfer in deep reinforcement learning using successor features and generalised policy improvement. In International Conference on Machine Learning, pp. 501 510. PMLR, 2018.

Ceron, J. S. O. and Castro, P. S. Revisiting rainbow: Promoting more insightful and inclusive deep reinforcement learning research. In International Conference on Machine Learning, pp. 1373 1383. PMLR, 2021.

Cheng, C.-A., Kolobov, A., and Agarwal, A. Policy improvement via imitation of multiple oracles. Advances in Neural Information Processing Systems, 33:5587 5598, 2020.

Fedotov, A. A., Harremoës, P., and Topsoe, F. Refinements of pinsker s inequality. IEEE Transactions on Information Theory, 49(6):1491 1498, 2003.

Fernández, F. and Veloso, M. Probabilistic policy reuse in a reinforcement learning agent. In Proceedings of the fifth international joint conference on Autonomous agents and multiagent systems, pp. 720 727, 2006.

Fujimoto, S. and Gu, S. S. A minimalist approach to offline reinforcement learning. Advances in Neural Information Processing Systems, 34, 2021.

Fujimoto, S., Hoof, H., and Meger, D. Addressing function approximation error in actor-critic methods. In International conference on machine learning, pp. 1587 1596. PMLR, 2018.

Fujimoto, S., Meger, D., and Precup, D. Off-policy deep reinforcement learning without exploration. In International Conference on Machine Learning, pp. 2052 2062. PMLR, 2019.

Gimelfarb, M., Sanner, S., and Lee, C.-G. Contextual policy transfer in reinforcement learning domains via deep mixtures-of-experts. In Uncertainty in Artificial Intelligence, pp. 1787 1797. PMLR, 2021.

Guberman, S. R. and Greenfield, P. M. Learning and transfer in everyday cognition. Cognitive Development, 6(3):233 260, 1991.

Haarnoja, T., Tang, H., Abbeel, P., and Levine, S. Reinforcement learning with deep energy-based policies. In International Conference on Machine Learning, pp. 1352 1361. PMLR, 2017.

Haarnoja, T., Zhou, A., Hartikainen, K., Tucker, G., Ha, S., Tan, J., Kumar, V., Zhu, H., Gupta, A., Abbeel, P., et al. Soft actor-critic algorithms and applications. ar Xiv preprint ar Xiv:1812.05905, 2018.

Heng, Y., Yang, T., ZHENG, Y., Jianye, H., and Taylor, M. E. Cross-domain adaptive transfer reinforcement learning based on state-action correspondence. In The 38th Conference on Uncertainty in Artificial Intelligence, 2022.

Kakade, S. and Langford, J. Approximately optimal approximate reinforcement learning. In In Proc. 19th International Conference on Machine Learning. Citeseer, 2002.

Khetarpal, K., Riemer, M., Rish, I., and Precup, D. Towards continual reinforcement learning: A review and perspectives. ar Xiv preprint ar Xiv:2012.13490, 2020.

Kurenkov, A., Mandlekar, A., Martin-Martin, R., Savarese, S., and Garg, A. Ac-teach: A bayesian actor-critic method for policy learning with an ensemble of suboptimal teachers. In Conference on Robot Learning, pp. 717 734. PMLR, 2020.

Li, S. and Zhang, C. An optimal online method of selecting source policies for reinforcement learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018.

Li, S., Gu, F., Zhu, G., and Zhang, C. Context-aware policy reuse. ar Xiv preprint ar Xiv:1806.03793, 2018.

Li, S., Wang, R., Tang, M., and Zhang, C. Hierarchical reinforcement learning with advantage-based auxiliary rewards. Advances in Neural Information Processing Systems, 32, 2019.

Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., and Wierstra, D. Continuous control with deep reinforcement learning. In ICLR (Poster), 2016.

Mnih, V., Badia, A. P., Mirza, M., Graves, A., Lillicrap, T., Harley, T., Silver, D., and Kavukcuoglu, K. Asynchronous methods for deep reinforcement learning. In International conference on machine learning, pp. 1928 1937. PMLR, 2016.

Nair, A., Gupta, A., Dalal, M., and Levine, S. Awac: Accelerating online reinforcement learning with offline datasets. ar Xiv preprint ar Xiv:2006.09359, 2020.

Ostrovski, G., Castro, P. S., and Dabney, W. The difficulty of passive learning in deep reinforcement learning. Advances in Neural Information Processing Systems, 34:23283 23295, 2021.

Pateria, S., Subagdja, B., Tan, A.-h., and Quek, C. Hierarchical reinforcement learning: A comprehensive survey. ACM Computing Surveys (CSUR), 54(5):1 35, 2021.

Rolnick, D., Ahuja, A., Schwarz, J., Lillicrap, T., and Wayne, G. Experience replay for continual learning. Advances in Neural Information Processing Systems, 32, 2019.

Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou, I., Huang, A., Guez, A., Hubert, T., Baker, L., Lai, M., Bolton, A., et al. Mastering the game of go without human knowledge. nature, 550 (7676):354 359, 2017.

Sodhani, S., Zhang, A., and Pineau, J. Multi-task reinforcement learning with context-based representations. In International Conference on Machine Learning, pp. 9767 9779. PMLR, 2021.

Sutton, R. S. and Barto, A. G. Reinforcement learning: An introduction. MIT press, 2018.

Teh, Y., Bapst, V., Czarnecki, W. M., Quan, J., Kirkpatrick, J., Hadsell, R., Heess, N., and Pascanu, R. Distral: Robust multitask reinforcement learning. Advances in neural information processing systems, 30, 2017.

van der Pol, E., Kipf, T., Oliehoek, F. A., and Welling, M. Plannable approximations to mdp homomorphisms: Equivariance under actions. ar Xiv preprint ar Xiv:2002.11963, 2020a.

van der Pol, E., Worrall, D., van Hoof, H., Oliehoek, F., and Welling, M. Mdp homomorphic networks: Group symmetries in reinforcement learning. Advances in Neural Information Processing Systems, 33:4199 4210, 2020b.

Vinyals, O., Babuschkin, I., Czarnecki, W. M., Mathieu, M., Dudzik, A., Chung, J., Choi, D. H., Powell, R., Ewalds, T., Georgiev, P., et al. Grandmaster level in starcraft ii using multi-agent reinforcement learning. Nature, 575(7782):350 354, 2019.

Wan, M., Gangwani, T., and Peng, J. Mutual information based knowledge transfer under state-action dimension mismatch. ar Xiv preprint ar Xiv:2006.07041, 2020.

Wu, Y., Tucker, G., and Nachum, O. Behavior regularized offline reinforcement learning. ar Xiv preprint ar Xiv:1911.11361, 2019.

Yang, R., Xu, H., Wu, Y., and Wang, X. Multi-task reinforcement learning with soft modularization. Advances in Neural Information Processing Systems, 33:4767 4777, 2020a.

Yang, T., Hao, J., Meng, Z., Zhang, Z., Hu, Y., Chen, Y., Fan, C., Wang, W., Wang, Z., and Peng, J. Efficient deep reinforcement learning through policy transfer. In AAMAS, pp. 2053 2055, 2020b.

Yu, T., Quillen, D., He, Z., Julian, R., Hausman, K., Finn, C., and Levine, S. Meta-world: A benchmark and evaluation for multi-task and meta reinforcement learning. In Conference on Robot Learning, pp. 1094 1100. PMLR, 2020.

Zhang, Q., Xiao, T., Efros, A. A., Pinto, L., and Wang, X. Learning cross-domain correspondence for control with dynamics cycle-consistency. ar Xiv preprint ar Xiv:2012.09811, 2020.