# hybrid_policy_optimization_from_imperfect_demonstrations__3dbd11ed.pdf

Hybrid Policy Optimization from Imperfect Demonstrations

Hanlin Yang Sun Yat-sen University Chao Yu Sun Yat-sen University Peng Sun Byte Dance Siji Chen Sun Yat-sen University

yuchao3@mail.sysu.edu.cn

Exploration is one of the main challenges in Reinforcement Learning (RL), especially in environments with sparse rewards. Learning from Demonstrations (Lf D) is a promising approach to solving this problem by leveraging expert demonstrations. However, expert demonstrations of high quality are usually costly or even impossible to collect in real-world applications. In this work, we propose a novel RL algorithm called HYbrid Policy Optimization (HYPO), which uses a small number of imperfect demonstrations to accelerate an agent s online learning process. The key idea is to train an offline guider policy using imitation learning in order to instruct an online agent policy to explore efficiently. Through mutual update of the guider policy and the agent policy, the agent can leverage suboptimal demonstrations for efficient exploration while avoiding the conservative policy caused by imperfect demonstrations. Empirical results show that HYPO significantly outperforms several baselines in various challenging tasks, such as Mu Jo Co with sparse rewards, Google Research Football, and the Air Sim drone simulation.

1 Introduction

Reinforcement Learning (RL) (Sutton & Barto, 2018) plays an important role in solving real-world control problems with large state-action spaces. In RL, an agent learns a decision-making policy through interaction with the environment, based on a reward function that indicates the agent s learning goal. However, pre-defining a reward function relies heavily on human knowledge that varies from one to another, and sometimes carefully hand-tuned reward function could lead to undesired or even hazardous policies (Devidze et al., 2021). Therefore, a more intuitive way of reward definition is to utilize the completeness of the task, e.g., the distance that a robot walks or scoring a goal in football games. While being unbiased towards the learning goal, learning with such sparse rewards could be challenging due to the inefficient exploration in the large problem space.

In recent years, there has been intensive research interest in Learning from Demonstrations (Lf D) as a promising approach to addressing sparse reward tasks by leveraging the expert demonstrations. Methods like DDPGf D (Vecerík et al., 2017), DQf D (Hester et al., 2018) and AWAC (Nair et al., 2020) normally require that the demonstrations consist of complete trajectories including state, actions and rewards, and thus are not applicable in real-world applications when it is only possible to observe the state and action information in the demonstrations. Other methods based on Imitation Learning (IL), such as Generative Adversarial Imitation Learning (GAIL) (Ho & Ermon, 2016) and Adversarial Inverse Reinforcement Learning (AIRL) (Fu et al., 2017), usually require that the demonstrations be generated by an optimal expert, which are difficult to achieve in complex tasks. More recent approaches such as Policy Optimization with Demonstrations (POf D) (Kang et al.,

Corresponding author. Code is available at https://github.com/joenghl/HYPO.

37th Conference on Neural Information Processing Systems (Neur IPS 2023).

2018) and Learning Online with Guidance Offline (LOGO) (Rengarajan et al., 2022) can result in an excessively conservative policy (Nair et al., 2020), due to the negative impact of imperfect demonstrations. Specifically, when the agent performs better than the suboptimal expert policy, simply using the demonstration-guided reward or distilling knowledge (e.g., POf D and LOGO) from the suboptimal policy will cause an excessively conservative policy of the online learning agent.

In this paper, we propose HYbrid Policy Optimization (HYPO), a novel RL algorithm that is capable of achieving near-optimal performance in sparse reward environments using only a small number of imperfect demonstrations with low quality and incomplete trajectories. The basic idea is to learn an offline imitation policy (i.e., the guider) to guide the learning of the agent and an online learning policy (i.e., the agent) to interact with the environment. These two policies are updated mutually to help the agent learn from both the expert demonstrations and the environment more efficiently. The key insight of HYPO is that, the guider imitates the expert to guide the learning of the agent during the initial stage, and then the agent learns to outperform the expert through online interactions with the environment. We evaluate HYPO on a wide variety of control tasks, including Mu Jo Co with sparse rewards, Google Research Football and Air Sim UAV simulation. Besides, we also investigate how the number and quality of the demonstrations can influence the performance of HYPO and other baseline methods. The results show that 1) HYPO can learn successful policies in these challenging tasks efficiently, and 2) HYPO is minimally affected by the number and quality of the demonstrations.

2 Related Work

2.1 Offline RL

Offline RL (Levine et al., 2020) enables to learn policies using only offline data, without any online interaction with the environment. However, the related offline RL methods, such as BCQ (Fujimoto et al., 2019), BEAR (Kumar et al., 2019), BRAC (Wu et al., 2019), and Fisher-BRC (Kostrikov et al., 2021) still suffer from the issue of distributional shift, i.e., the policy is trained under one distribution but evaluated on a different distribution. A common way to mitigate this issue is to restrict the policy distribution to be similar to the policy that generates the offline data. However, this constraint can result in excessively conservative policies. In contrast, HYPO can avoid the distribution shift problem since the offline trained guider has no interaction with the environment, but is only used to guide the online agent for more efficient exploration. In addition, unlike the existing offline RL approaches that normally require a large number of data with complete trajectories, thus have limited applications when collecting a large number of perfect demonstrations is either time-consuming or even infeasible, HYPO is able to learn a near-optimal policy given only a small number of incomplete demonstrations generated by a suboptimal policy.

IL methods are designed to mimic the behaviors of experts. As one of the most well-known IL algorithms, Behavior Cloning (BC) learns a policy by directly minimizing the discrepancy between the agent and the expert in the demonstration data. However, BC has been shown to suffer from issues such as compounding errors and the inability to handle the distributional shift Ross et al. (2011). As a recent BC-based method, DWBC Xu et al. (2022) trains a pure offline BC task to attain the expert performance by using a discriminator that distinguishes the optimal expert data and the suboptimal data. Different from DWBC, HYPO can learn policies that outperform the expert by using online interaction with the environments and offline guidance from the guider. Moreover, unlike BC-based methods that always require optimal demonstrations, HYPO is capable of learning with only imperfect demonstrations. As another type of IL, Inverse RL (IRL) (Ng et al., 2000) learns a reward function that explains the expert behavior and then uses this reward function to guide the agent s learning process. Some popular approaches use adversarial training to train a policy that is similar to the expert policy, but at the same time is robust to distributional shift, such as the GAIL (Ho & Ermon, 2016) and AIRL (Fu et al., 2017). However, unlike HYPO, all these methods assume that the expert actions be optimal.

The goal of Lf D is to use demonstration data to aid agent online learning, so as to overcome the exploration problem in RL, especially in environments with sparse rewards Schaal (1996). DQf D (Hester et al., 2018) and DDPGf D (Vecerík et al., 2017) rely on complete offline expert demonstrations with associated rewards to accelerate the online learning process in discrete and continuous action spaces, respectively. POf D Kang et al. (2018) combines the benefits of both imitation learning and RL to learn policies from both demonstration data and online interactions by learning from a shaped reward averaged over the original environment reward and a demonstrationguided reward. However, as the environment reward is much sparser than the demonstration-guided reward, such a mixed reward may cause the learning objective to deviate from the goal of the optimal policy. Another Lf D approach is LOGO (Rengarajan et al., 2022), which pre-trains an expert policy to guide the learning of agent, and uses a trust region controlled by a hyper-parameter to restrict agent learning. Unlike LOGO that the final policy may be excessively conservative due to the fixed suboptimal expert, HYPO enables the agent to outperform the expert through updating the guider and the agent mutually.

3 Preliminaries

3.1 Markov Decision Process

We consider the standard Markov Decision Process (MDP) (Sutton & Barto, 2018) as the mathematical framework for modeling sequential decision-making problems, which is defined by a tuple S, A, P, r, d0, γ , where S is a finite set of states, A is a finite set of actions, P : S A S R is the transition probability function, r : S R is the reward function, d0 is the initial distribution, and γ [0, 1) is the discount factor. An agent interacts with the environment based on policy π : S A [0, 1], which maps from state to distribution over actions. The performance of π is evaluated by its expected discounted reward J(π):

J(π) = Eπ [r(s, a)] = E(s0,a0,s1,...,)

t=0 γtr (st, at)

where (s0, a0, s1, . . . ) is a trajectory generated from interaction with the environment, i.e., s0 d0, at π( |st) and st+1 P( |st, at). The value function of a policy is defined as Vπ(s) = Eτ π [P t=0 γtr(st, at)|s0 = s], i.e., the expected cumulative discounted reward obtained by the policy π from the state s. Correspondingly, the action-value function is defined as Qπ(s, a) = Eτ π [P t=0 γtr(st, at)|s0 = s, a0 = a], and the advantage function is Aπ(s, a) = Qπ(s, a) Vπ(s). The objective of RL algorithm is to discover the optimal policy π that can maximize the expectation of discounted return J(π), i.e., π = argmaxπJ(π).

3.2 Learning with Offline Demonstrations

BC and GAIL are well-known baseline methods for learning with offline demonstrations. The difference is that BC is a pure offline method without any interactions with the environment, while GAIL is an online learning method through generating and learning from the trajectories.

BC. BC is used for matching the distribution over actions, with its objective given by:

Lπ = E (s,a) D

log π(a|s) , (2)

where D is the offline demonstrations. Optimizing objective (2) will make the distribution of actions w.r.t. πb match the one in the expert policy that generates the state-action pairs in D.

GAIL. The discriminator aims to distinguish between the offline expert demonstrations and the trajectories sampled by an online agent. The objective of the discriminator in GAIL is given by:

Ld GAIL = min d E (s,a) D

h log d(s, a) i + E (s,a) B

h log 1 d(s, a) i , (3)

where D is the offline demonstrations, B in the replay buffer consists of trajectories sampled by the agent, d is the discriminative classifiers d : S A (0, 1). Optimizing objective (3) will

make the learned discriminator assign 1 to all transitions from the offline demonstration D and 0 to all transitions from B. During training, the discriminator will assign a large reward to agent when d(s,a) B(s, a) is close to 1, which is inspired by the generative adversarial training.

The discriminator can be interpreted as a new reward function r(s, a) = log 1 d(s, a) providing learning signal to the agent. In this way, the agent will receive larger reward if its trajectories are more like the expert, leading to against the discriminator.

4 Hybrid Policy Optimization with Imperfect Demonstrations

State Space

random agent suboptimal expert RL exploration expert guidance

Figure 1: Our proposed method provides a suboptimal expert direction (red curve) for the agent, which prevents it from pointless exploration (brown curves) under sparse reward environments.

Before presenting detailed descriptions of our method, we use Figure 1 to illustrate the overall motivation. In the early learning stage, the agent usually suffers from pointless exploration (i.e., brown curves) due to the sparsity of the reward signal. In such stage, the agent is generally worse than the expert, and thus can gain beneficial guidance from the expert demonstrations to attain a relatively fair performance that is close to the expert. In the later learning stages, we expect the agent to surpass the expert through exploration in the environment, hence being too close to the expert can hinder the policy improvement. However, without constraining the deviation to the expert, the agent can still encounter performance collapse due to the pointless exploration caused by the sparse reward. We tackle these challenges by updating the two policies of the guider and agent mutually, to provide the agent with an appropriate guidance. In this way, the agent is able to outperform the expert to achieve a near-optimal performance.

Our goal is to develop an algorithm that can leverage the imperfect demonstrations generated by a suboptimal expert to boost online learning in sparse reward environment. We assume that there is a suboptimal expert policy πe but we have no access to it. All we can use are its demonstrations, which have the form D = {τ i}n i=1, where τ i = (si 1, ai 1, . . . , si t, ai t), τ i πe. And we make the following reasonable and indispensable assumption concerning the caliber of the expert policy:

Assumption 1. In the initial learning stages, the agent can get higher advantage values than current policy π if it acts according to the expert policy πe,

Eae πe,a π [Aπ(s, ae) Aπ(s, a)] ξ > 0, s S (4)

The above assumption implies that taking actions according to the expert will provide a higher advantage than taking actions according to the current policy π. This is reasonable since the expert policy perform much better than an untrained policy in the initial learning stages.

To better illustrate the analysis and simplify the formulation below, we use the term expert to indicate the suboptimal expert unless using the term optimal as the prefix. The offline and the online policy is respectively denoted as πb and ˆπ. The demonstrations consisting of trajectories generated by the suboptimal expert is denoted as D, while the replay buffer consisting of trajectories sampled by the agent is denoted as B. The output of the descriminator is d : S A (0, 1).

The three key components in HYPO are the discriminator, the offline guider, and the online agent. The discriminator is used to control the learning objective of the offline guider by distinguishing whether the trajectories are generated by the expert or the agent. The offline guider performs a BC task to provide the agent an adaptive guidance by learning from the expert and the agent dynamically. The online agent policy interacts with the environments and distills knowledge from the guider policy constantly to outperform the expert.

4.1 Semi-supervised Discriminator Learning

The discriminator in GAIL mentioned in Subsection 3.2 suffers from the overfitting problem, since the well-learned discriminator may overfit to the suboptimal expert (Zolna et al., 2021). Concretely, as the agent policy ˆπ improves, the behaviors of the agent become increasingly similar to those of the expert. However, due to the suboptimality of the expert, it is crucial to avoid overfitting to the suboptimal expert, as it can hinder the agent from converging to the optimal performance. To address this issue, we formulate the discriminator objective as a positive-unlabeled (PU) reward learning problem (Xu & Denil, 2021), which allows us to train the discriminator by treating the agent trajectories as an unlabeled mixture of the expert trajectories and the agent s trajectories. In this way, the guider can also learn from the agent since the agent s trajectories are not just the negative data, but the mixture of positive and negative data. The formulated discriminator objective is given as follows:

min d η E (s,a) D

h log d(s, a) i + E (s,a) B

h log 1 d(s, a) i η E (s,a) D

h log 1 d(s, a) i , (5)

where D denotes the offline demonstrations, B the replay buffer consisting of the trajectories sampled by the agent, and η [0, 1] the positive class prior, which is assumed to be known.

Furthermore, it is difficult to distinguish between the expert s and agent s trajectories by only relying on the information of state-action pairs. To mitigate this issue, an additional signal log πb is added into the input of the discriminator, leading to the final objective of the discriminator as follows:

Ld = η E (s,a) D

h log d(s, a, log πb) i + E (s,a) B

h log 1 d(s, a, log πb) i

η E (s,a) D

h log 1 d(s, a, log πb) i . (6)

Specifically, when considering state-action pairs (s, a) originating from the agent s trajectories, πb(a|s) will assign large probabilities to the agent s actions within these corresponding states. Conversely, for state-action pairs (s, a) derived from the expert s trajectories, πb(a|s) will assign small probabilities to the expert s actions within these corresponding states.

Remark on η. It is worth noting that, unlike the standard positive-unlabeled learning setup, the positive class prior η in our setting changes as the online policy learning progresses and the distribution of states in the replay buffer evolves. Specifically, η is set to increase as the learning progresses.

4.2 Adaptive Target for Offline Imitation

The objective of BC task is to mimic the expert policy through matching the conditional distribution π( |s) over actions. Therefore, the BC methods cannot learn a policy that outperforms the expert without additional guidance. In HYPO, πb is able to use the information from the trajectories sampled by the online policy ˆπ to learn from the agent, which can assist πb to attain a better performance than the original expert. Specifically, we use two adaptive functional weights FExpert(d) and GAgent(d) to determine the learning objective of πb dynamically:

Lπb = E (s,a) D

h log πb(a|s) F d(s, a, log πb) i + E (s,a) B

h log πb(a|s) G d(s, a, log πb) i ,

(7) where F and G, denoting FExpert(d) and GAgent(d), respectively, are the functional weights to determine the objective of πb. The functional weights are expected to force the πb to approach the expert in the initial stage, and to increase Ld to ensure discriminator robust throughout the whole learning process. Since the performance of the online learning policy differs significantly, ranging from being random to reaching the near-optimal, the discriminator needs to be robust enough to handle these changes. Inspired by the adversarial robustness (Carlini et al., 2019) and the weighted discriminator (Xu et al., 2022), we use πb to maximize Ld to improve the robustness of discriminator by minimizing the worst-case error. Consequently, the form of the functional weights of FExpert and GAgent can be derived as follows:

FExpert(d) = α η d(1 d), GAgent(d) = 1 1 d, (8)

where α is a weight factor. Refer to Appendix B for the detailed derivation.

Figure 2: The trend of weight FExpert(d). The deeper blue color means the lower weight value. η is increasing during the training and d is various from 0.1 to 0.9 (clipped).

To understand the trend of the functional weight FExpert(d) intuitively, we visualize it as illustrated in Figure 2. For transitions from D, the weight is insensitive to d when η is small, since the proportion of positive samples in the (s, a) pairs generated by the agent is low. In this case, πb can provide the online agent with guidance by learning towards the expert. As training progresses, the proportion η of positive samples in agent trajectories gradually increases, thus d begins to dominate the weight changes of FExpert(d). In this case, FExpert(d) becomes sensitive to d, and gets smaller since the d gets large when (s, a) comes from the expert, resulting in the learning towards the agent for πb. As for the other weight of GAgent, which will be large if the (s, a) sampled by the agent ˆπ is similar to that of the expert, changes insensitively when d remains low.

4.3 Performance Improvement of Online Learning

An useful reward signal is necessary for online RL algorithms to learn efficiently. However, in the sparse reward environments, the agent cannot explore states that have non-zero reward without additional guidance. To this end, we use an offline guider to provide the agent with guidance, which can significantly boost the exploration of agent, especially in the initial stages of training. Specifically, the online agent policy ˆπ, which learns from the sparse environment rewards, constantly distills knowledge from the guider policy πb. This is implemented by adding another corrective objective on the original PPO (Schulman et al., 2017) policy update:

JHYPO ˆπ (θ) = Et h min rt(θ)At, clip rt(θ), 1 ϵ, 1 + ϵ At CDKL(ˆπ||πb) i , (9)

where C is a decreasing coefficient of the KL-divergence. In Eq (9), we append a constraint of the KL-divergence between the guider policy πb and the agent policy ˆπ. In this way the agent can distill knowledge constantly from the guider. Besides, the decreasing coefficient of C can prevent the agent from the excessively conservative policy. Next, we theoretically analyze how the additional constraint can influence the agent. We consider the following two cases: case 1) the agent s performance is worse than the expert, which corresponds to satisfaction of Assumption 1, and case 2) the agent s performance is better than the expert, which means Assumption 1 is not satisfied.

Case 1. In the initial stage of training, the offline guider policy πb can be approximately treated as the expert policy since the BC method is more efficient than policy gradient (PG) and the FExpert keeps large, which drives πb to update towards the expert. Now the performance improvement guarantee of the online agent under the Assumption 1 can be given as follows: Proposition 1. Let π be a policy that satisfies Assumption 1. Then, for policy ˆπ,

JR(ˆπ) JR( π) (1 γ) 1ξ (1 γ) 1ϵR, π q

2Dˆπ KL(ˆπ, πb), (10)

where ϵR, π = maxs,a|A π R(s, a)|.

Proof. Refer to the Appendix B.

It is reasonable to assume that ˆπ satisfies Assumption 1 in the initial stage of learning since the current policy ˆπ of agent is learning from scratch. Then, minimizing Dˆπ KL(ˆπ, πb) can get a non-negative lower bound in Equation (10), which means that we can accelerate the online learning of the agent by minimizing the KL-divergence between ˆπ and πb. This process can be described as, the agent learns from the guider to attain the expert performance, which can constantly get non-zero rewards from the sparse environments.

Case 2. As the training progresses, the agent can achieve the expert performance efficiently. This can be realized in most of the previous imitation methods. However, we want to investigate that, how

Hopper Walker

Half Cheetah Ant

Figure 3: Mu Jo Co (left) is a set of popular continuous control environments with tasks of varying difficulty. Google Research Football (top right) is a novel RL environment where agents are trained to play football in an advance, physics based 3D simulation. Air Sim (bottom right) is a new simulator built on Unreal Engine that offers physically and visually realistic simulations for both of these goals.

the agent can surpass the expert and achieve the near-optimal performance in sparse reward tasks, which corresponds to do not satisfy the Assumption 1. If the guidance is simply removed from the agent, in other words, the agent no longer distills knowledge from the guider, the bootstrap error and the uncertain reward signals can reduce the performance of agent policy. If we use the expert policy to restrict the agent, like other Lf D methods, the agent policy will be excessively conservative, making it hard for the agent to achieve the near-optimal performance. Therefore, we address this issue by changing the learning objective of guider from expert to the agent, which will reduce the KL-divergence between πb and ˆπ. The policy improvement lower bound when Assumption 1 is not satisfied is given as follows:

Proposition 2. For policy ˆπ and any policy π,

JR(ˆπ) JR( π) 3Rmax

2Dmax KL (ˆπ, π), (11)

where Rmax = maxs,a|R(s, a)|.

Proof. Refer to the Appendix B.

The guider update towards the agent will reduce the KL-divergence item to improve the policy improvement bound in Eq (11).

The mutual update of the guider policy and agent policy is the core feature behind the HYPO algorithm. In this way, the agent can leverage a few suboptimal demonstrations for efficient exploration while avoding the negative impact of the low-quality data.

5 Experiments

In this section, we investigate whether HYPO can achieve near-optimal performance in extremely sparse reward environments by overcoming the restriction of imperfect demonstrations, and how the number and quality of the trajectories can influence the performance of HYPO. To comprehensively assess our method, we first perform an exhaustive evaluation of HYPO in Mu Jo Co (Todorov et al., 2012) with sparse rewards and Google Research Football (GRF) (Kurach et al., 2020) with huge policy space and only a sparse score reward. We also evaluate HYPO on an Unmanned Aerial Vehicle (UAV) 2 task based on the Unreal Engine and Air Sim (Shah et al., 2018) to show the effectiveness of HYPO in addressing more challenging control tasks with high-fidelity. All the mentioned environments are shown in Figure 3.

2https://github.com/sunghoonhong/Airsim DRL

Figure 4: Mu Jo Co simulation results. The x-axis is the number of samples. The y-axis is the average episode return, which is scaled to make the expert achieve 100 and a random policy achieve 0.

Figure 5: Final performance of the baseline algorithms that are trained with various number and performance of trajectories in demonstrations. HYPO is minimally affected by the quality of data.

5.1 Mu Jo Co Simulation

We first sparsify the built-in dense rewards on the Mu Jo Co platform for evaluating the methods in sparse-reward environments. Specifically, a reward of +1 is provided only after the agent moves forward over a specific distance. We compare HYPO to the following baselines: (1) Expert, which applies PPO using the original dense rewards to achieve the optimal return; (2) Demo, which is a suboptimal expert at the early stage of training; (3) PPO, which directly trains PPO with sparse rewards; (4) GAIL (Ho & Ermon, 2016), which uses a discriminator to provide a demonstrationguided reward for training; (5) POf D (Kang et al., 2018), which uses a weighted combination of the environment reward and the demonstration-guided reward; and (6) LOGO (Rengarajan et al., 2022), which merges a policy improvement step and a policy guidance step. All the demonstrations used for training are generated by Demo. Refer to Appendix C for more experimental details.

As shown in Figure 4, PPO fails to learn an useful policy in all environments due to the lack of guidance from demonstrations. GAIL can only achieve the level of the suboptimal expert policy since it can only mimic the policy that generates the demonstrations. POf D can attain a higher return than GAIL in Half Cheetah but a similar return in other tasks. LOGO can achieve relative better performance due to its decaying trust region that constrains the policy to gradually get rid of the influence of suboptimal data. However, the conservative policy can still impede the learning efficiency during the later learning process. As expected, HYPO outperforms the above methods in all environments by a large margin, which fully validates the effectiveness of HYPO in learning

near-optimal performance in extremely sparse reward environments by overcoming the restriction of imperfect demonstrations. We also investigate the influence of the number and quality (in terms of cumulative return) of demonstrations on the final performance. As shown in Figure 5, HYPO is capable of achieving far better performance than other methods, even with a small number of imperfect trajectories. Refer to Appendix F for more details about the performance of LOGO.

5.2 Google Research Football and Air Sim Simulation

Figure 6: Results in GRF and Air Sim. The left y-axis is for GRF and the right for Air Sim.

The GRF task is to control a single player to cooperate with teammates to break a specific defensive line formation and score. This environment provides two type of reward settings, i.e., the sparse score reward, and the dense checkpoints reward. Air Sim is a high-fidelity environment where we need to control an UAV to fly to the terminal while avoiding all the obstacles. This task also provides the sparse reward setting for flying a specific distance and the dense reward setting for keeping a certain speed of forward flight. The expert baseline is trained in the dense reward setting, while the imperfect demo data is generated by a partially trained expert. The results in Figure 6 show that HYPO can achieve near-optimal performance in these challenging tasks with imperfect demonstrations.

6 Conclusion and Outlook

In this paper, we investigate how to accelerate agent online learning with imperfect demonstrations in sparse reward environments. We introduce HYPO, a novel RL algorithm that can attain near-optimal performance in sparse reward settings by avoiding the excessive conservative policy. The highlight of HYPO lies in its capability to learn an offline guider and an online agent, while updating these two policies mutually to help the agent learn from the demonstrations and environments more efficiently. Experiments in various environments including Mu Jo Co, Google Research Football and Air Sim UAV simulation demonstrate that HYPO can greatly promote the learning efficiency in sparse reward tasks with imperfect data. Our future work is to extend HYPO to multi-agent scenarios when agents need to learn coordinated policies with sparse rewards. Moreover, we plan to investigate the potential of HYPO in more real-world applications, such as the autonomous driving, where sparse rewards are common challenges.

Acknowledgments and Disclosure of Funding

This work was supported by an SYSU-Byte Dance Research Project.

Carlini, N., Athalye, A., Papernot, N., Brendel, W., Rauber, J., Tsipras, D., Goodfellow, I., Madry, A., and Kurakin, A. On evaluating adversarial robustness. ar Xiv preprint ar Xiv:1902.06705, 2019.

Devidze, R., Radanovic, G., Kamalaruban, P., and Singla, A. Explicable reward design for reinforcement learning agents. Advances in Neural Information Processing Systems, 34:20118 20131, 2021.

Fu, J., Luo, K., and Levine, S. Learning robust rewards with adversarial inverse reinforcement learning. ar Xiv preprint ar Xiv:1710.11248, 2017.

Fujimoto, S., Meger, D., and Precup, D. Off-policy deep reinforcement learning without exploration. In International conference on machine learning, pp. 2052 2062. PMLR, 2019.

Hester, T., Vecerik, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., et al. Deep q-learning from demonstrations. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018.

Ho, J. and Ermon, S. Generative adversarial imitation learning. Advances in neural information processing systems, 29, 2016.

Kang, B., Jie, Z., and Feng, J. Policy optimization with demonstrations. In International conference on machine learning, pp. 2469 2478. PMLR, 2018.

Kostrikov, I., Fergus, R., Tompson, J., and Nachum, O. Offline reinforcement learning with fisher divergence critic regularization. In International Conference on Machine Learning, pp. 5774 5783. PMLR, 2021.

Kumar, A., Fu, J., Soh, M., Tucker, G., and Levine, S. Stabilizing off-policy q-learning via bootstrapping error reduction. Advances in Neural Information Processing Systems, 32, 2019.

Kurach, K., Raichuk, A., Sta nczyk, P., Zaj ac, M., Bachem, O., Espeholt, L., Riquelme, C., Vincent, D., Michalski, M., Bousquet, O., et al. Google research football: A novel reinforcement learning environment. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pp. 4501 4510, 2020.

Levine, S., Kumar, A., Tucker, G., and Fu, J. Offline reinforcement learning: Tutorial, review, and perspectives on open problems. ar Xiv preprint ar Xiv:2005.01643, 2020.

Nair, A., Gupta, A., Dalal, M., and Levine, S. Awac: Accelerating online reinforcement learning with offline datasets. ar Xiv preprint ar Xiv:2006.09359, 2020.

Ng, A. Y., Russell, S., et al. Algorithms for inverse reinforcement learning. In Icml, volume 1, pp. 2, 2000.

Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., and Shakkottai, S. Reinforcement learning with sparse rewards using guidance from offline demonstration. ar Xiv preprint ar Xiv:2202.04628, 2022.

Ross, S., Gordon, G., and Bagnell, D. A reduction of imitation learning and structured prediction to no-regret online learning. In Proceedings of the fourteenth international conference on artificial intelligence and statistics, pp. 627 635. JMLR Workshop and Conference Proceedings, 2011.

Schaal, S. Learning from demonstration. Advances in neural information processing systems, 9, 1996.

Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal policy optimization algorithms. ar Xiv preprint ar Xiv:1707.06347, 2017.

Shah, S., Dey, D., Lovett, C., and Kapoor, A. Airsim: High-fidelity visual and physical simulation for autonomous vehicles. In Field and Service Robotics: Results of the 11th International Conference, pp. 621 635. Springer, 2018.

Sutton, R. S. and Barto, A. G. Reinforcement learning: An introduction. MIT press, 2018.

Todorov, E., Erez, T., and Tassa, Y. Mujoco: A physics engine for model-based control. In 2012 IEEE/RSJ international conference on intelligent robots and systems, pp. 5026 5033. IEEE, 2012.

Vecerík, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N. M. O., Rothörl, T., Lampe, T., and Riedmiller, M. A. Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. Ar Xiv, abs/1707.08817, 2017.

Wu, Y., Tucker, G., and Nachum, O. Behavior regularized offline reinforcement learning. ar Xiv preprint ar Xiv:1911.11361, 2019.

Xu, D. and Denil, M. Positive-unlabeled reward learning. In Conference on Robot Learning, pp. 205 219. PMLR, 2021.

Xu, H., Zhan, X., Yin, H., and Qin, H. Discriminator-weighted offline imitation learning from suboptimal demonstrations. In International Conference on Machine Learning, pp. 24725 24742. PMLR, 2022.

Zolna, K., Reed, S., Novikov, A., Colmenarejo, S. G., Budden, D., Cabi, S., Denil, M., de Freitas, N., and Wang, Z. Task-relevant adversarial imitation learning. In Conference on Robot Learning, pp. 247 263. PMLR, 2021.