# guarded_policy_optimization_with_imperfect_online_demonstrations__453a3ae3.pdf

Published as a conference paper at ICLR 2023

GUARDED POLICY OPTIMIZATION WITH IMPERFECT ONLINE DEMONSTRATIONS

Zhenghai Xue1, Zhenghao Peng2, Quanyi Li3, Zhihan Liu4, Bolei Zhou2

1Nanyang Technological University, Singapore, 2 University of California, Los Angeles, 3The University of Edinburgh, 4Northwestern University

The Teacher-Student Framework (TSF) is a reinforcement learning setting where a teacher agent guards the training of a student agent by intervening and providing online demonstrations. Assuming optimal, the teacher policy has the perfect timing and capability to intervene in the learning process of the student agent, providing safety guarantee and exploration guidance. Nevertheless, in many real-world settings it is expensive or even impossible to obtain a well-performing teacher policy. In this work, we relax the assumption of a well-performing teacher and develop a new method that can incorporate arbitrary teacher policies with modest or inferior performance. We instantiate an Off-Policy Reinforcement Learning algorithm, termed Teacher-Student Shared Control (TS2C), which incorporates teacher intervention based on trajectory-based value estimation. Theoretical analysis validates that the proposed TS2C algorithm attains efficient exploration and substantial safety guarantee without being affected by the teacher s own performance. Experiments on various continuous control tasks show that our method can exploit teacher policies at different performance levels while maintaining a low training cost. Moreover, the student policy surpasses the imperfect teacher policy in terms of higher accumulated reward in held-out testing environments. Code is available at https://metadriverse.github.io/TS2C.

1 INTRODUCTION

In Reinforcement Learning (RL), the Teacher-Student Framework (TSF) (Zimmer et al., 2014; Kelly et al., 2019) incorporates well-performing neural controllers or human experts as teacher policies in the learning process of autonomous agents. At each step, the teacher guards the free exploration of the student by intervening when a specific intervention criterion holds. Online data collected from both the teacher policy and the student policy will be saved into the replay buffer and exploited with Imitation Learning or Off-Policy RL algorithms. Such a guarded policy optimization pipeline can either provide safety guarantee (Peng et al., 2021) or facilitate efficient exploration (Torrey & Taylor, 2013).

The majority of RL methods in TSF assume the availability of a well-performing teacher policy (Spencer et al., 2020; Torrey & Taylor, 2013) so that the student can properly learn from the teacher s demonstration about how to act in the environment. The teacher intervention is triggered when the student acts differently from the teacher (Peng et al., 2021) or when the teacher finds the current state worth exploring (Chisari et al., 2021). This is similar to imitation learning where the training outcome is significantly affected by the quality of demonstrations (Kumar et al., 2020; Fujimoto et al., 2019). Thus with current TSF methods if the teacher is incapable of providing high-quality demonstrations, the student will be misguided and its final performance will be upperbounded by the performance of the teacher. However, it is time-consuming or even impossible to obtain a well-performing teacher in many real-world applications such as object manipulation with robot arms (Yu et al., 2020a) and autonomous driving (Li et al., 2022a). As a result, current TSF methods will behave poorly with a less capable teacher.

In the real world, the coach of Usain Bolt does not necessarily need to run faster than Usain Bolt. Is it possible to develop a new interactive learning scheme where a student can outperform the teacher while retaining safety guarantee from it? In this work we develop a new guarded policy optimization

Published as a conference paper at ICLR 2023

method called Teacher-Student Shared Control (TS2C). It follows the setting of a teacher policy and a learning student policy, but relaxes the requirement of high-quality demonstrations from the teacher. A new intervention mechanism is designed: Rather than triggering intervention based on the similarity between the actions of teacher and student, the intervention is now determined by a trajectory-based value estimator. The student is allowed to conduct an action that deviates from the teacher s, as long as its expected return is promising. By relaxing the intervention criterion from step-wise action similarity to trajectory-based value estimation, the student has the freedom to act differently when the teacher fails to provide correct demonstration and thus has the potential to outperform the imperfect teacher. We conduct theoretical analysis and show that in previous TSF methods the quality of the online data-collecting policy is upper-bounded by the performance of the teacher policy. In contrast, TS2C is not limited by the imperfect teacher in upper-bound performance, while still retaining a lower-bound performance and safety guarantee.

Experiments on various continuous control environments show that under the newly proposed method, the learning student policy can be optimized efficiently and safely under different levels of teachers while other TSF algorithms are largely bounded by the teacher s performance. Furthermore, the student policies trained under the proposed TS2C substantially outperform all baseline methods in terms of higher efficiency and lower test-time cost, supporting our theoretical analysis.

2 BACKGROUND

2.1 RELATED WORK

The Teacher-Student Framework The idea of transferring knowledge from a teacher policy to a student policy has been explored in reinforcement learning (Zimmer et al., 2014). It improves the learning efficiency of the student policy by leveraging a pretrained teacher policy, usually by adding auxiliary loss to encourage the student policy to be close to the teacher policy (Schmitt et al., 2018; Traor e et al., 2019). Though our method follows teacher-student transfer framework, an optimal teacher is not a necessity. During training, agents are fully controlled by either the student (Traor e et al., 2019; Schmitt et al., 2018) or the teacher policy (Rusu et al., 2016), while our method follows intervention-based RL where a mixed policy controls the agent. Other attempts to relax the need of well-performing teacher models include student-student transfer (Lin et al., 2017; Lai et al., 2020), in which heterogeneous agents exchange knowledge through mutual regularisation (Zhao & Hospedales, 2021; Peng et al., 2020).

Learning from Demonstrations Another way to exploit the teacher policy is to collect static demonstration data from it. The learning agent will regard the demonstration as optimal transitions to imitate from. If the data is provided without reward signals, agent can learn by imitating the teacher s policy distribution (Ly & Akhloufi, 2020), matching the trajectory distribution (Ho & Ermon, 2016; Xu et al., 2019) or learning a parameterized reward function with inverse reinforcement learning (Abbeel & Ng, 2004; Fu et al., 2017). With additional reward signals, agents can perform Bellman updates pessimistically, as most offline reinforcement learning algorithms do (Levine et al., 2020). The conservative Bellman update can be performed either by restricting the overestimation of Q-function learning (Fujimoto et al., 2019; Kumar et al., 2020) or by involving model-based uncertainty estimation (Yu et al., 2020b; Chen et al., 2021b). In contrast to the offline learning from demonstration, in this work we focus on the online deployment of teacher policies with teacherstudent shared control and show its superiority in reducing the state distributional shift, improving efficiency and ensuring training-time safety.

Intervention-based Reinforcement Learning Intervention-based RL enables both the expert and the learning agent to generate online samples in the environment. The switch between policies can be random (Ross et al., 2011), rule-based (Parnichkun et al., 2022) or determined by the expert, either through the manual intervention of human participators (Abel et al., 2017; Chisari et al., 2021; Li et al., 2022b) or by referring to the policy distribution of a parameterized expert (Peng et al., 2021). More delicate switching algorithms include RCMP (da Silva et al., 2020) which asks for expert advice when the learner s action has high estimated uncertainty. RCMP only works for agents with discrete action spaces, while we investigate continuous action space in this paper. Also, Ross & Bagnell (2014) and Sun et al. (2017) query the expert to obtain the optimal value function, which is used to guide the expert intervention. These switching mechanisms assume the expert policy to be optimal, while our proposed algorithm can make use of a suboptimal expert policy. To exploit

Published as a conference paper at ICLR 2023

samples collected with different policies, Ross et al. (2011) and Kelly et al. (2019) compute behavior cloning loss on samples where the expert policy is in control and discard those generated by the learner. Other algorithms (Mandlekar et al., 2020; Chisari et al., 2021) assign positive labels on expert samples and compute policy gradient loss based on the pseudo reward. Some other research works focus on provable safety guarantee with shared control (Peng et al., 2021; Wagener et al., 2021), while we provide an additional lower-bound guarantee of the accumulated reward for our method.

2.2 NOTATIONS

We consider an infinite-horizon Markov decision process (MDP), defined by the tuple M = S, A, P, R, γ, d0 consisting of a finite state space S, a finite action space A, the state transition probability distribution P : S A S [0, 1], the reward function R : S A [Rmin, Rmax], the discount factor γ (0, 1) and the initial state distribution d0 : S [0, 1]. Unless otherwise stated, π denotes a stochastic policy π : S A [0, 1]. The state-action value and state value functions of π are defined as Qπ(s, a) = Es0=s,a0=a,at π( |st),st+1 p( |st,at) [P t=0 γt R (st, at)] and V π(s) = Ea π( |s)Qπ(s, a). The optimal policy is expected to maximize the accumulated return J(π) = Es d0V π(s).

The Teacher-Student Framework (TSF) models the shared control system as the combination of a teacher policy πt which is pretrained and fixed and a student policy πs to be learned. The actual actions applied to the agent are deduced from a mixed policy of πt and πs, where πt starts generating actions when intervention happens. The details of the intervention mechanism are described in Sec. 3.2. The goal of TSF is to improve the training efficiency and safety of πs with the involvement of πt. The discrepancy between πt and πs on state s, termed as policy discrepancy, is the L1-norm of output difference: πt( |s) πs( |s) 1 = R

A |πt(a|s) πs(a|s)| da. We define the discounted state distribution under policy π as dπ(s) = (1 γ) P t=0 γt Pr (st = s; π, d0), where Prπ (st = s; π, d0) is the state visitation probability. The state distribution discrepancy is defined as the difference in L1-norm of the discounted state distributions deduced from two policies: dπt dπs 1 = R

S |dπt(s) dπs(s)| ds.

3 GUARDED POLICY OPTIMIZATION WITH ONLINE DEMONSTRATIONS

Shared Control

Warmup Sampling Value Estimator

Action Intervention

Function Environment Action

Update Replay Buffer

Figure 1: Overview of the proposed teacherstudent shared control method. Both student and teacher policies are in the training loop and the shared control occurs based on the intervention function.

Fig. 1 shows an overview of our proposed method. In addition to the conventional singleagent RL setting, we include a teacher policy πt in the training loop. The term teacher only indicates that the role of this policy is to help the student training. No assumption on the optimality of the teacher is needed. The teacher policy is first used to do warmup rollouts and train a value estimator. During the training of the student policy, both πs and πt receive current state s from the environment. They propose actions as and at, and then a value-based intervention function T (s) determines which action should be taken and applied to the environment. The student policy is then updated with data collected through such intervention.

We first give a theoretical analysis on the general setting of intervention-based RL in Sec. 3.1. We then discuss the properties of different forms of intervention function T in Sec. 3.2. Based on these analyses, we propose a new algorithm for teacher-student shared control in Sec. 3.3. All the proofs in this section are included in Appendix A.1.

3.1 ANALYSIS ON INTERVENTION-BASED RL

In intervention-based RL, the teacher policy and the student policy act together and become a mixed behavior policy πb. The intervention function T (s) determines which policy is in charge. Let

Published as a conference paper at ICLR 2023

T (s) = 1 denotes the teacher policy πt takes control and T (s) = 0 means otherwise. Then πb can be represented as πb( |s) = T (s)πt( |s) + (1 T (s))πs( |s).

One issue with the joint control is that the student policy πs is trained with samples collected by the behavior policy πb, whose action distribution is not always aligned with πs. A large state distribution discrepancy between two policies dπb dπs 1 can cause distributional shift and ruin the training. A similar problem exists in behavior cloning (BC), though in BC no intervention is involved and πs learns from samples all collected by the teacher policy πt. To analyze the state distribution discrepancy in BC, we first introduce a useful lemma (Achiam et al., 2017).

Lemma 3.1. The state distribution discrepancy between the teacher policy πt and the student policy πs is bounded by their expected policy discrepancy:

dπt dπs 1 γ 1 γ Es dπt πt( | s) πs( | s) 1 . (1)

We apply the lemma to the setting of intervention-based RL and derive a bound for dπb dπs 1.

Theorem 3.2. For any behavior policy πb deduced by a teacher policy πt, a student policy πs and an intervention function T (s), the state distribution discrepancy between πb and πs is bounded by

dπb dπs 1 βγ 1 γ Es dπb πt( | s) πs( | s) 1 , (2)

where β = Es dπb [T (s) πt( |s) πs( |s) 1]

Es dπb πt( |s) πs( |s) 1 [0, 1] is the expected intervention rate weighted by the

policy discrepancy.

Both Eq. 1 and Eq. 2 bound the state distribution discrepancy by the difference in per-state policy distributions, but the upper bound with intervention is squeezed by the intervention rate β. In practical algorithms, β can be minimized to reduce the state distribution discrepancy and thus relieve the performance drop during test time. Based on Thm. 3.2, we further prove in Appendix A.1 that under the setting of intervention-based RL, the accumulated returns of behavior policy J(πb) and student policy J(πs) can be similarly related. The analysis in this section does not assume a certain form of the intervention function T (s). Our analysis provides the insight on the feasibility and efficiency of all previous algorithms in intervention-based RL (Kelly et al., 2019; Peng et al., 2021; Chisari et al., 2021). In the following section, we will examine different forms of intervention functions and investigate their properties and performance bounds, especially with imperfect online demonstrations.

3.2 LEARNING FROM IMPERFECT DEMONSTRATIONS

A straightforward idea to design the intervention function is to intervene when the student acts differently from the teacher. We model such process with the action-based intervention function Taction(s):

Taction(s) =

(1 if Ea πt( |s)[log πs(a | s)] < ε,

0 otherwise, (3)

wherein ε > 0 is a predefined parameter. A similar intervention function is used in EGPO (Peng et al., 2021), where the student s action is replaced by the teacher s if the student s action has low probability under the teacher s policy distribution. To measure the effectiveness of a certain form of intervention function, we examine the return of the behavior policy J(πb). With Taction(s) defined in Eq. 3 we can bound J(πb) with the following theorem.

Theorem 3.3. With the action-based intervention function Taction(s), the return of the behavior policy J(πb) is lower and upper bounded by

H ε J(πb) J(πt)

where H = Es dπb H(πt( |s)) is the average entropy of the teacher policy during shared control and β is the weighted intervention rate in Thm. 3.2.

Published as a conference paper at ICLR 2023

The theorem shows that J(πb) can be lower bounded by the return of the teacher policy πt and an extra term relating to the entropy of the teacher policy. It implies that action-based intervention function Taction is indeed helpful in providing training data with high return. We discuss the tightness of Thm. 3.3 and give an intuitive interpretation of

H ε in Appendix A.2.

Student trajectory

Teacher trajectory

Student action

Teacher action

Figure 2: In an autonomous driving scenario, the ego vehicle is the blue one on the left, following the gray vehicle on the right. The upper trajectory is proposed by the student to overtake and the lower trajectory is proposed by the teacher to keep following.

A drawback of the action-based intervention function is the strong assumption on the optimal teacher, which is not always feasible. If we turn to employ a suboptimal teacher, the behavior policy would be burdened due to the upper bound in Eq. 4. We illustrate this phenomenon with the example in Fig. 2 where a slow vehicle in gray is driving in front of the ego-vehicle in blue. The student policy is aggressive and would like to overtake the gray vehicle to reach the destination faster, while the teacher intends to follow the vehicle conservatively. Therefore, πs and πb will propose different actions in the current state, leading to Taction = 1 according to Eq. 3. The mixed policy with shared control will always choose to follow the front vehicle and the agent can never accomplish a successful overtake.

To empower the student to outperform a suboptimal teacher policy, we investigate a new form of intervention function that encapsulates the long-term value estimation into the decision of intervention, designed as follows:

Tvalue(s) =

(1 if V πt (s) Ea πs( |s)Qπt (s, a) > ε,

0 otherwise, (5)

where ε > 0 is a predefined parameter. By using this intervention function, the teacher tolerates student s action if the teacher can not perform significantly better than the student by ϵ in return. Tvalue no longer expects the student to imitate the teacher policy step-by-step. Instead, it makes decision on the basis of long-term return. Taking trajectories in Fig. 2 again as an example, if the overtake behavior has high return, the student will be preferable to Tvalue. Then the student control will not be intervened by the conservative teacher. So with the value-based intervention function, the agent s exploration ability will not be limited by a suboptimal teacher. Nevertheless, the lowerbound performance guarantee of the behavior policy πb still holds, shown as follows. Theorem 3.4. With the value-based intervention function Tvalue(s) defined in Eq. 5, the return of the behavior policy πb is lower bounded by

J (πb) J (πt) (1 β)ε

In safety-critical scenarios, the step-wise training cost c(s, a), i.e., the penalty on the safety violation during training, can be regarded as a negative reward. We define ˆr(s, a) = r(s, a) ηc(s, a) as the combined reward, where η is the weighting hyperparameter. ˆV , ˆQ and ˆTvalue are similarly defined by substituting r with ˆr in the original definition. Then we have the following corollary related to expected cumulative training cost, defined by C(π) = Es0 d0,at π( |st),st+1 p( |st,at) [P t=0 γtc (st, at)].

Corollary 3.5. With safety-critical value-based intervention function ˆTvalue(s), the expected cumulative training cost of the behavior policy πb is upper bounded by

C(πb) C(πt) + (1 β)ϵ

η [J(πb) J(πt)] . (7)

In Eq. 7 the upper bound of behavior policy s training cost consists of three terms: the cost of teacher policy, the threshold in intervention ϵ multiplied by coefficients and the superiority of πb over πt in cumulative reward. The first two terms are similar to those in Eq. 6 and the third term means a tradeoff between training safety and efficient exploration, which can be adjusted by hyperparameter η.

Published as a conference paper at ICLR 2023

Comparing the lower bound performance guarantee of action-based and value-based intervention function (Eq. 4 and Eq. 6), the performance gap between πb and πt can both be bounded with respect to the threshold for intervention ε and the discount factor γ. The difference is that the performance gap when using Taction is in an order of O( 1 (1 γ)2 ) while the gap with Tvalue is in an order of O( 1 1 γ ). It implies that in theory value-based intervention leads to better lower-bound performance guarantee. In terms of training safety guarantee, value-based intervention function Tvalue has better safety guarantee by providing a tighter safety bound with the order of O( 1 1 γ ), in contrast to O( 1 (1 γ)2 ) of action-based intervention function (see Theorem 1 in (Peng et al., 2021)). We show in the Sec. 4.3 that the theoretical advances of Tvalue in training safety and efficiency can both be verified empirically.

3.3 IMPLEMENTATION

Justified by the aforementioned advantages of the value-based intervention function, we propose a practical algorithm called Teacher-Student Shared Control (TS2C). Its workflow is listed in Appendix B. To obtain the teacher Q-network Qπt in the value-based intervention function in Eq. 5, we rollout the teacher policy πt and collect training samples during the warmup period. Gaussian noise is added to the teacher s policy distribution to increase the state coverage during warmup. With limited training data the Q-network may fail to provide accurate estimation when encountering previously unseen states. We propose to use teacher Q-ensemble based on the idea of ensembling Q-networks (Chen et al., 2021a). A set of ensembled teacher Q-networks Qϕ with the same architecture and different initialization weights are built and trained with the same data. To learn Qϕ we follow the standard procedure in (Chen et al., 2021a) and optimize the following loss:

L (ϕ) = Es,a D y Mean Qϕ (s, a) 2 , (8)

where y = Es D,a πt( |s )+N(0,σ) r + γMean Qϕ (s , a ) is the Bellman target and D is the replay buffer for storing sequences {(s, a, r, s )}. Teacher will intervene when Tvalue returns 1 or the output variance of ensembled Q-networks surpasses the threshold, which means the agent is exploring unknown regions and requires guarding. We also use Qϕ to compute the state-value functions in Eq. 5, leading to the following practical intervention function:

1 if Mean Ea πt( |s)Qϕ (s, a) Ea πs( |s)Qϕ (s, a) > ε1 or Var Ea πs( |s)Qϕ (s, a) > ε2,

0 otherwise.

Eq. 2 shows that the distributional shift and the performance gap to oracle can be reduced with smaller β, i.e., less teacher intervention. Therefore, we minimize the amount of teacher intervention via adding negative reward to the transitions one step before the teacher intervention. Incorporating intervention minimization, we use the following loss function to update the student s Q-network parameterized by ψ: L(ψ) = Es,a D h y Qψ (s, a) 2i , (10)

where y = Es D,a πb( |s ) r λTTS2C(s ) + γQψ (s , a ) α log πb(a |s )] is the soft Bellman target with intervention minimization. λ is the hyperparameter controlling the intervention minimization. α is the coefficient for maximum-entropy learning updated in the same way as Soft Actor Critic (SAC) (Haarnoja et al., 2018). To update the student s policy network parameterized by θ, we apply the objective used in SAC as:

L(θ) = Es D Ea πθ( |s) α log πθ (a | s) Qψ (s, a) . (11)

4 EXPERIMENTS

We conduct experiments to investigate the following questions: (1) Can agents trained with TS2C achieve super-teacher performance with imperfect teacher policies while outperforming other methods in the Teacher-Student Framework (TSF)? (2) Can TS2C provide safety guarantee and improve training efficiency compared to algorithms without teacher intervention? (3) Is TS2C robust in different environments and teacher policies trained with different algorithms? To answer questions

Published as a conference paper at ICLR 2023

0 0.5 1.0 1.5 2.0 0

Test Reward with Teacher-High

TS2C Importance EGPO Teacher

0 0.5 1.0 1.5 2.0 0

Test Reward with Teacher-Medium

0 0.5 1.0 1.5 2.0 0

Test Reward with Teacher-Low

0 0.5 1.0 1.5 2.0 Sampled Steps (1e5)

Training Cost with Teacher-High

TS2C Importance EGPO

0 0.5 1.0 1.5 2.0 Sampled Steps (1e5)

Training Cost with Teacher-Medium

0 0.5 1.0 1.5 2.0 Sampled Steps (1e5)

Training Cost with Teacher-Low

Figure 3: Comparison between our method TS2C and other algorithms with teacher policies providing online demonstrations. Importance refers to the Importance Advising algorithm. For each column, the involved teacher policy has high, medium, and low performance respectively.

(1)(2), we conduct preliminary training with the PPO algorithm (Schulman et al., 2017) and save checkpoints on different timesteps. Policies in different stages of PPO training are used as teacher policies in TS2C and other algorithms in the TSF. With regard to question (3), we use agents trained with PPO (Schulman et al., 2017), SAC (Haarnoja et al., 2018) and Behavior Cloning as the teacher policies from different sources.

4.1 ENVIRONMENT SETUP

The majority of the experiments are conducted on the lightweight driving simulator Meta Drive (Li et al., 2022a). One concern with TSF algorithms is that the student may simply record the teacher s actions and overfit the training environment. Meta Drive can test the generalizability of learned agents on unseen driving environments with its capability to generate an unlimited number of scenes with various road networks and traffic flows. We choose 100 scenes for training and 50 held-out scenes for testing. Examples of the traffic scenes from Meta Drive are shown in Appendix C. In Meta Drive, the objective is to drive the ego vehicle to the destination without dangerous behaviors such as crashing into other vehicles. The reward function consists of the dense reward proportional to the vehicle speed and the driving distance, and the terminal +20 reward when the ego vehicle reaches the destination. Training cost is increased by 1 when the ego vehicle crashes or drives out of the lane. To evaluate TS2C s performance in different environments, we also conduct experiments in several environments of the Mu Jo Co simulator (Todorov et al., 2012).

4.2 BASELINES AND IMPLEMENTATION DETAILS

Two sets of algorithms are selected as baselines to compare with. One includes traditional RL and IL algorithms without the TSF. By comparing with these methods we can demonstrate how TS2C improves the efficiency and safety of training. Another set contains previous algorithms with the TSF, including Importance Advising (Torrey & Taylor, 2013) and EGPO (Peng et al., 2021). The original Importance Advising uses an intervention function based on the range of the Q-function: I(s) = maxa A QD(s,a) mina A QD(s,a), where QD is the Q-table of the teacher policy. Such Q-table is not applicable in the Metadrive simulator with continuous state and action spaces. In practice, we sample N actions from the teacher s policy distribution and compute their Q-values on a certain state. The intervention happens if the range, i.e., the maximum value minus the minimum

Published as a conference paper at ICLR 2023

0 0.5 1.0 1.5 2.0 Sampled Steps (1e5)

(a) Test Reward

TS2C SAC PPO BC

0 0.5 1.0 1.5 2.0 Sampled Steps (1e5)

(b) Training Cost

0 0.5 1.0 1.5 2.0 Sampled Steps (1e5)

(c) Average Intervention Rate

TS2C EGPO Importance

Figure 4: Figures (a) and (b) shows the comparison of efficiency and safety between TS2C and baseline algorithms without teacher policies providing online demonstrations. Figure (c) shows the comparison of the average intervention rate between TS2C and two baseline algorithms in the TSF.

0 0.5 1.0 1.5 2.0 Sampled Steps (1e4)

Test Reward on Pendulum Env

TS2C SAC EGPO Teacher

0 1 2 3 4 5 Sampled Steps (1e5)

Test Reward on Hopper Env

0 0.2 0.4 0.6 0.8 1.0 Sampled Steps (1e6)

Test Reward on Walker2d Env

Figure 5: Performance comparison between our method TS2C and baseline algorithms on three environments from Mu Jo Co.

value, surpass a certain threshold ε. The EGPO algorithm uses an intervention function similar to the action-based intervention function introduced in section 3.2. All algorithms are trained with 4 different random seeds. In all figures the solid line is computed with the average value across different seeds and the shadow implies the standard deviation. We leave detailed information on the experiments and the result of ablation studies on hyperparameters in Appendix C.

4.3 RESULTS

Super-teacher performance and better safety guarantee The training result with three different levels of teacher policy can be seen in Fig. 3. The first row shows that the performance of TS2C is not limited by the imperfect teacher policies. It converges within 200k steps, independent of different performances of the teacher. EGPO and Importance Advicing is clearly bounded by teacher-medium and teacher-low, performing much worse than TS2C with imperfect teachers. The second row of Fig. 3 shows TS2C has lower training cost than both algorithms. Compared to EGPO and Importance Advising, the intervention mechanism in TS2C is better-designed and leads to better behaviors.

Better performance with TSF The result of comparing TS2C with baseline methods without the TSF can be seen in Fig. 4(a)(b). We use the teacher policy with a medium level of performance to train the student in TS2C. It achieves better performance and lower training cost than the baseline algorithms SAC, PPO, and BC. The comparative results show the effectiveness of incorporating teacher policies in online training. The behavior cloning algorithm does not involve online sampling in the training process, so it has zero training cost.

Extension for different environments and teacher policies The performances of TS2C in different Mu Jo Co environments and different sources of teacher policy are presented in Fig. 5 and 6 respectively. The figures show that TS2C is generalizable to different environments. It can also make use of the teacher policies from different sources and achieve super-teacher performance consistently. Our TS2C algorithm can outperform SAC in all three Mu Jo Co environments taken into consideration. On the other hand, though the EGPO algorithm has the best performance in the Pendulum environment, it struggles in the other two environments, namely Hopper and Walker.

Published as a conference paper at ICLR 2023

0 0.5 1.0 1.5 2.0 Sampled Steps (1e5)

Test Reward with PPO Teacher

TS2C Importance EGPO Teacher

0 0.5 1.0 1.5 2.0 Sampled Steps (1e5)

Test Reward with SAC Teacher

0 0.5 1.0 1.5 2.0 Sampled Steps (1e5)

Test Reward with BC Teacher

Figure 6: Performance comparison between our method TS2C and baseline algorithms with teacher policies providing online demonstrations. The teacher policies are trained by PPO, SAC, and behavior cloning respectively.

4.4 EFFECTS OF INTERVENTION FUNCTIONS

Trajectory with value-based intervention

Trajectory with action-based intervention

Figure 7: Visualization of the trajectories resulting from different intervention mechanisms. The trajectories of irrelevant traffic vehicles are marked orange. As in the green trajectory, action-based intervention make the car following the front vehicle. Valuebased intervention instead can learn overtaking behavior as in blue trajectory.

We further investigate the intervention behaviors under different intervention functions. As shown in Fig. 4(c), the average intervention rate Es dπb T (s) of TS2C drops quickly as soon as the student policy takes control. The teacher policy only intervenes during a very few states where it can propose actions with higher value than the students. The intervention rate of EGPO remains high due to the action-based intervention function: the teacher intervenes whenever the student act differently.

We also show different outcomes of action-based and value-based intervention functions with screenshots in the Meta Drive simulator. In Fig. 7 the ego vehicle happens to drive behind a traffic vehicle which is in an orange trajectory. With action-based intervention the teacher takes control and keeps following the front vehicle, as shown in the green trajectory. In contrast, with the value-based intervention the student policy proposes to turn left and overtake the front vehicle as in the blue trajectory. Such action has higher return and therefore is tolerable by TTS2C, leading to a better agent trajectory.

5 CONCLUSION AND DISCUSSION

In this work, we conduct theoretic analysis on intervention-based RL algorithms in the Teacher Student Framework. It is found that while the intervention mechanism has better properties than some imitation learning methods, using an action-based intervention function limits the performance of the student policy. We then propose TS2C, a value-based intervention scheme for online policy optimization with imperfect teachers. We provide the theoretic guarantees on its exploration ability and safety. Experiments show that the proposed TS2C method achieves consistent performance independent to the teacher policy being used. Our work brings progress and potential impact to relevant topics such as active learning, human-in-the-loop methods, and safety-critical applications.

Limitations. The proposed algorithm assumes the agent can access environment rewards, and thus defines the intervention function based on value estimations. It may not work in tasks where reward signals are inaccessible. This limitation could be tackled by considering reward-free settings and employing unsupervised skill discovery (Eysenbach et al., 2019; Aubret et al., 2019). These methods provide proxy reward functions that can be used in teacher intervention.

Published as a conference paper at ICLR 2023

Pieter Abbeel and Andrew Y Ng. Apprenticeship learning via inverse reinforcement learning. In Proceedings of the twenty-first international conference on Machine learning, pp. 1, 2004.

David Abel, John Salvatier, Andreas Stuhlm uller, and Owain Evans. Agent-agnostic human-in-theloop reinforcement learning. ar Xiv preprint ar Xiv:1701.04079, 2017.

Joshua Achiam, David Held, Aviv Tamar, and Pieter Abbeel. Constrained policy optimization. In ICML, volume 70 of Proceedings of Machine Learning Research, pp. 22 31. PMLR, 2017.

Arthur Aubret, La etitia Matignon, and Salima Hassas. A survey on intrinsic motivation in reinforcement learning. Co RR, abs/1908.06976, 2019.

Xinyue Chen, Che Wang, Zijian Zhou, and Keith W. Ross. Randomized ensembled double qlearning: Learning fast without a model. In International Conference on Learning Representations, 2021a.

Xiong-Hui Chen, Yang Yu, Qingyang Li, Fan-Ming Luo, Zhiwei Qin, Wenjie Shang, and Jieping Ye. Offline model-based adaptable policy learning. Advances in Neural Information Processing Systems, 34, 2021b.

Eugenio Chisari, Tim Welschehold, Joschka Boedecker, Wolfram Burgard, and Abhinav Valada. Correct me if i am wrong: Interactive learning for robotic manipulation. ar Xiv preprint ar Xiv:2110.03316, 2021.

Felipe Leno da Silva, Pablo Hernandez-Leal, Bilal Kartal, and Matthew Taylor. Uncertainty-aware action advising for deep reinforcement learning agents. In AAAI, 2020.

Benjamin Eysenbach, Abhishek Gupta, Julian Ibarz, and Sergey Levine. Diversity is all you need: Learning skills without a reward function. In ICLR (Poster). Open Review.net, 2019.

Justin Fu, Katie Luo, and Sergey Levine. Learning robust rewards with adversarial inverse reinforcement learning. ar Xiv preprint ar Xiv:1710.11248, 2017.

Scott Fujimoto, David Meger, and Doina Precup. Off-policy deep reinforcement learning without exploration. In International Conference on Machine Learning, pp. 2052 2062. PMLR, 2019.

Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In International conference on machine learning, pp. 1861 1870. PMLR, 2018.

Jonathan Ho and Stefano Ermon. Generative adversarial imitation learning. In NIPS, pp. 4565 4573, 2016.

Michael Janner, Justin Fu, Marvin Zhang, and Sergey Levine. When to trust your model: Modelbased policy optimization. In Neur IPS, pp. 12498 12509, 2019.

Michael Kelly, Chelsea Sidrane, Katherine Driggs-Campbell, and Mykel J Kochenderfer. Hgdagger: Interactive imitation learning with human experts. In 2019 International Conference on Robotics and Automation (ICRA), pp. 8077 8083. IEEE, 2019.

Aviral Kumar, Aurick Zhou, George Tucker, and Sergey Levine. Conservative q-learning for offline reinforcement learning. ar Xiv preprint ar Xiv:2006.04779, 2020.

Kwei-Herng Lai, Daochen Zha, Yuening Li, and Xia Hu. Dual policy distillation. In IJCAI, 2020.

Sergey Levine, Aviral Kumar, George Tucker, and Justin Fu. Offline reinforcement learning: Tutorial, review, and perspectives on open problems. ar Xiv preprint ar Xiv:2005.01643, 2020.

Quanyi Li, Zhenghao Peng, Lan Feng, Qihang Zhang, Zhenghai Xue, and Bolei Zhou. Metadrive: Composing diverse driving scenarios for generalizable reinforcement learning. IEEE transactions on pattern analysis and machine intelligence, 2022a.

Published as a conference paper at ICLR 2023

Quanyi Li, Zhenghao Peng, and Bolei Zhou. Efficient learning of safe driving policy via human-ai copilot optimization. In ICLR. Open Review.net, 2022b.

Kaixiang Lin, Shu Wang, and Jiayu Zhou. Collaborative deep reinforcement learning. ar Xiv preprint ar Xiv:1702.05796, 2017.

Abdoulaye O Ly and Moulay Akhloufi. Learning to drive by imitation: An overview of deep behavior cloning methods. IEEE Transactions on Intelligent Vehicles, 6(2):195 209, 2020.

Ajay Mandlekar, Danfei Xu, Roberto Mart ın-Mart ın, Yuke Zhu, Li Fei-Fei, and Silvio Savarese. Human-in-the-loop imitation learning using remote teleoperation. ar Xiv preprint ar Xiv:2012.06733, 2020.

Rom Parnichkun, Matthew N Dailey, and Atsushi Yamashita. Reil: A framework for reinforced intervention-based imitation learning. ar Xiv preprint ar Xiv:2203.15390, 2022.

Zhenghao Peng, Hao Sun, and Bolei Zhou. Non-local policy optimization via diversity-regularized collaborative exploration. ar Xiv preprint ar Xiv:2006.07781, 2020.

Zhenghao Peng, Quanyi Li, Chunxiao Liu, and Bolei Zhou. Safe driving via expert guided policy optimization. In 5th Annual Conference on Robot Learning, 2021.

St ephane Ross and J. Andrew Bagnell. Reinforcement and imitation learning via interactive noregret learning. Co RR, abs/1406.5979, 2014.

St ephane Ross, Geoffrey Gordon, and Drew Bagnell. A reduction of imitation learning and structured prediction to no-regret online learning. In Proceedings of the fourteenth international conference on artificial intelligence and statistics, pp. 627 635. JMLR Workshop and Conference Proceedings, 2011.

Andrei A Rusu, Sergio Gomez Colmenarejo, C aglar G ulc ehre, Guillaume Desjardins, James Kirkpatrick, Razvan Pascanu, Volodymyr Mnih, Koray Kavukcuoglu, and Raia Hadsell. Policy distillation. In ICLR. Open Review.net, 2016.

Simon Schmitt, Jonathan J Hudson, Augustin Zidek, Simon Osindero, Carl Doersch, Wojciech M Czarnecki, Joel Z Leibo, Heinrich Kuttler, Andrew Zisserman, Karen Simonyan, et al. Kickstarting deep reinforcement learning. ar Xiv preprint ar Xiv:1803.03835, 2018.

John Schulman, Sergey Levine, Pieter Abbeel, Michael I. Jordan, and Philipp Moritz. Trust region policy optimization. In ICML, volume 37 of JMLR Workshop and Conference Proceedings, pp. 1889 1897. JMLR.org, 2015.

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. ar Xiv preprint ar Xiv:1707.06347, 2017.

Jonathan Spencer, Sanjiban Choudhury, Matthew Barnes, Matthew Schmittle, Mung Chiang, Peter Ramadge, and Siddhartha Srinivasa. Learning from interventions: Human-robot interaction as both explicit and implicit feedback. In Robotics: Science and Systems (RSS), 2020.

Wen Sun, Arun Venkatraman, Geoffrey J. Gordon, Byron Boots, and J. Andrew Bagnell. Deeply aggrevated: Differentiable imitation learning for sequential prediction. In ICML, volume 70 of Proceedings of Machine Learning Research, pp. 3309 3318. PMLR, 2017.

Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for model-based control. In IROS, pp. 5026 5033. IEEE, 2012.

Lisa Torrey and Matthew Taylor. Teaching on a budget: Agents advising agents in reinforcement learning. In Proceedings of the 2013 International Conference on Autonomous Agents and Multi Agent Systems, AAMAS 13, pp. 1053 1060. International Foundation for Autonomous Agents and Multiagent Systems, 2013.

Ren e Traor e, Hugo Caselles-Dupr e, Timoth ee Lesort, Te Sun, Guanghang Cai, David Filliat, and Natalia D ıaz-Rodr ıguez. Discorl: Continual reinforcement learning via policy distillation. In Neur IPS workshop on Deep Reinforcement Learning, 2019.

Published as a conference paper at ICLR 2023

Nolan Wagener, Byron Boots, and Ching-An Cheng. Safe reinforcement learning using advantagebased intervention. In ICML, volume 139 of Proceedings of Machine Learning Research, pp. 10630 10640. PMLR, 2021.

Tian Xu, Ziniu Li, and Yang Yu. On value discrepancy of imitation learning. ar Xiv preprint ar Xiv:1911.07027, 2019.

Tianhe Yu, Deirdre Quillen, Zhanpeng He, Ryan Julian, Karol Hausman, Chelsea Finn, and Sergey Levine. Meta-world: A benchmark and evaluation for multi-task and meta reinforcement learning. In Conference on Robot Learning, pp. 1094 1100. PMLR, 2020a.

Tianhe Yu, Garrett Thomas, Lantao Yu, Stefano Ermon, James Y Zou, Sergey Levine, Chelsea Finn, and Tengyu Ma. Mopo: Model-based offline policy optimization. Advances in Neural Information Processing Systems, 33:14129 14142, 2020b.

Chenyang Zhao and Timothy Hospedales. Robust domain randomised reinforcement learning through peer-to-peer distillation. In Proceedings of The 13th Asian Conference on Machine Learning, volume 157 of Proceedings of Machine Learning Research, pp. 1237 1252. PMLR, 2021.

Matthieu Zimmer, Paolo Viappiani, and Paul Weng. Teacher-student framework: a reinforcement learning approach. In AAMAS Workshop Autonomous Robots and Multirobot Systems, 2014.

Published as a conference paper at ICLR 2023

A THEOREMS IN TS2C

A.1 DETAILED PROOF

We start the proof with the restatement of Lem. 3.1 in Sec. 3.1.

Lemma A.1 (Lemma 4.1 in (Xu et al., 2019)).

dπ dπ 1 γ 1 γ Es dπ π( | s) π ( | s) 1 . (12)

Thm 3.2 can be derived by substituting π and π in Lem A.1 with πb and πs.

Theorem A.2 (Restatement of Thm. 3.2). For any behavior policy πb deduced by a teacher policy πt, a student policy πs and a intervention function T (s), the state distribution discrepancy between πb and πs is bounded by policy discrepancy and intervention rate:

dπb dπs 1 βγ 1 γ Es dπb πt( | s) πs( | s) 1 , (13)

where β = Es dπb T (s)[πt( |s) πs( |s)] 1

Es dπb πt( |s) πs( |s) 1 is the weighted expected intervention rate.

dπb dπs 1 γ 1 γ Es dπb πb( | s) πs( | s) 1

= γ 1 γ Es dπb T (s)πt( | s) + (1 T (s))πs( | s) πs( | s) 1

= γ 1 γ Es dπb T (s) [πt( | s) πs( | s)] 1

= βγ 1 γ Es dπb πt( | s) πs( | s) 1 .

Based on Thm. 3.2, we further prove that under the setting of shared control, the performance gap of πs to the optimal policy π can be bounded by the gap between the teacher policy πt and π , together with the teacher-student policy difference. Therefore, training with the trajectory collected with mixed policy πb is to optimize an upper bound of the student s suboptimality. The following lemma is helpful in doing this.

|J (π) J (π )| Rmax (1 γ)2 Es dπ π( | s) π ( | s) 1 (15)

Proof. It is a direct combination of Lemma 4.2 and Lemma 4.3 in (Xu et al., 2019).

Theorem A.4. For any behavior policy πb consisting of a teacher policy πt, a student policy πs and a intervention function T (s), the suboptimality of the student policy is bounded by

|J (π ) J (πs)| βRmax

(1 γ)2 Es πb πt( | s) πs( | s) 1 + |J (π ) J (πb)| , (16)

Published as a conference paper at ICLR 2023

|J (πb) J (πs)| Rmax (1 γ)2 Es dπb πb( | s) πs( | s) 1

= Rmax (1 γ)2 Es dπb T (s)πt( | s) + (1 T (s))πs( | s) πs( | s) 1

= Rmax (1 γ)2 Es dπb T (s) [πt( | s) πs( | s)] 1

(1 γ)2 Es πb πt( | s) πs( | s) 1 .

|J (π ) J (πs)| |J (πb) J (πs)| + |J (π ) J (πb)|

(1 γ)2 Es πb πt( | s) πs( | s) 1 + |J (π ) J (πb)| .

Theorem A.5 (Restatement of Thm. 3.3). With the action distributional intervention function Taction(s), the return of the behavior policy J(πb) is lower and upper bounded by

H ε J(πb) J(πt)

where Rmax = max s,a r(s, a) is the maximal possible reward, H = Es dπb H(πt( |s)) is the average

entropy of the teacher policy during shared control.

|J (πb) J (πt)| Rmax (1 γ)2 Es dπb πb( | s) πt( | s) 1

= Rmax (1 γ)2 Es dπb T (s)πt( | s) + (1 T (s))πs( | s) πt( | s) 1

= (1 β)Rmax

(1 γ)2 Es dπb πs( | s) πt( | s) 1

(1 γ)2 Es dπb p

DKL(πt( |s) πs( |s))

(1 γ)2 Es dπb

Ea πt( |s) [log πt(a|s) log πs(a|s)]

(1 γ)2 Es dπb p

H(πt( |s) ε

Therefore, we obtain

H ε J (πb) J (πt)

H ε J(πb) J(πt)

which concludes the proof.

To prove Thm. 3.4, we introduce a useful lemma from (Schulman et al., 2015). Lemma A.6.

J(π) = J(π ) + Est,at τπ

t=0 γt Aπ (st, at)

Published as a conference paper at ICLR 2023

Theorem A.7 (Restatement of Thm. 3.4). With the value-based intervention function Tvalue(s) defined in Eq. 5, the return of the behavior policy πb is lower bounded by

J (πb) J (πt) ε 1 γ . (22)

J(πb) J(πt) = Esn,an τπb

n=0 γn At (sn, an)

= Esn,an τπb

n=0 γn [Qt (sn, an) Vt(sn)]

n=0 γn Ea πb( |sn)Qt (sn, a) Vt(sn) #

n=0 γn T (sn)Ea πt( |sn)Qt (sn, a) + (1 T (sn))Ea πs( |sn)Qt (sn, a) Vt(sn) #

n=0 γn (1 T (sn)) Ea πs( |sn)Qt (sn, a) Vt(sn) #

= (1 β)Esn τπb

n=0 γn Ea πs( |sn)Qt (sn, a) Vt(sn) #

(1 β)Esn τπb

(23) which concludes the proof.

Then we prove the corollary related to safety-critical scenarios.

Corollary A.8 (Restatement of Cor. 3.5). With safety-critical value-based intervention function ˆTvalue(s), the expected cumulative training cost of the behavior policy πb is upper bounded by

C(πb) C(πt) + (1 β)ϵ

η [J(πb) J(πt)] . (24)

Proof. We define expected return under policy π with combined reward ˆr as ˆJ(π), therefore

ˆJ(π) = Es0 d0,at π( |st),st+1 p( |st,at)

t=0 γtˆr (st, at)

= Es0 d0,at π( |st),st+1 p( |st,at)

t=0 γt [r (st, at) + ηc (st, at)]

= J(π) + ηC(π)

According to Thm. A.7, under ˆTvalue(s) we have

ˆJ (πb) ˆJ (πt) ε 1 γ . (26)

Eq. 24 can be immediately proved by combining Eq. 25 and Eq. 26.

Published as a conference paper at ICLR 2023

A.2 DISCUSSIONS ON THE RESULTS

In Thm. 3.3, the average entropy of the teacher policy H and the threshold for action-based intervention ε is included in the bound. We provide intuitive interpretations on the influence of H and ε here. For reference, the action-based intervention function Taction = 1 when Ea πt( |s) [log πs(a | s)] < ε. According to Thm 3.3 of our paper, a larger ε leads to smaller discrepancy between the returns of the behavior and teacher policies. This is because ε is the threshold for the action-based intervention function. If the action likelihood is less than ε, the teacher policy will take over the control. A larger ε means more teacher intervention, constraining the behavior policy to be closer to the teacher policy, which leads to a smaller discrepancy in their returns. The influence of H can be similarly analyzed. A larger H leads to larger return discrepancy. Intuitively, this is because with higher entropy, the teacher policy tends to have a more averaged or multi-modal distribution over the action space. So the policy distributions of the student and teacher are more likely to have overlaps, leading to a higher action likelihood. In turn, the intervention criterion is less likely to be satisfied, leading to fewer teacher interventions. In general, the intuitive interpretation of Thm. 3.3 indicates that if we would like larger return discrepancy, i.e. larger performance upper bound as well as smaller lower bound, we should use smaller intervention threshold and teacher policy with higher entropy, and vice versa. Thm. 3.3 has a gap with the actual algorithm in that the algorithm uses a value-based intervention function which is based on Thm. 3.4. Nevertheless, the intuitive interpretation may enlighten future work on how to choose a proper teacher policy in teacher-student shared control.

With respect to the tightness, Thm. 3.3 has a squared planning horizon 1 (1 γ)2 in the discrepancy term. This is in accordance with many previous works (Thm. 1 in (Xu et al., 2019), Thm. 4.1 in (Janner et al., 2019) and Thm. 1 in (Schulman et al., 2015)), which include (1 γ)2 in the denominator when it comes to differences of the cumulative return, given the difference in the action distribution. The order of 1 1 γ in Thm. 3.3 is tight, which dominates the gap in accumulated return. Nevertheless, the other constant terms, e.g. Rmax and the average entropy, can be tighter given some additional assumptions. We did not derive a tighter bound since the derivation will not be related to the main contribution of this paper, which is the new type of intervention function. Thm. 3.3 and Thm. 3.4 in their current forms are enough to demonstrate that the value-based intervention function has the advantage of providing more efficient exploration and better safety guarantee compared with action-based intervention function.

B THE ALGORITHM

The workflow of TS2C during training is show in Alg. 1.

Algorithm 1 The workflow of TS2C during training

1: Input: Warmup steps W; Scale of warmup noise σ; Training steps N; Teacher policy πt. 2: Initialize student policy πθ s, a set of parameterized Q-function for teacher policy Qϕ, parameterized Q-function for student policy Qψ and the replay buffer D. 3: for i = 1 to W do 4: Observe state si and sample ai πt( |s) + N(0, σ). 5: Step the environment with ai and store the tuple (si, ai, ri, si+1) to D. 6: Update ϕ with Temporal-Difference loss in Eq. 8. 7: end for 8: for i = 1 to N do 9: Observe state si and sample at πt( |si), as πθ s( |si). 10: Compute Tts2c(si) with Eq. 9, behavior policy πb( |si) and ab. 11: Step the environment with ab and store the tuple (si, ab, ri, si+1, Tvalue(si+1)) to D. 12: Update ψ in the student Q-function with the loss in Eq. 10. 13: Update θ in the student policy with the loss in Eq. 11. 14: end for

Published as a conference paper at ICLR 2023

C ADDITIONAL EXPERIMENT DEMONSTRATIONS

C.1 DEMONSTRATIONS OF DRIVING SCENARIOS

The demonstrations of several driving scenarios are shown in Fig. 8. We provide a demonstrative video showing the agent behavior trained with PPO and our TS2C algorithm in the supplementary materials.

Figure 8: Four examples of the traffic scenes in Meta Drive.

C.2 HYPER-PARAMETERS

The hyper-parameters used in the experiments are shown in the following tables. In the TS2C algorithm, larger values of the intervention threshold ε1 and ε2 will lead to a more strict intervention criterion and the steps with teacher control will be fewer. In order to control the policy distribution discrepancy, we choose ε1 and ε2 to ensure the average intervention rate to be less than 5%. Nevertheless, different ε1 in the intervention function has little influence on the algorithm performance, as shown in Fig. 11 of our paper. The coefficient for intervention minimization λ is simply set to 1. If used in other environments, it may need some adjustments to fit the reward scale. The coefficient for maximum entropy learning α is updated during training as in the SAC algorithm. The number of warmup timesteps is empirically chosen so that the expert value function can be properly trained. Other parameters follow the setting in EGPO (Peng et al., 2021). The hyper-parameters of other algorithms follow their original setting.

Table 1: TS2C (Ours) Hyper-parameter Value

Discount Factor γ 0.99 τ for target network update 0.005 Learning Rate 0.0001 Environmental horizon T 2000 Warmup Timesteps W 50000 # of Ensembled Value-Functions N 10 Variance of Gaussian Noise C 0.5 Intervention Minimization Ratio λ 1 Value-based Intervention Threshold ε1 1.2 Value-based Intervention Threshold ε2 2.5 Activation Function Relu Hidden Layer Sizes [256, 256]

Table 2: EGPO (Peng et al., 2021) Hyper-parameter Value

Discount Factor γ 0.99 τ for target network update 0.005 Learning Rate 0.0001 Environmental horizon T 2000 Steps before Learning start 10000 Intervention Occurrence Limit C 20 Number of Online Evaluation Episode 5 Kp 5 Ki 0.01 Kd 0.1 CQL Loss Temperature β 3.0 Activation Function Relu Hidden Layer Sizes [256, 256]

Published as a conference paper at ICLR 2023

Table 3: Importance Advising (Torrey & Taylor, 2013) Hyper-parameter Value

Discount Factor γ 0.99 τ for target network update 0.005 Learning Rate 0.0001 Environmental horizon T 2000 Warmup Timesteps W 50000 # of Actions Sampled N 10 Variance of Gaussian Noise C 0.5 Range-based Intervention Threshold ε 2.8 Activation Function Relu Hidden Layer Sizes [256, 256]

Table 4: SAC (Haarnoja et al., 2018) Hyper-parameter Value

Discount Factor γ 0.99 τ for Target Network Update 0.005 Learning Rate 0.0001 Environmental Horizon T 2000 Steps before Learning starts 10000 Activation Function Relu Hidden Layer Sizes [256, 256]

D ADDITIONAL EXPERIMENT RESULTS

D.1 ADDITIONAL PERFORMANCE COMPARISONS ON METADRIVE

In Fig. 9, we show the results of TS2C trained with various levels of teachers compared with baseline algorithms without shared control. Apart from the Fig. 4 in the main paper presenting the training results of TS2C with the medium level of teacher policy, here we present the performance of TS2C trained with the high, medium and low levels of teacher policy. The value-based intervention proposed by TS2C can utilize all these teacher policies, leading to safer and more efficient training compared to traditional RL algorithms.

Fig. 10 shows the results with different levels of teacher policy. Besides the testing reward and the training cost shown in Fig. 3 of the main paper, we show the training reward and test success rate of TS2C compared with baseline methods with the Teacher-Student Framework (TSF) respectively. Our TS2C algorithm still achieves the best performance among baseline algorithms when evaluated with these two metrics.

0.0 0.5 1.0 1.5 2.0 Sampled Steps 1e5

Training Cost with Teacher-High

TS2C SAC PPO

0.0 0.5 1.0 1.5 2.0 Sampled Steps 1e5

Training Cost with Teacher-Medium

0.0 0.5 1.0 1.5 2.0 Sampled Steps 1e5

Training Cost with Teacher-Low

0.0 0.5 1.0 1.5 2.0 Sampled Steps 1e5

Test Reward with Teacher-High

TS2C SAC PPO BC

0.0 0.5 1.0 1.5 2.0 Sampled Steps 1e5

Test Reward with Teacher-Medium

0.0 0.5 1.0 1.5 2.0 Sampled Steps 1e5

Test Reward with Teacher-Low

Figure 9: Comparison of training cost and test reward between our method TS2C and other algorithms without shared control.

Published as a conference paper at ICLR 2023

=2.0 =2.5 =3.0 Variance threshold

Test Reward at 100k Steps

w/IC w/o IC With or without intervention cost (IC)

Test Reward at 100k Steps

w/EVN w/o EVN With or without ensembled value networks (EVN)

Test Reward at 100k Steps

Value Action Intervention Function

Test Reward at 100k Steps

Figure 11: Ablation Studies for different variance thresholds, the intervention cost, ensembled value networks and different intervention functions.

0.0 0.5 1.0 1.5 2.0 Sampled Steps 1e5

Training Reward with Teacher-High

TS2C EGPO Importance

0.0 0.5 1.0 1.5 2.0 Sampled Steps 1e5

Training Reward with Teacher-Medium

0.0 0.5 1.0 1.5 2.0 Sampled Steps 1e5

Training Reward with Teacher-Low

0.0 0.5 1.0 1.5 2.0 Sampled Steps 1e5

1.0Test Success Rate with Teacher-High

TS2C EGPO Importance

0.0 0.5 1.0 1.5 2.0 Sampled Steps 1e5

1.0 Test Success Rate with Teacher-Medium

0.0 0.5 1.0 1.5 2.0 Sampled Steps 1e5

1.0Test Success Rate with Teacher-Low

Figure 10: Comparison of training reward and test success rate between our method TS2C and other algorithms with shared control.

D.2 ABLATION STUDIES

We conduct ablation studies and present the results in Fig. 11. We find the intervention cost and ensembled value networks are important to the algorithm s performance, while different variance thresholds in the intervention function has little influence. Also, TS2C with action-based intervention function behaves poorly in accordance with the theoretical analysis in Section 3.2.

D.3 DISCUSSIONS ON EXPERIMENT RESULTS

In Fig. 5, our TS2C algorithm can outperform SAC in all three Mu Jo Co environments taken into consideration. On the other hand, though the EGPO algorithm has the best performance in the Pendulum environment, it struggles in the other two environments, namely Hopper and Walker. This is because the action space of the pendulum environment is only one-dimensional. In this simple environment, the action-based intervention of the EGPO algorithm is effective. The policy only needs slight adjustments based on the imperfect teacher to work properly. In other words, the distance between the optimal action and the teacher action is small. However, in more complex environments like Hopper and Walker, the distance between the two is large. As the action-based

Published as a conference paper at ICLR 2023

intervention is too restrictive, the EGPO algorithm based on such intervention fails to achieve good performance.

In Fig. 6, the performance of EGPO with a SAC policy as the teacher policy is very poor. This is because the employed SAC teacher is less stochastic than the PPO policy. Student s actions have less likelihood in teacher s action distribution and are less tolerated by the action-based intervention function in EGPO, leading to large intervention rate and consequently large distributional shift. Our proposed TS2C algorithm does not access teacher internal action distribution and instead intervenes based on the state-action values of teacher policy, so it is robust to the stochasticity of teacher policy.