# selfimitation_learning_via_generalized_lower_bound_qlearning__e7fffa78.pdf Self-Imitation Learning via Generalized Lower Bound Q-learning Yunhao Tang Columbia University yt2541@columbia.edu Self-imitation learning motivated by lower-bound Q-learning is a novel and effective approach for off-policy learning. In this work, we propose a n-step lower bound which generalizes the original return-based lower-bound Q-learning, and introduce a new family of self-imitation learning algorithms. To provide a formal motivation for the potential performance gains provided by self-imitation learning, we show that n-step lower bound Q-learning achieves a trade-off between fixed point bias and contraction rate, drawing close connections to the popular uncorrected n-step Q-learning. We finally show that n-step lower bound Q-learning is a more robust alternative to return-based self-imitation learning and uncorrected n-step, over a wide range of continuous control benchmark tasks. The implementation is available at https://github.com/robintyh1/nstep-sil. 1 Introduction Learning with off-policy data is of central importance to scalable reinforcement learning (RL). The traditional framework of off-policy learning is based on importance sampling (IS): for example, in policy evaluation, given trajectories (xt, at, rt) t=0 generated under behavior policy µ, the objective is to evaluate Q-function Qπ(x0, a0) of a target policy π. Naive IS estimator involves products of the form π(at | xt)/µ(at | xt) and is infeasible in practice due to high variance. To control the variance, a line of prior work has focused on operator-based estimation to avoid full IS products, which reduces the estimation procedure into repeated iterations of off-policy evaluation operators [1 3]. Each iteration of the operator requires only local IS ratios, which greatly stabilizes the update. More formally, such operators T are designed such that their fixed points are the target Q-function T Qπ = Qπ. As such, these operators are unbiased and conducive to theoretical analysis. However, a large number of prior work has observed that certain biased operators tend to have significant empirical advantages [4 6]. One notable example is the uncorrected n-step operator, which directly bootstraps from n-step target trajectories without IS corrections [4]. The removal of all IS ratios biases the estimate, but allows the learning signal to be propagated over a longer horizon (in Section 2, we will characterize such effects as contraction rates). Indeed, when behavior trajectories are unlikely under the current policy, small IS ratios π(at | xt)/µ(at | xt) quickly cut off the learning signal. In general, there is a trade-off between the fixed point bias and contraction rates. Empirical findings suggest that it might be desirable to introduce bias in exchange for faster contractions in practice [7]. Recently, self-imitation learning (SIL) has been developed as a family of novel off-policy algorithms which facilitate efficient learning from highly off-policy data [8 10]. In its original form, SIL is motivated as lower bound Q-learning [11]. In particular, let QL(x, a) Qπ (x, a) denote a lower bound of the optimal Q-function Qπ . Optimizing auxiliary losses which encourage Qθ(x, a) QL(x, a) could significantly speed up learning with the trained Q-function Qθ(x, a). Such auxiliary losses could be extended to actor-critic algorithms with stochastic policies [8]: SIL suggests optimizing a policy πθ(a | x) by maximizing an objective similar to [Qµ(a | x) V πθ(x)]+ log πθ(a | x), 34th Conference on Neural Information Processing Systems (Neur IPS 2020), Vancouver, Canada. where V πθ(x) is the value-function for policy πθ, with [x]+ := max(0, x). The update is intuitively reasonable: if a certain actions a is high-performing under behavior policy µ, such that Qµ(x, a) > V πθ(x), the policy πθ(a | x) should imitate such actions. On a high-level, SIL is similar to the uncorrected n-step update in several aspects. With no explicit IS ratios, both methods entail that off-policy learning signals propagate over long horizons without being cut-off. As a result, both methods are biased due to the absence of proper corrections, and could be seen as trading-off fixed point bias for fast contractions. Main idea. In this paper, we make several theoretical and empirical contributions. Generalized SIL. In Section 3, we propose generalized SIL which strictly extends the original SIL formulation [8]. Generalized SIL provides additional flexibility and advantages over the original SIL: it learns from partial trajectories and bootstraps with learned Q-function; it applies to both stochastic and deterministic actor-critic algorithms. Trade-off. In Section 4, we formalize the trade-offs of SIL. We show that generalized SIL tradesoff contraction rates with fixed point bias in a similar way to uncorrected n-step [7]. Unlike uncorrected n-step, for which fixed point bias could be either positive or negative, the operator for SIL induces positive bias, which fits the motivation of SIL to move towards optimal Q-functions. Empirical. In Section 5, we show generalized SIL outperforms alternative baseline algorithms. 2 Background Consider the standard formulation of markov decision process (MDP). At a discrete time t 0, an agent is in state xt X, takes action at A, receives a reward rt = r(xt, at) R and transitions to a next state xt+1 p( | xt, at) X. A policy π(a | x) : X 7 P(A) defines a map from state to distributions over actions. The standard objective of RL is to maximize the expected cumulative discounted returns J(π) := Eπ[P t 0 γtrt] with a discount factor γ (0, 1). Let Qπ(x, a) denote the Q-function under policy π and Qπ R|X| |A| its vector form. Denote the Bellman operator as T π and optimality operator as T [12]. Let π be the optimal policy, i.e. π = arg maxπ J(π). It follows that Qπ, Qπ are the unique fixed points of T π, T respectively [13]. Popular RL algorithms are primarily motivated by the fixed point properties of the Q-functions (or value functions): in general, given a parameterized Q-function Qθ(x, a), the algorithms proceed by minimizing an empirical Bellman error loss minθ E(x,a)[(Qθ(x, a) T Qθ(x, a))2] with operator T . Algorithms differ in the distribution over sampled (x, a) and the operator T . For example, Q-learning sets the operator T = T for value iteration and the samples (x, a) come from an experience replay buffer [14]; Actor-critic algorithms set the operator T = T πθ for policy iteration and iteratively update the policy πθ for improvement, the data (x, a) could be either on-policy or off-policy [15 18]. 2.1 Elements of trade-offs in Off-policy Reinforcement Learning Here we introduce elements essential to characterizing the trade-offs of generic operators T in off-policy RL. For a complete review, please see [7]. Take off-policy evaluation as an example: the data are generated under a behavior policy µ while the target is to evaluate Qπ. Consider a generic operator T and assume that it has fixed point Q. Define the contraction rate of the operator as Γ(T ) := sup Q1 =Q2 T (Q1 Q2) / Q1 Q2 . Intuitively, operators with small contraction rate should have fast contractions to the fixed point. In practical algorithms, the quantity T Q(x, a) is approximated via stochastic estimations, denoted as T Q(x, a). All the above allows us to define the bias and variance of an operator B(T ) := Q Qπ 2 2, V(T ) := Eµ[ T Q T Q 2 2] evaluated at a Q-function Q. Note that all these quantities depend on the underlying MDP M, though when the context is clear we omit the notation dependency. Ideally, we seek an operator T with small bias, small variance and small contraction rate. However, it follows that these three aspects could not be optimized simultaneously for a general class of MDPs M M sup M M {B(T ) + p V(T ) + 2rmax 1 γ Γ(T )} I(M), (1) where rmax := maxx,a r(x, a) and I(M) is a information-theoretic lower bound [7]. This inequality characterizes the fundamental trade-offs of these three quantities in off-policy learning. Importantly, we note that though the variance V(T ) is part of the trade-off, it is often not a major focus of algorithmic designs [4, 7]. We speculate it is partly because in practice the variance could be reduced via e.g. large training batch sizes, while the bias and contraction rates do not improve with similar techniques. As a result, henceforth we focus on the trade-off between the bias and contraction rate. 2.2 Trading off bias and contraction rate Off-policy operators with unbiased fixed point B(T ) = 0 are usually more conducive to theoretical analysis [3, 7]. For example, Retrace operators Rπ,µ c are a family of off-policy evaluation operators indexed by trace coefficients c(x, a). When c(x, a) π(a | x)/µ(a | x), these operators are unbiased in that Rπ,µ c Qπ = Qπ, resulting in B(Rπ,µ c ) = 0. One popular choice is c(x, a) = min{ c, π(a | x)/µ(a | x)} such that the operator also controls variance V(Rπ,µ c ) [3] with c. However, many prior empirical results suggest that bias is not a major bottleneck in practice. For example, uncorrected n-step update is a popular technique which greatly improves DQN [14] where the RL agent applies the operator T π,µ nstep := (T µ)n 1T π where π, µ are target and behavior policies respectively [5, 6]. Note that since T π,µ nstep Qπ = Qπ, the n-step operator is biased B(T π,µ nstep) > 0 [7]. However, its contraction rate is small due to uncorrected updates Γ(T π,µ nstep) γn. On the other hand, though Retrace operators have unbiased fixed point, its contraction rates are typically high due to small IS, which cut off the signals early and fail to bootstrap with long horizons. The relative importance of contraction rate over bias is confirmed through the empirical observations that n-step often performs significantly better than Retrace in challenging domains [6, 7]. Such observations also motivate trading off bias and contraction rates in an adaptive way [7]. 2.3 Self-imitation Learning Maximum entropy RL. SIL is established under the framework of maximum-entropy RL [19 23], where the reward is augmented by an entropy term rent(x, a) := r(x, a) + c Hπ(x) and Hπ(x) is the entropy of policy π at state x, weighted by a constant c > 0. Accordingly, the Q-function is Qπ ent(x0, a0) := Eπ[r0 + P t 1 γt(rt + c Hπ(xt))]. The maximum-entropy RL objective is Jent(π) := Eπ[P t 0 γt(rt + c Hπ(xt)]. Similar to standard RL, we denote the optimal policy π ent = arg maxπ Jent(π) and its Q-function Qπ ent ent (x, a). Lower bound Q-learning. Lower bound Q-learning is motivated by the following inequality [8], Qπ ent ent (x, a) Qµ ent(x, a) = Eµ[r0 + t 1 γt(rt + c Hµ(xt))], (2) where µ is an arbitrary behavior policy. Lower bound Q-learning optimizes the following objective with the parameterized Q-function Qθ(x, a), min θ ED[([Qµ(x, a) Qθ(x, a)]+)2], (3) where [x]+ := max(x, 0). The intuition of Eqn.(3) is that the Q-function Qθ(x, a) obtains learning signals from all trajectories such that Qµ ent(x, a) > Qθ(x, a) Qπθ(x, a), i.e. trajectories which perform better than the current policy πθ. In practice Qµ ent(x, a) could be estimated via a single trajectory Qµ ent(x, a) Rµ(x, a) := r0 + P t 1 γt(rt + c Hµ(xt)). Though in Eqn.(3) one could plug in ˆRµ(x, a) in place of Qµ(x, a) [8, 10], this introduces bias due to the double-sample issue [24], especially when ˆRµ(x, a) has high variance either due to the dynamics or a stochastic policy. SIL with stochastic actor-critic. SIL further focuses on actor-critic algorithms where the Qfunction is parameterized by a value-function and a stochastic policy Qθ(x, a) := Vθ(x)+c log πθ(a | x). Taking gradients of the loss in Eqn.(3) with respect to θ yields the following loss function of the value-function and policy. The full SIL loss is Lsil(θ) = Lvalue(θ) + Lpolicy(θ). Lvalue(θ) = 1 2([ ˆRµ(x, a) Vθ(x)]+)2, Lpolicy(θ) = log πθ(a | x)[ Rµ(x, a) Vθ(x)]+. (4) 3 Generalized Self-Imitation Learning 3.1 Generalized Lower Bounds for Optimal Q-functions To generalize the formulation of SIL, we seek to provide generalized lower bounds for the optimal Q-functions. Practical lower bounds should possess several desiderata: (P.1) they could be estimated using off-policy partial trajectories; (P.2) they could bootstrap from learned Q-functions. In standard actor-critic algorithms, partial trajectories are generated via behavior policy µ (for example, see [25, 18, 26]), and the algorithm maintains an estimate of Q-functions for the current policy π. The following theorem states a general lower bound for the max-entropy optimal Q-function Qπ ent ent . Additional results on generalized lower bounds of the optimal value function V π could be similarly derived, and we leave its details in Theorem 3 in Appendix C. Theorem 1. (proof in Appendix A) Let π ent be the optimal policy and Qπ ent ent its Q-function under maximum entropy RL formulation. Given a partial trajectory (xt, at)n t=0, the following inequality holds for any n, Qπ ent ent (x0, a0) Lπ,µ,n ent (x0, a0) := Eµ[r0 + γc Hµ(x1) + t=1 γt(rt + c Hµ(xt+1)) + γn Qπ ent(xn, an)] By letting c = 0, we derive a generalized lower bound for the standard optimal Q-function Qπ Lemma 1. Let π be the optimal policy and Qπ its Q-function under standard RL. Given a partial trajectory (xt, at)n t=0, the following inequality holds for any n, Qπ (x0, a0) Lπ,µ,n(x0, a0) := Eµ[ t=0 γtrt + γn Qπ(xn, an)]. (6) When π = π , Lemma 1 reduces to the lower bounds applied in [11]. We see that the n-step lower bounds Lπ,µ,n ent satisfy both desiderata (P.1)(P.2): Lπ,µ,n ent could be estimated on a single trajectory and bootstraps from learned Q-function Qθ(x, a) Qπ(x, a). When n , Lπ,µ,n ent Qµ and we arrive at the lower bound employed by the original SIL [8]. The original SIL does not satisfy (P.1)(P.2): the estimate of Qµ requires full trajectories from finished episodes and does not bootstrap from learned Q-functions. In addition, because the lower bound Lπ,µ ent (x, a) bootstraps Q-functions at a finite step n, we expect it to partially mitigate the double-sample bias of ˆRµ(x, a). Also, as the policy π improves over time, the Q-function Qπ(x, a) increases and the bound Lπ,µ,n improves as well. On the contrary, the standard SIL does not enjoy such advantages. 3.2 Generalized Self-Imitation Learning Generalized SIL with stochastic actor-critic. We describe the generalized SIL for actor-critic algorithms. As developed in Section 2.3, such algorithms maintain a parameterized stochastic policy πθ(a | x) and value-function Vθ(x). Let ˆLπ,µ,n ent (x, a) denote the sample estimate of the n-step lower bound, the loss functions are L(n) value(θ) = 1 2([ˆLπ,µ,n ent (x, a) Vθ(x)]+)2, L(n) policy(θ) = log πθ(a | x)[ˆLπ,µ,n ent (x, a) Vθ(x)]+. Note that the loss functions in Eqn.(7) introduce updates very similar to A2C [25]. Indeed, when removing the threshold function [x]+ and setting the data distribution to be on-policy µ = π, we recover the n-step A2C objective. Generalized SIL with deterministic actor-critic. For continuous control, temporal difference (TD)-learning and deterministic policy gradients have proven highly sample efficient and highperforming [15, 27, 23]. By construction, the generalized n-step lower bounds Lπ,µ,n ent adopts n-step TD-learning and should naturally benefit the aforementioned algorithms. Such algorithms maintain a parameterized Q-function Qθ(x, a), which could be directly updated via the following loss L(n) qvalue(θ) = 1 2([ˆLπ,µ,n ent (x, a) Qθ(x, a)]+)2. (8) Interestingly, note that the above update Eqn.(8) is similar to n-step Q-learning update [4, 5] up to the threshold function [x]+. In Section 4, we will discuss their formal connections in details. Prioritized experience replay. Prior work on prioritized experience replay [28, 29] proposed to sample tuples (xt, at, rt) from replay buffer D with probability proportional to Bellman errors. We provide a straightforward extension by sampling proportional to the lower bound loss [ˆLπ,µ,n ent (x, a) Qθ(x, a)]+. This reduces to the sampling scheme in SIL [8] when letting n . 4 Trade-offs with Lower Bound Q-learning When applying SIL in practice, its induced loss functions are optimized jointly with the base loss functions [8]: in the case of stochastic actor-critic, the full loss function is L(θ) := Lac(θ) + Lsil(θ), where Lac(θ) is the original actor-critic loss function [25]. The parameter is then updated via the gradient descent step θ = θ θL(θ). This makes it difficult to analyze the behavior of SIL beyond the plain motivation of Q-function lower bounds. Though a comprehensive analysis of SIL might be elusive due to its empirical nature, we formalize the lower bound arguments via RL operators and draw connections with n-step Q-learning. Below, we present results for standard RL. 4.1 Operators for Generalized Lower Bound Q-learning First, we formalize the mathematical operator of SIL. Let Q R|X| |A| be a vector-valued Q-function. Given some behavior policy µ, define the operator Tsil Q(x, a) := Q(x, a) + [Qµ(x, a) Q(x, a)]+ where [x]+ := max(x, 0). This operator captures the defining feature of the practical lower bound Qlearning [8], where the Q-function Q(x, a) receives learning signals only when Qµ(x, a) > Q(x, a). For generalized SIL, we similarly define Tn,sil Q(x, a) := Q(x, a)+[(T µ)n 1T πQ(x, a) Q(x, a)]+, where Q(x, a) is updated when (T µ)n 1T πQ(x, a) > Q(x, a) as suggested in Eqn.(7,8). In practice, lower bound Q-learning is applied alongside other main iterative algorithms. Henceforth, we focus on policy iteration algorithms with the Bellman operator T π along with its n-step variant (T µ)n 1T π. Though practical deep RL implementations adopt additive loss functions, for theoretical analysis we consider a convex combination of these three operators, with coefficients α, β [0, 1]. T α,β n,sil := (1 β)T π + (1 α)βTn,sil + αβ(T µ)n 1T π (9) 4.2 Properties of the operators Theorem 2. (proof in Appendix B) Let π, µ be target and behavior policy respectively. Then the following results hold: Contraction rate. Γ(T α,β n,sil ) (1 β)γ + (1 α)β + αβγn. The operator is always contractive for α [0, 1], β [0, 1). When α > 1 γ 1 γn , we have for any β (0, 1), Γ(T α,β n,sil ) γ < γ for some γ . Fixed point bias. T α,β n,sil has a unique fixed point Qα,β for any α [0, 1], β [0, 1) such that (1 α)β < 1. This fixed point satisfies the bounds Qηπ+(1 η)µn 1π Qα,β Qπ , where Qηπ+(1 η)µn 1π is the unique fixed point of operator ηT π + (1 η)T µn 1π with η = 1 β 1 β+αβ . To highlight the connections between uncorrected n-step and SIL, we discuss two special cases. When α = 1, T α,β n,sil removes all the lower bound components and reduces to (1 β)T π + β(T µ)n 1T π. This recovers the trade-off results discussed in [7]: when β = 1, the operator becomes uncorrected n-step updates with the smallest possible contraction rate Γ(T α,β n,sil ) γn, but the fixed point Qα,β is biased. In general, there is no lower bound on the fixed point so that its value could be arbitrary depending on both π and µ. When α ( 1 γ 1 γn , 1], T α,β n,sil combines the lower bound operator. Importantly, unlike uncorrected n-step, now the fixed point is lower bounded Qα,β Qηπ+(1 η)µn 1π. Because such a fixed point bias is lower bounded, we call it positive bias. Adjusting α creates a trade-off between contraction rates and the positive fixed point bias. In addition, the fixed point bias is safe in that it is upper bounded by the optimal Q-function, Qα,β Qπ , which might be a desirable property in cases where over-estimation bias hurts the practical performance [30, 27]. In Section 5, we will see that such positive fixed point bias is beneficial to empirical performance, as similarly observed in [11, 8, 10]. Though T α,β n,sil does not contract as fast as the uncorrected n-step operator (T µ)n 1T π, it still achieves a bound on contraction rates strictly smaller than T π. As such, generalized SIL also enjoys fast contractions relative to the baseline algorithm. Figure 1: Bias of Q-function networks with twin-delayed deep deterministic policy gradient (TD3) variants on the Walker Stand task. Empirical evaluation of Q-function bias. To validate the statements made in Theorem 2 on the bias of Q-functions, we test with TD3 for a an empirical evaluation [27]. At a given time in training, the bias at a pair (x, a) is calculated as the difference between Qfunction network prediction and an unbiased Monte-Carlo estimate of Q-function for the current policy π, i.e. Qθ(x, a) ˆQπ(x, a)1. Figure 1 shows the mean 0.5std of such bias over time, with mean and std computed over visited state-action pairs under π. In general, the bias of TD3 is small, which is compatible to observations made in [27]. The bias of TD3 with uncorrected n = 5-step spreads over a wider range near zero, indicating significant non-zero bias on both sides. For TD3 with generalized SIL n = 5, the bias is also spread out but the mean bias is significantly greater than zero. This implies that SIL generally induces a positive bias in the fixed point. In summary, these observations confirm that neural network based Q-functions Qθ(x, a) display similar biases introduced by the corresponding exact operators. 5 Experiments We seek to address the following questions in the experiments: (1) Does generalized SIL entail performance gains on both deterministic and stochastic actor-critic algorithms? (2) How do the design choices (e.g. hyper-parameters, prioritized replay) of generalized SIL impact its performance? Benchmark tasks. For benchmark tasks, we focus on state-based continuous control. In order to assess the strengths of different algorithmic variants, we consider similar tasks Walker, Cheetah and Ant with different simulation backends from Open AI gym [31], Deep Mind Control Suite [32] and Bullet Physics Engine [33]. These backends differ in many aspects, e.g. dimensions of observation and action space, transition dynamics and reward functions. With such a wide range of varieties, we seek to validate algorithmic gains with sufficient robustness to varying domains. There are a total of 8 distinct simulated control tasks, with details in Appendix D. 5.1 Deterministic actor-critic Baselines. We choose TD3 [27] as the baseline algorithm which employs a deterministic actor πφ(x). TD3 builds on deep deterministic policy gradient (DDPG) [15] and alleviates the overestimation bias in DDPG via delayed updates and double critics similar to double Q-learning [34, 30]. Through a comparison of DDPG and TD3 combined with generalized SIL, we will see that overestimation bias makes the advantages through lower bound Q-learning much less significant. To incorporate generalized SIL into TD3, we adopt an additive loss function: let LTD3(n)(θ) be the n-step TD3 loss function and L(m) sil (θ) be the m-step generalized SIL loss. The full loss is L(θ) := 1By definition, the bias should be the difference between the fixed point Q and target Qπ. Since TD3 employs heavy replay during training, we expect the Q-function to be close to the fixed point Qθ Q. Because both the dynamics and policy are deterministic, an one-sample estimate of Q-function is accurate enough to approximate the true Q-function ˆQπ = Qπ. Hence here the bias is approximated by Qθ ˆQπ. L(n) TD3(θ) + ηL(m) sil (θ) with some η 0. We will use this general loss template to describe algorithmic variants for comparison below. Return-based SIL for TD3. A straightforward extension of SIL [8] and optimality tightening [11] to deterministic actor-critic algorithms, is to estimate the return ˆRµ(xt, at) := P t t γt t rt on a single trajectory (xt, at, rt) t=0 and minimize the lower bound objective ([ ˆRµ(x, a) Qθ(x, a)]+)2. Note that since both the policy and the transition is deterministic (for benchmarks listed above), the one-sample estimate of returns is exact in that ˆRµ(x, a) Rµ(x, a) Qµ(x, a). In this case, return-based SIL is exactly equivalent to generalized SIL with n . Evaluations. We provide evaluations on a few standard benchmark tasks in Figure 2 as well as their variants with delayed rewards. To facilitate the credit assignment of the training performance to various components of the generalized SIL, we compare with a few algorithmic variants: 1-step TD3 (n = 1, η = 0); 5-step TD3 (n = 5, η = 0); TD3 with 5-step generalized SIL (n = 1, η = 0.1, m = 5); TD3 with return-based SIL (n = 1, η = 0.1, m = ). Importantly, note that the weighting coefficient is fixed η = 0.1 for all cases of generalized SIL. The training results of selected algorithms are shown in Figure 2. We show the final performance of all baselines in Table 1 in Appendix D. (a) DMWalker Run (b) DMWalker Stand (c) DMWalker Walk (d) DMCheetah Run (f) Half Cheetah (h) Half Cheetah(B) Figure 2: Standard evaluations on 8 benchmark tasks. Different colors represent different algorithmic variants. Each curve shows the mean 0.5std of evaluation performance during training, averaged across 3 random seeds. The x-axis shows the time steps and the y-axis shows the cumulative returns. Observe that 5-step generalized SIL (blue) generally outperforms other baselines. Tasks with DM are from Deep Mind Control Suite, and tasks with (B) are from Bullet. We make several observations: (1) For uncorrected n-step, the best n is task dependent. However, 5-step generalized SIL consistently improves the performance over uncorrected n-step TD3 baselines; (2) SIL losses generally accelerate the optimization. Indeed, both generalized SIL and return-based SIL generally performs better than pure TD3 algorithms; (3) The advantage of generalized SIL is more than n-step bootstrap. Because n-step generalized SIL is similar to n-step updates, it is reasonable to speculate that the performance gains of SIL are partly attributed to n-step updates. By the significant advantages of generalized SIL relative to n-step updates, we see that its performance gains also come from the lower bound techniques; (4) n-step SIL with n = 5 works the best. With n = 1, SIL does not benefit from bootstrapping partial trajectories with long horizons; with n = , SIL does not benefit from bootstrapped values at all. As discussed in Section 3, n-step bootstrap provides benefits in (i) variance reduction (replacing the discounted sum of rewards by a value function) and (ii) tightened bounds. In deterministic environment with deterministic policy, the advantage (ii) leads to most of the performance gains. 5.2 Ablation study for deterministic actor-critic Please refer to Table 1 in Appendix D for a summary of ablation experiments over SIL variants. Here, we focus on discussions of the ablation results. Horizon parameter n. In our experience, we find that n = 5 works reasonably well though other close values might work as well. To clarify the extreme effect of n: at one extreme, n = 1 and SIL does not benefit from trajectory-based learning and generally underperforms n = 5; when n = , the return-based SIL does not provide as significant speed up as n = 5. Prioritized experience replay. In general, prioritized replay has two hyper-parameters: α for the degree of prioritized sampling and β for the degree of corrections [28]. For general SIL, we adopt α = 0.6, β = 0.1 as in [8]. We also consider variants where the tuples are sampled according to the priority but IS weights are not corrected (α = 0.6, β = 0.0) and where there is no prioritized sampling (α = β = 0.0). The results are reported in Table 2 in Appendix D. We observe that generalized SIL works the best when both prioritized sampling and IS corrections are present. Over-estimation bias. Algorithms with over-estimation bias (e.g. DDPG) does not benefit as much (e.g. TD3) from the lower bound loss, as shown by additional results in Appendix D. We speculate that this is because by construction the Q-function network Qθ(x, a) should be a close approximation to the Q-function Qπ(x, a). In cases where over-estimation bias is severe, this assumption does not hold. As a result, the performance is potentially harmed instead of improved by the uncontrolled positive bias [30, 27]. This contrasts with the controlled positive bias of SIL, which improves the performance. 5.3 Stochastic actor-critic Baselines. For the stochastic actor-critic algorithm, we adopt proximal policy optimization (PPO) [18]. Unlike critic-based algorithms such as TD3, PPO estimates gradients using near on-policy samples. Delayed reward environments. Delayed reward environment tests algorithms capability to tackle delayed feedback in the form of sparse rewards [8]. In particular, a standard benchmark environment returns dense reward rt at each step t. Consider accumulating the reward over d consecutive steps and return the sum at the end k steps, i.e. r t = 0 if t mod k = 0 and r t = Pt τ=t d+1 rτ if t mod d = 0. Evaluations. We compare three baselines: PPO, PPO with SIL [8] and PPO with generalized SIL with n = 5-step. We train these variants on a set of Open AI gym tasks with delayed rewards, where the delays are k {1, 5, 10, 20}. Please refer to Appendix D for further details of the algorithms. The final performance of algorithms after training (5 106 steps for Half Cheetah and 107 for the others) are shown in Figure 3. We make several observations: (1) The performance of PPO is generally inferior to its generalized SIL or SIL extensions. This implies the necessity of carrying out SIL in general, as observed in [8]; (2) The performance of generalized SIL with n = 5 differ depending on the tasks. SIL works significantly better with Ant, while generalized SIL works better with Humanoid. Since SIL is a special case for n , this implies the potential benefits of adapting n for each task. (a) Half Cheetah (c) Walke2d (d) Humanoid Figure 3: Standard evaluations on 4 benchmark Open AI gym tasks. Different colors represent different algorithmic variants. Each curve shows the mean 0.5std of evaluation performance at the end of training, averaged across 5 random seeds. The x-axis shows the delayed time steps for rewards and the y-axis shows the cumulative returns. The ticks {1, 5, 10, 20} show the delays and the x-axis of the plotted data is slightly shifted for better visualization. 6 Further Discussions on Related Work Over-estimation bias in Q-learning. Q-learning and TD-learning are popular algorithms for RL [35, 36]. Due to the max operator, sampled updates of Q-learning naturally incur over-estimation bias, which potentially leads to unstable learning. To mitigate the bias, prior work has considered Double Q-learning[34, 30], explicit bias correction [37], linear combination between Double Q-learning and Q-learning [38], bootstrapping from past predictions [39] and using an ensemble of Q-functions [40]. Similar ideas have been applied to actor-critic algorithms [27]. While it is conventionally believed that over-estimation bias is hurtful to the performance, [40] provides concrete examples where estimation bias (underor over-estimation) could accelerate learning. In practice, for certain RL environments where rewards are sparse, it is desirable to introduce positive bias to encourage exploration [8]. Learning from off-policy data. Off-policy learning is crucial for modern RL algorithms [41, 42]. At the core of many off-policy learning algorithms [1, 43, 44, 3, 45, 46], importance sampling (IS) corrects for the distributional mismatch between behavior π and target policy µ, generating unbiased updates. Despite the theoretical foundations, IS-based algorithms often underperform empirically motivated algorithms such as n-step updates [4 6]. In general, uncorrected n-step algorithms could be interpreted as trading-off fast contractions with fixed point bias [7], which seems to have a significant practical effect. In addition to potentially better performance, uncorrected n-step updates also do not require e.g. µ(a | x). This entails learning with truly arbitrary off-policy data. Built on top of n-step updates, we propose generalized n-step SIL which intentionally introduces a positive bias into the fixed point, effectively filtering out behavior data with poor performance. This idea of learning from good-performing off-policy data is rooted in algorithmic paradigms such as behavior cloning [47], inverse RL [48], and more recently instantiated by e.g., episodic control [49, 50] lower bound Q-learning [11] and SIL [8 10]. 7 Conclusion We have proposed generalized n-step lower bound Q-learning, a strict generalization of return-based lower bound Q-learning and the corresponding self-imitation learning algorithm [8]. We have drawn close connections between n-step lower bound Q-learning and uncorrected n-step updates: both techniques achieve performance gains by invoking a trade-off between contraction rates and fixed point bias of the evaluation operators. Empirically, we observe that the positive bias induced by lower bound Q-learning provides more consistent improvements than arbitrary n-step bias. It is of interest to study in general what bias could be beneficial to policy optimization, and how to exploit such bias in practical RL algorithms. 8 Broader Impact Algorithms which learn from off-policy samples are critical for the applications of RL to more impactful real life domains such as autonomous driving and health care. Our work provides insights into SIL, and its close connections to popular off-policy learning techniques such as n-step Q-learning. We believe our work entails a positive step towards better understanding of efficient off-policy RL algorithms, which paves the way for future research into important applications. 9 Acknowledgements The author thanks Mark Rowland and Tadashi Kozuno for insightful discussions about this project. [1] Doina Precup, Richard S Sutton, and Sanjoy Dasgupta. Off-policy temporal-difference learning with function approximation. In ICML, pages 417 424, 2001. [2] Anna Harutyunyan, Marc G Bellemare, Tom Stepleton, and Rémi Munos. Q (lambda) with off-policy corrections. In International Conference on Algorithmic Learning Theory, pages 305 320. Springer, 2016. [3] Rémi Munos, Tom Stepleton, Anna Harutyunyan, and Marc Bellemare. Safe and efficient off-policy reinforcement learning. In Advances in Neural Information Processing Systems, pages 1054 1062, 2016. [4] Matteo Hessel, Joseph Modayil, Hado Van Hasselt, Tom Schaul, Georg Ostrovski, Will Dabney, Dan Horgan, Bilal Piot, Mohammad Azar, and David Silver. Rainbow: Combining improvements in deep reinforcement learning. In Thirty-Second AAAI Conference on Artificial Intelligence, 2018. [5] Gabriel Barth-Maron, Matthew W Hoffman, David Budden, Will Dabney, Dan Horgan, Dhruva Tb, Alistair Muldal, Nicolas Heess, and Timothy Lillicrap. Distributed distributional deterministic policy gradients. ar Xiv preprint ar Xiv:1804.08617, 2018. [6] Steven Kapturowski, Georg Ostrovski, John Quan, Remi Munos, and Will Dabney. Recurrent experience replay in distributed reinforcement learning. 2018. [7] Mark Rowland, Will Dabney, and Rémi Munos. Adaptive trade-offs in off-policy learning. ar Xiv preprint ar Xiv:1910.07478, 2019. [8] Junhyuk Oh, Yijie Guo, Satinder Singh, and Honglak Lee. Self-imitation learning. ar Xiv preprint ar Xiv:1806.05635, 2018. [9] Tanmay Gangwani, Qiang Liu, and Jian Peng. Learning self-imitating diverse policies. ar Xiv preprint ar Xiv:1805.10309, 2018. [10] Yijie Guo, Jongwook Choi, Marcin Moczulski, Samy Bengio, Mohammad Norouzi, and Honglak Lee. Efficient exploration with self-imitation learning via trajectory-conditioned policy. ar Xiv preprint ar Xiv:1907.10247, 2019. [11] Frank S He, Yang Liu, Alexander G Schwing, and Jian Peng. Learning to play in a day: Faster deep reinforcement learning by optimality tightening. ar Xiv preprint ar Xiv:1611.01606, 2016. [12] Richard Bellman. A markovian decision process. Journal of mathematics and mechanics, pages 679 684, 1957. [13] Richard S Sutton, David A Mc Allester, Satinder P Singh, and Yishay Mansour. Policy gradient methods for reinforcement learning with function approximation. In Advances in neural information processing systems, pages 1057 1063, 2000. [14] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. Playing atari with deep reinforcement learning. ar Xiv preprint ar Xiv:1312.5602, 2013. [15] Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. ar Xiv preprint ar Xiv:1509.02971, 2015. [16] John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region policy optimization. In International Conference on Machine Learning, pages 1889 1897, 2015. [17] Ziyu Wang, Tom Schaul, Matteo Hessel, Hado Van Hasselt, Marc Lanctot, and Nando De Freitas. Dueling network architectures for deep reinforcement learning. ar Xiv preprint ar Xiv:1511.06581, 2015. [18] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. ar Xiv preprint ar Xiv:1707.06347, 2017. [19] Brian D Ziebart. Modeling purposeful adaptive behavior with the principle of maximum causal entropy. Carnegie Mellon University, 2010. [20] Roy Fox, Ari Pakman, and Naftali Tishby. Taming the noise in reinforcement learning via soft updates. ar Xiv preprint ar Xiv:1512.08562, 2015. [21] Kavosh Asadi and Michael L Littman. An alternative softmax operator for reinforcement learning. In International Conference on Machine Learning, pages 243 252, 2017. [22] Ofir Nachum, Mohammad Norouzi, Kelvin Xu, and Dale Schuurmans. Bridging the gap between value and policy based reinforcement learning. In Advances in Neural Information Processing Systems, pages 2775 2785, 2017. [23] Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Offpolicy maximum entropy deep reinforcement learning with a stochastic actor. ar Xiv preprint ar Xiv:1801.01290, 2018. [24] Leemon Baird. Residual algorithms: Reinforcement learning with function approximation. In Machine Learning Proceedings 1995, pages 30 37. Elsevier, 1995. [25] Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In International Conference on Machine Learning, pages 1928 1937, 2016. [26] Lasse Espeholt, Hubert Soyer, Remi Munos, Karen Simonyan, Volodymir Mnih, Tom Ward, Yotam Doron, Vlad Firoiu, Tim Harley, Iain Dunning, et al. Impala: Scalable distributed deep-rl with importance weighted actor-learner architectures. ar Xiv preprint ar Xiv:1802.01561, 2018. [27] Scott Fujimoto, Herke Van Hoof, and David Meger. Addressing function approximation error in actor-critic methods. ar Xiv preprint ar Xiv:1802.09477, 2018. [28] Tom Schaul, John Quan, Ioannis Antonoglou, and David Silver. Prioritized experience replay. ar Xiv preprint ar Xiv:1511.05952, 2015. [29] Dan Horgan, John Quan, David Budden, Gabriel Barth-Maron, Matteo Hessel, Hado Van Hasselt, and David Silver. Distributed prioritized experience replay. ar Xiv preprint ar Xiv:1803.00933, 2018. [30] Hado Van Hasselt, Arthur Guez, and David Silver. Deep reinforcement learning with double q-learning. In Thirtieth AAAI conference on artificial intelligence, 2016. [31] Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. Openai gym. ar Xiv preprint ar Xiv:1606.01540, 2016. [32] Yuval Tassa, Yotam Doron, Alistair Muldal, Tom Erez, Yazhe Li, Diego de Las Casas, David Budden, Abbas Abdolmaleki, Josh Merel, Andrew Lefrancq, et al. Deepmind control suite. ar Xiv preprint ar Xiv:1801.00690, 2018. [33] Erwin Coumans. Bullet physics engine. Open Source Software: http://bulletphysics. org, 1(3):84, 2010. [34] Hado V Hasselt. Double q-learning. In Advances in neural information processing systems, pages 2613 2621, 2010. [35] Christopher JCH Watkins and Peter Dayan. Q-learning. Machine learning, 8(3-4):279 292, 1992. [36] Dimitri P Bertsekas and John N Tsitsiklis. Neuro-dynamic programming: an overview. In Proceedings of 1995 34th IEEE Conference on Decision and Control, volume 1, pages 560 564. IEEE, 1995. [37] Donghun Lee, Boris Defourny, and Warren B Powell. Bias-corrected q-learning to control max-operator bias in q-learning. In 2013 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL), pages 93 99. IEEE, 2013. [38] Zongzhang Zhang, Zhiyuan Pan, and Mykel J Kochenderfer. Weighted double q-learning. In IJCAI, pages 3455 3461, 2017. [39] Oron Anschel, Nir Baram, and Nahum Shimkin. Averaged-dqn: Variance reduction and stabilization for deep reinforcement learning. In International Conference on Machine Learning, pages 176 185. PMLR, 2017. [40] Qingfeng Lan, Yangchen Pan, Alona Fyshe, and Martha White. Maxmin q-learning: Controlling the estimation bias of q-learning. ar Xiv preprint ar Xiv:2002.06487, 2020. [41] Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction, volume 1. MIT press Cambridge, 1998. [42] Csaba Szepesvári. Algorithms for reinforcement learning. Synthesis lectures on artificial intelligence and machine learning, 4(1):1 103, 2010. [43] Miroslav Dudík, Dumitru Erhan, John Langford, Lihong Li, et al. Doubly robust policy evaluation and optimization. Statistical Science, 29(4):485 511, 2014. [44] Philip Thomas and Emma Brunskill. Data-efficient off-policy policy evaluation for reinforcement learning. In International Conference on Machine Learning, pages 2139 2148, 2016. [45] Ashique Rupam Mahmood, Huizhen Yu, and Richard S Sutton. Multi-step off-policy learning without importance sampling ratios. ar Xiv preprint ar Xiv:1702.03006, 2017. [46] Mehrdad Farajtabar, Yinlam Chow, and Mohammad Ghavamzadeh. More robust doubly robust off-policy evaluation. ar Xiv preprint ar Xiv:1802.03493, 2018. [47] Stéphane Ross, Geoffrey Gordon, and Drew Bagnell. A reduction of imitation learning and structured prediction to no-regret online learning. In Proceedings of the fourteenth international conference on artificial intelligence and statistics, pages 627 635, 2011. [48] Pieter Abbeel and Andrew Y Ng. Apprenticeship learning via inverse reinforcement learning. In Proceedings of the twenty-first international conference on Machine learning, page 1, 2004. [49] Charles Blundell, Benigno Uria, Alexander Pritzel, Yazhe Li, Avraham Ruderman, Joel Z Leibo, Jack Rae, Daan Wierstra, and Demis Hassabis. Model-free episodic control. ar Xiv preprint ar Xiv:1606.04460, 2016. [50] Alexander Pritzel, Benigno Uria, Sriram Srinivasan, Adria Puigdomenech, Oriol Vinyals, Demis Hassabis, Daan Wierstra, and Charles Blundell. Neural episodic control. ar Xiv preprint ar Xiv:1703.01988, 2017. [51] Joshua Achiam. Openai spinning up. Git Hub, Git Hub repository, 2018. [52] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. ar Xiv preprint ar Xiv:1412.6980, 2014. [53] Tom Schaul, John Quan, Ioannis Antonoglou, and David Silver. Prioritized experience replay. ar Xiv preprint ar Xiv:1511.05952, 2015. [54] Prafulla Dhariwal, Christopher Hesse, Oleg Klimov, Alex Nichol, Matthias Plappert, Alec Radford, John Schulman, Szymon Sidor, and Yuhuai Wu. Openai baselines. https://github. com/openai/baselines, 2017.