# on_the_estimation_bias_in_double_qlearning__6c95b024.pdf

On the Estimation Bias in Double Q-Learning

Zhizhou Ren1 , Guangxiang Zhu2, Hao Hu2, Beining Han2, Jianglun Chen2, Chongjie Zhang2

1Department of Computer Science, University of Illinois at Urbana-Champaign 2Institute for Interdisciplinary Information Sciences, Tsinghua University zhizhour@illinois.edu, guangxiangzhu@outlook.com {hu-h19, hbn18, chen-jl18}@mails.tsinghua.edu.cn chongjie@tsinghua.edu.cn

Double Q-learning is a classical method for reducing overestimation bias, which is caused by taking maximum estimated values in the Bellman operation. Its variants in the deep Q-learning paradigm have shown great promise in producing reliable value prediction and improving learning performance. However, as shown by prior work, double Q-learning is not fully unbiased and suffers from underestimation bias. In this paper, we show that such underestimation bias may lead to multiple non-optimal ﬁxed points under an approximate Bellman operator. To address the concerns of converging to non-optimal stationary solutions, we propose a simple but effective approach as a partial ﬁx for the underestimation bias in double Q-learning. This approach leverages an approximate dynamic programming to bound the target value. We extensively evaluate our proposed method in the Atari benchmark tasks and demonstrate its signiﬁcant improvement over baseline algorithms.

1 Introduction

Value-based reinforcement learning with neural networks as function approximators has become a widely-used paradigm and shown great promise in solving complicated decision-making problems in various real-world applications, including robotics control (Lillicrap et al., 2016), molecular structure design (Zhou et al., 2019), and recommendation systems (Chen et al., 2018). Towards understanding the foundation of these successes, investigating algorithmic properties of deep-learning-based value function approximation has attracted a growth of attention in recent years (Van Hasselt et al., 2018; Fu et al., 2019; Achiam et al., 2019; Dong et al., 2020). One of the phenomena of interest is that Q-learning (Watkins, 1989) is known to suffer from overestimation issues, since it takes a maximum operator over a set of estimated action-values. Comparing with underestimated values, overestimation errors are more likely to be propagated through greedy action selections, which leads to an overestimation bias in value prediction (Thrun and Schwartz, 1993). This overoptimistic behavior of decision making has also been investigated in the literature of management science (Smith and Winkler, 2006) and economics (Thaler, 1988).

In deep Q-learning algorithms, one major source of value estimation errors comes from the optmization procedure. Although a deep neural network may have a sufﬁcient expressiveness power to represent an accurate value function, the back-end optimization is hard to solve. As a result of computational considerations, stochastic gradient descent is almost the default choice for training deep Q-networks. As pointed out by Riedmiller (2005) and Van Hasselt et al. (2018), a mini-batch gradient update may have unpredictable effects on state-action pairs outside the training batch. The high variance of gradient estimation by such stochastic methods would lead to an unavoidable approximation error in value prediction, which cannot be eliminated by simply increasing sample size and

Work done while Zhizhou was an undergraduate at Tsinghua University.

35th Conference on Neural Information Processing Systems (Neur IPS 2021).

network capacity. Through the maximum operator in the Q-learning paradigm, such approximation error would propagate and accumulate to form an overestimation bias. In practice, even if most benchmark environments are nearly deterministic (Brockman et al., 2016), a dramatic overestimation can be observed (Van Hasselt et al., 2016).

Double Q-learning (Van Hasselt, 2010) is a classical method to reduce the risk of overestimation, which is a speciﬁc variant of the double estimator (Stone, 1974) in the Q-learning paradigm. Instead of taking the greedy maximum values, it uses a second value function to construct an independent action-value evaluation as a cross validation. With proper assumptions, double Q-learning was proved to slightly underestimate rather than overestimate the maximum expected values (Van Hasselt, 2010). This technique has become a default implementation for stabilizing deep Q-learning algorithms (Hessel et al., 2018). In continuous control domains, a famous variant named clipped double Qlearning (Fujimoto et al., 2018) also shows great success in reducing the accumulation of errors in actor-critic methods (Haarnoja et al., 2018; Kalashnikov et al., 2018).

To understand algorithmic properties of double Q-learning and its variants, most prior work focus on the characterization of one-step estimation bias, i.e., the expected deviation from target values in a single step of Bellman operation (Lan et al., 2020; Chen et al., 2021). In this paper, we present a different perspective on how these one-step errors accumulate in stationary solutions. We ﬁrst review a widely-used analytical model introduced by Thrun and Schwartz (1993) and reveal a fact that, due to the perturbation of approximation error, both double Q-learning and clipped double Q-learning have multiple approximate ﬁxed points in this model. This result raises a concern that double Q-learning may easily get stuck in some local stationary regions and become inefﬁcient in searching for the optimal policy. Motivated by this ﬁnding, we propose a novel value estimator, named doubly bounded estimator, that utilizes an abstracted dynamic programming as a lower bound estimation to rule out the potential non-optimal ﬁxed points. The proposed method is easy to be combined with other existing techniques such as clipped double Q-learning. We extensively evaluate our approach on a variety of Atari benchmark tasks, and demonstrate signiﬁcant improvement over baseline algorithms in terms of sample efﬁciency and convergence performance.

2 Background

Markov Decision Process (MDP; Bellman, 1957) is a classical framework to formalize an agentenvironment interaction system which can be deﬁned as a tuple M = S, A, P, R, γ . We use S and A to denote the state and action space, respectively. P(s |s, a) and R(s, a) denote the transition and reward functions, which are initially unknown to the agent. γ is the discount factor. The goal of reinforcement learning is to construct a policy π : S A maximizing cumulative rewards

t=0 γt R(st, π(st)) s0 = s, st+1 P( |st, π(st))

Another quantity of interest can be deﬁned through the Bellman equation Qπ(s, a) = R(s, a) + γEs P ( |s,a) [V π(s )]. The optimal value function Q corresponds to the unique solution of the Bellman optimality equation, Q (s, a) = R(s, a) + γ Es P ( |s,a) [maxa A Q (s , a )]. Q-learning algorithms are based on the Bellman optimality operator T stated as follows:

(T Q)(s, a) = R(s, a) + γ E s P ( |s,a)

max a A Q(s , a ) . (1)

By iterating this operator, value iteration is proved to converge to the optimal value function Q . To extend Q-learning methods to real-world applications, function approximation is indispensable to deal with a high-dimensional state space. Deep Q-learning (Mnih et al., 2015) considers a sample-based objective function and deploys an iterative optimization framework

θt+1 = arg min θ Θ E (s,a,r,s ) D

" r + γ max a A Qθt(s , a ) Qθ(s, a) 2#

in which Θ denotes the parameter space of the value network, and θ0 Θ is initialized by some predetermined method. (s, a, r, s ) is sampled from a data distribution D which is changing during exploration. With inﬁnite samples and a sufﬁciently rich function class, the update rule stated in

Eq. (2) is asymptotically equivalent to applying the Bellman optimality operator T , but the underlying optimization is usually inefﬁcient in practice. In deep Q-learning, Eq. (2) is optimized by mini-batch gradient descent and thus its value estimation suffers from unavoidable approximation errors.

3 On the Effects of Underestimation Bias in Double Q-Learning

In this section, we will ﬁrst revisit a common analytical model used by previous work for studying estimation bias (Thrun and Schwartz, 1993; Lan et al., 2020), in which double Q-learning is known to have underestimation bias. Based on this analytical model, we show that its underestimation bias could make double Q-learning have multiple ﬁxed-point solutions under an approximate Bellman optimality operator. This result suggests that double Q-learning may have extra non-optimal stationary solutions under the effects of the approximation error.

3.1 Modeling Approximation Error in Q-Learning

In Q-learning with function approximation, the ground truth Bellman optimality operator T is approximated by a regression problem through Bellman error minimization (see Eq. (1) and Eq. (2)), which may suffer from unavoidable approximation errors. Following Thrun and Schwartz (1993) and Lan et al. (2020), we formalize underlying approximation errors as a set of random noises e(t)(s, a) on the regression outcomes:

Q(t+1)(s, a) = (T Q(t))(s, a) + e(t)(s, a). (3)

In this model, double Q-learning (Van Hasselt, 2010) can be modeled by two estimator instances {Q(t) i }i {1,2} with separated noise terms {e(t) i }i {1,2}. For simpliﬁcation, we introduce a policy

function π(t)(s) = arg maxa Q(t) 1 (s, a) to override the state value function as follows:

V (t)(s) = Q(t) 2 s, π(t)(s) = arg maxa AQ(t) 1 (s, a) ,

i {1, 2}, Q(t+1) i (s, a) = R(s, a) + γEs h V (t)(s ) i

| {z } target value

+ e(t) i (s, a) | {z } approximation error

A minor difference of Eq. (4) from the deﬁnition of double Q-learning given by Van Hasselt (2010) is the usage of a uniﬁed target value V (t)(s ) for both two estimators. This simpliﬁcation does not affect the derived implications, and is also implemented by advanced variants of double Q-learning (Fujimoto et al., 2018; Lan et al., 2020).

To establish a uniﬁed framework for analysis, we use a stochastic operator e T to denote the Q-iteration procedure Q(t+1) e T Q(t), e.g., the updating rules stated as Eq. (3) and Eq. (4). We call such an operator e T as a stochastic Bellman operator, since it approximates the ground truth Bellman optimality operator T and carries some noises due to approximation errors. Note that, as shown in Eq. (4), the target value can be constructed only using the state-value function V (t). We can deﬁne the stationary point of state-values V (t) as the ﬁxed point of a stochastic Bellman operator e T .

Deﬁnition 1 (Approximate Fixed Points). Let e T denote a stochastic Bellman operator, such as what are stated in Eq. (3) and Eq. (4). A state-value function V is regarded as an approximate ﬁxed point under a stochastic Bellman operator e T if it satisﬁes E[e T V ] = V , where e T V denotes the output state-value function while applying the Bellman operator e T on V .

Remark. In prior work (Thrun and Schwartz, 1993), value estimation bias is deﬁned by expected one-step deviation with respect to the ground truth Bellman operator, i.e., E[(e T V (t))(s)] (T V (t))(s). The approximate ﬁxed points stated in Deﬁnition 1 characterizes the accumulation of estimation biases in stationary solutions.

In Appendix A.2, we will prove the existence of such ﬁxed points as the following statement. Proposition 1. Assume the probability density functions of the noise terms {e(s, a)} are continuous. The stochastic Bellman operators deﬁned by Eq. (3) and Eq. (4) must have approximate ﬁxed points in arbitrary MDPs.

R0,1=1 / s1

R1,0=R1,1=1

ϵ = 1.0 γ = 0.99

(a) A simple construction

V (s0) V (s1) π(a0|s0) 100.162 100.0 62.2% 101.159 100.0 92.9%

110.0 100.0 100.0%

(b) Numerical solutions of ﬁxed points

97.5 98.0 98.5 99.0 99.5 100.0 100.5 Vt(s0)

(c) Visualizing non-monotonicity

Figure 1: (a) A simple inﬁnite-horizon MDP where double Q-learning stated as (4) has multiple approximate ﬁxed points. Ri,j is a shorthand of R(si, aj). (b) The numerical solutions of the ﬁxed points produced by double Q-learning in the MDP presented above. π denotes the expected policy generated by the corresponding ﬁxed point under the perturbation of noise e(s, a). A formal description of π refers to Deﬁnition 2 in Appendix A.3. (c) The relation between the input state-value V (t)(s0) and the expected output state-value E[V (t+1)(s0)] generated by double Q-learning in the constructed MDP, in which we assume V (t)(s1) = 100.

3.2 Existence of Multiple Approximate Fixed Points in Double Q-Learning Algorithms

Given the deﬁnition of the approximate ﬁxed point, a natural question is whether such kind of ﬁxed points are unique or not. Recall that the optimal value function Q is the unique solution of the Bellman optimality equation, which is the foundation of Q-learning algorithms. However, in this section, we will show that, under the effects of the approximation error, the approximate ﬁxed points of double Q-learning may not be unique.

An Illustrative Example. Figure 1a presents a simple MDP in which double Q-learning stated as Eq. (4) has multiple approximate ﬁxed points. For simplicity, this MDP is set to be fully deterministic and contains only two states. All actions on state s1 lead to a self-loop and produce a unit reward signal. On state s0, the result of executing action a0 is a self-loop with a slightly larger reward signal than choosing action a1 which leads to state s1. The only challenge for decision making in this MDP is to distinguish the outcomes of executing action a0 and a1 on state s0. To make the example more accessible, we assume the approximation errors {e(t)(s, a)}t,s,a are a set of independent random noises sampled from a uniform distribution Uniform( ϵ, ϵ). This simpliﬁcation is also adopted by Thrun and Schwartz (1993) and Lan et al. (2020) in case studies. Here, we select the magnitude of noise as ϵ = 1.0 and the discount factor as γ = 0.99 to balance the scale of involved amounts.

Considering to solve the equation E[e T V ] = V according to the deﬁnition of the approximate ﬁxed point (see Deﬁnition 1), the numerical solutions of such ﬁxed points are presented in Table 1b. There are three different ﬁxed point solutions. The ﬁrst thing to notice is that the optimal ﬁxed point V is retained in this MDP (see the last row of Table 1b), since the noise magnitude ϵ = 1.0 is much smaller than the optimality gap Q (s0, a0) Q (s0, a1) = 10. The other two ﬁxed points are non-optimal and very close to Q(s0, a0) Q(s0, a1) = 100. Intuitively, under the perturbation of approximation error, the agent cannot identify the correct maximum-value action for policy improvement in these situations, which is the cause of such non-optimal ﬁxed points. To formalize the implications, we would present a sufﬁcient condition for the existence of multiple extra ﬁxed points.

Mathematical Condition. Note that the deﬁnition of a stochastic Bellman operator can be decoupled to two parts: (1) Computing target values T Q(t) according to the given MDP; (2) Perform an imprecise regression and some speciﬁc computations to obtain Q(t+1). The ﬁrst part is deﬁned by the MDP, and the second part is the algorithmic procedure. From this perspective, we can deﬁne the input of a learning algorithm as a set of ground truth target values {(T Q(t))(s, a)}s,a. Based on this notation, a sufﬁcient condition for the existence of multiple ﬁxed points is stated as follows.

Proposition 2. Let fs ({(T Q)(s, a)}a A) = E[(e T V )(s)] denote the expected output value of a learning algorithm on state s. Assume fs( ) is differentiable. If the algorithmic procedure fs( ) satisﬁes Eq. (5), there exists an MDP such that it has multiple approximate ﬁxed points.

s, i, X R|A|, xi fs(X) > 1, (5)

where X = {xi}|A| i=1 denotes the input of the function fs.

The proof of Proposition 2 is deferred to Appendix A.4. This proposition suggests that, in order to determine whether a Q-learning algorithm may have multiple ﬁxed points, we need to check whether its expected output values could change dramatically with a slight alter of inputs. Considering the constructed MDP as an example, Figure 1c visualizes the relation between the input state-value V (t)(s0) and the expected output state-value E[V (t+1)(s0)] while assuming V (t)(s1) = 100 has converged to its stationary point. The minima point of the output value is located at the situation where V (t)(s0) is slightly smaller than V (t)(s1), since the expected policy derived by e T V (t) will have a remarkable probability to choose sub-optimal actions. This local minima suffers from the most dramatic underestimation among the whole curve, and the underestimation will eventually vanish as the value of V (t)(s0) increases. During this process, a large magnitude of the ﬁrst-order derivative could be found to meet the condition stated in Eq. (5).

Remark. In Appendix A.5, we show that clipped double Q-learning, a popular variant of double Q-learning, has multiple ﬁxed points in an MDP slightly modiﬁed from Figure 1a. Besides, the condition presented in Proposition 2 does not hold in standard Q-learning that uses a single maximum operator (see Proposition 6 in Appendix). It remains an open question whether standard Q-learning with overestimation bias has multiple ﬁxed points.

3.3 Diagnosing Non-Optimal Fixed Points

In this section, we ﬁrst characterize the properties of the extra non-optimal ﬁxed points of double Q-learning in the analytical model. And then, we discuss its connections to the literature of stochastic optimization, which motivates our proposed algorithm in section 4.

Underestimated Solutions. The ﬁrst notable thing is that, the non-optimal ﬁxed points of double Q-learning would not overestimate the true maximum values. More speciﬁcally, every ﬁxed-point solution could be characterized as the ground truth value of some stochastic policy as the follows:

Proposition 3 (Fixed-Point Characterization). Assume the noise terms e1 and e2 are independently generated in the double estimator stated in Eq. (4). Every approximate ﬁxed point V is equal to the ground truth value function V π with respect to a stochastic policy π.

The proof of Proposition 3 is deferred to Appendix A.3. In addition, the corresponding stochastic policy π can be interpreted as

π(a|s) = P a = arg max a A

R(s, a ) + γEs [V (s )] | {z } (T Q)(s,a )

+e(s, a ) ,

which is the expected policy generated by the corresponding ﬁxed point along with the random noise e(s, a ). This stochastic policy, named as induced policy, can provide a snapshot to infer how the agent behaves and evolves around these approximate ﬁxed points. To deliver intuitions, we provide an analogical explanation in the context of optimization as the following arguments.

Analogy with Saddle Points. Taking the third column of Table 1b as an example, due to the existence of the approximation error, the induced policy π suffers from a remarkable uncertainty in determining the best action on state s0. Around such non-optimal ﬁxed points, the greedy action selection may be disrupted by approximation error and deviate from the correct direction for policy improvement. These approximate ﬁxed points are not necessary to be strongly stationary solutions but may seriously hurt the learning efﬁciency. If we imagine each iteration of target updating as a step of gradient update for Bellman error minimization, the non-optimal ﬁxed points would refer to the concept of saddle points in the context of optimization. As stochastic gradient may be trapped in saddle points, Bellman operation with approximation error may get stuck around non-optimal approximate ﬁxed points. Please refer to section 5.1 for a visualization of a concrete example.

Escaping from Saddle Points. In the literature of non-convex optimization, the most famous approach to escaping saddle points is perturbed gradient descent (Ge et al., 2015; Jin et al., 2017). Recall that, although gradient directions are ambiguous around saddle points, they are not strongly convergent solutions. Some speciﬁc perturbation mechanisms with certain properties could help to make the optimizer to escape non-optimal saddle points. Although these methods cannot be directly applied to double Q-learning since the Bellman operation is not an exact gradient descent, it motivates us to construct a speciﬁc perturbation as guidance. In section 4, we would introduce a perturbed target updating mechanism that uses an external value estimation to rule out non-optimal ﬁxed points.

4 Doubly Bounded Q-Learning through Abstracted Dynamic Programming

As discussed in the last section, the underestimation bias of double Q-learning may lead to multiple non-optimal ﬁxed points in the analytical model. A major source of such underestimation is the inherent approximation error caused by the imprecise optimization. Motivated by the literature of escaping saddle points, we introduce a novel method, named Doubly Bounded Q-learning, which integrates two different value estimators to reduce the negative effects of underestimation.

4.1 Algorithmic Framework

As discussed in section 3.3, the geometry property of non-optimal approximate ﬁxed points of double Q-learning is similar to that of saddle points in the context of non-convex optimization. The theory of escaping saddle points suggests that, a well-shaped perturbation mechanism could help to remove non-optimal saddle points from the landscape of optimization (Ge et al., 2015; Jin et al., 2017). To realize this brief idea in the speciﬁc context of iterative Bellman error minimization, we propose to integrate a second value estimator using different learning paradigm as an external auxiliary signal to rule out non-optimal approximate ﬁxed points of double Q-learning. To give an overview, we ﬁrst revisit two value estimation paradigms as follows:

1. Bootstrapping Estimator: As the default implementation of most temporal-difference learning algorithms, the target value y Boots of a transition sample (st, at, rt, st+1) is computed through bootstrapping the latest value function back-up Vθtarget parameterized by θtarget on the successor state st+1 as follows:

y Boots θtarget (st, at) = rt + γVθtarget(st+1),

where the computations of Vθtarget differ in different algorithms (e.g., different variants of double Q-learning). 2. Dynamic Programming Estimator: Another approach to estimating state-action values is applying dynamic programming in an abstracted MDP (Li et al., 2006) constructed from the collected dataset. By utilizing a state aggregation function φ(s), we could discretize a complex environment to a manageable tabular MDP. The reward and transition functions of the abstracted MDP are estimated through the collected samples in the dataset. An alternative target value y DP is computed as:

y DP(st, at) = rt + γV DP(φ(st+1)), (6)

where V DP corresponds to the optimal value function of the abstracted MDP.

The advantages and bottlenecks of these two types of value estimators lie in different aspects of error controlling. The generalizability of function approximators is the major strength of the bootstrapping estimator, but on the other hand, the hardness of the back-end optimization would cause considerable approximation error and lead to the issues discussed in section 3. The tabular representation of the dynamic programming estimator would not suffer from systematic approximation error during optimization, but its performance relies on the accuracy of state aggregation and the sampling error in transition estimation.

Doubly Bounded Estimator. To establish a trade-off between the considerations in the above two value estimators, we propose to construct an integrated estimator, named doubly bounded estimator, which takes the maximum values over two different basis estimation methods:

y DB θtarget(st, at) = max n y Boots θtarget (st, at), y DP(st, at) o . (7)

The targets values y DB θtarget would be used in training the parameterized value function Qθ by minimizing

L(θ; θtarget) = E (st,at) D

Qθ(st, at) y DB θtarget(st, at) 2 ,

where D denotes the experience buffer. Note that, this estimator maintains two value functions using different data structures. Qθ is the major value function which is used to generate the behavior policy for both exploration and evaluation. VDP is an auxiliary value function computed by the abstracted dynamic programming, which is stored in a discrete table. The only functionality of VDP is computing the auxiliary target value y DP used in Eq. (7) during training.

Remark. The name doubly bounded refers to the following intuitive motivation: Assume both basis estimators, y Boots and y DP, are implemented by their conservative variants and do not tend to overestimate values. The doubly bounded target value y DB(st, at) would become a good estimation if either of basis estimator provides an accurate value prediction on the given (st, at). The outcomes of abstracted dynamic programming could help the bootstrapping estimator to escape the non-optimal ﬁxed points of double Q-learning. The function approximator used by the bootstrapping estimator could extend the generalizability of discretization-based state aggregation. The learning procedure could make progress if either of estimators can identify the correct direction for policy improvement.

Practical Implementation. To make sure the dynamic programming estimator does not overestimate the true values, we implement a tabular version of batch-constrained Q-learning (BCQ; Fujimoto et al., 2019) to obtain a conservative estimation. The abstracted MDP is constructed by a simple state aggregation based on low-resolution discretization, i.e., we only aggregate states that cannot be distinguished by visual information. We follow the suggestions given by Fujimoto et al. (2019) and Liu et al. (2020) to prune the unseen state-action pairs in the abstracted MDP. The reward and transition functions of remaining state-action pairs are estimated through the average of collected samples. A detailed description is deferred to Appendix B.5.

4.2 Underlying Bias-Variance Trade-Off

In general, there is no existing approach can completely eliminate the estimation bias in Q-learning algorithm. Our proposed method also focuses on the underlying bias-variance trade-off.

Provable Beneﬁts on Variance Reduction. The algorithmic structure of the proposed doubly bounded estimator could be formalized as a stochastic Bellman operator e T DB:

(e T DBV )(s) = max n (e T Boots V )(s), V DP(s) o , (8)

where e T Boots is the stochastic Bellman operator corresponding to the back-end bootstrapping estimator (e.g., Eq. (4)). V DP is an arbitrary deterministic value estimator such as using abstracted dynamic programming. The beneﬁts on variance reduction can be characterized as the following proposition.

Proposition 4. Given an arbitrary stochastic operator e T Boots and a deterministic estimator V DP,

V, s S, Var[(e T DBV )(s)] Var[(e T Boots V )(s)],

where (e T DBV )(s) is deﬁned as Eq. (8).

The proof of Proposition 4 is deferred to Appendix A.6. The intuition behind this statement is that, with a deterministic lower bound cut-off, the variance of the outcome target values would be reduced, which may contribute to improve the stability of training.

Trade-Off between Different Biases. In general, the proposed doubly bounded estimator does not have a rigorous guarantee for bias reduction, since the behavior of abstracted dynamic programming depends on the properties of the tested environments and the accuracy of state aggregation. In the most unfavorable case, if the dynamic programming component carries a large magnitude of error, the lower bounded objective would propagate high-value errors to increase the risk of overestimation. To address these concerns, we propose to implement a conservative approximate dynamic programming as discussed in the previous section. The asymptotic behavior of batch-constrained Q-learning does not tend to overestimate extrapolated values (Liu et al., 2020). The major risk of the dynamic programming module is induced by the state aggregation, which refers to a classical problem (Li et al., 2006). The experimental analysis in section 5.2 demonstrates that, the error carried by abstracted dynamic programming is acceptable, and it deﬁnitely works well in most benchmark tasks.

5 Experiments

In this section, we conduct experiments to demonstrate the effectiveness of our proposed method1. We ﬁrst perform our method on the two-state MDP example discussed in previous sections to visualize its algorithmic effects. And then, we set up a performance comparison on Atari benchmark with several baseline algorithms. A detailed description of experiment settings is deferred to Appendix B.

1Our code is available at https://github.com/Stilwell-Git/Doubly-Bounded-Q-Learning.

98 100 102 104 106 108 110 112 Distribution of V(s0)

Probability Density

Double Q-Learning (5 Epochs)

98 100 102 104 106 108 110 112 Distribution of V(s0)

0.5 Double Q-Learning (10 Epochs)

98 100 102 104 106 108 110 112 Distribution of V(s0)

0.5 Double Q-Learning (15 Epochs)

98 100 102 104 106 108 110 112 Distribution of V(s0)

0.5 Double Q-Learning (20 Epochs)

Figure 2: Visualization of the probability density of V (s0) learned by double Q-learning. In this section, all plotted distributions are estimated by 103 runs with random seeds. We utilize seaborn.distplot package to plot the kernel density estimation curves and histogram bins.

98 100 102 104 106 108 110 112 Distribution of V(s0)

Probability Density

V DP(s0)=99.0 (3 Epochs)

98 100 102 104 106 108 110 112 Distribution of V(s0)

0.5 V DP(s0)=99.5 (3 Epochs)

98 100 102 104 106 108 110 112 Distribution of V(s0)

1.0 V DP(s0)=100 (3 Epochs)

98 100 102 104 106 108 110 112 Distribution of V(s0)

1.0 V DP(s0)=100.5 (3 Epochs)

Figure 3: Visualization of the probability density of V (s0) learned by doubly bounded Q-learning with an imprecise dynamic programming planner. V DP(s0) denotes the value estimation given by dynamic programming, which serves as a lower bound in our method.

5.1 Tabular Experiments on a Two-State MDP

To investigate whether the empirical behavior of our proposed method matches it design purpose, we compare the behaviors of double Q-learning and our method on the two-state MDP example presented in Figure 1a. In this MDP, the non-optimal ﬁxed points would get stuck in V (s0) 100 and the optimal solution has V (s0) = 110. In this tabular experiment, we implement double Q-learning and our algorithm with table-based Q-value functions. The Q-values are updated by iteratively applying Bellman operators as what are presented in Eq. (4) and Eq. (8). To approximate practical scenarios, we simulate the approximation error by entity-wise Gaussian noises N(0, 0.5). Since the stochastic process induced by such noises suffers from high variance, we perform soft update Q(t+1) = (1 α)Q(t) + α(e T Q(t)) to make the visualization clear, in which α refers to learning rate in practice. We consider α = 10 2 for all experiments presented in this section. In this setting, we denote one epoch as 1 α(1 γ) = 104 iterations. A detailed description for the tabular implementation is deferred to Appendix B.2.

We investigate the performance of our method with an imprecise dynamic programming module, in which we only apply lower bound for state s0 with values V DP(s0) {99.0, 99.5, 100.0, 100.5}. The experiment results presented in Figure 2 and Figure 3 support our claims in previous sections:

1. The properties of non-optimal ﬁxed points are similar to that of saddle points. These extra ﬁxed points are relatively stationary region but not truly static. A series of lucky noises can help the agent to escape from non-optimal stationary regions, but this procedure may take lots of iterations. As shown in Figure 2, double Q-learning may get stuck in non-optimal solutions and it can escape these non-optimal regions by a really slow rate. After 20 epochs (i.e., 2 105 iterations), there are nearly a third of runs cannot ﬁnd the optimal solution.

2. The design of our doubly bounded estimator is similar to a perturbation on value learning. As shown in Figure 3, when the estimation given by dynamic programming is slightly higher than the non-optimal ﬁxed points, such as VDP(s0) = 100.5, it is sufﬁcient to help the agent escape from non-optimal stationary solutions. A tricky observation is that VDP(s0) = 100 also seems to work. It is because cutting-off a zero-mean noise would lead to a slight overestimation, which makes the actual estimated value of V DP(s0) to be a larger value.

5.2 Performance Comparison on Atari Benchmark

To demonstrate the superiority of our proposed method, Doubly Bounded Q-Learning through Abstracted Dynamic Programming (DB-ADP), we compare with six variants of deep Q-networks as baselines, including DQN (Mnih et al., 2015), double DQN (DDQN; Van Hasselt et al., 2016),

0M 2M 4M 6M 8M 10M Timesteps

Average Episode Return

0M 1M 2M 3M 4M 5M Timesteps

1250 Bank Heist

0M 1M 2M 3M 4M 5M Timesteps

25000 Battle Zone

0M 2M 4M 6M 8M 10M Timesteps

7500 Frostbite

0M 1M 2M 3M 4M 5M Timesteps

Average Episode Return

0M 1M 2M 3M 4M 5M Timesteps

0M 2M 4M 6M 8M 10M Timesteps

0M 1M 2M 3M 4M 5M Timesteps

Road Runner

0M 2M 4M 6M 8M 10M Timesteps

Average Episode Return

Star Gunner

0M 1M 2M 3M 4M 5M Timesteps

0M 1M 2M 3M 4M 5M Timesteps

Wizard Of Wor

0M 1M 2M 3M 4M 5M Timesteps

DB-ADP-C (ours) DB-ADP (ours)

Clipped DDQN DDQN

Dueling DDQN Averaged DQN

Maxmin DQN DQN

Figure 4: Learning curves on a suite of Atari benchmark tasks. DB-ADP and DB-ADP-C refer to our proposed approach built upon double Q-learning and clipped double Q-learning, respectively.

dueling DDQN (Wang et al., 2016), averaged DQN (Anschel et al., 2017), maxmin DQN (Lan et al., 2020), and clipped double DQN adapted from Fujimoto et al. (2018). Our proposed doubly bounded target estimation y DB is built upon two types of bootstrapping estimators that have clear incentive of underestimation, i.e., double Q-learning and clipped double Q-learning. We denote these two variants as DB-ADP-C and DB-ADP according to our proposed method with or without using clipped double Q-learning.

As shown in Figure 4, the proposed doubly bounded estimator has great promise in bootstrapping the performance of double Q-learning algorithms. The improvement can be observed both in terms of sample efﬁciency and ﬁnal performance. Another notable observation is that, although clipped double Q-learning can hardly improve the performance upon Double DQN, it can signiﬁcantly improve the performance through our proposed approach in most environments (i.e., DB-ADP-C vs. DB-ADP in Figure 4). This improvement should be credit to the conservative property of clipped double Q-learning (Fujimoto et al., 2019) that may reduce the propagation of the errors carried by abstracted dynamic programming.

5.3 Variance Reduction on Target Values

To support the theoretical claims in Proposition 4, we conduct an experiment to demonstrate the ability of doubly bounded estimator on variance reduction. We evaluate the standard deviation of the target values with respect to training networks using different sequences of training batches. Table 5a presents the evaluation results on our proposed methods and baseline algorithms. The -version corresponds to an ablation study, where we train the network using our proposed approach but evaluate the target values computed by bootstrapping estimators, i.e., using the target value formula of double DQN or clipped double DQN. As shown in Table 5a, the standard deviation of target values is signiﬁcantly reduced by our approaches, which matches our theoretical analysis in Proposition 4. It demonstrates a strength of our approach in improving training stability. A detailed description of the evaluation metric is deferred to Appendix B.

5.4 An Ablation Study on the Dynamic Programming Module

To support the claim that the dynamic programming estimator is an auxiliary module to improving the strength of double Q-learning, we conduct an ablation study to investigate the individual performance of dynamic programming. Formally, we exclude Bellman error minimization from the training

TASK NAME DB-ADP-C DB-ADP-C CDDQN

ALIEN 0.006 0.008 0.010 BANKHEIST 0.009 0.010 0.010 QBERT 0.008 0.010 0.011

TASK NAME DB-ADP DB-ADP DDQN

ALIEN 0.008 0.009 0.012 BANKHEIST 0.009 0.011 0.013 QBERT 0.009 0.011 0.012

(a) Variance reduction on target values

0M 2M 4M 6M 8M 10M Timesteps

Average Episode Return

0M 1M 2M 3M 4M 5M Timesteps

1250 Bank Heist

DB-ADP-C (ours) DB-ADP (ours) ADP (ablation) Clipped DDQN DDQN

(b) Ablation study on dynamic programming

Figure 5: (a) Evaluating the standard deviation of target values w.r.t. different training batches. The presented amounts are normalized by the value scale of corresponding runs. refers to ablation studies. (b) An ablation study on the individual performance of the dynamic programming module.

procedure and directly optimize the following objective to distill the results of dynamic programming into a generalizable parametric agent:

LADP(θ) = E (st,at) D

h Qθ(st, at) y DP(st, at) 2i ,

where y DP(st, at) denotes the target value directly by dynamic programming. As shown in Figure 5b, without integrating with the bootstrapping estimator, the abstracted dynamic programming itself cannot outperform deep Q-learning algorithms. It remarks that, in our proposed framework, two basis estimators are supplementary to each other.

6 Related Work

Correcting the estimation bias of Q-learning is a long-lasting problem which induces a series of approaches (Lee et al., 2013; D Eramo et al., 2016; Chen, 2020; Zhang and Huang, 2020), especially following the methodology of double Q-learning (Zhang et al., 2017; Zhu and Rigotti, 2021). The most representative algorithm, clipped double Q-learning (Fujimoto et al., 2018), has become the default implementation of most advanced actor-critic algorithms (Haarnoja et al., 2018). Based on clipped double Q-learning, several methods have been investigated to reduce the its underestimation and achieve promising performance (Ciosek et al., 2019; Li and Hou, 2019). Other recent advances usually focus on using ensemble methods to further reduce the error magnitude (Lan et al., 2020; Kuznetsov et al., 2020; Chen et al., 2021). Statistical analysis of double Q-learning is also an active area (Weng et al., 2020; Xiong et al., 2020) that deserves future studies.

Besides the variants of double Q-learning, using the softmax operator in Bellman operations is another effective approach to reduce the effects of approximation error (Fox et al., 2016; Asadi and Littman, 2017; Song et al., 2019; Kim et al., 2019; Pan et al., 2019). The characteristic of our approach is the usage of an approximate dynamic programming. Our analysis would provide a theoretical support for memory-based approaches, such as episodic control (Blundell et al., 2016; Pritzel et al., 2017; Lin et al., 2018; Zhu et al., 2020; Hu et al., 2021), which are usually designed for near-deterministic environments. Instead of using an explicit planner, Fujita et al. (2020) adopts the trajectory return as a lower bound for value estimation. This simple technique also shows promise in improving the efﬁciency of continuous control with clipped double Q-learning.

7 Conclusion

In this paper, we reveal an interesting fact that, under the effects of approximation error, double Q-learning may have multiple non-optimal ﬁxed points. The main cause of such non-optimal ﬁxed points is the underestimation bias of double Q-learning. Regarding this issue, we provide some analysis to characterize what kind of Bellman operators may suffer from the same problem, and how the agent may behave around these ﬁxed points. To address the potential risk of converging to non-optimal solutions, we propose doubly bounded Q-learning to reduce the underestimation in double Q-learning. The main idea of this approach is to leverage an abstracted dynamic programming as a second value estimator to rule out non-optimal ﬁxed points. The experiments show that the proposed method has shown great promise in improving both sample efﬁciency and convergence performance, which achieves a signiﬁcant improvement over baselines algorithms.

Acknowledgments and Disclosure of Funding

The authors would like to thank Kefan Dong for insightful discussions. This work is supported in part by Science and Technology Innovation 2030 New Generation Artiﬁcial Intelligence Major Project (No. 2018AAA0100904), a grant from the Institute of Guo Qiang, Tsinghua University, and a grant from Turing AI Institute of Nanjing.

Joshua Achiam, Ethan Knight, and Pieter Abbeel. Towards characterizing divergence in deep q-learning. ar Xiv preprint ar Xiv:1903.08894, 2019.

Oron Anschel, Nir Baram, and Nahum Shimkin. Averaged-dqn: Variance reduction and stabilization for deep reinforcement learning. In International Conference on Machine Learning, pages 176 185. PMLR, 2017.

Kavosh Asadi and Michael L Littman. An alternative softmax operator for reinforcement learning. In International Conference on Machine Learning, pages 243 252, 2017.

Richard Bellman. Dynamic programming. Princeton University Press, 89:92, 1957.

Charles Blundell, Benigno Uria, Alexander Pritzel, Yazhe Li, Avraham Ruderman, Joel Z Leibo, Jack Rae, Daan Wierstra, and Demis Hassabis. Model-free episodic control. ar Xiv preprint ar Xiv:1606.04460, 2016.

Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. Openai gym, 2016.

Luitzen Egbertus Jan Brouwer. Über abbildung von mannigfaltigkeiten. Mathematische annalen, 71 (1):97 115, 1911.

Pablo Samuel Castro, Subhodeep Moitra, Carles Gelada, Saurabh Kumar, and Marc G Bellemare. Dopamine: A research framework for deep reinforcement learning. ar Xiv preprint ar Xiv:1812.06110, 2018.

Gang Chen. Decorrelated double q-learning. ar Xiv preprint ar Xiv:2006.06956, 2020.

Shi-Yong Chen, Yang Yu, Qing Da, Jun Tan, Hai-Kuan Huang, and Hai-Hong Tang. Stabilizing reinforcement learning in dynamic environment with application to online recommendation. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 1187 1196, 2018.

Xinyue Chen, Che Wang, Zijian Zhou, and Keith Ross. Randomized ensembled double q-learning: Learning fast without a model. In International Conference on Learning Representations, 2021.

Kamil Ciosek, Quan Vuong, Robert Loftin, and Katja Hofmann. Better exploration with optimistic actor critic. In Advances in Neural Information Processing Systems, pages 1785 1796, 2019.

Kefan Dong, Yuping Luo, Tianhe Yu, Chelsea Finn, and Tengyu Ma. On the expressivity of neural networks for deep reinforcement learning. In International Conference on Machine Learning, pages 2627 2637. PMLR, 2020.

Carlo D Eramo, Marcello Restelli, and Alessandro Nuara. Estimating maximum expected value through gaussian approximation. In International Conference on Machine Learning, pages 1032 1040. PMLR, 2016.

Roy Fox, Ari Pakman, and Naftali Tishby. Taming the noise in reinforcement learning via soft updates. In Proceedings of the Thirty-Second Conference on Uncertainty in Artiﬁcial Intelligence, pages 202 211, 2016.

Justin Fu, Aviral Kumar, Matthew Soh, and Sergey Levine. Diagnosing bottlenecks in deep q-learning algorithms. In International Conference on Machine Learning, pages 2021 2030, 2019.

Scott Fujimoto, Herke Hoof, and David Meger. Addressing function approximation error in actorcritic methods. In International Conference on Machine Learning, pages 1587 1596, 2018.

Scott Fujimoto, David Meger, and Doina Precup. Off-policy deep reinforcement learning without exploration. In International Conference on Machine Learning, pages 2052 2062, 2019.

Yasuhiro Fujita, Kota Uenishi, Avinash Ummadisingu, Prabhat Nagarajan, Shimpei Masuda, and Mario Ynocente Castro. Distributed reinforcement learning of targeted grasping with active vision for mobile manipulators. In 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 9712 9719. IEEE, 2020.

Rong Ge, Furong Huang, Chi Jin, and Yang Yuan. Escaping from saddle points online stochastic gradient for tensor decomposition. In Conference on learning theory, pages 797 842. PMLR, 2015.

Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In International Conference on Machine Learning, pages 1861 1870, 2018.

Matteo Hessel, Joseph Modayil, Hado Van Hasselt, Tom Schaul, Georg Ostrovski, Will Dabney, Dan Horgan, Bilal Piot, Mohammad Gheshlaghi Azar, and David Silver. Rainbow: Combining improvements in deep reinforcement learning. In Proceedings of the Thirty-Second AAAI Conference on Artiﬁcial Intelligence, 2018.

Hao Hu, Jianing Ye, Zhizhou Ren, Guangxiang Zhu, and Chongjie Zhang. Generalizable episodic memory for deep reinforcement learning. ar Xiv preprint ar Xiv:2103.06469, 2021.

Nan Jiang, Alex Kulesza, Satinder Singh, and Richard Lewis. The dependence of effective planning horizon on model accuracy. In Proceedings of the 2015 International Conference on Autonomous Agents and Multiagent Systems, pages 1181 1189. Citeseer, 2015.

Chi Jin, Rong Ge, Praneeth Netrapalli, Sham M Kakade, and Michael I Jordan. How to escape saddle points efﬁciently. In International Conference on Machine Learning, pages 1724 1732. PMLR, 2017.

Dmitry Kalashnikov, Alex Irpan, Peter Pastor, Julian Ibarz, Alexander Herzog, Eric Jang, Deirdre Quillen, Ethan Holly, Mrinal Kalakrishnan, Vincent Vanhoucke, et al. Qt-opt: Scalable deep reinforcement learning for vision-based robotic manipulation. In Conference on Robot Learning. PMLR, 2018.

Richard M Karp and Michael O Rabin. Efﬁcient randomized pattern-matching algorithms. IBM journal of research and development, 31(2):249 260, 1987.

Seungchan Kim, Kavosh Asadi, Michael Littman, and George Konidaris. Deepmellow: removing the need for a target network in deep q-learning. In Proceedings of the 28th International Joint Conference on Artiﬁcial Intelligence, pages 2733 2739, 2019.

Arsenii Kuznetsov, Pavel Shvechikov, Alexander Grishin, and Dmitry Vetrov. Controlling overestimation bias with truncated mixture of continuous distributional quantile critics. In International Conference on Machine Learning, 2020.

Qingfeng Lan, Yangchen Pan, Alona Fyshe, and Martha White. Maxmin q-learning: Controlling the estimation bias of q-learning. In International Conference on Learning Representations, 2020.

Donghun Lee, Boris Defourny, and Warren B Powell. Bias-corrected q-learning to control maxoperator bias in q-learning. In 2013 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL), pages 93 99. IEEE, 2013.

Lihong Li, Thomas J Walsh, and Michael L Littman. Towards a uniﬁed theory of state abstraction for mdps. ISAIM, 4:5, 2006.

Zhunan Li and Xinwen Hou. Mixing update q-value for deep reinforcement learning. In 2019 International Joint Conference on Neural Networks (IJCNN), pages 1 6. IEEE, 2019.

Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. In International Conference on Learning Representations, 2016.

Zichuan Lin, Tianqi Zhao, Guangwen Yang, and Lintao Zhang. Episodic memory deep q-networks. In Proceedings of the 27th International Joint Conference on Artiﬁcial Intelligence, pages 2433 2439, 2018.

Yao Liu, Adith Swaminathan, Alekh Agarwal, and Emma Brunskill. Provably good batch reinforcement learning without great exploration. In Advances in Neural Information Processing Systems, 2020.

Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529 533, 2015.

Andrew W Moore and Christopher G Atkeson. Prioritized sweeping: Reinforcement learning with less data and less time. Machine learning, 13(1):103 130, 1993.

Ling Pan, Qingpeng Cai, Qi Meng, Wei Chen, Longbo Huang, and Tie-Yan Liu. Reinforcement learning with dynamic boltzmann softmax updates. ar Xiv preprint ar Xiv:1903.05926, 2019.

Alexander Pritzel, Benigno Uria, Sriram Srinivasan, Adrià Puigdomènech Badia, Oriol Vinyals, Demis Hassabis, Daan Wierstra, and Charles Blundell. Neural episodic control. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 2827 2836, 2017.

Martin Riedmiller. Neural ﬁtted q iteration ﬁrst experiences with a data efﬁcient neural reinforcement learning method. In European Conference on Machine Learning, pages 317 328. Springer, 2005.

James E Smith and Robert L Winkler. The optimizer s curse: Skepticism and postdecision surprise in decision analysis. Management Science, 52(3):311 322, 2006.

Zhao Song, Ron Parr, and Lawrence Carin. Revisiting the softmax bellman operator: New beneﬁts and new perspective. In International Conference on Machine Learning, pages 5916 5925, 2019.

Mervyn Stone. Cross-validatory choice and assessment of statistical predictions. Journal of the Royal Statistical Society: Series B (Methodological), 36(2):111 133, 1974.

Richard H Thaler. The winner s curse. The Journal of Economic Perspectives, 2(1):191 202, 1988.

Sebastian Thrun and Anton Schwartz. Issues in using function approximation for reinforcement learning. In Proceedings of the 1993 Connectionist Models Summer School Hillsdale, NJ. Lawrence Erlbaum, 1993.

Hado Van Hasselt. Double q-learning. In Advances in neural information processing systems, pages 2613 2621, 2010.

Hado Van Hasselt, Arthur Guez, and David Silver. Deep reinforcement learning with double q-learning. In Proceedings of the Thirtieth AAAI Conference on Artiﬁcial Intelligence, pages 2094 2100, 2016.

Hado Van Hasselt, Yotam Doron, Florian Strub, Matteo Hessel, Nicolas Sonnerat, and Joseph Modayil. Deep reinforcement learning and the deadly triad. ar Xiv preprint ar Xiv:1812.02648, 2018.

Ziyu Wang, Tom Schaul, Matteo Hessel, Hado Van Hasselt, Marc Lanctot, and Nando Freitas. Dueling network architectures for deep reinforcement learning. In International conference on machine learning, pages 1995 2003, 2016.

Chris Watkins. Learning from delayed rewards. Ph D thesis, King s College, Cambridge, 1989.

Wentao Weng, Harsh Gupta, Niao He, Lei Ying, and R Srikant. The mean-squared error of double q-learning. In Advances in Neural Information Processing Systems, volume 33, 2020.

Huaqing Xiong, Lin Zhao, Yingbin Liang, and Wei Zhang. Finite-time analysis for double q-learning. In Advances in Neural Information Processing Systems, volume 33, 2020.

Huihui Zhang and Wu Huang. Deep reinforcement learning with adaptive combined critics. 2020.

Zongzhang Zhang, Zhiyuan Pan, and Mykel J Kochenderfer. Weighted double q-learning. In Proceedings of the 26th International Joint Conference on Artiﬁcial Intelligence, pages 3455 3461, 2017.

Zhenpeng Zhou, Steven Kearnes, Li Li, Richard N Zare, and Patrick Riley. Optimization of molecules via deep reinforcement learning. Scientiﬁc reports, 9(1):1 10, 2019.

Guangxiang Zhu, Zichuan Lin, Guangwen Yang, and Chongjie Zhang. Episodic reinforcement learning with associative memory. In International Conference on Learning Representations, 2020.

Rong Zhu and Mattia Rigotti. Self-correcting q-learning. In Proceedings of the AAAI Conference on Artiﬁcial Intelligence, volume 35, pages 11185 11192, 2021.