# finitetime_analysis_for_double_qlearning__30008f12.pdf Finite-Time Analysis for Double Q-learning Huaqing Xiong1, Lin Zhao2, Yingbin Liang1, and Wei Zhang .3,4 1The Ohio State University 2National University of Singapore 3Southern University of Science and Technology 4Peng Cheng Laboratory 1{xiong.309, liang.889}@osu.edu; 2elezhli@nus.edu.sg; 3,4zhangw3@sustech.edu.cn Although Q-learning is one of the most successful algorithms for finding the best action-value function (and thus the optimal policy) in reinforcement learning, its implementation often suffers from large overestimation of Q-function values incurred by random sampling. The double Q-learning algorithm proposed in Hasselt (2010) overcomes such an overestimation issue by randomly switching the update between two Q-estimators, and has thus gained significant popularity in practice. However, the theoretical understanding of double Q-learning is rather limited. So far only the asymptotic convergence has been established, which does not characterize how fast the algorithm converges. In this paper, we provide the first non-asymptotic (i.e., finite-time) analysis for double Q-learning. We show that both synchronous and asynchronous double Q-learning are guaranteed to converge to an ϵ-accurate neighborhood of the global optimum by taking Ω 1 (1 γ)6ϵ2 1 ω + 1 1 γ 1 1 ω iterations, where ω (0, 1) is the decay parameter of the learning rate, and γ is the discount factor. Our analysis develops novel techniques to derive finite-time bounds on the difference between two inter-connected stochastic processes, which is new to the literature of stochastic approximation. 1 Introduction Q-learning is one of the most successful classes of reinforcement learning (RL) algorithms, which aims at finding the optimal action-value function or Q-function (and thus the associated optimal policy) via off-policy data samples. The Q-learning algorithm was first proposed by Watkins and Dayan (1992), and since then, it has been widely used in various applications including robotics (Tai and Liu, 2016), autonomous driving (Okuyama et al., 2018), video games (Mnih et al., 2015), to name a few. Theoretical performance of Q-learning has also been intensively explored. The asymptotic convergence has been established in Tsitsiklis (1994); Jaakkola et al. (1994); Borkar and Meyn (2000); Melo (2001); Lee and He (2019). The non-asymptotic (i.e., finite-time) convergence rate of Q-learning was firstly obtained in Szepesvári (1998), and has been further studied in (Even-Dar and Mansour, 2003; Shah and Xie, 2018; Wainwright, 2019; Beck and Srikant, 2012; Chen et al., 2020) for synchronous Q-learning and in (Even-Dar and Mansour, 2003; Qu and Wierman, 2020) for asynchoronous Q-learning. One major weakness of Q-learning arises in practice due to the large overestimation of the actionvalue function (Hasselt, 2010; Hasselt et al., 2016). Practical implementation of Q-learning involves using the maximum sampled Q-function to estimate the maximum expected Q-function (where the Corresponding author 34th Conference on Neural Information Processing Systems (Neur IPS 2020), Vancouver, Canada. expectation is taken over the randomness of reward). Such an estimation often yields a large positive bias error (Hasselt, 2010), and causes Q-learning to perform rather poorly. To address this issue, double Q-learning was proposed in Hasselt (2010), which keeps two Q-estimators (i.e., estimators for Q-functions), one for estimating the maximum Q-function value and the other one for update, and continuously changes the roles of the two Q-estimators in a random manner. It was shown in Hasselt (2010) that such an algorithm effectively overcomes the overestimation issue of the vanilla Q-learning. In Hasselt et al. (2016), double Q-learning was further demonstrated to substantially improve the performance of Q-learning with deep neural networks (DQNs) for playing Atari 2600 games. It inspired many variants (Zhang et al., 2017; Abed-alguni and Ottom, 2018), received a lot of applications (Zhang et al., 2018a,b), and have become one of the most common techniques for applying Q-learning type of algorithms (Hessel et al., 2018). Despite its tremendous empirical success and popularity in practice, theoretical understanding of double Q-learning is rather limited. Only the asymptotic convergence was provided in Hasselt (2010); Weng et al. (2020c). There has been no non-asymptotic result on how fast double Q-learning converges. From the technical standpoint, such finite-time analysis for double Q-learning does not follow readily from those for the vanilla Q-learning, because it involves two randomly updated Q-estimators, and the coupling between these two random paths significantly complicates the analysis. This goes much more beyond the existing techniques for analyzing the vanilla Q-learning that handles the random update of a single Q-estimator. Thus, the goal of this paper is to develop new finite-time analysis techniques that handle the inter-connected two random path updates in double Q-learning and provide the convergence rate. 1.1 Our contributions The main contribution of this paper lies in providing the first finite-time analysis for double Q-learning with both the synchronous and asynchronous implementations. We show that synchronous double Q-learning with a learning rate αt = 1/tω (where ω (0, 1)) attains an ϵ-accurate global optimum with at least the probability of 1 δ by taking Ω 1 (1 γ)6ϵ2 ln |S||A| (1 γ)7ϵ2δ 1 ω + 1 1 γ ln 1 (1 γ)2ϵ 1 1 ω iterations, where γ (0, 1) is the discount factor, |S| and |A| are the sizes of the state space and action space, respectively. We further show that under the same accuracy and high probability requirements, asynchronous double Q-learning takes Ω L4 (1 γ)6ϵ2 ln |S||A|L4 (1 γ)7ϵ2δ 1 ω + L2 1 γ ln 1 (1 γ)2ϵ 1 1 ω iterations, where L is the covering number specified by the exploration strategy. Our results corroborate the design goal of double Q-learning, which opts for better accuracy by making less aggressive progress during the execution in order to avoid overestimation. Specifically, our results imply that in the high accuracy regime, double Q-learning achieves the same convergence rate as vanilla Q-learning in terms of the order-level dependence on ϵ, which further indicates that the high accuracy design of double Q-learning dominates the less aggressive progress in such a regime. In the low-accuracy regime, which is not what double Q-learning is designed for, the cautious progress of double Q-learning yields a slightly weaker convergence rate than Q-learning in terms of the dependence on 1 γ. From the technical standpoint, our proof develops new techniques beyond the existing finite-time analysis of the vanilla Q-learning with a single random iteration path. More specifically, we model the double Q-learning algorithm as two alternating stochastic approximation (SA) problems, where one SA captures the error propagation between the two Q-estimators, and the other captures the error dynamics between the Q-estimator and the global optimum. For the first SA, we develop new techniques to provide the finite-time bounds on the two inter-related stochastic iterations of Q-functions. Then we develop new tools to bound the convergence of Bernoulli-controlled stochastic iterations of the second SA conditioned on the first SA. 1.2 Related work Due to the rapidly growing literature on Q-learning, we review only the theoretical results that are highly relevant to our work. Q-learning was first proposed in Watkins and Dayan (1992) under finite state-action space. Its asymptotic convergence has been established in Tsitsiklis (1994); Jaakkola et al. (1994); Borkar and Meyn (2000); Melo (2001) through studying various general SA algorithms that include Q-learning as a special case. Along this line, Lee and He (2019) characterized Q-learning as a switched linear system and applied the results of Borkar and Meyn (2000) to show the asymptotic convergence, which was also extended to other Q-learning variants. Another line of research focuses on the finite-time analysis of Q-learning which can capture the convergence rate. Such non-asymptotic results were firstly obtained in Szepesvári (1998). A more comprehensive work (Even-Dar and Mansour, 2003) provided finite-time results for both synchronous and asynchoronous Q-learning. Both Szepesvári (1998) and Even-Dar and Mansour (2003) showed that with linear learning rates, the convergence rate of Q-learning can be exponentially slow as a function of 1 1 γ . To handle this, the so-called rescaled linear learning rate was introduced to avoid such an exponential dependence in synchronous Q-learning (Wainwright, 2019; Chen et al., 2020) and asynchronous Q-learning (Qu and Wierman, 2020). The finite-time convergence of Q-learning was also analyzed with constant step sizes (Beck and Srikant, 2012; Chen et al., 2020; Li et al., 2020). Moreover, the polynomial learning rate, which is also the focus of this work, was investigated for both synchronous (Even-Dar and Mansour, 2003; Wainwright, 2019) and asynchronous Q-learning (Even-Dar and Mansour, 2003). In addition, it is worth mentioning that Shah and Xie (2018) applied the nearest neighbor approach to handle MDPs on infinite state space. Differently from the above extensive studies of vanilla Q-learning, theoretical understanding of double Q-learning is limited. The only asymptotic convergence guarantee was provided by Hasselt (2010). A concurrent work (Weng et al., 2020c) studied the mean-square errors for double Q-learning given the assumption that it converges to a unique fixed point. Both of the abovementioned papers do not provide the non-asymptotic (i.e., finite-time) analysis on how fast double Q-learning converges. This paper provides the first finite-time analysis for double Q-learning. The vanilla Q-learning algorithm has also been studied for the function approximation case, i.e., the Q-function is approximated by a class of parameterized functions. In contrast to the tabular case, even with linear function approximation, Q-learning has been shown not to converge in general (Baird, 1995). Strong assumptions are typically imposed to guarantee the convergence of Q-learning with function approximation (Bertsekas and Tsitsiklis, 1996; Zou et al., 2019; Chen et al., 2019; Du et al., 2019; Xu and Gu, 2019; Cai et al., 2019; Weng et al., 2020a,b). Regarding double Q-learning, it is still an open topic on how to design double Q-learning algorithms under function approximation and under what conditions they have theoretically guaranteed convergence. 2 Preliminaries on Q-learning and Double Q-learning In this section, we introduce the Q-learning and the double Q-learning algorithms. 2.1 Q-learning We consider a γ-discounted Markov decision process (MDP) with a finite state space S and a finite action space A. The transition probability of the MDP is given by P : S A S [0, 1], that is, P( |s, a) denotes the probability distribution of the next state given the current state s and action a. We consider a random reward function Rt at time t drawn from a fixed distribution φ : S A S 7 R, where E{Rt(s, a, s )} = Rs sa and s denotes the next state starting from (s, a). In addition, we assume |Rt| Rmax. A policy π := π( |s) characterizes the conditional probability distribution over the action space A given each state s S. The action-value function (i.e., Q-function) Qπ R|S| |A| for a given policy π is defined as Qπ(s, a) :=E t=0 γt Rt(s, π(s), s ) s0 = s, a0 = a =Es P ( |s,a) a π( |s ) h Rs sa + γQπ(s , a ) i , (1) where γ (0, 1) is the discount factor. Q-learning aims to find the Q-function of an optimal policy π that maximizes the accumulated reward. The existence of such a π has been proved in the classical MDP theory (Bertsekas and Tsitsiklis, 1996). The corresponding optimal Q-function, denoted as Q , is known as the unique fixed point of the Bellman operator T given by T Q(s, a) = Es P ( |s,a) Rs sa + γ max a U(s )Q(s , a ) , (2) where U(s ) A is the admissible set of actions at state s . It can be shown that the Bellman operator T is γ-contractive in the supremum norm Q := maxs,a |Q(s, a)|, i.e., it satisfies T Q T Q γ Q Q . (3) The goal of Q-learning is to find Q , which further yields π (s) = arg maxa U(s) Q (s, a). In practice, however, exact evaluation of the Bellman operator (2) is usually infeasible due to the lack of knowledge of the transition kernel of MDP and the randomness of the reward. Instead, Q-learning draws random samples to estimate the Bellman operator and iteratively learns Q as Qt+1(s, a) = (1 αt(s, a))Qt(s, a) + αt(s, a) Rt(s, a, s ) + γ max a U(s )Qt(s , a ) , (4) where Rt is the sampled reward, s is sampled by the transition probability given (s, a), and αt(s, a) (0, 1] denotes the learning rate. 2.2 Double Q-learning Although Q-learning is a commonly used RL algorithm to find the optimal policy, it can suffer from overestimation in practice (Smith and Winkler, 2006). To overcome this issue, Hasselt (2010) proposed double Q-learning given in Algorithm 1. Algorithm 1 Synchronous Double Q-learning (Hasselt, 2010) 1: Input: Initial QA 1 , QB 1 . 2: for t = 1, 2, . . . , T do 3: Assign learning rate αt. 4: Randomly choose either UPDATE(A) or UPDATE(B) with probability 0.5, respectively. 5: for each (s, a) do 6: observe s P( |s, a), and sample Rt(s, a, s ). 7: if UPDATE(A) then 8: Obtain a = arg maxa QA t (s , a ) 9: QA t+1(s, a) = QA t (s, a) + αt(s, a)(Rt(s, a, s ) + γQB t (s , a ) QA t (s, a)) 10: else if UPDATE(B) then 11: Obtain b = arg maxb QB t (s , b ) 12: QB t+1(s, a) = QB t (s, a) + αt(s, a)(Rt(s, a, s ) + γQA t (s , b ) QB t (s, a)) 13: end if 14: end for 15: end for 16: Output: QA T (or QB T ). Double Q-learning maintains two Q-estimators (i.e., Q-tables): QA and QB. At each iteration of Algorithm 1, one Q-table is randomly chosen to be updated. Then this chosen Q-table generates a greedy optimal action, and the other Q-table is used for estimating the corresponding Bellman operator for updating the chosen table. Specifically, if QA is chosen to be updated, we use QA to obtain the optimal action a and then estimate the corresponding Bellman operator using QB. As shown in Hasselt (2010), E[QB(s , a )] is likely smaller than maxa E[QA(s , a)], where the expectation is taken over the randomness of the reward for the same state-action pair. In this way, such a two-estimator framework of double Q-learning can effectively reduce the overestimation. Synchronous and asynchronous double Q-learning: In this paper, we study the finite-time convergence rate of double Q-learning in two different settings: synchronous and asynchronous implementations. For synchronous double Q-learning (as shown in Algorithm 1), all the state-action pairs of the chosen Q-estimator are visited simultaneously at each iteration. For the asynchronous case, only one state-action pair is updated in the chosen Q-table. Specifically, in the latter case, we sample a trajectory {(st, at, Rt, it)} t=0 under a certain exploration strategy, where it {A, B} denotes the index of the chosen Q-table at time t. Then the two Q-tables are updated based on the following rule: Qi t+1(s, a) Qi t(s, a), (s, a) = (st, at) or i = it; (1 αt(s, a))Qi t(s, a) + αt(s, a) Rt(s, a, s ) + γQic t (s , arg max a U(s ) Qi t(s , a ) , otherwise, where ic = {A, B} \ i. We next provide the boundedness property of the Q-estimators and the errors in the following lemma, which is typically necessary for the finite-time analysis. Lemma 1. For either synchronous or asynchronous double Q-learning, let Qi t(s, a) be the value of either Q table corresponding to a state-action pair (s, a) at iteration t. Suppose Qi 0 Rmax 1 γ . Then we have Qi t Rmax 1 γ and Qi t Q Vmax for all t 0, where Vmax := 2Rmax Lemma 1 can be proved by induction arguments using the triangle inequality and the uniform boundedness of the reward function, which is seen in Appendix B. 3 Main results We present our finite-time analysis for the synchronous and asynchronous double Q-learning in this section, followed by a sketch of the proof for the synchronous case which captures our main techniques. The detailed proofs of all the results are provided in the Supplementary Materials. 3.1 Synchronous double Q-learning Since the update of the two Q-estimators is symmetric, we can characterize the convergence rate of either Q-estimator, e.g., QA, to the global optimum Q . To this end, we first derive two important properties of double Q-learning that are crucial to our finite-time convergence analysis. The first property captures the stochastic error QB t QA t between the two Q-estimators. Since double Q-learning updates alternatingly between these two estimators, such an error process must decay to zero in order for double Q-learning to converge. Furthermore, how fast such an error converges determines the overall convergence rate of double Q-learning. The following proposition (which is an informal restatement of Proposition 1 in Appendix C.1) shows that such an error process can be block-wisely bounded by an exponentially decreasing sequence Gq = (1 ξ)q Vmax for q = 0, 1, 2, . . . , and some ξ (0, 1). Conceptually, as illustrated in Figure 1, such an error process is upper-bounded by the blue-colored piece-wise linear curve. Proposition 1. (Informal) Consider synchronous double Q-learning under a polynomial learning rate αt = 1 tω with ω (0, 1). We divide the time horizon into blocks [ˆτq, ˆτq+1) for q 0, where ˆτ0 = 0 and ˆτq+1 = ˆτq + c1ˆτ ω q with some c1 > 0. Fix ˆϵ > 0. Then for any n such that Gn ˆϵ and under certain conditions on ˆτ1 (see Appendix C.1), we have P q [0, n], t [ˆτq+1, ˆτq+2), QB t QA t Gq+1 1 c2n exp c3ˆτ ω 1 ˆϵ2 where the positive constants c2 and c3 are specified in Appendix C.1. Proposition 1 implies that the two Q-estimators approach each other asymptotically, but does not necessarily imply that they converge to the optimal action-value function Q . Then the next proposition (which is an informal restatement of Proposition 2 in Appendix C.2) shows that as long as the high probability event in Proposition 1 holds, the error process QA t Q between either Q-estimator (say QA) and the optimal Q-function can be block-wisely bounded by an exponentially decreasing sequence Dk = (1 β)k Vmax σ for k = 0, 1, 2, . . . , and β (0, 1). Conceptually, as illustrated in Figure 1, such an error process is upper-bounded by the yellow-colored piece-wise linear curve. Proposition 2. (Informal) Consider synchronous double Q-learning using a polynomial learning rate αt = 1 tω with ω (0, 1). We divide the time horizon into blocks [τk,τk+1) for k 0, where Figure 1: Illustration of sequence {Gk}k 0 as a block-wise upper bound on QB t QA t , and sequence {Dk}k 0 as a block-wise upper bound on QA t Q conditioned on the first upper bound event. See Appendix A for a numerical example to illustrate the block-wise upper bounds. τ0 = 0 and τk+1 = τk + c4τ ω k with some c4 > 0. Fix ϵ > 0. Then for any m such that Dm ϵ and under certain conditions on τ1 (see Appendix C.2), we have P k [0, m], t [τk+1, τk+2), QA t Q Dk+1|E, F 1 c5m exp c6τ ω 1 ϵ2 where E and F denote certain events defined in (12) and (13) in Appendix C.2, and the positive constants c4, c5, and c6 are specified Appendix C.2. As illustrated in Figure 1, the two block sequences {ˆτq}q 0 in Proposition 1 and {τq}q 0 in Proposition 2 can be chosen to coincide with each other. Then combining the above two properties followed by further mathematical arguments yields the following main theorem that characterizes the convergence rate of double Q-learning. We will provide a proof sketch for Theorem 1 in Section 3.3, which explains the main steps to obtain the supporting properties of Proposition 1 and 2 and how they further yield the main theorem. Theorem 1. Fix ϵ > 0 and γ (1/3, 1). Consider synchronous double Q-learning using a polynomial learning rate αt = 1 tω with ω (0, 1). Let QA T (s, a) be the value of QA for a state-action pair (s, a) at time T. Then we have P( QA T Q ϵ) 1 δ, given that V 2 max (1 γ)4ϵ2 ln |S||A|V 2 max (1 γ)5ϵ2δ ω + 1 1 γ ln Vmax (1 γ)ϵ where Vmax = 2Rmax Theorem 1 provides the finite-time convergence guarantee in high probability sense for synchronous double Q-learning. Specifically, double Q-learning attains an ϵ-accurate optimal Q-function with high probability with at most Ω 1 (1 γ)6ϵ2 ln 1 (1 γ)7ϵ2 1 ω + 1 1 γ ln 1 (1 γ)2ϵ 1 1 ω iterations. Such a result can be further understood by considering the following two regimes. In the high accuracy regime, in which ϵ 1 γ, the dependence on ϵ dominates, and the time complexity is given by ϵ 1 1 ω , which is optimized as ω approaches to 1. In the low accuracy regime, in which ϵ 1 γ, the dependence on 1 1 γ dominates, and the time complexity can be optimized at 7, which yields T = Ω 1 (1 γ)7ϵ7/3 + 1 (1 γ)7 = Ω 1 (1 γ)7ϵ7/3 . Furthermore, Theorem 1 corroborates the design effectiveness of double Q-learning, which overcomes the overestimation issue and hence achieves better accuracy by making less aggressive progress in each update. Specifically, comparison of Theorem 1 with the time complexity bounds of vanilla synchronous Q-learning under a polynomial learning rate in Even-Dar and Mansour (2003) and Wainwright (2019) indicates that in the high accuracy regime, double Q-learning achieves the same convergence rate as vanilla Q-learning in terms of the order-level dependence on ϵ. Clearly, the design of double Q-learning for high accuracy dominates the performance. In the low-accuracy regime (which is not what double Q-learning is designed for), double Q-learning achieves a slightly weaker convergence rate than vanilla Q-learning in Even-Dar and Mansour (2003); Wainwright (2019) in terms of the dependence on 1 γ, because its nature of less aggressive progress dominates the performance. 3.2 Asynchronous Double Q-learning In this subsection, we study the asynchronous double Q-learning and provide its finite-time convergence result. Differently from synchronous double Q-learning, in which all state-action pairs are visited for each update of the chosen Q-estimator, asynchronous double Q-learning visits only one state-action pair for each update of the chosen Q-estimator. Therefore, we make the following standard assumption on the exploration strategy (Even-Dar and Mansour, 2003): Assumption 1. (Covering number) There exists a covering number L, such that in consecutive L updates of either QA or QB estimator, all the state-action pairs of the chosen Q-estimator are visited at least once. The above conditions on the exploration are usually necessary for the finite-time analysis of asynchronous Q-learning. The same assumption has been taken in Even-Dar and Mansour (2003). Qu and Wierman (2020) proposed a mixing time condition which is in the same spirit. Assumption 1 essentially requires the sampling strategy to have good visitation coverage over all state-action pairs. Specifically, Assumption 1 guarantees that consecutive L updates of QA visit each state-action pair of QA at least once, and the same holds for QB. Since 2L iterations of asynchronous double Q-learning must make at least L updates for either QA or QB, Assumption 1 further implies that any state-action pair (s, a) must be visited at least once during 2L iterations of the algorithm. In fact, our analysis allows certain relaxation of Assumption 1 by only requiring each state-action pair to be visited during an interval with a certain probability. In such a case, we can also derive a finite-time bound by additionally dealing with a conditional probability. Next, we provide the finite-time result for asynchronous double Q-learning in the following theorem. Theorem 2. Fix ϵ > 0, γ (1/3, 1). Consider asynchronous double Q-learning under a polynomial learning rate αt = 1 tω with ω (0, 1). Suppose Assumption 1 holds. Let QA T (s, a) be the value of QA for a state-action pair (s, a) at time T. Then we have P( QA T Q ϵ) 1 δ, given that L4V 2 max (1 γ)4ϵ2 ln |S||A|L4V 2 max (1 γ)5ϵ2δ 1 γ ln γVmax Comparison of Theorem 1 and 2 indicates that the finite-time result of asynchronous double Qlearning matches that of synchronous double Q-learning in the order dependence on 1 1 γ and 1 ϵ . The difference lies in the extra dependence on the covering time L in Theorem 2. Since synchronous double Q-learning visits all state-action pairs (i.e., takes |S||A| sample updates) at each iteration, whereas asynchronous double Q-learning visits only one state-action pair (i.e., takes only one sample update) at each iteration, a more reasonable comparison between the two should be in terms of the overall sample complexity. In this sense, synchronous and asynchronous double Q-learning algorithms have the sample complexities of |S||A|T (where T is given in (5)) and T (where T is given in (6)), respectively. Since in general L |S||A|, synchronous double-Q is more efficient than asynchronous double-Q in terms of the overall sampling complexity. 3.3 Proof Sketch of Theorem 1 In this subsection, we provide an outline of the technical proof of Theorem 1 and summarize the key ideas behind the proof. The detailed proof can be found in Appendix C. In general, although we adopt the epoch-wise analysis technique which has been used for studying vanilla Q-learning in Bertsekas and Tsitsiklis (1996); Even-Dar and Mansour (2003), the analysis of double Q-learning requires to handle two interconnected SAs, so that the analysis of single SA in Even-Dar and Mansour (2003) cannot be directly applied. Thus, we need to develop new tools to handle such a new challenge. Our goal is to study the finite-time convergence of the error QA t Q between one Q-estimator and the optimal Q-function (this is without the loss of generality due to the symmetry of the two estimators). To this end, our proof includes: (a) Part I which analyzes the stochastic error propagation between the two Q-estimators QB t QA t ; (b) Part II which analyzes the error dynamics between one Q-estimator and the optimum QA t Q conditioned on the error event in Part I; and (c) Part III which bounds the unconditional error QA t Q . We describe each of the three parts in more details below. Part I: Bounding QB t QA t (see Proposition 1). The main idea is to upper bound QB t QA t by a decreasing sequence {Gq}q 0 block-wisely with high probability, where each block q (with q 0) is defined by t [ˆτq, ˆτq+1). The proof consists of the following four steps. Step 1 (see Lemma 2): We characterize the dynamics of u BA t (s, a) := QB(s, a) QA(s, a) as an SA algorithm as follows: u BA t+1(s, a) = (1 αt)u BA t (s, a) + αt(ht(s, a) + zt(s, a)), where ht is a contractive mapping of u BA t , and zt is a martingale difference sequence. Step 2 (see Lemma 3): We derive lower and upper bounds on u BA t via two sequences Xt;ˆτq and Zt;ˆτq as follows: Xt;ˆτq(s, a) + Zt;ˆτq(s, a) u BA t (s, a) Xt;ˆτq(s, a) + Zt;ˆτq(s, a), for any t ˆτq, state-action pair (s, a) S A, and q 0, where Xt;ˆτq is deterministic and driven by Gq, and Zt;ˆτq is stochastic and driven by the martingale difference sequence zt. Step 3 (see Lemma 5 and Lemma 6): We block-wisely bound u BA t (s, a) using the induction arguments. Namely, we prove u BA t Gq for t [ ˆτq, ˆτq+1) holds for all q 0. By induction, we first observe for q = 0, u BA t G0 holds. Given any state-action pair (s, a), we assume that u BA t (s, a) Gq holds for t [ˆτq, ˆτq+1). Then we show u BA t (s, a) Gq+1 holds for t [ˆτq+1, ˆτq+2), which follows by bounding Xt;ˆτq and Zt;ˆτq separately in Lemma 5 and Lemma 6, respectively. Step 4 (see Appendix C.1.4) : We apply union bound (Lemma 8) to obtain the block-wise bound for all state-action pairs and all blocks. Part II: Conditionally bounding QA t Q (see Proposition 2). We upper bound QA t Q by a decreasing sequence {Dk}k 0 block-wisely conditioned on the following two events: Event E: u BA t is upper bounded properly (see (12) in Appendix C.2), and Event F: there are sufficient updates of QA t in each block (see (13) in Appendix C.2). The proof of Proposition 2 consists of the following four steps. Step 1 (see Lemma 10): We design a special relationship (illustrated in Figure 1) between the block-wise bounds {Gq}q 0 and {Dk}k 0 and their block separations. Step 2 (see Lemma 11): We characterize the dynamics of the iteration residual rt(s, a) := QA t (s, a) Q (s, a) as an SA algorithm as follows: when QA is chosen to be updated at iteration t, rt+1(s, a)=(1 αt)rt(s, a)+αt(T QA t (s, a) Q (s, a))+αtwt(s, a)+αtγu BA t (s , a ), where wt(s, a) is the error between the Bellman operator and the sample-based empirical estimator, and is thus a martingale difference sequence, and u BA t has been defined in Part I. Step 3 (see Lemma 12): We provide upper and lower bounds on rt via two sequences Yt;τk and Wt;τk as follows: Yt;τk(s, a) + Wt;τk(s, a) rt(s, a) Yt;τk(s, a) + Wt;τk(s, a), for all t τk, all state-action pairs (s, a) S A, and all q 0, where Yt;τk is deterministic and driven by Dk, and Wt;τk is stochastic and driven by the martingale difference sequence wt. In particular, if QA t is not updated at some iteration, then the sequences Yt;τk and Wt;τk assume the same values from the previous iteration. Step 4 (see Lemma 13, Lemma 14 and Appendix C.2.4): Similarly to Steps 3 and 4 in Part I, we conditionally bound rt Dk for t [τk, τk+1) and k 0 via bounding Yt;τk and Wt;τk and further taking the union bound. Part III: Bounding QA t Q (see Appendix C.3). We combine the results in the first two parts, and provide high probability bound on rt with further probabilistic arguments, which exploit the high probability bounds on P(E) in Proposition 1 and P(F) in Lemma 15. 4 Conclusion In this paper, we provide the first finite-time results for double Q-learning, which characterize how fast double Q-learning converges under both synchronous and asynchronous implementations. For the synchronous case, we show that it achieves an ϵ-accurate optimal Q-function with at least the probability of 1 δ by taking Ω 1 (1 γ)6ϵ2 ln |S||A| (1 γ)7ϵ2δ 1 ω + 1 1 γ ln 1 (1 γ)2ϵ 1 1 ω iterations. Similar scaling order on 1 1 γ and 1 ϵ also applies for asynchronous double Q-learning but with extra dependence on the covering number. We develop new techniques to bound the error between two correlated stochastic processes, which can be of independent interest. Broader Impact Reinforcement learning has achieved great success in areas such as robotics and game playing, and thus has aroused broad interests and more potential real-world applications. Double Q-learning is a commonly used technique in deep reinforcement learning to improve the implementation stability and speed of deep Q-learning. In this paper, we provided the fundamental analysis on the convergence rate for double Q-learning, which theoretically justified the empirical success of double Q-learning in practice. Such a theory also provides practitioners desirable performance guarantee to further develop such a technique into various transferable technologies. Acknowledgements The work was supported in part by the U.S. National Science Foundation under the grant CCF1761506, National University of Singapore startup grant R-263-000-E60-133, National Natural Science Foundation of China (Grant No. 62073159), and the Shenzhen Science and Technology Program (Grant No. JCYJ20200109141601708). Abed-alguni, B. H. and Ottom, M. A. (2018). Double delayed Q-learning. International Journal of Artificial Intelligence, 16(2):41 59. Azuma, K. (1967). Weighted sums of certain dependent random variables. Tohoku Mathematical Journal, Second Series, 19(3):357 367. Baird, L. (1995). Residual algorithms: Reinforcement learning with function approximation. In Machine Learning Proceedings 1995, pages 30 37. Elsevier. Beck, C. L. and Srikant, R. (2012). Error bounds for constant step-size Q-learning. Systems & Control Letters, 61(12):1203 1208. Bertsekas, D. P. and Tsitsiklis, J. N. (1996). Neuro-Dynamic Programming, volume 5. Athena Scientific. Borkar, V. S. and Meyn, S. P. (2000). The ODE method for convergence of stochastic approximation and reinforcement learning. SIAM Journal on Control and Optimization, 38(2):447 469. Cai, Q., Yang, Z., Lee, J. D., and Wang, Z. (2019). Neural temporal-difference learning converges to global optima. In Advances in Neural Information Processing Systems (Neur IPS), pages 11312 11322. Chen, Z., Maguluri, S. T., Shakkottai, S., and Shanmugam, K. (2020). Finite-sample analysis of stochastic approximation using smooth convex envelopes. ar Xiv preprint ar Xiv:2002.00874. Chen, Z., Zhang, S., Doan, T. T., Maguluri, S. T., and Clarke, J.-P. (2019). Finite-time analysis of Q-learning with linear function approximation. ar Xiv preprint ar Xiv:1905.11425. Du, S. S., Luo, Y., Wang, R., and Zhang, H. (2019). Provably efficient Q-learning with function approximation via distribution shift error checking oracle. In Advances in Neural Information Processing Systems (Neur IPS), pages 8058 8068. Even-Dar, E. and Mansour, Y. (2003). Learning rates for Q-learning. Journal of Machine Learning Research, 5(Dec):1 25. Hasselt, H. V. (2010). Double Q-learning. In Advances in Neural Information Processing Systems (Neur IPS), pages 2613 2621. Hasselt, H. v., Guez, A., and Silver, D. (2016). Deep reinforcement learning with double q-learning. In Proc. AAAI Conference on Artificial Intelligence (AAAI). Hessel, M., Modayil, J., Van Hasselt, H., Schaul, T., Ostrovski, G., Dabney, W., Horgan, D., Piot, B., Azar, M., and Silver, D. (2018). Rainbow: Combining improvements in deep reinforcement learning. In Proc. AAAI Conference on Artificial Intelligence (AAAI). Jaakkola, T., Jordan, M. I., and Singh, S. P. (1994). Convergence of stochastic iterative dynamic programming algorithms. In Advances in Neural Information Processing Systems (Neur IPS), pages 703 710. Lee, D. and He, N. (2019). A unified switching system perspective and ODE analysis of Q-learning algorithms. ar Xiv preprint ar Xiv:1912.02270. Li, G., Wei, Y., Chi, Y., Gu, Y., and Chen, Y. (2020). Sample complexity of asynchronous Q-learning: Sharper analysis and variance reduction. ar Xiv preprint ar Xiv:2006.03041. Melo, F. S. (2001). Convergence of Q-learning: A simple proof. Institute of Systems and Robotics, Tech. Rep, pages 1 4. Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidjeland, A. K., Ostrovski, G., et al. (2015). Human-level control through deep reinforcement learning. Nature, 518(7540):529. Okuyama, T., Gonsalves, T., and Upadhay, J. (2018). Autonomous driving system based on deep Q-learnig. In Proc. IEEE International Conference on Intelligent Autonomous Systems (ICo IAS), pages 201 205. Qu, G. and Wierman, A. (2020). Finite-time analysis of asynchronous stochastic approximation and Q-learning. ar Xiv preprint ar Xiv:2002.00260. Shah, D. and Xie, Q. (2018). Q-learning with nearest neighbors. In Advances in Neural Information Processing Systems (Neur IPS), pages 3111 3121. Smith, J. E. and Winkler, R. L. (2006). The optimizer s curse: Skepticism and postdecision surprise in decision analysis. Management Science, 52(3):311 322. Szepesvári, C. (1998). The asymptotic convergence-rate of Q-learning. In Advances in Neural Information Processing Systems (Neur IPS), pages 1064 1070. Tai, L. and Liu, M. (2016). A robot exploration strategy based on Q-learning network. In Proc. IEEE International Conference on Real-time Computing and Robotics (RCAR), pages 57 62. Tsitsiklis, J. N. (1994). Asynchronous stochastic approximation and Q-learning. Machine Learning, 16(3):185 202. Wainwright, M. J. (2019). Stochastic approximation with cone-contractive operators: Sharp ℓ - bounds for Q-learning. ar Xiv preprint ar Xiv:1905.06265. Watkins, C. J. and Dayan, P. (1992). Q-learning. Machine Learning, 8(3-4):279 292. Weng, B., Xiong, H., Liang, Y., and Zhang, W. (2020a). Analysis of Q-learning with adaptation and momentum restart for gradient descent. In Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence (IJCAI-20), pages 3051 3057. Weng, B., Xiong, H., Zhao, L., Liang, Y., and Zhang, W. (2020b). Momentum Q-learning with finite-sample convergence guarantee. ar Xiv preprint ar Xiv:2007.15418. Weng, W., Gupta, H., He, N., Ying, L., and R., S. (2020c). Provably-efficient double Q-learning. ar Xiv preprint ar Xiv:ar Xiv:2007.05034. Xu, P. and Gu, Q. (2019). A finite-time analysis of Q-learning with neural network function approximation. ar Xiv preprint ar Xiv:1912.04511. Zhang, Q., Lin, M., Yang, L. T., Chen, Z., Khan, S. U., and Li, P. (2018a). A double deep Qlearning model for energy-efficient edge scheduling. IEEE Transactions on Services Computing, 12(5):739 749. Zhang, Y., Sun, P., Yin, Y., Lin, L., and Wang, X. (2018b). Human-like autonomous vehicle speed control by deep reinforcement learning with double Q-learning. In Proc. IEEE Intelligent Vehicles Symposium (IV), pages 1251 1256. Zhang, Z., Pan, Z., and Kochenderfer, M. J. (2017). Weighted double Q-learning. In International Joint Conferences on Artificial Intelligence, pages 3455 3461. Zou, S., Xu, T., and Liang, Y. (2019). Finite-sample analysis for SARSA with linear function approximation. In Advances in Neural Information Processing Systems (Neur IPS), pages 8665 8675.