# regularized_qlearning_through_robust_averaging__263a08e3.pdf Regularized Q-learning through Robust Averaging Peter Schmitt-F orster 1 Tobias Sutter 1 We propose a new Q-learning variant, called 2RA Q-learning, that addresses some weaknesses of existing Q-learning methods in a principled manner. One such weakness is an underlying estimation bias which cannot be controlled and often results in poor performance. We propose a distributionally robust estimator for the maximum expected value term, which allows us to precisely control the level of estimation bias introduced. The distributionally robust estimator admits a closed-form solution such that the proposed algorithm has a computational cost per iteration comparable to Watkins Q-learning. For the tabular case, we show that 2RA Q-learning converges to the optimal policy and analyze its asymptotic mean-squared error. Lastly, we conduct numerical experiments for various settings, which corroborate our theoretical findings and indicate that 2RA Q-learning often performs better than existing methods. 1. Introduction The optimal policy of a Markov Decision Process (MDP) is characterized via the dynamic programming equations introduced by Bellman (1957). While these dynamic programming equations critically depend on the underlying model, model-free reinforcement learning (RL) aims to learn these equations by interacting with the environment without any knowledge of the underlying model (Bertsekas & Tsitsiklis, 1996; Sutton & Barto, 2018; Meyn, 2022). There are two fundamentally different notions of interacting with the unknown environment. The first one is referred to as the synchronous setting, which assumes sample access to a generative model (or simulator), where the estimates are updated simultaneously across all state-action pairs in every iteration step. The second concept concerns an asynchronous 1Department of Computer and Information Science, University of Konstanz, Germany. Correspondence to: Peter Schmitt-F orster . Proceedings of the 41 st International Conference on Machine Learning, Vienna, Austria. PMLR 235, 2024. Copyright 2024 by the author(s). setting, where one only has access to a single trajectory of data generated under some fixed policy. A learning algorithm then updates its estimate of a single state-action pair using one state transition from the sampled trajectory in every step. In this paper, we focus on the asynchronous setting, which is the considerably more challenging task than the synchronous setting due to the Markovian nature of its sampling process. One of the most popular RL algorithms is Q-learning (Watkins, 1989; Watkins & Dayan, 1992), which iteratively learns the value function and hence the corresponding optimal policy of an MDP with unknown transition kernel. When designing an RL algorithm, there are various desirable properties such an algorithm should have, including (i) convergence to an optimal policy, (ii) efficient computation, and (iii) good performance of the learned policy after finitely many iterations. Watkins Q-learning is known to converge to the optimal policy under relatively mild conditions (Tsitsiklis, 1994) and a finite-time analysis is also available (Even-Dar & Mansour, 2001; Beck & Srikant, 2012; Qu & Wierman, 2020). Moreover, its simple update rule requires only one single maximization over the action space per step. The simplicity of Watkins Q-learning, however, comes at the cost of introducing an overestimation bias (Thrun & Schwartz, 1993; van Hasselt, 2010), which can severely impede the quality of the learned policy (Szita & L orincz, 2008; Strehl et al., 2009; Thrun & Schwartz, 1993; van Hasselt, 2010). It has been experimentally demonstrated that both, overestimation and underestimation bias, may improve learning performance, depending on the environment at hand (see Sutton & Barto (2018, Chapter 6.7) and Lan et al. (2020) for a detailed explanation). Therefore, deriving a Q-learning method equipped with the possibility to precisely control the (overand under-)estimation bias is desirable. Related Work. In the last decade, several Q-learning variants have been proposed to improve the weakness of Watkins Q-learning while aiming to admit the desirable properties (i)-(iii). We discuss approaches that are most relevant to our work. Double Q-learning (van Hasselt, 2010) mitigates the overestimation bias of Watkins Q-learning by introducing a double estimator for the maximum expected value term. While Double Q-learning is known to converge Regularized Q-learning through Robust Averaging to the optimal policy and has a similar computational cost per iteration to Q-learning, it, unfortunately, introduces an underestimation bias which, depending on the environment considered, can be equally undesirable as the overestimation bias of Watkins Q-learning (Lan et al., 2020). Maxmin Q-learning (Lan et al., 2020) works with N stateaction estimates, where N denotes a parameter, and chooses the smallest estimate to select the maximum action. Maxmin Q-learning allows to control the estimation bias via the parameter N. However, it generally requires a large value of N to remove the overestimation bias. Conceptually, Maxmin Q-learning is related to our approach, where we select a maximum action based on the average current Q-function and then consider a worst-case ball around this average value. Similarly, REDQ (Chen et al., 2021) updates N Q-functions based on a max-action, min-function step over a sampled subset of size M out of the N Q-functions and allows to control the overand underestimation bias. It further incorporates multiple randomized update steps in each iteration which results in good sample efficiency. REDQ performs well in complicated non-tabular settings (including continuous state/action spaces). In the tabular setting, Maxmin Q-learning and REDQ are equipped with an asymptotic convergence proof. Averaged-DQN (Anschel et al., 2017) is a simple extension to the Deep Q-learning (DQN) algorithm (Mnih et al., 2015), based on averaging previously learned Q-value estimates, which leads to a more stable training procedure and typically results in an improved performance by reducing the approximation error variance in the target values. Averaged-DQN and other regularized variants of DQN, such as Munchausen reinforcement learning (Vieillard et al., 2020), use a deep learning architecture and are not equipped with any theoretical guarantees about convergence or quality of the learned policy (Mehta & Meyn, 2020). In general, averaging in Q-learning is a well-known variance reduction method. A specific form of variance-reduced Q-learning is presented in Wainwright (2019), where it is shown that the presented algorithm has minimax optimal sample complexity. Regularized Q-learning (Lim et al., 2022) studies a modified Q-learning algorithm that converges when linear function approximation is used. It is shown that simply adding an appropriate regularization term ensures convergence of the algorithm in settings where the vanilla variant does not converge due to the linear function approximation used. A slightly different objective, when modifying Q-learning schemes, is to robustify them against environment shifts, i.e., settings where the environment, in which the policy is trained, is different from the environment in which the policy will be deployed. A popular approach is to consider a distributionally robust Q-learning model, where the resulting Q-function converges to the optimal Qvalue that is robust against small shifts in the environment, see Liu et al. (2022) for KL-based ambiguity sets and for distributionally robust formulations using the Wasserstein distance Neufeld & Sester (2022). The recent paper from Liu et al. (2022) presents a distributionally robust Q-learning methodology, where the resulting Q-function converges to the optimal Q-value that is robust against small shifts in the environment. Mehta & Meyn (2020), Meyn (2022), Lu et al. (2022), and Lu & Meyn (2023) have introduced a new class of Q-learning algorithms called convex Q-learning which exploit a convex reformulation of the Bellman equation via the well-known linear programming approach to dynamic programming (Hern andez-Lerma & Lasserre, 1996; 1999). Contribution. In this paper, we introduce a new Qlearning variant called Regularized Q-learning through Robust Averaging (2RA), which combines regularization and averaging. The proposed method has two parameters, ρ > 0 quantifying the level of robustness/regularization introduced and N N, which describes the number of state-action estimates used to form the empirical average. Centered around this new Q-learning variant, our main contributions can be summarized as follows: We present a tractable formulation of the proposed 2RA Q-learning where the computational cost per iteration is comparable to Watkins Q-learning. For any choice of N and for any positive sequence of regularization parameters {ρn}n N such that limn ρn = 0, we prove that the proposed 2RA Q-learning asymptotically converges to the true Q-function, see Theorem 3.1. We show how the choice of the two parameters ρ and N allow us to control the level of estimation bias in 2RA Q-learning, see Theorem 3.2, and show that as N our proposed estimation scheme becomes unbiased. We prove that under certain technical assumptions, the asymptotic mean-squared error of 2RA Q-learning is equal to the asymptotic mean-squared error of Watkins Qlearning, provided that we choose the learning rate of our method N-times larger than that of Watkins Q-learning, see Theorem 3.3. This theoretical insight allows practitioners to start with an initial guess of the learning rate for the proposed method that is N-times larger than that of standard Q-learning. We demonstrate that the theoretical properties of 2RA can be numerically reproduced in synthetic MDP settings. In more practical experiments from the Open AI gym suite (Brockman et al., 2016) we show that, even when implementations require deviations from out theoretically required assumptions, 2RA Q-learning has good performance and mostly outperforms other Q-learning variants. Detailed proofs are relegated to the Appendix A. Regularized Q-learning through Robust Averaging 2. Problem Setting Consider an MDP given by a six-tuple (S, A, P, r, γ, s0) comprising a finite state space S = {1, . . . , S}, a finite action space A = {1, . . . , A}, a transition kernel P : S A (S), a reward-per-stage function r : S A R, a discount factor γ (0, 1), and a deterministic initial state s0 S. Note that (S, A, P, r, γ, s0) describes a controlled discrete-time stochastic system, where the state and the action applied at time t are denoted as random variables St and At, respectively. If the system is in state st S at time t and action at A is applied, then an immediate reward of r(st, at) is incurred, and the system moves to state st+1 at time t + 1 with probability P(st+1|st, at). Thus, P( |st, at) represents the distribution of St+1 conditional on St = st and At = at. It is often convenient to represent the transition kernel as a matrix P RSA S. Actions are usually chosen according to a policy that prescribes a random action at time t depending on the state history up to time t and the action history up to time t 1. Throughout the paper, we restrict attention to stationary Markov policies, which are described by a stochastic kernel π : S (A), that is, π(at|st) denotes the probability of choosing action at while being in state st. We denote by Π the space of all stationary Markov policies. Given a stationary policy π and an initial condition s0, it is well-known that there exists a unique probability measure Pπ s0 defined on the canonical sample space Ω= (S A) equipped with its power set σ-field F = 2Ω, such that for all t N we have (see Hern andez-Lerma & Lasserre (1996, Section 2.2) for further details) Pπ s0(S0 = s0) = 1, Pπ s0(St+1 = st+1|St = st, At = at) = P(st+1|st, at) st, st+1 S, at A, Pπ s0(At = at|St = st) = π(at|st) st S, at A. To keep the notation simple, in the following, we denote Pπ s0 by P and the corresponding expectation operator by E. Then, the ultimate goal is to find an optimal policy π which leads to the largest expected infinite-horizon reward, i.e., π arg max π Π t=0 γt E[r(St, At)]. (1) An optimal (deterministic) policy can be alternatively obtained from the optimal Q-function as π ( |s) = δa ( ), where a arg maxa A Q (s, a) and the optimal Qfunction satisfies the Bellman equation (Bertsekas & Tsitsiklis, 1996), i.e., (s, a) S A Q (s, a) = r(s, a)+γ X s S P(s |s, a) max a A Q (s , a ). (2) Solving for the Q-function via (2) requires the knowledge of the underlying transition kernel P and reward function r, objects which in reinforcement learning problems generally are not known. In this work, we focus on the so-called asynchronous RL setting, where the Q-function is learned from a single trajectory of data which we assume to be generated from a fixed behavioral policy leading to state-action pairs {(S1, A1), . . . , (Sn, An), . . . }. The standard asynchronous Q-learning algorithm (Weng et al., 2020) can then be expressed as Qn+1(Sn, An) = Qn(Sn, An) + αQL n r(Sn, An) + γ max a A Qn(Sn+1, a ) Qn(Sn, An) , (3) where αQL n (0, 1] is the learning rate. It has been shown (Tsitsiklis, 1994; Szepesv ari, 1997; Qu & Wierman, 2020; Lee & He, 2020) that if each state is updated infinitely often and each action is tried an infinite number of times in each state, convergence to the optimal Q-function can be obtained. That is, for a learning rate satisfying P n=0 αQL n = and P n=0(αQL n )2 < , Q-learning converges P-almost surely to an optimal solution of the Bellman equation (2), i.e., limn Qn = Q P-almost surely. To simplify notation, we introduce the state-action variables Xn = (Sn, An) and define X = S A. As the state space in practice is often large, the Q-functions are commonly approximated via fewer basis functions. When interpreting the Q-function as a vector on R|X|, we use Q Φ θ , θ Rd, Φ = ϕ(s1, a1), . . . , ϕ(s|S|, a|A|) Rd |X|, (4) where with slight abuse of notation we denote by ϕ(s, a) Rd the given feature vectors associated with the pairs s = si and a = ai for i {1, . . . |X|}. Clearly, by choosing d = |X| and the canonical feature vectors ϕ(s, a) = P|X| i=1 ei 1{(s,a)=(si,ai)} for i = 1, . . . , |X|, the approximation (4) is exact, which is referred to as the tabular setting. For our convergence results that only hold in the tabular setting, we will use this representation. In this linear function approximation formulation, the standard asynchronous Q-learning (3) can be expressed as so-called Q(0)-learning (Melo et al., 2008; Meyn, 2022), which is given as θn+1 = θn + αQL n b(Xn) A1(Xn)θn + E(Xn, Sn+1, θn) , (5) where b(Xn) = ϕ(Xn)r(Xn), A1(Xn) = ϕ(Xn)ϕ(Xn) , E(Xn, Sn+1, θn) = γϕ(Xn) maxa A ϕ(Sn+1, a ) θn. It is well-known (van Hasselt, 2010), that the term E in the Q-learning (5) introduces an overestimation bias when compared to (2), since max a A E[ϕ(s, a ) θn] E[max a A ϕ(s, a ) θn] s S, (6) Regularized Q-learning through Robust Averaging where the expectation is with respect to the random variable θn defined according to (5) and the inequality is due to Jensen. A common method to mitigate this overestimation bias is to modify Q-learning to the so-called Double Qlearning (van Hasselt, 2010), which using the linear function approximation, can be expressed in the form (5), where θn = θA n θB n , b(Xn) = βnϕ(Xn)r(Xn) (1 βn)ϕ(Xn)r(Xn) A1(Xn)= βnϕ(Xn)ϕ(Xn) 0 0 (1 βn)ϕ(Xn)ϕ(Xn) E(Xn, Sn+1, θn) = βnγϕ(Xn)ϕ(Sn+1, πθA n (Sn+1)) θB n (1 βn)γϕ(Xn)ϕ(Sn+1, πθB n (Sn+1)) θA n βn are i.i.d. Bernoulli random variables with equal probability, and πθ(s) = arg maxa A ϕ(s, a) θ. While Double Q-learning avoids an overestimation bias, it introduces an underestimation bias (van Hasselt, 2010, Lemma 1), as each component of E satisfies1 E[ϕ(s, πθA n (s)) θB n ] max a A E[ϕ(s, a ) θA n ] s S. It can be directly seen by the Jensen inequality (6) that in the special case of θA = θB the inequality above becomes an equality. An inherent difficulty with the overestimation bias of standard Q-learning (resp. the underestimation bias of Double Q-learning) is that it cannot be controlled, i.e., the level of under-(resp. over-)estimation bias depending on the problem considered can be significant. In the following, we present a Q-learning method where the level of estimation bias can be precisely adjusted via a hyperparameter. 3. Regularization through Robust Averaging We propose the 2RA Q-learning method defined by the update rule θ(i) n+1 = θ(i) n + αnβ(i) n (b(Xn) A1(Xn)θ(i) n + Eρ(Xn, Sn+1, bθN,n)), for i = 1, . . . , N, (7) where βn is a generalized i.i.d. Bernoulli random variable on {1, . . . , N} with equal probability for each component i, i.e., P(βn = ei) = 1/N for all i = 1, . . . , N, where ei is the ith unit vector on RN and αn is the learning rate. 2RA Q-learning (7) is based on the estimator defined for all x X, s S, θ Rd as Eρ(x, s , θ) = γϕ(x) max a A min θ Bρ(θ) ϕ(s , a ) θ , (8) where ρ 0 is a given parameter and the ambiguity (or uncertainty) set Bρ(θ) is assumed to be of the form Bρ(θ) = 1Analogously we have E[ϕ(s, πθB n (s)) θA n ] maxa A E[ϕ(s, a ) θB n ] for all s S. {θ Rd : θ θ 2 2 ρ} with its center θ being the empirical average i=1 θ(i) n . (9) The intuition behind the proposed 2RA Q-learning (7) is to mitigate the overestimation bias of Watkins Q-learning by approximating the term maxa A EP[ϕ(s, a ) θn], where we consider θn as a Rd-valued random variable distributed according to P, via the distributionally robust model max a A min Q Bρ(b PN,n) EQ[ϕ(s, a ) θn], (10) where Bρ(b PN,n) is a set of probability measures centered around the empirical distribution b PN,n = 1 N PN i=1 δθ(i) n . When considering the ambiguity set Bρ(b PN,n) as the ball of all distributions that have a fixed diagonal covariance and a 2-Wasserstein distance to b PN,n of at most ρ, then by (Nguyen et al., 2021, Theorem 2) the distributionally robust model (10) directly corresponds to our estimator (8). Running the 2RA Q-learning (7) requires an evaluation of the estimator Eρ(Xn, Sn+1, bθN,n) given by the optimization problem (8), which admits a closed form expression. Lemma 3.1 (Estimator computation). The estimator defined in (8) is equivalently expressed as Eρ(x, s , θ) = γϕ(x) max a A ϕ(s , a ) θ ρ ϕ(s , a ) 2 . Proof. According to Bertsimas & Tsitsiklis (1997, Lemma 9.2), for any c, θ Rd and positive definite matrix H Rd d the optimization problem min θ Rd{c θ : (θ θ) H 1(θ θ) ρ} admits a closed-form solution θ = θ r ρ c Hc Hc. Therefore, by setting c = ϕ(s , a ) and H to be the identity matrix, an optimizer in (8) is θ = θ ρ ϕ(s , a ) 2 ϕ(s , a ), which completes the proof. In the tabular setting 2RA Q-learning (7) even for N = 2 is different from Double Q-learning (van Hasselt, 2010). However, in the special case where ρ = 0 and N = 1, our method collapses to Watkins Q-learning (5). In our proposed 2RA Q-learning, we ve made a modification to the term E(Xn, Sn+1, θn) from Wattkins Q-learning (5). Regularized Q-learning through Robust Averaging It is now replaced with a regularized version, denoted as Eρ based on Lemma 3.1. This adjustment is combined with the averaging property using bθN,n. The regularization term ρ ϕ(s , a ) 2 can be interpreted as negative UCB bonus term that discourages exploration, which has been considered for linear MDPs in (Jin et al., 2019), see also (Qian et al., 2019). In the remainder of this section, we theoretically investigate 2RA Q-learning (7) and, in particular, study how to choose the two regularization parameters ρ, N, and the learning rate αn. In Section 3.1, we show what properties the regularization term ρ should satisfy such that 2RA Q-learning (7) asymptotically converges to the optimal Q-function. This convergence is independent of the number N of Q-function estimates. Section 3.2 shows how the two terms ρ and N can be exploited to control the estimation bias of the presented scheme. Finally, Section 3.3 studies the convergence rate via the notion of the asymptotic mean squared error and, in particular, shows how to choose the learning rate αn as compared to Watkins Q-learning. 3.1. Asymptotic Convergence The 2RA Q-learning (7) for the tabular setting actually converges almost surely to the optimal Q-function satisfying the Bellman equation (2), provided the radius ρ is chosen appropriately. Theorem 3.1 (Asymptotic convergence). Consider the tabular setting where d = |X|, Φ is the canonical basis and let {ρn}n N be a sequence of non-negative numbers such that limn ρn = 0. Moreover, assume that (i) The learning rates satisfy αn(s, a) (0, 1], P n=0 αn(s, a) = , P n=0 α2 n(s, a) < and αn(s, a) = 0 unless (s, a) = (Sn, An). (ii) The reward r is bounded. Then, for any N N, 2RA Q-learning (7) converges to the optimal Q-function Q , i.e., limn Φ bθN,n = Q P-almost surely. Note that Φ bθN,n is our learned 2RA Q-function under the canonical basis describing the tabular setting. 3.2. Estimation Bias We now focus on the estimation bias of 2RA Q-learning induced by the term Eρ in (8). While Watkins Q-learning suffers from the mentioned overestimation bias, the proposed 2RA Q-learning (7) allows us to control the estimation bias via the parameters ρ and N. We show that for ρ > 0 with high probability, 2RA Q-learning generates an underestimation bias, somewhat similar to Double Q-learning. However, in contrast to Double Q-learning, we can control the level of underestimation via the parameter ρ. Moreover, the second parameter N, describing the number of action-value estimates, further allows us to control the estimation bias. Theorem 3.2 (Estimation bias). (i) Consider any N, n N, ρ 0 and i {1, . . . , N}. If E[θ(i) n ] Bρ(bθN,n), then the robust estimator defined in (8) provides an underestimation where the level of underestimation is controlled by ρ, i.e., for all x X, s S 0 γϕ(x) max a A E[ϕ(s , a ) θ(i) n ] E[Eρ(x, s , bθN,n)] ργϕ(x) max a A ϕ(s , a ) 2. (ii) Let θ(i) be initialized with the same value for i = 1, . . . , N. For any ρ 0 and n N, the robust estimator (8) satisfies for all x X, s S lim N E[Eρ(x, s , bθN,n)] = γϕ(x) max a A n E[ϕ(s , a ) θ(i) n ] ρ ϕ(s , a ) 2 o . Assertion (ii) directly implies that for ρ = 0, the robust estimator (8) is unbiased in the limit as N . We can alternatively show this via Theorem 3.2(i). By following the proof of Theorem 3.2, we can show that for any ρ > 0 lim N P E[θ(i) n ] Bρ(bθN,n) = 1, (11) i.e., for any given ρ if N is chosen large enough, we can expect the assumption of Assertion (i) of Theorem 3.2 to hold. To derive (11) note that in the proof of Theorem 3.2, we show (see (23) and apply Hoelder s inequality) that lim N E[ bθN,n E[θ(i) n ] ] = 0. Markov s inequality states that for any ρ > 0 P bθN,n E[θ(i) n ] > ρ 1 ρE h bθN,n E[θ(i) n ] i , which directly implies (11). Corollary 3.1 (Vanishing estimation bias). Under the assumptions of Theorem 3.1 and a regularization sequence {ρn}n N R+ such that limn ρn = 0, for any N N and i {1, . . . , N} lim n E[Eρn(x, s , bθN,n)] = lim n γϕ(x) max a A E[ϕ(s , a ) θ(i) n ] x X, s S. Theorem 3.2 allows us to interpret the choice of regularization {ρn}n N in a non-asymptotic manner. Regularized Q-learning through Robust Averaging Remark 3.1 (Selection of parameter ρn). We have shown in Theorem 3.1 that convergence of 2RA Q-learning (7) requires a sequence {ρn}n N such that limn ρn = 0. Theorem 3.2 provides insights into how the specific decay of ρn determines the resulting performance of 2RA Q-learning. More precisely, Theorem 3.2 describes the inherent tradeoff in the selection of ρn: choosing larger values of ρn increases the probability that E[θ(i) n ] Bρn(bθN,n) which guarantees an underestimation bias. On the other hand, choosing smaller values of ρn decrease the level of underestimation bias but potentially introduce an overestimation bias as the probability that E[θ(i) n ] / Bρn(bθN,n) increases. We further comment on the choice of regularization ρn in the numerical experiments, Section 4. Remark 3.2 (Selection of parameter N). The convergence of 2RA Q-learning holds for any choice of N; see Theorem 3.1. Moreover, Theorem 3.2 states that increasing N decreases the estimation bias. Choosing the parameter N too large, however, when using a learning rate that is Ntimes the learning rate of Watkins Q-learning (according to Theorem 3.3) can lead to numerical instability. Therefore, in practice, a trade-off must be made when selecting N. 3.3. Asymptotic Mean-Squared Error We have shown in Theorem 3.1 that 2RA Q-learning (7) asymptotically converges to the optimal Q-function. This section investigates the convergence rate via the so-called asymptotic mean-squared error. Throughout this section, we consider a tabular setting and assume without loss of generality that the optimal Q-function is such that θ = 0. If θ = 0, the results can hold by subtracting θ from the estimators of the Q-learning, see Devraj & Meyn (2017). Given the 2RA Q-learning and the corresponding estimator bθN,n as introduced in (9), we define its asymptotic meansquared error (AMSE) as the limit of a scaled covariance AMSE(bθN) = lim n n E[bθ N,nbθN,n] = lim n n E[ bθN,n 2 2]. Our analysis also discusses the choice of the learning rate in 2RA Q-learning compared to the learning rate of Watkins e6 2e1 + e12 e12 e2 e3 e4 e5 (a) Baird s Example 0 0.5 1 1.5 2 105 Maxmin with N=10 Averaged with K=10 RED with N=10 & M=2 Variance Reduced with D=10 2RA with N=10 & ρ0=0.5 (b) Q-learning variants 0 0.5 1 1.5 2 105 100 2RA with N=10 & ρ0=0 2RA with N=10 & ρ0=0.5 2RA with N=10 & ρ0=2 2RA with N=10 & ρ0=5 2RA with N=10 & ρ0=10 2RA with N=10 & ρ0=50 (c) Choice of ρ0 5 5.5 6 6.5 7 104 0 2RA with N=1 & ρ0=0.5 2RA with N=2 & ρ0=0.5 2RA with N=5 & ρ0=0.5 2RA with N=10 & ρ0=0.5 2RA with N=20 & ρ0=0.5 (d) Choice of N Figure 1: Baird s Example. All Methods use an initial learning rate of α0 = 0.01, wα = 105, and γ = 0.8. All 2RA agents additionally use wρ = 103. The reward function has values random-uniformly sampled from [ 0.05, 0.05]. All results are average over 100 consecutive experiments. (a) Baird s example environment with the feature vectors for each state-action pair. (b) Comparison of the AMSE of Watkins Q-learning, Double Q-learning, Maxmin Q-learning with N = 10, where the 2RA Q-learning uses initial ρ0 = 0.5 and N = 10. (c) Comparison of the AMSE of 2RA Q-learning with N = 10 but different initial values ρ0. (d) Experiment showing the MSE in terms of mean and standard deviation for different values of N with ρ0 = 0.5. Regularized Q-learning through Robust Averaging Q-learning. Theorem 3.3 (AMSE for 2RA Q-learning). Consider a setting where the regularization sequence {ρn}n N R+ is such that limn ρn = 0 and N N. Let αQL n = g/n be the learning rate of Watkins Q-learning (5) and consider the 2RA Q-learning (7) with learning rate αn = N g/n, where g is a positive constant2. Then, there exists some g0 > 0 such that for any g > g0 AMSE(bθN) = AMSE(θQL), where {θQL n }n N is a sequence generated by Watkins Qlearning algorithm. Remark 3.3 (Assumption on learning rate). As pointed out by Weng et al. (2020) the condition g > g0 in the tabular setting reduces to g > 1 µmin(1 γ), where µmin denotes the minimum entry of the stationary distribution of the state. 4. Numerical Results We numerically3 compare our presented 2RA Q-learning (7) with Watkins Q-learning (3), Hasselt s Double Q-learning (van Hasselt, 2010), and with the Maxmin Q-learning (Lan et al., 2020). First, we look at Baird s Example (Baird, 1995), then we consider arbitrary, randomly generated MDP environments with fixed rewards, and last, the Cart Pole example (Barto et al., 1983; Brockman et al., 2016). In all experiments4, we choose a step size αn = Nα0wα n+wα , where N is the number of state-action estimates used in the respective learning method, wα > 0 is a weight parameter, and α0 is the initial step size. The decay rate for the regularization parameter ρn, as required for convergence (see Theorem 3.1), is chosen to be either ρn= ρ0wρ n+wρ or ρn= ρ0wρ n2+wρ with exact parameters given for each experiment and a more detailed evaluation at the end of this section. Baird s Example. We consider the setting described in Weng et al. (2020). The environment (Figure 1a) has six states and two actions. Under action one, the transition probability to any of the six states is 1/6, and action two results in a deterministic transition to state six. These transition dynamics are independent of the state at which an action is chosen. Therefore the trajectories on which the methods are updated can be obtained from a random uniform behavioral policy that allows every state to be visited. The features vectors ϕ(s, a) are constructed as shown in Figure 1a where each ei R12 is the ith unit vector. In this setting, it is known that the optimal policy is unique (Weng et al., 2020), and our theoretical results apply. All θ(i) are 2We assume that our starting index for n is large enough such that Ng/n < 1. 3Here: github.com/2RAQ/code 4Except for REDQ, since the learning rate would be too high for the multiple updates per step. initialized, as in Weng et al. (2020), uniformly at random with values in [0, 2]. Figures 1b and 1c have a log-scaled y-axis to emphasize the smaller differences between models as they converge. The first important observation is that all learning methods converge to the same AMSE, which is in line with Theorem 3.3. An exception is Maxmin Q-learning, for which, however, no theoretical statement regarding the expected behavior of its AMSE is made. Higher values for ρ increase the convergence speed in the early learning phase, as shown in Figure 1c. However, if ρ is too large or its relative decay too slow, learning eventually slows down (or even temporarily worsens) as large values of ρ make the update steps too big. For the proposed choice of ρ, our 2RA-method outperforms the other learning methods in the first part of the learning process without getting significantly slower in the long run. Only the AMSE of Variance Reduced Q-Learning outperforms all other methods which appears to be caused by the specific instance of Baird s experiment (compare results of the Random Environment). Our next experiment shows that 2RA and Maxmin Q-Learning are sensitive towards different environments which prefer over-, under-, or no estimation bias at all. Figure 1d uses a nonlog-scaled y-axis to ensure the size of the standard errors is comparable. It can be observed how increasing the parameter N reduces the standard error of the learning across multiple experiments. However, it is also apparent that the marginal utility of each additional theta decreases fast. Random Environment. This experiment visualizes how different learning methods, with a fixed set of hyperparameters, behave under changes in the environment s transition dynamics. For |A| = 3 and |S| = 10, we consider a random environment that is described by a transition probability matrix, which, for each pair s S and a A, is drawn from a Dirichlet distribution with uniform parameter 0.1. Analogously, we draw a distribution of the initial states s0. Similar to Baird s example, these MDPs are ergodic and random uniform behavioral policies can be used to generate trajectories based on which updates are performed. We further consider a quadratic reward function r(s, a) = qs2 pa2 for all possible environments, where p, q R+ are such that p < q. Therefore, different environments have the same reward function but different transition dynamics. For our experiments, we chose q = 0.1 and p = 0.01. Each environment of Figure 2 is randomly drawn, in sequence, from the same random seed. The resulting dynamics vary significantly between different environments as only one constant selection of hyperparameters is used, but 2RA Q-learning consistently outperforms the other methods in the early stages of learning. With the exception of Maxmin, the other benchmarks perfom similarly to Watkins Q-Learning. A further obser- Regularized Q-learning through Robust Averaging 0 1 2 3 4 5 6 105 0 1 2 3 4 5 6 105 0 1 2 3 4 5 6 105 0 1 2 3 4 5 6 105 105 0 1 2 3 4 5 6 105 0 1 2 3 4 5 6 105 104 Averaged with K=10 RED with N=10 & M=2 Variance Reduced with D=10 2RA with N=10 & ρ0=50 Figure 2: Random Environment. All methods use an initial learning rate of α0 = 0.01, wα = 105, γ = 0.9, and all θ(i) initialized as zero. Maxmin as well as 2RA Q-learning have N = 10 and 2RA agents additionally use ρ0 = 50 and wρ = 104. The plots show the first six randomly drawn environments and all results are average over 100 consecutive experiments. A broader plot of the first 20 random environments is provided in Figure 4 in Appendix C. vation is that the better 2RA Q-learning performs in an environment, the better Double Q-learning performs. This indicates that these are environments where an underestimation bias is beneficial (Lan et al., 2020), with the strength of the effect varying with the drawn transition dynamics. Cart Pole. The well-known Cart Pole environment (Brockman et al., 2016) serves as a more practical application while still using the same linear function approximation model from the previous experiments in combination with a discretized Cart Pole state space. For each timestep, in which the agent can keep the pole within an allowed deviation from an upright angle and the cart s starting position along the horizontal axis, it receives a reward of +1. An episode ends if either one of the thresholds for allowed deviation is broken. Since a random uniform behavioral policy would not enable visits to all regions of the state space, the latest updated policy combined with ϵ-greedy exploration is used to generate the next timestep which is then used to update the model; Updates are applied after each timestep. We compare the different learning algorithms based on how many training episodes are required to solve the Cart Pole task. The task is considered to be solved if, during the evaluation, the average reward over 100 episodes with a maximum allowed stepcount of 210 reaches or exceeds 195. Across 1000 experiments, the number of episodes until the task is solved (hit times) is collected for each learning method. Methods that, on average, solve the environment with fewer training episodes are ranked higher in the performance comparison. As Cart Pole benefits from a high learning rate, an initial α0 = 0.4 is chosen and decayed per episode e, as compared to the decay per timestep of the previous experiments, such that αe = α0 wα e+wα . Comparing the hit time distributions of the different algorithms shows that the 2RA mean performance is better than Double Q-learning, which outperforms both Watkins and Maxmin Q-learning by a significant margin. Cart Pole appears to benefit from the underestimation bias introduced by Double and 2RA Q-learning. This is consistent with the previous experiments where the good performance of 2RA Q-learning correlates with good performance of Double Q-learning. Since this experiment has deterministically initialized θ0 as well as deterministic state transitions, REDQ and Variance Reduced Q-Learning are not comparable in this setting. In Appendix C.2, we provide an additional example, where we test 2RA Q-learning when used with neural network Q-function approximation, applied to the Lunar Lander environment (Brockman et al., 2016). Also there, 2RA Qlearning shows good performance, despite the fact, that our theoretical results do not apply. 5. Discussion and Conclusion In this work, we proposed 2RA Q-learning and showed that it enables control of the estimation bias via the parameters N and ρ while maintaining the same asymptotic convergence guarantees as Double and Watkins Q-learning. In practice, the control of the estimation bias enables faster convergence to a good-performing policy in finitely many steps which is caused by the intrinsic property of environments to favor an over-, an under-, or no estimation bias at all. Therefore, determining the optimal bias adjustment is highly dependent Regularized Q-learning through Robust Averaging 0 50 100 150 200 250 300 350 400 450 500 550 600 650 700 750 800 850 900 950 1,000 0 Watkins Double Maxmin Averaged 2RA (a) Distribution of hit times Algorithm Mean hit time Watkins Q-Learning 457.35 128.18 Double Q-Learning 401.89 144.43 Maxmin Q-Learning 645.02 270.01 Averaged Q-Learning 404.09 124.19 2RA Q-Learning 386.19 133.47 (b) Mean and std of hit times Figure 3: Cartpole, 1000 experiments. All methods use an initial learning rate of α0 = 0.4, wα = 100, γ = 0.999 and all θ(i) initialized as zero. Maxmin, as well as 2RA Q-learning, have N = 8. 2RA further uses ρ0 = 150 and wρ = 104. All algorithms are evaluated after every 50 episodes and recorded if the average evaluation reward reaches or exceeds 195. (a) Shows the distributions of each algorithm s hit times and (b) lists the respective mean hit times and corresponding standard deviations. on the specific environment and rigorous analysis of environments bias preferences is not yet available. To account for this, 2RA Q-learning provides two additional tuning parameters that can be used to fine-tune learning for these environment preferences. This level of control, combined with computational costs comparable to existing methods, makes 2RA Q-learning a valuable addition to the RL tool belt. The conducted numerical experiments for various settings corroborate our theoretical findings and highlight that 2RA Q-learning generally performs well and mostly outperforms other Q-learning variants. Acknowledgements This work was supported by the DFG in the Cluster of Excellence EXC 2117 Centre for the Advanced Study of Collective Behaviour (Project-ID 390829875). Impact Statement This paper presents work whose goal is to advance the field of Machine Learning. There are many potential societal consequences of our work, none which we feel must be specifically highlighted here. More specifically, this work is theoretical and not intended to be directly applicable to a real-world application. Yet, there is a clear real-world motivation. It is widely believed that to move any learning algorithm (including Q-learning) out of the labs into our lives, we need better and more statistical guarantees and clear theoretical understanding. At the moment, great progress is made into this area, but the vast majority of tools is tailor-made to the application, which means that practitioners need to understand all of these different tools to safely apply them. Our work aims to assist practitioners by presenting a general framework for controlling the estimation bias of Q-learning. Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado, G. S., Davis, A., Dean, J., Devin, M., Ghemawat, S., Goodfellow, I., Harp, A., Irving, G., Isard, M., Jia, Y., Jozefowicz, R., Kaiser, L., Kudlur, M., Levenberg, J., Man e, D., Monga, R., Moore, S., Murray, D., Olah, C., Schuster, M., Shlens, J., Steiner, B., Sutskever, I., Talwar, K., Tucker, P., Vanhoucke, V., Vasudevan, V., Vi egas, F., Vinyals, O., Warden, P., Wattenberg, M., Wicke, M., Yu, Y., and Zheng, X. Tensor Flow: Largescale machine learning on heterogeneous systems, 2015. Anschel, O., Baram, N., and Shimkin, N. Averaged-DQN: Variance reduction and stabilization for deep reinforcement learning. In International Conference on Machine Learning, 2017. Baird, L. Residual algorithms: Reinforcement learning with function approximation. In International Conference on Machine Learning, 1995. Barto, A. G., Sutton, R. S., and Anderson, C. W. Neuronlike adaptive elements that can solve difficult learning control problems. Transactions on Systems, Man, and Cybernetics, 1983. Beck, C. and Srikant, R. Error bounds for constant step-size Q-learning. Systems & Control Letters, 61(12), 2012. Bellman, R. Dynamic Programming. Princeton University Press, 1957. Regularized Q-learning through Robust Averaging Bertsekas, D. P. and Tsitsiklis, J. Neuro-Dynamic Programming. Athena Scientific, 1996. Bertsimas, D. and Tsitsiklis, J. N. Introduction to Linear Optimization. Athena Scientific, 1997. Billingsley, P. Convergence of probability measures. John Wiley & Sons Inc., 1999. Bohrnstedt, G. W. and Goldberger, A. S. On the exact covariance of products of random variables. Journal of the American Statistical Association, 64(328), 1969. Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., and Zaremba, W. Openai gym, 2016. Brockwell, P. J. Time series: Theory and methods. Springer Verlag, 1991. Chen, S., Devraj, A. M., Busic, A., and Meyn, S. P. Explicit Mean-Square Error Bounds for Monte-Carlo and Linear Stochastic Approximation. In International Conference on Artificial Intelligence and Statistics, 2020. Chen, X., Wang, C., Zhou, Z., and Ross, K. W. Randomized ensembled Double Q-Learning: Learning fast without a model. In International Conference on Learning Representations, 2021. Devraj, A. M. and Meyn, S. P. Fastest convergence for Q-learning. Ar Xiv preprint, abs/1707.03770, 2017. Durrett, R. Probability: Theory and Examples. Cambridge University Press, 2010. Even-Dar, E. and Mansour, Y. Learning Rates for QLearning. In Annual Conference on Computational Learning Theory, 2001. Gosavi, A. Boundedness of iterates in q-learning. Systems & Control Letters, 55(4):347 349, 2006. ISSN 0167-6911. He, K., Zhang, X., Ren, S., and Sun, J. Delving Deep into Rectifiers: Surpassing Human-Level Performance on Image Net Classification. In International Conference on Computer Vision, 2015. Hern andez-Lerma, O. and Lasserre, J. Further topics on discrete-time Markov control processes. Springer, 1999. Hern andez-Lerma, O. and Lasserre, J. B. Discrete-time Markov control processes: Basic optimality criteria. Springer, 1996. Huber, P. J. Robust Estimation of a Location Parameter. The Annals of Mathematical Statistics, 35(1), 1964. Jaakkola, T., Jordan, M. I., and Singh, S. P. On the convergence of stochastic iterative dynamic programming algorithms. Neural Computation, 6(6), 1994. Jin, C., Yang, Z., Wang, Z., and Jordan, M. I. Provably efficient reinforcement learning with linear function approximation. ar Xiv preprint, ar Xiv:1907.05388, 2019. Kingma, D. P. and Ba, J. Adam: A Method for Stochastic Optimization. In International Conference on Learning Representations, 2015. Lan, Q., Pan, Y., Fyshe, A., and White, M. Maxmin Qlearning: Controlling the Estimation Bias of Q-learning. In International Conference on Learning Representations, 2020. Lee, D. and He, N. A Unified Switching System Perspective and Convergence Analysis of Q-Learning Algorithms. In Annual Conference on Neural Information Processing, 2020. Lim, H.-D., Kim, D. W., and Lee, D. Regularized Qlearning. ar Xiv preprint, 2202.05404, 2022. Liu, Z., Bai, Q., Blanchet, J., Dong, P., Xu, W., Zhou, Z., and Zhou, Z. Distributionally robust Q-learning. In International Conference on Machine Learning, 2022. Lu, F. and Meyn, S. Convex q learning in a stochastic environment: Extended version. Ar Xiv preprint, abs/2309.05105, 2023. Lu, F., Mehta, P., Meyn, S., and Neu, G. Sufficient exploration for convex Q-learning. ar Xiv preprint, 2210.09409, 2022. Mehta, P. G. and Meyn, S. P. Convex Q-learning, part 1: Deterministic optimal control. ar Xiv preprint, 2008.03559, 2020. Melo, F. S., Meyn, S. P., and Ribeiro, M. I. An analysis of reinforcement learning with function approximation. In International Conference on Machine Learning, 2008. Meyn, S. Control Systems and Reinforcement Learning. Cambridge University Press, 2022. Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidjeland, A. K., Ostrovski, G., Petersen, S., Beattie, C., Sadik, A., Antonoglou, I., King, H., Kumaran, D., Wierstra, D., Legg, S., and Hassabis, D. Human-level control through deep reinforcement learning. Nature, 518(7540), 2015. Neufeld, A. and Sester, J. Robust Q-learning algorithm for markov decision processes under wasserstein uncertainty. ar Xiv preprint, 2210.00898, 2022. Regularized Q-learning through Robust Averaging Nguyen, V. A., Abadeh, S. S., Filipovi c, D., and Kuhn, D. Mean-Covariance Robust Risk Measurement. Ar Xiv preprint, abs/2112.09959, 2021. Qian, J., Fruit, R., Pirotta, M., and Lazaric, A. Exploration bonus for regret minimization in discrete and continuous average reward mdps. In Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. Qu, G. and Wierman, A. Finite-time analysis of asynchronous stochastic approximation and Q-learning. In Conference on Learning Theory, 2020. Singh, S. P., Jaakkola, T. S., Littman, M. L., and Szepesv ari, C. Convergence results for single-step on-policy reinforcement-learning algorithms. Machine Learning, 38(3), 2000. Strehl, A. L., Li, L., and Littman, M. L. Reinforcement learning in finite MDPs: PAC analysis. Journal of Machine Learning Research, 10(84), 2009. Sutton, R. S. and Barto, A. G. Reinforcement Learning: An Introduction. MIT Press, 2018. Szepesv ari, C. The asymptotic convergence-rate of Qlearning. In Advances in Neural Information Processing Systems, 1997. Szita, I. and L orincz, A. The many faces of optimism: a unifying approach. In International Conference on Machine Learning, 2008. Thrun, S. and Schwartz, A. Issues in using function approximation for reinforcement learning. In Connectionist Models Summer School, 1993. Tsitsiklis, J. N. Asynchronous stochastic approximation and Q-learning. Machine Learning, 16, 1994. van Hasselt, H. Double Q-learning. In Annual Conference on Neural Information Processing Systems, 2010. Vieillard, N., Pietquin, O., and Geist, M. Munchausen Reinforcement Learning. In nnual Conference on Neural Information Processing Systems, 2020. Wainwright, M. J. Variance-reduced Q-learning is minimax optimal. ar Xiv preprint, 1906.04697, 2019. Watkins, C. Learning from delayed rewards. Ph D thesis, King s College, Cambridge, 1989. Watkins, C. J. C. H. and Dayan, P. Q-learning. Machine Learning, 8(3), 1992. Weng, W., Gupta, H., He, N., Ying, L., and Srikant, R. The Mean-Squared Error of Double Q-Learning. In Annual Conference on Neural Information Processing Systems, 2020. Regularized Q-learning through Robust Averaging A.1. Asymptotic Convergence The proof of Theorem 3.1 is based on the following technical stochastic approximation result. We denote by w a weighted maximum norm with weight w = (w1, . . . , wd), wi > 0. If x Rd, then x w = maxi |xi| Lemma A.1. (Singh et al., 2000, Lemma 1) Consider a stochastic process {(αn, n, Fn)}n 0, where αn, n, Fn : X R satisfy the equations n+1(x) = (1 αn(x)) n(x) + αn(x)Fn(x), x X, n = 0, 1, . . . (12) Let Pn be a sequence of increasing σ-fields such that α0 and 0 are P0-measurable and αn, n and Fn 1 are Pnmeasurable for n = 1, 2, . . . . Assume the following hold (i) the set X is finite; (ii) αn(x) (0, 1], P n=1 αn(x) = , P n=1 α2 n(x) < almost surely; (iii) E[Fn( )|Pn] w κ n w + cn, where κ [0, 1) and cn converges to zero almost surely; (iv) Var(Fn(x)|Pn) K(1 + n w)2, where K is some constant. Then, n converges to zero almost surely as n . Proof of Theorem 3.1. The proof builds up on the convergence results of SARSA (Singh et al., 2000), Double Q-learning (van Hasselt, 2010) and uses Lemma A.1 as a key ingredient. For the convenience of notation, we carry out the proof in the Q-function notation. That is, in the tabular setting, where no function approximation is applied, by invoking Lemma 3.1, the proposed 2RA Q-learning (7) is expressed as Q(i) n+1(Sn, An) = Q(i) n (Sn, An) + αnβ(i) n r(Sn, An) + γ(max a A b QN,n(Sn+1, a ) ρn) Q(i) n (Sn, An) , (13) where b QN,n(s, a) = 1 N PN i=1 Q(i) n (s, a) for s S, a A. In the following, we fix an arbitrary index i {1, . . . , N} and with regard to Lemma A.1, we define Pn as the σ-field generated by {Sn, An, αn, . . . , S0, A0, α0, Q(1) 0 , . . . , Q(N) 0 }, X = S A, n = Q(i) n Q and Fn(Sn, An) = r(Sn, An) + γ(maxa A b QN,n(Sn+1, a ) ρn) Q (Sn, An). Then, 2RA Q-learning (7) can be expressed as an instance of (12). To apply Lemma A.1, we need to ensure its assumptions are satisfied. Assumption (i) and (ii) clearly hold. To show Assumption (iii), we note that the term Fn can be alternatively expressed as Fn(Sn, An) = F Q(i) n (Sn, An) + γ max a A b QN,n(Sn+1, a ) max a A Q(i) n (Sn+1, a ) ρn with F Q(i) n (Sn, An) = r(Sn, An) + γ max a A Q(i) n (Sn+1, a ) Q (Sn, An), (15) where the term F Q(i) n (Sn, An) corresponds to Watkins Q-learning for Q(i) n . Therefore, it is well-known (Jaakkola et al., 1994, Theorem 2) that E[F Q(i) n ( )|Pn] w γ n , which via (14) implies that E[Fn( )|Pn] w γ n + cn, where cn = γE max a A b QN,n(Sn+1, a ) max a A Q(i) n (Sn+1, a ) ρn | Pn It remains to show that cn converges to zero almost surely. Recall that by assumption limn ρn = 0. We define δ(i) n (s, a) = Q(i) n (s, a) b QN,n(s, a) Regularized Q-learning through Robust Averaging and will show that limn δ(i) n ( , ) = 0 almost surely. The reverse triangle inequality applied to the -norm implies that P-almost surely max a A Q(i) n (Sn+1, a ) max a A b QN,n(Sn+1, a ) = 0 and hence that cn converges to zero almost surely. We distinguish two cases. First, we consider an update on component i, then by (13) δ(i) n+1(s, a) = Q(i) n+1(s, a) b QN,n+1(s, a) = Q(i) n (s, a) + αn r(s, a) + γ max a A b QN,n(s , a ) γ ρn Q(i) n (s, a) b QN,n(s, a) Q(i) n+1(s, a) Q(i) n (s, a) = δ(i) n (s, a) + αn αn r(s, a) + γ max a A b QN,n(s , a ) γ ρn Q(i) n (s, a) , where the second equality follows from the decomposition 1 N PN j=1 Q(j) n+1(s, a) = 1 N (P j =i Q(j) n + Q(i) n+1). The third equality then uses our proposed Q-learning update formula (13) given as Q(i) n+1(s, a) = Q(i) n (s, a) + αn(r(s, a) + γ maxa A b QN,n(s , a ) γ ρn Q(i) n (s, a)). On the other hand, if the update is performed on component j = i, then δ(i) n+1(s, a) = Q(i) n+1(s, a) b QN,n+1(s, a) = Q(i) n (s, a) b QN,n(s, a) 1 N (Q(j) n+1(s, a) Q(j) n (s, a)) = δ(i) n (s, a) αn N (r(s, a) + γ max a A b QN,n(s , a ) γ ρn Q(j) n (s, a)). Hence, in total, we get E[δ(i) n+1(s, a)|Pn] N E δ(i) n (s, a) + N 1 N αn(r(s, a) + γ max a A b QN,n(s , a ) γ ρn Q(i) n (s, a))|Pn j =i E δ(i) n (s, a) αn N (r(s, a) + γ max a A b QN,n(s , a ) γ ρn Q(j) n (s, a))|Pn = E h (1 αn N )δ(i) n (s, a)|Pn i N )E h δ(i) n (s, a)|Pn 1 i , where the second equality follows from the observation that P j =i Q(j) n = N b QN,n Q(i) n . Recall that δ(i) n (s, a) = E[δ(i) n (s, a)|Pn 1] for any n. Hence, we have derived the update rule δ(i) n+1(s, a) = (1 αn N )δ(i) n (s, a), (17) which directly implies that limn δ(i) n (s, a) = 0 almost surely for all (s, a) S A. Since S and A are finite sets this implies that limn δ(i) n = 0 almost surely as desired. Hence, limn cn = 0 almost surely, which ensures Assumption (iii). We finally show that Assumption (iv) holds. Again we use the decomposition (14). Since the reward r is assumed to be bounded, it is known (Singh et al., 2000) that Var(F Q(i) n (x)|Pn) K(1 + n w)2, (18) Regularized Q-learning through Robust Averaging where again n = Q(i) n Q . We next show that there exists a constant K2 such that Var max a A b QN,n(Sn+1, a ) max a A Q(i) n (Sn+1, a )|Pn K2(1 + n w)2. (19) By the Cauchy-Schwarz inequality (18) and (19) imply Assumption (iv). To establish (19), we show that the Q-functions Q(i) n are bounded for any n and any i. Such results are well-known for classical Q-learning, see (Gosavi, 2006). For our modified Q-learning, we can show it via a simple contradiction argument. Suppose for the sake of contradiction there is an index i such that limn Q(i) n = . We first consider the setting, where there is an s S and a A such that limn Q(i) n (s , a ) = . So there exists n N such that for all n n we have Q(j) n (s , a) Q(i) n (s , a ) for all a A and for all j = 1, . . . , N. Therefore, max a A b QN,n(s , a) Q(i) n (s , a ) max a A Q(i) n (s , a), which implies Q(i) n+1(s, a) = Q(i) n (s, a) + αn(r(s, a) + γ max a A b QN,n(s , a ) Q(i) n (s, a)) (20a) Q(i) n (s, a) + αn(r(s, a) + γ max a A Q(i) n (s , a ) Q(i) n (s, a)), (20b) for any s S and a A. When considering Sn = s, An = a and Sn+1 = s , we see that the upper bound (20b) is Watkins Q-learning which leads to a bounded Q-function, so Q(i) n+1(s, a) needs to be bounded for all s, a which is a contradiction. The case where there is an s S and a A such that limn Q(i) n (s , a ) = follows analogously. Consequently, the Q-functions Q(i) n for any n and any i are bounded and (19) indeed holds, which implies Assumption (iv). A.2. Estimation Bias Proof of Theorem 3.2 (Estimation Bias). To prove Assertion (i), note that according to the definition of the robust estimator (8) for any E[θ(i) n ] Bρ(bθN,n) it must hold that Eρ(x, s , bθN,n) γϕ(x) max a A ϕ(s , a ) E[θ(i) n ] = γϕ(x) max a A E[ϕ(s , a ) θ(i) n ] x X, s S, which implies E[Eρ(x, s , bθN,n)] γϕ(x) max a A E[ϕ(s , a ) θ(i) n ] x X, s S. (21) A lower bound can be derived as for all x X, s S E[Eρ(x, s , bθN,n)] = γϕ(x)E max a A n ϕ(s , a ) bθN,n ρ ϕ(s , a ) 2 o (22a) γϕ(x)E max a A ϕ(s , a ) bθN,n ργϕ(x) max a A ϕ(s , a ) 2 (22b) γϕ(x) max a A E h ϕ(s , a ) bθN,n i ργϕ(x) max a A ϕ(s , a ) 2 (22c) = γϕ(x) max a A E h ϕ(s , a ) θ(i) n i ργϕ(x) max a A ϕ(s , a ) 2, (22d) where the first equality is due to Lemma 3.1. The first inequality follows from splitting the maximization or reverse triangle inequality. The second inequality follows from a Jensen step as explained in (6). The second equality uses the fact that all θ(i) n follow the same distribution. Combining (21) and (22) implies Assertion (i). To prove Assertion (ii), we first claim that for any n N lim N bθN,n = E[θ(i) n ], (23) where the convergence is in mean square, i.e., lim N E[ bθN,n E[θ(i) n ] 2] = 0. This in particular implies that bθN,n converges to E[θ(i)] in distribution. Our second claim states that for any x X and s S, the function Eρ(x, s , θ ) is Regularized Q-learning through Robust Averaging uniformly continuous in θ . Equipped with these two claims, lim N E[Eρ(x, s , bθN,n)] = E[Eρ(x, s , E[θ(i) n ])] (24a) = Eρ(x, s , E[θ(i) n ]) (24b) = γϕ(x) max a A n E[ϕ(s , a ) θ(i) n ] ρ ϕ(s , a ) 2 o , (24c) where the first equality holds due to the Portmanteau Theorem (Billingsley, 1999, Theorem 2.1), since bθN,n converges to E[θ(i) n ] in distribution and the function Eρ(x, s , θ ) is uniformly continuous in θ . The second equality is true as the function Eρ is deterministic, and the last equality is implied by Lemma 3.1. It, therefore, remains to show the two claims. To show (23), we recall that by definition i=1 θ(i) n . By symmetry of the 2RA Q-learning (7) the variables θ(i) n for all i {1, . . . , N} have the same distribution, but are correlated. We can exploit the weak correlations via the following result. Lemma A.2. (Brockwell, 1991, Theorem 7.1.1) Let {Y (i)} be a stationary process with mean µ and autocovariance function γ( ) defined as γ(N) = cov(Y (i+N), Y (i)) for any i N. Then, i=0 Y (i) µ = 0 if lim N γ(N) = 0. The 2RA Q-learning (7) is given as θ(i) n+1 =(1 αnβ(i) n A1(Xn))θ(i) n +γαnβ(i) n ϕ(Xn) max a A n ϕ(Sn+1, a ) bθN,n ρ ϕ(Sn+1, a ) 2 o + αnβ(i) n b(Xn), i = 1, . . . , N. (25) We claim that for any s, t {1, 2, . . . , N} and for any n N0 cov(θ(s) n , θ(t) n ) C O(1/N), (26) where C Rd d is some constant matrix. Then, according to Lemma A.2, we obtain (23). It, therefore, remains to show (26), which we do by induction over n. The initial condition θ(i) 0 = θ0 is assumed to be some deterministic value for all i, then by using the update rule (25) and recalling that bθN,0 = θ0 cov(θ(s) 1 , θ(t) 1 ) = cov(v + β(s) 1 w, v + β(t) 1 w) = cov(β(s) 1 w, β(t) 1 w) = E[β(s) 1 β(t) 1 ww ] E[β(s) 1 w]E[β(t) 1 w] = E[β(s) 1 β(t) 1 ]E[ww ] E[β(s) 1 ]E[w]E[β(s) 1 ]E[w] N 2 E[w]E[w] where v = θ0, w = αn A1(X0)θ0+γϕ(X0) maxa A{ϕ(S1, a ) θ0 ρ ϕ(S1, a ) 2}+b(X0) and we use the fact that E[β(s) 1 β(t) 1 ] = 0 and E[β(s) 1 ] = 1 N . Moreover, we use C = E[w]E[w] . To proceed with the induction step, assume that for any k, ℓ {1, 2, . . . , N} we have cov(θ(k) n , θ(ℓ) n ) = C O(1/N) and show that for any k, ℓ {1, 2, . . . , N} cov(θ(k) n+1, θ(ℓ) n+1) C O(1/N). (27) Regularized Q-learning through Robust Averaging Applying the update rule (25) leads to cov(θ(k) n+1, θ(ℓ) n+1) = cov (1 αnβ(k) n A1(Xn))θ(k) n + β(k) n αn γϕ(Xn)Mn + b(Xn) , (1 αnβ(ℓ) n A1(Xn))θ(ℓ) n + β(ℓ) n αn γϕ(Xn)Mn + b(Xn) = cov (1 αnβ(k) n A1(Xn))θ(k) n , (1 αnβ(ℓ) n A1(Xn))θ(ℓ) n + cov (1 αnβ(k) n A1(Xn))θ(k) n , β(ℓ) n αn (γϕ(Xn)Mn + b(Xn)) + cov (1 αnβ(ℓ) n A1(Xn))θ(ℓ) n , β(k) n αn (γϕ(Xn)Mn + b(Xn)) + cov β(k) n αn (γϕ(Xn)Mn + b(Xn)) , β(ℓ) n αn (γϕ(Xn)Mn + b(Xn)) , where Mn = maxa A{ϕ(Sn+1, a ) bθN,n ρ ϕ(Sn+1, a ) 2}. We can show that each covariance term from above is of the order O(1/N). For this, we recall that the properties of βn and its independence with respect to Xn ensure that cov(β(ℓ) n , β(k) n ) = O(1/N) and cov(β(ℓ) n , f(Xn)) = 0 (29a) for any bounded function f : X R. Moreover, for any k = ℓ cov(β(ℓ) n f(Xn), β(k) n f(Xn)) = E[β(ℓ) n β(k) n f(Xn)f(Xn) ] E[β(k) n f(Xn)]E[β(ℓ) n f(Xn) ] N 2 E[f(Xn)] 2 2 O(1/N). (29b) For any other bounded function g : X R, we get cov(β(ℓ) n f(Xn), g(Xn)) = E[β(ℓ) n f(Xn)g(Xn)] E[β(ℓ) n f(Xn)]E[g(Xn)] N E[f(Xn)g(Xn)] 1 N E[f(Xn)]E[g(Xn)] = O(1/N). (29c) Similarly, we obtain for any s, t {1, 2, . . . , N} cov(θ(k) n , β(ℓ) n f(Xn)) = C O(1/N), (29d) cov(θ(k) n , g(Xn)bθN,n) = 1 ℓ=1 cov(θ(k) n , g(Xn)θ(ℓ) n ), whereby by using the results of Bohrnstedt & Goldberger (1969), we can show that cov(θ(k) n , g(Xn)θ(ℓ) n ) = C O(1/N). Therefore, cov(θ(k) n , g(Xn)bθN,n) = C O(1/N). (29e) Finally, equipped with (27) and (29) by exploiting the results of Bohrnstedt & Goldberger (1969) and continuing with (28), we show cov(θ(k) n+1, θ(ℓ) n+1) = C O(1/N), which completes the induction step. We, therefore, have shown (26) and hence (23). Regarding our second claim, we show that for any x X and s S, the function Eρ(x, s , θ) is uniformly continuous in θ. For any fixed x X and s S the the function Eρ(x, s , θ) can be expressed as Eρ(x, s , θ) = γϕ(x) max a A ϕ(s , a ) θ ρ ϕ(s , a ) 2 = γϕ(x)φ(θ, s ), Regularized Q-learning through Robust Averaging where φ(θ, s ) = maxa A ϕ(s , a ) θ ρ ϕ(s , a ) 2 . To show that Eρ(x, s , θ) is uniformly continuous in θ, it remains to show that φ is uniformly continuous in θ. Clearly, φ is Lipschtiz continuous in θ, as max{ } max{ } max{ }, which is the reversed triangle inequality for the -norm. This implies the desired uniform continuity and hence hence Eρ(x, s , θ) is uniformly continuous in θ. Having shown both claims completes the proof. Proof of Corollay 3.1. Recall that according to Theorem 3.1 for any i {1, . . . , N} we have limn θ(i) n = θ almost surely and accordingly limn bθN,n = θ almost surely. By the definition of Eρn in (8) γϕ(x) max a A E[ϕ(s , a ) θ ] = E[E0(x, s , θ )] x X, s S. (30) Therefore, for all x X, s S lim n E h Eρn(x, s , bθN,n) i = E h lim n Eρn(x, s , bθN,n) i (31a) = E lim n γϕ(x) max a A n ϕ(s , a ) bθN,n ρn ϕ(s , a ) 2 o (31b) = E γϕ(x) max a A ϕ(s , a ) θ (31c) = E [E0(x, s , θ )] (31d) = γϕ(x) max a A E[ϕ(s , a ) θ ] (31e) = γϕ(x) max a A lim n E h ϕ(s , a ) θ(i) n i (31f) = lim n γϕ(x) max a A E h ϕ(s , a ) θ(i) n i , (31g) where the equality (31a) follows from the bounded convergence theorem. The equality (31b) is due to Lemma 3.1. In (31c), we use the fact that the limit and maximum can be interchanged as the maximum is over a finite set and that bθN,n converges to θ due to Theorem 3.1. The step (31d) uses the definition of E0. The equality (31e) is due to (30) and (31f) uses that θ(i) n converges to θ due to Theorem 3.1 together with the bounded convergence theorem. Finally, (31g) uses again the fact that the limit and maximum can be interchanged as the maximum is over a finite set. A.3. Asymptotic Mean-Squared Error Proof of Theorem 3.3 (AMSE for 2RA Q-learning). Our proof is inspired by the recent treatment of Double Q-learning (Weng et al., 2020) and by Devraj & Meyn (2017) analyzing the asymptotic properties of Q-learning. The key idea is to recall that from the proof of Theorem 3.1, we know that θ(i) n θ = 0 as n for any i {1, 2, . . . , N}. Hence, we can express the AMSE of θ(i) alternatively as AMSE(θ(i)) = lim n n E[θ(i) n θ(i) n ] = tr lim n n E[θ(i) n θ(i) n ] = tr (V ) , where the matrix V = limn n E[θ(i) n θ(i) n ] is called the asymptotic covariance. It has been shown in Devraj & Meyn (2017) that the asymptotic covariance of Watkins Q-learning (5) can be studied via the linearized counterpart, given as θQL n+1 = θQL n + αQL n ϕ(Xn)(r(Xn) + γϕ(Sn+1, π (Sn+1)) θQL n ϕ(Xn) θQL n ), where π is the optimal policy based on θ . Using similar arguments from Devraj & Meyn (2017) and Weng et al. (2020), we can show that the asymptotic variance of 2RA Q-learning (7), which is defined as limn n E[bθN,nbθ N,n], can be studied by considering the linearized recursion with ρn = 0, given as θ(i) n+1 = θ(i) n + αnβ(i) n (b(Xn) A1(Xn)θ(i) n + γϕ(Xn)ϕ(Sn+1, π (Sn+1)) | {z } =A2(Zn) bθN,n), i = 1, . . . , N, (32) Regularized Q-learning through Robust Averaging where Zn = (Xn, Sn+1), b(Xn) = ϕ(Xn)r(Xn) and A1(Xn) = ϕ(Xn)ϕ(Xn) . We formally justify this linearization argument in Lemma B.1. Using a more compact notation where θn = (θ(1) n , . . . , θ(N) n ) and βn = (β(1) n , . . . , β(N) n ) and choosing the learning rate αn = αQL n N, the update equation (32) can be expressed in a standard form as θn+1 = θn + αQL n (b(Xn) + A2(Zn)θn A1(Xn)θn), (33) where A1(Xn) = N diag(β(1) n A1(Xn), . . . , β(N) n A1(Xn)), β(1) n A2(Zn) . . . β(1) n A2(Zn) β(2) n A2(Zn) . . . β(2) n A2(Zn) ... ... ... β(N) n A2(Zn) . . . β(N) n A2(Zn) Nβ(1) n b(Xn) ... Nβ(N) n b(Xn) Let µ denote the invariant distribution of the Markov chain {Xn}n N and let D be a diagonal matrix with entries Dii = µi for all i = 1, . . . , |X|. Then, when considering X as a random variable under the stationary distribution, we introduce A1 = E[A1(X )] = ΦDΦ , A2 = E[A2(Z )] = γΦDPSπ Φ , A1 = E[A1(X )], A2 = E[A2(Z )], (35) where Sπ is the action selection matrix of the optimal policy π such that Sπ (s, (s, π (s))) = 1 for s S and Φ is defined in (4). With these variables at hand, we define A = A2 A1, i.e., 1 N A2 A1 1 N A2 1 N A2 . . . 1 N A2 1 N A2 1 N A2 A1 1 N A2 . . . 1 N A2 ... ... ... ... ... 1 N A2 1 N A2 1 N A2 . . . 1 N A2 A1 Moreover, we introduce Σb = E[b(X0)b(X0) ] + X n 1 E[b(Xn)b(X0) + b(X0)b(Xn) ]. According to the definition of b, we get the block-diagonal structure E[b(X0)b(X0) ] = N diag(E[b(X0)b(X0) ], . . . , E[b(X0)b(X0) ]), where the expectation is in steady-state. Moreover, E[b(Xn)b(X0) ] = E[b(Xn)b(X0) ] . . . E[b(Xn)b(X0) ] ... ... E[b(Xn)b(X0) ] . . . E[b(Xn)b(X0) ] which eventually leads to a matrix of the form NE[b(X0)b(X0) ]+2B2 2B2 . . . 2B2 2B2 NE[b(X0)b(X0) ]+2B2 . . . 2B2 ... ... ... ... 2B2 2B2 . . . NE[b(X0)b(X0) ]+2B2 where we introduce the variables B2 = 1 n 1 E[b(Xn)b(X0) + b(X0)b(Xn) ] and B1 = E[b(X0)b(X0) ] + B2. We define g0 = inf{g 0 : g max{λmax( A), λmax( A)} < 1} and note that g0 exists, since both A and A are Hurwitz as the corresponding Q-learning variants converge (Theorem 3.1). As a result for any g > g0 the matrix 1 2I + g A is Hurwitz and hence the Lyapunov equation 2I + g A + 1 2I + g A Σ + g2Σb = 0 (37) Regularized Q-learning through Robust Averaging has a unique solution, which describes AMSE of our proposed method (see Chen et al. (2020) and Weng et al. (2020, Theorem 1)), i.e., Σ = limn n E[θnθ n ]. Due to the symmetry of the proposed scheme, the matrix Σ will consist of diagonal elements equal to V = limn n E[θ(i) n θ(i) n ] and off-diagonal entries C = limn n E[θ(i) n θ(j) n ] for i = j. Therefore, summing the first row of matrices in (37) and using (36) leads to V + (N 1)C + g(V + (N 1)C)( A2 A1) + g( A2 A1)(V + (N 1)C) + g2N(B1 + B2) = 0. (38) Due to the definition of g0, the matrix 1 2I + g A is Hurwitz and hence the Lyapunov equation 2I + g( A2 A1) + 1 2I + g( A2 A1) ΣQL + g2 (B1 + B2) = 0 (39) has a unique solution, which is denoted by ΣQL and describes the AMSE of Watkins Q-learning, i.e., ΣQL = limn E[θQL n θQL n ], see Weng et al. (2020). By comparing (39) with (38) and recalling that the solutions are unique, we obtain ΣQL = V +(N 1)C N . Finally, AMSE(bθN) = lim n n E = 1 N 2 lim n n E = 1 N 2 (Ntr (V ) + N(N 1)tr (C)) (40c) = tr V + (N 1)C = tr ΣQL (40e) = AMSE(θQL). (40f) Regularized Q-learning through Robust Averaging B. Linearization results The proof of Theorem 3.3 uses a key result, stating that the asymptotic mean-squared error of the proposed 2RA Q-learning method (7) can be alternatively characterized by the linearized recursion (32). This result is a modification of the analysis for Double Q-learning derived in (Weng et al., 2020, Appendix A). To derive a formal statement, we introduce the key tool to analyze the linearization (32), which is the ODE counterpart of the 2RA Q-learning scheme (7) given as θ(i)(t) = lim n g E[ϕ(Xn) r(Xn) ϕ(Xn) θ(i)(t) + γϕ(Xn)(max a ϕ(Sn+1, a )bθN(t))], (41) where bθN(t) = 1 N PN i=1 θ(i)(t) and where we have used limn ρn = 0. With the help of the ODE (41) we can justify why working with the linearized system (32) allows us to quantify the AMSE for the 2RA Q-learning (7). This justification requires the following Assumption. We use the vectorized notation θ = (θ(1) , . . . , θ(N) ) and θ = (θ , . . . , θ ) where θ Rd is the optimal solution to the underlying MDP. Assumption B.1 (Linearization). We stipulate that for any N N (i) The process θ(i)(t) described by the ODE (41) has a globally asymptotically stable equilibrium θ(i) for any i = 1, . . . , N. (ii) The optimal policy π of the underlying MDP is unique. (iii) The sequence of random variables {n θn θ 2 2, n 1} is uniformly integrable. Sufficient conditions for Assumption (i), when using linear function approximation and in the setting N = 1, are studied in Melo et al. (2008); Lee & He (2020). Assumption (ii) is standard in many theoretical treatments of Q-learning and Assumption (iii) for N = 1 has been established, see Durrett (2010, Theorem 5.5.2), Devraj & Meyn (2017). Lemma B.1 (Linearization). Let {θn}n N be a sequence generated by the 2RA Q-learning (7) and let { θn}n N be a sequence generated by its linearized counterpart (32). Under Assumption B.1, we have lim n n E[ θn θ 2 2] = lim n n E[ θn θ 2 2]. Proof. In a first step, we show that the ODE (41) for any i = 1, . . . , N has a unique globally asymptotically stable equilibrium given as θ(i) = θ , where θ is the the limit of Watkins Q-learning (Weng et al., 2020, Equation (22)). Note that by Assumption B.1(i), the ODE (41) has a unique globally asymptotically stable equilibrium that we denote as θ(i) for all i = 1, . . . , N. By symmetry, any perturbation of it is a globally asymptotically equilibrium too. Hence, by uniqueness we must have θ(i) = θ(j) for all i, j {1, . . . , N}. Since all equilibrium points are identical, we recover the equilibrium point of the ODE describing Watkins Q-learning, i.e., θ(i) = θ for all i = 1, . . . , N. Next, in order to apply (Weng et al., 2020, Theorem 3) with respect to the ODE (41) we define for any i = 1, . . . , N w(i)(θ(t)) = lim n g E[ϕ(Xn) r(Xn) ϕ(Xn) θ(i)(t) + γϕ(Xn)(max a ϕ(Sn+1, a )bθN(t))]. With the vector notation w(θ) = (w(1)(θ) , . . . , w(N)(θ) ) and by plugging in the globally asymptotically stable equilibrium from above we obtain w(θ ) = g b + g( A2 A1)θ , (42) where b, A1 and A2 have been introduced in (35). Note that (42) corresponds to the ODE of the linearized 2RA Q-learning (32) at the point θ . We aim to show that θw(θ ) = g( A2 A1). This result follows from observing that there exists ε > 0 such that for any θ such that θ θ ε it holds w(i)(θ) = g b + g( A2 A1)θ. To see why this is the case, by following Weng et al. (2020), note that for the optimal policy π corresponding to θ we can define ω = min (s,a) X:a =π (s){ϕ(s, π (s)) θ π(s, a) θ } > 0, where the strict positivity follows from the uniqueness of π . Choose ε = ω 3 Φ 1 and consider any θ(i) Rd such that θ(i) θ ε. Then, for any s S and a A with a = π (s), it holds ϕ(s, π (s)) θ(i) ϕ(s, a) θ(i) ϕ(s, π (s)) θ ϕ(s, a) θ 2 Φ (θ(i) θ ) ω 2ω Regularized Q-learning through Robust Averaging which implies π = πθ, i.e., for any θ such that θ θ ε, the corresponding greedy policy πθ is optimal. Since θ(i) θ θ θ , it indeed holds for any θ such that θ θ ε that w(i)(θ) = g b + g( A2 A1)θ. So we have shown θw(θ ) = g( A2 A1). (43) In the final step, we define W (i)(Zn) = E[ϕ(Xn) r(Xn) ϕ(Xn) θ + γϕ(Xn)(max a ϕ(Sn+1, a )θ )], with the notation from (34) and denote its vectorized version in a compact form as W(Zn) = (W (1)(Zn) , . . . , W (N)(Zn) ) . We then note that W(Zn) = b(Zn) + (A2(Zn) A1(Zn))θ . Then, by definition of the asymptotic covariance n= E[(g W(Zn) w(θ )(g W(Z1) w(θ ) ] n= E[W(Zn)W(Z1) ] From the two results (43) and (44) by invoking (Weng et al., 2020, Theorem 3) we obtain n(θn θ ) d N(0, Σ), (45) where Σ is the unique solution to the Lyapunov equation (37). By applying the Continuous Mapping Theorem to (45) we get n θn θ 2 2 d X 2 2, X N(0, Σ). (46) Finally, combining (46) with Assumption B.1(iii) according to (Durrett, 2010, Theorem 5.5.2) ensures that lim n n E[ θn θ 2 2] = E[ X 2 2] = tr (Σ) . When considering the linearization (7) instead of the 2RA Q-learning, by following the same lines without the need to linearize, we obtain lim n n E[ θn θ 2 2] = E[ X 2 2] = tr (Σ) , where again P is the unique solution to the Lyapunov equation (37). This completes the proof. Regularized Q-learning through Robust Averaging C. Additional Numerical Results C.1. Additional Plots for Random Environment Experiment We provide some additional plots of the random experiment described in Section 4. 0 1 2 3 4 5 6 105 0 1 2 3 4 5 6 105 0 1 2 3 4 5 6 105 0 1 2 3 4 5 6 105 105 0 1 2 3 4 5 6 105 0 1 2 3 4 5 6 105 104 0 1 2 3 4 5 6 105 0 1 2 3 4 5 6 105 0 1 2 3 4 5 6 105 0 1 2 3 4 5 6 105 0 1 2 3 4 5 6 105 0 1 2 3 4 5 6 105 0 1 2 3 4 5 6 105 0 1 2 3 4 5 6 105 0 1 2 3 4 5 6 105 0 1 2 3 4 5 6 105 0 1 2 3 4 5 6 105 105.5 0 1 2 3 4 5 6 105 0 1 2 3 4 5 6 105 0 1 2 3 4 5 6 105 105.5 Averaged with K=10 RED with N=10 & M=2 Variance Reduced with D=10 2RA with N=10 & ρ0=50 Figure 4: Random Environment. All methods use an initial learning rate of α0 = 0.01, wα = 105, γ = 0.9, and all θ(i) initialized as zero. Maxmin as well as 2RA Q-learning have N = 10 and 2RA agents additionally use ρ0 = 50 and wρ = 104. The plots show the first 20 randomly drawn environments. C.2. Neural Network Function Approximation As an additional experiment, to test 2RA Q-learning when used with neural network Q-function approximation implemented in Tensorflow (Abadi et al., 2015), the Lunar Lander environment (Brockman et al., 2016) is chosen. A lander receives a large positive reward for landing in a designated area, a large negative reward for crashing, and a small negative reward for firing a thruster. Similar to the Cart Pole experiment, the latest updated policy with ϵ-greedy exploration is used to generate the next timestep on which the model is updated since not all states may be reached by a random uniform policy. The environment is considered to be solved if the average reward over 100 episodes, during evaluation, reaches or exceeds 200. Regularized Q-learning through Robust Averaging 300 400 500 600 700 800 900 1,0001,1001,2001,3001,4001,5001,6001,7001,8001,9002,000 0 Watkins Double Max Min 2RA (a) Distribution of hit times Algorithm Mean hit time Watkins Q-Learning 785 281.38 Double Q-Learning 791 268.64 Maxmin Q-Learning 770 252.19 2RA Q-Learning 767 276.24 (b) Mean and std of hit times Figure 5: Lunar Lander, 100 experiments. All methods use a learning rate of α = 0.0002 and a decay factor of γ = 0.99. Maxmin, as well as 2RA Q-learning, have N = 5. 2RA further uses ρ0 = 25 and wρ = 104. All algorithms are evaluated every 50 episodes and recorded if the average evaluation reward reaches or exceeds 200. (a) Shows the distributions of each algorithm s hit times and (b) lists the respective mean hit times and corresponding standard deviations. Analogue to the Cart Pole experiment, different algorithms are compared based on how many training episodes are required to solve the Lunar Lander environment where fewer average timesteps, until the environment is solved, result in a higher performance ranking for the corresponding model. Instead of the θ(i) a small neural network with two hidden Re LU (He et al., 2015) layers and N sets of weights are used. Averaging operations that were performed on the θi in the linear function approximation scenarios are now performed on sets of weights for the neural network. All models are trained with a Huber loss (Huber, 1964) and the Adam optimizer (Kingma & Ba, 2015) using a learning rate of α0 = 0.0002. The different learning methods are implemented as plain and close to the theory as possible. Updates are therefore applied after every timestep and on that single timestep. Comparing the hit times shows only little difference between all learning methods with Maxmin and 2RA Q-learning leading the field by a small margin. Future work could aim to implement and test the method with more contemporary training pipelines, such as the incorporation of experience replay etc., to analyse whether such optimizations amplify the differences in performance or just apply a uniform shift to all learning methods.