# addq_adaptive_distributional_double_qlearning__3118f971.pdf

ADDQ: Adaptive Distributional Double Q-Learning

Leif D oring 1 Benedikt Wille 1 Maximilian Birr 1 Mihail Bˆırsan 2 Martin Slowik 1

Bias problems in the estimation of Q-values are a well-known obstacle that slows down convergence of Q-learning and actor-critic methods. One of the reasons of the success of modern RL algorithms is partially a direct or indirect overestimation reduction mechanism. We propose an easy to implement method built on top of distributional reinforcement learning (DRL) algorithms to deal with the overestimation in a locally adaptive way. Our framework is simple to implement, existing distributional algorithms can be improved with a few lines of code. We provide theoretical evidence and use double Q-learning to show how to include locally adaptive overestimation control in existing algorithms. Experiments are provided for tabular, Atari, and Mu Jo Co environments.

1. Introduction

A fundamental building block of many modern reinforcement learning (RL) algorithms is Watkins Q-learning (QL) (Watkins & Dayan, 1992). In each round the agent observes a new reward signal and updates the currently estimated state-action function by combining the new reward signal with the best currently estimated action in the next step. Unfortunately, the update rule involves a maximum and maxima suffer both from overestimation bias and function approximation uncertainty. Thus, estimated Q values are initially way too large. Although convergence of Q-learning and its variants for tabular cases can be proved rigorously, the convergence can often be seen only after millions of iterations. In the context of Q-learning we refer to the seminal paper (Thrun & Schwartz, 1993). The overestimation effect is harmful not only for simple QL and variants with function approximation such as DQN (Mnih et al., 2015), but also for critic estimation in actor-critic methods such as

1Institute of Mathematics, University of Mannheim, Germany 2Department of Mathematics and Computer Science, Freie Universit at Berlin, Germany. Correspondence to: Leif D oring <doering@uni-mannheim.de>.

Proceedings of the 42 nd International Conference on Machine Learning, Vancouver, Canada. PMLR 267, 2025. Copyright 2025 by the author(s).

soft actor-critic (SAC) of (Haarnoja et al., 2018). Motivated by statistical approaches to the estimation of the expectation of maxima of random variables the concept of double Q-learning (DQL) was introduced in (van Hasselt, 2010). Instead of using one set of random variables two independent sets are used. One is used to detect the maximal index, the other to evaluate the random variable corresponding to the maximal index. For Q-learning this translates to keeping track of two copies of the Q-matrix that are alternated in order to detect the best action and evaluate the corresponding Q-value. DQL and its actor-critic variants reduce the overestimation (see for instance Figure 2 in (Fujimoto et al., 2018)) and sometimes even underestimate Q-values. This, for example, can be seen in a simple chain MDP (Example 6.7 in (Sutton & Barto, 2018) and also (Lan et al., 2020) for the underestimation effect in the same example). (Fujimoto et al., 2018) argue that overestimation should be addressed particularly in state-action regions with high uncertainty. Indirectly, this is taken into account by ensemble methods such as Maxmin QL (Lan et al., 2020). These methods use ensembles of more than two Q-estimators. Different ensemble estimators take the minimum, full ensemble averages (Anschel et al., 2017), or random ensemble averages (Chen et al., 2021). A related line of research uses uncertaintybased RL, also with the goal of reducing overestimation, see for instance (Wu et al., 2021), (Ghasemipour et al., 2022).

There is no rule whether QL, DQL, or any ensemble variant works well for a given environment. Algorithms sometimes perform well and sometimes fail. This article proposes a novel approach. We propose to take QL and (at least) one other method and try to control if the QL update should be replaced or mixed with another underestimating method. The control must be locally adaptive, and the need to manage the bias depends on the local uncertainty (aleatoric and epistemic randomness, function approximation). For an approach similar in spirit to ours, see (Dorka et al., 2021).

We show theoretically how distributional RL helps the agent identify the need for overestimation control.

As a test case, we combine QL and DQL using a local weighting that we call ADDQ.

Convergence of ADDQ is proved, experiments are performed on tabular, Atari, and Mu Jo Co environments.

ADDQ: Adaptive Distributional Double Q-Learning

2. Q-learning and the overestimation problem

2.1. Tabular Q-learning

Let us fix a (discrete) Markov decision model (S, A, R, p), where S is a finite state-space, A a finite space of allowed actions, R the reward space, and p a transition kernel describing the distribution of the reward r and the new state s

when action a is played in state s. Given a time-stationary policy π, a Markov kernel on S A, there is a Markov reward process (St, At, Rt) with transitions

Pπ(Rt = r, St+1 = s , At+1 = a |St = s, At = a)

= π(a : s )p(r, s : s, a).

The goal of the agent in reinforcement learning is to use rollouts of the MDP to find a policy that maximizes Qπ(s, a) = Eπ[P t=0 γt Rt|S0 = s, A0 = a], the expected discounted reward. The discounting factor γ (0, 1) is fixed. In the discrete setting with S and A finite it is well-known that optimal stationary policies exist and can be found as greedy policy obtained by the unique solution matrix Q

to T Q = Q. The non-linear operator (T Q)(s, a) = r(s, a) + P

s S p(R {s } : s, a)γ maxa A Q(s, a ) is called Bellman s optimality operator. Bellman s optimality operator is a max-norm contraction on the S A matrices. Using Banach s fixed point theorem, the solution can in principle be found by iteratively applying T to some initial matrix Q0. The drawback of this approach is the need to know the operator T , thus having knowledge about the transitions p. Using standard stochastic approximation algorithms the fixed point Q can be approximated by

Q(s, a) (1 α)Q(s, a) + α r + γ max a Q(s , a ) ,

called Q-learning (QL). The state-action pairs can be chosen synchronously or asynchronously using rollouts. Typically, to update at (s, a) a one-step sample s , r is obtained from p( : s, a) and the step-sizes α are assumed to satisfy the Robbins-Monro conditions. The exploration (choices of (s, a) to be updated) can be on-policy (using the Qestimates) or off-policy (using a behavior policy). The only requirement is infinite visits for all state-action pairs. The recursively defined matrix-sequence (Qt) was proved to converge in the tabular setting, see e.g. (Tsitsiklis, 1994).

2.2. The overestimation problem

Even though QL converges to Q for the number of updates going to infinity, the convergence is very slow. One of the known sources is the so-called overestimation problem of QL. The algorithm does not provide unbiased estimates of Q (s, a). Instead, the estimates Qt(s, a) tend to overestimate Q (s, a). A statistical explanation is based on the simple fact that the point estimator max{ ˆX1, ..., ˆXn} is not an unbiased estimator of max{E[X1], ..., E[Xn]} but

the estimator is positively biased. Thus, the update targets r + γ maxa Qt(s , a ) can be seen as overestimating the true Bellman optimality operator at each step. The consequences of overestimation are less obvious than it seems at first sight. Shifting Q-values the same amount globally has no effect, neither for Q-based exploration purposes nor for best action selection. It is the local difference in overestimation that must be avoided to not confuse the agent. Thus, it is crucial to understand the root causes of overestimation to then mitigate by taking alternative updates to QL if needed.

There is little quantitative understanding of the overestimation; for some rough bounds see (van Hasselt, 2011). We used tools from probability theory to compute estimation bounds for a simple explanatory example. The example is

r = 0 r = 0

k1 N(µ1, σ2 1) actions

k2 N(µ2, σ2 2) actions

Figure 1. Two-sided bandit MDP, start in s0, gray boxes terminal

intriguing, as it is simple but hard to learn - even more so if one side is replaced by a chain of decisions. Each side gives a reward (for simplicity 0) followed by a Gaussian reward from one of k actions. QL and DQL can both fail badly (each on one side) for the same parameter configuration. If µ1 > 0 > µ2, then the optimal action in s0 is left . If σ2 (and/or k2) is large compared to σ1 (and/or k1) then the overestimated Q-values of QL will confuse the agent and lead him to believe that right is optimal. Similarly, underestimating can make the agent believe down is optimal. Our approach of learning locally to use QL or DQL (or another variant) can mitigate that problem by learning to use QL for one side and DQL for the other. The method is motivated by two theoretical results, a lower bound on overestimation (Proposition 2.1) and a computation with DRL to connect estimated sample variances and overestimation (Proposition 2.2). If step-sizes are chosen in the common way as αt(s, a) = 1 Ts,a(t), with Ts,a(t) the number of visits at (s, a) up to time t, then results on sums and maxima of Gaussian random variables can be used to prove a lower bound on the expected overestimation of the true value γµ. Proposition 2.1. If the left side has been explored Nk1 times and the exploration was sufficiently exploratory (see Theorem A.2), then the Q-estimate at (s0, left ) has bias at

N and analogously at (s0, right ).

The lower bound quantifies the idea that uncertainty forces overestimation and the bias only decreases slowly over time. A proof is given in Appendix A. Since the estimate is relatively tight one gets a feeling for how many reward samples

ADDQ: Adaptive Distributional Double Q-Learning

are needed to get sufficiently precise estimates of the Qvalues so the agent makes the right decision.

There has been considerable interest for the past years to understand the sources of uncertainty in RL. While many sources of uncertainty exist, they are often categorized as either aleatoric or epistemic. Aleatoric uncertainty is model given and cannot be improved by more data or learning. In RL this is uncertainty implied by random variables governing rewards and transitions. Epistemic uncertainty refers to the uncertainty that could potentially be reduced using more data and better algorithms (including better function approximation). Epistemic uncertainty in RL is induced by all random variables to run the learning procedure (exploration, replay buffer, etc.) and function approximation in deep RL. Keeping in mind the different sources of uncertainty is useful in order to identify algorithmic potential for improvement but also theoretical limitations. It is also important to realize that aleatoric and epistemic randomness strongly influence each other. If a reward has large variance (aleatoric uncertainty) then the estimation with samples creates more epistemic uncertainty. For estimating expectations, this is due to the central limit theorem.

In deep RL one of the major sources of epistemic uncertainty is function approximation. As in (Thrun & Schwartz, 1993) we could add to our analysis independent error-noise modeling the function approximation. If the error noise is assumed Gaussian then our results readily extend, by adding additional noise-variance to our results. Since the modeling assumption as independent noise is rather special we restrain from studying the effect of function approximation on epistemic uncertainty. Instead, we will now show how to use additional information from distributional QL to deal with overestimation in a local way.

2.3. Tabular distributional Q-learning

To make this article as self-contained as possible we give a minimal overview on DRL. For a concise treatment we refer to the recent book (Bellemare et al., 2023). Given a Markov decision model and a stationary policy π, (Rowland et al., 2018) define the return distribution function as

ηπ(s, a)(B) := Pπ X

t=0 γt Rt B S0 = s, A0 = a

for B B(R). The expectation over the measure ηπ

is the classical state-action value function Qπ. In contrast to ordinary RL, the target in DRL is to learn return distributions instead of only return expectations. There have been a lot of theoretical articles on DRL (Bellemare et al., 2017; Dabney et al., 2018; Rowland et al., 2018; Lyle et al., 2019; Bellemare et al., 2023; Rowland et al., 2023b;a) establishing distributional Bellman operators, contractivity, convergence proofs of dynamic programming,

etc. It was shown in (Bellemare et al., 2017; 2023) that the return distribution function is the unique solution to ηπ = T πηπ, where T π : P(R)S A P(R)S A is the distributional Bellman operator defined as (T πη)(s, a) = P

r,s ,a R S A br,γ#η(s , a )π(a ; s )p(s , r; s, a) with bootstrap function br,γ(z) = r + γz and push-forward of measures f#ν(B) := ν(f 1(B)). In essence, sample based dynamic programming can also be carried out in a distributional sense replacing the classical Bellman operator by its analogue in the distributional sense. Distributional QL proceeds similarly to classical expectation QL:

η(s, a) (1 α)η(s, a) + α br,γ#η(s , a ) ,

with a = argmaxa Q(s , a ), where Q(s , a ) are the expectations of the probability measures η(s , a ). In order to work algorithmically with DRL parametrizations F of measures need to be used. Distributional learning algorithms in practice then work similarly to deep learning algorithms, alternating Bellman operators and function class projection, see Algorithm 1. There are two simple

Algorithm 1 Distributional Q-learning update step Require: Proxy η for η and pair (s, a) to be updated

Determine step-size α Sample reward/next state (r, s ) # Compute target a argmaxa EZ η(s ,a)[Z]

ˆη br,γ#η(s , a ) # Project target back onto support

ˆη ΠF(ˆη ) # Move η(s, a) towards the target, for tabular RL e.g. η(s, a) (1 α)η(s, a) + αˆη

parametrization that have been used frequently. The categorical parametrization (fixing number of atoms with variable weights at fixed locations) and the quantile parametrization (fixing number of atoms with fixed weights but variable locations). For the categorical parametrization a set of m evenly spaced locations θ1 < < θm needs to be fixed, the categorical measures are then defined by FC,m = Pm i=1 piδθi pi 0, Pm i=1 pi = 1 . That is, measures in FC,m are parametrized by an mdim probability vector with weights for the m fixed atoms. In contrast, the quantile parametrization FQ,m = n Pm i=1 1 mδθi : θi R o

fixes the weights to 1

m with variable atom locations.

2.4. Overestimation mitigation using distributional RL

We follow an insight from Proposition 2.1 that large uncertainty, aleatoric (large σ) as well as epistemic (small N - additional σ if function approximation is modeled as Gaussian error), implies large overestimation of Q-values. While we skip function approximation for theoretical considerations we are as precise as possible in the simplest tabular

ADDQ: Adaptive Distributional Double Q-Learning

settings. The estimate from Proposition 2.1 suggests to replace QL-updates in (s, a) if uncertainty is large (compared to other actions). Unfortunately, in standard QL the agent has no direct access to such information in order to adjust the update rule at (s, a). This is where our idea comes into play. DRL gives the agent exactly the needed information using distribution insight into the return estimate ηt(s, a). It is crucial to note that DRL learns random measures as η(s, a) is a probability measure that depends on the random samples used in the updates so it has an expectation and a variance that are both random in terms of the random samples. The situation becomes tricky as we are going to take variances of the variances. To avoid confusion an analogy to statistics is used. We speak of sample averages M (resp. sample variances S2) of η(s, a) and expectation E (resp. variance V) for the integrals against the randomness induced by the probability space behind all random variables.

We now use the bandit MDP from above to explain why the DRL agent has access to uncertainty estimates during the learning process. To allow concrete computations with distributions, we use a particularly simple QL update mechanism (the one also used for estimates in (van Hasselt, 2011)). First explore all actions N times, then propagate to (s0, left ). In fact, this is nothing but distributional QL with cyclic exploration and target-matrix trick (Mnih et al., 2015) as we explain in Appendix A. The obtained estimate of η (s0, left ) will be denoted by ˆη(s0, left ).

Proposition 2.2. The sample variance of ˆη(s0, left ), analogously for right , after Nk1 steps is σ2 1 N 1χ2 N 1-

distributed. The expectation is σ2 1, the variance is 2σ4 1 N 1.

A proof is given in Appendix A. It is surprising that the sample variance distribution can be identified explicitly as chi-squared, unlike the sample expectations (maxima of independent Gaussians). This is a consequence of the wellknown fact in statistics that sample variances of Gaussians are independent of sample expectations. Thus, if return estimates are Gaussian the max-operation of QL (with respect to sample averages) is only delicate for sample expectations, not for sample variances. We emphasize that in contrast to QL the agent in distributional QL does have access to the σ2χ2 N N -distributed sample variance by computing sums of the atoms! Hence, the agent can make use of an unbiased estimate for the aleatoric uncertainty σ2 which is conflicted by epistemic uncertainty that decreases with N.

Implications from of our theoeretical considerations: We propose the following locally adaptive overestimation mitigation method. (i) Use distributional QL. (ii) At every update compute the sample variance of the current return estimate ηt(s, a). (iii) If the sample variance is large replace the QL update by another update (e.g. DQL or an ensemble

update or a mixture). Since large has no absolute meaning in RL we will compare variances among all actions in s and reduce according to the relative sample variance.

Exploration and overestimation control with ensembles and double Q-learning: Known algorithms that directly try to better estimate the Q-values require to set up a particular algorithmic architecture. Most use an ensemble of Q-copies which are then combined by taking minima (Lan et al., 2020), averages (Peer et al., 2021), or averages with random choices (Chen et al., 2021). Ensemble methods are promising in theory (assuming independent ensembles) but more problematic for deep RL as storage problems force ensembles to be parametrized by the same neural network. The optimal number of copies (a hyperparameter) varies for different environments, some choices work well, others fail.

Our approach is different. We suggest to use QL when it works well (small sample variance) and another algorithm where QL fails (large sample variance). We could combine QL with ensemble methods but for this article decided to combine QL with DQL a bit in the spirit of weighted DQL (Zhang et al., 2017) but with completely different weights - we compare to weighted DQL in Appendix C.2. For DQL (van Hasselt, 2010) two copies QA and QB are stored. The update mechanism is similar to QL where the matrix to be updated is chosen randomly in every step. The main difference is the target used, matrices QA and QB are flipped:

(1 α)QA/B(s, a) + α r + γQB/A(s , z ) ,

with z = maxa QA/B(s , a ). Here and in the following we use QA/B to allow either the choice of QA or QB. Double Q-learning reduces the overestimation strongly but sometimes leads to severe underestimation.

The use of sample variance of estimated return distributions is not new to RL. For bandits the simplest example is the UCB exploration bonus for unknown variances that uses this for exploration. In the context of exploration in RL uncertainty dependent exploration has been applied using distributional RL (see for instance (Mavrin et al., 2019), (Moerland et al., 2018)). The use of sample variances in distributional RL to locally mitigate the overestimation is new to the best of our knowledge. In order to not mix up effects we stick to the standard exploration choices.

3. ADDQ: Tabular setting

Based on the theoretical insight we will now introduce a concrete method. We use DQL-updates as an alternative to QL-updates when the agent expects QL-updates to be harmful (large estimated uncertainty by means of sample variance). The weighted DQL-approach we suggest is readily implemented into existing DRL implementations.

ADDQ: Adaptive Distributional Double Q-Learning

3.1. Weighted DQL: From SARSA-trick to ADDQ

To derive the algorithm let us recall the SARSA convergence proof of (Singh et al., 2000) that was also used to prove convergence for DQL (van Hasselt, 2010) and variants such as clipped Q-learning (Fujimoto et al., 2018) or Maxmin (Lan et al., 2020). The idea is to add a clever 0, adding and subtracting what is missing to the QL update, thus, writing the algorithm as QL with a bias. If the bias can be proved to disappear, a comparison to QL implies convergence to Q :

Q-learning update z }| { (1 α)QA/B(s, a) + α r + γQA/B(s , z )

=:b A/B(s,a) z }| { α(γQB/A(s , z ) γQA/B(s , z )),

with z = argmax QA/B(s, a). Taking this point of view there are plenty of possibilities to modify DQL to achieve more or less over-/underestimation. As an example, choosing the negative bias-terms b A/B clip = α min{γQB/A(s , z ) γQA/B(s , z ), 0} yields so-called clipped Q-learning introduced as part of TD3 in (Fujimoto et al., 2018). As motivated in Section 2.4 we suggest a locally adaptive overestimation control. The main insight is as follows. One can locally interpolate between QL and DQL by multiplying bias terms with local adaptive weights:

b A/B(s, a) := βA/B(s, a) | {z } new b A/B(s, a).

Replacing bias terms b by b generalizes the update. The aggressive choice β = 1 results in QL updates (overestimation), β = 0 results in DQL updates (tendency of underestimation). Choices of β suggested in the present article are motivated by the propositions of Sections 2.2 and 2.4. If a lot of uncertainty is present, the algorithm uses large β, otherwise small β. The aggressive choices are not necessarily the best, in our experiments below softer choices were more effective. Since overestimation is a priori not problematic for the learning process (if all estimates are equally overestimated the best action does not change, only skewed overestimation slows down the learning) we suggest a choice of β that takes into account the local structure, normalizing variances over possible actions.

3.2. New algorithm: locally adaptive distributional double Q-learning

Following the ideas above we introduce ADDQ, integrating locally adaptive overestimation mitigation in DQL. The pseudo-code given in Algorithm 2 extends distributional RL pseudo-code to include the double algorithm and adaptive weights. For the tabular target update we follow the measure-mixture approach of (Rowland et al., 2018), Section 8. Changes to the code for other target updates (for

instance gradient steps to minimize KL loss) are straight forward. The algorithm is seen to be a combination of distri-

Algorithm 2 ADDQ update step Require: Proxies ηA, ηB for η , pair (s, a) to be updated

Determine step-size α Sample reward/next state (r, s ) Randomly choose Update(A) or Update(B) if Update(A) then

# Compute target with locally adapted weight a argmaxa EZ ηA(s ,a)[Z] Determine weight β [0, 1] based on ηA old, ηB old ν (1 β)ηB(s , a ) + βηA(s , a )

ˆηA br,γ#ν # Project target back onto support

ˆηA ΠF(ˆηA ) # Move η(s, a) towards the target, for tabular RL e.g. ηA(s, a) (1 α)ηA(s, a) + αˆηA

end if if Update(B) then

Proceed analogously with A and B exchanged end if

butional QL and distributional DQL. Keeping β constant to 1 is distributional QL, constant 0 distributional DQL. The key is to chose β dependent on the uncertainty that drives the skewed QL overestimation for different actions. Extending arguments from the literature, notably the convergence proof of (Rowland et al., 2018) for categorical Q-learning with stochastic approximation target upate and the SARSA trick of (Singh et al., 2000), we prove convergence of Algorithm 2 for categorical measure parametrizations:

Theorem 3.1. Given some initial return distribution functions ηA 0 , ηB 0 supported within [θ1, θm]. If

rewards are bounded in [Rmin, Rmax] and it holds [ Rmin

1 γ ] [θ1, θm],

step-sizes fulfill the Robbins-Monro conditions and ηA

or ηB are updated randomly,

the sequences (βA t )t N, (βB t )t N only depend on the past and fulfill limt |βA t βB t | = 0 almost surely,

then the induced Q-values converge almost surely towards Q . If additionally the MDP has a unique optimal policy π , then (ηA t ), (ηB t ) converge almost surely in ℓ2 to some η C FC,m and the greedy policy with respect to η C is π .

According to Theorem 3.1 symmetric sequences βA = βB

that can depend on past distributions ηA, ηB yield convergence. A particularly simple choice compares the deviations in ηA/B locally as a local measure for uncertainty.

ADDQ: Adaptive Distributional Double Q-Learning

Figure 2. Grid world. First column: Biases summed over all state-action pairs. Other columns highlight the relation of sample variance (bottom) and the effect on Q-value (top) estimation. For more experiments and comparisons to Maxmin, EBQL, REDQ see Appendix C.

An exemplary choice of adaptive β: As motivated earlier QL suffers from overestimation in the presence of uncertainty (function approximation, aleatoric randomness, epistemic randomness) which is reflected in the sample variance (or other measures of the spread of the distribution) of the discrete (random) distributions ηA/B t . As uniform overestimation is not particularly troubling (e.g. adding a constant to all Q-values does not harm at all) the learning process is slowed down by differences in sample variances for allowed actions. There are many other possibilities to choose β but we decided to fix this example for all our experiments. For a finite atomic measures ν = Pn i=1 piδai define the sample mean M(ν) = 1

n Pn i=1 aipi and the sample variance S2(ν) = 1 n 1 Pn i=1 pi(ai M(ν))2. Now define

S2 s,a := 1

2 S2(ηA t (s, a) + S2(ηB t (s, a) ,

S2 s := 1 |A|

a A S2 s,a, and S2 rel(s, a) := S2 s,a S2s .

According to our computations in the two-sided bandit model, evenly distributed relative sample variances (i.e. all values around 1) correspond to balanced overestimation effects. Otherwise, overestimation is unbalanced. We define locally adaptive weights for the next update at (s, a):

0.75 : S2 rel(s, a) < 0.75 0.5 : S2 rel(s, a) [0.75, 1.25] 0.25 : S2 rel(s, a) > 1.25 . (1)

For algorithmic simplicity one could also use squared deviations to the median instead of the mean as the median can be red off directly for measures in the FC,m and FQ,m. The choice combines Q and DQ rather softly, more aggressive choices reduce the length of the interval and increase (resp. decrease) towards β = 1 and β = 0. In practice, (tabular, all Atari, all Mu Jo Co) the choice worked well without any tuning. The thresholds in the definition of β can be seen as hyperparameters to the algorithm. We leave the development of adaptive thresholds for future research.

3.3. A grid world example

To show the advantages of ADDQ we present a grid world that highlights, in a more complicated and more realistic way, the main effects of the bandit MDP. For more details, see Appendix C. In particular, in Appendix C.3 we compare ADDQ to other algorithms for overestimation reduction.

Deep RL experiments presented below

F 1 2 S 4 5 6 7 8 9 10 11 12 G 14 15

Figure 3. Start in S, goal in G, fake goal in F, high stochasticity and low reward area in gray

use essentially deterministic environments with uncertainty mainly from function approximations. We thus provide a grid world with complicated stochasticity that, depending on parameters, is hard for QL and DQL. The high stochasticity region with low rewards confuses the QL agent. The agent strongly overestimates suboptimal Q-values that lead to the gray area, in the gray area Q-values are overestimated for actions that stay in the gray area. Hence, all ε-greedy exploration mechanisms spend much time in the gray area. On the other hand, DQL is motivated by estimators of iid random variables and underestimates strongly if action-values for different actions are unevenly distributed. Thus, the DQL agent gets confused by the local deviations caused by the fake goal and the stochastic region. What results in overestimation for QL, results in underestimation for DQL. In contrast, ADDQ locally combines update-rules of QL and DQL to reduce the estimation biases a lot. In Appendix C we show experimentally that the choice of thresholds in β is relatively harmless, more or less aggressive updates still perform well. In contrast, constant β and the non-distributional choice from (Zhang et al., 2017) with c = 10 have larger biases.

Plots in the figure above show a few key insights, more in Appendix C. First, the estimation bias of ADDQ is much smaller than those of QL and DQL. Most importantly in the complicated states 1, 4, 6, 7. The reason ADDQ better estimates the Q-values is the adaptive choice to prefer QL or DQL, depending on (relatively) large variances.

ADDQ: Adaptive Distributional Double Q-Learning

3.4. ADDQ adaptation for distributional DQN

Our DRL based local adaptive overestimation mitigation can be integrated into existing code for DRL with a few extra lines. In the following we describe the integration into C51 (Bellemare et al., 2017). We run experiments on Atari environments from the Arcade Learning Environment (Bellemare et al., 2013) using the Gymnasium API (Towers et al., 2023). Algorithms are few line modifications based on the RL Baselines3 Zoo (Raffin, 2020) training framework, without any further tuning of hyperparameters.

The C51 algorithm obtained its name from using a categorical representation of return distributions with m = 51 atoms. The weights of the parametrization are parametrized via feedforward neural networks following the DQN architecture (Mnih et al., 2015). The state s serves as input and the last layer outputs m = 51 logits for each action followed by a softmax to return probability weights. We write ηω(s, a) = Pm i=1 pi(s, a; ω)δθi, where ω comprises the online networks weights. We add a bar η to denote a delayed target network which is kept constant and is overwritten from η every e.g. 10000 steps with the parameters from the online network. The corresponding expectation is denoted by Qω(s, a) = Pm i=1 pi(s, a; ω)θi. Given a transition (s, a, r, s ) the projected target is

ˆη = ΠC(br,γ#η ω(s , a )) =:

i=1 ˆpiδθi,

where a = argmaxa Q ω(s , a ). Gradient descent on the weights with respect to some loss (here: cross-entropy) is used to move the distribution ηω(s, a) towards the target distribution ˆη. As for the tabular algorithm variants with overestimation reduction intervene in the target definition. We keep track of two independently initialized online networks denoted by ωA, ωB and a pair of respective target networks ωA, ωB. For each gradient step we simulate a vector of random variables with the same size as the batch size with each element determining which of the two estimators is being updated based on the respective transition with the same position in the batch. Accordingly, we use twice the batch size for these methods, so that on average per gradient step, the same number of transitions is used for each estimator, compared to the single-estimator case. We can now describe how to modify the targets for a given transition (s, a, r, s ). With the placeholder Γ in

ηA/B = ΠC(br,γ#ΓA/B)

only the place holder is modified for different algorithms:

Double C51: Set ΓA/B = η ωB/A(s , z ) with z = argmaxa Q ωA/B(s , a ).

Clipped C51: Inspired by (Fujimoto et al., 2018). Set ΓA/B = η ωX(s , z ) with z = argmaxa Q ωA/B(s , a )

and X = argminc {A,B} Q ωc(s , z ).1:

ADDQ (us): The ADDQ target uses

ΓA/B = βη ωA/B(s , z ) + (1 β)η ωB/A(s , z ),

where z = argmaxa Q ωA/B(s , a ). The locally adaptive weights β may depend on entire state-action return distributions. For the experiments we used the choice from Equation (1) based on the online networks ηωA, ηωB.

Results for two Atari environments and a RLiable comparison (Agarwal et al., 2021) are presented in Figure 4 and more extensively in Appendix E. The experiments show that ADDQ is more stable than QL and DQL (it never fails completely) and achieves higher scores in different metrics.

3.5. ADDQ adaptation for QRDQN

Modifications and results for quantile DRL (Dabney et al., 2018) are similar to the categorical setting. The setup is explained in Appendix D; for a quick view two experiments and a RLiable plot for probability of improvements can be found in Figure 4. More experimental results are provided in Appendix E. ADDQ is more stable (it never fails completely) and achieves higher scores in different metrics.

3.6. ADDQ adaptation for quantile SAC

Compared to Atari environments it is known that overestimation in Mu Jo Co environments is more severe. Using the double estimator in critic estimation is not enough, the clipping trick of TD3 (Fujimoto et al., 2018) greatly improves performance. Recall that, in the formulation of biased Q-learning of Section 3.1, clipping uses bias min{QB/A QA/B, 0} instead of QB/A QA/B. Thus, using our distributional overestimation identification to combine QL and DQL does not hint towards an algorithm competitive with SAC/TD3 or algorithms with more refined overestimation control such as REDQ (Chen et al., 2021) or TQC (Kuznetsov et al., 2020). For completeness of the article we still consider the ADDQ effect with standard single and double estimator. We used the base implementations from Stable-Baselines3 to study SAC with quantile regression. With a single critic estimator (we call this QRSAC), with double estimator, clipped double estimator (QB-SAC in the terminology of (Kuznetsov et al., 2020)), and ADDQ estimator (ADQRSAC). Experiments are shown in Figure 5 and more extensively in Appendix E.

As expected ADDQ improves QRSAC and double QRSAC but is not competitive with clipping. We leave for future work to experiment distributional RL based overestimation identification for REDQ and TQC.

1TD3 [(Fujimoto et al., 2018)] introduced clipping in an actorcritic setting, where the action is given by the actor. In our C51 adaptation, we select the greedy action based on the target network.

ADDQ: Adaptive Distributional Double Q-Learning

Figure 4. Our method ADDQ (blue) compared to distributional DQN (orange), with double estimator (green) and clipped double estimator (red). First row with categorical, second row quantile parametrization. Learning curves are given for two environments, 8 more in Appendix E. We used 10 seeds. RLiable probability of improvement plots are for all 10 environments, more metrics in Appendix E.

Figure 5. For completeness: Learning curves for two Mu Jo Co environments and RLiable probability of improvement on 5 environments. Adapting locally (blue, us) the double critic estimator (green) and direct estimator (orange) cannot (by far) reach performance of the clipping estimator (red). Nonetheless, ADDQ improves distributional direct and double critic estimators with almost no change to the code. Runs are averaged on 10 seeds, more details can be found in the Appendix E.

4. Summary, limitations, and future work

Built on theoretical insight for a bandit MDP we suggest to use sample variances in distributional RL to mitigate the overestimation of QL. Our approach does not use a novel estimation procedure for Q-values but instead combines known estimators and tries to use the better one. Using DRL, the agent in state-action pair (s, a) has access to nextstate information that predicts the overestimation of QLupdates and accordingly prevers QLor an alternative. The approach can be incorporated in different estimation methods (e.g. (random) ensemble methods, truncation methods). For the present article we decided to improve DQL, leading to our algorithm ADDQ. The algorithm has the expected feature that in contrast to DQ/DQL it does not fail completely on some environments. Probability of improvement and normalized scores using the RLiable library show clear improvement of ADDQ to underlying deep DQ/DQL.

While ADDQ improves QL and DQL in different settings there is future work to be done. (i) It is clear that our probabilistic calculations can be extended. It is quite likely that results from Gaussian processes can be used for computations in more general settings. (ii) The exemplary choice of β can be improved. It would be interesting to replace the hyperparameter thresholds by some adaptive learnable choice. (iii) The Mu Jo Co simulation study was only included for completeness, it would have been very surprising if a local adaptation of simple and double critic estimates could improve the clipped critic estimate (or even better algorithms such as REDQ or TQC). It is interesting future research to include local overestimation mitigation into REDQ (make the chosen ensemble number depend locally on sample variances) or TQC (make the number of truncated atoms depend locally on sample variances). (iv) Use sample variances to perform updates of target networks after non-constant steps.

ADDQ: Adaptive Distributional Double Q-Learning

The code used in our experiments can be found on Git Hub: https://github.com/Bomme HD/ADDQ.git.

Acknowledgement

The authors acknowledge support by the state of Baden W urttemberg through bw HPC and the German Research Foundation (DFG) through grant INST 35/1597-1 FUGG.

Impact statement

This paper presents work whose goal is to advance the field of machine learning, in particular reinforcement learning. There are many potential societal consequences of our work, none which we feel must be specifically highlighted here.

Adler, R. J. and Taylor, J. E. Random fields and geometry. Springer Monographs in Mathematics. Springer, New York, 2007. ISBN 978-0-387-48112-8.

Agarwal, R., Schwarzer, M., Castro, P. S., Courville, A. C., and Bellemare, M. Deep reinforcement learning at the edge of the statistical precipice. In Ranzato, M., Beygelzimer, A., Dauphin, Y., Liang, P., and Vaughan, J. W. (eds.), Advances in Neural Information Processing Systems, volume 34, pp. 29304 29320. Curran Associates, Inc., 2021. URL https://proceedings.neurips. cc/paper_files/paper/2021/file/ f514cec81cb148559cf475e7426eed5e-Paper. pdf.

Anschel, O., Baram, N., and Shimkin, N. Averaged-DQN: Variance reduction and stabilization for deep reinforcement learning. In Precup, D. and Teh, Y. W. (eds.), Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pp. 176 185. PMLR, 06 11 Aug 2017. URL https://proceedings.mlr.press/ v70/anschel17a.html.

Badia, A., Piot, B., Kapturowski, S., Sprechmann, P., Vitvitskyi, A., Guo, D., and Blundell, C. Agent57: Outperforming the atari human benchmark, 03 2020. URL http://arxiv.org/abs/2003.13350.

Bellemare, M. G., Naddaf, Y., Veness, J., and Bowling, M. The arcade learning environment: an evaluation platform for general agents. J. Artif. Int. Res., 47(1):253 279, may 2013. ISSN 1076-9757.

Bellemare, M. G., Dabney, W., and Munos, R. A distributional perspective on reinforcement learning. In Proceedings of the 34th International Conference on Machine

Learning - Volume 70, ICML 17, pp. 449 458. JMLR.org, 2017.

Bellemare, M. G., Dabney, W., and Rowland, M. Distributional Reinforcement Learning. MIT Press, 2023. http://www.distributional-rl.org.

Bertsekas, D. P. and Tsitsiklis, J. N. Neuro-dynamic programming., volume 3 of Optimization and neural computation series. Athena Scientific, 1996. ISBN 1886529108.

Castro, P. S., Moitra, S., Gelada, C., Kumar, S., and Bellemare, M. G. Dopamine: A Research Framework for Deep Reinforcement Learning. Preprint available on ar Xiv:1812.06110, 2018. URL http://arxiv.org/ abs/1812.06110.

Chen, X., Wang, C., Zhou, Z., and Ross, K. Randomized ensembled double q-learning: Learning fast without a model. Preprint available on ar Xiv:2101.05982, 2021. URL http://arxiv.org/abs/2101.05982.

Dabney, W., Rowland, M., Bellemare, M. G., and Munos, R. Distributional reinforcement learning with quantile regression. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence and Thirtieth Innovative Applications of Artificial Intelligence Conference and Eighth AAAI Symposium on Educational Advances in Artificial Intelligence, AAAI 18/IAAI 18/EAAI 18. AAAI Press, 2018. ISBN 978-1-57735-800-8.

Dorka, N., Welschehold, T., Boedecker, J., and Burgard, W. Adaptively calibrated critic estimates for deep reinforcement learning. IEEE Robotics and Automation Letters, 8:624 631, 2021. URL https: //api.semanticscholar.org/Corpus ID: 244527082.

Fernique, X. Regularit e des trajectoires des fonctions al eatoires gaussiennes. In Ecole d Et e de Probabilit es de Saint-Flour, IV-1974, volume Vol. 480 of Lecture Notes in Math., pp. 1 96. Springer, Berlin-New York, 1975.

Fujimoto, S., van Hoof, H., and Meger, D. Addressing function approximation error in actor-critic methods. Preprint available on ar Xiv:1802.09477, 2018. URL http://arxiv.org/abs/1802.09477.

Ghasemipour, K., Gu, S. S., and Nachum, O. Why so pessimistic? estimating uncertainties for offline rl through ensembles, and why their independence matters. In Proceedings of the 36th International Conference on Neural Information Processing Systems, NIPS 22, Red Hook, NY, USA, 2022. Curran Associates Inc. ISBN 9781713871088.

ADDQ: Adaptive Distributional Double Q-Learning

Haarnoja, T., Zhou, A., Abbeel, P., and Levine, S. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In International conference on machine learning, pp. 1861 1870. PMLR, 2018.

Kuznetsov, A., Shvechikov, P., Grishin, A., and Vetrov, D. Controlling overestimation bias with truncated mixture of continuous distributional quantile critics. In Proceedings of the 37th International Conference on Machine Learning, ICML 20. JMLR.org, 2020.

Lan, Q., Pan, Y., Fyshe, A., and White, M. Maxmin qlearning: Controlling the estimation bias of q-learning. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. Open Review.net, 2020. URL https: //openreview.net/forum?id=Bkg0u3Etwr.

Lyle, C., Bellemare, M. G., and Castro, P. S. A comparative analysis of expected and distributional reinforcement learning. Proceedings of the AAAI Conference on Artificial Intelligence, 33(01): 4504 4511, Jul. 2019. doi: 10.1609/aaai.v33i01. 33014504. URL https://ojs.aaai.org/index. php/AAAI/article/view/4365.

Mavrin, B., Yao, H., Kong, L., Wu, K., and Yu, Y. Distributional reinforcement learning for efficient exploration. In Chaudhuri, K. and Salakhutdinov, R. (eds.), ICML, volume 97 of Proceedings of Machine Learning Research, pp. 4424 4434. PMLR, 2019. URL http://dblp.uni-trier.de/db/ conf/icml/icml2019.html#Mavrin YKWY19.

Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidjeland, A. K., Ostrovski, G., Petersen, S., Beattie, C., Sadik, A., Antonoglou, I., King, H., Kumaran, D., Wierstra, D., Legg, S., and Hassabis, D. Human-level control through deep reinforcement learning. Nature, 518 (7540):529 533, 2015. doi: 10.1038/nature14236. URL https://doi.org/10.1038/nature14236.

Moerland, T., Broekens, J., and Jonker, C. The potential of the return distribution for exploration in rl, 06 2018. URL http://arxiv.org/abs/1806.04242.

Peer, O., Tessler, C., Merlis, N., and Meir, R. Ensemble bootstrapping for q-learning. In International Conference on Machine Learning, 2021. URL https: //api.semanticscholar.org/Corpus ID: 232076148.

Quan, J. and Ostrovski, G. DQN Zoo: Reference implementations of DQN-based agents, 2020. URL http: //github.com/deepmind/dqn_zoo.

Raffin, A. Rl baselines3 zoo. https://github.com/ DLR-RM/rl-baselines3-zoo, 2020.

Raffin, A., Hill, A., Gleave, A., Kanervisto, A., Ernestus, M., and Dormann, N. Stable-baselines3: Reliable reinforcement learning implementations. Journal of Machine Learning Research, 22(268):1 8, 2021. URL http: //jmlr.org/papers/v22/20-1364.html.

Rowland, M., Bellemare, M. G., Dabney, W., Munos, R., and Teh, Y. W. An analysis of categorical distributional reinforcement learning. In Storkey, A. and Perez-Cruz, F. (eds.), Proceedings of the Twenty-First International Conference on Artificial Intelligence and Statistics, volume 84 of Proceedings of Machine Learning Research, pp. 29 37. PMLR, 09 11 Apr 2018. URL https://proceedings.mlr.press/v84/ rowland18a.html.

Rowland, M., Munos, R., Azar, M. G., Tang, Y., Ostrovski, G., Harutyunyan, A., Tuyls, K., Bellemare, M. G., and Dabney, W. An analysis of quantile temporal-difference learning. Preprint available on ar Xiv:2301.04462, 2023a. URL http://arxiv.org/abs/2301.04462.

Rowland, M., Tang, Y., Lyle, C., Munos, R., Bellemare, M. G., and Dabney, W. The statistical benefits of quantile temporal-difference learning for value estimation. In Proceedings of the 40th International Conference on Machine Learning, ICML 23. JMLR.org, 2023b.

Singh, S., Jaakkola, T., Littman, M., and Szepesv ari, C. Convergence results for single-step on-policy reinforcementlearning algorithms. Machine Learning, 38(3):287 308, 2000. doi: 10.1023/A:1007678930559.

Sudakov, V. N. Gaussian random processes, and measures of solid angles in Hilbert space. Dokl. Akad. Nauk SSSR, 197:43 45, 1971. ISSN 0002-3264.

Sudakov, V. N. Geometric problems of the theory of infinitedimensional probability distributions. Trudy Mat. Inst. Steklov., 141:191, 1976. ISSN 0371-9685.

Sutton, R. S. and Barto, A. G. Reinforcement Learning: An Introduction. The MIT Press, second edition, 2018. URL http://incompleteideas.net/ book/the-book-2nd.html.

Thrun, S. and Schwartz, A. Issues in using function approximation for reinforcement learning. In Mozer, M., Smolensky, P., Touretzky, D., Elman, J., and Weigend, A. (eds.), Proceedings of the 1993 Connectionist Models Summer School, pp. 255 263. Lawrence Erlbaum, 1993. URL http://www.ri.cmu.edu/pub_files/ pub1/thrun_sebastian_1993_1/thrun_ sebastian_1993_1.pdf.

ADDQ: Adaptive Distributional Double Q-Learning

Todorov, E., Erez, T., and Tassa, Y. Mujoco: A physics engine for model-based control. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 5026 5033, 2012. doi: 10.1109/IROS.2012.6386109.

Towers, M., Terry, J. K., Kwiatkowski, A., Balis, J. U., Cola, G. d., Deleu, T., Goul ao, M., Kallinteris, A., KG, A., Krimmel, M., Perez-Vicente, R., Pierr e, A., Schulhoff, S., Tai, J. J., Shen, A. T. J., and Younis, O. G. Gymnasium, March 2023. URL https://zenodo.org/ record/8127025.

Tsitsiklis, J. N. Asynchronous stochastic approximation and q-learning. Machine Learning, 16(3):185 202, 1994. doi: 10.1023/A:1022689125041. URL https://doi. org/10.1023/A:1022689125041.

van Handel, R. Probabiltiy in High Dimensions. Princeton University, lecture notes, 2016.

van Hasselt, H. Double q-learning. In Lafferty, J., Williams, C., Shawe-Taylor, J., Zemel, R., and Culotta, A. (eds.), Advances in Neural Information Processing Systems, volume 23. Curran Associates, Inc., 2010. URL https://proceedings.neurips. cc/paper_files/paper/2010/file/ 091d584fced301b442654dd8c23b3fc9-Paper. pdf.

van Hasselt, H., Guez, A., and Silver, D. Deep reinforcement learning with double q-learning. Preprint available on ar Xiv:1509.06461, 2015. URL http://arxiv.org/abs/1509.06461. cite arxiv:1509.06461Comment: AAAI 2016.

van Hasselt, H. P. Insights in Reinforcement Learning: formal analysis and empirical evaluation of temporal-difference learning algorithms. Ph D thesis, Universiteit Utrecht, January 2011. URL http://homepages.cwi.nl/ hasselt/ papers/Insights_in_Reinforcement_ Learning_Hado_van_Hasselt.pdf.

Watkins, C. J. C. H. and Dayan, P. Q-learning. Machine Learning, 8(3):279 292, May 1992. ISSN 1573-0565. doi: 10.1007/BF00992698. URL https://doi.org/ 10.1007/BF00992698.

Wu, Y., Zhai, S., Srivastava, N., Susskind, J. M., Zhang, J., Salakhutdinov, R., and Goh, H. Uncertainty weighted actor-critic for offline reinforcement learning. In Meila, M. and Zhang, T. (eds.), Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pp. 11319 11328. PMLR, 18 24 Jul 2021. URL https:// proceedings.mlr.press/v139/wu21i.html.

Zhang, Z., Pan, Z., and Kochenderfer, M. J. Weighted double q-learning. In Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, IJCAI-17, pp. 3455 3461, 2017. doi: 10.24963/ijcai. 2017/483. URL https://doi.org/10.24963/ ijcai.2017/483.

ADDQ: Adaptive Distributional Double Q-Learning

A. Theoretical backup from probability theory

Theoretical backup for all known overestimation reduction algorithms is rather weak, the update rule of Q-learning makes precise computations very difficult. As we discuss below, the stochastic approximation rule requires computations with sums of random variables which is not well compatible with computations of maxima of random variables. Justifications for overestimation reduction algorithms are typically based on qualitative arguments for the estimation bias of expectations E[maxi{X1, ..., XK}] of maxima of independent random variables even though random variables appearing in Q-learning maxima are clearly not independent. We will give explicit estimates using probabilistic insights for sums and maxima of independent random variables.

We should make very clear that there are different factors to the overestimation problem that are captured in the algorithmic approach of ADDQ. To provide some rigorous theoretical evidence, we focus in this section on the tabular setting without function approximation error. We refer the reader to the seminal article (Thrun & Schwartz, 1993) for some simplified computations on the overestimation effect caused by approximation errors.

A.1. A lower bound computation for the overestimation bias in an episodic bandit MDP (Proof of Proposition 2.1)

k2 actions 0

The two-sided bandit MDP from the main text can be analyzed side by side. Without loss of generality we thus study the overestimation of the left-side which is essentially the bandit MDP that appeared quite a bit in the literature on overestimation of Q-learning (see (van Hasselt, 2011), Example 6.7 of (Sutton & Barto, 2018), or (Lan et al., 2020)).

Before stating our results on the overestimation bias we state a technical lemma that is based on the Sudakov-Fernique inequality, see (Sudakov, 1971; 1976; Fernique, 1975), which is a comparison inequality for Gaussian processes.

Lemma A.1. Let k N and suppose (Xi j)j N,i {1,...,k} are iid N(µ, σ2). Further, let n1, . . . , nk N such that n1 + + nk kn and set γ := max1 i =i k 1/nni + 1/ni 2/n . Then, E max 1 i k 1 ni

E max 1 i k 1 n

2γ ln k. (2)

In particular, if, additionally, 1/ni + 1/ni 2/n for all i, i {1, . . . , k} with i = i then

E max 1 i k 1 n

E max 1 i k 1 ni

Proof. Given n1, . . . , nk N with n1 + nk kn, set Y i := n 1 Pn j=1 Xi j and Zi := n 1 i Pni j=1 Xi j for any i {1, . . . , k}. Clearly, Y = (Y 1, . . . , Y k) and Z = (Z1, . . . , Zk) are Gaussian vectors with E[Y i] = E[Zi] for all i {1, . . . , k} and

γY i,i := E h Y i Y i 2i = 2/n and γZ i,i := E h Zi Zi 2i = 1/ni + 1/ni

for all i, i {1, . . . , k} with i = i . Thus, (2) is an immediate consequence of Adler & Taylor, 2007, Theorem 2.2.5, Eq. (2.2.11). If, in addition, 1/ni + 1/ni 2 for all i = i then γY i,i γZ i,i and (3) follows from Adler & Taylor, 2007, Theorem 2.2.5, Eq. (2.2.12).

We are now in a position to give a lower bound on the overestimation bias for all exploration policies under the typical 1 #visits-step-size schedule. This extends the overestimation analysis of (van Hasselt, 2011) which was based on a simple synchronous exploration mechanism. The step-size schedule is crucial for our probabilistic analysis as it turns the analysis in a computation with sums and maxima of independent random variables. The sum structure is a consequence of the

ADDQ: Adaptive Distributional Double Q-Learning

simple fact that zt := 1

t Pt i=1 ai 1 solves the stochastic approximation recursion zt+1 = (1 αt)zt + αtat, z0 = 0, with αt = 1 t+1. Here is a formal lower bound of the overestimation bias in the episodic bandit MDP. As mentioned above we only consider exploration of the left-side of the bandit MDP.

Theorem A.2. Suppose Q0 is initialized as the zero matrix and step-sizes are chosen as αt(s, a) = 1 Ts,a(t), with Ts,a(t) the number of updates of Q(s, a). If rewards are N(µ, σ2)-distributed, then every sufficient exploratory exploration rule leads to overestimation bias at least

E[QNk(s0, left )] Q (s0, left ) γ p

By sufficient exploratory we mean that 1 na(t) + 1 na (t) 2

N for all actions a, a with na(t) = Ts,a(t).

In simple words, the expected overestimation in n episodes of Q-learning on the left-side of the bandit MDP is at least of

the order σ

k log(k) n . Higher variance and more actions obviously lead to stronger overestimation.

Proof. In contrast to (van Hasselt, 2011) we do not assume synchronous exploration of all actions in each episode. Instead, we give a comparison argument that allows to reduce sufficiently exploratory exploration to simple cyclic exploration. The approach is motivated by the target network (here: target matrix) trick from DQN (Mnih et al., 2015). With target networks the update works as follows:

Qt+1(s, a) (1 αt)Qt(s, a) + αt r + γ max a Qt(s , a ) ,

where Q is kept constant for a fixed number of steps and then updated from the current Q-matrix. Although the target network was introduced in DQN to reduce overfitting in function approximation, it also reduces the overestimation of Q-values. We use the latter effect locally in this proof to construct a lower bound. The target matrix is kept constant (0 matrix) at s0 until the last update. We show that (i) this target matrix trick with cyclic exploration yields a lower bound for the overestimation bias of standard Q-learning with sufficient exploratory exploration strategy and (ii) allows for computations using elements of probability theory. In what follows we denote by Xa i the reward obtained when playing chosing a for the ith time. By assumption all (Xa i ) are iid.

Step 1: In the first step we show that using cyclic exploration (playing one action after the other) for N rounds (Nk steps in total) followed by one update at (s0, left ) minimizes the overestimation of Q-learning estimators ˆQcyc,tar(s0, left ) that explore each action at most N times. For the claim we compare arbitrary Q-learning with the cyclic variant. Recalling the Q-learning update (with arbitrary exploration rule) and the step-size schedule shows that

Qt(s1, a) = 1 Ts,a(t)

j=1 Xa j N(µ, tσ2).

Since s0 is explored once per episode the step-size schedule of regular Q-learning yields

QNk(s0, left ) = 1 Nk

t=1 γ max a Qt(s1, a)

for the Nth episode. We next show that E[QNk(s0, left )] E[ ˆQcyc,tar(s0, left )]. Denoting nk(t) = Ts1,ak(t) and

ADDQ: Adaptive Distributional Double Q-Learning

using Lemma A.1 we obtain

E[QN(s0, left )] = E h 1

t=1 γ max a Qt(s1, a1), ..., Qt(s1, ak) i

t=1 γE h max n 1 n1(t)

j=1 Xa1 j , ..., 1 nk(t)

j=1 Xak j oi

t=1 γE h max n 1

j=1 Xa1 j , ..., 1

j=1 Xak j oi

= γE h max n 1

j=1 Xa1 j , ..., 1

j=1 Xak j oi

= E[ ˆQcyc,tar(s0, left )].

Step 2: We now analyze the overestimation bias for the Q-learing with target matrix trick after Nk episodes. To deduce the claim we use a fact on the expectation of the maximum of independent N(µ, σ2)-distributed random variables:

π log(2) σ p

log(k) E[max{X1, ..., Xk}] µ +

The inequality can for instance be found in (van Handel, 2016). According to the step-size schedule the update is as follows. If N = nk, i.e. in cyclic exploration every action is played n times, then Qcyc,tar Nk (s1, a) = 1

N PN i=1 Xa i for independent N(µ, σ2)-distributed reward samples. Thus, QNk(s1, a1), ..., QNk(s1, ak) are iid and N(µ, σ2

N )-distributed. The final update after N episodes is to set

ˆQcyc,tar(s1, left ) = γ max{QNk(s1, a1), ..., QNk(s1, ak)} (d) = γ max{Z1, ..., Zk}

for Z1, ..., Zk iid N(µ, σ2

N ). The claim follows from the lower bound on the expectation of maxima of independent Gaussians.

The computations give a number of quantitative insights. First, it becomes very clear why explicit computations for Q-learning are complicated. The stochastic approximation update is closely related to sums (exactly equal to sums for 1 #visits step-sizes, while the update target involves a maximum. Unfortunately, sums and maxima of random variables do not get along well. We thus performed computations for Gaussian reward distributions for which sums are Gaussian again. Maxima of Gaussian random variables (processes) are a well-studied field in Mathematics and estimates can be derived.

Finally, the computations show how variances influence the overestimation problem. Stochastic rewards with high variance are more problematic than rewards with small variance. In the next section we study distribution RL in the same simple problem.

A.2. On the use of variance in local overestimation control via distributional RL (Proof of Proposition 2.2)

Our computations for the bandit MDP above suggest uncertainty at next states (aleatoric uncertainty of rewards, epistemic uncertainty through small N) lead to larger overestimation. Thus, algorithms mitigating the overestimation problem locally at (s, a) must somehow use uncertainty at next states s . In the following proposition we analyze the distributional Q-learning update scheme (compare Section 8 of (Rowland et al., 2018))

η(s, a) (1 α)η(s, a) + α br,γ#η(s , a ) ,

with a = argmaxa Q(s , a ), where Q(s , a ) are the (random) sample means of the estimated return distributions η(s , a ). The bootstrap function is br,γ(z) = r + γz and f#ν(B) := ν(f 1(B)) denotes the push-forward of measures. Recall from the main text that distributional QL learns random measures, η(s, a) itself is a probability measure that depends on the random samples used in the updates. As a probability measure η(s, a) has an expectation and a variance that are both random in terms of the random samples. The situation becomes a bit tricky as we are going to take variances of the variances.

ADDQ: Adaptive Distributional Double Q-Learning

To avoid confusion we use an analogy to statistics and speak of (random) sample averages M (resp. sample variances S2) of η(s, a) and expectation E (resp. variance V) for the integrals against the true randomness induced by the probability space behind all appearing random variables.

Motivated by the previous section we only study cyclic exploration with target measure , i.e. all actions in s1 are explored N times before bootstrapping the estimates to s0. Using the standard 1 #visits-step-size schedule allows us to carry out fairly explicit computations using tools from probability theory. Most importantly, we can derive the exact distribution for the sample variance of η(s0, left ), the state-action pair at risk for overestimation. The insight obtained from the computation motivates our ADDQ algorithm.

Proposition A.3. Consider the bandit MDP introduced above with N(µ, σ2) reward distributions and k actions. Denote by ˆηcyc,tar(s0, left ) the return distribution of distributional Q-learning with cyclic exploration, step-size schedule αt(s, a) = 1 Ts,a(t), and target distribution . More precisely, all arms are N times explored before the first and only update at (s1, left ). It then follows that

S2(ˆηcyc,tar(s0, left )) σ2

where χ2 n denotes the chi-squared distribution with n degrees of freedom.

It is interesting to note that while there is a simple distributional expression for the sample variance there is no simple expression for the distribution of the sample mean max{ 1

N PN i=1 Xa1 i , ..., 1

N PN i=1 Xak i } which is the maximum of independent Gaussians.

Proof. For the proof we need to spell-out the distributional Q-learning update and then use basic results from statistic for sample-mean/sample-variances.

Let us first understand the distributional update at the last step. Similar to scalar stochastic approximation schemes, the 1 n+1-step-size schedule also gives the distributional estimator

ˆηcyc,tar(s1, a) = 1

i=1 δXa i (4)

for the pre-terminal state s1 after N explorations of action a. To see why write ηn := 1

n Pn i=1 δXa i to get

ηn+1 = 1 n + 1

i=1 δXa i = 1 1 n + 1

i=1 δXa i + 1 n + 1δXa n+1 = (1 αn)ηn + αnδXa n+1

and recall that there is no max-term in the update for pre-terminal states. Note that the equal weights 1

N are a consequence of the step-size schedule. To compute the return distribution at s0 for action left we need to identify a . For that sake we must identify the sample means of ˆηcyc,tar(s1, a):

ˆQcyc,tar(s1, a) = 1

i=1 Xa i N(µ, σ2/N)

Now we chose a = arg maxa ˆQcyc,tar(s1, a) and set

ˆηcyc,tar(s0, left ) = b0,γ#ˆηcyc,tar(s1, a ) =: γX.

While we do not know much about X (it is the empirical distribution of the set of Gaussians Xa 1 , ..., Xa N with maximal sum) we know the exact distribution of the sample variance for the following reason. The sample variance of every ηcyc,tar(s1, a) is nothing what is called sample variance of the iid observation Xa 1 , ..., Xa k in statistics. If Ma := 1

N PN i=1 Xa i is the sample mean and S2 a := 1 N 1 PN i=1(Xa i M a)2 the sample variance then it is well-known that

E[S2 a] = σ2,

ADDQ: Adaptive Distributional Double Q-Learning

S2 a is σ2 N 1χ2 N 1-distributed,

Ma and S2 a are independent.

The third property implies that if S2 a1, ..., S2 ak are independent sample variances and a is chosen to maximize the sample means, then also S2 a σ2 N 1χ2 N 1 while Ma Ma.

As a consequence, even though we cannot compute the distribution of the sample mean of ˆηcyc,tar(s0, left ) we can compute the distribution of its sample variance as σ2 N 1χ2 N 1. The expectation is E[S2 a] = σ2 while the variance is

V(S2 a) = 2σ4 N 1. Note that one might be tempted to believe the sample variance is independent of the number k of actions. This is not the case, as the number of explorations Nk needed for N explorations depends on k. Alternatively, one might denote the number of total episodes by n and replace N by n

We now come back to the motivation of our ADDQ algorithm, in particular the choice of β. Let us consider the two-sided bandit MDP with N(µ1, σ2 1) (resp. N(µ2, σ2 2)) distributed rewards. Suppose µ1 > µ2 are small but σ2 1 σ2 2 and/or k1 k2. The MDP is delicate as the following can happen using QL (overestimation) and DQL (global overestimation reduction). In QL the agent will believe for a long time that right is the optimal action in s0. In DQL the agent will believe for a long time that down is the optimal action as both non-trivial Q-values can be underestimated to be negative. It is thus more reasonable to mitigate overestimation locally instead of globally. This is what ADDQ achieves through the choice of β.

To use our explicit computations above, the agent explores both sides with cyclic exploration and target-matrix update. Now assume both sides have been explored with a total number of n episodes each. The agent knows the estimates ˆη(s0, left ) (resp. ˆη(s0, right )) and the corresponding Q-values (sample means of return distributions) ˆQ(s0, left ) (resp. ˆQ(s0, right )). From the overestimation study of the previous section, the agent unwittingly overestimated the true

Q-values in the order of σ1

k1 log(k1) n (resp. σ2

k2 log(k2) n ) and will believe right is the correct action. Now the smart agent also knows he/she should take into account the sample variances that is known to the agent and we know are of the order σ2 1 (resp. σ2 2) with concentration depending on k1 (resp. k2). The local overestimation control of β (based on relative sample variances) from (1) thus compares σ2 1 to σ2 2 and suggests to mitigate overestimation of ˆQ(s0, right ). In our algorithm we use double Q-learning to mitigate, the same idea can of course be integrated into other algorithms (such as changing number of truncation atoms in truncated quantile critics algorithms (Kuznetsov et al., 2020)).

ADDQ: Adaptive Distributional Double Q-Learning

B. Experimental confirmation of theoretical results for two-sided bandit MDP

We provide an experimental analysis of the two-sided bandit MDP for which we proved theoretical results on the overestimation. For reimplementation purposes we collect here all required information on the environment and the training for all plots.

Environment:

µ1 = 0.1, µ2 = 0.1, k2 = 5, σ2 = 1; the correct decision is thus moving to the right in the Start State.

Distributional properties for ADDQ:

categorical parametrization, 51 atoms equally spaced on [ 3, 3]

initialization as δ0

Algorithmic choices:

β is chosen according to (1)

step-size schedules αt(s, a) = 1 Ts,a(t), with Ts,a(t) the number of visits in (s, a) up to time t, i.e. 1

n state-action wise counted

exploration: either ϵ-greedy with ϵ linearly decreasing from 1 to 0.1 in 10000 steps, then constant (E) or uniform random (U)

Policy evaluation: evaluation of 3 steps with a frequency of every 500 steps using current greedy policy, correct action rates refer to if the exploration was greedy

In order to demonstrate the proven proportionality of the overestimation of Q-values by QL in the number of arms and the variance to the left, we keep one of the values fixed and compare different values in the other one.

First experiment: σ1 = 5 fixed, iterating over k1 = 5, 10, 15, 20, denoted as K in the legend

Second experiment: k1 = 10 fixed, iterating over σ1 = 2, 4, 6, 8, denoted as S in the legend

The plots also demonstrate that

ADDQ is much better in terms of bias, leveraging local information given by the (relative) variances

the variances and relative variances at state 0 capture the real variances given by the MDP

although our theorems assumed a sufficiently exploratory policy, the results seem to generalize to the much more commonly used ϵ-greedy setting

ADDQ: Adaptive Distributional Double Q-Learning

Figure 6. Comparing ADDQ and QL on two-sided bandit MDP with different number of arms on the left side and different exploration settings.

ADDQ: Adaptive Distributional Double Q-Learning

Figure 7. Comparing ADDQ and QL on two-sided bandit MDP with different variances on the left side and different exploration settings.

ADDQ: Adaptive Distributional Double Q-Learning

C. Grid world details

C.1. Experiment from the main text

For reimplementation purposes we collect here all required information on the environment and the training for all plots.

Environment:

white low stochastic high average reward region: rewards 0.05, +0.05 with equal probabilities

gray high stochastic low average reward region: rewards 2.1, +2 with equal probabilities

goal: deterministic reward of 1, fake goal: deterministic reward of 0.65

Distributional properties for Categorical Q, Categorical Double, and ADDQ:

categorical parametrization, 51 atoms equally spaced on [ 3, 3]

initialization as δ0

Algorithmic choices:

β is chosen according to (1)

step-size schedule αt(s, a) = 1 Ts,a(t), with Ts,a(t) the number of visits in (s, a) up to time t, i.e. 1

n state-action wise counted,

exploration: ε-greedy with ε linearly decreasing from 1 to 0.1 in 10000 steps, then constant

Policy evaluation: evaluation of 6 steps with a frequency of every 500 steps, correct action rates refer to if the exploration was greedy

Figure 8. additional plots

For completeness we give plots pairing learning curves for Q-values with the corresponding sample variance. We give all state-action combinations for the interesting states 1 and 4 next to the fake goal, 6 and 7 next to the region with high stochasticity, and 10 and 14 before leaving the region with high stochasticity.

ADDQ: Adaptive Distributional Double Q-Learning

ADDQ: Adaptive Distributional Double Q-Learning

Finally, the following plot demonstrates that, in this Grid World example, the relative variances are also strongly determined by the variances of the next state.

ADDQ: Adaptive Distributional Double Q-Learning

C.2. Ablation study

In the following, on the same environment as before, we experiment with different choices for β and compare with weighted DQL (Zhang et al., 2017) with c = 10 (WDQ), the standard choice from that paper. The choice of beta s names are comprised as follows:

(Optional) First two letters: Left-tilted (lt), Right-tilted (rt)

First/Third letter: Neutral (n), Aggressive (a), Conservative (c)

Final digit: Refers to the number of intervals in the definition of Beta (3 or 5)

The choices of Aggressive, Conservative, and Neutral refer to the trade-off of no interpolation (just choosing which Algorithm s update to take) and softening the interpolation, with neutral being in between the two choices and corresponding to the choice presented in the main body. Leftand Right-tilted refers to shifting the intervals for the relative Variance to fall into while choosing the interpolation coefficient. Left-tilted favors the Q update, Right-tilted the DQ update. The concrete choices are:

0.75 : S2 rel(s, a) < 0.75 0.5 : S2 rel(s, a) [0.75, 1.25] 0.25 : S2 rel(s, a) > 1.25 , a3: β :=

1 : S2 rel(s, a) < 0.99 0.5 : S2 rel(s, a) [0.99, 1.01] 0 : S2 rel(s, a) > 1.01

0.75 : S2 rel(s, a) < 1.25 0.5 : S2 rel(s, a) [1.25, 1.75] 0.25 : S2 rel(s, a) > 1.75 , lta3: β :=

1 : S2 rel(s, a) < 1.49 0.5 : S2 rel(s, a) [1.49, 1.51] 0 : S2 rel(s, a) > 1.51

0.75 : S2 rel(s, a) < 0.25 0.5 : S2 rel(s, a) [0.25, 0.75] 0.25 : S2 rel(s, a) > 0.75 , rta3: β :=

1 : S2 rel(s, a) < 0.49 0.5 : S2 rel(s, a) [0.49, 0.51] 0 : S2 rel(s, a) > 0.51

0.6 : S2 rel(s, a) < 0.6 0.5 : S2 rel(s, a) [0.6, 1.4] 0.4 : S2 rel(s, a) > 1.4 , n5: β :=

1 : S2 rel(s, a) 0.25 0.75 : S2 rel(s, a) (0.25, 0.75) 0.5 : S2 rel(s, a) [0.75, 1.25] 0.25 : S2 rel(s, a) (1.25, 1.75) 0 : S2 rel(s, a) 1.75

0.6 : S2 rel(s, a) < 1.1 0.5 : S2 rel(s, a) [1.1, 1.9] 0.4 : S2 rel(s, a) > 1.9 , ltn5: β :=

1 : S2 rel(s, a) 0.75 0.75 : S2 rel(s, a) (0.75, 1.25) 0.5 : S2 rel(s, a) [1.25, 1.75] 0.25 : S2 rel(s, a) (1.75, 2.25) 0 : S2 rel(s, a) 2.25

0.6 : S2 rel(s, a) < 0.1 0.5 : S2 rel(s, a) [0.1, 0.9] 0.4 : S2 rel(s, a) > 0.9 , rtn5: β :=

1 : S2 rel(s, a) 0.25 0.75 : S2 rel(s, a) ( 0.25, 0.25) 0.5 : S2 rel(s, a) [0.25, 0.75] 0.25 : S2 rel(s, a) (0.75, 1.25) 0 : S2 rel(s, a) 1.25

ADDQ: Adaptive Distributional Double Q-Learning

1 : S2 rel(s, a) 0.99 0.75 : S2 rel(s, a) (0.99, 0.995) 0.5 : S2 rel(s, a) [0.995, 1.005] 0.25 : S2 rel(s, a) (1.005, 1.01) 0 : S2 rel(s, a) 1.01

0.7 : S2 rel(s, a) 0.1 0.6 : S2 rel(s, a) (0.1, 0.7) 0.5 : S2 rel(s, a) [0.7, 1.3] 0.4 : S2 rel(s, a) (1.3, 1.9) 0.3 : S2 rel(s, a) 1.9

1 : S2 rel(s, a) 1.49 0.75 : S2 rel(s, a) (1.49, 1.495) 0.5 : S2 rel(s, a) [1.495, 1.505] 0.25 : S2 rel(s, a) (1.505, 1.51) 0 : S2 rel(s, a) 1.51

, ltc5: β :=

0.7 : S2 rel(s, a) 0.6 0.6 : S2 rel(s, a) (0.6, 1.2) 0.5 : S2 rel(s, a) [1.2, 1.8] 0.4 : S2 rel(s, a) (1.8, 2.4) 0.3 : S2 rel(s, a) 2.4

1 : S2 rel(s, a) 0.49 0.75 : S2 rel(s, a) (0.49, 0.495) 0.5 : S2 rel(s, a) [0.495, 0.505] 0.25 : S2 rel(s, a) (0.505, 0.51) 0 : S2 rel(s, a) 0.51

, rtc5: β :=

0.7 : S2 rel(s, a) 0.4 0.6 : S2 rel(s, a) ( 0.4, 0.2) 0.5 : S2 rel(s, a) [0.2, 0.8] 0.4 : S2 rel(s, a) (0.8, 1.4) 0.3 : S2 rel(s, a) 1.4

ADDQ: Adaptive Distributional Double Q-Learning

Figure 9. Ablation study plots regarding bias improvement. State 1 is adjacent to the fake goal, state 6 adjacent to the stochastic region.

ADDQ: Adaptive Distributional Double Q-Learning

It turns out that

the choice of thresholds in β (hyperparameter to the algorithm) is harmless, results do not vary a lot,

conservative choices seem to work especially well.

weighted double Q-learning, an algorithm that is similar to ADDQ with non-distributional choice of β is improved by our locally adaptive distributional RL based choice of β.

C.3. Comparison to other algorithms

We compared ADDQ with other bias reduction methods using different hyperparameters used in the respective papers on the same Grid World environment.

Maxmin with 2, 4, 6, and 8 ensembles (Lan et al., 2020)

Ensemble Bootstrapped QL (EBQL) with 3, 7, 10, and 15 ensembles (Peer et al., 2021)

Randomized Ensemble DQL (REDQ) with 3 and 5 ensembles and size 1 and 2 of random update subset sizes (Chen et al., 2021)

It turns out that ADDQ decreases the estimation bias stronger than those algorithms while only using two ensembles.

ADDQ: Adaptive Distributional Double Q-Learning

Figure 10. Comparison to Max Min.

ADDQ: Adaptive Distributional Double Q-Learning

Figure 11. Comparison to EBQL

ADDQ: Adaptive Distributional Double Q-Learning

Figure 12. Comparison to REDQ.

ADDQ: Adaptive Distributional Double Q-Learning

D. ADDQ adaptation for QRDQN - setup

To not double too much in the main text we moved the adaptation of ADDQ to the quantile setup (Dabney et al., 2018) to the appendix. In this section we explain how to adapt the ADDQ idea into QRDQN, experimental results are provided in the next section.

The categorical approach has multiple disadvantages, most notably rewards and the fixed atom positions must be compatible. The categorical algorithm was included in the main text to keep notation simple. For quantile distributional RL the return distributions are parametrized by

FQ,m = n m X

1 mδθi : θi R o .

In contrast to the categorical setting the positions of the atoms are not fixed, but the weights of all atoms are set equal. The computation of the target is equal to the categorical version (except using the projection on FQ,m instead of FC,m). The update step is a gradient step in computing the Wasserstein-projection on FQ,m of the target distribution ˆη, that is a gradient step in the quantile Huber-loss minimization:

min ˆθA 1 (s,a),...,ˆθA m(s,a)

i=1 EZ ˆη[ρκ τi(Z ˆθA i (s, a))],

with quantile mid-points τi = 2i 1

( |τ 1u<0| 1

2u2 : |u| κ |τ 1u<0|κ(|u| 1

2κ) : |u| > κ .

The quantile Q-learning modification of classical DQN (Mnih et al., 2015) is called QRDQN. Using the same network architecture as C51, QRDQN approximates return distributions using the quantile representation. Therefore the last layer outputs the m quantile locations for each action. In the quantile setup we write ηω(s, a) = 1

m Pm i=1 δθi(s,a;ω) with induced mean values Qω(s, a) = 1

m Pm i=1 θi(s, a; ω). Given a sample transition (s, a, r, s ), the network parameters are updated via gradient descent with respect to the loss function

i,j=1 ρ1 τi(r + γθj(s , z ; ω) θi(s, a; ω)), z = argmaxa Q ω(s , a ), (5)

and the quantile mid-points τi = 2 1

In what follows we turn three known double variants of DQN with overestimation reduction into QRDQN variants and compare them on several Arcade environments to our algorithms. We use double DQN (van Hasselt et al., 2015), a Q-learning adaptation of the clipping trick included in TD3 (Fujimoto et al., 2018), a quantile version of our ADDQ algorithm, and an additional variant of ADDQ. Using the Stable-Baselines3 framework [(Raffin et al., 2021)] there is very little that must be modified. Return distributions ηA and ηB are parametrized with two independently initialized neural networks denoted by ωA and ωB. As in DQN we use delayed target networks, one for A, one for B, that are indicated with an additional bar. For each gradient step we simulate a vector of random variables with the same size as the batch size with each element determining which of the two estimators is being updated based on the respective transition with the same position in the batch. Accordingly, we use twice the batch size for these methods, so that on average per gradient step, the same number of transitions is used for each estimator, compared to the single-estimator case. The only difference in different algorithms is the target used to update the neural networks.

Similar to modifying C51 implementations we only modify the target return distributions br,γ#η ω(s , z ) for an appropriate action z . In the quantile setup those are given by the locations of their atoms:

Γ = {r + γθj(s , z ; ω) : j = 1, . . . , m}

We again use the compact A/B notation to indicate how the update applies for ΓA and ΓB.

Double QRDQN: ΓA/B = η ωB/A(s , z ), where z = argmaxa Q ωA/B(s , a ).

ADDQ: Adaptive Distributional Double Q-Learning

Clipped QRDQN: ΓA/B = η ωX(s , z ), where z = argmaxa Q ωA/B(s , a ) and X = argminc {A,B} Q ωc(s , z ).

ADDQ (us): ΓA/B = βη ωA/B(s , z ) + (1 β)η ωB/A(s , z ), where z = argmaxa Q ωA/B(s , a ). The weights β = β(s, a; ω) are essentially arbitrary and can depend on the current estimated return distributions ηA and ηB. For the experiments we take the same choice from 1 as used for the tabular and the categorical settings.

Experimental results are presented in the next section.

ADDQ: Adaptive Distributional Double Q-Learning

E. Deep reinforcement learning experiments

To ensure fair comparison we modified the algorithms C51 [(Bellemare et al., 2017)] and QRDQN [(Dabney et al., 2018)] within the Stable-Baselines3 framework [(Raffin et al., 2021)]. The C51 implementation has been added to this framework by adapting from the Dopamine framework [(Castro et al., 2018)] and the DQN Zoo [(Quan & Ostrovski, 2020)]. We run Atari environments from the Arcade Learning Environment [(Bellemare et al., 2013)] and Mu Jo Co [(Todorov et al., 2012)] environments both using the Gymnasium API [(Towers et al., 2023)]. We run the experiments via the RL Baselines3 Zoo [(Raffin, 2020)] training framework.

The experiments were executed on a HPC cluster with NVIDIA Tesla V100 and NVIDIA A100 GPUs. The replay buffer on Atari environments takes around 57GB of memory and less than 7 GB of memory for Mu Jo Co environments.

For the experiments the training has been interrupted every 50000 steps and 10 evaluation episodes on 10 evaluation environments without exploration have been performed. The plots below show the mean total reward (sum of all rewards) averaged over 10 seeds with standard errors of seeds as the shaded regions. To improve visibility a rolling window of size 4 is applied. Atari runs took less than 48 hours for 20 million train steps and periodic evaluations and Mu Jo Co runs less than 36 for 10 million train steps (Humanoid) and periodic evaluations. Note that one timestep in the Atari environments corresponds to 4 frames, which are stacked together. This corresponds to repeating every action 4 times in the actual game. Therefore 20 million timesteps correspond to 80 million frames. Additionally, a small ablation study comprising of 10 seeds on one evaluation environment with some of the choices of beta detailed in Appendix C.2 has been conducted.

ADDQ: Adaptive Distributional Double Q-Learning

E.1. Full experimental results for the categorical parametrization

As in (Bellemare et al., 2017) we use 51 atoms for all C51 variants.

Figure 13. Learning curves on 10 Atari environments, averaged over 10 seeds

ADDQ: Adaptive Distributional Double Q-Learning

We additionally provide plots using the RLiable library (Agarwal et al., 2021).

Figure 14. RLiable probability of improvement plot

Figure 15. RLiable human normalized scores plot (based on (Badia et al., 2020))

ADDQ: Adaptive Distributional Double Q-Learning

E.2. Full experimental results for the quantile parametrization

Figure 16. Learning curves on 10 Atari environments, averaged over 10 seeds

ADDQ: Adaptive Distributional Double Q-Learning

Figure 17. RLiable probability of improvement plot

Figure 18. RLiable human normalized scores plot (based on (Badia et al., 2020))

ADDQ: Adaptive Distributional Double Q-Learning

E.3. Full experimental results for the actor critic setting

Figure 19. Learning curves on 5 Mu Jo Co environments, averaged over 10 seeds

Figure 20. RLiable probability of improvement plot

ADDQ: Adaptive Distributional Double Q-Learning

Figure 21. RLiable normalized scores plot, based on highest and lowest performance on each environment

ADDQ: Adaptive Distributional Double Q-Learning

F. Convergence proof in the tabular categorical setup

In this section we give a convergence proof for the adaptive distributional double-Q algorithm in the simplest setting, the categorical setting. The proof is based on known arguments from the literature and requires some modifications to work in our generality. Since many papers only sketched proofs we decided to spell out all details. Remark F.1 (Notation and short recap). The Cramer distance ℓ2 for probability distributions ν, ν P(R) is given by

ℓ2(ν, ν ) = Z

R |Fν(z) Fν (z)|2 dz 1/2 .

Following (Rowland et al., 2018; Bellemare et al., 2023) the supremum extension of a probability metric d between two return distribution functions η, η PS A is denoted as

d(η, η ) = sup s,a S A d(η(s, a), η (s, a).

The iterates ηk+1 = ΠCT πηk converge to the unique fixed point in FS A C,m with respect to ℓ2 based on Banach s fixed point Theorem. This follows from the contraction property

ℓ2(ΠCT πη, ΠCT πη ) γ ℓ2(η, η ), (6)

[compare (Rowland et al., 2018; Bellemare et al., 2023)].

Theorem F.2 (Convergence of adaptive distributional Q-learning in the categorical setting). Given some initial return distribution functions ηA 0 , ηB 0 supported within [θ1, θm], the induced Q-values, i.e. the expected values of the return distributions (ηA t ), (ηB t ), recursively defined by Algorithm 1 converge almost surely towards Q if the following conditions are satisfied:

1. the step sizes αt(s, a) almost surely fulfill the Robbins-Monro conditions P t=0 αt(s, a) = and P t=0 α2 t(s, a) < .

2. rewards are bounded in [Rmin, Rmax] and [ Rmin

1 γ ] [θ1, θm],

3. the choice of updating ηA or ηB is random and independent of all previous random variables

4. the sequences (βA t )t N, (βB t )t N only depend on the past and fulfill limt |βA t βB t | = 0 almost surely.

If additionally the MDP has a unique optimal policy π , then (ηA t ), (ηB t ) converge almost surely in ℓ2 to some limit η C FC,m and the greedy policy with respect to η C is the optimal policy.

Note that the algorithm and proof uses βA/B t+1 (s, a) with index t + 1 when updating ηA/B t . This is to show that in general the parameter is allowed to depend on St+1 and the respective greedy action, i.e. it must only be Ft+1 measurable. To portray this generality in the following we will only write βA/B t+1 without referencing a state-action pair.

The simplest way to guarantee the assumptions on the adaptive parameters βA, βB to be satisfied is to chose them equal.

As in (van Hasselt, 2010), the proof is based on the following stochastic approximation result, see also (Bertsekas & Tsitsiklis, 1996), Proposition 4.5.

Lemma F.3 ((Singh et al., 2000), Lemma 1). Suppose (Ω, A, P, (Fn)) is a filtered probability space on which all appearing random variables are defined. Suppose that

1. a stochastic process (Fn)n N Rd with the coordinates Fi,n for i = 1, . . . , d such that Fn is Fn+1-measurable and for all i = 1, . . . , d

E[Fn|Fn] κ Xn + cn and V[Fi,n|Fn] K(1 + κ Xn )2 n 1,

where κ [0, 1), an adapted, stochastic process (cn)n N R+ that converges to 0 almost surely and some constant K > 0.

ADDQ: Adaptive Distributional Double Q-Learning

2. the non-negative stochastic process (αn)n N Rd, with the coordinates αi,n [0, 1] for i = 1, . . . , d is adapted with

n=1 αi,n = and

n=1 α2 i,n < a.s..

Then, for any F0-measurable initial condition X0 the stochastic process (Xn)n N Rd with coordinates Xi,n for i = 1, . . . , d that is recursively defined by

Xi,n+1 = (1 αi,n)Xi,n + αi,n Fi,n, n N,

converges almost surely to zero.

Furthermore, we follow (Rowland et al., 2018) by first showing the convergence of the mean-values to Q and afterwards showing convergence of the return distribution functions, under the assumption of a unique optimal policy, by coupling it with policy evaluation. The convergence of the latter is easier to prove and we will do so at the end. Lemma F.4 (Adaptive Double Categorical Temporal Difference for Policy Evaluation). Given some initial return distribution functions ηA 0 , ηB 0 supported within [θ1, θm] and a stationary policy π ΠS, the return distribution functions (ηA t ), (ηB t ) recursively defined by Algorithm 1, but with a π( ; St+1) instead, converge almost surely towards the unique fixed point ηC P(R)S A of the operator ΠCT π with respect to ℓ2, if the following conditions are satisfied:

1. the step sizes αt(s, a) fulfill the Robbins-Monro conditions:

P t=0 αt(s, a) = P t=0 α2 t(s, a) < ,

2. rewards are bounded in [Rmin, Rmax] and [ Rmin

1 γ ] [θ1, θm],

3. the choice of updating ηA or ηB is random and independent of all other previous random variables

The above result is only relevant for the proof of Theorem F.2, as policy evaluation with a double estimator is not of interest. Note that convergence of categorical temporal difference for policy evaluation (in the single estimator case) has been proven in [(Rowland et al., 2018) Theorem 2 mimicking (Tsitsiklis, 1994) Theorem 2] and [(Bellemare et al., 2023) Theorem 6.12 applying (Tsitsiklis, 1994) Theorem 3 or (Bertsekas & Tsitsiklis, 1996) Proposition 4.5]. Lemma F.5. Let (αt)t N0 be a sequence fulfilling the Robbins-Monro conditions and (Yt)t N an iid sequence of Bernoulli(0.5) random variables, i.e. P(Yt = 1) = P(Yt = 0) = 0.5 for all t N0. Then (αt Yt)t N0 also fulfills the Robbins-Monro condition.

Proof. The almost sure convergence of the summed squares is obviously fulfilled due to

t=0 (αt Yt)2

t=0 α2 t < almost surely.

Due to independence of each Yt with {Yn|n N0, n = t} as well as with α = (αt) t=0 we will consider a two stage experiment, where we first draw the sequence α = (αt) t=0 and then independently of this realization sample the iid sequence Y = (Yt) t=0. Due to the independence the joint measure of α and Y is the product measure. Consider the product space (Ω, F, P) = (Ωα ΩY , Fα FY , P N α P N Y ) where Ωα, ΩY = [0, 1]N, Fα, FY = B([0, 1]) N . Then, using that P t=0 αt = Pα-almost surely, we have

t=0 αt Yt = = Z

t=0 αt Yt = d Pα(α)

{(αt) t=0 Ωα:P t=0 αt= } PY X

t=0 αt Yt = d Pα(α)

{(αt) t=0 Ωα:P t=0 αt= } 1 d Pα(α)

ADDQ: Adaptive Distributional Double Q-Learning

where (a) can be seen as follows. Consider any deterministic sequence (bt) [0, 1] fulfilling P t=0 bt = . Then

t=0 bt Yt +

t=0 bt1Yt=0.

Now notice that A = P t=0 bt Yt and B = P t=0 bt1Yt=0 are identically distributed and since the sum of A and B is always infinity, almost surely either one of them is infinite. Given the identical distribution, we infer

t=0 bt Yt = ) > 0.

But since (bt Yt) is an independent sequence of random variables and the event that the infinite sum diverges is in the tail sigma algebra, the Kolmogorov 0-1 law yields:

t=0 bt Yt = ) = 1.

Remark F.6. As outlined in (Rowland et al., 2018), proof of Proposition 1, denoting by M(R) the space of all finite signed measures on (R, B(R)), the subspace

M0(R) := {ν M(R)|ν(R) = 0, Z

R Fν(x)2dx < },

where Fν(x) = ν([ , x)) for x R, is isometrically isomorphic to a subspace of the Hilbert space L2(R) with inner product given by

ν1, ν2 ℓ2 = Z

R Fν1(x)Fν2(x)dx.

Then the affine translation δ0 + M0 is also Hilbert space endowed with the same inner product. It contains probability measures ν P(R) satisfying Z 0

Fν(x)2 dx < and Z

0 (1 Fν(x))2 dx < .

To see this, consider µ = ν δ0 fulfills Fµ(x) = Fν(x) for x < 0 and Fµ(x) = Fν(x) 1 for x 1. Hence, µ M0. The two conditions assure that the tails decay fast enough. Note that the inner product induces a norm through ν 2 ℓ2 = ν, ν . And we have ℓ2(ν1, ν2) = ν1 ν2 ℓ2. In the following proof, we will make use of the relationship

ℓ2 2(ν1 + ν2, ν 1 + ν 2) = ν1 ν 1 2 ℓ2 + ν2 ν 2 2 ℓ2 + 2 ν1 ν 1, ν2 ν 2

holding by bilinearity of the inner product.

Proof of Thoerem F.2. Step 1: Convergence of mean values to Q

The proof mainly follows (Rowland et al., 2018) and (van Hasselt, 2010). Let the filtration be given by Ft = σ(ηA 0 , ηB 0 , s0, a0, α0, R0, S1, Y1, βA 1 , βB 1 . . . , st, at, αt), where (Yn)n N is an iid sequence of Bernoulli(0.5) random variables, independent of all other appearing random variables, such that A is updated when Yn+1 = 1. Denote the expected values of the return-distributions by QA t (s, a) = ER ηA t (s,a)[R] and overloading notation, let us further write E[ν] for the expected value ER ν[R] of a probability distribution ν P(R). We will first consider how the expected values evolve. Due to the symmetry of the updates it is sufficient to show convergence of QA t to Q . It is implied that α(s, a) = 0 for (s, a) = (st, at). Further, define

Xt(st, at) := QA t (st, at) Q (st, at)

Ft(st, at) := 1Yt+1=1 Rt + γ(βA t+1QA t (St+1, a ) + (1 βA t+1)QB t (St+1, a )) Q (st, at)

+ 1Yt+1=0Xt(st, at)

Ft(s, a) := 0 whenever (s, a) = (st, at)

ADDQ: Adaptive Distributional Double Q-Learning

with a = arg maxa ASt+1 QA(St+1, a ). According to [(Lyle et al., 2019) Proposition 1] projection ΠC is meanpreserving, i.e E[ΠCν] = E[ν] for when ν is a distribution supported within [θ1, θm]. This is the case for every ˆη as in Algorithm 1, which can be seen as following. Assume ηA t (st, at), ηB t (st, at) FC,m. Then also

ν = βA t+1ηA t (St+1, a ) + (1 βA t+1)ηB t (St+1, a )) FC,m,

and suppose ν = Pm i=1 piδθi for some pi. Then

ˆη := b Rt,γ#ν =

i=1 piδRt+γθi.

But now θ1 Rmin

(Assumption (ii)) guarantees that θ1 Rt + γθi θm i {1, . . . , m}

and ˆη is supported within [θ1, θm]. Similarly for a realized transition with (Rt, St+1) = (rt, st+1), we have for the expected value of the distribution

E h brt,γ# βA t+1ηA t (st+1, a + (1 βA t+1)ηB t (st+1, a )) i

=rt + γ(βA t+1QA t (st+1, a ) + (1 βA t+1)QB t (st+1, a )).

Hence, the expected values of the return distributions ηA t subtracted by Q indeed evolve as

Xt+1(s, a) = (1 αt(s, a))Xt(s, a) + αt(s, a)Ft(s, a).

We now proceed similarly as in (van Hasselt, 2010) to show that the conditions of Lemma F.3 are satisfied. We first show that V[Ft(s, a)|Ft] is bounded for all (s, a) S A and therefore satisfies V[Ft(s, a)|Ft] K(1+κ Xt ) as required. Since the rewards were assumed to be bounded there is an R > 0 such that |r|, |θ1|, |θm| R r R. Hence, we have

|Ft(st, at)| |Rt + γ(βA t+1QA t (St+1, a ) + (1 βA t+1)QB t (St+1, a )) Q (st, at)|

+|Xt(st, at)|

R + 3 R + 2 R 1 γ .

Next, we need to show that E[Ft | Ft] κ Xt + cn. Let us therefore decompose

Ft(st, at) =1Yt+1=1 F Q t (st, at) + γ(1 βt+1)(QB t (St+1, a ) QA t (St+1, a ))

+1Yt+1=0αt(st, at)Xt(st, at)

with F Q t (st, at) := Rt + γQA t (St+1, a ) Q (st, at). This yields

|E[1Yt+1=1F Q t (st, at) + 1Yt+1=0αt(st, at)Xt(st, at)|Ft]|

2E[Rt + γQA t (St+1, a )] Q (st, at) + 1

2Xt(st, at)|

|T QA(st, at) T Q (st, at)| + |1

2Xt(st, at)|

γ QA t Q + 1

2) | {z } <1

ADDQ: Adaptive Distributional Double Q-Learning

since the Bellman optimality operator is a γ-contraction. Subsequently, it only remains to show that

ct := |E[1Yt+1=1γ(1 βA t+1)(QB t (St+1, a ) QA t (St+1, a )|Ft]|

goes to zero almost surely. This is immediate if we verify that

XBA t (s, a) := QB t (s, a) QA t (s, a)

goes to zero almost surely for all (s, a) S A which will be achieved by another application of Lemma F.3. We infer that

XBA n+1(sn, an)

=XBA n (sn, an) + αn(sn, an)

1Yn+1=0 Rn + γ βB n+1QB n (Sn+1, b ) + (1 βB n+1)QA n (Sn+1, b ) QB n (sn, an)

1Yn+1=1 Rn + γ βA n+1QA n (Sn+1, a ) + (1 βA n+1)QB n (Sn+1, a ) QA n (sn, an)

=(1 αn(sn, an))XBA n (sn, an) + αn(sn, an)

1Yn+1=0 Rn + γ βB n+1QB n (Sn+1, b ) + (1 βB n+1)QA n (Sn+1, b )

1Yn+1=1 Rn + γ βA n+1QA n (Sn+1, a ) + (1 βA n+1)QB n (Sn+1, a )

+1Yn+1=1QB n (sn, an) 1Yn+1=0QA n (sn, an)

=(1 αn(sn, an))XBA n (sn, an) + αn(sn, an) Fn(sn, an),

Fn(sn, an) = 1Yn+1=0 Rn + γ βB n+1QB n (Sn+1, b ) + (1 βB n+1)QA n (Sn+1, b )

1Yn+1=1 Rn + γ βA n+1QA n (Sn+1, a ) + (1 βA n+1)QB n (Sn+1, a )

+1Yn+1=1QB n (sn, an) 1Yn+1=0QA n (sn, an) .

Now, using that QB n (sn, an), QA n (sn, an), XBA n (sn, an), αn(sn, an) are Fn-measurable and Yn+1 is independent of Fn, the conditional expectation satisfies

|E[ Fn(sn, an) | Fn]| =1

2γ|E[βB n+1QB n (Sn+1, b ) + (1 βB n+1)QA n (Sn+1, b )

βA n+1QA n (Sn+1, a ) (1 βA n+1)QB n (Sn+1, a )|Fn]|

2|QB n (sn, an) QA n (sn, an)|

2γ E[βB n+1(QB n (Sn+1, b ) QA n (Sn+1, a ))|Fn]

+ E[(1 βB n+1)(QA n (Sn+1, b ) QB n (Sn+1, a ))|Fn]

+ E[(βB n+1 βA n+1)QA n (Sn+1, a )|Fn]

+ E[((1 βB n+1) (1 βA n+1))QB n (Sn+1, a )|Fn]

ADDQ: Adaptive Distributional Double Q-Learning

Now if it holds E[QB n (Sn+1, b )|Fn] E[QA n (Sn+1, a )|Fn], by definition of a we have QA n (Sn+1, a ) = maxa ASn+1 QA n (Sn+1, a) QA n (Sn+1, b ) and therefore

E[QB n (Sn+1, b ) QA n (Sn+1, a )|Fn] = E[QB n (Sn+1, b ) QA n (Sn+1, a )|Fn]

E[QB n (Sn+1, b ) QA n (Sn+1, b )|Fn] XBA n .

Analogously, if E[QB n (Sn+1, b )|Fn] < E[QA n (Sn+1, a )|Fn], then we have by definition of b

E[QB n (Sn+1, b ) QA n (Sn+1, a )|Fn] = E[QA n (Sn+1, a ) QB n (Sn+1, b )|Fn]

E[QA n (Sn+1, a ) QB n (Sn+1, a )|Fn] XBA n .

Similarly, by distinguishing cases, one shows that

E[QA n (Sn+1, b ) QB n (Sn+1, a )|Fn] XBA n + 1

Combining the above yields E[ Fn(sn, an) | Fn] 1

2γ(βB n+1 + (1 βB n+1)) XBA n

+ γE[(βB n+1 βA n+1) QA n (Sn+1, a ) | {z } < R<

|Fn] + γE[((1 βB n+1) (1 βA n+1)) QB n (Sn+1, a ) | {z } < R<

| {z } := cn 0, since |βA n βB n | converges to 0 for n due to (iv)

Hence, we invoke Lemma F.3 to obtain convergence of XBA t and thus with another application of Lemma F.3, Xt(s, a) converges to zero which finally implies QA t (s, a) (and also QB(s, a)) converges to Q (s, a) almost surely for every (s, a) S A.

Since S, A are finite, for every ε > 0, there exists a random variable N > 0 such that for all t > N, we have max z {A,B} Qz t Q < ε almost surely.

Step 2: Convergence of return distributions Suppose the MDP has a unique optimal policy π . Now following (Rowland et al., 2018), we take ε to be half the minimum action gap for the optimal action-value function Q = Qπ , i.e.

2 min s S (Qπ (s, π (s) max a = Qπ (s, a))

which is greater than zero by assumption (v). Hence, denoting the action of the deterministic optimal policy in a certain state s by π (s), we get max a QA t (s, a) = max a QB t (s, a) = π (s)

for all t > N. For some initial condition η0 FS C,m, let now ηk be the iterates created by a double categorical policy evaluation algorithm for the optimal policy π , i.e.

ηA k+1(sk, ak) =(1 1Yk+1=1αk(sk, ak)) ηk(sk, ak)

+1Yk+1=1αk(sk, ak)ΠC

b Rk,γ# βA k+1 ηA k (Sk+1, π (Sk+1))

+(1 βA k+1) ηB k (Sk+1, π (Sk+1))

ηA k+1(s, a) = ηA k (s, a) for (s, a) = (sk, ak).

and analogously for ηB. Note that the appearing Yk, αk, βA k , βB k are chosen to be the same as in the control case above. Then ηA, ηB converges almost surely to the unique fixed point η C of the projected operator ΠCT π with respect to ℓ2 by

ADDQ: Adaptive Distributional Double Q-Learning

Lemma F.4. Similarly to (Rowland et al., 2018), we now proceed by a coupling argument. Denote by πA k , πB k any greedy selection rule with respect to ηA k and ηB k and Ak = {πA k = πB k = π for all n k}. Then Ak Ak+1 and by the above explanation we have P(Ak) 1. Additionally, let B be the event of probability 1 for which the (double) policy evaluation algorithm converges. Now on the event B Ak, we have

ℓ2 2( ηA n , η C) 0 and ℓ2 2( ηB n , η C) 0.

Then by the triangle inequality it suffices to show ℓ2(ηA n , ηA n ) 0 and ℓ2(ηB n , ηB n ) 0 on this event too, since then the Theorem follows by P(B Ak) 1. To prove this we will again apply Lemma F.3. This time with d = 2 |S||A|, where we identify

Xn := ℓ2 2(ηA n , ηA n ) ℓ2 2(ηB n , ηB n )

Additionally, we expand the filtration by Fn = σ(Fn, Yn+1) and define αA n (s, a) = αn(s, a)1Yn+1=1 and αB n (s, a) = αn(s, a)1Yn+1=0. By Lemma F.5 these steps-size sequences still fulfill the Robbins-Monro conditions. Then, writing

νA =βA n+1ηA n (Sn+1, π (Sn+1)) + (1 βA n+1)ηB n (Sn+1, π (Sn+1))

νA =βA n+1 ηA n (Sn+1, π (Sn+1)) + (1 βA n+1) ηB n (Sn+1, π (Sn+1))

for short, for n k, on Ak we have

ℓ2 2(ηA n+1(sn, an), ηA n+1(sn, an))

=(1 αA n (sn, an))2 ηA n (sn, an) ηA n (sn, an) 2 ℓ2 + αA n (sn, an)2 ΠC(b Rn,γ#νA) ΠC(b Rn,γ# νA) 2 ℓ2 +(1 αA n (sn, an)) αA n (sn, an)2 ηA n (sn, an) ηA n (sn, an), ΠC(b Rn,γ#νA) ΠC(b Rn,γ# νA) ℓ2.

This can be rewritten in terms of Lemma F.3 as

XA n+1(sn, an) = (1 ζA n (sn, an))XA n (sn, an) + ζA n (sn, an)F A n (sn, an)

with ζA n (sn, an) = 2 αA n (sn, an) αA n (sn, an)2 and

F A n (sn, an) = 1 ζA n (sn, an)( αA n (sn, an)2 ΠC(b Rn,γ#νA) ΠC(b Rn,γ# νA) 2 ℓ2

+(1 αA n (sn, an)) αA n (sn, an)2 ηA n (sn, an) ηA n (sn, an),

ΠC(b Rn,γ#νA) ΠC(b Rn,γ# νA) ℓ2)

and F A n (s, a) = 0 if (s, a) = (sn, an). It is mentioned that ζA n (sn, an) > 0. Notice that,

n=1 ζA n (sn, an) =

n=1 (2 αA n (sn, an) αA n (sn, an)2) = a.s.

n=1 ζA n (sn, an)2 =

n=1 4 αA n (sn, an)2 4 αA n (sn, an)3 + αA n (sn, an)2 < a.s.

ADDQ: Adaptive Distributional Double Q-Learning

Finally we have

|F A n (sn, an)| 1 ζA n (sn, an)( αA n (sn, an)2γ ℓ2 2(βA n+1ηA n + (1 βA n+1)ηB n , βA n+1 ηA n + (1 βA n+1) ηB n )

+(1 αA n (sn, an)) αA n (sn, an)2 γ| ηA n ηA n ,

βA n ηA n + (1 βA n )ηB n βA n ηA n (1 βA n ) ηB n ℓ2|)

1 ζA n (sn, an)( αA n (sn, an)2γ max z {A,B} ℓ2 2(ηz n, ηz n)

+(1 αA n (sn, an)) αA n (sn, an)2 γ max z {A,B} ℓ2 2(ηz n, ηz n))

= αA n (sn, an)2γ + (1 αA n (sn, an)) αA n (sn, an)2 γ 2 αA n (sn, an) αA n (sn, an)2 max z {A,B} ℓ2 2(ηz n, ηz n)

(2 αA n (sn, an) αA n (sn, an)2) γ 2 αA n (sn, an) αA n (sn, an)2 max z {A,B} ℓ2 2(ηz n, ηz n)

γ max z {A,B} ℓ2 2(ηz n, ηz n) = γ Xn

where we used regularity and 1/2-homogeneity of the ℓ2 metric as described in [(Bellemare et al., 2023) Section 4.6] as well as that ΠC is a non-expansion in ℓ2 and

| u, βu + (1 β)v | = β u, u + (1 β)| u, v | β max( u 2, v 2) + (1 β) u v

max( u 2, v 2)

by the Cauchy-Schwarz inequality. Further, by the above the Variance also fulfills

V[F A n (sn, an)| Fn] = E[F A n (sn, an)2|Fn] E[F A n (sn, an)| Fn]2

2( γ max z {A,B} ℓ2 2(ηz n, ηz n))2

2γ sup η,η FS C,m ℓ4 2(η, η ) < .

Therefore, by Lemma F.3 we obtain convergence ℓ2(ηA n , ηA n ) 0 and ℓ2(ηB n , ηB n ) 0 on Ak. As already described above, this results in ℓ2(ηA n , η C) 0 and ℓ2(ηB n , η C) 0 almost surely.

Proof of Lemma F.4. Let the filtration be given by Ft = σ(ηA 0 , ηB 0 , s0, a0, α0, R0, S1, Y1, βA 1 , βB 1 . . . , st, at, αt, Yt+1), where (Yn)n N is an iid sequence of Bernoulli(0.5) random variables, independent of all other appearing random variables, such that A is updated when Yn+1 = 1. To clarify, abbreviating

νA = βA t+1ηA t (St+1, At+1) + (1 βA t+1)ηB t (St+1, At+1)

νB = βB t+1ηB t (St+1, At+1) + (1 βB t+1)ηA t (St+1, At+1) where

At+1 π( ; St+1),

we are confronted with the updates

ηA t+1(s, a) = ηA t (s, a) + αt(s, a)1Yt+1=1(ΠC(b Rt,γ#νA) ηA t (s, a))

ηB t+1(s, a) = ηB t (s, a) + αt(s, a)1Yt+1=0(ΠC(b Rt,γ#νB) ηB t (s, a)).

As in the proof above, define αA n (s, a) = αn(s, a)1Yn+1=1 and αB n (s, a) = αn(s, a)1Yn+1=0. By Lemma F.5 these stepssize sequences still fulfill the Robbins-Monro conditions. Also note that as in step 2 of the proof of Theorem F.2, Yt+1 is

ADDQ: Adaptive Distributional Double Q-Learning

Ft-measurable and hence so is αA/B t . In order to align this with Lemma F.3, we rewrite

XA t+1(s, a) = ℓ2 2(ηA t+1(s, a), ηC(s, a))

=(1 αA t (s, a))2 ηA t (s, a) ηC(s, a) 2 ℓ2 + αA t (s, a)2 ΠC(b Rt,γ#νA) ηC(s, a) 2 ℓ2 +(1 αA t (s, a)) αA t (s, a)2 ηA t (s, a) ηC(s, a), ΠC(b Rt,γ#νA) ηC(s, a) ℓ2 =(1 ζA t (s, a))XA t (s, a) + ζA t (s, a)F A t (s, a)

with ζA t (s, a) = 2 αA t (s, a) αA t (s, a)2,

Xt := ℓ2 2(ηA t , ηC) ℓ2 2(ηB t , ηC)

F A t (s, a) = 1 ζA t (s, a)1 αA t (s,a)>0( αA t (s, a)2ℓ2 2(ΠC(b Rt,γ#νA), ηC(s, a))

+(1 αA t (s, a)) αA t (s, a)2 ηA t (s, a) ηC(s, a), ΠC(b Rt,γ#νA) ηC(s, a) ℓ2).

As in Equation (7), the sequence ζA t (s, a) fulfills the Robbins-Monro condition. Additionally, note that there exists K > 0, such that ℓ2 2(ΠC(b Rt,γ#νA), ηC(s, a)) < K independent of s, a, t. Further, observe that

ct := max z {A,B} 1 ζz t (s, a)1 αz t (s,a)>0 αz t (s, a)2K 0 for t almost surely.

We use that ΠC is mean-preserving [(Lyle et al., 2019) Proposition 1] for discrete distributions supported within [θ1, θm], which is satisfied by b Rt,γ#νA, due to Assumption (ii) and νA Fm. Together with the fact that ΠCT π is a γ-contraction with respect to ℓ2 and the Cauchy-Schwarz inequality, we have

|E[ ηA t (s, a) ηC(s, a), ΠC(b Rt,γ#νA) ηC(s, a) ℓ2|Ft]|

=| ηA t (s, a) ηC(s, a), E[ΠC(b Rt,γ#νA)|Ft] ηC(s, a) ℓ2|

=| ηA t (s, a) ηC(s, a), E[b Rt,γ#νA)|Ft] ηC(s, a) ℓ2|

=| ηA t (s, a) ηC(s, a), ΠCT π(βA t+1ηA t + (1 βA t+1)ηB t )(s, a) (ΠCT πηC)(s, a) ℓ2|

γ| ηA t ηC, (βA t+1ηA t + (1 βA t+1)ηB t ) ηC ℓ2|

γ(βA t+1 ℓ2 2(ηA t , ηC) + (1 βA t+1)| ηA t ηC, ηB t ηC ℓ2|)

γ(βA t+1 max z {A,B} ℓ2 2(ηz t , ηC) + (1 βA t+1) ηA t ηC ℓ2 ηB t ηC ℓ2|)

γ max z {A,B} ℓ2 2(ηz t , ηC)

In total, this yields

|E[F A t (s, a)|Ft]|

1 ζA t (s, a)1 αA t (s,a)>0 αA t (s, a)2K + 1 ζA t (s, a)1 αA t (s,a)>0(1 αA t (s, a)) αA t (s, a)2 γ Xt .

ct + γ Xt .

Since ℓ2(η, η ) < K for every η, η FS A C,m some K > 0, the conditional variance V[F A t |Ft] can be bounded uniformly in t. Therefore, the requirements of Lemma F.3 are fulfilled, and its application yields XA t (s, a) = ℓ2 2(ηA t (s, a), ηC(s, a)) 0 and XB t (s, a) = ℓ2 2(ηB t (s, a), ηC(s, a)) 0. Hence, also ηA t , ηB t converge to ηC with respect to ℓ2.