# onestep_distributional_reinforcement_learning__abaad25b.pdf Published in Transactions on Machine Learning Research (08/2023) One-Step Distributional Reinforcement Learning Mastane Achab mastane.achab@tii.ae Reda Alami, Yasser Abdelaziz Dahou Djilali, Kirill Fedyanin Technology Innovation Institute, 9639 Masdar City, Abu Dhabi, United Arab Emirates Eric Moulines Ecole polytechnique, Palaiseau, France Reviewed on Open Review: https: // openreview. net/ forum? id= ZPMf53v E1L Reinforcement learning (RL) allows an agent interacting sequentially with an environment to maximize its long-term expected return. In the distributional RL (Distr RL) paradigm, the agent goes beyond the limit of the expected value, to capture the underlying probability distribution of the return across all time steps. The set of Distr RL algorithms has led to improved empirical performance. Nevertheless, the theory of Distr RL is still not fully understood, especially in the control case. In this paper, we present the simpler one-step distributional reinforcement learning (OS-Distr RL) framework encompassing only the randomness induced by the one-step dynamics of the environment. Contrary to Distr RL, we show that our approach comes with a unified theory for both policy evaluation and control. Indeed, we propose two OS-Distr RL algorithms for which we provide an almost sure convergence analysis. The proposed approach compares favorably with categorical Distr RL on various environments. 1 Introduction In reinforcement learning (RL), a decision-maker or agent sequentially interacts with an unknown and uncertain environment in order to optimize some performance criterion (Sutton & Barto, 2018). At each time step, the agent observes the current state of the environment, then takes an action that influences both its immediate reward and the next state. In other words, RL refers to learning through trial-and-error a strategy (or policy) mapping states to actions that maximizes the long-term cumulative reward (or return): this is the so-called control task. More accurately, this return is a random variable due to the random transitions across states and classical RL only focuses on its expected value. On the other hand, the policy evaluation task aims at assessing the quality of any given policy (not necessarily optimal as in control) by computing its expected return in each initial state, also called value function. For both evaluation and control, when the model (i.e. reward function and transition probabilities between states) is known, these value functions can be seen as fixed points of some operators and computed by dynamic programming (DP) under the Markov decision process (MDP) formalism (Puterman, 2014). Nevertheless, the model in RL is typically unknown and the agent can only approximate the DP approach based on empirical trajectories. The TD(0) algorithm for policy evaluation and Q-learning for control, respectively introduced in (Sutton, 1988) and (Watkins & Dayan, 1992), are flagship examples of the RL paradigm. Formal convergence guarantees were provided for both of these methods, see (Dayan, 1992; Tsitsiklis, 1994; Jaakkola et al., 1993). In many RL applications, the number of states is very large and thus prevents the use of the aforementioned tabular RL algorithms. In such a situation, one should rather use function approximation to approximate the value functions, as achieved by the DQN algorithm (Mnih et al., 2013; 2015) combining ideas from Q-learning and deep learning. More recently, the distributional reinforcement learning (Distr RL) framework was proposed by Bellemare et al. (2017); see also (Morimura et al., 2010a;b). In Distr RL, the agent is optimized to model the whole probability distribution of the return, not just its expectation. In this new paradigm, several distributional procedures where proposed as extensions of the classic (non-distributional) Published in Transactions on Machine Learning Research (08/2023) RL methods, leading to improved empirical performance. In most cases, a Distr RL algorithm is composed of two main ingredients: i) a parametric family of distributions serving as proxies for the distributional returns, and ii) a metric measuring approximation errors between original distributions and parametric proxies. Related work. We recall a few existing Distr RL approaches, all of which considering mixtures of Dirac measures as their parametric family of proxy distributions. Indeed, the Categorical Distr RL (CDRL) parametrization used in the C51 algorithm proposed in (Bellemare et al., 2017) was shown in (Rowland et al., 2018) to correspond to orthogonal projections derived from the Cramér distance1. The CDRL approach learns the categorical probabilities p1(x, a), . . . , p K(x, a) of a distribution PK k=1 pk(x, a)δzk with fixed predefined support values z1, . . . , z K. On the other hand, the quantile regression approach was proposed in (Dabney et al., 2018), where the support {Q1(x, a), . . . , QN(x, a)} of the proxy distribution 1 N PN i=1 δQi(x,a) is learned for fixed uniform probabilities 1 N over the N atoms. In (Dabney et al., 2018), these atoms result from Wasserstein-1 projections, and correspond to quantiles of the unprojected distribution. Achab & Neu (2021) and Achab et al. (2022) later investigated the Wasserstein-2 setup leading to the study of conditional value-at-risk measures. For policy evaluation, the convergence analysis of CDRL and of the quantile approach were respectively derived in (Rowland et al., 2018) and (Rowland et al., 2023). Nevertheless, the theory of Distr RL is more challenging in the control case because the corresponding operator is not a contraction2 and does not necessarily admits a fixed point. For that reason, Rowland et al. (2018) proved the convergence of CDRL for control only under the restrictive assumption that the optimal policy is unique. Hence, it seems natural to ask the following question: Is there another formulation of Distr RL in which both the evaluation and the control tasks lead to contractive operators? . The answer provided by this paper is Yes, via a one-step approach! . Contrary to classic Distr RL dealing with the randomness across all time steps, the one-step variant (originally proposed by Achab (2020)) only cares about the one-step dynamics of the environment. Contributions. Our main contributions are as follows: We introduce a one-step variant of the Distr RL framework: we call that approach one-step distributional reinforcement learning (OS-Distr RL). We show that our new method solves the well-known instability issue of Distr RL in the control case. We provide a unified almost-sure convergence analysis for both evaluation and control. We experimentally show the competitive performance of our new (deep learning-enhanced) algorithms in various environments. The paper is organized as follows. In Section 2, we recall a few standard RL tools and notations as well as their Distr RL generalization. Then, our one-step approach is defined in Section 3. Section 4 introduces new OS-Distr RL algorithms along with theoretical convergence guarantees. Finally, numerical experiments are provided for illustration purpose in Section 5. The main proofs are deferred to the Supplementary Material. Notations. The indicator function of any event E is denoted by I{E}. We let Pb(R) be the set of probability measures on R having bounded support, and P(E) the set of probability mass functions on any finite set E, whose cardinality is denoted by |E|. The support of any discrete distribution q P(E) is support(q) = {y E : q(y) > 0}; the supremum norm of any function h: E R is h = maxy E |h(y)|. The cumulative distribution function (CDF) of a probability measure ν Pb(R) is the mapping F(z) = PZ ν(Z z) ( z R), and we denote its generalized inverse distribution function (a.k.a. quantile function) by F 1 : τ (0, 1] 7 inf{z R, F(z) τ}. Given ν1, ν2 Pb(R) with respective CDFs F1, F2, we denote ν2 ν1 and say that ν1 stochastically dominates ν2 if F1(z) F2(z) for all z R. For any probability 1The Cramér distance ℓ2 between two probability distributions ν1, ν2 with CDFs F1, F2 is equal to ℓ2(ν1, ν2) = q R R(F1(x) F2(x))2dx . 2A function mapping a metric space to itself is called a γ-contraction if it is Lipschitz continuous with Lipschitz constant γ < 1. Published in Transactions on Machine Learning Research (08/2023) measure ν Pb(R) and measurable function f : R R, the pushforward measure f#ν is defined for any Borel set A R by f#ν(A) = ν(f 1(A)) = ν({z R: f(z) A}). In this article, we only need the affine case fr0,γ(z) = r0 + γz (with r0 R, γ [0, 1)) for which fr0,γ#ν Pb(R) and fr0,γ#ν(A) = ν({ z r0 γ : z A}) if γ = 0, or fr0,γ#ν = δr0 is the Dirac measure at r0 if γ = 0. 2 Background on distributional reinforcement learning In this section, we recall some standard notations, tools and algorithms used in RL and Distr RL. 2.1 Markov decision process Throughout the paper, we consider a Markov decision process (MDP) characterized by the tuple (X, A, P, r, γ) with finite state space X, finite action space A, transition kernel P : X A P(X), reward function r: X A X R, and discount factor 0 γ < 1. If the agent takes some action a A while the environment is in state x X, then the next state X1 is sampled from the distribution P( |x, a) and the immediate reward is equal to r(x, a, X1). In the discounted MDP setting, the agent seeks a policy π: X P(A) maximizing its expected long-term return for each pair (x, a) of initial state and action: Qπ(x, a) = E t=0 γtr(Xt, At, Xt+1) X0 = x, A0 = a where Xt+1 P( |Xt, At) and At+1 π( |Xt+1). Qπ(x, a) and V π(x) = P a A π(a|x)Qπ(x, a) are respectively called the state-action value function and the value function of the policy π. Further, each of these functions can be seen as the unique fixed point of a so-called Bellman operator (Bellman, 1966), that we denote by T π for Q-functions. For any Q: X A R, the image of Q by T π is another Q-function given by: (T πQ)(x, a) = X x P(x |x, a) r(x, a, x ) + γ X a π(a |x )Q(x , a ) . This Bellman operator T π has several nice properties: in particular, it is a γ-contraction in and thus admits a unique fixed point (by Banach s fixed point theorem), namely Qπ = T πQπ. It is also well-known from (Bellman, 1966) that there always exists at least one policy π that is optimal uniformly for all initial conditions (x, a): Q (x, a) := Qπ (x, a) = sup π Qπ(x, a) and V (x) := max a Q (x, a) = V π (x) = sup π V π(x) . Similarly, this optimal Q-function Q is the unique fixed point of some operator T called the Bellman optimality operator and defined by: (TQ)(x, a) = X x P(x |x, a) r(x, a, x ) + γ max a Q(x , a ) , which is also a γ-contraction in . Noteworthy, knowing Q is sufficient to behave optimally: indeed, a policy π is optimal if and only if in every state x, support π ( |x) arg maxa Q (x, a). 2.2 The distributional Bellman operator In distributional RL, we replace scalar-valued functions Q by functions µ taking values that are entire probability distributions: µ(x,a) Pb(R) for each pair (x, a). In other words, µ is a collection of distributions indexed by states and actions. We recall below the definition of the distributional Bellman operator, which generalizes the Bellman operator to distributions. Definition 2.1 (Distributional Bellman operator, Bellemare et al. (2017)). Let π be a policy. The distributional Bellman operator T π : Pb(R)X A Pb(R)X A is defined for any distribution function µ = (µ(x,a))x,a by (T πµ)(x,a) = X x ,a P(x |x, a)π(a |x )fr(x,a,x ),γ#µ(x ,a ) . Published in Transactions on Machine Learning Research (08/2023) State x1 State x2 P(x2|x1,a1)=0 P(x1|x2,a1)=0 P(x1|x1,a1)=1 P(x1|x1,a2)=0.5 r(x1,a1)=1 r(x2,a1)=2 P(x2|x1,a2)=0.5 P(x1|x2,a2)=0.5 r(x1,a2)=0.5 r(x2,a2)=2.5 P(x2|x2,a1)=1 P(x2|x2,a2)=0.5 Figure 1: Markov decision process with two states X = {x1, x2}, two actions A = {a1, a2} and reward function independent of the next state: r(x, a, ) r(x, a). We know from Bellemare et al. (2017) that T π is a γ-contraction in the maximal p-Wasserstein metric3 W p(µ1, µ2) = max (x,a) X A Wp(µ(x,a) 1 , µ(x,a) 2 ) at any order p [1, + ]. Consequently, T π has a unique fixed point µπ = (µ(x,a) π )x,a equal to the collection of the probability distributions of the returns: µ(x,a) π = Distr t=0 γtr(Xt, At, Xt+1) X0 = x, A0 = a Applying T π may be prohibitive in terms of space complexity: if µ is discrete, then T πµ is still discrete but with up to |X| |A| times more atoms, as shown in the following example. Example 2.2. Let µ = (µ(x,a))x,a be the collection of the atomic distributions k=1 pk(x, a)δQk(x,a) , where K 1, pk(x, a) 0 and p1(x, a) + + p K(x, a) = 1. Then, T πµ is also a collection of atomic distributions with (at most) |X| |A| times more atoms: (T πµ)(x,a) = X x ,a P(x |x, a)π(a |x ) k=1 pk(x , a )δr(x,a,x )+γQk(x ,a ) , where we stress that the entire expression [r(x, a, x )+γQk(x , a )] is the argument of the Dirac delta function. Projected operators. Motivated by this space complexity issue, projected Distr RL operators were proposed to ensure a predefined and fixed space complexity budget. A projected Distr RL operator is the composition of a Distr RL operator with a projection over some parametric family of distributions. The Cramér distance projection ΠC has been considered in (Bellemare et al., 2017; Rowland et al., 2018; Bellemare et al., 2019): we recall its definition below. Definition 2.3 (Cramér projection). Let K 2 and z1 < < z K be real numbers defining the support of the categorical distributions. The Cramér projection is then defined by: for all z R, δz1 if z z1 zj+1 z zj+1 zj δzj + z zj zj+1 zj δzj+1 if zj < z zj+1 δz K if z > z K 3For p 1, we recall that the p-Wasserstein distance between two probability distributions ν1, ν2 on R with CDFs F1, F2 is defined as Wp(ν1, ν2) = R 1 τ=0 F 1 1 (τ) F 1 2 (τ) p dτ 1 p . If p = , W (ν1, ν2) = supτ (0,1) |F 1 1 (τ) F 1 2 (τ)|. Published in Transactions on Machine Learning Research (08/2023) 0 20 40 60 80 100 0.6 p1(x1, a1) (a) CDRL at (x1, a1). 0 20 40 60 80 100 (b) One-step at (x1, a1). 0 20 40 60 80 100 3.4 Q(x1, a1) (CDRL) Q(x1, a1) (One-step) (c) Both Q-functions at (x1, a1). 0 20 40 60 80 100 0.7 p1(x1, a2) (d) CDRL at (x1, a2). 0 20 40 60 80 100 (e) One-step at (x1, a2). 0 20 40 60 80 100 3.4 Q(x1, a2) (CDRL) Q(x1, a2) (One-step) (f) Both Q-functions at (x1, a2). Figure 2: Instability of categorical distributional control compared to the convergence of our one-step approach. The two leftmost plots are the iterations of the Cramér-projected full optimality operator ΠCT , while the center plots are obtained with our one-step operator ΠCT. The rightmost plots correspond to the Q-functions Q(x, a) = P4 k=1 pk(x, a)zk . and extended linearly to mixtures of Dirac measures. We will also apply ΠC entrywise to collections of distributions: if µ = (µ(x,a))x,a, then ΠCµ = (ΠCµ(x,a))x,a. The Cramér projection satisfies an important mean-preserving property: for any discrete distribution q with support included in the interval [z1, z K], then: EZ ΠC(q)[Z] = EZ q[Z] . (2) In this line of work, the proxy distributions PK k=1 pk(x, a)δzk are parametrized by the categorical probabilities p1(x, a), . . . , p K(x, a). In (Dabney et al., 2018), the Wasserstein-1 projection ΠW1 over atomic distributions 1 N PN i=1 δQi(x,a) is used. For this specific projection, the atoms that best approximate some CDF Fx,a are obtained through quantile regression with Qi(x, a) = F 1 x,a( 2i 1 2.3 Instability of distributional control Bellemare et al. (2017) have also proposed Distr RL optimality operators T , defined such that T µ = T πGµ where πG is a greedy policy with respect to the expectation of µ. The authors showed in their Propositions 1-2 that, unfortunately, T is not a contraction and does not necessarily admits a fixed point. For that reason, the convergence analysis of the control case in Dist RL is more challenging than for the evaluation task. Rowland et al. (2018) and Bellemare et al. (2023) (see chapter 7.4, in particular Theorem 7.9) circumvent this issue by reducing the control task to evaluation under the assumption that the optimal policy is unique. We start our investigation by verifying that this uniqueness hypothesis is critical to ensure convergence in Distr RL control. For that purpose, we choose a toy example fully described in Section 5 with (infinitely) many optimal policies and we run a CDRL procedure over it. As expected, we observe in Figure 2 the unstable behavior of the Distr RL paradigm for the control task. Indeed, as explained in chapter 7.5 in Bellemare et al. (2023), multiple optimal policies can produce different distributions that are then mixed by Distr RL control algorithms which might never converge. In the next section, we introduce our new one-step distributional approach converging even in this situation since all optimal policies will be shown to share the same fixed point distribution. Published in Transactions on Machine Learning Research (08/2023) 3 One-step distributional operators This section introduces the building blocks of our new framework. The main goal here is to show that our one-step approach is theoretically sound, though being less ambitious than full Distr RL. The main benefit of this simplification is that it will allow us to derive in the next section a convergence result holding for both evaluation and control. In particular in the control case, we will not need any additional assumption such as the uniqueness of the optimal policy as in CDRL (Rowland et al., 2018). 3.1 Formal definitions Let us now define the one-step Distr RL operators. Intuitively, they are similar to the full Distr RL operator T π except that they average the randomness after the first random transition and are thus oblivious to the randomness induced by the remaining time steps. Definition 3.1 (One-step distributional Bellman operator). Let π be a policy. The one-step distributional Bellman operator Tπ : Pb(R)X A Pb(R)X A is defined for any distribution function µ by (Tπµ)(x,a) = X x P(x |x, a)δr(x,a,x )+γ P a π(a |x )EZ µ(x ,a )[Z] , where we stress that the argument of the Dirac delta function is: r(x, a, x ) + γ P a π(a |x )EZ µ(x ,a )[Z]. Definition 3.2 (One-step distributional Bellman optimality operator). The one-step distributional Bellman optimality operator T : Pb(R)X A Pb(R)X A is defined for any µ by (Tµ)(x,a) = X x P(x |x, a)δr(x,a,x )+γ maxa EZ µ(x ,a )[Z] . Example 3.3. Let µ be the same discrete distribution function as in Example 2.2. Then, (Tπµ)(x,a) = X x P(x |x, a)δr(x,a,x )+γ P a π(a |x )Q(x ,a ) and (Tµ)(x,a) = X x P(x |x, a)δr(x,a,x )+γ maxa Q(x ,a ) , where Q(x , a ) = PK k=1 pk(x , a )Qk(x , a ) . Similarly to T , the one-step operators T and Tπ lead to a space complexity issue by producing distributions with a number |X| of atoms, which can be too demanding for a large state space. 3.2 Main properties Let us now discuss some key properties satisfied by our new one-step distributional operators. We first show that, similarly to the non-distributional setting, our approach comes with contractive operators in both evaluation and control. Proposition 3.1 (Contractivity). Let π be a policy. (i) For any 1 p , the one-step operators Tπ and T are γ-contractions in W p. (ii) The Cramér-projected one-step operators ΠC Tπ and ΠC T are γ-contractions in W 1. We stress that Proposition 3.1 highly contrasts with classic Distr RL where the control operator T is not a contraction in any metric. A major consequence of contractivity is the existence and uniqueness of fixed points, whose explicit formulas are gathered in the next proposition and in Table 1. Proposition 3.2 (Fixed points). Let π be a policy and consider ΠC from Definition 2.3. Published in Transactions on Machine Learning Research (08/2023) 0.4 0.2 0.0 0.2 0.4 0.0 (a) µ(x1,a2) 0 0.8 1.0 1.2 1.4 1.6 1.8 0.0 (b) µ(x1,a2) 2 1.00 1.25 1.50 1.75 2.00 2.25 2.50 2.75 0.0 (c) µ(x1,a2) 4 Figure 3: Illustration of Examples 2.2-3.3 in the two-states two-actions MDP depicted in Figure 1, with discount factor γ = 1 2, for the stochastic policy π(a|x) = 1 2 for all x, a. The probability distribution µ(x1,a2) j with histogram represented in red (resp. in blue) is obtained by applying j successive times the one-step distributional operator Tπ (resp. the full distributional operator T π) to the initial distributions µ(x,a) 0 = δ0. The red one-step histogram has a constant number of 2 atoms, corresponding to the number of states. Meanwhile, the blue full histogram approximates a continuous probability density with an increasing number of atoms along iterations. (i) The unique fixed point of Tπ is νπ given by ν(x,a) π = X x P(x |x, a)δr(x,a,x )+γV π(x ) . (ii) The unique fixed point of T is ν given by x P(x |x, a)δr(x,a,x )+γV (x ) . (iii) If z1 r(x, a, x ) + γV π(x ) z K for all triplets (x, a, x ), then the unique fixed point of ΠC Tπ is ηπ = ΠC(νπ). (iv) If z1 r(x, a, x ) + γV (x ) z K for all triplets (x, a, x ), then the unique fixed point of ΠC T is η = ΠC(ν ). The proof of Proposition 3.2 (point i, for instance) involves replacing, inside each Dirac, the quantity P x P(x |x , a )(r(x , a , x ) + γV π(x )) with the simpler Q-value Qπ(x , a ). This substitution is permissible due to the unique fixed-point characterization of V π and Qπ. Interestingly, Proposition 3.2-(iii)-(iv) shows that, in our one-step framework, the fixed point of a projected operator is simply the projection of the fixed point of the unprojected operator. Although this fact seems natural, it is not necessarily true in Distr RL. Indeed, the proof of Proposition 3 in (Rowland et al., 2018) suggests that the fixed point of ΠC T π is a worse approximation of µπ than is ΠCµπ, by a multiplicative factor p 1/(1 γ) in terms of Cramér distance. Furthermore, we stress that two policies sharing the same value function but with potentially distinct distributions and risk-levels (e.g. different variances) cannot be distinguished via the one-step approach as, from Proposition 3.2-(i), they necessarily share the same fixed point. This observation highlights a notable limitation of one-step Distr RL: it is unable to distinguish between two policies with identical value functions that would, however, yield distinct fixed points in the context of full Distr RL. Equipped with our projected one-step operators, we derive in the next section variants of the CDRL algorithms. 4 One-step Distr RL algorithms This section introduces new categorical algorithms based on the Cramér projection together with formal convergence guarantees. Published in Transactions on Machine Learning Research (08/2023) Distr RL One-step Distr RL Evaluation Distr (P t=0 γtr(Xt, At, Xt+1)) P x P(x |x, a)δr(x,a,x )+γV π(x ) Control does not necessarily exist P x P(x |x, a)δr(x,a,x )+γV (x ) Table 1: Comparison of the fixed points of the distributional Bellman operators in Distr RL versus one-step Distr RL. CDRL One-step CDRL Categorical target ΠC PK k=1 pt,k(xt+1, a )δrt+γzk ΠC δrt+γQt(xt+1,a ) Time complexity O(K log K) O(log K) Table 2: Comparison of the categorical targets in CDRL versus one-step CDRL. The notation a refers to a greedy action, namely a arg maxa Qt(xt+1, a ) . 4.1 One-Step CDRL We propose two algorithms, for policy evaluation and control respectively, that are described in Algorithm 1. These new categorical methods are derived from the stochastic approximation of the projected operators ΠC Tπ and ΠC T. For each state-action pair (x, a), both methods learn a discrete probability distribution p1(x, a), . . . , p K(x, a) over some fixed support z1 < < z K. As in classic RL algorithms, we consider a sequence of stepsizes αt(x, a) 0 indexed by states, actions and time steps t 0. At each time t, we perform a mixture update between the current distribution and a distributional target which is the Cramér projection of a single Dirac mass located at the target of TD(0) or Q-learning, computed from a single transition (xt, at, rt, xt+1). The only difference with tabular CDRL lies in the target as shown in Table 2. Contrary to the original CDRL target, ours remains the same whatever the greedy action a we choose inside the set arg maxa Qt(xt+1, a ): this explains why our approach is stable even when the optimal policy is not unique as illustrated in Figure 2. Moreover, our categorical target is faster to compute than in CDRL, where the time complexity O(K log(K)) pays the price for inserting every atom rt + γzk into the sorted array (z1, . . . , z K). Before moving forward to the convergence analysis of Algorithm 1, we propose its deep Algorithm 1 Tabular one-step categorical Distr RL Input: η(x,a) t = PK k=1 pt,k(x, a)δzk for all (x, a) Sample transition: (xt, at, rt, xt+1) Estimate Q-values: Qt(xt+1, a) PK k=1 pt,k(xt+1, a) zk if policy evaluation then bη(xt,at) t ΠC(δrt+γ P a π(a |xt+1)Qt(xt+1,a )) = PK k=1 bpt,kδzk else if control then bη(xt,at) t ΠC(δrt+γ maxa Qt(xt+1,a )) = PK k=1 bpt,kδzk end if Mixture update: η(xt,at) t+1 (1 αt(xt, at))η(xt,at) t + αt(xt, at)bη(xt,at) t η(x,a) t+1 η(x,a) t , (x, a) = (xt, at) Output: ηt+1 RL counterpart in Algorithm 2 based on the minimization of a Kullback-Leibler (KL) loss. Deep one-step CDRL. The main challenge in a non-tabular context is to learn the distributions in a compact and efficient way. For that purpose, we use the same deep categorical approach as in C51 (Bellemare et al., 2017). As shown in Table 1, the one-step method aims at learning much simpler distributions (necessarily atomic) than full Distr RL (typically, continuous distributions carrying more information). This suggests choosing a smaller number of categories, e.g. K = 4 used in Section 5, compared to what is commonly used in CDRL. Due to its similarity with C51, we call this new algorithm OS-C51 . Published in Transactions on Machine Learning Research (08/2023) Algorithm 2 OS-C51 (single update) Input: categorical distributions η(x,a) θ = PK k=1 pθ,k(x, a)δzk and a transition (xt, at, rt, xt+1) Compute Q-function in next state: Q(xt+1, a ) PK k=1 pθ,k(xt+1, a )zk Compute categorical target: bη(xt,at) ΠC(δrt+γ maxa Q(xt+1,a )) Output: KL(bη(xt,at) η(xt,at) θ ) 4.2 Convergence analysis We now provide convergence guarantees for our tabular one-step Distr RL algorithms. The major difference with the analysis of CDRL (Rowland et al., 2018) is that we do not require the uniqueness of the optimal policy in the case of control. We also rely on the existing analysis of non-distributional RL (Dayan, 1992; Tsitsiklis, 1994; Jaakkola et al., 1993). In particular, we require the following standard assumption. Assumption 4.1. For any pair (x, a) X A, the stepsizes (αt(x, a))t 0 satisfy the Robbins-Monro conditions: (P t=0 αt(x, a) = P t=0 αt(x, a)2 < almost surely. Equipped with stepsizes (αt(x, a)) satisfying the Robbins-Monro conditions, we are now ready to state our main theoretical contribution: namely, the convergence of Algorithm 1. Theorem 4.1. Consider Algorithm 1 and let us assume that Assumption 4.1 holds for the stepsizes (αt(x, a))t,x,a . (i) In the case of evaluation of a policy π, W 1(ηt, ηπ) t 0 almost surely , where ηπ is defined in Proposition 3.2-(iii). (ii) In the case of control, W 1(ηt, η ) t 0 almost surely , where η is defined in Proposition 3.2-(iv). The proof of Theorem 4.1 is deferred to the Supplementary Material: it follows the same steps as the proofs of Theorem 2 in (Tsitsiklis, 1994) and Theorem 1 in (Rowland et al., 2018). Notably, our analysis remains the same for evaluation and control contrary to (Rowland et al., 2018), where the control case requires additional assumptions (namely, uniqueness of the optimal policy and for all t 0, for all 1 k K such that pt,k(xt+1, a ) = 0, rt + γzk [z1, z K] almost surely). 5 Numerical experiments In this section, we present numerical experiments on both tabular and Atari games environments. 5.1 Tabular setting We describe our tabular experiments: first in a dynamic programming context i.e. knowing the transition kernel P and the reward function r, then in the Frozen Lake environment by only observing empirical transitions. Published in Transactions on Machine Learning Research (08/2023) 0 20000 40000 60000 80000 100000 1.0 p1(x5, a3) (a) (x5, a3) 0 20000 40000 60000 80000 100000 1.0 p1(x11, a1) p2(x11, a1) p3(x11, a1) (b) (x11, a1) 0 20000 40000 60000 80000 100000 0 3000 ||Qt Q * ||2 (c) Qt Q 2 2 Figure 4: Learning dynamics of the atomic probability distribution (pt,k(x, a))1 k K=3 produced by Algorithm 1 (control) on the Frozen Lake environment across 100000 iterations, with support (z1, z2, z3) = (0, 10, 20), constant stepsize α = 0.6 and ε-greedy exploration. The results are averaged over 100 seeds. Distributional dynamic programming. Figures 2-3 are obtained by exact dynamic programming in the MDP in Figure 1 with γ = 1 2: in this specific case, all policies are optimal. As discussed in subsection 2.3, this handcrafted example reveals the instability of classic CDRL in Figure 2 contrary to our one-step approach: for both methods we consider K = 4 categories with (z1, z2, z3, z4) = (0, 1.9, 2.1, 10). We point out that our results are sensitive to this specific choice of support. Indeed, always taking the action a1 in state x1 results in a total discounted distributional return reduced to a Dirac mass at 2. However, taking the action a2 leads to the same expected value Q (x1, a2) = 2 but with non-zero variance. Thus, CDRL targets can occasionally (when taking action a2) fall outside the selected interval [1.9, 2.1], leading to the depicted instability. Figure 3 illustrates the space complexity issue of Distr RL without projection: while the number of atoms is multiplied by at most |X||A| after each application of T π, our one-step operators produce distributions with (at most) as many atoms as there are states, i.e. two in this example. Frozen Lake. We also consider the Frozen Lake environment from Open AI Gym (Brockman et al., 2016) with discount factor γ = 0.95. It is characterized by 16 states x1, . . . , x16 and four actions a1, . . . , a4. Plus, this is a stochastic environment i.e. the transition probabilities P(x |x, a) are not all equal to either 0 or 1, which justifies a distributional approach. The Frozen Lake environment is a gridworld game (4x4 grid), where an agent moves on slippery ice (stochastic transitions) towards a goal. The agent can take 4 actions (up, down, left, right), aiming to reach the goal state from the starting state, avoiding holes, and only achieving a reward of 1 upon reaching the goal, and 0 otherwise. Due to the slippery attribute of the environment, the transitions are stochastic, meaning the agent may not always end up in the state it intended to move to. For example, if the agent chooses to move right, it may actually move up, right, or down with equal probability 1/3 due to the slippery ice. In Figure 4, we plot the iterates pt,k(x, a) over 100000 steps generated by our tabular Algorithm 1 with K = 3 atoms and (z1, z2, z3) = (0, 10, 20), constant stepsize α = 0.6 and ε-greedy exploration with ε exponentially decaying from 1 to 0.25. We average the results over 100 seeds. Given a pair (x, a), we observe the joint convergence of the three probabilities. As in CDRL, and because of the mean-preserving property (Eq. 2), the average Qt(x, a) = P3 k=1 pt,k(x, a)zk coincides with the Q-learning iterates and converges to Q . 5.2 Atari games For the experiments on Atari games (Bellemare et al., 2013), we implement 4 the OS-C51 agent on top of the popular Clean RL codebase (Huang et al., 2022). We compare the OS-C51 algorithm against C51 for two different number of atoms: K = 4 or K = 51. In each of the two cases, we choose the support to be evenly spread over the interval [ 10, 10]: z1 = 10 < < z K = 10 . We use the same architecture as C51: a deep neural network, parameterized by θ, takes an observation as input and outputs a vector of K logits. We run DQN and C51 with the default hyperparameters from Clean RL; for our new OC-C51 method, we take the same hyperparameters as in C51 due to their similarities. All three methods are based on a deep neural network with 3 convolutional layers followed by 2 fully connected layers with Re LU activation functions. 4here: https://github.com/mastane/cleanrl Published in Transactions on Machine Learning Research (08/2023) 0.0 2.5 5.0 7.5 10.0 Million frames Episode score Beam Rider No Frameskip-v4 DQN C51 (4 atoms) C51 (51 atoms) OS-C51 (4 atoms) OS-C51 (51 atoms) (a) Beamrider 0.0 2.5 5.0 7.5 10.0 Million frames Episode score Breakout No Frameskip-v4 DQN C51 (4 atoms) C51 (51 atoms) OS-C51 (4 atoms) OS-C51 (51 atoms) (b) Breakout 0.0 2.5 5.0 7.5 10.0 Million frames Episode score Pong No Frameskip-v4 DQN C51 (4 atoms) C51 (51 atoms) OS-C51 (4 atoms) OS-C51 (51 atoms) Figure 5: Average evaluation episodic return over 10 million frames for three Atari games. They all use the same linearly decaying ε-greedy exploration and the transitions are collected in a replay memory buffer of size 1000000. In particular for DQN, we use the Adam optimizer (Kingma & Ba, 2014) with learning rate equal to 0.0001 and batch size of 32. For the optimization of C51 and OS-C51, we use the Adam optimizer with learning rate set to 0.00025 and batch size equal to 32. The results are averaged over 5 seeds. As shown in Figure 5, the performance of OS-C51 (with only four atoms) is comparable with C51 on the Beamrider and Pong games. However, there seems to be a significant advantage for C51 on Breakout: this could be attributed to C51 s objective of approximating distributions with more complex and richer information, potentially offering a deeper understanding and better representation of the game dynamics. Finally, we highlight that we did not use sticky actions , where a chosen action is randomly repeated for several consecutive frames, introducing a form of environmental stochasticity. This could be an interesting source of stochasticity to explore for future experimental investigation, potentially adding complexity and richness to the learning process. 6 Conclusion We proposed new distributional RL algorithms that naturally extend TD(0) and Q-learning in the tabular setting. The main novelty in our approach is the use of new one-step distributional operators circumventing the instability issues of Distr RL control. We provided both theoretical convergence analysis and empirical proof-of-concept. A significant limitation of one-step Distr RL, as underscored in our study, is its inability to distinguish between two policies with identical value functions but differing distributions. This points to a crucial advantage of full Distr RL, which can capture and exploit these subtleties. Future research could investigate the generalization of our method based on the dynamics of several successive steps instead of a single one. Investigating this multi-step extension would likely reveal further important insights into the dynamics of RL processes. Another promising direction for future research lies in exploring the possible connection between full Distr RL and a multi-step approach with infinitely many steps. Such an investigation could offer a deeper understanding of the relationship between these methods and help to clarify the distinct benefits and trade-offs involved. Mastane Achab. Ranking and risk-aware reinforcement learning. Ph D thesis, Institut polytechnique de Paris, 2020. Mastane Achab and Gergely Neu. Robustness and risk management via distributional dynamic programming. ar Xiv preprint ar Xiv:2112.15430, 2021. Mastane Achab, Reda Alami, Yasser Abdelaziz Dahou Djilali, Kirill Fedyanin, Eric Moulines, and Maxim Panov. Distributional deep q-learning with cvar regression. In Deep Reinforcement Learning Workshop Neur IPS 2022, 2022. Published in Transactions on Machine Learning Research (08/2023) Marc G Bellemare, Yavar Naddaf, Joel Veness, and Michael Bowling. The arcade learning environment: An evaluation platform for general agents. Journal of Artificial Intelligence Research, 47:253 279, 2013. Marc G Bellemare, Will Dabney, and Rémi Munos. A distributional perspective on reinforcement learning. In International Conference on Machine Learning, pp. 449 458. PMLR, 2017. Marc G Bellemare, Nicolas Le Roux, Pablo Samuel Castro, and Subhodeep Moitra. Distributional reinforcement learning with linear function approximation. In The 22nd International Conference on Artificial Intelligence and Statistics, pp. 2203 2211. PMLR, 2019. Marc G Bellemare, Will Dabney, and Mark Rowland. Distributional reinforcement learning. MIT Press, 2023. Richard Bellman. Dynamic programming. Science, 153(3731):34 37, 1966. François Bolley. Separability and completeness for the wasserstein distance. In Séminaire de probabilités XLI, pp. 371 377. Springer, 2008. Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. Openai gym. ar Xiv preprint ar Xiv:1606.01540, 2016. Will Dabney, Mark Rowland, Marc Bellemare, and Rémi Munos. Distributional reinforcement learning with quantile regression. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018. Peter Dayan. The convergence of td (λ) for general λ. Machine learning, 8(3):341 362, 1992. Shengyi Huang, Rousslan Fernand Julien Dossa, Chang Ye, JeffBraga, Dipam Chakraborty, Kinal Mehta, and João G.M. Araújo. Cleanrl: High-quality single-file implementations of deep reinforcement learning algorithms. Journal of Machine Learning Research, 23(274):1 18, 2022. Tommi Jaakkola, Michael Jordan, and Satinder Singh. Convergence of stochastic iterative dynamic programming algorithms. Advances in neural information processing systems, 6, 1993. Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. ar Xiv preprint ar Xiv:1412.6980, 2014. Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. Playing atari with deep reinforcement learning. ar Xiv preprint ar Xiv:1312.5602, 2013. Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529 533, 2015. Tetsuro Morimura, Masashi Sugiyama, Hisashi Kashima, Hirotaka Hachiya, and Toshiyuki Tanaka. Nonparametric return distribution approximation for reinforcement learning. In ICML, 2010a. Tetsuro Morimura, Masashi Sugiyama, Hisashi Kashima, Hirotaka Hachiya, and Toshiyuki Tanaka. Parametric return density estimation for reinforcement learning. In Proceedings of the Twenty-Sixth Conference on Uncertainty in Artificial Intelligence, pp. 368 375, 2010b. Martin L Puterman. Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons, 2014. Mark Rowland, Marc Bellemare, Will Dabney, Rémi Munos, and Yee Whye Teh. An analysis of categorical distributional reinforcement learning. In International Conference on Artificial Intelligence and Statistics, pp. 29 37. PMLR, 2018. Mark Rowland, Rémi Munos, Mohammad Gheshlaghi Azar, Yunhao Tang, Georg Ostrovski, Anna Harutyunyan, Karl Tuyls, Marc G Bellemare, and Will Dabney. An analysis of quantile temporal-difference learning. ar Xiv preprint ar Xiv:2301.04462, 2023. Published in Transactions on Machine Learning Research (08/2023) Richard S Sutton. Learning to predict by the methods of temporal differences. Machine learning, 3(1):9 44, 1988. Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT press, 2018. John N Tsitsiklis. Asynchronous stochastic approximation and q-learning. Machine learning, 16(3):185 202, 1994. Christopher JCH Watkins and Peter Dayan. Q-learning. Machine learning, 8(3):279 292, 1992. A Proof of Proposition 3.1 i). Contraction in Wasserstein distance. Let 1 p < . Let µ1, µ2 Pb(R)X A be two distribution functions. Denoting Q1(x, a) and Q2(x, a) the respective expectations of µ(x,a) 1 and µ(x,a) 2 , we have for any pair (x, a): W p p ((Tµ1)(x,a), (Tµ2)(x,a)) = x P(x |x, a)δr(x,a,x )+γ maxa Q1(x ,a ), X x P(x |x, a)δr(x,a,x )+γ maxa Q2(x ,a )) x P(x |x, a)W p p (δr(x,a,x )+γ maxa Q1(x ,a ), δr(x,a,x )+γ maxa Q2(x ,a )) x P(x |x, a)| max a Q1(x , a ) max a Q2(x , a )|p x P(x |x, a) max a |Q1(x , a ) Q2(x , a )|p = γp X x P(x |x, a) max a τ=0 (F 1 x ,a (τ) F 1 x ,a (τ))dτ p x P(x |x, a) max a F 1 x ,a (τ) F 1 x ,a (τ) pdτ | {z } W p p (µ(x ,a ) 1 ,µ(x ,a ) 2 ) γp W p p(µ1, µ2) , where Fx,a, Fx,a denote the respective CDFs of µ(x,a) 1 and µ(x,a) 2 , and the first inequality follows from the interpretation of the Wasserstein distance as an infimum over couplings. Indeed, we have upper bounded W p p ((Tµ1)(x,a), (Tµ2)(x,a)) = inf λ Λ E(Z1,Z2) λ[|Z1 Z2|p] , where Λ denotes the set of all couplings of (Tµ1)(x,a) and (Tµ2)(x,a), via a specific coupling, λ0, such that λ0({r(x, a, x ) + γ max a Q1(x , a )} {r(x, a, x ) + γ max a Q2(x , a )}) = P(x |x, a) , x . Hence, by taking the supremum over all (x, a), we deduce that T is a γ-contraction in W p: W p(Tµ1, Tµ2) γW p(µ1, µ2). Published in Transactions on Machine Learning Research (08/2023) Similarly for p = , W ((Tµ1)(x,a), (Tµ2)(x,a)) = x P(x |x, a)δr(x,a,x )+γ maxa Q1(x ,a ), X x P(x |x, a)δr(x,a,x )+γ maxa Q2(x ,a )) x P(x |x, a)W (δr(x,a,x )+γ maxa Q1(x ,a ), δr(x,a,x )+γ maxa Q2(x ,a )) x P(x |x, a)| max a Q1(x , a ) max a Q2(x , a )| x P(x |x, a) max a |Q1(x , a ) Q2(x , a )| = γ X x P(x |x, a) max a τ=0 (F 1 x ,a (τ) F 1 x ,a (τ))dτ x P(x |x, a) max a F 1 x ,a (τ) F 1 x ,a (τ) dτ γW (µ1, µ2) . For policy evaluation, one can follow the same proofs by incorporating the following additional step: X a π(a |x )(Q1(x , a ) Q2(x , a )) X a π(a |x ) Q1(x , a ) Q2(x , a ) max a Q1(x , a ) Q2(x , a ) . ii). Let us write: W1((ΠCTµ1)(x,a), (ΠCTµ2)(x,a)) = x P(x |x, a)ΠCδr(x,a,x )+γ maxa Q1(x ,a ), X x P(x |x, a)ΠCδr(x,a,x )+γ maxa Q2(x ,a )) x P(x |x, a)W1(ΠCδr(x,a,x )+γ maxa Q1(x ,a ), ΠCδr(x,a,x )+γ maxa Q2(x ,a )) x P(x |x, a)| max a Q1(x , a ) max a Q2(x , a )| x P(x |x, a) max a |Q1(x , a ) Q2(x , a )| = γ X x P(x |x, a) max a τ=0 (F 1 x ,a (τ) F 1 x ,a (τ))dτ x P(x |x, a) max a F 1 x ,a (τ) F 1 x ,a (τ) dτ | {z } W1(µ(x ,a ) 1 ,µ(x ,a ) 2 ) γW 1(µ1, µ2) , where the second inequality follows from Lemma A.1. Taking the supremum over all (x, a) concludes the proof. The proof for ΠCTπ is similar. Lemma A.1. Let a, b be two real numbers. Then, W1(ΠCδa, ΠCδb) |a b|. Proof. We proceed by exhaustion of all possible (redundant) cases, up to permuting a and b: (i) a, b are both outside the interval (z1, z K]: the proof is trivial in this case (ii) a (z1, z K], b {z1, . . . , z K} and a b (iii) a (z1, z K], b {z1, . . . , z K} and a > b (iv) a (z1, z K] and b / (z1, z K] Published in Transactions on Machine Learning Research (08/2023) (v) a, b are both inside the interval (z1, z K] Proof in case ii). We assume that b belongs to the support, b a and a lies inside the interval (z1, z K]: zj < a zj+1 and b = zk+1 with 1 j k K . Then, we have: W1(ΠCδa, ΠCδb) = zj+1 a zk+1 zj + a zj zj+1 zj = (zk+1 zj)zj+1 a + a zj zj+1 zj (zj+1 zj) a zj zj+1 zj = zk+1 zj (a zj) = zk+1 a = |a b|. Proof in case iii). Similar to (ii) except that we have b = zk with j k: W1(ΠCδa, ΠCδb) = zj+1 a zj zk + a zj zj+1 zj = (zj zk)zj+1 a + a zj zj+1 zj + (zj+1 zj) a zj zj+1 zj = zj zk + a zj = a zk = |a b|. Proof in case iv). Now, if b is outside the interval (z1, zk], say b z1, we deduce from the previous computation that W1(ΠCδa, ΠCδb) = a z1 |a b|. Proof in case v). Finally, if both a and b lie in (z1, z K], W1(ΠCδa, ΠCδb) = W1( zj+1 a zj+1 zj δzj + a zj zj+1 zj δzj+1, ΠCδb) zj+1 zj W1(δzj, ΠCδb) + a zj zj+1 zj W1(δzj+1, ΠCδb) b zj + a zj zj+1 zj b zj+1 , (3) where we used cases (ii) and (iii) in the last equality. Then, if b zj+1, Eq. equation 3 implies W1(ΠCδa, ΠCδb) zj+1 a zj+1 zj (b zj) + a zj zj+1 zj (b zj+1) = (b zj)zj+1 a + a zj zj+1 zj (zj+1 zj) a zj zj+1 zj = b zj a + zj = b a. Symmetrically, if b zj, Eq. equation 3 implies W1(ΠCδa, ΠCδb) zj+1 a zj+1 zj (zj b) + a zj zj+1 zj (zj+1 b) = a b. Lastly, if a, b both belong to the same segment (zj, zj+1], and say a b, we have by direct computation: W1(ΠCδa, ΠCδb) = (zj+1 zj)zj+1 a (zj+1 b) zj+1 zj = b a . Published in Transactions on Machine Learning Research (08/2023) B Proof of Proposition 3.2 Given the completeness of the Wasserstein space (see Bolley (2008)), by combining Proposition 3.1 with Banach s fixed point theorem, we deduce the existence and uniqueness of the fixed points. For (i), we verify: (Tνπ)(x,a) = X x P(x |x, a)δr(x,a,x )+γ P a π(a |x )P x P (x |x ,a )(r(x ,a ,x )+γV π(x )) x P(x |x, a)δr(x,a,x )+γ P a π(a |x )Qπ(x ,a ) = ν(x,a) π . ii). For the control case, (Tν )(x,a) = X x P(x |x, a)δr(x,a,x )+γ maxa P x P (x |x ,a )(r(x ,a ,x )+γV (x )) x P(x |x, a)δr(x,a,x )+γ maxa Q (x ,a ) = ν(x,a) . iii). We start by computing: (Tπ ΠCνπ)(x,a) = X x P(x |x, a)δr(x,a,x )+γ P a π(a |x )Qπ(x ,a ) = ν(x,a) π , where we used the mean-preserving property of the Cramér projection (see Eq. equation 2). Then, we deduce that ΠC Tπ ΠCνπ = ΠCνπ = ηπ . iv). Similarly, one can show that η is the fixed point of ΠC T. C Proof of Theorem 4.1 We follow the same steps as in the proofs of Theorem 2 in (Tsitsiklis, 1994) and Theorem 1 in (Rowland et al., 2018). We focus on the control case (ii), though the proof remains similar for policy evaluation. Let us define for each (x, a) X A: L(x,a) 0 = δz1, U (x,a) 0 = δz K, and for j 0, L(x,a) j+1 = 1 2L(x,a) j + 1 2(ΠCTLj)(x,a) and U (x,a) j+1 = 1 2U (x,a) j + 1 2(ΠCTUj)(x,a) . We now state the two following lemmas and then, before proving them, explain how they allow us to conclude. Lemma C.1. (i) In terms of entrywise stochastic dominance, it holds for all j 0: Lj+1 Lj and Uj+1 Uj. (ii) (Lj) and (Uj) both converge to η in W 1. Lemma C.2. Given j 0, there exists a random time Tj 0 such that Lj ηt Uj for all t Tj , almost surely. From Lemma C.1-(ii), let ϵ > 0 and take j 0 large enough such that max{W 1(Lj, η ), W 1(Uj, η )} < ϵ . Then, it follows from Lemma C.2 followed by a triangular inequality that W 1(ηt, Lj) W 1(Lj, Uj) W 1(Lj, η ) + W 1(Uj, η ) < 2ϵ , for all t Tj, almost surely. Finally, W 1(ηt, η ) W 1(ηt, Lj) + W 1(Lj, Uj) + W 1(Uj, η ) < 5ϵ , which implies Theorem 4.1. We still need to prove Lemmas C.1-C.2. Published in Transactions on Machine Learning Research (08/2023) C.1 Proof of Lemma C.1 (i). Let us show by induction that Uj+1 Uj. First, the base case U (x,a) 1 U (x,a) 0 = δz K is true because U (x,a) 1 is supported in {z1, . . . , z K}. Now, let us assume that Uj+1 Uj for some j 0. We recall from Proposition 5 in (Rowland et al., 2018) that ΠC is a monotone map for element-wise stochastic dominance. Plus, it is easy to see that both Tπ and T are monotone too. Then, by monotonicity of ΠCT, it holds that ΠCTUj+1 ΠCTUj and hence, U (x,a) j+2 = 1 2U (x,a) j+1 + 1 2(ΠCTUj+1)(x,a) 1 2U (x,a) j + 1 2(ΠCTUj)(x,a) = U (x,a) j+1 , which proves the induction step. Symmetrically, one can show that Lj+1 Lj. (ii). Similarly to Lemma 7 in (Rowland et al., 2018), we formulate the following general result. Lemma C.3. Let (νk) k=0 be a sequence of probability distributions over {z1, . . . , z K} such that νk+1 νk for all k 0. Then, there exists a limit probability distribution νlim over {z1, . . . , z K} such that νk νlim in W1. Proof. Let us denote Fk the CDF of νk and F 1 k its quantile function valued in {z1, . . . , z K}. The assumption that νk stochastically dominates νk+1 reformulates as F 1 k+1(τ) F 1 k (τ) for all 0 < τ 1. Hence for each τ (0, 1], the sequence F 1 k (τ) is non-increasing and lower bounded by z1. Therefore, this sequence converges: F 1 k (τ) Hlim(τ). As F 1 k (τ) can only take discrete values, there exists N(τ) such that k N(τ), F 1 k (τ) = Hlim(τ). The limit function Hlim is thus non-decreasing, valued in {z1, . . . , z K} and right-continuous with left limits. It is therefore the quantile function of a probability distribution νlim supported over {z1, . . . , z K}. Let us now show that W1(νk, νlim) 0. Denoting Flim the CDF of νlim and p0 = 0, p1 = Flim(z1), . . . , p K = Flim(z K) = 1, we have: W1(νk, νlim) = Z 1 τ=0 |F 1 k (τ) Hlim(τ)|dτ = X 1 k K:pk =pk 1 τ=pk 1 |F 1 k (τ) zk|dτ . (4) Let p = min1 k K:pk =pk 1(pk pk 1), z = z K z1 and 0 < ϵ < p/2. Then for all k max1 k K:pk =pk 1 max{N(pk 1 +ϵ), N(pk ϵ)}, F 1 k (τ) is constant equal to zk for any τ [pk 1 +ϵ, pk ϵ] and we have: W1(νk, νlim) = X 1 k K:pk =pk 1 τ=pk 1 |F 1 k (τ) zk|dτ 1 k K:pk =pk 1 τ=pk 1 |F 1 k (τ) zk| | {z } z dτ + Z pk ϵ τ=pk 1+ϵ |F 1 k (τ) zk|dτ τ=pk ϵ |F 1 k (τ) zk| | {z } z which proves the result. By applying Lemma C.3 to the sequence (U (x,a) k )k 0 for each pair (x, a), we deduce the convergence of (Uk) towards some limit distribution function Ulim in W 1. Finally, by continuity of ΠCT for the metric W 1, this limit must verify: 2ΠCTUlim = Ulim = ΠCTUlim , from which we deduce that Ulim = η by unicity. The proof that (Lj) converges to η in W 1 is analogous. Published in Transactions on Machine Learning Research (08/2023) C.2 Proof of Lemma C.2 Let us prove Lemma C.2 by induction. The base case j = 0 is true because for all t T0 = 0, η(x,a) t is supported on {z1, . . . , z K} and so: L(x,a) 0 = δz1 η(x,a) t δz K = U (x,a) 0 . Then, let us assume that the result is true for some j 0, i.e. there exists a random time Tj such that Lj ηt Uj for all t Tj almost surely. Now, let us show that there exists a random time Tj+1 such that ηt Uj+1 for all t Tj+1 almost surely ( the proof for ηt Lj+1 is analogous). For each state-action pair (x, a), we define H(x,a) Tj = U (x,a) j and W (x,a) Tj the zero measure i.e. W (x,a) Tj (A) = 0 for all Borel sets A R. Then, for t Tj, H(x,a) t+1 = (1 αt(x, a))H(x,a) t + αt(x, a)(ΠCTUj)(x,a) and W (x,a) t+1 = (1 αt(x, a))W (x,a) t + αt(x, a)[ΠC(δr(x,a,x )+γ maxa Qt(x ,a )) (ΠCTηt)(x,a)] , (5) where x P( |x, a), Qt(x , a ) = PK k=1 pt,k(x , a )zk. Moreover, if (x, a) = (xt, at), then x = xt+1 and r(xt, at, xt+1) = rt. A consequence of Eq. (5) is that W (x,a) t is a signed measure on R with total measure equal to zero: W (x,a) t (R) = 0 for all t Tj. Let us now prove by (another) induction that η(x,a) t H(x,a) t + W (x,a) t for all t Tj, (x, a) X A almost surely. For t = Tj, it is true by assumption that ηTj Uj almost surely. Then, suppose that ηt Ht + Wt almost surely for some t Tj. Recalling that αt(x, a) = 0 for all (x, a) = (xt, at), it holds that: η(x,a) t+1 = (1 αt(x, a))η(x,a) t + αt(x, a)ΠC(δr(x,a,x )+γ maxa Qt(x ,a )) = (1 αt(x, a))η(x,a) t + αt(x, a)(ΠCTηt)(x,a) + αt(x, a)[ΠC(δr(x,a,x )+γ maxa Qt(x ,a )) (ΠCTηt)(x,a)] (1 αt(x, a))(Ht + Wt)(x,a) + αt(x, a)(ΠCTUj)(x,a) + αt(x, a)[ΠC(δr(x,a,x )+γ maxa Qt(x ,a )) (ΠCTηt)(x,a)] = H(x,a) t+1 + W (x,a) t+1 , (6) where the inequality comes from the assumptions ηt Ht +Wt and ηt Uj combined with the monotonicity of ΠCT. Now, observe that Ht can be explicitly expressed as follows: H(x,a) t = t 1 Y t =Tj (1 αt (x, a)) Uj + 1 t =Tj (1 αt (x, a)) (ΠCTUj)(x,a) . Since by Assumption 4.1, it holds that P t =0 αt (x, a) = for all (x, a) almost surely, we deduce that there exists a random time e Tj+1 Tj such that Qt 1 t =Tj(1 αt (x, a)) 1 4 for all (x, a) and for all t e Tj+1 almost surely. Then, because Lemma C.1-(i) implies that ΠCTUj Uj, we have for all t e Tj+1: ηt Ht + Wt 1 4ΠCTUj + Wt = 1 2ΠCTUj + Wt 1 4(Uj ΠCTUj) = Uj+1 + Wt 1 4(Uj ΠCTUj) . (7) We point out that the random noise term appearing in the definition of W (x,a) t+1 has zero mean: for all 1 k K, Ex P ( |x,a) ΠC(δr(x,a,x )+γ maxa Qt(x ,a )) (ΠCTηt)(x,a) (( , zk]) = 0 . (8) Eq. (8) implies via a classic stochastic approximation argument under Assumption 4.1 that W (x,a) t (( , zk]) 0 almost surely, for all (x, a) and k {1, . . . , K}. Finally, we take Tj+1 e Tj+1 Published in Transactions on Machine Learning Research (08/2023) large enough so that |W (x,a) t (( , zk])| 4 for all t Tj+1 and all (x, a), almost surely, where is given by: U (x,a) j (ΠCTUj)(x,a) (( , zk]) = 0 : x X, a A, 1 k K , which concludes the proof by using Eq. (7). Indeed, if U (x,a) j (( , zk]) = (ΠCTUj)(x,a)(( , zk]), then U (x,a) j (( , zk]) = U (x,a) j+1 (( , zk]).