# distributional_reinforcement_learning_via_moment_matching__09bf05fa.pdf

Distributional Reinforcement Learning via Moment Matching

Thanh Nguyen-Tang, Sunil Gupta, Svetha Venkatesh Applied Artiﬁcial Intelligence Institute (A2I2), Deakin University, Australia {thanhnt, sunil.gupta, svetha.venkatesh}@deakin.edu.au

We consider the problem of learning a set of probability distributions from the empirical Bellman dynamics in distributional reinforcement learning (RL), a class of state-of-the-art methods that estimate the distribution, as opposed to only the expectation, of the total return. We formulate a method that learns a ﬁnite set of statistics from each return distribution via neural networks, as in the distributional RL literature. Existing distributional RL methods however constrain the learned statistics to predeﬁned functional forms of the return distribution which is both restrictive in representation and difﬁcult in maintaining the predeﬁned statistics. Instead, we learn unrestricted statistics, i.e., deterministic (pseudo-)samples, of the return distribution by leveraging a technique from hypothesis testing known as maximum mean discrepancy (MMD), which leads to a simpler objective amenable to backpropagation. Our method can be interpreted as implicitly matching all orders of moments between a return distribution and its Bellman target. We establish sufﬁcient conditions for the contraction of the distributional Bellman operator and provide ﬁnitesample analysis for the deterministic samples in distribution approximation. Experiments on the suite of Atari games show that our method outperforms the distributional RL baselines and sets a new record in the Atari games for non-distributed agents.

Introduction A fundamental aspect in reinforcement learning (RL) is the value of an action in a state which is formulated as the expected value of the return, i.e., the expected value of the discounted sum of rewards when the agent follows a policy starting in that state and executes that action (Sutton, Barto et al. 1998). Learning this expected action-value via Bellman s equation (Bellman 1957) is central to valuebased RL such as temporal-difference (TD) learning (Sutton 1988), SARSA (Rummery and Niranjan 1994), and Qlearning (Watkins and Dayan 1992). Recently, however, approaches known as distributional RL that aim at learning the distribution of the return have shown to be highly effective in practice (Morimura et al. 2010b,a; Bellemare, Dabney, and Munos 2017; Dabney et al. 2018b,a; Yang et al. 2019). Despite many algorithmic variants with impressive practical performance (Bellemare, Dabney, and Munos 2017;

Copyright 2021, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved.

Dabney et al. 2018b,a; Yang et al. 2019), they all share the same characteristic that they explicitly learn a set of statistics of predeﬁned functional forms to approximate a return distribution. Using predeﬁned statistics can limit the learning due to the statistic constraints it imposes and the difﬁculty to maintain such predeﬁned statistics. In this paper, we propose to address these limitations by instead learning a set of unrestricted statistics, i.e., deterministic (pseudo-)samples, of a return distribution that can be evolved into any functional form. We observe that the deterministic samples can be deterministically learned to simulate a return distribution by utilizing an idea from statistical hypothesis testing known as maximum mean discrepancy (MMD). This novel perspective requires a careful design of algorithm and a further understanding of distributional RL associated with MMD. Leveraging this perspective, we are able to provide a novel algorithm to eschew the predeﬁned statistic limitations in distributional RL and give theoretical understanding of distributional RL within this perspective. Our approach is also conceptually amenable for natural extension along the lines of recent modelling improvements to distributional RL brought by Implicit Quantile Networks (IQN) (Dabney et al. 2018a), and Fully parameterized Quantile Function (FQF) (Yang et al. 2019). Speciﬁcally, our key contributions are

1. We provide a novel approach to distributional RL using deterministic samples via MMD that addresses the limitations in the existing distributional RL; 2. We provide theoretical understanding of distributional RL within our framework, speciﬁcally the contraction property of the distributional Bellman operator and the nonasymptotic convergence of the approximate distribution from deterministic samples; 3. We demonstrate the practical effectiveness of our framework in both tabular RL and large-scale experiments where our method outperforms the standard distributional RL methods and even establishes a new record in the Atari games for non-distributed agents.

Outline. After carefully reviewing relevant background and related works, and discussing their limitations, we present our novel algorithmic approach to address these issues followed by theoretical analysis. We then present the experiments to conﬁrm the effectiveness of our approach empirically and conclude our work.

The Thirty-Fifth AAAI Conference on Artificial Intelligence (AAAI-21)

Background and Related Works

Expected RL

In a standard RL setting, an agent interacts with an environment via a Markov Decision Process (S, A, R, P, γ) (Puterman 2014) where S and A denote state and action spaces, resp., R the reward measure, P( |s, a) the transition kernel measure, and γ [0, 1) a discount factor. A policy π( |s) maps a state to a distribution over the action space. Given a policy π, the discounted sum of future rewards following policy π is the random variable

t=0 γt R(st, at), (1)

where s0 = s, a0 = a, st P( |st 1, at 1), at π( |st), and R(st, at) R( |st, at). The goal in expected RL is to ﬁnd an optimal policy π that maximizes the action-value function Qπ(s, a) := E[Zπ(s, a)]. A common approach is to ﬁnd the unique ﬁxed point Q = Qπ of the Bellman optimality operator (Bellman 1957) T : RS A RS A deﬁned by

TQ(s, a) := E[R(s, a)] + γEP [max a Q(s , a )], (s, a).

A standard approach to this end is Q-learning (Watkins and Dayan 1992) which maintains an estimate Qθ of the optimal action-value function Q and iteratively improves the estimation via the Bellman backup

Qθ(s, a) E[R(s, a)] + γEP [max a Qθ(s , a )].

Deep Q-Network (DQN) (Mnih et al. 2015) achieves human-level performance on the Atari benchmark by leveraging a convolutional neural network to represent Qθ while using a replay buffer and a target network to update Qθ.

Additional Notations

Let X Rd be an open set. Let P(X) be the set of Borel probability measures on X. Let P(X)S A be the Cartesian product of P(X) indexed by S A. For any α 0, let Pα(X) := {p P(X) : R

X x αp(dx) < }. When d = 1, let mn(p) := R

X xnp(dx) be the n-th order moment of a distribution p P(X), and let

P (X) = p P(X) : lim sup n |mn(p)|1/n

Note that if X is a bounded domain in R, then P (X) = P(X). Denote by δz the Dirac measure, i.e., the point mass, at z. Denote by n the n-dimensional simplex.

Distributional RL

Instead of estimating only the expectation Qπ of Zπ, distributional RL methods (Bellemare, Dabney, and Munos 2017; Dabney et al. 2018b,a; Rowland et al. 2018; Yang et al. 2019) explicitly estimate the return distribution µπ = law(Zπ) as an auxiliary task. Empirically, this auxiliary

task has been shown to signiﬁcantly improve the performance in the Atari benchmark. Theoretically, in the policy evaluation setting, the distributional version of the Bellman operator is a contraction in the p-Wasserstein metric (Bellemare, Dabney, and Munos 2017) and Cr amer distance (Rowland et al. 2018) (but not in total variation distance (Chung and Sobel 1987), Kullback-Leibler divergence and Komogorov-Smirnov distance (Bellemare, Dabney, and Munos 2017)). The contraction implies the uniqueness of the ﬁxed point of the distributional Bellman operator. In control settings with tabular function approximations, distributional RL has a well-behaved asymptotic convergence in Cr amer distance when the return distributions are parameterized by categorical distributions (Rowland et al. 2018). Bellemare et al. (2019) establish the asymptotic convergence of distributional RL in policy evaluation in linear function approximations. Lyle, Bellemare, and Castro (2019) examine behavioural differences between distributional RL and expected RL, aligning the success of the former with nonlinear function approximations.

Categorical Distributional RL (CDRL) CDRL (Bellemare, Dabney, and Munos 2017) approximates a distribution η by a categorical distribution ˆη = PN i=1 θiδzi where z1 z2 ... z N is a set of ﬁxed supports and {θi}N i=1 are learnable probabilities. The learnable probabilities {θi}N i=1 are found in such way that ˆη is a projection of η onto {PN i=1 piδzi : {pi}N i=1 N} w.r.t. the Cr amer distance (Rowland et al. 2018). In practice, C51 (Bellemare, Dabney, and Munos 2017), an instance of CDRL with N = 51, has shown to perform favorably in Atari games.

Quantile Regression Distributional RL (QRDRL) QRDRL (Dabney et al. 2018b) approximates a distribution η by a mixture of Diracs ˆη = 1 N PN i=1 δθi where {θi}N i=1 are learnable in such a way that ˆη is a projection of η on { 1

N PN i=1 δzi : {zi}N i=1 RN} w.r.t. to the 1-Wasserstein distance. Consequently, θi = F 1 η ( 2i 1

2N ) where F 1 η is the inverse cumulative distribution function of η. Since the quantile values {F 1 η ( 2i 1

2N )} at the ﬁxed quantiles { 2i 1

2N } is a minimizer of an asymmetric quantile loss from quantile regression literature (thus the name QRDRL) and the quantile loss is compatible with stochastic gradient descent (SGD), the quantile loss is used for QRDRL in practice. QRDQN-1 (Dabney et al. 2018b), an instance of QRDRL with Huber loss, performs favorably empirically in Atari games.

Other Distributional Methods Some recent distributional RL methods have made modelling improvements to QRDRL. Two typical improvements are from Implicit Quantile Networks (IQN) (Dabney et al. 2018a), and Fully parameterized Quantile Function (FQF) (Yang et al. 2019). IQN uses implicit models to represent the quantile values {θi} in QRDRL, i.e., instead of being represented by ﬁxed network outputs, {θi} are the outputs of a differentiable function (e.g., neural networks) on the samples from a base sampling distribution (e.g., uniform). FQF further improves IQN by optimizing the locations of the base samples for IQN, instead of using random base samples as in IQN, i.e.,

both quantiles and quantile values are learnable in FQF.

Predeﬁned Statistic Principle Formally, a statistic is any functional ζ : P(X) R that maps a distribution p P(X) to a scalar ζ(p), e.g., the expectation ζ(p) = R

X xp(dx) is a common statistic in RL. Here, we formally refer to a predeﬁned statistic as the one whose functional form is speciﬁed before the statistic is learned. In contrast, an unrestricted statistic does not subscribe to any speciﬁc functional form (e.g., the median of a distribution η is a predeﬁned statistic as its functional form is predeﬁned via F 1 η ( 1

2) while any empirical sample z η can be considered an unrestricted statistic of η). Though CDRL and QRDRL are two different variants of distributional RL methodology, they share a unifying characteristic that they both explicitly learn a ﬁnite set of predeﬁned statistics, i.e., statistics of predeﬁned functional forms (Rowland et al. 2019). We refer to this as predeﬁned statistic principle. This is clear for QRDRL as the statistics to be learned about a distribution η are {ζ1, ..., ζN} where

ζi(η) := F 1 η (2i 1

N ), i {1, ..., N}.

It is a bit more subtle for CDRL. It can be shown in (Rowland et al. 2019) that CDRL is equivalent to learning the statistics {ζ1, ..., ζN 1} where

ζi(η) := EZ η

1{Z<zi} + 1{zi Z<zi+1} zi+1 Z zi+1 zi

Learning predeﬁned statistics as in CDRL and QRDRL however can suffer two limitations in (i) statistic representation and (ii) difﬁculty in maintaining the predeﬁned statistics. Regarding (i), given the same ﬁxed budget of N statistics to approximate a return distribution η, CDRL restrictively associates the statistic budget to N ﬁxed supports {zi}N i=1 while QRDRL constrains the budget to N quantile values at speciﬁc quantiles. Instead, the statistic budget should be freely learned into any form as long as it could simulate the target distribution η sensibly. Regarding (ii), the ﬁxed supports in CDRL require a highly involved projection step to be able to use KL divergence as the Bellman backup changes the distribution supports; QRDRL requires that the statistics must satisfy the constraints for valid quantile values at speciﬁc quantiles, e.g., the statistics to be learned are order statistics. In fact, QR-DQN (Dabney et al. 2018b), a typical instance of QRDRL, implicitly maintains the order statistics via an asymmetric quantile loss but still does not guarantee the monotonicity of the obtained quantile estimates. A further notice regarding (ii) recognized in (Rowland et al. 2019) is that since in practice we do not observe the environment dynamic but only samples of it, a naive update to learn the predeﬁned statistics using such samples can collapse the approximate distribution due to the different natures of samples and statistics (in fact, (Rowland et al. 2019) proposes imputation strategies to overcome this problem). Instead, the statistics to be learned should be free of all such difﬁculties to reduce the learning burden. One might say that IQN/FQF (discussed in the previous subsection) can help QRDRL overcome these limitations. While the modeling improvements in IQN/FQF are

practically effective, IQN/FQF however still embrace the predeﬁned statistic principle above as they built upon QRDRL with an improved modelling capacity. In this work we propose an alternative approach to distributional RL that directly eschews the predeﬁned statistic principle used in the prior distributional RL methods, i.e., the ﬁnite set of statistics in our approach can be evolved into any functional form and thus also reduces the need to maintain any statistic constraints. If we informally view improvements to CDRL/QRDRL into two dimensions: either modelling dimension or statistic dimension, IQN/FQF lie in the modelling dimension while our work belongs to the statistic dimension. We notice that this does not necessarily mean one approach is better than the other, but rather two orthogonal approaches where the modelling improvements in IQN/FQF can naturally apply to our work to further improve the modeling capacity. We leave these modeling extensions to the future work and focus the present work only on unrestricted statistics with the simplest modelling choice as possible.

Distributional RL via Moment Matching Maximum Mean Discrepancy Let F be a reproducing kernel Hilbert space (RKHS) associated with a continuous kernel k( , ) on X. Consider p, q P(X), and let Z and W be two random variables with distributions p and q, respectively. The maximum mean discrepancy (MMD) (Gretton et al. 2012) between p and q is deﬁned as

MMD(p, q; F) := sup f F: f F 1 (E[f(Z)] E[f(W)])

= E[k(Z, Z )] + E[k(W, W )] 2E[k(Z, W)] 1/2

where ψp := R

X k(x, )p(dx) is the Bochner integral, i.e., the mean embedding of p into F (Smola et al. 2007), and Z (resp. W ) is a random variable with distribution p (resp. q) and is independent of Z (resp. W). In sequel, we interchangeably refer to MMD by MMD(p, q; F), MMD(p, q; k), or MMD(p, q) if the context is clear.

Empirical Approximation Given empirical samples {zi}N i=1 p and {wi}M i=1 q, MMD admits a simple empirical estimate as

MMD2 b({zi}, {wi}; k) = 1 N 2 X

i,j k(zi, zj)

i,j k(wi, wj) 2 NM

i,j k(zi, wj).

Though there is also a simple unbiased estimate of MMD, the biased estimate MMDb has smaller variance in practice and thus is adopted in our work.

Problem Setting Consider d = 1. For any policy π, let µπ = law(Zπ) be the law (distribution) of the return r.v. Zπ as deﬁned in Eq. (1).

Algorithm 1: Generic MMDRL update

Require: Number of particles N, kernel k, discount factor γ [0, 1] Input: Sample transition (s, a, r, s )

1 if Policy evaluation then

2 a π( |s )

4 else if Control setting then

5 a arg maxa A 1 N PN i=1 Zθ(s , a )i

7 ˆTZi r + γZθ (s , a )i, 1 i N

Output: MMD2 b {Zθ(s, a)i}N i=1, { ˆTZi}N i=1; k

The distributional Bellman operator T π (Bellemare, Dabney, and Munos 2017) speciﬁes the relation of different return distributions across state-action pairs along the Bellman dynamic; that is, for any µ P(X)S A, and any (s, a) S A,

T πµ(s, a) := Z

X (fγ,r)#µ(s , a )R(dr|s, a)π(da |s )P(ds |s, a),

where fγ,r(z) := r + γz, z and (fγ,r)#µ(s , a ) is the pushforward measure of µ(s , a ) by fγ,r. Note that µπ is the ﬁxed point of T π, i.e., T πµπ = µπ. We are interested in the problem of learning µπ P(X)S A via the distributional Bellman operator T π.

Algorithmic Approach In practical setting, we must approximate the return distribution via a ﬁnite set of statistics as the space of Borel probability measures is inﬁnite-dimensional. Let Zθ(s, a) := {Zθ(s, a)i}N i=1 be a set of parameterized statistics of µπ(s, a) where θ represents the parameters of the model, e.g., neural networks. Instead of restricting Zθ(s, a) to predeﬁned statistic functionals, we model unrestricted statistics, i.e., deterministic samples where each Zθ(s, a)i can be evolved into any form of statistics and we use the Dirac mixture ˆµθ(s, a) = 1 N PN i=1 δZθ(s,a)i to approximate µπ(s, a). We refer to the deterministic samples Zθ(s, a) as particles, and our goal is reduced into learning the particles Zθ(s, a) to approximate µπ(s, a). To this end, the particles Zθ(s, a) is deterministically evolved to minimize the MMD distance between the approximate distribution and its distributional Bellman target. Algorithm 1 below presents the generic update in our approach, namely MMDRL.

Intuition The MMDRL reduces into the standard TD or Q-learning when N = 1. For N > 1, the objective MMDb when used with SGD and Gaussian kernels k(x, y) = exp( |x y|2/h) contributes in two ways: (i) The term 1 N 2 P i,j k(Zθ(s, a)i, Zθ(s, a)j) serves as a repulsive force that pushes the particles {Zθ(s, a)i} away from each other, preventing them from collapsing into a single mode, with force proportional to 2

he (Zθ(s,a)i Zθ(s,a)j)2/h|Zθ(s, a)i

Zθ(s, a)j|; (ii) the term 2

N 2 P i,j k(Zθ(s, a)i, ˆTZj) acts as an attractive force which pulls the particles {Zθ(s, a)i} closer to their target particles { ˆTZi}. This can also be intuitively viewed as a two-sample counterpart to Stein point variational inference (Liu and Wang 2016; Chen et al. 2018).

Particle Representation We can easily extend MMDRL to DQN-like architecture to create a novel deep RL, namely MMDQN (Algorithm 3 in Appendix B). In this work, we explicitly represent the particles {Zθ(s, a)i}N i in MMDQN via ﬁxed N network outputs as in QR-DQN (Dabney et al. 2018b) for simplicity (details in Appendix B). We emphasize that modeling improvements from IQN (Dabney et al. 2018a) and FQF (Yang et al. 2019) can be naturally applied to MMDQN: we can implicitly generate {Zθ(s, a)i}N i via applying a neural network function to N samples of a base sampling distribution (e.g., normal or uniform distribution) as in IQN, or we can use the proposal network in FQF to learn the weights of each Dirac components in MMDQN instead of using equal weights 1/N.

Theoretical Analysis

Here we provide theoretical understanding of MMDRL. Before that, we deﬁne the notion of supremum MMD, a MMD counterpart to the supremum Wasserstein in (Bellemare, Dabney, and Munos 2017), to work on P(X)S A.

Deﬁnition 1. Supremum MMD is a functional P(X)S A P(X)S A R deﬁned by

MMD (µ, ν; k) := sup (s,a) S A MMD(µ(s, a), ν(s, a); k)

for any µ, ν P(X)S A.

We are concerned with the following questions:

1. Metric property: When does MMD induce a metric on P(X)S A?

2. Contraction property: When is T π a contraction in MMD ?

3. Convergence property: How fast do the particles returned by minimizing MMD approach the target distribution it approximates?

The metric property in the ﬁrst question ensures that MMD is a meaningful test to distinguish two return distributions on P(X)S A. The contraction property in the second question guarantees that following from Banach s ﬁxed point theorem (Banach 1922), T π has a unique ﬁxed point which is µπ. In addition, starting with an arbitrary point µ0 P(X)X A, T π T π ... T πµ0 converges at an exponential rate to µπ in MMD . We provide sufﬁcient conditions to answer the ﬁrst two questions and derive the convergence rate of the optimal particles in approximating a target distribution for the third question. A short answer is that the ﬁrst two properties highly depend on the underlying kernel k, and the particles returned by minimizing MMD enjoy a rate O(1/ n) regardless of the dimension d of the underlying space X.

Metric Property

Proposition 1. Let P(S) P(S) be some (Borel) subset of the space of the Borel probability measures. If MMD is a metric on P(X), then MMD is also a metric on P(X)S A.

Proof. See Appendix A.1.

Theorem 1 below provides sufﬁcient conditions for MMD to induce a metric on P(X). Theorem 1. We have 1. If the underlying kernel k is characteristic (i.e., the induced Bochner integral ψp is injective), e.g., Gaussian kernels, then MMD is a metric on P(X) (Fukumizu et al. 2007; Gretton et al. 2012). 2. Deﬁne unrectiﬁed kernels kα(x, y) := x y α, α R, x, y X. Then MMD( , ; kα) is a metric on Pα(X) for all α (0, 2) but not a metric for α = 2 (Sz ekely 2003). 3. MMD associated with the so-called exp-prod kernel k(x, y) = exp( xy

σ2 ) for any σ > 0 is a metric on P (X).

Proof. See Appendix A.2.

Contraction Property We analyze the contraction of T π for several important classes of kernels. One such class is shift invariant and scale sensitive kernels. A kernel k( , ) is said to be shift invariant if k(x + c, y + c) = k(x, y), x, y, c X; it is said to be scale sensitive with order α > 0 if k(cx, cy) = |c|αk(x, y), x, y X and c R. For example, the unrectiﬁed kernel kα considered in Theorem 1 is both shift invariant and scale sensitive with order α while Gaussian kernels are only shift invariant. Theorem 2. We have 1. If the underlying kernel is k = P i I ciki where each component kernel ki is both shift invariant and scale sensitive with order αi > 0, ci 0, and I is a (possibly inﬁnite) index set, then T π is a γα /2-contraction in MMD where α := mini I αi. 2. T π is not a contraction in MMD associated with either Gaussian kernels or exp-prod kernels k(x, y) = exp( xy

Proof. See Appendix A.3.

Practical Consideration Theorem 2 provides a negative result for the commonly used Gaussian kernel. In practice, however, we found that Gaussian kernels can promote to match the moments between two distributions and have better empirical performance as compared to the other kernels analyzed in this section. In fact, MMD associated with Gaussian kernels k(x, y) = exp( (x y)2/(2σ2)) can be decomposed into

MMD2(µ, ν; k) =

1 σ2nn! ( mn(µ) mn(ν))2

where mn(µ) = Ex µ h e x2/(2σ2)xni , and similarly for mn(ν). This indicates that MMD associated with Gaussian

kernels approximately performs moment matching (scaled with a factor e x2/(2σ2) for each moment term).

Convergence Rate of Distribution Approximation We justify the goodness of the particles obtained via minimizing MMD in terms of approximating a target distribution.

Theorem 3. Let X Rd and P P(X). For any n N, let {xi}n i=1 X be a set of n deterministic points such that {xi}n i=1 arg inf{ xi}n i=1 MMD( 1

n Pn i=1 δ xi, P; F). Then, Pn := 1

n Pn i=1 δxi converges to P at a rate of O(1/ n) in the sense that for any function h in the unit ball of F, we have

X h(x)d Pn(x) Z

X h(x)d P(x) = O(1/ n).

Remark. MMD enjoys a convergence rate of O(n 1/2) 1 regardless of the underlying dimension d while 1-Wasserstein distance has a convergence rate of O(n 1/d) (if d > 2) (Fournier and Guillin 2015), which is slower for large d.

Proof. We ﬁrst present two relevant results below (whose proofs are deferred to Appendix A.4) from which the theorem can follow.

Proposition 2. Let (Xi)n i=1 be n i.i.d. samples of some distribution P. We have

i=1 δXi, P; F

= Op(1/ n),

where Op denotes big-O in probability.

Lemma 1. Let (an)n N R and (Xn)n N R be sequences of deterministic variables and of random variables, respectively, such that for all n, |an| |Xn| almost surely (a.s.). Then, if Xn = Op(f(n)) for some function f(n) > 0, we have an = O(f(n)).

It follows from the Cauchy-Schwartz inequality in F that for any function h in the unit ball of F, we have

X h(x)d Pn(x) Z

X h(x)d P(x)

X k(x, )d Pn(x) Z

X k(x, )d P(x) F

i=1 δxi, P; F

Now by letting Xn = MMD( 1

n Pn i=1 δ xi, P; F), an = MMD( 1

n Pn i=1 δxi, P; F), and f(n) = 1/ n, and noting that an Xn, n, Proposition 2, Lemma 1 and Eq. (2) immediately imply Theorem 3.

1In fact, the convergence rate can be improved to O(1/n) using kernel herding technique (Chen, Welling, and Smola 2010).

Figure 1: Performance of different methods in approximating the optimal policy s return distribution at the initial state in the chain environment of various chain lengths K = {1, 2, ..., 15}. The distribution approximation is evaluated in terms of how well a method can approximate the k-th central moment (except k = 1 means the expectation) of the target distribution. 95% C.I. with 30 seeds. A variant (Gaussian-MMDRL) of our proposed MMDRL matches the MC rollouts (representing ground truth) much better than QRDRL.

Experimental Results We ﬁrst present results with a tabular version of MMDRL to illustrate its behaviour in distribution approximation task. We then combine the MMDRL update to the DQN-style architecture to create a novel deep RL algorithm namely MMDQN, and evaluate it on the Atari-57 games. We give full details of the architectures and hyperparameters used in the experiments in the Appendix B. 2

Tabular Policy Evaluation We empirically evaluate that MMDRL with Gaussian kernels (Gaussian-MMDRL) can approximately learn the moments of a policy s return distribution as compared to the MMDRL with unrectiﬁed kernels (unrectiﬁed-MMDRL) and the baseline QRDRL. We use a variant of the classic chain environment (Rowland et al. 2019) . The chain environment of length K is a chain of K states s0, ..., s K 1 where s0 is the initial state and s K 1 is the terminal state (see Figure 2). In each state, there are only two possible actions: (i) forward, which moves the agent one step to the right with probability 0.9 and to s0 with probability 0.1, or (ii) backward, which transitions the agent to s0 with probability 0.9 and one step to the right with probability 0.1. The agent receives reward 1 when transitioning to the initial state s0, reward 1 when reaching the terminal state s K 1, and 0 otherwise. The discount factor is γ = 0.9. We estimate µ 0 the return distribution at the initial state of the optimal policy π which selects forward action in every state. The longer the chain length K, the more stochastic the optimal policy s return distribution at s0. We use 10, 000 Monte Carlo rollouts under policy π to compute the central moments of µ 0 as ground truth values. Each method uses only N = 30 samples to approximate the target distribution µ 0 (more algorithm details are presented in Algorithm 2 in Appendix B). The result is presented in Figure 1. While all the methods approximate the expectation of the target distribution well, their approximation qualities differentiate greatly when it comes to higher order moments. Gaussian-MMDRL, though with only N = 30 particles, can approximate higher order

2Our ofﬁcial code is available at https://github.com/ thanhnguyentang/mmdrl.

moments more reasonably in this example whereas the rest highly suffer from underestimation. We also experimented with the kernel considered in Theorem 1.3 in this tabular experiment and the Atari game experiment (next part) but found that it is highly inferior to the other kernel choices (even though it has an exact moment matching form as compared to Gaussian kernels, see Appendix A.2) thus we did not include it (we speculate that the shift invariance of Gaussian kernels seems effective when interacting with transition samples from the Bellman dynamics).

𝑠0 𝑠1 𝑠𝐾 2 𝑠𝐾 1 𝑟= 1

Figure 2: An illustration of a variant of the classic chain MDP with the chain length K.

Atari Games To demonstrate the effectiveness of MMDRL at scale, we combine the MMDRL in Algorithm 1 with DQN-like architecture to obtain a deep RL agent namely MMDQN (Algorithm 3 in Appendix B). Speciﬁcally in this work, we used the same architecture as QR-DQN (Dabney et al. 2018b) for simplicity but more advanced modeling improvements from IQN (Dabney et al. 2018a) and FQF (Yang et al. 2019) can naturally be used in combination to our framework.

Evaluation Protocol We evaluated our algorithm on 55 3 Atari 2600 games (Bellemare et al. 2013) following the standard training and evaluation procedures (Mnih et al. 2015; van Hasselt, Guez, and Silver 2016) (the full details are in appendix B). We computed human normalized scores for each agent per game. From the human normalized scores for an agent across all games, we extracted three statistics for the agent s performance: the median, the mean and the number of games where the agent s performance is above the human expert s performance.

3We failed to include Defender and Surround games using Open AI and Dopamine framework.

(a) Different kernel bandwidths (at ﬁxed N = 200).

(b) Different values of N (at ﬁxed h = mix).

Figure 3: The sensitivity of (the human-normalized scores of) MMDQN in the 6 tuning games with respect to: (a) the kernel choices (Gaussian kernels with different bandwidths h and unrectiﬁed kernels), and (b) the number of particles N. Here h = mix indicates the mixture of bandwidth values in {1, 2, ..., 10}. All curves are smoothed over 5 consecutive iterations. 95% C.I. for the h = mix, N = 200, and unrectiﬁed kernel curves (3 seeds) and 1 seed for the other curves.

Figure 4: Percentage improvement per-game of MMDQN over QR-DQN-1.

Mean Median >Human >DQN DQN 221% 79% 24 0 PRIOR. 580% 124% 39 48 C51 701% 178% 40 50 QR-DQN-1 902% 193% 41 54 RAINBOW 1213% 227% 42 52 IQN 1112% 218% 39 54 FQF 1426% 272% 44 54 MMDQN 1969% 213% 41 55

Table 1: Mean and median of best human-normalized scores across 55 Atari 2600 games. The results for MMDQN are averaged over 3 seeds and the reference results are from (Yang et al. 2019).

Baselines We categorize the baselines into two groups. The ﬁrst group contains comparable methods: DQN, PRIOR., C51, and QR-DQN-1, where DQN (Mnih et al. 2015) and PRIOR. (prioritized experience replay (Schaul et al. 2016)) are classic baselines. The second group includes reference methods: RAINBOW (Hessel et al. 2018), IQN, and FQF, which contain algorithmic/modeling improvements orthogonal to MMDQN: RAINBOW combines C51 with prioritized replay and n-step update while IQN and FQF contain modeling improvements as described in the related work section. Since in this work we used the same architecture as QR-DQN and C51 for MMDQN, we directly compare MMDQN with the ﬁrst group while including the second group for reference.

Hyperparameter Setting For fair comparison with QRDQN, we used the same hyperparameters: N = 200, Adam optimizer (Kingma and Ba 2015) with lr = 0.00005, ϵADAM = 0.01/32. We used ϵ-greedy policy with ϵ being decayed at the same rate as in DQN but to a lower value ϵ = 0.01 as commonly used by the distributional RL methods. We used a target network to compute the distributional Bellman target as with DQN. Our implementation is based on Open AI Gym (Brockman et al. 2016) and the Dopamine framework (Castro et al. 2018).

Kernel Selection We used Gaussian kernels kh(x, y) = exp (x y)2/h where h > 0. The kernel bandwidth h is crucial to the statistical quality of MMD: overestimated bandwidth results in a ﬂat kernel while underestimated one makes the decision boundary highly irregular. We utilize the kernel mixture trick in (Li, Swersky, and Zemel 2015) which is a mixture of K kernels covering a range of bandwidths k(x, y) = PK i=1 khi(x, y). The Gaussian kernel with a bandwidth mixture yields much a better performance than that with individual bandwidth and unrectiﬁed kernels in 6 tuning games: Breakout, Assault, Asterix, Ms Pacman, Qbert, and Beam Rider (see Figure 3 (a)). Figure 3 (b) shows the sensitivity of MMDQN in terms of the number of particles N in the 6 tuning games where too small N adversely affects the performance. The main empirical result is provided in Table 1 where we compute the mean and median of best human normalized scores across 55 Atari games in the 30 no-op evaluation setting (the full raw scores for each game are provided in Appendix C). The table shows that MMDQN signiﬁcantly outperforms the comparable methods in the ﬁrst group (DQN, PRIOR., C51 and QR-DQN-1) in all metrics though it shares the same network architecture with C51 and QR-DQN-1. Although we did not include any orthogonal algorithmic/modelling improvements from the reference methods to MMDQN, MMDQN still performs comparably with these methods and even achieve a state-of-the-art mean human-normalized score. In Figure 4 we also provide the percentage improvement per-game of MMDQN over QRDQN-1 where MMDQN offers signiﬁcant gains over QRDQN-1 in a large array of games.

Acknowledgements This research was partially funded by the Australian Government through the Australian Research Council (ARC). Prof. Venkatesh is the recipient of an ARC Australian Laureate Fellowship (FL170100006). We would also like to thank our anonymous reviewers (from Neur IPS 20 and AAAI 21) for the constructive feedback, Will Dabney (Deep Mind) for the raw result data of QR-DQN-1 and the useful comments on our earlier draft, and Mengyan Zhang (Australian National University) for the interesting discussion on order statistics.

References Banach, S. 1922. Sur Les Operations Dans Les Ensembles Abstraits et Leur Application Aux Equations Integrales. Fundamenta Mathematicae 3 133 181.

Bellemare, M. G.; Dabney, W.; and Munos, R. 2017. A Distributional Perspective on Reinforcement Learning. In ICML, volume 70 of Proceedings of Machine Learning Research, 449 458. PMLR.

Bellemare, M. G.; Le Roux, N.; Castro, P. S.; and Moitra, S. 2019. Distributional reinforcement learning with linear function approximation. In The 22nd International Conference on Artiﬁcial Intelligence and Statistics, 2203 2211. PMLR.

Bellemare, M. G.; Naddaf, Y.; Veness, J.; and Bowling, M. 2013. The Arcade Learning Environment: An Evaluation Platform for General Agents. Journal of Artiﬁcial Intelligence Research 47: 253 279. ISSN 1076-9757. doi: 10.1613/jair.3912.

Bellman, R. 1957. Dynamic Programming. Princeton, NJ, USA: Princeton University Press, 1 edition.

Brockman, G.; Cheung, V.; Pettersson, L.; Schneider, J.; Schulman, J.; Tang, J.; and Zaremba, W. 2016. Openai gym. ar Xiv preprint ar Xiv:1606.01540 .

Castro, P. S.; Moitra, S.; Gelada, C.; Kumar, S.; and Bellemare, M. G. 2018. Dopamine: A Research Framework for Deep Reinforcement Learning. Co RR abs/1812.06110.

Chen, W. Y.; Mackey, L. W.; Gorham, J.; Briol, F.; and Oates, C. J. 2018. Stein Points. In ICML, volume 80 of Proceedings of Machine Learning Research, 843 852. PMLR.

Chen, Y.; Welling, M.; and Smola, A. J. 2010. Super Samples from Kernel Herding. In UAI, 109 116. AUAI Press.

Chung, K.; and Sobel, M. J. 1987. Discounted MDP s: distribution functions and exponential utility maximization. Siam Journal on Control and Optimization 25: 49 62.

Dabney, W.; Ostrovski, G.; Silver, D.; and Munos, R. 2018a. Implicit Quantile Networks for Distributional Reinforcement Learning. In ICML, volume 80 of Proceedings of Machine Learning Research, 1104 1113. PMLR.

Dabney, W.; Rowland, M.; Bellemare, M. G.; and Munos, R. 2018b. Distributional Reinforcement Learning With Quantile Regression. In AAAI, 2892 2901. AAAI Press.

Fournier, N.; and Guillin, A. 2015. On the rate of convergence in Wasserstein distance of the empirical measure. Probability Theory and Related Fields 162(3-4): 707 738. Fukumizu, K.; Gretton, A.; Sun, X.; and Sch olkopf, B. 2007. Kernel Measures of Conditional Dependence. In NIPS, 489 496. Curran Associates, Inc. Gretton, A.; Borgwardt, K. M.; Rasch, M. J.; Sch olkopf, B.; and Smola, A. J. 2012. A Kernel Two-Sample Test. J. Mach. Learn. Res. 13: 723 773. Hessel, M.; Modayil, J.; van Hasselt, H.; Schaul, T.; Ostrovski, G.; Dabney, W.; Horgan, D.; Piot, B.; Azar, M. G.; and Silver, D. 2018. Rainbow: Combining Improvements in Deep Reinforcement Learning. In AAAI, 3215 3222. AAAI Press. Kingma, D. P.; and Ba, J. 2015. Adam: A Method for Stochastic Optimization. In ICLR (Poster). Li, Y.; Swersky, K.; and Zemel, R. S. 2015. Generative Moment Matching Networks. In ICML, volume 37 of JMLR Workshop and Conference Proceedings, 1718 1727. JMLR.org. Liu, Q.; and Wang, D. 2016. Stein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm. In NIPS, 2370 2378. Lyle, C.; Bellemare, M. G.; and Castro, P. S. 2019. A Comparative Analysis of Expected and Distributional Reinforcement Learning. In AAAI, 4504 4511. AAAI Press. Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A. A.; Veness, J.; Bellemare, M. G.; Graves, A.; Riedmiller, M.; Fidjeland, A. K.; Ostrovski, G.; et al. 2015. Human-level control through deep reinforcement learning. Nature 518(7540): 529 533. Morimura, T.; Sugiyama, M.; Kashima, H.; Hachiya, H.; and Tanaka, T. 2010a. Nonparametric Return Distribution Approximation for Reinforcement Learning. In ICML, 799 806. Omnipress. Morimura, T.; Sugiyama, M.; Kashima, H.; Hachiya, H.; and Tanaka, T. 2010b. Parametric Return Density Estimation for Reinforcement Learning. In UAI, 368 375. AUAI Press. Puterman, M. L. 2014. Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons. Rowland, M.; Bellemare, M. G.; Dabney, W.; Munos, R.; and Teh, Y. W. 2018. An Analysis of Categorical Distributional Reinforcement Learning. In AISTATS, volume 84 of Proceedings of Machine Learning Research, 29 37. PMLR. Rowland, M.; Dadashi, R.; Kumar, S.; Munos, R.; Bellemare, M. G.; and Dabney, W. 2019. Statistics and Samples in Distributional Reinforcement Learning. In ICML, volume 97 of Proceedings of Machine Learning Research, 5528 5536. PMLR. Rummery, G. A.; and Niranjan, M. 1994. On-line Qlearning using connectionist systems, volume 37. University of Cambridge, Department of Engineering Cambridge, UK. Schaul, T.; Quan, J.; Antonoglou, I.; and Silver, D. 2016. Prioritized Experience Replay. In ICLR (Poster).

Smola, A. J.; Gretton, A.; Song, L.; and Sch olkopf, B. 2007. A Hilbert Space Embedding for Distributions. In ALT, volume 4754 of Lecture Notes in Computer Science, 13 31. Springer. Sutton, R. S. 1988. Learning to Predict by the Methods of Temporal Differences. Mach. Learn. 3: 9 44. Sutton, R. S.; Barto, A. G.; et al. 1998. Introduction to reinforcement learning, volume 135. MIT press Cambridge. Sz ekely, G. J. 2003. E-statistics: The energy of statistical samples. Bowling Green State University, Department of Mathematics and Statistics Technical Report 3(05): 1 18. van Hasselt, H.; Guez, A.; and Silver, D. 2016. Deep Reinforcement Learning with Double Q-Learning. In AAAI, 2094 2100. AAAI Press. Watkins, C. J. C. H.; and Dayan, P. 1992. Q-learning. In Machine Learning, 279 292. Yang, D.; Zhao, L.; Lin, Z.; Qin, T.; Bian, J.; and Liu, T.-Y. 2019. Fully Parameterized Quantile Function for Distributional Reinforcement Learning. In Advances in Neural Information Processing Systems, 6190 6199.