# foundations_of_multivariate_distributional_reinforcement_learning__11f13c43.pdf

Foundations of Multivariate Distributional Reinforcement Learning

Harley Wiltzer Mila Québec AI Institute Mc Gill University harley.wiltzer@mail.mcgill.ca

Jesse Farebrother Mila Québec AI Institute Mc Gill Unversity jfarebro@cs.mcgill.ca

Arthur Gretton Google Deep Mind Gatsby Unit, University College London gretton@google.com

Mark Rowland Google Deep Mind markrowland@google.com

In reinforcement learning (RL), the consideration of multivariate reward signals has led to fundamental advancements in multi-objective decision-making, transfer learning, and representation learning. This work introduces the first oracle-free and computationally-tractable algorithms for provably convergent multivariate distributional dynamic programming and temporal difference learning. Our convergence rates match the familiar rates in the scalar reward setting, and additionally provide new insights into the fidelity of approximate return distribution representations as a function of the reward dimension. Surprisingly, when the reward dimension is larger than 1, we show that standard analysis of categorical TD learning fails, which we resolve with a novel projection onto the space of mass-1 signed measures. Finally, with the aid of our technical results and simulations, we identify tradeoffs between distribution representations that influence the performance of multivariate distributional RL in practice.

1 Introduction

Distributional reinforcement learning [DRL; MSK+10, BDM17b, BDR23] focuses on the idea of learning probability distributions of an agent s random return, rather than the classical approach of learning only its mean. This has been highly effective in combination with deep reinforcement learning [YZL+19, BCC+20, WBK+22], and DRL has found applications in risk-sensitive decision making [LM22, KEF23], neuroscience [DKNU+20], and multi-agent settings [ROH+21, SLL21].

In general, research in distributional reinforcement learning has focused on the classical setting of a scalar reward function. However, prior non-distributional approaches to multi-objective RL [RVWD13, HRB+22] and transfer learning [BDM+17a, BHB+20] model value functions of multivariate cumulants,1 rather than a scalar reward. Having learnt such a multivariate value function, it is then possible to perform zero-shot evaluation and policy improvement for any scalar reward signal contained in the span of the coordinates of the multivariate cumulants, opening up a variety of applications in transfer learning, and multi-objective and constrained RL.

Multivariate distributional RL combines these two ideas, and aims to learn the full probability distribution of returns given a multivariate cumulant function. Successfully learning the multivariate

1Cumulants refer to accumulated quantities in RL (e.g., rewards or multivariate rewards) not to be confused with statistical cumulants.

38th Conference on Neural Information Processing Systems (Neur IPS 2024).

reward distribution opens up a variety of unique possibilities, such as zero-shot return distribution estimation [WFG+24] and risk-sensitive policy improvement [CZZ+24].

Pioneering works have already proposed algorithms for multivariate distributional RL. While these works all demonstrate benefits from the proposed algorithmic approaches, each suffers from separate drawbacks, such as not modelling the full joint distribution [GBSL21], lacking theoretical guarantees [FSMT19, ZCZ+21], or requiring a maximum-likelihood optimisation oracle for implementation [WUS23]. Concurrently, the work of [LK24] analyzed algorithms for DRL with Banach-spacevalued rewards, and provided convergence guarantees for dynamic programming with non-parametric (intractable) distribution models.

Our central contribution in this paper is to propose algorithms for dynamic programming and temporaldifference learning in multivariate DRL which are computationally tractable and theoretically justified, with convergence guarantees. We show that reward dimensions strictly larger than 1 introduce new computational and statistical challenges. To resolve these challenges, we introduce multiple novel algorithmic techniques, including a randomized dynamic programming operator for efficiently approximating projected updates with high probability, and a novel TD-learning algorithm operating on mass-1 signed measures. These new techniques recover existing bounds even in the scalar reward case, and provide new insights into the behavior of distributional RL algorithms as a function of the reward dimension.

2 Background

We consider a Markov decision process with Polish state space X, action space A, transition kernel P : X A P(X), and discount factor γ [0, 1). Unlike the standard RL setting, we consider a vector-valued reward function r : X [0, Rmax]d, as in the literature on successor features [BDM+17a]. Given a policy π : X P(A), we write the policy-conditioned transition kernel P π( | x) = R P( | x, a)π(da | x).

Multi-variate return distributions. We write (Xt)t 0 for a trajectory generated by setting X0 = x, and for each t 0, Xt+1 P π( |Xt). The return obtained along this trajectory is then defined by Gπ(x) = P t=0 γtr(Xt), and the (multi-)return distribution function is ηπ(x) = Law (Gπ(x)).

Zero-shot evaluation. An intriguing prospect of estimating multivariate return distributions is the ability to predict (scalar) return distributions for any reward function in the span of the cumulants. Indeed, [ZCZ+21, CZZ+24] show that for any reward function r : x 7 r(x), w for some w Rd, Gπ(x), w =law P

t 0 γt r(Xt) for X0 = x. Likewise, one might consider r(x) = δx, in which case Gπ(x) corresponds to the per-trajectory discounted state visitation measure, and [WFG+24] demonstrated methods for learning the distribution of Gπ to infer the return distribution for any bounded deterministic reward function.

Multivariate distributional Bellman equation. It was shown in [ZCZ+21] that multi-return distributions obey a distributional Bellman equation, similar to the scalar case [MSK+10, BDM17b], and defines the multivariate distributional Bellman operator T π : P(Rd)X P(Rd)X by

(T πη)(x) = E X P π( |x)

(br(x),γ) η(X ) , (1)

where by,γ(z) = y + γz and f µ = µ f 1 is the pushforward of a measure µ through a measurable function f. In particular, [ZCZ+21] showed that ηπ satisfies the multi-variate distributional Bellman equation T πηπ = ηπ, and that T π is a γ-contraction in W p, where W p(η1, η2) = supx X Wp(η1(x), η2(x)) and Wp is the p-Wasserstein metric [Vil09]. This suggests a convergent scheme for approximating ηπ in W p by distributional dynamic programming, that is, computing the iterates ηk+1 = T πηk, following Banach s fixed point theorem.

Approximating multivariate return distributions. In practice, however, the iterates ηk+1 = T πηk cannot be computed efficiently, because the size of the support of ηk may increase exponentially with k. A variety of alternative approaches that aim to circumvent this computational difficulty have been considered [FSMT19, ZCZ+21, WUS23]. Many of these approaches have proven effective in combination with deep reinforcement learning, though as tabular algorithms, either lack theoretical guarantees, or rely on oracles for solving possibly intractable optimisation problems. A more complete account of multivariate DRL is given in Appendix A. A central motivation of our work

is the development of computationally-tractable algorithms for multivariate distributional RL with theoretically guarantees.

Maximum mean discrepancies. A core tool in the development of our proposed algorithms, as well as some prior work [NTGV20, ZCZ+21], is the notion of distance over probability distributions given by maximum mean discrepancies [GBR+12, MMD]. A maximum mean discrepancy MMDκ : P(Y) P(Y) R+ assigns a notion of distance to pairs of probability distributions, and is parametrised via a choice of kernel κ : Y Y R, defined by

MMDκ(p, q) = E(Y1,Y2) p p[κ(Y1, Y2)] 2E(Y,Z) p q[κ(Y, Z)] + E(Z1,Z2) q q[κ(Z1, Z2)] .

A useful alternative perspective on MMD is that the choice of kernel κ induces a reproducing kernel Hilbert space (RKHS) of functions H, namely the closure of the span of functions of the form z 7 κ(y, z) for each y Y, with respect to the norm H induced by the inner product κ(y1, ), κ(y2, ) = κ(y1, y2). With this interpretation, MMDκ(p, q) is equal to µp µq H, where µp = R

Y κ( , y)p(dy) H is the mean embedding of p (similarly for µq). When p 7 µp is injective, the kernel κ is called characteristic, and MMDκ is then a proper metric on P(Y) [GBR+12]. In the remainder of this work, we will assume that all spaces of measures will be over compact sets Y; thus with continuous kernels, we are ensured that distances between probability measures are bounded. When comparing return distributions, this is achieved by asserting that rewards are bounded.

We conclude this section by recalling a particular family of kernels, introduced in [SSGF13], that will be particularly useful for our analysis. These are the kernels induced by semimetrics. Definition 1. Let Y be a nonempty set, and consider a function ρ : Y Y R+. Then ρ is called a semimetric if it is symmetric and ρ(y1, y2) = 0 y1 = y2. Additionally, ρ is said to have strong negative type if R ρ d([p q] [p q]) < 0 whenever p, q P(Y) with p = q.

Notably, certain semimetrics naturally induce characteristic kernels and probability metrics. Theorem 1 ([SSGF13, Proposition 29]). Let ρ be a semimetric on a space Y have strong negative type, in the sense that R ρd([p q] [p q]) < 0 whenever p = q are probability measures on a compact set Y. Moreover, let κ : Y Y R denote the kernel induced by ρ, that is

κ(y1, y2) = 1

2(ρ(y1, y0) + ρ(y2, y0) ρ(y1, y2))

for some y0 Y. Then κ is characteristic, so MMDκ is a metric.

Remark 1. An important class of semimetrics are the functions ρα : Rd Rd R+ given by ρα(y1, y2) = y1 y2 α 2 for α (0, 2). It is known that these semimetrics have strong negative type, and thus the kernels κα induced by ρα are characteristic [SR13, SSGF13]. The resulting metric MMDκα is known as the energy distance.

3 Multivariate Distributional Dynamic Programming

To warm up, we begin by demonstrating that indeed the (multivariate) distributional Bellman operator is contractive in a supremal form MMDκ of MMD, given by MMDκ(η1, η2) = supx X MMDκ(η1(x), η2(x)), for a particular class of kernels κ. Our first theorem generalizes the analogous results of [NTGV20] in the scalar case to multivariate cumulants. The proof of Theorem 2, as well as proofs of all remaining results, are deferred to Appendix B. Theorem 2 (Convergent MMD dynamic programming for the multi-return distribution function). Let κ be a kernel induced by a semimetric ρ on [0, (1 γ) 1Rmax]d with strong negative type, satisfying

1. Shift-invariance. For any z Rd, ρ(z + y1, z + y2) = ρ(y1, y2).

2. Homogeneity. For any γ [0, 1), there exists c > 0 for which ρ(γy1, γy2) = γcρ(y1, y2).

Consider the sequence {ηk} k=1 given by ηk+1 = T πηk. Then ηk ηπ at a geometric rate of γc/2

in MMDκ, as long as MMDκ(η0, ηπ) C < .

Notably, the energy distance kernels κα satisfy the conditions of Theorem 2, and ρα(γy1, γy2) γαρ(y1, y2) by the homogeneity of the Euclidean norm, so T π is a γα/2-contraction in the energy distances. This generalizes the analogous result of [NTGV20] in the one-dimensional case.

While Theorem 2 illustrates a method for approximating ηπ in MMD, it leaves a lot to be desired. Firstly, even in tabular MDPs, just as in the case of scalar distributional RL, return distribution functions have infinitely many degrees of freedom, precluding a tractable exact representation. As such, it will be necessary to study approximate, finite parameterizations of the return distribution functions, requiring more careful convergence analysis. Moreover, in RL it is generally assumed that the transition kernel and reward function are not known analytically we only have access to sampled state transitions and cumulants. Thus, T π cannot be represented or computed exactly, and instead we must study algorithms for approximating ηπ from samples. We provide algorithms for resolving both of these concerns the former in Section 5 and the latter in Section 6 where we illustrate the difficulties that arise once the cumulant dimension exceeds unity.

4 Particle-Based Multivariate Distributional Dynamic Programming

Our first algorithmic contribution is inspired by the empirically successful equally-weighted particle (EWP) representations of multivariate return distributions employed by [ZCZ+21].

Temporal-difference learning with EWP representations. EWP representations, expressed by the class CEWP,m, are defined by

CEWP,m = (C EWP,m)X , C EWP,m =

i=1 δθi : θi Rd )

For simplicity, we consider the case here where at each state x, the multi-return distribution is approximated by N(x) = m atoms. We can represent η CEWP,m by η(x) = 1 m Pm i=1 δθi(x) for θi : X Rd. The work of [ZCZ+21] introduced a TD-learning algorithm for learning a CEWP,m representation of ηπ, computing iterates of the particles (θ(k) i )m i=1 according to

θ(k+1) i (x) = θ(k) i (x) λk θi(x)MMD2 κ

i=j δθ(k) j (x), 1

j=1 δr(x)+γθ (k) j (X )

for step sizes (λk)k 0 and sampled next states X P π( | x), where θ = stop-gradient(θ(k)) is a copy of θ(k) that does not propagate gradients. Despite the empirical success of this method in combination with deep learning, no convergence analysis has been established, owing to the nonconvexity of the MMD objective with respect to the particle locations. In this section we aim to understand to what extent analysis is possible for dynamic programming and temporal-difference learning algorithms based on the EWP representations in Equation (2).

Dynamic programming with EWP representations. As is often the case in approximate distributional dynamic programming [RBD+18, RMA+24], we have T πCEWP,m CEWP,m; in words, the distributional Bellman operator does not map EWP representations to themselves. Concretely, as long as there exists a state x at which the support of P π( | x) is not a singleton, (T πη)(x) will consist of more than m atoms even when η CEWP,m; and secondly, if P( | x) is not uniform, (T πη)(x) will not consist of equally-weighted particles.

Consequently, to obtain a DP algorithm over EWP representations, we must consider a projected operator of the form ΠEWPT π, for a projection ΠEWP : P(Rd)X CEWP,m. A natural choice for this projection is the operator that minimizes the MMD of each multi-return distribution in CEWP,m,

(Πm EWP,κη)(x) argmin p C EWP,m MMDκ(p, η(x)). (4)

Unfortunately, even in the scalar-reward (d = 1) case, the operator Πm EWP,κ is problematic; (Πm EWP,κη)(x) is not uniquely defined, and Πm EWP,κ is not a non-expansion in MMDκ [LB22, RMA+24]. These pathologies present significant complications when analyzing even the convergence of dynamic programming routines for learning an EWP representation of the multi-return distribution in particular, it is not even clear that Πm EWP,κT π has a fixed point (let alone a unique one). Another complication arises due to the computational difficulty of computing the projection (4): even in the case where η(x) has finite support for each state x, the projection (Πm EWP,κη)(x) is very similar to clustering, which can be intractable to compute exactly for large m [She21]. Thus, the argmin projection in Equation (4) cannot be used directly to obtain a tractable DP algorithm.

Randomised dynamic programming. Towards this end, we introduce a tractable randomized dynamic programming algorithm for the EWP representation, by using a randomized proxy Boot Projπ κ,m for Πκ,m T π, that produces accurate return distribution estimates with high probability. The method produces the following iterates,

ηk+1(x) = Boot Projπ κ,mηk(x) := 1

i=1 δr(x)+γZi, Zi ηk(Xi), Xi iid P π( | x) (5)

A similar algorithm for categorical representations was discussed in concurrent work [LK24] without convergence analysis.

The intuition is that, particularly for large m, the Monte Carlo error associated with the sample-based approximation to (T πη)(x) is small, and we can therefore expect the DP process, though randomised, to be accurate with high probability. This is summarised by a key theoretical result of this section; our proof of this result in the appendix provides a general approach to proving convergence for algorithms using arbitrary accurate approximations to (4) that we expect to be useful in future work. Theorem 3. Consider a kernel κ induced by the semimetric ρ(x, y) = x y α 2 with α (0, 2), and suppose rewards are bounded in each dimension within [0, Rmax]. For any η0 such that MMDκ(η0, ηπ) D < , and any δ > 0, for the sequence (ηk)k 0 defined in Equation (5), with probability at least 1 δ we have

MMDκ(ηK, ηπ) e O dα/2Rα max (1 γα/2)(1 γ)α m log |X|δ 1

where ηk+1 = Boot Projπ κ,mηk and K = log m

log γ α , and where e O omits logarithmic factors in m.

This shows that our novel randomised DP algorithm with EWP representations can tractably compute accurate approximations to the true multivariate return distributions, with only polynomial dependence on the dimension d. Appendix C illustrates explicitly how this procedure is more memory efficient than unprojected EWP dynamic programming. However, the guarantees associated with this algorithm hold only in high probability, and are weaker than the pointwise convergence guarantees of one-dimensional distributional DP algorithms [RBD+18, RMA+24, BDR23]. Consequently, these guarantees do not provide a clear understanding of the EWP-TD method described at the beginning of this section. However, in the sequel, we introduce DP and TD algorithms based on categorical representations, for which we derive dynamic programming and TD-learning convergence bounds.

The proof of Theorem 3 is hinges on the following proposition, which demonstrates that convergence of projected EWP dynamic programming is controlled by how far return distributions are transported under the projection map. Proposition 1 (Convergence of EWP Dynamic Programming). Consider a kernel satisfying the hypotheses of Theorem 2, suppose rewards are globally bounded in each dimension in [0, Rmax], and let {Π(k) κ,m}k 0 be a sequence of maps Π : P([0, (1 γ) 1Rmax]d)X CEWP,m satisfying

MMDκ((Π(k) κ,mη)(x), η(x)) f(d, m) < k 0. (6)

Then the iterates (ηk)k 0 given by ηk+1 = Π(k) κ,m T πηk with MMDκ(η0, ηπ) = D < converge to a set ηm EWP,κ B(ηπ, (1 γc/2) 1f(d, m)) in MMDκ, where B denotes the closed ball in MMDκ,

B(η, R) n η P(Rd)X : MMDκ(η, η ) R o .

As an immediate corollary of Proposition 1 and Theorem 3, we can derive an error rate for projected dynamic programming with Πm EWP,κ as well.

Corollary 1. For any kernel κ satisfying the hypotheses of Theorem 3, and for any η0 CEWP,m for which MMDκ(η0, ηπ) D < , the iterates ηk+1 = Πm EWP,κT πηk converge to a set ηm EWP,κ CEWP,m, where

ηm EWP,κ B ηπ, 6dα/2Rα max (1 γα/2)(1 γ)α m

5 Categorical Multivariate Distributional Dynamic Programming

Our next contribution is the introduction of a convergent multivariate distributional dynamic programming algorithm based on a categorical representation of return distribution functions, generalizing the algorithms and analysis of [RBD+18] to the multivariate setting.

Categorical representations. As outlined above, to model the multi-return distribution function in practice, it is necessary to restrict each multi-return distribution to a finitely-parameterized class. In this work, we take inspiration from successful distributional RL algorithms [BDM17b, RBD+18] and employ a categorical representation. The work of [WUS23] proposed a categorical representation for multivariate DRL, but their categorical projection was not justified theoretically, and it required a particular choice of fixed support. We propose a novel categorical representation with a finite (possibly state-dependent) support R(x) = {ξ(x)i}N(x) i=1 Rd, that models the multi-return distribution function η such that η(x) R(x) for each x X. The notation ξ(x)i simply refers to the ith support point at state x specified by R, and A denotes the probability simplex on the finite set A. We refer to the mapping R as the support map2 and we denote the class of multi-return distribution functions under the corresponding categorical representation as CR Q

Categorical projection. Once again, the distributional Bellman operator is not generally closed over CR, that is, T πCR CR. As such, we cannot actually employ the procedure described in Theorem 2 rather, we need to project applications of T π back onto CR. Roughly, we need an operator Π : P(Rd)X CR for which Π|CR = id. Given such an operator, as in the literature on categorical distributional RL [BDM17b, RBD+18], we will study the convergence of iterates ηk+1 = ΠT πηk.

Projection operators used in the scalar categorical distributional RL literature are specific to distributions over R, so we must introduce a new projection. We propose a projection similar to (4),

(ΠR C,κη)(x) = arginf p R(x) MMDκ(p, η(x)). (7)

We will now verify that ΠR C,κ is well-defined, and that it satisfies the aforementioned properties.

Lemma 1. Let κ be a kernel induced by a semimetric ρ on [0, (1 γ) 1Rmax]d with strong negative type (cf. Theorem 1). Then ΠR C,κ is well-defined, Ran(ΠR C,κ) CR, and ΠR C,κ|CR = id.

It is worth noting that beyond simply ensuring the well-posedness of the projection ΠR C,κ, Lemma 1 also certifies an efficient algorithm for computing the projection namely, by solving the appropriate quadratic program (QP), as observed by [SZS+08]. We demonstrate pseudocode for computing the projected Bellman operator ΠR C,κT π with a QP solver QPSolve in Algorithm 1.

Algorithm 1 Projected Categorical Dynamic Programming

Require: Support map R, kernel κ, transition kernel P π, reward function r, discount γ Require: Return distribution function η CR

ξ R(x ) P π(x | x)ηx (ξ)δr(x)+γξ Kx i,j κ(ξi, ξj) for each (ξi, ξj) R(x)2

ξ supp (T πη)x(T πη)x(ξ)κ(ξj, ξ) for each ξj R(x) p QPSolve minp R|R(x)| p Kxp 2p q subject to p 0, P

(ΠR C,κT πη)x P

ξi R(x) piδξi end for return ΠR C,κT πη

Lemma 2. Under the conditions of Lemma 1, ΠR C,κ is a nonexpansion in MMDκ. That is, for any η1, η2 P([0, (1 γ) 1Rmax]d)X , we have MMDκ(ΠR C,κη1, ΠR C,κη2) MMDκ(η1, η2).

Categorical multivariate distributional dynamic programming. As an immediate consequence of Lemma 2, it follows that projected dynamic programming under the projection ΠR C,κ is convergent;

2In many applications, the most natural support map is constant across the state space.

this is because T π is a contraction in MMDκ and ΠR C,κ is a nonexpansion in MMDκ, so the projected operator ΠR C,κT π is a contraction in MMDκ; a standard invokation of the Banach fixed point theorem appealing to the completenes of MMDκ certifies that repeated application of ΠR C,κT π will result in convergence to a unique fixed point. Corollary 2. Let κ be a kernel satisfying the conditions of Theorem 2. Then for any η0 CR, the iterates {ηk} k=1 given by ηk+1 = ΠR C,κT πηk converge geometrically to a unique fixed point.

Beyond the result of Theorem 2, Corollary 2 illustrates an algorithm for estimating ηπ in MMDκ provided knowledge of the transition kernel and the reward function, which is computationally tractable in tabular MDPs. Indeed, the iterates (ηk)k 0 all lie in CR, having finitely-many degrees of freedom. Algorithm 1 outlines a computationally tractable procedure for learning ηπ C,κ in this setting.

We note that the work of [WUS23] provided an alternative multivariate categorical algorithm, which was not analyzed theoretically. Moreover, our method provides the additional ability to support state-dependent arbitrary support maps, while theirs requires support maps to be uniform grids.

Accurate approximations. We now provide bounds showing that the fixed point ηπ C,κ from Corollary 2 can be made arbitrarily accurate by increasing the number of atoms.

To derive a bound on the quality of the fixed point, we provide a reduction via partitioning the space of returns to the covering number of this space. Proceeding, we define a class of partitions PR(x), where each P PR(x) satisfies

1. |P| = N(x); 2. For any θ1, θ2 P, either θ1 θ2 = or θ1 = θ2; 3. θ P θ = P([0, (1 γ) 1Rmax]d); 4. Each element θi P contains exactly one element zi R(x).

For any partition P, we define a notion of mesh size relative to a kernel κ induced by a semimetric ρ,

mesh(P; κ) = max θ P sup y1,y2 θ ρ(y1, y2). (8)

The following result confirms that ηπ C,κ recovers ηπ as the mesh decreases. Theorem 4. Let κ be a kernel induced by a c-homogeneous and shift-invariant semimetric ρ conforming to the conditions of Theorem 2. Then the fixed point ηπ C,κ of ΠR C,κT π satisfies

MMDκ(ηπ C,κ, ηπ) 1 1 γc/2 sup x X inf P PR(x)

mesh(P; κ). (9)

Thus, for any sequence of supports {R(x)m}m 1 for which the maximal space (in ρ) between any two points in R(x)m tends to 0 as m , the fixed point ηπ C,κ approximates ηπ to arbitrary precision for large enough m. The next corollary illustrates this in a familiar setting. Corollary 3. Let R(x) = Um, where Um is a set of m uniformly-spaced support points on [0, (1 γ) 1Rmax]. For κ induced by the semimetric ρ(x, y) = x y α 2 for α (0, 2),

MMDκ(ηπ C,κ, ηπ) 1 (1 γα/2)(1 γ)α/2 dα/4Rα/2 max (m1/d 2)α/2 .

With α = 1 and d = 1, the MMD in Corollary 3 is equivalent to the Cramér metric [SR13], the setting in which categorical (scalar) distributional dynamic programming is well understood. Our rate matches the known Θ(m 1/2) rate shown by [RBD+18] in this setting. Thus, our results offer a new perspective on categorical DRL, and naturally generalizes the theory to the multivariate setting.

Theorem 4 relies on the following lemma about the approximation quality of the categorical MMD projection, which may be of independent interest.

Lemma 3. Let κ be kernel satisfying the conditions of Lemma 1, and for any finite R Rd, define Π : P(Rd) R via Πp = arginfq R MMDκ(p, q). Then MMD2 κ(Πp, p) inf P PR mesh(P; κ).

At this stage, we have shown definitively that categorical dynamic programming converges in the multivariate case. In the sequel, we build on these results to provide a convergent multivariate categorical TD-learning algorithm.

Figure 1: Distributional SMs and associated predicted return distributions with the categorical (left) and EWP (right) representations. Simplex plots denote the distributional SM. Histograms denote the associated return distributions, predicted from a pair of held-out reward functions.

5.1 Simulation: The Distributional Successor Measure

As a preliminary example, we consider 3-state MDPs with the cumulants r(i) = (1 γ)ei, i [3] for ei the ith basis vector. In this setting, ηπ encodes the distribution over trajectory-wise discounted state occupancies, which was discussed in the recent work of [WFG+24] and called the distributional successor measure (DSM). Particularly, [WFG+24] showed that x 7 Law G x r for Gx ηπ(x) is the return distribution function for any scalar reward function r. Figure 1 shows that the projected categorical dynamic programming algorithm accurately approximates the distribution over discounted state occupancies as well as distributions over returns on held-out reward functions.

6 Multivariate Distributional TD-Learning

Next, we devise an algorithm for approximating the multi-return distribution function when the transition kernel and reward function are not known, and are observed only through samples. Indeed, this is a strong motivation for TD-learning algorithms [Sut88], wherein state transition data alone is used to solve the Bellman equation. In this section, we devise a TD-learning algorithm for multivariate DRL, leveraging our results on categorical dynamic programming in MMDκ.

Relaxation to signed measures. In the d = 1 setting, the categorical projection presented above is known to be affine [RBD+18], making scalar categorical TD-learning amenable to common techniques in stochastic approximation theory. However, the projection is not affine when d 2; we give an explicit example in Appendix D. Thus, we relax the categorical representation to include signed measures, which will provide us with an affine projection [BRCM19] this is crucial for proving our main result, Theorem 6. We denote by M1(Y) the set of all signed measures µ over Y with µ(Y) = 1. We begin by noting that the MMD endows M1(Y) with a metric structure.

Lemma 4. Let κ : Y Y R be a characteristic kernel over some space Y. Then MMDκ : M1(Y) M1(Y) R+ given by (p, q) 7 µp µq H defines a metric on M1(Y), where µp denotes the usual mean embedding of p and H is the RKHS with kernel κ.

We define the relaxed projection ΠR SC,κ : M1([0, (1 γ) 1Rmax]d)X Q

x X M1(R(x)) =: SR,

ΠR SC,κη (x) arginf p M1(R(x)) MMDκ(p, η(x)). (10)

Note that (10) is very similar to the definition of the categorical MMD projection in (7) the only difference is that the optimization occurs over the larger class of signed mass-1 measures. It is also worth noting that the distributional Bellman operator can be applied directly to signed measures, which yields the following convenient result.

Lemma 5. Under the conditions of Corollary 2, the projected operator ΠR SC,κT π : SR SR is affine, is contractive with contraction factor γc/2, and has a unique fixed point ηπ SC,κ.

While we have relaxed the projection, the fixed point ηπ SC,κ is a good approximation of ηπ.

Theorem 5. Under the conditions of Lemma 5, we have that

1. MMDκ(ηπ SC,κ, ηπ) 1 1 γc/2 supx X inf P PR(x) p

mesh(P; κ); and

2. MMDκ(ΠR C,κηπ SC,κ, ηπ) (1 + 1 1 γc/2 ) supx X inf P PR(x) p

mesh(P; κ).

Notably, the second statement of Theorem 5 states that projecting ηπ SC,κ back onto the space of multi-return distribution functions yields approximately the same error as ηπ C,κ when γ is near 1.

In the remainder of the section, we assume access to a stream of MDP transitions {Tt} t=1 X A Rd X consisting of elements Tt = (Xt, At, Rt, X t) with the following structure,

Xt P( | Ft 1) At π( | Xt) Rt = r(Xt) X t P( | Xt, At) (11)

where P is some probability measure and {Ft} t=1 is the canonical filtration Ft = σ( t t=1Tt). Based on these transitions, we can define stochastic distributional Bellman backups by

b T π t η(x) = (b Rt,γ) η(X t) x = Xt η(x) otherwise , (12)

which notably can be computed exactly without knowledge of P, r. Due to the stronger convergence guarantees shown for projected multivariate distributional dynamic programming, we introduce an asynchronous incremental algorithm leveraging the categorical representation, which produces iterates {bηt} t=1 according to

bηt+1 = (1 αt)bηt + αtΠR SC,κ b T π t bηt (13)

for bη0 CR, where {αt} t=1 is any sequence of (possibly) random step sizes adapted to the filtration {Ft} t=1. The iterates of (13) closely resemble those of classic stochastic approximation algorithms [RM51] and particularly asynchronous TD learning algorithms [JJS93, Tsi94, BT96], but with iterates taking values in the space of state-indexed signed measures. Indeed, our next result draws on the techniques from these works to establish convergence of TD-learning on SR representations. Theorem 6. For a kernel κ induced by a semimetric ρ of strong negative type, the sequence {bηt} t=1 given by (11)-(13) converges to ηπ SC,κ with probability 1.

6.1 Simulations: Distributional Successor Features

To illustrate the behavior of our categorical TD algorithm, we employ it to learn the multi-return distributions for several tabular MDPs with random cumulants. We focus on the case of 2and 3-dimensional cumulants, which is the setting studied in recent works regarding multivariate distributional RL [ZCZ+21, WUS23]. Interpreting the multi-return distributions as joint distributions over successor features [BDM+17a, SFs], we additionally evaluate the return distributions for random reward functions in the span of the cumulants. We compare our categorical TD approach with a tabular implementation of the EWP TD algorithm of [ZCZ+21], for which no convergence bounds are known.

100 200 300 400 500 Atoms

Cramér Distance

EWP-TD Cat TD Signed-Cat-TD

100 200 300 400 500 600 700 800 900 1000 Atoms

Cramér Distance

EWP-TD Signed-Cat-TD

Figure 2: Error of zero-shot return distribution predictions over random MDPs, measured by Cramér distance, and showing 95% confidence intervals.

Figure 2a compares TD learning approaches based on their ability to accurately infer (scalar) return distributions on held out reward functions, averaged over 100 random MDPs, with transitions drawn

1.0 0.5 0.0 0.5 1.0 Cumulant 1

Signed-Cat-TD

1.0 0.5 0.0 0.5 Return

Probability

Signed-Cat-TD

1.0 0.5 0.0 0.5 1.0 Return

Probability

Signed-Cat-TD

1.0 0.5 0.0 0.5 1.0 Cumulant 1

1.0 0.5 0.0 0.5 Return

Probability

0.5 0.0 0.5 1.0 Return

Probability

Figure 3: Distributional SFs and predicted return distributions with m = 400 atoms, in a random MDP with known rectangular bound on cumulants. Left: Categorical TD. Right: EWP TD.

from Dirichlet priors and 2-dimensional cumulants drawn from uniform priors. The performance of the categorical algorithms sharply increases as the number of atoms increases. On the other hand, the EWP TD algorithm performs well with few atoms, but does not improve very much with higher-resolution representations. We posit this is due to the algorithm getting stuck in local minima, given the non-convexity of the EWP MMD objective. This hypothesis is supported as well by Figure 3, which depicts the learned distributional SFs and return distribution predictions.

Particularly, we observe that the learned particle locations in the EWP TD approach tend to cluster in some areas or get stuck in low-density regimes, which suggests the presence of a local optimum. On the other hand, our provably-convergent categorical TD method learns a high fidelity quantization of the true multi-return distributions.

Naturally, however, the benefits of the poly(d) bounds for EWP suggested by Theorem 3 become more present as we increase the cumulant dimension. Figure 2b repeats the experiment of Figure 2a with d = 3, using randomized support points for the categorical algorithm to avoid a cubic growth in the cardinality of the supports. Notably, our method is the first capable of supporting such unstructured supports. While the categorical TD approach can still outperform EWP, a much larger number of atoms is required. This is unsurprising in light of our theoretical results.

7 Conclusion

We have provided the first provably convergent and computationally tractable algorithms for learning multivariate return distributions in tabular MDPs. Our theoretical results include convergence guarantees that indicate how accuracy scales with the number of particles m used in distribution representations, and interestingly motivate the use of signed measures to develop provably convergent TD algorithms.

While it is difficult to scale categorical representations to high-dimensional cumulants, our algorithm is highly performant in the low d setting, which has been the focus of recent work in multivariate distributional RL. Notably, even the d = 2 setting has important applications indeed, efforts in safe RL depend on distinguishing a cost signal from a reward signal (see, e.g., [YSTS23]), which can be modeled by bivariate distributional RL. In this setting, our method can easily be scaled to large state spaces by approximating the categorical signed measures with neural networks; an illustrated example is given in Appendix F.

On the other hand, the prospect of learning multi-return distributions for high-dimensional cumulants also has many important applications, such as modeling close approximations to distributional successor measures [WFG+24] for zero-shot risk-sensitive policy evaluation. In this setting, we believe EWP-based multivariate DRL will be highly impactful. Our results concerning EWP dynamic programming provide promising evidence that the accuracy of EWP representations scales gracefully with d for a fixed number of atoms. Thus, we believe that understanding convergence of EWP TD-learning algorithms is a very interesting and important open problem for future investigation.

Acknowledgements

The authors would like to thank Yunhao Tang, Tyler Kastner, Arnav Jain, Yash Jhaveri, Will Dabney, David Meger, and Marc Bellemare for their helpful feedback, as well as insightful suggestions from anonymous reviewers. This work was supported by the Fonds de Recherche du Québec, the National Sciences and Engineering Research Council of Canada, and the compute resources provided by Mila (mila.quebec).

[BB96] Steven J. Bradtke and Andrew G. Barto. Linear least-squares algorithms for temporal difference learning. Machine Learning, 22(1):33 57, 1996.

[BBC+21] Mathieu Blondel, Quentin Berthet, Marco Cuturi, Roy Frostig, Stephan Hoyer, Felipe Llinares-López, Fabian Pedregosa, and Jean-Philippe Vert. Efficient and modular implicit differentiation. In Advances in Neural Information Processing Systems (Neur IPS), 2021.

[BCC+20] Marc G. Bellemare, Salvatore Candido, Pablo Samuel Castro, Jun Gong, Marlos C. Machado, Subhodeep Moitra, Sameera S. Ponda, and Ziyu Wang. Autonomous navigation of stratospheric balloons using reinforcement learning. Nature, 588(7836):77 82, December 2020.

[BDM+17a] André Barreto, Will Dabney, Rémi Munos, Jonathan J. Hunt, Tom Schaul, Hado P. van Hasselt, and David Silver. Successor features for transfer in reinforcement learning. In Advances in Neural Information Processing Systems (Neur IPS), 2017.

[BDM17b] Marc G. Bellemare, Will Dabney, and Rémi Munos. A Distributional Perspective on Reinforcement Learning. In International Conference on Machine Learning (ICML), 2017.

[BDR23] Marc G. Bellemare, Will Dabney, and Mark Rowland. Distributional Reinforcement Learning. The MIT Press, 2023.

[BEKS17] Jeff Bezanson, Alan Edelman, Stefan Karpinski, and Viral B. Shah. Julia: A fresh approach to numerical computing. SIAM Review, 59(1):65 98, 2017.

[BFH+18] James Bradbury, Roy Frostig, Peter Hawkins, Matthew James Johnson, Chris Leary, Dougal Maclaurin, George Necula, Adam Paszke, Jake Vander Plas, Skye Wanderman Milne, and Qiao Zhang. JAX: composable transformations of Python+Num Py programs, 2018.

[BHB+20] André Barreto, Shaobo Hou, Diana Borsa, David Silver, and Doina Precup. Fast reinforcement learning with generalized policy updates. Proceedings of the National Academy of Sciences (PNAS), 117(48):30079 30087, 2020.

[BRCM19] Marc G Bellemare, Nicolas Le Roux, Pablo Samuel Castro, and Subhodeep Moitra. Distributional reinforcement learning with linear function approximation. In International Conference on Artificial Intelligence and Statistics (AISTATS), 2019.

[BT96] Dimitri P. Bertsekas and John N. Tsitsiklis. Neuro-dynamic programming. Athena Scientific, 1996.

[CGH+96] Robert M. Corless, Gaston H. Gonnet, David E.G. Hare, David J. Jeffrey, and Donald E. Knuth. On the Lambert W function. Advances in Computational Mathematics, 5:329 359, 1996.

[CZZ+24] Xin-Qiang Cai, Pushi Zhang, Li Zhao, Jiang Bian, Masashi Sugiyama, and Ashley Llorens. Distributional Pareto-optimal multi-objective reinforcement learning. In Advances in Neural Information Processing Systems (Neur IPS), 2024.

[DKNU+20] Will Dabney, Zeb Kurth-Nelson, Naoshige Uchida, Clara Kwon Starkweather, Demis Hassabis, Rémi Munos, and Matthew Botvinick. A distributional code for value in dopamine-based reinforcement learning. Nature, 577(7792):671 675, 2020.

[DRBM18] Will Dabney, Mark Rowland, Marc G. Bellemare, and Rémi Munos. Distributional Reinforcement Learning with Quantile Regression. In AAAI Conference on Artificial Intelligence, 2018.

[FSMT19] Dror Freirich, Tzahi Shimkin, Ron Meir, and Aviv Tamar. Distributional Multivariate Policy Evaluation and Exploration with the Bellman GAN. In International Conference on Machine Learning (ICML), 2019.

[GBR+12] Arthur Gretton, Karsten M. Borgwardt, Malte J. Rasch, Bernhard Schölkopf, and Alexander Smola. A Kernel Two-Sample Test. Journal of Machine Learning Research (JMLR), 13(25):723 773, 2012.

[GBSL21] Michael Gimelfarb, André Barreto, Scott Sanner, and Chi-Guhn Lee. Risk-Aware Transfer in Reinforcement Learning using Successor Features. In Advances in Neural Information Processing Systems (Neur IPS), 2021.

[HRB+22] Conor F. Hayes, Roxana Radulescu, Eugenio Bargiacchi, Johan Källström, Matthew Macfarlane, Mathieu Reymond, Timothy Verstraeten, Luisa M. Zintgraf, Richard Dazeley, Fredrik Heintz, Enda Howley, Athirai A. Irissappane, Patrick Mannion, Ann Nowé, Gabriel de Oliveira Ramos, Marcello Restelli, Peter Vamplew, and Diederik M. Roijers. A practical guide to multi-objective reinforcement learning and planning. In Proceedings of the International Conference on Autonomous Agents and Multiagent Systems (AAMAS), 2022.

[JJS93] Tommi Jaakkola, Michael I. Jordan, and Satinder P. Singh. On the Convergence of Stochastic Iterative Dynamic Programming Algorithms. In Advances in Neural Information Processing Systems (Neur IPS), 1993.

[KEF23] Tyler Kastner, Murat A. Erdogdu, and Amir-massoud Farahmand. Distributional Model Equivalence for Risk-Sensitive Reinforcement Learning. In Advances in Neural Information Processing Systems (Neur IPS), 2023.

[Lax02] Peter D. Lax. Functional analysis. John Wiley & Sons, 2002.

[LB22] Alix Lhéritier and Nicolas Bondoux. A Cramér Distance perspective on Quantile Regression based Distributional Reinforcement Learning. In International Conference on Artificial Intelligence and Statistics (AISTATS), 2022.

[LK24] Dong Neuck Lee and Michael R. Kosorok. Off-policy reinforcement learning with high dimensional reward. Co RR, ans/2408.07660, 2024.

[LM22] Shiau Hong Lim and Ilyas Malik. Distributional Reinforcement Learning for Risk Sensitive Policies. In Advances in Neural Information Processing Systems (Neur IPS), 2022.

[MSK+10] Tetsuro Morimura, Masashi Sugiyama, Hisashi Kashima, Hirotaka Hachiya, and Toshiyuki Tanaka. Nonparametric return distribution approximation for reinforcement learning. In International Conference on Machine Learning (ICML), 2010.

[NTGV20] Thanh Nguyen-Tang, Sunil Gupta, and Svetha Venkatesh. Distributional Reinforcement Learning via Moment Matching. In AAAI Conference on Artificial Intelligence, 2020.

[RBD+18] Mark Rowland, Marc G. Bellemare, Will Dabney, Remi Munos, and Yee Whye Teh. An Analysis of Categorical Distributional Reinforcement Learning. In International Conference on Artificial Intelligence and Statistics (AISTATS), 2018.

[RM51] Herbert Robbins and Sutton Monro. A Stochastic Approximation Method. The Annals of Mathematical Statistics, 22(3):400 407, September 1951.

[RMA+24] Mark Rowland, Rémi Munos, Mohammad Gheshlaghi Azar, Yunhao Tang, Georg Ostrovski, Anna Harutyunyan, Karl Tuyls, Marc G. Bellemare, and Will Dabney. An Analysis of Quantile Temporal-Difference Learning. Journal of Machine Learning Research (JMLR), 25(163):1 47, 2024.

[ROH+21] Mark Rowland, Shayegan Omidshafiei, Daniel Hennes, Will Dabney, Andrew Jaegle, Paul Muller, Julien Pérolat, and Karl Tuyls. Temporal difference and return optimism in cooperative multi-agent reinforcement learning. In AAMAS ALA Workshop, 2021.

[RVWD13] Diederik M. Roijers, Peter Vamplew, Shimon Whiteson, and Richard Dazeley. A survey of multi-objective sequential decision-making. Journal of Artificial Intelligence Research (JAIR), 48:67 113, 2013.

[Sch00] Bernhard Schölkopf. The Kernel Trick for Distances. In Advances in Neural Information Processing Systems (Neur IPS), 2000.

[SFS24] Eiki Shimizu, Kenji Fukumizu, and Dino Sejdinovic. Neural-kernel conditional mean embeddings. In International Conference on Machine Learning (ICML), 2024.

[She21] Vladimir Shenmaier. On the Complexity of the Geometric Median Problem with Outliers. Co RR, abs/2112.00519, 2021.

[SLL21] Wei-Fang Sun, Cheng-Kuang Lee, and Chun-Yi Lee. DFAC framework: Factorizing the value function via quantile mixture for multi-agent distributional Q-learning. In International Conference on Machine Learning (ICML), 2021.

[SR13] Gábor J. Székely and Maria L. Rizzo. Energy statistics: A class of statistics based on distances. Journal of Statistical Planning and Inference, 143(8):1249 1272, August 2013.

[SSGF13] Dino Sejdinovic, Bharath Sriperumbudur, Arthur Gretton, and Kenji Fukumizu. Equivalence of distance-based and RKHS-based statistics in hypothesis testing. The Annals of Statistics, 41(5), October 2013.

[Sut88] Richard S. Sutton. Learning to predict by the methods of temporal differences. Machine Learning, 3:9 44, 1988.

[SZS+08] Le Song, Xinhua Zhang, Alex Smola, Arthur Gretton, and Bernhard Schölkopf. Tailoring density estimation via reproducing kernel moment matching. In International Conference on Machine Learning (ICML), 2008.

[Tsi94] John N. Tsitsiklis. Asynchronous stochastic approximation and Q-learning. Machine learning, 16:185 202, 1994.

[TSM17] Ilya Tolstikhin, Bharath K. Sriperumbudur, and Krikamol Mu. Minimax estimation of kernel mean embeddings. Journal of Machine Learning Research (JMLR), 18(86):1 47, 2017.

[TVR97] J.N. Tsitsiklis and B. Van Roy. An analysis of temporal-difference learning with function approximation. IEEE Transactions on Automatic Control, 42(5):674 690, 1997.

[Vil09] Cédric Villani. Optimal transport: Old and new, volume 338. Springer, 2009.

[VN02] Jan Van Neerven. Approximating Bochner integrals by Riemann sums. Indagationes Mathematicae, 13(2):197 208, June 2002.

[WBK+22] Peter R Wurman, Samuel Barrett, Kenta Kawamoto, James Mac Glashan, Kaushik Subramanian, Thomas J Walsh, Roberto Capobianco, Alisa Devlic, Franziska Eckert, Florian Fuchs, et al. Outracing champion gran turismo drivers with deep reinforcement learning. Nature, 602(7896):223 228, 2022.

[WFG+24] Harley Wiltzer, Jesse Farebrother, Arthur Gretton, Yunhao Tang, André Barreto, Will Dabney, Marc G. Bellemare, and Mark Rowland. A distributional analogue to the successor representation. In International Conference on Machine Learning (ICML), 2024.

[WUS23] Runzhe Wu, Masatoshi Uehara, and Wen Sun. Distributional Offline Policy Evaluation with Predictive Error Guarantees. In International Conference on Machine Learning (ICML), 2023.

[YSTS23] Qisong Yang, Thiago D. Simão, Simon H. Tindemans, and Matthijs T. J. Spaan. Safety-constrained reinforcement learning with a distributional safety critic. Machine Learning, 112(3):859 887, 2023.

[YZL+19] Derek Yang, Li Zhao, Zichuan Lin, Tao Qin, Jiang Bian, and Tie-Yan Liu. Fully parameterized quantile function for distributional reinforcement learning. In Advances in Neural Information Processing Systems (Neur IPS), 2019.

[ZCZ+21] Pushi Zhang, Xiaoyu Chen, Li Zhao, Wei Xiong, Tao Qin, and Tie-Yan Liu. Distributional Reinforcement Learning for Multi-Dimensional Reward Functions. In Advances in Neural Information Processing Systems (Neur IPS), 2021.

Appendix A In-Depth Summary of Related Work 15

Appendix B Proofs 16

B.1 Multivariate Distributional Dynamic Programming: Section 3 . . . . . . . . . . . . 16

B.2 EWP Dynamic Programming: Section 4 . . . . . . . . . . . . . . . . . . . . . . . 18

B.3 Categorical Dynamic Programming: Section 5 . . . . . . . . . . . . . . . . . . . . 20

B.3.1 Quality of the Categorical Fixed Point . . . . . . . . . . . . . . . . . . . . 22

B.4 Categorical TD Learning: Section 6 . . . . . . . . . . . . . . . . . . . . . . . . . 24

B.4.1 The Signed Measure Relaxation . . . . . . . . . . . . . . . . . . . . . . . 24

B.4.2 Convergence of Categorical TD Learning . . . . . . . . . . . . . . . . . . 26

Appendix C Memory Efficiency of Randomized EWP Dynamic Programming 30

Appendix D Nonlinearity of the Categorical MMD Projection 33

Appendix E Experiment Details 33

Appendix F Neural Multivariate Distributional TD-Learning 33

A In-Depth Summary of Related Work

In Sections 1 and 2, we provided a high-level synopsis of the state of existing work in multivariate distributional RL. In this section, we elaborate further.

Analysis techniques. Our results in this paper mostly drawn on the analysis of one-dimensional distributional RL algorithms such as categorical and quantile dynamic programming, as well as their temporal-difference learning counterparts [RBD+18, DRBM18, RMA+24, BDR23]. The proof techniques in these works themselves are related to contraction-based arguments for reinforcement learning with function approximation [Tsi94, BT96, TVR97].

Multivariate distributional RL algorithms. Several prior works have contributed algorithms for multivariate distributional reinforcement learning, along with empirical demonstrations and theoretical analysis, though as we note in the main paper, the approaches proposed in this paper are the first algorithms with strong theoretical guarantees and efficient tabular implementations. [FSMT19] propose a deep-learning-based approach in which generative adversarial networks are used to model multivariate return distributions, and use these predictions to inform exploration strategies. [ZCZ+21] propose the TD algorithm combing equally-weighted particle representations and an MMD loss that we recall in Equation (3). They demonstrate the effectiveness of this algorithm in combination with deep learning function approximators, though do not analyze it. [WUS23] propose a family of algorithms for multivariate distributional RL termed fitted likelihood evaluation. These methods mirror LSTD algorithms [BB96], iteratively minimising a batch objective function (in this case, a negative log-likelihood, NLL) over a growing dataset. [WUS23] demonstrate that these algorithms are performant in low-dimensional settings empirically, and provide theoretical analysis for FLE algorithms, assuming an oracle which can approximately optimise the NLL objective at each algorithm step. [SFS24] also propose a TD learning algorithm for one-dimensional distributional RL using categorical support and an MMD-based loss. They demonstrate strong performance of this algorithm in classic RL domains such as Cart Pole and Mountain Car, but do not analyze the algorithm. Our analysis in this paper suggests our novel relaxation to mass-1 signed measures may be crucial to obtaining a straightforwardly analyzable TD algorithm.

Finally, the concurrent work of [LK24] studied distributional Bellman operators for Banach-spacevalued reward functions. Their work focuses on analyzing how well the fixed point of a distributional finite-dimensional multivariate Bellman equation can approximate the fixed point of a distributional Banach-space-valued Bellman equation. In contrast, our work only studies finite-dimensional reward functions, but provides explicit convergence rates and approximation bounds when the distribution representations are finite dimensional, unlike [LK24]. Moreover, [LK24] considers a similar algorithm to that discussed in Theorem 3 but for categorical representations, though its convergence is not proved. Furthermore, [LK24] did not prove convergence of any TD-learning algorithms, although they did propose some TD-learning algorithms which achieved interesting results in simulation.

B.1 Multivariate Distributional Dynamic Programming: Section 3

In this section, we will state some lemmas building up to the proof of Theorem 2. These lemmas generalize corresponding results of [NTGV20] that were specific to the scalar reward setting. We begin with a lemma that demonstrates a notion of convexity for the MMD induced by a conditional positive definite kernel.

Lemma 6. Let (pa)a I P(Y) and (qa)a I P(Y) be collections of probability measures indexed by an index set I. Suppose T P(I). Then for any characteristic kernel κ, the following holds,

MMDκ(Ea T [pa] , Ea T [qa ]) sup a I MMDκ(pa, qa)

Proof. It is known from [Sch00] that characteristic kernels generate RKHSs H into which probability measures can be embedded. As such, it holds that

MMDκ(p, q) = µp µq

where is the norm in the Hilbert space H and µp is the mean embedding of p that is, the unique element of H such that Ey p [f(y)] = f, µp for every f H, and where , is the inner product in H.

Let Tp Ea T [pa] and define Tq analogously. We claim that µT p = Ea T [µpa] Tµp. To see this, let f H, and observe that

E y T p [f] (y) = Z f(y)Tp(dy)

= ZZ f(y)T(da)pa(dy)

= ZZ f(y)pa(dy)T(da)

= Z f, µpa T(da)

= f, Z µpa T(da)

where the third step invokes Fubini s theorem, and the penultimate steps leverages the linearity of the inner product. Notably, T acts as a linear operator on mean embeddings. As a result, we see that

MMDκ(Tp, Tq) = µT p µT q = Tµp Tµq

I (µpa µqa)T(da)

Z µpa µqa T(da)

sup a I µpa µqa

= sup a I MMDκ(µpa, µqa).

where the penultimate inequality is due to Jensen s inequality, and the final inequality holds since supa I µpa µqa upper bounds the integrand, and the integral is a monotone operator.

Next, we show how the MMDκ under the kernels hypothesized in Theorem 2 behave under affine transformations to random variables. Lemma 7. Let κ be a kernel induced by a semimetric ρ of strong negative type defined over a compact subset Y Rd that is both shift invariant and c-homogeneous (cf. Theorem 2). Then for any a Y and p, q P(Y),

MMDκ((ba,γ) p, (ba,γ) q) γc/2MMDκ(p, q).

Proof. It is known [GBR+12] that the MMD can be expressed in terms of expected kernel evaluations, according to

MMD2 κ(p, q) = E [κ(Y, Y )] + E [κ(Z, Z )] 2E [κ(Y, Z)] (Y, Y , Z, Z ) p p q q

2(ρ(Y, y0) + ρ(Y , y0) ρ(Y, Y ))

2(ρ(Z, y0) + ρ(Z , y0) ρ(Z, Z ))

E [ρ(Y, y0) + ρ(Z, y0) ρ(Y, Z)]

= E [ρ(Y, Z)] 1

E [ρ(Y, Y )] + E [ρ(Z, Z )] ,

where the last step invokes the definition of a kernel induced by a semimetric, and the linearity of expectation. Then, defining Y , Y as independent samples from (ba,γ) p and Z, Z as independent samples from (ba,γ) q, we have

MMD2 κ((ba,γ) p, (ba,γ) q) = E h ρ( Y , Z) i 1

E h ρ( Y , Y ) i + E h ρ( Z, Z ) i

= E [ρ(a + γY, a + γZ)] 1

E [ρ(a + γY, a + γY )] + E [ρ(a + γZ, a + γZ )]

= E [ρ(γY, γZ)] 1

E [ρ(γY, γY )] + E [ρ(γZ, γZ )]

= γc E [ρ(Y, Z)] γc

E [ρ(Y, Y )] + E [ρ(Z, Z )]

= γc MMD2 κ(p, q),

where the second step is a change of variables, the third step invokes the shift invariance of ρ, and the fourth step invokes the homogeneity of ρ.

Thus, it follows that MMDκ((ba,γ) p, (ba,γ) q) γc/2MMDκ(p, q).

We are now ready to prove the main result of this section.

Theorem 2 (Convergent MMD dynamic programming for the multi-return distribution function). Let κ be a kernel induced by a semimetric ρ on [0, (1 γ) 1Rmax]d with strong negative type, satisfying

1. Shift-invariance. For any z Rd, ρ(z + y1, z + y2) = ρ(y1, y2).

2. Homogeneity. For any γ [0, 1), there exists c > 0 for which ρ(γy1, γy2) = γcρ(y1, y2).

Consider the sequence {ηk} k=1 given by ηk+1 = T πηk. Then ηk ηπ at a geometric rate of γc/2

in MMDκ, as long as MMDκ(η0, ηπ) C < .

Proof. We begin by showing that the distributional Bellman operator T π is contractive in MMDκ. We have

MMDκ(T πη1, T πη2) = sup x X MMDκ(T πη1(x), T πη2(x))

= sup x X MMDκ

E x P π( |x)

(br(x),γ) η1(x ) , E x P π( |x)

(br(x),γ) η2(x ) .

We apply Lemma 6 with I = X and T = P π( | x), yielding

MMDκ(T πη1, T πη2) sup x X sup x X MMDκ (br(x),γ) η1(x ), (br(x),γ) η2(x ) .

Next, invoking Lemma 7 with the shift-invariance and c-homogeneity of κ, we have

MMDκ(T πη1, T πη2) γc/2 sup x X sup x X MMDκ (η1(x ), η2(x ))

= γc/2 sup x X MMDκ (η1(x), η2(x))

= γc/2MMDκ(η1, η2).

It follows that MMDκ(ηk+1, ηπ) γc/2MMDκ(T πηk, T πηπ) = γc/2MMDκ(T πηk, ηπ), since ηπ is a fixed point of T π. Continuing, we see that MMDκ(ηk, ηπ) γkc/2MMDκ(η0, ηπ) γkc/2C O(γkc/2) o(1). Since MMDκ is a metric on P([0, (1 γ) 1Rmax]d)X for any characteristic kernel κ, it follows that ηk approaches ηπ at a geometric rate.

B.2 EWP Dynamic Programming: Section 4

In this section, we provide the proof of Theorem 3. To do so, we prove an abstract, general result, regarding any projection mapping that approximates the argmin MMD projection given in Equation (4). Proposition 1 (Convergence of EWP Dynamic Programming). Consider a kernel satisfying the hypotheses of Theorem 2, suppose rewards are globally bounded in each dimension in [0, Rmax], and let {Π(k) κ,m}k 0 be a sequence of maps Π : P([0, (1 γ) 1Rmax]d)X CEWP,m satisfying

MMDκ((Π(k) κ,mη)(x), η(x)) f(d, m) < k 0. (6)

Then the iterates (ηk)k 0 given by ηk+1 = Π(k) κ,m T πηk with MMDκ(η0, ηπ) = D < converge to a set ηm EWP,κ B(ηπ, (1 γc/2) 1f(d, m)) in MMDκ, where B denotes the closed ball in MMDκ,

B(η, R) n η P(Rd)X : MMDκ(η, η ) R o .

Proof. Let k = MMDκ(ηk, ηπ). Then we have

k = MMDκ(Π(k) κ,m T πηk 1, T πηπ)

MMDκ(Π(k) κ,m T πηk 1, T πηk 1) + MMDκ(T πηk 1, T πηπ)

f(d, m) + γc/2MMDκ(ηk 1, ηπ)

k f(d, m) + γc/2 k 1,

where the first step invokes the identity that ηπ is the fixed point of T π (which is well-defined by Theorem 2), the second step leverages the triangle inequality, and the third step follows by the definition of f(d, m) and the contractivity of T π established in Theorem 2. Unrolling the recurrence above, we have

MMDκ(ηk, ηπ) = k γck/2 0 + f(d, m)

γck/2D + f(d, m)

As such, as k , we have that

ηk, B ηπ, f(d, m)

proving our claim.

Proposition 1, despite its simplicity, reveals an interesting fact: given a good enough approximate MMD projection Πκ,m in the sense that f(d, m) decays quickly with m, the dynamic programming iterates (ηk)k 0 will eventually be contained in a (arbitrarily) small neighborhood of ηπ. The next result provides an example consequence of this abstract result, and establishes that m poly(d) is enough for convergence to an arbitrarily small set with projected distributional dynamic programming under the EWP representation.

Finally, we can now prove Theorem 3, which we restate below for convenience. Theorem 3. Consider a kernel κ induced by the semimetric ρ(x, y) = x y α 2 with α (0, 2), and suppose rewards are bounded in each dimension within [0, Rmax]. For any η0 such that MMDκ(η0, ηπ) D < , and any δ > 0, for the sequence (ηk)k 0 defined in Equation (5), with probability at least 1 δ we have

MMDκ(ηK, ηπ) e O dα/2Rα max (1 γα/2)(1 γ)α m log |X|δ 1

where ηk+1 = Boot Projπ κ,mηk and K = log m

log γ α , and where e O omits logarithmic factors in m.

Proof. For each x X and k [K], denote by Ex,k the event given by

Ex,k = MMDκ(Boot Projπ κ,mηk(x), T πηk(x)) 2dα/2Rα max (1 γ)α m + 4dα/2Rα max log δ 1

(1 γ)α m =: f(d, m; δ ) ,

for δ > 0 a constant to be chosen shortly. Moreover, with E = (x,k) X [K]Ex,k, it holds that under E, MMDκ(Boot Projπ κ,mηk, T πηk) f(d, m; δ ) for all k [K]. Following the proof of Proposition 1, we have that, conditioned on E,

MMDκ(ηk, ηπ) γαk/2D + f(d, m; δ )

2dα/2Rα max (1 γ)α m + f(d, m; δ )

Now, by [TSM17, Proposition A.1], event Ex,k holds with probability at least 1 δ , since each (Boot Projπ κ,mηk)(x) is generated independently by sampling m independent draws from the distribution T πηk. Therefore, event E holds with probability at least 1 |X|Kδ . Choosing δ = δ/(|X|K), we have that, with probability at least 1 δ,

MMDκ(ηK, ηπ) 2dα/2Rα max (1 γ)α m + 1 1 γα/2 2dα/2Rα max (1 γ)α m

1 + 2 log(|X|Kδ 1)

2dα/2Rα max (1 γ)α m + 2dα/2Rα max (1 γα/2)(1 γ)α m

1 + 2 log |X| 1 + log m

As such, there exist universal constants C0, C1 R+ such that

MMDκ(ηK, ηπ) C1 dα/2Rα max (1 γα/2)(1 γ)α m

1 + 2 log |X| 1 + log m

C0 dα/2Rα max (1 γα/2)(1 γ)α m

log |X| + log log m

log γ α + log δ 1

= C0 dα/2Rα max (1 γα/2)(1 γ)α m

Corollary 1. For any kernel κ satisfying the hypotheses of Theorem 3, and for any η0 CEWP,m for which MMDκ(η0, ηπ) D < , the iterates ηk+1 = Πm EWP,κT πηk converge to a set ηm EWP,κ CEWP,m, where

ηm EWP,κ B ηπ, 6dα/2Rα max (1 γα/2)(1 γ)α m

Proof. Proposition 1 shows that projected EWP dynamic programming converges to a set with radius controlled by the quantity f(d, m) that upper bounds the distance f(d, m) between ηk(x) and Π(k) κ,mηk(x) at the worst state x. In the proof of Theorem 3, we saw that with nonzero probability,

the randomized projections satisfy f(d, m) 6dα/2Rα max (1 γ)α m. Thus, there exists a projection satisfying this bound. Since the projection Πm EWP,κ is, by definition, the projection with the smallest possible f(d, m), the claim follows directly by Proposition 1.

B.3 Categorical Dynamic Programming: Section 5

Lemma 1. Let κ be a kernel induced by a semimetric ρ on [0, (1 γ) 1Rmax]d with strong negative type (cf. Theorem 1). Then ΠR C,κ is well-defined, Ran(ΠR C,κ) CR, and ΠR C,κ|CR = id.

Proof. Firstly, note that R(x) is a bounded, finite-dimensional subspace for each x X. Thus, R(x) is compact, and by the continuity of the MMD, the infimum in (7) is attained.

Following the technique of [SZS+08], we establish that ΠR C,κ can be computed as the solution to a particular quadratic program with convex constraints.

Let K RN(x) N(x) denote a matrix where Ki,j = κ(ξ(x)i, ξ(x)j). Since κ is a positive definite kernel when κ is characteristic [GBR+12], it follows that K is a positive definite matrix. Then, for any ϱ P([0, (1 γ) 1Rmax]d), we have

arginf p R(x) MMDκ(p, ϱ)

= arginf p R(x) MMD2 κ(p, ϱ)

= arginf p R(x)

j=1 pipjκ(ξ(x)i, ξ(x)j) 2

b RN(x) z }| { Z κ(ξ(x)i, y)ϱ(dy) +M(κ, R, ϱ)

= arginf p R(x)

p Kp 2p b ,

where M(κ, R, ϱ) is independent of p, so it does not influence the minimization. Now, since K is positive definite (by virtue of κ being characteristic) and R(x) is a closed convex subset of RN(x), it is well-known that there is unique optimum, and the infimum above is attained for some p R(x). Therefore, ΠR C,κ is indeed well-defined, and its range is contained in ΠR C,κ, confirming the first two claims. Finally, since ΠR C,κ is well-defined and since MMDκ is nonnegative and separates points,

ΠR C,κ must map elements of R(x) to themselves this is because MMDκ(p, p) = 0 for the kernels we consider.

Lemma 2. Under the conditions of Lemma 1, ΠR C,κ is a nonexpansion in MMDκ. That is, for any η1, η2 P([0, (1 γ) 1Rmax]d)X , we have MMDκ(ΠR C,κη1, ΠR C,κη2) MMDκ(η1, η2).

Proof. Fix any x X and denote M(x) = {µp H : p R(x)}, where H is the RKHS induced by the kernel κ and µp denotes the mean embedding of p in this RKHS. It is simple to verify that p 7 µp is linear: for any p, q P(Rd) and α, β R, for all f H with f = 1 we have

f, µαp+βq = Z f(y)[αp + βq](dy) = α Z f(y)p(dy) + β Z f(y)q(dy)

= a, αµp + βµq ,

which implies that µαp+βq = αµp + βµq. As a consequence, M(x) inherits convexity from R(x).

We claim that M(x) is closed as a subset of H. Since p 7 µp is an injection [GBR+12], by Lemma 1, since there is a unique q R(x) minimizing MMDκ(p, q), there is a unique µq M(x) attaining the infimum µp µq over M(x). Let µ H \ M(x). Then there exists µq M(x) such that µq µ = infν M(x) µ ν , and since µq = µ, it follows that infν M(x) ν µ = ϵ > 0. Since this is true for any µ M(x), it follows that H \ M(x) is open, so M(x) is closed.

We will now show that η(x) 7 ΠR C,κη(x) is a nonexpansion in H. Let η1, η2 CR, and denote by µ1(x), µ2(x) the mean embeddings of η1(x), η2(x). We slightly abuse notation and write ΠR C,κµi(x) to denote the mean embedding of ΠR C,κηi(x). Since M(x) is convex, for any ι(x) M(x) and λ [0, 1] we have

MMDκ(η1(x), ΠR C,κη1(x))2 = µ1(x) ΠR C,κµ1(x) 2

µ1(x) (λι(x) + (1 λ)ΠR C,κµ1(x)) 2

= µ1(x) ΠR C,κµ1(x) λ(ι(x) ΠR C,κµ1(x)) 2.

Now, by expanding the squared norms and taking λ 0, since M(x) is closed we have

µ1(x) ΠR C,κµ1(x), ι1(x) ΠR C,κµ1(x) 0 ι1(x), ι2(x) M(x)

ΠR C,κµ2(x) µ2(x), ΠR C,κµ2(x) ι2(x) 0,

where the second inequality follows by applying the same logic to µ2(x). Choosing ι1(x) = ΠR C,κµ2(x), ι2(x) = ΠR C,κµ1(x) M(x) and adding these inequalities yields

µ1(x) µ2(x) + (ΠR C,κµ2(x) ΠR C,κµ1(x)), ΠR C,κµ2(x) ΠR C,κµ1(x) 0.

Expanding, we see that

MMDκ(ΠR C,κη1(x), ΠR C,κη2(x))2 = ΠR C,κµ2(x) ΠR C,κµ1(x) 2

µ2(x) µ1(x), ΠR C,κµ2(x) ΠR C,κµ1(x)

µ2(x) µ1(x) ΠR C,κµ2(x) ΠR C,κµ1(x)

= MMDκ(η1(x), η2(x))MMDκ(ΠR C,κη1(x), ΠR C,κη2(x))

MMDκ(ΠR C,κη1(x), ΠR C,κη2(x)) MMDκ(η1(x), η2(x)),

confirming that η(x) 7 ΠR C,κη(x) is a non-expansion. It follows that

MMDκ(ΠR C,κη1, ΠR C,κη2) = sup x X MMDκ(η1(x), η2(x))

sup x X MMDκ(η1(x), η2(x))

= MMDκ(η1, η2).

Corollary 2. Let κ be a kernel satisfying the conditions of Theorem 2. Then for any η0 CR, the iterates {ηk} k=1 given by ηk+1 = ΠR C,κT πηk converge geometrically to a unique fixed point.

Proof. Combining Theorem 2 and Lemma 2, we see that

MMDκ(ΠR C,κT πη1, ΠR C,κT πη2) MMDκ(T πη1, η2) γc/2MMDκ(η1, η2)

for some c > 0. Thus, ΠR C,κT π is a contraction on (CR, MMDκ). If H is the RKHS induced by κ, we showed in Lemma 2 that CR is isometric to a product of closed, convex subsets of H. Therefore, by the completeness of H, CR is isometric to a complete subspace, and consequently CR is a complete subspace under the metric MMDκ. Invoking the Banach fixed-point theorem, it follows that ΠR C,κT π has a unique fixed point ηπ C,κ, and ηk ηπ C,κ geometrically.

B.3.1 Quality of the Categorical Fixed Point

As we saw in our analysis of multivariate DRL with EWP representations, the distance between a distribution and its projection (as a function of m, d) plays a major role in controlling the approximation error in projected distributional dynamic programming. Before proving the main results of this section, we begin by analyzing this quantity by reducing it to the largest distance between points among certain partitions of the space of returns.

Lemma 3. Let κ be kernel satisfying the conditions of Lemma 1, and for any finite R Rd, define Π : P(Rd) R via Πp = arginfq R MMDκ(p, q). Then MMD2 κ(Πp, p) inf P PR mesh(P; κ).

Proof. Our proof proceeds by establishing approximation bounds of Riemann sums to the Bochner integral µp, similar to [VN02]. Let P PR. Abusing notation to denote by Πpi the probability of the ith atom of the discrete support under Πp, we have

MMD2 κ(p, Πp) = µΠp µp 2

Z κ( , y)Πp(dy) Z κ( , y)p(dy)

i κ( , zi)Πpi Z κ( , y)p(dy)

where R = {zi}n i=1 for some n N. Since Πp optimizes the MMD over all probability vectors in R, for q R with qi = p(θi) for the ith element of P, we have

MMD2 κ(p, Πp)

i κ( , zi)p(θi) Z κ( , y)p(dy)

θi (κ( , zi) κ( , y))p(dy)

i sup y1,y2 θi κ( , y1) κ( , y2) p(θi)

sup θ P sup y1,y2 θ κ( , y1) κ( , y2) 2.

It was shown by [SSGF13] that z 7 κ( , z) is an isometry from (Rd, ρ1/2) to H, where H is the RKHS induced by κ. Thus, we have

MMD2 κ(p, Πp) sup θ P sup y1,y2 θ ρ(y1, y2) = mesh(P; κ).

Since this is true for any partition P PR, the claim follows by taking the infimum over PR.

We now move on to the main results.

Theorem 4. Let κ be a kernel induced by a c-homogeneous and shift-invariant semimetric ρ conforming to the conditions of Theorem 2. Then the fixed point ηπ C,κ of ΠR C,κT π satisfies

MMDκ(ηπ C,κ, ηπ) 1 1 γc/2 sup x X inf P PR(x)

mesh(P; κ). (9)

Proof. The proof begins in a similar manner to [RBD+18, Proposition 3]. Given that ΠR C,κ is a nonexpansion as shown in Lemma 2, we have

MMDκ(ηπ C,κ, ηπ) = sup x X MMDκ(ηπ C,κ(x), ηπ(x))

MMDκ(ηπ C,κ(x), ΠR C,κηπ(x)) + MMDκ(ΠR C,κηπ(x), ηπ(x))

MMDκ(ηπ C,κ, ΠR C,κηπ) + MMDκ(ΠR C,κηπ, ηπ)

(a) = MMDκ(ΠR C,κT πηπ C,κ, ΠR C,κT πηπ) + MMDκ(ΠR C,κηπ, ηπ)

(b) MMDκ(T πηπ C,κ, T πηπ) + MMDκ(ΠR C,κηπ, ηπ)

(c) γc/2MMDκ(ηπ C,κ, ηπ) + MMDκ(ΠR C,κηπ, ηπ)

MMDκ(ηπ C,κ, ηπ) 1 1 γc/2 MMDκ(ΠR C,κηπ, ηπ),

where (a) leverages the fact that ηπ C,κ is the fixed point of ΠR C,κT π and that ηπ is the fixed point of T π, (b) follows since ΠR C,κ is a nonexpansion by Lemma 2, and (c) follows by the contractivity of T π established in Theorem 2. Finally, by Lemma 3, we have

MMDκ(ηπ C,κ, ηπ) 1 1 γc/2 sup x X MMDκ(ΠR C,κηπ(x), ηπ(x)) 1 1 γc/2 sup x X inf P PR(x)

mesh(P; κ).

Finally, we explicitly derive a convergence rate for a particular support map under the energy distance kernels.

Corollary 3. Let R(x) = Um, where Um is a set of m uniformly-spaced support points on [0, (1 γ) 1Rmax]. For κ induced by the semimetric ρ(x, y) = x y α 2 for α (0, 2),

MMDκ(ηπ C,κ, ηπ) 1 (1 γα/2)(1 γ)α/2 dα/4Rα/2 max (m1/d 2)α/2 .

Proof. We begin bounding mesh(P; κ). Assume m = nd for some n N. We consider a partition P PUm consisting of d-dimensional hypercubes with side length (1 γ) 1Rmax/(n 1). By definition of Um, it is clear that these hypercubes cover [0, (1 γ) 1Rmax]d and each contain exactly one support point. Now, for each θ P, we have

sup y1,y2 θ ρ(y1, y2) y (y + (1 γ) 1Rmax/(n 1) 1 α 2

where 1 is the vector of all ones, and y is any element in θ. Expanding, we have

sup y1,y2 θ ρ(y1, y2) (1 γ) α d X

dα/2Rα max (1 γ)α(n 1)α .

Since this bound holds for any θ P, invoking Theorem 4 yields

MMDκ(ηπ C,κ, ηπ) 1 1 γα/2 sup x X inf P PUm

1 1 γα/2 sup x X

1 1 γα/2 sup x X

dα/2Rαmax (1 γ)α(n 2)α

= 1 (1 γα/2)(1 γ)α/2 dα/4Rα/2 max (n 1)α/2

= 1 (1 γα/2)(1 γ)α/2 dα/4Rα/2 max (n 1)α/2

= 1 (1 γα/2)(1 γ)α/2 dα/4Rα/2 max (m1/d 1)α/2 .

If instead m ((n 1)d, nd), we omit all but (n 1)d of the support points, and achieve

MMDκ(ηπ C,κ, ηπ) 1 (1 γα/2)(1 γ)α/2 dα/4Rα/2 max ( m1/d 1)α/2 .

Alternatively, we may write

MMDκ(ηπ C,κ, ηπ) 1 (1 γα/2)(1 γ)α/2 dα/4Rα/2 max (m1/d 2)α/2 .

B.4 Categorical TD Learning: Section 6

In this section, we prove results leading up to and including the convergence of the categorical TDlearning algorithm over mass-1 signed measures. First, in Section B.4.1, we show that MMDκ is in fact a metric on the space of mass-1 signed measures, and establish that the multivariate distributional Bellman operator is contractive under these distribution representations. Subsequently, in Section B.4.2, we analyze the temporal difference learning algorithm leveraging the results from Section B.4.1.

B.4.1 The Signed Measure Relaxation

We begin by establishing that MMDκ is a metric on M1(Y) for spaces Y under which MMDκ is a metric on P(Y). Lemma 4. Let κ : Y Y R be a characteristic kernel over some space Y. Then MMDκ : M1(Y) M1(Y) R+ given by (p, q) 7 µp µq H defines a metric on M1(Y), where µp denotes the usual mean embedding of p and H is the RKHS with kernel κ.

Proof. Naturally, since MMDκ is given by a norm, it is non-negative, symmetric, and satisfies the triangle inequality. We must show that MMDκ(p, q) = 0 p = q. Firstly, it is clear that MMDκ(p, p) = 0 by the positive homogeneity of the norm. It remains to show that MMDκ(p, q) = 0 = p = q.

Let p = q M1(Y). For the sake of contradiction, assume that MMDκ(p, q) = 0. We will show that this implies that MMDκ(P, Q) = 0 for a pair of distinct probability measures, which is a contradiction since MMDκ with characteristic κ is known to be a metric on P(Y).

By the Hahn decomposition theorem, we may write p = p+ p for non-negative measures p+, p . Therefore, for some a R+, we may express

p = (a + 1)p+ ap

where p+, p P(Y). Likewise, there exist b R+ and probability measure q+, q for which q = (b + 1)q+ bq . Since MMDκ(p, q) = 0 by hypothesis, and by linearity of p 7 µp, we have

0 = µp µq H = (a + 1)µp+ aµp + bµq (b + 1)µq+ H = (a + 1)µp+ + bµq (aµp + (b + 1)µq+) H

= (a + b + 1)

λµp+ + (1 λ)µq λ µp + (1 λ )µq+ , λ = a + 1 a + b + 1, λ = a a + b + 1 := (a + b + 1) µP µQ H,

where P = λp+ + (1 λ)q and Q = λ p + (1 λ )q+ are convex combinations of probability measures, and are therefore probability measures themselves. So, we have that

λp+ λ p = (1 λ )q+ (1 λ)q

(a + 1)λp+ ap = (b + 1)q+ bq

which contradicts our hypothesis. Therefore, MMDκ(p, q) = 0 p = q for any p, q M1(Y), and it follows that MMDκ is a metric.

Next, we show that the distributional Bellman operator is contractive on the space of mass-1 signed measure return distribution representations. Lemma 5. Under the conditions of Corollary 2, the projected operator ΠR SC,κT π : SR SR is affine, is contractive with contraction factor γc/2, and has a unique fixed point ηπ SC,κ.

Proof. Indeed, ΠR SC,κ is, in a sense, a simpler operator than ΠR C,κ. Since M1(R(x)) is an affine subspace of M1(Rd), it holds that ΠR SC,κ is simply a Hilbertian projection, which is known to be affine and a nonexpansion [Lax02]. Moreover, since T π acts identically on M1(Rd) as it does on P(Rd), it immediately follows that T π is a γc/2-contraction on M1(Rd), inheriting the result from Theorem 2. Thus, we have that for any η1, η2 SR,

MMDκ(ΠR SC,κT πη1, ΠR SC,κT πη2) MMDκ(T πη1, T πη2)

γc/2MMDκ(η1, η2)

confirming that the projected operator is a contraction. Since MMDκ is a metric on M1(R(x)) for each x X, it follows that MMDκ is a metric on SR. The existence and uniqueness of the fixed point ηπ SC,κ follows by the Banach fixed point theorem.

Finally, we show that the fixed point of distributional dynamic programming with signed measure representations is roughly as accurate as ηπ C,κ.

Theorem 5. Under the conditions of Lemma 5, we have that

1. MMDκ(ηπ SC,κ, ηπ) 1 1 γc/2 supx X inf P PR(x) p

mesh(P; κ); and

2. MMDκ(ΠR C,κηπ SC,κ, ηπ) (1 + 1 1 γc/2 ) supx X inf P PR(x) p

mesh(P; κ).

Proof. Since ΠR SC,κ is a nonexpansion in MMDκ by Lemma 5, following the procedure of Theorem 4, we have

MMDκ(ηπ SC,κ, ηπ) 1 1 γc/2 MMDκ(ΠR SC,κηπ, ηπ).

Note that ΠR SC,κηπ identifies the closest point (in MMDκ) to ηπ in SR := Q

x X M1(R(x)) and ΠR C,κηπ identifies the closest point to ηπ in CR := Q x X R(x). Since it is clear that CR SR, it follows that

MMDκ(ηπ SC,κ, ηπ) 1 1 γc/2 MMDκ(ΠR C,κηπ, ηπ).

The first statement then directly follows since it was shown in Lemma 3 that MMDκ(ΠR C,κηπ, ηπ) supx X inf P PR(x) p

mesh(P; κ).

To prove the second statement, we apply the triangle inequality to yield

MMDκ(ΠR C,κηπ SC,κ, ηπ) MMDκ(ΠR C,κηπ SC,κ, ΠR C,κηπ) + MMDκ(ΠR C,κηπ, ηπ)

MMDκ(ηπ SC,κ, ηπ) + MMDκ(ΠR C,κηπ, ηπ),

where the second step leverages the fact that ΠR C,κ is a nonexpansion in MMDκ by Lemma 2. Applying the conclusion of the first statement as well as the bound on MMDκ(ΠR C,κηπ, ηπ), we have

MMDκ(ΠR C,κηπ SC,κ, ηπ) 1 1 γc/2 sup x X inf P PR(x)

mesh(P; κ) + sup x X inf P PR(x)

= 1 + 1 1 γc/2

sup x X inf P PR(x)

mesh(P; κ).

B.4.2 Convergence of Categorical TD Learning

Convergence of the proposed categorical TD-learning algorithm will rely on studying the iterates through an isometry to an affine subspace of Q

x X RN(x). This affine subspace is that consisting of the set of state-conditioned signed probability vectors . We define Rn 1 as an affine subspace of Rn

for any n N according to

We note that any element η of SR can be encoded in Q

x X RN(x) 1 by expresing η(x) as the sequence of signed masses associated to each atom of R(x). In Lemma 8, we exhibit an isometry I between SR and Q

x X RN(x) 1 . Lemma 8. Let κ be a characteristic kernel. There exists an affine isometric isomorphism I between SR and an affine subspace Q

x X RN(x) 1 (cf. (15)).

Proof. Since κ is characteristic, it is positive definite [GBR+12]. Thus, for each x X, define Kx RN(x) N(x) according to

(Kx)i,j = κ(zi, zj) R(x) = {zk}N(x) k=1 .

Then each Kx is positive definite since κ is a positive definite kernel. Let p, q R(x), and define P RN(x) and Q RN(x) such that Pi = p(zi) and Qi = q(zi). Then, we have

MMD2 κ(p, q) = µp µq 2 H

i=1 κ( , zi)p(zi)

i=1 κ( , zi)q(zi)

i=1 κ( , zi)(p(zi) q(zi)),

j=1 κ( , zj)(p(zj) q(zj))

j=1 (p(zi) q(zi))(p(zj) q(zj))κ(zi, zj)

= (P Q) Kx(P Q)

= P Q 2 Kx.

Since Kx is positive definite, Kx is a norm on RN(x). Therefore, the map I1 x : ( R(x), MMDκ) (RN(x), Kx) given by I1 x(p) = P is a linear isometric isomorphism

onto the affine subspace of RN(x) with entries summing to 1, which we denote RN(x) 1 . Moreover, since (RN(x) 1 , Kx) is a finite dimensional Hilbert space, it is well known that there exists a linear isometric isomorphism I2 x : (RN(x) 1 , Kx) RN(x) 1 with the usual L2 norm. Thus, Ix = I2 x I1 x : ( R(x), MMDκ) RN(x) 1 is a linear isometric isomorphism. Consequently, it follows that I : (CR, MMDκ) Q

x X RN(x) 1 given by I = (Q

x X RN(x), 2, ) is a linear isometric isomorphism, where v 2, = supx X v(x) 2.

Lemma 9. Define the operator Uπ : Q x X RN(x) 1 Q x X RN(x) 1 by Uπ = IΠR SC,κT πI 1, where I is the isometry of Lemma 8. Let {Ut} t=1 be given by Ut = Iηt, where {ηt} t=1 are the dynamic programming iterates ηt+1 = ΠR SC,κT πηt. Then Ut+1 = UπUt. Moreover, Uπ is contractive whenever ΠR SC,κT π is, and in this case, Ut U , where U is the unique fixed point of Uπ.

Proof. By definition, we have

Ut+1 = Iηt+1 = I(ΠR C,κT πηt)

= IΠR C,κT π(I 1Ut)

which proves the first claim. Moreover, for U1 = Iη1 and U2 = Iη2, we have

UπU1 UπU2 2, = IΠR C,κT πη1 IΠR C,κT πη2 2, = MMDκ(ΠR C,κT πη1, ΠR C,κT πη2),

where the second step transforms the 2, to MMDκ since I is an isometry between those metric spaces. Therefore, if ΠR C,κT π is contractive with contraction factor β (0, 1), we have

UπU1 UπU2 2, βMMDκ(η1, η2)

= β Iη1 Iη2 2, = β U1 U2 2, ,

so that Uπ has the same contraction factor as ΠR C,κT π. Consequently, by the Banach fixed point theorem, Uπ has a unique fixed point U whenever ΠR C,κT π is contractive, and Ut U at the same rate as ηt ηπ.

The main theorem in this section is that temporal difference learning on the finite dimensional representations I(ηt) converges. Proposition 2 (Convergence of categorical temporal difference learning). Let {Ut} t=1 Q

x X RN(x) 1 be given by Ut = Ibηt, and let κ be a kernel satisfying the conditions of Theorem 2. Suppose that, for each x X, the states {Xt} t=1 and step sizes {αt} t=1 satisfy the Robbins-Munro conditions

t=0 αt1[Xt=x] =

t=0 α2 t1[Xt=x] < .

Then, with probability 1, Uk U , where U is the fixed point of Uπ.

The proof of this result as a natural extension of the convergence analysis of Categorical TD Learning given in [BDR23] to the multivariate return setting under the supremal MMD metric. The analysis hinges on the following general lemma. Lemma 10 ([BDR23, Theorem 6.9]). Let O : (Rn)X (Rn)X be a contractive operator with respect to 2, with fixed point Z , and let (Ω, F, {Fk} k=1 , P) be a filtered probability space. Define a map b O : (Rn)X X Ω (Rn)X such that

EP h b O(Z, X, ω) X = x i = (OZ)(x). (16)

For a stochastic process {ξk} k=1 adapted to {Fk} k=1 with ξk = Xk ωk, consider a sequence {Zk} k=1 (Rn)X given by

( (1 αk)Zk(x) + αk b O(Zk, Xk, ωk)(x) Xk = x Zk(x) otherwise (17)

where {αk} k=1 is adapted to {Fk} k=1 and satisfies the Robbins-Munro conditions for each x X,

k=1 αk1[Xt=x] = ,

k=1 α2 k1[Xt=x] < .

Finally, for the processes {w(x)k} k=1 where w(x)k = ( b O(Zk, Xk, ωk) (OZk)(Xk))1[Xk=x], assume the following moment condition holds,

EP w(x)k 2 Fk C1 + C2 Zk 2 2, (18)

for finite universal constants C1, C2. Then, with probability 1, Zk Z .

The operator O of Lemma 10 will be substituted with Uπ, governing the dynamics of the encoded iterates of the multi-return distribution. The stochastic operator b O will be substituted with the corresponding stochastic TD operator for Uπ, given by

b Uπ(U, x1, (R, x2))(x) = I ΠR SC,κ(b R,γ) I 1U(x2) (x1) x1 = x U(x) otherwise. (19)

This corresponds to applying a Bellman backup from a stochastic reward R and next state x2, followed by projecting back onto SR, and applying the isometry back to Q

x X RN(x) 1 .

Proof of Proposition 2. Let n = maxx X N(x). Note that for any m n, Rm can be isometrically embedded into Rn by zero-padding. Thus, Q

x X RN(x) 1 can be isometrically embedded into (Rn 1)X , so without loss of generality, we will assume that N(x) n.

Define the map b Uπ : (Rn 1)X X (Rd X) (Rn 1)X according to

( b Uπ(U, x1, (R, x2)))(x) = I(ΠR SC,κ(b R,γ) I 1U(x2))(x1) x1 = x U(x) otherwise (20)

Then, defining b Uπ k U = b Uπ(U, Xk, (Rk, X k)) with (Xk, Ak, Rk, X k) = Tk P as in (11), we have

Uk+1(x) = (I b T πbηk+1)(x)

= I 1[Xk=x]ΠR SC,κ(b Rk,γ) bηk(X k) + 1[Xk =x]bηk(x)

= 1[Xk=x]IΠR SC,κ(b Rk,γ) bηk(X k) + 1[Xk =x]Uk(x)

= ( b Uπ k Uk)(x).

Note that, since ΠR SC,κ is a Hilbert projection onto an affine subspace, it is affine [Lax02]. Consequently, b Uπ is an unbiased estimator of the operator Uπ in the following sense,

EP h b Uπ(U, Xk, (Rk, X k)) Xk = x i = EP IΠR SC,κ(b Rk,γ) I 1U(X k)

= IΠR SC,κEX k P π( |x) (br(x),γ) I 1U(X k)

= IΠR SC,κT πI 1U(x) = (UπU)(x),

where the first step invokes the linearity of ΠR SC,κ, the second step invokes the linearity of the isometry I established in Lemma 8 and the third step is due to the definition of T π. As a result, we see that the conditions (16) and (17) of Lemma 10 are satisfied by b Uπ, the iterates {Uk} k=1, and the step sizes {αk} k=1. Moreover, for wk(x) defined by

wk(x) = b Uπ(Uk, Xk, (Rk, X k)) (UπUk)(Xk) 1[Xk=x]

we have wk(x) 2 C1 + C2 U(x) 2 for universal constants C1, C2 this is shown in Lemma 11. As such, the condition of (18) is satisfied.

Finally, since Uπ inherits contractivity from ΠR SC,κT π as shown in Lemma 5, we may invoke Lemma 10, which ensures that Uk U with probability 1, where U = UπU is a unique fixed point.

Theorem 6. For a kernel κ induced by a semimetric ρ of strong negative type, the sequence {bηt} t=1 given by (11)-(13) converges to ηπ SC,κ with probability 1.

Proof. By Proposition 2, the sequence {Ut} t=1 with Ut = Iηt converges to a unique fixed point U with probability 1. Note that

I 1U = I 1UπU

= ΠR SC,κT π(I 1U ).

Therefore, I 1U is a fixed point of ΠR SC,κT π. Since it was shown in Lemma 5 that ΠR SC,κT π has a unique fixed point, it follows that I 1U = ηπ SC,κ. Since I is an isometry, bηt = I 1Ut I 1U

with probability 1, so indeed bηt ηπ SC,κ with probability 1.

To conclude, we prove Lemma 11, which was invoked in the proof of Proposition 2. Lemma 11. Under the conditions of Proposition 2, it holds that for any x X and U Q x X RN(x) 1,

E X P π( |x)

UπU(x) b Uπ(U, x, (R(x), X))(x) 2 C1 + C2 U(x) 2

for finite constants C1, C2 R+.

Proof. Since I is an isometry, we have that UπU(x) b Uπ(U, x, (r, x )) 2 = ΠR SC,κT πI 1U(x) ΠR SC,κ (br,γ) I 1(x ) (x) 2

where H is the RKHS induced by the kernel κ. Moreover, since ΠR SC,κ is a nonexpansion in H as argued in Lemma 5, we have that

E X P π( |x)

UπU(x) b Uπ(U, x, (R(x), X))(x) 2

E X P π( |x)

h T πI 1U(x) (b R,γ) I 1U(X) (x) 2 H

2 T πI 1U(x) 2 H | {z } A

+2 E X P π( |x)

h (b R,γ) I 1U(X) (x) 2 H

Proceeding, we will bound the terms A, B. To bound A, we simply have

A T πI 1U(x) ηπ(x) 2 H + ηπ(x) 2 H γc/2 I 1U(x) ηπ(x) 2 H + ηπ(x) 2 H,

where we invoke the contraction of Uπ in MMDκ from Theorem 2. Note that ηπ(x) P([0, (1 γ) 1Rmax]d), so it follows that ηπ 2 H D1,1 for some constant D1,1 since the kernel κ is bounded in compact domains. Expanding the norm of the difference above yields

A (1 + γc/2)D1,1 + D1,2 I 1U(x) 2 H = (1 + γc/2)D1,1 + D1,2 U(x) 2

for a finite constant D1,2, again invoking the isometry I in the last step.

Our bound for B is similar. Choose any x supp P π( | x). We consider the operator e Tx : P([0, (1 γ) 1Rmax]d) P([0, (1 γ) 1Rmax]d) given by

(e Tx η)(x) = (b R(x),γ) η(x ).

This operator is a contraction in MMDκ, and correspondingly has a fixed point ηπ x . To see this, we note that e Tx is simply a special case of Uπ for the case P π( | x) = δx , so the contractivity and existence of the fixed point are inherited from Theorem 2. Proceeding in a manner similar to the bound on A, we have (b R,γ) I 1U(x ) (x) 2 H = e Tx I 1U(x) 2

e Tx I 1U(x) ηx 2

H + ηx (x) 2 H

γc/2 I 1U(x) ηx 2 H + ηx (x) 2 H (1 + γc/2)D2,1 + D2,2 U(x) 2

where the final step mirrors the bound on A. Therefore, we have shown that

E X P π( |x)

UπU(x) b Uπ(U, x, (R(x), X))(x) 2

2(1 + γc/2)(D1,1 + D2,1) + (D1,2 + D2,2) U(x) 2,

completing the proof.

C Memory Efficiency of Randomized EWP Dynamic Programming

In Section 4, we argued for the necessity of considering a projection operator in EWP dynamic programming. While we provided a randomized projection, Theorem 3 requires that we apply only a finite amount of DP iterations. Thus, one might ask if, given that we apply only finitely many iterations, the naive unprojected EWP dynamic programming can produce accurate enough approximations of ηπ without costing too much in memory.

In this section, we demonstrate that, in fact, the algorithm described in Theorem 3 can approximate ηπ to any desired accuracy with many fewer particles. Suppose our goal is to derive some η such that

MMDκ(η, ηπ) ϵ

for some ϵ > 0. We will derive bounds on the number of required particles to attain such an approximation with unprojected EWP dynamic programming (denoting the number of particles munproj) as well as with our algorithm described in Theorem 3 (denoting the number of particles mproj. In both cases, we will compute iterates starting with some η0 CEWP,m with MMDκ(η0, ηπ) D < . For simplicity, we will consider the energy distance kernel with α = 1.

The remainder of this section will show that the dependence of the number of atoms on both ϵ and |X| is substantially worse in the unprojected case (that is, mproj munproj for large state spaces or low error tolerance). We demonstrate this with concrete lower bounds on munproj and upper bounds on mproj below; note that these bounds are not optimized for tightness or generality, and are instead aimed to provide straightforward evidence of our core points above.

We will begin by bounding munproj. In the best case, η0(x) is supported on 1 particle for each x. If any state can be reached from any other state in the MDP with non-zero probability, then applying the distributional Bellman operator to η0 will result in η1(x) having support on |X| atoms at each

state x (due to the mixture over successor states in the Bellman backup). Consequently, the iterate ηk(x) will be supported on |X|k atoms. Since MMDκ(ηk, ηπ) γ1/2D by Theorem 2, we require

K 2 log(D/ϵ)

to ensure that MMDκ(ηK, ηπ) ϵ. Thus, we have

munproj |X|

On the other hand, the following lemma bounds mproj; we prove the lemma at the end of this section. Lemma 12. Let ηmproj denote the output of the projected EWP algorithm described by Theorem 3 with m = mproj particles. Then under the assumptions of Theorem 3 and with the energy distance kernel with α = 1, MMDκ(ηmproj, ηπ) ϵ is achievable with

mproj Θ ϵ 2 d R2 max (1 γ)2(1 γ)2 polylog 1

δ , |X|, d, Rmax, 1 1 γ

For any fixed MDP with |X| 4 and γ 1/2, we have that

munproj exp 2 log |X| log ϵ 1

exp 2 log |X| log D

= exp 2 log |X| log D

ϵ 2 log |X|

since D > 0 and does not depend on ϵ. Meanwhile, we have mproj Θ(ϵ 2polylog(ϵ 1)) by Lemma 12, indicating a much more graceful dependence on ϵ relative to the unprojected algorithm.

On the other hand, for any fixed tolerance ϵ γD, we immediately have

munproj Ω(|X|2) mproj Θ(d polylog(d, |X|)).

In the worst case, we may have d Θ(|X|) (any larger d will induce linearly dependent cumulants). Thus, we have

mproj munproj

( e O(|X| 1) d ω(1) e O(|X| 2) d Θ(1),

so the projected algorithm scales much more gracefully with |X| as well.

Proof of Lemma 12

Finally, we prove Lemma 12, which determines the number of atoms required to achieve an ϵ-accurate return distribution estimate with the algorithm of Theorem 3. Lemma 12. Let ηmproj denote the output of the projected EWP algorithm described by Theorem 3 with m = mproj particles. Then under the assumptions of Theorem 3 and with the energy distance kernel with α = 1, MMDκ(ηmproj, ηπ) ϵ is achievable with

mproj Θ ϵ 2 d R2 max (1 γ)2(1 γ)2 polylog 1

δ , |X|, d, Rmax, 1 1 γ

Proof. Note that, by Theorem 3, increasing mproj can only decrease the error ϵ as long as mproj 1. Therefore, as shown in (14) in the proof of Theorem 3, there exists a universal constant C0 > 0 such that

ϵ := C0 1 mproj

dα/2Rα max (1 γα/2)(1 γ)α | {z } c1

+ log mproj

Now, we write c3 = C0c1c2, c4 = C0c1, and u := mproj, yielding

u + c4 log u2

u + 2c4 log u

Then, after isolating the logarithmic term and exponentiating, we see that

u = exp uϵ c3

We now rearrange this expression and invoke the identity W(z)e W (z) = z where W is a Lambert W-function [CGH+96]:

uec3/2c4 exp uϵ

= e c3/2c4ϵ

2c4 = e c2/2ϵ

zez = e c2/2ϵ

2c4 z := uϵ

There are two branches of the Lambert W-function on the reals, namely W0 and W 1. These two branches satisfy W0(zez) = z when z 1 and W 1(zez) = z when z 1. In our case, we know that z is negative, and it is known [CGH+96] that |W0(z)| 1 when z [ 1, 0]. Consequently, when z 1, we have | uϵ

2c4 | 1, and substituting mproj = u2, we have

mproj 4c2 4 ϵ2 when z 1. (23)

On the other hand, when z 1, we have

mproj = 4c2 4 ϵ2 W 2 1

Since it is known [CGH+96, Equations 4.19, 4.20] that W 2 1( z) polylog(1/z), incorporating (23), we have that

( 4c2 4 ϵ2 z 1 4c2 4 ϵ2 W 2 1 ec2/2ϵ

4c2 4 ϵ2 max 1, polylog(c4ec2/2ϵ 1)

4C2 0d R2 max (1 γ)2(1 γ)2ϵ2 polylog 1

δ , |X|, d, Rmax, 1 1 γ

The upper bound given above will generally not be an integer. Howevever, increasing mproj can only improve the approximation error, as shown in Theorem 3 since log m/ m decreases monotonically when m > 7. So, we can round mproj up to the nearest integer (or round it down when m 7) incurring a penalty of at most one atom. It follows that the randomized EWP dynamic programming algorithm of Theorem 3 run with mproj given by (21) produces a return distribution function ηmproj for which MMDκ(ηmproj, ηπ) ϵ.

Table 1: Certificate that ΠR C,κ is not affine Support point ξ R q1(ξ) q2(ξ) (0, 0) 0 0 (0, 1) 0 0 (0, 2) 0 0 (0, 3) 0 0 (1, 0) 0 0 (1, 1) 0.1999 0.2057 (1, 2) 0.1999 0.1959 (1, 3) 0 0 (2, 0) 0.0937 0.07957 (2, 1) 0.2062 0.2413 (2, 2) 0.1999 0.2026 (2, 3) 0 0 (3, 0) 0.0937 0.0787 (3, 1) 0.0063 0 (3, 2) 0 0 (3, 3) 0 0

D Nonlinearity of the Categorical MMD Projection

In Section 6, we noted that the categorical projection ΠR C,κ is non-affine. Here, we provide an explicit example certifying this phenomenon.

We consider a single-state MDP, since the nonlinearity issue is independent of the cardinality of the state space (the projection is applied to each state-conditioned distribution independently). We write R = {0, . . . , 3}2, and consider the kernel κ induced by ρ(x, y) = x y 2 this resulting MMD is known as the energy distance, which is what we used in our experiments. We consider two distributions, p1 = δ[1.5,1.5] and p2 = δ[2.5,0].

We consider λ = 0.8 and compare q1 = ΠR C,κ(λp1+(1 λ)p2) with q2 = λΠR C,κp1+(1 λ)ΠR C,κp2, and we note that q1 = q2; confirming that ΠR C,κ is not an affine map. The results are tabulated in Table 1, with bolded entries depicting the atoms with non-negligible differences in probability under q1, q2.

E Experiment Details

TD-learning experiments were conducted on a NVidia A100 80G GPU to parallelize experiments. Methods were implemented in Jax [BFH+18], particularly with the help of Jax Opt [BBC+21] for vectorizing QP solutions this was helpful for computing the categorical projections discussed in this work. SGD was used for optimization, using an annealed learning rate schedule (λk)k 0 with λk = k 3/5, satisfying the conditions of Lemma 10. Experiments with constant learning rates yielded similar results, but were less stable this validates that the choice of learning rate schedule did not impede learning.

The dynamic programming experiments were implemented in the Julia programming language [BEKS17].

In all experiments, we used the kernel induced by ρ(x, y) = x y 2 with reference point 0 for MMD optimization this corresponds to the energy distance, and satisfies the requisite assumptions for convergent multivariate distributional dynamic programming outlined in Theorem 2.

F Neural Multivariate Distributional TD-Learning

For the sake of illustration, in this section, we demonstrate that the signed categorical TD learning algorithm presented in Section 6 can be scaled to continuous state spaces with neural networks. We will consider an environment with visual (pixel) observations of a car in a parking lot, an example observation is shown in Figure 4.

Parking Spot

Figure 4: Example state in the parking environment.

Here, we consider 2-dimensional cumulants, where the first dimension tracks the x coordinate of the car, and the second dimension is an indicator that is 1 if and only if the car is parked in the parking spot. We learn a multivariate return distribution function with transitions sampled from trajectories that navigate around the obstacle to the parking spot. Notably, the successor features (expectation of multivariate return distribution) will be zero in the first dimension, since the set of trajectories is horizontally symmetric. Thus, from the successor features alone, one cannot distinguish the observed policy from one that traverses straight through the obstacle!

Fortunately, when modeling a distribution over multivariate returns, we should see that the support of the multivariate return distribution does not include points with vanishing first dimension.

Figure 5: Neural architecture for modeling multi-return distributions from images.

To learn the multivariate return distribution function from images, we use a convolutional neural architecture as shown in Figure 5.

Notably, we simply use convolutional networks to model the signed masses for the fixed atoms of the categorical representation. The projection ΠR SC,κ is computed by a QP solver as discussed in Section 5, and is applied only to the target distributions (thus we do not backpropagate through it).

We compared the multi-return distributions learned by our signed categorical TD method with that of [ZCZ+21]. Our results are shown in Figure 6. We see that both TD-learning methods accurately estimate the distribution over multivariate returns, indicating that no multivariate return will have a vanishing lateral component. Quantitatively, we see that the EWP algorithm appears to be stuck in a local optimum, with some particles lying in regions of low probability mass.

Moreover, on the right side of Figure 6, we show predicted return distributions for two randomly sampled reward vectors, and quantitatively evaluate the two methods. The leftmost reward vector incentivizes the agent to take paths conservatively avoiding the obstacle on the left. The rightmost reward vector incentivizes the agent to get to the parking spot as quickly as possible. We see that the EWP TD learning algorithm of [ZCZ+21] more accurately estimates the return distribution function corresponding to the latter reward vector, while our signed categorical TD algorithm more accurately estimates the return distribution function corresponding to the former reward vector. In both cases, both methods produce accurate estimations.

0.75 0.50 0.25 0.00 0.25 0.50 0.75

Lateral Feature

Parked Feature

Deep Signed Cat-TD

0.6 0.4 0.2 0.0 0.2 0.4 0.6

Lateral Feature

Parked Feature

Deep EWP-TD

30 20 10 0 10 20 0.00

EWP, Cramer = 0.0823 Cat, Cramer = 0.0806

0.1 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0

EWP, Cramer = 0.0088 Cat, Cramer = 0.0091

Figure 6: Multi-return distributions learned by signed categorical TD and EWP TD, as well as examples of predicted return distributions on two randomly sampled reward functions.

Neur IPS Paper Checklist

Question: Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope? Answer: [Yes] Justification: As claimed, we provided convergence results for multivariate DRL algorithms, and tested them through simulations. Guidelines:

The answer NA means that the abstract and introduction do not include the claims made in the paper. The abstract and/or introduction should clearly state the claims made, including the contributions made in the paper and important assumptions and limitations. A No or NA answer to this question will not be perceived well by the reviewers. The claims made should match theoretical and experimental results, and reflect how much the results can be expected to generalize to other settings. It is fine to include aspirational goals as motivation as long as it is clear that these goals are not attained by the paper. 2. Limitations

Question: Does the paper discuss the limitations of the work performed by the authors? Answer: [Yes] Justification: We have explicitly stated all assumptions, and our simulation section and conclusion explicitly discuss the limitations of our proposed approach. Guidelines:

The answer NA means that the paper has no limitation while the answer No means that the paper has limitations, but those are not discussed in the paper. The authors are encouraged to create a separate "Limitations" section in their paper. The paper should point out any strong assumptions and how robust the results are to violations of these assumptions (e.g., independence assumptions, noiseless settings, model well-specification, asymptotic approximations only holding locally). The authors should reflect on how these assumptions might be violated in practice and what the implications would be. The authors should reflect on the scope of the claims made, e.g., if the approach was only tested on a few datasets or with a few runs. In general, empirical results often depend on implicit assumptions, which should be articulated. The authors should reflect on the factors that influence the performance of the approach. For example, a facial recognition algorithm may perform poorly when image resolution is low or images are taken in low lighting. Or a speech-to-text system might not be used reliably to provide closed captions for online lectures because it fails to handle technical jargon. The authors should discuss the computational efficiency of the proposed algorithms and how they scale with dataset size. If applicable, the authors should discuss possible limitations of their approach to address problems of privacy and fairness. While the authors might fear that complete honesty about limitations might be used by reviewers as grounds for rejection, a worse outcome might be that reviewers discover limitations that

aren t acknowledged in the paper. The authors should use their best judgment and recognize that individual actions in favor of transparency play an important role in developing norms that preserve the integrity of the community. Reviewers will be specifically instructed to not penalize honesty concerning limitations.

3. Theory Assumptions and Proofs

Question: For each theoretical result, does the paper provide the full set of assumptions and a complete (and correct) proof? Answer: [Yes] Justification: All technical results are proved rigorously in the appendix. Guidelines:

The answer NA means that the paper does not include theoretical results. All the theorems, formulas, and proofs in the paper should be numbered and cross-referenced. All assumptions should be clearly stated or referenced in the statement of any theorems. The proofs can either appear in the main paper or the supplemental material, but if they appear in the supplemental material, the authors are encouraged to provide a short proof sketch to provide intuition. Inversely, any informal proof provided in the core of the paper should be complemented by formal proofs provided in appendix or supplemental material. Theorems and Lemmas that the proof relies upon should be properly referenced.

4. Experimental Result Reproducibility

Question: Does the paper fully disclose all the information needed to reproduce the main experimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper (regardless of whether the code and data are provided or not)? Answer: [Yes] Justification: Algorithms are described explicitly, experimental setup is detailed. Guidelines:

The answer NA means that the paper does not include experiments. If the paper includes experiments, a No answer to this question will not be perceived well by the reviewers: Making the paper reproducible is important, regardless of whether the code and data are provided or not. If the contribution is a dataset and/or model, the authors should describe the steps taken to make their results reproducible or verifiable. Depending on the contribution, reproducibility can be accomplished in various ways. For example, if the contribution is a novel architecture, describing the architecture fully might suffice, or if the contribution is a specific model and empirical evaluation, it may be necessary to either make it possible for others to replicate the model with the same dataset, or provide access to the model. In general. releasing code and data is often one good way to accomplish this, but reproducibility can also be provided via detailed instructions for how to replicate the results, access to a hosted model (e.g., in the case of a large language model), releasing of a model checkpoint, or other means that are appropriate to the research performed. While Neur IPS does not require releasing code, the conference does require all submissions to provide some reasonable avenue for reproducibility, which may depend on the nature of the contribution. For example (a) If the contribution is primarily a new algorithm, the paper should make it clear how to reproduce that algorithm. (b) If the contribution is primarily a new model architecture, the paper should describe the architecture clearly and fully. (c) If the contribution is a new model (e.g., a large language model), then there should either be a way to access this model for reproducing the results or a way to reproduce the model (e.g., with an open-source dataset or instructions for how to construct the dataset). (d) We recognize that reproducibility may be tricky in some cases, in which case authors are welcome to describe the particular way they provide for reproducibility. In the case of closed-source models, it may be that access to the model is limited in some way (e.g., to registered users), but it should be possible for other researchers to have some path to reproducing or verifying the results.

5. Open access to data and code

Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [Yes] Justification: Code will be provided. Guidelines:

The answer NA means that paper does not include experiments requiring code. Please see the Neur IPS code and data submission guidelines (https://nips.cc/public/ guides/Code Submission Policy) for more details. While we encourage the release of code and data, we understand that this might not be possible, so No is an acceptable answer. Papers cannot be rejected simply for not including code, unless this is central to the contribution (e.g., for a new open-source benchmark). The instructions should contain the exact command and environment needed to run to reproduce the results. See the Neur IPS code and data submission guidelines (https: //nips.cc/public/guides/Code Submission Policy) for more details. The authors should provide instructions on data access and preparation, including how to access the raw data, preprocessed data, intermediate data, and generated data, etc. The authors should provide scripts to reproduce all experimental results for the new proposed method and baselines. If only a subset of experiments are reproducible, they should state which ones are omitted from the script and why. At submission time, to preserve anonymity, the authors should release anonymized versions (if applicable). Providing as much information as possible in supplemental material (appended to the paper) is recommended, but including URLs to data and code is permitted. 6. Experimental Setting/Details

Question: Does the paper specify all the training and test details (e.g., data splits, hyperparameters, how they were chosen, type of optimizer, etc.) necessary to understand the results? Answer: [Yes] Justification: Hyperparameters are disclosed. Guidelines:

The answer NA means that the paper does not include experiments. The experimental setting should be presented in the core of the paper to a level of detail that is necessary to appreciate the results and make sense of them. The full details can be provided either with the code, in appendix, or as supplemental material. 7. Experiment Statistical Significance

Question: Does the paper report error bars suitably and correctly defined or other appropriate information about the statistical significance of the experiments? Answer: [Yes] Justification: 95% confidence intervals are shown in relevant plots. Guidelines:

The answer NA means that the paper does not include experiments. The authors should answer "Yes" if the results are accompanied by error bars, confidence intervals, or statistical significance tests, at least for the experiments that support the main claims of the paper. The factors of variability that the error bars are capturing should be clearly stated (for example, train/test split, initialization, random drawing of some parameter, or overall run with given experimental conditions). The method for calculating the error bars should be explained (closed form formula, call to a library function, bootstrap, etc.) The assumptions made should be given (e.g., Normally distributed errors). It should be clear whether the error bar is the standard deviation or the standard error of the mean. It is OK to report 1-sigma error bars, but one should state it. The authors should preferably report a 2-sigma error bar than state that they have a 96% CI, if the hypothesis of Normality of errors is not verified.

For asymmetric distributions, the authors should be careful not to show in tables or figures symmetric error bars that would yield results that are out of range (e.g. negative error rates). If error bars are reported in tables or plots, The authors should explain in the text how they were calculated and reference the corresponding figures or tables in the text. 8. Experiments Compute Resources

Question: For each experiment, does the paper provide sufficient information on the computer resources (type of compute workers, memory, time of execution) needed to reproduce the experiments? Answer: [Yes] Justification: This is discussed in the appendix. Guidelines:

The answer NA means that the paper does not include experiments. The paper should indicate the type of compute workers CPU or GPU, internal cluster, or cloud provider, including relevant memory and storage. The paper should provide the amount of compute required for each of the individual experimental runs as well as estimate the total compute. The paper should disclose whether the full research project required more compute than the experiments reported in the paper (e.g., preliminary or failed experiments that didn t make it into the paper). 9. Code Of Ethics

Question: Does the research conducted in the paper conform, in every respect, with the Neur IPS Code of Ethics https://neurips.cc/public/Ethics Guidelines? Answer: [Yes] Justification: We adhere to the code of ethics and preserve anonymity. Guidelines:

The answer NA means that the authors have not reviewed the Neur IPS Code of Ethics. If the authors answer No, they should explain the special circumstances that require a deviation from the Code of Ethics. The authors should make sure to preserve anonymity (e.g., if there is a special consideration due to laws or regulations in their jurisdiction). 10. Broader Impacts

Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed? Answer: [NA] Justification: The contributions are theoretical in nature, and do not have any immediate societal impacts. Guidelines:

The answer NA means that there is no societal impact of the work performed. If the authors answer NA or No, they should explain why their work has no societal impact or why the paper does not address societal impact. Examples of negative societal impacts include potential malicious or unintended uses (e.g., disinformation, generating fake profiles, surveillance), fairness considerations (e.g., deployment of technologies that could make decisions that unfairly impact specific groups), privacy considerations, and security considerations. The conference expects that many papers will be foundational research and not tied to particular applications, let alone deployments. However, if there is a direct path to any negative applications, the authors should point it out. For example, it is legitimate to point out that an improvement in the quality of generative models could be used to generate deepfakes for disinformation. On the other hand, it is not needed to point out that a generic algorithm for optimizing neural networks could enable people to train models that generate Deepfakes faster. The authors should consider possible harms that could arise when the technology is being used as intended and functioning correctly, harms that could arise when the technology is being used as intended but gives incorrect results, and harms following from (intentional or unintentional) misuse of the technology.

If there are negative societal impacts, the authors could also discuss possible mitigation strategies (e.g., gated release of models, providing defenses in addition to attacks, mechanisms for monitoring misuse, mechanisms to monitor how a system learns from feedback over time, improving the efficiency and accessibility of ML). 11. Safeguards

Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse (e.g., pretrained language models, image generators, or scraped datasets)? Answer: [NA] Justification: The paper does not pose any such risks. Guidelines:

The answer NA means that the paper poses no such risks. Released models that have a high risk for misuse or dual-use should be released with necessary safeguards to allow for controlled use of the model, for example by requiring that users adhere to usage guidelines or restrictions to access the model or implementing safety filters. Datasets that have been scraped from the Internet could pose safety risks. The authors should describe how they avoided releasing unsafe images. We recognize that providing effective safeguards is challenging, and many papers do not require this, but we encourage authors to take this into account and make a best faith effort. 12. Licenses for existing assets

Question: Are the creators or original owners of assets (e.g., code, data, models), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected? Answer: [NA] Justification: The paper does not use existing assets. Guidelines:

The answer NA means that the paper does not use existing assets. The authors should cite the original paper that produced the code package or dataset. The authors should state which version of the asset is used and, if possible, include a URL. The name of the license (e.g., CC-BY 4.0) should be included for each asset. For scraped data from a particular source (e.g., website), the copyright and terms of service of that source should be provided. If assets are released, the license, copyright information, and terms of use in the package should be provided. For popular datasets, paperswithcode.com/datasets has curated licenses for some datasets. Their licensing guide can help determine the license of a dataset. For existing datasets that are re-packaged, both the original license and the license of the derived asset (if it has changed) should be provided. If this information is not available online, the authors are encouraged to reach out to the asset s creators. 13. New Assets

Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets? Answer: [NA] Justification: The paper does not release any new assets. Guidelines:

The answer NA means that the paper does not release new assets. Researchers should communicate the details of the dataset/code/model as part of their submissions via structured templates. This includes details about training, license, limitations, etc. The paper should discuss whether and how consent was obtained from people whose asset is used. At submission time, remember to anonymize your assets (if applicable). You can either create an anonymized URL or include an anonymized zip file. 14. Crowdsourcing and Research with Human Subjects

Question: For crowdsourcing experiments and research with human subjects, does the paper include the full text of instructions given to participants and screenshots, if applicable, as well as details about compensation (if any)? Answer: [NA] Justification: The paper does not involve crowdsourcing nor research with human subjects. Guidelines:

The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Including this information in the supplemental material is fine, but if the main contribution of the paper involves human subjects, then as much detail as possible should be included in the main paper. According to the Neur IPS Code of Ethics, workers involved in data collection, curation, or other labor should be paid at least the minimum wage in the country of the data collector. 15. Institutional Review Board (IRB) Approvals or Equivalent for Research with Human Subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or institution) were obtained? Answer: [NA] Justification: The paper does not involve crowdsourcing nor research with human subjects. Guidelines:

The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Depending on the country in which research is conducted, IRB approval (or equivalent) may be required for any human subjects research. If you obtained IRB approval, you should clearly state this in the paper. We recognize that the procedures for this may vary significantly between institutions and locations, and we expect authors to adhere to the Neur IPS Code of Ethics and the guidelines for their institution. For initial submissions, do not include any information that would break anonymity (if applicable), such as the institution conducting the review.