# metrics_and_continuity_in_reinforcement_learning__1960888a.pdf

Metrics and Continuity in Reinforcement Learning

Charline Le Lan, * 1 Marc G. Bellemare, 2 Pablo Samuel Castro2

1University of Oxford, 2Google Research, Brain Team charline.lelan@stats.ox.ac.uk, {bellemare,psc}@google.com

In most practical applications of reinforcement learning, it is untenable to maintain direct estimates for individual states; in continuous-state systems, it is impossible. Instead, researchers often leverage state similarity (whether explicitly or implicitly) to build models that can generalize well from a limited set of samples. The notion of state similarity used, and the neighbourhoods and topologies they induce, is thus of crucial importance, as it will directly affect the performance of the algorithms. Indeed, a number of recent works introduce algorithms assuming the existence of well-behaved neighbourhoods, but leave the full speciﬁcation of such topologies for future work. In this paper we introduce a uniﬁed formalism for deﬁning these topologies through the lens of metrics. We establish a hierarchy amongst these metrics and demonstrate their theoretical implications on the Markov Decision Process specifying the reinforcement learning problem. We complement our theoretical results with empirical evaluations showcasing the differences between the metrics considered.

Introduction A simple principle to generalization in reinforcement learning is to require that similar states be assigned similar predictions. State aggregation implements a coarse version of this principle, by using a notion of similarity to group states together. A ﬁner implementation is to use the similarity in an adaptive fashion, for example by means of a nearest neighbour scheme over representative states. This approach is classically employed in the design of algorithms for continuous state spaces, where the fundamental assumption is the existence of a metric characterizing the real-valued distance between states. To illustrate this idea, consider the three similarity metrics depicted in Figure 1. The metric d1 isolates each state, the metric d3 groups together all states, while the metric d2 aggregates states based on the similarity in their long-term dynamics. In terms of generalization, d1 would not be expected to generalize well as new states cannot leverage knowledge from previous states; d3 can cheaply generalize to new states, but at the expense of accuracy; on the other hand, d2 seems to strike a good balance between the two extremes.

*Work performed while a Google Student Researcher. Copyright 2021, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved.

Figure 1: A simple ﬁve-state MDP (top) with the neighbourhoods induced by three metrics: an identity metric which isolates each state (d1); a metric which captures behavioral proximity (d2); and a metric which is not able to distinguish states (d3). The yellow circles represent ϵ-balls in the corresponding metric spaces. The bottom row indicates the V values for each state.

In this paper we study the effectiveness of behavioural metrics at providing a good notion of state similarity. We call behavioural metrics the class of metrics derived from properties of the environment, typically measuring differences in reward and transition functions. Since the introduction of bisimulation metrics (Ferns, Panangaden, and Precup 2004, 2005), a number of behavioural metrics have emerged with additional desirable properties, including lax bisimulation (Taylor, Precup, and Panagaden 2009; Castro and Precup 2010) and π-bisimulation metrics (Castro 2020). Behavioural metrics are of particular interest in the context of understanding generalization, since they directly encode the differences in action-conditional outcomes between states, and hence allow us to make meaningful statements about the relationship between these states. We focus on the interplay between behavioural metrics and the continuity properties they induce on various functions of interest in reinforcement learning. Returning to our example, V is only continuous with respect to d1 and d2. The continuity of a set of functions (with respect to a given metric) is assumed in most theoretical results for continuous state

The Thirty-Fifth AAAI Conference on Artificial Intelligence (AAAI-21)

spaces, such as uniform continuity of the transition function (Kakade, Kearns, and Langford 2003); Lipschitz continuity of all Q-functions of policies (Pazis and Parr 2013), Lipschitz continuity of the rewards and transitions (Zhao and Zhu 2014; Ok, Proutiere, and Tranos 2018) or of the optimal Q-function (Song and Sun 2019; Touati, Taiga, and Bellemare 2020; Sinclair, Banerjee, and Yu 2019). We ﬁnd that behavioural metrics support these algorithms to varying degrees: the original bisimulation metric, for example, provides fewer guarantees than what is required by some near-optimal exploration algorithms (Pazis and Parr 2013). These results are particularly signiﬁcant given that behavioural metrics form a relatively privileged group: any metric that enables generalization must in some sense reﬂect the structure of interactions within the environment and hence, act like a behavioural metric.

Our aim is to unify representations of state spaces and the notion of continuity via a taxonomy of metrics. Our ﬁrst contribution is a general result about the continuity relationships of different functions of the MDP (Theorem 1). While Gelada et al. (2019) (resp. Norets (2010)) proved the uniform Lipschitz continuity of the optimal action-value function (resp. local continuity of the optimal value function) given the uniform Lipschitz continuity (resp. local continuity) of the reward and transition functions and Rachelson and Lagoudakis (2010) showed the uniform Lipschitz continuity of the value function given the uniform Lipschitz continuity of the action-value function in the case of deterministic policies, Theorem 1 is a more comprehensive result about the different components of the MDP (reward and transition functions, value and action value functions), for a spectrum of continuity notions (local and uniform continuity, local and uniform Lipschitz continuity) and applicable with stochastic policies, also providing counterexamples demonstrating that these relationships are only implication results. Our second contribution is to demonstrate that different metrics lead to different notions of continuity for different classes of functions (Section Continuity: Prior metrics, Section value-based metrics and Table 2). We ﬁrst study metrics that have been introduced in the literature (presented in Section Prior metrics and abstractions). While Li, Walsh, and Littman (2006) provide a uniﬁed treatment of some of these metrics, they do not analyse these abstractions through the lens of continuity. Using our taxonomy, we ﬁnd that most commonly discussed metrics are actually poorly suited for algorithms that convert representations into values, so we introduce new metrics to overcome this shortcoming (section Value-based metrics). We also analyse the relationships between the topologies induced by all the metrics in our taxonomy (Theorem 2). Finally, we present an empirical evaluation that supports our taxonomy and shows the importance of the choice of a neighbourhood in reinforcement learning algorithms (section Empirical evaluation).

Background We consider an agent interacting with an environment, modelled as a Markov Decision Process (MDP) M = S, A, R, P, γ (Puterman 1994). Here S is a continuous state space with Borel σ-algebra Σ and A a discrete set of actions. Denoting (X) to mean the probability distribution over X, we also have that P : S A (S) is the transition function, R : S A [0, Rmax] is the measurable reward function, and γ [0, 1) is the discount factor. We write Pa s to denote the next-state distribution over S resulting from selecting action a in s and write Ra s for the corresponding reward. A stationary policy π : S (A) is a mapping from states to distributions over actions, describing a particular way of interacting with the environment. We denote the set of all policies by Π. For any policy π Π, the value function V π(s) measures the expected discounted sum of rewards received when starting from state s S and acting according to π:

V π(s) := E h X

t 0 γt Rat st ; s0 = s, at π( | st) i .

The maximum attainable value is Vmax := Rmax

1 γ . The value function satisﬁes Bellman s equation: V π(s) = E a π( | s)[Ra s + γ E s Pa s V π(s )].

The state-action value function or Q-function Qπ describes the expected discounted sum of rewards when action a A is selected from the starting state s, and satisﬁes the recurrence Qπ(s, a) = Ra s + γ E s Pa s V π(s ).

A policy π is said to be optimal if it maximizes the value function at all states: V π(s) = max π Π V π (s) for all s S.

The existence of an optimal policy is guaranteed in both ﬁnite and inﬁnite state spaces. We will denote this policy π Π. The corresponding value function and Q-function are denoted respectively V and Q .

Metrics, Topologies, and Continuity We begin by recalling standard deﬁnitions regarding metrics and continuity, two concepts central to our work. Deﬁnition 1 (Royden, 1968). A metric space X, d is a nonempty set X of elements (called points) together with a real-valued function d deﬁned on X X such that for all x, y, and z in X: d(x, y) 0; d(x, y) = 0 if and only if x = y; d(x, y) = d(y, x) and d(x, y) d(x, z) + d(z, y). The function d is called a metric. A pseudo-metric d is a metric with the second condition replaced by the weaker condition x = y = d(x, y) = 0. In what follows, we will often use metric to stand for pseudo-metric for brevity. A metric d is useful for our purpose as it quantiﬁes, in a real-valued sense, the relationship between states of the environment. Given a state s, a natural question is: What other states are similar to it? The notion of a topology gives a formal answer.

Deﬁnition 2 (Sutherland, 2009). A metric space X, d induces a topology (X, Td) deﬁned as the collection of open subsets of X; speciﬁcally, the subsets U X that satisfy the property that for each x U, there exists ϵ > 0 such that the ϵ-neighbourhood Bd(x, ϵ) = {y X|d(y, x) < ϵ} U. Let (X, T ) and (X, T ) be two topologies on the same space X. We say that T is coarser than T , or equivalently that T is ﬁner than T , if T T .

Given two similar states under a metric d, we are interested in knowing how functions of these states behave. In the introductory example, we asked speciﬁcally: how does the optimal value function behave for similar states? This leads us to the notion of functional continuity. Given f : X Y a function between a metric space (X, d X) and a metric space (Y, d Y ),

Local continuity (LC): f is locally continuous at x X if for any ϵ > 0, there exists a δx,ϵ > 0 such that for all x X, d X(x, x ) < δx,ϵ = d Y (f(x), f(x )) < ϵ. f is said to be locally continuous on X if it is continuous at every point x X.

Uniform continuity (UC): f is uniformly continuous on X when given any ϵ > 0, there exists δϵ > 0 such that for all x, x X, d X(x, x ) < δϵ = d Y (f(x), f(x )) < ϵ.

Local Lipschitz continuity (LLC): f is locally Lipschitz continuous at x X if there exists δx > 0, Kx > 0 such that for all x , x Bd X(x, δx), d Y (f(x ), f(x )) Kxd X(x , x ).

Uniform Lipschitz continuity (ULC): f is uniformly Lipschitz continuous if there exist K > 0 such that for all x, x X we have d Y (f(x), f(x )) Kd X(x, x ).

The relationship between these different forms of continuity is summarized by the following diagram:

where an arrow indicates implication; for example, any function that is ULC is also UC. Here, we are interested in functions of states and stateaction pairs. Knowing whether a particular function f possesses some continuity property p under a metric d informs us on how well we can extrapolate the value f(s) to other states; in other words, it informs us on the generalization properties of d.

Prior Metrics and Abstractions

The simplest structure is to associate states to distinct groups, what is often called state aggregation (Bertsekas 2011). This gives rise to an equivalence relation, which we interpret as a discrete pseudo-metric, that is a metric taking a countable range of values.

Deﬁnition 3. An equivalence relation E X X induces a discrete pseudo-metric e E where e E(x, x ) = 0 if (x, y) E, and 1 otherwise.

Throughout the text, we will use e to denote discrete pseudo-metrics. Two extremal examples of metrics are the identity metric e I : S S {0, 1}, induced by the identity relation I = {(s, t) S S|s = t} (e.g. d1 in Figure 1), and the trivial metric e T : S S {0} that collapses all states together (e.g. d3 in Figure 1). In-between these extremes, η-abstractions (Li, Walsh, and Littman 2006; Abel, Hershkowitz, and Littman 2017) are functions φ : S ˆS that aggregates states which are mapped close to each other by a function f. That is, given a threshold η 0 and f : S A R, φf,η(s) = φf,η(t) = |f(s, a) f(t, a)| η. We list a few choices for f along with the name of the abstraction we will refer to throughout this text in Table 1. η-abstractions are deﬁned in terms of a particular function of direct relevance to the agent. However, it is not immediately clear whether these abstractions are descriptive, and, more speciﬁcally, the kind of continuity properties they support. An alternative is to relate states based on the outcomes that arise from different choices, starting in these states. These are bisimulation relations (Givan, Dean, and Greig 2003).

Deﬁnition 4. An equivalence relation E S S with SE the quotient space and Σ(E) the Σ measurable sets closed under E, if whenever (s, t) E we have:

Bisimulation relation[Givan, Dean, and Greig, 2003]. Behavioral indistinguishability under equal actions; namely, for any action a A, Ra s = Ra t , and Pa s (X) = Pa t (X) for all X Σ(E). We call E a bisimulation relation. We denote the largest bisimulation relation as , and its corresponding discrete metric as e . Lax-bisimulation relation [Taylor, Precup, and Panagaden, 2009]. Behavioral indistinguishability under matching actions; namely, for any action a A from state s there is an action b A from state t such that Ra s = Rb t, and Pa s (X) = Pb t (X) for all X Σ(E), and vice-versa, we call E a lax-bisimulation relation. We denote the largest lax-bisimulation relation as lax, and its corresponding discrete metric as e lax. π-bisimulation relation [Castro, 2020]. Behavioral indistinguishability under a ﬁxed policy; namely, given a policy π Π, P

a A π(a|s)Ra s = P

a A π(a|t)Ra t , and P a A π(a|s)Pa s (X) = P a A π(a|s)Pb t (X) for all X Σ(E). We call E a π-bisimulation relation. We denote the largest bisimulation relation as π, and its corresponding discrete metric as e π. A bisimulation metric is the continous generalization of a bisimulation relation. Formally, d is a bisimulation metric if its kernel is equivalent to the bisimulation relation. The canonical bisimulation metric (Ferns, Panangaden, and Precup 2005) is constructed from the Wasserstein distance between probability distributions.

Deﬁnition 5. Let (Y, d Y ) be a metric space with Borel σ-algebra Σ. The Wasserstein distance (Villani 2008) between two probability measures P and Q on Y , under a given metric d Y is given by Wd Y (P, Q) =

f φf,η Q approximate Q function abstraction (η 0) / Q -irrelavance (η = 0) R and P approximate model abstraction (η 0) / Model-irrelevance (η = 0) Qπ Qπ-irrelevance abstraction (η = 0) max A Q a -irrelevance abstraction (η = 0)

Table 1: Different types of state abstractions.

infλ Γ(P,Q) E(x,y) λ[d Y (x, y)], where Γ(P, Q) is the set of couplings between P and Q.

Lemma 1 (Ferns, Panangaden, and Precup, 2005). Let M be the space of state pseudo-metrics and deﬁne the functional F : M M as F(d)(x, y) = maxa A |Ra x Ra y| + γWd(Pa x, Pa y ) . Then F has a least ﬁxed point d and d is a bisimulation metric.

In words, bisimulation metrics arise as the ﬁxed points of an operator on the space of pseudo-metrics. Lax bisimulation metrics d lax and a π-bisimulation metrics d π can be deﬁned in an analogous fashion; for succinctness, their formal deﬁnitions are included in the appendix.

Continuity Relationships

Our ﬁrst result characterizes the continuity relationships between key functions of the MDP. The theorem considers different forms of continuity and relates how the continuity of one function implies another. While the particular case of uniform Lipschitz continuity of Q (resp. local continuity of V ) from P + R has been remarked on before by Gelada et al. (2019) (resp. Norets (2010)) as well as the case of uniform Lipschitz continuity of V π given the uniform Lipschitz continuity of Qπ for stochastic policies π (Rachelson and Lagoudakis 2010), to the best of our knowledge this is the ﬁrst comprehensive treatment of the topic, in particular providing counterexamples.

Theorem 1. If we decompose the Cartesian product S A as: d S A(s, a, s , a ) = d S(s, s ) + d A(a, a ) with d A the identity metric, the LC, UC and LLC relationships between P, R, V π, V , Qπ and Q functions are given by diagram 3. A directed arrow f g indicates that function g is continuous whenever f is continuous. Labels on arrows indicate conditions that are necessary for that implication to hold. P +R is meant to stand for both P and R continuity; π-cont indicates continuity of π : S (A). An absence of a directed arrow indicates that there exists a counter-example proving that the implication does not exist. In the ULC case, the previous relationships also hold with the following additional assumptions: γLP < 1 for P + R Q and γLP(1 + Lπ) < 1 for P + R π-cont Qπ where LP and Lπ are the Lipschitz constants of P and π, respectively.

Proof. All proofs and counterexamples are provided in the appendix.

The arrows are transitive and apply for all forms of continuity illustrated in Diagram (1); for example, if we have ULC for Q , this implies we have LC for V . This diagram is useful when evaluating metrics as they clarify the strongest (or weakest) form of continuity one can demonstrate. When considering deterministic policies, we can notice that the πcontinuity mentioned in Theorem 1 is very restrictive, as the following lemma shows.

Lemma 2. If a deterministic policy π : S A is continuous, S is connected1 and A is discrete, then π is globally constant.

Taxonomy of Metrics We now study how different metrics support the continuity of functions relevant to reinforcement learning and the relationship between their induced topologies. While the taxonomy we present here is of independent interest, it also provides a clear theoretical foundation on which to build results regarding metric-respecting embeddings (Gelada et al. 2019; Zhang et al. 2020).

Continuity: Prior Metrics We begin the exposition by considering the continuity induced by discrete metrics. These enable us to analyze the properties of some representations found in the literature. The extremes of our metric hierarchy are the identity metric e I and trivial metric e T, which respectively support all and one continuous functions, and were represented by d1 and d3 in the introductory example.

Lemma 3 (Identity metric). e I induces the ﬁnest topology on S, made of all possible subsets of S. Let (Y, d Y ) be any metric space. Any function h (resp. Any bounded h) : (S, e I) (Y, d Y ) is LC and UC (resp. ULC).

1A connected space is topological space that cannot be represented as the union of two or more disjoint non-empty open subsets.

Lemma 4 (Trivial metric). e T induces the coarsest topology on S, consisting solely of { , S}. Let (Y, d Y ) be any metric space. Any function h : (S, e I) (Y, d Y ) is LC, UC and ULC iff h is constant.

We can also construct a discrete metric from any state aggregation φ : S ˆS as eφ(s, t) = e I(φ(s), φ(t)) = 0 if φ(s) = φ(t), and 1 otherwise. However, as stated below, ηabstractions do not guarantee continuity except in the trivial case where η = 0.

Lemma 5. If η = 0, then any function f (resp. bounded function f): (S, d S) (Y, d Y ) is LC and UC (resp. ULC) with respect to the pseudometric eφf,η. However, given a function f and η > 0, there exists an η-abstraction φf,η such that f is not continuous with respect to eφf,η.

Unlike the discrete metrics deﬁned by η-abstractions, both bisimulation metrics and the metric induced by the bisimulation relation support continuity of the optimal value function.

Lemma 6. Q (resp. Qπ) is ULC with Lipschitz constant 1 with respect to d (resp. d π).

Corollary 1. Q (resp. Qπ) is ULC with Lipschitz constant Vmax with respect to e (resp. e π).

We note that Ferns, Panangaden, and Precup (2004) proved a weaker statement involving V (resp. Castro, Panangaden, and Precup (2009), V π). To summarize, metrics that are too coarse may fail to provide the requisite continuity of reinforcement learning functions. Bisimulation metrics are particularly desirable as they achieve both a certain degree of coarseness, while preserving continuity. In practice, however, Ferns, Panangaden, and Precup s bisimulation metric is difﬁcult to compute and estimate, and tends to be conservative as long as two states can be distinguished by action sequences, bisimulation will keep them apart.

Value-Based Metrics

As an alternative to bisimulation metrics, we consider simple metrics constructed from value functions and study their continuity offerings. These metrics are simple in that they are deﬁned in terms of differences between values, or functions of values, at the states being compared. The last metric, d , is particularly appealing as it can be approximated, as we describe below. Under this metric, all Q-functions are Lipschitz continuous, supporting some of the more demanding continuous-state exploration algorithms (Pazis and Parr 2013).

Lemma 7. For a given MDP, let Qπ be the Q-function of policy π, and Q the optimal Q-function. The following are continuous pseudo-metrics:

1. d (s, s ) = max a A |Q (s, a) Q (s , a)|

2. d π(s, s ) = max a A |Qπ(s, a) Qπ(s , a)|

3. d (s, s ) = max π Π,a A |Qπ(s, a) Qπ(s , a)|

Q (resp. Qπ) is ULC with Lipschitz constant 1 wrt to d (resp. d π). Qπ is ULC with Lipschitz constant 1 wrt to d for any π Π.

Remark. When S is ﬁnite, the number of policies to consider to compute d is ﬁnite: d (s, s ) = max π Π,a A |Qπ(s, a)

Qπ(s , a)| = max π ΠAVF,a A |Qπ(s, a) Qπ(s , a)|, where

ΠAV F is the ﬁnite set of extremal policies corresponding to Adversarial Value Functions (AVFs) (Bellemare et al. 2019). d provides strong continuity of the value-function for all policies contrary to any other metric that has been used in the literature. Since computing d is computationally expensive, we will approximate it by the pseudometric dg AVF(n) = max π Πg AVF(n),a A |Qπ(s, a) Qπ(s , a)|, where Πg AVF(n) are n

samples from the set of extremal policies ΠAVF.

Categorizing Metrics, Continuity and Complexity We now formally present in Theorem 2 the topological relationships between the different metrics. This hierarchy is important for generalization purposes as it provides a comparison between the shapes of different neighbourhoods which serve as a basis for RL algorithms on continuous state spaces. Theorem 2. The relationships between the topologies induced by the metrics in Table 2 are given by the following diagram. We denote by d1 d2 when Td1 Td2, that is, when Td1 is coarser than Td2. Here d denotes any arbitrary metric.

e lax d lax dg AVF(n) d e I

e d d d d π

d e T e π d π

Proof. All proofs can be found in the appendix. The relation d lax d was shown by Taylor, Precup, and Panagaden (2009) but not expressed in topological terms.

We summarize in Table 2 our continuity results mentioned throughout this section and supplement them with the continuity of the lax-bisimulation metric proven in Taylor, Precup, and Panagaden (2009). To avoid over-cluttering the table, we only specify the strongest form of functional continuity according to Theorem 1. As an additional key differentiator, we also note the complexity of computing these metrics from a full model of the environment, which gives some indication about the difﬁculty of performing state abstraction. Proofs are provided in the appendix. From a computational point of view, all continuous metrics can be approximated using deep learning techniques which makes them even more attractive to build representations. Atari 2600 experiments by Castro (2020) show that π-bisimulation metrics do perform well in larger domains. This is also supported by (Zhang et al. 2020) who use an encoder architecture to learn a representation that respects the bisimulation metric.

Metric LC UC ULC LLC Complexity

Discrete metric e I Y S Y S B(Y S) BL(Y S) O(|S|) Trivial metric e T {y}S {y}S {y}S {y}S O(1) Model-irrelevance P, R P, R P, R P, R Qπ-irrelevance Qπ Qπ Qπ Qπ Q -irrelevance Q Q Q Q a -irrelevance Q Q Q Q Approx. abstraction - - - - e Q Q Q Q O(|A||S|3) d Q Q Q Q O |A||S|5 log |S| ln δ

e π Qπ Qπ Qπ Qπ O(|S|3) d π Qπ Qπ Qπ Qπ O |S|5 log |S| ln δ

e lax V V V V O(|A|2|S|3) d lax V V V V O |A|2|S|5 log |S| ln δ

d Q Q Q Q O |S|2|A| log(R 1 maxδ(1 γ)) log(γ)

d π Qπ Qπ Qπ Qπ O |S|2|A| log(R 1 maxδ(1 γ)) log(γ)

d Qπ, π Π Qπ, π Π Qπ, π Π Qπ, π Π NP-hard? (Bellemare et al. 2019)

Table 2: Categorization of state metrics, their continuity implications, and their complexity (when known). The notation {y}S denotes any function h : S Y that is constant, Y S refers to all functions h : S Y . B(Y S) (resp. BL(Y S) ) is a bounded (resp. locally bounded) function h : S Y . - denotes an absence of LC, UC, ULC and LLC. In the complexity column, δ is the desired accuracy.

Empirical Evaluation We now conduct an empirical evaluation to quantify the magnitude of the effects studied in the previous sections. Speciﬁcally, we are interested in how approximations derived from different metrics impact the performance of basic reinforcement learning procedures. We consider two kinds of approximations: state aggregation and nearest neighbour, which we combine with six representative metrics: e , e lax, d , d lax, d , and dg AVF(50). We conduct our experiments on Garnet MDPs, which are a class of randomly generated MDPs (Archibald, Mc Kinnon, and Thomas 1995; Piot, Geist, and Pietquin 2014). Speciﬁcally, a Garnet MDP Garnet(n S, n A) is parameterized by two values: the number of states n S and the number of actions n A, and is generated as follows: 1. The branching factor bs,a of each transition Pa s is sampled uniformly from [1 : n S]. 2. bs,a states are picked uniformly randomly from S and assigned a random value in [0, 1]; these values are then normalized to produce a proper distribution Pa s . 3. Each Ra s is sampled uniformly in [0, 1]. The use of Garnet MDPs grants us a less-biased comparison of the different metrics than if we were to pick a few speciﬁc MDPs. Nonetheless, we do provide extra experiments on a set of Grid World tasks in the appendix. The code used to produce all these experiments is open-sourced 2.

Generalizing the Value Function V

We begin by studying the approximation error that arises when extrapolating the optimal value function V from a

2Code available at https://github.com/google-research/googleresearch/tree/master/rl metrics aaai2021

subset of states. Speciﬁcally, given a subsampling fraction f [0, 1], we sample |S| f states and call this set κ. For each unknown state s S \ κ, we ﬁnd its nearest known neighbour according to metric d: NN(s) = arg mint κ d(s, t). We then deﬁne the optimal value function as ˆV (s) = V (NN(s)), and report the approximation error in Figure 2 (left). This experiment gives us insights into how amenable the different metrics are for transferring value estimates across states; effectively, their generalization capabilities. According to Theorem 2, the two discrete metrics e and e lax induce ﬁner topologies than their four continuous counterparts. Most of the states being isolated from each other in these two representations, e and e lax perform poorly. The three continuous metrics d , d lax and d all guarantee Lipschitz continuity of V while dg AVF(50) is approximately V Lipschitz continuous. However, d lax (resp. d ) produce coarser (resp. approximately coarser) topologies than d (resp. dg AVF(50)) (see Theorem 2). This is reﬂected in their better generalization error compared to the latter two metrics. Additionally, the lax bisimulation metric d lax outperforms d substantially, which can be explained by noting that d lax measures distances between two states under independent action choices, contrary to all other metrics.

Generalizing the Q-function Q

We now illustrate the continuity (or absence thereof) of Q with respect to the different metrics. In Figure 2 (center), we perform a similar experiment as the previous one, still using a 1-nearest neighbour scheme but now extrapolating Q . As expected, we ﬁnd that metrics that do not support Q

0 25 50 75 100 125 150 175 200 Number of known states

d AVF(50) dΔ *

0 25 50 75 100 125 150 175 200 Number of known states

0 25 50 75 100 125 150 175 200 Number of aggregate states

Figure 2: Errors when approximating the optimal value function (left) and optimal Q-function (center) via nearest-neighbours and errors when performing value iteration on aggregated states (right). Curves for e and e lax are covering each other on all of the plots. Averaged over 100 Garnet MDPs with 200 states and 5 actions, with 50 independent runs for each (to account for subsampling differences). Conﬁdence intervals were very tiny due to the large number of runs so were not included.

continuity, including d lax, cannot generalize from a subset of states, and their average error decreases linearly. In contrast, the three other metrics are able to generalize. Naturally, d , which aggregates states based on Q , performs particularly well. However, we note that d also outperforms the bisimulation metric d , highlighting the latter s conservativeness, which tends to separate states more. By our earlier argument regarding d lax, this suggests there may be a class of functions, not represented in Table 2, which is continuous under d but not dg AVF(50).

Approximate Value Iteration

As a ﬁnal experiment, we perform approximate value iteration using a state aggregation φ derived from one of the metrics. For each metric, we perform 10 different aggregations using a k-median algorithm, ranging from one aggregate state to 200 aggregate states. For a given aggregate state c, let Q(c, a) stand for its associated Q-value. The approximation value iteration update is

ˆQk(c, a) 1

Ra s + γEs Pa s max a A ˆQk(φ(s ))

We can then measure the error induced by our aggregation via maxa A 1 |S| P s S |Q (s, a) ˆQk(φ(s), a)|, which we display in the rightmost panel of Figure 2. As in our second experiment, the metrics that do not support Q -continuity well fail to give good abstractions for approximate value iteration. As for e , the topology induced by this metric is too ﬁne (Theorem 2) leading to poor generalisation results. The performance of d is consistent with Theorem 2, which states that it induces the coarsest topology. However, although it is known that Q -continuity is sufﬁcient for approximate value iteration (Li, Walsh, and Littman 2006), it is somewhat surprising that it outperforms dg AVF(50), since dg AVF(50) is an approximant of d that is designed to provide continuity with respect to all policies, so it may be expected to yield better approximations at intermediate iterations. Despite this, d still serves as an interesting and tractable surrogate metric to d .

Behavioral metrics are important both to evaluate the goodness of a given state representation and to learn such a representation. We saw that approximate abstractions and equivalence relations are insufﬁcient for continuous-state RL problems, because they do not support the continuity of common RL functions or induce very ﬁne representations on the state space leading to poor generalization.

Continuous behavioural metrics go one step further by considering the structure of the MDP in their construction and inducing coarser topologies than their discrete counterparts; however, within that class we still ﬁnd that not all metrics are equally useful. The original bisimulation metric of Ferns, Panangaden, and Precup (2004), for example, is too conservative and has a rather ﬁne topology. This is conﬁrmed by our experiments in Figure 2, where it performs poorly overall. The lax bisimulation metric guarantees the continuity of V which makes it suitable for transferring optimal values between states but fails to preserve continuity of Q . Together with our analysis, the d and d metrics seem interesting candidates when generalising within a neighbourhood.

d is useful when we do not know the value improvement path the algorithm will be following (Dabney et al. 2020). Despite being approximated from a ﬁnite number of policies, the performance of dg AVF(n), reﬂects the fact that it respects, in some sense, the entire space of policies that are spanned by policy iteration and makes it useful in practice. One advantage of this metric is that it is built from value functions, which are deﬁned on a per-state basis; this makes it amenable to online approximations. In contrast, bisimulation metrics are only deﬁned for pairs of states, which makes it difﬁcult to approximate in an online fashion, speciﬁcally due to the difﬁculty of estimating the Wasserstein metric on every update.

Finally, continuing our analysis on partially observable systems is an interesting area for future work. Although Castro, Panangaden, and Precup (2009) proposed various equivalence relations for partially observable systems, there has been little work in deﬁning proper metrics for these systems.

Acknowledgements The authors would like to thank Sheheryar Zaidi, Adam Foster and Abe Ng for insightful discussions about topology and functional analysis, Carles Gelada, John D. Martin, Dibya Ghosh, Ahmed Touati, Rishabh Agarwal, Marlos Machado and the whole Google Brain team in Montreal for helpful discussions, and Robert Dadashi for a conversation about polytopes. We also thank Lihong Li and the anonymous reviewers for useful feedback on this paper. We would also like to thank the Python community (Van Rossum and Drake Jr 1995; Oliphant 2007) and in particular Num Py (Oliphant 2006; Walt, Colbert, and Varoquaux 2011; Harris et al. 2020), Tensorﬂow (Abadi et al. 2016), Sci Py (Jones et al. 2001), Matplotlib (Hunter 2007) and Gin-Conﬁg (https://github.com/google/gin-conﬁg Gin Conﬁg).

Broader Impact This work lies in the realm of foundational RL in that it contributes to the fundamental understanding and development of reinforcement learning algorithms and theory. As such, despite us agreeing in the importance of this discussion, our work is quite far removed from ethical issues and potential societal consequences.

References Abadi, M.; Barham, P.; Chen, J.; Chen, Z.; Davis, A.; Dean, J.; Devin, M.; Ghemawat, S.; Irving, G.; Isard, M.; et al. 2016. Tensorﬂow: A system for large-scale machine learning. In 12th {USENIX} symposium on operating systems design and implementation ({OSDI} 16), 265 283. Abel, D.; Hershkowitz, D. E.; and Littman, M. L. 2017. Near optimal behavior via approximate state abstraction. ar Xiv preprint ar Xiv:1701.04113 . Archibald, T. W.; Mc Kinnon, K. I. M.; and Thomas, L. C. 1995. On the Generation of Markov Decision Processes. The Journal of the Operational Research Society 46(3): 354 361. Bellemare, M.; Dabney, W.; Dadashi, R.; Ali Taiga, A.; Castro, P. S.; Le Roux, N.; Schuurmans, D.; Lattimore, T.; and Lyle, C. 2019. A Geometric Perspective on Optimal Representations for Reinforcement Learning. In Advances in Neural Information Processing Systems 32, 4360 4371. Bertsekas, D. P. 2011. Approximate policy iteration: A survey and some new methods. Journal of Control Theory and Applications 9(3): 310 335. Castro, P. S. 2020. Scalable methods for computing state similarity in deterministic Markov Decision Processes. In Proceedings of the AAAI Conference on Artiﬁcial Intelligence. Castro, P. S.; Panangaden, P.; and Precup, D. 2009. Notions of state equivalence under partial observability. In Proceedings of the 21st International Joint Conference on Artiﬁcial Intelligence (IJCAI-09). Castro, P. S.; and Precup, D. 2010. Using bisimulation for policy transfer in MDPs. In Twenty-Fourth AAAI Conference on Artiﬁcial Intelligence.

Dabney, W.; Barreto, A.; Rowland, M.; Dadashi, R.; Quan, J.; Bellemare, M. G.; and Silver, D. 2020. The Value Improvement Path: Towards Better Representations for Reinforcement Learning. ar Xiv preprint ar Xiv:2006.02243 .

Dadashi, R.; Ta ıga, A. A.; Roux, N. L.; Schuurmans, D.; and Bellemare, M. G. 2019. The value function polytope in reinforcement learning. ar Xiv preprint ar Xiv:1901.11524 .

Ferns, N.; Panangaden, P.; and Precup, D. 2004. Metrics for ﬁnite Markov decision processes. In Proceedings of the 20th conference on Uncertainty in artiﬁcial intelligence, 162 169. AUAI Press.

Ferns, N.; Panangaden, P.; and Precup, D. 2005. Metrics for Markov Decision Processes with Inﬁnite State Spaces. In Proceedings of the 21st Conference on Uncertainty in Artiﬁcial Intelligence, UAI 2005.

Gelada, C.; Kumar, S.; Buckman, J.; Nachum, O.; and Bellemare, M. G. 2019. Deep MDP: Learning Continuous Latent Space Models for Representation Learning. In Proceedings of the International Conference on Machine Learning.

Givan, R.; Dean, T.; and Greig, M. 2003. Equivalence notions and model minimization in Markov decision processes. Artiﬁcial Intelligence 147(1-2): 163 223.

Harris, C. R.; Millman, K. J.; van der Walt, S. J.; Gommers, R.; Virtanen, P.; Cournapeau, D.; Wieser, E.; Taylor, J.; Berg, S.; Smith, N. J.; et al. 2020. Array programming with Num Py. Nature 585(7825): 357 362.

Hunter, J. D. 2007. Matplotlib: A 2D graphics environment. Computing in science & engineering 9(3): 90 95.

Jones, E.; Oliphant, T.; Peterson, P.; et al. 2001. Sci Py: Open source scientiﬁc tools for Python . http://www.scipy.org/

Kakade, S.; Kearns, M. J.; and Langford, J. 2003. Exploration in metric state spaces. In Proceedings of the 20th International Conference on Machine Learning (ICML-03), 306 312.

Li, L.; Walsh, T. J.; and Littman, M. L. 2006. Towards a Uniﬁed Theory of State Abstraction for MDPs. In ISAIM.

Machado, M.; Bellemare, M.; and Bowling, M. 2017. A Laplacian framework for option discovery in reinforcement learning. In Proceedings of the International Conference on Machine Learning.

Norets, A. 2010. Continuity and differentiability of expected value functions in dynamic discrete choice models. Quantitative economics 1(2): 305 322.

Ok, J.; Proutiere, A.; and Tranos, D. 2018. Exploration in structured reinforcement learning. In Advances in Neural Information Processing Systems, 8874 8882.

Oliphant, T. E. 2006. A guide to Num Py, volume 1. Trelgol Publishing USA.

Oliphant, T. E. 2007. Python for scientiﬁc computing. Computing in Science & Engineering 9(3): 10 20.

Pazis, J.; and Parr, R. 2013. PAC optimal exploration in continuous space Markov decision processes. In Twenty Seventh AAAI Conference on Artiﬁcial Intelligence.

Piot, B.; Geist, M.; and Pietquin, O. 2014. Difference of Convex Functions Programming for Reinforcement Learning. In Ghahramani, Z.; Welling, M.; Cortes, C.; Lawrence, N. D.; and Weinberger, K. Q., eds., Advances in Neural Information Processing Systems 27, 2519 2527. Curran Associates, Inc.

Puterman, M. L. 1994. Markov Decision Processes: Discrete Stochastic Dynamic Programming . John Wiley & Sons, Inc. Rachelson, E. and Lagoudakis, M. G. On the locality of action domination in sequential decision making. In International Symposium on Artiﬁcial Intelligence and Mathematics, ISAIM 2010, Fort Lauderdale, Florida, USA, January 6-8, 2010, 2010. Royden, H. 1968. Real Analysis. Upper Saddle River, New Jersey 07458: Prentice Hall, 3 edition. ISBN 0024041513. Sinclair, S. R.; Banerjee, S.; and Yu, C. L. 2019. Adaptive Discretization for Episodic Reinforcement Learning in Metric Spaces. Proceedings of the ACM on Measurement and Analysis of Computing Systems 3(3): 1 44. Solway, A.; Diuk, C.; C ordova, N.; Yee, D.; Barto, A. G.; Niv, Y.; and Botvinick, M. M. 2014. Optimal Behavioral Hierarchy. PLo S Computational Biology 10(8): e1003779. Song, Z.; and Sun, W. 2019. Efﬁcient model-free reinforcement learning in metric spaces. ar Xiv preprint ar Xiv:1905.00475 .

Sutherland, W. A. 2009. Introduction to metric and topological spaces. Oxford University Press. Sutton, R.; Precup, D.; and Singh, S. 1999. Between MDPs and semi-MDPs: A Framework for Temporal Abstraction in Reinforcement Learning. Artiﬁcial Intelligence 112: 181 211. Taylor, J.; Precup, D.; and Panagaden, P. 2009. Bounding performance loss in approximate MDP homomorphisms. In Advances in Neural Information Processing Systems, 1649 1656.

Touati, A.; Taiga, A. A.; and Bellemare, M. G. 2020. Zooming for Efﬁcient Model-Free Reinforcement Learning in Metric Spaces. ar Xiv preprint ar Xiv:2003.04069 .

Van Rossum, G.; and Drake Jr, F. L. 1995. Python reference manual. Centrum voor Wiskunde en Informatica Amsterdam. Villani, C. 2008. Optimal transport: old and new, volume 338. Springer Science & Business Media. Walt, S. v. d.; Colbert, S. C.; and Varoquaux, G. 2011. The Num Py array: a structure for efﬁcient numerical computation. Computing in science & engineering 13(2): 22 30.

Zhang, A.; Mc Allister, R.; Calandra, R.; Gal, Y.; and Levine, S. 2020. Learning Invariant Representations for Reinforcement Learning without Reconstruction. ar Xiv preprint ar Xiv:2006.10742 . Zhao, D.; and Zhu, Y. 2014. MEC A near-optimal online reinforcement learning algorithm for continuous deterministic systems. IEEE transactions on neural networks and learning systems 26(2): 346 356.