# measuring_structural_similarities_in_finite_mdps__3b540b23.pdf

Measuring Structural Similarities in Finite MDPs

Hao Wang1 , Shaokang Dong2 and Ling Shao1

1Inception Institute of Artiﬁcial Intelligence, UAE 2State Key Laboratory for Novel Software Technology, Nanjing University, China

1{hao.wang, ling.shao}@inceptioniai.org 2shaokangdong@smail.nju.edu.cn

In this paper, we investigate the structural similarities within a ﬁnite Markov decision process (MDP). We view a ﬁnite MDP as a heterogeneous directed bipartite graph and propose novel measures for the state and action similarities, in a mutually reinforced manner. We prove that the state similarity is a metric and the action similarity is a pseudometric. We also establish the connection between the proposed similarity measures and the optimal values of the MDP. Extensive experiments show that the proposed measures are effective.

1 Introduction The Markov decision process (MDP) is a useful mathematical model to support decision making, which has practical applications in many areas, such as intelligent control systems, ﬁnance, energy management, online advertising, etc. [Cai et al., 2017; Han et al., 2016] A ﬁnite MDP models the interaction between an agent and the outside environment. The environment is described as a set of discrete states, where, on each state, the agent has a set of available actions. By observing the current state and deciding which action to take, the agent may change the state of the environment. Driven by a properly designed reward scheme, the agent may thus develop a good policy for making wise decisions in the environment. In this paper, we investigate how to measure the state and action similarities in a ﬁnite MDP. Such similarities are important as they may provide a principled way of designing solutions in various other research areas, such as Transfer in reinforcement learning [Taylor and Stone, 2009], which aims to use past learning experiences to accelerate a current learning process; and MDP abstraction [Abel, 2019], where the goal is to construct an abstracted MDP with a smaller state-action space, while still maintaining certain properties of the original MDP. The research most related to ours is the bisimulation metric, proposed in [Ferns et al., 2004], which smoothly extends the notion of bisimulation [Givan et al., 2003]. Brieﬂy,

Part of this work was done when this author stayed at the State Key Laboratory for Novel Software Technology, Nanjing University.

bisimulation deﬁnes an equivalence relation between states and thereby induces a partition of the state space into equivalence classes. Whenever the same action is taken, bisimilar states perform exactly the same (probabilistic) transition into other equivalence classes, receiving exactly the same reward. Based on the observation that two states are similar if the same action has similar effects, [Ferns et al., 2004] developed the bisimulation metric between states to extend the rigorous notion of bisimulation. Nonetheless, one problem with the bisumlation metric is that it overlooks the role of actions. Consider the MDP example in Fig. 1(a). The bisimulation metric cannot capture the fact that taking action b on state u is, to some extent, similar to taking action a on state v. In this paper, we tackle this problem by explicitly considering the similarity between different actions.

b/0.5/0.7 a/0.45/0.5

(a) An example MDP M

(b) MDP graph GM

Figure 1: An MDP example and its graph representation. Labels on the edge are of the form (action/)transition probability/reward .

Another related notion is the MDP homomorphism [Ravindran and Barto, 2003], which consists of two algebraic mappings, one between states and the other between actions. The two mappings should exactly preserve the reward and transition probabilities. It has been pointed out that homomorphisms are often too strict and computationally difﬁcult to be practically useful [Taylor and Stone, 2011]. [Sorg and Singh, 2009] proposed soft homomorphism, allowing the state mapping to be probabilistic as long as it exactly preserves transition probabilities and rewards in expectation. We argue that the soft homomorphism is still too rigorous in many practical situations. For example, there is no nontrivial (soft) homomorphism in the example of Fig. 1(a), even though states u and v are intuitively quite similar. Instead of the algebraic view adopted by bisimulation and homomorphism, in this paper, we take a graph-theoretical

Proceedings of the Twenty-Eighth International Joint Conference on Artiﬁcial Intelligence (IJCAI-19)

perspective to reveal the structural similarities within a ﬁnite MDP. Our methodology originates in the studies on measuring the structural-contextual similarities between nodes in a general graph. [Jeh and Widom, 2002] developed Sim Rank, of which the basic idea is that two nodes are similar if and only if their neighbors are similar. Although Sim Rank is not directly applicable to MDPs, as it is unaware of MDP-speciﬁc characteristics, we are nevertheless able to tailor its idea to develop similarity measures in MDPs. To sum up, the contributions of this paper are the following.

1. We propose a heterogeneous bipartite graph representation for ﬁnite MDPs (Sec. 3). The graph contains state and action nodes interconnected by decision and transition edges. With such a representation, the roles of states and actions are properly captured. 2. We propose recursive deﬁnitions for the state and action similarities (Sec. 4.1). The similarities can be efﬁciently computed by an iterative algorithm (Sec. 4.2). 3. We prove that the induced distance measures have good metric properties (Sec. 4.3). We also show that the proposed measures can be used to bound the difference between optimal values of the MDP (Sec. 4.4). 4. We show, via extensive experiments on randomly generated MDPs, that the proposed measures are effective in capturing structural similarities (Sec. 5).

2 Related Work

2.1 Transfer in Reinforcement Learning

Transfer learning aims at using past learning experience to accelerate the learning of a current task. Many transfer learning techniques are based on the notion of task similarity. [Lazaric et al., 2008] measured the similarity between two MDPs from a sample-oriented perspective. [Ramamoorthy et al., 2013] proposed a value-preserving Lipschitz metric between two MDPs within the same state-action space, which was calculated as the maximum possible difference between the reward and state transition functions. [Ammar et al., 2014] used a restricted Boltzmann machine to measure the distance between two batches of samples from two MDPs, which was later used as the distance between the two MDPs. [Sinapov et al., 2015] represented MDPs as feature vectors and trained a regression model to evaluate the closeness of two MDPs. [Song et al., 2016] recently employed several metrics to measure the distance between two MDPs within the same state-action space. The above works mainly focus on inter-task similarities rather than structural similarities within one single task.

2.2 MDP Abstraction

The research ﬁeld of MDP abstraction is also relevant to our work. [Li et al., 2006] proposed a uniﬁed theory of state abstraction, covering bisimulation [Givan et al., 2003], homomorphism [Ravindran and Barto, 2003], policy irrelevance [Jong and Stone, 2005], etc. The focus of the theory was mainly on property preservation during the process of state abstraction. Structural similarities (especially similarities between actions) were not fully explored. On the

other hand, there have been several research works on planning and learning with abstract MDPs [Abel et al., 2019; Gopalan et al., 2017]. Their focus is different from ours. However, our work may contribute to this area by introducing a novel perspective towards MDP abstraction.

2.3 Structural Similarity in Graphs Besides the Sim Rank measure [Jeh and Widom, 2002] introduced in Sec. 1, node-to-node proximities in graphs have been extensively studied in the literature. [Jeh and Widom, 2003] suggested using personalized Page Rank, which is essentially an asymmetric random walk distance. Since Sim Rank is counterintuitive and inﬂexible in certain cases, many studies have been done to improve Sim Rank. [Xi et al., 2005] proposed Sim Fusion, based on a uniﬁed relationship matrix representation of the graph. [Antonellis et al., 2008] revised Sim Rank to Simrank++, by weighting the neighbors of graph nodes. [Zhao et al., 2009] proposed P-Rank, extending Sim Rank to work for information networks. [Lin et al., 2012] introduced maximum matching into the similarity measure and thereby developed Match Sim. Recently, [Jin et al., 2014] proposed a metric, Role Sim, that complies with a set of admissible properties. As with Sim Rank, the above-mentioned measures are not directly applicable to our problem because they are not tailored to ﬁnite MDPs.

3 Graph Representation of MDPs

In the literature, an MDP is typically described as a tuple M = (S, A, T, R). S and A are the ﬁnite sets of states and actions, respectively. In particular, for any s S, we denote by As A the set of available actions on state s. T : S A S [0, 1] and R : S A S [0, 1] are the state transition function and the reward function, respectively;1 so, T (s, a, s ) gives the probability of ending up on state s after taking action a on state s, and R (s, a, s ) is the reward for such a state transition. One problem with this algebraic representation of an MDP is that it does not distinguish between actions that have the same name but are taken on different states. To tackle this problem, we instead consider the following graph-theoretical representation of MDPs. The MDP graph for an MDP M = (S, A, T, R) is deﬁned as GM = (V, Λ, E, Ψ, p, r), which is a heterogeneous directed bipartite graph with two types of nodes, the state nodes (V ) and the action nodes (Λ). E is the set of decision edges from state nodes to action nodes and Ψ contains all the transition edges from action nodes to state nodes. While the decision edges are unweighted, each transition edge (α, v) Ψ is weighted by a transition probability p(α, v) and a reward r(α, v). The following procedure constructs the graph GM from a given MDP M.

1. For each s S, create a new node vs into V . 2. For each s S and each a As, create a new node αa into Λ and a new edge (vs, αa) into E.

1In principle, the value of R may take arbitrary reals. Nonetheless, in most practical cases, it is reasonable to assume that R is both lower and upper bounded and can thus be rescaled into [0, 1] without sacriﬁcing the representation power of the MDP model.

Proceedings of the Twenty-Eighth International Joint Conference on Artiﬁcial Intelligence (IJCAI-19)

3. For each (s, a, s ) S A S such that T(s, a, s ) > 0, create a new edge (αa, vs ) into Ψ; set p (αa, vs ) = T (s, a, s ) and r (αa, vs ) = R (s, a, s ). It is clear that there is a one-to-one correspondence between an MDP M and its graph GM. The graph GM is always bipartite, with |V | = |S| and |Λ| = |E| = P s S |As|. Fig. 1(b) shows the graph of the MDP in Fig. 1(a), which contains 6 state nodes, 4 action nodes, 4 decision edges, and 9 transition edges. Note that the 4 action nodes distinguish the two actions a, b A on states u and v.

4 The Structural Similarities For any node x V Λ in a given MDP graph GM = (V, Λ, E, Ψ, p, r), let Nx be the set of all the out-neighbors of x. Note that GM is bipartite, thus the out-neighbors of a state node are always action nodes, whereas those of an action node are always state nodes (see Fig. 1(b)). We now deﬁne the state similarity σS and the action similarity σA. The goal is to make the induced distance measures

δS(u, v) def = 1 σS(u, v), u, v V, (1)

δA(α, β) def = 1 σA(α, β), α, β Λ, (2)

have desirable properties.

4.1 The Recursive Similarity Measure We adopt the basic idea of Sim Rank [Jeh and Widom, 2002] to deﬁne σS and σA; namely, that two nodes are similar if and only if their neighbors are similar. The idea is implemented as a recursion.

The Base Cases As the base cases, we deﬁne

δS(u, v) def =

( 0, if u = v, 1, if u or v, but not both, is absorbing, du,v, if both u and v are absorbing.

Here, a state is absorbing if there is no out-neighbors, which is typically a goal state in practice. A conﬁguration of du,v [0, 1] is thus an application-dependent description of the relationships among goal states. Two special cases are du,v 1 and du,v 0, indicating that any two goal states should be regarded completely different or identical, respectively.

The Recursion for State Similarity For any two state nodes u, v V , the similarity σS(u, v) is essentially the similarity between their out-neighbors Nu and Nv, which are two sets of action nodes. The similarity between Nu and Nv should in turn be based on the pairwise similarity σA(α, β), for every α Nu and β Nv. Note that simply averaging over all the pairwise similarities σA(α, β) (as Sim Rank does) will prevent δS from being a metric, since the triangle inequality is compromised. Hence, we consider using the Hausdorff distance [Delfour and Zol esio, 2011].2

2Other options do exist. For example, the minimum pairwise distance (mindist), minα,β δA(α, β), is a metric. Yet, one problem is that a single small δA(α, β) dominates all the other pairwise distances even if they are all very large. This may be inconsistent with the intuition that similar states should have similar decision options.

Speciﬁcally, given an action node α and a set of action nodes N, the distance between α and N, abusing the symbol δA, is δA(α, N) def = minβ N δA(α, β). The Hausdorff distance, given δA, is then the maximum of all element-to-set distances:

δHaus(Nu, Nv; δA) = max α Nu β Nv {δA (α, Nv) , δA (β, Nu)} . (3)

The Hausdorff distance δHaus(Nu, Nv; δA) = provides an upper bound on the pairwise distances between the actions available on states u and v. Speciﬁcally, within a distance from any action in Nu (resp. Nv), there is always another action from Nv (resp. Nu); thereby, bounds the overall dissimilarity between Nu and Nv. Using δHaus, we obtain

σS(u, v) = CS (1 δHaus(Nu, Nv; δA)) , (4)

where u, v V are two distinct non-absorbing states and 0 < CS < 1 is a constant discounting the impact of the neighbors Nα and Nβ on the pair of state nodes (u, v).

The Recursion for Action Similarity The state similarity σS relies on a well-deﬁned δA (see Eq. 4). As shown in Fig. 1(b), in an MDP graph, an action node itself conveys limited information what really matters is its consequences or effects. An action node α is essentially both a distribution pα = p(α, ) over the state nodes (probabilistic transition) and a distribution rα = r(α, ) over [0, 1] (stochastic reward). The distance between rewards is relatively simple:

δrwd(α, β) = |E [rα] E [rβ]| , (5)

i.e., it is the difference between their expectations. The rest of the task involves measuring the distance between two probabilistic distributions over state nodes. To that end, we employ the earth mover s distance (EMD) [Rubner et al., 1998]3:

δEMD (pα, pβ; δS) = min F

v Nβ fu,v δS(u, v)

s.t. u, v V : fu,v 0,

v V fu,v = p (α, u) ,

u V fu,v = p (β, v) . (6)

EMD quantiﬁes the effort for transforming one distribution into another by moving the earth (probabilities, in our case) around. The value fu,v in the matrix F is the earth moved from the state node u Nα to the state node v Nβ. Such a movement of the earth incurs a cost of δS(u, v). δEMD(pα, pβ; δS) is thus the minimum possible effort for transforming the distribution pα to the distribution pβ. The recursion for σA, based on δEMD, is designed to be

σA(α, β) = 1 (1 CA) δrwd(α, β) CAδEMD(pα, pβ; δS), (7)

where 0 < CA < 1 is the parameter to weight the importance of the reward similarity and the transition similarity.

3There are other statistical distance measures (e.g., Kullback Leibler divergence, Hellinger distance, Jensen-Shannon divergence, etc. [Venturini, 2015]). They are not applicable here as they require a ﬁxed matching between Nα and Nβ to get nontrivial evaluations.

Proceedings of the Twenty-Eighth International Joint Conference on Artiﬁcial Intelligence (IJCAI-19)

4.2 Computation Algorithm 1 shows the iterative algorithm for computing σ S and σ A by simulating the recursion of Sec. 4.1. The algorithm

Algorithm 1 Structural similarities of an MDP graph Input: MDP graph GM = (V, Λ, E, Ψ, p, r) Parameter: Discount factors CS, CA (0, 1) Output: Solution (σ S, σ A) to the recursion of Sec. 4.1 Initialization & base cases. 1: S I|V | |V |, A I|Λ| |Λ| Iterative computation. 2: repeat 3: for all α Nu and β Nv (u, v V, u = v) do 4: d EMD (pα, pβ; GM, 1 S) 5: Compute Aα,β using Eq. 7 with CA, d and S 6: for all u, v V with Nu = and Nv = do 7: Compute Su,v using Eq. 4 with CS and A 8: until S and A converge 9: return (σ S, σ A) (S, A)

is mostly straightforward except for Line 4, where a subroutine EMD is invoked to compute δEMD(pα, pβ; δS). The computation of the EMD can be seen as solving an instance of the minimum-cost network ﬂow (MCF) problem, which can be done by using, for example, the successive shortest path (SSP) algorithm [Jewell, 1962].

Space and time complexity. Given the graph GM = (V, Λ, E, Ψ, p, r) of the MDP M = (S, A, T, R), Algorithm 1 requires Θ |V |2 + |Λ|2 = O |S|2|A|2 space to store S and A. SSP takes O K2 max working memory, where Kmax |V | is the maximum out-degree of action nodes in GM. Given a predeﬁned precision ϵ (e.g., ϵ = .001), SSP is guaranteed to terminate in O 1

ϵ2 K2 max + Kmax log Kmax = O 1

ϵ2 K2 max time, using Dijkstra s algorithm with a Fibonacci heap.4 Each iteration of Algorithm 1 (Lines 3-7) makes Θ |Λ|2 calls to SSP. Computing the Hausdorff distances takes Θ |V |2L2 max time, where Lmax |A| is the maximum out-degree of state nodes in GM. The overall time cost is therefore O N |S|2|A|2K2 max/ϵ2 , where N is the number of iterations before convergence.

4.3 Mathematical Properties We now prove that σ S and σ A are well-deﬁned by showing that Algorithm 1 always terminates. Let S(k) and A(k) be versions of the matrices S and A after the k-th execution of Lines 3-7 of Algorithm 1 (k = 1, 2, ). In addition, let S(0) and A(0) be the contents of S and A right before the algorithm enters the main loop. Let the symbol denote the pairwise-less-than-or-equal-to relationship between matrices.

Lemma 1 (Boundedness) For all k 0, 0 S(k) 1 and 0 A(k) 1.

4It is known that such a worst-case analysis is too pessimistic. [Brunsch et al., 2013] developed a smoothed analysis of SSP to better explain its practical efﬁciency. The discussion is, however, out of the scope of this paper as it does not affect how we use SSP.

Lemma 2 (Monotonicity) For all k 0, S(k) S(k+1) and A(k) A(k+1).

Sketch of Proof. Lemmata 1 & 2 can be proved by induction, using the boundedness of δHaus and δEMD, as well as their monotonicity with respect to δA and δS, respectively.

Theorem 1 (Unique Existence & Nontrivialness) Algorithm 1 terminates correctly with the unique solution (σ S, σ A) to the recursion of Sec. 4.1. The solution is nontrivial in the sense that σ S 1 and σ A 1, even if there is no absorbing state in the input MDP graph GM.

Proof. It is a direct corollary of Lemmata 1 & 2 that

lim k S(k) = σ S [0, 1],

lim k A(k) = σ A [0, 1].

To see the nontrivialness of (σ S, σ A), it sufﬁces to verify that σS 1 or σA 1 cannot constitute the solution due to the discount factor 0 < CS < 1. Next, we consider the matrices D(k) def = 1 S(k) and L(k) def = 1 A(k) for k = 0, 1, 2, .

Lemma 3 (Triangle Inequality) The triangle inequalities D(k) u,v D(k) u,w + D(k) w,v and L(k) α,β L(k) α,γ + L(k) γ,β hold for arbitrary state nodes u, v, w V , arbitrary action nodes α, β, γ Λ and every k 0, if the conﬁguration of the base cases, du,v, observes the triangle inequality (i.e., if for any absorbing states u, v and w, du,v du,w + dw,v).

Sketch of Proof. This proof is also done by induction. Verifying for L(0) = 1 and D(0) is trivial, given the initial conﬁguration of du,v values. The induction can then be completed since δHaus , ; L(k) and δEMD , ; D(k) are metrics. Lemma 3 immediately leads to the following result.

Theorem 2 (Metric Properties) If (σ S, σ A) is the solution to the recursion of Sec. 4.1, formulated on the MDP graph GM = (V, Λ, E, Ψ, p, r), then δ S = 1 σ S is a metric on V and δ A = 1 σ A is a pseudometric on Λ.

Proof. It is clear that both δ S and δ A are nonnegative and symmetric. The triangle inequalities of δ S and δ A are obtained by taking limits on both sides of the inequalities of Lemma 3. Then, δ S is a metric because δ S(u, v) = 0 if and only if u = v. Indeed, for any u = v, σ S(u, v) < 1 due to the discount factor CS, even if δHaus(u, v; δ A) = 0 (see Eq. 4). In contrast, it is possible that δ A(α, β) = 0 for two different actions α = β Λ, which happens when rα rβ and pα pβ.

4.4 Bounding Differences Between Optimal Values Given an MDP graph GM = (V, Λ, E, Ψ, p, r) and an arbitrary initial state u0 V , following a probabilistic policy π : V Λ [0, 1], there will be a trajectory of state transitions (of which the length could potentially be inﬁnite):

u0 α0 = π(u0) r1 u1 α1 = π(u1) r2 u2 α2 = π(u2) r3

Proceedings of the Twenty-Eighth International Joint Conference on Artiﬁcial Intelligence (IJCAI-19)

Given a discount factor ρ (0, 1), the state value of u V under policy π, written Vπ u , is the expected total accumulative return starting from the state node u, i.e.,

Vπ u def = Eπ

Similarly, the action value of α Λ under policy π is

Qπ α def = Eπ

Now, consider the optimal value functions V and Q under the optimal policy π . The well-known Bellman equations [Sutton and Barto, 2018] state that V u = max α Nu Q α,

u Nα p(α, u) (r(α, u) + ρV u) .

We show that the proposed distance measures, δ S and δ A, can be used to bound the difference between the optimal values.

Theorem 3 (Bounds on Differences of Optimal Values) For any state nodes u, v V and any action nodes α, β Λ, if we choose CS = 1 and CA = ρ, then we have5

|V u V v| 1 1 ρ δ S(u, v),

Q α Q β 1 1 ρ δ A(α, β).

Proof. Consider the sequences V(k) and Q(k) (k 0):

V(k) u = max α Nu Q(k) α ,

u Nu p(α, u) r(α, u) + ρV(k 1) u .

It sufﬁces to ﬁrst prove by induction that, for any k, V(k) u V(k) v 1 1 ρD(k) u,v and Q(k) α Q(k) β 1 1 ρL(k) α,β and then take limits on both sides of the inequalities. After verifying for k = 0, we see that Q(k+1) α Q(k+1) β

δrwd(α, β) + ρ

u V (p(α, u) p(β, u)) V(k) u

Due to the induction hypothesis V(k) u V(k) v 1 1 ρD(k) u,v, the second term is exactly the dual problem of Eq. 6 with cost function 1 1 ρD(k). Hence, Q(k+1) α Q(k+1) β

δrwd(α, β) + ρδEMD

pα, pβ; 1 1 ρD(k) (I.H.)

= 1 1 ρL(k+1) α,β . (Line 5 of Algorithm 1)

5The main purpose of CS is to, theoretically, guarantee a nontrivial (σ S, σ A); any CS < 1 serves its purpose. Here, it is convenient to choose a CS very close to 1 such that it can be ignored in Eq. 7.

Deﬁne d(k)(α, β) = Q(k) α Q(k) β for k 0, and let

α = arg maxα Nu Q(k+1) α and β = arg maxβ Nv Q(k+1) β . It can be shown that either α ﬁnds β as the closest counterpart, with respect to the distance measure d(k+1), or the other way around. Hence, V(k+1) u V(k+1) v = d(k+1)(α , β )

δHaus Nu, Nv; d(k+1) δHaus

Nu, Nv; 1 1 ρL(k+1)

= 1 1 ρD(k+1) u,v . (Line 7 of Algorithm 1)

Note that, since the reward function r is bounded within [0, 1], there is a trivial upper bound of P k=0 ρk = 1 1 ρ on both V and Q (see Eq. 8-9). Thm. 3 reveals a connection between the proposed similarity/distance measure and the optimal values by tightening the trivial upper bound.

5 Experiments We experimentally evaluate our proposed similarity measures using a series of designed MDPs. The state space of each MDP is an n n grid of cells, as shown in Fig. 2(a), where n is an odd integer. States are indexed with natural numbers

n (a) The state space

M (b) Available actions

Figure 2: An n n grid (n = 3) as an MDP M = (S, A, T, R).

in a left-to-right and bottom-to-top manner. There is a sole goal state g S, indexed n2 1

2 , at the center of the grid. The action space consists of 4 actions intended towards 4 different directions: left, up, right, and down. On any state, an action is available as long as its intention is physically possible. If action a As is taken on state s S, then it has a τs,a probability of going along the intended direction and 1 τs,a probability of entering a random adjacent state. After entering a state s , the agent receives a random reward of rs,a,s . We set rs,a,g 1 for any s S and a A.

5.1 A Case Study We ﬁrst carry out a simple case study to show the ability of our measures in capturing structural information for MDPs. We use the 3 3 grid shown in Fig. 2, in which τs,a 0.9 for any s S and a A, and rs,a,s 0 whenever s = g. As shown in Fig. 2(b), in this 9-state MDP there are 20 available actions. Note that states 0, 2, 6, and 8, together with the eight actions highlighted/in bold, are structurally symmetric. The same is true for states 1, 3, 5, and 7.

Proceedings of the Twenty-Eighth International Joint Conference on Artiﬁcial Intelligence (IJCAI-19)

We compare our results on this 9-state MDP with those of dbis, the bisimulation metric using EMD [Ferns et al., 2004]. dbis requires that every state has the same set of available actions. To meet this requirement, for any action of which the intention is physically impossible, we deﬁne its EMD for other actions to be 1. In all the experiments, we set CS = CA = 0.95 for our solution and c R = 0.05 and c T = 0.95 for dbis.

8 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

(a) Our solution δ S

8 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

(b) Bisimulation metric dbis

Figure 3: Visualization of state distance matrices.

Fig. 3 visualizes the resulting (upper-triangular) matrices of the distances between the 9 states. As can be seen, our results show a clear structural pattern in the distance matrix. The smallest distances are δ S(0, 2) = δ S(0, 6) = 0.27489 and δ S(1, 3) = δ S(1, 5) = 0.27492, followed by δ S(1, 7) = 0.28627 and δ S(0, 8) = 0.29873. In contrast, dbis generates very large distance values, with 0.95 being the minimum. In particular, with dbis(1, 7) = dbis(3, 5) = 0.995, dbis fails to capture certain structural similarities in the MDP.

5.2 Distribution of Distance Values

We now analyze the distribution of the calculated distance values. We generate a series of n n grid MDPs {Mn} for n = 11, 13, , 33. In each MDP Mn, for every s S and a As, we randomly draw the success probability τs,a from the Gaussian distribution N(0.9, 0.052) and the reward rs,a,s from the uniform distribution U(0, 0.01) when s = g. The reward for entering the goal state remains 1. The size of the state space of {Mn} ranges from 121 to 1,089, which adequately supports practical applications of ﬁnite MDPs. On each MDP Mn, Algorithm 1 calculates two similarity matrices, Sn and An, with parameters CS = CA = 0.95, from which we obtain two distance matrices, Dn S = 1 Sn and Dn A = 1 An. We also calculate a matrix Dn bis for dbis, using parameters c R = 0.05 and c T = 0.95. For each of Dn S , Dn bis, and Dn A, we ﬁt the values as a Beta distribution. Fig. 4 shows the ﬁtted probability density functions (PDFs). There are clearly three series of PDFs, ﬁtted from Dn S , Dn bis (Fig. 4(a)) and Dn A (Fig. 4(b)), for n = 11, 13, , 33. In each series, a darker line corresponds to a larger state-action space. As can be seen, the DS and DA series have consistent trends. The Dbis series is heavily skewed towards 1, while the DS and DA series are more ﬂattened. This indicates that our measures generate more reasonably distributed distance values, whereas the bisimulation metric constantly underestimates the structural similarities in {Mn}.

(a) Fitted from DS and Dbis

0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5

(b) Fitted from DA

Figure 4: PDFs of the ﬁtted Beta distributions.

5.3 Effect of Parameters Our measures rely on two parameters, CS and CA, so we perform experiments to investigate the impact of each. We randomly generate a moderate-sized MDP, M15, using the same metholodogy as in Sec. 5.2. We ﬁrst ﬁx CS = 0.95 and vary CA from 0.80 to 0.99 with a step-size of 0.01. Then, we swap the roles of CS and CA. For each pair of (CS, CA), we run Algorithm 1 to obtain the state similarity matrix S, from which we calculate the mean distance value and the standard deviation of the distance matrix D = 1 S.

0.80 0.85 0.90 0.95 1.00 CA

Mean distance

(a) Effect of CA (CS = 0.95)

0.80 0.85 0.90 0.95 1.00 CS

Mean distance

(b) Effect of CS (CA = 0.95)

Figure 5: Effect of the parameters CA and CS on M15.

Fig. 5 illustrates the impact of each parameter. As CA increases, the mean distance value also increases and the distribution becomes more concentrated. It is worth mentioning that, as suggested by Thm. 3, the value of CA is closely related to the discount factor ρ, which is determined by the application and of which 0.95 is a common practice [Sutton and Barto, 2018]. In contrast, CS shows the opposite impact. As CS increases, the mean distance decreases, though it remains around 0.8, and the distribution becomes ﬂattened. This is consistent with the implication of Thm. 3 that CS 1.

6 Conclusion and Future Work In this paper, we studied the structural similarities within a ﬁnite MDP. By representing MDPs as graphs, we described the similarities between states and actions by measuring the proximity between graph nodes. We proved the metric properties of the proposed measures, based on which we also derived upper bounds on the difference of optimal values. Extensive experiments showed the advantages of our measures. In the future, we plan to accelerate the computation of the proposed measures via parallelism, which is commonly used to handle large graphs. We are also interested in similarity search queries on MDPs, which aim to efﬁciently identify the most similar pairs of states/actions.

Proceedings of the Twenty-Eighth International Joint Conference on Artiﬁcial Intelligence (IJCAI-19)

References [Abel et al., 2019] David Abel, Dilip Arumugam, Kavosh Asadi, Yuu Jinnai, Michael L. Littman, and Lawson L.S. Wong. State abstraction as compression in apprenticeship learning. In AAAI, 2019. [Abel, 2019] David Abel. A theory of state abstraction for reinforcement learning. In AAAI, Doctoral Consortium, 2019. [Ammar et al., 2014] Haitham Bou Ammar, Eric Eaton, Matthew E. Taylor, Decebal Constantin Mocanu, Kurt Driessens, Gerhard Weiss, and Karl Tuyls. An automated measure of MDP similarity for transfer reinforcement learning. In AAAI, 2014. [Antonellis et al., 2008] Ioannis Antonellis, Hector Garcia Molina, and Chi Chao Chang. Simrank++: query rewriting through link analysis of the click graph. In VLDB, 2008. [Brunsch et al., 2013] Tobias Brunsch, Kamiel Cornelissen, Bodo Manthey, and Heiko R oglin. Smoothed analysis of the successive shortest path algorithm. In SODA, 2013. [Cai et al., 2017] Han Cai, Kan Ren, Weinan Zhang, Kleanthis Malialis, Jun Wang, Yong Yu, and Defeng Guo. Realtime bidding by reinforcement learning in display advertising. In WSDM, 2017. [Delfour and Zol esio, 2011] Michel C. Delfour and Jean Paul Zol esio. Shapes and Geometrics: Metrics, Analysis, Differential Calculus, and Optimization. Advances in Design and Control. SIAM, 2nd edition, 2011. [Ferns et al., 2004] Norm Ferns, Prakash Panangaden, and Doina Precup. Metrics for ﬁnite Markov decision processes. In UAI, 2004. [Givan et al., 2003] Robert Givan, Thomas Dean, and Matthew Greig. Equivalence notions and model minimization in Markov decision processes. Artiﬁcial Intelligence, 147:163 223, 2003. [Gopalan et al., 2017] Nakul Gopalan, Marie des Jardins, Michael L. Littman, James Mac Glashan, Shawn Squire, Stefanie Tellex, John Winder, and Lawson L.S. Wong. Planning with abstract Markov decision processes. In ICAPS, 2017. [Han et al., 2016] Zhenhua Han, Haisheng Tan, Guihai Chen, Rui Wang, Yifan Chen, and Francis C.M. Lau. Dynamic virtual machine management via approximate Markov decision process. In INFOCOM, 2016. [Jeh and Widom, 2002] Glen Jeh and Jennifer Widom. Sim Rank: a measure of structural-context similarity. In KDD, 2002. [Jeh and Widom, 2003] Glen Jeh and Jennifer Widom. Scaling personalized web search. In WWW, 2003. [Jewell, 1962] William S. Jewell. Optimal ﬂow through networks. Operations Research, 10(4):476 499, 1962. [Jin et al., 2014] Ruoming Jin, Victor E. Lee, and Longjie Li. Scalable and axiomatic ranking of network role similarity. TKDD, 8(1):Article No. 3, 2014.

[Jong and Stone, 2005] Nicholas K. Jong and Peter Stone. State abstraction discovery from irrelevant state variables. In IJCAI, 2005. [Lazaric et al., 2008] Alessandro Lazaric, Marcello Restelli, and Andrea Bonarini. Transfer of samples in batch reinforcement learning. In ICML, 2008. [Li et al., 2006] Lihong Li, Thomas J. Walsh, and Michael L. Littman. Towards a uniﬁed theory of state abstraction for MDPs. In ISAIM, 2006. [Lin et al., 2012] Zhenjiang Lin, Michael R. Lyu, and Irwin King. Match Sim: a novel similarity measure based on maximum neighborhood matching. KAIS, 32(1):141 166, 2012. [Ramamoorthy et al., 2013] Subramanian Ramamoorthy, M. M. Hassan Mahmud, Majd Hawasly, and Benjamin Rosman. Clustering Markov decision processes for continual transfer. Technical report, University of Edinburgh, 2013. [Ravindran and Barto, 2003] Balaraman Ravindran and Andrew G. Barto. SDMP homomorphisms: an algebraic approach to abstraction in semi-Markov decision processes. In IJCAI, 2003. [Rubner et al., 1998] Yossi Rubner, Carlo Tomasi, and Leonidas J. Guibas. A metric for distributions with applications to image databases. In ICCV, 1998. [Sinapov et al., 2015] Jivko Sinapov, Sanmit Narvekar, Matteo Leonetti, and Peter Stone. Learning inter-task transferability in the absence of target task samples. In AAMAS, 2015. [Song et al., 2016] Jinhua Song, Yang Gao, Hao Wang, and Bo An. Measuring the distance between ﬁnite Markov decision processes. In AAMAS, 2016. [Sorg and Singh, 2009] Jonathan Sorg and Satinder Singh. Transfer via soft homomorphisms. In AAMAS, 2009. [Sutton and Barto, 2018] Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An Introduction. MIT Press, 2nd edition, 2018. [Taylor and Stone, 2009] Matthew E. Taylor and Peter Stone. Transfer learning for reinforcement learning domains: a survey. JMLR, 10:1633 1685, 2009. [Taylor and Stone, 2011] Matthew E. Taylor and Peter Stone. An introduction to intertask transfer for reinforcement learning. AI Magzine, 32(1):15 34, 2011. [Venturini, 2015] Gabriel Martos Venturini. Statistical distances and probability metrics for multivariate data, ensembles and probability distributions. Ph D thesis, Universidad Carlos III de Madrid, 2015. [Xi et al., 2005] Wensi Xi, Edward A. Fox, Weiguo Fan, Benyu Zhang, Zheng Chen, Jun Yan, and Dong Zhuang. Sim Fusion: measuring similarity using uniﬁed relationship matrix. In SIGIR, 2005. [Zhao et al., 2009] Peixiang Zhao, Jiawei Han, and Yizhou Sun. P-Rank: a comprehensive structrual similarity measure over information networks. In CIKM, 2009.

Proceedings of the Twenty-Eighth International Joint Conference on Artiﬁcial Intelligence (IJCAI-19)