# reward_identification_in_inverse_reinforcement_learning__6e1cc525.pdf

Reward Identiﬁcation in Inverse Reinforcement Learning

Kuno Kim 1 Kirankumar Shiragur 1 Shivam Garg 1 Stefano Ermon 1

We study the problem of reward identiﬁability in the context of Inverse Reinforcement Learning (IRL). The reward identiﬁability question is critical to answer when reasoning about the effectiveness of using Markov Decision Processes (MDPs) as computational models of real world decision makers in order to understand complex decision making behavior and perform counterfactual reasoning. While identiﬁability has been acknowledged as a fundamental theoretical question in IRL, little is known about the types of MDPs for which rewards are identiﬁable, or even if there exist such MDPs. In this work, we formalize the reward identiﬁcation problem in IRL and study how identiﬁability relates to properties of the MDP model. For deterministic MDP models with the Max Ent RL objective, we prove necessary and sufﬁcient conditions for identiﬁability. Building on these results, we present efﬁcient algorithms for testing whether or not an MDP model is identiﬁable.

1. Introduction

Inverse Reinforcement Learning (IRL) is the process of estimating a reward function from demonstrations of optimal behavior. In general, the demonstrator is assumed to behave optimally with respect to an underlying Markov Decision Process (MDP). A key component of the MDP is the reward function which encapsulates the underlying incentives driving the behavior of the optimal demonstrator. A fundamental, yet unsolved question in the ﬁeld of IRL is reward identiﬁability: given the optimal behavior, can the reward motivating the behavior be identiﬁed up to a reasonable equivalence class?

The reward identiﬁability question is heavily motivated by real world use cases of IRL. One such area is in applying

*Equal contribution 1Department of Computer Science, Stanford University, Palo Alto, USA. Correspondence to: Kuno Kim <khkim@cs.stanford.edu>.

Proceedings of the 38 th International Conference on Machine Learning, PMLR 139, 2021. Copyright 2021 by the author(s).

MDPs to build computational models (Niv, 2009) of realworld, rational decision makers such as investors (Dixit et al., 1994; Rust, 1994), farmers (Nielsen and Kristensen, 2015), doctors (Heckman and Navarro, 2007), and animals (Montague et al., 1995; Doya and Sejnowski, 1998). Given demonstrations of how these decision makers behave, one can apply IRL to extract the underlying reward which can then be studied to better interpret and understand economic, healthcare, and ecological systems. In order for the MDP model to be an effective modeling choice, it should be sufﬁciently ﬂexible so that there exists an MDP that induces complex, realistic behaviors and identiﬁable so that the extracted reward function can be interpreted. If the MDP model is unidentiﬁable, this fact implies that a large number of arbitrarily different rewards rationalize the demonstrations equally well under the MDP model. As a result, no meaningful understanding of the decision maker can be obtained by IRL.

Another important application of IRL is in counterfactual reasoning (Kalouptsidi et al., 2015; Christensen and Connault, 2019), such as a ﬁnancial institution attempting to predict the change in behavior of its customer base in response to changes in the economic climate amongst other environmental factors (Rust, 1994). Assuming that the only factor that has changed is the environment and not the customer s incentives, IRL can be used to deduce a set of plausible reward functions describing the customer s incentives and the modeler can choose a reward to re-optimize in a different environment that simulates the change in economic climate. In this scenario, identiﬁability is a desirable property since often times the modeler s will examine the extracted rewards and choose the one which is most likely to transfer well to the new environment. If the MDP model is unidentiﬁable, it s unlikely that the modeler can effectively select a transfer reward as a large set of vastly different plausible rewards cannot be interpretted.

Despite the importance of the reward identiﬁability question, it is heavily under-explored in the Machine Learning (ML) literature. Little known about the types of MDP models for which the reward are identiﬁable, or even if there exist such MDP models. Many prior works raise concerns that IRL is an ill-posed problem due to identiﬁability issues (Ng et al., 2000; Ziebart et al., 2011; Ziebart, 2010; Dvijotham and Todorov, 2010; Fu et al., 2017; Geng et al., 2020), often pro-

Reward Identiﬁcation in Inverse Reinforcement Learning

viding the example that a constant reward rationalizes any optimal behavior. This example gives the wrongful impression that all MDP models are always unidentiﬁable as there are several RL frameworks such as the Maximum Entropy RL (Max Ent RL) framework where constant rewards cannot rationalize all behaviors. To the best of our knowledge, no prior works in the IRL literature have formally studied the reward identiﬁability problem.

In this work, we formalize the reward identiﬁcation problem in IRL and study how identiﬁability relates to properties of the MDP model. For deterministic MDP models with the Max Ent RL objective, we prove necessary and sufﬁcient conditions for identiﬁability. Building on these results, we present efﬁcient algorithms for testing whether or not an MDP model is identiﬁable.

2. Preliminaries

Let (S) denote the set of probability measures over the set S. An MDP M is a ﬁnite sequence (tuple) M := (X, A, P, P0, r, γ, T) where X is the discrete state space,

A is the discrete action (decision) space, P : X A ! (X) is the transition kernel, P0 2 (X) is the initial state

distribution, r : X A ! R 2 R is the reward function in a reward family R, and T 0 is the time horizon.

Let denote the concatenation operator between ﬁnite sequences An = (a0, ..., an), Bm = (b0, ..., bm) so that An Bm = (a1, ..., an, b1, ..., bm). For integers i j, we use the indexing notation Ai = (ai), Ai:j = (ai, ai+1, ..., aj). We decompose an MDP into its domain

d := (X, A, P, P0, γ) 2 D, reward r 2 R, and horizon T 2 N so that M := d (r) (T). Intuitively, the domain d characterizes the physical embodiment of the decision making agent as well as the external environment dynamics, the reward r encapsulates the desired optimal behaviors for a task, and the time horizon T deﬁnes how many decisions the agent gets to make to accomplish a task.

A policy is a stochastic function : X ! (A) 2 where denotes the set of considered policies. A trajectory of length k is a sequence of state-actions, i.e = (xt, at)k

t=0 where for all 0 t k, xt 2 X, at 2 A. The trajectory distribution of a policy executed in a domain d with time horizon T is denoted p( ; , d, T) = P0(x0) (a0|x0) QT

t=1 (at|xt)P(xt|xt 1, at 1). When there is no ambiguity we will omit d, T and simply write p( ; ). We will denote by X 0 the set of feasible initial states, i.e x 2 X 0 ) P0(x) > 0, and by [x, d, T] the set of feasible trajectories of length T in domain d starting from initial state x 2 X 0, so 0 2 [x, d, T] if 0

0 = x and there exists a policy that can sample it, i.e p( 0; ) > 0. When x is omitted, [d, T] = S

x2X 0 [x, d, T] denotes the set of all feasible trajectories.

Figure 1. Graphical Representation of MDP Models where d is

the domain, J is the learning objective, r is the reward, T is the time horizon, is the optimal policy for (d, r, T, J) and is a trajectories sampled from

The learning objective J : R D N ! R 2 J is a reward-dependent metric of policy performance that has as a unique global maximum with respect to for any (d, r, T). An optimal policy for an RL task (d, r, T, J) satisﬁes = arg max 2 J( ; d, r, T). Note that we restrict ourselves to well-behaving learning objectives that induce a unique optimal policy so that identiﬁability is well-deﬁned later. For example, the Maximum Entropy RL (Max Ent RL) objective JMax Ent( ; d, r, T) = E [PT

t=0 γtr(xt, at) log (at|xt)] satisﬁes the uniqueness maximizer property while the standard RL objective, i.e E [PT

t=0 γtr(xt, at)], does not. For compactness, we overload notation and write r( ) = PT

t=0 γtr(xt, at). We can then deﬁne RL task to be a sequence M (J) := (d, r, T, J), i.e an MDP with a corresponding learning objective, which fully deﬁnes the RL problem to be solved.

We deﬁne the corresponding optimal trajectory distribution for an RL task as pr( ; d, T, J) := p( ; , d, T). Again, when there is no ambiguity we will omit d, T and simply write pr( ). For some RL tasks, pr can be written down explicitly as a function of the reward r. For example, when J is the Max Ent RL objective from before and the transitions are deterministic, pr( ) = er( )/

02 [d,T ] er( 0)d 0.

An MDP model PMDP[R; d, T, J] := {pr( ; d, T, J) : r 2 R} is a family of optimal trajectory distributions parametrized by the reward r. Note that the unknown parameter is the reward r and (d, T, J) are assumed to be known. The data generating process for an MDP model is solving an RL problem with respect to M then sampling trajectories from the optimal policy . (see Figure 1)

Inverse Reinforcement Learning (IRL) seeks to invert the map r ! pr( ) given samples, or demonstrations, from pr. IRL is a ill-posed problem when r ! pr is not injective with respect to r, i.e many underlying rewards rationalize the distribution of observable data. In this work, we are in-

Reward Identiﬁcation in Inverse Reinforcement Learning

terested in this question of identiﬁability: when is it possible to invert r ! pr( ) up to a reasonable equivalence class given knowledge of pr (e.g by having an inﬁnite number of demonstrations) and (d, T, J). In the following section, we will begin by formalizing reward identiﬁability.

3. Identiﬁability

In general, a statistical model P = {p | 2 } for some parameter family is said to be identiﬁable if the map 7! p is bijective, i.e 1 = 2 () p 1 = p 2. However, this standard deﬁnition of identiﬁability is not directly applicable to MDP models since most RL objectives will yield the same optimal policy when the rewards are additively shifted by a constant. Thus, we ﬁrst propose a deﬁnition of identiﬁability adapted to MDP models. We use = to denote an equivalence relation on R, and the corresponding equivalence class of r by [r] = {ˆr 2 R | ˆr = r}. All proofs of theorems, propositions, and examples will be deferred to the Appendix.

Deﬁnition 1. (Identiﬁability) An MDP model PMDP[R; d, T, J] = {pr( ; d, T, J) | r 2 R} is identiﬁable up to an equivalence relation = if for all r, ˆr 2 R,

r = ˆr () pr = pˆr

We will consider two speciﬁc equivalence relations and derive different levels of identiﬁability from them. The ﬁrst is trajectory equivalence:

r = ˆr () 8x 2 X 0, 0, 00 2 [x, d, T],

ˆr( 0) r( 0) = ˆr( 00) r( 00) (1)

Two rewards are trajectory equivalent, i.e = , if they are equal up to a constant after discounted summing over stateaction pairs in trajectories starting from the same initial vertex. In other words, the two rewards represent the same preferences over trajectories. The second is state-action equivalence:

r =x,a ˆr () 8(x0, a0), (x00, a00) 2 X A,

ˆr(x0, a0) r(x0, a0) = ˆr(x00, a00) r(x00, a00)

Two rewards are state-action equivalent, i.e =x,a, if they are equal up to a constant at the state-action level. In other words, the two rewards represent the same preferences over state-actions. We will use [r] = {ˆr 2 R | ˆr = r} and [r]x,a = {ˆr 2 R | ˆr =x,a r} to denote trajectory and state-action equivalence classes, respectively.

Before deﬁning different levels of identiﬁability, we introduce a notion of proper MDP models which we will focus on for the remainder of this work.

Deﬁnition 2. (Proper Models) An MDP model PMDP[R; d, T, J] = {pr( ; d, T, J) | r 2 R} is proper if for all r, ˆr 2 R,

r = ˆr ) pr = pˆr

A proper MDP model is one that yields the same optimal behavior, i.e pr = pˆr, when the rewards are trajectory equivalent. An MDP model with most RL objectives J will be proper since J generally takes the form E [r( )] + f( ) for some regularization function f : ! R. For example, Max Ent MDP models are proper. (see Example 1) Example 1. Let JMax Ent be the Max Ent RL objective. Then, the Max Ent MDP model PMDP[R; d, T, JMax Ent] is proper.

We are now ready to deﬁne different levels of identiﬁability starting with weak identiﬁability. (Deﬁnition 3) Deﬁnition 3. (Weak Identiﬁability) An MDP model PMDP[R; d, T, J] is weakly identiﬁable if it is identiﬁable up to = , i.e trajectory equivalence.

If an MDP model is weakly identiﬁable, then one can identify rewards at the trajectory level given the optimal behavior. Put differently, a modeler can deduce the demonstrator s true preferences over trajectories starting from the same initial state from the demonstrator s behavior. We now deﬁne a stronger notion of identiﬁability in Deﬁnition 4. Deﬁnition 4. (Strong Identiﬁability) An MDP model is strongly identiﬁable if it is identiﬁable up to rewards shifted by a constant, i.e =x,a.

If an MDP model is strongly identiﬁable, then one can identify rewards at the most granular level, i.e state-actions, given the optimal behavior. Intuitively, the best we could hope for is to have MDP models be identiﬁable up to stateaction equivalences since the optimal behavior should not change when state-action rewards are shifted by a constant. For example,

Proposition 1 shows that, for proper RL models, strong identﬁability is a strictly stronger notion of identiﬁability than weak identiﬁability. Proposition 1. A proper MDP model is strongly identiﬁable only if it is weakly identiﬁable

In the following sections, we will characterize MDP models that satisfy different levels of identiﬁability starting with weak identiﬁability, and eventually moving up to strong identiﬁability.

4. Weak Identiﬁability

4.1. Deterministic Max Ent MDP models

As shown by Proposition 1, an MDP model must ﬁrst be weakly identiﬁable in order to be strongly identiﬁable. In

Reward Identiﬁcation in Inverse Reinforcement Learning

this section, we ﬁrst show that under the widely employed Max Ent RL learning objective, JMax Ent, MDP models with a domain d containing deterministic dynamics is weakly identiﬁable.

Theorem 1. Let PMDP[R; d, T, JMax Ent] be a Max Ent MDP model and R {r | r : X A ! R} be any set of rewards. Then, for all domains d := (X, A, P, P0, γ) consisting of deterministic transition dy-

namics, i.e 8(x, a), |supp(P( |x, a))|= 1, a deterministic initial state, i.e |supp(P0)|= 1, and T 0, PMDP[R; d, T, JMax Ent] is weakly identiﬁable.

For such deterministic Max Ent MDP models, the optimal trajectory distribution is given analytically by (Ziebart et al., 2011)

pr( ) = er( )

Z where Z =

Eq. 3 provides the intuition behind Theorem 1. pr( ) is an energy-based model (EBM) with the energy function set to the additive inverse of trajectory rewards r( ). We may than apply the result that two EBMs are equal if and only if the energy functions differ by a constant. (see Appendix for formal proof)

Intuitively, the demonstrator has a strong degree of control over it s future when the dynamics are deterministic, as it can choose to sample a feasible trajectory at its will without interference from the randomness of the dynamics. Thus, the optimal trajectory distribution can be viewed as a noiseless manifestation of the demonstrator s trajectory preferences, which makes it possible to uniquely identify the trajectory level rewards. On the contrary, the following section will provide examples of stochastic MDPs for which the corresponding MDP model is not identiﬁable as a result of the randomness of the environment interfering with the agent realizing trajectories according to its true preferences.

4.2. Common Misconceptions about Stochastic MDP

Models and Weak Identiﬁability

A common misconception in the IRL literature is that the trajectory distribution of the Max Ent optimal policy for stochastic dynamics is always equal to

pr( ) / er( )P0(x0)

P(xt|xt 1, at 1) (4)

which is also simply an EBM as the dynamics terms can be factored into the exponential. If the misconception were true, than one could also prove that all stochastic MDP models are weakly identiﬁable with the same proof for Theorem 1. As a counterexample to Eq. 4, consider an MDP with uniform random dynamics, i.e

8(x, a, x0), P(x0|x, a) = 1/|X|. Consider any reward function r that is state-only, i.e 8(x, a, a0), r(x, a) = r(x, a0), and constant everywhere except for one higher reward state x , i.e 8x, a 2 X A \ {x } A, r(x, a) = c and 8a, r(x , a) = c + 10. It s clear that the Max Ent optimal policy for this MDP is a uniform random policy, since any other policy would have lower entropy yet obtain the same expected rewards due to the environment dynamics forcing uniform state visitation. However, pr( ) from Eq. 4 exponentially prefers trajectories with higher reward and is thus not that attained by a uniform random policy. Thus we can conclude that the distribution in Eq. 4 not always attainable in stochastic environments. We can also see that by setting ˆr to be the constant reward, i.e 8(x, a), ˆr(x, a) = c, we have two rewards r, ˆr that are not trajectory equivalent since ˆr has constant trajectory rewards, while r has a higher trajectory reward for trajectories that visit x as compared to those that do not. (Note that these two types of trajectories are both feasible since the transition dynamics is always fully supported on the next state) Yet r, ˆr have the same uniform trajectory distribution pr, pˆr. Thus, this counterexample also serves to show that pr = pˆr 6) r = ˆr and that not all stochastic MDP models are weakly identiﬁable. We leave to future work to further characterize weakly identiﬁability for stochastic MDP models.

5. Strongly Identiﬁability

5.1. Domain Graphs

The key conceptual idea that will be used throughout the remaining sections is to embed the domain of an MDP model into a graph and reason about how properties of the graph relate to identiﬁability of the MDP model. We ﬁrst deﬁne domain graphs.

Deﬁnition 5. A domain graph for a domain d = (X, A, P, P0, γ) is a tuple Gd := (Vd, Ed, V 0

1. Vd := X A are the vertices

d := {(x, a) | P 0(x) > 0} are the initial vertices.

3. Ed := {e := (v, v0) = ((x, a), (x0, a0)) | v, v0 2

Vd, P(x0|x, a) > 0} are the edges

In words, the domain graph has a vertex for each state-action pair and a directed edge between vertices if the corresponding transition occurs with positive probability under the domain dynamics. We refer to the source and destination vertex of a directed edge e = (v, v0) by es = v and ed = v0, respectively. The initial vertex set V 0

d is the set of vertices that are feasible under the initial state distribution.

A path of length k 0 in the domain graph is a sequence of vertices := (vt)0 t k such that (vt, vt+1) 2 Ed

Reward Identiﬁcation in Inverse Reinforcement Learning

for all 0 t < k. We denote by | | the length of the path. Note that there are k + 1 (not necessarily distinct) vertices on the path. We further introduce indexing notation to extract subpaths. For integers 0 a b k, a: = (va, ..., vk), :a = (v0, ..., va 1), : a = (v0, ..., vk a 1), a: = (vk a, ..., vk), and a:b = (va, va+1, ..., vb 1). A simple path is a path that does not

contain the same vertex more than once. In Gd, we say that v0 is reachable from v in k-steps if there exists a path of exactly length k that starts at v that ends at v0, i.e there exists = (vt)0 t k such that v0 = v and vk = v0. A domain graph Gd is strongly connected if there exists a path between any two v, v0 2 Vd. A cycle C is a path that starts and ends at the same vertex, i.e v0 = vk. We say that a domain graph Gd is aperiodic if there does not exist n > 1 that divides the length of every cycle in the graph, and periodic otherwise.

We denote by Z[v, d, k] the set of all paths of length k that start from v 2 V 0

d , i.e Z[v, d, k] = { = (vt)0 t k = (xt, at)0 t k | v0 = v}, and subsequently Z[d, k] = S

d Z[v, d, k]. It should be clear that Z[d, k] has a bijection to the set of all feasible trajectories of length k in domain d, i.e Z[d, k] $ [d, k]. Given a reward function r and discount factor γ, we denote the reward of a path to be r( ) := Pk

t=0 γtr(vt) = Pk

t=0 γtr(xt, at). When combined with path indexing notation, we write r( a:b) := Pb 1

t=a γtr(vt). We now introduce a key concepts related to reachability starting with layers.

Deﬁnition 6. The kth layer of a vertex v 2 Vd, denoted Lk(v), is the set of all vertices reachable in exactly k-steps from v, i.e Lk(v) = {v0 2 Vd | 9 = (vt)0 t k s.t v0 = v, vk = v0}. We deﬁne L0(v) = {v}

and for V Vd, Lk(V ) = S

Intuitively, the size of the layers |Lk(v)| should grow with k as vertices further away from the initial vertices in V 0

d become reachable, although this is not strictly true, e.g bipartite graphs where certain vertices can only be reached in a odd or even number of steps. An important family of domain graphs are those that are coverable.

Deﬁnition 7. A vertex v 2 Vd is said to be t-covering for t 1 if Lt(v) = Vd. We say that a domain graph Gd is tcoverable (or just coverable) if there exists an initial vertex v 2 V 0

d that is t-covering.

In words, if a domain graph is t-coverable, there exists some time t at which all vertices can be reached. Intuitively, a domain with a coverable domain graph is one in which the agent can sample trajectories that terminate at a diverse set of state-actions.

5.2. From Weak to Strong Identiﬁability

In this section, we show the conditions under which a weakly identiﬁable MDP model can become strongly identi-

Proposition 2. Let PMDP[R; d, T, J] be an MDP model that is weakly identiﬁable. Then, it is strongly identiﬁable if and only if for all r, ˆr 2 R, (r = ˆr) ) (r =x,a ˆr). In other words, 8r 2 R, [r] [r]x,a.

In words, a weakly identiﬁable MDP model is strongly identiﬁable if trajectory equivalence implies state-action equivalence. Along with domain graphs, Proposition 2 will be used in the later sections to prove necessary and sufﬁcient conditions for strong identiﬁability.

Intuitively, Proposition 2 suggests that in order for an MDP model to be strongly identiﬁable, the set of feasible trajectories in the MDP should be diverse enough so that the stateaction rewards can be deduced from the trajectory rewards. To formalize this intuition, we introduce a linear systems perspective of identiﬁability. For simplicity of thought, consider an MDP with only one initial state, i.e |V 0

d |= 1. A path (trajectory) matrix A[d, k] is a matrix, of size |Z[d, k]| |Vd|, whose rows correspond to frequency counts of each node encountered on the feasible paths, i.e for some enumeration of the vertices Vd and paths Z[d, k], Aij[d, k] is the number of times the jth vertex (state-action) v(j) = (x, a)(j) 2 Vd was visited by the ith path (i) = (i) 2 Z[d, k]. When the discount factor γ = 1, the reward of the ith

path is simply r( (i)) = P

0 j |Vd| Aij[d, k]r(v(j)) = P

0 j |X A| Aij[d, k]r((x, a)(j)). In vector form, we may denote rx,a = (r(v(j)))0 j |Vd| and r = (r( (j)))0 j |Z[d,k]|. Then,

A[d, k]rx,a = r (5)

We see that the trajectory rewards are a linear transformation of the state-action rewards with the transformation matrix depending on the set of feasible paths (trajectories). Proposition 2 states that a weakly identiﬁable model is strongly identiﬁable if the state-action rewards can be identiﬁed from trajectory rewards, which, for γ = 1, is an equivalent statement to saying that, for the linear system of Eq. 5 should have a unique solution (when treating rx,a as an unknown and r as known). Thus, a weakly identiﬁable model is strongly identiﬁable if the trajectory (path) matrix is full rank, since this condition guarantees a unique solution to Eq 5.

Corollary 1. Let PMDP[R; d, T, J] be an MDP model that is weakly identiﬁable, R be the set of all rewards, |X 0|= 1, and γ = 1. Then, it is strongly identiﬁable if and only if rank(A[d, T]) = |X A|

In order for A[d, T] to be full rank, the there should be a sufﬁcient number of "linearly independent" paths (trajectories) in the feasible path set. Thus, intuitively, domain graphs that generate a sufﬁciently diverse collection of paths are

Reward Identiﬁcation in Inverse Reinforcement Learning

likely to yield strongly identiﬁable MDP models. This intuition will be captured in the strong identiﬁcation theorem presented in the following section.

5.3. Conditions for Strong Identiﬁcation

We now state the strong identiﬁability results, starting with a necessary and sufﬁcient condition for strong identiﬁability when the domain graph of the MDP model is strongly connected. Indeed, there is a wide pool of real world domains that have strongly connected domain graphs, including physics environments, e.g Mujoco (Todorov et al., 2012), Three DWorld (Gan et al., 2020), Navigation environments, e.g Household Robot Navigation (Mo et al., 2018), Mazes (Brockman et al., 2016), and some investment environments (Rust, 1994). We also provide sufﬁcient conditions for strong identiﬁability for domain graphs with weaker connectivity.

Theorem 2. (Strong Identiﬁcation Condition) For all (d, r, T, J) such that the MDP model PMDP[R; d, T, J] is

proper and Gd is strongly connected,

(Sufﬁciency) PMDP[R; d, T, J] is weakly identiﬁable, Gd

is T0-coverable, and T 2T0 ) PMDP[R; d, T, J] is strongly identiﬁable

(Necessity) PMDP[R; d, T, J] is strongly identiﬁable )

PMDP[R; d, T, J] is weakly identiﬁable, Gd is coverable.

In short, Theorem 2 states that coverability of the domain graph is a necessary and sufﬁcient condition for strong identiﬁability. Intuitively, coverable domain graphs have a diverse collection of feasible paths (trajectories) since for every vertex (state-action) there exists a trajectory that terminates at that vertex. Recalling the linear systems perspective on identiﬁability from Proposition 1, Theorem 2 shows that indeed coverability guarantees that there exists a threshold time horizon 2T0 above which there will exist a sufﬁcient number of linearly independent feasible trajectories to have a full rank trajectory matrix A[d, T], and thus uniquely solve for rx,a. On the contrary, for a non-coverable domain graph there does not exist any time horizon at which the MDP model is strongly identiﬁable.

To better understand and picture the types of strongly connected domain graphs that are coverable, we ﬁrst need to recognize that there is an equivalence between aperiodicity and coverability for strongly connected graphs.

Proposition 3. Let Gd be strongly connected. Then, Gd is aperiodic if and only if it is coverable.

The intuition behind Proposition 3 is that if a graph is aperiodic, then there exist cycles of coprime length which can be traversed an appropriate number of times before heading to a vertex of choice to terminate at. Since all natural numbers

k-partite Cycle (Unidentiﬁable)

Aperiodic Graph

(Identiﬁable)

Figure 2. Examples of domain graphs Gd. On the left is a do-

main graph for an MDP model that is not strongly identiﬁable due to the graph being a k-partite cycle (vertices in the same partition have the same color) which is a periodic graph. Right shows a domain graph for a strongly identiﬁable MDP model due to the graph being aperiodic as a result of the self loop state.

above a threshold can be expressed as a positive linear combination of coprime natural numbers (Denardo, 1977), there must exist a time horizon above which the feasible paths can terminate at any vertex. For the converse, if a graph is coverable at T, then it is also coverable at T + 1, T + 2, .... Therefore, there is a cycle of length T that goes from an initial vertex back to itself as well as a cycle of length T + 1 that does the same. Since, T, T + 1 are coprime, the graph is aperiodic. Corollary 2 immediately follows by combining the results of Theorem 2 and Proposition 3. Corollary 2. (Strong Identiﬁcation Condition) For all (d, r, T, J) such that PMDP[R; d, T, J] is a proper MDP

model and Gd is strongly connected,

(Sufﬁciency) PMDP[R; d, T, J] is weakly identiﬁable,

Gd aperiodic ) 9T0 0 such that 8T T0, PMDP[R; d, T, J] is strongly identiﬁable

(Necessity) PMDP[R; d, T, J] is strongly identiﬁable )

PMDP[R; d, T, J] is weakly identiﬁable, Gd is aperiodic.

Theorem 3 and Corollary 2 shows that strong identiﬁability, coverability, and aperiodicity are all equivalent properties for strongly connected domain graphs. This result is encouraging since the requirement of aperiodicity is fairly weak; there need only exist two cycles C1, C2 of coprime length such as k, k + 1. In fact, if there is any vertex with a selfloop, i.e 9(x, a) such that x 2 supp(P( |x, a)), the graph is aperiodic. For domains such as a physics environment, it s reasonable to expect that there exists a static state where you have an action that corresponds to not exerting any external forces into the environment to maintain the static state. Periodic graphs speciﬁcally have the topology of a k-partite cycle where vertices in each partition do not have edges between each other, and a vertices in a partition can only be reached in periodic time intervals. See Figure 2 for examples of different types of domain graphs.

Reward Identiﬁcation in Inverse Reinforcement Learning

5.4. Strong Identiﬁability Test for Strongly Connected

Domain Graphs

Building on the results of Theorem 2 and Corollary 2, we present a simple algorithm which tests for Strongly Identiﬁability by checking if the underlying domain graph of the MDP model is aperiodic. Algorithm 1 builds off the Period Finder algorithm of Denardo (1977) which takes as input a directed graph and returns the greatest common divisor of lengths of every cycle in the graph.

Algorithm 1 Strong Identiﬁability Test for MDP models with Strongly Connected Domain Graphs Procedure MDPId Test(PMDP[R; d, T, J])

1 Construct a domain graph Gd = (Vd, Ed, V 0

d ) from d.

2 Set gcd = Period Finder(Vd, Ed) (Denardo, 1977)

3 return gcd == 1

MDPId Test takes as input an MDP model, constructs a domain graph from d, and checks whether or not the domain graph is aperiodic. Algorithm 1 is correct and inherits the time and space efﬁciency of Period Finder (Denardo, 1977) as shown in Theorem 3.

Theorem 3. Let PMDP[R; d, T, J] be a weakly identiﬁable MDP model and Gd be strongly connected. Then,

(Correctness) MDPId Test(PMDP[R; d, T, J]) returns 1

(True) if and only if 9T such that PMDP[R; d, T, J] is

strongly identiﬁable.

(Efﬁciency) MDPId Test runs with time and space com-

plexity O(|Ed|)

In the next section we show how to test for strong identiﬁability when the domain graph is not strongly connected.

5.5. Strong Identiﬁability Sufﬁciency Test for Graphs

with Weaker Connectivity

Even when the domain graph is not strongly connected, the sufﬁcient condition from Theorem 2 is still holds as shown in Corollary 3.

Corollary 3. (Strong Identiﬁcation Condition) For all (d, r, T, J) such that the MDP model PMDP[R; d, T, J] is

(Sufﬁciency) PMDP[R; d, T, J] is weakly identiﬁable, Gd

is T0-coverable, and T 2T0 ) PMDP[R; d, T, J] is strongly identiﬁable

While coverability is still a sufﬁcient condition, it is no longer a necessary condition for strong identiﬁability and Proposition 3 does not hold either. Thus, we propose an

algorithm to test coverability directly as a sufﬁciency test of strong identiﬁability.

Algorithm 2 Strong Identiﬁability Sufﬁciency Test for General MDP models Procedure MDPCover Test(PMDP[R; d, T, J])

4 Construct a transition matrix M from d. Mij = P(v(j)|v(i)) where P(x0, a0|x, a) = P(x0|x, a).

5 Compute M |Vd|2

6 if The rows for the initial vertices in M |Vd|2 contains only non-zero entries then

return 1 else

return 0 end

We show in Theorem 4 that MDPCover Test from Algorithm 2 is correct and efﬁcient.

Theorem 4. Let PMDP[R; d, T, J] be a weakly identiﬁable MDP model. Then,

(Correctness) If MDPCover Test(PMDP[R; d, T, J]) returns 1 (True) then, 9T0 such that 8T T0, PMDP[R; d, T, J] is strongly identiﬁable.

(Efﬁciency) MDPCover Test runs with time complexity

O(|Vd|3log|Vd|) and space complexity O(|Vd|2)

In Algorithm 2, non-zero entries in the kth power of the transition matrix represent vertices in the kth layer of the domain graph. We check the |Vd|2 power since the covering horizon is upper bounded by |Vd|2 (see proof of Theorem 4 in the Appendix) We leave to future work to derive both necessary and sufﬁcient conditions for strong identiﬁability for non-strongly connected graphs.

6. Related Work

In the Machine Learning (ML) literature, IRL (Ng et al., 2000; Ramachandran and Amir, 2007; Ziebart et al., 2011; Boularias et al., 2011) has been studied for two main purposes. The ﬁrst is to run RL with the estimated reward in order to perform Imitation (Ng et al., 2000; Finn et al., 2016) or Apprenticeship Learning (Abbeel and Ng, 2004). IRL is an alternative to direct policy imitation (Pomerleau, 1991; Ho and Ermon, 2016; Kostrikov et al., 2020; Kim et al., 2020a;b) with the motivation being that the reward, in some cases, may be easier to estimate from fewer demonstrations than the policy. The second purpose is to run RL with the estimated reward in a different environment in order to transfer policies (Fu et al., 2017). The intuition is that rewards are likely to transfer more readily across environments compared to policies.

Reward Identiﬁcation in Inverse Reinforcement Learning

Proper (e.g Most MDP Models)

WI (e.g deterministic Max Ent)

Proper & SI

Proper & SI & SC = coverable graph = aperiodic graph (e.g graphs with a self loop)

Proper - WI (e.g Uniform random MDP)

WI - proper & SI (e.g k-partite cycle graph)

Figure 3. Venn Diagram of MDP Models with varying levels of

identiﬁability. Weakly Identiﬁable (WI) Models, such as deterministic Max Ent MDP models, are a strict subset of Proper Models and a stochastic Max Ent MDP model with uniform random dynamics is an example of a proper model that is not WI (section 4.2). Proper and Strongly Identiﬁable (SI) models are a strict subset of WI models where a deterministic Max Ent MDP model with a k-partite cycle domain graph (Figure 2) is an example of WI model that is not SI. Finally, proper, SI, and Strongly Connected (SC) models are equivalent to models with a coverable or aperiodic domain graph, such as a graph with a vertex that has a self loop (Figure 2)

Many prior works in IRL (Ng et al., 2000; Ziebart et al., 2011; Ziebart, 2010; Dvijotham and Todorov, 2010; Fu et al., 2017; Amin et al., 2017; Geng et al., 2020) have touched upon the identiﬁability problem, but none have formally addressed it. Although not widely known in the IRL community, the ﬁeld of econometrics has a rich body of work on identifying Dynamic Discrete Choice (DDC) (Rust, 1994; Arcidiacono and Miller, 2011; 2020; Abbring and Daljord, 2020) models which is an equivalent problem to IRL. The main result is that for inﬁnite horizon MDP models with stochastic dynamics fully supported on the state space, rewards are not strongly identiﬁable. DDC literature also separately explores identiﬁcation for counter factual reasoning (Kalouptsidi et al., 2015; Christensen and Connault, 2019) which has various application areas such as in health care (Heckman and Navarro, 2007) and retail competition (Arcidiacono and Miller, 2011).

A seemingly unrelated area of Network Tomography is also related in the sense that they seek to identify edge level quantities in a graph by path level measurements. Although the speciﬁc formulation has subtle differences, such as the paths being constrained to start and end on designated "measurement nodes", several works have looked into necessary and sufﬁcient conditions for identiﬁcation of edge level quantaties (Ren and Dong, 2016), and several works have proposed algorithms to ﬁnd the set of linearly independent paths that represent equations which can be solved to identify edge quantities (Gopalan and Ramasubramanian, 2011; Ren and Dong, 2016; Ma et al., 2013).

7. Discussion and future work

In this work we have formalized the reward identiﬁcation problem in IRL, showed that deterministic Max Ent MDP models are strongly identiﬁable if and only if the corresponding domain graph is aperiodic, and presented algorithms for testing identiﬁability in O(|Ed|) time and space. A summary of our characteriziation of varying levels of identiﬁability is presented in Figure 3.

The usefulness of IRL in real-world applications, e.g using MDPs as computational models of decision makers for interpretation and counterfactual reasoning, depends heavily on whether the underlying rewards can be identiﬁed. Thus the main practical guidance our theory and algorithms provide is a formal framework to evaluate which modeling assumptions and domains are suitable for IRL. Imagine yourself as a neuroscientist studying the behavior of male mice navigating a maze with various stimuli: cheeses, scents of females, and traps. You seek to apply IRL to understand the cognitive processes of the mice by inferring its underlying utility. Important questions to ask are: how should I design the maze so that the mice s utility can be inferred from behavioral data? What type of MDP modeling assumptions should I make? By applying Theorem 1 and 2 we can conclude that a Max Ent MDP model is appropriate and the maze should be designed so that the mouse has an option to remain static. (having a self-loop state guarantees strong identiﬁability)

There are a number of important open questions left to answer. First, when are stochastic MDP models weakly identiﬁable, if ever? The answer to this question can then be combined with our strong identiﬁability results to characterize when stochastic MDP models are strongly identiﬁable. Second, what are the necessary and sufﬁcient conditions for strong identiﬁability when the domain graph is not strongly connected? Do there exist efﬁcient algorithms for testing identiﬁability in more weakly connected domain graphs?

Acknowledgements

This research was supported by Sony, NSF (#1651565, #1522054, #1733686), ONR (N00014-19-1-2145), AFOSR (FA9550-19-1-0024).

Reward Identiﬁcation in Inverse Reinforcement Learning

Pieter Abbeel and Andrew Y Ng. Apprenticeship learn-

ing via inverse reinforcement learning. In Proceedings of the twenty-ﬁrst international conference on Machine learning, page 1, 2004.

Jaap H Abbring and Øystein Daljord. Identifying the dis-

count factor in dynamic discrete choice models. Quantitative Economics, 11(2):471 501, 2020.

Kareem Amin, Nan Jiang, and Satinder Singh. Repeated inverse reinforcement learning. ar Xiv preprint ar Xiv:1705.05427, 2017.

Peter Arcidiacono and Robert A Miller. Conditional choice

probability estimation of dynamic discrete choice models with unobserved heterogeneity. Econometrica, 79(6): 1823 1867, 2011.

Peter Arcidiacono and Robert A Miller. Identifying dy-

namic discrete choice models off short panels. Journal of Econometrics, 215(2):473 485, 2020.

Abdeslam Boularias, Jens Kober, and Jan Peters. Relative

entropy inverse reinforcement learning. In Proceedings of the Fourteenth International Conference on Artiﬁcial Intelligence and Statistics, pages 182 189. JMLR Workshop and Conference Proceedings, 2011.

Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas

Schneider, John Schulman, Jie Tang, and Zaremba Wojciech. Openai gym. ar Xiv preprint ar Xiv:1606.01540, 2016.

Timothy Christensen and Benjamin Connault. Counterfactual sensitivity and robustness. ar Xiv preprint ar Xiv:1904.00989, 2019.

Thomas H Cormen, Charles E Leiserson, Ronald L Rivest,

and Clifford Stein. Introduction to algorithms. MIT press, 2009.

Eric V Denardo. Periods of connected networks and powers

of nonnegative matrices. Mathematics of Operations Research, 2(1):20 24, 1977.

Avinash K Dixit, Robert K Dixit, and Robert S Pindyck.

Investment under uncertainty. Princeton university press, 1994.

Kenji Doya and Terrence J Sejnowski. A computational

model of birdsong learning by auditory experience and auditory feedback. In Central auditory processing and neural modeling, pages 77 88. Springer, 1998.

Krishnamurthy Dvijotham and Emanuel Todorov. Inverse

optimal control with linearly-solvable mdps. In ICML, 2010.

Chelsea Finn, Sergey Levine, and Pieter Abbeel. Guided

cost learning: Deep inverse optimal control via policy optimization. In International Conference on Machine Learning, pages 49 58, June 2016.

Justin Fu, Katie Luo, and Sergey Levine. Learning robust

rewards with adversarial inverse reinforcement learning. ar Xiv preprint ar Xiv:1710.11248, 2017.

Chuang Gan, Jeremy Schwartz, Seth Alter, Martin Schrimpf,

James Traer, Julian De Freitas, Jonas Kubilius, Abhishek Bhandwaldar, Nick Haber, Megumi Sano, et al. Threedworld: A platform for interactive multi-modal physical simulation. ar Xiv preprint ar Xiv:2007.04954, 2020.

Sinong Geng, Houssam Nassif, Carlos Manzanares, Max

Reppen, and Ronnie Sircar. Deep pqr: Solving inverse reinforcement learning using anchor actions. In International Conference on Machine Learning, pages 3431 3441. PMLR, 2020.

Abishek Gopalan and Srinivasan Ramasubramanian. On

identifying additive link metrics using linearly independent cycles and paths. IEEE/ACM Transactions on Networking, 20(3):906 916, 2011.

James J Heckman and Salvador Navarro. Dynamic dis-

crete choice and dynamic treatment effects. Journal of Econometrics, 136(2):341 396, 2007.

Jonathan Ho and Stefano Ermon. Generative adversarial

imitation learning. In Advances in Neural Information Processing Systems, pages 4565 4573, 2016.

JP Jarvis and Douglas R Shier. Graph-theoretic analysis of

ﬁnite markov chains. Applied mathematical modeling: a multidisciplinary approach, page 85, 1999.

Myrto Kalouptsidi, Paul T Scott, and Eduardo Souza-

Rodrigues. Identiﬁcation of counterfactuals in dynamic discrete choice models. Technical report, National Bureau of Economic Research, 2015.

Kuno Kim, Yihong Gu, Jiaming Song, Shengjia Zhao, and

Stefano Ermon. Domain adaptive imitation learning. In International Conference on Machine Learning, pages 5286 5295. PMLR, 2020a.

Kuno Kim, Akshat Jindal, Yang Song, Jiaming Song, Yanan

Sui, and Stefano Ermon. Imitation with neural density models. ar Xiv preprint ar Xiv:2010.09808, 2020b.

Ilya Kostrikov, Oﬁr Nachum, and Jonathan Tompson. Imita-

tion learning via off-policy distribution matching. 2020.

Liang Ma, Ting He, Kin K Leung, Don Towsley, and Anan-

thram Swami. Efﬁcient identiﬁcation of additive link metrics via network tomography. In 2013 IEEE 33rd

Reward Identiﬁcation in Inverse Reinforcement Learning

International Conference on Distributed Computing Systems, pages 581 590. IEEE, 2013.

Kaichun Mo, Haoxiang Li, Zhe Lin, and Joon-Young Lee.

The Adobe Indoor Nav Dataset: Towards deep reinforcement learning based real-world indoor robot visual navigation. 2018.

P Read Montague, Peter Dayan, Christophe Person, and

Terrence J Sejnowski. Bee foraging in uncertain environments using predictive hebbian learning. Nature, 377 (6551):725 728, 1995.

Andrew Y Ng, Stuart J Russell, et al. Algorithms for inverse

reinforcement learning. In Icml, pages 663 670, 2000.

Lars Relund Nielsen and Anders Ringgaard Kristensen.

Markov decision processes to model livestock systems. In Handbook of operations research in agriculture and the agri-food industry, pages 419 454. Springer, 2015.

Yael Niv. Reinforcement learning in the brain. Journal of

Mathematical Psychology, 53(3):139 154, 2009.

Dean A Pomerleau. Efﬁcient training of artiﬁcial neural net-

works for autonomous navigation. Neural computation, 3(1):88 97, 1991. ISSN 0899-7667.

Deepak Ramachandran and Eyal Amir. Bayesian inverse

reinforcement learning. In IJCAI, volume 7, pages 2586 2591, 2007.

Wei Ren and Wei Dong. Robust network tomography: K-

identiﬁability and monitor assignment. In IEEE INFOCOM 2016-The 35th Annual IEEE International Conference on Computer Communications, pages 1 9. IEEE, 2016.

John Rust. Structural estimation of markov decision pro-

cesses. Handbook of econometrics, 4:3081 3143, 1994.

E Todorov, T Erez, and Y Tassa. Mu Jo Co: A physics engine for model-based control. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 5026 5033, October 2012. doi: 10.1109/IROS. 2012.6386109.

Brian D Ziebart. Modeling purposeful adaptive behavior

with the principle of maximum causal entropy. 2010.

Brian D Ziebart, Andrew L Maas, J Andrew Bagnell, and

Anind K Dey. Maximum entropy inverse reinforcement learning. In AAAI, volume 8, pages 1433 1438, 2008.

Brian D Ziebart, J Andrew Bagnell, and Anind K Dey. Max-

imum causal entropy correlated equilibria for markov games. In The 10th International Conference on Autonomous Agents and Multiagent Systems - Volume 1,

AAMAS 11, pages 207 214, Richland, SC, 2011. International Foundation for Autonomous Agents and Multiagent Systems. ISBN 9780982657157, 9780982657157.