# multiview_decision_processes_the_helperai_problem__a7c54c74.pdf

Multi-View Decision Processes: The Helper-AI Problem

Christos Dimitrakakis David C. Parkes Chalmers University of Technology & University of Lille Harvard University

Goran Radanovic Paul Tylkin Harvard University Harvard University

We consider a two-player sequential game in which agents have the same reward function but may disagree on the transition probabilities of an underlying Markovian model of the world. By committing to play a speciﬁc policy, the agent with the correct model can steer the behavior of the other agent, and seek to improve utility. We model this setting as a multi-view decision process, which we use to formally analyze the positive effect of steering policies. Furthermore, we develop an algorithm for computing the agents achievable joint policy, and we experimentally show that it can lead to a large utility increase when the agents models diverge.

1 Introduction.

In the past decade, we have been witnessing the fulﬁllment of Licklider s profound vision on AI [Licklider, 1960]:

Man-computer symbiosis is an expected development in cooperative interaction between men and electronic computers.

Needless to say, such a collaboration, between humans and AIs, is natural in many real-world AI problems. As a motivating example, consider the case of autonomous vehicles, where a human driver can override the AI driver if needed. With advances in AI, the human will beneﬁt most if she allows the AI agent to assume control and drive optimally. However, this might not be achievable due to human behavioral biases, such as over-weighting the importance of rare events, the human might incorrectly override the AI. In the way, the misaligned models of the two drivers can lead to a decrease in utility. In general, this problem may occur whenever two agents disagree on their view of reality, even if they cooperate to achieve a common goal.

Formalizing this setting leads to a class of sequential multi-agent decision problems that extend stochastic games. While in a stochastic game there is an underlying transition kernel to which all agents (players) agree, the same is not necessarily true in the described scenario. Each agent may have a different transition model. We focus on a leader-follower setting in which the leader commits to a policy that the follower then best responds to, according to the follower s model. Mapped to our motivating example, this would mean that the AI driver is aware of human behavioral biases and takes them into account when deciding how to drive.

To incorporate both sequential and stochastic aspects, we model this as a multi-view decision process. Our multi-view decision process is based on an MDP model, with two, possibly different, transition kernels. One of the agents, hereafter denoted as P1, is assumed to have the correct transition kernel and is chosen to be the leader of the Stackelberg game it commits to a policy that the second agent

31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.

(P2) best-responds to according to its own model. The agents have the same reward function, and are in this sense cooperative. In an application setting, while the human (P2) may not be a planner, we motivate our set-up as modeling the endpoint of an adaptive process that leads P2 to adopt a best-response to the policy of P1.

Using the multi-view decision process, we analyze the effect of P2 s imperfect model on the achieved utility. We place an upper bound on the utility loss due to this, and also provide a lower bound on how much P1 gains by knowing P2 s model. One of our main analysis tools is the amount of inﬂuence an agent has, i.e. how much its actions affect the transition probabilities, both according to its own model, and according to the model of the other agent. We also develop an algorithm, extending backwards induction for simultaneous-move sequential games [c.f. Bošansk y et al., 2016], to compute a pair of policies that constitute a subgame perfect equilibrium.

In our experiments, we introduce intervention games as a way to construct example scenarios. In an intervention game, an AI and a human share control of a process, and the human can intervene to override the AI s actions but suffers some cost in doing so. This allows us to derive a multi-view process from any single-agent MDP. We consider two domains: ﬁrst, the intervention game variant of the shelter-food game introduced by Guo et al. [2013], as well as an autonomous driving problem that we introduce here. Our results show that the proposed approach provides a large increase in utility in each domain, thus overcoming the deﬁciencies of P2 s model, when the latter model is known to the AI.

1.1 Related work

Environment design [Zhang et al., 2009, Zhang and Parkes, 2008] is a related problem, where a ﬁrst agent seeks to modulate the behavior of a second agent. However, the interaction between agents occurs through ﬁnding a good modiﬁcation of the second agent s reward function: the AI observes a human performing a task, and uses inverse reinforcement learning [Ng et al., 2000] to estimate the human s reward function. Then it can assign extrinsic reward to different states in order to improve the human s policy. A similar problem in single-agent reinforcement learning is how to use internal rewards to improve the performance of a computationally-bounded, reinforcement learning agent [Sorg et al., 2010]. For example, even a myopic agent can maximize expected utility over a long time horizon if augmented with appropriately designed internal rewards. Our model differs from these prior works, in that the interaction between a helper agent and a second agent is through taking actions in the same environment as the second agent.

In cooperative inverse reinforcement learning [Hadﬁeld-Menell et al., 2016], an AI wants to cooperate with a human but does not initially understand the task. While their framework allows for simultaneous moves of the AI and the human, they only apply it to two-stage games, where the human demonstrates a policy in the ﬁrst stage and the AI imitates in the second stage. They show that the human should take into account the AI s best response when providing demonstrations, and develop an algorithm for computing an appropriate demonstration policy. Our focus is on joint actions in a multi-period, uncertain environment, rather than teaching. The model of Amir et al. [2016] is also different, in that it considers the problem of how a teacher can optimally give advice to a sub-optimal learner, and is thus focused on communication and adaptation rather than interaction through actions. Finally, Elmalech et al. [2015] consider an advice-giving AI in single-shot games, where the human has an incorrect model. They experimentally ﬁnd that when the AI heuristically models human expectations when giving advice, their performance is improved. We ﬁnd that this also holds in our more general setting.

We cannot use standard methods for computing optimal strategies in stochastic games [Bošanský et al., 2015, Zinkevich et al., 2005], as the two agents have different models of the transitions between states. On the other extreme, a very general formalism to represent agent beliefs, such as that of Gal and Pfeffer [2008] is not well suited, because we have a Stackelberg setting and the problem of the follower is standard. Our approach is to extend backwards induction [c.f. Bošansk y et al., 2016, Sec. 4] to the case of misaligned models in order to obtain a subgame perfect policy for the AI.

Paper organization. Section 2 formalises the setting and its basic properties, and provides a lower bound on the improvement P1 obtains when P2 s model is known. Section 3 introduces a backwards induction algorithm, while Section 4 discusses the experimental results. We conclude with Section 5. Finally, Appendix A collects all the proofs, additional technical material and experimental details.

2 The Setting and Basic Properties

We consider two-agent sequential stochastic game, with two agents P1, P2, who disagree on the underlying model of the world, with the i-th agent s model being µi, but share the same reward function. More formally,

Deﬁnition 1 (Multi-view decision process (MVDP)). A multi-view decision process G = S, A, σ1, σ2, µ1, µ2, ρ, γ is a game between two agents, P1, P2, who share the same reward function. The game has a state space S, with S |S|, action space A =

i Ai, with A |A|, starting state distribution σ, transition kernel µ, reward function1 ρ : S [0, 1], and discount factor γ [0, 1].

At time t, the agents observe the state st, take a joint action at = (at,1, at,2) and receive reward rt = ρ(st). However, the two agents may have a different view of the game, with agent i modelling the transition probabilities of the process as µi(st+1 | st, at) for the probability of the next state st+1 given the current state st and joint action at. Each agent s actions are drawn from a policy πi, which may be an arbitrary behavioral policy, ﬁxed at the start of the game. For a given policy pair π = (π1, π2), with πi Πi and Π

i Πi, the respective payoff from the point of view of the i-th agent ui : Π R is deﬁned to be:

ui(π) = Eπ µi[U | s1 σ], U

t=t γt 1ρ(st). (2.1)

For simplicity of presentation, we deﬁne reward rt = ρ(st) at time t, as a function of the state only, although an extension to state-action reward functions is trivial. The reward, as well, as well as the utility U (the discounted sum of rewards over time) are the same for both agents for a given sequence of states. However, the payoff for agent i is their expected utility under the model i, and can be different for each agent.

Any two-player stochastic game can be cast into an MVDP:

Lemma 1. Any two-player general-sum stochastic game (SG) can be reduced to a two-player MVDP in polynomial time and space.

The proof of Lemma 1 is in Appendix A.

2.1 Stackelberg setting

We consider optimal policies from the point of view of P1, who is trying to assist a misguided P2. For simplicity, we restrict our attention to the Stackelberg setting, i.e. where P1 commits to a speciﬁc policy π1 at the start of the game. This simpliﬁes the problem for P2, who can play the optimal response according to the agent s model of the world. We begin by deﬁning the (potentially unachievable) optimal joint policy, where both policies are chosen to maximise the same utility function:

Deﬁnition 2 (Optimal joint policy). A joint policy π is optimal under σ and µ1 iff u1( π) u1(π), π Π. We furthermore use u1 u1( π) to refer to the value of the jointly optimal policy.

This value may not be achievable, even though the two agents share a reward function, as the second agent s model does not agree with the ﬁrst agent s, and so their expected utilities are different. To model this, we deﬁne the Stackelberg utility of policy π1 for the ﬁrst agent as:

u St 1 (π1) u1(π1, πB 2 (π1)), πB 2 (π1) = arg max π2 Π2 u2(π1, π2), (2.2)

i.e. the value of the policy when the second agent best responds to agent one s policy under the second agent s model.2 The following deﬁnes the highest utility that P1 can achieve.

1For simplicity we consider state-dependent rewards bounded in [0, 1]. Our results are easily generalizable to ρ : S A [0, 1], through scaling by a factor of B and shifting by a factor of bm for any reward function in [b, b + B]. 2If there is no unique best response, we deﬁne the utility in terms of the worst-case, best response.

Deﬁnition 3 (Optimal policy). The optimal policy for P1, denoted by π 1, is the one maximizing the Stackelberg utility, i.e. u St 1 (π 1) u St 1 (π1), π1 Π1, and we use u 1 u St(π 1) to refer to the value of this optimal policy.

In the remainder of the technical discussion, we will characterize P1 policies in terms of how much worse they are than the jointly optimal policy, as well as how much better they can be than the policy that blithely assumes that P2 shares the same model.

We start with some observations about the nature of the game when one agent ﬁxes its policy, and we argue how the difference between the models of the two agents affects the utility functions. We then combine this with a deﬁnition of inﬂuence to obtain bounds on the loss due to the difference in the models.

When agent i ﬁxes a Markov policy πi, the game is an MDP for agent j. However, if agent i s policy is not Markovian the resulting game is not an MDP on the original state space. We show that if P1 acts as if P2 has the correct transition kernel, then the resulting joint policy has value bounded by the L1 norm between the true kernel and agent 2 s actual kernel. We begin by establishing a simple inequality to show that knowledge of the model µ2 is beneﬁcial for P1. Lemma 2. For any MVDP, the utility of the jointly optimal policy is greater than that of the (achievable) optimal policy, which is in turn greater than that of the policy that assumes that µ2 = µ1.

u1( π) u St 1 (π 1) u St 1 ( π1) (2.3)

Proof. The ﬁrst inequality follows from the deﬁnition of the jointly optimal policy and u St 1 . For the second inequality, note that the middle term is a maximizer for the right-hand side.

Consequently, P1 must be able to do (weakly) better if it knows µ2 compared to if it just assumes that µ2 = µ1. However, this does not tell us how much (if any) improvement we can obtain. Our idea is to see what policy π1 we would need to play in order to make P2 play π2, and measure the distance of this policy from π1. To obtain a useful bound, we need to have a measure on how much P1 must deviate from π1 in order for P2 to play π2. For this, we deﬁne the notion of inﬂuence. This will capture the amount by which a agent i can affect the game in the eyes of agent j. In particular, it is the maximal amount by which an agent i can affect the transition distribution of agent j by changing i s action at each state s: Deﬁnition 4 (Inﬂuence). The inﬂuence of agent i on the transition distribution of model µj is deﬁned as the vector:

Ii,j(s) max at, i max at,ia t,i µj( | st = s, at,i, at, i) µj( | st = s, a t,i, at, i) 1, (2.4)

where the norm is over the difference in next-state distributions st+1 for the two models.

Thus, I1,1 describes the actual inﬂuence of P1 on the transition probabilities, while I1,2 describes the perceived inﬂuence of P1 by P2. We will use inﬂuence to deﬁne an µ-dependent distance between policies, capturing the effect of an altered policy on the model: Deﬁnition 5 (Policy distance). The distance between policies πi, π i under model µj is:

πi π i µj max s S πi( | s) π i( | s) 1Ii,j(s). (2.5)

These two deﬁnitions result in the following Lipschitz condition on the utility function, whose proof can be found in Appendix A. Lemma 3. For any ﬁxed π2, and any π1, π 1: ui(π1, π2) ui(π 1, π2) + π1 π 1 µi γ (1 γ)2 , with a symmetric result holding for any ﬁxed policy π1, and any pair π2, π 2.

Lemma 3 bounds the change in utility due to a change in policy by P1 with respect to i s payoff. As shall be seen in the next section, it allows us to analyze how close the utility we can achieve comes to that of the jointly optimal policy, and how much can be gained by not naively assuming that the model of P2 is the same.

2.2 Optimality

In this section, we illuminate the relationship between different types of policies. First, we show that if P1 simply assumes µ2 = µ1, it only suffers a bounded loss relative to the jointly optimal policy. Subsequently, we prove that knowing µ2 allows P1 to ﬁnd an improved policy.

Lemma 4. Consider the optimal policy π1 for the modiﬁed game G = S, A, σ1, σ1, µ1, µ1, ρ, γ where P2 s model is correct. Then π1 is Markov and achieves utility u in G, while its utility in G is:

u St 1 ( π1) u 2 µ1 µ2 1

(1 γ)2 , µ1 µ2 1 max st,at µ1(st+1 | st, at) µ2(st+1 | st, at) 1.

As this bound depends on the maximum between all state action pairs, we reﬁne it in terms of the inﬂuence of each agent s actions. This also allows us to measure the loss in terms of the difference in P2 s actual and desired response, rather than the difference between the two models, which can be much larger.

Corollary 1. If P2 s best response to π1 is πB 2( π1) = π2, then our loss relative to the jointly optimal policy is bounded by u1( π1, π2) u1( π1, πB 2( π1)) πB 2( π1) π2 µ1 γ (1 γ)2 .

Proof. This follows from Lemma 3 by ﬁxing π1 for the policy pairs πB 2 ( π1), π2 under µ1.

While the previous corollary gave us an upper bound on the loss we incur if we ignore the beliefs of P2, we can bound the loss of the optimal Stackelberg policy in the same way:

Corollary 2. The difference between the optimal utility u1( π1, π2) and the optimal Stackleberg utility u St 1 (π 1) is bounded by u1( π1, π2) u St 1 (π 1) πB 2( π1) π2 µ1 γ (1 γ)2 .

Proof. The result follows directly from Corollary 1 and Lemma 2.

This bound is not very informative by itself, as it does not suggest an advantage for the optimal Stackelberg policy. Instead, we can use Lemma 3 to lower bound the increase in utility obtained relative to just playing the optimistic policy π1. We start by observing that when P2 responds with some ˆπ2 to π1, P1 could improve upon this by playing ˆπ1 = πB 1 (ˆπ2), the best response of to ˆπ2, if P1 could somehow force P2 to stick to ˆπ2. We can deﬁne

Δ u1(ˆπ1, ˆπ2) u1( π1, ˆπ2), (2.6)

to be the potential advantage from switching to ˆπ1. Theorem 1 characterizes how close to this advantage P1 can get by playing a stochastic policy πα 1 (a | s) α π1(a | s) + (1 α)ˆπ1(a | s), while ensuring that P2 sticks to ˆπ2.

Theorem 1 (A sufﬁcient condition for an advantage over the naive policy). Let ˆπ2 = πB 2( π1) be the response of P2 to the optimistic policy π1 and assume Δ > 0. Then we can obtain an advantage of at least:

Δ γ π1 ˆπ1 µ1

π1 ˆπ1 µ1 π1 ˆπ1 µ2 (2.7)

where δ u2( π1, ˆπ2) maxπ2 =ˆπ2 u2( π1, π2) is the gap between ˆπ2 and all other deterministic policies of P2 when P1 plays π1.

We have shown that knowledge of µ2 allows P1 to obtain improved policies compared to simply assuming µ2 = µ1, and that this improvement depends on both the real and perceived effects of a change in P1 s policy. In the next section we develop an efﬁcient dynamic programming algorithm for ﬁnding a good policy for P1.

3 Algorithms for the Stackelberg Setting

In the Stackelberg setting, we assume that P1 commits to a policy π1, and this policy is observed by P2. Because of this, it is sufﬁcient for P2 to use a Markov policy, and this can be calculated in polynomial time in the number of states and actions.

However, there is a polynomial reduction from stochastic games to MVDPs (Lemma 1), and since Letchford et al. [2012] show that computing optimal commitment strategies is NP-hard, then the planning problem for MVDPs is also NP-hard. Another difﬁculty that occurs is that dominating policies in the MDP sense may not exist in MVDPs.

Deﬁnition 6 (Dominating policies). A dominating policy π satisﬁes V π(s) V π (s), s S, where V π(s) = Eπ(u | s0 = s).

Dominating policies have the nice property that they are also optimal for any starting distribution σ. However, dominating, stationary Markov polices need not exist in our setting. Theorem 2. A dominating, stationary Markov policy may not exist in a given MVDP.

The proof of this theorem is given by a counterexample in Appendix A, where the optimal policy depends on the history of previously visited states.

In the trivial case when µ1 = µ2, the problem can be reduced to a Markov decision process, which can be solved in O(S2A) [Mansour and Singh, 1999, Littman et al., 1995]. Generally, however, the commitment by P1 creates new dependencies that render the problem inherently non-Markovian with respect to the state st and thus harder to solve. In particular, even though the dynamics of the environment are Markovian with respect to the state st, the MVDP only becomes Markov in the Stackelberg setting with respect to the hyper-state ηt = (st, πt:T,1) where πt:T,1 is the commitment by P1 for steps t, . . . , T. To see that the game is non-Markovian, we only need to consider a single transition from st to st+1. P2 s action depends not only on the action at,1 of P1, but also on the expected utility the agent will obtain in the future, which in turn depends on πt:T,1. Consequently, state st is not a sufﬁcient statistic for the Stackelberg game.

3.1 Backwards Induction

These difﬁculties aside, we now describe a backwards induction algorithm for approximately solving MVDPs. The algorithm can be seen as a generalization of the backwards induction algorithm for simultaneous-move stochastic games [c.f. Bošansk y et al., 2016] to the case of disagreement on the transition distribution.

In our setting, at stage t of the interaction, P2 has observed the current state st and also knows the commitment of P1 for all future periods. P2 now chooses the action

a t,2(π1) arg max at,2 ρ(st) + γ

at,1 π1(at,1 | st)

st+1 µ2(st+1|st, at,1, at,2) V2,t+1(st+1). (3.1)

Thus, for every state, there is a well-deﬁned continuation for P2. Now, P1 needs to choose an action. This can be done easily, since we know P2 s continuation, and so we can deﬁne a value for each state-action-action triplet for either agent:

Qi,t(st, at,1, at,2) = ρ(s) + γ

st+1 µi(st+1|st, at,1, at,2) Vi,t+1(st+1).

As the agents act simultaneously, the policy of P1 needs to be stochastic. The local optimization problem can be formed as a set of linear programs (LPs), one for each action a2 A2:

a1 π1(a1|s) Qt,1(s, a1, a2)

a1 π1(a1|s) Qt,2(s, a1, a2)

a1 π(a1) Qt,2(s, a1, ˆa2),

ˆa1 : 0 π1(ˆa1|s) 1, and

a1 π1(a1|s) = 1.

Each LP results in the best possible policy at time t, such that we force P2 to play a2. From these, we select the best one. At the end, the algorithm, given the transitions (µ1, µ2), and the time horizon T, returns an approximately optimal joint policy, (π 1, π 2) for the MVDP. The complete pseudocode is given in Appendix C, algorithm 1.

As this solves a ﬁnite horizon problem, the policy is inherently non-stationary. In addition, because there is no guarantee that there is a dominating policy, we may never obtain a stationary policy (see below). However, we can extract a stationary policy from the policies played at individual time steps t, and select the one with the highest expected utility. We can also obtain a version of the algorithm that attains a deterministic policy, by replacing the linear program with a maximization over P1 s actions.

Optimality. The policies obtained using this algorithm are subgame perfect, up to the time horizon adopted for backward induction; i.e. the continuation policies are optimal (considering the possibly incorrect transition kernel of P2) off the equilibrium path. As a dominating Markov policy may not exist, the algorithm may not converge to a stationary policy in the inﬁnite horizon discounted setting, similarly to the cyclic equilibria examined by Zinkevich et al. [2005]. This is because the commitment of P1 affects the current action of P2, and so the effective transition matrix for P1. More precisely, the transition actually depends on the future joint policy πn+1:T , because this determines the value Q2,t and so the policy of P2. Thus, the Bellman optimality condition does not hold, as the optimal continuation may depend on previous decisions.

4 Experiments

We focus on a natural subclass of multi-view decision processes, which we call intervention games. Therein, a human and an AI have joint control of a system, and the human can override the AI s actions at a cost. As an example, consider semi-autonomous driving, where the human always has an option to override the AI s decisions. The cost represents the additional effort of human intervention; if there was no cost, the human may always prefer to assume manual control and ignore the AI. Deﬁnition 7 (c-intervention game). A MVDP is a c-intervention game if all of P2 s actions override those of P1, apart from the null action a0 A2, which has no effect.

µ1(st+1 | st, at,1, at,2) = µ1(st+1 | st, a t,1, at,2) at,1, a t,1 A, at,2 = a0. (4.1)

In addition, the agents subtract a cost c(s) > 0 from the reward rt = ρ(st) whenever P2 takes an action other than a0.

Any MDP with action space A and reward function ρ : S [0, 1] can be converted into a cintervention game, and modeled as an MVDP, with action space A = A1 A2, where A1 = A , A2 = A1 a0 , a1 A1, a2 A2, a = (a1, a2) A,

r MIN = min s S, a 2 A2 ρ (s ) c(s ), (4.2)

r MAX = max s S, a 2 A2 ρ (s ) (4.3)

and reward function3 ρ: S A [0, 1], with

ρ(s, a) = ρ (s) c(s) I a2 = a0 r MIN r MAX r MIN . (4.4)

The reward function in the MVDP is deﬁned so that it also has the range [0, 1].

Algorithms and scenarios. We consider the main scenario, as well as three variant scenarios, with different assumptions about the AI s model. For the main scenario, the human has an incorrect model of the world, which the AI knows. For this, we consider three types of AI policies:

PURE: The AI only uses deterministic Markov policies.

3Note that although our original deﬁnition used a state-only reward function, we are using a state-action reward function.

(a) Multilane Highway

0 10 20 30 40 50 human error (factor)

opt pure mixed naive human stat

(b) Highway: Error

0.0 0.05 0.1 0.15 0.2 cost (safety+intervention)

opt pure mixed naive human stat

(c) Highway: Cost

(d) Food and Shelter

0.0 0.1 0.2 0.3 0.4 0.5 human error (skewness)

opt pure mixed naive human stat

(e) Food and Shelter: Error

0.0 0.1 0.2 0.3 0.4 0.5 cost (intervention)

opt pure mixed naive human stat

(f) Food and Shelter: Cost

Figure 1: Illustrations and experimental results for the multilane highway and food and shelter domains. Plots (b,e) show the effect of varying the error in the human s transition kernel with ﬁxed intervention cost. Plots (c,f) show the effect of varying the intervention cost for a ﬁxed error in the human s transition kernel.

MIXED: The AI may use stochastic Markov policies.

STAT: As above, but use the best instantaneous deterministic policy of the ﬁrst 25 time-steps found in PURE as a stationary Markov policy (running for the same time horizon as PURE).

We also have three variant scenarios of AI and human behaviour.

OPT: Both the AI and human have the correct model of the world.

NAIVE: The AI assumes that the human s model is correct.

HUMAN: Both agents use the incorrect human model to take actions. It is equivalent to the human having full control without any intervention cost.

In all of these, the AI uses a MIXED policy. We consider two simulated problem domains in which to evaluate our methods. The ﬁrst is a multilane highway scenario, where the human and AI have shared control of a car, and the second is a food and shelter domain where they must collect food and maintain a shelter. In all cases, we use a ﬁnite time horizon of 100 steps and a discount factor of γ = 0.95.

Multilane Highway. In this domain, a car is under joint control of an AI agent and a human, with the human able to override the AI s actions at any time. There are multiple lanes in a highway, with varying levels of risk and speed (faster lanes are more risky). Within each lane, there is some probability of having an accident. However, the human overestimates this probability, and so wants to travel in a slower lane than is optimal. We denote a starting state by A, a destination state by B, and, for lane i, intermediate states Ci1, ..., Ci J, where J is the number of intermediate states in a lane, and an accident state D. See Figure 1(a) for an illustration of the domain, and for the simulation results. In the plots, the error parameter represents a factor by which the human is wrong in assessing the accident probability (assumed to be small), while the cost parameter determines both the cost of safety (slow driving) of different lanes as well as the cost of human intervening on these lanes. The latter is because our experimental model couples the cost of intervention with the safety cost. The rewards range from 10 to 10. More details are provided in the Appendix (Section B).

Food and Shelter Domain. The food and shelter domain [Guo et al., 2013] involves an agent simultaneously trying to ﬁnd randomly placed food (in one of the top ﬁve locations) while maintaining a shelter. With positive probability at each time step, the shelter can collapse if it is not maintained. There is a negative reward for the shelter collapsing and positive reward for ﬁnding food (food reappears whenever it is found). In order to exercise the abilities of our modeling, we make the original setting more complex by increasing the size of the grid to 5 5 and allowing diagonal moves. For our MVDP setting, we give the AI the correct model but assume the human overestimates the probabilities. Furthermore, the human believes that diagonal movements are more prone to error. See Figure 1(d) for an illustration of the domain, and for the simulation results. In the plots, the error parameter determines how skewed the human s belief about the error is towards the uniform distribution, while the cost parameter determines the cost of intervention. The rewards range from 1 to 1. More details are provided in the Appendix (Section B).

Results. In the simulations, when we change the error parameter, we keep the cost parameter constant (0.15 for the multilane highway domain and 0.1 for the food and shelter domain), and vice versa, when we change the cost, we keep the error constant (25 for the multilane highway domain and 0.25 for the food and shelter domain). Overall, the results show that PURE, MIXED and STAT perform considerably better than NAIVE and HUMAN. Furthermore, for low costs, HUMAN is better than NAIVE. The reason is that in NAIVE the human agent overrides the AI, which is more costly than having the AI perform the same policy (as it happens to be for HUMAN). Therefore, simply assuming that the human has the correct model does not only lead to a larger error than knowing the human s model, but it can also be worse than simply adopting the human s erroneous model when making decisions.

As the cost of intervention increases, the utilities become closer to the jointly optimal one (OPT scenario), with the exception of the utility for scenario HUMAN. This is not surprising since the intervention cost has an important tempering effect the human is less likely to take over the control if interventions are costly. When the human error is small, the utility approaches that of the jointly optimal policy. Clearly, the increasing error leads to larger deviations from the the optimal utility.

Out of the three algorithms (PURE, MIXED and STAT), MIXED obtains a slightly better performance and shows the additional beneﬁt from allowing for stochastic polices. PURE and STAT have quite similar performance, which indicates that in most of the cases the backwards induction algorithm converges to a stationary policy.

5 Conclusion

We have introduced the framework of multi-view decision processes to model value-alignment problems in human-AI collaboration. In this problem, an AI and a human act in the same environment, and share the same reward function, but the human may have an incorrect world model. We analyze the effect of knowledge of the human s world model on the policy selected by the AI.

More precisely, we developed a dynamic programming algorithm, and gave simulation results to demonstrate that an AI with this algorithm can adopt a useful policy in simple environments and even when the human adopts an incorrect model. This is important for modern applications involving the close cooperation between humans and AI such as home robots or automated vehicles, where the human can choose to intervene but may do so erroneously. Although backwards induction is efﬁcient for discrete state and action spaces, it cannot usefully be applied to the continuous case. We would like to develop stochastic gradient algorithms for this case. More generally, we see a number of immediate extensions to MVDP: estimating the human s world model, studying a setting in which human is learning to respond to the actions of the AI, and moving away from Stackelberg to the case of no commitment.

Acknowledgements. The research has received funding from: the People Programme (Marie Curie Actions) of the European Union s Seventh Framework Programme (FP7/2007-2013) under REA grant agreement 608743, the Swedish national science foundation (VR), the Future of Life Institute, the SEAS Tom Kat fund, and a SNSF Early Postdoc Mobility fellowship.

Ofra Amir, Ece Kamar, Andrey Kolobov, and Barbara Grosz. Interactive teaching strategies for agent training. In IJCAI 2016, 2016.

Branislav Bošanský, Simina Brânzei, Kristoffer Arnsfelt Hansen, Peter Bro Miltersen, and Troels Bjerre Sørensen. Computation of Stackelberg Equilibria of Finite Sequential Games. 2015.

Branislav Bošansk y, Viliam Lis y, Marc Lanctot, Jiˇrí ˇCermák, and Mark HM Winands. Algorithms for computing strategies in two-player simultaneous move games. Artiﬁcial Intelligence, 237:1 40, 2016.

Avshalom Elmalech, David Sarne, Avi Rosenfeld, and Eden Shalom Erez. When suboptimal rules. In AAAI, pages 1313 1319, 2015.

Eyal Even-Dar and Yishai Mansour. Approximate equivalence of markov decision processes. In Learning Theory and Kernel Machines. COLT/Kernel 2003, Lecture notes in Computer science, pages 581 594, Washington, DC, USA, 2003. Springer.

Ya akov Gal and Avi Pfeffer. Networks of inﬂuence diagrams: A formalism for representing agents beliefs and decision-making processes. Journal of Artiﬁcial Intelligence Research, 33(1):109 147, 2008.

Xiaoxiao Guo, Satinder Singh, and Richard L Lewis. Reward mapping for transfer in long-lived agents. In C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 26, pages 2130 2138. 2013.

Dylan Hadﬁeld-Menell, Anca Dragan, Pieter Abbeel, and Stuart Russell. Cooperative inverse reinforcement learning, 2016.

Joshua Letchford, Liam Mac Dermed, Vincent Conitzer, Ronald Parr, and Charles L. Isbell. Computing optimal strategies to commit to in stochastic games. In Proceedings of the Twenty-Sixth AAAI Conference on Artiﬁcial Intelligence, AAAI 12, 2012.

J. C. R. Licklider. Man-computer symbiosis. RE Transactions on Human Factors in Electronics, 1: 4 11, 1960.

Michael L Littman, Thomas L Dean, and Leslie Pack Kaelbling. On the complexity of solving markov decision problems. In Proceedings of the Eleventh conference on Uncertainty in artiﬁcial intelligence, pages 394 402. Morgan Kaufmann Publishers Inc., 1995.

Yishay Mansour and Satinder Singh. On the complexity of policy iteration. In Proceedings of the Fifteenth conference on Uncertainty in artiﬁcial intelligence, pages 401 408. Morgan Kaufmann Publishers Inc., 1999.

Andrew Y Ng, Stuart J Russell, et al. Algorithms for inverse reinforcement learning. In ICML, pages 663 670, 2000.

Jonathan Sorg, Satinder P Singh, and Richard L Lewis. Internal rewards mitigate agent boundedness. In Proceedings of the 27th international conference on machine learning (ICML-10), pages 1007 1014, 2010.

Haoqi Zhang and David C. Parkes. Value-based policy teaching with active indirect elicitation. In Proc. 23rd AAAI Conference on Artiﬁcial Intelligence (AAAI 08), page 208 214, Chicago, IL, July 2008.

Haoqi Zhang, David C. Parkes, and Yiling Chen. Policy teaching through reward function learning. In 10th ACM Electronic Commerce Conference (EC 09), page 295 304, 2009.

Martin Zinkevich, Amy Greenwald, and Michael Littman. Cyclic equilibria in markov games. In Advances in Neural Information Processing Systems, 2005.