# should_robots_be_obedient__4549930c.pdf Should Robots be Obedient? Smitha Milli, Dylan Hadfield-Menell, Anca Dragan, Stuart Russell University of California, Berkeley {smilli,dhm,anca,russell}@berkeley.edu Intuitively, obedience following the order that a human gives seems like a good property for a robot to have. But, we humans are not perfect and we may give orders that are not best aligned to our preferences. We show that when a human is not perfectly rational then a robot that tries to infer and act according to the human s underlying preferences can always perform better than a robot that simply follows the human s literal order. Thus, there is a tradeoff between the obedience of a robot and the value it can attain for its owner. We investigate how this tradeoff is impacted by the way the robot infers the human s preferences, showing that some methods err more on the side of obedience than others. We then analyze how performance degrades when the robot has a misspecified model of the features that the human cares about or the level of rationality of the human. Finally, we study how robots can start detecting such model misspecification. Overall, our work suggests that there might be a middle ground in which robots intelligently decide when to obey human orders, but err on the side of obedience. 1 Introduction Should robots be obedient? The reflexive answer to this question is yes. A coffee making robot that doesn t listen to your coffee order is not likely to sell well. Highly capable autonomous system that don t obey human commands run substantially higher risks, ranging from property damage to loss of life [Asaro, 2006; Lewis, 2014] to potentially catastrophic threats to humanity [Bostrom, 2014; Russell et al., 2015]. Indeed, there are several recent examples of research that considers the problem of building agents that at the very least obey shutdown commands [Soares et al., 2015; Orseau and Armstrong, 2016; Hadfield-Menell et al., 2017]. However, in the long-term making systems blindly obedient doesn t seem right either. A self-driving car should certainly defer to its owner when she tries taking over because it s driving too fast in the snow. But on the other hand, the car shouldn t let a child accidentally turn on the manual driving mode. Figure 1: (Left) The blindly obedient robot always follows H s order. (Right) An IRL-R computes an estimate of H s preferences and picks the action optimal for this estimate. The suggestion that it might sometimes be better for an autonomous systems to be disobedient is not new [Weld and Etzioni, 1994; Scheutz and Crowell, 2007]. For example, this is the idea behind Do What I Mean systems [Teitelman, 1970] that attempt to act based on the user s intent rather than the user s literal order. A key contribution of this paper is to formalize this idea, so that we can study properties of obedience in AI systems. Specifically, we focus on investigating how the tradeoff between the robot s level of obedience and the value it attains for its owner is affected by the rationality of the human, the way the robot learns about the human s preferences over time, and the accuracy of the robot s model of the human. We argue that these properties are likely to have a predictable effect on the robot s obedience and the value it attains. We start with a model of the interaction between a human H and robot1 R that enables us to formalize R s level of obedience (Section 2). H and R are cooperative, but H knows the reward parameters θ and R does not. H can order R to take an action and R can decide whether to obey or not. We show that if R tries to infer θ from H s orders and then acts 1We use robot to refer to any autonomous system. Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence (IJCAI-17) by optimizing its estimate of θ, then it can always do better than a blindly obedient robot when H is not perfectly rational (Section 3). Thus, forcing R to be blindly obedient does not come for free: it requires giving up the potential to surpass human performance. We cast the problem of estimating θ from H s orders as an inverse reinforcement learning (IRL) problem [Ng et al., 2000; Abbeel and Ng, 2004]. We analyze the obedience and value attained by robots with different estimates for θ (Section 4). In particular, we show that a robot that uses a maximum likelihood estimate (MLE) of θ is more obedient to H s first order than any other robot. Finally, we examine how R s value and obedience is impacted when it has a misspecified model of H s policy or θ (Section 5). We find that when R uses the MLE it is robust to misspecification of H s rationality level (i.e. takes the same actions that it would have with the true model), although with the optimal policy it is not. This suggests that we may want to use policies that are alternative to the optimal one because they are more robust to model misspecification. If R is missing features of θ, then it is less obedient than it should be, whereas with extra, irrelevant features R is more obedient. This suggests that to ensure that R errs on the side of obedience we should equip it with a more complex model. When R has extra features, then it still attains more value than a blindly obedient robot. But if R is missing features, then it is possible for R to be better off being obedient. We use the fact that with the MLE R should nearly always obey H s first order (as proved in Section 4) to enable R to detect when it is missing features and act accordingly obedient. Overall, we conclude that in the long-term we should aim for R to intelligently decide when to obey H or not, since with a perfect model R can always do better than being blindly obedient. But our analysis also shows that R s value and obedience can easily be impacted by model misspecification. So in the meantime, it is critical to ensure that our approximations err on the side of obedience and are robust to model misspecification. 2 Human-Robot Interaction Model Suppose H is supervising R in a task. At each step H can order R to take an action, but R chooses whether to listen or not. We wish to analyze R s incentive to obey H given that 1. H and R are cooperative (have a shared reward) 2. H knows the reward parameters, but R does not 3. R can learn about the reward through H s orders 4. H may act suboptimally We first contribute a general model for this type of interaction, which we call a supervision POMDP. Then we add a simplifying assumption that makes this model clearer to analyze while still maintaining the above properties, and focus on this simplified version for the rest of the paper. Supervision POMDP. At each step in a supervision POMDP H first orders R to take a particular action and then R executes an action it chooses. The POMDP is described by a tuple M = S, Θ, A, R, T, P0, γ . S is a set of world states. Θ is a set of static reward parameters. The hidden state space of the POMDP is S Θ and at each step R observes the current world state and H s order. A is R s set of actions. R : S A Θ R is a parametrized, bounded function that maps a world state, the robot s action, and the reward parameters to the reward. T : S A S [0, 1] returns the probability of transitioning to a state given the previous state and the robot s action. P0 : S Θ [0, 1] is a distribution over the initial world state and reward parameters. γ [0, 1) is the discount factor. We assume that there is a (bounded) featurization of stateaction pairs φ : S A R and the reward function is a linear combination of the reward parameters θ Θ and these features: R(s, a) = θT φ(s, a). For clarity, we write A as AH when we mean H s orders and as AR when we mean R s actions. H s policy πH is Markovian: πH : S Θ AH [0, 1]. R s policy can depend on the history of previous states, orders, and actions: πR : [S AH AR] S AH AR. Human and Robot. Let Q(s, a; θ) be the Q-value function under the optimal policy for the reward parametrized by θ. A rational human follows the policy π H(s, a; θ) = 1 if a = argmaxa Q(s, a; θ) 0 o.w. A noisily rational human follows the policy πH(s, a; θ, β) exp (Q(s, a; θ)/β) (1) β is the rationality parameter. As β 0, H becomes rational ( πH π H). And as β , H becomes completely random ( πH Unif(A)). Let h = (s1, o1), . . . , (sn, on) be this history of past states and orders where (sn, on) is the current state and order. A blindly obedient robot s policy is to always follow the human s order: πO R(h) = on An IRL robot, IRL-R, is one whose policy is to maximize an estimate, ˆθn(h), of θ: πR(h) = argmax a Q(sn, a; ˆθn(h)) (2) Simplification to Repeated Game. For the rest of the paper unless otherwise noted we focus on a simpler repeated game in which each state is independent of the next, i.e T(s, a, s ) is independent of s and a. The repeated game eliminates any exploration-exploitation tradeoff: Q(s, a; ˆθn) = ˆθT n φ(s, a). But it still maintains the properties listed at the beginning of this section, allowing us to more clearly analyze their effects. 3 Justifying Autonomy In this section we show that there exists a tradeoff between the performance of a robot and its obedience. This provides a justification for why one might want a robot that isn t obedient: robots that are sometimes disobedient perform better than robots that are blindly obedient. We define R s obedience, O, as the probability that R follows H s order: On = P(πR(h) = on) Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence (IJCAI-17) 0 10 20 30 40 50 60 70 80 90 100 Autonomy Advantage ( ) 0 10 20 30 40 50 60 70 80 90 100 Obedience (O) P(optimal order) Figure 2: Autonomy advantage (left) and obedience O (right) over time. To study how much of an advantage (or disadvantage) H gains from R, we define the autonomy advantage, , as the expected extra reward R receives over following H s order: n = E[R(sn, πR(h)) R(sn, on)] We will drop the subscript on On and n when talking about properties that hold n. We will also use Rn(π) to denote the reward of policy π at step n, and φn(a) = φ(sn, a). Remark 1. For the robot to gain any advantage from being autonomous, it must sometimes be disobedient: > 0 = O < 1. This is because whenever R is obedient = 0. This captures the fact that a blindly obedient R is limited by H s decision making ability. However, if R follows a type of IRL policy, then R is guaranteed a positive advantage when H is not rational. The next theorem states this formally. Theorem 1. The optimal robot R is an IRL-R whose policy π R has ˆθ equal to the posterior mean of θ. R is guaranteed a nonnegative advantage on each round: n n 0 with equality if and only if n π R = πO R. Proof. When each step is independent of the next R s optimal policy is to pick the action that is optimal for the current step [Kaelbling et al., 1996]. This results in R picking the action that is optimal for the posterior mean, π R(h) = max a E[φn(a)T θ|h] = max a φn(a)T E[θ|h] By definition E[Rn(π R)] E[Rn(πO R)]. Thus, n n = E[Rn(π R) Rn(πO R)] 0. Also, by definition, n n = 0 π R = πO R. In addition to R being an IRL-R, the following IRL-Rs also converge to the maximum possible autonomy advantage. Theorem 2. Let n = E[Rn(π H) Rn(πH)] be the maximum possible autonomy advantage and On = P(Rn(π H) = Rn(πH)) be the probability H s order is optimal. Assume that when there are multiple optimal actions R picks H s order if it is optimal. If πR is an IRL-R policy (Equation 2) and ˆθn is strongly consistent, i.e P(ˆθn = θ) 1, then n n 0 and On On 0. n n = E[Rn(πR) Rn(π H)|ˆθn = θ]P(ˆθn = θ) + E[Rn(πR) Rn(π H)|ˆθn = θ]P(ˆθn = θ) 0 because E[Rn(πR) Rn(π H)|ˆθn = θ] is bounded. Similarly, On On = P(πR(h) = πH(sn)) P(Rn(π H) = Rn(πH)) = P(πR(h) = πH(sn)|ˆθn = θ)P(ˆθn = θ) + P(πR(h) = πH(sn)|ˆθn = θ)P(ˆθn = θ) P(Rn(π H) = Rn(πH)) P(Rn(π H) = Rn(πH)) P(Rn(π H) = Rn(πH)) = 0 Remark 2. In the limit n is higher for less optimal humans (humans with a lower expected reward E[R(sn, on)]). Theorem 3. The optimal robot R is blindly obedient if and only if H is rational: π R = πO R πH = π H Proof. Let O(h) = {θ Θ : oi = argmaxa Ri(a), i = 1, . . . , n} be the subset of Θ for which o1, . . . , on are optimal. If H is rational, then R s posterior only has support over O(h). So, E[Rn(a)|h] = Z θ O(h) θT φn(a)P(θ|h)dθ θ O(h) θT φn(on)P(θ|h)dθ = E[Rn(on)|h] Thus, H is rational = π R = πO R. R is an IRL-R where ˆθn is the posterior mean. If the prior puts non-zero mass on the true θ, then the posterior mean is consistent [Diaconis and Freedman, 1986]. Thus by Theorem 2, n n 0. Therefore if n n = 0, then n 0, which implies that P(πH = π H) 1. When πH is stationary this means that H is rational. Thus, π R = πO R = H is rational. We have shown that making R blindly obedient does not come for free. A positive requires being sometimes disobedient (Remark 1). Under the optimal policy R is guaranteed Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence (IJCAI-17) Autonomy Advantage ( ) Rationality (β) Step 0 Step 10 Step 100 Step 500 Max Figure 3: When H is more irrational converges to a higher value, but at a slower rate. a positive when H is not rational. And in the limit, R converges to the maximum possible advantage. Furthermore, the more suboptimal H is, the more of an advantage R eventually earns (Remark 2). Thus, making R blindly obedient requires giving up on this potential > 0. However, as Theorem 2 points out, as n R also only listens to H s order when it is optimal. Thus, and O come at a tradeoff. Autonomy advantage requires giving up obedience, and obedience requires giving up autonomy advantage. 4 Approximations via IRL R is an IRL-R with ˆθ equal to the posterior mean, i.e. R performs Bayesian IRL [Ramachandran and Amir, 2007]. However, as others have noted Bayesian IRL can be very expensive in complex environments [Michini and How, 2012]. We could instead approximate R by using a less expensive IRL algorithm. Furthermore, by Theorem 2 we can guarantee convergence to optimal behavior. Simpler choices for ˆθ include the maximum-a-posteriori (MAP) estimate, which has previously been suggested as an alternative to Bayesian IRL [Choi and Kim, 2011], or the maximum likelihood estimate (MLE). If H is noisily rational (Equation 1) and β = 1, then the MLE is equivalent to Maximum Entropy IRL [Ziebart et al., 2008]. Although Theorem 2 allows us to justify approximations at the limit, it is also important to ensure that R s early behavior is not dangerous. Specifically, we may want R to err on the side of obedience early on. To investigate this we first prove a necessary property for any IRL-R to follow H s order: Lemma 1. (Undominated necessary) Call on undominated if there exists θ Θ such that on is optimal, i.e on = argmaxa θT φ(sn, a). It is necessary for on to be undominated for an IRL-R to execute on. Proof. R executes a = argmaxa ˆθT n φ(sn, a), so it is not possible for R to execute on if there is no choice of ˆθn that makes on optimal. This can happen when one action dominates another action in value. For example, suppose Θ = R2 and there are three actions with features φ(s, a1) = [ 1, 1], φ(s, a2) = [0, 0], φ(s, a3) = [1, 1]. If H picks a2, then there is no θ Θ that makes a2 optimal, and thus R will never follow a2. One basic property we may want R to have is for it to listen to H early on. The next theorem looks at we can guarantee about R s obedience to the first order when H is noisily rational. Theorem 4. (Obedience to noisily rational H on 1st order) (a) When Θ = Rd the MLE does not exist after one order. But if we constrain the norm of ˆθ to not be too large, then we can ensure that R follows an undominated o1. In particular, K such that when R plans using the MLE ˆθ Θ = {θ Θ : ||θ||2 K} R executes o1 if and only if o1 is undominated. (b) If any IRL robot follows o1, so does MLE-R. In particular, if R follows o1, so does MLE-R. (c) If R uses the MAP or posterior mean, it is not guaranteed to follow an undominated o1. Furthermore, even if R follows o1, MAP-R is not guaranteed to follow o1. Proof. (a) The only if condition holds from Lemma 1. Suppose o1 is undominated. Then there exists θ such that o1 is optimal for θ . o1 is still optimal for a scaled version, cθ . As c , πH(o1; cθ ) 1, but never reaches it. Thus, the MLE does not exist. However since πH(o1; cθ ) monotonically increases towards 1, C such that for c > C, πH(o; cθ ) > 0.5. If K > C||θ ||, then the MLE will be optimal for o1 because πH(o1; ˆθ1) 0.5 and R executes a = argmaxa ˆθT φ(a) = argmaxa πH(a; ˆθ). Therefore, in practice we can simply use the MLE while constraining ||θ||2 to be less than some very large number. (b) From Lemma 1 if any IRL-R follows o1, then o1 is undominated. Then by (a) MLE-R follows o1. (c) For space we omit explicit counterexamples, but both statements hold because we can construct adversarial priors for which o1 is suboptimal for the posterior mean and for which o1 is optimal for the posterior mean, but not for the MAP. Theorem 4 suggests that at least at the beginning when R uses the MLE it errs on the side of giving us the benefit of the doubt , which is exactly what we would want out of an approximation. Figure 2a and 2b plot and O for an IRL robot that uses the MLE. As expected, R gains more reward than a blindly obedient one ( > 0), eventually converging to the maximum autonomy advantage (Figure 2a). On the other hand, as R learns about θ, its obedience also decreases, until eventually it only listens to the human when she gives the optimal order (Figure 2b). Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence (IJCAI-17) Autonomy Advantage ( ) added features missing features correct features Obedience (O) correct features added features missing features Figure 4: and O when Θ is misspecified As pointed out in Remark 2, is eventually higher for more irrational humans. However, a more irrational human also provides noisier evidence of θ, so the rate of convergence of is also slower. So, although initially may be lower for a more irrational H, in the long run there is more to gain from being autonomous when interacting with a more irrational human. Figure 3 shows this empirically. All experiments in this paper use the following parameters unless otherwise noted. At the start of each episode θ N(0, I) and at each step φn(a) N(0, I). There are 10 actions, 10 features, and β = 2. 2 Finally, even with good approximations we may still have good reason for feeling hesitation about disobedient robots. The naive analysis presented so far assumes that R s models are perfect, but it is almost certain that R s models of complex things like human preferences and behavior will be incorrect. By Theorem 1, R will not obey even the first order made by H if there is no θ Θ that makes H s order optimal. So clearly, it is possible to have disastrous effects by having an incorrect model of Θ. In the next section we look at how misspecification of possible human preferences (Θ) and human behavior (πH) can cause the robot to be overconfident and in turn less obedient than it should be. The autonomy advantage can easily become the rebellion regret. 5 Model Misspecification Incorrect Model of Human Behavior. Having an incorrect model of H s rationality (β) does not change the actions of MLE-R, but does change the actions of R . Theorem 5. (Incorrect model of human policy) Let β0 be H s true rationality and β be the rationality that R believes H has. Let ˆθ and ˆθ be R s estimate under the true model and misspecified model, respectively. Call R robust if its actions under β are the same as its actions under β0. (a) MLE-R is robust. (b) R is not robust. Proof. (a) The log likelihood l(h|θ) is concave in η = θ/β. 2All experiments can be replicated using the Jupyter notebook available at http://github.com/smilli/obedience So, ˆθ n = (β /β0)ˆθn. This does not change R s action: argmaxa ˆθ T n φn(a) = argmaxa ˆθT n φn(a) (b) Counterexamples can be constructed based on the fact that as β 0, H becomes rational, but as β , H becomes completely random. Thus, the likelihood will win over the prior for β 0, but not when β . MLE-R is more robust than the optimal R . This suggests a reason beyond computational savings for using approximations: the approximations may be more robust to misspecification than the optimal policy. Remark 3. Theorem 5 may give us insight into why Maximum Entropy IRL (which is the MLE with β = 1) works well in practice. In simple environments where noisy rationality can be used as a model of human behavior, getting the level of noisiness right doesn t matter. Incorrect Model of Human Preferences. The simplest way that H s preferences may be misspecified is through the featurization of θ. Suppose θ Θ = Rd. R believes that Θ = Rd . R may be missing features (d < d) or may have irrelevant features (d > d). R observes a d dimensional feature vector for each action: φn(a) N(0, Id d ). The true θ depends on only the first d features, but R estimates θ Rd . Figure 4 shows how and O change over time as a function of the number of features for a MLE-R. When R has irrelevant features it still achieves a positive (and still converges to the maximum because ˆθ remains consistent over a superset of Θ). But if R is missing features, then may be negative, and thus R would be better off being blindly obedient instead. Furthermore, when R contains extra features it is more obedient than it would be with the true model. But if R is missing features, then it is less obedient than it should be. This suggests that to ensure R errs on the side of obedience we should err on the side of giving R a more complex model. Detecting Misspecification. If R has the wrong model of Θ, R may be better off being obedient. In the remainder of this section we look at how R can detect that it is missing features and act accordingly obedient. Remark 4. (Policy mixing) We can make R more obedient, while maintaining convergence to the maximum advantage, Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence (IJCAI-17) Figure 5: (Detecting misspecification) The bold line shows the R that tries detecting missing features (Equation 3), as compared to MLE-R (which is also shown in Figure 4). by mixing R s policy πI R with a blindly obedient policy: πR(h) = 1{δn = 0}πO R(h) + 1{δn = 1}πI R(h) P(δn = i) = cn i = 0 1 cn i = 1 where 1 cn 0 with cn 0. In particular, we can have an initial burn-in period where R is blindly obedient for a finite number of rounds before switching to πI R. By Theorem 4 we know MLE-R will always obey H s first order if it is undominated. This means that for MLE-R, O1 should be close to one if undominated orders are expected to be rare. As pointed out in Remark 4 we can have an initial burn-in period where R always obeys H. Let R have a burn-in obedience period of B rounds. R uses this burn-in period to calculate the sample obedience on the first order: i=1 1{argmax a ˆθ1(hi)T φi(a) = oi} If O1 is not close to one, then it is likely that R has the wrong model of Θ, and would be better off just being obedient. So, we can choose some small ϵ and make R s policy on n B on n > B, O1 < 1 ϵ argmaxa ˆθT n φn(a) n > B, O1 > 1 ϵ (3) Figure 5 shows the of this robot as compared to the MLE-R from Figure 4 after using the first ten orders as a burn-in period. This R achieves higher than MLE-R when missing features and still does as well as MLE-R when it isn t missing features. Note that this strategy relies on the fact that MLE-R has the property of always following an undominated first order. If R were using the optimal policy, it is unclear what kind of simple property we could use to detect missing features. This gives us another reason for using an approximation: we may be able to leverage its properties to detect misspecification. 6 Related Work Ensuring Obedience. There are several recent examples of research that aim to provably ensure that H can interrupt R. [Soares et al., 2015; Orseau and Armstrong, 2016; Hadfield-Menell et al., 2017]. hadfield2016off show that R s obedience depends on a tradeoff between R s uncertainty about θ and H s rationality. However, they considered R s uncertainty in the abstract. In practice R would need to learn about θ through H s behavior. Our work analyzes how the way R learns about θ impacts its performance and obedience. Intent Inference For Assistance. Instead of just being blindly obedient, an autonomous system can infer H s intention and actively assist H in achieving it. Do What I Mean software packages interpret the intent behind what a programmer wrote to automatically correct programming errors [Teitelman, 1970]. When a user uses a telepointer network lag can cause jitter in her cursor s path. gutwin2003using address this by displaying a prediction of the user s desired path, rather than the actual cursor path. Similarly, in assistive teleoperation, the robot does not directly execute H s (potentially noisy) input. It instead acts based on an inference of H s intent. In [Dragan and Srinivasa, 2012] R acts according to an arbitration between H s policy and R s prediction of H s policy. Like our work, [Javdani et al., 2015] formalize assistive teleoperation as a POMDP in which H s goals are unknown, and try to optimize an inference of H s goal. While assistive teleoperation apriori assumes that R should act assistively, we show that under model misspecification sometimes it is better for R to simply defer to H, and contribute a method to decide between active assistance and blind obedience (Remark 4). Inverse Reinforcement Learning. We use inverse reinforcement learning [Ng et al., 2000; Abbeel and Ng, 2004] to infer θ from H s orders. We analyze how different IRL algorithms affect autonomy advantage and obedience, properties not previously studied in the literature. In addition, we analyze how model misspecification of the features of the space of reward parameters or the H s rationality impacts autonomy advantage and obedience. IRL algorithms typically assume that H is rational or noisily rational. We show that Maximum Entropy IRL [Ziebart et al., 2008] is robust to misspecification of a noisily rational H s rationality (β). However, humans are not truly noisily rational, and in the future it is important to investigate other models of humans in IRL and their potential misspecifications. [Evans et al., 2016] takes a step in this direction and models H as temporally inconsistent and potentially having false beliefs. In addition, IRL assumes that H acts without awareness of R s presence, cooperative inverse reinforcement learning [Hadfield-Menell et al., 2016] relaxes this assumption by modeling the interaction between H and R as a two-player cooperative game. 7 Conclusion To summarize our key takeaways: 1. ( > 0) If H is not rational, then R can always attain a positive . Thus, forcing R to be blindly obedient requires giving up on a positive . Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence (IJCAI-17) 2. ( vs O) There exists a tradeoff between and O. At the limit R attains the maximum , but only obeys H s order when it is the optimal action. 3. (MLE-R) When H is noisily rational MLE-R is at least as obedient as any other IRL-R to H s first order. This suggests that the MLE is a good approximation to R because it errs on the side of obedience. 4. (Wrong β) MLE-R is robust to having the wrong model of the human s rationality (β), but R is not. This suggests that we may not want to use the optimal policy because it may not be very robust to misspecification. 5. (Wrong Θ) If R has extra features, it is more obedient than with the true model, whereas if it is missing features, then it is less obedient. If R has extra features, it will still converge to the maximum . But if R is missing features, it is sometimes better for R to be obedient. This implies that erring on the side of extra features is far better than erring on the side of fewer features. 6. (Detecting wrong Θ) We can detect missing features by checking how likely MLE-R is to follow the first order. Overall, our analysis suggests that in the long-term we should aim to create robots that intelligently decide when to follow orders, but in the meantime it is crucial to ensure that these robots err on the side of obedience and are robust to misspecified models. Acknowledgements We thank Daniel Filan for feedback on an early draft. References [Abbeel and Ng, 2004] Pieter Abbeel and Andrew Y Ng. Apprenticeship learning via inverse reinforcement learning. In International Conference on Machine learning, page 1. ACM, 2004. [Asaro, 2006] Peter M Asaro. What Should We Want From a Robot Ethic. International Review of Information Ethics, 6(12):9 16, 2006. [Bostrom, 2014] Nick Bostrom. Superintelligence: Paths, Dangers, Strategies. OUP Oxford, 2014. [Choi and Kim, 2011] Jaedeug Choi and Kee-Eung Kim. MAP Inference for Bayesian Inverse Reinforcement Learning. In Advances in Neural Information Processing Systems, pages 1989 1997, 2011. [Diaconis and Freedman, 1986] Persi Diaconis and David Freedman. On the consistency of Bayes estimates. The Annals of Statistics, pages 1 26, 1986. [Dragan and Srinivasa, 2012] Anca D Dragan and Siddhartha S Srinivasa. Formalizing assistive teleoperation. MIT Press, July, 2012. [Evans et al., 2016] Owain Evans, Andreas Stuhlm uller, and Noah D Goodman. Learning the preferences of ignorant, inconsistent agents. In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, pages 323 329. AAAI Press, 2016. [Hadfield-Menell et al., 2016] Dylan Hadfield-Menell, Stuart J Russell, Pieter Abbeel, and Anca Dragan. Cooperative Inverse Reinforcement Learning. In Advances in Neural Information Processing Systems, pages 3909 3917, 2016. [Hadfield-Menell et al., 2017] Dylan Hadfield-Menell, Anca Dragan, Pieter Abbeel, and Stuart Russell. The Off-Switch Game. In IJCAI, 2017. [Javdani et al., 2015] Shervin Javdani, Siddhartha S Srinivasa, and J Andrew Bagnell. Shared autonomy via hindsight optimization. In Robotics: Science and Systems, 2015. [Kaelbling et al., 1996] Leslie Pack Kaelbling, Michael L Littman, and Andrew W Moore. Reinforcement Learning: A Survey. Journal of Artificial Intelligence Research, 4:237 285, 1996. [Lewis, 2014] Johin Lewis. The Case for Regulating Fully Autonomous Weapons. Yale Law Journal, 124:1309, 2014. [Michini and How, 2012] Bernard Michini and Jonathan P How. Improving the Efficiency of Bayesian Inverse Reinforcement Learning. In IEEE International Conference on Robotics and Automation (ICRA), pages 3651 3656. IEEE, 2012. [Ng et al., 2000] Andrew Y Ng, Stuart J Russell, et al. Algorithms for inverse reinforcement learning. In International Conference on Machine Learning, pages 663 670, 2000. [Orseau and Armstrong, 2016] Laurent Orseau and Stuart Armstrong. Safely Interruptible Agents. In UAI, 2016. [Ramachandran and Amir, 2007] Deepak Ramachandran and Eyal Amir. Bayesian Inverse Reinforcement Learning. In IJCAI, 2007. [Russell et al., 2015] Stuart Russell, Daniel Dewey, and Max Tegmark. Research Priorities for Robust and Beneficial Artificial Intelligence. AI Magazine, 36(4):105 114, 2015. [Scheutz and Crowell, 2007] Matthias Scheutz and Charles Crowell. The Burden of Embodied Autonomy: Some Reflections on the Social and Ethical Implications of Autonomous Robots. In Workshop on Roboethics at the International Conference on Robotics and Automation, Rome, 2007. [Soares et al., 2015] Nate Soares, Benja Fallenstein, Stuart Armstrong, and Eliezer Yudkowsky. Corrigibility. In Workshops at the Twenty-Ninth AAAI Conference on Artificial Intelligence, 2015. [Teitelman, 1970] Warren Teitelman. Toward a Programming Laboratory. Software Engineering Techniques, page 108, 1970. [Weld and Etzioni, 1994] Daniel Weld and Oren Etzioni. The First Law of Robotics (a call to arms). In AAAI, volume 94, pages 1042 1047, 1994. [Ziebart et al., 2008] Brian D Ziebart, Andrew L Maas, J Andrew Bagnell, and Anind K Dey. Maximum Entropy Inverse Reinforcement Learning. In AAAI, volume 8, pages 1433 1438. Chicago, IL, USA, 2008. Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence (IJCAI-17)