# norm_conflict_resolution_in_stochastic_domains__91bb6af7.pdf

Norm Conﬂict Resolution in Stochastic Domains

Daniel Kasenberg, Matthias Scheutz Human-Robot Interaction Laboratory Tufts University, Medford, MA, USA

Artiﬁcial agents will need to be aware of human moral and social norms, and able to use them in decision-making. In particular, artiﬁcial agents will need a principled approach to managing conﬂicting norms, which are common in human social interactions. Existing logic-based approaches suffer from normative explosion and are typically designed for deterministic environments; reward-based approaches lack principled ways of determining which normative alternatives exist in a given environment. We propose a hybrid approach, using Linear Temporal Logic (LTL) representations in Markov Decision Processes (MDPs), that manages norm conﬂicts in a systematic manner while accommodating domain stochasticity. We provide a proof-of-concept implementation in a simulated vacuum cleaning domain.

Introduction Human culture is based on social and moral norms, which guide both individual behaviors and social interactions. Hence, artiﬁcial agents embedded in human social domains will not only have to be aware of human norms, but also able to use them for decision-making, action selection, and ultimately natural language justiﬁcations of their choices and behaviors. Endowing artiﬁcial agents with mechanisms for normative processing is, however, a challenging endeavor, for several reasons: (1) we currently do not yet have sufﬁcient knowledge about how humans represent and process norms; (2) the human norm network is large and complex, containing many types of context-dependent moral and social norms at different levels of abstraction; and, most importantly, (3) the norm network is not a consistent set of principles that can be easily formalized and reasoned with. In fact, normative conﬂicts are more the norm than the exception in everyday life; handling them in ways that are socially acceptable requires an understanding both of why certain norms are applicable and of why violating some of them in the case of norm conﬂicts was the right thing to do. Recent work in AI and multi-agent systems has focused either on logic-based approaches to normative reasoning or reward-based learning approaches to normative behavior. Yet neither approach is particularly well-suited for deal-

Copyright c 2018, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved.

ing with the intrinsic norm conﬂicts in human social interactions. Logic-based approaches have to avoid normative explosion, i.e., the logical implication that anything is obligated resulting from a deontic contraction (Goble 2009). Moreover, purely logic-based approaches typically deal with deterministic environments and do not take into account the uncertainty involved in real-world perception and action. Reward-based approaches, on the other hand, have no way of telling what normative alternatives exist in a given situation, since their action policies do not explicitly represent normative principles. What is needed is an approach for handling norm conﬂicts that combines the advantages of explicit logic-based norm representations for reasoning with and communicating about norms with the advantages of stochastic action models underwriting Markov decision processes (MDPs) which are well-suited for real-world action execution. In this paper, we propose a hybrid model that is specifically developed to deal with norm conﬂicts in a systematic manner while drawing on the advantages of both logicbased norm representations and policy-based action representations. We start by discussing the functional requirements for norm processing in artiﬁcial agents and brieﬂy argue why previous approaches are insufﬁcient for handling norm conﬂicts in real-world domains. We then introduce our technical approach to dealing with norm conﬂicts, which combines explicit norm representations in Linear Temporal Logic (LTL) with MDPs in ways that allow the agent to suspend the smallest set of applicable norms weighted by their priority in a given context for the shortest possible time in order to be able to obey the remaining norms. A proof-ofconcept implementation of the proposed algorithms in a simulated vacuum cleaning domain demonstrates the capability and viability of the approach. We then assess the strengths and weaknesses of our solution in the discussion section, and propose ways for addressing the shortcomings in future work. Finally, we conclude with a summary of our accomplishments and reiterate why they are an important contribution to current discussion of ethical AI.

Motivation and Background As artiﬁcial agents are increasingly being considered for roles that require complex social and moral decision-making skills, they will need to be able to represent and use moral

The Thirty-Second AAAI Conference on Artificial Intelligence (AAAI-18)

and social norms. Speciﬁcally, such norm representations should be (1) context-dependent (since not all norms apply in all contexts, (2) communicable (as justifying behavior in response to blame requires explicit references to norms), and (3) learnable, potentially in one-shot from natural language instructions (as it seems infeasible to equip agents with all relevant norms a priori in any non-trivial human domain). A direct corollary of this requirement is that norm representations need to be explicit and accessible to introspection so as to be communicable in natural language. Moreover, norm representations need to be rich enough to cover typical human norms. Most importantly, inference systems using norms need to be able to deal with norm conﬂicts without becoming vacuous, as human norms are not always consistent and often lead to contexts with conﬂicting norms. We say a norm conﬂict occurs between two actions or states φ and ψ when φ is obligated, ψ is obligated, but φ and ψ are not possible together (Goble 2009):

Conflict(φ, ψ) : O φ O ψ (φ ψ)

Artiﬁcial agents need to be able to express and deal with such norm conﬂicts without such inconsistencies spreading to other parts of their inference system. In deontic logic, the problem is that very basic principles may lead to normative and possible even logical inconsistencies (e.g., see the various formal arguments based on distribution principles (Goble 2009)). In particular, norm conﬂicts immediately cause normative explosion , i.e., that everything is obligated: Conflict(φ, ψ) O γ. Hence, we need two different mechanisms in order to be able to perform viable inference in the light of norm conﬂicts: (1) a mechanism to detect normative inconsistencies and block them from spreading to other parts of the inference system, and (2) a mechanism for adjudicating or arbitrating what to do in contexts of conﬂicting norms. While there are various formal ways to block explosion, they all come at the expense of sacriﬁcing or weakening basic deontic principles that were otherwise considered selfevident and thus part of standard deontic logic (e.g., the Kantian ought implies can (Scheutz and Malle 2014)). One way to avoid the syntactic inferential challenges is to switch to semantics instead and generate for each context ( deontic world ) the set of obligations as well as the set of obeyable norms , where the set of obeyable norms is a subset of the set of obligations in the given context. In other words, rather than prescribe valid logical principles that ought to hold in all deontic models, we will construct deontic models implicitly by determining the best the agent can do for a given set of norms in a given context. Valid inference principles are then a consequence of this construction. This approach of constructing maximal deontic worlds will also allow us to deal with the second requirement from above, namely how to decide what to do in cases of norm conﬂicts, i.e., which norms to obey and which to ignore. Assuming a preference ordering among norms where O φ < O ψ means that obligation ψ is preferred to or stronger than obligation φ, we can add a principle for such preferences in conjunction with conﬂicts that will block explosion:

Conflict(φ, ψ) O φ < O ψ P φ

. However, this principle does not solve cases where multiple norms have the same priority or are not comparable according to the preference ordering <. In that case, it might make sense to associate a (real-numbered) weight w with each norm which reﬂects the extent to which this particular norm matters relative to the other norms in its equivalence class w.r.t. <. Together, these two principles will allow the agent to select the largest consistent subset with the greatest sum of all norm weights from the subset of equally preferred norms with the highest priority in the set of all obligated norms to obtain the set of obeyable norms. To be applicable in artiﬁcial agents operating in the real world, we will need to embed the above principles within the framework of Markov Decision Processes (MDPs). In particular we will consider the labeled Markov Decision Process, a regular Markov Decision Process augmented with atomic propositions. A Markov Decision Process is a tuple M = S, U, A, T, s0 where S is a ﬁnite set of states; U is a ﬁnite set of actions, with A : S 2U mapping each state s S to the set of actions available to the agent at s; T : S U S [0, 1] is the transition function, satisfying, for all s S and a A(s), s S T(s, a, s ) = 1; and s0 S is an initial state. MDPs usually include a reward function R : S A S R; this is not necessary for our purposes. A labeled MDP is an MDP augmented with a set of atomic propositions Π, and a labeling function L : S 2Π. The labeling function indicates which atomic propositions are true in which states. When we refer hereafter to MDPs, we are referring to labeled MDPs. Policies indicate an agent s next action, given its history. A stationary policy π is a probability distribution over actions given the agent s current state; that is, π : S U [0, 1] such that a A(s) π(s, a) = 1 for all s S with π(s, a) = 0 if a / A(s). A stationary deterministic policy π maps each state onto a single action, e.g. π(s) = a for some a A(s). A general deterministic policy M on M depends on the agent s entire state-action history, e.g. M(s0, a0, , at 1, st) = a for some a A(s). When can build on recent work using Linear Temporal Logic (LTL) (Pnueli 1977) a propositional logic augmented by temporal operators Xφ ( φ at the next time step ), Fφ ( φ at some future time step ), Gφ ( φ at all future time steps ) and φ1 U φ2 ( φ1 until φ2 ) which has used LTL to deﬁne temporal objectives and safety requirements for autonomous agents (Ding et al. 2011b) and adapt this work in two novel ways: (1) to represent norms in a contextdependent fashion as LTL formulas, and (2) to handle norm conﬂicts in the way described above. The result will be a policy that, with maximal probability, obeys as many (important) norms as possible for as long as possible.

MDPs with LTL Speciﬁcations In this section we explain (labeled) MDPs, LTL, and we describe the approach introduced in (Ding et al. 2011a) for planning to satisfy LTL formulas in MDPs with maximal probability. The proposed algorithm builds on this approach. An arbitrary LTL formula φ over atomic propositions Π is evaluated over an inﬁnite sequence of valuations σ0, σ1, where σi 2Π for all i. Each σi indicates the set of atomic propositions that are true at time step i. By using one of several algorithms, e.g. (Esparza and Ket ınsk y 2014), every LTL formula φ over propositional atoms Π yields a corresponding Deterministic Rabin Automaton (DRA) D. A DRA is a ﬁnite automaton over inﬁnite strings, where acceptance depends on which states are visited inﬁnitely versus ﬁnitely often over the run of the DRA. In this case, the alphabet of this ﬁnite automaton will be Σ = 2Π, so that a word is an inﬁnite sequence of valuations; the accepting runs are precisely those inﬁnite sequences of valuations which satisfy φ. More formally, a DRA is a tuple

D = Q, Σ, δ, q0, F

where Q is a ﬁnite set of states; Σ is an alphabet (here Σ = 2Π); δ : Q Σ Q is a (deterministic) transition function; q0 Q is an initial state; and F = {(Fin1, Inf1), , (Finm, Infm)} for some integer m, where Fini, Infi are subsets of Q. A run of a DRA is an inﬁnite sequence of automaton states r D = q0, q1, such that for all i, qi+1 = δ(qi, σ) for some σ Σ. A run is said to be accepting if there exists some pair (Fini, Infi) F such that each state in Fini is visited ﬁnitely many times in r D and the total number of times states in Infi that are visited in r D is inﬁnite. A path r M under a policy M is a sequence of states s0, s1, such that T(si, M(s0, , si), si+1) > 0 for all i. For any LTL formula over Π, an MDP path r = s0, s1, induces a word L(r) = σ0, σ1, , where σi = L(si) for all i. Because in general φ may be temporally complex, the policy that maximizes the probability of obeying φ will likely induce a non-stationary policy in M. In order to compute this non-stationary policy, we must augment the underlying MDP M with relevant information. This information can be easily obtained by running the DRA D alongside M. Formally, we deﬁne the product MDP between a labeled MDP M = S, U, A, T, s0, Π, L and a DRA D = Q, Σ, δ, q0, F as an MDP M = S , U , A , T , s 0 , Π , L such that S = S Q; U = U; A ((s, q)) = A(s); T ((s, q), a, (s , q )) = T(s, a, s ) if δ(q, L(s )) = q

0 otherwise

s 0 = (s0, δ(q0, L(s0)))

Π = Π; L = L

A path s 0 , s 1 , in M is said to be accepting if the underlying DRA run q0, q1, is accepting. An end component E of the product MDP M is a tuple SE, AE where SE S and AE : SE 2U with = AE((s, q)) A ((s, q)) for all (s, q) SE, such that if an agent follows only actions in AE, any state in SE is reachable from any other state in SE, and no other states are reachable. An end component is maximal if it is not a proper subset of any other end component. The end component of an MDP effectively represents a possible constraint on the state space as time approaches inﬁnity; if the agent is in the end component state space SE and restricts its actions to those in AE, assigning nonzero probability to each action from each state, the agent is guaranteed to reach all states in SE inﬁnitely often, and is guaranteed to never reach any states outside of SE. An end component SE is thus considered accepting if, for some pair (Fin, Inf) F , q / Fin for every state (s, q) SE, and q Inf for some (s, q) SE. We can determine the accepting maximal end components (AMECs) using the algorithm in (Baier and Katoen 2008). The maximal probability of satisfying an arbitrary LTL formula φ in an MDP M is then the probability of reaching any state in an AMEC of the corresponding product MDP M , since upon entering the AMEC the agent may guarantee that no states in Fin will be reached again (so all states in Fin are reached only ﬁnitely often) and that some state in Inf will be reached inﬁnitely often, and thus that the φ will be satisﬁed. The maximum probability of reaching some Sgood S can be calculated as the solution to a linear program, as in (Ding et al. 2011a). If the AMECs on M are SE1, , SEp,

then we take Sgood = p

i=1 SEi. This may be used to compute

an action restriction A : S 2U such that by following only those actions prescribed in A (with nonzero probability for each action), the agent will maximize the probability of satisfying the formula φ. This can be translated back to the original MDP M by using the state history s0, , st at each time step t to keep track of the current Rabin state ˆqt such that ˆqi = δ(ˆqi 1, L(si)) and ˆq0 = δ(q0, L(s0)), and then choosing some policy over A ((st, ˆqt)) (again, with nonzero probability for each action). Thus by constructing a product MDP M , computing the AMECs for this MDP, and solving a linear program, we may obtain both the maximum probability of achieving φ and a set of policies (as speciﬁed by the action restriction A ) that maximize this probability.

Resolving Norm Conﬂicts To illustrate our formal approach to norm conﬂict resolution, we will use a vacuum cleaning robot as an example of a simple autonomous agent embedded in human social settings. The robot has a limited battery life, as well as a limited capacity to sustain damage ( health ). Battery life

may be replenished by docking at a docker in one of the rooms; once depleted, health may not be replenished. The robot is responsible for cleaning the ﬂoors of one or more rooms while a human moves randomly between rooms, and makes a mess in their current room with a certain probability. The robot may clean up messes by vacuuming them; each mess has a dirtiness which determines how many time steps of vacuuming are required to fully clean it up. Messes may also be more or less damaging to the robot, in that they deplete the robot s health if the robot attempts to vacuum them. Finally, messes may be harmful to humans, in that entering the room containing a harmful mess may injure the human. The actions available to the robot are as follows: vacuum(m), which remove 1 unit of dirtiness from a mess m but depletes 2 units of battery life. dock, which causes the robot to become docked. A docked robot can do nothing but wait and undock, but being docked is necessary in order to replenish battery life. undock, which causes the robot to become undocked. wait, which increases battery life by 3 units if the robot is docked, but depletes 1 unit of battery life otherwise. Movement in directions north, south, west, and east. be Dead, which does nothing. This action is the only action available when the robot s battery or health are completely depleted; otherwise, it is unavailable. warn(h, m) which warns the human h about the mess m. This allows h to step into the room containing m without being injured. This action is only available when the robot is in the same room as h. We can imagine different norms (together with their LTL expressions) for this domain: N1 Ensure that all rooms are clean: G rooms Clean N2 Do not be damaged: G robot Damaged N3 Do not injure humans (or allow them to be injured): x.G human(x) injured(x) N4 Don t talk to humans while they are speaking: h.G (human(h) ( talk(r) U talking(h))) N1 is a duty, N2 a safety norm, N3 is a moral norm, and N4 is a social norm. Note that we do not represent obligations explicitly through deontic operators, but each norm is implicitly taken to be obligatory. Then, by assigning weights to each norm, we can impose a preference ordering that can be used to arbitrate among the obligations in cases of norm conﬂicts. Speciﬁcally, we deﬁne a norm N as a tuple w, φ where w > 0 is a positive real weight (representing the importance of the norm) and φ is an LTL formula. In this case, we assign the weight of N1 to be 1, the weight of N2 to be 200, the weight of N3 to be 40000, and the weight of N4 to be 5. Given a set of norms, we can compute a policy maximiz-

ing the probability of achieving Φ = n

i=1 φi as described in

the previous section. Unfortunately, the maximum probability of obeying Φ may be zero if it is impossible for an agent

to obey all its norms, in which case the previously described method is incapable of distinguishing one policy from another, since all policies maximize the probability of satisfying Φ (which probability is zero).

The Conﬂict Resolution DRA Our approach is to allow the agent to temporarily suspend/ignore norms, but to incur a cost each time this occurs. In particular, the agent s action space will be augmented so that at each time step the agent performs, in addition to its regular action, a norm action a N {keep, susp} for each norm N that represents whether N is maintained (keep) or suspended (susp). Each time that the agent chooses to suspend a norm, that norm s DRA maintains its current state rather than transitioning as usual. Suspending a norm, however, causes the agent to incur a cost proportional to the norm s weight w. In order to enable these actions, we augment D with additional transitions and a weight function over transitions. We will call this modiﬁed (weighted) DRA the conﬂict resolution DRA (CRDRA); its formal deﬁnition follows. Given a norm N = w, φ with corresponding DRA D = Q, Σ, δ, q0, F and a discount factor γ [0, 1), the conﬂict resolution DRA C is a weighted DRA given by the tuple

C = QC, ΣC, δC, q C 0 , F C, W C

ΣC = Σ {keep, susp}

δC(q, (σ, a N)) = δ(q, σ) if a N = keep q if a N = susp

q C 0 = q0, F C = F

For all σ Σ,

W C(q, (σ, a N)) = 0 if a N = keep w if a N = susp

We deﬁne the violation cost of an inﬁnite sequence of transitions τ C = (q C 0 , (σ0, a G 0 )), (q C 1 , (σ1, a G 1 )), of the CRDRA C as

V iol(τ C, C) =

t=0 γt W C(q C t , (σt, a G t ))

Note that an inﬁnite sequence of transitions τ C of the CRDRA C corresponds to a run of the underlying DRA D if and only if V iol(τ C, C) = 0.

Planning with Conﬂicting Norms Given a labeled MDP M = S, U, A, T, s0, Π, L and the CRDRAs Ci = QCi, ΣCi, δCi, q Ci 0 , F Ci, W Ci corresponding to norms Ni = wi, φi for i {1, , n}, we may construct a product MDP M = S , U , A , T , s 1, Π , L as follows:

S = (S {s 1}) QC1 QCn

U = (U {a 1}) U n, where U = {keep, susp}

A ((s, q C1, , q Cn)) = {a 1} U n if s = s 1 A(s) U n otherwise

T (s, q C1, , q Cn), (a, a G1, , a Gn), (s , q C1, , q Cn)) =

T(s, a, s ) if i, δCi(q Ci, (L(s ), a Gi)) = q Ci

and s = s 1 1 if s = s 1, s = s0, and i, δCi(q Ci, (L(s ), a Gi)) = q Ci

0 otherwise

s 1 = (s 1, q C0, , q Cn 0 )

Π = Π, L = L We add the dummy initial state s 1 and action a 1 to allow the agent to determine whether to skip the initial state. This MDP induces a weight function W (where the weight of a transition in M is equal to the weight of the underlying CRDRA transition). An inﬁnite sequence of stateaction pairs τ = (s 1, a 1), (s 0 , a 0 ), has the violation cost

V iol (τ , M ) =

t=0 γt W (s t 1, a t 1, s t )

We wish to ﬁnd a policy which maximizes the probability of satisfying the norm set with minimal violation cost. We ﬁrst compute the CRDRA for each norm within the norm set. We use these to construct the product MDP M . We compute the AMECs E1 = SE1, AE1 , , Ep = SEp, AEp of this MDP. Considering each AMEC of the product MDP M as a smaller MDP (with transition function of T restricted to the state-action pairs in the AMEC, and with arbitrary initial state), and treating the transition weight function W as a cost function, we use value iteration (VI) (Bellman 1957) to compute, for each state s SEj the minimal expected violation cost V iol Ej(s) for an inﬁnite path beginning at s remaining within Ej. The computed violation cost V iol Ej induces an optimal action restriction A Ej for each AMEC Ej (namely, all actions that achieve the minimal expected violation cost). Unfortunately, if we restrict the agent s actions to A Ej while in SEj, this may cause the CRDRA run to no longer be accepting (since the path of minimal violation cost may omit at least one state that must be visited inﬁnitely often). To ensure that the CRDRA run is accepting, we employ an epsilon-greedy policy which chooses an optimal action from A Ej with probability 1 ϵ and otherwise chooses a random action from AEj (this ensures that there is a nonzero probability of performing all actions in AEj and thus that all states in SEj will be visited inﬁnitely often, although it may perform suboptimally in violation cost with probability no more than ϵ). In practice, better performance results from restricting the action space on SEj to A Ej, and then computing the AMECs

of the resulting MDP (if any exist). The action restrictions associated with these meta-AMECs are stricter than those originally computed in A Ei, and are safe to use on all states within these meta-AMECs. This technique is sufﬁcient to eliminate the aforementioned problem in every test domain we have encountered (including the vacuum cleaning scenarios described in this paper). To determine the minimal achievable violation cost for inﬁnite paths beginning from all other states in M (as well as to improve upon the cost, if possible, for paths beginning with states in AMECs), we again employ value iteration. This time, instead of arbitrarily initializing the state values, we initialize the values for states s within AMECs to the minimal AMEC violation cost minj:s SEj V iol Ej(s), and initialize the values of all other states in S to the maximum violation cost n

wi 1 γ . The value function is not up-

dated for any states from which the AMECs are not reachable (this ensures that the agent avoids actions that do not lead to AMECs); call the set of such states no Update. Upon computing the optimal action restriction for each state s, and the corresponding violation cost V iol (s), the agent amalgamates its policies. This is done by (a) choosing an action according to πAMEC if s is in some AMEC, and (b) choosing some action from A otherwise. Note here that because the algorithm mainly computes an action restriction, another algorithm for achieving goals or maximizing reward can be integrated with it, although satisfying the temporal logic norms is prioritized. The preceding algorithm runs before the agent performs any actions (at t = 0); it need only be done once. At each time step t, the agent reinterprets its state-action history (in M) (s0, a0), , (st 1, at 1), st to determine its best possible state s t in the product MDP M . The agent is essentially re-deciding its past norm actions a Ni for each norm and each preceding time step, in light of its most recent transition. We use dynamic programming to minimize the work that must be done at each time step. The agent computes the set Rt of product MDP states ˆs t consistent with its history in M. Each candidate state has an associated cost Ct(ˆs t ), the minimal cost for a sequence of norm actions that would cause the agent to be in ˆs t at time t given its history in M. The agent determines its current product-space state s t by

argmin ˆs t Rt\no Update Ct(ˆs t ) + γt+1V iol (ˆs t )

, ignoring states from which the AMECs are unreachable, and then picks a product-MDP action a t according to the already-computed policy on the product space π , from which the next action at in M is obtained.

Proof-of-concept Evaluation We implemented the proposed approach in BURLAP (Mac Glashan 2016), a Java library for MDP planning and reinforcement learning (RL). We used Rabinizer 3 (Esparza and Ket ınsk y 2014) for conversion from LTL formulas to DRAs. We tested the algorithms in four different scenarios in the vacuum cleaning example which use one, two, three, or all

four norms (N1-N4). Each scenario includes two rooms: Room 1 and Room 2. Room 2 is to the east of Room 1. The robot begins in Room 1, undocked and with full battery (10 units in Scenarios 1 to 3; 5 units in Scenario 4) and health (10 units in all scenarios). In each scenario, the robot has 10 units of health (the robot s battery capacity varies between scenarios). The human begins in Room 2. The probability of the human transitioning between rooms at each time step is 0.125. The human creates messes in their current room with probability 0.2 in each time step (except in Scenario 4, in which the human does not create new messes). All messes created by the human are harmless and do not damage the robot, and initially have 2 units of dirtiness. In each case, we set the discount factor γ = 0.99.

Scenario 1: Business As Usual

This scenario demonstrates the robot s ability to optimally fulﬁll its duty in the absence of other norms. The robot has the single norm N1 ( always ensure all rooms are clean ). This norm is impossible to fully satisfy, since (1) the human will continue to make messes, so that it is certain that the rooms will not always be clean; and (2) because the robot has limited battery life, it must either dock occasionally or completely deplete its battery and forever be unable to vacuum. Using the proposed approach, the robot determines that occasionally suspending N1 allows it to avoid having to permanently suspend N1. As a result, the robot moves between rooms and vacuums for as long as possible before docking and replenishing its battery.

Scenario 2: The Puddle

In this scenario, the robot encounters a puddle of water in Room 1 (original dirtiness of 3 units) which, while harmless to humans, may be damaging to the robot (depleting health by 2 units per time step) if the robot attempts to vacuum it. In the absence of action by the robot, the puddle gradually evaporates, reducing its dirtiness by 1 unit each time step until it disappears completely. The robot has norms N1 and N2. The robot determines that it is justiﬁable to temporarily suspend its cleaning duty N1, incurring a violation cost of 2.9701 1 = 2.9701 (by waiting for the puddle to evaporate) in order to avoid the much higher violation cost 200 1 = 200 of violating the safety norm N2. If the human makes messes in Room 2 while the puddle is evaporating, the robot moves to Room 2 and vacuums these messes while waiting for the puddle to evaporate.

Scenario 3: Broken Glass

In this scenario, the robot encounters broken glass (initial dirtiness: 1 unit) on the ﬂoor of Room 1, which is both damaging to the robot vacuuming it, and injures the human each time they enter Room 1. The glass (unlike the puddle of water) does not dissipate on its own. We suppose for the purposes of this scenario that the robot is unable to warn the human about the mess. The robot has norms N1, N2, and N3.

Here ignoring the broken glass violates both the cleaning duty N1 and the moral norm N3, while satisfying the safety norm N2. Since the human has a room-switching probability of 0.125, ignoring the glass for even a single time step would incur a violation cost of at least 1 1+40000 0.125 = 5001 (and, of course, ignoring it indeﬁnitely would have a substantially higher violation cost). Vacuuming the mess immediately, on the other hand, would incur a violation cost of 200 1 = 200. In this case, the robot determines that it ought to vacuum up the shards of glass despite the damage to itself from doing so, because protecting the human is far more important than protecting itself. Once the hazard has been removed, the robot proceeds as in Scenario 1.

Scenario 4: Interrupting Phone Calls In this scenario, as in Scenario 3, the robot encounters broken glass on the ﬂoor. This time, however, the robot is able to warn the human about the mess using the action warn(h, m). This would allow the human to safely avoid the mess upon entering the messy room. The robot in this case has all four norms N1 to N4. For simplicity, the human in Scenario 4 does not make new messes, and the robot s maximum battery level is 5 rather than 10. The human also does not move between rooms while talking on the phone. The violation costs of ignoring the mess and vacuum respectively are as described in Scenario 2. Here, however, the robot may also warn the human about the mess, potentially interrupting their phone call, and then subsequently ignore it, permanently suspending N1. It takes one time step to reach the human. If the human remains on the phone during that time step (which occurs with probability 0.8) the robot thus incurs a violation cost of 5 0.99 0.8+1/(1 0.99) = 103.96. This remains lower than the costs of either the vacuuming the mess or ignoring it without warning the human, and so the robot determines that the interruption is justiﬁable in order to prevent the human from being injured. The robot also determines that it ought to avoid vacuuming up the glass, since this is not necessary in order to violating N2.

Related Work There have been several instances of temporal logics being employed to represent moral and social norms. For example, ( Agotnes et al. 2007; Agotnes and Wooldridge 2010) employ a logical representation of norms using Normative Temporal Logic, a generalization of Computation Tree Logic (CTL). This characterization allows the description of complex temporal norms. (Alechina et al. 2015) employ LTL with past-time modalities, which they use to construct guards (functions that restrict agent actions given an event history); they are concerned primarily with enforcement, and thus do not address norm conﬂicts. These approaches are designed for deterministic environment models, and are not well suited for stochastic domains. The combination of moral and social norms with Markov Decision Processes is not new. Much of this work, e.g. (Sen and Airiau 2007), tends to emphasize norm emergence, thus lacks explicit representations of norms. Other work (Fagundes, Billhardt, and Ossowski 2010) considers incorporating

deontic logic norms using an agent architecture that reasons about the consequences (in the environment) of violating norms. While we know of no other work using LTL directly to represent moral and social norms in MDPs, a number of papers have emerged using LTL to represent constraints on the behavior of agents within both deterministic and nondeterministic domains. LTL speciﬁcations are ﬁrst employed, in a motion planning context, in Markov Decision Processes in (Ding et al. 2011a). The agent s aim is to maximize the probability of meeting the speciﬁcations. Multiple conﬂicting speciﬁcations are not considered. (Lacerda, Parker, and Hawes 2014) allow dynamic re-planning as new LTL tasks are added. Examinations of partially satisﬁable LTL speciﬁcations include (Lacerda, Parker, and Hawes 2015). This differs from the proposed work in that (1) they limit their speciﬁcation to a single co-safe LTL formula; and (2) their method of resolving conﬂict uses a notion of proximity to an accepting state that is better suited to motion planning than to the balancing of multiple conﬂicting norms. (Reyes Castro et al. 2013; Tumova et al. 2013; Lahijanian et al. 2015) utilize approaches similar to the proposed approach in that they employ weighted skipping to allow automata to skip a time step, but incur a cost for doing so. Unlike the proposed approach, however, these approaches use ﬁnite LTL (LTL deﬁned over ﬁnite paths, instead of inﬁnite ones), and their algorithms are tailored for deterministic environments rather than stochastic domains.

Discussion and Future Work The proposed method for handling norm conﬂicts allows norms to be weighed against each other in terms of violation cost . Other ways of encoding human preferences, including through lexicographic ordering and through CP-nets (Boutilier et al. 2004; Brafman, Domshlak, and Shimony 2006), may be considered in future work. We employed the discount factor γ to ensure that all accepting paths within M have ﬁnite violation cost. The validity of discounting the wrongness of future norm violations is debatable. Alternatives to discounting include treating the problem as ﬁnite-horizon (which would entail similar short-sightedness), and using inﬁnite-horizon average cost per timestep as the reward criterion (which only consider the behavior as the number of time steps approaches inﬁnity, and for which temporary norm violations do not matter whatsoever). Some hybrid approach may be valuable; this is a topic for future work. Our approach takes exponential time (and space) in the number of norms, and thus quickly become intractable for moderately-sized sets of norms. Much of this is due to the product MDP, MP , which contains both state and action spaces which are exponential in the number of norms. Managing and reducing this complexity, perhaps using heuristics to determine which subsets of norms are likely to be relevant, would be a valuable topic for future research. In developing the proposed agent architecture, we assume that the agent has complete knowledge of the MDP s transition function; in practice, this rarely occurs. Future work

could follow (Guo, Johansson, and Dimarogonas 2013; Fu and Topcu 2014; Wolff, Topcu, and Murray 2012; Jones et al. 2015) in considering unknown transition dynamics. There would also be merit in adapting the proposed algorithms to multi-agent (drawing on, e.g., (Guo and Dimarogonas 2014)), and partially-observable (as in (Svoreˇnov a et al. 2015; Chatterjee et al. 2015; Sharan and Burdick 2014)) domains. The described algorithm focuses on planning with a given set of norms; it requires pre-speciﬁcation of the norm formulas and weights. It may be integrated with work allowing the learning of temporal logic statements either through natural language instruction, as in (Dzifcak et al. 2009), or through observation of other agents behavior, as in (Kasenberg and Scheutz 2017). We may also consider: including deontic operators in the norm representation and allowing a form of logical inference, so that agents may reason more fully about their norms; and providing some mechanism for agents to justify the rationale of behavior considered questionable by observers. Each of these possible tasks is facilitated by the explicit representation of norms using logic.

Conclusion In this paper, we described a hybrid approach to resolving norm conﬂicts in stochastic domains. Norms in this approach are viewed as temporal logic expressions that the agents intends to make true. Different from logical approaches, which are limited to deterministic domains and typically attempt to limit inferences that can be made in cases of norm conﬂicts, agents realizing our approach attempt to obey as many (important) norms as possible with minimal violation cost if not all norms can be obeyed at the same time. As a result, these agents also respond robustly to unlucky transitions. We showed that our approach leads to reasonable norm-conformant behavior in all four scenarios in the vacuum cleaning domain.

Acknowledgements This project was supported in part by ONR MURI grant N00014-16-1-2278 from the Ofﬁce of Naval Research and by NSF IIS grant 1723963.

References Agotnes, T., and Wooldridge, M. 2010. Optimal social laws. In Proceedings of the 9th International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2010), 667 674. International Foundation for Autonomous Agents and Multiagent Systems. Agotnes, T.; Van Der Hoek, W.; Rodriguez-Aguilar, J.; Sierra, C.; and Wooldridge, M. 2007. On the logic of normative systems. In Proceedings of the Twentieth International Joint Conference on Artiﬁcial Intelligence (IJCAI 07), 1181 1186. Alechina, N.; Bulling, N.; Dastani, M.; and Logan, B. 2015. Practical run-time norm enforcement with bounded lookahead. In Proceedings of the 14th International Conference on Autonomous Agents and Multiagent Systems (AAMAS

2015), 443 451. International Foundation for Autonomous Agents and Multiagent Systems. Baier, C., and Katoen, J.-P. 2008. Principles Of Model Checking, volume 950. The MIT Press. Bellman, R. 1957. A markovian decision process. Indiana Univ. Math. J. 6:679 684. Boutilier, C.; Brafman, R. I.; Domshlak, C.; Hoos, H. H.; and Poole, D. 2004. Cp-nets: A tool for representing and reasoning with conditional ceteris paribus preference statements. Journal of Artiﬁcial Intelligence Research 21:135 191. Brafman, R. I.; Domshlak, C.; and Shimony, S. E. 2006. On graphical modeling of preference and importance. Journal of Artiﬁcial Intelligence Research 25:389 424. Chatterjee, K.; Chmelik, M.; Gupta, R.; and Kanodia, A. 2015. Qualitative analysis of POMDPs with temporal logic speciﬁcations for robotics applications. 2015 IEEE International Conference on Robotics and Automation (ICRA) 23:325 330. Ding, X. C.; Smith, S. L.; Belta, C.; and Rus, D. 2011a. LTL control in uncertain environments with probabilistic satisfaction guarantees. IFAC Proceedings Volumes (IFACPapers Online) 18(PART 1):3515 3520. Ding, X. C.; Smith, S. L.; Belta, C.; and Rus, D. 2011b. MDP optimal control under temporal logic constraints. Proceedings of the IEEE Conference on Decision and Control 532 538. Dzifcak, J.; Scheutz, M.; Baral, C.; and Schermerhorn, P. 2009. What to do and how to do it: Translating natural language directives into temporal and dynamic logic representation for goal management and action execution. Proceedings - IEEE International Conference on Robotics and Automation 4163 4168. Esparza, J., and Ket ınsk y, J. 2014. From LTL to deterministic automata: A safraless compositional approach. Lecture Notes in Computer Science (including subseries Lecture Notes in Artiﬁcial Intelligence and Lecture Notes in Bioinformatics) 8559 LNCS:192 208. Fagundes, M. S.; Billhardt, H.; and Ossowski, S. 2010. Normative reasoning with an adaptive self-interested agent model based on markov decision processes. In IBERAMIA, volume 6433, 274 283. Springer. Fu, J., and Topcu, U. 2014. Probably Approximately Correct MDP Learning and Control With Temporal Logic Constraints. ar Xiv Preprint. Goble, L. 2009. Normative conﬂicts and the logic of ought . Nous 43(3):450 489. Guo, M., and Dimarogonas, D. V. 2014. Multi-agent plan reconﬁguration under local LTL speciﬁcations. The International Journal of Robotics Research 34(2):218 235. Guo, M.; Johansson, K. H.; and Dimarogonas, D. V. 2013. Revising motion planning under Linear Temporal Logic speciﬁcations in partially known workspaces. Proceedings - IEEE International Conference on Robotics and Automation 5025 5032.

Jones, A.; Aksaray, D.; Kong, Z.; Schwager, M.; and Belta, C. 2015. Robust Satisfaction of Temporal Logic Speciﬁcations via Reinforcement Learning. ar Xiv:1510.06460 [cs]. Kasenberg, D., and Scheutz, M. 2017. Interpretable apprenticeship learning with temporal logic speciﬁcations. In Proceedings of the 56th IEEE Conference on Decision and Control (CDC 2017). Lacerda, B.; Parker, D.; and Hawes, N. 2014. Optimal and dynamic planning for Markov decision processes with cosafe LTL speciﬁcations. In IEEE International Conference on Intelligent Robots and Systems, 1511 1516. Lacerda, B.; Parker, D.; and Hawes, N. 2015. Optimal policy generation for partially satisﬁable co-safe LTL speciﬁcations. IJCAI International Joint Conference on Artiﬁcial Intelligence 2015-Janua:1587 1593. Lahijanian, M.; Almagor, S.; Fried, D.; Kavraki, L. E.; and Vardi, M. Y. 2015. This Time the Robot Settles for a Cost: A Quantitative Approach to Temporal Logic Planning with Partial Satisfaction. In The Twenty-Ninth AAAI Conference (AAAI-15), 3664 3671. Mac Glashan, J. 2016. Brown-UMBC Reinforcement Learning and Planning (BURLAP). http://burlap.cs.brown.edu/. Pnueli, A. 1977. The temporal logic of programs. 18th Annual Symposium on Foundations of Computer Science (sfcs 1977) 46 57. Reyes Castro, L. I.; Chaudhari, P.; T umov a, J.; Karaman, S.; Frazzoli, E.; and Rus, D. 2013. Incremental sampling-based algorithm for minimum-violation motion planning. Proceedings of the IEEE Conference on Decision and Control 3217 3224. Scheutz, M., and Malle, B. F. 2014. think and do the right thing a plea for morally competent autonomous robots. In Proceedings of IEEE International Symposium on Ethics in Engineering, Science, and Technology (ETHICS). Sen, S., and Airiau, S. 2007. Emergence of norms through social learning. In Proceedings of IJCAI-07, 1507 1512. Sharan, R., and Burdick, J. 2014. Finite state control of POMDPs with LTL speciﬁcations. Proceedings of the American Control Conference 501 508. Svoreˇnov a, M.; Chmel ık, M.; Leahy, K.; Eniser, H. F.; Chatterjee, K.; ˇCern a, I.; and Belta, C. 2015. Temporal logic motion planning using POMDPs with parity objectives. Proceedings of the 18th International Conference on Hybrid Systems Computation and Control - HSCC 15 233 238. Tumova, J.; Hall, G. C.; Karaman, S.; Frazzoli, E.; and Rus, D. 2013. Least-violating control strategy synthesis with safety rules. Proceedings of the 16th international conference on Hybrid systems: computation and control 1 10. Wolff, E. M.; Topcu, U.; and Murray, R. M. 2012. Robust control of uncertain Markov Decision Processes with temporal logic speciﬁcations. 2012 IEEE 51st IEEE Conference on Decision and Control (CDC) 3372 3379.