# online_decisionmaking_for_scalable_autonomous_systems__cd176d3b.pdf

Online Decision-Making for Scalable Autonomous Systems

Kyle Hollins Wray1,2 and Stefan J. Witwicki 2 and Shlomo Zilberstein1

1 College of Information and Computer Sciences, University of Massachusetts, Amherst, MA 01002 2 Nissan Research Center - Silicon Valley, Sunnyvale, CA 94089 wray@cs.umass.edu, stefan.witwicki@nissan-usa.com, shlomo@cs.umass.edu

We present a general formal model called MODIA that can tackle a central challenge for autonomous vehicles (AVs), namely the ability to interact with an unspeciﬁed, large number of world entities. In MODIA, a collection of possible decisionproblems (DPs), known a priori, are instantiated online and executed as decision-components (DCs), unknown a priori. To combine the individual action recommendations of the DCs into a single action, we propose the lexicographic executor action function (LEAF) mechanism. We analyze the complexity of MODIA and establish LEAF s relation to regret minimization. Finally, we implement MODIA and LEAF using collections of partially observable Markov decision process (POMDP) DPs, and use them for complex AV intersection decision-making. We evaluate the approach in six scenarios within a realistic vehicle simulator and present its use on an AV prototype.

1 Introduction There has been substantial progress with planning under uncertainty in partially observable, but fully modeled worlds. However, few effective formalisms have been proposed for planning in open worlds with an unspeciﬁed, large number of objects. This remains a key challenge for autonomous systems, particularly for autonomous vehicles (AVs). AV research has advanced rapidly since the DARPA Grand Challenge [Thrun et al., 2006], which acted as a catalyst for subsequent work on low-level sensing [Sivaraman and Trivedi, 2013] and control [Dolgov et al., 2010], as well as high-level route planning [Wray et al., 2016a]. A critical missing component to enable autonomy in long-term urban deployments is the mid-level intersection decision-making (e.g., the second-to-second stop, yield, edge, or go decisions). As in many robotic domains, the primary challenges include the sheer complexity of real-world problems, wide variety of possible scenarios that can arise, and unbounded number of multi-step problems that will be actually encountered, perhaps simultaneously. These factors have limited the deployment of existing methods for mid-level decision-making [Ulbrich and Maurer, 2013; Brechtel et al.,

2014; Bai et al., 2015; Jo et al., 2015]. We present a scalable, realistic solution, with strong mathematical foundations, via decomposition into problem-speciﬁc decision-components. Our primary motivation is to provide a general solution for AV decision-making at any intersection, including n-way stops, yields, left turns at green trafﬁc lights, right turns at red trafﬁc lights, etc. In this domain, the AV approaches the intersection knowing only the static features from the map, such as road, crosswalk, and trafﬁc controller information. Any number of vehicles and pedestrians can arrive and interact around the intersection, all potentially relevant to decision-making and unknown a priori. The AV must make mid-level decisions, using very limited hardware resources, including when to stop, yield, edge forward, or go, based on all possible interactions among all vehicles including the AV itself. Vehicles can be occluded, requiring the use of information gathering actions based on belief over partial observability. Pedestrians can jaywalk, necessitating that motion forward is taken only under strong conﬁdence they will not cross. Uncertainty regarding priority and right-of-way exists, and must be handled under stochastic changes. Vehicles and pedestrians can block one another s motion, and AV-related blocking conﬂicts must be discovered and resolved via motion-based negotiation. We provide a general solution for domains concerning multiple online decision-components with interacting actions (MODIA). For the particularly difﬁcult AV intersection decision domain, MODIA considers all vehicles and pedestrians as separate individual decision-components. Each component is a partially observable Markov decision process (POMDP) that maintains its own belief for that particular component problem and proposes an action to take at each time step. MODIA then employs an executor function to act as an action aggregator to determine the actual action taken by the AV. This decomposition enables a tractable POMDP solution, beneﬁting from powerful belief-based reasoning while only growing linearly in the number of encountered problems. The primary contributions include: a formal deﬁnition of MODIA (Section 3), a rigorous analysis of the complexity and regret-minimization properties (Section 4), an AV intersection decision-making MODIA solution (Section 5), and an evaluation of the approach in simulation as well as integration with a real AV (Section 6). We begin with a review of POMDPs (Section 2), and conclude with a survey of related work (Section 7) and ﬁnal reﬂections (Section 8).

Proceedings of the Twenty-Sixth International Joint Conference on Artiﬁcial Intelligence (IJCAI-17)

2 Background Material A partially observable Markov decision process (POMDP) is represented by the tuple S,A,Ω,T,O,R [Kaelbling et al., 1998]. S is a ﬁnite set of states. A is a ﬁnite set of actions. Ωis a ﬁnite set of observations. T :S A S [0,1] is a state transition function such that T(s,a,s )=Pr(s |s,a). O:A S Ω [0,1] is an observation function such that O(a,s ,ω)=Pr(ω|a,s ). R:S A R is a reward function. The agent does not observe the true state of the system, and instead makes observations while maintaining a belief over the true state denoted b |S|. Given action a A and subsequent observation ω Ω, belief b is updated to b with: b (s )=ηO(a,s ,ω)P s T(s,a,s )b(s) for all s S, with normalizing constant η. A policy maps beliefs to actions π: |S| A. (Note: n is the standard n-simplex.) The value function V : |S| R for a belief is the expected reward given a ﬁxed policy π, a discount factor γ [0,1], and a horizon h. Also, it is useful to deﬁne the Q-value of belief b given action a as Q: |S| A R with V (b)= Q(b,π(b)). Since V π is piecewise linear and convex, we describe it using sets of α-vectors Γ={α1,...,αr} with each αi =[αi(s1),...,αi(sn)]T and αi(s) denoting value of state s S. The objective is to ﬁnd optimal policy π that maximizes V denoted as V . Given an initial belief b0, V can be iteratively computed for a time step t, expanding beliefs at each update resulting in belief b, by maximizing:

s S b(s)R(s,a)+ X

ω Ω max α Γt 1 X

s S b(s)V t saωα

and V t saωα =γ P s S O(a,s ,ω)T(s,a,s )α(s ); for s S, α0(s)=R/(1 γ) in Γ0 ={α0} with R=mins,a R(s,a).

3 Problem Formulation We begin with a general problem description that considers a single autonomous agent that encounters any number of decision problems online during execution. This paper focuses on collections of POMDPs primarily for their general form, self-consistency, and space limitations. It can be generalized to other decision-making models in the natural way. Finally, Figure 1 depicts a complete MODIA example for AVs, and is referenced throughout this section for each concept.

3.1 Decision-Making with MODIA The multiple online decision-components with interacting actions (MODIA) model describes a realistic single-agent online decision-making scenario deﬁned by the tuple P,A . P ={P1,...,Pk} are decision-problems (DPs) that could be encountered during execution. For this paper, each Pi P is a POMDP with Pi = Si,Ai,Ωi,Ti,Oi,Ri (Section 2) starting from an initial belief b0 i |Si|. We consider discrete time steps t N over the agent s entire lifetime. A={a1,...,az} are z primary actions that are the true actions taken by the agent that affect the state of the external system environment. Importantly, only P and A are known ofﬂine a priori.

AV Example Figure 1 has two pre-solved intersection decision-components: single vehicle (P1) or pedestrian (P2).

Each are POMDPs with actions (recommendations) stop or go . Primary actions A for the AV are also stop or go .

Online, the DPs are instantiated based on what the agent experiences in the external system environment. Due to the nature of actually executing multiple decision-making models (e.g., POMDPs) in real applications, there is no complete model for which, when, or how many DPs are instantiated, or even how long they are relevant. Formally, the online instantiations in MODIA are deﬁned by the tuple C,φ,τ . Over the agent s lifetime, there are n DP instantiations called decision-components (DCs) denoted as C ={C1,...,Cn}, with both C and n unknown a priori. Let φ:C P denote the DP for each instantiation. Let τ :C N N be the two time steps that each DC is instantiated and terminated. For notational convenience, for all Ci C, let τs(Ci) and τe(Ci) be the start and end times; we have τs(Ci)<τe(Ci). Without loss of generality, we also assume for i<j, τs(Ci) τs(Cj). We call a DC Ci C instantiated at time step t N if t [τs(Ci),τe(Ci)]. Any instantiated Ci C includes POMDP φ(Ci), its policy πi : |Si| Ai, and its current belief state bti i |Si| with local POMDP time step ti =t τs(Ci).

AV Example (Continued) Online, the AV encounters an intersection and immediately (at time step 1) observes two vehicles and one pedestrian. Three DCs are instantiated; C1 and C2 are for each vehicle (φ(C1)=φ(C2)=P1), and C3 is for the pedestrian (φ(C3)=P2). The start times for all Ci are τs(Ci)=1; the end times τe(Ci) are still unknown. Each POMDP Ci, with φ(Ci)=Pj: b0 i =b0 j, ti =1, and πi =πj.

3.2 The MODIA Executor

With DPs and primary actions P,A (known a priori), and online execution of DCs C,φ,τ (unknown a priori), the primary actions taken from A are determined by an action executor function ϵ: A A with A=(S i Ai) . (Note: X is a Kleene operator on a set X, and Ai is the set of actions for the POMDP from DP Pi.) The executor takes DC action recommendations and converts them to a primary action taken by the agent in the external system environment. It also converts a primary action back to what that decision meant to individual DCs via their action sets. In this paper, we use the notation ϵ 1 :A A with ϵ 1 i (a) referring to an individual Ci s action from POMDP φ(Ci) for some a A. It is important to note the requirement that the executor function ϵ must be able to map any tuple of actions taken from any combination of DPs, with any number of possible duplicates, to a primary action. MODIA is a class of problems that operates without any knowledge about which (or how many) DPs will be instantiated online.

AV Example (Continued) In Figure 1, all three DCs produce an action a1, a2, a3 = a A at each time step. The example states a1 = a3 =stop and a2 =go. The executor ϵ decides from a that stop A will be the primary action. It informs each DC Ci what the primary action means to Ci individually, simply ϵ 1 i (stop)=stop, for belief updates.

Proceedings of the Twenty-Sixth International Joint Conference on Artiﬁcial Intelligence (IJCAI-17)

(Ofﬂine - Solve)

(Online - Use Solutions)

(Environment)

Ai=A={stop,go}

Instantiate new DCs from DPs

New DC? Observe ω1,ω2,ω3

Executor: ϵ( a) ϵ( a)=stop Take stop action

Update total regret: Rt ϵ =Rt 1 ϵ +0+(Q2(bt2 2 ,go) Q2(bt2 2 ,stop))+0

Update each DC Ci with ϵ 1 i (stop),ωi

φ(C1)=P1 φ(C2)=P1 φ(C3)=P2 τs(C1)=1 τs(C2)=1 τs(C3)=1

a1=stop a2=go a3=stop

Figure 1: Example visualization of MODIA for AVs. Ofﬂine, the DPs (left) are solved: vehicles (P1) and pedestrians (P2). Online, the AV approaches an intersection in the environment (center). DCs (right) are instantiated from DPs based on 3 new observations: 2 vehicles (C1 and C2) and 1 pedestrian (C3). Each DC recommends an action ( a): 2 stops and 1 go. The executor decides: stop. The agent takes the action, resulting in regret for C2 s action in Rt ϵ. New observations induce DC updates.

3.3 The MODIA Objective

The goal of the class of problems captured by MODIA is to design the DPs, primary action set, and executor so that it solves the online real-world problem (e.g., AVs). Prior work on single-POMDP online algorithms experimentally analyze their performance with simpler metrics such as average discounted reward (ADR) or run time [Somani et al., 2013; Kurniawati and Yadav, 2016], and richer metrics such as error bound reduction (EBR) or lower bound improvement (LBI) [Ross et al., 2008]. MODIA is an online multi-POMDP model that differs from these previous online single-POMDP solvers. We instead provide a concrete objective function to enable the analysis of this complex online problem within a theoretical context. Our problem domain does not contain a model for how DPs are instantiated as DCs, nor how long DCs remain active. Thus, the objective is to minimize regret experienced at each step for any given DC instantiations. Formally, for P,A,C,φ,τ,ϵ , let h τe(Cn) be a horizon, let It ={i {1,...,n}|τs(Ci) t τe(Ci)} denote the set of indexes for instantiated DCs, and let executor decision ϵ( a)= at at time t {1,...,h} with primary action at A and the tuple of all instantiated DC s actions a A, so for all i It, ai =πi(bti i ) with πi, ti, and bti i from instantiated DC Ci C. The total regret Rh ϵ R is:

i It Qi(bti i ,πi(bti i )) Qi(bti i ,ϵ 1 i (at)). (1)

We refer to the regret at time t for all instantiated DCs in It as rt ϵ. Informally, a DC s regret in MODIA is the expected reward following the DC s desired policy s action, minus the realized expected reward following the executor s action.

AV Example (Continued) Executor ϵ selected stop A, which has ϵ 1 i (stop)=stop for all Ci C. Following each DC s desired action, only C2 chose go instead. This induces regret equal to Q2(bt2 2 ,go) Q2(bt2 2 ,stop) 0; C1 and C3 have 0 regret. Rt ϵ is updated accordingly.

3.4 LEAF for MODIA So far we have described the general form of MODIA using a general executor. Now we examine a particular kind of executor with desirable regret-minimizing properties (shown in Section 4). Speciﬁcally, we can deﬁne a lexicographic preference over the individual actions suggested by each DC. Thus, each DC suggests an action, stored collectively as a tuple of action recommendations, and the executor only executes the best (in terms of preference) action from this set. A lexicographic executor action function (LEAF) has two requirements regarding a MODIA s structure in P,A . First, let the primary actions A be factored with the unique action sets from among the DPs; formally, A= i Λi with Λ=S j{Aj}. Second, let i be a lexicographic ordering over actions in these unique action sets Λi Λ. If a MODIA satisﬁes these two requirements, then for all a= a1,..., ax A and a= a1,...,ay A, LEAF ϵ( a)=a is deﬁned by:

ai i a, a {a Λi| j s.t. aj =a } (2)

for all Λi Λ, and ϵ( )=a for some ﬁxed a A. Informally, a are the current desired actions from DCs, Λi is the unique action set, a are the resulting actions, and each ai (from matching unique action set Λi) has the highest preference following i from the available voted-upon actions. Similarly, the inverse executor extracts the relevant action factor taken by the system and distributes it to all DCs who have that action set; formally, for all Ci C, with φ(Ci)=Pℓ, there exists an action aj Λj =Aℓsuch that for the primary action taken a A, ϵ 1 i (a)= aj. In summary, LEAF simply takes the most preferred action among those available.

AV Example (Continued) In the AV example, we have action sets {stop,go}=A1 =A2 =A=Λ1. Thus, it satisﬁes the ﬁrst requirement: primary actions are composed of DP actions. For the second, we deﬁne a lexicographic preference 1 (encouraging safety) over Λ1 with stop go. Now ϵ in Figure 1 is actually LEAF. Namely, the action stop is the most preferred action desired among only the actions selected by the DCs. Thus, stop is the result of the executor.

Proceedings of the Twenty-Sixth International Joint Conference on Artiﬁcial Intelligence (IJCAI-17)

3.5 Risk-Sensitive MODIA Now we also consider a speciﬁc kind of MODIA, with a form of monotonicity in an ordered relationship over actions and Q-values. Informally, we require DP s Q-values to be monotonic over actions with a penalty for selecting policy-violating high-risk actions. Formally, a MODIA is risk-sensitive with respect to a preference i, if for all j, b, a, and a : (1) if a i a i πj(b) then Qj(b,a) Qj(b,a ), (2) if πj(b) i a then Qj(b,a) Q for sufﬁcient penalty Q.

AV Example (Continued) Action stop makes no progress towards the goal while go does, so long as go is optimal, resulting in (1). Conversely, performing go when stop is optimal produces a severe expected cost, resulting in (2).

4 Theoretical Analysis Given DPs and primary actions P,A , MODIA requires the selection of an executor to minimize regret accumulated over time, in addition to solving the DPs themselves. With n unknown a priori, as well as which and when DPs are instantiated as DCs, it is impossible to perform tractable planning techniques entirely ofﬂine; again, MODIA is an category of online decision-making scenarios. Assume, however, that a prescient oracle provided C,φ,τ a priori. While this is an impossible scenario, it is useful to understand the worst-case complexity of exploiting this information in the underlying problem of selecting a regret-minimizing executor given this normally unobtainable information. Proposition 1 formally proves this complexity. Proposition 1. If C,φ,τ is known a priori, then the complexity to compute the optimal executor ϵ is O(n2zmh) with z =|A|, m=maxi |Ai|, and h=maxi τe(Ci).

Proof. Must determine the worst-case complexity to fully deﬁne executor ϵ : A A to minimize regret Equation 1. In the worst-case, we must explore all relevant executors, and compute the regret for each, resulting in the optimal solution. By executor deﬁnition in Section 3.2, A=(S i Ai) and z =|A|. Given n=|C|, the maximum realizable set size of A is all unique potential actions, multiplied by the maximal number of unique DCs instantiated simultaneously. In the worst-case, Ai =Aj for all i =j, so all possible actions must be considered for each; this order bound is m=maxi |Ai|. Also, all combinations of instantiated DCs must be realized, so all τ(Ci) =τ(Cj) for all i =j. In any order, n births, n deaths, and time no DCs instantiated; thus there are 2n+1 in total. Hence, the number of potential executors is O(znm). In the worst-case scenario, Ri(bti i ,πi(bti i )) differs for every time step for all Ci C. Equation 1 requires O(hmaxt It) operations. Given C, h=maxi τe(Ci). By deﬁnition of It, maxt It n. Thus, the worst-case complexity to compute an optimal ϵ is O(znm) O(hn)=O(n2zmh).

With Proposition 1, we know this impossible oracular scenario s complexity is relatively high, but not exponential. This suggests a method for computing an optimal executor, under more realistic assumptions. Thus, let ˆρ be a given model for the hardest feature of MODIA: online instantiation. Let ˆρ: ˆNn ˆTn ˆEn ˆNn ˆTn [0,1]

deﬁne the probability that a particular set of instantiated DCs ˆn, ˆτ ˆNn ˆTn, and executor selection ˆϵ ˆEn, results in a successor DC instantiation state ˆn , ˆτ ˆNn ˆTn. Here, ˆNn ={1,...,k}n are instantiation indexes (deﬁning φ), ˆTn ={ˆτ {α,{1,...,h}2,ω}n| i N, ˆτis < ˆτie} are the instantiation start and end times (deﬁning τ) including noninstantiated α and completed ω demarcations, and ˆEn ={ϵ: A A|| A| n} are all valid executors (deﬁning ϵ). Additionally, we must assume knowledge of a maximum number of DCs n and horizon h for decidability. Given this model, Proposition 2 proves the resulting MDP s optimal policy minimizes expected regret, and that the problem is unfortunately computationally intractable in practice.

Proposition 2. If n, h, and model ˆρ are known a priori, then: (1) the resulting MDP s optimal policy π minimizes expected regret, and (2) its state space is exponential in n and k.

Proof. We must show the construction of an MDP whose optimal policy minimizes expected regret and show its complexity in the necessity of an exponential state space. Let ˆS, ˆA, ˆT, ˆR be a ﬁnite horizon MDP with horizon ˆh=h+1. States are ˆS ={ˆs0} ˆEn ˆBn ˆTn with ˆs0 denoting the initial executor selection state and ˆBn ={ ˆB (S i ˆBh i ) || ˆB|=n} be all possible reachable beliefs for Pi in horizon h (denoted ˆBh i ) for all possible instantiations. For notation, we use ˆs= ˆϵ,ˆb, ˆτ , each containing instantiated values ˆϵi, ˆbi, ˆτsi, and ˆτei, as well as ˆθ: ˆBn ˆNn mapping beliefs to their original POMDPs indices. Actions are executor selection ˆA= ˆEn. State transitions ˆT : ˆS ˆA ˆS [0,1] have two cases. First, ˆT(ˆs0,ˆa, ˆs )=[ˆs = ˆa, , ] captures executor selection. Second, for ˆs = ˆs0 we have:

ˆT(ˆs,ˆa, ˆs )=[(ˆs= ˆs0 ˆϵ =ˆa) (ˆs = ˆs0 ˆϵ =ˆϵ)]

ˆρ(ˆθ(ˆb), ˆτ,ˆϵ, ˆθ(ˆb ), ˆτ )

i=1 [ˆb i =b0 j ˆτi =α ˆτ i =1]

i=1 Pr(ˆb i|ˆbi,πj(ˆbi))[ˆτi N ˆτ i = ˆτi +1]

i=1 [ˆτ i =ω ˆb i =ˆbi]

with j = ˆθi(ˆb). This captures executor state assignment, the instantiation model ˆρ, the proper initialization of belief, the belief update for active DCs, and the termination of a DC. Rewards ˆR: ˆS ˆA R describe the negative regret, ˆR(ˆs,ˆa)=P i Qj(ˆbi,ˆϵ 1 i (at)) Qj(ˆbi,πj(ˆbi))[ˆτi N] with R(ˆs0,ˆa)=0. By construction, this is MODIA, assuming ˆρ, n, and h were provided. By assigning ϵ =π (ˆs0), we minimize expected regret. In the worst-case, it necessitates modeling all n DC instantiation permutations (with replacement) of the k DPs, which is O(kn).

This illustrates the importance of the original MODIA formulation. Even with the instantiation model of Proposition 2, the problem is still unscalable. And the knowledge needed to bound the number of active DCs (e.g., n and h) is generally unavailable a priori. This intrinsic lack of information motivated our formulation that minimizes the regret at each time step. Hence, the agent is guided by the optimal

Proceedings of the Twenty-Sixth International Joint Conference on Artiﬁcial Intelligence (IJCAI-17)

DC policies from each instantiated DP, selecting the regretminimizing action at each time step. Proposition 3 proves that LEAF minimizes the regret in risk-sensitive MODIA at each time step, enabling a tractable solution to MODIA.

Proposition 3. If a MODIA is risk-sensitive, then LEAF minimizes regret rt ϵ for all t.

Proof. By deﬁnition of regret rt ϵ for LEAF ϵ at time step t: rt ϵ =P i Qj(bti i ,πj(bti i )) Qj(bti i ,ϵ 1 i (at)) with φ(Ci)=Pj. We must show for all ϵ, rt ϵ rt ϵ. For readability, hereafter, let ai =ϵ 1 i (at), ai = ϵ 1 i ( at), a j =πj(bti i ), and bi =bti i . By deﬁnition of risk-sensitive, there always exists action a j such that Qj(bi,a j) Q. Thus, it is sufﬁcient to show that for all i It, Qj(bi,ai) Qj(bi, ai), or there exists a Ci C with φ(Ci)=Pj such that Qj(bi, ai) Q. By risk-sensitivity and LEAF, consider 3 cases for ϵ and ϵ. Case 1: ai =x ai for ai, ai Λx =Aj. Trivially, we have Qj(bi,ai)=Qj(bi, ai). Case 2: ai x ai has two cases. Case 2.a: If ai =a j, then by deﬁnition πj s optimality, for any ai Aj, Qj(bi,ai)= Qj(bi,a j) Qj(bi, ai). Case 2.b: If ai =a j, then by LEAF Equation 2, ai {a Λx| u s.t. au =a}. Thus, by deﬁnition of a A, there exists this u =i such that ai =au =a v with φ(Cu)=Pv. By risk-sensitivity, a v =au =ai x ai that implies Qv(bu, ai) Q. Case 3: ai x ai. By deﬁnition of risk-sensitivity, we have ai x ai x a j and consequently Qj(bi, ai) Qj(bi,ai). All cases proven. LEAF minimizes regret rt ϵ for any t.

5 Application to Autonomous Vehicles

We apply MODIA and LEAF to this concrete problem of AV decision-making at intersections. The formulation expands on the numerous AV examples described in Section 3. Due to space considerations, we focus our attention strictly on deﬁning vehicle-related DP (POMDP); however, pedestrian and other DPs follow in a similar manner. Overall, this AV robotic application serves to both ground our theoretical work and simultaneously present an actual solution to intersection decision-making in the real world. The MODIA AV P,A deﬁnes P by converting intersection types (and pedestrian types) into POMDP DP. These types capture the static abstracted information. For example, intersection types contain features such as the number of road segments, lane information (incoming and outgoing), crosswalk locations, and trafﬁc controller information. A DP is created for all lanes within all intersection types (and pedestrian types). Formally, for each such vehicle and intersection type, we deﬁne the DP POMDP Si,Ai,Ωi,Ti,Oi,Ri = Pi P. Si =Sℓ av St av Sℓ ov St ov Sb ov Sp ov describes the AV s location (approaching/at/edged/inside/goal) and time spent at location (short/long), as well as the other vehicle s location (approaching/at/edged/inside/empty), time spent at location (short/long), blocking (yes/no), and priority at intersection in relation to AV (ahead/behind), respectively. Actions are simply Ai ={stop,edge,go}, and encode movement by assigning desired velocity and goal points along the AV s trajectory within the intersection. Lower-level nuances

in path planning [Wray et al., 2016b] are optimized by other methods. Ωi =Ωt av Ωb av Ωt ov Ωb ov primarily encode the noisy sensor updates in blocking detection (yes/no) but also if the time spent was updated (yes/no) for both the AV and other vehicle. Ti :Si Ai Si [0,1] multiply the probabilities of a wide range of situations quantiﬁable and deﬁnable in the state-action space described. This includes multiplying probabilities for: (1) vehicle kindly lets AV have priority, (2) vehicle cuts AV off, (3) AV s success or failure of motion to an abstracted state based on its physical size, (4) a new vehicle arrives at an intersection lane, (6) time increments, (7) vehicle actually stops at stop sign or does a rolling stop, (8) vehicle is blocking the AV s path following the static intersection type s road structure, etc. Additionally, a dead end state (an absorbing non-goal self-loop) is reached when the AV and other vehicle both have state factor inside while also blocking each other. Oi :Ai Si Ωi [0,1] captures the sensor noise (e.g., determined via calibration and testing of the AV s sensors). This includes successful detections of: (1) other vehicle s crossing of physical locations mapped to abstracted states, (2) determining the blocking probability based on the location of the other vehicle, etc. Ri :Si Ai R is deﬁned as unit cost for all states, except the goal state. The primary actions are A={stop,edge,go} and simply describe the AV s movement along the desired trajectory. We deﬁne a lexicographic preference 1 over this action set stop 1 edge 1 go. This preference formalizes the notion that if even one DC said to stop, then the AV should stop. Similarly, if at least one DC said to edge but none said stop, then the AV should cautiously edge forward. Otherwise, the AV should go. This enables us to apply LEAF because Ai =A for all Ai (even the pedestrian DPs) and we have lexicographic preference 1. Lastly, the deﬁned MODIA produces Q-values that satisfy risk-sensitivity.

6 Experimentation We begin with experiments on six different intersections in an industry-standard vehicle simulation developed by Realtime Technologies, Inc. that accurately simulates vehicle dynamics with support for ambient trafﬁc and pedestrians. We evaluated MODIA on real map data at six different intersections, each highlighting a commonly encountered real-world scenario. Table 1 describes each scenario by name and provides details regarding the road segments, vehicles, and pedestrians that exist. The number of potential incidents describes how many risks exist, which MODIA perfectly obviates. We compare a MODIA AV with ignorant and naive AV baseline algorithms. The ignorant AV follows the law but ignores the existence of all vehicles and pedestrians, acting as if the intersections are empty. The naive AV follows the law and cautiously waits until all others have cleared the intersection beyond 15 meters before attempting to go. These two baselines implement extremes of rule-based AVs [Jo et al., 2015] and serve as a form of bound for AV behavior to understand MODIA AV s performance. We evaluate each by their time to complete an intersection, which includes the observations while approaching, decisions at the intersection, and travel within the intersection. In Table 1, we observe the MODIA AV successfully completes intersections faster than the cau-

Proceedings of the Twenty-Sixth International Joint Conference on Artiﬁcial Intelligence (IJCAI-17)

Intersection Scenarios MODIA Baselines Name RS V P PI |C| M I N Crosswalk Pedestrian 4 0 1 1 4 21.1 16.7 30.1 Vehicle & Pedestrian 3 1 1 1 3 16.8 13.6 37.1 Walk & Run Pedestrians 3 1 2 2 6 19.1 13.3 23.3 Multi-Vehicle Interaction 4 2 0 2 5 19.0 13.2 20.9 Bike Crossing 3 0 1 1 3 16.4 13.8 19.8 Jay Walker 4 0 1 1 4 17.7 14.4 24.3

Table 1: Results for six intersection problems described by the number of road segments (RS), vehicles (V), pedestrians (P), and potential incidents (PI). MODIA AV M (number of DCs |C|) is compared with two baselines, ignorant I and naive N, using their intersection completion times (seconds).

tious naive AV. While the MODIA AV takes longer than the ignorant AV, the ignorant AV encounters each potential incident and the MODIA AV safely avoids them. Figure 2 depicts a common 4-way intersection with our fully-operational AV prototype, which operates on real public roads and contains an implementation of MODIA and LEAF. This real-world scenario illustrates MODIA s success in addressing scalability concerns while simultaneously handling the nuanced aspects of online decision-making. Each described vehicle DP POMDP has 400 states (265 with additional pruning), with a rich well-structured belief space. In MODIA AVs, the POMDP s size is constant and applies to any intersection. In comparison, a single all-encompassing POMDP with these state factors quickly becomes utterly infeasible, and will vary greatly among intersections. For example, the 4-way stop from Figure 2 that only considers the AV and 3 other vehicles (no pedestrians) would the state space S =Sℓ av St av 3 i=1(Sℓ ovi St ovi Sb ovi Sp ovi). This has |S|=640,000 states, exemplifying notions from Proposition 2. Conversely, MODIA AVs scale linearly with the number of vehicles, and would only be 795 states evenly distributed over three POMDPs. On modest hardware, a DP can take <1 minute to solve using nova [Wray and Zilberstein, 2015]. Monolithic POMDPs, like the one described, are unequivocally intractable; however, MODIA enables the now realized POMDP solution for AV decision-making.

7 Related Work Previous work on an general models related to MODIA include architectures for mobile robots [Brooks, 1986; Rosenblatt, 1997] or other systems [Decker, 1996], and contain decision-components that produce actions, aggregated to a system action. They do not, however, naturally model uncertainty or have a general theoretical grounding. Forms of hierarchicies include action-based execution of child problems with multi-options [Barto and Mahadevan, 2003] and abstract machines [Parr and Russell, 1998]. Action-space partitioning that execute smaller MDPs [Hauskrecht et al., 1998] and POMDPs [Pineau et al., 2001] also exists. These do not model the online execution of an unknown number of decision-components for use in robotics. More applicationfocused work on action voting for simple POMDPs to solve intractable POMDPs have been used successfully [Yadav et al., 2015]. Robotic applications of hierarchical POMDPs for an intelligent wheelchair decompose the problem into components [Tao et al., 2009], or with two POMDP levels for vision-based robots [Sridharan et al., 2010]. These practical

Figure 2: Our fully-operational AV prototype at a 4-way stop intersection that implements AV MODIA and LEAF.

methods work well but lack generalized mathematical foundations. Also, none of these present AV-speciﬁc solutions. Previous work speciﬁc to AV decision-making includes simple rule-based or ﬁnite-state controller systems [Jo et al., 2015], which are simple to implement but are brittle, difﬁcult to maintain, and were unable to handle the abundant uncertainty in AV decision-making. Initial attempts using deep neural networks map raw images to control [Chen et al., 2015] are slow to train and tend to fail rapidly when presented with novel situations. Mixed-observability MDPs for pedestrian avoidance also successfully use a decision-component approach (AV-pedestrian pairs) but provide limited theoretical work and do not extend to intersections [Bandyopadhyay et al., 2013]. Using a single POMDP for all decision-making has been explored, including continuous POMDPs using raw spacial coordinates for mid-level decision-making [Brechtel et al., 2014], online intention-aware POMDPs for pedestrian navigation [Bai et al., 2015], and POMDPs for lane changes that use online approximate lookahead algorithms [Ulbrich and Maurer, 2013]. These approaches do not address the exponential complexity concerns (scalability), provide generalizable theoretical foundations, or enable simultaneous seamless integration of multiple different decision-making scenarios on a real AV, all of which are provided by MODIA.

8 Conclusion MODIA is a principled theoretical model designed for direct practical use in online decision-making for autonomous robots. It has a number advantages over the direct use of a massive monolithic POMDP for planning and learning. Namely, it remains tractable by growing linearly in the number of decision-making problems encountered. Its component-based form simpliﬁes the design and analysis, and enables provable theoretical results for this class of problems. MODIA is shown to successfully solve a challenging AV interaction problem. Future work will explore more executors and models beyond LEAF and risk-sensitive MODIA, develop additional AV-related DPs, and tackle other intractable robotic domains such as humanoid service robots using MODIA as a scalable online decision-making solution.

Acknowledgments We thank the reviewers for their helpful comments, Liam Pedersen for valuable discussions, and Nissan Motor Co., Ltd. for supporting this work.

Proceedings of the Twenty-Sixth International Joint Conference on Artiﬁcial Intelligence (IJCAI-17)

[Bai et al., 2015] Haoyu Bai, Shaojun Cai, Nan Ye, David Hsu, and Wee Sun Lee. Intention-aware online POMDP planning for autonomous driving in a crowd. In IEEE Int l. Conf. on Robotics and Automation (ICRA), pages 454 460, 2015. [Bandyopadhyay et al., 2013] Tirthankar Bandyopadhyay, Chong Zhuang Jie, David Hsu, Marcelo H. Ang, Daniela Rus, and Emilio Frazzoli. Intention-aware pedestrian avoidance. In Proc. of the 13th Int l. Symposium on Experimental Robotics, pages 963 977, 2013. [Barto and Mahadevan, 2003] Andrew G. Barto and Sridhar Mahadevan. Recent advances in hierarchical reinforcement learning. Discrete Event Dynamic Systems, 13(4):341 379, 2003. [Brechtel et al., 2014] Sebastian Brechtel, Tobias Gindele, and R udiger Dillmann. Probabilistic decision-making under uncertainty for autonomous driving using continuous POMDPs. In Proc. of the 17th Int l. IEEE Conf. on Intelligent Transportation Systems (ITSC), pages 392 399, 2014. [Brooks, 1986] Rodney Brooks. A robust layered control system for a mobile robot. IEEE Journal on Robotics and Automation, 2(1):14 23, 1986. [Chen et al., 2015] Chenyi Chen, Ari Seff, Alain Kornhauser, and Jianxiong Xiao. Deep Driving: Learning affordance for direct perception in autonomous driving. In Proc. of the 15th IEEE Int l. Conf. on Computer Vision (ICCV), pages 2722 2730, 2015. [Decker, 1996] Keith Decker. TAEMS: A framework for environment centered analysis & design of coordination mechanisms. Foundations of Distributed Artiﬁcial Intelligence, pages 429 448, 1996. [Dolgov et al., 2010] Dmitri Dolgov, Sebastian Thrun, Michael Montemerlo, and James Diebel. Path planning for autonomous vehicles in unknown semi-structured environments. The Int l. Journal of Robotics Research, 29(5):485 501, 2010. [Hauskrecht et al., 1998] Milos Hauskrecht, Nicolas Meuleau, Leslie Pack Kaelbling, Thomas Dean, and Craig Boutilier. Hierarchical solution of Markov decision processes using macroactions. In Proc. of the 14th Conf. on Uncertainty in Artiﬁcial Intelligence (UAI), pages 220 229, 1998. [Jo et al., 2015] Kichun Jo, Junsoo Kim, Dongchul Kim, Chulhoon Jang, and Myoungho Sunwoo. Development of autonomous car Part II: A case study on the implementation of an autonomous driving system based on distributed architecture. IEEE Transactions on Industrial Electronics, 62(8):5119 5132, 2015. [Kaelbling et al., 1998] Leslie P. Kaelbling, Michael L. Littman, and Anthony R. Cassandra. Planning and acting in partially observable stochastic domains. Artiﬁcial Intelligence, 101(1):99 134, 1998. [Kurniawati and Yadav, 2016] Hanna Kurniawati and Vinay Yadav. An online POMDP solver for uncertainty planning in dynamic environment. In Proc. of the 16th Int l. Symposium Robotics Research (ISRR), pages 611 629, 2016. [Parr and Russell, 1998] Ronald Parr and Stuart Russell. Reinforcement learning with hierarchies of machines. In Proc. of the 10th Conf. on Advances in Neural Information Processing Systems (NIPS), pages 1043 1049, 1998. [Pineau et al., 2001] Joelle Pineau, Nicholas Roy, and Sebastian Thrun. A hierarchical approach to POMDP planning and execution. In Proc. of the ICML Workshop on Hierarchy and Memory in Reinforcement Learning, 2001.

[Rosenblatt, 1997] Julio K. Rosenblatt. DAMN: A distributed architecture for mobile navigation. Journal of Experimental & Theoretical Artiﬁcial Intelligence, 9(2-3):339 360, 1997. [Ross et al., 2008] St ephane Ross, Joelle Pineau, S ebastien Paquet, and Brahim Chaib-Draa. Online planning algorithms for POMDPs. Journal of Artiﬁcial Intelligence Research, 32:663 704, 2008. [Sivaraman and Trivedi, 2013] Sayanan Sivaraman and Mohan M. Trivedi. Looking at vehicles on the road: A survey of visionbased vehicle detection, tracking, and behavior analysis. IEEE Transactions on Intelligent Transportation Systems, 14(4):1773 1795, 2013. [Somani et al., 2013] Adhiraj Somani, Nan Ye, David Hsu, and Wee Sun Lee. DESPOT: Online POMDP planning with regularization. In Proc. of the 26th Conf. on Advances in Neural Information Processing Systems (NIPS), pages 1772 1780, 2013. [Sridharan et al., 2010] Mohan Sridharan, Jeremy Wyatt, and Richard Dearden. Planning to see: A hierarchical approach to planning visual actions on a robot using POMDPs. Artiﬁcial Intelligence, 174(11):704 725, 2010. [Tao et al., 2009] Yong Tao, Tianmiao Wang, Hongxing Wei, and Diansheng Chen. A behavior control method based on hierarchical POMDP for intelligent wheelchair. In Proc. of IEEE/ASME Int l. Conf. on Advanced Intelligent Mechatronics (AIM), pages 893 898, 2009. [Thrun et al., 2006] Sebastian Thrun, Mike Montemerlo, Hendrik Dahlkamp, David Stavens, Andrei Aron, James Diebel, Philip Fong, John Gale, Morgan Halpenny, Gabriel Hoffmann, et al. Stanley: The robot that won the DARPA Grand Challenge. Journal of Field Robotics, 23(9):661 692, 2006. [Ulbrich and Maurer, 2013] Simon Ulbrich and Markus Maurer. Probabilistic online POMDP decision making for lane changes in fully automated driving. In Proc. of the 16th Int l. IEEE Conf. on Intelligent Transportation Systems (ITSC), pages 2063 2067, 2013. [Wray and Zilberstein, 2015] Kyle H. Wray and Shlomo Zilberstein. A parallel point-based POMDP algorithm leveraging GPUs. In Proc. of the 2015 AAAI Fall Symposium on Sequential Decision Making for Intelligent Agents, pages 95 96, 2015. [Wray et al., 2016a] Kyle H. Wray, Luis Pineda, and Shlomo Zilberstein. Hierarchical approach to transfer of control in semiautonomous systems. In Proc. of the 25th Int l. Joint Conf. on Artiﬁcial Intelligence (IJCAI), pages 517 523, 2016. [Wray et al., 2016b] Kyle H. Wray, Dirk Ruiken, Roderic A. Grupen, and Shlomo Zilberstein. Log-space harmonic function path planning. In Proc. of the 29th IEEE/RSJ Int l. Conf. on Intelligent Robots and Systems (IROS), pages 1511 1516, 2016. [Yadav et al., 2015] Amulya Yadav, Leandro S. Marcolino, Eric Rice, Robin Petering, Hailey Winetrobe, Harmony Rhoades, Milind Tambe, and Heather Carmichael. Preventing HIV spread in homeless populations using PSINET. In Proc. of the 27th Conf. on Innovative Applications of Artiﬁcial Intelligence (IAAI), pages 4006 4011, 2015.

Proceedings of the Twenty-Sixth International Joint Conference on Artiﬁcial Intelligence (IJCAI-17)