# online_decisionmaking_for_scalable_autonomous_systems__cd176d3b.pdf Online Decision-Making for Scalable Autonomous Systems Kyle Hollins Wray1,2 and Stefan J. Witwicki 2 and Shlomo Zilberstein1 1 College of Information and Computer Sciences, University of Massachusetts, Amherst, MA 01002 2 Nissan Research Center - Silicon Valley, Sunnyvale, CA 94089 wray@cs.umass.edu, stefan.witwicki@nissan-usa.com, shlomo@cs.umass.edu We present a general formal model called MODIA that can tackle a central challenge for autonomous vehicles (AVs), namely the ability to interact with an unspecified, large number of world entities. In MODIA, a collection of possible decisionproblems (DPs), known a priori, are instantiated online and executed as decision-components (DCs), unknown a priori. To combine the individual action recommendations of the DCs into a single action, we propose the lexicographic executor action function (LEAF) mechanism. We analyze the complexity of MODIA and establish LEAF s relation to regret minimization. Finally, we implement MODIA and LEAF using collections of partially observable Markov decision process (POMDP) DPs, and use them for complex AV intersection decision-making. We evaluate the approach in six scenarios within a realistic vehicle simulator and present its use on an AV prototype. 1 Introduction There has been substantial progress with planning under uncertainty in partially observable, but fully modeled worlds. However, few effective formalisms have been proposed for planning in open worlds with an unspecified, large number of objects. This remains a key challenge for autonomous systems, particularly for autonomous vehicles (AVs). AV research has advanced rapidly since the DARPA Grand Challenge [Thrun et al., 2006], which acted as a catalyst for subsequent work on low-level sensing [Sivaraman and Trivedi, 2013] and control [Dolgov et al., 2010], as well as high-level route planning [Wray et al., 2016a]. A critical missing component to enable autonomy in long-term urban deployments is the mid-level intersection decision-making (e.g., the second-to-second stop, yield, edge, or go decisions). As in many robotic domains, the primary challenges include the sheer complexity of real-world problems, wide variety of possible scenarios that can arise, and unbounded number of multi-step problems that will be actually encountered, perhaps simultaneously. These factors have limited the deployment of existing methods for mid-level decision-making [Ulbrich and Maurer, 2013; Brechtel et al., 2014; Bai et al., 2015; Jo et al., 2015]. We present a scalable, realistic solution, with strong mathematical foundations, via decomposition into problem-specific decision-components. Our primary motivation is to provide a general solution for AV decision-making at any intersection, including n-way stops, yields, left turns at green traffic lights, right turns at red traffic lights, etc. In this domain, the AV approaches the intersection knowing only the static features from the map, such as road, crosswalk, and traffic controller information. Any number of vehicles and pedestrians can arrive and interact around the intersection, all potentially relevant to decision-making and unknown a priori. The AV must make mid-level decisions, using very limited hardware resources, including when to stop, yield, edge forward, or go, based on all possible interactions among all vehicles including the AV itself. Vehicles can be occluded, requiring the use of information gathering actions based on belief over partial observability. Pedestrians can jaywalk, necessitating that motion forward is taken only under strong confidence they will not cross. Uncertainty regarding priority and right-of-way exists, and must be handled under stochastic changes. Vehicles and pedestrians can block one another s motion, and AV-related blocking conflicts must be discovered and resolved via motion-based negotiation. We provide a general solution for domains concerning multiple online decision-components with interacting actions (MODIA). For the particularly difficult AV intersection decision domain, MODIA considers all vehicles and pedestrians as separate individual decision-components. Each component is a partially observable Markov decision process (POMDP) that maintains its own belief for that particular component problem and proposes an action to take at each time step. MODIA then employs an executor function to act as an action aggregator to determine the actual action taken by the AV. This decomposition enables a tractable POMDP solution, benefiting from powerful belief-based reasoning while only growing linearly in the number of encountered problems. The primary contributions include: a formal definition of MODIA (Section 3), a rigorous analysis of the complexity and regret-minimization properties (Section 4), an AV intersection decision-making MODIA solution (Section 5), and an evaluation of the approach in simulation as well as integration with a real AV (Section 6). We begin with a review of POMDPs (Section 2), and conclude with a survey of related work (Section 7) and final reflections (Section 8). Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence (IJCAI-17) 2 Background Material A partially observable Markov decision process (POMDP) is represented by the tuple S,A,Ω,T,O,R [Kaelbling et al., 1998]. S is a finite set of states. A is a finite set of actions. Ωis a finite set of observations. T :S A S [0,1] is a state transition function such that T(s,a,s )=Pr(s |s,a). O:A S Ω [0,1] is an observation function such that O(a,s ,ω)=Pr(ω|a,s ). R:S A R is a reward function. The agent does not observe the true state of the system, and instead makes observations while maintaining a belief over the true state denoted b |S|. Given action a A and subsequent observation ω Ω, belief b is updated to b with: b (s )=ηO(a,s ,ω)P s T(s,a,s )b(s) for all s S, with normalizing constant η. A policy maps beliefs to actions π: |S| A. (Note: n is the standard n-simplex.) The value function V : |S| R for a belief is the expected reward given a fixed policy π, a discount factor γ [0,1], and a horizon h. Also, it is useful to define the Q-value of belief b given action a as Q: |S| A R with V (b)= Q(b,π(b)). Since V π is piecewise linear and convex, we describe it using sets of α-vectors Γ={α1,...,αr} with each αi =[αi(s1),...,αi(sn)]T and αi(s) denoting value of state s S. The objective is to find optimal policy π that maximizes V denoted as V . Given an initial belief b0, V can be iteratively computed for a time step t, expanding beliefs at each update resulting in belief b, by maximizing: s S b(s)R(s,a)+ X ω Ω max α Γt 1 X s S b(s)V t saωα and V t saωα =γ P s S O(a,s ,ω)T(s,a,s )α(s ); for s S, α0(s)=R/(1 γ) in Γ0 ={α0} with R=mins,a R(s,a). 3 Problem Formulation We begin with a general problem description that considers a single autonomous agent that encounters any number of decision problems online during execution. This paper focuses on collections of POMDPs primarily for their general form, self-consistency, and space limitations. It can be generalized to other decision-making models in the natural way. Finally, Figure 1 depicts a complete MODIA example for AVs, and is referenced throughout this section for each concept. 3.1 Decision-Making with MODIA The multiple online decision-components with interacting actions (MODIA) model describes a realistic single-agent online decision-making scenario defined by the tuple P,A . P ={P1,...,Pk} are decision-problems (DPs) that could be encountered during execution. For this paper, each Pi P is a POMDP with Pi = Si,Ai,Ωi,Ti,Oi,Ri (Section 2) starting from an initial belief b0 i |Si|. We consider discrete time steps t N over the agent s entire lifetime. A={a1,...,az} are z primary actions that are the true actions taken by the agent that affect the state of the external system environment. Importantly, only P and A are known offline a priori. AV Example Figure 1 has two pre-solved intersection decision-components: single vehicle (P1) or pedestrian (P2). Each are POMDPs with actions (recommendations) stop or go . Primary actions A for the AV are also stop or go . Online, the DPs are instantiated based on what the agent experiences in the external system environment. Due to the nature of actually executing multiple decision-making models (e.g., POMDPs) in real applications, there is no complete model for which, when, or how many DPs are instantiated, or even how long they are relevant. Formally, the online instantiations in MODIA are defined by the tuple C,φ,τ . Over the agent s lifetime, there are n DP instantiations called decision-components (DCs) denoted as C ={C1,...,Cn}, with both C and n unknown a priori. Let φ:C P denote the DP for each instantiation. Let τ :C N N be the two time steps that each DC is instantiated and terminated. For notational convenience, for all Ci C, let τs(Ci) and τe(Ci) be the start and end times; we have τs(Ci)<τe(Ci). Without loss of generality, we also assume for i