# robust_finitestate_controllers_for_uncertain_pomdps__d9f62134.pdf Robust Finite-State Controllers for Uncertain POMDPs Murat Cubuktepe1, Nils Jansen2, Sebastian Junges3, Ahmadreza Marandi4, Marnix Suilen2, Ufuk Topcu1 1Department of Aerospace Engineering and Engineering Mechanics, University of Texas at Austin, USA 2Department of Software Science, Radboud University, The Netherlands 3Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, USA 4Department of Industrial Engineering and Innovation Sciences, Eindhoven University of Technology, The Netherlands {mcubuktepe, utopcu}@utexas.edu, n.jansen@science.ru.nl, sjunges@berkeley.edu, a.marandi@tue.nl, marnix.suilen@ru.nl Uncertain partially observable Markov decision processes (u POMDPs) allow the probabilistic transition and observation functions of standard POMDPs to belong to a so-called uncertainty set. Such uncertainty, referred to as epistemic uncertainty, captures uncountable sets of probability distributions caused by, for instance, a lack of data available. We develop an algorithm to compute finite-memory policies for u POMDPs that robustly satisfy specifications against any admissible distribution. In general, computing such policies is theoretically and practically intractable. We provide an efficient solution to this problem in four steps. (1) We state the underlying problem as a nonconvex optimization problem with infinitely many constraints. (2) A dedicated dualization scheme yields a dual problem that is still nonconvex but has finitely many constraints. (3) We linearize this dual problem and (4) solve the resulting finite linear program to obtain locally optimal solutions to the original problem. The resulting problem formulation is exponentially smaller than those resulting from existing methods. We demonstrate the applicability of our algorithm using large instances of an aircraft collision-avoidance scenario and a novel spacecraft motion planning case study. Introduction In sequential decision making, one has to account for various sources of uncertainty. Examples for such sources are lossy long-range communication channels with a spacecraft orbiting the earth, or the expected responses of a human operator in a decision support system (Kochenderfer 2015). Moreover, sensor limitations may lead to imperfect information of the current state of the system. The standard models to reason about decision-making under uncertainty and imperfect information are partially observable Markov decision processes (POMDPs) (Kaelbling, Littman, and Cassandra 1998). The likelihood of uncertain events, like a message loss in communication channels or of specific responses by human operators, is often only estimated from data. Yet, such likelihoods enter a POMDP model as concrete probabilities. POMDPs, in fact, require precise transition and observation probabilities. Uncertain POMDPs (u POMDPs) remove this assumption by incorporating uncertainty sets of probabilities (Burns and Brock 2007). These sets are part of the Copyright c 2021, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. transition and observation model in u POMDPs and take the form of, for example, probability intervals or likelihood functions (Givan, Leach, and Dean 2000; Nilim and Ghaoui 2005). Such uncertainty is commonly called epistemic, which is induced by a lack of precise knowledge as opposed to aleatoric uncertainty, which is due to stochasticity. Motivating example. Consider a robust spacecraft motion planning system which serves as decision support for a human operator. This system delivers advice on switching to a different orbit or avoiding close encounters with other objects in space. Existing POMDP models assume that a fixed probability captures the responsiveness of the human (Kochenderfer 2015). However, a single probability may not faithfully capture the responsiveness of multiple operators. Instead of using a single probability, we use bounds on the actual value of this probability and obtain a u POMDP model. A policy for the decision support system necessarily considers all potential probability distributions and may need to predict the current location of the spacecraft based on its past trajectory. Therefore, it requires robustness against the uncertainty, and memory to store past (observation) data. Problem formulation. The general problem we address is: For a u POMDP, compute a policy that is robust against all possible probabilities from the uncertainty set. Even without uncertainties, the problem is undecidable (Madani, Hanks, and Condon 1999). Most research has focused on either belief-based approaches or on computing finite state controllers (FSCs) (Meuleau et al. 1999; Amato, Bernstein, and Zilberstein 2010). Our approach extends the latter and computes robust FSCs representing robust policies. Contribution and Approach We develop a novel solution for efficiently computing robust policies for u POMDPs using robust convex optimization techniques. The method solves u POMDPs with hundreds of thousands of states, which is out of reach for the state-of-the-art. Finite memory. The first step of our method is to obtain a tractable representation of memory. As a concise representation of u POMDP policies with finite memory, we utilize FSCs. First, we build the product of the FSC memory and the u POMDP. Computing a policy with finite memory for the original u POMDP is equivalent to computing a memoryless The Thirty-Fifth AAAI Conference on Artificial Intelligence (AAAI-21) policy on the product (Junges et al. 2018). While the problem complexity remains NP-hard (Vlassis, Littman, and Barber 2012), the problem size grows polynomially in the size of the FSC. Existing methods cannot cope with that blow-up, as our experiments demonstrate. Semi-infinite problem. We state the u POMDP problem as a semi-infinite nonconvex optimization problem with infinitely many constraints. A tailored dualization scheme translates this problem into a finite optimization problem with a linear increase in the number of variables. Combining this dualization with a linear-time transformation of the u POMDP to a so-called simple u POMDP (Junges et al. 2018) ensures exact solutions to the original problem. The exact solutions and the moderate increase in the problem size contrasts with an over-approximative solution computed using an exponentially larger encoding proposed in (Suilen et al. 2020). Finite nonconvex problem. The resulting finite optimization problem is still nonconvex and infeasible to solve exactly (Alizadeh and Goldfarb 2003; Lobo et al. 1998). We adapt a sequential convex programming (SCP) scheme (Mao et al. 2018) that iteratively linearizes the problem around previous solutions. Three extensions help to alleviate the errors introduced by the linearization, rendering the method sound for the u POMDP problem. Numerical experiments. We demonstrate the applicability of the approach using two case studies. Notably, we introduce a novel robust spacecraft motion planning scenario. We show that our method scales to significantly larger models than (Suilen et al. 2020). This scalability advantage allows more precise models and adding memory to the policies. Related Work Uncertain MDPs with full observability have been extensively studied, see for instance (Wolff, Topcu, and Murray 2012; Puggelli et al. 2013; Hahn et al. 2017). For u POMDPs, (Suilen et al. 2020) is also based on convex optimization. However, their resulting optimization problems are exponentially larger than ours, and they only consider memoryless policies. (Burns and Brock 2007) utilizes sampling-based methods and (Osogami 2015) employs a robust value iteration on the belief space of the u POMDP. (Ahmadi et al. 2018) uses sum-of-squares optimization to verify u POMDPs against temporal logic specifications. The work in (Itoh and Nakamura 2007) assume distributions over the probability values of the uncertainty set. Finally, (Chamie and Mostafa 2018) considers a convexified belief space and computes a policy that is robust over this space. Background and Formal Problem Uncertain POMDPs. The set of all probability distributions over a finite or countably infinite set X is Distr(X). Definition 1 (u POMDP). An uncertain partially observable Markov decision process (u POMDP) is a tuple M = (S, s I , A, I, P, Z, O, R) with a finite set S of states, an initial state s I S, a finite set A of actions, a set I of probability intervals, an uncertain transition function P : S A S I, a finite set Z of observations, an uncertain observation function O: S Z I, and a reward function R: S A R 0. Nominal probabilities are point intervals where the upper and lower bounds coincide. If all probabilities are nominal, the model is a (nominal) POMDP. We see u POMDPs as sets of nominal POMDPs that vary only in their transition function. For a transition function P : S A S R, we write P P if for all s, s S and α A we have P(s, α, s ) P(s, α, s ) and P(s, α, ) is a probability distribution over S. Finally, a fully observable u POMDP where each state has unique observations is an uncertain MDP. Policies. An observation-based policy σ: (Z A) Z Distr(A) for a u POMDP maps a trace, i.e., a sequence of observations and actions, to a distribution over actions. A finite-state controller (FSC) consists of a finite set of memory states and two functions. The action mapping γ(n, z) takes an FSC memory state n and an observation z, and returns a distribution over u POMDP actions. To change a memory state, the memory update η(n, z, α) returns a distribution over memory states and depends on the action α selected by γ. An FSC induces an observation-based policy by following a joint execution of these functions upon a trace of the u POMDP. An FSC is memoryless if there is a single memory state. Such FSCs encode policies σ: Z Distr(A). Specifications. We constrain the undiscounted expected reward (the value) of a policy for a u POMDP using specifications: For a POMDP M and a set of goal states G the specification E κ( G) states that the expected reward before reaching G is at least κ. For brevity, we require that the POMDP has no dead-ends, i.e., that under every policy, we eventually reach G. Reachability specifications to a subset of G and discounted rewards are special cases (Puterman 1994). Satisfying specifications. A policy σ satisfies a specification ϕ = E κ( G), if the expected reward to reach G induced by σ is at least κ. POMDP M[P] denotes the instantiated u POMDP M with a fixed transition function P P. A policy robustly satisfies ϕ for the u POMDP M, if it does so for all M[P] with P P. Thus, a (robust) policy for u POMDPs accounts for all possible instantiations P P. Formal problem statement. Given a u POMDP M and a specification ϕ, compute an FSC that yields an observationbased policy σ which robustly satisfies ϕ for M. Assumptions. We compute the product of the memory update of an FSC and the u POMDP and reduce the problem do computing policies without memory (Junges et al. 2018). Therefore, we focus in the following on memoryless policies and show the impact of finite memory in our examples. For the correctness of our method, instantiations are not allowed to remove transitions. That is, for any interval we either require the lower bound to be strictly greater than 0 such that the transition exists in all instantiations, or both upper and lower bound to be equal to 0 such that the transition does not exist in all instantiations. This assumption is standard and for instance used in (Wiesemann, Kuhn, and Rustem 2013). Optimization Problem for u POMDPs In this section, we reformulate our problem statement as a semi-infinite nonconvex optimization problem, with finitely many variables but infinitely many constraints. We utilize simple u POMDPs, defined below, as the dualization discussed in Sect. is exact (only) for simple u POMDPs. We adopt a small extension of (PO)MDPs in which only a subset of the actions are available in a state, i.e., the transition function should be interpreted as a partial function. We denote the set of actions at state s by A(s). We ensure that states sharing an observation share the set of available actions. Moreover, we translate the observation function to be deterministic without uncertainty, i.e., of the form O: S Z, by expanding the state space (Chatterjee et al. 2016). Definition 2 (Binary/Simple u POMDP). A u POMDP is binary, if |A(s)| 2 for all s S. A binary u POMDP is simple if for all s S, the following is true: |A(s)| = 2 implies α A(s), s S, P(s, α, s ) = 1, Simple u POMDPs differentiate the states with action choices and uncertain outcomes. All u POMDPs can be transformed into simple u POMDPs. We refer to (Junges et al. 2018) for a transformation. This transformation preserves the optimal expected reward of a u POMDP. Let Sa denote the states with action choices, and Su denote the states with uncertain outcomes. We now introduce the optimization problem with the nonnegative reward variables {rs 0 | s S} denoting the expected reward before reaching goal set G from state s, and positive variables {σs,α > 0 | s S, α A(s)} denoting the probability of taking an action α in a state s for the policy. Note that we only consider policies where for all states s and actions α it holds that σs,α > 0, such that applying the policy to the u POMDP does not change the underlying graph. maximize rs I (1) subject to rs I κ, s G, rs = 0, (2) α A(s) σs,α = 1, (3) s, s S s.t. O(s) = O(s ), α A(s), σs,α = σs ,α, (4) α A(s) σs,α R(s, α)+ X s S P(s, α, s ) rs , (5) s Su, P P, rs R(s) + X s S P(s, s ) rs . (6) The objective is to maximize the expected reward rs I at the initial state. The constraint (2) encodes the specification requirement and assigns the expected reward to 0 in the states of goal set G. We ensure that the policy is a valid probability distribution in each state by (3). Next, (4) ensures that the policy is observation-based. We encode the computation of expected rewards for states with action choices by (5) and with uncertain outcomes by (6). We omit denoting the unique actions in the transition function P(s, s ) and reward function R(s) in (6) for states with uncertain outcomes. Let us consider some properties of the optimization problem. First, the functions in (5) are quadratic. Essentially, the policy variables σs,α are multiplied with the reward variables rs. In general, these constraints are nonconvex, and we later linearize them. Second, the values of the transition probabilities P(s, s ) for s, s Su in (6) belong to continuous intervals. Therefore, there are infinitely many constraints over a finite set of reward variables. These constraints are similar to the LP formulation for MDPs (Puterman 1994), and are affine; there are no policy variables. Dualization of Semi-Infinite Constraints In this section, we transform the semi-infinite optimization problem into a finite optimization problem using dualization for simple u POMDPs. Specifically, we translate the semiinfinite constraints from (6) into finitely many constraints. Dualization for Robust Linear Programs We summarize robust LPs with polytopic uncertainty (Bertsimas, Brown, and Caramanis 2011; Ben-Tal, El Ghaoui, and Nemirovski 2009). The idea for solving robust LPs with such uncertainty is essential in our approach. Robust LPs. A robust LP with the variable x Rn is minimize c x subject to (d + u) x e u U, where c, d Rn, and e R are given vectors, u Rn is the uncertain parameter, and U is the uncertainty set. We assume that the set U is a convex polytope, i.e., that it is an n-dimensional shape defined by the linear inequalities Cu + g 0 for C Rm n and g Rm. Duality. For simplicity, we explain the idea on a single robust inequality. The idea can be generalized to multiple robust inequalities. With U defined by the linear inequalities above, duality can be used to obtain a tractable solution to the robust LP. Specifically, we write the Lagrangian for the maximization problem over u x with the dual variable µ 0 as L(u, µ) = u x + µ (Cu + g). By taking the supremum over u, we obtain sup u U L(u, µ) = if C µ + x = 0, µ g if C µ + x = 0. All inequalities are linear which implies strong duality, i.e., sup u U u x = inf µ 0 {µ g | C µ + x = 0}. Since these optimization problems are linear, we know that the optimal value is attained at a feasible solution. Therefore, we can replace sup and inf with max and min. The semiinfinite inequality with polytopic uncertainty is equivalent to the following linear constraints d x + µ g e, C µ + x = 0, µ 0. Dualization for Simple POMDPs We now describe the dualization step that we use to obtain a finite optimization problem for simple POMDPs. Uncertainty polytopes. First, we formalize the uncertainty sets for simple u POMDPs. For state s Su, we assume that there are |S| intervals [as, bs] I, s S that describe the uncertain probability of transitioning to a successor state of s. This setting is equivalent to (Suilen et al. 2020). The following constraints ensure that the uncertain transition function P is well-defined at a state s: s S, as,s us,s bs,s , X s S us,s = 1. We denote these constraints in matrix form: Csu + gs 0. Dualization. After obtaining the matrices Cs and vectors gs characterizing uncertainty sets for each state, we directly use dualization to transform the inequalities in (6) into s Su, rs R(s) + µ s gs, C s µs + q = 0, µs 0, (7) where µs R|S| is the dual variable of the constraint Csu + gs 0 and q is an |S| dimensional vector denoting the set of reward variables for each state s S. After this step, the problem with the objective (1) and the constraints (2) (5) and (7) has finitely many variables and constraints. In the worst-case, the number of dual variables in (7) is |S| |Su|, and the number of constraints is 2 |S| |Su|+|Su|, which only yields a factor |S| of increase in the number of variables compared to the aforementioned semi-infinite optimization problem. In practice, most of the values of as,s , and bs,s are 0, meaning there are few successor states for each state s, and we exploit this structure in our numerical examples. Linearizing the Finite Nonconvex Problem In this section, we discuss our algorithm to solve the (finite but nonconvex) dual problem from the previous section. Our method is based on a sequential convex programming (SCP) method (Yuan 2015; Mao et al. 2018). SCP iteratively computes a locally optimal solution to the dual problem from Section by approximating it as an LP. In every iteration, this approximation is obtained by linearizing the quadratic functions around a previous solution. The resulting LP does not necessarily have a feasible solution, and the optimal solution does not necessarily correspond to a feasible solution of the original dual problem. We alleviate this drawback by adding three components to our approach. Linearizing. To linearize the dual problem, we only need to linearize the quadratic functions in (5), repeated below. α A(s) σs,α R(s, α)+ X s S P(s, α, s ) rs The constraint is quadratic in rs and σs,α. For compact notation, let h(s, α, s ) be the quadratic function above for a given s Sa, s S and α A(s): h(s, α, s ) = σs,α P(s, α, s ) rs and let d = P(s, α, s ), y = σs,α, and z = rs . Let ˆy, ˆz denote an arbitrary assignment to y and z. We linearize h(s, α, s ) around ˆy, ˆz as haff(s, α, s ) = d ˆy ˆz + ˆy (z ˆz) + ˆz (y ˆy) = d (ˆy z + ˆz y ˆz ˆy). The resulting function haff(s, α, s ) is affine in y and z (rs and σs,α). After the linearization step, we replace (5) by α A(s) σs,α R(s, α)+ X s S haff(s, α, s ) Recall that the linearized problem may be infeasible, or the optimal solution to the linearized may no longer be feasible to the dual problem. We alleviate this issue as follows. First, we add penalty variables to the linearized constraints to ensure that the dual problem is always feasible, and we penalize violations by adding these variables to the objective. Second, we include trust regions around the previous solution to ensure that we do not deviate too much from this solution. Finally, we incorporate a robust verification step to ensure that the obtained policy satisfies the specification. This verification step is exact and alleviates the potential approximation errors that may arise in the linearization. Penalty variables. We add a nonnegative penalty variable ks for all Sa to the constraints in (8), which yields: s Sa, rs ks + X α A(s) σs,α R(s, α)+ X s S haff(s, α, s ) When ks is sufficiently large, these constraints always allow a feasible solution. We also add a penalty variable to rs I κ. Trust regions. We use trust regions by adding the following set of constraints to the resulting linearized problem: s S, ˆrs/δ rs ˆrs δ , (10) s Sa, α A(s), ˆσs,α/δ σs,α ˆσs,α δ , (11) where δ = δ + 1 and δ > 0 is the size of the trust region, which restricts the set of feasible policies, and ˆrs and ˆσs,α denotes the value of the reward and policy variables that are used for linearization. Extended LP. Combining these steps, we now state the resulting finite LP for some fixed but arbitrary assignment to ˆrs and ˆσs,α in the definition of haff, penalty parameter τ > 0 and a trust region δ > 0: maximize rs I τ X s Sn ks (12) subject to rs I + ks I κ, s G, rs = 0, (13) α A(s) σs,α = 1, (14) s, s S s.t. O(s) = O(s ), α A(s), σs,α = σs ,α, (15) s Sa, rs ks + X α A(s) σs,α R(s, α)+ X s S haff(s, α, s ) , (16) s Su, rs R(s) + µ s gs, C s µs + q = 0, µs 0, (17) s S, ˆrs/δ rs ˆrs δ , (18) s Sa, α A(s), ˆσs,α/δ σs,α ˆσs,α δ . (19) Complete algorithm. We detail our SCP method in Algorithm 1. Its basic idea is to find a solution to the LP in (12) (19) such that this solution is feasible to the nonconvex optimization problem with the objective (1) and the constraints (2) (5) and (7). We do this by an iterative procedure. We start with an initial guess of a policy and a trust region radius with δ > 0, and (Step 1) we verify the uncertain MC that combines the u POMDP and the fixed policy σ to obtain the Algorithm 1 Sequential convex programming with trust region for solving uncertain POMDPs Input: u POMDP, specification with κ, γ > 1, ω > 0 Initialize: trust region δ, weight τ, policy σ, βold = 0. 1: while δ > ω do 2: Verify the uncertain MC with policy σ: Step 1 3: Extract reward variables rs, objective value β. 4: if λ < β then Step 2 5: return the policy σ Found solution 6: else if β > βold then 7: s : ˆrs rs, ˆσ σ, βold β Accept iteration 8: δ δ γ Extend trust region 9: else 10: δ δ/γ Reject iteration, reduce trust region 11: end if 12: Linearize (5) around ˆσ, ˆrs Step 3 13: Solve the resulting LP see (12) (19) 14: Extract optimal solution for policy σ 15: end while 16: return the policy σ resulting values of the reward variables rs and an objective value β. This step is not present in existing SCP methods, and this is our third improvement to ensure that the returned policy robustly satisfies the specification. If the uncertain MC indeed satisfies the specification (Step 2 can be skipped in the first iteration), we return the policy σ. Otherwise, we check whether the obtained expected reward β is improved compared to βold. In this case, we accept the solution and enlarge the trust region by multiplying δ with γ > 1. If not, we reject the solution and contract the trust region by γ, and resolve the linear problem at the previous solution. Then (Step 3) we solve the LP (12) (19) with the current parameters. We linearize around previous policy variables ˆσs,α and reward variables ˆrs, and solve with parameters δ and γ to get an optimal solution. We iterate this procedure until a feasible solution is found, or the radius of the trust region is below a threshold ω > 0. If the trust region size is below ω, we return the policy σ, even though it does not satisfy the expected reward specification. In such cases, we can run Algorithm 1 with a different initial assignment. Numerical Examples We evaluate the new SCP-based approach to solve the u POMDP problem on two case studies. Additionally, we compare with (Suilen et al. 2020). Setting. We use the verification tool Storm 1.6.2 (Dehnert et al. 2017) to build u POMDPs. The experiments were performed on an Intel Core i9-9900u 2.50 GHz CPU and 64 GB of RAM with Gurobi (Gurobi Optimization 2020) 9.0 as the LP solver and Storm s robust verification algorithm. We initialize ˆσ to be uniform over all actions. The algorithm parameters are τ = 104, δ = 1.5, γ = 1.5, and ω = 10 4. Spacecraft Motion Planning This case study considers the robust spacecraft motion planning system (Frey et al. 2017; Hobbs and Feron 2020), as mentioned in the introduction. The spacecraft orbits the earth along a set of predefined natural motion trajectories (NMTs) (Kim et al. 2007). While the spacecraft follows its current NMT, it does not consume fuel. Upon an imminent close encounter with other objects in space, the spacecraft may be directed to switch into a nearby NMT at the cost of a certain fuel usage. We consider two objectives: (1) To maximize the probability of avoiding a close encounter with other objects and (2) to minimize the fuel consumption, both within successfully finishing an orbiting cycle. Uncertainty enters the problem in the form of potential sensing and actuating errors. In particular, there is uncertainty about the spacecraft position, the location of other objects in the current orbit, and the failure rate of switching to a nearby NMT. Model. We encode the problem as a u POMDP with twodimensional state variables for the NMT n {1, . . . , 36}, and the (discretized) time index i {1, . . . , I} for a fixed NMT. We use different values of resolution I in the examples. Every combination of n, i defines an associated point in the 3-dimensional space. The transition model is built as follows. In each state, there are two types of actions, (1) staying in the current NMT, which increments the time index by 1, and (2) switching to a different NMT if two locations are close to each other. More precisely, we allow a transfer between two points in space defined by n, i and n , i if the distance between the two associated points in space is less than 250km. A switching action may fail. In particular, the spacecraft may transfer to an unintended nearby orbit. The observation model contains 1440 different observations of the current locations of the orbit, which give partial information about the current NMT and time index in orbit. Specifically, for each NMT, we can only observe the location up to an accuracy of 40 different locations in each orbit. The points that are close to each other have the same observation. Variants. We consider three benchmarks. S1 is our standard model with a discretized trajectory of I = 200 time indices. S2 denotes an extension of S1 where the considered policy is an FSC with 5 memory states. S3 uses a higher resolution (I = 600). Finally, S4 is an extension of S3, where the policy is an FSC with 2 memory states. In all models, we use an uncertain probability of switching into an intended orbit and correctly locating the objects, given by the intervals [0.50, 0.95]. The four benchmarks have 36 048, 349 480, 108 000, and 342 750 states as well as 65 263, 698 960, 195 573, and 665 073 transitions, respectively. In this example, the objective is to maximize the probability of avoiding a close encounter with objects in the orbit while successfully finishing a cycle. Memory yields best policies. Fig. 1a shows the convergence of the reachability probabilities for each model, specifically the probability versus the runtime in seconds. First, we observe that after 20 minutes of computation, using larger POMDPs that have a higher resolution or memory in the policy yields superior policies. Second, the policy with memory is superior to the policy without memory. Finally, we observe that larger models indeed require more time per iteration, which means that on the smaller u POMDP S1, the algorithm converges faster to its optimum. 0 500 1,000 1,500 0.6 Time elapsed (s) Probability S1 S2 S3 S4 (a) Obtained probability of avoiding close encounters between the spacecraft and other objects in the orbit. 0 200 400 0 Time elapsed (s) Probability S1N S2N S3N S4N (b) The performance of the policies obtained from the nominal model applied to the uncertain model (dashed lines). 0 500 1,000 1,500 102 Time elapsed (s) Expected cost (c) The obtained expected cost of successfully finishing an orbit. Figure 1: Computational effort versus the performance of the different policies for the spacecraft motion planning case study. (a) Trajectory from a memoryless policy. (b) Trajectory from policy with 5 memory states. Figure 2: We show the obtained trajectory from a policy in red that finishes an orbit around the origin. We highlight the initial location by a big red circle. We depict the other objects by black spheres, and all NMTs that were used as a part of the trajectory. Comparing policies. Fig. 2 shows a comparison of policies and depicts the resulting spacecraft trajectories. In particular, we show the trajectories from two different policies, a memoryless policy (one memory state) in Fig. 2a computed on S1 and from a policy with finite memory (five memory states) in Fig. 2b computed on S2. The trajectory from the memoryless policy switches its orbit 17 times, whereas the trajectory from the finite-memory policy switches its orbit only 4 times. Additionally, the average length of the trajectory with the finite-memory policy is 188, compared to 375 for the memoryless one. These results demonstrate the utility of finite-memory to improve the reachability probability and minimize the number of switches. Robust policies are more robust. We demonstrate the utility of computing robust policies against uncertainty in Fig. 1b. Intuitively, we compute policies on nominal models and use them on uncertain models. We give results on the nominal transition probabilities of the four considered models, where we ran Algorithm 1 on the nominal models until we reach the optimal value. The performance of the policies on the nominal models has solid lines, and the performance of the policies on the uncertain models has dashed lines. However, when we extract these policies and apply them on the u POMDP, they perform significantly worse and fail to satisfy the objective in each case. The results clearly demonstrate the trade-offs between computing policies for nominal and uncertain problems. In each case, the computation time is roughly an order of magnitude larger. Yet, the resulting policies are better: we observe that the probability of a close encounter with another object increases up to 60 percentage points, if we do not consider the uncertainty in the model. Expected energy consumption. Finally, we consider an example where there is a cost for switching orbits. The objective is to minimize the cost of successfully finishing a cycle in orbit. We obtain the cost of each switching action according to the parameters in (Frey et al. 2017). Additionally, we define the cost of a close encounter to be 10000N s (Newton-second, which is a unit for the impulse and relates 500 1,000 1,500 Time elapsed (s) Probability Figure 3: The probability of safely finishing the mission without collision in the aircraft collision example. to the fuel consumption), a much higher cost than all possible switching actions. We reduce the uncertainty in the model by setting the previously mentioned intervals for these models to [0.90, 0.95]. In particular, the worst-case probability to correctly detect objects is now higher than before, reducing the probability of close encounters with those objects. Fig. 1c shows the convergence of the costs for each model. The costs of the resulting policies for models S1 and S2 are 178 and 153N s, respectively. We ran into numerical troubles during the robust verification step for S3 and S4. Similar to the previous example, the results demonstrate the utility of finite-memory policies to reduce the fuel cost of spacecraft. Aircraft Collision Avoidance We consider robust aircraft collision avoidance (Kochenderfer 2015). The objective is to maximize the probability of avoiding a collision with an intruder aircraft while taking into account sensor errors and uncertainty in its future paths. Model. We consider a two-dimensional example with a state space consisting of (1) the own aircraft position x, y relative to the intruder, and (2) the relative speed of the intruder relative to the own aircraft, x, y . We (grid-)discretize the relative position in 900 cells of 100 100 feet, and the relative speed in 100 cells of x and y variables into 10 points between 1000 1000 feet/minute (in each direction). We use time steps of 1 second. In each time step, the action choice reflects changing the acceleration of the own aircraft in two dimensions, while the intruder acceleration is chosen probabilistically. The state is then updated accordingly. The probabilistic changes by the intruder are given by uncertain intervals I, generalizing the model from (Kochenderfer 2015) where fixed values are assumed. We vary I below. Additionally, the probability of the pilot being responsive at a given time is given by the interval [0.70, 0.90]. The own aircraft cannot precisely observe the intruder s position, and the accuracy of the observation is quantized by 300 feet. The model has 476 009 states and 866 889 transitions. The specification is to maximize the probability of reaching the target while maintaining a distance of 400 feet in either direction. Variants. We consider three ranges for intruder behavior in our example. The intruder A1 has an uncertain probability of applying an acceleration in one particular direction by the interval I = [0.2, 0.8], while we have I = [0.3, 0, 7] and I = [0.4, 0.6] for A2 and A3, respectively. Results. We show the convergence of the method for each intruder in Fig. 3. We compute a locally optimal policy within 0 500 1,000 1,500 0 0.2 0.4 0.6 0.8 Time elapsed (s) Probability S1-SCP S1-QCQP (a) On model S1 0 500 1,000 1,500 Time elapsed (s) Probability A4-SCP A4-QCQP A5-SCP A5-QCQP (b) On smaller models A4 and A5. Figure 4: Comparing SCP with QCQP. minutes in each case. For the three models, the probability of satisfying the specification is 0.997, 0.976, and 0.76, respectively. Last, we extract the policy to counter intruders within A3 and apply it to the u POMDP defined for A1, which has more uncertainty than A3. Then, the resulting reachability probability is only 0.55, which shows the utility of considering robustness against potentially adversarial intruders. Comparing Performance We compare our approach (SCP) with an implementation of (Suilen et al. 2020), based on an exponentially large quadratically constrained quadratic program (QCQP). Let us first revisit the spacecraft problem. The QCQP method runs out of memory on all models except S1. For S1, QCQP indeed has a small advantage over SCP, cf. Fig. 4a. We remark that the policies computed with SCP on S2, S3, and S4 are superior to the policy computed by QCQP on S1. For the airplane problem, the QCQP again runs out of memory on A1, A2, and A3. For comparison, we construct models A4 and A5 by a coarser discretization (and varying uncertainty). The models have 18 718 states, and are smaller versions of A1 and A2. The performance is plotted in Fig. 4b. For A4, QCQP yields a slightly better policy but takes more time, whereas, on A5, SCP finds a better policy faster. We conclude that QCQP from (Suilen et al. 2020) is comparable on small models, but does not scale to large models. We presented a new approach to computing robust policies for uncertain POMDPs. The experiments showed that we are able to apply our method based on convex optimization on wellknown benchmarks with varying levels of uncertainty. Future work will involve incorporating learning-based techniques. Acknowledgments M. Cubuktepe and U. Topcu were supported in part by the grants ARL ACC-APG-RTP W911NF, NASA 80NSSC19K0209, and AFRL FA9550-19-1-0169. N. Jansen and M. Suilen were supported in part by the grants NWO OCENW.KLEIN.187: Provably Correct Policies for Uncertain Partially Observable Markov Decision Processes , NWA.1160.18.238: Prima Vera , and NWA.BIGDATA.2019.002: EXo Du S - EXplainable Data Science . S. Junges was supported in part by NSF grants 1545126 (Ve HICa L) and 1646208, by the DARPA Assured Autonomy program, by Berkeley Deep Drive, and by Toyota under the i Cy Phy center. References Ahmadi, M.; Cubuktepe, M.; Jansen, N.; and Topcu, U. 2018. Verification of Uncertain POMDPs Using Barrier Certificates. In Allerton, 115 122. IEEE. Alizadeh, F.; and Goldfarb, D. 2003. Second-Order Cone Programming. Math Program. 95(1): 3 51. Amato, C.; Bernstein, D. S.; and Zilberstein, S. 2010. Optimizing Fixed-size Stochastic Controllers for POMDPs and Decentralized POMDPs. AAMAS 21(3): 293 320. Ben-Tal, A.; El Ghaoui, L.; and Nemirovski, A. 2009. Robust Optimization, volume 28. Princeton University Press. Bertsimas, D.; Brown, D. B.; and Caramanis, C. 2011. Theory and Applications of Robust Optimization. SIAM review 53(3): 464 501. Burns, B.; and Brock, O. 2007. Sampling-Based Motion Planning with Sensing Uncertainty. In ICRA, 3313 3318. IEEE. Chamie, M. E.; and Mostafa, H. 2018. Robust Action Selection in Partially Observable Markov Decision Processes with Model Uncertainty. In CDC, 5586 5591. IEEE. Chatterjee, K.; Chmel ık, M.; Gupta, R.; and Kanodia, A. 2016. Optimal Cost Almost-sure Reachability in POMDPs. Artif. Intell. 234: 26 48. Dehnert, C.; Junges, S.; Katoen, J.; and Volk, M. 2017. A Storm is Coming: A Modern Probabilistic Model Checker. In CAV (2), volume 10427 of LNCS, 592 600. Springer. Frey, G. R.; Petersen, C. D.; Leve, F. A.; Kolmanovsky, I. V.; and Girard, A. R. 2017. Constrained Spacecraft Relative Motion Planning Exploiting Periodic Natural Motion Trajectories and Invariance. Journal of Guidance, Control, and Dynamics 40(12): 3100 3115. Givan, R.; Leach, S.; and Dean, T. 2000. Bounded-Parameter Markov Decision Processes. Artif. Intell. 122(1-2): 71 109. Gurobi Optimization, L. 2020. Gurobi Optimizer Reference Manual. URL http://www.gurobi.com. Hahn, E. M.; Hashemi, V.; Hermanns, H.; Lahijanian, M.; and Turrini, A. 2017. Multi-Objective Robust Strategy Synthesis for Interval Markov Decision Processes. In QEST, volume 10503 of LNCS, 207 223. Springer. Hobbs, K. L.; and Feron, E. M. 2020. A Taxonomy for Aerospace Collision Avoidance with Implications for Automation in Space Traffic Management. In AIAA Scitech 2020 Forum, 0877. Itoh, H.; and Nakamura, K. 2007. Partially Observable Markov Decision Processes with Imprecise Parameters. Artif. Intell. 171(8): 453 490. Junges, S.; Jansen, N.; Wimmer, R.; Quatmann, T.; Winterer, L.; Katoen, J.; and Becker, B. 2018. Finite-State Controllers of POMDPs using Parameter Synthesis. In UAI, 519 529. Kaelbling, L. P.; Littman, M. L.; and Cassandra, A. R. 1998. Planning and Acting in Partially Observable Stochastic Domains. Artificial intelligence 101(1-2): 99 134. Kim, S. C.; Shepperd, S. W.; Norris, H. L.; Goldberg, H. R.; and Wallace, M. S. 2007. Mission Design and Trajectory Analysis for Inspection of a Host Spacecraft by a Microsatellite. In 2007 IEEE Aerospace Conference, 1 23. IEEE. Kochenderfer, M. J. 2015. Decision Making Under Uncertainty: Theory and Application. MIT press. Lobo, M. S.; Vandenberghe, L.; Boyd, S.; and Lebret, H. 1998. Applications of Second-Order Cone Programming. Linear Algebra and its Applications 284(1-3): 193 228. Madani, O.; Hanks, S.; and Condon, A. 1999. On the Undecidability of Probabilistic Planning and Infinite-Horizon Partially Observable Markov Decision Problems. In AAAI, 541 548. AAAI Press. Mao, Y.; Szmuk, M.; Xu, X.; and Acikmese, B. 2018. Successive Convexification: A Superlinearly Convergent Algorithm for Non-convex Optimal Control Problems. ar Xiv preprint ar Xiv:1804.06539 . Meuleau, N.; Kim, K.-E.; Kaelbling, L. P.; and Cassandra, A. R. 1999. Solving POMDPs by Searching the Space of Finite Policies. In UAI, 417 426. Nilim, A.; and Ghaoui, L. E. 2005. Robust Control of Markov Decision Processes with Uncertain Transition Matrices. Operations Research 53(5): 780 798. Osogami, T. 2015. Robust Partially Observable Markov Decision Process. In ICML, volume 37, 106 115. Puggelli, A.; Li, W.; Sangiovanni-Vincentelli, A. L.; and Seshia, S. A. 2013. Polynomial-Time Verification of PCTL Properties of MDPs with Convex Uncertainties. In CAV, volume 8044 of LNCS, 527 542. Springer. Puterman, M. L. 1994. Markov Decision Processes. John Wiley and Sons. Suilen, M.; Jansen, N.; Cubuktepe, M.; and Topcu, U. 2020. Robust Policy Synthesis for Uncertain POMDPs via Convex Optimization. In IJCAI, 4113 4120. ijcai.org. Vlassis, N.; Littman, M. L.; and Barber, D. 2012. On the Computational Complexity of Stochastic Controller Optimization in POMDPs. ACM Trans. on Computation Theory 4(4): 12:1 12:8. Wiesemann, W.; Kuhn, D.; and Rustem, B. 2013. Robust Markov Decision Processes. Mathematics of Operations Research 38(1): 153 183. Wolff, E. M.; Topcu, U.; and Murray, R. M. 2012. Robust Control of Uncertain Markov Decision Processes with Temporal Logic Specifications. In CDC, 3372 3379. IEEE. Yuan, Y.-x. 2015. Recent Advances in Trust Region Algorithms. Mathematical Programming 151(1): 249 281.