# model_predictive_control_with_reachavoid_analysis__b3fef12e.pdf Model Predictive Control with Reach-avoid Analysis Dejin Ren1,2 , Wanli Lu3 , Jidong Lv 3 , Lijun Zhang1,2 and Bai Xue1,2 1State Key Lab. of Computer Science, Institute of Software, CAS, Beijing, China 2University of Chinese Academy of Sciences, Beijing, China 3National Engineering Research Center of Rail Transportation Operation and Control System, Beijing Jiaotong University, Beijing, China {rendj, zhanglj, xuebai}@ios.ac.cn, {luwanli, jdlv}@bjtu.edu.cn In this paper we investigate the optimal controller synthesis problem, so that the system under the controller can reach a specified target set while satisfying given constraints. Existing model predictive control (MPC) methods learn from a set of discrete states visited by previous (sub-)optimized trajectories and thus result in computationally expensive mixed-integer nonlinear optimization. In this paper a novel MPC method is proposed based on reach-avoid analysis to solve the controller synthesis problem iteratively. The reach-avoid analysis is concerned with computing a reach-avoid set which is a set of initial states such that the system can reach the target set successfully. It not only provides terminal constraints, which ensure feasibility of MPC, but also expands discrete states in existing methods into a continuous set (i.e., reach-avoid sets) and thus leads to nonlinear optimization which is more computationally tractable online due to the absence of integer variables. Finally, we evaluate the proposed method and make comparisons with state-of-the-art ones based on several examples. 1 Introduction Control synthesis is a fundamental problem which automatically constructs a control strategy that induces a system to exhibit a desired behavior. Due to the ability of handling control and state constraints and yielding high performance control systems [Camacho and Alba, 2013], control design methods based on MPC have gained great popularity and found wide acceptance in industrial applications, ranging from autonomous driving [Verschueren et al., 2014; Kabzan et al., 2019] to large scale interconnected power systems [Ernst et al., 2008; Mohamed et al., 2011]. In MPC design methods one issue is to guarantee feasibility of the successive optimization problems [Scokaert and Rawlings, 1999]. Because MPC is greedy in its nature, i.e., it only searches for the optimal strategy within a finite horizon, an MPC controller may steer the state to a Corresponding author region starting from where the violation of hard state constraints cannot be avoided. Although this feasibility issue could be solved by using a sufficiently long prediction horizon, we may not be able to afford the computation overhead due to the limited computational resources. Consequently, several solutions towards the feasibility issue are proposed [Zheng and Morari, 1995; Zeng et al., 2021; Ma et al., 2021]. Besides the feasibility issue of satisfying all hard state constraints [Zheng and Morari, 1995; Zeng et al., 2021; Ma et al., 2021], a system is also desired to achieve certain performance objective in an optimal sense. In existing literature, stability performance of approaching an equilibrium within the MPC framework is extensively studied. One common solution of achieving stability is adding Lyapunov functions as terminal costs and/or corresponding invariant sets as terminal sets [Michalska and Mayne, 1993; Limon et al., 2005; Lim on et al., 2006; Mhaskar et al., 2006; de la Pe na and Christofides, 2008; Wu et al., 2019; Grandia et al., 2020]. This has motivated significant research work on applications of this control design to nonlinear processes [Yao and Shekhar, 2021]. However, in many real applications stability performance is demanding. For example, a spacecraft rendezvous may require the chaser vehicle to be at a certain position relative to the target, moving towards it with a certain velocity. All of these quantities are specified with some tolerance, forming a target region in the state space. When the chaser enters that region a physical connection would be made and the maneuver is complete. However, since the target velocity is non-zero, the region is not invariant and stability cannot be achieved. Regarding this practical issue, the formulation in this paper replaces the notion of stability with reachability: given a target set, the system will achieve the reachability objective of reaching the target set in finite time successfully. To the best of our knowledge, the learning model predictive control (LMPC) proposed in [Rosolia and Borrelli, 2017], which utilizes the historical data to improve suboptimal controllers iteratively, is the only method within the MPC framework which can solve this problem. However, it leads to mixed-integer nonlinear programming problems which are fundamentally challenging to solve. In this work, we consider control tasks where the goal is to steer the system from a starting configuration to a target set in finite time, while satisfying state and input constraints. The control task is formalized as a reach-avoid optimization Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence (IJCAI-23) problem. For solving the optimization problem, we propose a novel learning based MPC algorithm which fuses iterative learning control (ILC) [Arimoto et al., 1984] and reachavoid analysis in [Zhao et al., 2022]. The proposed algorithm utilizes MPC to iteratively improve upon a suboptimal controller. Based on a suboptimal controller obtained in the previous iteration, the reach-avoid analysis is to compute a set of initial states such that the closed-loop system can reach the target set without a violation of the state and input constraints. In our algorithm, the reach-avoid analysis not only provides terminal constraints which ensure feasibility of the MPC, but also expands the viability space of the system and thus facilitates exploring better trajectories with lower costs. Finally, several examples demonstrate the benefits of our algorithm over state-of-the-art ones. The closest work to the present one is [Rosolia and Borrelli, 2017]. Due to the use of a set of discrete states visited by previous (sub-)optimized trajectories, computationally expensive mixed-integer nonlinear programming problems have to be addressed online in the proposed LMPC, limiting its application in practice. Our method is inspired by [Rosolia and Borrelli, 2017]. However, it incorporates the recently developed reach-avoid analysis technique and expands discrete states into a continuous (reach-avoid) set, which not only facilitates exploring better trajectories but also leads to a novel MPC formulation which involves solving more computationally tractable nonlinear programming problems online. Besides, this work is also close to the ones on model-based safe reinforcement learning, such as [Thomas et al., 2021; Luo and Ma, 2021; Wen and Topcu, 2018]. However, these studies prioritize enforcing the safety objective rather than achieving a joint objective of safety and reachability. This paper is structured as follows. In Section 2 the reachavoid problem of interest is introduced. Then, we detail our algorithm for solving the reach-avoid problem in Section 3, and evaluate it on several examples in Section 4. Finally, we conclude this paper in Section 5. 2 Preliminaries In this section we introduce the reach-avoid optimization problem of interest. Throughout this paper, Rn and N denote the set of n-dimensional real vectors and non-negative integers, respectively. R[ ] denotes the ring of polynomials in variables given by the argument. P[x] represents the set of sum-of-squares polynomials over variable x, i.e., P[x] = {p R[x] | p = Pk i=1 q2 i , qi R[x], i = 1, 2, . . . , k}. 2.1 Problem Statement Consider a discrete-time dynamical system x(t + 1) = f(x(t), u(t)), t N, (1) where f( ) : Rn Rm Rn is the system dynamics, x( ) : N Rn is the state and u( ) : N U Rm is the control. Definition 1. A control policy is a sequence of control inputs π = {u(i)}i N, where u(i) : N U. Furthermore, we define Π as the set of all control policies. Given a safe set X = {x Rn | w(x) 0} and a target set T = {x Rn | g(x) 0} with T X, a control policy π Π is safe with respect to a state x0, if the control policy π will drive system (1) starting from x(0) = x0 to reach the target set T in finite time while staying inside the safe set X before the first target hitting time. The associated cost of reaching the target set T safely is defined below. Definition 2. Given a state x(0) = x0 X \ T , and a safe policy π = {u(i)}i N Π, the cost with respect to the state x0 and the policy π is defined below: k=0 h(x(k), u(k)), (2) where x(k + 1) = f(x(k), u(k)), L = min{i N | x(i) T }, and h( , ) is continuous and satisfies h(x, u) 0, (x, u) X U. (3) In this paper, we would like to synthesize a safe control policy π Π with respect to a specified initial state x0 such that system (1) will enter the target set T with the minimal cost in finite time while staying inside the safe set X before the first target hitting time. Problem 1. We attempt to solve the following reach-avoid optimization problem: J(x0) = min T,u(i),i=0,...,T 1 i=0 h(x(i), u(i)) x(i + 1) = f(x(i), u(i)), x(0) = x0, u(i) U, x(i) X, x(T) T , i = 0, . . . , T 1, T N, where T is the first time of hitting the target set T . Due to uncertainties on the target hitting time, it is challenging to solve optimization (4) directly. However, the computational complexity can be reduced when searching for a feasible sub-optimal solution. In this paper we will adopt a learning strategy to solving (4), which iteratively improves upon already known suboptimal policies as the LMPC algorithm in [Rosolia and Borrelli, 2017]. Consequently, an initial feasible policy is needed, as formulated in Assumption 1. Assumption 1. Assume an initial policy π0 = {u0(i)}i N, which can drive system (1) starting from x0 to the target set T in finite time safely, is available. The corresponding trajectory of system (1) can be obtained and is denoted by {x0(i)}0 i L0, where, x0(i + 1) = f(x0(i), u0(i)), i = 0, . . . , L0 1, x0(0) = x0, x0(L0) T . Remark 1. The availability of a feasible control policy is not restrictive in practice for a number of applications. For instance, with race cars one can always run a path at very low speed to obtain a control policy. 2.2 Guidance-barrier Functions In this subsection we introduce guidance-barrier functions. They not only provide terminal constraints in our MPC method escorting system (1) to the target set T safely, but also generate a set which curves out a viability space for system (1) to reach the target set T safely. Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence (IJCAI-23) Definition 3. Given the safe set X, target set T and a factor λ (1, ), a bounded function v(x) : Y R is a guidancebarrier function of system (1) with the feedback controller ˆu( ) : X U, if it satisfies the following constraints: v(f(x, ˆu(x))) λv(x), x X \ T , v(x) 0, x Y \ X, v(x) M, x T , v(x0) > 0, where Y = {y | y = f(x, u), u U, x X} X, and M is a user-defined positive number. When f(x, ˆu(x)) is polynomial over x, and X is semialgebraic set, i.e., f(x, ˆu(x)), w(x) R[x], a set Y of the form {x | w0(x) 0} with w0(x) R[x] can be obtained using program (3) in [Zhao et al., 2022]. Remark 2. It is worth remarking here that if (5) holds, for system (1) with ˆu( ) : X U, the induced trajectory starting from x0 will hit the target set T within a bounded amount of time being less than or equal to logλ M v(x0) (It can obtained according to λT v(x0) v(x(T)) M, where T is the first hitting time of T .). A reach-avoid set R = {x X | v(x) > 0} can be computed via solving constraint (5), which is a set of states such that there exists a control policy π Π driving system (1) to enter the target set T in finite time while staying inside the safe set X before the first target hitting time. Theorem 1. Given the safe set X, target set T and a factor λ (1, ), if v(x) : Y R is a guidance-barrier function of system (1) with the feedback controller ˆu( ) : X U, then R = {x X | v(x) > 0} is a reach-avoid set. Theorem 1 can be assured by Corollary 1 in [Zhao et al., 2022]. Due to space limitations, we omitted the proof herein. Remark 3. One of admissible control policies π Π such that system (1) satisfies the reach-avoid specification can be constructed by the feedback controller ˆu( ) : X U: when system (1) is in state x(i) R, the corresponding control action is u(i) = ˆu(x(i)). We denote such a control policy by πˆu in the rest of this paper. 3 Reach-avoid Model Predictive Control In this section we elucidate our learning-based algorithm for solving optimization (4) in Problem 1, which is built upon a so-called reach-avoid model predictive control (RAMPC). The proposed RAMPC is constructed based on a guidancebarrier function. Our RAMPC algorithm is iterative and at each iteration it mainly consists of three steps. The first step is to synthesize a feedback controller by interpolating the suboptimal state-control pair obtained in the previous iteration. Then, a guidance-barrier function satisfying (5) with the synthesized feedback controller is computed. Finally, based on the computed guidance-barrier function a MPC controller, together with its resulting state trajectory, is generated online. The framework of the algorithm is summarized in Alg. 1. For solving (5) in Alg. 1, when f(x, ˆu(x)) is polynomial over x, and Y, X and T are semi-algebraic sets, i.e., Algorithm 1 The framework for solving optimization (4). Require: system (1); initial state x0; safe set X; target set T ; control input set U; feasible state-control trajectory {(x0(i), u0(i))}L0 1 i=0 , of which the cost is J0(x0) = PL0 1 i=0 h(x0(i), u0(i)); factor λ and bound M in (5); maximum iteration number K; prediction horizon N in RAMPC; termination threshold ξ > 0. Ensure: Return J (x0). for j = 0 : K do 1 apply interpolation techniques to compute a feedback controller ˆuj( ) : X U based on the j-th state-control trajectory {(xj(i), uj(i))}Lj 1 i=0 ; 2 compute vj(x) via solving (5) with ˆu(x) = ˆuj(x); 3 solve RAMPC optimization, which is constructed with vj(x), to obtain a state-control pair {(xj+1(i), uj+1(i))}Lj+1 1 i=0 and the cost Jj+1(x0): if Jj+1(x0) Jj(x0) ξ then Return J (x0) = Jj+1(x0); end if end for f(x, ˆu(x)), w0(x), w(x), g(x) R[x], the problem of solving constraint (5) can be transformed into a semi-definite programming problem (6), v(f(x, ˆu(x))) λv(x) + s1(x)w(x) s2(x)g(x) P[x] v(x) + s3(x)w0(x) s4(x)w(x) P[x] M v(x) + s5(x)g(x) P[x] v(x0) > 0, (6) where sj(x) P[x], j = 1, . . . , 5, v(x) R[x]. Otherwise, sample-based approaches in the context of randomized algorithms [Tempo et al., 2013] for robust convex optimization can be employed to solve constraint (5). The basis of these approaches is the Almost Lyapunov condition [Liu et al., 2020], which allow the Lyapunov conditions to be violated in restricted subsets of the space while still ensuring stability properties. Although results from these approaches lack rigorous guarantees, nice empirical performances demonstrated the practicality of these approaches in existing literature (e.g., [Chang and Gao, 2021]). We will revisit it in our future work. Remark 4. In Alg. 1, the computations in the first two steps, i.e., synthesizing feedback controllers and solving (5), can be carried out offline. Online computations occur only in the third step, which generate MPC controllers. In the following subsection we will introduce the RAMPC optimization in Alg. 1. 3.1 Reach-avoid Model Predictive Control This subsection introduces the RAMPC optimization in Alg. 1, which involves solving the MPC optimization problem online : JRAMPC,j t t+N (xj t) = min uk|t k=0 h(xj k|t, uj k|t)+Qj(xj N|t, πˆuj 1) Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence (IJCAI-23) xj k+1|t = f(xj k|t, uj k|t), uj k|t U, xj k|t X, k = 0, 1, . . . , N 1, vj 1(xj N|t) ( λNvj 1(x0), if t = 0, λvj 1(xj N|t 1), otherwise. xj 0|t = xj t where the superscript j is the iteration index, xj k|t is the state predicted k time steps ahead, computed at time t, initialized at xj 0|t = xj t with xj 0 = x0, and similarly for uj k|t, and Qj(xj N|t) = PL 1 i=0 h(xj i+N|t, ˆuj 1(xj i+N|t)) is the cost with respect to the state xj N|t and the control policy πˆuj 1. In (7), the terminal constraint vj 1(xj N|t) ( λNvj 1(x0), if t = 0, λvj 1(xj N|t 1), otherwise. (8) guarantees that the terminal state xj N|t lies in the reach-avoid set Rj 1 = {x X | vj 1(x) > 0}, which carves out a larger continuous viability space (visualized as the cyan region in Fig. 1) for system (1) to be capable of achieving the reach-avoid objective. This is different from the LMCP method in [Rosolia and Borrelli, 2017], which restricts terminal states within previously explored discrete states (visualized as pink points in Fig. 1) and thus leads to computationally demanding mixed-integer nonlinear optimization. As to the practical computations of the cost Qj( , πˆuj 1) : Rj 1 R, we will give a detailed introduction in Subsection 3.2. Figure 1: An illustration of continuous (reach-avoid) sets in our RAMPC method and discrete states in the LMPC method, the green region denotes a target set. u ,j 0:N|t = {u ,j i|t }N 1 i=0 and x ,j 0:N|t = {x ,j i|t }N i=0 (9) be the optimal solution to (7) at time t and JRAMPC,j t t+N (xj t) the corresponding optimal cost. Then, at time t of the iteration j, the first element uj(t) = u ,j 0|t of u ,j 0:N|t is applied to the system (1) and thus the state of system (1) turns into xj t+1 = x ,j 1|t = f(xj(t), uj(t)), (10) where xj(t) = x ,j 0|t, which will be the used to update the initial state in (7) for subsequent computations. Once there exist t N and l N s.t. xj l|t T , we terminate computations in this iteration: from the state x ,j 0|t on, we will apply the control actions {u ,j i|t }l 1 i=0 successively to system (1). Let πj = {uj(t)}0 t Lj 1 be the control policy applied to system (1) by solving optimization (7). The resulting state-control pair {xj(i), uj(i)}0 i Lj 1, where xj(Lj) T , is obtained. We can conclude that the performance of πj is no worse than that of πj 1, i.e., Jj 1(x0) = PLj 1 1 i=0 h(xj 1(i), uj 1(i)) PLj 1 i=0 h(xj(i), uj(i)) = Jj(x0). This conclusion can be justified by Theorem 3. Under Assumption 2, we in Theorem 2 show that the RAMPC (7) is feasible and system (1) with controllers computed by solving (7) satisfies the reach-avoid property, and in Theorem 3 show that the j-th iteration cost Jj(x0) is nonincreasing as j increases in RAMPC (7). Their proofs can be found in the appendix of the extended version [Ren et al., 2023]. Assumption 2. At each iteration 0 j K the computed reach-avoid set Rj = {x X | vj(x) > 0} via solving (5) is non-empty. In addition, we assume that it includes the state trajectory {xj(i)}Lj 1 i=0 1. Theorem 2. For each iteration j K, the RAMPC (7) is feasible for 1 t Lj 1; also, system (1) with controllers obtained by solving RAMPC (7) can reach the target set T within the time interval [0, logλ M v0 ] while satisfying all of state and input constraints. Theorem 3. Consider system (1) with controllers obtained by solving (7). Then, the iteration cost Jj(x0) is non-increasing with the iteration index j. 3.2 Estimating Terminal Costs via Scenario Optimization In practice, the exact terminal cost Qj( , πˆuj 1) : Rj 1 R is challenging, even impossible to obtain. In this subsection, we present a linear programming method based on the scenario optimization [Calafiore and Campi, 2006] to compute an approximation of the terminal cost Qj( , πˆuj 1) : Rj 1 R in the probably approximately correct (PAC) sense. Let {xi}N i=1 be the independent samples extracted from Rj 1 = {x X | vj 1(x) > 0} according to the uniform distribution P, and Qj(xi, πˆuj 1) be the corresponding cost of the roll-out from xi with the controller πˆuj 1. Such cost can be simply computed by summing up the cost along the closed-loop realized trajectory until the target set T is reached. Finally, using the set of sample states and correspding costs {(xi, Qj(xi, πˆuj 1))}N i=1, we approximate the terminal cost Qj( , πˆuj 1) : Rj 1 R using the scenario optimization [Calafiore and Campi, 2006]. A linearly parameterized model template Qj a(c1, . . . , cl, x), k 1 is utilized, which is a linear function in unknown parameters c = (c1, . . . , cl) but can be nonlinear over x. Then we construct the following linear program over c based on the family of given datum 1This assumption can be realized by interpolating the feedback controller ˆuj( ) : X U s.t. ˆuj(xj(i)) = uj(i), i = 0, . . . , Lj 1. This requirement mainly serves for our theoretical analysis since it can guarantee that our algorithm can iteratively improve the policy (Theorem 3). However, in practical this requirement is not indispensable regarding efficient computations. Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence (IJCAI-23) {(xi, Qj(xi, πˆuj 1))}N i=1: Qj a(c, xi) Qj(xi, πˆuj 1) δ, Qj(xi, πˆuj 1) Qj a(c, xi) δ, Uc cl Uc, i = 1, . . . , N , 0 δ, where Uc 0 is a pre-specified upper bound for ci, i = 1, . . . , l. Denote the optimal solution to (11) by (c , δ ). The discrepancy between Qj a(c , x) and Qj(x, πˆuj 1) is formally characterized by two parameters: the error probability ϵ (0, 1) and confidence level β (0, 1). Theorem 4. Let (c , δ ) be an optimal solution to (11), ϵ (0, 1), β (0, 1) and β + l + 1). Then we have that with at least 1 β confidence, P({x R | |Qj a(c , x) Qj(x, πˆuj 1)| δ }) 1 ϵ. Thus, we relax optimization (4) into the following form: JRAMPC,j a,t t+N(xj t) = min uk|t k=0 h(xk|t, uk|t)+Qj 1 a (c , x N|t) xk+1|t = f(xk|t, uk|t), uk|t U, xk|t X, k = 0, 1, . . . , N 1, v(x N|t) λNv(x0), if t = 0, λv(x N|t 1), otherwise. x0|t = xj t where Qj 1 a (c , x N|t) is the approximate terminal cost at the state x N|t, which is obtained via solving (11). Proposition 1. With at least 1 β confidence, |JRAMPC,j a,t t+N(xj t) JRAMPC,j t t+N (xj t)| δ holds with the probability larger than or equal to 1 ϵ, i.e., P({xj t Rj 1 | |JRAMPC,j a,t t+N(xj t) JRAMPC,j t t+N (xj t)| δ }) 1 ϵ. Relying on (12), we summarize our RAMPC algorithm for solving optimization (4) in Algorithm 2. 4 Examples In this section we evaluate our RAMPC algorithm, i.e., Alg. 2, and make comparisons with the LMPC algorithm in [Rosolia and Borrelli, 2017] on several examples. All the experiments were run on MATLAB 2022b with CPU 12th Gen Intel(R) Core(TM) i9-12900K and RAM 64 GB. Constraint (5) is solved by encoding it into sum-of-squares constraints which is treated by the semi-definite programming solver MOSEK; the nonlinear programming (7) and the mixedinteger nonlinear programming in the LMPC algorithm in Algorithm 2 The RAMPC algorithm for solving (4). Require: system (1) with an initial state x0, a safe set X, a target set T a control input set U and a feasible statecontrol trajectory {(x0(i), u0(i))}L0 1 i=0 , of which the cor- responding cost is J0(x0) = PL0 1 i=0 h(x0(i), u0(i)); factor λ and bound M in (5); iteration number K, prediction horizon N and termination threshold ξ > 0; probability error ϵ and confidence level δ in PAC approximations. Ensure: Return J (x0). for all j = 0 : K do 1 apply interpolation techniques to compute ˆuj( ) : X U for the j-th state-control trajectory {(xj i, uj i)}L1 1 i=0 ; 2 obtain Rj = {x X | vj(x) > 0} via solving (5) with ˆu(x) = ˆuj(x); 3 compute the PAC terminal cost Qj a( ) : Rj R via solving optimization (11); 4 Solving MPC optimization (12) with vj to obtain a statecontrol pair {(xj+1(i), uj+1(i))}Lj+1 1 i=0 and the cost Jj+1(x0): if |Jj+1(x0) Jj(x0)| ξ then Return J (x0) = Jj+1(x0); end if end for [Rosolia and Borrelli, 2017] are solved using YALMIP [Lofberg, 2004]. In addition, we take ˆuj( ) as a linear function to interpolate the j-th state-control trajectory {(xj i, uj i)}Lj 1 i=0 at each iteration j 1. The configuration parameters in Alg. 2 for all examples are shown in Table 1. Example λ M N K ξ δ ϵ Ex. 1 1.001 1 4 8 0.1 0.1 0.1 Ex. 2 1.001 1 3 8 0.1 0.05 0.05 Ex. 3 1.001 1 4 6 0.002 0.1 0.1 Table 1: Configuration parameters in Alg. 2 for examples. Example 1. Consider the drone system from [Rosolia and Borrelli, 2017], xt+1 = 1 dt 0 1 where xt = (pt, vt) , pt and vt are respectively the position and velocity of the drone at time t, ut is the control input and dt is the control interval which is equal to 0.1. We assume X = {(p, v) | p2 82 1 0}, initial state x0 = (4, 6) , target set T = {(p, v) | p2+v2 0.52 0} and control input set U = {u | 0.5 u 0.5}. The set Y = {(p, v) | p2 82 2 0} in constraint (5) is obtained by solving a semi-define program as in [Zhao et al., 2022]. The cost in optimization (4) is h (x, u) = x 2 2 + u 2 2. For simplicity of presentation herein, we do not present the initial state-control trajectory which is a long sequence. The initial state-control trajectory is induced by the feedback controller ˆu0(p, v) = 0.04p 0.1v. For scenario optimization (11), Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence (IJCAI-23) Iteration Iteration Cost Time Cost(seconds) RAMPC LMPC RAMPC LMPC 0 369.8267 369.8267 1 215.1007 222.2482 3.3204 3.1741 2 215.1002 217.3604 3.0100 1.5634 3 - 215.3008 - 1.1980 4 - 215.1003 - 1.1023 5 - 215.1044 - 1.1881 total time 6.3304 8.2259 Table 2: Iteration cost and computation time for Example 1. we use a template of the polynomial form Qj a(c, p, v) = c1 + c2p+c3v+c4pv+c5p2+c6v2 and N = 207 sampled points which can be computed via Theorem 4. We compare our RAMPC algorithm with the LMPC one in [Rosolia and Borrelli, 2017] with the same prediction horizon and termination conditions (i.e., N and ξ). Since this system is linear, instead of using a set of explored discrete states, the terminal constraint in the LMPC algorithm can be constructed by its convex hull, thus resulting in non-linear programs instead of mixed-integer nonlinear programs as mentioned in [Rosolia and Borrelli, 2017]. The performances of trajectories (shown in Figure 2) generated by both are compared. The iteration costs and computation times at each iteration of RAMPC and LMPC are presented in Table 2. It is observed that the iteration costs in our algorithm drop more quickly than those in the LMPC one. Also, our RAMPC algorithm is more efficient than the LMPC one. RAMPC terminates after 2 iterations with the computation time of 6.3304s, but LMPC converges more slowly and has to take 5 iterations with the computation time of 8.2259s. Figure 2: Trajectories in Example 1 (Green, black and cyan curve T , X and R0). Example 2. Consider the Euler version with the time step t = 0.05 of the controlled reversed-time Van der Pol oscillator in [Drazin and Drazin, 1992]: x1(t + 1) = x1(t) tx2(t) x2(t + 1) = x2(t) t((1 x2 1(t))x2(t) x1(t)) + u(t), with the safe set X = {(x1, x2) | x1 initial state x0 = (1.2, 1) , target set T = {(x1, x2) | x2 1+x2 2 0.22 0} and input set U = {u | 0.5 u 0.5}. In (5), the set Y = {(x1, x2) | x1 2 2 2 0} is computed by solving a semi-define program in [Zhao et al., 2022]. In addition, the cost h (x, u) = x 2 2 + u 2 2, where x = (x1, x2) , is adopted. The initial trajectory is induced by the controller ˆu0(x) 0. For scenario optimization (11), we use a template of the polynomial form Qj a(c, x) = c1 + c2x1 + c3x2 + c4x1x2 + c5x2 1 + c6x2 2 and N = 428 samples which can be computed via Theorem 4. LMPC uses the same prediction horizon and termination conditions (i.e., N and ξ). Some trajectories generated by our RAMPC algorithm and the LMPC algorithm are visualized in Figure 3. Our RAMPC algorithm terminates after the third iteration, but the LMPC one does not terminate after eight iterations. Figure 3 shows that the initial trajectory and the trajectory generated by the first iteration in the LMPC algorithm are too close to be indistinguishable within the first few time steps, which on the other hand further reflects that the controller generated in the first iteration of the LMPC algorithm may not improve the performance induced by the initial control policy. Besides, it is observed that the eighth trajectory from the LMPC algorithm looks not stable and has strong oscillations, which are not expected in practice. In contrast, the trajectory generated by our algorithm is smoother. Table 3 summarizes the iteration costs and the computation times at each iteration of our RAMPC algorithm and the LMPC one. The iteration cost induced by the initial trajectory is 64.3087, after three iterations our RAMPC algorithm reduces the cost to 29.2824 with the computation time of 31.2841s. In contrast, the LMPC algorithm needs more than 8 iterations with the computation time of almost more than 3 hours to achieve the same cost. This striking contrast definitely demonstrates that our RAMPC algorithm can solve optimization (4) more efficiently for some cases. Iteration Iteration Cost Time Cost(seconds) RAMPC LMPC RAMPC LMPC 0 64.3087 64.3087 1 30.9693 40.0714 10.6583 45.0 2 29.2919 39.3270 10.2248 59.4 3 29.2824 38.5239 10.4010 558.5 4 - 37.6402 - 1298 5 - 36.7088 - 1653.3 6 - 35.7561 - 2033.6 7 - 35.2567 - 2115.2 8 - 34.9022 - 1706.1 total time 31.2841 9469.1 Table 3: Iteration cost and computation time for Example 2. Example 3. Consider the Euler version with the time step t = 0.1 of the controlled 3D Van der Pol oscillator from Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence (IJCAI-23) -2 -1 0 1 2 x1 Figure 3: Trajectories in Example 2 (Green, black and cyan curve T , X and R0). [Korda et al., 2014]: x1(t + 1) = x1(t) + t( 2x2(t)) x2(t + 1) = x2(t) + t(0.8x1(t) 2.1x2(t) +x3(t) + 10x2 1(t)x2(t)) x3(t + 1) = x3(t) + t( x3(t) + x3 3(t)) + u(t) with the safe set X = {(x1, x2, x3) | x2 1 + x2 2 + x2 3 0.52 0}, initial state x0 = (0.2, 0.4, 0.1) , target set T = {(x1, x2, x3) | x2 1 + x2 2 + x2 3 0.12 0} and input set U = {u | 2 u 2}. The set Y = {(x1, x2, x3) | x2 1 + x2 2 + x2 3 0.5 0} is utilized in (5) by solving a semi-define program, as described in [Zhao et al., 2022]. We use the cost function h (x, u) = x 2 2 + u 2 2 in this example, where x = (x1, x2, x3) . The initial trajectory is generated by the controller ˆu0(x) 0. The template of the polynomial form Qj a(c, x) = c1 +c2x1 + c3x2+c4x3+c5x1x2+c6x1x3+c7x2x3+c8x2 1+c9x2 2+c10x2 3 and sample number N = 267 computed through Theorem 4 are adopted for scenario optimization (11). The prediction horizon N and termination condition ξ are same for LMPC. In this example, LMPC is unable to achieve the desired task of reducing the iteration cost and generating a suitable controller. In contrast, our RAMPC algorithm demonstrates good performance and can generate an effective controller. Table 4 summarizes the iteration costs and computation times at each iteration of our RAMPC algorithm. The initial trajectory and ones generated in the first and last iterations of our RAMPC algorithm are visualized in Figure 4. Iteration Iteration Cost Time Cost(seconds) 0 1.3489 1 0.8344 7.8186 2 0.8295 7.8022 3 0.8291 7.7416 total time 23.3623 Table 4: Iteration cost and computation time of RAMPC algorithm for Example 3. Discussions on configuration parameters. Here we give a brief discussion on the configuration parameters: Figure 4: Trajectories in Example 3 λ, M, K, ξ, N, δ, ϵ based on the experiments (some are presented in the appendix of the extended version [Ren et al., 2023]). The less conservative the computed reach-avoid set in each iteration is, the smaller the iteration cost is. Therefore, λ is recommended to be as close to one as possible, but it cannot be one. For details, please refer to [Zhao et al., 2022]; the upper bound M can be any positive number since if v(x) satisfies constraint (5) with M = M , then v(x) M also satisfies constraint (5) with M = 1; K and ξ, especially ξ, affect the number of iterations and thus the quality of controllers. Generally, a larger K and/or smaller ξ will improve the control performance, but the computational burden increases; the approximation error for terminal costs is formally characterized in the PAC sense using ϵ and δ. These two parameters and the size of parameters c determine the minimal number of samples for approximating terminal costs according to Theorem 4. Smaller ϵ and δ will lead to more accurate approximations of terminal costs in the PAC sense. However, regarding the overfitting issue and the computational burden of solving (11), too small ϵ and δ are not desired in practice; the prediction horizon N controls how far into the future the MPC predicts the system response. Like other MPC schemes, a longer horizon generally increases the control performance, but requires an increasingly powerful computing platform (due to solving (7) online), excluding certain control applications. In practice, an appropriate choice strongly depends on computational resources and real-time requirements. In addition, it is worth remarking here that the costs obtained by RAMPC are not sensitive to N for some cases. For details, please refer to experimental results in the appendix of the extended version [Ren et al., 2023]. 5 Conclusion In this paper we proposed a novel RAMPC algorithm for solving a reach-avoid optimization problem. The RAMPC algorithm is built upon MPC and reach-avoid analysis. Rather than solving computationally intractable mixed-integer nonlinear optimization online, it addresses computationally more tractable nonlinear optimization. Moreover, due to the incorporation of reach-avoid analysis, which expands the viability space, the RAMPC algorithm can provide better controllers with a smaller number of iterations. Several numerical examples demonstrated the advantages of our RAMPC algorithm. Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence (IJCAI-23) Acknowledgments This research was supported by the NSFC under grant No. 61836005, CAS Project for Young Scientists in Basic Research under grant No. YSBR-040, NSFC under grant No. 52272329, and CAS Pioneer Hundred Talents Program. References [Arimoto et al., 1984] Suguru Arimoto, Sadao Kawamura, and Fumio Miyazaki. Bettering operation of robots by learning. Journal of Robotic systems, 1(2):123 140, 1984. [Calafiore and Campi, 2006] Giuseppe Carlo Calafiore and Marco C Campi. The scenario approach to robust control design. IEEE Transactions on Automatic Control, 51(5):742 753, 2006. [Camacho and Alba, 2013] Eduardo F Camacho and Carlos Bordons Alba. Model predictive control. Springer science & business media, 2013. [Chang and Gao, 2021] Ya-Chien Chang and Sicun Gao. Stabilizing neural control using self-learned almost lyapunov critics. In Proceedings of the 2021 IEEE International Conference on Robotics and Automation, pages 1803 1809. IEEE, 2021. [de la Pe na and Christofides, 2008] David Mu noz de la Pe na and Panagiotis D Christofides. Lyapunov-based model predictive control of nonlinear systems subject to data losses. IEEE Transactions on Automatic Control, 53(9):2076 2089, 2008. [Drazin and Drazin, 1992] Philip G Drazin and Philip Drazin Drazin. Nonlinear systems. Number 10. Cambridge University Press, 1992. [Ernst et al., 2008] Damien Ernst, Mevludin Glavic, Florin Capitanescu, and Louis Wehenkel. Reinforcement learning versus model predictive control: a comparison on a power system problem. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 39(2):517 529, 2008. [Grandia et al., 2020] Ruben Grandia, Andrew J Taylor, Andrew Singletary, Marco Hutter, and Aaron D Ames. Nonlinear model predictive control of robotic systems with control lyapunov functions. ar Xiv preprint ar Xiv:2006.01229, 2020. [Kabzan et al., 2019] Juraj Kabzan, Lukas Hewing, Alexander Liniger, and Melanie N Zeilinger. Learning-based model predictive control for autonomous racing. IEEE Robotics and Automation Letters, 4(4):3363 3370, 2019. [Korda et al., 2014] Milan Korda, Didier Henrion, and Colin N Jones. Controller design and region of attraction estimation for nonlinear dynamical systems. IFAC Proceedings Volumes, 47(3):2310 2316, 2014. [Limon et al., 2005] Daniel Limon, Teodoro Alamo, and Eduardo F Camacho. Enlarging the domain of attraction of mpc controllers. Automatica, 41(4):629 635, 2005. [Lim on et al., 2006] Daniel Lim on, Teodoro Alamo, Francisco Salas, and Eduardo F Camacho. On the stability of constrained mpc without terminal constraint. IEEE Transactions on Automatic Control, 51(5):832 836, 2006. [Liu et al., 2020] Shenyu Liu, Daniel Liberzon, and Vadim Zharnitsky. Almost lyapunov functions for nonlinear systems. Automatica, 113:108758, 2020. [Lofberg, 2004] Johan Lofberg. Yalmip: A toolbox for modeling and optimization in matlab. In Proceedings of the 2004 IEEE International Conference on Robotics and Automation (IEEE Cat. No.04CH37508), pages 284 289. IEEE, 2004. [Luo and Ma, 2021] Yuping Luo and Tengyu Ma. Learning barrier certificates: Towards safe reinforcement learning with zero training-time violations. Advances in Neural Information Processing Systems, 34:25621 25632, 2021. [Ma et al., 2021] Haitong Ma, Xiangteng Zhang, Shengbo Eben Li, Ziyu Lin, Yao Lyu, and Sifa Zheng. Feasibility enhancement of constrained receding horizon control using generalized control barrier function. In Proceedings of the 4th IEEE International Conference on Industrial Cyber-Physical Systems, pages 551 557. IEEE, 2021. [Mhaskar et al., 2006] Prashant Mhaskar, Nael H El-Farra, and Panagiotis D Christofides. Stabilization of nonlinear systems with state and control constraints using lyapunovbased predictive control. Systems & Control Letters, 55(8):650 659, 2006. [Michalska and Mayne, 1993] Hannah Michalska and David Q Mayne. Robust receding horizon control of constrained nonlinear systems. IEEE Transactions on Automatic Control, 38(11):1623 1633, 1993. [Mohamed et al., 2011] TH Mohamed, H Bevrani, AA Hassan, and T Hiyama. Decentralized model predictive based load frequency control in an interconnected power system. Energy Convers. Manag., 52(2):1208 1214, 2011. [Ren et al., 2023] Dejin Ren, Wanli Lu, Jidong Lv, Lijun Zhang, and Bai Xue. Model predictive control with reachavoid analysis. ar Xiv preprint ar Xiv:2305.08712, 2023. [Rosolia and Borrelli, 2017] Ugo Rosolia and Francesco Borrelli. Learning model predictive control for iterative tasks. a data-driven control framework. IEEE Transactions on Automatic Control, 63(7):1883 1896, 2017. [Scokaert and Rawlings, 1999] Pierre OM Scokaert and James B Rawlings. Feasibility issues in linear model predictive control. AICh E Journal, 45(8):1649 1659, 1999. [Tempo et al., 2013] Roberto Tempo, Giuseppe Calafiore, and Fabrizio Dabbene. Randomized algorithms for analysis and control of uncertain systems: with applications. Springer, 2013. [Thomas et al., 2021] Garrett Thomas, Yuping Luo, and Tengyu Ma. Safe reinforcement learning by imagining the near future. Advances in Neural Information Processing Systems, 34:13859 13869, 2021. [Verschueren et al., 2014] Robin Verschueren, Stijn De Bruyne, Mario Zanon, Janick V Frasch, and Moritz Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence (IJCAI-23) Diehl. Towards time-optimal race car driving using nonlinear mpc in real-time. In Proceedings of the 53rd IEEE Conference on Decision and Control, pages 2505 2510. IEEE, 2014. [Wen and Topcu, 2018] Min Wen and Ufuk Topcu. Constrained cross-entropy method for safe reinforcement learning. Advances in Neural Information Processing Systems, 31, 2018. [Wu et al., 2019] Zhe Wu, Fahad Albalawi, Zhihao Zhang, Junfeng Zhang, Helen Durand, and Panagiotis D Christofides. Control lyapunov-barrier function-based model predictive control of nonlinear systems. Automatica, 109:108508, 2019. [Yao and Shekhar, 2021] Ye Yao and Divyanshu Kumar Shekhar. State of the art review on model predictive control (mpc) in heating ventilation and air-conditioning (hvac) field. Build. Environ., 200:107952, 2021. [Zeng et al., 2021] Jun Zeng, Bike Zhang, and Koushil Sreenath. Safety-critical model predictive control with discrete-time control barrier function. In Proceedings of the 2021 American Control Conference, pages 3882 3889. IEEE, 2021. [Zhao et al., 2022] Changyuan Zhao, Shuyuan Zhang, Lei Wang, and Bai Xue. Inner-approximating robust reachavoid sets for discrete-time polynomial dynamical systems. IEEE Transactions on Automatic Control, 2022. [Zheng and Morari, 1995] Alex Zheng and Manfred Morari. Stability of model predictive control with mixed constraints. IEEE Transactions on Automatic Control, 40(10):1818 1823, 1995. Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence (IJCAI-23)