# infinitehorizon_differentiable_model_predictive_control__6460f7c6.pdf

Published as a conference paper at ICLR 2020

INFINITE-HORIZON DIFFERENTIABLE MODEL PREDICTIVE CONTROL

Sebastian East1,2 , Marco Gallieri1, Jonathan Masci1, Jan Koutník1 & Mark Cannon2

1NNAISENSE, Lugano, Switzerland 2Department of Engineering Science, University of Oxford, Oxford, UK {sebastian.east,mark.cannon}@eng.ox.ac.uk {marco,jonathan,jan}@nnaisense.com

This paper proposes a differentiable linear quadratic Model Predictive Control (MPC) framework for safe imitation learning. The inﬁnite-horizon cost is enforced using a terminal cost function obtained from the discrete-time algebraic Riccati equation (DARE), so that the learned controller can be proven to be stabilizing in closed-loop. A central contribution is the derivation of the analytical derivative of the solution of the DARE, thereby allowing the use of differentiation-based learning methods. A further contribution is the structure of the MPC optimization problem: an augmented Lagrangian method ensures that the MPC optimization is feasible throughout training whilst enforcing hard constraints on state and input, and a prestabilizing controller ensures that the MPC solution and derivatives are accurate at each iteration. The learning capabilities of the framework are demonstrated in a set of numerical studies.

1 INTRODUCTION

Imitation Learning (IL, Osa et al., 2018) aims at reproducing an existing control policy by means of a function approximator and can be used, for instance, to hot-start reinforcement learning. Effective learning and generalisation to unseen data are paramount to IL success, especially in safety critical applications. Model Predictive Control (MPC, Maciejowski, 2000; Camacho & Bordons, 2007; Rawlings & Mayne, 2009; Kouvaritakis & Cannon, 2015; Gallieri, 2016; Borrelli et al., 2017; Rakovi c & Levine, 2019) is the most successful advanced control methodology for systems with hard safety constraints. At each time step, a ﬁnite horizon forecast is made from a predictive model of the system and the optimal actions are computed, generally relying on convex constrained Quadratic Programming (QP, Boyd & Vandenberghe, 2004; Bemporad et al., 2000). Stability of the MPC in closed loop with the physical system requires the solution of a simpler unconstrained inﬁnite horizon control problem (Mayne et al., 2000) which results in a value function (terminal cost and constraint) and a candidate terminal controller to be accounted for in the MPC forecasting. For Linear Time Invariant (LTI) models and quadratic costs, this means solving (ofﬂine) a Riccati equation (Kalman, 2001) or a linear matrix inequality (Boyd et al., 1994). Under these conditions, an MPC controller will effectively control a system, up to a certain accuracy, provided that uncertainties in the model dynamics are limited (Limon et al., 2009). Inaccuracies in the MPC predictions can reduce its effectiveness (and robustness) as the forecast diverges from the physical system trajectory over long horizons. This is particularly critical in applications with both short and long-term dynamics and it is generally addressed, for instance in robust MPC (Richards, 2004; Rakovi c et al., 2012), by using a controller to pre-stabilise the predictions.

This paper presents an inﬁnite-horizon differentiable linear quadratic MPC that can be learned using gradient-based methods. In particular, the learning method uses an MPC controller where the terminal cost and terminal policy are the solution of an unconstrained inﬁnite-horizon Linear Quadratic Regulator (LQR). A closed-form solution for the derivative of the Discrete-time Algebraic Riccati Equation (DARE) associated with the LQR is presented so that the stationary solution of the forward pass is fully differentiable. This method allows analytical results from control theory

Corresponding author.

Published as a conference paper at ICLR 2020

to be used to determine the stabilizing properties of the learned controller when implemented in closed-loop. Once the unconstrained LQR is computed, the predictive model is pre-stabilised using a linear state-feedback controller to improve the conditioning of the QP and the numerical accuracy of the MPC solution and gradients. The proposed algorithm successfully learns an MPC with both local stability and intrinsic robustness guarantees under small model uncertainties.

Contributions This paper provides a framework for correctly learning an inﬁnite-horizon, LTI quadratic MPC using recent developments in differentiable QPs (Amos & Kolter, 2017) and principles from optimal control (Blanchini & Miani, 2007). A primary contribution is that the Discrete-time Algebraic Riccati Equation (DARE) is used to provide inﬁnite-horizon optimality and stability, and an analytical derivative of the solution of the DARE is derived so that differentiation-based optimization can be used for training. This connects known results on MPC stability (Limon et al., 2003; 2009) and on inﬁnite-horizon optimality (Scokaert & Rawlings, 1998) to imitation learning (Osa et al., 2018).

A further contribution is the MPC control formulation: a pre-stabilizing linear state-feedback controller is implemented from the solution of the DARE, and then the total control input is obtained as a perturbation of the feedback control law from the solution of a convex QP. The pre-stabilizing controller ensures that the QP is well conditioned and promotes a highly accurate global solution, which in turn ensures that the gradients calculated in the backwards pass are accurate. Additionally, an augmented Lagrangian penalty method is used to enforce constraints on state and control input. This approach ensures that the hard constraints are strictly enforced if the penalty term is sufﬁciently large, and also guarantees that the MPC problem is feasible throughout the training process. These contributions are in contrast to (Amos et al., 2018) which did not consider state constraints, and implemented a differential dynamic programming (Tassa et al., 2014) method to solve the MPC optimization for which convergence could not be guaranteed.

The framework is implemented on a set of second order mass-spring-damper systems and a vehicle platooning model, where it is demonstrated that the inﬁnite horizon cost can be learned and the hard constraints can be guaranteed using a short ﬁnite prediction horizon.

Notation In := n n identity matrix. Om n := m n matrix of zeros. 0n := a vector of n zeros. 1n := a vector of n ones. All inequalities and are considered element-wise in the context of vectors. ρ(A) := largest absolute eigenvalue of given matrix A. vec : Rm n 7 Rmn is deﬁned as vec ([c1 cn]) := (c1, , cn), i.e. the columns of a matrix stacked into a vector. For a matrix A Rm n, the Vm,n Rmn mn permutation matrix is implicitly deﬁned by Vm,nvec A := vec A . The Kronecker product, , is deﬁned as in (Magnus & Neudecker, 1999, pp. 440).

2 DIFFERENTIABLE MPC

Linear quadratic MPC This paper considers linear time invariant systems of the form

xt+dt = Axt + But, (1)

where xt Rn is the system state, ut Rm is the control input, A Rn n is the state transition matrix, B Rn m is the input matrix, t R is the time, and dt R is the timestep (assumed constant). The control problem for this system is to determine the sequence of values of ut that achieve a desired level of performance (e.g. stability, frequency response, etc...), and when the system is subject to hard constraints on control input, ut U, and state, xt X, (or a combination of both), a well studied framework for controller synthesis is MPC. The principle of MPC is that the system s control input and state are optimized over a ﬁnite prediction horizon, then the ﬁrst element of the obtained control sequence is implemented at the current time step and the process is repeated ad inﬁnitum. For linear MPC it is common to use a quadratic stage cost and box constraints on state and control ( x xk x and u uk u where u 0 u), so that at each time index t the vector of optimized control variables ˆu is obtained from

Published as a conference paper at ICLR 2020

ˆu 0:N = argmin ˆu

k=0 ˆu k Rˆuk + 1

k=1 ˆx k Qˆxk + 1

2 ˆx NQN ˆx N + ku

k=0 1 mrk + kx

s.t. ˆx0 = xt, ˆxk+1 = Aˆxk + Bˆuk, k {0, . . . , N 1}, u rk ˆuk u + rk and rk 0, k {0, N 1}, x sk ˆxk x + sk and sk 0, k {1, . . . , N}, (2)

where ˆu0:N is the predicted control trajectory, ˆx is the predicted state trajectory, R Rm m 0 represents the stage cost on control input, Q Rn n 0 represents the stage cost on state, QN Rn n 0 represents the terminal cost on state, N N is the prediction horizon, rk Rm are slack variables for the control constraint, sk Rn are slack variables for the state constraint, and ku R > 0 and kx R > 0 represent the cost of control and state constraint violations. The variables s and r enforce the box constraints on state and control using the augmented Lagrangian method (Nocedal & Wright, 2006, 17.2), and it can be shown that for sufﬁciently high kx and ku the constraints x xk x and u uk u can be exactly guaranteed (Kerrigan & Maciejowski, 2000) (i.e. s = r = 0). The beneﬁt of this approach is that it ensures that the MPC optimization is feasible at each iteration of the learning process, whilst still ensuring that the constraints are hard . To close the MPC control loop, at each timestep, t, the ﬁrst element of the optimized control sequence, ˆu 0, is implemented as ut.

Pre-stabilised MPC If the control input is decomposed into ut = Kxt + δut, where K Rm n is a stabilizing linear state-feedback matrix and δut is a perturbation to the feedback control, system (1) becomes xt+dt = (A + BK)xt + Bδut, (3) and problem (2) becomes

δˆu 0:N = argmin δˆu

k=0 (Kˆxk + δˆuk) R(Kˆxk + δˆuk) + 1

k=1 ˆx k Qˆxk + 1

2 ˆx NQN ˆx N

k=0 1 mrk + kx

s.t. ˆx0 = xt, ˆxk+1 = (A + BK)ˆxk + Bδˆuk, k {0, . . . , N 1}, u rk Kˆxk + δˆuk u + rk and rk 0, k {0, . . . , N 1}, x sk ˆxk x + sk and sk 0, k {1, . . . , N},

so that ˆu 0 = Kxt + δˆu 0 is implemented as ut. Using this decomposition, system (3) controlled with the solution of (4) is precisely equal to system (1) controlled with the solution of (2), but problem (4) is preferable from a computational standpoint if A is open-loop unstable (i.e. ρ(A) > 1) and N is large , as this can lead to poor conditioning of the matrices deﬁned in Appendix A. This is important in the context of differentiable MPC, as if A is being learned then there may be no bounds on its eigenvalues at any given iteration.

MPC derivative. Problems (2) and (4) can be rearranged into the QP form (details in Appendix A)

z = argmin z 1 2z Hz + q z s.t. lb Mz ub. (5)

When z is uniquely deﬁned by (5), it can also be considered as the solution of an implicit function deﬁned by the Karush-Kuhn-Tucker (KKT) conditions, and in Amos & Kolter (2017) it was demonstrated that it is possible to differentiate through this function to obtain the derivatives of z with respect to the parameters H, q, l, M, and u. 1 The MPC controller can then be used as a layer in a

1Note that (5) differs from the form presented in Amos & Kolter (2017), and is instead the form of problem solved by the OSQP solver used in this paper. Appendix B demonstrates how to differentiate (5) using the solution returned by OSQP.

Published as a conference paper at ICLR 2020

neural network, and backpropagation can be used to determine the derivatives of an imitation cost function with respect to the MPC parameters Q, R, A, B, u, u, x, x, kx and ku.

Imitation Learning. A possible use case of the derivative of a model predictive controller is imitation learning, where a subset of {cost function, system dynamics, constraints} are learned from observations of a system being controlled by an expert . Imitation learning can be performed by minimizing the loss 1 T

t=0 ut:t+Ndt ˆu 0:N(xt) 2 2 + β ˆwt 2 2, (6)

where ut is the measured control input, ˆu 0:N(xt) is the full MPC solution, and β 0 is a hyperparameter. It is assumed that both the learning algorithm and MPC controller have completely precise measurements of both the state and control input. The ﬁrst term of (6) is the control imitation loss, and the second term penalises the one-step ahead prediction error ˆwt = Axt + But xt+dt. In practice, the prediction error loss might not be needed for the MPC to be learned correctly, however its use can be instrumental for stability, as discussed in the next section.

3 TERMINAL COST FOR INFINITE HORIZON

Terminal cost. The inﬁnite-horizon discrete-time Linear Quadratic Regulator (LQR, Kalman, 2001) is given with state feedback gain

K = (R + B PB) 1B PA, (7)

where P is obtained as a solution of the DARE

P = A PA A PB(R + B PB) 1B PA + Q. (8)

The principle of the approach presented in this paper is the MPC controller (2,4) is implemented with QN = P. Proposition 1 summarises the relevant properties of the proposed MPC, building on classic MPC results from Scokaert & Rawlings (1998); Limon et al. (2003; 2009).

Proposition 1. Consider the MPC problem (4) with QN = P, where P and K solve (7-8). Deﬁne V N(x) as the optimal objective in (4) with xt = x. Denote the optimal stage cost with xt = x as ℓ(x, ˆu 0(x)) = x Qx + ˆu 0(x) Rˆu 0(x). Then, for the closed-loop system, it follows that:

1. For any N 1, there exists a closed and bounded set, Ω N, such that, if x0 Ω N and ˆwt = 0, t 0, then the MPC solution is inﬁnite-horizon optimal for any N N.

2. There exist positive scalars d, α, such that, for any N 1, if ˆwt = 0, t 0 then the MPC constraints are feasible, t 0, and the origin is asymptotically stable x0 ΓN, with

ΓN = {x Rn : V N(x) ℓ(x, ˆu 0(x)) + (N 1)d + α} . (9)

3. There exist a scalar, µ 0, such that, for any N 1 the MPC constraints are robustly feasible, t 0, and the system is Input-to-State Stable (ISS) x0 ΓN given an additive model error, ˆw, such that: ˆwt µ, t 0. In other words:

V N(xt+dt) V N(xt) ℓ(xt, ˆu 0(xt)) + σ( ˆwt ),

for some strictly increasing, bounded function, σ( ), with σ(0) = 0.

4. The QP matrices, H, M and the vector q, in (5), have ﬁnite norms for any N 1.

Proof. Proof of Proposition 1 is given in Appendix C.

Implications. Proposition 1 has some important implications. First, point 1 implies that there exists a state-dependant ﬁnite horizon length, N, which is sufﬁcient to make the MPC problem inﬁnite-horizon optimal. This N can be upper bounded for a closed and bounded set of feasible states, Ω N. Scokaert & Rawlings (1998) proposed an iterative search that increases the horizon until optimality is veriﬁed; a similar algorithm is discussed in Appendix E where learning is completed

Published as a conference paper at ICLR 2020

with a large horizon and then iteratively reduced afterwards, although it is not implemented in this paper. Point 2,3 state that MPC that can provide stability and constraints satisfaction, hence safety, if the model error is small. This also applies to small errors in the QP solution. Finally, point 4 states that the QP matrices have ﬁnite norm when the system dynamics are pre-stabilised using the LQR gain2, so the MPC problem is well conditioned and can be solved reliably to high accuracy, even over long horizons. If the open-loop system is unstable then the terms of the matrices in Appendix A for the standard form are unbounded, so the QP solution may be poorly conditioned and the result inaccurate for long horizons. This can in turn invalidate the results of Amos & Kolter (2017) which assumes that the KKT conditions are exactly satisﬁed in order to compute its gradients.

DARE Derivative. In order to implement QN = P in a differentiable imitation learning framework such as that presented in Section 2, the solution of the DARE is differentiated as follows.

Proposition 2. Let P be the stabilizing solution of (8), and assume that Z 1 1 and (R + B PB) 1 exist, then the Jacobians of the implicit function deﬁned by (8) are given by

vec P vec A = Z 1 1 Z2, vec P vec B = Z 1 1 Z3, vec P vec Q = Z 1 1 Z4, vec P vec R = Z 1 1 Z5,

where Z1, . . . , Z5 are deﬁned by

Z1 := In2 (A A ) In2 (PBM2B In) (In PBM2B )

+ (PB PB)(M2 M2)(B B )

Z2 := (Vn,n + In2)(In A M1)

Z3 := (A A ) (PB PB)(M2 M2)(Im2 + Vm,m)(Im B P)

(In2 + Vn,n)(PBM2 P)

Z5 := (A A )(PB PB)(M2 M2),

and M1, M2, M3 are deﬁned by

M1 := P PBM2B P, M2 := M 1 3 , M3 := R + B PB.

Proof. The proof of Proposition 2 is given in Appendix D.

The sensitivity of the DARE solution has been investigated in the context of robustness to perturbations in the input matrices, e.g. Sun (1998); Konstantinov et al. (1993), and the analytical derivative of the continuous-time algebraic Riccati equation was derived in Brewer (1977) by differentiating the exponential of the Hamiltonian matrix, but to the best of the authors knowledge this is the ﬁrst presentation of an analytic derivative of the DARE using the differential calculus approach of Magnus & Neudecker (1999).

Algorithm 1: Inﬁnite-horizon MPC Learning In: M \ S, N > 0, β 0, Ne > 0. Out: S for i = 0...Ne do

Forward Pass (K, P) DARE (7-8) solution QT P ˆu 0:N MPC QP (3-5) solution L Imitation loss (6) Backward Pass Differentiate loss (6) Differentiate MPC QP solution, ˆu 0:N, using Appendix B Differentiate DARE, (P, K), using Proposition 2 Update step S Gradient-based step

Algorithm overview Algorithm 1 presents the overall procedure for learning a subset, S, of the MPC controller parameters, M = {A, B, Q, R, x, x, u, u, ku, kx}, with the key steps of the forwards and backwards pass of a gradient-based optimization method. In each forward pass the MPC terminal cost matrix, QN, and the pre-stabilizing controller, K, are set from the solution of the DARE, then the DARE and MPC QP solutions are differentiated in the backward pass to obtain the gradients. Note that the horizon, N, is not differentiable, and that learning the entire set M simultaneously is challenging in general.

2Note that any stabilising gain would be acceptable for the purpose of QP conditioning only.

Published as a conference paper at ICLR 2020

4 NUMERICAL EXPERIMENTS

In this section the performance of the algorithm was demonstrated through numerical experiments in two test cases: ﬁrstly on a set of second order mass-spring-damper models to provide a performance baseline in an easily interpretable setting, and then on vehicle platooning problem to investigate a higher-dimensional real-world application.

4.1 MASS-SPRING-DAMPER

Model & Expert Expert data was generated using a mass-spring-damper model parameterized by a mass, m R > 0, damping coefﬁcient, c R, stiffness, k R, and timestep dt R > 0, where

A = exp(Acdt), Ac = 0 1 k

, B = (A In)A 1 c Bc, Bc = 0 1 m

so that xt R2 is the position and velocity of the mass, and the ut R is a force applied to the mass.

Table 1: Damping coefﬁcient c used to generate the seven imitation systems.

System 1 2 3 4 5 6 7

c 1 0.5 0.1 -0.1 -0.3 -0.5 -0.6

Seven models were created with m = 1, k = 1, and dt = 0.2, and c was varied as shown in Table 1 to affect the open-loop stability of the models (c > 0 stable, c < 0 unstable). The expert data was then generated by simulating each of the systems the initial condition x0 = (0, 3) in closed-loop with an inﬁnite-horizon MPC controller (i.e. the horizon was increased until the open-loop state predictions matched the closed-loop response), using Q = diag([1, 1]), R = 2, (u, u) = ( , 0.5), x = ( 1, ), and x) = (1, ). The constraint set was chosen so that the constraints on both state and control input were strongly active at the solution whilst ensuring that the expert MPC optimization was feasbile. The values ku = kx = 100 were found to be sufﬁcient to enforce the hard constraints and were used for all experiments. It is important to note that the approach of (Amos et al., 2018) cannot be used reliably for even this simple example as it does not consider state constraints, and when hard constraints are added to the method it fails in general because the optimization problem has become infeasible in the forwards pass at some time t.

0 1000 Iteration

Imitation Loss

0 1000 Iteration

0 1000 Iteration

102 100 10 2 10 4

1 2 3 4 5 6 7

Figure 1: Mass-spring-damper. Imitation loss and model loss at each iteration of the training process. Top row: imitation loss. Bottom row: model loss given by vec A vec Aj 2 2, where Aj is the learned model at iteration j, and A is the correct model. The model loss was not used as part of the training process, and shown only to indicate whether the model is converging correctly.

Learning The learner and expert shared all system and controller information apart from the state transition matrix A, which was learned, and the MPC horizon length, which was implemented as each of N {2, 3, 6} in three separate experiments. A was initialized with the correct state transition matrix plus a uniformly distributed pseudorandom perturbation in the interval [ 0.5, 0.5] added to each element. The learner was supplied with the ﬁrst 50 elements of the closed loop state trajectory and corresponding controls as a batch of inputs, and was trained to minimize the imitation loss (6) with β = 0, i.e. the state dynamics were learned using predicted control trajectories only, and the state transitions are not made available to the learner (this is the same approach used in Amos et al., 2018). The experiments were implemented in Pytorch 1.2.0 using the built-in Adam optimizer (Kingma & Ba, 2014) for

Published as a conference paper at ICLR 2020

1000 steps using default parameters. The MPC optimization problems were solved for the expert and learner using OSQP (Stellato et al., 2017) with settings (eps_abs=1E-10, eps_rel=1E-10, eps_rim_inf=1E-10, eps_dual_inf=1E-10).

Results Figure 1 shows the imitation and model loss at each of the 1000 optimization iterations for each of the tested horizon lengths. It can be seen that for all of the generated systems the imitation loss converges to a low value, although this is a local minimum in general. In most cases, the learned model converges to a close approximation of the real model, although as the problem is non-convex this cannot be guaranteed, and it is also shown that there are some cases in which the model does not converge correctly. This occurred exclusively for N = 2, where neither system 4 nor system 2 converge to the correct dynamics. Additionally, it can be seen that both the imitation loss and model loss converge faster as the prediction horizon is increased. This suggests that a longer learning horizon improves the learning capabilities of the methods, but there is not sufﬁcient data to demonstrate this conclusively.

1 2 3 4 5 6 7

Figure 2: Mass-spring-damper. Closed-loop trajectories using the expert and learned controllers. Trajectories only shown for x1 (position), but x2 (velocity) can be inferred. Expert controllers shown with solid lines, and learned controller shown with dotted lines. The hard state constraints are shown in the red regions.

To test generalization performance, each of the systems was re-initialized with initial condition x0 = (0.5, 2) and simulated in closed loop using the learned controller for each horizon length. The results are compared in Figure 2 against the same systems controlled with an inﬁnite horizon MPC controller. The primary observation is that as the learned MPC horizon is increased to N = 6, the closed loop trajectories converge to expert trajectories, indicating that the inﬁnite horizon cost has been learned (when using the inﬁnite horizon cost with no model mismatch or disturbance, the predicted MPC trajectory is exactly the same as the closed loop trajectory), and that the state constraints are guaranteed for N 4. Furthermore, it can be seen that the learned controllers are stabilizing, even for the shortest horizon and the most unstable open-loop systems. This is also the case for systems 2 and 4 where the incorrect dynamics were learned, although in this case the state constraints are not guaranteed for N = 2.

4.2 VEHICLE PLATOONING

y1 y2 ynv 1 ynv

Figure 3: Platoon Model. nv vehicles in 1 degree of freedom where y is longitudinal displacement.

Model & Expert Vehicle platoon control is a problem that has been studied using control theory (e.g. Zheng et al. (2019)), but here it is demonstrated that a safe, stabilizing controller can be learned from examples of vehicles driving in formation. Figure 3 shows an illustration of a platoon of nv vehicles for which the objective is to stabilize the relative longitudinal positions of each vehicle to the steady-state conditions yi yi 1 yss and yi yi 1 0 i, subject to the hard constraint that relative position of the vehicles is never lower than a safe threshold yi yi 1 y i, and that the vehicles ability to brake and accelerate is constrained by b yi a i where b < 0 < a (note that only the relative positions and velocities of the vehicles is considered, as the global position and velocity of the platoon can be controlled separately by adding an equal perturbation to each element of y). In appendix F it is shown that this can be modelled as a discrete time LTI system. Expert data was generated from the model with nv = 10 vehicles so that xt R18 and ut R10. 20 instances were generated using random feasible initial conditions with yss = 30 m and y = 10 m, and then simulated for 20 s in time intervals of dt = 0.7 s with an inﬁnite-horizon MPC controller, using Q = In and R = 2Im.

Published as a conference paper at ICLR 2020

Learning The learner and expert shared all system and controller information apart from the cost matrices Q and R, which were learned, and the MPC horizon length, which was implemented as each of N {5, 10, 15, 20} in four separate experiments. The matrices Q and R were initialized as completely random diagonal matrices with each element uniformly distributed in the interval [0, 3], and the diagonal structure was maintained through training. 500 training iterations were used; otherwise the learning process (loss function, learning rate, etc...) was the same as in Section 4.1.

0 500 Iterations

Imitation Loss

0 500 Iterations

Cost Function Loss

N = 5 N = 10 N = 15 N = 20

Figure 4: Vehicle platooning. Imitation loss and cost function loss at each iteration of the training process. Left: imitation loss. Right: model loss given by vec Q vec Qj 2 2 + vec R vec Rj 2 2, where Q and R are the correct cost matrices and Qj and Rj are the cost matrices at iteration j.

Results Figure 4 shows the imitation and cost function losses at each of the 500 optimization iterations for each of the tested horizon lengths and initial conditions. As with the mass-spring-damper experiments, it is suggested that a longer prediction horizon improves training as the imitation loss generally converges to a lower value for the examples with N {15, 20}, but only convergence to a local minimum is achieved in general. The cost error also does not converge in general (although better convergence is observed again for the longer horizon lengths), however for this learning problem there is a manifold of matrices Q and R with the same minimizing argument, so divergence of the cost error does not necessarily indicate that the learned cost function is incorrect . Furthermore, in this case the model is known exactly, so the closed-loop inﬁnite-horizon properties can be obtained even without the correct cost function.

Figure 5 shows the model simulated from the same initial condition in closed loop using a learned controller for each of the horizon lengths, together with the error between the MPC state predictions and ensuing closed-loop behaviour. All of the controllers are observed to successfully satisfy the hard constraints on vehicle separation, and all converge to the correct steady-state vehicle separation. The differences between the prediction capabilities of the controllers is highlighted by the state prediction errors, and it can be seen that for N = 20 the state predictions match the ensuing behaviour, indicating that the inﬁnite horizon cost is being used and that closed-loop stability is guaranteed, even without the use of a terminal constraint set. It is also demonstrated for N < 20 that the largest errors occur

State Prediction

0 200 y (m)

Figure 5: Vehicle platooning. Closed loop simulation and prediction error for all horizon lengths. Top row: closed loop simulation where each shaded region is the safe separation distance for each vehicle. Bottom row: prediction error given by x[t:t+N] ˆxt, where ˆx is the state trajectory predicted by the MPC at time t.

Published as a conference paper at ICLR 2020

from predictions made at times when the state constraints are active, suggesting that these controllers deviate from their predictions to satisfy the constraints at later intervals.

4.3 LIMITATIONS

The above approach is limited in scope to LTI systems, and a more comprehensive solution would cover linear time varying systems (for which the MPC is still obtained from the solution of a QP). In this case the inﬁnite horizon cost cannot be obtained from the solution of the DARE, and the extension of the methods presented in this paper to time varying or non-linear models is non-trivial (see Appendix G for further discussion). Additionally, the derivative of the DARE in Proposition 2 involves multiple Kronecker products and matrix inversions (including an n2 n2 matrix) that do not scale well to large state and control dimensions, although the dynamics of physical systems can usually be reasonably approximated with only a few tens of variables, so this may not become an issue in practice. The algorithm also requires a stabilizing solution of the DARE to exist; theories for the existence of stabilizing solutions are non-trivial (e.g. Ran & Vreugdenhil, 1988), and it is not immediately obvious how to enforce their existence throughout the training process (stabilizibility can be encouraged using the one-step ahead term in (6)).

ACKNOWLEDGMENTS

The authors are grateful to Brandon Amos for providing support using his differentiable QP tool (https://github.com/locuslab/optnet) in the preliminary work for this project (all of the methods presented in this paper were developed independently).

Brandon Amos and J. Zico Kolter. Opt Net: Differentiable Optimization as a Layer in Neural Networks. ar Xiv:1703.00443 [cs, math, stat], March 2017. URL http://arxiv.org/abs/ 1703.00443. ar Xiv: 1703.00443.

Brandon Amos, Ivan Jimenez, Jacob Sacks, Byron Boots, and J. Zico Kolter. Differentiable MPC for End-to-end Planning and Control. In Advances in Neural Information Processing Systems 31, pp. 8289 8300. Curran Associates, Inc., 2018.

A. Bemporad, M. Morari, V. Dua, and E.N. Pistikopoulos. The explicit solution of model predictive control via multiparametric quadratic programming. In Proceedings of the 2000 American Control Conference. ACC (IEEE Cat. No.00CH36334). IEEE, 2000. doi: 10.1109/acc.2000.876624. URL https://doi.org/10.1109/acc.2000.876624.

Franco Blanchini and Stefano Miani. Set-Theoretic Methods in Control (Systems & Control: Foundations & Applications). Birkhäuser, 2007. ISBN 0817632557.

R. V. Bobiti. Sampling driven stability domains computation and predictive control of constrained nonlinear systems. Ph D thesis, 2017. URL https://pure.tue.nl/ws/files/78458403/ 20171025_Bobiti.pdf.

Francesco Borrelli, Alberto Bemporad, and Manfred Morari. Predictive Control for Linear and Hybrid Systems. Cambridge University Press, 2017. ISBN 1107016886.

Stephen Boyd and Lieven Vandenberghe. Convex Optimization. Cambridge University Press, 2004. ISBN 0521833787.

Stephen Boyd, Laurent El Ghaoui, Eric Feron, and Vendataramanan Balakrishnan. Linear Matrix Inequalities in System & Control Theory (Studies in Applied Mathematics, Volume 15). Society for Industrial & Applied, 1994. ISBN 089871334X.

J. Brewer. The derivative of the riccati matrix with respect to a matrix. IEEE Transactions on Automatic Control, 22(6):980 983, December 1977. doi: 10.1109/TAC.1977.1101656.

E. F. Camacho and C. Bordons. Model Predictive control. Springer London, 2007. doi: 10.1007/ 978-0-85729-398-5.

Published as a conference paper at ICLR 2020

Marco Gallieri. Lasso-MPC Predictive Control with ℓ1-Regularised Least Squares. Springer International Publishing, 2016. doi: 10.1007/978-3-319-27963-3. URL https://doi.org/ 10.1007/978-3-319-27963-3.

R. A. Horn and C. R. Johnson. Matrix Analysis. Cambridge University Press, New York, NY, USA, 2nd edition, 2012. ISBN 0521548233, 9780521548236.

Vlad Ionescu and Martin Weiss. On computing the stabilizing solution of the discrete-time riccati equation. Linear Algebra and its Applications, 174:229 238, 1992. ISSN 0024-3795. doi: https://doi.org/10.1016/0024-3795(92)90053-D. URL http://www.sciencedirect.com/ science/article/pii/002437959290053D.

Rudolf Kalman. Contribution to the theory of optimal control. Bol. Soc. Mat. Mexicana, 5, 02 2001.

Eric C. Kerrigan and Jan M. Maciejowski. Soft constraints and exact penalty functions in model predictive control. In Proc. UKACC International Conference (Control, 2000.

H. K. Khalil. Nonlinear Systems. Pearson Education, 3rd edition, 2014.

D. Kingma and J. Ba. Adam: A method for stochastic optimization. In International Conference on Learning Representations (ICLR), 2014.

MM Konstantinov, P Petkov, and ND Christov. Perturbation analysis of the discrete riccati equation. Kybernetika, 29(1):18 29, 1993.

B. Kouvaritakis and M. Cannon. Model Predictive Control: Classical, Robust and Stochastic. Advanced Textbooks in Control and Signal Processing, Springer, London, 2015.

D. Limon, T. Alamo, and E. F. Camacho. Stable constrained MPC without terminal constraint. Proceedings of the 2003 American Control Conference, 2003., 6:4893 4898 vol.6, 2003.

D. Limon, T. Alamo, D. M. Raimondo, D. Muñoz de la Peña, J. M. Bravo, A. Ferramosca, and E. F. Camacho. Input-to-State Stability: A Unifying Framework for Robust Model Predictive Control. In Nonlinear Model Predictive Control, pp. 1 26. Springer Berlin Heidelberg, 2009. doi: 10.1007/ 978-3-642-01094-1_1. URL https://doi.org/10.1007/978-3-642-01094-1_1.

Jan Maciejowski. Predictive Control with Constraints. Prentice Hall, 2000. ISBN 0201398230.

Jan R. Magnus and Heinz Neudecker. Matrix Differential Calculus with Applications in Statistics and Econometrics. John Wiley, second edition, 1999. ISBN 0471986321 9780471986324 047198633X 9780471986331.

D. Q. Mayne, J. B. Rawlings, C. V. Rao, and P. O. M. Scokaert. Constrained model predictive control: Stability and optimality. 2000.

J. Nocedal and S. J. Wright. Numerical optimization. Springer verlag, 2006.

Takayuki Osa, Joni Pajarinen, Gerhard Neumann, J. Andrew Bagnell, Pieter Abbeel, and Jan Peters. An Algorithmic Perspective on Imitation Learning. Foundations and Trends in Robotics, 7 (1-2):1 179, 2018. ISSN 1935-8253, 1935-8261. doi: 10.1561/2300000053. URL http: //arxiv.org/abs/1811.06711. ar Xiv: 1811.06711.

S. V. Rakovi c, B. Kouvaritakis, R. Findeisen, and M. Cannon. Homothetic tube model predictive control. Automatica, 48:1631 1638, 08 2012. doi: 10.1016/j.automatica.2012.05.003.

Savecsa V. Rakovi c and William S. Levine (eds.). Handbook of Model Predictive Control. Springer International Publishing, 2019. doi: 10.1007/978-3-319-77489-3. URL https://doi.org/ 10.1007/978-3-319-77489-3.

A.C.M. Ran and R. Vreugdenhil. Existence and comparison theorems for algebraic riccati equations for continuousand discrete-time systems. Linear Algebra and its Applications, 99:63 83, 1988. ISSN 0024-3795. doi: https://doi.org/10.1016/0024-3795(88)90125-5. URL http: //www.sciencedirect.com/science/article/pii/0024379588901255.

Published as a conference paper at ICLR 2020

J. B. Rawlings and D. Q. Mayne. Model Predictive Control Theory and Design. Nob Hill Pub, Llc, 2009. ISBN 0975937707.

A. G. Richards. Robust Constrained Model Predictive Control. Ph D thesis, MIT, 2004.

P.O.M. Scokaert and J.B. Rawlings. Constrained linear quadratic regulation. IEEE Transactions on Automatic Control, 43(8):1163 1169, 1998. doi: 10.1109/9.704994. URL https://doi.org/ 10.1109/9.704994.

B. Stellato, G. Banjac, P. Goulart, A. Bemporad, and S. Boyd. OSQP: An operator splitting solver for quadratic programs. Ar Xiv e-prints, November 2017.

J. Sun. Sensitivity analysis of the discrete-time algebraic riccati equation. Linear Algebra and its Applications, 275-276:595 615, 1998. ISSN 0024-3795. doi: https://doi.org/ 10.1016/S0024-3795(97)10017-9. URL http://www.sciencedirect.com/science/ article/pii/S0024379597100179. Proceedings of the Sixth Conference of the International Linear Algebra Society.

Y. Tassa, N. Mansard, and E. Todorov. Control-limited differential dynamic programming. In 2014 IEEE International Conference on Robotics and Automation (ICRA), pp. 1168 1175, May 2014. doi: 10.1109/ICRA.2014.6907001.

Y. Zheng, Y. Bian, S. Li, and S. E. Li. Cooperative control of heterogeneous connected vehicles with directed acyclic interactions. IEEE Intelligent Transportation Systems Magazine, pp. 1 1, 2019. doi: 10.1109/MITS.2018.2889654.

Published as a conference paper at ICLR 2020

A MPC QUADRATIC PROGRAM

Problem (2) is equivalent to

z = argmin z 1 2z

R + Ψ QΨ ONm Nm ONn Nn

"ΨQΦxt ku1Nm kx1Nn

u 0Nm x Φxt 0Nn

INm INm INm INm INm Ψ INn Ψ INn INn

B ... ... AN 1B B

are of conformal dimensions. Using the above, problem (4) is then equivalent to

z = argmin z 1 2z

(KˆΨ + INm) R(KˆΨ + INm) + ˆΨ ˆQˆΨ ONm Nm ONn Nn

(K R(KˆΨ + INm) + ˆQˆΨ) ˆΦxt ku1Nm ku1Nn

u KˆΦxt 0Nm x Φxt 0Nn

(KˆΨ + INm) INm (KˆΨ + INm) INm INm Ψ INn Ψ INn INn

u KˆΦxt x Φxt

(A + BK) ... (A + BK)N

B ... ... (A + BK)N 1B B

ˆQ = On n Q

, ˆΦ = In Φ

, ˆΨ = On Nn Ψ

K ... ONm n K

are of conformal dimensions.

Published as a conference paper at ICLR 2020

B OSQP DERIVATIVES

OSQP solves quadratic programs of the form (5), and returns values for z, y, and s that satisfy

Hz + q M y = 0, s C, y NC(s),

(Stellato et al., 2017, 2), where C is the set {s : l s u}, and NC is the normal cone of C. The values of y that are returned by the solver can be used to determine whether the constraints are strongly active at the solution, where yi = 0 indicates that the constraints li Miz and Miz ui are inactive, yi > 0 indicates that Miz ui is strongly active, and yi < 0 indicates that li Miz is strongly active. The solution can therefore be completely characterised by the KKT system

H M U M L MU ML

" z y U y L

" q u U l L

where U = {i : yi > 0} and L = {i : yi < 0}, and the notation MS indicates a matrix consisting of the i S columns of given matrix M, and v S indicates a vector consisting of the i S elements of given vector v. Equation (10) can then be differentiated using the techniques detailed in (Amos & Kolter, 2017, 3).

C PROOF OF PROPOSITION 1

Proof. (Proposition 1) The ﬁrst point follows from (Scokaert & Rawlings, 1998). The next two points of Proposition 1 stem from the results in (Limon et al., 2003; 2009). In particular, the closedloop is Lipschitz since the model is linear and the controller is the solution of a strictly convex QP. Moreover, the LQR provides a contractive terminal set. The ﬁnal point follows from the fact that (A + BK)N has eigenvalues in the unit circle, N 1. Proof of point 4 is concluded by inspection of the QP matrices (Appendix A) and by application of Theorem 5.6.12, page 298 of Horn & Johnson (2012) which states that, given a bound, ρ, on the spectral radius, then there exists a matrix norm which is also less than ρ.

D PROOF OF PROPOSITION 2

Proof. (Proposition 2) If a stabilizing solution (ρ(A + BK) 1) to (8) exists, it is unique (Ionescu & Weiss, 1992, Proposition 1), and the DARE can therefore be considered an implicit function of A, B, Q, and R. Using the assumption that (R + B PB) 1 exists, it can be concluded that Z1, . . . , Z5 and M1, M2, M3 exist (the Kronecker product and matrix addition, subtraction, and multiplication always exist). Equation (8) can be given by

P = A M1A + Q, (11)

which is differentiable, and M1, M2, M3 are also differentiable. Differentials are taken for (11) and each of M1, M2, M3 as

dvec P =(Vn,n + In2)(In A M1)dvec A + (A A )dvec M1 + dvec Q

dvec M1 = In2 (PBM2B In) (In PBM2B ) dvec P

(PB PB)dvec M2 (In2 + Vn,n)(PBM2 P)vecd B dvec M2 = (M2 M2)dvec M3 dvec M3 =dvec R + (B B )dvec P + (Im2 + Vm,m)(Im B P)vecd B,

then these can be combined using the differential chain rule (Magnus & Neudecker, 1999, Theorem 18.2) to obtain

Z1dvec P = Z2dvec A + Z3dvec B + Z4dvec Q + Z5dvec R.

The Jacobians, as deﬁned in Proposition 2, therefore exist if Z 1 1 exists.

Published as a conference paper at ICLR 2020

E VERIFICATION AND REDUCTION OF THE PREDICTION HORIZON

Algorithm 2: MPC horizon veriﬁcation and reduction In: N > 0, X0 X, M, (P, K) from (7-8), ϵ > 0, ns > 0, η (0, 1). Out: N, X X X0 while X do

N N while N > 0 do Xsample ns uniform state samples, s.t.: x X δˆu Solution of MPC QP (3-5), x Xsample if δˆu k(x) ϵ, k N, x Xsample then return TRUE N N 1

Procedure failed N N + 1 Go to Algorithm 1

A method is proposed for the reduction of the MPC prediction horizon after imitation learning. The idea is to be able to reproduce the inﬁnite-horizon optimal MPC up to a tolerance ϵ with high probability. Do do so, we check that, for a candidate horizon N, the MPC action deltas, δˆu k, satisfy δˆu k ϵ, for all k N. This means that the optimal action is equal to the LQR up to a tolerance ϵ. In order to provide a high probability guarantee of this condition, we propose the use of a probabilistic veriﬁcation approach, similar to Bobiti (2017). This is described in Algorithm 2. In particular, the condition is checked on a high number, ns, of initial states. These states are sampled uniformly from a set of interest X, which can be either the state constraints X or an estimate of the region of attraction, ΓN. If veriﬁed, this set is a region of attraction for the system with high probability. The relationship between the number of samples and the veriﬁcation probability is discussed in (Bobiti, 2017, Chapter 5). The algorithm also checks whether the inﬁnite horizon condition has been reached for the N used during training. Finally, a line search for a suitable X is proposed using a scaling factor η (0, 1). In particular, the initial set is downscaled until either an horizon is found or the set becomes empty. In latter case the search fails and the procedure returns to the training algorithm with an increased N. Noticeably, the proposed algorithm does not require to explicitly compute the terminal set in which the LQR is invariant and it could be used also for non-linear MPC if an inﬁnite-horizon (or a stabilising) terminal controller is available.

F PLATOON MODEL DERIVATION

The problem described in Section 4.2 can be decomposed into the regulation problem

y2 y1 ... yn yn 1

z2 z1 ... zn zn 1

z2 z1 ... zn zn 1

subject to the constraints

z2 z1 ... zn zn 1

1(y yss), and 1b y 1a.

If each vehicle is modelled as a mass then a continuous-time LTI state space model can be formed as

z2 z1 ... zn zn 1 z2 z1 ... zn zn 1

z2 z1 ... zn zn 1 z2 z1 ... zn zn 1

1 1 ... ... 1 1

which can then be given as

x = 0 I 0 0

Published as a conference paper at ICLR 2020

If it is assumed that the control input is constant between sampling intervals t and t + dt, then this can be given in discrete time as

xt+dt = I dt I O I

2 ˆBc(dt)2 ˆBcdt

1 1 ... ... 1 1

where xt R2(nv 1), and u Rnv and are subject to the constraints

xt (y yss)1nv 1

, and b1 ut a1 t.

G NONLINEAR MODELS

As discussed in the main paper, our approach is currently limited to Linear Time Invariant (LTI) systems. In general, conditions for inﬁnite-horizon optimality of systems that are not LTI are nontrivial. Some of the results on MPC stability could however be maintained, for example in the case when the LQR value function, x Px, is a local control Lyapunov function (Khalil, 2014; Mayne et al., 2000). In this case, the stability and intrinsic robustness results are maintained (see Limon et al., 2003; 2009). For these system, it would be possible to use our method, for instance in combination with Amos et al. (2018), to provide a stable Non-linear MPC. This is however a big assumptions for systems that are very non-linear. Assessing this LQR controllability condition could be done, for instance, by training a local linear model around the target equilibrium (origin) and then checking whether the DARE is solvable. This should be performed before starting the imitation learning. We leave the study of more general systems to future work.