# safe_pontryagin_differentiable_programming__f6d7c98e.pdf

Safe Pontryagin Differentiable Programming

Wanxin Jin University of Pennsylvania wanxinjin@gmail.com

Shaoshuai Mou Purdue University mous@purdue.edu

George J. Pappas University of Pennsylvania pappasg@seas.upenn.edu

We propose a Safe Pontryagin Differentiable Programming (Safe PDP) methodology, which establishes a theoretical and algorithmic framework to solve a broad class of safety-critical learning and control tasks problems that require the guarantee of safety constraint satisfaction at any stage of the learning and control progress. In the spirit of interior-point methods, Safe PDP handles different types of system constraints on states and inputs by incorporating them into the cost or loss through barrier functions. We prove three fundamentals of the proposed Safe PDP: ﬁrst, both the solution and its gradient in the backward pass can be approximated by solving their more efﬁcient unconstrained counterparts; second, the approximation for both the solution and its gradient can be controlled for arbitrary accuracy by a barrier parameter; and third, importantly, all intermediate results throughout the approximation and optimization strictly respect the constraints, thus guaranteeing safety throughout the entire learning and control process. We demonstrate the capabilities of Safe PDP in solving various safety-critical tasks, including safe policy optimization, safe motion planning, and learning MPCs from demonstrations, on different challenging systems such as 6-Do F maneuvering quadrotor and 6-Do F rocket powered landing.

1 Introduction

Safety is usually a priority in the deployment of a learning or control algorithm to real-world systems. For a physical system (agent), safety is normally given in various constraints on system states and inputs, which must not be violated by the algorithm at any stage of the learning and control process, otherwise will cause irrevocable or unacceptable failure/damage. Those systems are referred to as safety-critical. The constraints in a safety-critical system can include the immediate ones, which are directly imposed on the system state and input at certain or all-time instances, and the long-term ones, which are deﬁned on the trajectory of system states and inputs over a long period.

Compared to the abundant results that focus on system optimality [1 3], systematic and principled treatments for safety-critical learning and control problems seem largely insufﬁcient, particularly in the following gaps (detailed in Section 1.1). First, existing safety strategies are either too conservative, which may restrict the task performance, or violation-tolerable, which only pursues the near-constraint guarantee and thus are not strictly constraint-respecting. Second, a systematic safety paradigm capable of handling different types of constraints, including system state and input (or mixed), immediate, or/and long-term constraints, is still lacked. Third, some existing safety strategies suffer from huge computationaland datacomplexity, difﬁcult to be integrated into any differentiable programming frameworks to solve large-scale learning and continuous control tasks.

To address the above research gaps, this paper aims to develop a safe differentiable programming framework with the following key capabilities. First, the framework provides a systematic treatment for different types of constraints in a safety-critical problem; second, it attains provable safetyand accuracyguarantees throughout the learning and control process; third, it is ﬂexible to perform safe learning of any unknown aspects of a constrained decision-making system, including policy, dynamics, state and input constraints, and control cost; ﬁnally, it can be integrated to any differentiable programming framework to efﬁciently solve large-scale safe learning and control tasks.

35th Conference on Neural Information Processing Systems (Neur IPS 2021).

1.1 Related Work

In machine learning and control ﬁelds, safety has been deﬁned by different criteria, such as worstcase [4, 5], risk-sensitive [6], ergodicity [7], robust [8, 9], etc., most of which are formulated by directly altering an objective function [10]. In this paper, we focus only on constrained learning and control problems, where constraints are explicitly formulated and must be satisﬁed. We categorize existing techniques into at-convergence safety methods, which only concern constraint satisfaction at convergence, or in-progress safety methods, which attempt to ensure constraint satisfaction during the entire optimization process.

At-convergence safety methods. In reinforcement learning (RL), a constrained agent is typically formulated as a Constrained Markov Decision Process (CMDP) [11], seeking a policy that not only optimizes a reward but also satisﬁes an upper bound for a cumulative cost. A common strategy [12 17] to solve CMDPs is to use the primal-dual method, by establishing the unconstrained Lagrangian and performing saddle-point optimization. In deep learning, the primal-dual method has been recently used [18] to train deep neural networks with constraints. In control, the primal-dual method has been used to solve constrained optimal control (constrained trajectory) problems [19 21]. While proved to satisfy constraints at convergence [22, 23], the primal-dual type methods cannot guarantee constraint satisfaction during optimization, as shown in [13, 24], thus are not suitable for safety-critical tasks.

In-progress safety methods. To enforce safety during training, [25] and [26] solve CMDPs by introducing additional constraints into the Trust Region Policy Optimization (TRPO) [27]. Since these methods only obtain the near constraint guarantee, constraint violation is not fully eliminated. Another line of constrained RL [24, 28 30] leverages the Lyapunov theory [31] to bound behavior of an agent. But how to choose a valid Lyapunov function for general tasks is still an open problem to date [32], particularly for constrained RL, since it requires a Lyapunov function to be consistent with the constraints and to permit optimal policies [28]. Some other work also attempts to handle immediate constraints the constraints imposed on agent state and input at any time. In [33], a safe exploration scheme is proposed to produce a safe reachable region; and it only considers ﬁnite state space. [34] develops a method that learns safety constraints and then optimizes a reward within the certiﬁed safe region; the method deﬁnes constraints purely on agent state and thus may not be readily applicable to mixed state-input constraints.

In control, in-progress safety can be achieved via two model-based frameworks: reachability theory [35] and control barrier functions [36, 37]. Safe control based on reachability theory [35, 38 40] explicitly considers adversarial factors and seeks a strategy that maintains the constraints despite the adversarial factors. This process typically requires solving the Hamilton-Jacobi-Isaacs equations [41], which become computationally difﬁcult for high-dimensional systems [35]. Control barrier functions [36, 37] constrain a system only on safety boundaries, making it a less-conservative strategy for safety-critical tasks [42 44]. Most of the methods consider afﬁne dynamics and directly use the given constraint function as a control barrier function. Such a choice could be problematic when a system is uncontrollable at the boundary of the sublevel set. Thus, how to ﬁnd a valid control barrier function is still an ongoing research topic [45 47]. The above two control safety frameworks favorably focus on pure state constraints and cannot be readily extended to other constraints, such as mixed state-input constraints or the cumulative constraints deﬁned on the system trajectory.

Interior-point methods and control. Interior-point methods (IPMs) [48 50] solve constrained optimization by sequentially ﬁnding solutions to unconstrained problems with the objective combining the original objective and a barrier that prevents from leaving the feasible regions. IPMs have been used for constrained linear quadratic regular (LQR) control in [51 57]. While IPMs for nonlinear constrained optimal control are studied in [58 62], they mostly focus on developing algorithms to solve the unconstrained approximation (from the perturbed KKT conditions) and lack of performance analysis. Most recently, [63] uses the IPM to develop a zero-th order non-convex optimization method; and [64] uses IPMs to solve reinforcement learning with only cumulative constraints. Despite the promise of the trend, the theoretical results and systematic algorithms regarding the differentiability of general constrained control systems based on IPMs have not been studied and established.

Differentiable projection layer. In machine learning, a recent line of work considers embedding a differentiable projection layer [65 67] into a general training process to ensure safety. Particularly, [67] and [66] enforce safety by constructing a dedicated projection layer, which projects the unsafe actions outputted from a neural policy into a safe region (satisfying safety constraints). This projection layer is a differentiable convex layer [68, 69], which can be trained end-to-end. In [65], safety is

deﬁned as robustness in the case of the worst adversarial disturbance, and the set of robust policies is solved by classic robust control (solving LMIs). An RL neural policy with a differentiable projection layer is learned such that the action from the neural policy lies in the robust policy set. Different from the above work, Safe PDP does not enforce safety by projection; instead, Safe PDP uses barrier functions to guarantee safety constraint satisfaction. More importantly, we have shown, in both theory and experiments, that with barrier functions, differentiability can also be attained.

Sensitivity analysis and differentiable MPCs. Other work related to Safe PDP includes the recent results for sensitivity analysis [70], which focuses on differentiation of a solution to a general nonlinear program, and differentiable MPCs [68], which is based on differentiable quadratic programming. In long-horizon control settings, directly applying [70] and [68] can be inefﬁcient: the complexity of [68, 70] for differentiating a solution to a general optimal control system is at least O(T 2) (T is the time horizon) due to computing the inverse of Jacobian of the KKT conditions. Since an optimal control system has more sparse structures than general nonlinear or quadratic programs, by exploiting those structures and proposing the Auxiliary Control System, Safe PDP enjoys the complexity of O(T) for differentiating a solution to a general control system. Such advantages have been discussed and shown in the foundational PDP work [71] and will also be shown later (in Section 8) in this paper.

1.2 Paper Contributions

We propose a safe differentiable programming methodology named as Safe Pontryagin Differentiable Programming (Safe PDP). Safe PDP provides a systematic treatment of different types of system constraints, including state and inputs (or mixed), immediate, and long-term constraints, with provable safetyand performance-guarantee. Safe PDP is also a uniﬁed differentiable programming framework, which can be used to efﬁciently solve a broad class of safety-critical learning and control tasks.

In the spirit of interior-point methods, Safe PDP incorporates different types of system constraints into control cost and loss through barrier functions, approximating a constrained control system and task using their unconstrained counterparts. Contributions of Safe PDP are theoretical and algorithmic. Theoretically, we prove in Theorem 2 and Theorem 3 that (I) not only a solution but also the gradient of the solution can be safely approximated by solving a more efﬁcient unconstrained counterpart; (II) any intermediate results throughout the approximation and optimization are strictly safe, that is, never violating the original system/task constraints; and (III) the approximations for both solution and its gradient can be controlled for arbitrary accuracy by barrier parameters. Arithmetically, (IV) we prove in Theorem 1 that if a constrained control system is differentiable, the gradient of its trajectory is a globally unique solution to an Auxiliary Control System [71], which can be solved efﬁciently with the complexity of only O(T), T is control horizon; (V) in Section 7, we experimentally demonstrate the capability of Safe PDP for efﬁciently solving various safety-critical learning and control problems, including safe neural policy optimization, safe motion planning, learning MPCs from demonstrations.

2 Safe PDP Problem Formulation

Consider a class of constrained optimal control systems (models) Σ(θ), which are parameterized by a tunable parameter θ Rr in its control cost function, dynamics, initial condition, and constraints:

control cost: J(θ) = XT 1

t=0 ct(xt, ut, θ) + c T (x T , θ)

subject to dynamics: xt+1 = f(xt, ut, θ) with x0 = x0(θ), t, terminal constraints: g T (x T , θ) 0, h T (x T , θ) = 0, path constraints: gt(xt, ut, θ) 0, ht(xt, ut, θ) = 0, t.

Here, xt Rn is the system state; ut Rm is the control input; ct : Rn Rm Rr R and c T : Rn Rr R are the stage and ﬁnal costs, respectively; f : Rn Rm Rr Rn is the dynamics with initial state x0 = x0(θ) Rn; t = 0, 1, ..., T is the time step with T the time horizon; g T : Rn Rr Rq T and h T : Rn Rr Rs T are the ﬁnal inequality and equality constraints, respectively; gt : Rn Rm Rr Rqt and ht : Rn Rm Rr Rst are the immediate inequality and equality constraints at time t, respectively. All inequalities (here and below) are entry-wise. We consider that all functions in Σ(θ) are three-times continuously differentiable (i.e., C3) with respect to its arguments. Although we here have parameterized all aspects of Σ(θ), for a speciﬁc application (see Section 7), one only needs to parameterize the unknown aspects in Σ(θ) and keep others given.

Any unknown aspects in Σ(θ) can be implemented by differentiable neural networks. For a given θ, Σ(θ) produces a trajectory ξθ = {xθ 0:T , uθ 0:T 1} by solving the following Problem B(θ):

ξθ = {xθ 0:T , uθ 0:T 1} arg min{x0:T ,u0:T 1} J(θ)

subject to xt+1 = f(xt, ut, θ) with x0 = x0(θ), t,

g T (x T , θ) 0 and h T (x T , θ) = 0,

gt(xt, ut, θ) 0 and ht(xt, ut, θ) = 0, t.

Here, we use since ξθ to Problem B(θ) may not be unique in general, thus constituting a solution set {ξθ}. We will discuss the existence and uniqueness of {ξθ} in Section 4.

For a speciﬁc task, we aim to ﬁnd a speciﬁc model Σ(θ ), i.e, searching for a speciﬁc θ , such that its trajectory ξθ from B(θ ) meets the following two given requirements. First, ξθ minimizes a given task loss ℓ(ξθ, θ); and second, ξθ satisﬁes the given task constraints Ri(ξθ, θ) 0, i = 1, 2, ..., l. Note that, we need to distinguish between the two types of objectives: task loss ℓ(ξθ, θ) and control cost J(θ), and also the two types of constraints: task constraints Ri(ξθ, θ) and model constraints gt(θ). In fact, J(θ) and gt(θ) in Σ(θ) are unknown and parameterized by θ and can represent the unknown inherent aspects of a physical agent, while ℓ(ξθ, θ) and Ri(ξθ, θ) are given and known depending on the speciﬁc task (they also explicitly depend on θ since θ needs to be regularized in some learning cases). Assume ℓ(ξθ, θ) and Ri(ξθ, θ) are both twice-continuously differentiable. The problem of searching for θ can be formally written as: θ = arg min θ ℓ(ξθ, θ)

subject to Ri(ξθ, θ) 0, i = 1, 2, ..., l, ξθ solves Problem B(θ) .

For a speciﬁc learning and control task, one only needs to specify the details of Σ(θ) and give a task loss ℓ(ξθ, θ) and constraints Ri(ξθ, θ) 0. Section 7 will give representative examples.

3 Challenges to Solve Problem P

Problem P belongs to bi-level optimization [72] each time θ is updated in the outer-level (including task loss ℓand task constraint Ri) of Problem P, the corresponding trajectory ξθ needs to be solved from the inner-level Problem B(θ). Similar to PDP [71], one could approach Problem P using gradient-based methods by ignoring the process of solving inner-level Problem B(θ) and just viewing ξθ as an explicit differentiable function of θ. Then, based on interior-point methods [48], one can introduce a logarithmic barrier function for each task constraint, ln Ri , and a barrier parameter ϵ > 0. This leads to solving the following unconstrained Problem SP(ϵ) sequentially

θ (ϵ) = arg min θ ℓ(ξθ, θ) ϵ Xl

i=1 ln Ri(ξθ, θ) SP(ϵ)

for a ﬁxed ϵ. By controlling ϵ 0, θ (ϵ) is expected to converge to the solution θ to Problem P. Although plausible, the above process has the following technical challenges to be addressed:

(1) As ξθ is a solution to the constrained optimal control Problem B(θ), is ξθ differentiable? Does the auxiliary control system [71] exist for solving ξθ

(2) Since we want to obtain ξθ and ξθ

θ at as low cost as possible, instead of solving the constrained Problem B(θ), can we use an unconstrained system to approximate both ξθ and ξθ

θ ? Importantly, can the accuracy of the approximations for ξθ and ξθ

θ be arbitrarily and safely controlled? (3) Can we guarantee that the approximation for ξθ is safe in a sense that the approximation always respects the system original inequality constraints gt 0 and g T 0?

(4) With the safe approximations for both ξθ and ξθ

θ , can accuracy of the solution to the outer-level unconstrained optimization SP(ϵ) be arbitrarily controlled towards θ ?

(5) With the safe approximations for both ξθ and ξθ

θ , can we guarantee the safety of the outer-level inequality constraints Ri 0 during the optimization for the outer-level SP(ϵ)?

The following paper will address the above challenges. For reference, we give a quick overview: Challenge (1) will be addressed in Section 4 and the result is in Theorem 1; Challenges (2) and (3) will be addressed in Section 5 and the result is in Theorem 2; Challenges (4) and (5) will be addressed in Section 6 and the result is in Theorem 3; and Section 7 gives some representative applications.

4 Differentiability for Σ(θ) and its Auxiliary Control System

4.1 Differentiability of ξθ For the constrained optimal control system Σ(θ) in (1), we deﬁne the following Hamiltonian Lt for t = 0, 1, ..., T 1 and LT , respectively,

Lt = ct(xt, ut, θ) + λ t+1f(xt, ut, θ) + v tgt(xt, ut, θ) + w tht(xt, ut, θ), (2a)

LT = c T (x T , θ) + v T g T (x T , θ) + w T h T (x T , θ), (2b)

where λt Rn is the costate, vt Rqt and wt Rst are multipliers for the inequality and equality constraints, respectively. The well-known second-order condition for ξθ to be a local isolated (locally unique) minimizing trajectory to Σ(θ) in Problem B(θ) has been well-established in [73]. For completeness, we present it in Lemma A.2 in Appendix A. Lemma A.2 states that there exist costate sequence λθ 1:T , and multiplier sequences vθ 0:T and wθ 0:T , such that (ξθ, λθ 1:T , vθ 0:T , wθ 0:T ) satisﬁes the well-known Constrained Pontryagin Minimum Principle (C-PMP) given in (S.5) in Lemma A.2. Based on the above, one can have the following result for the differentiability of ξθ. Lemma 1 (Differentiability of ξθ). Given a ﬁxed θ, assume the following conditions hold for Σ( θ):

(i) the second-order condition (Lemma A.2) is satisﬁed for Σ( θ);

(ii) the gradients of all binding constraints at ξ θ are linearly independent (binding constraints include all equality constraints and all active inequality constraints);

(iii) strict complementarity holds at ξ θ, i.e., active inequality constraint has positive multiplier.

Then, for all θ in a neighborhood of θ, there exists a unique once-continuously differentiable function (ξθ,λθ 1:T ,vθ 0:T ,wθ 0:T ) that satisﬁes the second-order condition (Lemma A.2) for the constrained optimal control system Σ(θ) with ξθ, λθ 1:T , vθ 0:T , wθ 0:T = ξ θ, λ θ 1:T , v θ 0:T , w θ 0:T at θ= θ. Hence, ξθ is a local isolated minimizing trajectory to Σ(θ). Further, for all θ near θ, the strict complementarity is preserved, and the linear independence of the gradients of all binding constraints at ξθ hold.

The proof of Lemma 1 can directly follow the well-known ﬁrst-order sensitivity result in Theorem 2.1 in [74]. Here, conditions (i)-(iii) are the sufﬁcient conditions to guarantee the applicability of the well-known implicit function theorem [75] to the C-PMP. Condition (ii) is well-known and serves as a sufﬁcient condition for the constraint qualiﬁcation to establish the C-PMP (see Corollary 3, pp. 22, [48]). Condition (iii) is necessary to ensure that the Jacobian matrix in the implicit function theorem is invertible, and it also leads to the persistence of strict complementarity, saying that the inactive inequalities remain inactive and active ones remain active and there is no switching between them near θ. Both our practice and previous works [68, 69, 71, 74, 76] show that the conditions (i)-(iii) are very mild and the differentiability of ξθ can be attained almost everywhere in the space of θ.

4.2 Auxiliary Control System to Solve ξθ

θ If the conditions (i)-(iii) in Lemma 1 for differentiability of ξθ hold, we next show that ξθ

θ can also be efﬁciently solved by an auxiliary control system, which is originally proposed in the foundational work [71]. First, we deﬁne the new state and input (matrix) variables Xt Rn r and Ut Rm r, respectively. Then, we introduce the following auxiliary control system,

control cost: J = Tr

Lxx t Lxu t Lux t Luu t

+ Lxθ t Luθ t

2X T Lxx T XT + (Lxθ T ) XT

dynamics: Xt+1 = F x t Xt + F u t Ut + F θ t with X0 = Xθ 0 terminal constraint: Gx T XT + Gθ T = 0, Hx T XT + Hθ T = 0,

path constraint: Gx t Xt + Gu t Ut + Gθ t = 0, Hx t Xt + Hu t Ut + Hθ t = 0.

Here, Lx t and Lxx t denote the ﬁrstand secondorder derivatives, respectively, of the Hamiltonian Lt in (2) with respect to x; F x t , Hx t , Gt denote the ﬁrst-order derivatives of f t, ht, gt with respect

to x, respectively, where gt is the vector of stacking all active inequality constraints in gt; and the similar convention applies to the other notations. All derivative matrices deﬁning Σ(ξθ) are evaluated at ξθ, λθ 1:T , vθ 0:T , wθ 0:T , where λθ 1:T , vθ 0:T , and wθ 0:T are usually the byproducts of a constrained optimal control solver [77] or can be easily solved from the C-PMP given ξθ, as done in [71]. We note that Σ(ξθ) is a Equality-constrained Linear Quadratic Regulator (LQR) system, as its control cost function is quadratic and dynamics and constraints are linear. For the above Σ(ξθ), we have the following important result without additional assumptions.

Theorem 1 ( ξθ

θ is a globally unique minmizing trajectory to Σ(ξθ)). Let the conditions (i)-(iii) in Lemma 1 for differentiability of ξθ hold. Then, the auxiliary control system Σ(ξθ) in (3) has a globally unique minimizing trajectory, denoted as Xθ 0:T , U θ 0:T 1 , which is exactly ξθ

n Xθ 0:T , U θ 0:T 1 o = ξθ

θ with Xθ t = xθ t θ and U θ t = uθ t θ . (4)

The proof of the above theorem is in Appendix B. Theorem 1 states that as long as the conditions (i)- (iii) in Lemma 1 for differentiability of ξθ are satisﬁed, without additional assumptions, the auxiliary control system Σ(ξθ) always has a globally unique minimizing trajectory, which is exactly ξθ

θ . Thus, obtaining ξθ

θ is equivalent to solving Σ(ξθ), which be efﬁciently done thanks to the recent development of the equality-constrained LQR algorithms [78 80], all of which have a complexity of O(T). The algorithm that implements Theorem 1 is given in Algorithm 1 in Appendix E.1.

5 Safe Unconstrained Approximations for ξθ and ξθ

From Section 4, we know that one can solve the constrained system Σ(θ) to obtain ξθ and solve its auxiliary control system Σ(ξθ) to obtain ξθ

θ . Although theoretically appealing, there are several difﬁculties in implementation. First, solving a constrained optimal control Problem B(θ) is not as easy as solving an unconstrained optimal control, for which many trajectory optimization algorithms, e.g., i LQR [81] and DDP [82], are available. Second, establishing Σ(ξθ) requires the values of the multipliers vθ 0:T and wθ 0:T . And third, to construct Σ(ξθ), one also needs to identify all active inequality constraints gt, which can be numerically difﬁcult due to numerical error (we will show this in later experiments). All those difﬁculties motivate us to develop a more efﬁcient paradigm to obtain both ξθ and ξθ

θ , which is the goal of this section.

To proceed, we ﬁrst convert the constrained system Σ(θ) to an unconstrained system Σ(θ, γ) by adding all constraints to its control cost via barrier functions. Here, we use quadratic barrier function for each equality constraint and logarithm barrier functions for each inequality constraint; and all barrier functions are associated with the same barrier parameter γ > 0. This leads to Σ(θ, γ) to be

control cost: J(θ, γ) =

ct(xt, ut, θ) γ

i=1 ln gt,i(xt, ut, θ) +

ht,i(xt, ut, θ) 2 + c T (x T , θ)

i=1 ln g T,i(x T , θ) + 1

h T,i(x T , θ) 2,

subject to dynamics: xt+1 = f(xt, ut, θ) with x0 = x0(θ), t.

The trajectory ξ(θ,γ) = n x(θ,γ) 0:T , u(θ,γ) 0:T 1 o produced by the above unconstrained system Σ(θ, γ) is

ξ(θ,γ) = n x(θ,γ) 0:T , u(θ,γ) 0:T 1 o arg min{x0:T ,u0:T 1} J(θ, γ)

s.t. xt+1 = f(xt, ut, θ) with x0 = x0(θ), SB(θ, γ)

that is, ξ(θ,γ) is minimizing the new control cost J(θ, γ) subject to only dynamics. Then we have the

following important result about the safe unconstrained approximation for ξθ and ξθ

θ using Σ(θ, γ).

Theorem 2. Let conditions (i)-(iii) in Lemma 1 for differentiability of ξθ hold. For any small γ > 0,

(a) there exists a local isolated minimizing trajectory ξ(θ,γ) that solves Problem SB(θ, γ), and

Σ(θ, γ) is well-deﬁned at ξ(θ,γ), i.e., gt x(θ,γ) t , u(θ,γ) t , θ <0 and g T x(θ,γ) T , θ <0;

(b) ξ(θ,γ) is once-continuously differentiable with respect to (θ, γ), and

ξ(θ,γ) ξθ and ξ(θ,γ)

θ as γ 0; (6)

(c) the trajectory derivative ξ(θ,γ)

θ is a globally unique minimizing trajectory to the auxiliary control system Σ ξ(θ,γ) corresponding to Σ(θ, γ).

The proof of the above theorem is given in Appendix C. It is worth noting that the above assertions require no additional assumption except the same conditions (i)-(iii) for differentiability of ξθ. We make the following comments on the above results, using an illustrative cartpole example in Fig. 1.

10 4 10 3 10 2 10 1 0.0

|| ( , ) ||2

10 4 10 3 10 2 10 1 0

|| ( , ) ||2

0 5 10 15 20 25 Time

Cart force input

0 5 10 15 20 25 Time

Cart position

Figure 1: ξ(θ,γ) and ξ(θ,γ)

θ approximate ξθ and ξθ

θ under different γ > 0.

First, assertion (b) states that by choosing a small γ > 0, one can simply use ξ(θ,γ) and ξ(θ,γ)

θ of the unconstrained optimal control system Σ(θ, γ) to approximate ξθ and ξθ

θ of the original constrained system Σ(θ), respectively. Second, notably, assertion (b) also states that the above approximations can be controlled for arbitrary accuracy by simply letting γ 0, as illustrated in the upper panels in Fig. 1. Third, more importantly, assertion (a) states that the above approximations are always safe in a sense that the approximation ξ(θ,γ) with any small γ > 0 is guaranteed to satisfy all inequality constraints in the original Σ(θ), as illustrated in the bottom panels in Fig. 1. Finally, similar to Theorem 1, assertion (c) states that the derivative ξ(θ,γ)

θ for Σ(θ, γ) is a globally unique minimizing trajectory to its corresponding auxiliary control system Σ ξ(θ,γ) , thus PDP [71] directly applies here.

In addition to the theoretical importance of Theorem 2, we also summarize its algorithmic advantage compared to directly handling the original constrained system Σ(θ) and its auxiliary control system Σ(ξθ). First, solving the unconstrained Σ(θ, γ) is easier than solving the constrained Σ(θ) as more off-the-shelf algorithms are available for unconstrained trajectory optimization than for constrained one. Second, when solving ξ(θ,γ)

θ using Σ ξ(θ,γ) , there is no need to identify the inactive and active inequality constraints, as opposed to solving ξθ

θ using Σ(ξθ); thus it is easier to implement and more numerically stable (we will show this later in experiments). Third, in contrast to Theorem 1, the unconstrained Σ(θ, γ) and Σ ξ(θ,γ) avoid dealing with the multipliers v0:T and w0:T . Finally, by absorbing hard inequality constraints into the control cost through barrier functions, Σ(θ, γ) introduces the softness of constraints and mitigates the discontinuous switching between inactive/active inequalities over a large range of θ. This leads to a more numerically stable algorithm, as we will show in later experiments. Implementation of Theorem 2 is given in Algorithm 2 in Appendix E.2.

6 Safe PDP to Solve Problem P

According to Theorem 2, we use the safe unconstrained approximation system Σ(θ, γ) in (5) to replace the original inner-level constrained system Σ(θ) in (1). Then, we give the following important result for solving Problem P, which addresses the Challenges (4) and (5) in Section 3. Theorem 3. Consider all functions deﬁning the constrained optimal control system Σ(θ) are at least three-times continuously differentiable, and let the conditions (i)-(iii) in Lemma 1 for differentiability of ξθ hold in a neighborhood of θ . Suppose that the second-order condition for a local isolated minimizor θ to Problem P is satisﬁed, that the gradients θRi(ξθ , θ ) of all binding constraints Ri(ξθ, θ) = 0 are linearly independent at θ , and that the strict complementary holds at θ . Then, for any small ϵ > 0 and any small γ > 0, the following outer-level unconstrained approximation

θ (ϵ, γ) = arg min θ ℓ ξ(θ,γ), θ ϵ Xl

i=1 ln Ri ξ(θ,γ), θ , SP(ϵ, γ)

with ξ(θ,γ) being the optimal trajectory to the inner-level safe unconstrained approximation system Σ(θ, γ) in (5), has the following assertions:

(a) there exists a local isolated minimizor θ (ϵ, γ) to the above SP(ϵ, γ), and the corresponding trajectory ξ(θ (ϵ,γ),γ) from the inner-level approximation system Σ θ (ϵ, γ), γ is safe with respect the original outer-level constraints, i.e., Ri ξ(θ (ϵ,γ),γ), θ (ϵ, γ) < 0, i = 1, 2, ..., l;

(b) θ (ϵ, γ) is once-continuously differentiable with respect to both ϵ and γ, and

θ (ϵ, γ) θ as (ϵ, γ) (0, 0); (7)

(c) for any θ near θ (ϵ, γ), ξ(θ,γ) from the inner-level approximation system Σ(θ, γ) is safe with respect to the original outer-level constraints, i.e., Ri ξ(θ,γ), θ < 0, i = 1, 2, ..., l.

The proof of the above theorem is given in Appendix D. The above result says that instead of solving the original constrained Problem P with the inner-level constrained system Σ(θ) in (1), one can solve an unconstrained approximation Problem SP(ϵ, γ) with the inner-level safe unconstrained approximation system Σ(θ, γ) in (5). Particularly, we make the following comments on the importance of the above theorem. First, claim (a) afﬁrms that although the inner-level trajectory ξ(θ,γ) is an approximation (recall Theorem 2), the outer-level unconstrained Problem SP(ϵ, γ) always has a locally unique solution θ (ϵ, γ); furthermore, at θ (ϵ, γ), the corresponding inner-level trajectory ξ(θ (ϵ,γ),γ) is safe with respect to the original outer-level constraints, i.e., Ri ξ(θ (ϵ,γ),γ), θ (ϵ, γ) < 0, i = 1, 2, ..., l. Second, claim (b) asserts that the accuracy of the solution θ (ϵ, γ) to the outer-level approximation Problem SP(ϵ, γ) is controlled jointly by the inner-level barrier parameter γ and outer-level barrier parameter ϵ: as both barrier parameters approach zero, θ (ϵ, γ) is converging to the true solution θ

to the original Problem P. Third, claim (c) says that during the local search of the outer-level solution θ (ϵ, γ), the corresponding inner-level trajectory ξ(θ,γ) is always safe with respect to the original outer-level constraints, i.e., Ri ξ(θ,γ), θ < 0, i = 1, 2, ..., l. The above Theorem 3, together with Theorem 2 provide the safetyand accuracyguarantees for the whole Safe PDP framework. Then entire Safe PDP algorithm is given in Algorithm 3 in Appendix E.3.

7 Applications to Different Safety-Critical Tasks

We apply Safe PDP to solve some representative safety-critical learning/control tasks. For a speciﬁc task, one only needs to specify the parameterization detail of Σ(θ), a task loss ℓ(ξθ, θ), and task constraints Ri(ξθ, θ) in Problem P. The experiments are performed on the systems of different complexities in Table 1. All codes are available at https://github.com/wanxinjin/Safe-PDP.

Table 1: Experimental environments [71]

System Σ(θ) Dynamics f(θdyn) Control cost J(θobj) Constraints g(θcstr)

Cartpole cart & pole masses and length ct= u 2 2+ θ obj(x xgoal) 2 2, c T = θ obj(x xgoal) 2 2

gx(x) Xmax, u 2/ Umax,

θcstr={Xmax, Umax}

Two-link Robot arm length and mass of links 6-Do F quadrotor mass, wing length, inertia 6-Do F rocket landing rocket mass, length, inertia

Note that for each system, g(θcstr) includes the immediate constraints on system input u and state x at any time instance; gx is known; 2/ is the 2 or norm; and time horizon T is around 50 for all systems.

Problem I: Safe Policy Optimization aims to ﬁnd a policy that minimizes a control cost J subject to constraints g while guaranteeing that any intermediate policy during optimization should never violate the constraints. To apply Safe PDP to solve such a problem for the systems in Table 1, we set:

Σ(θ) : dynamics: xt+1 = f(xt, ut) with x0, policy: ut = πt(xt, θ), (8)

where dynamics f is learned from demonstrations in Problem III, and π(θ) is represented by a (deep) feedforward neural network (NN) with θ the NN parameter. In Problem P, the task loss ℓ(ξθ, θ) is set as J(θobj), and task constraints Ri(ξθ, θ) as g(θcstr), with both θobj and θcstr known. Then, safe policy optimization is to solve Problem P using Safe PDP. The results for the robot arm and 6-Do F maneuvering quadrotor are in Fig. 2, and the other results and details are in Appendix F.1.

0 500 1000 1500 Iteration

Loss (control cost)

(a) Robot-arm loss

Iter. #0 Iter. #1500

ut, 1 ut, 2

0 5 10 15 20 25 Time t

Generated ut

Safe PDP, = 10 4

Iter. #0 Iter. #1500

0 5 10 15 20 25 Time t

Unconstrained

(b) Constraint violation during opt.

0 500 1000 1500 2000 Iteration

Loss (control cost)

(c) Quadrotor loss

Iter. #0 Iter. #2000

0 5 10 15 20 25 Time t

Generated ||ut||

Safe PDP, = 10 2

Iter. #0 Iter. #2000

0 5 10 15 20 25 Time t

Unconstrained

(d) Constraint violation during opt.

Figure 2: Safe neural policy optimization for robot-arm (a)-(b) and 6-Do F quadrotor (c)-(d).

Fig. 2a and 2c plot loss (control cost) versus gradient-descent iteration under different ϵ, showing that the NN policy achieves a good convergence when ϵ 10 2 (as asserted by Theorem 3). Fig. 2b and 2d show all indeterminate control trajectories generated from the NN policy during entire iterations; we also mark the constraints Umax and compare with the unconstrained policy optimization under the same settings. The results conﬁrm that Safe PDP enables to achieve an optimal policy while guaranteeing that any intermediate policy throughout optimization is safe.

Problem II: Safe Motion Planning searches for a dynamics-feasible trajectory that optimizes a criterion and avoids unsafe regions (obstacles), meanwhile guaranteeing that any intermediate motion trajectory during search must avoid the unsafe regions. To apply Safe PDP to solve such problem, we specialize Σ(θ) as (8) except that policy here is ut = u(t, θ), which is represented by Lagrangian polynomial [83] with θ the parameters (pivots). In Problem P, task loss is set as J(θobj), and task constraints as g(θcstr), with θobj and θcstr known in Table 1. The safe planning results using Safe PDP for cartpole and 6-Do F rocket landing are in Fig. 2, in comparison with ALTRO, a state-of-the-art constrained trajectory optimization method [21]. Other results and more details are in Appendix F.2.

0 1000 2000 3000 Iteration

Loss (planning loss)

(a) Cartpole loss

Iter. #0 Iter. #3000

Safe PDP, = 10 2

Iter. #0 Iter. #3000

Iter. #0 Iter. #3000

0 5 10 15 20 25 Time t

Iter. #0 Iter. #3000

0 5 10 15 20 25 Time t

(b) Constraint violation during opt.

0 500 1000 1500 Iteration

Loss (planning loss)

(c) Rocket loss

Iter. #0 Iter. #1500

Thrust ||u||2

Safe PDP, = 10 2

Iter. #0 Iter. #1500

Iter. #0 Iter. #1500

0 10 20 30 40 Time t

Iter. #0 Iter. #1500

0 10 20 30 40 Time t

(d) Constraint violation during opt.

Figure 3: Safe motion planning for cartpole (a)-(b) and 6-Do F rocket powered landing (c)-(d).

Fig. 3a and 3c plot the task loss versus gradient-descent iteration, showing that the trajectory achieves a good convergence with ϵ 10 2. Fig. 3b and 3d show all intermediate motion trajectories during entire optimization, with constraints marked. The results conﬁrm that Safe PDP can ﬁnd an optimal trajectory while always respecting constraints throughout planning process.

Problem III: Learning MPC from Demonstrations. Suppose for all systems in Table 1, the control cost J(θcost), dynamics f(θdyn) and constraints gt(θcstr) are all unknown and parameterized as in Table 1. We aim to jointly learn θ = {θcost, θdyn, θcstr} from demonstrations ξdemo = {xdemo 0:T , udemo 0:T 1} of a true expert system. In Problem P, set Σ(θ) as (1), consisting of J(θcost), f(θdyn), and gt(θcstr) parameterized; set task loss ℓ(ξθ, θ) = ξdemo ξθ 2, which quantiﬁes the reproducing loss between ξdemo and ξθ; and there is no task constraints. By solving Problem P, we can learn Σ(θ) such that its reproduced ξθ is closest to given ξdemo. The demonstrations ξdemo here are generated with θ known (two episode trajectories for each system with time horizon T=50). The plots of the loss versus gradient-descent iteration are in Fig. 4, and more details and results are in Appendix F.3.

In Fig. 4a-4d, for each system, we use three strategies to obtain ξθ and ξθ

θ for Σ(θ): (A) use a solver [77] to obtain ξθ and use Theorem 1 to obtain ξθ

θ ; (B) use Theorem 2 to approximate both ξθ and ξθ

θ by ξ(θ,γ) and ξ(θ,γ)

θ , respectively, γ=10 2; and (C) use a solver to obtain ξθ and Theorem 2 only for ξθ

θ . Fig. 4a-4d show that for Strategies (B) and (C), the reproducing loss quickly converges to zeros, indicating that the dynamics, constraints, and control cost are successfully learned to reproduce the demonstrations. Fig. 4a-4d also show numerical instability for strategy (A); this is due to the discontinuous switching of active inequalities between iterations, and also the error in correctly identifying active inequalities (we identify them by checking gt,i > δ with δ > 0 a small

threshold), as analyzed in Section 5. More analysis is given in Appendix F.3. Note that we are not aware of any existing methods that can handle jointly learning of cost, dynamics, and constraints here, and thus we have not given benchmark comparison. Fig. 4e gives timing results of Safe PDP.

Thm 2 for and

Thm 2 only for

0 20 40 60 Iteration

Reproducing loss

(a) Cartpole (log-y)

Thm 2 for and

Thm 2 only for

0 20 40 60 Iteration

Reproducing loss

(b) Robot-arm

Thm 2 for and

Thm 2 only for

0 50 100 Iteration

Reproducing loss

(c) Quadrotor (log-y)

Thm 2 for and

Thm 2 only for

0 30 60 Iteration

Reproducing loss

20 40 60 80 100 Time horizon T

Time per iteration [s]

Theorem 1 forward Theorem 1 backward Theorem 2 forward Theorem 2 backward

Figure 4: Jointly learning dynamics, constraints, and control cost from demonstrations.

8 Discussion

100 200 300 Time horizon T

Time for differentiation [s]

Safe PDP Cas ADi diff MPC

Figure 5: Time for differentiating optimal trajectory with different T.

Comparisons with other differentiable frameworks. Fig. 5 compares Safe PDP, Cas ADi [70], and Differentiable MPC [68] for the computational efﬁciency of differentiating an optimal trajectory of a constrained optimal control system with different control horizons T. The results show a signiﬁcantly computational advantage of Safe PDP over Cas ADi and Differentiable MPC. Speciﬁcally, Safe PDP has a complexity of O(T), while Cas ADi and Differentiable MPC have at least O(T 2). This is because both Cas ADi and differentiable MPC are based on the implicit function theorem [75] and need to compute the inverse of a Hessian matrix of the size proportional to T T. In contrast, Safe PDP solves the gradient of a trajectory by constructing an Auxiliary Control System, which can be solved using the Riccati equation.

Limitation of Safe PDP. Safe PDP requires a safe (feasible) initialization such that the log-barrier control cost or loss is well-deﬁned. While restrictive, safe initialization is common in safe learning [63, 84]. We have the following empiricism on how to provide safe initializations for different types of problems, as adopted in our experiments in Section 7. In safe policy optimization, one could ﬁrst use supervised learning to learn a safe policy from some safe trajectories/demonstrations (not necessarily be optimal) and then use the learned safe policy to initialize Safe PDP. In safe motion planning, one could arbitrarily provide a safe trajectory (not necessarily optimal) to initialize Safe PDP. In learning MPCs, the goal includes learning of constraint itself, and there is no such requirement.

Strategies to accelerate forward pass of Safe PDP. There are many strategies to accelerate a long-horizon trajectory optimization (optimal control) in the forward pass of Safe PDP. (I) One effective way is to scale the (continuous) long-horizon problem into a smaller one (e.g., a unit) by applying a time-warping function to the dynamics and cost function [85]. After solving the scaled short-horizon problem, re-scale the trajectory back. (II) There are also warm-up tricks, e.g., one can initialize the trajectory at the next iteration using the result of the previous iteration. (III) One can also use a hierarchical strategy to solve trajectory optimization from coarse to ﬁne resolutions. We have tested and provided the comparison for the above three acceleration strategies in Appendix G.2.

Please refer to Appendix G for more discussion, which includes G.1: comparison between Safe-PDP and non-safe PDP; G.2: comparison of different strategies for accelerating long-horizon trajectory optimization; G.3: trade-offs between accuracy and computational efﬁciency using barrier penalties; G.4: learning MPCs from non-optimal data; and G.5: detailed discussion on limitation of Safe PDP.

9 Conclusions

This paper proposes a Safe Pontryagin Differentiable Programming methodology, which establishes a provable and systematic safe differentiable framework to solve a broad class of safety-critical control and learning tasks with different types of safety constraints. For a constrained system and task, Safe PDP approximates both the solution and its gradient in backward pass by solving their more efﬁcient unconstrained counterparts. Safe PDP has established two results: one is the controlled accuracy guarantee for approximations of the solution and its gradient, and the other is the safety guarantee for constraint satisfaction throughout the control and learning process. We envision the potential of Safe PDP for addressing various safety-critical problems in machine learning, control, and robotics ﬁelds.

Acknowledgments and Disclosure of Funding

This work is supported by the NASA University Leadership Initiative (ULI) under grant number 80NSSC20M0161. The research of Prof. George J. Pappas is supported by the AFOSR Assured Autonomy in Congested Environments under grant number FA9550-19-1-0265. This work has been done primarily in the last semester of Wanxin Jin s Ph.D. study at Purdue University. Wanxin Jin thanks Prof. Zhaoran Wang for some discussion about this work.

[1] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529 533, 2015.

[2] David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mastering the game of go with deep neural networks and tree search. Nature, 529(7587):484 489, 2016.

[3] Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. In International Conference on Learning Representations, 2016.

[4] Arnab Nilim and Laurent El Ghaoui. Robust control of markov decision processes with uncertain transition matrices. Operations Research, 53(5):780 798, 2005.

[5] Matthias Heger. Consideration of risk in reinforcement learning. In International Conference on Machine Learning, pages 105 111, 1994.

[6] Ronald A Howard and James E Matheson. Risk-sensitive markov decision processes. Management science, 18(7):356 369, 1972.

[7] Teodor Mihai Moldovan and Pieter Abbeel. Safe exploration in markov decision processes. International Conference on Machine Learning, 2012.

[8] Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks. ar Xiv preprint ar Xiv:1706.06083, 2017.

[9] Kemin Zhou and John Comstock Doyle. Essentials of robust control, volume 104. Prentice hall Upper Saddle River, NJ, 1998.

[10] Javier Garcıa and Fernando Fernández. A comprehensive survey on safe reinforcement learning. Journal of Machine Learning Research, 16(1):1437 1480, 2015.

[11] Eitan Altman. Constrained Markov decision processes, volume 7. CRC Press, 1999.

[12] Eitan Altman. Constrained markov decision processes with total cost criteria: Lagrangian approach and dual linear program. Mathematical methods of operations research, 48(3):387 417, 1998.

[13] Ming Yu, Zhuoran Yang, Mladen Kolar, and Zhaoran Wang. Convergent policy optimization for safe reinforcement learning. Advances in Neural Information Processing Systems, 2019.

[14] Yinlam Chow, Mohammad Ghavamzadeh, Lucas Janson, and Marco Pavone. Risk-constrained reinforcement learning with percentile risk criteria. International Conference on Machine Learning, 18(1):6070 6120, 2017.

[15] Shalabh Bhatnagar and K Lakshmanan. An online actor critic algorithm with function approximation for constrained markov decision processes. Journal of Optimization Theory and Applications, 153(3):688 708, 2012.

[16] Dongsheng Ding, Kaiqing Zhang, Tamer Basar, and Mihailo Jovanovic. Natural policy gradient primal-dual method for constrained markov decision processes. Advances in Neural Information Processing Systems, 33, 2020.

[17] Miguel Calvo-Fullana, Santiago Paternain, Luiz FO Chamon, and Alejandro Ribeiro. State augmented constrained reinforcement learning: Overcoming the limitations of learning with rewards. ar Xiv preprint ar Xiv:2102.11941, 2021.

[18] Yatin Nandwani, Abhishek Pathak, Parag Singla, et al. A primal dual formulation for deep learning with constraints. Advances in Neural Information Processing Systems, 2019.

[19] Maïtine Bergounioux, Kazufumi Ito, and Karl Kunisch. Primal-dual strategy for constrained optimal control problems. SIAM Journal on Control and Optimization, 37(4):1176 1194, 1999.

[20] Matthew R Kirchner, Gary Hewer, Jérôme Darbon, and Stanley Osher. A primal-dual method for optimal control and trajectory generation in high-dimensional systems. In Conference on Control Technology and Applications, pages 1583 1590, 2018.

[21] Taylor A Howell, Brian E Jackson, and Zachary Manchester. Altro: A fast solver for constrained trajectory optimization. In IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 7674 7679, 2019.

[22] Simon S Du and Wei Hu. Linear convergence of the primal-dual gradient method for convexconcave saddle point problems without strong convexity. In International Conference on Artiﬁcial Intelligence and Statistics, pages 196 205. PMLR, 2019.

[23] Chi Jin, Praneeth Netrapalli, and Michael Jordan. What is local optimality in nonconvexnonconcave minimax optimization? In International Conference on Machine Learning, pages 4880 4889. PMLR, 2020.

[24] Yinlam Chow, Oﬁr Nachum, Aleksandra Faust, Edgar Duenez-Guzman, and Mohammad Ghavamzadeh. Lyapunov-based safe policy optimization for continuous control. ar Xiv preprint ar Xiv:1901.10031, 2019.

[25] Joshua Achiam, David Held, Aviv Tamar, and Pieter Abbeel. Constrained policy optimization. In International Conference on Machine Learning, pages 22 31. PMLR, 2017.

[26] Tsung-Yen Yang, Justinian Rosca, Karthik Narasimhan, and Peter J Ramadge. Projection-based constrained policy optimization. International Conference on Learning Representations, 2020.

[27] John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region policy optimization. In International Conference on Machine Learning, pages 1889 1897. PMLR, 2015.

[28] Yinlam Chow, Oﬁr Nachum, Edgar Duenez-Guzman, and Mohammad Ghavamzadeh. A lyapunov-based approach to safe reinforcement learning. Advances in Neural Information Processing Systems, 2018.

[29] Theodore J Perkins and Andrew G Barto. Lyapunov design for safe reinforcement learning. Journal of Machine Learning Research, 3(Dec):803 832, 2002.

[30] Felix Berkenkamp, Matteo Turchetta, Angela P Schoellig, and Andreas Krause. Safe modelbased reinforcement learning with stability guarantees. Advances in Neural Information Processing Systems, 2017.

[31] Aleksandr Mikhailovich Lyapunov. The general problem of the stability of motion. International journal of control, 55(3):531 534, 1992.

[32] Peter Giesl and Sigurdur Hafstein. Review on computational methods for lyapunov functions. Discrete & Continuous Dynamical Systems-B, 20(8):2291, 2015.

[33] Matteo Turchetta, Felix Berkenkamp, and Andreas Krause. Safe exploration in ﬁnite markov decision processes with gaussian processes. Advances in Neural Information Processing Systems, 29:4312 4320, 2016.

[34] Akifumi Wachi and Yanan Sui. Safe reinforcement learning in constrained markov decision processes. In International Conference on Machine Learning, pages 9797 9806. PMLR, 2020.

[35] Somil Bansal, Mo Chen, Sylvia Herbert, and Claire J Tomlin. Hamilton-jacobi reachability: A brief overview and recent advances. In IEEE Conference on Decision and Control, pages 2242 2253, 2017.

[36] Peter Wieland and Frank Allgöwer. Constructive safety using control barrier functions. IFAC Proceedings Volumes, 40(12):462 467, 2007.

[37] Aaron D Ames, Xiangru Xu, Jessy W Grizzle, and Paulo Tabuada. Control barrier function based quadratic programs for safety critical systems. IEEE Transactions on Automatic Control, 62(8):3861 3876, 2016.

[38] Jaime F Fisac, Anayo K Akametalu, Melanie N Zeilinger, Shahab Kaynama, Jeremy Gillula, and Claire J Tomlin. A general safety framework for learning-based control in uncertain robotic systems. IEEE Transactions on Automatic Control, 64(7):2737 2752, 2018.

[39] Sylvia L Herbert, Mo Chen, Soo Jean Han, Somil Bansal, Jaime F Fisac, and Claire J Tomlin. Fastrack: A modular framework for fast and guaranteed safe motion planning. In IEEE Conference on Decision and Control, pages 1517 1522, 2017.

[40] Mo Chen, Jaime F Fisac, Shankar Sastry, and Claire J Tomlin. Safe sequential path planning of multi-vehicle systems via double-obstacle hamilton-jacobi-isaacs variational inequality. In European Control Conference, pages 3304 3309, 2015.

[41] Lawrence C Evans and Panagiotis E Souganidis. Differential games and representation formulas for solutions of hamilton-jacobi-isaacs equations. Indiana University mathematics journal, 33(5):773 797, 1984.

[42] Aaron D Ames, Samuel Coogan, Magnus Egerstedt, Gennaro Notomista, Koushil Sreenath, and Paulo Tabuada. Control barrier functions: Theory and applications. In European Control Conference, pages 3420 3431, 2019.

[43] Jason Choi, Fernando Castaneda, Claire J Tomlin, and Koushil Sreenath. Reinforcement learning for safety-critical control under model uncertainty, using control lyapunov functions and control barrier functions. ar Xiv preprint ar Xiv:2004.07584, 2020.

[44] Richard Cheng, Gábor Orosz, Richard M Murray, and Joel W Burdick. End-to-end safe reinforcement learning through barrier functions for safety-critical continuous control tasks. In AAAI Conference on Artiﬁcial Intelligence, pages 3387 3395, 2019.

[45] Alexander Robey, Haimin Hu, Lars Lindemann, Hanwen Zhang, Dimos V Dimarogonas, Stephen Tu, and Nikolai Matni. Learning control barrier functions from expert demonstrations. In IEEE Conference on Decision and Control, pages 3717 3724, 2020.

[46] Alexander Robey, Lars Lindemann, Stephen Tu, and Nikolai Matni. Learning robust hybrid control barrier functions for uncertain systems. ar Xiv preprint ar Xiv:2101.06492, 2021.

[47] Wanxin Jin, Zhaoran Wang, Zhuoran Yang, and Shaoshuai Mou. Neural certiﬁcates for safe control policies. ar Xiv preprint ar Xiv:2006.08465, 2020.

[48] Anthony V Fiacco and Garth P Mc Cormick. Nonlinear programming: sequential unconstrained minimization techniques. SIAM, 1990.

[49] Yurii Nesterov and Arkadii Nemirovskii. Interior-point polynomial algorithms in convex programming. SIAM, 1994.

[50] Anders Forsgren, Philip E Gill, and Margaret H Wright. Interior methods for nonlinear optimization. SIAM review, 44(4):525 597, 2002.

[51] AEB Lim, JB Moore, and L Faybusovich. Linearly constrained lq and lqg optimal control. IFAC Proceedings Volumes, 29(1):1110 1115, 1996.

[52] SJ Wright. Structured interior point methods for optimal control. In IEEE Conference on Decision and Control, pages 1711 1716, 1991.

[53] Stephen J Wright. Interior point methods for optimal control of discrete time systems. Journal of Optimization Theory and Applications, 77(1):161 187, 1993.

[54] Christopher V Rao, Stephen J Wright, and James B Rawlings. Application of interior-point methods to model predictive control. Journal of optimization theory and applications, 99(3):723 757, 1998.

[55] Anders Hansson and S Boydt. Robust optimal control of linear discrete-time systems using primal-dual interior-point methods. In American Control Conference, volume 1, pages 183 187, 1998.

[56] Anders Hansson. A primal-dual interior-point method for robust optimal control of linear discrete-time systems. IEEE Transactions on Automatic Control, 45(9):1639 1655, 2000.

[57] Christian Feller and Christian Ebenbauer. Relaxed logarithmic barrier function based model predictive control of linear systems. IEEE Transactions on Automatic Control, 62(3):1223 1238, 2016.

[58] Julien Laurent-Varin, J Frederic Bonnans, Nicolas Bérend, Mounir Haddou, and Christophe Talbot. Interior-point approach to trajectory optimization. Journal of Guidance, Control, and Dynamics, 30(5):1228 1238, 2007.

[59] John Hauser and Alessandro Saccon. A barrier function method for the optimization of trajectory functionals with constraints. In IEEE Conference on Decision and Control, pages 864 869. IEEE, 2006.

[60] Paul Malisani, François Chaplais, and Nicolas Petit. An interior penalty method for optimal control problems with state and input constraints of nonlinear systems. Optimal Control Applications and Methods, 37(1):3 33, 2016.

[61] Alexander Domahidi, Aldo U Zgraggen, Melanie N Zeilinger, Manfred Morari, and Colin N Jones. Efﬁcient interior point methods for multistage problems arising in receding horizon control. In IEEE conference on decision and control, pages 668 674, 2012.

[62] Andrei Pavlov, Iman Shames, and Chris Manzie. Interior point differential dynamic programming. IEEE Transactions on Control Systems Technology, 2021.

[63] Ilnura Usmanova, Andreas Krause, and Maryam Kamgarpour. Safe non-smooth black-box optimization with application to policy search. In Learning for Dynamics and Control, pages 980 989. PMLR, 2020.

[64] Yongshuai Liu, Jiaxin Ding, and Xin Liu. Ipo: Interior-point policy optimization under constraints. In AAAI Conference on Artiﬁcial Intelligence, pages 4940 4947, 2020.

[65] Priya L Donti, Melrose Roderick, Mahyar Fazlyab, and J Zico Kolter. Enforcing robust control guarantees within neural network policies. In International Conference on Learning Representations, 2020.

[66] Bingqing Chen, Priya L. Donti, Kyri Baker, J. Zico Kolter, and Mario Bergés. Enforcing policy feasibility constraints through differentiable projection for energy optimization. In ACM International Conference on Future Energy Systems, page 199 210, New York, NY, USA, 2021. Association for Computing Machinery.

[67] Tu-Hoa Pham, Giovanni De Magistris, and Ryuki Tachibana. Optlayer-practical constrained optimization for deep reinforcement learning in the real world. In International Conference on Robotics and Automation, pages 6236 6243. IEEE, 2018.

[68] Brandon Amos, Ivan Dario Jimenez Rodriguez, Jacob Sacks, Byron Boots, and J Zico Kolter. Differentiable mpc for end-to-end planning and control. In Advances in Neural Information Processing Systems, 2018.

[69] Brandon Amos and J Zico Kolter. Optnet: Differentiable optimization as a layer in neural networks. In International Conference on Machine Learning, pages 136 145. PMLR, 2017.

[70] Joel AE Andersson and James B Rawlings. Sensitivity analysis for nonlinear programming in casadi. IFAC-Papers On Line, 51(20):331 336, 2018.

[71] Wanxin Jin, Zhaoran Wang, Zhuoran Yang, and Shaoshuai Mou. Pontryagin differentiable programming: An end-to-end learning and control framework. In Advances in Neural Information Processing Systems, 2020.

[72] Ankur Sinha, Pekka Malo, and Kalyanmoy Deb. A review on bilevel optimization: from classical to evolutionary approaches and applications. IEEE Transactions on Evolutionary Computation, 22(2):276 295, 2017.

[73] J Pearson and R Sridhar. A discrete optimal control problem. IEEE Transactions on automatic control, 11(2):171 174, 1966.

[74] Anthony V Fiacco. Sensitivity analysis for nonlinear programming using penalty methods. Mathematical programming, 10(1):287 311, 1976.

[75] Walter Rudin et al. Principles of mathematical analysis, volume 3. Mc Graw-hill New York, 1976.

[76] Charles D Kolstad and Leon S Lasdon. Derivative evaluation and computational experience with large bilevel mathematical programs. Journal of optimization theory and applications, 65(3):485 499, 1990.

[77] Joel AE Andersson, Joris Gillis, Greg Horn, James B Rawlings, and Moritz Diehl. Casadi: a software framework for nonlinear optimization and optimal control. Mathematical Programming Computation, 11(1):1 36, 2019.

[78] Athanasios Sideris and Luis A Rodriguez. A riccati approach to equality constrained linear quadratic optimal control. In American Control Conference, pages 5167 5172, 2010.

[79] Shuo Yang, Gerry Chen, Yetong Zhang, Frank Dellaert, and Howie Choset. Equality constrained linear optimal control with factor graphs. ar Xiv preprint ar Xiv:2011.01360, 2020.

[80] Forrest Laine and Claire Tomlin. Efﬁcient computation of feedback control for equalityconstrained lqr. In International Conference on Robotics and Automation, pages 6748 6754. IEEE, 2019.

[81] Weiwei Li and Emanuel Todorov. Iterative linear quadratic regulator design for nonlinear biological movement systems. In International Conference on Informatics in Control, Automation and Robotics, pages 222 229, 2004.

[82] David H Jacobson and David Q Mayne. Differential dynamic programming. Number 24. Elsevier Publishing Company, 1970.

[83] Milton Abramowitz and Irene A Stegun. Handbook of mathematical functions with formulas, graphs, and mathematical tables, volume 55. US Government printing ofﬁce, 1964.

[84] Felix Berkenkamp, Andreas Krause, and Angela P Schoellig. Bayesian optimization with safety constraints: safe and automatic parameter tuning in robotics. Machine Learning, pages 1 35, 2021.

[85] Wanxin Jin, Todd D Murphey, Dana Kuli c, Neta Ezer, and Shaoshuai Mou. Learning from sparse demonstrations. ar Xiv preprint ar Xiv:2008.02159, 2020.

[86] Michael Athans. The matrix minimum principle. Information and control, 11(5-6):592 606, 1967.

[87] Jean Dieudonné. Foundations of modern analysis. New York: Academic Press. Volume 1 of Treatise on Analysis, 2011.

[88] Rolf Johansson. System modeling and identiﬁcation. Prentice-hall, 1993.

[89] Ricky TQ Chen, Yulia Rubanova, Jesse Bettencourt, and David Duvenaud. Neural ordinary differential equations. Advances in Neural Information Processing Systems, 2018.

[90] Wanxin Jin, Dana Kuli c, Shaoshuai Mou, and Sandra Hirche. Inverse optimal control from incomplete trajectory observations. The International Journal of Robotics Research, 40(67):848 865, 2021.

[91] Wanxin Jin and Shaoshuai Mou. Distributed inverse optimal control. Automatica, 129, 2021.

[92] Wanxin Jin, Dana Kuli c, Jonathan Feng-Shun Lin, Shaoshuai Mou, and Sandra Hirche. Inverse optimal control for multiphase cost functions. IEEE Transactions on Robotics, 35(6):1387 1398, 2019.

[93] Wanxin Jin, Todd D Murphey, and Shaoshuai Mou. Learning from incremental directional corrections. ar Xiv preprint ar Xiv:2011.15014, 2020.

[94] Michael A Patterson and Anil V Rao. Gpops-ii: A matlab software for solving multiple-phase optimal control problems using hp-adaptive gaussian quadrature collocation methods and sparse nonlinear programming. ACM Transactions on Mathematical Software (TOMS), 41(1):1 37, 2014.

1. For all authors...

(a) Do the main claims made in the abstract and introduction accurately reﬂect the paper s contributions and scope? [Yes] (b) Did you describe the limitations of your work? [Yes] Please see Sections 8 and Appendix G.5 in the paper. (c) Did you discuss any potential negative societal impacts of your work? [N/A] (d) Have you read the ethics review guidelines and ensured that your paper conforms to them? [Yes]

2. If you are including theoretical results...

(a) Did you state the full set of assumptions of all theoretical results? [Yes] Pleae ﬁnd them in all theorems in the paper. (b) Did you include complete proofs of all theoretical results? [Yes] Please ﬁnd complete proofs for all theoretical results in the Appendix in the supplementary ﬁle.

3. If you ran experiments...

(a) Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [Yes] Please refer to the code at https://github.com/wanxinjin/Safe-PDP. (b) Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? [Yes] Please see the Appendix F in the supplementary ﬁle. (c) Did you report error bars (e.g., with respect to the random seed after running experiments multiple times)? [N/A] (d) Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [Yes] Please see the Appendix F in the supplementary ﬁle.

4. If you are using existing assets (e.g., code, data, models) or curating/releasing new assets...

(a) If your work uses existing assets, did you cite the creators? [Yes] Please see the citation of the experimental environments in Table 1 in the paper. (b) Did you mention the license of the assets? [N/A]

(c) Did you include any new assets either in the supplemental material or as a URL? [Yes]

Some video demo links are included in Appendix F in the supplementary ﬁle. (d) Did you discuss whether and how consent was obtained from people whose data you re using/curating? [N/A]

(e) Did you discuss whether the data you are using/curating contains personally identiﬁable information or offensive content? [N/A] 5. If you used crowdsourcing or conducted research with human subjects...

(a) Did you include the full text of instructions given to participants and screenshots, if applicable? [N/A] (b) Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable? [N/A] (c) Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation? [N/A]