# physicsregulated_deep_reinforcement_learning_invariant_embeddings__13a086cc.pdf Published as a conference paper at ICLR 2024 PHYSICS-REGULATED DEEP REINFORCEMENT LEARNING: INVARIANT EMBEDDINGS Hongpeng Cao1 , Yanbing Mao2 , Lui Sha3 & Marco Caccamo1 1TUM, Germany, 2WSU, United States, 3UIUC, United States This paper proposes the Phy-DRL: a physics-regulated deep reinforcement learning (DRL) framework for safety-critical autonomous systems. The Phy-DRL has three distinguished invariant-embedding designs: i) residual action policy (i.e., integrating data-driven-DRL action policy and physics-model-based action policy), ii) automatically constructed safety-embedded reward, and iii) physics-modelguided neural network (NN) editing, including link editing and activation editing. Theoretically, the Phy-DRL exhibits 1) a mathematically provable safety guarantee and 2) strict compliance of critic and actor networks with physics knowledge about the action-value function and action policy. Finally, we evaluate the Phy-DRL on a cart-pole system and a quadruped robot. The experiments validate our theoretical results and demonstrate that Phy-DRL features guaranteed safety compared to purely data-driven DRL and solely model-based design while offering remarkably fewer learning parameters and fast training towards safety guarantee. 1 INTRODUCTION 1.1 MOTIVATIONS Machine learning (ML) technologies have been integrated into autonomous systems, defining learningenabled autonomous systems. These have succeeded tremendously in many complex tasks with high-dimensional state and action spaces. However, the recent incidents due to the deployment of ML models overshadow the revolutionizing potential of ML, especially for the safety-critical autonomous systems Zachary & Helen (2021). Developing safe ML is thus more vital today. In the ML community, deep reinforcement learning (DRL) has demonstrated breakthroughs in sequential decision-making in broad areas, ranging from autonomous driving Kendall et al. (2019) to games Silver et al. (2018). This motivates us to develop a DRL-based safe learning framework for achieving safe and complex tasks of safety-critical autonomous systems Critic Network Actor Network data driven action policy 𝐚drl 𝑘 model-based action policy 𝐚phy 𝑘 = 𝐅 𝐬(𝑘) Environment Phy-DRL Agent Figure 1: Phy-DRL Framework, applied to a quadruped robot. Despite the tremendous success of DRL in many autonomous systems for complex decision-making, applying DRL to safetycritical autonomous systems remains a challenging problem. It has a deep root to the action policy of DRL being parameterized by deep neural networks (DNN), whose behaviors are hard to predict Huang et al. (2017) and verify Katz et al. (2017), raising the first safety concern. The second safety concern stems from the purely data-driven DNN that DRL adopts Equal contribution. Correspondence to Yanbing Mao: maoyanbing.eth@gmail.com. Published as a conference paper at ICLR 2024 for powerful function approximation and representation learning of action-value function, action policy, and environment states Mnih et al. (2015); Silver et al. (2016). Specifically, recent studies revealed that purely data-driven DNN applied to physical systems can infer relations violating physics laws, which sometimes leads to catastrophic consequences (e.g., data-driven blackout owning to violation of physical limits Zachary & Helen (2021)). 1.2 CONTRIBUTIONS To address the aforementioned safety concerns, we propose the Phy-DRL: a physics-regulated deep reinforcement learning framework with enhanced safety assurance. Depicting in Figure 1, Phy-DRL has three novel (invariant-embedding) architectural designs: Residual Action Policy, which integrates data-driven-DRL action policy and physics-modelbased action policy. Safety-Embedded Reward, in conjunction with the Residual Action Policy, empowers the Phy-DRL with a mathematically provable safety guarantee and fast training. Physics-Knowledge-Enhanced Critic and Actor Networks, whose neural architectures have two key components: i) NN input augmentation for directly capturing hard-to-learn features, and ii) NN editing, including link editing and activation editing, for guaranteeing strict compliance with available knowledge about the action-value function and action policy. 1.3 RELATED WORK AND OPEN PROBLEMS Residual Action Policy. The recent research on DRL for controlling autonomous systems has shifted to integrating data-driven DRL and model-based decision-making, leading to a residual action policy diagram. In this diagram, the model-based action policy can guide the exploration of DRL agents during training. Meanwhile, the DRL policy learns to effectively deal with uncertainties and compensate for the model mismatch of the model-based action policy. The aims of existing residual frameworks Rana et al.; Li et al. (a); Cheng et al. (2019b); Johannink et al. (2019) mainly focus on stability guarantee, with the exception being Cheng et al. (2019a) on safety guarantee. Moreover, physics models are nonlinear, which poses difficulty in empowering analyzable and verifiable behavior. Furthermore, the model knowledge has not yet been explored to regulate the construction of DRL towards safety guarantee. In summary, the open problems in this domain are Problem 1.1. How to design a residual action policy to be a best-trade-off between the model-based action policy and the data-driven-DRL action policy? Problem 1.2. How can the knowledge of the physics model be used to construct a DRL s reward towards a safety guarantee? Safety-Embedded Reward. A safety-embedded reward is crucial for DRL to search for policies that are safe. To achieve a safety guarantee, control Lyapunov function (CLF) is the potential safety-embedded reward Perkins & Barto (2002); Berkenkamp et al. (2017); Chang & Gao (2021); Zhao et al. (2023; 2024). Meanwhile, the seminal work Westenbroek et al. (2022) discovered that if DRL s reward is CLF-like, the systems controlled by a well-trained DRL policy can retain a mathematically-provable stability guarantee. Moving forward, the question of how to construct such a CLF-like reward for achieving a safety guarantee remains open, i.e., Problem 1.3. What is the systematic guidance for constructing the safety-embedded reward (e.g., CLF-like reward) for DRL? Knowledge-Enhanced Neural Networks. The critical flaw of purely data-driven DNN, i.e., violation of physics laws, motivates the emerging research on physics-enhanced DNN. Current frameworks include physics-informed NN Wang & Yu; Willard et al. (2021); Jia et al. (2021); Lu et al. (2021); Chen et al. (2021); Xu & Darve (2022); Karniadakis et al. (2021); Wang et al. (2020); Cranmer et al. (2020) and physics-guided NN architectures Muralidhar et al. (2020); Masci et al. (2015); Monti et al. (2017); Horie et al.; Wang (2021); Li et al. (b). Both use compact partial differential equations (PDEs) for formulating training loss functions and/or architectural components. These frameworks improve consistency degree with prior physics knowledge but remain problematic in applying to DRL. For example, we define DRL s reward in advance. The critic network of DRL is to learn or estimate the expected future reward, also known as the action-value function. Because the Published as a conference paper at ICLR 2024 action-value function involves unknown future rewards, its compact governing equation is unavailable for physics-informed networks and physics-guided architectures. In summary, only partial knowledge about the action-value function and action policy is available, which thus motivates the open problem: Problem 1.4. How do we develop end-to-end critic and actor networks that strictly comply with partially available physics knowledge about the action-value function and action policy? 1.4 SUMMARY: ANSWERS TO PROBLEM 1.1 PROBLEM 1.4 The proposed Phy-DRL answers Problem 1.1 Problem 1.4 simultaneously. As shown in Figure 1, the residual diagram of Phy-DRL simplifies the model-based action policy to be an analyzable and verifiable linear one, while offering fast training towards safety guarantee. Meanwhile, the linear model knowledge (leveraged for computing model-based policy) works as a model-based guidance for constructing the safety-embedded reward for DRL towards mathematically provable safety guarantee. Lastly, the proposed NN editing guarantees the strict compliance of critic and actor networks with partially available physics knowledge about the action-value function and action policy. 2 PRELIMINARIES Table 1 in Appendix A summarizes notations that are used throughout the paper. 2.1 DYNAMICS MODEL OF REAL PLANT The generic dynamics model of a real plant can be described by s(k + 1) = A s(k) + B a(k) + f(s(k), a(k)), k N (1) where f(s(k), a(k)) Rn is the unknown model mismatch, A Rn n and B Rn m denote known system matrix and control structure matrix, respectively, s(k) Rn is the system state, a(k) Rm is the applied control action. The available model knowledge pertaining to real plant (1) is represented by (A, B). 2.2 SAFETY DEFINITION The considered safety problem stems from safety regulations or constraints on system states, which motives the following definition of safety set X. Safety Set: X s Rn| v D s v v, D Rh n, v, v, v Rh . (2) where D, v, v and v are given in advance for formulating h safety conditions. Considering the safety set, we present the definition of a safety guarantee. Definition 2.1. Consider the safety set X in Equation (2) and its subset Ω. The real plant (1) is said to be safety guaranteed, if for any s(1) Ω X, then s(k) Ω X, k > 1 N. Remark 2.2 (Role of Ω). The subset Ωis called the safety envelope, whose details will be explained in Section 5. The Ωwill bridge many (i.e., high-dimensional) safety conditions in safety set Equation (2) and one-dimensional safety-embedded reward. Meanwhile, Definition 2.1 indicates that safety guarantee means the Phy-DRL successfully searches for a policy that renders Ωinvariant (i.e., operating from any initial sample inside Ω, system state never leaves Ωat any time). 3 DESIGN OVERVIEW: INVARIANT EMBEDDINGS In this paper, an invariant refers to a prior policy, prior knowledge, or a designed property independent of DRL agent training. As shown in Figure 1, the proposed Phy-DRL can address Problem 1.1 Problem 1.4 because of three invariant-embedding designs. Specifically, 1) residual action policy, which integrates data-driven action policy with an invariant model-based action policy that completely depends on prior model knowledge (A, B). ii) Safety-embedded reward, whose off-line-designed inequality (shown in Equation (8)) for assistance in delivering the mathematically provable safety guarantee is also completely independent of agent training. iii) Physics-knowledge-enhanced DNN, whose NN editing embeds the prior invariant knowledge about the action-value function and action policy to the critic and actor networks, respectively. The following Section 4, Section 5 and Section 6 detail the three designs, respectively. Published as a conference paper at ICLR 2024 4 INVARIANT EMBEDDING 1: RESIDUAL ACTION POLICY Showing in Figure 1, the applied control action a(k) from Phy-DRL is given in the residual form: a(k) = adrl(k) | {z } data-driven + aphy(k) | {z } invariant: model-based where the adrl(k) denotes a date-driven action from DRL, while the aphy(k) is a model-based action, computed according to the invariant policy: aphy(k) = F s(k), (4) where the computation of F is based on the model knowledge (A, B), carried out in Section 5. The developed Phy-DRL is based on actor-critic architecture in DRLLillicrap et al. (2016); Schulman et al.; Haarnoja et al. (2018) for searching an action policy adrl(k) = π(s(k)) that maximizes the expected return from the initial state s(k): Qπ(s(k), adrl(k)) = Es(k) ρ, adrl(k) π t=k γt k R (s(t), adrl(t)) where ρ represents the initial state distribution, R( ) maps a state-action to a real-value reward, γ [0, 1] is the discount factor. The expected return (5) and action policy π( ) are parameterized by the critic and actor networks, respectively. 5 INVARIANT EMBEDDING 2: SAFETY-EMBEDDED REWARD The current safety formula (2) is not ready for constructing the safety-embedded reward yet, since it has multiple safety conditions while the reward R ( ) in Equation (5) is one-dimensional. To bridge the gap, we introduce the following safety envelope, which converts multi-dimensional safety conditions into a scalar value. Safety Envelope: Ω s Rn| s P s 1, P 0 . (6) The following lemma builds a connection: safety envelope is a subset of the safety set (also required in Definition 2.1). Its formal proof appears in Appendix C. Lemma 5.1. Consider the sets defined in Equation (2) and Equation (6). We have Ω X, if :,i 1 and [D]i,: P 1 h D i :,i = 1, [d]i =1 1, [d]i = 1, i {1, 2, . . . , h} (7) where D = D Λ , and d, Λ and Λ are defined below for i, j {1, 2, . . . , h}: 1, [v+v]i >0 1, [v+v]i <0 1, otherwise , [Λ]i,j 0, i =j [v+v]i , [v+v]i >0 [v+v]i , [v+v]i <0 [v+v]i , otherwise 0, i =j [v+v]i , [v+v]i >0 [v+v]i , [v+v]i <0 [ v v]i , otherwise Referring to model knowledge (A, B), Equation (6) and Equation (4), the proposed reward is R(s(k), adrl(k)) = s (k) A P A s(k) s (k + 1) P s(k + 1) | {z } r(s(k), s(k+1)): invariant property P A P A 0 + w(s(k), adrl(k)), (8) where the sub-reward r(s(k), s(k + 1)) is safety-embedded, and we define: A = A + B F. (9) Remark 5.2 (Sub-rewards). In Equation (8), the safety-embedded sub-reward r(s(k), s(k + 1)) is critical for keeping a system safe, such as avoiding car crashes, and preventing car sliding and slipping in an icy road. The sub-reward w(s(k), adrl(k)) aims at high-performance operations, such as minimizing energy consumption of resource-limited robots Yang et al. (2022), which can be optional in some timeand safety-critical environments. Published as a conference paper at ICLR 2024 Moving forward, we present the following theorem, which states the conditions on matrices F, P and reward for safety guarantee, whose proof is given in Appendix D.1. Theorem 5.3 (Mathematically Provable Safety Guarantee). Consider the safety set X (2), the safety envelope Ω(6), and the system (1) under control of Phy-DRL. The matrices F and P involved in the model-based action policy (4) and the safety-embedded reward (8) are computed according to F = R Q 1, P = Q 1, (10) where R and Q 1 satisfy the inequalities (7) and α Q Q A + R B A Q + B R Q 0, with a given α (0, 1). (11) Given any s(1) Ω, the system state s(k) Ω X holds k N (i.e., the safety of system (1) is guaranteed), if the sub-reward r(s(k), s(k + 1)) in (8) satisfies r(s(k), s(k + 1)) α 1, k N. Remark 5.4 (Solving Optimal R and Q). Equation (4), Equation (8) and Equation (9) imply that the designs of model-based action policy and safety-embedded reward equate the computations of F and P. While Equation (10) shows the computations depend on R and Q only. So, the remaining work is obtaining R and Q. There are multiple toolboxes for solving R and Q from linear matrix inequalities (LMIs) (7) and (11), such as MATLAB s LMI Solver Boyd et al. (1994). What we are more interested in is finding optimal R and Q that can maximize the safety envelope. We note the volume of a safety envelope (6) is proportional to p det (P 1), the interested optimal problem is thus a typical analytic centering problem, formulated as given a α (0, 1), arg min Q, R log det Q 1 = arg max Q, R log det P 1 , subject to LMIs (7) and (11) (12) from which, optimal R and Q can be solved via CVX toolbox Grant et al. (2009). Remark 5.5 (F is given). Equation (12) also works in the scenario of a given model-based action policy (i.e., F), which is carried out in Section 7.2 as an example. Remark 5.6 (Provable Stability Guarantee). Following the same proof path of Theorem 5.3, Phy DRL also exhibits the mathematically provable stability guarantee, which is presented in Appendix E. Remark 5.7 (Fast Training). The proof path of Theorem 5.3 is leveraged to reveal the driving factor of Phy-DRL s fast training towards safety guarantee, which is presented in Appendix D.2. Remark 5.8 (Obtaining (A, B)). For a system with an available nonlinear dynamics model, the model knowledge (A, B) can be directly obtained by simplifying the nonlinear model to a linear one. While for a system whose dynamics model is not available, (A, B) can be obtained via system identification Oymak & Ozay (2019), as used in social systems Mao et al. (2022). 6 INVARIANT EMBEDDING 3: PHYSICS-KNOWLEDGE-ENHANCED DNN Physics-Model-Guided NN Editing (b) Phy N Architecture Activation Editing Augmentation (𝐚) Physics-Knowledge-Enhanced DNN Augmentation Order 𝒓𝒑 Figure 2: Physics-Knowledge-Enhanced DNN architecture. The Phy-DRL is built on the actorcritic architecture, where a critic and an actor network are used to approximate the action-value function (i.e., Q (s(k), adrl(k)) in Equation (5)) and learn an action policy (i.e., adrl(k) = π (s(k))), respectively. We note from Equation (5) that the action-value function is a direct function of our defined reward but involves unknown future rewards. So, some invariant knowledge exists that the critic and/or actor networks shall strictly comply with, which motivates Problem 1.4. To address this problem, as shown in Figure 1, we develop physics-enhanced critic and actor networks for Phy-DRL. The proposed networks are built on physics-knowledge-enhanced DNN, whose architecture is depicted in Figure 2. The DNN has two innovations in neural architecture: Published as a conference paper at ICLR 2024 i) Neural Network (NN) Input Augmentation described by Algorithm 1, and ii) Physics-Model-Guided NN Editing described by Algorithm 2. To understand how Algorithm 1 and Algorithm 2 address Problem 1.4, we describe the ground-truth models of action-value function and action policy as Q (s, adrl) = AQ |{z} weight matrix m(s, adrl, r Q) | {z } node-representation vector + p(s, adrl) | {z } unknown model mismatch π (s) = Aπ |{z} weight matrix m(s, rπ) | {z } node-representation vector + p(s) |{z} unknown model mismatch Rlen(π(s)), (14) where the vectors m(s, adrl, r Q) and m(s, rπ) are, respectively, augmentations of input vectors [s; adrl] and s, which embraces all the non-missing and non-redundant monomials of a Taylor series. One motivation behind the augmentations is that according to Taylor s Theorem in Appendix G, the Taylor series can approximate arbitrary nonlinear functions with controllable accuracy via controlling series orders: r Q and rπ. The second motivation is our proposed safety-embedded reward r(s(k), s(k +1)) in Equation (8) is a typical Taylor series and is pre-defined. If using the Taylor series to approximate the action-value function and an action policy, we can also discover hidden invariant knowledge. Algorithm 2 needs inputs of knowledge sets KQ and Kπ. which include available knowledge about governing Equation (13) and Equation (14), respectively. The two knowledge sets are defined below. KQ {[AQ]i | no [m(s, adrl, r Q)]i in p(s, adrl), i {1, . . . , len(m(s, adrl, r Q))}}, (15) Kπ n [Aπ]i,j no [m(s, rπ)]j in [p(s)]i, i {1, . . . , len(p(s))}, j {1, . . . , len(m(s, rπ))} o . (16) Remark 6.1 (Toy Examples: Obtaining Knowledge Sets). Due to the page limit, an example for obtaining KQ via Taylor s theorem is presented in Appendix H. This example is about obtaining Kπ. According to the dynamics and control of vehicles, the throttle command of traction control for preventing sliding and slipping depends on longitudinal velocity (denoted by v) and angular velocity (denoted by w) only Rajamani (2011); Mao et al. (2023). For simplification, we let s = [v, w, ζ] , where ζ denotes yaw. While action policy π (s) R2, with [π (s)]1 and [π (s)]2 denote the throttle command and steering command, respectively. By Algorithm 1 with r t = rπ = 2 and y t = s, we have m(s, rπ) = [1, v, w, ζ, v2, vw, vζ, w2, wζ, ζ2] . Considering Equation (14), the m(s, rπ), in conjunction with the knowledge [π (s)]1 depends on w and v only", leads to the information: 1) [p(s)]1 in Equation (14) in this example does not have monomials: 1, ζ, vζ, wζ, and ζ2, and 2) Aπ = 0 w1 w2 0 w3 w4 0 w5 0 0 w6 w7 w8 w9 w10 w11 w12 w13 w14 w15 where w1, . . . , w15 are learning weights. Referring to Equation (16), we then have Kπ = {[Aπ]1,1 = 0, [Aπ]1,2 = w1, . . . , [Aπ]1,10 = 0}. With knowledge sets at hand, we can introduce two design aims for addressing Problem 1.4. Aim 6.2. Given KQ (15), consider the critic network built on physic-knowledge-enhanced DNN in Figure 2, where x = (s, adrl) and y = b Q (s, adrl) (i.e., y approximates Q (s, adrl)). The end-to-end input/output of the critic network strictly complies with available knowledge about the governing Equation (13), i.e., if [AQ]i KQ, then y does not have monomials [m(s, adrl, r Q)]i. Aim 6.3. Given Kπ (16), consider the actor network built on physic-knowledge-enhanced DNN in Figure 2, where x = s and y = bπ (s) (i.e., y approximates π (s)). The end-to-end input/output of the actor network strictly complies with available knowledge about the governing Equation (14), i.e., if [Aπ]i,j Kπ, then [y]i does not have monomials [m(s, rπ)]j. Algorithm 2, in conjunction with Algorithm 1, is able to deliver Aim 6.2 and Aim 6.3. It is formally stated in Theorem 6.4, whose proof appears in Appendix J. Theorem 6.4. If the critic and actor networks are built on physics-knowledge-enhanced DNN (described in Figure 2), whose NN input augmentation and NN editing are described by Algorithm 1 and Algorithm 2, respectively, then Aim 6.2 and Aim 6.3 are achieved. Finally, we refer to the toy example in Remark 6.1 for an overview of NN editing. For the end-to-end mapping [y]1 = [bπ (s)]1, given Kπ and output of Algorithm 1, the link editing of Algorithm 2 removes all connections with node representations 1, ζ, vζ, wζ, ζ2 and maintain link connections with v, w,v2, vw and w2. Meanwhile, the action editing of Algorithm 2 guarantees the usages of action functions in all Phy N layers do not introduce monomials of ζ to the mapping [y]1 = [bπ (s)]1. Published as a conference paper at ICLR 2024 Algorithm 1 NN Input Augmentation Aim at representation vectors to embrace all the non-missing and non-redundant monomials of the Taylor series. 1: Input: augmentation order r t , input y t ; t indicates layer number, e.g, t = 2 denotes the second NN layer. 2: Generate index vector of input: i [1; 2; . . . ; len(y t )]; 3: Initialize augmentation vector: m(y t , r t+1 ) y t ; 4: for _ = 2 to r t do 5: for i = 1 to len(y t ) do 6: Compute temp: ta [y t ]i [y t ]h [i]i : len(y t ) i; Capture hard-to-learn nonlinear representations, in the form of monomials of the Taylor series, such as the 2nd-order monomials of sub-reward r(s(k), s(k + 1)) in Equation (8). 7: if i == 1 then 8: Generate temp: etb eta; 9: else if i > 1 then 10: Generate temp: etb h etb; eta i ; Avoid missing and redundant monomials. 11: end if 12: Update index entry: [i]i len(y t ); 13: Augment: m(y t , r t+1 ) h m(y t , r t ); etb i ; 14: end for 15: end for Line 4 Line 15 generate node-representation vector that embraces all the non-missing and non-redundant monomials of the Taylor series. One illustration example is in Figure 8 in Appendix F. 16: Output: m(y t , r t ) 1; m(y t , r t+1 ) . Controllable Model Accuracy: The algorithm provides one option of approximating the ground-truth models (13) and (14) via Taylor series. According to Taylor s Theorem (Chapter 2.4 Königsberger (2013)), the networks have controllable model accuracy by controlling augmentation orders r t (see Appendix G for further explanations). Algorithm 2 Physics-Model-Guided Neural Network Editing Perform on deep Phy N, and each layer needs Algorithm 1 for generating node-representation vectors. Detailed explanations of Algorithm 2 appear in Appendix I 1: Input: Network type set T = Q , π , knowledge sets KQ (15) and Kπ (16), number of Phy Ns p Q and pπ, origin input x, augmentation orders r Q and rπ, model matrices AQ and Aπ, terminal output dimension len(y). 2: Choose network type ϖ T; 3: Specify augmentation order of the first Phy N: r 1 rϖ; 4: for t = 1 to pϖ do 5: if t == 1 then 6: Generate node-representation vector m(x, r 1 ) via Algorithm 1; Corresponding to m(s, adrl, r Q) and m(s, rπ) in the ground-truth models (13) and (14), because r 1 rϖ. 7: Generate raw weight matrix via gradient descent algorithm: W 1 ; Raw weight matrix usually responds to a fully-connected NN layer, which can violate physics knowledge. 8: Generate knowledge matrix K 1 : [K 1 ]i,j ( [Aϖ]i,j, [Aϖ]i,j Kϖ 0, otherwise ; Include all elements in knowledge set. 9: Generate weight-masking matrix M 1 : [M 1 ]i,j ( 0, [Aϖ ]i,j Kϖ 1, otherwise ; 10: Generate activation-masking vector a 1 : [a 1 ]i ( 0, [M 1 ]i,j = 0, j {1, . . . , len(m(x, r 1 ))} 1, otherwise ; 11: else if t > 1 then 12: Generate raw weight matrix via gradient descent algorithm: W t ; Raw weight matrix usually responds to a fully-connected NN layer, which can violate physics knowledge. 13: Generate node-representation vector m(y t 1 , r t ) via Algorithm 1; 14: Generate knowledge matrix: K t O(len(y t ) len(y)) len(m(y t 1 ,r t )) 0len(y) Ilen(y) Olen(y) (len(m(y t 1 ,r t )) len(y) 1) 15: Generate weight-masking matrix M t : 0, [m(y t ,r t )]j [m(x,r 1 )]v =0 and [M 1 ]i,v =0, v {1, . . . , len(m(x, r 1 ))} 1, otherwise ; 16: Generate activation-masking vector a t h a 1 ; 1len(y t ) len(y) i ; 17: end if 18: Generate uncertainty matrix U t M t W t ; 19: Compute output: y t K t m(y t 1 , r t ) + a t act U t m y t 1 , r t ; 20: end for 21: Output: by y p . Published as a conference paper at ICLR 2024 1.0 0.5 0.0 0.5 1.0 x (a) Phy-DRL Policy 1.0 0.5 0.0 0.5 1.0 x (b) Model-based Policy 1.0 0.5 0.0 0.5 1.0 x (c) DRL Policy Figure 3: Blue: area of IE samples defined in Equation (18). Green: area of EE samples defined in Equation (19). Rectangular area: safety set. Ellipse area: safety envelope. 7 EXPERIMENTS 7.1 CART-POLE SYSTEM We take the cart-pole simulator provided in Open-AI Gym Brockman et al. (2016). Its mechanical analog is shown in Figure 11 in Appendix K, characterized by pendulum s angle θ and angular velocity ω = θ, and cart s position x and velocity v = x. Phy-DRL s action policy is to stabilize the pendulum at equilibrium s = [x , v , θ , ω ] = [0, 0, 0, 0] , while constraining system state to Safety Set: X = s R4 0.9 x 0.9, 0.8 < θ < 0.8 . (17) To demonstrate the robustness of Phy-DRL, we intentionally create a large model mismatch for obtaining a model-based action policy of Phy-DRL. Specifically, as explained in Appendix K.1, the physics-model knowledge represented by (A, B) is obtained through ignoring friction force, and letting cos θ 1, sin θ θ and ω2 sin θ 0. The system trajectories in Figure 12 in Appendix K.4 show that the sole model-based action policy does not guarantee safety. The computations of F, P and A are presented in Appendix K.1 Appendix K.3. Meanwhile, for the high-performance sub-reward in Equation (8), we let w(s(k), adrl(k)) = a2 drl(k). Finally, to validate our theoretical results and present experimental comparisons, we define two safe samples: Safe Internal-Envelope (IE) Sample es: if s(1) = es Ω, then s(k) Ω, k N. (18) Safe External-Envelope (EE) Sample es: if s(1) = es X, then s(k) X \ Ω, k N. (19) We consider a CLF (control Lyapunov function) reward, proposed in Westenbroek et al. (2022), as R( ) = s (k) P s(k) s (k + 1) P s(k + 1) + w(s(k), a(k)), where the P is the same as the one in the Phy-DRL s safety-embedded reward. We mainly compare our Phy-DRL with purely data-driven DRL having the CLF reward for testing. Both models are trained for 75000 steps and have the same configurations of critic and actor networks, presented in Appendix K.5. The performance metrics are the areas of IE and EE samples. Figure 3 shows that i) Phy-DRL successfully renders the safety envelope invariant, demonstrating Theorem 5.3 (r(s(k), s(k + 1)) α 1 holds in final training episode), and ii) the safety areas of the sole model-based policy and the purely data-driven DRL are much smaller. Additional comparisons with incorporating a model for model-based DRL (with the proposed CLF reward) and our Phy-DRL for state prediction are presented in Appendix K.7. 7.2 QUADRUPED ROBOT In this experiment, action policies missions are concurrent safe center-gravity management, safe lane tracking alone x-axis, and safe velocity regulation. We define safety constraints as: X={bs |Co M z-height 0.24m| 0.13m, |yaw| 0.17 rad, |Co M x-velocity rx| |rx|}, (20) Targeted Equilibrium: bs = [0; 0; 0.24m; 0; 0; 0; rx; 0; 0; 0; 0; 0], (21) where the bs denotes the robot s state vector (given in Equation (74) in Appendix L.2), the rx denotes the desired Co M x-velocity. The system state of the model (1) is expressed as s = bs bs . The designs of model-based policy and reward appear in Appendix L.3 and the training details are presented in Appendix L.6 and Appendix L.7. To demonstrate the performance of trained Phy-DRL, Published as a conference paper at ICLR 2024 0.4 0.3 0.2 0.1 0.0 0.1 0.2 0.3 0.4 Height Safety Envelope (a) Velocity 1 m/s, Snow Road DRL Phy-DRL 0.4 0.3 0.2 0.1 0.0 0.1 0.2 0.3 0.4 Height Safety Envelope (b) Velocity 0.5 m/s, Snow Road DRL Phy-DRL 0.3 0.2 0.1 0.0 0.1 0.2 0.3 0.4 Height Safety Envelope (c) Velocity -1.4 m/s, Wet Road DRL Phy-DRL 0.2 0.1 0.0 0.1 0.2 0.3 0.4 0.5 Height Safety Envelope (d) Velocity -0.4 m/s, Wet Road DRL Phy-DRL Figure 4: Phase plots of models running different environments, given different velocity commands. we consider the comparisons of four policies: Phy-DRL policy, whose network configurations are summarized in the model PKN-15 in Table 2 in Appendix L.6. Its total number of training steps is only 106. DRL policy, denoting a purely data-driven action policy trained in standard DRL. Its network configurations are summarized in the model FC MLP in Table 2. Its reward for training is the CLF proposed in Westenbroek et al. (2022) and given in Equation (84), where the P is the same as the one in the Phy-DRL s safety-embedded reward. Its number of training steps is large as 107. PD policy, denoting a default proportional-derivative controller developed in Da et al. (2021). Linear policy, which is the sole model-based action policy equation 4 used in Phy-DRL. We compare the four policies in four testing environments: a) rx = 1 m/s and snow road, b) rx = 0.5 m/s and snow road, c) rx = 1.4 m/s and wet road, and d) rx = 0.4 m/s and wet road. The links to demonstration videos are available at Appendix L.9, with Figure 4 showing that Phy-DRL successfully constraints the robot s states to a safety set. Given more reasonable velocity commands in environments b) and d), Phy-DRL can also successfully constrain system states to the safety envelope. The Linear and PD policies can only constrain system states to a safety envelope in environment d). The DRL policy violates the safety requirements in all environments, which implies that purely data-driven DRL needs more training steps to search for a safe and robust policy. Meanwhile, Appendix L.4, Appendix L.6, and Appendix L.7 show that Phy-DRL features remarkably better velocity-regulation performance, fewer learning parameters, and fast and stable training. 8 CONCLUSION AND DISCUSSION This paper proposes Phy-DRL: a physics-regulated deep reinforcement learning framework for safety-critical autonomous systems. Phy-DRL exhibits a mathematically provable safety guarantee. Compared with purely data-driven DRL and solely model-based design, Phy-DRL features fewer learning parameters and fast and stable training while offering enhanced safety assurance. We recall the computation of matrix P (used in defining reward and safety envelope) depends on a linear model. However, if the linear model s mismatch is large, no safe policies may exist to render the safety envelope defined by P invariant. How to address safety concerns induced by faulty P constitutes our future research. We also note the derived condition of safety guarantee in Theorem 5.3 is not yet ready for a practical testing procedure due to the necessary full coverage testing within the domain Ω. Transforming the theoretical safety conditions into practical and efficient ones will be another future research. Published as a conference paper at ICLR 2024 9 REPRODUCIBILITY STATEMENT The code to reproduce our experimental results and supplementary materials are available at https: //github.com/HP-CAO/phy_rl. The experimental settings are described in Appendix K.5, Appendix K.6, and Appendix L.8. 10 ACKNOWLEDGEMENTS We would like to first thank the anonymous reviewers for their helpful feedback, thoughtful reviews, and insightful comments. We appreciate Mirco Theile s helpful suggestions regarding the technical details, which inspire our future research directions. We also thank Yihao Cai for his help in deploying Phy-DRL on a physical quadruped robot. This work was partly supported by the National Science Foundation under Grant CPS-2311084 and Grant CPS-2311085 and the Alexander von Humboldt Professorship Endowed by the German Federal Ministry of Education and Research. Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S Corrado, Andy Davis, Jeffrey Dean, and Matthieu Devin. Tensorflow: Large-scale machine learning on heterogeneous distributed systems. ar Xiv preprint https://arxiv.org/abs/ 1603.04467. Felix Berkenkamp, Matteo Turchetta, Angela Schoellig, and Andreas Krause. Safe model-based reinforcement learning with stability guarantees. Advances in Neural Information Processing Systems, 30, 2017. Stephen Boyd, Laurent El Ghaoui, Eric Feron, and Venkataramanan Balakrishnan. Linear matrix inequalities in system and control theory. SIAM, 1994. Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. Open AI Gym, 2016. Ya-Chien Chang and Sicun Gao. Stabilizing neural control using self-learned almost lyapunov critics. In 2021 IEEE International Conference on Robotics and Automation, pp. 1803 1809. IEEE, 2021. Yuntian Chen, Dou Huang, Dongxiao Zhang, Junsheng Zeng, Nanzhe Wang, Haoran Zhang, and Jinyue Yan. Theory-guided hard constraint projection (HCP): A knowledge-based data-driven scientific machine learning method. Journal of Computational Physics, 445:110624, 2021. Richard Cheng, Gábor Orosz, Richard M Murray, and Joel W Burdick. End-to-end safe reinforcement learning through barrier functions for safety-critical continuous control tasks. In Proceedings of the AAAI conference on artificial intelligence, volume 33, pp. 3387 3395, 2019a. Richard Cheng, Abhinav Verma, Gabor Orosz, Swarat Chaudhuri, Yisong Yue, and Joel Burdick. Control regularization for reduced variance reinforcement learning. In International Conference on Machine Learning, pp. 1141 1150, 2019b. Miles Cranmer, Sam Greydanus, Stephan Hoyer, Peter Battaglia, David Spergel, and Shirley Ho. Lagrangian neural networks. In ICLR 2020 Workshop on Integration of Deep Neural Models and Differential Equations, 2020. Xingye Da, Zhaoming Xie, David Hoeller, Byron Boots, Anima Anandkumar, Yuke Zhu, Buck Babich, and Animesh Garg. Learning a contact-adaptive controller for robust, efficient legged locomotion. In Conference on Robot Learning, pp. 883 894. PMLR, 2021. Jared Di Carlo, Patrick M Wensing, Benjamin Katz, Gerardo Bledt, and Sangbae Kim. Dynamic locomotion in the mit cheetah 3 through convex model-predictive control. In 2018 IEEE/RSJ international conference on intelligent robots and systems (IROS), pp. 1 9. IEEE, 2018. Published as a conference paper at ICLR 2024 Steven Diamond and Stephen Boyd. CVXPY: A Python-embedded modeling language for convex optimization. Journal of Machine Learning Research, 17(83):1 5, 2016. Razvan V Florian. Correct equations for the dynamics of the cart-pole system. Center for Cognitive and Neural Studies (Coneural), Romania, 2007. doi: https://coneural.org/florian/papers/05_cart_ pole.pdf. Michael Grant, Stephen Boyd, and Yinyu Ye. Cvx users guide. online: http://www. stanford. edu/boyd/software. html, 2009. Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In Jennifer Dy and Andreas Krause (eds.), Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pp. 1861 1870. PMLR, 10 15 Jul 2018. URL https://proceedings.mlr.press/v80/haarnoja18b.html. Masanobu Horie, Naoki Morita, Toshiaki Hishinuma, Yu Ihara, and Naoto Mitsume. Isometric transformation invariant and equivariant graph convolutional networks. ar Xiv:2005.06316. URL https://arxiv.org/abs/2005.06316. Sandy H. Huang, Nicolas Papernot, Ian J. Goodfellow, Yan Duan, and Pieter Abbeel. Adversarial attacks on neural network policies. In 5th International Conference on Learning Representations, ICLR 2017, Workshop Track Proceedings, 2017. URL https://openreview.net/forum? id=ryvl Ry BKl. Michael Janner, Justin Fu, Marvin Zhang, and Sergey Levine. When to trust your model: Model-based policy optimization. Advances in neural information processing systems, 32, 2019. Xiaowei Jia, Jared Willard, Anuj Karpatne, Jordan S Read, Jacob A Zwart, Michael Steinbach, and Vipin Kumar. Physics-guided machine learning for scientific discovery: An application in simulating lake temperature profiles. ACM/IMS Transactions on Data Science, 2(3):1 26, 2021. Tobias Johannink, Shikhar Bahl, Ashvin Nair, Jianlan Luo, Avinash Kumar, Matthias Loskyll, Juan Aparicio Ojea, Eugen Solowjow, and Sergey Levine. Residual reinforcement learning for robot control. In 2019 International Conference on Robotics and Automation (ICRA), pp. 6023 6029. IEEE, 2019. George Em Karniadakis, Ioannis G Kevrekidis, Lu Lu, Paris Perdikaris, Sifan Wang, and Liu Yang. Physics-informed machine learning. Nature Reviews Physics, 3(6):422 440, 2021. Guy Katz, Clark Barrett, David L Dill, Kyle Julian, and Mykel J Kochenderfer. Reluplex: An efficient smt solver for verifying deep neural networks. In Computer Aided Verification: 29th International Conference, CAV 2017, pp. 97 117. Springer, 2017. Alex Kendall, Jeffrey Hawke, David Janz, Przemyslaw Mazur, Daniele Reda, John-Mark Allen, Vinh-Dieu Lam, Alex Bewley, and Amar Shah. Learning to drive in a day. In 2019 International Conference on Robotics and Automation, pp. 8248 8254. IEEE, 2019. Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. ar Xiv preprint https://arxiv.org/abs/1412.6980. Konrad Königsberger. Analysis 2. Springer-Verlag, 2013. Tongxin Li, Ruixiao Yang, Guannan Qu, Yiheng Lin, Steven Low, and Adam Wierman. Equipping black-box policies with model-based advice for stable nonlinear control. ar Xiv preprint https: //arxiv.org/pdf/2206.01341.pdf, a. Yunzhu Li, Hao He, Jiajun Wu, Dina Katabi, and Antonio Torralba. Learning compositional Koopman operators for model-based control. ar Xiv:1910.08264, b. URL https://arxiv.org/abs/ 1910.08264. Timothy P. Lillicrap, Jonathan J. Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. In 4th International Conference on Learning Representations, ICLR, 2016. Published as a conference paper at ICLR 2024 Lu Lu, Raphael Pestourie, Wenjie Yao, Zhicheng Wang, Francesc Verdugo, and Steven G Johnson. Physics-informed neural networks with hard constraints for inverse design. SIAM Journal on Scientific Computing, 43(6):B1105 B1132, 2021. Yanbing Mao, Naira Hovakimyan, Tarek Abdelzaher, and Evangelos Theodorou. Social system inference from noisy observations. IEEE Transactions on Computational Social Systems, pp. 1 13, 2022. doi: 10.1109/TCSS.2022.3229599. Yanbing Mao, Yuliang Gu, Naira Hovakimyan, Lui Sha, and Petros Voulgaris. SL1-simplex: Safe velocity regulation of self-driving vehicles in dynamic and unforeseen environments. ACM Transactions on Cyber-Physical Systems, 7(1):1 24, 2023. Jonathan Masci, Davide Boscaini, Michael Bronstein, and Pierre Vandergheynst. Geodesic convolutional neural networks on riemannian manifolds. In Proceedings of the IEEE international conference on computer vision workshops, pp. 37 45, 2015. Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, Stig Petersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, and Demis Hassabis. Human-level control through deep reinforcement learning. Nature, 518(7540):529 533, 2015. Federico Monti, Davide Boscaini, Jonathan Masci, Emanuele Rodola, Jan Svoboda, and Michael M Bronstein. Geometric deep learning on graphs and manifolds using mixture model cnns. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5115 5124, 2017. Nikhil Muralidhar, Jie Bu, Ze Cao, Long He, Naren Ramakrishnan, Danesh Tafti, and Anuj Karpatne. Phy Net: Physics guided neural networks for particle drag force prediction in assembly. In Proceedings of the 2020 SIAM International Conference on Data Mining, pp. 559 567, 2020. Samet Oymak and Necmiye Ozay. Non-asymptotic identification of lti systems from a single trajectory. In 2019 American control conference, pp. 5655 5661. IEEE, 2019. Theodore J Perkins and Andrew G Barto. Lyapunov design for safe reinforcement learning. Journal of Machine Learning Research, 3(Dec):803 832, 2002. Rajesh Rajamani. Vehicle dynamics and control. Springer Science & Business Media, 2011. Krishan Rana, Vibhavari Dasagi, Jesse Haviland, Ben Talbot, Michael Milford, and Niko Sünderhauf. Bayesian controller fusion: Leveraging control priors in deep reinforcement learning for robotics. ar Xiv preprint https://arxiv.org/pdf/2107.09822.pdf. John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. ar Xiv preprint https://arxiv.org/abs/1707.06347. David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, Sander Dieleman, Dominik Grewe, John Nham, Nal Kalchbrenner, Ilya Sutskever, Timothy Lillicrap, Madeleine Leach, Koray Kavukcuoglu, Thore Graepel, and Demis Hassabis. Mastering the game of go with deep neural networks and tree search. Nature, 529(7587):484 489, 2016. David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez, Marc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepel, Timothy Lillicrap, Karen Simonyan, and Demis Hassabis. A general reinforcement learning algorithm that masters chess, shogi, and go through self-play. Science, 362(6419):1140 1144, 2018. R Wang. Incorporating symmetry into deep dynamics models for improved generalization. In International Conference on Learning Representations, 2021. Rui Wang and Rose Yu. Physics-guided deep learning for dynamical systems: A survey. ar Xiv:2107.01272. URL https://arxiv.org/pdf/2107.01272.pdf. Published as a conference paper at ICLR 2024 Rui Wang, Karthik Kashinath, Mustafa Mustafa, Adrian Albert, and Rose Yu. Towards physicsinformed deep learning for turbulent flow prediction. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 1457 1466, 2020. Tyler Westenbroek, Fernando Castaneda, Ayush Agrawal, Shankar Sastry, and Koushil Sreenath. Lyapunov design for robust and efficient robotic reinforcement learning. ar Xiv:2208.06721, 2022. URL https://arxiv.org/pdf/2208.06721.pdf. Jared Willard, Xiaowei Jia, Shaoming Xu, Michael Steinbach, and Vipin Kumar. Integrating scientific knowledge with machine learning for engineering and environmental systems. ACM Computing Surveys, 2021. Kailai Xu and Eric Darve. Physics constrained learning for data-driven inverse modeling from sparse observations. Journal of Computational Physics, pp. 110938, 2022. Ruihan Yang, Minghao Zhang, Nicklas Hansen, Huazhe Xu, and Xiaolong Wang. Learning visionguided quadrupedal locomotion end-to-end with cross-modal transformers. 2022 International Conference on Learning Representations, 2022. Arnold Zachary and Toner Helen. AI Accidents: An emerging threat. Center for Security and Emerging Technology, 2021. URL https://doi.org/10.51593/20200072. Fuzhen Zhang. The Schur complement and its applications, volume 4. Springer Science & Business Media, 2006. Liqun Zhao, Konstantinos Gatsis, and Antonis Papachristodoulou. Stable and safe reinforcement learning via a barrier-lyapunov actor-critic approach. In 2023 62nd IEEE Conference on Decision and Control (CDC), pp. 1320 1325. IEEE, 2023. Liqun Zhao, Keyan Miao, Konstantinos Gatsis, and Antonis Papachristodoulou. Nlbac: A neural ordinary differential equations-based framework for stable and safe reinforcement learning. ar Xiv preprint ar Xiv:2401.13148, 2024. Published as a conference paper at ICLR 2024 A Notations throughout Paper 16 B Auxiliary Lemmas 17 C Proof of Lemma 5.1 18 D Theorem 5.3 21 D.1 Proof of Theorem 5.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 D.2 Extension: Explanation of Fast Training Toward Safety Guarantee . . . . . . . . . 22 E Extension: Mathematically-Provable Safety and Stability Guarantees 24 E.1 Safety versus Stability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 E.2 Provable Safety and Stability Guarantees . . . . . . . . . . . . . . . . . . . . . . . 25 F NN Input Augmentation: Explanations and Example 26 G Controllable Model Accuracy 27 H Example: Obtaining Knowledge Set KQ 28 I Physics-Model-Guided Neural Network Editing: Explanations 29 I.1 Activation Editing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 I.2 Knowledge Preserving and Passing . . . . . . . . . . . . . . . . . . . . . . . . . . 29 J Proof of Theorem 6.4 31 K Experiment: Cart-Pole System 32 K.1 Physics-Model Knowledge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 K.2 Safety Knowledge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 K.3 Model-Based Action Policy and DRL Reward . . . . . . . . . . . . . . . . . . . . 33 K.4 Sole Model-Based Action Policy: Failure Due to Large Model Mismatch . . . . . 33 K.5 Configurations: Networks and Training Conditions . . . . . . . . . . . . . . . . . 34 K.6 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 K.7 Testing Comparisons: Model for State Prediction? . . . . . . . . . . . . . . . . . . 34 L Experiment: Quadruped Robot 37 L.1 Overview: Best-Trade-Off Between Model-Based Design and Data-Driven Design 37 L.2 Real System Dynamics: Highly Nonlinear! . . . . . . . . . . . . . . . . . . . . . 37 L.3 Simplifying Model-Based Designs . . . . . . . . . . . . . . . . . . . . . . . . . . 38 L.4 Testing Experiment: Velocity Tracking Performance . . . . . . . . . . . . . . . . . 39 Published as a conference paper at ICLR 2024 L.5 Safety-Embedded Rewards . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 L.6 Physics-Knowledge-Enhanced Critic Network . . . . . . . . . . . . . . . . . . . . 40 L.7 Reward Comparisons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 L.8 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 L.9 Links: Demonstration Videos . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 Published as a conference paper at ICLR 2024 A NOTATIONS THROUGHOUT PAPER Table 1: Notation Rn set of n-dimensional real vectors N set of natural numbers len(s) length of vector s [x]i i-th entry of vector x [x]i:j a sub-vector formed by the i-th to j-th entries of vector x [W]i,: i-th row of matrix W [W]i,j element at row i and column j of matrix W P 0 matrix P is positive definite P 0 matrix P is negative definite matrix or vector transposition In n n-dimensional identity matrix 1n n-dimensional vector of all ones 0n n-dimensional vector of all zeros Hadamard product [x ; y] stacked (tall column) vector of vectors x and y act activation function Om n m n-dimensional zero matrix an element in knowledge set a learning element X \ Ω complement set of Ωwith respect to X Published as a conference paper at ICLR 2024 B AUXILIARY LEMMAS Lemma B.1 (Schur Complement Zhang (2006)). For any symmetric matrix M = A B B C M 0 if and only if C 0 and A BC 1B 0. Lemma B.2. Corresponding to the set X defined in Equation (2), we define: b X s Rn| 1h d D s, and D s 1h . (22) The sets X = b X if and only if D = D Λ and D = D Λ , where d, Λ and Λ are defined in Lemma 5.1. Proof. The condition of set defined in Equation (2) is equivalent to [v + v]i [D]i,: s [v + v]i , i {1, 2, . . . , h}, (23) based on which, we consider three cases. Case One: If [v + v]i > 0, we obtain from Equation (23) that [v + v]i > 0 as well, such that the Equation (23) can be rewritten equivalently as [v + v]i = [D]i,: s [Λ]i,i 1, and [D]i,: s [v + v]i = [D]i,: s [Λ]i,i 1 = [d]i, i {1, 2, . . . , h}, (24) which is obtained via considering the second items of [Λ]i,j and [Λ]i,j and the first item of [d]i, presented in Lemma 5.1. Case Two: If [v + v]i < 0, we obtain from Equation (23) that [v + v]i < 0 as well, such that the Equation (23) can be rewritten equivalently as [v + v]i = [D]i,: s [Λ]i,i 1, and [D]i,: s [v + v]i = [D]i,: s [Λ]i,i 1 = [d]i, i {1, 2, . . . , h}, (25) which is obtained via considering the third items of [Λ]i,j and [Λ]i,j and the second item of [d]i, presented in Lemma 5.1. Case Three: If [v + v]i > 0 and [v + v]i < 0, the Equation (23) can be rewritten equivalently as [v + v]i = [D]i,: s [Λ]i,i 1, and [D]i,: s [ v v]i = [D]i,: s [Λ]i,i 1 = [d]i, i {1, 2, . . . , h}, (26) which is obtained via considering the fourth items of [Λ]i,j and [Λ]i,j and the third item of [d]i, presented in Lemma 5.1. We note from the first items of [Λ]i,j and [Λ]i,j in Lemma 5.1 that the defined Λ and Λ are diagonal matrices. The conjunctive results (23) (26) can thus be equivalent described by Λ 1h and D s substituting D = D Λ and D = D Λ into which, we obtain 1h d D s, and D s 1h, which is the condition for defining the set b X in Equation (22). We thus conclude the statement. Published as a conference paper at ICLR 2024 C PROOF OF LEMMA 5.1 In light of Lemma B.2 in Appendix B, we have X = b X. Therefore, to prove Ω X, we consider the proof of Ω b X, which is carried out below. The b X defined in Equation (22) relies on two conjunctive conditions: 1h d D s and D s 1h, based on which the proof is separated into two cases. Case One: D s 1h, which can be rewritten as [D s]i 1, i {1, 2, . . . , h}. We next prove that max s Ω i,i, i {1, 2, . . . , h}. To achieve this, let us consider the constrained optimization problem: i , subject to s P s 1, i {1, 2, . . . , h}. Let s be the optimal solution. Then, according to the Kuhn-Tucker conditions, we have D i,: 2λ P s = 0, λ (1 (s ) P s ) = 0, which, in conjunction with λ > 0, lead to D i,: , (27) (s ) P s = 1. (28) Multiplying both left-hand sides of Equation (27) by (s ) yields 2λ (s ) P s = (s ) D i,:, which, in conjunction with Equation (28), results in D i,: > 0. (29) Multiplying both left-hand sides of Equation (27) by P 1 leads to 2λ s = P 1 D i,:, multiplying both left-hand sides of which by i,:, we arrive in D i,: . (30) Substituting Equation (29) into Equation (30), we obtain (s ) D i,:, from which we can have (s ) D i,: > 0, which with Equation (29) indicate: D i,:. (31) We note that the Equation (27) is equivalent to s = 1 2λ P 1 D i,:, substituting Equation (31) into which results in s = 1 r [D]i,: P 1 [D] D i,:, multiplying both sides of which by i = max s Ω :,i, i {1, 2, . . . , h} which means i 1 if and only if :,i 1, i {1, 2, . . . , h}, (32) which further implies that :,i 1, i {1, 2, . . . , h}. (33) Published as a conference paper at ICLR 2024 Case Two: 1h d D s, which includes two scenarios: [d]i = 1 and [d]i = 1, referring to d in Lemma 5.1. We first consider [D s]i 1, which can be rewritten as [ b D s]i [d]i = 1, i {1, 2, . . . , h}, with b D = D. Following the same steps to derive Equation (32), we obtain o 1 if and only if h b D i,: P 1 h b D :,i < 1, i {1, 2, . . . , h}, which with b D = D indicate that min s Ω{[D s]i} 1 iff [D]i,: P 1 h D i :,i < 1, [d]i = 1, i {1, 2, . . . , h}. (34) We next consider [d]i = 1. We prove in this scenario, min s Ω{[D s]i} = rh D P 1 D i achieve this, let us consider the constrained optimization problem: min {[D s]i} , subject to s P s 1. Let ˆs be the optimal solution. Then, according to the Kuhn-Tucker conditions, we have [D] i,: + 2ˆλ P ˆs = 0, ˆλ (1 (ˆs ) P ˆs ) = 0, which, in conjunction with ˆλ < 0, leads to 2ˆλ P ˆs = [D] i,: , (35) (ˆs ) P ˆs = 1. (36) Multiplying both left-hand sides of Equation (35) by (ˆs ) yields 2ˆλ (ˆs ) P ˆs = (ˆs ) [D] i,:, which, in conjunction with Equation (36), results in 2ˆλ = (ˆs ) [D] i,: < 0. (37) Multiplying both left-hand sides of Equation (35) by P 1 leads to 2ˆλ ˆs = P 1 [D] i,:, multiplying both left-hand sides of which by [D]i,:, we arrive in 2ˆλ [D]i,: ˆs = [D]i,: P 1 [D] i,: . (38) Substituting Equation (37) into Equation (38), we obtain (ˆs ) [D] i,: [D]i,: ˆs = [D]i,: P 1 [D] i,:, which together with Equation (37) indicate the solution: 2ˆλ = (ˆs ) [D] i,: = q [D]i,: P 1 [D] i,:. (39) We note that the Equation (35) is equivalent to ˆs = 1 2ˆλ P 1 [D] i,:, substituting Equation (39) into which results in ˆs = 1 q [D]i,: P 1 [D] i,: P 1 [D] i,:, multiplying both sides of which by [D]i,: means min s Ω{[D s]i} = min s Ω n [D]i,: s o = [D]i,: ˆs = [D]i,: P 1 [D] i,: q [D]i,: P 1 [D] i,: [D]i,: P 1 h D i :,i, i {1, 2, . . . , h}, which means min s Ω{[D s]i} 1 iff [D]i,: P 1 h D i :,i 1, [d]i = 1, i {1, 2, . . . , h}. (40) Summarizing Equation (34) and Equation (40), with the consideration of d presented in Lemma 5.1, we conclude that min s Ω{[D s]i} [d]i iff [D]i,: P 1 h D i :,i = 1, if [d]i = 1 1, if [d]i = 1 , i {1, 2, . . . , h}, Published as a conference paper at ICLR 2024 which, in conjunction with the fact that D s and d are vectors, implies that D s d 1h, if [D]i,: P 1 h D i :,i = 1, if [d]i = 1 1, if [d]i = 1 , i {1, 2, . . . , h}. (41) We now conclude from Equation (33) and Equation (41) that D s 1h and D s d 1h hold, if for any i {1, 2, . . . , h}, :,i 1 and [D]i,: P 1 h D i :,i = 1, if [d]i = 1 1, if [d]i = 1. Meanwhile, noticing the D s 1h and D s d 1h is the condition of forming the set b X in Equation (22), we can finally obtain Equation (7). Published as a conference paper at ICLR 2024 D THEOREM 5.3 D.1 PROOF OF THEOREM 5.3 We note that Q = Q and the F = R Q 1 is equivalent to the F Q = R, substituting which into Equation (11) yields α Q Q (A + B F) (A + B F) Q Q We note Equation (42) implies α > 0 and Q 0. Then, according to the auxiliary Lemma B.1 in Appendix B, we have α Q Q (A + B F) Q 1 (A + B F) Q 0. (43) Since P = Q 1, multiplying both left-hand and right-hand sides of Equation (43) by P we obtain α P (A + B F) P (A + B F) 0, which, in conjunction with A defined in Equation (9), leads to αP A P A 0. (44) We now define a function: V (s(k)) = s (k) P s(k). (45) With the consideration of function in Equation (45), along the real plant (1) with Equation (3) and Equation (4), we have V (s(k + 1)) = (B adrl(k) + f(s(k), a(k))) P (B adrl(k) + f(s(k), a(k))) + 2s (k) A P (B adrl(k) + f(s(k), a(k))) + s (k) (A P A) s(k) (46) < (B adrl(k) + f(s(k), a(k))) P (B adrl(k) + f(s(k), a(k))) + 2s (k) A P (B adrl(k) + f(s(k), a(k))) + α V (s(k)), (47) where the Equation (47) is obtained from its previous step via considering Equation (45) and Equation (44). Observing Equation (46), we obtain r(s(k), s(k + 1)) = V (s(k + 1)) s (k) (A P A) s(k) = (B adrl(k) + f(s(k), a(k))) P (B adrl(k) + f(s(k), a(k))) + 2s (k) A P (B adrl(k) + f(s(k), a(k)) (48) where r(s(k), s(k + 1)) is defined in Equation (8). Substituting Equation (48) into Equation (47) yields V (s(k + 1)) < α V (s(k)) r(s(k), s(k + 1)), which further implies that V (s(k + 1)) V (s(k)) < (α 1) V (s(k)) r(s(k), s(k + 1)). (49) Since 0 < α < 1, we have α 1 < 0. So, the (α 1) V (s(k)) r(s(k), s(k + 1)) 0 means V (s(k)) r(s(k),s(k+1)) α 1 . Therefore, if r(s(k),s(k+1)) α 1 1, we have V (s(k)) 1, and (α 1) V (s(k)) r(s(k), s(k + 1)) 0, (50) where the second inequality, in conjunction with Equation (49), implies that there exists a scalar θ such that V (s(k + 1)) V (s(k)) < θ, with θ > 0. (51) Published as a conference paper at ICLR 2024 The result in Equation (51) can guarantee the safety of real plants. To prove this, let s consider the worst-case scenario that the V (s(k)) is strictly increasing with respect to time k N. So, starting from system state s(k) satisfying V (s(k)) r(s(k),s(k+1)) α 1 1, V (s(k)) will increase to V (s(q)) = r(s(q),s(q+1)) α 1 1, where q > k N. Meanwhile, we note that the (α 1) V (s(k)) r(s(k), s(k + 1)) 0 is equivalent to V (s(k)) r(s(k),s(k+1)) α 1 , and also in conjunction with Equation (49) implies that V (s(k + 1)) V (s(k)) < 0. These mean that if V (s(q)) = r(s(q),s(q+1)) α 1 1 is achieved, the V (s(k)) will start decreasing 1. We thus conclude here that in the worst-case scenario, if starting from a point not larger than r(s(k),s(k+1)) α 1 , i.e., V (s(k)) r(s(k),s(k+1)) α 1 1, we have V (s(k)) 1, k N. (52) We now consider the other case, i.e., 1 V (s(k)) > r(s(k),s(k+1)) α 1 . Recalling that in this case V (s(k + 1)) V (s(k)) < 0, which means V (s(k)) is strictly decreasing with respect to time k, until V (s(q)) r(s(q),s(q+1)) α 1 < 1, q > k N. Then, following the same analysis path of the worst case, we can conclude Equation (52) consequently. In other words, V (s(k)) 1, k N, if V (s(1)) 1, which, in conjunction with Equation (6) and Equation (45), lead to s(k) Ω, k N, if s(1) Ω. (53) This proof path is illustrated in Figure 5. Finally, we note from Lemma 5.1 that the condition in Equation (7) is to guarantee that Ω X, which with Equation (53) result in s(k) Ω X, k N, if s(1) Ω, which completes the proof. Safety Envelope Figure 5: Illustration of the proof path of Theorem 5.3. D.2 EXTENSION: EXPLANATION OF FAST TRAINING TOWARD SAFETY GUARANTEE The experimental results demonstrate that compared with purely data-driven DRL, our proposed Phy-DRL has much faster training toward safety guarantee, which can be explained by Equation (44), 1In practice, the control command at one control period should not drive the system to escape the safety envelop Ωfrom V (s(q)) = r(s(q),s(q+1)) Published as a conference paper at ICLR 2024 Equation (45), and Equation (46). Specifically, because of them and 0 < α < 1, we have V (s(k + 1)) V (s(k)) V (s(k + 1)) α V (s(k)) = (B adrl(k) + f(s(k), a(k))) P (B adrl(k) + f(s(k), a(k))) + 2s (k) A P (B adrl(k) + f(s(k), a(k))) + s (k) (A P A P) s(k). (54) The (off-line designed) model-based property in Equation (44) implies that Equation (54) has a constant negative term, i.e., s (k) (A P A α P) s(k) < 0. As depicted by Figure 6, the s (k) (A P A α P) s(k) < 0 can be understood that it puts a global attractor insides safety envelope, which generate attracting force toward safety envelope. Obviously, because of the always-existing attracting force, system states under the control of Phy-DRL are more likely and quickly to stay inside the safety envelope, compared with purely data-driven DRL frameworks. In summary, the driving factor of Phy-DRL s fast training toward safety guarantee is the concurrent safety-embedded reward and residual action policy. Goal: Safety Guarantee System states do not leave safety envelope! DRL: Training Phy-DRL: Training : Global Attractor, due to Generating attracting force toward safety envelope! ( ) 0 s A P A P s Safety Envelope Figure 6: Explanation of Phy-DRL s fast training: the root reason is the (off-line designed) modelbased property in Equation (44), which generates always-existing attracting force toward the safety envelope. Published as a conference paper at ICLR 2024 E EXTENSION: MATHEMATICALLY-PROVABLE SAFETY AND STABILITY GUARANTEES We first present the definition of a stability guarantee. Definition E.1. The real plant (1) is said to be stability guaranteed, if given any s(1) Rn, then lim k s(k) = 0n. The relation between safety guarantee and stability guarantee is presented in Appendix E.1. E.1 SAFETY VERSUS STABILITY Equilibrium Point Equilibrium Point Equilibrium Point Figure 7: Explanations of safety and stability via phase plots: (a) only stability is guaranteed, (b) only safety is guaranteed, (c) both safety and stability are guaranteed. Published as a conference paper at ICLR 2024 According to Definition 2.1 and Definition E.1, the phase plots in Figure 7 well depict the relation between safety and stability: Figure 7 (a): Only Stability is guaranteed. Operating from any initial condition (inside or outside the safety set), the system state will converge to the equilibrium (zero). But safety is not guaranteed, since operating from an initial condition inside the safety set, the system will leave the safety set later. Figure 7 (b): Only Safety is guaranteed. Operating from any initial condition inside the safety set, the system state never leaves the safety set but does not converge to the equilibrium point. Figure 7 (c): Both Safety and Stability are guaranteed. Operating from any initial condition inside the safety set, the system state never leaves the safety set and converges to the equilibrium. E.2 PROVABLE SAFETY AND STABILITY GUARANTEES Thanks to the residual action policy and safety-embedded reward, the proposed Phy-DRL exhibits mathematically-provable safety and stability guarantees, which is formally stated in the following theorem. Theorem E.2 (Mathematically-Provable Safety and Stability Guarantees). Consider the safety set X (2), the safety envelope Ω(6), and the system (1) under control of Phy-DRL. The matrices F and P involved in the model-based action policy (4) and the safety-embedded reward (8) are computed according to Equation (11), where R and Q 1 satisfies the inequalities (7) and (11). Both the safety and stability of the system (1) are guaranteed, if the sub-reward r(s(k), s(k + 1)) in Equation (8) satisfies r(s(k), s(k + 1)) > (α 1) s (k) P s(k). Proof. This proof is straightforward and is based on the Proof of Theorem 5.3 in Appendix D.1. If (α 1) V (s(k)) r(s(k), s(k + 1)) < 0, we obtain from Equation (49) that V (s(k + 1)) < V (s(k)) , k N, which implies that V (s(k)) is strictly decreasing with respect to time k N. The Phy-DRL in this condition thus stabilizes the real plant (1). Additionally, because of V (s(k)) s strict decreasing, we obtain Equation (53) via considering the Equation (6). In light of Lemma 5.1, the condition Equation (7) is to guarantee that Ω X, which with Equation (53) result in s(k) Ω X, k N, if s(1) Ω. We thus conclude that in this condition, both safety and stability are guaranteed, which completes the proof. Published as a conference paper at ICLR 2024 F NN INPUT AUGMENTATION: EXPLANATIONS AND EXAMPLE Line 16 shows that Algorithm 1 finally stacks vector with one. This operation means a Phy N node will be assigned to be one, and the bias will be thus treated as link weights associated with the nodes of ones. As an example shown in Figure 8, the NN input augmentation empowers Phy N to capture well core nonlinearities of physical quantities such as kinetic energy ( 1 2mv2) and aerodynamic drag force ( 1 2ρv2CDA), that drive the state dynamics of physical systems and then represent or approximate physical knowledge in form of the polynomial function. Line 6 Line 13 of Algorithm 1 guarantee that the generated node-representation vectors embrace all the non-missing and non-redundant monomials of a polynomial function. One such example is shown in Figure 8. In this example, the [x]2 1 [x]2 2 is generated only by [x]1 ([x]1 [x]2 2), not including others (see e.g., [x]2 ([x]2 [x]2 1)). Meanwhile, it can be straightforward to verify from the compact example in Figure 8 that all the generated monomials are non-missing and non-redundant. zero order 1st Order 2nd Order 3rd Order Figure 8: An example of Algorithm 1 in Tensor Flow framework, where the input x R3 and the augmentation order r = 3. Published as a conference paper at ICLR 2024 G CONTROLLABLE MODEL ACCURACY The Taylor s Theorem offers a series expansion of arbitrary nonlinear functions, as shown below. Taylor s Theorem (Chapter 2.4 Königsberger (2013)): Let g : Rn R be a r-times continuously differentiable function at the point o Rn. Then there exists hα : Rn R, where |α| = r, such that α! (x o)α + X |α|=r hα(x)(x o)α, and lim x o hα (x) = 0, (55) where α = [α1; α2; . . . ; αn], |α| = n P i=1 αi, α! = n Q i=1 αi!, xα = n Q i=1 xαi i , and αg = |α|g xα1 1 ... xαn n . Given hα(x) is finite and x o < 1, the error P |α|=r hα(x)(x o)α for approximating the ground truth g(x) will drop significantly as the order r = |α| increases and lim |α|=r hα(x)(x o)α = 0. This allows for controllable model accuracy via controlling the order r. Published as a conference paper at ICLR 2024 H EXAMPLE: OBTAINING KNOWLEDGE SET KQ We use a simple example to explain how KQ is derived from Equation (5), according to Taylor s theorem Königsberger (2013). For simplification, we let s(k) R, adrl(k) R, and r Q = 2. By Algorithm 1 with r t = r Q and y t = [s(k), s(k + 1), adrl(k)] , we have m(s(k), adrl(k), r Q) = [1, s(k), s(k+1), adrl(k), s2(k), s(k) s(k+1), s(k) adrl(k), s2(k+1), s(k+1) adrl(k), a2 drl(k)] . (56) Observing Equation (5), we can also denote the action-value function as Qπ(R (s(k), adrl(k))) Qπ(s(k), adrl(k)). For our reward, we let R(s(k), adrl(k)) = s (k) A P A s(k) s (k + 1) P s(k + 1). (57) Right now, according to Taylor s theorem in Appendix G, expanding the action-value function Qπ(R (s(k), adrl(k))) around the R (s(k), adrl(k)), we have Qπ(R (s(k), adrl(k))) = b + w1 R (s(k), adrl(k)) + w2 R2 (s(k), adrl(k)) + . . . | {z } p(s(k),adrl(k)) = AQ m(s(k), adrl(k), r Q) + p(s(k), adrl(k)). (58) Recalling Equation (56), Equation (57), and Taylor s theorem in Appendix G, we then conclude from Equation (58) that Weight matrix AQ = b 0 0 0 w1 [P]1,1 2w1 [P]1,2 0 w1 [P]2,2 0 0 , where b and w1 are learning parameters. The unknown p(s(k), adrl(k)) does not include any monomial in Equation (56). So, the elements of KQ in this example are all entries of AQ, i.e., KQ = {[AQ]1, , [AQ]10}. Published as a conference paper at ICLR 2024 I PHYSICS-MODEL-GUIDED NEURAL NETWORK EDITING: EXPLANATIONS I.1 ACTIVATION EDITING For the edited weight matrix W t , if its entries in the same row are all in the knowledge set, the associated activation should be inactivated. Otherwise, the end-to-end input/output of DNN may not strictly preserve the available physics knowledge due to the additional nonlinear mappings induced by the activation functions. This thus motivates the physics-knowledge preserving computing, i.e., Line 19 of Algorithm 2. Figure 9 summarizes the flowchart of NN editing in a single Phy N layer: Given the node-representation vector from Algorithm 1, the original (fully-connected) weight matrix is edited via link editing to embed assigned physics knowledge, resulting in W t . The edited weight matrix W t is separated into knowledge matrix K t and uncertainty matrix U t , such that W t = K t + U t . Specifically, the K t , generated in Line 8 and Line 14 of Algorithm 2, includes all the parameters in the knowledge set. While the M t , generated in Line 9 and Line 15, is used to generate uncertainty matrix U t (see Line 18) to include all the parameters excluded from knowledge set. This is achieved by freezing the parameters of W t that are included in the knowledge set to zeros. The K t , M t and activation-masking vector a t (generated in Line 10 and Line 16) are used by activation editing for the physical-knowledge-preserving computing of output in each Phy N layer. The function of a t is to avoid the extra mapping (induced by activation) that prior physics knowledge does not include. Physics-guided NN Editing: Algorithm 2 Original Weight Matrix A𝐜𝐭𝐢𝐯𝐚𝐭𝐢𝐨𝐧 𝐌𝐚𝐬𝐤𝐢𝐧𝐠𝐕𝐞𝐜𝐭𝐨𝐫 Link Editing Activation Editing Uncertainty Matrix 𝐔𝒕 Knowledge Matrix 𝐊𝒕 𝐲𝑡= 𝐊𝑡 𝔪𝐲𝑡 1 , 𝑟𝑡 + 𝐚𝑡 act 𝐔𝑡 𝔪𝐲𝑡 1 ,𝑟𝑡 Physical-Knowledge-Preserving Computing Node-Representation Vector: Algorithm 1 0 0 0 0 0 0 Figure 9: Flowchart of NN editing in single Phy N layer, where and denote a parameter included in and excluded from knowledge set, respectively. I.2 KNOWLEDGE PRESERVING AND PASSING The flowchart of NN editing operating in cascade Phy Ns is depicted in Figure 10. Lines 5-9 of Algorithm 2 means that Aϖ = K 1 + M 1 Aϖ, leveraging which and the setting r 1 = rϖ, the ground-truth model (13) and (14) can be rewritten as y = (K 1 + M 1 Aϖ) m(x, r) + p(x) = K 1 m(x, r 1 ) + (M 1 Aϖ) m(x, r 1 ) + p(x). (59) where we define: y Q (s, adrl) , ϖ = Q π (s) , ϖ = π , x [s; adrl] , ϖ = Q s, ϖ = π , r r Q, ϖ = Q rπ, ϖ = π . We obtain from Line 19 of Algorithm 2 that the output of the first Phy N layer is y 1 = K 1 m(x, r 1 ) + a 1 act U 1 m x, r 1 . (60) Recalling that K 1 includes all the entries of Aϖ while the U 1 includes remainders, we conclude from Equation (59) and Equation (60) that the available physics knowledge pertaining to the groundtruth model has been embedded to the first Phy N layer. As Figure 10 shows the knowledge embedded Published as a conference paper at ICLR 2024 in the first layer shall be passed down to the remaining cascade Phy Ns and preserved therein, such that the end-to-end critic and actor network can strictly comply with the physics knowledge. This knowledge passing is achieved by the block matrix K p generated in Line 14, thanks to which, the output of t-th Phy N layer satisfies [y t ]1:len(y) =K 1 m(x, r 1 ) | {z } knowledge passing +[a t act U t m(y t 1 , r t ) ]1:len(y) | {z } knowledge preserving , t {2, . . . , p}. (61) Meanwhile, the U t = M t W t means the masking matrix M t generated in Line 15 is to remove the spurious correlations in the cascade Phy Ns, which is depicted by the cutting link operation in Figure 10. Augmentation Augmentation 𝐏𝐡𝐲𝐍 Layer 1: 𝒓𝟏 = 2 𝐏𝐡𝐲𝐍 Layer 2: 𝒓𝟐 = 2 𝐏𝐡𝐲𝐍 Layer 3: 𝒓𝟑 = 1 Augmentation 𝐌𝐚𝐭𝐫𝐢𝐱: Pass and Preserve Knowledge 𝐊2 , 𝐌2 : 𝐖𝟐 𝐊𝟐+ 𝐌𝟐 𝐖𝟐 Pass and Preserve Knowledge 𝐊3 , 𝐌3 : 𝐖𝟑 𝐊𝟑+ 𝐌𝟑 𝐖𝟑 Embed Knowledge 𝐊1 , 𝐌1 : 𝐖𝟏 𝐊𝟏+ 𝐌𝟏 𝐖𝟏 Figure 10: Example of NN editing, i.e., Algorithm 2. (i) Parameters excluded from the knowledge set are formed by the grey links, while the parameters included in the knowledge set are formed by the red and blue links. (ii) Cutting black links is to avoid spurious correlations. Otherwise, the links can lead to violation of physics knowledge about the governing Equation (13) and Equation (14). Published as a conference paper at ICLR 2024 J PROOF OF THEOREM 6.4 Let us first consider the first Phy N layer, i.e., the case t = 1. Line 8 of Algorithm 2 means that the knowledge matrix K 1 includes parameters included in knowledge sets, whose corresponding entries in the masking matrix M 1 (generated in Line 9 of Algorithm 2) are frozen to be zeros. Consequently, both M 1 Aϖ and U 1 = M 1 W 1 excludes all the parameters of knowledge matrix K 1 . We thus conclude that M 1 Aϖ m(x, r 1 ) + p(x) in the ground-truth model (59) and a 1 act U 1 m x, r 1 in the output computation in Line 21 are independent of the term K 1 m(x, r 1 ). Moreover, the activation-masking vector (generated in Line 10 of Algorithm 2) indicates that if all the entries in the i-th row of masking matrix are zeros (implying all the entries in the i-th row of weight matrix are included in the knowledge set Kϖ), the activation function corresponding to the output s i-th entry is inactive. Finally, we arrive in the conclusion that the input/output of the first Phy N layer strictly complies with the available physics knowledge pertaining to the ground truth (59), i.e., if the [Aϖ]i,j Kϖ, the [y 1 ]i does not have monomials [m (x, rϖ)]j. We next consider the remaining Phy N layers. Considering Line 19 of Algorithm 2, we have [y p ]1:len(y) = [K p m(y p 1 , r p )]1:len(y) + [a p act U p m(y p 1 , r p ) ]1:len(y) = Ilen(y) [m(y p 1 , r p )]2:(len(y)+1) + [a p act U p m(y p 1 , r p ) ]1:len(y) (62) = Ilen(y) [y p 1 ]1:len(y) + [a p act U p m(y p 1 , r p ) ]1:len(y) (63) = [y p 1 ]1:len(y) + [a p act U p m(y p 1 , r p ) ]1:len(y) = [K p 1 m(y p 2 , r p 1 )]1:len(y) + [a p act U p m(y p 1 , r p ) ]1:len(y) = Ilen(y) [m(y p 2 , r p 1 )]2:(len(y)+1) + [a p act U p m(y p 1 , r p ) ]1:len(y) = Ilen(y) [y p 2 ]1:len(y) + [a p act U p m(y p 1 , r p ) ]1:len(y) = [y p 2 ]1:len(y) + [a p act U p m(y p 1 , r p ) ]1:len(y) = . . . = [y 1 ]1:len(y) + [a p act U p m(y p 1 , r p ) ]1:len(y) = [K 1 m(x, r 1 )]1:len(y) + [a p act U p m(y p 1 , r p ) ]1:len(y) = K 1 m(x, r 1 ) + [a p act U p m(y p 1 , r p ) ]1:len(y), (64) where Equation (62) and Equation (63) are obtained from their previous steps via considering the structure of block matrix K t (generated in Line 14 of Algorithm 2) and the formula of augmented monomials: m(y, r) = 1; y; [m(y, r)](len(y)+2):len(m(y,r)) (generated via Algorithm 2). The remaining iterative steps follow the same path. The training loss function is to push the terminal output of Algorithm 2 to approximate the real output y, which in light of Equation (64) yields by = K 1 m(x, r 1 ) + [a p act U p m(y p 1 , r p ) ]1:len(y) = K 1 m(x, r 1 ) + a p act U p m(y p 1 , r p ) , (65) where Equation (65) from its previous step is obtained via considering the fact len(by) = len(y) = len(y p ). Meanwhile, the condition of generating a weight-masking matrix in Line 15 of Algorithm 2 removes all the node-representations connections with the parameters of knowledge set included in K 1 . Therefore, we can conclude that in the terminal output computation in Equation (65), the term a p act U p m(y p 1 , r p ) does not have influence on the computing of knowledge term K 1 m(x, r 1 ). Thus, the Algorithm 2 strictly embeds and preserves the available knowledge pertaining to the physics model of ground truth in Equation (59). Published as a conference paper at ICLR 2024 K EXPERIMENT: CART-POLE SYSTEM friction force Figure 11: Mechanical analog of inverted pendulums. K.1 PHYSICS-MODEL KNOWLEDGE To have the model-based action policy, the first step is to obtain system matrix A and control structure matrix B in a real plant (1). In other words, the available model knowledge about the dynamics of the cart-pole system is a linear model: s(k + 1) = A s(k) + B a(k), k N. (66) We refer to the dynamics model of cart-pole system described in Florian (2007) and consider the approximations cos θ 1, sin θ θ and ω2 sin θ 0 for obtaining (A, B) as 1 0.0333 0 0 0 1 0.0565 0 0 0 1 0.0333 0 0 0.8980 1 , B = [0 0.0334 0 0.0783] . (67) K.2 SAFETY KNOWLEDGE Considering the safety conditions in Equation (17) and the formula of safety set in Equation (2), we have , D = 1 0 0 0 0 0 1 0 , v = 0.9 0.8 , v = 0.9 0.8 based on which, then according to the Λ, Λ and d defined in Lemma 5.1, we have , Λ = Λ = 0.9 0 0 0.8 from which and D given in Equation (68), we then have 9 0 0 0 0 0 5 4 0 Published as a conference paper at ICLR 2024 K.3 MODEL-BASED ACTION POLICY AND DRL REWARD With the knowledge given in Equation (69) and Equation (70), the matrices F and P are ready to be obtained through solving the centering problem in Equation (12). We let α = 0.98. By the CVXPY toolbox Diamond & Boyd (2016) in Python, we obtain 0.66951866 0.69181711 0.27609583 0.55776279 0.69181711 9.86247186 0.1240829 12.4011146 0.27609583 0.1240829 0.66034399 2.76789607 0.55776279 12.4011146 2.76789607 32.32280039 R = [ 6.40770185 18.97723676 6.10235911 31.03838284 ] , based on this, we then have 4.6074554 1.49740096 5.80266046 0.99189224 1.49740096 0.81703147 2.61779592 0.51179642 5.80266046 2.61779592 11.29182733 1.87117709 0.99189224 0.51179642 1.87117709 0.37041435 F = R P = [ 8.25691599 6.76016534 40.12484514 6.84742553 ] , (72) A = A + B F = 1 0.03333333 0 0 0.27592037 1.22590363 1.2843559 0.2288196 0 0 1 0.03333333 0.64668827 0.52946156 2.24458365 0.46370415 With these solutions and letting w(s(k), a(k)) = a2 drl(k), the model-based action policy (4) and the safety-embedded reward (8) are then ready for the Phy-DRL. K.4 SOLE MODEL-BASED ACTION POLICY: FAILURE DUE TO LARGE MODEL MISMATCH The trajectories of the cart-pole system under the control of sole model-based action policy, i.e., a(k) = aphy(k) = F s(k), are shown in Figure 12. The system s initial condition lies in the safety envelope, i.e., s(1) Ω. Figure 12 shows the sole model-based action policy cannot stabilize the system and cannot guarantee its safety. This failure is due to a large model mismatch between the simplified linear model (66) and the real system having nonlinear dynamics. While Figure 3 shows the Phy-DRL successfully overcomes the large model mismatch and renders the system safe and stable, tested with many initial conditions s(1) X. 0 100 200 300 400 500 600 Time steps (k) Figure 12: System trajectories of the cart-pole system: unstable and unsafe. Published as a conference paper at ICLR 2024 K.5 CONFIGURATIONS: NETWORKS AND TRAINING CONDITIONS In this case study, the goal of action policy is to stabilize the pendulum at the equilibrium s = [x , v , θ , ω ] = [0, 0, 0, 0] , while constraining system state to the safety set in Equation (17). We convert the measured angle θ into sin(θ) and cos(θ) to simplify the learning process. Therefore, the observation space can be expressed as s = [x, v, sin(θ), sin(θ), ω] . We also added a terminal condition to the training episode that stops the running of the cart-pole system when a violation of safety occurs for both DRL and Phy-DRL in training, depicted as follows: β(s(k)) = 1, if |x(k)| 0.9 or |θ(k)| 0.8 0, otherwise. Specifically, the running of the cart-pole system (starting from an initial condition) during training is terminated if either its cart position or pendulum angle exceeds the safety bounds or the pendulum falls. During training, we reset episodes of the system running from random initial conditions inside the safety set if the maximum step of system running is reached, or β(s(k)) = 1. The development of Phy-DRL is based on the DDPG algorithm. The actor and critic networks in the DDPG algorithm are implemented as a Multi-Layer Perceptron (MLP) with four fully connected layers. The output dimensions of critic and actor networks are 256, 128, 64, and 1, respectively. The activation functions of the first three neural layers are Re LU, while the output of the last layer is the Tanh function for the actor-network and Linear for the critic network. The input of the critic network is [s; a], while the input of the actor-network is s. K.6 TRAINING For the code, we use the Python API for the Tensor Flow framework Kingma & Ba and the Adam optimizer Abadi et al. for training. This project is using the settings: 1) Ubuntu 20.04, 2) Python 3.7, 3) Tensor Flow 2.5.0, 4) Numpy 1.19.5, and 5) Gym 0.20. For Phy-DRL, we let discount factor γ = 0.4, and the learning rates of critic and actor networks are the same as 0.0003. We set the batch size to 200. The total training steps are 106, and the maximum step number of one episode is 1000. Each weight matrix is initialized randomly from a (truncated) normal distribution with zero mean and standard deviation, discarding and re-drawing any samples more than two standard deviations from the mean. We initialize each bias according to the normal distribution with zero mean and standard deviation. K.7 TESTING COMPARISONS: MODEL FOR STATE PREDICTION? We perform testing comparisons of two Phy-DRLs (with and without a model inside for state prediction) and two DRLs (with and without a model inside for state prediction). The two DRLs use the same CLF reward as R( ) = s (k) P s(k) s (k + 1) P s(k+1)+w(s(k), a(k)) (proposed in Westenbroek et al. (2022)), where the P is the same as the one in the Phy-DRL s safety-embedded reward. The model used for state prediction is the one in Equation (66). The testing results are presented in Figure 13. We note in Figure 13 that mf-Phy-DRL, denotes a policy trained via our Phy-DRL that does not adopt the model in Equation (66) for state prediction. mb-Phy-DRL, denotes a policy trained via our Phy-DRL that adopts the model in Equation (66) for state prediction. mf-DRL, denotes a policy trained via DRL that does not adopt the model in Equation (66) for state prediction. mb-DRL, denotes a policy trained via DRL that adopts the model in Equation (66) for state prediction. Besides, all the training models of mf-Phy-DRL, mb-Phy-DRL, mf-DRL and mb-DRL have the same configurations of critic and actor networks, presented in Appendix K.5. The performance metrics are the areas of IE and EE samples, defined in Equation (18) and Equation (19), respectively. Published as a conference paper at ICLR 2024 1.0 0.5 0.0 0.5 1.0 x (a) mf-Phy-DRL (5 104) 1.0 0.5 0.0 0.5 1.0 x (b) mb-Phy-DRL (5 104) 1.0 0.5 0.0 0.5 1.0 x (c) mf-DRL (5 104) 1.0 0.5 0.0 0.5 1.0 x (d) mb-DRL (5 104) 1.0 0.5 0.0 0.5 1.0 x (e) mf-Phy-DRL (7.5 104) 1.0 0.5 0.0 0.5 1.0 x (f) mb-Phy-DRL (7.5 104) 1.0 0.5 0.0 0.5 1.0 x (g) mf-DRL (7.5 104) 1.0 0.5 0.0 0.5 1.0 x (h) mb-DRL (7.5 104) 1.0 0.5 0.0 0.5 1.0 x (i) mf-Phy-DRL (105) 1.0 0.5 0.0 0.5 1.0 x (j) mb-Phy-DRL (105) 1.0 0.5 0.0 0.5 1.0 x (k) mf-DRL (105) 1.0 0.5 0.0 0.5 1.0 x (l) mb-DRL (105) 1.0 0.5 0.0 0.5 1.0 x (m) mf-Phy-DRL (2 105) 1.0 0.5 0.0 0.5 1.0 x (n) mb-Phy-DRL (2 105) 1.0 0.5 0.0 0.5 1.0 x (o) mf-DRL (2 105) 1.0 0.5 0.0 0.5 1.0 x (p) mb-DRL (2 105) Figure 13: Blue: area of IE samples defined in Equation (18). Green: area of EE samples defined in Equation (19). Rectangular area: safety set. Ellipse area: safety envelope. (a)-(d): all the policies are obtained via training for 5 104 steps. (e)-(h): all the policies are obtained via training for 7.5 104 steps. (i)-(l): all the policies are obtained via training for 105 steps. (m)-(p): all the policies are obtained via training for 2 105 steps. Published as a conference paper at ICLR 2024 Observing Figure 13, we conclude that Our Phy-DRL that does not adopt a model for state prediction can quickly complete training (only 5 104 steps) for rendering safety envelope invariant. While our Phy-DRL that adopts a model for state prediction is slightly slow, 7.5 104 training steps can complete the safety task. Safety areas of DRL policies are much smaller than our Phy-DRL policies. Even increasing training to 2 105 steps, mf-DRL and mb-DRL policies cannot render the safety envelope invariant, i.e., provide a safety guarantee. Incorporating our linear model into DRL for state prediction does not improve system performance, a root reason was revealed in Janner et al. (2019) that the performance of model-based RL is constrained by modeling errors or model mismatch. Specifically, if a model has a large model mismatch with the nonlinear model of real system dynamics, relying on the model for state prediction (in model-based RL) may lead to potential sub-optimal performance and safety violations. Published as a conference paper at ICLR 2024 L EXPERIMENT: QUADRUPED ROBOT Figure 14: Quadruped robot: 3D single rigid-body model. The developed package for training the robot via DRL is built on a Python-based framework for the A1 robot from Unitree, released in Git Hub. The original framework includes a simulation based on Pybullet, an interface for direct sim-to-real transfer, and an implementation of the Convex MPC Controller for basic motion control. For the quadruped robot, the outputs of the designed action policies are the desired positional acceleration and rotational accelerations. The computed accelerations are then converted to the low-level motors torque control. L.1 OVERVIEW: BEST-TRADE-OFF BETWEEN MODEL-BASED DESIGN AND DATA-DRIVEN DESIGN In the following sections, we demonstrate that the residual action policy of Phy-DRL is a besttrade-off between the model-based policy and the data-driven DRL policy for safety-critical control. Specifically, Appendix L.2 explicitly shows the dynamics of quadruped robot is highly nonlinear, directly leveraging which for designing an analyzable and verifiable model-based action policy is extremely hard. While Appendix L.3 shows the residual action policy of Phy-DRL allows the model-based design to be simplified to be an analyzable and verifiable linear one, while offering fast and stable training (see Appendix L.6 and Appendix L.7). L.2 REAL SYSTEM DYNAMICS: HIGHLY NONLINEAR! The dynamics model of the robot is based on a single rigid body subject to forces at the contact patches Di Carlo et al. (2018). Referring to Figure 14, the considered robot dynamics is characterized by the position of the body s center of mass (Co M) p = [px; py; pz] R3, the Co M velocity v p R3, the Euler angles e = [ϕ; θ; ψ] R3 with ϕ, θ and ψ being roll, pitch and yaw angles, respectively, and the angular velocity in world coordinates w R3. The robot s state vector is bs = [Co M x-position; Co M y-position; Co M z-height; roll; pitch; yaw; Co M x-velocity; Co M y-velocity; Co M z-velocity; angular velocity w R3]. (74) Before presenting the body dynamics model, we introduce the set of foot states {Left Front (LF), Right Front (RF), Left Rear (LR), Right Rear (RR)}, based on which we define the following two footsteps for describing the trotting behavior of the quadruped robot: Step1 {LF=1, RF=0, LR=0, RR=1}, Step2 {LF=0, RF=1, LR=1, RR=0}, (75) where 0 indicates that the corresponding foot is a stance, and 1 denotes otherwise. The considered walking controller lifts two feet at a time by switching between the two stepping primitives in the following order: Step1 Step2 Step1 Step2 Step1 Step2 Step1 Step2 . . . repeating . . . Published as a conference paper at ICLR 2024 According to the literature Di Carlo et al. (2018), the body dynamics of quadruped robots can be described by O3 O3 I3 O3 O3 O3 O3 R(ϕ, θ, ψ) O3 O3 O3 O3 O3 O3 O3 O3 + b Bσ(t) aσ(t) + 03 03 03 eg +f(bs), (76) where eg = [0; 0; g] R3 with g being the gravitational acceleration, f(bs) denotes model mismatch, the R(ϕ, θ, ψ) = Rz(ψ) Ry(θ) Rx(ϕ) R3 3 with Ri(α) R3 3 being the rotation of angle α about axis i. The aσ(t) R9, σ (t) S {Step1, Step2}, in the dynamics Equation (76) are the switching action commands, i.e., a Step1 = [f RF; f LR] R6, a Step2 = [f LF; f RR] R6, (77) where the f LF, f RF, f RR, f LR R3 are the ground reaction forces. While the b Bσ(t) R12 6 denote the corresponding switching control structure matrices: O3 O3 O3 O3 1 m I3 1 m I3 I 1 (ϕ, θ, ψ) [r RF] I 1 (ϕ, θ, ψ) [r LR] O3 O3 O3 O3 1 m I3 1 m I3 I 1 (ϕ, θ, ψ) [r LF] I 1 (ϕ, θ, ψ) [r RR] where I (ϕ, θ, ψ) R3 is the robot s inertia tensor, r LF, r RF, r LR, r RR R3 denote the four foots positions relative to Co M position, and [rσ] is defined as the skew-symmetric matrix: 0 [ro]z [ro]y [ro]z 0 [ro]x [ro]y [ro]x 0 , o {LF, RF, LR, RR}. L.3 SIMPLIFYING MODEL-BASED DESIGNS To have the model knowledge represented by (A, B) pertaining to robot dynamics (76), we make the following simplifications. R(ϕ, θ, ψ) = I3, e B O3 O3 O3 O3 I3 O3 O3 I3 where the R(ϕ, θ, ψ) = I3 is obtained through setting the zero angels of roll, pitch and yaw, i.e., ϕ = θ = ψ = 0. Referring to the matrices in Equation (78), with the simplifications in Equation (79) at hand and the ignoring of unknown model mismatch of the ground-truth model, we can obtain a simplified linear model pertaining to robot dynamics (76): ep ee ev ew O3 O3 I3 O3 O3 O3 O3 I3 O3 O3 O3 O3 O3 O3 O3 O3 ep ee ev ew + e B euσ(t), (80) where eu Step1 [f RF; f LR] R6 and eu Step2 [f LF; f RR] R6. Published as a conference paper at ICLR 2024 In light of the equilibrium point s in Equation (21) and es given in Equation (80), we define s es s . It is then straightforward to obtain a dynamics from Equation (80) as s = e A s + e B uσ(t), which transforms to a discrete-time model via sampling technique: s(k + 1) = A s(k) + B euσ(k)(k) | {z } , with A = I12 + T e A and B = T e B, (81) where T = 0.001 sec is the sampling period. Considering the safety conditions in Equation (20), we obtain the safety set defined in Equation (2), where 0 0 0 0 0 1 0 0 0 0 3 0 0 1 0 0 0 0 0 0 0 3 0 0 0 0 0 0 1 0 0 0 3 " 0 0.24 rx " 0.17 0.13 |rx| " 0.17 0.13 |rx| We now obtain the following model-based solutions, which satisfy the LMIs in Equation (7) and Equation (11). 0.016 0 0.023 0 0 0.102 0.003 0 0 0.015 0.001 0 0 0.002 0 0.001 0.023 0.001 226.355 0.078 0.082 55.206 0.017 0.003 0 0 0.078 0.641 0.001 0.118 0 0 0 0 0.082 0.001 0.638 0.125 0 0 0.102 0.002 55.206 0.118 0.125 247.803 0.045 0.003 0.003 0 0.017 0 0 0.045 1.065 0 0 0.001 0.003 0 0 0.003 0 0.03 0.211 0.008 314.163 0.73 0.789 483.681 0.168 0.021 0 0 0.006 0.042 0 0.004 0 0 0 0 0.003 0 0.042 0.003 0 0 0.002 0 1.249 0.003 0.003 5.547 0 0 0.211 0 0 0.002 0.008 0 0 0 314.163 0.006 0.003 1.249 0.73 0.042 0 0.003 0.789 0 0.042 0.003 483.681 0.004 0.003 5.547 0 0 0 0 0.168 0 0 0 0.021 0 0 0 3229.212 0.049 0.018 10.901 0.049 0.017 0 0 0.018 0 0.017 0 10.901 0 0 0.169 0.1 0 0 0 0 0 40 0 0 0 0 0 0 0.1 0 0 0 0 0 30 0 0 0 0 0 0 100 0 0 0 0 0 10 0 0 0 0 0 0 100 0 0 0 0 0 10 0 0 0 0 0 0 100 0 0 0 0 0 10 0 0 0 0 0 0 100 0 0 0 0 0 30 with which and matrices A and B in Equation (81), we are able to deliver the model-based policy (4) and safety-embedded reward (8). L.4 TESTING EXPERIMENT: VELOCITY TRACKING PERFORMANCE The velocity trajectories of the four models (defined in Section 7.2) running in Environments 1-4 are shown in Figure 15. Observing the Figure 15, we straightforwardly discover that the trained Phy-DRL can lead to much better performance of velocity regulation or tracking, compared with solely model-based action policies and purely data-driven DRL s action policy. Published as a conference paper at ICLR 2024 0 1000 2000 3000 4000 5000 Time Steps (a) Velocity Reference 1 m/s, Friction Coefficient 0.44 Reference Linear PD DRL Phy-DRL 0 1000 2000 3000 4000 5000 Time Steps (b) Velocity Reference 0.5 m/s, Friction Coefficient 0.44 Reference Linear PD DRL Phy-DRL 0 1000 2000 3000 4000 5000 6000 7000 8000 (c) Velocity Reference -1.4 m/s, Friction Coefficient 0.6 Reference Linear PD DRL Phy-DRL 0 1000 2000 3000 4000 5000 6000 7000 8000 (d) Velocity Reference -0.4 m/s, Friction Coefficient 0.6 Reference Linear PD DRL Phy-DRL Figure 15: Velocity trajectories of quadruped robot in four different environments defined in Section 7.2. L.5 SAFETY-EMBEDDED REWARDS For the aim of a mathematically-provable safety guarantee, the reward that is most similar to ours is the CLF reward proposed in Westenbroek et al. (2022). We also perform the comparisons. To simplify the comparisons, we do not consider the high-performance sub-reward, which means both rewards degrade to Ours : R(s(k), adrl(k)) = s (k) (A P A) s(k) s (k + 1) P s(k + 1), (83) CLF Reward : R(s(k), adrl(k)) = s (k) P s(k) s (k + 1) P s(k + 1). (84) L.6 PHYSICS-KNOWLEDGE-ENHANCED CRITIC NETWORK To apply the NN editing, we first obtain the available knowledge about the actor-value function and action policy. Referring to Equation (5), the action-value function can be re-denoted by Qπ(s(k), adrl(k)) = Qπ(R (s(k), adrl(k))). (85) According to Taylor s theorem in Appendix G, expanding the action-value function in Equation (85) around the (one-dimensional real value) R (s(k), adrl(k)), and recalling Equation (83), we conclude the action-value function does not include any odd-order monomials of [s(k)]i, i = 1, . . . , 12, and is independent of adrl(k). The critic network shall strictly comply with the knowledge. This knowledge compliance can be achieved via our proposed NN editing. We do not obtain the invariant knowledge about action policy in this example. In other words, according to our analysis, the action policy depends on all the elements of the system state s(k). So, in this example, we only need to design a physics-knowledge-enhanced critic network. Published as a conference paper at ICLR 2024 The architecture of the considered physics-knowledge-enhanced critic network in this example is shown in Figure 16, where the Phy N architecture is given in Figure 2 (b). We compare the performance of physics-knowledge-enhanced critic networks with fully-connected multi-layer perceptron (FC MLP). Figure 16 shows that different critic networks can be obtained by only changing the output dimensions n. We here consider three models: physics-knowledge-enhanced critic network with n = 10 (PKN-10), physics-knowledge-enhanced critic network with n = 15 (PKN-15), and physicsknowledge-enhanced critic network with n = 20 (PKN-20). The parameter numbers of all network models are summarized in Table 2. Phy N 𝒓𝟏= 2 Phy N 𝒓𝟑= 1 Phy N 𝒓𝟐 = 2 Physics-Knowledge-Enhanced Critic Network 𝐝𝐢𝐦𝐞𝐧𝐭𝐢𝐨𝐧= 𝒏 𝐝𝐢𝐦𝐞𝐧𝐭𝐢𝐨𝐧= 𝒏 𝐈𝐧𝐩𝐮𝐭 𝐎𝐮𝐭𝐩𝐮𝐭 Figure 16: Considered physics-knowledge-enhanced critic network. The trajectories of episode reward of the four models in Table 2 are shown in Figure 17, which, in conjunction with Table 2, show that except for the smallest knowledge-enhanced critic network (i.e., model PKN-10), other networks (i.e., models PKN-15 and PKN-20) outperform the very large FC MLP model, viewed from perspectives of parameter numbers, episode reward and stability of training. This, on the other hand, implies that the physics-knowledge-enhanced critic network can avoid significant spurious correlations via NN editing. Table 2: Model Parameters Layer 1 Layer 2 Layer 3 Layer 4 Model ID #weights #bias #Weights #bias #weights #bias #weights #bias #sum PKN-10 1710 10 650 10 10 1 2391 PKN-15 2565 15 2025 15 15 1 4636 PKN-20 3420 20 4600 20 20 1 8081 FC MLP 2304 128 16384 128 16384 128 128 1 35585 L.7 REWARD COMPARISONS We next compare the two rewards in Equation (83) and Equation (84) from the perspectives of design differences and experiments. To have a fair experimental comparison, we compare the two rewards in the same Phy-DRL package. In other words, we use two Phy-DRL models to train the robot; the only difference is their reward: one is our proposed reward in Equation (83) while the other one is the CLF reward in Equation (84). L.7.1 COMPARISON: DESIGN DIFFERENCES Along the ground-truth model of real plant (1) with the consideration of Equation (3), Equation (4) and Equation (9), we have s (k + 1) P s(k + 1) = (B adrl(k) + f(s(k), a(k))) P (B adrl(k) + f(s(k), a(k))) + 2s (k) A P (B adrl(k) + f(s(k), a(k))) + s (k) (A P A) s(k). (86) Published as a conference paper at ICLR 2024 0 200000 400000 600000 800000 1000000 Training Steps Episode Reward PKN-10: mean FC MLP: mean PKN-10: 95% CI FC MLP: 95% CI 0 200000 400000 600000 800000 1000000 Training Steps Episode Reward PKN-15: mean FC MLP: mean PKN-15: 95% CI FC MLP: 95% CI 0 200000 400000 600000 800000 1000000 Training Steps Episode Reward PKN-20: mean FC MLP: mean PKN-20: 95% CI FC MLP: 95% CI Figure 17: Trajectories of episode reward in training: smoothing rate 0.15 and 3 random seeds. Published as a conference paper at ICLR 2024 Table 3: Episode Numbers Ours CLF Reward Model ID PKN-15 PKN-20 PKN-15 PKN-20 Episode Number: average 351 348 376 409 We next define two invariants and one unknown: invariant-1 s (k) A P A s(k), (87) invariant-2 s (k) P A P A s(k). (88) unknown (B adrl(k) + f(s(k), a(k)) P (B adrl(k) + f(s(k), a(k))) + 2s (k) A P (B adrl(k) + f(s(k), a(k))) . (89) We note the formulas in Equation (87) and Equation (88) are named as invariants , because all the terms in their right-hand sides (i.e., designed matrices A and P) are known to us and their properties are not influenced by training. While the formula in Equation (89) is defined as unknown since the terms in its right-hand side are unknown to us due to the unknown model mismatch f(s(k), a(k)) and unknown data-driven action policy adrl(k) during training. Using the definitions in Equation (87)-Equation (89), the formula in Equation (86) is rewritten as s (k + 1) P s(k + 1) = unknown + invariant-1, by which and recalling Equation (87)-Equation (89), the two rewards in Equation (83) and Equation (84) are equivalently rewritten as Ours: R(s(k), adrl(k)) = invariant-1 s (k + 1) P s(k + 1) = unknown, (90) CLF Reward: R(s(k), adrl(k)) = s (k) P s(k) s (k + 1) P s(k + 1) = invariant-2 unknown. (91) Observing the formulas in Equation (90) and Equation (91), we discover a critical difference between our proposed reward in Equation (83) and the CLF reward in Equation (84): our reward decouples invariant and unknown for learning (i.e., data-driven DRL only learn the unknown), while the CLF reward mixes invariant and unknown (i.e., data-driven DRL learn both unknown and invariant). L.7.2 COMPARISON: TRAINING We next present the training behavior. We note our reward in Equation (90) and the CLF reward in Equation (91) have different scales. To present fair comparisons, we process the raw episode reward via unionization. Specifically, the proceeded episode reward called the united episode reward, is defined as R(m) R(m) minr=1,2,... {|R(r)|}, (92) where R(m) denotes the raw episode reward at the episode index m. We consider two models, PKN-15 and PKN-20, whose network configurations are summarized in Table 2. Each model has three seeds. For all the training of DRL and Phy-DRL, we set the maximum step number of an episode as 10200, while an episode will terminate if the |Co M height| 0.12 m (robot falls). The two models averages of episode number over the 106 training steps and the three random seeds are presented in Table 3. The smaller average value therein means the more successful running time of the robot or the fewer times the robot falls. Meanwhile, the trajectories of the processed episode rewards are shown in Figure 18, observing which and Table 3, we can discover that our proposed reward leads to much more stable and safe training. The root reason can be that our reward decouples invariant and unknown and only lets data-driven DRL learn the unknown defined in Equation (89). Published as a conference paper at ICLR 2024 0 200000 400000 600000 800000 1000000 Training Steps Episode Reward CLF reward: mean Ours: mean CLF reward: 95% CI Ours: 95% CI 0 200000 400000 600000 800000 1000000 Training Steps Episode Reward CLF reward: mean Ours: mean CLF reward: 95% CI Ours: 95% CI Figure 18: Trajectories of united episode reward: smoothing rate 0.15 and 3 random seeds. Published as a conference paper at ICLR 2024 L.8 TRAINING For the code, we use the Python API for the Tensor Flow framework Kingma & Ba and the Adam optimizer Abadi et al. for training. This project is using the settings: 1) Ubuntu 22.04, 2) Python 3.7, 3) Tensor Flow 2.5.0, 4) Numpy 1.19.5, and 5) Pybullet. For Phy-DRL, the observation of the policy is a 12-dimensional tracking error vector between the robot s state vector and the mission vector. The agent s actions offset the desired positional and lateral accelerations generated from the model-based policy. The computed accelerations are then converted to the low-level motors torque control. The policy is trained using DDPG algorithm Lillicrap et al. (2016). The actor and critic networks are implemented as a Multi-Layer Perceptron (MLP) with four fully connected layers. The output dimensions of the critic network are 256, 128, 64, and 1. The output dimensions of actor networks are 256, 128, 64, and 6. The input of the critic network is the tracking error vector and the action vector. The input of the actor network is the tracking error vector. The activation functions of the first three neural layers are Re LU, while the output of the last layer is the Tanh function for the actor network and Linear for the critic network. We let discount factor γ = 0.2, and the learning rates of critic and actor networks are the same as 0.0003. We set the batch size to 300. The maximum step number of one episode is 10200. Each weight matrix is initialized randomly from a (truncated) normal distribution with zero mean and standard deviation, discarding and re-drawing any samples more than two standard deviations from the mean. We initialize each bias according to the normal distribution with zero mean and standard deviation. L.9 LINKS: DEMONSTRATION VIDEOS Environment 1): velocity command: rx = 1 m/s and snow road. A demonstration video is available at https://www.youtube.com/watch?v=tsp PMb Zwfig&t=1s. Environment ii): velocity command: rx = 0.5 m/s and snow road. A demonstration video is available at https://www.youtube.com/watch?v=BK8k92jahf I&t=21s. Environment iii): velocity command: rx = 1.4 m/s and wet road. A demonstration video is available at https://www.youtube.com/shorts/gb C-Cwq Gj78. Environment iv):velocity command: rx = 0.4 m/s and wet road. A demonstration video is available at https://www.youtube.com/shorts/Uw QYRve LJUs.