# online_controlinformed_learning__b063d93c.pdf Published in Transactions on Machine Learning Research (03/2025) Online Control-Informed Learning Zihao Liang liang331@purdue.edu School of Aeronautics and Astronautics Purdue University Tianyu Zhou zhou1043@purdue.edu School of Aeronautics and Astronautics Purdue University Zehui Lu lu846@purdue.edu School of Aeronautics and Astronautics Purdue University Shaoshuai Mou mous@purdue.edu School of Aeronautics and Astronautics Purdue University Reviewed on Open Review: https: // openreview. net/ forum? id= LDzv ZEVl5H This paper proposes an Online Control-Informed Learning (OCIL) framework, which employs the well-established optimal control and state estimation techniques in the field of control to solve a broad class of learning tasks in an online fashion. This novel integration effectively handles practical issues in machine learning such as noisy measurement data, online learning, and data efficiency. By considering any robot as a tunable optimal control system, we propose an online parameter estimator based on extended Kalman filter (EKF) to incrementally tune the system in an online fashion, enabling it to complete designated learning or control tasks. The proposed method also improves the robustness in learning by effectively managing noise in the data. Theoretical analysis is provided to demonstrate the convergence of OCIL. Three learning modes of OCIL, i.e. Online Imitation Learning, Online System Identification, and Policy Tuning On-the-fly, are investigated via experiments, which validate their effectiveness. 1 Introduction Informed Machine Learning (IML) (Von Rueden et al., 2021) represents an emerging approach that integrates prior knowledge into the machine learning (ML) process. While classic classification tasks in unsupervised, semi-supervised, or supervised ML primarily focus on extracting patterns from labeled or unlabeled data (Le Cun et al., 2015), IML leverages prior knowledge such as physical laws, expert knowledge, or existing models to uncover underlying connections within data (Karniadakis et al., 2021). This integration enables models to produce more reliable and interpretable predictions, especially when dealing with noisy data. This approach is especially advantageous in the domains where theoretical understanding is well-established and thus can guide ML. One notable example of IML is physics-informed machine learning (Wu et al., 2018; Karniadakis et al., 2021; Kashinath et al., 2021), which is particularly valuable for solving partial differential equations for computational fluid dynamics. Control-informed learning (CIL) is a subset of IML tailored for system control, autonomy, and robotics. This approach merges standard control theory with ML techniques to enhance the capabilities of autonomous systems. The integration leverages the complementary strengths of control and learning. Control theory provides model structures and optimization guidance that enable efficient and reliable algorithms for handling complex tasks. Meanwhile, ML improves these models by learning from data, a capacity that some conventional Published in Transactions on Machine Learning Research (03/2025) control methods lack (Jin et al., 2020; 2021b). This paper aims to tackle learning tasks in autonomous systems that are governed by optimal control (OC) systems. An optimal control system usually consists of dynamics, a control policy, and an objective function. From a unified perspective, learning these components can be understood as addressing a common problem with unknown parameters in different parts of the system and using different loss functions. For example, in learning dynamics, the task involves parameterizing a differential equation, with the loss function measuring the prediction error between the model s output and the target data. In learning policies, the unknown parameters are within the feedback policy and the control objective itself serves as a loss function. When learning control objective functions, the objective is parameterized, and the loss measures the discrepancy between the reproduced trajectory and observed demonstrations. To tackle these problems, many works in the field of so-called Learning for Dynamics and Control aim to leverage the integration of learning and control but often treat them as separate or sequential tasks. For example, control theories are used for algorithm development and convergence analysis of online unconstrained or constrained optimizations (Casti et al., 2023; Bastianello et al., 2024; Lu et al., 2024); model-based reinforcement learning (Heess et al., 2015; Gu et al., 2016), improves sample efficiency by using dynamics models; Koopman-operator control (Proctor et al., 2018; Abraham & Murphey, 2019; Hao et al., 2024), employs learning to transform nonlinear systems into linear observable space, simplifying control design. In contrast, CIL integrates these processes, allowing learning algorithms to incorporate control insights directly. The integration enables ML and control techniques to perform simultaneously, reducing computational complexity, and improving practical applicability. CIL differentiates itself by utilizing Pontryagin s maximum principle, a foundational concept in OC theory. This principle defines the optimality conditions for the state and input trajectories of an OC system. CIL employs these conditions to provide gradients for machine learning (Jin et al., 2020; 2021b; Böttcher et al., 2022). CIL integrates these gradients directly into its learning process, ensuring that machine learning outcomes are efficient while remaining consistent with established control theories and physical models. This approach enhances both the reliability and accuracy of the results. 1.1 Related Work This section presents existing research on learning various components of an autonomous system and explores related learning frameworks that address these problems from a unified perspective. Learning dynamics. To learn a nonlinear system with possibly noisy measurement, Markov decisionprocess-based methods are widely used, such as linear regression (Haruno et al., 2001), observation-transition modeling (Finn et al., 2016), latent space modeling (Watter et al., 2015), (deep) neural networks (NN) (Li et al., 2018; Li & Hao, 2018; Han et al., 2019; Zhang et al., 2019; Benning et al., 2019; Liu & Markowich, 2020; Beintema et al., 2023; Pillonetto et al., 2025), Gaussian processes (Deisenroth & Rasmussen, 2011), and transition graphs (Zhang et al., 2018). Despite their widespread use, these methods often must balance data efficiency with prediction accuracy. To improve both metrics, physics-informed learning approaches Lutter et al. (2019); Xu et al. (2020); Saemundsson et al. (2020); Sharma et al. (2023) incorporate physical laws into learning models. Koopman operator theory offers a method for lifting states to an infinite-dimensional linear observable space (Mauroy et al., 2020; Liang et al., 2023; Hao et al., 2023; Liu et al., 2024). Learning objective functions. Objective learning is typically referred to as inverse reinforcement learning (IRL) in the ML community and inverse optimal control (IOC) in the system control community. These methods aim to deduce a control objective function with observed optimal demonstrations. (Brown et al., 2019) The objective function is generally represented as a weighted sum of features (Abbeel & Ng, 2004; Ratliff et al., 2006; Ziebart et al., 2008; Arora & Doshi, 2021). Approaches to find these unknown weights include feature matching (Abbeel & Ng, 2004), maximum entropy (Ziebart et al., 2008), maximum margin (Ratliff et al., 2006), and approximate variational reward imitation learning (Chan & van der Schaar, 2021). As for learning nonlinear parameter mapping of objective functions, prior and system-dependent knowledge is required to further extend the methods above. On the other hand, with system dynamics, IOC aims for efficient learning approaches (Mombaur et al., 2010). For example, some methods (Keshavarz et al., 2011; Jin et al., 2019; 2021a; Jin & Mou, 2021; Liang et al., 2022; 2023) directly calculate unknown weights by minimizing the violation of optimality conditions by the observed demonstration data, which avoids repeatedly solving OC problems. Published in Transactions on Machine Learning Research (03/2025) Learning control policies. Learning policies are generally termed reinforcement learning (RL) and OC in the ML and control communities, respectively. In the RL community, there are mainly two streams of research, namely model-free and model-based RL. Model-free RL learns policies by directly interacting with the environment, without using a model of it (Mnih et al., 2013; 2015; Oh et al., 2016). To improve data complexity, model-based RL learns a dynamics model before policy learning (Schneider, 1997; Abbeel et al., 2006; Deisenroth & Rasmussen, 2011; Levine & Abbeel, 2014; Gu et al., 2016). For OC, the first strategy is based on dynamical programming, such as the linear quadratic regulator (LQR) (Scokaert & Rawlings, 1998), which solves the OC problem with linear dynamics and quadratic cost, the linear quadratic Gaussian (Athans, 1971), which combines LQR with a Kalman filter to solve OC problem with linear system affected by Gaussian noise, the iterative linear quadratic regulator (i LQR) (Li & Todorov, 2004), which linearizes the dynamics and quadratizes the value function, and differential dynamical programming, which quadratizes the dynamics and value function. Another strategy relies on Pontryagin s maximum/minimal principle (PMP) (Pontryagin, 2018), such as shooting methods (Bock & Plitt, 1984) and collocation methods (Patterson & Rao, 2014). These open-loop methods are further improved by closed-loop methods such as model predictive control (MPC) (Schwenzer et al., 2021), which repeatedly solves an OC problem over a finite horizon to generate control inputs. Recently, Jin et al. (2020) proposed a framework for learning an optimal policy based on differentiating Pontryagin s Maximum Principle. Many research studies also focus on incremental policy tuning. One of the most popular tracks is transfer learning, which exploits the generalization of existing knowledge so that it can be transferred across different domains (Taylor & Stone, 2009). Recently, transfer learning has been implemented to speed up the learning process in RL (Taylor & Stone, 2009). Another popular method is behavior cloning (Torabi et al., 2018; Czarnecki et al., 2019; Sasaki & Yamashina, 2021). In the control community, tuning OC systems initially refers to neighboring extremal optimal control (NEOC) (Bryson, 1975; Ghaemi et al., 2009). There are other popular methods including adaptive control (Ioannou & Sun, 2012; Bertsekas, 2022; Luo et al., 2023; Guo & Pan, 2023), which adjusts its parameters in real-time to maintain optimal performance, even in the presence of uncertainties or changes in system dynamics, and Bayesian optimization for controller tuning, (Khosravi et al., 2021; Sorourifar et al., 2021; Berkenkamp et al., 2023). To sum up, there are numerous existing methods focused on individual tasks. These approaches are effective when only one component of the system is unknown. However, in many real-world scenarios, multiple components may be unavailable or uncertain simultaneously. For instance, in autonomous driving, the dynamics of the vehicle may be unknown due to changes in road conditions or vehicle wear and tear. Simultaneously, the control policy may also be unavailable due to a lack of predefined rules or data. In such cases, existing methods often fall short, as they are not designed to handle the joint learning of multiple interdependent components, limiting their applicability in more complex or incomplete systems. Unified learning frameworks. Several studies have explored unified learning frameworks to tackle learning challenges in autonomous systems. These approaches integrate an implicit planner directly within the policy (Okada et al., 2017; Pereira et al., 2018; Amos et al., 2018; Srinivas et al., 2018). The main challenge in these methods is learning the OC system, which is very similar to the goal of this work. (Okada et al., 2017; Pereira et al., 2018) learn a path-integral OC system (Kappen, 2005), which is a special class of OC systems. (Srinivas et al., 2018) learns an OC system in a latent space. These methods rely on an unrolling strategy to make differentiation easier. Essentially, they treat solving an OC problem as an unrolled computational graph created by applying gradient descent repeatedly. This allows automatic differentiation tools (Abadi et al., 2016) to be used directly. This approach faces a few challenges: (i) it requires storing all intermediate steps, making it memory-intensive, and (ii) the accuracy of the gradients depends on how many steps are included in the graph, leading to a trade-off between computational cost and accuracy. To tackle these issues, Amos et al. (2018) proposed a differentiable MPC framework. In the forward pass, it uses an LQR approximation of the OC system, and in the backward pass, gradients are computed by differentiating this LQR approximation. This framework has a major challenge: differentiating LQR requires solving a large linear equation, involving the inversion of a matrix with size proportional to the time horizon, making it very costly for long-horizon systems. To address the challenges of the framework mentioned above, Jin et al. (2020) proposed Pontryagin s differential programming (PDP). PDP avoids unrolled computational graphs by only storing the resulting trajectory without concern about how it is solved. Instead of relying on intermediate Published in Transactions on Machine Learning Research (03/2025) LQR approximations, it directly differentiates through Pontryagin s Maximum Principle (PMP) to obtain exact gradients. Furthermore, its backward pass uses an auxiliary control system to obtain the gradient, reducing memory and computational complexity. First, it lacks the ability for online learning, as it relies on gradient descent to update unknown parameters in the OC system, requiring significant computation time to reach convergence. This drawback is particularly problematic in applications like autonomous driving, where quick adaptation to new scenarios is essential for safety and performance. Second, PDP does not account for noisy measurement data, limiting its effectiveness in real-world situations where sensor data is often unreliable or noisy. 1.2 Contributions This paper introduces an online learning framework called Online Control-Informed Learning (OCIL). This framework is designed to be data efficient for various learning and control tasks while providing robustness against noisy data. In this paper, we consider an autonomous system as an OC system, which is parameterized by tunable parameters within different components of the system, including dynamics, policy, and objective function. By tuning the OC system in an online fashion, the proposed OCIL tackles three learning tasks in robotics, namely Online Imitation Learning, Online System Identification, and Policy Tuning On-the-fly. The proposed OCIL consists of two main components, both of which are inspired by control theory. Specifically, the framework first proposes an online parameter estimator based on the classic online state estimation techniques in control theory. The estimator continually updates the parameter estimates in an online fashion as new data becomes available, aiming to minimize a cumulative loss defined for a specific task. To do so, the gradient information for the loss with respect to the tunable parameter is required. Therefore, OCIL employs a gradient generator (GG) based on Pontryagin Differential Programming in OC theory to calculate the exact gradient. Notations. denotes the Euclidean norm. Given a matrix A Rn m, let A denotes its transpose. For positive integers n and m, let In be the n n identity matrix; 0n Rn denotes a vector with all value 0; 0n m denotes a n m matrix with all value 0. Let col{v1, . . . , va} denote a column stack of elements v1, . . . , va, which may be scalars, vectors or matrices, i.e. col{v1, . . . , va} [v 1 . . . v a]. 2 Problem Formulation Consider the following class of OC systems Σ(θ ), where θ Rp denotes the unknown and constant parameter. The behavior of Σ(θ ) is determined by minimizing a control objective function: {x1:T (θ ), u0:T 1(θ )} = arg min x1:T , u0:T 1 J(x0:T , u0:T 1, θ ) = PT 1 t=0 c(xt, ut, θ ) + h(x T , θ ) (1a) s.t. xt+1 = f(xt, ut, θ ), with x0 given. (1b) where t = 0, 1, 2, , T is the time index with T being the final time; xt Rn and ut Rm denote the system state and control input, respectively; x0:T (θ ) col {x0(θ ), , x T (θ )} and u0:T 1(θ ) col {u0(θ ), , u T 1(θ )} denote the states and inputs trajectory given parameter θ , respectively; x t (θ ) and u t (θ ) denote the state and input given θ at time t respectively; f : Rn Rm Rp Rn denotes a twice-differentiable time-invariant system dynamics; c : Rn Rm Rp 7 R and h : Rn Rp 7 R denote running cost the final cost, respectively, both of which are assumed to be twice-differentiable. Remark 1. Including the parameter θ in the system dynamics allows for the representation of both partially known and completely unknown dynamics. For partially known dynamics, it is parameterized via a known physical dynamic model with unknown physical parameters. For example, this could be a quadrotor dynamics with known structure and unknown inertia and mass (Wang et al., 2014; Jin et al., 2020; Revach et al., 2022). In the case of completely unknown dynamics, parameterization is done by neural networks. In this case, the neural network captures the evolution of the state, where the parameter θ represents the weights and biases of the neural network (Kumpati et al., 1990; Lewis et al., 1998; Nelles & Nelles, 2020). For notation simplicity, we define the unknown trajectory of the optimal control system Σ(θ ) as ξ(θ ) col{x0:T (θ ), u0:T 1(θ )} R(T +1)n+T m (2) Published in Transactions on Machine Learning Research (03/2025) The goal of this paper is to estimate θ . Define ˆθ Rp as an arbitrary estimation of θ . Then for estimation purposes, a copy, Σ(ˆθ), of the autonomous system Σ(θ ) can be proposed by replacing θ with ˆθ in (1), i.e., {x1:T (ˆθ), u0:T 1(ˆθ)} = arg min x1:T , u0:T 1 J(x0:T , u0:T 1, ˆθ) = PT 1 t=0 c(xt, ut, ˆθ) + h(x T , ˆθ) (3a) s.t. xt+1 = f(xt, ut, ˆθ), with x0 given. (3b) At each time t, a noisy measurement Ot Rr is observed, where Ot = h(ξt(θ )) + vt. (4) Here, h : Rn+m 7 Rr denotes a twice-differentiable measurement function; ξt(θ ) = {x t (θ ), u t (θ )}; vt N(0r, Rt) denotes the measurement noise which is a multivariate Gaussian, with Rt Rr r being the covariance matrices of the measurement noise. With the measurement equation (4) defined, this paper considers a signed residual function: l(ξt(ˆθ), Ot) = Ot h(ξt(ˆθ)) Rr. (5) Then, the performance of the entire trajectory can be evaluated by a cumulative loss which is assumed to be twice-differentiable: L(ξ(ˆθ)) = PT t=0 l(ξt(ˆθ), Ot) 2. (6) The problem of interest is to develop an online method to update the estimation ˆθt Rp of θ at every time t, such that its trajectory ξ(ˆθt) from (1) minimizes a task-specific cumulative loss L(ξ(ˆθ)). To achieve a specific learning or control task, one needs to select the most suitable measurement Ot. Below, we will present three modes of the proposed OCIL framework. It is worth noting that in different applications, adjustments to the configuration of system Σ(ˆθ) are required according to the task. Online Sys ID: For a Sys ID problem, the goal is to identify the dynamics model of a physical system from the state-input trajectory ξo = {xo 0:T , u0:T 1}, where the superscript o denotes the observed trajectory. The trajectory is often generated by persistent excitation of the system without considering any control objectives (Keesman, 2011). Therefore, we can set J(x0:T , u0:T 1, ˆθ) = 0: Σ(ˆθ) : dynamics: xt+1 = f(xt, ut, ˆθ), with x0 given, objective: J(x0:T , u0:T 1, ˆθ) = 0. (7) To identify the model dynamics, namely finding the θ in the dynamics f(xt, ut, θ ), one could design the signed residual function to represent the discrepancy between the observed trajectory and the trajectory produced by ˆθ, i.e. l(ξt(ˆθ), ξo t ) = ξo t ξt(ˆθ), where ξo t is a slice of ξo at time t. In the Sys ID mode, the measurement Ot received at time t is a slice of the trajectory of a physical system ξo t . Online Imitation Learning: The objective function and the model dynamics are parameterized by an unknown θ . The OC system follows (3). Suppose one can observe the measurement of the expert demonstration y t at each time t. Then, the signed residual function can be designed as l(ξt(ˆθ), y t ) = y t g(xt(ˆθ), ut(ˆθ)). In this case, the measurement Ot received at time t is the expert demonstration y t . The optimal demonstration can vary between being continuous or sparse, depending on practical application scenarios. Tuning Policy On-the-fly: For an autonomous system, one would like to obtain a control policy such that the trajectory minimizes certain task loss. This mode considers a feedback controller which is parameterized by ˆθ, i.e. ut = µ(xt, ˆθ). Then the OC system is written as follows: Σ(ˆθ) : dynamics: xt+1 = f(xt, µ(xt, ˆθ)), with x0 given, objective: J(x0:T , u0:T 1, ˆθ) = PT 1 t=0 c(xt, µ(xt, ˆθ)) + h(x T ). (8) Then we can design the signed residual function such that it represents trajectory tracking. For instance, the signed residual function could be l(ξt(ˆθ), ξd t ) = ξd t ξt(ˆθ), where ξd t is a slice of desired trajectory to track at time t. Published in Transactions on Machine Learning Research (03/2025) 3 Main Results The proposed OCIL consists of two main components, both of which are inspired by control theory. Specifically, OCIL first proposes an online parameter estimator based on the extended Kalman filter (EKF). Going forward, we will show the challenge of obtaining the Kalman gain. To tackle this challenge, the gradient information for the loss with respect to the tunable parameter is required. Therefore, OCIL employs a gradient generator (GG) based on Pontryagin Differential Programming to calculate the exact gradient. Then the proposed OCIL framework will be introduced and supported with theoretical analysis. 3.1 Online Parameter Estimator To minimize the cumulative task loss L(ξ(ˆθ)) with measurement Ot, which is unavailable until time t, the optimization problem that needs to be solved in an online fashion is: min θ PT t=0 l(ξt(ˆθ), Ot) 2 subject to ξ(ˆθ) is the trajectory of (3). (9) The optimization problem (9) is essentially a least squares problem, although under constraints. One of the most famous methods to solve the least squares problems incrementally is the EKF (Bertsekas, 1996; Ribeiro, 2004). The EKF was proposed to incrementally estimate the state of a system using measured output available at each time step. In our problem setting, instead of estimating the state of a system, our goal is to estimate the parameter θ by utilizing the measurement Ot that is available at each time t. Therefore, by considering the parameter θ as the state to be estimated, one can introduce a new dynamical system: dynamics: θt+1 = θt, with θ0 = θ , measurement: Ot = h(ξt(θt)) + vt, (10) The online estimation of θ via EKF can be done as follows (Ribeiro, 2004): ˆθ t := ˆθt 1, P t := Pt 1 (11a) Kt := P t L t(Lt P t L t + Rt) 1, Pt := (Ip Kt Lt)P t , ˆθt := ˆθ t + Kt(Ot h(ξt(ˆθ t ))), (11b) Lt dl(ξt(θt), Ot) dθt |θt=ˆθ t Rr p (12) where (11a) predicts the dynamics; (11b) updates the parameter estimate. Here, the superscript means the term is not yet updated by measurement residual; Pt Rp p is a positive-definite matrix that denotes the covariance of the estimate; Kt Rp r denotes the Kalman gain. Throughout the estimation process, all of the terms are known except Lt. It is challenging to obtain this term as the signed residual function l(ξt(ˆθ), Ot) is not an explicit function of θ. In the next subsection, we will present a gradient generator which computes the exact value for Lt. 3.2 Gradient Generator In this section, for notation simplicity, the parameter estimate ˆθ t is simplified to θ; dl(ξt(θt),Ot) dθt |θt=ˆθ t is written as dl(ξt(θ)) dθ . To obtain the gradient dl(ξt(θ)) dθ , one can employ the chain rule by definition, dθ = l(ξt(θ)) ξt(θ) ξt(θ) where l(ξt(θ)) ξt(θ) is known since the signed residual function is pre-designed. The challenge that remains is to find the partial derivative ξt(θ) θ , i.e. an analytical relation between trajectory ξt and the tunable parameter θ. To tackle this challenge, the gradient generator in Jin et al. (2020) is used to obtain the exact value of ξt(θ) Given the OC system (3), one can obtain the Hamiltonian equation Ht = c(xt, ut, θ) + f(xt, ut, θ) λt+1 (14) Published in Transactions on Machine Learning Research (03/2025) for all t = 0, , T 1, where λt Rn denotes the Lagrangian multiplier associated with the equality constraint of model dynamics. With the definition of ξ(θ), one has ξ(θ) θ = col{ x1:T (θ) θ , u0:T 1(θ) θ }. By defining θ Rn p, Ut ut(θ) θ Rm p, (15) one can utilize the following lemma from Jin et al. (2020) to obtain the partial derivatives ξt(θ) Lemma 1. Jin et al. (2020) Define the Jacobian and Hessian matrices related to ξ(θ) as: xt , Gt = f ut , Et = f θ , Hxx t = 2Ht xt xt , Hxu t = 2Ht xt ut = (Hux t ) , Huu t = 2Ht ut ut , Hxθ t = 2Ht xt θ , Huθ t = 2Ht ut θ , Hxx T = 2h x T x T , Hxθ T = 2h x T θ . (16) If Huu t is invertible for all t = 0, , T 1, the following recursions from t = T to t = 0 hold: Vt = Ct + A t(I + Vt+1Bt) 1Vt+1At, Wt = A t(I + Vt+1Bt) 1(Wt+1 + Vt+1Mt) + Nt, (17) with VT = Hxx T and WT = Hxθ T . Here, At = Ft Gt(Huu t ) 1Hux t , Bt = Gt(Huu t ) 1G t, Mt = Et Gt(Huu t ) Huθ t , Ct = Hxx t Hxu t (Huu t ) 1Hux t , Nt = Hxθ t Hxu t (Huu t ) Huθ t are all known given (16). Then, the partial derivative ξ(θ) θ can be obtained by recursively solving the following equations from t = 0 to T 1 with X0(θ) = 0: Ut = (Huu t )-1(Hux t Xt + Huθ t + G t(I + Vt+1Bt)-1(Vt+1At Xt + Vt+1Mt + Wt+1)), Xt+1 = Ft Xt + Gt Ut + Et. (18) The terms in (16) are based on the trajectory ξ(θ) and the associated Lagrangian multiplier λ0:T 1. According to the discrete-time Pontryagin Maximum Principle (Jin et al., 2020), the trajectory of the Lagrangian multiplier can be obtained by x T , λt Ht xt λt+1, for t = T 1, , 1. (19) Remark 2. Lemma 1 proposes a recursive way to obtain the exact gradient of the trajectory ξ(θ) with respect to the parameter θ, i.e. ξ(θ) 3.3 OCIL Framework With the online parameter estimator and the gradient generator, we propose the Online Control-Informed Learning framework in Fig. 1. The framework is summarized in Algorithm 2. Figure 1: Framework of Online Control-Informed Learning. Published in Transactions on Machine Learning Research (03/2025) As shown in Fig. 1, at each time step, the predefined OC system Σ(ˆθt) generates a system trajectory ξ(ˆθt) by performing optimal control with given x0 and ˆθt. The trajectory ξ(ˆθt) is then fed into the signed residual function l(ξt(ˆθt), Ot) and the gradient generator. Along with the information Ot obtained at time t, the signed residual function generates l(ξt(ˆθ t ),Ot) ξt(ˆθ t ) , while ξt(ˆθ t ) ˆθ t is generated by the gradient generator in Algorithm 1. The chain rule is then performed to obtain the Jacobian matrix Lt, which is then passed into the online parameter estimator for the estimation of θ . Algorithm 1: Gradient Generator (GG) Input: Trajectory ξ(ˆθ t ) from Σ(ˆθ t ) 1 Compute the coefficient matrices in (16) ; 2 Set VT = Hxx T and WT = Hxθ T ; 3 for t T to 0 by t do 4 Update Vt and Wt using (17) 5 Set X0(ˆθ t ) = 0; 6 for t 0 to T by t do 7 Update Xt(ˆθ t ) and Ut(ˆθ t ) using (18) ˆθ t = {X1:T (ˆθ t ), U0:T 1(ˆθ t )} Algorithm 2: Online Control-Informed Learning System and Residual: Σ(ˆθ) and l(ξt(ˆθ), Ot) Initialize: ˆθ0, P0 1 for t = t0, t1, do 2 Obtain new information Ot; 3 Solve ξ(ˆθt) from current OC system Σ(ˆθt); 4 Obtain ξt(ˆθ t ) ˆθ t with GG in Algorithm 1; 5 Obtain l(ξt(ˆθ t ),Ot) ξt(ˆθ t ) from l(ξt(ˆθ t ), Ot); 6 Obtain Lt via the chain rule (13); 7 Update ˆθt using the estimator (11); 3.4 Convergence Analysis This subsection presents the convergence analysis of the online parameter estimator. The analysis employs a candidate Lyapunov function and introduces how the measurement covariance matrix Rt affects the convergence of the cumulative loss L(ξ(ˆθ)). In this section, for brevity, the signed residual function l(ξt(ˆθ), Ot) is written as l(ξt(ˆθ)). Suppose for a specific task, the optimal cumulative loss L(ξ(θ )) = 0. Then, we define the estimation error as θt = θ ˆθt . Furthermore, we define Measurement error: et = l(ξ(θ )) l(ξ(ˆθ t )) Prediction error: θ t = θ ˆθ t . (20) To perform the convergence analysis, a candidate Lyapunov function is employed: Vt = θ t P 1 t θt. (21) The goal here is to determine conditions for which the candidate Lyapunov function {Vt}t=1,2,... is a decreasing sequence, i.e. Vt+1 Vt 0, t. For rigorous analysis of the candidate Lyapunov function, as proposed in Boutayeb et al. (1997), unknown diagonal matrices Ft Rr r and Gt Rp p are introduced to model the measurement and prediction error defined in (20): Ftet = Lt θ t , θ t = Gt θt 1. (22) To ensure convergence of the proposed estimator, the following assumptions need to be made. Assumption 1. The derivative Lt = dl(ξt(ˆθ t )) dˆθ t is of full rank for every ˆθ t . Remark 3. The discrete-time dynamical system (10) satisfies the observability rank condition, i.e., for every ˆθ t , rank(col{ dl(ξt(ˆθ t )) dˆθ t , dl(ξt(ˆθ t )) dˆθ t Ip, , dl(ξt(ˆθ t )) dˆθ t Ip 1 p }) = p (Song & Grizzle, 1992). That means if Assumption 1 is satisfied for every ˆθ t , the system (10) is observable for every ˆθ t . The observability condition assures that Pt is a bounded matrix from above and below (Song & Grizzle, 1992; Boutayeb & Aubry, 1999). As common in the EKF analysis, we adopt the following assumption: Assumption 2. Lt is a uniformly bounded matrix. We have the following lemma to show how the measurement covariance matrix Rt affects the convergence of the tunable parameter. The proof can be found in Appendix A. Published in Transactions on Machine Learning Research (03/2025) Lemma 2. Let Assumptions 1 and 2 hold. If the following inequalities are satisfied: (Ft Is)2 Rt(Lt P t L t + Rt) 1, (23) G t P 1 t Gt P 1 t 0, (24) Then the proposed estimator (11), when used as an observer for the system (10), ensures local asymptotic convergence, i.e. limt θt = 0. Remark 4. Lemma 2 provides sufficient conditions for the convergence of ˆθt. As the diagonal matrices Ft and Gt are unknown, one can design the matrix Rt to satisfy inequalities (23). For example, one can set the matrix Rt to be sufficiently large, i.e. much larger than Lt P t L t, so that (23) is satisfied, which means the parameter estimator can tolerate arbitrary large initial prediction error. It is worth to note that as long as (23) and (24) are satisfied, ˆθt converges to θ and consequently Ft and Gt become identity matrix. In the case when there is no measurement noise, i.e. Rt = 0s s, Ft and Gt can only be identity matrices to satisfy the inequalities (23) and (24), indicating the convergence of ˆθt to θ . Remark 5. Equation (23) and (24) indicate one of the limitations of the estimator, which is the selection of initial guess. If the initial guess of θ results in F0 and G0 that do not satisfy (23) and (24), the value of the Lyapunov function (21) becomes larger, which leads to even larger Ft and Gt, causing the estimation to diverge. We have the following main theorem shows how the measurement covariance matrix Rt affects the convergence of cumulative loss L(ξ(ˆθ)) by utilizing the inequalities introduced in Lemma 2. The proof can be found in Appendix B. Theorem 1. Let Assumptions 1 and 2 hold. If the inequalities in Lemma 2 are met, then estimating θ with the proposed estimator (11) employing the gradient generator in (17)-(18) ensures local asymptotic convergence of the cumulative loss L in (6) to 0 , i.e. limt L(ξ(ˆθ)) = 0. 4 Applications to Different Online Learning Modes and Experiments This section demonstrates the capability of the proposed OCIL framework with its three modes by three applications, Online Imitation Learning, Online System Identification, and Learning Policy on-the-fly. This section includes a performance comparison with some state-of-the-art frameworks for three environments that are summarized in Table 1. Let O t = h(ξt(θ )) denotes the measurement without noise. The measurement noise is subject to a multivariate Gaussian distribution N(O t , σ2Ir). To highlight the flexibility of OCIL, each experiment includes two phases: 1) online phase, where OCIL keeps learning the unknown parameter while new data comes in before the final time T; 2) offline phase, where OCIL keeps learning the parameter given the learned parameter at time T and the entire trajectory obtained from time t = 0 to time T. For each environment and task, a terminal time T Z is defined to represent a desired time duration where the system shall finish the task. To unify the data visualization of both online and offline phases, the horizontal axis represents the number of data points, where a vertical red line corresponds to the final time T, i.e. the end of the online phase. The number of data points reflects the number of iterations multiplied by the total number of time steps for each iteration. The solid blue curves indicate the online portion of OCIL, whereas the dashed blue curves indicate the offline portion. For every environment and every method, 5 trials are performed given random initial conditions due to the high computational cost for other methods. The computational performance and analysis for OCIL are shown in Section 5 of the Appendix. Online Imitation Learning. The control objective is parameterized as a weighted distance to the goal. Set the signed residual function of imitation learning l(ξt(ˆθ), y t ) = y t g(xt(ˆθ), ut(ˆθ)). The optimal cumulative loss is zero, i.e. L(ξ(θ )) = 0, with full knowledge of the parameter. Four existing methods are used for comparisons: (i) inverse KKT (Englert et al., 2017) (ii) neural policy cloning (Bojarski et al., 2016) and (iii) PDP (Jin et al., 2020). These methods don t handle measurement noise well because of their limitations, so we performed the experiments without including measurement noise for these methods. For OCIL, σ = 0.1 for all of the systems. Published in Transactions on Machine Learning Research (03/2025) Table 1: Experiment Environments Systems Dynamics parameter θdyn Objective parameter θobj Cartpole cart mass, pole mass and length c(x, u) = θobj x xg 2 + u 2 h(x) = θobj x xg 2 6-Do F Quadrotor mass, wing length, inertia matrix 6-Do F Rocket mass, rocket length, inertia matrix Fig. 2a-2c summarize the comparison result, where OCIL converges faster and obtains lower loss than the other offline methods, in both online and offline phases. The initial loss for each method is different because the learning representation (parameterization) is different. Thus, it is hard to guarantee that an initial neural network has the same loss as another initial parameter vector. Nevertheless, the initial representation of each method is adjusted such that OCIL does not take advantage of good initialization. Fig. 2a-2c validate the effectiveness of OCIL s both online and offline performance, even with measurement noise. (a) Cartpole (b) Quadrotor Figure 2: Imitation loss v.s. number of data points 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 Figure 3: Trajectories of the cartpole in online imitation learning. Blue solid lines: learned trajectory. Green solid lines: observed noisy trajectory. Red dashed lines: ground truth. Online System Identification The signed residual function is set to be l(ξt(ˆθ), ξo t ) = ξo t ξt(ˆθ). The optimal cumulative loss is zero, i.e. L(ξ(θ )) = 0, with full knowledge of the parameter. Three other methods are used for comparison: (i) Pytorch Adam solver (Pillonetto et al., 2025), (ii) DMDc (Proctor et al., 2016), and (iii) PDP (Jin et al., 2020). No measurement noise are injected into observed data for existing methods due to their inherent limitations. For OCIL, σ = 0.05 for all of the systems. Fig. 5a-5c summarize the result, where OCIL outperforms PDP for faster convergence and lower loss, in both online and offline phases. Different than Online Imitation Learning, OCIL does not decrease its Sys ID loss significantly at first because the number of data points is not sufficient for online learning. Once the number of data points becomes sufficient, the Sys ID loss starts decreasing significantly. This phenomenon can also be observed in the other methods, but their critical number of data points is significantly larger than OCIL s. In Fig. 5d-5f, OCIL and other methods are applied to learn the neural dynamics using the same observed Published in Transactions on Machine Learning Research (03/2025) 0 10 20 30 40 50 0 10 20 30 40 50 0 10 20 30 40 50 0 10 20 30 40 50 0 10 20 30 40 50 0 10 20 30 40 50 0 10 20 30 40 50 0 10 20 30 40 50 0 10 20 30 40 50 0 10 20 30 40 50 0 10 20 30 40 50 0 10 20 30 40 50 0 10 20 30 40 50 0 10 20 30 40 50 0 10 20 30 40 50 0 10 20 30 40 50 0 10 20 30 40 50 Figure 4: Trajectories of the quadrotor in online imitation learning. Blue solid lines: learned trajectory. Green solid lines: observed noisy trajectory. Red dashed lines: ground truth. trajectory. It can be seen that OCIL outperforms other methods for lower loss. Fig. 6 demonstrates the capability of OCIL to deal with neural dynamics that have different sizes of NN. (a) Cartpole (b) Quadrotor (d) Cartpole, offline, NN dynamics (e) Quadrotor, offline, NN dynamics (f) Rocket, offline, NN dynamics Figure 5: Sys ID loss v.s. number of data points Policy Tuning On-the-fly. The parameterized OC system in 8 is used here, where the policy is in a state-feedback form and parameterized by the tunable parameter ˆθ. The signed residual function is set to be l(ξt(ˆθ), ξ t ) = ξ t ξt(ˆθ), where ξ t is the trajectory that needs to be tracked. The optimal cumulative loss Published in Transactions on Machine Learning Research (03/2025) (a) Cartpole w/o noise (b) Quadrotor w/o noise (c) Rocket w/o noise Figure 6: Sys ID Loss v.s. number of data points, given different sizes of neural dynamics is zero, i.e. L(ξ(θ )) = 0, with full knowledge of the parameter. Other methods are used for comparison (i) i LQR (Li & Todorov, 2004) (ii) GPS (Levine & Abbeel, 2014), and (iii) PDP (Jin et al., 2020). No measurement noise is included for existing methods due to their limitations. For OCIL, σ = 0.1 for cartpole and quadrotor; σ = 0.25 for rocket. Fig. 9a-9c summarize the result, where the loss and its variation of OCIL converge very quickly. The buffers in Fig. 9a-9f indicate 3 times of standard deviation. Fig. 9d-9f presents the online phase of OCIL given 1000 random trials, which further validates the effectiveness and robustness of OCIL given measurement noise. 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 Figure 7: Trajectories of the cartpole in policy tuning on-the-fly. Blue solid lines: learned trajectory. Green solid lines: observed noisy trajectory. Red dashed lines: ground truth. In general, OCIL from all figures does not have a smooth loss trajectory as the other offline methods. This is because at the online phase, an optimal gain matrix Kt from (11a) is computed to update ˆθt, whereas the other methods either use a constant or iteration-dependent step size. The optimal gain is conceptually similar to searching an optimal step size in the line-search optimization algorithms. Thus, it is observed that the loss variation, as represented by blue buffers, is relatively high initially but starts decreasing significantly as new data comes in because Kt is continually updated. In contrast, the loss variation barely changes for the other offline methods after some data points. 5 Online Computational Performance The experiments with OCIL were performed on a desktop with one Intel Core i7-8700k CPU with 8GB RAM. No GPU was used. The experiments with other methods were performed on a desktop with one AMD Ryzen 9 5900X CPU, one Nvidia Geforce RTX 4070ti, and 32 GB RAM. A more powerful PC was selected for the other methods because of their high computational cost. As noted at the beginning of Section 4, only 5 trials were conducted due to the computational expense. Published in Transactions on Machine Learning Research (03/2025) 0 10 20 30 40 50 0 10 20 30 40 50 0 10 20 30 40 50 0 10 20 30 40 50 0 10 20 30 40 50 0 10 20 30 40 50 0 10 20 30 40 50 0 10 20 30 40 50 0 10 20 30 40 50 0 10 20 30 40 50 0 10 20 30 40 50 0 10 20 30 40 50 0 10 20 30 40 50 0 10 20 30 40 50 0 10 20 30 40 50 0 10 20 30 40 50 0 10 20 30 40 50 Figure 8: Trajectories of the quadrotor in policy tuning on-the-fly. Blue solid lines: learned trajectory. Green solid lines: observed noisy trajectory. Red dashed lines: ground truth. (a) Cartpole (b) Quadrotor (d) Cartpole, online phase, with noise (e) Quadrotor, online phase, with noise (f) Rocket, online phase, with noise Figure 9: Policy Tuning Loss v.s. number of data points. Buffers represent loss variation under 3σ with random initial conditions. To demonstrate that the computational performance of OCIL is enough to be used in an online fashion, we recorded the computational time for OCIL in different modes for 100 trials. The code is implemented in Python, utilizing the Cas ADi library with the IPOPT solver to solve the OC problem. Table 3 summarizes the OCIL s computational performance for the system identification task for three environments, where OCIL Published in Transactions on Machine Learning Research (03/2025) Time indicates the computational time of running OCIL at each time t, i.e. the iteration within the for-loop of Algorithm 2; GG Time indicates the computational time of running gradient generator (GG) at each time t, i.e. Algorithm 1; Estimator Time indicates the computational time of updating ˆθ, i.e. Line 7 of Algorithm 2; indicates the time step of each environment, i.e., the time duration between two consecutive data measurements or the maximum allowed time duration of online algorithms to perform computation; Percentage indicates the percentage of the average OCIL time with respect to . The header of Table 4 and Table 2 are the same. Roughly speaking, OCIL time = GG time + Estimator time + Optimal Control computation time. Table 3 illustrates that OCIL can estimate the dynamical system with neural network representation in an online fashion, within the system frequency of getting new data. Table 4 illustrates that OCIL can tune the neural policy online. As indicated in Line 2 of Algorithm 2, the most computationally heavy part is solving optimal control trajectory in an online fashion, instead of GG and the parameter estimator. As demonstrated at the beginning of this section, OCIL does not require huge computational resources, such as GPU. Therefore, OCIL has the capability to run in an online fashion. Table 2: Computational Performance for Online Imitation Learning Env. OCIL Time [ms] GG Time [ms] Estimator Time [ms] [ms] Percentage Cartpole 62.10 6.63 7.47 0.25 0.031 0.0023 100 62.10 % Quadrotor 81.70 2.51 21.72 0.84 0.058 0.039 100 81.70 % Rocket 72.25 13.91 19.77 6.26 0.060 0.012 100 72.25% Table 3: Computational Performance for Sys ID with Neural System Env. OCIL Time [ms] GG Time [ms] Estimator Time [ms] [ms] Percentage Cartpole 17.18 5.15 6.05 2.59 1.93 0.85 50 34.36 % Quadrotor 35.53 8.98 12.18 4.77 16.18 6.11 100 35.53 % Rocket 29.54 8.29 11.25 4.67 12.89 5.29 200 14.77 % Table 4: Computational Performance for Policy Tuning with Neural Policy Env. OCIL Time [ms] GG Time [ms] Estimator Time [ms] [ms] Percentage Cartpole 16.41 5.35 7.72 3.07 3.96 1.91 50 32.82 % Quadrotor 62.67 9.51 30.86 2.25 22.47 8.33 100 62.67 % Rocket 59.02 7.25 33.99 2.98 12.94 5.62 100 59.02 % 6 Limitations. This section discusses the major limitations of the proposed framework from three perspectives. Local convergence: Since OCIL is based on first-order gradients, it can only achieve local minima for general non-convex optimal control problems in (3). Furthermore, the general problem proposed in this paper belongs to a bi-level optimization framework. Under certain assumptions such as convexity and smoothness on models (e.g., dynamics model, policy, loss function, and control objective function), global convergence of the bi-level optimization can be established. However, such conditions are too restrictive in the context of dynamic control systems. Therefore, the local convergence analysis based on general nonlinear optimization is enough. Published in Transactions on Machine Learning Research (03/2025) Parameterization matters for global convergence: When performing experiments, we find that how models are parameterized matters for good convergence performance. For example, in online Sys ID mode, we observe that using a neural network dynamics (in Fig. 5d-5f) is more likely to get trapped in local minima than using the true dynamics with unknown parameters (in Fig. 5a-5c)). In general, more complex parameterization will bring extreme non-convexity to the optimization problem, making the algorithm more easily trapped in local minima. Determining the parameterization of an object to be learned requires prior or expert knowledge, which is common in ML. Initialization matters: As OCIL borrows how optimal gain updates from EKF, they share the same drawback that convergence depends on the selection of initialization. As shown in Remark 5, a bad initial guess might cause the estimator to diverge according to Lemma 2. Therefore, if a relatively good initial guess is hard to retrieve, one might need to use other methods to cold start OCIL. 7 Conclusions This paper proposes Online Control-Informed Learning (OCIL), an online learning method tailored for diverse learning tasks. By considering an optimal control system with a tunable parameter, OCIL is a unified learning framework that effectively addresses tasks such as online imitation learning, online system identification, and tuning policy on-the-fly. By designing a signed residual function specific to each task and treating the parameter as a state of a new system, we employ the online parameter estimator to estimate the parameter online and minimize the signed residual at each time step. Theoretical analysis establishes the convergence conditions for OCIL, while experiments on various environments, tasks, and existing methods are done to validate its data efficiency, versatility, and robustness against measurement noise. Acknowledgments This material is based upon work supported by the Office of Naval Research (ONR) and Saab, Inc. under the Threat and Situational Understanding of Networked Online Machine Intelligence (TSUNOMI) program (grant no. N00014-23-C-1016). Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the ONR, the U.S. Government, or Saab, Inc. Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, et al. Tensorflow: Large-scale machine learning on heterogeneous distributed systems. ar Xiv preprint ar Xiv:1603.04467, 2016. Pieter Abbeel and Andrew Y Ng. Apprenticeship learning via inverse reinforcement learning. In International Conference on Machine Learning, pp. 1 8, 2004. Pieter Abbeel, Morgan Quigley, and Andrew Y Ng. Using inaccurate models in reinforcement learning. In International Conference on Machine Learning, pp. 1 8, 2006. Ian Abraham and Todd D Murphey. Active learning of dynamics for data-driven control using koopman operators. IEEE Transactions on Robotics, 35(5):1071 1083, 2019. Brandon Amos, Ivan Jimenez, Jacob Sacks, Byron Boots, and J Zico Kolter. Differentiable mpc for end-to-end planning and control. In Advances in Neural Information Processing Systems, pp. 8289 8300, 2018. Saurabh Arora and Prashant Doshi. A survey of inverse reinforcement learning: Challenges, methods and progress. Artificial Intelligence, 297:103500, 2021. Michael Athans. The role and use of the stochastic linear-quadratic-gaussian problem in control system design. IEEE Transactions on Automatic Control, 16(6):529 552, 1971. Nicola Bastianello, Ruggero Carli, and Sandro Zampieri. Internal model-based online optimization. IEEE Transactions on Automatic Control, 69(1):689 696, 2024. Published in Transactions on Machine Learning Research (03/2025) Gerben I Beintema, Maarten Schoukens, and Roland Tóth. Deep subspace encoders for nonlinear system identification. Automatica, 156:111210, 2023. Martin Benning, Elena Celledoni, Matthias J Ehrhardt, Brynjulf Owren, and Carola-Bibiane Schönlieb. Deep learning as optimal control problems: models and numerical methods. ar Xiv preprint ar Xiv:1904.05657, 2019. Felix Berkenkamp, Andreas Krause, and Angela P Schoellig. Bayesian optimization with safety constraints: safe and automatic parameter tuning in robotics. Machine Learning, 112(10):3713 3747, 2023. Dimitri Bertsekas. Lessons from Alpha Zero for optimal, model predictive, and adaptive control. Athena Scientific, 2022. Dimitri P. Bertsekas. Incremental least squares methods and the extended kalman filter. SIAM Journal on Optimization, 6(3):807 822, 1996. Hans Georg Bock and Karl-Josef Plitt. A multiple shooting algorithm for direct solution of optimal control problems. IFAC Proceedings Volumes, 17(2):1603 1608, 1984. Mariusz Bojarski, Davide Del Testa, Daniel Dworakowski, Bernhard Firner, Beat Flepp, Prasoon Goyal, Lawrence D Jackel, Mathew Monfort, Urs Muller, Jiakai Zhang, et al. End to end learning for self-driving cars. ar Xiv preprint ar Xiv:1604.07316, 2016. Lucas Böttcher, Nino Antulov-Fantulin, and Thomas Asikis. Ai pontryagin or how artificial neural networks learn to control dynamical systems. Nature Communications, 13(1):333, 2022. M. Boutayeb and D. Aubry. A strong tracking extended kalman observer for nonlinear discrete-time systems. IEEE Transactions on Automatic Control, 44(8):1550 1556, 1999. M. Boutayeb, H. Rafaralahy, and M. Darouach. Convergence analysis of the extended kalman filter used as an observer for nonlinear deterministic discrete-time systems. IEEE Transactions on Automatic Control, 42(4):581 586, 1997. Daniel Brown, Wonjoon Goo, Prabhat Nagarajan, and Scott Niekum. Extrapolating beyond suboptimal demonstrations via inverse reinforcement learning from observations. In International Conference on Machine Learning, pp. 783 792. PMLR, 2019. Arthur Earl Bryson. Applied optimal control: optimization, estimation and control. CRC Press, 1975. Umberto Casti, Nicola Bastianello, Ruggero Carli, and Sandro Zampieri. A control theoretical approach to online constrained optimization. ar Xiv preprint ar Xiv:2309.15498, 2023. Alex J Chan and Mihaela van der Schaar. Scalable bayesian inverse reinforcement learning. ar Xiv preprint ar Xiv:2102.06483, 2021. Wojciech M Czarnecki, Razvan Pascanu, Simon Osindero, Siddhant Jayakumar, Grzegorz Swirszcz, and Max Jaderberg. Distilling policy distillation. In The 22nd International Conference on Artificial Intelligence and Statistics, pp. 1331 1340. PMLR, 2019. Marc Deisenroth and Carl E Rasmussen. Pilco: A model-based and data-efficient approach to policy search. In Proceedings of the 28th International Conference on Machine Learning (ICML-11), pp. 465 472, 2011. Peter Englert, Ngo Anh Vien, and Marc Toussaint. Inverse kkt: Learning cost functions of manipulation tasks from demonstrations. The International Journal of Robotics Research, 36(13-14):1474 1488, 2017. Chelsea Finn, Ian Goodfellow, and Sergey Levine. Unsupervised learning for physical interaction through video prediction. In Advances in Neural Information Processing Systems, pp. 64 72, 2016. Reza Ghaemi, Jing Sun, and Ilya V Kolmanovsky. Neighboring extremal solution for nonlinear discrete-time optimal control problems with state inequality constraints. IEEE Transactions on Automatic Control, 54 (11):2674 2679, 2009. Published in Transactions on Machine Learning Research (03/2025) Shixiang Gu, Timothy Lillicrap, Ilya Sutskever, and Sergey Levine. Continuous deep q-learning with model-based acceleration. In International Conference on Machine Learning, pp. 2829 2838, 2016. Kai Guo and Yongping Pan. Composite adaptation and learning for robot control: A survey. Annual Reviews in Control, 55:279 290, 2023. Jiequn Han, Qianxiao Li, et al. A mean-field optimal control formulation of deep learning. Research in the Mathematical Sciences, 6(1):1 41, 2019. Wenjian Hao, Paulo C Heredia, Bowen Huang, Zehui Lu, Zihao Liang, and Shaoshuai Mou. Policy learning based on deep koopman representation. ar Xiv preprint ar Xiv:2305.15188, 2023. Wenjian Hao, Zehui Lu, Devesh Upadhyay, and Shaoshuai Mou. A distributed deep koopman learning algorithm for control. ar Xiv preprint ar Xiv:2412.07212, 2024. Masahiko Haruno, Daniel M Wolpert, and Mitsuo Kawato. Mosaic model for sensorimotor learning and control. Neural Computation, 13(10):2201 2220, 2001. Nicolas Heess, Gregory Wayne, David Silver, Timothy Lillicrap, Tom Erez, and Yuval Tassa. Learning continuous control policies by stochastic value gradients. In Advances in Neural Information Processing Systems, pp. 2944 2952, 2015. Petros A Ioannou and Jing Sun. Robust adaptive control. Courier Corporation, 2012. Wanxin Jin and Shaoshuai Mou. Distributed inverse optimal control. Automatica, 129:109658, 2021. ISSN 0005-1098. Wanxin Jin, Dana Kulić, Jonathan Feng-Shun Lin, Shaoshuai Mou, and Sandra Hirche. Inverse optimal control for multiphase cost functions. IEEE Transactions on Robotics, 35(6):1387 1398, 2019. Wanxin Jin, Zhaoran Wang, Zhuoran Yang, and Shaoshuai Mou. Pontryagin differentiable programming: An end-to-end learning and control framework. Advances in Neural Information Processing Systems, 33: 7979 7992, 2020. Wanxin Jin, Dana Kulić, Shaoshuai Mou, and Sandra Hirche. Inverse optimal control from incomplete trajectory observations. The International Journal of Robotics Research, 40(6-7):848 865, 2021a. Wanxin Jin, Shaoshuai Mou, and George J Pappas. Safe pontryagin differentiable programming. Advances in Neural Information Processing Systems, 34:16034 16050, 2021b. Hilbert J Kappen. Path integrals and symmetry breaking for optimal control theory. Journal of Statistical Mechanics: Theory and Experiment, 2005. George Em Karniadakis, Ioannis G Kevrekidis, Lu Lu, Paris Perdikaris, Sifan Wang, and Liu Yang. Physicsinformed machine learning. Nature Reviews Physics, 3(6):422 440, 2021. Karthik Kashinath, M Mustafa, Adrian Albert, JL Wu, C Jiang, Soheil Esmaeilzadeh, Kamyar Azizzadenesheli, R Wang, Ashesh Chattopadhyay, A Singh, et al. Physics-informed machine learning: case studies for weather and climate modelling. Philosophical Transactions of the Royal Society A, 379(2194):20200093, 2021. Karel J Keesman. System identification: an introduction. Springer Science & Business Media, 2011. Arezou Keshavarz, Yang Wang, and Stephen Boyd. Imputing a convex objective function. In IEEE International Symposium on Intelligent Control, pp. 613 619, 2011. Mohammad Khosravi, Varsha N Behrunani, Piotr Myszkorowski, Roy S Smith, Alisa Rupenyan, and John Lygeros. Performance-driven cascade controller tuning with bayesian optimization. IEEE Transactions on Industrial Electronics, 69(1):1032 1042, 2021. Jack B Kuipers. Quaternions and rotation sequences, volume 66. Princeton University Press, 1999. Published in Transactions on Machine Learning Research (03/2025) S Narendra Kumpati, Parthasarathy Kannan, et al. Identification and control of dynamical systems using neural networks. IEEE Transactions on Neural Networks, 1(1):4 27, 1990. Yann Le Cun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. Nature, 521(7553):436 444, 2015. Sergey Levine and Pieter Abbeel. Learning neural network policies with guided policy search under unknown dynamics. In Advances in Neural Information Processing Systems, pp. 1071 1079, 2014. FW Lewis, Suresh Jagannathan, and Aydin Yesildirak. Neural network control of robot manipulators and non-linear systems. CRC press, 1998. Qianxiao Li and Shuji Hao. An optimal control approach to deep learning and applications to discrete-weight neural networks. ar Xiv preprint ar Xiv:1803.01299, 2018. Qianxiao Li, Long Chen, Cheng Tai, and E Weinan. Maximum principle based algorithms for deep learning. Journal of Machine Learning Research, 18(165):1 29, 2018. Weiwei Li and Emanuel Todorov. Iterative linear quadratic regulator design for nonlinear biological movement systems. In International Conference on Informatics in Control, Automation and Robotics, pp. 222 229, 2004. Zihao Liang, Wanxin Jin, and Shaoshuai Mou. An iterative method for inverse optimal control. In 2022 13th Asian Control Conference (ASCC), pp. 959 964, 2022. Zihao Liang, Wenjian Hao, and Shaoshuai Mou. A data-driven approach for inverse optimal control. In 2023 62nd IEEE Conference on Decision and Control (CDC), pp. 3632 3637, 2023. Hailiang Liu and Peter Markowich. Selection dynamics for deep neural networks. Journal of Differential Equations, 269(12):11540 11574, 2020. Yong Liu, Chenyu Li, Jianmin Wang, and Mingsheng Long. Koopa: Learning non-stationary time series dynamics with koopman predictors. Advances in Neural Information Processing Systems, 36, 2024. Zehui Lu, Tianpeng Zhang, and Yebin Wang. Torque constraint modeling and reference shaping for servo systems. IEEE Control Systems Letters, 8:2637 2642, 2024. Rui Luo, Zhinan Peng, Jiangping Hu, and Bijoy Kumar Ghosh. Adaptive optimal control of affine nonlinear systems via identifier critic neural network approximation with relaxed pe conditions. Neural Networks, 167:588 600, 2023. Michael Lutter, Christian Ritter, and Jan Peters. Deep lagrangian networks: Using physics as model prior for deep learning. ar Xiv preprint ar Xiv:1907.04490, 2019. Alexandre Mauroy, Y Susuki, and Igor Mezić. Koopman operator in systems and control. Springer, 2020. Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. Playing atari with deep reinforcement learning. ar Xiv preprint ar Xiv:1312.5602, 2013. Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529 533, 2015. Katja Mombaur, Anh Truong, and Jean-Paul Laumond. From human to humanoid locomotion an inverse optimal control approach. Autonomous Robots, 28(3):369 383, 2010. Oliver Nelles and Oliver Nelles. Nonlinear dynamic system identification. Springer, 2020. Junhyuk Oh, Valliappa Chockalingam, Satinder Singh, and Honglak Lee. Control of memory, active perception, and action in minecraft. ar Xiv preprint ar Xiv:1605.09128, 2016. Masashi Okada, Luca Rigazio, and Takenobu Aoshima. Path integral networks: End-to-end differentiable optimal control. ar Xiv preprint ar Xiv:1706.09597, 2017. Published in Transactions on Machine Learning Research (03/2025) Michael A Patterson and Anil V Rao. Gpops-ii: A matlab software for solving multiple-phase optimal control problems using hp-adaptive gaussian quadrature collocation methods and sparse nonlinear programming. ACM Transactions on Mathematical Software, 41(1):1, 2014. Marcus Pereira, David D Fan, Gabriel Nakajima An, and Evangelos Theodorou. Mpc-inspired neural network policies for sequential decision making. ar Xiv preprint ar Xiv:1802.05803, 2018. Gianluigi Pillonetto, Aleksandr Aravkin, Daniel Gedon, Lennart Ljung, Antônio H Ribeiro, and Thomas B Schön. Deep networks for system identification: a survey. Automatica, 171:111907, 2025. Lev Semenovich Pontryagin. Mathematical theory of optimal processes. Routledge, 2018. Joshua L Proctor, Steven L Brunton, and J Nathan Kutz. Dynamic mode decomposition with control. SIAM Journal on Applied Dynamical Systems, 15(1):142 161, 2016. Joshua L Proctor, Steven L Brunton, and J Nathan Kutz. Generalizing koopman theory to allow for inputs and control. SIAM Journal on Applied Dynamical Systems, 17(1):909 930, 2018. Nathan D Ratliff, J Andrew Bagnell, and Martin A Zinkevich. Maximum margin planning. In International Conference on Machine Learning, pp. 729 736, 2006. Guy Revach, Nir Shlezinger, Xiaoyong Ni, Adrià López Escoriza, Ruud J. G. van Sloun, and Yonina C. Eldar. Kalmannet: Neural network aided kalman filtering for partially known dynamics. IEEE Transactions on Signal Processing, 70:1532 1547, 2022. Maria Isabel Ribeiro. Kalman and extended kalman filters: Concept, derivation and properties. Institute for Systems and Robotics, 43(46):3736 3741, 2004. Steindor Saemundsson, Alexander Terenin, Katja Hofmann, and Marc Deisenroth. Variational integrator networks for physically structured embeddings. In International Conference on Artificial Intelligence and Statistics, pp. 3078 3087, 2020. Fumihiro Sasaki and Ryota Yamashina. Behavioral cloning from noisy demonstrations. In International Conference on Learning Representations, 2021. Jeff G Schneider. Exploiting model uncertainty estimates for safe dynamic control learning. In Advances in Neural Information Processing Systems, pp. 1047 1053, 1997. Max Schwenzer, Muzaffer Ay, Thomas Bergs, and Dirk Abel. Review on model predictive control: An engineering perspective. The International Journal of Advanced Manufacturing Technology, 117(5):1327 1349, 2021. Pierre OM Scokaert and James B Rawlings. Constrained linear quadratic regulation. IEEE Transactions on Automatic Control, 43(8):1163 1169, 1998. Pushan Sharma, Wai Tong Chung, Bassem Akoush, and Matthias Ihme. A review of physics-informed machine learning in fluid mechanics. Energies, 16(5):2343, 2023. Yongkyu Song and Jessy W. Grizzle. The extended kalman filter as a local asymptotic observer for nonlinear discrete-time systems. In 1992 American Control Conference, pp. 3365 3369, 1992. Farshud Sorourifar, Georgios Makrygirgos, Ali Mesbah, and Joel A Paulson. A data-driven automatic tuning method for mpc under uncertainty using constrained bayesian optimization. IFAC-Papers On Line, 54(3): 243 250, 2021. Aravind Srinivas, Allan Jabri, Pieter Abbeel, Sergey Levine, and Chelsea Finn. Universal planning networks. ar Xiv preprint ar Xiv:1804.00645, 2018. Matthew E Taylor and Peter Stone. Transfer learning for reinforcement learning domains: A survey. Journal of Machine Learning Research, 10(7), 2009. Published in Transactions on Machine Learning Research (03/2025) Faraz Torabi, Garrett Warnell, and Peter Stone. Behavioral cloning from observation. ar Xiv preprint ar Xiv:1805.01954, 2018. Laura Von Rueden, Sebastian Mayer, Katharina Beckh, Bogdan Georgiev, Sven Giesselbach, Raoul Heese, Birgit Kirsch, Julius Pfrommer, Annika Pick, Rajkumar Ramamurthy, et al. Informed machine learning a taxonomy and survey of integrating prior knowledge into learning systems. IEEE Transactions on Knowledge and Data Engineering, 35(1):614 633, 2021. Hai Wang, Zhihong Man, Weixiang Shen, Zhenwei Cao, Jinchuan Zheng, Jiong Jin, et al. Robust control for steer-by-wire systems with partially known dynamics. IEEE Transactions on Industrial Informatics, 10(4): 2003 2015, 2014. Manuel Watter, Jost Springenberg, Joschka Boedecker, and Martin Riedmiller. Embed to control: A locally linear latent dynamics model for control from raw images. In Advances in Neural Information Processing Systems, pp. 2746 2754, 2015. Jin-Long Wu, Heng Xiao, and Eric Paterson. Physics-informed machine learning approach for augmenting turbulence models: A comprehensive framework. Physical Review Fluids, 3(7):074602, 2018. Peng Xu, Fred Roosta, and Michael W Mahoney. Second-order optimization for non-convex machine learning: An empirical study. In SIAM International Conference on Data Mining, pp. 199 207, 2020. Amy Zhang, Sainbayar Sukhbaatar, Adam Lerer, Arthur Szlam, and Rob Fergus. Composable planning with attributes. In International Conference on Machine Learning, pp. 5842 5851, 2018. Dinghuai Zhang, Tianyuan Zhang, Yiping Lu, Zhanxing Zhu, and Bin Dong. You only propagate once: Painless adversarial training using maximal principle. ar Xiv preprint ar Xiv:1905.00877, 2019. Brian D Ziebart, Andrew Maas, J Andrew Bagnell, and Anind K Dey. Maximum entropy inverse reinforcement learning. In AAAI Conference on Artificial Intelligence, pp. 1433 1438, 2008. Published in Transactions on Machine Learning Research (03/2025) A Proof of Lemma 2 Since the matrices Pt and Lt are bounded according to Assumption 1 and 2, from (11b), one will have: Kt = Pt L t R 1 t (25) = P t L t(Lt P t L t + Rt) 1. (26) Then, by taking the inverse of (25) and (26), one will get: P 1 t = (P t ) 1 + L t R 1 t Lt. (27) Substituting (25) into (11b) and subtracting both sides from θt, one will have: θt = θ t Pt L t R 1 t et. (28) Then, plug (28) into the Lyapunov function (21): Vt = θ t P 1 t θt (29) = ( θ t Pt L t R 1 t et) P 1 t ( θ t Pt L t R 1 t et) (30) = ( θ t ) P 1 t θ t ( θ t ) L t R 1 t et e t R 1 t Lt θ t + e t R 1 t Lt Pt L t R 1 t et (31) Next, we plug (27) into (31): Vt = ( θ t ) ((P t ) 1 + L t R 1 t Lt) θ t ( θ t ) L t R 1 t et e t R 1 t Lt θ t + e t R 1 t Lt Pt L t R 1 t et (32) = V t + ( θ t ) L t R 1 t Lt θ t ( θ t ) L t R 1 t et e t R 1 t Lt θ t + e t R 1 t Lt Pt L t R 1 t et, (33) V t = ( θ t ) (P t ) 1 θ t (34) = ( θt 1) G t P 1 t 1Gt θt 1. (35) Using (22), (33) becomes: Vt = V t + ( θ t ) L t R 1 t Lt θ t ( θ t ) L t R 1 t et e t R 1 t Lt θ t + e t R 1 t Lt Pt L t R 1 t et (36) = V t + e t Ft R 1 t Ftet e t Ft R 1 t et e t R 1 t Ftet + e t R 1 t Lt Pt L t R 1 t et (37) = V t + e t(Ft R 1 t Ft Ft R 1 t R 1 t Ft + R 1 t Lt Pt L t R 1 t )et. (38) To ensure that the Lyapunov function {Vt}t=1,2,... is a decreasing sequence, Vt Vt 1 0. Vt Vt 1 (39) = e t(Ft R 1 t Ft Ft R 1 t R 1 t Ft + R 1 t Lt Pt L t R 1 t )et (40) + ( θt 1) (G t P 1 t 1Gt P 1 t 1) θt 1 0. (41) Therefore, to ensure the Lyapunov function is a decreasing sequence, Ft R 1 t Ft Ft R 1 t R 1 t Ft + R 1 t Lt Pt L t R 1 t 0, (42) and G t P 1 t 1Gt P 1 t 1 0. (43) With some manipulations: (Ft Is)R 1 t (Ft Is) R 1 t + R 1 t Lt Pt L t R 1 t 0, (44) (Ft Is)R 1 t (Ft Is) R 1 t (Is Lt P t L t(Lt P t L t + Rt) 1) 0. (45) Published in Transactions on Machine Learning Research (03/2025) By letting Is = (Lt P t L t + Rt)(Lt P t L t + Rt) 1, we have (Ft Is)R 1 t (Ft Is) (Lt P t L t + Rt) 1 0. (46) Since Ft and Rt are diagonal matrices, we will have R 1 t (Ft Is)2 (Lt P t L t + Rt) 1 0, (47) which at the end yields: (Ft Is)2 Rt(Lt P t L t + Rt) 1, (48) therefore the proof is completed. B Proof of Theorem 1 This proof is straightforward once Lemma 2 is provided. Consider the assumptions 1 and 2 are met, according to Lemma 2, with the exact gradient generated by the gradient generator in (17)-(18), limt θt = 0. As the estimated ˆθt converges to the true θ , where the true parameter gives zero cumulative loss, the cumulative loss L(ξ(ˆθ)) goes to 0. C Experiment Details C.1 System/Environment Setups Cartpole. We consider the following continuous dynamics of the cartpole (F + mpl θ2 sin(θ) mt ) mpl θ cos(θ) g sin(θ) cos(θ)(F + mpl θ2 sin(θ) 3 mp cos(θ)2 where p R is the horizontal displacement of the cart; θ R is the pole angle; F R denotes the horizontal force applied to the cart which is between 1 and +1; l R is the length of the pole; mp, mt R are the masses of the pole and total cartpole, respectively. By defining the states and control inputs of the cartpole x p p θ θ and u F (50) respectively. Quadrotor UAV. We consider a quadrotor UAV with the following dynamics m v I = mg I + FI, 2Ω(ωB)q B/I, JB ωB = MB ω JBωB. Here, the subscription B and I denote a quantity expressed in the body frame and inertial (world) frame, respectively; m and JB R3 3 are the mass and moment of inertia with respect to the body frame of the UAV, respectively. g is the gravitational constant (g = 10 m/s2), g I = [0, 0, g] . p R3 and v R3 are the position and velocity vector of the UAV; ωB R3 is the angular velocity vector of the UAV; q B/I R4 is the unit quaternion Kuipers (1999) that describes the attitude of the UAV with respect to the inertial frame; Ω(ωB) is defined as: 0 ωx ωy ωz ωx 0 ωz ωy ωy ωz 0 ωx ωz ωy ωx 0 Published in Transactions on Machine Learning Research (03/2025) MB R3 is the torque applied to the UAV; FI R3 is the force vector applied to the UAV center of mass. The total force magnitude f = FI R (along z-axis of the body frame) and torque MB = [Mx, My, Mz] are generated by thrust from four rotating propellers [T1, T2, T3, T4] , their relationship can be expressed as: 1 1 1 1 0 lw/2 0 lw/2 lw/2 0 lw/2 0 c c c c T1 T2 T3 T4 where lw is the wing length of the UAV and c is a fixed constant. The state and input vectors of the UAV are defined as: x p v q ω R13, u T1 T2 T3 T4 R4. (54) Rocket. The rocket is treated as a rigid body subject to constant gravitational acceleration, g I R3, and neglects aerodynamic forces. The vehicle is assumed to actuate a single gimbaled rocket engine to generate a thrust vector within a feasible range of magnitudes and gimbal angles. We assume that at the landing phase, the depletion of fuel is insignificant. Therefore, we omit the dynamics of rocket mass. The rocket has the following dynamics: p I = v I, m CI/BTB + g I, 2Ω(ωB)q B/I, JB ωB = MB [ωB ]JBωB. Here, the subscription B and I denote a quantity expressed in the body frame and inertial (world) frame, respectively; m and JB R3 3 are the mass and moment of inertia with respect to body frame of the rocket, respectively. p R3 and v R3 are the position and velocity vector of the rocket; ωB R3 is the angular velocity vector of the rocket; q B/I = [q0, q1, q2, q3] is the unit quaternion that describes the attitude of rocket with respect to the inertial frame; TB R3 is the commanded thrust vector; MB R3 is the torque applied to the rocket; CB/I is the direction cosine matrix that encodes the attitude transformation from body frame to inertia frame and related to q B/I by the following relationship: 1 2(q2 2 + q2 3) 2(q1q2 + q0q3) 2(q1q3 q0q2) 2(q1q2 q0q3) 1 2(q2 1 + q2 3) 2(q2q3 + q0q1) 2(q1q3 + q0q2) 2(q2q3 q0q1) 1 2(q2 1 + q2 1) The inverse transformation is denoted as CI/B = CT B/I; The skew-symmetric matrices [ωB ] and Ω(ωB) are defined as follow: 0 ωz ωy ωz 0 ωx ωy ωx 0 0 ωx ωy ωz ωx 0 ωz ωy ωy ωz 0 ωx ωz ωy ωx 0 The state and input vectors of the rocket are defined as: x = h p I v I q B/I ω B i R13, u = TB = Tx Ty Tz R3, (56) Discretization. Discretization is done by the following discrete-time form xt+1 xt + g(xt, ut) f(xt, ut), (57) Published in Transactions on Machine Learning Research (03/2025) where is the discretization interval. C.2 Online Imitation Learning Data acquisition. The dataset of expert demonstrations is generated by solving an optimal control system with the true dynamics and control objective parameter θ = {θdyn, θobj} given. We generate five trajectories with different initial conditions x0 and time horizons T. PDP. We employed the PDP in Jin et al. (2020) to solve this problem. The learning rate is η = 10 4. Five trials were run given random initial θ0. Inverse KKT method. We choose the inverse KKT method Englert et al. (2017) for comparison because it is suitable for learning objective functions for high-dimensional continuous-space systems. We adopt the inverse KKT method and define the KKT loss as the norm-2 violation of the KKT condition by the demonstration data: L x0:T (x 0:T , u 0:T 1) 2 + L u0:T 1 (x 0:T , u 0:T 1) Neural policy cloning. For the neural policy cloning, we directly learn a neural network policy u = µ(x, θ) from the dataset using supervised learning, that is t=0 u t µ(x t , θ) 2 (58) C.3 Online System Identification Data acquisition. In the system identification experiment, we collect a total number of five trajectories from systems with dynamics known, wherein different trajectories ξo = {xo 0:T , u0:T 1} have different initial conditions x0 and horizons T (T ranges from 10 to 20 depending on different environment and task), with random inputs u0:T 1 drawn from uniform distribution. PDP. We employed the PDP in Jin et al. (2020) to solve this problem. The learning rate is η = 10 4. Five trials were run given random initial θ0. For the neural dynamics case, the learning rate is η = 10 5. Pytorch Adam to learn neural dynamics. We consider the dynamics of each system (cartpole, quadrotor, and rocket) are represented by a fully-connected feed-forward neural network ˆf(xt, ut, θ). The neural network has a layer structure of (n + m)-2(n + m)-n with tanh activation functions, i.e., there is an input layer with (n + m) neurons equal to the dimension of state, one hidden layer with 2(n + m) neurons and one output layer with n neurons. The ξo = {xo 0:T , u0:T 1} obtained previously are used in stage loss. We conducted five trials for each method with different initial θ. We use Pytorch Adam to train the neural network by minimizing the following residual t=0 xo t+1 ˆf(xo t, ut, θ) 2. (59) DMDc. The DMDc method Proctor et al. (2016) is a method that is based on Koopman theory to represent nonlinear dynamics with linear dynamics of observables. Observables ψ(xt) are some basis functions of states. The observable space has a much higher dimension compared to state space. The estimation of the dynamics is achieved by the following optimization: t=0 ψ(xo t+1) Aψ(xo t) But 2. (60) C.4 Policy Tuning On-the-fly Neural State Feedback Policy. In this application, we learn the parameters of a neural state feedback policy by minimizing given control objective functions. Specifically, we use a fully connected feed-forward Published in Transactions on Machine Learning Research (03/2025) neural network that has a layer structure of 3n-3n-m with tanh activation functions, i.e., there is an input layer with 3n neurons equal to the dimension of state, one hidden layer with 3n neurons and one output layer with m neurons. The policy parameter θ is the neural network parameter. For comparison, we apply the guided policy search (GPS) method Levine & Abbeel (2014) and i LQR Li & Todorov (2004) to solve the same problem. PDP. We employed the PDP in Jin et al. (2020) to solve this problem. The learning rate is set to be η = 10 4 or = 10 6. Five trials were run given random initial θ0. For the neural objective function case, the learning rate is η = 10 5.