# deep_koopman_learning_using_noisy_data__44defe0b.pdf Published in Transactions on Machine Learning Research (04/2025) Deep Koopman Learning using Noisy Data Wenjian Hao hao93@purdue.edu School of Aeronautics and Astronautics Purdue University Devesh Upadhyay devesh.upadhyay@saabinc.com Saab, Inc. Shaoshuai Mou mous@purdue.com School of Aeronautics and Astronautics Purdue University Reviewed on Open Review: https: // openreview. net/ forum? id= j6Rm6T2l FU This paper proposes a data-driven framework to learn a finite-dimensional approximation of a Koopman operator for approximating the state evolution of a dynamical system under noisy observations. To this end, our proposed solution has two main advantages. First, the proposed method only requires the measurement noise to be bounded. Second, the proposed method modifies the existing deep Koopman operator formulations by characterizing the effect of the measurement noise on the Koopman operator learning and then mitigating it by updating the tunable parameter of the observable functions of the Koopman operator, making it easy to implement. The performance of the proposed method is demonstrated on several standard benchmarks. We then compare the presented method with similar methods proposed in the latest literature on Koopman learning. 1 Introduction Directly dealing with complex nonlinear dynamical systems for model-based control design has remained a challenge for the control community. One long-standing solution to this problem has been to use linearized models and the associated vast body of knowledge for linear analysis. Linear control theory is a very rich and well-developed field that provides rigorous control development with methods for stability and robustness guarantees. Lyapunov showed that for a linearized system that is stable around an equilibrium point, there exists a region of stability around this equilibrium point for which the original nonlinear system is also stable A.Lyapunov (1992). Recent advances in data-driven methods have spurred new and increased research interest in machine learning (ML) based methods for deriving reduced-order models (ROM) as surrogates for complex nonlinear systems. This has also led to the adoption of these methods for developing control and autonomy/automation solutions for robotic and unmanned systems. Examples include learning dynamics using deep neural networks (DNNs) Murphy (2002); Gillespie et al. (2018), physics-informed neural networks (PINNs) Raissi et al. (2019), and lifting linearization methods such as Koopman operator methods Koopman (1931); Koopman & Neumann (1932); Mezić (2015); Proctor et al. (2018); Mauroy & Goncalves (2016). Lifting linearization allows one to represent a nonlinear system with an equivalent linear system in a lifted, higher-dimensional space. One widely adopted lifting method is the extending dynamic mode decomposition (EDMD) Williams et al. (2015), which lifts the state space to a higher-dimensional space, for which the temporal evolution is approximately linear Korda & Mezić (2018). It is, however, typically difficult to find an exact finite-dimensional linear representation for most nonlinear systems. Further, the Koopman operator focuses on non-autonomous systems: dx dt = f(x, u, t). This poses a challenge for the control design of dynamical systems in the choice of a sufficient basis function necessary for the lifted system to be linear Published in Transactions on Machine Learning Research (04/2025) and exact. Extensions to such systems require truncation-based approximations, and the finite-dimensional representation is no longer exact. To this end, various eigen-decomposition-based truncations are proposed. In Lusch et al. (2017) the authors proposed using deep learning methods to discover the eigenfunctions of the approximated Koopman operator, and Yeung et al. (2019); Lusch et al. (2018); Han et al. (2020); Bevanda et al. (2021) employed DNNs as observable functions of the Koopman operator, which are tuned based on collected state-control pairs by minimizing an appropriately defined loss function which is also referred as the deep Koopman operator method (DKO). Recent work, such as Hao et al. (2024), has extended the DKO method to approximate nonlinear time-varying systems. These methods rely on a set of measured output variables that collectively define some nonlinear representation of the independent state variables. Establishing a sufficient set of these observable functions remains an active area of research. Further, real-world noisy measurements impose additional challenges. Additionally, for most practical systems, it is also critical to find computationally feasible approximation methods for extracting finite-dimensional representations. Related work. While Koopman-based methods have been proven to be effective in learning dynamics from a system s input-output data pairs. However, in practical real-world applications, the output measurements are noisy and can result in biased estimates of the linear system. Even if the noise of the state variables is assumed to be uncorrelated, the nonlinear transformations in the observables may lead to complex noiseinfluence correlations between the noise-free states and the transformed observables. Several methods are proposed to solve the measurement noise issue. One solution Sotiropoulos. (2021) is to directly measure the states and the observables, this, however, may not always be possible. Noisy measurements are also shown to further complicate the anti-causal observable problem when dealing with the lifting of controlled systems Selby (2021). In other approaches, authors in Dawson et al. (2016); Hemati et al. (2017) introduce total least square (TLS) methods in DMD, in Haseli & Cortés (2019) the authors propose a combination of the EDMD and TLS methods to account for the measurement noise, and in Sinha et al. (2020); Wanner & Mezic (2022); Sinha et al. (2023) the authors propose to solve the EDMD with measurement noise as a robust Koopman operator problem which is a min-max optimization problem. This paper extends the DKO method to the scenario where the system state data is corrupted by unknown but bounded measurement noise. As already discussed, this creates the challenge of generating additional noise transformations impacted by the DNN-derived basis functions of the deep Koopman operator. This leads to distortion of the noise, and the properties of the measurement noise and associated correlations may not remain the same after lifting. The contributions of this work are that we first propose a data-driven framework to learn the deep Koopman operator from the system states-inputs data pairs under unknown and bounded measurement noise, and then we provide numerical evidence that our proposed method can approximate the system dynamics with reasonable accuracy adequate for control applications. This paper is organized as follows. Section 2 states the problem. Section 3 presents the proposed algorithm and its theoretical development. The numerical simulations and comparison of the algorithms are shown in Section 4. Finally, Section 5 concludes the paper. Notations. We denote as the Euclidean norm. For a matrix A Rm n, A F denotes its Frobenius norm, A denotes its transpose, and A denotes its Moore-Penrose pseudoinverse. Additionally, In represents the n n identity matrix. Consider the following time-invariant system: x(t + 1) = f(x(t), u(t)), x(0) given, (1) y(t) = x(t) + w(t), (2) where t = 0, 1, 2, denotes the time index, x(t) Rn and u(t) Rm denote the system state and control input, respectively, y(t) Rn denotes the measured state, w(t) Rn corresponds to the unknown measurement noise, which is assumed to be bounded (i.e., w(t) wmax), and f : Rn Rm Rn denotes the dynamics mapping which is assumed to be unknown. Published in Transactions on Machine Learning Research (04/2025) Suppose an observed system states-inputs trajectory from time 0 to T is denoted as: ξ = {(yt, ut), t = 0, 1, 2, , T}. (3) One approach to approximating the unknown dynamics f in Eq. (1) using ξ is through the deep Koopman operator (DKO) method. We begin with an approximate representation of the original system dynamics as ˆx(t + 1) = ˆf(ˆx(t), u(t), θ) with ˆx(0) = x(0), where ˆx(t) Rn represents the system state. The function ˆf is constructed based on the Koopman operator theory-based process, as described by: g(ˆx(t + 1), θ) = Ag(ˆx(t), θ) + Bu(t), (4) ˆx(t) = Cg(ˆx(t), θ), (5) where g( , θ) : Rn Rr is represented by a Lipschitz continuous DNN with a known architecture and tunable parameters θ Rp, and A Rr r, B Rr m, C Rn r are variable matrices to be determined later. Here, Eq. (4) with r n represents the dynamics evolution in the lifted space Rr, and Eq. (5) defines the mapping between the lifted space Rr and the original space Rn. By integrating Eq. (4)-(5), the deep Koopman operator dynamics can be expressed as follows: ˆx(t + 1) = ˆf(ˆx(t), u(t), θ) = C(Ag(ˆx(t), θ) + Bu(t)), ˆx(0) = x(0). (6) The problem of interest is to determine the constant matrices A , B , C and the optimal parameter θ using the noisy trajectory ξ in Eq. (3) such that, for any 0 t T 1, the following approximation holds: g(yt+1, θ ) = A g(yt, θ ) + B ut, (7) xt = C g(yt, θ ). (8) For notational brevity, we define the set of A , B , C , θ satisfying Eq. (7)-(8) as the Deep Koopman Representation (DKR), which will be referenced throughout this paper. K = {A , B , C , θ }. (9) 3 Main Results This section first outlines the main challenges and key ideas underlying the proposed approach, followed by the presentation of an algorithm to achieve the DKR in Eq. (9). 3.1 Challenges and Key Ideas To achieve the DKR, one natural approach is to minimize the following estimation errors using ξ in Eq. (3): A , B , C , θ = arg min A,B,C,θ 1 2T t=0 xt+1 ˆf(yt, ut, θ) 2. (10) A fundamental challenge in solving Eq. (10) arises from the fact that the true system states xt are unknown in our problem setting. To address this challenge, we propose an alternative minimization problem to Eq. (10). To proceed, for any 0 t T 1, we first introduce the notation: xt+1 = f(xt, ut, θ ) + ϵt = C ( A g(xt, θ ) + B ut) + ϵt and yt+1 = f(yt, ut, θ ) + ϵt = C ( A g(yt, θ ) + B ut) + ϵt, where f and f are introduced DKO dynamics achieved using the noise-free and noisy trajectories, respectively, based on the same function g( , θ). The terms ϵt and ϵt represent the estimation errors that arise from solving the following optimization problems: A , B , C , θ = arg min A,B,C,θ 1 2T t=0 xt+1 ˆf(xt, ut, θ) 2 (11) Published in Transactions on Machine Learning Research (04/2025) A , B , C , θ = arg min A,B,C,θ 1 2T t=0 yt+1 ˆf(yt, ut, θ) 2. (12) Then, by expanding the error function in Eq. (10) using the introduced f and f, and applying the triangle inequality, we obtain: xt+1 f(xt, ut, θ ) + f(xt, ut, θ ) f(yt, ut, θ ) + f(yt, ut, θ ) yt+1 + yt+1 ˆf(yt, ut, θ) 2 yt+1 ˆf(yt, ut, θ) 2 + f(xt, ut, θ ) f(yt, ut, θ ) 2 + ϵt 2 + ϵt 2. (13) Note that ϵt and ϵt are typically small positive constants. For a detailed analysis of these estimation errors, we refer to the existing work Hao et al. (2024). Removing the constant terms from the upper bound derived in Eq. (13), we formulate the following loss function to achieve the DKR in Eq. (9): Lf(A, B, C, θ) = 1 t=0 ( yt+1 ˆf(yt, ut, θ) 2 + f(xt, ut, θ ) f(yt, ut, θ ) 2). (14) Using this development, we propose that instead of minimizing the residual in Eq. (10), we can minimize the upper bound of Eq. (10) as in Eq. (14). This minimization problem is split into two components. First, given a noisy trajectory ξ, the goal is to determine a dynamical model that approximates the relationship between yt, ut and yt+1 as stated in Eq. (12). The second component focuses on minimizing the norm difference between the model f(xt, ut, θ ) from Eq. (11) and f(yt, ut, θ ) from Eq. (12). The key challenge in solving Eq. (14) lies in quantifying the difference between the two models, f(xt, ut, θ ) and f(yt, ut, θ ). Remark 1 Note that, in contrast to the conventional error function of xt+1 ˆf(yt, ut, θ) 2 = yt+1 wt+1 ˆf(yt, ut, θ) 2 yt+1 ˆf(yt, ut, θ) 2 + wt+1 2, the proposed loss function in Eq. (14) substitutes the term wt+1 2 with the discrepancy between the system dynamics, f f 2. This formulation allows for further minimization by tuning θ, thereby enhancing robustness and stability in estimation. 3.2 Algorithm We now introduce an algorithm to solve Eq. (14) utilizing the noisy data from Eq. (3). We start by addressing the first term in Eq. (14), for which we define the following loss function: Lf,1(A, B, C, θ) = 1 t=0 g(yt+1, θ) Ag(yt, θ) But 2 + yt Cg(yt, θ) 2), (15) where the first and second parts of Lf,1 represent the estimation errors in the lifted space, as described in Eq. (4), and the original space, as outlined in Eq. (5), respectively. If the matrices A, B, C are known, the optimal θ that minimizes Lf,1 can be directly obtained using the gradient descent method. However, when these matrices are unknown, an alternative iterative approach can be employed. Specifically, let k = 0, 1, 2, be the iteration index, and let θk represent the estimation of θ at the k-th iteration. Initially, the relationship between the constant matrices and θk is established based on the trajectory ξ. Once this relationship is identified, the gradient θLf,1 can be computed only using θk, enabling the application of the gradient descent method to minimize Lf,1 by iteratively updating θk. To this end, we first introduce the following data matrices formed from ξ: Y = [y0, y1, , y T 1] Rn T , Y = [y1, y2, , y T ] Rn T , U = [u0, u1, , u T 1] Rm T , Gk = [g(y0, θk), g(y1, θk), , g(y T 1, θk)] Rr T , Gk = [g(y1, θk), g(y2, θk), , g(y T , θk)] Rr T . (16) It leads to the following compact form of Eq. (15) using given θk and the definition of the Frobenius norm: ˆLf,1(A, B, C) = 1 2T ( Gk [A B] Gk U 2 F + Y CGk 2 F ). (17) Published in Transactions on Machine Learning Research (04/2025) If the matrices Gk Rr T and Gk U R(r+m) T in Eq. (17) have full row ranks (i.e., are right-invertible), then the parameter θk can be updated using the following rule: [ A k, B k], C k = arg min [A,B],C ˆLf,1 = Gk , YG k, (18) θk+1 = θk αk θLf,1( A k, B k, C k, θk), θ0 given, (19) where the step size αk satisfies the standard conditions P k=0 αk = and P k=0 α2 k < , and θLf,1( A k, B k, C k, θk) = 1 ( θg(yt+1, θk) A k θg(yt, θk)) (g(yt+1, θk) A kg(yt, θk) B kut) ( C k θg(yt, θk) (yt C kg(yt, θk)). Note that the matrices A k, B k, and C k remain constant while computing θLf,1( A k, B k, C k, θk). To minimize the second part of Eq. (14), which quantifies the discrepancy between the two models, f(xt, ut, θ ) and f(yt, ut, θ ), we observe that this difference arises due to the measurement noise wt. To systematically characterize this discrepancy under wt, we define the following loss function: Lf,2( θ ) = 1 t=0 max wt f(xt, ut, θ ) f(yt, ut, θ ) 2 2T max wt ( t=0 g(xt, θ ) g(yt, θ ) 2 + [ A , B ] [ A , B ] 2 F + C C 2 F ). Here, we assume that the matrices A , B , C and A , B , C are obtained following the same procedure outlined in Eq. (18), using the same function g( , θ), but derived from noise-free data and noisy data, respectively. Let G and G denote the data matrices computed with an arbitrary given θ, analogous to Gk and Gk in Eq. (16), respectively. Given that the system state xt in Eq. (20) is unknown within our problem setting, we now introduce the following theorem to determine the optimal θ that minimizes Lf,2( θ ). Theorem 1 If the unknown measurement noise wt is bounded by wmax and g( , θ) is a Lipschitz continuous function, then the optimal θ that minimizes the following loss function will also minimize Eq. (20): ˆLf,2(θ) = 1 ) 1 2 F (( G 2 F + G G U 2 F ) G 2 F + G 2 F ) + (GG ) 1 2 F YG 2 F G 2 F . Proof of Theorem 1 is given in the Appendix. Based on ˆLf,1 in Eq. (17) and ˆLf,2 in Eq. (21), one have the following loss function: ˆLf(A, B, C, θ) = 1 2T ( G [A, B] G U 2 F + Y CG 2 F ) + ˆLf,2(θ). (22) In summary, rather than directly minimizing the residual in Eq. (10), this paper proposes a method to determine A , B , C , and θ that minimize Eq. (22), which serves as an upper bound for Eq. (10). The proposed algorithm follows an iterative process. In each iteration k, the algorithm first computes the matrices [ A k, B k] and C k using Eq. (18). Then, the gradient descent method is employed to update θk to find the optimal θ that minimizes ˆLf in Eq. (22). This iterative procedure is summarized in Algorithm 1, referred to as deep Koopman learning with the noisy data (DKND) in the rest of this paper. Published in Transactions on Machine Learning Research (04/2025) Algorithm 1: Deep Koopman learning with the noisy data (DKND) Input: Y, Y, U in Eq. (16). Output: A , B , C , θ . Initialization: Set the learning rate sequences {αk}K k=0 and terminal accuracy ϵ 0, build DNN g( , θ) : Rn Rr with given θ0 Rp. for k = 0, 1, 2, , K do Compute [ A k, B k] and C k by solving Eq. (18), and construct the loss function ˆLf in Eq. (22), replacing its [A, B], C, and θ with [ A k, B k], C k, and θk, respectively. Update the θk using the gradient descent: θk+1 = θk αk θ ˆLf( A k, B k, C k, θk). Stop if ˆLf < ϵ and save the resulting A k, B k, C k, and θk as A , B , C , and θ . end Remark 2 Note that by following the definition of Moore Penrose inverse (i.e., for any D Rm n with full row rank, D = D (DD ) 1), the inverse terms ( G U ) 1 and (GG ) 1 may pose challenges in com- puting the gradient of ˆ Lf. One way to address this issue is to utilize the relation θK 1 = K 1( θK)K 1, where K Rn n. This approach enables gradient computation involving matrix inverses in a more manageable form. 4 Experiments In this subsection, we first demonstrate the performance of the proposed algorithm by analyzing the estimation errors between the predicted system states and the true noise-free states across four benchmark dynamics: one 2D simple linear discrete time-invariant dynamics: xt+1 = 0.9 0.1 0 0.8 ut, x0 = 1 0 cartpole (xt R4, ut R) and lunar lander (xt R6, ut R2) examples from the Openai gym Brockman et al. (2016), and one real-world example of unmanned surface vehicles (xt R6, ut R2), of which the details can be found in Li et al. (2024). Then, we compare the proposed algorithm with related methods. Experiment setup. In this experiment, we first gather noise-free state-input pairs D = {(xt, ut)}T t=0 from the aforementioned four examples, where ut represents randomly generated control inputs drawn from a uniform distribution bounded between 1 and 1. Subsequently, we introduce three types of bounded measurement noise: Gaussian noise (w G t ) with mean µ = 0 and standard deviation σ = 2, Poisson distribution (w P t ) with an expected separation λ = 3, and uniform distribution (w U t ) generated from the open interval [ 1, 2). To ensure bounded noise, we apply a clipping procedure to the measurement noise. These noise types are added to the system states to yield noisy measurements. Specifically, we denote the noisy measurements under Gaussian noise as y G t = xt + w G t . The corresponding dataset, denoted DG = {(y G t , ut)}T t=0, is used for the experiments. To facilitate training and testing, we allocate 80% of DG to train DKND (denoted as DG train), reserving the remaining 20% for testing (denoted as DG test). For performance evaluation, we compute the root mean square deviation (RMSD) over the test dataset DG test: RMSD(DG test) = v u u t 1 |DG test| (yt,ut) DG test xt+1 ˆf(yt, ut, θ ) 2, where |DG test| denote the number of data pairs (yt, ut) in DG test, and ˆ f represents the estimated dynamics obtained from the proposed DKND method. Additionally, we compare the performance of DKND against three baseline algorithms: DK, which solves Eq. (12) using noisy measurements yt, DMDTLS from Dawson et al. (2016), and the multilayer perceptron (MLP) approach. To fairly evaluate the algorithms, we assign the above methods with the same DNN structure, training parameters (e.g., learning rate, training epochs, etc.), Published in Transactions on Machine Learning Research (04/2025) and training and testing datasets. To mitigate the influence of random initialization of DNN parameters, each gradient-based method is run for 10 experimental trials. The average RMSD and their standard deviations are reported in the tables over these 10 trials. RMSD Methods 2D example Cartpole Lunar lander Surface vehicle Gaussian noise proposed 0.1963 0.0002 0.3705 0.0027 0.4007 0.0004 0.3059 0.0091 DK 0.2149 0.0106 0.5604 0.0037 0.4730 0.0051 0.3068 0.0027 MLP 0.2118 0.0016 0.3987 0.0006 0.4650 0.0027 0.2960 0.0030 DMDTLS 0.2483 29.9163 0.6836 2.1779 - wmax = 0.8 wmax = 1.0 wmax = 1.0 wmax = 1.0 Poisson noise proposed 0.4431 0.0018 0.6975 0.0025 0.8269 0.0029 0.7642 0.0031 DK 0.4707 0.0206 0.7996 0.0075 0.8565 0.0031 0.7768 0.0013 MLP 0.4644 0.0011 0.6808 0.0017 0.8329 0.0019 0.7727 0.0009 DMDTLS 0.4709 3.7518 0.9268 2.0631 - wmax = 1.2 wmax = 1.3 wmax = 1.5 wmax = 1.5 Uniform noise proposed 1.4471 0.0089 2.1534 0.2912 2.1224 0.5110 1.7541 0.0177 DK 1.7493 0.1877 2.3839 0.1021 2.6577 0.0422 1.5712 0.0264 MLP 2.3287 0.0236 4.0224 0.0035 4.8612 0.0037 1.7127 0.0248 DMDTLS 27.2598 39.2297 286.4929 24.2376 - wmax = 5.3 wmax = 7.2 wmax = 8.2 wmax = 1.2 Table 1: Averaged RSMD over training data. RSMD Methods 2D example Cartpole Lunar lander Surface vehicle Gaussian noise proposed 0.2074 0.0008 0.3974 0.0076 0.4877 0.0185 0.4539 0.0661 DK 0.2124 0.0072 0.6190 0.0088 0.8433 0.0737 0.5642 0.1301 MLP 0.3721 0.0311 0.8757 0.0172 1.9348 0.1598 0.7048 0.0581 DMDTLS 0.2514 28.3258 0.6551 2.4107 - wmax = 0.8 wmax = 1.0 wmax = 1.0 wmax = 1.0 Poisson noise proposed 0.4551 0.0014 0.7118 0.0030 0.8268 0.0255 0.9456 0.0485 DK 0.4784 0.0211 0.8281 0.0088 1.0857 0.1158 1.0846 0.1584 MLP 0.4888 0.0053 1.0596 0.0429 1.8316 0.1631 0.9229 0.0612 DMDTLS 0.4709 4.4250 0.8958 3.3943 - wmax = 1.2 wmax = 1.3 wmax = 1.5 wmax = 1.5 Uniform noise proposed 1.4832 0.0117 2.1362 0.2796 2.3323 0.6968 2.2739 0.1600 DK 2.0752 0.2770 2.3234 0.1033 3.0809 0.0639 3.9145 0.6408 MLP 2.6835 0.0831 4.8319 0.0939 6.2894 0.1411 3.1910 0.4081 DMDTLS 26.7608 40.9191 288.2916 24.6145 - wmax = 5.3 wmax = 7.2 wmax = 8.2 wmax = 1.2 Table 2: Averaged RSMD over testing data. Results analysis. As presented in Tables. 1-2, the proposed DKND method achieves smaller average RSMD and standard deviation on testing data when compared to other methods, even as the complexity of the dynamics is increasing. Specifically, when the noise follows a uniform distribution and wt grows larger, the gap in RSMD between the proposed DKND and DK methods becomes more pronounced. Note that the RSMD for all gradient-based comparison methods over the training data does not show significant differences. This can be attributed to the fact that their training processes are terminated at the same terminal accuracy. Figs. 1-4 display detailed estimation error plots across the testing data, with shaded regions indicating the variability across 10 trials. Due to space limitations, additional experimental details, such as the generation of measurement noise, the structure of the DNNs, and the training parameters, are provided in the Appendix. Published in Transactions on Machine Learning Research (04/2025) 0 20 40 60 80 100 Data index (k) Estimation errors DMDTLS MLP DK DKND (a) Gaussian noise. 0 20 40 60 80 100 Data index (k) Estimation errors DMDTLS MLP DK DKND (b) Poisson noise. 0 20 40 60 80 100 Data index (k) Estimation errors MLP DK DKND (c) Uniform noise. Figure 1: Prediction errors over testing data for the linear dynamics example. 0 20 40 60 80 100 120 Data index (k) Estimation errors MLP DK DKND (a) Gaussian noise. 0 20 40 60 80 100 120 Data index (k) Estimation errors MLP DK DKND (b) Poisson noise. 0 20 40 60 80 100 120 Data index (k) Estimation errors MLP DK DKND (c) Uniform noise. Figure 2: Prediction errors over testing data for the cartpole example. 0 50 100 150 200 250 300 Data index (k) Estimation errors DMDTLS MLP DK DKND (a) Gaussian noise. 0 50 100 150 200 250 300 Data index (k) Estimation errors DMDTLS MLP DK DKND (b) Poisson noise. 0 50 100 150 200 250 300 Data index (k) Estimation errors MLP DK DKND (c) Uniform noise. Figure 3: Prediction errors over testing data for the lunar lander example. Published in Transactions on Machine Learning Research (04/2025) 0 10 20 30 40 50 60 70 80 Data index (k) Estimation errors MLP DK DKND (a) Gaussian noise. 0 10 20 30 40 50 60 70 80 Data index (k) Estimation errors MLP DK DKND (b) Poisson noise. 0 10 20 30 40 50 60 70 80 Data index (k) Estimation errors MLP DK DKND (c) Uniform noise. Figure 4: Prediction errors over testing data for the surface vehicle example. 5 Concluding Remarks In this paper, we have introduced a data-driven framework called Deep Koopman Learning with Noisy Data (DKND) to address the challenge of learning system dynamics from data affected by measurement noise. By learning dynamics, we refer to estimating dynamics where, given yt, ut, the output of the estimated dynamics, ˆxt+1, approximates the true system state xt+1 with reasonable accuracy. The key contribution of this work lies in modifying the existing deep Koopman framework by explicitly characterizing the noise effect on the learned representation in Eq. (9) and mitigating the impact of noise on the DKR through tuning the DNN parameters to minimize Eq. (21) requiring only that the measurement noise be bounded. We evaluated the proposed DKND framework on datasets with three different types of measurement noise, using examples including simple 2D dynamics, cartpole, lunar lander, and surface vehicle systems. Our results demonstrate the robustness of DKND under different types of measurement noise compared to related methods. Limitations. Since the formulation presented in this paper only addresses the scenario where the measurement noise is bounded, the effect of this bound on the performance of the proposed approach is not formally investigated and remains an open question. Due to the non-convex nature of DNN optimization, the DKND framework is inherently limited to achieving local minima. Future research could explore several aspects, including the design of optimal control strategies based on the learned dynamics using the measured noisy system states. Acknowledgments This material is based upon work supported by the Defense Advanced Research Projects Agency (DARPA) via Contract No. N65236-23-C-8012 and under subcontract to Saab, Inc. as part of the Refle XAI project. Any opinions, findings and conclusions, or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the DARPA, the U.S. Government, or Saab, Inc. A.Lyapunov. The general problem of the stability of motion. International Journal of Control, 55(3), 1992. Petar Bevanda, Max Beier, Sebastian Kerz, Armin Lederer, Stefan Sosnowski, and Sandra Hirche. Koopmanizingflows: Diffeomorphically learning stable koopman operators. ar Xiv preprint ar Xiv:2112.04085, 2021. Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. Openai gym. ar Xiv preprint ar Xiv:1606.01540, 2016. Scott TM Dawson, Maziar S Hemati, Matthew O Williams, and Clarence W Rowley. Characterizing and correcting for the effect of sensor noise in the dynamic mode decomposition. Experiments in Fluids, 57: 1 19, 2016. Published in Transactions on Machine Learning Research (04/2025) Morgan T Gillespie, Charles M Best, Eric C Townsend, David Wingate, and Marc D Killpack. Learning nonlinear dynamic models of soft robots for model predictive control with neural networks. In 2018 IEEE International Conference on Soft Robotics (Robo Soft), pp. 39 45. IEEE, 2018. Yiqiang Han, Wenjian Hao, and Umesh Vaidya. Deep learning of koopman representation for control. In 2020 59th IEEE Conference on Decision and Control (CDC), pp. 1890 1895. IEEE, 2020. Wenjian Hao, Bowen Huang, Wei Pan, Di Wu, and Shaoshuai Mou. Deep koopman learning of nonlinear time-varying systems. Automatica, 159:111372, 2024. Masih Haseli and Jorge Cortés. Approximating the koopman operator using noisy data: noise-resilient extended dynamic mode decomposition. In 2019 American Control Conference (ACC), pp. 5499 5504. IEEE, 2019. Maziar S Hemati, Clarence W Rowley, Eric A Deem, and Louis N Cattafesta. De-biasing the dynamic mode decomposition for applied koopman spectral analysis of noisy datasets. Theoretical and Computational Fluid Dynamics, 31:349 368, 2017. Bernard O Koopman. Hamiltonian systems and transformation in hilbert space. Proceedings of the national academy of sciences of the united states of america, 17(5):315, 1931. Bernard O Koopman and J v Neumann. Dynamical systems of continuous spectra. Proceedings of the National Academy of Sciences, 18(3):255 263, 1932. Milan Korda and Igor Mezić. Linear predictors for nonlinear dynamical systems: Koopman operator meets model predictive control. Automatica, 93:149 160, 2018. Jianwen Li, Hyunsang Park, Wenjian Hao, Lei Xin, Jalil Chavez-Galaviz, Ajinkya Chaudhary, Meredith Bloss, Kyle Pattison, Christopher Vo, Devesh Upadhyay, et al. C3d: Cascade control with change point detection and deep koopman learning for autonomous surface vehicles. ar Xiv preprint ar Xiv:2403.05972, 2024. Bethany Lusch, Steven L Brunton, and J Nathan Kutz. Data-driven discovery of koopman eigenfunctions using deep learning. Bulletin of the American Physical Society, 2017. Bethany Lusch, J Nathan Kutz, and Steven L Brunton. Deep learning for universal linear embeddings of nonlinear dynamics. Nature communications, 9(1):1 10, 2018. Alexandre Mauroy and Jorge Goncalves. Linear identification of nonlinear systems: A lifting technique based on the koopman operator. In 2016 IEEE 55th Conference on Decision and Control (CDC), pp. 6500 6505. IEEE, 2016. Igor Mezić. On applications of the spectral theory of the koopman operator in dynamical systems and control theory. In 2015 54th IEEE Conference on Decision and Control (CDC), pp. 7034 7041. IEEE, 2015. Kevin Patrick Murphy. Dynamic bayesian networks: representation, inference and learning. University of California, Berkeley, 2002. Joshua L Proctor, Steven L Brunton, and J Nathan Kutz. Generalizing koopman theory to allow for inputs and control. SIAM Journal on Applied Dynamical Systems, 17(1):909 930, 2018. Maziar Raissi, Paris Perdikaris, and George E Karniadakis. Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations. Journal of Computational Physics, 378:686 707, 2019. Nicholas Stearns Selby. On the Application of Machine Learning and Physical Modeling Theory to Causal Lifting Linearizations of Nonlinear Dynamical Systems with Exogenous Input and Control. MIT Dissertation, Cambridge, 2021. Published in Transactions on Machine Learning Research (04/2025) Subhrajit Sinha, Bowen Huang, and Umesh Vaidya. On robust computation of koopman operator and prediction in random dynamical systems. Journal of Nonlinear Science, 30(5):2057 2090, 2020. Subhrajit Sinha, Sai P Nandanoori, and DA Barajas-Solano. Online real-time learning of dynamical systems from noisy streaming data. Scientific Reports, 13(1):22564, 2023. Filippos E Sotiropoulos. Methods for Control in Robotic Excavation. MIT Dissertation, Cambridge, 2021. Mathias Wanner and Igor Mezic. Robust approximation of the stochastic koopman operator. SIAM Journal on Applied Dynamical Systems, 21(3):1930 1951, 2022. Matthew O Williams, Ioannis G Kevrekidis, and Clarence W Rowley. A data driven approximation of the koopman operator: Extending dynamic mode decomposition. Journal of Nonlinear Science, 25(6): 1307 1346, 2015. Enoch Yeung, Soumya Kundu, and Nathan Hodas. Learning deep neural network representations for koopman operators of nonlinear dynamical systems. In 2019 American Control Conference (ACC), pp. 4832 4839. IEEE, 2019. In this section, we first provide the proof of Theorem 1, followed by an overview of the detailed simulation setup used in the experiments presented throughout this paper. A.1 Proof of Theorem 1. Recall that G and G denote the data matrices constructed using an arbitrary given parameter vector θ, in the same structure as Gk and Gk in Eq. (16), respectively. Before presenting the proof, we introduce the following notions. Define δgt = g(yt, θ) g(xt, θ) as the difference between the DNN g( , θ) evaluated at the observed noisy system state yt = xt +wt and the true system state xt. Next, we introduce the following data matrices: G = [δg0, δg1, , δg T 1] Rr T , G = [δg1, δg2, , δg T ] Rr T , Gx = [g(x0, θ), g(x1, θ), , g(x T 1, θ)] Rr T , Gx = [g(x1, θ), g(x2, θ), , g(x T , θ)] Rr T , X = [x0, x1, , x T 1] Rn T , X = [x1, x2, , x T ] Rn T , W = [w0, w1, , w T 1] Rn T , W = [w1, w2, , w T ] Rn T , For brevity, we omit the iteration index k, as the constant matrices of f and f are assumed to be computed using the same function g( , θ). Utilizing Eq. (16) and Eq. (23), we obtain: Y = X + W, Y = X + W, G = Gx + G, G = Gx + G. (24) We now proceed by minimizing Eq. (11) (over the noise-free trajectory) with respect to the dynamics matrices. The solution to this problem is analogous to the one derived in Eq. (18) (over the noisy trajectory). By utilizing the notations introduced in Eq. (23), the following results can be obtained through a reformulation of Eq. (18): [ A , B ] = Gx C = XG x. (26) Published in Transactions on Machine Learning Research (04/2025) We then expand Eq. (25)-(26) using Eq. (24) and the definition of the Moore Penrose inverse. This results in the following expression: = ( G G) G G U = ( G G) G G U [( G G) G + GG , GU ] | {z } Nw + ( G G) G GG GU (27) and C = XG x = (Y W)(G G) , = (Y W)(G G) ((G G)(G G) ) 1, = (YG ((Y + W) G WG ) | {z } Nw )(GG |{z} P + (( G + G) G GG ) | {z } Mw Ir) 1. (28) Note here that [ A , B ] = G G U ) 1 and C = YG (GG ) 1. Applying Sherman Morrison formula to Eq. (27)-(28), that is, for given invertible matrix P Rn n and column vectors m, v Rn, if 1 + v P 1m = 0, the following holds: (P + mv ) 1 = P 1 P 1mv P 1 1 + v P 1m . Using this formula, we derive the following results: [ A , B ] = [ A , B ] + (Nw P 1 [ A , B ])Mw P 1(Ir+m + P 1Mw) 1 Nw P 1 (29) and C = C + ( Nw P 1 C ) Mw P 1(Ir + P 1 Mw) 1 Nw P 1. (30) Here, we recall the dynamics difference Lf,2( θ ) defined in Eq. (20), where we replace θ with θ, as θ represents the optimal solution of Lf,1, which is found using the same variable θ in the overall loss function Lf = Lf,1 + Lf,2 as defined in Eq. (14), given by: Lf,2(θ) = 1 2T max wt ( t=0 g(xt, θ ) g(yt, θ) 2 + [ A , B ] [ A , B ] 2 F + C C 2 F ). By following Eq. (29)-(30), Lf,2(θ) becomes Lf,2(θ) = 1 2T max wt ( (Nw P 1 [ A , B ])Mw P 1(Ir+m + P 1Mw) 1 Nw P 1 2 F + ( Nw P 1 C ) Mw P 1(Ir + P 1 Mw) 1 Nw P 1 2 F + t=0 g(xt, θ ) g(yt, θ) 2). (31) To proceed and for clarity, we define δgmax t = g(xt + wmax, θ) g(xt, θ) and Gmax = [δgmax 0 , δgmax 1 , , δgmax T 1] Rr T , Gmax = [δgmax 1 , δgmax 2 , , δgmax T ] Rr T , Wmax = [wmax, wmax, , wmax] Rn T , Ymax = X + Wmax. Published in Transactions on Machine Learning Research (04/2025) Accordingly, we define the following matrices under the maximum measurement noise: Nmax = [( G Gmax) G max + Gmax G , Gmax U ], Nmax = (Ymax + Wmax) G max Wmax G , Mmax = ( Gmax G) G max Gmax G Gmax U Mmax = ( G + Gmax) G max Gmax G . Then, by applying the triangle inequality and utilizing the fact that g(x, θ) is Lipschitz continuous with Lipschitz constants Lx and Lθ, where Lx denotes the Lipschitz constant of g with respect to x and Lθ denotes the Lipschitz constant of g with respect to θ, the function Lf,2(θ) can be expressed as: Lf,2(θ) = 1 2T ( (Nmax P 1 [ A , B ])Mmax P 1(Ir+m + P 1Mmax) 1 Nmax P 1 2 F + ( Nmax P 1 C ) Mmax P 1(Ir + P 1 Mmax) 1 Nmax P 1 2 F t=0 g(xt, θ ) g(xt + wmax, θ) 2) 2T ( Nmax P 1 2 F + [ A , B ] 2 F ) Mmax P 1 2 F (Ir+m + P 1Mmax) 1 2 F + Nmax P 1 2 F + ( Nmax P 1 2 F + C 2 F ) Mmax P 1 2 F (Ir + P 1 Mmax) 1 2 F + Nmax P 1 2 F +TL2 g), (33) where Lg = 2max{Lx, Lθ}. Observing that the upper bound in Eq. (33) contains complex terms (Ir+m + P 1Mmax) 1 2 F and (Ir + P 1 Mmax) 1 2 F , which complicate the minimization of the upper bound by tuning θ, we propose an alternative approach. Instead of attempting to find a solution that makes (Ir+m + P 1Mmax) 1 2 F = 0 and (Ir + P 1 Mmax) 1 2 F = 0, we aim to achieve P 1Mmax 2 F = 0 and P 1 Mmax 2 F = 0 such that (Ir+m + P 1Mmax) 1 2 F = r + m and (Ir + P 1 Mmax) 1 2 F = r, and then proceed to minimize the remaining terms. Thus, we define the upper bound of Eq. (33) as: Lf,2(θ) = 1 Nmax P 1 2 F + [ A , B ] 2 F ) Mmax P 1 2 F P 1Mmax 2 F + Nmax P 1 2 F + ( Nmax P 1 2 F + C 2 F ) Mmax P 1 2 F P 1 Mmax 2 F + Nmax P 1 2 F +TL2 g) Nmax 2 F P 1 2 F + [ A , B ] 2 F ) Mmax 4 F P 1 4 F + Nmax 2 F P 1 2 F + ( Nmax 2 F P 1 2 F + C 2 F ) Mmax 2 F P 1 4 F + Nmax 2 F P 1 2 F +TL2 g). Here, using the the Lipschitz continuity of g and Eq. (32), one obtains: Nmax 2 F ( G 2 F + G 2 F +(TLxwmax)2+ U 2 F )(TLxwmax)2, Mmax 2 F (2 G 2 F +2 U 2 F +(TLxwmax)2)(TLxwmax)2, Nmax 2 F ( Y 2 F +(Twmax)2)(TLxwmax)2 + (Twmax)2 G 2 F , Mmax 2 F (2 G 2 F +(TLxwmax)2)(TLxwmax)2. Finally, using Eq. (35) and removing the constant terms while merging the repeated terms from Eq. (34), the resulting loss function is expressed as: ˆLf,2(θ) = 1 P 1 2 F (( G 2 F + [ A , B ] 2 F ) G 2 F + G 2 F )+ P 1 2 F C 2 F G 2 F , where [ A , B ] = G G U and C = YG . Published in Transactions on Machine Learning Research (04/2025) A.2 Simulation details In this subsection, we provide the simulation details regarding the experiment in Section 4. A.2.1 Computation resource and training parameters 2D dynamics Cartpole Lunar lander Surface vehicle Optimizer Adam Accuracy (ϵ) 1e 4 Training epochs (S) 1e4 Learning rate (αk) 1e 5 The number of data pairs (T) 500 600 1600 600 Compute device Apple M2, 16GB RAM Table 3: Training parameters. A.2.2 DNNs architecture The DNN architectures of method DKND and DK used in this paper are presented in Table 4. We refer to https://pytorch.org/docs/stable/nn.html for the definition of functions Linear(), Re LU() and we denote layeri as the i-th layer of the DNN and Linear([n, m]) denotes a linear function with a weight matrix of shape n m. Since for the DKND and DK methods, the input of its DNN observable function is the 2D dynamics Cartpole Lunar lander Surface vehicle layer1 type Linear([2, 512]) Linear([4, 512]) Linear([6, 512]) Linear([6, 512]) layer2 type Re LU() Re LU() Re LU() Re LU() layer3 type Linear([512, 128]) Linear([512, 128]) Linear([512, 128]) Linear([512, 128]) layer4 type Re LU() Re LU() Re LU() Re LU() layer5 type Linear([128, 4]) Linear([128, 6]) Linear([128, 4]) Linear([128, 10]) Table 4: DNN structures of DKND and DK. measured state yt and the input of the MLP method is a stacked vector of [y t, u t] , we show the DNN structure of the MLP method in the following table. 2D dynamics Cartpole Lunar lander Surface vehicle layer1 type Linear([3, 512]) Linear([5, 512]) Linear([8, 512]) Linear([8, 512]) layer2 type Re LU() Re LU() Re LU() Re LU() layer3 type Linear([512, 128]) Linear([512, 128]) Linear([512, 128]) Linear([512, 128]) layer4 type Re LU() Re LU() Re LU() Re LU() layer5 type Linear([128, 4]) Linear([128, 6]) Linear([128, 4]) Linear([128, 10]) Table 5: DNN structures of MLP.