# learning_stable_stochastic_nonlinear_dynamical_systems__81a2a2b4.pdf Learning Stable Stochastic Nonlinear Dynamical Systems Jonas Umlauft 1 Sandra Hirche 1 A data-driven identification of dynamical systems requiring only minimal prior knowledge is promising whenever no analytically derived model structure is available, e.g., from first principles in physics. However, meta-knowledge on the system s behavior is often given and should be exploited: Stability as fundamental property is essential when the model is used for controller design or movement generation. Therefore, this paper proposes a framework for learning stable stochastic systems from data. We focus on identifying a state-dependent coefficient form of the nonlinear stochastic model which is globally asymptotically stable according to probabilistic Lyapunov methods. We compare our approach to other state of the art methods on real-world datasets in terms of flexibility and stability. 1. Introduction An accurate identification of the system dynamics is the first and very crucial step to many modern control methods. Although reinforcement learning also allows model-free search for optimal policies, it is known to be less efficient and difficult to analyze. Therefore, classical control engineers employ system identification techniques to obtain parametric model descriptions of dynamical systems from observation data, e.g., in the linear case ARX and ARMAX models. The identification focuses on model selection, i.e., finding the model structure and the corresponding set of parameters. But often this set of model candidates is difficult to find, especially for complex, possibly non-deterministic, systems (Ljung, 1998). Therefore, the need for data-driven models has emerged recently as control engineering is increasingly applied in areas without analytic description of the dynamical system. We consider the following two application scenarios: First, assume a set of trajectories for a 1Chair of Information-oriented Control, Technical University of Munich, Munich, Germany. Correspondence to: Jonas Umlauft . Proceedings of the 34th International Conference on Machine Learning, Sydney, Australia, PMLR 70, 2017. Copyright 2017 by the author(s). robotic task is given through human demonstrations, e.g., object grasping. The goal is to represent the motion with a dynamical system. To ensure the reproduction terminates at the desired final point (object to grasp), we introduce the stability constraint. Second, consider a dynamical system which is known to be stable, e.g., a pendulum which rests in hanging position. The goal is to identify the dynamics precisely without further physical insights. Bayesian non-parametric methods, more particularly Gaussian Processes (GPs) where successfully employed by Kocijan et al. (2005) and Wang et al. (2005) for system identification. Other approaches focus on learning switching linear systems (Fox et al., 2009) or employ an EM algorithm (Ghahramani & Roweis, 1999) for nonlinear systems. However, these approaches neglect the prior assumption that the dynamical system is stable, which becomes crucial when the learned model is used as a generative process such as in movement generation for robotics (Ijspeert et al., 2002). If stability is not considered during learning, the identified model suffers from spurious attractors which are not part of the true dynamics or instability. Only little work has merged the extensive knowledge on stability theory from control engineering with the powerful data-driven approaches for system identification: For example Boots et al. (2008) and Chiuso & Pillonetto (2010) take stability constraints for learning dynamical systems into account but are limited to linear systems. The work by Khansari-Zadeh & Billard (2011) ensures stability of the system by constraining the optimization of a Gaussian Mixture Model (GMM) to stability conditions derived from Lyapunov methods. The work by Paraschos et al. (2013) relies on a phase variable to ensure stability, which makes the approach time-dependent and therefore less robust. Control Lyapunov functions are used by Khansari Zadeh & Billard (2014) to ensure global stability for the learned system. These approaches partially employ probabilistic models (GP, GMM), but limit the analysis to the deterministic part by only considering the mean regression. By discarding the true underlying probability distribution, information regarding reliability of the model provided by the data is lost. This leads to overconfident conclusions regarding performance or safety on the real system. Therefore, this work proposes a framework for learning Stable Stochastic Nonlinear Dynamical Systems probabilistic nonlinear dynamical systems from observation, which takes the prior assumption of stability into account. The required stochastic stability conditions of the discrete-time Markov processes are derived from Lyapunov theory. We provide simulation results to validate the proposed approach and compare it to previously mentioned methods for identifying dynamical systems. 2. Problem Formulation We consider an autonomous, dynamical, discrete-time system with continuous-valued state xk X = Rd. The state evolves according to an unknown stochastic process1 xk+1 = ˆf(xk, ˆωk), (1) with initial value x0 X and ˆωk is a random variable from the probability space (Ω, F, P) with sample space Ω, the corresponding σ-algebra F and the probability measure P. Since xk X is fixed at each step, (1) describes a state dependent distribution over xk+1. A realization of ˆωk Ω, is drawn at every time step, yielding a realization of the next step. As the distribution for xk+1 only depends on the state at time step k, ˆf is a Markov process, denoted by {xk}. We assume that consecutive measurements of the state are available, thus N data pairs are given in the trainings set D = { xn, xn+1}N n=1. Based on these measurements, we model the unknown dynamics ˆf including the distribution ˆωk using the prior knowledge, that the stochastic process (1) converges to the origin xk = 0. The model consists of the mapping f ψ and a encoding of the random variable ω defined by a finite parameter vector ψ Ψ. As the model f ψ must best possibly explain the data D, the problem is formulated as constrained likelihood maximization ψ = arg max ψ n=1 log P xn+1| xn, f ψ , (2a) s.t. {xk} converges to the origin for k . (2b) As different stochastic stability concepts exist, the convergence in (2b) is defined as convergence with probability one (w.p.1) (Kushner, 1971): Definition 1 (Convergence w.p.1). {xk} converges to the origin w.p.1 if, for each ϵ > 0, xk ϵ only finitely often. 1Notation: Bold symbols denote vectors or multivariate functions, capital letters matrices and Ip the p p identity matrix. A 0 denotes positive definiteness of the matrix A, E [ ] the expected value, V [ ] the variance of a random variable and C [ , ] the covariance between two random variables, where C [a] = C [a, a]. X denotes a realization of the random variable X. Imitating Matlab indexing, A(:,i) denotes the i-th column, A(j,:) the j-th row and A(1:2,i) the first and second element in the i-the column of A. The i-th entry of the vector xk is denoted xk,i. This also implies the following type of convergence, which might be more intuitive to the reader. Definition 2 (Convergence in probability). The chain {xk} converges to the origin in probability if P( xk ϵ) 0, for each ϵ > 0. We do not consider any control input here, thus the identification takes place for the closed-loop system for a existing controller or an uncontrolled system. 3. Stability Conditions for the Model 3.1. The Model Consider the state-dependent coefficient form of f ψ xk+1 = A(xk)xk, (3) where, for a fixed xk, A is a random variable from the probability space (ΩA, FA, PA) with the sample space ΩA Rd d. The probability density function of A is specified by the vector θ Θ, which is state dependent through θψ : X Θ. This mapping is itself parametrized by a vector ψ. At each step, a realization of A, denoted by A, is drawn and multiplied by the state xk to proceed by one step. This is visualized in Figure 1 along with the two-layer model structure: The first layer maps current state xk X onto the parameter θ Θ, denoted by θψ : X Θ. The mapping is parametrized by ψ. The second layer is the probability distribution on A which assigns to each element in the sample space ΩA a probability based on θ. Θ (ΩA, FA, PA) sample xk+1 Model Simulation Figure 1. Illustration of the two layer design of the proposed model and the process of simulation. To illustrate this multilayer design, we give a brief example in the scalar case d = 1: Assume A(xk) follows, for a given xk, a Gaussian distribution A N(µ, σ). Therefore, the parameter vector is θN = [µ σ] with µ R, σ R+, thus ΘN R R+. The dependency of these parameters on the current state xk is expressed in θψ N , e.g., µ(xk) σ(xk) = θψ N (xk) = wxk zx2 k Stable Stochastic Nonlinear Dynamical Systems where linear dependency of the mean on the state and a quadratic relation between variance and the state is assumed. The parameters defining θψ N here are ψ = [w z] . Generally, the first layer θψ : X Θ can be any state of the art parametric regression method which is parametrized by ψ. For layer two, any probability distribution with a fixed set of parameters is applicable for A. Leaving the stochastic aspect aside, model (3) is the statedependent coefficient (SDC) form which is reached by factorizing a nonlinear system into a linear-like structure. It was shown, that for a any continuous differentiable function f with f(0) = 0, their exists a matrix-valued function A(x) such that f(x) = A(x)x, see Cimen (2008). Thus, the SDC form is not limiting the expressive power of our model. It also reflects the setup of many real-world system, e.g., consider an actuator whose output is generally noisy and the magnitude of the noise is dependent on the temperature. By modeling the temperature as a state, the model (3) allows to capture this varying precision of the actuator. The structure of the model (3) combines two important criteria. First, it provides more flexibility than a linear system with random parameters, so it encodes also nonlinear dynamics.Second, it is simple enough to allow a quadratic Lyapunov function analysis and therefore the derivation of analytic constraints for convergence as needed for the optimization in (2). 3.2. Stability Analysis For approaching the problem as formulated in Section 2, an analytic condition for the constraint in the optimization problem (2b), given that f ψ is of the form (3), is needed. The literature on stability criteria for dynamical systems is very rich and for nonlinear systems Lyapunov type methods are often used. They are based on the following idea: If there is a function representing the energy in the system (called Lyapunov function) which constantly decreases over time, the state will converge to a zero energy state, the origin. More precise, the Lyapunov function must be positive definite and it must be strictly decreasing over time, except in the origin. Using the stochastic discretetime version of Lyapunov methods and the Borel-Cantelli Lemma leads to the following conditions for exponential stability (which implies convergence w.p.1 as defined in Definition 1) Theorem 1 (Exponential Stability, (Kushner, 1971)). Given a positive definite function V (xk) 0 for which E [V (xk+1)|xk] V (xk) αV (xk), xk X \ 0, (5) for some α > 0 then E [V (xk+m)|xk] (1 α)m V (xk) and (6) V (xk+m) 0 for m (w.p.1). (7) For the class of systems in (3) a quadratic function V (xk) is a proper Lyapunov function to derive sufficient stability constraints for arbitrary distributions on A as shown in the following proposition: Proposition 1 (Stability of the model (3)). Consider a stochastic process generated from (3) where in each step a realization of A is drawn from sample space ΩA Rd d. The process is globally exponentially stable at xk = 0 if there exists a P 0 such that E [A (xk)] P E [A(xk)] +Q (1 α)P 0, xk X, (8) for some α > 0, where Q is defined as Q(i,j)(xk) = X l P(l,:) C A(:,i)(xk), A(l,j)(xk) , (9) for any x0 X. Proof. Considering a quadratic Lyapunov function V (xk) = x k Pxk with P 0, the inequality from Theorem 1 in (5) is given as E x k+1Pxk+1|xk+1 x k Pxk αx k Pxk, which yields for the stochastic2 process xk+1 = A(xk)xk x k E [A ] P E [A] xk + Tr (P C [Axk]) (10) (1 α)x k Pxk 0, xk X. Now, an expression for the trace is derived as follows Tr (P C [Axk]) = Tr i A(:,i)xk,i i,j xk,ixk,j C A(:,i), A(:,j) i,j,l P(l,:) C A(:,i), A(l,j) xk,ixk,j, = x k Qxk where definition of Q in (9) was substituted. Using this simplification, (10) is rewritten as E [A] P E [A] +Q (1 α)P xk 0, which must hold for xk X. To ensure this, the matrix E [A] P E [A] +Q (1 α)P must be negative semidefinite, which concludes the proof. 2The xk dependency of the random process A as been dropped for notational convenience. Stable Stochastic Nonlinear Dynamical Systems The interpretation of Proposition 1 is analogue to the linear deterministic case xk+1 = Axk which is stable if there exists a matrix P for which A PA P 0: In the nonlinear case in (3) the negative definiteness must be fulfilled for A(xk), xk X. The probabilistic nature of the system (3) in addition requires a buffer , which here is Q. The deterministic case is reconstructed if A has zero variance. The scalar case, considered in the following remark, also allows an intuitive insight to the Proposition 1: There is a trade-off between the magnitude of the expected value and the variance of A as follows: Remark 1. In the scalar case3, i.e. d = 1 in (3), with Q = P V [A(xk)] condition (8) simplifies for any P > 0 to E [A(xk)]2 + V [A(xk)] 1 α, xk X. (11) 4. Stable Learning with Various Distributions Our learning framework consists of three major steps: 1. Chose any probability distribution for the random variable A in (3) which is given by a fixed set of parameters θ Θ and whose first two moments are available. It is assumed that subset Θ Θ for which (8) is fulfilled is non-empty, thus Θ = . 2. Chose any parametric regression method to represent the mapping θψ : X Θ. The parameters of this mapping are denoted by ψ Ψ. The set of all ψ for which all xk X map to Θ is denoted by Ψ . 3. The likelihood maximization under constraints ψ = arg max ψ Ψ n=1 log P xn+1|xn, θψ , (12) is solved, where ψ Ψ is equivalent to constraint (8) with P 0 and α > 0. The optimization (12) is a general constrained nonlinear program in a rather high dimensional space (depending on number of parameters of the regression method in step 2). However, independent of the optimality, the model f ψ of the form (3) is exponentially stable, thus any sample path of the system converges. For computational simplicity, we focus on two types of distribution which naturally fulfill the constraints as explained in the next sections. 4.1. Stability with Beta Distribution For certain choices of distributions, constraint (8) is fulfilled for all possible parameter θ, thus Θ = Θ, which 3Even though A, Q, P are scalars here, we keep them capitalized for notational consistency. makes the optimization unconstrained. One example of such a distribution is the Beta distribution as given in the following corollary. Corollary 1. The scalar system xk+1 = A(xk)xk where A(xk) = κ( A(xk) η) with Beta distributed A(xk) B(a(xk), b(xk)) and κ = 2, η = 0.5 with state dependent parameters [a(xk) b(xk)] = θψ B (xk), with any θψ B : X ΘB = R2 + is exponentially stable. Proof. Applying the affine transformation to mean and variance leads to4 E [A(xk)] = κ E h A(xk) i η = κ a a + b η , V [A(xk)] = κ2 V h A(xk) i = κ2 ab (a + b)2(a + b + 1). Condition (11) is rewritten to (κ(E [A(xk)] η))2 + κ2 V [A(xk)] 1 α, xk X, where the best possible choice for η minimizes (E [A(xk)] η)2, because it leaves the largest possible range for κ. As E [A] is in the interval ]0, 1[ the minimization is achieved with the choice η = 1 2. Then, condition (11), divided by κ2 on both sides, evaluates to (a + b)2 a a + b + 1 4 + ab (a + b)2(a + b + 1) = ab (a + b)(a + b + 1) | {z } As α > 0 can be chosen arbitrarily small this condition holds for every |κ| 2. Hence, according to Theorem 1 the system xk+1 = A(xk)xk is exponentially stable. To ensure maximal flexibility of the model, κ = 2 is set for further considerations. This leads to the conclusion that Θ B = ΘB = R2 +. Therefore, in the optimization, no constraints on ψ must be considered, thus Ψ = Ψ . 4.2. Stability with Dirichlet Distribution Constructing A from a Dirichlet distribution also allows for unconstrained optimization as it also leads to stable behavior as shown in the following corollary. Corollary 2. The d-dimensional system xk+1 = A(xk)xk where each row of A(xk) consists of the first d elements of a d + 1 dimensional Dirichlet distributed vector, thus, A(i,:) = a(i) (1:d), with a(i) D θψi D (xk) , i = 1 . . . d, with any θψi D : X ΘD, i, is asymptotically stable w.p.1. 4The state dependency of a, b is dropped for notational convenience. Stable Stochastic Nonlinear Dynamical Systems Proof. By construction the sample space of A, ΩA contains only elements for which A(i,j) > 0, A(i,j) < 1, i, j = 1 . . . d j=1 A(i,j) < 1, i = 1 . . . d. (13) Consider now a realization of A(xk) denoted by A and since the following statements hold for any realization in the sample space, we omit writing A ΩA. It follows j=1 A(i,j) < 1 i max i=1:d j=1 A(i,j) < 1 A = max i=1:d j=1 | A(i,j)| < 1, where the last inequality holds because all elements of A are strictly positive and A denotes the Maximum Absolute Row Sum Norm. Consider now M consecutive realizations A(i) with i = 1, . . . , M. For the maximum norm of state in the M-th step holds where the submultiplicativity property of induced matrices (Horn & Johnson, 2013) is used. As convergence towards the origin holds for each element in the sample space, the system is stable with probability one. Therefore, the parameter space is unrestricted Θ D = ΘD = Rd+1 + . Remark 2. Note that this approach only allows to represent the special class of positive systems. Nevertheless, positive systems play an important role in control engineering for modeling the evolution of strictly positive quantities as shown in (Farina & Rinaldi, 2011). Remark 3. An affine transformation, as shown for Beta distribution is not possible here because absolute values are taken in the row sum. Therefore, from Pd j=1 | A(i,j)| < 1 one cannot conclude Pd j=1 |κ( A(i,j) 0.5)| < 1 for any κ > 1. 5. Simulations We validate our approach, labeled Le SSS (for Learning Stable Stochastic Systems), using synthetic and human motion data and the simulation of a chemical reactor. For the Beta distribution, Gaussian Mixture Regression (GMR) V [A(xk)] inferred V [A(xk)] true 8 6 4 2 0 2 4 6 8 Training Data E [A(xk)] inferred E [A(xk)] true Figure 2. Comparison of true and inferred xk dependent variance (top) and mean (bottom) function. is used for the mapping from the state to the parameters θB : X ΘB. Thus, the parameter vector ψ, is the concatenation of the prior πl, the means µl and the covariances Σl for l = 1, . . . L . The code (based on Calinon (2009)) includes k-means clustering initialization and a transformation of Σl and πl to make it an unconstrained optimization. To evaluate the likelihood function for each training point { xn+1, xn}, the Beta distribution parameters are computed [an bn] = θψ B ( xn) using GMR. Then, the log likelihood of An = xn+1/ xn given the parameters [an bn] is evaluated using the density function of the Beta distribution. As all possible parameter θB = [a b] R2 + lead to stability, finding ψ is an unconstrained optimization problem. For the Dirichlet distribution, the mapping from the state to the parameters θψ D : X ΘD uses a nearest neighbor approach for computational simplicity. The 2d = 4 closest data points are considered for fitting the training parameters of the Dirichlet distribution locally. Then a training point is placed at the center of these four points. At reproduction, the closest such training point and its Dirichlet parameters are taken for regression. This does not necessarily maximize the likelihood, but shows accurate results for reproduction. We compare the following models from literature regarding reproduction precision and convergence properties: The approach introduced by Boots et al. (2008) learns stable linear dynamical system (stable LDS) from data. It constraints the search of the deterministic dynamic matrix A to ensure the stability Stable Stochastic Nonlinear Dynamical Systems of xk+1 = Axk. Gaussian Process Dynamical Models (GPDM) (Wang et al., 2005) represent dynamical system in the general form xk+1 = f(xk), with Gaussian Process f GP(0, k(xk, x k)). We employ a zero prior mean function and a squared exponential kernel. The hyperparameters of the kernel are optimized using the likelihood as described by Rasmussen & Williams (2006). In reproduction, this method can either be used in deterministic setting by only taking the posterior mean prediction µGP(xk) thus xk+1 = µGP(xk) or the stochastic setting xk+1 N(µGP(xk), ΣGP(xk)), where ΣGP(xk) is the posterior variance. GPDMs are bounded (Beckers & Hirche, 2016a;b) but not stable. The Stable Estimator of Dynamical Systems (SEDS) as introduced by Khansari-Zadeh & Billard (2011) constraints the likelihood optimization of GMR parameters to a class of mean stable dynamical timecontinuous systems. The GMR maps from current state x to the time derivative x. It focuses on deterministic systems by only considering stability criteria for the mean prediction of the GMR, µGMM(x), while ignoring the stochastic nature of GMMs, (its variance prediction ΣGMM). We also run this method in a stochastic setting, where x N (µGMM(x), ΣGMM(x)). For our simulations, five mixtures are employed. Before starting the comparison to existing approaches, Le SSS is demonstrated on a synthetic dataset. 5.2. Simulation 1: Synthetic Data For the first simulation, the task is to identify the stable nonlinear stochastic system given by xk+1 = A(xk)xk, (14) where A(xk) B (xk 5)2, (xk + 5)2 . The learning algorithm is given 100 training points { xn, xn+1}100 n=1 equally spaced in the state space interval [ 8, 8] which are drawn from the state dependent Beta distribution (14). Here L = 3 was chosen for the number of mixtures in the GMR for the mapping θψ B : X ΘB. Figure 2 compares the mean and variance of the original system (14) to the one inferred by our model. It clearly shows that the model offers sufficient flexibility to reconstruct the original system. Note: It is also possible to verify the parameter functions a(xk), b(xk) as given in (14), but we directly look at the mean and variance functions as there exists a unique mapping and it is more intuitive for interpretation. It must be omitted, that the data was generated from the same model which the algorithm is learning. This 0 50 100 150 200 0.6 Training Data E [A(xk)] inferred V [A(xk)] inferred Figure 3. The inferred mean and variance function for the training set of the projected Z-shape movement are shown along with the training data (the realizations of the random variable A). 0 20 40 60 80 100 120 Training Data Figure 4. Comparison of simulations for x0 = 193 for stable LDS and the mean predictions from the models GPDM, SEDS and Le SSS. explains the good fitting, but is of course not often the case in practical application. Therefore, we continue with a real world dataset in the following. 5.3. Simulation 2: Human Motion Data For the next simulation, we use the data set for lettershaped motions provided by Khansari-Zadeh & Billard (2011). The 225 trainings points of 3 trajectories of the two dimensional Z-shaped motion are projected on the y-axis. The GMR for θψ B : X ΘB is trained with two mixtures. Figure 3 shows the training data along with the fit of the mean and variance functions. The mean function shows a smoothed estimate of the training data. The model identifies properly that the training data has higher variability (around xk = 0) and captures this in its variance function. Figure 4 compares the reproduction of the models stable LDS, GPDM, SEDS and Le SSS if taking the deterministic (mean) output of each model (all starting from the same Stable Stochastic Nonlinear Dynamical Systems 0 20 40 60 80 100 120 No convergence to origin Oscillation but no convergence Training Data Figure 5. Comparison of simulations for x0 = 193 with the stochastic models of GPDM, SEDS and Le SSS. initial point). The stable LDS approach leads to a converging trajectory, but fails to capture the complexity of the dynamic (as the true dynamic is nonlinear). The GPDM converges to a spurious attractor at x 9.3 which is undesired but not surprising. SEDS and Le SSS both lead to asymptotic stable reproductions of the movement. Since the data does not contain the full state (due to the projection on the y-axis), it is not possible to reproduce the movement precisely with a dynamical system model. Figure 5 compares the reproduction of the three stochastic dynamical models GPDM, SEDS and Le SSS based on three sample paths drawn from each model. The GPDM again converges to the spurious attractor. SEDS clearly shows that convergence of the mean is not sufficient for converging trajectories of a stochastic system, as the drawn sample paths are strongly oscillating around the origin without tendency to converge. In the stochastic case only Le SSS generates converging trajectories. Figure 6 shows an example for the human motion imitation in the 2D case on a different training data set. It shows the deterministic trajectory and 5 sample path realizations, where all of them show high reproduction precision and convergence to the orign. 5.4. Simulation 3: Chemical Reactor Simulation For the last validation, we utilize simulated data from a simplified chemical reactor (Einarsson, 1998). The closedloop reactor is modeled by a piecewise affine system with two states: the fluid level x1 and the temperature x2. Both states are physically positive quantities, therefore the approach in Section 4.2 is suitable. The switching between different dynamic matrices is state dependent and occurs at x1 = 3 and x2 = 50, which corresponds to a discrete change of the control inputs. The training data consists of 8 trajectories of 15 steps each, which are pairwise initialized at the 4 different regions of the dynamics and perturbed 0 50 100 150 200 250 300 Training Data Deterministic Figure 6. 2D simulation for human motion data set with Le SSS. Error/stable? stable LDS GPDM SEDS Le SSS deterministic 322/yes 173/no 332/yes 162/yes stochastic n/a 177/no 364/no 165/yes Table 1. Comparison of the reproduction error in terms of area between each demonstration and its corresponding reproduction for the stochastic and deterministic case for the chemical reactor simulation. The area is computed for each of the 8 initial points separately and cumulated for each of the approaches. It also indicates which models are stable. with white noise with σ = 0.01 for both states. Figure 7 shows the training data along with the reproduction using stable LDS, GPDM, SEDS and Le SSS in the deterministic setting. The initial points in the test case were set close to the one in the training data. The stable LDS is not capable to capture the varying behavior in the different regions of the piecewise affine system and therefore fails in accuracy of the reproduction. GPDM leads again to convergence outside the origin, which is undesirable. SEDS and Le SSS are both converging as it is enforced by design. Figure 8 shows that similar to the 1D case GPDM and SEDS fail to converge in the stochastic case while Le SSS is stable in all sample paths. Table 1 compares the methods with regard to the reproduction precision quantitatively. It shows that Le SSS outperforms other methods in this measure while providing the necessary guarantees regarding convergence. 5.5. Discussion The simulations show that Le SSS is powerful enough to represent various nonlinear dynamics, while capturing the probabilistic nature of the process. The incorporation of the prior knowledge on goal convergence ensures that the learned model is stable in probability. The computational complexity for learning the parameters of the model using interior-point methods, is mainly determined by the employed mapping in the first layer θψ. The Stable Stochastic Nonlinear Dynamical Systems 80 Reproduction very imprecise x1 fluid level x2 temperature No convergence to origin x1 fluid level Reproduction very imprecise x1 fluid level Precise reproduction and convergence x1 fluid level Figure 7. Training data (black) of chemical reactor and the deterministic simulation for stable LDS, GPDM, SEDS and Le SSS approach. No convergence to origin x1 fluid level x2 temperature Oscillation hinders convergence x1 fluid level Precise reproduction and convergence x1 fluid level Figure 8. Training data (black) of chemical reactor and the stochastic simulation for GPDM, SEDS and Le SSS approach. Simulation 2 LDS GPDM SEDS Le SSS Time 0.016s 8.53s 2.47s 9.88s Simulation 3 LDS GPDM SEDS Le SSS Time 4.54s 11.2s 22.3s 18.2s Table 2. Computation times for model learning and simulation for all compared approaches. computation times on a i5 CPU 2.30GHz, 2 Cores and 8GB RAM are given for Simulation 2 and 3 in Table 2. Since the GPDM, SEDS and Le SSS all solve non-convex optimization problems, their commutation times are in the same order of magnitude. The linear model has advantage here. Regarding the scalability with more training points, the parameter fitting performs similarly to other approaches requiring likelihood computation since this is the major factor. However, the scalability strongly depends on the employed distribution and the mapping in the first layer θψ. This work only deals with system with a single equilibrium point, but could be extended to system with more complex attractor dynamics. However, further knowledge is required because - in addition to the position of all equilibrium points - their regions of attraction must be known. 6. Conclusion This work proposes a framework for learning nonlinear stable stochastic dynamical systems from data. We introduce a flexible model, which builds on the state-dependent coefficient form and derive exponential stability conditions based on stochastic Lyapunov methods. The criteria is applicable to various probability distributions, while we focus to investigate the application to Beta and Dirichlet distributions. Simulation results verify sufficient flexibility of the model and the correct identification of the system s uncertainty. In comparison to existing approaches it showed advantages in reproduction precision and convergence properties on human motion data and simulated data from a real system. Acknowledgment The research leading to these results has received funding from the European Research Council under the European Union Seventh Framework Program (FP7/2007-2013) ERC Starting Grant Control based on Human Models (conhumo) agreement number 337654. We also would like to thank the reviewers for very constructive feedback. Stable Stochastic Nonlinear Dynamical Systems Beckers, Thomas and Hirche, Sandra. Equilibrium distributions and stability analysis of Gaussian process state space models. In Conference on Decision and Control (CDC), pp. 6355 6361, 2016a. Beckers, Thomas and Hirche, Sandra. Stability of Gaussian process state space models. In European Control Conference, pp. 2275 2281, 2016b. Boots, Byron, Gordon, Geoffrey J, and Siddiqi, Sajid M. A constraint generation approach to learning stable linear dynamical systems. In Advances in Neural Information Processing Systems (NIPS), pp. 1329 1336. Curran Associates, Inc., 2008. Calinon, Sylvain. Robot Programming by Demonstration: A Probabilistic Approach. EPFL/CRC Press, 2009. Chiuso, Alessandro and Pillonetto, Gianluigi. Learning sparse dynamic linear systems using stable spline kernels and exponential hyperpriors. In Advances in Neural Information Processing Systems (NIPS), pp. 397 405. Curran Associates, Inc., 2010. Cimen, Tayfun. State-dependent Riccati equation (SDRE) control: A survey. IFAC Proceedings Volumes, 41(2): 3761 3775, 2008. Einarsson, Valur. On Verification of Switched Systems using Abstractions. Ph D thesis, Linkoping University, Automatic Control, The Institute of Technology, 1998. Farina, Lorenzo and Rinaldi, Sergio. Positive linear systems: Theory and applications. John Wiley&Sons, 2011. Fox, Emily, Sudderth, Erik B., Jordan, Michael I., and Willsky, Alan S. Nonparametric Bayesian learning of switching linear dynamical systems. In Advances in Neural Information Processing Systems (NIPS), pp. 457 464. Curran Associates, Inc., 2009. Ghahramani, Zoubin and Roweis, Sam T. Learning nonlinear dynamical systems using an EM algorithm. In Advances in Neural Information Processing Systems (NIPS), pp. 431 437. MIT Press, 1999. Horn, Roger A. and Johnson, Charles R. Matrix analysis. Cambridge Univ. Press, Cambridge, 2013. Ijspeert, Auke Jan, Nakanishi, Jun, and Schaal, Stefan. Movement imitation with nonlinear dynamical systems in humanoid robots. In International Conference on Robotics and Automation (ICRA). IEEE, 2002. Khansari-Zadeh, Seyed Mohammad and Billard, Aude. Learning stable nonlinear dynamical systems with Gaussian mixture models. IEEE Transactions on Robotics, 27 (5):943 957, 2011. Khansari-Zadeh, Seyed Mohammad and Billard, Aude. Learning control Lyapunov function to ensure stability of dynamical system-based robot reaching motions. Robotics and Autonomous Systems, 62(6):752 765, 2014. Kocijan, J., Girard, A., Banko, B., and Murray-Smith, R. Dynamic systems identification with Gaussian processes. Mathematical and Computer Modelling of Dynamical Systems, 11(4), 411-424, 2005. Kushner, Harold Joseph. Introduction to stochastic control. Holt, Rinehart and Winston New York, 1971. Ljung, Lennart. System Identification. Prentice Hall PTR, NJ, USA, 1998. Paraschos, Alexandros, Daniel, Christian, Peters, Jan, and Neumann, Gerhard. Probabilistic movement primitives. In Advances in Neural Information Processing Systems (NIPS), pp. 2616 2624, 2013. Rasmussen, Carl Edward and Williams, Christopher KI. Gaussian Processes for Machine Learning. MIT Press, Cambridge, MA, USA, January 2006. Wang, Jack M., Fleet, David J., and Hertzmann, Aaron. Gaussian process dynamical models. In Advances in Neural Information Processing Systems (NIPS), pp. 1441 1448, 2005.